To gamma or not to gamma? Testing the fit of rates-across-sites models
Humphries, MA and Holland, BR and Karpievitch, YV and Sumner, JG, To gamma or not to gamma? Testing the fit of rates-across-sites models, Phylomania 2012, 8-9 November 2012, University of Tasmania, Hobart, pp. 7. (2012) [Conference Extract]
Since the introduction of explicitly model based methods of phylogenetic inference (e.g. maximum like-
lihood and Bayesian approaches) the complexity and biological realism of models of sequence evolution
has increased. An important advance in this regard was the introduction of models that allowed rate
variation across sites (RAS), i.e. they modelled the fact that some sites in a gene may be more or
less likely to accept substitutions than others. The most common way of accomplishing this is to use
a discrete approximation to a gamma distribution. This has the computational advantage of allowing
(usually 4 or 8) different rate categories with the addition of a single extra parameter into the model.
However, overly simplistic models of RAS can cause problems for phylogenetic inference and for
estimating dates of divergences. In particular, a recent study has shown that if there are a small
number of sites that mutate very frequently compared to other sites (so called hot spots) this can lead
to time-dependence of rate estimates (Soubrier et al 2012).
In this study we used amino-acid data from a study by Grahnen et al (2011) who simulated data
using a biophysical model of protein folding and binding. We extracted the number of mutations at
each site and fit this data to a variety of models. In particular:
• Constant RAS implies the frequency distribution of counts of mutations should follow a Poisson
• Gamma distributed RAS imply that the counts should follow a negative binomial distribution
• Gamma distributed RAS with invariants sites imply that counts should follow a zero inflated
negative binomial distribution.
We will discuss the merits of these models and whether or not any of them provide an acceptable fit
to data generated under biologically realistic conditions.
phylogenetic inference, maximum likelihood, Bayesian, sequence evolution, advarate variation across sites, RAS, gamma distribution