Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Details of PRF Methodology In the Poisson Random Field PRF) model, it is assumed that non-synonymous mutations at a given gene are either very deleterious or have selective effect s such that the fitness of a heterozygote is 1 s relative to the wildtype, and a homozygous individual has fitness 1 2s 1. As in most population genetic models, we can only estimate the product of the selection coefficient and the population size! = 2N e s ). The strength of purifying selection against very deleterious mutations is captured in the non-lethal non-synonymous mutation rate parameter! r = 2N e µ f 0 where N e is the effective population size, µ is the mutation rate, and f 0 is the fraction of mutations that are non-lethal. Synonymous sites are assumed to evolve neutrally with mutation rate! s = 2N e µ. We assume the mutation rate parameters among genes are independent of one another to capture the fact that some genes are subject to very strong purifying selection while others may experience relatively weak purifying selection. Likewise, we make no assumption about how neutral mutation rate varies from gene to gene. Using the results of Sawyer and Hartl 1992) 1, we model the cell entries in the McDonald- Kreitman tables 2 for a given gene as Poisson random variables with the following expected values: Fixed Segregating Silent! s " 1 m 1 n)1 # & 1 $ % n'! s * i 2 Replacement! r " G,m) G,n))! 1) e )2 r F,n) i=1 1.1) where F!,n) and G!,n) are integrals over of the distribution of mutation frequencies that depend on the selection coefficient for the gene and the number of sequences sampled from species 1 n) and species 2 m) see Sawyer and Hartl, 1992):
Bustamante et al., Supplementary Nature Manuscript # 2 out of 9 F!,n) = # # 1 0 1 1 " x n " 1 " x) n ) x1" x) G!,n) = x n"1 1 " x) "1 0 "2! 1" x) 1 " e dx 1" e "2! "2! 1" x) 1" e dx 1 " e "2! 1.2) Equation 1.1) assume that polymorphism data is only being modelled for one species as is the case for our study. The parameter! is the number of generations since species divergence divided by twice the effective population size. We treat this quantity as constant among genes and estimate it using all of the data. Since the above equations are only strictly true under the assumption of independence among sites and a population of constant size, we use simulations to gauge the accuracy and robustness of our results to model misspecification. We write down the joint probability of the MK data given the selection coefficient and mutation rate parameters for each of the G genes in our sample and time since species divergence as the product of the individual entries in the MK tables for all genes: G Pr{P,D!,",#} = % % Pr{D c,i! c,i,#," c,i }Pr{P c,i! c,i," c,i } 1.3) i=1 c${n,s} where P c,i is the number SNPs and D c,i is the number of fixed difference of type c either synonymous or non-synonymous) in gene i. We treat all synonymous sites as neutral! S,i = 0 for all i) and obtain the probabilities on the right hand side of 1.3) directly from the Poisson distribution with mean given by the corresponding entry in 1.1). To obtain the posterior distribution on! i conditional on the species divergence time, we use a Normal prior with mean 0 and standard deviation of! = 8 such that
Bustamante et al., Supplementary Nature Manuscript # 3 out of 9 & ) F! j,n) Pr{! i P N,i, D N,i,"} # 2! $ i F! j,n) i " G! 1% e %2! i i,m) G! i,n) ' )* & 2! i 1% e %2! i 2! $ i F! j,n) i ' 1% e %2! i & 2! $ i F! j,n) i ' 1% e %2! i ) " G! i,m) G! i,n))* " G! i,m) G! i,n)) ) " G! i,m) G! i,n))*, i D N,i, i e %! i 2 2-2 - 2. P N,i 1.4) where! i and! i are parameters of a Gamma prior distribution on the mutation rate for the locus in practice we set these to 0.01 for all genes, which makes them uninformative). The first two terms on the right-hand side of expression 1.4) represent the conditional probability given " of observing P N,i non-synonymous polymorphisms and D N,i non-synonymous fixed differences, respectively; the third term comes from the prior distribution on the mutation rate, a parameter which has been integrated out of the posterior distribution, and the fourth term is the prior distribution of ". In order to classify individual loci as positively or negatively selected, we will focus on quantifying the posterior probability for a given gene that it s selection coefficient is greater or less) than 0 given the observed data for the gene P i = Pr{! i > 0 P N,i, D N,i }). If P is greater than 97.5%, this is mathematically equivalent to saying that the 95% highest posterior density credibility intervals Bayesian confidence interval) for the selection coefficient are above 0 and we classify such genes as positively selected. Likewise, if P! is greater than 97.5% for a given locus, this is equivalent to the corresponding 95% CI being completely below 0 and we classify these as negatively selected. We will estimate this quantity using the usual Monte Carlo estimator: P i = Pr{! i > 0 P N,i, D N,i } " 1 # M I! m) > 0) i 1.5) M m=1
Bustamante et al., Supplementary Nature Manuscript # 4 out of 9 where I!) is the indicator function which takes on the value 1 if the argument is true and 0, otherwise, and! i m) is the value of! i at step m in a Markov Chain Monte Carlo algorithm. All posterior probabilities reported here are from 50,000 retained draws from 10 chains each of length 50,000 steps sampled using the Markov Chain Monte Carlo algorithm and convergence criteria previously described with the modification that the genomic distribution of selective effects is not updated 3-5. This simplification is made so that the marginal posterior distributions of the selection coefficient are conditionally independent of one another and can be pooled for further analysis in terms of molecular function and biological process. Simulations A potential concern is the robustness of our analysis to deviations from the assumptions of the Sawyer and Hartl Poisson Random Field model used to analyze the data. That is, could nonstandard demography produce genomic patterns of variation that we may misinterpret as signatures of selection? To address this issue, we have simulated data using standard coalescent algorithms as implemented in the computer program ms under complete linkage within genes and three neutral demographic scenarios 6. For all simulations, we used 10,000 replicates with 79 chromosomes. This mimics the sampling structure of the Celera data: 38 African-American and 40 European American with 1 chromosome representing the chimpanzee sequences used where chimp SNPs were excluded from the analysis. We assumed a mutation rate of! = 2 with half of the neutral mutations as synonymous and half non-synonymous. This parameter was chosen since close to half of the SNPs in our data are non-synonymous and half are synonymous see Figure1A). Our choice of mutation rate is twice the average estimate of the mutation rate, and makes our results conservative, since the smaller the mutation rate, the better the Poisson approximation to the cell entries. For all models, we used a human-chimpanzee divergence of! = 10. The demographic models considered are:
Bustamante et al., Supplementary Nature Manuscript # 5 out of 9 a) Panmixia among humans all 78 chromosomes from a randomly mating population) with constant size i.e., the standard neutral model). b) Population structure model A: 40 European American chromosomes drawn from one population and 38 African-American drawn from another, with a migration rate of M = 4N e m = 1 per generation. The European-American population undergoes a population contraction backwards in time at time 0.1* 2N e generations back in time 40 50 thousand years ago) of 90% while the African-American population has a 50% reduction. The two human populations are then joined at time 0.25 in units of 2N e generations ~100-125K years ago). The human and chimpanzee population are joined at time 10 ~5 million years ago). c) Population structure model B: same as above except twice the migration rate. In figure 1B we report the distribution of Posterior probabilities that the selection coefficient for a gene is above 0 for each of the three models considered here as well as for the Celera data. It is important to keep in mind that posterior probabilities are not the same as P- values, so there is no theoretical reason for them to follow a uniform distribution as would be the case for P-values if the null hypothesis is true). The Celera data has a clear excess of genes with high and low posterior probabilities i.e., too many in the <1%, 1-5%, and >99% categories) regardless of which demographic model is used as the null. The signature is particularly strong for negative selection this may be partly due to power). Note, in this figure, we have conditioned as in the data on using only loci with at least 4 variable amino acids in the alignment. Model Diagnostics In Figure 1S, we summarize the posterior mean of the selection parameter! = E2Ns Data) for genes with at least two variable amino acid sites in the human-chimp alignments as a function of six aspects of the data *all correlations are based the square-root
Bustamante et al., Supplementary Nature Manuscript # 6 out of 9 transformation of the raw data). We see that d S, the per synonymous site species substitution rate, is slightly correlated with the posterior mean of the selection coefficient r = 0.0651 ± 0.035; P < 10!3 ). This may be explained by the fact that! is strongly positively correlated with d N, the non-synonymous species substitution rate r = 0.624 ± 0.021; P < 10!16 ) and that these quantities are, themselves, correlated, r = 0.099 ± 0.034; P < 10!7 ). The former correlation between! and d N is expected, since the rate of amino acid substitution should increase with the strength of selection ". The latter correlation between d N and d S has been previously documented for samples of size n = 1 from each species across a variety of methods. We also observe a significant moderate negative correlation between! and p S r =!0.139 ± 0.034; P < 10!14 ) and a strong negative correlation between! with p N r =!0.665 ± 0.021; P < 10!16 ). The latter correlation is not expected, but can be explained by a consideration of power. That is, mkprf relies on the ratio of replacement divergence to replacement polymorphism to estimate "!! if a gene has low levels of amino acid polymorphism and high levels of amino acid divergence, then this is consistent with strong positive selection and low mutation rate. This signal will be amplified if a gene has experienced very recent positive selection, since genetic hitchhiking will reduce amino acid polymorphism. Likewise, we observe a positive correlation between the d N / d S ratio and the posterior mean of the selection coefficient r = 0.377 ± 0.028; P < 10!16 ) and a negative correlation between p N / p S and! r =!0.282 ± 0.032; P < 10!16 ). This illustrates one important aspect of our analysis that differs from previous work 7 namely, that we can detect evidence for positive selection in the presence of selective constraint. Our power to detect selection is dependent on the observed cell entries in the McDonald- Kreitman table. Since genes of longer length will, generally, have more mutations and, thus, more variation per gene, we were concerned that the effects we observe could be due to spurious correlation. This might occur, for example, if longer genes have more amino acid polymorphism
Bustamante et al., Supplementary Nature Manuscript # 7 out of 9 and are, thus, overrepresented in the set of negatively selected genes. In order to assess this issue, we plotted the distribution of the log-odds posterior of negative selection P! < 0 Data) log as a function of the length of the aligned human-chimpanzee coding regions P! > 0 Data) see Figure 2S). There appears to be little or no correlation; therefore, differences in length among genes of different molecular functions and biological processes contributes little, if anything, to the discrepancy in the proportion of genes we classify as positively or negatively selected. Posterior Distribution of Human-Chimpanzee Species Divergence time. As part of our analysis, we also obtain a very precise estimate of the scaled humanchimpanzee species divergence,!. Based on 50,000 retained draws of our MCMC algorithm we obtain a posterior mean of 9.57 in units of 2N e generations assuming the human, chimpanzee, and ancestral populations are of roughly equal size) with 95% credibility intervals of 9.37, 9.77). Using a human/chimp generation time of 25 years and a long-term effective population size of 10,000, this corresponds to 4.78 million years ago. Our 95% confidence intervals holding generation time and 2N e fixed is a narrow range of 4.685 mya to 4.885 mya. Given our uncertainty in the long-term effective population size of humans and chimpanzees as well as variation in generation time, we have surely overestimated our confidence in the credibility interval of the divergence time. However, the likelihood function in our model is only dependent on the scaled time, which we have estimated with high precision. Counting synonymous and non-synonymous sites The number of synonymous and non-synonymous sites per gene have been counted using the underlying nucleotide context dependent mutation rates found by Hwang and Green 8, which assumes that the mutation rate from one nucleotide to another is dependent on the site s two
Bustamante et al., Supplementary Nature Manuscript # 8 out of 9 flanking nucleotides. This method is able to account for many mutation biases such as the hypermutability of C p G dinucleotides, transition/transversion biases, as well as many other subtle effects. Calculation of the total number of non-synonymous or synonymous) sites in a gene is then performed by summing over the mutation rates at each site that would or would not) result in an amino acid change relative to the overall mutability of the site. Missing or ambiguous data in the human-chimp alignment, as well as changes to and from stop codons were excluded. References 1. Sawyer, S. A. & Hartl, D. L. Population genetics of polymorphism and divergence. Genetics 132, 1161-76 1992). 2. McDonald, J. H. & Kreitman, M. Adaptive protein evolution at the Adh locus in Drosophila. Nature 351, 652-4 1991). 3. Barrier, M., Bustamante, C. D., Yu, J. & Purugganan, M. D. Selection on rapidly evolving proteins in the Arabidopsis genome. Genetics 163, 723-33 2003). 4. Gilad, Y., Bustamante, C. D., Lancet, D. & Paabo, S. Natural selection on the olfactory receptor gene family in humans and chimpanzees. Am J Hum Genet 73, 489-501 2003). 5. Bustamante, C. D. et al. The cost of inbreeding in Arabidopsis. Nature 416, 531-4 2002). 6. Hudson, R. R. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 18, 337-8 2002). 7. Clark, A. G. et al. Inferring nonneutral evolution from human-chimp-mouse orthologous gene trios. Science 302, 1960-3 2003).
Bustamante et al., Supplementary Nature Manuscript # 9 out of 9 8. Hwang, D. G. & Green, P. Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. Proc Natl Acad Sci U S A 101, 13994-4001 2004).