Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Similar documents
7. Tests for selection

Lecture 22: Signatures of Selection and Introduction to Linkage Disequilibrium. November 12, 2012

The genomic rate of adaptive evolution

Drosophila melanogaster and D. simulans, two fruit fly species that are nearly

Fitness landscapes and seascapes

Estimating Evolutionary Trees. Phylogenetic Methods

LETTERS. Natural selection on protein-coding genes in the human genome

Understanding relationship between homologous sequences

122 9 NEUTRALITY TESTS

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

Supporting Information

SEQUENCE DIVERGENCE,FUNCTIONAL CONSTRAINT, AND SELECTION IN PROTEIN EVOLUTION

SWEEPFINDER2: Increased sensitivity, robustness, and flexibility

Lecture 18 - Selection and Tests of Neutrality. Gibson and Muse, chapter 5 Nei and Kumar, chapter 12.6 p Hartl, chapter 3, p.

Using Molecular Data to Detect Selection: Signatures From Multiple Historical Events

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin

Supporting information for Demographic history and rare allele sharing among human populations.

Stat 516, Homework 1

Estimating selection on non-synonymous mutations. Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh,

Using Molecular Data to Detect Selection: Signatures From Multiple Historical Events

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles

Neutral behavior of shared polymorphism

Gene Genealogies Coalescence Theory. Annabelle Haudry Glasgow, July 2009

Effective population size and patterns of molecular evolution and variation

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Frequency Spectra and Inference in Population Genetics

Gene expression differences in human and chimpanzee cerebral cortex

The Wright-Fisher Model and Genetic Drift

Mathematical models in population genetics II

Selection and Population Genetics

Statistical Tests for Detecting Positive Selection by Utilizing High. Frequency SNPs

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

O 3 O 4 O 5. q 3. q 4. Transition

Population Genetics I. Bio

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Lecture Notes: BIOL2007 Molecular Evolution

Gene regulation: From biophysics to evolutionary genetics

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

I of a gene sampled from a randomly mating popdation,

Processes of Evolution

Diffusion Models in Population Genetics

p(d g A,g B )p(g B ), g B

Neutral Theory of Molecular Evolution

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Estimating the Distribution of Selection Coefficients from Phylogenetic Data with Applications to Mitochondrial and Viral DNA

Supporting Information

Challenges when applying stochastic models to reconstruct the demographic history of populations.

It has been more than 25 years since Lewontin

Febuary 1 st, 2010 Bioe 109 Winter 2010 Lecture 11 Molecular evolution. Classical vs. balanced views of genome structure

Evolu&on, Popula&on Gene&cs, and Natural Selec&on Computa.onal Genomics Seyoung Kim

Supporting Information Text S1

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

Inferring Speciation Times under an Episodic Molecular Clock

Genetic Variation in Finite Populations

MCMC: Markov Chain Monte Carlo

Natural selection on the molecular level

Robust demographic inference from genomic and SNP data

Supplementary Information for Hurst et al.: Causes of trends of amino acid gain and loss

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Divergence Pattern of Duplicate Genes in Protein-Protein Interactions Follows the Power Law

Introduction to population genetics & evolution

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Introduction to Advanced Population Genetics

The neutral theory of molecular evolution

Hidden Markov models in population genetics and evolutionary biology

6 Introduction to Population Genetics

A Bayesian Approach to Phylogenetics

Haplotype-based variant detection from short-read sequencing

QTL model selection: key players

(Write your name on every page. One point will be deducted for every page without your name!)

The abundance of deleterious polymorphisms in humans

The Structure of Genealogies in the Presence of Purifying Selection: a "Fitness-Class Coalescent"

Population Structure

Lecture 9. QTL Mapping 2: Outbred Populations

How robust are the predictions of the W-F Model?

Bayesian Inference of Interactions and Associations

Sequence evolution within populations under multiple types of mutation

6 Introduction to Population Genetics

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

Population Genetics II (Selection + Haplotype analyses)

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Wright-Fisher Models, Approximations, and Minimum Increments of Evolution

Penalized Loss functions for Bayesian Model Choice

Lecture 2: Linear Models. Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Predicting Protein Functions and Domain Interactions from Protein Interactions

Statistical Tests for Detecting Positive Selection by Utilizing. High-Frequency Variants

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

1 Introduction. Abstract

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

Bayesian Regression Linear and Logistic Regression

Temporal Trails of Natural Selection in Human Mitogenomes. Author. Published. Journal Title DOI. Copyright Statement.

Lecture 13: Population Structure. October 8, 2012

Coalescent based demographic inference. Daniel Wegmann University of Fribourg

7.36/7.91 recitation CB Lecture #4

LECTURE # How does one test whether a population is in the HW equilibrium? (i) try the following example: Genotype Observed AA 50 Aa 0 aa 50

Inferring Species Trees Directly from Biallelic Genetic Markers: Bypassing Gene Trees in a Full Coalescent Analysis. Research article.

A consideration of the chi-square test of Hardy-Weinberg equilibrium in a non-multinomial situation

Inference of mutation parameters and selective constraint in mammalian. coding sequences by approximate Bayesian computation

Transcription:

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Details of PRF Methodology In the Poisson Random Field PRF) model, it is assumed that non-synonymous mutations at a given gene are either very deleterious or have selective effect s such that the fitness of a heterozygote is 1 s relative to the wildtype, and a homozygous individual has fitness 1 2s 1. As in most population genetic models, we can only estimate the product of the selection coefficient and the population size! = 2N e s ). The strength of purifying selection against very deleterious mutations is captured in the non-lethal non-synonymous mutation rate parameter! r = 2N e µ f 0 where N e is the effective population size, µ is the mutation rate, and f 0 is the fraction of mutations that are non-lethal. Synonymous sites are assumed to evolve neutrally with mutation rate! s = 2N e µ. We assume the mutation rate parameters among genes are independent of one another to capture the fact that some genes are subject to very strong purifying selection while others may experience relatively weak purifying selection. Likewise, we make no assumption about how neutral mutation rate varies from gene to gene. Using the results of Sawyer and Hartl 1992) 1, we model the cell entries in the McDonald- Kreitman tables 2 for a given gene as Poisson random variables with the following expected values: Fixed Segregating Silent! s " 1 m 1 n)1 # & 1 $ % n'! s * i 2 Replacement! r " G,m) G,n))! 1) e )2 r F,n) i=1 1.1) where F!,n) and G!,n) are integrals over of the distribution of mutation frequencies that depend on the selection coefficient for the gene and the number of sequences sampled from species 1 n) and species 2 m) see Sawyer and Hartl, 1992):

Bustamante et al., Supplementary Nature Manuscript # 2 out of 9 F!,n) = # # 1 0 1 1 " x n " 1 " x) n ) x1" x) G!,n) = x n"1 1 " x) "1 0 "2! 1" x) 1 " e dx 1" e "2! "2! 1" x) 1" e dx 1 " e "2! 1.2) Equation 1.1) assume that polymorphism data is only being modelled for one species as is the case for our study. The parameter! is the number of generations since species divergence divided by twice the effective population size. We treat this quantity as constant among genes and estimate it using all of the data. Since the above equations are only strictly true under the assumption of independence among sites and a population of constant size, we use simulations to gauge the accuracy and robustness of our results to model misspecification. We write down the joint probability of the MK data given the selection coefficient and mutation rate parameters for each of the G genes in our sample and time since species divergence as the product of the individual entries in the MK tables for all genes: G Pr{P,D!,",#} = % % Pr{D c,i! c,i,#," c,i }Pr{P c,i! c,i," c,i } 1.3) i=1 c${n,s} where P c,i is the number SNPs and D c,i is the number of fixed difference of type c either synonymous or non-synonymous) in gene i. We treat all synonymous sites as neutral! S,i = 0 for all i) and obtain the probabilities on the right hand side of 1.3) directly from the Poisson distribution with mean given by the corresponding entry in 1.1). To obtain the posterior distribution on! i conditional on the species divergence time, we use a Normal prior with mean 0 and standard deviation of! = 8 such that

Bustamante et al., Supplementary Nature Manuscript # 3 out of 9 & ) F! j,n) Pr{! i P N,i, D N,i,"} # 2! $ i F! j,n) i " G! 1% e %2! i i,m) G! i,n) ' )* & 2! i 1% e %2! i 2! $ i F! j,n) i ' 1% e %2! i & 2! $ i F! j,n) i ' 1% e %2! i ) " G! i,m) G! i,n))* " G! i,m) G! i,n)) ) " G! i,m) G! i,n))*, i D N,i, i e %! i 2 2-2 - 2. P N,i 1.4) where! i and! i are parameters of a Gamma prior distribution on the mutation rate for the locus in practice we set these to 0.01 for all genes, which makes them uninformative). The first two terms on the right-hand side of expression 1.4) represent the conditional probability given " of observing P N,i non-synonymous polymorphisms and D N,i non-synonymous fixed differences, respectively; the third term comes from the prior distribution on the mutation rate, a parameter which has been integrated out of the posterior distribution, and the fourth term is the prior distribution of ". In order to classify individual loci as positively or negatively selected, we will focus on quantifying the posterior probability for a given gene that it s selection coefficient is greater or less) than 0 given the observed data for the gene P i = Pr{! i > 0 P N,i, D N,i }). If P is greater than 97.5%, this is mathematically equivalent to saying that the 95% highest posterior density credibility intervals Bayesian confidence interval) for the selection coefficient are above 0 and we classify such genes as positively selected. Likewise, if P! is greater than 97.5% for a given locus, this is equivalent to the corresponding 95% CI being completely below 0 and we classify these as negatively selected. We will estimate this quantity using the usual Monte Carlo estimator: P i = Pr{! i > 0 P N,i, D N,i } " 1 # M I! m) > 0) i 1.5) M m=1

Bustamante et al., Supplementary Nature Manuscript # 4 out of 9 where I!) is the indicator function which takes on the value 1 if the argument is true and 0, otherwise, and! i m) is the value of! i at step m in a Markov Chain Monte Carlo algorithm. All posterior probabilities reported here are from 50,000 retained draws from 10 chains each of length 50,000 steps sampled using the Markov Chain Monte Carlo algorithm and convergence criteria previously described with the modification that the genomic distribution of selective effects is not updated 3-5. This simplification is made so that the marginal posterior distributions of the selection coefficient are conditionally independent of one another and can be pooled for further analysis in terms of molecular function and biological process. Simulations A potential concern is the robustness of our analysis to deviations from the assumptions of the Sawyer and Hartl Poisson Random Field model used to analyze the data. That is, could nonstandard demography produce genomic patterns of variation that we may misinterpret as signatures of selection? To address this issue, we have simulated data using standard coalescent algorithms as implemented in the computer program ms under complete linkage within genes and three neutral demographic scenarios 6. For all simulations, we used 10,000 replicates with 79 chromosomes. This mimics the sampling structure of the Celera data: 38 African-American and 40 European American with 1 chromosome representing the chimpanzee sequences used where chimp SNPs were excluded from the analysis. We assumed a mutation rate of! = 2 with half of the neutral mutations as synonymous and half non-synonymous. This parameter was chosen since close to half of the SNPs in our data are non-synonymous and half are synonymous see Figure1A). Our choice of mutation rate is twice the average estimate of the mutation rate, and makes our results conservative, since the smaller the mutation rate, the better the Poisson approximation to the cell entries. For all models, we used a human-chimpanzee divergence of! = 10. The demographic models considered are:

Bustamante et al., Supplementary Nature Manuscript # 5 out of 9 a) Panmixia among humans all 78 chromosomes from a randomly mating population) with constant size i.e., the standard neutral model). b) Population structure model A: 40 European American chromosomes drawn from one population and 38 African-American drawn from another, with a migration rate of M = 4N e m = 1 per generation. The European-American population undergoes a population contraction backwards in time at time 0.1* 2N e generations back in time 40 50 thousand years ago) of 90% while the African-American population has a 50% reduction. The two human populations are then joined at time 0.25 in units of 2N e generations ~100-125K years ago). The human and chimpanzee population are joined at time 10 ~5 million years ago). c) Population structure model B: same as above except twice the migration rate. In figure 1B we report the distribution of Posterior probabilities that the selection coefficient for a gene is above 0 for each of the three models considered here as well as for the Celera data. It is important to keep in mind that posterior probabilities are not the same as P- values, so there is no theoretical reason for them to follow a uniform distribution as would be the case for P-values if the null hypothesis is true). The Celera data has a clear excess of genes with high and low posterior probabilities i.e., too many in the <1%, 1-5%, and >99% categories) regardless of which demographic model is used as the null. The signature is particularly strong for negative selection this may be partly due to power). Note, in this figure, we have conditioned as in the data on using only loci with at least 4 variable amino acids in the alignment. Model Diagnostics In Figure 1S, we summarize the posterior mean of the selection parameter! = E2Ns Data) for genes with at least two variable amino acid sites in the human-chimp alignments as a function of six aspects of the data *all correlations are based the square-root

Bustamante et al., Supplementary Nature Manuscript # 6 out of 9 transformation of the raw data). We see that d S, the per synonymous site species substitution rate, is slightly correlated with the posterior mean of the selection coefficient r = 0.0651 ± 0.035; P < 10!3 ). This may be explained by the fact that! is strongly positively correlated with d N, the non-synonymous species substitution rate r = 0.624 ± 0.021; P < 10!16 ) and that these quantities are, themselves, correlated, r = 0.099 ± 0.034; P < 10!7 ). The former correlation between! and d N is expected, since the rate of amino acid substitution should increase with the strength of selection ". The latter correlation between d N and d S has been previously documented for samples of size n = 1 from each species across a variety of methods. We also observe a significant moderate negative correlation between! and p S r =!0.139 ± 0.034; P < 10!14 ) and a strong negative correlation between! with p N r =!0.665 ± 0.021; P < 10!16 ). The latter correlation is not expected, but can be explained by a consideration of power. That is, mkprf relies on the ratio of replacement divergence to replacement polymorphism to estimate "!! if a gene has low levels of amino acid polymorphism and high levels of amino acid divergence, then this is consistent with strong positive selection and low mutation rate. This signal will be amplified if a gene has experienced very recent positive selection, since genetic hitchhiking will reduce amino acid polymorphism. Likewise, we observe a positive correlation between the d N / d S ratio and the posterior mean of the selection coefficient r = 0.377 ± 0.028; P < 10!16 ) and a negative correlation between p N / p S and! r =!0.282 ± 0.032; P < 10!16 ). This illustrates one important aspect of our analysis that differs from previous work 7 namely, that we can detect evidence for positive selection in the presence of selective constraint. Our power to detect selection is dependent on the observed cell entries in the McDonald- Kreitman table. Since genes of longer length will, generally, have more mutations and, thus, more variation per gene, we were concerned that the effects we observe could be due to spurious correlation. This might occur, for example, if longer genes have more amino acid polymorphism

Bustamante et al., Supplementary Nature Manuscript # 7 out of 9 and are, thus, overrepresented in the set of negatively selected genes. In order to assess this issue, we plotted the distribution of the log-odds posterior of negative selection P! < 0 Data) log as a function of the length of the aligned human-chimpanzee coding regions P! > 0 Data) see Figure 2S). There appears to be little or no correlation; therefore, differences in length among genes of different molecular functions and biological processes contributes little, if anything, to the discrepancy in the proportion of genes we classify as positively or negatively selected. Posterior Distribution of Human-Chimpanzee Species Divergence time. As part of our analysis, we also obtain a very precise estimate of the scaled humanchimpanzee species divergence,!. Based on 50,000 retained draws of our MCMC algorithm we obtain a posterior mean of 9.57 in units of 2N e generations assuming the human, chimpanzee, and ancestral populations are of roughly equal size) with 95% credibility intervals of 9.37, 9.77). Using a human/chimp generation time of 25 years and a long-term effective population size of 10,000, this corresponds to 4.78 million years ago. Our 95% confidence intervals holding generation time and 2N e fixed is a narrow range of 4.685 mya to 4.885 mya. Given our uncertainty in the long-term effective population size of humans and chimpanzees as well as variation in generation time, we have surely overestimated our confidence in the credibility interval of the divergence time. However, the likelihood function in our model is only dependent on the scaled time, which we have estimated with high precision. Counting synonymous and non-synonymous sites The number of synonymous and non-synonymous sites per gene have been counted using the underlying nucleotide context dependent mutation rates found by Hwang and Green 8, which assumes that the mutation rate from one nucleotide to another is dependent on the site s two

Bustamante et al., Supplementary Nature Manuscript # 8 out of 9 flanking nucleotides. This method is able to account for many mutation biases such as the hypermutability of C p G dinucleotides, transition/transversion biases, as well as many other subtle effects. Calculation of the total number of non-synonymous or synonymous) sites in a gene is then performed by summing over the mutation rates at each site that would or would not) result in an amino acid change relative to the overall mutability of the site. Missing or ambiguous data in the human-chimp alignment, as well as changes to and from stop codons were excluded. References 1. Sawyer, S. A. & Hartl, D. L. Population genetics of polymorphism and divergence. Genetics 132, 1161-76 1992). 2. McDonald, J. H. & Kreitman, M. Adaptive protein evolution at the Adh locus in Drosophila. Nature 351, 652-4 1991). 3. Barrier, M., Bustamante, C. D., Yu, J. & Purugganan, M. D. Selection on rapidly evolving proteins in the Arabidopsis genome. Genetics 163, 723-33 2003). 4. Gilad, Y., Bustamante, C. D., Lancet, D. & Paabo, S. Natural selection on the olfactory receptor gene family in humans and chimpanzees. Am J Hum Genet 73, 489-501 2003). 5. Bustamante, C. D. et al. The cost of inbreeding in Arabidopsis. Nature 416, 531-4 2002). 6. Hudson, R. R. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 18, 337-8 2002). 7. Clark, A. G. et al. Inferring nonneutral evolution from human-chimp-mouse orthologous gene trios. Science 302, 1960-3 2003).

Bustamante et al., Supplementary Nature Manuscript # 9 out of 9 8. Hwang, D. G. & Green, P. Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. Proc Natl Acad Sci U S A 101, 13994-4001 2004).