POPULATION GENETICS Biology 107/207L Winter 2005 Lab 5. Testing for positive Darwinian selection

Similar documents
7. Tests for selection

Accuracy and Power of the Likelihood Ratio Test in Detecting Adaptive Molecular Evolution

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution

Proceedings of the SMBE Tri-National Young Investigators Workshop 2005

Lecture Notes: BIOL2007 Molecular Evolution

Bayes Empirical Bayes Inference of Amino Acid Sites Under Positive Selection

Statistics for Biology and Health. Series Editors K. Dietz, M. Gail, K. Krickeberg, A. Tsiatis, J. Samet

Maximum-Likelihood Analysis of Molecular Adaptation in Abalone Sperm Lysin Reveals Variable Selective Pressures Among Lineages and Sites

PAML 4: Phylogenetic Analysis by Maximum Likelihood

Variance and Covariances of the Numbers of Synonymous and Nonsynonymous Substitutions per Site

RELATING PHYSICOCHEMMICAL PROPERTIES OF AMINO ACIDS TO VARIABLE NUCLEOTIDE SUBSTITUTION PATTERNS AMONG SITES ZIHENG YANG

Lecture 22: Signatures of Selection and Introduction to Linkage Disequilibrium. November 12, 2012

Natural selection on the molecular level

Understanding relationship between homologous sequences

Sequence Divergence & The Molecular Clock. Sequence Divergence

Maximum Likelihood Estimation on Large Phylogenies and Analysis of Adaptive Evolution in Human Influenza Virus A

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

It has been proved remarkably

Letter to the Editor. Temperature Hypotheses. David P. Mindell, Alec Knight,? Christine Baer,$ and Christopher J. Huddlestons

Statistical Properties of the Branch-Site Test of Positive Selection

Edward Susko Department of Mathematics and Statistics, Dalhousie University. Introduction. Installation

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Probabilistic modeling and molecular phylogeny

SUPPLEMENTARY INFORMATION

MOLECULAR PHYLOGENY AND GENETIC DIVERSITY ANALYSIS. Masatoshi Nei"

KaKs Calculator: Calculating Ka and Ks Through Model Selection and Model Averaging

Estimating the Distribution of Selection Coefficients from Phylogenetic Data with Applications to Mitochondrial and Viral DNA

Question: If mating occurs at random in the population, what will the frequencies of A 1 and A 2 be in the next generation?

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Molecular Clocks. The Holy Grail. Rate Constancy? Protein Variability. Evidence for Rate Constancy in Hemoglobin. Given

types of codon models

Codon-model based inference of selection pressure. (a very brief review prior to the PAML lab)

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Temporal Trails of Natural Selection in Human Mitogenomes. Author. Published. Journal Title DOI. Copyright Statement.

GENETICS - CLUTCH CH.22 EVOLUTIONARY GENETICS.

COMPARISON of synonymous (silent) and nonsyn- definition of positive selection (adaptive molecular evoonymous

The Phylogenetic Handbook

Processes of Evolution

T R K V CCU CG A AAA GUC T R K V CCU CGG AAA GUC. T Q K V CCU C AG AAA GUC (Amino-acid

Supplementary Information for Hurst et al.: Causes of trends of amino acid gain and loss

SPECIATION. REPRODUCTIVE BARRIERS PREZYGOTIC: Barriers that prevent fertilization. Habitat isolation Populations can t get together

Impact of recurrent gene duplication on adaptation of plant genomes

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Synonymous Codon Substitution Matrices

SWEEPFINDER2: Increased sensitivity, robustness, and flexibility

SEQUENCE DIVERGENCE,FUNCTIONAL CONSTRAINT, AND SELECTION IN PROTEIN EVOLUTION

Variances of the Average Numbers of Nucleotide Substitutions Within and Between Populations

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Febuary 1 st, 2010 Bioe 109 Winter 2010 Lecture 11 Molecular evolution. Classical vs. balanced views of genome structure

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics.

Erasing Errors Due to Alignment Ambiguity When Estimating Positive Selection

How Molecules Evolve. Advantages of Molecular Data for Tree Building. Advantages of Molecular Data for Tree Building

One of the striking features that has emerged from the study

MODELING EVOLUTION AT THE PROTEIN LEVEL USING AN ADJUSTABLE AMINO ACID FITNESS MODEL

Graduate Funding Information Center

Drosophila melanogaster and D. simulans, two fruit fly species that are nearly

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

PHYLOGENY AND SYSTEMATICS

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29):

Group activities: Making animal model of human behaviors e.g. Wine preference model in mice

Why do more divergent sequences produce smaller nonsynonymous/synonymous

Estimating the Influence of Selection on the Variable Amino Acid Sites of the Cytochrome b Protein Functional Domains

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

The nonsynonymous/synonymous substitution rate ratio versus the radical/conservative replacement rate ratio in the evolution of mammalian genes

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Computational Biology: Basics & Interesting Problems

Evolutionary change. Evolution and Diversity. Two British naturalists, one revolutionary idea. Darwin observed organisms in many environments

Using Molecular Data to Detect Selection: Signatures From Multiple Historical Events

Fitness landscapes and seascapes

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

Taming the Beast Workshop

Introduction to Biology

Draft document version 0.6; ClustalX version 2.1(PC), (Mac); NJplot version 2.3; 3/26/2012

STEM-hy: Species Tree Estimation using Maximum likelihood (with hybridization)

Phylogenetic Inference using RevBayes

Effects of Gap Open and Gap Extension Penalties

SUPPLEMENTARY INFORMATION

The genomic rate of adaptive evolution

Polymorphism due to multiple amino acid substitutions at a codon site within

8/23/2014. Phylogeny and the Tree of Life

a-fB. Code assigned:

Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis

LETTER. Epistasis as the primary factor in molecular evolution

Phylogenetics. BIOL 7711 Computational Bioscience

Concepts and Methods in Molecular Divergence Time Estimation

Host-Symbiont Conflicts: Positive Selection on an Outer Membrane Protein of Parasitic but not Mutualistic Rickettsiaceae

Phylogenetic Tree Generation using Different Scoring Methods

a-dB. Code assigned:

Comparing Evolutionary Patterns and Variability in the Mitochondrial Control Region and Cytochrome b in Three Species of Baleen Whales

ADVANCED PLACEMENT BIOLOGY

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

Lab 9: Maximum Likelihood and Modeltest

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Introduction to Bioinformatics Online Course: IBT

MEGA5: Molecular Evolutionary Genetics Analysis using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

Lecture 7 Mutation and genetic variation

Multiple Choice Review- Eukaryotic Gene Expression

Transcription:

POPULATION GENETICS Biology 107/207L Winter 2005 Lab 5. Testing for positive Darwinian selection A growing number of statistical approaches have been developed to detect natural selection at the DNA sequence level (reviews in Kreitman and Akashi 1995; Hughes 1999; Yang and Bielawski 2000; Nielsen 2001). One of the most powerful is the test for positive Darwinian selection in which a protein s rate of nonsynonymous subsitution (d N ) is compared with its rate of synonymous substitution (d S ). Provided that codon bias has not acted to constrain (d S ), a d N /d S ratio (also call the ω ratio) exceeding unity is strong evidence for the continued operation of natural selection favoring amino acid replacement mutations. Recently, maximum-likelihood models of codon substitution have been developed that allow for ω ratios to vary among sites (Nielsen and Yang 1998; Yang et al. 2000) thus enabling the identification of positive selection at individual amino acid sites in a protein-coding gene. These methods appear to offer a number of advantages over earlier pair-wise comparisons of d N and d S among taxa that average ω ratios over all sites and lineages. However, tests for positive selection based on ω ratios > 1 are extremely stringent and will likely fail to identify adaptive evolution when selection is weak and/or episodic or when power is reduced due to limited taxon sampling (see Anisomova et al. 2001). In a wide range of species elevated d N /d S ratios have commonly been reported at two broad classes of genes those involved in host-pathogen interactions (e.g., Hughes and Nei 1988; Smith et al. 1995; Ford 2001) and those functioning in reproduction (e.g., Lee et al. 1995; Metz and Palumbi 1996; Swanson et al. 2001; see recent review by Ford 2002). Despite this emerging generality, the diversity of genes that might experience positive selection in the genome is unclear. Positive selection has been described at proteins as diverse as digestive enzymes, cytochromes, toxins, cytokines, hormones, and antifreeze proteins. The growing list of positively selected genes suggests that diversifying selection may be more common than previously estimated (e.g., Endo et al. 1996) although details of the selective process in many cases remain unknown.

In this lab, we will use maximum-likelihood approaches to test for positive selection at two nuclear genes (pantophysin and S2) and one mitochondrial gene (cytochrome b) among various species of marine fishes belonging to the family Gadidae. The main gene of interest is pantophysin (209 amino acids), an integral membrane protein found in small (< 100 nm) cytoplasmic microvesicles that function in a variety of intracellular shuttling pathways (see Haass et al. 1996; Windoffer et al. 1999). The precise role played by pantophysin in these trafficking pathways is still unknown. However, there is a strong signal of positive selection at this locus in the Atlantic cod, Gadus morhua: Two common alleles are segregating in populations throughout the north Atlantic region that differ by six amino acid substitutions (and no silent changes) and group in one small domain of the protein (Pogson 2001). It is unclear if the elevated rates of replacement changes observed at the PanI locus of G. morhua are due to the unusual polymorphism detected in this species or if similar selection pressures are acting in other related species. The second nuclear gene we will study encodes for the S2 ribosomal protein (183 amino acids). This is a highly conserved gene in most vertebrates and thus is expected to serve as a control (i.e., not exhibit positive selection). We will also include a region of the mitochondrial cytochrome b (cyt b) gene (299 amino acids) that, like S2, is not expected to experience positive selection. We will perform tests for positive selection by running two models implemented by the codeml program of PAML (Phylogenetic Analysis by Maximum Likelihood). The null model we will use is called M7 (beta), which assumes that d N /d S ratios at different amino acid positions in a protein follow a beta distribution. Because the beta distribution is constrained to fall between zero and one, model M7 prevents any amino acid sites from experiencing positive selection (since this necessitates that the d N /d S ratio at a position exceeds unity). The likelihood score of model M7 (l M7 ) is compared to that obtained from model M8 (beta&ω>1), which allows for another group of sites (estimated from the data) to have d N /d S ratios that exceed unity (l M8 ). The likelihood scores of the two models are tested for significance using a standard likelihood ratio test (χ 2 = 2 (l M8 - l M7 ) with 2 d.f.). The codeml program will also perform a Bayes empirical Bayes (BEB) calculation of the posterior probabilities of sites identified as having d N /d S ratios greater than 1. Posterior probabilities above 0.95 for a site provide strong support for the action of positive selection.

Download the PAML program from the following web site: http://abacus.gene.ucl.ac.uk/software/paml.html Directions on downloading the program are given on this home page. There are Windows, Unix, Linux, and Mac OSX versions of PAML available. Download the appropriate archive and Unzip the program into a desired folder. A total of 132 files should be extracted. The data files The codeml program requires a control file codeml.ctl, a data file containing the aligned sequences (e.g., pan.nuc ) and a tree file (e.g., pan.trees ) in the same folder as the executable file in order to run. After unzipping the PAML files delete the default codeml.ctl file. Copy the following control files for each gene from the Bio 107/207 class web site into the PAML folder: pancodeml.ctl, S2codeml.ctl and cytbcodeml.ctl. Then copy the following data files into the PAML folder: pan.nuc, S2.nuc and cytb.nuc. Also copy the following tree files into the same folder: pan.trees, S2.trees, and cytb.trees. To test for positive selection at the pantophysin gene, you must rename the pancodeml.ctl file as codeml.ctl. After the run has completed, the results are printed into the output file pan.out. Re-rename the codeml.ctl back to pancodeml.ctl. Repeat for the S2 and cyt b genes. Running PAML Open up a Command Prompt window from the path Start Programs Accessories Command Prompt. Change the directory to where the PAML program has been installed. For example, if the program is in c:\program Files\PAML then type cd\ Program Files\PAML. Run the codeml program from the command prompt by typing codeml. The codeml program will read the data and then begin iterating through multiple rounds of parameter estimation. Endless hours of fun watching the gibberish on your screen! The time it will take to

perform these runs will depend on the speed of your computer hopefully not more than 3-4 hours per run. Looking at the output files Output files for the three control files are called pan.out, S2.out, and cytb.out. For the file pan.out the results we are interested in are presented at the bottom of page 14 under Model 7: beta (10 categories). This is the output from our null model. On the line beginning lnl (ntime: 35 np: 38) is the likelihood score (with a value, hopefully, close to 2464.479416). Below this are estimates of branch lengths, the transition/transversion ratio (kappa) and parameters for the beta distribution (p and q). Then appear the estimates of dn and ds for all the branches in the tree. Note that the dn/ds ratios are constrained not to exceed 1. The results for model M8 (beta&ω>1) appear below those for M7. Please note the likelihood score for M8 you will need this for the likelihood ratio test. Parameters for the model appear following the branch length estimates. After the line Parameters in beta&ω>1: are listed p 0 (the proportion of selectively constrained sites), p 1 (the proportion of sites experiencing positive selection), parameters for the beta distribution (p and q), and a mean omega ratio (ω) for the positively selected sites. A mean omega ratio significantly above 1.0 is strong evidence for positive selection. The dn and ds values listed for each branch are now maximum-likelihood estimates of the true substitution rates. Below this table is listing of positively selected sites identified by the model and their posterior probabilities. There is a second output file made for each run called rst. (Rename the rst file if you wish to save it after a run, otherwise it will re-written.) This is an extremely detailed file containing reconstructions of ancestral states and substitution patterns. It also lists the actual amino acid substitutions that have occurred along each branch of the phylogeny. This information can provides further insights into the nature of the observed amino acid substitutions (for example between polar and nonpolar, between charged and noncharged residues, etc.). Assignment

1. Run models M7 and M8 on the pantophysin, S2, and cytochrome b data sets as described above. Present likelihood scores, parameter estimates, and listings of positively selected sites (if present) for each locus. Perform likelihood ratio tests for the action of positive selection at each gene. Briefly discuss the similarities and/or differences observed between the patterns of nucleotide substitutions at each gene. 2. If positive selection has been detected at a locus, determine how many branches on the tree exhibit greater numbers of nonsynonymous than synonymous changes. What does this tell us about the history of positive selection at the gene? Also examine the locations of positively selected sites? Do they appear to be clustered or random? What could cause clustering of sites experiencing positive selection? References Anisimova, M., J.P. Bielawski, and Z. Yang. 2001. Accuracy and power of the likelihood ratio tests in detecting adaptive molecular evolution. Mol. Biol. Evol. 18: 1585-1592. Endo, T., K. Ikeo, and T. Gojobori. 1996. Large-scale search for genes on which positive selection may operate. Mol. Biol. Evol. 13: 685-690. Ford, M.J. 2001. Molecular evolution of transferrin: evidence for positive selection in salmonids. Mol. Biol. Evol. 18:639-647. Ford, M.J.. 2002. Applications of selective neutrality tests to molecular ecology. Mol. Ecol. 11:1245-1262. Haass, N.K., J. Kartenbeck, and R.E. Leube. 1996. Pantophysin is a ubiquitously expressed synaptophysin homologue and defines constitutive transport vesicles. J. Cell Biol. 134: 731-746.

Hughes, A.L.. 1999. Adaptive evolution of genes and genomes. Oxford University Press, New York. Hughes, A.L., and M. Nei. 1988. Pattern of nucleotide substitution at major histocompatibility complex class I loci reveals overdominant selection. Nature 335: 167-170. Kreitman, M., and H. Akashi. 1995. Molecular evidence for natural selection. Annu. Rev. Ecol. Syst. 26: 403-422. Lee, Y.-H., T. Ota, and V. D. Vacquier. 1995. Positive selection is a general phenomenon in the evolution of abalone sperm lysine. Mol. Biol. Evol. 12: 231-238. Metz, E.C., and S.R. Palumbi. 1996. Positive selection and sequence rearrangements generate extensive polymorphism in the gamete recognition protein bindin. Mol. Biol. Evol. 13: 397-406. Nielsen, R. 2001. Statistical tests of neutrality in the age of genomics. Heredity 86: 641-647. Nielsen, R., and Z. Yang. 1998. Likelihood methods for detecting positively selected sites and applications to the HIV-1 envelope gene. Genetics 148: 929-936. Pogson, G.H. 2001. Nucleotide polymorphism and natural selection at the pantophysin (PanI) locus in the Atlantic cod, Gadus morhua (L.). Genetics 157: 317-330. Smith, N.H., J. Maynard Smith, and B.G. Spratt. 1995. Sequence evolution of the porb gene of Neisseria gonorrhoeae and Neisseria meningitidis: evidence of positive Darwinian selection. Mol. Biol. Evol. 12: 363-370. Swanson, W.J., and C.F. Aquadro. 2002. Positive Darwinian selection promotes heterogeneity among members of the antifreeze protein multigene family. J. Mol. Evol. 54: 403-410.

Windoffer, R., M. Borchet-Stuhltrager, N.K. Haass, S. Thomas, M. Hergt, C.J. Bulitta, and R.E. Leube. 1999. Tissue expression of the vesicle protein pantophysin. Cell Tissue Res. 296: 499-510. Yang, Z., and J.P. Bielawski. 2000. Statistical methods for detecting molecular adaptation. Trends Ecol. Evol. 15: 496-503. Yang, Z., R. Nielsen, N. Goldman, and A.-M. K. Pedersen. 2000. Codon-substitution models for heterogeneous selection pressures at amino acid sites. Genetics 155: 431-449.