Understanding relationship between homologous sequences

Similar documents
7. Tests for selection

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Natural selection on the molecular level

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Drosophila melanogaster and D. simulans, two fruit fly species that are nearly

Nucleotide substitution models

Lecture Notes: BIOL2007 Molecular Evolution

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

SEQUENCE DIVERGENCE,FUNCTIONAL CONSTRAINT, AND SELECTION IN PROTEIN EVOLUTION

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

Molecular Evolution & the Origin of Variation

Molecular Evolution & the Origin of Variation

The neutral theory of molecular evolution

Dr. Amira A. AL-Hosary

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Processes of Evolution

Processes of Evolution

Febuary 1 st, 2010 Bioe 109 Winter 2010 Lecture 11 Molecular evolution. Classical vs. balanced views of genome structure

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

EVOLUTIONARY DISTANCES

How Molecules Evolve. Advantages of Molecular Data for Tree Building. Advantages of Molecular Data for Tree Building

Quantifying sequence similarity

8/23/2014. Phylogeny and the Tree of Life

7.36/7.91 recitation CB Lecture #4

UoN, CAS, DBSC BIOL102 lecture notes by: Dr. Mustafa A. Mansi. The Phylogenetic Systematics (Phylogeny and Systematics)

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Group activities: Making animal model of human behaviors e.g. Wine preference model in mice

Genomes and Their Evolution

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Cladistics and Bioinformatics Questions 2013

FUNDAMENTALS OF MOLECULAR EVOLUTION

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Reading for Lecture 13 Release v10

Lecture 7 Mutation and genetic variation

Phylogenetic inference

D. Incorrect! That is what a phylogenetic tree intends to depict.

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

BIOINFORMATICS LAB AP BIOLOGY

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

EVOLUTIONARY DISTANCE MODEL BASED ON DIFFERENTIAL EQUATION AND MARKOV PROCESS

Molecular evolution - Part 1. Pawan Dhar BII

- mutations can occur at different levels from single nucleotide positions in DNA to entire genomes.

Evolutionary Analysis of Viral Genomes

Lecture 11 Friday, October 21, 2011

Evolutionary Change in Nucleotide Sequences. Lecture 3

Probabilistic modeling and molecular phylogeny

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Chapter 7: Covalent Structure of Proteins. Voet & Voet: Pages ,

GENETICS - CLUTCH CH.22 EVOLUTIONARY GENETICS.

Molecular evolution 2. Please sit in row K or forward

Chapter 16: Reconstructing and Using Phylogenies

Phylogenetic Trees. How do the changes in gene sequences allow us to reconstruct the evolutionary relationships between related species?

Fitness landscapes and seascapes

Warm-Up- Review Natural Selection and Reproduction for quiz today!!!! Notes on Evidence of Evolution Work on Vocabulary and Lab

Concepts and Methods in Molecular Divergence Time Estimation

Multiple Sequence Alignment. Sequences

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

Mutation, Selection, Gene Flow, Genetic Drift, and Nonrandom Mating Results in Evolution

Lecture 20 DNA Repair and Genetic Recombination (Chapter 16 and Chapter 15 Genes X)

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Outline. Genome Evolution. Genome. Genome Architecture. Constraints on Genome Evolution. New Evolutionary Synthesis 11/1/18

Molecular Evolution and Comparative Genomics

6 Introduction to Population Genetics

C.DARWIN ( )

Organizing Life s Diversity

Selection against A 2 (upper row, s = 0.004) and for A 2 (lower row, s = ) N = 25 N = 250 N = 2500

Phylogeny 9/8/2014. Evolutionary Relationships. Data Supporting Phylogeny. Chapter 26

Phylogenetic Tree Reconstruction

Outline. Genome Evolution. Genome. Genome Architecture. Constraints on Genome Evolution. New Evolutionary Synthesis 11/8/16

Molecular Population Genetics

Gene expression differences in human and chimpanzee cerebral cortex

Sequence Divergence & The Molecular Clock. Sequence Divergence

BINF6201/8201. Molecular phylogenetic methods

Genetical theory of natural selection

6 Introduction to Population Genetics

Gene Families part 2. Review: Gene Families /727 Lecture 8. Protein family. (Multi)gene family

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Building Phylogenetic Trees

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Full file at CHAPTER 2 Genetics

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution

AP Biology TEST #5 Chapters REVIEW SHEET

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Theory a well supported testable explanation of phenomenon occurring in the natural world.

Evidence for Evolution

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics.

Neutral behavior of shared polymorphism

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Molecular Evolution, course # Final Exam, May 3, 2006

Evolutionary Models. Evolutionary Models

Molecular Evolution & Phylogenetics Traits, phylogenies, evolutionary models and divergence time between sequences

Phylogeny and the Tree of Life

Frequently Asked Questions (FAQs)

Outline. Evolution: Speciation and More Evidence. Key Concepts: Evolution is a FACT. 1. Key concepts 2. Speciation 3. More evidence 4.

Transcription:

Molecular Evolution

Molecular Evolution How and when were genes and proteins created? How old is a gene? How can we calculate the age of a gene? How did the gene evolve to the present form? What selective forces (if any) influence the evolution of a gene sequence and expression? Are these changes in sequence adaptive or neutral? How do species evolve? How can evolution of a gene tell us about the evolutionary relationship of species?

Understanding relationship between homologous sequences Complete evolutionary history is depicted as phylogenetic tree Tree topology correctly identify the common ancestors of homologs sequences Tree distance identify the correct relationship in time Example: MSA Topology which sequences should be aligned first Distance how to weight the sequences when computing alignment score

Measuring evolutionary distance How long ago (relatively) did two homologous sequences diverge from a common ancestor Simple method: % divergence 100 - % identity Count up the number of places two sequences differ Used by blast: % identity, % similarity Easy to compute But the major problem is that it underestimates divergence after only a moderate amount of change. % divergence saturated with time PAM250 represent 80% divergence

Types of nucleotide substitutions

Nucleotide Substitutions Models To make functional inferences we typically don t model insertions and deletions, large or small, because of the difficulty in assigning homology. We model nucleotide substitutions under the assumptions: They occur as single, independent events They don t affect substitutions at other sites Time scale is much larger than time to fixation in a population Effective models incorporate the probability of substitutions, and hence account for different kinds of substitutions (multiple, parallel, convergent, reversion) Model sequence evolution as a markov process A- >A- >T- >C- >A

Number of substitutions Central question: given two aligned sequences what is the number of substitutions that actually occurred Assuming constant substitution rate λ the expected number of substitutions per size is 2λt t is unknown Known observed divergence

Jukes- Cantor Model (JC69) 1969 Evolution is described by a single parameter, alpha (α), the rate of substitution. Assumptions: Substitutions among 4 nucleotide types occur with equal probability (rate matrix below) Nucleotides have equal frequency at equilibrium A T C G A 1-3α α α α T α 1-3α α α C α α 1-3α α G α α α 1-3α

Jukes- Cantor Model What is probability of having nucleotide A (P A(t) ) at time t if we start with A? Derive expression P A using discrete time periods in which the rate is represented as α. P A(0) = 1, P C(0) = 0, P A(1) = 1 3α, P C(1) = α, P A(2) = (1 3α)P A(1) + α(1 P A(1) ) at time 1 there was an A, and that A had not changed at time 1 there was not an A and it changed to an A 9

Jukes- Cantor Model Recurrence equation: P A(2) = (1 3α)P A(1) + α(1 P A(1) ) P A(t+1) = (1 3α)P A(t) + α[1 P A(t) ] = (1 4α)P A(t) + α What is the change in P A over time (ΔP A )? ΔP A(t) = P A(t+1) P A(t) Subtract P A(t) from both sides we get ΔP A(t) = - 4αP A(t) + α ΔP A(t) is proportional to P A(t) 10

Jukes- Cantor Model What is the change in P A over continuous time? dp A(t) /dt = - 4αP A(t) + α Solve the differential equation P A(t) = ¼ + ( P A(0) ¼ )e - 4αt P A(t) = ¼ + ¾ e - 4αt starting at A at t=0 P A(t) = ¼ - ¼ e - 4αt starting at T,C, or G at t=0 We can generalize to these equations because all nucleotides are equivalent: P ii(t) = ¼ + ¾ e - 4αt P ij(t) = ¼ - ¼ e - 4αt 11

Jukes- Cantor Model Which equilibrium frequency of i is reached at large t? (e.g. i = A) P AA(t) = ¼ + ¾ e - 4αt P ja(t) = ¼ - ¼ e - 4αt We can think of P i as the frequency of i in a long sequence. P A(t) 0.0 0.2 0.4 0.6 0.8 1.0 0 20 40 60 80 100 time 12

Applying the Jukes- Cantor Model to estimate distance Given two sequences evolving independently for a time t what is the probability they both have an A P I(t) = P 2 AA(t) + P 2 AC(t) + P 2 AT(t) + P 2 AG(t) both remain A, both change to C, T, or G P I(t) = P 2 ii(t) + 3(P 2 ij(t)) P I(t) = (¼ + ¾ e - 4αt ) 2 +3(¼ - ¼ e - 4αt ) 2 =P I(t) = ¼ + ¾ e - 8αt 0.0 0.2 0.4 0.6 0.8 1.0 1/16 0 20 40 60 80 100 time

Estimated number of substitutions Our goal original goal was to estimate the total number of substitutions since divergence from a common ancestor E[sub]=2λt =6αt; λ =3α Estimate αt from P I(t) = ¼ + ¾ e - 8αt and P D =1- P I(t) αt = - 1/8*ln(1 (4/3)P D ) E[sub]=¾ ln(1- (4/3)P D )- - Also known as K JC

Example K JC = function(p) { - 0.75 * log(1-4*p/3) } # divergent bases = total bases identical bases 34 25 = 9 9/34 = 0.264705 ß uncorrected % divergence K JC (9/34) = 0.326488 distance K JC ß corrected K-JC 0.0 1.0 2.0 3.0 0.0 0.2 0.4 0.6 0.8 1.0 p (%divergence)

Rate variation between bases In reality, bases are not equivalent, and rates of change between them are not equal. Transitions usually outnumber transversions. cytosine thymine 16

Kimura s 2- parameter model (K2P) Models transition and transversion rates separately Two parameters α for transition rate and β for transversionrate. Assumption: Nucleotides have equal frequency at equilibrium A T C G A 1- α- 2β β β α T β 1- α- 2β α β C β α 1- α- 2β β ƒ = [0.25,0.25,0.25,0.25] G α β β 1- α- 2β

Kimura model Begin with A: P AA(0) = 1 What is P AA(1)? P AA(1) = 1 α 2β P AA(2) = (1 α 2β)P AA(1) + βp AT(1) + βp AC(1) + αp AG(1) Recursion: P AA(t+1) = (1 α 2β)P AA(t) + βp AT(t) + βp AC(t) + αp AG(t)

Calculating divergence P and Q are proportions of divergence due to transitions (P) and transversions (Q) between the 2 sequences K K2P = - ½ ln(1 2P Q) ¼ ln(1 2Q) k2p = function(p,q) { - 0.5*log(1-2*p- q) - 0.25*log(1-2*q) } 0.6 K2P Distance Two 100bp sequences have 20 transitions and 4 transversions between them. k2p(0.2,0.04) = 0.3107 changes per base pair 31 changes over the entire sequence Q 0.5 0.4 0.3 3 2 The sequences came from a snapdragon (Antirrhinum) and a monkey flower (Mimulus) whose lineages diverged 76 Mya, yielding a divergence rate of 0.2 0.1 1 0.408 substitutions per million years 0.0 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 P

Kimura K2P model Estimate the ts/tv ratio ts/tv = α/β = 2* ln(1 2p q) / ln(1 2q) 1 Mammals Nuclear DNA ts/tv 2 Mitochondrial ts/tv 15

Nucleotide substitution models

Different substitution models

Models have assumptions All nucleotide sites have same rate The substitution matrix does not change Sites change independently. There is no co- evolution or multiple mutation This is all true: Sequence changes only due to replication error Errors are randomly propagated neither advantageous nor deleterious

GC content varies over evolutionary time GC content is heterogeneous across the genome at a scale of hundreds of nucleotides GC biased gene conversion Gene conversion- is the process by which one DNA sequence replaces a homologous sequence such that the sequences become identical after the Diploid organism: 2 copies of every locus

Evolutionary rates vary according to gene region Drosophila Human - Chimp

Molecular clock hypothesis JC and Kimura models assume nucleotides accrue substitutions at a constant rate Empirical evidence Useful concept for dating divergence times Deviation indicate slowing or acceleration of evolutionary change or incorrect fossil dating

Molecular clock hypothesis Forces effecting sequence change Mutation: sequence change in a single individual Fixation: there exists at least two sequence variants (alleles) in a population and overtime only one remains Drift: change in variant frequency due to resampling Selection Negative Selective removal of deleterious mutations (alleles) Positive Increase the frequency of beneficial mutations (alleles) that increase fitness (success in reproduction) Main innovation: most changes we observe across lineages are neutral

Molecular clock hypothesis Rates vary widely for different proteins but scale with time Local clock vs global clock Rates can vary over branches and over time Selection Generation time effect Efficiency of DNA repair Some evidence suggests that DNA repair is more efficient in humans than in mice

Protein- coding sequences present opportunities to study differential rates A nonsynonymous substitution is a nucleotide mutation that alters the amino acid sequence of a protein. Synonymous substitutions do not alter amino acid sequences. Synonymous (silent) changes are thought to have relatively small effects, if any, on gene and protein function synonymous sites typically diverge at rates similar to non- functional sequences, such as pseudogenes, they are often the best molecular clock to normalize rates of substitution. 29

Synonymous and nonsynonymous substitution rates K S and K A Early methods estimated K S and K A using simple counting methods. These were sufficient as long as divergence was low (< 1 change per codon). Step 1: count # syn and nonsyn changes (M S & M A ) Step 2: normalize each by number of syn and nonsynsites (N S & N A ) for each nucleotide, sum proportion of potential changes that are syn or nonsyn determine mean # syn and nonsyn sites between sequences Step 3: use nucleotide model to compute genetic distances K S and K A K S = - ¾ ln(1 (4/3)*(M S /N S ) ) K A = - ¾ ln(1 (4/3)*(M A /N A ) ) More sophisticated models can be used to account for ts/tv rates.

Interpreting K S and K A These quantities can be powerful for making inferences about protein function. While amino acid divergence rates between proteins vary 1,000 fold, it is not clear whether rapidly evolving regions resulted from a lack of functional constraint or from positive selection for novel function. We can distinguish between these two scenarios by normalizing K A with K S

Statistical Tests K A K S use t- test to assess significance t = (K A K S ) / sqrt( V(K A ) + V(K S ) ), V is variance Cannot be used for small numbers of substitutions M<10 2 x 2 Contingency table and Fisher s exact test Because M A and M S are not corrected genetic distances this is only accurate for small numbers of substitutions ~ K < 0.2 K A / K S (a.k.a. d N /d S ) It is convenient to compare them as a ratio, in which the nonsyn rate is normalized by the syn rate. These values are comparable across genes and species. nonsyn syn changed M A M S not changed N A - M A N S - M S

dn/ds in practice Only useful for comparing close sequences Mammalian genes YES! Yeast family genes NO Synonymous sites are at saturation!

dn/ds genome wide histone Nucleosomes Cytochrome c oxidase Rad51 Pairwise comparisons between mouse and human genes. (Evolutionary insights into host pathogen interactions from mammalian sequence data) Pathogen defense Fertilization proteins Olfactory receptors Detoxification enzymes Only rarely do d N /d S ratios calculated over the entire gene exceed 1. Usually only select protein regions experience recurrent positively selected changes, so that moderately elevated values may indicate the presence of positive selection.

Immune proteins are targets of positive selection

Example MHC MHC molecules bind intercellular peptides and present them to immune cells Important for recognizing virus infected cells Overall d N /d S ratio <1 Amino acids in spacefill have a d N /d S ratio > 1.

Reproductive proteins are often under positive selection Human Chimp Seminal fluid proteins Drosophila melanogaster simulans Male accessory gland proteins Drosophila melanogaster simulans Random non- reproductive proteins Figure 1. Plots of d N Versus d S for Primate and Drosophila Seminal Fluid Genes (A) Genes encoding seminal fluid proteins identified by mass spectrometry in human versus chimpanzee. (B) Drosophila simulans male-specific accessory gland genes versus D. melanogaster [2]. The diagonal represents neutral evolution, a d N /d S ratio of one. Most genes are subject to purifying selection and fall below the diagonal, while several genes fall above or near the line suggesting positive selection. Comparison of the two plots shows elevated d N /d S ratios in seminal fluid genes of both taxonomic groups. DOI: 10.1371/journal.pgen.0010035.g001

Particular protein functional classes demonstrate frequent signs of positive selection From Human-Macaque gene comparisons of Positive Selection

Example:lysozymes Lysozyme is a bacteriolyticenzyme normally acting in host defense. It has been coopted to the foregut in vertebrate species that digest plant material namely ruminants (e.g. cow), colobine monkeys (e.g. langur), and the hoatzin, a bird. Lysozymes in these species have independently converged to the same amino acid at specific sites to the effect of increasing tolerance to the low ph of the digestive tract.