Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Similar documents
Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Quantifying sequence similarity

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Scoring Matrices. Shifra Ben-Dor Irit Orr

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Local Alignment Statistics

7.36/7.91 recitation CB Lecture #4

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Lecture Notes: Markov chains

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Computational Biology

Substitution matrices

Algorithms in Bioinformatics

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Scoring Matrices. Shifra Ben Dor Irit Orr

Understanding relationship between homologous sequences

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

CSE 549: Computational Biology. Substitution Matrices

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

bioinformatics 1 -- lecture 7

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Dr. Amira A. AL-Hosary

Lecture 4. Models of DNA and protein change. Likelihood methods

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Evolutionary Analysis of Viral Genomes

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Tools and Algorithms in Bioinformatics

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Sequence Database Search Techniques I: Blast and PatternHunter tools

Sequence Analysis '17 -- lecture 7

Advanced topics in bioinformatics

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Large-Scale Genomic Surveys

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Practical considerations of working with sequencing data

Pairwise sequence alignment

Single alignment: Substitution Matrix. 16 march 2017

An Introduction to Sequence Similarity ( Homology ) Searching

Reading for Lecture 13 Release v10

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Exercise 5. Sequence Profiles & BLAST

Consensus methods. Strict consensus methods

Nucleotide substitution models

Lecture 4. Models of DNA and protein change. Likelihood methods

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Pairwise Sequence Alignment

What Is Conservation?

Bioinformatics 1 -- lecture 9. Phylogenetic trees Distance-based tree building Parsimony

Sequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

BMI/CS 776 Lecture 4. Colin Dewey

... and searches for related sequences probably make up the vast bulk of bioinformatics activities.

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Phylogenetic inference

Chapter 2: Sequence Alignment

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Bioinformatics Exercises

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Molecular Evolution and Comparative Genomics

Sequence analysis and comparison

C3020 Molecular Evolution. Exercises #3: Phylogenetics

EVOLUTIONARY DISTANCE MODEL BASED ON DIFFERENTIAL EQUATION AND MARKOV PROCESS

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Lecture 6 Phylogenetic Inference

Pairwise sequence alignments

EVOLUTIONARY DISTANCES

BLAST: Basic Local Alignment Search Tool

Similarity or Identity? When are molecules similar?

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

C.DARWIN ( )

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Evolutionary Models. Evolutionary Models

EECS730: Introduction to Bioinformatics

进化树构建方法的概率方法 第 4 章 : 进化树构建的概率方法 问题介绍. 部分 lid 修改自 i i f l 的 ih l i

Constructing Evolutionary/Phylogenetic Trees

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Introduction to Bioinformatics

Molecular Evolution & Phylogenetics Traits, phylogenies, evolutionary models and divergence time between sequences

Molecular Evolution, course # Final Exam, May 3, 2006

Reconstruire le passé biologique modèles, méthodes, performances, limites

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Sequence analysis and Genomics

Substitution Matrices

In-Depth Assessment of Local Sequence Alignment

Tutorial 4 Substitution matrices and PSI-BLAST

How Molecules Evolve. Advantages of Molecular Data for Tree Building. Advantages of Molecular Data for Tree Building

Hidden Markov Models

Moreover, the circular logic

Molecular Evolution and DNA systematics

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Transcription:

Sequence Analysis 17: lecture 5 Substitution matrices Multiple sequence alignment

Substitution matrices Used to score aligned positions, usually of amino acids. Expressed as the log-likelihood ratio of mutation (or log-odds ratio) Derived from multiple sequence alignments Two commonly used matrices: PAM and BLOSUM PAM = percent accepted mutations (Dayhoff) BLOSUM = Blocks substitution matrix (Henikoff)

PAM M Dayhoff, 1978 Evolutionary time is measured in Percent Accepted Mutations, or PAMs One PAM of evolution means 1% of the residues/bases have changed, averaged over all 20 amino acids. To get the relative frequency of each type of mutation, we count the times it was observed in a database of multiple sequence alignments. Based on global alignments Assumes a Markov model for evolution. Margaret Oakley Dayhoff

BLOSUM Henikoff & Henikoff, 1992 Based on database of ungapped local alignments (BLOCKS) Alignments have lower similarity than PAM alignments. BLOSUM number indicates the percent identity level of sequences in the alignment. For example, for BLOSUM62 sequences with approximately 62% identity were counted. Some BLOCKS represent functional units, providing validation of the alignment. Steven Henikoff

The data: Multiple Sequence Alignments A multiple sequence alignment is made using many pairwise sequence alignments

Columns in a MSA have a common evolutionary history a phylogenetic tree for one position in the alignment? N S A By aligning the sequences, we are asserting that the aligned residues in each column had a common ancestor.

A tree shows the evolutionary history of a single position Ancestral characters can be inferred by parsimony analysis. W N W W worm clam bird goat fish 7

Counting mutations without knowing ancestral sequences Naíve way: Assume any of the characters could be the ancestral one. Assume equal distance to the ancestor from each taxon. L K F L K F L K F L K F L K F L K F L K F W W N R L S K K P R L S K K P R L T K K P R L S K K P R L S R K P R L T R K P R L ~ K K P W W N If was the ancestor, then it mutated to a W twice, to N once, and stayed three times.

Subsitution matrices are symmetrical Since we don't know which sequence came first, we don't know whether w or...is correct. So we count this as one mutation of each type. P(-->W) and P(W-->) are the same number. (That's why we only show the upper triangle) w

Ask the data. Q: What is the probability of amino acid X mutating to amino acid Z? 10

Summing the substitution counts We assume the ancestor is one of the observed amino acids, but we don't know which, so we try them all. W W N N N W 3 1 2 one column of a MSA W symmetrical matrix

Next possible ancestor, again. We already counted this, so ignore it. W W N N N W 2 1 2 W

W W N N N W 2 1 W 1

W W N N N W 2 1 W 0

N W W W N N 2 0 0 W

Next... again W W N Counting as the ancestor many times as it appears recognizes the increased likelihood that (the most frequent aa at this position) is the true ancestor. N W N W 1 0 0

W W N (no counts for last seq.) N W N W 0 0 0

o to next column. Continue summing. W W N P P I N P P A Continue doing this for every column in every multiple sequence alignment... N W N W 6 4 8 0 2 TOTAL=21 1

Probability ratios are expressed as log odds Substitutions (and many other things in bioinformatics) are expressed as a "likelihood ratio", or "odds ratio" of the observed data over the expected value. Likelihood and odds are synomyms for Probability. So Log Odds is the log (usually base 2) of the odds ratio. log odds ratio = log 2 (observed/expected )

How do you calculate log-odds? P() = 4/7 = 0.57 Observed probability of -> q = P(->)=6/21 = 0.29 Expected probability of ->, e = 0.57*0.57 = 0.33 odds ratio = q /e = 0.29/0.33 log odds ratio = log 2 (q /e ) If the lod is < 0., then the mutation is less likely than expected by chance. If it is > 0., it is more likely.

Distribution matters -> P()=0.50 e = 0.25 q = 9/42 =0.21 lod = log 2 (0.21/0.25) = 0.2 A W W A N A A s spread over many columns P()=0.50 e = 0.25 q = 21/42 =0.5 lod = log 2 (0.50/0.25) = 1 W A W A W A A s concentrated

Distribution matters ->W P()=0.50, P(W)=0.14 e W = 0.07 q W = 7/42 =0.17 lod = log 2 (0.17/0.07) = 1.3 A W A W N A A and W seen together more often than expected. P()=0.50, P(W)=0.14 e W = 0.07 q = 3/42 =0.07 lod = log 2 (0.07/0.07) = 0 W A W A W A A s and W s not seen together.

In class exercise: et the substitution value for P->Q PQPP QQQP QQPP QPPP QQQP P Q P Q P(P)=, P(Q)= e PQ = q PQ = / = lod = log 2 (q PQ /e PQ ) = sequence alignment database. substitution counts expected (e), versus observed (q) for P->Q

PAM assumes Markovian evolution A Markov process is one where the likelihood of the next "state" depends only on the current state. The inference that evolution is Markovian assumes that base changes (or amino acid changes) occur at a constant rate and depend only on the identity of the current base (or amino acid). transition likelihood / 1%time current aa -> ->A ->V V->V V->.9946.0002.0021.9932.0001 A V V millions of years (MY)

Markovian evolution in PAM Amino acids found at one position in a MSA, expressed as probabilities. P(X) Substitution scores expressed as probabilities. P(X->X) 20x20, symmetrical 1x20 25

Probability distributions 0 P 1 ΣP(x) = 1 x in X P is a probability. P(x) is the probability of x. where X is the complete set of all possibilities x. ΣΣP(x)P(y) = 1 x in X y in Y where Y is a different complete set of all possibilities y. All ways of getting an Alanine at time t+1, given a lycine at time t. P(A) = P(A ) All ways of getting an Alanine at time t+1, given any amino acid at time t. P(A) = P(A)P(A A) + P(C)P(A C)... P(Y)P(A Y) = Σ P(x)P(A x) x all amino acids 26

New distribution from old distribution. P(A x) P(x) Σ P(x)P(A x) x all amino acids = row times column! 20x20, symmetrical time t=0 Time t=1 distribution can be predicted from time t=0 distribution and the substitution matrix. time t=1 27

Markovian evolution in PAM is "temporal extrapolation" Start with ly only at t=0. Wait for 1% of the sequence to be mutated. What amino acids are now found at that position? 0 PAM1 = 1 1 2 Wait another 1%-worth of time. PAM1 = 1 0 But,... PAM1 is just... PAM1 PAM1

So if we continue extrapolating, aren't we just using a matrix that is PAM taken to a power? (yes...) From 1% to 250% 250 times the amount of time for 1% mutation is 250% mutation. PAM1 PAM1 PAM1 = 250 PAM250 = PAM1 Hereafter, the number after PAM denotes the power to which PAM1 was taken. i.e. PAM250 = PAM1 250

Differences between PAM and BLOSUM PAM PAM matrices are based on global alignments of closely related proteins. The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. Other PAM matrices are extrapolated from PAM1 using an assumed Markov chain. BLOSUM BLOSUM matrices are based on local alignments. BLOSUM 62 is a matrix calculated from comparisons of sequences with approx 62% identity. All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins. BLOSUM 62 is the default matrix in BLAST (the database search program). It is tailored for comparisons of moderately distant proteins. Alignment of distant relatives may be more accurate with a different matrix.

Protein versus DNA alignments Are protein alignment better? Protein alphabet = 20, DNA alphabet = 4. Protein alignment is more informative Less chance of homoplasy with proteins. Homology detectable at greater edit distance Protein alignment more informative Better old Standard alignments are available for proteins. Better statistics from.s. alignments. On the other hand, DNA alignments are more sensitive to short evolutionary distances. 31

DNA evolutionary models: P-distance What is the relationship between time and the %identity? p = D L evolutionary time 0 p 1 p is a good measure of time only when p is small. 32

DNA evolutionary models: Poisson correction Corrects for multiple mutations at the same site. Unobserved mutations. p = D L 1-p = e -rt d P = rt dp = -ln(1-p) The fraction unchanged decays according to the Poisson function. In the time t since the common ancestor, rt mutations have occurred, where r is the mutation rate (r = genetic drift * selection pressure) evolutionary time 0 p 1 Poisson correction assumes p goes to 1 at t=. Where should it really go? 33

DNA evolutionary models: Jukes-Cantor What is the relationship between true evolutionary distance and p-distance? Prob(mutation in one unit of time) = α α << 1. Prob(no mutation) = 1-3α A C T A 1-3α α α α At time t, fraction identical is q(t). Fraction non-identical is p(t). p(t) + q(t) = 1 In time t+1, each of q(t) positions stays same with prob = 1-3α. C T α α α 1-3α α α α 1-3α α α α 1-3α Prob that both sequences do not mutate = (1-3α) 2 =(1-6α+9α 2 ) (1-6α). ( Since α<<1, we can safely neglect α 2. ) Prob that a mismatch mutates back to an identity = 2αp(t) q(t+1) = q(t)(1-6α) + 2α(1-q(t)) d q(t)/dt q(t+1) - q(t) = 2α - 8α q(t) Integrating: q(t) = (1/4)(1 + 3exp(-8αt)) Solving for d JC = 6αt = -(3/4)ln(1 - (4/3)p), where p is the p-distance. 34

DNA evolutionary models evolutionary time p-distance Poisson correction Jukes-Cantor 0 p 1 In Jukes-Cantor, p limits to p=0.75 at infinite evolutionary distance. 35

Transitions/transversions A C In DNA replication, errors can be transitions (purine for purine, pyrimidine for pyrimidine) or transversions (purine for pyrimidine & vice versa) R = transitions/transversions. R would be 1/2 if all mutations were equally likely. In DNA alignments, R is observed to be about 4. Transitions are greatly favored over transversions. T 36

Jukes-Cantor with correction for transitions/ transversions (Kimura 2-parameter model, dk2p) A C T A C T 1-2β-α β β α β 1-2β-α α β β α β β α 1-2β-α β 1-2β-α Split changes (D) into the two types, transition (P) and transversion (Q) p-distance = D/L = P + Q P = transitions/l, Q=transversions/L The the corrected evolutionary distance is... d K2P =-(1/2)ln(1-2P-Q)-(1/4)ln(1-2Q) 37

Further corrections are possible, but... Do we need a nucleotide substitution matrix? A C T A C T Additional corrections possible for: Position in codon frame Isochores (C-rich, AT-rich regions) Methylated, not methylated.

Review questions 1.What are we asserting about ancestry by aligning the sequences? 2.What kind of data is used to generate a substitution matrix? 3.What makes the PAM matrix Markovian? 4.When should you align amino acid sequence, when DNA? 5.When is the P-distance an accurate measure of time? 6.Why is time versus p-distance Poisson? 7.Why are transitions more common the transversions? 8.What does Kimura do that Jukes-Cantor does not? 39

PAM250

BLOSUM62

In class exercise: Which substitution matrix favors... PAM250 BLOSUM62 conservation of polar residues conservation of non-polar residues conservation of C, Y, or W polar-to-nonpolar mutations polar-to-polar mutations NO take-home exercise today...