Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).


 Emmeline Robbins
 2 years ago
 Views:
Transcription
1 1 Bioinformatics: Indepth PROBABILITY & STATISTICS Spring Semester 2011 University of Zürich and ETH Zürich Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM). Dr. Stefanie Muff Adapted from a course by Dr. Dominic Schumacher.
2 2 By now, we modeled the development over different positions in one DNA sequence ( time =position in the sequence). Today,......we use Markov chains to model the development of individual positions in a DNA or Protein sequence over time ( time =time in which the sequence evolves).
3 3 Objectives for today: Studying models for sequence evolution Transition matrices. Deriving matrices that give a realistic score to each possible substitution in a sequence (DNA or protein) Scoring matrices.
4 Models for sequence evolution (DNA) Each site of the DNA sequence evolves according to a first order Markov Chain with state space {A,C,G,T}. 4
5 5 Simplest model for DNAevolution: JukesCantor (1969) p a,a p a,c p a,g p a,t p c,a p c,c p c,g p c,t p g,a p g,c p g,g p g,t p t,a p t,c p t,g p t,t = 1 3α α α α α 1 3α α α α α 1 3α α α α α 1 3α The stationary distribution is π = (0.25, 0.25, 0.25, 0.25). Necessary: α < 1/3. The parameter α depends on the time scale: E.g., if the unit time is generations, α would take a smaller value than if the unit time were chosen as generations.
6 Objection: The JukesCantor model is not entirely realistic (e.g. all kinds of substitutions are equally likely). 6
7 7 More realistic model: The Kimura model (1980) P = 1 α 2β β α β β 1 α 2β β α α β 1 α 2β β β α β 1 α 2β The stationary distribution is π = (0.25, 0.25, 0.25, 0.25). Necessary: α + 2β < 1. α = prob. for transitions (purine to purine or pyrimidine to pyrimidine). β = prob. for transversions (purine to pyrimidine or vice versa) (purines: a,g / pyrimidines: c,t)
8 8 Comments: There are even more realistic Markov models for DNA substitution (e.g. Kimura 3ST, Felsenstein, Hasegawa, Kishino, Yano, and many others). Most models assume that sites evolve independently (which is not entirely realistic). Why not use more realistic models? Because they lack nice properties, e.g., reversibility and stationary distributions. For more sophisticated models it is more complicated to compute the probabilities of interest. there are more parameters to estimate. The simpler ones seem to give reasonably sensible results.
9 9 Evolutionary models for proteins The concept is similar, but here it is much more important to account for the wide range of different transition probabilities associated with amino acid substitutions. From: Taylor W.R. (1986) The classification of amino acid conservation: a strategy for the hierarchical analysis of residue conservation, Bioinformatics, 9: See the construction of substitution scoring matrices!
10 Substitution scoring matrices 10
11 11 Scoring systems Important in bioinformatics for comparisons of DNA or protein sequences. Aim: inferring the function of molecules by finding similarity to a sequence with known function. For that purpose needed: Good alignment of sequences. For that purpose needed: A measure for judging the quality of an alignment in relation to other possible alignments, a scoring system.
12 12 We use additive scoring systems: Assign a quality score for each match and sum up scores of individual positions (ignoring gaps for now). Example: Two DNA sequences; score for a match: +1, score for a mismatch 1. E.g.: a a g t t t c t t g a a a c t c c c t g Individual scores: = Cumulative score: 6 4 = 2 Maybe more realistic: score for a match: +1, score for a transition: 1/2, score for a transversion: 1. Cumulative score in that case: 6 2 = 4.
13 13 Scoring matrices The scores for the individual positions can be displayed in a socalled scoring matrix (also called substitution matrix). This is usually a symmetrical 4 4 (DNA) resp (protein) matrix which has as entries (i, j) the scores that we assign if at a position the nucleotides resp. the amino acids i and j are aligned. E.g. for the second example from the last slide: s a,a s a,c s a,g s a,t 1 1 1/2 1 s c,a s c,c s c,g s c,t /2 S = = s g,a s g,c s g,g s g,t 1/ s t,a s t,c s t,g s t,t 1 1/2 1 1
14 14 How can we find a biologically sensible scoring matrix? For DNA sequences: simple scoring matrices (like the one presented) are often effective. Usually, no need to worry. For protein sequences: some substitutions are clearly more likely to occur than others (presumably due to similar chemical properties of the amino acids involved); e.g. isoleucine for valine, serine for threonine, socalled conservative substitutions. We get considerably better alignments if we take this into account. Use scoring matrices that are derived by statistical analysis of protein data.
15 15 Scoring matrices for proteins Specifications: Identical amino acids should be given a higher score than any substitution. Conservative substitutions should be given a higher score than nonconservative ones (e.g., substitution between hydrophobic or charged amino acids). We want our scoring matrices to take into account the evolutionary distance between the sequences involved!
16 16 Generally: substitutions with low transitionprobability should be given a lower score than those that occur very frequently (like selftransitions). In fact: Transition matrix LogLikelihood ratios = Scoring matrix
17 17 Approaches to find scoring matrices 1) the PAM family of substitution matrices Uses Markov chains and phylogenetic trees to fit an evolutionary model; loglikelihood ratios for the construction of a scoring matrix from an estimated transition matrix. 2) the BLOSUM family of substitution matrices Uses loglikelihood ratios for the construction of a scoring matrix from an estimated transition matrix. 3) WAG and WAG* substitution matrices Combines the estimation of transition and scoring matrices by a maximumlikelihood approach. Most novel of the three methods and nowadays widely used.
18 18 The PAM family of scoring matrices (Dayhoff, Schwartz, and Orcutt, 1978) It requires the use of Markov chains and phylogenetic trees (for fitting an evolutionary model) loglikelihood ratios (for getting a scoring matrix from an estimated transition matrix) PAM = Point (or Percentage) Accepted Mutations. Accepted mutations = those mutations able to spread in the population and become dominating (typically mutations that do not disrupt the protein function or even increases the fitness of the species).
19 19 PAM matrices (Dayhoff matrices) There are two types of PAM matrices: a PAM Markov transition matrix P (= the table of estimated transition probabilities for the underlying evolutionary model). a PAM scoring matrix (= the table of scores for all possible pairs of amino acids, which is used to judge the quality of a given alignment). Again: Transition matrix Scoring matrix
20 20 Underlying model: Each site in the sequence evolves according to a reversible Markov chain, and independently of the other sites. All the Markov chains have the same transition matrix P (matrix with dimension 20 20). Dayhoff et al. (1978) estimated the onestep transition matrix P from protein sequence data. How...?
21 21 Construction of a PAM1 transition matrix Definition: A PAM1 transition matrix is the Markov transition matrix applying for a time period over which we expect 1% of the amino acids to undergo accepted point mutations. The steps involved in the estimation: 1. Find reliable data and align protein sequences that are at least 85% identical. 2. Reconstruct phylogenetic trees and infer ancestral sequences. 3. Count the amino acid replacements that occurred along the trees (i.e. count mutations accepted by natural selection). 4. Use these counts to estimate the Markov transition probabilities between amino acids.
22 22 Step 1: Find reliable data Dayhoff et al. (1978) used ungapped multiple alignments of certain wellconserved regions from closely related proteins. (71 blocks of proteins from 34 families, all in all 1572 changes.) AAEEAATG...G CE CAP PAATH...GTE PPAV AS TH......GCG VVIG AAAH... GAI >85% In any block, any two sequences did not differ by more than 15%. (The idea was to keep the number of sites that have encountered several changes low.) Note: Only a limited amout of protein sequences was available at that time!
23 23 Step 2: Reconstruct phylogenetic trees The aligned regions then were used to infer the underlying evolutionary tree(s) and the ancestral sequences. For this, most parsimonious trees were used. A most parsimonious tree is a tree structure such that the total number of substitutions across the tree is minimal. The protein sequences of one block are the leaves of a tree. AA EE Data: seq1: seq2: seq3: AA AE EE AE AE EE AA AE AA AE EE
24 24 Why do we use trees? To avoid overcounting! Trees = sequences are grouped in the right way (in general: very similar sequences succeed one another in the tree) = we have mainly transitions between these sequences, and only a few transitions to other, more different sequences, so the corresponding substitutions do not get an unnatural importance.
25 25 Step 3: Count the replacements along the trees Example There are five most parsimonious trees for the three sequences AA, AE and EE: AE AA 0 AE EE 0 AE AE 0 AE EE AA AA AE EE AA EE AE AE AE AE AE EE AE AA AA AE EE Count all substitutions along the branches (example: 1 indicates an AE alignment).
26 26 The substitution AE (and EA), occurs exactly twice in each tree. The substitution AA (and EE) occurs a total of 15 times in the trees. Selfalignments are counted twice, thus we get 30. Divide by the number of trees to get an average of 6. In summary we get the count matrix A A A E A = = E A E E
27 27 Now, we do not have only two letters A and E, but 20 amino acids. They shall be numbered from 1 to 20 and transitions between them are counted just as in the example above. Here, let A jk be the average number of times substitutions from j to k were observed in the trees: A 1,1 A 1,2... A 1,20 A 2,1 A 2,2... A 2,20 A = A 20,1 A 20,2... A 20,20 The counts can be summed over different blocks of sequences.
28 28 Step 4: Estimate the Markov transition probabilities The estimated probabilities j k are the observed relative frequencies: a jk := A jk 20 m=1 A. j,m Remember: PAM1 is a transition matrix where 1% of the amino acids are exptected to undergo an accepted point mutation. Thus, to get P = (p jk ) the a jk s have to be scaled by a factor c: p jk := c a jk for j k and p j,j := 1 k j c a jk, where the scaling constant c is sufficiently small so that p j,j 0 for all j.
29 29 The scaling factor c The factor c enforces the 1% of accepted point mutations. This is useful for relatively short evolutionary distances. Such a time unit is called an evolutionary distance of 1 PAM. (Note: 1 PAM can be 1 Mio of years, but also much more or less, depending on the protein family.)
30 30 The determination of c Let Z n = the amino acid present at a particular site considered at time n (hence 1 Z n 20). The probability that the site will change after 1 PAM time unit (i.e. after one step) is given by P(Z n+1 Z n ) 20 q j p jk, j=1 k j where q j is the observed frequency of the amino acid j in the original blocks of aligned proteins.
31 31 One wants the probability that the site will change after 1 PAM to be equal to (That implies an average change of 1%.) 0.01 = 20 q j p jk j=1 k j 20 q j c a jk j=1 k j = c 20 q j a jk. j=1 k j c = 20 j= k j q j a jk
32 32 How can the transition matrix be turned into a scoring matrix? Consider two given protein sequences s = a 1 a 2 a n and s = b 1 b 2 b n at an evolutionary distance of 1 PAM. (Note: the evol. distance between sequences is difficult to exactly determine, but for closelyrelated sequences also nonoptimal matrices give good results.) The score for aligning s with s is generated by comparing two different hypothesis H 0 and H 1 : H 0 :s and s are not evolutionarily related H 1 : s and s are evolutionarily related (i.e. s depends on s via the Markov model).
33 33 Under H 0, we have a chance alignment s: a 1 a 2 a n s : b 1 b 2 b n Supose amino acid j appears with probability q j. The probability for getting this chance alignment is equal to ( n ) ( n ) P H0 (the alignment) = q ai q bi i=1 i=1 n = (q ai q bi ). i=1
34 34 Under H 1, the sites in the sequences are dependent according to the Markov model and thus transition probabilities are those in the PAM1matrix. Example: P H1 (align P and R in a given site) = q P p PR. Since the different sites evolve independently of each other, we get P H1 (the alignment) = n (q ai p ai b i ). i=1
35 35 We want our score to reflect the chance (or the odds) that with s and s we have aligned evolutionarily related sequences (i.e. we want a high score if it is very likely that we have aligned related sequences). A natural choice for the score: The likelihood ratio (odds): Score = P H 1 (the alignment) P H0 (the alignment) n i=1 = (q a i p ai b i ) n i=1 (q a i q bi ) = n i=1 q ai p ai b i q ai q bi = n i=1 p ai b i q bi.
36 36 Or, equivalently, one can use the log likelihood ratio (the log odds ). The score S of the alignment then becomes: ( PH1 (the alignment) ) S = log P H0 (the alignment) ( n = log = i=1 n log i=1 p ai b i q bi ) ( pai b i ). q bi Advantage: log turns products into sums additive scoring scheme. The entry (a, b) in the PAM substitution matrix is then of the form ( pab ) S ab = log q b (usually rounded to the nearest integer for convenience).
37 37 Total score: S(alignment) = n i=1 S a i b i. Moreover: if S ab = log p ab q b < 1 ( pab q b ) < 0 q a p ab < q a q b (i.e. S ab < 0 if it is more likely to see a and b aligned against each other in a random alignment than to see a and b aligned at the evolutionary distance of 1 PAM). Otherwise, S ab = log ( pab q b ) 0.
38 38 PAMn substitution matrix For sequences having an evolutionary distance of n PAM units. Careful: n PAM units does not mean that we expect n% of the amino acids to differ... because substitutions can occur at the same site many times! Let P be the 1 PAM transition matrix. As always with Markov chains: the nstep transition probabilities p (n) ab are given as the entries in P n. The scores are S (n) ab = log ( p (n) ab q b ). Highest level: PAM250 (approx. 20% sequence similarity).
39 39 PAM criticism Assumption that each site has same mutuability. Assumption that sites evolve independenty. Construction of phylogenetic trees (especially most parsimonious ones) is very difficult (finding the most parsimonious tree is an NPhard problem). Matrices for greater evolutionary distances are extrapolated from the PAM1 matrix. Based on a small set of closely related proteins.
40 40 The BLOSUM family of substitution matrices (Henikoff and Henikoff, 1992). BLOSUM = BLOcks SUbstitution Matrices. Again the scores will be logarithms of likelihood ratios, but this time there are no evolutionary models, and hence no Markov chains and no trees involved. The likelihoods are obtained by the statistical analysis of blocks of aligned sequences.
41 41 Blocks of aligned sequences About 3000 blocks (extracted from multiple sequence alignments) involving about 800 protein families were used to derive the BLOSUMs matrices (H&H developed a program called Protomat for obtaining these blocks). A difference to the construction of PAM matrices: the kind of data that was used! H&H s data was far more extensive: several hundred groups of proteins, at least 2369 occurances of any particular substitution. Different concept: Dayhoff et al. used data from closely related proteins and extrapolated (PAM1 PAMn), H&H directly used protein sequences regardless of their evolutionary distances. BLOSUM uses no evolutionary model (i.e., no need to order the sequences according to evol. distance  no trees).
42 42 Four sample blocks from H&H s Blocks database: WWYIR CASILRKIYIYGPV GVSRLRTAYGGRK NRG WFYVR CASILRHLYHRSPA GVGSITKIYGGRK RNG WYYVR AAAVARHIYLRKTV GVGRLRKVHGSTK NRG WYFIR AASICRHLYIRSPA GIGSFEKIYGGRR RRG WYYTR AASIARKIYLRQGI GVGGFQKIYGGRQ RNG WFYKR AASVARHIYMRKQV GVGKLNKLYGGAK SRG WFYKR AASVARHIYMRKQV GVGKLNKLYGGAK SRG WYYVR TASIARRLYVRSPT GVDALRLVYGGSK RRG WYYVR TASVARRLYIRSPT GVGALRRVYGGNK RRG WFYTR AASTARHLYLRGGA GVGSMTKIYGGRQ RNG WFYTR AASTARHLYLRGGA GVGSMTKIYGGRQ RNG WWYVR AAALLRRVYIDGPV GVNSLRTHYGGKK DRG Each block stems from an ungapped multiple alignment of a relatively highly conserved region of a protein family.
43 43 Likelihoods Count the proportion p a of times that the AA a occurs somewhere in any block; the proportion p ab of times that the AA pair (a, b) (not necessarily distinct) occurs in the same column of any block. Note: there are a total of N (m 2) pairs that have to be taken into account in all the blocks, where N is the number of columns in all blocks together and m is the number of rows in each block, if we assume that this number is the same for all blocks.
44 44 The likelihood ratios p ab : likelihood / estimated probability for the substitution a b under the assumption that the sequences are related. p a p b : likelihood / estimated probability for the substitution a b (also for b a) under the assumption that the sequences are not related. Thus, for a b the likelihood for the substitution a b is 2 p a p b. A scoring matrix is obtained by the same considerations as for the PAM matrix. Set the entry at position (a, b) as ( 2 log pab ) 2 2 p S ab := a p b if a b ( 2 log pab 2 if a = b. p a p b ) (2log 2 was what H&H used; using log would give us essentially the same score) This is not yet the BLOSUM we are looking for!
45 45 The circularity problem iteration To obtain the initial blocks, a multiple alignment was done a substitution matrix is needed for that! Henikoff and Henikoff used a simple unit matrix for this first alignment (1 for a match, 0 for a mismatch). Then, with the first BLOSUM matrix they obtained by the above procedure, they constructed a second set of blocks and a second BLOSUM matrix. Then, with this second matrix a third matrix was constructed. Only this is the matrix that is recommended to be used. This gives a BLOSUM100 matrix, provided that we have eliminated all identical copies of sequences from the original blocks. The BLOSUM100 is not very useful!
46 46 BLOSUMx matrices To take into account low or large evolutionary distances between sequences, different BLOSUM matrices are constructed by clustering those sequences in each block that are sufficiently close (by combining them and using a weighted contribution to the counting). The result is called BLOSUMx (matrix), where the number x determines what we mean by sufficiently close : We cluster sequences in a block that have x% identity or more. BLOSUM80 is used to compare closely related sequences, while BLOSUM45 is suitable for diverged proteins. An average BLOSUM often used is BLOSUM62.
47 47 Note: The numbers n and x in PAMn and BLOSUMx play opposite roles. Higher values of n and lower values of x both correspond to longer evolutionary distances!! The n counts the steps in the Markov chain used for the evolutionary model. The x tells up to which percentage of similarity between two sequences in a block the sequences are seen as different.
48 48 PAM vs. BLOSUM Advantages of BLOSUM: Simpler model. Observationbased, more or less independent of other models and concepts (no Markov chain assumption, trees, maximum parsimony needed). Tests suggest that BLOSUM matrices generally are superior to PAM matrices for detecting biological relationships (even if same amounts of data are used). Advantages of PAM: Yields an explicit evolutionary model as a byproduct. Helps better understanding biological relations.
49 49 WAG and WAG* matrices (Whelan and Goldman, 2001) WAG/WAG* is the skilful combination of an approximate maximum likelihood (ML) method; a counting method JTT (Jones, Taylor and Thornton 1992; essentially an improved PAM) that empolys most parsimonious trees. Both ML and counting methods are not satisfactory alone: ML is very expensive and can be applied to only sequences at a time. Counting methods using maximum parsimony have essentially all backdraws of PAM.
50 50 Assumptions taken in the WAGmodels All amino acid sites evolve independently according to the same Markov process. The Markov process is stationary, time homogeneous and reversible. Transition probabilities p jk between amino acids stay constant over nearoptimal branch lengths and tree topologies (i.e., given that the real tree is known). JTT gives nearoptimal trees (i.e., optimal branch lengths).
51 51 In the WAGmodels the transition matrix P is optimized by maximum likelihood, under the assumption that the optimal treetopologies T = (T 1,...,T n ) are known for all aligned protein families D = (D 1,...,D n ). Then maximize the sum: log L = n f amilies i=1 log(p T i, D i ) Reference to original WAGpaper: S. Whealan, N. Goldman: A General Empirical Model of Protein Evolution Derived from Multiple Protein Families Using a MaximumLikelihood Approach, Mol.Biol.Evol. 18(5): , 2001
Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)
Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from
More informationSequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment
Sequence Analysis 17: lecture 5 Substitution matrices Multiple sequence alignment Substitution matrices Used to score aligned positions, usually of amino acids. Expressed as the loglikelihood ratio of
More informationBioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre
Bioinformatics Scoring Matrices David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Learning Objectives To explain the requirement
More informationLecture Notes: Markov chains
Computational Genomics and Molecular Biology, Fall 5 Lecture Notes: Markov chains Dannie Durand At the beginning of the semester, we introduced two simple scoring functions for pairwise alignments: a similarity
More informationQuantifying sequence similarity
Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity
More informationSequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University
Sequence Alignment: Scoring Schemes COMP 571 Luay Nakhleh, Rice University Scoring Schemes Recall that an alignment score is aimed at providing a scale to measure the degree of similarity (or difference)
More informationSara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)
Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline
More informationAlgorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment
Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot
More informationLecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22
Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 24. Phylogeny methods, part 4 (Models of DNA and
More informationComputational Biology
Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,
More informationLecture 4. Models of DNA and protein change. Likelihood methods
Lecture 4. Models of DNA and protein change. Likelihood methods Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 4. Models of DNA and protein change. Likelihood methods p.1/36
More informationLecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26
Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 27. Phylogeny methods, part 4 (Models of DNA and
More informationAmira A. ALHosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut
Amira A. ALHosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut UniversityEgypt Phylogenetic analysis Phylogenetic Basics: Biological
More informationSequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013
Sequence Alignments Dynamic programming approaches, scoring, and significance Lucy Skrabanek ICB, WMC January 31, 213 Sequence alignment Compare two (or more) sequences to: Find regions of conservation
More informationLecture 3: Markov chains.
1 BIOINFORMATIK II PROBABILITY & STATISTICS Summer semester 2008 The University of Zürich and ETH Zürich Lecture 3: Markov chains. Prof. Andrew Barbour Dr. Nicolas Pétrélis Adapted from a course by Dr.
More informationScoring Matrices. Shifra BenDor Irit Orr
Scoring Matrices Shifra BenDor Irit Orr Scoring matrices Sequence alignment and database searching programs compare sequences to each other as a series of characters. All algorithms (programs) for comparison
More informationLocal Alignment Statistics
Local Alignment Statistics Stephen Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, MD Central Issues in Biological Sequence Comparison
More informationDr. Amira A. ALHosary
Phylogenetic analysis Amira A. ALHosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut UniversityEgypt Phylogenetic Basics: Biological
More informationTHEORY. Based on sequence Length According to the length of sequence being compared it is of following two types
Exp 11 THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between
More informationWhat Is Conservation?
What Is Conservation? Lee A. Newberg February 22, 2005 A Central Dogma Junk DNA mutates at a background rate, but functional DNA exhibits conservation. Today s Question What is this conservation? Lee A.
More informationBIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University
BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University Measures of Sequence Similarity Alignment with dot
More informationSubstitution matrices
Introduction to Bioinformatics Substitution matrices Jacques van Helden Jacques.vanHelden@univamu.fr Université d AixMarseille, France Lab. Technological Advances for Genomics and Clinics (TAGC, INSERM
More informationSequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas Informal inductive proof of best alignment path onsider the last step in the best
More informationEdward Susko Department of Mathematics and Statistics, Dalhousie University. Introduction. Installation
1 dist est: Estimation of RatesAcrossSites Distributions in Phylogenetic Subsititution Models Version 1.0 Edward Susko Department of Mathematics and Statistics, Dalhousie University Introduction The
More informationPOPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics
POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics  in deriving a phylogeny our goal is simply to reconstruct the historical relationships between a group of taxa.  before we review the
More informationSequence comparison: Score matrices
Sequence comparison: Score matrices http://facultywashingtonedu/jht/gs559_2013/ Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas FYI  informal inductive proof of best
More informationCSE 549: Computational Biology. Substitution Matrices
CSE 9: Computational Biology Substitution Matrices How should we score alignments So far, we ve looked at arbitrary schemes for scoring mutations. How can we assign scores in a more meaningful way? Are
More informationSequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas FYI  informal inductive proof of best alignment path onsider the last step in
More informationSequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University
Sequence Alignment: A General Overview COMP 571  Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of
More informationPhylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University
Phylogenetics: Distance Methods COMP 571  Spring 2015 Luay Nakhleh, Rice University Outline Evolutionary models and distance corrections Distancebased methods Evolutionary Models and Distance Correction
More informationSubstitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A
GAGATC 3:G A 6:C T Common Ancestor ACGATC 1:A G 2:C A Substitution = Mutation followed 5:T C by Fixation GAAATT 4:A C 1:G A AAAATT GAAATT GAGCTC ACGACC Chimp Human Gorilla Gibbon AAAATT GAAATT GAGCTC ACGACC
More informationEvolutionary Models. Evolutionary Models
Edit Operators In standard pairwise alignment, what are the allowed edit operators that transform one sequence into the other? Describe how each of these edit operations are represented on a sequence alignment
More informationSequence analysis and comparison
The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species
More informationPhylogenetic Assumptions
Substitution Models and the Phylogenetic Assumptions Vivek Jayaswal Lars S. Jermiin COMMONWEALTH OF AUSTRALIA Copyright htregulation WARNING This material has been reproduced and communicated to you by
More information7.36/7.91 recitation CB Lecture #4
7.36/7.91 recitation 2192014 CB Lecture #4 1 Announcements / Reminders Homework:  PS#1 due Feb. 20th at noon.  Late policy: ½ credit if received within 24 hrs of due date, otherwise no credit  Answer
More informationMaximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.
Maximum Likelihood This presentation is based almost entirely on Peter G. Fosters  "The Idiot s Guide to the Zen of Likelihood in a Nutshell in Seven Days for Dummies, Unleashed. http://www.bioinf.org/molsys/data/idiots.pdf
More informationFirst generation sequencing and pairwise alignment (Hightech, not high throughput) Analysis of Biological Sequences
First generation sequencing and pairwise alignment (Hightech, not high throughput) Analysis of Biological Sequences 140.638 where do sequences come from? DNA is not hard to extract (getting DNA from a
More informationPairwise sequence alignment
Department of Evolutionary Biology Example Alignment between very similar human alpha and beta globins: GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL
More informationAlgorithms in Bioinformatics
Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology
More informationBiochemistry 324 Bioinformatics. Pairwise sequence alignment
Biochemistry 324 Bioinformatics Pairwise sequence alignment How do we compare genes/proteins? When we have sequenced a genome, we try and identify the function of unknown genes by finding a similar gene
More informationAdvanced topics in bioinformatics
Feinberg Graduate School of the Weizmann Institute of Science Advanced topics in bioinformatics Shmuel Pietrokovski & Eitan Rubin Spring 2003 Course WWW site: http://bioinformatics.weizmann.ac.il/courses/atib
More information进化树构建方法的概率方法 第 4 章 : 进化树构建的概率方法 问题介绍. 部分 lid 修改自 i i f l 的 ih l i
第 4 章 : 进化树构建的概率方法 问题介绍 进化树构建方法的概率方法 部分 lid 修改自 i i f l 的 ih l i 部分 Slides 修改自 University of Basel 的 Michael Springmann 课程 CS302 Seminar Life Science Informatics 的讲义 Phylogenetic Tree branch internal node
More information3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT
3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode
More informationMassachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution
Massachusetts Institute of Technology 6.877 Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution 1. Rates of amino acid replacement The initial motivation for the neutral
More informationTools and Algorithms in Bioinformatics
Tools and Algorithms in Bioinformatics GCBA815, Fall 2013 Week3: Blast Algorithm, theory and practice Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and Systems Biology
More informationScoring Matrices. Shifra Ben Dor Irit Orr
Scoring Matrices Shifra Ben Dor Irit Orr Scoring matrices Sequence alignment and database searching programs compare sequences to each other as a series of characters. All algorithms (programs) for comparison
More informationLecture 4. Models of DNA and protein change. Likelihood methods
Lecture 4. Models of DNA and protein change. Likelihood methods Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 4. Models of DNA and protein change. Likelihood methods p.1/39
More informationPhylogenetic inference
Phylogenetic inference Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 7 th 016 After this lecture, you can discuss (dis) advantages of different information types
More information"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky
MOLECULAR PHYLOGENY "Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky EVOLUTION  theory that groups of organisms change over time so that descendeants differ structurally
More informationEVOLUTIONARY DISTANCE MODEL BASED ON DIFFERENTIAL EQUATION AND MARKOV PROCESS
August 0 Vol 4 No 0050 JATIT & LLS All rights reserved ISSN: 998645 wwwjatitorg EISSN: 8795 EVOLUTIONAY DISTANCE MODEL BASED ON DIFFEENTIAL EUATION AND MAKOV OCESS XIAOFENG WANG College of Mathematical
More informationSome of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!
Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis
More informationBLAST: Target frequencies and information content Dannie Durand
Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences
More informationReconstruire le passé biologique modèles, méthodes, performances, limites
Reconstruire le passé biologique modèles, méthodes, performances, limites Olivier Gascuel Centre de Bioinformatique, Biostatistique et Biologie Intégrative C3BI USR 3756 Institut Pasteur & CNRS Reconstruire
More informationAn Introduction to Sequence Similarity ( Homology ) Searching
An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,
More informationMolecular Evolution and Phylogenetic Tree Reconstruction
1 4 Molecular Evolution and Phylogenetic Tree Reconstruction 3 2 5 1 4 2 3 5 Orthology, Paralogy, Inparalogs, Outparalogs Phylogenetic Trees Nodes: species Edges: time of independent evolution Edge length
More informationPhylogenetic Tree Reconstruction
I519 Introduction to Bioinformatics, 2011 Phylogenetic Tree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Evolution theory Speciation Evolution of new organisms is driven
More informationSingle alignment: Substitution Matrix. 16 march 2017
Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block
More informationEVOLUTIONARY DISTANCES
EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:
More informationPhylogenetics. BIOL 7711 Computational Bioscience
Consortium for Comparative Genomics! University of Colorado School of Medicine Phylogenetics BIOL 7711 Computational Bioscience Biochemistry and Molecular Genetics Computational Bioscience Program Consortium
More informationMarkov Models & DNA Sequence Evolution
7.91 / 7.36 / BE.490 Lecture #5 Mar. 9, 2004 Markov Models & DNA Sequence Evolution Chris Burge Review of Markov & HMM Models for DNA Markov Models for splice sites Hidden Markov Models  looking under
More informationInferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT
Inferring phylogeny Constructing phylogenetic trees Tõnu Margus Contents What is phylogeny? How/why it is possible to infer it? Representing evolutionary relationships on trees What type questions questions
More informationINFORMATIONTHEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM. Liuling Gong, Nidhal Bouaynaya and Dan Schonfeld
INFORMATIONTHEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM Liuling Gong, Nidhal Bouaynaya and Dan Schonfeld University of Illinois at Chicago, Dept. of Electrical
More informationInferring Molecular Phylogeny
Dr. Walter Salzburger he tree of life, ustav Klimt (1907) Inferring Molecular Phylogeny Inferring Molecular Phylogeny 55 Maximum Parsimony (MP): objections long branches I!! B D long branch attraction
More informationConstructing Evolutionary/Phylogenetic Trees
Constructing Evolutionary/Phylogenetic Trees 2 broad categories: istancebased methods Ultrametric Additive: UPGMA Transformed istance NeighborJoining Characterbased Maximum Parsimony Maximum Likelihood
More informationBINF 730. DNA Sequence Alignment Why?
BINF 730 Lecture 2 Seuence Alignment DNA Seuence Alignment Why? Recognition sites might be common restriction enzyme start seuence stop seuence other regulatory seuences Homology evolutionary common progenitor
More informationMolecular Evolution & Phylogenetics Traits, phylogenies, evolutionary models and divergence time between sequences
Molecular Evolution & Phylogenetics Traits, phylogenies, evolutionary models and divergence time between sequences Basic Bioinformatics Workshop, ILRI Addis Ababa, 12 December 2017 1 Learning Objectives
More informationConstructing Evolutionary/Phylogenetic Trees
Constructing Evolutionary/Phylogenetic Trees 2 broad categories: Distancebased methods Ultrametric Additive: UPGMA Transformed Distance NeighborJoining Characterbased Maximum Parsimony Maximum Likelihood
More informationPhylogenetics: Building Phylogenetic Trees
1 Phylogenetics: Building Phylogenetic Trees COMP 571 Luay Nakhleh, Rice University 2 Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary model should
More informationModeling Noise in Genetic Sequences
Modeling Noise in Genetic Sequences M. Radavičius 1 and T. Rekašius 2 1 Institute of Mathematics and Informatics, Vilnius, Lithuania 2 Vilnius Gediminas Technical University, Vilnius, Lithuania 1. Introduction:
More informationPairwise Alignment. GuanShieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55
Pairwise Alignment GuanShieng Huang shieng@ncnu.edu.tw Dept. of CSIE, NCNU Pairwise Alignment p.1/55 Approach 1. Problem definition 2. Computational method (algorithms) 3. Complexity and performance Pairwise
More informationPractical considerations of working with sequencing data
Practical considerations of working with sequencing data File Types Fastq >aligner > reference(genome) coordinates Coordinate files SAM/BAM most complete, contains all of the info in fastq and more!
More informationCONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018
CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018 SEQUENCE ANALYSIS  A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of
More informationBioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment
Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM NeedlemanWunsch algorithm (Global) SmithWaterman algorithm (Local) BLAST (local, heuristic) Evalue
More informationBIOINFORMATICS TRIAL EXAMINATION MASTERS KTOR
BIOINFORMATICS KT Maastricht University Faculty of Humanities and Science Knowledge Engineering Study TRIAL EXAMINATION MASTERS KTOR Examiner: R.L. Westra Date: March 30, 2007 Time: 13:30 15:30 Place:
More informationPhylogeny Estimation and Hypothesis Testing using Maximum Likelihood
Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood For: Prof. Partensky Group: Jimin zhu Rama Sharma Sravanthi Polsani Xin Gong Shlomit klopman April. 7. 2003 Table of Contents Introduction...3
More informationBioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter
Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Institute of Bioinformatics Johannes Kepler University, Linz, Austria Sequence Alignment 2. Sequence Alignment Sequence Alignment 2.1
More informationPairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )
Pairwise sequence alignments Vassilios Ioannidis (From Volker Flegel ) Outline Introduction Definitions Biological context of pairwise alignments Computing of pairwise alignments Some programs Importance
More informationBioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics
Bioinformatics 1 Biology, Sequences, Phylogenetics Part 4 Sepp Hochreiter Klausur Mo. 30.01.2011 Zeit: 15:30 17:00 Raum: HS14 Anmeldung Kusss Contents Methods and Bootstrapping of Maximum Methods Methods
More informationSome of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!
Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis
More informationBioinformatics and BLAST
Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: NeedlemanWunsch SmithWaterman BLAST Implementation issues and current research Recap from Last Time Genome consists
More informationPairwise sequence alignments
Pairwise sequence alignments Volker Flegel VI, October 2003 Page 1 Outline Introduction Definitions Biological context of pairwise alignments Computing of pairwise alignments Some programs VI, October
More informationSequence Database Search Techniques I: Blast and PatternHunter tools
Sequence Database Search Techniques I: Blast and PatternHunter tools Zhang Louxin National University of Singapore Outline. Database search 2. BLAST (and filtration technique) 3. PatternHunter (empowered
More informationPhylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University
Phylogenetics: Building Phylogenetic Trees COMP 571  Fall 2010 Luay Nakhleh, Rice University Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary
More informationMoreover, the circular logic
Moreover, the circular logic How do we know what is the right distance without a good alignment? And how do we construct a good alignment without knowing what substitutions were made previously? ATGCGTGCAAGT
More informationToday s Lecture: HMMs
Today s Lecture: HMMs Definitions Examples Probability calculations WDAG Dynamic programming algorithms: Forward Viterbi Parameter estimation Viterbi training 1 Hidden Markov Models Probability models
More informationEvolutionary Analysis of Viral Genomes
University of Oxford, Department of Zoology Evolutionary Biology Group Department of Zoology University of Oxford South Parks Road Oxford OX1 3PS, U.K. Fax: +44 1865 271249 Evolutionary Analysis of Viral
More informationUsing algebraic geometry for phylogenetic reconstruction
Using algebraic geometry for phylogenetic reconstruction Marta Casanellas i Rius (joint work with Jesús FernándezSánchez) Departament de Matemàtica Aplicada I Universitat Politècnica de Catalunya IMA
More informationLecture 2, 5/12/2001: Local alignment the SmithWaterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models
Lecture 2, 5/12/2001: Local alignment the SmithWaterman algorithm Alignment scoring schemes and theory: substitution matrices and gap models 1 Local sequence alignments Local sequence alignments are necessary
More informationLargeScale Genomic Surveys
Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Protein Flexibility Homology Modeling Sequence Alignment Structure Classification Gene Prediction
More informationMotivating the need for optimal sequence alignments...
1 Motivating the need for optimal sequence alignments... 2 3 Note that this actually combines two objectives of optimal sequence alignments: (i) use the score of the alignment o infer homology; (ii) use
More informationBMI/CS 776 Lecture 4. Colin Dewey
BMI/CS 776 Lecture 4 Colin Dewey 2007.02.01 Outline Common nucleotide substitution models Directed graphical models Ancestral sequence inference Poisson process continuous Markov process X t0 X t1 X t2
More informationInDepth Assessment of Local Sequence Alignment
2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore InDepth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.
More informationHomology Modeling. Roberto Lins EPFL  summer semester 2005
Homology Modeling Roberto Lins EPFL  summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,
More informationMarkov Chains. Sarah Filippi Department of Statistics TA: Luke Kelly
Markov Chains Sarah Filippi Department of Statistics http://www.stats.ox.ac.uk/~filippi TA: Luke Kelly With grateful acknowledgements to Prof. Yee Whye Teh's slides from 2013 14. Schedule 09:3010:30 Lecture:
More informationLie Markov models. Jeremy Sumner. School of Physical Sciences University of Tasmania, Australia
Lie Markov models Jeremy Sumner School of Physical Sciences University of Tasmania, Australia Stochastic Modelling Meets Phylogenetics, UTAS, November 2015 Jeremy Sumner Lie Markov models 1 / 23 The theory
More informationEstimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6057
Estimating Phylogenies (Evolutionary Trees) II Biol4230 Thurs, March 2, 2017 Bill Pearson wrp@virginia.edu 42818 Jordan 6057 Tree estimation strategies: Parsimony?no model, simply count minimum number
More informationHow should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?
How should we go about modeling this? gorilla GAAGTCCTTGAGAAATAAACTGCACACACTGG orangutan GGACTCCTTGAGAAATAAACTGCACACACTGG Model parameters? Time Substitution rate Can we observe time or subst. rate? What
More informationStatistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences
Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic
More informationAppendix 1. 2K(K +1) n " k "1. = "2ln L + 2K +
Appendix 1 Selection of model of amino acid substitution. In the SOWHL tests performed on amino acid alignments either the WAG (Whelan and Goldman 2001) or the JTT (Jones et al. 1992) replacement matrix
More informationPage 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence
Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994)
More information