Moreover, the circular logic

Similar documents
Evolution. CT Amemiya et al. Nature 496, (2013) doi: /nature12027

Quantifying sequence similarity

Practical considerations of working with sequencing data

Pairwise alignment using HMMs

Local Alignment: Smith-Waterman algorithm

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Pairwise sequence alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Sequence analysis and Genomics

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Large-Scale Genomic Surveys

Algorithms in Bioinformatics

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Minimum Edit Distance. Defini'on of Minimum Edit Distance

Welcome to CS262: Computational Genomics

Pair Hidden Markov Models

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

HMMs and biological sequence analysis

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

EECS730: Introduction to Bioinformatics

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Hidden Markov Models. x 1 x 2 x 3 x K

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Bio nformatics. Lecture 3. Saad Mneimneh

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

Week 10: Homology Modelling (II) - HHpred

Sequence comparison: Score matrices

Tools and Algorithms in Bioinformatics

Pairwise & Multiple sequence alignments

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Introduction to Bioinformatics Algorithms Homework 3 Solution

Single alignment: Substitution Matrix. 16 march 2017

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

EECS730: Introduction to Bioinformatics

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Sequence Analysis '17- lecture 8. Multiple sequence alignment

Multiple Sequence Alignment using Profile HMM

Hidden Markov Models

Scoring Matrices. Shifra Ben-Dor Irit Orr

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Sequence analysis and comparison

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Hidden Markov Models

Hidden Markov Models. x 1 x 2 x 3 x K

Today s Lecture: HMMs

Pairwise sequence alignments

Lecture 2: Pairwise Alignment. CG Ron Shamir

An Introduction to Sequence Similarity ( Homology ) Searching

Hidden Markov Models

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Substitution matrices

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Computational Biology

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Introduction to Bioinformatics

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Stephen Scott.

Pairwise sequence alignment and pair hidden Markov models

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

In-Depth Assessment of Local Sequence Alignment

Markov Chains and Hidden Markov Models. = stochastic, generative models

Copyright 2000 N. AYDIN. All rights reserved. 1

Lecture 14: Multiple Sequence Alignment (Gene Finding, Conserved Elements) Scribe: John Ekins

Sequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.

Sequence Analysis '17 -- lecture 7

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

1.5 Sequence alignment

Collected Works of Charles Dickens

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Homology Modeling. Roberto Lins EPFL - summer semester 2005

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Bioinformatics and BLAST

Pairwise alignment. 2.1 Introduction GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

Sequence Alignment Techniques and Their Uses

Global alignments - review

Alignment & BLAST. By: Hadi Mozafari KUMS

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Similarity searching summary (2)

Transcription:

Moreover, the circular logic How do we know what is the right distance without a good alignment? And how do we construct a good alignment without knowing what substitutions were made previously? ATGCGT--GCAAGT --GCGGCGGCAGTT ATGCGT--GCAAGT- --GCGGCGGC-AGTT Which is better? Less gaps but 7matches More gaps but 8 matches

Substitution matrices for proteins We need to compare, DNA or protein sequences, estimate their distances etc (e.g. Helpful for inferring molecular function by finding similarity to a sequence with known function). For that purpose we need: Good alignment of sequences. For that purpose we need: A measure for judging the quality of an alignment in relation to other possible alignments, a scoring system.

A practical aspect as well, since we don t study very diverged DNA sequences

Commonly multiplied by a power of 10 to deal with decimals

A sample substitution matrix 134 LQQGELDLVMTSDILPRSELHYSPMFDFEVRLVLAPDHPLASKTQITPEDLASETLLI 137 LDSNSVDLVLMGVPPRNVEVEAEAFMDNPLVVIAPPDHPLAGERAISLARLAEETFVM BLOSUM62 D:D = +6 C 9 S -1 4 T -1 1 5 P -3-1 -1 7 A 0 1 0-1 4 G -3 0-2 -2 0 6 D:R = -2 N -3 1 0-2 -2 0 6 D -3 0-1 -1-2 -1 1 6 E -4 0-1 -1-1 -2 0 2 5 Q -3 0-1 -1-1 -2 0 0 2 5 H -3-1 -2-2 -2-2 1-1 0 0 8 R -3-1 -1-2 -1-2 0-2 0 1 0 5 K -3 0-1 -1-1 -2 0-1 1 1-1 2 5 M -1-1 -1-2 -1-3 -2-3 -2 0-2 -1-1 5 I -1-2 -1-3 -1-4 -3-3 -3-3 -3-3 -3 1 4 L -1-2 -1-3 -1-4 -3-4 -3-2 -3-2 -2 2 2 4 V -1-2 0-2 0-3 -3-3 -2-2 -3-3 -2 1 3 1 4 F -2-2 -2-4 -2-3 -3-3 -3-3 -1-3 -3 0 0 0-1 6 Y -2-2 -2-3 -2-3 -2-3 -2-1 2-2 -2-1 -1-1 -1 3 7 W -2-3 -2-4 -3-2 -4-4 -3-2 -2-3 -3-1 -3-2 -3 1 2 11 C S T P A G N D E Q H R K M I L V F Y W

BLOSUM Brief Overview Based purely on counts and alignment BLOSUM65 < BLOSUM45 in evolutionary distance a set of sequences S = S...S, where S = s...s, ij 20 j i j i i= 1 j= 1 i= 1 1 k i i1 in s is the j-th amino acid in sequence S. The probability of observing each amino acid X =p(x) p(x) cx (, S)/ cx (, S) where cx (, S k = j i k ) is the count of amino acid X in sequence S PX X random= px px orpx ifl= m 2 ( l, m ) 2 ( l). ( m) ( l), i j i

In general log-odds for alignment PX (, X data) = # of X, X combinations/all possible pairwise combinations l m l m finally take the log2-odds likelihood ratio and multiply by 2 for easy representation { } Note: all possible pairwise combinations=k. nn ( 1)/2 Blosum does use pseudo count to accomadate for combinations not observed: add 1 in numerator, add 210 (20*19+20) in denominator ie P(X,X data) = l m count(x l,x m) k. ( 1) / 2 + 210 { nn } P(X l,x m data) λ.log = subs. matrix P (X,X random) l m λ being a scaling factor for representation

BLOSUM vs PAM vs others? BLOSUM: Based on a range of evolutionary periods -- Each matrix constructed separately Indirectly accounts for interdependence of residues Range of sequences, range of replacements PAM Based on extrapolation from a short evolutionary period -- Errors in PAM1 are magnified through PAM250 Assumes Markov process too much extrapolation Rare replacements too infrequent to be represented accurately Others Incorporation of secondary structure/structural alignments Use of structural alignments Transmemberane protein-specific matrices

How do we use the matrices for sequence alignment? AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings x = x 1 x 2...x M, y = y 1 y 2 y N, An alignment of two sequences x and y is an arrangement of x and y by position, where a and b can be padded with gap symbols to achieve the same length.

Why do we bother aligning? Finding important molecular regions that are conserved across species Given a new sequence, infer its function based on similarity to another Seq. Determining the evolutionary constraints at work Mutations in a population or a family of genes Find similar looking sequences in a database Inferring the secondary or tertiary structure of a sequence of interest -Molecular modeling using a template (homology modeling)

Alignment is a path in the alignment matrix X: ACACACTA Y: AGCACACA X: AGCACAC-A Y: A-CACACTA m+ n! mn!! m+ n 2 Alignments possible!

Dynamic programming for global alignment simple case Consider two sequences: x[1.. M], and y[1.. N] for a new position to be aligned we have ONLY three choices 1. x i aligns to y j x 1 x i-1 x i Align x[1.. i] with y[1.. j] y 1 y j-1 y j 2. x i aligns to a gap x 1 x i-1 x i y 1 y j - 3. y j aligns to a gap x 1 x i - y 1 y j-1 y j x[1..(i-1)] is already aligned with y[1..(j)], so align x[i] with a gap in y x[1.. i] is already aligned with y[1.. (j-1)], so align a gap in x to y[j]

1. x i aligns to y j x 1 x i-1 y 1 y j-1 x i y j F (i,j) = F (i-1, j-1) + s(i,j) 2. x i aligns to a gap x 1 x i-1 x i y 1 y j - 3. y j aligns to a gap x 1 x i - y 1 y j-1 y j F (i,j) = F (i-1, j) gap_open_penalty (go) F (i,j) = F (i, j-1) gap_open_penalty (go)

If we could make F (i, j-1), F (i-1, j), F (i-1, j-1) optimal, then we can make the next ones optimal as well F (i-1, j-1) + s (x i, y j ) F (i, j) = max F (i-1, j) go F (i, j-1) go Where s (x i, y j ) = Score for a match, if x i = y j ; score for a mismatch, if x i y j ;

The Needleman-Wunsch Algorithm pioneering application of DP to biological sequences 1. Initialization. a. F(0, 0) = 0 b. F(0, j) = - j go c. F(i, 0) = - i go 2. Main Iteration. Filling-in partial alignments For each i = 1 M For each j = 1 N F(i-1,j-1) + s(x i, y j ) [case 1] F(i, j) = max F(i-1, j) go [case 2] F(i, j-1) go [case 3] DIAG, if [case 1] Ptr(i,j) = LEFT, if [case 2] UP, if [case 3] 3. Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

An example T C G C A F (i-1, j-1) + s (x i, y j ) F (i, j) = max F (i-1, j) go 0-2 -4-6 -8-10 F (i, j-1) go T -2 1-1 -3-5 -7 s (x i, y j )=1 for match, -1 for mismatch go=2 C C -4-1 2 0-2 -4-6 -3 0 1 1-1 A -8-5 -2-1 0 2 Trace back arrows from lower right to upper left and Maximize the scores Alignment score = 2 T C G C A T C - C A 1 1-2 1 1

Are gaps that bad? Current model: Gap of length n incurs penalty n.go γ(n) Bunch of gaps are better than individual gaps Convex (saturating) gap penalty function: γ(n): for all n, γ(n + 1) - γ(n) γ(n) - γ(n 1) A common function is Affine gap penalty γ(n) = go + (n 1) ge gap gap open extend γ(n)

Convex gap dynamic programming Initialization: same as before Iteration: F(i-1, j-1) + s(x i, y j ) F(i, j) = max max k=0 i-1 F(k,j) γ(i-k) max k=0 j-1 F(i,k) γ(j-k) Termination: same Running Time: O(N 2 M) Space: O(NM) (assume N>M)

Fast implementation of affine gap penalty We need three matrices for tracking scores Matrix-1: a[i,j]= to store maximum score of an alignment that ends in x[i] matched to y[j] Matrix-2: b[i,j]= to store maximum score of an alignment that ends in gap matched to y[j] Matrix-3: c[i,j]= to store maximum score of an alignment that ends in gap matched to x[i] x i y j - y j.x i -

Implementation Cont d ai [ 1, j 1] ai [, j] = max bi [ 1, j 1] + si (, j) ci [ 1, j 1] ai [, j 1] + go bi [, j] = max bi [, j 1] + ge ci [, j 1] + go ai [ 1, j] + go ci [, j] = max bi [ 1, j] + go ci [ 1, j] + ge Pointer-matrices: Three matrices to figure out which state within each score matrix maximization was used to obtain the optimal alignment of position i,j. Of course you need to know which matrix yielded the best score in the end (m,n) as well Covered by assignment!

http://www.ebi.ac.uk/tools/emboss/align/index.html For checking results of your code http://www.ch.embnet.org/software/lalign_form.html

Pair HMMs for alignment HMM for sequence alignment, which incorporates affine gap scores. Match (M) Insertion in x (X) insertion in y (Y) Hidden States Observation Symbols Match (M): {(a,b) a,b in }. Insertion in x (X): {(a,-) a in }. Insertion in y (Y): {(-,a) a in }.

Simple representation 1-δ M p xiyj 1-ε 1-ε X q xi δ δ Y q yj ε ε M X Y M 1-2δ 1- ε 1- ε X δ ε 0 Y δ 0 ε

Full representation δ ε Begin 1-2δ-τ M p xiyj δ X q xi 1-ε-τ τ τ End 1-2δ-τ δ δ 1-ε-τ Y q yj ε τ τ

Initialization: v M (0, 0) = 1. v X (0, 0) = v Y (0, 0) = 0 v * (-1, j) = v * (i, -1) = 0. Recurrence: i = 0,,n, j = 0,,m, except for(0,0); Termination: = = = 1), ( 1), ( max ), ( ) 1, ( ) 1, ( max ), ( 1) 1, ( ) (1 1) 1, ( ) (1 1) 1, ( ) 2 (1 max ), ( X M Y X M X Y X M M j i v j i v q j i v j i v j i v q j i v j i v j i v j i v p j i v j i j i y x y x ε δ ε δ τ ε τ ε τ δ )), ( ),, ( ),, ( max( Y X M E m n v m n v m n v v τ = Algorithm: Viterbi algorithm for pair HMMs

Viterbi Pair HMM is equivalent to global dynamic programming with affine gap! Proof (sketch) Add the following missing details to DEKM s summary Take the viterbi matrices and divide by the terms for random sequence probability, and take the log of resulting transformation i.e, (1 2 δ τ) v ( i 1, j 1). p q q (1 η) v i j v i j p q q M 2 xiyj xi yi M X 2 (, ) = max (1 ε τ) ( 1, 1). xiyj xi yi (1 η) Y 2 (1 ε τ) v ( i 1, j 1). pxiyj qxiqyi (1 η) M δv ( i 1, j). qxi qxi (1 η) (, ) = max i X εv ( i 1, j). qxi qxi (1 η) X v i j qx

Multiple Sequence Alignment (MSA) Si () = sx ( i, yi) all possible pairs How to score? and How to align? Scoring, common approach: Sum of pairs or its variants for each column Consider L aligned at all N positions, LL score =5 total score, Si ( ) = 5 NN ( -1) 2 Now consider one position being G, rest being L GL score=-4 ΔSi ( ) 5( N 1) + 4( N 1) 9( N 1) 9 = = = Si () Si () 2.5 NN ( 1) 2.5N Relative evidence decreases with N, opposed to having increased confidence in L being the "correct" residue at that position

Clustal W Pairwise alignment: calculation of distance matrix Rooted nj tree (guide tree) and calculation of sequence weights Progressive alignment following the guide tree

Step 1-Calculation of Distance Matrix using pairwise alignment Use the Distance Matrix to create a Guide Tree to determine the order of the sequences. Hbb-Hu 1 - Hbb-Ho 2.17 - Hba-Hu 3.59.60 - Hba-Ho 4.59.59.13 - Myg-Ph 5.77.77.75.75 - Gib-Pe 6.81.82.73.74.80 - Lgb-Lu 7.87.86.86.88.93.90-1 2 3 4 5 6 7 D = 1 (I) D = Difference score I = # of identical aa s in pairwise global alignment total number of aa s in shortest sequence

Step 2-Create Rooted Tree and calculate weights Neighbor joining algorithm simple, to be discussed later Weight Alignment Order of alignment: 1 Hba-Hu vs Hba-Ho 2 Hbb-Hu vs Hbb-Ho 3 A vs B 4 Myg-Ph vs C 5 Gib-Pe vs D 6 Lgh-Lu vs E

Step 3-Progressive alignment

Step 3-Progressive alignment Scoring during progressive alignment

Recommended MSA Programs MUSCLE (fast and accurate) MAVID (genome-scale alignment) SAM ( hidden markov, powerful and wide range of options)