p(-,i)+p(,i)+p(-,v)+p(i,v),v)+p(i,v)

Similar documents
Multiple Sequence Alignment

Multiple Sequence Alignment

Quantifying sequence similarity

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Multiple Alignment. Slides revised and adapted to Bioinformática IST Ana Teresa Freitas

Lecture 14: Multiple Sequence Alignment (Gene Finding, Conserved Elements) Scribe: John Ekins

Copyright 2000 N. AYDIN. All rights reserved. 1

Learning Sequence Motif Models Using Gibbs Sampling

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types


Sequence Analysis '17- lecture 8. Multiple sequence alignment

Phylogenetic Tree Reconstruction

Feedback-error control

Multiple Sequence Alignment

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Outline. Markov Chains and Markov Models. Outline. Markov Chains. Markov Chains Definitions Huizhen Yu

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Computer arithmetic. Intensive Computation. Annalisa Massini 2017/2018

2. Sample representativeness. That means some type of probability/random sampling.

MODELING THE RELIABILITY OF C4ISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL

Molecular Evolution and Phylogenetic Tree Reconstruction

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

EVOLUTIONARY DISTANCES

The vast amount of sequence data collected over the past two

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

FE FORMULATIONS FOR PLASTICITY

Towards understanding the Lorenz curve using the Uniform distribution. Chris J. Stephens. Newcastle City Council, Newcastle upon Tyne, UK

Model checking, verification of CTL. One must verify or expel... doubts, and convert them into the certainty of YES [Thomas Carlyle]

Hotelling s Two- Sample T 2

Finite-State Verification or Model Checking. Finite State Verification (FSV) or Model Checking

Universal Finite Memory Coding of Binary Sequences

Overview Multiple Sequence Alignment

Phylogenetic analyses. Kirsi Kostamo

Ch. 9 Multiple Sequence Alignment (MSA)

Analysis of some entrance probabilities for killed birth-death processes

Elementary Analysis in Q p

Analysis of Group Coding of Multiple Amino Acids in Artificial Neural Network Applied to the Prediction of Protein Secondary Structure

Bayesian Networks Practice

MODEL-BASED MULTIPLE FAULT DETECTION AND ISOLATION FOR NONLINEAR SYSTEMS

An Improved Calibration Method for a Chopped Pyrgeometer

Tools and Algorithms in Bioinformatics

Convex Optimization methods for Computing Channel Capacity

Deriving Indicator Direct and Cross Variograms from a Normal Scores Variogram Model (bigaus-full) David F. Machuca Mory and Clayton V.

Finite Mixture EFA in Mplus

Moreover, the circular logic

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Physicochemical properties of GPCR amino acid sequences for understanding GPCR-G-protein coupling

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Distributed Rule-Based Inference in the Presence of Redundant Information

Pretest (Optional) Use as an additional pacing tool to guide instruction. August 21

5. PRESSURE AND VELOCITY SPRING Each component of momentum satisfies its own scalar-transport equation. For one cell:

Phylogenetic trees 07/10/13

Evolutionary Tree Analysis. Overview

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Chapter 6. Thermodynamics and the Equations of Motion

DETC2003/DAC AN EFFICIENT ALGORITHM FOR CONSTRUCTING OPTIMAL DESIGN OF COMPUTER EXPERIMENTS

Pulse Propagation in Optical Fibers using the Moment Method

Computational Biology

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Cryptanalysis of Pseudorandom Generators

One-way ANOVA Inference for one-way ANOVA

1. INTRODUCTION. Fn 2 = F j F j+1 (1.1)

Unit 1 - Computer Arithmetic

ROC n Rule Learning - Towards a Better Understanding of Covering Algorithms

A New GP-evolved Formulation for the Relative Permittivity of Water and Steam

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Constructing Evolutionary/Phylogenetic Trees

Lecture 5,6 Local sequence alignment

State Estimation with ARMarkov Models

Week 5: Distance methods, DNA and protein models

Pretest (Optional) Use as an additional pacing tool to guide instruction. August 21

q-ary Symmetric Channel for Large q

Use of Transformations and the Repeated Statement in PROC GLM in SAS Ed Stanek

Combining Logistic Regression with Kriging for Mapping the Risk of Occurrence of Unexploded Ordnance (UXO)

Multiple Sequence Alignment: HMMs and Other Approaches

Robust Predictive Control of Input Constraints and Interference Suppression for Semi-Trailer System

Topic: Lower Bounds on Randomized Algorithms Date: September 22, 2004 Scribe: Srinath Sridhar

Math 239: Discrete Mathematics for the Life Sciences Spring Lecture 14 March 11. Scribe/ Editor: Maria Angelica Cueto/ C.E.

Bayesian Model Averaging Kriging Jize Zhang and Alexandros Taflanidis

Named Entity Recognition using Maximum Entropy Model SEEM5680

Factors Effect on the Saturation Parameter S and there Influences on the Gain Behavior of Ytterbium Doped Fiber Amplifier

Approximating min-max k-clustering

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

On Line Parameter Estimation of Electric Systems using the Bacterial Foraging Algorithm

Using algebraic geometry for phylogenetic reconstruction

Classical gas (molecules) Phonon gas Number fixed Population depends on frequency of mode and temperature: 1. For each particle. For an N-particle gas

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Dr. Amira A. AL-Hosary

General Linear Model Introduction, Classes of Linear models and Estimation

Chapter 7 Rational and Irrational Numbers

Using Factor Analysis to Study the Effecting Factor on Traffic Accidents

Keywords: pile, liquefaction, lateral spreading, analysis ABSTRACT

Lecture Notes: Markov chains

Maximum Cardinality Matchings on Trees by Randomized Local Search

Transcription:

Multile Sequence Alignment Given: Set of sequences Score matrix Ga enalties Find: Alignment of sequences such that otimal score is achieved. Motivation Aligning rotein families Establish evolutionary relationshis Identify imortant functional regions Yield structural clues Aligning non-coding DNA sequences Find conserved regions in DNA for control of exression Infer evolutionary relationshis Identify imortant functional regions Motivation Scoring a Multile Alignment Binding sites, DNA sequence motifs, may be conserved within secies (to control exression in concerted fashion) The sites may be conserved across secies (using similar control mechanisms) May diverge within and across secies for secial urose or evolutionary drift S: GA_TCA : GTCTGA : GATATT Scoring a Multile Alignment Alignment score = sum of column scores Identify a reasonable method of obtaining a cumulative score for substitutions in each column is a challenge. Column score: I V - I I I V - SP (Sum of Pairs) measure (oular method): Comute airwise scores of all airs and sum them SP-score(I, score(i,-,i,v),i,v) = (I,-)+(I,I)+(I,V)+ (-,I)+(,I)+(-,V)+(I,V),V)+(I,V) Ga enalty constant or linear or nonlinear (affects the comutational comlexity). (-,-) =?0 SP-score( score(α) = Σ score(α ij where α ij is the airwise alignment induced by α on sequences s i, s j.

Ex. A Problem with SP-score N-N N match N-C match Score (BLOSUM2) Sequence Column A 2 3 5 Column B Column C A B C 0 0 0 2 N N N CN (3N-N(), N(), C-C(9)) C(9)) 9 The scores decrease raidly! SP-score tends to overweight the influence of mutations. -3 N A Problem with SP-score Sequence Column A 2 n s(n,n) =, s(n,c)= 3; Score of the column A = Score of the column B = Column B Column C n(n-)/2 n(n-)/2 9(n-) Relative difference: 9(n-)/[n(n-)/2] = 3/n (inverse deendence on n) Counter-intuitive: relative diff should be increase with the more evidence we have for a conserved asaragine N. Multile Alignments: Scoring Sum of airs (SP-Score) Score) Number of matches (multile longest common subsequence score) Entroy score Multile LCS Score A column is a match if all the letters in the column are the same AAA AAA AAT ATC Only good for very similar sequences Methods for Multile Alignment Dynamic Programming Progressive Alignment Star CLUSTALW Iterative Alignment Hidden Markov Model s s2 s3 Aligning Three Sequences GATTCA GTCTGA GATATT 2

Aligning Three Sequences Same strategy as aligning two sequences Use a 3-D Manhattan Cube, with each axis reresenting a sequence to align For global alignments, go from source to sink source W 2-D vs 3-D Alignment Grid V 2-D edit grah sink 3-D edit grah 2-D cell versus 3-D Alignment Cell Architecture of 3-D Alignment Cell (i-,j-,k-) (i-,j-,k) (i-,j,k-) (i-,j,k) In 2-D, 3 edges in each unit square? In 3-D, 7 edges in each unit cube (i,j-,k-) (i,j-,k) (i,j,k-) (i,j,k) Alignment Paths Multile Alignment: Dynamic Programming 0 2 3 A -- T G C 0 2 3 3 A A T -- C 0 0 2 3 -- A T G C x coordinate y coordinate z coordinate Resulting ath in (x,y,z) sace: (0,0,0) (,,0) (,2,) (2,3,2) (3,3,3) (,,) s i,j,k = max s i-,j-,k- + δ(v i, w j, u k ) s i-,j-,k + δ (v i, w j, _ ) s i-,j,k- + δ (v i, _, u k ) s i,j-,k- + δ (_, w j, u k ) s i-,j,k + δ (v i, _, _) s i,j-,k + δ (_, w j, _) s i,j,k- + δ (_, _, u k ) cube diagonal: no indels face diagonal: one indel edge diagonal: two indels (x, y, z) is an entry in the 3-D scoring matrix 3

Multile Alignment: Running Time For 3 sequences of length n,, the run time is 7n 3 ; O(n 3 ) For k sequences, build a k-dimensional Manhattan, with run time (2 k -)( )(n k ); O(2 k n k ) Progressive Alignment Star method CLUSTALW Conclusion: dynamic rogramming aroach for alignment between two sequences is easily extended to k sequences but it is imractical due to exonential running time Multile Alignment Induces Pairwise Alignments Reverse Problem: Constructing Multile Alignment from Pairwise Alignments Every multile alignment induces airwise alignments Induces: x: AC-GCGG-C y: AC-GC-GAGGAG z: GCCGC-GAGGAG x: ACGCGG-C; C; x: AC-GCGG-C; C; y: AC-GCGAG y: ACGC-GAC; GAC; z: GCCGC-GAG; GAG; z: GCCGCGAG Given 3 arbitrary airwise alignments: x: ACGCTGG-C; C; x: AC-GCTGG-C; C; y: AC-GC-GAGGAG y: ACGC--GAC; z: GCCGCA-GAG; GAG; z: GCCGCAGAG can we construct a multile alignment that induces them? NOT ALWAYS The STAR Alignment Method Using a airwise alignment method find the sequence that is most similar to all the other sequences: score(α i ) = Σ score(α Using this best sequence as the center (of a star, hence the name) align the other sequences following the once a ga always a ga rule. Ex: S S S5 A T T G C C A T T A T G G C C A T T A T C C A A T T T T A T C T T C T T A C T G A C C More on STAR Alignment Assuming similarity matrix for the airwise comaring of the sequences: S S S5 Σscore( score(α ij S - 7-2 0-3 2 7 - -2 0 - -2-2 - 0-7 - S 0 0 0 - -3-3 S5-3 - -7-3 - -7 Choose s be the center of the Star! S5 S S

More on STAR Alignment Next we get the best alignment between S and the other sequences as follows: S A T T G C C A T T A T G G C C A T T S A T T G C C A T T - - A T C - C A A T T T T S A T T G C C A T T S A T C T T C - T T S A T T G C C A T T S5 A C T G A C C - - More on STAR Alignment Build the MSA starting with S? and : A T T G C C A T T A T G G C C A T T Adding using once a ga always a ga A T T G C C A T T - - A T G G C C A T T - - A T C - C A A T T T T Reeat to include all the sequences A T T G C C A T T - - A T G G C C A T T - - A T C - C A A T T T T A T C T T C - T T - - A C T G A C C - - - - Comlexity of STAR Alignment Clearly, the time comlexity of the STAR method is dominated by comuting the airwise alignment. For k sequences, there are O(k 2 ) airs Each airwise alignment takes O(n 2 ), n = length of each seq. Cost for comuting all airwise alignments: O((kn) 2 ) Cost to merge the sequences into a msa. If n max is the uer bound of the alignment length, one merge takes O(kn max ). Total takes O(k 2 n max ). The total time comlexity for STAR method: O( k 2 n 2 + k 2 n max ) Profile Alignment Problem with Star aroach --- all alignment are determined by airwise sequence alignments. Profile alignment uses osition-secificsecific information from grou s multile alignment to align a new sequence to it. Mismatches at highly conserved ositions should be enalized more Gas should be enalized more at ositions where few gas occur Scoring function SP-score Profile Alignment Aligning two multile alignment (rofiles) using SP-score.. A T T G C C A T T k+. A T C - C A A T k. A T G G C C A T T K. A - C T G A A C Recall: SP-score( score(α ) = Σ score(α ij SP-score( score(α) = Σ SP-score( score(α ) SP-score( score(α) = Σ SP-score( score(α ) = ΣΣ score(α ij = ΣΣ score(α + ΣΣ score(α + Σ Σ score(α ij k k< i k k<j K The alignment can be done exactly like a standard airwise alignment! Need to be otimized CLUSTALW CLUSTALW is a rogressive method use a airwise alignment method to determine the most related sequences rogressively add less related sequences or grous of sequences to the initial alignment CLUSTAL family CLUSTAL - gives equal weight to all sequences CLUSTALW - can give different weights to the sequences & other rogram arameters CLUSTALX - rovides a GUI to CLUSTAL 5

CLUSTALW Construct a distance matrix of all k(k )/2 airs by airwise dynamic rogramming alignment and comute the distances between all air sequences (-distance: the roortion () of nucleotide sites at which two sequences being comared are different). Construct a guide tree by a neighbor-joining clustering algorithm Progressively align at nodes in order of decreasing similarity, using sequence-sequence, sequence, sequence-rofile, and rofile-rofile alignment More on CLUSTALW After comuting the distance between all airs of sequences we ut them into a matrix. For examle if we consider a set of 7 sequences we could have the following matrix: Seq. S S - S S5 S S7.7 -.59.0 - S.59.59.3 - S5.77.77.75.75 - S.8.82.73.7.80 - S7.87.8.8.88.93.90 - Neighbor Joining Very oular method! Assumes additivity: distance between airs of leaves = sum of lengths of edges connecting them Produces unrooted tree Very much like the Fitch-Margoliash method, excet that the choice as to which sequence to air is done differently 3 Neighbor Joining 2 0. 0. 0. 0. 0. Additivity: distance between airs of leaves equals to the sum of lengths of edges connecting them. d km = (d im + d jm d ij ( k = arent of i & j ) How to choose the neighbor leaves? /2. Neighbor Joining Find the modified distance matrix: Find the sum of the distance between seq i and all other sequences: r i = Σ k d ik / (n 2), (n = total # of seqs) Find the modified distance matrix: D ij = d ij ( r i + r j ), Claim: A air of leaves i, j for which D ij is minimal will be neighboring leaves. Algorithm: Neighbor Joining Initialization Define T to be the set of leaf nodes, one for each given seq Let L = T Iteration: Pick i, j in L for which D ij is the minimal Define a new node k and set d km =(d im +d jm d /2, m in L Add k to T with edges of lengths d ik =(d ij +r i r j )/2, d jk =d ij - d ik Remove i and j from L and add k Termination: When L consists of two leaves i and j add the remaining edge between i and j,, with length d ij

d 2 3 0.3 0.5 0. 2 0. 0.5 3 0.9 D 2 3 -. -.2 -. 2 -. -.2 3 -. 3 5 3 0. Examle d ik =(d ij +r i r )/2, j d jk =d ij - d ik 0. 5 d km =(d im +d jm d /2 d 5 2 5 0.2 0.5 2 0.5 D 5 2 5 -.2 -.2 2 -.2 0. 5 3 0. 0. 0. 0. 2 2 0. 0. 0. 0. 0. 3 d 5 5 0. Profile Alignment Aligning two multile alignment (rofiles) using SP-score.. A T T G C C A T T k+. A T C - C A A T k. A T G G C C A T T K. A - C T G A A C Recall: SP-score( score(α ) = Σ score(α ij SP-score( score(α) = Σ SP-score( score(α ) SP-score( score(α) = Σ SP-score( score(α ) = ΣΣ score(α ij = ΣΣ score(α + ΣΣ score(α + Σ Σ score(α ij k k< i k k<j K The alignment can be done exactly like a standard airwise alignment! Need to be otimized s s2 s3 s GTTGA GTTTGA GATATT GTATA Exercise Distances -distance: For a airwise alignment, count the number of mismatches/gas between the two sequences, then divide this value by the length of the alignment. Ex. N K L - O N distance = 3/ =.5 - M L N O N Jukes-Cantor distance d = (¾)log[-(/3)](/3)] =-distance More on CLUSTALW Construct a guide tree using Neighbor Joining method. For the distance matrix in the examle we could construct the following guide tree. S7.5057 S.08.08.227.03.09.393 S.09.09.25.08 S S5 7

More on CLUSTALW Progressively align at nodes in order of decreasing similarity, using sequence-sequence, sequence, sequence-rofile, and rofile-rofile alignment In our examle we first align S with (grou) then with S (grou2), then align grou with grou2, then we continue until we have only one alignment. CLUSTALW htt://clustalw.genome.j/ Ex: (FASTA format) >seqa GARFIELDTHEFASTCAT >seqb GARFIELDTHEVERYFASTCAT >seqc GARFIELDTHEFATCAT Problem of Sequence Weights The available sequences are not randomly samled, but reflect biases in how we collect sequences. If weight everything equally, then closely related sequences will be allowed to dominate the multile alignment. As a result, conclusions about ) conservation, 2) evolutionary distance, 3) reliability of redictions would be wrong. Sequence Weighting Examle CYEGNGHF Human- CYEGNGDF Human-2 CYHGNGDF Human-2 CYHGNGDS Mouse CYHGNGQS Rat CFEGNGHS Pig Solutions: don t weight the three humans equally with the others. Use a measure of similarity to down-weight weight their influence on the multile alignment. More on CLUSTALW More heuristics of CLUSTALW: Sequences are weighted to comensate for biased reresentation in large subfamilies and the defects of the sum-of-airs. Use different substitution matrix (BLOSUM80 for closely related sequences; BLOSUM50 for distant sequences) Set ga enalty be a function of the residues observed at the osition (hydrohobic residues give higher ga enalties than hydrohilic or flexible residues) Set ga and ga extension enalties to force all the gas to occur in the same laces ClustalW In Summary Poular multile alignment tool today W stands for weighted (different arts of alignment are weighted differently). Three-ste rocess ) Construct airwise alignments 2) Build Guide Tree 3) Progressive Alignment guided by the tree 8

Iterative Methods Shortcoming of Progressive Aroach: Deendence uon initial alignments Sub-alignments are frozen frozen Errors in alignment roagated Iterative Methods: Begin with an initial alignment A sequence or a grou of sequences is taken out and realigned to a rofile of the remaining aligned sequences. Alignment is reeatedly refined until the alignment does not change. 9