EECS730: Introduction to Bioinformatics

Similar documents
Evolution. CT Amemiya et al. Nature 496, (2013) doi: /nature12027

Dynamic Programming: Edit Distance

Quantifying sequence similarity

SA-REPC - Sequence Alignment with a Regular Expression Path Constraint

Moreover, the circular logic

Algorithms for biological sequence Comparison and Alignment

Minimum Edit Distance. Defini'on of Minimum Edit Distance

Bio nformatics. Lecture 3. Saad Mneimneh

Sequence analysis and Genomics

Pairwise & Multiple sequence alignments

Lecture 2: Pairwise Alignment. CG Ron Shamir

Welcome to CS262: Computational Genomics

Tandem Mass Spectrometry: Generating function, alignment and assembly

Lecture 5,6 Local sequence alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Introduction to Bioinformatics

An Introduction to Bioinformatics Algorithms Hidden Markov Models

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

Hidden Markov Models. Three classic HMM problems

String Matching Problem

Introduction to spectral alignment

Hidden Markov Models

Multiple Sequence Alignment

Dynamic programming. Curs 2017

Sequence Alignment (chapter 6)

STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler

Evolutionary Tree Analysis. Overview

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Multiple Alignment. Slides revised and adapted to Bioinformática IST Ana Teresa Freitas

Outline. Approximation: Theory and Algorithms. Motivation. Outline. The String Edit Distance. Nikolaus Augsten. Unit 2 March 6, 2009

Pairwise sequence alignment

Sequence Alignment. Johannes Starlinger

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Local Alignment: Smith-Waterman algorithm

Outline. Similarity Search. Outline. Motivation. The String Edit Distance

11.3 Decoding Algorithm

Approximation: Theory and Algorithms

EECS730: Introduction to Bioinformatics

HIDDEN MARKOV MODELS

Pairwise alignment, Gunnar Klau, November 9, 2005, 16:

Comparing whole genomes

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Similarity Search. The String Edit Distance. Nikolaus Augsten. Free University of Bozen-Bolzano Faculty of Computer Science DIS. Unit 2 March 8, 2012

Hidden Markov Models

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Computational Biology

Sequence analysis and comparison

Similarity Search. The String Edit Distance. Nikolaus Augsten.

Introduction to Bioinformatics Algorithms Homework 3 Solution

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Aside: Golden Ratio. Golden Ratio: A universal law. Golden ratio φ = lim n = 1+ b n = a n 1. a n+1 = a n + b n, a n+b n a n

Sequence Comparison. mouse human

More Dynamic Programming

Algorithms in Bioinformatics

Pattern Matching (Exact Matching) Overview

Analysis and Design of Algorithms Dynamic Programming

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

More Dynamic Programming

Hidden Markov Models

CSE 549 Lecture 3: Sequence Similarity & Alignment. slides (w/*) courtesy of Carl Kingsford

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Lecture 4: September 19

NUMB3RS Activity: DNA Sequence Alignment. Episode: Guns and Roses

Multiple Sequence Alignment

Lecture 8 Multiple Alignment and Phylogeny

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

EECS730: Introduction to Bioinformatics

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Whole Genome Alignment. Adam Phillippy University of Maryland, Fall 2012

Dynamic programming. Curs 2015

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Algorithms in Bioinformatics: A Practical Introduction. Sequence Similarity

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Evolutionary Models. Evolutionary Models

Hidden Markov Models

CSE 431/531: Analysis of Algorithms. Dynamic Programming. Lecturer: Shi Li. Department of Computer Science and Engineering University at Buffalo

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

Reading for Lecture 13 Release v10

Genome Rearrangements In Man and Mouse. Abhinav Tiwari Department of Bioengineering

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Graph Alignment and Biological Networks

Lecture 13. More dynamic programming! Longest Common Subsequences, Knapsack, and (if time) independent sets in trees.

RNA secondary structure prediction. Farhat Habib

UNIT 5. Protein Synthesis 11/22/16

In-Depth Assessment of Local Sequence Alignment

EVOLUTIONARY DISTANCES

Practical considerations of working with sequencing data

Algebraic Dynamic Programming. Dynamic Programming, Old Country Style

Large-Scale Genomic Surveys

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Algorithm Design and Analysis

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

List of Code Challenges. About the Textbook Meet the Authors... xix Meet the Development Team... xx Acknowledgments... xxi

Transcription:

EECS730: Introduction to Bioinformatics Lecture 03: Edit distance and sequence alignment Slides adapted from Dr. Shaojie Zhang (University of Central Florida)

KUMC visit How many of you would like to attend my talk on metagenomics?

DNA Sequence Comparison: First Success Story Finding sequence similarities with genes of known function is a common approach to infer a newly sequenced gene s function In 1984 Russell Doolittle and colleagues found similarities between cancer-causing gene (v-sys in Simian Sarcoma Virus) and normal growth factor (PDGF) gene Cancer is caused by normal growth gene being switched on at a wrong time

The human genome is not the most complex genome!!! nearly 200 complete genomes have been sequenced

We can build the evolution history of these species

Evolutionary Rates next generation OK OK OK X X Still OK?

Sequence conservation implies important function

Sequence similarity Similar genes sequences will code for similar protein sequences Similar protein sequences should adopt similar folds (3D structures) Similar 3D structures imply similar functions Similar gene sequences may origin from the same ancestor and can provide information in evolution inference How do we quantify the sequence similarity???

Hamming distance? V : AT AT AT AT W : T AT AT AT A Hamming distance V : AT AT AT AT -- W : -- T AT AT AT A Alignment distance Hamming distance underestimate the similarity of two strings, more sophisticated algorithm is needed!

Evolution at the DNA level Deletion Mutation Insertion ACGGTGCAGTTACCA AC----CAGTCCACCA SEQUENCE EDITS REARRANGEMENTS Inversion Translocation Duplication

Sequence alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings x = x 1 x 2...x M, y = y 1 y 2 y N, an alignment is an assignment of gaps to positions 0,, M in x, and 0,, N in y, so as to line up each letter in one sequence with either a letter, or a gap in the other sequence

Sequence alignment cont. AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC What is the object function??? (and quantitative measure)

The Manhattan Tourist problem Computing similarity is detail-oriented, and we need to do some preliminary work first: The Manhattan Tourist Problem introduces grids, graphs and edit graphs See the most stuff in the least time.

Manhattan Tourist Problem (MTP) Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid Source * * * * * * * * * * * Sink

MTP formulation Goal: Find the longest path in a weighted grid. Input: A weighted grid G with two distinct vertices, one labeled source and the other labeled sink Output: A longest path in G from source to sink

i coordinates source 0 0 1 2 3 4 0 3 2 4 0 3 1 0 2 4 3 9 j coordinates 1 3 2 4 13 2 2 4 6 0 7 3 2 1 4 1 19 MTP example 1 4 4 2 1 3 3 3 0 2 20 6 8 3 4 1 3 2 2 23 sink

i coordinates source 0 0 1 2 3 4 0 3 2 4 0 3 1 0 2 4 3 9 j coordinates 1 3 2 4 1 4 13 2 2 4 6 0 7 3 10 17 2 1 4 1 19 MTP example 2 4 4 2 1 3 3 3 0 22 2 20 6 8 3 4 1 3 2 30 32 2 34 sink

Simple recursion MT(n,m) x MT(n-1,m)+ length of the edge from (n- 1,m) to (n,m) y MT(n,m-1)+ length of the edge from (n,m-1) to (n,m) return max{x,y} Slow!!! For the same reason that RecursiveChange is slow

MTP: Dynamic Programming source j 0 1 i 0 1 1 S 0,1 = 1 1 S 1,0 = Instead of recursion, store the result in an array S

MTP: Dynamic Programming cont. source j 0 1 2 i 0 1 2 3 1 3 S 0,2 = 3 1 3-4 S 1,1 = 4 2 8 S 2,0 = 8

MTP: Dynamic Programming cont. source j 0 1 2 3 i 0 1 2 1 3 3 10 8 S 3,0 = 8 1 3-4 1 13 S 1,2 = 13 2 0 8-9 S 2,1 = 9 3 8

MTP: Dynamic Programming cont. source j 0 1 2 3 i 0 1 2 1 3 8 3 10-1 3-1 - 4-3 13 8 S 1,3 = 8 2 0 8-3 0 9 12 S 2,2 = 12 3 8 0 9

MTP: Dynamic Programming cont. source j 0 1 2 3 i 0 1 2 1 3 8 3 10-1 - 1-4 13 8 3-3 2 2 0 8-3 3 9 12 0-1 S 2,3 = 1 3 8 0 0 9 9

MTP: Dynamic Programming cont. source j 0 1 2 3 i 0 1 2 1 3 8 3 10 - Done! 1-1 - 4 13 8 3-3 2 (showing all back-traces) 2 8-3 3 9 12 1 0 0-1 3 8 0 0 9 9 0 16

MTP: recurrence function Computing the score for a point (i,j) by the recurrence relation: s i, j = max s i-1, j + weight of the edge between (i-1, j) and (i, j) s i, j-1 + weight of the edge between (i, j-1) and (i, j) the running time is n x m for a n by m grid (n = # of rows, m = # of columns)

Manhattan is not a perfect grid

Manhattan is not a perfect grid cont. A 2 A 3 A 1 B What about diagonals? The score at point B is given by: s A1 + weight of the edge (A 1, B) s B = max of s A2 + weight of the edge (A 2, B) s A3 + weight of the edge (A 3, B)

Manhattan is not a perfect grid cont. Computing the score for point x is given by the recurrence relation: s x = max of s y + weight of vertex (y, x) where y є Predecessors(x) Predecessors (x) set of vertices that have edges leading to x The running time for a graph G(V, E), (V is the set of all vertices and E is the set of all edges) is O(E) since each edge is evaluated once

Traversing the Manhattan grid a) b) 3 different strategies: a) Column by column b) Row by row c) Along diagonals c)

Aligning DNA sequences V = ATCTGATG W = TGCATAC match n = 8 m = 7 mismatch 4 1 2 3 matches mismatches insertions deletions V W A T C T G A T G T G C A T A C indels deletion insertion

The Longest Common String (LCS) problem Given two sequences v = v 1 v 2 v m and w = w 1 w 2 w n The LCS of v and w is a sequence of positions in v: 1 < i 1 < i 2 < < i t < m and a sequence of positions in w: 1 < j 1 < j 2 < < j t < n such that i t -th letter of v equals to j t -letter of w and t is maximal

LCS example i coords: 0 1 2 2 3 3 4 6 7 8 elements of v elements of w j coords: 0 A T -- C -- T G A T G -- T G C A T -- A -- C 0 1 2 3 4 6 6 7 (0,0) (1,0) (2,1) (2,2) (3,3) (3,4) (4,) (,) (6,6) (7,6) (8,7) Matches shown in red positions in v: positions in w: 2 < 3 < 4 < 6 1 < 3 < < 6 The LCS Problem can be expressed using the grid similar to MTP grid Finding the heaviest path from the source to sink!!!

LCS: dynamic programming Find the LCS of two strings Input: A weighted graph G with two distinct vertices, one labeled source one labeled sink Output: A longest path in G from source to sink Solve using an LCS edit graph with diagonals replaced with +1 edges

Edit graph for the LCS problem i T 0 1 j A T C T G A T C 0 1 2 3 4 6 7 8 G C A 2 3 4 T A C 6 7

LCS recursive function Let v i = prefix of v of length i: v 1 v i and w j = prefix of w of length j: w 1 w j The length of LCS(v i,w j ) is computed by: s i, j = max s i-1, j s i, j-1 i-1,j -1 i-1,j 1 0 i,j -1 i,j s i-1, j-1 + 1 if v i = w j 0

An issue of the LCS ATGTTAT ATCGTAC LCS= AT-GTTAT * ATCGT-AC AT-GTTAT- ATCGT-A-C LCS= Does it mean that both alignments are equally good? The second one is more gappy, which is not a good sign in alignment as it indicates frameshift Need a better object function for sequence similarity, suggestions?

The edit distance problem Levenshtein (1966) introduced edit distance of two strings as the minimum number of elementary operations (insertions, deletions, and substitutions) to transform one string into the other d(v,w) = MIN no. of elementary operations to transform v w The edit distance is considered as the evolution distance, as the edit is made by evolutionary force

Edit distance: example edit operations: TGCATAT ATCCGAT 4 edit operations: TGCATAT ATCCGAT TGCATAT (delete last T) TGCATA (delete last A) TGCAT (insert A at front) ATGCAT (substitute C for 3 rd G) ATCCAT (insert G before last A) ATCCGAT (Done) TGCATAT (insert A at front) ATGCATAT (delete 6 th T) ATGCATA (substitute G for th A) ATGCGTA (substitute C for 3 rd G) ATCCGAT (Done)

Alignment: 2 row representation Given 2 DNA sequences v and w: v : w : AT CT GAT T GCAT A m = 7 n = 6 Alignment : 2 * k matrix ( k > m, n ) letters of v letters of w A T -- G T T A T -- A T C G T -- A -- C matches 2 insertions 2 deletions

The Alignment Grid revisited 2 sequences used for grid V = ATGTTAT W = ATCGTAC Every alignment path is from source to sink

Alignments in edit graph and represent indels in v and w with edit operation 1. represent match or mismatch with edit operation of 0 or 1. The total number of edit operations of the alignment path is 4.

Alignment as path in the edit graph Every path in the edit graph corresponds to an alignment:

Equivalently good solutions may present Old Alignment 012234677 v= AT_GTTAT_ w= ATCGT_A_C 01234667 New Alignment 012234677 v= AT_GTTAT_ w= ATCG_TA_C 012344667

More details into edit distance solution Dynamic programming s i,j = s i-1, j-1 +1 if v i = w j max s i-1, j s i, j-1

Initializing the DP table 0 1 2 3 4 6 7 1 2 3 4 6 7 Initialize 1 st row and 1 st column to be all corresponding edit costs.

Filling the table 0 1 2 3 4 6 7 1 2 3 4 6 7 S i,j = S i-1, j-1 max S i-1, j S i, j-1 value from NW +1, if v i = w j value from North (top) value from West (left)

Filling the table cont. 0 1 2 3 4 6 7 1 2 3 4 6 7 0 1 2 3 4 6 1 0 1 2 3 4 2 1 2 1 2 3 4 3 2 2 2 1 2 3 4 3 3 3 2 2 3 4 4 4 3 2 3 6 4 3 3

Trace back: find the optimal edit graph and generate the alignment 1. PrintLCS(b,v,i,j) 2. if i = 0 or j = 0 3. return 4. if b i,j =. PrintLCS(b,v,i-1,j-1) 6. print v i 7. else 8. if b i,j = 9. PrintLCS(b,v,i-1,j) 10. else 11. PrintLCS(b,v,i,j-1)

Traceback 0 1 2 3 4 6 7 1 2 3 4 6 7 0 1 2 3 4 6 1 0 1 2 3 4 2 1 2 1 2 3 4 3 2 2 2 1 2 3 4 3 3 3 2 2 3 4 4 4 3 2 3 6 4 3 3 ATCG-TAC * AT-GTTAT

Running time It takes O(nm) time to fill in the n * m dynamic programming matrix. Why O(nm)? The pseudocode consists of a nested for loop inside of another for loop to set up a n * m matrix.