Dynamic Programming: Edit Distance

Similar documents
EECS730: Introduction to Bioinformatics

Introduction to Bioinformatics

STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler

Bio nformatics. Lecture 3. Saad Mneimneh

Sequence Alignment. Johannes Starlinger

Outline. Approximation: Theory and Algorithms. Motivation. Outline. The String Edit Distance. Nikolaus Augsten. Unit 2 March 6, 2009

Dynamic Programming. Prof. S.J. Soni

SA-REPC - Sequence Alignment with a Regular Expression Path Constraint

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Analysis and Design of Algorithms Dynamic Programming

Approximation: Theory and Algorithms

Outline. Similarity Search. Outline. Motivation. The String Edit Distance

Hidden Markov Models

Introduction to spectral alignment

Similarity Search. The String Edit Distance. Nikolaus Augsten. Free University of Bozen-Bolzano Faculty of Computer Science DIS. Unit 2 March 8, 2012

Implementing Approximate Regularities

Algorithms in Bioinformatics

Lecture 2: Pairwise Alignment. CG Ron Shamir

Similarity Search. The String Edit Distance. Nikolaus Augsten.

MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance

Multiple Alignment. Slides revised and adapted to Bioinformática IST Ana Teresa Freitas

Computational Biology

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

LIFE SCIENCES: PAPER I ANSWER BOOKLET

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Multiple Sequence Alignment: HMMs and Other Approaches

Sequence analysis and Genomics

Dynamic Programming. Shuang Zhao. Microsoft Research Asia September 5, Dynamic Programming. Shuang Zhao. Outline. Introduction.

Introduction to Bioinformatics Algorithms Homework 3 Solution

On the Monotonicity of the String Correction Factor for Words with Mismatches

Sequence Comparison. mouse human

Tandem Mass Spectrometry: Generating function, alignment and assembly

Lecture 4: September 19

CMPSCI 311: Introduction to Algorithms Second Midterm Exam

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

Sequence Alignment (chapter 6)

Generating Function Notes , Fall 2005, Prof. Peter Shor

CSE 591 Foundations of Algorithms Homework 4 Sample Solution Outlines. Problem 1

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Algebraic Dynamic Programming. Dynamic Programming, Old Country Style

Algorithms in Bioinformatics: A Practical Introduction. Sequence Similarity

Motivating the need for optimal sequence alignments...

More Dynamic Programming

More Dynamic Programming

What can you prove by induction?

Lecture 5,6 Local sequence alignment

CSE 431/531: Analysis of Algorithms. Dynamic Programming. Lecturer: Shi Li. Department of Computer Science and Engineering University at Buffalo

11.3 Decoding Algorithm

Chapter 4. Greedy Algorithms. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

MITOCW MIT8_01F16_L12v01_360p

Searching Sear ( Sub- (Sub )Strings Ulf Leser

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

MITOCW ocw f99-lec23_300k

Real Analysis Prof. S.H. Kulkarni Department of Mathematics Indian Institute of Technology, Madras. Lecture - 13 Conditional Convergence

Exam EDAF May 2011, , Vic1. Thore Husfeldt

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Samson Zhou. Pattern Matching over Noisy Data Streams

INF4130: Dynamic Programming September 2, 2014 DRAFT version

Data Structures in Java

Dynamic Programming. Weighted Interval Scheduling. Algorithmic Paradigms. Dynamic Programming

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

6. DYNAMIC PROGRAMMING II

MITOCW MITRES18_006F10_26_0501_300k-mp4

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007.

Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line

Pattern Matching (Exact Matching) Overview

CS1800: Strong Induction. Professor Kevin Gold

Pairwise sequence alignment

Efficient High-Similarity String Comparison: The Waterfall Algorithm

Gibbs Sampling Methods for Multiple Sequence Alignment

Chapter 5 Simplifying Formulas and Solving Equations

1 Alphabets and Languages

CS 580: Algorithm Design and Analysis

Sequence Analysis '17 -- lecture 7

Exam EDAF August Thore Husfeldt

String Matching Problem

Dynamic programming. Curs 2017

Exhaustive search. CS 466 Saurabh Sinha

Algorithms and Theory of Computation. Lecture 9: Dynamic Programming

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Modern Algebra Prof. Manindra Agrawal Department of Computer Science and Engineering Indian Institute of Technology, Kanpur

Seq2Seq Losses (CTC)

CHAPTER 1. REVIEW: NUMBERS

Exam EDAF May Thore Husfeldt

Compute the Fourier transform on the first register to get x {0,1} n x 0.

Semiconductor Optoelectronics Prof. M. R. Shenoy Department of physics Indian Institute of Technology, Delhi

The Simplex Method: An Example

Proving languages to be nonregular

Finite Automata Part Two

1 Seidel s LP algorithm

Background: Comment [1]: Comment [2]: Comment [3]: Comment [4]: mass spectrometry

CSE 202 Dynamic Programming II

Aside: Golden Ratio. Golden Ratio: A universal law. Golden ratio φ = lim n = 1+ b n = a n 1. a n+1 = a n + b n, a n+b n a n

(x 1 +x 2 )(x 1 x 2 )+(x 2 +x 3 )(x 2 x 3 )+(x 3 +x 1 )(x 3 x 1 ).

Solutions to November 2006 Problems

MITOCW MITRES_18-007_Part1_lec3_300k.mp4

Transcription:

Dynamic Programming: Edit Distance Bioinformatics: Issues and Algorithms SE 308-408 Fall 2007 Lecture 10 Lopresti Fall 2007 Lecture 10-1 -

Outline Setting the Stage DNA Sequence omparison: First Successes he hange Problem he Manhattan ourist Problem he Longest ommon Subsequence Problem Sequence Alignment he Edit Distance Problem Lopresti Fall 2007 Lecture 10-2 -

Sequence comparison and alignment onsider: AAA AAA versus AA AAA hat the two DNA fragments on the left somehow seem more similar than the two on the right could be significant. Question: how can we measure sequence similarity? Lopresti Fall 2007 Lecture 10-3 -

Motivation Why is this important? iven a new DNA sequence, one of the first things a biologist will want to do is search databases of known sequences to see if anyone has recorded something similar. (As we've seen, genetic sequences are long and the databases are enormous, so efficiency will be an issue.) Sequence similarity can provide clues about function. Similarity can provide clues about evolutionary relationships. Many other problems from computational biology incorporate some notion of sequence similarity as a basic premise. Lopresti Fall 2007 Lecture 10-4 -

Sequence concepts A sequence is a linear string of symbols over a finite alphabet. (Note: string and sequence are often used synonymously.) Some basic sequence concepts: length empty string subsequence supersequence number of symbols in s (written s ). sequence of length 0 (written ε). sequence that can be obtained from s by removing some symbols (A is a subsequence of A). if t is a subsequence of s, then s is a supersequence of t. Lopresti Fall 2007 Lecture 10-5 -

Sequence concepts More basic sequence concepts: substring sequence of consecutive symbols appearing in s (A is not a substring of A, but is). (Observation: every substring is a subsequence of the string in question, but not vice versa.) superstring interval if t is a substring of s, then s is a superstring of t. set of consecutive indices of a string, e.g., [2..4] refers to 2nd through 4th symbols of string s, or s[2]s[3]s[4]. Lopresti Fall 2007 Lecture 10-6 -

Sequence concepts And lastly: prefix substring of s of the form s[1..j] where 0 j s (when j = 0, prefix is the empty string). suffix substring of s of the form s[i.. s ] where 1 i s + 1 (when i = s + 1, suffix is the empty string). A is a prefix (and substring and subsequence) of AA. A is a suffix (and substring and subsequence) of AA. Lopresti Fall 2007 Lecture 10-7 -

Sequences he concept of a sequence is extremely broad. In this course, we are concerned with genetic sequences. However, there are other important kinds of sequence data: ASII text, speech, handwriting (pen-strokes). Likewise, the same algorithmic techniques turn up again and again, often under different names: approximate string matching, edit (or evolutionary) distance, dynamic time warping. Lopresti Fall 2007 Lecture 10-8 -

Sequence comparison: a false start An obvious idea that comes to mind is to line up each symbol and count the number that don't match: A A A A A = 2 As we know, this is Hamming distance and forms the basis for most error correcting codes, as well as the motif-finding problem we saw earlier. But it doesn't work for the kinds of sequences we care about:? = 6 Just one missing symbol at the start of the second sequence leads to a large distance. Lopresti Fall 2007 Lecture 10-9 -

Sequence manipulation at the genetic level enomes aren't static...... sequence comparison must account for this. http://www.accessexcellence.org/ab//nhgri_pdfs/deletion.pdf http://www.accessexcellence.org/ab//nhgri_pdfs/insertion.pdf Lopresti Fall 2007 Lecture 10-10 -

he human factor In addition, errors can arise during the sequencing process: "...the error rate is generally less than 1% over the first 650 bases and then rises significantly over the remaining sequence. http://genome.med.harvard.edu/dnaseq.html A hard-to-read gel (arrow marks location where bands of similar intensity appear in two different lanes): http://hshgp.genome.washington.edu/teacher_resources/99-studentdnasequencingmodule.pdf A A? A A? A A? Lopresti Fall 2007 Lecture 10-11 -

Sequence alignment As we have seen, the two sequences we wish to compare may have different lengths. As a result, we need to allow for deletions and insertions. he notion of an alignment helps us visualize this: A A? - Alignment permits us to incorporate spaces (represented by a dash) in one or both sequences to make them the same length. Lopresti Fall 2007 Lecture 10-12 -

Lopresti Fall 2007 Lecture 10-13 - Sequence alignment No this one has 5! How can we find the best alignment? A - - - - - his alignment has 6 mismatches. Is it the best possible? A - - - - -

DNA sequence comparison First successes: Finding sequence similarities with genes of known function is a common approach to infer function of a newlysequenced gene. In 1984, Russell Doolittle and colleagues found similarities between cancer-causing gene and normal growth factor (PDF) gene. Lopresti Fall 2007 Lecture 10-14 -

ystic Fibrosis ystic fibrosis (F) is chronic and often fatal genetic disease of body's mucus glands (abnormally high level of mucus). Primarily affects respiratory systems in children. Mucus is slimy material that coats many epithelial surfaces and is secreted into fluids such as saliva. Inheritance of ystic Fibrosis: In early 1980's, biologists hypothesized that F is an autosomal recessive disorder caused by mutations in a gene that remained unknown until 1989. Heterozygous carriers are asymptomatic. Must be homozygously recessive in this gene in order to be diagnosed with F. Lopresti Fall 2007 Lecture 10-15 -

ystic Fibrosis: finding the gene Lopresti Fall 2007 Lecture 10-16 -

ystic Fibrosis Finding Similarities between ystic Fibrosis ene and AP binding proteins: AP binding proteins are present on cell membrane and act as transport channel. In 1989, biologists found similarity between cystic fibrosis gene and AP binding proteins. A plausible function for cystic fibrosis gene, given that F involves sweet secretion with abnormally high sodium level. Lopresti Fall 2007 Lecture 10-17 -

ystic Fibrosis: mutation analysis If a high percentage of cystic fibrosis (F) patients have a certain mutation in the gene, while unaffected patients don t, that could be an indicator of a mutation that is related to F. Indeed, a certain mutation was found in 70% of F patients; this is convincing evidence of a predominant genetic diagnostics marker for F. Lopresti Fall 2007 Lecture 10-18 -

ystic Fibrosis and the FR protein FR (ystic Fibrosis ransmembrane conductance Regulator) protein acts in cell membrane of epithelial cells that secrete mucus. hese cells line airways of nose, lungs, stomach wall, etc. FR protein (1480 amino acids) regulates a chloride ion channel: adjusts wateriness of fluids secreted by cell. F sufferers are missing a single amino acid in their FR. Mucus ends up being too thick, affecting many organs. Lopresti Fall 2007 Lecture 10-19 -

Bring in the Bioinformaticians ene similarities between two genes with known and unknown function alert biologists to some possibilities. omputing a similarity score between two genes tells how likely it is that they have similar functions. Dynamic programming is a technique for revealing similarities between genes. he hange Problem is a good problem to introduce idea of dynamic programming. Lopresti Fall 2007 Lecture 10-20 -

he hange Problem You hand cashier $20 for a $19.23 bill. Your change could be: 3 quarters + 2 pennies 77 pennies etc. he hange Problem. onvert some amount of money M into given denominations, using fewest possible number of coins. Input: An amount of money M, and an array of d denominations, c = (c 1, c 2,, c d ), with (c 1 > c 2 > > c d ). Output: A list of d integers, i 1, i 2,, i d, such that c 1 i 1 + c 2 i 2 + + c d i d = M and i 1 + i 2 + + i d is minimal. Lopresti Fall 2007 Lecture 10-21 -

he hange Problem: an example iven denominations 1, 3, and 5, what is minimum number of coins needed to make change for a given value? Value 1 2 3 4 5 6 7 8 9 10 Min # of coins 1 1 1 Only one coin needed to make change for values 1, 3, and 5. Lopresti Fall 2007 Lecture 10-22 -

he hange Problem: an example iven denominations 1, 3, and 5, what is minimum number of coins needed to make change for a given value? Value 1 2 3 4 5 6 7 8 9 10 Min # of coins 1 2 1 2 1 2 2 2 However, two coins needed to make change for values 2, 4, 6, 8, and 10. Lopresti Fall 2007 Lecture 10-23 -

he hange Problem: an example iven denominations 1, 3, and 5, what is minimum number of coins needed to make change for a given value? Value 1 2 3 4 5 6 7 8 9 10 Min # of coins 1 2 1 2 1 2 3 2 3 2 Lastly, three coins needed to make change for values 7 and 9. Lopresti Fall 2007 Lecture 10-24 -

he hange Problem: recurrence his example is expressed by the following recurrence relation: minnumoins(m) = min of minnumoins(m-1) + 1 minnumoins(m-3) + 1 minnumoins(m-5) + 1 iven denominations c 1, c 2,, c d, recurrence relation is: minnumoins(m) = min of minnumoins(m-c 1 ) + 1 minnumoins(m-c 2 ) + 1 minnumoins(m-c d ) + 1 Lopresti Fall 2007 Lecture 10-25 -

he hange Problem: a recursive algorithm Recursivehange(M, c, d) if M = 0 return 0 bestnumoins infinity for i 1 to d if M c i numoins Recursivehange(M c i, c, d) if numoins + 1 < bestnumoins bestnumoins numoins + 1 return bestnumoins Lopresti Fall 2007 Lecture 10-26 -

he hange Problem: a recursive algorithm he Recursivehange algorithm works, but it is not efficient: It recalculates optimal coin combination for a given amount of money repeatedly. E.g., if M = 77 and c = (1,3,7), then optimal combination of coins for 70 cents is computed on 9 different occasions! 77 76 74 70 75 73 69 73 71 67 69 67 63 74 72 68 68 66 62 70 68 64 68 66 62 62 60 56 72 70 66 72 70 66 66 64 60 66 64 60...... 70 70 70 70 70 Lopresti Fall 2007 Lecture 10-27 -

he hange Problem: we can do better Rather than re-compute a given value more than once: Save results of each computation for 0 to M. his way, we can do a reference call to find an already computed value, instead of re-computing each time. Running time is M x d, where M is the value of money and d is the number of denominations. Lopresti Fall 2007 Lecture 10-28 -

he hange Problem: Dynamic Programming DPhange(M, c, d) bestnumoins 0 0 for m 1 to M bestnumoins m infinity for i 1 to d if m c i if bestnumoins m ci + 1 < bestnumoins m return bestnumoins M bestnumoins m bestnumoins m ci + 1 Lopresti Fall 2007 Lecture 10-29 -

DPhange example 0 0 0 1 0 1 0 1 2 0 1 2 0 1 2 3 0 1 2 1 0 1 2 3 4 0 1 2 1 2 0 1 2 3 4 5 0 1 2 1 2 3 0 1 2 3 4 5 6 0 1 2 1 2 3 2 0 1 2 3 4 5 6 7 0 1 2 1 2 3 2 1 0 1 2 3 4 5 6 7 8 0 1 2 1 2 3 2 1 2 0 1 2 3 4 5 6 7 8 9 0 1 2 1 2 3 2 1 2 3 c = (1,3,7) M = 9 Lopresti Fall 2007 Lecture 10-30 -

he Manhattan ourist Problem (MP) Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid. Source * * * * * * * * * * * * Sink Lopresti Fall 2007 Lecture 10-31 -

he Manhattan ourist Problem (MP) Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid. Source * * * * * * * * * * * * Sink Lopresti Fall 2007 Lecture 10-32 -

he Manhattan ourist Problem (MP) he Manhattan ourist Problem. Find the longest path in a weighted grid. Input: A weighted grid with two distinct vertices, one labeled source and the other labeled sink. Output: A longest path in from source to sink. Lopresti Fall 2007 Lecture 10-33 -

MP: an example source 0 0 1 2 3 4 3 2 4 0 0 3 5 9 1 0 2 4 3 j coordinate 1 3 2 4 13 2 4 6 5 2 1 i coordinate 2 3 4 4 5 0 7 3 4 5 3 3 0 6 8 1 3 2 15 2 5 4 2 2 19 1 20 3 23 sink Lopresti Fall 2007 Lecture 10-34 -

MP: greedy is not optimal source 1 2 5 5 3 10 5 2 1 5 3 5 3 1 2 3 4 promising start, but leads to bad choices! 0 0 0 0 0 5 2 18 22 sink Lopresti Fall 2007 Lecture 10-35 -

MP: a simple recursive program M(n, m) if n = 0 and m = 0 return 0 if n > 0 x M(n-1, m) + length of edge from (n-1, m) to (n, m) else x 0 if m > 0 y M(n, m-1) + length of edge from (n, m-1) to (n, m) else y 0 return max(x, y) What's wrong with this approach? Lopresti Fall 2007 Lecture 10-36 -

MP: Dynamic Programming source j 0 1 i 0 5 1 1 S 0,1 = 1 1 5 S 1,0 = 5 alculate optimal path score for each vertex in graph. Each vertex s score is maximum of prior vertices' scores plus weight of respective edges in between. Lopresti Fall 2007 Lecture 10-37 -

MP: Dynamic Programming (continued) source j 0 1 2 i 0 5 1 2 3 1 3 S 0,2 = 3 1 3 5-5 4 S 1,1 = 4 2 8 S 2,0 = 8 Lopresti Fall 2007 Lecture 10-38 -

MP: Dynamic Programming (continued) source j 0 1 2 3 i 0 5 1 2 5 1 3 3 10 8 S 3,0 = 8 1 3 5-5 5 4 1 13 S 1,2 = 13 2 3 0 8 8-5 S 3,0 = 8 9 S 2,1 = 9 Lopresti Fall 2007 Lecture 10-39 -

MP: Dynamic Programming (continued) source j 0 1 2 3 i 0 1 2 5 1 3 8 5 3 10-5 reedy algorithm fails! 1 2 3 0 5 8-5 1-5 5-3 -5 3 0 4 9 13 8 S 1,3 = 8 12 S 2,2 = 12 3 8 0 9 S 3,1 = 9 Lopresti Fall 2007 Lecture 10-40 -

MP: Dynamic Programming (continued) source j 0 1 2 3 0 1 2 5 1 3 8 i 5 3 10-5 Done! 1 5-5 1-5 4 13 8 3 5-3 2 2 8-5 3 3 9 12 15 0 0-5 1 3 8 0 0 9 9 0 16 S3,3 = 16 Lopresti Fall 2007 Lecture 10-41 -

MP: recurrence omputing score for a point (i, j) by recurrence relation: s i, j = max s i-1, j + weight of the edge between (i-1, j) and (i, j) s i, j-1 + weight of the edge between (i, j-1) and (i, j) he running time is n x m for a n-by-m grid, where n is the number of rows and m is the number of columns. Lopresti Fall 2007 Lecture 10-42 -

But Manhattan is not a perfect grid... A 2 A 3 What about diagonals? A 1 B Score at point B is given by: s B = max of s A1 + weight of the edge (A 1, B) s A2 + weight of the edge (A 2, B) s A3 + weight of the edge (A 3, B) Lopresti Fall 2007 Lecture 10-43 -

But Manhattan is not a perfect grid... omputing score s x for point x is given by recurrence relation: s x = max s y + weight of edge (y, x) where of y Predecessors(x) Predecessors(x) = set of vertices that have edges leading to x Running time for a graph (V, E) is O(E) since each edge is evaluated once Note: V is set of all vertices and E is set of all edges. Lopresti Fall 2007 Lecture 10-44 -

Alignment: two-row representation iven 2 DNA sequences v and w: v : w : A A A A m = 7 n = 6 Alignment : 2 * k matrix ( k > m, n ) letters of v letters of w A - - - A A - A - 4 matches 2 insertions 3 deletions Lopresti Fall 2007 Lecture 10-45 -

Aligning DNA sequences v = AA w = AA n = 8 m = 7 4 1 3 2 matches mismatches deletions insertions match mismatch V A A W A A indels insertion deletion Lopresti Fall 2007 Lecture 10-46 -

he Longest ommon Subsequence Problem (LS) he Longest ommon Subsequence Problem. Find the longest common subsequence of two sequences. Input: wo sequences: v = v 1 v 2 v m and w = w 1 w 2 w n Output: A sequence of positions in v : 1 i 1 < i 2 < < i t m and a sequence of positions in w : 1 j 1 < j 2 < < j t n such that symbol i k of v matches j k of w and t is maximal. Lopresti Fall 2007 Lecture 10-47 -

LS example i coords: 0 1 2 2 3 3 4 5 6 7 8 elements of v elements of w j coords: 0 A - - A - A - A - 0 1 2 3 4 5 5 6 6 7 (0,0) (1,0) (2,1) (2,2) (3,3) (3,4) (4,5) (5,5) (6,6) (7,6) (8,7) Matches shown in red positions in v: positions in w: 2 < 3 < 4 < 6 < 8 1 < 3 < 5 < 6 < 7 Every common subsequence is a path in 2-D grid. Lopresti Fall 2007 Lecture 10-48 -

LS Dynamic Programming Find LS of two strings. Input: a weighted graph with two distinct vertices, one labeled source and one labeled sink. Output: a longest path in from source to sink. Solve using an MP graph with diagonals replaced by +1 edges. Lopresti Fall 2007 Lecture 10-49 -

LS Problem as Manhattan ourist Problem i 0 j A A 0 1 2 3 4 5 6 7 8 1 A A 2 3 4 5 6 7 Lopresti Fall 2007 Lecture 10-50 -

MP raph for LS Problem i 0 j A A 0 1 2 3 4 5 6 7 8 1 A A 2 3 4 5 6 7 Lopresti Fall 2007 Lecture 10-51 -

MP raph for LS Problem j A A i 0 1 2 0 1 2 3 4 5 6 7 8 Every path is common subsequence. A A 3 4 5 6 Diagonal edge adds extra element to subsequence. 7 LS Problem: find path with maximum diagonal edges. Lopresti Fall 2007 Lecture 10-52 -

omputing LS Let v i = prefix of v of length i : v 1 v i and w j = prefix of w of length j : w 1 w j he length of LS(v i, w j ) is computed by: s i-1, j s i, j = max s i, j-1 s i-1, j-1 + 1 if v i = w j i-1,j -1 i,j -1 i-1,j 1 0 0 i,j Lopresti Fall 2007 Lecture 10-53 -

LS Alignments Every path in grid corresponds to an alignment: W A V 0 1 2 3 4 A 0 1 2 3 4 0 1 2 2 3 4 V = A - W = A 0 1 2 3 4 4 Lopresti Fall 2007 Lecture 10-54 -

Edit distance In 1966, Levenshtein introduced edit distance between two strings as minimum number of elementary operations (insertions, deletions, and substitutions) needed to transform one string into the other. d(v, w) = MIN number of elementary operations needed to transform v w. Keep in mind that LS computation uses a MAX. Lopresti Fall 2007 Lecture 10-55 -

he Edit Distance Problem he Edit Distance Problem. iven two sequences, find the optimal series of deletions, insertions, and substitutions to transform one into the other. Input: wo sequences: v = v 1 v 2 v m and w = w 1 w 2 w n Output: An optimal series of basic editing operations: e 1, e 2,..., e t such then, when applied to one of the sequences, say v, it is transformed into the other sequence, w. Here optimal can mean any of a number of things, including fewest or lowest- / highest-cost. Lopresti Fall 2007 Lecture 10-56 -

Edit distance vs. Hamming distance Hamming distance always compares i -th letter of v with i -th letter of w v = AAAA w = AAAA Hamming distance: d(v, w) = 8 Just one shift Make it all line up Edit distance may compare i -th letter of v with j -th letter of w v = -AAAA w = AAAA Edit distance: d(v, w) = 2 omputing Hamming distance is trivial task. omputing edit distance is non-trivial task. Lopresti Fall 2007 Lecture 10-57 -

Edit distance vs. Hamming distance Hamming distance always compares i -th letter of v with i -th letter of w v = AAAA w = AAAA Hamming distance: d(v, w) = 8 Just one shift Make it all line up Edit distance may compare i -th letter of v with j -th letter of w v = -AAAA w = AAAA- Edit distance: d(v, w) = 2 How to find which j goes with which i??? One insertion and one deletion. Lopresti Fall 2007 Lecture 10-58 -

Edit distance example AA AA in 5 steps: AA (delete last ) AA (delete last A) A (insert A at front) AA (substitute for 3 rd ) AA (insert before last A) AA (Done) Lopresti Fall 2007 Lecture 10-59 -

Edit distance example AA AA in 5 steps: AA (delete last ) AA (delete last A) A (insert A at front) AA (substitute for 3 rd ) AA (insert before last A) AA (Done) What is the edit distance? 5? Lopresti Fall 2007 Lecture 10-60 -

Edit distance example AA AA in 4 steps: AA (insert A at front) AAA (delete 8 th ) AAA (substitute for 5 th A) AA (substitute for 3 rd ) AA (Done) Lopresti Fall 2007 Lecture 10-61 -

Edit distance example AA AA in 4 steps: AA (insert A at front) AAA (delete 8 th ) AAA (substitute for 5 th A) AA (substitute for 3 rd ) AA (Done) an it be done in 3 steps??? Lopresti Fall 2007 Lecture 10-62 -

Alignment grid Every alignment path is from source to sink. Lopresti Fall 2007 Lecture 10-63 -

Alignment as a path in edit graph 0 1 2 2 3 4 5 6 7 7 A _ A _ A _ A _ 0 1 2 3 4 5 5 6 6 7 orresponding path: (0,0), (1,1), (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7) Lopresti Fall 2007 Lecture 10-64 -

Alignment as a path in edit graph and represent indels in v and w with score 0. represent matches with score 1. he score of the alignment path is 5. Lopresti Fall 2007 Lecture 10-65 -

Alignment as a path in edit graph Every path in edit graph orresponds to an alignment: Lopresti Fall 2007 Lecture 10-66 -

Alignment as a path in edit graph Old Alignment 0122345677 v = A_A_ w = A_A_ 0123455667 New Alignment New Alignment 0122345677 v = A_A_ w = A_A_ 0123445667 Lopresti Fall 2007 Lecture 10-67 -

Alignment as a path in edit graph 0122345677 v = A_A_ w = A_A_ 0123455667 (0,0), (1,1), (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7) Lopresti Fall 2007 Lecture 10-68 -

Sequence comparison: the basic algorithm We can't afford to enumerate all possible alignments looking for the best one that would be an exponential search. Fortunately we don't have to. he optimal alignment can be found using dynamic programming, which we've just seen. Dynamic programming is based on premise of computing solutions to smaller subproblems first and then using these to solve successively larger problems until we have our answer. (Dynamic programming was invented by Richard Bellman in the 1950's. Its application to sequence comparison came later, in the 1970's.) http://fens.sabanciuniv.edu/msie/operations_research_50_years/anniversary/or50/1526-5463-2002-50-01-0048.pdf Lopresti Fall 2007 Lecture 10-69 -

Sequence alignment... First some ground-rules. Legal: A Legal: - deletion - insertion Not legal: - Legal: Legal: - match mismatch Lopresti Fall 2007 Lecture 10-70 -

Sequence comparison: the basic algorithm iven two sequences v and w, consider what's required to compute optimal alignment for prefixes v[1..i] and w[1..j]. Based on our rules for alignments, there are three possible cases: deletion insertion substitution or match optimal alignment for v[1..i-1] and w[1..j] v[i] - optimal alignment for v[1..i] and w[1..j-1] - w[j] optimal alignment for v[1..i-1] and w[1..j-1] v[i] or w[j] I II III So, assuming we've already computed solutions for all shorter prefixes, we can compute the alignment for v[1..i] and w[1..j]. Lopresti Fall 2007 Lecture 10-71 -

Sequence comparison: the basic algorithm onceptually, this might look something like this: optimal alignment at v[1..i-1] and w[1..j] + cost of deleting v[i] optimal alignment at v[1..i] and w[1..j] = max optimal alignment at v[1..i] and w[1..j-1] + cost of inserting w[j] Here we assume that deletions, insertions, and mismatches have negative costs, while matches have positive costs. optimal alignment at v[1..i-1] and w[1..j-1] + cost of matching v[i] and w[j] Lopresti Fall 2007 Lecture 10-72 -

Sequence comparison: the basic algorithm his computation can be viewed as building a 2-D matrix: s t r i n g v ε ε 0 cost of deleting v d[i,j] = max s t r i n g where indel is cost of a deletion or insertion, and sub is cost of a match/mismatch involving symbols v[i] and w[j]. w cost of inserting w d[i-1,j] + indel d[i,j-1] + indel d[i-1,j-1] + sub(v[i],w [j]) Lopresti Fall 2007 Lecture 10-73 -

Sequence comparison: the basic algorithm Stated more generally, say that our two sequences are: v[1]v[2]v[3]...v[m] w[1]w[2]w[3]...w[n] hen: d[0,0] = 0 d[i,0] = d[i-1,0] + c del (v[i]) d[0,j] = d[0,j-1] + c ins (w[j]) 1 i m 1 j n initial conditions And: d[i,j] = max d[i-1,j] + c del (v[i]) d[i,j-1] + c ins (w[j]) 1 i m, 1 j n d[i-1,j-1] + c sub (v[i], w[j]) recurrence Where c del, c ins, and c sub are the costs of a deletion, an insertion, and a substitution, respectively. Lopresti Fall 2007 Lecture 10-74 -

Sequence comparison: the basic algorithm Example: say that c del = -1, c ins = -1, c sub = -1 if mismatch and +1 if match A 0-1 -2-3 A -1-1 0-1 -2 0-1 -1-3 -1-1 -2-4 -2-2 0 Lopresti Fall 2007 Lecture 10-75 -

Algorithm for computing edit distance EditDistance1(v, w) for i 1 to n d i,0 d i-1,0 + c del (v i ) for j 1 to m d 0,j d 0,j-1 + c ins (w j ) for i 1 to n for j 1 to m d i-1,j + c del (v i ) d i,j max d i,j-1 + c ins (w j ) d i-1,j-1 + c sub (v i, w j ) return (d n,m ) Lopresti Fall 2007 Lecture 10-76 -

Sequence comparison: some observations omputation can progress in a number of ways: or or For sequences of length m and n, v[1]v[2]v[3]...v[m] w[1]w[2]w[3]...w[n] What is the computation time? O(mn) How much memory (storage) is required? O(mn) Lopresti Fall 2007 Lecture 10-77 -

Sequence comparison: getting an alignment We started with the notion of alignment. How do we get this? By keeping track of optimal decisions made during algorithm, A 0-1 -2 A -3-1 -1-2 -4-1 0-2 -2-1 0-1 -2-3 -1-1 0... and then tracing back optimal path. A - A (May not be unique). Lopresti Fall 2007 Lecture 10-78 -

Maintaining the traceback EditDistance1(v, w)... for i 1 to n for j 1 to m d i-1,j + c del (v i ) d i,j max d i,j-1 + c ins (w j ) d i-1,j-1 + c sub (v i, w j ) if d i,j = d i-1,j + c del (v i ) b i,j if d i,j = d i,j-1 + c ins (w j ) return (d n,m, b) if d i,j = d i-1,j-1 + c sub (v i, w j ) Lopresti Fall 2007 Lecture 10-79 -

More examples omparing A vs. A: Lopresti Fall 2007 Lecture 10-80 -

More examples omparing AA vs. A: Lopresti Fall 2007 Lecture 10-81 -

More examples omparing A vs. : Recall the trace we saw earlier: A - - - - - Lopresti Fall 2007 Lecture 10-82 -

Optimality of the algorithm How do we know this agorithm finds an optimal alignment? We won't dig into finer details now, but basically it hinges on: Bellman's principle of optimality for dynamic programming. he cost functions behaving properly (i.e., they must satisfy the triangle inequality). his could be an interesting topic for a final paper... Lopresti Fall 2007 Lecture 10-83 -

Wrap-up Readings for next time: ontinue reading IBA hapter 6 (sequence comparison). Remember: ome to class having done the readings. heck Blackboard regularly for updates. Lopresti Fall 2007 Lecture 10-84 -