Algorithm Design and Analysis LECTURE 18 Dynamic Programming (Segmented LS recap) Longest Common Subsequence Adam Smith
Segmented Least Squares Least squares. Foundational problem in statistic and numerical analysis. Given n points in the plane: (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ). Find a line y = ax + b that minimizes the sum of the squared error: n SSE = ( y i ax i b) 2 i=1 y x Solution. Calculus min error is achieved when a = n x i i y i ( i x i ) ( i y i ), b = n x 2 i ( x i ) 2 i i i y i a i x i n O(n) time
Segmented Least Squares Segmented least squares. Points lie roughly on a sequence of several line segments. Given n points in the plane (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ) with x 1 < x 2 <... < x n, find a sequence of lines that minimizes: E = sum of the sums of the squared errors in all segments L = number of lines Tradeoff via objective function: E + c L, for some constant c > 0. Note: discontinuous functions permitted! y x
Dynamic Programming: Multiway Choice Notation. OPT(j) = minimum cost for points p 1, p i+1,..., p j. e(i, j) = minimum sum of squares for points p i, p i+1,..., p j. To compute OPT(j): Last segment uses points p i, p i+1,..., p j for some i. Cost = e(i, j) + c + OPT(i-1). 0 if j = 0 OPT( j) = min 1 i j { e(i, j) + c + OPT(i 1) } otherwise
Segmented Least Squares: Algorithm INPUT: n, p 1,,p N, c Segmented-Least-Squares() { M[0] = 0 for j = 1 to n for i = 1 to j compute the least square error e ij for the segment p i,, p j for j = 1 to n M[j] = min 1 i j (e ij + c + M[i-1]) } return M[n] Running time. O(n 3 ). can be improved to O(n 2 ) by pre-computing various statistics Bottleneck = computing e(i, j) for O(n 2 ) pairs, O(n) per pair using previous formula.
Least Common Subsequence A.k.a. sequence alignment edit distance
Longest Common Subsequence (LCS) Given two sequences x[1.. m] and y[1.. n], find a longest subsequence common to them both. a not the x: A B C B D A B y: B D C A B A BCBA = LCS(x, y)
Longest Common Subsequence (LCS) Given two sequences x[1.. m] and y[1.. n], find a longest subsequence common to them both. a not the x: A B C B D A B y: B D C A B A BCAB BCBA = LCS(x, y)
Motivation Approximate string matching [Levenshtein, 1966] Search for occurance, get results for occurrence Computational biology [Needleman-Wunsch, 1970 s] Simple measure of genome similarity cgtacgtacgtacgtacgtacgtatcgtacgt acgtacgtacgtacgtacgtacgtacgtacgt
Motivation Approximate string matching [Levenshtein, 1966] Search for occurance, get results for occurrence Computational biology [Needleman-Wunsch, 1970 s] Simple measure of genome similarity acgtacgtacgtacgtcgtacgtatcgtacgt aacgtacgtacgtacgtcgtacgta cgtacgt n length(lcs(x,y)) is called the edit distance
Brute-force LCS algorithm Check every subsequence of x[1.. m] to see if it is also a subsequence of y[1.. n]. Analysis Checking = O(n) time per subsequence. 2 m subsequences of x (each bit-vector of length m determines a distinct subsequence of x). Worst-case running time = O(n2 m ) = exponential time.
Dynamic programming algorithm Simplification: 1. Look at the length of a longest-common subsequence. 2. Extend the algorithm to find the LCS itself. Notation: Denote the length of a sequence s by s. Strategy: Consider prefixes of x and y. Define c[i, j] = LCS(x[1.. i], y[1.. j]). Then, c[m, n] = LCS(x, y).
Recursive formulation c[i 1, j 1] + 1 if x[i] = y[j], c[i, j] = max{c[i 1, j], c[i, j 1]} otherwise. Base case: c[i,j]=0 if i=0 or j=0. Case x[i] = y[ j]: 1 2 i m x: 1 2 = j n y: The second case is similar.
Dynamic-programming hallmark #1 Optimal substructure An optimal solution to a problem (instance) contains optimal solutions to subproblems. If z = LCS(x, y), then any prefix of z is an LCS of a prefix of x and a prefix of y.
Recursive algorithm for LCS LCS(x, y, i, j) // ignoring base cases if x[i] = y[ j] then c[i, j] LCS(x, y, i 1, j 1) + 1 else c[i, j] max{lcs(x, y, i 1, j), LCS(x, y, i, j 1)} return c[i, j] Worse case: x[i] y[ j], in which case the algorithm evaluates two subproblems, each with only one parameter decremented.
Recursion tree m = 7, n = 6: 7,6 6,6 7,5 5,6 6,5 6,5 7,4 m+n 4,6 5,5 5,5 6,4 5,5 6,4 6,4 7,3 Height = m + n work potentially exponential.
Recursion tree m = 7, n = 6: 7,6 5,6 same subproblem 6,6 7,5 6,5 6,5 7,4 m+n 4,6 5,5 5,5 6,4 5,5 6,4 6,4 7,3 Height = m + n work potentially exponential., but we re solving subproblems already solved!
Dynamic-programming hallmark #2 Overlapping subproblems A recursive solution contains a small number of distinct subproblems repeated many times. The number of distinct LCS subproblems for two strings of lengths m and n is only m n.
Memoization algorithm Memoization: After computing a solution to a subproblem, store it in a table. Subsequent calls check the table to avoid redoing work. LCS(x, y, i, j) if c[i, j] = NIL then if x[i] = y[j] then c[i, j] LCS(x, y, i 1, j 1) + 1 else c[i, j] max{ LCS(x, y, i 1, j), LCS(x, y, i, j 1)} Time = (m n) = constant work per table entry. Space = (m n). same as before
Dynamic-programming algorithm IDEA: Compute the table bottom-up. B A B C B D A B 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 D 0 0 1 1 1 2 2 2 C 0 0 1 2 2 2 2 2 A 0 1 1 2 2 2 3 3 B 0 1 2 2 3 3 3 4 A 0 1 2 2 3 3 4 4
Dynamic-programming algorithm IDEA: Compute the table bottom-up. Time = (m n). B A B C B D A B 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 D 0 0 1 1 1 2 2 2 C 0 0 1 2 2 2 2 2 A 0 1 1 2 2 2 3 3 B 0 1 2 2 3 3 3 4 A 0 1 2 2 3 3 4 4
Dynamic-programming algorithm IDEA: Compute the table bottom-up. Time = (m n). Reconstruct LCS by tracing backwards. A B C B D A B 0 0 0 0 0 0 0 0 B 0 0 1 1 1 1 1 1 D 0 0 1 1 1 2 2 2 C 0 0 1 2 2 2 2 2 A 0 1 1 2 2 2 3 3 B 0 1 2 2 3 3 3 4 A 0 1 2 2 3 3 4 4
Dynamic-programming algorithm IDEA: Compute the table bottom-up. Time = (m n). Reconstruct LCS by tracing backwards. Multiple solutions are possible. B A B C B D A B 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 D 0 0 1 1 1 2 2 2 C 0 0 1 2 2 2 2 2 A 0 1 1 2 2 2 3 3 B 0 1 2 2 3 3 3 4 A 0 1 2 2 3 3 4 4
Dynamic-programming algorithm IDEA: Compute the table bottom-up. Time = (m n). Reconstruct LCS by tracing backwards. Space = (m n). KT book: O(min{m, n}) A B C B D A B 0 0 0 0 0 0 0 0 B 0 0 1 1 1 1 1 1 D 0 0 1 1 1 2 2 2 C 0 0 1 2 2 2 2 2 A 0 1 1 2 2 2 3 3 B 0 1 2 2 3 3 3 4 A 0 1 2 2 3 3 4 4