Dynamic programming. Curs 2017

Similar documents
Dynamic programming. Curs 2015

CSE 202 Dynamic Programming II

Dynamic Programming. Cormen et. al. IV 15

Aside: Golden Ratio. Golden Ratio: A universal law. Golden ratio φ = lim n = 1+ b n = a n 1. a n+1 = a n + b n, a n+b n a n

CSE 421 Weighted Interval Scheduling, Knapsack, RNA Secondary Structure

CSE 431/531: Analysis of Algorithms. Dynamic Programming. Lecturer: Shi Li. Department of Computer Science and Engineering University at Buffalo

6. DYNAMIC PROGRAMMING I

Dynamic Programming 1

Dynamic Programming( Weighted Interval Scheduling)

Chapter 6. Dynamic Programming. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Dynamic Programming: Interval Scheduling and Knapsack

CSE 421 Dynamic Programming

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programming II Date: 10/12/17

More Dynamic Programming

Dynamic Programming. Weighted Interval Scheduling. Algorithmic Paradigms. Dynamic Programming

Copyright 2000, Kevin Wayne 1

More Dynamic Programming

CS60007 Algorithm Design and Analysis 2018 Assignment 1

Lecture 2: Divide and conquer and Dynamic programming

6. DYNAMIC PROGRAMMING I

Dynamic Programming. Data Structures and Algorithms Andrei Bulatov

Algorithm Design and Analysis

Chapter 6. Dynamic Programming. CS 350: Winter 2018

Copyright 2000, Kevin Wayne 1

CS 580: Algorithm Design and Analysis

Lecture 13. More dynamic programming! Longest Common Subsequences, Knapsack, and (if time) independent sets in trees.

Chapter 6. Weighted Interval Scheduling. Dynamic Programming. Algorithmic Paradigms. Dynamic Programming Applications

Computer Science & Engineering 423/823 Design and Analysis of Algorithms

Lecture 13. More dynamic programming! Longest Common Subsequences, Knapsack, and (if time) independent sets in trees.

Greedy Algorithms My T. UF

CS Lunch. Dynamic Programming Recipe. 2 Midterm 2! Slides17 - Segmented Least Squares.key - November 16, 2016

CS473 - Algorithms I

Computer Science & Engineering 423/823 Design and Analysis of Algorithms

Areas. ! Bioinformatics. ! Control theory. ! Information theory. ! Operations research. ! Computer science: theory, graphics, AI, systems,.

General Methods for Algorithm Design

Even More on Dynamic Programming

DAA Unit- II Greedy and Dynamic Programming. By Mrs. B.A. Khivsara Asst. Professor Department of Computer Engineering SNJB s KBJ COE, Chandwad

Lecture 7: Dynamic Programming I: Optimal BSTs

6. DYNAMIC PROGRAMMING I

CS 580: Algorithm Design and Analysis

Algorithms Design & Analysis. Dynamic Programming

Outline DP paradigm Discrete optimisation Viterbi algorithm DP: 0 1 Knapsack. Dynamic Programming. Georgy Gimel farb

Introduction to Bioinformatics

Dynamic Programming: Shortest Paths and DFA to Reg Exps

CSE 202 Homework 4 Matthias Springer, A

Lecture 2: Pairwise Alignment. CG Ron Shamir

Graduate Algorithms CS F-09 Dynamic Programming

CMPSCI 311: Introduction to Algorithms Second Midterm Exam

Limitations of Algorithm Power

CS Analysis of Recursive Algorithms and Brute Force

CS483 Design and Analysis of Algorithms

CMPSCI 611 Advanced Algorithms Midterm Exam Fall 2015

Data Structures in Java

Lecture 9. Greedy Algorithm

Dynamic Programming (CLRS )

University of Toronto Department of Electrical and Computer Engineering. Final Examination. ECE 345 Algorithms and Data Structures Fall 2016

Algorithms and Theory of Computation. Lecture 9: Dynamic Programming

Introduction. I Dynamic programming is a technique for solving optimization problems. I Key element: Decompose a problem into subproblems, solve them

Dynamic Programming: Shortest Paths and DFA to Reg Exps

COL351: Analysis and Design of Algorithms (CSE, IITD, Semester-I ) Name: Entry number:

What is Dynamic Programming

CS Data Structures and Algorithm Analysis

Unit 1A: Computational Complexity

CS583 Lecture 11. Many slides here are based on D. Luebke slides. Review: Dynamic Programming

Advanced Analysis of Algorithms - Midterm (Solutions)

Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM

Quiz 1 Solutions. Problem 2. Asymptotics & Recurrences [20 points] (3 parts)

CSCE 750 Final Exam Answer Key Wednesday December 7, 2005

p 3 p 2 p 4 q 2 q 7 q 1 q 3 q 6 q 5

Algorithms. NP -Complete Problems. Dong Kyue Kim Hanyang University

Chapter 8 Dynamic Programming

CS 6901 (Applied Algorithms) Lecture 3

CS 350 Algorithms and Complexity

CS173 Running Time and Big-O. Tandy Warnow

Activity selection. Goal: Select the largest possible set of nonoverlapping (mutually compatible) activities.

Knapsack and Scheduling Problems. The Greedy Method

Algorithms Exam TIN093 /DIT602

Lecture 13: Dynamic Programming Part 2 10:00 AM, Feb 23, 2018

Combinatorial Optimization

EECS730: Introduction to Bioinformatics

University of New Mexico Department of Computer Science. Final Examination. CS 561 Data Structures and Algorithms Fall, 2013

1 Primals and Duals: Zero Sum Games

CS 350 Algorithms and Complexity

CPSC 320 (Intermediate Algorithm Design and Analysis). Summer Instructor: Dr. Lior Malka Final Examination, July 24th, 2009

Mat 3770 Bin Packing or

Bio nformatics. Lecture 3. Saad Mneimneh

Partha Sarathi Mandal

Module 1: Analyzing the Efficiency of Algorithms

Topics in Approximation Algorithms Solution for Homework 3

FINAL EXAM PRACTICE PROBLEMS CMSC 451 (Spring 2016)

Dynamic Programming. Reading: CLRS Chapter 15 & Section CSE 2331 Algorithms Steve Lai

1 Maintaining a Dictionary

Question Paper Code :

Dynamic Programming. p. 1/43

INF4130: Dynamic Programming September 2, 2014 DRAFT version

Chapter 8 Dynamic Programming

This means that we can assume each list ) is

Sequence Alignment (chapter 6)

Local Alignment: Smith-Waterman algorithm

6. DYNAMIC PROGRAMMING II

Transcription:

Dynamic programming. Curs 2017

Fibonacci Recurrence. n-th Fibonacci Term INPUT: n nat QUESTION: Compute F n = F n 1 + F n 2 Recursive Fibonacci (n) if n = 0 then return 0 else if n = 1 then return 1 else (Fibonacci(n 1)+ +Fibonacci(n 2)) end if

Computing F 7. As F n+1 /F n (1 + 5)/2 1.61803 then F n > 1.6 n, and to compute F n we need 1.6 n recursive calls. F(7) F(6) F(5) F(5) F(4) F(3) F(3) F(2) F(2) F(1) F(2) F(1) F(1) F(0) F(0)F(0) F(0) F(1) F(0) F(4) F(4) F(3)

A DP algorithm. To avoid repeating multiple computations of subproblems, carry the computation bottom-up and store the partial results in a table DP-Fibonacci (n) {Construct table} F 0 = 0 F 1 = 1 for i = 1 to n do F i = F i 1 + F i 2 end for To get F n need O(n) time and O(n) space. F[0] F[1] F[2] F[3] F[4] F[5] F[6] F[7] 0 1 1 2 3 5 8 13

F(7) F(6) F(5) F(5) F(4) F(4) F(3) F(4) F(3) F(3) F(2) F(2) F(1) F(1) F(2) F(1) F(1) F(0) F(1) F(0) F(0) Recursive (top-down) approach very slow Too many identical sub-problems and lots of repeated work. Therefore, bottom-up + storing in a table. This allows us to look up for solutions of sub-problems, instead of recomputing. Which is more efficient.

Dynamic Programming. Richard Bellman: An introduction to the theory of dynamic programming RAND, 1953 Today it would be denoted Dynamic Planning Dynamic programming is a powerful technique of divide and conquer type, for efficiently computing recurrences by storing partial results and re-using them when needed. Explore the space of all possible solutions by decomposing things into subproblems, and then building up correct solutions to larger problems. Therefore, the number of subproblems can be exponential, but if the number of different problems is polynomial, we can get a polynomial solution by avoiding to repeat the computation of the same subproblem.

Properties of Dynamic Programming Dynamic Programming works when: Optimal sub-structure: An optimal solution to a problem contains optimal solutions to subproblems. Overlapping subproblems: A recursive solution contains a small number of distinct subproblems. repeated many times. Difference with greedy Greedy problems have the greedy choice property: locally optimal choices lead to globally optimal solution. For some DP problems greedy choice is not possible globally optimal solution requires back-tracking through many choices. I.e. In DP we generate all possible feasible solutions, while in greedy we are bound for the initial choice

Guideline to implement Dynamic Programming 1. Characterize the structure of an optimal solution: make sure space of subproblems is not exponential. Define variables. 2. Define recursively the value of an optimal solution: Find the correct recurrence formula, with solution to larger problem as a function of solutions of sub-problems. 3. Compute, bottom-up, the cost of a solution: using the recursive formula, tabulate solutions to smaller problems, until arriving to the value for the whole problem. 4. Construct an optimal solution: Trace-back from optimal value.

Implementtion of Dynamic Programming Memoization: technique consisting in storing the results of subproblems and returning the result when the same sub-problem occur again. Technique used to speed up computer programs. In implementing the DP recurrence using recursion could be very inefficient because solves many times the same sub-problems. But if we could manage to solve and store the solution to sub-problems without repeating the computation, that could be a clever way to use recursion + memoization. To implement memoization use any dictionary data structure, usually tables or hashing.

Implementtion of Dynamic Programming The other way to implement DP is using iterative algorithms. DP is a trade-off between time speed vs. storage space. In general, although recursive algorithms ha exactly the same running time than the iterative version, the constant factor in the O is quite more larger because the overhead of recursion. On the other hand, in general the memoization version is easier to program, more concise and more elegant. Top-down: Recursive and Bottom-up: Iterative

Weighted Activity Selection Problem Weighted Activity Selection Problem INPUT: a set S = {1, 2,..., n} of activities to be processed by a single resource. Each activity i has a start time s i and a finish time f i, with f i > s i, and a weight w i. QUESTION: Find the set of mutually compatible such that it maximizes i S w i Recall: Greedy strategy not always solved this problem. 6 10 6 1 5 10

Notation for the weighted activity selection problem We have {1, 2,..., n} activities with f 1 f 2 f n and weights {w i }. Therefore, we may need a O(n lg n) pre-processing sorting step. Define p(j) to be the largest integer i < j such that i and j are disjoints (p(j) = 0 if no disjoint i < j exists). Let Opt(j) be the value of the optimal solution to the problem consisting of activities in the range 1 to j. Let O j be the set of jobs in optimal solution for {1,..., j}. 1 2 3 4 5 6 1 2 3 2 1 2 p(1)=0 p(2)=0 p(3)=1 p(4)=0 p(5)=3 p(6)=3 0 1 2 3 4 5 6 7 8 9 10 11 12 13

Recurrence Consider sub-problem {1,..., j}. We have two cases: 1.- j O j : w j is part of the solution, no jobs {p(j) + 1,..., j 1} are in O j, if O p(n) is the optimal solution for {1,..., p(n)} then O j = O p(n) {j} (optimal substructure) 2.- j O j : then O j = O j 1 { 0 if j = 0 Opt(j) = max{(opt(p(j)) + w j ), Opt(j 1)} if j 1

Recursive algorithm Considering the set of activities S, we start by a pre-processing phase: sorting the activities by increasing {f i } n i=1 and computing and tabulating P[1,..., n] = {p[j]}. The cost of the pre-computing phase is: O(n lg n + n) Therefore we assume S is sorted and all p(j) are computed and tabulated in P[1 n] To compute Opt(j): R-Opt (j) if j = 0 then return 0 else return max(w j + R-Opt(p(j)), R-Opt(j 1)) end if

Recursive algorithm Opt(3) Opt(4) Opt(6) Opt(5) Opt(3) Opt(3) Opt(2) Opt(2) Opt(1) Opt(1) Opt(1) Opt(1) Opt(1) Opt(1) What is the worst running time of this algorithm?: O(2 n )

Iterative algorithm Assuming we have as input the set S of n activities sorted by increasing f, each i with s i, w i and the values of p(j), we define through the process a 1 (n + 1) table M[0, 1..., n], where M[i] contains the value to the best partial solution from 1 to i. Opt-Val (n) Define table M[] M[0] = 0 for j = 1 to n do M[j] = max(m[p[j]] + w j, M[j 1]) end for return M[n] Notice: this algorithm gives only the numerical max. weight Time complexity: O(n) (not counting the pre-process)

Iterative algorithm: Example i s i f i w i P 1 1 5 1 0 2 0 8 2 0 3 7 9 2 1 4 1 11 3 0 5 9 12 1 3 6 10 13 2 3 p(3) p(6) W 0 1 2 3 3 4 5 0 1 2 3 4 5 6 Sol.: Max weight=5

The DP algorithm: Memoization Notice we only have to solve a limited number of n different subproblems: from Opt(1) to Opt(n). Memoization: technique consisting of storing already computed values of sub-problems in a dictionary DS. Therefore it could use tables, hashing,... In the algorithm below we use a table M[0,..., n] Find-Opt (j) if j = 0 then return 0 else if M[j] 0 then return M[j] else M[j] = max(find-opt(p[j]) + w j, Find-Opt(j 1)) return M[j] end if Time complexity: O(n)

The DP algorithm To get also the list of selected activities: Find-Opt (j) if j = 0 then return 0 else if M[P[j]] + w j > M[j 1] then return j together with Find-Opt(P[j]) else return Find-Opt(j 1) end if Time complexity: O(n)

0-1 Knapsack 0-1 Knapsack INPUT:a set I = {i} n 1 of items that can NOT be fractioned, each i with weight w i and value v i. A maximum weight W permissible QUESTION: select the items to maximize the profit. Recall that we can NOT take fractions of items.

Characterize structure of optimal solution and define recurrence As part of the input, we need two variables i (the item) and w (the cumulative weight up to item i). Let v be a variable indicating the optimal value we have obtained so far. Let v[i, w] be the maximum value (optimum) we can get from objects {1, 2,..., i} within total weight w. We wish to compute v[n, W ].

Recurrence To compute v[i, w] we have two possibilities: That the i-th element is not part of the solution, then we go to compute v[i 1, w], or that the i-th element is part of the solution. we add v i and call This gives the recurrence, 0 if i = 0 or w = 0 v[i, w] = v[i 1, w w i ] + v i if i is part of the solution v[i 1, w] otherwise i.e. v[i, w] = max{v[i 1, w w i ] + v i, v[i 1, w]}

DP algorithm Define a table M = v[0... n, 0... W ], Knapsack(i, w) for i = 1 to n 1 do v[i, 0] := 0 end for for i = 1 to n do for w = 0 to W do if w i > w then v[i, w] := v[i 1, w] else v[i, w] := max{v[i 1, w], v[i 1, w w i ] + v i } end if end for end for return v[n, W ] The number of steps is O(nW ).

Example. i 1 2 3 4 5 w i 1 2 5 6 7 v i 1 6 18 22 28 W = 11. w 0 1 2 3 4 5 6 7 8 9 10 11 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 2 0 1 6 7 7 7 7 7 7 7 7 7 I 3 0 1 6 7 7 18 19 24 25 25 25 25 4 0 1 6 7 7 18 22 24 28 29 29 40 5 0 1 6 7 7 18 22 28 29 34 35 40 For instance, v[5, 11] = max{v[4, 10], v[4, 11 7] + 28} = max{7, 0 + 18} = 18.

Recovering the solution a table M = v[1... n, 0... W ], Sol-Knaps(i, W ) Initialize M Define a list L[i, w] for every cell [i, w] for i = 1 to n do for w = 0 to W do v[i, w] := max{v[i 1, w], v[i 1, w w i ] + v i } if v[i 1, w] < v[i 1, w w i ] + v i then Add v i L[i 1, w w i ] to L[i, w] end if end for end for return v[n, W ] and the list of [i, w] Complexity: O(nW ) + cost(l[i, w]). Question: How would you implement the L[i, w] s?

Solution for previous example 0 1 2 3 4 5 6 7 8 9 10 11 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 2 0 1 6 7 7 7 7 7 7 7 7 7 3 0 1 6 7 7 18 19 24 25 25 25 25 4 0 1 6 7 7 18 22 24 28 29 29 40 (3,4) 5 0 1 6 7 7 18 22 28 29 34 35 40 (3,4)

Hidden structure in a DP algorithm Every DP algorithm works because there is an underlying DAG structure, where each node represents a subproblem, and each edge is a precedence constrain on the order in which the subproblems could be solved. Edges could have weights, depending on the problem. Having nodes a 1, a 2,..., a n point to b means that b can be solved only when a 1, a 2,..., a n are solved. Could you come with the DAG associated to the DP solution of the Knapsack?

DNA: The book of life DNA, is the hereditary material in almost all living organisms. They can reproduce by themselves. Its function is like a program unique to each individual organism that rules the working and evolution of the organism. El DNA is a string of 3 10 9 characters over {A, T, G, C}.

DNA: Chromosomes and genes The nucleus of each cell contains the DNA molecule, packaged into thread-like structures called chromosomes. In humans, each cell normally contains 23 pairs of chromosomes, for a total of 46. A gene is the basic unit of heredity, which are made up of DNA, and act as instructions to make proteins. Humans, have between 20,000 and 25,000 genes. Every person has 2 copies of each gene, one i from each parent. Most genes are the same in all humans, but 0.1% of genes are slightly different between people. Alleles are forms of the same gene with small differences in their sequence of DNA. These small differences contribute to each person s unique traits.

Computational genomics: Some questions When a new gene is discovered, one way to gain insight into its working, is to find well known genes (not necessarily in the same species) which match it closely. Biologists suggest a generalization of edit distance as a definition of approximately match. GenBank (https://www.ncbi.nlm.nih.gov/genbank/) has a collection of > 10 10 well studied genes, BLAST is a software to do fast searching for similarities between a genes a DB of genes. Sequencing DNA: consists in the determination of the order of DNA bases, in a short sequence of 500-700 characters of DNA. To get the global picture of the whole DNA chain, we generate a large amount of DNA sequences and try to assembled them into a coherent DNA sequence. This last part is usually a difficult one, as the position of each sequence is the global DNA chain is not know before hand.

Evolution DNA T A C A G T A C G Mutation T A C A C T A C G Delete T C A G A C G T C A G A C G Insertion A T C A G A C G

Sequence alignment problem T A C A G T A C G? A T C A G A C G

Formalizing the problem Longest common substring: Substring = chain of characters without gaps. T C A T G T A G A C T A T C A G A Longest common subsequence: Subsequence = ordered chain of characters with gaps. T C A T G T A G A C T A T C A G A Edit distance: Convert one string into another one using a given set of operations. T A C A G T A C G A T C A G A C G?

String similarity problem: The Longest Common Subsequence LCS INPUT: sequences X =< x 1 x m > and Y =< y 1 y n > QUESTION: Compute the longest common subsequence. A sequence Z =< z 1 z k > is a subsequence of X if there is a subsequence of integers 1 i 1 < i 2 <... < i k m such that z j = x ij. If Z is a subsequence of X and Y, the Z is a common subsequence of X and Y. Given X = ATATAT, then TTT is a subsequence of X

Greedy approach LCS: Given sequences X =< x 1 x m > and Y =< y 1 y n >. Compute the longest common subsequence. Greedy X, Y S := for i = 1 to m do for j = i to n do if x i = y j then S := S {x i } end if let y l such that l = min{a > j x i = y a} let x k such that k = min{a > i x i = y a} if i, l < k then do S := S {x i }, i := i + 1; j := l + 1 S := S {x k }, i := k + 1; j := j + 1 else if not such y l, x k then do i := i + 1 and j := j + 1. end if end for end for

Greedy approach LCS: Given sequences X =< x 1 x m > and Y =< y 1 y n >. Compute the longest common subsequence. Greedy X, Y S := for i = 1 to m do for j = i to n do if x i = y j then S := S {x i } end if let y l such that l = min{a > j x i = y a} let x k such that k = min{a > i x i = y a} if i, l < k then do S := S {x i }, i := i + 1; j := l + 1 S := S {x k }, i := k + 1; j := j + 1 else if not such y l, x k then do i := i + 1 and j := j + 1. end if end for end for Greedy approach do not work For X =A T C A C and Y =C G C A C A C A C T the result of greedy is A T but the solution is A C A C.

DP approach: Characterization of optimal solution Let X =< x 1 x n > and Y =< y 1 y m >. Let X [i] =< x 1 x i > and Y [i] =< y 1 y j >. Define c[i, j] = length de la LCS of X [i] and Y [j]. Want c[n, m] i.e. solution LCS X and Y. What is a subproblem? Subproblem = something that goes part of the way in converting one string into other.

Characterization of optimal solution and recurrence If X =C G A T C and Y =A T A C c[5, 4] = c[4, 3] + 1 If X =C G A T and Y =A T A to find c[4, 3]: either LCS of C G A T and A T or LCS of C G A and A T A c[4, 3] = max(c[3, 3], c[4, 2]) Therefore, given X and Y { c[i 1, j 1] + 1 c[i, j] = max(c[i, j 1], c[i 1, j]) if x i = y j otherwise

Recursion tree c[i, j] = { c[i 1, j 1] + 1 max(c[i, j 1], c[i 1, j]) if x i = y j otherwise c[3,2] c[3,1] c[2,2] c[2,1] c[3,0] c[2,1] c[2,0] c[2,1] c[1,2] c[1,1] c[2,0] c[1,1] c[1,0] c[1,1] c[0,2] c[0,1] c[1,0] c[0,1] c[0,0]

The direct top-down implementation of the recurrence LCS (X, Y ) if m = 0 or n = 0 then return 0 else if x m = y n then return 1+LCS (x 1 x m 1, y 1 y n 1 ) else return max{lcs (x 1 x m 1, y 1 y n ) end if LCS (x 1 x m, y 1 y n 1 )} The algorithm explores a tree of depth Θ(n + m), therefore the time complexity is T (n) = 3 Θ(n+m).

Bottom-up solution Avoid the exponential running time, by tabulating the subproblems and not repeating their computation. To memoize the values c[i, j] we use a table c[0 n, 0 m] Starting from c[0, j] = 0 for 0 j m and from c[i, 0] = 0 from 0 i n go filling row-by-row, left-to-right, all c[i, j] i 1 i j 1 c[i 1,j 1] c[i,j 1] j c[i 1,j] c[i,j] Use a field d[i, j] inside c[i, j] to indicate from where we use the solution.

Bottom-up solution LCS (X, Y ) for i = 1 to n do c[i, 0] := 0 end for for j = 1 to m do c[0, j] := 0 end for for i = 1 to n do for j = 1 to m do if x i = y j then c[i, j] := c[i 1, j 1] + 1, b[i.j] := else if c[i 1, j] c[i, j 1] then c[i, j] := c[i 1, j], b[i, j] := else c[i, j] := c[i, j 1], b[i, j] :=. end if end for end for Time and space complexity T = O(nm).

Example. X=(ATCTGAT); Y=(TGCATA). Therefore, m = 6, n = 7 0 1 2 3 4 5 6 T G C A T A 0 0 0 0 0 0 0 0 1 A 0 0 0 0 1 1 1 2 T 0 1 1 1 1 2 2 3 C 0 1 1 2 2 2 2 4 T 0 1 1 2 2 3 3 5 G 0 1 2 2 2 3 3 6 A 0 1 2 2 3 3 4 7 T 0 1 2 2 3 4 4

Construct the solution Uses as input the table c[n, m]. The first call to the algorithm is con-lcs (c, n, m) con-lcs (c, i, j) if i = 0 or j = 0 then STOP. else if b[i, j] = then con-lcs (c, i 1, j 1) return x i else if b[i, j] = then con-lcs (c, i 1, j) else con-lcs (c, i, j 1) end if The algorithm has time complexity O(n + m).

LCS: Underlying DAG Assign weights 0 at edges (i 1, j 1) (i, j) and 1 to remaining edges in the DAG Find the minimum path between (0, 0) (n, W ) 0 1 1 1 (i 2,j 2) 1 0 (i 1,j 2) 1 1 1 1 1 (i 2,j 1) 1 1 (i 1,j 1) 1 0 (i,j 1) 1 1 (i 1,j) 1 (i,j) 1 (i+1,j) 1 0 (i+1,j+1) 1 1 1

Edit Distance. The edit distance between strings X = x 1 x n and Y = y 1 y m is defined to be the minimum number of edit operations needed to transform X into Y.

Edit Distance: Levenshtein distance The most relevant set of operations used are given by the Levenshtein distance: insert(x, i, a)= x 1 x i ax i+1 x n. delete(x, i)= x 1 x i 1 x i+1 x n modify(x, i, a)= x 1 x i 1 ax i+1 x n. The cost of each operation is 1. Main applications of the Levenshtein distance: Computational genomics: similarity between strings on {A, T, G, C, }. Natural Language Processing: distance, between strings on the alphabet.

Exemple-1 x = aabab and y = babb aabab = X X =insert(x, 0, b) baabab X =delete(x, 2) babab Y =delete(x, 4) babb X = aabab Y = babb

Exemple-1 x = aabab and y = babb aabab = X X =insert(x, 0, b) baabab X =delete(x, 2) babab Y =delete(x, 4) babb X = aabab Y = babb A shortest edit distance aabab = X X =modify(x, 1, b) babab Y =delete(x, 4) babb Use dynamic programming.

Exemple-2 E[ga t ge l] E[3,3]=2 g a t g e l E[1,1]=0 g a t g e l g e t E[2,2]=1 E[3,3]=2 gel g e l gel E[gat pagat] gat pagat E[3,3]=2 + pagat pagat E[3,5]=4 BUT pagat pagat E[3,5]=2 How? gat pagat gat pagat gat pagat pagat pagat NOTICE: E[gat pagat] equivalent to E[pagat gat]

1.- Characterize the structure of an optimal solution and set the recurrence. Assume want to find the edit distance from X = TATGCAAGTA to Y = CAGTAGTC. Let E[10, 8] be the min. distance between X and Y. Consider the prefixes TATGCA and CAGT Let E[6, 4] = edit distance between TATGCA and CAGT Distance between TATGCA and CAGT is E[5, 4] + 1 (delete A in X D) Distance between TATGCAT and CAGT is E[6, 3] + 1 (insert T in X I) Distance between TATGCA and CAGT is E[5, 3] + 1 (modify A in X to a T (M)) Consider the prefixes TATGCA and CAGTA Distance between TATGCA and CAGT A is E[5, 4] (keep last A and compare TATGC and CAGT ).

To compute the edit distance from X = x 1 x n to Y = y 1 y m Let X [i] = x 1 x i and Y [j] = y 1 y j let E[i, j] = edit distance from X [i] to Y [j] If x i y j the last step from X [i] Y [j] must be one of: 1. put y j at the end x: x x 1 x i y j, and then transform x 1 x i into y 1 y j 1. E[i, j] = E[i, j 1] + 1 2. delete x i : x x 1 x i 1, and then transform x 1 x i 1 into y 1 y j. E[i, j] = E[i 1, j] + 1 3. change x i into y j : x x 1 x i 1 y j, and then transform x 1 x i 1 into y 1 y j 1 E[i, j] = E[i 1, j 1] + 1 4. if x i = y j : E[i, j] = E[i 1, j 1] + 0

Recurrence Therefore, we have the recurrence i if j = 0 (convertingλ y[j]) j if i = 0 (converting X [i] λ) E[i, j] = E[i 1, j] + 1 if D min E[i, j 1] + 1, if I E[i 1, j 1] + d(x i, y j ) otherwise where d(x i, y j ) = { 0 if x i = y j 1 otherwise

2.- Computing the optimal costs. Edit X = {x 1,..., x n}, Y = {y 1,..., y n} for i = 0 to n do E[i, 0] := i end for for j = 0 to m do E[0, j] := j end for for i = 1 to n do for j = 1 to m do if x i = y j then d(x i, y j ) = 0 else d(x i, y j ) = 1 end if E[i, j] := E[i, j 1] + 1 if E[i 1, j 1] + d(x i, y j ) < E[i, j] then E[i, j] := E[i 1, j 1] + d(x i, y j ), b[i, j] := else if E[i 1, j] + 1 < E[i, j] then E[i, j] := E[i 1, j] + 1, b[i, j] := else b[i, j] := end if end for end for Time and space complexity T = O(nm). Example: X=aabab; Y=babb. Therefore, n = 5, m = 4 0 1 2 3 4 λ b a b b 0 λ 0 1 2 3 4 1 a 1 1 1 2 3 2 a 2 2 1 2 3 3 b 3 2 2 1 2 4 a 4 3 2 2 2 5 b 5 4 3 2 2

3.- Construct the solution. Uses as input the table E[n, m]. The first call to the algorithm is con-edit (E, n, m) con-edit (E, i, j) if i = 0 or j = 0 then STOP. else if b[i, j] = and x i = y j then change(x, i, y j ) con-edit (E, i 1, j 1) else if b[i, j] = then delete(x, i), con-edit (c, i 1, j) else insert(x, i, y j ), con-edit (c, i, j 1) end if This algorithm has time complexity O(nm).

Sequence Alignment. Finding similarities between sequences is important in Bioinformatics For example, Locate similar subsequences in DNA Locate DNA which may overlap. Similar sequences evolved from common ancestor Evolution modified sequences by mutations Replacement. Deletions. Insertions.

Sequence Alignment. Given two sequences over same alphabet G C G C A T G G A T T G A G C G A T G C G C C A T T G A T G A C C A An alignment - GCGC- ATGGATTGAGCGA TGCGCCATTGAT -GACC- A Alignments consist of: Perfect matchings. Mismatches. Insertion and deletions. There are many alignments of two sequences. Which is better?

BLAST To compare alignments define a scoring function: +1 if the ith pair matches s(i) = 1 if the ith mismatches 2 if the ith pair InDel Score alignment = i s(i) Better = max. score between all alignments of two sequences. Use the dynamic programming techniques + faster heuristics.

Dynamic Programming in Trees Trees are nice graphs to bound the number of subproblems. Given T = (V, A) with V = n, recall that there are n subtrees in T. Therefore, when considering problems defined on trees, it is easy to bound the number of subproblems This allows to use Dynamic Programming to give polynomial solutions to difficult graph problems when the input is a tree.

The Maximum Weight Independent Set (MWIS) INPUT: G = (V, E), together with a weight w : V R QUESTION: Find the largest S V such that no two vertices in S are connected in G. For general G, the problem is difficult, as the case with all weights =1 is already NP-complete. 5 1 1 2 2 2 6

MWIS on Trees Given a tree T = (V, E) choose a r V and root it from r INSTANCE: Given a rooted tree T r = (V, E) and a set of weights w : V R, QUESTION: Find the independent set of nodes with maximal weight. Notation: 6 a 4 8 8 b c d 8 e f g h i 5 6 2 3 j 9 k 7 l m n o 2 5 4 6 Given T r, where r is the root, then v V, let T v denote subtree rooted at v. Given v T r let F (v) be the set of children of v, and N(v) be the set of grandchildren of v. For any T v, let S v be the set of the MWIS in T v, and let M(v) = x S v w(x). We want to max M(r).

Characterization of the optimal solution Key observation: An MWIS optima set S r in T r can t contain vertices which are father-son. If r S r : then F (r) S r. So S r {r} contains an optimum solution for each T v, with v N(r). If r S r : S r contains an optimum solution for each T u, with u F (r). Recursive definition of the optimal solution M(v) = { w(v) if v is a leaf, max{ u F (v) M(u), w(v) + u N(v) } otherwise. b 8 j 6 a 4 8 c 8 d e f g h i 9 k 7 5 6 2 3 l m n o 2 5 4 6

Recall:Post-order traversal of a rooted tree 6 a 4 b 8 c 8 d 8 e f g h i 5 6 2 3 j 9 k l m n o 2 5 4 6 7 Post Order e f b g l m n h i c o j k d a

IDP Algorithm to find M r Let v 1,..., v n = r be the post-order traversal of T r Define a 2 n table A to store the values of M and M. WIS T r Let v 1,..., v n = r the post-order traversal of T r for i = 1 to n do if v i a leaf then M(v i ) = w(v i ), M (v i ) = 0 else M (v i ) = u N(v) M(u) M(v i ) = max{ u F (v) M(u), w(v) + M (v i )} end if store M(v i ) and M (v i ) in A[2i 1] and A[2i] end for Complexity: space = O(n), time = O(n) (we visit the nodes in post-order and examine each node exactly once) How can we recover the set S r??

Top-down algorithm to recover the WIS S r Let r = v n,..., r 1 be the level-order traversal of T r Recover-WIS T r, A[2n] S r = for i = n to n do if M(v i ) > M (v i ) then S = S {v i } end if end for return S

Bottom-up M(l) = 5, M(m) = 4, M(n) = 6, M(e) = 5 M(f ) = 6, M(o) = 2, M(k) = 7 M(b) = 4, M (b) = 11 M(h) = 8, M (h) = 15 M(j) = 9, M (j) = 2 M(d) = 10, M (d) = 16 M(c) = 23, M (c) = 20 M(a) = 6 + M (b) + M (c) + M (d) = 53 M (a) = 6 + M (b) + M(c) + M (d) = 50 6 a 4 b 8 c 8 d 8 e f g h i 5 6 2 3 j 9 k 7 l m n o 2 5 4 6

Top down M(l) = 5, M(m) = 4, M(n) = 6, M(e) = 5 M(f ) = 6, M(o) = 2, M(k) = 7 M(b) = 4, M (b) = 11 M(h) = 8, M (h) = 15 M(j) = 9, M (j) = 2 M(d) = 10, M (d) = 16 M(c) = 23, M (c) = 20 M(a) = 6 + M (b) + M (c) + M (d) = 53 M (a) = 6 + M (b) + M(c) + M (d) = 50 Maximum Weighted Independent Set: S = {a, e, f, g, i, j, k, l, m, n} w(s) = 53. 6 a 4 b 8 c 8 d 8 e f g h i 5 6 2 3 j 9 k 7 l m n o 2 5 4 6

Complexity. Space complexity: 2n 2 Time complexity: Step 1 To update M and M we only need to look at the sons of the vertex into consideration. As we cover all n vertices, O(n) Step 2 From the root down to the leaves, for each vertex v compare M(v) with M (v), if M(v) > M (v) include v in S, otherwise do not include. Θ(n) As we have to consider all the vertices at least one, we have a lower bound in the time complexity of Ω(n), therefore the total number of steps is Θ(n).

How to win an election: Gerrymandering Gerrymandering is the practice of carving up electoral districts in very careful ways as to lead to outcomes that favor a particular political party.

Gerrymandering The word gerrymander was coined in 1812 after Massachusetts governor Elbridge Gerry signed into law a salamander-shaped district.

Gerrymandering Ten districts and 1000 voters, 500 pink and 500 green. You want to maximize the number of districts won by the pink party. Green wins one district, pink wins 9 districts.

Gerrymandering

Gerrymandering Given a set of n precincts P 1,..., P n, each containing m votes. We want to divide each precinct into two districts, each consisting of n/2 of the precincts. For each precincts we have an estimation of the percentage of votes which will vote party A or B. A set of precincts is susceptible to gerrymandering if it is possible to perform a division into two districts s.t. the same party holds a majority in both districts. Give an algorithm to determine whether a set of precincts is susceptible to gerrymandering, and such that the running time is polynomial in n and m.

Solution: DP We ask if there is a partition of P 1,..., P n into 2 districts C 1, C 2 s.t. a party (wlog A) has majority in both. (Similar to 0-1 Knapsack). Assume that n and m are even and let a = total number of votes for A. If A has majority in C 1 &C 2 a > mn 2 + 2. (A wins in C 1 &C 2 if there is a subset of n/2 precincts where A gets s > nm 4 + 1 and in the complementary n/2 precincts A also gets a s > nm 4 + 1,) Gerrymandering is possible iff nm 4 < s < a nm 4 Optimal sub-structure: Among {P 1,..., P q }, S {P 1,..., P q } of k precincts with s voters for A if among {P 1,..., P q 1 } S with s a q voters for A.

Solution: Recurrence Given precincts {P 1,..., P q }, lets T (q, k, s) = T if there exists a subset S with S = k, for which the total number of included precincts has a number of voters (for A) s = i S a i T if q = 1, k = 1, and s = a 1, T if T (q 1, k 1, s a q ) = T, T (q, k, s) = T if T (q 1, k, s) = T, F otherwise. The algorithm should fill a Boolean table T of size n n mn 2 (a 4 1), starting from T (1, 1, a 1). Once T is filled we must check all values in the row T [n, n 2, s] for all mn 4 < s < a mn 4. If one of those values is T, then the problem instance is gerrymandered. The complexity is O(n 3 m).

Limitations of Dynamic Programming Any problem which has the property of optimal substructure can be solved by Dynamic programming. Many of the optimization problems satisfy this property. The big limitation to the use of dynamic programming is the number of different solutions we must keep track of. Many times Dynamic programming is used to speed up an exponential solution.