CSE 591 Foundations of Algorithms Homework 4 Sample Solution Outlines Problem 1 (a) Consider the situation in the figure, every edge has the same weight and V = n = 2k + 2. Easy to check, every simple path from s to t is a shortest path and the number of such paths is 2 k+1 = 2 n/2 which grows fast than any polynomial function of n. There are millions of possible answers as well. (b) We just a need small modification for Dijkstra s algorithm. Modified Dijkstra s algorithm Let dis[] be distance vector Let num[] be a vector for number of shortest paths Let Q be a priority queue for each vertex v do dis[v] = num[v] = 0 end for dis[s] = 0 num[s] = 1 Add every vertex to Q while Q is not empty do u = vertex in Q with min dis[u] Remove u from Q for each neighbor v of u do if dis[u] + length(u, v) < dis[v] then 1
dis[v] = dis[u] + length(u, v) num[v] = num[u] else if dis[u] + length(u, b) == dis[v] then num[v]+ = num[u] end if end for end while Output num[t] num[t] is the number of shortest s t paths. The algorithm works correctly since we update num[v] only when the current path length is shortest. Also, if v s predecessor, u, is also on v s shortest path, we should add num[u] to num[v] by property of optimal substructure. The complexity of the algorithm is the same as original Dijkstra s algorithm. Problem 2 Recall from class that given an RNA sequence x 1 x n, we computed the largest number of matching pairs opt(i, j) for each consecutive subsequence x i x j for 1 i < j n by dynamic programming working from the smallest values of j i to the largest. Now we want the largest κ numbers of matching pairs in a solution; in the problem statement, κ = 10 but we will state it generally. We keep track of the best κ sizes for each choice of i and j, so that opt(i, j) is now a multiset of size at most κ. The only change we make is in the recurrence for opt(i, j). We compute it as follows. When j i 4, the multiset opt(i, j) = {0}, because sharp turns are not permitted. For two multisets A and B, let A B be the multiset {a+b : a A, b B}. To compute opt(i, j), we form a set X of candidates. Initially X is empty. For every i t < j 4 for which {x j, x t } = {C, G} or {x j, x t } = {A, U}, add the elements of opt(i, t 1) opt(t + 1, j 1) to X. Remove all but the κ largest elements in X; then add 1 to each entry. Keeping track of the actual sets of matching pairs is now straightforward; each size in one of the multisets can be associated with a specific set of matching pairs. The run time is (up to O( )) the same as the original algorithm when κ is fixed. Problem 3 In order to avoid kinks, we use different states. Let opt(i, j) be the optimal number of matching pairs without kinks, sharp turns and crossings between position i and j. In addition, we let f(i, j) be the optimal number of pairs between position i and i if x i and x j could be matched and here we force x i matched to x j and all other inside matching pairs are valid. Then, we can derive recurrence relation between opt(i, j) and f(i, j). if x j is not involved in any pair, then Otherwise, opt(i, j) = opt(i, j 1) 2
opt(i, j) = max{opt(i, t 1) + f(t, j)} for all t such that x t is matched to x j and j > t + 4. The idea is to enumerate the pair that j is involved and by definition, it is f(t, j). The rest sequence becomes a subproblem which is opt(i, t 1). Then let s look at structure of f(i, j). As we define, x i is matched to x j and we need to handle subsequence from x i+1 to x j 1. If we use opt(i + 1, j 1) directly, we may fall to the trap since opt(i + 1, j 1) may be obtained from f(i + 1, j 2) or f(i + 2, j 1) which forms a kink. In order to avoid that, we can enumerate matching pairs that x i+1 and x j 1 are involved. Initially, we set every f(i, j) be 1, and we have following recursion: First, we focus on x i+1 f(i, j) = max{f(i, j), 1 + max{f(i + 1, t) + opt(t + 1, j 1)}} here we check all t > i + 1 + 4 such that x t could be paired with x i and t j 2. Again, we are enumerating the pair where x i is matched and it won t create a kink. Similarly, we could handle x j 1 f(i, j) = max{f(i, j), 1 + max{opt(i + 1, t 1) + f(t, j 1)}} for all t < j 1 4 except t = i + 2. Finally, we may also consider the case where neither x i+1 nor x j 1 is in the substructure, then we have For edge cases, we have f(i, j) = max{f(i, j), 1 + opt(i + 2, j 2)} f(i, j) = 1 if j = i + 5 opt(i, j) = 0 if i j 4 All entries can be computed in O(n 3 ) time where n is the length of whole sequence. Problem 4 Let V = n and w(e) = 1 for every edge. Select arbitrarily a vertex r and treat it as the root of the tree. The distance from r to every vertex can be calculated using, for example, breadth-first search in time linear in the number of edges (and hence linear in the number of vertices because G is a tree). The total of the distances from r can be calculated, and dividing this total by n yields the average distance from 3
r. Our objective, therefore, is to select a root r that minimizes the total of the distances from r. Suppose that we rooted the tree at r. We calculate, for each vertex v, its number δ v of descendants in the tree as follows. First compute the degree d v of each vertex in O(n) time; in the process form a list L of non-root vertices of degree 1. Initialize δ v = 1 for each vertex. Now while L is not empty, choose v L, let w be the parent of v, add δ v to δ w, delete v, and if w now has degree 1 (i.e., has no other children), add w to L unless it is the root. When L becomes empty, for every vertex v we have that δ v is the count of its descendants. We claim that r is a correct vertex at which to place the CA if and only if every child c of r has δ c n/2. To see this, first suppose that some child has c of r has δ c > n/2. Moving the root from r to c adds 1 to the distance to n δ c vertices but subtracts 1 from the distance to δ c vertices. Hence the total distance decreases by moving the root to c, and r is not the correct vertex to choose. In the other direction, suppose that every child c of r has δ c n/2. To the contrary suppose that there is a vertex for which the total of the distances is less than from r; choose the one f that is closest to r. Now f must be a descendant of a child c of r, and hence f has at most n/2 descendants. But by the argument above, moving from f to its parent cannot increase the total of the distances, a contradiction. So r is indeed a correct choice. This underlies the algorithm. Having tried r, we check whether any child of r has more than n/2 descendants. If none does, we respond with r. Otherwise we choose such a child c, and move the root to c from r. To update the numbers of descendants, only two changes are needed. If r had n descendants and c had δ c, r will now have n δ c, and c will have n. This takes constant time. We can move the root no more than n times, so after O(n) moves each taking constant time, we report a correct vertex. Problem 5 Suppose that the given English word is W = w 1 w 2 w n and Bengali word is B = b 1 b 2 b m. We suppose that the standard alphabet in which each is written is Σ. We are to find a word P = p 1 p l for which max{ed(w, P ), ed(b, P )}} = min max{ed(w, Q), ed(b, Q)}} Q Σ In defining edit distance ed(x, y), we assume that the gap penalty is δ > 0; the mismatch penalty α aa = 0 for a Σ, and the mismatch penalty α ab = α ba > 0 for distinct a, b Σ. As in class we can form a directed graph G on (n + 1)(m + 1) vertices, say {0,..., n} {0,..., m}. Draw a directed edge from (i, j) to (i + 1, j) with cost δ; a directed edge from (i, j) to (i, j + 1) with cost δ, and a directed edge from (i, j) to (i + 1, j + 1) with cost α wi+1 b j+1. Then ed(w, B) is precisely the length of a shortest path from (0, 0) to (n, m) in G. Intuitively we want to split this path in half to find a good candidate for PIE, but we need to be careful. 4
Consider a particular shortest path from (0,0) to (n, m) in G. Let L be the list of insertions, deletions, and substitutions performed on W in following this path. The total cost of the operations in L is denoted by t. No operation in L can cost more than 2δ; for if so it is a substitution and can be replaced by an insertion and a deletion, lowering the total cost. Choose a subset L of L whose costs total t t/2, as close to t/2 as possible. Applying the operations in L to W yields a word P with ed(w, P ) = t and ed(p, B) = t t. We note that edit distance satisfies the triangle inequality that ed(w, B) ed(w, P ) + ed(p, B), so that P appears to be a good candidate if t and t t are as equal as possible. This is a good start, but we cannot be sure that we have selected the right path, and have not said how to find L. I would be very happy if anyone had gotten to a similar point in developing an answer. To get an exact answer, the idea is to build paths starting from (0,0) in G, keeping track at each vertex (i, j) of a set of ordered pairs, each of which specifies a distance from w 1 w i and from b 1 b j to a closest candidate in PIE. The concern is that there appear to be too many pairs at each vertex to keep track of; but note that if one pair has both entries at least as large as the corresponding entry in another pair, we do not need to keep it. Together with some plausible assumptions about the mismatch penalties, we can then ensure that the list of pairs at each vertex has polynomial length. Needless to say, I have omitted many details. And how might one find all? Sketch: Enumerate paths in G of close to shortest length; consider each way to split its operations into two sets of approximately equal cost. 5