Advanced Algorithms. Lecture Notes for April 5, 2016 Dynamic programming, continued (HMMs); Iterative improvement Bernard Moret

Size: px

Start display at page:

Download "Advanced Algorithms. Lecture Notes for April 5, 2016 Dynamic programming, continued (HMMs); Iterative improvement Bernard Moret"

Edwina Ward
5 years ago
Views:

1 Advanced Algorithms Lecture Notes for April 5, 2016 Dynamic programming, continued (HMMs); Iterative improvement Bernard Moret Dynamic Programming (continued) Finite Markov models A finite Markov model is a discrete-time (event-driven) stochastic model defined on a set of states S = {s 1,s 2,...,s n } through a transition probability matrix T = (t i j ), where t i j is the probability of moving from state s i to state s j. (In practice, many of the entries in T are set to zero: only certain transitions are considered in the model.) T is a (right) stochastic matrix, i.e., each t i j is a real number in [0,1] and every row sums to 1. A finite Markov model is sometimes also called a stochastic finite automaton. As with finite automata, it is convenient to add an output function to the model. While we could associate an output with a transition, the normal usage is to associate an output with a state the equivalent of a Mealy machine for finite automata. Suppose the output is a character from the alphabet Σ, then we add a second (right) stochastic matrix E =(e i j ), the emission matrix, where e i j is the probability of producing (emitting) the jth character in Σ when in state s i. If we run the model, it will produce some output string; for convenience, we assume that this string is finite. Notice that we are looking at a very large number of parameters: in the worst case, the n n transition matrix has n(n 1) parameters, to which we must add n ( Σ 1) parameters for the emission matrix. In practice, therefore, we assume that we know the finite automaton (i.e., we know the number of states and we know which transitions are assumed to have zero probability), but not its parameters: we will attempt to infer (learn) the parameters from collected data. So what can we collect about the model? Obviously, we can collect data representing the output; so denote by x the observed output string, by x i the character in the ith position of x, and by m the length of x. (The model may be capable of producing strings of unbounded length, but of course we can make only bounded observations.) In a limited number of problems, we may also be able to observe directly the state of the model, thereby giving rise, for the same m, to a string of observed states y=y 1 Y 2...y m. Hidden Markov Models For most problems arising from nature, the actual state of the process under observation is not observable indeed, it may not be well defined, since the model is certainly a simplification of nature. However, since we are observing the process, we can certainly collect the output. In our terminology, we can get x, but not y; when such is the case, we call the model a Hidden Markov Model HMM for short. 1

2 HMMs are found everyhwere, but are perhaps most common in two areas. They completely changed the area of voice recognition, an area that that been lavishly funded, mostly in vain, for 30 years by the military. Now we have Alexa, Cortana, Siri, and other personal assistants, and the voice recognition part is completely routine the challenge is in understanding the nature of the requests and formulating appropriate responses. The other area is computational biology, where HMMs are used to analyze genomes (at several billion elements per, these are quite long strings), on much the same basis: let the HMM do the work of recognizing patterns rather than attempt develop specialized algorithms that keep failing to generalize properly. One can use an HMM as a recognizer: by formulating and training several HMMs, one for each class of outcomes, one can use these HMMs to compute the probability that each would be generated some new sequence of observations and then assign that new sequence to the class most likely to have generated it. One can also study the innards of the machine and ask about the probability of occupation of a certain state, the most likely path through the HMM for a given sequence of observations, etc. These all require knowledge of the entire model, including the values of all of its numerous parameters, something that is not usually preset by the designer, but inferred (through training) from actual data. Hence another fundamental question about an HMM is how to infer the values t i j and e ik from observed data. This last is done today through the Baum-Welch algorithm, a special case of EM (expectation-maximization), in effect an approximation algorithm based on iterative improvement. We shall return to it later in the course. The former (classification and detailed statistics on the functioning), however, is done through dynamic programming and so we tackle them now. We formalize three related questions: 1. How do we reconstruct (and compute the likelihood of) the most likely state sequence for the given HMM and given output sequence? 2. How do we compute the likelihood of a given sequence? 3. How do we compute the likelihood of a particular state at a particular step for the observed output sequence? All three of these questions can be answered by the same basic dynamic program, with small variations. We will build a table with one row for each state and one column for each position in the observed output sequence thus our table will be n m. (Note that we may want to introduce two additional states, i.e., two additional rows, and one additional column; the extra states are a start state and an end state, the extra column corresponds to position 0 before the position of the first characters in the output string. These additions do not alter the following, but prevent having to treat special rows or columns differently from others and thus are likely to improve efficiency.) 1. The solution here is the so-called Viterbi algorithm. We will fill the table column by column, starting at column 0 and ending at column m; entry i j in the table, that is, 2

3 P v (i, j), will denote the probability of the most likely path to state s i after reading the jth character in x. We can compute P(i, j) according to the following recurrence: ( P v (i, j)=max ali e i,x j P v (l, j 1) ) l In words, we just look one step back along the path, to a previous entry P v (l, j 1) for some previous state l, and compute the probability of the extension from that previous step to the current state s i at step j with symbol x j emitted, retaining the largest one. It should be noted, that, in spite of appearances, what is computed is really a joint probability, not (at least not intentionally) a conditional probability. That is, the recurrence computes the maximum value of the probability p(π,x), where x is the output string and π is a path through the states of the HMM that produces that string. However, it is not hard to see that computing the max (over choices of π) of p(π,x) is equivalent to computing the max of p(π x), the conditional (posterior) probability. 2. To get the likelihood of the output sequence, we could look at every state path that can generate it, compute the probability of each such path, then sum them all to obtain p(x) = π p(π,x), using the notation from (i); but this takes time proportional to the number of paths and thus could take exponential time. Instead, we implicitly look at all paths every time we add another output character. The recurrence is nearly identical to the Viterbi recurrence we simply replace the max operator by an addition. ( P f (i, j)= ali e i,x j P f (l, j 1) ) l (Note that the emission probability is independent of the summation index and so can be factored.) The dynamic program resulting from this recurrence is often called the forward algorithm, hence my choice of P f for this function. 3. Here we really want to compute p(s i, j x), the probability that, given that the HMM produced the output sequence x, it was in state s i at step j. Because this is a conditional probability (unlike the two computed above), we proceed somewhat indirectly, using joint probabilities that we do know how to compute. We have p(s i, j,x)= p(s i,x 1 x 2 x j ) p(x j+1 x j+2 x m s i,x 1 x 2 x j ) Note that the first term in the product is exactly P f (i, j) in the forward algorithm, so we know how to compute it. The second term can be simplified: thanks to the Markov (memoryless) property, the dependency on x 1 x 2 x j does not exist, so the second term now reduces to p(x j+1 x j+2 x m s i ), and we can easily compute that term by the same dynamic program again, but this time running it backward (which gives us the condition for free). The recurrence is then P b (i, j)= a li e i,x j+1 P b (l, j+1) l 3

4 where the b subscript denotes that the backward version. Now we simply have p(s i, j x)= P f(i, j) P b (i, j) P(x) What is the running time of each of these three algorithms? All have to fill in the entire n m table; moreover, in order to fill it, all have to look at all transitions into (or out of) each state. Thus, to process one column of the table, all transitions of the HMM must be examined and used in the recurrence, so that the cost of processing one entire column is proportional to α n, the number of nonzero entries in the transition matrix. Thus the overall running time of the algorithms is Θ(α n m); this can be as low as order mn if α (which is twice the average degree of a node in the state transition diagram) is a constant and as high as order mn 2 if α is some fraction of n (as in a dense state graph). Clearly, there are other questions we might ask about the model itself, such as the probability that the HMM is in some given state at any time during the process, the probability that a particular transition is used at any time during the process, and many others. Most, if not all, such questions can be answered through dynamic programming approaches similar to the three above. Iterative Improvement Stable Marriage Imagine you run a match-making service and introduce 20 young men and 20 young women with the goal of eventually forming 20 happy couples. After some period of getting acquainted, you eventually ask each woman to rank order all 20 men, from first choice to last choice as a potential husband, and similarly you ask each man to rank order all 20 women, from first choice to last choice as a potential wife. A marriage of woman y to man x is said to be stable if there does not exist any other couple (x,y ) such that either woman y prefers x to her own husband and man x prefers y to his own wife, or (the symmetric situation) men x prefers y to his own wife and woman y prefers x to her own husband. (If either of these two situations occurs, then obviously there an incentive for a swap of spouses, which would remedy the situation hence the marriages (x,y) and (x,y ) would not be stable.) Phrased as a marital endeavor, the problem seems silly, but in fact it has quite a range of applications, although in most applications the lists of preferences will not include all members of the other subset. The best known application in the US is The Match for medical residencies: medical students about to graduate visit several residency programs (teaching hospitals with openings in this or that area of medicine) all over the USA, while these programs receive visits from a (much larger) number of medical students and interview these students. (Most residency programs have more than one position in each medical area, hence the larger number of students.) At the end of this period of courtship, each medical student files with a central authority ( a list of her/his preferences (i.e., a rank ordering of the programs visited), while each residency program files with the same central authority a rank-ordered 4

5 list of all students who visited. Then. on a fatal day known simply as MATCH DAY. a computer program computes a stable matching, that is a list of pairs (student, residency position), and both students and residency programs are notified of the outcome. Nearly students participate in The Match each year and almost all residency positions are filled through it. Gale and Shapley, the scientists who formulated the stable marriage problem (well, one of the two, since the other had died) and the scientist who applied it to the medical residency problem (Roth) received the Nobel prize in economics in 2012 for their work on this problem. The work dates all the way back to 1962, when Gale and Shapley published a famous paper entitled College admissions and the stability of marriage (American Mathematics Monthly 69:9 15), in which they defined a notion of stable marriage (matching/pairing/assignment/etc.) and showed that, in matching seekers to offerers, one could produced stable matchings that were optimal for seekers or stable matchings that were optimal for offerers. (In the case of the medical match, the stable matching produced is optimal for medical graduates.) The formulation they gave was for college admissions, but they also introduced a formulation in the context of marriage. The algorithm used in the match proceeds in rounds as follows: In the first round, every man proposes to the top woman on his list. Each woman accepts the best proposal she receives. (Naturally, a number of women will receive multiple proposals while some will receive none.) In the second round, every man not yet engaged proposes to the second choice on his list, regardless of whether she is already engaged or not. Each woman retains the best proposal she has received so far; this may entail breaking the engagement she made at the previous round because she received a proposal from a better-ranked man in this round, causing a man who was engaged at the end of the first round to be unengaged at the end of the second round. In each subsequent round, each unengaged man proposes to the next woman (engaged or not) on his list; each woman retains the best proposal she has received across all rounds so far. This continues until every man (and thus also every woman) is engaged, at which point all engagements become marriages. Readers will recognize this algorithm as the prevailing social custom in Western civilization for many centuries (up to the 19th century and even later): women could not make a move of their own (although they could certainly drop hints!), but had to wait on the pleasure of men to declare themselves. Unsurprisingly, like much about human civilizations, this proposal algorithm is optimal for men; it is also pessimal for women. (Simply flipping the roles will give us an algorithm optimal for women and pessimal for men; it turns out, however, that designing an algorithm that is fair to both sexes is much harder: it was not until the 21st century that the first such algorithm was published.) The match has one major difference from the marriage version: the rankings of the medical graduates include only the 5 15 residency programs where they interviewed and the rankings of the residency programs, while larger, also fall far short of listing all medical graduates. Fortunately, this difference is easily handled through the same algorithm. 5

6 That this algorithm terminates is clear: at round i, every unengaged man proposes to the ith woman on his list, so the process stops after at most n rounds. That it ends with n marriages is also clear, and due to the fact that everyone is willing to marry anyone of the opposite sex there are preferences, but there cannot be rejections. That the marriages thus chosen are stable is due to the process: if x ends up married to y, but prefers y, then he must have proposed to y before proposing to y and have been rejected, which means that y was already engaged at that time to someone higher on her list than x. Since the ranking of a woman s partner can only go up from the first engagement, y is now married to someone at least as high on her list as the man she already preferred to x during the rounds of proposals and so is not interested in switching. A similar line of reasoning holds for a man that y might prefer to x. The algorithm may result in a quadratic number of proposals (n rounds, with only one additional engagement per round). 6

Ma/CS 6b Class 3: Stable Matchings

Ma/CS 6b Class 3: Stable Matchings α p 5 p 12 p 15 q 1 q 7 q 12 β By Adam Sheffer Neighbor Sets Let G = V 1 V 2, E be a bipartite graph. For any vertex a V 1, we define the neighbor set of a as N a = u