We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

We Live in Exciting Times ACM (an international computing research society) has named CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Apr. 2, 2019 Yoshua Bengio, Geoffrey Hinton, and Yann LeCun recipients of the 2018 ACM Turing Award for conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing. The ACM Turing Award, often referred to as the Nobel Prize of Computing, carries a $1 million prize. It is named for Alan M. Turing, the British mathematician who articulated the mathematical foundation and limits of computing. () April 2, 2019 1 / 48 () April 2, 2019 2 / 48 Outline Outline 1 Markov chain 1 Markov chain 2 Hidden Markov Model 2 Hidden Markov Model () April 2, 2019 3 / 48 () April 2, 2019 4 / 48

Markov Models Markov chain Markov models are powerful probabilistic models to analyze sequential data. A.A.Markov (1856-1922) introduced the Markov chains in 1906 when he produced the first theoretical results for stochastic processes. They are now commonly used in text or speech recognition stock market prediction bioinformatics Directed strongly connected graph with self-loops. Each edge labeled by a positive probability. At each state, the probabilities on outgoing edges sum up to 1. Transition (or stochastic ) matrix: A = a ij = P (i j in 1 step). () April 2, 2019 5 / 48 () April 2, 2019 6 / 48 Markov chain Definition Given a sequentially ordered random variables X 1, X 2,, X t,, X T, called states, Transition probability for describing how the state at time t 1 changes to the state at time t, P (X t = value X t 1 = value) Initial probability for describing the initial state at time t = 1. P (X 1 = value) All X t s take value from the same discrete set {1,..., N}. We will assume that the transition probability does not change with respect to time t, i.e., a stationary Markov chain. Markov chain Transition probabilities make a table/matrix A whose elements are a ij = P (X t = j X t 1 = i) Initial probability becomes a vector π whose elements are π i = P (X 1 = i) where i or j index over from 1 to N. We have the following constraints a ij = 1 π i = 1 j i Additionally, all those numbers should be non-negative. () April 2, 2019 7 / 48 () April 2, 2019 8 / 48

Examples Definition Example 1 (Language model) States [N] represent a dictionary of words, a ice,cream = P (X t+1 = cream X t = ice) is an example of the transition probability. Example 2 (Weather) States [N] represent weather at each day a sunny,rainy = P (X t+1 = rainy X t = sunny) A Markov chain is a stochastic process with the Markov property: a sequence of random variables X 1, X 2, s.t. P (X t+1 X 1, X 2,, X t ) = P (X t+1 X t ) i.e. the current state only depends on the most recent state. Is the Markov assumption reasonable? Not completely for the language model for example. Higher order Markov chains make it more reasonable, e.g. P (X t+1 X 1, X 2,, X t ) = P (X t+1 X t, X t 1 ) i.e. the current word only depends on the last two words. () April 2, 2019 9 / 48 () April 2, 2019 10 / 48 Exercise 1 Consider the following Markov model. Given that now I am having Coffee, whats the probability that the next step is Facebook and the next is Work? P (X 3 = W, X 2 = F X 1 = C) = = P (X 3 = W, X 2 = F, X 1 = C) P (X 1 = C) = P (X 3 = W X 2 = F, X 1 = C)P (X 2 = F X 1 = C)P (X 1 = C) P (X 1 = C) (chain rule) = P (X 3 = W X 2 = F )P (X 2 = F X 1 = C) (Markov rule) = 0.5 0.6 = 0.3 () April 2, 2019 11 / 48 Exercise 2 Given that now I am having Coffee, what is the probability that in two steps I am at Work? P (X 3 = W X 1 = C) = = s P (X 3 = W, X 2 = s X 1 = C) = = P (X 3 = W X 2 = W )P (X 2 = W X 1 = C) (marginalization) + P (X 3 = W X 2 = C)P (X 2 = C X 1 = C) + P (X 3 = W X 2 = F )P (X 2 = F X 1 = C) = 0.3 0.4 + 0.1 0.3 + 0.6 0.5 = 0.45 Using a transition matrix: P (X 3 = j X 1 = i) = N a ik a kj = a 2 ij k=1 () April 2, 2019 12 / 48

Parameter estimation for Markov models Now suppose we have observed M sequences of examples: x 1,1,..., x 1,T x M,1,..., x M,T where for simplicity we assume each sequence has the same length T lower case x n,t represents the value of the random variable X n,t From these observations how do we learn the model parameters (π, A)? () April 2, 2019 13 / 48 Finding the MLE Same story, Maximum Likelihood Estimation: argmax π,a ln P (X 1 = x 1, X 2 = x 2,..., X T = x T ) First, we need to compute this joint probability. Applying the chain rule for random variables, we get P (X 1, X 2,..., X T ) = P (X 2, X 3,..., X T X 1 )P (X 1 ) = P (X 3,..., X T X 1, X 2 )P (X 2 X 1 )P (X 1 ) = = = P (X 1 ) = P (X 1 ) T P (X t X 1,..., X t 1 ) T P (X t X t 1 ) (Markov property) () April 2, 2019 14 / 48 Finding the MLE Example The log-likelihood of a sequence x 1,..., x T is ln P (X 1 = x 1, X 2 = x 2,..., X T = x T ) = P (X 1 = x 1 ) + ln P (X t = x t X t 1 = x t 1 ) = ln π x1 + ln a xt 1,x t = n I[x 1 = n] ln π n + ( T ) I[x t 1 = n, x t = n ] ln a n,n n,n Suppose we observed the following 2 sequences of length 5 sunny, sunny, rainy, rainy, rainy rainy, sunny, sunny, sunny, rainy MLE is the following model () April 2, 2019 15 / 48 () April 2, 2019 16 / 48

Finding the MLE So MLE is argmax π,a (#initial states with value n) ln π n n + n,n (#transitions from n to n ) ln a n,n We have seen this many times. The solution is (derivation is left as an exercise): #of sequences starting with n π n = #of sequences #of transitions from n to n a n,n = #of transitions starting with n Outline 1 Markov chain 2 Hidden Markov Model Inferring HMMs Viterbi Algorithm Viterbi Algorithm: Example Learning HMMs () April 2, 2019 17 / 48 () April 2, 2019 18 / 48 Markov Model with outcomes Another example picture from Wikipedia Now suppose each state X t also emits some outcome Z t [O] based on the following model On each day, we also observe Bob s activity: walk, shop, or clean, which only depends on the weather of that day. P (Z t = o X t = s) = b s,o (emission probability) independent of anything else. For example, in the language model, Z t is the speech signal for the underlying word X t (very useful for speech recognition). Now the model parameters are ({π s }, {a s,s }, {b s,o }) = (π, A, B). () April 2, 2019 19 / 48 () April 2, 2019 20 / 48

HMM defines a joint probability Joint likelihood P (X 1, X 2,, X T, Z 1, Z 2,, Z T ) = P (X 1, X 2,, X T ) P (Z 1, Z 2,, Z T X 1, X 2,, X T ) Markov assumption simplifies the first term P (X 1, X 2,, X T ) = P (X 1 ) T P (X t X t 1 ) The independence assumption simplifies the second term P (Z 1, Z 2,, Z T X 1, X 2,, X T ) = T P (Z t X t ) Namely, each Z t is conditionally independent of anything else, if conditioned on X t. () April 2, 2019 21 / 48 The joint log-likelihood is ln P (X 1 = x 1, X 2 = x 2,, X T = x T, Z 1 = z 1, Z 2 = z 2,, Z T = z T ) T T = ln P (X 1 = x 1 ) P (X t = x t X t 1 = x t 1 ) P (Z t = z t X t = x t ) = ln P (X 1 = x 1 ) + + ln P (X t = x t X t 1 = x t 1 ) ln P (Z t = z t X t = x t ) = ln π x1 + ln a xt 1,x t + ln b xt,z t () April 2, 2019 22 / 48 Learning the model Learning the model If we observe M state-outcome sequences: x m,1, z m,1,..., x m,t, z m,t for m = 1,..., M, the MLE is again very simple (verify yourself): π s #initial states with value s a s,s #transitions from s to s b s,o #state-outcome pairs (s, o) However, most often we do not observe the states! Think about the speech recognition example. This is called Hidden Markov Model (HMM). It is important to notice that the denomination hidden while defining a Hidden Markov Model is referred to the states of the Markov chain, not to the parameters of the model. How to learn HMMs? Roadmap: first discuss how to infer when the model is known (key: dynamic programming) then discuss how to learn the model (key: EM) () April 2, 2019 23 / 48 () April 2, 2019 24 / 48

What can we infer about an HMM? What can we infer for a known HMM? In HMMs, we are often interested in the following problems the probability of observing a whole sequence P (Z 1:T = z 1:T ) = P (Z 1 = z 1,..., Z T = z T ) e.g. prob. of observing Bob s activities walk, walk, shop, clean, walk, shop, shop for one week the state at some point of time, given an observation sequence γ s (t) = P (X t = s Z 1:T = z 1:T ) e.g. given Bob s activities for one week, how was the weather like on Wed? In HMMs, we are often interested in the following problems the likelihood of two consecutive states at a given time ξ s,s (t) = P (X t = s, X t+1 = s Z 1:T = z 1:T ) e.g. given Bob s activities for one week, how was the weather like on Wed and Thu? most likely hidden states path, given an observation sequence argmax x 1:T P (X 1:T = x 1:T Z 1:T = z 1:T ) e.g. given Bob s activities for one week, what s the most likely weather for this week? () April 2, 2019 25 / 48 () April 2, 2019 26 / 48 Forward and backward messages The key is to compute two things: forward messages: for each s and t α s (t) = P (X t = s, Z 1:t = z 1:t ) The intuition is, if we observe up to time t, what is the likelihood of the Markov chain in state s? backward messages: for each s and t β s (t) = P (Z t+1:t = z t+1:t X t = s) The interpretation is: if we are told that the Markov chain at time t is in the state s, then what are the likelihood of observing future observations from t + 1 to T? () April 2, 2019 27 / 48 Computing forward messages Key: establish a recursive formula α s (t) = P (X t = s, Z 1:t ) = P (X t = s, Z 1:t 1, Z t ) = P (Z t X t = s, Z 1:t 1 )P (X t = s, Z 1:t 1 ) = P (Z t X t = s)p (X t = s, Z 1:t 1 ) (independence) = b s,zt P (X t = s, X t 1 = s, Z 1:t 1 ) (marginalizing) s = b s,zt P (X t = s X t 1 = s, Z 1:t 1 )P (X t 1 = s, Z 1:t 1 ) s = b s,zt P (X t = s X t 1 = s )P (X t 1 = s, Z 1:t 1 ) s = b s,zt a s,sα s (t 1) s [N] Base case: α s (1) = P (X 1, Z 1 ) = P (Z 1 X 1 )P (X 1 ) = π s b s,z1 () April 2, 2019 28 / 48

Forward procedure Forward procedure Forward algorithm For all s [N], compute α s (1) = π s b s,z1. For t = 2,..., T for each s [N], compute α s (t) = b s,zt a s,s α s (t 1) s [N] It takes O(S 2 T ) time and O(ST ) space using dynamic programming. Oh, no, CSCI-570 again.. () April 2, 2019 29 / 48 () April 2, 2019 30 / 48 Computing backward messages Backward procedure Again establish a recursive formula β s (t) = P (Z t+1:t X t = s) = P (Z t+1:t, X t+1 = s X t = s) (marginalizing) s = P (Z t+1:t X t+1 = s, X t = s)p (X t+1 = s X t = s) (sl. 11) s = a s,s P (Z t+1:t X t+1 = s ) = a s,s P (Z t+1, Z t+2:t X t+1 = s ) s s = a s,s P (Z t+1 Z t+2:t, X t+1 = s )P (Z t+2:t X t+1 = s ) s Backward algorithm For all s [N], set β s (T ) = 1. For t = T 1,..., 1 for each s [N], compute β s (t) = a s,s b s,z t+1 β s (t + 1) s [N] = s a s,s b s,z t+1 β s (t + 1) Again it takes O(S 2 T ) time and O(ST ) space. Base case: β s (T ) = 1 (verify yourself) () April 2, 2019 31 / 48 () April 2, 2019 32 / 48

Using forward and backward messages With forward and backward messages, we can compute problems stated on slide 25. γ s (t) = P (X t = s Z 1:T ) s [N] P (X t = s, Z 1:T ) = P (X t = s, Z 1:t, Z t+1:t ) = P (X t = s, Z 1:t )P (Z t+1:t X t = s, Z 1:t ) = P (X t = s, Z 1:t )P (Z t+1:t X t = s) = α s (t)β s (t) What constant are we omitting in? It is exactly P (Z 1:T ) = P (Z 1:T, X t = s) = α s (T ) = α s (t)β s (t), s [N] s [N] the probability of observing the sequence z 1:T. This is true for any t; a good way to check correctness of your code. () April 2, 2019 33 / 48 Using forward and backward messages Slide 26: the conditional probability of transition s to s at time t ξ s,s (t) = P (X t = s, X t+1 = s Z 1:T ) P (X t = s, X t+1 = s, Z 1:T ) = P (X t = s, Z 1:t, X t+1 = s, Z t+1:t ) = P (X t = s, Z 1:t )P (X t+1 = s, Z t+1:t X t = s, Z 1:t ) = α s (t) P (X t+1 = s, Z t+1:t X t = s) = α s (t) P (Z t+1:t X t+1 = s, X t = s)p (X t+1 = s X t = s) = α s (t)p (Z t+1:t X t+1 = s ) a s,s = α s (t) a s,s P (Z t+1:t, Z t+2:t X t+1 = s ) = α s (t) a s,s P (Z t+2:t X t+1 = s, Z t+1:t )P (Z t+1 X t+1 = s ) = α s (t) a s,s P (Z t+2:t X t+1 = s )P (Z t+1 X t+1 = s ) = α s (t) a s,s b s,x t+1 β s (t + 1) The normalization constant is in fact again P (Z 1:T ) () April 2, 2019 34 / 48 Finding the most likely path Find the most likely hidden states path, given an observation sequence: This is called Viterbi decoding. argmax x 1:T P (X 1:T = x 1:T Z 1:T = z 1:T ) We can t use forward and backward messages directly, though it is very similar to the forward procedure, just replace sum to max. We will compute (think of Dijkstra algorithm) δ s (t) = max x 1:t 1 P (X t = s, X 1:t 1, Z 1:t ) s (t) = argmax x t 1 max P (X t = s, X 1:t 1 Z 1:t ) x 1:t 2 Computing δ s (t) Follow me (the goal is to get a recurrence) δ s (t) = max x 1:t 1 P (X t = s, X 1:t 1, Z 1:t ) = max x 1:t 1 P (X t = s, Z t, X 1:t 1, Z 1:t 1 ) = max s max P (X t = s, Z t X 1:t 1, Z 1:t 1 )P (X 1:t 1 = s, Z 1:t 1 ) x 1:t 2 = max s δ s (t 1)P (X t, Z t X 1:t 1, Z 1:t 1 ) = max s δ s (t 1)P (Z t, X t X 1:t 1 ) = max s δ s (t 1)P (Z t X t, X 1:t 1 )P (X t X 1:t 1 ) = max s δ s (t 1)P (Z t X t )P (X t X t 1 ) = b s,zt max s a s,sδ s (t 1) Base case: δ s (1) = P (X 1 = s, Z 1 = z 1 ) = π s b s,z1 () April 2, 2019 35 / 48 () April 2, 2019 36 / 48

Viterbi Algorithm (!) Viterbi Algorithm For each s [N], compute δ s (1) = π s b s,z1. For each t = 2,..., T, for each s [N], compute Example Arrows represent the argmax, i.e. s (t). δ s (t) = b s,zt max s a s,sδ s (t 1) s (t) = argmax s a s,sδ s (t 1) Backtracking: let z T = argmax s δ s (T ). For each t = T,..., 2: set z t 1 = z t (t). The most likely path is rainy, rainy, sunny, sunny. Output the most likely path z 1,..., z T. () April 2, 2019 37 / 48 () April 2, 2019 38 / 48 Example t = 1, the initial time Consider the HMM below. In this world, every time step (say every few minutes), you can either be Studying or playing Video games. You re also either Grinning or Frowning while doing the activity. δ s (1) = π s b s,z1. Compute δ S (1) and δ V (1). δ S (1) = p(z 1 = Grin x 1 = S)π(x 1 = S) = 0.5 0.5 = 0.25 Suppose that we believe that the initial state distribution is 50/50. We observe: Grin, Grin, Frown, Grin. What is the most likely path for this sequence of observations? δ V (1) = p(z 1 = Grin x 1 = V )π(x 1 = V ) = 0.8 0.5 = 0.4 () April 2, 2019 39 / 48 () April 2, 2019 40 / 48

t = 2 δ s (t) = b s,zt max s a s,sδ s (t 1). Compute δ S (2) and δ V (2). t = 3 δ s (t) = b s,zt max s a s,sδ s (t 1). Compute δ S (3) and δ V (3). δ S (2) = p(z 2 = Grin x 2 = S) max{p(x 2 = S x 1 = S)δ S (1), p(x 2 = S x 1 = V )δ V (1)} = 0.5 max{0.8 0.25, 0.4 0.4} = 0.01 δ V (2) = p(z 2 = Grin x 2 = V ) max{p(x 2 = V x 1 = S)δ S (1), p(x 2 = V x 1 = V )δ V (1)} 0.8 max{0.2 0.25, 0.6 0.4} = 0.192 () April 2, 2019 41 / 48 δ S (3) = p(z 3 = F rown x 3 = S) max{p(x 3 = S x 2 = S)δ S (2), p(x 3 = S x 2 = V )δ V (2)} = 0.5 max{0.8 0.1, 0.4 0.192} = 0.04 δ V (3) = p(z 3 = F rown x 3 = V ) max{p(x 3 = V x 2 = S)δ S (2), p(x 3 = V x 2 = V )δ V (2)} = 0.2 max{0.2 0.1, 0.6 0.192} = 0.023 () April 2, 2019 42 / 48 t = 4 Learning the parameters of an HMM Observation at t = 4 is Grin All previous inferences depend on knowing the parameters (π, A, B). How do we learn the parameters based on M observation sequences z m,1,..., z m,t for m = 1,..., M? δ S (4) = 0.016 δ V (4) = 0.01 Last step, since δ S (4) > δ V (4), so we choose MLE is intractable due to the hidden states (similar to GMM). Need to apply EM again! Known as the Baum Welch algorithm. z 4 = S Then z 3 = S, then z 2 = S and then z 1 = S, using the trace-back table. () April 2, 2019 43 / 48 () April 2, 2019 44 / 48

Applying EM: E-Step Recall in the E-Step we fix the parameters and find the posterior distributions q of the hidden states (for each sample), which leads to the complete log-likelihood: E x1:t q [ln(x 1:T = x 1:T, Z 1:T = z 1:T )] [ ] = E x1:t q ln π x1 + ln a xt 1,x t + ln b xt,z t = s T 1 γ s (1) ln π s + ξ s,s (t) ln a s,s + s,s This is quite similar to computations on slide 15. γ s (t) and ξ s,s (t) are defined on slides 9 and 10: γ s (t) = P (X t = s Z 1:T ) γ s (t) ln b s,zt ξ s,s (t) = P (X t = s, X t+1 = s Z 1:T ) s (slide 22) Applying EM: M-Step The maximizer of complete log-likelihood is simply doing weighted counting: where π s m a s,s m b s,o m γ (m) s (1) = E q [ #initial states with value s] 1 ξ (m) [ s,s (t) = E q #transitions from s to s ] t:z t =o γ s (m) (t) = E q [ #state-outcome pairs (s, o)] γ (m) s (t) = P (X m,t = s Z m,1:t = z m,1:t ) ξ (m) s,s (t) = P (X m,t = s, X m,t+1 = s Z m,1:t = z m,1:t ) () April 2, 2019 45 / 48 () April 2, 2019 46 / 48 Baum Welch algorithm Summary Step 0 Initialize the parameters (π, A, B) Step 1 (E-Step) Fixing the parameters, compute forward and backward messages for all sample sequences, then use these to compute γ s (m) (t) and ξ (m) s,s (t) for each m, t, s, s (see Slides 9 and 10). Step 2 (M-Step) Update parameters: π s m γ (m) s (1), a s,s m 1 ξ (m) s,s (t), b s,o m t:x t =o γ s (m) (t) Very important models: Markov chains, hidden Markov models Several algorithms: forward and backward procedures inferring HMMs based on forward and backward messages Viterbi algorithm Baum Welch (EM) algorithm Step 3 Return to Step 1 if not converged. () April 2, 2019 47 / 48 () April 2, 2019 48 / 48