Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1 Linear Regression 2 Linear Classification 1 Linear Classification 2 Kernel Methods Sparse Kernel Methods Mixture Models and EM 1 Mixture Models and EM 2 Neural Networks 1 Neural Networks 2 Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data 2 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 828
Part XVIII Sequential Data 2 Alpha-Beta How to train a - Viterbi Algorithm 645of 828
Example of a Hidden Markov Model Assume Peter and Mary are students in Canberra and Sydney, respectively. Peter is a computer science student and only interested in riding his bicycle, shopping for new computer gadgets, and studying. (Well, he also does other things but because these other activities don t depend on the weather we neglect them here.) Mary does not know the current weather in Canberra, but knows the general trends of the weather in Canberra. She also knows Peter well enough to know what he does on average on rainy days and also when the skies are blue. She believes that the weather (rainy or not rainy) follows a given discrete Markov chain. She tries to guess the sequence of weather patterns for a number of days after Peter tells her on the phone what he did in the last days. Alpha-Beta How to train a - Viterbi Algorithm 646of 828
Example of a Hidden Markov Model Mary uses the following rainy sunny initial probability 0.2 0.8 transition probability rainy 0.3 0.7 sunny 0.4 0.6 emission probability cycle 0.1 0.6 shop 0.4 0.3 study 0.5 0.1 Assume, Peter tells Mary that the list of his activities in the last days was [ cycle, shop, study ] (a) Calculate the probability of this observation sequence. (b) Calculate the most probable sequence of hidden states for these observations. Alpha-Beta How to train a - Viterbi Algorithm 647of 828
Hidden Markov Model The trellis for the hidden Markov model Find the probability of cycle, shop, study Rainy 0.2 Sunny 0.8 0.3 0.7 0.4 0.6 Rainy Sunny 0.3 0.7 0.4 0.6 Rainy Sunny Alpha-Beta 0.5 0.4 0.1 0.6 cycle shop 0.3 0.1 0.5 0.4 0.1 0.6 cycle shop 0.3 0.1 0.5 0.4 0.1 0.6 cycle shop 0.3 0.1 How to train a - Viterbi Algorithm study study study 648of 828
Hidden Markov Model Find the most probable hidden states for cycle, shop, study Rainy 0.2 Sunny 0.8 0.3 0.7 0.4 0.6 Rainy Sunny 0.3 0.7 0.4 0.6 Rainy Sunny Alpha-Beta 0.5 0.4 0.1 0.6 cycle shop 0.3 0.1 0.5 0.4 0.1 0.6 cycle shop 0.3 0.1 0.5 0.4 0.1 0.6 cycle shop 0.3 0.1 How to train a - Viterbi Algorithm study study study 649of 828
Homogeneous Hidden Markov Model Joint probability distribution over both latent and observed variables [ N ] N p(x, Z θ) = p(z 1 π) p(z n z n 1, A) p(x m z m, φ) n=2 m=1 where X = (x 1,..., x N ), Z = (z 1,..., z N ), and θ = {π, A, φ}. Most of the discussion will be independent of the particular choice of emission probabilities (e.g. discrete tables, Gaussians, mixture of Gaussians). Alpha-Beta How to train a - Viterbi Algorithm 650of 828
We have observed a data set X = (x 1,..., x n ). Assume it came from a with a given structure (number of nodes, form of emission probabilities). The likelihood of the data is p(x θ) = Z p(x, Z θ) This joint distribution does not factorise over n (as with the mixture distribution). We have N variables, each with K states : K N terms. Number of terms grows exponentially with the length of the chain. But we can use the conditional independence of the latent variables to reorder their calculation later. Further obstacle to find a closed loop maximum likelihood solution: calculating the emission probabilities for different states z n. Alpha-Beta How to train a - Viterbi Algorithm 651of 828
- EM Employ the EM algorithm to find the Maximum Likelihood for. Alpha-Beta How to train a - Viterbi Algorithm 652of 828
- EM Employ the EM algorithm to find the Maximum Likelihood for. Start with some initial parameter settings θ old. Alpha-Beta How to train a - Viterbi Algorithm 653of 828
- EM Employ the EM algorithm to find the Maximum Likelihood for. Start with some initial parameter settings θ old. E-step: Find the posterior distribution of the latent variables p(z X, θ old ). Alpha-Beta How to train a - Viterbi Algorithm 654of 828
- EM Employ the EM algorithm to find the Maximum Likelihood for. Start with some initial parameter settings θ old. E-step: Find the posterior distribution of the latent variables p(z X, θ old ). M-step: Maximise Alpha-Beta Q(θ, θ old ) = Z p(z X, θ old ) ln p(z, X θ) How to train a - Viterbi Algorithm with respect to the parameters θ = {π, A, φ}. 655of 828
- EM Denote the marginal posterior distribution of z n by γ(z n ), and the joint posterior distribution of two successive latent variables by ξ(z n 1, z n ) γ(z n ) = p(z n X, θ old ) ξ(z n 1, z n ) = p(z n 1, z n X, θ old ). For each step n, γ(z n ) has K nonnegative values which sum to 1. For each step n, ξ(z n 1, z n ) has K K nonnegative values which sum to 1. Elements of these vectors are denoted by γ(z nk ) and ξ(z n 1,j, z nk ) respectively. Alpha-Beta How to train a - Viterbi Algorithm 656of 828
- EM Because the expectation of a binary random variable is the probability that it is one, we get with this notation γ(z nk ) = E [z nk ] = γ(z n )z nk z n ξ(z n 1,j, z nk ) = E [z n 1,j z nk ] = ξ(z n 1, z n )z n 1,j z nk. z n 1,z n Putting all together we get K N K K Q(θ, θ old ) = γ(z 1k ) ln π k + ξ(z n 1,j, z nk ) ln A jk k=1 n=2 j=1 k=1 N K + γ(z nk ) ln p(x n φ k ). n=1 k=1 Alpha-Beta How to train a - Viterbi Algorithm 657of 828
- EM M-step: Maximising Q(θ, θ old ) = results in K γ(z 1k ) ln π k + k=1 + N n=1 k=1 N K n=2 j=1 k=1 K γ(z nk ) ln p(x n φ k ). π k = γ(z 1k) K j=1 γ(z 1j) K ξ(z n 1,j, z nk ) ln A jk Alpha-Beta How to train a - Viterbi Algorithm N n=2 A jk = ξ(z n 1,j, z nk ) K N l=1 n=2 ξ(z n 1,j, z nl ) 658of 828
- EM Still left: Maximising with respect to φ. Q(θ, θ old ) = K γ(z 1k ) ln π k + k=1 + N n=1 k=1 N K n=2 j=1 k=1 K γ(z nk ) ln p(x n φ k ). K ξ(z n 1,j, z nk ) ln A jk But φ only appears in the last term, and under the assumption that all the φ k are independent of each other, this term decouples into a sum. Then maximise each contribution N n=1 γ(z nk) ln p(x n φ k ) individually. Alpha-Beta How to train a - Viterbi Algorithm 659of 828
- EM In the case of Gaussian emission densities p(x φ k ) = N (x µ k, Σ k ), we get for the maximising parameters for the emission densities N n=1 µ k = γ(z nk) x n N n=1 γ(z nk) Σ k = N n=1 γ(z nk) (x n µ k )(x n µ k ) T N n=1 γ(z. nk) Alpha-Beta How to train a - Viterbi Algorithm 660of 828
Need to efficiently evaluate the γ(z nk ) and ξ(z n 1,j, z nk ). The graphical model for the is a tree! We know we can use a two-stage message passing algorithm to calculate the posterior distribution of the latent variables. For this is called the forward-backward algorithm (Rabiner, 1989), or Baum-Welch algorithm (Baum, 1972). Other variants exist, different only in the form of the messages propagated. We look at the most widely known, the alpha-beta algorithm. Alpha-Beta How to train a - Viterbi Algorithm 661of 828
for Given the data X = {x 1,..., x N } the following independence relations hold: p(x z n ) = p(x 1,..., x n z n ) p(x n+1,..., x N z n ) p(x 1,..., x n 1 x n, z n ) = p(x 1,..., x n 1 z n ) p(x 1,..., x n 1 z n 1, z n ) = p(x 1,..., x n 1 z n 1 ) p(x n+1,..., x N z n, z n+1 ) = p(x n+1,..., x N z n+1 ) p(x n+2,..., x N x n+1, z n+1 ) = p(x n+2,..., x N z n+1 ) p(x z n 1, z n ) = p(x 1,..., x n 1 z n 1 ) p(x n z n ) p(x n+1,..., x N z n ) p(x N+1 X, z N+1 ) = p(x N+1 z N+1 ) p(z n+1 X, z N ) = p(z N+1 z N ) Alpha-Beta How to train a - Viterbi Algorithm z 1 z 2 z n 1 z n z n+1 x 1 x 2 x n 1 x n x n+1 662of 828
for - Example Let s look at the following independence relation: p(x z n ) = p(x 1,..., x n z n ) p(x n+1,..., x N z n ) Any path from the set {x 1,..., x n } to the set {x n+1,..., x N } has to go through z n. In p(x z n ) the node z n is conditioned on (= observed). All paths from x 1,..., x n 1 through z n to x n+1,..., x N are head-tail. The path from x n through z n to z n+1 is tail-tail, so blocked. Therefore x 1,..., x n x n+1,..., x N z n. Alpha-Beta How to train a - Viterbi Algorithm z 1 z 2 z n 1 z n z n+1 x 1 x 2 x n 1 x n x n+1 663of 828
Alpha-Beta Define the joint probability of observing all data up to step n and having z n as latent variable to be α(z n ) = p(x 1,..., x n, z n ). Define the probability of all future data given z n to be β(z n ) = p(x n+1,..., x N z n ). Then it an be shown the following recursions hold α(z n ) = p(x n z n ) z n 1 α(z n 1 ) p(z n z n 1 ) Alpha-Beta How to train a - Viterbi Algorithm K α(z 1 ) = {π k p(x 1 φ k )} z1k k=1 664of 828
Alpha-Beta At step n we can efficiently calculate α(z n ) given α(z n 1 ) α(z n ) = p(x n z n ) z n 1 α(z n 1 ) p(z n z n 1 ) α(z n 1,1 ) k = 1 α(z n 1,2 ) k = 2 A 21 A 11 A 31 α(z n,1 ) p(x n z n,1 ) Alpha-Beta How to train a - Viterbi Algorithm α(z n 1,3 ) k = 3 n 1 n 665of 828
Alpha-Beta And for β(z n ) we get the recursion β(z n ) = z n+1 β(z n+1 ) p(x n+1 z n+1 ) p(z n+1 z n ) k = 1 k = 2 β(z n,1 ) β(z n+1,1 ) A 13 A 11 A 12 p(x n z n+1,1 ) β(z n+1,2 ) p(x n z n+1,2 ) β(z n+1,3 ) Alpha-Beta How to train a - Viterbi Algorithm k = 3 n n + 1 p(x n z n+1,3 ) 666of 828
Alpha-Beta How do we start the β recursion? What is β(z N )? β(z N ) = p(x N+1,..., x N z n ). Can be shown the following is consistent with the approach β(z N ) = 1. k = 1 β(z n,1 ) β(z n+1,1 ) A 11 A 12 β(z n+1,2 ) p(x n z n+1,1 ) Alpha-Beta How to train a - Viterbi Algorithm k = 2 A 13 β(z n+1,3 ) p(x n z n+1,2 ) k = 3 n n + 1 p(x n z n+1,3 ) 667of 828
Alpha-Beta Now we know how to calculate α(z n ) and β(z n ) for each step. What is the probability of the data p(x)? Use the definition of γ(z n ) and Bayes γ(z n ) = p(z n X) = p(x z n) p(z n ) p(x) = p(x, z n) p(x) and the following conditional independence statement from the graphical model of the p(x z n ) = p(x 1,..., x n z n ) p(x n+1,..., x N z n ) }{{}}{{} α(z n)/p(z n) β(z n) Alpha-Beta How to train a - Viterbi Algorithm and therefore γ(z n ) = α(z n)β(z n ) p(x) 668of 828
Alpha-Beta Marginalising over z n results in 1 = z γ(z n ) = n α(z n )β(z n ) p(x) z n and therefore at each step n p(x) = z n α(z n )β(z n ) Most conveniently evaluated at step N where β(z N ) = 1 as Alpha-Beta How to train a - Viterbi Algorithm p(x) = z N α(z n ). 669of 828
Alpha-Beta Finally, we need to calculate the joint posterior distribution of two successive latent variables by ξ(z n 1, z n ) defined by ξ(z n 1, z n ) = p(z n 1, z n X). This can be done directly from the α and β values in the form ξ(z n 1, z n ) = α(z n 1) p(x n z n ) p(z n z n 1 ) β(z n ) p(x) Alpha-Beta How to train a - Viterbi Algorithm 670of 828
How to train a 1 Make an inital selection for the parameters θ old where θ = {π, A, φ)}. (Often, A, π initialised uniformly or randomly. The φ k -initialisation depends on the emission distribution; for Gaussians run K-means first and get µ k and Σ k from there.) 2 (Start of E-step) Run forward recursion to calculate α(z n ). 3 Run backward recursion to calculate β(z n ). 4 Calculate γ(z) and ξ(z n 1, z n ) from α(z n ) and β(z n ). 5 Evaluate the likelihood p(x). 6 (Start of M-step) Find a θ new maximising Q(θ, θ old ). This results in new settings for the parameters π k, A j k and φ k as described before. 7 Iterate until convergence is detected. Alpha-Beta How to train a - Viterbi Algorithm 671of 828
Alpha-Beta - Notes In order to calculate the likelihood, we need to use the joint probability p(x, Z) and sum over all possible values of Z. That means, every particular choice of Z corresponds to one path through the lattice diagram. There are exponentially many! Using the alpha-beta algorithm, the exponential cost has been reduced to a linear cost in the length of the model. How did we do that? Swapping the order of multiplication and summation. Alpha-Beta How to train a - Viterbi Algorithm 672of 828
- Viterbi Algorithm Motivation: The latent states can have some meaningful interpretation, e.g. phonemes in a speech recognition system where the observed variables are the acoustic signals. Goal: After the system has been trained, find the most probable states of the latent variables for a given sequence of observations. Warning: Finding the set of states which are each individually the most probable does NOT solve this problem. Alpha-Beta How to train a - Viterbi Algorithm 673of 828
- Viterbi Algorithm Define ω(z n ) = max z1,...,z n 1 ln p(x 1,..., x n, z 1,..., z n ) From the joint distribution of the given by [ N ] N p(x 1,..., x N, z 1,..., z N ) = p(z 1 ) p(z n z n 1 ) p(x n z n ) n=2 the following recursion can be derived n=1 ω(z n ) = ln p(x n z n ) + max z n 1 {ln p(z n z n 1 ) + ω(z n 1 )} ω(z 1 ) = ln p(z 1 ) + ln p(x 1 z 1 ) = ln p(x 1, z 1 ) k = 1 Alpha-Beta How to train a - Viterbi Algorithm k = 2 k = 3 n 2 n 1 n n + 1 674of 828
- Viterbi Algorithm Calculate ω(z n ) = max ln p(x 1,..., x n, z 1,..., z n ) z 1,...,z n 1 for n = 1,..., N. For each step n remember which is the best transition to go into each state at the next step. At step N : Find the state with the highest probability. For n = 1,..., N 1: Backtrace which transition led to the most probable state and identify from which state it came. k = 1 Alpha-Beta How to train a - Viterbi Algorithm k = 2 k = 3 n 2 n 1 n n + 1 675of 828