Cheng Soon Ong & Christian Walder. Canberra February June 2018

Similar documents
Cheng Soon Ong & Christian Walder. Canberra February June 2018

STA 414/2104: Machine Learning

Graphical Models Seminar

STA 4273H: Statistical Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear Dynamical Systems

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Hidden Markov Models

Hidden Markov Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

COMP90051 Statistical Machine Learning

Lecture 11: Hidden Markov Models

Hidden Markov Models Part 2: Algorithms

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Pattern Recognition and Machine Learning

Parametric Models Part III: Hidden Markov Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Brief Introduction of Machine Learning Techniques for Content Analysis

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Hidden Markov Models

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

CS838-1 Advanced NLP: Hidden Markov Models

Hidden Markov Models. Terminology and Basic Algorithms

Latent Variable Models and Expectation Maximization

L23: hidden Markov models

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

Cheng Soon Ong & Christian Walder. Canberra February June 2018

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Latent Variable Models and Expectation Maximization

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Mixtures of Gaussians. Sargur Srihari

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Introduction to Machine Learning CMU-10701

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Expectation Maximization

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Hidden Markov Models. Terminology and Basic Algorithms

p(d θ ) l(θ ) 1.2 x x x

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Note Set 5: Hidden Markov Models

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Statistical Sequence Recognition and Training: An Introduction to HMMs

Hidden Markov Models Part 1: Introduction

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Linear Dynamical Systems (Kalman filter)

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Lecture 3: ASR: HMMs, Forward, Viterbi

LEARNING SELF-ORGANIZING MIXTURE MARKOV MODELS

Bayesian Machine Learning - Lecture 7

Hidden Markov Models

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Basic math for biology

Hidden Markov Models. Terminology, Representation and Basic Problems

Introduction to Machine Learning

COM336: Neural Computing

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

Bioinformatics 2 - Lecture 4

Advanced Data Science

EECS E6870: Lecture 4: Hidden Markov Models

Dynamic Approaches: The Hidden Markov Model

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Statistical Methods for NLP

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

Hidden Markov Models in Language Processing

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Latent Variable View of EM. Sargur Srihari

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Hidden Markov models

Hidden Markov Models

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

CSCI-567: Machine Learning (Spring 2019)

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Lecture 3: Machine learning, classification, and generative models

Clustering with k-means and Gaussian mixture distributions

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

Markov Chains and Hidden Markov Models

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering and Gaussian Mixtures

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Machine Learning for OR & FE

Latent Dirichlet Allocation

p L yi z n m x N n xi

Directed Probabilistic Graphical Models CMSC 678 UMBC

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Probabilistic Graphical Models: Exercises

Lecture 15. Probabilistic Models on Graph

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Master 2 Informatique Probabilistic Learning and Data Analysis

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Statistical NLP: Hidden Markov Models. Updated 12/15

Transcription:

Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1 Linear Regression 2 Linear Classification 1 Linear Classification 2 Kernel Methods Sparse Kernel Methods Mixture Models and EM 1 Mixture Models and EM 2 Neural Networks 1 Neural Networks 2 Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data 2 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 828

Part XVIII Sequential Data 2 Alpha-Beta How to train a - Viterbi Algorithm 645of 828

Example of a Hidden Markov Model Assume Peter and Mary are students in Canberra and Sydney, respectively. Peter is a computer science student and only interested in riding his bicycle, shopping for new computer gadgets, and studying. (Well, he also does other things but because these other activities don t depend on the weather we neglect them here.) Mary does not know the current weather in Canberra, but knows the general trends of the weather in Canberra. She also knows Peter well enough to know what he does on average on rainy days and also when the skies are blue. She believes that the weather (rainy or not rainy) follows a given discrete Markov chain. She tries to guess the sequence of weather patterns for a number of days after Peter tells her on the phone what he did in the last days. Alpha-Beta How to train a - Viterbi Algorithm 646of 828

Example of a Hidden Markov Model Mary uses the following rainy sunny initial probability 0.2 0.8 transition probability rainy 0.3 0.7 sunny 0.4 0.6 emission probability cycle 0.1 0.6 shop 0.4 0.3 study 0.5 0.1 Assume, Peter tells Mary that the list of his activities in the last days was [ cycle, shop, study ] (a) Calculate the probability of this observation sequence. (b) Calculate the most probable sequence of hidden states for these observations. Alpha-Beta How to train a - Viterbi Algorithm 647of 828

Hidden Markov Model The trellis for the hidden Markov model Find the probability of cycle, shop, study Rainy 0.2 Sunny 0.8 0.3 0.7 0.4 0.6 Rainy Sunny 0.3 0.7 0.4 0.6 Rainy Sunny Alpha-Beta 0.5 0.4 0.1 0.6 cycle shop 0.3 0.1 0.5 0.4 0.1 0.6 cycle shop 0.3 0.1 0.5 0.4 0.1 0.6 cycle shop 0.3 0.1 How to train a - Viterbi Algorithm study study study 648of 828

Hidden Markov Model Find the most probable hidden states for cycle, shop, study Rainy 0.2 Sunny 0.8 0.3 0.7 0.4 0.6 Rainy Sunny 0.3 0.7 0.4 0.6 Rainy Sunny Alpha-Beta 0.5 0.4 0.1 0.6 cycle shop 0.3 0.1 0.5 0.4 0.1 0.6 cycle shop 0.3 0.1 0.5 0.4 0.1 0.6 cycle shop 0.3 0.1 How to train a - Viterbi Algorithm study study study 649of 828

Homogeneous Hidden Markov Model Joint probability distribution over both latent and observed variables [ N ] N p(x, Z θ) = p(z 1 π) p(z n z n 1, A) p(x m z m, φ) n=2 m=1 where X = (x 1,..., x N ), Z = (z 1,..., z N ), and θ = {π, A, φ}. Most of the discussion will be independent of the particular choice of emission probabilities (e.g. discrete tables, Gaussians, mixture of Gaussians). Alpha-Beta How to train a - Viterbi Algorithm 650of 828

We have observed a data set X = (x 1,..., x n ). Assume it came from a with a given structure (number of nodes, form of emission probabilities). The likelihood of the data is p(x θ) = Z p(x, Z θ) This joint distribution does not factorise over n (as with the mixture distribution). We have N variables, each with K states : K N terms. Number of terms grows exponentially with the length of the chain. But we can use the conditional independence of the latent variables to reorder their calculation later. Further obstacle to find a closed loop maximum likelihood solution: calculating the emission probabilities for different states z n. Alpha-Beta How to train a - Viterbi Algorithm 651of 828

- EM Employ the EM algorithm to find the Maximum Likelihood for. Alpha-Beta How to train a - Viterbi Algorithm 652of 828

- EM Employ the EM algorithm to find the Maximum Likelihood for. Start with some initial parameter settings θ old. Alpha-Beta How to train a - Viterbi Algorithm 653of 828

- EM Employ the EM algorithm to find the Maximum Likelihood for. Start with some initial parameter settings θ old. E-step: Find the posterior distribution of the latent variables p(z X, θ old ). Alpha-Beta How to train a - Viterbi Algorithm 654of 828

- EM Employ the EM algorithm to find the Maximum Likelihood for. Start with some initial parameter settings θ old. E-step: Find the posterior distribution of the latent variables p(z X, θ old ). M-step: Maximise Alpha-Beta Q(θ, θ old ) = Z p(z X, θ old ) ln p(z, X θ) How to train a - Viterbi Algorithm with respect to the parameters θ = {π, A, φ}. 655of 828

- EM Denote the marginal posterior distribution of z n by γ(z n ), and the joint posterior distribution of two successive latent variables by ξ(z n 1, z n ) γ(z n ) = p(z n X, θ old ) ξ(z n 1, z n ) = p(z n 1, z n X, θ old ). For each step n, γ(z n ) has K nonnegative values which sum to 1. For each step n, ξ(z n 1, z n ) has K K nonnegative values which sum to 1. Elements of these vectors are denoted by γ(z nk ) and ξ(z n 1,j, z nk ) respectively. Alpha-Beta How to train a - Viterbi Algorithm 656of 828

- EM Because the expectation of a binary random variable is the probability that it is one, we get with this notation γ(z nk ) = E [z nk ] = γ(z n )z nk z n ξ(z n 1,j, z nk ) = E [z n 1,j z nk ] = ξ(z n 1, z n )z n 1,j z nk. z n 1,z n Putting all together we get K N K K Q(θ, θ old ) = γ(z 1k ) ln π k + ξ(z n 1,j, z nk ) ln A jk k=1 n=2 j=1 k=1 N K + γ(z nk ) ln p(x n φ k ). n=1 k=1 Alpha-Beta How to train a - Viterbi Algorithm 657of 828

- EM M-step: Maximising Q(θ, θ old ) = results in K γ(z 1k ) ln π k + k=1 + N n=1 k=1 N K n=2 j=1 k=1 K γ(z nk ) ln p(x n φ k ). π k = γ(z 1k) K j=1 γ(z 1j) K ξ(z n 1,j, z nk ) ln A jk Alpha-Beta How to train a - Viterbi Algorithm N n=2 A jk = ξ(z n 1,j, z nk ) K N l=1 n=2 ξ(z n 1,j, z nl ) 658of 828

- EM Still left: Maximising with respect to φ. Q(θ, θ old ) = K γ(z 1k ) ln π k + k=1 + N n=1 k=1 N K n=2 j=1 k=1 K γ(z nk ) ln p(x n φ k ). K ξ(z n 1,j, z nk ) ln A jk But φ only appears in the last term, and under the assumption that all the φ k are independent of each other, this term decouples into a sum. Then maximise each contribution N n=1 γ(z nk) ln p(x n φ k ) individually. Alpha-Beta How to train a - Viterbi Algorithm 659of 828

- EM In the case of Gaussian emission densities p(x φ k ) = N (x µ k, Σ k ), we get for the maximising parameters for the emission densities N n=1 µ k = γ(z nk) x n N n=1 γ(z nk) Σ k = N n=1 γ(z nk) (x n µ k )(x n µ k ) T N n=1 γ(z. nk) Alpha-Beta How to train a - Viterbi Algorithm 660of 828

Need to efficiently evaluate the γ(z nk ) and ξ(z n 1,j, z nk ). The graphical model for the is a tree! We know we can use a two-stage message passing algorithm to calculate the posterior distribution of the latent variables. For this is called the forward-backward algorithm (Rabiner, 1989), or Baum-Welch algorithm (Baum, 1972). Other variants exist, different only in the form of the messages propagated. We look at the most widely known, the alpha-beta algorithm. Alpha-Beta How to train a - Viterbi Algorithm 661of 828

for Given the data X = {x 1,..., x N } the following independence relations hold: p(x z n ) = p(x 1,..., x n z n ) p(x n+1,..., x N z n ) p(x 1,..., x n 1 x n, z n ) = p(x 1,..., x n 1 z n ) p(x 1,..., x n 1 z n 1, z n ) = p(x 1,..., x n 1 z n 1 ) p(x n+1,..., x N z n, z n+1 ) = p(x n+1,..., x N z n+1 ) p(x n+2,..., x N x n+1, z n+1 ) = p(x n+2,..., x N z n+1 ) p(x z n 1, z n ) = p(x 1,..., x n 1 z n 1 ) p(x n z n ) p(x n+1,..., x N z n ) p(x N+1 X, z N+1 ) = p(x N+1 z N+1 ) p(z n+1 X, z N ) = p(z N+1 z N ) Alpha-Beta How to train a - Viterbi Algorithm z 1 z 2 z n 1 z n z n+1 x 1 x 2 x n 1 x n x n+1 662of 828

for - Example Let s look at the following independence relation: p(x z n ) = p(x 1,..., x n z n ) p(x n+1,..., x N z n ) Any path from the set {x 1,..., x n } to the set {x n+1,..., x N } has to go through z n. In p(x z n ) the node z n is conditioned on (= observed). All paths from x 1,..., x n 1 through z n to x n+1,..., x N are head-tail. The path from x n through z n to z n+1 is tail-tail, so blocked. Therefore x 1,..., x n x n+1,..., x N z n. Alpha-Beta How to train a - Viterbi Algorithm z 1 z 2 z n 1 z n z n+1 x 1 x 2 x n 1 x n x n+1 663of 828

Alpha-Beta Define the joint probability of observing all data up to step n and having z n as latent variable to be α(z n ) = p(x 1,..., x n, z n ). Define the probability of all future data given z n to be β(z n ) = p(x n+1,..., x N z n ). Then it an be shown the following recursions hold α(z n ) = p(x n z n ) z n 1 α(z n 1 ) p(z n z n 1 ) Alpha-Beta How to train a - Viterbi Algorithm K α(z 1 ) = {π k p(x 1 φ k )} z1k k=1 664of 828

Alpha-Beta At step n we can efficiently calculate α(z n ) given α(z n 1 ) α(z n ) = p(x n z n ) z n 1 α(z n 1 ) p(z n z n 1 ) α(z n 1,1 ) k = 1 α(z n 1,2 ) k = 2 A 21 A 11 A 31 α(z n,1 ) p(x n z n,1 ) Alpha-Beta How to train a - Viterbi Algorithm α(z n 1,3 ) k = 3 n 1 n 665of 828

Alpha-Beta And for β(z n ) we get the recursion β(z n ) = z n+1 β(z n+1 ) p(x n+1 z n+1 ) p(z n+1 z n ) k = 1 k = 2 β(z n,1 ) β(z n+1,1 ) A 13 A 11 A 12 p(x n z n+1,1 ) β(z n+1,2 ) p(x n z n+1,2 ) β(z n+1,3 ) Alpha-Beta How to train a - Viterbi Algorithm k = 3 n n + 1 p(x n z n+1,3 ) 666of 828

Alpha-Beta How do we start the β recursion? What is β(z N )? β(z N ) = p(x N+1,..., x N z n ). Can be shown the following is consistent with the approach β(z N ) = 1. k = 1 β(z n,1 ) β(z n+1,1 ) A 11 A 12 β(z n+1,2 ) p(x n z n+1,1 ) Alpha-Beta How to train a - Viterbi Algorithm k = 2 A 13 β(z n+1,3 ) p(x n z n+1,2 ) k = 3 n n + 1 p(x n z n+1,3 ) 667of 828

Alpha-Beta Now we know how to calculate α(z n ) and β(z n ) for each step. What is the probability of the data p(x)? Use the definition of γ(z n ) and Bayes γ(z n ) = p(z n X) = p(x z n) p(z n ) p(x) = p(x, z n) p(x) and the following conditional independence statement from the graphical model of the p(x z n ) = p(x 1,..., x n z n ) p(x n+1,..., x N z n ) }{{}}{{} α(z n)/p(z n) β(z n) Alpha-Beta How to train a - Viterbi Algorithm and therefore γ(z n ) = α(z n)β(z n ) p(x) 668of 828

Alpha-Beta Marginalising over z n results in 1 = z γ(z n ) = n α(z n )β(z n ) p(x) z n and therefore at each step n p(x) = z n α(z n )β(z n ) Most conveniently evaluated at step N where β(z N ) = 1 as Alpha-Beta How to train a - Viterbi Algorithm p(x) = z N α(z n ). 669of 828

Alpha-Beta Finally, we need to calculate the joint posterior distribution of two successive latent variables by ξ(z n 1, z n ) defined by ξ(z n 1, z n ) = p(z n 1, z n X). This can be done directly from the α and β values in the form ξ(z n 1, z n ) = α(z n 1) p(x n z n ) p(z n z n 1 ) β(z n ) p(x) Alpha-Beta How to train a - Viterbi Algorithm 670of 828

How to train a 1 Make an inital selection for the parameters θ old where θ = {π, A, φ)}. (Often, A, π initialised uniformly or randomly. The φ k -initialisation depends on the emission distribution; for Gaussians run K-means first and get µ k and Σ k from there.) 2 (Start of E-step) Run forward recursion to calculate α(z n ). 3 Run backward recursion to calculate β(z n ). 4 Calculate γ(z) and ξ(z n 1, z n ) from α(z n ) and β(z n ). 5 Evaluate the likelihood p(x). 6 (Start of M-step) Find a θ new maximising Q(θ, θ old ). This results in new settings for the parameters π k, A j k and φ k as described before. 7 Iterate until convergence is detected. Alpha-Beta How to train a - Viterbi Algorithm 671of 828

Alpha-Beta - Notes In order to calculate the likelihood, we need to use the joint probability p(x, Z) and sum over all possible values of Z. That means, every particular choice of Z corresponds to one path through the lattice diagram. There are exponentially many! Using the alpha-beta algorithm, the exponential cost has been reduced to a linear cost in the length of the model. How did we do that? Swapping the order of multiplication and summation. Alpha-Beta How to train a - Viterbi Algorithm 672of 828

- Viterbi Algorithm Motivation: The latent states can have some meaningful interpretation, e.g. phonemes in a speech recognition system where the observed variables are the acoustic signals. Goal: After the system has been trained, find the most probable states of the latent variables for a given sequence of observations. Warning: Finding the set of states which are each individually the most probable does NOT solve this problem. Alpha-Beta How to train a - Viterbi Algorithm 673of 828

- Viterbi Algorithm Define ω(z n ) = max z1,...,z n 1 ln p(x 1,..., x n, z 1,..., z n ) From the joint distribution of the given by [ N ] N p(x 1,..., x N, z 1,..., z N ) = p(z 1 ) p(z n z n 1 ) p(x n z n ) n=2 the following recursion can be derived n=1 ω(z n ) = ln p(x n z n ) + max z n 1 {ln p(z n z n 1 ) + ω(z n 1 )} ω(z 1 ) = ln p(z 1 ) + ln p(x 1 z 1 ) = ln p(x 1, z 1 ) k = 1 Alpha-Beta How to train a - Viterbi Algorithm k = 2 k = 3 n 2 n 1 n n + 1 674of 828

- Viterbi Algorithm Calculate ω(z n ) = max ln p(x 1,..., x n, z 1,..., z n ) z 1,...,z n 1 for n = 1,..., N. For each step n remember which is the best transition to go into each state at the next step. At step N : Find the state with the highest probability. For n = 1,..., N 1: Backtrace which transition led to the most probable state and identify from which state it came. k = 1 Alpha-Beta How to train a - Viterbi Algorithm k = 2 k = 3 n 2 n 1 n n + 1 675of 828