We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Similar documents
Note Set 5: Hidden Markov Models

Parametric Models Part III: Hidden Markov Models

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

STA 414/2104: Machine Learning

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

STA 4273H: Statistical Machine Learning

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Lecture 3: ASR: HMMs, Forward, Viterbi

Introduction to Machine Learning CMU-10701

Machine Learning for natural language processing

Graphical Models Seminar

Cheng Soon Ong & Christian Walder. Canberra February June 2018

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

COMP90051 Statistical Machine Learning

Basic math for biology

Statistical NLP: Hidden Markov Models. Updated 12/15

Statistical Machine Learning from Data

Hidden Markov Models. Terminology and Basic Algorithms

Hidden Markov Models

Hidden Markov Models Part 2: Algorithms

CSCI-567: Machine Learning (Spring 2019)

Hidden Markov Models

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Brief Introduction of Machine Learning Techniques for Content Analysis

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

order is number of previous outputs

Hidden Markov Models in Language Processing

Sequence modelling. Marco Saerens (UCL) Slides references

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models. Terminology and Basic Algorithms

Advanced Data Science

Bayesian Machine Learning - Lecture 7

LEARNING DYNAMIC SYSTEMS: MARKOV MODELS

8: Hidden Markov Models

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

Hidden Markov Models

Statistical Methods for NLP

Dynamic Approaches: The Hidden Markov Model

Lecture 11: Hidden Markov Models

Hidden Markov Models NIKOLAY YAKOVETS

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Hidden Markov Models. Terminology, Representation and Basic Problems

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Hidden Markov Models

Hidden Markov Models and Gaussian Mixture Models

L23: hidden Markov models

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Hidden Markov Models

Statistical Sequence Recognition and Training: An Introduction to HMMs

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Linear Dynamical Systems

Directed Probabilistic Graphical Models CMSC 678 UMBC

Hidden Markov Models. x 1 x 2 x 3 x N

Hidden Markov models

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Statistical Methods for NLP

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Linear Dynamical Systems (Kalman filter)

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

A Higher-Order Interactive Hidden Markov Model and Its Applications Wai-Ki Ching Department of Mathematics The University of Hong Kong

Mini-project 2 (really) due today! Turn in a printout of your work at the end of the class

Multiscale Systems Engineering Research Group

Hidden Markov Models Hamid R. Rabiee

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov model. Jianxin Wu. LAMDA Group National Key Lab for Novel Software Technology Nanjing University, China

CS532, Winter 2010 Hidden Markov Models

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

MACHINE LEARNING 2 UGM,HMMS Lecture 7

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

A gentle introduction to Hidden Markov Models

Statistical Processing of Natural Language

Lecture 12: Algorithms for HMMs

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Expectation Maximization (EM)

Data Mining in Bioinformatics HMM

Lecture 15. Probabilistic Models on Graph

Machine Learning for OR & FE

A REVIEW AND APPLICATION OF HIDDEN MARKOV MODELS AND DOUBLE CHAIN MARKOV MODELS

Hidden Markov Models

Hidden Markov Models,99,100! Markov, here I come!

Approximate Inference

HMM: Parameter Estimation

Introduction to Hidden Markov Modeling (HMM) Daniel S. Terry Scott Blanchard and Harel Weinstein labs

Computational Genomics and Molecular Biology, Fall

CS 5522: Artificial Intelligence II

Dept. of Linguistics, Indiana University Fall 2009

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

Hidden Markov Models

Lecture 12: Algorithms for HMMs

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Tutorial on Hidden Markov Model

Transcription:

We Live in Exciting Times ACM (an international computing research society) has named CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Apr. 2, 2019 Yoshua Bengio, Geoffrey Hinton, and Yann LeCun recipients of the 2018 ACM Turing Award for conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing. The ACM Turing Award, often referred to as the Nobel Prize of Computing, carries a $1 million prize. It is named for Alan M. Turing, the British mathematician who articulated the mathematical foundation and limits of computing. () April 2, 2019 1 / 48 () April 2, 2019 2 / 48 Outline Outline 1 Markov chain 1 Markov chain 2 Hidden Markov Model 2 Hidden Markov Model () April 2, 2019 3 / 48 () April 2, 2019 4 / 48

Markov Models Markov chain Markov models are powerful probabilistic models to analyze sequential data. A.A.Markov (1856-1922) introduced the Markov chains in 1906 when he produced the first theoretical results for stochastic processes. They are now commonly used in text or speech recognition stock market prediction bioinformatics Directed strongly connected graph with self-loops. Each edge labeled by a positive probability. At each state, the probabilities on outgoing edges sum up to 1. Transition (or stochastic ) matrix: A = a ij = P (i j in 1 step). () April 2, 2019 5 / 48 () April 2, 2019 6 / 48 Markov chain Definition Given a sequentially ordered random variables X 1, X 2,, X t,, X T, called states, Transition probability for describing how the state at time t 1 changes to the state at time t, P (X t = value X t 1 = value) Initial probability for describing the initial state at time t = 1. P (X 1 = value) All X t s take value from the same discrete set {1,..., N}. We will assume that the transition probability does not change with respect to time t, i.e., a stationary Markov chain. Markov chain Transition probabilities make a table/matrix A whose elements are a ij = P (X t = j X t 1 = i) Initial probability becomes a vector π whose elements are π i = P (X 1 = i) where i or j index over from 1 to N. We have the following constraints a ij = 1 π i = 1 j i Additionally, all those numbers should be non-negative. () April 2, 2019 7 / 48 () April 2, 2019 8 / 48

Examples Definition Example 1 (Language model) States [N] represent a dictionary of words, a ice,cream = P (X t+1 = cream X t = ice) is an example of the transition probability. Example 2 (Weather) States [N] represent weather at each day a sunny,rainy = P (X t+1 = rainy X t = sunny) A Markov chain is a stochastic process with the Markov property: a sequence of random variables X 1, X 2, s.t. P (X t+1 X 1, X 2,, X t ) = P (X t+1 X t ) i.e. the current state only depends on the most recent state. Is the Markov assumption reasonable? Not completely for the language model for example. Higher order Markov chains make it more reasonable, e.g. P (X t+1 X 1, X 2,, X t ) = P (X t+1 X t, X t 1 ) i.e. the current word only depends on the last two words. () April 2, 2019 9 / 48 () April 2, 2019 10 / 48 Exercise 1 Consider the following Markov model. Given that now I am having Coffee, whats the probability that the next step is Facebook and the next is Work? P (X 3 = W, X 2 = F X 1 = C) = = P (X 3 = W, X 2 = F, X 1 = C) P (X 1 = C) = P (X 3 = W X 2 = F, X 1 = C)P (X 2 = F X 1 = C)P (X 1 = C) P (X 1 = C) (chain rule) = P (X 3 = W X 2 = F )P (X 2 = F X 1 = C) (Markov rule) = 0.5 0.6 = 0.3 () April 2, 2019 11 / 48 Exercise 2 Given that now I am having Coffee, what is the probability that in two steps I am at Work? P (X 3 = W X 1 = C) = = s P (X 3 = W, X 2 = s X 1 = C) = = P (X 3 = W X 2 = W )P (X 2 = W X 1 = C) (marginalization) + P (X 3 = W X 2 = C)P (X 2 = C X 1 = C) + P (X 3 = W X 2 = F )P (X 2 = F X 1 = C) = 0.3 0.4 + 0.1 0.3 + 0.6 0.5 = 0.45 Using a transition matrix: P (X 3 = j X 1 = i) = N a ik a kj = a 2 ij k=1 () April 2, 2019 12 / 48

Parameter estimation for Markov models Now suppose we have observed M sequences of examples: x 1,1,..., x 1,T x M,1,..., x M,T where for simplicity we assume each sequence has the same length T lower case x n,t represents the value of the random variable X n,t From these observations how do we learn the model parameters (π, A)? () April 2, 2019 13 / 48 Finding the MLE Same story, Maximum Likelihood Estimation: argmax π,a ln P (X 1 = x 1, X 2 = x 2,..., X T = x T ) First, we need to compute this joint probability. Applying the chain rule for random variables, we get P (X 1, X 2,..., X T ) = P (X 2, X 3,..., X T X 1 )P (X 1 ) = P (X 3,..., X T X 1, X 2 )P (X 2 X 1 )P (X 1 ) = = = P (X 1 ) = P (X 1 ) T P (X t X 1,..., X t 1 ) T P (X t X t 1 ) (Markov property) () April 2, 2019 14 / 48 Finding the MLE Example The log-likelihood of a sequence x 1,..., x T is ln P (X 1 = x 1, X 2 = x 2,..., X T = x T ) = P (X 1 = x 1 ) + ln P (X t = x t X t 1 = x t 1 ) = ln π x1 + ln a xt 1,x t = n I[x 1 = n] ln π n + ( T ) I[x t 1 = n, x t = n ] ln a n,n n,n Suppose we observed the following 2 sequences of length 5 sunny, sunny, rainy, rainy, rainy rainy, sunny, sunny, sunny, rainy MLE is the following model () April 2, 2019 15 / 48 () April 2, 2019 16 / 48

Finding the MLE So MLE is argmax π,a (#initial states with value n) ln π n n + n,n (#transitions from n to n ) ln a n,n We have seen this many times. The solution is (derivation is left as an exercise): #of sequences starting with n π n = #of sequences #of transitions from n to n a n,n = #of transitions starting with n Outline 1 Markov chain 2 Hidden Markov Model Inferring HMMs Viterbi Algorithm Viterbi Algorithm: Example Learning HMMs () April 2, 2019 17 / 48 () April 2, 2019 18 / 48 Markov Model with outcomes Another example picture from Wikipedia Now suppose each state X t also emits some outcome Z t [O] based on the following model On each day, we also observe Bob s activity: walk, shop, or clean, which only depends on the weather of that day. P (Z t = o X t = s) = b s,o (emission probability) independent of anything else. For example, in the language model, Z t is the speech signal for the underlying word X t (very useful for speech recognition). Now the model parameters are ({π s }, {a s,s }, {b s,o }) = (π, A, B). () April 2, 2019 19 / 48 () April 2, 2019 20 / 48

HMM defines a joint probability Joint likelihood P (X 1, X 2,, X T, Z 1, Z 2,, Z T ) = P (X 1, X 2,, X T ) P (Z 1, Z 2,, Z T X 1, X 2,, X T ) Markov assumption simplifies the first term P (X 1, X 2,, X T ) = P (X 1 ) T P (X t X t 1 ) The independence assumption simplifies the second term P (Z 1, Z 2,, Z T X 1, X 2,, X T ) = T P (Z t X t ) Namely, each Z t is conditionally independent of anything else, if conditioned on X t. () April 2, 2019 21 / 48 The joint log-likelihood is ln P (X 1 = x 1, X 2 = x 2,, X T = x T, Z 1 = z 1, Z 2 = z 2,, Z T = z T ) T T = ln P (X 1 = x 1 ) P (X t = x t X t 1 = x t 1 ) P (Z t = z t X t = x t ) = ln P (X 1 = x 1 ) + + ln P (X t = x t X t 1 = x t 1 ) ln P (Z t = z t X t = x t ) = ln π x1 + ln a xt 1,x t + ln b xt,z t () April 2, 2019 22 / 48 Learning the model Learning the model If we observe M state-outcome sequences: x m,1, z m,1,..., x m,t, z m,t for m = 1,..., M, the MLE is again very simple (verify yourself): π s #initial states with value s a s,s #transitions from s to s b s,o #state-outcome pairs (s, o) However, most often we do not observe the states! Think about the speech recognition example. This is called Hidden Markov Model (HMM). It is important to notice that the denomination hidden while defining a Hidden Markov Model is referred to the states of the Markov chain, not to the parameters of the model. How to learn HMMs? Roadmap: first discuss how to infer when the model is known (key: dynamic programming) then discuss how to learn the model (key: EM) () April 2, 2019 23 / 48 () April 2, 2019 24 / 48

What can we infer about an HMM? What can we infer for a known HMM? In HMMs, we are often interested in the following problems the probability of observing a whole sequence P (Z 1:T = z 1:T ) = P (Z 1 = z 1,..., Z T = z T ) e.g. prob. of observing Bob s activities walk, walk, shop, clean, walk, shop, shop for one week the state at some point of time, given an observation sequence γ s (t) = P (X t = s Z 1:T = z 1:T ) e.g. given Bob s activities for one week, how was the weather like on Wed? In HMMs, we are often interested in the following problems the likelihood of two consecutive states at a given time ξ s,s (t) = P (X t = s, X t+1 = s Z 1:T = z 1:T ) e.g. given Bob s activities for one week, how was the weather like on Wed and Thu? most likely hidden states path, given an observation sequence argmax x 1:T P (X 1:T = x 1:T Z 1:T = z 1:T ) e.g. given Bob s activities for one week, what s the most likely weather for this week? () April 2, 2019 25 / 48 () April 2, 2019 26 / 48 Forward and backward messages The key is to compute two things: forward messages: for each s and t α s (t) = P (X t = s, Z 1:t = z 1:t ) The intuition is, if we observe up to time t, what is the likelihood of the Markov chain in state s? backward messages: for each s and t β s (t) = P (Z t+1:t = z t+1:t X t = s) The interpretation is: if we are told that the Markov chain at time t is in the state s, then what are the likelihood of observing future observations from t + 1 to T? () April 2, 2019 27 / 48 Computing forward messages Key: establish a recursive formula α s (t) = P (X t = s, Z 1:t ) = P (X t = s, Z 1:t 1, Z t ) = P (Z t X t = s, Z 1:t 1 )P (X t = s, Z 1:t 1 ) = P (Z t X t = s)p (X t = s, Z 1:t 1 ) (independence) = b s,zt P (X t = s, X t 1 = s, Z 1:t 1 ) (marginalizing) s = b s,zt P (X t = s X t 1 = s, Z 1:t 1 )P (X t 1 = s, Z 1:t 1 ) s = b s,zt P (X t = s X t 1 = s )P (X t 1 = s, Z 1:t 1 ) s = b s,zt a s,sα s (t 1) s [N] Base case: α s (1) = P (X 1, Z 1 ) = P (Z 1 X 1 )P (X 1 ) = π s b s,z1 () April 2, 2019 28 / 48

Forward procedure Forward procedure Forward algorithm For all s [N], compute α s (1) = π s b s,z1. For t = 2,..., T for each s [N], compute α s (t) = b s,zt a s,s α s (t 1) s [N] It takes O(S 2 T ) time and O(ST ) space using dynamic programming. Oh, no, CSCI-570 again.. () April 2, 2019 29 / 48 () April 2, 2019 30 / 48 Computing backward messages Backward procedure Again establish a recursive formula β s (t) = P (Z t+1:t X t = s) = P (Z t+1:t, X t+1 = s X t = s) (marginalizing) s = P (Z t+1:t X t+1 = s, X t = s)p (X t+1 = s X t = s) (sl. 11) s = a s,s P (Z t+1:t X t+1 = s ) = a s,s P (Z t+1, Z t+2:t X t+1 = s ) s s = a s,s P (Z t+1 Z t+2:t, X t+1 = s )P (Z t+2:t X t+1 = s ) s Backward algorithm For all s [N], set β s (T ) = 1. For t = T 1,..., 1 for each s [N], compute β s (t) = a s,s b s,z t+1 β s (t + 1) s [N] = s a s,s b s,z t+1 β s (t + 1) Again it takes O(S 2 T ) time and O(ST ) space. Base case: β s (T ) = 1 (verify yourself) () April 2, 2019 31 / 48 () April 2, 2019 32 / 48

Using forward and backward messages With forward and backward messages, we can compute problems stated on slide 25. γ s (t) = P (X t = s Z 1:T ) s [N] P (X t = s, Z 1:T ) = P (X t = s, Z 1:t, Z t+1:t ) = P (X t = s, Z 1:t )P (Z t+1:t X t = s, Z 1:t ) = P (X t = s, Z 1:t )P (Z t+1:t X t = s) = α s (t)β s (t) What constant are we omitting in? It is exactly P (Z 1:T ) = P (Z 1:T, X t = s) = α s (T ) = α s (t)β s (t), s [N] s [N] the probability of observing the sequence z 1:T. This is true for any t; a good way to check correctness of your code. () April 2, 2019 33 / 48 Using forward and backward messages Slide 26: the conditional probability of transition s to s at time t ξ s,s (t) = P (X t = s, X t+1 = s Z 1:T ) P (X t = s, X t+1 = s, Z 1:T ) = P (X t = s, Z 1:t, X t+1 = s, Z t+1:t ) = P (X t = s, Z 1:t )P (X t+1 = s, Z t+1:t X t = s, Z 1:t ) = α s (t) P (X t+1 = s, Z t+1:t X t = s) = α s (t) P (Z t+1:t X t+1 = s, X t = s)p (X t+1 = s X t = s) = α s (t)p (Z t+1:t X t+1 = s ) a s,s = α s (t) a s,s P (Z t+1:t, Z t+2:t X t+1 = s ) = α s (t) a s,s P (Z t+2:t X t+1 = s, Z t+1:t )P (Z t+1 X t+1 = s ) = α s (t) a s,s P (Z t+2:t X t+1 = s )P (Z t+1 X t+1 = s ) = α s (t) a s,s b s,x t+1 β s (t + 1) The normalization constant is in fact again P (Z 1:T ) () April 2, 2019 34 / 48 Finding the most likely path Find the most likely hidden states path, given an observation sequence: This is called Viterbi decoding. argmax x 1:T P (X 1:T = x 1:T Z 1:T = z 1:T ) We can t use forward and backward messages directly, though it is very similar to the forward procedure, just replace sum to max. We will compute (think of Dijkstra algorithm) δ s (t) = max x 1:t 1 P (X t = s, X 1:t 1, Z 1:t ) s (t) = argmax x t 1 max P (X t = s, X 1:t 1 Z 1:t ) x 1:t 2 Computing δ s (t) Follow me (the goal is to get a recurrence) δ s (t) = max x 1:t 1 P (X t = s, X 1:t 1, Z 1:t ) = max x 1:t 1 P (X t = s, Z t, X 1:t 1, Z 1:t 1 ) = max s max P (X t = s, Z t X 1:t 1, Z 1:t 1 )P (X 1:t 1 = s, Z 1:t 1 ) x 1:t 2 = max s δ s (t 1)P (X t, Z t X 1:t 1, Z 1:t 1 ) = max s δ s (t 1)P (Z t, X t X 1:t 1 ) = max s δ s (t 1)P (Z t X t, X 1:t 1 )P (X t X 1:t 1 ) = max s δ s (t 1)P (Z t X t )P (X t X t 1 ) = b s,zt max s a s,sδ s (t 1) Base case: δ s (1) = P (X 1 = s, Z 1 = z 1 ) = π s b s,z1 () April 2, 2019 35 / 48 () April 2, 2019 36 / 48

Viterbi Algorithm (!) Viterbi Algorithm For each s [N], compute δ s (1) = π s b s,z1. For each t = 2,..., T, for each s [N], compute Example Arrows represent the argmax, i.e. s (t). δ s (t) = b s,zt max s a s,sδ s (t 1) s (t) = argmax s a s,sδ s (t 1) Backtracking: let z T = argmax s δ s (T ). For each t = T,..., 2: set z t 1 = z t (t). The most likely path is rainy, rainy, sunny, sunny. Output the most likely path z 1,..., z T. () April 2, 2019 37 / 48 () April 2, 2019 38 / 48 Example t = 1, the initial time Consider the HMM below. In this world, every time step (say every few minutes), you can either be Studying or playing Video games. You re also either Grinning or Frowning while doing the activity. δ s (1) = π s b s,z1. Compute δ S (1) and δ V (1). δ S (1) = p(z 1 = Grin x 1 = S)π(x 1 = S) = 0.5 0.5 = 0.25 Suppose that we believe that the initial state distribution is 50/50. We observe: Grin, Grin, Frown, Grin. What is the most likely path for this sequence of observations? δ V (1) = p(z 1 = Grin x 1 = V )π(x 1 = V ) = 0.8 0.5 = 0.4 () April 2, 2019 39 / 48 () April 2, 2019 40 / 48

t = 2 δ s (t) = b s,zt max s a s,sδ s (t 1). Compute δ S (2) and δ V (2). t = 3 δ s (t) = b s,zt max s a s,sδ s (t 1). Compute δ S (3) and δ V (3). δ S (2) = p(z 2 = Grin x 2 = S) max{p(x 2 = S x 1 = S)δ S (1), p(x 2 = S x 1 = V )δ V (1)} = 0.5 max{0.8 0.25, 0.4 0.4} = 0.01 δ V (2) = p(z 2 = Grin x 2 = V ) max{p(x 2 = V x 1 = S)δ S (1), p(x 2 = V x 1 = V )δ V (1)} 0.8 max{0.2 0.25, 0.6 0.4} = 0.192 () April 2, 2019 41 / 48 δ S (3) = p(z 3 = F rown x 3 = S) max{p(x 3 = S x 2 = S)δ S (2), p(x 3 = S x 2 = V )δ V (2)} = 0.5 max{0.8 0.1, 0.4 0.192} = 0.04 δ V (3) = p(z 3 = F rown x 3 = V ) max{p(x 3 = V x 2 = S)δ S (2), p(x 3 = V x 2 = V )δ V (2)} = 0.2 max{0.2 0.1, 0.6 0.192} = 0.023 () April 2, 2019 42 / 48 t = 4 Learning the parameters of an HMM Observation at t = 4 is Grin All previous inferences depend on knowing the parameters (π, A, B). How do we learn the parameters based on M observation sequences z m,1,..., z m,t for m = 1,..., M? δ S (4) = 0.016 δ V (4) = 0.01 Last step, since δ S (4) > δ V (4), so we choose MLE is intractable due to the hidden states (similar to GMM). Need to apply EM again! Known as the Baum Welch algorithm. z 4 = S Then z 3 = S, then z 2 = S and then z 1 = S, using the trace-back table. () April 2, 2019 43 / 48 () April 2, 2019 44 / 48

Applying EM: E-Step Recall in the E-Step we fix the parameters and find the posterior distributions q of the hidden states (for each sample), which leads to the complete log-likelihood: E x1:t q [ln(x 1:T = x 1:T, Z 1:T = z 1:T )] [ ] = E x1:t q ln π x1 + ln a xt 1,x t + ln b xt,z t = s T 1 γ s (1) ln π s + ξ s,s (t) ln a s,s + s,s This is quite similar to computations on slide 15. γ s (t) and ξ s,s (t) are defined on slides 9 and 10: γ s (t) = P (X t = s Z 1:T ) γ s (t) ln b s,zt ξ s,s (t) = P (X t = s, X t+1 = s Z 1:T ) s (slide 22) Applying EM: M-Step The maximizer of complete log-likelihood is simply doing weighted counting: where π s m a s,s m b s,o m γ (m) s (1) = E q [ #initial states with value s] 1 ξ (m) [ s,s (t) = E q #transitions from s to s ] t:z t =o γ s (m) (t) = E q [ #state-outcome pairs (s, o)] γ (m) s (t) = P (X m,t = s Z m,1:t = z m,1:t ) ξ (m) s,s (t) = P (X m,t = s, X m,t+1 = s Z m,1:t = z m,1:t ) () April 2, 2019 45 / 48 () April 2, 2019 46 / 48 Baum Welch algorithm Summary Step 0 Initialize the parameters (π, A, B) Step 1 (E-Step) Fixing the parameters, compute forward and backward messages for all sample sequences, then use these to compute γ s (m) (t) and ξ (m) s,s (t) for each m, t, s, s (see Slides 9 and 10). Step 2 (M-Step) Update parameters: π s m γ (m) s (1), a s,s m 1 ξ (m) s,s (t), b s,o m t:x t =o γ s (m) (t) Very important models: Markov chains, hidden Markov models Several algorithms: forward and backward procedures inferring HMMs based on forward and backward messages Viterbi algorithm Baum Welch (EM) algorithm Step 3 Return to Step 1 if not converged. () April 2, 2019 47 / 48 () April 2, 2019 48 / 48