10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 22 April 2, 2018 1
Reminders Homework 6: PAC Learning / Generative Models Out: Wed, Mar 28 Due: Wed, Apr 04 at 11:59pm Homework 7: HMMs Out: Wed, Apr 04 Due: Mon, Apr 16 at 11:59pm 2
DISCRIMINATIVE AND GENERATIVE CLASSIFIERS 3
Generative vs. Discriminative Generative Classifiers: Example: Naïve Bayes Define a joint model of the observations x and the labels y: p(x,y) Learning maximizes (joint) likelihood Use Bayes Rule to classify based on the posterior: p(y x) =p(x y)p(y)/p(x) Discriminative Classifiers: Example: Logistic Regression Directly model the conditional: p(y x) Learning maximizes conditional likelihood 4
Generative vs. Discriminative Whiteboard Contrast: To model p(x) or not to model p(x)? 5
Generative vs. Discriminative Finite Sample Analysis (Ng & Jordan, 2002) [Assume that we are learning from a finite training dataset] If model assumptions are correct: Naive Bayes is a more efficient learner (requires fewer samples) than Logistic Regression If model assumptions are incorrect: Logistic Regression has lower asymtotic error, and does better than Naïve Bayes 6
solid: NB dashed: LR Slide courtesy of William Cohen 7
solid: NB dashed: LR Naïve Bayes makes stronger assumptions about the data but needs fewer examples to estimate the parameters On Discriminative vs Generative Classifiers:. Andrew Ng and Michael Jordan, NIPS 2001. Slide courtesy of William Cohen 8
Generative vs. Discriminative Learning (Parameter Estimation) Naïve Bayes: Parameters are decoupled à Closed form solution for MLE Logistic Regression: Parameters are coupled à No closed form solution must use iterative optimization techniques instead 9
Naïve Bayes vs. Logistic Reg. Learning (MAP Estimation of Parameters) Bernoulli Naïve Bayes: Parameters are probabilities à Beta prior (usually) pushes probabilities away from zero / one extremes Logistic Regression: Parameters are not probabilities à Gaussian prior encourages parameters to be close to zero (effectively pushes the probabilities away from zero / one extremes) 10
Features Naïve Bayes vs. Logistic Reg. Naïve Bayes: Features x are assumed to be conditionally independent given y. (i.e. Naïve Bayes Assumption) Logistic Regression: No assumptions are made about the form of the features x. They can be dependent and correlated in any fashion. 11
MOTIVATION: STRUCTURED PREDICTION 12
Structured Prediction Most of the models we ve seen so far were for classification Given observations: x = (x 1, x 2,, x K ) Predict a (binary) label: y Many real-world problems require structured prediction Given observations: x = (x 1, x 2,, x K ) Predict a structure: y = (y 1, y 2,, y J ) Some classification problems benefit from latent structure 13
Structured Prediction Examples Examples of structured prediction Part-of-speech (POS) tagging Handwriting recognition Speech recognition Word alignment Congressional voting Examples of latent structure Object recognition 14
Dataset for Supervised Part-of-Speech (POS) Tagging Data: D = {x (n), y (n) } N n=1 Sample 1: Sample 2: Sample 3: Sample 4: n v p d n time n n v d n time flies like an arrow flies like an arrow n v p n n flies p n n v v with fly with their wings time you will see y (1) x (1) y (2) x (2) y (3) x (3) y (4) x (4) 15
Dataset for Supervised Handwriting Recognition Data: D = {x (n), y (n) } N n=1 Sample 1: u n e x p e c t e d y (1) x (1) Sample 2: v o l c a n i c y (2) x (2) Sample 2: e m b r a c e s y (3) x (3) Figures from (Chatzis & Demiris, 2013) 16
Dataset for Supervised Phoneme (Speech) Recognition Data: D = {x (n), y (n) } N n=1 Sample 1: h# dh ih s w uh z iy z iy y (1) x (1) Sample 2: f ao r ah s s h# y (2) x (2) Figures from (Jansen & Niyogi, 2013) 17
Application: Word Alignment / Phrase Extraction Variables (boolean): For each (Chinese phrase, English phrase) pair, are they linked? Interactions: Word fertilities Few jumps (discontinuities) Syntactic reorderings ITG contraint on alignment Phrases are disjoint (?) (Burkett & Klein, 2012) 18
Application: Congressional Voting Variables: Text of all speeches of a representative Local contexts of references between two representatives Interactions: Words used by representative and their vote Pairs of representatives and their local context (Stoyanov & Eisner, 2012) 19
Structured Prediction Examples Examples of structured prediction Part-of-speech (POS) tagging Handwriting recognition Speech recognition Word alignment Congressional voting Examples of latent structure Object recognition 20
Case Study: Object Recognition Data consists of images x and labels y. x (1) x (2) pigeon y (1) rhinoceros y (2) x (3) x (4) leopard y (3) llama y (4) 21
Case Study: Object Recognition Data consists of images x and labels y. Preprocess data into patches Posit a latent labeling z describing the object s parts (e.g. head, leg, tail, torso, grass) Define graphical model with these latent variables in mind z is not observed at train or test time leopard 22
Case Study: Object Recognition Data consists of images x and labels y. Preprocess data into patches Posit a latent labeling z describing the object s parts (e.g. head, leg, tail, torso, grass) Define graphical model with these latent variables in mind z is not observed at train or test time Z 7 X 7 Z 2 X 2 leopard Y 23
Case Study: Object Recognition Data consists of images x and labels y. Preprocess data into patches Posit a latent labeling z describing the object s parts (e.g. head, leg, tail, torso, grass) Define graphical model with these latent variables in mind z is not observed at train or test time Z 7 ψ 1 X 7 ψ 4 ψ 4 Z 2 ψ 2 ψ 4 ψ 3 X 2 leopard Y 24
Structured Prediction Preview of challenges to come Consider the task of finding the most probable assignment to the output Classification ŷ = p(y ) y where y {+1, 1} Structured Prediction ˆ = p( ) where Y and Y is very large 25
Machine Learning The data inspires the structures we want to predict Inference finds {best structure, marginals, partition function} for a new observation (Inference is usually called as a subroutine in learning) Domain Knowledge Combinatorial Optimization ML Mathematical Modeling Optimization Our model defines a score for each structure It also tells us what to optimize Learning tunes the parameters of the model 26
Machine Learning Data Model X 1 time flies like an arrow X2 X 3 X 4 X 5 Objective Inference Learning (Inference is usually called as a subroutine in learning) 27
BACKGROUND 28
Whiteboard Background Chain Rule of Probability Conditional Independence 29
Background: Chain Rule of Probability For random variables A and B: P (A, B) =P (A B)P (B) For random variables X 1,X 2,X 3,X 4 : P (X 1,X 2,X 3,X 4 )=P(X 1 X 2,X 3,X 4 ) P (X 2 X 3,X 4 ) P (X 3 X 4 ) P (X 4 ) 30
Background: Conditional Independence Random variables A and B are conditionally independent given C if: P (A, B C) =P (A C)P (B C) (1) or equivalently: P (A B,C) =P (A C) (2) We write this as: A B C (3) Later we will also write: I<A, {C}, B> 31
HIDDEN MARKOV MODEL (HMM) 32
HMM Outline Motivation Time Series Data Hidden Markov Model (HMM) Example: Squirrel Hill Tunnel Closures [courtesy of Roni Rosenfeld] Background: Markov Models From Mixture Model to HMM History of HMMs Higher-order HMMs Training HMMs (Supervised) Likelihood for HMM Maximum Likelihood Estimation (MLE) for HMM EM for HMM (aka. Baum-Welch algorithm) Forward-Backward Algorithm Three Inference Problems for HMM Great Ideas in ML: Message Passing Example: Forward-Backward on 3-word Sentence Derivation of Forward Algorithm Forward-Backward Algorithm Viterbi algorithm 33
Whiteboard Markov Models Example: Squirrel Hill Tunnel Closures [courtesy of Roni Rosenfeld] First-order Markov assumption Conditional independence assumptions 34
Mixture Model for Time Series Data We could treat each (tunnel state, travel time) pair as independent. This corresponds to a Naïve Bayes model with a single feature (travel time). p(o, S, S, O, C, 2m, 3m, 18m, 9m, 27m) = (.8 *.2 *.1 *.03 * ) O.8 S.1 C.1 O.8 S.1 C.1 O S S O C 1min 2min 3min O.1.2.3 S.01.02.03 C 0 0 0 1min 2min 3min 2m 3m 18m O.1.2.3 9m 27m S.01.02.03 C 0 0 0 36
Hidden Markov Model A Hidden Markov Model (HMM) provides a joint distribution over the the tunnel states / travel times with an assumption of dependence between adjacent tunnel states. p(o, S, S, O, C, 2m, 3m, 18m, 9m, 27m) = (.8 *.08 *.2 *.7 *.03 * ) O.8 S.1 C.1 O S C O.9.08.02 S.2.7.1 C.9 0.1 O S C O.9.08.02 S.2.7.1 C.9 0.1 O S S O C 1min 2min 3min O.1.2.3 S.01.02.03 C 0 0 0 1min 2min 3min 2m 3m 18m O.1.2.3 9m 27m S.01.02.03 C 0 0 0 37
From Mixture Model to HMM Y 1 Y 2 Y 3 Y 4 Y 5 X 1 X 2 X 3 X 4 X 5 Naïve Bayes : Y 1 Y 2 Y 3 Y 4 Y 5 X 1 X 2 X 3 X 4 X 5 HMM: 38
From Mixture Model to HMM Y 1 Y 2 Y 3 Y 4 Y 5 X 1 X 2 X 3 X 4 X 5 Naïve Bayes : Y 0 Y 1 Y 2 Y 3 Y 4 Y 5 X 1 X 2 X 3 X 4 X 5 HMM: 39
SUPERVISED LEARNING FOR HMMS 44
HMM Parameters: Hidden Markov Model O.8 S.1 C.1 O S C O.9.08.02 S.2.7.1 C.9 0.1 O S C O.9.08.02 S.2.7.1 C.9 0.1 Y 1 Y 2 Y 3 Y 4 Y 5 1min 2min 3min O.1.2.3 S.01.02.03 C 0 0 0 1min 2min 3min X 1 X 2 X 3 O.1.2.3 X 4 X 5 S.01.02.03 C 0 0 0 46
HMM Parameters: Hidden Markov Model Assumption: y 0 = START Generative Story: For notational convenience, we fold the initial probabilities C into the transition matrix B by our assumption. Y 0 Y 1 Y 2 Y 3 Y 4 Y 5 X 1 X 2 X 3 X 4 X 5 47
Joint Distribution: y 0 = START Hidden Markov Model Y 0 Y 1 Y 2 Y 3 Y 4 Y 5 X 1 X 2 X 3 X 4 X 5 48
Whiteboard Training HMMs (Supervised) Likelihood for an HMM Maximum Likelihood Estimation (MLE) for HMM 49