Hidden Markov Models

Similar documents
Hidden Markov Models

Bayesian Networks (Part I)

Directed Graphical Models (aka. Bayesian Networks)

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

MLE/MAP + Naïve Bayes

Bayesian Networks (Part II)

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

STA 414/2104: Machine Learning

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Sequential Supervised Learning

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Perceptron & Neural Networks

MLE/MAP + Naïve Bayes

Undirected Graphical Models

STA 4273H: Statistical Machine Learning

Intelligent Systems (AI-2)

Intelligent Systems (AI-2)

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Brief Introduction of Machine Learning Techniques for Content Analysis

Cheng Soon Ong & Christian Walder. Canberra February June 2018

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Lecture 11: Hidden Markov Models

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine Learning for Structured Prediction

Perceptron (Theory) + Linear Regression

Hidden Markov Models in Language Processing

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Conditional Random Field

Statistical Methods for NLP

COMP90051 Statistical Machine Learning

Machine Learning Gaussian Naïve Bayes Big Picture

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Bias-Variance Tradeoff

Advanced Data Science

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Generative v. Discriminative classifiers Intuition

Machine Learning, Fall 2012 Homework 2

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Machine Learning

Estimating Parameters

Machine Learning

Mini-project 2 (really) due today! Turn in a printout of your work at the end of the class

Introduction to Machine Learning

Conditional Random Fields for Sequential Supervised Learning

Graphical models for part of speech tagging

Introduction to Machine Learning Midterm, Tues April 8

Logistic Regression. William Cohen

Probabilistic Graphical Models

Machine Learning

Lecture 15. Probabilistic Models on Graph

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Pattern Recognition and Machine Learning

Generative v. Discriminative classifiers Intuition

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Probabilistic Time Series Classification

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Logistic Regression & Neural Networks

CSCI-567: Machine Learning (Spring 2019)

Hidden Markov Models

From perceptrons to word embeddings. Simon Šuster University of Groningen

Statistical Methods for NLP

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

CPSC 540: Machine Learning

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Machine Learning

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Introduction to Machine Learning CMU-10701

Deriving Principal Component Analysis (PCA)

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Bayesian Classifiers, Conditional Independence and Naïve Bayes. Required reading: Naïve Bayes and Logistic Regression (available on class website)

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Intelligent Systems (AI-2)

Machine Learning (CS 567) Lecture 2

Expectation Maximization (EM)

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Statistical Machine Learning from Data

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Naïve Bayes classification

Hidden Markov Models

Introduction to Machine Learning Midterm Exam

Directed Probabilistic Graphical Models CMSC 678 UMBC

Machine Learning Overview

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Machine Learning for natural language processing

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Hidden Markov Models

Generative v. Discriminative classifiers Intuition

The Noisy Channel Model and Markov Models

Transcription:

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 19 Nov. 5, 2018 1

Reminders Homework 6: PAC Learning / Generative Models Out: Wed, Oct 31 Due: Wed, Nov 7 at 11:59pm (1 week) Homework 7: HMMs Out: Wed, Nov 7 Due: Mon, Nov 19 at 11:59pm Grades are up on Canvas 2

Q&A Q: Why would we use Naïve Bayes? Isn t it too Naïve? A: Naïve Bayes has one key advantage over methods like Perceptron, Logistic Regression, Neural Nets: Training is lightning fast! While other methods require slow iterative training procedures that might require hundreds of epochs, Naïve Bayes computes its parameters in closed form by counting. 3

DISCRIMINATIVE AND GENERATIVE CLASSIFIERS 4

Generative vs. Discriminative Generative Classifiers: Example: Naïve Bayes Define a joint model of the observations x and the labels y: p(x,y) Learning maximizes (joint) likelihood Use Bayes Rule to classify based on the posterior: p(y x) =p(x y)p(y)/p(x) Discriminative Classifiers: Example: Logistic Regression Directly model the conditional: p(y x) Learning maximizes conditional likelihood 5

Generative vs. Discriminative Whiteboard Contrast: To model p(x) or not to model p(x)? 6

Generative vs. Discriminative Finite Sample Analysis (Ng & Jordan, 2002) [Assume that we are learning from a finite training dataset] If model assumptions are correct: Naive Bayes is a more efficient learner (requires fewer samples) than Logistic Regression If model assumptions are incorrect: Logistic Regression has lower asymtotic error, and does better than Naïve Bayes 7

solid: NB dashed: LR Slide courtesy of William Cohen 8

solid: NB dashed: LR Naïve Bayes makes stronger assumptions about the data but needs fewer examples to estimate the parameters On Discriminative vs Generative Classifiers:. Andrew Ng and Michael Jordan, NIPS 2001. Slide courtesy of William Cohen 9

Generative vs. Discriminative Learning (Parameter Estimation) Naïve Bayes: Parameters are decoupled à Closed form solution for MLE Logistic Regression: Parameters are coupled à No closed form solution must use iterative optimization techniques instead 10

Naïve Bayes vs. Logistic Reg. Learning (MAP Estimation of Parameters) Bernoulli Naïve Bayes: Parameters are probabilities à Beta prior (usually) pushes probabilities away from zero / one extremes Logistic Regression: Parameters are not probabilities à Gaussian prior encourages parameters to be close to zero (effectively pushes the probabilities away from zero / one extremes) 11

Features Naïve Bayes vs. Logistic Reg. Naïve Bayes: Features x are assumed to be conditionally independent given y. (i.e. Naïve Bayes Assumption) Logistic Regression: No assumptions are made about the form of the features x. They can be dependent and correlated in any fashion. 12

MOTIVATION: STRUCTURED PREDICTION 13

Structured Prediction Most of the models we ve seen so far were for classification Given observations: x = (x 1, x 2,, x K ) Predict a (binary) label: y Many real-world problems require structured prediction Given observations: x = (x 1, x 2,, x K ) Predict a structure: y = (y 1, y 2,, y J ) Some classification problems benefit from latent structure 14

Structured Prediction Examples Examples of structured prediction Part-of-speech (POS) tagging Handwriting recognition Speech recognition Word alignment Congressional voting Examples of latent structure Object recognition 15

Dataset for Supervised Part-of-Speech (POS) Tagging Data: D = {x (n), y (n) } N n=1 Sample 1: Sample 2: Sample 3: Sample 4: n v p d n time n n v d n time flies like an arrow flies like an arrow n v p n n flies p n n v v with fly with their wings time you will see y (1) x (1) y (2) x (2) y (3) x (3) y (4) x (4) 16

Dataset for Supervised Handwriting Recognition Data: D = {x (n), y (n) } N n=1 Sample 1: u n e x p e c t e d y (1) x (1) Sample 2: v o l c a n i c y (2) x (2) Sample 2: e m b r a c e s y (3) x (3) Figures from (Chatzis & Demiris, 2013) 17

Dataset for Supervised Phoneme (Speech) Recognition Data: D = {x (n), y (n) } N n=1 Sample 1: h# dh ih s w uh z iy z iy y (1) x (1) Sample 2: f ao r ah s s h# y (2) x (2) Figures from (Jansen & Niyogi, 2013) 18

Application: Word Alignment / Phrase Extraction Variables (boolean): For each (Chinese phrase, English phrase) pair, are they linked? Interactions: Word fertilities Few jumps (discontinuities) Syntactic reorderings ITG contraint on alignment Phrases are disjoint (?) (Burkett & Klein, 2012) 19

Application: Congressional Voting Variables: Text of all speeches of a representative Local contexts of references between two representatives Interactions: Words used by representative and their vote Pairs of representatives and their local context (Stoyanov & Eisner, 2012) 20

Structured Prediction Examples Examples of structured prediction Part-of-speech (POS) tagging Handwriting recognition Speech recognition Word alignment Congressional voting Examples of latent structure Object recognition 21

Case Study: Object Recognition Data consists of images x and labels y. x (1) x (2) pigeon y (1) rhinoceros y (2) x (3) x (4) leopard y (3) llama y (4) 22

Case Study: Object Recognition Data consists of images x and labels y. Preprocess data into patches Posit a latent labeling z describing the object s parts (e.g. head, leg, tail, torso, grass) Define graphical model with these latent variables in mind z is not observed at train or test time leopard 23

Case Study: Object Recognition Data consists of images x and labels y. Preprocess data into patches Posit a latent labeling z describing the object s parts (e.g. head, leg, tail, torso, grass) Define graphical model with these latent variables in mind z is not observed at train or test time Z 7 X 7 Z 2 X 2 leopard Y 24

Case Study: Object Recognition Data consists of images x and labels y. Preprocess data into patches Posit a latent labeling z describing the object s parts (e.g. head, leg, tail, torso, grass) Define graphical model with these latent variables in mind z is not observed at train or test time Z 7 ψ 1 X 7 ψ 4 ψ 4 Z 2 ψ 2 ψ 4 ψ 3 X 2 leopard Y 25

Structured Prediction Preview of challenges to come Consider the task of finding the most probable assignment to the output Classification ŷ = p(y ) y where y {+1, 1} Structured Prediction ˆ = p( ) where Y and Y is very large 26

Machine Learning The data inspires the structures we want to predict Inference finds {best structure, marginals, partition function} for a new observation (Inference is usually called as a subroutine in learning) Domain Knowledge Combinatorial Optimization ML Mathematical Modeling Optimization Our model defines a score for each structure It also tells us what to optimize Learning tunes the parameters of the model 27

Machine Learning Data Model X 1 time flies like an arrow X2 X 3 X 4 X 5 Objective Inference Learning (Inference is usually called as a subroutine in learning) 28

BACKGROUND 29

Background: Chain Rule of Probability For random variables A and B: P (A, B) =P (A B)P (B) For random variables X 1,X 2,X 3,X 4 : P (X 1,X 2,X 3,X 4 )=P(X 1 X 2,X 3,X 4 ) P (X 2 X 3,X 4 ) P (X 3 X 4 ) P (X 4 ) 31

Background: Conditional Independence Random variables A and B are conditionally independent given C if: P (A, B C) =P (A C)P (B C) (1) or equivalently: P (A B,C) =P (A C) (2) We write this as: A B C (3) Later we will also write: I<A, {C}, B> 32

HIDDEN MARKOV MODEL (HMM) 33

HMM Outline Motivation Time Series Data Hidden Markov Model (HMM) Example: Squirrel Hill Tunnel Closures [courtesy of Roni Rosenfeld] Background: Markov Models From Mixture Model to HMM History of HMMs Higher-order HMMs Training HMMs (Supervised) Likelihood for HMM Maximum Likelihood Estimation (MLE) for HMM EM for HMM (aka. Baum-Welch algorithm) Forward-Backward Algorithm Three Inference Problems for HMM Great Ideas in ML: Message Passing Example: Forward-Backward on 3-word Sentence Derivation of Forward Algorithm Forward-Backward Algorithm Viterbi algorithm 34

Whiteboard Markov Models Example: Squirrel Hill Tunnel Closures [courtesy of Roni Rosenfeld] First-order Markov assumption Conditional independence assumptions 35

36

Mixture Model for Time Series Data We could treat each (tunnel state, travel time) pair as independent. This corresponds to a Naïve Bayes model with a single feature (travel time). p(o, S, S, O, C, 2m, 3m, 18m, 9m, 27m) = (.8 *.2 *.1 *.03 * ) O.8 S.1 C.1 O.8 S.1 C.1 O S S O C 1min 2min 3min O.1.2.3 S.01.02.03 C 0 0 0 1min 2min 3min 2m 3m 18m O.1.2.3 9m 27m S.01.02.03 C 0 0 0 38

Hidden Markov Model A Hidden Markov Model (HMM) provides a joint distribution over the the tunnel states / travel times with an assumption of dependence between adjacent tunnel states. p(o, S, S, O, C, 2m, 3m, 18m, 9m, 27m) = (.8 *.08 *.2 *.7 *.03 * ) O.8 S.1 C.1 O S C O.9.08.02 S.2.7.1 C.9 0.1 O S C O.9.08.02 S.2.7.1 C.9 0.1 O S S O C 1min 2min 3min O.1.2.3 S.01.02.03 C 0 0 0 1min 2min 3min 2m 3m 18m O.1.2.3 9m 27m S.01.02.03 C 0 0 0 39

From Mixture Model to HMM Y 1 Y 2 Y 3 Y 4 Y 5 X 1 X 2 X 3 X 4 X 5 Naïve Bayes : Y 1 Y 2 Y 3 Y 4 Y 5 X 1 X 2 X 3 X 4 X 5 HMM: 40

From Mixture Model to HMM Y 1 Y 2 Y 3 Y 4 Y 5 X 1 X 2 X 3 X 4 X 5 Naïve Bayes : Y 0 Y 1 Y 2 Y 3 Y 4 Y 5 X 1 X 2 X 3 X 4 X 5 HMM: 41

SUPERVISED LEARNING FOR HMMS 46

HMM Parameters: Hidden Markov Model O.8 S.1 C.1 O S C O.9.08.02 S.2.7.1 C.9 0.1 O S C O.9.08.02 S.2.7.1 C.9 0.1 Y 1 Y 2 Y 3 Y 4 Y 5 1min 2min 3min O.1.2.3 S.01.02.03 C 0 0 0 1min 2min 3min X 1 X 2 X 3 O.1.2.3 X 4 X 5 S.01.02.03 C 0 0 0 48

Whiteboard Training HMMs (Supervised) Likelihood for an HMM Maximum Likelihood Estimation (MLE) for HMM 49

Learning an HMM decomposes into solving two (independent) Mixture Models Supervised Learning for HMMs Y t Y t+1 Y t X t 50

HMM Parameters: Hidden Markov Model Assumption: y 0 = START Generative Story: For notational convenience, we fold the initial probabilities C into the transition matrix B by our assumption. Y 0 Y 1 Y 2 Y 3 Y 4 Y 5 X 1 X 2 X 3 X 4 X 5 51

Joint Distribution: y 0 = START Hidden Markov Model Y 0 Y 1 Y 2 Y 3 Y 4 Y 5 X 1 X 2 X 3 X 4 X 5 52

Supervised Learning for HMMs Learning an HMM decomposes into solving two (independent) Mixture Models Yt Yt+1 Yt Xt 53

HMMs: History Markov chains: Andrey Markov (1906) Random walks and Brownian motion Used in Shannon s work on information theory (1948) Baum-Welsh learning algorithm: late 60 s, early 70 s. Used mainly for speech in 60s-70s. Late 80 s and 90 s: David Haussler (major player in learning theory in 80 s) began to use HMMs for modeling biological sequences Mid-late 1990 s: Dayne Freitag/Andrew McCallum Freitag thesis with Tom Mitchell on IE from Web using logic programs, grammar induction, etc. McCallum: multinomial Naïve Bayes for text With McCallum, IE using HMMs on CORA Slide from William Cohen 55

Higher-order HMMs 1 st -order HMM (i.e. bigram HMM) <START> Y 1 Y 2 Y 3 Y 4 Y 5 X 1 X 2 X 3 X 4 X 5 2 nd -order HMM (i.e. trigram HMM) <START> Y 1 Y 2 Y 3 Y 4 Y 5 3 rd -order HMM X 1 X 2 X 3 X 4 X 5 <START> Y 1 Y 2 Y 3 Y 4 Y 5 X 1 X 2 X 3 X 4 X 5 56