Hidden Markov Models

Similar documents
Hidden Markov Models

Bayesian Networks (Part I)

Directed Graphical Models (aka. Bayesian Networks)

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Bayesian Networks (Part II)

MLE/MAP + Naïve Bayes

STA 414/2104: Machine Learning

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

STA 4273H: Statistical Machine Learning

MLE/MAP + Naïve Bayes

Undirected Graphical Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Sequential Supervised Learning

Perceptron & Neural Networks

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Lecture 11: Hidden Markov Models

Intelligent Systems (AI-2)

Intelligent Systems (AI-2)

Brief Introduction of Machine Learning Techniques for Content Analysis

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Introduction to Machine Learning Midterm, Tues April 8

Hidden Markov Models in Language Processing

Conditional Random Field

Generative v. Discriminative classifiers Intuition

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

COMP90051 Statistical Machine Learning

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine Learning Gaussian Naïve Bayes Big Picture

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Logistic Regression. William Cohen

Introduction to Machine Learning

Perceptron (Theory) + Linear Regression

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Bias-Variance Tradeoff

Statistical Methods for NLP

Machine Learning, Fall 2012 Homework 2

Generative v. Discriminative classifiers Intuition

Naïve Bayes classification

Probabilistic Graphical Models

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Machine Learning

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Introduction to Machine Learning CMU-10701

Machine Learning

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Hidden Markov Models

Machine Learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

CSCI-567: Machine Learning (Spring 2019)

CPSC 540: Machine Learning

Pattern Recognition and Machine Learning

Lecture 15. Probabilistic Models on Graph

Probabilistic Time Series Classification

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Hidden Markov Models

Advanced Data Science

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Machine Learning Overview

Directed Probabilistic Graphical Models CMSC 678 UMBC

Intelligent Systems (AI-2)

Machine Learning

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Statistical Methods for NLP

Estimating Parameters

Conditional Random Fields for Sequential Supervised Learning

Hidden Markov Models

Introduction to Machine Learning Midterm Exam

Machine Learning

ECE 5984: Introduction to Machine Learning

Machine Learning for natural language processing

Hidden Markov Models Part 2: Algorithms

Logistic Regression & Neural Networks

Statistical Machine Learning from Data

Introduction to Graphical Models

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Generative v. Discriminative classifiers Intuition

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Directed and Undirected Graphical Models

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Machine Learning for Signal Processing Bayes Classification and Regression

Bayesian Methods: Naïve Bayes

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Transcription:

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 22 April 2, 2018 1

Reminders Homework 6: PAC Learning / Generative Models Out: Wed, Mar 28 Due: Wed, Apr 04 at 11:59pm Homework 7: HMMs Out: Wed, Apr 04 Due: Mon, Apr 16 at 11:59pm 2

DISCRIMINATIVE AND GENERATIVE CLASSIFIERS 3

Generative vs. Discriminative Generative Classifiers: Example: Naïve Bayes Define a joint model of the observations x and the labels y: p(x,y) Learning maximizes (joint) likelihood Use Bayes Rule to classify based on the posterior: p(y x) =p(x y)p(y)/p(x) Discriminative Classifiers: Example: Logistic Regression Directly model the conditional: p(y x) Learning maximizes conditional likelihood 4

Generative vs. Discriminative Whiteboard Contrast: To model p(x) or not to model p(x)? 5

Generative vs. Discriminative Finite Sample Analysis (Ng & Jordan, 2002) [Assume that we are learning from a finite training dataset] If model assumptions are correct: Naive Bayes is a more efficient learner (requires fewer samples) than Logistic Regression If model assumptions are incorrect: Logistic Regression has lower asymtotic error, and does better than Naïve Bayes 6

solid: NB dashed: LR Slide courtesy of William Cohen 7

solid: NB dashed: LR Naïve Bayes makes stronger assumptions about the data but needs fewer examples to estimate the parameters On Discriminative vs Generative Classifiers:. Andrew Ng and Michael Jordan, NIPS 2001. Slide courtesy of William Cohen 8

Generative vs. Discriminative Learning (Parameter Estimation) Naïve Bayes: Parameters are decoupled à Closed form solution for MLE Logistic Regression: Parameters are coupled à No closed form solution must use iterative optimization techniques instead 9

Naïve Bayes vs. Logistic Reg. Learning (MAP Estimation of Parameters) Bernoulli Naïve Bayes: Parameters are probabilities à Beta prior (usually) pushes probabilities away from zero / one extremes Logistic Regression: Parameters are not probabilities à Gaussian prior encourages parameters to be close to zero (effectively pushes the probabilities away from zero / one extremes) 10

Features Naïve Bayes vs. Logistic Reg. Naïve Bayes: Features x are assumed to be conditionally independent given y. (i.e. Naïve Bayes Assumption) Logistic Regression: No assumptions are made about the form of the features x. They can be dependent and correlated in any fashion. 11

MOTIVATION: STRUCTURED PREDICTION 12

Structured Prediction Most of the models we ve seen so far were for classification Given observations: x = (x 1, x 2,, x K ) Predict a (binary) label: y Many real-world problems require structured prediction Given observations: x = (x 1, x 2,, x K ) Predict a structure: y = (y 1, y 2,, y J ) Some classification problems benefit from latent structure 13

Structured Prediction Examples Examples of structured prediction Part-of-speech (POS) tagging Handwriting recognition Speech recognition Word alignment Congressional voting Examples of latent structure Object recognition 14

Dataset for Supervised Part-of-Speech (POS) Tagging Data: D = {x (n), y (n) } N n=1 Sample 1: Sample 2: Sample 3: Sample 4: n v p d n time n n v d n time flies like an arrow flies like an arrow n v p n n flies p n n v v with fly with their wings time you will see y (1) x (1) y (2) x (2) y (3) x (3) y (4) x (4) 15

Dataset for Supervised Handwriting Recognition Data: D = {x (n), y (n) } N n=1 Sample 1: u n e x p e c t e d y (1) x (1) Sample 2: v o l c a n i c y (2) x (2) Sample 2: e m b r a c e s y (3) x (3) Figures from (Chatzis & Demiris, 2013) 16

Dataset for Supervised Phoneme (Speech) Recognition Data: D = {x (n), y (n) } N n=1 Sample 1: h# dh ih s w uh z iy z iy y (1) x (1) Sample 2: f ao r ah s s h# y (2) x (2) Figures from (Jansen & Niyogi, 2013) 17

Application: Word Alignment / Phrase Extraction Variables (boolean): For each (Chinese phrase, English phrase) pair, are they linked? Interactions: Word fertilities Few jumps (discontinuities) Syntactic reorderings ITG contraint on alignment Phrases are disjoint (?) (Burkett & Klein, 2012) 18

Application: Congressional Voting Variables: Text of all speeches of a representative Local contexts of references between two representatives Interactions: Words used by representative and their vote Pairs of representatives and their local context (Stoyanov & Eisner, 2012) 19

Structured Prediction Examples Examples of structured prediction Part-of-speech (POS) tagging Handwriting recognition Speech recognition Word alignment Congressional voting Examples of latent structure Object recognition 20

Case Study: Object Recognition Data consists of images x and labels y. x (1) x (2) pigeon y (1) rhinoceros y (2) x (3) x (4) leopard y (3) llama y (4) 21

Case Study: Object Recognition Data consists of images x and labels y. Preprocess data into patches Posit a latent labeling z describing the object s parts (e.g. head, leg, tail, torso, grass) Define graphical model with these latent variables in mind z is not observed at train or test time leopard 22

Case Study: Object Recognition Data consists of images x and labels y. Preprocess data into patches Posit a latent labeling z describing the object s parts (e.g. head, leg, tail, torso, grass) Define graphical model with these latent variables in mind z is not observed at train or test time Z 7 X 7 Z 2 X 2 leopard Y 23

Case Study: Object Recognition Data consists of images x and labels y. Preprocess data into patches Posit a latent labeling z describing the object s parts (e.g. head, leg, tail, torso, grass) Define graphical model with these latent variables in mind z is not observed at train or test time Z 7 ψ 1 X 7 ψ 4 ψ 4 Z 2 ψ 2 ψ 4 ψ 3 X 2 leopard Y 24

Structured Prediction Preview of challenges to come Consider the task of finding the most probable assignment to the output Classification ŷ = p(y ) y where y {+1, 1} Structured Prediction ˆ = p( ) where Y and Y is very large 25

Machine Learning The data inspires the structures we want to predict Inference finds {best structure, marginals, partition function} for a new observation (Inference is usually called as a subroutine in learning) Domain Knowledge Combinatorial Optimization ML Mathematical Modeling Optimization Our model defines a score for each structure It also tells us what to optimize Learning tunes the parameters of the model 26

Machine Learning Data Model X 1 time flies like an arrow X2 X 3 X 4 X 5 Objective Inference Learning (Inference is usually called as a subroutine in learning) 27

BACKGROUND 28

Whiteboard Background Chain Rule of Probability Conditional Independence 29

Background: Chain Rule of Probability For random variables A and B: P (A, B) =P (A B)P (B) For random variables X 1,X 2,X 3,X 4 : P (X 1,X 2,X 3,X 4 )=P(X 1 X 2,X 3,X 4 ) P (X 2 X 3,X 4 ) P (X 3 X 4 ) P (X 4 ) 30

Background: Conditional Independence Random variables A and B are conditionally independent given C if: P (A, B C) =P (A C)P (B C) (1) or equivalently: P (A B,C) =P (A C) (2) We write this as: A B C (3) Later we will also write: I<A, {C}, B> 31

HIDDEN MARKOV MODEL (HMM) 32

HMM Outline Motivation Time Series Data Hidden Markov Model (HMM) Example: Squirrel Hill Tunnel Closures [courtesy of Roni Rosenfeld] Background: Markov Models From Mixture Model to HMM History of HMMs Higher-order HMMs Training HMMs (Supervised) Likelihood for HMM Maximum Likelihood Estimation (MLE) for HMM EM for HMM (aka. Baum-Welch algorithm) Forward-Backward Algorithm Three Inference Problems for HMM Great Ideas in ML: Message Passing Example: Forward-Backward on 3-word Sentence Derivation of Forward Algorithm Forward-Backward Algorithm Viterbi algorithm 33

Whiteboard Markov Models Example: Squirrel Hill Tunnel Closures [courtesy of Roni Rosenfeld] First-order Markov assumption Conditional independence assumptions 34

Mixture Model for Time Series Data We could treat each (tunnel state, travel time) pair as independent. This corresponds to a Naïve Bayes model with a single feature (travel time). p(o, S, S, O, C, 2m, 3m, 18m, 9m, 27m) = (.8 *.2 *.1 *.03 * ) O.8 S.1 C.1 O.8 S.1 C.1 O S S O C 1min 2min 3min O.1.2.3 S.01.02.03 C 0 0 0 1min 2min 3min 2m 3m 18m O.1.2.3 9m 27m S.01.02.03 C 0 0 0 36

Hidden Markov Model A Hidden Markov Model (HMM) provides a joint distribution over the the tunnel states / travel times with an assumption of dependence between adjacent tunnel states. p(o, S, S, O, C, 2m, 3m, 18m, 9m, 27m) = (.8 *.08 *.2 *.7 *.03 * ) O.8 S.1 C.1 O S C O.9.08.02 S.2.7.1 C.9 0.1 O S C O.9.08.02 S.2.7.1 C.9 0.1 O S S O C 1min 2min 3min O.1.2.3 S.01.02.03 C 0 0 0 1min 2min 3min 2m 3m 18m O.1.2.3 9m 27m S.01.02.03 C 0 0 0 37

From Mixture Model to HMM Y 1 Y 2 Y 3 Y 4 Y 5 X 1 X 2 X 3 X 4 X 5 Naïve Bayes : Y 1 Y 2 Y 3 Y 4 Y 5 X 1 X 2 X 3 X 4 X 5 HMM: 38

From Mixture Model to HMM Y 1 Y 2 Y 3 Y 4 Y 5 X 1 X 2 X 3 X 4 X 5 Naïve Bayes : Y 0 Y 1 Y 2 Y 3 Y 4 Y 5 X 1 X 2 X 3 X 4 X 5 HMM: 39

SUPERVISED LEARNING FOR HMMS 44

HMM Parameters: Hidden Markov Model O.8 S.1 C.1 O S C O.9.08.02 S.2.7.1 C.9 0.1 O S C O.9.08.02 S.2.7.1 C.9 0.1 Y 1 Y 2 Y 3 Y 4 Y 5 1min 2min 3min O.1.2.3 S.01.02.03 C 0 0 0 1min 2min 3min X 1 X 2 X 3 O.1.2.3 X 4 X 5 S.01.02.03 C 0 0 0 46

HMM Parameters: Hidden Markov Model Assumption: y 0 = START Generative Story: For notational convenience, we fold the initial probabilities C into the transition matrix B by our assumption. Y 0 Y 1 Y 2 Y 3 Y 4 Y 5 X 1 X 2 X 3 X 4 X 5 47

Joint Distribution: y 0 = START Hidden Markov Model Y 0 Y 1 Y 2 Y 3 Y 4 Y 5 X 1 X 2 X 3 X 4 X 5 48

Whiteboard Training HMMs (Supervised) Likelihood for an HMM Maximum Likelihood Estimation (MLE) for HMM 49