Machine Learning & Data Mining CS/CNS/EE 155. Lecture 11: Hidden Markov Models

Similar documents
Machine Learning & Data Mining CS/CNS/EE 155. Lecture 8: Hidden Markov Models

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs

Bayesian networks Lecture 18. David Sontag New York University

CS 6140: Machine Learning Spring 2017

Sta$s$cal sequence recogni$on

CS 6140: Machine Learning Spring 2016

CSE 473: Ar+ficial Intelligence. Probability Recap. Markov Models - II. Condi+onal probability. Product rule. Chain rule.

Statistical Methods for NLP

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

Lecture 13: Structured Prediction

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

CSE 473: Ar+ficial Intelligence

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on. Professor Wei-Min Shen Week 8.1 and 8.2

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

CSE 473: Ar+ficial Intelligence. Hidden Markov Models. Bayes Nets. Two random variable at each +me step Hidden state, X i Observa+on, E i

Hidden Markov Models

Sequence Labeling: HMMs & Structured Perceptron

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

Sequential Supervised Learning

CSE 473: Ar+ficial Intelligence. Example. Par+cle Filters for HMMs. An HMM is defined by: Ini+al distribu+on: Transi+ons: Emissions:

Hidden Markov Models

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

UVA CS / Introduc8on to Machine Learning and Data Mining

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

COMP90051 Statistical Machine Learning

Today. Finish word sense disambigua1on. Midterm Review

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Introduction to Machine Learning CMU-10701

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Hidden Markov Models

Probability and Structure in Natural Language Processing

Intelligent Systems (AI-2)

Intelligent Systems (AI-2)

Hidden Markov Models in Language Processing

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Introduc)on to Ar)ficial Intelligence

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Computa(on of Similarity Similarity Search as Computa(on. Stoyan Mihov and Klaus U. Schulz

Lecture 12: EM Algorithm

STA 414/2104: Machine Learning

Lecture 11: Hidden Markov Models

Midterm sample questions

STA 4273H: Statistical Machine Learning

LECTURER: BURCU CAN Spring

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Lecture 3: ASR: HMMs, Forward, Viterbi

Lecture 11: Viterbi and Forward Algorithms

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

An Introduc+on to Sta+s+cs and Machine Learning for Quan+ta+ve Biology. Anirvan Sengupta Dept. of Physics and Astronomy Rutgers University

Conditional Random Fields for Sequential Supervised Learning

Administrivia. What is Information Extraction. Finite State Models. Graphical Models. Hidden Markov Models (HMMs) for Information Extraction

Hidden Markov Models Part 2: Algorithms

A brief introduction to Conditional Random Fields

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Probabilistic Graphical Models

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Hidden Markov Models

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Where are we? è Five major sec3ons of this course

So# Inference and Posterior Marginals. September 19, 2013

Hidden Markov Models and Applica2ons. Spring 2017 February 21,23, 2017

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

STAD68: Machine Learning

Statistical Methods for NLP

Brief Introduction of Machine Learning Techniques for Content Analysis

Lecture 15. Probabilistic Models on Graph

Latent Dirichlet Alloca/on

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari

Hidden Markov Models

Par$cle Filters Part I: Theory. Peter Jan van Leeuwen Data- Assimila$on Research Centre DARC University of Reading

Log-Linear Models, MEMMs, and CRFs

Natural Language Processing

Hidden Markov Models. x 1 x 2 x 3 x K

lecture 6: modeling sequences (final part)

Hidden Markov Models

Lecture 4: State Estimation in Hidden Markov Models (cont.)

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Last Lecture Recap UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 3: Linear Regression

Conditional Random Field

Advanced Data Science

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Probability and Structure in Natural Language Processing

Introduction to Machine Learning Midterm, Tues April 8

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Founda'ons of Large- Scale Mul'media Informa'on Management and Retrieval. Lecture #3 Machine Learning. Edward Chang

order is number of previous outputs

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

Mini-project 2 (really) due today! Turn in a printout of your work at the end of the class

Transcription:

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 11: Hidden Markov Models 1

Kaggle Compe==on Part 1 2

Kaggle Compe==on Part 2 3

Announcements Updated Kaggle Report Due Date: 9pm on Monday Feb 13 th Next Recita=on Tuesday Feb 14 th 7pm-8pm Recap of concepts covered today 4

x = Fish Sleep y = (N, V) Sequence Predic=on (POS Tagging) x = The Dog Ate My Homework y = (D, N, V, D, N) x = The Fox Jumped Over The Fence y = (D, N, V, P, D, N) 5

Challenges Mul=variable Output Make mul=ple predic=ons simultaneously Variable Length Input/Output Sentence lengths not fixed 6

Mul=variate Outputs x = Fish Sleep y = (N, V) Mul=class predic=on: POS Tags: Det, Noun, Verb, Adj, Adv, Prep Replicate Weights:! # # w = # # "# w 1 w 2! w K $ & & & & %&! # # b = # # "# " $ $ f (x w, b) = $ $ $ # How many classes? b 1 b 2! b K $ & & & & %& Score All Classes: w 1 T x b 1 w 2 T x b 2! w K T x b K % ' ' ' ' ' & Predict via Largest Score: argmax k " $ $ $ $ $ # w 1 T x b 1 w 2 T x b 2! w K T x b K % ' ' ' ' ' & 7

Mul=class Predic=on x = Fish Sleep y = (N, V) Mul=class predic=on: All possible length-m sequences as different class (D, D), (D, N), (D, V), (D, Adj), (D, Adv), (D, Pr) (N, D), (N, N), (N, V), (N, Adj), (N, Adv), L M classes! Length 2: 6 2 = 36! POS Tags: Det, Noun, Verb, Adj, Adv, Prep L=6 8

Mul=class Predic=on x = Fish Sleep y = (N, V) Mul=class predic=on: All possible length-m sequences as different class (D, (Not D), Tractable (D, N), (D, for V), Sequence (D, Adj), (D, Predic=on) Adv), (D, Pr) (N, D), (N, N), (N, V), (N, Adj), (N, Adv), L M classes! Length 2: 6 2 = 36! POS Tags: Det, Noun, Verb, Adj, Adv, Prep L=6 Exponen=al Explosion in #Classes! 9

Why is Naïve Mul=class Intractable? x= I fish oken POS Tags: Det, Noun, Verb, Adj, Adv, Prep (D, D, D), (D, D, N), (D, D, V), (D, D, Adj), (D, D, Adv), (D, D, Pr) (D, N, D), (D, N, N), (D, N, V), (D, N, Adj), (D, N, Adv), (D, N, Pr) (D, V, D), (D, V, N), (D, V, V), (D, V, Adj), (D, V, Adv), (D, V, Pr) (N, D, D), (N, D, N), (N, D, V), (N, D, Adj), (N, D, Adv), (N, D, Pr) (N, N, D), (N, N, N), (N, N, V), (N, N, Adj), (N, N, Adv), (N, N, Pr) Assume pronouns are nouns for simplicity. 10

Why is Naïve Mul=class Intractable? x= I fish oken POS Tags: Det, Noun, Verb, Adj, Adv, Prep (D, D, D), (D, D, N), (D, D, V), (D, D, Adj), (D, D, Adv), (D, D, Pr) (D, N, D), (D, N, N), (D, N, V), (D, N, Adj), (D, N, Adv), (D, N, Pr) (D, V, D), (D, V, N), (D, V, V), (D, V, Adj), (D, V, Adv), (D, V, Pr) Treats Every Combina=on As Different Class (Learn model for each combina=on) (N, D, D), (N, D, N), (N, D, V), (N, D, Adj), (N, D, Adv), (N, D, Pr) (N, N, D), (N, N, N), (N, N, V), (N, N, Adj), (N, N, Adv), (N, N, Pr) Exponen=ally Large Representa=on! (Exponen=al Time to Consider Every Class) (Exponen=al Storage) Assume pronouns are nouns for simplicity. 11

Independent Classifica=on x= I fish oken POS Tags: Det, Noun, Verb, Adj, Adv, Prep Treat each word independently (assump=on) Independent mul=class predic=on per word Predict for x= I independently Predict for x= fish independently Predict for x= oken independently Concatenate predic=ons. Assume pronouns are nouns for simplicity. 12

Independent Classifica=on x= I fish oken POS Tags: Det, Noun, Verb, Adj, Adv, Prep Treat each word independently (assump=on) Independent mul=class predic=on per word Predict for x= I independently #Classes = #POS Tags (6 in our example) Predict for x= fish independently Predict for x= oken independently Concatenate predic=ons. Solvable using standard mul=class predic=on. Assume pronouns are nouns for simplicity. 13

Independent Classifica=on x= I fish oken POS Tags: Det, Noun, Verb, Adj, Adv, Prep Treat each word independently Independent mul=class predic=on per word P(y x) x= I x= fish x= onen y= Det 0.0 0.0 0.0 y= Noun 1.0 0.75 0.0 y= Verb 0.0 0.25 0.0 PredicQon: (N, N, Adv) Correct: (N, V, Adv) y= Adj 0.0 0.0 0.4 y= Adv 0.0 0.0 0.6 y= Prep 0.0 0.0 0.0 Why the mistake? Assume pronouns are nouns for simplicity. 14

Context Between Words x= I fish oken POS Tags: Det, Noun, Verb, Adj, Adv, Prep Independent Predic=ons Ignore Word Pairs In Isola=on: Fish is more likely to be a Noun But Condi=oned on Following a (pro)noun Fish is more likely to be a Verb! 1st Order Dependence (Model All Pairs) 2 nd Order Considers All Triplets Arbitrary Order = Exponen=al Size (Naïve Mul=class) Assume pronouns are nouns for simplicity. 15

1 st Order Hidden Markov Model x = (x 1,x 2,x 4,x 4,,x M ) y = (y 1,y 2,y 3,y 4,,y M ) (sequence of words) (sequence of POS tags) P(x i y i ) Probability of state y i genera=ng x i P(y i+1 y i ) Probability of state y i transi=oning to y i+1 P(y 1 y 0 ) P(End y M ) Not always used y 0 is defined to be the Start state Prior probability of y M being the final state 16

Graphical Model Representa=on Y 0 Y 1 Y 2 Y M Y End X 1 X 2 X M Op=onal ( ) = P(End y M ) P(y i y i 1 ) P x, y M P(x i y i ) i=1 M i=1 17

1 st Order Hidden Markov Model ( ) = P(End y M ) P(y i y i 1 ) P x, y Joint Distribu=on M P(x i y i ) i=1 Op=onal M i=1 P(x i y i ) Probability of state y i genera=ng x i P(y i+1 y i ) Probability of state y i transi=oning to y i+1 P(y 1 y 0 ) P(End y M ) Not always used y0 is defined to be the Start state Prior probability of y M being the final state 18

1 st Order Hidden Markov Model P x y M i=1 ( ) = P(x i y i ) Condi=onal Distribu=on on x given y Given a POS Tag Sequence y: Can compute each P(x i y) independently! (x i condi=onally independent given y i ) P(x i y i ) Probability of state y i genera=ng x i P(y i+1 y i ) Probability of state y i transi=oning to y i+1 P(y 1 y 0 ) P(End y M ) Not always used y 0 is defined to be the Start state Prior probability of y M being the final state 19

1 st Order Hidden Markov Model Addi=onal Complexity of (#POS Tags) 2 Models All State-State Pairs (all POS Tag-Tag pairs) Models All State-Observa=on Pairs (all Tag-Word pairs) P(x i y i ) Same Complexity as Independent Mul=class Probability of state y i genera=ng x i P(y i+1 y i ) Probability of state y i transi=oning to y i+1 P(y 1 y 0 ) P(End y M ) Not always used y 0 is defined to be the Start state Prior probability of y M being the final state 20

Rela=onship to Naïve Bayes Y 0# Y 1# Y 2# # Y M# Y End# X 1) X 2) # X M) Y 1# Y 2# # Y M# X 1) X 2) # X M) Reduces to a sequence of disjoint Naïve Bayes models (if we ignore transi=on probabili=es) 21

P ( word state/tag ) Two-word language: fish and sleep Two-tag language: Noun and Verb P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Given Tag Sequence y: P( fish sleep (N,V) ) = 0.8*0.5 P( fish fish (N,V) ) = 0.8*0.5 P( sleep fish (V,V) ) = 0.8*0.5 P( sleep sleep (N,N) ) = 0.2*0.5 Slides borrowed from Ralph Grishman 22

Sampling HMMs are genera=ve models Models joint distribu=on P(x,y) Can generate samples from this distribu=on First consider condi=onal distribu=on P(x y) P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Given Tag Sequence y = (N,V): Sample each word independently: Sample P(x 1 N) (0.8 Fish, 0.2 Sleep) Sample P(x 2 V) (0.5 Fish, 0.5 Sleep) What about sampling from P(x,y)? 23

Forward Sampling of P(y,x) ( ) = P(End y M ) P(y i y i 1 ) P x, y M i=1 M i=1 P(x i y i ) P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 0.2 0.8 start 0.8 0.7 noun verb end 0.2 Slides borrowed from Ralph Grishman Ini=alize y 0 = Start Ini=alize i = 0 1. i = i + 1 2. Sample y i from P(y i y i-1 ) 3. If y i == End: Quit 4. Sample x i from P(x i y i ) 5. Goto Step 1 Exploits Condi=onal Ind. Requires P(End y i ) 24

Forward Sampling of P(y,x L) ( ) = P(End y M ) P(y i y i 1 ) P x, y M M P(x i y i ) i=1 M i=1 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 0.8 0.2 0.91 start noun verb 0.667 0.09 Slides borrowed from Ralph Grishman 0.333 Ini=alize y 0 = Start Ini=alize i = 0 1. i = i + 1 2. If(i == M): Quit 3. Sample y i from P(y i y i-1 ) 4. Sample x i from P(x i y i ) 5. Goto Step 1 Exploits Condi=onal Ind. Assumes no P(End y i ) 25

1 st Order Hidden Markov Model P( x k+1:m, y k+1:m x 1:k, y 1:k ) = P( x k+1:m, y k+1:m y k ) Memory-less Model only needs y k to model rest of sequence P(x i y i ) Probability of state y i genera=ng x i P(y i+1 y i ) Probability of state y i transi=oning to y i+1 P(y 1 y 0 ) P(End y M ) Not always used y0 is defined to be the Start state Prior probability of y M being the final state 26

Viterbi Algorithm 27

Most Common Predic=on Problem Given input sentence, predict POS Tag seq. argmax y P( y x) Naïve approach: Try all possible y s Choose one with highest probability ExponenQal Qme: L M possible y s 28

argmax y ( ) = P(x i y i ) P x y P y L i=1 Bayes s Rule P( y x) = argmax y L i=1 = argmax y = argmax y ( ) = P(End y L ) P(y i y i 1 ) P(y, x) P(x) P(y, x) P(x y)p(y) 29

argmax y P(y, x) = argmax y = argmax y M M P(y i y i 1 ) P(x i y i ) i=1 M M i=1 argmax P(y i y i 1 ) P(x i y i ) y 1:M 1 i=1 = argmax argmax P(y M y M 1 )P(x M y M )P(y 1:M 1 x 1:M 1 ) y M y 1:M 1 M i=1 P( y 1:k x 1:k ) = P(x 1:k y 1:k )P(y 1:k ) ( ) = P(x i y i ) P x 1:k y 1:k P y 1:k k k i=1 ( ) = P(y i+1 y i ) i=1 Exploit Memory-less Property: The choice of y M only depends on y 1:M-1 via P(y M y M-1 )! 30

Dynamic Programming Input: x = (x 1,x 2,x 3,,x M ) Computed: best length-k prefix ending in each Tag: Examples: Yˆ # & k (V ) = % argmax P(y 1:k 1 V, x 1:k )( V ˆ # & Y k (N) = % argmax P(y 1:k 1 N, x 1:k )( N $ y 1:k 1 ' $ y 1:k 1 ' Sequence Concatena=on Claim: ˆ Y k+1 (V ) = # & argmax P(y 1:k V, x 1:k+1 ) % ( V $ y 1:k { Y ˆk ( T) } T ' # & = argmax P(y 1:k, x 1:k )P(y k+1 = V y k )P(x k+1 y k+1 = V ) % ( V $ y 1:k { Y ˆk ( T) } T ' Pre-computed Recursive DefiniQon! 31

Solve: " % Yˆ 2 (V ) = argmax P(y 1, x 1 )P(y 2 = V y 1 )P(x 2 y 2 = V ) $ ' V # y 1 { Y ˆ1 ( T) } T & Store each Ŷ 1 (Z) & P(Ŷ 1 (Z),x 1 ) Ŷ 1 (V) y 1 =V Ŷ 2 (V) y 1 =D Ŷ 1 (D) Ŷ 2 (D) y 1 =N Ŷ 1 (N) Ŷ 2 (N) Ŷ 1 (Z) is just Z 32

Solve: " % Yˆ 2 (V ) = argmax P(y 1, x 1 )P(y 2 = V y 1 )P(x 2 y 2 = V ) $ ' V # y 1 { Y ˆ1 ( T) } T & Store each Ŷ 1 (Z) & P(Ŷ 1 (Z),x 1 ) Ŷ 1 (V) Ŷ 2 (V) Ŷ 1 (D) Ŷ 2 (D) y 1 =N Ŷ 1 (N) Ŷ 2 (N) Ŷ 1 (Z) is just Z Ex: Ŷ 2 (V) = (N, V) 33

Solve: ˆ Y 3 (V ) = " % argmax P(y 1:2, x 1:2 )P(y 3 = V y 2 )P(x 3 y 3 = V ) $ ' V # y 1:2 { Y ˆ2 ( T) } T & Store each Ŷ 1 (Z) & P(Ŷ 1 (Z),x 1 ) Store each Ŷ 2 (Z) & P(Ŷ 2 (Z),x 1:2 ) Ŷ 1 (V) Ŷ 2 (V) y 2 =V Ŷ 3 (V) y 2 =D Ŷ 1 (D) Ŷ 2 (D) Ŷ 3 (D) y 2 =N Ŷ 1 (N) Ŷ 2 (N) Ŷ 3 (N) Ex: Ŷ 2 (V) = (N, V) 34

Solve: Store each Ŷ 1 (Z) & P(Ŷ 1 (Z),x 1 ) ˆ Y 3 (V ) = " % argmax P(y 1:2, x 1:2 )P(y 3 = V y 2 )P(x 3 y 3 = V ) $ ' V # y 1:2 { Y ˆ2 ( T) } T & Store each Ŷ 2 (Z) & P(Ŷ 2 (Z),x 1:2 ) Claim: Only need to check solu=ons of Ŷ 2 (Z), Z=V,D,N Ŷ 1 (V) Ŷ 2 (V) y 2 =V Ŷ 3 (V) y 2 =D Ŷ 1 (D) Ŷ 1 (N) Suppose Ŷ 2 (D) Ŷ 3 (V) = (V,V,V) Ŷ 3 (D) prove that Ŷ 3 (V) = (N,V,V) has higher prob. Ex: Ŷ 2 (V) = (N, V) y 2 =N Proof depends on 1 st order property Prob. Ŷ 2 (N) of (V,V,V) & (N,V,V) Ŷ 3 (N) differ in 3 terms P(y 1 y 0 ), P(x 1 y 1 ), P(y 2 y 1 ) None of these depend on y 3! 35

ˆ Y M (V ) = # % $ & argmax P(y 1:M 1, x 1:M 1 )P(y M = V y M 1 )P(x M y M = V )P(End y M = V ) ( V { Y M 1 ( T) } T ' y 1:M 1 ˆ Store each Ŷ 1 (Z) & P(Ŷ 1 (Z),x 1 ) Store each Ŷ 2 (Z) & P(Ŷ 2 (Z),x 1:2 ) Store each Ŷ 3 (Z) & P(Ŷ 3 (Z),x 1:3 ) Op=onal Ŷ 1 (V) Ŷ 2 (V) Ŷ 3 (V) Ŷ M (V) Ŷ 1 (D) Ŷ 2 (D) Ŷ 3 (D) Ŷ M (D) Ŷ 1 (N) Ŷ 2 (N) Ŷ 3 (N) Ŷ M (N) Ex: Ŷ 2 (V) = (N, V) Ex: Ŷ 3 (V) = (D,N,V) 36

Solve: For k=1..m Viterbi Algorithm argmax y Itera=vely solve for each Ŷ k (Z) Z looping over every POS tag. Predict best Ŷ M (Z) P( y x) = argmax y = argmax y = argmax y P(y, x) P(x) P(y, x) P(x y)p(y) Also known as Mean A Posteriori (MAP) inference 37

Numerical Example x= (Fish Sleep) 0.2 0.8 start 0.8 noun verb 0.7 end 0.2 Slides borrowed from Ralph Grishman 38

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 0 1 2 3 start 1 verb 0 noun 0 end 0 Slides borrowed from Ralph Grishman 39

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Token 1: fish 0 1 2 3 start 1 0 verb 0.2 *.5 noun 0.8 *.8 end 0 0 Slides borrowed from Ralph Grishman 40

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Token 1: fish 0 1 2 3 start 1 0 verb 0.1 noun 0.64 end 0 0 Slides borrowed from Ralph Grishman 41

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Token 2: sleep (if fish is verb) 0 1 2 3 start 1 0 0 verb 0.1.1*.1*.5 noun 0.64.1*.2*.2 end 0 0 - Slides borrowed from Ralph Grishman 42

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Token 2: sleep (if fish is verb) 0 1 2 3 start 1 0 0 verb 0.1.005 noun 0.64.004 end 0 0 - Slides borrowed from Ralph Grishman 43

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Token 2: sleep (if fish is a noun) 0 1 2 3 start 1 0 0 verb 0.1.005.64*.8*.5 noun 0.64.004.64*.1*.2 end 0 0 - Slides borrowed from Ralph Grishman 44

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Token 2: sleep (if fish is a noun) 0 1 2 3 start 1 0 0 verb 0.1.005.256 noun 0.64.004.0128 end 0 0 - Slides borrowed from Ralph Grishman 45

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Token 2: sleep take maximum, set back pointers 0 1 2 3 start 1 0 0 verb 0.1.005.256 noun 0.64.004.0128 end 0 0 - Slides borrowed from Ralph Grishman 46

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Token 2: sleep take maximum, set back pointers 0 1 2 3 start 1 0 0 verb 0.1.256 noun 0.64.0128 end 0 0 - Slides borrowed from Ralph Grishman 47

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Token 3: end 0 1 2 3 start 1 0 0 0 verb 0.1.256 - noun 0.64.0128 - end 0 0 -.256*.7.0128*.1 Slides borrowed from Ralph Grishman 48

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Token 3: end take maximum, set back pointers 0 1 2 3 start 1 0 0 0 verb 0.1.256 - noun 0.64.0128 - end 0 0 -.256*.7.0128*.1 Slides borrowed from Ralph Grishman 49

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Decode: fish = noun sleep = verb 0 1 2 3 start 1 0 0 0 verb 0.1.256 - noun 0.64.0128 - end 0 0 -.256*.7 Slides borrowed from Ralph Grishman 50

0.2 start 0.8 0.7 noun verb end Decode: fish = noun sleep = verb 0.8 0.2 P(x y) What might go wrong for long sequences? 0 1 2 3 start 1 0 0 0 y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 verb 0 Underflow!.1.256 - Small numbers get repeatedly mul=plied noun 0.64.0128 - together exponen=ally small! end 0 0 -.256*.7 Slides borrowed from Ralph Grishman 51

Solve: argmax y For k=1..m Viterbi Algorithm (w/ Log Probabili=es) P( y x) = argmax y = argmax y = argmax y Itera=vely solve for each log(ŷ k (Z)) Z looping over every POS tag. Predict best log(ŷ M (Z)) P(y, x) P(x) P(y, x) log P(x y)+ log P(y) Log(Ŷ M (Z)) accumulates addiqvely, not mulqplicaqvely 52

Recap: Independent Classifica=on x= I fish oken POS Tags: Det, Noun, Verb, Adj, Adv, Prep Treat each word independently Independent mul=class predic=on per word P(y x) x= I x= fish x= onen y= Det 0.0 0.0 0.0 y= Noun 1.0 0.75 0.0 y= Verb 0.0 0.25 0.0 PredicQon: (N, N, Adv) Correct: (N, V, Adv) y= Adj 0.0 0.0 0.4 y= Adv 0.0 0.0 0.6 y= Prep 0.0 0.0 0.0 Assume pronouns are nouns for simplicity. Mistake due to not modeling mulqple words. 53

Recap: Viterbi Models pairwise transi=ons between states Pairwise transi=ons between POS Tags 1 st order model P x, y ( ) = P(End y M ) P(y i y i 1 ) M P(x i y i ) i=1 M i=1 x= I fish oken Independent: (N, N, Adv) HMM Viterbi: (N, V, Adv) *Assuming we defined P(x,y) properly 54

Training HMMs 55

Supervised Training Given: N S = {(x i, y i )} i=1 Word Sequence (Sentence) POS Tag Sequence Goal: Es=mate P(x,y) using S P x, y ( ) = P(End y M ) P(y i y i 1 ) i=1 Maximum Likelihood! M M i=1 P(x i y i ) 56

Aside: Matrix Formula=on Define Transi=on Matrix: A A ab = P(y i+1 =a y i =b) or Log( P(y i+1 =a y i =b) ) P(y next y) y= Noun y= Verb y next = Noun 0.09 0.667 y next = Verb 0.91 0.333 Observa=on Matrix: O O wz = P(x i =w y i =z) or Log(P(x i =w y i =z) ) P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 57

Aside: Matrix Formula=on ( ) = P(End y M ) P(y i y i 1 ) P x, y M P(x i y i ) i=1 M i=1 ( ) = P(End y M ) P(y i y i 1 ) P x, y M M P(x i y i ) i=1 = A End,y M A y i,y i 1 i=1 M i=1 O x i,y i M i=1 log(p(x, y)) = A! End,y M + Ay! + i,y Ox! i 1 i,y i M i=1 M i=1 Log prob. formula=on Each entry of à is define as log(a) 58

Maximum Likelihood argmax A,O (x,y) S P( x, y) = argmax A,O (x,y) S Es=mate each component separately: M P(End y M ) P(y i y i 1 ) P(x i y i ) i=1 M i=1 A ab = N j=1 M j 1" # y i+1 j =a i=0 ( ) ( y i j=b) N M j 1" # y i j =b$ % j=1 i=0 $ % O wz = N j=1 M j 1" # x i j =w i=1 ( ) ( y i j=z) N M j 1" # y i j =z$ % j=1 i=1 $ % (Derived via minimizing neg. log likelihood) 59

Recap: Supervised Training argmax A,O (x,y) S Maximum Likelihood Training Coun=ng sta=s=cs Super easy! Why? P( x, y) = argmax A,O (x,y) S What about unsupervised case? M P(End y M ) P(y i y i 1 ) P(x i y i ) i=1 M i=1 60

Condi=onal Independence Assump=ons argmax A,O (x,y) S P( x, y) = argmax A,O (x,y) S Everything decomposes to products of pairs I.e., P(y i+1 =a y i =b) doesn t depend on anything else Can just es=mate frequencies: P(End y M ) P(y i y i 1 ) P(x i y i ) How oken y i+1 =a when y i =b over training set Note that P(y i+1 =a y i =b) is a common model across all loca=ons of all sequences. M i=1 M i=1 61

Condi=onal Independence Assump=ons argmax A,O (x,y) S P( x, y) = argmax A,O (x,y) S P(End y M ) P(y i y i 1 ) P(x i y i ) Everything decomposes # Parameters: to products of pairs I.e., P(y i+1 =a y i =b) doesn t depend on anything else Transi=ons A: #Tags 2 Observa=ons O: #Words x #Tags Can just es=mate frequencies: Avoids How oken directly y i+1 =a when model y i =b over word/word training set pairings Note that P(y i+1 =a y #Tags i =b) is a = common 10s model across all loca=ons of all #Words sequences. = 10000s M i=1 M i=1 62

Unsupervised Training What about if no y s? Just a training set of sentences N S = { x i } i=1 S=ll want to es=mate P(x,y) How? Why? Word Sequence (Sentence) argmax P( x i ) = argmax P x i, y i i y ( ) 63

Why Unsupervised Training? Supervised Data hard to acquire Require annota=ng POS tags Unsupervised Data plen=ful Just grab some text! Might just work for POS Tagging! Learn y s that correspond to POS Tags Can be used for other tasks Detect outlier sentences (sentences with low prob.) Sampling new sentences. 64

EM Algorithm (Baum-Welch) If we had y s è max likelihood. If we had (A,O) è predict y s Chicken vs Egg! 1. Ini=alize A and O arbitrarily 2. Predict prob. of y s for each training x 3. Use y s to es=mate new (A,O) ExpectaQon Step MaximizaQon Step 4. Repeat back to Step 1 un=l convergence hzp://en.wikipedia.org/wiki/baum%e2%80%93welch_algorithm 65

Expecta=on Step Given (A,O) For training x=(x 1,,x M ) Predict P(y i ) for each y=(y 1, y M ) x 1 x 2 x L P(y i =Noun) 0.5 0.4 0.05 P(y i =Det) 0.4 0.6 0.25 P(y i =Verb) 0.0 0.7 Encodes current model s beliefs about y Marginal Distribu=on of each y i 66

Recall: Matrix Formula=on Define Transi=on Matrix: A A ab = P(y i+1 =a y i =b) or Log( P(y i+1 =a y i =b) ) P(y next y) y= Noun y= Verb y next = Noun 0.09 0.667 y next = Verb 0.91 0.333 Observa=on Matrix: O O wz = P(x i =w y i =z) or Log(P(x i =w y i =z) ) P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 67

Maximiza=on Step Max. Likelihood over Marginal Distribu=on Supervised: Unsupervised: A ab = A ab = N M j j=1 M j N 1" # y i+1 j =a j=1 i=0 i=0 N P(y j i M j j=1 ( ) ( y i j=b) M j N 1" # y i j =b$ % j=1 i=0 i=0 = b, y i+1 j = a) O wz = P(y j i $ % = b) O wz = Marginals M j N 1" # x i j =w j=1 i=1 M j N 1! " x i j =w# $ j=1 i=1 N M j j=1 i=1 ( ) ( y i j=z) M j N 1" # y i j =z$ % j=1 i=1 P(y j i P(y j i $ % Marginals = z) = z) Marginals 68

Compu=ng Marginals (Forward-Backward Algorithm) Solving E-Step, requires compute marginals x 1 x 2 x L P(y i =Noun) 0.5 0.4 0.05 P(y i =Det) 0.4 0.6 0.25 P(y i =Verb) 0.0 0.7 Can solve using Dynamic Programming! Similar to Viterbi 69

Nota=on Probability of observing prefix x 1:i and having the i-th state be y i =Z α z (i) = P(x 1:i, y i = Z A,O) Probability of observing suffix x i+1:m given the i-th state being y i =Z β z (i) = P(x i+1:m y i = Z, A,O) Compu=ng Marginals = Combining the Two Terms P(y i = z x) = z' a z (i)β z (i) a z' (i)β z' (i) hzp://en.wikipedia.org/wiki/baum%e2%80%93welch_algorithm 70

Nota=on Probability of observing prefix x 1:i and having the i-th state be y i =Z α z (i) = P(x 1:i, y i = Z A,O) Probability of observing suffix x i+1:m given the i-th state being y i =Z β z (i) = P(x i+1:m y i = Z, A,O) Compu=ng Marginals = Combining the Two Terms P(y i = b, y i 1 = a x) = a',b' a a (i 1)P(y i = b y i 1 = a)p(x i y i = b)β b (i) a a' (i 1)P(y i = b' y i 1 = a')p(x i y i = b')β b' (i) hzp://en.wikipedia.org/wiki/baum%e2%80%93welch_algorithm 71

Forward (sub-)algorithm Solve for every: α z (i) = P(x 1:i, y i = Z A,O) Naively: α z (i) = P(x 1:i, y i = Z A,O) = y 1:i 1 ExponenQal Time! P(x 1:i, y i = Z, y 1:i 1 A,O) Can be computed recursively (like Viterbi) α z (1) = P(y 1 = z y 0 )P(x 1 y 1 = z) = O x 1,z A z,start α z (i +1) = O x i+1,z L j=1 α j (i) A z, j Viterbi effec=vely replaces sum with max 72

Backward (sub-)algorithm Solve for every: β z (i) = P(x i+1:m y i = Z, A,O) Naively: β z (i) = P(x i+1:m y i = Z, A,O) = y i+1:l P(x i+1:m, y i+1:m y i = Z, A,O) Can be computed recursively (like Viterbi) β z (L) =1 L j=1 β z (i) = β j (i +1) A j,z O x i+1, j ExponenQal Time! 73

Forward-Backward Algorithm Runs Forward α z (i) = P(x 1:i, y i = Z A,O) Runs Backward β z (i) = P(x i+1:m y i = Z, A,O) For each training x=(x 1,,x M ) Computes each P(y i ) for y=(y 1,,y M ) P(y i = z x) = z' a z (i)β z (i) a z' (i)β z' (i) 74

Recap: Unsupervised Training Train using only word sequences: N S = { x i } i=1 y s are hidden states All pairwise transi=ons are through y s Hence hidden Markov Model Train using EM algorithm Converge to local op=mum Word Sequence (Sentence) 75

Ini=aliza=on How to choose #hidden states? By hand Cross Valida=on P(x) on valida=on data Can compute P(x) via forward algorithm: P(x) = P(x, y) = α z (M ) P(End y M = z) y z 76

Recap: Sequence Predic=on & HMMs Models pairwise dependences in sequences x= I fish oken POS Tags: Det, Noun, Verb, Adj, Adv, Prep Independent: (N, N, Adv) HMM Viterbi: (N, V, Adv) Compact: only model pairwise between y s Main LimitaQon: Lots of independence assump=ons Poor predic=ve accuracy 77

Next Week Deep Genera=ve Models (Recent Applica=ons Lecture) More on Unsupervised Learning Recita=on TUESDAY (7pm): Recap of Viterbi and Forward/Backward 78