Machine Learning & Data Mining CS/CNS/EE 155. Lecture 8: Hidden Markov Models

Similar documents
Machine Learning & Data Mining CS/CNS/EE 155. Lecture 11: Hidden Markov Models

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

CS 6140: Machine Learning Spring 2017

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs

Sta$s$cal sequence recogni$on

Bayesian networks Lecture 18. David Sontag New York University

CSE 473: Ar+ficial Intelligence. Probability Recap. Markov Models - II. Condi+onal probability. Product rule. Chain rule.

CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

CS 6140: Machine Learning Spring 2016

Lecture 13: Structured Prediction

CSE 473: Ar+ficial Intelligence

CSE 473: Ar+ficial Intelligence. Example. Par+cle Filters for HMMs. An HMM is defined by: Ini+al distribu+on: Transi+ons: Emissions:

Hidden Markov Models

CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on. Professor Wei-Min Shen Week 8.1 and 8.2

Statistical Methods for NLP

CSE 473: Ar+ficial Intelligence. Hidden Markov Models. Bayes Nets. Two random variable at each +me step Hidden state, X i Observa+on, E i

Sequence Labeling: HMMs & Structured Perceptron

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Sequential Supervised Learning

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Probability and Structure in Natural Language Processing

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Introduc)on to Ar)ficial Intelligence

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Hidden Markov Models

COMP90051 Statistical Machine Learning

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Introduction to Machine Learning CMU-10701

Today. Finish word sense disambigua1on. Midterm Review

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Midterm sample questions

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Hidden Markov Models

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Hidden Markov Models in Language Processing

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

UVA CS / Introduc8on to Machine Learning and Data Mining

Intelligent Systems (AI-2)

LECTURER: BURCU CAN Spring

Administrivia. What is Information Extraction. Finite State Models. Graphical Models. Hidden Markov Models (HMMs) for Information Extraction

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Lecture 11: Viterbi and Forward Algorithms

Lecture 12: EM Algorithm

Lecture 11: Hidden Markov Models

STA 414/2104: Machine Learning

STA 4273H: Statistical Machine Learning

Computa(on of Similarity Similarity Search as Computa(on. Stoyan Mihov and Klaus U. Schulz

Intelligent Systems (AI-2)

STAD68: Machine Learning

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Probabilistic Graphical Models

Hidden Markov Models

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

So# Inference and Posterior Marginals. September 19, 2013

Conditional Random Fields for Sequential Supervised Learning

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Brief Introduction of Machine Learning Techniques for Content Analysis

Where are we? è Five major sec3ons of this course

Hidden Markov Models Part 2: Algorithms

A brief introduction to Conditional Random Fields

Advanced Data Science

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Lecture 15. Probabilistic Models on Graph

Lecture 3: ASR: HMMs, Forward, Viterbi

An Introduc+on to Sta+s+cs and Machine Learning for Quan+ta+ve Biology. Anirvan Sengupta Dept. of Physics and Astronomy Rutgers University

Par$cle Filters Part I: Theory. Peter Jan van Leeuwen Data- Assimila$on Research Centre DARC University of Reading

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Recent Applica6ons of Lasso

Discrimina)ve Latent Variable Models. SPFLODD November 15, 2011

Hidden Markov Models and Applica2ons. Spring 2017 February 21,23, 2017

Log-Linear Models, MEMMs, and CRFs

Natural Language Processing

lecture 6: modeling sequences (final part)

Last Lecture Recap UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 3: Linear Regression

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Statistical Methods for NLP

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Hidden Markov Models

Hidden Markov Models

Conditional Random Field

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Part 2. Representation Learning Algorithms

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari

Generative Model (Naïve Bayes, LDA)

Hidden Markov Models. x 1 x 2 x 3 x K

Founda'ons of Large- Scale Mul'media Informa'on Management and Retrieval. Lecture #3 Machine Learning. Edward Chang

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

CS838-1 Advanced NLP: Hidden Markov Models

STA 4273H: Sta-s-cal Machine Learning

CS 7180: Behavioral Modeling and Decision- making in AI

Statistical Sequence Recognition and Training: An Introduction to HMMs

Transcription:

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 8: Hidden Markov Models 1

x = Fish Sleep y = (N, V) Sequence Predic=on (POS Tagging) x = The Dog Ate My Homework y = (D, N, V, D, N) x = The Fox Jumped Over The Fence y = (D, N, V, P, D, N) 2

Challenges Mul=variable Output Make mul=ple predic=ons simultaneously Variable Length Input/Output Sentence lengths not fixed 3

Mul=variate Outputs x = Fish Sleep y = (N, V) Mul=class predic=on: POS Tags: Det, Noun, Verb, Adj, Adv, Prep Replicate Weights:! # # w = # # "# w 1 w 2! w K $ & & & & %&! # # b = # # "# " $ $ f (x w, b) = $ $ $ # How many classes? b 1 b 2! b K $ & & & & %& Score All Classes: w 1 T x b 1 w 2 T x b 2! w K T x b K % ' ' ' ' ' & Predict via Largest Score: argmax k " $ $ $ $ $ # w 1 T x b 1 w 2 T x b 2! w K T x b K % ' ' ' ' ' & 4

Mul=class Predic=on x = Fish Sleep y = (N, V) Mul=class predic=on: All possible length- M sequences as different class (D, D), (D, N), (D, V), (D, Adj), (D, Adv), (D, Pr) (N, D), (N, N), (N, V), (N, Adj), (N, Adv), L M classes! Length 2: 6 2 = 36! POS Tags: Det, Noun, Verb, Adj, Adv, Prep L=6 5

Mul=class Predic=on x = Fish Sleep y = (N, V) Mul=class predic=on: All possible length- M sequences as different class (D, (Not D), Tractable (D, N), (D, for V), Sequence (D, Adj), (D, Predic=on) Adv), (D, Pr) (N, D), (N, N), (N, V), (N, Adj), (N, Adv), L M classes! Length 2: 6 2 = 36! POS Tags: Det, Noun, Verb, Adj, Adv, Prep L=6 Exponen=al Explosion in #Classes! 6

Why is Naïve Mul=class Intractable? x= I fish ogen POS Tags: Det, Noun, Verb, Adj, Adv, Prep (D, D, D), (D, D, N), (D, D, V), (D, D, Adj), (D, D, Adv), (D, D, Pr) (D, N, D), (D, N, N), (D, N, V), (D, N, Adj), (D, N, Adv), (D, N, Pr) (D, V, D), (D, V, N), (D, V, V), (D, V, Adj), (D, V, Adv), (D, V, Pr) (N, D, D), (N, D, N), (N, D, V), (N, D, Adj), (N, D, Adv), (N, D, Pr) (N, N, D), (N, N, N), (N, N, V), (N, N, Adj), (N, N, Adv), (N, N, Pr) Assume pronouns are nouns for simplicity. 7

Why is Naïve Mul=class Intractable? x= I fish ogen POS Tags: Det, Noun, Verb, Adj, Adv, Prep (D, D, D), (D, D, N), (D, D, V), (D, D, Adj), (D, D, Adv), (D, D, Pr) (D, N, D), (D, N, N), (D, N, V), (D, N, Adj), (D, N, Adv), (D, N, Pr) (D, V, D), (D, V, N), (D, V, V), (D, V, Adj), (D, V, Adv), (D, V, Pr) Treats Every Combina=on As Different Class (Learn model for each combina=on) (N, D, D), (N, D, N), (N, D, V), (N, D, Adj), (N, D, Adv), (N, D, Pr) (N, N, D), (N, N, N), (N, N, V), (N, N, Adj), (N, N, Adv), (N, N, Pr) Exponen=ally Large Representa=on! (Exponen=al Time to Consider Every Class) (Exponen=al Storage) Assume pronouns are nouns for simplicity. 8

Independent Classifica=on x= I fish ogen POS Tags: Det, Noun, Verb, Adj, Adv, Prep Treat each word independently (assump=on) Independent mul=class predic=on per word Predict for x= I independently Predict for x= fish independently Predict for x= ogen independently Concatenate predic=ons. Assume pronouns are nouns for simplicity. 9

Independent Classifica=on x= I fish ogen POS Tags: Det, Noun, Verb, Adj, Adv, Prep Treat each word independently (assump=on) Independent mul=class predic=on per word Predict for x= I independently #Classes = #POS Tags (6 in our example) Predict for x= fish independently Predict for x= ogen independently Concatenate predic=ons. Solvable using standard mul=class predic=on. Assume pronouns are nouns for simplicity. 10

Independent Classifica=on x= I fish ogen POS Tags: Det, Noun, Verb, Adj, Adv, Prep Treat each word independently Independent mul=class predic=on per word P(y x) x= I x= fish x= ogen y= Det 0.0 0.0 0.0 y= Noun 1.0 0.75 0.0 y= Verb 0.0 0.25 0.0 PredicKon: (N, N, Adv) Correct: (N, V, Adv) y= Adj 0.0 0.0 0.4 y= Adv 0.0 0.0 0.6 y= Prep 0.0 0.0 0.0 Why the mistake? Assume pronouns are nouns for simplicity. 11

Context Between Words x= I fish ogen POS Tags: Det, Noun, Verb, Adj, Adv, Prep Independent Predic=ons Ignore Word Pairs In Isola=on: Fish is more likely to be a Noun But Condi=oned on Following a (pro)noun Fish is more likely to be a Verb! 1st Order Dependence (Model All Pairs) 2 nd Order Considers All Triplets Arbitrary Order = Exponen=al Size (Naïve Mul=class) Assume pronouns are nouns for simplicity. 12

1 st Order Hidden Markov Model x = (x 1,x 2,x 4,x 4,,x M ) y = (y 1,y 2,y 3,y 4,,y M ) (sequence of words) (sequence of POS tags) P(x i y i ) Probability of state y i genera=ng x i P(y i+1 y i ) Probability of state y i transi=oning to y i+1 P(y 1 y 0 ) P(End y M ) Not always used y 0 is defined to be the Start state Prior probability of y M being the final state 13

Graphical Model Representa=on Y 0 Y 1 Y 2 Y M Y End X 1 X 2 X M Op=onal ( ) = P(End y M ) P(y i y i 1 ) P x, y M P(x i y i ) M 14

1 st Order Hidden Markov Model ( ) = P(End y M ) P(y i y i 1 ) P x, y Joint Distribu=on M P(x i y i ) Op=onal M P(x i y i ) Probability of state y i genera=ng x i P(y i+1 y i ) Probability of state y i transi=oning to y i+1 P(y 1 y 0 ) P(End y M ) Not always used y0 is defined to be the Start state Prior probability of y M being the final state 15

1 st Order Hidden Markov Model P x y M ( ) = P(x i y i ) Condi=onal Distribu=on on x given y Given a POS Tag Sequence y: Can compute each P(x i y) independently! (x i condi=onally independent given y i ) P(x i y i ) Probability of state y i genera=ng x i P(y i+1 y i ) Probability of state y i transi=oning to y i+1 P(y 1 y 0 ) P(End y M ) Not always used y 0 is defined to be the Start state Prior probability of y M being the final state 16

1 st Order Hidden Markov Model Addi=onal Complexity of (#POS Tags) 2 Models All State- State Pairs (all POS Tag- Tag pairs) Models All State- Observa=on Pairs (all Tag- Word pairs) P(x i y i ) Same Complexity as Independent Mul=class Probability of state y i genera=ng x i P(y i+1 y i ) Probability of state y i transi=oning to y i+1 P(y 1 y 0 ) P(End y M ) Not always used y 0 is defined to be the Start state Prior probability of y M being the final state 17

Rela=onship to Naïve Bayes Y 0# Y 1# Y 2# # Y M# Y End# X 1) X 2) # X M) Y 1# Y 2# # Y M# X 1) X 2) # X M) Reduces to a sequence of disjoint Naïve Bayes models (if we ignore transi=on probabili=es) 18

P ( word state/tag ) Two- word language: fish and sleep Two- tag language: Noun and Verb P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Given Tag Sequence y: P( fish sleep (N,V) ) = 0.8*0.5 P( fish fish (N,V) ) = 0.8*0.5 P( sleep fish (V,V) ) = 0.8*0.5 P( sleep sleep (N,N) ) = 0.2*0.5 Slides borrowed from Ralph Grishman 19

Sampling HMMs are genera=ve models Models joint distribu=on P(x,y) Can generate samples from this distribu=on First consider condi=onal distribu=on P(x y) P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Given Tag Sequence y = (N,V): Sample each word independently: Sample P(x 1 N) (0.8 Fish, 0.2 Sleep) Sample P(x 2 V) (0.5 Fish, 0.5 Sleep) What about sampling from P(x,y)? 20

Forward Sampling of P(y,x) ( ) = P(End y M ) P(y i y i 1 ) P x, y M M P(x i y i ) P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 0.2 0.8 start 0.8 0.7 noun verb end 0.2 Slides borrowed from Ralph Grishman Ini=alize y 0 = Start Ini=alize i = 0 1. i = i + 1 2. Sample y i from P(y i y i- 1 ) 3. If y i == End: Quit 4. Sample x i from P(x i y i ) 5. Goto Step 1 Exploits Condi=onal Ind. Requires P(End y i ) 21

Forward Sampling of P(y,x L) ( ) = P(End y M ) P(y i y i 1 ) P x, y M M P(x i y i ) M P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 0.8 0.2 0.91 start noun verb 0.667 0.09 Slides borrowed from Ralph Grishman 0.333 Ini=alize y 0 = Start Ini=alize i = 0 1. i = i + 1 2. If(i == M): Quit 3. Sample y i from P(y i y i- 1 ) 4. Sample x i from P(x i y i ) 5. Goto Step 1 Exploits Condi=onal Ind. Assumes no P(End y i ) 22

1 st Order Hidden Markov Model P( x k+1:m, y k+1:m x 1:k, y 1:k ) = P( x k+1:m, y k+1:m y k ) Memory- less Model only needs y k to model rest of sequence P(x i y i ) Probability of state y i genera=ng x i P(y i+1 y i ) Probability of state y i transi=oning to y i+1 P(y 1 y 0 ) P(End y M ) Not always used y0 is defined to be the Start state Prior probability of y M being the final state 23

Viterbi Algorithm 24

Most Common Predic=on Problem Given input sentence, predict POS Tag seq. argmax y P( y x) Naïve approach: Try all possible y s Choose one with highest probability ExponenKal Kme: L M possible y s 25

argmax y ( ) = P(x i y i ) P x y P y L Bayes s Rule P( y x) = argmax y L = argmax y = argmax y ( ) = P(End y L ) P(y i y i 1 ) P(y, x) P(x) P(y, x) P(x y)p(y) 26

argmax y P(y, x) = argmax y = argmax y M M P(y i y i 1 ) P(x i y i ) M M argmax P(y i y i 1 ) P(x i y i ) y 1:M 1 = argmax argmax P(y M y M 1 )P(x M y M )P(y 1:M 1 x 1:M 1 ) y M y 1:M 1 M P( y 1:k x 1:k ) = P(x 1:k y 1:k )P(y 1:k ) ( ) = P(x i y i ) P x 1:k y 1:k P y 1:k k k ( ) = P(y i+1 y i ) Exploit Memory- less Property: The choice of y L only depends on y 1:M- 1 via P(y M y M- 1 )! 27

Dynamic Programming Input: x = (x 1,x 2,x 3,,x M ) Computed: best length- k prefix ending in each Tag: Examples: Yˆ # & k (V ) = % argmax P(y 1:k 1 V, x 1:k )( V ˆ # & Y k (N) = % argmax P(y 1:k 1 N, x 1:k )( N $ y 1:k 1 ' $ y 1:k 1 ' Sequence Concatena=on Claim: ˆ Y k+1 (V ) = # & argmax P(y 1:k V, x 1:k+1 ) % ( V $ y 1:k { Y ˆk ( T) } T ' # & = argmax P(y 1:k, x 1:k )P(y k+1 = V y k )P(x k+1 y k+1 = V ) % ( V $ y 1:k { Y ˆk ( T) } T ' Pre- computed Recursive DefiniKon! 28

Solve: " % Yˆ 2 (V ) = argmax P(y 1, x 1 )P(y 2 = V y 1 )P(x 2 y 2 = V ) $ ' V # y 1 { Y ˆ1 ( T) } T & Store each Ŷ 1 (Z) & P(Ŷ 1 (Z),x 1 ) Ŷ 1 (V) y 1 =V Ŷ 2 (V) y 1 =D Ŷ 1 (D) Ŷ 2 (D) y 1 =N Ŷ 1 (N) Ŷ 2 (N) Ŷ 1 (Z) is just Z 29

Solve: " % Yˆ 2 (V ) = argmax P(y 1, x 1 )P(y 2 = V y 1 )P(x 2 y 2 = V ) $ ' V # y 1 { Y ˆ1 ( T) } T & Store each Ŷ 1 (Z) & P(Ŷ 1 (Z),x 1 ) Ŷ 1 (V) Ŷ 2 (V) Ŷ 1 (D) Ŷ 2 (D) y 1 =N Ŷ 1 (N) Ŷ 2 (N) Ŷ 1 (Z) is just Z Ex: Ŷ 2 (V) = (N, V) 30

Solve: ˆ Y 3 (V ) = " % argmax P(y 1:2, x 1:2 )P(y 3 = V y 2 )P(x 3 y 3 = V ) $ ' V # y 1:2 { Y ˆ2 ( T) } T & Store each Ŷ 1 (Z) & P(Ŷ 1 (Z),x 1 ) Store each Ŷ 2 (Z) & P(Ŷ 2 (Z),x 1:2 ) Ŷ 1 (V) Ŷ 2 (V) y 2 =V Ŷ 3 (V) y 2 =D Ŷ 1 (D) Ŷ 2 (D) Ŷ 3 (D) y 2 =N Ŷ 1 (N) Ŷ 2 (N) Ŷ 3 (N) Ex: Ŷ 2 (V) = (N, V) 31

Solve: Store each Ŷ 1 (Z) & P(Ŷ 1 (Z),x 1 ) ˆ Y 3 (V ) = " % argmax P(y 1:2, x 1:2 )P(y 3 = V y 2 )P(x 3 y 3 = V ) $ ' V # y 1:2 { Y ˆ2 ( T) } T & Store each Ŷ 2 (Z) & P(Ŷ 2 (Z),x 1:2 ) Claim: Only need to check solu=ons of Ŷ 2 (Z), Z=V,D,N Ŷ 1 (V) Ŷ 2 (V) y 2 =V Ŷ 3 (V) y 2 =D Ŷ 1 (D) Ŷ 2 (D) Ŷ 3 (D) y 2 =N Ŷ 1 (N) Ŷ 2 (N) Ŷ 3 (N) Ex: Ŷ 2 (V) = (N, V) 32

Solve: Store each Ŷ 1 (Z) & P(Ŷ 1 (Z),x 1 ) ˆ Y 3 (V ) = " % argmax P(y 1:2, x 1:2 )P(y 3 = V y 2 )P(x 3 y 3 = V ) $ ' V # y 1:2 { Y ˆ2 ( T) } T & Store each Ŷ 2 (Z) & P(Ŷ 2 (Z),x 1:2 ) Claim: Only need to check solu=ons of Ŷ 2 (Z), Z=V,D,N Ŷ 1 (V) Ŷ 2 (V) y 2 =V Ŷ 3 (V) y 2 =D Ŷ 1 (D) Ŷ 1 (N) Suppose Ŷ 2 (D) Ŷ 3 (V) = (V,V,V) Ŷ 3 (D) prove that Ŷ 3 (V) = (N,V,V) has higher prob. Ex: Ŷ 2 (V) = (N, V) y 2 =N Proof depends on 1 st order property Prob. Ŷ 2 (N) of (V,V,V) & (N,V,V) Ŷ 3 (N) differ in 3 terms P(y 1 y 0 ), P(x 1 y 1 ), P(y 2 y 1 ) None of these depend on y 3! 33

ˆ Y M (V ) = # % $ & argmax P(y 1:M 1, x 1:M 1 )P(y M = V y M 1 )P(x M y M = V )P(End y M = V ) ( V { Y M 1 ( T) } T ' y 1:M 1 ˆ Store each Ŷ 1 (Z) & P(Ŷ 1 (Z),x 1 ) Store each Ŷ 2 (Z) & P(Ŷ 2 (Z),x 1:2 ) Store each Ŷ 3 (Z) & P(Ŷ 3 (Z),x 1:3 ) Op=onal Ŷ 1 (V) Ŷ 2 (V) Ŷ 3 (V) Ŷ M (V) Ŷ 1 (D) Ŷ 2 (D) Ŷ 3 (D) Ŷ M (D) Ŷ 1 (N) Ŷ 2 (N) Ŷ 3 (N) Ŷ M (N) Ex: Ŷ 2 (V) = (N, V) Ex: Ŷ 3 (V) = (D,N,V) 34

Solve: For k=1..m Viterbi Algorithm argmax y Itera=vely solve for each Ŷ k (Z) Z looping over every POS tag. Predict best Ŷ M (Z) P( y x) = argmax y = argmax y = argmax y P(y, x) P(x) P(y, x) P(x y)p(y) Also known as Mean A Posteriori (MAP) inference 35

Numerical Example x= (Fish Sleep) 0.2 0.8 start 0.8 noun verb 0.7 end 0.2 Slides borrowed from Ralph Grishman 36

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 0 1 2 3 start 1 verb 0 noun 0 end 0 Slides borrowed from Ralph Grishman 37

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Token 1: fish 0 1 2 3 start 1 0 verb 0.2 *.5 noun 0.8 *.8 end 0 0 Slides borrowed from Ralph Grishman 38

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Token 1: fish 0 1 2 3 start 1 0 verb 0.1 noun 0.64 end 0 0 Slides borrowed from Ralph Grishman 39

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Token 2: sleep (if fish is verb) 0 1 2 3 start 1 0 0 verb 0.1.1*.1*.5 noun 0.64.1*.2*.2 end 0 0 - Slides borrowed from Ralph Grishman 40

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Token 2: sleep (if fish is verb) 0 1 2 3 start 1 0 0 verb 0.1.005 noun 0.64.004 end 0 0 - Slides borrowed from Ralph Grishman 41

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Token 2: sleep (if fish is a noun) 0 1 2 3 start 1 0 0 verb 0.1.005.64*.8*.5 noun 0.64.004.64*.1*.2 end 0 0 - Slides borrowed from Ralph Grishman 42

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Token 2: sleep (if fish is a noun) 0 1 2 3 start 1 0 0 verb 0.1.005.256 noun 0.64.004.0128 end 0 0 - Slides borrowed from Ralph Grishman 43

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Token 2: sleep take maximum, set back pointers 0 1 2 3 start 1 0 0 verb 0.1.005.256 noun 0.64.004.0128 end 0 0 - Slides borrowed from Ralph Grishman 44

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Token 2: sleep take maximum, set back pointers 0 1 2 3 start 1 0 0 verb 0.1.256 noun 0.64.0128 end 0 0 - Slides borrowed from Ralph Grishman 45

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Token 3: end 0 1 2 3 start 1 0 0 0 verb 0.1.256 - noun 0.64.0128 - end 0 0 -.256*.7.0128*.1 Slides borrowed from Ralph Grishman 46

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Token 3: end take maximum, set back pointers 0 1 2 3 start 1 0 0 0 verb 0.1.256 - noun 0.64.0128 - end 0 0 -.256*.7.0128*.1 Slides borrowed from Ralph Grishman 47

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Decode: fish = noun sleep = verb 0 1 2 3 start 1 0 0 0 verb 0.1.256 - noun 0.64.0128 - end 0 0 -.256*.7 Slides borrowed from Ralph Grishman 48

0.2 start 0.8 0.7 noun verb end Decode: fish = noun sleep = verb 0.8 0.2 P(x y) What might go wrong for long sequences? 0 1 2 3 start 1 0 0 0 y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 verb 0 Underflow!.1.256 - Small numbers get repeatedly mul=plied noun 0.64.0128 - together exponen=ally small! end 0 0 -.256*.7 Slides borrowed from Ralph Grishman 49

Solve: argmax y For k=1..m Viterbi Algorithm (w/ Log Probabili=es) P( y x) = argmax y = argmax y = argmax y Itera=vely solve for each log(ŷ k (Z)) Z looping over every POS tag. Predict best log(ŷ M (Z)) P(y, x) P(x) P(y, x) log P(x y)+ log P(y) Log(Ŷ M (Z)) accumulates addikvely, not mulkplicakvely 50

Recap: Independent Classifica=on x= I fish ogen POS Tags: Det, Noun, Verb, Adj, Adv, Prep Treat each word independently Independent mul=class predic=on per word P(y x) x= I x= fish x= ogen y= Det 0.0 0.0 0.0 y= Noun 1.0 0.75 0.0 y= Verb 0.0 0.25 0.0 PredicKon: (N, N, Adv) Correct: (N, V, Adv) y= Adj 0.0 0.0 0.4 y= Adv 0.0 0.0 0.6 y= Prep 0.0 0.0 0.0 Assume pronouns are nouns for simplicity. Mistake due to not modeling mulkple words. 51

Recap: Viterbi Models pairwise transi=ons between states Pairwise transi=ons between POS Tags 1 st order model P x, y ( ) = P(End y M ) P(y i y i 1 ) M P(x i y i ) M x= I fish ogen Independent: (N, N, Adv) HMM Viterbi: (N, V, Adv) *Assuming we defined P(x,y) properly 52

Training HMMs 53

Supervised Training Given: N S = {(x i, y i )} Word Sequence (Sentence) POS Tag Sequence Goal: Es=mate P(x,y) using S P x, y ( ) = P(End y M ) P(y i y i 1 ) Maximum Likelihood! M M P(x i y i ) 54

Aside: Matrix Formula=on Define Transi=on Matrix: A A ab = P(y i+1 =a y i =b) or Log( P(y i+1 =a y i =b) ) P(y next y) y= Noun y= Verb y next = Noun 0.09 0.667 y next = Verb 0.91 0.333 Observa=on Matrix: O O wz = P(x i =w y i =z) or Log(P(x i =w y i =z) ) P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 55

Aside: Matrix Formula=on ( ) = P(End y M ) P(y i y i 1 ) P x, y M P(x i y i ) M ( ) = P(End y M ) P(y i y i 1 ) P x, y M M P(x i y i ) = A End,y M A y i,y i 1 M O x i,y i M log(p(x, y)) = A! End,y M + Ay! + i,y Ox! i 1 i,y i M M Log prob. formula=on Each entry of à is define as log(a) 56

Maximum Likelihood argmax A,O (x,y) S P( x, y) = argmax A,O (x,y) S Es=mate each component separately: M P(End y M ) P(y i y i 1 ) P(x i y i ) M A ab = N j=1 M j 1" # y i+1 j =a i=0 ( ) ( y i j=b) N M j 1" # y i j =b$ % j=1 i=0 $ % O wz = N j=1 M j 1" # x i j =w ( ) ( y i j=z) N M j 1" # y i j =z$ % j=1 $ % (Derived via minimizing neg. log likelihood) 57

Recap: Supervised Training argmax A,O (x,y) S Maximum Likelihood Training Coun=ng sta=s=cs Super easy! Why? P( x, y) = argmax A,O (x,y) S What about unsupervised case? M P(End y M ) P(y i y i 1 ) P(x i y i ) M 58

Recap: Supervised Training argmax A,O (x,y) S Maximum Likelihood Training Coun=ng sta=s=cs Super easy! Why? P( x, y) = argmax A,O (x,y) S What about unsupervised case? M P(End y M ) P(y i y i 1 ) P(x i y i ) M 59

Condi=onal Independence Assump=ons argmax A,O (x,y) S P( x, y) = argmax A,O (x,y) S Everything decomposes to products of pairs I.e., P(y i+1 =a y i =b) doesn t depend on anything else Can just es=mate frequencies: P(End y M ) P(y i y i 1 ) P(x i y i ) How ogen y i+1 =a when y i =b over training set Note that P(y i+1 =a y i =b) is a common model across all loca=ons of all sequences. M M 60

Condi=onal Independence Assump=ons argmax A,O (x,y) S P( x, y) = argmax A,O (x,y) S P(End y M ) P(y i y i 1 ) P(x i y i ) Everything decomposes # Parameters: to products of pairs I.e., P(y i+1 =a y i =b) doesn t depend on anything else Transi=ons A: #Tags 2 Observa=ons O: #Words x #Tags Can just es=mate frequencies: Avoids How ogen directly y i+1 =a when model y i =b over word/word training set pairings Note that P(y i+1 =a y #Tags i =b) is a = common 10s model across all loca=ons of all #Words sequences. = 10000s M M 61

Unsupervised Training What about if no y s? Just a training set of sentences N S = { x i } S=ll want to es=mate P(x,y) How? Why? Word Sequence (Sentence) argmax P( x i ) = argmax P x i, y i i y ( ) 62

Unsupervised Training What about if no y s? Just a training set of sentences N S = { x i } S=ll want to es=mate P(x,y) How? Why? Word Sequence (Sentence) argmax P( x i ) = argmax P x i, y i i y ( ) 63

Why Unsupervised Training? Supervised Data hard to acquire Require annota=ng POS tags Unsupervised Data plen=ful Just grab some text! Might just work for POS Tagging! Learn y s that correspond to POS Tags Can be used for other tasks Detect outlier sentences (sentences with low prob.) Sampling new sentences. 64

EM Algorithm (Baum- Welch) If we had y s è max likelihood. If we had (A,O) è predict y s Chicken vs Egg! 1. Ini=alize A and O arbitrarily 2. Predict prob. of y s for each training x 3. Use y s to es=mate new (A,O) ExpectaKon Step MaximizaKon Step 4. Repeat back to Step 1 un=l convergence hyp://en.wikipedia.org/wiki/baum%e2%80%93welch_algorithm 65

Expecta=on Step Given (A,O) For training x=(x 1,,x M ) Predict P(y i ) for each y=(y 1, y M ) x 1 x 2 x L P(y i =Noun) 0.5 0.4 0.05 P(y i =Det) 0.4 0.6 0.25 P(y i =Verb) 0.0 0.7 Encodes current model s beliefs about y Marginal Distribu=on of each y i 66

Recall: Matrix Formula=on Define Transi=on Matrix: A A ab = P(y i+1 =a y i =b) or Log( P(y i+1 =a y i =b) ) P(y next y) y= Noun y= Verb y next = Noun 0.09 0.667 y next = Verb 0.91 0.333 Observa=on Matrix: O O wz = P(x i =w y i =z) or Log(P(x i =w y i =z) ) P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 67

Maximiza=on Step Max. Likelihood over Marginal Distribu=on Supervised: Unsupervised: A ab = A ab = N M j j=1 M j N 1" # y i+1 j =a j=1 i=0 i=0 N P(y j i M j j=1 ( ) ( y i j=b) M j N 1" # y i j =b$ % j=1 i=0 i=0 = b, y i+1 j = a) O wz = P(y j i $ % = b) O wz = Marginals M j N 1" # x i j =w j=1 M j N 1! " x i j =w# $ j=1 N M j j=1 ( ) ( y i j=z) M j N 1" # y i j =z$ % j=1 P(y j i P(y j i $ % Marginals = z) = z) Marginals 68

Compu=ng Marginals (Forward- Backward Algorithm) Solving E- Step, requires compute marginals x 1 x 2 x L P(y i =Noun) 0.5 0.4 0.05 P(y i =Det) 0.4 0.6 0.25 P(y i =Verb) 0.0 0.7 Can solve using Dynamic Programming! Similar to Viterbi 69

Nota=on Probability of observing prefix x 1:i and having the i- th state be y i =Z α z (i) = P(x 1:i, y i = Z A,O) Probability of observing suffix x i+1:m given the i- th state being y i =Z β z (i) = P(x i+1:m y i = Z, A,O) Compu=ng Marginals = Combining the Two Terms P(y i = z x) = z' a z (i)β z (i) a z' (i)β z' (i) hyp://en.wikipedia.org/wiki/baum%e2%80%93welch_algorithm 70

Nota=on Probability of observing prefix x 1:i and having the i- th state be y i =Z α z (i) = P(x 1:i, y i = Z A,O) Probability of observing suffix x i+1:m given the i- th state being y i =Z β z (i) = P(x i+1:m y i = Z, A,O) Compu=ng Marginals = Combining the Two Terms P(y i = b, y i 1 = a x) = a',b' a a (i 1)P(y i = b y i 1 = a)p(x i y i = b)β b (i) a a' (i 1)P(y i = b' y i 1 = a')p(x i y i = b')β b' (i) hyp://en.wikipedia.org/wiki/baum%e2%80%93welch_algorithm 71

Forward (sub- )Algorithm Solve for every: α z (i) = P(x 1:i, y i = Z A,O) Naively: α z (i) = P(x 1:i, y i = Z A,O) = y 1:i 1 ExponenKal Time! P(x 1:i, y i = Z, y 1:i 1 A,O) Can be computed recursively (like Viterbi) α z (1) = P(y 1 = z y 0 )P(x 1 y 1 = z) = O x 1,z A z,start α z (i +1) = O x i+1,z L j=1 α j (i) A z, j Viterbi effec=vely replaces sum with max 72

Backward (sub- )Algorithm Solve for every: β z (i) = P(x i+1:m y i = Z, A,O) Naively: β z (i) = P(x i+1:m y i = Z, A,O) = y i+1:l P(x i+1:m, y i+1:m y i = Z, A,O) Can be computed recursively (like Viterbi) β z (L) =1 L j=1 β z (i) = β j (i +1) A j,z O x i+1, j ExponenKal Time! 73

Forward- Backward Algorithm Runs Forward α z (i) = P(x 1:i, y i = Z A,O) Runs Backward β z (i) = P(x i+1:m y i = Z, A,O) For each training x=(x 1,,x M ) Computes each P(y i ) for y=(y 1,,y M ) P(y i = z x) = z' a z (i)β z (i) a z' (i)β z' (i) 74

Recap: Unsupervised Training Train using only word sequences: N S = { x i } y s are hidden states All pairwise transi=ons are through y s Hence hidden Markov Model Train using EM algorithm Converge to local op=mum Word Sequence (Sentence) 75

Ini=aliza=on How to choose #hidden states? By hand Cross Valida=on P(x) on valida=on data Can compute P(x) via forward algorithm: P(x) = P(x, y) = α z (M ) P(End y M = z) y z 76

Recap: Sequence Predic=on & HMMs Models pairwise dependences in sequences x= I fish ogen POS Tags: Det, Noun, Verb, Adj, Adv, Prep Independent: (N, N, Adv) HMM Viterbi: (N, V, Adv) Compact: only model pairwise between y s Main LimitaKon: Lots of independence assump=ons Poor predic=ve accuracy 77

Next Week Condi=onal Random Fields Sequen=al version of logis=c regression Removes many independence assump=ons More accurate in prac=ce Can only be trained in supervised se}ng Recita=on Tonight: Recap of Viterbi and Forward/Backward 78