Lecture 12: EM Algorithm

Similar documents
Lecture 11: Viterbi and Forward Algorithms

Lecture 13: Structured Prediction

Lecture 12: Algorithms for HMMs

Lecture 9: Hidden Markov Model

Lecture 12: Algorithms for HMMs

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Sequence Labeling: HMMs & Structured Perceptron

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Hidden Markov Models

Hidden Markov Models in Language Processing

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Data-Intensive Computing with MapReduce

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Statistical Pattern Recognition

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

STA 414/2104: Machine Learning

Machine Learning for natural language processing

Generative Clustering, Topic Modeling, & Bayesian Inference

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

AN INTRODUCTION TO TOPIC MODELS

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

STA 4273H: Statistical Machine Learning

Statistical Methods for NLP

Hidden Markov Models

Statistical NLP: Hidden Markov Models. Updated 12/15

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model:

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Statistical methods in NLP, lecture 7 Tagging and parsing

Midterm sample questions

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

Basic math for biology

Language and Statistics II

Mixture of Gaussians Models

Statistical Processing of Natural Language

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Hidden Markov Models. Representing sequence data. Markov Models. A dice-y example 4/5/2017. CISC 5800 Professor Daniel Leeds Π A = 0.3, Π B = 0.

Introduction to Machine Learning CMU-10701

Lecture 11: Hidden Markov Models

Advanced Natural Language Processing Syntactic Parsing

Latent Dirichlet Allocation Introduction/Overview

CPSC 540: Machine Learning

Hidden Markov Models. Representing sequence data. Markov Models. A dice-y example 4/26/2018. CISC 5800 Professor Daniel Leeds Π A = 0.3, Π B = 0.

EECS E6870: Lecture 4: Hidden Markov Models

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Dept. of Linguistics, Indiana University Fall 2009

Statistical Methods for NLP

CS 7180: Behavioral Modeling and Decision- making in AI

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Soft Inference and Posterior Marginals. September 19, 2013

Statistical Pattern Recognition

Hidden Markov Models: Maxing and Summing

lecture 6: modeling sequences (final part)

LECTURER: BURCU CAN Spring

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models. x 1 x 2 x 3 x K

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

COMP90051 Statistical Machine Learning

Maschinelle Sprachverarbeitung

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

Pair Hidden Markov Models

Neural Architectures for Image, Language, and Speech Processing

Expectation Maximization

Semi-supervised Learning

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Data Preprocessing. Cluster Similarity

Topics in Natural Language Processing

Hidden Markov Models

Log-Linear Models, MEMMs, and CRFs

Today s Agenda. Need to cover lots of background material. Now on to the Map Reduce stuff. Rough conceptual sketch of unsupervised training using EM

Brief Introduction of Machine Learning Techniques for Content Analysis

HMM: Parameter Estimation

Machine Learning: Assignment 3

EM (cont.) November 26 th, Carlos Guestrin 1

Directed Probabilistic Graphical Models CMSC 678 UMBC

Hidden Markov Models Hamid R. Rabiee

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Midterm sample questions

Expectation Maximization Algorithm

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Fun with weighted FSTs

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 8: Hidden Markov Models

Probabilistic Context-free Grammars

Conditional Random Fields

K-Means and Gaussian Mixture Models

Transcription:

Lecture 12: EM Algorithm Kai-Wei hang S @ University of Virginia kw@kwchang.net ouse webpage: http://kwchang.net/teaching/nlp16 S6501 Natural Language Processing 1

Three basic problems for MMs v Likelihood of the input: v Forward algorithm v Decoding (tagging) the input: v Viterbi algorithm v Estimation (learning): ow likely the sentence I love cat occurs POS tags of I love cat occurs ow to learn the model? v Find the best model parameters v ase 1: supervised tags are annotated vmaximum likelihood estimation (MLE) v ase 2: unsupervised -- only unannotated text vforward-backward algorithm S6501 Natural Language Processing 2

EM algorithm v POS induction can we tag POS without annotated data? v An old idea v Good mathematical intuition v Tutorial paper: ftp://ftp.icsi.berkeley.edu/pub/techreports/1997/t r-97-021.pdf v http://people.csail.mit.edu/regina/6864/em_note s_mike.pdf S6501 Natural Language Processing 3

ard EM (Intuition) v We don t know the hidden states (i.e., POS tags) v If we know the model S6501 Natural Language Processing 4

Recap: Learning from Labeled Data v If we know the hidden states (labels) v we count how often we see t "#$ t " and w & t " then normalize. 2 3 2 2 3 1 2 3 2 S6501 Natural Language Processing 5

Recap: Tagging the input v If we know the model, we can find the best tag sequence S6501 Natural Language Processing 6

ard EM (Intuition) v We don t know the hidden states (i.e., POS tags) 1. Let s guess! 2. Then, we have labels; we can estimate the model 3. heck if the model is consistent with the labels we guessed; if no Step 1. S6501 Natural Language Processing 7

Let s make a guess P( ) P( ) P( Start) ( 1 )? 0 - ( 2 )?? - ( 3 ) 0? - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5?????? 2 3 2?????? 2 3 1 2 3 2 S6501 Natural Language Processing 8

These are obvious P( ) P( ) P( Start) ( 1 )? 0 - ( 2 )?? - ( 3 ) 0? - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5???? 2 3 2??? 2 3 1 2 3 2 S6501 Natural Language Processing 9

Guess more P( ) P( ) P( Start) ( 1 )? 0 - ( 2 )?? - ( 3 ) 0? - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5? 2 3 2? 2 3 1 2 3 2 S6501 Natural Language Processing 10

Guess all of them Now we can estimate ML P( ) P( ) P( Start) ( 1 )? 0 - ( 2 )?? - ( 3 ) 0? - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5 2 3 2 2 3 1 2 3 2 S6501 Natural Language Processing 11

Does our guess consistent with the model? P( ) P( ) P( Start) ( 1 ) 0.5 0 - ( 2 ) 0.5 0.625 - ( 3 ) 0 0.375 - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5 2 3 2 2 3 1 2 3 2 S6501 Natural Language Processing 12

ow to find latent states based on our model? Viterbi! P( ) P( ) P( Start) ( 1 ) 0.5 0 - ( 2 ) 0.5 0.625 - ( 3 ) 0 0.375 - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5?????? 2 3 2?????? 2 3 1 2 3 2 S6501 Natural Language Processing 13

Something wrong P( ) P( ) P( Start) ( 1 ) 0.5 0 - ( 2 ) 0.5 0.625 - ( 3 ) 0 0.375 - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5 From Viterbi 2 3 2 From Viterbi 2 3 1 2 3 2 S6501 Natural Language Processing 14

It s fine. Let s do again P( ) P( ) P( Start) ( 1 ) 1 0 - ( 2 ) 0 0.7 - ( 3 ) 0 0.3 - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5 2 3 2 2 3 1 2 3 2 S6501 Natural Language Processing 15

This time it is consistent P( ) P( ) P( Start) ( 1 ) 1 0 - ( 2 ) 0 0.7 - ( 3 ) 0 0.3 - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5 From Viterbi 2 3 2 From Viterbi 2 3 1 2 3 2 S6501 Natural Language Processing 16

No! Only one solution? EM is sensitive to initialization P( ) P( ) P( Start) ( 1 ) 0.22 0 - ( 2 ) 0.77 0 - ( 3 ) 0 1 - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5 2 3 2 2 3 1 2 3 2 S6501 Natural Language Processing 17

ow about this? P( ) P( ) P( Start) ( 1 )? 0 - ( 2 )?? - ( 3 ) 0? - ( )?? 0.5 ( )?? 0.5?????? 2 3 2?????? 2 3 1 2 3 2 S6501 Natural Language Processing 18

ard EM v We don t know the hidden states (i.e., POS tags) 1. Let s guess based on our model! v Find the best sequence using Viterbi algorithm 2. Then, we have labels; we can estimate the model v Maximum Likelihood Estimation 3. heck if the model is consistent with the labels we guessed; if no Step 1. S6501 Natural Language Processing 19

Soft EM v We don t know the hidden states (i.e., POS tags) 1. Let s guess based on our model! v Find the best sequence using Viterbi algorithm 2. Then, Let s use we expected have labels; counts we instead! can estimate the model v Maximum Likelihood Estimation 3. heck if the model is consistent with the labels we guessed; if no Step 1. S6501 Natural Language Processing 20

Expected ounts P( ) P( ) P( Start) ( 1 ) 0.8 0 - ( 2 ) 0.2 0.2 - ( 3 ) 0 0.8 - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5??? S6501 Natural Language Processing 21

Expected ounts Some sequences are more likely to occur than the others P( ) P( ) P( Start) ( 1 ) 0.8 0 - ( 2 ) 0.2 0.2 - ( 3 ) 0 0.8 - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5 S6501 Natural Language Processing 22

Expected ounts P( ) P( ) P( Start) ( 1 ) 0.8 0 - ( 2 ) 0.2 0.2 - ( 3 ) 0 0.8 - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5 0.01024 0.00256 0.00064 0.00256 S6501 Natural Language Processing 23

Expected ounts Assume we draw 100,000 random samples P( ) P( ) P( Start) ( 1 ) 0.8 0 - ( 2 ) 0.2 0.2 - ( 3 ) 0 0.8 - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5 1024 256 64 256 S6501 Natural Language Processing 24

Expected ounts Let s update model P( ) P( ) P( Start) ( 1 ) 0.8 0 - ( 2 ) 0.2 0.2 - ( 3 ) 0 0.8 - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5 1024 256 64 256 S6501 Natural Language Processing 25

Expected ounts Let s update model ow many -? 1024*2+256=2302 P( ) P( ) P( Start) ( 1 ) 0.8 0 - ( 2 ) 0.2 0.2 - ( 3 ) 0 0.8 - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5 1024 256 64 256 S6501 Natural Language Processing 26

Expected ounts ow many -? 1024*2+256=2302 ow many? P( ) P( ) P( Start) ( 1 ) 0.8 0 - ( 2 ) 0.2 0.2 - ( 3 ) 0 0.8-1024*3+256*2+64*2+256=3968 ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5 1024 2302/3968 = 0.580 256 P( )? 64 256 S6501 Natural Language Processing 27

Expected ounts P( )? 2302/3968 = 0.580 P( ) P( ) P( Start) ( 1 ) 0.8 0 - ( 2 ) 0.2 0.2 - ( 3 ) 0 0.8 - ( ) 0.8 0.58 0.2 0.5 ( ) 0.2 0.8 0.5 1024 256 Do this for all the other entries! 64 256 S6501 Natural Language Processing 28

Are we done yet? v What if we have 45 tags? v What if our sentences has 20 tokens...? v We need an efficent algorithm again! S6501 Natural Language Processing 29

Expected ounts P( )? 2302/3968 = 0.580 P( ) P( ) P( Start) ( 1 ) 0.8 0 - ( 2 ) 0.2 0.2 - ( 3 ) 0 0.8 - ( ) 0.8 0.2 0.5 (1024+256+256+64) /3968 = 0.403 P(1 )? ( ) 0.2 0.8 0.5 1024 256 64 256 S6501 Natural Language Processing 30

Expected ounts P( )? 2302/3968 = 0.580 P( ) P( ) P( Start) ( 1 ) 0.8 0 - ( 2 ) 0.2 0.2 - ( 3 ) 0 0.8 - ( ) 0.8 0.2 0.5 (1024+256+256+64) /3968 = 0.403 P(1 )? ( ) 0.2 0.8 0.5 0.01024 0.00064 0.00256 0.00256 S6501 Natural Language Processing 31

In general P( ) P( ) P( Start) ( 1 ) 0.8 0 - ( 2 ) 0.2 0.2 - ( 3 ) 0 0.8 - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5 2 2 2 S6501 Natural Language Processing 32

In general P( ) P( ) P( Start) ( 1 ) 0.8 0 - ( 2 ) 0.2 0.2 - ( 3 ) 0 0.8 - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5 2 2 2 S6501 Natural Language Processing 33

In general Let s say #words = n P w 1..n, t / = P( ) P( ) P( Start) ( 1 ) 0.8 0 - ( 2 ) 0.2 0.2 - ( 3 ) 0 0.8 - ( ) 0.8 0.2 0.5 ( ) 0.2 0.8 0.5 i=k.. 2 2 2 S6501 Natural Language Processing 34

In general probability of w $ w 7 and tag k is P w 1..n, t / = = P w 1..k, t / = P w k31..n t / = probability of w 73$ w 8 and tag k is i=k.. 2 2 2 S6501 Natural Language Processing 35

In general an be computed by forward algorithm an be computed by backward algorithm P w 1..n, t / = = P w 1..k, t / = P w k31..n t / = P w 1..k, t / = = 9 P w 1..k, t 1..k#1,t / = P w k31..n t / = = 9 P w k..n, t k31..n t / = t 1..k;1 t k<1..n i=k.. 2 2 2 S6501 Natural Language Processing 36

Forward algorithm i i Induction: α 7 q =P w 7 q) B α 7#$ q A P(q q ) S6501 Natural Language Processing 37

Backward algorithm vp w k31..n t / = = P w k32..n t 73$ = q P q P(w 73$ q) q v β 7 = β 73$ q P q P(w 73$ q) B S6501 Natural Language Processing 38

In general an be computed by forward algorithm an be computed by backward algorithm P w 1..n, t / = = P w 1..k, t / = P w k31..n t / = P w 1..k, t / = = 9 P w 1..k, t 1..k#1,t / = t 1..k;1 P w k..n, t / = = 9 P w k..n, t k31..n, t / = i=k t k<1..n.. 2 2 2 S6501 Natural Language Processing 39

Emission ounts Expected counts of (2,) P 2 = " P(w " = 2, t " =, w 1..n ) " P(t " =, w 1..n ) i=k Expected counts of.. 2 2 2 S6501 Natural Language Processing 40

ow about the transition counts? P w 1..n, t / =, t 73$ = = P w 1..k, t / = P w k31..n t /3$ = P P(w k31 ) = α k β 73$ P P(w k31 ). i=k i=k+1 S6501 Natural Language Processing 41

Three basic problems for MMs v Likelihood of the input: v Forward algorithm v Decoding (tagging) the input: v Viterbi algorithm v Estimation (learning): ow likely the sentence I love cat occurs POS tags of I love cat occurs ow to learn the model? v Find the best model parameters v ase 1: supervised tags are annotated vmaximum likelihood estimation (MLE) v ase 2: unsupervised -- only unannotated text vforward-backward algorithm S6501 Natural Language Processing 42

Trick: computing everything in log space v omework: v Write forward, backward and Viterbi algorithm in log-space v int: you need a function to compute log(a+b) S6501 Natural Language Processing 43

Behind the scenes v What is EM optimized? v Log Likelihood of the input! v log P(w λ) v log P w λ = log t P(w, t λ) = log X Π 8 "V$ P t " t "#$, t "#W P(w " t " ) In contrast, in the supervised situation, We are optimizing log P(w, t λ) This is hard; In contrast log P w, t λ = logπ 8 "V$ P t " t "#$, t "#W P w " t " = "(log P t " t "#$, t "#W + log P w " t " ) Log Π is hard; log Π = log is easy S6501 Natural Language Processing 44

Intuition of EM (from the optimization perspective) λ (b3w) λ (b3$) λ (b) f λ g b3$ = logp w λ = log P(w, t λ) Key idea: 1. Define g c λ such that f λ g b λ λ and f λ (b) = g b λ b 2. Optimize g c λ g b S6501 Natural Language Processing 45

Intuition of EM (from optimization perspective) λ (b3w) λ (b3$) λ (b) > f λ g b3$ = logp w λ = log P(w, t λ) Key idea: 1. Define g c λ such that f λ g b λ λ and f λ (b) = g b λ b 2. Optimize g c λ g b ard EM, Soft EM define different g c λ S6501 Natural Language Processing 46

g c λ for soft EM v log P w, t X f w, t λ = log X f t w, λ b P t w, λ b λ Jensen inequality: Let p(x) = 1 log k f x p(x) k p(x)log f x P t X w, λ b log f w, t λ f t w, λ b S6501 Natural Language Processing 47

g c λ (b) = f λ b? v log P w, t X P t X w, λ b λ log f w, t λ f t w, λ b f λ b = log P w, t λ (b) = log P(w λ b ) X g b λ (b) = P t w, λ b X = X P t w, λ b log P w λ (b) = (logp w λ (b) ) X P t w, λ b = log w λ b f w, t λ(b) log f t w, λ b S6501 Natural Language Processing 48

Intuition of EM (from optimization perspective) λ (b3w) λ (b3$) λ (b) > f λ g b3$ = logp w λ = log P(w, t λ) Key idea: 1. Define g c λ such that f λ g b λ λ and f λ (b) = g b λ b 2. Optimize g c λ g b Soft EM define g c λ = P t w, λ b log f w, t λ X S6501 Natural Language Processing f t w, λ b 49

Optimizing g c λ g c λ = P t X w, λ (b) log f w, t λ f t w, λ (b) = P t w, λ b X (log P w, t λ log P t w, λ (b) ) max g s λ = P t w, λ (b) (log P w, t λ ) r X This term doesn t have λ = X P t w, λ (b) "(log P t " t "#$,t "#W + log P w " t " ) In contrast, in supervised learning case: We know how to solve this!! log P w, t λ = logπ 8 "V$ P t " t "#$, t "#W P w " t " = "(log P t " t "#$, t "#W + log P w " t " ) Log Π is hard; log Π = log is easy S6501 Natural Language Processing 50