Lecture 6 Hidden Markov Models and Maximum Entropy Models

Similar documents
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

Generative and Discriminative Models. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Feature-Rich Sequence Models. Statistical NLP Spring MEMM Taggers. Decoding. Derivative for Maximum Entropy. Maximum Entropy II

Hidden Markov Models

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Hidden Markov Models

Overview. Hidden Markov Models and Gaussian Mixture Models. Acoustic Modelling. Fundamental Equation of Statistical Speech Recognition

Introduction to Hidden Markov Models

Discriminative classifier: Logistic Regression. CS534-Machine Learning

10-701/ Machine Learning, Fall 2005 Homework 3

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Part-of-Speech Tagging with Hidden Markov Models

Homework Assignment 3 Due in class, Thursday October 15

Support Vector Machines

Hidden Markov Models

Evaluation of classifiers MLPs

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing

Logistic Classifier CISC 5800 Professor Daniel Leeds

Limited Dependent Variables

Hopfield networks and Boltzmann machines. Geoffrey Hinton et al. Presented by Tambet Matiisen

Support Vector Machines

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

Structured Perceptrons & Structural SVMs

Week 5: Neural Networks

Mean Field / Variational Approximations

15-381: Artificial Intelligence. Regression and cross validation

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Learning undirected Models. Instructor: Su-In Lee University of Washington, Seattle. Mean Field Approximation

Maxent Models & Deep Learning

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Computational Biology Lecture 8: Substitution matrices Saad Mneimneh

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

MARKOV CHAIN AND HIDDEN MARKOV MODEL

Probabilistic Classification: Bayes Classifiers. Lecture 6:

Outline and Reading. Dynamic Programming. Dynamic Programming revealed. Computing Fibonacci. The General Dynamic Programming Technique

Chapter 6 Hidden Markov Models. Chaochun Wei Spring 2018

3.1 ML and Empirical Distribution

Expectation Maximization Mixture Models HMMs

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

Hidden Markov Model Cheat Sheet

Supporting Information

Multilayer Perceptron (MLP)

The Geometry of Logit and Probit

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

Basically, if you have a dummy dependent variable you will be estimating a probability.

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov

Lecture 3: ASR: HMMs, Forward, Viterbi

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Lecture 2 Solution of Nonlinear Equations ( Root Finding Problems )

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

Speech and Language Processing

6 Supplementary Materials

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Course 395: Machine Learning - Lectures

Feature Selection: Part 1

Lecture Notes on Linear Regression

Grenoble, France Grenoble University, F Grenoble Cedex, France

Linear Regression Analysis: Terminology and Notation

Generative classification models

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

Lecture 14: Bandits with Budget Constraints

Hopfield Training Rules 1 N

Lecture 3: Shannon s Theorem

Chapter 20 Duration Analysis

Lecture 9: Hidden Markov Model

Generalized Linear Methods

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Quantifying Uncertainty

Comparison of Regression Lines

Composite Hypotheses testing

Continuous Time Markov Chain

Multilayer neural networks

Kernel Methods and SVMs Extension

Evaluation for sets of classes

: Numerical Analysis Topic 2: Solution of Nonlinear Equations Lectures 5-11:

Conditional Random Fields in Speech, Audio and Language Processing

MAXIMUM A POSTERIORI TRANSDUCTION

Structural Extensions of Support Vector Machines. Mark Schmidt March 30, 2009

Other NN Models. Reinforcement learning (RL) Probabilistic neural networks

Lecture 4: Universal Hash Functions/Streaming Cont d

since [1-( 0+ 1x1i+ 2x2 i)] [ 0+ 1x1i+ assumed to be a reasonable approximation

EEE 241: Linear Systems

Structure and Drive Paul A. Jensen Copyright July 20, 2003

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

Note on EM-training of IBM-model 1

Multi-layer neural networks

An Integrated Asset Allocation and Path Planning Method to to Search for a Moving Target in in a Dynamic Environment

Machine learning: Density estimation

Logistic Regression Maximum Likelihood Estimation

SEMI-SUPERVISED LEARNING

COMP th April, 2007 Clement Pang

Lecture 10 Support Vector Machines. Oct

Hidden Markov Models. Hongxin Zhang State Key Lab of CAD&CG, ZJU

SDMML HT MSc Problem Sheet 4

Transcription:

Lecture 6 Hdden Markov Models and Maxmum Entropy Models CS 6320 82

HMM Outlne Markov Chans Hdden Markov Model Lkelhood: Forard Alg. Decodng: Vterb Alg. Maxmum Entropy Models 83

Dentons A eghted nte-state automaton adds probabltes to the arcs The sum o the probabltes leavng any arc must sum to one A Markov chan s a specal case o a WFSA n hch the nput sequence unquely determnes hch states the automaton ll go through Markov chans can t represent nherently ambguous problems Useul or assgnng probabltes to unambguous sequences 84

Markov Chan or Weather 85

Markov Chan or Words 86

Markov Chan Model A set o states Q = q, q 2 q N; the state at tme t s q t Transton probabltes: a set o probabltes A = a 0 a 02 a n a nn. Each a j represents the probablty o transtonng rom state to state j The set o these s the transton probablty matrx A Markov Assumpton: Current state only depends on prevous state Pq q...q Pq q 87

Markov Chan Model n j= a j = 88

Weather example Markov chans are useul hen e need to compute the probabltes or a sequence o events that are observable. 89

Markov Chan or Weather What s the probablty o 4 consecutve arm days? Sequence s arm-arm-arm-arm I.e., state sequence s 3-3-3-3 P3,3,3,3 = 3 a 33 a 33 a 33 = 0.2 x 0.6 3 = 0.0432 But hat about states are not observable? 90

HMM or Ice Cream You are a clmatologst n the year 2799 Studyng global armng You can t nd any records o the eather n Baltmore, MA or summer o 2007 But you nd Jason Esner s dary Whch lsts ho many ce-creams Jason ate every date that summer Our job: gure out ho hot t as 9

Hdden Markov Model For Markov chans, the output symbols are the same as the states. See hot eather: e re n state hot But n part-o-speech taggng and other thngs The output symbols are ords But the hdden states are part-o-speech tags So e need an extenson! A Hdden Markov Model s an extenson o a Markov chan n hch the nput symbols are not the same as the states. Ths means e don t kno hch state e are n. 92

Hdden Markov Models States Q = q, q 2 q N; Observatons O= o, o 2 o T; Each observaton s a symbol rom a vocabulary V = {v,v 2, v V } Transton probabltes Transton probablty matrx A = {a j } a j Pq t j q t, j N Observaton lkelhoods Output probablty matrx B={b k} b k PX t o k q t Specal ntal probablty vector Pq N 93

Esner Task Gven Ice Cream Observaton Sequence:,2,3,2,2,2,3 Produce: Weather Sequence: H,C,H,H,H,C 94

HMM or Ice Cream There are to hdden states hot cold Observatons are the number o ce cream events O = {,2,3} 95

Transton Probabltes 96

Observaton Lkelhoods 97

HMM or Three Basc Problems 98

Lkelhood Computaton Gven an HMM = A, B and an observaton sequence O. Determne the lkelhood P O. Problem : Compute the probablty o eatng 3 3 ce creams. Problem 2: Compute the probablty o eatng 3 3 ce creams hen the hdden sequence s hot hot cold. 99

Lkelhood Computaton For a partcular hdden state sequence Q And an observaton sequence O The lkelhood o the observaton sequence s P O Q = T = Po q P 3 3 hot hot cold = P 3 hot x P hot x P3 cold 200

Lkelhood Computaton Jont probablty o beng n a eather state sequence Q and a partcular sequence o observatons O o ce cream events s: P O, Q = P O Q x P Q = n = P o q x n = Pq q 20

We can compute no the probablty o a sequence o observatons O usng the jont probabltes P O = P O, Q = P O Q PQ Q Q P3 3 = P3 3, cold cold cold + P3 3, cold cold hot +...... + P3 3, hot hot hot 202

Forard Algorthm For N hdden states and a sequence o T observatons Forard Algorthm uses ON 2 T operatons nstead o N T a t j s the probablty o beng n state j ater seng the rst t observatons a t j = Po, o 2 o t, q t = j λ a t j = N = a t a j b j o t 203

Forard trells or ce cream example 204

Forard Algorthm. Intalzaton a j = a 0j b j o j N 2. Recurson 3. Termnaton N a t j = a t a j b j o t ; j N, < t T = N P O = a T q F = a T a F = 205

Forard Algorthm 206

Forard Algorthm 207

Decodng POS taggng s such a problem, and so s the eather problem Recall that n the case o POS taggng e need to compute n tˆ arg max P t t n n n We could just enumerate all paths gven the nput and use the model to assgn probabltes to each. Not a good dea. Luckly dynamc programmng helps us here 208

Vterb Algorthm Vterb algorthm computes a trells usng dynamc programmng. Observaton s processed rom let to rght llng out a trells o states v t j s the probablty that HMM s n state j ater seeng the rst t observatons v t j = max Pq 0, q q t, o, o 2 o t q t = j λ qo,q q v t j = N max v t a j b j o t = 209

Vterb tralls or ce cream example 20

Vterb Algorthm. Intalzaton 2. Recurson v t j = bt t j = 3. Termnaton N max = v t a j b j o t ; j N, < t T N argmax v t a j b j o t ; j N, < t T = The best score: P = v t q F = N max v T = The start o backtrace: q T = b tt q F = a,f N argmax v T = a,f 2

Vterb Traceback 22

The Vterb Algorthm 23

Vterb Example 24

Vterb Summary Create an array Wth columns correspondng to nputs Ros correspondng to possble states Seep through the array n one pass llng the columns let to rght usng our transton probs and observatons probs Dynamc programmng key s that e need only store the MAX prob path to each cell, not all paths. 25

Evaluaton So once you have your POS tagger runnng ho do you evaluate t? Overall error rate th respect to a gold-standard test set. Error rates on partcular tags Error rates on partcular ords Tag conusons... 26

Error Analyss Look at a conuson matrx See hat errors are causng problems Noun NN vs ProperNoun NNP vs Adj JJ Past tense VBD vs Partcple VBN vs Adjectve JJ 27

Evaluaton The result s compared th a manually coded Gold Standard Typcally accuracy reaches 96-97% Ths may be compared th result or a baselne tagger one that uses no context. Important: 00% s mpossble even or human annotators. 28

Maxmum Entropy Models 29

MEM Outlne Maxmum Entropy Models Background Maxmum Entropy Model appled to NLP classcaton Maxmum Entropy Markov Models 220

Maxmum Entropy Probablstc machne learnng or sequence classcaton POS taggng, speech recognton non-sequental classcaton text classcaton, sentment analyss Maxmum entropy extracts eatures rom nputs, then combnes them to classy nputs. Computes the probablty o a class c gven an observaton x descrbed by a vector o eatures 22

Lnear Regresson Problem: Prce a house based on vague adjectves used n the adds. Ex: antastc, cute, charmng Fgure 6.7 Some made-up data on the number o vague adjectves antastc, cute, charmng n a real estate ad and the amount the house sold or over the askng prce. prce 0 Num_Adjectves Fgure 6.8 A plot o the made-up ponts n Fg. 6.7 and the regresson lne that best ts them, th the equaton y = -4900x + 6550. 222

223 Multple Lnear Regresson Num_Unsold_Houses Mortgage_Rate Num_Adjectves prce 3 2 0 N 0 prce y N n n b a b a b a b a b a 2 2 product: dot N y 0 lnear regresson: In realty, the prce o house depends on several actors.

Learnng n Lnear Regresson Problem: Learn the eghts y j pred N 0 j Mnmze the cost uncton produced by eghts or all M examples n the tranng set. cost W M 2 j j y pred yobs j0 Y = X W = X T X X T y 224

Logstc Regresson Lnear regresson predcts real-value unctons Classcaton problems deal th dscrete values or classes We calculate the probablty that an observaton s n a partcular class, and pck the class th the hghest probablty. Let observaton x have eature vector, and class y Py true x N 0 Use a model to predct the odds o y beng true p y true x -p y true x p y true x ln -p y true x 225

226 Logt Functon ln logt x -p x p x p e e x y p e e x y p x y p e x y p e x y p e x y p x y p e x y -p x y p x -py x y p true true true true true true true true true true true ln e e e x y p true e e e x y p alse Ths s called logstc uncton Logstc Regresson s the model n hch a lnear uncton s used to estmate a logt o probablty

227 Logstc Regresson--Classcaton N N e x y p x y p x y p x y p x y p x y p 0 0 hyperplane a the equaton o s 0 0 0 true true alse true alse true Problem: Gven an observaton x decde t belongs to class true or class alse.

Maxmum Entropy Modelng In NLP e need to classy problems th multple classes p c x exp Z p c x exp cc exp N 0 N 0 c c Z C p c x cc exp N 0 c p c x cc exp exp N 0 N 0 c c c, x c, x In MaxEnt nstead o ndcator unctons, e use c,x, meanng eature or a partcular class c or a gven observaton x 228

Maxmum Entropy Modelng Secretarat/NNP s/bez expected/vbn to/to race/?? tomorro/ ord "race"& c NN c, x 0 otherse 2 t TO & c VB c, x 0 otherse sux ord 3 c, x 0 otherse s_loer_case ord 4 c, x 0 otherse "ng" & c VBG ord "race"& c VB 5 c, x 0 otherse 6 t TO & c NN c, x 0 otherse "race"& c VB 229

Maxmum Entropy Modelng.8.3 e e P NN x.8.3.8.0. e e e e e.20.8.0. e e e P VB x.8.3.8.0. e e e e e.80 cˆ arg max cc P c x 230

Why call t Maxmum Entropy? Problem: Assgn a tag to the ord zzsh. Wthout any pror normaton Knong that only our tags are possble 23

Entropy equaton H x P xlog P x 2 P NN x P JJ P NNS Pords zzsh and t NN or t P VB NNS 8 0 P VB 20 p* = argmax Hp The exponental model or multnomal logstc regresson also nds the maxmum entropy dstrbuton subject to constrants rom eature uncton. 232

Maxmum Entropy Markov Models MEMM Tˆ argmax P T W T argmax P W T P T T argmax P ord T tag P tag tag Tˆ argmax P T W T argmax P tag T ord, tag Advantages o MEMM. We estmate drectly the probablty o each tag gvng the prevous tag and observed ord. 2. We can condton any useul eature o nput observaton, hch as not possble th HMM 233

234 MEMM n n q q P q o P O Q P n o q q P O Q P, Fgure 6.20 The HMM top and MEMM bottom representaton o the probablty computaton or the correct sequence o tags or the Secretarat sentence. Each arc ould be assocated th a probablty; the HMM computes to separate probabltes or the observaton lkelhood and the pror, hle the MEMM computes a sngle probablty uncton at each state, condtoned on the prevous state and current observaton.

MEMM Fgure 6.2 An MEMM or part-o-speech taggng, augmentng the descrpton n Fg. 6.20 by shong that an MEMM can condton on many eatures o the nput, such as captalzaton, morphology endng n -s or ed, as ell as earler ords or tags. We have shon some potental addtonal eatures or the rst three decsons, usng derent lne styles or each class. P q q, o exp o, q Z o, q 235