Hidden Markov Models in Language Processing

Similar documents
Hidden Markov Models

Lecture 12: Algorithms for HMMs

CS838-1 Advanced NLP: Hidden Markov Models

Note Set 5: Hidden Markov Models

Lecture 12: Algorithms for HMMs

Hidden Markov Models Part 2: Algorithms

Hidden Markov Models

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Statistical Methods for NLP

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

STA 414/2104: Machine Learning

Hidden Markov Models

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Statistical Methods for NLP

Lecture 13: Structured Prediction

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Midterm sample questions

Basic math for biology

STA 4273H: Statistical Machine Learning

Log-Linear Models, MEMMs, and CRFs

Introduction to Machine Learning CMU-10701

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

N-gram Language Modeling Tutorial

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Linear Dynamical Systems

A gentle introduction to Hidden Markov Models

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Hidden Markov Modelling

COMP90051 Statistical Machine Learning

A brief introduction to Conditional Random Fields

Statistical Pattern Recognition

Hidden Markov Models

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

10/17/04. Today s Main Points

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Hidden Markov Models and Gaussian Mixture Models

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Sequential Supervised Learning

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Directed Probabilistic Graphical Models CMSC 678 UMBC

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Statistical Machine Learning from Data

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Computational Genomics and Molecular Biology, Fall

Text Mining. March 3, March 3, / 49

Intelligent Systems (AI-2)

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Dept. of Linguistics, Indiana University Fall 2009

Conditional Random Field

Graphical models for part of speech tagging

Hidden Markov models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear Classifiers IV

Hidden Markov Models

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

Lecture 12: EM Algorithm

Parametric Models Part III: Hidden Markov Models

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

Latent Variable Models and Expectation Maximization

Hidden Markov Models

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Graphical Models for Automatic Speech Recognition

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

Machine Learning for OR & FE

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Machine Learning for natural language processing

Intelligent Systems (AI-2)

Lecture 15. Probabilistic Models on Graph

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

CS 188: Artificial Intelligence Fall 2011

Collapsed Variational Bayesian Inference for Hidden Markov Models

N-gram Language Modeling

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Sequences and Information

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Temporal Modeling and Basic Speech Recognition

Intelligent Systems:

Sequence Labeling: HMMs & Structured Perceptron

Undirected Graphical Models

Probabilistic Models for Sequence Labeling

Probabilistic Graphical Models

AN INTRODUCTION TO TOPIC MODELS

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Lecture 3: ASR: HMMs, Forward, Viterbi

Bayesian Models for Sequences

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Latent Variable Models and Expectation Maximization

The Noisy Channel Model and Markov Models

Artificial Intelligence Markov Chains

Expectation Maximization (EM)

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Conditional Random Fields

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Transcription:

Hidden Markov Models in Language Processing Dustin Hillard Lecture notes courtesy of Prof. Mari Ostendorf Outline Review of Markov models What is an HMM? Examples General idea of hidden variables: implications for inference and estimation Back to HMM details: the key questions Hidden-event language models Goals: To understand the assumptions behind an HMM, so that you can decide when the model makes sense and when extensions make sense as well as the cost of extensions. 1

Review: Markov Models The next state s i+1 is conditionally independent of the past given the current state s i : p(s 1,..., s T ) = p(s 1 ) T Illustration of Topology i=2 p(s i s 1,..., s i 1 ) = p(s 1 ) T i=2 p(s i s i 1 ) State topology pictures depict constraints on transitions, i.e. the allowable transitions in the state space with circles corresponding to state values. Two examples below are for: a fully connected state space (as in a letter bigram with non-zero probabilities), and a strictly ordered model (as in the left-to-right model used in speech acoustic models). Graphical Model Illustration of Time-Sequence The picture below could apply for any topology. Each circle is a random variable S i at a particular time i. 2

Hidden Markov Models (HMMs) In an HMM, we observe a sequence o = o 1,..., o T that is associated with a state sequence that we cannot observe s = s 1,..., s T. The model describes the joint state and observation sequence: p(s 1,..., s T, o 1,..., o T ) = p(s 1 )p(o 1 s 1 ) T i=2 p(s i s i 1 )p(o i s i ) and we get the probability of the observation sequence by marginalizing: p(o 1,..., o T ) = o p(o 1,..., o T, s 1,..., s T ) The key assumptions in an HMM are: The state sequence is Markov: p(s i s 1,..., s i 1 ) = p(s i s i 1 ) The observations are conditionally independent of future and past states and observations given the current state: p(o i s 1,..., s T, o 1,..., o i 1, o i+1,..., o T ) = p(o i s i ) Aside: there are two types of HMMs, labeled states (p(o i s i )) and labeled transitions (p(o i s i 1, s i )). You can convert from one to another, so we will stick with the simpler and more popular labeled-state HMM framework. 3

We can extend the illustrations of Markov models to HMMs below. Illustration of Topology Using only the strictly ordered model for simplicity, each circle is a possible value in the state space as before but we add dashed arrows to indicate generation of an observation. Graphical Model Illustration of Time-Sequence Again, the picture below could apply for any topology. Each empty circle is a random variable S i at a particular time i, and each full circle is a random variable O i at time i. 4

HMM Examples in Language Processing Some examples include part-of-speech tagging, identification of named entities in text, punctuation prediction, recognition of speech acts,... More details for the first examples below. Part-of-speech (POS) Tagging States are POS tags, p(s i s i 1 ) describes POS sequence tendencies Observations are words, p(o i s i ) describes word-tag relations o 1, o 2, o 3, o 4 = I saw the boy s 1, s 2, s 3, s 4 = pronoun verb det noun It might be useful to have more complex representations of the words as observations to better model unknown words, i.e. o i not in a known vocabulary, since the observation space needs to be finite in order to specify p(o i s i ). Consider possible observation sequences: I saw the zlrxl Twas brillig and the slithy toves did gyre and... We would probably say that zlrxl is a noun, slithy is an adjective, and toves is a plural noun, based on endings of the words and/or neighboring POS types. So, we might want to expand o i to a vector with the word, ending, capitalization, etc as elements of the observation vector. Aside: There are some limitations of an HMM for POS tagging, so extensions or other models are more often used. For example, the POS tag might be better predicted by using the previous word and not just the previous tag, e.g. some verbs do not take a direct object. 5

Name Recognition States include {begin person (BP), cont person (CP), begin location (BL), cont location (CL), not a name (φ)}, p(s i s i 1 ) describes name sequence tendencies Observations are words, p(o i s i ) describes word-name relations o 1, o 2, o 3, o 4, o 5, o 6, o 7 = We saw Senator Pat Johnson in Boston. s 1, s 2, s 3, s 4, s 5, s 6, s 7 = φ φ φ BP CP φ BL Again, it might be useful to have features for characterizing new words, and it might be useful to condition the predicted state and/or observation on both the previous word and state: p(s i s i 1, o i 1 ); p(o i s i, o i 1 ) Other sequence labeling problems... Sentence segmentation: sequence of boundary vs. no boundary decisions after every word Topic segmentation: sequence of boundary vs. no boundary decisions after every sentence Speech act tagging on the sequence of utterances in a conversation... 6

Hidden variables are useful for: Hidden Variables modeling different sources of variability, e.g. temporal variability in an HMM, multimodal behavior in a mixture model (some random event that I can t observe determines the distribution), OR the hidden variables may be what you want to detect (as in tagging), learning distributions where some values are missing, as in missing labels for semi-supervised learning. Key points in working with hidden variables: Computing P(observations) requires marginalizing (summing out) hidden states. Parameter estimation is iterative, e.g. the EM algorithm for maximum likelihood estimation. 7

The EM Algorithm Given training data X = {x 1,..., x T } and model p(x, s θ), the maximum likelihood parameter estimate requires argmax θ log p(x θ) = argmax θ T i=1 log s p(x i, s θ) where the sum over the hidden variable complicates the solution. Basic idea: directly maximizing log likelihood log p(x θ) is too hard, but can maximize E[log p(x, S) X, θ)] and indirectly maximize log likelihood. Still no closed form solution though, need to: Iterate: E-step: Find Q(θ θ (l) ) = E[log p(x, S) X, θ (l) ] M-step: Find θ (l+1) = argmax θ Q(θ θ (l) ) For problems involving p(x, s) in the exponential family (most everything we work with), the EM algorithm reduces to: E-step: Find expected sufficient statistics E[t (l) ] of unobserved process. M-step: Use these in place of t in the ML update formulas for θ (l+1). For many discrete hidden variable problems, the E-step involves estimating the probabilities for different possible state values: γ (l) t (j) = p(s t = j x, θ (l) ) 8

EM for Mixtures In a mixture distribution, the index of the underlying mode is the hidden variable: p(x) = m λ i p i (x) = m p(z = i)p(x z = i) i=1 Initialize: Provide {λ (0) i, p (0) j (x)} Iterate: E-step : Compute γ (l) t (j) = p(z t = j x t, θ (l) ) = p (l) (z t = j, x t )/[ k p (l) (z t = k, x t )] M-step: Update component model parameters using γ t -weighted observation statistics and update mixture weights using: λ (l+1) j i=1 = 1 T t γ t (l) (j) (Essentially a relative frequency estimate using weighted counts.) The mixture estimation algorithm works for all types of mixtures, with the difference being in the details of how the distributions p i (x) are estimated. Important examples include: Mixtures of language models (for topic and genre modeling) use weighted n-gram counts, e.g. Gaussian mixtures n j = t γ t (j); c(w a ) = µ j = 1 γ t (j)x t ; n j t t:w t =w a γ t (j) σj 2 = 1 γ t (j)(x t µ j ) 2 n j where I ve dropped the (l) superscript to simplify the notation. Note: You can also keep the mixture components fixed and just estimate the mixture weights if the components come from other training data. t 9

EM for HMMs Define HMM parameters: π j = p(s 1 = j); a jk = p(s i = k s i 1 = j); b j (o) = p(o s t = j) E-step: Compute γ (l) t (j) = p(s t = j o 1,..., o T, θ (l) ) ξ (l) t (j, k) = p(s t 1 = j, s t = k o 1,..., o T, θ (l) ) M-step: Update component model parameters using γ t -weighted observation statistics and update transitions using ξ t terms. Using the same weighted counts idea as for mixtures: a (l+1) jk = t ξ (l) t (j, k) k t ξ (l) t (j, k) The update for π j is similar, and the update for the observation distribution depends on the form of b j (o) but also uses weighted observations(continuous o) or weighted counts (discrete o). The model-related details are primarily in the E-Step, which is more complicated than the mixture model because of the sequential dependencies. So now, let s go back to the details of the HMM. 10

Details for HMMs Questions people ask about HMMs: 1. How do you compute p(o)? 2. What is the most likely state sequence argmax s p(s o)? 3. What is the most likely state at a particular time t, argmax j p(s t = j o)? 4. How do you estimate the parameters of an HMM? Three key algorithms needed to answer the questions Forward algorithm: α t (k) = p(o 1,..., o t, s t = j) = j α t 1 (j)a jk b k (o t ) Viterbi algorithm: δ t (k) = max p(o s 1,...,s 1,..., o t, s 1,..., s t 1, s t = k) = max δ t 1 (j)a jk b k (o t ) t 1 j Backward algorithm: β t (j) = p(o t+1,..., o T s t = j) = k β t+1 (k)a jk b k (o t+1 ) 11

These algorithms answer the questions as follows: 1. Use the forward algorithm: p(o) = p(o 1,..., o T ) = j p(o 1,..., o T, s T = j) = j α T (j) = α T 2. Use the Viterbi algorithm, keeping track of the max j at each time step, then traceback from the final best state. 3. Use the forward and backward algorithms to find γ t (j) = p(s t = j o 1,..., o T ) = p(s t = j, o 1,..., o T )/p(o) = α t (j)β t (j)/α T and use γ t (j) = p(s t = j o) in finding the argmax. 4. Use the forward and backward algorithms in the E-step to find γ t (j) as above and ξ t (j, k) = p(s t 1 = j, s t = k o 1,..., o T ) = α t (j)a jk b k (o t+1 )β t+1 (k)/α T 12

Hidden-Event Language Models Hidden-event language models represent a hidden event e i (such as a sentence or topic boundary) after every word w i. It is a bit like an HMM because the hidden event, but the observations are based on n-gram language models. The bigram case would be: p(w, e) = p(w 1, e 1, w 2, e 2,..., w T, e T ) = p(e 1 w 1 )p(w 1 ) t p(e t w t, w t 1, e t 1 )p(w t w t 1, e t 1 ) The p(e i ) terms are the hidden state sequence model, but different from and HMM in that the state transitions depend on the words, and the p(w t ) terms are the observation model which is an event-dependent language model. For many problems in speech processing, it is useful to combine this model with another observation model (e.g. p(f t e t, w t )) that characterizes acoustic information f t. The hidden-event language model can be designed with the SRILM toolkit using the hidden-ngram command. 13