CS838-1 Advanced NLP: Hidden Markov Models

Similar documents
Hidden Markov Models

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

10/17/04. Today s Main Points

Hidden Markov Models

STA 414/2104: Machine Learning

Graphical Models Seminar

Hidden Markov Models in Language Processing

STA 4273H: Statistical Machine Learning

Lecture 13: Structured Prediction

A gentle introduction to Hidden Markov Models

Basic math for biology

Hidden Markov Models (HMMs)

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

HMM and Part of Speech Tagging. Adam Meyers New York University

Midterm sample questions

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Probabilistic Graphical Models

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

COMP90051 Statistical Machine Learning

A brief introduction to Conditional Random Fields

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Statistical Methods for NLP

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 8 POS tagset) Pushpak Bhattacharyya CSE Dept., IIT Bombay 17 th Jan, 2012

Sequence Labeling: HMMs & Structured Perceptron

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Statistical Methods for NLP

Directed Probabilistic Graphical Models CMSC 678 UMBC

Machine Learning for natural language processing

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Hidden Markov Models

Hidden Markov Models

p(d θ ) l(θ ) 1.2 x x x

Dynamic Approaches: The Hidden Markov Model

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Linear Dynamical Systems

Statistical methods in NLP, lecture 7 Tagging and parsing

Lecture 12: Algorithms for HMMs

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Today s Agenda. Need to cover lots of background material. Now on to the Map Reduce stuff. Rough conceptual sketch of unsupervised training using EM

Lecture 12: Algorithms for HMMs

Posterior vs. Parameter Sparsity in Latent Variable Models Supplementary Material

LECTURER: BURCU CAN Spring

Lecture 7: Sequence Labeling

Sequential Supervised Learning

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Note Set 5: Hidden Markov Models

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Statistical NLP: Hidden Markov Models. Updated 12/15

Natural Language Processing

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Log-Linear Models, MEMMs, and CRFs


Probabilistic Models for Sequence Labeling

1. Markov models. 1.1 Markov-chain

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Text Mining. March 3, March 3, / 49

Introduction to Machine Learning CMU-10701

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

Hidden Markov Models Part 2: Algorithms

Hidden Markov models

Lecture 9: Hidden Markov Model

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Sequence modelling. Marco Saerens (UCL) Slides references

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Hidden Markov Models and Gaussian Mixture Models

CS388: Natural Language Processing Lecture 4: Sequence Models I

Statistical Sequence Recognition and Training: An Introduction to HMMs

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Intelligent Systems (AI-2)

Statistical Methods for NLP

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Parametric Models Part III: Hidden Markov Models

Automatic Speech Recognition (CS753)

POS-Tagging. Fabian M. Suchanek

Statistical Machine Learning from Data

Probabilistic Graphical Models

Introduction to Machine Learning Midterm, Tues April 8

Part-of-Speech Tagging

order is number of previous outputs

Mixture Models and Expectation-Maximization

Graphical models for part of speech tagging

EECS E6870: Lecture 4: Hidden Markov Models

Statistical Pattern Recognition

Computational Genomics and Molecular Biology, Fall

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Dept. of Linguistics, Indiana University Fall 2009

Conditional Random Fields: An Introduction

Transcription:

CS838-1 Advanced NLP: Hidden Markov Models Xiaojin Zhu 2007 Send comments to jerryzhu@cs.wisc.edu 1 Part of Speech Tagging Tag each word in a sentence with its part-of-speech, e.g., The/AT representative/nn put/vbd chairs/nns on/in the/at table/nn. It is useful in information extraction, question answering, and shallow parsing. Tagging is easier than parsing. There are many tag sets, for example (part of the Penn Treebank Tagset) Tag Part of Speech Examples AT article the, a, an IN preposition atop, for, on, in, with, from JJ adjective big, old, good JJR comparative adjective bigger, older, better MD modal verb may, can, will NN singular or mass noun fish, lamp NNP singular proper noun Madison, John NNS plural noun cameras PN personal pronoun he, you, me RB adverb else, about, afterwards VB verb, base form go VBD verb, past tense went VBG verb, present participle going VBN verb, past participle gone VBP verb, non-3rd person singular present go VBZ verb, 3rd person singular present goes The difficulty: a word might have multiple possible POS: I/PN can/md can/vb a/at can/nn. Two kinds of information can be used for POS tagging: 1

1. Certain tag sequences are more likely than other, e.g., AT JJ NN is quite often, while AT JJ VBP is unlikely: a new book. 2. A word can have multiple possible POS, but some are more likely than others, e.g., flour is more often a noun than a verb. Question: given a word sequence x 1:N, how do we compute the most likely POS sequence? 2 Hidden Markov Models An HMM has the following components: K states (e.g., tag types) initial distribution π (π 1,..., π K ), a vector of size K. emission probabilities p(x z), where x is an output symbol (e.g., a word) and z is a state (e.g., a tag). In this example, p(x z) can be a multinomial for each z. Note it is possible for different states to output the same symbol this is the source of difficulty. Let φ = {p(x z)}. state transition probabilities p(z n = j z n 1 = i) A ij, where A is a K K transition matrix. This is a first-order Markov assumption on the states. The parameters of an HMM is θ = {π, φ, A}. An HMM can be plotted as a transition diagram (note it is not a graphical model! The nodes are not random variables). However, its graphical model is a linear chain on hidden nodes, with observed nodes x 1:N. There is a nice urn model that explains HMM as a generative model. We can run an HMM for n steps, and produce x 1:N,. The joint probability is p(x 1:N, θ) = p(z 1 π)p(x 1 z 1, φ) N p(z n z n 1, A)p(x n z n, φ). (1) Since the same output can be generated by multiple states, there usually are more than one state sequence that can generate x 1:N. The problem of tagging is arg max z1:n p( x 1:N, θ), i.e., finding the most likely state sequence to explain the observation. As we see later, it is solved with the Viterbi algorithm, or more generally the max-sum algorithm. But we need good parameters θ first, which can be estimated using maximum likelihood on some (labeled and unlabeled) training data using EM (known as Baum-Welch algorithm for HMM). EM training involves so-called forward-backward or in general sum-product algorithm. Besides tagging, HMM has been a huge success for acoustic modeling in speech recognition, where the observation is acoustic feature vector in a short time window, and the state is the phoneme. n=2 2

3 HMM Training: The Baum-Welch Algorithm Let x 1:N be a single, very long training sequence of observed output (e.g., a long document), and its hidden labels (e.g., the corresponding tag sequence). It is easy to extend to multiple training documents. 3.1 The trivial case: observed We find the maximum likelihood estimate of θ by maximizing the likelihood of observed data. Since both x 1:N and are observed, MLE boils down to frequency estimate. In particular, A ij is the fraction of times z = i is followed by z = j in. φ is the MLE of output x under z. For multinomials, p(x z) is the fraction of times x is produced under state z. π is the fraction of times each state being the first state of a sequence (assuming we have multiple training sequences). 3.2 The interesting case: unobserved The MLE will maximize (up to a local optimum, see below) the likelihood of observed data p(x 1:N θ) = p(x 1:N, θ), (2) where the summation is over all possible label sequences of length N. Already we see this is an exponential sum with K N label sequences, and brute force will fail. HMM training uses a combination of dynamic programming and EM to handle this issue. Note the log likelihood involves summing over hidden variables, which suggests we can apply Jensen s inequality to lower bound (2). log p(x 1:N θ) = log p(x 1:N, θ) (3) = log p( x 1:N, θ old p(x 1:N, θ) ) p( x 1:N, θ old ) (4) p( x 1:N, θ old ) log p(x 1:N, θ) p( x 1:N, θ old ). (5) We want to maximize the above lower bound instead. Taking the parts that depends on θ, we define an auxiliary function Q(θ, θ old ) = p( x 1:N, θ old ) log p(x 1:N, θ). (6) 3

The p(x 1:N, θ) term is defined as the product along the chain in (1). By taking the log, it becomes summation of terms. Q is the expectation of individual terms under the distribution p( x 1:N, θ old ). In fact more variables will be marginalized out, leading to expectation under very simple distributions. For example, the first term in log p(x 1:N, θ) is log p(z 1 π), and its expectation is p( x 1:N, θ old ) log p(z 1 π) = p(z 1 x 1:N, θ old ) log p(z 1 π). (7) z 1 Note we go from an exponential sum over possible sequences to a single variable z 1, which has only K values. Introducing the shorthand γ 1 (k) = p(z 1 = k x 1:N, θ old ), the marginal of z 1 given input x 1:N and old parameters, the above expectation is K γ 1(k) log π k. In general, we use the shorthand γ n (k) = p(z n = k x 1:N, θ old ) (8) ξ n (jk) = p(z n 1 = j, z n = k x 1:N, θ old ) (9) to denote the node marginals and edge marginals (conditioned on input x 1:N, under the old parameters). It will be clear that these marginal distributions play an important role. The Q function can be written as Q(θ, θ old N N ) = γ 1 (k) log π k + γ n (k) log p(x n z n, φ) + ξ n (jk) log A jk. (10) n=1 n=2 j=1 We are now ready to state the EM algorithm for HMMs, known as the Baum-Welch algorithm. Our goal is to find θ that maximizes p(x 1:N θ), and we do so via a lower bound, or Q. In particular, 1. Initialize θ randomly or smartly (e.g., using limited labeled data with smoothing) 2. E-step: Compute the node and edge marginals γ n (k), ξ n (jk) for n = 1... N, j, k = 1... K. This is done using the forward-backward algorithm (or more generally the sum-product algorithm), which is discussed in a separate note. 3. M-step: Find θ to maximize Q(θ, θ old ) as in (10). 4. Iterate E-step and M-step until convergence. As usual, the Baum-Welch algorithm finds a local optimum of θ for HMMs. 4 The M-step The M-step is a constrained optimization problem since the parameters need to be normalized. As before, one can introduce Lagrange multipliers and set the 4

gradient of the Lagrangian to zero to arrive at π k γ 1 (k) (11) N ξ n (jk) (12) A jk Note A jk is normalized over k. φ is maximized depending on the particular form of the distribution p(x n z n, φ). If it is multinomial, p(x z = k, φ) is the frequency of x in all output, but each output at step n is weighted by γ n (k). 5 The E-step n=2 We need to compute the marginals on each node z n, and on each edge z n 1 z n, conditioned on the observation x 1:N. Recall this is precisely what the sumproduct algorithm can do. We can convert the HMM into a linear chain factor graph with variable nodes and factor nodes f 1:N. The graph is a chain with the order, from left to right, f 1, z 1, f 2, z 2,..., f N, z N. The factors are f 1 (z 1 ) = p(z 1 π)p(x 1 z 1 ) (13) f n (z n 1, z n ) = p(z n z n 1 )p(x n z n ). (14) We note that x 1:N do not appear in the factor graph, instead they are absorbed into the factors. We pass messages from left to right along the chain, and from right to left. This corresponds to the forward-backward algorithm. It is worth noting that since every variable node z n has only two factor node neighbors, it simply repeats the message it received from one of the neighbor to the other. The left-to-right messages are µ fn z n = = = It is not hard to show that f n (z n 1 = k, z n )µ zn 1 f n (15) f n (z n 1 = k, z n )µ fn 1 z n 1 (16) p(z n z n 1 = k)p(x n z n )µ fn 1 z n 1. (17) µ fn z n = p(x 1,..., x n, z n ), (18) and this message is called α in standard forward-backward literature, corresponding to the forward pass. Similarly, one can show that the right-to-left message µ fn+1 z n = p(x n+1,..., x N z n ). (19) 5

This message is called β and corresponds to the backward pass. By multiplying the two incoming messages at any node z n, we obtain the joint p(x 1:N, z n ) = µ fn z n µ fn+1 z n (20) since x 1:N are observed (see the sum-product lecture notes, and we have equality because HMM is a directed graph). If we sum over z n, we obtain the marginal likelihood of observed data p(x 1:N ) = µ fn z n µ fn+1 z n, (21) z n=1 which is useful to monitor the convergence of Baum-Welch (note summing over any z n will give us this). Finally the desired node marginal is obtained by γ n (k) p(z n = k x 1:N ) = p(x 1:N, z n = k). (22) p(x 1:N ) A similar argument gives ξ n (jk) using the sum-product algorithm too. 6 Use HMM for Tagging Given input (word) sequence w 1:N and HMM θ, tagging can be formulated as finding the most likely state sequence that maximizes p( x 1:N, θ). This is precisely the problem max-sum algorithm solves. In the context of HMMs, the max-sum algorithm is known as the Viterbi algorithm. 6