Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Similar documents
Statistical Methods for NLP

Lecture 12: Algorithms for HMMs

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Lecture 12: Algorithms for HMMs

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Statistical Methods for NLP

Data-Intensive Computing with MapReduce

Hidden Markov Models

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Lecture 12: EM Algorithm

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models in Language Processing

Lecture 3: ASR: HMMs, Forward, Viterbi

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Statistical NLP: Hidden Markov Models. Updated 12/15

Today s Agenda. Need to cover lots of background material. Now on to the Map Reduce stuff. Rough conceptual sketch of unsupervised training using EM

Computational Genomics and Molecular Biology, Fall

Dept. of Linguistics, Indiana University Fall 2009

Sequence Labeling: HMMs & Structured Perceptron

Hidden Markov Models

A gentle introduction to Hidden Markov Models

Statistical Methods for NLP

Text Mining. March 3, March 3, / 49

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Machine Learning for natural language processing

Conditional Random Fields

Lecture 11: Viterbi and Forward Algorithms

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

p(d θ ) l(θ ) 1.2 x x x

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

STA 414/2104: Machine Learning

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Hidden Markov Models. Representing sequence data. Markov Models. A dice-y example 4/5/2017. CISC 5800 Professor Daniel Leeds Π A = 0.3, Π B = 0.

STA 4273H: Statistical Machine Learning

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Midterm sample questions

Hidden Markov Models and Gaussian Mixture Models

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model:

Hidden Markov Models. Representing sequence data. Markov Models. A dice-y example 4/26/2018. CISC 5800 Professor Daniel Leeds Π A = 0.3, Π B = 0.

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Hidden Markov Models. Terminology and Basic Algorithms

CS838-1 Advanced NLP: Hidden Markov Models

A brief introduction to Conditional Random Fields

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Data Mining in Bioinformatics HMM

Dynamic Programming: Hidden Markov Models

Hidden Markov Models

Statistical Sequence Recognition and Training: An Introduction to HMMs

Advanced Natural Language Processing Syntactic Parsing

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

order is number of previous outputs

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

1. Markov models. 1.1 Markov-chain

Hidden Markov Models. x 1 x 2 x 3 x N

Hidden Markov Models. Terminology, Representation and Basic Problems

Expectation Maximization (EM)

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Graphical models for part of speech tagging

Probabilistic Context-free Grammars

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Math 350: An exploration of HMMs through doodles.

Hidden Markov Models Hamid R. Rabiee

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Hidden Markov Models: Maxing and Summing

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Introduction to Machine Learning CMU-10701

Basic math for biology

Statistical Machine Learning from Data

TnT Part of Speech Tagger

Lecture 13: Structured Prediction

Stephen Scott.

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

CRF for human beings

COMP90051 Statistical Machine Learning

Log-Linear Models, MEMMs, and CRFs

CSCE 471/871 Lecture 3: Markov Chains and

Hidden Markov Models (HMMs) November 14, 2017

Note Set 5: Hidden Markov Models

Hidden Markov Models. Terminology and Basic Algorithms

Fun with weighted FSTs

Hidden Markov Model and Speech Recognition

Advanced Data Science

8: Hidden Markov Models

Graphical Models Seminar

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Sequences and Information

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

Genome 373: Hidden Markov Models II. Doug Fowler

Transcription:

Basic Text Analysis Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakimnivre@lingfiluuse Basic Text Analysis 1(33)

Hidden Markov Models Markov models are probabilistic sequence models used for problems such as: 1 Speech recognition 2 Spell checking 3 Part-of-speech tagging 4 Named entity recognition A Markov model runs through a sequence of states emitting observable signals If the state sequence cannot be determined from the observation sequence, the model is said to be hidden Basic Text Analysis 2(33)

Hidden Markov Models A Markov model consists of five elements: 1 A finite set of states Q = {q 1,, q Q } 2 A finite signal alphabet Σ = {s 1,, s Σ } 3 Initial probabilities P(q) defining the probability of starting in state q (for every q Q) 4 Transition probabilities P(q q ) defining the probability of going from state q to state q (for every (q, q ) Q 2 ) 5 Emission probabilities P(s q) defining the probability of emitting symbol s in state q (for every (s, q) Σ Q) Remember: If we add a start state, P(q) = P(q start) Basic Text Analysis 3(33)

Markov Assumptions State transitions are assumed to be independent of everything except the current state: P(q 1,, q n ) = n P(q i q i 1 ) Signal emissions are assumed to be independent of everything except the current state: i=1 P(q 1,, q n, s 1,, s n ) = P(q 1,, q n ) n P(s i q i ) NB: subscripts on states and signals refer to sequence positions i=1 Basic Text Analysis 4(33)

Tagging with an HMM Modeling assumptions: 1 States represent tag n-grams 2 Signals represent words Probability model (first-order, unigram states): n P(w 1,, w n, t 1,, t n ) = P(w i t i )P(t i t i 1 ) i=1 In an nth-order model, states represent tag n-grams This gives an n+1-gram language model over tags Ambiguity: The same word (signal) generated by different tags (states) Language is ambiguous Markov model is hidden Basic Text Analysis 5(33)

A Simple First-Order HMM for Tagging Basic Text Analysis 6(33)

Problems for HMMs Decoding = finding the optimal state sequence argmax P(q 1,, q n, s 1,, s n ) q 1,,q n Learning = estimating the model parameters ˆP(q j q i ) q j, q i ˆP(s q) s, q Basic Text Analysis 7(33)

Decoding Given observation sequence s 1,, s n, compute: Brute force solution: argmax P(q 1,, q n, s 1,, s n ) q 1,,q n For every possible state sequence q1,, q n Compute P(q 1,, q n, s 1,, s n) = n i=1 P(q i q i 1 )P(s i q i ) Pick the sequence with highest probability Anything wrong with this? Basic Text Analysis 8(33)

Time Complexity How long does it take to compute a solution? Actual running time depends on many factors Time complexity is about how time grows with input size Analysis of brute-force algorithm: Computing P(q1,, q n, s 1,, s n ) requires 2n multiplications There are Q n possible state sequences of length n If each multiplication takes c time, T (n) = c 2n Q n In the long run, only the fastest growing factor matters: T (n) = O( Q n ) Basic Text Analysis 9(33)

Basic Text Analysis 10(33)

Dynamic Programming What is the problem really? The number Ω n of sequences grows exponentially But the sequences have overlapping subsequences The brute-force algorithm does a lot of unnecessary work Key: solution of size n contains solution of size n 1 argmax q1,,q n P(q 1,, q n, s 1,, s n) = argmax qn argmax q1,,q n 1 P(q 1,, q n 1, s 1,, s n 1 )P(q n q n 1 )P(s n s n) We can use dynamic programming Create a table for storing partial results Make sure that partial results are available when needed Avoid recomputing the same result more than once Basic Text Analysis 11(33)

The Trellis q Q q 3 q 2 q 1 1 2 3 4 5 6 7 8 9 10 n For HMMs the table is known as the trellis: Every rows corresponds to a state q Q Every column corresponds to a position i (1 i n) The cell qi represents the best way to reach q at i Basic Text Analysis 12(33)

The Viterbi Algorithm q Q q 3 q 2 q 1 1 2 3 4 5 6 7 8 9 10 n Goal: Find best cell in column n = best way to reach any state at n Algorithm: Fill the table from left to right, column by column For each cell q i : Add q to all best paths in column i 1 Keep the one with highest probability We need one trellis for probabilities (A) and one for paths (B) Basic Text Analysis 13(33)

The Viterbi Algorithm q Q q 3 q 2 q 1 1 2 3 4 5 6 7 8 9 10 n for i = 1 to n for q = q 1 to q Q for q = q 1 to q Q p A[q, i 1]P(q q )P(s i q) if p > A[q, i] then A[q, i] p B[q, i] q q max q A[q, n] return B[q, n], B[B[q, n], n 1], Basic Text Analysis 14(33)

Viterbi Example Basic Text Analysis 15(33)

Viterbi Example Basic Text Analysis 15(33)

Viterbi Example Basic Text Analysis 15(33)

Viterbi Example Basic Text Analysis 15(33)

Time Complexity Analysis of Viterbi algorithm: Filling one cell requires 2 Q multiplications There are Q cells in each column There are n columns in the trellis If each multiplication takes c time, T (n) = 2n Q 2 Worst-case complexity: T (n) = O(n Q 2 ) Basic Text Analysis 16(33)

Basic Text Analysis 17(33)

Learning Supervised learning: Given a tagged training corpus, we can estimate parameters using (smoothed) relative frequencies This maximizes the joint likelihood of states and signals Transition probabilities can be smoothed like n-gram models Emission probabilities need special tricks for unknown words Weakly supervised learning: Given a lexicon and an untagged training corpus, we can use Expectation-Maximization to estimate parameters This attempts to maximize the marginal likelihood of the signals (because states are hidden/unobserved) Basic Text Analysis 18(33)

Maximum Likelihood Estimation With labeled data D = {(x 1, y 1 ),, (x n, y n )}, we set parameters θ to maximize the joint likelihood of input and output: argmax θ n P θ (y i )P θ (x i y i ) 1=i With unlabeled data D = {x 1,, x n }, we can instead set parameters θ to maximize the marginal likelihood of the input: argmax θ n 1=i y Ω Y P θ (y)p θ (x i y) In this case, Y is a hidden variable that is marginalized out Basic Text Analysis 19(33)

Expectation-Maximization Joint MLE is easy just use relative frequencies Marginal MLE is hard no closed form solution argmax θ n i=1 y Ω Y P θ (y)p θ (x i y) =? What can we do? Use numerical approximation methods Most common approach: Expectation-Maximization (EM) Basic Text Analysis 20(33)

Expectation-Maximization Basic observations: If we know fn (x, y), we can find the MLE θ If we know f N (x) and the MLE θ, we can derive f N (x, y) If P θ (y x) = p and f N (x) = n, then f N (x, y) = np Basic idea behind EM: 1 Start by guessing θ 2 Compute expected counts E[f N (x, y)] = P θ (y x)f N (x) 3 Find MLE θ given expectation E[f N (x, y)] 4 Repeat steps 2 and 3 until convergence Basic Text Analysis 21(33)

Supervised Learning Lexicon: eat V Corpus: <s> eat/v fish/n </s> fish N V <s> drink/v beer/n </s> drink N V beer N V Freqs N V eat fish drink beer <s> 0 2 N 0 0 0 1 0 1 V 2 0 1 0 1 0 Probs N V eat fish drink beer <s> 00 10 N 05 05 00 05 00 05 V 10 00 05 00 05 00 Basic Text Analysis 22(33)

EM: First Guess Lexicon: eat V Corpus: <s> eat fish </s> fish N V <s> drink beer </s> drink N V beer N V Freqs N V eat fish drink beer <s>?? N?????? V?????? Probs N V eat fish drink beer <s> 05 05 N 05 05 00 033 033 033 V 05 05 033 033 033 00 Basic Text Analysis 23(33)

EM: First E-Step Lexicon: eat V Corpus: <s> eat/v fish/n </s> : 05 fish N V <s> eat/v fish/v </s> : 05 drink N V <s> drink/n beer/n </s> : 05 beer N V <s> drink/v beer/n </s> : 05 Freqs N V eat fish drink beer <s>?? N?????? V?????? Probs N V eat fish drink beer <s> 05 05 N 05 05 00 033 033 033 V 05 05 033 033 033 00 Basic Text Analysis 24(33)

EM: First E-Step Lexicon: eat V Corpus: <s> eat/v fish/n </s> : 05 fish N V <s> eat/v fish/v </s> : 05 drink N V <s> drink/n beer/n </s> : 05 beer N V <s> drink/v beer/n </s> : 05 Freqs N V eat fish drink beer <s> 05 15 N 05 0 0 05 05 1 V 1 05 1 05 05 0 Probs N V eat fish drink beer <s> 05 05 N 05 05 00 033 033 033 V 05 05 033 033 033 00 Basic Text Analysis 24(33)

EM: First M-Step Lexicon: eat V Corpus: <s> eat fish </s> fish N V <s> drink beer </s> drink N V beer N V Freqs N V eat fish drink beer <s> 05 15 N 05 0 0 05 05 1 V 1 05 1 05 05 0 Probs N V eat fish drink beer <s> 025 075 N 10 00 00 025 025 05 V 067 033 05 025 025 00 Basic Text Analysis 25(33)

EM: Second E-Step Lexicon: eat V Corpus: <s> eat/v fish/n </s> : 086 fish N V <s> eat/v fish/v </s> : 014 drink N V <s> drink/n beer/n </s> : 033 beer N V <s> drink/v beer/n </s> : 067 Freqs N V eat fish drink beer <s> 05 15 N 05 0 0 05 05 1 V 1 05 1 05 05 0 Probs N V eat fish drink beer <s> 025 075 N 10 00 00 025 025 05 V 067 033 05 025 025 00 Basic Text Analysis 26(33)

EM: Second E-Step Lexicon: eat V Corpus: <s> eat/v fish/n </s> : 086 fish N V <s> eat/v fish/v </s> : 014 drink N V <s> drink/n beer/n </s> : 033 beer N V <s> drink/v beer/n </s> : 067 Freqs N V eat fish drink beer <s> 033 167 N 033 0 0 086 033 1 V 119 014 1 014 067 0 Probs N V eat fish drink beer <s> 025 075 N 10 00 00 025 025 05 V 067 033 05 025 025 00 Basic Text Analysis 26(33)

EM: Second M-Step Lexicon: eat V Corpus: <s> eat fish </s> fish N V <s> drink beer </s> drink N V beer N V Freqs N V eat fish drink beer <s> 033 167 N 033 0 0 086 033 1 V 119 014 1 014 067 0 Probs N V eat fish drink beer <s> 017 083 N 10 00 00 039 015 046 V 089 011 055 008 037 00 Basic Text Analysis 27(33)

EM: Third E-Step Lexicon: eat V Corpus: <s> eat/v fish/n </s> : 098 fish N V <s> eat/v fish/v </s> : 002 drink N V <s> drink/n beer/n </s> : 009 beer N V <s> drink/v beer/n </s> : 091 Freqs N V eat fish drink beer <s> 033 167 N 033 0 0 086 033 1 V 119 014 1 014 067 0 Probs N V eat fish drink beer <s> 017 083 N 10 00 00 039 015 046 V 089 011 055 008 037 00 Basic Text Analysis 28(33)

EM: Third E-Step Lexicon: eat V Corpus: <s> eat/v fish/n </s> : 098 fish N V <s> eat/v fish/v </s> : 002 drink N V <s> drink/n beer/n </s> : 009 beer N V <s> drink/v beer/n </s> : 091 Freqs N V eat fish drink beer <s> 009 191 N 009 0 0 098 009 1 V 189 002 1 002 091 0 Probs N V eat fish drink beer <s> 017 083 N 10 00 00 039 015 046 V 089 011 055 008 037 00 Basic Text Analysis 29(33)

EM: Third M-Step Lexicon: eat V Corpus: <s> eat fish </s> fish N V <s> drink beer </s> drink N V beer N V Freqs N V eat fish drink beer <s> 009 191 N 009 0 0 098 009 1 V 189 002 1 002 091 0 Probs N V eat fish drink beer <s> 005 095 N 10 00 00 047 004 048 V 099 001 052 001 047 00 Basic Text Analysis 30(33)

Convergence EM is guaranteed to converge to a local maximum of the likelihood function A local maximum may not be the global maximum Even if we reach the global maximum, it may not be the same as for the supervised model (Why?) In general, EM is quite sensitive to initialization Basic Text Analysis 31(33)

EM for Hidden Markov Models Computing expectations: E[f N (q, q )] = n i=1 P(s 1,, s n, q i = q, q i 1 = q ) E[f N (s, q)] = n i=1 P(s 1,, s i 1, s i = s,, s n, q i = q) Difficulty: Summing over all possible state sequences The number Ω n of sequences grows exponentially Sounds familiar? We can use dynamic programming again The forward-backward algorithm Basic Text Analysis 32(33)

Summary Markov models are probabilistic sequence models used for part-of-speech tagging and many similar problems They can be trained from labeled data using relative frequency estimation or from unlabeled data and a lexicon using EM Thanks to the Markov assumptions, computation can be made efficient using dynamic programming: Most probable state sequence Viterbi Probability of signal sequence Forward or Backward Expectations for EM Forward-Backward Basic Text Analysis 33(33)