Statistical Methods for NLP

Similar documents
Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Statistical Methods for NLP

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Lecture 12: Algorithms for HMMs

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Lecture 13: Structured Prediction

A gentle introduction to Hidden Markov Models

Graphical models for part of speech tagging

Lecture 12: Algorithms for HMMs

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Dynamic Programming: Hidden Markov Models

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Statistical Methods for NLP

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Sequence Labeling: HMMs & Structured Perceptron

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

Machine Learning for natural language processing

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Computational Genomics and Molecular Biology, Fall

COMP90051 Statistical Machine Learning

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Intelligent Systems (AI-2)

Hidden Markov Models

Intelligent Systems (AI-2)

Lecture 7: Sequence Labeling

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Hidden Markov Models in Language Processing

Statistical NLP: Hidden Markov Models. Updated 12/15

Dept. of Linguistics, Indiana University Fall 2009

A brief introduction to Conditional Random Fields

Statistical methods in NLP, lecture 7 Tagging and parsing

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

CS838-1 Advanced NLP: Hidden Markov Models

STA 414/2104: Machine Learning

Statistical Processing of Natural Language

Sequences and Information

Text Mining. March 3, March 3, / 49

order is number of previous outputs

STA 4273H: Statistical Machine Learning

Conditional Random Field

Probabilistic Models for Sequence Labeling

Machine Learning for Structured Prediction

Lecture 3: ASR: HMMs, Forward, Viterbi

Expectation Maximization (EM)

Hidden Markov Models

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

10/17/04. Today s Main Points

Midterm sample questions

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Lecture 9: Hidden Markov Model

Hidden Markov Models

Fun with weighted FSTs

The Noisy Channel Model and Markov Models

Lecture 11: Hidden Markov Models

Dynamic Approaches: The Hidden Markov Model

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model:

Hidden Markov Models. x 1 x 2 x 3 x N

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

with Local Dependencies

Advanced Natural Language Processing Syntactic Parsing

Log-Linear Models, MEMMs, and CRFs

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Hidden Markov Models Part 2: Algorithms

Advanced Data Science

Statistical Machine Learning from Data

Conditional Random Fields

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Data Mining in Bioinformatics HMM

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Introduction to Machine Learning CMU-10701

CS460/626 : Natural Language

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Today s Agenda. Need to cover lots of background material. Now on to the Map Reduce stuff. Rough conceptual sketch of unsupervised training using EM

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Hidden Markov Models for biological sequence analysis I

Hidden Markov Models Hamid R. Rabiee

1. Markov models. 1.1 Markov-chain

Lecture 9. Intro to Hidden Markov Models (finish up)

Maximum Entropy Markov Models

Expectation Maximization (EM)

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Brief Introduction of Machine Learning Techniques for Content Analysis

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Hidden Markov Models

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Lab 12: Structured Prediction

Hidden Markov Models and Gaussian Mixture Models

Processing/Speech, NLP and the Web

Machine Learning for NLP

Hidden Markov Models for biological sequence analysis

Transcription:

Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21)

Introduction Structured Classification In many NLP tasks, the output (and input) is structured: Part-of-speech tagging Input: Sequence X 1,..., X n of words Output: Sequence Y 1,..., Y n of tags Syntactic parsing Input: Sequence X 1,..., X n of words Output: Parse tree Y consisting of nodes, edges and labels Models for structured classification: Sequence models today Stochastic grammars lecture 6 Statistical Methods for NLP 2(21)

Introduction Sequence Classification Why not just take one element at a time? for i = 1 to n do argmax P(Y i = y X i ) y Because the desired solution may be: arg max y 1,...,y n P(Y 1 = y 1,..., Y n = y n X 1,..., X n ) Two approaches: Local optimization greedy method may be suboptimal Global optimization requires independence assumptions Statistical Methods for NLP 3(21)

Hidden Markov Models Hidden Markov Models Markov models are probabilistic sequence models used for problems such as: 1. Speech recognition 2. Spell checking 3. Part-of-speech tagging 4. Named entity recognition A (discrete) Markov model runs through a sequence of states emitting signals. If the state sequence cannot be determined from the signal sequence, the model is said to be hidden. Statistical Methods for NLP 4(21)

Hidden Markov Models Hidden Markov Models A Markov model consists of five elements: 1. A finite set of states Ω = {s 1,..., s k }. 2. A finite signal alphabet Σ = {σ 1,..., σ m }. 3. Initial probabilities P(s) (for every s Ω) defining the probability of starting in state s. 4. Transition probabilities P(s i s j ) (for every (s i, s j ) Ω 2 ) defining the probability of going from state s j to state s i. 5. Emission probabilities P(σ s) (for every (σ, s) Σ Ω) defining the probability of emitting symbol σ in state s. Statistical Methods for NLP 5(21)

Hidden Markov Models A Simple HMM Statistical Methods for NLP 6(21)

Hidden Markov Models Markov Assumptions State transitions are assumed to be independent of everything except the current state: n 1 P(s 1,..., s n ) = P(s 1 ) P(s i+1 s i ) i=1 Signal emissions are assumed to be independent of everything except the current state: P(s 1,..., s n, σ 1,..., σ n ) = P(s 1,..., s n ) n P(σ i s i ) i=1 Statistical Methods for NLP 7(21)

Hidden Markov Models Observation Sequences The probability of a signal sequence is obtained by summing over state sequences: P(σ 1,..., σ n ) = P(s 1,..., s n, σ 1,..., σ n ) s 1,...,s n Ω n Looks familiar? The HMM states are hidden variables. We can use EM to train an HMM on unlabeled data. Statistical Methods for NLP 8(21)

Hidden Markov Models Problems for HMMs Optimal state sequence: arg max s 1,...,s n P(s 1,..., s n, σ 1,..., σ n ) Probability of signal sequence: P(σ 1,..., σ n ) = P(s 1,..., s n, σ 1,..., σ n ) s 1,...,s n Ω n Expected counts for hidden variables (for EM): E[C(s)] = P(σ 1,..., σ n, s 1 = s) E[C(s, s )] = n 1 i=1 P(σ 1,..., σ n, s i = s, s i+1 = s ) E[C(s, σ)] = n i=1 P(σ 1,..., σ i 1, σ i = σ,..., σ n, s i = s) Statistical Methods for NLP 9(21)

Hidden Markov Models Problems for HMMs Difficulty: Summing (or maximizing) over all possible state sequences The number Ω n of sequences grows exponentially Key observation: Solution of size n contains solution of size n 1 Probability of signal sequence: P(σ 1,..., σ n) = P s P(σ n Ω 1,..., σ n, s n) P(σ 1,..., σ i, s i ) = P s i 1 Ω P(σ i s i )P(s i s i 1 )P(σ 1,..., σ i 1, s i 1 ) Dynamic programming algorithms are applicable Statistical Methods for NLP 10(21)

Hidden Markov Models Algorithms for HMMs Optimal state sequence (Viterbi): viterbi(i, s) max s P(σ 1,..., σ i, s i 1 = s, s i = s) viterbi(1, s) = P(s)P(σ 1 s) viterbi(i, s) = max s viterbi(i 1, s )P(s s )P(σ i s) max s viterbi(n, s) = max s1,...,s n P(σ 1,..., σ n, s 1,..., s n) Note: The equations above are for computing the max To get the argmax, we store back-pointers to best states Statistical Methods for NLP 11(21)

Hidden Markov Models Algorithms for HMMs Probability of signal sequence (Forward): α(i, s) P(σ 1,..., σ i, s i = s) α(1, s) = P(s)P(σ 1 s) α(i, s) = P s α(i 1, s )P(s s )P(σ i s) P s α(n, s) = P(σ 1,..., σ n) Probability of signal sequence (Backward): β(i, s) P(σ i+1,..., σ n s i = s) β(n, s) = 1 β(i, s) = P s β(i + 1, s )P(s s)p(σ i+1 s ) P s β(1, s)p(s)p(σ 1 s) = P(σ 1,..., σ n) Statistical Methods for NLP 12(21)

Hidden Markov Models Algorithms for HMMs Expected counts of hidden states (Forward-Backward): E[C(s)] = P(s)P(σ 1 s)β(1, s) E[C(s, s )] = P n i=1 α(i, s)p(s s)p(σ i+1 s )β(i + 1, s ) E[C(s, σ)] = P i:σ i =σ α(i, s)p(σ i s)β(i, s) Algorithm analysis: All algorithms can be implemented to fill an Ω n matrix All algorithms run in O( Ω 2 n) time Compare to O( Ω n ) for the naive implementation Statistical Methods for NLP 13(21)

Part-of-Speech Tagging Part-of-Speech Tagging Typical sequence classification problem: Given a word sequence w1,..., w n, determine the corresponding part-of-speech (tag) sequence t 1,..., t n Probabilistic view of the problem: arg max t 1,...,t n P(t 1,..., t n w 1,..., w n ) First order HMM (tag bigram model): Signals represent words States represent part-of-speech tags Tagging amounts to finding the optimal state sequence Statistical Methods for NLP 14(21)

Part-of-Speech Tagging Modeling Generative model: argmax t1,...,t n P(t 1,..., t n w 1,..., w n ) = argmax t1,...,t n P(t 1,..., t n, w 1,..., w n ) = argmax t1,...,t n P(t 1,..., t n )P(w 1,..., w n t 1,..., t n ) Markov assumptions: Model parameters: P(t 1,..., t n ) = n i=1 P(t i t i 1 ) P(w 1,..., w n t 1,..., t n ) = n i=1 P(w i t i ) P(t) (start probabilities for all tags t) P(t t ) (transition probabilities for all pairs of tags t, t ) P(w t) (emission probabilities for words w and tags t) Statistical Methods for NLP 15(21)

Part-of-Speech Tagging Modeling P(vb dt) P(dt) P(nn nn) P(vb vb) P(nn dt) P(vb nn) dt nn vb P(nn vb) P(the dt) P(can nn) P(can vb) P(smells nn) P(smells vb) Statistical Methods for NLP 16(21)

Part-of-Speech Tagging Learning Supervised learning: Given a tagged training corpus, we can estimate parameters using (smoothed) relative frequencies Weakly supervised learning: Given a lexicon and an untagged training corpus, we can use EM to estimate parameters EM for HMM = Baum-Welch s algorithm E-step = Forward-Backward algorithm (see above) M-step = MLE given expected counts Statistical Methods for NLP 17(21)

Smoothing for Part-of-Speech Tagging Part-of-Speech Tagging Transition probabilities: Structurally similar to n-gram probabilities Standard methods like additive smoothing Emission probabilities: Structurally similar to Naive Bayes likelihood Standard methods for known words Special treatment of unknown words (affixes, capitalization) Statistical Methods for NLP 18(21)

Part-of-Speech Tagging Inference and Evaluation Inference: Finding the optimal state (tag) sequence Viterbi algorithm runs in O( Ω n) time Sparse matrix may save space (and time) Evaluation: Compare output t1,..., t n to gold standard t 1,..., t n: Accuracy = n i=1 [[t i = t i ]] n Confidence intervals and test for proportions Precision and recall for specific classes (tags) Statistical Methods for NLP 19(21)

Conclusion HMM Applications HMMs are widely used in NLP (and elsewhere): Speech recognition Optical character recognition Spell checking Morphological segmentation Part-of-speech tagging Named entity recognition Chunking Disfluency detection Topic segmentation DNA sequence analysis Protein classification... Statistical Methods for NLP 20(21)

Conclusion Other Sequence Models HMM is a generative model Models the joint distribution of inputs and outputs Discriminative models: Maximum Entropy Markov Models (MEMM) Conditional Random Fields (CRF) Weighted Finite-State Transducers (WFST): Finite-state automata for relations Weighted transitions combining transition and emission Subsumes HMMs and other probabilistic models Statistical Methods for NLP 21(21)