More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Similar documents
Statistical Methods for NLP

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Lecture 13: Structured Prediction

Lecture 12: Algorithms for HMMs

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Lecture 12: Algorithms for HMMs

Sequence Labeling: HMMs & Structured Perceptron

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Hidden Markov Models

A brief introduction to Conditional Random Fields

10/17/04. Today s Main Points

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

COMP90051 Statistical Machine Learning

Brief Introduction of Machine Learning Techniques for Content Analysis

STA 414/2104: Machine Learning

CS838-1 Advanced NLP: Hidden Markov Models

Hidden Markov Models (HMMs)

Lecture 9: Hidden Markov Model

Hidden Markov Models

Hidden Markov Models

Basic math for biology

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Statistical Methods for NLP

Conditional Random Field

Dept. of Linguistics, Indiana University Fall 2009

Hidden Markov Models

Machine Learning for natural language processing

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

STA 4273H: Statistical Machine Learning

Lecture 11: Hidden Markov Models

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Lecture 7: Sequence Labeling

Intelligent Systems (AI-2)

Intelligent Systems (AI-2)

Machine Learning for Structured Prediction

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

A gentle introduction to Hidden Markov Models

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Log-Linear Models, MEMMs, and CRFs

Applied Natural Language Processing

Collapsed Variational Bayesian Inference for Hidden Markov Models

NLP Programming Tutorial 11 - The Structured Perceptron

Statistical Processing of Natural Language

Graphical models for part of speech tagging

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Probabilistic Models for Sequence Labeling

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Hidden Markov Models

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Today s Agenda. Need to cover lots of background material. Now on to the Map Reduce stuff. Rough conceptual sketch of unsupervised training using EM

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Multiscale Systems Engineering Research Group

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Machine Learning Overview

order is number of previous outputs

8: Hidden Markov Models

Hidden Markov Models

Pattern Recognition and Machine Learning

Hidden Markov Modelling

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Statistical methods in NLP, lecture 7 Tagging and parsing

Information Extraction from Text

LECTURER: BURCU CAN Spring

Hidden Markov Models in Language Processing

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Introduction to Machine Learning CMU-10701

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Lecture 3: Machine learning, classification, and generative models

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

HMM: Parameter Estimation

Sequential Supervised Learning

Expectation Maximization (EM)

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Lecture 12: EM Algorithm

Natural Language Processing

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Machine Learning Midterm, Tues April 8

Expectation Maximization (EM)

Hidden Markov Models Hamid R. Rabiee

Hidden Markov Models

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

with Local Dependencies

p(d θ ) l(θ ) 1.2 x x x

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari

Parametric Models Part III: Hidden Markov Models

lecture 6: modeling sequences (final part)

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Transcription:

More on HMMs and other sequence models Intro to NLP - ETHZ - 18/03/2013

Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative sequence models: Yasemin Altun MaxEnt and MEMM CRF Perceptron HMM Structured SVM

"Squad helps dog bite victim" bite -> verb? bite -> noun? POS ambiguity "Dealers will hear Car Talk at noon" car -> noun & talk -> verb? car & talk -> proper names?

PoS tagging The process of assigning a part-of-speech tag (label) to each word in a text. She promised to back the bill PRP VBD TO VB DT NN..

Applications A useful pre-processing step in many tasks: Speech synthesis Syntactic parsing Machine translation Information retrieval Named entity recognition Summarization Title: "But Siri, Apple's personal assistant application on the iphone 4s, doesn't disappoint"

PTB tagset

Sequence classification Dependencies between variables: P(NN DT) >> P (VB DT)... Indipendent per-word tagging suboptimal Sequence models: classify whole sequences y 1:T HMMs: a. Given observation x 1:T predict the best tagging y 1:T Viterbi algorithm b. Given model theta and y 1:T compute P(x 1:T theta) Forward algorithm c. Given dataset of sequences estimate theta ML estimation

Unsupervised parameter estimation In many scenarios there is little or no annotated data How can we learn a model from unsupervised data, or a combination of labeled/unlabeled data?

HMMs HMM = (Q,O,A,B) 1. States: Q=q 1..q N [the part of speech tags] a. Including special initial/final states q 0 and q F b. lambda = (A,B) 2. Observation symbols: O = o 1..o V [words] 3. Transitions: a. A = {a ij }; a ij = P(q t = j q t-1 = i) ~ Multi(a i ) 4. Emissions: a. B = {b ik }; b ik = P(o t = v k q t = i) ~ Multi(b i )

Complete data likelihood The joint probability of a sequence of words and tags, given a model: Generative process: 1. generate a tag sequence 2. emit the words for each tag

Observation probability Given HMM theta = (A,B) and observation sequence o 1:N compute P(o 1:N theta) Applications: language modeling Complete data likelihood: Sum over all possible tag sequences:

Data likelihood - Given a dataset of sequences X = {x 1...x N } - Likelihood function: - Maximum likelihood problem: - Find the model parameters that maximize the probability of the data

Baum-Welch algorithm Also "forward-backward" alg. variant of the EM (Expectation Maximization) algorithm: 1. Start with an initial assigment for (A,B) 2. While likelihood improves: a. Compute expectations wrt to the desired parameters based on current model (E-step) b. Re-estimate (M-step) Guaranteed to improve the data likelihood (locally)

Forward probability alpha t (j) = probability of being in state j having observed o 1:t Sum over all paths up to t-1 leading to j Init: Final:

Forward algorithm

Example: model A = V N END V 0.4 0.6 0.3 N 0.5 0.5 0.7 START 0.4 0.6 B = board backs plan vote V 0.1 0.4 0.3 0.2 N 0.4 0.1 0.2 0.3

Forward computation END V N START board backs plan vote Time 1 2 3 4

Forward computation END V N START board backs plan vote Time 1 2 3 4

Forward computation END a=0.04 V N START a=0.24 board backs plan vote Time 1 2 3 4

Forward computation END a=0.04 a=0.0544 V N START a=0.24 a=0.0144 board backs plan vote Time 1 2 3 4

Forward computation END a=0.04 a=0.0544 a=0.0087 V N START a=0.24 a=0.0144 a=0.0080 board backs plan vote Time 1 2 3 4

Forward computation END a=0.04 a=0.0544 a=0.0087 a=0.0015 V N START a=0.24 a=0.0144 a=0.0080 a=0.0028 board backs plan vote Time 1 2 3 4

Forward computation END a=0.04 a=0.0544 a=0.0087 a=0.0015 V a=0.0024 N START a=0.24 a=0.0144 a=0.0080 a=0.0028 board backs plan vote Time 1 2 3 4

Backward probability beta t (i): probability of observation o t+1:t starting at state i at time t: Init: Recursion: Final:

Backward computation END V N START board backs plan vote Time 1 2 3 4

Backward computation END V N START board backs plan vote Time 1 2 3 4

Backward computation END b=0.3 V N START b=0.7 board backs plan vote Time 1 2 3 4

Backward computation END b=0.150 b=0.3 V N START b=0.135 b=0.7 board backs plan vote Time 1 2 3 4

Backward computation END b=0.0342 b=0.150 b=0.3 V N START b=0.0360 b=0.135 b=0.7 board backs plan vote Time 1 2 3 4

Backward computation END b=0.0076 b=0.0342 b=0.150 b=0.3 V N START b=0.0086 b=0.0360 b=0.135 b=0.7 board backs plan vote Time 1 2 3 4

Backward computation END b=0.0076 b=0.0342 b=0.150 b=0.3 V N b=0.0024 START b=0.0086 b=0.0360 b=0.135 b=0.7 board backs plan vote Time 1 2 3 4

Parameter estimation (supervised) Maximum likelihood estimates (MLE) on data 1. Transition probabilities: 2. Emission probabilities:

Parameter estimation (unsupervised) 1. Transition probabilities: 2. Emission probabilities:

Parameter estimation 1. Expected number of times in state i for observation O:

Parameter estimation 1. Expected number of times in state i for observation O: 2. Expected number of transitions from i to j for observation O:

Gamma

Gamma

Gamma 2 (V) END g=0.7752 V N START board backs plan vote Time 1 2 3 4

Emission re-estimation b ik = exp. # of times in state i emitting word k expected number of times in state i

Xi

Xi 2 (V,V) END V xi=0.40 N START board backs plan vote Time 1 2 3 4

Transition re-estimation a ij = Expected number of transitions from i to j Expected number of transtions out of i

Forward-Backward

Example X = [ "board backs plan vote" "vote backs board plan" "backs board vote plan" "plan vote backs board" ] A = V N END V 0.4 0.6 0.3 N 0.5 0.5 0.7 B = START 0.4 0.6 board backs plan vote V 0.1 0.4 0.3 0.2 N 0.4 0.1 0.2 0.3

Example Optimizing only some parameters also improves LogL: Generalized EM

Bayesian modeling Choose model lambda that maximizes P(data\lambda) Bayesian methods: model parameter distribution: P(data lambda) P(lambda) Advantages: Model uncertainty about the data Prefer models with certain properties; e.g., sparsity Non-parametric models: number of hidden variables unknown a-priori

1. Sampling (e.g., Gibbs): Bayesian inference 2. Variational Bayes: a. Find (point estimate) parameters lambda that minimize upper bound on negative log likelihood, including the priors. 3. Priors on parameter distributions: Dirichlet (conjugate of multinomial).

Bayesian HMMs HMM = (Q,O,A,B) 1. States: Q=q 1..q N [the part of speech tags] a. Including special initial/final states q 0 and q F 2. Observation symbols: O = o 1..o V [words] 3. Transitions: a. A = {a ij }; a ij = P(q t = j q t-1 = i) ~ Multi(a i ) b. a i alpha A ~ Dir(alpha A ) 4. Emissions: a. B = {b ik }; b ik = P(o t = v k q t = i) ~ Multi(b i ) b. b i alpha A ~ Dir(alpha B ) alpha: controls the sparsity of A and B

Parameter estimation - VB 1. Transition probabilities: 2. Emission probabilities: F:

Example with varying alpha

alpha = 1.0, after FB X = [ "board backs plan vote" "vote backs board plan" "backs board vote plan" "plan vote backs board" ] A = V N END V 0.34 0.66 0.12 N 0.13 0.87 0.88 B = START 0.51 0.49 board backs plan vote V 0.25 0.32 0.18 0.24 N 0.25 0.22 0.28 0.25

alpha = 0.1, after FB X = [ "board backs plan vote" "vote backs board plan" "backs board vote plan" "plan vote backs board" ] A = V N END V 0.0 1.0 0.0 N 0.36 0.64 1.0 B = START 0.2 0.8 board backs plan vote V 0.0 1.0 0.0 0.0 N 0.33 0.0 0.33 0.33

tag-frequencies (Johnson, 2007)

Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative sequence models: Yasemin Altun MaxEnt and MEMM CRF Perceptron HMM Structured SVM

References - Gao & Johnson, "A Comparison of Bayesian Estimators for unsupervised Hidden Markov Model POS taggers". EMNLP 2008. - Bilmes, "A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models" - Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition" - Neal & Hinton, "A view of the EM algorithm that justifies incremental, sparse and other variants"

Bayesian HMMS Model parameters: a. a ij ~ Multi(a i ) b. a i alpha 1 ~ Dir(alpha 1 ) c. b ik ~ Multi(b i ) d. b i alpha 2 ~ Dir(alpha 2 ) Dirichlet conjugate to multinomial: inference 1. Sampling (e.g., Gibbs): 2. Variational Bayes: a. Find parameters lambda that minimize upper bound on negative log likelihood, including the priors