More on HMMs and other sequence models. Intro to NLP - ETHZ

More on HMMs and other sequence models Intro to NLP - ETHZ - 18/03/2013

Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative sequence models: Yasemin Altun MaxEnt and MEMM CRF Perceptron HMM Structured SVM

"Squad helps dog bite victim" bite -> verb? bite -> noun? POS ambiguity "Dealers will hear Car Talk at noon" car -> noun & talk -> verb? car & talk -> proper names?

PoS tagging The process of assigning a part-of-speech tag (label) to each word in a text. She promised to back the bill PRP VBD TO VB DT NN..

Applications A useful pre-processing step in many tasks: Speech synthesis Syntactic parsing Machine translation Information retrieval Named entity recognition Summarization Title: "But Siri, Apple's personal assistant application on the iphone 4s, doesn't disappoint"

PTB tagset

Sequence classification Dependencies between variables: P(NN DT) >> P (VB DT)... Indipendent per-word tagging suboptimal Sequence models: classify whole sequences y 1:T HMMs: a. Given observation x 1:T predict the best tagging y 1:T Viterbi algorithm b. Given model theta and y 1:T compute P(x 1:T theta) Forward algorithm c. Given dataset of sequences estimate theta ML estimation

Unsupervised parameter estimation In many scenarios there is little or no annotated data How can we learn a model from unsupervised data, or a combination of labeled/unlabeled data?

HMMs HMM = (Q,O,A,B) 1. States: Q=q 1..q N [the part of speech tags] a. Including special initial/final states q 0 and q F b. lambda = (A,B) 2. Observation symbols: O = o 1..o V [words] 3. Transitions: a. A = {a ij }; a ij = P(q t = j q t-1 = i) ~ Multi(a i ) 4. Emissions: a. B = {b ik }; b ik = P(o t = v k q t = i) ~ Multi(b i )

Complete data likelihood The joint probability of a sequence of words and tags, given a model: Generative process: 1. generate a tag sequence 2. emit the words for each tag

Observation probability Given HMM theta = (A,B) and observation sequence o 1:N compute P(o 1:N theta) Applications: language modeling Complete data likelihood: Sum over all possible tag sequences:

Data likelihood - Given a dataset of sequences X = {x 1...x N } - Likelihood function: - Maximum likelihood problem: - Find the model parameters that maximize the probability of the data

Baum-Welch algorithm Also "forward-backward" alg. variant of the EM (Expectation Maximization) algorithm: 1. Start with an initial assigment for (A,B) 2. While likelihood improves: a. Compute expectations wrt to the desired parameters based on current model (E-step) b. Re-estimate (M-step) Guaranteed to improve the data likelihood (locally)

Forward probability alpha t (j) = probability of being in state j having observed o 1:t Sum over all paths up to t-1 leading to j Init: Final:

Forward algorithm

Example: model A = V N END V 0.4 0.6 0.3 N 0.5 0.5 0.7 START 0.4 0.6 B = board backs plan vote V 0.1 0.4 0.3 0.2 N 0.4 0.1 0.2 0.3

Forward computation END V N START board backs plan vote Time 1 2 3 4

Forward computation END a=0.04 V N START a=0.24 board backs plan vote Time 1 2 3 4

Forward computation END a=0.04 a=0.0544 V N START a=0.24 a=0.0144 board backs plan vote Time 1 2 3 4

Forward computation END a=0.04 a=0.0544 a=0.0087 V N START a=0.24 a=0.0144 a=0.0080 board backs plan vote Time 1 2 3 4

Forward computation END a=0.04 a=0.0544 a=0.0087 a=0.0015 V N START a=0.24 a=0.0144 a=0.0080 a=0.0028 board backs plan vote Time 1 2 3 4

Forward computation END a=0.04 a=0.0544 a=0.0087 a=0.0015 V a=0.0024 N START a=0.24 a=0.0144 a=0.0080 a=0.0028 board backs plan vote Time 1 2 3 4

Backward probability beta t (i): probability of observation o t+1:t starting at state i at time t: Init: Recursion: Final:

Backward computation END V N START board backs plan vote Time 1 2 3 4

Backward computation END b=0.3 V N START b=0.7 board backs plan vote Time 1 2 3 4

Backward computation END b=0.150 b=0.3 V N START b=0.135 b=0.7 board backs plan vote Time 1 2 3 4

Backward computation END b=0.0342 b=0.150 b=0.3 V N START b=0.0360 b=0.135 b=0.7 board backs plan vote Time 1 2 3 4

Backward computation END b=0.0076 b=0.0342 b=0.150 b=0.3 V N START b=0.0086 b=0.0360 b=0.135 b=0.7 board backs plan vote Time 1 2 3 4

Backward computation END b=0.0076 b=0.0342 b=0.150 b=0.3 V N b=0.0024 START b=0.0086 b=0.0360 b=0.135 b=0.7 board backs plan vote Time 1 2 3 4

Parameter estimation (supervised) Maximum likelihood estimates (MLE) on data 1. Transition probabilities: 2. Emission probabilities:

Parameter estimation (unsupervised) 1. Transition probabilities: 2. Emission probabilities:

Parameter estimation 1. Expected number of times in state i for observation O:

Parameter estimation 1. Expected number of times in state i for observation O: 2. Expected number of transitions from i to j for observation O:

Gamma

Gamma 2 (V) END g=0.7752 V N START board backs plan vote Time 1 2 3 4

Emission re-estimation b ik = exp. # of times in state i emitting word k expected number of times in state i

Xi 2 (V,V) END V xi=0.40 N START board backs plan vote Time 1 2 3 4

Transition re-estimation a ij = Expected number of transitions from i to j Expected number of transtions out of i

Forward-Backward

Example X = [ "board backs plan vote" "vote backs board plan" "backs board vote plan" "plan vote backs board" ] A = V N END V 0.4 0.6 0.3 N 0.5 0.5 0.7 B = START 0.4 0.6 board backs plan vote V 0.1 0.4 0.3 0.2 N 0.4 0.1 0.2 0.3

Example Optimizing only some parameters also improves LogL: Generalized EM

Bayesian modeling Choose model lambda that maximizes P(data\lambda) Bayesian methods: model parameter distribution: P(data lambda) P(lambda) Advantages: Model uncertainty about the data Prefer models with certain properties; e.g., sparsity Non-parametric models: number of hidden variables unknown a-priori

1. Sampling (e.g., Gibbs): Bayesian inference 2. Variational Bayes: a. Find (point estimate) parameters lambda that minimize upper bound on negative log likelihood, including the priors. 3. Priors on parameter distributions: Dirichlet (conjugate of multinomial).

Bayesian HMMs HMM = (Q,O,A,B) 1. States: Q=q 1..q N [the part of speech tags] a. Including special initial/final states q 0 and q F 2. Observation symbols: O = o 1..o V [words] 3. Transitions: a. A = {a ij }; a ij = P(q t = j q t-1 = i) ~ Multi(a i ) b. a i alpha A ~ Dir(alpha A ) 4. Emissions: a. B = {b ik }; b ik = P(o t = v k q t = i) ~ Multi(b i ) b. b i alpha A ~ Dir(alpha B ) alpha: controls the sparsity of A and B

Parameter estimation - VB 1. Transition probabilities: 2. Emission probabilities: F:

Example with varying alpha

alpha = 1.0, after FB X = [ "board backs plan vote" "vote backs board plan" "backs board vote plan" "plan vote backs board" ] A = V N END V 0.34 0.66 0.12 N 0.13 0.87 0.88 B = START 0.51 0.49 board backs plan vote V 0.25 0.32 0.18 0.24 N 0.25 0.22 0.28 0.25

alpha = 0.1, after FB X = [ "board backs plan vote" "vote backs board plan" "backs board vote plan" "plan vote backs board" ] A = V N END V 0.0 1.0 0.0 N 0.36 0.64 1.0 B = START 0.2 0.8 board backs plan vote V 0.0 1.0 0.0 0.0 N 0.33 0.0 0.33 0.33

tag-frequencies (Johnson, 2007)

References - Gao & Johnson, "A Comparison of Bayesian Estimators for unsupervised Hidden Markov Model POS taggers". EMNLP 2008. - Bilmes, "A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models" - Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition" - Neal & Hinton, "A view of the EM algorithm that justifies incremental, sparse and other variants"

Bayesian HMMS Model parameters: a. a ij ~ Multi(a i ) b. a i alpha 1 ~ Dir(alpha 1 ) c. b ik ~ Multi(b i ) d. b i alpha 2 ~ Dir(alpha 2 ) Dirichlet conjugate to multinomial: inference 1. Sampling (e.g., Gibbs): 2. Variational Bayes: a. Find parameters lambda that minimize upper bound on negative log likelihood, including the priors

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013