More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Size: px

Start display at page:

Download "More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013"

Hilary McKinney
5 years ago
Views:

1 More on HMMs and other sequence models Intro to NLP - ETHZ - 18/03/2013

2 Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative sequence models: Yasemin Altun MaxEnt and MEMM CRF Perceptron HMM Structured SVM

3 "Squad helps dog bite victim" bite -> verb? bite -> noun? POS ambiguity "Dealers will hear Car Talk at noon" car -> noun & talk -> verb? car & talk -> proper names?

4 PoS tagging The process of assigning a part-of-speech tag (label) to each word in a text. She promised to back the bill PRP VBD TO VB DT NN..

5 Applications A useful pre-processing step in many tasks: Speech synthesis Syntactic parsing Machine translation Information retrieval Named entity recognition Summarization Title: "But Siri, Apple's personal assistant application on the iphone 4s, doesn't disappoint"

6 PTB tagset

7 Sequence classification Dependencies between variables: P(NN DT) >> P (VB DT)... Indipendent per-word tagging suboptimal Sequence models: classify whole sequences y 1:T HMMs: a. Given observation x 1:T predict the best tagging y 1:T Viterbi algorithm b. Given model theta and y 1:T compute P(x 1:T theta) Forward algorithm c. Given dataset of sequences estimate theta ML estimation

8 Unsupervised parameter estimation In many scenarios there is little or no annotated data How can we learn a model from unsupervised data, or a combination of labeled/unlabeled data?

9 HMMs HMM = (Q,O,A,B) 1. States: Q=q 1..q N [the part of speech tags] a. Including special initial/final states q 0 and q F b. lambda = (A,B) 2. Observation symbols: O = o 1..o V [words] 3. Transitions: a. A = {a ij }; a ij = P(q t = j q t-1 = i) ~ Multi(a i ) 4. Emissions: a. B = {b ik }; b ik = P(o t = v k q t = i) ~ Multi(b i )

10 Complete data likelihood The joint probability of a sequence of words and tags, given a model: Generative process: 1. generate a tag sequence 2. emit the words for each tag

11 Observation probability Given HMM theta = (A,B) and observation sequence o 1:N compute P(o 1:N theta) Applications: language modeling Complete data likelihood: Sum over all possible tag sequences:

12 Data likelihood - Given a dataset of sequences X = {x 1...x N } - Likelihood function: - Maximum likelihood problem: - Find the model parameters that maximize the probability of the data

13 Baum-Welch algorithm Also "forward-backward" alg. variant of the EM (Expectation Maximization) algorithm: 1. Start with an initial assigment for (A,B) 2. While likelihood improves: a. Compute expectations wrt to the desired parameters based on current model (E-step) b. Re-estimate (M-step) Guaranteed to improve the data likelihood (locally)

14 Forward probability alpha t (j) = probability of being in state j having observed o 1:t Sum over all paths up to t-1 leading to j Init: Final:

15 Forward algorithm

16 Example: model A = V N END V N START B = board backs plan vote V N

17 Forward computation END V N START board backs plan vote Time

18 Forward computation END V N START board backs plan vote Time

19 Forward computation END a=0.04 V N START a=0.24 board backs plan vote Time

20 Forward computation END a=0.04 a= V N START a=0.24 a= board backs plan vote Time

21 Forward computation END a=0.04 a= a= V N START a=0.24 a= a= board backs plan vote Time

22 Forward computation END a=0.04 a= a= a= V N START a=0.24 a= a= a= board backs plan vote Time

23 Forward computation END a=0.04 a= a= a= V a= N START a=0.24 a= a= a= board backs plan vote Time

24 Backward probability beta t (i): probability of observation o t+1:t starting at state i at time t: Init: Recursion: Final:

25 Backward computation END V N START board backs plan vote Time

26 Backward computation END V N START board backs plan vote Time

27 Backward computation END b=0.3 V N START b=0.7 board backs plan vote Time

28 Backward computation END b=0.150 b=0.3 V N START b=0.135 b=0.7 board backs plan vote Time

29 Backward computation END b= b=0.150 b=0.3 V N START b= b=0.135 b=0.7 board backs plan vote Time

30 Backward computation END b= b= b=0.150 b=0.3 V N START b= b= b=0.135 b=0.7 board backs plan vote Time

31 Backward computation END b= b= b=0.150 b=0.3 V N b= START b= b= b=0.135 b=0.7 board backs plan vote Time

32 Parameter estimation (supervised) Maximum likelihood estimates (MLE) on data 1. Transition probabilities: 2. Emission probabilities:

33 Parameter estimation (unsupervised) 1. Transition probabilities: 2. Emission probabilities:

34 Parameter estimation 1. Expected number of times in state i for observation O:

35 Parameter estimation 1. Expected number of times in state i for observation O: 2. Expected number of transitions from i to j for observation O:

36 Gamma

37 Gamma

38 Gamma 2 (V) END g= V N START board backs plan vote Time

39 Emission re-estimation b ik = exp. # of times in state i emitting word k expected number of times in state i

40 Xi

41 Xi 2 (V,V) END V xi=0.40 N START board backs plan vote Time

42 Transition re-estimation a ij = Expected number of transitions from i to j Expected number of transtions out of i

43 Forward-Backward

44 Example X = [ "board backs plan vote" "vote backs board plan" "backs board vote plan" "plan vote backs board" ] A = V N END V N B = START board backs plan vote V N

45 Example Optimizing only some parameters also improves LogL: Generalized EM

46 Bayesian modeling Choose model lambda that maximizes P(data\lambda) Bayesian methods: model parameter distribution: P(data lambda) P(lambda) Advantages: Model uncertainty about the data Prefer models with certain properties; e.g., sparsity Non-parametric models: number of hidden variables unknown a-priori

47 1. Sampling (e.g., Gibbs): Bayesian inference 2. Variational Bayes: a. Find (point estimate) parameters lambda that minimize upper bound on negative log likelihood, including the priors. 3. Priors on parameter distributions: Dirichlet (conjugate of multinomial).

48 Bayesian HMMs HMM = (Q,O,A,B) 1. States: Q=q 1..q N [the part of speech tags] a. Including special initial/final states q 0 and q F 2. Observation symbols: O = o 1..o V [words] 3. Transitions: a. A = {a ij }; a ij = P(q t = j q t-1 = i) ~ Multi(a i ) b. a i alpha A ~ Dir(alpha A ) 4. Emissions: a. B = {b ik }; b ik = P(o t = v k q t = i) ~ Multi(b i ) b. b i alpha A ~ Dir(alpha B ) alpha: controls the sparsity of A and B

49 Parameter estimation - VB 1. Transition probabilities: 2. Emission probabilities: F:

50 Example with varying alpha

51 alpha = 1.0, after FB X = [ "board backs plan vote" "vote backs board plan" "backs board vote plan" "plan vote backs board" ] A = V N END V N B = START board backs plan vote V N

52 alpha = 0.1, after FB X = [ "board backs plan vote" "vote backs board plan" "backs board vote plan" "plan vote backs board" ] A = V N END V N B = START board backs plan vote V N

53 tag-frequencies (Johnson, 2007)

54 Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative sequence models: Yasemin Altun MaxEnt and MEMM CRF Perceptron HMM Structured SVM

55 References - Gao & Johnson, "A Comparison of Bayesian Estimators for unsupervised Hidden Markov Model POS taggers". EMNLP Bilmes, "A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models" - Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition" - Neal & Hinton, "A view of the EM algorithm that justifies incremental, sparse and other variants"

56 Bayesian HMMS Model parameters: a. a ij ~ Multi(a i ) b. a i alpha 1 ~ Dir(alpha 1 ) c. b ik ~ Multi(b i ) d. b i alpha 2 ~ Dir(alpha 2 ) Dirichlet conjugate to multinomial: inference 1. Sampling (e.g., Gibbs): 2. Variational Bayes: a. Find parameters lambda that minimize upper bound on negative log likelihood, including the priors

Statistical Methods for NLP

Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured