Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

Size: px

Start display at page:

Download "Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation"

Marjory Weaver
5 years ago
Views:

1 Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation Taesun Moon Katrin Erk and Jason Baldridge Department of Linguistics University of Texas at Austin 1 University Station B5100 Austin, TX USA Empirical Methods in Natural Language Processing 2010 Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 1 / 35

2 Introduction Unsupervised HMM POS tagging: Problems The standard HMM is a very poor approximation of natural language Markov independence assumption is too strong Parameter configuration is too restrictive Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 2 / 35

3 Introduction A simple dichotomy in natural language For many languages, words can generally be grouped into function words and content words There are few function words by type Individually, these function words appear relatively frequently There are many content word by type Individually, these content words appear relatively infrequently Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 3 / 35

4 Introduction Some numbers from the Penn Treebank WSJ Number of word types (Total tokens) per tag NN = 9321 (164K) JJ = 8591 (75K) DT = 24 (101K) CC = 22 (29K) Conditional probability of most frequent word given tag p(company NN) = 0.02 p(other JJ) = 0.02 p(the DT) = 0.59 p(and CC) = 0.69 Transition probabilities p(nn JJ) = 0.45 p(nnp NNP) = 0.38 p(dt PDT) = 0.91 p(vb MD) = 0.80 Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 4 / 35

5 Introduction Some assumptions in IR Function words are almost always stopwords tf-idf is predicated on the difference in variance of words across documents LSI, LDA, etc. capture the variance of content words across documents Moon, Erk, Baldridge Crouching Dirichlet EMNLP 10 5 / 35

6 Introduction Solutions Directly model statistical dichotomy between content and function words Capture possible variance of content words and tags across contexts Retain HMM framework and extend from there Moon, Erk, Baldridge Crouching Dirichlet EMNLP 10 6 / 35

7 Models Standard Hidden Markov Model w i 1 w i w i+1 t i 1 t i t i+1 Difficult to model sparsity of words given tag Is still competitive with Bayesian HMM Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 7 / 35

8 Models Bayesian Hidden Markov Model w i ψ ξ Emission prior t i Transition prior δ γ Can model sparsity through hyperparameters Difficult to capture content/function dichotomy Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 8 / 35

9 Models Latent Dirichlet Allocation/Hidden Markov Model [Griffiths et al. 2005] β Word/topic prior φ w i z i θ α t i Topic prior Transition prior δ γ An LDA that jointly handles stopword removal Captures topic/non-topic dichotomy Conflates all topical content words into a single state Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 9 / 35

10 Model I Crouching Dirichlet, Hidden Markov Model β φ w i ψ ξ Content word prior Function word prior t i Crouching Dirichlet prior Transition prior α θ δ γ Define a composite distribution that models content states and function states separately Content words given tag are less sparse than function words given tag Assume that content words and tags have greater variance across contexts Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP / 35

11 Model I Crouching Dirichlet, Hidden Markov Model (Content States) Word over content state prior with large β φ β w i t i The Dirichlet who crouches θ δ State transition prior α γ Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP / 35

12 Model I Crouching Dirichlet, Hidden Markov Model (Function States) w i ψ ξ Word over function state prior with small ξ t i δ State transition prior γ Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP / 35

13 Models Bayesian Hidden Markov Model vs CDHMM ξ β w i ψ φ w i t i t i δ θ δ γ α γ Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP / 35

14 Models LDAHMM vs CDHMM β z i θ β φ w i α φ w i t i t i δ θ δ γ α γ Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP / 35

15 Model II HMM+ Word over content state prior with large β φ β w i ψ ξ Word over function state prior with small ξ t i δ State transition prior γ Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP / 35

16 Parameter inference: Collapsed Gibbs Sampling Conditional distribution of interest p(t i t i,w) Bayesian HMM conditional distribution ( N wi t i + ξ Nti t + γ) ( i 1 N ti+1 t i + I[t i 1 = t i = t i+1 ] + γ ) N ti + W ξ N ti + Tγ + I[t i = t i 1 ] CDHMM conditional distribution N wi t i +β N ti t i 1 +γ N ti +W β N wi t i +ξ N ti +W ξ N ti d i +α N di +Cα N ti t i 1 +γ N ti+1 t i +I[t i 1 =t i =t i+1 ]+γ N ti +Tγ+I[t i =t i 1 ] N ti+1 t i +I[t i 1 =t i =t i+1 ]+γ N ti +Tγ+I[t i =t i 1 ] t i C t i F Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP / 35

17 Data & Experiments Corpora Penn Treebank Wall Street Journal: English, 1M words Brown: English, 800K words Tiger: German, 450K words Floresta: Portuguese, 200K words Uspanteko: 70K words, transcribed text, tagged over morphemes Moon, Erk, Baldridge Crouching Dirichlet EMNLP / 35

18 Data & Experiments Evaluation measures Greedily matched accuracy (one-to-one, many-to-one) Pairwise token level precision, recall, f -score Variation of information [Meila, 2007] Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP / 35

19 Data & Experiments Models Bayesian HMM: Does not model content/function dichotomy LDAHMM: Models topic/non-topic dichotomy HMM+: Models content/function dichotomy w/ different blocked priors CDHMM: Models content/function dichotomy w/ different blocked priors and a crouching Dirichlet prior Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP / 35

20 Data & Experiments Parameter settings φ θ β=0.1 α=1 w i ψ Content Function word prior word prior t i Topic/CD prior Transition prior δ ξ= γ=0.1 Hyperparameters Uninformative/symmetric priors No hyperparameter re-estimation Other settings Total no. of states: 20/30/40/50 No. of content states: 5 Iterations: chains, single sample each Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP / 35

21 Full Results by Corpus Wall Street Journal 1 to 1 f score M to VI Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP / 35

22 Full Results by Corpus Brown 1 to 1 f score M to VI Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP / 35

23 Full Results by Corpus Tiger 1 to 1 f score M to VI Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP / 35

24 Full Results by Corpus Floresta 1 to 1 f score M to VI Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP / 35

25 Full Results by Corpus Uspanteko 1 to 1 f score M to VI Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP / 35

26 Full Results by Corpus CDHMM always wins or ties on accuracy CDHMM loses once to HMM+ on VI (Floresta) HMM wins once on f -score (Uspanteko) HMM+ ties with LDAHMM once on f -score (Floresta) CDHMM wins elsewhere on f -score Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP / 35

27 f -score dependent on number of states f score WSJ Brown Tiger Floresta Uspanteko If number of model states are kept close to number of gold tags CDHMM wins or ties on every corpus on every measure Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP / 35

28 Results: Accuracy Learning curves Wall Street Journal hmm hmm+ ldahmm cdhmm Wall Street Journal one-to-one Wall Street Journal many-to-one Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP / 35

29 Results: Accuracy Learning curves Brown hmm hmm+ ldahmm cdhmm Brown one-to-one Brown many-to-one Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP / 35

30 Results: Accuracy Learning curves Tiger (German) hmm hmm+ ldahmm cdhmm Tiger one-to-one Tiger many-to-one Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP / 35

31 Results: Accuracy Learning curves Floresta (Portuguese) hmm hmm+ ldahmm cdhmm Floresta one-to-one Floresta many-to-one Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP / 35

32 Results: Accuracy Learning Curves Uspanteko hmm hmm+ ldahmm cdhmm Uspanteko one-to-one Uspanteko many-to-one Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP / 35

33 Conclusion Modeling content/function word dichotomy improves the HMM Capturing context variance of content words even further improves the HMM Merely modeling this dichotomy through blocked hyperparameters works as well Moon, Erk, Baldridge Crouching Dirichlet EMNLP / 35

34 Thank you! Moon, Erk, Baldridge Crouching Dirichlet EMNLP / 35

35 Bibliography [Meila, 2007] Marina Meilă. Comparing clusteringsan information based distance. Journal of Multivariate Analysis, [Griffiths et al., 2005] Thomas L. Griffiths, Mark Steyvers, David M. Blei and Joshua B. Tenenbaum. Integrating topics and syntax. Advances in neural information processing systems, 2005 Moon, Erk, Baldridge Crouching Dirichlet EMNLP / 35

Collapsed Variational Bayesian Inference for Hidden Markov Models

Collapsed Variational Bayesian Inference for Hidden Markov Models Pengyu Wang, Phil Blunsom Department of Computer Science, University of Oxford International Conference on Artificial Intelligence and