Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation Taesun Moon Katrin Erk and Jason Baldridge Department of Linguistics University of Texas at Austin 1 University Station B5100 Austin, TX 78712-0198 USA Empirical Methods in Natural Language Processing 2010 Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 1 / 35

Introduction Unsupervised HMM POS tagging: Problems The standard HMM is a very poor approximation of natural language Markov independence assumption is too strong Parameter configuration is too restrictive Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 2 / 35

Introduction A simple dichotomy in natural language For many languages, words can generally be grouped into function words and content words There are few function words by type Individually, these function words appear relatively frequently There are many content word by type Individually, these content words appear relatively infrequently Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 3 / 35

Introduction Some numbers from the Penn Treebank WSJ Number of word types (Total tokens) per tag NN = 9321 (164K) JJ = 8591 (75K) DT = 24 (101K) CC = 22 (29K) Conditional probability of most frequent word given tag p(company NN) = 0.02 p(other JJ) = 0.02 p(the DT) = 0.59 p(and CC) = 0.69 Transition probabilities p(nn JJ) = 0.45 p(nnp NNP) = 0.38 p(dt PDT) = 0.91 p(vb MD) = 0.80 Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 4 / 35

Introduction Some assumptions in IR Function words are almost always stopwords tf-idf is predicated on the difference in variance of words across documents LSI, LDA, etc. capture the variance of content words across documents Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 5 / 35

Introduction Solutions Directly model statistical dichotomy between content and function words Capture possible variance of content words and tags across contexts Retain HMM framework and extend from there Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 6 / 35

Models Standard Hidden Markov Model w i 1 w i w i+1 t i 1 t i t i+1 Difficult to model sparsity of words given tag Is still competitive with Bayesian HMM Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 7 / 35

Models Bayesian Hidden Markov Model w i ψ ξ Emission prior t i Transition prior δ γ Can model sparsity through hyperparameters Difficult to capture content/function dichotomy Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 8 / 35

Models Latent Dirichlet Allocation/Hidden Markov Model [Griffiths et al. 2005] β Word/topic prior φ w i z i θ α t i Topic prior Transition prior δ γ An LDA that jointly handles stopword removal Captures topic/non-topic dichotomy Conflates all topical content words into a single state Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 9 / 35

Model I Crouching Dirichlet, Hidden Markov Model β φ w i ψ ξ Content word prior Function word prior t i Crouching Dirichlet prior Transition prior α θ δ γ Define a composite distribution that models content states and function states separately Content words given tag are less sparse than function words given tag Assume that content words and tags have greater variance across contexts Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 10 / 35

Model I Crouching Dirichlet, Hidden Markov Model (Content States) Word over content state prior with large β φ β w i t i The Dirichlet who crouches θ δ State transition prior α γ Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 11 / 35

Model I Crouching Dirichlet, Hidden Markov Model (Function States) w i ψ ξ Word over function state prior with small ξ t i δ State transition prior γ Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 12 / 35

Models Bayesian Hidden Markov Model vs CDHMM ξ β w i ψ φ w i t i t i δ θ δ γ α γ Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 13 / 35

Models LDAHMM vs CDHMM β z i θ β φ w i α φ w i t i t i δ θ δ γ α γ Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 14 / 35

Model II HMM+ Word over content state prior with large β φ β w i ψ ξ Word over function state prior with small ξ t i δ State transition prior γ Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 15 / 35

Parameter inference: Collapsed Gibbs Sampling Conditional distribution of interest p(t i t i,w) Bayesian HMM conditional distribution ( N wi t i + ξ Nti t + γ) ( i 1 N ti+1 t i + I[t i 1 = t i = t i+1 ] + γ ) N ti + W ξ N ti + Tγ + I[t i = t i 1 ] CDHMM conditional distribution N wi t i +β N ti t i 1 +γ N ti +W β N wi t i +ξ N ti +W ξ N ti d i +α N di +Cα N ti t i 1 +γ N ti+1 t i +I[t i 1 =t i =t i+1 ]+γ N ti +Tγ+I[t i =t i 1 ] N ti+1 t i +I[t i 1 =t i =t i+1 ]+γ N ti +Tγ+I[t i =t i 1 ] t i C t i F Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 16 / 35

Data & Experiments Corpora Penn Treebank Wall Street Journal: English, 1M words Brown: English, 800K words Tiger: German, 450K words Floresta: Portuguese, 200K words Uspanteko: 70K words, transcribed text, tagged over morphemes Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 17 / 35

Data & Experiments Evaluation measures Greedily matched accuracy (one-to-one, many-to-one) Pairwise token level precision, recall, f -score Variation of information [Meila, 2007] Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 18 / 35

Data & Experiments Models Bayesian HMM: Does not model content/function dichotomy LDAHMM: Models topic/non-topic dichotomy HMM+: Models content/function dichotomy w/ different blocked priors CDHMM: Models content/function dichotomy w/ different blocked priors and a crouching Dirichlet prior Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 19 / 35

Data & Experiments Parameter settings φ θ β=0.1 α=1 w i ψ Content Function word prior word prior t i Topic/CD prior Transition prior δ ξ=0.0001 γ=0.1 Hyperparameters Uninformative/symmetric priors No hyperparameter re-estimation Other settings Total no. of states: 20/30/40/50 No. of content states: 5 Iterations: 1000 10 chains, single sample each Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 20 / 35

Full Results by Corpus Wall Street Journal 1 to 1 f score 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.5 0.6 M to 1 0 1 2 3 4 VI Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 21 / 35

Full Results by Corpus Brown 1 to 1 f score 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.5 0.6 M to 1 0 1 2 3 4 VI Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 22 / 35

Full Results by Corpus Tiger 1 to 1 f score 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.5 0.6 M to 1 0 1 2 3 4 VI Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 23 / 35

Full Results by Corpus Floresta 1 to 1 f score 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.5 0.6 M to 1 0 1 2 3 4 VI Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 24 / 35

Full Results by Corpus Uspanteko 1 to 1 f score 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.5 0.6 M to 1 0 1 2 3 4 VI Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 25 / 35

Full Results by Corpus CDHMM always wins or ties on accuracy CDHMM loses once to HMM+ on VI (Floresta) HMM wins once on f -score (Uspanteko) HMM+ ties with LDAHMM once on f -score (Floresta) CDHMM wins elsewhere on f -score Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 26 / 35

f -score dependent on number of states f score 0.0 0.1 0.2 0.3 0.4 0.5 WSJ Brown Tiger Floresta Uspanteko If number of model states are kept close to number of gold tags CDHMM wins or ties on every corpus on every measure Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 27 / 35

Results: Accuracy Learning curves Wall Street Journal 0.3 0.4 0.5 0.6 hmm hmm+ ldahmm cdhmm 0.3 0.4 0.5 0.6 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 Wall Street Journal one-to-one Wall Street Journal many-to-one Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 28 / 35

Results: Accuracy Learning curves Brown 0.3 0.4 0.5 0.6 hmm hmm+ ldahmm cdhmm 0.3 0.4 0.5 0.6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Brown one-to-one Brown many-to-one Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 29 / 35

Results: Accuracy Learning curves Tiger (German) 0.3 0.4 0.5 0.6 hmm hmm+ ldahmm cdhmm 0.3 0.4 0.5 0.6 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 Tiger one-to-one Tiger many-to-one Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 30 / 35

Results: Accuracy Learning curves Floresta (Portuguese) 0.3 0.4 0.5 0.6 hmm hmm+ ldahmm cdhmm 0.3 0.4 0.5 0.6 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 Floresta one-to-one Floresta many-to-one Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 31 / 35

Results: Accuracy Learning Curves Uspanteko 0.3 0.4 0.5 0.6 hmm hmm+ ldahmm cdhmm 0.3 0.4 0.5 0.6 0 1 2 0 1 2 Uspanteko one-to-one Uspanteko many-to-one Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 32 / 35

Conclusion Modeling content/function word dichotomy improves the HMM Capturing context variance of content words even further improves the HMM Merely modeling this dichotomy through blocked hyperparameters works as well Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 33 / 35

Thank you! Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 34 / 35

Bibliography [Meila, 2007] Marina Meilă. Comparing clusteringsan information based distance. Journal of Multivariate Analysis, 2007. [Griffiths et al., 2005] Thomas L. Griffiths, Mark Steyvers, David M. Blei and Joshua B. Tenenbaum. Integrating topics and syntax. Advances in neural information processing systems, 2005 Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 35 / 35