Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

Similar documents
Collapsed Variational Bayesian Inference for Hidden Markov Models

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Probabilistic Context-free Grammars

Part-of-Speech Tagging

topic modeling hanna m. wallach

Latent Dirichlet Allocation Introduction/Overview

Lecture 19, November 19, 2012

Latent Dirichlet Allocation (LDA)

LECTURER: BURCU CAN Spring

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Topic Modelling and Latent Dirichlet Allocation

Statistical Methods for NLP

AN INTRODUCTION TO TOPIC MODELS

Type-Based MCMC. Michael I. Jordan UC Berkeley 1 In NLP, this is sometimes referred to as simply the collapsed

Latent Dirichlet Allocation Based Multi-Document Summarization

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Topic Modeling: Beyond Bag-of-Words

Latent variable models for discrete data

Lecture 9: Hidden Markov Model

TnT Part of Speech Tagger

10/17/04. Today s Main Points

A Continuous-Time Model of Topic Co-occurrence Trends

Applying hlda to Practical Topic Modeling

The infinite HMM for unsupervised PoS tagging

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Posterior vs. Parameter Sparsity in Latent Variable Models Supplementary Material

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Lab 12: Structured Prediction

Topic Models and Applications to Short Documents

Measuring Topic Quality in Latent Dirichlet Allocation

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Text Mining. March 3, March 3, / 49

Evaluation Methods for Topic Models

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung

Language as a Stochastic Process

Acoustic Unit Discovery (AUD) Models. Leda Sarı

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

Why doesn t EM find good HMM POS-taggers?

Sparse Stochastic Inference for Latent Dirichlet Allocation

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Gaussian Models

A Supertag-Context Model for Weakly-Supervised CCG Parser Learning

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Modeling Environment

Bayesian Hidden Markov Models and Extensions

Bayesian Nonparametrics for Speech and Signal Processing

Study Notes on the Latent Dirichlet Allocation

Benchmarking and Improving Recovery of Number of Topics in Latent Dirichlet Allocation Models

16 : Approximate Inference: Markov Chain Monte Carlo

Topic Models. Charles Elkan November 20, 2008

Document and Topic Models: plsa and LDA

Sequence Labeling: HMMs & Structured Perceptron

The effect of non-tightness on Bayesian estimation of PCFGs

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Hidden Markov Models

Latent Variable Models in NLP

Part-of-Speech Tagging

Text Mining for Economics and Finance Latent Dirichlet Allocation

Topic Modeling Using Latent Dirichlet Allocation (LDA)

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

人工知能学会インタラクティブ情報アクセスと可視化マイニング研究会 ( 第 3 回 ) SIG-AM Pseudo Labled Latent Dirichlet Allocation 1 2 Satoko Suzuki 1 Ichiro Kobayashi Departmen

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Probabilistic Topic Models Tutorial: COMAD 2011

Graphical models for part of speech tagging

CS Lecture 18. Topic Models and LDA

Machine Learning for natural language processing

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

arxiv: v6 [stat.ml] 11 Apr 2017

The Infinite PCFG using Hierarchical Dirichlet Processes

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

Applied Bayesian Nonparametrics 3. Infinite Hidden Markov Models

Basic math for biology

Assessing the Consequences of Text Preprocessing Decisions

Logistic Normal Priors for Unsupervised Probabilistic Grammar Induction

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Lecture 13: Structured Prediction

Efficient Tree-Based Topic Modeling

Efficient Methods for Topic Model Inference on Streaming Document Collections

Processing/Speech, NLP and the Web

Online EM for Unsupervised Models

Improved Decipherment of Homophonic Ciphers

Applied Natural Language Processing

Gentle Introduction to Infinite Gaussian Mixture Modeling

Prenominal Modifier Ordering via MSA. Alignment

Probabilistic Context Free Grammars. Many slides from Michael Collins

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Infering the Number of State Clusters in Hidden Markov Model and its Extension

CS838-1 Advanced NLP: Hidden Markov Models

Collaborative topic models: motivations cont

Lecture 13 : Variational Inference: Mean Field Approximation

Latent Variable Models Probabilistic Models in the Study of Language Day 4

SYNTHER A NEW M-GRAM POS TAGGER

Improving Topic Models with Latent Feature Word Representations

Latent Dirichlet Bayesian Co-Clustering

HTM: A Topic Model for Hypertexts

Transcription:

Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation Taesun Moon Katrin Erk and Jason Baldridge Department of Linguistics University of Texas at Austin 1 University Station B5100 Austin, TX 78712-0198 USA Empirical Methods in Natural Language Processing 2010 Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 1 / 35

Introduction Unsupervised HMM POS tagging: Problems The standard HMM is a very poor approximation of natural language Markov independence assumption is too strong Parameter configuration is too restrictive Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 2 / 35

Introduction A simple dichotomy in natural language For many languages, words can generally be grouped into function words and content words There are few function words by type Individually, these function words appear relatively frequently There are many content word by type Individually, these content words appear relatively infrequently Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 3 / 35

Introduction Some numbers from the Penn Treebank WSJ Number of word types (Total tokens) per tag NN = 9321 (164K) JJ = 8591 (75K) DT = 24 (101K) CC = 22 (29K) Conditional probability of most frequent word given tag p(company NN) = 0.02 p(other JJ) = 0.02 p(the DT) = 0.59 p(and CC) = 0.69 Transition probabilities p(nn JJ) = 0.45 p(nnp NNP) = 0.38 p(dt PDT) = 0.91 p(vb MD) = 0.80 Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 4 / 35

Introduction Some assumptions in IR Function words are almost always stopwords tf-idf is predicated on the difference in variance of words across documents LSI, LDA, etc. capture the variance of content words across documents Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 5 / 35

Introduction Solutions Directly model statistical dichotomy between content and function words Capture possible variance of content words and tags across contexts Retain HMM framework and extend from there Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 6 / 35

Models Standard Hidden Markov Model w i 1 w i w i+1 t i 1 t i t i+1 Difficult to model sparsity of words given tag Is still competitive with Bayesian HMM Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 7 / 35

Models Bayesian Hidden Markov Model w i ψ ξ Emission prior t i Transition prior δ γ Can model sparsity through hyperparameters Difficult to capture content/function dichotomy Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 8 / 35

Models Latent Dirichlet Allocation/Hidden Markov Model [Griffiths et al. 2005] β Word/topic prior φ w i z i θ α t i Topic prior Transition prior δ γ An LDA that jointly handles stopword removal Captures topic/non-topic dichotomy Conflates all topical content words into a single state Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 9 / 35

Model I Crouching Dirichlet, Hidden Markov Model β φ w i ψ ξ Content word prior Function word prior t i Crouching Dirichlet prior Transition prior α θ δ γ Define a composite distribution that models content states and function states separately Content words given tag are less sparse than function words given tag Assume that content words and tags have greater variance across contexts Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 10 / 35

Model I Crouching Dirichlet, Hidden Markov Model (Content States) Word over content state prior with large β φ β w i t i The Dirichlet who crouches θ δ State transition prior α γ Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 11 / 35

Model I Crouching Dirichlet, Hidden Markov Model (Function States) w i ψ ξ Word over function state prior with small ξ t i δ State transition prior γ Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 12 / 35

Models Bayesian Hidden Markov Model vs CDHMM ξ β w i ψ φ w i t i t i δ θ δ γ α γ Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 13 / 35

Models LDAHMM vs CDHMM β z i θ β φ w i α φ w i t i t i δ θ δ γ α γ Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 14 / 35

Model II HMM+ Word over content state prior with large β φ β w i ψ ξ Word over function state prior with small ξ t i δ State transition prior γ Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 15 / 35

Parameter inference: Collapsed Gibbs Sampling Conditional distribution of interest p(t i t i,w) Bayesian HMM conditional distribution ( N wi t i + ξ Nti t + γ) ( i 1 N ti+1 t i + I[t i 1 = t i = t i+1 ] + γ ) N ti + W ξ N ti + Tγ + I[t i = t i 1 ] CDHMM conditional distribution N wi t i +β N ti t i 1 +γ N ti +W β N wi t i +ξ N ti +W ξ N ti d i +α N di +Cα N ti t i 1 +γ N ti+1 t i +I[t i 1 =t i =t i+1 ]+γ N ti +Tγ+I[t i =t i 1 ] N ti+1 t i +I[t i 1 =t i =t i+1 ]+γ N ti +Tγ+I[t i =t i 1 ] t i C t i F Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 16 / 35

Data & Experiments Corpora Penn Treebank Wall Street Journal: English, 1M words Brown: English, 800K words Tiger: German, 450K words Floresta: Portuguese, 200K words Uspanteko: 70K words, transcribed text, tagged over morphemes Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 17 / 35

Data & Experiments Evaluation measures Greedily matched accuracy (one-to-one, many-to-one) Pairwise token level precision, recall, f -score Variation of information [Meila, 2007] Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 18 / 35

Data & Experiments Models Bayesian HMM: Does not model content/function dichotomy LDAHMM: Models topic/non-topic dichotomy HMM+: Models content/function dichotomy w/ different blocked priors CDHMM: Models content/function dichotomy w/ different blocked priors and a crouching Dirichlet prior Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 19 / 35

Data & Experiments Parameter settings φ θ β=0.1 α=1 w i ψ Content Function word prior word prior t i Topic/CD prior Transition prior δ ξ=0.0001 γ=0.1 Hyperparameters Uninformative/symmetric priors No hyperparameter re-estimation Other settings Total no. of states: 20/30/40/50 No. of content states: 5 Iterations: 1000 10 chains, single sample each Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 20 / 35

Full Results by Corpus Wall Street Journal 1 to 1 f score 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.5 0.6 M to 1 0 1 2 3 4 VI Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 21 / 35

Full Results by Corpus Brown 1 to 1 f score 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.5 0.6 M to 1 0 1 2 3 4 VI Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 22 / 35

Full Results by Corpus Tiger 1 to 1 f score 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.5 0.6 M to 1 0 1 2 3 4 VI Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 23 / 35

Full Results by Corpus Floresta 1 to 1 f score 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.5 0.6 M to 1 0 1 2 3 4 VI Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 24 / 35

Full Results by Corpus Uspanteko 1 to 1 f score 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.5 0.6 M to 1 0 1 2 3 4 VI Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 25 / 35

Full Results by Corpus CDHMM always wins or ties on accuracy CDHMM loses once to HMM+ on VI (Floresta) HMM wins once on f -score (Uspanteko) HMM+ ties with LDAHMM once on f -score (Floresta) CDHMM wins elsewhere on f -score Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 26 / 35

f -score dependent on number of states f score 0.0 0.1 0.2 0.3 0.4 0.5 WSJ Brown Tiger Floresta Uspanteko If number of model states are kept close to number of gold tags CDHMM wins or ties on every corpus on every measure Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 27 / 35

Results: Accuracy Learning curves Wall Street Journal 0.3 0.4 0.5 0.6 hmm hmm+ ldahmm cdhmm 0.3 0.4 0.5 0.6 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 Wall Street Journal one-to-one Wall Street Journal many-to-one Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 28 / 35

Results: Accuracy Learning curves Brown 0.3 0.4 0.5 0.6 hmm hmm+ ldahmm cdhmm 0.3 0.4 0.5 0.6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Brown one-to-one Brown many-to-one Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 29 / 35

Results: Accuracy Learning curves Tiger (German) 0.3 0.4 0.5 0.6 hmm hmm+ ldahmm cdhmm 0.3 0.4 0.5 0.6 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 Tiger one-to-one Tiger many-to-one Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 30 / 35

Results: Accuracy Learning curves Floresta (Portuguese) 0.3 0.4 0.5 0.6 hmm hmm+ ldahmm cdhmm 0.3 0.4 0.5 0.6 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 Floresta one-to-one Floresta many-to-one Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 31 / 35

Results: Accuracy Learning Curves Uspanteko 0.3 0.4 0.5 0.6 hmm hmm+ ldahmm cdhmm 0.3 0.4 0.5 0.6 0 1 2 0 1 2 Uspanteko one-to-one Uspanteko many-to-one Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 32 / 35

Conclusion Modeling content/function word dichotomy improves the HMM Capturing context variance of content words even further improves the HMM Merely modeling this dichotomy through blocked hyperparameters works as well Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 33 / 35

Thank you! Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 34 / 35

Bibliography [Meila, 2007] Marina Meilă. Comparing clusteringsan information based distance. Journal of Multivariate Analysis, 2007. [Griffiths et al., 2005] Thomas L. Griffiths, Mark Steyvers, David M. Blei and Joshua B. Tenenbaum. Integrating topics and syntax. Advances in neural information processing systems, 2005 Moon, Erk, Baldridge (UT@Austin) Crouching Dirichlet EMNLP 10 35 / 35