Latent Variable Models in NLP - PDF Free Download

Latent Variable Models in NLP Aria Haghighi with Slav Petrov, John DeNero, and Dan Klein UC Berkeley, CS Division

Latent Variable Models

Latent Variable Models Observed

Latent Variable Models Observed Hidden

Latent Variable Models Observed Hidden e.g. Word Alignment

Latent Variable Models Observed Hidden e.g. Word Alignment Observed Bitext

Latent Variable Models Observed Hidden e.g. Word Alignment Observed Bitext Hidden Alignment

Latent Variable Models Observed Hidden

Latent Variable Models Observed Hidden Basic question for LV models

Latent Variable Models Observed Hidden Basic question for LV models Q: How do I make latent variables behave the way I want them too.

Latent Variable Models Observed Hidden Basic question for LV models Q: How do I make latent variables behave the way I want them too. A: Very careful model design!

Recent Applications

Recent Applications Parsing PCFG Annotation

Recent Applications Parsing PCFG Annotation Machine Translation Learning Phrase Tables

Recent Applications Parsing PCFG Annotation Machine Translation Learning Phrase Tables Information Extraction / Discourse Unsupervised Coref Resolution

Recent Applications Parsing Machine Translation Learning Phrase Tables Information Extraction / Discourse Disclaimer There are many other excellent examples of PCFG LV models Annotation in NLP, these examples are from the Berkeley group. Unsupervised Coref Resolution

PCFG Annotation

Subject NP PCFG Annotation

PCFG Annotation Subject NP Object NP

PCFG Annotation

PCFG Annotation Annotation refines base treebank symbols to improve statistical fit of the grammar

PCFG Annotation Annotation refines base treebank symbols to improve statistical fit of the grammar Parent annotation [Johnson 98]

PCFG Annotation Annotation refines base treebank symbols to improve statistical fit of the grammar Parent annotation [Johnson 98] Head Lexicalization [Collins 99]

PCFG Annotation Annotation refines base treebank symbols to improve statistical fit of the grammar Parent annotation [Johnson 98] Head Lexicalization [Collins 99] Automatic Clustering? [Matsuzaki 05, Petrov 06]

[Matsuzaki et. al., 05] Learning Latent Annotations EM algorithm:

[Matsuzaki et. al., 05] Learning Latent Annotations EM algorithm: Brackets are known Base categories are known Only subcategories hidden

[Matsuzaki et. al., 05] Learning Latent Annotations EM algorithm: Brackets are known Base categories are known Only subcategories hidden Just like Forward-Backward for HMMs

[Matsuzaki et. al., 05] Learning Latent Annotations EM algorithm: Brackets are known Base categories are known Only subcategories hidden Forward X 1 X X 2 X4 7 X 3 X 5 X 6. Just like Forward-Backward for HMMs He was right Backward

Learning Latent Annotations DT

Learning Latent Annotations DT DT-1 DT-2 DT-3 DT-4

Learning Latent Annotations DT

[Petrov et. al., 06] Hierarchical Split/Merge Train DT

[Petrov et. al., 06] Hierarchical Split/Merge Train DT Unhelpful Split

[Petrov et. al., 06] Hierarchical Split/Merge Train DT

Hierarchical Split/Merge Train DT Model k=16 F1 Flat Training 87.3 Split Merge 89.5 [Petrov et. al., 06]

Hierarchical Split/Merge Train DT Lesson Don t give EM too much of a leash Model k=16 F1 Flat Training 87.3 Split Merge 89.5 [Petrov et. al., 06]

Linguistic Candy Proper Nouns (NNP): NNP-14 Oct. Nov. Sept. NNP-12 John Robert James NNP-2 J. E. L. NNP-1 Bush Noriega Peters NNP-15 New San Wall NNP-3 York Francisco Street Personal pronouns (PRP): PRP-0 It He I PRP-1 it he they PRP-2 it them him [Petrov et. al., 06]

[Petrov et. al., 07] Learning Latent Annotations 40 words F1 all F1 E N G G E R C H N Charniak&Johnson 05 (generative) 90.1 89.6 Petrov et. al. 07 90.6 90.1 Dubey 05 76.3 - Petrov et. al. 07 80.8 80.1 Chiang et al. 02 80.0 76.6 Petrov et. al. 07 86.3 83.4

[Petrov et. al., 07] Learning Latent Annotations 40 words F1 all F1 E N G G E R C H N Charniak&Johnson 05 (generative) 90.1 89.6 Petrov et. al. 07 90.6 90.1 Download Parser http://nlp.cs.berkeley.edu Dubey 05 76.3 - Petrov et. al. 07 80.8 80.1 Chiang et al. 02 80.0 76.6 Petrov et. al. 07 86.3 83.4

Outline Parsing PCFG Annotation Machine Translation Learning Phrase Tables Information Extraction / Discourse Unsupervised Coref Resolution

Phrase-Based Model

Phrase-Based Model Sentence-aligned corpus

Phrase-Based Model Sentence-aligned corpus Directional word alignments

Phrase-Based Model Sentence-aligned corpus Directional word alignments Intersected and grown word alignments

Phrase-Based Model cat chat 0.9 the cat le chat 0.8 dog chien 0.8 house maison 0.6 my house ma maison 0.9 language langue 0.9 Sentence-aligned corpus Phrase table (translation model) Directional word alignments Intersected and grown word alignments

Overview: Phrase Extraction appelle un chat un chat call a spade a spade

Overview: Phrase Extraction appelle un chat un chat appelle call call a spade a spade

Overview: Phrase Extraction appelle un chat un chat appelle call call a spade a spade chat un chat spade a spade

Overview: Phrase Extraction appelle un chat un chat appelle call call a spade a spade chat un chat spade a spade What s Hidden: Sentence segmentation and phrasal alignment

Overview: Phrase Extraction appelle un chat un chat call a spade a spade

Overview: Phrase Extraction appelle un chat un chat call a spade a spade appelle appelle un appelle un chat un un chat un chat un chat chat un chat un chat call call a call a spade a x2 a spade x2 a spade a spade x2 spade a spade a spade

Learning Phrases

Learning Phrases Sentence-aligned corpus

Learning Phrases Phrase-level generative model Sentence-aligned corpus

Learning Phrases Phrase-level generative model cat chat 0.9 the cat le chat 0.8 dog chien 0.8 house maison 0.6 my house ma maison 0.9 language langue 0.9 Sentence-aligned corpus Phrase table (translation model) DeNero et. al. (2006) argues underperforms surface statistic

Dirichlet Process (DP) Prior

Dirichlet Process (DP) Prior Distribution over distributions

Dirichlet Process (DP) Prior Distribution over distributions Base Measure

Dirichlet Process (DP) Prior Concentration Distribution over distributions Base Measure

Dirichlet Process (DP) Prior Concentration Distribution over distributions Base Measure Almost surely discrete (even if base isn t)

Dirichlet Process (DP) Prior Concentration Distribution over distributions Base Measure Almost surely discrete (even if base isn t) G = π i δ xi ( ) i=1

Dirichlet Process (DP) Prior Concentration Distribution over distributions Base Measure Almost surely discrete (even if base isn t) G = π i δ xi ( ) i=1 Each point x i represent params of a latent cluster

Dirichlet Process (DP) Prior Concentration Distribution over distributions Base Measure Almost surely discrete (even if base isn t) G = π i δ xi ( ) i=1 Each point Nonparametric x i represent params of a latent cluster

Dirichlet Process (DP) Prior Concentration Distribution over distributions Base Measure Almost surely discrete (even if base isn t) G = x i π i δ xi ( ) i=1 Each point represent params of a latent cluster Nonparametric # of params grows with data (logn for N points)

Chinese Restaurant Process

Bayesian Phrase Learning Given: Sentence pair (e, f) 1. Generate l phrase pairs: P (l) =p $ (1 p $ ) l 1 2. Generate l Phrase Pairs {(e i,f i )} (e i,f i ) G, G DP(α,P 0 ) P 0 (f, e) =P f (f)p e (e) P e (e) = ( p s (1 p s ) e 1) ( 1 n e ) e 3. Align and Reorder foreign phrases {f i } Output: Aligned phrase pairs {(e i,f i )} [DeNero et. al., 2008]

Bayesian Phrase Learning [DeNero et. al.,2006] [DeNero et. al., 2008]

Gibbs Sampling Inference [DeNero et. al., 2008]

[DeNero et. al., 2008] Phrase Learning Results EN-ES Europarl, evaluate on BLEU 10k 100k Heuristic Baseline 14.6 21.8 Learned 15.5 23.2

Outline Parsing PCFG Annotation Machine Translation Learning Phrase Tables Information Extraction / Discourse Unsupervised Coref Resolution

Coreference Resolution Weir group.. whose.. headquarters.. U.S.. corporation.. power plant.. which.. Jiangsu..

Coreference Resolution Weir group.. whose.. headquarters Mention.. U.S.. corporation.. power plant.. which.. Jiangsu..

Coreference Resolution Weir group.. whose.. headquarters.. U.S.. corporation.. power plant.. which.. Jiangsu..

Coreference Resolution Weir Group Weir Group Weir HQ Weir group.. whose.. headquarters United States Weir Group.. U.S.. corporation.. Weir Plant Weir Plant Jiangsu power plant.. which.. Jiangsu..

Coreference Resolution Weir Group Weir Group Weir HQ Weir group.. whose.. headquarters Entity United States Weir Group.. U.S.. corporation.. Weir Plant Weir Plant Jiangsu power plant.. which.. Jiangsu..

Coreference Resolution Weir Group Weir Group Weir HQ Weir group.. whose.. headquarters United States Weir Group.. U.S.. corporation.. Weir Plant Weir Plant Jiangsu power plant.. which.. Jiangsu..

Finite Mixture Model [Haghighi & Klein, 2007]

[Haghighi & Klein, 2007] Finite Mixture Model Z1= Weir Group Z2= Weir Group Z3= Weir HQ

[Haghighi & Klein, 2007] Finite Mixture Model Entity Distribution P(Weir Group) = 0.2,... P(Weir HQ) = 0.5, Z1= Weir Group Z2= Weir Group Z3= Weir HQ

[Haghighi & Klein, 2007] Finite Mixture Model Entity Distribution P(Weir Group) = 0.2,... P(Weir HQ) = 0.5, Z1= Weir Group Z2= Weir Group Z3= Weir HQ W1= Weir Group W2= whose W3= headquart.

Finite Mixture Model Entity Distribution Z1= Weir Group P(Weir Group) = 0.2,... P(Weir HQ) = 0.5, Z2= Weir Group Z3= Weir HQ Mention Distribution P(W Weir Group): Weir Group =0.4, whose =0.2,... W1= Weir Group W2= whose W3= headquart. [Haghighi & Klein, 2007]

Bayesian Finite Mixture Model Entity Distribution β K Mention Distribution Z1= Weir Group Z2= Weir Group Z3= Weir HQ P(W Weir Group): Weir Group =0.4, whose =0.2,... W1= Weir Group W2= whose W3= headquart. [Haghighi & Klein, 2007]

[Haghighi & Klein, 2007] Bayesian Finite Mixture Model Z1= Weir Group Entity Distribution β Z2= Weir Group K Z3= Weir HQ This is how many entities there are Mention Distribution P(W Weir Group): Weir Group =0.4, whose =0.2,... W1= Weir Group W2= whose W3= headquart.

[Haghighi & Klein, 2007] Bayesian Finite Mixture Model Entity Distribution β K Mention Distribution Z1= Weir Group Z2= Weir Group Z3= Weir HQ φ K W1= Weir Group W2= whose W3= headquart.

Bayesian Finite Mixture Model Entity Distribution How do you choose K? β K Mention Distribution Z1= Weir Group Z2= Weir Group Z3= Weir HQ φ K W1= Weir Group W2= whose W3= headquart. [Haghighi & Klein, 2007]

[Haghighi & Klein, 2007] Bayesian Finite Mixture Model Entity Distribution β K Mention Distribution Z1= Weir Group Z2= Weir Group Z3= Weir HQ φ K W1= Weir Group W2= whose W3= headquart.

[Haghighi & Klein, 2007] Infinite Mixture Model Entity Distribution β Mention Distribution Z1= Weir Group Z2= Weir Group Z3= Weir HQ φ W1= Weir Group W2= whose W3= headquart.

[Haghighi & Klein, 2007] Infinite Mixture Model Drawn from a Dirichlet Entity Distribution β Process (DP) prior [Teh et al., 2006] Mention Distribution Z1= Weir Group Z2= Weir Group Z3= Weir HQ φ W1= Weir Group W2= whose W3= headquart.

[Haghighi & Klein, 2007] Infinite Mixture Model Entity Distribution β Mention Distribution Z1= Weir Group Z2= Weir Group Z3= Weir HQ φ W1= Weir Group W2= whose W3= headquart.

Infinite Mixture Model β...... φ L L Z Z S S T N G T N G M W M W

Global Coreference Resolution [Haghighi & Klein, 2007]

[Haghighi & Klein, 2007] Global Coreference Resolution Global Entities

HDP Model ψ β.... φ L Z L Z S T N G S T N G M W M W N

HDP Model β 0 ψ β.... φ L Z L Z S T N G S T N G M W M W N

HDP Model β 0 Global Entity Distribution drawn from a DP ψ β.... φ L Z L Z S T N G S T N G M W M W N

HDP Model β 0 ψ β.... φ L Z L Z S T N G S T N G M W M W N

HDP Model β 0 ψ L Z β.... L Document Entity Distribution subsampled from Global Distr. Z φ S T N G S T N G M W M W N

HDP Model β 0 ψ β.... φ L Z L Z S T N G S T N G M W M W N

Unsupervised Coref Results MUC6: 30 train/test documents Data Set # Doc P R F MUC6 60 80.8 52.8 63.9 +DRYRUN 251 79.1 59.7 68.0 +NWIRE 381 80.4 62.4 70.3 Recent Supervised Number 73.4 F1 [McCallum & Wellner, 2004] [Haghighi & Klein, 2007]

Summary Latent Variable Models Flexible Require careful model design Can yield state-of-the-art results for a variety of tasks Many interesting unsupervised problems yet to explore

Thanks http://nlp.cs.berkeley.edu

Latent Variable Models Modeling Capture information missing in annotation Compensate for independence assumptions Induction Only game in town. Careful structural design

Manual Annotation Klein & Manning, 2003

Manual Annotation Manually split categories NP: subject vs object DT: determiners vs demonstratives IN: sentential vs prepositional Advantages: Fairly compact grammar Linguistic motivations Disadvantages: Performance leveled out Manually annotated Klein & Manning, 2003

Latent Variable Models

Latent Variable Models Observed

Latent Variable Models Observed Hidden

Latent Variable Models

Latent Variable Models Modeling

Latent Variable Models Modeling Induction (Unsupervised Learning)

Learning Latent Annotations

Learning Latent Annotations Learning [Petrov et. al. 06] Adaptive Merging Hierarchical Training Inference [Petrov et. al. 07] Coarse-to-fine approximation Max constituent parsing

Latent Variable Models

Latent Variable Models Observed

Latent Variable Models Observed Hidden

Latent Variable Models Observed Hidden Better model

Latent Variable Models Observed Hidden Better model Induce

Linguistic Candy [Petrov et. al., 06]

Linguistic Candy Proper Nouns (NNP): NNP-14 Oct. Nov. Sept. NNP-12 John Robert James NNP-15 New San Wall NNP-3 York Francisco Street [Petrov et. al., 06]

Linguistic Candy Proper Nouns (NNP): NNP-14 Oct. Nov. Sept. NNP-12 John Robert James NNP-15 New San Wall NNP-3 York Francisco Street Personal pronouns (PRP): PRP-0 It He I PRP-1 it he they PRP-2 it them him [Petrov et. al., 06]