Latent Variable Models in NLP

Similar documents
The Infinite PCFG using Hierarchical Dirichlet Processes

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

CS 545 Lecture XVI: Parsing

Algorithms for NLP. Machine Translation II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Statistical Methods for NLP

Multiword Expression Identification with Tree Substitution Grammars

Chapter 14 (Partially) Unsupervised Parsing

10/17/04. Today s Main Points

Unsupervised Coreference Resolution in a Nonparametric Bayesian Model

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

Statistical NLP Spring Corpus-Based MT

Corpus-Based MT. Statistical NLP Spring Unsupervised Word Alignment. Alignment Error Rate. IBM Models 1/2. Problems with Model 1

LECTURER: BURCU CAN Spring

Lecture 15. Probabilistic Models on Graph

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Parsing with Context-Free Grammars

Discrimina)ve Latent Variable Models. SPFLODD November 15, 2011

The Infinite PCFG using Hierarchical Dirichlet Processes

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Sampling Alignment Structure under a Bayesian Translation Model

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Probabilistic Context-free Grammars

A Supertag-Context Model for Weakly-Supervised CCG Parser Learning

An Overview of Nonparametric Bayesian Models and Applications to Natural Language Processing

Multilevel Coarse-to-Fine PCFG Parsing

Spatial Normalized Gamma Process

Expectation Maximization (EM)

Bayesian Nonparametrics for Speech and Signal Processing

DT2118 Speech and Speaker Recognition

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Spectral Unsupervised Parsing with Additive Tree Metrics

Advanced Natural Language Processing Syntactic Parsing

Marrying Dynamic Programming with Recurrent Neural Networks

The Infinite PCFG using Hierarchical Dirichlet Processes

Probabilistic Context Free Grammars. Many slides from Michael Collins

Type-Based MCMC. Michael I. Jordan UC Berkeley 1 In NLP, this is sometimes referred to as simply the collapsed

Roger Levy Probabilistic Models in the Study of Language draft, October 2,

A Support Vector Method for Multivariate Performance Measures

Natural Language Processing

Unsupervised Deduplication using Cross-Field Dependencies

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

Collapsed Variational Bayesian Inference for Hidden Markov Models

Applied Bayesian Nonparametrics 3. Infinite Hidden Markov Models

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation

Introduction to Probablistic Natural Language Processing

Unit 2: Tree Models. CS 562: Empirical Methods in Natural Language Processing. Lectures 19-23: Context-Free Grammars and Parsing

Lecture 5: UDOP, Dependency Grammars

Hierarchical Bayesian Nonparametrics

Spatial Bayesian Nonparametrics for Natural Image Segmentation

Intelligent Systems (AI-2)

Lecture 13: Structured Prediction

Bringing machine learning & compositional semantics together: central concepts

Logistic Normal Priors for Unsupervised Probabilistic Grammar Induction

Lecture 9: Hidden Markov Model

Machine Translation: Examples. Statistical NLP Spring Levels of Transfer. Corpus-Based MT. World-Level MT: Examples

Parsing with Context-Free Grammars

Bayesian Tools for Natural Language Learning. Yee Whye Teh Gatsby Computational Neuroscience Unit UCL

Variational Decoding for Statistical Machine Translation

Shared Segmentation of Natural Scenes. Dependent Pitman-Yor Processes

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

A Context-Free Grammar

Unsupervised Coreference of Publication Venues

The effect of non-tightness on Bayesian estimation of PCFGs

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs

Lexical Translation Models 1I

Statistical methods in NLP, lecture 7 Tagging and parsing

Probabilistic Context-Free Grammar

Latent Dirichlet Allocation Introduction/Overview

Non-Parametric Bayes

Spatial Role Labeling CS365 Course Project

AN ABSTRACT OF THE DISSERTATION OF

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

A DOP Model for LFG. Rens Bod and Ronald Kaplan. Kathrin Spreyer Data-Oriented Parsing, 14 June 2005

Processing/Speech, NLP and the Web

Statistical NLP Spring HW2: PNP Classification

HW2: PNP Classification. Statistical NLP Spring Levels of Transfer. Phrasal / Syntactic MT: Examples. Lecture 16: Word Alignment

Features of Statistical Parsers

A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Bayesian Structure Modeling. SPFLODD December 1, 2011

CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

Intelligent Systems (AI-2)

Hidden Markov Models

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

NLP Programming Tutorial 11 - The Structured Perceptron

Recurrent neural network grammars

Lagrangian Relaxation Algorithms for Inference in Natural Language Processing

c(a) = X c(a! Ø) (13.1) c(a! Ø) ˆP(A! Ø A) = c(a)

The Infinite Markov Model

Hidden Markov Models

Transcription:

Latent Variable Models in NLP Aria Haghighi with Slav Petrov, John DeNero, and Dan Klein UC Berkeley, CS Division

Latent Variable Models

Latent Variable Models

Latent Variable Models Observed

Latent Variable Models Observed Hidden

Latent Variable Models Observed Hidden e.g. Word Alignment

Latent Variable Models Observed Hidden e.g. Word Alignment

Latent Variable Models Observed Hidden e.g. Word Alignment Observed Bitext

Latent Variable Models Observed Hidden e.g. Word Alignment Observed Bitext Hidden Alignment

Latent Variable Models Observed Hidden

Latent Variable Models Observed Hidden Basic question for LV models

Latent Variable Models Observed Hidden Basic question for LV models Q: How do I make latent variables behave the way I want them too.

Latent Variable Models Observed Hidden Basic question for LV models Q: How do I make latent variables behave the way I want them too. A: Very careful model design!

Recent Applications

Recent Applications Parsing PCFG Annotation

Recent Applications Parsing PCFG Annotation Machine Translation Learning Phrase Tables

Recent Applications Parsing PCFG Annotation Machine Translation Learning Phrase Tables Information Extraction / Discourse Unsupervised Coref Resolution

Recent Applications Parsing Machine Translation Learning Phrase Tables Information Extraction / Discourse Disclaimer There are many other excellent examples of PCFG LV models Annotation in NLP, these examples are from the Berkeley group. Unsupervised Coref Resolution

PCFG Annotation

PCFG Annotation

Subject NP PCFG Annotation

PCFG Annotation Subject NP Object NP

PCFG Annotation

PCFG Annotation Annotation refines base treebank symbols to improve statistical fit of the grammar

PCFG Annotation Annotation refines base treebank symbols to improve statistical fit of the grammar Parent annotation [Johnson 98]

PCFG Annotation Annotation refines base treebank symbols to improve statistical fit of the grammar Parent annotation [Johnson 98] Head Lexicalization [Collins 99]

PCFG Annotation Annotation refines base treebank symbols to improve statistical fit of the grammar Parent annotation [Johnson 98] Head Lexicalization [Collins 99] Automatic Clustering? [Matsuzaki 05, Petrov 06]

[Matsuzaki et. al., 05] Learning Latent Annotations EM algorithm:

[Matsuzaki et. al., 05] Learning Latent Annotations EM algorithm: Brackets are known Base categories are known Only subcategories hidden

[Matsuzaki et. al., 05] Learning Latent Annotations EM algorithm: Brackets are known Base categories are known Only subcategories hidden Just like Forward-Backward for HMMs

[Matsuzaki et. al., 05] Learning Latent Annotations EM algorithm: Brackets are known Base categories are known Only subcategories hidden Just like Forward-Backward for HMMs

[Matsuzaki et. al., 05] Learning Latent Annotations EM algorithm: Brackets are known Base categories are known Only subcategories hidden Forward X 1 X X 2 X4 7 X 3 X 5 X 6. Just like Forward-Backward for HMMs He was right Backward

Learning Latent Annotations DT

Learning Latent Annotations DT DT-1 DT-2 DT-3 DT-4

Learning Latent Annotations DT

[Petrov et. al., 06] Hierarchical Split/Merge Train DT

[Petrov et. al., 06] Hierarchical Split/Merge Train DT

[Petrov et. al., 06] Hierarchical Split/Merge Train DT

[Petrov et. al., 06] Hierarchical Split/Merge Train DT

[Petrov et. al., 06] Hierarchical Split/Merge Train DT

[Petrov et. al., 06] Hierarchical Split/Merge Train DT Unhelpful Split

[Petrov et. al., 06] Hierarchical Split/Merge Train DT Unhelpful Split

[Petrov et. al., 06] Hierarchical Split/Merge Train DT

[Petrov et. al., 06] Hierarchical Split/Merge Train DT

Hierarchical Split/Merge Train DT Model k=16 F1 Flat Training 87.3 Split Merge 89.5 [Petrov et. al., 06]

Hierarchical Split/Merge Train DT Lesson Don t give EM too much of a leash Model k=16 F1 Flat Training 87.3 Split Merge 89.5 [Petrov et. al., 06]

Linguistic Candy Proper Nouns (NNP): NNP-14 Oct. Nov. Sept. NNP-12 John Robert James NNP-2 J. E. L. NNP-1 Bush Noriega Peters NNP-15 New San Wall NNP-3 York Francisco Street Personal pronouns (PRP): PRP-0 It He I PRP-1 it he they PRP-2 it them him [Petrov et. al., 06]

[Petrov et. al., 07] Learning Latent Annotations 40 words F1 all F1 E N G G E R C H N Charniak&Johnson 05 (generative) 90.1 89.6 Petrov et. al. 07 90.6 90.1 Dubey 05 76.3 - Petrov et. al. 07 80.8 80.1 Chiang et al. 02 80.0 76.6 Petrov et. al. 07 86.3 83.4

[Petrov et. al., 07] Learning Latent Annotations 40 words F1 all F1 E N G G E R C H N Charniak&Johnson 05 (generative) 90.1 89.6 Petrov et. al. 07 90.6 90.1 Download Parser http://nlp.cs.berkeley.edu Dubey 05 76.3 - Petrov et. al. 07 80.8 80.1 Chiang et al. 02 80.0 76.6 Petrov et. al. 07 86.3 83.4

Outline Parsing PCFG Annotation Machine Translation Learning Phrase Tables Information Extraction / Discourse Unsupervised Coref Resolution

Phrase-Based Model

Phrase-Based Model Sentence-aligned corpus

Phrase-Based Model Sentence-aligned corpus

Phrase-Based Model Sentence-aligned corpus Directional word alignments

Phrase-Based Model Sentence-aligned corpus Directional word alignments

Phrase-Based Model Sentence-aligned corpus Directional word alignments Intersected and grown word alignments

Phrase-Based Model Sentence-aligned corpus Directional word alignments Intersected and grown word alignments

Phrase-Based Model cat chat 0.9 the cat le chat 0.8 dog chien 0.8 house maison 0.6 my house ma maison 0.9 language langue 0.9 Sentence-aligned corpus Phrase table (translation model) Directional word alignments Intersected and grown word alignments

Overview: Phrase Extraction appelle un chat un chat call a spade a spade

Overview: Phrase Extraction appelle un chat un chat appelle call call a spade a spade

Overview: Phrase Extraction appelle un chat un chat appelle call call a spade a spade chat un chat spade a spade

Overview: Phrase Extraction appelle un chat un chat appelle call call a spade a spade chat un chat spade a spade What s Hidden: Sentence segmentation and phrasal alignment

Overview: Phrase Extraction appelle un chat un chat call a spade a spade

Overview: Phrase Extraction appelle un chat un chat call a spade a spade appelle appelle un appelle un chat un un chat un chat un chat chat un chat un chat call call a call a spade a x2 a spade x2 a spade a spade x2 spade a spade a spade

Overview: Phrase Extraction appelle un chat un chat call a spade a spade appelle appelle un appelle un chat un un chat un chat un chat chat un chat un chat call call a call a spade a x2 a spade x2 a spade a spade x2 spade a spade a spade Extract all phrases, no competition for which are more useful.

Learning Phrases

Learning Phrases Sentence-aligned corpus

Learning Phrases Phrase-level generative model Sentence-aligned corpus

Learning Phrases Phrase-level generative model cat chat 0.9 the cat le chat 0.8 dog chien 0.8 house maison 0.6 my house ma maison 0.9 language langue 0.9 Sentence-aligned corpus Phrase table (translation model)

Learning Phrases Phrase-level generative model cat chat 0.9 the cat le chat 0.8 dog chien 0.8 house maison 0.6 my house ma maison 0.9 language langue 0.9 Sentence-aligned corpus Phrase table (translation model) DeNero et. al. (2006) argues underperforms surface statistic

Learning Phrases Phrase-level generative model cat chat 0.9 the cat le chat 0.8 dog chien 0.8 house maison 0.6 my house ma maison 0.9 language langue 0.9 Sentence-aligned corpus Phrase table (translation model) DeNero et. al. (2006) argues underperforms surface statistic Latent variables abused, EM finds bad solution

Learning Phrases Phrase-level generative model cat chat 0.9 the cat le chat 0.8 dog chien 0.8 house maison 0.6 my house ma maison 0.9 language langue 0.9 Sentence-aligned corpus Phrase table (translation model) DeNero et. al. (2006) argues underperforms surface statistic Latent variables abused, EM finds bad solution DeNero et. al. (2008) model outperforms surface statistics

Dirichlet Process (DP) Prior

Dirichlet Process (DP) Prior Distribution over distributions

Dirichlet Process (DP) Prior Distribution over distributions

Dirichlet Process (DP) Prior Distribution over distributions Base Measure

Dirichlet Process (DP) Prior Concentration Distribution over distributions Base Measure

Dirichlet Process (DP) Prior Concentration Distribution over distributions Base Measure Almost surely discrete (even if base isn t)

Dirichlet Process (DP) Prior Concentration Distribution over distributions Base Measure Almost surely discrete (even if base isn t) G = π i δ xi ( ) i=1

Dirichlet Process (DP) Prior Concentration Distribution over distributions Base Measure Almost surely discrete (even if base isn t) G = π i δ xi ( ) i=1 Each point x i represent params of a latent cluster

Dirichlet Process (DP) Prior Concentration Distribution over distributions Base Measure Almost surely discrete (even if base isn t) G = π i δ xi ( ) i=1 Each point Nonparametric x i represent params of a latent cluster

Dirichlet Process (DP) Prior Concentration Distribution over distributions Base Measure Almost surely discrete (even if base isn t) G = x i π i δ xi ( ) i=1 Each point represent params of a latent cluster Nonparametric # of params grows with data (logn for N points)

Chinese Restaurant Process

Chinese Restaurant Process

Chinese Restaurant Process

Chinese Restaurant Process

Chinese Restaurant Process

Chinese Restaurant Process

Chinese Restaurant Process

Chinese Restaurant Process

Chinese Restaurant Process

Bayesian Phrase Learning Given: Sentence pair (e, f) 1. Generate l phrase pairs: P (l) =p $ (1 p $ ) l 1 2. Generate l Phrase Pairs {(e i,f i )} (e i,f i ) G, G DP(α,P 0 ) P 0 (f, e) =P f (f)p e (e) P e (e) = ( p s (1 p s ) e 1) ( 1 n e ) e 3. Align and Reorder foreign phrases {f i } Output: Aligned phrase pairs {(e i,f i )} [DeNero et. al., 2008]

Bayesian Phrase Learning Given: Sentence pair (e, f) 1. Generate l phrase pairs: P (l) =p $ (1 p $ ) l 1 2. Generate l Phrase Pairs {(e i,f i )} (e i,f i ) G, G DP(α,P 0 ) P 0 (f, e) =P f (f)p e (e) P e (e) = ( p s (1 p s ) e 1) ( 1 n e ) e 3. Align and Reorder foreign phrases {f i } Output: Aligned phrase pairs {(e i,f i )} [DeNero et. al., 2008]

Bayesian Phrase Learning Given: Sentence pair (e, f) 1. Generate l phrase pairs: P (l) =p $ (1 p $ ) l 1 2. Generate l Phrase Pairs {(e i,f i )} (e i,f i ) G, G DP(α,P 0 ) P 0 (f, e) =P f (f)p e (e) P e (e) = ( p s (1 p s ) e 1) ( 1 n e ) e 3. Align and Reorder foreign phrases {f i } Output: Aligned phrase pairs {(e i,f i )} [DeNero et. al., 2008]

Bayesian Phrase Learning Given: Sentence pair (e, f) 1. Generate l phrase pairs: P (l) =p $ (1 p $ ) l 1 2. Generate l Phrase Pairs {(e i,f i )} (e i,f i ) G, G DP(α,P 0 ) P 0 (f, e) =P f (f)p e (e) P e (e) = ( p s (1 p s ) e 1) ( 1 n e ) e 3. Align and Reorder foreign phrases {f i } Output: Aligned phrase pairs {(e i,f i )} [DeNero et. al., 2008]

Bayesian Phrase Learning Given: Sentence pair (e, f) 1. Generate l phrase pairs: P (l) =p $ (1 p $ ) l 1 2. Generate l Phrase Pairs {(e i,f i )} (e i,f i ) G, G DP(α,P 0 ) P 0 (f, e) =P f (f)p e (e) P e (e) = ( p s (1 p s ) e 1) ( 1 n e ) e 3. Align and Reorder foreign phrases {f i } Output: Aligned phrase pairs {(e i,f i )} [DeNero et. al., 2008]

Bayesian Phrase Learning Given: Sentence pair (e, f) 1. Generate l phrase pairs: P (l) =p $ (1 p $ ) l 1 2. Generate l Phrase Pairs {(e i,f i )} (e i,f i ) G, G DP(α,P 0 ) P 0 (f, e) =P f (f)p e (e) P e (e) = ( p s (1 p s ) e 1) ( 1 n e ) e 3. Align and Reorder foreign phrases {f i } Output: Aligned phrase pairs {(e i,f i )} [DeNero et. al., 2008]

Bayesian Phrase Learning Given: Sentence pair (e, f) 1. Generate l phrase pairs: P (l) =p $ (1 p $ ) l 1 2. Generate l Phrase Pairs {(e i,f i )} (e i,f i ) G, G DP(α,P 0 ) P 0 (f, e) =P f (f)p e (e) P e (e) = ( p s (1 p s ) e 1) ( 1 n e ) e 3. Align and Reorder foreign phrases {f i } Output: Aligned phrase pairs {(e i,f i )} [DeNero et. al., 2008]

Bayesian Phrase Learning Given: Sentence pair (e, f) 1. Generate l phrase pairs: P (l) =p $ (1 p $ ) l 1 2. Generate l Phrase Pairs {(e i,f i )} (e i,f i ) G, G DP(α,P 0 ) P 0 (f, e) =P f (f)p e (e) P e (e) = ( p s (1 p s ) e 1) ( 1 n e ) e 3. Align and Reorder foreign phrases {f i } Output: Aligned phrase pairs {(e i,f i )} [DeNero et. al., 2008]

Bayesian Phrase Learning Given: Sentence pair (e, f) 1. Generate l phrase pairs: P (l) =p $ (1 p $ ) l 1 2. Generate l Phrase Pairs {(e i,f i )} (e i,f i ) G, G DP(α,P 0 ) P 0 (f, e) =P f (f)p e (e) P e (e) = ( p s (1 p s ) e 1) ( 1 n e ) e 3. Align and Reorder foreign phrases {f i } Output: Aligned phrase pairs {(e i,f i )} [DeNero et. al., 2008]

Bayesian Phrase Learning Given: Sentence pair (e, f) 1. Generate l phrase pairs: P (l) =p $ (1 p $ ) l 1 2. Generate l Phrase Pairs {(e i,f i )} (e i,f i ) G, G DP(α,P 0 ) P 0 (f, e) =P f (f)p e (e) P e (e) = ( p s (1 p s ) e 1) ( 1 n e ) e 3. Align and Reorder foreign phrases {f i } Output: Aligned phrase pairs {(e i,f i )} [DeNero et. al., 2008]

Bayesian Phrase Learning Given: Sentence pair (e, f) 1. Generate l phrase pairs: P (l) =p $ (1 p $ ) l 1 2. Generate l Phrase Pairs {(e i,f i )} (e i,f i ) G, G DP(α,P 0 ) P 0 (f, e) =P f (f)p e (e) P e (e) = ( p s (1 p s ) e 1) ( 1 n e ) e 3. Align and Reorder foreign phrases {f i } Output: Aligned phrase pairs {(e i,f i )} [DeNero et. al., 2008]

Bayesian Phrase Learning Given: Sentence pair (e, f) 1. Generate l phrase pairs: P (l) =p $ (1 p $ ) l 1 2. Generate l Phrase Pairs {(e i,f i )} (e i,f i ) G, G DP(α,P 0 ) P 0 (f, e) =P f (f)p e (e) P e (e) = ( p s (1 p s ) e 1) ( 1 n e ) e 3. Align and Reorder foreign phrases {f i } Output: Aligned phrase pairs {(e i,f i )} [DeNero et. al., 2008]

Bayesian Phrase Learning Given: Sentence pair (e, f) 1. Generate l phrase pairs: P (l) =p $ (1 p $ ) l 1 2. Generate l Phrase Pairs {(e i,f i )} (e i,f i ) G, G DP(α,P 0 ) P 0 (f, e) =P f (f)p e (e) P e (e) = ( p s (1 p s ) e 1) ( 1 n e ) e 3. Align and Reorder foreign phrases {f i } Output: Aligned phrase pairs {(e i,f i )} [DeNero et. al., 2008]

Bayesian Phrase Learning [DeNero et. al.,2006] [DeNero et. al., 2008]

Bayesian Phrase Learning [DeNero et. al.,2006] [DeNero et. al., 2008]

Bayesian Phrase Learning [DeNero et. al.,2006] [DeNero et. al., 2008]

Gibbs Sampling Inference [DeNero et. al., 2008]

Gibbs Sampling Inference [DeNero et. al., 2008]

Gibbs Sampling Inference [DeNero et. al., 2008]

Gibbs Sampling Inference [DeNero et. al., 2008]

Gibbs Sampling Inference [DeNero et. al., 2008]

Gibbs Sampling Inference [DeNero et. al., 2008]

[DeNero et. al., 2008] Phrase Learning Results EN-ES Europarl, evaluate on BLEU 10k 100k Heuristic Baseline 14.6 21.8 Learned 15.5 23.2

Outline Parsing PCFG Annotation Machine Translation Learning Phrase Tables Information Extraction / Discourse Unsupervised Coref Resolution

Coreference Resolution Weir group.. whose.. headquarters.. U.S.. corporation.. power plant.. which.. Jiangsu..

Coreference Resolution Weir group.. whose.. headquarters.. U.S.. corporation.. power plant.. which.. Jiangsu..

Coreference Resolution Weir group.. whose.. headquarters Mention.. U.S.. corporation.. power plant.. which.. Jiangsu..

Coreference Resolution Weir group.. whose.. headquarters.. U.S.. corporation.. power plant.. which.. Jiangsu..

Coreference Resolution Weir Group Weir Group Weir HQ Weir group.. whose.. headquarters United States Weir Group.. U.S.. corporation.. Weir Plant Weir Plant Jiangsu power plant.. which.. Jiangsu..

Coreference Resolution Weir Group Weir Group Weir HQ Weir group.. whose.. headquarters Entity United States Weir Group.. U.S.. corporation.. Weir Plant Weir Plant Jiangsu power plant.. which.. Jiangsu..

Coreference Resolution Weir Group Weir Group Weir HQ Weir group.. whose.. headquarters United States Weir Group.. U.S.. corporation.. Weir Plant Weir Plant Jiangsu power plant.. which.. Jiangsu..

Finite Mixture Model [Haghighi & Klein, 2007]

[Haghighi & Klein, 2007] Finite Mixture Model Z1= Weir Group Z2= Weir Group Z3= Weir HQ

[Haghighi & Klein, 2007] Finite Mixture Model Entity Distribution P(Weir Group) = 0.2,... P(Weir HQ) = 0.5, Z1= Weir Group Z2= Weir Group Z3= Weir HQ

[Haghighi & Klein, 2007] Finite Mixture Model Entity Distribution P(Weir Group) = 0.2,... P(Weir HQ) = 0.5, Z1= Weir Group Z2= Weir Group Z3= Weir HQ W1= Weir Group W2= whose W3= headquart.

Finite Mixture Model Entity Distribution Z1= Weir Group P(Weir Group) = 0.2,... P(Weir HQ) = 0.5, Z2= Weir Group Z3= Weir HQ Mention Distribution P(W Weir Group): Weir Group =0.4, whose =0.2,... W1= Weir Group W2= whose W3= headquart. [Haghighi & Klein, 2007]

Bayesian Finite Mixture Model Entity Distribution β K Mention Distribution Z1= Weir Group Z2= Weir Group Z3= Weir HQ P(W Weir Group): Weir Group =0.4, whose =0.2,... W1= Weir Group W2= whose W3= headquart. [Haghighi & Klein, 2007]

[Haghighi & Klein, 2007] Bayesian Finite Mixture Model Z1= Weir Group Entity Distribution β Z2= Weir Group K Z3= Weir HQ This is how many entities there are Mention Distribution P(W Weir Group): Weir Group =0.4, whose =0.2,... W1= Weir Group W2= whose W3= headquart.

Bayesian Finite Mixture Model Entity Distribution β K Mention Distribution Z1= Weir Group Z2= Weir Group Z3= Weir HQ P(W Weir Group): Weir Group =0.4, whose =0.2,... W1= Weir Group W2= whose W3= headquart. [Haghighi & Klein, 2007]

[Haghighi & Klein, 2007] Bayesian Finite Mixture Model Entity Distribution β K Mention Distribution Z1= Weir Group Z2= Weir Group Z3= Weir HQ φ K W1= Weir Group W2= whose W3= headquart.

Bayesian Finite Mixture Model Entity Distribution How do you choose K? β K Mention Distribution Z1= Weir Group Z2= Weir Group Z3= Weir HQ φ K W1= Weir Group W2= whose W3= headquart. [Haghighi & Klein, 2007]

[Haghighi & Klein, 2007] Bayesian Finite Mixture Model Entity Distribution β K Mention Distribution Z1= Weir Group Z2= Weir Group Z3= Weir HQ φ K W1= Weir Group W2= whose W3= headquart.

[Haghighi & Klein, 2007] Infinite Mixture Model Entity Distribution β Mention Distribution Z1= Weir Group Z2= Weir Group Z3= Weir HQ φ W1= Weir Group W2= whose W3= headquart.

[Haghighi & Klein, 2007] Infinite Mixture Model Drawn from a Dirichlet Entity Distribution β Process (DP) prior [Teh et al., 2006] Mention Distribution Z1= Weir Group Z2= Weir Group Z3= Weir HQ φ W1= Weir Group W2= whose W3= headquart.

[Haghighi & Klein, 2007] Infinite Mixture Model Entity Distribution β Mention Distribution Z1= Weir Group Z2= Weir Group Z3= Weir HQ φ W1= Weir Group W2= whose W3= headquart.

Infinite Mixture Model β...... φ L L Z Z S S T N G T N G M W M W

Global Coreference Resolution [Haghighi & Klein, 2007]

Global Coreference Resolution [Haghighi & Klein, 2007]

Global Coreference Resolution [Haghighi & Klein, 2007]

Global Coreference Resolution [Haghighi & Klein, 2007]

[Haghighi & Klein, 2007] Global Coreference Resolution Global Entities

HDP Model ψ β.... φ L Z L Z S T N G S T N G M W M W N

HDP Model β 0 ψ β.... φ L Z L Z S T N G S T N G M W M W N

HDP Model β 0 Global Entity Distribution drawn from a DP ψ β.... φ L Z L Z S T N G S T N G M W M W N

HDP Model β 0 ψ β.... φ L Z L Z S T N G S T N G M W M W N

HDP Model β 0 ψ L Z β.... L Document Entity Distribution subsampled from Global Distr. Z φ S T N G S T N G M W M W N

HDP Model β 0 ψ β.... φ L Z L Z S T N G S T N G M W M W N

Unsupervised Coref Results MUC6: 30 train/test documents Data Set # Doc P R F MUC6 60 80.8 52.8 63.9 +DRYRUN 251 79.1 59.7 68.0 +NWIRE 381 80.4 62.4 70.3 Recent Supervised Number 73.4 F1 [McCallum & Wellner, 2004] [Haghighi & Klein, 2007]

Summary Latent Variable Models Flexible Require careful model design Can yield state-of-the-art results for a variety of tasks Many interesting unsupervised problems yet to explore

Thanks http://nlp.cs.berkeley.edu

Latent Variable Models Modeling Capture information missing in annotation Compensate for independence assumptions Induction Only game in town. Careful structural design

Manual Annotation Klein & Manning, 2003

Manual Annotation Manually split categories NP: subject vs object DT: determiners vs demonstratives IN: sentential vs prepositional Advantages: Fairly compact grammar Linguistic motivations Disadvantages: Performance leveled out Manually annotated Klein & Manning, 2003

Manual Annotation Manually split categories NP: subject vs object DT: determiners vs demonstratives IN: sentential vs prepositional Advantages: Fairly compact grammar Linguistic motivations Disadvantages: Performance leveled out Manually annotated Klein & Manning, 2003

Manual Annotation Manually split categories NP: subject vs object DT: determiners vs demonstratives IN: sentential vs prepositional Advantages: Fairly compact grammar Linguistic motivations Disadvantages: Performance leveled out Manually annotated Manual Annotation 86.36 F1 Klein & Manning, 2003

Latent Variable Models

Latent Variable Models

Latent Variable Models Observed

Latent Variable Models Observed Hidden

Latent Variable Models

Latent Variable Models

Latent Variable Models Modeling

Latent Variable Models Modeling

Latent Variable Models Modeling

Latent Variable Models Modeling

Latent Variable Models Modeling Induction (Unsupervised Learning)

Latent Variable Models Modeling Induction (Unsupervised Learning)

Learning Latent Annotations

Learning Latent Annotations Learning [Petrov et. al. 06] Adaptive Merging Hierarchical Training Inference [Petrov et. al. 07] Coarse-to-fine approximation Max constituent parsing

Latent Variable Models

Latent Variable Models

Latent Variable Models Observed

Latent Variable Models Observed Hidden

Latent Variable Models Observed Hidden Better model

Latent Variable Models Observed Hidden Better model Induce

Linguistic Candy [Petrov et. al., 06]

Linguistic Candy Proper Nouns (NNP): NNP-14 Oct. Nov. Sept. NNP-12 John Robert James NNP-15 New San Wall NNP-3 York Francisco Street [Petrov et. al., 06]

Linguistic Candy Proper Nouns (NNP): NNP-14 Oct. Nov. Sept. NNP-12 John Robert James NNP-15 New San Wall NNP-3 York Francisco Street Personal pronouns (PRP): PRP-0 It He I PRP-1 it he they PRP-2 it them him [Petrov et. al., 06]