Incorporating Domain Knowledge into Topic Modeling via Dirichlet Forest Priors

Similar documents
Latent Dirichlet Allocation (LDA)

Statistical Debugging with Latent Topic Models

Topic Modeling Using Latent Dirichlet Allocation (LDA)

Incorporating Social Context and Domain Knowledge for Entity Recognition

Graphical Models and Kernel Methods

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Generative Clustering, Topic Modeling, & Bayesian Inference

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Applying hlda to Practical Topic Modeling

Text Mining for Economics and Finance Latent Dirichlet Allocation

Introduction To Machine Learning

Latent Dirichlet Allocation (LDA)

A Continuous-Time Model of Topic Co-occurrence Trends

Introduction to Bayesian inference

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

Topic Modelling and Latent Dirichlet Allocation

Interactive Topic Modeling

Latent Dirichlet Allocation Based Multi-Document Summarization

Document and Topic Models: plsa and LDA

Unified Modeling of User Activities on Social Networking Sites

Distributed ML for DOSNs: giving power back to users

Latent Dirichlet Allocation

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Lecture 13 : Variational Inference: Mean Field Approximation

Parallelized Variational EM for Latent Dirichlet Allocation: An Experimental Evaluation of Speed and Scalability

Lecture 19, November 19, 2012

Latent Dirichlet Allocation Introduction/Overview

AN INTRODUCTION TO TOPIC MODELS

Topic Models. Charles Elkan November 20, 2008

RECSM Summer School: Facebook + Topic Models. github.com/pablobarbera/big-data-upf

Topic Models and Applications to Short Documents

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Dirichlet Allocation

The Early Church Acts 2:42-47; 4:32-37

Efficient Tree-Based Topic Modeling

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Evaluation Methods for Topic Models

Lecture 22 Exploratory Text Analysis & Topic Models

IPSJ SIG Technical Report Vol.2014-MPS-100 No /9/25 1,a) 1 1 SNS / / / / / / Time Series Topic Model Considering Dependence to Multiple Topics S

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

Distributed Gibbs Sampling of Latent Topic Models: The Gritty Details THIS IS AN EARLY DRAFT. YOUR FEEDBACKS ARE HIGHLY APPRECIATED.

Time-Sensitive Dirichlet Process Mixture Models

P, NP, NP-Complete. Ruth Anderson

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Language Information Processing, Advanced. Topic Models

Non-Parametric Bayes

Hybrid Models for Text and Graphs. 10/23/2012 Analysis of Social Media

Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs

Latent Dirichlet Allocation and Singular Value Decomposition based Multi-Document Summarization

Kernel Density Topic Models: Visual Topics Without Visual Words

Probablistic Graphical Models, Spring 2007 Homework 4 Due at the beginning of class on 11/26/07

Undirected Graphical Models

Statistical Models. David M. Blei Columbia University. October 14, 2014

CS Lecture 18. Topic Models and LDA

Paul Writes To Philemon Philemon 1:1-25

Your Zodiac Signs Say A Lot About Your Fears And This Is Damn True!

Web Search and Text Mining. Lecture 16: Topics and Communities

Bayesian Nonparametrics for Speech and Signal Processing

Machine Learning

Sparse Stochastic Inference for Latent Dirichlet Allocation

Non-parametric Clustering with Dirichlet Processes

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

Chris Bishop s PRML Ch. 8: Graphical Models

Nonparametric Bayes Pachinko Allocation

CS145: INTRODUCTION TO DATA MINING

Mixtures of Multinomials

GLAD: Group Anomaly Detection in Social Media Analysis

Gaussian Mixture Model

Topic Models. Material adapted from David Mimno University of Maryland INTRODUCTION. Material adapted from David Mimno UMD Topic Models 1 / 51

Fast Collapsed Gibbs Sampling For Latent Dirichlet Allocation

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Clustering using Mixture Models

Data Mining Techniques

Modeling User Rating Profiles For Collaborative Filtering

Collaborative Topic Modeling for Recommending Scientific Articles

Hierarchical Bayesian Nonparametrics

Introduction to Probabilistic Machine Learning

Natural Language Processing

GOLDEN TEXT-"Doth not wisdom cry? and understanding put forth her voice?" (Proverbs 8:1).

Active and Semi-supervised Kernel Classification

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Semi-Supervised Learning

UNIVERSITY OF CALIFORNIA, IRVINE. Networks of Mixture Blocks for Non Parametric Bayesian Models with Applications DISSERTATION

STA 4273H: Statistical Machine Learning

Exploiting Geographic Dependencies for Real Estate Appraisal

Improving Topic Models with Latent Feature Word Representations

28 : Approximate Inference - Distributed MCMC

University of New Mexico Department of Computer Science. Final Examination. CS 362 Data Structures and Algorithms Spring, 2007

arxiv: v1 [stat.ml] 8 Jan 2012

Jesus Heals a Blind Man

Study Notes on the Latent Dirichlet Allocation

There Is Therefore Now No Condemnation Romans 8:1-12

Design of Text Mining Experiments. Matt Taddy, University of Chicago Booth School of Business faculty.chicagobooth.edu/matt.

Mixed-membership Models (and an introduction to variational inference)

arxiv: v6 [stat.ml] 11 Apr 2017

Unsupervised machine learning

Sampling from Bayes Nets

Latent Dirichlet Alloca/on

Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process

Text Mining: Basic Models and Applications

Transcription:

Incorporating Domain Knowledge into Topic Modeling via Dirichlet Forest Priors David Andrzejewski, Xiaojin Zhu, Mark Craven University of Wisconsin Madison ICML 2009 Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 1 / 21

New Year s Wishes Goldberg et al 2009 89,574 New Year s wishes (NYC Times Square website) Example wishes: Peace on earth own a brewery I hope I get into Univ. of Penn graduate school. The safe return of my friends in Iraq find a cure for cancer To lose weight and get a boyfriend I Hope Barack Obama Wins the Presidency To win the lottery! Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 2 / 21

New Year s Wishes Goldberg et al 2009 89,574 New Year s wishes (NYC Times Square website) Example wishes: Peace on earth own a brewery I hope I get into Univ. of Penn graduate school. The safe return of my friends in Iraq find a cure for cancer To lose weight and get a boyfriend I Hope Barack Obama Wins the Presidency To win the lottery! Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 2 / 21

New Year s Wishes Goldberg et al 2009 89,574 New Year s wishes (NYC Times Square website) Example wishes: Peace on earth own a brewery I hope I get into Univ. of Penn graduate school. The safe return of my friends in Iraq find a cure for cancer To lose weight and get a boyfriend I Hope Barack Obama Wins the Presidency To win the lottery! Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 2 / 21

New Year s Wishes Goldberg et al 2009 89,574 New Year s wishes (NYC Times Square website) Example wishes: Peace on earth own a brewery I hope I get into Univ. of Penn graduate school. The safe return of my friends in Iraq find a cure for cancer To lose weight and get a boyfriend I Hope Barack Obama Wins the Presidency To win the lottery! Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 2 / 21

New Year s Wishes Goldberg et al 2009 89,574 New Year s wishes (NYC Times Square website) Example wishes: Peace on earth own a brewery I hope I get into Univ. of Penn graduate school. The safe return of my friends in Iraq find a cure for cancer To lose weight and get a boyfriend I Hope Barack Obama Wins the Presidency To win the lottery! Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 2 / 21

New Year s Wishes Goldberg et al 2009 89,574 New Year s wishes (NYC Times Square website) Example wishes: Peace on earth own a brewery I hope I get into Univ. of Penn graduate school. The safe return of my friends in Iraq find a cure for cancer To lose weight and get a boyfriend I Hope Barack Obama Wins the Presidency To win the lottery! Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 2 / 21

New Year s Wishes Goldberg et al 2009 89,574 New Year s wishes (NYC Times Square website) Example wishes: Peace on earth own a brewery I hope I get into Univ. of Penn graduate school. The safe return of my friends in Iraq find a cure for cancer To lose weight and get a boyfriend I Hope Barack Obama Wins the Presidency To win the lottery! Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 2 / 21

New Year s Wishes Goldberg et al 2009 89,574 New Year s wishes (NYC Times Square website) Example wishes: Peace on earth own a brewery I hope I get into Univ. of Penn graduate school. The safe return of my friends in Iraq find a cure for cancer To lose weight and get a boyfriend I Hope Barack Obama Wins the Presidency To win the lottery! Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 2 / 21

Topic Modeling of Wishes Topic 13 go school cancer into well free cure college... graduate... law... surgery recovery... Use topic modeling to understand common wish themes Topic 13 mixes college and illness wish topics Want to split [go school into college] and [cancer free cure well] Resulting topics separate these words, Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 3 / 21

Topic Modeling of Wishes Topic 13 go school cancer into well free cure college... graduate... law... surgery recovery... Use topic modeling to understand common wish themes Topic 13 mixes college and illness wish topics Want to split [go school into college] and [cancer free cure well] Resulting topics separate these words, Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 3 / 21

Topic Modeling of Wishes Topic 13 go school cancer into well free cure college... graduate... law... surgery recovery... Use topic modeling to understand common wish themes Topic 13 mixes college and illness wish topics Want to split [go school into college] and [cancer free cure well] Resulting topics separate these words, Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 3 / 21

Topic Modeling of Wishes Topic 13 go school cancer into well free cure college... graduate... law... surgery recovery... Topic 13(a) Topic 13(b) job go school great into good college... business graduate finish grades away law accepted... mom husband cancer hope free son well... full recovery surgery pray heaven pain aids... Use topic modeling to understand common wish themes Topic 13 mixes college and illness wish topics Want to split [go school into college] and [cancer free cure well] Resulting topics separate these words, as well as related words Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 3 / 21

Topic Modeling of Wishes Topic 13 go school cancer into well free cure college... graduate... law... surgery recovery... Topic 13(a) Topic 13(b) job go school great into good college... business graduate finish grades away law accepted... mom husband cancer hope free son well... full recovery surgery pray heaven pain aids... Use topic modeling to understand common wish themes Topic 13 mixes college and illness wish topics Want to split [go school into college] and [cancer free cure well] Resulting topics separate these words, as well as related words Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 3 / 21

Topic Modeling with Domain Knowledge Why domain knowledge? Topics may not correspond to meaningful concepts Topics may not align well with user modeling goals Possible sources of domain knowledge: Human guidance (separate school from cure ) Structured sources (encode Gene Ontology term transcription factor activity ) Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 4 / 21

Topic Modeling with Domain Knowledge Why domain knowledge? Topics may not correspond to meaningful concepts Topics may not align well with user modeling goals Possible sources of domain knowledge: Human guidance (separate school from cure ) Structured sources (encode Gene Ontology term transcription factor activity ) Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 4 / 21

Topic Modeling with Domain Knowledge Why domain knowledge? Topics may not correspond to meaningful concepts Topics may not align well with user modeling goals Possible sources of domain knowledge: Human guidance (separate school from cure ) Structured sources (encode Gene Ontology term transcription factor activity ) Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 4 / 21

Topic Modeling with Domain Knowledge Why domain knowledge? Topics may not correspond to meaningful concepts Topics may not align well with user modeling goals Possible sources of domain knowledge: Human guidance (separate school from cure ) Structured sources (encode Gene Ontology term transcription factor activity ) Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 4 / 21

Topic Modeling with Domain Knowledge Why domain knowledge? Topics may not correspond to meaningful concepts Topics may not align well with user modeling goals Possible sources of domain knowledge: Human guidance (separate school from cure ) Structured sources (encode Gene Ontology term transcription factor activity ) Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 4 / 21

Topic Modeling with Domain Knowledge Why domain knowledge? Topics may not correspond to meaningful concepts Topics may not align well with user modeling goals Possible sources of domain knowledge: Human guidance (separate school from cure ) Structured sources (encode Gene Ontology term transcription factor activity ) Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 4 / 21

Word Preferences within Topics Inspired by constrained clustering (Basu, Davidson, & Wagstaff 2008) Need a suitable language for expressing our preferences Pairwise primitives higher-level operations Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 5 / 21

Word Preferences within Topics Inspired by constrained clustering (Basu, Davidson, & Wagstaff 2008) Need a suitable language for expressing our preferences Pairwise primitives higher-level operations Operation Must-Link (school,college) Meaning topics t, P(school t) P(college t) Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 5 / 21

Word Preferences within Topics Inspired by constrained clustering (Basu, Davidson, & Wagstaff 2008) Need a suitable language for expressing our preferences Pairwise primitives higher-level operations Operation Must-Link (school,college) Cannot-Link (school,cure) Meaning topics t, P(school t) P(college t) no topic t has P(school t) and P(cure t) both high Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 5 / 21

Word Preferences within Topics Inspired by constrained clustering (Basu, Davidson, & Wagstaff 2008) Operation Must-Link (school,college) Cannot-Link (school,cure) Meaning topics t, P(school t) P(college t) no topic t has P(school t) and P(cure t) both high split [go school into college] vs [cancer free cure well] Must-Link among words for each concept Cannot-Link between words from different concepts Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 5 / 21

Word Preferences within Topics Inspired by constrained clustering (Basu, Davidson, & Wagstaff 2008) Operation Must-Link (school,college) Cannot-Link (school,cure) Meaning topics t, P(school t) P(college t) no topic t has P(school t) and P(cure t) both high split merge [go school into college] vs [cancer free cure well] Must-Link among words for each concept Cannot-Link between words from different concepts [love marry together boyfriend] in one topic [married boyfriend engaged wedding] in another Must-Link among concept words Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 5 / 21

Word Preferences within Topics Inspired by constrained clustering (Basu, Davidson, & Wagstaff 2008) Operation Must-Link (school,college) Cannot-Link (school,cure) Meaning topics t, P(school t) P(college t) no topic t has P(school t) and P(cure t) both high split merge isolate [go school into college] vs [cancer free cure well] Must-Link among words for each concept Cannot-Link between words from different concepts [love marry together boyfriend] in one topic [married boyfriend engaged wedding] in another Must-Link among concept words [the year in 2008] in many wish topics Must-Link among words to be isolated Cannot-Link vs other Top N words for each topic Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 5 / 21

Dirichlet Prior ( dice factory ) P(φ β) for K -dimensional multinomial parameter φ K -dimensional hyperparameter β ( pseudocounts ) C C C A β = [1, 1, 1] B A β = [20, 5, 5] B A β = [50, 50, 50] B Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 6 / 21

Latent Dirichlet Allocation (LDA) Blei, Ng, and Jordan 2003 β φ T w z N d θ D α For each topic t φ t Dirichlet(β) For each doc d θ d Dirichlet(α) For each word w z Multinomial(θ d ) w Multinomial(φ z ) Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 7 / 21

Latent Dirichlet Allocation (LDA) Blei, Ng, and Jordan 2003 β φ T w z N d θ D α For each topic t φ t Dirichlet(β) For each doc d θ d Dirichlet(α) For each word w z Multinomial(θ d ) w Multinomial(φ z ) Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 7 / 21

Latent Dirichlet Allocation (LDA) Blei, Ng, and Jordan 2003 β φ T w z N d θ D α For each topic t φ t Dirichlet(β) For each doc d θ d Dirichlet(α) For each word w z Multinomial(θ d ) w Multinomial(φ z ) Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 7 / 21

Latent Dirichlet Allocation (LDA) Blei, Ng, and Jordan 2003 β φ T w z N d θ D α For each topic t φ t Dirichlet(β) For each doc d θ d Dirichlet(α) For each word w z Multinomial(θ d ) w Multinomial(φ z ) Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 7 / 21

Latent Dirichlet Allocation (LDA) Blei, Ng, and Jordan 2003 β φ T w z N d θ D α For each topic t φ t Dirichlet(β) For each doc d θ d Dirichlet(α) For each word w z Multinomial(θ d ) w Multinomial(φ z ) Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 7 / 21

Latent Dirichlet Allocation (LDA) Blei, Ng, and Jordan 2003 β φ T w z N d θ D α For each topic t φ t Dirichlet(β) For each doc d θ d Dirichlet(α) For each word w z Multinomial(θ d ) w Multinomial(φ z ) Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 7 / 21

Latent Dirichlet Allocation (LDA) Blei, Ng, and Jordan 2003 β φ T w z N d θ D α For each topic t φ t Dirichlet(β) For each doc d θ d Dirichlet(α) For each word w z Multinomial(θ d ) w Multinomial(φ z ) Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 7 / 21

Latent Dirichlet Allocation (LDA) Blei, Ng, and Jordan 2003 β φ T w z N d θ D α For each topic t φ t Dirichlet(β) For each doc d θ d Dirichlet(α) For each word w z Multinomial(θ d ) w Multinomial(φ z ) Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 7 / 21

LDA with Dirichlet Forest Prior This work β φ T w z N d θ D α η For each topic t φ t Dirichlet(β) φ t DirichletForest(β, η) For each doc d θ d Dirichlet(α) For each word w z Multinomial(θ d ) w Multinomial(φ z ) Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 7 / 21

Related work: Correlated Topic Model (CTM) Blei and Lafferty 2006 D N d µ T β φ w z θ Σ For each topic t φ t Dirichlet(β) For each doc d θ d Dirichlet(α) θ d LogisticNormal(µ, Σ) For each word w z Multinomial(θ d ) w Multinomial(φ z ) Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 7 / 21

Related work: Pachinko Allocation Model (PAM) Li and McCallum 2006 D N T d β φ w z θ α For each topic t φ t Dirichlet(β) For each doc d θ d Dirichlet(α) Pachinko(θ d ) Dirichlet-DAG(α) For each word w z Multinomial(θ d ) z Pachinko(θ d ) w Multinomial(φ z ) Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 7 / 21

Must-Link (college,school) t, we want P(college t) P(school t) Must-Link is transitive Cannot be encoded by a single Dirichlet Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 8 / 21

Must-Link (college,school) t, we want P(college t) P(school t) Must-Link is transitive Cannot be encoded by a single Dirichlet Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 8 / 21

Must-Link (college,school) t, we want P(college t) P(school t) Must-Link is transitive Cannot be encoded by a single Dirichlet lottery lottery school Goal college school β = [5, 5, 0.1] college Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 8 / 21

Must-Link (college,school) t, we want P(college t) P(school t) Must-Link is transitive Cannot be encoded by a single Dirichlet lottery lottery school Goal college school β = [50, 50, 1] college Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 8 / 21

Must-Link (college,school) t, we want P(college t) P(school t) Must-Link is transitive Cannot be encoded by a single Dirichlet lottery lottery school Goal college school β = [500, 500, 100] college Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 8 / 21

Dirichlet Tree ( dice factory 2.0 ) Dennis III 1991, Minka 1999 Control variance of subsets of variables Sample Dirichlet(γ) at parent, distribute mass to children Mass reaching leaves are final multinomial parameters φ (s) = 0 for all internal node s standard Dirichlet (for our trees, true when η = 1) Conjugate to multinomial, can integrate out ( collapse ) φ {γ ηβ 2β ηβ β A B (β = 1, η = 50) C φ= Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 9 / 21

Dirichlet Tree ( dice factory 2.0 ) Dennis III 1991, Minka 1999 Control variance of subsets of variables Sample Dirichlet(γ) at parent, distribute mass to children Mass reaching leaves are final multinomial parameters φ (s) = 0 for all internal node s standard Dirichlet (for our trees, true when η = 1) Conjugate to multinomial, can integrate out ( collapse ) φ {γ ηβ 2β ηβ β 0.91 0.09 A B (β = 1, η = 50) C φ= 0.09 Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 9 / 21

Dirichlet Tree ( dice factory 2.0 ) Dennis III 1991, Minka 1999 Control variance of subsets of variables Sample Dirichlet(γ) at parent, distribute mass to children Mass reaching leaves are final multinomial parameters φ (s) = 0 for all internal node s standard Dirichlet (for our trees, true when η = 1) Conjugate to multinomial, can integrate out ( collapse ) φ {γ ηβ 2β ηβ β 0.91 0.58 0.09 0.42 A B (β = 1, η = 50) C φ= 0.53 0.38 0.09 Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 9 / 21

Dirichlet Tree ( dice factory 2.0 ) Dennis III 1991, Minka 1999 Control variance of subsets of variables Sample Dirichlet(γ) at parent, distribute mass to children Mass reaching leaves are final multinomial parameters φ (s) = 0 for all internal node s standard Dirichlet (for our trees, true when η = 1) Conjugate to multinomial, can integrate out ( collapse ) φ ( L k ) φ (k)γ(k) 1 I s p(φ γ) = ( C(s) ) Γ k γ (k) C(s) k Γ ( γ (k)) L(s) k φ (k) (s) Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 9 / 21

Dirichlet Tree ( dice factory 2.0 ) Dennis III 1991, Minka 1999 Control variance of subsets of variables Sample Dirichlet(γ) at parent, distribute mass to children Mass reaching leaves are final multinomial parameters φ (s) = 0 for all internal node s standard Dirichlet (for our trees, true when η = 1) Conjugate to multinomial, can integrate out ( collapse ) φ ( L k ) φ (k)γ(k) 1 I s p(φ γ) = ( C(s) ) Γ k γ (k) C(s) k Γ ( γ (k)) L(s) k φ (k) (s) Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 9 / 21

Dirichlet Tree ( dice factory 2.0 ) Dennis III 1991, Minka 1999 Control variance of subsets of variables Sample Dirichlet(γ) at parent, distribute mass to children Mass reaching leaves are final multinomial parameters φ (s) = 0 for all internal node s standard Dirichlet (for our trees, true when η = 1) Conjugate to multinomial, can integrate out ( collapse ) φ I Γ s p(w γ) = ( C(s) ) Γ k γ (k) C(s) Γ ( γ (k) + n (k)) ( γ (k) + n (k))) Γ(γ (k) ) ( C(s) k k Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 9 / 21

Must-Link (school,college) via Dirichlet Tree Place (school,college) beneath internal node Large edge weights beneath this node (large η) 2β β ηβ ηβ college school lottery (β = 1, η = 50) Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 10 / 21

Must-Link (school,college) via Dirichlet Tree Place (school,college) beneath internal node Large edge weights beneath this node (large η) lottery 2β β ηβ ηβ college school lottery school (β = 1, η = 50) college Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 10 / 21

Cannot-Link (school,cancer) Do not want words to co-occur as high-probability for any topic No topic-word multinomial φ t = P(w t) should have: High probability P(school t) High probability P(cancer t) Cannot-Link is non-transitive Cannot be encoded by single Dirichlet/DirichletTree Will require mixture of Dirichlet Trees (Dirichlet Forest) Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 11 / 21

Cannot-Link (school,cancer) Do not want words to co-occur as high-probability for any topic No topic-word multinomial φ t = P(w t) should have: High probability P(school t) High probability P(cancer t) Cannot-Link is non-transitive Cannot be encoded by single Dirichlet/DirichletTree Will require mixture of Dirichlet Trees (Dirichlet Forest) Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 11 / 21

Cannot-Link (school,cancer) Do not want words to co-occur as high-probability for any topic No topic-word multinomial φ t = P(w t) should have: High probability P(school t) High probability P(cancer t) Cannot-Link is non-transitive Cannot be encoded by single Dirichlet/DirichletTree Will require mixture of Dirichlet Trees (Dirichlet Forest) cure cancer school Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 11 / 21

Sampling a Tree from the Forest AB C Vocabulary [A, B, C, D, E, F, G] Must-Links (A, B) Cannot-Links (A, D), (C, D), (E, F) Cannot-Link-graph X X D E F X G Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 12 / 21

Sampling a Tree from the Forest Connected components ABCD EF G A B C D E F G AB X D C X E X F G Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 12 / 21

Sampling a Tree from the Forest Subgraph complements ABCD EF G A B C D E F G AB D E G C F Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 12 / 21

Sampling a Tree from the Forest Maximal cliques ABCD EF G A B C D E F G AB D E G C F Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 12 / 21

Sampling a Tree from the Forest Sample q (1) for first connected component q (1) =? η η AB C D AB C D A B C D E F G AB D E G C F Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 12 / 21

Sampling a Tree from the Forest q (1) = 1 (choose ABC) q (1) =1 η η AB C D AB C D A B C D E F G AB D E G C F Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 12 / 21

Sampling a Tree from the Forest Insert chosen Cannot-Link subtree q (1) =1 η A B C D E F G AB D E G C F Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 12 / 21

Sampling a Tree from the Forest Put (A, B) under Must-Link subtree q (1) =1 η η η A B C D E F G AB D E G C F Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 12 / 21

Sampling a Tree from the Forest Sample q (2) for second connected component q (1) =1 η η η q (2) =? η η A B C D E F G AB D E G C F Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 12 / 21

Sampling a Tree from the Forest q (2) = 2 (choose F) q (1) =1 η η η q (2) =2 η η A B C D E F G AB D E G C F Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 12 / 21

Sampling a Tree from the Forest Insert chosen Cannot-Link subtree q (1) =1 η η q (2) =2 η η A B C D E F G AB D E G C F Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 12 / 21

LDA with Dirichlet Forest Prior For each topic t = 1... T For each Cannot-Link-graph connected component r = 1... R Sample q (r) t clique sizes φ t DirichletTree(q t, β, η) For each doc d = 1... D θ d Dirichlet(α) For each word w z Multinomial(θ d ) w Multinomial(φ z ) η q (1) =1 η η A B C D E F G AB C D q (2) =2 η E F G Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 13 / 21

Collapsed Gibbs Sampling of (z, q) Complete Gibbs sample: z 1... z N, q (1) 1... q (R) 1,..., q (1) T Sample z i for each word position i in corpus Iv ( i) p(z i = v z i, q 1:T, w) (n (d) i,v + α) Sample q (r) j p(q (r) j for each topic j and component r = q z, q j, q ( r) j, w) M rq I j,r=q Γ β k Γ k s ( Cj (s) k ( Cj (s) k (γ (k) j s ) γ (k) j + n (k) j ) (Cv (s i)) γ v Cv (s) k ) C j (s) k... q(r) T (Cv (s i)) + n i,v ) ( γ (k) v + n (k) i,v Γ(γ (k) j + n (k) Γ(γ (k) j ) j ) Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 14 / 21

Synthetic Data - Must-Link (B,C) Prior knowledge: B and C should be in the same topic Corpus: ABAB, CDCD, EEEE, ABAB, CDCD, EEEE Standard LDA topics [φ 1, φ 2 ] do not put (B, C) together 1 [φ 1 = AB, φ 2 = CDE] 2 [φ 1 = ABE, φ 2 = CD] 3 [φ 1 = ABCD, φ 2 = E] As η increases, Must-Link (B,C) [φ 1 = ABCD, φ 2 = E] η=1 η=10 η=50 1 2 1 2 1 2 3 3 3 Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 15 / 21

Synthetic Data - isolate(b) Prior knowledge: B should be isolated from [A,C] Corpus: ABC, ABC, ABC, ABC Standard LDA topics [φ 1, φ 2 ] do not isolate B 1 [φ 1 = AC, φ 2 = B] 2 [φ 1 = A φ 2 = BC] 3 [φ 1 = AB, φ 2 = C] As η increases, Cannot-Link (A,B)+Cannot-Link (B,C) [φ 1 = AC, φ 2 = B] η=1 η=500 η=1000 1 3 1 3 1 3 2 2 2 Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 16 / 21

Original Wish Topics Topic Top words sorted by φ = p(word topic) 0 love i you me and will forever that with hope 1 and health for happiness family good my friends 2 year new happy a this have and everyone years 3 that is it you we be t are as not s will can 4 my to get job a for school husband s that into 5 to more of be and no money stop live people 6 to our the home for of from end safe all come 7 to my be i find want with love life meet man 8 a and healthy my for happy to be have baby 9 a 2008 in for better be to great job president 10 i wish that would for could will my lose can 11 peace and for love all on world earth happiness 12 may god in all your the you s of bless 2008 13 the in to of world best win 2008 go lottery 14 me a com this please at you call 4 if 2 www Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 17 / 21

Original Wish Topics Topic Top words sorted by φ = p(word topic) 0 love i you me and will forever that with hope 1 and health for happiness family good my friends 2 year new happy a this have and everyone years 3 that is it you we be t are as not s will can 4 my to get job a for school husband s that into 5 to more of be and no money stop live people 6 to our the home for of from end safe all come 7 to my be i find want with love life meet man 8 a and healthy my for happy to be have baby 9 a 2008 in for better be to great job president 10 i wish that would for could will my lose can 11 peace and for love all on world earth happiness 12 may god in all your the you s of bless 2008 13 the in to of world best win 2008 go lottery 14 me a com this please at you call 4 if 2 www Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 17 / 21

isolate([to and for]...) 50 stopwords vs Top 50 in existing topics Topic Top words sorted by φ = p(word topic) 0 love forever marry happy together mom back 1 health happiness good family friends prosperity 2 life best live happy long great time ever wonderful 3 out not up do as so what work don was like 4 go school cancer into well free cure college 5 no people stop less day every each take children 6 home safe end troops iraq bring war husband house 7 love peace true happiness hope joy everyone dreams 8 happy healthy family baby safe prosperous everyone 9 better job hope president paul great ron than person 10 make money lose weight meet finally by lots hope married Isolate and to for a the year in new all my 2008 12 god bless jesus loved know everyone love who loves 13 peace world earth win lottery around save 14 com call if 4 2 www u visit 1 3 email yahoo Isolate i to wish my for and a be that the in Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 18 / 21

isolate([to and for]...) 50 stopwords vs Top 50 in existing topics Topic Top words sorted by φ = p(word topic) 0 love forever marry happy together mom back 1 health happiness good family friends prosperity 2 life best live happy long great time ever wonderful 3 out not up do as so what work don was like MIXED go school cancer into well free cure college 5 no people stop less day every each take children 6 home safe end troops iraq bring war husband house 7 love peace true happiness hope joy everyone dreams 8 happy healthy family baby safe prosperous everyone 9 better job hope president paul great ron than person 10 make money lose weight meet finally by lots hope married Isolate and to for a the year in new all my 2008 12 god bless jesus loved know everyone love who loves 13 peace world earth win lottery around save 14 com call if 4 2 www u visit 1 3 email yahoo Isolate i to wish my for and a be that the in Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 18 / 21

split([cancer free cure well],[go school into college]) 0 love forever happy together marry fall 1 health happiness family good friends 2 life happy best live love long time 3 as not do so what like much don was 4 out make money house up work grow able 5 people no stop less day every each take 6 home safe end troops iraq bring war husband 7 love peace happiness true everyone joy 8 happy healthy family baby safe prosperous 9 better president hope paul ron than person 10 lose meet man hope boyfriend weight finally Isolate and to for a the year in new all my 2008 12 god bless jesus loved everyone know loves 13 peace world earth win lottery around save 14 com call if 4 www 2 u visit 1 email yahoo 3 Isolate i to wish my for and a be that the in me get Split job go school great into good college Split mom husband cancer hope free son well Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 19 / 21

split([cancer free cure well],[go school into college]) LOVE love forever happy together marry fall 1 health happiness family good friends 2 life happy best live love long time 3 as not do so what like much don was 4 out make money house up work grow able 5 people no stop less day every each take 6 home safe end troops iraq bring war husband 7 love peace happiness true everyone joy 8 happy healthy family baby safe prosperous 9 better president hope paul ron than person LOVE lose meet man hope boyfriend weight finally Isolate and to for a the year in new all my 2008 12 god bless jesus loved everyone know loves 13 peace world earth win lottery around save 14 com call if 4 www 2 u visit 1 email yahoo 3 Isolate i to wish my for and a be that the in me get Split job go school great into good college Split mom husband cancer hope free son well Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 19 / 21

merge([love... marry...],[meet... married...]) (10 words total) Topic Top words sorted by φ = p(word topic) Merge love lose weight together forever marry meet success health happiness family good friends prosperity life life happy best live time long wishes ever years - as do not what someone so like don much he money out make money up house work able pay own lots people no people stop less day every each other another iraq home safe end troops iraq bring war return joy love true peace happiness dreams joy everyone family happy healthy family baby safe prosperous vote better hope president paul ron than person bush Isolate and to for a the year in new all my god god bless jesus everyone loved know heart christ peace peace world earth win lottery around save spam com call if u 4 www 2 3 visit 1 Isolate i to wish my for and a be that the Split job go great school into good college hope move Split mom hope cancer free husband son well dad cure Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 20 / 21

Conclusions/Acknowledgments Conclusions DF prior expresses pairwise preferences among words Can efficiently sample from DF-LDA posterior Topics obey preferences, capture structure Future work Hierarchical domain knowledge Quantify benefits on tasks Other application domains Code http://www.cs.wisc.edu/~andrzeje/research/df_lda.html Funding Wisconsin Alumni Research Foundation (WARF) NIH/NLM grants T15 LM07359 and R01 LM07050 ICML student travel scholarship Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 21 / 21

Maximal cliques Maximal cliques of complement graph independent sets Worst-case: 3 n 3 (Moon & Moser 1965) We are only concerned with connected graphs, but still O(3 n 3 ) (Griggs et al 1988) Find cliques with Bron-Kerbosch (branch-and-bound) Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 1 / 8

Maximal cliques Maximal cliques of complement graph independent sets Worst-case: 3 n 3 (Moon & Moser 1965) We are only concerned with connected graphs, but still O(3 n 3 ) (Griggs et al 1988) Find cliques with Bron-Kerbosch (branch-and-bound) Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 1 / 8

Maximal cliques Maximal cliques of complement graph independent sets Worst-case: 3 n 3 (Moon & Moser 1965) We are only concerned with connected graphs, but still O(3 n 3 ) (Griggs et al 1988) Find cliques with Bron-Kerbosch (branch-and-bound) Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 1 / 8

log-sum-exp for sampling q Sampling q involves Γ( ) evaluations logsumexp trick (x m is largest value) Evaluate and normalize in log-domain before sampling log( e x ) = log(e xm ( e x xm )) = x m + log( e x xm ) Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 2 / 8

Synthetic 2 - Cannot-Link Corpus: ABCCABCC, ABDDABDD (x2) Standard LDA topics (T = 3) Posterior 5-way split, permutations of AB C D Cannot-Link (A,B) AB C D 1 SynData2, eta=1 1 SynData2, eta=500 1 SynData2, eta=1500 0.5 1 2 3 0.5 1 2 3 0.5 1 2 3 0 0 0 0.5 4 5 1 1 0.5 0 0.5 1 0.5 4 5 1 1 0.5 0 0.5 1 0.5 4 5 1 1 0.5 0 0.5 1 Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 3 / 8

Synthetic 4 - split Corpus: ABCDEEEE, ABCDFFFF (x3) Standard LDA topics (T = 3) Posterior concentrated at ABCD E F (not shown) Standard LDA topics (T = 4) Posterior dispersed (shown) T = 4 split(ab,cd) AB CD E F 1 0.5 SynData4, eta=1 2 3 1 1 0.5 SynData4, eta=100 2 3 1 1 0.5 SynData4, eta=500 2 3 1 0 4 5 0 4 5 0 4 5 0.5 7 6 1 1 0.5 0 0.5 1 0.5 7 6 1 1 0.5 0 0.5 1 0.5 7 6 1 1 0.5 0 0.5 1 Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 4 / 8

Knowledge-based Topic Modeling: Yeast Corpus 18,193 MEDLINE abstracts Queries on yeast genes Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 5 / 8

Domain Knowledge from the Gene Ontology (GO) Processes: transcription, translation, replication Phases: initiation, elongation, termination split the process concepts split the phase concepts Idea: want meaningful process+phase composite topics Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 6 / 8

DF-LDA (right) more aligned with target concepts 1 2 3 4 5 6 7 8 o 1 2 3 4 5 6 7 8 9 10 transcription 1 transcriptional 2 template 1 translation translational trna 1 replication 2 cycle division 3 initiation start assembly 7 elongation 1 termination disassembly release 2 stop Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 7 / 8

Synthetic Datasets Goal: understand DF prior on very simple data Well-mixed samples (label-switching, stable proportions) Label-switching heuristic φ alignment Observe changes as strength parameter η varies Visualize φ samples with PCA Andrzejewski (Wisconsin) Dirichlet Forest Priors ICML 2009 8 / 8