Tree-based Label Dependency Topic Models

Similar documents
AN INTRODUCTION TO TOPIC MODELS

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Bayesian Nonparametrics for Speech and Signal Processing

Latent Dirichlet Allocation (LDA)

Text Mining for Economics and Finance Latent Dirichlet Allocation

Latent Dirichlet Allocation Introduction/Overview

Efficient Tree-Based Topic Modeling

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Hybrid Models for Text and Graphs. 10/23/2012 Analysis of Social Media

Content-based Recommendation

Topic Models and Applications to Short Documents

Latent Dirichlet Allocation

Topic Learning and Inference Using Dirichlet Allocation Product Partition Models and Hybrid Metropolis Search

Non-parametric Clustering with Dirichlet Processes

16 : Approximate Inference: Markov Chain Monte Carlo

CS Lecture 18. Topic Models and LDA

Study Notes on the Latent Dirichlet Allocation

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

A Continuous-Time Model of Topic Co-occurrence Trends

topic modeling hanna m. wallach

Online Bayesian Passive-Agressive Learning

Gaussian Mixture Model

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

Latent Dirichlet Allocation Based Multi-Document Summarization

Sparse Stochastic Inference for Latent Dirichlet Allocation

Spatial Normalized Gamma Process

Topic Modelling and Latent Dirichlet Allocation

Document and Topic Models: plsa and LDA

INTERPRETING THE PREDICTION PROCESS OF A DEEP NETWORK CONSTRUCTED FROM SUPERVISED TOPIC MODELS

A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

Latent variable models for discrete data

Generative Clustering, Topic Modeling, & Bayesian Inference

Sampling Equation Derivation for Lex-MED-RTM

Truncation-free Stochastic Variational Inference for Bayesian Nonparametric Models

Distributed ML for DOSNs: giving power back to users

Haupthseminar: Machine Learning. Chinese Restaurant Process, Indian Buffet Process

Lecture 22 Exploratory Text Analysis & Topic Models

Nonparametric Bayes Pachinko Allocation

Collapsed Variational Bayesian Inference for Hidden Markov Models

Kernel Density Topic Models: Visual Topics Without Visual Words

Hierarchical Dirichlet Processes

Using Both Latent and Supervised Shared Topics for Multitask Learning

Supervised topic models for clinical interpretability

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Classical Predictive Models

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

Bayesian Nonparametrics: Dirichlet Process

Evaluation Methods for Topic Models

Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

A Brief Overview of Nonparametric Bayesian Models

Applying hlda to Practical Topic Modeling

Bayesian Hidden Markov Models and Extensions

Unified Modeling of User Activities on Social Networking Sites

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Class-Specific Simplex-Latent Dirichlet Allocation for Image Classification

Class-Specific Simplex-Latent Dirichlet Allocation for Image Classification

arxiv: v1 [stat.ml] 5 Dec 2016

IPSJ SIG Technical Report Vol.2014-MPS-100 No /9/25 1,a) 1 1 SNS / / / / / / Time Series Topic Model Considering Dependence to Multiple Topics S

Recent Advances in Bayesian Inference Techniques

Dirichlet Enhanced Latent Semantic Analysis

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

Probabilistic Latent Semantic Analysis

Tree-Based Inference for Dirichlet Process Mixtures

Lecture 13 : Variational Inference: Mean Field Approximation

Bayesian Nonparametrics

Spatial Bayesian Nonparametrics for Natural Image Segmentation

Asynchronous Distributed Learning of Topic Models

Latent Dirichlet Allocation (LDA)

The supervised hierarchical Dirichlet process

Topic Modeling: Beyond Bag-of-Words

Bayesian nonparametric models for bipartite graphs

Design of Text Mining Experiments. Matt Taddy, University of Chicago Booth School of Business faculty.chicagobooth.edu/matt.

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

The Doubly Correlated Nonparametric Topic Model

Online Bayesian Passive-Aggressive Learning"

Contents. Part I: Fundamentals of Bayesian Inference 1

Collapsed Variational Dirichlet Process Mixture Models

Online Bayesian Passive-Aggressive Learning

arxiv: v1 [stat.ml] 8 Jan 2012

Collaborative topic models: motivations cont

Statistical Debugging with Latent Topic Models

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

A Discriminative Topic Model using Document Network Structure

Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation

Dirichlet Process Based Evolutionary Clustering

Lecture 3a: Dirichlet processes

Improving Topic Models with Latent Feature Word Representations

arxiv: v1 [cs.cl] 1 Apr 2016

Non-Parametric Bayes

METHODS FOR IDENTIFYING PUBLIC HEALTH TRENDS. Mark Dredze Department of Computer Science Johns Hopkins University

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Latent Dirichlet Allocation and Singular Value Decomposition based Multi-Document Summarization

Liangjie Hong, Ph.D. Candidate Dept. of Computer Science and Engineering Lehigh University Bethlehem, PA

Lecture 6: Graphical Models: Learning

An Alternative Prior Process for Nonparametric Bayesian Clustering

arxiv: v6 [stat.ml] 11 Apr 2017

Transcription:

Tree-based Label Dependency Topic Models Viet-An Nguyen 1, Jordan Boyd-Graber 1,2,4, Jonathan Chang 5, Philip Resnik 1,3,4 1 Computer Science, 2 ischool, 3 Linguistics, 4 UMIACS 5 Facebook University of Maryland, College Park, MD Menlo Park, CA vietan@cs.umd.edu, jbg@umiacs.umd.edu, resnik@umd.edu jonchang@fb.com Abstract Multi-labeled corpora, where each document is tagged with a set of labels, are ubiquitous. When the number of unique labels in the dataset is large, there are naturally some dependencies among the labels. In this paper, we propose TREELAD a hierarchical topic model to capture these label dependencies using a tree-structured topic hierarchy. We apply TREELAD on a real-world dataset and show some promising empirical results. 1 Introduction The past decade has seen probabilistic topic models being used to study the thematic structure of documents in a wide range of forms including news, blogs, web pages etc [4]. Standard unsupervised topic models such as latent Dirichlet allocation (LDA) [6] aim to discover a set of topics from input data which only consists of a set of documents. In many settings, documents are associated with additional information, which motivates work to simultaneously model the documents text with their continuous response variables [5, 18, slda] or their categorical labels [12, 8, 16]. In this work, we focus on modeling multi-labeled data, in which each document is tagged with a set of labels. These data are ubiquitous. Web pages are tagged with multiple directories 1, books are labeled with different categories or political speeches are annotated with multiple issues 2. Previous topic models on multi-labeled data focus on cases where the number of labels is relatively small and labels are assumed independent [9, 10, 17]. Unfortunately, in many real-world examples, the number of labels range from hundreds to thousands, which often makes independence assumptions too strong. Recent work captures the dependency among labels by projecting them onto some latent space [11, 13]. However, when the number of labels is very large, simply projecting the labels into a latent space may not be enough. Understanding an unwieldy label space requires discovering a structure within the labels. In this work, we capture the dependence between labels using a learned hierarchy. Our model, TREELAD, learns from label co-occurence and word usage to discover a hierarchy of topics associated with user-generated labels. 2 Model We introduce TREELAD a tree-based label dependency topic model to capture the hierarchy of topics from a set of multi-labeled documents. The inputs of TREELAD consist of a set of D documents {w d }, each tagged with a set of labels t d. The number of unique labels is K and the word vocabulary size is V. TREELAD associates each of the K labels with a topic a multinomial distribution over the vocabulary, and uses a tree-structure to capture the relationships among them. Part of this work was done while the first author interned at Facebook. 1 Open Directory Project (http://www.dmoz.org/) 2 Policy Agenda Codebook (http://policyagendas.org/) 1

Each document is generated by repeatedly traversing the topic tree from the root downward to a node in the tree, whose associated topic is used to generated the corresponding word token. Figure 1 shows the plate diagram notations of TREELAD, together with its generative process.,,,,, 1. Create the label graph G and generate a tree T from G (See 2.1) 2. For each node k [1, K] in T (a) If k is the root, draw background topic φ k Dir(β) (b) Otherwise, draw topic φ k Dir(γ φ σ(k) ) 3. For each document d [1, D] having labels t d (a) Define a subtree T d R(T, t d ) (See 2.3) (b) For each node k in T d i. Draw a multinomial over k s children θ d,k Dir(α) ii. Draw a stochastic switching variable ω d,k Beta(m, π) (c) For each word n [1, N d ] i. Draw z d,n B(θ d, ω d ) (See 2.2) ii. Draw w d,n Mult(φ zd,n ) Figure 1: The generative process with plate diagram notations of our model TREELAD. 2.1 Tree generation We assume that we are given a directed complete graph between labels i and j. The edge between these two labels is the conditional probability of the labels, w i,j = P (i j) = C i,j /C j. We also assume an additional background node. Edges to the background node have weight 0 and edges from the background node to label i are weighted with the marginal probability of node i. Because each non-root node in the tree corresponds to exactly one label, the tree has K + 1 nodes. We model the tree T generated from this weighted graph as proportional to the probability of the constituent edges, p(t ) (i,j) T w i,j. Because the background node has no non-zero incomming edges, it must be the root node of the resulting tree. Given only this information (ignoring the content of the documents), we can construct the maximum probability tree by running Chu-Liu/Edmonds algorithm to find the maximum spanning tree starting at the root node. 2.2 Tree-structured Dirichlet, truncated stick breaking process Given T, we will stochastically assign each token in every document to a node in the tree. This is done by a tree-structured Dirichlet, truncated stick breaking process, denoted by B. For a document d, we associate each node k in the tree with (1) a stochastic switching variable ω d,k Beta(m, π) and (2) a multinomial distribution over k s children θ d,k Dirichlet(α). The detailed process is as follows: a token in document d traverses the tree from the root. Suppose that the token reaches a node k, it will stop at this node with probability ω d,k or move to one of k s child nodes with probability 1 ω d,k. If moving on, the token will choose a child node k of k with probability θ d,k,k. This process provides a prior distribution over all nodes in a tree. The process is, in spirit, similar to the TSSB prior [1] in which a datum traverses downward from the root and can stop at any node in the tree. However, since the number of nodes in our tree is fixed, instead of using stick breaking processes for distributions over the depth and width of the tree, we use truncated stick breaking processes and Dirichlet distributions respectively. 2.3 Restricted subtrees With the tree T and the prior distribution B over all its nodes, we can assign each token to a node by first computing the probability of this token being at every node in the tree, and then sampling from this distribution. However, doing so is inefficient given the large number of labels. To speed things up and to leverage the information from the labels, for each document we only consider assigning its tokens to a subset of nodes, called restricted subtree T d, based on the set of labels of the document. For a document d having labels t d, we consider three ways of constructing T d : 1. Exact paths (T 1 d ): contains nodes on the paths from the root to document s label nodes t d. 2

2. Inside subtrees (Td 2 1 ): contains nodes in Td document s label nodes t d. 3. Inside-outside subtrees (Td 3 2 ): contains nodes in Td the first-level nodes on paths from the root to document s label nodes t d. 3 Inference and all nodes in the subtrees rooted at the and all nodes in the subtrees rooted at Given a set of documents with observed words w and labels t, the inference task is to find the posterior distribution over the latent variables. We approximate TREELAD s posterior by collapsing out θ and ω, alternating between (1) sampling node assignments for each token z d,n and (2) sampling topics φ k. Currently, we fix the tree structure by running Chu-Liu/Edmonds algorithm to find the maximum spanning tree using only label co-occurrence information as described in 2.1. In principle, in order to incorporate the word usage, the tree should be updated during inference to take into account the word distribution at each tree node. We leave this as our future work. 3.1 Sampling z d,n The probability of assigning a token w d,n to a node specified by a path P = (root,, k) from the root to a node k is P (z d,n = k rest) d,k + mπ d, k + π i P\{k} d,>i d, i + (1 m)π + π j P\{root} + α j C d,σ(j) ( d, j + α) φ k (1) d, j where N d,k is the number of tokens in document d assigned to node k, N d,>k is the number of tokens in document k assigned to any nodes in the subtree rooted at k excluding k. We define N d, k N d,k + N d,>k. Conventionally, superscript d,n denotes counts excluding w d,n. We also use σ(k) to denote the parent node of k and C d,k to denote the set of children of k in the document-specific tree T d. 3.2 Sampling topics φ k The topics on each path of the tree form a cascaded Dirichlet-multinomial chain where the topic φ k at a node k is drawn from a Dirichlet distribution with the mean being the topic φ σ(k) at the parent node σ(k). In our inference process, we explicitly sample the topic at each node, following the approach described in [2]. More specifically, for a node k, we sample φ k Dirichlet(m k + m k + γ φ σ(k) ) where m k is the word type count vector at node k. In other words, m k,v is the number of times that word type v is assigned to node k. m k is a smoothed count vector in which m k,v captures the number of times node k is used when sampling v at any of k s children nodes. m k can be approximated by sampling from an Antoniak distribution [3, 14] or summing over counts from all k s children using minimal/maximal path assumptions [7, 15]. In this work, for efficiency we choose to approximate the counts using the two path assumptions. More specifically, the smoothed count vector m k at node k will be the sum of the propagated count vectors from all k s children, which are defined as follow: Minimal path: each child node i of k will propagate a value of 1 to k if m i,v > 0. Maximal path: each child node i of k will propagate its full count vector m i to k. The sampling process starts from the bottom of the tree to compute the smoothed count vectors for all nodes in the tree. After reaching to root node, we perform the actual sampling in a top-down manner. 4 Empirical Evaluations We run TREELAD on a set of congressional bill descriptions that are discussed during the 112 th U.S. Congress, each of which is labeled with a set of subjects. The dataset is collected from https:// www.govtrack.us/. After performing standard pre-processing steps such as stemming, removing stop words, removing short documents (having less than 5 tokens) etc, we have 12,299 documents with 5,000 words in the vocabulary and 281 unique labels. We run our Gibbs sampler for 1,000 iterations. Figure 2 shows a small portion of the label tree that is learned by our TREELAD using minimal path assumption and inside-outside subtrees. 3

Health programs administration & funding public_health, individu, autism_spectrum, servic_act, servic, train, mental_health, assist_secretari, care, disord Higher education alien, educ_act, loan, statu, student_assist, immigr, visa, employ, student_loan, borrow Education programs funding educ, institut, secretari, establish, develop, health, award_grant, social, train, grant Elementary and secondary education educ, school, student, program, secretari, lea, elementari, grant, secondari_educ, local_educ Education titl, program, subpart, esea, charter_school, lea, develop, assess, teacher, school_improv Medical research research, director, nih, diseas, nation_institut, develop, prevent, establish, center, cdc Background amend, provid, secretari, author, unit, purpos, direct, bill, program, establish Taxation intern_revenu, code, amend, allow, tax, taxpay, gross_incom, repeal, incom_tax, corpor Teaching, teachers, curricula teacher, establish, esea, titl, develop, middl_grade, train, engin, partnership, recruit_fund Medicare social_secur, medicar, hospit, payment, servic, medicar_program, physician, medicar_beneficiari Health health, drug, human_servic, establish, health_care, provide, public_health, program Foreign trade and international finance duti, unit, amend, extend, temporari_suspens, bill, harmon_tariff, percent, suspens, rate Terrorism homeland_secur, terror, attack, unit, terrorist_attack, respond, wound, direct, threat Income tax deductions deduct, tax_deduct, extend, credit, tax_credit, taxpay, perman, tax_relief, tax_rate, tax Health care coverage and access health_care, afford_care, patient_protect, repeal, medicaid, provis, health_insur, ppaca, care Health promotion and preventive care prevent, public_health, screen, director, women, eat_disord, test, awar, educ, develop Health promotion and preventive care prevent, public_health, screen, director, women, eat_disord, test, awar, educ, develop Tariffs suspend_temporarili, schedul, harmon_tariff, duti, temporari_suspens, mixtur, reduc_temporarili, footwear, acid, temporarili_suspend Income tax credits credit, tax_credit, extend, allow, code, employ, increas, bond, properti Figure 2: A portion of the label tree learned by TREELAD from a collection of congressional bill descriptions during the 112 th U.S. Congress 5 Conclusion In this paper, we propose TREELAD, a hierarchical topic model for multi-labeled documents. TREELAD aims to capture the dependency between labels using a tree-structure hierarchy. We present a Gibbs sampling inference algorithm to approximate the model s posterior distribution. Preliminary experiments on a set of multi-labeled congressional bill descriptions show promising results and the learned label hierarchy captures qualitative relationships between labels in the data. We are working toward evaluating TREELAD more thoroughly on the quality of the learned hierarchy as well as on other downstream applications such as multi-label document classification. Acknowledgments This research was supported in part by NSF under grant #1211153 (Resnik) and #1018625 (Boyd- Graber and Resnik). Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the view of the sponsor. References [1] Ryan Adams, Zoubin Ghahramani, and Michael Jordan. Tree-structured stick breaking for hierarchical data. In NIPS, 2010. [2] Amr Ahmed, Liangjie Hong, and Alex Smola. The nested Chinese restaurant franchise process: User tracking and document modeling. In ICML, 2013. [3] Charles E. Antoniak. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics, 2(6):1152 1174, 1974. [4] David M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77 84, April 2012. [5] David M. Blei and Jon D. McAuliffe. Supervised topic models. In Proceedings of Advances in Neural Information Processing Systems. 2007. [6] David M. Blei, Andrew Ng, and Michael Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 2003. [7] Philip J Cowans. Probabilistic Document Modelling. PhD thesis, University of Cambridge, 2006. [8] Simon Lacoste-Julien, Fei Sha, and Michael I. Jordan. DiscLDA: Discriminative learning for dimensionality reduction and classification. In NIPS, pages 897 904, 2008. [9] David M. Mimno and Andrew McCallum. Topic models conditioned on arbitrary features with Dirichletmultinomial regression. In UAI, pages 411 418, 2008. 4

[10] Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher Manning. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In EMNLP, 2009. [11] Daniel Ramage, Christopher D. Manning, and Susan Dumais. Partially labeled topic models for interpretable text mining. In SIGKDD, pages 457 465, 2011. [12] Michal Rosen-Zvi, Thomas L. Griffiths, Mark Steyvers, and Padhraic Smyth. The author-topic model for authors and documents. In Proceedings of Uncertainty in Artificial Intelligence, 2004. [13] Timothy N. Rubin, America Chambers, Padhraic Smyth, and Mark Steyvers. Statistical topic models for multi-label document classification. Mach. Learn., 88(1-2):157 208, 2012. [14] Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. Hierarchical Dirichlet processes. JASA, 101(476), 2006. [15] Hanna M Wallach. Structured Topic Models for Language. PhD thesis, University of Cambridge, 2008. [16] Chong Wang, David Blei, and Li Fei-Fei. Simultaneous image classification and annotation. In Computer Vision and Pattern Recognition, 2009. [17] Chong Wang, David Blei, and Li Fei-Fei. Simultaneous image classification and annotation. In CVPR, 2009. [18] Jun Zhu, Amr Ahmed, and Eric P. Xing. MedLDA: maximum margin supervised topic models for regression and classification. In Proceedings of the International Conference of Machine Learning, 2009. 5