Text Mining for Economics and Finance Latent Dirichlet Allocation

Similar documents
Study Notes on the Latent Dirichlet Allocation

16 : Approximate Inference: Markov Chain Monte Carlo

Latent Dirichlet Allocation (LDA)

Topic Modelling and Latent Dirichlet Allocation

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Replicated Softmax: an Undirected Topic Model. Stephen Turner

AN INTRODUCTION TO TOPIC MODELS

Generative Clustering, Topic Modeling, & Bayesian Inference

CS Lecture 18. Topic Models and LDA

Lecture 13 : Variational Inference: Mean Field Approximation

Introduction to Bayesian inference

Probabilistic Graphical Networks: Definitions and Basic Results

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Machine Learning for Data Science (CS4786) Lecture 24

Graphical Models and Kernel Methods

Gaussian Mixture Model

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Latent Dirichlet Allocation

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Evaluation Methods for Topic Models

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Modeling Environment

DAG models and Markov Chain Monte Carlo methods a short overview

Recent Advances in Bayesian Inference Techniques

Latent Dirichlet Allocation (LDA)

Dirichlet Enhanced Latent Semantic Analysis

topic modeling hanna m. wallach

Latent Dirichlet Allocation

Latent Dirichlet Allocation Introduction/Overview

Topic Models and Applications to Short Documents

Text Mining for Economics and Finance Unsupervised Learning

Collapsed Variational Bayesian Inference for Hidden Markov Models

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

STA 4273H: Statistical Machine Learning

Applying hlda to Practical Topic Modeling

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Sparse Stochastic Inference for Latent Dirichlet Allocation

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Mixed-membership Models (and an introduction to variational inference)

13: Variational inference II

Undirected Graphical Models

Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process

Markov chain Monte Carlo Lecture 9

Sampling from Bayes Nets

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Exponential families also behave nicely under conditioning. Specifically, suppose we write η = (η 1, η 2 ) R k R p k so that

Bayesian Methods for Machine Learning

28 : Approximate Inference - Distributed MCMC

18 : Advanced topics in MCMC. 1 Gibbs Sampling (Continued from the last lecture)

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Latent Dirichlet Bayesian Co-Clustering

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

LDA with Amortized Inference

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

STA 4273H: Statistical Machine Learning

Topic Models. Material adapted from David Mimno University of Maryland INTRODUCTION. Material adapted from David Mimno UMD Topic Models 1 / 51

Benchmarking and Improving Recovery of Number of Topics in Latent Dirichlet Allocation Models

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Oliver Schulte - CMPT 419/726. Bishop PRML Ch.

CS 343: Artificial Intelligence

Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs

STA 4273H: Statistical Machine Learning

Web Search and Text Mining. Lecture 16: Topics and Communities

Note for plsa and LDA-Version 1.1

Content-based Recommendation

Time-Sensitive Dirichlet Process Mixture Models

Bayesian Inference and MCMC

Lecture 22 Exploratory Text Analysis & Topic Models

Document and Topic Models: plsa and LDA

Intelligent Systems:

Markov Networks.

Kernel Density Topic Models: Visual Topics Without Visual Words

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Latent Dirichlet Allocation Based Multi-Document Summarization

Machine Learning. Probabilistic KNN.

LECTURE 15 Markov chain Monte Carlo

Part 1: Expectation Propagation

COS513 LECTURE 8 STATISTICAL CONCEPTS

Bayesian Nonparametric Models

LSI, plsi, LDA and inference methods

3 : Representation of Undirected GM

Unified Modeling of User Activities on Social Networking Sites

A MCMC Approach for Learning the Structure of Gaussian Acyclic Directed Mixed Graphs

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

A Continuous-Time Model of Topic Co-occurrence Trends

Statistical Models. David M. Blei Columbia University. October 14, 2014

Lecture 7 and 8: Markov Chain Monte Carlo

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

10-701/15-781, Machine Learning: Homework 4

CS145: INTRODUCTION TO DATA MINING

Principled Selection of Hyperparameters in the Latent Dirichlet Allocation Model

Distributed ML for DOSNs: giving power back to users

2 : Directed GMs: Bayesian Networks

Language Information Processing, Advanced. Topic Models

Bayesian Networks BY: MOHAMAD ALSABBAGH

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Bayesian Networks in Educational Assessment

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

Discovering Latent Patterns with Hierarchical Bayesian Mixed-Membership Models

Transcription:

Text Mining for Economics and Finance Latent Dirichlet Allocation Stephen Hansen Text Mining Lecture 5 1 / 45

Introduction Recall we are interested in mixed-membership modeling, but that the plsi model has a huge number of parameters to estimate. One solution is to adopt a Bayesian approach; the plsi model with a prior distribution on the document-specific mixing probabilities is called Latent Dirichlet Allocation (Blei, Ng, and Jordan 2003). LDA is widely used within computer science and, increasingly, social sciences. LDA forms the basis of many, more complicated mixed-membership models. Text Mining Lecture 5 2 / 45

Latent Dirichlet Allocation Original 1. Draw θ d independently for d = 1,..., D from Dirichlet(α). 2. Each word w d,n in document d is generated from a two-step process: 2.1 Draw topic assignment z d,n from θ d. 2.2 Draw w d,n from β zd,n. Estimate hyperparameters α and term probabilities β 1,..., β K. Text Mining Lecture 5 3 / 45

Latent Dirichlet Allocation Modified 1. Draw β k independently for k = 1,..., K from Dirichlet(η). 2. Draw θ d independently for d = 1,..., D from Dirichlet(α). 3. Each word w d,n in document d is generated from a two-step process: 3.1 Draw topic assignment z d,n from θ d. 3.2 Draw w d,n from β zd,n. Fix scalar values for η and α. Text Mining Lecture 5 4 / 45

Example statement: Yellen, March 2006, #51 We have noticed a change in the relationship between the core CPI and the chained core CPI, which suggested to us that maybe something is going on relating to substitution bias at the upper level of the index. You focused on the nonmarket component of the PCE, and I wondered if something unusual might be happening with the core CPI relative to other measures. Text Mining Lecture 5 5 / 45

Example statement: Yellen, March 2006, #51 We have noticed a change in the relationship between the core CPI and the chained core CPI, which suggested to us that maybe something is going on relating to substitution bias at the upper level of the index. You focused on the nonmarket component of the PCE, and I wondered if something unusual might be happening with the core CPI relative to other measures. Text Mining Lecture 5 5 / 45

Example statement: Yellen, March 2006, #51 We have noticed a change in the relationship between the core CPI and the chained core CPI, which suggested to us that maybe something is going on relating to substitution bias at the upper level of the index. You focused on the nonmarket component of the PCE, and I wondered if something unusual might be happening with the core CPI relative to other measures. Text Mining Lecture 5 5 / 45

Example statement: Yellen, March 2006, #51 We have 17ticed a 39ange in the 39lationship 1etween the 25re 25I and the 41ained 25re 25I, which 25ggested to us that 36ybe 36mething is 38ing on 43lating to 25bstitution 20as at the 25per 39vel of the 16dex. You 23cused on the 25nmarket 25mponent of the 25E, and I 32ndered if 38mething 16usual might be 4appening with the 25re 25I 16lative to other 25asures. Text Mining Lecture 5 5 / 45

Distribution of Attention Text Mining Lecture 5 6 / 45

Topic 25 Text Mining Lecture 5 7 / 45

Advantage of Flexibility measur has probability 0.026 in topic 25, and probability 0.021 in topic 11 Text Mining Lecture 5 8 / 45

Topic 11 Text Mining Lecture 5 8 / 45

Advantage of Flexibility measur has probability 0.026 in topic 25, and probability 0.021 in topic 11. It gets assigned to 25 in this statement consistently due to the presence of other topic 25 words. In statements containing words on evidence and numbers, it consistently gets assigned to 11. Sampling algorithm can help place words in their appropriate context. Text Mining Lecture 5 8 / 45

External Validation BBD Fraction of FOMC 1 & 2 0.02.04.06.08.1 BBD Topic 35 1987m7 1991m7 1995m7 1999m7 2003m7 2007m7 50 100 150 200 BBD Fraction of FOMC 1 & 2 0.02.04.06.08.1 BBD Topic 12 1987m7 1991m7 1995m7 1999m7 2003m7 2007m7 50 100 150 200 BBD FOMC Meeting Date FOMC Meeting Date Text Mining Lecture 5 9 / 45

Pro-Cyclical Topics Fraction of speaker s time in FOMC2 0.05.1.15.2 Min Median Max 1987m7 1991m1 1994m7 1998m1 2001m7 2005m1 2008m7 Fraction of speaker s time in FOMC2 0.05.1.15.2 Min Median Max 1987m7 1991m1 1994m7 1998m1 2001m7 2005m1 2008m7 FOMC Meeting Date FOMC Meeting Date Text Mining Lecture 5 10 / 45

Counter-Cyclical Topics Fraction of speaker s time in FOMC2 0.05.1.15.2 Min Median Max 1987m7 1991m1 1994m7 1998m1 2001m7 2005m1 2008m7 Fraction of speaker s time in FOMC2 0.05.1.15.2 Min Median Max 1987m7 1991m1 1994m7 1998m1 2001m7 2005m1 2008m7 FOMC Meeting Date FOMC Meeting Date Text Mining Lecture 5 11 / 45

Graphical Models Consider a probabilistic model with joint distribution f (x) over the random variables x = (x 1,..., x N ). In high-dimensional models, it is useful to summarize relationships among random variables with directed graphs in which nodes represent random variables and links between nodes represent dependencies. A node s parents are the set of nodes that link to it; a node s children are the the set of nodes that it links to. A Bayesian network is a probabilistic model whose joint distribution can be represented by a directed acyclic graph (DAG). The nodes in a DAG can be ordered so that parents precede children. Text Mining Lecture 5 12 / 45

LDA as a Bayesian Network α θ 1 θ D z 1,1 z 1,N1 z D,1 z D,ND w 1,1 w 1,N1 w D,1 w D,ND β 1,..., β K η Text Mining Lecture 5 13 / 45

LDA Plate Diagram α θ d z d,n w d,n N d D β k K η Text Mining Lecture 5 14 / 45

Graph Properties Let B = (β 1,..., β K ) and T = (θ 1,..., θ D ). Object Parents Children α T θ d α z d z d,n θ d w d,n w d,n z d,n, B β k η w 1,..., w D η B Text Mining Lecture 5 15 / 45

Conditional Independence Property I In a Bayesian network, nodes are independent of their ancestors conditional on their parents. This means we can write f (x) = N f (x i parents(x i )), which can greatly simplify joint distributions. i=1 Applying this formula to LDA yields ( ) ( ) Pr [ w d,n z d,n, B ] Pr [ z d,n θ d ] d n ( ) ( ) Pr [ θ d α ] Pr [ β k η ] d k d n Text Mining Lecture 5 16 / 45

Conditional Independence Property II The Markov blanket MB(x i ) of a node x i in a Bayesian network is the set of nodes consisting of x i s parents, children, and children s parents. Conditional on its Markov blanket, the node x i is independent of all nodes outside its Markov blanket. So f (x i x i ) has the same distribution as f (x i MB(x i )). Text Mining Lecture 5 17 / 45

Posterior Distribution The inference problem in LDA is to compute the posterior distribution over z, T, and B given the data w and Dirichlet hyperparameters. Let s consider the simpler problem of inferring the latent variables taking the parameters as given. Posterior distribution is Pr [ z = z w, T, B ] = Pr [ w z = z, T, B ] Pr [ z = z T, B ] z Pr [ w z = z, T, B ] Pr [ z = z T, B ]. Text Mining Lecture 5 18 / 45

Posterior Distribution The inference problem in LDA is to compute the posterior distribution over z, T, and B given the data w and Dirichlet hyperparameters. Let s consider the simpler problem of inferring the latent variables taking the parameters as given. Posterior distribution is Pr [ z = z w, T, B ] = Pr [ w z = z, T, B ] Pr [ z = z T, B ] z Pr [ w z = z, T, B ] Pr [ z = z T, B ]. We can compute the numerator easily, and each element of denominator. But z {1,..., K} N there are K N terms in the sum intractable problem. For example, a 100 word corpus with 50 topics has 7.88e169 terms. Text Mining Lecture 5 18 / 45

Approximate Inference Instead of obtaining a closed-form solution for the posterior distribution, we must approximate it. Markov chain Monte Carlo methods provide a stochastic approximation to the true posterior. The general idea is to define a Markov chain whose stationary distribution is equivalent to the posterior distribution, which we then draw samples from. There are several MCMC methods, but we will consider Gibbs sampling. Text Mining Lecture 5 19 / 45

Gibbs Sampling We want to draw samples from some joint distribution over x = (x 1,..., x N ) given by f (x) (e.g. a posterior distribution). Suppose we can compute the conditional distribution f i f (x i x i ). Then we can use the following algorithm: 1. Randomly allocate an initial value for x, say x 0 2. Let S be the number of iterations to run chain. For each s {1,..., S}, draw xi s according to x s i f (x i x s 1,..., x s i 1, x s 1 i+1,..., x s 1 N ). 3. Discard initial iterations (burn in), and collect samples from every mth (thinning interval) iteration thereafter. 4. Use collected samples to approximate joint distribution, or related distributions and moments. Text Mining Lecture 5 20 / 45

Sampling Equations for θ d The Markov blanket of θ d is: The parent α. The children z d. So we need to draw samples from Pr [ θ d α, z d ]. This is the posterior distribution for θ d given a fixed value for the vector of allocation variables z d. We computed this posterior above. Let n d,k be the number of words in document d that have topic allocation k. Then Pr [ θ d α, z d ] = Dir (α + n d,1,..., α + n d,k ). Text Mining Lecture 5 21 / 45

Sampling Equations for β k The Markov blanket of β k is: The parent η. The children w = (w 1,..., w D ). The children s parents: 1. z = (z 1,..., z D ). 2. B k. Consider Pr [ β k η, w, z ]. Only the allocation variables assigned to k and their associated words are informative about β k. Let m k,v be the number of times topic k allocation variables generate term v. Then Pr [ β k η, w, z ] = Dir (η + m k,1,..., η + m k,v ). Text Mining Lecture 5 22 / 45

Sampling Equations for Allocations The Markov blanket of z d,n is: The parent θ d. The child w d,n. The child s parents β 1,..., β K. Pr [ z d,n = k w d,n = v, B, θ d ] = Pr [ w d,n = v z d,n = k, B, θ d ] Pr [ z d,n = k B, θ d ] k Pr [ w d,n = v z d,n = k, B, θ d ] Pr [ z d,n = k B, θ d ] = θk d β v k. k θk d βv k Text Mining Lecture 5 23 / 45

Summary To complete one iteration of Gibbs sampling, we need to: 1. Sample from a multinomial distribution N times for the topic allocation variables. 2. Sample from a Dirichlet D times for the document-specific mixing probabilities. 3. Sample from a Dirichlet K times for the topic-specific term probabilities. Sampling from these distributions is standard, and implemented in many programming languages. Text Mining Lecture 5 24 / 45

Collapsed Sampling Collapsed sampling refers to analytically integrating out some variables in the joint likelihood and sampling the remainder. This tends to be more efficient because we reduce the dimensionality of the space we sample from. Griffiths and Steyvers (2004) proposed a collapsed sampler for LDA that integrates out the T and B terms and samples only z. For details see Heinrich (2009) and technical appendix of Hansen, McMahon, and Prat (2015). Text Mining Lecture 5 25 / 45

Collapsed Sampling Equation for LDA The sampling equation for the nth allocation variable in document d is: Pr [ z d,n = k z (d,n), w, α, η ] m k,v d,n + η ) (n d,k v m k,v + ηv + α where the superscript denotes counts excluding (d, n) term. Text Mining Lecture 5 26 / 45

Collapsed Sampling Equation for LDA The sampling equation for the nth allocation variable in document d is: Pr [ z d,n = k z (d,n), w, α, η ] m k,v d,n + η ) (n d,k v m k,v + ηv + α where the superscript denotes counts excluding (d, n) term. Probability term n in document d is assigned to topic k is increasing in: 1. The number of other terms in document d that are currently assigned to k. 2. The number of other occurrences of the term v d,n in the entire corpus that are currently assigned to k. Both mean that terms that regularly co-occur in documents will be grouped together to form topics. Property 1 means that terms within a document will tend to be grouped together into few topics rather than spread across many separate topics. Text Mining Lecture 5 26 / 45

Predictive Distributions Collapsed sampling gives the distribution of the allocation variables, but we also care about variables we integrated out. Their predictive distributions are easy to form given topic assignments: β k,v = m k,v + η V v=1 (m k,v + η) and θ d,k = n d,k + α K k=1 (n d,k + α). Text Mining Lecture 5 27 / 45

LDA on Survey Data Recall from the first lecture that text data is one instance of count data. Although typically applied to natural language, LDA is in principle applicable to any count data. We recently applied it to CEO survey data to estimate management behaviors with K = 2. Can visualize results in terms of likelihood ratios of marginals over specific data features. Text Mining Lecture 5 28 / 45

Activity Type Text Mining Lecture 5 29 / 45

Planning; Size; Number of Functions Text Mining Lecture 5 30 / 45

Insiders vs Outsiders Text Mining Lecture 5 31 / 45

Functions Text Mining Lecture 5 32 / 45

Model Selection There are three parameters to set to run the Gibbs sampling algorithm: number of topics K and hyperparameters α, η. Priors don t receive too much attention in literature. Griffiths and Steyvers recommend η = 200/V and α = 50/K. Smaller values will tend to generate more concentrated distributions. (See also Wallach et. al. 2009). K is less clear. Two potential goals: 1. Predict text well. Statistical criteria to select K. 2. Interpretability. General versus specific. Text Mining Lecture 5 33 / 45

Formalizing Interpretablility Chang et. al. (2009) propose an objective way of determining whether topics are indeed interpretable. Two tests: 1. Word intrusion. Form set of words consisting of top five words from topic k + word with low probability in topic k. Ask subjects to identify inserted word. 2. Topic intrusion. Show subjects a snippet of a document + top three topics associated to it + randomly drawn other topic. Ask to identify inserted topic. Estimate LDA and other topic models on NYT and Wikipedia articles for K = 50, 100, 150. Text Mining Lecture 5 34 / 45

Results Text Mining Lecture 5 35 / 45

Implications Topics seem objectively interpretable in many contexts. Tradeoff between goodness-of-fit and interpretablility, which is generally more important in social science. Potential development of statistical models in future to explicitly maximize interpretablility. Text Mining Lecture 5 36 / 45

Chain Convergence and Selection Determining when a chain has converged can be tricky. One approach is to measure how well different states of the chain predict the data, and determine convergence in terms of its stability. Standard practice is run chains from different starting values, after which you can select the best-fit chain for analysis. For LDA, a typical goodness-of-fit measure is perplexity, given by D ( V d=1 v=1 exp x K d,v log θ ) k=1 d,k βk,v D d=1 N. d Text Mining Lecture 5 37 / 45

Perplexity 1 Text Mining Lecture 5 38 / 45

Perplexity 2 Text Mining Lecture 5 39 / 45

Out-of-Sample Documents We are sometimes interested in obtaining the document-topic distribution for out-of-sample documents. We can perform Gibbs sampling treating estimated topics as fixed Pr [ z d,n = k z (d,n), w d, α, η ] β k,vd,n [ n d,k + α ] for each out-of-sample document d. Only 10-20 iterations necessary since topics already estimated. Text Mining Lecture 5 40 / 45

Dictionary Methods + LDA The terms in dictionaries come labeled, so can be seen as a type of supervised approach to information retrieval. One can combine dictionary methods with the output of LDA to weight words counts by topic. Recent application to minutes of the Federal Reserve to extract index of economic situation and forward guidance. First step is to run 15-topic model and identify two separate kinds of topic. Text Mining Lecture 5 41 / 45

Example Topic Text Mining Lecture 5 42 / 45

Monetary Measures of Tone Contraction decreas* decelerat* slow* weak* low* loss* contract* Expansion increas* accelerat* fast* strong* high* gain* expand* Text Mining Lecture 5 43 / 45

Indices Text Mining Lecture 5 44 / 45

Conclusion Key ideas from this lecture: 1. Latent Dirichlet Allocation. Influential in computer science, many potential applications in economics. 2. Graphical models to simplify and visualize complex joint likelihoods. 3. Gibbs sampling to stochastically approximate posterior. 4. Interpretable output in several domains. Topic for future: incorporate economic structure into LDA. Text Mining Lecture 5 45 / 45