RECSM Summer School: Facebook + Topic Models. github.com/pablobarbera/big-data-upf

Similar documents
Introduction To Machine Learning

Latent Dirichlet Allocation (LDA)

Hybrid Models for Text and Graphs. 10/23/2012 Analysis of Social Media

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Generative Clustering, Topic Modeling, & Bayesian Inference

Lecture 22 Exploratory Text Analysis & Topic Models

Topic Modeling: Beyond Bag-of-Words

Machine Learning

CS Lecture 18. Topic Models and LDA

AN INTRODUCTION TO TOPIC MODELS

Discovering Geographical Topics in Twitter

Unified Modeling of User Activities on Social Networking Sites

Assessing the Consequences of Text Preprocessing Decisions

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

CS6220: DATA MINING TECHNIQUES

Document and Topic Models: plsa and LDA

CS145: INTRODUCTION TO DATA MINING

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Latent variable models for discrete data

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

Collaborative topic models: motivations cont

Study Notes on the Latent Dirichlet Allocation

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Topic Modeling Using Latent Dirichlet Allocation (LDA)

Kernel Density Topic Models: Visual Topics Without Visual Words

Comparing Relevance Feedback Techniques on German News Articles

Probabilistic Dyadic Data Analysis with Local and Global Consistency

Liangjie Hong, Ph.D. Candidate Dept. of Computer Science and Engineering Lehigh University Bethlehem, PA

Latent Dirichlet Conditional Naive-Bayes Models

topic modeling hanna m. wallach

Collaborative Topic Modeling for Recommending Scientific Articles

Latent Dirichlet Allocation Based Multi-Document Summarization

Distributed ML for DOSNs: giving power back to users

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Latent Dirichlet Allocation (LDA)

Measuring Topic Quality in Latent Dirichlet Allocation

Topic Models and Applications to Short Documents

Classical Predictive Models

Online Bayesian Passive-Agressive Learning

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Gaussian Models

Yahoo! Labs Nov. 1 st, Liangjie Hong, Ph.D. Candidate Dept. of Computer Science and Engineering Lehigh University

Web-Mining Agents Topic Analysis: plsi and LDA. Tanya Braun Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Machine learning for pervasive systems Classification in high-dimensional spaces

Day 6: Classification and Machine Learning

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Incorporating Social Context and Domain Knowledge for Entity Recognition

Non-Parametric Bayes

Bayesian Nonparametrics for Speech and Signal Processing

Lecture 13 : Variational Inference: Mean Field Approximation

Time Series Topic Modeling and Bursty Topic Detection of Correlated News and Twitter

Present and Future of Text Modeling

Latent Dirichlet Allocation Introduction/Overview

Content-based Recommendation

GLAD: Group Anomaly Detection in Social Media Analysis

Introduction to Probabilistic Machine Learning

Probabilistic Models for Multi-relational Data Analysis

Dirichlet Enhanced Latent Semantic Analysis

Probabilistic Time Series Classification

Introduction to Bayesian inference

Parallelized Variational EM for Latent Dirichlet Allocation: An Experimental Evaluation of Speed and Scalability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Combining Generative and Discriminative Methods for Pixel Classification with Multi-Conditional Learning

Language Information Processing, Advanced. Topic Models

Distinguish between different types of scenes. Matching human perception Understanding the environment

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Text mining and natural language analysis. Jefrey Lijffijt

Topic modeling with more confidence: a theory and some algorithms

Title Document Clustering. Issue Date Right.

Introduction to Machine Learning

A Continuous-Time Model of Topic Co-occurrence Trends

Probabilistic Topic Models Tutorial: COMAD 2011

University of Washington Department of Electrical Engineering EE512 Spring, 2006 Graphical Models

Click Prediction and Preference Ranking of RSS Feeds

Parametric Mixture Models for Multi-Labeled Text

Naïve Bayes classification

Additive Regularization of Topic Models for Topic Selection and Sparse Factorization

Recent Advances in Bayesian Inference Techniques

Latent Topic Models for Hypertext

Collaborative User Clustering for Short Text Streams

Text Mining for Economics and Finance Supervised Learning

DEKDIV: A Linked-Data-Driven Web Portal for Learning Analytics Data Enrichment, Interactive Visualization, and Knowledge Discovery

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs

Incorporating Domain Knowledge into Topic Modeling via Dirichlet Forest Priors

Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process

Neural Networks for Web Page Classification Based on Augmented PCA

Class-Specific Simplex-Latent Dirichlet Allocation for Image Classification

Class-Specific Simplex-Latent Dirichlet Allocation for Image Classification

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Expectation-Propagation for the Generative Aspect Model

Mining Newsgroups Using Networks Arising From Social Behavior by Rakesh Agrawal et al. Presented by Will Lee

Behavioral Data Mining. Lecture 2

Gaussian Mixture Model

Dimension Reduction (PCA, ICA, CCA, FLD,

Transcription:

RECSM Summer School: Facebook + Topic Models Pablo Barberá School of International Relations University of Southern California pablobarbera.com Networked Democracy Lab www.netdem.org Course website: github.com/pablobarbera/big-data-upf

Collecting Facebook data Facebook only allows access to public pages data through the Graph API: 1. Posts on public pages

Collecting Facebook data Facebook only allows access to public pages data through the Graph API: 1. Posts on public pages 2. Likes, reactions, comments, replies...

Collecting Facebook data Facebook only allows access to public pages data through the Graph API: 1. Posts on public pages 2. Likes, reactions, comments, replies... Some public user data (gender, location) was available through previous versions of the API (not anymore)

Collecting Facebook data Facebook only allows access to public pages data through the Graph API: 1. Posts on public pages 2. Likes, reactions, comments, replies... Some public user data (gender, location) was available through previous versions of the API (not anymore) Access to other (anonymized) data used in published studies requires permission from Facebook

Collecting Facebook data Facebook only allows access to public pages data through the Graph API: 1. Posts on public pages 2. Likes, reactions, comments, replies... Some public user data (gender, location) was available through previous versions of the API (not anymore) Access to other (anonymized) data used in published studies requires permission from Facebook R library: Rfacebook

Overview of text as data methods Fig. 1 in Grimmer and Stewart (2013)

Overview of text as data methods Entity Recognition Fig. 1 in Grimmer and Stewart (2013)

Overview of text as data methods Entity Recognition Events Quotes Locations Names... Fig. 1 in Grimmer and Stewart (2013)

Overview of text as data methods Entity Recognition Cosine similarity Naive Bayes Events Quotes Locations Names... Fig. 1 in Grimmer and Stewart (2013)

Overview of text as data methods Entity Recognition Cosine similarity Naive Bayes mixture model? Events Quotes Locations Names... Fig. 1 in Grimmer and Stewart (2013)

Overview of text as data methods Entity Recognition Cosine similarity Naive Bayes mixture model? Events Quotes Locations Names... (ML methods) Fig. 1 in Grimmer and Stewart (2013)

Overview of text as data methods Entity Recognition Cosine similarity Naive Bayes mixture model? Events Quotes Locations Names... Models with covariates (slda, STM) (ML methods) Fig. 1 in Grimmer and Stewart (2013)

Latent Dirichlet allocation (LDA) Topic models are powerful tools for exploring large data sets and for making inferences about the content of documents!"#$%&'() *"+,#) +"/,9#)1 +.&),3&'(1 "65%51 :5)2,'0("'1.&/,0,"'1 -.&/,0,"'1 2,'3$1 4$3,5)%1 &(2,#)1 6$332,)%1 )+".()1 65)&65//1 )"##&.1 65)7&(65//1 8""(65//1 - -

Latent Dirichlet allocation (LDA) Complexity+of+Inference+in+Latent+Dirichlet+Alloca6on+ David+Sontag,+Daniel+Roy+ Topic models are powerful tools for exploring large data (NYU,+Cambridge)+ sets and for making inferences about the content of documents Topic+models+are+powerful+tools+for+exploring+large+data+sets+and+for+making+ inferences+about+the+content+of+documents+!"#$%&'() *"+,#) Documents+ Topics+ +"/,9#)1.&/,0,"'1 )+".()1 poli6cs+.0100+ +.&),3&'(1 2,'3$1 religion+.0500+ 65)&65//1sports+.0105+ president+.0095+ hindu+.0092+ baseball+.0100+ "65%51 4$3,5)%1 )"##&.1 obama+.0090+ judiasm+.0080+ soccer+.0055+ :5)2,'0("'1 &(2,#)1 65)7&(65//1 washington+.0085+ ethics+.0075+ basketball+.0050+.&/,0,"'1 6$332,)%1 8""(65//1 religion+.0060+ buddhism+.0016+ football+.0045+ - + - + - Many applications in information retrieval, document summarization, and classification Almost+all+uses+of+topic+models+(e.g.,+for+unsupervised+learning,+informa6on+ retrieval,+classifica6on)+require+probabilis)c+inference:+ New+document+ t = p(w z = t) What+is+this+document+about?+ weather+.50+ finance+.49+ sports+.01+ W66+ + Words+w 1,+,+w N+ Distribu6on+of+topics +

Latent Dirichlet allocation (LDA) Complexity+of+Inference+in+Latent+Dirichlet+Alloca6on+ David+Sontag,+Daniel+Roy+ Topic models are powerful tools for exploring large data (NYU,+Cambridge)+ sets and for making inferences about the content of documents Topic+models+are+powerful+tools+for+exploring+large+data+sets+and+for+making+ inferences+about+the+content+of+documents+!"#$%&'() *"+,#) Documents+ Topics+ +"/,9#)1.&/,0,"'1 )+".()1 poli6cs+.0100+ +.&),3&'(1 2,'3$1 religion+.0500+ 65)&65//1sports+.0105+ president+.0095+ hindu+.0092+ baseball+.0100+ "65%51 4$3,5)%1 )"##&.1 obama+.0090+ judiasm+.0080+ soccer+.0055+ :5)2,'0("'1 &(2,#)1 65)7&(65//1 washington+.0085+ ethics+.0075+ basketball+.0050+.&/,0,"'1 6$332,)%1 8""(65//1 religion+.0060+ buddhism+.0016+ football+.0045+ - + - + - Many applications in information retrieval, document summarization, and classification Almost+all+uses+of+topic+models+(e.g.,+for+unsupervised+learning,+informa6on+ retrieval,+classifica6on)+require+probabilis)c+inference:+ New+document+ t = p(w z = t) What+is+this+document+about?+ weather+.50+ finance+.49+ sports+.01+ W66+ + Words+w 1,+,+w N+ Distribu6on+of+topics + LDA is one of the simplest and most widely used topic models

Latent Dirichlet Allocation

Latent Dirichlet Allocation Document = random mixture over latent topics Topic = distribution over n-grams Probabilistic model with 3 steps: 1. Choose θ i Dirichlet(α) 2. Choose β k Dirichlet(δ) 3. For each word in document i: Choose a topic zm Multinomial(θ i ) Choose a word w im Multinomial(β i,k=zm ) where: α=parameter of Dirichlet prior on distribution of topics over docs. θ i =topic distribution for document i δ=parameter of Dirichlet prior on distribution of words over topics β k =word distribution for topic k

Latent Dirichlet Allocation Key parameters: 1. θ = matrix of dimensions N documents by K topics where θ ik corresponds to the probability that document i belongs to topic k; i.e. assuming K = 5: T1 T2 T3 T4 T5 Document 1 0.15 0.15 0.05 0.10 0.55 Document 2 0.80 0.02 0.02 0.10 0.06... Document N 0.01 0.01 0.96 0.01 0.01

Latent Dirichlet Allocation Key parameters: 1. θ = matrix of dimensions N documents by K topics where θ ik corresponds to the probability that document i belongs to topic k; i.e. assuming K = 5: T1 T2 T3 T4 T5 Document 1 0.15 0.15 0.05 0.10 0.55 Document 2 0.80 0.02 0.02 0.10 0.06... Document N 0.01 0.01 0.96 0.01 0.01 2. β = matrix of dimensions K topics by M words where β km corresponds to the probability that word m belongs to topic k; i.e. assuming M = 6: W1 W2 W3 W4 W5 W6 Topic 1 0.40 0.05 0.05 0.10 0.10 0.30 Topic 2 0.10 0.10 0.10 0.50 0.10 0.10... Topic k 0.05 0.60 0.10 0.05 0.10 0.10

Validation From Quinn et al, AJPS, 2010: 1. Semantic validity

Validation From Quinn et al, AJPS, 2010: 1. Semantic validity Do the topics identify coherent groups of tweets that are internally homogenous, and are related to each other in a meaningful way?

Validation From Quinn et al, AJPS, 2010: 1. Semantic validity Do the topics identify coherent groups of tweets that are internally homogenous, and are related to each other in a meaningful way? 2. Convergent/discriminant construct validity

Validation From Quinn et al, AJPS, 2010: 1. Semantic validity Do the topics identify coherent groups of tweets that are internally homogenous, and are related to each other in a meaningful way? 2. Convergent/discriminant construct validity Do the topics match existing measures where they should match?

Validation From Quinn et al, AJPS, 2010: 1. Semantic validity Do the topics identify coherent groups of tweets that are internally homogenous, and are related to each other in a meaningful way? 2. Convergent/discriminant construct validity Do the topics match existing measures where they should match? Do they depart from existing measures where they should depart?

Validation From Quinn et al, AJPS, 2010: 1. Semantic validity Do the topics identify coherent groups of tweets that are internally homogenous, and are related to each other in a meaningful way? 2. Convergent/discriminant construct validity Do the topics match existing measures where they should match? Do they depart from existing measures where they should depart? 3. Predictive validity

Validation From Quinn et al, AJPS, 2010: 1. Semantic validity Do the topics identify coherent groups of tweets that are internally homogenous, and are related to each other in a meaningful way? 2. Convergent/discriminant construct validity Do the topics match existing measures where they should match? Do they depart from existing measures where they should depart? 3. Predictive validity Does variation in topic usage correspond with expected events?

Validation From Quinn et al, AJPS, 2010: 1. Semantic validity Do the topics identify coherent groups of tweets that are internally homogenous, and are related to each other in a meaningful way? 2. Convergent/discriminant construct validity Do the topics match existing measures where they should match? Do they depart from existing measures where they should depart? 3. Predictive validity Does variation in topic usage correspond with expected events? 4. Hypothesis validity

Validation From Quinn et al, AJPS, 2010: 1. Semantic validity Do the topics identify coherent groups of tweets that are internally homogenous, and are related to each other in a meaningful way? 2. Convergent/discriminant construct validity Do the topics match existing measures where they should match? Do they depart from existing measures where they should depart? 3. Predictive validity Does variation in topic usage correspond with expected events? 4. Hypothesis validity Can topic variation be used effectively to test substantive hypotheses?

Example: open-ended survey responses Bauer, Barberá et al, Political Behavior, 2016. Data: General Social Survey (2008) in Germany Responses to questions: Would you please tell me what you associate with the term left? and would you please tell me what you associate with the term right? Open-ended questions minimize priming and potential interviewer effects Sparse Additive Generative model instead of LDA (more coherent topics for short text) K = 4 topics for each question

Example: open-ended survey responses Bauer, Barberá et al, Political Behavior, 2016.

Example: open-ended survey responses Bauer, Barberá et al, Political Behavior, 2016.

Example: open-ended survey responses Bauer, Barberá et al, Political Behavior, 2016.

Example: topics in US legislators tweets Data: 651,116 tweets sent by US legislators from January 2013 to December 2014. 2,920 documents = 730 days 2 chambers 2 parties Why aggregating? Applications that aggregate by author or day outperform tweet-level analyses (Hong and Davidson, 2010) K = 100 topics (more on this later) Validation: http://j.mp/lda-congress-demo

Choosing the number of topics Choosing K is one of the most difficult questions in unsupervised learning (Grimmer and Stewart, 2013, p.19)

Choosing the number of topics Choosing K is one of the most difficult questions in unsupervised learning (Grimmer and Stewart, 2013, p.19) We chose K = 100 based on cross-validated model fit. 1.00 Ratio wrt worst value 0.95 0.90 0.85 loglikelihood Perplexity 0.80 10 20 30 40 50 60 70 80 90 100 110 120 Number of topics

Choosing the number of topics Choosing K is one of the most difficult questions in unsupervised learning (Grimmer and Stewart, 2013, p.19) We chose K = 100 based on cross-validated model fit. 1.00 Ratio wrt worst value 0.95 0.90 0.85 loglikelihood Perplexity 0.80 10 20 30 40 50 60 70 80 90 100 110 120 Number of topics BUT: there is often a negative relationship between the best-fitting model and the substantive information provided.

Choosing the number of topics Choosing K is one of the most difficult questions in unsupervised learning (Grimmer and Stewart, 2013, p.19) We chose K = 100 based on cross-validated model fit. 1.00 Ratio wrt worst value 0.95 0.90 0.85 loglikelihood Perplexity 0.80 10 20 30 40 50 60 70 80 90 100 110 120 Number of topics BUT: there is often a negative relationship between the best-fitting model and the substantive information provided. GS propose to choose K based on substantive fit.

Extensions of LDA 1. Structural topic model (Roberts et al, 2014, AJPS) 2. Dynamic topic model (Blei and Lafferty, 2006, ICML; Quinn et al, 2010, AJPS) 3. Hierarchical topic model (Griffiths and Tenembaun, 2004, NIPS; Grimmer, 2010, PA) Why? Substantive reasons: incorporate specific elements of DGP into estimation Statistical reasons: structure can lead to better topics.

Structural topic model Prevalence: Prior on the mixture over topics is now document-specific, and can be a function of covariates (documents with similar covariates will tend to be about the same topics) Content: distribution over words is now document-specific and can be a function of covariates (documents with similar covariates will tend to use similar words to refer to the same topic)

Dynamic topic model Source: Blei, Modeling Science

Dynamic topic model Source: Blei, Modeling Science

Wordfish (Slapin and Proksch, 2008, AJPS) Goal: unsupervised scaling of ideological positions

Wordfish (Slapin and Proksch, 2008, AJPS) Goal: unsupervised scaling of ideological positions Ideology of politician i, θ i is a position in a latent scale.

Wordfish (Slapin and Proksch, 2008, AJPS) Goal: unsupervised scaling of ideological positions Ideology of politician i, θ i is a position in a latent scale. Word usage is drawn from a Poisson-IRT model: W im Poisson(λ im ) λ im = exp(α i + ψ m + β m θ i )

Wordfish (Slapin and Proksch, 2008, AJPS) Goal: unsupervised scaling of ideological positions Ideology of politician i, θ i is a position in a latent scale. Word usage is drawn from a Poisson-IRT model: W im Poisson(λ im ) λ im = exp(α i + ψ m + β m θ i ) where:

Wordfish (Slapin and Proksch, 2008, AJPS) Goal: unsupervised scaling of ideological positions Ideology of politician i, θ i is a position in a latent scale. Word usage is drawn from a Poisson-IRT model: W im Poisson(λ im ) λ im = exp(α i + ψ m + β m θ i ) where: α i is loquaciousness of politician i

Wordfish (Slapin and Proksch, 2008, AJPS) Goal: unsupervised scaling of ideological positions Ideology of politician i, θ i is a position in a latent scale. Word usage is drawn from a Poisson-IRT model: W im Poisson(λ im ) λ im = exp(α i + ψ m + β m θ i ) where: α i is loquaciousness of politician i ψ m is frequency of word m

Wordfish (Slapin and Proksch, 2008, AJPS) Goal: unsupervised scaling of ideological positions Ideology of politician i, θ i is a position in a latent scale. Word usage is drawn from a Poisson-IRT model: W im Poisson(λ im ) λ im = exp(α i + ψ m + β m θ i ) where: α i is loquaciousness of politician i ψ m is frequency of word m β m is discrimination parameter of word m

Wordfish (Slapin and Proksch, 2008, AJPS) Goal: unsupervised scaling of ideological positions Ideology of politician i, θ i is a position in a latent scale. Word usage is drawn from a Poisson-IRT model: W im Poisson(λ im ) λ im = exp(α i + ψ m + β m θ i ) where: α i is loquaciousness of politician i ψ m is frequency of word m β m is discrimination parameter of word m Estimation using EM algorithm.

Wordfish (Slapin and Proksch, 2008, AJPS) Goal: unsupervised scaling of ideological positions Ideology of politician i, θ i is a position in a latent scale. Word usage is drawn from a Poisson-IRT model: W im Poisson(λ im ) λ im = exp(α i + ψ m + β m θ i ) where: α i is loquaciousness of politician i ψ m is frequency of word m β m is discrimination parameter of word m Estimation using EM algorithm. Identification:

Wordfish (Slapin and Proksch, 2008, AJPS) Goal: unsupervised scaling of ideological positions Ideology of politician i, θ i is a position in a latent scale. Word usage is drawn from a Poisson-IRT model: W im Poisson(λ im ) λ im = exp(α i + ψ m + β m θ i ) where: α i is loquaciousness of politician i ψ m is frequency of word m β m is discrimination parameter of word m Estimation using EM algorithm. Identification: Unit variance restriction for θ i

Wordfish (Slapin and Proksch, 2008, AJPS) Goal: unsupervised scaling of ideological positions Ideology of politician i, θ i is a position in a latent scale. Word usage is drawn from a Poisson-IRT model: W im Poisson(λ im ) λ im = exp(α i + ψ m + β m θ i ) where: α i is loquaciousness of politician i ψ m is frequency of word m β m is discrimination parameter of word m Estimation using EM algorithm. Identification: Unit variance restriction for θ i Choose a and b such that θa > θ b