RECSM Summer School: Facebook + Topic Models. github.com/pablobarbera/big-data-upf

Size: px

Start display at page:

Download "RECSM Summer School: Facebook + Topic Models. github.com/pablobarbera/big-data-upf"

Eugene Gilmore
5 years ago
Views:

1 RECSM Summer School: Facebook + Topic Models Pablo Barberá School of International Relations University of Southern California pablobarbera.com Networked Democracy Lab Course website: github.com/pablobarbera/big-data-upf

2 Collecting Facebook data Facebook only allows access to public pages data through the Graph API: 1. Posts on public pages

3 Collecting Facebook data Facebook only allows access to public pages data through the Graph API: 1. Posts on public pages 2. Likes, reactions, comments, replies...

4 Collecting Facebook data Facebook only allows access to public pages data through the Graph API: 1. Posts on public pages 2. Likes, reactions, comments, replies... Some public user data (gender, location) was available through previous versions of the API (not anymore)

5 Collecting Facebook data Facebook only allows access to public pages data through the Graph API: 1. Posts on public pages 2. Likes, reactions, comments, replies... Some public user data (gender, location) was available through previous versions of the API (not anymore) Access to other (anonymized) data used in published studies requires permission from Facebook

6 Collecting Facebook data Facebook only allows access to public pages data through the Graph API: 1. Posts on public pages 2. Likes, reactions, comments, replies... Some public user data (gender, location) was available through previous versions of the API (not anymore) Access to other (anonymized) data used in published studies requires permission from Facebook R library: Rfacebook

7 Overview of text as data methods Fig. 1 in Grimmer and Stewart (2013)

8 Overview of text as data methods Entity Recognition Fig. 1 in Grimmer and Stewart (2013)

9 Overview of text as data methods Entity Recognition Events Quotes Locations Names... Fig. 1 in Grimmer and Stewart (2013)

10 Overview of text as data methods Entity Recognition Cosine similarity Naive Bayes Events Quotes Locations Names... Fig. 1 in Grimmer and Stewart (2013)

11 Overview of text as data methods Entity Recognition Cosine similarity Naive Bayes mixture model? Events Quotes Locations Names... Fig. 1 in Grimmer and Stewart (2013)

12 Overview of text as data methods Entity Recognition Cosine similarity Naive Bayes mixture model? Events Quotes Locations Names... (ML methods) Fig. 1 in Grimmer and Stewart (2013)

13 Overview of text as data methods Entity Recognition Cosine similarity Naive Bayes mixture model? Events Quotes Locations Names... Models with covariates (slda, STM) (ML methods) Fig. 1 in Grimmer and Stewart (2013)

Latent Dirichlet allocation (LDA) Topic models are

making inferences about the content of documents!

14 Latent Dirichlet allocation (LDA) Topic models are powerful tools for exploring large data sets and for making inferences about the content of documents!"#$%&'() *"+,#) +"/,9#)1 +.&),3&'(1 "65%51 :5)2,'0("'1.&/,0,"'1 -.&/,0,"'1 2,'3$1 4$3,5)%1 &(2,#)1 6$332,)%1 )+".()1 65)&65//1 )"##&.1 65)7&(65//1 8""(65//1 - -

15 Latent Dirichlet allocation (LDA) Complexity+of+Inference+in+Latent+Dirichlet+Alloca6on+ David+Sontag,+Daniel+Roy+ Topic models are powerful tools for exploring large data (NYU,+Cambridge)+ sets and for making inferences about the content of documents Topic+models+are+powerful+tools+for+exploring+large+data+sets+and+for+making+ inferences+about+the+content+of+documents+!"#$%&'() *"+,#) Documents+ Topics+ +"/,9#)1.&/,0,"'1 )+".()1 poli6cs &),3&'(1 2,'3$1 religion )&65//1sports president hindu baseball "65%51 4$3,5)%1 )"##&.1 obama judiasm soccer :5)2,'0("'1 &(2,#)1 65)7&(65//1 washington ethics basketball &/,0,"'1 6$332,)%1 8""(65//1 religion buddhism football Many applications in information retrieval, document summarization, and classification Almost+all+uses+of+topic+models+(e.g.,+for+unsupervised+learning,+informa6on+ retrieval,+classifica6on)+require+probabilis)c+inference:+ New+document+ t = p(w z = t) What+is+this+document+about?+ weather+.50+ finance+.49+ sports+.01+ W66+ + Words+w 1,+,+w N+ Distribu6on+of+topics +

16 Latent Dirichlet allocation (LDA) Complexity+of+Inference+in+Latent+Dirichlet+Alloca6on+ David+Sontag,+Daniel+Roy+ Topic models are powerful tools for exploring large data (NYU,+Cambridge)+ sets and for making inferences about the content of documents Topic+models+are+powerful+tools+for+exploring+large+data+sets+and+for+making+ inferences+about+the+content+of+documents+!"#$%&'() *"+,#) Documents+ Topics+ +"/,9#)1.&/,0,"'1 )+".()1 poli6cs &),3&'(1 2,'3$1 religion )&65//1sports president hindu baseball "65%51 4$3,5)%1 )"##&.1 obama judiasm soccer :5)2,'0("'1 &(2,#)1 65)7&(65//1 washington ethics basketball &/,0,"'1 6$332,)%1 8""(65//1 religion buddhism football Many applications in information retrieval, document summarization, and classification Almost+all+uses+of+topic+models+(e.g.,+for+unsupervised+learning,+informa6on+ retrieval,+classifica6on)+require+probabilis)c+inference:+ New+document+ t = p(w z = t) What+is+this+document+about?+ weather+.50+ finance+.49+ sports+.01+ W66+ + Words+w 1,+,+w N+ Distribu6on+of+topics + LDA is one of the simplest and most widely used topic models

17 Latent Dirichlet Allocation

18 Latent Dirichlet Allocation Document = random mixture over latent topics Topic = distribution over n-grams Probabilistic model with 3 steps: 1. Choose θ i Dirichlet(α) 2. Choose β k Dirichlet(δ) 3. For each word in document i: Choose a topic zm Multinomial(θ i ) Choose a word w im Multinomial(β i,k=zm ) where: α=parameter of Dirichlet prior on distribution of topics over docs. θ i =topic distribution for document i δ=parameter of Dirichlet prior on distribution of words over topics β k =word distribution for topic k

19 Latent Dirichlet Allocation Key parameters: 1. θ = matrix of dimensions N documents by K topics where θ ik corresponds to the probability that document i belongs to topic k; i.e. assuming K = 5: T1 T2 T3 T4 T5 Document Document Document N

20 Latent Dirichlet Allocation Key parameters: 1. θ = matrix of dimensions N documents by K topics where θ ik corresponds to the probability that document i belongs to topic k; i.e. assuming K = 5: T1 T2 T3 T4 T5 Document Document Document N β = matrix of dimensions K topics by M words where β km corresponds to the probability that word m belongs to topic k; i.e. assuming M = 6: W1 W2 W3 W4 W5 W6 Topic Topic Topic k

21 Validation From Quinn et al, AJPS, 2010: 1. Semantic validity

22 Validation From Quinn et al, AJPS, 2010: 1. Semantic validity Do the topics identify coherent groups of tweets that are internally homogenous, and are related to each other in a meaningful way?

23 Validation From Quinn et al, AJPS, 2010: 1. Semantic validity Do the topics identify coherent groups of tweets that are internally homogenous, and are related to each other in a meaningful way? 2. Convergent/discriminant construct validity

24 Validation From Quinn et al, AJPS, 2010: 1. Semantic validity Do the topics identify coherent groups of tweets that are internally homogenous, and are related to each other in a meaningful way? 2. Convergent/discriminant construct validity Do the topics match existing measures where they should match?

25 Validation From Quinn et al, AJPS, 2010: 1. Semantic validity Do the topics identify coherent groups of tweets that are internally homogenous, and are related to each other in a meaningful way? 2. Convergent/discriminant construct validity Do the topics match existing measures where they should match? Do they depart from existing measures where they should depart?

26 Validation From Quinn et al, AJPS, 2010: 1. Semantic validity Do the topics identify coherent groups of tweets that are internally homogenous, and are related to each other in a meaningful way? 2. Convergent/discriminant construct validity Do the topics match existing measures where they should match? Do they depart from existing measures where they should depart? 3. Predictive validity

27 Validation From Quinn et al, AJPS, 2010: 1. Semantic validity Do the topics identify coherent groups of tweets that are internally homogenous, and are related to each other in a meaningful way? 2. Convergent/discriminant construct validity Do the topics match existing measures where they should match? Do they depart from existing measures where they should depart? 3. Predictive validity Does variation in topic usage correspond with expected events?

28 Validation From Quinn et al, AJPS, 2010: 1. Semantic validity Do the topics identify coherent groups of tweets that are internally homogenous, and are related to each other in a meaningful way? 2. Convergent/discriminant construct validity Do the topics match existing measures where they should match? Do they depart from existing measures where they should depart? 3. Predictive validity Does variation in topic usage correspond with expected events? 4. Hypothesis validity

29 Validation From Quinn et al, AJPS, 2010: 1. Semantic validity Do the topics identify coherent groups of tweets that are internally homogenous, and are related to each other in a meaningful way? 2. Convergent/discriminant construct validity Do the topics match existing measures where they should match? Do they depart from existing measures where they should depart? 3. Predictive validity Does variation in topic usage correspond with expected events? 4. Hypothesis validity Can topic variation be used effectively to test substantive hypotheses?

30 Example: open-ended survey responses Bauer, Barberá et al, Political Behavior, Data: General Social Survey (2008) in Germany Responses to questions: Would you please tell me what you associate with the term left? and would you please tell me what you associate with the term right? Open-ended questions minimize priming and potential interviewer effects Sparse Additive Generative model instead of LDA (more coherent topics for short text) K = 4 topics for each question

31 Example: open-ended survey responses Bauer, Barberá et al, Political Behavior, 2016.

32 Example: open-ended survey responses Bauer, Barberá et al, Political Behavior, 2016.

33 Example: open-ended survey responses Bauer, Barberá et al, Political Behavior, 2016.

34 Example: topics in US legislators tweets Data: 651,116 tweets sent by US legislators from January 2013 to December ,920 documents = 730 days 2 chambers 2 parties Why aggregating? Applications that aggregate by author or day outperform tweet-level analyses (Hong and Davidson, 2010) K = 100 topics (more on this later) Validation:

35 Choosing the number of topics Choosing K is one of the most difficult questions in unsupervised learning (Grimmer and Stewart, 2013, p.19)

36 Choosing the number of topics Choosing K is one of the most difficult questions in unsupervised learning (Grimmer and Stewart, 2013, p.19) We chose K = 100 based on cross-validated model fit Ratio wrt worst value loglikelihood Perplexity Number of topics

37 Choosing the number of topics Choosing K is one of the most difficult questions in unsupervised learning (Grimmer and Stewart, 2013, p.19) We chose K = 100 based on cross-validated model fit Ratio wrt worst value loglikelihood Perplexity Number of topics BUT: there is often a negative relationship between the best-fitting model and the substantive information provided.

38 Choosing the number of topics Choosing K is one of the most difficult questions in unsupervised learning (Grimmer and Stewart, 2013, p.19) We chose K = 100 based on cross-validated model fit Ratio wrt worst value loglikelihood Perplexity Number of topics BUT: there is often a negative relationship between the best-fitting model and the substantive information provided. GS propose to choose K based on substantive fit.

39 Extensions of LDA 1. Structural topic model (Roberts et al, 2014, AJPS) 2. Dynamic topic model (Blei and Lafferty, 2006, ICML; Quinn et al, 2010, AJPS) 3. Hierarchical topic model (Griffiths and Tenembaun, 2004, NIPS; Grimmer, 2010, PA) Why? Substantive reasons: incorporate specific elements of DGP into estimation Statistical reasons: structure can lead to better topics.

40 Structural topic model Prevalence: Prior on the mixture over topics is now document-specific, and can be a function of covariates (documents with similar covariates will tend to be about the same topics) Content: distribution over words is now document-specific and can be a function of covariates (documents with similar covariates will tend to use similar words to refer to the same topic)

41 Dynamic topic model Source: Blei, Modeling Science

42 Dynamic topic model Source: Blei, Modeling Science

43 Wordfish (Slapin and Proksch, 2008, AJPS) Goal: unsupervised scaling of ideological positions

44 Wordfish (Slapin and Proksch, 2008, AJPS) Goal: unsupervised scaling of ideological positions Ideology of politician i, θ i is a position in a latent scale.

45 Wordfish (Slapin and Proksch, 2008, AJPS) Goal: unsupervised scaling of ideological positions Ideology of politician i, θ i is a position in a latent scale. Word usage is drawn from a Poisson-IRT model: W im Poisson(λ im ) λ im = exp(α i + ψ m + β m θ i )

46 Wordfish (Slapin and Proksch, 2008, AJPS) Goal: unsupervised scaling of ideological positions Ideology of politician i, θ i is a position in a latent scale. Word usage is drawn from a Poisson-IRT model: W im Poisson(λ im ) λ im = exp(α i + ψ m + β m θ i ) where:

47 Wordfish (Slapin and Proksch, 2008, AJPS) Goal: unsupervised scaling of ideological positions Ideology of politician i, θ i is a position in a latent scale. Word usage is drawn from a Poisson-IRT model: W im Poisson(λ im ) λ im = exp(α i + ψ m + β m θ i ) where: α i is loquaciousness of politician i

48 Wordfish (Slapin and Proksch, 2008, AJPS) Goal: unsupervised scaling of ideological positions Ideology of politician i, θ i is a position in a latent scale. Word usage is drawn from a Poisson-IRT model: W im Poisson(λ im ) λ im = exp(α i + ψ m + β m θ i ) where: α i is loquaciousness of politician i ψ m is frequency of word m

49 Wordfish (Slapin and Proksch, 2008, AJPS) Goal: unsupervised scaling of ideological positions Ideology of politician i, θ i is a position in a latent scale. Word usage is drawn from a Poisson-IRT model: W im Poisson(λ im ) λ im = exp(α i + ψ m + β m θ i ) where: α i is loquaciousness of politician i ψ m is frequency of word m β m is discrimination parameter of word m

50 Wordfish (Slapin and Proksch, 2008, AJPS) Goal: unsupervised scaling of ideological positions Ideology of politician i, θ i is a position in a latent scale. Word usage is drawn from a Poisson-IRT model: W im Poisson(λ im ) λ im = exp(α i + ψ m + β m θ i ) where: α i is loquaciousness of politician i ψ m is frequency of word m β m is discrimination parameter of word m Estimation using EM algorithm.

51 Wordfish (Slapin and Proksch, 2008, AJPS) Goal: unsupervised scaling of ideological positions Ideology of politician i, θ i is a position in a latent scale. Word usage is drawn from a Poisson-IRT model: W im Poisson(λ im ) λ im = exp(α i + ψ m + β m θ i ) where: α i is loquaciousness of politician i ψ m is frequency of word m β m is discrimination parameter of word m Estimation using EM algorithm. Identification:

52 Wordfish (Slapin and Proksch, 2008, AJPS) Goal: unsupervised scaling of ideological positions Ideology of politician i, θ i is a position in a latent scale. Word usage is drawn from a Poisson-IRT model: W im Poisson(λ im ) λ im = exp(α i + ψ m + β m θ i ) where: α i is loquaciousness of politician i ψ m is frequency of word m β m is discrimination parameter of word m Estimation using EM algorithm. Identification: Unit variance restriction for θ i

53 Wordfish (Slapin and Proksch, 2008, AJPS) Goal: unsupervised scaling of ideological positions Ideology of politician i, θ i is a position in a latent scale. Word usage is drawn from a Poisson-IRT model: W im Poisson(λ im ) λ im = exp(α i + ψ m + β m θ i ) where: α i is loquaciousness of politician i ψ m is frequency of word m β m is discrimination parameter of word m Estimation using EM algorithm. Identification: Unit variance restriction for θ i Choose a and b such that θa > θ b

Introduction To Machine Learning

Introduction To Machine Learning David Sontag New York University Lecture 21, April 14, 2016 David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 1 / 14 Expectation maximization