Statistical Topic Models for Computational Social Science Hanna M. Wallach

Similar documents
topic modeling hanna m. wallach

Gaussian Mixture Model

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Machine Learning

Text mining and natural language analysis. Jefrey Lijffijt

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Latent Dirichlet Allocation Introduction/Overview

Topic Models and Applications to Short Documents

Topic Modeling: Beyond Bag-of-Words

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Bayesian Nonparametrics for Speech and Signal Processing

Evaluation Methods for Topic Models

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Distributed ML for DOSNs: giving power back to users

Recent Advances in Bayesian Inference Techniques

Latent Dirichlet Allocation

Collaborative Topic Modeling for Recommending Scientific Articles

A Continuous-Time Model of Topic Co-occurrence Trends

Text Mining for Economics and Finance Latent Dirichlet Allocation

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA)

Probabilistic Matrix Factorization

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

Gaussian Models

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Study Notes on the Latent Dirichlet Allocation

Kernel Density Topic Models: Visual Topics Without Visual Words

Introduction to Graphical Models

Introduction. Chapter 1

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

WHO lunchtime seminar Mapping child growth failure in Africa between 2000 and Professor Simon I. Hay March 12, 2018

Modeling User Rating Profiles For Collaborative Filtering

Lecture 13 : Variational Inference: Mean Field Approximation

Document and Topic Models: plsa and LDA

Collaborative topic models: motivations cont

STA 414/2104: Lecture 8

Data Mining Techniques

Probability models for machine learning. Advanced topics ML4bio 2016 Alan Moses

An Alternative Prior Process for Nonparametric Bayesian Clustering

Mixtures of Multinomials

Language Information Processing, Advanced. Topic Models

Latent Dirichlet Allocation Based Multi-Document Summarization

Latent Variable Models Probabilistic Models in the Study of Language Day 4

Probabilistic Latent Semantic Analysis

Algorithm-Independent Learning Issues

Generative Clustering, Topic Modeling, & Bayesian Inference

Click Prediction and Preference Ranking of RSS Feeds

Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

10/17/04. Today s Main Points

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Content-based Recommendation

Presentation of the Cooperation Project goals. Nicola Ferrè

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

CS Lecture 18. Topic Models and LDA

Using Both Latent and Supervised Shared Topics for Multitask Learning

Machine Learning (CS 567) Lecture 2

Classification Techniques with Applications in Remote Sensing

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Russell Hanson DFCI April 24, 2009

CELL AND MICROBIOLOGY Nadia Iskandarani

AN INTRODUCTION TO TOPIC MODELS

Tutorial on Approximate Bayesian Computation

DEKDIV: A Linked-Data-Driven Web Portal for Learning Analytics Data Enrichment, Interactive Visualization, and Knowledge Discovery

Design of Text Mining Experiments. Matt Taddy, University of Chicago Booth School of Business faculty.chicagobooth.edu/matt.

Press Release BACTERIA'S KEY INNOVATION HELPS UNDERSTAND EVOLUTION

AUTOMATIC DETECTION OF WORDS NOT SIGNIFICANT TO TOPIC CLASSIFICATION IN LATENT DIRICHLET ALLOCATION

Spatial Bayesian Nonparametrics for Natural Image Segmentation

Inference in Explicit Duration Hidden Markov Models

Towards Association Rules with Hidden Variables

Forecasting Wind Ramps

Classical Predictive Models

UN-GGIM: Strengthening Geospatial Capability

Introduction to Reliability Theory (part 2)

Ronald Christensen. University of New Mexico. Albuquerque, New Mexico. Wesley Johnson. University of California, Irvine. Irvine, California

Topic modeling with more confidence: a theory and some algorithms

Forecasting with Expert Opinions

Topic Models. Charles Elkan November 20, 2008

Welcome to CAMCOS Reports Day Fall 2011

Introduction To Machine Learning

Nonparametric Bayes Pachinko Allocation

CSCI-567: Machine Learning (Spring 2019)

Latent variable models for discrete data

Welcome Survey getting to know you Collect & log Supplies received Classroom Rules Curriculum overview. 1 : Aug 810. (3 days) 2nd: Aug (5 days)

An Alternative Prior Process for Nonparametric Bayesian Clustering

Topic Modeling Using Latent Dirichlet Allocation (LDA)

BetaZi SCIENCE AN OVERVIEW OF PHYSIO-STATISTICS FOR PRODUCTION FORECASTING. EVOLVE, ALWAYS That s geologic. betazi.com. geologic.

Approximate Bayesian Computation

Research Statement on Statistics Jun Zhang

Fast Likelihood-Free Inference via Bayesian Optimization

Topic Modelling and Latent Dirichlet Allocation

Bayesian Nonparametric Accelerated Failure Time Models for Analyzing Heterogeneous Treatment Effects

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Machine Learning, Fall 2009: Midterm

CS145: INTRODUCTION TO DATA MINING

Transcription:

Statistical Topic Models for Computational Social Science University of Massachusetts Amherst wallach@cs.umass.edu

Complex Social Processes 2

Traditional Social Science Case studies Interviews Participant observation Survey research Social network analysis Self-reports, one-time snapshots, small scale 3

The Computer Revolution 4

Computational Social Science "A computational social science is emerging that leverages the capacity to collect and analyze data with an unprecedented breadth and depth and scale and may reveal patterns of individual and group behaviors. Lazer et al., 2009 5

Structure vs. Content 6

Products of Interactions Scientific information is both the basic raw material for, and one of the principal products of, scientific research [ ] Scientists find out what other scientists are accomplishing through [...] journals, books, abstracts and indexes, bibliographies, reviews. NSF Brochure, 1962 7

Text as Data Structured and formal: e.g., publications, patents, press releases Messy and unstructured: e.g., chat logs, OCRed documents, transcripts Large scale, robust methods for analyzing text 8

Collaborate to Study Collaboration There needs to be a greater focus on what these [interaction] data mean [...] This requires the input of social scientists, rather than just those more traditionally involved in data capture, such as computer scientists. Julia Lane, NSF, 24 March 2010 9

Different (But Overlapping) Roles Social science: specific models for specific applications, extensive post-analysis work Computer science: novel classes of models, mathematical and computational properties of models that extend across applications 10

This Talk Statistical topic models for text analysis Off-the-shelf topic models: priors, stop words Studying formerly-classified government documents 11

Statistical Modeling Modeling challenges: Aggregating and representing large data sets Handling data from sources with disparate emphases Reasoning under uncertain information Performing efficient inference Bayesian latent (hidden) variable models: Powerful and flexible This talk: statistical topic models [Wallach et al. & Adams et al., AISTATS '10] 12

Statistical Topic Modeling Three fundamental assumptions: Documents have latent semantic structure ( topics ) We can infer topics from worddocument co-occurrences Can simulate this inference algorithmically Given a data set, the goal is to Learn the composition of the topics that best represent it Learn which topics are used in each document 13

Why Topic Models? kriging covariance mean estimate weight random mse matrix conditional point vs. gaussian regression covariance prediction function bayesian process prior distribution matrix 14

probability Topics and Words human evolution disease computer genome evolutionary host models dna species bacteria information genetic organisms diseases data genes life resistance computers sequence origin bacterial system gene biology new network molecular groups strains systems sequencing phylogenetic control model map living infectious parallel............ 15

Documents and Topics 16

Mixtures vs. Admixtures topics mixture documents topics admixture documents 17

Generative Statistical Modeling Assume data was generated by a probabilistic model: Model may have hidden structure (latent variables) Model defines a joint distribution over all variables Model parameters are unknown Infer hidden structure and model parameters from data Situate new data in estimated model 18

... Generative Process probability 19

... Choose a Distribution Over Topics probability 20

... Choose a Topic probability 21

... Choose a Word probability 22

... And So On probability 23

???... Real Data: Statistical Inference?? probability 24

... The End Result... probability 25

This Talk Statistical topic models for text analysis Off-the-shelf topic models: priors, stop words Studying formerly-classified government documents 26

The State of The Art Topic models are extremely appealing but they're not always usable by non-experts Need to bridge this gap between producers and consumers of topic modeling technology: Address problems/challenges faced by practitioners Question unquestioned assumptions Explore the interplay between theory and practice 27

Off-the-Shelf Topic Modeling I want to model technology emergence by analyzing patent abstracts... I have a statistical model that you can use... 28

Off-the-Shelf Topic Modeling I want to model technology emergence by analyzing patent abstracts... I have a statistical model that you can use... a a the the field the of invention emission carbon a of an and to to electron gas and present............ 29

Off-the-Shelf Topic Modeling? Help! All my topics consist of the, and of, to, a... Now they all consist of invention, present, thereof... Wait, but how do I choose the right number of topics? Preprocess your data to remove stop words... Make a domain-specific list of stop words... Evaluate the probability of unseen data for different numbers... 30

Directed Graphical Models Nodes: random variables (latent or observed) Edges: probabilistic dependencies between variables Plates: macros that allow subgraphs to be replicated 31

Statistical Topic Modeling [Hofmann, '99] topic assignment document-specific topic distribution topics observed word 32

Latent Dirichlet Allocation (LDA) [Blei, Ng & Jordan, '03] topic assignment topics Dirichlet distribution Dirichlet distribution document-specific topic distribution observed word 33

Discrete Probability Distributions 3-dimensional discrete probability distributions can be visually represented in 2-dimensional space: A B C 34

Dirichlet Distribution Distribution over discrete probability distributions: base measure (mean) A... B C concentration parameter 35

Dirichlet Parameters 36

Dirichlet Priors for LDA symmetric priors: uniform base measures 37

Dirichlet Priors for LDA Two scalar concentration parameters: α and β Concentration parameters are usually set heuristically e.g., and Some recent work on learning optimal values for the concentration parameters from data No rigorous study of the Dirichlet priors: e.g., asymmetric vs. symmetric base measures Effects of the base measures on the inferred topics 38

Symmetric Asymmetric [Wallach et al., '09] Use prior over as a running example Uniform base measure nonuniform base measure Asymmetric prior: some topics more likely a priori 39

Hierarchical Asymmetric Dirichlet Which topics should be more probable a priori? Draw from a Dirichlet distribution: A A... B B C C 40

Putting Everything Together Asymmetric hierarchical Dirichlet priors Integrate out Learn, and base measures and concentration parameters from data 41

Data Sets Carbon nanotechnology patents: Ultimate goal: track innovation and emergence Fullerene and carbon nanotube patents 1,016 abstracts (~100 words each) 103,499 total words; 6,068 unique words 20 Newsgroups data (80,012 total words) New York Times articles (477,465 total words) 42

Inferred Topics before after a a the the field the of invention emission carbon a of an and to to electron gas and present............ the carbon metal composite a nanotubes catalytic polymer of nanotube transition matrix to catalyst catalyst weight and substrate from fiber............ 43

Sampled Concentration Parameters 44

A Theoretical Observation... Symmetric Dirichlet is a special case of the hierarchical asymmetric Dirichlet (large concentration parameter) A A... B B C C 45

Sampled Concentration Parameters 46

Intuition Topics should be distinct from each other: Asymmetric prior over topics makes topics more similar to each other (and to corpus-wide word frequencies) Want a symmetric prior to preserve topic distinctness Still have to account for power-law word usage: Asymmetric prior over document-specific topic distributions means some topics (e.g., the, a, of, to... ) can be used more often than others in all documents 47

Off-the-Shelf Topic Modeling I can model technology emergence by analyzing patent abstracts! Great! Let me know if you need any more help! the carbon metal composite a nanotubes catalytic polymer of nanotube transition matrix to catalyst catalyst weight and substrate from fiber............ 48

Polylingual Topics 49

Polylingual Topics 50

Polylingual Topic Model [Mimno et al., '09]...... tuple of aligned documents language-specific Dirichlet parameters 51

This Talk Statistical topic models for text analysis Off-the-shelf topic models: priors, stop words Studying formerly-classified government documents 52

In 2009 Alone... 52 million pages reviewed for declassification 29 million pages declassified $8.8 billion spent on administration of the US government classification system 53

How Sensitive? After a 14-year legal battle by a California history professor, the FBI has released a new cache of material from a 300-page dossier on the late rock star John Lennon, and has agreed to pay $204,000 to cover legal fees incurred in his efforts to open the file. For all the years of challenge, however, the file contains little, if any, new information about Lennon, though it does present some bizarre details, like a description of an antiwar activist trying to train a parrot to speak profanities. NYT, 25 September 2007 54

A Problematic Trade-off The more data kept secret, the less secure the data: More people need to have access to the data More storage space is required 55

What We Are NOT Studying... 56

Exploring Declassified Documents Declassification goals: Recommend documents for human review Match documents with human reviewers' expertise Transparency research goals: High-level characterization of the data Finding specific, known information of interest Finding interesting or unexpected information 57

Declassified Documents: DDRS ~88,000 formerly-classified government documents Created and declassified between 1926 and 2005 58

Available Information Sanitized? Classification level Issuer Creation date Document type Declassification date 59

Available Information Sanitized? Classification level Issuer Creation date Document type Declassification date 60

Available Information Sanitized? Classification level Issuer Creation date Document type Declassification date 61

Available Information Sanitized? Classification level Issuer Creation date Document type Declassification date 62

Available Information Sanitized? Classification level Issuer Creation date Document type Declassification date 63

Available Information Sanitized? Classification level Issuer Creation date Type Declassification date 64

Declassification Durations 65

Survival Analysis Statistical methods for evaluating time until death : Biology/medicine: organism death Engineering: component failure Social science: event durations (e.g., parolee recidivism) Goal: model effect on survival time of covariates, e.g., Vaccine treatments Temperature differences Job placement or education programs 66

Document Survival creation date 2/21/67 26 years declassification date 4/6/93 67

Survival Distribution of Documents 68

Accelerated Failure Time Models Survival analysis with covariates Linear models for the log of the duration : Parametric: a probability distribution is specified e.g., Weibull, log-normal, gamma, log-logistic... Can make predictions for unseen data 69

Classification and Content 1975 to 1989 1946 to 2003 70

Classification and Content: 1960s 71

Classification and Content: 1960s 72

Word Frequencies? rhodesia africa southern khama white african namibia 73

portion of typical document in topic Topics in Declassified Documents creation/declassification date corps, service, volunteers, men, volunteer, age, draft, selective, calls, young, manpower, year, army, deferments, induction, armed, freedom,... 74

portion of typical document in topic Topics in Declassified Documents creation/declassification date package, hostages, release, hostage, khomeini, packages, ghotbzadeh, held, released, banisadr, revolutionary, debriefing, scenario, family, date,... 75

portion of typical document in topic Topics in Declassified Documents creation/declassification date oswald, dallas, assassination, kennedy, texas, fbi, orleans, advised, lee, president, bureau, started, harvey, john, information, ruby, november,... 76

portion of typical document in topic Topics in Declassified Documents creation/declassification date artichoke, subject, drugs, techniques, work, interrogation, writer, drug, lsd, effects, hypnosis, methods, medical, physical, subjects, human,... 77

Predicting Duration Using Topics 78

Predicted Duration - Actual Duration 79

Jointly Modeling Text and Duration Topics provide information about classification durations Goal: incorporate durations into the generative model Infer latent topics using both textual and temporal information 80

Jointly Modeling Text and Duration [Shorey et al., '11] log secrecy duration normal-gamma distribution Dirichlet distribution Dirichlet distribution normal-gamma distribution creation date 81

Topic-Specific Duration Distributions 82

Topic-Specific Duration Distributions means of all LDA topics relating to Vietnam & Laos 83

What's Next? Predict durations directly from the generative model Mixture vs. admixture topics Supervised topic modeling Unseen content Subject matter experts Analysis and prediction of redactions 84

Thanks! Acknowledgements: B. Desmarais, A. McCallum, D. Mimno, R. Shorey wallach@cs.umass.edu http://www.cs.umass.edu/~wallach/