How to Read 100 Million Blogs (& Classify Deaths Without Physicians)

Similar documents
How to Read 100 Million Blogs (& Classify Deaths Without Physicians)

Advance in Statistical Theory and Methods for Social Sciences

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes

Quantification. Using Supervised Learning to Estimate Class Prevalence. Fabrizio Sebastiani

Information Retrieval and Organisation

2.4. Conditional Probability

Behavioral Data Mining. Lecture 2

Fast Logistic Regression for Text Categorization with Variable-Length N-grams

Language Modeling. Michael Collins, Columbia University

CS 188: Artificial Intelligence Spring Today

Applied Natural Language Processing

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Generative Clustering, Topic Modeling, & Bayesian Inference

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

STAT:5100 (22S:193) Statistical Inference I

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction to AI Learning Bayesian networks. Vibhav Gogate

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

An Improved Method of Automated Nonparametric Content Analysis for Social Science

Lecture 01: Introduction

Classification: Analyzing Sentiment

Click Prediction and Preference Ranking of RSS Feeds

Classification, Linear Models, Naïve Bayes

University of Illinois at Urbana-Champaign. Midterm Examination

Classification: Analyzing Sentiment

Informal Definition: Telling things apart

A REVIEW ARTICLE ON NAIVE BAYES CLASSIFIER WITH VARIOUS SMOOTHING TECHNIQUES

Key Methods in Geography / Nicholas Clifford, Shaun French, Gill Valentine

Mining Newsgroups Using Networks Arising From Social Behavior by Rakesh Agrawal et al. Presented by Will Lee

Maxent Models and Discriminative Estimation

Lecture 10 and 11: Text and Discrete Distributions

Gaussian Models

Probabilistic Language Modeling

Probability Review and Naïve Bayes

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Maps: Two-dimensional, scaled representations of Earth surface - show spatial data (data with locational component)

Discrete Probability and State Estimation

Discrete Probability and State Estimation

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

2. AXIOMATIC PROBABILITY

Matching for Causal Inference Without Balance Checking

GALLUP NEWS SERVICE JUNE WAVE 1

Machine Learning 2007: Slides 1. Instructor: Tim van Erven Website: erven/teaching/0708/ml/

Machine Learning. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 8 May 2012

Using SPSS for One Way Analysis of Variance

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

CHAPTER 2: KEY ISSUE 1 Where Is the World s Population Distributed? p

where Female = 0 for males, = 1 for females Age is measured in years (22, 23, ) GPA is measured in units on a four-point scale (0, 1.22, 3.45, etc.

Typical information required from the data collection can be grouped into four categories, enumerated as below.

An Algorithms-based Intro to Machine Learning

Properties of Probability

Least Squares Classification

The Noisy Channel Model and Markov Models

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Bayesian Learning Features of Bayesian learning methods:

DM-Group Meeting. Subhodip Biswas 10/16/2014

Predictive Analytics on Accident Data Using Rule Based and Discriminative Classifiers

Introduction to Forecasting

Machine Learning and Deep Learning! Vincent Lepetit!

GAMINGRE 8/1/ of 7

Understanding and accessing 2011 census aggregate data

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)

Deconstructing Data Science

FORECASTING STANDARDS CHECKLIST

10/15/2015 A FAST REVIEW OF DISCRETE PROBABILITY (PART 2) Probability, Conditional Probability & Bayes Rule. Discrete random variables

Randomized Decision Trees

Bayesian Learning. Examples. Conditional Probability. Two Roles for Bayesian Methods. Prior Probability and Random Variables. The Chain Rule P (B)

Data Collection. Lecture Notes in Transportation Systems Engineering. Prof. Tom V. Mathew. 1 Overview 1

Types of spatial data. The Nature of Geographic Data. Types of spatial data. Spatial Autocorrelation. Continuous spatial data: geostatistics

Review. More Review. Things to know about Probability: Let Ω be the sample space for a probability measure P.

Ranked Retrieval (2)

The LTA Data Aggregation Program

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Agreement Coefficients and Statistical Inference

Linear Classifiers. Michael Collins. January 18, 2012

Welcome Survey getting to know you Collect & log Supplies received Classroom Rules Curriculum overview. 1 : Aug 810. (3 days) 2nd: Aug (5 days)

Statistics, continued

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Support Vector Machines

Stephen Scott.

Table of Contents. Enrollment. Introduction

WHO EpiData. A monthly summary of the epidemiological data on selected Vaccine preventable diseases in the WHO European Region

6.036 midterm review. Wednesday, March 18, 15

World Geography Review Syllabus

Language Models. Philipp Koehn. 11 September 2018

Classification 1: Linear regression of indicators, linear discriminant analysis

Statistical NLP Spring A Discriminative Approach

COMP 328: Machine Learning

Modern Information Retrieval

Machine Learning & Data Mining

Where Was Mars At Your Birth?

Naïve Bayes Classifiers

Support Vector Machine & Its Applications

Natural Language Processing. Statistical Inference: n-grams

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Transcription:

How to Read 100 Million Blogs (& Classify Deaths Without Physicians) Gary King Institute for Quantitative Social Science Harvard University (7/1/08 talk at IBM) () (7/1/08 talk at IBM) 1 / 30

References Daniel Hopkins and Gary King. Extracting Systematic Social Science Meaning from Text Gary King (Harvard, IQSS) Content Analysis 2 / 30

References Daniel Hopkins and Gary King. Extracting Systematic Social Science Meaning from Text commercialized via: Gary King (Harvard, IQSS) Content Analysis 2 / 30

References Daniel Hopkins and Gary King. Extracting Systematic Social Science Meaning from Text commercialized via: Gary King and Ying Lu. Verbal Autopsy Methods with Multiple Causes of Death, forthcoming, Statistical Science Gary King (Harvard, IQSS) Content Analysis 2 / 30

References Daniel Hopkins and Gary King. Extracting Systematic Social Science Meaning from Text commercialized via: Gary King and Ying Lu. Verbal Autopsy Methods with Multiple Causes of Death, forthcoming, Statistical Science In use by (among others): Gary King (Harvard, IQSS) Content Analysis 2 / 30

References Daniel Hopkins and Gary King. Extracting Systematic Social Science Meaning from Text commercialized via: Gary King and Ying Lu. Verbal Autopsy Methods with Multiple Causes of Death, forthcoming, Statistical Science In use by (among others): Copies at http://gking.harvard.edu Gary King (Harvard, IQSS) Content Analysis 2 / 30

Inputs and Target Quantities of Interest Gary King (Harvard, IQSS) Content Analysis 3 / 30

Inputs and Target Quantities of Interest Input Data: Gary King (Harvard, IQSS) Content Analysis 3 / 30

Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, emails, etc.) Gary King (Harvard, IQSS) Content Analysis 3 / 30

Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, emails, etc.) A set of (mutually exclusive and exhaustive) categories Gary King (Harvard, IQSS) Content Analysis 3 / 30

Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, emails, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Gary King (Harvard, IQSS) Content Analysis 3 / 30

Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, emails, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest Gary King (Harvard, IQSS) Content Analysis 3 / 30

Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, emails, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest individual document classifications (spam filters) Gary King (Harvard, IQSS) Content Analysis 3 / 30

Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, emails, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest individual document classifications (spam filters) proportion in each category (proportion email which is spam) Gary King (Harvard, IQSS) Content Analysis 3 / 30

Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, emails, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest individual document classifications (spam filters) proportion in each category (proportion email which is spam) Estimation Gary King (Harvard, IQSS) Content Analysis 3 / 30

Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, emails, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest individual document classifications (spam filters) proportion in each category (proportion email which is spam) Estimation Can get the 2nd by counting the 1st (turns out not to be necessary!) Gary King (Harvard, IQSS) Content Analysis 3 / 30

Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, emails, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest individual document classifications (spam filters) proportion in each category (proportion email which is spam) Estimation Can get the 2nd by counting the 1st (turns out not to be necessary!) High classification accuracy unbiased category proportions Gary King (Harvard, IQSS) Content Analysis 3 / 30

Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, emails, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest individual document classifications (spam filters) proportion in each category (proportion email which is spam) Estimation Can get the 2nd by counting the 1st (turns out not to be necessary!) High classification accuracy unbiased category proportions Different methods optimize estimation of the different quantities Gary King (Harvard, IQSS) Content Analysis 3 / 30

Blogs as a Running Example Gary King (Harvard, IQSS) Content Analysis 4 / 30

Blogs as a Running Example Blogs (web logs): web version of a daily diary, with posts listed in reverse chronological order. Gary King (Harvard, IQSS) Content Analysis 4 / 30

Blogs as a Running Example Blogs (web logs): web version of a daily diary, with posts listed in reverse chronological order. 8% of U.S. Internet users (12 million) have blogs Gary King (Harvard, IQSS) Content Analysis 4 / 30

Blogs as a Running Example Blogs (web logs): web version of a daily diary, with posts listed in reverse chronological order. 8% of U.S. Internet users (12 million) have blogs Growth: 0 in 2000; 78 108 million worldwide now. Gary King (Harvard, IQSS) Content Analysis 4 / 30

Blogs as a Running Example Blogs (web logs): web version of a daily diary, with posts listed in reverse chronological order. 8% of U.S. Internet users (12 million) have blogs Growth: 0 in 2000; 78 108 million worldwide now. A democratic technology: 6 million in China and 700,000 in Iran Gary King (Harvard, IQSS) Content Analysis 4 / 30

Blogs as a Running Example Blogs (web logs): web version of a daily diary, with posts listed in reverse chronological order. 8% of U.S. Internet users (12 million) have blogs Growth: 0 in 2000; 78 108 million worldwide now. A democratic technology: 6 million in China and 700,000 in Iran We are living through the largest expansion of expressive capability in the history of the human race Gary King (Harvard, IQSS) Content Analysis 4 / 30

Blogs as a Running Example Blogs (web logs): web version of a daily diary, with posts listed in reverse chronological order. 8% of U.S. Internet users (12 million) have blogs Growth: 0 in 2000; 78 108 million worldwide now. A democratic technology: 6 million in China and 700,000 in Iran We are living through the largest expansion of expressive capability in the history of the human race Measures classical notion of public opinion: active public expressions designed to influence policy and politics (previously: strikes, boycotts, demonstrations, editorials) Gary King (Harvard, IQSS) Content Analysis 4 / 30

Blogs as a Running Example Blogs (web logs): web version of a daily diary, with posts listed in reverse chronological order. 8% of U.S. Internet users (12 million) have blogs Growth: 0 in 2000; 78 108 million worldwide now. A democratic technology: 6 million in China and 700,000 in Iran We are living through the largest expansion of expressive capability in the history of the human race Measures classical notion of public opinion: active public expressions designed to influence policy and politics (previously: strikes, boycotts, demonstrations, editorials) (Public opinion surveys) Gary King (Harvard, IQSS) Content Analysis 4 / 30

One specific quantity of interest Gary King (Harvard, IQSS) Content Analysis 5 / 30

One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Gary King (Harvard, IQSS) Content Analysis 5 / 30

One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Specific categories: Label Category 2 extremely negative 1 negative 0 neutral 1 positive 2 extremely positive NA no opinion expressed NB not a blog Gary King (Harvard, IQSS) Content Analysis 5 / 30

One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Specific categories: Label Category 2 extremely negative 1 negative 0 neutral 1 positive 2 extremely positive NA no opinion expressed NB not a blog Hard case: Gary King (Harvard, IQSS) Content Analysis 5 / 30

One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Specific categories: Label Category 2 extremely negative 1 negative 0 neutral 1 positive 2 extremely positive NA no opinion expressed NB not a blog Hard case: Part ordinal, part nominal categorization Gary King (Harvard, IQSS) Content Analysis 5 / 30

One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Specific categories: Label Category 2 extremely negative 1 negative 0 neutral 1 positive 2 extremely positive NA no opinion expressed NB not a blog Hard case: Part ordinal, part nominal categorization Sentiment categorization is more difficult than topic classification Gary King (Harvard, IQSS) Content Analysis 5 / 30

One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Specific categories: Label Category 2 extremely negative 1 negative 0 neutral 1 positive 2 extremely positive NA no opinion expressed NB not a blog Hard case: Part ordinal, part nominal categorization Sentiment categorization is more difficult than topic classification Informal language: my crunchy gf thinks dubya hid the wmd s, :)! Gary King (Harvard, IQSS) Content Analysis 5 / 30

One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Specific categories: Label Category 2 extremely negative 1 negative 0 neutral 1 positive 2 extremely positive NA no opinion expressed NB not a blog Hard case: Part ordinal, part nominal categorization Sentiment categorization is more difficult than topic classification Informal language: my crunchy gf thinks dubya hid the wmd s, :)! Little common internal structure (no inverted pyramid) Gary King (Harvard, IQSS) Content Analysis 5 / 30

The Conversation about John Kerry s Botched Joke Gary King (Harvard, IQSS) Content Analysis 6 / 30

The Conversation about John Kerry s Botched Joke You know, education if you make the most of it... you can do well. If you don t, you get stuck in Iraq. Gary King (Harvard, IQSS) Content Analysis 6 / 30

The Conversation about John Kerry s Botched Joke You know, education if you make the most of it... you can do well. If you don t, you get stuck in Iraq. Affect Towards John Kerry Proportion 0.0 0.1 0.2 0.3 0.4 0.5 0.6 1 1 2 0 2 Sept Oct Nov Dec Jan Feb Mar 2006 2007 Gary King (Harvard, IQSS) Content Analysis 6 / 30

Representing Text as Numbers Gary King (Harvard, IQSS) Content Analysis 7 / 30

Representing Text as Numbers Filter: choose English language blogs that mention Bush Gary King (Harvard, IQSS) Content Analysis 7 / 30

Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Gary King (Harvard, IQSS) Content Analysis 7 / 30

Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Code variables: presence/absence of unique unigrams, bigrams, trigrams Gary King (Harvard, IQSS) Content Analysis 7 / 30

Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Code variables: presence/absence of unique unigrams, bigrams, trigrams Our Example: Gary King (Harvard, IQSS) Content Analysis 7 / 30

Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Code variables: presence/absence of unique unigrams, bigrams, trigrams Our Example: Our 10,771 blog posts about Bush and Clinton: 201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams. Gary King (Harvard, IQSS) Content Analysis 7 / 30

Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Code variables: presence/absence of unique unigrams, bigrams, trigrams Our Example: Our 10,771 blog posts about Bush and Clinton: 201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams. keep only unigrams in > 1% or < 99% of documents: 3,672 variables Gary King (Harvard, IQSS) Content Analysis 7 / 30

Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Code variables: presence/absence of unique unigrams, bigrams, trigrams Our Example: Our 10,771 blog posts about Bush and Clinton: 201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams. keep only unigrams in > 1% or < 99% of documents: 3,672 variables Groups infinite possible posts into only 2 3,672 distinct types Gary King (Harvard, IQSS) Content Analysis 7 / 30

Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Code variables: presence/absence of unique unigrams, bigrams, trigrams Our Example: Our 10,771 blog posts about Bush and Clinton: 201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams. keep only unigrams in > 1% or < 99% of documents: 3,672 variables Groups infinite possible posts into only 2 3,672 distinct types More sophisticated summaries: we ve used, but they re not necessary Gary King (Harvard, IQSS) Content Analysis 7 / 30

Notation Gary King (Harvard, IQSS) Content Analysis 8 / 30

Notation Document Category -2 extremely negative -1 negative 0 neutral D i = 1 positive 2 extremely positive NA no opinion expressed NB not a blog Gary King (Harvard, IQSS) Content Analysis 8 / 30

Notation Document Category -2 extremely negative -1 negative 0 neutral D i = 1 positive 2 extremely positive NA no opinion expressed NB not a blog Word Stem Profile: S i1 = 1 if awful is used, 0 if not S i2 = 1 if good is used, 0 if not S i =.. S ik = 1 if except is used, 0 if not Gary King (Harvard, IQSS) Content Analysis 8 / 30

Quantities of Interest Gary King (Harvard, IQSS) Content Analysis 9 / 30

Quantities of Interest Computer Science: individual document classifications D 1, D 2..., D L Gary King (Harvard, IQSS) Content Analysis 9 / 30

Quantities of Interest Computer Science: individual document classifications D 1, D 2..., D L Social Science: proportions in each category P(D = 2) P(D = 1) P(D = 0) P(D) = P(D = 1) P(D = 2) P(D = NA) P(D = NB) Gary King (Harvard, IQSS) Content Analysis 9 / 30

Issues with Existing Statistical Approaches Gary King (Harvard, IQSS) Content Analysis 10 / 30

Issues with Existing Statistical Approaches 1 Direct Sampling Gary King (Harvard, IQSS) Content Analysis 10 / 30

Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample Gary King (Harvard, IQSS) Content Analysis 10 / 30

Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. Gary King (Harvard, IQSS) Content Analysis 10 / 30

Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) Gary King (Harvard, IQSS) Content Analysis 10 / 30

Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Gary King (Harvard, IQSS) Content Analysis 10 / 30

Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Biased without a random sample Gary King (Harvard, IQSS) Content Analysis 10 / 30

Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Biased without a random sample Models P(D S), but the world works as P(S D) Gary King (Harvard, IQSS) Content Analysis 10 / 30

Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Biased without a random sample Models P(D S), but the world works as P(S D) Bias unless Gary King (Harvard, IQSS) Content Analysis 10 / 30

Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Biased without a random sample Models P(D S), but the world works as P(S D) Bias unless P(D S) encompasses the true model. Gary King (Harvard, IQSS) Content Analysis 10 / 30

Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Biased without a random sample Models P(D S), but the world works as P(S D) Bias unless P(D S) encompasses the true model. S spans the space of all predictors of D (i.e., all information in the document) Gary King (Harvard, IQSS) Content Analysis 10 / 30

Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Biased without a random sample Models P(D S), but the world works as P(S D) Bias unless P(D S) encompasses the true model. S spans the space of all predictors of D (i.e., all information in the document) Bias even with optimal classification and high % correctly classified Gary King (Harvard, IQSS) Content Analysis 10 / 30

Using Misclassification Rates to Correct Proportions Gary King (Harvard, IQSS) Content Analysis 11 / 30

Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Gary King (Harvard, IQSS) Content Analysis 11 / 30

Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Aggregate classifications to category proportions Gary King (Harvard, IQSS) Content Analysis 11 / 30

Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Aggregate classifications to category proportions Use labeled set to estimate misclassification rates (by cross-validation) Gary King (Harvard, IQSS) Content Analysis 11 / 30

Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Aggregate classifications to category proportions Use labeled set to estimate misclassification rates (by cross-validation) Use misclassification rates to correct proportions Gary King (Harvard, IQSS) Content Analysis 11 / 30

Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Aggregate classifications to category proportions Use labeled set to estimate misclassification rates (by cross-validation) Use misclassification rates to correct proportions Result: vastly improved estimates of category proportions Gary King (Harvard, IQSS) Content Analysis 11 / 30

Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Aggregate classifications to category proportions Use labeled set to estimate misclassification rates (by cross-validation) Use misclassification rates to correct proportions Result: vastly improved estimates of category proportions (No new assumptions beyond that of the classifier) Gary King (Harvard, IQSS) Content Analysis 11 / 30

Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Aggregate classifications to category proportions Use labeled set to estimate misclassification rates (by cross-validation) Use misclassification rates to correct proportions Result: vastly improved estimates of category proportions (No new assumptions beyond that of the classifier) (still requires random samples, individual classification, etc) Gary King (Harvard, IQSS) Content Analysis 11 / 30

Formalization from Epidemiology (Levy and Kass, 1970) Gary King (Harvard, IQSS) Content Analysis 12 / 30

Formalization from Epidemiology (Levy and Kass, 1970) Accounting identity for 2 categories: P( ˆD = 1) = (sens)p(d = 1) + (1 spec)p(d = 2) Gary King (Harvard, IQSS) Content Analysis 12 / 30

Formalization from Epidemiology (Levy and Kass, 1970) Accounting identity for 2 categories: P( ˆD = 1) = (sens)p(d = 1) + (1 spec)p(d = 2) Solve: P(D = 1) = P( ˆD = 1) (1 spec) sens (1 spec) Gary King (Harvard, IQSS) Content Analysis 12 / 30

Formalization from Epidemiology (Levy and Kass, 1970) Accounting identity for 2 categories: Solve: P( ˆD = 1) = (sens)p(d = 1) + (1 spec)p(d = 2) P(D = 1) = P( ˆD = 1) (1 spec) sens (1 spec) Use this equation to correct P( ˆD = 1) Gary King (Harvard, IQSS) Content Analysis 12 / 30

Generalizations: J Categories, No Individual Classification (King and Lu, 2008, in press) Gary King (Harvard, IQSS) Content Analysis 13 / 30

Generalizations: J Categories, No Individual Classification (King and Lu, 2008, in press) Accounting identity for J categories P( ˆD = j) = J P( ˆD = j D = j )P(D = j ) j =1 Gary King (Harvard, IQSS) Content Analysis 13 / 30

Generalizations: J Categories, No Individual Classification (King and Lu, 2008, in press) Accounting identity for J categories P( ˆD = j) = J P( ˆD = j D = j )P(D = j ) j =1 Drop ˆD calculation, since ˆD = f (S): P(S = s) = J P(S = s D = j )P(D = j ) j =1 Gary King (Harvard, IQSS) Content Analysis 13 / 30

Generalizations: J Categories, No Individual Classification (King and Lu, 2008, in press) Accounting identity for J categories P( ˆD = j) = J P( ˆD = j D = j )P(D = j ) j =1 Drop ˆD calculation, since ˆD = f (S): P(S = s) = J P(S = s D = j )P(D = j ) j =1 Simplify to an equivalent matrix expression: P(S) = P(S D)P(D) Gary King (Harvard, IQSS) Content Analysis 13 / 30

Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 Gary King (Harvard, IQSS) Content Analysis 14 / 30

Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 Document category proportions (quantity of interest) Gary King (Harvard, IQSS) Content Analysis 14 / 30

Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 Word stem profile proportions (estimate in unlabeled set by tabulation) Gary King (Harvard, IQSS) Content Analysis 14 / 30

Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 Word stem profiles, by category (estimate in labeled set by tabulation) Gary King (Harvard, IQSS) Content Analysis 14 / 30

Estimation The matrix expression again: P(S) 2 K 1 = Y = X β = P(S D) P(D) 2 K J J 1 Alternative symbols (to emphasize the linear equation) Gary King (Harvard, IQSS) Content Analysis 14 / 30

Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Solve for quantity of interest (with no error term) Gary King (Harvard, IQSS) Content Analysis 14 / 30

Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: Gary King (Harvard, IQSS) Content Analysis 14 / 30

Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer Gary King (Harvard, IQSS) Content Analysis 14 / 30

Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Gary King (Harvard, IQSS) Content Analysis 14 / 30

Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Elements of P(D) must be between 0 and 1 and sum to 1 Gary King (Harvard, IQSS) Content Analysis 14 / 30

Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Elements of P(D) must be between 0 and 1 and sum to 1 Solutions Gary King (Harvard, IQSS) Content Analysis 14 / 30

Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Elements of P(D) must be between 0 and 1 and sum to 1 Solutions Use subsets of S; average results Gary King (Harvard, IQSS) Content Analysis 14 / 30

Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Elements of P(D) must be between 0 and 1 and sum to 1 Solutions Use subsets of S; average results Equivalent to kernel density smoothing of sparse categorical data Gary King (Harvard, IQSS) Content Analysis 14 / 30

Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Elements of P(D) must be between 0 and 1 and sum to 1 Solutions Use subsets of S; average results Equivalent to kernel density smoothing of sparse categorical data Use constrained LS to constrain P(D) to simplex Gary King (Harvard, IQSS) Content Analysis 14 / 30

Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Elements of P(D) must be between 0 and 1 and sum to 1 Solutions Use subsets of S; average results Equivalent to kernel density smoothing of sparse categorical data Use constrained LS to constrain P(D) to simplex Result: fast, accurate, with very little (human) tuning required Gary King (Harvard, IQSS) Content Analysis 14 / 30

A Nonrandom Hand-coded Sample Differences in Document Category Frequencies Differences in Word Profile Frequencies P(D) 0.0 0.1 0.2 0.3 0.4 P(S) 0.01 0.02 0.05 0.10 0.0 0.1 0.2 0.3 0.4 P h (D) 0.01 0.02 0.05 0.10 P h (S) All existing methods would fail with these data. Gary King (Harvard, IQSS) Content Analysis 15 / 30

Accurate Estimates Estimated P(D) 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 Actual P(D) Gary King (Harvard, IQSS) Content Analysis 16 / 30

Out-of-sample Comparison: 60 Seconds vs. 8.7 Days Affect in Blogs Estimated P(D) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Actual P(D) Gary King (Harvard, IQSS) Content Analysis 17 / 30

Out of Sample Validation: Other Examples Congressional Speeches Immigration Editorials Enron Emails Estimated P(D) 0.0 0.2 0.4 0.6 0.8 Estimated P(D) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Estimated P(D) 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 Actual P(D) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Actual P(D) 0.0 0.2 0.4 0.6 0.8 Actual P(D) Gary King (Harvard, IQSS) Content Analysis 18 / 30

Verbal Autopsy Methods Gary King (Harvard, IQSS) Content Analysis 19 / 30

Verbal Autopsy Methods The Problem Gary King (Harvard, IQSS) Content Analysis 19 / 30

Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies Gary King (Harvard, IQSS) Content Analysis 19 / 30

Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies High quality death registration: only 23/192 countries Gary King (Harvard, IQSS) Content Analysis 19 / 30

Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies High quality death registration: only 23/192 countries Existing Approaches Gary King (Harvard, IQSS) Content Analysis 19 / 30

Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies High quality death registration: only 23/192 countries Existing Approaches Verbal Autopsy: Ask relatives or caregivers 50-100 symptom questions Gary King (Harvard, IQSS) Content Analysis 19 / 30

Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies High quality death registration: only 23/192 countries Existing Approaches Verbal Autopsy: Ask relatives or caregivers 50-100 symptom questions Ask physicians to determine cause of death (low intercoder reliability) Gary King (Harvard, IQSS) Content Analysis 19 / 30

Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies High quality death registration: only 23/192 countries Existing Approaches Verbal Autopsy: Ask relatives or caregivers 50-100 symptom questions Ask physicians to determine cause of death (low intercoder reliability) Apply expert algorithms (high reliability, low validity) Gary King (Harvard, IQSS) Content Analysis 19 / 30

Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies High quality death registration: only 23/192 countries Existing Approaches Verbal Autopsy: Ask relatives or caregivers 50-100 symptom questions Ask physicians to determine cause of death (low intercoder reliability) Apply expert algorithms (high reliability, low validity) Find deaths with medically certified causes from a local hospital, trace caregivers to their homes, ask the same symptom questions, and statistically classify deaths in population (model-dependent, low accuracy) Gary King (Harvard, IQSS) Content Analysis 19 / 30

An Alternative Approach Gary King (Harvard, IQSS) Content Analysis 20 / 30

An Alternative Approach Document Category, Cause of Death, 1 if bladder cancer 2 if cardiovascular disease D i = 3 if transportation accident.. J if infectious respiratory Gary King (Harvard, IQSS) Content Analysis 20 / 30

An Alternative Approach Document Category, Cause of Death, 1 if bladder cancer 2 if cardiovascular disease D i = 3 if transportation accident.. J if infectious respiratory Word Stem Profile, Symptoms: S i1 = 1 if breathing difficulties, 0 if not S i2 = 1 if stomach ache, 0 if not S i =.. S ik = 1 if diarrhea, 0 if not Gary King (Harvard, IQSS) Content Analysis 20 / 30

An Alternative Approach Document Category, Cause of Death, 1 if bladder cancer 2 if cardiovascular disease D i = 3 if transportation accident.. J if infectious respiratory Word Stem Profile, Symptoms: S i1 = 1 if breathing difficulties, 0 if not S i2 = 1 if stomach ache, 0 if not S i =.. S ik = 1 if diarrhea, 0 if not Apply the same methods Gary King (Harvard, IQSS) Content Analysis 20 / 30

Validation in Tanzania Random Split Sample Community Sample Estimate 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Error 0.2 0.1 0.0 0.1 0.2 Error Estimate 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.2 0.1 0.0 0.1 0.2 0.00 0.05 0.10 0.15 0.20 0.25 0.30 TRUE 0.00 0.05 0.10 0.15 0.20 0.25 0.30 TRUE Gary King (Harvard, IQSS) Content Analysis 21 / 30

Validation in China 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Random Split Sample Estimate 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.10 0.00 0.05 0.10 TRUE Error 0.00 0.05 0.10 0.15 0.20 0.25 0.30 City Sample I Estimate 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.10 0.00 0.05 0.10 TRUE Error 0.00 0.05 0.10 0.15 0.20 0.25 0.30 City Sample II Estimate 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.10 0.00 0.05 0.10 TRUE Error Gary King (Harvard, IQSS) Content Analysis 22 / 30

Implications for an Individual Classifier Gary King (Harvard, IQSS) Content Analysis 23 / 30

Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) Gary King (Harvard, IQSS) Content Analysis 23 / 30

Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) Gary King (Harvard, IQSS) Content Analysis 23 / 30

Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) How to use this (less restrictive) assumption for classification (Bayes Theorem): Gary King (Harvard, IQSS) Content Analysis 23 / 30

Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) How to use this (less restrictive) assumption for classification (Bayes Theorem): P(D l S l = s l ) = P(S l = s l D l = j)p(d l = j) P(S l = s l ) Gary King (Harvard, IQSS) Content Analysis 23 / 30

Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) How to use this (less restrictive) assumption for classification (Bayes Theorem): P(D l S l = s l ) = P(S l = s l D l = j)p(d l = j) P(S l = s l ) The goal: individual classification Gary King (Harvard, IQSS) Content Analysis 23 / 30

Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) How to use this (less restrictive) assumption for classification (Bayes Theorem): P(D l S l = s l ) = P(S l = s l D l = j)p(d l = j) P(S l = s l ) Output from our estimator (described above) Gary King (Harvard, IQSS) Content Analysis 23 / 30

Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) How to use this (less restrictive) assumption for classification (Bayes Theorem): P(D l S l = s l ) = P(S l = s l D l = j)p(d l = j) P(S l = s l ) Nonparametric estimate from labeled set (an assumption) Gary King (Harvard, IQSS) Content Analysis 23 / 30

Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) How to use this (less restrictive) assumption for classification (Bayes Theorem): P(D l S l = s l ) = P(S l = s l D l = j)p(d l = j) P(S l = s l ) Nonparametric estimate from unlabeled set (no assumption) Gary King (Harvard, IQSS) Content Analysis 23 / 30

Classification with Less Restrictive Assumptions Gary King (Harvard, IQSS) Content Analysis 24 / 30

Classification with Less Restrictive Assumptions P h (D=j) 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 P(D=j) Gary King (Harvard, IQSS) Content Analysis 24 / 30

Classification with Less Restrictive Assumptions P h (D=j) 0.0 0.1 0.2 0.3 0.4 0.5 P h (S k=1) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.0 0.1 0.2 0.3 0.4 0.5 P(D=j) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 P(S k=1) Gary King (Harvard, IQSS) Content Analysis 24 / 30

Classification with Less Restrictive Assumptions Gary King (Harvard, IQSS) Content Analysis 25 / 30

Classification with Less Restrictive Assumptions P^(D=j) 0.0 0.1 0.2 0.3 0.4 0.5 SVM Nonparametric 0.0 0.1 0.2 0.3 0.4 0.5 P(D=j) Gary King (Harvard, IQSS) Content Analysis 25 / 30

Classification with Less Restrictive Assumptions P^(D=j) 0.0 0.1 0.2 0.3 0.4 0.5 SVM Nonparametric 0.0 0.1 0.2 0.3 0.4 0.5 P(D=j) Percent correctly classified: Gary King (Harvard, IQSS) Content Analysis 25 / 30

Classification with Less Restrictive Assumptions P^(D=j) 0.0 0.1 0.2 0.3 0.4 0.5 SVM Nonparametric 0.0 0.1 0.2 0.3 0.4 0.5 P(D=j) Percent correctly classified: SVM (best existing classifier): 40.5% Gary King (Harvard, IQSS) Content Analysis 25 / 30

Classification with Less Restrictive Assumptions P^(D=j) 0.0 0.1 0.2 0.3 0.4 0.5 SVM Nonparametric 0.0 0.1 0.2 0.3 0.4 0.5 P(D=j) Percent correctly classified: SVM (best existing classifier): 40.5% Our nonparametric approach: 59.8% Gary King (Harvard, IQSS) Content Analysis 25 / 30

Misclassification Matrix for Blog Posts -2-1 0 1 2 NA NB P(D 1 ) -2.70.10.01.01.00.02.16.28-1.33.25.04.02.01.01.35.08 0.13.17.13.11.05.02.40.02 1.07.06.08.20.25.01.34.03 2.03.03.03.22.43.01.25.03 NA.04.01.00.00.00.81.14.12 NB.10.07.02.02.02.04.75.45 Gary King (Harvard, IQSS) Content Analysis 26 / 30

SIMEX Analysis of Not a Blog Category Category NB 0.0 0.2 0.4 0.6 0.8 1.0 1 0 α Gary King (Harvard, IQSS) Content Analysis 27 / 30

SIMEX Analysis of Not a Blog Category Category NB 0.0 0.2 0.4 0.6 0.8 1.0 1 0 α Gary King (Harvard, IQSS) Content Analysis 28 / 30

SIMEX Analysis of Not a Blog Category Category NB 0.0 0.2 0.4 0.6 0.8 1.0 1 0 1 2 3 α Gary King (Harvard, IQSS) Content Analysis 29 / 30

SIMEX Analysis of Other Categories Category 2 Category 0 Category 2 0.0 0.1 0.2 0.3 0.4 0.5 1 0 1 2 3 α Category 1 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 1 0 1 2 3 α Category 1 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 1 0 1 2 3 α Category NA 0.0 0.1 0.2 0.3 0.4 0.5 1 0 1 2 3 α 1 0 1 2 3 α 1 0 1 2 3 α Gary King (Harvard, IQSS) Content Analysis 30 / 30

For more information http://gking.harvard.edu Gary King (Harvard, IQSS) Content Analysis 31 / 30