How to Read 100 Million Blogs (& Classify Deaths Without Physicians)
|
|
- Harold Allen
- 5 years ago
- Views:
Transcription
1 How to Read 100 Million Blogs (& Classify Deaths Without Physicians) Gary King Institute for Quantitative Social Science Harvard University (7/1/08 talk at IBM) () (7/1/08 talk at IBM) 1 / 30
2 References Daniel Hopkins and Gary King. Extracting Systematic Social Science Meaning from Text Gary King (Harvard, IQSS) Content Analysis 2 / 30
3 References Daniel Hopkins and Gary King. Extracting Systematic Social Science Meaning from Text commercialized via: Gary King (Harvard, IQSS) Content Analysis 2 / 30
4 References Daniel Hopkins and Gary King. Extracting Systematic Social Science Meaning from Text commercialized via: Gary King and Ying Lu. Verbal Autopsy Methods with Multiple Causes of Death, forthcoming, Statistical Science Gary King (Harvard, IQSS) Content Analysis 2 / 30
5 References Daniel Hopkins and Gary King. Extracting Systematic Social Science Meaning from Text commercialized via: Gary King and Ying Lu. Verbal Autopsy Methods with Multiple Causes of Death, forthcoming, Statistical Science In use by (among others): Gary King (Harvard, IQSS) Content Analysis 2 / 30
6 References Daniel Hopkins and Gary King. Extracting Systematic Social Science Meaning from Text commercialized via: Gary King and Ying Lu. Verbal Autopsy Methods with Multiple Causes of Death, forthcoming, Statistical Science In use by (among others): Copies at Gary King (Harvard, IQSS) Content Analysis 2 / 30
7 Inputs and Target Quantities of Interest Gary King (Harvard, IQSS) Content Analysis 3 / 30
8 Inputs and Target Quantities of Interest Input Data: Gary King (Harvard, IQSS) Content Analysis 3 / 30
9 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) Gary King (Harvard, IQSS) Content Analysis 3 / 30
10 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories Gary King (Harvard, IQSS) Content Analysis 3 / 30
11 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Gary King (Harvard, IQSS) Content Analysis 3 / 30
12 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest Gary King (Harvard, IQSS) Content Analysis 3 / 30
13 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest individual document classifications (spam filters) Gary King (Harvard, IQSS) Content Analysis 3 / 30
14 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest individual document classifications (spam filters) proportion in each category (proportion which is spam) Gary King (Harvard, IQSS) Content Analysis 3 / 30
15 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest individual document classifications (spam filters) proportion in each category (proportion which is spam) Estimation Gary King (Harvard, IQSS) Content Analysis 3 / 30
16 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest individual document classifications (spam filters) proportion in each category (proportion which is spam) Estimation Can get the 2nd by counting the 1st (turns out not to be necessary!) Gary King (Harvard, IQSS) Content Analysis 3 / 30
17 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest individual document classifications (spam filters) proportion in each category (proportion which is spam) Estimation Can get the 2nd by counting the 1st (turns out not to be necessary!) High classification accuracy unbiased category proportions Gary King (Harvard, IQSS) Content Analysis 3 / 30
18 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest individual document classifications (spam filters) proportion in each category (proportion which is spam) Estimation Can get the 2nd by counting the 1st (turns out not to be necessary!) High classification accuracy unbiased category proportions Different methods optimize estimation of the different quantities Gary King (Harvard, IQSS) Content Analysis 3 / 30
19 Blogs as a Running Example Gary King (Harvard, IQSS) Content Analysis 4 / 30
20 Blogs as a Running Example Blogs (web logs): web version of a daily diary, with posts listed in reverse chronological order. Gary King (Harvard, IQSS) Content Analysis 4 / 30
21 Blogs as a Running Example Blogs (web logs): web version of a daily diary, with posts listed in reverse chronological order. 8% of U.S. Internet users (12 million) have blogs Gary King (Harvard, IQSS) Content Analysis 4 / 30
22 Blogs as a Running Example Blogs (web logs): web version of a daily diary, with posts listed in reverse chronological order. 8% of U.S. Internet users (12 million) have blogs Growth: 0 in 2000; million worldwide now. Gary King (Harvard, IQSS) Content Analysis 4 / 30
23 Blogs as a Running Example Blogs (web logs): web version of a daily diary, with posts listed in reverse chronological order. 8% of U.S. Internet users (12 million) have blogs Growth: 0 in 2000; million worldwide now. A democratic technology: 6 million in China and 700,000 in Iran Gary King (Harvard, IQSS) Content Analysis 4 / 30
24 Blogs as a Running Example Blogs (web logs): web version of a daily diary, with posts listed in reverse chronological order. 8% of U.S. Internet users (12 million) have blogs Growth: 0 in 2000; million worldwide now. A democratic technology: 6 million in China and 700,000 in Iran We are living through the largest expansion of expressive capability in the history of the human race Gary King (Harvard, IQSS) Content Analysis 4 / 30
25 Blogs as a Running Example Blogs (web logs): web version of a daily diary, with posts listed in reverse chronological order. 8% of U.S. Internet users (12 million) have blogs Growth: 0 in 2000; million worldwide now. A democratic technology: 6 million in China and 700,000 in Iran We are living through the largest expansion of expressive capability in the history of the human race Measures classical notion of public opinion: active public expressions designed to influence policy and politics (previously: strikes, boycotts, demonstrations, editorials) Gary King (Harvard, IQSS) Content Analysis 4 / 30
26 Blogs as a Running Example Blogs (web logs): web version of a daily diary, with posts listed in reverse chronological order. 8% of U.S. Internet users (12 million) have blogs Growth: 0 in 2000; million worldwide now. A democratic technology: 6 million in China and 700,000 in Iran We are living through the largest expansion of expressive capability in the history of the human race Measures classical notion of public opinion: active public expressions designed to influence policy and politics (previously: strikes, boycotts, demonstrations, editorials) (Public opinion surveys) Gary King (Harvard, IQSS) Content Analysis 4 / 30
27 One specific quantity of interest Gary King (Harvard, IQSS) Content Analysis 5 / 30
28 One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Gary King (Harvard, IQSS) Content Analysis 5 / 30
29 One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Specific categories: Label Category 2 extremely negative 1 negative 0 neutral 1 positive 2 extremely positive NA no opinion expressed NB not a blog Gary King (Harvard, IQSS) Content Analysis 5 / 30
30 One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Specific categories: Label Category 2 extremely negative 1 negative 0 neutral 1 positive 2 extremely positive NA no opinion expressed NB not a blog Hard case: Gary King (Harvard, IQSS) Content Analysis 5 / 30
31 One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Specific categories: Label Category 2 extremely negative 1 negative 0 neutral 1 positive 2 extremely positive NA no opinion expressed NB not a blog Hard case: Part ordinal, part nominal categorization Gary King (Harvard, IQSS) Content Analysis 5 / 30
32 One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Specific categories: Label Category 2 extremely negative 1 negative 0 neutral 1 positive 2 extremely positive NA no opinion expressed NB not a blog Hard case: Part ordinal, part nominal categorization Sentiment categorization is more difficult than topic classification Gary King (Harvard, IQSS) Content Analysis 5 / 30
33 One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Specific categories: Label Category 2 extremely negative 1 negative 0 neutral 1 positive 2 extremely positive NA no opinion expressed NB not a blog Hard case: Part ordinal, part nominal categorization Sentiment categorization is more difficult than topic classification Informal language: my crunchy gf thinks dubya hid the wmd s, :)! Gary King (Harvard, IQSS) Content Analysis 5 / 30
34 One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Specific categories: Label Category 2 extremely negative 1 negative 0 neutral 1 positive 2 extremely positive NA no opinion expressed NB not a blog Hard case: Part ordinal, part nominal categorization Sentiment categorization is more difficult than topic classification Informal language: my crunchy gf thinks dubya hid the wmd s, :)! Little common internal structure (no inverted pyramid) Gary King (Harvard, IQSS) Content Analysis 5 / 30
35 The Conversation about John Kerry s Botched Joke Gary King (Harvard, IQSS) Content Analysis 6 / 30
36 The Conversation about John Kerry s Botched Joke You know, education if you make the most of it... you can do well. If you don t, you get stuck in Iraq. Gary King (Harvard, IQSS) Content Analysis 6 / 30
37 The Conversation about John Kerry s Botched Joke You know, education if you make the most of it... you can do well. If you don t, you get stuck in Iraq. Affect Towards John Kerry Proportion Sept Oct Nov Dec Jan Feb Mar Gary King (Harvard, IQSS) Content Analysis 6 / 30
38 Representing Text as Numbers Gary King (Harvard, IQSS) Content Analysis 7 / 30
39 Representing Text as Numbers Filter: choose English language blogs that mention Bush Gary King (Harvard, IQSS) Content Analysis 7 / 30
40 Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Gary King (Harvard, IQSS) Content Analysis 7 / 30
41 Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Code variables: presence/absence of unique unigrams, bigrams, trigrams Gary King (Harvard, IQSS) Content Analysis 7 / 30
42 Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Code variables: presence/absence of unique unigrams, bigrams, trigrams Our Example: Gary King (Harvard, IQSS) Content Analysis 7 / 30
43 Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Code variables: presence/absence of unique unigrams, bigrams, trigrams Our Example: Our 10,771 blog posts about Bush and Clinton: 201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams. Gary King (Harvard, IQSS) Content Analysis 7 / 30
44 Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Code variables: presence/absence of unique unigrams, bigrams, trigrams Our Example: Our 10,771 blog posts about Bush and Clinton: 201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams. keep only unigrams in > 1% or < 99% of documents: 3,672 variables Gary King (Harvard, IQSS) Content Analysis 7 / 30
45 Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Code variables: presence/absence of unique unigrams, bigrams, trigrams Our Example: Our 10,771 blog posts about Bush and Clinton: 201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams. keep only unigrams in > 1% or < 99% of documents: 3,672 variables Groups infinite possible posts into only 2 3,672 distinct types Gary King (Harvard, IQSS) Content Analysis 7 / 30
46 Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Code variables: presence/absence of unique unigrams, bigrams, trigrams Our Example: Our 10,771 blog posts about Bush and Clinton: 201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams. keep only unigrams in > 1% or < 99% of documents: 3,672 variables Groups infinite possible posts into only 2 3,672 distinct types More sophisticated summaries: we ve used, but they re not necessary Gary King (Harvard, IQSS) Content Analysis 7 / 30
47 Notation Gary King (Harvard, IQSS) Content Analysis 8 / 30
48 Notation Document Category -2 extremely negative -1 negative 0 neutral D i = 1 positive 2 extremely positive NA no opinion expressed NB not a blog Gary King (Harvard, IQSS) Content Analysis 8 / 30
49 Notation Document Category -2 extremely negative -1 negative 0 neutral D i = 1 positive 2 extremely positive NA no opinion expressed NB not a blog Word Stem Profile: S i1 = 1 if awful is used, 0 if not S i2 = 1 if good is used, 0 if not S i =.. S ik = 1 if except is used, 0 if not Gary King (Harvard, IQSS) Content Analysis 8 / 30
50 Quantities of Interest Gary King (Harvard, IQSS) Content Analysis 9 / 30
51 Quantities of Interest Computer Science: individual document classifications D 1, D 2..., D L Gary King (Harvard, IQSS) Content Analysis 9 / 30
52 Quantities of Interest Computer Science: individual document classifications D 1, D 2..., D L Social Science: proportions in each category P(D = 2) P(D = 1) P(D = 0) P(D) = P(D = 1) P(D = 2) P(D = NA) P(D = NB) Gary King (Harvard, IQSS) Content Analysis 9 / 30
53 Issues with Existing Statistical Approaches Gary King (Harvard, IQSS) Content Analysis 10 / 30
54 Issues with Existing Statistical Approaches 1 Direct Sampling Gary King (Harvard, IQSS) Content Analysis 10 / 30
55 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample Gary King (Harvard, IQSS) Content Analysis 10 / 30
56 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. Gary King (Harvard, IQSS) Content Analysis 10 / 30
57 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) Gary King (Harvard, IQSS) Content Analysis 10 / 30
58 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Gary King (Harvard, IQSS) Content Analysis 10 / 30
59 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Biased without a random sample Gary King (Harvard, IQSS) Content Analysis 10 / 30
60 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Biased without a random sample Models P(D S), but the world works as P(S D) Gary King (Harvard, IQSS) Content Analysis 10 / 30
61 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Biased without a random sample Models P(D S), but the world works as P(S D) Bias unless Gary King (Harvard, IQSS) Content Analysis 10 / 30
62 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Biased without a random sample Models P(D S), but the world works as P(S D) Bias unless P(D S) encompasses the true model. Gary King (Harvard, IQSS) Content Analysis 10 / 30
63 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Biased without a random sample Models P(D S), but the world works as P(S D) Bias unless P(D S) encompasses the true model. S spans the space of all predictors of D (i.e., all information in the document) Gary King (Harvard, IQSS) Content Analysis 10 / 30
64 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Biased without a random sample Models P(D S), but the world works as P(S D) Bias unless P(D S) encompasses the true model. S spans the space of all predictors of D (i.e., all information in the document) Bias even with optimal classification and high % correctly classified Gary King (Harvard, IQSS) Content Analysis 10 / 30
65 Using Misclassification Rates to Correct Proportions Gary King (Harvard, IQSS) Content Analysis 11 / 30
66 Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Gary King (Harvard, IQSS) Content Analysis 11 / 30
67 Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Aggregate classifications to category proportions Gary King (Harvard, IQSS) Content Analysis 11 / 30
68 Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Aggregate classifications to category proportions Use labeled set to estimate misclassification rates (by cross-validation) Gary King (Harvard, IQSS) Content Analysis 11 / 30
69 Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Aggregate classifications to category proportions Use labeled set to estimate misclassification rates (by cross-validation) Use misclassification rates to correct proportions Gary King (Harvard, IQSS) Content Analysis 11 / 30
70 Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Aggregate classifications to category proportions Use labeled set to estimate misclassification rates (by cross-validation) Use misclassification rates to correct proportions Result: vastly improved estimates of category proportions Gary King (Harvard, IQSS) Content Analysis 11 / 30
71 Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Aggregate classifications to category proportions Use labeled set to estimate misclassification rates (by cross-validation) Use misclassification rates to correct proportions Result: vastly improved estimates of category proportions (No new assumptions beyond that of the classifier) Gary King (Harvard, IQSS) Content Analysis 11 / 30
72 Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Aggregate classifications to category proportions Use labeled set to estimate misclassification rates (by cross-validation) Use misclassification rates to correct proportions Result: vastly improved estimates of category proportions (No new assumptions beyond that of the classifier) (still requires random samples, individual classification, etc) Gary King (Harvard, IQSS) Content Analysis 11 / 30
73 Formalization from Epidemiology (Levy and Kass, 1970) Gary King (Harvard, IQSS) Content Analysis 12 / 30
74 Formalization from Epidemiology (Levy and Kass, 1970) Accounting identity for 2 categories: P( ˆD = 1) = (sens)p(d = 1) + (1 spec)p(d = 2) Gary King (Harvard, IQSS) Content Analysis 12 / 30
75 Formalization from Epidemiology (Levy and Kass, 1970) Accounting identity for 2 categories: P( ˆD = 1) = (sens)p(d = 1) + (1 spec)p(d = 2) Solve: P(D = 1) = P( ˆD = 1) (1 spec) sens (1 spec) Gary King (Harvard, IQSS) Content Analysis 12 / 30
76 Formalization from Epidemiology (Levy and Kass, 1970) Accounting identity for 2 categories: Solve: P( ˆD = 1) = (sens)p(d = 1) + (1 spec)p(d = 2) P(D = 1) = P( ˆD = 1) (1 spec) sens (1 spec) Use this equation to correct P( ˆD = 1) Gary King (Harvard, IQSS) Content Analysis 12 / 30
77 Generalizations: J Categories, No Individual Classification (King and Lu, 2008, in press) Gary King (Harvard, IQSS) Content Analysis 13 / 30
78 Generalizations: J Categories, No Individual Classification (King and Lu, 2008, in press) Accounting identity for J categories P( ˆD = j) = J P( ˆD = j D = j )P(D = j ) j =1 Gary King (Harvard, IQSS) Content Analysis 13 / 30
79 Generalizations: J Categories, No Individual Classification (King and Lu, 2008, in press) Accounting identity for J categories P( ˆD = j) = J P( ˆD = j D = j )P(D = j ) j =1 Drop ˆD calculation, since ˆD = f (S): P(S = s) = J P(S = s D = j )P(D = j ) j =1 Gary King (Harvard, IQSS) Content Analysis 13 / 30
80 Generalizations: J Categories, No Individual Classification (King and Lu, 2008, in press) Accounting identity for J categories P( ˆD = j) = J P( ˆD = j D = j )P(D = j ) j =1 Drop ˆD calculation, since ˆD = f (S): P(S = s) = J P(S = s D = j )P(D = j ) j =1 Simplify to an equivalent matrix expression: P(S) = P(S D)P(D) Gary King (Harvard, IQSS) Content Analysis 13 / 30
81 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 Gary King (Harvard, IQSS) Content Analysis 14 / 30
82 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 Document category proportions (quantity of interest) Gary King (Harvard, IQSS) Content Analysis 14 / 30
83 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 Word stem profile proportions (estimate in unlabeled set by tabulation) Gary King (Harvard, IQSS) Content Analysis 14 / 30
84 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 Word stem profiles, by category (estimate in labeled set by tabulation) Gary King (Harvard, IQSS) Content Analysis 14 / 30
85 Estimation The matrix expression again: P(S) 2 K 1 = Y = X β = P(S D) P(D) 2 K J J 1 Alternative symbols (to emphasize the linear equation) Gary King (Harvard, IQSS) Content Analysis 14 / 30
86 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Solve for quantity of interest (with no error term) Gary King (Harvard, IQSS) Content Analysis 14 / 30
87 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: Gary King (Harvard, IQSS) Content Analysis 14 / 30
88 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer Gary King (Harvard, IQSS) Content Analysis 14 / 30
89 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Gary King (Harvard, IQSS) Content Analysis 14 / 30
90 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Elements of P(D) must be between 0 and 1 and sum to 1 Gary King (Harvard, IQSS) Content Analysis 14 / 30
91 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Elements of P(D) must be between 0 and 1 and sum to 1 Solutions Gary King (Harvard, IQSS) Content Analysis 14 / 30
92 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Elements of P(D) must be between 0 and 1 and sum to 1 Solutions Use subsets of S; average results Gary King (Harvard, IQSS) Content Analysis 14 / 30
93 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Elements of P(D) must be between 0 and 1 and sum to 1 Solutions Use subsets of S; average results Equivalent to kernel density smoothing of sparse categorical data Gary King (Harvard, IQSS) Content Analysis 14 / 30
94 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Elements of P(D) must be between 0 and 1 and sum to 1 Solutions Use subsets of S; average results Equivalent to kernel density smoothing of sparse categorical data Use constrained LS to constrain P(D) to simplex Gary King (Harvard, IQSS) Content Analysis 14 / 30
95 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Elements of P(D) must be between 0 and 1 and sum to 1 Solutions Use subsets of S; average results Equivalent to kernel density smoothing of sparse categorical data Use constrained LS to constrain P(D) to simplex Result: fast, accurate, with very little (human) tuning required Gary King (Harvard, IQSS) Content Analysis 14 / 30
96 A Nonrandom Hand-coded Sample Differences in Document Category Frequencies Differences in Word Profile Frequencies P(D) P(S) P h (D) P h (S) All existing methods would fail with these data. Gary King (Harvard, IQSS) Content Analysis 15 / 30
97 Accurate Estimates Estimated P(D) Actual P(D) Gary King (Harvard, IQSS) Content Analysis 16 / 30
98 Out-of-sample Comparison: 60 Seconds vs. 8.7 Days Affect in Blogs Estimated P(D) Actual P(D) Gary King (Harvard, IQSS) Content Analysis 17 / 30
99 Out of Sample Validation: Other Examples Congressional Speeches Immigration Editorials Enron s Estimated P(D) Estimated P(D) Estimated P(D) Actual P(D) Actual P(D) Actual P(D) Gary King (Harvard, IQSS) Content Analysis 18 / 30
100 Verbal Autopsy Methods Gary King (Harvard, IQSS) Content Analysis 19 / 30
101 Verbal Autopsy Methods The Problem Gary King (Harvard, IQSS) Content Analysis 19 / 30
102 Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies Gary King (Harvard, IQSS) Content Analysis 19 / 30
103 Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies High quality death registration: only 23/192 countries Gary King (Harvard, IQSS) Content Analysis 19 / 30
104 Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies High quality death registration: only 23/192 countries Existing Approaches Gary King (Harvard, IQSS) Content Analysis 19 / 30
105 Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies High quality death registration: only 23/192 countries Existing Approaches Verbal Autopsy: Ask relatives or caregivers symptom questions Gary King (Harvard, IQSS) Content Analysis 19 / 30
106 Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies High quality death registration: only 23/192 countries Existing Approaches Verbal Autopsy: Ask relatives or caregivers symptom questions Ask physicians to determine cause of death (low intercoder reliability) Gary King (Harvard, IQSS) Content Analysis 19 / 30
107 Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies High quality death registration: only 23/192 countries Existing Approaches Verbal Autopsy: Ask relatives or caregivers symptom questions Ask physicians to determine cause of death (low intercoder reliability) Apply expert algorithms (high reliability, low validity) Gary King (Harvard, IQSS) Content Analysis 19 / 30
108 Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies High quality death registration: only 23/192 countries Existing Approaches Verbal Autopsy: Ask relatives or caregivers symptom questions Ask physicians to determine cause of death (low intercoder reliability) Apply expert algorithms (high reliability, low validity) Find deaths with medically certified causes from a local hospital, trace caregivers to their homes, ask the same symptom questions, and statistically classify deaths in population (model-dependent, low accuracy) Gary King (Harvard, IQSS) Content Analysis 19 / 30
109 An Alternative Approach Gary King (Harvard, IQSS) Content Analysis 20 / 30
110 An Alternative Approach Document Category, Cause of Death, 1 if bladder cancer 2 if cardiovascular disease D i = 3 if transportation accident.. J if infectious respiratory Gary King (Harvard, IQSS) Content Analysis 20 / 30
111 An Alternative Approach Document Category, Cause of Death, 1 if bladder cancer 2 if cardiovascular disease D i = 3 if transportation accident.. J if infectious respiratory Word Stem Profile, Symptoms: S i1 = 1 if breathing difficulties, 0 if not S i2 = 1 if stomach ache, 0 if not S i =.. S ik = 1 if diarrhea, 0 if not Gary King (Harvard, IQSS) Content Analysis 20 / 30
112 An Alternative Approach Document Category, Cause of Death, 1 if bladder cancer 2 if cardiovascular disease D i = 3 if transportation accident.. J if infectious respiratory Word Stem Profile, Symptoms: S i1 = 1 if breathing difficulties, 0 if not S i2 = 1 if stomach ache, 0 if not S i =.. S ik = 1 if diarrhea, 0 if not Apply the same methods Gary King (Harvard, IQSS) Content Analysis 20 / 30
113 Validation in Tanzania Random Split Sample Community Sample Estimate Error Error Estimate TRUE TRUE Gary King (Harvard, IQSS) Content Analysis 21 / 30
114 Validation in China Random Split Sample Estimate TRUE Error City Sample I Estimate TRUE Error City Sample II Estimate TRUE Error Gary King (Harvard, IQSS) Content Analysis 22 / 30
115 Implications for an Individual Classifier Gary King (Harvard, IQSS) Content Analysis 23 / 30
116 Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) Gary King (Harvard, IQSS) Content Analysis 23 / 30
117 Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) Gary King (Harvard, IQSS) Content Analysis 23 / 30
118 Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) How to use this (less restrictive) assumption for classification (Bayes Theorem): Gary King (Harvard, IQSS) Content Analysis 23 / 30
119 Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) How to use this (less restrictive) assumption for classification (Bayes Theorem): P(D l S l = s l ) = P(S l = s l D l = j)p(d l = j) P(S l = s l ) Gary King (Harvard, IQSS) Content Analysis 23 / 30
120 Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) How to use this (less restrictive) assumption for classification (Bayes Theorem): P(D l S l = s l ) = P(S l = s l D l = j)p(d l = j) P(S l = s l ) The goal: individual classification Gary King (Harvard, IQSS) Content Analysis 23 / 30
121 Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) How to use this (less restrictive) assumption for classification (Bayes Theorem): P(D l S l = s l ) = P(S l = s l D l = j)p(d l = j) P(S l = s l ) Output from our estimator (described above) Gary King (Harvard, IQSS) Content Analysis 23 / 30
122 Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) How to use this (less restrictive) assumption for classification (Bayes Theorem): P(D l S l = s l ) = P(S l = s l D l = j)p(d l = j) P(S l = s l ) Nonparametric estimate from labeled set (an assumption) Gary King (Harvard, IQSS) Content Analysis 23 / 30
123 Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) How to use this (less restrictive) assumption for classification (Bayes Theorem): P(D l S l = s l ) = P(S l = s l D l = j)p(d l = j) P(S l = s l ) Nonparametric estimate from unlabeled set (no assumption) Gary King (Harvard, IQSS) Content Analysis 23 / 30
124 Classification with Less Restrictive Assumptions Gary King (Harvard, IQSS) Content Analysis 24 / 30
125 Classification with Less Restrictive Assumptions P h (D=j) P(D=j) Gary King (Harvard, IQSS) Content Analysis 24 / 30
126 Classification with Less Restrictive Assumptions P h (D=j) P h (S k=1) P(D=j) P(S k=1) Gary King (Harvard, IQSS) Content Analysis 24 / 30
127 Classification with Less Restrictive Assumptions Gary King (Harvard, IQSS) Content Analysis 25 / 30
128 Classification with Less Restrictive Assumptions P^(D=j) SVM Nonparametric P(D=j) Gary King (Harvard, IQSS) Content Analysis 25 / 30
129 Classification with Less Restrictive Assumptions P^(D=j) SVM Nonparametric P(D=j) Percent correctly classified: Gary King (Harvard, IQSS) Content Analysis 25 / 30
130 Classification with Less Restrictive Assumptions P^(D=j) SVM Nonparametric P(D=j) Percent correctly classified: SVM (best existing classifier): 40.5% Gary King (Harvard, IQSS) Content Analysis 25 / 30
131 Classification with Less Restrictive Assumptions P^(D=j) SVM Nonparametric P(D=j) Percent correctly classified: SVM (best existing classifier): 40.5% Our nonparametric approach: 59.8% Gary King (Harvard, IQSS) Content Analysis 25 / 30
132 Misclassification Matrix for Blog Posts NA NB P(D 1 ) NA NB Gary King (Harvard, IQSS) Content Analysis 26 / 30
133 SIMEX Analysis of Not a Blog Category Category NB α Gary King (Harvard, IQSS) Content Analysis 27 / 30
134 SIMEX Analysis of Not a Blog Category Category NB α Gary King (Harvard, IQSS) Content Analysis 28 / 30
135 SIMEX Analysis of Not a Blog Category Category NB α Gary King (Harvard, IQSS) Content Analysis 29 / 30
136 SIMEX Analysis of Other Categories Category 2 Category 0 Category α Category α Category α Category NA α α α Gary King (Harvard, IQSS) Content Analysis 30 / 30
137 For more information Gary King (Harvard, IQSS) Content Analysis 31 / 30
How to Read 100 Million Blogs (& Classify Deaths Without Physicians)
ary King Institute for Quantitative Social Science Harvard University () How to Read 100 Million Blogs (& Classify Deaths Without Physicians) (6/19/08 talk at Google) 1 / 30 How to Read 100 Million Blogs
More informationAdvance in Statistical Theory and Methods for Social Sciences
Advance in Statistical Theory and Methods for Social Sciences Ying Lu A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in partial fulfillment of the requirements
More informationCategorization ANLP Lecture 10 Text Categorization with Naive Bayes
1 Categorization ANLP Lecture 10 Text Categorization with Naive Bayes Sharon Goldwater 6 October 2014 Important task for both humans and machines object identification face recognition spoken word recognition
More informationANLP Lecture 10 Text Categorization with Naive Bayes
ANLP Lecture 10 Text Categorization with Naive Bayes Sharon Goldwater 6 October 2014 Categorization Important task for both humans and machines 1 object identification face recognition spoken word recognition
More informationQuantification. Using Supervised Learning to Estimate Class Prevalence. Fabrizio Sebastiani
Quantification Using Supervised Learning to Estimate Class Prevalence Fabrizio Sebastiani Istituto di Scienza e Tecnologie dell Informazione Consiglio Nazionale delle Ricerche 56124 Pisa, IT E-mail: fabrizio.sebastiani@isti.cnr.it
More informationInformation Retrieval and Organisation
Information Retrieval and Organisation Chapter 13 Text Classification and Naïve Bayes Dell Zhang Birkbeck, University of London Motivation Relevance Feedback revisited The user marks a number of documents
More information2.4. Conditional Probability
2.4. Conditional Probability Objectives. Definition of conditional probability and multiplication rule Total probability Bayes Theorem Example 2.4.1. (#46 p.80 textbook) Suppose an individual is randomly
More informationBehavioral Data Mining. Lecture 2
Behavioral Data Mining Lecture 2 Autonomy Corp Bayes Theorem Bayes Theorem P(A B) = probability of A given that B is true. P(A B) = P(B A)P(A) P(B) In practice we are most interested in dealing with events
More informationFast Logistic Regression for Text Categorization with Variable-Length N-grams
Fast Logistic Regression for Text Categorization with Variable-Length N-grams Georgiana Ifrim *, Gökhan Bakır +, Gerhard Weikum * * Max-Planck Institute for Informatics Saarbrücken, Germany + Google Switzerland
More informationLanguage Modeling. Michael Collins, Columbia University
Language Modeling Michael Collins, Columbia University Overview The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques: Linear interpolation Discounting
More informationCS 188: Artificial Intelligence Spring Today
CS 188: Artificial Intelligence Spring 2006 Lecture 9: Naïve Bayes 2/14/2006 Dan Klein UC Berkeley Many slides from either Stuart Russell or Andrew Moore Bayes rule Today Expectations and utilities Naïve
More informationApplied Natural Language Processing
Applied Natural Language Processing Info 256 Lecture 5: Text classification (Feb 5, 2019) David Bamman, UC Berkeley Data Classification A mapping h from input data x (drawn from instance space X) to a
More informationLanguage Models. CS6200: Information Retrieval. Slides by: Jesse Anderton
Language Models CS6200: Information Retrieval Slides by: Jesse Anderton What s wrong with VSMs? Vector Space Models work reasonably well, but have a few problems: They are based on bag-of-words, so they
More information9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering
Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make
More informationGenerative Clustering, Topic Modeling, & Bayesian Inference
Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week
More informationNatural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley
Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:
More informationSTAT:5100 (22S:193) Statistical Inference I
STAT:5100 (22S:193) Statistical Inference I Week 3 Luke Tierney University of Iowa Fall 2015 Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 1 Recap Matching problem Generalized
More informationIntroduction: MLE, MAP, Bayesian reasoning (28/8/13)
STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this
More informationIntroduction to AI Learning Bayesian networks. Vibhav Gogate
Introduction to AI Learning Bayesian networks Vibhav Gogate Inductive Learning in a nutshell Given: Data Examples of a function (X, F(X)) Predict function F(X) for new examples X Discrete F(X): Classification
More informationCS6375: Machine Learning Gautam Kunapuli. Support Vector Machines
Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this
More information1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016
AM 1: Advanced Optimization Spring 016 Prof. Yaron Singer Lecture 11 March 3rd 1 Overview In this lecture we will introduce the notion of online convex optimization. This is an extremely useful framework
More informationAn Improved Method of Automated Nonparametric Content Analysis for Social Science
An Improved Method of Automated Nonparametric Content Analysis for Social Science Connor T. Jerzak Gary King Anton Strezhnev July 13, 2018 Abstract Computer scientists and statisticians are often interested
More informationLecture 01: Introduction
Lecture 01: Introduction Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South Carolina Lecture 01: Introduction
More informationClassification: Analyzing Sentiment
Classification: Analyzing Sentiment STAT/CSE 416: Machine Learning Emily Fox University of Washington April 17, 2018 Predicting sentiment by topic: An intelligent restaurant review system 1 It s a big
More informationClick Prediction and Preference Ranking of RSS Feeds
Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS
More informationClassification, Linear Models, Naïve Bayes
Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob Eisenstein Today Text classification problems and their evaluation Linear classifiers
More informationUniversity of Illinois at Urbana-Champaign. Midterm Examination
University of Illinois at Urbana-Champaign Midterm Examination CS410 Introduction to Text Information Systems Professor ChengXiang Zhai TA: Azadeh Shakery Time: 2:00 3:15pm, Mar. 14, 2007 Place: Room 1105,
More informationClassification: Analyzing Sentiment
Classification: Analyzing Sentiment STAT/CSE 416: Machine Learning Emily Fox University of Washington April 17, 2018 Predicting sentiment by topic: An intelligent restaurant review system 1 4/16/18 It
More informationInformal Definition: Telling things apart
9. Decision Trees Informal Definition: Telling things apart 2 Nominal data No numeric feature vector Just a list or properties: Banana: longish, yellow Apple: round, medium sized, different colors like
More informationA REVIEW ARTICLE ON NAIVE BAYES CLASSIFIER WITH VARIOUS SMOOTHING TECHNIQUES
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 10, October 2014,
More informationKey Methods in Geography / Nicholas Clifford, Shaun French, Gill Valentine
Key Methods in Geography / Nicholas Clifford, Shaun French, Gill Valentine Nicholas Clifford, Shaun French, Gill Valentine / 2010 / Key Methods in Geography / SAGE, 2010 / 144624363X, 9781446243633 / 568
More informationMining Newsgroups Using Networks Arising From Social Behavior by Rakesh Agrawal et al. Presented by Will Lee
Mining Newsgroups Using Networks Arising From Social Behavior by Rakesh Agrawal et al. Presented by Will Lee wwlee1@uiuc.edu September 28, 2004 Motivation IR on newsgroups is challenging due to lack of
More informationMaxent Models and Discriminative Estimation
Maxent Models and Discriminative Estimation Generative vs. Discriminative models (Reading: J+M Ch6) Introduction So far we ve looked at generative models Language models, Naive Bayes But there is now much
More informationLecture 10 and 11: Text and Discrete Distributions
Lecture 10 and 11: Text and Discrete Distributions Machine Learning 4F13, Spring 2014 Carl Edward Rasmussen and Zoubin Ghahramani CUED http://mlg.eng.cam.ac.uk/teaching/4f13/ Rasmussen and Ghahramani Lecture
More informationGaussian Models
Gaussian Models ddebarr@uw.edu 2016-04-28 Agenda Introduction Gaussian Discriminant Analysis Inference Linear Gaussian Systems The Wishart Distribution Inferring Parameters Introduction Gaussian Density
More informationProbabilistic Language Modeling
Predicting String Probabilities Probabilistic Language Modeling Which string is more likely? (Which string is more grammatical?) Grill doctoral candidates. Regina Barzilay EECS Department MIT November
More informationProbability Review and Naïve Bayes
Probability Review and Naïve Bayes Instructor: Alan Ritter Some slides adapted from Dan Jurfasky and Brendan O connor What is Probability? The probability the coin will land heads is 0.5 Q: what does this
More informationN-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24
L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,
More informationMaps: Two-dimensional, scaled representations of Earth surface - show spatial data (data with locational component)
Maps: Two-dimensional, scaled representations of Earth surface - show spatial data (data with locational component) Cartography (map-making) involves 5 design decisions based on purpose of map Projection
More informationDiscrete Probability and State Estimation
6.01, Spring Semester, 2008 Week 12 Course Notes 1 MASSACHVSETTS INSTITVTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science 6.01 Introduction to EECS I Spring Semester, 2008 Week
More informationDiscrete Probability and State Estimation
6.01, Fall Semester, 2007 Lecture 12 Notes 1 MASSACHVSETTS INSTITVTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science 6.01 Introduction to EECS I Fall Semester, 2007 Lecture 12 Notes
More informationThe Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview
The Language Modeling Problem We have some (finite) vocabulary, say V = {the, a, man, telescope, Beckham, two, } 6.864 (Fall 2007) Smoothed Estimation, and Language Modeling We have an (infinite) set of
More information2. AXIOMATIC PROBABILITY
IA Probability Lent Term 2. AXIOMATIC PROBABILITY 2. The axioms The formulation for classical probability in which all outcomes or points in the sample space are equally likely is too restrictive to develop
More informationMatching for Causal Inference Without Balance Checking
Matching for ausal Inference Without Balance hecking Gary King Institute for Quantitative Social Science Harvard University joint work with Stefano M. Iacus (Univ. of Milan) and Giuseppe Porro (Univ. of
More informationGALLUP NEWS SERVICE JUNE WAVE 1
GALLUP NEWS SERVICE JUNE WAVE 1 -- FINAL TOPLINE -- Timberline: 937008 IS: 392 Princeton Job #: 15-06-006 Jeff Jones, Lydia Saad June 2-7, 2015 Results are based on telephone interviews conducted June
More informationMachine Learning 2007: Slides 1. Instructor: Tim van Erven Website: erven/teaching/0708/ml/
Machine 2007: Slides 1 Instructor: Tim van Erven (Tim.van.Erven@cwi.nl) Website: www.cwi.nl/ erven/teaching/0708/ml/ September 6, 2007, updated: September 13, 2007 1 / 37 Overview The Most Important Supervised
More informationMachine Learning. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 8 May 2012
Machine Learning Hal Daumé III Computer Science University of Maryland me@hal3.name CS 421 Introduction to Artificial Intelligence 8 May 2012 g 1 Many slides courtesy of Dan Klein, Stuart Russell, or Andrew
More informationUsing SPSS for One Way Analysis of Variance
Using SPSS for One Way Analysis of Variance This tutorial will show you how to use SPSS version 12 to perform a one-way, between- subjects analysis of variance and related post-hoc tests. This tutorial
More informationUnderstanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014
Understanding Comments Submitted to FCC on Net Neutrality Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014 Abstract We aim to understand and summarize themes in the 1.65 million
More informationCHAPTER 2: KEY ISSUE 1 Where Is the World s Population Distributed? p
CHAPTER 2: KEY ISSUE 1 Where Is the World s Population Distributed? p. 45-49 Always keep your vocabulary packet out whenever you take notes. As the term comes up in the text, add to your examples for the
More informationwhere Female = 0 for males, = 1 for females Age is measured in years (22, 23, ) GPA is measured in units on a four-point scale (0, 1.22, 3.45, etc.
Notes on regression analysis 1. Basics in regression analysis key concepts (actual implementation is more complicated) A. Collect data B. Plot data on graph, draw a line through the middle of the scatter
More informationTypical information required from the data collection can be grouped into four categories, enumerated as below.
Chapter 6 Data Collection 6.1 Overview The four-stage modeling, an important tool for forecasting future demand and performance of a transportation system, was developed for evaluating large-scale infrastructure
More informationAn Algorithms-based Intro to Machine Learning
CMU 15451 lecture 12/08/11 An Algorithmsbased Intro to Machine Learning Plan for today Machine Learning intro: models and basic issues An interesting algorithm for combining expert advice Avrim Blum [Based
More informationProperties of Probability
Econ 325 Notes on Probability 1 By Hiro Kasahara Properties of Probability In statistics, we consider random experiments, experiments for which the outcome is random, i.e., cannot be predicted with certainty.
More informationLeast Squares Classification
Least Squares Classification Stephen Boyd EE103 Stanford University November 4, 2017 Outline Classification Least squares classification Multi-class classifiers Classification 2 Classification data fitting
More informationThe Noisy Channel Model and Markov Models
1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle
More informationMath for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han
Math for Machine Learning Open Doors to Data Science and Artificial Intelligence Richard Han Copyright 05 Richard Han All rights reserved. CONTENTS PREFACE... - INTRODUCTION... LINEAR REGRESSION... 4 LINEAR
More informationBayesian Learning Features of Bayesian learning methods:
Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease or increase the estimated probability that a hypothesis is correct. This provides a more
More informationDM-Group Meeting. Subhodip Biswas 10/16/2014
DM-Group Meeting Subhodip Biswas 10/16/2014 Papers to be discussed 1. Crowdsourcing Land Use Maps via Twitter Vanessa Frias-Martinez and Enrique Frias-Martinez in KDD 2014 2. Tracking Climate Change Opinions
More informationPredictive Analytics on Accident Data Using Rule Based and Discriminative Classifiers
Advances in Computational Sciences and Technology ISSN 0973-6107 Volume 10, Number 3 (2017) pp. 461-469 Research India Publications http://www.ripublication.com Predictive Analytics on Accident Data Using
More informationIntroduction to Forecasting
Introduction to Forecasting Introduction to Forecasting Predicting the future Not an exact science but instead consists of a set of statistical tools and techniques that are supported by human judgment
More informationMachine Learning and Deep Learning! Vincent Lepetit!
Machine Learning and Deep Learning!! Vincent Lepetit! 1! What is Machine Learning?! 2! Hand-Written Digit Recognition! 2 9 3! Hand-Written Digit Recognition! Formalization! 0 1 x = @ A Images are 28x28
More informationGAMINGRE 8/1/ of 7
FYE 09/30/92 JULY 92 0.00 254,550.00 0.00 0 0 0 0 0 0 0 0 0 254,550.00 0.00 0.00 0.00 0.00 254,550.00 AUG 10,616,710.31 5,299.95 845,656.83 84,565.68 61,084.86 23,480.82 339,734.73 135,893.89 67,946.95
More informationUnderstanding and accessing 2011 census aggregate data
Understanding and accessing 2011 census aggregate data 4 July 11:00 to 16:00 BST Justin Hayes and Richard Wiseman UK Data Service Census Support UK censuses provide an unparalleled resource of high quality
More informationText Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)
Text Categorization CSE 454 (Based on slides by Dan Weld, Tom Mitchell, and others) 1 Given: Categorization A description of an instance, x X, where X is the instance language or instance space. A fixed
More informationDeconstructing Data Science
econstructing ata Science avid Bamman, UC Berkeley Info 290 Lecture 6: ecision trees & random forests Feb 2, 2016 Linear regression eep learning ecision trees Ordinal regression Probabilistic graphical
More informationFORECASTING STANDARDS CHECKLIST
FORECASTING STANDARDS CHECKLIST An electronic version of this checklist is available on the Forecasting Principles Web site. PROBLEM 1. Setting Objectives 1.1. Describe decisions that might be affected
More information10/15/2015 A FAST REVIEW OF DISCRETE PROBABILITY (PART 2) Probability, Conditional Probability & Bayes Rule. Discrete random variables
Probability, Conditional Probability & Bayes Rule A FAST REVIEW OF DISCRETE PROBABILITY (PART 2) 2 Discrete random variables A random variable can take on one of a set of different values, each with an
More informationRandomized Decision Trees
Randomized Decision Trees compiled by Alvin Wan from Professor Jitendra Malik s lecture Discrete Variables First, let us consider some terminology. We have primarily been dealing with real-valued data,
More informationBayesian Learning. Examples. Conditional Probability. Two Roles for Bayesian Methods. Prior Probability and Random Variables. The Chain Rule P (B)
Examples My mood can take 2 possible values: happy, sad. The weather can take 3 possible vales: sunny, rainy, cloudy My friends know me pretty well and say that: P(Mood=happy Weather=rainy) = 0.25 P(Mood=happy
More informationData Collection. Lecture Notes in Transportation Systems Engineering. Prof. Tom V. Mathew. 1 Overview 1
Data Collection Lecture Notes in Transportation Systems Engineering Prof. Tom V. Mathew Contents 1 Overview 1 2 Survey design 2 2.1 Information needed................................. 2 2.2 Study area.....................................
More informationTypes of spatial data. The Nature of Geographic Data. Types of spatial data. Spatial Autocorrelation. Continuous spatial data: geostatistics
The Nature of Geographic Data Types of spatial data Continuous spatial data: geostatistics Samples may be taken at intervals, but the spatial process is continuous e.g. soil quality Discrete data Irregular:
More informationReview. More Review. Things to know about Probability: Let Ω be the sample space for a probability measure P.
1 2 Review Data for assessing the sensitivity and specificity of a test are usually of the form disease category test result diseased (+) nondiseased ( ) + A B C D Sensitivity: is the proportion of diseased
More informationRanked Retrieval (2)
Text Technologies for Data Science INFR11145 Ranked Retrieval (2) Instructor: Walid Magdy 31-Oct-2017 Lecture Objectives Learn about Probabilistic models BM25 Learn about LM for IR 2 1 Recall: VSM & TFIDF
More informationThe LTA Data Aggregation Program
The LTA Data Aggregation Program Brian P Flaherty The Methodology Center The Pennsylvania State University August 1, 1999 I Introduction Getting data into response pattern format (grouped data) has often
More informationCS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning
CS 446 Machine Learning Fall 206 Nov 0, 206 Bayesian Learning Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes Overview Bayesian Learning Naive Bayes Logistic Regression Bayesian Learning So far, we
More informationAgreement Coefficients and Statistical Inference
CHAPTER Agreement Coefficients and Statistical Inference OBJECTIVE This chapter describes several approaches for evaluating the precision associated with the inter-rater reliability coefficients of the
More informationLinear Classifiers. Michael Collins. January 18, 2012
Linear Classifiers Michael Collins January 18, 2012 Today s Lecture Binary classification problems Linear classifiers The perceptron algorithm Classification Problems: An Example Goal: build a system that
More informationWelcome Survey getting to know you Collect & log Supplies received Classroom Rules Curriculum overview. 1 : Aug 810. (3 days) 2nd: Aug (5 days)
1st Quarter (41Days) st 1 : Aug 810 (3 days) 2nd: Aug 13-17 Reporting Categories (TEKS SEs) Skill Create and write a postcard about your favorite community activity Review 2nd Grade Vocabulary Chapter
More informationStatistics, continued
Statistics, continued Visual Displays of Data Since numbers often do not resonate with people, giving visual representations of data is often uses to make the data more meaningful. We will talk about a
More informationCSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas
ian ian ian Might have reasons (domain information) to favor some hypotheses/predictions over others a priori ian methods work with probabilities, and have two main roles: Naïve Nets (Adapted from Ethem
More informationChap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University
Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics
More informationRecap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP
Recap: Language models Foundations of atural Language Processing Lecture 4 Language Models: Evaluation and Smoothing Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipp
More informationSupport Vector Machines
Support Vector Machines Reading: Ben-Hur & Weston, A User s Guide to Support Vector Machines (linked from class web page) Notation Assume a binary classification problem. Instances are represented by vector
More informationStephen Scott.
1 / 28 ian ian Optimal (Adapted from Ethem Alpaydin and Tom Mitchell) Naïve Nets sscott@cse.unl.edu 2 / 28 ian Optimal Naïve Nets Might have reasons (domain information) to favor some hypotheses/predictions
More informationTable of Contents. Enrollment. Introduction
Enrollment Table of Contents Enrollment Introduction Headcount Summaries by College, Status, Gender, Citizenship, Race, and Level Headcount Enrollment by College, Department, Level, and Status by College,
More informationWHO EpiData. A monthly summary of the epidemiological data on selected Vaccine preventable diseases in the WHO European Region
A monthly summary of the epidemiological data on selected Vaccine preventable diseases in the WHO European Region Table 1: Reported cases for the period January December 2018 (data as of 01 February 2019)
More information6.036 midterm review. Wednesday, March 18, 15
6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that
More informationWorld Geography Review Syllabus
Purpose Class: World Geography Review Syllabus This course is designed to help students review and remediate their understanding major themes, concepts, and facts connected to the study World Geography.
More informationLanguage Models. Philipp Koehn. 11 September 2018
Language Models Philipp Koehn 11 September 2018 Language models 1 Language models answer the question: How likely is a string of English words good English? Help with reordering p LM (the house is small)
More informationClassification 1: Linear regression of indicators, linear discriminant analysis
Classification 1: Linear regression of indicators, linear discriminant analysis Ryan Tibshirani Data Mining: 36-462/36-662 April 2 2013 Optional reading: ISL 4.1, 4.2, 4.4, ESL 4.1 4.3 1 Classification
More informationStatistical NLP Spring A Discriminative Approach
Statistical NLP Spring 2008 Lecture 6: Classification Dan Klein UC Berkeley A Discriminative Approach View WSD as a discrimination task (regression, really) P(sense context:jail, context:county, context:feeding,
More informationCOMP 328: Machine Learning
COMP 328: Machine Learning Lecture 2: Naive Bayes Classifiers Nevin L. Zhang Department of Computer Science and Engineering The Hong Kong University of Science and Technology Spring 2010 Nevin L. Zhang
More informationModern Information Retrieval
Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction
More informationMachine Learning & Data Mining
Group M L D Machine Learning M & Data Mining Chapter 7 Decision Trees Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University Top 10 Algorithm in DM #1: C4.5 #2: K-Means #3: SVM
More informationWhere Was Mars At Your Birth?
Where Was Mars At Your Birth? This chart will make it easy for you to determine your Mars sign. We ve listed each of the dates that Mars enters a new sign. If you were born after June 11, 1950, when Mars
More informationNaïve Bayes Classifiers
Naïve Bayes Classifiers Example: PlayTennis (6.9.1) Given a new instance, e.g. (Outlook = sunny, Temperature = cool, Humidity = high, Wind = strong ), we want to compute the most likely hypothesis: v NB
More informationSupport Vector Machine & Its Applications
Support Vector Machine & Its Applications A portion (1/3) of the slides are taken from Prof. Andrew Moore s SVM tutorial at http://www.cs.cmu.edu/~awm/tutorials Mingyue Tan The University of British Columbia
More informationNatural Language Processing. Statistical Inference: n-grams
Natural Language Processing Statistical Inference: n-grams Updated 3/2009 Statistical Inference Statistical Inference consists of taking some data (generated in accordance with some unknown probability
More informationData Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition
Data Mining Classification: Basic Concepts and Techniques Lecture Notes for Chapter 3 by Tan, Steinbach, Karpatne, Kumar 1 Classification: Definition Given a collection of records (training set ) Each
More information