University of Iowa. Introduction to Text Analysis. Bryce J. Dietrich. Agenda. Dictionaries. Document- Term Matrices. Preprocessing

Size: px

Start display at page:

Download "University of Iowa. Introduction to Text Analysis. Bryce J. Dietrich. Agenda. Dictionaries. Document- Term Matrices. Preprocessing"

Marian Snow
5 years ago
Views:

1 University of Iowa

3 Text and Political Science Post 2000, things have changed... Massive collections of texts are being increasingly used as a data source: Congressional speeches, press releases, newsletters,... Facebook posts, tweets, s, text messages... Newspapers, magazines, transcripts... Foreign news sources, treaties,... Why? LOTS of unstructured text data (201 billion s sent and received every day) LOTS of cheap storage: 1956: $10,000 per megabyte. 2016: $ per megabyte. LOTS of methods and software to analyze text

4 Text and Political Science Ultimately, these trends mean... Analyzing text has become bigger, faster, and stronger: Generalizable one method can be used across a variety of text Systematic one method can be used again and again Cheap one computer can do a lot, 100 computers can do even more Analyzing text is still important: Laws Treaties News media Campaigns Petitions Speeches Press Releases

5 What Can We Do?

6 What Can We Do? There are two things we may want to do with this haystack... Understanding the meaning of a sentence or phrase analyzing a straw of hay Humans = 1 Computers = 0 Classifying text organizing hay stack Humans = 0 Computers = 1

7 Text = Complex? This bill has seen a long and winding road, but in the process we have worked together. We have not quit. We have worked across the aisle. The final bill has the support of over 370 groups, and they represent those from all over the country and all over the ideological spectrum.

8 Text = Complex? This bill has seen a long and winding road, but in the process we have worked together. We have not quit. We have worked across the aisle. The final bill has the support of over 370 groups, and they represent those from all over the country and all over the ideological spectrum.

9 Text = Complex? This bill has seen a long and winding road, but in the process we have worked together. We have not quit. We have worked across the aisle. The final bill has the support of over 370 groups, and they represent those from all over the country and all over the ideological spectrum.

10 Text = Complex? This bill has seen a long and winding road, but in the process we have worked together. We have not quit. We have worked across the aisle. The final bill has the support of over 370 groups, and they represent those from all over the country and all over the ideological spectrum.

11 Text = Complex? This bill has seen a long and winding road, but in the process we have worked together. We have not quit. We have worked across the aisle. The final bill has the support of over 370 groups, and they represent those from all over the country and all over the ideological spectrum.

12 Text = Simple? Speech by Barbara Mikulski (D-MD) delivered on June 28, word count zika 4 million 4 emergency 3 health 3 act 2 response 2 republican 2 report 1 treat 1 world 1 organization 1 affordable 1

13 Text = Simple? The Republican conference report also doesn t treat Zika like the emergency it is. The World Health Organization declared the Zika virus a public health emergency on February 1. And Zika meets the Budget Act criteria for emergency spending: It is urgent, unforeseen, and temporary. Yet Republicans insisted that we cut $750 million to pay for the response to Zika, including $543 million from the Affordable Care Act, $100 million from the Department of Health and Human Services, HHS, nonrecurring expense fund, and $107 million from Ebola response funds.

14 Big Data Suppose we want to categorize 100 documents... Consider two documents A and B, how many clusters can we make? (AB, BA) = 2 Consider three documents A, B, and C, how many clusters can we make? (ABC, CBA, ACB, BCA, CAB) = 5 Bell(n) = number of ways to partition n objects. Bell(2) = 2, Bell(3) = 5, Bell(5) = 52, etc. Bell(100) = It takes R seconds to count to It would take R = seconds to count to Bell(100) There are seconds in a year = years. Automated methods can help with even small tasks!

15 Principles of Automated Text (from Justin Grimmer) Principle 1: All Quantitative Models of Language are Wrong But Some are Useful Data generation process unknown Complexity of language: Time flies like an arrow; fruit flies like a banana Make peace, not war; Make war not peace Models necessarily fail to capture language Principle 2: Quantitative Methods Augment Humans, Not Replace Them Computer-Assisted Content Computers suggest, Humans interpret Principle 3: There is no Globally Best Method for Automated Text Supervised methods = known categories methods = discover categories Principle 4: Validate, Validate, Validate

16 Principles of Automated Text (from Justin Grimmer) Principle 2: Quantitative Methods Augment Humans, Not Replace Them Computer-Assisted Content Computers suggest, Humans interpret

17 Principles of Automated Text (from Justin Grimmer) Principle 3: There is no Globally Best Method for Automated Text Supervised methods = known categories methods = discover categories

18 Principles of Automated Text (from Justin Grimmer) Principle 4: Validate, Validate, Validate Few theorems to guarantee performance Apply methods validate Do not blinding use methods!

19 Principles of Automated Text (from Justin Grimmer) Principle 1: All Quantitative Models of Language are Wrong But Some are Useful Principle 2: Quantitative Methods Augment Humans, Not Replace Them Principle 3: There is no Globally Best Method for Automated Text Principle 4: Validate, Validate, Validate

20 Types of Classification Problems Topic: What is this text about? - Policy area of legislation - Party agendas Sentiment: What is said in this text? - For or against legislation - Agree or disagree with an argument - A liberal/conservative position Style/Tone: How is it said? - Positive/Negative Emotion

21 Weighted Words Weighted Words!

22 Dictionary Methods

23 Dictionary Methods Many Dictionary Methods (like LIWC) 1) Proprietary wrapped in GUI 2) Basic tasks: a) Count words b) Weight some words more than others c) Some graphics 3) Expensive!

24 Other 1) General Inquirer Database ( ) - Stone, P.J., Dumphy, D.C., and Ogilvie, D.M. (1966) The General Inquirer: A Computer Approach to Content - 1,915 positive words and 2,291 negative words 2) Linguistic Inquiry Word Count (LIWC) - Creation process: 1) Generate word list for categories We drew on common emotion rating scales...[then] brain-storming sessions among 3-6 judges were held to generate other words 2) Judge round (a) Does the word belong? (b) What other categories might it belong to? positive words and 499 negative words

25 Generating New Words Three ways to create dictionaries: - Statistical methods - Manual generation - Theory - Research Assistants a) Grad Students b) Undergraduates c) Mechanical Turkers - Example: {Happy, Unhappy} - Ask Turkers: how happy is elevator, car, pretty, young

26 Applying a Dictionary to NYT Articles Python!

27 X = X = N K matrix - N = Number of documents - K = Number of features

28 I Have A Dream 1 # i mport modules 2 import r e q u e s t s 3 from bs4 i mport B e a u t i f u l S o u p 4 5 # c r e a t e u r l 6 u r l = h t t p : / / a v a l o n. law. y a l e. edu /20 t h c e n t u r y / mlk01. asp 7 8 # c r e a t e soup 9 r e s p o n s e = r e q u e s t s. g e t ( u r l ) 10 c o n t e n t s = r e s p o n s e. c o n t e n t 11 soup = B e a u t i f u l S o u p ( c o n t e n t s, html. p a r s e r ) # g e t s p e e c h 14 s p eech = 15 l i n e s = soup. f i n d a l l ( p ) 16 f o r l i n e i n l i n e s : 17 s p eech += l i n e. g e t t e x t ( )

29 Installing Beautiful Soup 1 # a c t i v a t e our v i r t u a l e n v i r o nment 2 s o u r c e a c t i v a t e python UI 3 4 # i n s t a l l b e a u t i f u l soup 5 python m p i p i n s t a l l b e a u t i f u l s o u p 4

30 Using Beautiful Soup 1 # i mport modules 2 import r e q u e s t s 3 from bs4 i mport B e a u t i f u l S o u p 4 5 # c r e a t e u r l 6 u r l = h t t p : / / a v a l o n. law. y a l e. edu /20 t h c e n t u r y / mlk01. asp 7 8 # c r e a t e soup 9 r e s p o n s e = r e q u e s t s. g e t ( u r l ) 10 c o n t e n t s = r e s p o n s e. c o n t e n t 11 soup = B e a u t i f u l S o u p ( c o n t e n t s, html. p a r s e r ) # p r i n t soup 14 p r i n t ( soup. p r e t t i f y ( ) [ 0 : ] )

31 HTML 1 <html> 2 <head> 3 < l i n k h r e f=.. / c s s / s i t e. c s s r e l= s t y l e s h e e t t y p e = t e x t / c s s > 4 < t i t l e> 5 Avalon P r o j e c t I have a Dream by Martin L u t h e r King, J r ; August 28, </ t i t l e> 7 </ l i n k> 8 </ head> 9 <body> 10 <d i v c l a s s= H e a d e r C o n t a i n e r > 11 12 < l i c l a s s= Search >

32 HTML Browsers read in the HTML document, parses it into a DOM (Document Object Model) structure, and then renders the DOM structure.

33 HTML Browsers read in the HTML document, parses it into a DOM (Document Object Model) structure, and then renders the DOM structure.

34 I Have A Dream

35 Finding Text 1 2 I have a dream t h a t one day on the r e d h i l l s o f G e o r g i a the s o n s o f f o r m e r s l a v e s and the s o n s o f f o r m e r s l a v e o w n e r s w i l l be a b l e to s i t down t o g e t h e r at a t a b l e o f b r o t h e r h o o d. 3

36 Finding Text 1 # i mport modules 2 import r e q u e s t s 3 from bs4 i mport B e a u t i f u l S o u p 4 5 # c r e a t e u r l 6 u r l = h t t p : / / a v a l o n. law. y a l e. edu /20 t h c e n t u r y / mlk01. asp 7 8 # c r e a t e soup 9 r e s p o n s e = r e q u e s t s. g e t ( u r l ) 10 c o n t e n t s = r e s p o n s e. c o n t e n t 11 soup = B e a u t i f u l S o u p ( c o n t e n t s, html. p a r s e r ) # g e t l i n e s 14 l i n e s = soup. f i n d a l l ( p ) # g e t t y p e 17 p r i n t ( t y p e ( l i n e s ) )

37 Finding Text This a bs4.element.resultset object which is simply a collection of tags. Note, this is a bs4 object, so most standard functions will not work. 1 # i n s p e c t f i r s t element 2 p r i n t ( l i n e s [ 0 ] ) 3 4 # output 5 d e l i v e r e d on the s t e p s at the L i n c o l n Memorial i n Washington D. C. on August 28, 1963 6 7 # g e t t e x t 8 p r i n t ( l i n e s [ 0 ]. g e t t e x t ( ) ) 9 10 # output 11 D e l i v e r e d on the s t e p s at the L i n c o l n Memorial i n Washington D. C. on August 28, 1963

38 I Have A Dream 1 # i mport modules 2 import r e q u e s t s 3 from bs4 i mport B e a u t i f u l S o u p 4 5 # c r e a t e u r l 6 u r l = h t t p : / / a v a l o n. law. y a l e. edu /20 t h c e n t u r y / mlk01. asp 7 8 # c r e a t e soup 9 r e s p o n s e = r e q u e s t s. g e t ( u r l ) 10 c o n t e n t s = r e s p o n s e. c o n t e n t 11 soup = B e a u t i f u l S o u p ( c o n t e n t s, html. p a r s e r ) # g e t s p e e c h 14 s p eech = 15 l i n e s = soup. f i n d a l l ( p ) 16 f o r l i n e i n l i n e s : 17 s p eech += l i n e. g e t t e x t ( )

39 Simplify text in order to make it useful. 1 Remove capitalization, punctuation 2 Discard word order (Bag of Words Assumption) 3 Discard stop words 4 Create equivalence class: Stem, lemmatize, or synonym 5 Discard less useful features 6 Other reduction, specialization

40 Step 1: Removing Capitalization and Punctuation Removing capitalization: - Python : string.lower() - R : tolower( string ) Removing punctuation - Python: re.sub( \W,, string) - R : gsub( \\W,, string)

41 Step 1: Removing Capitalization and Punctuation 1 # i mport modules 2 import r e 3 4 # c r e a t e s e n t e n c e 5 s e n t e n c e = F i v e s c o r e y e a r s ago, a g r e a t American, i n whose s y m b o l i c shadow we s t a n d s i g n e d the Emancipation P r o c l a m a t i o n. 6 s e n t e n c e 2 = s e n t e n c e. l o w e r ( ) 7 s e n t e n c e 3 = r e. sub ( \W,, s e n t e n c e 2 ) 8 9 # output 10 f i v e s c o r e y e a r s ago a g r e a t american i n whose s y m b o l i c shadow we s t a n d s i g n e d the e m a n c i p a t i o n p r o c l a m a t i o n

42 Step 1: Removing Capitalization and Punctuation 1 #import modules 2 import r e 3 4 #c r e a t e s e n t e n c e 5 s e n t e n c e = F i v e s c o r e y e a r s ago, a g r e a t American, i n whose s y m b o l i c shadow we s t a n d s i g n e d the Emancipation P r o c l a m a t i o n. 6 s e n t e n c e = r e. sub ( \W,, s e n t e n c e. l o w e r ( ) ) 7 p r i n t ( s e n t e n c e ) 8 9 f i v e s c o r e y e a r s ago a g r e a t american i n whose s y m b o l i c shadow we s t a n d s i g n e d the e m a n c i p a t i o n p r o c l a m a t i o n

43 Step 2: Bag of Words Assumption Assumption: Discard Word Order Five score years ago, a great American, in whose symbolic shadow we stand signed the Emancipation Proclamation. five score years ago a great american in whose symbolic shadow we stand signed the emancipation proclamation Unigrams Unigram the 101 of 96 to 57 and 44 a 36 be 31 will 26 that 24 is 21 freedom 19 in 19 Count Bigrams Bigram five score 1 score years 1 years ago 1 ago a 1 a great 1 great american 1 american in 1 in whose 1 whose symbolic 1 symbolic shadow 1 shadow we 1 Count Trigrams

44 Step 2: Bag of Words Assumption 1 # a c t i v a t e v i r t u a l e nvironment 2 s o u r c e a c t i v a t e python UI 3 4 # i n s t a l l n l t k 5 python m p i p i n s t a l l n l t k 6 7 # download s t o p words 8 n l t k. download ( s t o pwords ) 9 10 # download wordnet 11 n l t k. download ( wordnet )

45 Step 2: Bag of Words Assumption 1 # i mport modules 2 import r e q u e s t s 3 import r e 4 from bs4 i mport B e a u t i f u l S o u p 5 from n l t k import F r e q D i s t 6 from n l t k import w o r d t o k e n i z e 7 from n l t k import bigrams 8 from n l t k import t r i g r a m s

46 Step 2: Bag of Words Assumption 1 # u r l 2 u r l = h t t p : / / a v a l o n. law. y a l e. edu /20 t h c e n t u r y / mlk01. asp 3 4 # c r e a t e soup 5 r e s p o n s e = r e q u e s t s. g e t ( u r l ) 6 c o n t e n t s = r e s p o n s e. c o n t e n t 7 soup = B e a u t i f u l S o u p ( c o n t e n t s, html. p a r s e r ) 8 9 # g e t t e x t 10 s p eech = 11 l i n e s = soup. f i n d a l l ( p ) 12 f o r l i n e i n l i n e s : 13 s p eech += l i n e. g e t t e x t ( ) # remove p u n c t u a t i o n 16 s p eech = r e. sub ( \W,, s p eech. l o w e r ( ) )

47 Step 2: Bag of Words Assumption 1 # g e t unigrams 2 words = w o r d t o k e n i z e ( s peech. l o w e r ( ) ) 3 w o r d s f r e q u e n c y = F r e q D i s t ( words ) 4 w o r d s f r e q u e n c y. most common ( 1 0 ) 5 6 # g e t bigrams 7 l i s t ( bigrams ( words ) ) 8 9 # g e t t r i g r a m s 10 l i s t ( t r i g r a m s ( words ) )

48 How Could This Possibly Work? Three answers 1) It might not: Validation is critical (task specific) 2) Central Tendency in Text: Words often imply what a text is about war, civil, union or tone consecrate, dead, died, lives. Likely to be used repeatedly: create a theme for an article 3) Human supervision: Inject human judgement (coders): helps methods identify subtle relationships between words and outcomes of interest Training Sets

49 Step 3: Discard Stopwords Stop Words: English Language place holding words - the, it, if, a, able, at, be, because... - Add noise to documents (without conveying much information) - Discard stop words: focus on substantive words Be Careful! - she, he, her, his - You may need to customize your stop word list abbreviations, titles, etc.

50 Step 3: Discard Stopwords 1 # i mport modules 2 import r e q u e s t s 3 import r e 4 import s t r i n g 5 from bs4 i mport B e a u t i f u l S o u p 6 from n l t k import F r e q D i s t 7 from n l t k import w o r d t o k e n i z e 8 from n l t k import bigrams 9 from n l t k import t r i g r a m s 10 from n l t k. c o r p u s i mport s topwords

51 Step 3: Discard Stopwords 1 # u r l 2 u r l = h t t p : / / a v a l o n. law. y a l e. edu /20 t h c e n t u r y / mlk01. asp 3 4 # c r e a t e soup 5 r e s p o n s e = r e q u e s t s. g e t ( u r l ) 6 c o n t e n t s = r e s p o n s e. c o n t e n t 7 soup = B e a u t i f u l S o u p ( c o n t e n t s, html. p a r s e r ) 8 9 # g e t t e x t 10 s p eech = 11 l i n e s = soup. f i n d a l l ( p ) 12 f o r l i n e i n l i n e s : 13 s p eech += l i n e. g e t t e x t ( ) # remove p u n c t u a t i o n 16 s p eech = r e. sub ( \W,, s p eech. l o w e r ( ) )

52 Step 3: Discard Stopwords 1 # i mport modules 2 import s t r i n g 3 4 # c r e a t e word l i s t 5 words = w o r d t o k e n i z e ( s peech. l o w e r ( ) ) 6 7 # remove stopwords 8 s t opwords = stopwords. words ( e n g l i s h ) 9 c l e a n s p e e c h = f i l t e r ( lambda x : x not i n stopwords, words ) 10 c l e a n s p e e c h 2 = [ word f o r word i n words i f word not i n s t o pwords ]

53 Step 4: Create an Equivalence Class of Words Reduce dimensionality further

54 Comparing Stemming and Lemmatizing Stemming algorithm: Porter most commonly used stemmer. Lancaster very aggressive, stems may not be interpretable. Snowball (Porter 2) essentially a better version of Porter.

55 Comparing Stemming s 1 # i mport modules 2 from n l t k. c o r p u s i mport s topwords 3 from n l t k. stem. p o r t e r import 4 from n l t k. stem. l a n c a s t e r import 5 from n l t k. stem. s n o w b a l l import SnowballStemmer 6 from n l t k. stem import WordNetLemmatizer

56 Comparing Stemming s 1 # s a m p l e t e x t 2 f i v e s c o r e y e a r s ago a g r e a t american i n whose s y m b o l i c shadow we s t a n d today s i g n e d the e m a n c i p a t i o n p r o c l a m a t i o n t h i s momentous d e c r e e came as a g r e a t beacon l i g h t o f hope to m i l l i o n s o f negro s l a v e s who had been s e a r e d i n the f l a m e s o f w i t h e r i n g i n j u s t i c e i t came as a j o y o u s daybreak to end the l o n g n i g h t o f t h e i r c a p t i v i t y

57 Porter 1 # use p o r t e r 2 stemmer = PorterStemmer ( ) 3 p o r t e r t e x t = [ stemmer. stem ( word ) f o r word i n sample words ] 4 5 # output 6 f i v e s c o r e y e a r ago a g r e a t american i n whose symbol shadow we s t a n d today s i g n the emancip proclam t h i moment d e c r e came as a g r e a t beacon l i g h t o f hope to m i l l i o n o f negro s l a v e who had been s e a r i n the f l a m e o f w i t h e r i n j u s t i c i t came as a j o y o u daybreak to end the l o n g n i g h t o f t h e i r c a p t i v

58 Lancaster 1 # use l a n c a s t e r 2 stemmer = LancasterStemmer ( ) 3 l a n c a s t e r t e x t = [ stemmer. stem ( word ) f o r word i n sample words ] 4 5 # output 6 f i v s c o r y e a r ago a g r e am i n whos symbol shadow we s t a n d today s i g n the emancip proclam t h i mom d e c r cam as a g r e beacon l i g h t o f hop to m i l o f negro s l a v who had been s e a r i n the flam o f with i n j u s t i t cam as a j o y daybreak to end the l o n g n i g h t o f t h e i r capt

59 Snowball 1 # use s n o w b a l l 2 stemmer = SnowballStemmer ( e n g l i s h ) 3 s n o w b a l l t e x t = [ stemmer. stem ( word ) f o r word i n sample words ] 4 5 # output 6 f i v e s c o r e y e a r ago a g r e a t american i n whose symbol shadow we s t a n d today s i g n the emancip proclam t h i s moment d e c r e came as a g r e a t beacon l i g h t o f hope to m i l l i o n o f negro s l a v e who had been s e a r i n the f l a m e o f w i t h e r i n j u s t i c i t came as a j o y o u s daybreak to end the l o n g n i g h t o f t h e i r c a p t i v

60 Finding Lower Dimensional Embeddings of Text 1) Task: - Embed our documents in a lower dimensional space - Visualize our documents - Inference about similarity - Inference about behavior 2) Supervised : - Predict the values of one or more outputs or response variables Y = (Y 1,..., Y m ) for a given set of input or predictor variables X T = (X 1,..., X p ) - xi T = (x i1,..., X ip ) denotes the inputs for the i th training case and ŷ i is the response measure. - Student presents an answer ŷ i for each x i in the training sample, and the supervisor or teacher grades the answer. - Usually this requires some loss function L(y, ŷ), for example, L(y, ŷ) = (y ŷ) 2.

61 Finding Lower Dimensional Embeddings of Text 3) : - In this case one has a set of N observations (x 1, x 2,..., x N ) of a random p-vector X having joint density Pr(X ) - The goal is to directly infer the properties of this probability density without the help of a supervisor or teacher. - The dimension of X is sometimes much higher than in supervised learning, and the properties of interest are often more complicated than simple point estimates.

62 k-means Suppose, we have a dataset with two variables x = (1, 2, 3, 4, 5) and y = (1, 2, 3, 4, 5): 1 Randomly place k centroids inside the two-dimensional space (X,Y). 2 For each point (x i, y i ) find the nearest centroid by minimizing some distance measure. 3 Assign each point (x i, y i ) to cluster j. 4 For each cluster j = 1... K: Create a new centroid c j using the average across all points x i and y i

63 k-means

64 k-means

65 k-means

66 k-means

67 k-means

68 The Expectation algorithm enables parameter estimation in probabilistic models with incomplete data.

69 The Expectation algorithm enables parameter estimation in probabilistic models with incomplete data. exponential family of distributions: Normal Exponential Gamma Chi-Squared Beta Dirichlet (Der-rick-let) Bernoulli Poisson P θt+1 (X ) P θt (X )

70 The Expectation algorithm enables parameter estimation in probabilistic models with incomplete data. exponential family of distributions: Not guaranteed to give θ MLE Overfitting Slow Generally, it can t be used for non-exponential distributions.

71 Gaussian Mixture Model Let s assume we have observations x 1... x n : Each x i is drawn from one of two normal distributions. One of these distributions (red) has a mean of µ red and a variance of σ 2 red. The other distribution (blue) has a mean of µ blue and a variance of σ 2 blue. If we know the source of each observation, then estimating µ red, µ blue, σred 2, and σ2 blue is trivial.

72 Gaussian Mixture Model Let s assume we have observations x 1... x n drawn from the red distribution: µ red = x 1+x 2 + +x n n red σ 2 red = (x 1 µ red ) 2 + +(x n µ red ) 2 n red

73 Gaussian Mixture Model Let s assume we have observations x 1... x n drawn from the blue distribution: µ blue = x 1+x 2 + +x n n blue σ 2 blue = (x 1 µ blue ) 2 + +(x n µ blue ) 2 n blue

74 Gaussian Mixture Model

75 Gaussian Mixture Model

76 Gaussian Mixture Model However, what if we do not know the source?

77 Gaussian Mixture Model Why don t we just guess?

78 Gaussian Mixture Model Let s assume someone gave you the parameters µ red and σ 2 red, what is the probability a given point, x i, is from that distribution? (Normal PDF) 1 P(x i red) = exp ( (x i µ red ) 2 ) 2πσred 2 2σred 2 (1)

79 Gaussian Mixture Model What is the probability the parameters µ red and σ 2 red are correct, given point x i? (Bayes Rule) P(Yes x i ) = P(A B) = P(B A)P(A) P(B) P(x i Yes)P(Yes) P(x i Yes)P(Yes) + P(x i No)P(No) (2) (3)

80 Gaussian Mixture Model What is the probability the parameters µ red and σ 2 red are correct, given point x i? (Bayes Rule) P(red x i ) = P(x i red)p(red) P(x i red)p(red) + P(x i blue)p(blue) (4) P(blue x i ) = 1 P(red x i ) (5)

81 Gaussian Mixture Model If you knew where the points came from, you could estimated µ and σ 2 easily. Unfortunately, you do not know where the points came from. If we knew µ red, σred 2, µ blue, and σblue 2 we could figure out which distribution the points came from. EM Start with two randomly placed normal distributions (µ red, σred 2 ) and (µ blue, σblue 2 ). For each x i, determine P(red x i ) = the probability that the point was drawn from the red distribution. This is a soft assignment, meaning that each x i with have two probabilities: P(red x i ) and P(blue x i ). Once this is done, re-estimate (µ red, σred 2 ) and (µ blue, σblue 2 ), given what we learned. Iterate until it convergence.

82 Gaussian Mixture Model If you knew where the points came from, you could estimated µ and σ 2 easily. Unfortunately, you do not know where the points came from. If we knew µ red, σred 2, µ blue, and σblue 2 we could figure out which distribution the points came from. EM Start with two randomly placed normal distributions (µ red, σred 2 ) and (µ blue, σblue 2 ). For each x i, determine P(red x i ) = the probability that the point was drawn from the red distribution (E-STEP). This is a soft assignment, meaning that each x i with have two probabilities: P(red x i ) and P(blue x i ). Once this is done, re-estimate (µ red, σred 2 ) and (µ blue, σblue 2 ), given what we learned (M-STEP). Iterate until it convergence.

83 EM Example

84 EM Example To use the EM algorithm, we have to answer several questions: 1 How Likely is Each of the Points to Come From the Red Distribution? 1 P(x i red) = exp ( (x i µ red ) 2 ) 2πσred 2 2σred 2

85 EM Example 1 How Likely is Each of the Points to Come From the Red Distribution (µ red = 10, σ 2 red = 9)? x i P(x i red)

86 EM Example 1 How Likely is Each of the Points to Come From the Red Distribution (µ red = 10, σ 2 red = 9)? 2 How Likely is it the Red Distribution is specified correctly, given x i? P(red i x i ) = P(red i x i ) = P(x i red)p(red) P(x i red)p(red) + P(x i blue)p(blue) P(x i red).50 P(x i red).50 + P(x i blue).50

87 EM Example 1 How Likely is Each of the Points to Come From the Red Distribution (µ red = 10, σ 2 red = 9)? 2 How Likely is it the Red Distribution is specified correctly, given x i? x i P(red i x i )

88 EM Example 1 How Likely is Each of the Points to Come From the Red Distribution (µ red = 10, σ 2 red = 9)? 2 How Likely is it the Red Distribution is specified correctly, given x i? 3 How Likely is it the Blue Distribution is specified correctly, given x i? P(blue i x i ) = 1 P(red i x i )

89 EM Example How Likely is it the Blue Distribution is specified correctly, given x i? x i P(blue i x i )

90 EM Example 1 How Likely is Each of the Points to Come From the Red Distribution (µ red = 10, σ 2 red = 9)? 2 How Likely is it the Red Distribution is specified correctly, given x i? 3 How Likely is it the Blue Distribution is specified correctly, given x i? 4 Given what we know what is the likely red mean (µ red ) and variance (σ 2 red )? µ red = red 1x 1 + red 2 x red n x n red 1 + red red n σ 2 red = red 1(x 1 µ red ) red n (x n µ red ) 2 red 1 + red red n

91 EM Example 1 How Likely is Each of the Points to Come From the Red Distribution (µ red = 10, σ 2 red = 9)? 2 How Likely is it the Red Distribution is specified correctly, given x i? 3 How Likely is it the Blue Distribution is specified correctly, given x i? 4 Given what we know what is the likely red mean (µ red ) and variance (σ 2 red )? ˆµ red = ˆσ 2 red =

92 EM Example 1 How Likely is Each of the Points to Come From the Red Distribution (µ red = 10, σ 2 red = 9)? 2 How Likely is it the Red Distribution is specified correctly, given x i? 3 How Likely is it the Blue Distribution is specified correctly, given x i? 4 Given what we know what is the likely red mean (µ red ) and variance (σ 2 red )? 5 Given what we know what is the likely blue mean (µ blue ) and variance (σ 2 blue )? µ blue = blue 1x 1 + blue 2 x blue n x n blue 1 + blue blue n σ 2 blue = blue 1(x 1 µ blue ) blue n (x n µ blue ) 2 blue 1 + blue blue n

93 EM Example 1 How Likely is Each of the Points to Come From the Red Distribution (µ red = 10, σ 2 red = 9)? 2 How Likely is it the Red Distribution is specified correctly, given x i? 3 How Likely is it the Blue Distribution is specified correctly, given x i? 4 Given what we know what is the likely red mean (µ red ) and variance (σ 2 red )? 5 Given what we know what is the likely blue mean (µ blue ) and variance (σ 2 blue )? µ blue = σ 2 blue =

94 EM Example

95 EM Example

96 Topic and Mixed Membership Models (Grimmer) Doc 1 Doc 2 Doc 3. Doc N Cluster 1 Cluster 2. Cluster K

97 Latent Dirichlet Allocation () Suppose, we have the following sentences: Donald Trump is running for president. Donald Trump debated last night. Hillary Clinton is running for president. Hillary Clinton debated last night. Hillary Clinton debated better than Donald Trump last night.

98 Latent Dirichlet Allocation () Suppose, we have the following sentences: Donald Trump is running for president. (50% Topic A, 50% Topic C) Donald Trump debated last night. (50% Topic A, 50% Topic D) Hillary Clinton is running for president. (50% Topic B, 50% Topic C) Hillary Clinton debated last night. (50% Topic B, 50% Topic D) Hillary Clinton debated better than Donald Trump last night. (33% Topic A, 33% Topic B, 33% Topic D)

99 Latent Dirichlet Allocation () Suppose, we have the following sentences: Donald Trump is running for president. (50% Topic A, 50% Topic C) Donald Trump debated last night. (50% Topic A, 50% Topic D) Hillary Clinton is running for president. (50% Topic B, 50% Topic C) Hillary Clinton debated last night. (50% Topic B, 50% Topic D) Hillary Clinton debated better than Donald Trump last night. (33% Topic A, 33% Topic B, 33% Topic D)

100 Latent Dirichlet Allocation () Suppose, we have the following sentences: Donald Trump is running for president. (50% Topic A, 50% Topic C) Donald Trump debated last night. (50% Topic A, 50% Topic D) Hillary Clinton is running for president. (50% Topic B, 50% Topic C) Hillary Clinton debated last night. (50% Topic B, 50% Topic D) Hillary Clinton debated better than Donald Trump last night. (33% Topic A, 33% Topic B, 33% Topic D)

101 Latent Dirichlet Allocation () Suppose, we have the following sentences: Donald Trump is running for president. (50% Topic A, 50% Topic C) Donald Trump debated last night. (50% Topic A, 50% Topic D) Hillary Clinton is running for president. (50% Topic B, 50% Topic C) Hillary Clinton debated last night. (50% Topic B, 50% Topic D) Hillary Clinton debated better than Donald Trump last night. (33% Topic A, 33% Topic B, 33% Topic D)

102 Latent Dirichlet Allocation () represents documents as mixtures of topics that produce words with certain probabilities. Imagine you are writing an article: 1 How many N words will your article have? (Let s assume this follows a Poisson distribution.) 2 What is the mixture of K topics? (Let s assume this follows a Dirichlet distribution.) 3 For each word in your article: First, pick a topic from the distribution outlined above. Second, given the topic you have selected, choose the word that appears the most.

103 Latent Dirichlet Allocation () represents documents as mixtures of topics that produce words with certain probabilities. Imagine you are writing an article: 1.) Let s assume your article will have 5 words. 2.) Let s assume our topic distribution will be 75% Topic B and 25% Topic A. 3.) Let s assume Hillary Clinton, president, and win, appears in 75%, 50%, and 25% of the documents that are included in Topic B, respectively. 4.) Let s assume the Donald Trump and president appears in 75% and 50% of the documents that are included in Topic A, respectively.

104 Latent Dirichlet Allocation () represents documents as mixtures of topics that produce words with certain probabilities. Imagine you are writing an article: 4.) Let s assume the Donald Trump and president appears in 75% and 50% of the documents that are included in Topic A, respectively. 1 st word comes from Topic B, which then gives you Hillary Clinton. 2 nd word comes from Topic A, which then gives you Donald Trump. 3 rd word comes from Topic A, which then gives you president. 4 th word comes from Topic B, which then gives you president. 5 th word comes from Topic B, which then gives you win. Hillary Clinton Donald Trump president president win.

105 Harmonic Mean Wallach et al. (2009) suggest the harmonic mean could be considered a goodness-of-fit test for the model: 1 M ( M 1 K i=1 ) 1 K θ m,k, where θ m,k is the topic distribution for document m and topic k. i=1 The degree of differentiation of a distribution θ over all topics is theoretically captured by the harmonic mean. If the topic distribution has a high probability for only a few topics, then it would have a lower harmonic mean value the model has a better ability to separate documents into different topics. (6)

106 Perplexity Heinrich (2005) suggests perplexity as a way to prevent overfitting an model. Perplexity is equivalent to the geometric mean per-word likelihood: { } Perplexity(w) = exp log(p(w)) V j=1 njd D d=1 (7), where n jd denotes how often the j th term occurred in the d th document. Perplexity is essentially the reciprocal geometric mean of the likelihood of testing data given the trained model M. Therefore, the lower perplexity value indicates that the model could fit the testing data better.

107 Entropy The entropy measure can also be used to indicate how the topic distributions differ across models. Higher values indicate that the topic distributions are more evenly spread over the topics. 1 s a p p l y ( f i t t e d m o d e l s, f u n c t i o n ( x ) mean ( a p p l y ( p o s t e r i o r ( x ) $ t o p i c s, 1, f u n c t i o n ( z ) sum ( z l o g ( z ) ) ) ) )

108 Entropy The entropy measure can also be used to indicate how the topic distributions differ across models. Higher values indicate that the topic distributions are more evenly spread over the topics. 1 entropy < NULL 2 f o r ( i i n 1 : l e n g t h ( f i t t e d m o d e l s ) ) { 3 t e m p t o p i c s < p o s t e r i o r ( f i t t e d m o d e l s [ [ i ] ] ) $ t o p i c s 4 temp sums< NULL 5 f o r ( j i n 1 :NROW( t e m p t o p i c s ) ) { 6 temp sums< c ( temp sums, sum ( t e m p t o p i c s [ j, ] l o g ( t e m p t o p i c s [ j, ] ) ) ) 7 } 8 entropy < c ( entropy, mean( temp sums ) ) 9 }

109 Applying to NYT Articles R!

110 Cluster Quality (Grimmer and King 2011) Assessing Cluster Quality with experiments - Goal: group together similar documents - Who knows if similarity measure corresponds with semantic similarity Inject human judgement on pairs of documents Design to assess cluster quality - Sample pairs of documents - Scale: (1) unrelated, (2) loosely related, (3) closely related - Cluster Quality = mean(within cluster) - mean(between clusters) - Select clustering with highest cluster quality

111 What Can We Do?

112 Additional Resources Webscraping Web_Scraping_with_Beautiful_Soup.html Machine The Elements of Statistical : Data Mining, Inference, and Prediction by Hastie et al Text

113 Questions? Questions?

Some Probability and Statistics

Some Probability and Statistics David M. Blei COS424 Princeton University February 12, 2007 D. Blei ProbStat 01 1 / 42 Who wants to scribe? D. Blei ProbStat 01 2 / 42 Random variable Probability is about