University of Iowa. Introduction to Text Analysis. Bryce J. Dietrich. Agenda. Dictionaries. Document- Term Matrices. Preprocessing

Size: px
Start display at page:

Download "University of Iowa. Introduction to Text Analysis. Bryce J. Dietrich. Agenda. Dictionaries. Document- Term Matrices. Preprocessing"

Transcription

1 University of Iowa

2

3 Text and Political Science Post 2000, things have changed... Massive collections of texts are being increasingly used as a data source: Congressional speeches, press releases, newsletters,... Facebook posts, tweets, s, text messages... Newspapers, magazines, transcripts... Foreign news sources, treaties,... Why? LOTS of unstructured text data (201 billion s sent and received every day) LOTS of cheap storage: 1956: $10,000 per megabyte. 2016: $ per megabyte. LOTS of methods and software to analyze text

4 Text and Political Science Ultimately, these trends mean... Analyzing text has become bigger, faster, and stronger: Generalizable one method can be used across a variety of text Systematic one method can be used again and again Cheap one computer can do a lot, 100 computers can do even more Analyzing text is still important: Laws Treaties News media Campaigns Petitions Speeches Press Releases

5 What Can We Do?

6 What Can We Do? There are two things we may want to do with this haystack... Understanding the meaning of a sentence or phrase analyzing a straw of hay Humans = 1 Computers = 0 Classifying text organizing hay stack Humans = 0 Computers = 1

7 Text = Complex? This bill has seen a long and winding road, but in the process we have worked together. We have not quit. We have worked across the aisle. The final bill has the support of over 370 groups, and they represent those from all over the country and all over the ideological spectrum.

8 Text = Complex? This bill has seen a long and winding road, but in the process we have worked together. We have not quit. We have worked across the aisle. The final bill has the support of over 370 groups, and they represent those from all over the country and all over the ideological spectrum.

9 Text = Complex? This bill has seen a long and winding road, but in the process we have worked together. We have not quit. We have worked across the aisle. The final bill has the support of over 370 groups, and they represent those from all over the country and all over the ideological spectrum.

10 Text = Complex? This bill has seen a long and winding road, but in the process we have worked together. We have not quit. We have worked across the aisle. The final bill has the support of over 370 groups, and they represent those from all over the country and all over the ideological spectrum.

11 Text = Complex? This bill has seen a long and winding road, but in the process we have worked together. We have not quit. We have worked across the aisle. The final bill has the support of over 370 groups, and they represent those from all over the country and all over the ideological spectrum.

12 Text = Simple? Speech by Barbara Mikulski (D-MD) delivered on June 28, word count zika 4 million 4 emergency 3 health 3 act 2 response 2 republican 2 report 1 treat 1 world 1 organization 1 affordable 1

13 Text = Simple? The Republican conference report also doesn t treat Zika like the emergency it is. The World Health Organization declared the Zika virus a public health emergency on February 1. And Zika meets the Budget Act criteria for emergency spending: It is urgent, unforeseen, and temporary. Yet Republicans insisted that we cut $750 million to pay for the response to Zika, including $543 million from the Affordable Care Act, $100 million from the Department of Health and Human Services, HHS, nonrecurring expense fund, and $107 million from Ebola response funds.

14 Big Data Suppose we want to categorize 100 documents... Consider two documents A and B, how many clusters can we make? (AB, BA) = 2 Consider three documents A, B, and C, how many clusters can we make? (ABC, CBA, ACB, BCA, CAB) = 5 Bell(n) = number of ways to partition n objects. Bell(2) = 2, Bell(3) = 5, Bell(5) = 52, etc. Bell(100) = It takes R seconds to count to It would take R = seconds to count to Bell(100) There are seconds in a year = years. Automated methods can help with even small tasks!

15 Principles of Automated Text (from Justin Grimmer) Principle 1: All Quantitative Models of Language are Wrong But Some are Useful Data generation process unknown Complexity of language: Time flies like an arrow; fruit flies like a banana Make peace, not war; Make war not peace Models necessarily fail to capture language Principle 2: Quantitative Methods Augment Humans, Not Replace Them Computer-Assisted Content Computers suggest, Humans interpret Principle 3: There is no Globally Best Method for Automated Text Supervised methods = known categories methods = discover categories Principle 4: Validate, Validate, Validate

16 Principles of Automated Text (from Justin Grimmer) Principle 2: Quantitative Methods Augment Humans, Not Replace Them Computer-Assisted Content Computers suggest, Humans interpret

17 Principles of Automated Text (from Justin Grimmer) Principle 3: There is no Globally Best Method for Automated Text Supervised methods = known categories methods = discover categories

18 Principles of Automated Text (from Justin Grimmer) Principle 4: Validate, Validate, Validate Few theorems to guarantee performance Apply methods validate Do not blinding use methods!

19 Principles of Automated Text (from Justin Grimmer) Principle 1: All Quantitative Models of Language are Wrong But Some are Useful Principle 2: Quantitative Methods Augment Humans, Not Replace Them Principle 3: There is no Globally Best Method for Automated Text Principle 4: Validate, Validate, Validate

20 Types of Classification Problems Topic: What is this text about? - Policy area of legislation - Party agendas Sentiment: What is said in this text? - For or against legislation - Agree or disagree with an argument - A liberal/conservative position Style/Tone: How is it said? - Positive/Negative Emotion

21 Weighted Words Weighted Words!

22 Dictionary Methods

23 Dictionary Methods Many Dictionary Methods (like LIWC) 1) Proprietary wrapped in GUI 2) Basic tasks: a) Count words b) Weight some words more than others c) Some graphics 3) Expensive!

24 Other 1) General Inquirer Database ( ) - Stone, P.J., Dumphy, D.C., and Ogilvie, D.M. (1966) The General Inquirer: A Computer Approach to Content - 1,915 positive words and 2,291 negative words 2) Linguistic Inquiry Word Count (LIWC) - Creation process: 1) Generate word list for categories We drew on common emotion rating scales...[then] brain-storming sessions among 3-6 judges were held to generate other words 2) Judge round (a) Does the word belong? (b) What other categories might it belong to? positive words and 499 negative words

25 Generating New Words Three ways to create dictionaries: - Statistical methods - Manual generation - Theory - Research Assistants a) Grad Students b) Undergraduates c) Mechanical Turkers - Example: {Happy, Unhappy} - Ask Turkers: how happy is elevator, car, pretty, young

26 Applying a Dictionary to NYT Articles Python!

27 X = X = N K matrix - N = Number of documents - K = Number of features

28 I Have A Dream 1 # i mport modules 2 import r e q u e s t s 3 from bs4 i mport B e a u t i f u l S o u p 4 5 # c r e a t e u r l 6 u r l = h t t p : / / a v a l o n. law. y a l e. edu /20 t h c e n t u r y / mlk01. asp 7 8 # c r e a t e soup 9 r e s p o n s e = r e q u e s t s. g e t ( u r l ) 10 c o n t e n t s = r e s p o n s e. c o n t e n t 11 soup = B e a u t i f u l S o u p ( c o n t e n t s, html. p a r s e r ) # g e t s p e e c h 14 s p eech = 15 l i n e s = soup. f i n d a l l ( p ) 16 f o r l i n e i n l i n e s : 17 s p eech += l i n e. g e t t e x t ( )

29 Installing Beautiful Soup 1 # a c t i v a t e our v i r t u a l e n v i r o nment 2 s o u r c e a c t i v a t e python UI 3 4 # i n s t a l l b e a u t i f u l soup 5 python m p i p i n s t a l l b e a u t i f u l s o u p 4

30 Using Beautiful Soup 1 # i mport modules 2 import r e q u e s t s 3 from bs4 i mport B e a u t i f u l S o u p 4 5 # c r e a t e u r l 6 u r l = h t t p : / / a v a l o n. law. y a l e. edu /20 t h c e n t u r y / mlk01. asp 7 8 # c r e a t e soup 9 r e s p o n s e = r e q u e s t s. g e t ( u r l ) 10 c o n t e n t s = r e s p o n s e. c o n t e n t 11 soup = B e a u t i f u l S o u p ( c o n t e n t s, html. p a r s e r ) # p r i n t soup 14 p r i n t ( soup. p r e t t i f y ( ) [ 0 : ] )

31 HTML 1 <html> 2 <head> 3 < l i n k h r e f=.. / c s s / s i t e. c s s r e l= s t y l e s h e e t t y p e = t e x t / c s s > 4 < t i t l e> 5 Avalon P r o j e c t I have a Dream by Martin L u t h e r King, J r ; August 28, </ t i t l e> 7 </ l i n k> 8 </ head> 9 <body> 10 <d i v c l a s s= H e a d e r C o n t a i n e r > 11 <u l c l a s s= HeaderTopTools > 12 < l i c l a s s= Search >

32 HTML Browsers read in the HTML document, parses it into a DOM (Document Object Model) structure, and then renders the DOM structure.

33 HTML Browsers read in the HTML document, parses it into a DOM (Document Object Model) structure, and then renders the DOM structure.

34 I Have A Dream

35 Finding Text 1 <p> 2 I have a dream t h a t one day on the r e d h i l l s o f G e o r g i a the s o n s o f f o r m e r s l a v e s and the s o n s o f f o r m e r s l a v e o w n e r s w i l l be a b l e to s i t down t o g e t h e r at a t a b l e o f b r o t h e r h o o d. 3 </p>

36 Finding Text 1 # i mport modules 2 import r e q u e s t s 3 from bs4 i mport B e a u t i f u l S o u p 4 5 # c r e a t e u r l 6 u r l = h t t p : / / a v a l o n. law. y a l e. edu /20 t h c e n t u r y / mlk01. asp 7 8 # c r e a t e soup 9 r e s p o n s e = r e q u e s t s. g e t ( u r l ) 10 c o n t e n t s = r e s p o n s e. c o n t e n t 11 soup = B e a u t i f u l S o u p ( c o n t e n t s, html. p a r s e r ) # g e t l i n e s 14 l i n e s = soup. f i n d a l l ( p ) # g e t t y p e 17 p r i n t ( t y p e ( l i n e s ) )

37 Finding Text This a bs4.element.resultset object which is simply a collection of tags. Note, this is a bs4 object, so most standard functions will not work. 1 # i n s p e c t f i r s t element 2 p r i n t ( l i n e s [ 0 ] ) 3 4 # output 5 <p>d e l i v e r e d on the s t e p s at the L i n c o l n Memorial i n Washington D. C. on August 28, 1963 </p> 6 7 # g e t t e x t 8 p r i n t ( l i n e s [ 0 ]. g e t t e x t ( ) ) 9 10 # output 11 D e l i v e r e d on the s t e p s at the L i n c o l n Memorial i n Washington D. C. on August 28, 1963

38 I Have A Dream 1 # i mport modules 2 import r e q u e s t s 3 from bs4 i mport B e a u t i f u l S o u p 4 5 # c r e a t e u r l 6 u r l = h t t p : / / a v a l o n. law. y a l e. edu /20 t h c e n t u r y / mlk01. asp 7 8 # c r e a t e soup 9 r e s p o n s e = r e q u e s t s. g e t ( u r l ) 10 c o n t e n t s = r e s p o n s e. c o n t e n t 11 soup = B e a u t i f u l S o u p ( c o n t e n t s, html. p a r s e r ) # g e t s p e e c h 14 s p eech = 15 l i n e s = soup. f i n d a l l ( p ) 16 f o r l i n e i n l i n e s : 17 s p eech += l i n e. g e t t e x t ( )

39 Simplify text in order to make it useful. 1 Remove capitalization, punctuation 2 Discard word order (Bag of Words Assumption) 3 Discard stop words 4 Create equivalence class: Stem, lemmatize, or synonym 5 Discard less useful features 6 Other reduction, specialization

40 Step 1: Removing Capitalization and Punctuation Removing capitalization: - Python : string.lower() - R : tolower( string ) Removing punctuation - Python: re.sub( \W,, string) - R : gsub( \\W,, string)

41 Step 1: Removing Capitalization and Punctuation 1 # i mport modules 2 import r e 3 4 # c r e a t e s e n t e n c e 5 s e n t e n c e = F i v e s c o r e y e a r s ago, a g r e a t American, i n whose s y m b o l i c shadow we s t a n d s i g n e d the Emancipation P r o c l a m a t i o n. 6 s e n t e n c e 2 = s e n t e n c e. l o w e r ( ) 7 s e n t e n c e 3 = r e. sub ( \W,, s e n t e n c e 2 ) 8 9 # output 10 f i v e s c o r e y e a r s ago a g r e a t american i n whose s y m b o l i c shadow we s t a n d s i g n e d the e m a n c i p a t i o n p r o c l a m a t i o n

42 Step 1: Removing Capitalization and Punctuation 1 #import modules 2 import r e 3 4 #c r e a t e s e n t e n c e 5 s e n t e n c e = F i v e s c o r e y e a r s ago, a g r e a t American, i n whose s y m b o l i c shadow we s t a n d s i g n e d the Emancipation P r o c l a m a t i o n. 6 s e n t e n c e = r e. sub ( \W,, s e n t e n c e. l o w e r ( ) ) 7 p r i n t ( s e n t e n c e ) 8 9 f i v e s c o r e y e a r s ago a g r e a t american i n whose s y m b o l i c shadow we s t a n d s i g n e d the e m a n c i p a t i o n p r o c l a m a t i o n

43 Step 2: Bag of Words Assumption Assumption: Discard Word Order Five score years ago, a great American, in whose symbolic shadow we stand signed the Emancipation Proclamation. five score years ago a great american in whose symbolic shadow we stand signed the emancipation proclamation Unigrams Unigram the 101 of 96 to 57 and 44 a 36 be 31 will 26 that 24 is 21 freedom 19 in 19 Count Bigrams Bigram five score 1 score years 1 years ago 1 ago a 1 a great 1 great american 1 american in 1 in whose 1 whose symbolic 1 symbolic shadow 1 shadow we 1 Count Trigrams

44 Step 2: Bag of Words Assumption 1 # a c t i v a t e v i r t u a l e nvironment 2 s o u r c e a c t i v a t e python UI 3 4 # i n s t a l l n l t k 5 python m p i p i n s t a l l n l t k 6 7 # download s t o p words 8 n l t k. download ( s t o pwords ) 9 10 # download wordnet 11 n l t k. download ( wordnet )

45 Step 2: Bag of Words Assumption 1 # i mport modules 2 import r e q u e s t s 3 import r e 4 from bs4 i mport B e a u t i f u l S o u p 5 from n l t k import F r e q D i s t 6 from n l t k import w o r d t o k e n i z e 7 from n l t k import bigrams 8 from n l t k import t r i g r a m s

46 Step 2: Bag of Words Assumption 1 # u r l 2 u r l = h t t p : / / a v a l o n. law. y a l e. edu /20 t h c e n t u r y / mlk01. asp 3 4 # c r e a t e soup 5 r e s p o n s e = r e q u e s t s. g e t ( u r l ) 6 c o n t e n t s = r e s p o n s e. c o n t e n t 7 soup = B e a u t i f u l S o u p ( c o n t e n t s, html. p a r s e r ) 8 9 # g e t t e x t 10 s p eech = 11 l i n e s = soup. f i n d a l l ( p ) 12 f o r l i n e i n l i n e s : 13 s p eech += l i n e. g e t t e x t ( ) # remove p u n c t u a t i o n 16 s p eech = r e. sub ( \W,, s p eech. l o w e r ( ) )

47 Step 2: Bag of Words Assumption 1 # g e t unigrams 2 words = w o r d t o k e n i z e ( s peech. l o w e r ( ) ) 3 w o r d s f r e q u e n c y = F r e q D i s t ( words ) 4 w o r d s f r e q u e n c y. most common ( 1 0 ) 5 6 # g e t bigrams 7 l i s t ( bigrams ( words ) ) 8 9 # g e t t r i g r a m s 10 l i s t ( t r i g r a m s ( words ) )

48 How Could This Possibly Work? Three answers 1) It might not: Validation is critical (task specific) 2) Central Tendency in Text: Words often imply what a text is about war, civil, union or tone consecrate, dead, died, lives. Likely to be used repeatedly: create a theme for an article 3) Human supervision: Inject human judgement (coders): helps methods identify subtle relationships between words and outcomes of interest Training Sets

49 Step 3: Discard Stopwords Stop Words: English Language place holding words - the, it, if, a, able, at, be, because... - Add noise to documents (without conveying much information) - Discard stop words: focus on substantive words Be Careful! - she, he, her, his - You may need to customize your stop word list abbreviations, titles, etc.

50 Step 3: Discard Stopwords 1 # i mport modules 2 import r e q u e s t s 3 import r e 4 import s t r i n g 5 from bs4 i mport B e a u t i f u l S o u p 6 from n l t k import F r e q D i s t 7 from n l t k import w o r d t o k e n i z e 8 from n l t k import bigrams 9 from n l t k import t r i g r a m s 10 from n l t k. c o r p u s i mport s topwords

51 Step 3: Discard Stopwords 1 # u r l 2 u r l = h t t p : / / a v a l o n. law. y a l e. edu /20 t h c e n t u r y / mlk01. asp 3 4 # c r e a t e soup 5 r e s p o n s e = r e q u e s t s. g e t ( u r l ) 6 c o n t e n t s = r e s p o n s e. c o n t e n t 7 soup = B e a u t i f u l S o u p ( c o n t e n t s, html. p a r s e r ) 8 9 # g e t t e x t 10 s p eech = 11 l i n e s = soup. f i n d a l l ( p ) 12 f o r l i n e i n l i n e s : 13 s p eech += l i n e. g e t t e x t ( ) # remove p u n c t u a t i o n 16 s p eech = r e. sub ( \W,, s p eech. l o w e r ( ) )

52 Step 3: Discard Stopwords 1 # i mport modules 2 import s t r i n g 3 4 # c r e a t e word l i s t 5 words = w o r d t o k e n i z e ( s peech. l o w e r ( ) ) 6 7 # remove stopwords 8 s t opwords = stopwords. words ( e n g l i s h ) 9 c l e a n s p e e c h = f i l t e r ( lambda x : x not i n stopwords, words ) 10 c l e a n s p e e c h 2 = [ word f o r word i n words i f word not i n s t o pwords ]

53 Step 4: Create an Equivalence Class of Words Reduce dimensionality further

54 Comparing Stemming and Lemmatizing Stemming algorithm: Porter most commonly used stemmer. Lancaster very aggressive, stems may not be interpretable. Snowball (Porter 2) essentially a better version of Porter.

55 Comparing Stemming s 1 # i mport modules 2 from n l t k. c o r p u s i mport s topwords 3 from n l t k. stem. p o r t e r import 4 from n l t k. stem. l a n c a s t e r import 5 from n l t k. stem. s n o w b a l l import SnowballStemmer 6 from n l t k. stem import WordNetLemmatizer

56 Comparing Stemming s 1 # s a m p l e t e x t 2 f i v e s c o r e y e a r s ago a g r e a t american i n whose s y m b o l i c shadow we s t a n d today s i g n e d the e m a n c i p a t i o n p r o c l a m a t i o n t h i s momentous d e c r e e came as a g r e a t beacon l i g h t o f hope to m i l l i o n s o f negro s l a v e s who had been s e a r e d i n the f l a m e s o f w i t h e r i n g i n j u s t i c e i t came as a j o y o u s daybreak to end the l o n g n i g h t o f t h e i r c a p t i v i t y

57 Porter 1 # use p o r t e r 2 stemmer = PorterStemmer ( ) 3 p o r t e r t e x t = [ stemmer. stem ( word ) f o r word i n sample words ] 4 5 # output 6 f i v e s c o r e y e a r ago a g r e a t american i n whose symbol shadow we s t a n d today s i g n the emancip proclam t h i moment d e c r e came as a g r e a t beacon l i g h t o f hope to m i l l i o n o f negro s l a v e who had been s e a r i n the f l a m e o f w i t h e r i n j u s t i c i t came as a j o y o u daybreak to end the l o n g n i g h t o f t h e i r c a p t i v

58 Lancaster 1 # use l a n c a s t e r 2 stemmer = LancasterStemmer ( ) 3 l a n c a s t e r t e x t = [ stemmer. stem ( word ) f o r word i n sample words ] 4 5 # output 6 f i v s c o r y e a r ago a g r e am i n whos symbol shadow we s t a n d today s i g n the emancip proclam t h i mom d e c r cam as a g r e beacon l i g h t o f hop to m i l o f negro s l a v who had been s e a r i n the flam o f with i n j u s t i t cam as a j o y daybreak to end the l o n g n i g h t o f t h e i r capt

59 Snowball 1 # use s n o w b a l l 2 stemmer = SnowballStemmer ( e n g l i s h ) 3 s n o w b a l l t e x t = [ stemmer. stem ( word ) f o r word i n sample words ] 4 5 # output 6 f i v e s c o r e y e a r ago a g r e a t american i n whose symbol shadow we s t a n d today s i g n the emancip proclam t h i s moment d e c r e came as a g r e a t beacon l i g h t o f hope to m i l l i o n o f negro s l a v e who had been s e a r i n the f l a m e o f w i t h e r i n j u s t i c i t came as a j o y o u s daybreak to end the l o n g n i g h t o f t h e i r c a p t i v

60 Finding Lower Dimensional Embeddings of Text 1) Task: - Embed our documents in a lower dimensional space - Visualize our documents - Inference about similarity - Inference about behavior 2) Supervised : - Predict the values of one or more outputs or response variables Y = (Y 1,..., Y m ) for a given set of input or predictor variables X T = (X 1,..., X p ) - xi T = (x i1,..., X ip ) denotes the inputs for the i th training case and ŷ i is the response measure. - Student presents an answer ŷ i for each x i in the training sample, and the supervisor or teacher grades the answer. - Usually this requires some loss function L(y, ŷ), for example, L(y, ŷ) = (y ŷ) 2.

61 Finding Lower Dimensional Embeddings of Text 3) : - In this case one has a set of N observations (x 1, x 2,..., x N ) of a random p-vector X having joint density Pr(X ) - The goal is to directly infer the properties of this probability density without the help of a supervisor or teacher. - The dimension of X is sometimes much higher than in supervised learning, and the properties of interest are often more complicated than simple point estimates.

62 k-means Suppose, we have a dataset with two variables x = (1, 2, 3, 4, 5) and y = (1, 2, 3, 4, 5): 1 Randomly place k centroids inside the two-dimensional space (X,Y). 2 For each point (x i, y i ) find the nearest centroid by minimizing some distance measure. 3 Assign each point (x i, y i ) to cluster j. 4 For each cluster j = 1... K: Create a new centroid c j using the average across all points x i and y i

63 k-means

64 k-means

65 k-means

66 k-means

67 k-means

68 The Expectation algorithm enables parameter estimation in probabilistic models with incomplete data.

69 The Expectation algorithm enables parameter estimation in probabilistic models with incomplete data. exponential family of distributions: Normal Exponential Gamma Chi-Squared Beta Dirichlet (Der-rick-let) Bernoulli Poisson P θt+1 (X ) P θt (X )

70 The Expectation algorithm enables parameter estimation in probabilistic models with incomplete data. exponential family of distributions: Not guaranteed to give θ MLE Overfitting Slow Generally, it can t be used for non-exponential distributions.

71 Gaussian Mixture Model Let s assume we have observations x 1... x n : Each x i is drawn from one of two normal distributions. One of these distributions (red) has a mean of µ red and a variance of σ 2 red. The other distribution (blue) has a mean of µ blue and a variance of σ 2 blue. If we know the source of each observation, then estimating µ red, µ blue, σred 2, and σ2 blue is trivial.

72 Gaussian Mixture Model Let s assume we have observations x 1... x n drawn from the red distribution: µ red = x 1+x 2 + +x n n red σ 2 red = (x 1 µ red ) 2 + +(x n µ red ) 2 n red

73 Gaussian Mixture Model Let s assume we have observations x 1... x n drawn from the blue distribution: µ blue = x 1+x 2 + +x n n blue σ 2 blue = (x 1 µ blue ) 2 + +(x n µ blue ) 2 n blue

74 Gaussian Mixture Model

75 Gaussian Mixture Model

76 Gaussian Mixture Model However, what if we do not know the source?

77 Gaussian Mixture Model Why don t we just guess?

78 Gaussian Mixture Model Let s assume someone gave you the parameters µ red and σ 2 red, what is the probability a given point, x i, is from that distribution? (Normal PDF) 1 P(x i red) = exp ( (x i µ red ) 2 ) 2πσred 2 2σred 2 (1)

79 Gaussian Mixture Model What is the probability the parameters µ red and σ 2 red are correct, given point x i? (Bayes Rule) P(Yes x i ) = P(A B) = P(B A)P(A) P(B) P(x i Yes)P(Yes) P(x i Yes)P(Yes) + P(x i No)P(No) (2) (3)

80 Gaussian Mixture Model What is the probability the parameters µ red and σ 2 red are correct, given point x i? (Bayes Rule) P(red x i ) = P(x i red)p(red) P(x i red)p(red) + P(x i blue)p(blue) (4) P(blue x i ) = 1 P(red x i ) (5)

81 Gaussian Mixture Model If you knew where the points came from, you could estimated µ and σ 2 easily. Unfortunately, you do not know where the points came from. If we knew µ red, σred 2, µ blue, and σblue 2 we could figure out which distribution the points came from. EM Start with two randomly placed normal distributions (µ red, σred 2 ) and (µ blue, σblue 2 ). For each x i, determine P(red x i ) = the probability that the point was drawn from the red distribution. This is a soft assignment, meaning that each x i with have two probabilities: P(red x i ) and P(blue x i ). Once this is done, re-estimate (µ red, σred 2 ) and (µ blue, σblue 2 ), given what we learned. Iterate until it convergence.

82 Gaussian Mixture Model If you knew where the points came from, you could estimated µ and σ 2 easily. Unfortunately, you do not know where the points came from. If we knew µ red, σred 2, µ blue, and σblue 2 we could figure out which distribution the points came from. EM Start with two randomly placed normal distributions (µ red, σred 2 ) and (µ blue, σblue 2 ). For each x i, determine P(red x i ) = the probability that the point was drawn from the red distribution (E-STEP). This is a soft assignment, meaning that each x i with have two probabilities: P(red x i ) and P(blue x i ). Once this is done, re-estimate (µ red, σred 2 ) and (µ blue, σblue 2 ), given what we learned (M-STEP). Iterate until it convergence.

83 EM Example

84 EM Example To use the EM algorithm, we have to answer several questions: 1 How Likely is Each of the Points to Come From the Red Distribution? 1 P(x i red) = exp ( (x i µ red ) 2 ) 2πσred 2 2σred 2

85 EM Example 1 How Likely is Each of the Points to Come From the Red Distribution (µ red = 10, σ 2 red = 9)? x i P(x i red)

86 EM Example 1 How Likely is Each of the Points to Come From the Red Distribution (µ red = 10, σ 2 red = 9)? 2 How Likely is it the Red Distribution is specified correctly, given x i? P(red i x i ) = P(red i x i ) = P(x i red)p(red) P(x i red)p(red) + P(x i blue)p(blue) P(x i red).50 P(x i red).50 + P(x i blue).50

87 EM Example 1 How Likely is Each of the Points to Come From the Red Distribution (µ red = 10, σ 2 red = 9)? 2 How Likely is it the Red Distribution is specified correctly, given x i? x i P(red i x i )

88 EM Example 1 How Likely is Each of the Points to Come From the Red Distribution (µ red = 10, σ 2 red = 9)? 2 How Likely is it the Red Distribution is specified correctly, given x i? 3 How Likely is it the Blue Distribution is specified correctly, given x i? P(blue i x i ) = 1 P(red i x i )

89 EM Example How Likely is it the Blue Distribution is specified correctly, given x i? x i P(blue i x i )

90 EM Example 1 How Likely is Each of the Points to Come From the Red Distribution (µ red = 10, σ 2 red = 9)? 2 How Likely is it the Red Distribution is specified correctly, given x i? 3 How Likely is it the Blue Distribution is specified correctly, given x i? 4 Given what we know what is the likely red mean (µ red ) and variance (σ 2 red )? µ red = red 1x 1 + red 2 x red n x n red 1 + red red n σ 2 red = red 1(x 1 µ red ) red n (x n µ red ) 2 red 1 + red red n

91 EM Example 1 How Likely is Each of the Points to Come From the Red Distribution (µ red = 10, σ 2 red = 9)? 2 How Likely is it the Red Distribution is specified correctly, given x i? 3 How Likely is it the Blue Distribution is specified correctly, given x i? 4 Given what we know what is the likely red mean (µ red ) and variance (σ 2 red )? ˆµ red = ˆσ 2 red =

92 EM Example 1 How Likely is Each of the Points to Come From the Red Distribution (µ red = 10, σ 2 red = 9)? 2 How Likely is it the Red Distribution is specified correctly, given x i? 3 How Likely is it the Blue Distribution is specified correctly, given x i? 4 Given what we know what is the likely red mean (µ red ) and variance (σ 2 red )? 5 Given what we know what is the likely blue mean (µ blue ) and variance (σ 2 blue )? µ blue = blue 1x 1 + blue 2 x blue n x n blue 1 + blue blue n σ 2 blue = blue 1(x 1 µ blue ) blue n (x n µ blue ) 2 blue 1 + blue blue n

93 EM Example 1 How Likely is Each of the Points to Come From the Red Distribution (µ red = 10, σ 2 red = 9)? 2 How Likely is it the Red Distribution is specified correctly, given x i? 3 How Likely is it the Blue Distribution is specified correctly, given x i? 4 Given what we know what is the likely red mean (µ red ) and variance (σ 2 red )? 5 Given what we know what is the likely blue mean (µ blue ) and variance (σ 2 blue )? µ blue = σ 2 blue =

94 EM Example

95 EM Example

96 Topic and Mixed Membership Models (Grimmer) Doc 1 Doc 2 Doc 3. Doc N Cluster 1 Cluster 2. Cluster K

97 Latent Dirichlet Allocation () Suppose, we have the following sentences: Donald Trump is running for president. Donald Trump debated last night. Hillary Clinton is running for president. Hillary Clinton debated last night. Hillary Clinton debated better than Donald Trump last night.

98 Latent Dirichlet Allocation () Suppose, we have the following sentences: Donald Trump is running for president. (50% Topic A, 50% Topic C) Donald Trump debated last night. (50% Topic A, 50% Topic D) Hillary Clinton is running for president. (50% Topic B, 50% Topic C) Hillary Clinton debated last night. (50% Topic B, 50% Topic D) Hillary Clinton debated better than Donald Trump last night. (33% Topic A, 33% Topic B, 33% Topic D)

99 Latent Dirichlet Allocation () Suppose, we have the following sentences: Donald Trump is running for president. (50% Topic A, 50% Topic C) Donald Trump debated last night. (50% Topic A, 50% Topic D) Hillary Clinton is running for president. (50% Topic B, 50% Topic C) Hillary Clinton debated last night. (50% Topic B, 50% Topic D) Hillary Clinton debated better than Donald Trump last night. (33% Topic A, 33% Topic B, 33% Topic D)

100 Latent Dirichlet Allocation () Suppose, we have the following sentences: Donald Trump is running for president. (50% Topic A, 50% Topic C) Donald Trump debated last night. (50% Topic A, 50% Topic D) Hillary Clinton is running for president. (50% Topic B, 50% Topic C) Hillary Clinton debated last night. (50% Topic B, 50% Topic D) Hillary Clinton debated better than Donald Trump last night. (33% Topic A, 33% Topic B, 33% Topic D)

101 Latent Dirichlet Allocation () Suppose, we have the following sentences: Donald Trump is running for president. (50% Topic A, 50% Topic C) Donald Trump debated last night. (50% Topic A, 50% Topic D) Hillary Clinton is running for president. (50% Topic B, 50% Topic C) Hillary Clinton debated last night. (50% Topic B, 50% Topic D) Hillary Clinton debated better than Donald Trump last night. (33% Topic A, 33% Topic B, 33% Topic D)

102 Latent Dirichlet Allocation () represents documents as mixtures of topics that produce words with certain probabilities. Imagine you are writing an article: 1 How many N words will your article have? (Let s assume this follows a Poisson distribution.) 2 What is the mixture of K topics? (Let s assume this follows a Dirichlet distribution.) 3 For each word in your article: First, pick a topic from the distribution outlined above. Second, given the topic you have selected, choose the word that appears the most.

103 Latent Dirichlet Allocation () represents documents as mixtures of topics that produce words with certain probabilities. Imagine you are writing an article: 1.) Let s assume your article will have 5 words. 2.) Let s assume our topic distribution will be 75% Topic B and 25% Topic A. 3.) Let s assume Hillary Clinton, president, and win, appears in 75%, 50%, and 25% of the documents that are included in Topic B, respectively. 4.) Let s assume the Donald Trump and president appears in 75% and 50% of the documents that are included in Topic A, respectively.

104 Latent Dirichlet Allocation () represents documents as mixtures of topics that produce words with certain probabilities. Imagine you are writing an article: 4.) Let s assume the Donald Trump and president appears in 75% and 50% of the documents that are included in Topic A, respectively. 1 st word comes from Topic B, which then gives you Hillary Clinton. 2 nd word comes from Topic A, which then gives you Donald Trump. 3 rd word comes from Topic A, which then gives you president. 4 th word comes from Topic B, which then gives you president. 5 th word comes from Topic B, which then gives you win. Hillary Clinton Donald Trump president president win.

105 Harmonic Mean Wallach et al. (2009) suggest the harmonic mean could be considered a goodness-of-fit test for the model: 1 M ( M 1 K i=1 ) 1 K θ m,k, where θ m,k is the topic distribution for document m and topic k. i=1 The degree of differentiation of a distribution θ over all topics is theoretically captured by the harmonic mean. If the topic distribution has a high probability for only a few topics, then it would have a lower harmonic mean value the model has a better ability to separate documents into different topics. (6)

106 Perplexity Heinrich (2005) suggests perplexity as a way to prevent overfitting an model. Perplexity is equivalent to the geometric mean per-word likelihood: { } Perplexity(w) = exp log(p(w)) V j=1 njd D d=1 (7), where n jd denotes how often the j th term occurred in the d th document. Perplexity is essentially the reciprocal geometric mean of the likelihood of testing data given the trained model M. Therefore, the lower perplexity value indicates that the model could fit the testing data better.

107 Entropy The entropy measure can also be used to indicate how the topic distributions differ across models. Higher values indicate that the topic distributions are more evenly spread over the topics. 1 s a p p l y ( f i t t e d m o d e l s, f u n c t i o n ( x ) mean ( a p p l y ( p o s t e r i o r ( x ) $ t o p i c s, 1, f u n c t i o n ( z ) sum ( z l o g ( z ) ) ) ) )

108 Entropy The entropy measure can also be used to indicate how the topic distributions differ across models. Higher values indicate that the topic distributions are more evenly spread over the topics. 1 entropy < NULL 2 f o r ( i i n 1 : l e n g t h ( f i t t e d m o d e l s ) ) { 3 t e m p t o p i c s < p o s t e r i o r ( f i t t e d m o d e l s [ [ i ] ] ) $ t o p i c s 4 temp sums< NULL 5 f o r ( j i n 1 :NROW( t e m p t o p i c s ) ) { 6 temp sums< c ( temp sums, sum ( t e m p t o p i c s [ j, ] l o g ( t e m p t o p i c s [ j, ] ) ) ) 7 } 8 entropy < c ( entropy, mean( temp sums ) ) 9 }

109 Applying to NYT Articles R!

110 Cluster Quality (Grimmer and King 2011) Assessing Cluster Quality with experiments - Goal: group together similar documents - Who knows if similarity measure corresponds with semantic similarity Inject human judgement on pairs of documents Design to assess cluster quality - Sample pairs of documents - Scale: (1) unrelated, (2) loosely related, (3) closely related - Cluster Quality = mean(within cluster) - mean(between clusters) - Select clustering with highest cluster quality

111 What Can We Do?

112 Additional Resources Webscraping Web_Scraping_with_Beautiful_Soup.html Machine The Elements of Statistical : Data Mining, Inference, and Prediction by Hastie et al Text

113 Questions? Questions?

Some Probability and Statistics

Some Probability and Statistics Some Probability and Statistics David M. Blei COS424 Princeton University February 12, 2007 D. Blei ProbStat 01 1 / 42 Who wants to scribe? D. Blei ProbStat 01 2 / 42 Random variable Probability is about

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

Gaussian Models

Gaussian Models Gaussian Models ddebarr@uw.edu 2016-04-28 Agenda Introduction Gaussian Discriminant Analysis Inference Linear Gaussian Systems The Wishart Distribution Inferring Parameters Introduction Gaussian Density

More information

Some Probability and Statistics

Some Probability and Statistics Some Probability and Statistics David M. Blei COS424 Princeton University February 13, 2012 Card problem There are three cards Red/Red Red/Black Black/Black I go through the following process. Close my

More information

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Latent Dirichlet Allocation

Latent Dirichlet Allocation Outlines Advanced Artificial Intelligence October 1, 2009 Outlines Part I: Theoretical Background Part II: Application and Results 1 Motive Previous Research Exchangeability 2 Notation and Terminology

More information

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades

More information

Gaussian Mixture Models, Expectation Maximization

Gaussian Mixture Models, Expectation Maximization Gaussian Mixture Models, Expectation Maximization Instructor: Jessica Wu Harvey Mudd College The instructor gratefully acknowledges Andrew Ng (Stanford), Andrew Moore (CMU), Eric Eaton (UPenn), David Kauchak

More information

Latent Dirichlet Allocation Based Multi-Document Summarization

Latent Dirichlet Allocation Based Multi-Document Summarization Latent Dirichlet Allocation Based Multi-Document Summarization Rachit Arora Department of Computer Science and Engineering Indian Institute of Technology Madras Chennai - 600 036, India. rachitar@cse.iitm.ernet.in

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly

More information

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation. ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent

More information

ANLP Lecture 22 Lexical Semantics with Dense Vectors

ANLP Lecture 22 Lexical Semantics with Dense Vectors ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous

More information

Latent Dirichlet Allocation Introduction/Overview

Latent Dirichlet Allocation Introduction/Overview Latent Dirichlet Allocation Introduction/Overview David Meyer 03.10.2016 David Meyer http://www.1-4-5.net/~dmm/ml/lda_intro.pdf 03.10.2016 Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models

More information

EM (cont.) November 26 th, Carlos Guestrin 1

EM (cont.) November 26 th, Carlos Guestrin 1 EM (cont.) Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University November 26 th, 2007 1 Silly Example Let events be grades in a class w 1 = Gets an A P(A) = ½ w 2 = Gets a B P(B) = µ

More information

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) Latent Dirichlet Allocation (LDA) A review of topic modeling and customer interactions application 3/11/2015 1 Agenda Agenda Items 1 What is topic modeling? Intro Text Mining & Pre-Processing Natural Language

More information

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) Latent Dirichlet Allocation (LDA) D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3:993-1022, January 2003. Following slides borrowed ant then heavily modified from: Jonathan Huang

More information

Midterm sample questions

Midterm sample questions Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts

More information

Feature selection. Micha Elsner. January 29, 2014

Feature selection. Micha Elsner. January 29, 2014 Feature selection Micha Elsner January 29, 2014 2 Using megam as max-ent learner Hal Daume III from UMD wrote a max-ent learner Pretty typical of many classifiers out there... Step one: create a text file

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

More information

Mixture of Gaussians Models

Mixture of Gaussians Models Mixture of Gaussians Models Outline Inference, Learning, and Maximum Likelihood Why Mixtures? Why Gaussians? Building up to the Mixture of Gaussians Single Gaussians Fully-Observed Mixtures Hidden Mixtures

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Sequences and Information

Sequences and Information Sequences and Information Rahul Siddharthan The Institute of Mathematical Sciences, Chennai, India http://www.imsc.res.in/ rsidd/ Facets 16, 04/07/2016 This box says something By looking at the symbols

More information

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20. 10-601 Machine Learning, Midterm Exam: Spring 2008 Please put your name on this cover sheet If you need more room to work out your answer to a question, use the back of the page and clearly mark on the

More information

Text Mining: Basic Models and Applications

Text Mining: Basic Models and Applications Introduction Basics Latent Dirichlet Allocation (LDA) Markov Chain Based Models Public Policy Applications Text Mining: Basic Models and Applications Alvaro J. Riascos Villegas University of los Andes

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Statistical Models. David M. Blei Columbia University. October 14, 2014

Statistical Models. David M. Blei Columbia University. October 14, 2014 Statistical Models David M. Blei Columbia University October 14, 2014 We have discussed graphical models. Graphical models are a formalism for representing families of probability distributions. They are

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing Semantics with Dense Vectors Reference: D. Jurafsky and J. Martin, Speech and Language Processing 1 Semantics with Dense Vectors We saw how to represent a word as a sparse vector with dimensions corresponding

More information

Data Preprocessing. Cluster Similarity

Data Preprocessing. Cluster Similarity 1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M

More information

Machine Learning for Computational Advertising

Machine Learning for Computational Advertising Machine Learning for Computational Advertising L1: Basics and Probability Theory Alexander J. Smola Yahoo! Labs Santa Clara, CA 95051 alex@smola.org UC Santa Cruz, April 2009 Alexander J. Smola: Machine

More information

Hidden Markov Models

Hidden Markov Models 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 22 April 2, 2018 1 Reminders Homework

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization

More information

Latent Variable Models and EM Algorithm

Latent Variable Models and EM Algorithm SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/

More information

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014 Naïve Bayes Classifiers and Logistic Regression Doug Downey Northwestern EECS 349 Winter 2014 Naïve Bayes Classifiers Combines all ideas we ve covered Conditional Independence Bayes Rule Statistical Estimation

More information

Internet Engineering Jacek Mazurkiewicz, PhD

Internet Engineering Jacek Mazurkiewicz, PhD Internet Engineering Jacek Mazurkiewicz, PhD Softcomputing Part 11: SoftComputing Used for Big Data Problems Agenda Climate Changes Prediction System Based on Weather Big Data Visualisation Natural Language

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018 1-61 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Regularization Matt Gormley Lecture 1 Feb. 19, 218 1 Reminders Homework 4: Logistic

More information

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Naïve Bayes Matt Gormley Lecture 18 Oct. 31, 2018 1 Reminders Homework 6: PAC Learning

More information

Text Mining for Economics and Finance Unsupervised Learning

Text Mining for Economics and Finance Unsupervised Learning Text Mining for Economics and Finance Unsupervised Learning Stephen Hansen Text Mining Lecture 3 1 / 46 Introduction There are two main divisions in machine learning: 1. Supervised learning seeks to build

More information

COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017

COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017 COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University OVERVIEW This class will cover model-based

More information

Probability models for machine learning. Advanced topics ML4bio 2016 Alan Moses

Probability models for machine learning. Advanced topics ML4bio 2016 Alan Moses Probability models for machine learning Advanced topics ML4bio 2016 Alan Moses What did we cover in this course so far? 4 major areas of machine learning: Clustering Dimensionality reduction Classification

More information

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014 Understanding Comments Submitted to FCC on Net Neutrality Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014 Abstract We aim to understand and summarize themes in the 1.65 million

More information

N-gram Language Modeling

N-gram Language Modeling N-gram Language Modeling Outline: Statistical Language Model (LM) Intro General N-gram models Basic (non-parametric) n-grams Class LMs Mixtures Part I: Statistical Language Model (LM) Intro What is a statistical

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: (Finish) Expectation Maximization Principal Component Analysis (PCA) Readings: Barber 15.1-15.4 Dhruv Batra Virginia Tech Administrativia Poster Presentation:

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model Instructor: Yizhou Sun yzsun@cs.ucla.edu December 4, 2017 Methods to be Learnt Vector Data Set Data Sequence Data Text Data Classification Clustering

More information

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch. School of Computer Science 10-701 Introduction to Machine Learning aïve Bayes Readings: Mitchell Ch. 6.1 6.10 Murphy Ch. 3 Matt Gormley Lecture 3 September 14, 2016 1 Homewor 1: due 9/26/16 Project Proposal:

More information

Applied Natural Language Processing

Applied Natural Language Processing Applied Natural Language Processing Info 256 Lecture 5: Text classification (Feb 5, 2019) David Bamman, UC Berkeley Data Classification A mapping h from input data x (drawn from instance space X) to a

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage: Lecture 2: N-gram Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS 6501: Natural Language Processing 1 This lecture Language Models What are

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!

More information

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification

More information

Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

More information

Bayesian Methods: Naïve Bayes

Bayesian Methods: Naïve Bayes Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior

More information

Training the linear classifier

Training the linear classifier 215, Training the linear classifier A natural way to train the classifier is to minimize the number of classification errors on the training data, i.e. choosing w so that the training error is minimized.

More information

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes 1 Categorization ANLP Lecture 10 Text Categorization with Naive Bayes Sharon Goldwater 6 October 2014 Important task for both humans and machines object identification face recognition spoken word recognition

More information

ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes ANLP Lecture 10 Text Categorization with Naive Bayes Sharon Goldwater 6 October 2014 Categorization Important task for both humans and machines 1 object identification face recognition spoken word recognition

More information

MLE/MAP + Naïve Bayes

MLE/MAP + Naïve Bayes 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes MLE / MAP Readings: Estimating Probabilities (Mitchell, 2016)

More information

10-701/ Machine Learning: Assignment 1

10-701/ Machine Learning: Assignment 1 10-701/15-781 Machine Learning: Assignment 1 The assignment is due September 27, 2005 at the beginning of class. Write your name in the top right-hand corner of each page submitted. No paperclips, folders,

More information

Lecture 15. Probabilistic Models on Graph

Lecture 15. Probabilistic Models on Graph Lecture 15. Probabilistic Models on Graph Prof. Alan Yuille Spring 2014 1 Introduction We discuss how to define probabilistic models that use richly structured probability distributions and describe how

More information

Bayesian Learning Features of Bayesian learning methods:

Bayesian Learning Features of Bayesian learning methods: Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease or increase the estimated probability that a hypothesis is correct. This provides a more

More information

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations : DAG-Structured Mixture Models of Topic Correlations Wei Li and Andrew McCallum University of Massachusetts, Dept. of Computer Science {weili,mccallum}@cs.umass.edu Abstract Latent Dirichlet allocation

More information

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24 L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,

More information

Lecture #11: Classification & Logistic Regression

Lecture #11: Classification & Logistic Regression Lecture #11: Classification & Logistic Regression CS 109A, STAT 121A, AC 209A: Data Science Weiwei Pan, Pavlos Protopapas, Kevin Rader Fall 2016 Harvard University 1 Announcements Midterm: will be graded

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Introduction to Graphical Models

Introduction to Graphical Models Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) Yung-Kyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic

More information

Probability and Information Theory. Sargur N. Srihari

Probability and Information Theory. Sargur N. Srihari Probability and Information Theory Sargur N. srihari@cedar.buffalo.edu 1 Topics in Probability and Information Theory Overview 1. Why Probability? 2. Random Variables 3. Probability Distributions 4. Marginal

More information

Topic Models and Applications to Short Documents

Topic Models and Applications to Short Documents Topic Models and Applications to Short Documents Dieu-Thu Le Email: dieuthu.le@unitn.it Trento University April 6, 2011 1 / 43 Outline Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:

More information

Deep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng

Deep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng Deep Sequence Models Context Representation, Regularization, and Application to Language Adji Bousso Dieng All Data Are Born Sequential Time underlies many interesting human behaviors. Elman, 1990. Why

More information

Lecture 6: Gaussian Mixture Models (GMM)

Lecture 6: Gaussian Mixture Models (GMM) Helsinki Institute for Information Technology Lecture 6: Gaussian Mixture Models (GMM) Pedram Daee 3.11.2015 Outline Gaussian Mixture Models (GMM) Models Model families and parameters Parameter learning

More information

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models and Gaussian Mixture Models Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 23&27 January 2014 ASR Lectures 4&5 Hidden Markov Models and Gaussian

More information

Subject CS1 Actuarial Statistics 1 Core Principles

Subject CS1 Actuarial Statistics 1 Core Principles Institute of Actuaries of India Subject CS1 Actuarial Statistics 1 Core Principles For 2019 Examinations Aim The aim of the Actuarial Statistics 1 subject is to provide a grounding in mathematical and

More information

Mixtures of Gaussians continued

Mixtures of Gaussians continued Mixtures of Gaussians continued Machine Learning CSE446 Carlos Guestrin University of Washington May 17, 2013 1 One) bad case for k-means n Clusters may overlap n Some clusters may be wider than others

More information

18.05 Practice Final Exam

18.05 Practice Final Exam No calculators. 18.05 Practice Final Exam Number of problems 16 concept questions, 16 problems. Simplifying expressions Unless asked to explicitly, you don t need to simplify complicated expressions. For

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up Much of this material is adapted from Blei 2003. Many of the images were taken from the Internet February 20, 2014 Suppose we have a large number of books. Each is about several unknown topics. How can

More information

Hidden Markov Models

Hidden Markov Models 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 19 Nov. 5, 2018 1 Reminders Homework

More information

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1 EM Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University November 19 th, 2007 2005-2007 Carlos Guestrin 1 K-means 1. Ask user how many clusters they d like. e.g. k=5 2. Randomly guess

More information

N-gram Language Modeling Tutorial

N-gram Language Modeling Tutorial N-gram Language Modeling Tutorial Dustin Hillard and Sarah Petersen Lecture notes courtesy of Prof. Mari Ostendorf Outline: Statistical Language Model (LM) Basics n-gram models Class LMs Cache LMs Mixtures

More information

Text Mining for Economics and Finance Latent Dirichlet Allocation

Text Mining for Economics and Finance Latent Dirichlet Allocation Text Mining for Economics and Finance Latent Dirichlet Allocation Stephen Hansen Text Mining Lecture 5 1 / 45 Introduction Recall we are interested in mixed-membership modeling, but that the plsi model

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

Lecture 18: Learning probabilistic models

Lecture 18: Learning probabilistic models Lecture 8: Learning probabilistic models Roger Grosse Overview In the first half of the course, we introduced backpropagation, a technique we used to train neural nets to minimize a variety of cost functions.

More information

Machine Learning, Fall 2012 Homework 2

Machine Learning, Fall 2012 Homework 2 0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

More information

1 Probabilities. 1.1 Basics 1 PROBABILITIES

1 Probabilities. 1.1 Basics 1 PROBABILITIES 1 PROBABILITIES 1 Probabilities Probability is a tricky word usually meaning the likelyhood of something occuring or how frequent something is. Obviously, if something happens frequently, then its probability

More information

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017 COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University TOPIC MODELING MODELS FOR TEXT DATA

More information

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9 CSCI 5832 Natural Language Processing Jim Martin Lecture 9 1 Today 2/19 Review HMMs for POS tagging Entropy intuition Statistical Sequence classifiers HMMs MaxEnt MEMMs 2 Statistical Sequence Classification

More information

CS Lecture 18. Topic Models and LDA

CS Lecture 18. Topic Models and LDA CS 6347 Lecture 18 Topic Models and LDA (some slides by David Blei) Generative vs. Discriminative Models Recall that, in Bayesian networks, there could be many different, but equivalent models of the same

More information

RECSM Summer School: Facebook + Topic Models. github.com/pablobarbera/big-data-upf

RECSM Summer School: Facebook + Topic Models. github.com/pablobarbera/big-data-upf RECSM Summer School: Facebook + Topic Models Pablo Barberá School of International Relations University of Southern California pablobarbera.com Networked Democracy Lab www.netdem.org Course website: github.com/pablobarbera/big-data-upf

More information

Event A: at least one tail observed A:

Event A: at least one tail observed A: Chapter 3 Probability 3.1 Events, sample space, and probability Basic definitions: An is an act of observation that leads to a single outcome that cannot be predicted with certainty. A (or simple event)

More information

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015 Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about

More information

AN INTRODUCTION TO TOPIC MODELS

AN INTRODUCTION TO TOPIC MODELS AN INTRODUCTION TO TOPIC MODELS Michael Paul December 4, 2013 600.465 Natural Language Processing Johns Hopkins University Prof. Jason Eisner Making sense of text Suppose you want to learn something about

More information

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Replicated Softmax: an Undirected Topic Model. Stephen Turner Replicated Softmax: an Undirected Topic Model Stephen Turner 1. Introduction 2. Replicated Softmax: A Generative Model of Word Counts 3. Evaluating Replicated Softmax as a Generative Model 4. Experimental

More information

The Noisy Channel Model and Markov Models

The Noisy Channel Model and Markov Models 1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle

More information

Generative Models for Discrete Data

Generative Models for Discrete Data Generative Models for Discrete Data ddebarr@uw.edu 2016-04-21 Agenda Bayesian Concept Learning Beta-Binomial Model Dirichlet-Multinomial Model Naïve Bayes Classifiers Bayesian Concept Learning Numbers

More information