Probability Review and Naive Bayes. Mladen Kolar and Rob McCulloch

Size: px

Start display at page:

Download "Probability Review and Naive Bayes. Mladen Kolar and Rob McCulloch"

Osborne Kerry Farmer
6 years ago
Views:

1 Probability Review and Naive Bayes Mladen Kolar and Rob McCulloch

2 1. Probability and Graphical Models 2. Quick Probability Review 3. The Sinus Example 4. Bayesian Networks 5. Bayes Rule Classification 6. Naive Bayes Classification 7. Sentiment Analysis: Spam or Ham

3 1. Probability and Graphical Models Joint probability distributions help us understand uncertain relationships between many variables. Let X = (X 1, X 2,..., X p ) be a vector of random variables. Let x = (x 1, x 2,..., x p ) be an outcome. If x is continuous we write P(X A) = If x is discrete we write A p(x)dx P(X A) = A p(x) 1

4 Figuring out p(x) and understanding it in high dimensions is a very difficult and very important problem. As usual, to make things manageable we look for simplifying structure. The simplest structure is independence: p(x 1, x 2,..., x p ) = p(x 1 ) p(x 2 )... p(x p ) = p(x i ). 2

5 But complete independence is completely boring. It rules out relationships amongst the x s!!!! Instead we will look for conditional independence. A pair of variable may not be independent, but they may be independent conditional on knowledge of a third!! 3

6 In ML/DS, graphical models means a nice way to think about conditional independence structures in joint distributions. Generally, graphical models capture kinds of conditional independence. We will quickly review basic probability notions, using some of the language associated with graphical models. Later in the course we may study graphical models more systematically. 4

7 2. Quick Probability Review Note: we will use notation appropriate for discrete random variables: and so on. Two Variables p(x 1, x 2 ) = P(X 1 = x 1, X 2 = x 2 ) Suppose we have joint disbtribution p(x 1, x 2 ) then we can compute the marginal p(x 1 ) (or p(x 2 )) and the conditional (or p(x 1 x 2 )). p(x 2 x 1 ) = p(x 1, x 2 ) p(x 1 ) 5

8 We can always get (x 1, x 2 ) by first getting x 1 and then getting x 2 conditional on x 1 : p(x 1, x 2 ) = p(x 1 ) p(x 2 x 1 ). X 1 and X 2 are independent if p(x 2 x 1 ) = p(x 2 ) (x 1, x 2 ). so that, p(x 1, x 2 ) = p(x 1 ) p(x 2 ). X 1 and X 2 are independent if no matter what you learn about X 1 your beliefs about X 2 are the same as if you knew nothing about X 1. 6

9 Three Variables Have p(x 1, x 2, x 3 ) = P(X 1 = x 1, X 2 = x 2, X 3 = x 3 ). Then, p(x 3 x 1, x 2 ) = p(x 1, x 2, x 3 ) p(x 1, x 2 ) and p(x 3, x 2 x 1 ) = p(x 1, x 2, x 3 ) p(x 1 ) 7

10 Note: p(x 3 x 1, x 2 ) p(x 1, x 2, x 3 ) p(x 3, x 2 x 1 ) p(x 1, x 2, x 3 ) The conditional is proportional to the joint. We always have: p(x 1, x 2, x 3 ) = p(x 1 ) p(x 2, x 3 x 1 ) = p(x 1 ) p(x 2 x 1 ) p(x 3 x 1, x 2 ) 8

11 Conditional Independence X 3 is independent of X 2 conditional on X 1 if p(x 3 x 2, x 1 ) = p(x 3 x 1 ) (x 1, x 2, x 3 ). Given you know X 1 = x 1, X 2 tells you nothing about X 3. We write X 2 X 3 X 1 9

12 If X 2 X 3 X 1 then, p(x 1, x 2, x 2 ) = p(x 1 )p(x 2 x 1 )p(x 3 x 1, x 2 ) = p(x 1 )p(x 2 x 1 )p(x 3 x 1 ) It is this kind of simplification of the joint structure that graphical models are meant to capture. 10

13 X 2 X 3 X 1 then we could have, p(x 1, x 2, x 3 ) = p(x 1 ) p(x 3 x 1 ) p(x 2 x 1 ). This is a directed graphical model. Each variable is a node in the graph. The arrows pointing into a node tell you which (of the previous) variables it depends on. 11

14 If X 1 X 3 X 2 then we could have, p(x 1, x 2, x 3 ) = p(x 1 ) p(x 2 x 1 ) p(x 3 x 2 ). 12

15 Complete independence: Complete dependence: 13

16 Our graphs are DAGs: Directed Acyclic Graphs: Directed: The connections are arrows that point from one node (variable) to another. Acyclic: You can t start at a node and follow the arrows back to the same node. 14

17 Several Variables: Our definitions extend to many variables easily by considering groups. Suppose we have (X 1, X 2,..., X p ). Let, G = {i 1,, i 2,..., i q } {1, 2,..., p} and then,. X G = {X i : i G} We can say one group of variables is independent of another group conditional on a third group X G3 X G2 X G1. 15

18 Conditional independence can be a very powerfull way to capture the assumptions in a model. For example in the model: Y = X β + ɛ, ɛ N(0, σ 2 ), ɛ X where X is considered a random vector we have: Y X X β. Given you know X β, X has no more information about Y. 16

19 3. The Sinus Example Suppose we know the following: The flu causes sinus inflammation. Allergies cause sinus inflammation. Sinus inflammation causes a runny nose. Sinus inflammation causes headaches. 17

20 The flu causes sinus inflammation. Allergies cause sinus inflammation. Sinus inflammation causes a runny nose. Sinus inflammation causes headaches. Note: In this example we explicitly interpret the relationships as causal. More generally they may just be observational. 18

21 Let F = 1 if a person has the flu, 0 else, A = 1 if a person has allergies, 0 else, S = 1 if a person has sinus inflammation, 0 else, H = 1 if a person has a headache, 0 else, N = 1 if a person has a runny nose, 0 else. Then, p(f, a, s, h, n) = p(f ) p(a) p(s f, a) p(h s) p(n s). 19

22 The absence of edges contains information!!! p(f, a, s, h, n) = p(f ) p(a) p(s f, a) p(h s) p(n s). so, for example, (H, N) (F, A) S. F effects N, but only through S. 20

23 4. Bayesian Networks A Bayian Network is a DAG: Directed Acyclic Graphical representation of the joint distribuiton of X = (X 1, X 2,..., X p ). Given a specific ordering of the variables, let P i be a set of variables such that p(x i x 1, x 2,..., x i 1 ) = p(x i P i ) where P i are the parents of X i, P i {x 1, x 2,..., x i 1 }. Then, p(x 1, x 2,..., x p ) = i p(x i P i ). Local Markov Assumption: A variable X is independent of its non-descendants given its parents (only the parents). 21

24 Bayes Nets: Algorithms for fitting DAG structures to data are in important part of Machine Learning/Data Science/Artificial Intelligence. Also important are algorithms for computing conditionals and marginals. In our sinus example we might want to know p(a = 1 N = 1). 22

25 An important example of a Machine Learning technique which exploits a certain conditional independence structure is the Naive Bayes Classifier. We will study NB below. 23

26 5. Bayes Rule Classification Consider the directed data mining problem with a categorical Y. Many methods can be viewed as an attempt to estimate: p(y x) the conditional distribution of Y given X = x. For example in logistic regression we have: p(y = 1 x) Bernoulli(p(x)), p(x) = x β. 24

27 An alternative approach is to estimate the full joint distribution of (X, Y ) by estimating the marginal for for Y (p(y)) and the conditional for X (p(x y)). We then have the joint via: p(x, y) = p(y) p(x y). And classification is then obtained from Bayes Theorem: p(y x) = p(y)p(x y) p(x) p(y)p(x y) As usual we can predict the most probable y or make a decision based on the probabilities. 25

28 In general, it seems more straighforward to estimate p(y x) directly but some well known approaches take the Bayes-rule path. For example, for x numeric, classic discriminant analysis assumes p(x y) N(µ y, Σ). Remember, Y is discrete and we have to estimate p(x y) for each possible y. 26

29 6. Naive Bayes Classification Naive Bayes classification simplifies the problem by assumming that the elements of X are conditionally independent given Y. p(x, y) = p(y) i p(x i y) so that p(y x) p(y) i p(x i y) Each coordinate x i of x gets to multiply in it s own contribution of evidence about y depending on how likely x i would be if Y = y. Given x, a simple approach to classification just guesses the y having the largest p(y x). 27

30 28

31 NB has some key advantages: We only have to estimate the low dimension p(x i y) instead of the high dimensional p(x y)!!!! Many small bits of information from each x i can be combined. It is simple. The main disadvantage is that the conditional independence assumption often seems inappropriate. However, this does not seem to keep from working very well in practice!!! According to Mladen, NB is the single most used classifier out there. NB often performs well, even when the assumption is violated. 29

32 7. Sentiment Analysis: Spam or Ham Sentiment analysis tries to classify text documents. A popular approach is to combine bag of words with NB. Each word in the document provides an additional independent piece of evidence about the kind of document it is. A simple example is trying to classify the document as spam or not: spam or ham. 30

33 Bag of words means just that, we ignore the order of the words. The document: When the lecture is over, remember to wake up the person sitting next to you in the lecture room. is the same as the document, in is lecture lecture next over person remember room sitting the the the to to up wake when you 31

34 Sentiment Analysis has been used very creatively: Classify a bunch of discussion posts on the economy at time t and then predict whether the stock market will go up up at time t + 1 given how many are classified as positive. Classify a bunch of discussion posts as favorable to candidate A or candidate B and then predict how they will do in the next poll or the election. 32

35 SMS Spam Data: Note: this follows Chapter 4 of Machine Learning with R, by Brett Lanz. Note: sms: short message service. Have 5,559 sms text message documents. Each one is labelled as spam or ham. Here is the first (ham) and fourth (spam) observation: > smsraw[1,] type text 1 ham Hope you are having a good week. Just checking in > smsraw[4,] type 4 spam 4 complimentary 4 STAR Ibiza Holiday or 10,000 cash needs your URGENT collection NOW from Landline not to lose out! Box434SK38WP150PPM18+ 33

36 Work flow: clean: tolower, kill numbers, punctuation, stopwords stem: (help,helped,helping,helps) becomes (help,help,help,help) tokenization: split a document up into single words (or tokens or terms ). document term matrix (DTM): rows indicate documents columns are counts for terms. train/test split. throw away low count terms. convert DTM to indicators: Yes if the word (term) is in the document, 0 else. do Naive Bayes!! 34

37 Note: Most of the work is processing the data!!!!! This is typically the case in real world applications. Getting the data into a form that allows you to analysize it is time consuming and very important. Garbage in, garbage out!!! In class we will typically focus on getting a basic understanding of the methods and not emphasize the data wrangling. 35

38 Clean and Stem Here are the first two documents: > smsraw$text[1] [1] "Hope you are having a good week. Just checking in" > smsraw$text[2] [1] "K..give back my thanks." Here are the first 2 docs after cleaning. smscc is the cleaned data in the Corpus data structure from the tm R package. smscc is for sms data as a Cleaned Corpus. > smscc[[1]][1] $content [1] "hope good week just check" > smscc[[2]][1] $content [1] "kgive back thank" 36

39 Tokenize and get DTM Tokenization gives us 6518 words (or terms) from all the sms documents. The i th row of the DTM gives us the count for each term in document i. > print(dim(smsdtm)) [1] > library(slam) #for col_sums > summary(col_sums(smsdtm)) #summarize total time a term is used. Min. 1st Qu. Median Mean 3rd Qu. Max > terms = smsdtm$dimnames$terms > nterm = length(terms) > set.seed(14) > ii = sample(1:nterm,20) > terms[ii] [1] "effect" "pinku" "wikipediacom" "mundh" "wwwsmsconet" [6] "marsm" "voic" "itz" "logo" "hip" [11] "transfr" "colin" "leo" "technolog" "text" [16] "scratch" "graze" "prolli" "tech" "ofsi" 37

40 Word frequencies for ham: Word frequencies for spam: 38

41 Train/Test #train and test # creating training and test datasets smstrain = smsdtm[1:4169, ] smstest = smsdtm[4170:5559, ] # also save the labels smstrainy = smsraw[1:4169, ]$type smstesty = smsraw[4170:5559, ]$type > prop.table(table(smstrainy)) smstrainy ham spam > prop.table(table(smstesty)) smstesty ham spam

42 Throw Away Terms with Low Frequency # save frequently-appearing terms to a character vector smsfreqwords = findfreqterms(smstrain, 5) > str(smsfreqwords) chr [1:1136] "abiola" "abl" "abt" "accept" "access" "account"... > length(smsfreqwords) [1] 1136 # create DTMs with only the frequent terms smsfreqtrain = smstrain[, smsfreqwords] smsfreqtest = smstest[, smsfreqwords] > dim(smsfreqtrain) [1] > dim(smstest) [1]

43 Convert Counts to Indicators #convert counts to if(count>0) (yes,no) convertcounts <- function(x) { x <- ifelse(x > 0, "Yes", "No") } # apply() convert_counts() to columns of train/test data # these are just matrices smstrain = apply(smsfreqtrain, MARGIN = 2, convertcounts) smstest <- apply(smsfreqtest, MARGIN = 2, convertcounts) > dim(smstrain) [1] > smstrain[1:3,1:5] Terms Docs abiola abl abt accept access 1 "No" "No" "No" "No" "No" 2 "No" "No" "No" "No" "No" 3 "No" "No" "No" "No" "No" 41

44 We are ready for NB!!! library(e1071) smsnb = naivebayes(smstrain, smstrainy) > smsnb$tables[1:3] $abiola abiola smstrainy No Yes ham spam $abl abl smstrainy No Yes ham spam $abt abt smstrainy No Yes ham spam The tables are our p(x i y) terms!! y is ham or spam and x i are the words(terms): abiola, abl, abt,... That is p(abiola = Yes y = ham) =

45 $age age smstrainy No Yes ham spam $adult adult smstrainy No Yes ham spam What is p(y = spam age = Yes)? (The prob the sms is spam given the word age is in it). Let s use p(y = spam) =.14, the training data proportion. p(y = spam age = Yes) = p(y=spam)p(age=yes y=spam) p(y=spam)p(age=yes y=spam)+p(y=ham)p(age=yes y=ham) >.14* /(.14* * ) [1]

46 $age age smstrainy No Yes ham spam $adult adult smstrainy No Yes ham spam What is p(y = spam age = Yes, adult = Yes)? (The prob the sms is spam given the word age is in it and the word adult is in it). p(y = spam age = Yes, adult = Yes) = p(spam)p(age=y spam)p(adult=y spam) p(spam)p(age=y spam)p(adult=y spam)+p(ham)p(age=y ham)p(age=y ham) >.14* * /(.14* * * * ) [1]

47 Out of Sample Confusion Matrix yhat = predict(smsnb,smstest) library(gmodels) CrossTable(yhat, smstesty, prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE, dnn = c( predicted, actual )) actual predicted ham spam Row Total ham spam Column Total Missclassification rate: 36/1390 = % spam detected:.836. Not bad!! 45

48 Try it again with laplace=1, add 1 to each count when estimating th 2x2 tables. > smsnb2 = naivebayes(smstrain, smstrainy,laplace=1) > smsnb2$tables[1:3] $abiola abiola smstrainy No Yes ham spam $abl abl smstrainy No Yes ham spam $abt abt smstrainy No Yes ham spam Got rid of the 0/1 conditional probabilities, which are extreme. 46

49 > yhat2 = predict(smsnb2,smstest) > CrossTable(yhat2, smstesty, + prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE, + dnn = c( predicted, actual )) actual predicted ham spam Row Total ham spam Column Total Not too different. 47

Probabilistic Graphical Models

Probabilistic Graphical Models Carlos Carvalho, Mladen Kolar and Robert McCulloch 11/19/2015 Classification revisited The goal of classification is to learn a mapping from features to the target class.