COMP 328: Machine Learning Lecture 2: Naive Bayes Classifiers Nevin L. Zhang Department of Computer Science and Engineering The Hong Kong University of Science and Technology Spring 2010 Nevin L. Zhang (HKUST) COMP 328 Spring 2010 1 / 34
Two different types of classifiers Decision tree classifiers Data Decision rules Classify unseen examples using the rules Naive Bayes classifiers Data probabilistic model about relationship among class and attributes Classify unseen examples via inference based on the model Nevin L. Zhang (HKUST) COMP 328 Spring 2010 2 / 34
Outline Probabilistic Models and Classification 1 Probabilistic Models and Classification 2 Probabilistic Independence 3 Na Ive Bayes Model Classifiers 4 Issues 5 Learning to Classify Text Nevin L. Zhang (HKUST) COMP 328 Spring 2010 3 / 34
Probabilistic Models and Classification Joint Distribution The most general way to describe relationships among random variables. Example: P(Temperature, Wind, PlayTennis) Temperature Wind PlayTennis Probability Hot Weak No 0.1 Hot Weak Yes 0 Hot Strong No 0.2 Hot Strong Yes 0.3 Cool Weak No 0.1 Cool Weak Yes 0.2 Cool Strong No 0.1 Cool Strong Yes 0 One prob value for each combination of the states of the variables. All the prob values must sum to 1. Nevin L. Zhang (HKUST) COMP 328 Spring 2010 4 / 34
Probabilistic Models and Classification Joint Distribution and Classification Suppose we have a joint distribution. How do classify? Calculate probability and then classify. Play tennis on a Hot day with Strong Wind? P(Play = y T = h, W = s) = = P(Play = y, T = h, W = s) P(T = h, W = s) 0.3 0.2 + 0.3 = 0.6 P(Play = n T = h, W = s) = 1 P(Play = y T = h, W = s) = 1 0.6 = 0.4 Answer: yes (if must answer with yes or no) Nevin L. Zhang (HKUST) COMP 328 Spring 2010 5 / 34
Probabilistic Models and Classification Joint Distribution and Classification We can perform classification even with a subset of attributes Play tennis when Wind is weak? P(Play = y W = w) = = P(Play = y, W = w) P(W = w) 0 + 0.2 0.1 + 0 + 0.1 + 0.2 = 0.67 P(Play = n T = h) = 0.33 Answer: yes Nevin L. Zhang (HKUST) COMP 328 Spring 2010 6 / 34
Probabilistic Models and Classification The General Case Suppose we have a joint distribution P(A 1,A 2,...,A n,c) over attributes A 1,A 2,...,A n and the class variable C. For a new example with attribute values a 1,a 2,...,a n, For each possible value v j of class variable C, compute the probability P(C = v j A 1 = a 1, A 2 = a 2,..., A n = a n ) Assign the example to the class v j with the highest posterior probability: For all v j P(C = v j A 1 = a 1,A 2 = a 2,...,A n = a n) P(C = v j A 1 = a 1, A 2 = a 2,...,A n = a n) cj = arg max v j class labels P(C = v j A 1 = a 1,A 2 = a 2,..., A n = a n) Nevin L. Zhang (HKUST) COMP 328 Spring 2010 7 / 34
Probabilistic Models and Classification A Difficult 100 binary attributes, and a binary class variables. How many entries in the joint probability table? 2 101. Too many to handle. Too many to estimate from data. Leads to overfit. The Naive Bayes model reduces the number of model parameters by making independence assumption. Nevin L. Zhang (HKUST) COMP 328 Spring 2010 8 / 34
Outline Probabilistic Independence 1 Probabilistic Models and Classification 2 Probabilistic Independence 3 Na Ive Bayes Model Classifiers 4 Issues 5 Learning to Classify Text Nevin L. Zhang (HKUST) COMP 328 Spring 2010 9 / 34
Probabilistic Independence Marginal independence Two random variables X and Y are marginally independent, written X Y, if for any state x of X and any state y of Y, P(X=x Y =y) = P(X=x), whenever P(Y = y) 0. Meaning: Learning the value of Y does not give me any information about X and vice versa.y contains no information about X and vice versa. Equivalent definition: P(X=x,Y =y) = P(X=x)P(Y =y) Shorthand for the equations: P(X Y ) = P(X),P(X,Y ) = P(X)P(Y ). Nevin L. Zhang (HKUST) COMP 328 Spring 2010 10 / 34
Probabilistic Independence Marginal independence Examples: X:result of tossing a fair coin for the first time, Y: result of second tossing of the same coin. X: result of US election, Y : your grades in this course. Counter example:x midterm exam grade, Y final exam grade. Nevin L. Zhang (HKUST) COMP 328 Spring 2010 11 / 34
Probabilistic Independence Conditional independence Two random variables X and Y are conditionally independent given a third variable Z,written X Y Z, if P(X=x Y=y, Z=z) = P(X=x Z=z) whenever P(Y =y, Z=z) 0 Meaning: If I know the state of Z already, then learning the state of Y does not give me additional information about X. Y might contain some information about X. However all the information about X contained in Y are also contained in Z. Shorthand for the equation: Equivalent definition: P(X Y, Z) = P(X Z) P(X, Y Z) = P(X Z)P(Y Z) Nevin L. Zhang (HKUST) COMP 328 Spring 2010 12 / 34
Probabilistic Independence Example of Conditional Independence There is a bag of 100 coins. 10 coins were made by a malfunctioning machine and are biased toward head. Tossing such a coin results in head 80% of the time. The other coins are fair. Randomly draw a coin from the bag and toss it a few time. X i : result of the i-th tossing, Y : whether the coin is produced by the malfunctioning machine. The X i s are not marginally independent of each other: If I get 9 heads in first 10 tosses, then the coin is probably a biased coin. Hence the next tossing will be more likely to result in a head than a tail. Learning the value of X i gives me some information about whether the coin is biased, which in term gives me some information about X j. Nevin L. Zhang (HKUST) COMP 328 Spring 2010 13 / 34
Probabilistic Independence Example of Conditional Independence However, they are conditionally independent given Y : If the coin is not biased, the probability of getting a head in one toss is 1/2 regardless of the results of other tosses. If the coin is biased, the probability of getting a head in one toss is 80% regardless of the results of other tosses. If I already knows whether the coin is biased or not, learning the value of X i does not give me additional information about X j. Here is how the variables are related pictorially. We will return to this picture later. Coin Type Toss 1 Result Toss 2 Result... Toss n Result Nevin L. Zhang (HKUST) COMP 328 Spring 2010 14 / 34
Outline Naive Bayes Model Classifiers 1 Probabilistic Models and Classification 2 Probabilistic Independence 3 Na Ive Bayes Model Classifiers 4 Issues 5 Learning to Classify Text Nevin L. Zhang (HKUST) COMP 328 Spring 2010 15 / 34
Naive Bayes Model Classifiers The Naive Bayes Model It assumes that the attributes are mutually independent of each other given the class variable. Graphically depicted as: Joint distribution given by n P(C,A 1,A 2,...A n ) = P(C) P(A i C) i=1 Nevin L. Zhang (HKUST) COMP 328 Spring 2010 16 / 34
Naive Bayes Model Classifiers Learning the Naive Bayes Model Learning amounts to estimating from data: P(C), P(A 1 C),..., P(A n C). Straightforward to compute: ˆP(C = v j ) = ˆP(A i = a i C = v j ) = # of examples with C = v j total # of examples # of examples with C = v j and A i = a i # of examples with C = v j Although simple, those are the maximum likelihood estimates (MLE) of the parameters. Have nice properties. Nevin L. Zhang (HKUST) COMP 328 Spring 2010 17 / 34
Naive Bayes Model Classifiers Classification the Naive Bayes Model For a new example with attribute values: a 1, a 2,..., a n, assign it to this class: v NB = arg max ˆP(C = v j ) v j V n ˆP(A i = a i C = v j ) i=1 Nevin L. Zhang (HKUST) COMP 328 Spring 2010 18 / 34
Example: PlayTennis Naive Bayes Model Classifiers Day Outlook Temperature Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No Nevin L. Zhang (HKUST) COMP 328 Spring 2010 19 / 34
Naive Bayes Model Classifiers Example: Estimate parameters P(PlayTennis = y) = 9/14 P(PlayTennis = n) = 5/14 P(Outlook = sunny y) = 2/9 P(Outlook = sunny n) = 3/5 P(Outlook = overcast y) = 4/9 P(Outlook = overcast n) = 0/5 P(Outlook = rain y) = 3/9 P(Outlook = rain n) = 2/5 P(Temp = hot y) = 2/9 P(Temp = hot PlayTennis = n) = 2/5 P(Temp = mild y) = 4/9 P(Temp = mild n) = 2/5 P(Temp = cool y) = 3/9 P(Temp = cool n) = 1/5 P(Humidity = high y) = 3/9 P(Humidity = normal n) = 1/5 P(Humidity = normal y) = 6/9 P(Humidity = high n) = 4/5 P(Wind = strong y) = 3/9 P(Wind = strong n) = 3/5 P(Wind = weak y) = 6/9 P(Wind = weak n) = 2/5 Nevin L. Zhang (HKUST) COMP 328 Spring 2010 20 / 34
Naive Bayes Model Classifiers Example:Classification New case: (Sunny, Cool, High, Strong) PlayTennis? Inference: P(y)P(sunny y)p(cool y)p(high y)p(strong y) =.005 P(n)P(sunny n)p(cool n)p(high n)p(strong n) =.021 Conclusion: v NB = n No, don t play. Nevin L. Zhang (HKUST) COMP 328 Spring 2010 21 / 34
Outline Issues 1 Probabilistic Models and Classification 2 Probabilistic Independence 3 Na Ive Bayes Model Classifiers 4 Issues 5 Learning to Classify Text Nevin L. Zhang (HKUST) COMP 328 Spring 2010 22 / 34
Zero counts Issues None of the training instances with target v j have attribute a i? ˆP(a i v j ) = 0 Hence, we have ˆP(v j ) i ˆP(a i v j ) = 0 Future example with a i, no chance to be classifies as v j, even if all other attribute values suggest v j. Smoothing: Add virtual count 1 to each case. (Laplace Smoothing/Correction) ˆP(A i = a i C = v j ) = (# of examples with C = v j and A i = a i ) + 1 (# of examples with C = v j ) + C C : number of classes Nevin L. Zhang (HKUST) COMP 328 Spring 2010 23 / 34
Weka does just that Issues P(outlook yes) is {3/12, 5/12, 4/12} instead of {2/9, 4/9, 3/9} as on Slide 20. Nevin L. Zhang (HKUST) COMP 328 Spring 2010 24 / 34
Continuous attributes Issues Continuous attributes? Discretize them. Equal intervals. Weka has MDL method. Use parameteric form (Gaussian) for P(A i C). Nevin L. Zhang (HKUST) COMP 328 Spring 2010 25 / 34
Issues Conditional independence assumption Assumption: A i s mutually independent of each other given C. Often violated. Might lead to double counting To see this, suppose we duplicate A 1. So we have two copies of A 1 in data: A 1 and A 1. Information in data remain the same. However, classification might be different: arg max ˆP(v j )ˆP(A 1 v j )ˆP(A 1 v j )ˆP(A 2 v j )... v j V The evidence on A 1 is counted twice. Nevin L. Zhang (HKUST) COMP 328 Spring 2010 26 / 34
Issues Conditional independence assumption Naive Bayes classifier works surprisingly well anyway! Reason: Although ˆP(v j ) i ˆP(a i v j ) might be a poor estimation of ˆP(v j, a 1...,a n ) to be correct we might still have arg max ˆP(v j ) v j V i ˆP(a i v j ) = arg max v j V P(v j, a 1..., a n ) Nevin L. Zhang (HKUST) COMP 328 Spring 2010 27 / 34
Issues Notes: Conditional independence assumption Bayesian (belief) network classifier: relax the assumption Nevin L. Zhang (HKUST) COMP 328 Spring 2010 28 / 34
Issues Overfitting Overfitting is not an issue for Naive Bayes classifier because the complexity is fixed. However it is an issue for Bayesian networks. Nevin L. Zhang (HKUST) COMP 328 Spring 2010 29 / 34
Outline Learning to Classify Text 1 Probabilistic Models and Classification 2 Probabilistic Independence 3 Na Ive Bayes Model Classifiers 4 Issues 5 Learning to Classify Text Nevin L. Zhang (HKUST) COMP 328 Spring 2010 30 / 34
Learning to Classify Text Spam and non-spam E-mails Classify emails into spam and non-spam according to content. Classes: S and S Attributes: w 1, w 2,..., w n, a list of words. From training set. Stop word removal: e.g., a, the Stemming: e.g., engineering, engineered, engineer engineer Parameters P(w i S): probability of word w i appear in a spam mail. P(w i S): probability of word w i appear in a non-spam mail. P(S) and P( S): probability of a mail being spam or non-spam All those can be obtained from a training set by counting. Nevin L. Zhang (HKUST) COMP 328 Spring 2010 31 / 34
Learning to Classify Text Spam and non-spam E-mails New document D with words: X 1, X 2,..., X n A subset of w 1, w 2,..., w n P(D S) = m i=1 P(X i S) P(D S) = m i=1 P(X i S) Document is spam if P(S) n n P(X i S) > P( S) P(X i S) i=1 Can we do this with decision trees? i=1 Nevin L. Zhang (HKUST) COMP 328 Spring 2010 32 / 34
Learning to Classify Text Nevin L. Zhang (HKUST) COMP 328 Spring 2010 33 / 34
Learning to Classify Text Final Remark When to use Naive Bayes Classifiers? Moderate or large training set available Attributes that describe instances are conditionally independent given classification Successful applications: Diagnosis Classifying text documents Nevin L. Zhang (HKUST) COMP 328 Spring 2010 34 / 34