Bayes and Naïve Bayes Classifiers CS434
In this lectre 1. Review some basic probability concepts 2. Introdce a sefl probabilistic rle - Bayes rle 3. Introdce the learning algorithm based on Bayes rle (ths the name Bayes classifier) and its extension, Naïve Beyes
Commonly sed discrete distribtions Binary random variable ~ 1 ; 0 1 1 Categorical distribtion: can take mltiple vales, 1,, 1
Learning the parameters of discrete Let denote the event of getting a head when tossing a coin ( Given a seqence of coin tosses, we estimate distribtions Let denote the otcome of rolling a die ( ) Given a seqence of rolls, we estimate Estimation sing sch simple conting performs what we call Maximm Likelihood Estimation (MLE). There are other methods as well.
Continos Random Variables A continos random variable can take any vale in an interval on the real line sally corresponds to some real-valed measrements, e.g., today s lowest temperatre The probability of a continos random variable taking an exact vale is typically zero: P(=56.2)=0 It is more meaningfl to measre the probability of a random variable taking a vale within an interval 50, 60 This is captred in the Probability density fnction
PDF: probability density fnction The probability of taking vale in a given range, is defined to be the area nder the PDF crve between and We often se to denote the PDF of Note: 0 can be larger than 1 1 15 20
What is the intitive meaning of? If When x is sampled from, yo are times more likely to see a vale near than that a vale near Higher probability (approximately 40% more likely) to see an x near 15, than an x that is near 20
Commonly Used Continos Distribtions f E.g., Sppose we know that in the last 10 mintes, one cstomer arrived at OSU federal conter #1. The actal time of arrival X can be modeled by a niform distribtion over the interval of (0, 10) f E.g., the body temp. of a person, the average IQ of a class of 1 st graders f E.g., the time between two sccessive forest fires
Conditional Probability P(A B) = probability of A being tre given that B is tre H = Have a headache F = Coming down with Fl F P(H) = 1/10 P(F) = 1/40 P(H F) = 1/2 Copyright Andrew W. Moore H Headaches are rare and fl is rarer, bt if yo re coming down with a fl there s a 50-50 chance yo ll have a headache.
Conditional Probability F H P(H F) = Area of H and F region ------------------------------ Area of F region H = Have a headache F = Coming down with Fl = P(H ^ F) ----------- P(F) P(H) = 1/10 P(F) = 1/40 P(H F) = 1/2 Copyright Andrew W. Moore
Definition of Conditional Probability Corollary: The Chain Rle Copyright Andrew W. Moore
Probabilistic Inference F H = Have a headache F = Coming down with Fl H P(H) = 1/10 P(F) = 1/40 P(H F) = 1/2 One day yo wake p with a headache. Yo think: Drat! 50% of fls are associated with headaches so I mst have a 50-50 chance of coming down with fl Is this reasoning good? Copyright Andrew W. Moore
Probabilistic Inference F P(F H) = H H = Have a headache F = Coming down with Fl P(H) = 1/10 P(F) = 1/40 P(H F) = 1/2 P(F H) = Copyright Andrew W. Moore
What we st did P(A B) P(A B) P(B) P(B A) = ----------- = --------------- P(A) P(A) This is Bayes Rle Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 Copyright Andrew W. Moore
Using Bayes Rle to Gamble $1.00 The Win envelope has a dollar and for beads in it The Lose envelope has three beads and no money Trivial qestion: someone draws an envelope at random and offers to sell it to yo. How mch shold yo pay in order to not lose money on average? Copyright Andrew W. Moore
Using Bayes Rle to Gamble $1.00 The Win envelope has a dollar and for beads in it Copyright Andrew W. Moore The Lose envelope has three beads and no money Interesting qestion: before deciding, yo are allowed to see one bead drawn from the envelope. Sppose it s black: How mch shold yo pay? Sppose it s red: How mch shold yo pay?
Where are we? We have recalled the fndamentals of probability We have discssed conditional probability and Bayes rle Now we will move on to talk abot the Bayes classifier
Basic Idea Each example is described by inpt featres, i.e.,,,,, for now we assme they are discrete variables Each example belongs to one of possible classes, denoted by 1,, If we don t know anything abot the featres, a randomly drawn example has a fixed probability to belong to class If an example belongs to class, its featres,, will follow some particlar distribtion,, Given an example,,, a bayes classifier reasons abot the vale of sing Bayes rle:,,,,,,
Learning a Bayes Classifier,,,,,, Given a set of training data, : 1,,, we learn: Class Priors: for 1, 1 Class Conditional Distribtion:,, for 1,, P( 1,..., m #of class y ) examples that have vales ( #of total class examples Marginal Distribtion of :,, 1, 2,..., No need to learn, can be compted sing m )
Learning a oint distribtion Bild a JD table for yor attribtes in which the probabilities are nspecified A B C Prob 0 0 0? 0 0 1? 0 1 0? 0 1 1? 1 0 0? 1 0 1? 1 1 0? 1 1 1? The fill in each row with records matching row Pˆ (row) total nmber of records A B C Prob 0 0 0 0.30 0 0 1 0.05 0 1 0 0.10 0 1 1 0.05 1 0 0 0.05 1 0 1 0.10 1 1 0 0.25 Fraction of all records in which A and B are Tre bt C is False 1 1 1 0.10
Example of Learning a Joint This Joint was obtained by learning from three attribtes in the UCI Adlt Censs Database [Kohavi 1995] 48842 examples in total 12363 examples with this vale combination UCI machine learning repository: http://archive.ics.ci.ed/ml/datasets/adlt
Bayes Classifiers in a ntshell ) ( ),, ( argmax ),, ( ) ( ),, ( argmax ),, ( argmax 1 1 1 1 predict y P y P P y P y P y P y m m m m 1. Estimate as fraction of class examples for =1,,c. 2. Learn conditional oint distribtion,, for 1,, 3. To make a prediction for a new example,, learning
Example: Spam Filtering Use the Bag-of-words representation Assme a dictionary of words and tokens Create one binary attribte for each dictionary entry i.e., 1 means the -th word is sed in the email Consider a reasonable dictionary size: 5000 --- we have 5000 binary attribtes How many parameters that we need to learn? We need to learn the class prior (=1),(=0): 1 For each of the two classes, we need to learn a oint distribtion table of 5000 binary variables: 2 1 Total: 2 2 1 1 Clearly we don t have nearly enogh data to estimate that many parameters
Joint Distribtion Overfits It is common to enconter the following sitation: No training examples have the exact vale combinations 1, 2,., 0 for all vales of How do we make predictions for sch examples? To avoid overfitting Make some bold assmptions to simplify the oint distribtion
The Naïve Bayes Assmption Assme that each featre is independent of any other featres given the class label ) ( ) ( ),, ( 1 1 1 1 x P x P y x x P m m m m
A note abot independence Assme A and B are two random events. Then A and B are independent if and only if P(A B) = P(A)
Independence Theorems Assme P(A B) = P(A) Then P(A^B) = P(A) P(B) Assme P(A B) = P(A) Then P(B A) = P(B)
Examples of independent events Two separate coin tosses Consider the following for events: T: Toothache ( I have a toothache) C: Catch (dentist s steel probe catches in my tooth) A: Cavity (I have a cavity) W: Weather (weather is good) P(T, C, A, W) =P(T,C,A)P(W)
Conditional Independence P(A B,C) = P(A C) A and B are conditionally independent given C P(B A,C)=P(B C) If A and B are conditionally independent given C, then we have P(A,B C) = P(A C) P(B C)
Example of conditional independence T: Toothache ( I have a toothache) C: Catch (dentist s steel probe catches in my tooth) A: Cavity T and C are conditionally independent given A: P(T, C A) =P(T A)*P(C A) So, events that are not independent from each other might be conditionally independent given some fact It can also happen the other way arond. Events that are independent might become conditionally dependent given some fact. B=Brglar in yor hose; A = Alarm (Brglar) rang in yor hose E = Earthqake happened B is independent of E (ignoring some minor possible connections between them) However, if we know A is tre, then B and E are no longer independent. Why? P(B A) >> P(B A, E) Knowing E is tre makes it mch less likely for B to be tre
Naïve Bayes Classifier By assming that each attribte is independent of any other attribtes given the class label, we now have a Naïve Bayes Classifier Instead of learning a oint distribtion of all featres, we learn separately for each featre And we compte the oint by taking the prodct:
Naïve Bayes Classifiers in a ntshell m i i i m m m m y x P y P y P y x x P x x y P y 1 1 1 1 1 predict ) ( ) ( argmax ) ( ),, ( argmax ),, ( argmax 1. Learn the for each featre, and y vale 2. Estimate as fraction of records with. 3. For a new example,, : learning
Example X 1 X 2 X 3 Y 1 1 1 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 1 1 0 1 1 1 Apply Naïve Bayes, and make prediction for (1,0,1)?
Laplace Smoothing With the Naïve Bayes Assmption, we can still end p with zero probabilities E.g., if we receive an email that contains a word that has never appeared in the training emails 0for 0and 1 As sch we ignore all the other words in the email becase of this single rare word Laplace smoothing can help 10 # of ex. with 1, 0 # of ex. with 0 2 1
Final Notes abot (Naïve) Bayes Classifier Any density estimator can be plgged in to estimate P(x 1,x 2,, x m y) for Bayes, or P(x i y) for Naïve Bayes Real valed attribtes can be discretized or directly modeled sing simple continos distribtions sch as Gassian (Normal) distribtion Naïve Bayes is wonderflly cheap and srvives tens of thosands of attribtes easily