Bayesian Classification http://css.engineering.uiowa.edu/~comp/ Bayesian Classification: Why? Probabilistic learning: Computation of explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data. Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities Benchmark: Even if Bayesian methods are computationally intractable, they can provide a benchmark for other algorithms 1
Bayesian Theorem Given training data D, posteriori probability of a hypothesis h, h follows the Bayes theorem P ( h = MAP (maximum posteriori) hypothesis h MAP argmax h = argmax h h H h H Practical difficulty: Initial knowledge of probabilities is required, significant computational cost ). Bayesian Theorem Probability that h olds given D P ( h = Training data set D describes computers by their speed and price. h = is hypothesis that D is a computer (i.e., D belongs to a specific category C ) = probability that D is a computer given the speed and price is difficult to compute 2
Naïve Bayes Classifier (1) simplified assumption: features are conditionally independent: P ( h = C C v C V) ) ) j j i j i= 1 n Naïve Bayes Classifier Features Decision Object Outlook Temperature Humidity Wind Play Ball 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Weak Yes 14 Rain Mild High Strong No 3
Naive Bayesian Classifier (2) Given a training set, we can compute the probabilities Outlook P N Humidity P N sunny 2/9 3/5 high 3/9 4/5 overcast 4/9 0 normal 6/9 1/5 rain 3/9 2/5 Temperature Windy hot 2/9 2/5 true 3/9 3/5 mild 4/9 2/5 false 6/9 2/5 cool 3/9 1/5 Bayesian classification P ( h = The classification problem may be formalized using a-posteriori probabilities: C X) = probability that the sample tuple X=<x 1,,x k > is of class C E.g., class=n outlook=sunny, windy=true, ) Idea: assign sample X class label C such that C X) is maximal 4
Estimating a-posteriori probabilities Bayes theorem: C X) = X C) C) / X) X) is constant for all classes C) = relative freq of class C samples C such that C X) is maximum = C such that X C) C) is maximum Problem: computing X C) is infeasible! Naïve Bayesian Classification Naïve assumption: attribute independence x 1,,x k C) = x 1 C) x k C) If i-th attribute is categorical: x i C) is estimated as the relative freq of samples having value x i as i-th attribute in class C If i-th attribute is continuous: x i C) is estimated thru a Gaussian density function Computationally easy in both cases 5
Play-tennis example: estimating x i C) outlook Outlook Temperature Humidity Windy Class sunny hot high false N sunny hot high true N overcast hot high false P rain mild high false P rain cool normal false P rain cool normal true N overcast cool normal true P sunny mild high false N sunny cool normal false P rain mild normal false P sunny mild normal true P overcast mild high true P overcast hot normal false P rain mild high true N p) = 9/14 n) = 5/14 sunny p) = 2/9 overcast p) =4/9 rain p) = 3/9 temperature hot p) = 2/9 mild p) = 4/9 cool p) = 3/9 humidity high p) = 3/9 normal p) = 6/9 windy true p) = 3/9 false p) = 6/9 sunny n) = 3/5 overcast n) = 0 rain n) = 2/5 hot n) = 2/5 mild n) = 2/5 cool n) = 1/5 high n) = 4/5 normal n) = 2/5 true n) = 3/5 false n) = 2/5 Play-tennis example An unseen sample X = <rain, hot, high, false> X p) p) = rain p) hot p) high p) false p) p) = 3/9 2/9 3/9 6/9 9/14 = 0.010582 X n) n) = rain n) hot n) high n) false n) n) = 2/5 2/5 4/5 2/5 5/14 = 0.018286 Sample X is classified in class n (don t play) 6
The independence hypothesis makes computation possible yields optimal classifiers when satisfied but is seldom satisfied in practice, as attributes (variables) are often correlated. Attempts to overcome this limitation: Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes Decision trees, that reason on one attribute at the time, considering most important attributes first Bayesian Belief Networks (1) Family History Smoker (FH, S) (FH, ~S)(~FH, S) (~FH, ~S) LC 0.8 0.5 0.7 0.1 LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9 The conditional probability table for the variable LungCancer PositiveXRay Dyspnea Bayesian Belief Networks 7
Bayesian Belief Networks (2) Bayesian belief network allows a subset of the variables conditionally independent A graphical model of causal relationships Several cases of learning Bayesian belief networks Given both network structure and all the variables: easy Given network structure but only some variables When the network structure is not known in advance Reference J. Han, M. Kamber; Data Mining, 2001; Morgan Kaufmann Publishers: San Francisco, CA. 8