Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 4 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter of Data Mining by I. H. Witten, E. Frank and M. A. Hall

Statistical modeling Opposite of R: use all the attributes Two assumptions: Attributes are equally important statistically independent (given the class value) I.e., knowing the value of one attribute says nothing about the value of another (if the class is known) Independence assumption is never correct! But this scheme works well in practice

Probabilities for weather data Temperature 0 /9 /5 /9 /5 /9 0/5 /9 /5 /9 /5 /9 /5 rmal 6 rmal 6 9 5 /9 /5 6/9 /5 6/9 /5 /9 /5 9/ 5/ Temp rmal rmal rmal rmal rmal rmal rmal Data Mining: Practical Machine Learning Tools and Techniques (Chapter )

Probabilities for weather data Temperature 0 /9 /5 /9 /5 /9 0/5 /9 /5 /9 /5 /9 /5 A new day: rmal 6 rmal 6 9 5 /9 /5 6/9 /5 6/9 /5 /9 /5 9/ 5/ Temp.? Likelihood of the two classes For yes = /9 /9 /9 /9 9/ = 0.005 For no = /5 /5 /5 /5 5/ = 0.006 Conversion into a probability by normalization: P( yes ) = 0.005 / (0.005 + 0.006) = 0.05 P( no ) = 0.006 / (0.005 + 0.006) = 0.795

Bayes s rule Probability of event H given evidence E: Pr [E H]Pr [H] Pr [H E]= Pr [E] A priori probability of H : Probability of event before evidence is seen A posteriori probability of H : Pr [H] Probability of event after evidence is seen Pr [H E] Thomas Bayes Born: 70 in London, England Died: 76 in Tunbridge Wells, Kent, England 5

Naïve Bayes for classification Classification learning: what s the probability of the class given an instance? Evidence E = instance Event H = class value for instance Naïve assumption: evidence splits into parts (i.e. attributes) that are independent Pr [E H]Pr [E H] Pr [En H]Pr [H] Pr [H E]= Pr [E] 6

Weather data example Temp.? Evidence E Pr [ yes E]=Pr [ = yes] Pr [Temperature= yes] Pr [= yes] Probability of class yes Pr [ = yes] Pr [ yes] Pr [E] 9 9 9 9 9 = Pr [E] 7

The zero-frequency problem What if an attribute value doesn t occur with every class value? (e.g. = high for class yes ) Probability will be zero! Pr [= yes]=0 A posteriori probability will also be zero! Pr [yes E]=0 ( matter how likely the other values are!) Remedy: add to the count for every attribute value-class combination (Laplace estimator) Result: probabilities will never be zero! (also: stabilizes probability estimates) 8

Missing values Training: instance is not included in frequency count for attribute value-class combination Classification: attribute will be omitted from calculation Example: Temp.?? Likelihood of yes = /9 /9 /9 9/ = 0.08 Likelihood of no = /5 /5 /5 5/ = 0.0 P( yes ) = 0.08 / (0.08 + 0.0) = % P( no ) = 0.0 / (0.08 + 0.0) = 59% 0

Numeric attributes Usual assumption: attributes have a normal or Gaussian probability distribution (given the class) The probability density function for the normal distribution is defined by two parameters: Sample mean µ Standard deviation σ Then the density function f(x) is f (x)= n = x i n i= e π σ n σ= (x i μ) n i= (x μ) σ

Statistics for weather data Temperature 6, 68, 65,7, 65, 70, 70, 85, 0 69, 70, 7,80, 70, 75, 90, 9, 7, 85, 80, 95, /9 /5 µ =7 µ =75 µ =79 /9 0/5 σ =6. σ =7.9 σ =0. /9 /5 6 9 5 µ =86 6/9 /5 σ =9.7 /9 /5 9/ 5/ Example density value: f temperature=66 yes = e 6. 66 7 6. =0.00

Classifying a new day A new day: Temp. 66 90 true? Likelihood of yes = /9 0.00 0.0 /9 9/ = 0.00006 Likelihood of no = /5 0.0 0.08 /5 5/ = 0.00008 P( yes ) = 0.00006 / (0.00006 + 0. 00008) = 5% P( no ) = 0.00008 / (0.00006 + 0. 00008) = 75% Missing values during training are not included in calculation of mean and standard deviation