Soft Computing. Lecture Notes on Machine Learning. Matteo Mattecci.

Similar documents
The Naïve Bayes Classifier. Machine Learning Fall 2017

Algorithms for Classification: The Basic Methods

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Inteligência Artificial (SI 214) Aula 15 Algoritmo 1R e Classificador Bayesiano

COMP 328: Machine Learning

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Naïve Bayes Lecture 6: Self-Study -----

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

Machine Learning. Yuh-Jye Lee. March 1, Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU

Lecture 9: Bayesian Learning

CPSC 540: Machine Learning

Introduction to Bayesian Learning. Machine Learning Fall 2018

Probabilistic and Bayesian Analytics Based on a Tutorial by Andrew W. Moore, Carnegie Mellon University

Classification and Regression Trees

UVA CS / Introduc8on to Machine Learning and Data Mining

Bayesian Learning. Bayesian Learning Criteria

Introduction to Machine Learning

Introduction to Machine Learning

CLASSIFICATION NAIVE BAYES. NIKOLA MILIKIĆ UROŠ KRČADINAC

Data Mining Part 4. Prediction

Bayesian Classification. Bayesian Classification: Why?

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Generative Clustering, Topic Modeling, & Bayesian Inference

Jialiang Bao, Joseph Boyd, James Forkey, Shengwen Han, Trevor Hodde, Yumou Wang 10/01/2013

EECS 349:Machine Learning Bryan Pardo

Classification. Classification. What is classification. Simple methods for classification. Classification by decision tree induction

COMP61011 : Machine Learning. Probabilis*c Models + Bayes Theorem

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 4 of Data Mining by I. H. Witten, E. Frank and M. A.

The Solution to Assignment 6

Lecture : Probabilistic Machine Learning

Mining Classification Knowledge

Stephen Scott.

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Mining Classification Knowledge

Probability Based Learning

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Mixtures of Gaussians continued

Bayesian Learning Features of Bayesian learning methods:

Classification: Rule Induction Information Retrieval and Data Mining. Prof. Matteo Matteucci

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Linear Classifiers and the Perceptron

Machine Learning, Fall 2012 Homework 2

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

COMP61011! Probabilistic Classifiers! Part 1, Bayes Theorem!

Artificial Intelligence: Reasoning Under Uncertainty/Bayes Nets

Naïve Bayes classification

CSE-4412(M) Midterm. There are five major questions, each worth 10 points, for a total of 50 points. Points for each sub-question are as indicated.

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Bayesian Approaches Data Mining Selected Technique

Logistic Regression. Machine Learning Fall 2018

Data classification (II)

CPSC 340: Machine Learning and Data Mining

Decision Trees. Tirgul 5

Machine Learning

FINAL: CS 6375 (Machine Learning) Fall 2014

Probabilistic Classification

Artificial Intelligence Programming Probability

CPSC 540: Machine Learning

CS 188: Artificial Intelligence Spring Today

Machine Learning Recitation 8 Oct 21, Oznur Tastan

CSCE 478/878 Lecture 6: Bayesian Learning

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

UVA CS 4501: Machine Learning

Bayesian Methods: Naïve Bayes

Why Probability? It's the right way to look at the world.

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Probability models for machine learning. Advanced topics ML4bio 2016 Alan Moses

MLE/MAP + Naïve Bayes

Algorithmisches Lernen/Machine Learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Learning Objectives. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 1

Midterm sample questions

CMPT Machine Learning. Bayesian Learning Lecture Scribe for Week 4 Jan 30th & Feb 4th

CSC321 Lecture 18: Learning Probabilistic Models

Bayesian Networks. Motivation

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

CSC2515 Assignment #2

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

The Bayes classifier

Week 5: Logistic Regression & Neural Networks

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

Decision Trees. Gavin Brown

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Qualifier: CS 6375 Machine Learning Spring 2015

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Decision Trees. Danushka Bollegala

Be able to define the following terms and answer basic questions about them:

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

A.I. in health informatics lecture 3 clinical reasoning & probabilistic inference, II *

Notes on Machine Learning for and

Midterm, Fall 2003

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

Bias-Variance Tradeoff

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Transcription:

Soft Computing Lecture Notes on Machine Learning Matteo Mattecci matteucci@elet.polimi.it Department of Electronics and Information Politecnico di Milano Matteo Matteucci c Lecture Notes on Machine Learning p. 1/22

Supervised Learning Bayes Classifiers Matteo Matteucci c Lecture Notes on Machine Learning p. 2/22

Maximum Likelihood vs. Maximum A Posteriory According to the probability we want to maximize MLE (Maximum Likelihood Estimator): Ŷ = argmax v i P(X 1,X 2,...,X m Y = v i ) MAP (Maximum A Posteriori Estimator): Ŷ = argmax v i P(Y = v i X 1,X 2,...,X m ) Matteo Matteucci c Lecture Notes on Machine Learning p. 3/22

Maximum Likelihood vs. Maximum A Posteriory According to the probability we want to maximize MLE (Maximum Likelihood Estimator): Ŷ = argmax v i P(X 1,X 2,...,X m Y = v i ) MAP (Maximum A Posteriori Estimator): Ŷ = argmax v i P(Y = v i X 1,X 2,...,X m ) We can compute the second by applying the Bayes Theorem: P(Y = v i X 1,X 2,...,X m ) = P(X 1,X 2,...,X m Y = v i )P(Y = v i ) P(X 1,X 2,...,X m ) = P(X 1,X 2,...,X m Y = v i )P(Y = v i ) ny j=0 P(X 1,X 2,...,X m Y = v j )P(Y = v j ) Matteo Matteucci c Lecture Notes on Machine Learning p. 3/22

Bayes Classifiers Unleashed Using the MAP estimation, we get the Bayes Classifier: Learn the distribution over inputs for each value Y This gives P(X 1,X 2,...,X m Y = v i ) Estimate P(Y = v i ) as fraction of records with Y = v i For a new prediction: Ŷ = arg max v i P(Y = v i X 1,X 2,...,X m ) = arg max v i P(X 1,X 2,...,X m Y = v i )P(Y = v i ) Matteo Matteucci c Lecture Notes on Machine Learning p. 4/22

Bayes Classifiers Unleashed Using the MAP estimation, we get the Bayes Classifier: Learn the distribution over inputs for each value Y This gives P(X 1,X 2,...,X m Y = v i ) Estimate P(Y = v i ) as fraction of records with Y = v i For a new prediction: Ŷ = arg max v i P(Y = v i X 1,X 2,...,X m ) = arg max v i P(X 1,X 2,...,X m Y = v i )P(Y = v i ) You can plug any density estimator to get your flavor of Bayes Classifier: Joint Density Estimator Naïve Density Estimator... Matteo Matteucci c Lecture Notes on Machine Learning p. 4/22

Unsupervised Learning Density Estimation Matteo Matteucci c Lecture Notes on Machine Learning p. 5/22

The Joint Distribution Given two random variables X and Y, the joint distribution of X and Y is the distribution of X and Y together: P(X,Y). Matteo Matteucci c Lecture Notes on Machine Learning p. 6/22

The Joint Distribution Given two random variables X and Y, the joint distribution of X and Y is the distribution of X and Y together: P(X,Y). How to make a joint distribution of M variables: 1. Make a truth table listing all combination of values 2. For each combination state/compute how probable it is 3. Check that all probabilities sum up to 1 Matteo Matteucci c Lecture Notes on Machine Learning p. 6/22

The Joint Distribution Given two random variables X and Y, the joint distribution of X and Y is the distribution of X and Y together: P(X,Y). How to make a joint distribution of M variables: 1. Make a truth table listing all combination of values 2. For each combination state/compute how probable it is 3. Check that all probabilities sum up to 1 Example with 3 boolean variables A, B and C. Matteo Matteucci c Lecture Notes on Machine Learning p. 6/22

Using the Joint Distribution (I) Compute probability for logic expression: P(E) = Row E P(Row). Matteo Matteucci c Lecture Notes on Machine Learning p. 7/22

Using the Joint Distribution (I) Compute probability for logic expression: P(E) = Row E P(Row). P(A) = 0.05+0.10+0.25+0.10 = 0.5 P(A B) = 0.25+0.10 = 0.35 P(Ā C) = 0.30+0.05+0.10+0.05+0.05+0.25 = 0.8 Matteo Matteucci c Lecture Notes on Machine Learning p. 7/22

Using the Joint Distribution (I) Compute probability for logic expression: P(E) = Row E P(Row). P(A) = 0.05+0.10+0.25+0.10 = 0.5 P(A B) = 0.25+0.10 = 0.35 P(Ā C) = 0.30+0.05+0.10+0.05+0.05+0.25 = 0.8 Can t we do something more useful? Matteo Matteucci c Lecture Notes on Machine Learning p. 7/22

Using the Joint Distribution (II) Use it for making inference: P(E 1 E 2 ) = P(E 1 E 2 ) P(E 2 ) = Row E 1 E 2 P(Row) Row E 2 P(Row). Matteo Matteucci c Lecture Notes on Machine Learning p. 8/22

Using the Joint Distribution (II) Use it for making inference: P(E 1 E 2 ) = P(E 1 E 2 ) P(E 2 ) = Row E 1 E 2 P(Row) Row E 2 P(Row). P(A B) = (0.25+0.10)/(0.10+0.05+0.25+0.10) = 0.35/0.50 = 0.70 P(C A B) = (0.10)/(0.25+0.10) = 0.10/0.35 = 0.285 P(Ā C) = (0.05+0.05)/(0.05+0.05+0.10+0.10) = 0.10/0.30 = 0.333 Matteo Matteucci c Lecture Notes on Machine Learning p. 8/22

Using the Joint Distribution (II) Use it for making inference: P(E 1 E 2 ) = P(E 1 E 2 ) P(E 2 ) = Row E 1 E 2 P(Row) Row E 2 P(Row). P(A B) = (0.25+0.10)/(0.10+0.05+0.25+0.10) = 0.35/0.50 = 0.70 P(C A B) = (0.10)/(0.25+0.10) = 0.10/0.35 = 0.285 P(Ā C) = (0.05+0.05)/(0.05+0.05+0.10+0.10) = 0.10/0.30 = 0.333 Where do we get the Joint Density from? Matteo Matteucci c Lecture Notes on Machine Learning p. 8/22

The Joint Distribution Estimator A Density Estimator learns a mapping from a set of attributes to a probability distribution over the attributes space Matteo Matteucci c Lecture Notes on Machine Learning p. 9/22

The Joint Distribution Estimator A Density Estimator learns a mapping from a set of attributes to a probability distribution over the attributes space Our Joint Distribution learner is our first example of something called Density Estimation Build a Joint Distribution table for your attributes in which the probabilities are unspecified The fill in each row with ˆP(row) = records matching row total number of records Matteo Matteucci c Lecture Notes on Machine Learning p. 9/22

The Joint Distribution Estimator A Density Estimator learns a mapping from a set of attributes to a probability distribution over the attributes space Our Joint Distribution learner is our first example of something called Density Estimation Build a Joint Distribution table for your attributes in which the probabilities are unspecified The fill in each row with ˆP(row) = records matching row total number of records We will come back to its formal definition at the end of this lecture don t worry, but now... How can we evaluate it? Matteo Matteucci c Lecture Notes on Machine Learning p. 9/22

Evaluating a Density Estimator We can use likelihood for evaluating density estimation: Given a record x, a density estimator M tells you how likely it is ˆP(x M) Matteo Matteucci c Lecture Notes on Machine Learning p. 10/22

Evaluating a Density Estimator We can use likelihood for evaluating density estimation: Given a record x, a density estimator M tells you how likely it is ˆP(x M) Given a dataset with R records, a density estimator can tell you how likely the dataset is under the assumption that all records were independently generated from it ˆP(dataset) = ˆP(x 1 x 2... x R M) = R k=1 ˆP(x k M) Matteo Matteucci c Lecture Notes on Machine Learning p. 10/22

Evaluating a Density Estimator We can use likelihood for evaluating density estimation: Given a record x, a density estimator M tells you how likely it is ˆP(x M) Given a dataset with R records, a density estimator can tell you how likely the dataset is under the assumption that all records were independently generated from it ˆP(dataset) = ˆP(x 1 x 2... x R M) = R k=1 ˆP(x k M) Since likelihood can get too small we usually use log-likelihood: log ˆP(dataset) = log R k=1 ˆP(x k M) = R k=1 log ˆP(x k M) Matteo Matteucci c Lecture Notes on Machine Learning p. 10/22

Joint Distribution Summary Now we have a way to learn a Joint Density estimator from data Joint Density estimators can do many good things Can sort the records by probability, and thus spot weird records (e.g., anomaly/outliers detection) Can do inference: P(E 1 E 2 ) (e.g., Automatic Doctor, Help Desk) Can be used for Bayes Classifiers (see later) Matteo Matteucci c Lecture Notes on Machine Learning p. 11/22

Joint Distribution Summary Now we have a way to learn a Joint Density estimator from data Joint Density estimators can do many good things Can sort the records by probability, and thus spot weird records (e.g., anomaly/outliers detection) Can do inference: P(E 1 E 2 ) (e.g., Automatic Doctor, Help Desk) Can be used for Bayes Classifiers (see later) Joint Density estimators can badly overfit! Joint Estimator just mirrors the training data Suppose you see a new dataset, its likelihood is going to be: log ˆP(new dataset M) = R k=1 log ˆP(x k M) = if k : ˆP(xk M) = 0 Matteo Matteucci c Lecture Notes on Machine Learning p. 11/22

Joint Distribution Summary Now we have a way to learn a Joint Density estimator from data Joint Density estimators can do many good things Can sort the records by probability, and thus spot weird records (e.g., anomaly/outliers detection) Can do inference: P(E 1 E 2 ) (e.g., Automatic Doctor, Help Desk) Can be used for Bayes Classifiers (see later) Joint Density estimators can badly overfit! Joint Estimator just mirrors the training data Suppose you see a new dataset, its likelihood is going to be: log ˆP(new dataset M) = R k=1 log ˆP(x k M) = if k : ˆP(xk M) = 0 We need something which generalizes! Naïve Density Estimator Matteo Matteucci c Lecture Notes on Machine Learning p. 11/22

Naïve Density Estimator The naïve model assumes that each attribute is distributed independently of any of the other attributes. Let x[i] denote the i th field of record x. The Naïve Density Estimator says that: x[i] {x[1],x[2],...,x[i 1],x[i+1],...,x[M]} Matteo Matteucci c Lecture Notes on Machine Learning p. 12/22

Naïve Density Estimator The naïve model assumes that each attribute is distributed independently of any of the other attributes. Let x[i] denote the i th field of record x. The Naïve Density Estimator says that: x[i] {x[1],x[2],...,x[i 1],x[i+1],...,x[M]} It is important to recall every time we use a Naïve Density that: Attributes are equally important Knowing the value of one attribute says nothing about the value of another Independence assumption is almost never correct...... this scheme works well in practice! Matteo Matteucci c Lecture Notes on Machine Learning p. 12/22

Naïve Density Estimator: An Example From a Naïve Distribution you can compute the Joint Distribution: Suppose A,B,C,D independently distributed, P(A B C D) =? Matteo Matteucci c Lecture Notes on Machine Learning p. 13/22

Naïve Density Estimator: An Example From a Naïve Distribution you can compute the Joint Distribution: Suppose A,B,C,D independently distributed, P(A B C D) =? P(A B C D) = P(A B C D)P( B C D) = P(A)P( B C D) = P(A)P( B C D)P(C D) = P(A)P( B)P(C D) = P(A)P( B)P(C D)P( D) = P(A)P( B)P(C)P( D) Matteo Matteucci c Lecture Notes on Machine Learning p. 13/22

Naïve Density Estimator: An Example From a Naïve Distribution you can compute the Joint Distribution: Suppose A,B,C,D independently distributed, P(A B C D) =? P(A B C D) = P(A B C D)P( B C D) = P(A)P( B C D) = P(A)P( B C D)P(C D) = P(A)P( B)P(C D) = P(A)P( B)P(C D)P( D) = P(A)P( B)P(C)P( D) Example: suppose to randomly shake a green dice and a red dice Dataset 1: A = red value, B = green value Dataset 2: A = red value, B = sum of values Dataset 3: A = sum of values, B = difference of values Which of these datasets violates the naïve assumption? Matteo Matteucci c Lecture Notes on Machine Learning p. 13/22

Learning a Naïve Density Estimator Suppose x[1],x[2],...,x[m] are independently distributed Once we have the Naïve Distribution, we can construct any row of the implied Joint Distribution on demand P(x[1] = u 1,x[2] = u 2,...,x[M] = u M ) = M k=1 P(x[k] = u k ) We can do any inference! Matteo Matteucci c Lecture Notes on Machine Learning p. 14/22

Learning a Naïve Density Estimator Suppose x[1],x[2],...,x[m] are independently distributed Once we have the Naïve Distribution, we can construct any row of the implied Joint Distribution on demand P(x[1] = u 1,x[2] = u 2,...,x[M] = u M ) = M k=1 P(x[k] = u k ) We can do any inference! But how do we learn a Naïve Density Estimator? ˆP(x[i] = u) = number of record for which x[i] = u total number of records Matteo Matteucci c Lecture Notes on Machine Learning p. 14/22

Learning a Naïve Density Estimator Suppose x[1],x[2],...,x[m] are independently distributed Once we have the Naïve Distribution, we can construct any row of the implied Joint Distribution on demand P(x[1] = u 1,x[2] = u 2,...,x[M] = u M ) = M k=1 P(x[k] = u k ) We can do any inference! But how do we learn a Naïve Density Estimator? ˆP(x[i] = u) = number of record for which x[i] = u total number of records Please wait a few minute, I ll get the reason for this too!! Matteo Matteucci c Lecture Notes on Machine Learning p. 14/22

Joint Density vs. Naïve Density What we got so far? Let s try to summarize things up: Joint Distribution Estimator Can model anything Given 100 records and more than 6 Boolean attributes will perform poorly Can easily overfit the data Matteo Matteucci c Lecture Notes on Machine Learning p. 15/22

Joint Density vs. Naïve Density What we got so far? Let s try to summarize things up: Joint Distribution Estimator Can model anything Given 100 records and more than 6 Boolean attributes will perform poorly Can easily overfit the data Naïve Distribution Estimator Can model only very boring distributions Given 100 records and 10,000 multivalued attributes will be fine Quite robust to overfitting Matteo Matteucci c Lecture Notes on Machine Learning p. 15/22

Joint Density vs. Naïve Density What we got so far? Let s try to summarize things up: Joint Distribution Estimator Can model anything Given 100 records and more than 6 Boolean attributes will perform poorly Can easily overfit the data Naïve Distribution Estimator Can model only very boring distributions Given 100 records and 10,000 multivalued attributes will be fine Quite robust to overfitting So far we have two simple density estimators, in other lectures we ll see vastly more impressive ones (Mixture Models, Bayesian Networks,... ). Matteo Matteucci c Lecture Notes on Machine Learning p. 15/22

Joint Density vs. Naïve Density What we got so far? Let s try to summarize things up: Joint Distribution Estimator Can model anything Given 100 records and more than 6 Boolean attributes will perform poorly Can easily overfit the data Naïve Distribution Estimator Can model only very boring distributions Given 100 records and 10,000 multivalued attributes will be fine Quite robust to overfitting So far we have two simple density estimators, in other lectures we ll see vastly more impressive ones (Mixture Models, Bayesian Networks,... ). But first, why should we care about density estimation? Matteo Matteucci c Lecture Notes on Machine Learning p. 15/22

Joint Density Bayes Classifier In the case of the Joint Density Bayes Classifier Ŷ = argmax v i P(X 1,X 2,...,X m Y = v i )P(Y = v i ) This degenerates to a very simple rule: Ŷ = most common Y among records having X 1 = u 1,X 2 = u 2,...,X m = u m Matteo Matteucci c Lecture Notes on Machine Learning p. 16/22

Joint Density Bayes Classifier In the case of the Joint Density Bayes Classifier Ŷ = argmax v i P(X 1,X 2,...,X m Y = v i )P(Y = v i ) This degenerates to a very simple rule: Ŷ = most common Y among records having X 1 = u 1,X 2 = u 2,...,X m = u m Important Note: If no records have the exact set of inputs X 1 = u 1,X 2 = u 2,...,X m = u m, then P(X 1,X 2,...,X m Y = v i ) = 0 for all values of Y. In that case we just have to guess Y s value! Matteo Matteucci c Lecture Notes on Machine Learning p. 16/22

Naïve Bayes Classifier In the case of the Naïve Bayes Classifier Can be simplified in: Ŷ = argmax v i P(X 1,X 2,...,X m Y = v i )P(Y = v i ) Ŷ = argmax v i P(Y = v i ) m j=0 P(X j = u j Y = v i ) Matteo Matteucci c Lecture Notes on Machine Learning p. 17/22

Naïve Bayes Classifier In the case of the Naïve Bayes Classifier Can be simplified in: Ŷ = argmax v i P(X 1,X 2,...,X m Y = v i )P(Y = v i ) Ŷ = argmax v i P(Y = v i ) m j=0 P(X j = u j Y = v i ) Technical Hint: When we have 10,000 input attributes the product will underflow in floating point math, so we should use logs: Ŷ = argmax v i logp(y = v i )+ m logp(x j = u j Y = v i ) j=0 Matteo Matteucci c Lecture Notes on Machine Learning p. 17/22

The Example: Is this a nice day to play golf? Outlook Temp Humid. Windy Play sunny 85 85 false No sunny 80 90 true No overcast 83 78 false Yes rain 70 96 false Yes rain 68 80 false Yes rain 65 70 true No overcast 64 65 true Yes sunny 72 95 false No sunny 69 70 false Yes rain 75 80 false Yes sunny 75 70 true Yes overcast 72 90 true Yes overcast 81 75 false Yes rain 71 80 true No Matteo Matteucci c Lecture Notes on Machine Learning p. 18/22

The Example: Is this a nice day to play golf? Outlook Temp Humid. Windy Play sunny 85 85 false No sunny 80 90 true No overcast 83 78 false Yes rain 70 96 false Yes rain 68 80 false Yes rain 65 70 true No overcast 64 65 true Yes sunny 72 95 false No sunny 69 70 false Yes rain 75 80 false Yes sunny 75 70 true Yes overcast 72 90 true Yes overcast 81 75 false Yes rain 71 80 true No Attribute Value Play Don t Outlook sunny 2 (2/9) 3 (3/5) overcast 4 (4/9) 0 (0) rain 3 (3/9) 2 (2/5) Temp. hight 2 (2/9) 2 (2/5) mild 4 (4/9) 2 (2/5) cool 3 (3/9) 1 (1/5) Humid. high 3 (3/9) 4 (4/5) normal 6 (6/9) 1 (1/5) Windy true 3 (3/9) 3 (3/5) false 6 (6/9) 2 (2/5) Matteo Matteucci c Lecture Notes on Machine Learning p. 18/22

The Example: Is this a nice day to play golf? Outlook Temp Humid. Windy Play sunny 85 85 false No sunny 80 90 true No overcast 83 78 false Yes rain 70 96 false Yes rain 68 80 false Yes rain 65 70 true No overcast 64 65 true Yes sunny 72 95 false No sunny 69 70 false Yes rain 75 80 false Yes sunny 75 70 true Yes overcast 72 90 true Yes overcast 81 75 false Yes rain 71 80 true No Attribute Value Play Don t Outlook sunny 2 (2/9) 3 (3/5) overcast 4 (4/9) 0 (0) rain 3 (3/9) 2 (2/5) Temp. hight 2 (2/9) 2 (2/5) mild 4 (4/9) 2 (2/5) cool 3 (3/9) 1 (1/5) Humid. high 3 (3/9) 4 (4/5) normal 6 (6/9) 1 (1/5) Windy true 3 (3/9) 3 (3/5) false 6 (6/9) 2 (2/5) Play = 9 (9/14) Don t Play = 5 (5/14) Matteo Matteucci c Lecture Notes on Machine Learning p. 18/22

The Example: A brand new day You wake up and gain some new Evidence about the day: Outlook Temp Humid. Windy Play sunny cool high true??? Matteo Matteucci c Lecture Notes on Machine Learning p. 19/22

The Example: A brand new day You wake up and gain some new Evidence about the day: Outlook Temp Humid. Windy Play sunny cool high true??? Should you play golf or not? Let s ask it to our Naïve Bayes Classifier! P(Evidence yes) = P(sunny yes)p(cool yes)p(high yes)p(true yes) = 2/9 3/9 3/9 3/9 = 0.00823 P(Evidence no) = P(sunny no)p(cool no)p(high no)p(true no) = 3/5 1/5 4/5 3/5 = 0.0576 P(P lay = yes Evidence) = P(Play = no Evidence) = class class P(Evidence yes)p(yes) = 0.205 P(Evidence class)p(class) P(Evidence no)p(no) = 0.795 P(Evidence class)p(class) Matteo Matteucci c Lecture Notes on Machine Learning p. 19/22

Missing Values (still A brand new day ) You wake up and gain some new partial Evidence about the day: Outlook Temp Humid. Windy Play??? cool high true??? Matteo Matteucci c Lecture Notes on Machine Learning p. 20/22

Missing Values (still A brand new day ) You wake up and gain some new partial Evidence about the day: Outlook Temp Humid. Windy Play??? cool high true??? P(Evidence yes) = o P(overlook yes)p(cool yes)p(high yes)p(true yes) = 3/9 3/9 3/9 = 0.037 P(Evidence no) = o P(overlook yes)p(cool no)p(high no)p(true no) P(yes Evidence) = P(no Evidence) = = 1/5 4/5 3/5 = 0.096 class class P(Evidence yes)p(yes) = 0.41 P(Evidence class)p(class) P(Evidence no)p(no) = 0.59 P(Evidence class)p(class) Matteo Matteucci c Lecture Notes on Machine Learning p. 20/22

The Zero Frequency Problem What if an attribute value doesn t occur with every class value (e.g. Outlook = overcast for class no)? Probability will be zero! No matter how likely the other values are, also a-posteriori probability will be zero! P(Outlook = overcast no) = 0 P(no Evidence) = 0 Matteo Matteucci c Lecture Notes on Machine Learning p. 21/22

The Zero Frequency Problem What if an attribute value doesn t occur with every class value (e.g. Outlook = overcast for class no)? Probability will be zero! No matter how likely the other values are, also a-posteriori probability will be zero! P(Outlook = overcast no) = 0 P(no Evidence) = 0 The solution is related to something called smoothing prior : Add 1 to the count for every attribute value-class combination Matteo Matteucci c Lecture Notes on Machine Learning p. 21/22

The Zero Frequency Problem What if an attribute value doesn t occur with every class value (e.g. Outlook = overcast for class no)? Probability will be zero! No matter how likely the other values are, also a-posteriori probability will be zero! P(Outlook = overcast no) = 0 P(no Evidence) = 0 The solution is related to something called smoothing prior : Add 1 to the count for every attribute value-class combination This simple approach (Laplace estimator) solves the problem and stabilize probability estimates! Matteo Matteucci c Lecture Notes on Machine Learning p. 21/22

The Zero Frequency Problem What if an attribute value doesn t occur with every class value (e.g. Outlook = overcast for class no)? Probability will be zero! No matter how likely the other values are, also a-posteriori probability will be zero! P(Outlook = overcast no) = 0 P(no Evidence) = 0 The solution is related to something called smoothing prior : Add 1 to the count for every attribute value-class combination This simple approach (Laplace estimator) solves the problem and stabilize probability estimates! We can do also fancy things! Matteo Matteucci c Lecture Notes on Machine Learning p. 21/22

M-Estimate Probability We can use M-estimate probability to estimate P(Attribute Class): P(A C) = n A +mp n+m n A number of examples with class C and attribute A n number of examples with class C p = 1/k whre k are the possible values of attribute A m is a costant value Matteo Matteucci c Lecture Notes on Machine Learning p. 22/22

M-Estimate Probability We can use M-estimate probability to estimate P(Attribute Class): P(A C) = n A +mp n+m n A number of examples with class C and attribute A n number of examples with class C p = 1/k whre k are the possible values of attribute A m is a costant value Example: For the Outlook attribute we get: sunny= 2+m/3 9+m 4+m/3 3+m/3, overcast= 9+m, rain= 9+m. Matteo Matteucci c Lecture Notes on Machine Learning p. 22/22

M-Estimate Probability We can use M-estimate probability to estimate P(Attribute Class): P(A C) = n A +mp n+m n A number of examples with class C and attribute A n number of examples with class C p = 1/k whre k are the possible values of attribute A m is a costant value Example: For the Outlook attribute we get: sunny= 2+m/3 9+m 4+m/3 3+m/3, overcast= 9+m, rain= 9+m. We can also use weights p 1,p 2,...,p k summing up to 1! sunny= 2+m p 1 9+m, overcast= 4+m p 2 9+m, rain= 3+m p 3 9+m. Matteo Matteucci c Lecture Notes on Machine Learning p. 22/22