Naïve Bayes Lecture 6: Self-Study -----

Similar documents
Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 4 of Data Mining by I. H. Witten, E. Frank and M. A.

Algorithms for Classification: The Basic Methods

Inteligência Artificial (SI 214) Aula 15 Algoritmo 1R e Classificador Bayesiano

Data Mining Part 4. Prediction

Machine Learning. Yuh-Jye Lee. March 1, Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU

Mining Classification Knowledge

Mining Classification Knowledge

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Classification. Classification. What is classification. Simple methods for classification. Classification by decision tree induction

CLASSIFICATION NAIVE BAYES. NIKOLA MILIKIĆ UROŠ KRČADINAC

The Naïve Bayes Classifier. Machine Learning Fall 2017

Machine Learning Chapter 4. Algorithms

Decision Trees Entropy, Information Gain, Gain Ratio

Bayesian Classification. Bayesian Classification: Why?

Slides for Data Mining by I. H. Witten and E. Frank

Soft Computing. Lecture Notes on Machine Learning. Matteo Mattecci.

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

The Bayes classifier

Jialiang Bao, Joseph Boyd, James Forkey, Shengwen Han, Trevor Hodde, Yumou Wang 10/01/2013

COMP 328: Machine Learning

Bayesian Methods: Naïve Bayes

The Solution to Assignment 6

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

Probability Based Learning

CSCE 478/878 Lecture 6: Bayesian Learning

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Administrative notes. Computational Thinking ct.cs.ubc.ca

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Bayesian Learning Features of Bayesian learning methods:

Bayesian Learning. Bayesian Learning Criteria

Learning from Data 1 Naive Bayes

Chapter 4.5 Association Rules. CSCI 347, Data Mining

Naive Bayes Classifier. Danushka Bollegala

Day 5: Generative models, structured classification

Data classification (II)

CSE-4412(M) Midterm. There are five major questions, each worth 10 points, for a total of 50 points. Points for each sub-question are as indicated.

MLE/MAP + Naïve Bayes

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Numerical Learning Algorithms

Decision Support. Dr. Johan Hagelbäck.

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning


Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

Quiz3_NaiveBayesTest

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

COS 424: Interacting with Data. Lecturer: Dave Blei Lecture #11 Scribe: Andrew Ferguson March 13, 2007

Learning Decision Trees

MLE/MAP + Naïve Bayes

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Stephen Scott.

Intuition Bayesian Classification

Behavioral Data Mining. Lecture 2

CPSC 340: Machine Learning and Data Mining

Learning Decision Trees

COMP61011! Probabilistic Classifiers! Part 1, Bayes Theorem!

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Classification: Rule Induction Information Retrieval and Data Mining. Prof. Matteo Matteucci

Decision Trees. Gavin Brown

Bias Correction in Classification Tree Construction ICML 2001

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Classification: Decision Trees

Tools of AI. Marcin Sydow. Summary. Machine Learning

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Naïve Bayes, Maxent and Neural Models

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

5. Discriminant analysis

Bayesian Learning (II)

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Classification II: Decision Trees and SVMs

Bayesian Models in Machine Learning

Dimensionality reduction

Applied Logic. Lecture 4 part 2 Bayesian inductive reasoning. Marcin Szczuka. Institute of Informatics, The University of Warsaw

Lecture 9: Bayesian Learning

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Decision Tree Learning and Inductive Inference

Induction on Decision Trees

Classification and Regression Trees

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

EECS 349:Machine Learning Bryan Pardo

Introduction to Bayesian Learning. Machine Learning Fall 2018

( D) I(2,3) I(4,0) I(3,2) weighted avg. of entropies

Nonparametric Bayesian Methods (Gaussian Processes)

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

STA 4273H: Statistical Machine Learning

Administrative notes February 27, 2018

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Will it rain tomorrow?

Machine Learning 4. week

Generative Models for Discrete Data

Data Mining. Chapter 1. What s it all about?

Voting (Ensemble Methods)

Learning Classification Trees. Sargur Srihari

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Transcription:

Naïve Bayes Lecture 6: Self-Study ----- Marina Santini Acknowledgements Slides borrowed and adapted from: Data Mining by I. H. Witten, E. Frank and M. A. Hall 1

Lecture 6: Required Reading Daumé III (015: 5-59; 107-110) Witten et al. (011: 90-99; 05-08; 14-15; -; 8-9; 1-; 4)

Outline Naïve Bayes Zero-probability problem: smoothing Multinomial Naïve Bayes Discussion

Statistical modeling Use all the attributes Two assumptions: Attributes are equally important statistically independent (given the class value) I.e., knowing the value of one attribute says nothing about the value of another (if the class is known) Independence assumption is never correct! But this scheme works well in practice 4

5 Probabilities for weather data 5/ 14 5 9/ 14 9 Play /5 /5 /9 6/9 6 Windy 1/5 4/5 1 4 6/9 /9 6 rmal rmal Humidity 1/5 /5 /5 1 /9 4/9 /9 4 /5 /9 Mild Hot Mild Hot Temperature 0/5 4/9 Overcast /5 /9 0 4 Overcast Outlook Mild rmal Hot Overcast Mild Overcast rmal Mild rmal Mild rmal Mild rmal Overcast rmal rmal Mild Hot Overcast Hot Hot Play Windy Humidity Temp Outlook

Probabilities for weather data Outlook Temperature Humidity Windy Play Hot 4 6 9 5 Overcast 4 0 Mild 4 rmal 6 1 1 Overcast /9 4/9 /5 0/5 Hot Mild /9 4/9 /5 /5 rmal /9 6/9 4/5 1/5 6/9 /9 /5 /5 9/ 14 5/ 14 /9 /5 /9 1/5 A new day: Outlook Temp. Humidity Windy Play? Likelihood of the two classes For yes = /9 /9 /9 /9 9/14 = 0.005 For no = /5 1/5 4/5 /5 5/14 = 0.006 Conversion into a probability by normalization: P( yes ) = 0.005 / (0.005 + 0.006) = 0.05 P( no ) = 0.006 / (0.005 + 0.006) = 0.795 6

Bayes s rule Probability of event H given evidence E: Pr[H E ]= Pr[E H ]Pr[H ] Pr[E ] A priori probability of H : Pr[H ] Probability of event before evidence is seen A posteriori probability of H : Probability of event after evidence is seen Pr[H E ] Thomas Bayes Born: 170 in London, England Died: 1761 in Tunbridge Wells, Kent, England 7

Naïve Bayes for classification Classification learning: what s the probability of the class given an instance? Evidence E = instance Event H = class value for instance Naïve assumption: evidence splits into parts (i.e. attributes) that are independent 8

Weather data example Outlook Temp. Humidity Windy Play? Evidence E Pr[yes E ]= Pr[Outlook= yes] Probability of class yes = Pr[Temperature= yes] Pr[Humidity= yes] Pr[Windy= yes] Pr[yes] Pr[E ] 9 9 9 9 9 14 Pr[E ] 9

The zero-frequency problem What if an attribute value doesn t occur with every class value? (e.g. Humidity = high for class yes ) Probability will be zero! A posteriori probability will also be zero! ( matter how likely the other values are!) Remedy: add 1 to the count for every attribute value-class combination (Laplace estimator) Result: probabilities will never be zero! (also: stabilizes probability estimates) Pr[Humidity= yes]= 0 Pr[yes E ]= 0 10

Modified probability estimates In some cases adding a constant different from 1 might be more appropriate Example: attribute outlook for class yes Overcast Weights don t need to be equal (but they must sum to 1) 11

Missing values Training: instance is not included in frequency count for attribute value-class combination Classification: attribute will be omitted from calculation Example: Outlook? Temp. Humidity Windy Play? Likelihood of yes = /9 /9 /9 9/14 = 0.08 Likelihood of no = 1/5 4/5 /5 5/14 = 0.04 P( yes ) = 0.08 / (0.08 + 0.04) = 41% P( no ) = 0.04 / (0.08 + 0.04) = 59% 1

Numeric attributes Usual assumption: attributes have a normal or Gaussian probability distribution (given the class) The probability density function for the normal distribution is defined by two parameters: Sample mean µ Standard deviation σ Then the density function f(x) is 1

Statistics for weather data Outlook Temperature Humidity Windy Play 64, 68, 65,71, 65, 70, 70, 85, 6 9 5 Overcast 4 0 69, 70, 7,80, 70, 75, 90, 91, 7, 85, 80, 95, Overcast /9 4/9 /5 0/5 µ =7 σ =6. µ =75 σ =7.9 µ =79 σ =10. µ =86 σ =9.7 6/9 /9 /5 /5 9/ 14 5/ 14 /9 /5 Example density value: 14

Classifying a new day A new day: Outlook Temp. 66 Humidity 90 Windy true Play? Likelihood of yes = /9 0.040 0.01 /9 9/14 = 0.00006 Likelihood of no = /5 0.01 0.081 /5 5/14 = 0.000108 P( yes ) = 0.00006 / (0.00006 + 0. 000108) = 5% P( no ) = 0.000108 / (0.00006 + 0. 000108) = 75% Missing values during training are not included in calculation of mean and standard deviation 15

Probability densities Relationship between probability and density: But: this doesn t change calculation of a posteriori probabilities because ε cancels out Exact relationship: 16

Multinomial naïve Bayes I Version of naïve Bayes used for document classification using bag of words model n 1,n,..., n k : number of times word i occurs in document P 1,P,..., P k : probability of obtaining word i when sampling from documents in class H Probability of observing document E given class H (based on multinomial distribution): Ignores probability of generating a document of the right length (prob. assumed constant for each class) 17

Multinomial naïve Bayes II Suppose dictionary has two words, yellow and blue Suppose Pr[yellow H] = 75% and Pr[blue H] = 5% Suppose E is the document blue yellow blue Probability of observing document: Suppose there is another class H' that has Pr[yellow H'] = 10% and Pr[yellow H'] = 90%: Need to take prior probability of class into account to make final classification Factorials don't actually need to be computed Underflows can be prevented by using logarithms 18

Naïve Bayes: discussion Naïve Bayes works surprisingly well (even if independence assumption is clearly violated) Why? Because classification doesn t require accurate probability estimates as long as maximum probability is assigned to correct class However: adding too many redundant attributes will cause problems (e.g. identical attributes) te also: many numeric attributes are not normally distributed ( kernel density estimators) 19

The end 0