Naive Bayes classification

Similar documents
Bayesian Learning (II)

The Naïve Bayes Classifier. Machine Learning Fall 2017

Introduction to Bayesian Learning. Machine Learning Fall 2018

Naïve Bayes classification

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

Introduction to Machine Learning

Introduction to Machine Learning

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

COMP 328: Machine Learning

The Bayes classifier

Probability Theory. Introduction to Probability Theory. Principles of Counting Examples. Principles of Counting. Probability spaces.

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introductory Econometrics. Review of statistics (Part II: Inference)

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Dynamic Programming Lecture #4

Behavioral Data Mining. Lecture 2

2.4. Conditional Probability

Probabilistic Classification

RVs and their probability distributions

Module 2 : Conditional Probability

Data Mining Part 4. Prediction

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

Deep Learning for Computer Vision

Bayes Formula. MATH 107: Finite Mathematics University of Louisville. March 26, 2014

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

ORF 245 Fundamentals of Statistics Chapter 9 Hypothesis Testing

Today we ll discuss ways to learn how to think about events that are influenced by chance.

Uncertainty. Variables. assigns to each sentence numerical degree of belief between 0 and 1. uncertainty

Chapter 4. Probability Theory. 4.1 Probability theory Events

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

FINAL: CS 6375 (Machine Learning) Fall 2014

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Chapter Three. Hypothesis Testing

Machine Learning for Signal Processing Bayes Classification

COMPUTATIONAL LEARNING THEORY

ECE521 week 3: 23/26 January 2017

CS155: Probability and Computing: Randomized Algorithms and Probabilistic Analysis

MODULE -4 BAYEIAN LEARNING

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Randomized Algorithms

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

MLE/MAP + Naïve Bayes

Algorithmisches Lernen/Machine Learning

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

CS 188: Artificial Intelligence Spring Today

MLE/MAP + Naïve Bayes

15-388/688 - Practical Data Science: Nonlinear modeling, cross-validation, regularization, and evaluation

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

CPSC 340: Machine Learning and Data Mining

Tutorial 2. Fall /21. CPSC 340: Machine Learning and Data Mining

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Numerical Learning Algorithms

Generative Clustering, Topic Modeling, & Bayesian Inference

Conditional Probability

P (E) = P (A 1 )P (A 2 )... P (A n ).

Probability. Lecture Notes. Adolfo J. Rumbos

Bayesian Models in Machine Learning

BAYESIAN DECISION THEORY

COS 424: Interacting with Data. Lecturer: Dave Blei Lecture #11 Scribe: Andrew Ferguson March 13, 2007

Algorithms for Classification: The Basic Methods

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Uncertain Knowledge and Bayes Rule. George Konidaris

6.867 Machine Learning

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

From Bayes Theorem to Pattern Recognition via Bayes Rule

Naïve Bayesian. From Han Kamber Pei

Feature selection. Micha Elsner. January 29, 2014

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 10

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Click Prediction and Preference Ranking of RSS Feeds

Lecture : Probabilistic Machine Learning

6.867 Machine Learning

Probability and Statistics

Q1 (12 points): Chap 4 Exercise 3 (a) to (f) (2 points each)

Logistic Regression. Machine Learning Fall 2018

Answers Machine Learning Exercises 2

Review for Exam 1. Erik G. Learned-Miller Department of Computer Science University of Massachusetts, Amherst Amherst, MA

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Probabilistic Machine Learning. Industrial AI Lab.

LECTURE 1. 1 Introduction. 1.1 Sample spaces and events

Bayesian Updating with Discrete Priors Class 11, Jeremy Orloff and Jonathan Bloom

Statistical methods in recognition. Why is classification a problem?

Bayesian Approaches Data Mining Selected Technique

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Elements of probability theory

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Machine Learning Linear Classification. Prof. Matteo Matteucci

Learning from Data 1 Naive Bayes

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Non-parametric Methods

Transcription:

Naive Bayes classification Christos Dimitrakakis December 4, 2015 1 Introduction One of the most important methods in machine learning and statistics is that of Bayesian inference. This is the most fundamental method of drawing conclusions from data and explicit prior assumptions. In Bayesian inference, prior assumptions are represented as a probabilities on a space of hypothesis. Each hypothesis is seen as a probabilistic model of all possible data that we can see. 2 The basics of Bayesian inference Frequently, we want to draw conclusions from data. However, the conclusions are never solely inferred from data, but also depend on prior assumptions about reality. Example 2.1. John claims he has psychic powers and can predict a series of coin tosses. We oblige, and throw a coin 8 times. John predicts 8 out of 8 coin tosses. The probability of him doing so by chance is 2 8. If he was a medium, as he claims, then his probability of achieving the feat would be 1. Should we believe John? Example 2.2. Traces of DNA are found at a murder scene. We perform a DNA test against a database of 10 4 citizens registered to be living in the area. We know that the probability of a false positive (that is, the test finding a match by mistake) is 10 6. If there is a match in the database, does that mean that the citizen was at the scene of the crime? Answering these questions requires us to clearly define what are the possible hypotheses we wish to consider. Taking the first example, we can define two: 1. hypothesis M, that John is a medium. 2. hypothesis M, that John is not a medium. We can also define a probability model for the number of successful predictions that John would make in either case. 1

Let x t be 0 if John makes an incorrect prediction at time t and x t = 1 if he makes a correct prediction. John s claim that he can predict our tosses perfectly means that for a sequence of tosses x = x 1,..., x n, { 1, x t = 1 t [n] P(x M) = 0, t [n] : x t = 0. That is, the probability of perfectly correct predictions is 1, and that of one or more incorrect prediction is 0. For the other model, we can assume that all draws are independently and identically distributed from a fair coin. Consequently, no matter what John s predictions are, we have that: P(x M) = 2 n. So, for the given example, as stated, we have the following facts: If John makes one or more mistakes, then P(x M) = 0 and P(x M) = 2 n. Thus, we should perhaps say that then John is not a medium If John makes no mistakes at all, then P(x M) = 1, P(x M) = 2 n. (2.1) Does that mean that we must conclude that John is a medium? What if n = 1? What if n = 100? In fact, our conclusion should somehow depend on the strength of the evidence. Should it also not depend on how likely we think that a medium exists? It is this latter idea that we ll try and exploit. We d like to combine the weight of the evidence, with the weight of our prior beliefs about reality. To do this, we first recall the definition of conditional probability. 2.1 Conditional probability and Bayesian inference Let A and B be two events. Let P (A) be the probability of event A and P (B) the probability of event B. We can think of the probability of an event as the relative size of the event in the space of probabilities. 1 The probability of both events A and B happening at the same time is denoted by P (A B). This amounts to measuring the size of the space by the intersection of A and B. The basic probability laws are the following. Axioms of probability 1. The probability of the certain event is P (Ω) = 1 mass. 1 More formally, probability is a measure; a function similar to volume, area, length and 2

2. The probability of the impossible event is P ( ) = 0 3. The probability of any event A is 1 P (A) 0. 4. If A, B are disjoint, i.e. A B =, meaning that they cannot happen at the same time, then P (A B) = P (A) + P (B) Sometimes we would like to calculate the probability of some event A happening given that we know that some other event B has happened. For this we need to first define the idea of conditional probability. Definition 2.1 (Conditional probability). The probability of A happening if we know that B has happened is defined to be: P (A B) P (A B). P (B) Here, the probability measure of any event A given B is defined to be the probability of the intersection of of the events divided by the second event. We can rewrite this definition as follows, by using the definition for P (B A) Bayes s theorem P (A B) = P (B A)P (A). P (B) Now let us apply this idea to our specific problem. This allows us to calculate the probability of John being a medium, given the data: where P(M x) = P(x M) P(M), P(x) P(x) = P(x M) P(M) + P(x M) P( M). The only thing left to specify is P(M), the probability that John is a medium before seeing the data. This is our subjective prior belief that mediums exist and that John is one of them. More generally, we can think of Bayesian inference as follows: We start with a set of mutually exclusive hypotheses H = {M 1,..., M k }. Each hypothesis M is represented by a specific probabilistic model for any possible data x, that is P(x M). 3

For each hypothesis, we have a prior probability P(M) that it is correct. After observing the data, we can calculate a posterior probability that the hypothesis is correct: P(M x) = P(x M) P(M) k i=1 P(x M i) P(M i ). Combining the prior belief with evidence is key in this procedure. Our posterior belief can then be used as a new prior belief when we get more evidence. 3 Naive Bayes classifiers One special case of this idea is in classification, when each hypothesis corresponds to a specific class. Then, given a new example vector of data x, we would like to calculate the probability of different classes C given the data, P(C x). From Bayes s theorem, we see that we can write this as P(C x) = P(x C) P(C) i P(x C i) P(C i ) for any class C. This directly gives us a method for classifying new data, as long as we have a way to obtain P(x C) and P(C). Calculating the prior probability of classes A simple method is to simply count the number of times each class appears in the training data D T = ((x t, y t )) T t=1. Then we can set P(C) = 1/T T I {y t = C} t=1 The Naive Bayes classifier uses the following model for observations, where observations are independent of each other given the class. Thus, for example the result of three different tests for lung cancer (stethoscope, radiography and biopsy) only depend on whether you have cancer, and not on each other. Probability model for observations n P(x C) = P(x(1),..., x(n) C) = P(x(k) C). k=1 4

There are two different types of models we can have, one of which is mostly useful for continuous attributes and the other for discrete attributes. In the first, we just need to count the number of times each feature takes different values in different classes. Discrete attribute model. Here we simply count the average number of times that the attribute k had the value i when the label was C. This is in fact analogous to the conditional probability definition. P(x(k) = i C) = t=1 I {x t(k) = i y t = C} t=1 I {y t = C} = N k(i, C) N(C), where N k (i, C) is the numb l l.er of examples in class C whose k-th attribute has the value i, and N(C) is the number of examples in class C. Sometimes we need to be able to deal with cases where there are no examples at all of one class. In that case, that class would have probability zero. To get around this problem, we add fake observations to our data. This is called Laplace smoothing. Remark 3.1. In Laplace smoothing with constant λ, our probability model is t=1 P(x(k) = i C) = I {x t(k) = i y t = C} + λ t=1 I {y = N k(i, C) + λ t = C} + n k λ N(C) + n k λ. where n k is the number of values that the k-th attribute can take. This is necessary, because we want n k i=1 P(x(k) = i C) = 1. (You can check that this is indeed the case as a simple exercise). Continuous attribute model. Here we can use a Gaussian model for each continuous dimension. P(x(k) = v C) = 1 σ 2π e (v µ) 2 σ 2, where µ and σ are the mean and variance of the Gaussian, typically calculated from the training data as: t=1 µ = x t(k) I {y t = C} t=1 I {y, t = C} i.e. µ is the mean of the k-th attribute when the label is C and σ = t=1 [x t(k) µ] 2 I {y t = C} t=1 I {y, t = C} 5

i.e. σ is the variance of the k-th attribute when the label is C. Sometimes we can just fix σ to a constant value, i.e. σ = 1. Estimates versus true probabilities Remember that the probabilities we get from this calculation are only estimates. We do not really know the probabilities of each observation given the classes: we are only estimating them from the data. It is also possible that our assumption about the independence of features is completely wrong. Exercise 1. This is an exercise to get you familiar with the NaiveBayes implementation in the e1071 library. First, open R and install the library e1071 by doing > install.packages( e1071, dependencies = TRUE) Then load the library: \library( e1071 ) Then, check the documentation either by going to http://www.inside-r. org/packages/cran/e1071/docs/naivebayes or by simply typing >?naivebayes in R. then type the commands in the tutorial. Some explanations are given below. The following line creates a Naive Bayes model predicting the Class variable from all the other variables. 1 model <- naivebayes ( Class ~., data = Training ) There are two ways to predict new data given our model. The first method gives us the labels as outputs 1 class. predictions <- predict ( model, Holdout ) The second method gives us the class probabilities as outputs 6

1 prob. predictions <- predict ( model, Holdout, type = " raw ") # We can create a contingency table from our class predictions 1 table ( class. predictions, Y$ Class ) More precisely, the command 1 A <- table (x, y) gives a matrix, whose entry A ij is equal to the number of times that x t = i and y t = j. Consequently, the sum of terms in the diagonal is the number of correctly classified examples and the sum of the remaining terms it that of the incorrectly classified examples. Exercise 2. The purpose of this exercise is to explore the effect of the Laplace smoothing parameter λ on Naive Bayes classification. For this exercise use the package DNA: 1 data (DNA, package = " mlbench ") The class labels are stored in the column Class. Use Data points 1 1593 for training (store it in a variable called Training) Data points 1594 2124 for holdout (store it in a variable called Holdout) Data point 2125 3186 for testing (store it in a variable called Testing) Then do the following: 1. Use a loop to go through the parameters λ {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10} and for each value Train a model using 1 model <- naivebayes ( Class ~., data = Training, laplace = lambda ) Measure the classification error of the model on the Holdout data using predict and do a plot for all different lambda, using a for loop. Save the plot in a PDF file using the pdf command. Find the best value (with lowest error) for λ on the Holdout set. 7

Test the accuracy of the model with the best λ on the Test set. Then submit the following items with a private message on Piazza, using the tag [hw4]. 1. Your R code, in a single R file named MyNameBayes.R which I should be able to directly run with source("mynamebayes.r"); to produce: (a) The hold out classification error for the different lambdas, in a plot. You can generate PDF plots with the pdf command (see also my example: knnexample.r) (b) The value of the best λ, and the resulting testing error classification. 2. An answer to the following questions: (a) How does the testing error compare to the training and holdout error? Why do you think this is the case? (b) What do you think are the advantages and disadvanages of Naive Bayes over KNN and decision trees? 8