Naïve Bayes MIT Course Notes Cynthia Rudin

Similar documents
Bayes (Naïve or not) Classifiers: Generative Approach

Chapter 4 (Part 1): Non-Parametric Classification (Sections ) Pattern Classification 4.3) Announcements

Generative classification models

Dimensionality Reduction and Learning

ENGI 4421 Joint Probability Distributions Page Joint Probability Distributions [Navidi sections 2.5 and 2.6; Devore sections

Introduction to local (nonparametric) density estimation. methods

9 U-STATISTICS. Eh =(m!) 1 Eh(X (1),..., X (m ) ) i.i.d

Lecture 9: Tolerant Testing

CHAPTER VI Statistical Analysis of Experimental Data

Dimensionality reduction Feature selection

Kernel-based Methods and Support Vector Machines

X X X E[ ] E X E X. is the ()m n where the ( i,)th. j element is the mean of the ( i,)th., then

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

Lecture 8: Linear Regression

CS 2750 Machine Learning. Lecture 8. Linear regression. CS 2750 Machine Learning. Linear regression. is a linear combination of input components x

Class 13,14 June 17, 19, 2015

Bayesian Classification. CS690L Data Mining: Classification(2) Bayesian Theorem: Basics. Bayesian Theorem. Training dataset. Naïve Bayes Classifier

The equation is sometimes presented in form Y = a + b x. This is reasonable, but it s not the notation we use.

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Econometric Methods. Review of Estimation

Parametric Density Estimation: Bayesian Estimation. Naïve Bayes Classifier

22 Nonparametric Methods.

Chapter 5 Properties of a Random Sample

Point Estimation: definition of estimators

EP2200 Queueing theory and teletraffic systems. Queueing networks. Viktoria Fodor KTH EES/LCN KTH EES/LCN

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

ECON 482 / WH Hong The Simple Regression Model 1. Definition of the Simple Regression Model

Announcements. Recognition II. Computer Vision I. Example: Face Detection. Evaluating a binary classifier

Lecture 02: Bounding tail distributions of a random variable

Lecture Notes Types of economic variables

Outline. Point Pattern Analysis Part I. Revisit IRP/CSR

ρ < 1 be five real numbers. The

Parameter Estimation

Lecture 3 Naïve Bayes, Maximum Entropy and Text Classification COSI 134

Midterm Exam 1, section 1 (Solution) Thursday, February hour, 15 minutes

CS286.2 Lecture 4: Dinur s Proof of the PCP Theorem

Overview. Basic concepts of Bayesian learning. Most probable model given data Coin tosses Linear regression Logistic regression

b. There appears to be a positive relationship between X and Y; that is, as X increases, so does Y.

Midterm Exam 1, section 2 (Solution) Thursday, February hour, 15 minutes

Linear Regression Linear Regression with Shrinkage. Some slides are due to Tommi Jaakkola, MIT AI Lab

1. The weight of six Golden Retrievers is 66, 61, 70, 67, 92 and 66 pounds. The weight of six Labrador Retrievers is 54, 60, 72, 78, 84 and 67.

Bayesian Classifier. v MAP. argmax v j V P(x 1,x 2,...,x n v j )P(v j ) ,..., x. x) argmax. )P(v j

Functions of Random Variables

6. Nonparametric techniques

2SLS Estimates ECON In this case, begin with the assumption that E[ i

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:

Chapter 4 Multiple Random Variables

Chapter Business Statistics: A First Course Fifth Edition. Learning Objectives. Correlation vs. Regression. In this chapter, you learn:

Homework 1: Solutions Sid Banerjee Problem 1: (Practice with Asymptotic Notation) ORIE 4520: Stochastics at Scale Fall 2015

Feature Selection: Part 2. 1 Greedy Algorithms (continued from the last lecture)

Chapter Two. An Introduction to Regression ( )

Discrete Mathematics and Probability Theory Fall 2016 Seshia and Walrand DIS 10b

Bayesian belief networks

STK4011 and STK9011 Autumn 2016

A NEW LOG-NORMAL DISTRIBUTION

Point Estimation: definition of estimators

CIS 800/002 The Algorithmic Foundations of Data Privacy October 13, Lecture 9. Database Update Algorithms: Multiplicative Weights

Supervised learning: Linear regression Logistic regression

2.28 The Wall Street Journal is probably referring to the average number of cubes used per glass measured for some population that they have chosen.

Bayesian belief networks

Lecture Notes Forecasting the process of estimating or predicting unknown situations

Parameter, Statistic and Random Samples

Lecture 1. (Part II) The number of ways of partitioning n distinct objects into k distinct groups containing n 1,

12.2 Estimating Model parameters Assumptions: ox and y are related according to the simple linear regression model

Unsupervised Learning and Other Neural Networks

1. Overview of basic probability

Mean is only appropriate for interval or ratio scales, not ordinal or nominal.

ENGI 3423 Simple Linear Regression Page 12-01

MA/CSSE 473 Day 27. Dynamic programming

1 Onto functions and bijections Applications to Counting

Statistical modelling and latent variables (2)

Lecture 3 Probability review (cont d)

Chapter 13, Part A Analysis of Variance and Experimental Design. Introduction to Analysis of Variance. Introduction to Analysis of Variance

Rademacher Complexity. Examples

X ε ) = 0, or equivalently, lim

Part 4b Asymptotic Results for MRR2 using PRESS. Recall that the PRESS statistic is a special type of cross validation procedure (see Allen (1971))

Linear Regression with One Regressor

Construction and Evaluation of Actuarial Models. Rajapaksha Premarathna

CLASS NOTES. for. PBAF 528: Quantitative Methods II SPRING Instructor: Jean Swanson. Daniel J. Evans School of Public Affairs

Section l h l Stem=Tens. 8l Leaf=Ones. 8h l 03. 9h 58

Qualifying Exam Statistical Theory Problem Solutions August 2005

18.657: Mathematics of Machine Learning

hp calculators HP 30S Statistics Averages and Standard Deviations Average and Standard Deviation Practice Finding Averages and Standard Deviations

18.413: Error Correcting Codes Lab March 2, Lecture 8

Statistics Descriptive and Inferential Statistics. Instructor: Daisuke Nagakura

3. Basic Concepts: Consequences and Properties

Summary of the lecture in Biostatistics

THE ROYAL STATISTICAL SOCIETY 2016 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE MODULE 5

Chapter 8. Inferences about More Than Two Population Central Values

Faculty Research Interest Seminar Department of Biostatistics, GSPH University of Pittsburgh. Gong Tang Feb. 18, 2005

Application of Calibration Approach for Regression Coefficient Estimation under Two-stage Sampling Design

Bayes Decision Theory - II

ε. Therefore, the estimate

Lecture Notes 2. The ability to manipulate matrices is critical in economics.

An Introduction to. Support Vector Machine

Bayes Estimator for Exponential Distribution with Extension of Jeffery Prior Information

Simple Linear Regression

Extreme Value Theory: An Introduction

PTAS for Bin-Packing

Transcription:

Thaks to Şeyda Ertek Credt: Ng, Mtchell Naïve Bayes MIT 5.097 Course Notes Cytha Rud The Naïve Bayes algorthm comes from a geeratve model. There s a mportat dstcto betwee geeratve ad dscrmatve models. I all cases, we wat to predct the label y, gve x, that s, we wat P (Y = y X = x). Throughout the paper, we ll remember that the probablty dstrbuto for measure P s over a ukow dstrbuto over X Y. Naïve Bayes Geeratve Model Estmate P (X = x Y = y) ad P (Y = y) ad use Bayes rule to get P (Y = y X = x) Dscrmatve Model Drectly estmate P (Y = y X = x) Most of the top 0 classfcato algorthms are dscrmatve (K-NN, CART, C4.5, SVM, AdaBoost). For Naïve Bayes, we make a assumpto that f we kow the class label y, the we kow the mechasm (the radom process) of how x s geerated. Naïve Bayes s great for very hgh dmesoal problems because t makes a very strog assumpto. Very hgh dmesoal problems suffer from the curse of dmesoalty t s dffcult to uderstad what s gog o a hgh dmesoal space wthout tos of data. Example: Costructg a spam flter. Each example s a emal, each dmeso j of vector x represets the presece of a word.

a 0 0 aardvark aardwolf x =.... buy. 0 zyxt Ths x represets a emal cotag the words a ad buy, but ot aardvark or zyxt. The sze of the vocabulary could be 50,000 words, so we are a 50,000 dmesoal space. Naïve Bayes makes the assumpto that the x (j) s are codtoally depedet gve y. Say y = meas spam emal, word 2,087 s buy, ad word 39,83 s prce. Naïve Bayes assumes that f y = (t s spam), the kowg x (2,087) = (emal cotas buy ) wo t effect your belef about x (39,38) (emal cotas prce ). Note: Ths does ot mea x (2,087) ad x (39,83) are depedet, that s, P (X (2,087) = x (2,087) ) = P (X (2,087) = x (2,087) X (39,83) = x (39,83) ). It oly meas they are codtoally depedet gve y. Usg the defto of codtoal probablty recursvely, P (X () = x (),..., X (50,000) = x (50,000) Y = y) = P (X () = x () Y = y)p (X (2) = x (2) Y = y, X () = x () ) P (X (3) = x (3) Y = y, X () = x (), X (2) = x (2) )... P (X (50,000) = x (50,000) Y = y, X () = x (),..., X (49,999) = x (49,999) ). The depedece assumpto gves: P (X () = x (),..., X () = x () Y = y) = P (X () = x () Y = y)p (X (2) = x (2) Y = y)... P (X () = x () Y = y) = P (X (j) = x (j) Y = y). () 2

Bayes rule says () () P (Y = y)p (X () = x (),..., X () = x () Y = y) P (Y = y X = x,..., X () = x () ) = P (X () = x (),..., X () = x () ) so pluggg (), we have P (Y = y) P (X (j) = x (j) Y = y) P (Y = y X () = x (),..., X () = x () ) = P (X () = x (),..., X () = x () ) For a ew test stace, called x test, we wat to choose the most probable value of y, that s () () test () () P (Y = ) () j P (X = x test,..., X () = x y NB arg max P (X () = x test,..., X () = x test ) = arg max P (Y = ) P (X (j) = x (j) Y = ). Y = ) (j) So ow, we just eed P (Y = ) for each possble, ad P (X (j) = x test Y = ) for each j ad. Of course we ca t compute those. Let s use the emprcal probablty estmates: [y =] P (Y = ) = = fracto of data where the label s m [ (j) (j) (j) (j) x =x test,y =] P (X = x Y = ) = = Cof(Y = X (j) (j) test = x test ). [y =] That s the smplest verso of Naïve Bayes: y NB (j) arg max P (Y = ) P (X (j) = x test Y = ). There could potetally be a problem that most of the codtoal probabltes are 0 because the dmesoalty of the data s very hgh compared to the amout (j) of data. Ths causes a problem because f eve oe P (X (j) = x test Y = ) s zero the the whole rght sde s zero. I other words, f o trag examples from class spam have the word tomato, we d ever classfy a test example cotag the word tomato as spam! 3

To avod ths, we (sort of) set the probabltes to a small postve value whe there are o data. I partcular, we use a Bayesa shrkage estmate of P (X (j) (j) = x test Y = ) where we add some hallucated examples. There are K hallucated examples spread evely over the possble values of X (j). K s the umber of dstct values of X (j). The probabltes are pulled toward /K. So, ow we replace: (j) (j) + ) [ x = x,y = P (X (j) (j ] = x test Y = ) = test [y =] + K P (Y = ) = [y =] + m + K Ths s called Laplace smoothg. The smoothg for P (Y = ) s probably uecessary ad has lttle to o effect. Naïve Bayes s ot ecessarly the best algorthm, but s a good frst thg to try, ad performs surprsgly well gve ts smplcty! There are extesos to cotuous data ad other varatos too. PPT Sldes 4

MIT OpeCourseWare http://ocw.mt.edu 5.097 Predcto: Mache Learg ad Statstcs Sprg 202 For formato about ctg these materals or our Terms of Use, vst: http://ocw.mt.edu/terms.