Learning from Data 1 Naive Bayes

Similar documents
Learning from Data 1 Naive Bayes

9.2 Maximum A Posteriori and Maximum Likelihood

Homework Assignment 3 Due in class, Thursday October 15

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Evaluation for sets of classes

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Limited Dependent Variables

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

Generative classification models

Linear Approximation with Regularization and Moving Least Squares

= z 20 z n. (k 20) + 4 z k = 4

SDMML HT MSc Problem Sheet 4

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Computation of Higher Order Moments from Two Multinomial Overdispersion Likelihood Models

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Machine learning: Density estimation

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Probabilistic Classification: Bayes Classifiers. Lecture 6:

Composite Hypotheses testing

Introduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law:

Week 5: Neural Networks

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Engineering Risk Benefit Analysis

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Classification as a Regression Problem

Difference Equations

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Lecture 12: Classification

10-701/ Machine Learning, Fall 2005 Homework 3

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

The exam is closed book, closed notes except your one-page cheat sheet.

Probability Theory (revisited)

Conjugacy and the Exponential Family

Analysis of Discrete Time Queues (Section 4.6)

Lecture 10 Support Vector Machines II

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Chapter 13: Multiple Regression

Relevance Vector Machines Explained

Linear Feature Engineering 11

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

Basically, if you have a dummy dependent variable you will be estimating a probability.

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Hidden Markov Models

2.3 Nilpotent endomorphisms

Kernel Methods and SVMs Extension

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Lecture 12: Discrete Laplacian

1 Convex Optimization

The Expectation-Maximization Algorithm

Module 2. Random Processes. Version 2 ECE IIT, Kharagpur

} Often, when learning, we deal with uncertainty:

Ph 219a/CS 219a. Exercises Due: Wednesday 23 October 2013

Gaussian Mixture Models

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Classification learning II

Expectation Maximization Mixture Models HMMs

Outline. Multivariate Parametric Methods. Multivariate Data. Basic Multivariate Statistics. Steven J Zeil

Lecture 3: Shannon s Theorem

arxiv: v2 [stat.me] 26 Jun 2012

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

Module 9. Lecture 6. Duality in Assignment Problems

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Lecture Notes on Linear Regression

Lecture 10: May 6, 2013

Lecture 2: Prelude to the big shrink

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

See Book Chapter 11 2 nd Edition (Chapter 10 1 st Edition)

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Other NN Models. Reinforcement learning (RL) Probabilistic neural networks

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

C/CS/Phy191 Problem Set 3 Solutions Out: Oct 1, 2008., where ( 00. ), so the overall state of the system is ) ( ( ( ( 00 ± 11 ), Φ ± = 1

Mixture o f of Gaussian Gaussian clustering Nov

Logistic Classifier CISC 5800 Professor Daniel Leeds

Generalized Linear Methods

Numerical Heat and Mass Transfer

Why Bayesian? 3. Bayes and Normal Models. State of nature: class. Decision rule. Rev. Thomas Bayes ( ) Bayes Theorem (yes, the famous one)

Statistical Foundations of Pattern Recognition

Multilayer Perceptron (MLP)

Affine and Riemannian Connections

Online Classification: Perceptron and Winnow

since [1-( 0+ 1x1i+ 2x2 i)] [ 0+ 1x1i+ assumed to be a reasonable approximation

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

Lecture Nov

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics

Chapter 8 Indicator Variables

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Lecture 3: Probability Distributions

APPENDIX A Some Linear Algebra

12. The Hamilton-Jacobi Equation Michael Fowler

Open Systems: Chemical Potential and Partial Molar Quantities Chemical Potential

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

Lecture 10 Support Vector Machines. Oct

Transcription:

Learnng from Data 1 Nave Bayes Davd Barber dbarber@anc.ed.ac.uk course page : http://anc.ed.ac.uk/ dbarber/lfd1/lfd1.html c Davd Barber 2001, 2002 1

Learnng from Data 1 : c Davd Barber 2001,2002 2 1 Why Nave Bayes? Nave Bayes s one of the smplest densty estmaton methods from whch we can form one of the standard classfcaton methods n machne learnng. Its fame s partly due to the followng propertes: Very easy to program and ntutve Fast to tran and to use as a classfer Very easy to deal wth mssng attrbutes 2 Understandng Condtonal Independence Very popular n certan felds such as computatonal lngustcs/nlp However, despte the smplcty of Nave Bayes, there are some ptfalls that need to be avoded, as we wll descrbe. The ptfalls usually made are due to a poor understandng of the central assumpton behnd Nave Bayes, namely condtonal ndependence. Before we explan how to use condtonal ndependence to form a classfer, we concentrate on explanng the basc assumpton of condtonal ndependence. Consder a general probablty dstrbuton of two varables, p(x 1, x 2 ). Usng Bayes rule, wthout loss of generalty, we can wrte p(x 1, x 2 ) = p(x 1 x 2 )p(x 2 ) (2.1) Smlarly, f we had another class varable, c, we can wrte, usng Bayes rule : p(x 1, x 2 c) = p(x 1 x 2, c)p(x 2 c) (2.2) In the above expresson, we have not made any assumptons at all. Consder now the term p(x 1 x 2, c). If knowledge of c s suffcent to determne how x 1 wll be dstrbuted, we don t need to know the state of x 2. That s, we may wrte p(x 1 x 2, c) = p(x 1 c). For example, we may wrte the general statement: p(cloudy, wndy storm) = p(cloudy wndy, storm)p(wndy storm) (2.3) where, for example, each of the varables can take the values yes or no, and now further make the assumpton p(cloudy wndy, storm) = p(cloudy storm) so that the dstrbuton becomes p(cloudy, wndy storm) = p(cloudy storm)p(wndy storm) (2.4) We can generalse the stuaton of two varables to a condtonal ndependence assumpton for a set of varables x 1,..., x N, condtonal on another varable c: N p(x c) = p(x c) (2.5) =1 A further example may help to clarfy the assumptons behnd condtonal ndependence. EasySell.com consders that ts customers convenently fall nto two groups the young or old. Based on only ths nformaton, they buld general customer profles for product preferences. Easysell.com assumes that, gven the knowledge that a customer s ether young or old, ths s suffcent to determne whether or not a customer wll lke a product, ndependent of ther lkes or dslkes for any other products. Thus, gven that a customer s young, she has a 95% chance to lke Rado1, a 5% chance to lke Rado2, a 2% chance to lke Rado3 and a 20% chance to lke Rado4. Smlarly, they model that an old customer has a 3% chance to lke Rado1, an 82% chance to lke Rado2, a 34% chance to lke Rado3 and a 92% chance to lke Rado4. Mathematcally, we would wrte p(r1, R2, R3, R4 age) = p(r1 age)p(r2 age)p(r3 age)p(r4 age) (2.6)

Learnng from Data 1 : c Davd Barber 2001,2002 3 3 Are they Scottsh? where each of the varables R1, R2, R3, R4 can take the values ether lke or dslke, and the age varable can take the value ether young or old. Thus the nformaton about the age of the customer s so powerful that ths determnes the ndvdual product preferences wthout needng to know anythng else. Clearly, ths s a rather strong assumpton, but a popular one, and sometmes leads to surprsngly good results. In ths chapter, we wll take the condtonng varable to represent the class of the datapont x. Coupled then wth a sutable choce for the condtonal dstrbuton p(x c), we can then use Bayes rule to form a classfer. In ths chapter, we wll consder two cases of dfferent condtonal dstrbutons, one approprate for dscrete data and the other for contnuous data. Furthermore, we wll demonstrate how to learn any free parameters of these models. Consder the followng vector of attrbutes: (lkes shortbread, lkes lager, drnks whskey, eats porrdge, watched England play football) T (3.1) A vector x = (1, 0, 1, 1, 0) T would descrbe that a person lkes shortbread, does not lke lager, drnks whskey, eats porrdge, and has not watched England play football. Together wth each vector x µ, there s a class label descrbng the natonalty of the person: Scottsh, or Englsh. We wsh to classfy a new vector x = (1, 0, 1, 1, 0) T as ether Scottsh or Englsh. We can use Bayes rule to calculate the probablty that x s Scottsh or Englsh: p(s x) = p(x S)p(S) p(x) p(e x) = p(x E)p(E) p(x) (3.2) (3.3) Snce we must have p(s x) + p(e x) = 1, we could also wrte p(s x) = p(x S)p(S) p(x S)p(S) + p(x E)p(E) (3.4) It s straghtforward to show that the pror class probablty p(s) s smply gven by the fracton of people n the database that are Scottsh, and smlarly p(e) s gven as the fracton of people n the database that are Englsh. What about p(x S)? Ths s where our densty model for x comes n. In the prevous chapter, we looked at a usng a Gaussan dstrbuton. Here we wll make a dfferent, very strong condtonal ndependence assumpton: p(x S) = p(x 1 S)p(x 2 S)... p(x 5 S) (3.5) What ths assumpton means s that knowng whether or not someone s Scottsh, we don t need to know anythng else to calculate the probablty of ther lkes and dslkes. Matlab code to mplement Nave Bayes on a small dataset s wrtten below, where each row of the datasets represents a (row) vector of attrbutes of the form equaton (3.1).

Learnng from Data 1 : c Davd Barber 2001,2002 4 % Nave Bayes usng Bernoull Dstrbuton xe=[0 1 1 1 0 0; % englsh 0 0 1 1 1 0; 1 1 0 0 0 0; 1 1 0 0 0 1; 1 0 1 0 1 0]; xs=[1 1 1 1 1 1 1; % scottsh 0 1 1 1 1 0 0; 0 0 1 0 0 1 1; 1 0 1 1 1 1 0; 1 1 0 0 1 0 0]; pe = sze(xe,2)/(sze(xe,2) + sze(xs,2)); ps =1-pE; % ML class prors pe = p(c=e), ps=p(c=s) me = mean(xe ) ; % ML estmates of p(x=1 c=e) ms = mean(xs ) ; % ML estmates of p(x=1 c=s) x=[1 0 1 1 0] ; % test pont npe = pe*prod(me.^x.*(1-me).^(1-x)); % p(x,c=e) nps = ps*prod(ms.^x.*(1-ms).^(1-x)); % p(x,c=s) pxe = npe/(npe+nps) % probablty that x s englsh 3.1 Further Issues Based on the tranng data n the code above, we have the followng : p(x 1 = 1 E) = 1/2,p(x 2 = 1 E) = 1/2,p(x 3 = 1 E) = 1/3,p(x 4 = 1 E) = 1/2,p(x 5 = 1 E) = 1/2, p(x 1 = 1 S) = 1,p(x 2 = 1 S) = 4/7,p(x 3 = 1 S) = 3/7,p(x 4 = 1 S) = 5/7,p(x 5 = 1 S) = 3/7 and the pror probabltes are p(s) = 7/13 and p(e) = 6/13. For x = (1, 0, 1, 1, 0) T, we get p(s x*) = 1 3 7 3 7 5 7 4 7 7 13 1 3 7 3 7 5 7 4 7 7 13 + 1 2 1 2 1 3 1 2 1 2 6 13 (3.6) whch s 0.8076. Snce ths s greater than 0.5, we would classfy ths person as beng Scottsh. Consder tryng to classfy the vector x = (0, 1, 1, 1, 1) T. In the tranng data, all Scottsh people say they lke shortbread. Ths means that p(x, S) = 0, and hence that p(s x) = 0. Ths demonstrates a dffculty wth sparse data very extreme class probabltes can be made. One way to amelorate ths stuaton s to smooth the probabltes n some way, for example by addng a certan small number M to the frequency counts of each class: p(x = 1 c) = number of tmes x = 1 for class c + M number of tmes x = 1 for class c + M + number of tmes x = 0 for class c + M (3.7) 3.2 Gaussans Ths ensures that there are no zero probabltes n the model. Fttng contnuous data s also straghtforward usng Nave Bayes. For example, f we were to model each attrbutes dstrbuton as a Gaussan, p(x c) = N(µ, σ ), ths would be exactly equvalent to usng the condtonal Gaussan densty estmator n the prevous chapter by replacng the covarance matrx wth all elements zero except for those on the dagonal.

Learnng from Data 1 : c Davd Barber 2001,2002 5 3.3 Text Classfcaton Bag of words Nave Bayes has been often appled to classfy documents n classes. We wll outlne here how ths s done. Refer to a computatonal lngustcs course for detals of how exactly to do ths. Consder a set of documents about poltcs, and a set about sport. We search through all documents to fnd the, say 100 most commonly occurng words. Each document s then represented by a 100 dmensonal vector representng the number of tmes that each of the words occurs n that document the so called bag of words representaton (ths s clearly a very crude assumpton snce t does not take nto account the order of the words). We then ft a Nave Bayes model by fttng a dstrbuton of the number of occurrences of each word for all the documents of, frst sport, and then poltcs. Ths then completes the model. The reason Nave Bayes may be able to classfy documents reasonably well n ths way s that the condtonal ndependence assumpton s not so slly : f we know people are talkng about poltcs, ths perhaps s almost suffcent nformaton to specfy what knds of other words they wll be usng we don t need to know anythng else. (Of course, f you want ultmately a more powerful text classfer, you need to relax ths assumpton). 4 Ptfalls wth Nave Bayes 1-of-M encodng So far we have descrbed how to mplement Nave Bayes for the case of bnary attrbutes and also for the case of Gaussan contnuous attrbutes. However, very often, the software that people seem to commonly use requres that the data s n the form of bnary attrbutes. It s n the transformaton of non-bnary data to a bnary form that a common mstake occurs. Consder the followng attrbute : age. In a survey, a person s age s marked down usng the varable a 1, 2, 3. a = 1 means the person s between 0 and 10 years old, a = 2 means the person s between 10 and 20 years old, a = 3 means the person s older than 20. Perhaps there would be other attrbutes for the data, so that each data entry s a vector of two varables (a, b) T. One way to transform the varable a nto a bnary representaton would be to use three bnary varables (a 1, a 2, a 3 ). Thus, (1, 0, 0) represents a = 1, (0, 1, 0) represents a = 2 and (0, 0, 1) represents a = 3. Ths s called 1 of M codng snce only 1 of the bnary varables s actve n encodng the M states. The problem here s that ths encodng, by constructon, means that the varables a 1, a 2, a 3 are dependent for example, f we know that a 1 = 1, we know that a 2 = 0 and a 3 = 0. Regardless of any possble condtonng, these varables wll always reman completely dependent, contrary to the assumpton of Nave Bayes. Ths mstake, however, s wdespread please help preserve a lttle of my santy by not makng the same error. The correct approach s to smply use varables wth many states the multnomal rather than bnomal dstrbuton. Ths s straghtforward and left as an exercse for the nterested reader. 5 Estmaton usng Maxmum Lkelhood : Bernoull Process In ths secton we formally derve how to learn the parameters n a Nave Bayes model from data. The results are ntutve, and ndeed, we have already made use of them n the prevous sectons. However, t s nstructve to carry out ths procedure and some lght can be cast also on the nature of the decson boundary (at least for the case of bnary attrbutes). Consder a dataset X = {x µ, µ = 1,..., P } of bnary attrbutes. That s x µ {0, 1}. Each datapont x µ has an assocated class label c µ. Based upon the class label, we can splt the nputs nto those that belong to each class : X c = {x x s n class c}. We wll consder here only the case of

Learnng from Data 1 : c Davd Barber 2001,2002 6 two classes (ths s called a Bernoull process the case of more classes s also straghtforward and called the multnomal process). Let the number of dataponts from class c = 0 be n 0 and the number from class c = 1 be n 1. For each class of the two classes, we then need to estmate the values p(x = 1 c) θ c. (The other probablty, p(x = 0 c) s smply gven from the normalsaton requrement, p(x = 0 c) = 1 p(x = 1 c) = 1 θ c). Usng the standard assumpton that the data s generated dentcally and ndependently, the lkelhood of the model generatng the dataset X c (the data X belongng to class c) s p(x c ) = p(x µ c) (5.1) µ from class c Usng our condtonal ndependence assumpton p(x c) = p(x c) = (θ c ) x (1 θ c ) 1 x (5.2) (remember that n each term n the above expresson, x s ether 0 or 1 and hence, for each term n the product, only one of the two factors wll contrbute, contrbutng a factor θ c f x = 1 and 1 θ c f x = 0). Puttng ths all together, we can fnd the log lkelhood L(θ c ) =,µ x µ log θc + (1 x µ ) log(1 θc ) (5.3) Optmsng wth respect to θ c and equate to zero) gves θ c p(x = 1 c) (dfferentate wth respect to p(x = 1 c) = number of tmes x = 1 for class c (number of tmes x = 1 for class c) + (number of tmes x = 0 for class c) (5.4) A smlar Maxmum Lkelhood argument gves the ntutve result: p(c) = number of tmes class c occurs total number of data ponts (5.5) 5.1 Classfcaton Boundary If we just wsh to fnd the most lkely class for a new pont x, we can compare the log probabltes, classfyng x as class 1 f log p(c = 1 x ) > log p(c = 0 x ) (5.6) Usng the defnton of the classfer, ths s equvalent to (snce the normalsaton constant log p(x ) can be dropped from both sdes) log p(x c = 1) + log p(c = 1) > log p(x c = 0) + log p(c = 0) (5.7) Usng the bnary encodng x {0, 1}, ths s : classfy x as class 1 f { x log θ 1 + (1 x ) log(1 θ 1 ) } + log p(c = 1) > { x log θ 0 + (1 x ) log(1 θ 0 ) } + log p(c = 0) (5.8) Note that ths decson rule can be expressed n the form : classfy x as class 1 f w x +a > 0 for some sutable choce of weghts w and constant a (the reader s nvted to fnd the explct values of these weghts). The nterpretaton of ths s that w specfes a hyperplane n the x space and x s classfed as a 1 f t les on one sde of the hyperplane. We shall talk about other such lnear classfers n a later chapter.