Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

Similar documents
Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Introduction to Machine Learning

Introduction to Machine Learning

Introduction to Machine Learning

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Machine Learning, Fall 2012 Homework 2

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

CS540 Machine learning L8

Machine Learning - MT Classification: Generative Models

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Bayesian Methods: Naïve Bayes

Learning Bayesian network : Given structure and completely observed data

Learning with Probabilities

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Computational Cognitive Science

Introduction to Probability and Statistics (Continued)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Generative Models for Discrete Data

MLE/MAP + Naïve Bayes

CSC321 Lecture 18: Learning Probabilistic Models

Introduction to Probabilistic Machine Learning

Naïve Bayes classification

Probabilistic Graphical Models

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Lecture 3. Probability - Part 2. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. October 19, 2016

Computational Cognitive Science

Machine Learning

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Lecture 2: Priors and Conjugacy

Parametric Techniques Lecture 3

Announcements. Proposals graded

COMP90051 Statistical Machine Learning

Hierarchical Models & Bayesian Model Selection

Parametric Techniques

Statistical Data Mining and Machine Learning Hilary Term 2016

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

CS540 Machine learning L9 Bayesian statistics

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

COS513 LECTURE 8 STATISTICAL CONCEPTS

Density Estimation. Seungjin Choi

Lecture 2: Conjugate priors

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Non-Parametric Bayes

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Logistic Regression. Machine Learning Fall 2018

Introduc)on to Bayesian methods (con)nued) - Lecture 16

CSC411 Fall 2018 Homework 5

Data Analysis and Uncertainty Part 2: Estimation

Lecture : Probabilistic Machine Learning

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Lecture 1 October 9, 2013

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

PMR Learning as Inference

Lecture 4: Probabilistic Learning

CPSC 540: Machine Learning

Conjugate Priors, Uninformative Priors

MLE/MAP + Naïve Bayes

CS Lecture 18. Topic Models and LDA

Probability and Estimation. Alan Moses

Introduction to Bayesian Learning. Machine Learning Fall 2018

Probabilistic Graphical Models

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Naive Bayes and Gaussian Bayes Classifier

Gaussian and Linear Discriminant Analysis; Multiclass Classification

ECE 5984: Introduction to Machine Learning

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

The Naïve Bayes Classifier. Machine Learning Fall 2017

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

Lecture 2: Simple Classifiers

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Homework 5: Conditional Probability Models

Bayesian Approaches Data Mining Selected Technique

Naive Bayes and Gaussian Bayes Classifier

Introduction to Machine Learning. Lecture 2

Machine Learning

Bayesian Classifiers, Conditional Independence and Naïve Bayes. Required reading: Naïve Bayes and Logistic Regression (available on class website)

Generalized Linear Models

CS-E3210 Machine Learning: Basic Principles

Introduction to Machine Learning

Naive Bayes and Gaussian Bayes Classifier

Overfitting, Bias / Variance Analysis

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Lecture 23 Maximum Likelihood Estimation and Bayesian Inference

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

An Introduction to Statistical and Probabilistic Linear Models

Bayesian RL Seminar. Chris Mansley September 9, 2008

Time Series and Dynamic Models

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

CMU-Q Lecture 24:

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

Introduction to Machine Learning Spring 2018 Note 18

Transcription:

Lecture 4 Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza October 6, 2017 Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 1 / 46

Outline 1 Naive Bayes Classifiers Basic Concepts Class-Conditional Distributions Likelihood MLE Bayesian Naive Bayes Prior Posterior MAP Posterior Predictive Plug-in Approximation Log-Sum-Exp Trick Posterior Predictive Algorithm Feature Selection Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 2 / 46

Outline 1 Naive Bayes Classifiers Basic Concepts Class-Conditional Distributions Likelihood MLE Bayesian Naive Bayes Prior Posterior MAP Posterior Predictive Plug-in Approximation Log-Sum-Exp Trick Posterior Predictive Algorithm Feature Selection Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 3 / 46

Generative Classifiers vs Discriminative Classifiers probabilistic classifier we are given a dataset D = {(xi, y i )} N i=1 the goal is to compute the class posterior p(y = c x) which models the mapping y = f (x) generative classifiers p(y = c x) is computed starting from the class-conditional density p(x y = c, θ) and the class prior p(y = c θ) given that p(y = c x, θ) p(x y = c, θ)p(y = c θ) (= p(y = c, x θ)) this is called a generative classifier since it specifies how to generate the feature vector x for each class y = c (by using p(x y = c, θ)) the model is usually fit by maximizing the joint log-likelihood, i.e. one computes θ = arg max θ i log p(y i, xi θ) discriminative classifiers the model p(y = c x) is directly fit to the data the model is usually fit by maximizing the conditional log-likelihood, i.e. one computes θ = arg max θ i log p(y i xi, θ) Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 4 / 46

Naive Bayes Classifiers Basic Concepts a Naive Bayes Classifier (NBC) uses a generative approach let x = [x 1,..., x D ] T be our feature vector with D components 1 let y {1,..., C} where C is the number of classes assumption: the D features are assumed to be conditionally independent given the class label, i.e. D p(x y = c, θ) = p(x j y = c, θ jc ) this is the simplest approach to specify a class-conditional density j=1 it is called naive since we do not actually expect the features to be conditionally independent, even conditional to the class label y = c even if the naive assumption is not true, NBC often works well given that the model is quite simple and depends on O(CD) parameters and hence is relatively immune to overfitting 1 one can have x R D or x {1, 2,..., K} D or x {0, 1} D Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 5 / 46

Outline 1 Naive Bayes Classifiers Basic Concepts Class-Conditional Distributions Likelihood MLE Bayesian Naive Bayes Prior Posterior MAP Posterior Predictive Plug-in Approximation Log-Sum-Exp Trick Posterior Predictive Algorithm Feature Selection Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 6 / 46

Naive Bayes Classifiers Class-Conditional Distributions the form of the class-conditional density depends on the type of each feature if x j R we can use the Gaussian distribution p(x y = c, θ) = D N (x j µ jc, σjc) 2 where for each class c we specify the mean µ jc of feature j and its variance σ jc j=1 if x j {0, 1} we can use the Bernoulli distribution p(x y = c, θ) = D Ber(x j µ jc ) where for each class c we specify the probability µ jc = p(x j = 1 y = c), i.e. the probability that feature j occurs j=1 Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 7 / 46

Naive Bayes Classifiers Class-Conditional Distributions if x j {1,..., K} we can use the categorical distribution D p(x y = c, θ) = Cat(x j µ jc ) where for each class c we specify the histogram µ jc = [p(x j = 1 y = c),..., p(x j = K y = c)] other kind of features can be conceived and we can mix different kind of features j=1 Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 8 / 46

Outline 1 Naive Bayes Classifiers Basic Concepts Class-Conditional Distributions Likelihood MLE Bayesian Naive Bayes Prior Posterior MAP Posterior Predictive Plug-in Approximation Log-Sum-Exp Trick Posterior Predictive Algorithm Feature Selection Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 9 / 46

Naive Bayes Classifiers Likelihood probability for single data case p(xi, y i θ) = p(y i π)p(xi y i, θ) = (NBC assumption) = p(y i π) j p(x ij y i, θ j ) where θ is a compound vector parameter containing π and θ j since y i Cat(π) p(y i π) = c π I(y i =c) c for each class c we allocate a specific set of parameters θ jc p(x ij y i, θ j ) = c p(x ij θ jc ) I(y i =c) hence p(xi, y i θ) = c π I(y i =c) c p(x ij θ jc ) I(y i =c) j c Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 10 / 46

Naive Bayes Classifiers Likelihood the log-likelihood is given by log p(d θ) = N log p(xi, y i θ) = i=1 N C i=1 c=1 N log π I(y i =c) c + i=1 D j=1 c=1 C log p(x ij θ jc ) I(y i =c) C D C = N c log π c + log p(x ij θ jc ) c=1 j=1 c=1 i:y i =c where N c i I(y i = c) and we assumed as usual that the pairs (xi, y i ) are iid Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 11 / 46

Outline 1 Naive Bayes Classifiers Basic Concepts Class-Conditional Distributions Likelihood MLE Bayesian Naive Bayes Prior Posterior MAP Posterior Predictive Plug-in Approximation Log-Sum-Exp Trick Posterior Predictive Algorithm Feature Selection Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 12 / 46

Naive Bayes Classifiers MLE the log-likelihood is C D C log p(d θ) = N c log π c + log p(x ij θ jc ) c=1 j=1 c=1 i:y i =c here we have the sum of two terms, the first concerning π = [π 1,..., π C ] and the second concerning DC set of parameters θ jc in order to compute the MLE we can optimize the two group of parameters π and θ jc separately Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 13 / 46

Naive Bayes Classifiers MLE the log-likelihood is C D C log p(d θ) = N c log π c + log p(x ij θ jc ) c=1 j=1 c=1 i:y i =c the first term concerns the labels y i Cat(π), recall how we computed the MLE of the Dirichlet-multinomial model the MLE can be computed by optimizing the Lagrangian l(π, λ) = ( N c log π c + λ 1 ) π c c c where we enforce the constraint c πc = 1 we impose l π c = 0, l = 0 and we obtain the MLE estimation λ ˆπ c = Nc N Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 14 / 46

Naive Bayes Classifiers MLE the log-likelihood is C D C log p(d θ) = N c log π c + log p(x ij θ jc ) c=1 j=1 c=1 i:y i =c as for the second term optimization, we assume the features x ij are binary, i.e. x ij {0, 1}, and x ij y = c Ber(θ jc ), hence θ jc = θ jc [0, 1] in this case, we could compute the MLE by using the analysis which was performed with the beta-binomial model doing the math again, we have to optimize the function D C D C ( ) J = log p(x ij θ jc ) = I(x ij = 1) log θ jc +I(x ij = 0) log(1 θ jc ) j=1 c=1 i:y i =c = j=1 c=1 j=1 c=1 i:y i =c D C D C N jc log θ jc + (N c N jc ) log(1 θ jc ) j=1 c=1 where N jc i I(x ij = 1, y i = c) and N c i I(y i = c) Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 15 / 46

Naive Bayes Classifiers MLE we have to optimize the function J = D C D C N jc log θ jc + (N c N jc ) log(1 θ jc ) j=1 c=1 j=1 c=1 where N jc i I(x ij = 1, y i = c) and N c i I(y i = c) by imposing J θ jc = 0 one obtains the MLE estimate ˆθ jc = N jc N c Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 16 / 46

Naive Bayes Classifiers Model Fitting algorithm: MLE fitting a naive Bayes classifier to binary features (i.e. xi {0, 1} D ) N c = 0, N jc = 0 ; for i = 1 : N do c := y i ; // get the class label of the i-th sample N c := N c + 1; for j = 1 : D do if x ij = 1 then N jc := N jc + 1 end end end ˆπ c = Nc N, ˆθ jc = N jc N c ; see the naivebayesfit script for some Matlab code the algorithm takes O(ND) time Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 17 / 46

Outline 1 Naive Bayes Classifiers Basic Concepts Class-Conditional Distributions Likelihood MLE Bayesian Naive Bayes Prior Posterior MAP Posterior Predictive Plug-in Approximation Log-Sum-Exp Trick Posterior Predictive Algorithm Feature Selection Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 18 / 46

Naive Bayes Classifiers Bayesian Reasoning as we know the MLE estimates can overfit recall the black swan paradox and the issue of using empirical fractions N i /N a simple solution to overfitting is to be Bayesian Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 19 / 46

Outline 1 Naive Bayes Classifiers Basic Concepts Class-Conditional Distributions Likelihood MLE Bayesian Naive Bayes Prior Posterior MAP Posterior Predictive Plug-in Approximation Log-Sum-Exp Trick Posterior Predictive Algorithm Feature Selection Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 20 / 46

The Beta-Binomial Model Prior for simplicity we use a factored prior D C p(θ) = p(π) p(θ jc ) j=1 c=1 where θ is a compound vector parameter containing π, θ jc as for the prior of π we use p(π) = Dir(π α) which is a conjugate prior w.r.t. the multinomial part as for the prior of each θ jc we use p(θ jc ) = Beta(θ jc β 0, β 1) which is a conjugate prior w.r.t. the binomial part we can obtain a uniform prior by setting α = 1 and β 0 = β 1 = 1 Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 21 / 46

Outline 1 Naive Bayes Classifiers Basic Concepts Class-Conditional Distributions Likelihood MLE Bayesian Naive Bayes Prior Posterior MAP Posterior Predictive Plug-in Approximation Log-Sum-Exp Trick Posterior Predictive Algorithm Feature Selection Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 22 / 46

The Beta-Binomial Model Posterior factored likelihood factored prior factored posterior log p(d θ) = log Cat(y π) + p(θ) = Dir(π α) D D j=1 c=1 p(θ D) = p(π D) C log Ber(x ij θ jc ) j=1 c=1 i:y i =c C Beta(θ jc β 0, β 1) D j=1 c=1 C p(θ jc D) p(π D) = Dir(π N 1 + α 1,..., N C + α C ) p(θ jc D) = Beta(θ jc N jc + β 1, (N c N jc ) + β 0) to compute the posterior we just updates the empirical counts of the likelihood with the prior counts Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 23 / 46

Outline 1 Naive Bayes Classifiers Basic Concepts Class-Conditional Distributions Likelihood MLE Bayesian Naive Bayes Prior Posterior MAP Posterior Predictive Plug-in Approximation Log-Sum-Exp Trick Posterior Predictive Algorithm Feature Selection Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 24 / 46

The Beta-Binomial Model MAP factored posterior MAP estimate of π = [π 1,..., π C ] p(θ D) = p(π D) D j=1 c=1 C p(θ jc D) p(π D) = Dir(π N 1 + α 1,..., N C + α C ) p(θ jc D) = Beta(θ jc N jc + β 1, (N c N jc ) + β 0) ˆπ = arg max π Dir(π N1 + α1,..., N Nc + αc 1 C + α C ) = ˆπ c = N + α 0 C MAP estimate of θ jc for j {1,..., D}, c {1,..., C} ˆθ jc = arg max θ jc Beta(θ jc N jc + β 1, (N c N jc ) + β 0) = ˆθ jc = N jc + β 1 1 N c + β 1 + β 0 2 Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 25 / 46

Naive Bayes Classifiers MAP Model Fitting algorithm: MAP fitting a naive Bayes classifier to binary features (i.e. xi {0, 1} D ) N c = 0, N jc = 0 ; for i = 1 : N do c := y i ; // get the class label of the i-th sample N c := N c + 1; for j = 1 : D do if x ij = 1 then N jc := N jc + 1 end end end ˆπ c = Nc +αc 1 N+α 0 C, ˆθ jc = N jc +β 1 1 N c +β 1 +β 0 2 ; Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 26 / 46

Outline 1 Naive Bayes Classifiers Basic Concepts Class-Conditional Distributions Likelihood MLE Bayesian Naive Bayes Prior Posterior MAP Posterior Predictive Plug-in Approximation Log-Sum-Exp Trick Posterior Predictive Algorithm Feature Selection Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 27 / 46

Naive Bayes Classifiers Posterior Predictive if we are given a new sample x the posterior predictive is p(y = c x, D) p(x y = c, D)p(y = c D) with a NBC the class conditional density can factorized as D p(x y = c, D) = p(x j y = c, D) (since features are assumed to be conditionally independent given the class label) combining the two above equations returns D p(y = c x, D) p(y = c D) p(x j y = c, D) j=1 j=1 Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 28 / 46

Naive Bayes Classifiers Posterior Predictive we start from this factorization and we apply the Bayesian procedure p(y = c x, D) p(y = c D) D p(x j y = c, D) first step, we integrate out the unknown π on the first factor p(y = c D) = p(y = c, π D)dπ = p(y = c π, D)p(π D)dπ = ( ) π gives enough information to compute p(y = c) = j=1 p(y = c π)p(π D)dπ second step, we integrate out the unknowns θ jc on each remaining factor p(x j y = c, D) = p(x j, θ jc y = c, D)dθ jc = p(x j θ jc, y = c, D)p(θ jc y = c, D)dθ jc ( ) the new x is independent from D = p(x j θ jc, y = c)p(θ jc D)dθ jc Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 29 / 46

Naive Bayes Classifiers Posterior Predictive recollecting everything together returns [ ] D [ p(y = c x, D) p(y = c π)p(π D)dπ j=1 p(x j θ jc, y = c)p(θ jc D)dθ jc ] and plugging-in the model PDFs/PMFs we adopted [ ] p(y = c x, D) Cat(y = c π)dir(π N 1 + α 1,..., N C + α C )dπ D [ ] Ber(x j θ jc, y = c)beta(θ jc N jc + β 1, (N c N jc ) + β 0)dθ jc = j=1 the first part is a Dirichlet-multinomial model the second part is a product of beta-binomial models Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 30 / 46

Naive Bayes Classifiers Posterior Predictive doing the math again for the first part Cat(y = c π)dir(π N 1 + α 1,..., N C + α C )dπ = where α 0 = c αc π c Dir(π N 1 + α 1,..., N C + α C )dπ = E[π c D] = Nc + αc N + α 0 this is exactly how we computed the posterior mean for the Dirichlet-multinomial model Nc + αc π c = E[π c D] = N + α 0 Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 31 / 46

Naive Bayes Classifiers Posterior Predictive doing the math again for the second part Ber(x j θ jc, y = c)beta(θ jc N jc + β 1, (N c N jc ) + β 0)dθ jc = = θ I(x j =1) jc (1 θ jc ) I(x j =0) Beta(θ jc N jc + β 1, (N c N jc ) + β 0)dθ jc = where = (θ jc ) I(x j =1) (1 θ jc ) I(x j =0) θ jc = E[θ jc D] = N jc + β 1 N c + β 0 + β 1 in the above equations we first worked on x j = 1 and then on x j = 0 this is exactly how we computed the posterior mean for the beta-binomial model Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 32 / 46

Naive Bayes Classifiers Posterior Predictive the final posterior predictive is with the posterior means p(y = c x, D) π c D j=1 (θ jc ) I(x j =1) (1 θ jc ) I(x j =0) θ jc = E[θ jc D] = N jc + β 1 N c + β 0 + β 1 and π c = E[π c D] = Nc + αc N + α 0 Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 33 / 46

Outline 1 Naive Bayes Classifiers Basic Concepts Class-Conditional Distributions Likelihood MLE Bayesian Naive Bayes Prior Posterior MAP Posterior Predictive Plug-in Approximation Log-Sum-Exp Trick Posterior Predictive Algorithm Feature Selection Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 34 / 46

Naive Bayes Classifiers Plug-in Approximation we can approximate the posterior with a single point, i.e. p(θ D) δ ˆθ (θ) where ˆθ can be the MAP or the MLE we obtain in this case a plug-in approximation p(y = c x, D) ˆπ c D j=1 (ˆθ jc ) I(x j =1) (1 ˆθ jc ) I(x j =0) the plug-in approximation is obviously more prone to overfitting Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 35 / 46

Outline 1 Naive Bayes Classifiers Basic Concepts Class-Conditional Distributions Likelihood MLE Bayesian Naive Bayes Prior Posterior MAP Posterior Predictive Plug-in Approximation Log-Sum-Exp Trick Posterior Predictive Algorithm Feature Selection Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 36 / 46

Naive Bayes Classifiers Log-Sum-Exp Trick the posterior predictive has the following form p(y = c x) = p(x y = c)p(y = c) p(x) = p(x y = c)p(y = c) c p(x y = c )p(y = c ) p(x y = c) is often a very small number, especially if x is a high-dimensional vector, since we have to enforce p(x y = c) = 1 x this entails that a naive implementation of the posterior predictive can fail due to numerical underflow the obvious solution is to use logs log p(y = c x) = log p(x y = c) + log p(y = c) log p(x) and if we define b c log p(x y = c) + log p(y = c), one has [ ] log p(y = c x) = b c log e bc c Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 37 / 46

Naive Bayes Classifiers Log-Sum-Exp Trick with b c log p(x y = c) + log p(y = c) we have [ log p(y = c x) = b c log now we have the problem that computing e b c can cause an overflow 2 we can use the log-sum-exp trick in order to avoid this problem [ ] [( ) ] [ ] log e bc = log e bc B e B = log e bc B + B where B max c b c c with this trick the biggest term e bc B equals zero c c e bc ] c 2 since b c can be a big number Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 38 / 46

Outline 1 Naive Bayes Classifiers Basic Concepts Class-Conditional Distributions Likelihood MLE Bayesian Naive Bayes Prior Posterior MAP Posterior Predictive Plug-in Approximation Log-Sum-Exp Trick Posterior Predictive Algorithm Feature Selection Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 39 / 46

Naive Bayes Classifiers Posterior Predictive Algorithm the computed posterior predictive is p(y = c x, D) π c if we apply the log we obtain D j=1 (θ jc ) I(x j =1) (1 θ jc ) I(x j =0) D log p(y = c x, D) log π c + I(x j = 1) log(θ jc ) + I(x j = 0) log(1 θ jc ) the above log-posterior is the basis for the next algorithm j=1 Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 40 / 46

Naive Bayes Classifiers Posterior Predictive Algorithm algorithm: predicting with a naive Bayes classifier for binary features (i.e. xi {0, 1} D ) for c = 1 : C do L c := log ˆπ c; for j = 1 : D do if x j = 1 then L c := L c + log ˆθ jc else L c := L c + log(1 ˆθ jc ) end end p c := exp(l c logsumexp(l 1:C )); // compute p(y = c x, D) end ŷ := arg max p c c; the above algorithm computes ŷ = arg max c p(y = c x, D) the used parameter estimate ˆθ can be obviously best replaced with the posterior mean θ as shown in the computation of the full posterior predictive Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 41 / 46

Outline 1 Naive Bayes Classifiers Basic Concepts Class-Conditional Distributions Likelihood MLE Bayesian Naive Bayes Prior Posterior MAP Posterior Predictive Plug-in Approximation Log-Sum-Exp Trick Posterior Predictive Algorithm Feature Selection Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 42 / 46

Feature Selection By using Mutual Information an NBC is commonly used to fit a joint distribution over potentially many features the NBC fitting algorithm is O(ND) where N is the dataset size and D is the size of x problems: D can be very high and NBC may suffer from overfitting a common approach to reduce these problems is to perform feature selection: 1 evaluate the relevance of each feature 2 hold only the K most relevant features (K is chosen based on some tradeoff accuracy-complexity) Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 43 / 46

Feature Selection Mutual Information correlation is a very limited measure of dependence; revise the slides about correlation and independence (lecture 3 part 2) a more general approach is to determine how similar is a joint distribution p(x, Y ) to p(x )p(y ) (recall the definition X Y ) mutual information (MI) I[X ; Y ] KL[p(X, Y ) p(x )p(y )] = x one has I[X ; Y ] 0 with equality iff p(x, Y ) = p(x )p(y ) p(x, y) p(x, y) log p(x)p(y) y Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 44 / 46

Feature Selection Mutual Information we want to measure the relevance between feature X j and the class label Y I[X j ; Y ] = x j p(x j, y) log p(x j, y) p(x j )p(y) for an NBC classifier with binary features one has (homework ex 3.21) I j I[X j ; Y ] = [ θ jc π c log θ jc + (1 θ jc )π c log 1 θ ] jc θ j 1 θ c c where the following quantities are computed by the NBC fitting algorithm: π c = p(y = c), θ jc = p(x j = 1 y = c) and θ j = p(x j = 1) = c πcθ jc the top K features with the highest I j can then be selected and used y Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 45 / 46

Credits Kevin Murphy s book Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 46 / 46