COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

COMP 551 Applied Machine Learning Lecture 5: Generative mdels fr linear classificatin Instructr: Herke van Hf (herke.vanhf@mail.mcgill.ca) Slides mstly by: Jelle Pineau Class web page: www.cs.mcgill.ca/~hvanh2/cmp551 Unless therwise nted, all material psted fr this curse are cpyright f the instructrs, and cannt be reused r repsted withut the instructr s written permissin.

Mdeling fr binary classificatin Tw prbabilistic appraches: 1. Generative learning: Separately mdel P(x y) and P(y). Use Bayes rule, t estimate P(y x): P(y =1 x) = P(x y =1)P(y =1) P(x) 2. Discriminative learning: Directly estimate P(y x). 2 Jelle Pineau

Hw abut ther types f data? Last lecture, we saw ne generative apprach (LDA) LDA wrks with cntinuus data What abut ther types f data? 3 Jelle Pineau

Hw abut ther types f data? LDA nly wrks with cntinuus input data Let s lk at an apprach fr handling ther types f data (mainly: binary) 4 Jelle Pineau

Generative learning with binary input data Generative learning: Estimate P(x y), P(y). Then calculate P(y x). Simple principle: fr every class y, estimate the cnditinal prbability P(x y) f every input pattern x What happens if the number f input variables m is large? 5 Jelle Pineau

Naïve Bayes assumptin Generative learning: Estimate P(x y), P(y). Then calculate P(y x). Naïve Bayes: Assume the x j are cnditinally independent given y. In ther wrds: P(x j y) = P(x j y, x k ), fr all j, k Generative mdel structure: P(x y) = P(x 1, x 2,, x m y) = P(x 1 y) P(x 2 y, x 1 ) P(x 3 y, x 1, x 2 ) P(x m y, x 1, x 2,, x m-1 ) (frm general rules f prbabilities) = P(x 1 y) P(x 2 y) P(x 3 y) P(x m y) (frm the Naïve Bayes assumptin abve) 8 Jelle Pineau

Cnditinally independence example Offer and pprtunity might bth ccur ften in spam e-mails Let s say we get 50% spam and 50% regular. Let s say, spam e- mails cntain 50% ffer and 50% pprtunity independently. Let s say it s 10% fr either in regular e-mail. Spam Regular Tgether % Expected % Cntains nly ffer 25 9 34 17 21 Cntains nly pprtunity 25 9 34 17 21 Cntains neither 25 81 106 53 49 Cntains bth 25 1 26 13 9 Offer and pprtunity nt independent ver all e-mails! We say cnditinally independent given the class 9 Jelle Pineau

Naïve Bayes graphical mdel y x 1 x 2 x 3 x m Hw many parameters t estimate? Assume m binary features. 10 Jelle Pineau

Naïve Bayes graphical mdel y x 1 x 2 x 3 x m Hw many parameters t estimate? Assume m binary features. withut Naïve Bayes assumptin: O(2 m ) numbers t describe mdel. With Naïve Bayes assumptin: O(m) numbers t describe mdel. Useful when the number f features is high. 11 Jelle Pineau

Training a Naïve Bayes classifier Assume x, y are binary variables, m=1. Estimate the parameters P(x y) and P(y) frm data. y Define: Θ 1 = Pr (y=1) Θ j,1 = Pr (x j =1 y=1) Θ j,0 = Pr (x j =1 y=0). x Evaluatin criteria: Find parameters that maximize the lglikelihd functin. Likelihd: Pr(y x) Pr(y)Pr(x y) = i=1:n ( P(y i ) j=1:m P(x i,j y i ) ) Samples i are independent, s we take prduct ver n. Input features are independent (cnd. n y) s we take prduct ver m. 13 Jelle Pineau

Training a Naïve Bayes classifier Likelihd fr binary utput variable: L(Θ 1 y) = Θ 1y (1-Θ 1 ) 1-y Lg-likelihd fr all parameters (like befre): lg L(Θ 1,Θ i,1,θ i,0 D) = Σ i=1:n [ lg P(y i ) + Σ j=1:m lg P(x i,j y i ) ] = Σ i=1:n [ y i lg Θ 1 + (1-y i )lg(1-θ 1 ) + Σ j=1:m y i ( x i,j lgθ i,1 + (1-x i,j )lg(1-θ i,1 ) ) + Σ j=1:m (1-y i )( x i,j lgθ i,0 + (1-x i,j )lg(1-θ i,0 ) ) ] (will have ther frm if params P(x y) have ther frm, e.g. Gaussian). 16 Jelle Pineau

Training a Naïve Bayes classifier Slving fr Θ 1 we get: Θ 1 = (1/n) Σ i=1:n y i = number f examples where y=1 / number f examples Similarly, we get: Θ j,1 = number f examples where x j =1 and y=1 / number f examples where y=1 Θ j,0 = number f examples where x j =1 and y=0 / number f examples where y=0 18 Jelle Pineau

Naïve Bayes decisin bundary Decisin bundary where prbability f classes are equal: lg-dds rati = 0 lg Pr(y =1 x) Pr(y = 0 x) = lg Pr(x y =1)P(y =1) Pr(x y = 0)P(y = 0) = lg = lg m ( ) ( ) ( ) ( ) P(y =1) P(y = 0) + lg P x j y =1 j=1 P x j y = 0 m j=1 m P(y =1) P(y = 0) + lg P x j y =1 P x j y = 0 j=1 19 Jelle Pineau

Naïve Bayes decisin bundary Cnsider the case where features are binary: x j = {0, 1} Define: w j,0 = lg P(x j = 0 y =1) P(x j = 0 y = 0) ; w j,1 = lg P(x j =1 y =1) P(x j =1 y = 0) Nw we have: lg Pr(y =1 x) Pr(y = 0 x) ( ) ( ) m P(y =1) = lg P(y = 0) + lg P x j y =1 P x j y = 0 This is a linear decisin bundary! cnstant + linear in x j=1 m P(y =1) = lg P(y = 0) + (w j,0(1 x j )+ w j,1 x j ) j=1 m m P(y =1) = lg P(y = 0) + w j,0 + (w j,1 w j,0 )x j j=1 j=1 20 Jelle Pineau

Text classificatin example Using Naïve Bayes, we can cmpute prbabilities fr all the wrds which appear in the dcument cllectin. P(y=c) is the prbability f class c P(x j y=c) is the prbability f seeing wrd j in dcuments f class c Set f classes depends n the applicatin, e.g. Tpic mdeling: each class crrespnds t dcuments n a given tpic, e.g. {Plitics, Finance, Sprts, Arts}. Class c What happens when a wrd is nt bserved in the training data? wrd 1 wrd 2 wrd 3 wrd m 22 Jelle Pineau

Laplace smthing Replace the maximum likelihd estimatr: Pr(x j y=1) = number f instance with x j =1 and y=1 number f examples with y=1 23 Jelle Pineau

Laplace smthing Replace the maximum likelihd estimatr: Pr(x j y=1) = number f instance with x j =1 and y=1 number f examples with y=1 With the fllwing: Pr(x j y=1) = (number f instance with x j =1 and y=1) + 1 (number f examples with y=1) + 2 If n example frm that class, it reduces t a prir prbability r Pr=1/2. If all examples have x j =1, then Pr(x j =0 y) has Pr = 1 / (#examples + 1). If a wrd appears frequently, the new estimate is nly slightly biased. This is a frm f regularizatin (decreases variance at the cst f bias ) 25 Jelle Pineau

Example: 20 newsgrups Given 1000 training dcuments frm each grup, learn t classify new dcuments accrding t which newsgrup they came frm: cmp.graphics cmp.s.ms-windws.misc cmp.sys.ibm.pc.hardware cmp.sys.mac.hardware cmp.windws.x alt.atheism sc.religin.christian talk.religin.misc talk.plitics.mideast talk.plitics.misc misc.frsale rec.auts rec.mtrcycles rec.sprt.baseball rec.sprt.hckey sci.space sci.crypt sci.electrnics sci.med talk.plitics.guns Naïve Bayes: 89% classificatin accuracy (cmparable t ther state-f-the-art methds.) 26 Jelle Pineau

Gaussian Naïve Bayes Extending Naïve Bayes t cntinuus inputs: P(y) is still assumed t be a binmial distributin. P(x y) is assumed t be a multivariate Gaussian (nrmal) distributin with mean μ R n and cvariance matrix Σ R n xr n If we assume the same Σ fr all classes: Linear discriminant analysis. If Σ is distinct between classes: Quadratic discriminant analysis. If Σ is diagnal (i.e. features are independent): Gaussian Naïve Bayes. (linear if same fr all classes) 28 Jelle Pineau

Discriminative learning We have seen that under several assumptins, we get linear decisin bundaries p(x y) are Gaussian with shared cvariance (LDA) p(x y) are independent Bernulli distributins (Naïve Bayes) D we really need t estimate p(x y) and p(y)? Can we directly find the parameters f the best decisin bundary? E.g. cvariance matrix requires estimating O(m 2 ) but decisin bundary nly requires O(m) parameters 31 Jelle Pineau

Prbabilistic view f discriminative learning Suppse we have 2 classes: y {0, 1} What is the prbability f a given input x having class y = 1? Cnsider Bayes rule: P(y =1 x) = P(x, y =1) P(x) = P(x y =1)P(y =1) P(x y =1)P(y =1)+ P(x y = 0)P(y = 0) 32 Jelle Pineau

Prbabilistic view f discriminative learning Suppse we have 2 classes: y {0, 1} What is the prbability f a given input x having class y = 1? Cnsider Bayes rule: P(y =1 x) = = 1+ P(x, y =1) P(x) = 1 P(x y = 0)P(y = 0) P(x y =1)P(y =1) P(x y =1)P(y =1) P(x y =1)P(y =1)+ P(x y = 0)P(y = 0) = 1+ exp(ln 1 = P(x y = 0)P(y = 0) P(x y =1)P(y =1) ) 1 1+ exp( a) = =σ(-a) σ 33 Jelle Pineau

Prbabilistic view f discriminative learning Suppse we have 2 classes: y {0, 1} What is the prbability f a given input x having class y = 1? Cnsider Bayes rule: P(y =1 x) = where = 1+ P(x, y =1) P(x) = 1 P(x y = 0)P(y = 0) P(x y =1)P(y =1) a = ln P(x y =1)P(y =1) P(x y =1)P(y =1)+ P(x y = 0)P(y = 0) = 1+ exp(ln P(x y =1)P(y =1) P(x y = 0)P(y = 0) 1 = P(x y = 0)P(y = 0) P(x y =1)P(y =1) ) = ln P(y =1 x) P(y = 0 x) Here σ has a special frm, called the lgistic functin 1 1+ exp( a) = =σ(-a) σ (By Bayes rule; P(x) n tp and bttm cancels ut.) and a is the lg-dds rati f data being class 1 vs. class 0. 35 Jelle Pineau

Discriminative learning: Lgistic regressin The lgistic functin (= sigmid curve): σ(w T x) = 1 / (1 + e -wtx ) Transfrms learned functin s.t. it can be interpreted as prbability 36 Jelle Pineau

Discriminative learning: Lgistic regressin The lgistic functin (= sigmid curve): σ(w T x) = 1 / (1 + e -wtx ) Transfrms learned functin s.t. it can be interpreted as prbability The decisin bundary is the set f pints fr which a=0. Idea: Directly mdel the lg-dds with a linear functin: a = ln P(x y =1)P(y =1) P(x y = 0)P(y = 0) = w 0 + w 1 x 1 + + w m x m 37 Jelle Pineau

Fitting the weights Recall: σ(w T x i ) is the prbability that y i =1 (given x i ) 1-σ(w T x i ) be the prbability that y i = 0. Fr y {0, 1}, the likelihd functin, Pr(x 1,y 1,, x n,y h w), is: i=1:n σ(w T x i ) yi (1- σ(w T x i )) (1-yi) (samples are i.i.d.) Gal: Minimize the negative lg-likelihd (als called crssentrpy errr functin): - i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) 40 Jelle Pineau

Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: 41 Jelle Pineau

Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: δlg(σ)/δw=1/σ Err(w)/ w = - [ i=1:n y i (1/σ(w T x i ))(1-σ(w T x i )) σ(w T x i )x i + 42 Jelle Pineau

Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: δσ/δw=σ(1-σ) Err(w)/ w = - [ i=1:n y i (1/σ(w T x i ))(1-σ(w T x i )) σ(w T x i )x i + 43 Jelle Pineau

Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: δw T x/δw=x Err(w)/ w = - [ i=1:n y i (1/σ(w T x i ))(1-σ(w T x i )) σ(w T x i )x i + 44 Jelle Pineau

Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: δ(1-σ)/δw= (1-σ)σ(-1) Err(w)/ w = - [ i=1:n y i (1/σ(w T x i ))(1-σ(w T x i )) σ(w T x i )x i + (1-y i )(1/(1-σ(w T x i )))(1-σ(w T x i ))σ(w T x i )(-1) x i ] 45 Jelle Pineau

Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: Err(w)/ w = - [ i=1:n y i (1/σ(w T x i ))(1-σ(w T x i )) σ(w T x i )x i + (1-y i )(1/(1-σ(w T x i )))(1-σ(w T x i ))σ(w T x i )(-1) x i ] = - i=1:n x i (y i (1-σ(w T x i )) - (1-y i )σ(w T x i )) = - i=1:n x i (y i - σ(w T x i )) 46 Jelle Pineau

Multi-class classificatin Generally tw ptins: 1. Learn a single classifier that can prduce 20 distinct utput values. 2. Learn 20 different 1-vs-all binary classifiers. 49 Jelle Pineau

Multi-class classificatin Generally tw ptins: 1. Learn a single classifier that can prduce 20 distinct utput values. 2. Learn 20 different 1-vs-all binary classifiers. Optin 1 assumes yu have a multi-class versin f the classifier. Fr Naïve Bayes, cmpute P(y x) fr each class, and select the class with highest prbability. Optin 2 applies t all binary classifiers, s mre flexible. But: ften slwer (need t learn many classifiers) creates a class imbalance prblem (say, 5% vs 95% fr 20 classes) what if tw classifiers say belngs t class? Or zer d? 51 Jelle Pineau

Cmparing linear classificatin methds Crdinate 2 fr Training Data -6-4 -2 0 2 4 Technique Errr Rates Training Test Linear regressin 0.48 0.67 Linear discriminant analysis 0.32 0.56 Quadratic discriminant analysis 0.01 0.53 Lgistic regressin 0.22 0.51-4 -2 0 2 4 Crdinate 1 fr Training Data FIGURE 4.4. A tw-dimensinal plt f the vwel training data. There are eleven classes with X IR 10,andthisisthebestviewintermsfaLDAmdel (Sectin 4.3.3). The heavy circles are the prjected mean vectrs fr each class. The class verlap is cnsiderable. 52 Jelle Pineau

Discriminative vs generative Discriminative classifiers ften have less parameters t estimate Discriminative classifiers ften d better but Generative mdel might give us mre insight in data It can tell us when all classes are bad (lw prbability) With many classes, discriminative mdels need t find the decisin bundary between every pair 53 Jelle Pineau

What yu shuld knw Naïve Bayes assumptin Lg-dds rati decisin bundary Hw t estimate parameters fr Naïve Bayes Laplace smthing Relatin between Naïve Bayes, LDA, QDA, Gaussian Naïve Bayes. Derivatin f lgistic regressin. Wrth reading further: Relatin between Lgistic regressin and LDA (Hastie et al., 4.4.5) 54 Jelle Pineau

What yu shuld knw 55 Jelle Pineau