COMP 551 Applied Machine Learning Lecture 4: Linear classification

Size: px

Start display at page:

Download "COMP 551 Applied Machine Learning Lecture 4: Linear classification"

Emery Craig
6 years ago
Views:

1 COMP 551 Applied Machine Learning Lecture 4: Linear classificatin Instructr: Jelle Pineau Class web page: Unless therwise nted, all material psted fr this curse are cpyright f the instructr, and cannt be reused r repsted withut the instructr s written permissin.

2 Tday s Quiz 1. What is meant by the term verfitting? What can cause verfitting? Hw can ne avid verfitting? 2. Which f the fllwing increases the chances f verfitting (assuming everything else is held cnstant): a) Reducing the size f the training set. b) Increasing the size f the training set. c) Reducing the size f the test set. d) Increasing the size f the test set. e) Reducing the number f features. f) Increasing the number f features. 2 Jelle Pineau

3 Evaluatin Use crss-validatin fr mdel selectin. Training set is used t select a hypthesis f frm a class f hyptheses F (e.g. regressin f a given degree). Validatin set is used t cmpare best f frm each hypthesis class acrss different classes (e.g. different degree regressin). Must be untuched during the prcess f lking fr f within a class F. Test set: Ideally, a separate set f (labeled) data is withheld t get a true estimate f the generalizatin errr. (Often the validatin set is called test set, withut distinctin.) 3 Jelle Pineau

4 Validatin vs Train errr [Frm Hastie et al. textbk] High Bias Lw Variance Lw Bias High Variance Predictin Errr Test Sample Training Sample Lw Mdel Cmplexity High FIGURE Test and training errr as a functin f mdel cmplexity. 4 Jelle Pineau

5 Bias vs Variance Gauss-Markv Therem says: The least-squares estimates f the parameters w have the smallest variance amng all linear unbiased estimates. Insight: Find lwer variance slutin, at the expense f sme bias. E.g. Include penalty fr mdel cmplexity in errr t reduce verfitting. Err(w) = i=1:n ( y i - w T x i ) 2 + λ mdel_size λ is a hyper-parameter that cntrls penalty size. 5 Jelle Pineau

6 Ridge regressin (aka L2-regularizatin) Cnstrains the weights by impsing a penalty n their size: ŵ ridge = argmin w { i=1:n ( y i - w T x i ) 2 + λ j=0:m w j2 } where λ can be selected manually, r by crss-validatin. D a little algebra t get the slutin: ŵ ridge = (X T X+λI) -1 X T Y The ridge slutin is nt equivariant under scaling f the data, s typically need t nrmalize the inputs first. Ridge gives a smth slutin, effectively shrinking the weights, but drives few weights t 0. 6 Jelle Pineau

7 Lass regressin (aka L1-regularizatin) Cnstrains the weights by penalizing the abslute value f their size: ŵ lass = argmin W { i=1:n ( y i - w T x i ) 2 + λ j=1:m w j } Nw the bjective is nn-linear in the utput y, and there is n clsed-frm slutin. Need t slve a quadratic prgramming prblem instead. Mre cmputatinally expensive than Ridge regressin. Effectively sets the weights f less relevant input features t zer. 7 Jelle Pineau

8 Cmparing Ridge and Lass Ridge g regularizatin (2 pa w 2 Cnturs f equal regressin errr Lass w 2 1 w? w? w 1 w 1 Cnturs f equal mdel cmplexity penalty 8 Jelle Pineau

9 A quick lk at evaluatin functins We call L(Y,f w (x)) the lss functin. Least-square / Mean squared-errr (MSE) lss: L(Y, f w (X)) = i=1:n ( y i - w T x i ) 2 Other lss functins? Abslute errr lss: L(Y, f w (X)) = i=1:n y i w T x i 0-1 lss (fr classificatin): L(Y, f w (X)) = i=1:n I ( y i f w (x i ) ) Different lss functins make different assumptins. Squared errr lss assumes the data can be apprximated by a glbal linear mdel with Gaussian nise. 9 Jelle Pineau

10 Next: Linear mdels fr classificatin Linear Regressin f 0/1 Respnse FIGURE 2.1. A classificatin example in tw dimensins. The classes are cded as a binary variable (BLUE =0, ORANGE =1), and then fit by linear regressin. The line is the decisin bundary defined by x T ˆβ =0.5. Therangeshadedregin dentes that part f input space classified as ORANGE, while the blue regin is classified as BLUE. 10 Jelle Pineau

11 Classificatin prblems Given data set D=<x i,y i >, i=1:n, with discrete y i, find a hypthesis which best fits the data. If y i {0, 1} this is binary classificatin. If y i can take mre than tw values, the prblem is called multi-class classificatin. 11 Jelle Pineau

12 Applicatins f classificatin Text classificatin (spam filtering, news filtering, building web directries, etc.) Image classificatin (face detectin, bject recgnitin, etc.) Predictin f cancer recurrence. Financial frecasting. Many, many mre! 12 Jelle Pineau

13 Simple example Given nucleus size, predict cancer recurrence. Univariate input: X = nucleus size. Binary utput: Y = {NRecurrence = 0; Recurrence = 1} Try: Minimize the least-square errr. nnrecurrence cunt NRecurrence nucleus size 15 Recurrence recurrence cunt nucleus size 13 Jelle Pineau

14 Predicting a class frm linear regressin Here red line is: Y = X (X T X) -1 X T Y Hw t get a binary utput? 1. Threshld the utput: { y <= t fr NRecurrence, y > t fr Recurrence} 2. Interpret utput as prbability: y = Pr (Recurrence) *3*4.,+405# ,+405"6 "&$ " #&) #&( #&' #&$ # Can we find a better mdel?!#&$! "# "! $# $! %# *+,-.+/0/ Jelle Pineau

15 Mdeling fr binary classificatin Tw prbabilistic appraches: 1. Discriminative learning: Directly estimate P(y x). 2. Generative learning: Separately mdel P(x y) and P(y). Use Bayes rule, t estimate P(y x): P(y =1 x) = P(x y =1)P(y =1) P(x) 15 Jelle Pineau

16 Prbabilistic view f discriminative learning Suppse we have 2 classes: y {0, 1} What is the prbability f a given input x having class y = 1? Cnsider Bayes rule: P(y =1 x) = where = 1+ P(x, y =1) P(x) a = ln = P(x y =1)P(y =1) P(x y =1)P(y =1)+ P(x y = 0)P(y = 0) 1 = P(x y = 0)P(y = 0) 1+ exp(ln P(x y =1)P(y =1) P(x y =1)P(y =1) P(x y = 0)P(y = 0) 1 = P(x y = 0)P(y = 0) P(x y =1)P(y =1) ) = ln P(y =1 x) P(y = 0 x) Here σ has a special frm, called the lgistic functin (By Bayes rule; P(x) n tp and bttm cancels ut.) and a is the lg-dds rati f data being class 1 vs. class exp( a) = σ 16 Jelle Pineau

17 Discriminative learning: Lgistic regressin Idea: Directly mdel the lg-dds with a linear functin: a = ln P(x y =1)P(y =1) P(x y = 0)P(y = 0) = w 0 + w 1 x w m x m The decisin bundary is the set f pints fr which a=0. The lgistic functin (= sigmid curve): σ(w T x) = 1 / (1 + e -wtx ) Hw d we find the weights? Need an ptimizatin functin. 17 Jelle Pineau

18 Fitting the weights Recall: σ(w T x i ) is the prbability that y i =1 (given x i ) 1-σ(w T x i ) be the prbability that y i = 0. Fr y {0, 1}, the likelihd functin, Pr(x 1,y 1,, x n,y h w), is: i=1:n σ(w T x i ) yi (1- σ(w T x i )) (1-yi) (samples are i.i.d.) Gal: Minimize the lg-likelihd (als called crss-entrpy errr functin): - i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) 18 Jelle Pineau

19 Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: δlg(σ)/δw=1/σ Err(w)/ w = - [ i=1:n y i (1/σ(w T x i ))(1-σ(w T x i )) σ(w T x i )x i + 19 Jelle Pineau

20 Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: δσ/δw=σ(1-σ) Err(w)/ w = - [ i=1:n y i (1/σ(w T x i ))(1-σ(w T x i )) σ(w T x i )x i + 20 Jelle Pineau

21 Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: δw T x/δw=x Err(w)/ w = - [ i=1:n y i (1/σ(w T x i ))(1-σ(w T x i )) σ(w T x i )x i + 21 Jelle Pineau

22 Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: δ(1-σ)/δw= (1-σ)σ(-1) Err(w)/ w = - [ i=1:n y i (1/σ(w T x i ))(1-σ(w T x i )) σ(w T x i )x i + (1-y i )(1/(1-σ(w T x i )))(1-σ(w T x i ))σ(w T x i )(-1) x i ] 22 Jelle Pineau

23 Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: Err(w)/ w = - [ i=1:n y i (1/σ(w T x i ))(1-σ(w T x i )) σ(w T x i )x i + (1-y i )(1/(1-σ(w T x i )))(1-σ(w T x i ))σ(w T x i )(-1) x i ] = - i=1:n x i (y i (1-σ(w T x i )) - (1-y i )σ(w T x i )) = - i=1:n x i (y i - σ(w T x i )) 23 Jelle Pineau

24 Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: Err(w)/ w = - [ i=1:n y i (1/σ(w T x i ))(1-σ(w T x i )) σ(w T x i )x i + (1-y i )(1/(1-σ(w T x i )))(1-σ(w T x i ))σ(w T x i )(-1) x i ] = - i=1:n x i (y i (1-σ(w T x i )) - (1-y i )σ(w T x i )) = - i=1:n x i (y i - σ(w T x i )) Nw apply iteratively: w k+1 = w k + α k i=1:n x i (y i σ(w kt x i )) Can als apply ther iterative methds, e.g. Newtn s methd, crdinate descent, L-BFGS, etc. 24 Jelle Pineau

25 Mdeling fr binary classificatin Tw prbabilistic appraches: 1. Discriminative learning: Directly estimate P(y x). 2. Generative learning: Separately mdel P(x y) and P(y). Use Bayes rule, t estimate P(y x): P(y =1 x) = P(x y =1)P(y =1) P(x) 25 Jelle Pineau

26 What yu shuld knw Basic definitin f linear classificatin prblem. Derivatin f lgistic regressin. Linear discriminant analysis: definitin, decisin bundary. Quadratic discriminant analysis: basic idea, decisin bundary. LDA vs QDA prs/cns. Wrth reading further: Under sme cnditins, linear regressin fr classificatin and LDA are the same (Hastie et al., p ). Relatin between Lgistic regressin and LDA (Hastie et al., 4.4.5) 26 Jelle Pineau

27 Final ntes Yu dn t yet have a team fr Prject #1? => Use mycurses. Yu dn t yet have a plan fr Prject #1? => Start planning! Feedback n tutrial 1? 27 Jelle Pineau

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

COMP 551 Applied Machine Learning Lecture 5: Generative mdels fr linear classificatin Instructr: Herke van Hf (herke.vanhf@mail.mcgill.ca) Slides mstly by: Jelle Pineau Class web page: www.cs.mcgill.ca/~hvanh2/cmp551