COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Size: px

Start display at page:

Download "COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification"

Phebe Nelson
5 years ago
Views:

1 COMP 551 Applied Machine Learning Lecture 5: Generative mdels fr linear classificatin Instructr: Herke van Hf Slides mstly by: Jelle Pineau Class web page: Unless therwise nted, all material psted fr this curse are cpyright f the instructrs, and cannt be reused r repsted withut the instructr s written permissin.

2 Mdeling fr binary classificatin Tw prbabilistic appraches: 1. Generative learning: Separately mdel P(x y) and P(y). Use Bayes rule, t estimate P(y x): P(y =1 x) = P(x y =1)P(y =1) P(x) 2. Discriminative learning: Directly estimate P(y x). 2 Jelle Pineau

3 Hw abut ther types f data? Last lecture, we saw ne generative apprach (LDA) LDA wrks with cntinuus data What abut ther types f data? 3 Jelle Pineau

4 Hw abut ther types f data? LDA nly wrks with cntinuus input data Let s lk at an apprach fr handling ther types f data (mainly: binary) 4 Jelle Pineau

5 Generative learning with binary input data Generative learning: Estimate P(x y), P(y). Then calculate P(y x). Simple principle: fr every class y, estimate the cnditinal prbability P(x y) f every input pattern x What happens if the number f input variables m is large? 5 Jelle Pineau

6 Generative learning with binary input data Generative learning: Estimate P(x y), P(y). Then calculate P(y x). Simple principle: fr every class y, estimate the cnditinal prbability P(x y) f every input pattern x What happens if the number f input variables m is large? O(2 m ) parameters necessary t describe the mdel! Need additinal assumptin n structure f input t keep manageable! 6 Jelle Pineau

7 Naïve Bayes assumptin Generative learning: Estimate P(x y), P(y). Then calculate P(y x). Naïve Bayes: Assume the x j are cnditinally independent given y. In ther wrds: P(x j y) = P(x j y, x k ), fr all j, k 7 Jelle Pineau

8 Naïve Bayes assumptin Generative learning: Estimate P(x y), P(y). Then calculate P(y x). Naïve Bayes: Assume the x j are cnditinally independent given y. In ther wrds: P(x j y) = P(x j y, x k ), fr all j, k Generative mdel structure: P(x y) = P(x 1, x 2,, x m y) = P(x 1 y) P(x 2 y, x 1 ) P(x 3 y, x 1, x 2 ) P(x m y, x 1, x 2,, x m-1 ) (frm general rules f prbabilities) = P(x 1 y) P(x 2 y) P(x 3 y) P(x m y) (frm the Naïve Bayes assumptin abve) 8 Jelle Pineau

9 Cnditinally independence example Offer and pprtunity might bth ccur ften in spam s Let s say we get 50% spam and 50% regular. Let s say, spam e- mails cntain 50% ffer and 50% pprtunity independently. Let s say it s 10% fr either in regular . Spam Regular Tgether % Expected % Cntains nly ffer Cntains nly pprtunity Cntains neither Cntains bth Offer and pprtunity nt independent ver all s! We say cnditinally independent given the class 9 Jelle Pineau

10 Naïve Bayes graphical mdel y x 1 x 2 x 3 x m Hw many parameters t estimate? Assume m binary features. 10 Jelle Pineau

11 Naïve Bayes graphical mdel y x 1 x 2 x 3 x m Hw many parameters t estimate? Assume m binary features. withut Naïve Bayes assumptin: O(2 m ) numbers t describe mdel. With Naïve Bayes assumptin: O(m) numbers t describe mdel. Useful when the number f features is high. 11 Jelle Pineau

12 Training a Naïve Bayes classifier Assume x, y are binary variables, m=1. Estimate the parameters P(x y) and P(y) frm data. y Define: Θ 1 = Pr (y=1) Θ j,1 = Pr (x j =1 y=1) Θ j,0 = Pr (x j =1 y=0). x 12 Jelle Pineau

13 Training a Naïve Bayes classifier Assume x, y are binary variables, m=1. Estimate the parameters P(x y) and P(y) frm data. y Define: Θ 1 = Pr (y=1) Θ j,1 = Pr (x j =1 y=1) Θ j,0 = Pr (x j =1 y=0). x Evaluatin criteria: Find parameters that maximize the lglikelihd functin. Likelihd: Pr(y x) Pr(y)Pr(x y) = i=1:n ( P(y i ) j=1:m P(x i,j y i ) ) Samples i are independent, s we take prduct ver n. Input features are independent (cnd. n y) s we take prduct ver m. 13 Jelle Pineau

14 Training a Naïve Bayes classifier Likelihd fr binary utput variable: L(Θ 1 y) = Θ 1y (1-Θ 1 ) 1-y Lg-likelihd fr all parameters (like befre): lg L(Θ 1,Θ i,1,θ i,0 D) = Σ i=1:n [ lg P(y i ) + Σ j=1:m lg P(x i,j y i ) ] 14 Jelle Pineau

15 Training a Naïve Bayes classifier Likelihd fr binary utput variable: L(Θ 1 y) = Θ 1y (1-Θ 1 ) 1-y Lg-likelihd fr all parameters (like befre): lg L(Θ 1,Θ i,1,θ i,0 D) = Σ i=1:n [ lg P(y i ) + Σ j=1:m lg P(x i,j y i ) ] = Σ i=1:n [ y i lg Θ 1 + (1-y i )lg(1-θ 1 ) + Σ j=1:m y i ( x i,j lgθ i,1 + (1-x i,j )lg(1-θ i,1 ) ) + Σ j=1:m (1-y i )( x i,j lgθ i,0 + (1-x i,j )lg(1-θ i,0 ) ) ] 15 Jelle Pineau

16 Training a Naïve Bayes classifier Likelihd fr binary utput variable: L(Θ 1 y) = Θ 1y (1-Θ 1 ) 1-y Lg-likelihd fr all parameters (like befre): lg L(Θ 1,Θ i,1,θ i,0 D) = Σ i=1:n [ lg P(y i ) + Σ j=1:m lg P(x i,j y i ) ] = Σ i=1:n [ y i lg Θ 1 + (1-y i )lg(1-θ 1 ) + Σ j=1:m y i ( x i,j lgθ i,1 + (1-x i,j )lg(1-θ i,1 ) ) + Σ j=1:m (1-y i )( x i,j lgθ i,0 + (1-x i,j )lg(1-θ i,0 ) ) ] (will have ther frm if params P(x y) have ther frm, e.g. Gaussian). 16 Jelle Pineau

17 Training a Naïve Bayes classifier Likelihd fr binary utput variable: L(Θ 1 y) = Θ 1y (1-Θ 1 ) 1-y Lg-likelihd fr all parameters (like befre): lg L(Θ 1,Θ i,1,θ i,0 D) = Σ i=1:n [ lg P(y i ) + Σ j=1:m lg P(x i,j y i ) ] = Σ i=1:n [ y i lg Θ 1 + (1-y i )lg(1-θ 1 ) + Σ j=1:m y i ( x i,j lgθ i,1 + (1-x i,j )lg(1-θ i,1 ) ) + Σ j=1:m (1-y i )( x i,j lgθ i,0 + (1-x i,j )lg(1-θ i,0 ) ) ] (will have ther frm if params P(x y) have ther frm, e.g. Gaussian). Maximize t estimate Θ 1 : take derivative f lgl, set t 0: L / Θ 1 = Σ i=1:n (y i /Θ 1 - (1-y i )/(1-Θ 1 )) = 0 17 Jelle Pineau

18 Training a Naïve Bayes classifier Slving fr Θ 1 we get: Θ 1 = (1/n) Σ i=1:n y i = number f examples where y=1 / number f examples Similarly, we get: Θ j,1 = number f examples where x j =1 and y=1 / number f examples where y=1 Θ j,0 = number f examples where x j =1 and y=0 / number f examples where y=0 18 Jelle Pineau

19 Naïve Bayes decisin bundary Decisin bundary where prbability f classes are equal: lg-dds rati = 0 lg Pr(y =1 x) Pr(y = 0 x) = lg Pr(x y =1)P(y =1) Pr(x y = 0)P(y = 0) = lg = lg m ( ) ( ) ( ) ( ) P(y =1) P(y = 0) + lg P x j y =1 j=1 P x j y = 0 m j=1 m P(y =1) P(y = 0) + lg P x j y =1 P x j y = 0 j=1 19 Jelle Pineau

20 Naïve Bayes decisin bundary Cnsider the case where features are binary: x j = {0, 1} Define: w j,0 = lg P(x j = 0 y =1) P(x j = 0 y = 0) ; w j,1 = lg P(x j =1 y =1) P(x j =1 y = 0) Nw we have: lg Pr(y =1 x) Pr(y = 0 x) ( ) ( ) m P(y =1) = lg P(y = 0) + lg P x j y =1 P x j y = 0 This is a linear decisin bundary! cnstant + linear in x j=1 m P(y =1) = lg P(y = 0) + (w j,0(1 x j )+ w j,1 x j ) j=1 m m P(y =1) = lg P(y = 0) + w j,0 + (w j,1 w j,0 )x j j=1 j=1 20 Jelle Pineau

21 Text classificatin example Using Naïve Bayes, we can cmpute prbabilities fr all the wrds which appear in the dcument cllectin. P(y=c) is the prbability f class c P(x j y=c) is the prbability f seeing wrd j in dcuments f class c Class c wrd 1 wrd 2 wrd 3 wrd m 21 Jelle Pineau

22 Text classificatin example Using Naïve Bayes, we can cmpute prbabilities fr all the wrds which appear in the dcument cllectin. P(y=c) is the prbability f class c P(x j y=c) is the prbability f seeing wrd j in dcuments f class c Set f classes depends n the applicatin, e.g. Tpic mdeling: each class crrespnds t dcuments n a given tpic, e.g. {Plitics, Finance, Sprts, Arts}. Class c What happens when a wrd is nt bserved in the training data? wrd 1 wrd 2 wrd 3 wrd m 22 Jelle Pineau

23 Laplace smthing Replace the maximum likelihd estimatr: Pr(x j y=1) = number f instance with x j =1 and y=1 number f examples with y=1 23 Jelle Pineau

24 Laplace smthing Replace the maximum likelihd estimatr: Pr(x j y=1) = number f instance with x j =1 and y=1 number f examples with y=1 With the fllwing: Pr(x j y=1) = (number f instance with x j =1 and y=1) + 1 (number f examples with y=1) Jelle Pineau

25 Laplace smthing Replace the maximum likelihd estimatr: Pr(x j y=1) = number f instance with x j =1 and y=1 number f examples with y=1 With the fllwing: Pr(x j y=1) = (number f instance with x j =1 and y=1) + 1 (number f examples with y=1) + 2 If n example frm that class, it reduces t a prir prbability r Pr=1/2. If all examples have x j =1, then Pr(x j =0 y) has Pr = 1 / (#examples + 1). If a wrd appears frequently, the new estimate is nly slightly biased. This is a frm f regularizatin (decreases variance at the cst f bias ) 25 Jelle Pineau

26 Example: 20 newsgrups Given 1000 training dcuments frm each grup, learn t classify new dcuments accrding t which newsgrup they came frm: cmp.graphics cmp.s.ms-windws.misc cmp.sys.ibm.pc.hardware cmp.sys.mac.hardware cmp.windws.x alt.atheism sc.religin.christian talk.religin.misc talk.plitics.mideast talk.plitics.misc misc.frsale rec.auts rec.mtrcycles rec.sprt.baseball rec.sprt.hckey sci.space sci.crypt sci.electrnics sci.med talk.plitics.guns Naïve Bayes: 89% classificatin accuracy (cmparable t ther state-f-the-art methds.) 26 Jelle Pineau

27 Gaussian Naïve Bayes Extending Naïve Bayes t cntinuus inputs: P(y) is still assumed t be a binmial distributin. P(x y) is assumed t be a multivariate Gaussian (nrmal) distributin with mean μ R n and cvariance matrix Σ R n xr n 27 Jelle Pineau

28 Gaussian Naïve Bayes Extending Naïve Bayes t cntinuus inputs: P(y) is still assumed t be a binmial distributin. P(x y) is assumed t be a multivariate Gaussian (nrmal) distributin with mean μ R n and cvariance matrix Σ R n xr n If we assume the same Σ fr all classes: Linear discriminant analysis. If Σ is distinct between classes: Quadratic discriminant analysis. If Σ is diagnal (i.e. features are independent): Gaussian Naïve Bayes. (linear if same fr all classes) 28 Jelle Pineau

29 Gaussian Naïve Bayes Extending Naïve Bayes t cntinuus inputs: P(y) is still assumed t be a binmial distributin. P(x y) is assumed t be a multivariate Gaussian (nrmal) distributin with mean μ R n and cvariance matrix Σ R n xr n If we assume the same Σ fr all classes: Linear discriminant analysis. If Σ is distinct between classes: Quadratic discriminant analysis. If Σ is diagnal (i.e. features are independent): Gaussian Naïve Bayes. (linear if same fr all classes) Hw d we estimate parameters? Derive the maximum likelihd estimatrs fr μ and Σ. 29 Jelle Pineau

30 Mdeling fr binary classificatin Tw prbabilistic appraches: 1. Generative learning: Separately mdel P(x y) and P(y). Use Bayes rule, t estimate P(y x): P(y =1 x) = P(x y =1)P(y =1) P(x) 2. Discriminative learning: Directly estimate P(y x). 30 Jelle Pineau

31 Discriminative learning We have seen that under several assumptins, we get linear decisin bundaries p(x y) are Gaussian with shared cvariance (LDA) p(x y) are independent Bernulli distributins (Naïve Bayes) D we really need t estimate p(x y) and p(y)? Can we directly find the parameters f the best decisin bundary? E.g. cvariance matrix requires estimating O(m 2 ) but decisin bundary nly requires O(m) parameters 31 Jelle Pineau

32 Prbabilistic view f discriminative learning Suppse we have 2 classes: y {0, 1} What is the prbability f a given input x having class y = 1? Cnsider Bayes rule: P(y =1 x) = P(x, y =1) P(x) = P(x y =1)P(y =1) P(x y =1)P(y =1)+ P(x y = 0)P(y = 0) 32 Jelle Pineau

33 Prbabilistic view f discriminative learning Suppse we have 2 classes: y {0, 1} What is the prbability f a given input x having class y = 1? Cnsider Bayes rule: P(y =1 x) = = 1+ P(x, y =1) P(x) = 1 P(x y = 0)P(y = 0) P(x y =1)P(y =1) P(x y =1)P(y =1) P(x y =1)P(y =1)+ P(x y = 0)P(y = 0) = 1+ exp(ln 1 = P(x y = 0)P(y = 0) P(x y =1)P(y =1) ) 1 1+ exp( a) = =σ(-a) σ 33 Jelle Pineau

34 Prbabilistic view f discriminative learning Suppse we have 2 classes: y {0, 1} What is the prbability f a given input x having class y = 1? Cnsider Bayes rule: P(y =1 x) = = 1+ P(x, y =1) P(x) = 1 P(x y = 0)P(y = 0) P(x y =1)P(y =1) P(x y =1)P(y =1) P(x y =1)P(y =1)+ P(x y = 0)P(y = 0) = 1+ exp(ln 1 = P(x y = 0)P(y = 0) P(x y =1)P(y =1) ) 1 1+ exp( a) = =σ(-a) σ where a = ln P(x y =1)P(y =1) P(x y = 0)P(y = 0) = ln P(y =1 x) P(y = 0 x) (By Bayes rule; P(x) n tp and bttm cancels ut.) 34 Jelle Pineau

35 Prbabilistic view f discriminative learning Suppse we have 2 classes: y {0, 1} What is the prbability f a given input x having class y = 1? Cnsider Bayes rule: P(y =1 x) = where = 1+ P(x, y =1) P(x) = 1 P(x y = 0)P(y = 0) P(x y =1)P(y =1) a = ln P(x y =1)P(y =1) P(x y =1)P(y =1)+ P(x y = 0)P(y = 0) = 1+ exp(ln P(x y =1)P(y =1) P(x y = 0)P(y = 0) 1 = P(x y = 0)P(y = 0) P(x y =1)P(y =1) ) = ln P(y =1 x) P(y = 0 x) Here σ has a special frm, called the lgistic functin 1 1+ exp( a) = =σ(-a) σ (By Bayes rule; P(x) n tp and bttm cancels ut.) and a is the lg-dds rati f data being class 1 vs. class Jelle Pineau

36 Discriminative learning: Lgistic regressin The lgistic functin (= sigmid curve): σ(w T x) = 1 / (1 + e -wtx ) Transfrms learned functin s.t. it can be interpreted as prbability 36 Jelle Pineau

37 Discriminative learning: Lgistic regressin The lgistic functin (= sigmid curve): σ(w T x) = 1 / (1 + e -wtx ) Transfrms learned functin s.t. it can be interpreted as prbability The decisin bundary is the set f pints fr which a=0. Idea: Directly mdel the lg-dds with a linear functin: a = ln P(x y =1)P(y =1) P(x y = 0)P(y = 0) = w 0 + w 1 x w m x m 37 Jelle Pineau

38 Discriminative learning: Lgistic regressin The lgistic functin (= sigmid curve): σ(w T x) = 1 / (1 + e -wtx ) Transfrms learned functin s.t. it can be interpreted as prbability The decisin bundary is the set f pints fr which a=0. Idea: Directly mdel the lg-dds with a linear functin: a = ln P(x y =1)P(y =1) P(x y = 0)P(y = 0) = w 0 + w 1 x w m x m Hw d we find the weights? Need an bjective functin! 38 Jelle Pineau

39 Fitting the weights Recall: σ(w T x i ) is the prbability that y i =1 (given x i ) 1-σ(w T x i ) be the prbability that y i = 0. Fr y {0, 1}, the likelihd functin, Pr(x 1,y 1,, x n,y h w), is: i=1:n σ(w T x i ) yi (1- σ(w T x i )) (1-yi) (samples are i.i.d.) 39 Jelle Pineau

40 Fitting the weights Recall: σ(w T x i ) is the prbability that y i =1 (given x i ) 1-σ(w T x i ) be the prbability that y i = 0. Fr y {0, 1}, the likelihd functin, Pr(x 1,y 1,, x n,y h w), is: i=1:n σ(w T x i ) yi (1- σ(w T x i )) (1-yi) (samples are i.i.d.) Gal: Minimize the negative lg-likelihd (als called crssentrpy errr functin): - i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) 40 Jelle Pineau

41 Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: 41 Jelle Pineau

42 Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: δlg(σ)/δw=1/σ Err(w)/ w = - [ i=1:n y i (1/σ(w T x i ))(1-σ(w T x i )) σ(w T x i )x i + 42 Jelle Pineau

43 Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: δσ/δw=σ(1-σ) Err(w)/ w = - [ i=1:n y i (1/σ(w T x i ))(1-σ(w T x i )) σ(w T x i )x i + 43 Jelle Pineau

44 Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: δw T x/δw=x Err(w)/ w = - [ i=1:n y i (1/σ(w T x i ))(1-σ(w T x i )) σ(w T x i )x i + 44 Jelle Pineau

45 Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: δ(1-σ)/δw= (1-σ)σ(-1) Err(w)/ w = - [ i=1:n y i (1/σ(w T x i ))(1-σ(w T x i )) σ(w T x i )x i + (1-y i )(1/(1-σ(w T x i )))(1-σ(w T x i ))σ(w T x i )(-1) x i ] 45 Jelle Pineau

46 Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: Err(w)/ w = - [ i=1:n y i (1/σ(w T x i ))(1-σ(w T x i )) σ(w T x i )x i + (1-y i )(1/(1-σ(w T x i )))(1-σ(w T x i ))σ(w T x i )(-1) x i ] = - i=1:n x i (y i (1-σ(w T x i )) - (1-y i )σ(w T x i )) = - i=1:n x i (y i - σ(w T x i )) 46 Jelle Pineau

47 Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: Err(w)/ w = - [ i=1:n y i (1/σ(w T x i ))(1-σ(w T x i )) σ(w T x i )x i + (1-y i )(1/(1-σ(w T x i )))(1-σ(w T x i ))σ(w T x i )(-1) x i ] = - i=1:n x i (y i (1-σ(w T x i )) - (1-y i )σ(w T x i )) = - i=1:n x i (y i - σ(w T x i )) Nw apply iteratively: w k+1 = w k + α k i=1:n x i (y i σ(w kt x i )) 47 Jelle Pineau

48 Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: Err(w)/ w = - [ i=1:n y i (1/σ(w T x i ))(1-σ(w T x i )) σ(w T x i )x i + (1-y i )(1/(1-σ(w T x i )))(1-σ(w T x i ))σ(w T x i )(-1) x i ] = - i=1:n x i (y i (1-σ(w T x i )) - (1-y i )σ(w T x i )) = - i=1:n x i (y i - σ(w T x i )) Nw apply iteratively: w k+1 = w k + α k i=1:n x i (y i σ(w kt x i )) Can als apply ther iterative methds, e.g. Newtn s methd, crdinate descent, L-BFGS, etc. 48 Jelle Pineau

49 Multi-class classificatin Generally tw ptins: 1. Learn a single classifier that can prduce 20 distinct utput values. 2. Learn 20 different 1-vs-all binary classifiers. 49 Jelle Pineau

50 Multi-class classificatin Generally tw ptins: 1. Learn a single classifier that can prduce 20 distinct utput values. 2. Learn 20 different 1-vs-all binary classifiers. Optin 1 assumes yu have a multi-class versin f the classifier. Fr Naïve Bayes, cmpute P(y x) fr each class, and select the class with highest prbability. 50 Jelle Pineau

51 Multi-class classificatin Generally tw ptins: 1. Learn a single classifier that can prduce 20 distinct utput values. 2. Learn 20 different 1-vs-all binary classifiers. Optin 1 assumes yu have a multi-class versin f the classifier. Fr Naïve Bayes, cmpute P(y x) fr each class, and select the class with highest prbability. Optin 2 applies t all binary classifiers, s mre flexible. But: ften slwer (need t learn many classifiers) creates a class imbalance prblem (say, 5% vs 95% fr 20 classes) what if tw classifiers say belngs t class? Or zer d? 51 Jelle Pineau

52 Cmparing linear classificatin methds Crdinate 2 fr Training Data Technique Errr Rates Training Test Linear regressin Linear discriminant analysis Quadratic discriminant analysis Lgistic regressin Crdinate 1 fr Training Data FIGURE 4.4. A tw-dimensinal plt f the vwel training data. There are eleven classes with X IR 10,andthisisthebestviewintermsfaLDAmdel (Sectin 4.3.3). The heavy circles are the prjected mean vectrs fr each class. The class verlap is cnsiderable. 52 Jelle Pineau

53 Discriminative vs generative Discriminative classifiers ften have less parameters t estimate Discriminative classifiers ften d better but Generative mdel might give us mre insight in data It can tell us when all classes are bad (lw prbability) With many classes, discriminative mdels need t find the decisin bundary between every pair 53 Jelle Pineau

54 What yu shuld knw Naïve Bayes assumptin Lg-dds rati decisin bundary Hw t estimate parameters fr Naïve Bayes Laplace smthing Relatin between Naïve Bayes, LDA, QDA, Gaussian Naïve Bayes. Derivatin f lgistic regressin. Wrth reading further: Relatin between Lgistic regressin and LDA (Hastie et al., 4.4.5) 54 Jelle Pineau

55 What yu shuld knw 55 Jelle Pineau

COMP 551 Applied Machine Learning Lecture 4: Linear classification

COMP 551 Applied Machine Learning Lecture 4: Linear classificatin Instructr: Jelle Pineau (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/cmp551 Unless therwise nted, all material psted