UVA CS / Introduc8on to Machine Learning and Data Mining

Size: px

Start display at page:

Download "UVA CS / Introduc8on to Machine Learning and Data Mining"

Leo Milo Hodge
5 years ago
Views:

1 UVA CS / Introduc8on to Machne Learnng and Data Mnng Lecture 16: Genera,ve vs. Dscrmna,ve / K- nearest- neghbor Classfer / LOOCV Yanjun Q / Jane,, PhD Unversty of Vrgna Department of Computer Scence 10/22/14 1 Where are we? è Fve major sec,ons of ths course q Regresson (supervsed) q Classfca,on (supervsed) q Unsupervsed models q Learnng theory q Graphcal models 10/22/14 2 1

Where are we? è hree major sec,ons for classfca,on We can dvde the large varety of classfcaton approaches nto roughly three major types 1.

Generatve: - buld a generatve statstcal model - e.g., naïve bayes classfer, Bayesan networks 3.

K nearest neghbors 10/22/14 3 C A Dataset for classfca,on C Output as Dscrete Class Label C 1, C 2,, C L Genera,ve Dscrmna,ve argmax P(C X) =

2 Where are we? è hree major sec,ons for classfca,on We can dvde the large varety of classfcaton approaches nto roughly three major types 1. Dscrmnatve - drectly estmate a decson rule/boundary - e.g., logstc regresson, support vector machne, decsonree 2. Generatve: - buld a generatve statstcal model - e.g., naïve bayes classfer, Bayesan networks 3. Instance based classfers - Use observaton drectly (no models) - e.g. K nearest neghbors 10/22/14 3 C A Dataset for classfca,on C Output as Dscrete Class Label C 1, C 2,, C L Genera,ve Dscrmna,ve argmax P(C X) = argmax P(X,C) = argmax P(X C)P(C) C C C P(C X) C = c 1,,c L Data/ponts/nstances/examples/samples/records: [ rows ] Features/a0rbutes/dmensons/ndependent varables/covarates/predctors/regressors: [ columns, except the last] arget/outcome/response/label/dependent varable: specal column to be predcted [ last column ] 10/22/14 4 2

Genera,ve Mul8nomal Naïve Bayes as Stochas8c Language Models the boy lkes the dog 0.2 0.01 0.0001 0.2 0.0005 Multply all fve terms Model C1 0.2 the 0.01 boy 0.0001 sad 0.0001 lkes 0.0001 black 0.

3 Genera,ve Mul8nomal Naïve Bayes as Stochas8c Language Models the boy lkes the dog Multply all fve terms Model C1 0.2 the 0.01 boy sad lkes black dog 0.01 garden Model C2 0.2 the boy 0.03 sad 0.02 lkes 0.1 black 0.01 dog garden the boy lkes black dog P(s C2) P(C2) > P(s C1) P(C1) 10/22/14 5 Dscrmna,ve e.g. Probablty of dsease Logs,c regresson models for bnary target varable coded 0/1. P (C=1 X) Logt func,on logs,c func,on eα+βx P(c =1 x) = 1+ e α+βx x! P(c =1 x) $! P(c =1 x) $ ln# & = ln# & = α + β 1 x 1 + β 2 x β p x p 10/22/14 6 " P(c = 0 x) % " 1 P(c =1 x) % 3

ln[p/(1-p)] Odds= p/(1- p) Logstc Dstrbuton P (Y=1 x) x hs means we use Bernoull dstrbuton to model the target

4 Bnary Logstc Regresson In summary that the logstc regresson tells us two thngs at once. ransformed, the log odds (logt) are lnear. ln[p/(1-p)] Odds= p/(1- p) Logstc Dstrbuton P (Y=1 x) x hs means we use Bernoull dstrbuton to model the target varable wth ts Bernoull parameter p=p(y=1 x) predefned. 10/22/14 7 x p 1- p oday : Relevant classfers / KNN / LOOCV ü Logs,c regresson (cont.) ü Naïve Bayes Gaussan Classfer ü K- nearest neghbor ü LOOCV 10/22/14 8 4

Mul,nomal Logs,c Regresson Model he method drectly models the posteror probabl,es as the output of regresson exp( βk 0 + βk x) Pr( G = k X = x) =, K 1 1+ exp( β + β x) Pr( G = K X = x) = 1+ l= 1 K 1

5 Mul,nomal Logs,c Regresson Model he method drectly models the posteror probabl,es as the output of regresson exp( βk 0 + βk x) Pr( G = k X = x) =, K 1 1+ exp( β + β x) Pr( G = K X = x) = 1+ l= 1 K 1 l= 1 1 l0 exp( β + β x) l0 l l k = 1,, K 1 x s p- dmensonal nput vector β k s a p- dmensonal vector for each k otal number of parameters s (K- 1)(p+1) 10/22/14 Note that the class boundares are lnear 9 MLE for Logs,c Regresson ranng Let s ft the logs,c regresson model for K=2,.e., number of classes s 2 ranng set: (x, y ), =1,,N Log- lkelhood: N l(β) = {logpr(y = y X = x )} N =1 = y log(pr(y =1 X = x ))+ (1 y )log(pr(y = 0 X = x )) =1 N = (y log exp(β x ) 1+ exp(β x ) )+ (1 y )log 1 1+ exp(β x ) ) =1 N = (y β x log(1+ exp(β x ))) =1 For Bernoull dstrbu,on p(y x) y (1 p) 1 y x are (p+1)- dmensonal nput vector wth leadng entry 1 β s a (p+1)- dmensonal vector y = 1 f C =1; y = 0 f C =0 10/22/14 We want to maxmze the log- lkelhood n order to es,mate β 10 5

6 Newton- Raphson for LR (op,onal) l( β) = β N = 1 exp( β x) ( y ) x 1+ exp( β x) = 0 (p+1) Non- lnear equa,ons to solve for (p+1) unknowns where, ( 2 l(β) β β ) = - Solve by Newton- Raphson method: β new β old [( 2 l(β) β β )]-1 l(β) β, N =1 x x ( exp(β x ) 1+ exp(β x ) )( 1 1+ exp(β x ) ) p(x ; β) 1 - p(x ; β) 10/22/14 11 Newton- Raphson for LR (op,onal) N l(β) β = (y exp(β x) 1+ exp(β x) )x = X (y p) =1 x 1 x2 X =! xn So, NR rule becomes: N by ( p+ 1) y1 2, y y =! yn ( 2 l(β) β β ) = X WX, N by 1 β new β old + ( X exp( β x1 ) /(1 + exp( β x )) 1 exp( β x2) /(1 + exp( β x2)) p =! exp( β xn ) /(1 + exp( β xn )) X : N (p + 1) matrx of x y : N 1 matrx of y p : N 1 matrx of p( x ; β WX ), N by 1 W : N N dagonal matrx of p( x ; β 1 X )(1 p( x ; β ( y p), exp( β x ) 1 ( )(1 ) (1+ exp( β x )) (1+ exp( β x )) 10/22/14 12 old ) old old )) 6

7 Newton- Raphson for LR Newton- Raphson new old β = β + ( X 10/22/14 = ( X = ( X WX ) WX ) 1 1 X X WX ) W ( Xβ Wz 1 + W ( y p) ( y p)) Adjusted response old 1 z = Xβ + W ( y p) Itera,vely reweghted least squares (IRLS) new β arg mn( z Xβ ) W ( z Xβ ) β old X arg mn( y p) W β 1 1 ( y p) Re expressng Newton step as weghted least square step 13 oday : Relevant classfers / KNN / LOOCV ü Logs,c regresson (cont.) ü Gaussan Naïve Bayes Classfer Gaussan dstrbu,on Gaussan NBC LDA, QDA Dscrmna,ve vs. Genera,ve ü K- nearest neghbor ü LOOCV 10/22/

8 he Gaussan Dstrbu8on Covarance Matrx Mean 10/22/14 15 Courtesy: hrp://research.mcrosos.com/~cmbshop/prml/ndex.htm Mul8varate Gaussan Dstrbu8on A multvarate Gaussan model: x ~ N(µ,Σ) where Here µ s the mean vector and Σ s the covarance matrx, f p=2 µ = {µ 1, µ 2 } Σ = var(x 1 ) cov(x 1,x 2 ) cov(x 1,x 2 ) var(x 2 ) he covarance matrx captures lnear dependences among the varables 10/22/

9 MLE Es8ma8on for Mul8varate Gaussan We can ft statstcal models by maxmzng the probablty / lkelhood of generatng the observed samples: L(x 1,,x n Θ) = p(x 1 Θ) p(x n Θ) (the samples are assumed to be ndependent) In the Gaussan case, we smply set the mean and the varance to the sample mean and the sample varance: 1 µ = n x n = 1 2 σ 1 = n ( n = 1 x 2 µ ) 10/22/14 17 Probabls,c Interpreta,on of Lnear Regresson Let us assume that the target varable and the nputs are related by the equa,on: y = x θ + where ε s an error term of unmodeled effects or random nose Now assume that ε follows a Gaussan N(0,σ), then we have: ε 1 ( y θ x ) exp 2πσ 2σ 2 ( y x; θ) = 2 p By ndependence (among samples) assump,on: 10/22/14 n n n 1 L( θ) = p( y x; θ) = exp = 2πσ = ( y θ x ) 2 σ 18 9

10 Probabls,c Interpreta,on of Lnear Regresson (cont.) Hence the log- lkelhood s: n l( θ) = nlog = ( y 2 1 θ x ) 2πσ σ 2 2 Do you recognze the last term? Yes t s: n 1 J ( θ ) = ( x θ y ) 2 = 1 hus under ndependence assump,on, resdual means square s equvalent to MLE of θ! 2 10/22/14 19 oday : Relevant classfers / KNN / LOOCV ü Logs,c regresson (cont.) ü Gaussan Naïve Bayes Classfer Gaussan dstrbu,on Gaussan NBC LDA, QDA Dscrmna,ve vs. Genera,ve ü K- nearest neghbor ü LOOCV 10/22/

11 Gaussan Naïve Bayes Classfer argmax C j j P(C X) = argmax C P(X,C) = argmax P(X C)P(C) C 2 1 ( X j µ j ) Pˆ( Xj C = c ) = exp 2 2πσ j 2σ j µ : mean(avearage) of attrbute values X of examples for whch C = c σ : standarddevaton of attrbute values X Naïve Bayes Classfer 10/22/14 21 j j of examples P(X C) = P(X 1, X 2,, X p C) = P(X 1 X 2,, X p,c)p(x 2,, X p C) = P(X 1 C)P(X 2,, X p C) = P(X 1 C)P(X 2 C) P(X p C) for whch C = c Gaussan Naïve Bayes Classfer Contnuous-valued Input Attrbutes Condtonal probablty modeled wth the normal dstrbuton 1 " ˆP(X j C = c ) = exp (X j µ j) 2 % 2 2πσ $ j # 2σ ' j & µ j : mean (avearage) of attrbute values X j of examples for whch C = c σ j : standard devaton of attrbute values X j of examples for whch C = c Learnng Phase: Output: normal dstrbutons and p L for X = (X 1,, X p ), C = c 1,, c L P(C = c ) =1,, L 10/22/14 est Phase: for X! = ( X 1!,, X! p ) Calculate condtonal probabltes wth all the normal dstrbutons Apply the MAP rule to make a decson 22 11

12 Naïve Gaussan means? Not Naïve P(X 1, X 2,, X p C) = Naïve P(X 1, X 2,, X p C = c j ) = P(X 1 C)P(X 2 C) P(X p C) = 1 $ exp (X j µ j) 2 ' 2 2πσ & j % 2σ ) j ( Dagonal Matrx Σ_ j = Λ _ j Each class covarance matrx s dagonal 10/22/14 23 oday : Relevant classfers / KNN / LOOCV ü Logs,c regresson (cont.) ü Gaussan Naïve Bayes Classfer Gaussan dstrbu,on Gaussan NBC LDA, QDA, RDA Dscrmna,ve vs. Genera,ve ü K- nearest neghbor, ü LOOCV 10/22/

If covarance matrx not Iden,ty but same e.g.

Classfca8on argmax k P(C _ k X) = argmax k

13 If covarance matrx not Iden,ty but same e.g. è LDA (Lnear Dscrmnant Analyss) Each class covarance matrx s the same Class k Class l Class k Class l 10/22/14 25 Op8mal Classfca8on argmax k P(C _ k X) = argmax k P(X,C) = argmax P(X C)P(C) k - Note 10/22/

14 è he Decson Boundary Between class k and l, {x : δ k (x) = δ l (x)}, s lnear log P(C _ k X) P(C _l X) P(X C _ k) P(C _ k) = log + log P(X C _l) P(C _l) Boundary ponts X : when P(c_k X) == P(c_l X), the les lnear equa,on ==0, a lnear lne 10/22/14 27 Vsualzaton (three classes) 10/22/

15 If covarance matrx not Iden,ty not same e.g. è QDA (Quadra,c Dscrmnant Analyss) 10/22/14 29 LDA on Expanded Bass LDA wth quadra,c bass Versus QDA 10/22/

16 Regularzed Dscrmnant Analyss 10/22/14 31 oday : Relevant classfers / KNN / LOOCV ü Logs,c regresson (cont.) ü Gaussan Naïve Bayes Classfer Gaussan dstrbu,on Gaussan NBC LDA, QDA Dscrmna,ve vs. Genera,ve ü K- nearest neghbor ü LOOCV 10/22/

17 LDA vs. Logs,c Regresson 10/22/14 33 Pr Dscrmnatve vs. Generatve Logs,c Regresson Gaussan Heght 17

18 Dscrmnatve vs. Generatve Defn,ons h gen and h ds : genera,ve and dscrmna,ve classfers h gen, nf and h ds, nf : same classfers but traned on the en,re popula,on (asympto,c classfers) n nfnty, h gen h gen, nf and h ds h ds, nf Ng, Jordan,. "On dscrmna,ve vs. genera,ve classfers: A comparson of logs,c regresson and nave bayes." Advances n neural nformahon processng systems 14 (2002): 841. Dscrmnatve vs. Generatve Propos,on 1: Propos,on 2: - p : number of dmensons - n : number of observa,ons - ϵ : generalza,on error 18

19 Logstc Regresson vs. NBC Dscrmna,ve classfer (Logs,c Regresson) - Smaller asympto,c error - Slow convergence ~ sze of tranng set O(p) Genera,ve classfer (Nave Bayes) - Larger asympto,c error - Can handle mssng data (EM) - Fast convergence ~ sze of tranng set O(lg(p)) Genera,on error Logs,c Regresson Nave Bayes Sze of tranng set 19

20 Genera,on error Sze of tranng set Xue, Jng- Hao, and D. Mchael rerngton. "Comment on On dscrmna,ve vs. genera,ve classfers: A comparson of logs,c regresson and nave Bayes."Neural processng le0ers 28.3 (2008): Logstc Regresson vs. NBC Emprcally, genera,ve classfers approach ther asympto,c error faster than dscrmna,ve ones Good for small tranng set Handle mssng data well (EM) Emprcally, dscrmna,ve classfers have lower asympto,c error than genera,ve ones Good for larger tranng set 20

21 oday : Genera,ve vs. Dscrmna,ve / KNN / LOOCV ü Logs,c regresson (cont.) ü Gaussan Naïve Bayes Classfer Gaussan dstrbu,on Gaussan NBC LDA, QDA Dscrmna,ve vs. Genera,ve ü K- nearest neghbor, ü LOOCV 10/22/14 41 Nearest neghbor classfers Basc dea: If t walks lke a duck, quacks lke a duck, then t s probably a duck compute dstance test sample tranng samples choose k of the nearest samples 10/22/

22 Nearest neghbor classfers Unknown record Requres three nputs: 1. he set of stored tranng samples 2. Dstance metrc to compute dstance between samples 3. he value of k,.e., the number of nearest neghbors to retreve 10/22/14 43 Nearest neghbor classfers Unknown record o classfy unknown sample: 1. Compute dstance to other tranng records 2. Iden,fy k nearest neghbors 3. Use class labels of nearest neghbors to determne the class label of unknown record (e.g., by takng majorty vote) 10/22/

23 Defnton of nearest neghbor X X X (a) 1-nearest neghbor (b) 2-nearest neghbor (c) 3-nearest neghbor k- nearest neghbors of a sample x are dataponts that have the k smallest dstances to x 10/22/ nearest neghbor Vorono dagram 10/22/

24 Nearest neghbor classfcaton Compute dstance between two ponts: For nstance, Eucldean dstance d( x, y) = ( x y ) Optons for determnng the class from nearest neghbor lst ake majorty vote of class labels among the k-nearest neghbors Weght the votes accordng to dstance example: weght factor w = 1 / d /22/14 47 Nearest neghbor classfcaton Choosng the value of k: If k s too small, senstve to nose ponts If k s too large, neghborhood may nclude ponts from other classes X 10/22/

25 Nearest neghbor classfcaton Scalng ssues Attrbutes may have to be scaled to prevent dstance measures from beng domnated by one of the attrbutes Example: heght of a person may vary from 1.5 m to 1.8 m weght of a person may vary from 90 lb to 300 lb ncome of a person may vary from $10K to $1M 10/22/14 49 Problem wth Eucldean measure: Hgh dmensonal data curse of dmensonalty Can produce counter-ntutve results Nearest neghbor classfcaton vs d = d = u one solu,on: normalze the vectors to unt length 10/22/

k-nearest neghbor classfer s a lazy learner Does not buld model explctly.

Nearest neghbor classfcaton Classfyng unknown samples s relatvely expensve.

10/22/14 51 Decson boundares n global vs.

26 k-nearest neghbor classfer s a lazy learner Does not buld model explctly. Unlke eager learners such as decson tree nducton and rule-based systems. Nearest neghbor classfcaton Classfyng unknown samples s relatvely expensve. k-nearest neghbor classfer s a local model, vs. global model of lnear classfers. 10/22/14 51 Decson boundares n global vs. local models lnear regresson 15-nearest neghbor 1-nearest neghbor global stable can be naccurate local accurate unstable What ultmately matters: GENERALIZAION 10/22/

error rate of the 1- nearest- neghbour classfer s never more than twce

27 K- Nearest- Neghbours for Classfca,on (2) K = 3 K = 1 10/22/14 53 K- Nearest- Neghbours for Classfca,on (3) K acts as a smother For, the error rate of the 1- nearest- neghbour classfer s never more than twce the op,mal error (obtaned from the true cond,onal class dstrbu,ons). 10/22/

28 oday : Genera,ve vs. Dscrmna,ve / KNN / LOOCV ü Logs,c regresson (cont.) ü Gaussan Naïve Bayes Classfer Gaussan dstrbu,on Gaussan NBC LDA, QDA Dscrmna,ve vs. Genera,ve ü K- nearest neghbor, ü LOOCV 10/22/14 55 Dataset cross- valda,on (e.g. K=3) k- fold cross- valda,on ran est 10/22/

29 Common Splƒng Strateges Leave- one- out (n- fold cross valda,on) 10/22/14 57 Yanjun Q / UVA CS Leave- one- out cross valda8on Leave- one- out cross valda8on (LOOCV) s K- fold cross valda,on taken to ts logcal extreme, wth K equal to n, the number of data ponts n the set. hat means that n separate,mes, the func,on op,mza,on s traned on all the data except for one pont and a predc,on s made for that pont. As before the average error s computed and used to evaluate the model. 10/22/14 29

30 CV- based Model Selec,on We re tryng to decde whch algorthm to use. We tran each machne and make a table... 10/22/14 59 Yanjun Q / UVA CS Whch knd of cross- valda,on? 10/22/14 30

31 oday Recap: Genera,ve vs. Dscrmna,ve / KNN / LOOCV ü Logs,c regresson (cont.) ü Gaussan Naïve Bayes Classfer Gaussan dstrbu,on Gaussan NBC LDA, QDA Dscrmna,ve vs. Genera,ve ü K- nearest neghbor, ü LOOCV 10/22/14 61 References q Prof. an, Stenbach, Kumar s Introduc,on to Data Mnng slde q Prof. Andrew Moore s sldes q Prof. Erc Xng s sldes q Has,e, revor, et al. he elements of stahshcal learnng. Vol. 2. No. 1. New York: Sprnger, /22/

UVA$CS$6316$$ $Fall$2015$Graduate:$$ Machine$Learning$$ $ $Lecture$15:$LogisAc$Regression$/$ GeneraAve$vs.$DiscriminaAve$$

UVA$CS$6316$$ $Fall$2015$Graduate:$$ Machine$Learning$$ $ $Lecture$15:$LogisAc$Regression$/$ GeneraAve$vs.$DiscriminaAve$$ Dr.YanjunQ/UVACS6316/f15 UVACS6316 Fall2015Graduate: MachneLearnng Lecture15:LogsAcRegresson/ GeneraAvevs.DscrmnaAve 10/21/15 Dr.YanjunQ UnverstyofVrgna Departmentof ComputerScence 1 Wherearewe?! FvemajorsecHonsofthscourse