Linear Classificatin CS 54: Machine Learning Slides adapted frm Lee Cper, Jydeep Ghsh, and Sham Kakade
Review: Linear Regressin CS 54 [Spring 07] - H
Regressin Given an input vectr x T = (x, x,, xp), we want t predict the quantitative respnse Y Linear regressin frm: f(x) = 0 + Least squares prblem: px i= x i i min(y X ) > (y X ) =) ˆ =(X > X) X > y CS 54 [Spring 07] - H
Feature Selectin Brute frce is infeasible fr large number f features Algrithms Best subset selectin beynd 40 features is impractical Stepwise selectin (frward and backward) CS 54 [Spring 07] - H
Regularizatin Add penalty term n mdel parameters t achieve a mre simple mdel r reduce sensitivity t training data Less prne t verfitting min L(X, y)+ penalty( ) CS 54 [Spring 07] - H
CS 54 [Spring 07] - H Ridge & Lass Regularizatin Cefficients 0 4 6 8 0. 0.0 0. 0.4 0.6 lcavl lweight age lbph svi lcp gleasn pgg45 df(λ) 0.0 0. 0.4 0.6 0.8.0 0. 0.0 0. 0.4 0.6 Shrinkage Factr s Cefficients lcavl lweight age lbph svi lcp gleasn pgg45 Ridge Lass Figures.8 &.0 (Hastie et al.)
Thus far, regressin: Predict a cntinuus value given sme inputs r features CS 54 [Spring 07] - H
Linear Classificatin CS 54 [Spring 07] - H
Linear Classifiers: Spam Filtering spam vs nt spam CS 54 [Spring 07] - H
Linear Classifiers: Weather Predictin CS 54 [Spring 07] - H
Ntatin Number f classes: K A specific class: k Set f classes: G Prir prbability f class k: k =Pr(G = k) KX j = j= CS 54 [Spring 07] - H
Bayes Decisin Thery CS 54 [Spring 07] - H
Statistical Decisin Thery Revisited Natural rule f classificatin: f(x) = argmax j=,...,k Pr(G = k X = x) Applicatin f Bayes rule: Pr(X = x G = k)pr(g = k) Pr(G = k X = x) = Pr(X = x) Since denminatr same acrss all classes f(x) = argmax j=,...,k Pr(X = x G = k) k CS 54 [Spring 07] - H
Classificatin Evaluatin CS 54 [Spring 07] - H
Misclassificatin Rate ptimal decisin bundary Pr(x,G ) red area means it is an nn-ptimal design Pr(x,G ) Pr(mistake) = Z R Pr(x,G )dx + CS 54 [Spring 07] - H Z R Pr(x,G )dx
Cnfusin Matrix & Metrics CS 54 [Spring 07] - H
Prblems with Accuracy Assumes equal cst fr bth types f errr FN = FP Is 99% accuracy? Depends n the prblem and the dmain Cmpare t the base rate (i.e., predicting predminant class) CS 54 [Spring 07] - H
Receiver Operating Characteristic Curve AUC = area under ROC curve https://en.wikipedia.rg/wiki/receiver_perating_characteristic CS 54 [Spring 07] - H
ROC Curves Slpe is always increasing Each pint represents different tradeff (cst rati) between FP and FN Tw nn-intersecting curves means ne methd dminates the ther Tw intersecting curves means ne methd is better fr sme cst ratis, and ther methd is better fr ther cst ratis CS 54 [Spring 07] - H
Area Under ROC Curve (AUC) > 0.9: excellent predictin smething ptentially fishy, shuld check fr infrmatin leakage 0.8: gd predictin 0.5: randm predictin <0.5: smething wrng! AUC is mre rbust t class imbalanced situatin CS 54 [Spring 07] - H
Discriminant Analysis CS 54 [Spring 07] - H
Bayes Classifier MAP classifier (maximum a psterir) Outcme: partitining f the input space Classifier is ptimal: statistically minimizes the errr rate Why nt use Bayes classifier all the time? CS 54 [Spring 07] - H
Discriminant Functins Each class has a discriminant functin: k(x) Classify accrding t the best discriminant: Ĝ(x) = argmax j=,...,k k(x) Can be frmulated in terms f prbabilities Ĝ(x) = argmax j=,...,k Pr(G = k X = x) CS 54 [Spring 07] - H
Discriminant Analysis Bayes rule: Pr(G X)Pr(X) =Pr(X G)Pr(G) Applicatin f Bayes therem: Pr(G = k X = x) = f k (x) k P K`= f`(x) ` Use lg-rati fr a tw class prblem: lg Pr(G = k X = x) Pr(G = ` X = x) = lg f k(x) f`(x) + lg k ` CS 54 [Spring 07] - H
Linear Regressin Classifier Each respnse categry cded as indicatr variable Fit linear regressin mdel t each clumn f respnse indicatr matrix simultaneusly Cmpute the fitted utput and classify accrding t largest cmpnent Serius prblems ccurs when number f classes greater than r equal t! CS 54 [Spring 07] - H
Linear Discriminant Analysis (LDA) Assume each class density is frm a multivariate Gaussian f k (x) = ( ) p/ exp k / (x µ k) > k (x µ k) LDA assumes class have cmmn cvariance matrix Discriminant functin: k(x) =x > µ k µ> k µ k + lg k CS 54 [Spring 07] - H
LDA Decisin Bundaries True distributins with Estimated bundaries same cvariance and different means Figure 4.5 (Hastie et al.) CS 54 [Spring 07] - H
CS 54 [Spring 07] - H LDA vs Linear Regressin Figure 4. (Hastie et al.) Linear Regressin Linear Discriminant Analysis X X X X
Quadratic Discriminant Analysis (QDA) What if the cvariances are nt equal? Quadratic discriminant functins: k(x) = lg k (x µ k) > k (x µ k) + lg k Quadratic decisin bundary Cvariance matrix must be estimated fr each class CS 54 [Spring 07] - H
CS 54 [Spring 07] - H LDA vs. QDA Decisin Bundaries Figure 4. (Hastie et al.) LDA QDA
Gaussian Parameter Values In practice, the parameters f multivariate nrmal distributin are unknwn Estimate using the training data Prir distributin ˆ k = N k /N Mean ˆµ k = X g i =k x i /N k Variance = KX X (x i ˆµ k )(x i ˆµ k ) > /(N K) k= g i =k CS 54 [Spring 07] - H
Regularized Discriminant Analysis Cmprmise between LDA and QDA Shrink separate cvariances f QDA twards cmmn cvariance like LDA Similar t ridge regressin ˆ k ( ) = ˆ k +( ) ˆ CS 54 [Spring 07] - H
Example: Vwel Data Experiment recrded 58 instances f spken wrds Wrds fall int classes ( vwels ) 0 features fr each instance CS 54 [Spring 07] - H
Regularized Discriminant Analysis Regularized Discriminant Analysis n the Vwel Data Misclassificatin Rate 0.0 0. 0. 0. 0.4 0.5 Test Data Train Data 0.0 0. 0.4 0.6 0.8.0 α Figure 4.7 (Hastie et al.) Optimum fr test ccurs clse t QDA CS 54 [Spring 07] - H
Reduced-rank LDA What if we want t further reduce the dimensin t L where L < K -? Why? Visualizatin Regularizatin sme dimensins may nt prvide gd separatin between classes but just nise CS 54 [Spring 07] - H
Fisher s Linear Discriminant Find prjectin that maximizes rati f between class variance t within class variance between within = (a> (µ µ )) a > ( + )a CS 54 [Spring 07] - H Figure 4.6 (Bishp)
Why Fisher Makes Sense Fllwing infrmatin is taken int accunt Spread f class centrids directin jining centrids separates the mean Shape f data defined by cvariance minimum verlap can be fund CS 54 [Spring 07] - H
Why Fisher Makes Sense: Graphically + + + + Prjected data maximizing between class nly Discriminant directin Figure 4.9 (Hastie et al.) CS 54 [Spring 07] - H
Vwel Data: -D Subspace Linear Discriminant Analysis Crdinate fr Training Data -6-4 - 0 4-4 - 0 4 Crdinate fr Training Data Figure 4.4 (Hastie et al.) CS 54 [Spring 07] - H
Vwel Data: Reduced-rank LDA CS 54 [Spring 07] - H Figure 4.0 (Hastie et al.)
CS 54 [Spring 07] - H Vwel Data: Reduced-rank LDA () Figure 4. (Hastie et al.) Cannical Crdinate Cannical Crdinate Classificatin in Reduced Subspace
Lgistic Regressin CS 54 [Spring 07] - H
Revisiting LDA fr Binary Classes LDA assumes predictrs are nrmally distributed lg Pr(G = k X = x) Pr(G = ` X = x) = lg k ` (µ k + µ`) > (µ k + µ`) + x > (µ k µ`) = 0 + > x Lg dds f class vs is a linear functin Why nt estimate cefficients directly? CS 54 [Spring 07] - H
Link Functins Hw t cmbine regressin and prbability? Use regressin t mdel the psterir Link functin Map frm real values t [0,] Need prbabilities t sum t CS 54 [Spring 07] - H
Lgistic Regressin Lgistic functin (r sigmid) f(z) = + exp( z) Apply sigmid t linear functin f the input features Pr(G =0 X, )= + exp(x > ) Pr(G = X, )= exp(x > ) + exp(x > ) CS 54 [Spring 07] - H
Sigmid Functin f(x) = + exp (w 0+w x) CS 54 [Spring 07] - H
Fitting Lgistic Regressin Mdels N lnger straightfrward (nt simple least squares) See bk fr discussin f tw-class case Use ptimizatin methds (Newtn-Raphsn) In practice use a sftware library CS 54 [Spring 07] - H
Optimizatin: Lg Likelihd Maximize likelihd f yur training data by assuming class labels are cnditinally independent L( ) = Lg likelihd ny i= Pr(G = k X = x i ), = { 0, } nx `( ) = Pr(G = k X = x i ) i= = p k (x; ) CS 54 [Spring 07] - H
Optimizatin: Lgistic Regressin Lg likelihd fr lgistic regressin nx `( ) = (y > i x i lg( + exp ( > x i ) )) i= Simple gradient descent using derivatives @`( ) = @ nx x i (y i p(x; )) i= Bk illustrates Newtn-Raphsn which uses nd rder infrmatin fr better cnvergence CS 54 [Spring 07] - H
Lgistic Regressin Cefficients Hw t interpret cefficients? Similar t interpretatin fr linear regressin Increasing the ith predictr xi by unit and keeping all ther predictrs fixed increases: Estimated lg dds (class ) by an additive factr i Estimated dds (class ) by a multiplicative factr exp i CS 54 [Spring 07] - H
Example: Suth African Heart Disease Predict mycardial infarctin heart attack Variables: sbp Systlic bld pressure tbacc Tbacc use ldl Chlesterl measure famhist Family histry f mycardial infarctin besity, alchl, age CS 54 [Spring 07] - H
Example: Suth African Heart Disease CS 54 [Spring 07] - H Table 4. (Hastie et al.)
CS 54 [Spring 07] - H Example: Suth African Heart Disease Figure 4. (Hastie et al.) sbp 0 0 0 0 0.0 0.4 0.8 0 50 00 00 60 0 0 0 0 0 tbacc ldl 6 0 4 0.0 0.4 0.8 famhist besity 5 5 5 45 0 50 00 alchl 00 60 0 6 0 4 5 5 5 45 0 40 60 0 40 60 age
Example: Suth African Heart Disease 4 5 6 7 Cefficients βj(λ) 0.0 0. 0.4 0.6 ************************************************************************************************************************************************************************************************************************************* * ************************************************************************************************************************************************************************************************************************************ age famhist ldl tbacc sbp alchl besity 0.0 0.5.0.5.0 β(λ) CS 54 [Spring 07] - H Figure 4. (Hastie et al.)
Linear Separability & Lgistic Regressin What happens in the case when my data is cmpletely separable? Weights g t infinity Infinite number f MLEs Use sme frm f regularizatin t avid this scenari CS 54 [Spring 07] - H
LDA vs Lgistic Regressin LDA estimates the Gaussian parameters and prir (easy!) Lgistic regressin estimates cefficients directly based n maximum likelihd (harder!) Bth have linear decisin bundaries that are different why? LDA assumes nrmal distributin within class Lgistic regressin is mre flexible and rbust t situatins with utliers and nt nrmal class cnditinal densities CS 54 [Spring 07] - H
Multiclass Lgistic Regressin Extensin t K classes: use K - mdels lg Pr(G = j X = x) Pr(G = K X = x) = 0j + > j x Mdel the lg dds f each class t a base class Fit cefficients jintly by maximum likelihd Put them tgether t get psterirs exp( 0i + > Pr(G = i x) = i x) + P j exp( 0j + >,i6= j j x) CS 54 [Spring 07] - H
Lgistic Regressin Prperties Advantages Parameters have useful interpretatins the effect f unit change in a feature is t increase the dds f a respnse multiplicatively by the factr exp i Quite rbust, well develped Disadvantages Parametric, but wrks fr entire expnential family f distributins Slutin nt clsed frm, but still reasnably fast CS 54 [Spring 07] - H
Lgistic Regressin Additinal Cmments Example f a generalized linear mdel with cannical link functin = lgit, crrespnding t Bernulli Fr mre infrmatin, see shrt curse by Heather Turner (http://statmath.wu.ac.at/curses/ heather_turner/glmcurse_00.pdf) Old technique but still very widely used Output layer fr neural netwrks CS 54 [Spring 07] - H
Cmparisn n Vwel Recgnitin CS 54 [Spring 07] - H Table 4. (Hastie et al.)
Generative vs Discriminative Generative: separately mdel class-cnditinal densities and prirs Example: LDA, QDA Discriminative: try t btain class bundaries directly thrugh heuristic r estimating psterir prbabilities Example: Decisin trees, lgistic regressin CS 54 [Spring 07] - H
Generative vs Discriminative Analgy Task is t determine the language smene is speaking Generative: Learn each language and determine which language the speech belngs t Discriminative: Determine the linguistic differences withut learning any language CS 54 [Spring 07] - H