Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes November 9, 2012 1 / 1

Nearest centroid rule Suppose we break down our data matrix as by the labels yielding (X j ) 1 j k with sizes nrow(x j ) = n j. A simple rule for classification is: Assign a new observation with features x to f (x) = argmin d(x, X j ) 1 j k What do we mean by distance here? 2 / 1

Nearest centroid rule If we can assign a central point or centroid µ j to each X j, then we can define the distance above as distance to the centroid µ j. This yields the nearest centroid rule f (x) = argmin d(x, µ j ) 1 j k 3 / 1

Nearest centroid rule This rule is described completely by the functions h ij (x) = d(x, µ j ) d(x, µ i ) with f (x) being any i such that h ij (x) 1 j. 4 / 1

Nearest centroid rule If d(x, y) = x y 2 then the natural centroid is µ j = 1 n j n j l=1 This rule just classifies points to the nearest µ j. But, if the covariance matrix of our data is not I, this rule ignores structure in the data... X j l 5 / 1

Why we should use covariance 20 15 10 5 0 5 3 2 1 0 1 2 3 6 / 1

Choice of distance Often, there is some background model for our data that is equivalent to a given procedure. For this nearest centroid rule, using the Euclidean distance effectively assumes that within the set of points X j, the rows are multivariate Gaussian with covariance matrix proportional to I. It also implicitly assumes that the n j s are roughly equal. For instance, if one n j was very small just because a data point is close to that µ j doesn t necessarily mean that we should conclude it has label j because there might be a huge number of points of label i near that µ j. 7 / 1

Gaussian discriminant functions Suppose each group with label j had its own mean µ j and covariance matrix Σ j, as well as proportion π j. The Gaussian discriminant functions are defined as h ij (x) = h i (x) h j (x) h i (x) = log π i log Σ i /2 (x µ i ) T Σ 1 i (x µ i )/2 The first term weights the prior probability, the second two terms are log φ µi,σ i (x) = log L(µ i, Σ i x). Ignoring the first two terms, the h i s are essentially within-group Mahalanobis distances... 8 / 1

Gaussian discriminant functions The classifier assigns x label i if h i (x) h j (x) j. Or, f (x) = argmax h i (x) 1 i k This is equivalent to a Bayesian rule. We ll see more Bayesian rules when we talk about naïve Bayes... When all Σ i and π i s are identical, the classifier is just nearest centroid using Mahalanobis distance instead of Euclidean distance. 9 / 1

Estimating discriminant functions In practice, we will have to estimate π j, µ j, Σ j. Obvious estimates: π j = n j k j=1 n j µ j = 1 n j n j l=1 X j l Σ j = 1 n j 1 (X j µ j 1) T (X j µ j 1) 10 / 1

Estimating discriminant functions If we assume that the covariance matrix is the same within groups, then we might also form the pooled estimate Σ P = k j=1 (n j 1) Σ j k j=1 n j 1 If we use the pooled estimate Σ j = Σ P and plug these into the Gaussian discrimants, the functions h ij (x) are linear (or affine) functions of x. This is called Linear Discriminant Analysis (LDA). Not to be confused with the other LDA (Latent Dirichlet Allocation)... 11 / 1

Linear Discriminant Analysis using (petal.width, petal.length) 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.5 0 1 2 3 4 5 6 7 8 12 / 1

Linear Discriminant Analysis using (sepal.width, sepal.length) 5.0 4.5 4.0 3.5 3.0 2.5 2.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 13 / 1

Quadratic Discriminant Analysis If we use don t use pooled estimate Σ j = Σ j and plug these into the Gaussian discrimants, the functions h ij (x) are quadratic functions of x. This is called Quadratic Discriminant Analysis (QDA). 14 / 1

Quadratic Discriminant Analysis using (petal.width, petal.length) 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.5 0 1 2 3 4 5 6 7 8 15 / 1

Quadratic Discriminant Analysis using (sepal.width, sepal.length) 5.0 4.5 4.0 3.5 3.0 2.5 2.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 16 / 1

Motivation for Fisher s rule 20 15 10 5 0 5 3 2 1 0 1 2 3 17 / 1

Fisher s discriminant function Fisher proposed to classify using a linear rule. He first decomposed Then, he proposed, (X µ1) T (X µ1) = ŜS B + ŜS W v = argmax v:v T ŜS W v=1 v T ŜS B v Having found v, form X j v and the centroid η j = mean(x j v) In the two-class problem k = 2, this is the same as LDA. 18 / 1

Fisher s discriminant functions The direction v 1 is an eigenvector of some matrix. There are others, up to k 2 more. Suppose we find all k 1 vectors and form X j V T, each one an n j (k 1) matrix with centroid η j R k 1. The matrix V determines a map from R p to R k 1, so given a new data point we can compute V x R k 1. This gives rise to a classifier f (x) = nearest centroid(v x) This is LDA (assuming π j = 1 k )... 19 / 1

Discriminant models in general A discriminant model is generally a model that estimates P(Y = j x), 1 j k That is, given that the features I observe are x, the probability I think this label is j... LDA and QDA are actually generative models since they specify P(X = x Y = j). There are lots of discriminant models... 20 / 1

Logistic regression The logistic regression model is ubiquitous in binary classification (two-class) problems Model: P(Y = 1 x) = α + ext β 1 + e α+xt β = π(α, β, x) 21 / 1

Logistic regression Software that fits a logistic regression model produces an estimate of β based on a data matrix X n p and binary labels Y n 1 {0, 1} n It fits the model minimizing what we call the deviance DEV(β) = 2 n (Y i log π(β, X i )+(1 Y i ) log(1 π(β, X i )) i=1 While not immediately obvious, this is a convex minimization problem, hence is fairly easy to solve. Unlike trees, the convexity yields a globally optimal solution. 22 / 1

Logistic regression, setosa vs virginica, versicolor using (sepal.width, sepal.length) 5.0 4.5 4.0 3.5 3.0 2.5 2.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 23 / 1

Logistic regression, virginica vs setosa, versicolor using (sepal.width, sepal.length) 5.0 4.5 4.0 3.5 3.0 2.5 2.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 24 / 1

Logistic regression, versicolor vs setosa, virginica using (sepal.width, sepal.length) 5.0 4.5 4.0 3.5 3.0 2.5 2.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 25 / 1

Logistic regression Logistic regression produces an estimate of P(y = 1 x) = π(x) Typically, we classify as 1 if π(x) > 0.5. This yields a 2 2 confusion matrix Predicted: 0 Predicted: 1 Actual : 0 TN FP Actual : 1 FN TP From the 2 2 confusion matrix, we can compute Sensitivity, Specificity, etc. 26 / 1

Logistic regression However, we could choose the threshold differently, perhaps related to estimates of the prior probabilities of 0 s and 1 s. Now, each threshold 0 t 1 yields a new confusion matrix This yields a 2 2 confusion matrix Predicted: 0 Predicted: 1 Actual : 0 TN(t) FP(t) Actual : 1 FN(t) TP(t) 27 / 1

ROC curve Generally speaking, we prefer classifiers that are both highly sensitive and highly specific. These confusion matrices can be summarized using an ROC (Receiver Operating Characteristic) curve. This is a plot of the curve (1 Specificity(t), Sensitivity(t)) 0 t 1 Often, Specificity is referred to as TNR (True Negative Rate), and (1 Specificity(t)) as FPR. Often, Sensitivity is referred to as TPR (True Positive Rate), and (1 Sensitivity(t)) as FNR. 28 / 1

AUC: Area under ROC curve 1.0 0.8 True Positive Rate 0.6 0.4 0.2 0.0 ROC curve Random guessing t =0.5 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 29 / 1

ROC curve Points in the upper left of the ROC curve are good. Any point on the diagonal line represents a classifier that is guessing randomly. A tree classifier as we ve discussed seems to correspond to only one point in the ROC plot. But one can estimate probabilities based on frequencies in terminal nodes. 30 / 1

ROC curve A common numeric summary of the ROC curve is AUC(ROC curve) = Area under ROC curve. Can be interpreted as an estimate of the probability that the classifier will give a random positive instance a higher score than a random negative instance. Maximum value is 1. For a random guesser, AUC is 0.5 31 / 1

AUC: Area under ROC curve 1.0 0.8 True Positive Rate 0.6 0.4 0.2 0.0 Logistic-Random Random 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 32 / 1

ROC curve: logistic, rpart, lda 33 / 1

34 / 1