Lecture 9: Classification, LDA

Lecture 9: Classification, LDA Reading: Chapter 4 STATS 202: Data mining and analysis October 13, 2017 1 / 21

Review: Main strategy in Chapter 4 Find an estimate ˆP (Y X). Then, given an input x 0, we predict the response as in a Bayes classifier: ŷ 0 = argmax y ˆP (Y = y X = x0 ). 2 / 21

Linear Discriminant Analysis (LDA) Instead of estimating P (Y X) directly, we will estimate: 3 / 21

Linear Discriminant Analysis (LDA) Instead of estimating P (Y X) directly, we will estimate: 1. ˆP (X Y ): Given the response, what is the distribution of the inputs? 3 / 21

Linear Discriminant Analysis (LDA) Instead of estimating P (Y X) directly, we will estimate: 1. ˆP (X Y ): Given the response, what is the distribution of the inputs? 2. ˆP (Y ): How probable is each category? Then, we use Bayes rule to obtain the estimate: ˆP (Y = k X = x) = ˆP (X = x Y = k) ˆP (Y = k) ˆP (X = x) 3 / 21

Linear Discriminant Analysis (LDA) Instead of estimating P (Y X), we compute estimates: 1. ˆP (X = x Y = k) = ˆfk (x), where each ˆf k (x) is a Multivariate Normal Distribution density: X2 4 2 0 2 4 X2 4 2 0 2 4 4 2 0 2 4 X1 4 2 0 2 4 X1 4 / 21

LDA has linear decision boundaries Suppose that: 5 / 21

LDA has linear decision boundaries Suppose that: We know P (Y = k) = π k exactly. 5 / 21

LDA has linear decision boundaries Suppose that: We know P (Y = k) = π k exactly. P (X = x Y = k) is Mutivariate Normal with density: f k (x) = 1 (2π) p/2 Σ 1/2 e 1 2 (x µ k) T Σ 1 (x µ k ) 5 / 21

LDA has linear decision boundaries Suppose that: We know P (Y = k) = π k exactly. P (X = x Y = k) is Mutivariate Normal with density: f k (x) = 1 (2π) p/2 Σ 1/2 e 1 2 (x µ k) T Σ 1 (x µ k ) µ k : Mean of the inputs for category k. Σ : Covariance matrix (common to all categories). 5 / 21

LDA has linear decision boundaries By Bayes rule, the probability of category k, given the input x is: P (Y = k X = x) = likelihood prior marginal = f k(x)π k P (X = x) 6 / 21

LDA has linear decision boundaries By Bayes rule, the probability of category k, given the input x is: P (Y = k X = x) = likelihood prior marginal = f k(x)π k P (X = x) The denominator does not depend on the response k, so we can write it as a constant: P (Y = k X = x) = C f k (x)π k 6 / 21

LDA has linear decision boundaries P (Y = k X = x) = Cπ k (2π) p/2 Σ 1/2 e 1 2 (x µ k) T Σ 1 (x µ k ) 7 / 21

LDA has linear decision boundaries P (Y = k X = x) = Cπ k (2π) p/2 Σ 1/2 e 1 2 (x µ k) T Σ 1 (x µ k ) Now, let us absorb everything that does not depend on k into a constant C : P (Y = k X = x) = C π k e 1 2 (x µ k) T Σ 1 (x µ k ) and take the logarithm of both sides: log P (Y = k X = x) = log C + log π k 1 2 (x µ k) T Σ 1 (x µ k ). 7 / 21

LDA has linear decision boundaries Goal, maximize the following over k: log π k 1 2 (x µ k) T Σ 1 (x µ k ). 8 / 21

LDA has linear decision boundaries Goal, maximize the following over k: log π k 1 2 (x µ k) T Σ 1 (x µ k ). = log π k 1 2 [ x T Σ 1 x + µ T k Σ 1 µ k ] + x T Σ 1 µ k 8 / 21

LDA has linear decision boundaries Goal, maximize the following over k: log π k 1 2 (x µ k) T Σ 1 (x µ k ). = log π k 1 2 [ x T Σ 1 x + µ T k Σ 1 µ k ] + x T Σ 1 µ k =C + log π k 1 2 µt k Σ 1 µ k + x T Σ 1 µ k 8 / 21

LDA has linear decision boundaries Goal, maximize the following over k: log π k 1 2 (x µ k) T Σ 1 (x µ k ). = log π k 1 2 [ x T Σ 1 x + µ T k Σ 1 µ k ] + x T Σ 1 µ k =C + log π k 1 2 µt k Σ 1 µ k + x T Σ 1 µ k We define the objective: δ k (x) = log π k 1 2 µt k Σ 1 µ k + x T Σ 1 µ k At an input x, we predict the response with the highest δ k (x). 8 / 21

LDA has linear decision boundaries What is the decision boundary? It is the set of points in which 2 classes are equally probable: δ k (x) = δ l (x) 9 / 21

LDA has linear decision boundaries What is the decision boundary? It is the set of points in which 2 classes are equally probable: δ k (x) = δ l (x) log π k 1 2 µt k Σ 1 µ k + x T Σ 1 µ k = log π l 1 2 µt l Σ 1 µ l + x T Σ 1 µ l 9 / 21

Estimating π k ˆπ k = #{i ; y i = k} n In words, the fraction of training samples of class k. 10 / 21

Estimating the parameters of f k (x) Estimate the the mean vector µ k for each class: ˆµ k = 1 #{i ; y i = k} i ; y i =k x i 11 / 21

Estimating the parameters of f k (x) Estimate the the mean vector µ k for each class: ˆµ k = 1 #{i ; y i = k} i ; y i =k Estimate the common covariance matrix Σ: x i 11 / 21

Estimating the parameters of f k (x) Estimate the the mean vector µ k for each class: ˆµ k = 1 #{i ; y i = k} i ; y i =k Estimate the common covariance matrix Σ: One predictor (p = 1): x i ˆσ 2 = 1 n K K k=1 i ; y i =k (x i ˆµ k ) 2. Many predictors (p > 1): Compute the vectors of deviations (x 1 ˆµ y1 ), (x 2 ˆµ y2 ),..., (x n ˆµ yn ) and use an unbiased estimate of its covariance matrix, Σ. 11 / 21

LDA prediction For an input x, predict the class with the largest: ˆδ k (x) = log ˆπ k 1 2 ˆµT k ˆΣ 1ˆµ k + x T ˆΣ 1ˆµ k 12 / 21

LDA prediction For an input x, predict the class with the largest: ˆδ k (x) = log ˆπ k 1 2 ˆµT k ˆΣ 1ˆµ k + x T ˆΣ 1ˆµ k The decision boundaries are defined by: log ˆπ k 1 2 ˆµT k ˆΣ 1ˆµ k + x T ˆΣ 1ˆµ k = log ˆπ l 1 2 ˆµT l ˆΣ 1ˆµ l + x T ˆΣ 1ˆµ l 12 / 21

Quadratic discriminant analysis (QDA) The assumption that the inputs of every class have the same covariance Σ can be quite restrictive: X2 4 3 2 1 0 1 2 X2 4 3 2 1 0 1 2 4 2 0 2 4 4 2 0 2 4 X 1 Boundaries for Bayes (dashed), LDA (dotted), and QDA (solid). X 1 13 / 21

Quadratic discriminant analysis (QDA) In quadratic discriminant analysis we estimate a mean ˆµ k and a covariance matrix ˆΣ k for each class separately. 14 / 21

Quadratic discriminant analysis (QDA) In quadratic discriminant analysis we estimate a mean ˆµ k and a covariance matrix ˆΣ k for each class separately. Given an input, it is easy to derive an objective function: δ k (x) = log π k 1 2 µt k Σ 1 k µ k + x T Σ 1 k µ k 1 2 xt Σ 1 k x 1 2 log Σ k 14 / 21

Quadratic discriminant analysis (QDA) Bayes boundary ( ) LDA ( ) QDA ( ). X2 4 3 2 1 0 1 2 X2 4 3 2 1 0 1 2 4 2 0 2 4 X 1 4 2 0 2 4 X 1 15 / 21

Evaluating a classification method We have talked about the 0-1 loss: 1 m m 1(y i ŷ i ). i=1 It is possible to make the wrong prediction for some classes more often than others. The 0-1 loss doesn t tell you anything about this. 16 / 21

Example. Predicting default Used LDA to predict credit card default in a dataset of 10K people. Predicted yes if P (default = yes X) > 0.5. 17 / 21

Example. Predicting default Used LDA to predict credit card default in a dataset of 10K people. Predicted yes if P (default = yes X) > 0.5. The error rate among people who do not default (false positive rate) is very low. However, the error rate among people who do default (false negative rate) is 76%. 17 / 21

Example. Predicting default Changing the threshold to 0.2 makes it easier to classify to yes. Predicted yes if P (default = yes X) > 0.2. 18 / 21

Example. Predicting default Changing the threshold to 0.2 makes it easier to classify to yes. Predicted yes if P (default = yes X) > 0.2. Note that the rate of false positives became higher! That is the price to pay for fewer false negatives. 18 / 21

Example. Predicting default Let s visualize the dependence of the error on the threshold: Error Rate 0.0 0.2 0.4 0.6 0.0 0.1 0.2 0.3 0.4 0.5 Threshold False negative rate (error for defaulting customers) False positive rate (error for non-defaulting customers) 0-1 loss or total error rate. 19 / 21

Example. The ROC curve True positive rate 0.0 0.2 0.4 0.6 0.8 1.0 ROC Curve Displays the performance of the method for any choice of threshold. 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate 20 / 21

Example. The ROC curve True positive rate 0.0 0.2 0.4 0.6 0.8 1.0 ROC Curve 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate Displays the performance of the method for any choice of threshold. The area under the curve (AUC) measures the quality of the classifier: 0.5 is the AUC for a random classifier The closer AUC is to 1, the better. 20 / 21

Next time Comparison of logistic regression, LDA, QDA, and KNN classification. Start Chapter 5: Resampling. 21 / 21