Lecture 9: Classification, LDA

Size: px

Start display at page:

Download "Lecture 9: Classification, LDA"

Randell Stevenson
6 years ago
Views:

1 Lecture 9: Classification, LDA Reading: Chapter 4 STATS 202: Data mining and analysis October 13, / 21

2 Review: Main strategy in Chapter 4 Find an estimate ˆP (Y X). Then, given an input x 0, we predict the response as in a Bayes classifier: ŷ 0 = argmax y ˆP (Y = y X = x0 ). 2 / 21

3 Linear Discriminant Analysis (LDA) Instead of estimating P (Y X) directly, we will estimate: 3 / 21

4 Linear Discriminant Analysis (LDA) Instead of estimating P (Y X) directly, we will estimate: 1. ˆP (X Y ): Given the response, what is the distribution of the inputs? 3 / 21

5 Linear Discriminant Analysis (LDA) Instead of estimating P (Y X) directly, we will estimate: 1. ˆP (X Y ): Given the response, what is the distribution of the inputs? 2. ˆP (Y ): How probable is each category? 3 / 21

6 Linear Discriminant Analysis (LDA) Instead of estimating P (Y X) directly, we will estimate: 1. ˆP (X Y ): Given the response, what is the distribution of the inputs? 2. ˆP (Y ): How probable is each category? Then, we use Bayes rule to obtain the estimate: ˆP (Y = k X = x) = ˆP (X = x Y = k) ˆP (Y = k) ˆP (X = x) 3 / 21

7 Linear Discriminant Analysis (LDA) Instead of estimating P (Y X) directly, we will estimate: 1. ˆP (X Y ): Given the response, what is the distribution of the inputs? 2. ˆP (Y ): How probable is each category? Then, we use Bayes rule to obtain the estimate: ˆP (Y = k X = x) = ˆP (X = x Y = k) ˆP (Y = k) j ˆP (X = x Y = j) ˆP (Y = j) 3 / 21

8 Linear Discriminant Analysis (LDA) Instead of estimating P (Y X), we compute estimates: 1. ˆP (X = x Y = k) = ˆfk (x), where each ˆf k (x) is a Multivariate Normal Distribution density: X X X X1 4 / 21

9 Linear Discriminant Analysis (LDA) Instead of estimating P (Y X), we compute estimates: 1. ˆP (X = x Y = k) = ˆfk (x), where each ˆf k (x) is a Multivariate Normal Distribution density: X X X X1 2. ˆP (Y = k) = ˆπk is estimated by the fraction of training samples of class k. 4 / 21

10 LDA has linear decision boundaries Suppose that: 5 / 21

11 LDA has linear decision boundaries Suppose that: We know P (Y = k) = π k exactly. 5 / 21

12 LDA has linear decision boundaries Suppose that: We know P (Y = k) = π k exactly. P (X = x Y = k) is Mutivariate Normal with density: f k (x) = 1 (2π) p/2 Σ 1/2 e 1 2 (x µ k) T Σ 1 (x µ k ) 5 / 21

13 LDA has linear decision boundaries Suppose that: We know P (Y = k) = π k exactly. P (X = x Y = k) is Mutivariate Normal with density: f k (x) = 1 (2π) p/2 Σ 1/2 e 1 2 (x µ k) T Σ 1 (x µ k ) µ k : Mean of the inputs for category k. Σ : Covariance matrix (common to all categories). 5 / 21

14 LDA has linear decision boundaries Suppose that: We know P (Y = k) = π k exactly. P (X = x Y = k) is Mutivariate Normal with density: f k (x) = 1 (2π) p/2 Σ 1/2 e 1 2 (x µ k) T Σ 1 (x µ k ) µ k : Mean of the inputs for category k. Σ : Covariance matrix (common to all categories). Then, what is the Bayes classifier? 5 / 21

15 LDA has linear decision boundaries By Bayes rule, the probability of category k, given the input x is: P (Y = k X = x) = likelihood prior marginal = f k(x)π k P (X = x) 6 / 21

16 LDA has linear decision boundaries By Bayes rule, the probability of category k, given the input x is: P (Y = k X = x) = likelihood prior marginal = f k(x)π k P (X = x) The denominator does not depend on the response k, so we can write it as a constant: P (Y = k X = x) = C f k (x)π k 6 / 21

17 LDA has linear decision boundaries By Bayes rule, the probability of category k, given the input x is: P (Y = k X = x) = likelihood prior marginal = f k(x)π k P (X = x) The denominator does not depend on the response k, so we can write it as a constant: P (Y = k X = x) = C f k (x)π k Plugging in the formula for f k (x): P (Y = k X = x) = Cπ k (2π) p/2 Σ 1/2 e 1 2 (x µ k) T Σ 1 (x µ k ) 6 / 21

18 LDA has linear decision boundaries P (Y = k X = x) = Cπ k (2π) p/2 Σ 1/2 e 1 2 (x µ k) T Σ 1 (x µ k ) 7 / 21

19 LDA has linear decision boundaries P (Y = k X = x) = Cπ k (2π) p/2 Σ 1/2 e 1 2 (x µ k) T Σ 1 (x µ k ) Now, let us absorb everything that does not depend on k into a constant C : P (Y = k X = x) = C π k e 1 2 (x µ k) T Σ 1 (x µ k ) 7 / 21

20 LDA has linear decision boundaries P (Y = k X = x) = Cπ k (2π) p/2 Σ 1/2 e 1 2 (x µ k) T Σ 1 (x µ k ) Now, let us absorb everything that does not depend on k into a constant C : P (Y = k X = x) = C π k e 1 2 (x µ k) T Σ 1 (x µ k ) and take the logarithm of both sides: log P (Y = k X = x) = log C + log π k 1 2 (x µ k) T Σ 1 (x µ k ). 7 / 21

21 LDA has linear decision boundaries P (Y = k X = x) = Cπ k (2π) p/2 Σ 1/2 e 1 2 (x µ k) T Σ 1 (x µ k ) Now, let us absorb everything that does not depend on k into a constant C : P (Y = k X = x) = C π k e 1 2 (x µ k) T Σ 1 (x µ k ) and take the logarithm of both sides: log P (Y = k X = x) = log C + log π k 1 2 (x µ k) T Σ 1 (x µ k ). This constant is the same for every category, k. 7 / 21

22 LDA has linear decision boundaries P (Y = k X = x) = Cπ k (2π) p/2 Σ 1/2 e 1 2 (x µ k) T Σ 1 (x µ k ) Now, let us absorb everything that does not depend on k into a constant C : P (Y = k X = x) = C π k e 1 2 (x µ k) T Σ 1 (x µ k ) and take the logarithm of both sides: log P (Y = k X = x) = log C + log π k 1 2 (x µ k) T Σ 1 (x µ k ). This constant is the same for every category, k. So we want to find the maximum of this over k. 7 / 21

23 LDA has linear decision boundaries Goal, maximize the following over k: log π k 1 2 (x µ k) T Σ 1 (x µ k ). 8 / 21

24 LDA has linear decision boundaries Goal, maximize the following over k: log π k 1 2 (x µ k) T Σ 1 (x µ k ). = log π k 1 2 [ x T Σ 1 x + µ T k Σ 1 µ k ] + x T Σ 1 µ k 8 / 21

25 LDA has linear decision boundaries Goal, maximize the following over k: log π k 1 2 (x µ k) T Σ 1 (x µ k ). = log π k 1 2 [ x T Σ 1 x + µ T k Σ 1 µ k ] + x T Σ 1 µ k =C + log π k 1 2 µt k Σ 1 µ k + x T Σ 1 µ k 8 / 21

26 LDA has linear decision boundaries Goal, maximize the following over k: log π k 1 2 (x µ k) T Σ 1 (x µ k ). = log π k 1 2 [ x T Σ 1 x + µ T k Σ 1 µ k ] + x T Σ 1 µ k =C + log π k 1 2 µt k Σ 1 µ k + x T Σ 1 µ k We define the objective: δ k (x) = log π k 1 2 µt k Σ 1 µ k + x T Σ 1 µ k At an input x, we predict the response with the highest δ k (x). 8 / 21

27 LDA has linear decision boundaries What is the decision boundary? It is the set of points in which 2 classes are equally probable: δ k (x) = δ l (x) 9 / 21

28 LDA has linear decision boundaries What is the decision boundary? It is the set of points in which 2 classes are equally probable: δ k (x) = δ l (x) log π k 1 2 µt k Σ 1 µ k + x T Σ 1 µ k = log π l 1 2 µt l Σ 1 µ l + x T Σ 1 µ l 9 / 21

29 LDA has linear decision boundaries What is the decision boundary? It is the set of points in which 2 classes are equally probable: δ k (x) = δ l (x) log π k 1 2 µt k Σ 1 µ k + x T Σ 1 µ k = log π l 1 2 µt l Σ 1 µ l + x T Σ 1 µ l This is a linear equation in x. X X X X1 9 / 21

30 Estimating π k ˆπ k = #{i ; y i = k} n In words, the fraction of training samples of class k. 10 / 21

31 Estimating the parameters of f k (x) Estimate the the mean vector µ k for each class: ˆµ k = 1 #{i ; y i = k} i ; y i =k x i 11 / 21

32 Estimating the parameters of f k (x) Estimate the the mean vector µ k for each class: ˆµ k = 1 #{i ; y i = k} i ; y i =k Estimate the common covariance matrix Σ: x i 11 / 21

33 Estimating the parameters of f k (x) Estimate the the mean vector µ k for each class: ˆµ k = 1 #{i ; y i = k} i ; y i =k Estimate the common covariance matrix Σ: One predictor (p = 1): x i ˆσ 2 = 1 n K K k=1 i ; y i =k (x i ˆµ k ) / 21

34 Estimating the parameters of f k (x) Estimate the the mean vector µ k for each class: ˆµ k = 1 #{i ; y i = k} i ; y i =k Estimate the common covariance matrix Σ: One predictor (p = 1): x i ˆσ 2 = 1 n K K k=1 i ; y i =k (x i ˆµ k ) 2. Many predictors (p > 1): Compute the vectors of deviations (x 1 ˆµ y1 ), (x 2 ˆµ y2 ),..., (x n ˆµ yn ) and use an unbiased estimate of its covariance matrix, Σ. 11 / 21

35 LDA prediction For an input x, predict the class with the largest: ˆδ k (x) = log ˆπ k 1 2 ˆµT k ˆΣ 1ˆµ k + x T ˆΣ 1ˆµ k 12 / 21

36 LDA prediction For an input x, predict the class with the largest: ˆδ k (x) = log ˆπ k 1 2 ˆµT k ˆΣ 1ˆµ k + x T ˆΣ 1ˆµ k The decision boundaries are defined by: log ˆπ k 1 2 ˆµT k ˆΣ 1ˆµ k + x T ˆΣ 1ˆµ k = log ˆπ l 1 2 ˆµT l ˆΣ 1ˆµ l + x T ˆΣ 1ˆµ l 12 / 21

37 LDA prediction For an input x, predict the class with the largest: ˆδ k (x) = log ˆπ k 1 2 ˆµT k ˆΣ 1ˆµ k + x T ˆΣ 1ˆµ k The decision boundaries are defined by: log ˆπ k 1 2 ˆµT k ˆΣ 1ˆµ k + x T ˆΣ 1ˆµ k = log ˆπ l 1 2 ˆµT l ˆΣ 1ˆµ l + x T ˆΣ 1ˆµ l These are the solid lines in the following image: X X X X1 12 / 21

38 Quadratic discriminant analysis (QDA) The assumption that the inputs of every class have the same covariance Σ can be quite restrictive: X X X 1 Boundaries for Bayes (dashed), LDA (dotted), and QDA (solid). X 1 13 / 21

39 Quadratic discriminant analysis (QDA) In quadratic discriminant analysis we estimate a mean ˆµ k and a covariance matrix ˆΣ k for each class separately. 14 / 21

40 Quadratic discriminant analysis (QDA) In quadratic discriminant analysis we estimate a mean ˆµ k and a covariance matrix ˆΣ k for each class separately. Given an input, it is easy to derive an objective function: δ k (x) = log π k 1 2 µt k Σ 1 k µ k + x T Σ 1 k µ k 1 2 xt Σ 1 k x 1 2 log Σ k 14 / 21

41 Quadratic discriminant analysis (QDA) In quadratic discriminant analysis we estimate a mean ˆµ k and a covariance matrix ˆΣ k for each class separately. Given an input, it is easy to derive an objective function: δ k (x) = log π k 1 2 µt k Σ 1 k µ k + x T Σ 1 k µ k 1 2 xt Σ 1 k x 1 2 log Σ k This objective is now quadratic in x and so are the decision boundaries. 14 / 21

42 Quadratic discriminant analysis (QDA) Bayes boundary ( ) LDA ( ) QDA ( ). X X X X 1 15 / 21

43 Evaluating a classification method We have talked about the 0-1 loss: 1 m m 1(y i ŷ i ). i=1 It is possible to make the wrong prediction for some classes more often than others. The 0-1 loss doesn t tell you anything about this. 16 / 21

44 Evaluating a classification method We have talked about the 0-1 loss: 1 m m 1(y i ŷ i ). i=1 It is possible to make the wrong prediction for some classes more often than others. The 0-1 loss doesn t tell you anything about this. A much more informative summary of the error is a confusion matrix: 16 / 21

45 Example. Predicting default Used LDA to predict credit card default in a dataset of 10K people. Predicted yes if P (default = yes X) > / 21

46 Example. Predicting default Used LDA to predict credit card default in a dataset of 10K people. Predicted yes if P (default = yes X) > 0.5. The error rate among people who do not default (false positive rate) is very low. 17 / 21

47 Example. Predicting default Used LDA to predict credit card default in a dataset of 10K people. Predicted yes if P (default = yes X) > 0.5. The error rate among people who do not default (false positive rate) is very low. However, the error rate among people who do default (false negative rate) is 76%. 17 / 21

48 Example. Predicting default Used LDA to predict credit card default in a dataset of 10K people. Predicted yes if P (default = yes X) > 0.5. The error rate among people who do not default (false positive rate) is very low. However, the error rate among people who do default (false negative rate) is 76%. False negatives may be a bigger source of concern! 17 / 21

49 Example. Predicting default Used LDA to predict credit card default in a dataset of 10K people. Predicted yes if P (default = yes X) > 0.5. The error rate among people who do not default (false positive rate) is very low. However, the error rate among people who do default (false negative rate) is 76%. False negatives may be a bigger source of concern! One possible solution: Change the threshold. 17 / 21

50 Example. Predicting default Changing the threshold to 0.2 makes it easier to classify to yes. Predicted yes if P (default = yes X) > / 21

51 Example. Predicting default Changing the threshold to 0.2 makes it easier to classify to yes. Predicted yes if P (default = yes X) > 0.2. Note that the rate of false positives became higher! That is the price to pay for fewer false negatives. 18 / 21

52 Example. Predicting default Let s visualize the dependence of the error on the threshold: Error Rate Threshold False negative rate (error for defaulting customers) False positive rate (error for non-defaulting customers) 0-1 loss or total error rate. 19 / 21

53 Example. The ROC curve True positive rate ROC Curve Displays the performance of the method for any choice of threshold False positive rate 20 / 21

54 Example. The ROC curve True positive rate ROC Curve False positive rate Displays the performance of the method for any choice of threshold. The area under the curve (AUC) measures the quality of the classifier: 0.5 is the AUC for a random classifier The closer AUC is to 1, the better. 20 / 21

55 Next time Comparison of logistic regression, LDA, QDA, and KNN classification. Start Chapter 5: Resampling. 21 / 21

Lecture 9: Classification, LDA

Lecture 9: Classification, LDA Reading: Chapter 4 STATS 202: Data mining and analysis October 13, 2017 1 / 21 Review: Main strategy in Chapter 4 Find an estimate ˆP (Y X). Then, given an input x 0, we