Lecture 4 Discriminant Analysis, k-nearest Neighbors

Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se

Summary of Lecture 3 (I/VI) The classification problem amounts to modeling the relationship between the input X and a qualitative output Y, i.e., the output belongs to one out of K distinct classes. A classifier is a prediction model Ŷ = Ĝ(X) that maps any input X into a predicted class Ŷ {1,..., K}. The Bayes classifier predicts each input as belonging to the most likely class, according to the conditional probabilities Pr(Y = k X) for k {1,..., K}. The Bayes classifier is optimal w.r.t. misclassification error. 1 / 29 fredrik.lindsten@it.uu.se

Summary of Lecture 3 (II/VI) For binary (two-class) classification, Y {, 1}, the logistic regression model is q(x) def = Pr(Y = 1 X) = eβt X 1 + e βt X. The model parameters β are found by maximum likelihood by (numerically) maximizing the log-likelihood function, log l(β) = n i=1 ( )) y i β T x i log (1 + e βt x i. 2 / 29 fredrik.lindsten@it.uu.se

Summary of Lecture 3 (III/VI) Approximating the Bayes classifier, we get the prediction model { 1 if q(x) >.5 Ĝ(X) = β T X > otherwise This attempts to minimize the total misclassification error. More generally, we can use { 1 if q(x) > r, Ĝ(X) = otherwise where r 1 is a user chosen threshold. 3 / 29 fredrik.lindsten@it.uu.se

Summary of Lecture 3 (IV/VI) Confusion matrix: Predicted condition Ŷ = Ŷ = 1 Total True Y = TN FP N condition Y = 1 FN TP P Total N* P* For the classifier Ĝ(X) = I( q(x) > r ) the numbers in the confusion matrix will depend on the threshold r. Decreasing r TN, FN decrease and FP, TP increase. Increasing r TN, FN increase and FP, TP decrease. 4 / 29 fredrik.lindsten@it.uu.se

Summary of Lecture 3 (V/VI) Some commonly used derived performance measures are: True positive rate: TPR = TP P = TP [, 1] FN + TP False positive rate: FPR = FP N = FP [, 1] FP + TN Precision: Prec = TP P* = TP [, 1] FP + TP 5 / 29 fredrik.lindsten@it.uu.se

True positive rate Summary of Lecture 3 (VI/VI) 1 ROC curve for logistic regression.8.6.4.2.2.4.6.8 1 False positive rate ROC: plot of TPR vs. FPR as r ranges from to 1.1 Area Under Curve (AUC): condensed performance measure for the classifier, taking all possible thresholds into account. 1Possible to draw ROC curves based on other tuning parameters as well. 6 / 29 fredrik.lindsten@it.uu.se

Contents Lecture 4 1. Linear Discriminant Analysis 2. Quadratic Discriminant Analysis 3. A nonparametric classifier k-nearest Neighbors 7 / 29 fredrik.lindsten@it.uu.se

Bayes theorem Figure: (Possibly) Thomas Bayes, c. 171 1761. Bayes theorem describes a fundamental relationship between conditional and marginal probabilities. p(z 1 z 2 ) = p(z 2 z 1 )p(z 1 ) p(z 2 ) = p(z 2 z 1 )p(z 1 ) p(z2 z 1 )p(z 1 )dz 2 8 / 29 fredrik.lindsten@it.uu.se

Bayes theorem for classification Recall the Bayes classifier, based on the conditional probabilities Pr(Y = k X = x). Let π k def = Pr(Y = k) denote the prior (marginal) probability of class k {1,..., K}. f k (x) def = p(x Y = k) denote the probability density of X for an observation that comes from the kth class. The Bayes classifier can then be expressed using Bayes theorem: Pr(Y = k X = x) = f k (x)π k Kj=1 f j (x)π j. 9 / 29 fredrik.lindsten@it.uu.se

Multivariate Gaussian density The p-dimensional Gaussian (normal) probability density function with mean vector µ and covariance matrix Σ is, ( 1 N (x µ, Σ) = (2π) p/2 exp 1 ) Σ 1/2 2 (x µ)t Σ 1 (x µ), where µ : p 1 vector and Σ : p p positive definite matrix. Let X = (X 1,..., X p ) T N (µ, Σ), µ j is the mean of X j, Σ jj is the variance of X j, Σ ij (i j) is the covariance between X i and X j. 1 / 29 fredrik.lindsten@it.uu.se

Multivariate Gaussian density.2.3 Value of PDF.15.1.5 Value of PDF.2.1 5 5 5 5 x 2 Σ = -5-5 ( ) 1 1 x 1 x 2 Σ = -5-5 x 1 ( 1 ).4.4.5 11 / 29 fredrik.lindsten@it.uu.se

LDA, summary The LDA classifier assigns a test input X = x to class k for which, is largest, where δ k (x) = x T Σ 1 µ k 1 2 µt k Σ 1 µ k + log π k π k = n k /n k = 1,..., K, µ k = 1 n k x i, k = 1,..., K, Σ = 1 n K i:y i =k K k=1 i:y i =k (x i µ k )(x i µ k ) T. δ k (x) is a linear function in x and LDA is therefore a linear classifier (its decision boundaries are linear). 12 / 29 fredrik.lindsten@it.uu.se

ex) LDA decision boundary Illustration of LDA decision boundary the level curves of two Gaussian PDFs with the same covariance matrix intersect along a straight line, δ 1 (x) = δ 2 (x). x 2 x 1 13 / 29 fredrik.lindsten@it.uu.se

ex) Simple spam filter We will use LDA to construct a simple spam filter: Output: Y {spam, ham} Input: X = 57-dimensional vector of features extracted from the email Frequencies of 48 predefined words (make, address, all,... ) Frequencies of 6 predefined characters (;, (, [,!, $, #) Average length of uninterrupted sequences of capital letters Length of longest uninterrupted sequence of capital letters Total number of capital letters in the e-mail Dataset consists of 4, 61 emails classified as either spam or ham (split into 75 % training and 25 % testing) UCI Machine Learning Repository Spambase Dataset (https://archive.ics.uci.edu/ml/datasets/spambase) 14 / 29 fredrik.lindsten@it.uu.se

ex) Simple spam filter > lda.fit = lda(y.,data=dataset,subset=-test) LDA confusion matrix (test data): Ŷ = Ŷ = 1 Y = 667 23 Y = 1 12 358 Table: Threshold r =.5 Ŷ = Ŷ = 1 Y = 687 3 Y = 1 237 223 Table: Threshold r =.9 15 / 29 fredrik.lindsten@it.uu.se

ex) Simple spam filter A linear classifier uses a linear decision boundary for separating the two classes as much as possible. In R p this is a (p 1)-dimensional hyperplane. How can we visualize the classifier? One way: project the (labeled) test inputs onto the normal of the decision boundary. 2 15 Y= Y=1 1 5 16 / 29 fredrik.lindsten@it.uu.se -.5.5 Projection on normal

ex) Simple spam filter Feature Count Input (X) make X 1 = address X 2 = all 1 X 3 =.48... char ; 1 X 49 =.7 char ( 3 X 5 =.21 char [ X 51 = char! 2 X 52 =.14... #Capitals 4 X 57 = 4 Pr(ham X) = 96.4% (according to LDA model) 17 / 29 fredrik.lindsten@it.uu.se

Gmail s spam filter Official Gmail blog, July 9, 215.... the spam filter now uses an artificial neural network to detect and block the especially sneaky spam the kind that could actually pass for wanted mail. Sri Harsha Somanchi, Product Manager 18 / 29 fredrik.lindsten@it.uu.se

Quadratic discriminant analysis Do we have to assume a common covariance matrix? No! Estimating a separate covariance matrix for each class leads to an alternative method, Quadratic Discriminant Analysis (QDA). Whether we should choose LDA or QDA has to do with the bias-variance trade-off, i.e. the risk of over- or underfitting. Compared to LDA, QDA... has more parameters is more flexible (lower bias) has higher risk of overfitting (larger variance) 19 / 29 fredrik.lindsten@it.uu.se

ex) QDA decision boundary Illustration of QDA decision boundary the level curves of two Gaussian PDFs with different covariance matrices intersect along a curved (quadratic) line. x 2 x 1 Thus, QDA is not a linear classifier. 2 / 29 fredrik.lindsten@it.uu.se

Parametric and nonparametric models So far we have looked at a few parametric models, linear regression, logistic regression, LDA and QDA, all of which are parametrized by a fixed-dimensional parameter. Non-parametric models instead allow the flexibility of the model to grow with the amount of available data. can be very flexible (= low bias) can suffer from overfitting (high variance) can be compute and memory intensive As always, the bias-variance trade-off is key also when working with non-parametric models. 21 / 29 fredrik.lindsten@it.uu.se

k-nn The k-nearest neighbors (k-nn) classifier is a simple non-parametric method. Given training data T = {(x i, y i )} n i=1, for a test input X = x, 1. Identify the k training inputs x i nearest to x, represented by the set N = {i : x i is in the k-neighborhood of x}. 2. Classify x according to a majority vote within the neighborhood N, i.e. assign x to class j for which i N I(y i = j) is largest (ties are often handled by a coin flip). 22 / 29 fredrik.lindsten@it.uu.se

ex) k-nn on a toy model We illustrate the k-nn classifier on a synthetic example where the Bayes classifier is known (colored regions). 4 2 Y=1 Y=2 Y=3 X 2-2 -4-6 -4-2 2 4 X 1 23 / 29 fredrik.lindsten@it.uu.se

ex) k-nn on a toy model The decision boundaries for the k-nn classifier are shown as black lines. 4 2 Y=1 Y=2 Y=3 k=1 X 2-2 -4-6 -4-2 2 4 X 1 24 / 29 fredrik.lindsten@it.uu.se

ex) k-nn on a toy model The decision boundaries for the k-nn classifier are shown as black lines. 4 2 Y=1 Y=2 Y=3 k=1 X 2-2 -4-6 -4-2 2 4 X 1 25 / 29 fredrik.lindsten@it.uu.se

ex) k-nn on a toy model The decision boundaries for the k-nn classifier are shown as black lines. 4 2 Y=1 Y=2 Y=3 k=1 X 2-2 -4-6 -4-2 2 4 X 1 26 / 29 fredrik.lindsten@it.uu.se

ex) k-nn on a toy model 4 k=1 4 k=1 4 k=1 Y=1 Y=1 Y=1 Y=2 Y=2 Y=2 2 Y=3 2 Y=3 2 Y=3 X 2 X 2 X 2-2 -2-2 -4-4 -4-6 -4-2 2 4 X 1-6 -4-2 2 4 X 1-6 -4-2 2 4 X 1 The choice of the tuning parameter k controls the model flexibility: Small k small bias, large variance Large k large bias, small variance 27 / 29 fredrik.lindsten@it.uu.se

Error ex) k-nn on a toy model.4.3 Training error Test error Bayes error.2.1.1.1 1 1/k 28 / 29 fredrik.lindsten@it.uu.se

A few concepts to summarize lecture 4 Bayes theorem: A formula relating conditional and marginal probabilities of two events. Linear discriminant analysis (LDA): A classifier based on the assumption that the distribution of the input X is multivariate Gaussian for each class, with different means but the same covariance. LDA is a linear classifier. Quadratic discriminant analysis (QDA): Same as LDA, but where a different covariance matrix is used for each class. QDA is not a linear classifier. Non-parametric model: A model where the flexibility is allowed to grow with the amount of available training data. k-nn classifier: A simple non-parametric classifier based on classifying a test input X = x according to the class labels of the k training samples nearest to x. 29 / 29 fredrik.lindsten@it.uu.se