Lecture 3 Classification, Logistic Regression Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se F. Lindsten
Summary of lecture 2 (I/III) Linear regression: regression with a linear model Y = β 0 + β 1 X 1 + β 2 X 2 + + β p X p +ɛ. }{{} f(x;β) How to learn/train/estimate β? Use the maximum likelihood principle: assume ɛ N (0, σ 2 ɛ ) iid least squares & normal equations β = argmin Xβ y 2 2, β = (X T X) 1 X T y, β x i1 1 x T 1 y 1 x i2 x i =., X = 1 x T 2.., y = y 2.. x ip 1 x T n y n 1 / 21 F. Lindsten
Summary of lecture 2 (II/III) We can make arbitrary nonlinear transformations of the inputs, for example polynomials Y = β 0 + β 1 V = X 1 + β 2 V 2= X 2 + β 3 V 3= X 3 + + β p V p= X p + ɛ (V = original input variable, X i transformed input variables or features) Qualitative input variables are handled by creating dummy variables 2 / 21 F. Lindsten
Summary of lecture 2 (III/III) Overfitting may occur when the model is too flexible! Y 15 10 Linear regression with no regularization Ridge regression with γ = 0.1 Ridge regression with γ = 1 5 3 2 1 0 1 Can be handled using regularization, X 3 / 21 F. Lindsten Ridge regression β = argmin Xβ y 2 2 + γ β 2 2 β LASSO β = argmin Xβ y 2 2 + γ β 1 β
Contents Lecture 3 1. The classification problem 2. Bayes classifier the optimal classifier w.r.t. minimizing misclassification error 3. Logistic regression 4. Diagnostic tools for classification (via example) Practical info: Don t forget to sign up for the mini project by Jan 25! 4 / 21 F. Lindsten
Qualitative outputs Many machine learning applications have qualitative outputs Y HiggsML: Separate H ττ decay from noise (see lecture 1): Y {signal, background} Face verification: Y {match, no match} Identify the spoken vowel from an audio signal: Y {A, E, I, O, U, Y} Diagnosis system for leukemia: Y {ALL, AML, CLL, CML, no leukemia}... 5 / 21 F. Lindsten
The classification problem Classification is the task of modeling the relationship between an (arbitrary) input X and a qualitative output Y. A classification model can be specified in terms of the conditional class probabilities Pr(Y = k X) for k = 1,..., K. A prediction model for classification (= a classifier) is a function Ĝ that maps an input to a class: Ŷ = Ĝ(X) where Ŷ {1,..., K}. 6 / 21 F. Lindsten
Decision boundaries The prediction model Ŷ = Ĝ(X) is such that Ŷ {1,..., K}. The input space can thus be segmented into K regions, separated by decision boundaries. 4 2 Y=1 Y=2 Y=3 X 2 0-2 -4-6 -4-2 0 2 4 X 1 7 / 21 F. Lindsten
Why not linear regression? 8 / 21 F. Lindsten Can we use linear regression for classification problems? ex) Classifying e-mails as spam. Code the output as { 0 if ham (= good email), Y = 1 if spam, and learn a linear regression model. Classify as spam if Ŷ > 0.5. Why is this not a good idea? Ŷ can be viewed as an estimate of Pr(Y = spam X). However, there is no guarantee that Ŷ [0, 1], making it hard to interpret as a probability. Sensitive to unequally sized classes in the training data. Difficult to generalize to K > 2 classes.
Training and test errors for classifiers What makes a good classifier? We compute Ĝ based on a set of training data T = {(x 1, y 1 ),..., (x n, y n )}. Def: The training error is 1 n ni=1 I(ŷ i y i ) where ŷ i = Ĝ(x i) and I( ) is an indicator function. Def: The test error is Pr(Ŷ Y ) = E[I(Ŷ Y )] where Ŷ = Ĝ(X), and (X, Y ) is an unobserved input/output pair. More specifically, these are referred to as misclassification errors, since they are based on the misclassification loss, L(Y, Ĝ(X)) = I(Ĝ(X) Y ). 9 / 21 F. Lindsten
Bayes classifier The optimal classifier, in terms of minimizing the misclassification test error, is the one which assigns each prediction to the most likely class, given its input value. That is, the predicted output for input x is the class k for which Pr(Y = k X = x) is largest. This is referred to as the Bayes classifier. In practice we do not know the conditional probability of the class Y given the input X. Practical classification methods typically try to approximate the Bayes classifier as well as possible. 10 / 21 F. Lindsten
Logistic function The function f : R [0, 1] defined as f(x) = the logistic function. 1 0.8 ex 1+e is known as x e x 1+e x 0.6 0.4 0.2 0 6 4 2 0 2 4 6 x 11 / 21 F. Lindsten
ex) Logistic regression Consider a toyish problem where we want to build a classifier for whether a person has a certain disease (Y = 1) or not (Y = 0) based on two biological indicators X 1 and X 2. The training data consists of n = 1, 000 labeled samples (right). X 2 10 8 6 4 Y=0 Y=1 2 2 4 6 8 10 X 1 12 / 21 F. Lindsten
ex) Logistic regression A logistic regression model Pr(Y = 1 X) = eβ 0+β 1 X 1 +β 2 X 2 1 + e β 0+β 1 X 1 +β 2 X 2 is learned using maximum likelihood: > glm.fit = glm(y.,data=training,family=binomial) > glm.fit$coefficients (Intercept) x1 x2-17.5814 1.8091 0.2766 i.e., β = ( 17.6, 1.81, 0.277) T. 13 / 21 F. Lindsten
ex) Logistic regression: training error 10 8 Y=0 Y=1 X 2 6 4 Training error: 2 2 4 6 8 10 X 1 1 n n I(Ĝ(x i) y i ) = 3.3% i=1 14 / 21 F. Lindsten
ex) Logistic regression: test error To further test the classifier we evaluate it on previously unseen test data, T t = {(x i, y i)} nt i=1. X 2 10 8 6 4 2 Y=0 Y=1 (Estimated) test error: n t 2 4 6 8 10 X 1 1 n I(Ĝ(x i) y i) = 5.0% t i=1 The naive classifier Ĝ(x) 0 attains a test error of 5.1%. 15 / 21 F. Lindsten
ex) Logistic regression: confusion matrix Instead of just looking at the misclassification error we can compute the confusion matrix. True negative Predicted condition Ŷ = 0 Ŷ = 1 True Y = 0 941 8 condition Y = 1 42 9 False positive False negative True positive 16 / 21 F. Lindsten Out of 51 patients affected by the disease, only 9 are correctly classified! The True Positive Rate (TPR) is just 9/51 17.7% (See ISL p. 148 149 for definitions of TPR and other classification diagnostics.)
Conservative predictions The Bayes classifier minimizes the total misclassification error! What if false negatives are worse than false positives? Idea: Modify the prediction model: { 1 if q(x) > r, Ĝ(X) = 0 otherwise where 0 r 1 is a user chosen threshold. 17 / 21 F. Lindsten
ex) Logistic regression, cont d X 2 10 8 6 4 Y=0 Y=1 Ŷ = 0 Ŷ = 1 Y = 0 881 68 Y = 1 10 41 Table: Confusion matrix (r = 0.2) 2 2 4 6 8 10 X 1 If we set the threshold at r = 0.2, The true positive rate is increased to 41/51 = 80.4% However, the misclassification error is increased to 7.8% 18 / 21 F. Lindsten
Error rate ex) Logistic regression: error rates 1 0.8 0.6 Misclassification error 1-True Positive Rate False Positive Rate 19 / 21 F. Lindsten 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 Threshold (r) As we increase the threshold r from 0 to 0.5: The misclassification error decreases The number of non-diseased persons incorrectly classified as diseased (False Positive Rate) decreases. The number of diseased persons incorrectly classified as non-diseased (1 True Positive Rate) increases!
True positive rate ex) Logistic regression: ROC and AUC 1 ROC curve for logistic regression 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 False positive rate ROC1 curve: plot of TPR vs. FPR. Area Under Curve (AUC): condensed performance measure for the classifier, taking all possible thresholds into account. AUC [0, 1] where AUC = 0.5 corresponds to a random guess. [ex) AUC = 0.96] 1For Receiver Operating Characteristics, which is a historical name. 20 / 21 F. Lindsten
A few concepts to summarize lecture 3 Classification: The problem of modeling a relationship between an input X and a qualitative output Y. Decision boundaries: Points in the input space where the classifier Ĝ(X) changes value. Bayes classifier: The optimal classifier w.r.t. minimizing misclassification error. Logistic regression: Models the Bayes classifier using a linear regression model for the log-odds. Confusion matrix: Table with predicted vs true class labels (for binary classification number of true negatives, true positives, false negatives, and false positives). 21 / 21 F. Lindsten