Lecture 3 Classification, Logistic Regression

Similar documents
Lecture 4 Discriminant Analysis, k-nearest Neighbors

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Machine Learning Linear Classification. Prof. Matteo Matteucci

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Performance Evaluation

Machine Learning for NLP

Introduction to Logistic Regression

Classifier performance evaluation

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Least Squares Classification

LDA, QDA, Naive Bayes

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Classification. Classification is similar to regression in that the goal is to use covariates to predict on outcome.

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

Performance evaluation of binary classifiers

ECE521 Lecture7. Logistic Regression

Linear discriminant functions

Statistical Data Mining and Machine Learning Hilary Term 2016

Diagnostics. Gad Kimmel

Machine Learning Practice Page 2 of 2 10/28/13

Applied Machine Learning Annalisa Marsico

Chapter 10 Logistic Regression

CS Machine Learning Qualifying Exam

ECE521 week 3: 23/26 January 2017

Statistical aspects of prediction models with high-dimensional data

Linear Methods for Classification

Binary Classification / Perceptron

The Perceptron Algorithm, Margins

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Data Privacy in Biomedicine. Lecture 11b: Performance Measures for System Evaluation

Least Squares Regression

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Learning Classification with Auxiliary Probabilistic Information Quang Nguyen Hamed Valizadegan Milos Hauskrecht

Online Advertising is Big Business

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

CSCI-567: Machine Learning (Spring 2019)

Machine Learning for NLP

Machine Learning and Data Mining. Linear classification. Kalev Kask

Confusion matrix. a = true positives b = false negatives c = false positives d = true negatives 1. F-measure combines Recall and Precision:

Machine Learning Lecture 7

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

COMS 4771 Regression. Nakul Verma

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Logistic Regression. Machine Learning Fall 2018

LECTURE NOTE #3 PROF. ALAN YUILLE

Logistic Regression. COMP 527 Danushka Bollegala

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Performance Evaluation

Machine Learning (CS 567) Lecture 2

Stephen Scott.

Statistical Machine Learning

Introduction to Data Science

Linear Models for Regression CS534

Introduction to Machine Learning

Distribution-Free Distribution Regression

Feature Engineering, Model Evaluations

Lecture 2 Machine Learning Review

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Least Squares Regression

GI01/M055: Supervised Learning

Machine Learning for Signal Processing Bayes Classification and Regression

Statistical Machine Learning Hilary Term 2018

Naïve Bayes classification

Recap from previous lecture

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Statistical Machine Learning

Support Vector Machines

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Linear Models for Regression CS534

Multivariate statistical methods and data mining in particle physics

Machine Learning, Fall 2009: Midterm

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Discriminative Models

Machine Learning (CS 567) Lecture 3

STK Statistical Learning: Advanced Regression and Classification

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Statistical Methods for Data Mining

18.9 SUPPORT VECTOR MACHINES

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Classification objectives COMS 4771

Classification: Analyzing Sentiment

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Machine Learning for OR & FE

Linear Models in Machine Learning

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Proteomics and Variable Selection

Probabilistic Graphical Models

Bayesian Learning (II)

Linear Models for Regression CS534

CSC314 / CSC763 Introduction to Machine Learning

Bayesian Decision Theory

Learning with multiple models. Boosting.

Introduction to Logistic Regression

Lecture #11: Classification & Logistic Regression

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Online Learning and Sequential Decision Making

Transcription:

Lecture 3 Classification, Logistic Regression Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se F. Lindsten

Summary of lecture 2 (I/III) Linear regression: regression with a linear model Y = β 0 + β 1 X 1 + β 2 X 2 + + β p X p +ɛ. }{{} f(x;β) How to learn/train/estimate β? Use the maximum likelihood principle: assume ɛ N (0, σ 2 ɛ ) iid least squares & normal equations β = argmin Xβ y 2 2, β = (X T X) 1 X T y, β x i1 1 x T 1 y 1 x i2 x i =., X = 1 x T 2.., y = y 2.. x ip 1 x T n y n 1 / 21 F. Lindsten

Summary of lecture 2 (II/III) We can make arbitrary nonlinear transformations of the inputs, for example polynomials Y = β 0 + β 1 V = X 1 + β 2 V 2= X 2 + β 3 V 3= X 3 + + β p V p= X p + ɛ (V = original input variable, X i transformed input variables or features) Qualitative input variables are handled by creating dummy variables 2 / 21 F. Lindsten

Summary of lecture 2 (III/III) Overfitting may occur when the model is too flexible! Y 15 10 Linear regression with no regularization Ridge regression with γ = 0.1 Ridge regression with γ = 1 5 3 2 1 0 1 Can be handled using regularization, X 3 / 21 F. Lindsten Ridge regression β = argmin Xβ y 2 2 + γ β 2 2 β LASSO β = argmin Xβ y 2 2 + γ β 1 β

Contents Lecture 3 1. The classification problem 2. Bayes classifier the optimal classifier w.r.t. minimizing misclassification error 3. Logistic regression 4. Diagnostic tools for classification (via example) Practical info: Don t forget to sign up for the mini project by Jan 25! 4 / 21 F. Lindsten

Qualitative outputs Many machine learning applications have qualitative outputs Y HiggsML: Separate H ττ decay from noise (see lecture 1): Y {signal, background} Face verification: Y {match, no match} Identify the spoken vowel from an audio signal: Y {A, E, I, O, U, Y} Diagnosis system for leukemia: Y {ALL, AML, CLL, CML, no leukemia}... 5 / 21 F. Lindsten

The classification problem Classification is the task of modeling the relationship between an (arbitrary) input X and a qualitative output Y. A classification model can be specified in terms of the conditional class probabilities Pr(Y = k X) for k = 1,..., K. A prediction model for classification (= a classifier) is a function Ĝ that maps an input to a class: Ŷ = Ĝ(X) where Ŷ {1,..., K}. 6 / 21 F. Lindsten

Decision boundaries The prediction model Ŷ = Ĝ(X) is such that Ŷ {1,..., K}. The input space can thus be segmented into K regions, separated by decision boundaries. 4 2 Y=1 Y=2 Y=3 X 2 0-2 -4-6 -4-2 0 2 4 X 1 7 / 21 F. Lindsten

Why not linear regression? 8 / 21 F. Lindsten Can we use linear regression for classification problems? ex) Classifying e-mails as spam. Code the output as { 0 if ham (= good email), Y = 1 if spam, and learn a linear regression model. Classify as spam if Ŷ > 0.5. Why is this not a good idea? Ŷ can be viewed as an estimate of Pr(Y = spam X). However, there is no guarantee that Ŷ [0, 1], making it hard to interpret as a probability. Sensitive to unequally sized classes in the training data. Difficult to generalize to K > 2 classes.

Training and test errors for classifiers What makes a good classifier? We compute Ĝ based on a set of training data T = {(x 1, y 1 ),..., (x n, y n )}. Def: The training error is 1 n ni=1 I(ŷ i y i ) where ŷ i = Ĝ(x i) and I( ) is an indicator function. Def: The test error is Pr(Ŷ Y ) = E[I(Ŷ Y )] where Ŷ = Ĝ(X), and (X, Y ) is an unobserved input/output pair. More specifically, these are referred to as misclassification errors, since they are based on the misclassification loss, L(Y, Ĝ(X)) = I(Ĝ(X) Y ). 9 / 21 F. Lindsten

Bayes classifier The optimal classifier, in terms of minimizing the misclassification test error, is the one which assigns each prediction to the most likely class, given its input value. That is, the predicted output for input x is the class k for which Pr(Y = k X = x) is largest. This is referred to as the Bayes classifier. In practice we do not know the conditional probability of the class Y given the input X. Practical classification methods typically try to approximate the Bayes classifier as well as possible. 10 / 21 F. Lindsten

Logistic function The function f : R [0, 1] defined as f(x) = the logistic function. 1 0.8 ex 1+e is known as x e x 1+e x 0.6 0.4 0.2 0 6 4 2 0 2 4 6 x 11 / 21 F. Lindsten

ex) Logistic regression Consider a toyish problem where we want to build a classifier for whether a person has a certain disease (Y = 1) or not (Y = 0) based on two biological indicators X 1 and X 2. The training data consists of n = 1, 000 labeled samples (right). X 2 10 8 6 4 Y=0 Y=1 2 2 4 6 8 10 X 1 12 / 21 F. Lindsten

ex) Logistic regression A logistic regression model Pr(Y = 1 X) = eβ 0+β 1 X 1 +β 2 X 2 1 + e β 0+β 1 X 1 +β 2 X 2 is learned using maximum likelihood: > glm.fit = glm(y.,data=training,family=binomial) > glm.fit$coefficients (Intercept) x1 x2-17.5814 1.8091 0.2766 i.e., β = ( 17.6, 1.81, 0.277) T. 13 / 21 F. Lindsten

ex) Logistic regression: training error 10 8 Y=0 Y=1 X 2 6 4 Training error: 2 2 4 6 8 10 X 1 1 n n I(Ĝ(x i) y i ) = 3.3% i=1 14 / 21 F. Lindsten

ex) Logistic regression: test error To further test the classifier we evaluate it on previously unseen test data, T t = {(x i, y i)} nt i=1. X 2 10 8 6 4 2 Y=0 Y=1 (Estimated) test error: n t 2 4 6 8 10 X 1 1 n I(Ĝ(x i) y i) = 5.0% t i=1 The naive classifier Ĝ(x) 0 attains a test error of 5.1%. 15 / 21 F. Lindsten

ex) Logistic regression: confusion matrix Instead of just looking at the misclassification error we can compute the confusion matrix. True negative Predicted condition Ŷ = 0 Ŷ = 1 True Y = 0 941 8 condition Y = 1 42 9 False positive False negative True positive 16 / 21 F. Lindsten Out of 51 patients affected by the disease, only 9 are correctly classified! The True Positive Rate (TPR) is just 9/51 17.7% (See ISL p. 148 149 for definitions of TPR and other classification diagnostics.)

Conservative predictions The Bayes classifier minimizes the total misclassification error! What if false negatives are worse than false positives? Idea: Modify the prediction model: { 1 if q(x) > r, Ĝ(X) = 0 otherwise where 0 r 1 is a user chosen threshold. 17 / 21 F. Lindsten

ex) Logistic regression, cont d X 2 10 8 6 4 Y=0 Y=1 Ŷ = 0 Ŷ = 1 Y = 0 881 68 Y = 1 10 41 Table: Confusion matrix (r = 0.2) 2 2 4 6 8 10 X 1 If we set the threshold at r = 0.2, The true positive rate is increased to 41/51 = 80.4% However, the misclassification error is increased to 7.8% 18 / 21 F. Lindsten

Error rate ex) Logistic regression: error rates 1 0.8 0.6 Misclassification error 1-True Positive Rate False Positive Rate 19 / 21 F. Lindsten 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 Threshold (r) As we increase the threshold r from 0 to 0.5: The misclassification error decreases The number of non-diseased persons incorrectly classified as diseased (False Positive Rate) decreases. The number of diseased persons incorrectly classified as non-diseased (1 True Positive Rate) increases!

True positive rate ex) Logistic regression: ROC and AUC 1 ROC curve for logistic regression 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 False positive rate ROC1 curve: plot of TPR vs. FPR. Area Under Curve (AUC): condensed performance measure for the classifier, taking all possible thresholds into account. AUC [0, 1] where AUC = 0.5 corresponds to a random guess. [ex) AUC = 0.96] 1For Receiver Operating Characteristics, which is a historical name. 20 / 21 F. Lindsten

A few concepts to summarize lecture 3 Classification: The problem of modeling a relationship between an input X and a qualitative output Y. Decision boundaries: Points in the input space where the classifier Ĝ(X) changes value. Bayes classifier: The optimal classifier w.r.t. minimizing misclassification error. Logistic regression: Models the Bayes classifier using a linear regression model for the log-odds. Confusion matrix: Table with predicted vs true class labels (for binary classification number of true negatives, true positives, false negatives, and false positives). 21 / 21 F. Lindsten