Classification 1: Linear regression of indicators, linear discriminant analysis

Similar documents
Classification 2: Linear discriminant analysis (continued); logistic regression

Logistic Regression. Advanced Methods for Data Analysis (36-402/36-608) Spring 2014

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

CMSC858P Supervised Learning Methods

Lecture 9: Classification, LDA

Lecture 9: Classification, LDA

Lecture 9: Classification, LDA

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Lecture 6: Methods for high-dimensional problems

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Linear Methods for Prediction

Introduction to Machine Learning Spring 2018 Note 18

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Regularized Discriminant Analysis and Reduced-Rank LDA

Classification. Chapter Introduction. 6.2 The Bayes classifier

Machine Learning Linear Classification. Prof. Matteo Matteucci

ECE521 Lecture7. Logistic Regression

Linear Methods for Prediction

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

MSA220 Statistical Learning for Big Data

LDA, QDA, Naive Bayes

Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

The Bayes classifier

Support Vector Machines

Statistical Data Mining and Machine Learning Hilary Term 2016

MATH 567: Mathematical Techniques in Data Science Logistic regression and Discriminant Analysis

MSA220/MVE440 Statistical Learning for Big Data

Lecture 5: Classification

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Introduction to Machine Learning

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

STATS306B STATS306B. Discriminant Analysis. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Data Mining Stat 588

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Bayesian Decision and Bayesian Learning

Lecture 5: LDA and Logistic Regression

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Machine Learning for OR & FE

Statistical Methods for SVM

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Multiclass Classification-1

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Logistic Regression. Machine Learning Fall 2018

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

Notes on Discriminant Functions and Optimal Classification

Classification: Linear Discriminant Analysis

Linear Decision Boundaries

LEC 4: Discriminant Analysis for Classification

CSCI-567: Machine Learning (Spring 2019)

ECE521 week 3: 23/26 January 2017

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Gaussian and Linear Discriminant Analysis; Multiclass Classification

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Linear Model Selection and Regularization

Recap from previous lecture

Introduction to Data Science

Applied Multivariate and Longitudinal Data Analysis

Linear Regression and Discrimination

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Machine Learning Lecture 2

CS 195-5: Machine Learning Problem Set 1

5. Discriminant analysis

Categorical Predictor Variables

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Introduction to Machine Learning

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Machine Learning (CSE 446): Probabilistic Machine Learning

Generative Clustering, Topic Modeling, & Bayesian Inference

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Assignment 3. Introduction to Machine Learning Prof. B. Ravindran

STA 4273H: Statistical Machine Learning

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Bayes Decision Theory

SF2935: MODERN METHODS OF STATISTICAL LECTURE 3 SUPERVISED CLASSIFICATION, LINEAR DISCRIMINANT ANALYSIS LEARNING. Tatjana Pavlenko.

Machine Learning, Fall 2012 Homework 2

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

Classification. Classification is similar to regression in that the goal is to use covariates to predict on outcome.

Feature Engineering, Model Evaluations

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

MS&E 226: Small Data

Bias-Variance Tradeoff

Spring 2006: Linear Discriminant Analysis, Etc.

12 Discriminant Analysis

9. Least squares data fitting

COMP 551 Applied Machine Learning Lecture 2: Linear Regression

Classification Based on Probability

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

Machine Learning for Signal Processing Bayes Classification

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

High Dimensional Discriminant Analysis

Loss Functions, Decision Theory, and Linear Models

Transcription:

Classification 1: Linear regression of indicators, linear discriminant analysis Ryan Tibshirani Data Mining: 36-462/36-662 April 2 2013 Optional reading: ISL 4.1, 4.2, 4.4, ESL 4.1 4.3 1

Classification Classification is a predictive task in which the response takes values across discrete categories (i.e., not continuous), and in the most fundamental case, two categories Examples: Predicting whether a patient will develop breast cancer or remain healthy, given genetic information Predicting whether or not a user will like a new product, based on user covariates and a history of his/her previous ratings Predicting the region of Italy in which a brand of olive oil was made, based on its chemical composition Predicting the next elected president, based on various social, political, and historical measurements 2

Similar to our usual setup, we observe pairs (x i, y i ), i = 1,... n, where y i gives the class of the ith observation, and x i R p are the measurements of p predictor variables Though the class labels may actually be y i {healthy, sick} or y i {Sardinia, Sicily,...}, but we can always encode them as y i {1, 2,... K} where K is the total number of classes Note that there is a big difference between classification and clustering; in the latter, there is not a pre-defined notion of class membership (and sometimes, not even K), and we are not given labeled examples (x i, y i ), i = 1,... n, but only x i, i = 1,... n 3

Constructed from training data (x i, y i ), i = 1,... n, we denote our classification rule by ˆf(x); given any x R p, this returns a class label ˆf(x) {1,... K} As before, we will see that there are two different ways of assessing the quality of ˆf: its predictive ability and interpretative ability E.g., train on (x i, y i ), i = 1,... n, the data of elected presidents and related feature measurements x i R p for the past n elections, and predict, given the current feature measurements x 0 R p, the winner of the current election In what situations would we care more about prediction error? And in what situations more about interpretation? 4

Binary classification and linear regression Let s start off by supposing that K = 2, so that the response is y i {1, 2}, for i = 1,... n You already know a tool that you could potentially use in this case for classification: linear regression. Simply treat the response as if it were continuous, and find the linear regression coefficients of the response vector y R n onto the predictors, i.e., ˆβ 0, ˆβ = argmin β 0 R, β R p n (y i β 0 x T i β) 2 Then, given a new input x 0 R p, we predict the class to be { ˆf LS 1 if (x 0 ) = ˆβ 0 + x T ˆβ 0 1.5 2 if ˆβ 0 + x T ˆβ 0 > 1.5 i=1 5

y y y (Note: since we included an intercept term in the regression, it doesn t matter whether we code the class labels as {1, 2} or {0, 1}, etc.) In many instances, this actually works reasonably well. Examples: 0 mistakes 3 mistakes 21 mistakes 1.0 1.2 1.4 1.6 1.8 2.0 1.0 1.2 1.4 1.6 1.8 2.0 1.0 1.2 1.4 1.6 1.8 2.0 0.0 0.2 0.4 0.6 0.8 x 0.0 0.2 0.4 0.6 0.8 x 0.0 0.5 1.0 1.5 x Overall, using linear regression in this way for binary classification is not a crazy idea. But how about if there are more than 2 classes? 6

Linear regression of indicators This idea extends to the case of more than two classes. Given K classes, define the indicator matrix Y R n K to be the matrix whose columns indicate class membership; that is, its jth column satisfies Y ij = 1 if y i = j (observation i is in class j) and Y ij = 0 otherwise E.g., with n = 6 observations and K = 3 classes, the matrix Y = 1 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 1 R 6 3 corresponds to having the first two observations in class 1, the next two in class 2, and the final 2 in class 3 7

To construct a prediction rule, we regress each column Y j R n (indicating the jth class versus all else) onto the predictors: ˆβ j,0, ˆβ j = argmin β j,0 R, β j R p n (Y ij β 0,j βj T x i ) 2 i=1 Now, given a new input x 0 R p, we compute ˆβ 0,j + x T 0 ˆβ j, j = 1,... K take predict the class j that corresponds to the highest score. I.e., we let each of the K linear models make its own prediction, and then we take the strongest. Formally, ˆf LS (x 0 ) = argmax j=1,...k ˆβ 0,j + x T 0 ˆβ j 8

The decision boundary between any two classes j, k are the values of x R p for which ˆβ 0,j + x T ˆβj = ˆβ 0,k + x T ˆβk i.e., ˆβ 0,j ˆβ 0,k + ( ˆβ j ˆβ k ) T x = 0 This defines a (p 1)-dimensional affine subspace in R p. To one side, we would always predict class j over k; to the other, we would always predict class k over j For K classes total, there are ( ) K 2 = K(K 1) 2 decision boundaries 9

Ideal result What we d like to see when we use linear regression for a 3-way classification (from ESL page 105): The plotted lines are the decision boundaries between classes 1 and 2, and 2 and 3 (the decision boundary between classes 1 and 3 never matters) 10

Actual result What actually happens when we use linear regression for this 3-way classification (from ESL page 105): The decision boundaries between 1 and 2 and between 2 and 3 are the same, so we would never predict class 2. This problem is called masking (and it is not uncommon for moderate K and small p) 11

12 Why did this happen? Projecting onto the line joining the three class centroids gives some insight into why this happened (from ESL page 106):

13 Statistical decision theory Let C be a random variable giving the class label of an observation in our data set. A natural rule would be to classify according to f(x) = argmax j=1,...k P(C = j X = x) This predicts the most likely class, given the feature measurements X = x R p. This is called the Bayes classifier, and it is the best that we can do (think of overlapping classes) Note that we can use Bayes rule to write P(C = j X = x) = P(X = x C = j) P(C = j) P(X = x) Let π j = P(C = j) be the prior probability of class j. Since the Bayes classifier compares the above quantity across j = 1,... K for X = x, the denominator is always the same, hence f(x) = argmax j=1,...k P(X = x C = j) π j

14 Linear discriminant analysis Using the Bayes classifier is not realistic as it requires knowing the class conditional densities P(X = x C = j) and prior probabilities π j. But if estimate these quantities, then we can follow the idea Linear discriminant analysis (LDA) does this by assuming that the data within each class are normally distributed: h j (x) = P(X = x C = j) = N(µ j, Σ) density We allow each class to have its own mean µ j R p, but we assume a common covariance matrix Σ R p p. Hence 1 { h j (x) = (2π) p/2 det(σ) 1/2 exp 1 } 2 (x µ j) T Σ 1 (x µ j ) So we want to find j so that P(C = j X = x) π j = h j (x) π j is the largest

15 Since log( ) is a monotone function, we can consider maximizing log(h j (x)π j ) over j = 1,... K. We can define the rule: f LDA (x) = argmax j=1,...k = argmax j=1,...k = argmax j=1,...k = argmax j=1,...k δ j (x) We call δ j (x), j = 1,... K the discriminant functions. Note δ j (x) = x T Σ 1 µ j 1 2 µt j Σ 1 µ j + log π j is just an affine function of x

16 In practice, given an input x R p, can we just use the rule f LDA on the previous slide? Not quite! What s missing: we don t know π j, µ j, and Σ. Therefore we estimate them based on the training data x i R p and y i {1,... K}, i = 1,... n, by: ˆπ j = n j /n, the proportion of observations in class j ˆµ j = 1 n j y i =j x i, the centroid of class j ˆΣ = 1 n K K j=1 y i =j (x i ˆµ j )(x i ˆµ j ) T, the pooled sample covariance matrix (Here n j is the number of points in class j) This gives the estimated discriminant functions: ˆδ j (x) = x T ˆΣ 1ˆµ j 1 2 ˆµT j ˆΣ 1ˆµ j + log ˆπ j and finally the linear discriminant analysis rule, ˆf LDA (x) = argmax j=1,...k ˆδ j (x)

17 LDA decision boundaries The estimated discriminant functions ˆδ j (x) = x T ˆΣ 1ˆµ j 1 2 ˆµT j ˆΣ 1ˆµ j + log ˆπ j = a j + b T j x are just affine functions of x. The decision boundary between classes j, k is the set of all x R p such that ˆδ j (x) = ˆδ k (x), i.e., a j + b T j x = a k + b T k x This defines an affine subspace in x: a j a k + (b j b k ) T x = 0

18 Example: LDA decision boundaries Example of decision boundaries from LDA (from ESL page 109): f LDA (x) ˆf LDA (x) Are the decision boundaries the same as the perpendicular bisectors (Voronoi boundaries) between the class centroids? (Why not?)

19 LDA computations, usages, extensions The decision boundaries for LDA are useful for graphical purposes, but to classify a new point x 0 R p we don t use them we simply compute ˆδ j (x 0 ) for each j = 1,... K LDA performs quite well on a wide variety of data sets, even when pitted against fancy alternative classification schemes. Though it assumes normality, its simplicity often works in its favor. (Why? Think of the bias-variance tradeoff) Still, there are some useful extensions of LDA. E.g., Quadratic discriminant analysis: using the same normal model, we now allow each class j to have its own covariance matrix Σ j. This leads to quadratic decision boundaries Reduced-rank linear discriminant analysis: we essentially project the data to a lower dimensional subspace before performing LDA. We will study this next time

Example: olive oil data Example: n = 572 olive oils, each made in one of three regions of Italy. On each observation we have p = 8 features measuring the percentage composition of 8 different fatty acids. (Data from the olives data set from the R package classifly) From the lda function in the MASS package: 18 16 14 12 10 8 52 54 56 58 60 62 LD1 LD2 This looks nice (seems that the observations are separated into classes), but what exactly is being shown? More next time... 20

Recap: linear regression of indicators, linear discriminant analysis In this lecture, we introduced the task of classification, a prediction problem in which the outcome is categorical We can perform classification for any total number of classes K by simply performing K separate linear regressions on the appropriate indicator vectors of class membership. However, there can be problems with this when K > 2, a common problem is masking, in which one class is never predicted at all Linear discriminant analysis also draws linear decision boundaries but in a smarter way. Statistical decision theory tells us that we really only need to know the class conditional densities and the prior class probabilities in order to perform classification. Linear discriminant analysis assumes normality of the data within each class, and assumes a common covariance matrix; it then replaces all unknown quantities by their sample estimates 21

Next time: more linear discriminant analysis; logistic regression Logistic regression is a natural extension of the ideas behind linear regression and linear discriminant analysis. y 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 x 22