Classification 2: Linear discriminant analysis (continued); logistic regression

Similar documents
Classification 1: Linear regression of indicators, linear discriminant analysis

Lecture 9: Classification, LDA

Lecture 9: Classification, LDA

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Regularized Discriminant Analysis and Reduced-Rank LDA

CMSC858P Supervised Learning Methods

Lecture 6: Methods for high-dimensional problems

Lecture 9: Classification, LDA

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Linear Methods for Prediction

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

MSA220 Statistical Learning for Big Data

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Logistic Regression. Advanced Methods for Data Analysis (36-402/36-608) Spring 2014

Linear Methods for Prediction

Linear regression COMS 4771

STATS306B STATS306B. Discriminant Analysis. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

LDA, QDA, Naive Bayes

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Introduction to Machine Learning

Motivating the Covariance Matrix

Linear Decision Boundaries

6.867 Machine Learning

The Bayes classifier

MATH 567: Mathematical Techniques in Data Science Logistic regression and Discriminant Analysis

Machine Learning (CSE 446): Probabilistic Machine Learning

Metric-based classifiers. Nuno Vasconcelos UCSD

Introduction to Machine Learning

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

MSA200/TMS041 Multivariate Analysis

Linear Models for Classification

Classification. Chapter Introduction. 6.2 The Bayes classifier

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

20 Unsupervised Learning and Principal Components Analysis (PCA)

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

PRINCIPAL COMPONENTS ANALYSIS

High Dimensional Discriminant Analysis

Week 5: Logistic Regression & Neural Networks

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Machine Learning Linear Classification. Prof. Matteo Matteucci

High Dimensional Discriminant Analysis

Notes on Discriminant Functions and Optimal Classification

Introduction to Machine Learning Spring 2018 Note 18

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Spring 2006: Linear Discriminant Analysis, Etc.

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Statistical Pattern Recognition

Recap from previous lecture

Lecture 5: Classification

STA 450/4000 S: January

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Linear Regression and Discrimination

Singular Value Decomposition. 1 Singular Value Decomposition and the Four Fundamental Subspaces

Machine Learning, Fall 2009: Midterm

Statistical Data Mining and Machine Learning Hilary Term 2016

22 Approximations - the method of least squares (1)

Introduction to Machine Learning

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

Machine Learning 2nd Edition

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

5. Discriminant analysis

CSCI-567: Machine Learning (Spring 2019)

Lecture 5: LDA and Logistic Regression

CSC 411 Lecture 12: Principal Component Analysis

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

CS 195-5: Machine Learning Problem Set 1

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

L11: Pattern recognition principles

Principal Component Analysis (PCA)

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Machine Learning (Spring 2012) Principal Component Analysis

Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56

Machine Learning for OR & FE

High Dimensional Discriminant Analysis

Bayesian Decision and Bayesian Learning

Terminology for Statistical Data

PCA, Kernel PCA, ICA

Support Vector Machines

Extensions to LDA and multinomial regression

Pattern recognition. "To understand is to perceive patterns" Sir Isaiah Berlin, Russian philosopher

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Introduction to Data Science

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Machine Learning, Fall 2012 Homework 2

Regularized Discriminant Analysis and Its Application in Microarray

Machine Learning for Signal Processing Bayes Classification and Regression

Principal Component Analysis

Transcription:

Classification 2: Linear discriminant analysis (continued); logistic regression Ryan Tibshirani Data Mining: 36-462/36-662 April 4 2013 Optional reading: ISL 4.4, ESL 4.3; ISL 4.3, ESL 4.4 1

Reminder: linear discriminant analysis Last time we defined the Bayes classifier in the population, for the class label C {1,... K} and feature vector X R p, as f(x) = argmax j=1,...k = argmax j=1,...k P(C = j X = x) P(X = x C = j) π j where π j = P(C = j) is the prior probability of class j Linear discriminant analysis approximates this rule by modeling the conditional class densities as multivariate normals: h j (x) = P(X = x C = j) = N(µ j, Σ) density i.e., each class j has its own mean µ j R p, but shares a common covariance matrix Σ R p p 2

We then replace π j, µ j, Σ by their sample estimates, based on labeled observations y i {1,... K}, x i R p, i = 1,... n, ˆπ j = n j /n, ˆΣ = 1 n K The rule then reduces to ˆµ j = 1 x i, n j y i =j K (x i ˆµ j )(x i ˆµ j ) T j=1 y i =j ˆf LDA (x) = argmax j=1,...k ˆδ j (x) where ˆδ j (x) is the estimated discriminant function of class j, ˆδ j (x) = x T ˆΣ 1ˆµ j 1 2 ˆµT j ˆΣ 1ˆµ j + log ˆπ j 3

LDA computations and sphering Note that LDA equivalently minimizes over j = 1,... K, 1 2 (x ˆµ j) T ˆΣ 1 (x ˆµ j ) log ˆπ j It helps to factorize ˆΣ (i.e., compute its eigendecomposition): ˆΣ = UDU T where U R p p has orthonormal columns (and rows), and D = diag(d 1,... d p ) with d j 0 for each j. Then we have ˆΣ 1 = UD 1 U T, and (x ˆµ j ) T ˆΣ 1 (x ˆµ j ) = D 1/2 U T x j D 1/2 U T ˆµ j }{{}}{{} x µ j This is just the squared distance between x and µ j 2 2 4

Hence the LDA procedure can be described as: 1. Compute the sample estimates ˆπ j, ˆµ j, ˆΣ 2. Factor ˆΣ, as in ˆΣ = UDU T 3. Transform the class centroids µ j = D 1/2 U T ˆµ j 4. Given any point x R p, transform to x = D 1/2 U T x R p, and then classify according to the nearest centroid in the transformed space, adjusting for class proportions this is the class j for which 1 2 x µ j 2 2 log ˆπ j is smallest What is this transformation doing? Think about applying it to the observations: x i = D 1/2 U T x i, i = 1,... n This is basically sphering the data points, because if we think of x R p were a random variable with covariance matrix ˆΣ, then Cov(D 1/2 U T x) = D 1/2 U T ˆΣUD 1/2 = I 5

Linear subspace spanned by sphered centroids LDA compares the quantity 1 2 x µ j 2 2 log ˆπ j across the classes j = 1,... K. Consider the affine subspace M R p spanned by the transformed centroids µ 1,... µ K, which has dimension K 1 For any x R p, we can decompose x = P M x + P M x, so x µ j 2 2 = P M x µ j }{{} M + P M x }{{} M 2 2 = P M x µ j 2 2 + P M x 2 2 The second term doesn t depend on j What this is telling us: the LDA classification rule is unchanged if we project the points to be classified onto M, since the distances orthogonal to M don t matter 6

LDA procedure summarized We can think of the LDA procedure as: 1. Compute the sample estimates ˆπ j, ˆµ j, ˆΣ 2. Make two transformations: first, sphere the data points, based on factoring ˆΣ; second, project down to the affine subspace spanned by the sphered centroids. This can all be summarized a single linear transformation A R (K 1) p 3. Given any point x R p, transform to x = Ax R K 1, and classify according to the class j = 1,... K for which is smallest, where µ j = Aˆµ j 1 2 x µ j 2 2 log ˆπ j 7

This way of describing LDA may sound more complicated, but actually, it s much simpler! After applying A, we ve reduced the problem from p to K 1 dimensions, and then it s basically nearest centroid classification: ˆf LDA (x) = argmin j=1,...k 1 2 x µ j 2 2 log ˆπ j (The only distinction being that we adjust for class proportions) In R, the matrix A T is exactly what is returned by the scaling component from the lda function in the MASS package 8

Example: olive oil data Recall our olive oil example: n = 572 olive oils, made in one of three regions of Italy. For each observation we have p = 8 features measuring the percentage composition of 8 different fatty acids 18 16 14 12 10 8 52 54 56 58 60 62 LD1 LD2 These are the transformed data points Ax i R 2, i = 1,... 572. Note that here M is a 2-dimensional subspace, since there are K = 3 classes, and the transformation has dimension A R 2 8 9

10 Decision boundaries revisited Working in the transformed space makes it easier to draw decision boundaries. Now the decision boundary between classes j and k is the set of all z R K 1 such that 1 2 z µ j 2 2 log ˆπ j = 1 2 z µ k 2 2 log ˆπ k After some calculation, this is simply ( µ j µ k ) T z = log ˆπ k ˆπ j + 1 2 ( µ j 2 2 µ k 2 2) E.g., when K = 3, so that z R 2, this is just the line given by z 2 = a + bz 1, where a = log ˆπ k ˆπ j + 1 2 ( µ j 2 2 µ k 2 2 ) µ j2 µ k2, b = µ k1 µ j1 µ j2 µ k2

Example: olive oil data Decision boundaries, using the formula that we derived: 18 16 14 12 10 8 52 54 56 58 60 62 LD1 LD2 Region 1 Region 2 Region 3 18 16 14 12 10 8 52 54 56 58 60 62 LD1 LD2 Region 1 Region 2 Region 3 11

12 Reduced-rank linear discriminant analysis The dimension reduction from p to K 1 was exact, in that we didn t change the LDA rule at all. Why might we want to reduce further to a dimension L < K 1, if K is large? Visualization Regularization Reduced-rank linear discriminant analysis is a nice way to project down to lower than K 1 dimensions. It chooses the lower dimensional subspaces so as to spread out the centroids as much as possible Does this sound like principal components analysis? It s not a coincidence! These dimensions are computed by looking at the principal components directions of the matrix of transformed centroids

Example: vowel data Example: this experiment recorded n = 528 instances of spoken words. The words fall into K = 11 classes ( vowels ), and there are p = 10 features measured on each instance Reduced-rank linear discriminant analysis in 2 dimensions: (From ESL page 107) 13

As the rank increases, the centroids become less spread out: (From ESL page 115) 14

15 Reduced-rank LDA as regularization If the number of classes K is large, then projecting down to a dimension L < K 1 can actually be helpful in terms of applying regularization. This is because some dimensions may not providing a lot of separation between the classes, but just noise. (You should be thinking of the bias-variance tradeoff!) Reduced-rank LDA in L dimensions delivers a matrix A R L p, and classification of a point x R p proceeds as usual, by first transforming to x = Ax, and then choosing the class j = 1,... K that minimizes 1 2 x µ j 2 2 + log ˆπ j where µ j = Aˆµ j. Smaller L means more regularization For the lda function in R, the matrix A R L p above corresponds to the (transpose) of the first L columns of the scaling matrix

16 Example: vowel data For the vowel data set, there were actually n = 462 points held out as a test set. Here is the test error rate (and training error rate) of reduced-rank LDA as a function of L: (From ESL page 117)

Original form of LDA for two classes Let s go back to the original form of LDA, and consider just two classes, for simplicity. Recall that LDA assumes that the predictor variables are normally distributed within each class: P(X = x C = j) = N(µ j, Σ) density, j = 1, 2 Bayes rule then gives us P(C = j X = x) = P(X=x C=j)P(C=j) P(X=x). Now plugging in the normal density log { P(C = 1 X = x) } = P(C = 2 X = x) = = α 0 + α T x That is, the log odds of class 1 versus 2 is a linear function of x. Estimating ˆπ j, ˆµ j, ˆΣ amounts to estimating ˆα 0, ˆα. A question: why don t we estimate these coefficients directly? 17

18 Logistic regression In logistic regression, we assume that log { P(C = 1 X = x) } = β 0 + β T x P(C = 2 X = x) for some unknown β 0 R, β R p, which we will estimate directly Note that P (C = 2 X = x) = 1 P (C = 1 X = x), and ( ) p log = β 0 + β T x 1 p Therefore our assumption is that P(C = 1 X = x) = exp(β 0 + β T x) 1 + exp(β 0 + β T x)

19 Estimating logistic regression coefficients Suppose that we are given a sample (x i, y i ), i = 1,... n. Here y i denotes the class {1, 2} of the ith observation. Assume that the class labels are conditionally independent given x 1,... x n. Then n L(β 0, β) = P(C = y i X = x i ) i=1 the likelihood of these n observations, so the log likelihood is n l(β 0, β) = log P(C = y i X = x i ) i=1 Now for convenience, we define the indicator { 1 if y i = 1 u i = 0 if y i = 2

The log likelihood can be written as l(β 0, β) = = = = n log P(C = y i X = x i ) i=1 n {u i (β 0 + β T x i ) log ( 1 + exp(β 0 + β T x i ) )} i=1 The coefficients are estimated by maximizing the likelihood, ˆβ 0, ˆβ = argmax β R,β R p n {u i (β 0 +β T x i ) log ( 1+exp(β 0 +β T x i ) )} i=1 Note that logistic regression is somewhat more general than LDA in that we don t need to assume normality to estimate the linear coefficients 20

21 Recap: reduced-rank LDA and logistic regression In this lecture we saw that the usual linear discriminant analysis for prediction in p dimensions can be transformed to a much simpler rule in K 1 dimensions, where K is the number of classes This transformation was achieved by first sphering the data points, and then projecting onto the affine subspace spanned by the sphered class centroids. The final prediction rule is basically nearest centroid classification, except for the factor adjusting for different class proportions Further transformations, to dimensions L < K 1, can also be performed; this is called reduced-rank linear discriminant analysis. Doing so can be helpful for visualization or regularization purposes Logistic regression models the log odds of being in class 1 versus 2 as a linear function of the features. It is fit by maximum likelihood

22 Next time: more logistic regression; nearest-neighbors and prototypes Classification by K-means clustering: using an unsupervised tool for a supervised task (From ESL page 464)