Regularized Discriminant Analysis. Part I. Linear and Quadratic Discriminant Analysis. Discriminant Analysis. Example. Example. Class distribution

Similar documents
Simultaneous variable selection and class fusion for high-dimensional linear discriminant analysis

Classification Methods II: Linear and Quadratic Discrimminant Analysis

CMSC858P Supervised Learning Methods

Lecture 6: Methods for high-dimensional problems

Regularized Discriminant Analysis and Reduced-Rank LDA

Regularized Discriminant Analysis and Its Application in Microarrays

MSA220 Statistical Learning for Big Data

Gaussian Models

Extensions to LDA and multinomial regression

application in microarrays

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Lecture 9: Classification, LDA

Lecture 9: Classification, LDA

Lecture 9: Classification, LDA

Regularized Discriminant Analysis and Its Application in Microarray

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Minimum Error Rate Classification

MSA220/MVE440 Statistical Learning for Big Data

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

MSA200/TMS041 Multivariate Analysis

Univariate shrinkage in the Cox model for high dimensional data

Sparse statistical modelling

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

High Dimensional Discriminant Analysis

A COMPARISON OF REGULARIZED LINEAR DISCRIMINANT FUNCTIONS FOR POORLY-POSED CLASSIFICATION PROBLEMS

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

INRIA Rh^one-Alpes. Abstract. Friedman (1989) has proposed a regularization technique (RDA) of discriminant analysis

L5: Quadratic classifiers

The Bayes classifier

LDA, QDA, Naive Bayes

Machine Learning (CS 567) Lecture 5

5. Discriminant analysis

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

A Modern Look at Classical Multivariate Techniques

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

High Dimensional Discriminant Analysis

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Computation. For QDA we need to calculate: Lets first consider the case that

Machine Learning for OR & FE

A Bias Correction for the Minimum Error Rate in Cross-validation

Discriminant Analysis Documentation

A Robust Approach to Regularized Discriminant Analysis

Permutation-invariant regularization of large covariance matrices. Liza Levina

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Classification of high dimensional data: High Dimensional Discriminant Analysis

Bayesian Decision Theory

Bayesian Decision Theory

Machine Learning Linear Classification. Prof. Matteo Matteucci

1/sqrt(B) convergence 1/B convergence B

Linear Methods for Prediction

Machine Learning for OR & FE

A Statistical Analysis of Fukunaga Koontz Transform

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

Discriminant analysis and supervised classification

Introduction to Machine Learning Spring 2018 Note 18

Introduction to Machine Learning

Motivating the Covariance Matrix

International Journal of Pure and Applied Mathematics Volume 19 No , A NOTE ON BETWEEN-GROUP PCA

Introduction to Machine Learning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

12 Discriminant Analysis

Introduction to Machine Learning

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

Is cross-validation valid for small-sample microarray classification?

Feature selection and extraction Spectral domain quality estimation Alternatives

Linear Methods for Prediction

Chapter 10. Semi-Supervised Learning

A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data

Lecture 5: Classification

Classification 2: Linear discriminant analysis (continued); logistic regression

DISCRIMINANT ANALYSIS. 1. Introduction

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Classification: Linear Discriminant Analysis

High Dimensional Discriminant Analysis

LEC 4: Discriminant Analysis for Classification

Classification 1: Linear regression of indicators, linear discriminant analysis

Introduction to Machine Learning

Applied Multivariate and Longitudinal Data Analysis

STK Statistical Learning: Advanced Regression and Classification

Linear Regression and Discrimination

Exam 2. Jeremy Morris. March 23, 2006

Building a Prognostic Biomarker

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

ECE521 week 3: 23/26 January 2017

Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

Classification. Sandro Cumani. Politecnico di Torino

SPARSE QUADRATIC DISCRIMINANT ANALYSIS FOR HIGH DIMENSIONAL DATA

Introduction to Machine Learning

STATISTICAL LEARNING SYSTEMS

Lecture 4 Discriminant Analysis, k-nearest Neighbors

STAT 730 Chapter 1 Background

Maximum Entropy Generative Models for Similarity-based Learning

Statistical Inference On the High-dimensional Gaussian Covarianc

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Lecture 18: Multiclass Support Vector Machines

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

STATS306B STATS306B. Discriminant Analysis. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Transcription:

Part I 09.06.2006 Discriminant Analysis The purpose of discriminant analysis is to assign objects to one of several (K) groups based on a set of measurements X = (X 1, X 2,..., X p ) which are obtained from each object each object is assumed to be a member of one (and only one) group 1 k K an error is incurred if the object is attached to the wrong group the measurements of all objects of one class k are characterized by a probability density f k (X) we want to nd a rule to decide for every object to which class it belongs to A group of people consist of male and female persons K = 2 from each person the data of their weight and height is collected p = 2 the gender is unknown in the data set we want to classify the gender for each person from the weight and height discriminant analysis a classication rule is needed (discriminant function) to choose the group for each person to construct this function a training sample is used in which the gender is already known Class distribution the distribution of the measurements X are seldom identical in each class conditional distribution for each class k most often applied classication rules are based on the multivariate normal distribution f k (X) = f (X k) = 1 (2π) p/2 k 1/2 e 1 2 (x µ k) T P 1 k (x µ k) where µ k and Σ k are the class k (1 k K) population mean vector and covariance matrix f k (X) are seldom known Prior and unconditional distribution There might be some prior knowledge about the probability of observing a member of class k prior probability π k with π 1 +... + π k = 1 if the prior probabilities are equal for each k leads to Maximum-Lelihood to estimate the class conditional densities f k (X ) and the prior probability π k a training sample with already correctly classied classes is used the unconditional distribution of X is given by f (X) = π(k)f (X k) Posterior distribution Probability for one object with given vector X T = (X 1,..., X p ) to belong to class k is calculated by the Bayes formula p(k X) = posterior distribution class distribution prior probablitiy {}}{ f (X k) π(k) f (X) unconditional distribution f (X k) π(k) an object is assigned to class ˆk, if it has the biggest posterior probability p(ˆk X) this is equal to minimizing the expected loss

Log posterior distribution For easier calculation we take the logarithm of the posterior distribution log p(k X) = log f (X k) + log π(k) with the multivariate normal distribution it leads to log p(k X) = log((2π) p 2 Σk 1 2 e 1 2 (x µ k) T Σ 1 (x µ k) k ) + log π k = 1 2 (x µ k) T Σ 1 k (x µ k) 1 2 log Σ k + log π k + constant the constant term p log(2π) can be omitted as it is the same 2 for each class k Linear discriminant analysis (1) Quadratic discriminant analysis multiplication with 2 leads to the discriminant function d k (X) = (X µ k ) T Σ 1 k (X µ k) + log Σ k 2 log π k) Mahalanobis distance and to the classication rule dˆk(x ) = min d k (X ) max p(k X ) Using this rule is called the Quadratic Discriminant Analysis (QDA) Linear and Quadratic Boundaries A special case occurs when all k class covariance matrices are identical Σ k = Σ The discriminant function simplies to d k (x) = (x µ k ) T Σ 1 (x µ k ) 2 log π(k) d k (x) = 2µ T k Σ 1 X µ T k Σ 1 µ k 2 log π(k) This is called the Linear Discriminant Analysis (LDA) because the quadratic terms in the discriminant function cancel: x T Σ 1 x is the same in every class k and can be left out the decision boundaries are now linear Estimation In most applications of linear and quadratic discriminant analysis the parameters µ and Σ are estimated by their sample analogs and ˆµ k = X k = 1 N k ˆΣ k = S k N k = 1 N k N X n1. N X np = x 1. x p (X X k )(X X k ) T c(v)=k with c(v) = class of vth observation Part II These estimates are straightforward to compute and represent the corresponding maximum lelihood estimates. Problem: they are only optimal for n and not for small n the (p p) covariance matrix estimates become highly variable not all of the parameters are even identiable Σ is singular the inverse Σ 1 does not exist poorly-posed the number of parameters to be estimated is comparable to the number of observations ill-posed that number exceeds the sample size QDA LDA poorly-posed N k p N p ill-posed N k p N p parameters to be estimated k p 2 + p p 2 + p QDA requires generally larger samples size than LDA

Regularization For ill- or poorly-posed situations: parameter estimates can be highly unstable high variance The aim of regularization is to improve the estimates by biasing them away from their sample-based values reduction of variance at the expense of potentially increased bias the bias variance trade-o is regulated by two parameters these parameters control the strength of the biasing Regularization with parameter λ Regularization for Quadratic Discriminant Analysis Strategy 1: If QDA is ill- or poorly-posed Replacing the individual class sample covariance matrices by their average (pooled covariance matrix) ˆΣ = S k N k regularization by reducing the number of parameters to be estimated this can result in superior performance, especially in small-sample settings leads to LDA the choice between is quite restrictive Eigenvalues Strategy 2: A less limited approach is represented by with 0 λ 1 ˆΣ k (λ) = (1 λ)ˆσ k + λˆσ λ controls the degree of shrinkage of the individual class covariance matric estimates toward the pooled estimate λ = 0 gives rise to QDA λ = 1 gives rise to LDA still fairly limited cannot be used if LDA is ill- or poorly posed If N p then even LDA is poorly- or ill-posed ˆΣ is singular some eigenvalues are 0 decomposing Σ with the spectral decomposition leads to e v ith eigenvalue of Σ k ith eigenvector of Σ k ˆΣ 1 does not exist Σ 1 = p v v T e Further regularization Eigenvalues Strategy 3 If LDA is ill- or poorly-posed ˆΣ k (λ, γ) = (1 γ) ˆΣ tr [ k (λ) + γ ˆΣ k (λ)] I p average eigenvalue with 0 γ 1 tr A = sum of eigenvalues the additional regularization parameter γ controls shrinkage toward a multiple of the identity matrix for a given value of λ decreasing the larger eigenvalues and increasing the smaller ones shrinkage toward the average eigenvalue of ˆΣ k (λ) The discriminant function for the Regularized Discriminant Analysis (RDA) is d k (X) = (X Xk ) T ˆΣ 1 k (λ, γ)(x Xk )+log ˆΣ k (λ, γ) 2 log π(k) the values for λ and γ are not lely to be known in advance we have to estimate them the aim is to nd values for λ and γ that jointly minimize the future misclassication risk Methods: bootstrapping cross-validation Model selection Idea of cross-validation (leave-one-out) one particular observation X v is removed from the model the classication rule is developed on the N 1 training observations without X v then this classication rule is used to classify X v and to calculate the loss which occurs if classied to the wrong group this is repeated for every observation the future misclassication risk is estimated by the average of the resulting misclassication loss this is done for a number of combinations for λ and γ

Conclusion The potential for RDA to improve misclassication risk over that of QDA or LDA depend on the situation N k p no regularization is needed and QDA can be used model-selection procedure should tend to small values of λ and γ N p LDA has been the method of choice in the past regularization can substantially improve the misclassication risk when Σ k are not close to being equal N is even too small for LDA Part III (PAM) Class prediction with gene expression data The aim is to assign people to one of K(1 k K) diagnostic categories based on their gene expression prole The classication by DNA microarrays is challenging because: there is a very large number of genes (p) from which to predict classes and only a relatively small number of samples (N) again a restricted form of Discriminant Analysis identify the genes which contribute most to the classication reduction of p Small round blue cell tumors (SRBCT) of childhood can be divided in four groups Burkitt lymphome (BL) Ewing sarcoma (EWS) neuroblastoma (NB) rhabdomyosarcoma (RMS) The DNA microarrays of 88 children with SRBCT were obtained 63 of them were already classied right and their data were used as the training sample to estimate the classication rule the category for the other 25 children (of which 5 were not SRBCT) was then predicted by this rule the aim is to correctly classify the test samples The data consist of expression measurements on 2,308 genes the mean expression value (centroid) was calculated from the training sample for each of the four classes then the squared distance from the gene expression prole to each class centroids was calculated for each test sample the predicted class for a child was the one with the closest centroid nearest centroid classication x = jɛc k x i = n j=1 x ij n k x ij n mean value in class k for gene i overall centroid for gene i d is a t statistic for gene i, comparing the mean of class k to the overall centroid d = x x i m k (s i + s 0 ) It would be more attractive if fewer genes were needed modication to nearest shrunken centroid classication where the genes which don't contribute for the class prediction are eliminated m k = 1/n k + 1/n s i s 0 pooled standard deviation for gene i same value for every gene, positive constant Shrunken Centroids The distance for gene i between the mean in class k and the overall mean is shrunk toward zero soft thresholding d = sign(d )( d ) + with { shrinkage parameter, also called threshold t if t 0 t + = 0 otherwise many of the genes are eliminated as increases is again chosen by cross-validation ( = 4.34 in the example ) The centroids are shrunk towards the overall centroid x = x i + m k (s i + s 0 )d if d = 0 for every class k x =... = i1 x ik the centroid for each class is the same if each group has the same mean for one gene, this gene does not help to predict a class and can be left out in the example only 43 genes are needed for the class prediction and 2,275 are not needed to distinguish between the groups

Discriminant Function The discriminant function uses the Mahalanobis metric in computing distances to centroids: with π k W δ LDA k (x ) = (x x k ) T } {{ } 1 p Σ 1 p p class prior probability pooled within-class covariance matrix (x x k ) 2logπ k p 1 Σ is huge as p n and any sample estimate will be singular to cope with this problem a heavily restricted form of LDA is used Σ is assumed to be a diagonal matrix Classication rule Literature The discriminant function can be rewritten as δ k (x ) = p again standardized by s i + s 0 (x i x )2 (s i + s 0 ) 2 2logπ k the discriminant function is for one person with p genes x = (x 1, x 2,..., x p) the person is assigned to group ˆk if δˆk(x ) = min δ k (x ) Friedmann, J. H. (1989), "," Journal of the American Statistical Association Tibshirani R., Hastie T., Nasasimhan B., Chu G. (2002), "Diagnosis of multiple cancer types by shrunken centroids of gene expression," PNAS