Is cross-validation valid for small-sample microarray classification?

Size: px

Start display at page:

Download "Is cross-validation valid for small-sample microarray classification?"

Augustine Ray
5 years ago
Views:

1 Is cross-validation valid for small-sample microarray classification? Braga-Neto et al. Bioinformatics 2004 Topics in Bioinformatics Presented by Lei Xu October 26,

2 Review 1) Statistical framework X R d : a feature vector; Y C = {1, 2,, C}: a class variable; Regard X, Y as random variables with joint probability distribution; F = P(x, k) = P(X = x, Y = k), which is usually unknown; Y has prior distribution: p(k) = P(Y = k); X has class-conditional distribution: P(x k) = P(X = x Y = k); 2

3 Review 1) Statistical framework Classifier is a function f: R d C; Classification:S n f S, i.e., construct f based on a training set S n = {(x (1), y (1) ), (x (2), y (2) ),, (x (n), y (n) )}, where (x (i), y (i) ) are i.i.d. with law F; Error rate of a classifier f:ε(f)=p(f(x) Y); for the case C ={1, 2},ε(f)=E( Y-f(X) ); ε(f) depends on the joint distribution F; 3

4 Review 1) Statistical framework Theorem: For any classifier f and any loss function L: CxC [0, ), ε L (f) ε L (f B ), f B is the Bayes classifier; for the special case L is 0-1 loss, ε L ( )= ε( ). 4

5 2) Cross-validation Review Goal: estimate ε(f); Algorithm ( K-fold cross-validation): a) Divide S n into K disjoint pieces S n = S n1 S S n2 nk (for simplicity, we assume that K divides n); 5

6 Review 2) Cross-validation Algorithm a) For each i = 1,, K, > Build the classifier f i using the selected rule based on S n \S ni ; > Classify using f i yielding the number of errors ( j) ( j) ˆ ε i = y fi( x ) ( j) ( j) ( x, y ) S ni where (x (j), y (j) ) is the j-th sample in the i-th fold S ni and S ni =n/k; 6

7 2) Cross-validation Algorithm Review c) Calculate the estimate of ε(f): 1 K ( j) ( j) ˆ cvk = y fi ( x ) n i= 1 ( j) ( j) ( x, y ) S ε K-fold cross-validation is unbiased as an estimator of E[ ε n n/ K], here ε is the error rate for a given n n / K training sample S n \S ni ; Therefore, if n-k is small, then it is nearly an unbiased estimator of [ ]. E ε n ni 7

8 Primer 1: Three Popular Classification Rules 1) Linear Discriminant Analysis (LDA) Let X R d be a feature vector and Y {1, 2,,C } be a class variable; We collect date {X (1), X (2),. X (n) } and would like to classify these date into C distinct classes; Design discriminant functions f c (X), c=1, 2,, C, such that: If f c (X (i) ) > f k (X (i) ), for all k c, then Y (i) =c, i = 1, 2,, n; 8

9 Primer 1: Three Popular Classification Rules 1) Linear Discriminant Analysis Choose the form of the discriminant functions to be linear, i.e., f c (X) = W ct X + W c,0, c = 1,2,, C where W c =(W c,1, W c,2,, W c,d ) T is the weight vector and W c,0 is the bias; 9

10 Primer 1: Three Popular Classification Rules 1) Linear Discriminant Analysis For the clarity of presentation, let s consider the case that we only have two classes. In this case, we only have one linear discriminant function: f(x) = W T X + W 0 Note: f 1 (X) = W 1T X + W 1,0 > W 1T X +W,0 = f 2 (X) f(x) = (W 1 -W 2 ) T X + (W 1,0 -W 2,0 ) >0; 10

11 Primer 1: Three Popular Classification Rules 1) Linear Discriminant Analysis Classification can be performed by the following procedure: If f(x (i) ) > 0, then Y (i) = 1; else Y (i) = 2, i = 1, 2,, n; The decision boundary is linear and is define by the equation: f(x)=w T X + W 0 =0; 11

12 Primer 1: Three Popular Classification Rules 1) Linear Discriminant Analysis The decision boundary is a hyperplane, the orientation of the surface is determined by the normal vector W and the location of the surface is determined by the bias W 0 ; 12

13 Primer 1: Three Popular Classification Rules 1) Linear Discriminant Analysis Training step: estimate the weight vector W and the bias W 0 from training set S n = {(x (1), y (1) ), (x (2), y (2) ),, (x (n), y (n) )}; One approach: minimize the perceptron criterion function, given by: T X JWW (, ) 0 W x+ W0, if ( W, W0) = x X ( W, W0 ) 0, if X ( W, W )= where X(W, W 0 ) is the set of training samples misclassified by the choice of W, W 0 ; The problem is solved by an iterative algorithm that converges in finite steps. 0 13

14 Primer 1: Three Popular Classification Rules 1) Linear Discriminant Analysis Bayes classifier: f B (x)= arg max p( k x), here pk ( x) = ; In the case the class-conditional distribution p(x k) is distributed as N ( µ k, k), the Bayes classifier f B (x)= p( x k) p( k) px ( ) 1 arg min[( x µ ) ( ) t k k x µ k + ln k 2ln p( k)] k Note: f B (x) = f QDA (x); k 14

15 Primer 1: Three Popular Classification Rules 1) Linear Discriminant Analysis If suppose = and p(k) is uniform, f B (x) = = f LDA (x); 2 2 In this paper, and p(k) is k =diag( σ k,.., σ k) uniform, so f B (x) = k 1 t 1 t arg min( µ k k µ k 2 x k µ k ) k x u d j kj 2 arg min [( ) 2ln σ k ] k j= 1 σ k + = here d = 2 or 5, the dimension of x; f DQDA (x), 15

16 Primer 1: Three Popular Classification Rules 2) k-nearest Neighbor Rule (k-nnr) For a given unlabeled sample X R d, find the k closest labeled samples in the training set and assign X to the class that appears most frequently; Only requires: an integer k; training set; a metric to measure closeness ; 16

17 Primer 1: Three Popular Classification Rules 2) k-nearest Neighbor Rule (k-nnr) An example 17

18 Primer 1: Three Popular Classification Rules 3) Decision Tree Rule (CART): An impurity-based split method; Dr. Geman has talked a lot about this method. 18

19 Primer 2: Error Estimate Method Here we assume a two-class problem, i.e., C = {1, 2}; Given a training set S n = {(x (1), y (1) ), (x (2), y (2) ),, (x (n), y (n) )}, where (x (i), y (i) ) are i.i.d. with law F = P(x, k) = P(X=x, Y=k) (which is usually unknown), a classifier is a mapping f S : R d {1, 2}; 19

20 Primer 2: Error Estimate Method The true error of a classifier based on a fixed training set: ε n = E F ( Y f S (X) ); Since F is unknown, we could not compute ε n exactly; The unconditional error of the classification rule: ε = E(ε n ) = E(E F ( Y f S (X) )), where the outside expectation is with respect to S n ; In practice, we have to use an error estimator, for example, sample mean, to approximate the true error ε n. In this paper, the bias of error estimation always refers to expectation over S n. 20

21 Primer 2: Error Estimate Method 1) Holdout Method For large samples, randomly choose a subset as test set of size n t and the rest as training set of size n-n t and the test set is independent of the training set, the error estimator: 1 ˆ = ( ) n () i () i ε n n y fs \ S x t n nt ( i) ( i) t ( x, y ) SnT where, S nt is the test set; This is an unbiased estimator of E[ ε n n t ]; Impractical with small samples. 21

22 Primer 2: Error Estimate Method 2) Resubstitution Method Estimate the error by directly computing the error on the whole training data: 1 n () i () i ˆ resub = y fs ( x ) n n i= 1 ε Usually low-biased as an estimator of ε n and can be arbitrarily close to 0, caused by overfitting and reusing same data to measure error. 22

23 Primer 2: Error Estimate Method 3) Cross-validation K-fold cross-validation; Leave-one-out cross-validation, i.e. n-fold cross-validation and therefore a single sample is left out each time; Stratified cross-validation: the classes are represented in each fold in the same proportion as in the original data; improves the estimator (Witten and Frank 2000). 23

24 Primer 2: Error Estimate Method 4) Bootstrap Method Bootstrap sampling: a bootstrap sample S n * consists of a sample of size n selected from original data set with replacement; Note that the probability that an original sample will make it into S n* is 1-(1-1/n) n, which will go to 1-e , as n goes to infinity; 24

25 Primer 2: Error Estimate Method 4) Bootstrap Method Bootstrap method a) For each i=1,, B, > Choose a bootstrap sample S = x y x y x y * i n ( i1 ) ( i1 ) ( i2 ) ( i2 ) ( in) ( in) {(, ),(, ),...,(, )},i j {1,2,..,n}, from S n, called the training set; > Get (S *i n ) C : the set of samples that did not make into S *i n, called the test set; > Build the classifier f i using S *i n ; 25

26 Primer 2: Error Estimate Method 4) Bootstrap Method Bootstrap method > Calculate the estimated error rate tested by the test set 1 ˆ ε = y fi( x) (S *) i C 0i i C n ( xy, ) (S * ) Calculate the estimated error rate tested by the training set 1 ˆ ε = y fi ( x) S * i resub, i i n n ( xy, ) S * n 26

27 Primer 2: Error Estimate Method 4) Bootstrap Method Bootstrap method b) Calculate the bootstrap zero estimator: B 1 ˆ ε = 0 eˆ 0i B i= 1 which tends to be a high-biased estimator of ε, and the resubstitution estimator: B 1 ˆ ε = ˆ resub eresub, i B i= 1 27

28 Primer 2: Error Estimate Method 4) Bootstrap Method Bootstrap method b) Get the bootstrap estimator ˆ ε = ˆ ε ˆ ε b632 resub 0 which tries to correct the high bias of the bootstrap zero estimator via a weighted average of the zero and resubstitution estimators. 28

29 Primer 2: Error Estimate Method 4) Bootstrap Method Bias-corrected Bootstrap Estimator Try to correct directly the bias of resubstitution by adding to ˆresub ε its bootstrap estimation of bias (Efron, 1983): B n 1 1 () () ˆ ˆ b i i εbbc = εresub + pi y fb( x ) B b= 1 i= 1 n n 1 here, p b i = I ( j) ( j) ( i) ( i) ( x, the actual b, yb ) = ( x, y ) n j= 1 proportion of times a data point (x (i), y (i) ) appears in b-th bootstrap sample. 29

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,