Random projection ensemble classification

Random projection ensemble classification Timothy I. Cannings Statistics for Big Data Workshop, Brunel Joint work with Richard Samworth

Introduction to classification Observe data from two classes, pairs (X 1, Y 1 ), (X 2, Y 2 ),..., (X n, Y n ) taking values in R p {0, 1}. Task: Predict the class, Y, of a new observation X. Applications in medical diagnoses, fraud detection, marketing... Lots of well developed methods, e.g. linear discriminant analysis, k-nearest neighbours, support vector machines, classification trees and more... Many of these are not suitable for Big Data, and suffer from the curse of dimensionality. For example, LDA is not directly applicable if p > n. Previous proposals assume linear and/or sparse decision boundaries.

Random projections Johnson Lindenstrauss Lemma: n points in R p can be projected into dimension d = O(log n), such that all pairwise distances are approximately preserved. Random projections have been used successfully as a computational time saver. Typically only one projection is used... works well if p is very large.

Motivating result If the conditional distributions of X Y = 0 and X Y = 1 are different, then there exists a d-dimensional projection A 0, such that the distributions of A 0 X Y = 0 and A 0 X Y = 1 are different. Main idea: Average over many (carefully chosen) low-dimensional random projections of the data.

Most projections are useless! second component 3 2 1 0 1 2 3 second component 3 2 1 0 1 2 3 second component 3 2 1 0 1 2 3 3 2 1 0 1 2 3 first component 3 2 1 0 1 2 3 first component 3 2 1 0 1 2 3 first component second component 3 2 1 0 1 2 3 second component 3 2 1 0 1 2 3 second component 3 2 1 0 1 2 3 3 2 1 0 1 2 3 first component 3 2 1 0 1 2 3 first component 3 2 1 0 1 2 3 first component

Statistical setting Suppose that the pair (X, Y ) has joint distribution P. With prior π 1 := P(Y = 1) = 1 π 0 ; and write η(x) := P(Y = 1 X = x). The risk, R(C) := P{C(X ) Y } is minimised by the Bayes classifier { C Bayes 1 if η(x) 1/2; (x) = 0 otherwise. We cannot use the Bayes classifier in practice, since η is unknown. Instead, we use the (i.i.d.) training pairs (X 1, Y 1 ),..., (X n, Y n ) P.

The generic random projection classifier Let C n,t denote the d-dimensional base classifier, trained with the data T, consisting of n pairs in R d {0, 1}. Let A 1, A 2,..., A B1 be random projections from R p to R d. For b = 1,..., B 1, classify A b X using the base classifier Ĉ A b n (x) = C n,tab (A b x) with T Ab := {(A b X 1, Y 1 ),..., (A b X n, Y n )}, then average ˆν n B1 (x) := 1 B 1 1 B. {Ĉ A b 1 n (x)=1} The generic random projection ensemble classifier is given by { Ĉn RP 1 if ˆν B 1 n (x) α; (x) := 0 otherwise, for some α (0, 1). b=1

The infinite-projection random projection classifier For now, fix the training data {(X 1, Y 1 ),..., (X n, Y n )} = {(x 1, y 1 ),..., (x n, y n )}. Let ˆµ n (x) := E{ˆν n B1 (x)} = P{Ĉn A1 (x) = 1}, where P and E refer to randomness in projections, and define the infinite-projection classifier { 1 if ˆµn (x) α; Ĉn RP (x) := 0 otherwise.

The generic random projection classifier notation Let G n,0 and G n,1 denote the distribution functions of ˆµ n (X ) {Y = 0} and ˆµ n (X ) {Y = 1}, respectively. Assumption A.1: Suppose that G n,0 and G n,1 are twice differentiable at α.

The generic random projection classifier theory 1 Recall R(Ĉn) = P { Ĉ n (X ) Y }, where P refers to the pair (X, Y ) only. Theorem Assume A.1. Then as B 1, where E{R(Ĉ RP n )} R(Ĉ RP n ) = γ n(α) B 1 ( 1 ) + o B 1 γ n (α) := (1 α B 1 α ) G n α(1 α) (α) + G n (α), 2 and G n := π 1 G n,1 π 0 G n,0.

The generic random projection classifier theory 2 Theorem For each B 1 N, we have R(Ĉ n RP ) R(C Bayes ) A1 E{R(Ĉn )} R(C Bayes ). min(α, 1 α)

Choosing good random projections 1. Generate independent d-dimensional projections A b1,b 2, b 1 = 1,..., B 1 ; b 2 = 1,..., B 2, according to Haar measure. 2. Estimate the test error of the base classifier after each projection by ˆR A b 1,b 2 n using, for example, the resubstitution or leave-one-out estimator. 3. For each b 1, select the projection from the block of size B 2, that yields the smallest estimate of the test error. 4. Use the selected projections as A 1,..., A B1 in the generic classifier.

Further assumptions [Training data still fixed] Assumption A.2: There exists β (0, 1) such that P ( ˆRA n inf A ˆRA n + ɛ n ) β, where A is drawn from Haar measure and ɛ n = ɛ (B2) n := E{R(Ĉ n A1 A1 ) ˆR n }. Assumption A.3: Sufficient dimension reduction. Suppose there exists A R p d such that Y is independent of X given A X (Cook, 1998; Lee et al., 2013).

Test error bound Theorem Assume A.2 and A.3. Then, for each B 1 N, E{R(Ĉ RP n where ɛ n = E{R(Ĉ n A1 ) )} R(C Bayes ) R(Ĉ A A1 ˆR n } and ɛ A n n ) R(C Bayes ) min(α, 1 α) + + 2ɛ n ɛ A n min(α, 1 α) (1 β)b2 min(α, 1 α), := R(Ĉ n A A ) ˆR n.

Choosing α Oracle choice: α [ argmin π1 G n,1 (α ) + π 0 {1 G n,0 (α )} ]. α [0,1] Practical choice: ˆα argmin [ˆπ1 Ĝ n,1 (α ) + ˆπ 0 {1 Ĝ n,0 (α )} ]. α [0,1] Computing Ĝn,1(α ) and Ĝn,0(α ) requires negligible extra computational cost, since we have already classified the training data when calculating the test error estimates. This performs remarkably well!

Choosing α 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5

Choosing α 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6

Base classifier Linear discriminant analysis Assuming the class priors are equal, { Ĉn A LDA (x) := 1 if ( )T Ax ˆµA 1 +ˆµA 0 2 ˆΩA (ˆµ A 1 ˆµA 0 ) 0; 0 otherwise. (1) Now consider the training data as random pairs from P. Under Normality assumptions: ( E{R(Ĉ A LDA d ) n )} R(C Bayes ) = O. n For the training error estimator, ˆR n A := 1 n n i=1 1 {Ĉn A LDA (X i ) Y i }, from Vapnik-Chervonenkis theory (Devroye et al., 1996) and E ɛ A n 8 d log n + 3 log 2 + 1, 2n d log n + 3 log 2 + log B2 + 1 E ɛ n 8. 2n

Base classifier k-nearest neighbour Classify Ax according to a majority vote over its k nearest neighbours among the projected training data. Under regularity assumptions (Hall et al., 2008) ER(Ĉ A knn n ) R(C A Bayes n ) = O(1/k + (k/n) 4/d ). For the leave-one-out estimator, ˆR n A := 1 n n i=1 1 {Ĉ A knn n, i (X i ) Y i }, we have that ( 1 ) E ɛ A n n + 24k1/2 1/2 n 1 2π n + 2 3k 1/4 1/2 n 1/2 π, and { } 1/3 k(1 + log E ɛ n 3{4(3 d + 1)} 1/3 B2 + 3 log 2). n

Simulation Comparison RP-LDA d : the random projection ensemble classifier with LDA base, B 1 = 500, B 2 = 50. LDA, QDA and knn: Base classifiers in the original space RF : Random Forests (ntree = 1000) (Breiman, 2001) SVM: Support vector machines (Cortes and Vapnik, 1995) GP: Gaussian Process classifiers (Williams and Barber, 1998). PenLDA: Penalised linear discriminant analysis (Witten and Tibshirani, 2011) NSC: Nearest shrunken centroids (Tibshirani et al., 2003) PenLog: l 1 -penalised logistic regression SDR5: Dimension reduction based on the SDR assumption OTE: Optimal tree ensembles

Simulations 1 X {Y = 1} N p (Rµ 1, RΣ 1 R T ) and X {Y = 0} N p (Rµ 0, RΣ 0 R T ). R is a p p rotation matrix, p = 100, µ 1, µ 0, Σ 1 and Σ 0 are such that (A.3) holds with d = 3. Bayes risk = 4.09. n 50 100 200 500 1000 RP-LDA 5 8.2 6.2 5.6 5.2 5.1 RP-QDA 5 8.1 6.2 5.6 5.1 5.1 RP-knn 5 9.0 6.5 5.7 5.3 5.1 LDA N/A N/A 14.3 7.7 6.3 knn 12.8 10.0 8.8 7.7 7.3 RF 11.1 7.9 6.8 6.2 6.1 Radial SVM 24.0 8.8 6.4 5.6 5.5 Radial GP 14.1 7.5 5.8 5.2 5.1 PenLDA 11.1 8.0 6.7 5.8 5.8 NSC 12.6 9.1 7.3 5.9 5.8 OTE 18.3 15.5 12.4 10.1 9.2

Simulations 2 X {Y = r} 1 2 N p(µ r, I p p ) + 1 2 N p( µ r, I p p ), p = 100. With µ 1 = (2, 2, 0,..., 0) T and µ 0 = (2, 2, 0,..., 0) T. (A.3) holds with d = 2. Bayes risk = 4.45. n 50 100 200 500 1000 RP-QDA 5 39.3 31.5 22.3 12.3 8.8 RP-knn 5 43.7 35.2 25.3 14.3 10.2 QDA N/A N/A N/A 35.2 27.4 knn 34.7 28.7 23.7 18.2 15.3 RF 49.7 49.4 48.3 46.2 43.3 Radial SVM 49.8 49.8 50.2 49.8 48.7 Linear SVM 50.0 50.0 49.6 50.2 50.0 Radial GP 48.2 46.0 42.8 35.4 26.6 PenLDA 50.0 50.0 49.8 50.1 50.1 NSC 49.7 49.8 49.7 50.1 49.6 SDR5-knn N/A N/A 32.2 26.3 21.8 OTE 48.5 44.4 34.7 16.9 9.6

Simulations 3 No (A.3) P 1 is the distribution of p independent Laplace components, P 0 = N p (µ, I p p ), with µ = 1 p (1,..., 1, 0,..., 0) T, where µ has p/2 non-zero components. p = 100, Bayes risk = 1.01 n 50 100 200 500 1000 RP-QDA 5 9.7 6.1 4.2 3.7 3.3 RP-knn 5 21.3 11.0 6.9 4.9 3.8 QDA N/A N/A N/A 35.5 15.3 knn 49.9 49.8 49.8 49.9 49.7 RF 44.8 36.5 23.4 10.7 7.7 Radial SVM 39.3 8.9 4.7 3.8 3.4 Radial GP 48.9 47.5 45.5 41.1 36.2 PenLDA 46.0 45.0 44.5 43.3 41.7 NSC 47.5 46.6 46.0 44.3 42.3 PenLog 48.8 47.3 46.4 44.1 42.2 SDR5-knn N/A N/A 46.1 40.1 36.3 OTE 46.7 42.0 30.6 16.8 11.4

Real Data 1: Ionosphere The Ionosphere dataset from the UCI consists of p = 32 high-frequency antenna measurements for 351 observations. Observations are classified as good (class 1) or bad (class 0), depending on whether there is evidence for free electrons in the Ionosphere or not. n 50 100 200 RP-QDA 5 8.1 6.2 5.2 RP-knn 5 13.1 7.4 5.4 QDA N/A N/A 14.1 knn 21.8 18.1 16.4 RF 10.5 7.5 6.5 Radial SVM 27.7 12.6 6.7 Linear SVM 19.4 17.1 15.5 Radial GP 22.3 17.8 14.5 PenLDA 21.2 19.8 19.8 NSC 22.6 19.1 17.5 SDR5-knn 30.6 17.5 10.1 OTE 14.4 9.8 7.3

Real Data 2: Musk The Musk dataset from the UCI consists of 1016 musk (class 1) and 5581 non-musk (class 0) molecules. The task is to classify a molecule based on p = 166 shape measurements. n 100 200 500 RP-QDA 5 12.1 9.9 8.6 RP-knn 5 11.8 9.7 8.0 knn 14.7 11.8 8.2 RF 13.2 10.7 7.6 Linear SVM 13.9 10.4 7.4 Radial GP 14.9 14.1 11.1 PenLDA 27.7 27.1 27.0 NSC 15.3 15.2 15.2 SDR5-knn N/A 24.1 9.8 OTE 13.9 11.0 8.1

Extensions 1: Sample Splitting We could split the training sample into T n,1 and T n,2, where T n,1 = n (1) and T n,2 = n (2). Then use ˆR A n (1),n (2) := 1 n (2) to estimate the test errors. In this case 1 { } Ĉ A n (X i,y i ) T (1),T A (X i ) Y i n,2 n,1 E ɛ A n = E ( R(Ĉ A A n ˆR (1) n (1),n ) ( 1 + log 2 ) 1/2 (2) Tn,1, 2n (2) E ɛ n = E ( R(Ĉ A1 A1 ) ˆR ) ( 1 + log 2 + log B2 ) 1/2 Tn,1 n (1) n (1),n. (2) 2n (2) The bounds hold for any choice of base classifier, and are sharper than those earlier, but we have reduced the effective sample size.

Extensions 2: Multi-class problems What if there are K > 2 classes? One option: let ˆν B1 n,r (x) := 1 B 1 B 1 b 1=1 1 { Ĉ A b 1 n (x)=r} for r = 1,..., K. Given α 1,..., α K > 0 with K r=1 α r = 1, we can then define Ĉ RP n (x) := sargmax{α r ˆν n,r B1 (x)}. r=1,...,k The choice of α 1,..., α K is analogous to the choice of α in the case K = 2. Task: minimise the the test error of the infinite-simulation random projection ensemble classifier as before.

Extensions 3: Ultrahigh-dimensional data What about alternative types of projection? Consider the ultrahigh-dimensional setting, say p in the thousands. In this case, the space of projections is very large. We could use axis-aligned projections instead; i.e. subsample the features (work in progress with Ben Li at Harvard). There are then only ( p d) p d /d! choices for the projections, and if d is small, it may be feasible to carry out an exhaustive search. We lose the equivariance to orthogonal transformation.

The R package RPEnsemble is available from CRAN. Thank you!

Bibliography I Bickel, P. J. and Levina, E. (2004). Some theory for Fisher s linear discriminant function, naive Bayes, and some alternatives when there are more variables than observations. Bernoulli, 10, 989 1010. Breiman, L. (2001). Random Forests. Machine Learning, 45, 5 32. Cannings, T. I. and Samworth, R. J. (2015) Random projection ensemble classification. arxiv e-prints, 1504.04595. Cook, R. D. (1998). Regression Graphics: Ideas for Studying Regressions through Graphics. Wiley, New York. Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273 297. Dasgupta, S. and Gupta, A. (2002). An elementary proof of the Johnson Lindenstrauss Lemma. Random Struct. Alg., 22, 60 65. Devroye, L., Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer, New York.

Bibliography II Hall, P., Park, B. U. and Samworth, R. J. (2008). Choice of neighbour order in nearest-neighbour classification. Ann. Statist., 36, 2135 2152. Lee, K.-Y., Li, B. and Chiaromonte, F. (2013). A general theory for nonlinear sufficient dimension reduction: formulation and estimation. Ann. Statist., 41, 221 249. Tibshirani, R., Hastie, T., Narisimhan, B. and Chu, G. (2003). Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Statist. Science, 18, 104 117. Williams, C. K. I. and Barber, D. (1998). Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 1342 1351. Witten, D. M. and Tibshirani, R. (2011). Penalized classification using Fisher s linear discriminant. J. Roy. Statist. Soc., Ser. B., 73, 753 772.