Random projection ensemble classification

Similar documents
arxiv: v2 [stat.me] 5 Jun 2017

Random-projection ensemble classification

Does Modeling Lead to More Accurate Classification?

A Study of Relative Efficiency and Robustness of Classification Methods

Consistency of Nearest Neighbor Methods

Curve learning. p.1/35

A Bias Correction for the Minimum Error Rate in Cross-validation

A Simple Algorithm for Learning Stable Machines

Statistical learning theory, Support vector machines, and Bioinformatics

CMSC858P Supervised Learning Methods

Introduction to Machine Learning

ECE-271B. Nuno Vasconcelos ECE Department, UCSD

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

The Bayes classifier

Classification 2: Linear discriminant analysis (continued); logistic regression

Support Vector Machines

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Statistical Data Mining and Machine Learning Hilary Term 2016

Random forests and averaging classifiers

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

Introduction to Machine Learning

Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

arxiv: v1 [stat.ml] 3 Mar 2019

High Dimensional Discriminant Analysis

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

Machine Learning Lecture 7

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Support Vector Machines for Classification: A Statistical Portrait

Generalization, Overfitting, and Model Selection

An Introduction to Statistical and Probabilistic Linear Models

A Tutorial on Support Vector Machine

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Online Learning and Sequential Decision Making

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

Pattern Recognition and Machine Learning

Classification using stochastic ensembles

Variable Selection and Weighting by Nearest Neighbor Ensembles

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Linear Methods for Prediction

Machine Learning Linear Classification. Prof. Matteo Matteucci

Introduction to Machine Learning Spring 2018 Note 18

Predicting the Probability of Correct Classification

Lecture 5: Classification

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

A Magiv CV Theory for Large-Margin Classifiers

Machine Learning for OR & FE

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Large-Scale Nearest Neighbor Classification with Statistical Guarantees

Machine Learning Gaussian Naïve Bayes Big Picture

Logistic Regression. Machine Learning Fall 2018

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Machine Learning (CSE 446): Probabilistic Machine Learning

Bayesian Learning (II)

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Semiparametric Discriminant Analysis of Mixture Populations Using Mahalanobis Distance. Probal Chaudhuri and Subhajit Dutta

The exam is closed book, closed notes except your one-page cheat sheet.

Machine Learning (CS 567) Lecture 5

SPARSE QUADRATIC DISCRIMINANT ANALYSIS FOR HIGH DIMENSIONAL DATA

Simultaneous variable selection and class fusion for high-dimensional linear discriminant analysis

Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv: v2 [math.st] 23 Jul 2012

PDEEC Machine Learning 2016/17

Multivariate statistical methods and data mining in particle physics

Microarray Data Analysis: Discovery

Generative v. Discriminative classifiers Intuition

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Nearest Neighbors Methods for Support Vector Machines

Chapter 14 Combining Models

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Classification. Chapter Introduction. 6.2 The Bayes classifier

Midterm: CS 6375 Spring 2018

Full versus incomplete cross-validation: measuring the impact of imperfect separation between training and test sets in prediction error estimation

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Does Unlabeled Data Help?

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Introduction to Machine Learning

Generalization and Overfitting

Regularized Discriminant Analysis and Reduced-Rank LDA

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Classification and Pattern Recognition

Approximation Theoretical Questions for SVMs

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Linear & nonlinear classifiers

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

VARIABLE SELECTION AND INDEPENDENT COMPONENT

CSCI-567: Machine Learning (Spring 2019)

EVALUATING MISCLASSIFICATION PROBABILITY USING EMPIRICAL RISK 1. Victor Nedel ko

Introduction to Machine Learning

Machine Learning, Fall 2009: Midterm

ECE 5424: Introduction to Machine Learning

LECTURE NOTE #3 PROF. ALAN YUILLE

ECE521 Tutorial 2. Regression, GPs, Assignment 1. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel for the slides.

CS 195-5: Machine Learning Problem Set 1

Improved Bounds on the Dot Product under Random Projection and Random Sign Projection

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Bias-Variance Tradeoff

Data Mining und Maschinelles Lernen

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Transcription:

Random projection ensemble classification Timothy I. Cannings Statistics for Big Data Workshop, Brunel Joint work with Richard Samworth

Introduction to classification Observe data from two classes, pairs (X 1, Y 1 ), (X 2, Y 2 ),..., (X n, Y n ) taking values in R p {0, 1}. Task: Predict the class, Y, of a new observation X. Applications in medical diagnoses, fraud detection, marketing... Lots of well developed methods, e.g. linear discriminant analysis, k-nearest neighbours, support vector machines, classification trees and more... Many of these are not suitable for Big Data, and suffer from the curse of dimensionality. For example, LDA is not directly applicable if p > n. Previous proposals assume linear and/or sparse decision boundaries.

Random projections Johnson Lindenstrauss Lemma: n points in R p can be projected into dimension d = O(log n), such that all pairwise distances are approximately preserved. Random projections have been used successfully as a computational time saver. Typically only one projection is used... works well if p is very large.

Motivating result If the conditional distributions of X Y = 0 and X Y = 1 are different, then there exists a d-dimensional projection A 0, such that the distributions of A 0 X Y = 0 and A 0 X Y = 1 are different. Main idea: Average over many (carefully chosen) low-dimensional random projections of the data.

Most projections are useless! second component 3 2 1 0 1 2 3 second component 3 2 1 0 1 2 3 second component 3 2 1 0 1 2 3 3 2 1 0 1 2 3 first component 3 2 1 0 1 2 3 first component 3 2 1 0 1 2 3 first component second component 3 2 1 0 1 2 3 second component 3 2 1 0 1 2 3 second component 3 2 1 0 1 2 3 3 2 1 0 1 2 3 first component 3 2 1 0 1 2 3 first component 3 2 1 0 1 2 3 first component

Statistical setting Suppose that the pair (X, Y ) has joint distribution P. With prior π 1 := P(Y = 1) = 1 π 0 ; and write η(x) := P(Y = 1 X = x). The risk, R(C) := P{C(X ) Y } is minimised by the Bayes classifier { C Bayes 1 if η(x) 1/2; (x) = 0 otherwise. We cannot use the Bayes classifier in practice, since η is unknown. Instead, we use the (i.i.d.) training pairs (X 1, Y 1 ),..., (X n, Y n ) P.

The generic random projection classifier Let C n,t denote the d-dimensional base classifier, trained with the data T, consisting of n pairs in R d {0, 1}. Let A 1, A 2,..., A B1 be random projections from R p to R d. For b = 1,..., B 1, classify A b X using the base classifier Ĉ A b n (x) = C n,tab (A b x) with T Ab := {(A b X 1, Y 1 ),..., (A b X n, Y n )}, then average ˆν n B1 (x) := 1 B 1 1 B. {Ĉ A b 1 n (x)=1} The generic random projection ensemble classifier is given by { Ĉn RP 1 if ˆν B 1 n (x) α; (x) := 0 otherwise, for some α (0, 1). b=1

The infinite-projection random projection classifier For now, fix the training data {(X 1, Y 1 ),..., (X n, Y n )} = {(x 1, y 1 ),..., (x n, y n )}. Let ˆµ n (x) := E{ˆν n B1 (x)} = P{Ĉn A1 (x) = 1}, where P and E refer to randomness in projections, and define the infinite-projection classifier { 1 if ˆµn (x) α; Ĉn RP (x) := 0 otherwise.

The generic random projection classifier notation Let G n,0 and G n,1 denote the distribution functions of ˆµ n (X ) {Y = 0} and ˆµ n (X ) {Y = 1}, respectively. Assumption A.1: Suppose that G n,0 and G n,1 are twice differentiable at α.

The generic random projection classifier theory 1 Recall R(Ĉn) = P { Ĉ n (X ) Y }, where P refers to the pair (X, Y ) only. Theorem Assume A.1. Then as B 1, where E{R(Ĉ RP n )} R(Ĉ RP n ) = γ n(α) B 1 ( 1 ) + o B 1 γ n (α) := (1 α B 1 α ) G n α(1 α) (α) + G n (α), 2 and G n := π 1 G n,1 π 0 G n,0.

The generic random projection classifier theory 2 Theorem For each B 1 N, we have R(Ĉ n RP ) R(C Bayes ) A1 E{R(Ĉn )} R(C Bayes ). min(α, 1 α)

Choosing good random projections 1. Generate independent d-dimensional projections A b1,b 2, b 1 = 1,..., B 1 ; b 2 = 1,..., B 2, according to Haar measure. 2. Estimate the test error of the base classifier after each projection by ˆR A b 1,b 2 n using, for example, the resubstitution or leave-one-out estimator. 3. For each b 1, select the projection from the block of size B 2, that yields the smallest estimate of the test error. 4. Use the selected projections as A 1,..., A B1 in the generic classifier.

Further assumptions [Training data still fixed] Assumption A.2: There exists β (0, 1) such that P ( ˆRA n inf A ˆRA n + ɛ n ) β, where A is drawn from Haar measure and ɛ n = ɛ (B2) n := E{R(Ĉ n A1 A1 ) ˆR n }. Assumption A.3: Sufficient dimension reduction. Suppose there exists A R p d such that Y is independent of X given A X (Cook, 1998; Lee et al., 2013).

Test error bound Theorem Assume A.2 and A.3. Then, for each B 1 N, E{R(Ĉ RP n where ɛ n = E{R(Ĉ n A1 ) )} R(C Bayes ) R(Ĉ A A1 ˆR n } and ɛ A n n ) R(C Bayes ) min(α, 1 α) + + 2ɛ n ɛ A n min(α, 1 α) (1 β)b2 min(α, 1 α), := R(Ĉ n A A ) ˆR n.

Choosing α Oracle choice: α [ argmin π1 G n,1 (α ) + π 0 {1 G n,0 (α )} ]. α [0,1] Practical choice: ˆα argmin [ˆπ1 Ĝ n,1 (α ) + ˆπ 0 {1 Ĝ n,0 (α )} ]. α [0,1] Computing Ĝn,1(α ) and Ĝn,0(α ) requires negligible extra computational cost, since we have already classified the training data when calculating the test error estimates. This performs remarkably well!

Choosing α 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5

Choosing α 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6

Base classifier Linear discriminant analysis Assuming the class priors are equal, { Ĉn A LDA (x) := 1 if ( )T Ax ˆµA 1 +ˆµA 0 2 ˆΩA (ˆµ A 1 ˆµA 0 ) 0; 0 otherwise. (1) Now consider the training data as random pairs from P. Under Normality assumptions: ( E{R(Ĉ A LDA d ) n )} R(C Bayes ) = O. n For the training error estimator, ˆR n A := 1 n n i=1 1 {Ĉn A LDA (X i ) Y i }, from Vapnik-Chervonenkis theory (Devroye et al., 1996) and E ɛ A n 8 d log n + 3 log 2 + 1, 2n d log n + 3 log 2 + log B2 + 1 E ɛ n 8. 2n

Base classifier k-nearest neighbour Classify Ax according to a majority vote over its k nearest neighbours among the projected training data. Under regularity assumptions (Hall et al., 2008) ER(Ĉ A knn n ) R(C A Bayes n ) = O(1/k + (k/n) 4/d ). For the leave-one-out estimator, ˆR n A := 1 n n i=1 1 {Ĉ A knn n, i (X i ) Y i }, we have that ( 1 ) E ɛ A n n + 24k1/2 1/2 n 1 2π n + 2 3k 1/4 1/2 n 1/2 π, and { } 1/3 k(1 + log E ɛ n 3{4(3 d + 1)} 1/3 B2 + 3 log 2). n

Simulation Comparison RP-LDA d : the random projection ensemble classifier with LDA base, B 1 = 500, B 2 = 50. LDA, QDA and knn: Base classifiers in the original space RF : Random Forests (ntree = 1000) (Breiman, 2001) SVM: Support vector machines (Cortes and Vapnik, 1995) GP: Gaussian Process classifiers (Williams and Barber, 1998). PenLDA: Penalised linear discriminant analysis (Witten and Tibshirani, 2011) NSC: Nearest shrunken centroids (Tibshirani et al., 2003) PenLog: l 1 -penalised logistic regression SDR5: Dimension reduction based on the SDR assumption OTE: Optimal tree ensembles

Simulations 1 X {Y = 1} N p (Rµ 1, RΣ 1 R T ) and X {Y = 0} N p (Rµ 0, RΣ 0 R T ). R is a p p rotation matrix, p = 100, µ 1, µ 0, Σ 1 and Σ 0 are such that (A.3) holds with d = 3. Bayes risk = 4.09. n 50 100 200 500 1000 RP-LDA 5 8.2 6.2 5.6 5.2 5.1 RP-QDA 5 8.1 6.2 5.6 5.1 5.1 RP-knn 5 9.0 6.5 5.7 5.3 5.1 LDA N/A N/A 14.3 7.7 6.3 knn 12.8 10.0 8.8 7.7 7.3 RF 11.1 7.9 6.8 6.2 6.1 Radial SVM 24.0 8.8 6.4 5.6 5.5 Radial GP 14.1 7.5 5.8 5.2 5.1 PenLDA 11.1 8.0 6.7 5.8 5.8 NSC 12.6 9.1 7.3 5.9 5.8 OTE 18.3 15.5 12.4 10.1 9.2

Simulations 2 X {Y = r} 1 2 N p(µ r, I p p ) + 1 2 N p( µ r, I p p ), p = 100. With µ 1 = (2, 2, 0,..., 0) T and µ 0 = (2, 2, 0,..., 0) T. (A.3) holds with d = 2. Bayes risk = 4.45. n 50 100 200 500 1000 RP-QDA 5 39.3 31.5 22.3 12.3 8.8 RP-knn 5 43.7 35.2 25.3 14.3 10.2 QDA N/A N/A N/A 35.2 27.4 knn 34.7 28.7 23.7 18.2 15.3 RF 49.7 49.4 48.3 46.2 43.3 Radial SVM 49.8 49.8 50.2 49.8 48.7 Linear SVM 50.0 50.0 49.6 50.2 50.0 Radial GP 48.2 46.0 42.8 35.4 26.6 PenLDA 50.0 50.0 49.8 50.1 50.1 NSC 49.7 49.8 49.7 50.1 49.6 SDR5-knn N/A N/A 32.2 26.3 21.8 OTE 48.5 44.4 34.7 16.9 9.6

Simulations 3 No (A.3) P 1 is the distribution of p independent Laplace components, P 0 = N p (µ, I p p ), with µ = 1 p (1,..., 1, 0,..., 0) T, where µ has p/2 non-zero components. p = 100, Bayes risk = 1.01 n 50 100 200 500 1000 RP-QDA 5 9.7 6.1 4.2 3.7 3.3 RP-knn 5 21.3 11.0 6.9 4.9 3.8 QDA N/A N/A N/A 35.5 15.3 knn 49.9 49.8 49.8 49.9 49.7 RF 44.8 36.5 23.4 10.7 7.7 Radial SVM 39.3 8.9 4.7 3.8 3.4 Radial GP 48.9 47.5 45.5 41.1 36.2 PenLDA 46.0 45.0 44.5 43.3 41.7 NSC 47.5 46.6 46.0 44.3 42.3 PenLog 48.8 47.3 46.4 44.1 42.2 SDR5-knn N/A N/A 46.1 40.1 36.3 OTE 46.7 42.0 30.6 16.8 11.4

Real Data 1: Ionosphere The Ionosphere dataset from the UCI consists of p = 32 high-frequency antenna measurements for 351 observations. Observations are classified as good (class 1) or bad (class 0), depending on whether there is evidence for free electrons in the Ionosphere or not. n 50 100 200 RP-QDA 5 8.1 6.2 5.2 RP-knn 5 13.1 7.4 5.4 QDA N/A N/A 14.1 knn 21.8 18.1 16.4 RF 10.5 7.5 6.5 Radial SVM 27.7 12.6 6.7 Linear SVM 19.4 17.1 15.5 Radial GP 22.3 17.8 14.5 PenLDA 21.2 19.8 19.8 NSC 22.6 19.1 17.5 SDR5-knn 30.6 17.5 10.1 OTE 14.4 9.8 7.3

Real Data 2: Musk The Musk dataset from the UCI consists of 1016 musk (class 1) and 5581 non-musk (class 0) molecules. The task is to classify a molecule based on p = 166 shape measurements. n 100 200 500 RP-QDA 5 12.1 9.9 8.6 RP-knn 5 11.8 9.7 8.0 knn 14.7 11.8 8.2 RF 13.2 10.7 7.6 Linear SVM 13.9 10.4 7.4 Radial GP 14.9 14.1 11.1 PenLDA 27.7 27.1 27.0 NSC 15.3 15.2 15.2 SDR5-knn N/A 24.1 9.8 OTE 13.9 11.0 8.1

Extensions 1: Sample Splitting We could split the training sample into T n,1 and T n,2, where T n,1 = n (1) and T n,2 = n (2). Then use ˆR A n (1),n (2) := 1 n (2) to estimate the test errors. In this case 1 { } Ĉ A n (X i,y i ) T (1),T A (X i ) Y i n,2 n,1 E ɛ A n = E ( R(Ĉ A A n ˆR (1) n (1),n ) ( 1 + log 2 ) 1/2 (2) Tn,1, 2n (2) E ɛ n = E ( R(Ĉ A1 A1 ) ˆR ) ( 1 + log 2 + log B2 ) 1/2 Tn,1 n (1) n (1),n. (2) 2n (2) The bounds hold for any choice of base classifier, and are sharper than those earlier, but we have reduced the effective sample size.

Extensions 2: Multi-class problems What if there are K > 2 classes? One option: let ˆν B1 n,r (x) := 1 B 1 B 1 b 1=1 1 { Ĉ A b 1 n (x)=r} for r = 1,..., K. Given α 1,..., α K > 0 with K r=1 α r = 1, we can then define Ĉ RP n (x) := sargmax{α r ˆν n,r B1 (x)}. r=1,...,k The choice of α 1,..., α K is analogous to the choice of α in the case K = 2. Task: minimise the the test error of the infinite-simulation random projection ensemble classifier as before.

Extensions 3: Ultrahigh-dimensional data What about alternative types of projection? Consider the ultrahigh-dimensional setting, say p in the thousands. In this case, the space of projections is very large. We could use axis-aligned projections instead; i.e. subsample the features (work in progress with Ben Li at Harvard). There are then only ( p d) p d /d! choices for the projections, and if d is small, it may be feasible to carry out an exhaustive search. We lose the equivariance to orthogonal transformation.

The R package RPEnsemble is available from CRAN. Thank you!

Bibliography I Bickel, P. J. and Levina, E. (2004). Some theory for Fisher s linear discriminant function, naive Bayes, and some alternatives when there are more variables than observations. Bernoulli, 10, 989 1010. Breiman, L. (2001). Random Forests. Machine Learning, 45, 5 32. Cannings, T. I. and Samworth, R. J. (2015) Random projection ensemble classification. arxiv e-prints, 1504.04595. Cook, R. D. (1998). Regression Graphics: Ideas for Studying Regressions through Graphics. Wiley, New York. Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273 297. Dasgupta, S. and Gupta, A. (2002). An elementary proof of the Johnson Lindenstrauss Lemma. Random Struct. Alg., 22, 60 65. Devroye, L., Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer, New York.

Bibliography II Hall, P., Park, B. U. and Samworth, R. J. (2008). Choice of neighbour order in nearest-neighbour classification. Ann. Statist., 36, 2135 2152. Lee, K.-Y., Li, B. and Chiaromonte, F. (2013). A general theory for nonlinear sufficient dimension reduction: formulation and estimation. Ann. Statist., 41, 221 249. Tibshirani, R., Hastie, T., Narisimhan, B. and Chu, G. (2003). Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Statist. Science, 18, 104 117. Williams, C. K. I. and Barber, D. (1998). Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 1342 1351. Witten, D. M. and Tibshirani, R. (2011). Penalized classification using Fisher s linear discriminant. J. Roy. Statist. Soc., Ser. B., 73, 753 772.