AI Memo AIM Permutation Tests for Classification

Size: px

Start display at page:

Download "AI Memo AIM Permutation Tests for Classification"

Kelly Stewart
5 years ago
Views:

1 AI Memo AIM Permutation Tests for Cassification Sayan Mukherjee Whitehead/MIT Center for Genome Research Center for Bioogica and Computationa Learning Massachusetts Institute of Technoogy Cambridge, MA 0239, USA Poina Goand Computer Science and Artificia Inteigence Laboratory Massachusetts Institute of Technoogy Cambridge, MA 0239, USA Dmitry Panchenko Department of Mathematics Massachusetts Institute of Technoogy Cambridge, MA 0245, USA

2 Abstract We introduce and expore an approach to estimating statistica significance of cassification accuracy, which is particuary usefu in scientific appications of machine earning where high dimensionaity of the data and the sma number of training exampes render most standard convergence bounds too oose to yied a meaningfu guarantee of the generaization abiity of the cassifier. Instead, we estimate statistica significance of the observed cassification accuracy, or the ikeihood of observing such accuracy by chance due to spurious correations of the high-dimensiona data patterns with the cass abes in the given training set. We adopt permutation testing, a non-parametric technique previousy deveoped in cassica statistics for hypothesis testing in the generative setting (i.e., comparing two probabiity distributions). We demonstrate the method on rea exampes from neuroimaging studies and DNA microarray anaysis and suggest a theoretica anaysis of the procedure that reates the asymptotic behavior of the test to the existing convergence bounds. Keywords: Cassification, Permutation testing, Statistica significance, Non-parametric tests, Rademacher processes. Acknowedgments The authors woud ike thank Pabo Tamayo, Vadimir Kotchinskii, Ji Mesirov, Todd Goub and Bruce Fisch for usefu discussions. Sayan Mukherjee was supported by a SLOAN/DOE grant. Poina Goand was supported in part by NSF IIS grant and Athinoua A. Martinos Center for Biomedica Imaging coaborative research grant. Dmitry Panchenko was partiay supported by AT&T Buchsbaum grant. The authors woud ike to acknowedge Dr. M. Spiridon and Dr. N. Kanwisher for providing the fmri data, Dr. R. Buckner for providing the cortica thickness data and Dr. D. Greve and Dr. B. Fisch for hep with registration and feature extraction in the MRI experiments discussed in this paper. Dr. Kanwisher woud ike to acknowedge EY 3455 and MH 5950 grants. Dr. Buckner woud ike to acknowedge the assistance of the Washington University ADRC, James S McDonne Foundation, the Azheimer s Association, and NIA grants AG05682 and AG0399. Dr. B. Fisch woud ike to acknowedge NIH R0 RR6594-0A grant. The Human Brain Project/Neuroinformatics research is funded jointy by the NINDS, the NIMH and the NCI (R0-NS3958). Further support was provided by the NCRR (P4-RR4075 and R0-RR3609).

3 . Introduction Many scientific studies invove detection and characterization of predictive patterns in high dimensiona measurements, which can often be reduced to training a binary cassifier or a regression mode. We wi use two exampes of such appications to iustrate the techniques in this paper: medica image studies and gene expression anaysis. Image-based cinica studies of brain disorders attempt to detect neuroanatomica changes induced by diseases, as we as predict deveopment of the disease. The goas of gene expression anaysis incude cassification of the tissue morphoogy and prediction of the treatment outcome from DNA microarray data. In both fieds, training a cassifier to reiaby abe new exampes into the heathy popuation or one of the disease sub-groups can hep to improve screening and eary diagnostics, as we as provide an insight into the nature of the disorder. Both imaging data and DNA microarray measurements are characterized by high dimensionaity of the input space (thousands of features) and sma datasets (tens of independent exampes), typica of many bioogica appications. For statistica earning to be usefu in such scientific appications, it must provide an estimate of the significance of the detected differences, i.e., a guarantee of how we the resuts of earning describe the entire popuation. machine earning theory offers two types of such guarantees, both based on estimating the expected error of the resuting cassifier function. The first approach is to estimate the test error on a hod-out set or by appying a cross-vaidation procedure, such as a jackknife or bootstrap (Efron, 982) which, in conjunction with a variance-based convergence bound, provides a confidence interva (i.e., the interva that with high probabiity contains the true vaue) for the expected error. Sma sampe sizes render this approach ineffective as the variance of the error on a hodout set is often too arge to provide a meaningfu estimate on how cose we are to the true error. Appying variance-based bounds to the cross-vaidation error estimates produces miseading resuts as the cross-vaidation iterations are not independent, causing us to underestimate the variance. An aternative, but equay fruitess, approach is to use bounds on the convergence of the empirica training error to the expected test error. For very high dimensiona data, the training error is aways zero and the bounds are extremey oose. Thus we are often forced to concude that, athough the cassification accuracy ooks promising, we need significanty more data (severa orders of magnitude more than currenty avaiabe) before the standard bounds provide a meaningfu confidence interva for the expected error (Guyon et a., 998). Since coecting data in these appications is often expensive in terms of time and resources, it is desirabe to obtain a quantitative indicator of how robust is the observed cassification accuracy, ong before the asymptotic bounds appy, for the training set sizes in tens to hundreds exampes. This is particuary reevant since the empirica resuts often indicate that efficient earning is possibe with far fewer exampes than predicted by the convergence bounds. In this paper, we demonstrate how a weaker guarantee, that of statistica significance, can sti be provided for cassification resuts on a sma number of exampes. We consider the question on the differences between the two casses, as measured by the cassifier performance on a test set, in the framework of hypothesis testing traditionay used in statistics. Intuitivey, statistica significance is a measure of how ikey we were to obtain the observed test accuracy by chance, ony because the training agorithm identified some pattern in the 2

4 high-dimensiona data that happened to correate with the cass abes as an artifact of a sma data set size. Our goa is to reject the nu hypothesis, namey that a given famiy of cassifiers cannot earn to accuratey predict the abes of a test point given a training set. The empirica estimate of the expected test error can serve as a test statistic that measures how different the two casses are with respect to the famiy of cassifiers we use in training. We adopt permutation tests (Good, 994, Kenda, 945), a non-parametric technique deveoped in the statistics iterature, to estimate the probabiity distribution of the statistic under the nu hypothesis from the avaiabe data. We empoy the permutation procedure to construct an empirica estimate of the error distribution and then use this estimate to assign a significance to the observed cassification error. In the next section, we provide the necessary background on hypothesis testing and permutation tests. In Section 3, we extend the permutation procedure to estimate statistica significance of cassification resuts. Section 4 demonstrates the appication of the test on two detaied exampes, reports resuts for severa studies from the fieds of brain imaging and gene expression anaysis and offers practica guideines for appying the procedure. In Section 5, we suggest a theoretica anaysis of the procedure that eads to convergence bounds governed by simiar quantities to those that contro standard empirica error bounds, cosing with a brief discussion of open questions. 2. Background. Hypothesis Testing and Permutations In two-cass comparison hypothesis testing, the differences between two data distributions are measured using a dataset statistic T :(R n {, }) R, such that for a given dataset {(x k,y k )} k=,wherex k R n are observations and y k {, } are the corresponding cass abes, T (x,y,...,x,y )isameasureofthesimiarityofthe subsets {x k y k =} and {x k y k = }. The nu hypothesis typicay assumes that the two conditiona probabiity distributions are identica, p(x y=) = p(x y= ), or equivaenty, that the data and the abes are independent, p(x,y)= p(x)p(y). The goa of the hypothesis test is to reject the nu hypothesis at a certain eve of significance α which sets the maxima acceptabe probabiity of fase positive (decaring that the casses are different when the nu hypothesis is true). For any vaue of the statistic, the corresponding p-vaue is the highest eve of significance at which the nu hypothesis can sti be rejected. In cassica statistics, the data are often assumed to be one-dimensiona (n =). For exampe, in the two-sampe t-test, the data in the two casses are assumed to be generated by one-dimensiona Gaussian distributions of equa variance. The nu hypothesis is that the distributions have the same mean. The distribution of the t-statistic, the difference between the sampe means normaized by the standard error, under the nu hypothesis is the Student s distribution (Sachs, 984, Student, 908). If the integra of Student s distribution over the vaues higher than the observed t-statistic is smaer than the desired significance eve α, we reject the nu hypothesis in favor of the aternative hypothesis that the means of the two distributions are different. In order to perform hypothesis testing, we need to know the probabiity distribution of the seected statistic under the nu hypothesis. In genera, the distribution for a particuar 3

5 statistic cannot be computed without making strong assumptions on the generative mode of the data. Non-parametric techniques, such as permutation tests, can be of great vaue if the distribution of the data is unknown. Permutation tests were first introduced as a non-parametric aternative to the one-dimensiona t-test and have been used to repace Student s distribution when the normaity of the data distribution coud not be assured. Here, we describe the genera formuation of the test as appied to any statistic T. Suppose we have chosen an appropriate statistic T and the acceptabe significance eve α. Let Π be a set of a permutations of indices,...,, where is the number of independent exampes in the dataset. The permutation test procedure that consists of M iterations is defined as foows: Repeat M times (with index m =,...,M): sampe a permutation π m from a uniform distribution over Π, compute the statistic vaue for this permutation of abes t m = T (x,y π m,...,x,y π m ). Construct an empirica cumuative distribution ˆP (T t) = M M Θ(t t m ), where Θ is a step-function (Θ(x) =,ifx 0; 0 otherwise). Compute the statistic vaue for the actua abes, t 0 = T (x,y,...,x,y )andits corresponding p-vaue p 0 under the empirica distribution ˆP.Ifp 0 α, reject the nu hypothesis. The procedure computes an empirica estimate of the cumuative distribution of the statistic T under the nu hypothesis and uses it for hypothesis testing. Since the nu hypothesis assumes that the two casses are indistinguishabe with respect to the seected statistic, a the training datasets generated through permutations are equay ikey to be observed under the nu hypothesis, yieding the estimates of the statistic for the empirica distribution. An equivaent resut is obtained if we choose to permute the data, rather than the abes. Ideay, we woud ike to use the entire set of permutations Π to construct the empirica distribution ˆP, but it might be not feasibe for computationa reasons. Instead, we resort to samping from Π. It is therefore important to seect the number of samping iterations M to be arge enough to guarantee accurate estimation. One soution is to monitor the rate of change in the estimated distribution and stop when the changes are beow an acceptabe threshod. The precision of the test therefore depends on the number of iterations and the number of distinct abeings of the training set. To better understand the difference between the parametric approach of the t-test and permutation testing, observe that statistica significance does not provide an absoute measure of how robust the observed differences are, but it is rather contingent upon certain assumptions about the data distribution in each cass p(x y) being true. The t-test assumes that the distribution of data in each cass is Gaussian, whie the permutation test assumes that the data distribution is adequatey represented by the sampe data. Neither estimates how we the sampe data describe the genera popuation, which is one of the fundamenta questions in statistica earning theory and is outside the scope of this paper. 4 m=

6 3. Permutations Tests for Cassification Permutation tests can be used to assess statistica significance of the cassifier and its performance using an empirica estimate of the test error as a statistic that measures dissimiarity between two popuations. Depending on the amount of the avaiabe data, the test error can be estimated on a arge hod-out set or using cross-vaidation in every iteration of the permutation procedure. The nu hypothesis assumes that the reationship between the data and the abes cannot be earned reiaby by the famiy of cassifiers used in the training step. The aternative hypothesis is that we can train a cassifier with sma expected error. We use permutations to estimate the empirica cumuative distribution of the cassifier error under the nu hypothesis. For any vaue of the estimated error e, the appropriate p-vaue is ˆP (e) (i.e., the probabiity of observing cassification error ower than e). We can reject the nu hypothesis and decare that the cassifier earned the (probabiistic) reationship between the data and the abes with a risk of being wrong with probabiity of at most ˆP (e). The permutation procedure is equivaent to samping new training sets from a probabiity distribution that can be factored into the origina marginas for the data and the abes, p(x,y)=p(x)p(y) because it eaves the data unchanged and maintains reative frequencies of the abes through permutation, whie destroying any reationship between the data and the abes. The test evauates the ikeihood of test error estimate on the origina dataset reative to the empirica distribution of the error on data sets samped from p(x,y). To underscore the point made in the previous section, the test uses ony the avaiabe data exampes to evauate the compexity of the cassification probem, and is therefore vaid ony to the extent that the avaiabe dataset represents the true distribution p(x,y). Unike standard convergence bounds, such as bounds based on VC-dimension, the empirica probabiity distribution of the cassification error under the nu hypothesis says nothing about how we the estimated error rate wi generaize. Thus permutation tests provide a weaker guarantee than the convergence bounds, but they can sti be usefu in testing if the observed cassification resuts are ikey to be obtained by chance, due to a spurious pattern correated with the abes in the given sma data set. Note that the estimated empirica distribution aso depends on the cassifier famiy and the training agorithm used to construct the cassifier function. It essentiay estimates the expressive power of the cassifier famiy with respect to the training dataset. The variance of the empirica distribution ˆP constructed by the permutation test is a function two quantities: the randomness due to the sma sampe size and the difficuty of the cassification probem. The varaince decreases with more sampes and easier cassification probems. 4. Appication of the Test We use permutation testing in our work to assess the significance of the observed cassification accuracy before we concude that the resuts obtained in the cross-vaidation procedure are robust, or decide that more data are needed before we can trust the detected pattern or trend in the bioogica data. In this section, we first demonstrate the procedure in detai on two different exampes, a study of changes in the cortica thickness due to Azheimer s disease using MRI scans for measurement and a discrimination between two types of eukemia 5

7 Error Jacknife error K=0 K=20 K= Training set size N p vaue Statistica significance K=0 K=20 K= Training set size N p vaue Empirica cdf for N= K=0 K= K= error Figure : Estimated test error (eft) and statistica significance (midde) computed for different training set sizes N and test set sizes K, and empirica error distribution (right) constructed for N = 50 and different test set sizes K in the cortica thickness study. Fied circes on the right graph indicate the cassifier performance on the true abes (K = 0: e =.30, p =.9; K = 20: e =.29, p =.08; K = 40: e =.29, p =.03). based on DNA microarray data. These studies invove very different types of data, but both share sma sampe size and high dimensionaity of the origina input space, rendering the convergence bounds extremey oose. We then report experimenta resuts on more exampes from rea bioogica studies and offer practica guideines on appication of the test. In a experiments reported in this section, we used inear Support Vector Machines (Vapnik, 998) to train a cassifier, and jackknifing (i.e., samping without repacement) for cross-vaidation. The number of cross-vaidation iterations was, 000, and the number of permutation iterations was 0, Detaied Exampes The first exampe compares the thickness of the cortex in 50 patients diagnosed with dementia of the Azheimer type and 50 norma contros of matched age. The cortica sheet was automaticay segmented from each MRI scan, foowed by a registration step that brought the surfaces into correspondence by mapping them onto a unit sphere whie minimizing distortions and then aigning the cortica foding patterns (Fisch et a., 999, Fisch and Dae, 2000). The cortica thickness was densey samped on a mm grid at corresponding ocations for a subjects, resuting in over 300,000 thickness measurements. The measurements in neighboring ocations are highy correated, as both the pattern of thickness and the pattern of its change are smooth over the surface of the cortex, eading us to beieve that earning the differences between the two groups might be possibe with a reasonabe number of exampes. We start by studying the behavior of the estimated error and its statistica significance as a function of training set size and test set size, reported in Figure. Every point in the first two graphs is characterized by a corresponding training set size N and test set size K, drawn from the origina dataset. In permutation testing, the abes of the training data are permuted prior to training. It is not surprising that increasing the number of training exampes improves the robustness of cassification as exhibited by both the accuracy and the significance estimates. By examining the eft graph, we concude that at approximatey 6

8 0.5 Jacknife error 0.5 Statistica significance error p vaue (og scae) Training set size N Training set size N Figure 2: Estimated test error and statistica significance for different training set sizes N for the cortica thickness study. Unike the experiments in Figure, a of the exampes unused in training were used to test the cassifier. The p-vaues are shown on a ogarithmic scae. N = 40, the accuracy of the cassification saturates at 7% (e =.29). In testing for significance, we typicay expect to work in this region of reativey sow change in the estimated test error. Increasing the number of independent exampes on which we test the cassifier in each iteration does not significanty affect the estimated cassification error, but substantiay improves statistica significance of the same error vaue, as to be expected: when we increase the test set size, a cassifier trained on a random abeing of the training data is ess ikey to maintain the same eve of testing accuracy. The right graph in Figure iustrates this point for a particuar training set size of N = 50 (we within the saturation range for the expected error estimates). It shows the empirica distribution ˆP (e) curves for the test set sizes K = 0, 20, 40. The fied circes represent cassification performance on the true abes and the corresponding p-vaues. We note again that the three circes represent virtuay the same accuracy, but substantiay different p-vaues. For this training set size, if we set the significance threshod at α =.05, testing on K = 40 achieves statistica significance (p =.03, i.e., 3% chance of observing better than 7% accuracy of cross-vaidation if the data and the abes in this probem are truy independent). In the experiments described above, most iterations of cross-vaidation and permutation testing did not use a avaiabe exampes (i.e., N + K was ess than the tota number of exampes). These were constructed to iustrate the behavior of the test for the same test set size but varying training set sizes. In practice, one shoud use a avaiabe data for testing, as it wi ony improve the significance. Figure 2 shows the estimated cassification error and the corresponding p-vaues that were estimated using a of the exampes eft out in the training step for testing the cassifier. And whie the error graph ooks very simiar to that in Figure, the behavior of significance estimates is quite different. The p-vaues originay decrease as the training set size increases, but after a certain point, they start growing. Two conficting factors contro p-vaue estimates as the number of training exampes increases: improved accuracy of the cassification, which causes the point of interest to side to the eft and as a resut, down on the empirica cdf curve, and the decreasing number of test exampes, which causes the empirica cdf curve to become more shaow. In our experience, the minimum of the p-vaue often ies in the region of sow change in error vaues. In this 7

9 0.4 Jacknife error 0.5 Statistica significance error Training set size N p vaue (og scae) Training set size N Figure 3: Estimated test error and statistica significance for different training set sizes N for the eukemia morphoogy study. A of the exampes unused in training were used to test the cassifier. The p-vaues are shown on a ogarithmic scae. particuar exampe, the minimum of p-vaue, p =0.02, is achieved at N = 50. It is smaer than the numbers reported earier because in this experiment because we use more test exampes (K = 50) in each iteration of permutation testing. Before proceeding to the summary reporting of the experimenta resuts for other studies, we demonstrate the technique on one more detaied exampe. The objective of the underying experiment was to accuratey discriminate acute myeoid eukemia (AML) from acute ymphobastic eukemia using DNA microarray expression data. The data set contains 48 sampes of AML and 25 sampes of ALL. Expression eves of 7, 29 genes and expressed sequence tags (ESTs) were measured via an oigonuceotide microarray for each sampe (Goub et a., 999, Sonim et a., 2000). Figure 3 shows the resuts for this study. Sma number of avaiabe exampes forces us to work with much smaer range of training and test set sizes, but, in contrast to the previous study, we achieve statistica significance with substantiay fewer data points. The cross-vaidation error reduces rapidy as we increase the number of training exampes, dropping beow 5% at N = 26 training exampes. The p-vaues aso decrease very quicky as we increase the number of training exampes, achieving minimum of.00 at N = 28 training exampes. Simiary to the previous exampe, the most statisticay significant resut ies in the range of reativey sow error change. As both exampes demonstrate, it is not necessariy best to use most of the data for training. Additiona data might not improve the testing accuracy in a substantia way and coud be much more usefu in obtaining more accurate estimates of the error and the p- vaue. In fact, a commony used eave-one-out procedure that ony uses one exampe to test in each cross-vaidation iteration procures extremey noisy estimates. In a the experiments reported in the next section, we use at east 0 exampes to test in each iteration. 4.2 Experimenta Resuts We report experimenta resuts for nine different studies invoving cassification of bioogica data derived from imaging of two different types, as we as microarray measurements. Tabe summarizes the number of features, which determines the input space dimensionaity, the number of exampes in each cass for each study and the resuts of statistica anaysis. The studies are oosey sorted in the ascending strength of the resuts (decreasing error and p-vaues). 8

10 p-vaue <.05 owest p-vaue Experiment features pos neg N/2 e p N/2 e p e min Lymphoma outcome 7, Brain cancer outcome 7, Breast cancer outcome 24, MRI 327, Meduo vs. gioma 7, AML vs. ALL 7, fmri fu 303, fmri reduced 95, Tumor vs. norm Tabe : Summary of the experimenta data and resuts. The first four coumns describe the study: the name of the experiment, the number of features and the number of exampes in each cass. Coumns 5-7 report the number of training exampes from each cass N/2, the jackknife error e and the p-vaue p for the smaest training set size that achieves significance at α =.05. Coumns 8-0 report the resuts for the training set size that yieds the smaest p-vaue. The ast coumn contains the owest error e observed in the experiment. Imaging Data. In addition to the MRI study of cortica thickness described above ( MRI in Tabe ), we incude the resuts of two fmri (functiona MRI) experiments that compare the patterns of brain activations in response to different visua stimui in a singe subject. We present the resuts of comparing activations in response to face images to those induced by house images, as these categories are beieved to have specia representation in the cortex. The feature extraction step was simiar to that of the MRI study, treating the activation signa as the measurement to be measured over the cortica surface. The first experiment ( fmri fu ) used the entire cortica surface for feature extraction, whie the second experiment ( fmri reduced ) considered ony the visuay active region of the cortex. The mask for the visuay active voxes was obtained using a separate visua task. The goa of using the mask was to test if removing irreevant voxes from consideration improves the cassification performance. Microarray Data. In addition to the eukemia morphoogy study ( AML vs. ALL ), we incude the resuts of five other expression datasets where either the morphoogy or the treatment outcome was predicted (Mukherjee et a., 2003). Three studies invoved predicting treatment outcome: surviva of ymphoma patients, surviva of patients with brain cancer and predicting metastisis of breast cancers. Three other studies invoved predicting morphoogica properties of the tissue: meduobastomas (meduo) vs. giobastomas (gio) AML vs. ALL and tumor tissue vs. norma. For each experiment, we anayzed the behavior of the cross-vaidation error and the statistica significance simiary to the detaied exampes presented earier. Tabe summarizes the resuts for three important events: the first time the p-vaue pot crosses.05 threshod (thus achieving statistica significance at that eve), the point of owest p-vaue,. Giobastomas are tumors of gia ces in the brain whie meduobastomas are tumors of neura tissue. 9

11 and the owest cassification error. For the first two events, we report the number of training exampes, the error and the p-vaue. The owest error is typicay achieved for the argest training set size and is shown here mainy for comparison with the other two error vaues reported. We observe that the error vaues corresponding to the owest p-vaues are very cose to the smaest errors reported, impying that the p-vaues bottom out in the region of a reativey sow change in the error estimates. The first three studies in the tabe did not produce statisticay significant resuts. In the first two studies, the rest of the indicators are extremey weak 2. In the third study, predicting whether a breast tumor wi metastisize, the error stabiizes fairy eary (training on 30 exampes eads to 39% error, whie the smaest error observed is 38%, obtained by training on 58 exampes), but the p-vaues are too high. This eads us to beieve that more data coud hep estabish the significance of the resut, simiary to the MRI study. Unfortunatey, the error in these two studies is too high to be usefu in a diagnostic appication. The rest of the studies achieve reativey ow errors and p-vaues. The ast study in the tabe predicting cancerous tissue from norma tissue yieds a highy significant resut (p <0 6 ), with the error staying very stabe for training on 90 exampes to training on 70 exampes. We aso observe that the significance threshod of.05 is probaby too high for these experiments, as the corresponding error vaues are significanty higher than the ones reported for the smaest p-vaue. A more stringent threshod of.0 woud cause most experiments to produce a more reaistic estimates of the cross-vaidation error attainabe on the given data set. 4.3 Summary And Heuristics In this section, we show how permutation testing in conjunction with cross-vaidation can be used to anayze the quaity of cassification resuts on scientific data. Here, we provide a ist of practica essons earned from the empirica studies that we hope wi be usefu to readers appying this methodoogy.. Interpreting the p-vaue. Two factors affect statistica significance: the separation between the casses and the amount of data we have to support it. We can achieve ow p-vaues in a situation where the two casses are very far apart and we have a few data points from each group, or when they are much coser, but we have substantiay more data. P-vaue by itsef does not indicate which of these two situations is true. However, ooking at both the p-vaue and the cassifier accuracy gives us an indication of how easy it is to separate the casses. 2. Size of the hodout set. In our experience, sma test sizes in cross-vaidation and permutation testing eads to noisy estimates of the cassification accuracy and the significance. We typicay imit the training set size to aow at east 0 test exampes in each iteration of the resamping procedures. 3. Size of the training set. Since we are interested in a robust estimates of the test error, one shoud utiize sufficient number of training exampes to be working in the region where the cross-vaidation error does not vary much. Cross-checking this with the 2. In (Mukherjee et a., 2003), a gene seection procedure ed to greater accuracy and smaer p-vaues. 0

12 region of owest p-vaues provides another usefu indication of the acceptabe training set size. In our experiments with artificia data (not shown here), the number of training exampes at which the p-vaues stop decreasing dramaticay remains amost constant as we add more data. This coud mean that acquiring more experimenta data wi not change the optima training set size substantiay, ony ower the resuting p-vaues. 4. Performance on future sampes. As we pointed out in earier sections, the permutation test does not provide a guarantee on how cose the observed cassification error is to the true expected error. Thus we might achieve statistica significance, but sti have an inaccurate estimate of the expected error. This brings us back to the main assumption made by most non-parametric procedures, namey that the avaiabe exampes capture the reevant properties of the underying probabiity distribution with adequate precision. The variance-based convergence bounds and the VC-stye generaization bounds are sti the ony two ways known to us to obtain a distribution-free guarantee on the expected error of the cassifier. The next section reates permutation testing to the generaization bounds and offers a theoretica justification for the procedure. 5. A Theoretica Motivation for the Permutation Procedure The purpose of the permutation test is to show that it is very unikey that a permuted dataset wi achieve the same cross-vaidation error as the cross-vaidation error on the unpermuted dataset. In this paper, we restrict our proofs to eave-one-out cross-vaidation, with future pans to extend this work to the genera case. Ideay, we woud ike to show that for a reasonaby constrained famiy of cassifiers (for exampe a VC cass) the foowing two facts hod: the eave-one-out error of the cassifier is cose to the error rate of one of the best cassifiers in the cass and the eave-one-out error of the permuted dataset is cose to the smaer of the prior probabiities of the two casses. In Section 5., we prove that the training error on the permuted data concentrates around the smaer prior probabiity. In Section 5.2, we combine previousy known resuts to state that the eave-one-out error of the cassifier is cose to the error rate of one the best cassifiers in the cass. In Section 5.3, we extend a resut from (Mukherjee et a., 2002) about the eave-one-out procedure to reate the training error of permuted data to the eave-one-out error. We cose with some comments reating the theoretica resuts to the empirica permutation procedure. 5. The Permutation Probem We are given a cass of concepts C and an unknown target concept c 0. Without oss of generaity we wi assume that P (c 0 ) /2 and C. For a permutation π of the training data, the smaest training error on the permuted set is e (π) = min P (c c 0 ) () c C = min I(x i c, x π i c 0 )+I(x i c, x π i c 0 ), c C

13 where x i is the i-th sampe and x π i is the i-th sampe after permutation. For a fixed concept c C the average error is EP (c c 0 )=P (c)( P (c 0 )) + ( P (c))p (c 0 ), and it is cear that since P (c 0 ) /2 takingc = wi minimize the average error which in that case wi be equa to P (c 0 ). Thus, our goa wi be to show that under some compexity assumptions on cass C the smaest training error e (π) iscosetop (c 0 ). Minimizing () is equivaent to the foowing maximization probem since max c C I(x i c)(2i(x π i c 0 ) ), e (π) =P (x c 0 ) max c C I(x i c)(2i(x π i c 0 ) ), and P (x c 0 ) is the empirica measure of the random concept. We woud ike to show that e (π) is cose to the random error P (x c 0 ) and give rates of convergence. We wi do this by bounding the process G (π) =sup c C I(x i c)(2i(x π i c 0 ) ) and using the fact that, by Chernoff s inequaity, P (x c 0 )iscosetop(x c 0 ): ( ) 2P (c0 )( P (c 0 ))t IP P (x c 0 ) P (x c 0 ) e t. (2) Theorem If the concept cass C has VC dimension V then with probabiity Ke t/k ( ) V og V og Kt G (π) K min, ( 2P (c 0 )) 2 +. Remark The second quantity in the above bound comes from the appication of Chernoff s inequaity simiar to (2) and, thus, has a one dimensiona nature in a sense that it doesn t depend on the compexity (VC dimension) of cass C. An interesting property of this resut is that if P (c 0 ) < /2 then first term that depends on the VC dimension V wi be of order V og / which, ignoring the one dimensiona terms, gives the zero-error type rate of convergence of e (π) top (x c 0 ). Combining this theorem and equation (2) we can state that with probabiity Ke t/k ( ) V og V og Kt P (x c 0 ) P (x c 0 )+K min, ( 2P (c 0 )) 2 +. In order to prove Theorem, we require severa preiminary resuts. We first prove the foowing usefu emma. 2

14 Lemma It is possibe to construct on the same probabiity space two i.i.d Bernoui sequences ε =(ε,...,ε n ) and ε =(ε,...,ε n ) such that ε is independent of ε ε n and n ε i ε i = n ε i n ε i. Proof For k =0,...,n, et us consider the foowing probabiity space E k. Each eement w of E k consists of two coordinates w =(ε, π). The first coordinate ε =(ε,...,ε n )hasthe margina distribution of an i.i.d. Bernoui sequence. The second coordinate π impements the foowing randomization. Given the first coordinate ε, consider a set I(ε) ={i : ε i =} and denote its cardinaity m =card{i(ε)}. Ifm k, thenπ picks a subset I(π, ε) ofi(ε) with cardinaity k uniformy, and if m<k,thenπ picks a subset I(π, ε) of the compement I c (ε) with cardinaity n k aso uniformy. On this probabiity space E k,weconstructa sequence ε = ε (ε, π) in the foowing way. If k m =card{i(ε)} then we set ε i = if i I(π, ε) andε i = otherwise. If k > m =card{i(ε)} then we set ε i = if i I(π, ε) andε i = otherwise. Next, we consider a space E = k ne k with probabiity measure P(A) = n k=0 B(n, p, k)p(a E k), where B(n, p, k) = ( n k) p k ( p) n k. On this probabiity space the sequence ε and ε wi satisfy the conditions of the emma. First of a, X = ε ε n has binomia distribution since by construction P(X = k) =P(E k)= B(n, p, k). Aso, by construction, the distribution of ε is invariant under the permutation of coordinates. This, ceary, impies that ε is i.i.d. Bernoui. Aso, obviousy, ε is independent of ε ε n. Finay, by construction n ε i ε i = n ε i n ε i. Definition Let u>0 and et C be a set of concepts. Every finite set of concepts c,..., c n with the property that for a c C there is a c j such that c j (x i ) c(x i ) 2 u is caed a u-cover with respect to L2 (x ). The covering number N (C,u,{x,..x }) is the smaest number for which the above hods. Definition 2 The uniform metric entropy is og N (C,u) where N (C,u) is the smaest integer for which, (x,..., x ), N (C,u,{x,..x }) N(C,u). Theorem 2 The foowing hods with probabiity greater than 4e t/4 G (π) sup K µr r 0 2tP (c0 )( P (c 0 )) +2, µ r og N (u, C)du 2 ( 2P (c µr (t +2og(r +)) 0)) + where µ r =2 r and og N (C,u) is the uniform metric entropy for the cass C. 3

15 Proof The process can be rewritten as G (π) =sup c C I(x i c)(2i(x π i c 0 ) ). G (π) =sup c C I(x i c)ε i, where ε i =2I(x π i c 0 ) =± are Bernoui random variabes with P (ε i =)=P (c 0 ). Due to permutations the random variabes (ε i ) depend on (x i ) ony through the cardinaity of {x i c 0 }. By emma we can construct a random Bernoui sequence (ε i )thatis independent of x and for which G (π) sup c C I(x i c)ε i + ε i ε i. We first contro the second term ε i ε i ε i (2P (c 0 ) ) + ε i (2P (c 0 ) ), then using Chernoff s inequaity twice we get with probabiity 2e t ε i ε i 2tP 2 (c0 )( P (c 0 )). We bock concepts in C into eves { C r = c C: } I(x i c) (2 r, 2 r and denote µ r =2 r. We define the processes R(r) =sup I(x i c)ε i, c C r and obtain sup c C I(x i c)ε i sup R(r). r By Taagrand s inequaity on the cube (Taagrand, 995), we have for each eve r ( ) IP ε R(r) E ε sup I(x i c)ε µr t i + e t/4. c C r 4

16 Note that for this inequaity to hod, the random variabes (ε ) need ony be independent, they do not need to be symmetric. This bound is conditioned on a given {x i } and therefore, ( ) IP R(r) E ε sup I(x i c)ε µr t i + e t/4. c C r If, for each r, wesett t +2og(r + ), we can write ( IP r R(r) E ε sup I(x i c)ε µr (t +2og(r +)) i + c C r (r +) 2 e t/4 2e t/4. r=0 Using standard symmetrization techniques we add and subtract an independent sequence ε i such that Eε i = Eε i =(2P (c 0) ): E ε sup I(x i c)ε i c C r E ε sup I(x i c)ε i I(x i c)eε i + I(x i c)(2p (c 0 ) ) c C r ( ) E ε ε sup I(x i c)(ε ε ) ( 2P (c 0 )) inf I(x i c) c C r c C r 2E ηi sup I(x i c)η i µ r( 2P (c 0 )), c C r 2 where η i =(ε i ε i )/2 takes vaues {, 0, } with probabiity P (η i =)=P (η i = ). One can easiy check that the random variabes η i satisfy the inequaity ( ) IP η i a i >t e t 2 2 a 2 i which is the ony prerequisite for the chaining method. Thus, one can write Dudey s entropy integra bound (van der Vaart and Wener, 996) as E ηi sup I(x i c)η i K µr og N (u, C)du. c C r We finay get ( IP r R(r) K µr µr (t +2og(r +)) og N (u, C)du + µ ) r( 2P (c 0 )) 0 2 2e t/ )

17 This competes the proof of Theorem 2. Proof of Theorem ForacasswithVCdimensionV, it is we known that (van der Vaart and Wener, 996) µr 0 Vµ r og 2 µ og N (u, C)du K r. Since without oss of generaity we ony need to consider µ r > /, it remains to appy Theorem 2 and notice that ( ) Vµr og sup K µ r V og r 2 ( 2P (c V og 0)) K min, ( 2P (c 0 )) 2. A other terms that do not depend on the VC dimension V can be combined to give Kt/. 5.2 Convergence of the Leave-one-out Error to the Best in the Cass Now we turn our attention to the cross-vaidation error and show that the eave-one-out error of the cassifier is cose to the error rate of one the best cassifiers in the cass. Given a training set generated by an unknown concept c 0 and a cass of concepts C, the training step seects concept ĉ from this cass via empirica risk minimization ĉ arg min c C P (c c 0 ), where P (c c 0 ) is the empirica symmetric difference observed on the points in the dataset S = {(x i,y i )}. This error is caed the training error. The best cassifiers in the cass are ones for which c opt arg min P (c c 0). c C It is we known that if C has VC dimension V then with probabiity e t (Vapnik, 998) V og P (ĉ c 0 ) P (c opt c 0 ) K, (3) so the expected error of the empirica minimizer approaches the error rate of optima cassifiers in the cass. We now reate the the eave-one-out error to the expected error of the empirica minimizer. The cassifier ĉ S i is the empirica minimizer of the dataset with the i-th point eft out and we write the the eave-one-out error as I(x i ĉ S i). 6

18 Theorem 4.2 in (Kearns and Ron, 999) states that for empirica risk minimization on a VC cass with probabiity δ / V og I(x i ĉ S i) P (ĉ c 0 ) K δ. The above bound couped with the bound in equation (3) ensures that with high probabiity the eave-one-out error approaches the error rate of one of the best cassifiers in the cass. 5.3 Reating the Training Error to the Leave-one-out Error In the two previous subsections we demonstrated that the eave-one-out error approaches the error rate of one of the best cassifiers in the cass and that the training error on the permuted data concentrates around the smaer prior probabiity. However, in our statistica test we use the cross-vaidation error of the permuted data rather than the training error of the permuted data to buid the empirica distribution function. So we have to reate the eave-one-out procedure to the training error for permuted data. In the case of empirica minimization on VC casses, the training error can be reated to the eave-one-out error by the foowing bound (Kearns and Ron, 999, Mukherjee et a., 2002) (see aso appendix A) ( ) V og E S I(x i ĉ S i) P (ĉ c 0 ) Θ, where E S is the expectation over datasets S, V is the VC dimension, and ĉ S i is the empirica minimizer of the dataset with the i-th point eft out. This bound is of the same order as the deviation between the empirica and expected errors of the empirica minimizer ( ) V og E S P (ĉ c 0 ) P (ĉ c 0 ) Θ. This inequaity impies that, on average, the eave-one-out error is not a significanty better estimate of the test error than the training error. In the case of the permutation procedure we show that a simiar resut hods. Theorem 3 If the famiy of cassifiers has VC-dimension V then V og E S,π I(x π i ĉ S i,π) P (x c 0 ) K, where E S,π is the expectation over datasets S and permutations π, ĉ S i,π is the empirica minimizer of the permuted dataset with the i-th point eft out, x π i is the i-th point after the permutation. The proof of this theorem is in appendix A. From Section 5. we have that V og E S,π P (ĉ c 0 ) P (x c 0 ) K. Therefore, one can concude that for the permutation procedure on average the eave-oneout error is not a significanty better estimate of the random error than the training error. 7

19 5.4 Comments on the Theoretica Resuts The theoretica resuts in this section were to give an anaysis and motivation for the permutation tests. The bounds derived are not meant to repace the empirica permutation procedure. Simiary to VC-stye generaization bounds, these woud require amounts of data far beyond the range of sampes reaisticay attainabe in the rea experiments to obtain practicay usefu deviations (of the order of a few percent), which is precisey the motivation for the empirica permutation procedure. Moreover, we state that eave-one-out procedure is not a significanty better estimate of error that the training error. This statement must be taken with a grain of sat since it was derived via a worst case anaysis. In practice, the eave-one-out estimator is amost aways a better estimate of the test error than the training error and, for the empirica risk minimization, the eave-one-out error is never smaer than the training error: (Mukherjee et a., 2002) I(x i ĉ S i) I(x i ĉ S ). Aso, it is important to note that the reation we state between the eave-one-out error and training error is in expectation. Ideay, we woud ike to reate these quantities in probabiity with a simiar rates. It is an open question whether this hods. 6. Concusion And Open Probems This paper describes and expores an approach to estimating statistica significance of a cassifier given a sma sampe size based on permutation testing. The foowing is the ist of open probems reated to this methodoogy:. Size of the training/test set. We provide a heuristic to seect the size of the training and the hodout sets. A more rigorous formuation of this probem might suggest a more principed methodoogy for setting the training set size. This probem is ceary an exampe of the ubiquitous bias-variance tradeoff diemma. 2. Leave-one-out error and training error. In the theoretica motivation, we reate the eave-one-out error to the training error in expectation. The theoretica motivation woud be much stronger if this reation was made in probabiity. A carefu experimenta and theoretica anaysis of the reation between the training error and eave-n-out error for these types permutation procedures woud be of interest. 3. Feature seection. Both in neuroimaging studies and in DNA microarray anaysis, finding the features which most accuratey cassify the data is very important. Permutation procedures simiar to the one described in this paper have been used to address this probem (Goub et a., 999, Sonim et a., 2000, Nichos and Homes, 200). The anaysis of permutation procedures for seecting discriminative features seems to be more difficut than the anaysis of the permutation procedure for cassification. It woud be very interesting to extend the type of anaysis here to the feature seection probem. 8

20 4. Muti-cass cassification. Extending the methodoogy and theoretica motivation for the muti-cass probem has not been done. To concude, we hope other researchers in the community wi find the technique usefu in assessing statistica significance of observed resuts when the data are high dimensiona and are not necessariy generated by a known distribution. Appendix A. Proof of Theorem 3 In this appendix, we prove Theorem 3 from section 5.3. We first define some terms. A concept wi be designated c 0.AdatasetSismade up of points {z i } where z =(x,y). When the i-th point is eft out of S we have the set S i. The empirica minimizer on S is ĉ S and the empirica minimizer on S i is ĉ S i. The empirica error on a set S is P (ĉ S c 0 ), and the empirica error on a set S i is P (ĉ S i c 0 ). If we perform empirica risk minimization on a VC cass, then with high probabiity ( e t/k ) V og P (ĉ S c 0 ) P (ĉ S c 0 ) K. (4) Simiary, with high probabiity V og( ) P (ĉ S i c 0 ) P (ĉ S i c 0 ) K ( ) and P (ĉ S i c 0 ) P (ĉ S i c 0 ) K. We turn the probabiity into an expectation V og E S P (ĉ S i c 0 ) P (ĉ S c 0 ) K. (5) One can check that the expectation over S of the eave-one-out error is equa to the expectation over S of the expected error of ĉ S i (Mukherjee et a., 2002, Bousquet and Eisseeff, 2002): E S I(x i ĉ S i) = E S P (ĉ S i c 0 ). Combining the above equaity with inequaities (4) and (5) gives us the foowing bounds V og E S I(x i ĉ S i) P (ĉ S c 0 ) K, V og E S I(x i ĉ S i) P (ĉ S c 0 ) K. 9

21 So for the nonpermuted data, the eave-one-out error is not significanty coser to the test error in expectation than the training error. We want to show something ike the above for the case where the concept c 0 is random and the empirica minimizer is constructed on a dataset with abes randomy permuted. Let s denote ĉ S i,π the empirica minimizer of the permuted dataset with the ith point eft out. It s eave-one-out error is I(x π i ĉ S i,π). If we can show that E S,π I(x π i ĉ S i,π) = E S,π P (ĉ S i,π c 0 ), then we can use the same argument we used for the nonrandom case. We start by breaking up the expectations E S,π I(x π i ĉ S i,π) = E π E S E zi I(x π i ĉ S i,π) = E π E S E xi E yi I(ĉ S i,π(x π i )=y i), The second ine hods because for a random concept p(x,y)= p(y)p(x). One can easiy check the foowing hod and Simiary it hods that E yi I(ĉ S i,π(x π i )=y i )=min(p (y = ),P(y =))=P (x c 0 ), E π E S E xi min(p (y = ),P(y =))=P (x c 0 ). E S,π P (ĉ S i,π c 0 )=min(p (y = ),P(y =))=P (x c 0 ). References O. Bousquet and A. Eisseeff. Stabiity and generaization. Journa Machine Learning Research, 2: , B. Efron. The Jackknife, The Bootstrap, and Other Resamping Pans. SIAM, Phiadephia, PA, 982. B. Fisch and A.M. Dae. Measuring the thickness of the human cerebra cortex from magnetic resonance images. PNAS, 26: , B. Fisch, M.I. Sereno, R.B.H. Toote, and A.M. Dae. High-resoution intersubject averaging and a coordinate system for the cortica surface. Human Brain Mapping, 8: ,

22 T.R. Goub, D. Sonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coer, M.L. Loh, J.R. Downing, M.A. Caigiuri, C. D. Boomfied, and E. S. Lander. Moecuar Cassification of Cancer: Cass discovery and cass prediction by gene expression monitoring. Science, 286:53 537, 999. P. Good. Permutation Tests: A Practica Guide to Resamping Methods for Testing Hypothesis. Springer-Verag, 994. I. Guyon, J. Makhou, R. Schwartz, and V. Vapnik. What size test set gives good error estimates? IEEE Transactions on Pattern Anaysis and Machine Inteigence, 20:52 64, 998. M. Kearns and D. Ron. Agorithmic stabiity and sanity-check bounds for eave-one-out cross-vaidation. Neura Computation, : , 999. M.G. Kenda. The treatment of ties in ranking probems. Biometrika, 33:239 25, 945. S. Mukherjee, P. Niyogi, T. Poggio, and R. Rifkin. Statistica earning: Stabiity is necessary and sufficient for consistency of empirica risk minimization. AI Memo , Massachusetts Institute of Technoogy, S. Mukherjee, P. Tamayo, S. Rogers, R. Rifkin, A. Enge, C. Campbe, T.R. Goub, and J.P. Mesirov. Estimating dataset size requirements for cassifying dna microarray data. Journa Computationa Bioogy, 0(2):9 42, T.E. Nichos and A.P. Homes. Nonparametric permutation tests for functiona neuroimaging: A primer with exampes. Human Brain Mapping, 5: 25, 200. L. Sachs. Appied Statistics: A Handbook of Techniques. Springer Verag, 984. D. Sonim, P. Tamayo, J.P. Mesirov, T.R. Goub, and E. Lander. Cass prediction and discovery using gene expression data. In Proceedings of the Fourth Annua Conference on Computationa Moecuar Bioogy (RECOMB), pages , Student. The probabe error of a mean. Biometrika, 6: 25, 908. M. Taagrand. Concentration of measure and isoperimetric inequaities in product spaces. Pubications Mathématiques de I.H.E.S., 8:73 205, 995. A. van der Vaart and J. Wener. Weak convergence and Empirica Processes With Appications to Statistics. Springer-Verag, 996. V.N. Vapnik. Statistica Learning Theory. John Wiey & Sons,

Permutation Tests for Classification

Permutation Tests for Classification Journa of Machine Learning Research (2000) -48 Submitted 4/00; Pubished 0/00 Permutation Tests for Cassification Poina Goand Computer Science and Artificia Inteigence Laboratory Massachusetts Institute