A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split

Similar documents
4. Score normalization technical details We now discuss the technical details of the score normalization method.

Radial Basis Function Networks: Algorithms

On split sample and randomized confidence intervals for binomial proportions

Approximating min-max k-clustering

General Linear Model Introduction, Classes of Linear models and Estimation

Recent Developments in Multilayer Perceptron Neural Networks

Notes on Instrumental Variables Methods

Solved Problems. (a) (b) (c) Figure P4.1 Simple Classification Problems First we draw a line between each set of dark and light data points.

State Estimation with ARMarkov Models

arxiv: v1 [physics.data-an] 26 Oct 2012

Spectral Analysis by Stationary Time Series Modeling

Estimation of the large covariance matrix with two-step monotone missing data

The non-stochastic multi-armed bandit problem

MATH 2710: NOTES FOR ANALYSIS

Feedback-error control

Statics and dynamics: some elementary concepts

HENSEL S LEMMA KEITH CONRAD

System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests

Distributed Rule-Based Inference in the Presence of Redundant Information

Universal Finite Memory Coding of Binary Sequences

8 STOCHASTIC PROCESSES

Uncorrelated Multilinear Principal Component Analysis for Unsupervised Multilinear Subspace Learning

A New Perspective on Learning Linear Separators with Large L q L p Margins

Convex Optimization methods for Computing Channel Capacity

Department of Electrical and Computer Systems Engineering

MATHEMATICAL MODELLING OF THE WIRELESS COMMUNICATION NETWORK

arxiv:cond-mat/ v2 25 Sep 2002

arxiv: v3 [physics.data-an] 23 May 2011

8.7 Associated and Non-associated Flow Rules

Cryptanalysis of Pseudorandom Generators

Keywords: pile, liquefaction, lateral spreading, analysis ABSTRACT

AI*IA 2003 Fusion of Multiple Pattern Classifiers PART III

A Simple Weight Decay Can Improve. Abstract. It has been observed in numerical simulations that a weight decay can improve

Fuzzy Automata Induction using Construction Method

Multilayer Perceptron Neural Network (MLPs) For Analyzing the Properties of Jordan Oil Shale

Use of Transformations and the Repeated Statement in PROC GLM in SAS Ed Stanek

3.4 Design Methods for Fractional Delay Allpass Filters

Characteristics of Beam-Based Flexure Modules

Metrics Performance Evaluation: Application to Face Recognition

Using the Divergence Information Criterion for the Determination of the Order of an Autoregressive Process

Hotelling s Two- Sample T 2

Topic 7: Using identity types

STABILITY ANALYSIS TOOL FOR TUNING UNCONSTRAINED DECENTRALIZED MODEL PREDICTIVE CONTROLLERS

M. Opper y. Institut fur Theoretische Physik. Justus-Liebig-Universitat Giessen. D-6300 Giessen, Germany. physik.uni-giessen.dbp.

Elementary Analysis in Q p

John Weatherwax. Analysis of Parallel Depth First Search Algorithms

Period-two cycles in a feedforward layered neural network model with symmetric sequence processing

Generalized Coiflets: A New Family of Orthonormal Wavelets

Effective conductivity in a lattice model for binary disordered media with complex distributions of grain sizes

FE FORMULATIONS FOR PLASTICITY

Asymptotically Optimal Simulation Allocation under Dependent Sampling

dn i where we have used the Gibbs equation for the Gibbs energy and the definition of chemical potential

The Noise Power Ratio - Theory and ADC Testing

Yixi Shi. Jose Blanchet. IEOR Department Columbia University New York, NY 10027, USA. IEOR Department Columbia University New York, NY 10027, USA

An Analysis of Reliable Classifiers through ROC Isometrics

Information collection on a graph

Supplementary Materials for Robust Estimation of the False Discovery Rate

COMMUNICATION BETWEEN SHAREHOLDERS 1

Information collection on a graph

Lower bound solutions for bearing capacity of jointed rock

CONVOLVED SUBSAMPLING ESTIMATION WITH APPLICATIONS TO BLOCK BOOTSTRAP

arxiv: v1 [quant-ph] 22 Apr 2017

The Graph Accessibility Problem and the Universality of the Collision CRCW Conflict Resolution Rule

A SIMPLE PLASTICITY MODEL FOR PREDICTING TRANSVERSE COMPOSITE RESPONSE AND FAILURE

t 0 Xt sup X t p c p inf t 0

Recursive Estimation of the Preisach Density function for a Smart Actuator

Solution sheet ξi ξ < ξ i+1 0 otherwise ξ ξ i N i,p 1 (ξ) + where 0 0

KEY ISSUES IN THE ANALYSIS OF PILES IN LIQUEFYING SOILS

Introduction to MVC. least common denominator of all non-identical-zero minors of all order of G(s). Example: The minor of order 2: 1 2 ( s 1)

Elements of Asymptotic Theory. James L. Powell Department of Economics University of California, Berkeley

Preconditioning techniques for Newton s method for the incompressible Navier Stokes equations

#A64 INTEGERS 18 (2018) APPLYING MODULAR ARITHMETIC TO DIOPHANTINE EQUATIONS

Evaluating Circuit Reliability Under Probabilistic Gate-Level Fault Models

MODELING THE RELIABILITY OF C4ISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL

1 Probability Spaces and Random Variables

DETC2003/DAC AN EFFICIENT ALGORITHM FOR CONSTRUCTING OPTIMAL DESIGN OF COMPUTER EXPERIMENTS

One step ahead prediction using Fuzzy Boolean Neural Networks 1

New Schedulability Test Conditions for Non-preemptive Scheduling on Multiprocessor Platforms

ON POLYNOMIAL SELECTION FOR THE GENERAL NUMBER FIELD SIEVE

arxiv: v2 [stat.me] 3 Nov 2014

Combining Logistic Regression with Kriging for Mapping the Risk of Occurrence of Unexploded Ordnance (UXO)

A compression line for soils with evolving particle and pore size distributions due to particle crushing

CHAPTER-II Control Charts for Fraction Nonconforming using m-of-m Runs Rules

On the capacity of the general trapdoor channel with feedback

Hidden Predictors: A Factor Analysis Primer

STA 250: Statistics. Notes 7. Bayesian Approach to Statistics. Book chapters: 7.2

An Analysis of TCP over Random Access Satellite Links

On Wald-Type Optimal Stopping for Brownian Motion

PARTIAL FACE RECOGNITION: A SPARSE REPRESENTATION-BASED APPROACH. Luoluo Liu, Trac D. Tran, and Sang Peter Chin

Deriving Indicator Direct and Cross Variograms from a Normal Scores Variogram Model (bigaus-full) David F. Machuca Mory and Clayton V.

Chapter 3. GMM: Selected Topics

Proof: We follow thearoach develoed in [4]. We adot a useful but non-intuitive notion of time; a bin with z balls at time t receives its next ball at

Uniformly best wavenumber approximations by spatial central difference operators: An initial investigation

Finding a sparse vector in a subspace: linear sparsity using alternating directions

Asymptotic Properties of the Markov Chain Model method of finding Markov chains Generators of..

Bayesian Model Averaging Kriging Jize Zhang and Alexandros Taflanidis

Best approximation by linear combinations of characteristic functions of half-spaces

MAKING WALD TESTS WORK FOR. Juan J. Dolado CEMFI. Casado del Alisal, Madrid. and. Helmut Lutkepohl. Humboldt Universitat zu Berlin

An Improved Calibration Method for a Chopped Pyrgeometer

7.2 Inference for comparing means of two populations where the samples are independent

Transcription:

A Bound on the Error of Cross Validation Using the Aroximation and Estimation Rates, with Consequences for the Training-Test Slit Michael Kearns AT&T Bell Laboratories Murray Hill, NJ 7974 mkearns@research.att.com Abstract: We give an analysis of the generalization error of cross validation in terms of two natural measures of the difficulty of the roblem under consideration: the aroximation rate (the accuracy to which the target function can be ideally aroximated as a function of the number of hyothesis arameters), and the estimation rate (the deviation between the training and generalization errors as a function of the number of hyothesis arameters). The aroximation rate catures the comlexity of the target function with resect to the hyothesis model, and the estimation rate catures the extent to which the hyothesis model suffers from overfitting. Using these two measures, we give a rigorous and general bound on the error of cross validation. The bound clearly shows the tradeoffs involved with making the fraction of data saved for testing too large or too small. By otimizing the bound with resect to, we then argue (through a combination of formal analysis, lotting, and controlled exerimentation) that the following qualitative roerties of cross validation behavior should be quite robust to significant changes in the underlying model selection roblem: When the target function comlexity is small comared to the samle size, the erformance of cross validation is relatively insensitive to the choice of. The imortance of choosing otimally increases, and the otimal value for decreases, as the target function becomes more comlex relative to the samle size. There is nevertheless a single fixed value for that works nearly otimally for a wide range of target function comlexity. Category: Learning Theory. Prefer oral resentation. 1 INTRODUCTION In this aer we analyze the erformance of cross validation in the context of model selection and comlexity regularization. We work in a setting in which we must choose the right number of arameters for a hyothesis function in resonse to a finite training samle, with the goal of minimizing the resulting generalization error. There is a large and interesting literature on cross validation methods, which often emhasizes asymtotic statistical roerties, or the exact calculation of the generalization error for simle models. (The literature is too large to survey here; foundational aers include those of Stone [7, 8].) Our aroach here is somewhat different, and is rimarily insired by two sources: the work of Barron and Cover [2], who introduced the idea of bounding the error of a model selection method (MDL in their case) in terms of a quantity known as the index of resolvability; and the work of Vanik [9], who rovides extremely owerful and general tools for uniformly bounding the deviations between training and generalization errors. We combine these methods to give a new and general analysis of cross validation erformance. In the first and more formal art of the aer, we give a rigorous bound on the error of cross validation in terms of two arameters of the underlying model selection roblem (which is defined by a target function, an inut distribution, and a nested sequence of increasingly comlex classes of hyothesis functions): the aroximation rate and the estimation rate, mentioned in the abstract and defined formally shortly. Taken together, these two roblem arameters determine (our analogue of) the index of resolvability, and Vanik s work yields estimation rates alicable to many natural roblems. In the second art of the aer, we investigate the imlications of our bound for choosing, the fraction of data withheld for testing in cross validation. The most interesting asect of this analysis is the identification of several qualitative

roerties (identifiedin the abstract, and in greater detail in the aer) of the otimal that aear to be invariant over a wide class of model selection roblems. 2 THE FORMALISM In this aer we consider model selection as a two-art roblem: choosing the aroriate number of arameters for the hyothesis function, and tuning these arameters. The training samle is used in both stes of this rocess. In many settings, the tuning of the arameters is determined by a fixed learning algorithm such as backroagation, and then model selection reduces to the roblem of choosing the architecture (which determines the number of weights, the connectivity attern, and so on). For concreteness, here we adot an idealized version of this division of labor. We assume a nested sequence of function classes H 1 H d, called the structure [9], where H d is a class of boolean functions of d arameters, each function being a maing from some inut sace X into f; 1g. For simlicity, in this aer we assume that the Vanik-Chervonenkis (VC) dimension [1, 9] of the class H d is O(d). To remove this assumtion, one simly relaces all occurrences of d in our bounds by the VC dimension of H d. We assume that we have in our ossession a learning algorithm L that on inut any training samle S and any value d will outut a hyothesis function h d 2 H d that minimizes the training error over H d thatis, t (h d )=min h2hd f t (h)g, where t (h) is the fraction of the examles in S on which h disagrees with the given label. In many situations, training error minimization is known to be comutationally intractable, leading researchers to investigate heuristics such as backroagation. The extent to which the theory resented here alies to such heuristics will deend in art on the extent to which they aroximate training error minimization for the roblem under consideration; however, many of our results have rigorous generalizations for the case in which no assumtions are made on the heuristic L (with the quality of the boundsof course decaying with the quality of the hyotheses oututby L). Model selection is thus the roblem of choosing the best value of d. More recisely, we assume an arbitrary target function f (which may or may not reside in one of the function classes in the structure H 1 H d ), and an inut distribution D; f and D together define the generalization error function g (h) =Pr x2d [h(x) 6= f (x)]. We are given a training samle S of f, consisting of m random examles drawn according to D and labeled by f (with the labels ossibly corruted by noise). In many model selection methods (such as Rissanen s Minimum Descrition Length Princile [5] and Vanik s Guaranteed Risk Minimization [9]), for each value of d = 1; 2; 3;::: we give the entire samle S and d to the learning algorithm L to obtain the function h d minimizing the training error in H d.some function of d and the training errors t (h d ) is then used to choose among the sequence h 1 ;h 2 ;h 3 ;:::. Whatever the method, we interret the goal to be that of minimizing the generalization error of the hyothesis selected. In this aer, we will make the rather mild but very useful assumtion that the structure has the roerty that for any samle size m, there is a value d max (m) such that t (h d max (m)) =forany labeled samle S of m examles; we will refer to the function d max (m) as the fitting number of the structure 1. The fitting number formalizes the simle notion that with enough arameters, we can always fit the training data erfectly, a roerty held by most sufficiently owerful function classes (including multilayer neural networks). We tyically exect the fitting number to be a linear function of m, or at worst a olynomial in m. The significance of the fitting number for us is that no reasonable model selection method should choose h d for d d max (m), since doing so simly adds comlexity without reducing the training error. In this aer we concentrate on the simlest version of cross validation. Unlike the methods mentioned above, which use the entire samle for training the h d, in cross validation we choose a arameter 2 [; 1], which determines the slit between training and test data. Given the inut samle S of m examles, let S be the subsamle consisting of the first (1, )m examles in S,andS the subsamle consisting of the last m examles. In cross validation, rather than giving the entire samle S to L, we give only the smaller samle S, resulting in the sequence h 1 ;:::;h d max ((1,)m) of increasingly comlex hyotheses. Each hyothesis is now obtained by training on only (1, )m examles, which imlies that we will consider only d smaller than the corresonding fitting number d max ((1, )m); let us introduce the shorthand d max for d max((1, )m). Cross validation chooses the h d satisfying h d = min i2f1;:::;d max g f t (h i)g where t (h i) is the error of h i on the subsamle S. Notice that we are not considering multifold cross validation, or other variants that make more efficient use of the samle, because our analyses will require the indeendence of the test set. However, we believe that many of the themes that emerge here aly to these more sohisticated variants as well. We use cv (m) to denote the generalization error g (h d ) of the hyothesis h d chosen by cross validation when given as inut a samle S of m random examles of the target function; obviously, cv (m) deends on the structure, f, D, and the noise rate. When bounding cv (m), we will use the exression with high robability to mean with robability 1 Weaker definitions for d max(m) also suffice, but this is the simlest.

1, over the samle S, for some small fixed constant >. All of our results can also be stated with as a arameter at the cost of a log(1=) factor in the bounds, or in terms of the exected value of cv (m). 3 THE APPROXIMATION RATE It is aarent that we should not exect to give nontrivial bounds on cv (m) that take no account of some measure of the comlexity of the unknown target function f; the correct measure of this comlexity is less obvious. Following the examle of Barron and Cover s analysis of MDL erformance in the context of density estimation [2], we roose the aroximation rate as a natural measure of the comlexity of f and D in relationshi to the chosen structure H 1 H d. Thus we define the aroximation rate function g (d) to be g (d) =min h2hd f g (h)g. The function g (d) tells us the best generalization error that can be achieved in the class H d, and it is a nonincreasing function of d. If g (d )=forsomesufficiently large d, this means that the target function f, at least with resect to the inut distribution, is realizable in the class H d, and thus d isacoarsemeasureofhowcomlexf is. More generally, even if g (d) > foralld, the rate of decay of g (d) still gives a nice indication of how much reresentational ower we gain with resect to f and D by increasing the comlexity of ourmodels. Still missing, of course, issome means of determining the extent to which this reresentational ower can be realized by training on a finite samle of a given size, but this will be added shortly. First we give examles of the aroximation rate that we will examine in some detail following the general bound on cv (m). The Intervals Problem. In this roblem, the inut sace X is the real interval [; 1], and the class H d of the structure consists of all boolean ste functions over [; 1] of at most d stes; thus, each function artitions the interval [; 1] into at most d disjoint segments (not necessarily of equal width), and assigns alternating ositive and negative labels to these segments. Thus, the inut sace is one-dimensional, but the structure contains arbitrarily comlex functions over [; 1]. It is easily verified that our assumtion that the VC dimension of H d is O(d) holds here, and that the fitting number obeys d max (m) m. Now suose that the inut density D is uniform, and suose that the target function f is the function of s alternating segments of equal width 1=s, forsomes (thus, f lies in the class H s ). We will refer to these settings as the intervals roblem. Then the aroximation rate g (d) is g (d) =(1=2)(1, d=s) for 1 d<sand g (d) =ford s. Thus, as long as d<s, increasing the comlexity d gives linear ayoff in terms of decreasing the otimal generalization error. For d s, there is no ayoff forincreasing d, since we can already realize the target function. The reader can easily verify that if f lies in H s, but does not have equal width intervals, g (d) is still iecewise linear, but for d<sis concave u : the gain in aroximation obtained by incrementally increasing d diminishes as d becomes larger. Although the intervals roblem is rather simle and artificial, a recise analysis of cross validation behavior can be given for it, and we will argue that this behavior is reresentative of much broader and more realistic settings. The Percetron Problem. In this roblem, the inut sace X is < N for some large natural number N. The class H d consists of all ercetrons over the N inuts in which at most d weights are nonzero. If the inut density is sherically symmetric (for instance, the uniform density on the unit ball in < N ), and the target function is a function in H s with all s nonzero weights equal to 1, then it can be shown that the aroximation rate function g (d) is g (d) =(1=) cos,1 ( d=n ) for d<s[6], and of course g (d) =ford s. This roblem rovides a nice contrast to the intervals roblem, since here the behavior of the aroximation rate for small d is concave down: as long as d<s, an incremental increase in d yields more aroximative ower for large d than it does for small d (excet for very small values of d). Power Law Decay. In addition to the secific examles just given, we would also like to study reasonably natural arametric forms of g (d), to determine the sensitivity of our theory to a lausible range of behaviors for the aroximation rate. This is imortant, since in ractice we do not exect to have recise knowledge of g (d), since it deends on the target function and inut distribution. Following the work of Barron [1], who shows a c=d bound on g (d) for the case of neural networks with one hidden layer under a squared error generalization measure (where c is a measure of target function comlexity in terms of a Fourier transform integrability condition) 2, in the later art of the aer we investigate g (d) of the form (c=d) + min,where min is a arameter reresenting the degree of unrealizability of f with resect to the structure, and c; > are arameters caturing the rate of decay to min. Our analysis concludes that the qualitative henomena we identify are invariant to wide ranges of choices for these arameters. Note that for all three cases, there is a natural measure of target function comlexity catured by g (d): in the intervals roblem it is the number of target intervals, in the ercetron roblem it is the number of nonzero weights in the target, 2 Since the bounds we will give have straightforward generalizations to real-valued function learning under squared error (details in the full aer), examining behavior for g(d) in this setting seems reasonable.

and in the more general ower law case it is catured by the arameters of the ower law. In the later art of the aer, we will study cross validation erformance as a function of these comlexity measures, and obtain remarkably similar redictions. 4 THE ESTIMATION RATE For a fixed f, D and H 1 H d, we say that a function (d; m) is an estimation rate bound if for all d and m, with high robabilityover the samle S we have j t (h d ), g (h d )j(d; m), where as usual h d is the result of training error minimization on S within H d. Thus (d; m) simly bounds the deviation between the training error and the generalization error of h 3 d. Note that the best such bound may deend in a comlicated way on all of the elements of the roblem: f, D and the structure. Indeed, much of the recent work on the statistical hysics theory of learning curves has documented the wide variety of behaviors that such deviations may assume [6, 3]. However, for many natural roblems it is both convenient and accurate to rely on a universal estimation rate bound rovided by the owerful theory of uniform convergence: Namely, for any f, D and any structure the function (d; m) = (d=m) log(m=d) is an estimation rate bound [9] 4. Deending uon the details of the roblem, it is sometimes aroriate to omit the log(m=d) factor, and often aroriate to refine the d=m behavior to a function that interolates smoothly between d=m behavior for small t to d=m for large t. Although such refinements are both interesting and imortant, many of the qualitative claims and redictions we will make are invariant to them as long as the deviation j t (h d ), g (h d )j is well-aroximated by a ower law (d=m) (>); it will be more imortant to recognize and model the cases in which ower law behavior is grossly violated. Note that this universal estimation rate bound holds only under the assumtion that the training samle is noise-free, but straightforward generalizations exist. For instance, if the training data is corruted by random label noise at rate <1=2, then (d; m) = (d=(1, 2) 2 m)log(m=d) is again a universal estimation rate bound. After giving a general bound on cv (m) in which the aroximation and estimation rate functions are arameters, we investigate the behavior of cv (m) (and more secifically, of the arameter ) for secific choices of these arameters. 5 THE BOUND Theorem 1 Let H 1 H d be any structure, where the VC dimension of H d is O(d). Let f and D be any target function and inut distribution, let g (d) be the aroximation rate function for the structure with resect to f and D, and let (d; m) be an estimation rate bound for the structure with resect to f and D. Then for any m, with high robability cv (m) min 1dd max f g (d) +(d; (1, )m)g + O @ s log(d max) m 1 A (1) where is the fraction of the training samle used for testing, and dmax = d max ((1, )m). Using the universal estimation bound rate and the rather weak assumtion that d max (m) is olynomial in m, we obtain that with high robability cv (m) min 1dd max ( g (d) +O s d (1, )m log m d!) + O s log((1, )m) m! : (2) Straightforward generalizations of these bounds for the case where the data is corruted by random label noise can be obtained, using the modified estimation rate bound mentioned in Section 4 (details in the full aer). Proof Sketch: We have sace only to highlight the main ideas of the roof. For each d from 1 to d max, fix a function f d 2 H d satisfying g (f d )= g (d); thus, f d is the best ossible aroximation to the target f within the class H d. By a standard Chernoff bound argument it can be shown that with high robability we have j t (f d ), g (f d )j log(dmax )=m for all 1 d d max. This means that within each H d, the minimum training error t (h d ) is at 3 In the later and less formal art of the aer, we will often assume that secific forms of (d; m) are not merely uer bounds on this deviation, but accurate aroximations to it. 4 The results of Vanik actually show the stronger result that jt(h), g(h)j (d=m) log(m=d) for all h 2 Hd, not only for the training error minimizer hd.

most g (d) + log(d max)=m. Since(d; m) is an estimation rate bound, we have that with high robability, for all 1 d d max, g (h d ) t (h d )+(d; m) (3) q g (d) + log(d max )=m + (d; m): (4) Thus we have bounded the generalization error of the d max hyotheses h d, only one of which will be selected. If we knew the actual values of these generalization errors (equivalent to having an infinite test samle in cross validation), we could bound our error by the minimum over all 1 d d max of the exression (4) above. However, in cross validation we do not know the exact values of these generalization errors but must instead use the m testing examles to estimate them. Again by standard Chernoff bound arguments, this introduces an additional log(d max )=(m) error term, resulting in our final bound. This concludes the roof sketch. In the bounds given by (1) and (2), the minfg exression is analogous to Barron and Cover s index of resolvability [2]; the final term in the bounds reresents the error introduced by the testing hase of cross validation. These bounds exhibit tradeoff behavior with resect to the arameter : as we let aroach, we are devoting more and more of the samle to training the h d, and the estimation rate bound term (d; (1, )m) is decreasing. However, the test error term O( log(d max )=(m)) is increasing, since we have less data to accurately estimate the g (h d ). The reverse henomenon occurs as we let aroach 1. While we believe Theorem 1 to be enlightening and otentially useful in its own right, we would now like to take its interretation a ste further. More recisely, suose we assume that the bound is an aroximation to the actual behavior of cv (m). Then in rincile we can otimize the bound to obtain the best value for. Of course, in addition to the assumtions involved (the main one being that (d; m) is a good aroximation to the training-generalization error deviations of the h d ), this analysis can only be carried out given information that we should not exect to have in ractice (at least in exact form) in articular, the aroximation rate function g (d), which deends on f and D. However, we argue in the coming sections that several interesting qualitative henomena regarding the choice of are largely invariant to a wide range of natural behaviors for g (d). 6 A CASE STUDY: THE INTERVALS PROBLEM We begin by erforming the suggested otimization of for the intervals roblem. Recall that the aroximation rate here is g (d) = (1=2)(1, d=s) for d<sand g (d) =ford s, wheres is the comlexity of the target function. Here we analyze the behavior obtained by assuming that the estimation rate (d; m) actually behaves as (d; m) = d=(1, )m (so we are omitting the log factor from the universal bound) 5, and to simlify the formal analysis a bit (but without changing the qualitative behavior) we relace the term log((1, )m)=(m) by the weaker log(m)=m. Thus, if we define the function F (d; m; ) = g (d) + d=(1, )m + log(m)=(m) (5) then following (1), we are aroximating cv (m) by cv (m) min 1dd max ff (d; m; )g 6 (see Figure 1 for a lot of this aroximation). The first ste of the analysis is to fix a value for and differentiate F (d; m; ) with resect to d to discover the minimizing value of d. This differentiation must have two regimes due to the discontinuity at d = s in g (d). It is easily verified that the derivative is,(1=2s) +1=(2 d(1, )m) for d<sand 1=(2 d(1, )m) for d s. It can be shown that rovided that (1, )m 4s then d = s is a global minimum of of F (d; m; ), andif this condition is violated then the value we obtain for cv (m) is vacuously large anyway (meaning that this fixed choice of can not end u being the otimal one, or, if m<4s, that our analysis claims we simly do not have enough data for nontrivial generalization, regardless of how we slit it between training and testing). Plugging in d = s yields cv (m) F (s; m; ) = s=(1, )m + log(m)=(m) for this fixed choice of. Now by differentiating F (s; m; ) with resect to, it can be shown that the otimal choice of under the assumtions is ot =(log(m)=s) 1=3 =(1 +(log(m)=s) 1=3 ). 5 It can be argued that this ower law estimation rate is actually a rather accurate aroximation for the true behavior of the training-generalization deviations of the hd for this roblem. 6 Although there are hidden constants in the O() notation of the bounds, it is the relative weights of the estimation and test error terms that is imortant, and choosing both constants equal to 1 is a reasonable choice (since both terms have the same Chernoff bound origins).

It is imortant to remember at this oint that desite the fact that we have derived a recise exression for ot, due to the assumtions and aroximations we have made in the various constants, any quantitative interretation of this exression is meaningless. However, we can reasonably exect that this exression catures the qualitative way in which the otimal changes as the amount of data m changes in relation to the target function comlexity s. Onthis score the situation initially aears rather bleak, as the function (log(m)=s) 1=3 =(1 +(log(m)=s) 1=3 ) is quite sensitive to the ratio log(m)=s. For examle, for m = 1, if s = 1 we obtain ot = :524,ifs = 1 we obtain ot = :338,andifs = 1 we obtain ot = :191. Thus ot is becoming smaller as log(m)=s becomes small, and the analysis suggests vastly different choices for ot deending on the target function comlexity, which is something we do not exect to have the luxury of knowing in ractice. However, it is both fortunate and interesting that ot does not tell the entire story. In Figure 2, we lot the function F (s; m; ) 7 as a function of for m = 1 and for several different values of s (note that for consistency with the later exerimental lots, the x axis of the lot is actually the training fraction 1, ). Here we can observe four imortant qualitative henomena: (A) When s is small comared to m, the redicted error is relatively insensitive to the choice of : as a function of, F (s; m; ) has a wide, flat bowl, indicating a wide range of yielding essentially the same near-otimal error. (B) As s becomes larger in comarison to the fixed samle size m, the relative sueriority of ot over other values for becomes more ronounced. In articular, large values for become rogressively worse as s increases. For examle, the lots indicate that for s = 1 (again, m = 1), even though ot = :524 the choice = :75 will result in error quite near that achieved using ot.however,for s = 5, = :75 is redicted to yield greatly subotimal error. (C) Because of the insensitivity to for s small comared to m, thereisafixed value of which seems to yield reasonably good erformance for a wide range of values for s; this value is essentially the value of ot for the case where s is large (but nontrivial generalization is still ossible), since choosing the best value for is more imortant there than for the small s case. Note that for very large s, the boundredicts vacuously large error for all values of, so that thechoice of again becomes irrelevant. (D) The value of ot is decreasing as s increases. This is slightly difficult to confirm from the lot, but can be seen clearly from the recise exression for ot. Desite the fact that our analysis so far has been rather secialized (addressing the behavior for a fixed structure, target function and inut distribution), it is our belief that (A), (B), (C) and (D) above are rather universal henomena that hold for many other model selection roblems. For instance, we shall shortly demonstrate that our theory again redicts that (A), (B), (C) and (D) hold for the ercetron roblem, and for case of ower law decay of g (d) described earlier. First we give an exerimental demonstration that at least the redicted roerties (A), (B) and (C) truly do hold for the intervals roblem. In Figures 3, 4 and 5, we lot the results of exeriments in which labeled random samles of size m = 5 were generated for a target function of s equal width intervals, for s = 1; 1 and 5. The samles were corruted by random label noise at rate = :3. For each value of and each value of d, (1, )m of the samle was given to a rogram imlementing training error minimization for the class H 8 d ; the remaining m examles were used to select the best h d according to cross validation. The lots show the true generalization error of the h d selected by cross validation as a function of ; this generalization error can be comuted exactly for this roblem. Each oint in the lots reresents an average over 1 trials. While there are obvious and significant quantitative differences between these exerimental lots and Figure 2, the roerties (A), (B) and (C) are rather clearly borne out by the data: (A) In Figure 3, where s is small comared to m, there is a wide range of accetable ; it aears that any choice of between :1 and :5 yields nearly otimal generalization error. (B) By the time s = 1 (Figure 4), the sensitivity to is considerably more ronounced. For examle, the choice = :5 now results in clearly subotimal erformance, and it is more imortantto have close to :1. (C) Desite these comlexities, there does indeed aear to be single value of aroximately :1 that erforms nearly otimally for the entire range of s examined. The roerty (D) namely, that the otimal decreases as the target function comlexity is increased relative to a fixed m is certainly not refuted by the exerimental results, but any such effect is simly too small to be verified. It 7 In the lots, we now use the more accurate test enalty term log((1, )m)=(m) since we are no longer concerned with simlifying the calculation. 8 A nice feature of the intervals roblem is the fact that training error minimization can be erformed in almost linear time using a dynamic rogramming aroach [4].

would be interesting to verify this rediction exerimentally, erhas on a different roblem where the redicted effect is more ronounced. 7 POWER LAW DECAY AND THE PERCEPTRON PROBLEM For the cases where the aroximation rate g (d) obeys either ower law decay or is that derived for the ercetron roblem discussed in Section 3, the behavior of cv (m) as a function of redicted by our theory is largely the same. For examle, if g (d) = (c=d) and we use the standard estimation rate (d; m) = d=(1, )m, thenan analysis similar to that erformed for the intervals roblem reveals that for fixed the minimizing choice of d is d =(4c 2 (1, )m) 1=3 ; lugging this value of d back into the bound on cv (m) yields Figure 6, which, like Figure 2 for the intervals roblem, shows the redicted behavior of cv (m) as a function of for a fixed m and several different choices for the comlexity arameter c. We again see that roerties (A), (B), (C) and (D) hold strongly desite the change in g (d) from the intervals roblem (although quantitative asects of the rediction, which already must be taken lightly for reasons reviously stated, have obviously changed, such as the interesting values of the ratio of samle size to target function comlexity). Through a combination of formal analysis and lotting, it is ossible to demonstrate that the roerties (A), (B), (C) and (D) are robust to wide variations in the arameters and min in the arametric form g (d) =(c=d) + min, as well as wide variations in the form of the estimation rate (d; m). For examle, if g (d) =(c=d) 2 (faster than the aroximation rate examined above) and = d=m (faster than the estimation rated examined above), then for the interesting ratios of m to c (that is, where the generalization error redicted is bounded away from and the trivial value of 1=2), a figure quite similar to Figure 6 is obtained. Similar redictions can be derived for the ercetron roblem using the universal estimation rate bound or any similar ower law form. In summary, our theory redicts that although significant quantitative differences in the behavior of cross validation may arise for different model selection roblems, the roerties (A), (B), (C) and (D) should be resent in a wide range of roblems. At the very least, the behavior of our bounds exhibits these roerties for a wide range of roblems. It would be interesting to try to identify natural roblems for which one or more of these roerties is strongly violated; a otential source for such roblems may be those for which the underlying learning curve deviates from classical ower law behavior [6, 3]. 8 ACKNOWLEDGEMENTS I give warm thanks to Yishay Mansour, Dana Ron and Yoav Freund for their contributions to this work and for many interesting discussions. The code used in the exeriments was written by Andrew Ng. References [1] A. Barron. Universal aroximation bounds for suerositions of a sigmoidal function. IEEE Transactions on Information Theory, 19:93 944, 1991. [2] A. R. Barron and T. M. Cover. Minimum comlexity density estimation. IEEE Transactions on Information Theory, 37:134 154, 1991. [3] D. Haussler, M. Kearns, H.S. Seung, and N. Tishby. Rigourous learning curve bounds from statistical mechanics. In Proceedings of the Seventh Annual ACM Confernce on Comutational Learning Theory, ages 76 87, 1994. [4] M. Kearns, Y. Mansour, A. Ng, and D. Ron. An exerimental and theoretical comarison of model selection methods. In Proceedings of the Eigth Annual ACM Conference on Comutational Learning Theory, 1995. [5] J. Rissanen. Stochastic Comlexity in Statistical Inquiry, volume 15 of Series in Comuter Science. World Scientific, 1989. [6] H. S. Seung, H. Somolinsky, and N. Tishby. Statistical mechanics of learning from examles. Physical Review, A45:656 691, 1992. [7] M. Stone. Cross-validatory choice and assessment of statistical redictions. Journal of the Royal Statistical Society B, 36:111 147, 1974. [8] M. Stone. Asymtotics for and against cross-validation. Biometrika, 64(1):29 35, 1977. [9] V. N. Vanik. Estimation of Deendences Based on Emirical Data. Sringer-Verlag, New York, 1982. [1] V. N. Vanik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their robabilities. Theory of Probability and its Alications, 16(2):264 28, 1971.

cv error bound, intervals, s = 1, m = 1 cv error bound, intervals, d = s = 1 slice, m=1.5.5.4.4.3.3 y.2.2.1.1.2.4.6 trainfrac.8 1 2 15 1 d 5.2.4.6.8 1 trainfrac Figure 1: Plot of the bound/aroximation cv(m) g(d) + d=(1, )m+ log((1, )m)=(m) as a function of the training fraction 1, and d for m = 1, and g(d) for the intervals roblem with a target function of comlexity s = 1. Several interesting features are evident for these values, including the redicted choice of d = s = 1, and the curvature as a function of, indicating a unique otimal choice for bounded away from and 1. (Note that we have lotted the minimum of the bound and.5, since this generalization error can be trivially achieved by fliing a fair coin.) err vs cv ratio, 1 at 5/5, 3 noise, samle 5 Figure 2: Plot of the redicted generalization error of cross validation for the intervals model selection roblem, as a function of the fraction 1, of data used for training. (In the lot, the fraction of training data isontheleft( = 1) and 1 on the right ( = )). The fixed samle size m = 1; was used, and the 6 lots show the error redicted by the theory for target function comlexity values s = 1 (bottom lot), 5, 1, 25, 5, and 1 (to lot). The movement and relative sueriority of the otimal choice for identified by roerties (A), (B), (C) and (D) can be seen clearly. (We have again lotted the minimum of the the redicted error and.5.) err vs cv ratio, 1 at 5/5, 3 noise, samle 5.5.4.4.3.3.2.2.1.1 1 2 3 4 5 1 2 3 4 5 Figure 3: Exerimental lot of cross validation generalization error as a function of training set size (1, )m,fors = 1 and m = 5, with 3% label noise added. As in Figure 2, the x axis indicates the amount of data used for training. For this small value of the ratio of s to m, roerty (A) redicted by the theory is confirmed: there is a wide range of values yielding essentially the same generalization error. err vs cv ratio, 5 at 5/5, 3 noise, samle 5 Figure 4: Exerimental lot of cross validation generalization error as a function of training set size (1, )m as in Figure 3, but now for the target function comlexity s = 1. Proerty (B) redicted by the theory can be seen: comared to Figure 3, the relative sueriority of the otimal over other values is increasing, and in articular larger values of (such as :5) are now clearly subotimal choices. cv bound, (c/d) for c from 1. to 15., m=25.5.5.4.4.3.3 y.2.2.1.1 1 2 3 4 5.2.4.6.8 1 trainfrac Figure 5: Exerimental lot of cross validation generalization error as a function of training set size (1, )m,nowfors = 5. Figures 3, 4 and 5 together confirm the redicted roerty (C): as in the theoretical lot of Figure 2, desite the differing shaes of the three lots there is a fixed value (about = :1) yielding near-otimal erformance for the entire range of s. Figure 6: Plot of the redicted generalization error of cross validation for the case g(d)=(c=d), as a function of the fraction 1, of data used for training. The fixed samle size m = 25; was used, and the 6 lots show the error redicted by the theory for target function comlexity values c = 1 (bottom lot), 25, 5, 75, 1, and 15 (to lot). Proerties (A), (B), (C) and (D) are again evident. (We have again lotted the minimum of the the redicted error and.5.)