A variance decomposition and a Central Limit Theorem for empirical losses associated with resampling designs

Similar documents
Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

Linear First-Order Equations

DIFFERENTIAL GEOMETRY, LECTURE 15, JULY 10

Introduction to the Vlasov-Poisson system

The Exact Form and General Integrating Factors

Lower Bounds for the Smoothed Number of Pareto optimal Solutions

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION

FLUCTUATIONS IN THE NUMBER OF POINTS ON SMOOTH PLANE CURVES OVER FINITE FIELDS. 1. Introduction

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs

The derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x)

Logarithmic spurious regressions

ALGEBRAIC AND ANALYTIC PROPERTIES OF ARITHMETIC FUNCTIONS

Least-Squares Regression on Sparse Spaces

u!i = a T u = 0. Then S satisfies

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors

Convergence of Random Walks

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions

Math 1B, lecture 8: Integration by parts

Ramsey numbers of some bipartite graphs versus complete graphs

Conservation Laws. Chapter Conservation of Energy

Final Exam Study Guide and Practice Problems Solutions

A Course in Machine Learning

Schrödinger s equation.

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

Witt#5: Around the integrality criterion 9.93 [version 1.1 (21 April 2013), not completed, not proofread]

Topic 7: Convergence of Random Variables

Permanent vs. Determinant

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences.

NOTES ON EULER-BOOLE SUMMATION (1) f (l 1) (n) f (l 1) (m) + ( 1)k 1 k! B k (y) f (k) (y) dy,

θ x = f ( x,t) could be written as

Jointly continuous distributions and the multivariate Normal

Calculus of Variations

6 General properties of an autonomous system of two first order ODE

Proof of SPNs as Mixture of Trees

A note on asymptotic formulae for one-dimensional network flow problems Carlos F. Daganzo and Karen R. Smilowitz

Lecture 2 Lagrangian formulation of classical mechanics Mechanics

Tractability results for weighted Banach spaces of smooth functions

On combinatorial approaches to compressed sensing

PDE Notes, Lecture #11

Generalized Tractability for Multivariate Problems

Sturm-Liouville Theory

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k

Separation of Variables

d dx But have you ever seen a derivation of these results? We ll prove the first result below. cos h 1

Agmon Kolmogorov Inequalities on l 2 (Z d )

arxiv:hep-th/ v1 3 Feb 1993

Self-normalized Martingale Tail Inequality

Tutorial on Maximum Likelyhood Estimation: Parametric Density Estimation

Parameter estimation: A new approach to weighting a priori information

On the enumeration of partitions with summands in arithmetic progression

Math 342 Partial Differential Equations «Viktor Grigoryan

CHAPTER 1 : DIFFERENTIABLE MANIFOLDS. 1.1 The definition of a differentiable manifold

Lecture XII. where Φ is called the potential function. Let us introduce spherical coordinates defined through the relations

Some Examples. Uniform motion. Poisson processes on the real line

A Weak First Digit Law for a Class of Sequences

The Subtree Size Profile of Plane-oriented Recursive Trees

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

Euler equations for multiple integrals

Implicit Differentiation

3.7 Implicit Differentiation -- A Brief Introduction -- Student Notes

Integration Review. May 11, 2013

Characterizing Real-Valued Multivariate Complex Polynomials and Their Symmetric Tensor Representations

Lower bounds on Locality Sensitive Hashing

Approximate Constraint Satisfaction Requires Large LP Relaxations

Perfect Matchings in Õ(n1.5 ) Time in Regular Bipartite Graphs

The Principle of Least Action

TOEPLITZ AND POSITIVE SEMIDEFINITE COMPLETION PROBLEM FOR CYCLE GRAPH

Lagrangian and Hamiltonian Mechanics

1. Aufgabenblatt zur Vorlesung Probability Theory

REAL ANALYSIS I HOMEWORK 5

Mathematical Review Problems

1 dx. where is a large constant, i.e., 1, (7.6) and Px is of the order of unity. Indeed, if px is given by (7.5), the inequality (7.

Chapter 6: Energy-Momentum Tensors

SYMMETRIC KRONECKER PRODUCTS AND SEMICLASSICAL WAVE PACKETS

7.1 Support Vector Machine

Introduction to Markov Processes

Table of Common Derivatives By David Abraham

The chromatic number of graph powers

05 The Continuum Limit and the Wave Equation

2Algebraic ONLINE PAGE PROOFS. foundations

'HVLJQ &RQVLGHUDWLRQ LQ 0DWHULDO 6HOHFWLRQ 'HVLJQ 6HQVLWLYLW\,1752'8&7,21

Similar Operators and a Functional Calculus for the First-Order Linear Differential Operator

Linear and quadratic approximation

Discrete Mathematics

A Modification of the Jarque-Bera Test. for Normality

Review of Differentiation and Integration for Ordinary Differential Equations

under the null hypothesis, the sign test (with continuity correction) rejects H 0 when α n + n 2 2.

Calculus of Variations

Well-posedness of hyperbolic Initial Boundary Value Problems

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

3 The variational formulation of elliptic PDEs

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments

Step 1. Analytic Properties of the Riemann zeta function [2 lectures]

SYNCHRONOUS SEQUENTIAL CIRCUITS

How to Minimize Maximum Regret in Repeated Decision-Making

Function Spaces. 1 Hilbert Spaces

Influence of weight initialization on multilayer perceptron performance

Robustness and Perturbations of Minimal Bases

Differentiability, Computing Derivatives, Trig Review. Goals:

Transcription:

Mathias Fuchs, Norbert Krautenbacher A variance ecomposition an a Central Limit Theorem for empirical losses associate with resampling esigns Technical Report Number 173, 2014 Department of Statistics University of Munich http://www.stat.uni-muenchen.e

A VARIANCE DECOMPOSITION AND A CENTRAL LIMIT THEOREM FOR EMPIRICAL LOSSES ASSOCIATED WITH RESAMPLING DESIGNS MATHIAS FUCHS NORBERT KRAUTENBACHER ABSTRACT. The mean preiction error of a classification or regression proceure can be estimate using resampling esigns such as the cross-valiation esign. We ecompose the variance of such an estimator associate with an arbitrary resampling proceure into a small linear combination of covariances between elementary estimators, each of which is a regular parameter as escribe in the theory of U-statistics. The enumerative combinatorics of the occurrence frequencies of these covariances govern the linear combination s coefficients an, therefore, the variance s large scale behavior. We stuy the variance of incomplete U-statistics associate with kernels which are partly but not entirely symmetric. This leas to asymptotic statements for the preiction error s estimator, uner general non-empirical conitions on the resampling esign. In particular, we show that the resampling base estimator of the average preiction error is asymptotically normally istribute uner a general an easily verifiable conition. Likewise, we give a sufficient criterion for consistency. We thus evelop a new approach to unerstaning small-variance esigns as they have recently appeare in the literature. We exhibit the U-statistics which estimate these variances. We present a case from linear regression where the covariances between the elementary estimators can be compute analytically. We illustrate our theory by computing estimators of the stuie quantities in an artificial ata example. 1. INTRODUCTION This paper is concerne with the variance of resampling esigns in machine learning an statistics. A resampling esign a collection of splits of the ata into learning an test sets yiels an estimator of the expectation of a loss function of a moel fitting proceure; see Section 2 for etails of the set-up. An example of a resampling esign is the leave-p-out estimator of the average preiction error; a recent preprint [5] exploits the fact that this estimator is a U-statistic to erive its properties. Consequently, it is asymptotically normally istribute uner a very weak conition, namely that of existing an non-vanishing asymptotic variance. In this work, we generalize from the leave-p-out estimator to general resampling esigns such as cross-valiation. We set up a very general conition on the resampling esign that leas to consistency an a narrower one that leas to asymptotic normality. In a similar framework, consistency of cross-valiation was shown in Yang [13]; the principal ifference between our work an Yang s is that in Theorems 4 an 5 we treat the case of a fixe learning set size; hence, we o not subsume cross-valiation in these theorems. A resampling esign is an non-empirical atum in the sense that it can an shoul be 2010 Mathematics Subject Classification. 62G05, 62G09, 62G10, 62G20. Key wors an phrases. U-statistic, cross-valiation, limit theorem, esign, moel selection. supporte by the German Science Founation (DFG-Einzelförerung BO3139/2-2 to Anne-Laure Boulesteix). 1

2 MATHIAS FUCHS NORBERT KRAUTENBACHER specifie by the experimenter before seeing the ata. Moreover, a esign is, by nature, algebraic: its efinition involves no probability or analysis. On the other han, a Central Limit Theorem is a probabilistic-analytical statement an thus of quite a ifferent nature. Suppose we are given a sequence of resampling esigns in the sense that for every sufficiently large sample size n a collection of learning sets is specifie. Then, the Central Limit Theorem may either hol or not. By specifying a sufficient criterion in this work, we pose the question of whether there is any necessary criterion on the esign. This question seems to be very challenging. Likewise, it seems to be ifficult to etermine sufficient or necessary conitions on the resampling esign for the Strong Law of Large Numbers, the Berry-Esseen theorem, an the Law of the Iterate Logarithm to be vali. Therefore, it seems that the present work raises certain interesting an challenging problems for further research, at the bounary between the combinatorial worl of esigns an the probabilisticanalytical worl of limit theorems. Partial answers are given by the theory of incomplete U-statistics; however, the theory of incomplete U-statistics has only been evelope thoroughly in the case of symmetric kernels. Here, in contrast, a resampling esign is an incomplete U-statistic that is naturally associate with a non-symmetric kernel but usually not a symmetric one (note that only complete U-statistics are always associate with symmetric kernels). Usually, in statistical esign theory, the experimental units are allocate to blocks, an each experimental unit leas to a response [2, Section 3.1, Equation (2)]. Here, the picture is quite ifferent: our inepenent observations correspon to what are calle the treatments; thus, a response is measure for each treatment. It seems that this viewpoint appears less frequently in statistics. Here, we point out the usefulness of statistical esign theory to resampling. Although esign theory has been use in resampling an variance estimation theory, previous papers seem to have focuse on giving surveys [11], whereas we examine moel fitting algorithms in general. Design theory an U-statistics also seem to have been examine in the case where the blocks are the evaluation inices of symmetric kernels. Here, we look at a very ifferent scenario: The blocks are the inices of the learning sets, an the kernel is non-symmetric since it involves a learning set together with a testing observation. Likewise, the literature escribing resampling proceures for moel fitting in the language of U-statistics seems to be surprisingly sparse. Let us now outline the main results in more etail. The problem that cross-valiation suffers high variance is well stuie; further, approaches aiming at alleviating this are classical an treate in vast amounts of literature. Recently, in Zhang an Qian [14], cross-valiation esigns akin to Latin hyper-cube esigns in experimental esign theory were propose, an it was shown that such esigns, although of a computational cost similar to that of cross-valiation, have clearly smaller variance an are therefore generally preferable. Zhang an Qian [14, after Formula 12] gives a variance ecomposition of the average preiction error estimator associate with several particular esigns; we will give the corresponing formula for any esign. The same reference also contains an extensive overview of recent literature. Moreover, Fuchs et al. [5] outline that the leave-p-out preiction error estimator can be seen as a U-statistic an exploite this fact to euce the existence of an approximately exact hypothesis test of the equality of the two preiction errors. Since Fuchs et al. [5] is a preprint, we give a synopsis of that paper in Section 2.4. Thus, we aim to exploit the fact that any resampling proceure is an incomplete U-statistic an to view the results of Zhang

VARIANCE AND CENTRAL LIMIT THEOREM FOR EMPIRICAL LOSSES 3 an Qian [14] in the light of the variance calculation framework of U-statistics. There is a general theory of incomplete U-statistics esigns such that the variance of such a incomplete U-statistics is as small as possible an, therefore, as close as possible to that of the leave-p-out classifier [9, Chapter 4]; let us recall the fact that any complete U-statistic associate with a possibly non-symmetric kernel is simultaneously a U-statistic associate with a symmetric kernel, namely the striation of the original kernel. Thus the theory of complete U-statistics is entirely covere by that for symmetric kernels. However, the picture is completely ifferent for incomplete U-statistics. The reason is that if one efines an incomplete U-statistic just as an average taken over symmetric kernels of a collection of subsets, then one misses a goo eal of interesting statistics. Here, we will investigate a more general efinition that calls any average of non-symmetric kernels an incomplete U-statistic. In contrast to a efinition just containing symmetric kernels, we will have to perform optimization for non-symmetric kernels. Then, the kernel efining the U-statistic which is the leave-p-out error estimator, is genuinely non-symmetric. The associate symmetrization is the leave-one-out error estimator on a sample whose size is just one plus the original learning sample size. We are now face with the ifficulty that this kernel is computationally very unfortunate. Therefore, we set out to generalize the theory of incomplete U-statistics to that of non-symmetric kernels. However, we will o so just for the case of a milly non-symmetric kernel such as ours in fact, only a few summans are necessary in orer to obtain a symmetric one. Subsuming this point, it seems that the existing theories are restricte to the case of symmetric kernels. In contrast, a proper resampling proceure woul not rely on a symmetric kernel, because there is no reason why small-variance proceures coul be achieve with a symmetric kernel. Moreover, it seems very intuitive that the symmetrize formula of the kernel leas to a very high ratio of variance to computational cost. In generalizing the theory of incomplete U-statistics to that of non-symmetric kernels, we give a conceptual approach to fining esigns similar to the a-hoc esigns of Zhang an Qian [14] which were efine without any mention of U-statistics. The main results of our paper concern a ecomposition of the variance of any crossvaliation-like proceure into a linear combination of four series of core covariances, generalizing the covariances appearing in Bengio an Granvalet [3, Corollary 2]. Each of these, enote by τ (i) for i = 1,...,4, is a regular parameter an can therefore be estimate optimally by another U-statistic. The variance estimation of U-statistics has alreay been consiere in the literature [10, 12]. The coefficients of the linear combination are polynomials of egree at most two that only epen on the sample size an the learning set size. Thus, they are known in avance of seeing the ata an easily calculable. The ecomposition is a significant generalization of the classical ecomposition of Hoeffing [7, Formula 5.18] for the variance of a U-statistic to the case of an incomplete U-statistic associate with a symmetric kernel. The ifference of our variance formula to Lee s is that ours extens over four series of covariances instea of just one. It turns out that the variance expression thus attaine is extremely ifficult (or perhaps impossible) to minimize over all esigns of a given size uniformly over all unerlying probability istributions P. Therefore, we will approximate an asymptotic case of large sample size. Our main results are: proving the existence of unbiase variance estimators for the core

4 MATHIAS FUCHS NORBERT KRAUTENBACHER covariances (Corollary 1), the variance structure of cross-valiation (Theorem 16) an its estimation (Theorem 3), the analytical computation of the core covariances in a toy regression moel in Section 4, the Central Limit Theorem 5 an the associate asymptotic test in (29) an the numerical computation of the estimators in a relate regression moel in Section 6. The paper is structure as follows. In Section 2, we specify the set-up, Section 3 explains the variance ecompositions, Section 4 presents an analytical computation of the core covariances, in Section 5, we efine the variance estimators an show the Central Limit Theorem, an Section 6 illustrates our theory by means of a ata example in which we compute the estimators numerically. 2. THE SET-UP 2.1. The loss estimator. The general framework of the loss estimator is slightly more general than that unerlying the largest part of statistical literature. In the general framework, there is a univariate response variable Y ranging over a set Y, an a multivariate preictor variable X ranging over X (both X an Y are assume to be equippe with fixe σ-algebras). The joint istribution of (X,Y ) is escribe by a probability measure P on X Y equippe with the prouct σ-algebra. The quality of the preiction of Y is measure by a loss function (y,y ) l(y,y ). Typically, binary classification uses the misclassification loss 1 y y, but we can also use any other measurable loss. Other loss functions inclue, for instance, the usual regression mean-square loss (y y ) 2 or a survival analysis loss after extening the loss function s omain of efinition to censore observations. We fix a learning sample size g an then consier a statistical moel fitting proceure in the form of a function (1) s : (X Y ) g X Y (x 1,y 1,...,x g,y g,x g+1 ) s(x 1,y 1,...,x g,y g ;x g+1 ) which maps the learning sample (x 1,y 1,...,x g,y g ) to the preiction rule applie to the test observation x g+1. Equivalently, s can be seen as mapping the learning sample to a classification rule which is a map from preictors X to responses in Y. (Sometimes, s(x 1,y 1,...,x g,y g ;x g+1 ) is enote by f (x g+1 x 1,y 1,...,x g,y g ) to escribe a learne estimator f for a true moel f : X Y.) Throughout the paper, we will assume that s treats all learning arguments equally, so that it is invariant uner permutation of the first g arguments, an we assume that s is measurable with respect to the prouct σ-algebra on (X Y ) g X. The joint expectation of the loss function with respect to the g + 1-fol prouct measure is (2) E(l(s)) = l(s(x 1,y 1,...,x g,y g ;x g+1 ),y g+1 )P(x 1,y 1 ),...,P(x g+1,y g+1 ) an is calle the unconitional loss of the moel fitting proceure, where the left-han sie uses a slightly sloppy but unambiguous notation. It is of practical interest to estimate it, together with the ifference E(l 1 (s 1 )) E(l 2 (s 2 )) = E(l 1 (s 1 ) l 2 (s 2 )), for two moel fitting proceures s 1 an s 2 an two loss functions l 1 an l 2. Remark 1. E(l(s)) generalizes the usual mean square error in sense that the loss function is arbitrary instea of being the quaratic loss, the true moel is arbitrary instea of being in the particular form Y = f (X) + ε, the preictors X are ranom, an the expectation is taken with respect to the learning ata as well.

VARIANCE AND CENTRAL LIMIT THEOREM FOR EMPIRICAL LOSSES 5 Even if the true moel is of the form Y = f (X) + ε an the loss is quaratic, one cannot immeiately obtain a bias-variance ecomposition as in Hastie et al. [6, Formula 2.47] because the joint testing an learning expectation instea of just the testing expectation leas to covariance between f (X g+1 ) an f (X g+1 ). The erivation of the bias-variance ecomposition usually relies on ignoring this covariance by viewing the X i as non-ranom. 2.2. Estimators for the loss. Let us efine Γ(i 1,...,i g ;i g+1 ) := l 1 (s 1 (x i1,y i1,...,x ig,y ig ;x ig+1 ),y ig+1 ) (3) l 2 (s 2 (x i1,y i1,...,x ig,y ig ;x ig+1 ),y ig+1 ), a function on a set of g + 1 ifferent inices i k 1,...,n, for two moel fitting proceures s 1,s 2 an two appropriate loss functions l 1,l 2. We allow for each moel fitting proceure to have its own loss function because then the case l 2 := 0 yiels the loss of a single proceure, which is of obvious practical interest. We have: E(Γ) = E(l 1 (s 1 ) l 2 (s 2 )) an we efine Θ = EΓ as a slight generalization of (2). The expectations are taken with respect to the (g + 1)-fol prouct space of X Y an are assume to exist. A resampling proceure is a collection of isjoint learning an test sets. For every pair of learning set an test observation one obtains an elementary estimator of the error rate. Averaging these across all learning an test sets of the resampling proceure efines an unbiase estimator for Θ. Quite often, another convention is use where such an estimator is seen as an approximation for the preiction error on another learning set size such as the total sample size; then, unbiaseness is of course lost. It is now of interest to gain insight into the variance of such an estimator. All expectations an variances are taken with respect to the g + 1-fol prouct measure of P. The efinition of Γ was such that the number g + 1 of arguments is minimal uner the restriction that Θ = E Γ for all unerlying probability istributions such that this expectation exists. This minimality woul be lost if the efinition of Γ involve a larger test set size than one. Let T be a collection of pairs (S,a) where S {1,...,n} is an (unorere) set of isjoint learning inices, an a {1,...,n}\S is a test inex. Then, each Γ(S;a) is an elementary estimator of Θ, an we efine Θ(T ) := 1 T Γ(S;a). (S,a) T In simple cases, it is possible to compute Θ analytically. For instance, we will o so in Section 4. 2.3. Complete an incomplete U-statistics. This section summarizes some efinitions an ieas from [7]. Let n enote the sample size. A U-statistic is a statistic of the form U = ( n k) 1 h(zi1,...,z ik ) for a symmetric function h of k vector arguments, where the summation extens over all possible subsets (i 1,...,i k ). Since the number of such subsets is ( n k), the expectation of U is equal to that of h with respect to the k-fol prouct measure of P, so U is an unbiase estimator of E(h). A regular parameter is a functional of the form P h k (P). The minimal k such that there exists a symmetric function h such that E(h) = Θ hols for all probability istributions P is calle the egree of the U-statistic. Any such minimal function is calle a kernel of U. If a non-symmetric function with that property exists, then, by symmetrization, a symmetric function exists. An important property of U-statistics is that they are the unique minimum variance estimator of the expecte value Θ. Furthermore, the convergence of U towars Θ is controlle

6 MATHIAS FUCHS NORBERT KRAUTENBACHER by precise theorems: the Laws of Large Numbers, the Law of the Iterate Logarithm, the Law of Berry-Esseen, an the Central Limit Theorem. An incomplete U-statistic is often efine in the literature as one associate with a symmetric kernel, namely as a sum of the form K 1 S S h(z S1,...,z Sk ), where h is a symmetric function an S is a collection of k-subsets S. We write S =: K because it generalizes the corresponing nomenclature in K-fol cross-valiation. Since h is symmetric, it suffices to exten the summation over collections of increasing subsets, an an evaluation of h is alreay etermine by its evaluation on increasing inices: each subset S can be written as S = (S i ) such that 1 S 1 < < S k n. Here, we will consier statistics of the more general form R 1 S R h(z R1,...,z Rk ) where h is not necessarily symmetric, an therefore R is a collection of arbitrary orere, but not necessarily increasing, subsets R = {R 1,...,R k }. Variance-minimizing esigns have been set up for incomplete U-statistics with symmetric kernels but not yet for those with not necessarily symmetric kernels. We will o so in the special case of h = Γ. One coul consier variance minimizing esigns associate with the symmetrization Γ 0 (as efine below) but the variance can be reuce further in the general case. 2.4. A test for the comparison of two average preiction errors. Here, we give a short, self-containe overview of the results of Fuchs et al. [5]. One efines Γ 0 (1,...,g + 1) := (g + 1) 1 Γ(π(1),...,π(g);π(g + 1)) π where the sum is taken over all g + 1 cyclic permutations π of 1,...,g + 1, namely all permutations of the form (1,...,g+1) (q,...,g+1,1,...,q 1), where q {1,...,g+ 1}. Then Γ 0 is the leave-one-out version of Γ, an Γ 0 is a symmetric function of g + 1 vector arguments. Therefore, Γ 0 efines a U-statistic, an sorting out the terms shows that this U-statistic is the leave-p-out estimator of the error [1] where p := n g (this efinition hols for the rest of the paper). Likewise, Γ 0 is obtaine from Γ by symmetrizing over all (g + 1)! permutations; the sum then simplifies to the cyclic permutations because all learning observations are treate equally. Let T or, when the sample size is neee, T,n enote the maximal esign, consisting of all ( n g) (n g) possible pairs (S;a). Then, the U-statistic associate with the symmetric kernel Γ 0 is Θ(T ), the leave-p-out estimator. An important consequence of ientifying the leave-p-out estimator as a U-statistic is that it has minimal variance among all estimators of the error rate. Also, all of the many properties of U-statistics, such as asymptotic normality an so on, automatically apply to the leavep-out estimator Θ(T ). We implicitly assume Assumption 1. The egree of Θ is exactly g + 1. Similarly, the egree of Θ 2 is 2g + 2. Remark 2. It seems to be very har to prove analytically the first part of the assumption, or to give numerical evience. However, it seems to be very intuitive to assume that the true error can not be achieve by a smaller learning set size than g, across all istributions P. The secon part of the assumption is violate, for instance, if σ1 2 = 0 (efine in Definition 4), which correspons to the case that the U-statistic is egenerate. It is unclear whether the secon part of the assumption can be violate if the U-statistic is nonegenerate. Furthermore, it turns out that the variance of a U-statistic, trivially given by U 2 Θ 2, is another regular parameter an can therefore be estimate by a U-statistic. However, uner

VARIANCE AND CENTRAL LIMIT THEOREM FOR EMPIRICAL LOSSES 7 Assumption 1, the variance is a U-statistic of twice the egree of that of the unerlying U-statistic, an therefore, there is no unbiase estimator of the variance of the leave-p-out error estimator unless n 2(g + 1). Therefore, the learning set size must be less than half the total sample size. However, uner this constraint, stuentization is possible because of the consistence of the variance estimator, the Laws of Large Numbers, an Slutsky s theorem. This leas to the fact that the stanarize statistic (U 2 Θ 2 ) 1/2 U is approximately normal, implying that there is an approximately exact test for the comparison of the losses of two statistical proceures [5]. 3. THE CORE COVARIANCES AND THEIR THEORETICAL PROPERTIES In the following, we will generalize the variance ecomposition of Bengio an Granvalet [3, Formula (7)] to arbitrary esigns. Thus, we will erive the general formula for the variance of a resampling proceure. In particular, we will take avantage of the fact that the large number of covariance terms occurring in the variance of a resampling proceure reuces to a few core covariance terms which we will call τ (i). Our goal is the variance ecompositions in formulas (14) an (19). These are variance ecompositions of incomplete U-statistics associate with only partially symmetric kernels. In the particular case where the kernel is symmetric (which oes not happen for kernels of the form (3)), we recover part of the variance ecomposition of incomplete U- statistics as in Lee [9, Chapter 4]. However, it is quite important to note that our variance ecomposition (19) is somewhat analogous to, but oes not reuce to, the variance ecomposition of Lee [9, Chapter 4, Formula (2)]. In fact, our quantities B γ only refer to the learning sets an are therefore ifferent from Lee s B γ s. Definition 1. Let S = {1,...,g}, a = g + 1, S = {g + 1,...,2g + 1}, a = 2g + 2. Then, the functional Θ 2 is efine by Θ 2 (P) = Γ(S;a)Γ(S ;a ) 2g+2 P(Z 1,...,Z 2g+2 ). This is a regular parameter of egree at most 2g + 2. In the case that Θ is egenerate (meaning that σ 1 = 0 for all P where σ 1 is efine in (4)), Θ 2 = E(Γ 0 (1,...,g + 1)Γ 0 (g + 1,...,2g + 1)) an therefore it is of smaller egree, it seems reasonable to assume that this is the only way Θ 2 can have smaller egree. 3.1. The four series - efinition. Let us now consier proucts of two evaluations of Γ where the inex sets overlap in inices, but there is either no overlap in the test inices, or one test observation occurs in the learning observation of the other, or both test observations occur in the other s learning set, respectively, or both test observations coincie. These four cases are illustrate by Figure 1 an escribe all possible configurations. Definition 2. E(Γ(1,...,g;g + 1)Γ(1,...,,g + 2,...,2g + 1 ;2g + 2 )), i = 1 τ (i) := Θ 2 E(Γ(1,...,g;g + 1)Γ(1,..., 1,g + 1,...,2g + 1 ;2g + 2 )), i = 2 + E(Γ(1,...,g;g + 1)Γ(1,..., 2,g + 1,...,2g + 2 ; 1)), i = 3 E(Γ(1,...,g;g + 1)Γ(1,..., 1,g + 2,...,2g + 2 ;g + 1)), i = 4 for = 1,...,g+1, an the exceptional cases τ (i) 0 = 0 for all i, an τ(3) 1 = τ (1) g+1 = τ(2) g+1 = 0.

8 MATHIAS FUCHS NORBERT KRAUTENBACHER S a S a S' a' S' a' (a) Case 1 (b) Case 2 S a S a S' a' S' a' (c) Case 3 () Case 4 Figure 1. Let S,a an S,a be any pair of g-subsets S an S, an a / S, a / S. Then Cov(Γ(S;a),Γ(S ;a )) only epens on which of the four cases escribes the overlap pattern. Here: example for = 5 Remark 3. Therefore, the quantity σ 2 from Bengio an Granvalet [3] appears in this classification as τ (4) n n/k+1 where n is the total sample size an K is the number of blocks of cross-valiation, their ω is our τ (1) n n/k an their γ is our τ(3) n 2n/K+1. The seemingly more complicate nomenclature, involving lower inices, allows for the treatment of any resampling proceure instea of only cross-valiation. Notational Convention 1. Throughout this work, we enote the total overlap size (S {a}) (S {a }) between two evaluation tuples by the letter, an the overlap between two learning sets S S by the letter c. The interest in these quantities is that any occurring covariance between evaluations of Γ is equal to one of them. Note that there is an astronomical number of possible pairs of evaluations of Γ, but there are only 4g + 1 quantities τ c (i) unequal to zero. Observation 1. Let S,a an S,a be any pair of g-subsets S an S, an a / S, a / S. Then Cov(Γ(S;a),Γ(S ;a )) = τ (i) (S {a}) (S {a for some i = 1,...,4 that escribes the overlap }) pattern. This is obvious from the fact that Γ is symmetric in the learning inices, an the prouct measure n P is permutation invariant. 3.2. σ 2 as a linear combination of the core covariances. Let us efine (4) σ 2 := E(Γ 0(1,...,g + 1)Γ 0 (g + 2,...,2g + 2 )) Θ 2

VARIANCE AND CENTRAL LIMIT THEOREM FOR EMPIRICAL LOSSES 9 for = 1,...,g + 1. (σ 2 is calle ζ in Hoeffing [7].) Thus, σ 2 measures the covariance between two symmetrize kernels whose overlap has size. By a short computation, these numbers can be seen to be conitional variances, hence they are non-negative an it is justifie to efine them as squares. By plugging in the efinition of Γ 0 an expaning the sum we arrive at the following expression in terms of the four series: σ 2 = 1 ( ) (5) (g + 1) 2 (g + 1 ) 2 τ (1) + 2(g + 1 ) τ (2) + ( 1) τ (3) + τ (4). In particular, we see that the right han sie must be non-negative. The asymptotic variance of the complete U-statistic, the leave-p-out estimator, is (g + 1) 2 σ1 2 /n [7, 5.23] (recall that p = n g). So, the limiting variance is (6) lim nv( Θ(T,n )) = g 2 τ (1) 1 + 2gτ (2) 1 + τ (4) 1, where the limit is taken for g fixe. 3.3. Possible values for τ (i). Furthermore, plugging (5) into the inequality (7) σ 2 / σ 2 / [7, 5.19] for puts constraints on the τ (i), on top of those from Bengio an Granvalet [3, Section 6]. Some of the τ (i) can be ientifie as variances. Let us efine the conitional expectations Γ (i), (z 1,...,z ) := { E(Γ(1,...,g;g + 1) Z 1 = z 1,...,Z = z ) if i = 1 E(Γ(1,...,g;g + 1) Z 1 = z 1,...,Z 1 = z 1,Z g+1 = z ) if i = 4 Uner favorable regularity conitions on f an P, for instance those analogous to those state in connection with Cramér [4, 21.1.5], these functions possess the more explicit form { Γ (i), (z G(z1,...,z 1,...,z ) :=,Z +1,...,Z g ;Z g+1 )P(Z +1 )...P(Z g )P(Z g+1 ), i = 1 G(z1,...,z 1,Z,...,Z g ;z )P(Z )...P(Z g ), i = 4. where we have to use the function G as a slightly ifferent notation for Γ, namely G(Z 1,...,Z g ;Z g+1 ) := Γ(1,...,g + 1) Let us enote the b b-matrix implementing the binomial transform by P or P(b), thus P i j = ( 1) i+ j( i j) for 1 j i b an Pi j = 0 for j > i. For any U-statistic U, let us enote the associate quantities σ an δ from [7, 5.27] by σ (U) an δ (U), respectively. We then have δ(u) = Pσ(U) as vectors with inex = 1,...,eg(U). The following lemma puts strong constraints on the possible values of τ (1) an τ (4) ; these constraints o not appear in the treatment by Bengio an Granvalet [3, Section 6]. Lemma 1. For any = 1,...,g, the quantity τ (1) is the quantity σ 2 (1) (U ) of the U-statistic U (1) associate with the kernel Γ (1),g of egree g. Consequently, (8) τ (1) = V(Γ (1), ) 0, an (9) τ (1) / τ(1) e /e

10 MATHIAS FUCHS NORBERT KRAUTENBACHER for any 1 < e g, an (10) Pτ (1) = Pσ(U (1) ) = δ(u (1) ) 0 for the binomial matrix P = P(g). For type (4), one has only (11) τ (4) = V(Γ (4), ) 0. Proof. Γ (1),g is a function of g arguments. Plugging in the ranom variables Z i, we arrive at a ranom variable Γ (1),g (Z 1,...,Z g ). Using the fact that Γ (1),g is symmetric, one obtains σ (U (1) ) = Cov(Γ (1),g (Z 1,...,Z g ),Γ (1),g (Z 1,...,Z,Z g+1,...,z 2g )). Writing this covariance as the ifference of the expectation of the prouct an the prouct of the expectations, the first term is seen to have overlap pattern of type (,(1)), an the secon is Θ 2, thus the first claim. The claim on type (4) ensues analogously. In contrast, no such assertion seems to hol for the series (2) an (3), an there seems to be no positivity statement. Also, there seems to be no reason why τ (4) coul be ientifie with the quantities σ of some U-statistic, nor why Pτ (4) shoul be non-negative. In contrast, Bengio an Granvalet [3] also give constraints that are not covere by Lemma 1. Lemma 2. (1) τ (4) τ (1) 1 for all. (2) τ (4) /2 τ(3) 2 τ (4) for all. (3) τ (i) τ g (4) for all an i. Proof. The first two statements follow from plugging in all possible values for n an K into Bengio an Granvalet [3, Lemma 8]. The thir statement is the Cauchy-Schwarz inequality. 3.4. Variance ecomposition of incomplete U-statistics. Let us turn our attention to the general incomplete U-statistic associate with a collection T of pairs (S, a) of a learning set S an a test observation a / S. We will briefly enote an overlap size an type of pattern by Ψ((S,a),(S,a )) = (,(i)) when (S {a}) (S {a }) = an the type is (i), an will then write τ(ψ((s,a),(s,a )) instea inicating the type of the overlap pattern with lower an upper inices. The variance of the cross-valiation-like proceure associate with the collection T is (12) V( Θ(T )) = T 2 i, j Cov(Γ(S i ;a i ),Γ(S j ;a j )) = T 2 τ(ψ((s,a),(s,a ))), i, j which is convenient because the last sum can be written as a linear combination with much fewer summans because many summans take the same value. 3.5. Variance ecomposition of test-complete esigns. Definition 3. (13) (1) Consier the following linear combination of the τ (i) : ξ c :=(n 2g + c)(n 2g + c 1) τ (1) c + 2(g c)(n 2g + c) τ (2) c+1 + + (g c) 2 τ (3) c+2 + (n 2g + c) τ(4) c+1 for all c = 0,...,g, where we efine τ (3) g+2 = 0.

VARIANCE AND CENTRAL LIMIT THEOREM FOR EMPIRICAL LOSSES 11 (2) Furthermore, let us call a esign T test-complete whenever the following hols: (S,a) T = (S,b) T for any b / S. In wors, a esign is test-complete whenever it contains, together with a learning set S, the combinations of S with all possible test observations. Note that a test-complete esign is uniquely specifie by the learning sets it contains. Whenever a test-complete esign T is specifie by the collection of learning sets it contains, we will write S for the collection of learning sets, where each learning set S is counte only once even if it occurs in several pairs (S,a). Thus, T = K(n g) (of course, we suppose S to contain each learning set only once). (3) Let T be a test-complete esign. For any c = 0,...,g, let fc l N 0 be the number of orere pairs of learning sets (S,S ), both occurring in T, such that S S = c. Pairs (S,S) with the same learning set occurring twice are also allowe (where l is a mere symbol instea of an inex). For instance, any cross-valiation esign is test-complete. The same hols for the complete esign efining the leave-p-out estimator. For any test-complete esign, the associate numbers f l c are easily computable. For instance, they are given by the number of entries equal to c in N T N, where N is the incience matrix of the learning sets occurring in the esign. Obviously, only test-complete esigns seem to be relevant in practice because of the low computational cost of evaluating the loss function for a given moel an given test observations. Theorem 1. Let T be a test-complete esign an let S be the associate collection of learning sets. Then, the variance of the error estimator satisfies (14) V( Θ(T )) = T 2 g fc l ξ c c=0 where ξ c was efine in (13). Proof. This follows from expaning the variance as in (12) into the form T 2 multiplie by the sum of all entries of the T T -covariance matrix between the non-rescale summans of Θ(T ) an counting the terms. Each entry of the covariance matrix is escribe by two pairs (S,a),(S,a ) an therefore efines a specific type (1),...,(4) of the overlap pattern between (S,a) an (S,a ), an a particular overlap size = (S {a}) (S {a }). Any two summans of the same type (i) an the same overlap size are equal, namely τ (i). Now, counting an summing up all such terms with learning overlap size c, one obtains ξ c. This implies the result. Minimization of the expression f l c ξ c seems to be very har in practice. However, we will outline below a few cases where this task is feasible. Example 1 (Variance of cross-valiation). Let us assume n is ivisible by K, an that therefore the learning sets have size g = n n/k. We then arrive at the following. For K-fol cross-valiation, K 2, we count 0, c / {n n/k,n 2n/K} (15) fc l = K, c = n n/k K 2 K, c = n 2n/K. The variance of cross-valiation is given by the formula (16) V( Θ(T )) = (K 1 n 1 )τ (1) n n/k + (1 K 1 )τ (3) n 2n/K+2 + n 1 τ (4) n n/k+1

12 MATHIAS FUCHS NORBERT KRAUTENBACHER In the case K = 2, we obtain the expression (17) V( Θ(T )) = 1 ( (1) (n/2 1)τ n n/2 + n/2 τ(3) 2 + τ (4) ) n/2+1 Since it is unclear whether an how fast the τ (i) converge to zero, one can not immeiately euce asymptotic statements from (17). 3.6. Non-asymptotic minimization of f l c ξ c. Definition 4. Let T be a test-complete esign. For γ = 1,...,g an a subset s {1,...,n} such that s = γ, let n(s) be the number of learning sets S in the esign (where each single learning set is counte only once) such that s S. Let B l γ := s n(s) 2, where the sum is taken over all ( n γ) subsets s. Analogously, let B0 := K 2 = T 2 (n g) 2 = g c=0 f l c. Lemma 3. The quantities f l c are uniquely etermine by the B l γ. In fact, f l c = g γ=c( 1) γ c( γ c) Bγ for all 0 c g. Proof. For 1 c g, the proof procees in complete analogy to the proof of Lee [9, Chapter 4, Equation (7)], even though our f l c an B γ are quite ifferent from Lee s f c an B γ. For c = 0, one has f l 0 = g c=0 f l c g c=1 f l c = B 0 g c=1 g γ=c B γ = = B 0 + g γ=1 ( 1)γ B γ = g γ=0 ( 1)γ B γ, using that γ c=1 ( 1)c( γ c) = 1 for all γ c. Let us write this result in the form f l = PB for the upper-triangular matrix P efine by P c,γ = ( 1) γ c( γ) c for all 0 c γ g an Pc,γ = 0 for γ < c, where ( γ 0) := 1 for all γ 0 (The map escribe by the matrix P is often calle the binomial transform.) Using (14), we can now write V( Θ(T )) = T 2 < f l,ξ >= T 2 < PB,ξ >= T 2 < B l,p T ξ >. For this reason, we consier the binomial transformation P T ξ of the vector ξ separately: Definition 5. γ (18) α γ := c=0( 1) γ c( ) γ ξ c for all 0 γ g c Thus, we have shown that (19) V( Θ(T )) = T 2 g B γ α γ, γ=0 an in orer to minimize this, we have to maximize those B γ for which α γ is negative, an minimize those for which it is positive. This stans in contrast to the classical case where all B γ have to be minimize. The usefulness of (19) lies in the fact that in the case that ξ c is a polynomial of small egree in c, all α γ vanish when γ is greater than the polynomial s egree because γ c=0 ( 1)c( γ c) c = 0 for any < γ. In Section 4, we will exhibit a case where the ξ c is a polynomial of egree two, an in Section 6 we will give numerical evience that the ξ can more often be approximate well by a quaratic polynomial. Precisely, if ξ is of egree one, we have ξ c = b + Ac an then α 0 = b,α 1 = A an α γ = 0 for γ 2. If ξ is of egree two, we have ξ c = b + Ac +Cc 2, an then it is easy to calculate that α 0 = b (20) α 1 = A +C α 2 = 2C α γ = 0 for all γ 3.

VARIANCE AND CENTRAL LIMIT THEOREM FOR EMPIRICAL LOSSES 13 4. ANALYTICAL COMPUTATION OF THE CORE COVARIANCES IN A TOY INTERCEPT ESTIMATION MODEL Let us consier the following simple example. The ranom variable X is univariate an istribute accoring to some unknown istribution P X, an the joint istribution of (Y,X) is given by the simple moel Y = β 0 +β 1 X +ε, where ε N (0,v) with β 0 an v unknown. This moel is a close relative of that use in univariate orinary regression where the slope coefficient β 1 is known an only the intercept is estimate. We show the following facts in the supplement 6.4: There is an explicit formula for the kernel Γ, the τ c (i) are quaratic polynomials in c which we write own, consequently, ξ c is a quaratic polynomial in c as well, an α γ is non-zero if an only if γ = 0,1,2. A first consequence is that by (19), two esigns have the same variance alreay as soon as they have the same B 0,B 1 an B 2. Example 2. The calculations of 6.4 can be use to compare the variance of cross-valiation with that of the leave-p-out estimator in close form expressions. The full U-statistic associate with the kernel Γ, i.e., the leave-p-out estimator on a sample of size n, is equal to ˆv(1 + g 1 ). Since ˆv v(n 1) 1 χn 1 2, the leave-p-out estimator is istribute as the v(1 + g 1 )(n 1) 1 -fol of a chi-square of n 1 egrees of freeom. Therefore, the variance of the leave-p-out estimator is V(v(1 + g 1 )(n 1) 1 χn 1 2 ) = v 2 (1 + g 1 ) 2 (n 1) 2 2(n 1) = 2v 2 (1 + g 1 ) 2 (n 1) 1. This is consistent with the fact that by (6) an (5), we have lim nv( Θ(T )) = 2(g 2 + 2g 1 + 1)v 2 which coul also be erive from the expression V( Θ(T )) = T 2 B γ α γ. So, we have erive the rescale limiting variance of the leave-p-out estimator in three ways. In contrast, for the esign T 2-CV escribing two-fol cross-valiation (g = n/2), we obtain by (17): (21) V( Θ(T CV )) = 2v 2[ n 1 + 14n 2]. Thus, the ratio V 2-CV /V( Θ(T,n )) is one for n = 2 an tens to one for n, an attains its maximum, 25/16, for n = 6. Note that here g an n both ten to infinity, in contrast to the rest of the paper: (22) lim nv( Θ(T )) = lim nv( Θ(T CV )) = 2v 2. Also, one can check that V( Θ(T )) < V( Θ(T CV ) for all n, as it shoul. Let us go back to the usual scenario of fixe g, an let n ten to infinity. Thus, in this example, the limiting variance 2v 2 agrees in all three cases: where g is fixe, the case of cross-valiation with g = n/2, an the leave-p-out case where g = n/2. Note also that minimizing B γ α γ involves only three non-zero summans whereas f l c ξ c involves g + 1 summans. Therefore, the minimization problem s imensionality is rastically reuce when passing from the ξ c to the α γ. Let us now show how to apply our calculations to the variance minimization problem. Let us say we are given fixe values for n,g an K. The problem is to fin a esign that minimizes the expression B 1 α 1 + B 2 α 2, because the pre-factor as well as the summan corresponing to c = 0 of (19) can be ignore because they are etermine by the pre-set quantity K. Let us assume that each observation occurs in the same number of learning sets. This is analogous to the usual restriction to equireplicate esigns as in Lee [9, Section 4.3.2]), an

14 MATHIAS FUCHS NORBERT KRAUTENBACHER we also call such esigns equireplicate, even though we are only referring to the learning sets. In such esigns, the conition that B 1 = K 2 g 2 /n is impose. Thus, only B 2 remains as a egree of freeom in the optimization, eliminating any trae-off between competing components. Since α 2 > 0, B 2 has to be minimize. Subsuming the results of this section, we have shown the following: Theorem 2. In the intercept estimation moel of this chapter, all equireplicate esigns for fixe n,g an K that have the same B 2 have the same variance. Any equireplicate esign with minimal B 2 among all equireplicate esigns achieves the minimal variance among all equireplicate esigns of the same n, g, an K. Assuming that the configuration of n,g, an K allows for the existence of a Balance Incomplete Block Design (see Definition 7), any Balance Incomplete Block Design of these n, g, an K is a esign with minimal variance among all equireplicate esigns of these n, g, an K. Proof. It only remains to show the last assertion. This is one in complete analogy to the proof of Lee [9, Chapter 4, Theorem 1]. For instance, for g = 2, B 2 is boun to be equal to K, an therefore all equireplicate esigns have the same variance. Another simple example is the leave-one-out case g = n 1,K = n. Then, the minimality of the esign s variance has been unveile to be the minimality of a symmetric Balance Incomplete Block Design s variance. Since the α γ, unlike those in the classical context, can happen to be negative, one might ask whether there exists a configuration (n, g, K) such that an equireplicate esign exists but a non-equireplicate esign has smaller variance than the best equireplicate one. Such a non-equireplicate esign woul then maximize B 1 instea of minimizing B 2. Thus, it woul be, in some sense, the opposite of an equireplicate esign. It seems that whenever ξ c is a polynomial in c of small egree, arguments similar to those in this chapter can be use to etermine equireplicate minimal-variance esigns in a non-empirical way. 5. A GENERAL CENTRAL LIMIT THEOREM AND A HYPOTHESIS TEST 5.1. The core covariance as regular parameters an their estimation. Let us recall that a linear combination of regular parameters is a regular parameter [7, mile of Page 295]. This allows us to split off the common regular parameter Θ 2 from an integral that is specific for the overlap pattern, in the following sense: each quantity τ (i) can be written as τ (i) (P) = Γ(S;a)Γ(S ;a ) 2g+2 P(Z 1,...,Z 2g+2 ) Θ 2 (P) where the overlap pattern between (S,a) an (S,a ) is of type (i) an (S {a}) (S {a }) =. Note that this implies that S {a} S {a } = 2g + 2, so the number of integrals in the first summan is correctly specifie. Therefore, τ (i) is a regular parameter of egree at most 2g + 2. Lemma 4. If Θ 2 has egree 2g + 2, τ (i) is a regular parameter of egree exactly 2g + 2. Proof. Assume, there was a function f of only 2g + 1 arguments such that τ (i) (P) = f 2g+2 1 P

VARIANCE AND CENTRAL LIMIT THEOREM FOR EMPIRICAL LOSSES 15 for all P. This covers the general case because if there were a function with even fewer arguments then there woul also be a function with 2g + 1 arguments, efine by ignoring the aitional ones. Then, by linearity, f Γ(S;a)Γ(S ;a ) woul be a kernel of egree 2g + 1 for Θ 2, in contraiction to Assumption 1. as the U-statistics associate with these regular pa- Definition 6. We efine statistics rameters. τ (i) As with U-statistics, they satisfy several optimality properties. τ (i) Corollary 1. The estimators are the minimal variance unbiase estimators for τ (i). They are consistent an satisfy the Weak an Strong Law of Large Numbers. Lemma 5. Since Θ 2,ξ c an α γ are regular parameters, there exist U-statistics Θ 2, ξ c, an α γ. Let us abbreviate the U-statistic for the regular parameter V( Θ(T )) as [V( Θ(T ))]. Then [V( Θ(T ))] = ( Θ(T )) 2 Θ 2 = T 2 g c=0 f l c ξ c = T 2 Likewise, the estimators ξ c satisfy the empirical analog to (13). This is in analogy to V( Θ(T )) = E[( Θ(T )) 2 ] Θ 2 = T 2 g c=0 f l c ξ c = T 2 g γ=0 g γ=0 B γ α γ. B γ α γ. Proof. This fact is not obvious but can be checke by straightforwar computation. 5.2. Variance estimation of cross-valiation. In the following, we are referring to a situation where n observations are use to carry out cross-valiation, an there exist n socalle extra observations such that the total number of observations satisfies n + n 2g + 2. Then, there exists a variance estimator for cross-valiation, which may be contraste with Bengio an Granvalet [3]. Precisely, plugging (15) into Theorem 14, we obtain: Theorem 3. The empirical counterpart of the right han-sie of (16) is a U-statistic of egree 2g + 2 uner Assumption 1. It efines the unique minimal variance unbiase estimator of the variance of cross-valiation, if there are enough extra observations. Otherwise, no unbiase estimator exists. This U-statistic [V( Θ(T ))] is ientical to the plug-in estimator (K 1 n 1 (1) ) τ n n/k + (1 K 1 ) τ (3) n 2n/K+1 + n 1 τ(4) n n/k+1 5.3. The Weak Law of Large Numbers. There is the following general criterion on the resampling esign for the Weak Law of Large Numbers to hol. Assume that for each sample size n a esign T n with learning sets S n is given. We will only consier the case where the learning set sizes g are the same across all n, so lim g/n = 0. Let us write K n := S n. Theorem 4. Assume that U (1) is non-egenerate in the sense that τ (1) 1 = σ1 2 (1) (U ) 0. Then, the following are equivalent:

16 MATHIAS FUCHS NORBERT KRAUTENBACHER (1) (23) lim f l g 0,n/( c=0 f l c,n) = 1. (2) Θ(T n ) is weakly consistent in the sense that Θ(T n ) converges in probability to Θ as n. Proof. (1) = (2): By (13), the quantity ξ c,n is O(n) for c = 0 because τ (1) 0 = 0 an is O(n 2 ) for c 1. Therefore, the summan T n 2 f l 0,nξ 0,n = (n g) 2 ( g c=0 f l c,n) 1 f l 0,nξ 0,n of (14) always vanishes, taking into account that g c=0 f l c,n = (n g) 2 T n 2 for a testcomplete esign. Similarly, the remaining summan of the ecomposition (14) is (n g) 2 ( g fc,n) l 1 g c=0 c=1 f l c,nξ c,n. Since (n g) 2 g c=1 ξ c,n is a boune sequence, it follows from the conition that lim V( Θ(T n )) = 0, an thus the assertion. (2) = (1): Convergence in probability implies convergence of the variance to zero. Since lim ξ 0/(2gnτ (2) 1 + nτ (4) 1 ) = 1 we have 1 lim n 2 Kn 2 ( f l 0,n(2gnτ (2) 1 + nτ (4) 1 ) + n2 c 1 lim ξ c/(n 2 τ c (1) ) = 1,c 1, fc,nτ l c (1) 1 ) = lim K n 2 fc,nτ l c (1) = 0. c 1 Since τc 1 τ (1) 1 0, this implies that limkn 2 fc,n l = 0 for all c 1. Hence, f0,n l lim Kn 2 = 1 lim f l c,n c 1 Kn 2 = 1 0 = 1. Example 3. The conition appearing in the first equivalence of Theorem 4 is satisfie, for instance, for the complete esign sequence. In contrast, it is violate for a esign sequence such that there is an observation that is containe in every learning set for every n. One can also construct esign sequences such that the limit (23) takes values between zero an one. 5.4. A Central Limit Theorem. The following theorem generalizes Lee [9, Theorem 1, Chapter 4.3.1] to U-statistics with a non-symmetric kernel an thus to CV-like proceures. Lemma 6. Let Θ(T ) be a CV-like proceure base on a fixe esign T T an Θ(T ) be the leave-p-out estimator. Then ) (24) V( Θ(T )) V( Θ(T )) = V( Θ(T ) Θ(T ) 0.

VARIANCE AND CENTRAL LIMIT THEOREM FOR EMPIRICAL LOSSES 17 Proof. Since V( Θ(T ) Θ(T )) = V( Θ(T )) 2Cov( Θ(T ), Θ(T )) + V( Θ(T )), (24) hols if an only if Cov( Θ(T ), Θ(T )) = V( Θ(T )). Since Cov(Γ(S;a), Θ(T )) is the same for every (S;a) T, we have V( Θ(T )) = Cov( Θ(T ), Θ(T )) = Cov( T 1 (S;a) T Γ(S;a), Θ(T )) = T 1 (S;a) T Cov(Γ(S;a), Θ(T )) = T 1 T Cov(Γ(S;a), Θ(T )) = T 1 T Cov(Γ(S;a), Θ(T )) = T 1 Cov(Γ(S;a), Θ(T )) (S;a) T = Cov( Θ(T ), Θ(T )). The proof iffers from that of [9] not only because we generalize his theorem, but also because there is a mistake in his proof. He assumes that the covariances of an incomplete U-statistic an a kernel are all equal, for each set of the esign. This property, however, is not vali in general. We are now going to investigate a situation where a esign is pre-specifie for each sample size, an will give a general sufficient criterion for a Central Limit Theorem. Let us abbreviate the complete, leave-p-out esign for n by T,n, an its learning sets by S,n such that K,n := S,n = ( n g). Suppose, again, that for each n 2g + 2, a test-complete esign T n for sample size n is given such that the trivial conition K n is satisfie. For each n 2g + 2, an any c = 0,...,g, let fn,c l N 0 be the number of orere pairs of learning sets (S,S ), both occurring in S n, such that S S = c. Recall that T n 2 = (n g) 2 g c=0 f c,n l = (n g) 2 Kn 2. Using f,c,n l = ( n g )( n g g)( c g c), we see that the numbers f,c,n l satisfy the following asymptotic properties: (25) lim n f,1,n l K,n 2 = g 2 for c = 1, an (26) lim n f,c,n l K,n 2 = 0 for all 2 c g. Lemma 7. Assume f l c,n satisfies equations (25), (26) with f l c,n in place of f l,c,n an K n in place of K,n. Then (27) lim n(v( Θ(T n ) V( Θ(T,n )) = 0.

18 MATHIAS FUCHS NORBERT KRAUTENBACHER Proof. Let us express the fact that ξ c epens on n by writing ξ c,n. One consiers the orer of magnitue of the ξ c,n as given in the proof of Theorem 4. One substitutes (14) for V( Θ(T n )) an for V( Θ(T,n )) an consiers each summan of the left-han sie of (27) separately (the finite sum over c can be pulle outsie the limit). For instance, the summan belonging to c = 0 is zero because, by assumption, an, similarly For c = 1, we have lim ((n f l 0,nξ 0,n )/ T n 2 ) = lim ((n 2 f l 0,n/ T n 2 ) ξ 0,n /n) = lim (n2 f0,n/ T l n 2 ) lim (ξ 0,n /n) = lim (ξ 0,n /n) lim ((n f l,0,nξ 0,n )/ T,n 2 ) = lim (ξ 0,n /n). lim ((n f l 1,nξ 1,n )/ T n 2 ) = lim ((n 3 f l 1,n/ T n 2 ) ξ 1,n /n 2 ) = lim (n3 f l 1,n/ T n 2 ) lim (ξ 1,n /n 2 ) = g 2 lim (ξ 1,n /n 2 ) an lim ((n f l,1,nξ 1,n )/ T,n 2 ) = g 2 lim (ξ 1,n /n 2 ). Analogously, one shows that every summan belonging to any other c vanishes. Theorem 5. The following are equivalent: (1) Equation (27) (2) ( Θ(T n ) Θ)(V( Θ(T,n ))) 1/2 N (0,1) in istribution as g remains fixe, n. The conition of Lemma 7 implies the conition (23) of Theorem 4; so we pass to a more specific case. Likewise, the situation of Theorem 4 is that a relaxation of (27) hols, namely that the left-han sie of (27) lacks the factor n. It is easy to construct examples of esign sequences T n such that (23) hols but (27) is violate. Proof. (1) = (2): We have (28) ( Θ(T ) Θ)(V( Θ(T,n ))) 1/2 = ( Θ(T ) Θ(T,n ))(V( Θ(T,n ))) 1/2 + ( Θ(T,n ) Θ)(V( Θ(T,n ))) 1/2. We will show that the first summan converges in probability to zero while the secon converges in istribution to N (0,1). The claim then follows from the stanar fact that istribution in convergence is invariant uner perturbation with a term that converges to zero in probability. Using Lemma 6, the variance of the first summan of the right-han sie of (28) is V [ ( Θ(T ) Θ(T,n ))(V( Θ(T,n ))) 1/2] = V( Θ(T n ))/V( Θ(T,n )) 1, which converges to zero as we have just shown. The secon summan of the right-han sie of (28) is the stanarize U-statistic which satisfies the Central Limit Theorem [7, Theorem 7.1], multiplie by the square root of the ratio of the variances which converges to one. This completes the proof. (2) = (1): Taking the variance of the left-han sie of (27), we see that