Uniform laws of large numbers 2

Size: px

Start display at page:

Download "Uniform laws of large numbers 2"

Kristin Hines
5 years ago
Views:

1 C H A P T E R 4 Uniform laws of large numbers The focus of this chapter is a class of results known as uniform laws of large numbers. 3 As suggested by their name, these results represent a strengthening of the usual law of 4 large numbers, which applies to a fixed sequence of random variables, to related laws 5 that hold uniformly over collections of random variables. On one hand, such uniform 6 laws are of theoretical interest in their own right, and represent an entrypoint to a rich 7 area of probability and statistics known as empirical process theory. On the other hand, 8 uniform laws also play a key role in more applied settings, including understanding the 9 behavior of different types of statistical estimators. The classical versions of uniform 0 laws are of an asymptotic nature, whereas more recent work in the area has emphasized non-asymptotic results. Consistent with the overall goals of this book, this chapter will follow the non-asymptotic route, presenting results that apply to all sample sizes. In 3 order to do so, we make use of the tail bounds and the notion of Rademacher complexity 4 previously introduced in Chapter Motivation 6 We begin with some statistical motivations for deriving laws of large numbers, first for 7 the case of cumulative distribution functions (CDFs) and then for more general function 8 classes Uniform convergence of CDFs 0 The law of any scalar random variable X can be fully specified by its cumulative distribution function, whose value at any point t R is given by F(t) := P[X t]. Now suppose that we are given a collection x n = {x,x,...,x n } of n i.i.d. samples, each drawn according to the law specified by F. A natural estimate of F is the empirical CDF given by F n (t) := n I (,t] [X i ], (4.) February 4, 05 0age

2 96 CHAPTER 4. UNIFORM LAWS OF LARGE NUMBERS where I (,t] [x] is a {0,}-valued indicator function for the event {x t}. Since the population CDF can be written as F(t) = E[I (,t] [X]], the empirical CDF is unbiased in a pointwise sense. Figure 4- provides some illustrations of empirical CDFs for the uniform distribution on the interval [0,] for different sample sizes. Note that F n is a random function, with the value F n (t) corresponding to the fraction of samples that lie in the interval (,t]. Consequently, for any fixed t R, the law of large numbers implies that F n (t) prob. F(t). Population and empirical CDFs Population and empirical CDFs CDF value F(t) CDF value F(t) Population CDF Empirical CDF t value (a) 0. Population CDF Empirical CDF t value (b) Figure 4-. Plots of population and empirical CDF functions for the uniform distribution on [0, ]. (a) Empirical CDF based on n = 0 samples. (b) Empirical CDF based on n = 00 samples In statistical settings, a typical use of the empirical CDF is to construct estimators of various quantities associated with the population CDF. Many such estimation problems can be formulated in a terms of functional γ that maps any CDF F to a real number γ(f) that is, F γ(f). Given a set of samples distributed according to F, the plugin principle suggests replacing the unknown F with the empirical CDF F n, thereby obtaining γ( F n ) as an estimate of γ(f). Let us illustrate this procedure via some examples. Example 4. (Expectation functionals). Given some integrable function g, we may define the expectation functional γ g via γ g (F) := g(x)df(x). (4.) For instance, for the function g(x) = x, the functional γ g maps F to E[X], where X is a random variable with CDF F. For any g, the plug-in estimate is given by γ g ( F n ) = n n g(x i), corresponding to the sample mean of g(x). In the special case ROUGH DRAFT: DO NOT DISTRIBUTE!! M. Wainwright February 4, 05

3 SECTION 4.. MOTIVATION 97 g(x) = x, we recover the usual sample mean n n X i as an estimate for the mean µ = E[X]. A similar interpretation applies to other choices of the underlying function g. 3 Example 4. (Quantile functionals). For any α [0,], the quantile functional Q α is given by Q α (F) := inf{t R F(t) α}. (4.3) The median corresponds to the special case α = 0.5. The plug-in estimate is given by Q α ( F n ) := inf { t R n I [t, ) [X i ] α }, (4.4) and corresponds to estimating the α th quantile of the distribution by the α th sample 4 quantile. In the special case α = 0.5, this estimate corresponds to the sample median. 5 Again, it is of interest to determine in what sense (if any) the random variable Q α ( F n ) 6 approaches Q α (F) as n becomes large. In this case, Q α ( F n ) is a fairly complicated, non- 7 linear function of all the variables, so that this convergence does not follow immediately 8 by a classical result such as the law of large numbers. 9 0 Example 4.3 (Goodness-of-fit functionals). It is frequently of interest to test the hy- pothesis of whether or not a given set of data has been drawn from a known dis- tribution F 0. For instance, we might be interested in assessing departures from uni- 3 formity, in which case F 0 would be a uniform distribution on some interval, or de- 4 partures from Gaussianity, in which F 0 would specify a Gaussian with a fixed mean 5 and variance. Such tests can be performed using functionals that measure the dis- 6 tance between F and the target CDF F 0, including the sup-norm distance F F 0, 7 or other distances such as the Cramér-von-Mises criterion based on the functional 8 γ(f) := [F(x) F 0(x)] df 0 (x). 9 For any plug-in estimator γ( F n ), an important question is to understand when it is consistent i.e., when does γ( F n ) converges to γ(f) in probability (or almost surely)? This question can be addressed in a unified manner for many functionals by defining a notion of continuity. Given a pair of CDFs F and G, let us measure the distance between them using the sup-norm G F := sup G(t) F(t). (4.5) t R We can then define the continuity of a functional γ with respect to this norm: more 0 precisely, we say that the functional γ is continuous at F in the sup-norm if for all ǫ > 0, there exists a δ > 0 such that G F δ implies that γ(g) γ(f) ǫ. ROUGH DRAFT: DO NOT DISTRIBUTE! M. Wainwright February 4, 05

4 98 CHAPTER 4. UNIFORM LAWS OF LARGE NUMBERS As we explore in Exercise 4., this notion is useful, because for any continuous functional, it reduces the consistency question for the plug-in estimator γ( F n ) to the issue of whether or not F n F converges to zero. A classical result, known as the Glivenko-Cantelli theorem, addresses the latter question: Theorem 4. (Glivenko-Cantelli). For any distribution, the empirical CDF F n is a strongly consistent estimator of the population CDF F in the uniform norm, meaning that F n F a.s. 0. (4.6) We provide a proof of this claim as a corollary of a more general result to follow (see Theorem 4.). For statistical applications, an important consequence of Theorem 4. is that the plug-in estimate γ( F n ) is almost surely consistent as an estimator of γ(f) for any functional γ that is continuous with respect to the sup-norm. See Exercise 4. for further exploration of these issues Uniform laws for more general function classes We now turn to more general consideration of uniform laws of large numbers. Let F be a class of integrable real-valued functions with domain X, and let X n = {X,...,X n } be a collection of i.i.d. samples from some distribution P over X. Consider the random variable P n P F := sup f(x i ) E[f], (4.7) which measures the absolute deviation between the sample average n n f(x i) and the population average E[f] = E[f(X)], uniformly over the class F. (See the bibliographic section for a discussion of possible measurability concerns with this random variable.) Definition 4.. We say that F is a Glivenko-Cantelli class for P if P n P F converges to zero in probability as n This notion can also be defined in a stronger sense, requiring almost sure convergence of P n P F, in which case we say that F satisfies a strong Glivenko-Cantelli law. The classical result on the empirical CDF (Theorem 4.) can be reformulated as a particular case of this notion: ROUGH DRAFT: DO NOT DISTRIBUTE!! M. Wainwright February 4, 05

5 SECTION 4.. MOTIVATION 99 Example 4.4 (Empirical CDFs and indicator functions). Consider the function class F = { I (,t] ( ) t R }, (4.8) where I (,t] is the {0,}-valued indicator function of the interval (,t]. For each fixed t R, we have E[I (,t] (X)] = P[X t] = F(t), so that the classical Glivenko- Cantelli theorem corresponds to a strong uniform law for the class (4.8). 3 Not all classes of functions are Glivenko-Cantelli, as illustrated by the following example. 4 Example 4.5 (Failure of uniform law). Let S be the class of all subsets S of [0, ] such that the subset S has a finite number of elements, and consider the function class F S = {I S ( ) S S} of ({0-}-valued) indicator functions of such sets. Suppose that samples X i are drawn from some distribution over [0,] that has no atoms (i.e., P({x}) = 0 for all x [0,]); this class includes any distribution that has a density with respect to Lebesgue measure. For any such distribution, we are guaranteed that P[S] = 0 for all S S. On the other hand, for any positive integer n N, the set X n = {X,...,X n } belongs to S, and moreover by definition of the empirical distribution, we have P n [X n ] =. Putting together the pieces, we conclude that sup P n [S] P[S] = 0 =, (4.9) S S so that the function class F S is not a Glivenko-Cantelli class for P. 5 We have seen that the classical Glivenko-Cantelli law which guarantees conver- 6 gence of a special case of the variable P n P F is of interest in analyzing estimators 7 based on plugging in the empirical CDF. It is natural to ask in what other statis- 8 tical contexts do these quantities arise? In fact, variables of the form P n P F are 9 ubiquitous throughout statistics in particular, they lie at the heart of methods based 0 on empirical risk minimization. In order to describe this notion more concretely, let us consider an indexed-family of probability distributions {P θ θ Ω}, and suppose that we are given n samples X n = {X,...,X n }, each sample lying in some space X. 3 Suppose that the samples are drawn i.i.d. according to a distribution P θ, for some fixed 4 but unknown θ Ω. Here the index θ could lie within a finite-dimensional space, 5 such as Ω = R d in a vector estimation problem, or could lie within some function class 6 Ω = G, in which case the problem is of the non-parametric variety. 7 In either case, a standard decision-theoretic approach to estimating θ is based on minimizing a loss function of the form L θ (x), which measures the fit between a parameter θ Ω and the sample X X. Given the collection of n samples X n, the ROUGH DRAFT: DO NOT DISTRIBUTE! M. Wainwright February 4, 05

6 00 CHAPTER 4. UNIFORM LAWS OF LARGE NUMBERS principle of empirical risk minimization is based on the objective function R n (θ,θ ) := n L θ (X i ). This quantity is known as the empirical risk, since it is defined by the samples X n, and our notation reflects the fact that these samples depend in turn on the unknown distribution P θ. This empirical risk should be contrasted with the population risk R(θ,θ ) := E θ [ Lθ (X) ], where the expectation E θ is taken over a sample X P θ. In practice, one minimizes the empirical risk over some subset Ω 0 of the full space Ω, thereby obtaining some estimate θ. The statistical question is how to bound the excess risk, measured in terms of the population quantities namely the difference δr( θ,θ ) := R( θ,θ ) inf θ Ω 0 R(θ,θ ). Let us consider some examples to illustrate. Example 4.6 (Maximum likelihood). Consider a family of distributions {P θ, θ Ω}, each with a strictly positive density p θ (defined with respect to a common underlying measure). Now suppose that we are given n i.i.d. samples from unknown distribution P θ, and we would like to estimate the unknown parameter θ. In order to do so, we consider the cost function L θ (x) := log p θ (x) p θ (x). (The term p θ (x), which we have included for later theoretical convenience, has no effect on the minimization over θ.) Indeed, the maximum likelihood estimate is obtained by minimizing the empirical risk θ arg min θ Ω 0 n log p θ (X i) p θ (X i ) } {{ } R n(θ,θ ) = arg min θ Ω 0 n log p θ (X i ) The population risk is given by R(θ,θ [ ) = E θ log p θ (X) ] 3 p θ (X), a quantity known as the 4 Kullback-Leibler divergence between p θ and p θ. In the special case that θ Ω 0, the 5 excess risk is simply the Kullback-Leibler divergence between the true density p θ and 6 the fitted model p θ. ROUGH DRAFT: DO NOT DISTRIBUTE!! M. Wainwright February 4, 05

7 SECTION 4.. MOTIVATION 0 Example 4.7 (Binary classification). Suppose that we observe n samples of the form (X i,y i ) {,+} R d, where the vector X i corresponds to a set of d predictors or features, and the binary variable Y i corresponds to a label. We can view such data as 3 being generated by some distribution P X over the features, and a conditional distribu- 4 tion P Y Z. Since Y takes binary values, the conditional distribution is fully specified by 5 the likelihood ratio ψ(x) = P[Y=+ X=x] P[Y= X=x]. 6 The goal of binary classification is to estimate a function f : R d {,+} that minimizes the probablity of mis-classification P[f(X) Y], for an independently drawn pair (X, Y). Note that this probability of error corresponds to the population risk for the cost function L f (X,Y) := { if f(x) Y 0 otherwise. (4.0) A function that minimizes this probability of error is known as a Bayes classifier f ; in the special case of equally probable classes (P[Y = +] = P[Y = ] = /), a Bayes classifier is given by f (x) = + if φ(x), and f (x) = otherwise. Since the likelihood ratio φ (and hence f ) is unknown, a natural approach to approximating the Bayes rule is based on choosing f to minimize the empirical risk R n (f,f ) := n I [ ] f(x i ) Y i, }{{} L f (X i,y i ) corresponding to the fraction of training samples that are mis-classified. Typically, 7 the minimization over f is restricted to some subset of all possible decision rules. See 8 Chapter 4 for some further discussion of how to analyze such methods for binary 9 classification. 0 Returning to the main thread, our goal is to develop methods for controlling the excess risk. For simplicity, let us assume that there exists some θ 0 Ω 0 such that R(θ 0,θ ) = inf θ Ω0 R(θ,θ ). With this notation, the excess risk can be decomposed as δr( θ,θ ) = { R( θ,θ ) R n ( θ,θ ) } }{{} T + { Rn ( θ,θ ) R n (θ 0,θ ) } } {{ } T 0 + { Rn (θ 0,θ ) R(θ 0,θ ) }. }{{} T 3 Note that the middle term is non-positive, since θ minimizes the empirical risk over Ω 0. The third term can be dealt with in a relatively straightforward manner, because θ 0 is an unknown but non-random quantity. Indeed, recalling the definition of the If the infimum is not achieved, then we choose an element θ 0 for which this equality holds up to some arbitrarily small tolerance ǫ > 0, and the analysis to follow holds up to this tolerance. ROUGH DRAFT: DO NOT DISTRIBUTE! M. Wainwright February 4, 05

8 0 CHAPTER 4. UNIFORM LAWS OF LARGE NUMBERS empirical risk, we have T 3 = n L θ0 (X i ) E[L θ0 (X)], corresponding to the deviation of a sample mean from its expectation for the random variable L θ0 (X). This quantity can be controlled using the techniques introduced in Chapter for instance, via the Hoeffding bound when the samples are independent and the loss function is bounded. The first term can be written in a similar way, namely as the sum T = n Lˆθ(X i ) E[Lˆθ(X)]. Thisquantityismorechallengingtocontrol, becausetheparameter θ insharpcontrast to the deterministic quantity θ 0 is now random. In general, it depends on all the samples X n, since it was obtained by minimizing the empirical risk. For this reason, controlling the first term requires a stronger result, such as a uniform law of large numbers over the loss class L(Ω 0 ) := {L θ, θ Ω 0 }. With this notation, we have T sup θ Ω 0 n L θ (X i ) E[L θ (X)] = P n P L(Ω0 ) Since T 3 is also dominated by this same quantity, we conclude that the excess risk is at most P n P L(Ω0 ). This derivation demonstrates that the central challenge in analyzing estimators based on empirical risk minimization is to establish a uniform law of large numbers for the loss class L(Ω 0 ). We explore various concrete examples of this procedure in the exercises A uniform law via Rademacher complexity Having developed various motivations for studying uniform laws, let us now turn to the technical details of deriving such results. An important quantity that underlies the study of uniform laws is the Rademacher complexity of the function class F. For any fixed collection x n := (x,...,x n ) of points, consider the subset of R n given by F(x n ) := {( f(x ),...,f(x n ) ) f F }. (4.) The set F(x n ) corresponds to all those vectors in Rn that can be realized by applying a function f F to the collection (x,...,x n ), and the empirical Rademacher complexity ROUGH DRAFT: DO NOT DISTRIBUTE!! M. Wainwright February 4, 05

9 SECTION 4.. A UNIFORM LAW VIA RADEMACHER COMPLEXITY 03 is given by R(F(x n [ )/n) := E ε sup ε i f(x i ) ] (4.) Note that this definition coincides with our earlier definition of the Rademacher com- plexity of a set (see Example.). Given a collection X n := (X,...,X n ) of random samples, then the empirical Rademacher complexity R(F(X n )/n) is a random variable. Taking its expectation yields the Rademacher complexity of the function class F namely, the deterministic quantity [ R n (F) := E X R(F(X n )/n) ] [ = E X,ε sup n f F ε i f(x i ) ]. (4.3) Note that the Rademacher complexity is the average of the maximum correlation 3 between the vector (f(x ),...,f(x n )) and the noise vector (ε,...,ε n ), where the 4 maximum is taken over all functions f F. The intuition is a natural one: a func- 5 tion class is extremely large and in fact, too large for statistical purposes if we 6 can always find a function that has a high correlation with a randomly drawn noise 7 vector. Conversely, when the Rademacher complexity decays as a function of sample 8 size, then it is impossible to find a function that correlates very highly in expectation 9 with a randomly drawn noise vector. 0 We now make precise the connection between Rademacher complexity and the Glivenko-Cantelli property, in particular by showing that any bounded function class 3 F with R n (F) = o() is also a Glivenko-Cantelli class. More precisely, we prove a 4 non-asymptotic statement, in terms of a tail bound for the probability that the random 5 variable P n P F deviates substantially above a multiple of the Rademacher com- 6 plexity. 7 Theorem 4.. Let F be a class of functions f : X R that is uniformly bounded (i.e., f b for all f F). Then for all n and δ 0, we have 8 P n P F R n (F)+δ (4.4) 9 ) with P-probability at least exp( nδ 8b. Consequently, as long as Rn (F) = o(), a.s. we have P n P F 0. In order for Theorem 4. to be useful, we need to obtain upper bounds on the 0 ROUGH DRAFT: DO NOT DISTRIBUTE! M. Wainwright February 4, 05

10 04 CHAPTER 4. UNIFORM LAWS OF LARGE NUMBERS Rademacher complexity. There are a variety of methods for doing so, ranging from direct calculations to alternative complexity measures. In Section 4.3, we develop some techniques for upper bounding the Rademacher complexity for indicator functions of half-intervals, as required for the classical Glivenko-Cantelli theorem (see Example 4.4); we also discuss the notion of Vapnik-Chervonenkis dimension, which can be used to upper bound the Rademacher complexity for other function classes. In Chapter 5, we introduce more advanced techniques based on metric entropy and chaining for controlling Rademacher complexity and related sub-gaussian processes. In the meantime, let us turn to the proof of Theorem Proof. We first note if R n (F) = o(), then the almost-sure convergence follows from the tail bound (4.4) and the Borel-Cantelli lemma. Accordingly, the remainder of the argument is devoted to proving the tail bound (4.4). Concentration around mean: We first claim that when F is uniformly bounded, then the randomvariable P n P F issharplyconcentratedarounditsmean. Inordertosimplify notation, it is convenient to define the re-centered functions f(x) := f(x) E[f(X)], and to write P n P F = sup f F n n f(x i). Thinking of the samples as fixed for the moment, consider the function G(x,...,x n ) := sup f F n n f(x i). We claim that G satisfies the Lipschitz property required to apply the bounded differences method (recall Corollary.). Since the function G is invariant to permutation of its coordinates, it suffices to bound the difference when the first co-ordinate x is perturbed. Accordingly, we define the vector y R n with y i = x i for all i, and seek the bound the difference G(x) G(y). For any function f = f E[f], we have n f(x i ) sup h F n where the final inequality uses the fact that h(y i ) n f(x i ) n f(x ) f(y ) = f(x ) f(y ) b, f(y i ) f(x ) f(y ) n b n, (4.5) which follows from the uniform boundedness condition f b. Since the inequality (4.5) holds for any function f, we may take the supremum over f F on both sides; doing so yields the inequality G(x) G(y) b n. Since the same argument may be applied with the roles of x and y reversed, we conclude that G(x) G(y) b n. ROUGH DRAFT: DO NOT DISTRIBUTE!! M. Wainwright February 4, 05

11 SECTION 4.. A UNIFORM LAW VIA RADEMACHER COMPLEXITY 05 Therefore, by the bounded differences method (see Corollary.), we have P n P F E[ P n P F ] t with P-prob. at least exp( nt 8b ), (4.6) valid for all t 0. Upper bound on mean: It remains to show that E[ P n P F ] is upper bounded by at most R n (F), and we do so using a classical symmetrization argument. Letting (Y,...,Y n ) be a second i.i.d. sequence, independent of (X,...,X n ), we have [ E[ P n P F ] = E X sup [ = E X sup E Y [ [ E X,Y sup n f F { f(xi ) E Yi [f(y i )] } ] { f(xi ) f(y i ) } ] ] { f(xi ) f(y i ) } ], where the inequality follows from Jensen s inequality for convex functions. Now let (ε,...,ε n ) be an i.i.d. sequence of Rademacher variables, independent of X and Y. For any function f F and any i =,,...,n, the variable ε i (f(x i ) f(y i )) has the same distribution as f(x i ) f(y i ), whence [ E[ P n P F ] E X,Y,ε sup [ E X,ε sup n f F ( ε i f(xi ) f(y i ) ) ] ε i f(x i ) ] = R n (F). (4.7) Combining the upper bound (4.7) with the tail bound (4.6) yields the claim. 4.. Necessary conditions with Rademacher complexity 3 The proof of Theorem 4. illustrates an important technique known as symmetrization, which relates the random variable P n P F to its symmetrized version R n F := sup ε i f(x i ) (4.8) Note that the expectation of R n F corresponds to the Rademacher complexity, which 4 plays a central role in Theorem 4.. It is natural to wonder whether much was lost in 5 moving from the variable P n P F to its symmetrized version. The following sand- 6 wich result relates these quantities. 7 ROUGH DRAFT: DO NOT DISTRIBUTE! M. Wainwright February 4, 05

12 06 CHAPTER 4. UNIFORM LAWS OF LARGE NUMBERS 3 Proposition 4.. For any convex non-decreasing function Φ : R R, we have [ ( E X,ε Φ R )] (a) [ n F E X Φ( Pn P F ) ] (b) [ E X,ε Φ( Rn F ) ], (4.9) where F = {f E[f],f F} is the re-centered function class. When applied with the convex non-decreasing function Φ(t) = t, Proposition 4. yields the inequalities E X,ε R n F E X [ P n P F ] E X,ε R n F, (4.0) with the only difference being the use of F in the upper bound, and the re-centered class F in the lower bound. Other choices of interest include Φ(t) = e λt for some λ > 0, which can be used to control the moment generating function. Proof. The proof of inequality (b) is essentially identical to the argument for Φ(t) = t provided in the proof of Theorem 4.. Turning to the bound (a), we have [ ( E X,ε Φ R )] [ ( n F = EX,ε Φ sup ε i {f(x i ) E Yi [f(y i )]} )] (i) [ ( E X,Y,ε Φ sup (ii) [ ( = E X,Y Φ sup ε i {f(x i ) f(y i )} )] {f(x i ) f(y i )} )] where inequality(i) follows from Jensen s inequality and the convexity of Φ; and equality (ii) follows since for each each i =,...,n and f F, the variables ε i {f(x i ) f(y i )} andf(x i ) f(y i )havethesamedistribution. Addingandsubtracting E[f]andapplying triangle inequality, we obtain that T := sup f F n n {f(x i) f(y i )} is upper bounded as T sup {f(x i ) E[f]} + sup f F n {f(y i ) E[f]} Since Φ is convex and non-decreasing, we obtain Φ(T) Φ( sup {f(x i ) E[f]} ) + Φ( sup f F n {f(y i ) E[f]} ). ROUGH DRAFT: DO NOT DISTRIBUTE!! M. Wainwright February 4, 05

13 SECTION 4.3. UPPER BOUNDS ON THE RADEMACHER COMPLEXITY 07 Since X and Y are identically distributed, taking expectations yields the claim. A consequence of Proposition 4. is that the random variable P n P F can be lower bounded by a multiple of Rademacher complexity, and some fluctuation terms. 3 This fact can be used to prove the following: 4 Proposition 4.. Let F be a class of functions f : X R that is uniformly bounded (i.e., f b for all f F). Then for all δ 0, we have P n P F R n(f) sup f F E[f] n δ (4.) 5 6 with P-probability at least e nδ 8b. 7 We leave the proof of this result for the reader (see Exercise 4.3). As a conse- 8 quence, if the Rademacher complexity R n (F) remains bounded away from zero, then 9 P n P F cannot converge to zero in probability. We have thus shown that for a 0 uniformly bounded function class F, the Rademacher complexity provides a necessary and sufficient condition for it to be Glivenko-Cantelli. 4.3 Upper bounds on the Rademacher complexity 3 Obtaining concrete results using Theorem 4. requires methods for upper bounding the 4 Rademacher complexity. There are a variety of such methods, ranging from simple 5 union bound methods (suitable for finite function classes) to more advanced techniques 6 involving the notion of metric entropy and chaining arguments. We explore the latter 7 techniques in Chapter 5 to follow. This section is devoted to more elementary tech- 8 niques, including those required to prove the classical Glivenko-Cantelli result, and 9 more generally, those that apply to function classes with polynomial discrimination Classes with polynomial discrimination For a given collection of points x n = (x,...,x n ), the size of the set F(x n ) provides a sample-dependent measure of the complexity of F. In the simplest case, the set 3 F(x n ) contains only a finite number of vectors for all sample sizes, so that its size 4 can be measured via its cardinality. For instance, if F consists of a family of decision 5 rules taking binary values (as in Example 4.7), then F(x n ) can contain at most n 6 elements. Of interest to us are function classes for which this cardinality grows only as 7 a polynomial function of n, as formalized in the following: 8 9 ROUGH DRAFT: DO NOT DISTRIBUTE! M. Wainwright February 4, 05

14 08 CHAPTER 4. UNIFORM LAWS OF LARGE NUMBERS Definition 4. (Polynomial discrimination). A class F of functions with domain X has polynomial discrimination of order ν if for each positive integer n and collection x n = {x,...,x n } of n points in X, the set F(x n ) has cardinality upper bounded as card ( F(x n ) ) (n+) ν. (4.) The significance of this property is that it provides a straightforward approach to controlling the Rademacher complexity. For any set S R n, we let D := sup x S x denote its maximal width in the l -norm. Lemma 4.. Suppose that F has polynomial discrimination of order ν. Then for all n 0 and any collection of points x n = (x,...,x n ), where D(x n ) := sup f F [ E ε sup ε i f(x i ) ] }{{} R(F(x n )/n)) D 3 (x n ) ν log(n+), n n f (x i ) n is the l - radius of the set F(x n )/ n. We leave the proof of this claim for the reader (see Exercise 4.7). Although Lemma 4. is stated as an upper bound on the empirical Rademacher complexity, it yields as a corollary an upper bound on the Rademacher complexity R n (F) = E X [R(F(X n/n))], one which involves the expected l -width E X n [D(X)]. An especially simple case is when function class is uniformly bounded say f b for all f F so that D(x n ) b for all samples, and hence Lemma 4. implies that R n (F) 3 b νlog(n+) n for all n 0. (4.3) Combined with Theorem 4., we conclude that any bounded function class with polynomial discrimination is Glivenko-Cantelli. What types of function classes have polynomial discrimination? As discussed previously in Example 4.4, the classical Glivenko-Cantelli law is based on indicator functions of the left-sided intervals (, t]. These functions are uniformly bounded with b =, and moreover, as shown in the following proof, this function class has polynomial discrimination of order d =. Consequently, Theorem 4. combined with Lemma 4. yields a quantitative version of Theorem 4. as a corollary. 0 ROUGH DRAFT: DO NOT DISTRIBUTE!! M. Wainwright February 4, 05

15 SECTION 4.3. UPPER BOUNDS ON THE RADEMACHER COMPLEXITY 09 Corollary 4. (Classical Glivenko-Cantelli). Let F(t) = P[X t] be the CDF of a random variable X, and let F n be the empirical CDF based on n i.i.d. samples X i P. Then [ ] log(n+) P F n F 3 +δ e nδ 8 for all δ 0, (4.4) n and hence F n F a.s Proof. For a given sample x n = (x,...,x n ) R n, consider the set F(x n ), where F is the set of all {0-}-valued indicator functions of the half-intervals [t, ) for t R. If we order the samples as x () x ()... x (n), then they split the real line into at most n+ intervals (including the two end-intervals (,x () ) and [x (n), )). For a given t, the indicator function I [t, ) takes the value one for all x (i) t, and the value zero for all other samples. Thus, we have shown that for any given sample x n, we have card(f(x n )) n+. Applying Lemma 4., we obtain [ E ε sup ε i f(x i ) ] log(n+) 3, n log(n+) and taking averages over the data X i yields the upper bound R n (F) 3 n. 4 The claim (4.4) then follows from Theorem Although the exponential tail bound (4.4) is adequate for many purposes, it is 6 far from the tightest possible. Using alternative methods, we provide a sharper result 7 that removes the log(n+) factor in Chapter 5. See the bibliographic section for 8 references to the sharpest possible results, including control of the constants in the 9 exponent and the pre-factor Vapnik-Chervonenkis dimension Thus far, we have seen that it is relatively straightforward to establish uniform laws for function classes with polynomial discrimination. In certain cases, such as in our proof 3 of the classical Glivenko-Cantelli law, we can verify by direct calculation that a given 4 function class has polynomial discrimination. More broadly, it is of interest to develop 5 techniques for certifying this property, and the theory of Vapnik-Chervonenkis (VC) 6 dimension provides one such class of techniques. 7 Let us consider a function class F in which each function f is binary-valued, taking 8 the values {0,} for concreteness. In this case, the set F(x n ) from equation (4.) can 9 have at most n elements. 0 ROUGH DRAFT: DO NOT DISTRIBUTE! M. Wainwright February 4, 05

16 0 CHAPTER 4. UNIFORM LAWS OF LARGE NUMBERS Definition 4.3 (Shattering and VC dimension). Given a class F of binary valued functions, the set x n = (x,...,x n ) is shattered by F means that card(f(x n )) = n. The VC-dimension ν(f) is the largest integer n for which there is some collection x n = (x,...,x n ) of n points that can be shattered by F. 3 4 When the quantity ν(f) is finite, then the function class F is said to be a VC-class. 5 We will frequently consider function classes F that consist of indicator functions I S [ ], 6 for sets S ranging over some class of sets S. In this case, we use S(x n ) and ν(s) as 7 shorthands for the sets F(x n ) and the VC dimension of F, respectively. 8 9 Let us illustrate the notions of shattering and VC dimension with some examples: 0 Example 4.8 (Intervals in R). Consider the class of all indicator functions for left-sided half-intervals on the real line namely, the class S left := { (,a] a R}. Implicit in the proof of Corollary 4. is a calculation of the VC dimension for this class. We 3 first note that for any single point x, both subsets ({x } and the empty set ) can be 4 picked out by the class of left-sided intervals {(,a] a R}. But given two distinct 5 points x < x, it is imposssible to find a left-sided interval that contains x but not x. 6 Therefore, we conclude that ν(s left ) =. In the proof of Corollary 4., we showed more 7 specifically that for any collection x n = {x,...,x n }, we have card(s left (x n )) n+. Now consider the class of all two-sided intervals over the real line namely, the class S two := { (b,a] a,b R such that b < a }. The class S two can shatter any twopoint set. However, given three distinct points x < x < x 3, it cannot pick out the subset {x,x 3 }, showing that ν(s two ) =. For future reference, let us also upper bound the shattering coefficient of S two. Note that any collection of n distinct points x < x <... < x n < x n divides up the real line into (n + ) intervals. Thus, any set of the form ( b,a] can be specified by choosing one of (n+) intervals for b, and a second interval for a. Thus, a crude upper bound on the shatter coefficient is card(s two (x n )) (n+), 8 showing that this class has polynomial discrimination with degree ν = Thus far, we have seen two examples of function classes with finite VC dimension, both of which turned out to also polynomial discrimination. Is there a general connection between the VC dimension and polynomial discriminability? Indeed, it turns out that any finite VC class has polynomial discrimination with degree at most the VC dimension; this fact is a deep result that was proved independently (in slightly different forms) by Vapnik and Chervonenkis, Sauer and Shelah. We refer the reader to the bibliographic section for further discussion and references. In order to understand why this fact is surprising, note that for a given set class S, ROUGH DRAFT: DO NOT DISTRIBUTE!! M. Wainwright February 4, 05

17 SECTION 4.3. UPPER BOUNDS ON THE RADEMACHER COMPLEXITY thedefinitionofvcdimensionimpliesthatforalln > ν(s),wemusthavecard(s(x n )) < n for all collections x n of n samples. However, at least in principle, there could exist some subset with card(s(x n )) = n, which is not significantly different than n. The following result shows that this is not the case; indeed, for any VC-class, the cardinality of S(x n ) can grow at most polyno- mially in n. 3 Proposition 4.3 (Vapnik-Chervonenkis, Sauer and Shelah). Consider a set class S with ν(s) <. Then for any collection of points x n = (x,...,x n ), we have 4 card(s(x n )) (i) ν(s) i=0 ( ) n (ii) (n+) ν(s). (4.5) i We prove inequality (i) in the Appendix. Given inequality (i), inequality (ii) can be 8 established by elementary combinatorial arguments, so we leave it as an exercise for the 9 reader (part (a) of Exercise 4.). Part (b) of the same exercise establishes a sharper 0 upper bound Controlling the VC dimension Since classes with finite VC dimension have polynomial discrimination, it is of interest 3 to develop techniques for controlling the VC dimension. 4 Basic operations 5 The property of having finite VC dimension is preserved under a number of basic op- 6 erations, as summarized in the following. 7 Proposition 4.4. Let S and T be set classes, each with finite VC dimensions ν(s) and ν(t) respectively. Then each of the following set classes also have finite VC dimension: (a) The set class S c := { S c S S }, where S c denotes the complement of S. 8 9 (b) The set class S T := { S T S S, T T }. (c) The set class S T := { S T S S, T T }. 0 We leave the proof of this result as an exercise for the reader. ROUGH DRAFT: DO NOT DISTRIBUTE! M. Wainwright February 4, 05

18 CHAPTER 4. UNIFORM LAWS OF LARGE NUMBERS Vector space structure Any class G of real-valued functions defines a class of sets by the operation of taking subgraphs. In particular, given a real-valued function g : X R, its subgraph at level zero is the subset S g := {x X g(x) 0}. In this way, we can associate to G the collection of subsets S(G) := { S g, g G}, which we refer to as the subgraph class of G. Many interesting classes of sets are naturally defined in this way, among them half-spaces, ellipsoids and so on. In many cases, the underlying function class G is a vector space, and the following result allows us to upper bound the VC dimension of the associated set class S(G). Proposition 4.5 (Finite-dimensional vector spaces). Let G be a vector space of functions g : R d R with dimension dim(g) <. Then the subgraph class S(G) has VC dimension at most dim(g). Proof. By the definition of VC dimension, we need to show that no collection of n = dim(g)+pointsin R d canbeshatteredby S(G). Fixacollectionx n = {x,...,x n } ofnpointsin R d,andconsiderthelinearmapl : G R n givenbyl(g) = (g(x ),...,g(x n )). By construction, the range of the mapping L is a linear subspace of R n with dimension at most dim(g) = n < n. Therefore, there must exist a non-zero vector γ R n such that γ, L(g) = 0 for all g G. We may assume without loss of generality that at least one γ i is positive, and then write ( γ i )g(x i ) = γ i g(x i ) for all g G. (4.6) {i γ i 0} {i γ i >0} Now suppose that there exists some g G such that the associated subgraph set S g = {x R d g(x) 0} includes only the subset {x i γ i 0}. For such a function g, the right-hand side of equation (4.6) would be strictly positive while the left-hand side would be non-positive, which is a contradiction. We conclude that G fails to shatter the set {x,...,x n }, as claimed. Let us illustrate the use of Proposition 4.5 with some examples: Example 4.9 (Linear functions in R d ). For a pair (a,b) R d R, define the function f a,b (x) := a, x +b, and consider the family L d := { f a,b (a,b) R d R } of all such linear functions. The associated subgraph class S(L d ) corresponds to the collection of all half-spaces of the form H a,b := {x R d a, x + b 0}. Since the family L d forms a vector space of dimension d +, we obtain as an immediate consequence of Proposition 4.5 that S(L d ) has VC dimension at most d+. For the special case d =, let us verify this statement by a more direct calculation. In this case, the class S(L ) corresponds to the collection of all left-sided or right-sided ROUGH DRAFT: DO NOT DISTRIBUTE!! M. Wainwright February 4, 05

19 SECTION 4.3. UPPER BOUNDS ON THE RADEMACHER COMPLEXITY 3 intervals that is, S(L ) = { (,t] t R } { [t, ) t R }. Given any two distinct points x < x, the collection of all such intervals can pick out all possible subsets. However, given any three points x < x < x 3, there is no interval contained in S(L ) that contains x while excluding both x and x 3. This 3 calculation shows that ν(s(l )) =, which matches the upper bound obtained from 4 Proposition 4.5. More generally, it can be shown that the VC dimension of S(L d ) is 5 d+, so that Propsition 4.5 yields a sharp result in all dimensions. 6 Example 4.0 (Spheres in R d ). Consider the sphere S a,b := {x R d x a b}, where (a,b) R d R + specify its center and radius, respectively, and let S d sphere denote the collection of all such spheres. If we define the function f a,b (x) := x d a j x j + a b, j= then we have S a,b = {x R d f a,b (x) 0}, so that the sphere S a,b is a sub-graph of 7 the function f a,b. 8 In order to leverage Proposition 4.5, we first define a feature map φ : R d R d+ via φ(x) := (x,...,x d,), and then consider functions of the form g c (x) := c, φ(x) + x, where c R d+. The family of functions {g c, c R d+ } is a vector space of dimension d +, and it 9 contains the function class {f a,b, (a,b) R d R + }. Consequently, by applying Propo- 0 sition 4.5 to this larger vector space, we conclude that ν(s d sphere) d+. This bound is adequate for many purposes, but is not sharp: a more careful analysis shows that the VC dimension of spheres in R d is actually d+. See Exercise 4.9 for an in-depth 3 exploration of the case d =. 4 5 Appendix A: Proof of Proposition We prove inequality (i) in Proposition 4.3 by establishing the following more general 7 inequality: 8 Lemma 4.. Let A be a finite set and let U be a class of subsets of A. Then card(u) card( { B A B is shattered by U } ). (4.7) ROUGH DRAFT: DO NOT DISTRIBUTE! M. Wainwright February 4, 05

20 4 CHAPTER 4. UNIFORM LAWS OF LARGE NUMBERS To see that this lemma implies inequality (i), note that if B A is shattered by S, then we must have card(b) ν(s). Consequently, if we let A = {x,...,x n } and set U = S A, then Lemma 4. implies that card(s(x n )) = card(s A) card ({ B A B ν(s) }) ν(s) i=0 ( ) n, i as claimed. It remains to prove Lemma 4.. For a given x A, let us define an operator on sets U U via T x (U) = U { U\{x} if x U and U\{x} / U otherwise We let T x (U) be the new class of sets defined by applying T x to each member of U namely, T x (U) := {T x (U) U U}. We first claim that T x is a one-to-one mapping between U and T x (U), and hence that card(t x (U)) = card(u). To establish this claim, for any pair of sets U,U U such that T x (U) = T x (U ), we must prove that U = U. We divide the argument into three separate cases: Case : x / U and x / U. Given x / U, we have T x (U) = U, and hence T x (U ) = U. Moreover, x / U implies that T x (U ) = U. Combining the equalities yields U = U. Case : x / U and x U. In this case, we have T x (U) = U = T x (U ), so that x U but x / T x (U ). But this condition implies that T x (U ) = U \{x} / U, which contradicts the fact that T x (U ) = U U. By symmetry, the case x U and x / U is identical. Case 3: x U U. IfbothofU\{x}andU \{x}belongtou,thent x (U) = U and T x (U ) = U, from which U = U follows. If neither of U\{x} nor U \{x} belong to U, then we can conclude that U\{x} = U \{x}, and hence U = U. Finally, if U\{x} / U but U \{x} U, then T x (U) = U\{x} / U but T x (U ) = U U, which is a contradiction. We next claim that if T x (U) shatters a set B, then so does U. If x / B, then both U and T x (U) pick out the same set of subsets of B. Otherwise, suppose that x B. Since T x (U) shatters B, for any subset B B\{x}, there is a subset T T x (U) such that T B = B {x}. Since T = T x (U) for some subset U U and x T, we conclude that both U and U\{x} must belong to U, so that U also shatters B. Using the fact that T x preserves cardinalities and does not increase shattering numbers, we can now conclude the proof of the lemma. Define the weight function ROUGH DRAFT: DO NOT DISTRIBUTE!! M. Wainwright February 4, 05

21 SECTION 4.4. BIBLIOGRAPHIC DETAILS AND BACKGROUND 5 ω(u) = U U card(u),andnotethatapplyingatransformationt x canonlyreducethis weight function: i.e., ω(t x (U)) ω(u). Consequently, by applying the transformations {T x } to U repeatedly, we can obtain a new class of sets U such that card(u) = card(u ) 3 and the weight ω(u ) is minimal. Then for any U U and any x U, we have 4 U\{x} U. (Otherwise, we would have ω(t x (U )) < ω(u ), contradicting minimality.) 5 Therefore, the set class U shatters any one of its elements. Since U shatters at least as 6 many subsets as U, and card(u ) = card(u) by construction, the claim (4.7) follows Bibliographic details and background 8 First, a technical remark regarding measurability: in general, the quantity P n P F need not be measurable, since the function class F may contain an uncountable number of elements. If the function class is separable, then we may simply take the supremum over the countable dense basis. In general, there are various ways of dealing with this issue, including the use of outer probability (c.f. van der Vaart and Wellner [vdvw96]). Here we instead adopt the following convention. For any finite class of functions G contained within F, the variable P n P G is well-defined, so that it is sensible to define P n P F := sup { P n P G G F, G has finite cardinality }. By using this definition, we can always think instead about suprema over finite sets. 9 Theorem 4. was originally proved by Glivenko [Gli33] for the continuous case, and by Cantelli [Can33] in the general setting. The non-asymptotic form of the Glivenko- Cantelli theorem given in Corollary 4. can be sharpened substantially. For instance, Dvoretsky, Kiefer and Wolfowitz [DKW56] prove that there is a constant C independent of F and n such that P [ F n F δ ] Ce nδ for all δ 0. (4.8) Massart [Mas90] establishes the sharpest possible result, including control of the leading 0 constant. The Rademacher complexity, and its relative the Gaussian complexity, have a lengthy history in the study of Banach spaces using probabilistic methods; for instance, see the 3 books [MS86, Pis89, LT9]. In Chapter 5, we develop further connections between these 4 two forms of complexity, and the related notion of metric entropy. Rademacher and 5 Gaussian complexities have also been studied in the specific context of uniform laws 6 of large numbers and empirical risk minimization (e.g.,[bm0, BBM05, Kol0, Kol06, 7 KP00, vdvw96]). 8 The proof of Proposition 4.3 is adapted from Ledoux and Talagrand [Led0], who 9 credit the approach to Frankl [Fra83]. Exercise 5.4 is adapted from Problem.6.3 from 0 ROUGH DRAFT: DO NOT DISTRIBUTE! M. Wainwright February 4, 05

22 6 CHAPTER 4. UNIFORM LAWS OF LARGE NUMBERS van der Vaart and Wellner [vdvw96]. The proof of Proposition 4.5 is adapted from Pollard [Pol84], who credits it to Steele [Ste78] and Dudley [Dud78] Exercises Exercise 4.. Recall that the functional γ is continuous in the sup-norm at F if for all ǫ > 0, there exists a δ > 0 such that G F δ implies that γ(g) γ(f) ǫ. (a) Given n i.i.d. samples with law specified by F, let F n be the empirical CDF. Show that if γ is continuous in the sup-norm at F, then γ( F n ) prob. γ(f). (b) Which of the following functionals are continuous with respect to the sup-norm? Prove or disprove. (i) The mean functional F xdf(x). (ii) The Cramér-von Mises functional F [F(x) F 0 (x)] df 0 (x). (iii) The quantile functional Q p (F) = inf{t R F(t) p}. Exercise 4.. Recall from Example 4.5 the class S of all subsets S of [0,] for which S has a finite number of elements. Prove that the Rademacher complexity satisfies the lower bound [ R n (S) = E X,ε sup S S n ε i I S [X i ] ]. (4.9) 3 4 Discuss the connection to Theorem 4.. Exercise 4.3. In this exercise, we work through the proof of Proposition 4.. (a) Recall the re-centered function class F = {f E[f] f F}. Show that E X,ε [ R n F] E X,ε [ R n F ] sup f F E[f]. n 5 (b) Use concentration results to complete the proof of Proposition 4.. Exercise 4.4. Consider the function class F = { x sign( θ, x ) θ R d, θ = }, corresponding to the {, +}-valued classification rules defined by linear functions in R d. Supposing that d n, let x n = {x,...,x n } be a collection of vectors in R d that ROUGH DRAFT: DO NOT DISTRIBUTE!! M. Wainwright February 4, 05

23 SECTION 4.5. EXERCISES 7 are linearly independent. Show that the empirical Rademacher complexity satisfies R(F(x n [ )/n) = E ε sup ε i f(x i ) ] =. Discuss the consequences for empirical risk minimization over the class F. Exercise 4.5. Prove the following properties of the Rademacher complexity. (a) R n (F) = R n (conv(f)). 3 (b) Show that R n (F +G) R n (F)+R n (G). Give an example to demonstrate that 4 this bound cannot be improved in general. 5 (c) Given a fixed and uniformly bounded function g, show that R n (F +g) R n (F)+ g n. (4.30) Exercise 4.6. Let S and T be two classes of sets with finite VC dimensions. Which of 6 the following set classes have finite VC dimension? Prove or disprove. 7 (a) The set class S c := { S c S S}, where S c denotes the complement of the set S. 8 (b) The set class S T := { S T S S,T T}. 9 (c) The set class S T := { S T S S,T T}. 0 Exercise 4.7. Prove Lemma 4.. Exercise 4.8. Consider the class of left-sided half-intervals in R d : S d left := { (,t ] (,t ] (,t d ] (t,...,t d ) R d}. Show that for any collection of n points, we have card(s d left(x n )) (n + )d and ν(s d left) = d. 3 Exercise 4.9. Consider the class of all spheres in R : that is S sphere := { S a,b, (a,b) R R + }, (4.3) where S a,b := {x R x a b} is the sphere of radius b 0 centered at 4 a = (a,a ). 5 (a) Show that S sphere can shatter any subset of three points that are not collinear. 6 ROUGH DRAFT: DO NOT DISTRIBUTE! M. Wainwright February 4, 05

24 8 CHAPTER 4. UNIFORM LAWS OF LARGE NUMBERS (b) Show that for any subset of four points, the balls can discriminate at most 5 out of 6 of the possible subsets, and conclude that ν(s sphere) = Exercise 4.0. Show that the class C d cc of all closed and convex sets in R d does not have finite VC dimension. (Hint: Consider a set of n points on the boundary of the unit ball.) Exercise 4.. (a) Prove inequality (ii) in Proposition 4.3. (b) Prove the sharper upper bound card(s(x n )) ( ) en ν. 7 ν (Hint: You might find the 8 result of Exercise.9 useful.) ROUGH DRAFT: DO NOT DISTRIBUTE!! M. Wainwright February 4, 05

Lecture 2: Uniform Entropy

STAT 583: Advanced Theory of Statistical Inference Spring 218 Lecture 2: Uniform Entropy Lecturer: Fang Han April 16 Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal