Uniform laws of large numbers 2

Size: px
Start display at page:

Download "Uniform laws of large numbers 2"

Transcription

1 C H A P T E R 4 Uniform laws of large numbers The focus of this chapter is a class of results known as uniform laws of large numbers. 3 As suggested by their name, these results represent a strengthening of the usual law of 4 large numbers, which applies to a fixed sequence of random variables, to related laws 5 that hold uniformly over collections of random variables. On one hand, such uniform 6 laws are of theoretical interest in their own right, and represent an entrypoint to a rich 7 area of probability and statistics known as empirical process theory. On the other hand, 8 uniform laws also play a key role in more applied settings, including understanding the 9 behavior of different types of statistical estimators. The classical versions of uniform 0 laws are of an asymptotic nature, whereas more recent work in the area has emphasized non-asymptotic results. Consistent with the overall goals of this book, this chapter will follow the non-asymptotic route, presenting results that apply to all sample sizes. In 3 order to do so, we make use of the tail bounds and the notion of Rademacher complexity 4 previously introduced in Chapter Motivation 6 We begin with some statistical motivations for deriving laws of large numbers, first for 7 the case of cumulative distribution functions (CDFs) and then for more general function 8 classes Uniform convergence of CDFs 0 The law of any scalar random variable X can be fully specified by its cumulative distribution function, whose value at any point t R is given by F(t) := P[X t]. Now suppose that we are given a collection x n = {x,x,...,x n } of n i.i.d. samples, each drawn according to the law specified by F. A natural estimate of F is the empirical CDF given by F n (t) := n I (,t] [X i ], (4.) February 4, 05 0age

2 96 CHAPTER 4. UNIFORM LAWS OF LARGE NUMBERS where I (,t] [x] is a {0,}-valued indicator function for the event {x t}. Since the population CDF can be written as F(t) = E[I (,t] [X]], the empirical CDF is unbiased in a pointwise sense. Figure 4- provides some illustrations of empirical CDFs for the uniform distribution on the interval [0,] for different sample sizes. Note that F n is a random function, with the value F n (t) corresponding to the fraction of samples that lie in the interval (,t]. Consequently, for any fixed t R, the law of large numbers implies that F n (t) prob. F(t). Population and empirical CDFs Population and empirical CDFs CDF value F(t) CDF value F(t) Population CDF Empirical CDF t value (a) 0. Population CDF Empirical CDF t value (b) Figure 4-. Plots of population and empirical CDF functions for the uniform distribution on [0, ]. (a) Empirical CDF based on n = 0 samples. (b) Empirical CDF based on n = 00 samples In statistical settings, a typical use of the empirical CDF is to construct estimators of various quantities associated with the population CDF. Many such estimation problems can be formulated in a terms of functional γ that maps any CDF F to a real number γ(f) that is, F γ(f). Given a set of samples distributed according to F, the plugin principle suggests replacing the unknown F with the empirical CDF F n, thereby obtaining γ( F n ) as an estimate of γ(f). Let us illustrate this procedure via some examples. Example 4. (Expectation functionals). Given some integrable function g, we may define the expectation functional γ g via γ g (F) := g(x)df(x). (4.) For instance, for the function g(x) = x, the functional γ g maps F to E[X], where X is a random variable with CDF F. For any g, the plug-in estimate is given by γ g ( F n ) = n n g(x i), corresponding to the sample mean of g(x). In the special case ROUGH DRAFT: DO NOT DISTRIBUTE!! M. Wainwright February 4, 05

3 SECTION 4.. MOTIVATION 97 g(x) = x, we recover the usual sample mean n n X i as an estimate for the mean µ = E[X]. A similar interpretation applies to other choices of the underlying function g. 3 Example 4. (Quantile functionals). For any α [0,], the quantile functional Q α is given by Q α (F) := inf{t R F(t) α}. (4.3) The median corresponds to the special case α = 0.5. The plug-in estimate is given by Q α ( F n ) := inf { t R n I [t, ) [X i ] α }, (4.4) and corresponds to estimating the α th quantile of the distribution by the α th sample 4 quantile. In the special case α = 0.5, this estimate corresponds to the sample median. 5 Again, it is of interest to determine in what sense (if any) the random variable Q α ( F n ) 6 approaches Q α (F) as n becomes large. In this case, Q α ( F n ) is a fairly complicated, non- 7 linear function of all the variables, so that this convergence does not follow immediately 8 by a classical result such as the law of large numbers. 9 0 Example 4.3 (Goodness-of-fit functionals). It is frequently of interest to test the hy- pothesis of whether or not a given set of data has been drawn from a known dis- tribution F 0. For instance, we might be interested in assessing departures from uni- 3 formity, in which case F 0 would be a uniform distribution on some interval, or de- 4 partures from Gaussianity, in which F 0 would specify a Gaussian with a fixed mean 5 and variance. Such tests can be performed using functionals that measure the dis- 6 tance between F and the target CDF F 0, including the sup-norm distance F F 0, 7 or other distances such as the Cramér-von-Mises criterion based on the functional 8 γ(f) := [F(x) F 0(x)] df 0 (x). 9 For any plug-in estimator γ( F n ), an important question is to understand when it is consistent i.e., when does γ( F n ) converges to γ(f) in probability (or almost surely)? This question can be addressed in a unified manner for many functionals by defining a notion of continuity. Given a pair of CDFs F and G, let us measure the distance between them using the sup-norm G F := sup G(t) F(t). (4.5) t R We can then define the continuity of a functional γ with respect to this norm: more 0 precisely, we say that the functional γ is continuous at F in the sup-norm if for all ǫ > 0, there exists a δ > 0 such that G F δ implies that γ(g) γ(f) ǫ. ROUGH DRAFT: DO NOT DISTRIBUTE! M. Wainwright February 4, 05

4 98 CHAPTER 4. UNIFORM LAWS OF LARGE NUMBERS As we explore in Exercise 4., this notion is useful, because for any continuous functional, it reduces the consistency question for the plug-in estimator γ( F n ) to the issue of whether or not F n F converges to zero. A classical result, known as the Glivenko-Cantelli theorem, addresses the latter question: Theorem 4. (Glivenko-Cantelli). For any distribution, the empirical CDF F n is a strongly consistent estimator of the population CDF F in the uniform norm, meaning that F n F a.s. 0. (4.6) We provide a proof of this claim as a corollary of a more general result to follow (see Theorem 4.). For statistical applications, an important consequence of Theorem 4. is that the plug-in estimate γ( F n ) is almost surely consistent as an estimator of γ(f) for any functional γ that is continuous with respect to the sup-norm. See Exercise 4. for further exploration of these issues Uniform laws for more general function classes We now turn to more general consideration of uniform laws of large numbers. Let F be a class of integrable real-valued functions with domain X, and let X n = {X,...,X n } be a collection of i.i.d. samples from some distribution P over X. Consider the random variable P n P F := sup f(x i ) E[f], (4.7) which measures the absolute deviation between the sample average n n f(x i) and the population average E[f] = E[f(X)], uniformly over the class F. (See the bibliographic section for a discussion of possible measurability concerns with this random variable.) Definition 4.. We say that F is a Glivenko-Cantelli class for P if P n P F converges to zero in probability as n This notion can also be defined in a stronger sense, requiring almost sure convergence of P n P F, in which case we say that F satisfies a strong Glivenko-Cantelli law. The classical result on the empirical CDF (Theorem 4.) can be reformulated as a particular case of this notion: ROUGH DRAFT: DO NOT DISTRIBUTE!! M. Wainwright February 4, 05

5 SECTION 4.. MOTIVATION 99 Example 4.4 (Empirical CDFs and indicator functions). Consider the function class F = { I (,t] ( ) t R }, (4.8) where I (,t] is the {0,}-valued indicator function of the interval (,t]. For each fixed t R, we have E[I (,t] (X)] = P[X t] = F(t), so that the classical Glivenko- Cantelli theorem corresponds to a strong uniform law for the class (4.8). 3 Not all classes of functions are Glivenko-Cantelli, as illustrated by the following example. 4 Example 4.5 (Failure of uniform law). Let S be the class of all subsets S of [0, ] such that the subset S has a finite number of elements, and consider the function class F S = {I S ( ) S S} of ({0-}-valued) indicator functions of such sets. Suppose that samples X i are drawn from some distribution over [0,] that has no atoms (i.e., P({x}) = 0 for all x [0,]); this class includes any distribution that has a density with respect to Lebesgue measure. For any such distribution, we are guaranteed that P[S] = 0 for all S S. On the other hand, for any positive integer n N, the set X n = {X,...,X n } belongs to S, and moreover by definition of the empirical distribution, we have P n [X n ] =. Putting together the pieces, we conclude that sup P n [S] P[S] = 0 =, (4.9) S S so that the function class F S is not a Glivenko-Cantelli class for P. 5 We have seen that the classical Glivenko-Cantelli law which guarantees conver- 6 gence of a special case of the variable P n P F is of interest in analyzing estimators 7 based on plugging in the empirical CDF. It is natural to ask in what other statis- 8 tical contexts do these quantities arise? In fact, variables of the form P n P F are 9 ubiquitous throughout statistics in particular, they lie at the heart of methods based 0 on empirical risk minimization. In order to describe this notion more concretely, let us consider an indexed-family of probability distributions {P θ θ Ω}, and suppose that we are given n samples X n = {X,...,X n }, each sample lying in some space X. 3 Suppose that the samples are drawn i.i.d. according to a distribution P θ, for some fixed 4 but unknown θ Ω. Here the index θ could lie within a finite-dimensional space, 5 such as Ω = R d in a vector estimation problem, or could lie within some function class 6 Ω = G, in which case the problem is of the non-parametric variety. 7 In either case, a standard decision-theoretic approach to estimating θ is based on minimizing a loss function of the form L θ (x), which measures the fit between a parameter θ Ω and the sample X X. Given the collection of n samples X n, the ROUGH DRAFT: DO NOT DISTRIBUTE! M. Wainwright February 4, 05

6 00 CHAPTER 4. UNIFORM LAWS OF LARGE NUMBERS principle of empirical risk minimization is based on the objective function R n (θ,θ ) := n L θ (X i ). This quantity is known as the empirical risk, since it is defined by the samples X n, and our notation reflects the fact that these samples depend in turn on the unknown distribution P θ. This empirical risk should be contrasted with the population risk R(θ,θ ) := E θ [ Lθ (X) ], where the expectation E θ is taken over a sample X P θ. In practice, one minimizes the empirical risk over some subset Ω 0 of the full space Ω, thereby obtaining some estimate θ. The statistical question is how to bound the excess risk, measured in terms of the population quantities namely the difference δr( θ,θ ) := R( θ,θ ) inf θ Ω 0 R(θ,θ ). Let us consider some examples to illustrate. Example 4.6 (Maximum likelihood). Consider a family of distributions {P θ, θ Ω}, each with a strictly positive density p θ (defined with respect to a common underlying measure). Now suppose that we are given n i.i.d. samples from unknown distribution P θ, and we would like to estimate the unknown parameter θ. In order to do so, we consider the cost function L θ (x) := log p θ (x) p θ (x). (The term p θ (x), which we have included for later theoretical convenience, has no effect on the minimization over θ.) Indeed, the maximum likelihood estimate is obtained by minimizing the empirical risk θ arg min θ Ω 0 n log p θ (X i) p θ (X i ) } {{ } R n(θ,θ ) = arg min θ Ω 0 n log p θ (X i ) The population risk is given by R(θ,θ [ ) = E θ log p θ (X) ] 3 p θ (X), a quantity known as the 4 Kullback-Leibler divergence between p θ and p θ. In the special case that θ Ω 0, the 5 excess risk is simply the Kullback-Leibler divergence between the true density p θ and 6 the fitted model p θ. ROUGH DRAFT: DO NOT DISTRIBUTE!! M. Wainwright February 4, 05

7 SECTION 4.. MOTIVATION 0 Example 4.7 (Binary classification). Suppose that we observe n samples of the form (X i,y i ) {,+} R d, where the vector X i corresponds to a set of d predictors or features, and the binary variable Y i corresponds to a label. We can view such data as 3 being generated by some distribution P X over the features, and a conditional distribu- 4 tion P Y Z. Since Y takes binary values, the conditional distribution is fully specified by 5 the likelihood ratio ψ(x) = P[Y=+ X=x] P[Y= X=x]. 6 The goal of binary classification is to estimate a function f : R d {,+} that minimizes the probablity of mis-classification P[f(X) Y], for an independently drawn pair (X, Y). Note that this probability of error corresponds to the population risk for the cost function L f (X,Y) := { if f(x) Y 0 otherwise. (4.0) A function that minimizes this probability of error is known as a Bayes classifier f ; in the special case of equally probable classes (P[Y = +] = P[Y = ] = /), a Bayes classifier is given by f (x) = + if φ(x), and f (x) = otherwise. Since the likelihood ratio φ (and hence f ) is unknown, a natural approach to approximating the Bayes rule is based on choosing f to minimize the empirical risk R n (f,f ) := n I [ ] f(x i ) Y i, }{{} L f (X i,y i ) corresponding to the fraction of training samples that are mis-classified. Typically, 7 the minimization over f is restricted to some subset of all possible decision rules. See 8 Chapter 4 for some further discussion of how to analyze such methods for binary 9 classification. 0 Returning to the main thread, our goal is to develop methods for controlling the excess risk. For simplicity, let us assume that there exists some θ 0 Ω 0 such that R(θ 0,θ ) = inf θ Ω0 R(θ,θ ). With this notation, the excess risk can be decomposed as δr( θ,θ ) = { R( θ,θ ) R n ( θ,θ ) } }{{} T + { Rn ( θ,θ ) R n (θ 0,θ ) } } {{ } T 0 + { Rn (θ 0,θ ) R(θ 0,θ ) }. }{{} T 3 Note that the middle term is non-positive, since θ minimizes the empirical risk over Ω 0. The third term can be dealt with in a relatively straightforward manner, because θ 0 is an unknown but non-random quantity. Indeed, recalling the definition of the If the infimum is not achieved, then we choose an element θ 0 for which this equality holds up to some arbitrarily small tolerance ǫ > 0, and the analysis to follow holds up to this tolerance. ROUGH DRAFT: DO NOT DISTRIBUTE! M. Wainwright February 4, 05

8 0 CHAPTER 4. UNIFORM LAWS OF LARGE NUMBERS empirical risk, we have T 3 = n L θ0 (X i ) E[L θ0 (X)], corresponding to the deviation of a sample mean from its expectation for the random variable L θ0 (X). This quantity can be controlled using the techniques introduced in Chapter for instance, via the Hoeffding bound when the samples are independent and the loss function is bounded. The first term can be written in a similar way, namely as the sum T = n Lˆθ(X i ) E[Lˆθ(X)]. Thisquantityismorechallengingtocontrol, becausetheparameter θ insharpcontrast to the deterministic quantity θ 0 is now random. In general, it depends on all the samples X n, since it was obtained by minimizing the empirical risk. For this reason, controlling the first term requires a stronger result, such as a uniform law of large numbers over the loss class L(Ω 0 ) := {L θ, θ Ω 0 }. With this notation, we have T sup θ Ω 0 n L θ (X i ) E[L θ (X)] = P n P L(Ω0 ) Since T 3 is also dominated by this same quantity, we conclude that the excess risk is at most P n P L(Ω0 ). This derivation demonstrates that the central challenge in analyzing estimators based on empirical risk minimization is to establish a uniform law of large numbers for the loss class L(Ω 0 ). We explore various concrete examples of this procedure in the exercises A uniform law via Rademacher complexity Having developed various motivations for studying uniform laws, let us now turn to the technical details of deriving such results. An important quantity that underlies the study of uniform laws is the Rademacher complexity of the function class F. For any fixed collection x n := (x,...,x n ) of points, consider the subset of R n given by F(x n ) := {( f(x ),...,f(x n ) ) f F }. (4.) The set F(x n ) corresponds to all those vectors in Rn that can be realized by applying a function f F to the collection (x,...,x n ), and the empirical Rademacher complexity ROUGH DRAFT: DO NOT DISTRIBUTE!! M. Wainwright February 4, 05

9 SECTION 4.. A UNIFORM LAW VIA RADEMACHER COMPLEXITY 03 is given by R(F(x n [ )/n) := E ε sup ε i f(x i ) ] (4.) Note that this definition coincides with our earlier definition of the Rademacher com- plexity of a set (see Example.). Given a collection X n := (X,...,X n ) of random samples, then the empirical Rademacher complexity R(F(X n )/n) is a random variable. Taking its expectation yields the Rademacher complexity of the function class F namely, the deterministic quantity [ R n (F) := E X R(F(X n )/n) ] [ = E X,ε sup n f F ε i f(x i ) ]. (4.3) Note that the Rademacher complexity is the average of the maximum correlation 3 between the vector (f(x ),...,f(x n )) and the noise vector (ε,...,ε n ), where the 4 maximum is taken over all functions f F. The intuition is a natural one: a func- 5 tion class is extremely large and in fact, too large for statistical purposes if we 6 can always find a function that has a high correlation with a randomly drawn noise 7 vector. Conversely, when the Rademacher complexity decays as a function of sample 8 size, then it is impossible to find a function that correlates very highly in expectation 9 with a randomly drawn noise vector. 0 We now make precise the connection between Rademacher complexity and the Glivenko-Cantelli property, in particular by showing that any bounded function class 3 F with R n (F) = o() is also a Glivenko-Cantelli class. More precisely, we prove a 4 non-asymptotic statement, in terms of a tail bound for the probability that the random 5 variable P n P F deviates substantially above a multiple of the Rademacher com- 6 plexity. 7 Theorem 4.. Let F be a class of functions f : X R that is uniformly bounded (i.e., f b for all f F). Then for all n and δ 0, we have 8 P n P F R n (F)+δ (4.4) 9 ) with P-probability at least exp( nδ 8b. Consequently, as long as Rn (F) = o(), a.s. we have P n P F 0. In order for Theorem 4. to be useful, we need to obtain upper bounds on the 0 ROUGH DRAFT: DO NOT DISTRIBUTE! M. Wainwright February 4, 05

10 04 CHAPTER 4. UNIFORM LAWS OF LARGE NUMBERS Rademacher complexity. There are a variety of methods for doing so, ranging from direct calculations to alternative complexity measures. In Section 4.3, we develop some techniques for upper bounding the Rademacher complexity for indicator functions of half-intervals, as required for the classical Glivenko-Cantelli theorem (see Example 4.4); we also discuss the notion of Vapnik-Chervonenkis dimension, which can be used to upper bound the Rademacher complexity for other function classes. In Chapter 5, we introduce more advanced techniques based on metric entropy and chaining for controlling Rademacher complexity and related sub-gaussian processes. In the meantime, let us turn to the proof of Theorem Proof. We first note if R n (F) = o(), then the almost-sure convergence follows from the tail bound (4.4) and the Borel-Cantelli lemma. Accordingly, the remainder of the argument is devoted to proving the tail bound (4.4). Concentration around mean: We first claim that when F is uniformly bounded, then the randomvariable P n P F issharplyconcentratedarounditsmean. Inordertosimplify notation, it is convenient to define the re-centered functions f(x) := f(x) E[f(X)], and to write P n P F = sup f F n n f(x i). Thinking of the samples as fixed for the moment, consider the function G(x,...,x n ) := sup f F n n f(x i). We claim that G satisfies the Lipschitz property required to apply the bounded differences method (recall Corollary.). Since the function G is invariant to permutation of its coordinates, it suffices to bound the difference when the first co-ordinate x is perturbed. Accordingly, we define the vector y R n with y i = x i for all i, and seek the bound the difference G(x) G(y). For any function f = f E[f], we have n f(x i ) sup h F n where the final inequality uses the fact that h(y i ) n f(x i ) n f(x ) f(y ) = f(x ) f(y ) b, f(y i ) f(x ) f(y ) n b n, (4.5) which follows from the uniform boundedness condition f b. Since the inequality (4.5) holds for any function f, we may take the supremum over f F on both sides; doing so yields the inequality G(x) G(y) b n. Since the same argument may be applied with the roles of x and y reversed, we conclude that G(x) G(y) b n. ROUGH DRAFT: DO NOT DISTRIBUTE!! M. Wainwright February 4, 05

11 SECTION 4.. A UNIFORM LAW VIA RADEMACHER COMPLEXITY 05 Therefore, by the bounded differences method (see Corollary.), we have P n P F E[ P n P F ] t with P-prob. at least exp( nt 8b ), (4.6) valid for all t 0. Upper bound on mean: It remains to show that E[ P n P F ] is upper bounded by at most R n (F), and we do so using a classical symmetrization argument. Letting (Y,...,Y n ) be a second i.i.d. sequence, independent of (X,...,X n ), we have [ E[ P n P F ] = E X sup [ = E X sup E Y [ [ E X,Y sup n f F { f(xi ) E Yi [f(y i )] } ] { f(xi ) f(y i ) } ] ] { f(xi ) f(y i ) } ], where the inequality follows from Jensen s inequality for convex functions. Now let (ε,...,ε n ) be an i.i.d. sequence of Rademacher variables, independent of X and Y. For any function f F and any i =,,...,n, the variable ε i (f(x i ) f(y i )) has the same distribution as f(x i ) f(y i ), whence [ E[ P n P F ] E X,Y,ε sup [ E X,ε sup n f F ( ε i f(xi ) f(y i ) ) ] ε i f(x i ) ] = R n (F). (4.7) Combining the upper bound (4.7) with the tail bound (4.6) yields the claim. 4.. Necessary conditions with Rademacher complexity 3 The proof of Theorem 4. illustrates an important technique known as symmetrization, which relates the random variable P n P F to its symmetrized version R n F := sup ε i f(x i ) (4.8) Note that the expectation of R n F corresponds to the Rademacher complexity, which 4 plays a central role in Theorem 4.. It is natural to wonder whether much was lost in 5 moving from the variable P n P F to its symmetrized version. The following sand- 6 wich result relates these quantities. 7 ROUGH DRAFT: DO NOT DISTRIBUTE! M. Wainwright February 4, 05

12 06 CHAPTER 4. UNIFORM LAWS OF LARGE NUMBERS 3 Proposition 4.. For any convex non-decreasing function Φ : R R, we have [ ( E X,ε Φ R )] (a) [ n F E X Φ( Pn P F ) ] (b) [ E X,ε Φ( Rn F ) ], (4.9) where F = {f E[f],f F} is the re-centered function class. When applied with the convex non-decreasing function Φ(t) = t, Proposition 4. yields the inequalities E X,ε R n F E X [ P n P F ] E X,ε R n F, (4.0) with the only difference being the use of F in the upper bound, and the re-centered class F in the lower bound. Other choices of interest include Φ(t) = e λt for some λ > 0, which can be used to control the moment generating function. Proof. The proof of inequality (b) is essentially identical to the argument for Φ(t) = t provided in the proof of Theorem 4.. Turning to the bound (a), we have [ ( E X,ε Φ R )] [ ( n F = EX,ε Φ sup ε i {f(x i ) E Yi [f(y i )]} )] (i) [ ( E X,Y,ε Φ sup (ii) [ ( = E X,Y Φ sup ε i {f(x i ) f(y i )} )] {f(x i ) f(y i )} )] where inequality(i) follows from Jensen s inequality and the convexity of Φ; and equality (ii) follows since for each each i =,...,n and f F, the variables ε i {f(x i ) f(y i )} andf(x i ) f(y i )havethesamedistribution. Addingandsubtracting E[f]andapplying triangle inequality, we obtain that T := sup f F n n {f(x i) f(y i )} is upper bounded as T sup {f(x i ) E[f]} + sup f F n {f(y i ) E[f]} Since Φ is convex and non-decreasing, we obtain Φ(T) Φ( sup {f(x i ) E[f]} ) + Φ( sup f F n {f(y i ) E[f]} ). ROUGH DRAFT: DO NOT DISTRIBUTE!! M. Wainwright February 4, 05

13 SECTION 4.3. UPPER BOUNDS ON THE RADEMACHER COMPLEXITY 07 Since X and Y are identically distributed, taking expectations yields the claim. A consequence of Proposition 4. is that the random variable P n P F can be lower bounded by a multiple of Rademacher complexity, and some fluctuation terms. 3 This fact can be used to prove the following: 4 Proposition 4.. Let F be a class of functions f : X R that is uniformly bounded (i.e., f b for all f F). Then for all δ 0, we have P n P F R n(f) sup f F E[f] n δ (4.) 5 6 with P-probability at least e nδ 8b. 7 We leave the proof of this result for the reader (see Exercise 4.3). As a conse- 8 quence, if the Rademacher complexity R n (F) remains bounded away from zero, then 9 P n P F cannot converge to zero in probability. We have thus shown that for a 0 uniformly bounded function class F, the Rademacher complexity provides a necessary and sufficient condition for it to be Glivenko-Cantelli. 4.3 Upper bounds on the Rademacher complexity 3 Obtaining concrete results using Theorem 4. requires methods for upper bounding the 4 Rademacher complexity. There are a variety of such methods, ranging from simple 5 union bound methods (suitable for finite function classes) to more advanced techniques 6 involving the notion of metric entropy and chaining arguments. We explore the latter 7 techniques in Chapter 5 to follow. This section is devoted to more elementary tech- 8 niques, including those required to prove the classical Glivenko-Cantelli result, and 9 more generally, those that apply to function classes with polynomial discrimination Classes with polynomial discrimination For a given collection of points x n = (x,...,x n ), the size of the set F(x n ) provides a sample-dependent measure of the complexity of F. In the simplest case, the set 3 F(x n ) contains only a finite number of vectors for all sample sizes, so that its size 4 can be measured via its cardinality. For instance, if F consists of a family of decision 5 rules taking binary values (as in Example 4.7), then F(x n ) can contain at most n 6 elements. Of interest to us are function classes for which this cardinality grows only as 7 a polynomial function of n, as formalized in the following: 8 9 ROUGH DRAFT: DO NOT DISTRIBUTE! M. Wainwright February 4, 05

14 08 CHAPTER 4. UNIFORM LAWS OF LARGE NUMBERS Definition 4. (Polynomial discrimination). A class F of functions with domain X has polynomial discrimination of order ν if for each positive integer n and collection x n = {x,...,x n } of n points in X, the set F(x n ) has cardinality upper bounded as card ( F(x n ) ) (n+) ν. (4.) The significance of this property is that it provides a straightforward approach to controlling the Rademacher complexity. For any set S R n, we let D := sup x S x denote its maximal width in the l -norm. Lemma 4.. Suppose that F has polynomial discrimination of order ν. Then for all n 0 and any collection of points x n = (x,...,x n ), where D(x n ) := sup f F [ E ε sup ε i f(x i ) ] }{{} R(F(x n )/n)) D 3 (x n ) ν log(n+), n n f (x i ) n is the l - radius of the set F(x n )/ n. We leave the proof of this claim for the reader (see Exercise 4.7). Although Lemma 4. is stated as an upper bound on the empirical Rademacher complexity, it yields as a corollary an upper bound on the Rademacher complexity R n (F) = E X [R(F(X n/n))], one which involves the expected l -width E X n [D(X)]. An especially simple case is when function class is uniformly bounded say f b for all f F so that D(x n ) b for all samples, and hence Lemma 4. implies that R n (F) 3 b νlog(n+) n for all n 0. (4.3) Combined with Theorem 4., we conclude that any bounded function class with polynomial discrimination is Glivenko-Cantelli. What types of function classes have polynomial discrimination? As discussed previously in Example 4.4, the classical Glivenko-Cantelli law is based on indicator functions of the left-sided intervals (, t]. These functions are uniformly bounded with b =, and moreover, as shown in the following proof, this function class has polynomial discrimination of order d =. Consequently, Theorem 4. combined with Lemma 4. yields a quantitative version of Theorem 4. as a corollary. 0 ROUGH DRAFT: DO NOT DISTRIBUTE!! M. Wainwright February 4, 05

15 SECTION 4.3. UPPER BOUNDS ON THE RADEMACHER COMPLEXITY 09 Corollary 4. (Classical Glivenko-Cantelli). Let F(t) = P[X t] be the CDF of a random variable X, and let F n be the empirical CDF based on n i.i.d. samples X i P. Then [ ] log(n+) P F n F 3 +δ e nδ 8 for all δ 0, (4.4) n and hence F n F a.s Proof. For a given sample x n = (x,...,x n ) R n, consider the set F(x n ), where F is the set of all {0-}-valued indicator functions of the half-intervals [t, ) for t R. If we order the samples as x () x ()... x (n), then they split the real line into at most n+ intervals (including the two end-intervals (,x () ) and [x (n), )). For a given t, the indicator function I [t, ) takes the value one for all x (i) t, and the value zero for all other samples. Thus, we have shown that for any given sample x n, we have card(f(x n )) n+. Applying Lemma 4., we obtain [ E ε sup ε i f(x i ) ] log(n+) 3, n log(n+) and taking averages over the data X i yields the upper bound R n (F) 3 n. 4 The claim (4.4) then follows from Theorem Although the exponential tail bound (4.4) is adequate for many purposes, it is 6 far from the tightest possible. Using alternative methods, we provide a sharper result 7 that removes the log(n+) factor in Chapter 5. See the bibliographic section for 8 references to the sharpest possible results, including control of the constants in the 9 exponent and the pre-factor Vapnik-Chervonenkis dimension Thus far, we have seen that it is relatively straightforward to establish uniform laws for function classes with polynomial discrimination. In certain cases, such as in our proof 3 of the classical Glivenko-Cantelli law, we can verify by direct calculation that a given 4 function class has polynomial discrimination. More broadly, it is of interest to develop 5 techniques for certifying this property, and the theory of Vapnik-Chervonenkis (VC) 6 dimension provides one such class of techniques. 7 Let us consider a function class F in which each function f is binary-valued, taking 8 the values {0,} for concreteness. In this case, the set F(x n ) from equation (4.) can 9 have at most n elements. 0 ROUGH DRAFT: DO NOT DISTRIBUTE! M. Wainwright February 4, 05

16 0 CHAPTER 4. UNIFORM LAWS OF LARGE NUMBERS Definition 4.3 (Shattering and VC dimension). Given a class F of binary valued functions, the set x n = (x,...,x n ) is shattered by F means that card(f(x n )) = n. The VC-dimension ν(f) is the largest integer n for which there is some collection x n = (x,...,x n ) of n points that can be shattered by F. 3 4 When the quantity ν(f) is finite, then the function class F is said to be a VC-class. 5 We will frequently consider function classes F that consist of indicator functions I S [ ], 6 for sets S ranging over some class of sets S. In this case, we use S(x n ) and ν(s) as 7 shorthands for the sets F(x n ) and the VC dimension of F, respectively. 8 9 Let us illustrate the notions of shattering and VC dimension with some examples: 0 Example 4.8 (Intervals in R). Consider the class of all indicator functions for left-sided half-intervals on the real line namely, the class S left := { (,a] a R}. Implicit in the proof of Corollary 4. is a calculation of the VC dimension for this class. We 3 first note that for any single point x, both subsets ({x } and the empty set ) can be 4 picked out by the class of left-sided intervals {(,a] a R}. But given two distinct 5 points x < x, it is imposssible to find a left-sided interval that contains x but not x. 6 Therefore, we conclude that ν(s left ) =. In the proof of Corollary 4., we showed more 7 specifically that for any collection x n = {x,...,x n }, we have card(s left (x n )) n+. Now consider the class of all two-sided intervals over the real line namely, the class S two := { (b,a] a,b R such that b < a }. The class S two can shatter any twopoint set. However, given three distinct points x < x < x 3, it cannot pick out the subset {x,x 3 }, showing that ν(s two ) =. For future reference, let us also upper bound the shattering coefficient of S two. Note that any collection of n distinct points x < x <... < x n < x n divides up the real line into (n + ) intervals. Thus, any set of the form ( b,a] can be specified by choosing one of (n+) intervals for b, and a second interval for a. Thus, a crude upper bound on the shatter coefficient is card(s two (x n )) (n+), 8 showing that this class has polynomial discrimination with degree ν = Thus far, we have seen two examples of function classes with finite VC dimension, both of which turned out to also polynomial discrimination. Is there a general connection between the VC dimension and polynomial discriminability? Indeed, it turns out that any finite VC class has polynomial discrimination with degree at most the VC dimension; this fact is a deep result that was proved independently (in slightly different forms) by Vapnik and Chervonenkis, Sauer and Shelah. We refer the reader to the bibliographic section for further discussion and references. In order to understand why this fact is surprising, note that for a given set class S, ROUGH DRAFT: DO NOT DISTRIBUTE!! M. Wainwright February 4, 05

17 SECTION 4.3. UPPER BOUNDS ON THE RADEMACHER COMPLEXITY thedefinitionofvcdimensionimpliesthatforalln > ν(s),wemusthavecard(s(x n )) < n for all collections x n of n samples. However, at least in principle, there could exist some subset with card(s(x n )) = n, which is not significantly different than n. The following result shows that this is not the case; indeed, for any VC-class, the cardinality of S(x n ) can grow at most polyno- mially in n. 3 Proposition 4.3 (Vapnik-Chervonenkis, Sauer and Shelah). Consider a set class S with ν(s) <. Then for any collection of points x n = (x,...,x n ), we have 4 card(s(x n )) (i) ν(s) i=0 ( ) n (ii) (n+) ν(s). (4.5) i We prove inequality (i) in the Appendix. Given inequality (i), inequality (ii) can be 8 established by elementary combinatorial arguments, so we leave it as an exercise for the 9 reader (part (a) of Exercise 4.). Part (b) of the same exercise establishes a sharper 0 upper bound Controlling the VC dimension Since classes with finite VC dimension have polynomial discrimination, it is of interest 3 to develop techniques for controlling the VC dimension. 4 Basic operations 5 The property of having finite VC dimension is preserved under a number of basic op- 6 erations, as summarized in the following. 7 Proposition 4.4. Let S and T be set classes, each with finite VC dimensions ν(s) and ν(t) respectively. Then each of the following set classes also have finite VC dimension: (a) The set class S c := { S c S S }, where S c denotes the complement of S. 8 9 (b) The set class S T := { S T S S, T T }. (c) The set class S T := { S T S S, T T }. 0 We leave the proof of this result as an exercise for the reader. ROUGH DRAFT: DO NOT DISTRIBUTE! M. Wainwright February 4, 05

18 CHAPTER 4. UNIFORM LAWS OF LARGE NUMBERS Vector space structure Any class G of real-valued functions defines a class of sets by the operation of taking subgraphs. In particular, given a real-valued function g : X R, its subgraph at level zero is the subset S g := {x X g(x) 0}. In this way, we can associate to G the collection of subsets S(G) := { S g, g G}, which we refer to as the subgraph class of G. Many interesting classes of sets are naturally defined in this way, among them half-spaces, ellipsoids and so on. In many cases, the underlying function class G is a vector space, and the following result allows us to upper bound the VC dimension of the associated set class S(G). Proposition 4.5 (Finite-dimensional vector spaces). Let G be a vector space of functions g : R d R with dimension dim(g) <. Then the subgraph class S(G) has VC dimension at most dim(g). Proof. By the definition of VC dimension, we need to show that no collection of n = dim(g)+pointsin R d canbeshatteredby S(G). Fixacollectionx n = {x,...,x n } ofnpointsin R d,andconsiderthelinearmapl : G R n givenbyl(g) = (g(x ),...,g(x n )). By construction, the range of the mapping L is a linear subspace of R n with dimension at most dim(g) = n < n. Therefore, there must exist a non-zero vector γ R n such that γ, L(g) = 0 for all g G. We may assume without loss of generality that at least one γ i is positive, and then write ( γ i )g(x i ) = γ i g(x i ) for all g G. (4.6) {i γ i 0} {i γ i >0} Now suppose that there exists some g G such that the associated subgraph set S g = {x R d g(x) 0} includes only the subset {x i γ i 0}. For such a function g, the right-hand side of equation (4.6) would be strictly positive while the left-hand side would be non-positive, which is a contradiction. We conclude that G fails to shatter the set {x,...,x n }, as claimed. Let us illustrate the use of Proposition 4.5 with some examples: Example 4.9 (Linear functions in R d ). For a pair (a,b) R d R, define the function f a,b (x) := a, x +b, and consider the family L d := { f a,b (a,b) R d R } of all such linear functions. The associated subgraph class S(L d ) corresponds to the collection of all half-spaces of the form H a,b := {x R d a, x + b 0}. Since the family L d forms a vector space of dimension d +, we obtain as an immediate consequence of Proposition 4.5 that S(L d ) has VC dimension at most d+. For the special case d =, let us verify this statement by a more direct calculation. In this case, the class S(L ) corresponds to the collection of all left-sided or right-sided ROUGH DRAFT: DO NOT DISTRIBUTE!! M. Wainwright February 4, 05

19 SECTION 4.3. UPPER BOUNDS ON THE RADEMACHER COMPLEXITY 3 intervals that is, S(L ) = { (,t] t R } { [t, ) t R }. Given any two distinct points x < x, the collection of all such intervals can pick out all possible subsets. However, given any three points x < x < x 3, there is no interval contained in S(L ) that contains x while excluding both x and x 3. This 3 calculation shows that ν(s(l )) =, which matches the upper bound obtained from 4 Proposition 4.5. More generally, it can be shown that the VC dimension of S(L d ) is 5 d+, so that Propsition 4.5 yields a sharp result in all dimensions. 6 Example 4.0 (Spheres in R d ). Consider the sphere S a,b := {x R d x a b}, where (a,b) R d R + specify its center and radius, respectively, and let S d sphere denote the collection of all such spheres. If we define the function f a,b (x) := x d a j x j + a b, j= then we have S a,b = {x R d f a,b (x) 0}, so that the sphere S a,b is a sub-graph of 7 the function f a,b. 8 In order to leverage Proposition 4.5, we first define a feature map φ : R d R d+ via φ(x) := (x,...,x d,), and then consider functions of the form g c (x) := c, φ(x) + x, where c R d+. The family of functions {g c, c R d+ } is a vector space of dimension d +, and it 9 contains the function class {f a,b, (a,b) R d R + }. Consequently, by applying Propo- 0 sition 4.5 to this larger vector space, we conclude that ν(s d sphere) d+. This bound is adequate for many purposes, but is not sharp: a more careful analysis shows that the VC dimension of spheres in R d is actually d+. See Exercise 4.9 for an in-depth 3 exploration of the case d =. 4 5 Appendix A: Proof of Proposition We prove inequality (i) in Proposition 4.3 by establishing the following more general 7 inequality: 8 Lemma 4.. Let A be a finite set and let U be a class of subsets of A. Then card(u) card( { B A B is shattered by U } ). (4.7) ROUGH DRAFT: DO NOT DISTRIBUTE! M. Wainwright February 4, 05

20 4 CHAPTER 4. UNIFORM LAWS OF LARGE NUMBERS To see that this lemma implies inequality (i), note that if B A is shattered by S, then we must have card(b) ν(s). Consequently, if we let A = {x,...,x n } and set U = S A, then Lemma 4. implies that card(s(x n )) = card(s A) card ({ B A B ν(s) }) ν(s) i=0 ( ) n, i as claimed. It remains to prove Lemma 4.. For a given x A, let us define an operator on sets U U via T x (U) = U { U\{x} if x U and U\{x} / U otherwise We let T x (U) be the new class of sets defined by applying T x to each member of U namely, T x (U) := {T x (U) U U}. We first claim that T x is a one-to-one mapping between U and T x (U), and hence that card(t x (U)) = card(u). To establish this claim, for any pair of sets U,U U such that T x (U) = T x (U ), we must prove that U = U. We divide the argument into three separate cases: Case : x / U and x / U. Given x / U, we have T x (U) = U, and hence T x (U ) = U. Moreover, x / U implies that T x (U ) = U. Combining the equalities yields U = U. Case : x / U and x U. In this case, we have T x (U) = U = T x (U ), so that x U but x / T x (U ). But this condition implies that T x (U ) = U \{x} / U, which contradicts the fact that T x (U ) = U U. By symmetry, the case x U and x / U is identical. Case 3: x U U. IfbothofU\{x}andU \{x}belongtou,thent x (U) = U and T x (U ) = U, from which U = U follows. If neither of U\{x} nor U \{x} belong to U, then we can conclude that U\{x} = U \{x}, and hence U = U. Finally, if U\{x} / U but U \{x} U, then T x (U) = U\{x} / U but T x (U ) = U U, which is a contradiction. We next claim that if T x (U) shatters a set B, then so does U. If x / B, then both U and T x (U) pick out the same set of subsets of B. Otherwise, suppose that x B. Since T x (U) shatters B, for any subset B B\{x}, there is a subset T T x (U) such that T B = B {x}. Since T = T x (U) for some subset U U and x T, we conclude that both U and U\{x} must belong to U, so that U also shatters B. Using the fact that T x preserves cardinalities and does not increase shattering numbers, we can now conclude the proof of the lemma. Define the weight function ROUGH DRAFT: DO NOT DISTRIBUTE!! M. Wainwright February 4, 05

21 SECTION 4.4. BIBLIOGRAPHIC DETAILS AND BACKGROUND 5 ω(u) = U U card(u),andnotethatapplyingatransformationt x canonlyreducethis weight function: i.e., ω(t x (U)) ω(u). Consequently, by applying the transformations {T x } to U repeatedly, we can obtain a new class of sets U such that card(u) = card(u ) 3 and the weight ω(u ) is minimal. Then for any U U and any x U, we have 4 U\{x} U. (Otherwise, we would have ω(t x (U )) < ω(u ), contradicting minimality.) 5 Therefore, the set class U shatters any one of its elements. Since U shatters at least as 6 many subsets as U, and card(u ) = card(u) by construction, the claim (4.7) follows Bibliographic details and background 8 First, a technical remark regarding measurability: in general, the quantity P n P F need not be measurable, since the function class F may contain an uncountable number of elements. If the function class is separable, then we may simply take the supremum over the countable dense basis. In general, there are various ways of dealing with this issue, including the use of outer probability (c.f. van der Vaart and Wellner [vdvw96]). Here we instead adopt the following convention. For any finite class of functions G contained within F, the variable P n P G is well-defined, so that it is sensible to define P n P F := sup { P n P G G F, G has finite cardinality }. By using this definition, we can always think instead about suprema over finite sets. 9 Theorem 4. was originally proved by Glivenko [Gli33] for the continuous case, and by Cantelli [Can33] in the general setting. The non-asymptotic form of the Glivenko- Cantelli theorem given in Corollary 4. can be sharpened substantially. For instance, Dvoretsky, Kiefer and Wolfowitz [DKW56] prove that there is a constant C independent of F and n such that P [ F n F δ ] Ce nδ for all δ 0. (4.8) Massart [Mas90] establishes the sharpest possible result, including control of the leading 0 constant. The Rademacher complexity, and its relative the Gaussian complexity, have a lengthy history in the study of Banach spaces using probabilistic methods; for instance, see the 3 books [MS86, Pis89, LT9]. In Chapter 5, we develop further connections between these 4 two forms of complexity, and the related notion of metric entropy. Rademacher and 5 Gaussian complexities have also been studied in the specific context of uniform laws 6 of large numbers and empirical risk minimization (e.g.,[bm0, BBM05, Kol0, Kol06, 7 KP00, vdvw96]). 8 The proof of Proposition 4.3 is adapted from Ledoux and Talagrand [Led0], who 9 credit the approach to Frankl [Fra83]. Exercise 5.4 is adapted from Problem.6.3 from 0 ROUGH DRAFT: DO NOT DISTRIBUTE! M. Wainwright February 4, 05

22 6 CHAPTER 4. UNIFORM LAWS OF LARGE NUMBERS van der Vaart and Wellner [vdvw96]. The proof of Proposition 4.5 is adapted from Pollard [Pol84], who credits it to Steele [Ste78] and Dudley [Dud78] Exercises Exercise 4.. Recall that the functional γ is continuous in the sup-norm at F if for all ǫ > 0, there exists a δ > 0 such that G F δ implies that γ(g) γ(f) ǫ. (a) Given n i.i.d. samples with law specified by F, let F n be the empirical CDF. Show that if γ is continuous in the sup-norm at F, then γ( F n ) prob. γ(f). (b) Which of the following functionals are continuous with respect to the sup-norm? Prove or disprove. (i) The mean functional F xdf(x). (ii) The Cramér-von Mises functional F [F(x) F 0 (x)] df 0 (x). (iii) The quantile functional Q p (F) = inf{t R F(t) p}. Exercise 4.. Recall from Example 4.5 the class S of all subsets S of [0,] for which S has a finite number of elements. Prove that the Rademacher complexity satisfies the lower bound [ R n (S) = E X,ε sup S S n ε i I S [X i ] ]. (4.9) 3 4 Discuss the connection to Theorem 4.. Exercise 4.3. In this exercise, we work through the proof of Proposition 4.. (a) Recall the re-centered function class F = {f E[f] f F}. Show that E X,ε [ R n F] E X,ε [ R n F ] sup f F E[f]. n 5 (b) Use concentration results to complete the proof of Proposition 4.. Exercise 4.4. Consider the function class F = { x sign( θ, x ) θ R d, θ = }, corresponding to the {, +}-valued classification rules defined by linear functions in R d. Supposing that d n, let x n = {x,...,x n } be a collection of vectors in R d that ROUGH DRAFT: DO NOT DISTRIBUTE!! M. Wainwright February 4, 05

23 SECTION 4.5. EXERCISES 7 are linearly independent. Show that the empirical Rademacher complexity satisfies R(F(x n [ )/n) = E ε sup ε i f(x i ) ] =. Discuss the consequences for empirical risk minimization over the class F. Exercise 4.5. Prove the following properties of the Rademacher complexity. (a) R n (F) = R n (conv(f)). 3 (b) Show that R n (F +G) R n (F)+R n (G). Give an example to demonstrate that 4 this bound cannot be improved in general. 5 (c) Given a fixed and uniformly bounded function g, show that R n (F +g) R n (F)+ g n. (4.30) Exercise 4.6. Let S and T be two classes of sets with finite VC dimensions. Which of 6 the following set classes have finite VC dimension? Prove or disprove. 7 (a) The set class S c := { S c S S}, where S c denotes the complement of the set S. 8 (b) The set class S T := { S T S S,T T}. 9 (c) The set class S T := { S T S S,T T}. 0 Exercise 4.7. Prove Lemma 4.. Exercise 4.8. Consider the class of left-sided half-intervals in R d : S d left := { (,t ] (,t ] (,t d ] (t,...,t d ) R d}. Show that for any collection of n points, we have card(s d left(x n )) (n + )d and ν(s d left) = d. 3 Exercise 4.9. Consider the class of all spheres in R : that is S sphere := { S a,b, (a,b) R R + }, (4.3) where S a,b := {x R x a b} is the sphere of radius b 0 centered at 4 a = (a,a ). 5 (a) Show that S sphere can shatter any subset of three points that are not collinear. 6 ROUGH DRAFT: DO NOT DISTRIBUTE! M. Wainwright February 4, 05

24 8 CHAPTER 4. UNIFORM LAWS OF LARGE NUMBERS (b) Show that for any subset of four points, the balls can discriminate at most 5 out of 6 of the possible subsets, and conclude that ν(s sphere) = Exercise 4.0. Show that the class C d cc of all closed and convex sets in R d does not have finite VC dimension. (Hint: Consider a set of n points on the boundary of the unit ball.) Exercise 4.. (a) Prove inequality (ii) in Proposition 4.3. (b) Prove the sharper upper bound card(s(x n )) ( ) en ν. 7 ν (Hint: You might find the 8 result of Exercise.9 useful.) ROUGH DRAFT: DO NOT DISTRIBUTE!! M. Wainwright February 4, 05

Lecture 2: Uniform Entropy

Lecture 2: Uniform Entropy STAT 583: Advanced Theory of Statistical Inference Spring 218 Lecture 2: Uniform Entropy Lecturer: Fang Han April 16 Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal

More information

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider

More information

Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results

Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics

More information

Empirical Processes: General Weak Convergence Theory

Empirical Processes: General Weak Convergence Theory Empirical Processes: General Weak Convergence Theory Moulinath Banerjee May 18, 2010 1 Extended Weak Convergence The lack of measurability of the empirical process with respect to the sigma-field generated

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

7 Influence Functions

7 Influence Functions 7 Influence Functions The influence function is used to approximate the standard error of a plug-in estimator. The formal definition is as follows. 7.1 Definition. The Gâteaux derivative of T at F in the

More information

Topology. Xiaolong Han. Department of Mathematics, California State University, Northridge, CA 91330, USA address:

Topology. Xiaolong Han. Department of Mathematics, California State University, Northridge, CA 91330, USA  address: Topology Xiaolong Han Department of Mathematics, California State University, Northridge, CA 91330, USA E-mail address: Xiaolong.Han@csun.edu Remark. You are entitled to a reward of 1 point toward a homework

More information

Metric Spaces and Topology

Metric Spaces and Topology Chapter 2 Metric Spaces and Topology From an engineering perspective, the most important way to construct a topology on a set is to define the topology in terms of a metric on the set. This approach underlies

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9 MAT 570 REAL ANALYSIS LECTURE NOTES PROFESSOR: JOHN QUIGG SEMESTER: FALL 204 Contents. Sets 2 2. Functions 5 3. Countability 7 4. Axiom of choice 8 5. Equivalence relations 9 6. Real numbers 9 7. Extended

More information

Some Background Material

Some Background Material Chapter 1 Some Background Material In the first chapter, we present a quick review of elementary - but important - material as a way of dipping our toes in the water. This chapter also introduces important

More information

Probability and Measure

Probability and Measure Part II Year 2018 2017 2016 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2018 84 Paper 4, Section II 26J Let (X, A) be a measurable space. Let T : X X be a measurable map, and µ a probability

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

P-adic Functions - Part 1

P-adic Functions - Part 1 P-adic Functions - Part 1 Nicolae Ciocan 22.11.2011 1 Locally constant functions Motivation: Another big difference between p-adic analysis and real analysis is the existence of nontrivial locally constant

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Metric spaces and metrizability

Metric spaces and metrizability 1 Motivation Metric spaces and metrizability By this point in the course, this section should not need much in the way of motivation. From the very beginning, we have talked about R n usual and how relatively

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

Introduction to Real Analysis Alternative Chapter 1

Introduction to Real Analysis Alternative Chapter 1 Christopher Heil Introduction to Real Analysis Alternative Chapter 1 A Primer on Norms and Banach Spaces Last Updated: March 10, 2018 c 2018 by Christopher Heil Chapter 1 A Primer on Norms and Banach Spaces

More information

Lebesgue Measure on R n

Lebesgue Measure on R n CHAPTER 2 Lebesgue Measure on R n Our goal is to construct a notion of the volume, or Lebesgue measure, of rather general subsets of R n that reduces to the usual volume of elementary geometrical sets

More information

Theorem 2.1 (Caratheodory). A (countably additive) probability measure on a field has an extension. n=1

Theorem 2.1 (Caratheodory). A (countably additive) probability measure on a field has an extension. n=1 Chapter 2 Probability measures 1. Existence Theorem 2.1 (Caratheodory). A (countably additive) probability measure on a field has an extension to the generated σ-field Proof of Theorem 2.1. Let F 0 be

More information

1 Topology Definition of a topology Basis (Base) of a topology The subspace topology & the product topology on X Y 3

1 Topology Definition of a topology Basis (Base) of a topology The subspace topology & the product topology on X Y 3 Index Page 1 Topology 2 1.1 Definition of a topology 2 1.2 Basis (Base) of a topology 2 1.3 The subspace topology & the product topology on X Y 3 1.4 Basic topology concepts: limit points, closed sets,

More information

The main results about probability measures are the following two facts:

The main results about probability measures are the following two facts: Chapter 2 Probability measures The main results about probability measures are the following two facts: Theorem 2.1 (extension). If P is a (continuous) probability measure on a field F 0 then it has a

More information

Measurable functions are approximately nice, even if look terrible.

Measurable functions are approximately nice, even if look terrible. Tel Aviv University, 2015 Functions of real variables 74 7 Approximation 7a A terrible integrable function........... 74 7b Approximation of sets................ 76 7c Approximation of functions............

More information

Course 212: Academic Year Section 1: Metric Spaces

Course 212: Academic Year Section 1: Metric Spaces Course 212: Academic Year 1991-2 Section 1: Metric Spaces D. R. Wilkins Contents 1 Metric Spaces 3 1.1 Distance Functions and Metric Spaces............. 3 1.2 Convergence and Continuity in Metric Spaces.........

More information

Concentration inequalities and the entropy method

Concentration inequalities and the entropy method Concentration inequalities and the entropy method Gábor Lugosi ICREA and Pompeu Fabra University Barcelona what is concentration? We are interested in bounding random fluctuations of functions of many

More information

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016 12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses

More information

Connectedness. Proposition 2.2. The following are equivalent for a topological space (X, T ).

Connectedness. Proposition 2.2. The following are equivalent for a topological space (X, T ). Connectedness 1 Motivation Connectedness is the sort of topological property that students love. Its definition is intuitive and easy to understand, and it is a powerful tool in proofs of well-known results.

More information

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi Real Analysis Math 3AH Rudin, Chapter # Dominique Abdi.. If r is rational (r 0) and x is irrational, prove that r + x and rx are irrational. Solution. Assume the contrary, that r+x and rx are rational.

More information

1/12/05: sec 3.1 and my article: How good is the Lebesgue measure?, Math. Intelligencer 11(2) (1989),

1/12/05: sec 3.1 and my article: How good is the Lebesgue measure?, Math. Intelligencer 11(2) (1989), Real Analysis 2, Math 651, Spring 2005 April 26, 2005 1 Real Analysis 2, Math 651, Spring 2005 Krzysztof Chris Ciesielski 1/12/05: sec 3.1 and my article: How good is the Lebesgue measure?, Math. Intelligencer

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

MATH 51H Section 4. October 16, Recall what it means for a function between metric spaces to be continuous:

MATH 51H Section 4. October 16, Recall what it means for a function between metric spaces to be continuous: MATH 51H Section 4 October 16, 2015 1 Continuity Recall what it means for a function between metric spaces to be continuous: Definition. Let (X, d X ), (Y, d Y ) be metric spaces. A function f : X Y is

More information

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability... Functional Analysis Franck Sueur 2018-2019 Contents 1 Metric spaces 1 1.1 Definitions........................................ 1 1.2 Completeness...................................... 3 1.3 Compactness......................................

More information

Lecture Notes in Advanced Calculus 1 (80315) Raz Kupferman Institute of Mathematics The Hebrew University

Lecture Notes in Advanced Calculus 1 (80315) Raz Kupferman Institute of Mathematics The Hebrew University Lecture Notes in Advanced Calculus 1 (80315) Raz Kupferman Institute of Mathematics The Hebrew University February 7, 2007 2 Contents 1 Metric Spaces 1 1.1 Basic definitions...........................

More information

Math 328 Course Notes

Math 328 Course Notes Math 328 Course Notes Ian Robertson March 3, 2006 3 Properties of C[0, 1]: Sup-norm and Completeness In this chapter we are going to examine the vector space of all continuous functions defined on the

More information

Notes on Gaussian processes and majorizing measures

Notes on Gaussian processes and majorizing measures Notes on Gaussian processes and majorizing measures James R. Lee 1 Gaussian processes Consider a Gaussian process {X t } for some index set T. This is a collection of jointly Gaussian random variables,

More information

B. Appendix B. Topological vector spaces

B. Appendix B. Topological vector spaces B.1 B. Appendix B. Topological vector spaces B.1. Fréchet spaces. In this appendix we go through the definition of Fréchet spaces and their inductive limits, such as they are used for definitions of function

More information

6. Brownian Motion. Q(A) = P [ ω : x(, ω) A )

6. Brownian Motion. Q(A) = P [ ω : x(, ω) A ) 6. Brownian Motion. stochastic process can be thought of in one of many equivalent ways. We can begin with an underlying probability space (Ω, Σ, P) and a real valued stochastic process can be defined

More information

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical

More information

The Canonical Gaussian Measure on R

The Canonical Gaussian Measure on R The Canonical Gaussian Measure on R 1. Introduction The main goal of this course is to study Gaussian measures. The simplest example of a Gaussian measure is the canonical Gaussian measure P on R where

More information

Topological properties

Topological properties CHAPTER 4 Topological properties 1. Connectedness Definitions and examples Basic properties Connected components Connected versus path connected, again 2. Compactness Definition and first examples Topological

More information

Sections of Convex Bodies via the Combinatorial Dimension

Sections of Convex Bodies via the Combinatorial Dimension Sections of Convex Bodies via the Combinatorial Dimension (Rough notes - no proofs) These notes are centered at one abstract result in combinatorial geometry, which gives a coordinate approach to several

More information

Probability and Measure

Probability and Measure Probability and Measure Robert L. Wolpert Institute of Statistics and Decision Sciences Duke University, Durham, NC, USA Convergence of Random Variables 1. Convergence Concepts 1.1. Convergence of Real

More information

STAT 7032 Probability Spring Wlodek Bryc

STAT 7032 Probability Spring Wlodek Bryc STAT 7032 Probability Spring 2018 Wlodek Bryc Created: Friday, Jan 2, 2014 Revised for Spring 2018 Printed: January 9, 2018 File: Grad-Prob-2018.TEX Department of Mathematical Sciences, University of Cincinnati,

More information

1 Glivenko-Cantelli type theorems

1 Glivenko-Cantelli type theorems STA79 Lecture Spring Semester Glivenko-Cantelli type theorems Given i.i.d. observations X,..., X n with unknown distribution function F (t, consider the empirical (sample CDF ˆF n (t = I [Xi t]. n Then

More information

7 Convergence in R d and in Metric Spaces

7 Convergence in R d and in Metric Spaces STA 711: Probability & Measure Theory Robert L. Wolpert 7 Convergence in R d and in Metric Spaces A sequence of elements a n of R d converges to a limit a if and only if, for each ǫ > 0, the sequence a

More information

08a. Operators on Hilbert spaces. 1. Boundedness, continuity, operator norms

08a. Operators on Hilbert spaces. 1. Boundedness, continuity, operator norms (February 24, 2017) 08a. Operators on Hilbert spaces Paul Garrett garrett@math.umn.edu http://www.math.umn.edu/ garrett/ [This document is http://www.math.umn.edu/ garrett/m/real/notes 2016-17/08a-ops

More information

17. Convergence of Random Variables

17. Convergence of Random Variables 7. Convergence of Random Variables In elementary mathematics courses (such as Calculus) one speaks of the convergence of functions: f n : R R, then lim f n = f if lim f n (x) = f(x) for all x in R. This

More information

3 (Due ). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure?

3 (Due ). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure? MA 645-4A (Real Analysis), Dr. Chernov Homework assignment 1 (Due ). Show that the open disk x 2 + y 2 < 1 is a countable union of planar elementary sets. Show that the closed disk x 2 + y 2 1 is a countable

More information

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization John Duchi Outline I Motivation 1 Uniform laws of large numbers 2 Loss minimization and data dependence II Uniform

More information

Lecture 2. We now introduce some fundamental tools in martingale theory, which are useful in controlling the fluctuation of martingales.

Lecture 2. We now introduce some fundamental tools in martingale theory, which are useful in controlling the fluctuation of martingales. Lecture 2 1 Martingales We now introduce some fundamental tools in martingale theory, which are useful in controlling the fluctuation of martingales. 1.1 Doob s inequality We have the following maximal

More information

X n D X lim n F n (x) = F (x) for all x C F. lim n F n(u) = F (u) for all u C F. (2)

X n D X lim n F n (x) = F (x) for all x C F. lim n F n(u) = F (u) for all u C F. (2) 14:17 11/16/2 TOPIC. Convergence in distribution and related notions. This section studies the notion of the so-called convergence in distribution of real random variables. This is the kind of convergence

More information

2. Function spaces and approximation

2. Function spaces and approximation 2.1 2. Function spaces and approximation 2.1. The space of test functions. Notation and prerequisites are collected in Appendix A. Let Ω be an open subset of R n. The space C0 (Ω), consisting of the C

More information

CHAPTER 7. Connectedness

CHAPTER 7. Connectedness CHAPTER 7 Connectedness 7.1. Connected topological spaces Definition 7.1. A topological space (X, T X ) is said to be connected if there is no continuous surjection f : X {0, 1} where the two point set

More information

Problem set 1, Real Analysis I, Spring, 2015.

Problem set 1, Real Analysis I, Spring, 2015. Problem set 1, Real Analysis I, Spring, 015. (1) Let f n : D R be a sequence of functions with domain D R n. Recall that f n f uniformly if and only if for all ɛ > 0, there is an N = N(ɛ) so that if n

More information

The Vapnik-Chervonenkis Dimension

The Vapnik-Chervonenkis Dimension The Vapnik-Chervonenkis Dimension Prof. Dan A. Simovici UMB 1 / 91 Outline 1 Growth Functions 2 Basic Definitions for Vapnik-Chervonenkis Dimension 3 The Sauer-Shelah Theorem 4 The Link between VCD and

More information

THE INVERSE FUNCTION THEOREM

THE INVERSE FUNCTION THEOREM THE INVERSE FUNCTION THEOREM W. PATRICK HOOPER The implicit function theorem is the following result: Theorem 1. Let f be a C 1 function from a neighborhood of a point a R n into R n. Suppose A = Df(a)

More information

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds

More information

Week 2: Sequences and Series

Week 2: Sequences and Series QF0: Quantitative Finance August 29, 207 Week 2: Sequences and Series Facilitator: Christopher Ting AY 207/208 Mathematicians have tried in vain to this day to discover some order in the sequence of prime

More information

Lecture 1 Measure concentration

Lecture 1 Measure concentration CSE 29: Learning Theory Fall 2006 Lecture Measure concentration Lecturer: Sanjoy Dasgupta Scribe: Nakul Verma, Aaron Arvey, and Paul Ruvolo. Concentration of measure: examples We start with some examples

More information

Analysis Finite and Infinite Sets The Real Numbers The Cantor Set

Analysis Finite and Infinite Sets The Real Numbers The Cantor Set Analysis Finite and Infinite Sets Definition. An initial segment is {n N n n 0 }. Definition. A finite set can be put into one-to-one correspondence with an initial segment. The empty set is also considered

More information

I. The space C(K) Let K be a compact metric space, with metric d K. Let B(K) be the space of real valued bounded functions on K with the sup-norm

I. The space C(K) Let K be a compact metric space, with metric d K. Let B(K) be the space of real valued bounded functions on K with the sup-norm I. The space C(K) Let K be a compact metric space, with metric d K. Let B(K) be the space of real valued bounded functions on K with the sup-norm Proposition : B(K) is complete. f = sup f(x) x K Proof.

More information

MATH 4200 HW: PROBLEM SET FOUR: METRIC SPACES

MATH 4200 HW: PROBLEM SET FOUR: METRIC SPACES MATH 4200 HW: PROBLEM SET FOUR: METRIC SPACES PETE L. CLARK 4. Metric Spaces (no more lulz) Directions: This week, please solve any seven problems. Next week, please solve seven more. Starred parts of

More information

Chapter 2 Metric Spaces

Chapter 2 Metric Spaces Chapter 2 Metric Spaces The purpose of this chapter is to present a summary of some basic properties of metric and topological spaces that play an important role in the main body of the book. 2.1 Metrics

More information

Real Analysis Problems

Real Analysis Problems Real Analysis Problems Cristian E. Gutiérrez September 14, 29 1 1 CONTINUITY 1 Continuity Problem 1.1 Let r n be the sequence of rational numbers and Prove that f(x) = 1. f is continuous on the irrationals.

More information

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable Lecture Notes 1 Probability and Random Variables Probability Spaces Conditional Probability and Independence Random Variables Functions of a Random Variable Generation of a Random Variable Jointly Distributed

More information

Existence and Uniqueness

Existence and Uniqueness Chapter 3 Existence and Uniqueness An intellect which at a certain moment would know all forces that set nature in motion, and all positions of all items of which nature is composed, if this intellect

More information

Midterm 1. Every element of the set of functions is continuous

Midterm 1. Every element of the set of functions is continuous Econ 200 Mathematics for Economists Midterm Question.- Consider the set of functions F C(0, ) dened by { } F = f C(0, ) f(x) = ax b, a A R and b B R That is, F is a subset of the set of continuous functions

More information

Introduction to Empirical Processes and Semiparametric Inference Lecture 13: Entropy Calculations

Introduction to Empirical Processes and Semiparametric Inference Lecture 13: Entropy Calculations Introduction to Empirical Processes and Semiparametric Inference Lecture 13: Entropy Calculations Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics and Operations Research

More information

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable Lecture Notes 1 Probability and Random Variables Probability Spaces Conditional Probability and Independence Random Variables Functions of a Random Variable Generation of a Random Variable Jointly Distributed

More information

Behaviour of Lipschitz functions on negligible sets. Non-differentiability in R. Outline

Behaviour of Lipschitz functions on negligible sets. Non-differentiability in R. Outline Behaviour of Lipschitz functions on negligible sets G. Alberti 1 M. Csörnyei 2 D. Preiss 3 1 Università di Pisa 2 University College London 3 University of Warwick Lars Ahlfors Centennial Celebration Helsinki,

More information

Sobolev spaces. May 18

Sobolev spaces. May 18 Sobolev spaces May 18 2015 1 Weak derivatives The purpose of these notes is to give a very basic introduction to Sobolev spaces. More extensive treatments can e.g. be found in the classical references

More information

Statistics 300B Winter 2018 Final Exam Due 24 Hours after receiving it

Statistics 300B Winter 2018 Final Exam Due 24 Hours after receiving it Statistics 300B Winter 08 Final Exam Due 4 Hours after receiving it Directions: This test is open book and open internet, but must be done without consulting other students. Any consultation of other students

More information

Statistical learning theory, Support vector machines, and Bioinformatics

Statistical learning theory, Support vector machines, and Bioinformatics 1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric

More information

Lecture 4 Lebesgue spaces and inequalities

Lecture 4 Lebesgue spaces and inequalities Lecture 4: Lebesgue spaces and inequalities 1 of 10 Course: Theory of Probability I Term: Fall 2013 Instructor: Gordan Zitkovic Lecture 4 Lebesgue spaces and inequalities Lebesgue spaces We have seen how

More information

Estimates for probabilities of independent events and infinite series

Estimates for probabilities of independent events and infinite series Estimates for probabilities of independent events and infinite series Jürgen Grahl and Shahar evo September 9, 06 arxiv:609.0894v [math.pr] 8 Sep 06 Abstract This paper deals with finite or infinite sequences

More information

n [ F (b j ) F (a j ) ], n j=1(a j, b j ] E (4.1)

n [ F (b j ) F (a j ) ], n j=1(a j, b j ] E (4.1) 1.4. CONSTRUCTION OF LEBESGUE-STIELTJES MEASURES In this section we shall put to use the Carathéodory-Hahn theory, in order to construct measures with certain desirable properties first on the real line

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

Tame definable topological dynamics

Tame definable topological dynamics Tame definable topological dynamics Artem Chernikov (Paris 7) Géométrie et Théorie des Modèles, 4 Oct 2013, ENS, Paris Joint work with Pierre Simon, continues previous work with Anand Pillay and Pierre

More information

Real Analysis Notes. Thomas Goller

Real Analysis Notes. Thomas Goller Real Analysis Notes Thomas Goller September 4, 2011 Contents 1 Abstract Measure Spaces 2 1.1 Basic Definitions........................... 2 1.2 Measurable Functions........................ 2 1.3 Integration..............................

More information

Continuity. Chapter 4

Continuity. Chapter 4 Chapter 4 Continuity Throughout this chapter D is a nonempty subset of the real numbers. We recall the definition of a function. Definition 4.1. A function from D into R, denoted f : D R, is a subset of

More information

Introductory Analysis I Fall 2014 Homework #9 Due: Wednesday, November 19

Introductory Analysis I Fall 2014 Homework #9 Due: Wednesday, November 19 Introductory Analysis I Fall 204 Homework #9 Due: Wednesday, November 9 Here is an easy one, to serve as warmup Assume M is a compact metric space and N is a metric space Assume that f n : M N for each

More information

Economics 204 Fall 2011 Problem Set 1 Suggested Solutions

Economics 204 Fall 2011 Problem Set 1 Suggested Solutions Economics 204 Fall 2011 Problem Set 1 Suggested Solutions 1. Suppose k is a positive integer. Use induction to prove the following two statements. (a) For all n N 0, the inequality (k 2 + n)! k 2n holds.

More information

THEOREMS, ETC., FOR MATH 515

THEOREMS, ETC., FOR MATH 515 THEOREMS, ETC., FOR MATH 515 Proposition 1 (=comment on page 17). If A is an algebra, then any finite union or finite intersection of sets in A is also in A. Proposition 2 (=Proposition 1.1). For every

More information

Lecture 3: Lower Bounds for Bandit Algorithms

Lecture 3: Lower Bounds for Bandit Algorithms CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and

More information

Examples of Dual Spaces from Measure Theory

Examples of Dual Spaces from Measure Theory Chapter 9 Examples of Dual Spaces from Measure Theory We have seen that L (, A, µ) is a Banach space for any measure space (, A, µ). We will extend that concept in the following section to identify an

More information

NOTES ON THE REGULARITY OF QUASICONFORMAL HOMEOMORPHISMS

NOTES ON THE REGULARITY OF QUASICONFORMAL HOMEOMORPHISMS NOTES ON THE REGULARITY OF QUASICONFORMAL HOMEOMORPHISMS CLARK BUTLER. Introduction The purpose of these notes is to give a self-contained proof of the following theorem, Theorem.. Let f : S n S n be a

More information

Mathematics for Economists

Mathematics for Economists Mathematics for Economists Victor Filipe Sao Paulo School of Economics FGV Metric Spaces: Basic Definitions Victor Filipe (EESP/FGV) Mathematics for Economists Jan.-Feb. 2017 1 / 34 Definitions and Examples

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical

More information

2 (Bonus). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure?

2 (Bonus). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure? MA 645-4A (Real Analysis), Dr. Chernov Homework assignment 1 (Due 9/5). Prove that every countable set A is measurable and µ(a) = 0. 2 (Bonus). Let A consist of points (x, y) such that either x or y is

More information

Spaces of continuous functions

Spaces of continuous functions Chapter 2 Spaces of continuous functions 2.8 Baire s Category Theorem Recall that a subset A of a metric space (X, d) is dense if for all x X there is a sequence from A converging to x. An equivalent definition

More information

Notes on uniform convergence

Notes on uniform convergence Notes on uniform convergence Erik Wahlén erik.wahlen@math.lu.se January 17, 2012 1 Numerical sequences We begin by recalling some properties of numerical sequences. By a numerical sequence we simply mean

More information

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,

More information

Rademacher Averages and Phase Transitions in Glivenko Cantelli Classes

Rademacher Averages and Phase Transitions in Glivenko Cantelli Classes IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 1, JANUARY 2002 251 Rademacher Averages Phase Transitions in Glivenko Cantelli Classes Shahar Mendelson Abstract We introduce a new parameter which

More information

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product Chapter 4 Hilbert Spaces 4.1 Inner Product Spaces Inner Product Space. A complex vector space E is called an inner product space (or a pre-hilbert space, or a unitary space) if there is a mapping (, )

More information

Brownian Motion. 1 Definition Brownian Motion Wiener measure... 3

Brownian Motion. 1 Definition Brownian Motion Wiener measure... 3 Brownian Motion Contents 1 Definition 2 1.1 Brownian Motion................................. 2 1.2 Wiener measure.................................. 3 2 Construction 4 2.1 Gaussian process.................................

More information

2. Metric Spaces. 2.1 Definitions etc.

2. Metric Spaces. 2.1 Definitions etc. 2. Metric Spaces 2.1 Definitions etc. The procedure in Section for regarding R as a topological space may be generalized to many other sets in which there is some kind of distance (formally, sets with

More information

Infinite-dimensional Vector Spaces and Sequences

Infinite-dimensional Vector Spaces and Sequences 2 Infinite-dimensional Vector Spaces and Sequences After the introduction to frames in finite-dimensional vector spaces in Chapter 1, the rest of the book will deal with expansions in infinitedimensional

More information