Introduction to Empirical Processes and Semiparametric Inference Lecture 13: Entropy Calculations

Similar documents
Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results

Introduction to Empirical Processes and Semiparametric Inference Lecture 08: Stochastic Convergence

Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued

Introduction to Empirical Processes and Semiparametric Inference Lecture 09: Stochastic Convergence, Continued

Lecture 2: Uniform Entropy

Introduction to Empirical Processes and Semiparametric Inference Lecture 22: Preliminaries for Semiparametric Inference

Exercises from other sources REAL NUMBERS 2,...,

Continuity of convex functions in normed spaces

General Notation. Exercises and Problems

Continuous Functions on Metric Spaces

g 2 (x) (1/3)M 1 = (1/3)(2/3)M.

THE STONE-WEIERSTRASS THEOREM AND ITS APPLICATIONS TO L 2 SPACES

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi

Metric Spaces and Topology

3 (Due ). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure?

Real Analysis Notes. Thomas Goller

Economics 204 Fall 2011 Problem Set 2 Suggested Solutions

2 (Bonus). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure?

1 Topology Definition of a topology Basis (Base) of a topology The subspace topology & the product topology on X Y 3

Math 5052 Measure Theory and Functional Analysis II Homework Assignment 7

THEOREMS, ETC., FOR MATH 515

Math General Topology Fall 2012 Homework 11 Solutions

CHAPTER V DUAL SPACES

Locally convex spaces, the hyperplane separation theorem, and the Krein-Milman theorem

Math 328 Course Notes

Summer Jump-Start Program for Analysis, 2012 Song-Ying Li

U e = E (U\E) e E e + U\E e. (1.6)

1. Let A R be a nonempty set that is bounded from above, and let a be the least upper bound of A. Show that there exists a sequence {a n } n N

Bounded uniformly continuous functions

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...

Walker Ray Econ 204 Problem Set 3 Suggested Solutions August 6, 2015

CHAPTER I THE RIESZ REPRESENTATION THEOREM

(convex combination!). Use convexity of f and multiply by the common denominator to get. Interchanging the role of x and y, we obtain that f is ( 2M ε

h(x) lim H(x) = lim Since h is nondecreasing then h(x) 0 for all x, and if h is discontinuous at a point x then H(x) > 0. Denote

Introduction to Empirical Processes and Semiparametric Inference Lecture 25: Semiparametric Models

Review of Multi-Calculus (Study Guide for Spivak s CHAPTER ONE TO THREE)

Problem Set 5: Solutions Math 201A: Fall 2016

Some Background Math Notes on Limsups, Sets, and Convexity

Selçuk Demir WS 2017 Functional Analysis Homework Sheet

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9

Introduction to Functional Analysis

Entrance Exam, Real Analysis September 1, 2017 Solve exactly 6 out of the 8 problems

Lecture Notes on Metric Spaces

ON THE CHEBYSHEV POLYNOMIALS. Contents. 2. A Result on Linear Functionals on P n 4 Acknowledgments 7 References 7

Analysis Finite and Infinite Sets The Real Numbers The Cantor Set

(1) Consider the space S consisting of all continuous real-valued functions on the closed interval [0, 1]. For f, g S, define

Functional Analysis I

Empirical Processes: General Weak Convergence Theory

Extreme points of compact convex sets

Definition 6.1. A metric space (X, d) is complete if every Cauchy sequence tends to a limit in X.

REVIEW OF ESSENTIAL MATH 346 TOPICS

Metric Spaces. Exercises Fall 2017 Lecturer: Viveka Erlandsson. Written by M.van den Berg

THE INVERSE FUNCTION THEOREM

Immerse Metric Space Homework

Chapter 4: Asymptotic Properties of the MLE

MATH 51H Section 4. October 16, Recall what it means for a function between metric spaces to be continuous:

Continuity. Matt Rosenzweig

L p Spaces and Convexity

Existence and Uniqueness

Commutative Banach algebras 79

The Heine-Borel and Arzela-Ascoli Theorems

An introduction to Mathematical Theory of Control

Problem Set 2: Solutions Math 201A: Fall 2016

Metric Spaces Math 413 Honors Project

Introduction to Real Analysis Alternative Chapter 1

Basic Properties of Metric and Normed Spaces

+ 2x sin x. f(b i ) f(a i ) < ɛ. i=1. i=1

Integration on Measure Spaces

Honours Analysis III

From now on, we will represent a metric space with (X, d). Here are some examples: i=1 (x i y i ) p ) 1 p, p 1.

2. The Concept of Convergence: Ultrafilters and Nets

Metric Spaces Lecture 17

Sobolev spaces. May 18

Notes for Functional Analysis

Convex Analysis and Economic Theory Winter 2018

Lecture 3. Econ August 12

Problem set 1, Real Analysis I, Spring, 2015.

M17 MAT25-21 HOMEWORK 6

Metric Spaces Math 413 Honors Project

Some Background Material

Functional Analysis. Martin Brokate. 1 Normed Spaces 2. 2 Hilbert Spaces The Principle of Uniform Boundedness 32

NAME: Mathematics 205A, Fall 2008, Final Examination. Answer Key

Math 209B Homework 2

Suppose R is an ordered ring with positive elements P.

Measurable functions are approximately nice, even if look terrible.

Math 6455 Oct 10, Differential Geometry I Fall 2006, Georgia Tech. Integration on Manifolds, Volume, and Partitions of Unity

PROBLEMS. (b) (Polarization Identity) Show that in any inner product space

ANALYSIS WORKSHEET II: METRIC SPACES

MATH 202B - Problem Set 5

Convergence Concepts of Random Variables and Functions

L p Functions. Given a measure space (X, µ) and a real number p [1, ), recall that the L p -norm of a measurable function f : X R is defined by

converges as well if x < 1. 1 x n x n 1 1 = 2 a nx n

LINEAR CHAOS? Nathan S. Feldman

Riemann integral and volume are generalized to unbounded functions and sets. is an admissible set, and its volume is a Riemann integral, 1l E,

Analysis III. Exam 1

MATH MEASURE THEORY AND FOURIER ANALYSIS. Contents

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

u xx + u yy = 0. (5.1)

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Maths 212: Homework Solutions

Transcription:

Introduction to Empirical Processes and Semiparametric Inference Lecture 13: Entropy Calculations Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics and Operations Research University of North Carolina-Chapel Hill 1

Uniform Entroy Let 1{C} denote the collection of all indicator functions of sets in the class C. The following theorem gives a bound on the L r covering numbers of 1{C}: THEOREM 1. There exists a universal constant K < such that for any VC-class of sets C, any r 1, and any 0 < ɛ < 1, N(ɛ, 1{C}, L r (Q)) KV (C)(4e) V (C) ( 1 ɛ ) r(v (C) 1). This is Theorem 2.6.4 of VW, and we omit the proof. 2

Since F = 1 serves as an envelope for 1{C}, we have as an immediate corollary that, for F = 1{C}, sup Q N(ɛ F 1,Q, F, L 1 (Q)) < and J(1, F, L 2 ) 1 0 log(1/ɛ)dɛ = 0 u 1/2 e u du 1, where the supremum is over all finite probability measures Q with F Q,2 > 0. 3

Thus the uniform entropy conditions required in the G-C and Donsker theorems of the previous chapter are satisfied for indicators of VC-classes of sets. Since the constant 1 serves as a universally applicable envelope function, these classes of indicator functions are therefore G-C and Donsker, provided the requisite measurability conditions hold. 4

For a function f : X R, the subset of X R given by {(x, t) : t < f(x)} is the subgraph of f. A collection F of measurable real functions on the sample space X is a VC-subgraph class or VC-class (for short), if the collection of all subgraphs of functions in F forms a VC-class of sets (as sets in X R). Let V (F) denote the VC-index of the set of subgraphs of F. 5

VC-classes of functions grow at a polynomial rate just like VC-classes of sets: THEOREM 2. There exists a universal constant K < such that, for any VC-class of measurable functions F with integrable envelope F, any r 1, any probability measure Q with F Q,r > 0, and any 0 < ɛ < 1, N(ɛ F Q,r, F, L r (Q)) KV (F)(4e) V (F) ( 2 ɛ ) r(v (F) 1). Thus VC-classes of functions easily satisfy the uniform entropy requirements of the G-C and Donsker theorems of the previous chapter. 6

A related kind of function class is the VC-hull class. A class of measurable functions G is a VC-hull class if there exists a VC-class F such that each f G is the pointwise limit of a sequence of functions {f m } in the symmetric convex hull of F (denoted sconvf). A function f is in sconvf if f = m i=1 α if i, where the α i s are real numbers satisfying m i=1 α i 1 and the f i s are in F. 7

The convex hull of a class of functions F, denoted convf, is similarly defined but with the requirement that the α i s are positive. We use convf to denote pointwise closure of convf and sconvf to denote the pointwise closure of sconvf. Thus the class of functions F is a VC-hull class if F = sconvg for some VC-class G. 8

THEOREM 3. Let Q be a probability measure on (X, A), and let F be a class of measurable functions with measurable envelope F, such that QF 2 < and, for 0 < ɛ < 1, N(ɛ F Q,2, F, L 2 (Q)) C ( ) 1 V, ɛ for constants C, V < (possibly dependent on Q). Then there exist a constant K depending only on V and C such that log N(ɛ F Q,2, convf, L 2 (Q)) K ( 1 ɛ ) 2V/(V +2). This is Theorem 2.6.9 of VW, and we omit the proof. 9

It is not hard to verify that sconvf is a subset of the convex hull of F { F} {0}, where F { f : f F} (see Exercise 9.6.1 below). Since the covering numbers of F { F} {0} are at most one plus twice the covering numbers of F, the conclusion of Theorem 3 also holds if convf is replaced with sconvf. 10

This leads to the following easy corollary for VC-hull classes: COROLLARY 1. For any VC-hull class F of measurable functions and all 0 < ɛ < 1, sup Q log N(ɛ F Q,2, F, L 2 (Q)) K ( ) 1 2 2/V, 0 < ɛ < 1, ɛ where the supremum is taken over all probability measures Q with F Q,2 > 0, V is the VC-index of the VC-subgraph class associated with F, and the constant K < depends only on V. 11

We now present several important examples and results about VC-classes of sets and both VC-subgraph and VC-hull classes of functions. LEMMA 1. Let F be a finite-dimensional vector space of measurable functions f : X R. Then F is VC-subgraph with V (F) dim(f) + 2. The next three lemmas consist of useful tools for building VC-classes from other VC-classes. 12

LEMMA 2. Let C and D be VC-classes of sets in a set X, with respective VC-indices V C and V D ; and let E be a VC-class of sets in W, with VC-index V E. Also let φ : X Y and ψ : Z X be fixed functions. Then (i) C c {C c : C C} is VC with V (C c ) = V (C); (ii) C D {C D : C C, D D} is VC with index V C + V D 1; (iii) C D {C D : C C, D D} is VC with index V C + V D 1; (iv) D E is VC in X W with VC index V D + V E 1; 13

(v) φ(c) is VC with index V C if φ is one-to-one; (vi) ψ 1 (C) is VC with index V C. 14

LEMMA 3. For any class C of sets in a set X, the class F C of indicator functions of sets in C is VC-subgraph if and only if C is a VC-class. Moreover, whenever at least one of C or F C is VC, the respective VC-indices are equal. 15

Proof. Let D be the collection of sets of the form for all C C. {(x, t) : t < 1{x C}} Suppose that D is VC, and let k = V (D). Then no set of the form {(x 1, 0),..., (x k, 0)} can be shattered by D, and hence V (C) V (D). 16

Now suppose that C is VC with VC-index k. Since for any t < 0, 1{x C} > t for all x and all C, we have that no collection {(x 1, t 1 ),..., (x k, t k )} can be shattered by D if any of the t j s are < 0. It is similarly true that no collection {(x 1, t 1 ),..., (x k, t k )} can be shattered by D if any of the t j s are 1, since 1{x C} > t is never true when t 1. 17

It can now be deduced that can only be shattered if can be shattered. {(x 1, t 1 ),..., (x k, t k )} {(x 1, 0),..., (x k, 0)} But this can only happen if {x 1,..., x k } can be shattered by C. Thus V (D) V (C). 18

LEMMA 4. Let F and G be VC-subgraph classes of functions on a set X, with respective VC indices V F and V G. Let g : X R, φ : R R, and ψ : Z X be fixed functions. Then (i) F G {f g : f F, g G} is VC-subgraph with index V F + V G 1; (ii) F G is VC with index V F + V G 1; (iii) {F > 0} {{f > 0} : f F} is a VC-class of sets with index V F ; (iv) F is VC-subgraph with index V F ; 19

(v) F + g {f + g : f F} is VC with index V F ; (vi) F g {fg : f F} is VC with index 2V F 1; (vii) F ψ {f(ψ) : f F} is VC with index V F ; (viii) φ F is VC with index V F for monotone φ. 20

The next two lemmas refer to properties of monotone processes and classes of monotone functions. LEMMA 5. Let {X(t), t T } be a monotone increasing stochastic process, where T R. Then X is VC-subgraph with index V (X) = 2. 21

Proof. Let X be the set of all monotone increasing functions g : T R; and for any s T and x X, define (s, x) f s (x) = x(s). Thus the proof is complete if we can show that the class of functions F {f s : s T } is VC-subgraph with VC index 2. Now let (x 1, t 1 ), (x 2, t 2 ) be any two points in X R. F shatters (x 1, t 1 ), (x 2, t 2 ) if the graph G of (f s (x 1 ), f s (x 2 )) in R 2 surrounds the point (t 1, t 2 ) as s ranges over T. 22

By surrounding a point (a, b) R 2, we mean that the graph must pass through all four of the sets {(u, v) : u a, v b}, {(u, v) : u > a, v b}, {(u, v) : u a, v > b} and {(u, v) : u > a, v > b}. By the assumed monotonicity of x 1 and x 2, the graph G forms a monotone curve in R 2, and it is thus impossible for it to surround any point in R 2. Thus (x 1, t 1 ), (x 2, t 2 ) cannot be shattered by F, and the desired result follows. 23

LEMMA 6. The set F of all monotone functions f : R [0, 1] satisfies sup Q log N(ɛ, F, L 2 (Q)) K ɛ, 0 < ɛ < 1, where the supremum is taken over all probability measures Q, and the constant K < is universal. 24

BUEI Classes Recall for a class of measurable functions F, with envelope F, the uniform entropy integral J(δ, F, L 2 ) δ 0 sup log N(ɛ F Q,2, F, L 2 (Q))dɛ, Q where the supremum is taken over all finitely discrete probability measures Q with F Q,2 > 0. Note the dependence on choice of envelope F. This is crucial since there are many random functions which can serve as an envelope. 25

For example, if F is an envelope, then so is F + 1 and 2F. One must allow that different envelopes may be needed in different settings. We say that the class F has bounded uniform entropy integral (BUEI) with envelope F or is BUEI with envelope F if J(1, F, L 2 ) < for that particular choice of envelope. 26

Theorem 2 tells us that a VC-class F is automatically BUEI with any envelope. We leave it as an exercise to show that if F and G are BUEI with respective envelopes F and G, then F G is BUEI with envelope F G. 27

LEMMA 7. Let F 1,..., F k be BUEI classes with respective envelopes F 1,..., F k, and let φ : R k R satisfy k φ f(x) φ g(x) 2 c 2 (f j (x) g j (x)) 2, (1) j=1 for every f, g F 1 F k and x for a constant 0 < c <. Then the class φ (F 1,..., F k ) is BUEI with envelope H φ(f 0 ) + c k ( f 0j + F j ), j=1 where f 0 (f 01,..., f 0k ) is any function in F 1 F k, and where φ (F 1,..., F k ) is as defined in Lemma 8.10. 28

Some useful consequences of Lemma 7 are given in the following lemma: LEMMA 8. Let F and G be BUEI with respective envelopes F and G, and let φ : R R be a Lipschitz continuous function with Lipschitz constant 0 < c <. Then (i) F G is BUEI with envelope F + G; (ii) F G is BUEI with envelope F + G; (iii) F + G is BUEI with envelope F + G; (iv) φ(f) is BUEI with envelope φ(f 0 ) + c( f 0 + F ), provided f 0 F. 29

Most of the BUEI preservation results we give in this section have parallel Donsker preservation properties. An important exception, and one which is perhaps the primary justification for the use of BUEI preservation techniques, applies to products of Donsker classes. As verified in the following theorem, the product of two BUEI classes is BUEI, whether or not the two classes involved are bounded: THEOREM 4. Let F and G be BUEI classes with respective envelopes F and G. 30

Then F G {fg : f F, g G} is BUEI with envelope F G. In order for BUEI results to be useful for obtaining Donsker results, it is necessary that sufficient measurability be established so that Theorem 8.19 can be used. Pointwise measurability (PM) is sufficient measurability for this purpose. 31

Since there are significant similarities between PM preservation and BUEI preservation results, one can construct useful joint PM and BUEI preservation results: LEMMA 9. Let the classes F 1,..., F k be both BUEI and PM with respective envelopes F 1,..., F k, and let φ : R k R satisfy (1) for every f, g F 1 F k and x for a constant 0 < c <. Then the class φ (F 1,..., F k ) is both BUEI and PM with envelope H φ(f 0 ) + c k ( f 0j + F j ), j=1 where f 0 is any function in F 1 F k. 32

LEMMA 10. Let the classes F and G be both BUEI and PM with respective envelopes F and G, and let φ : R R be a Lipschitz continuous function with Lipschitz constant 0 < c <. Then (i) F G is both BUEI and PM with envelope F G; (ii) F G is both BUEI and PM with envelope F + G; (iii) F G is both BUEI and PM with envelope F + G; (iv) F + G is both BUEI and PM with envelope F + G; (v) F G is both BUEI and PM with envelope F G; (vi) φ(f) is both BUEI and PM with envelope φ(f 0 ) + c( f 0 + F ), where f 0 F. 33

If a class of measurable functions F is both BUEI and PM with envelope F, then Theorem 8.19 implies that F is P -Donsker whenever P F 2 <. Note that we have somehow avoided discussing preservation for subsets of classes. This is because it is unclear whether a subset of a PM class F is itself a PM class. 34

The difficulty is that while F may have a countable dense subset G (dense in terms of pointwise convergence), it is unclear whether any arbitrary subset H F also has a suitable countable dense subset. An easy way around this problem is to use various preservation results to establish that F is P -Donsker, and then it follows directly that any H F is also P -Donsker by the definition of weak convergence. 35

Bracketing Entropy We now present several useful bracketing entropy results for certain function classes as well as a few preservation results. We first mention that bracketing numbers are in general larger than covering numbers, as verified in the following lemma: LEMMA 11. Let F be any class of real function on X and any norm on F. Then N(ɛ, F, ) N [] (ɛ, F, ) for all ɛ > 0. 36

Proof. Fix ɛ > 0, and let B be collection of ɛ-brackets that covers F. From each bracket B B, take a function g(b) B F to form a finite collection of functions G F of the same cardinality as B consisting of one function from each bracket in B. Now every f F lies in a bracket B B such that f g(b) ɛ by the definition of an ɛ-bracket. Thus G is an ɛ cover of F of the same cardinality as B. The desired conclusion now follows. 37

The first substantive bracketing entropy result we present considers classes of smooth functions on a bounded set X R d. For any vector k = (k 1,..., k d ) of nonnegative integers define the differential operator D k k ( x k 1 1,..., xk d d ), where k k 1 + + k d. As defined previously, let x be the largest integer j x, for any x R. 38

For any function f : X R and α > 0, define the norm f α max sup D k f(x) + max sup k: k α x k: k = α x,y D k f(x) D k f(y) x y α α, where the suprema are taken over x y in the interior of X. When k = 0, we set D k f = f. Now let C α M f α M. (X ) be the set of all continuous functions f : X R with Recall that for a set A in a metric space, diam A = sup x,y A d(x, y). 39

THEOREM 5. Let X R d be bounded and convex with nonempty interior. There exists a constant K < depending only on α, diam X, and d such that log N [] (ɛ, C α 1 (X ), L r (Q)) K ( ) 1 d/α, ɛ for every r 1, ɛ > 0, and any probability measure Q on R d. 40

We now consider several results for Lipschitz and Sobolev function classes. We first present the results for covering numbers based on the uniform norm and then present the relationship to bracketing entropy: THEOREM 6. For a compact, convex subset C R d, let F be the class of all convex functions f : C [0, 1] with f(x) f(y) L x y for every x, y. For some integer m 1, let G be the class of all functions g : [0, 1] [0, 1] with 1 0 [g(m) (x)] 2 dx 1, where superscript (m) denotes the m th derivative. 41

Then log N(ɛ, F, ) K(1 + L) d/2 ( 1 ɛ ) d/2, and log N(ɛ, G, ) M ( ) 1 1/m, ɛ where is the uniform norm and the constant K < depends only on d and C and the constant M depends only on m. 42

The following lemma shows how Theorem 6 applies to bracketing entropy: LEMMA 12. For any norm dominated by and any class of functions F, log N [] (2ɛ, F, ) log N(ɛ, F, ), for all ɛ > 0. Proof. Let f 1,..., f m be a uniform ɛ-cover of F. Since the 2ɛ-brackets [f i ɛ, f i + ɛ] now cover F, the result follows. 43

We now present a second Lipschitz continuity result which is in fact a generalization of Lemma 12. The result applies to function classes of the form F = {f t : t T }, where f s (x) f t (x) d(s, t)f (x) (2) for some metric d on T, some real function F on the sample space X, and for all x X. This special Lipschitz structure arises in a number of settings, including parametric Z- and M- estimation. 44

For example, consider the least absolute deviation regression setting of Section 2.2.6, under the assumption that the random covariate U and regression parameter θ are constrained to known compact subsets U, Θ R p. Recall that, in this setting, the outcome given U is modeled as Y = θ U + e, where the residual error e has median zero. 45

Estimation of the true parameter value θ 0 is accomplished by minimizing θ P n m θ, where m θ (X) e (θ θ 0 ) U e, X (Y, U) and e = Y θ 0 U. From (2.20) in Section 2.2.6, we know that the class F = {m θ : θ Θ} satisfies (2) with T = Θ, d(s, t) = s t and F (x) = u, where x = (y, u) is a realization of X. 46

The following theorem shows that the bracketing numbers for a general F satisfying (2) are bounded by the covering numbers for the associated index set T : THEOREM 7. Suppose the class of functions F = {f t : t T } satisfies (2) for every s, t T and some fixed function F. Then, for any norm, N [] (2ɛ F, F, ) N(ɛ, T, d). 47

Proof. Note that for any ɛ-net t 1,..., t k that covers T with respect to d, the brackets [f tj ɛf, f tj + ɛf ] cover F. Since these brackets are all of size 2ɛ F, the proof is complete. Note that when is any norm dominated by, Theorem 7 simplifies to Lemma 12 when T = F and d = (and thus automatically F = 1). 48

We move now from continuous functions to monotone functions: THEOREM 8. For each integer r 1, there exists a constant K < such that the class F of monotone functions f : R [0, 1] satisfies log N [] (ɛ, F, L r (Q)) K ɛ, for all ɛ > 0 and every probability measure Q. 49

Useful preservation results for bracketing entropy are, unfortunately, rare, but the following are two such results: LEMMA 13. Let F and G be classes of measurable function. Then for any probability measure Q and any 1 r, (i) N [] (2ɛ, F + G, L r (Q)) N [] (ɛ, F, L r (Q))N [] (ɛ, G, L r (Q)); (ii) Provided F and G are bounded by 1, N [] (2ɛ, F G, L r (Q)) N [] (ɛ, F, L r (Q))N [] (ɛ, G, L r (Q)). 50