EMPIRICAL PROCESS THEORY AND APPLICATIONS. Sara van de Geer. Handout WS 2006 ETH Zürich

Size: px

Start display at page:

Download "EMPIRICAL PROCESS THEORY AND APPLICATIONS. Sara van de Geer. Handout WS 2006 ETH Zürich"

Constance Robbins
6 years ago
Views:

1 EMPIRICAL PROCESS THEORY AND APPLICATIONS by Sara van de Geer Handout WS 2006 ETH Zürich

2 Contents Preface Introduction Law of large numbers for real-valued random variables 2 R d -valued random variables 3 Definition Glivenko-Cantelli classes of sets 4 Convergence of averages to their expectations 2 Exponential probability inequalities 2 Chebyshev s inequality 22 Bernstein s inequality 23 Hoeffding s inequality 24 Exercise 3 Symmetrization 3 Symmetrization with means 32 Symmetrization with probabilities 33 Some facts about conditional expectations 4 Uniform law of large numbers 4 Classes of functions 42 Classes of sets 43 Vapnik-Chervonenkis classes 44 VC graph classes of functions 45 Exercises 5 M-estimators 5 What is an M-estimator? 52 Consistency 53 Exercises 6 Uniform central limit theorems 6 Real-valued random variables 62 R d -valued random variables 63 Donsker s Theorem 64 Donsker classes 65 Chaining and the increments of empirical processes 7 Asymptotic normality of M-estimators 7 Asymptotic linearity 72 Conditions a,b and c for asymptotic normality 73 Asymptotics for the median 74 Conditions A,B and C for asymptotic normality 75 Exercises 8 Rates of convergence for least squares estimators 8 Gaussian errors 82 Rates of convergence 83 Examples 84 Exercises 9 Penalized least squares 9 Estimation and approximation error 92 Finite models 93 Nested finite models 94 General penalties 95 Application to a classical penalty 96 Exercise 2

3 Preface This preface motivates why, from a statistician s point of view, it is interesting to study empirical processes We indicate that any estimator is some function of the empirical measure In these lectures, we study convergence of the empirical measure, as sample size increases In the simplest case, a data set consists of observations on a single variable, say real-valued observations Suppose there are n such observations, denoted by X,, X n For example, X i could be the reaction time of individual i to a given stimulus, or the number of car accidents on day i, etc Suppose now that each observation follows the same probability law P This means that the observations are relevant if one wants to predict the value of a new observation X say the reaction time of a hypothetical new subject, or the number of car accidents on a future day, etc Thus, a common underlying distribution P allows one to generalize the outcomes An estimator is any given function T n X,, X n of the data Let us review some common estimators The empirical distribution The unknown P can be estimated from the data in the following way Suppose first that we are interested in the probability that an observation falls in A, where A is a certain set chosen by the researcher We denote this probability by P A Now, from the frequentist point of view, the probability of an event is nothing else than the limit of relative frequencies of occurrences of that event as the number of occasions of possible occurrences n grows without limit So it is natural to estimate P A with the frequency of A, ie, with P n A = number of times an observation X i falls in A total number of observations = number of X i A n We now define the empirical measure P n as the probability law that assigns to a set A the probability P n A We regard P n as an estimator of the unknown P The empirical distribution function The distribution function of X is defined as and the empirical distribution function is F x = P X x, ˆF n x = number of X i x n 09 F F n^ x Figure Figure plots the distribution function F x = /x 2, x smooth curve and the empirical distribution function ˆF n stair function of a sample from F with sample size n = 200 3

4 Means and averages The theoretical mean µ := EX E stands for Expectation, can be estimated by the sample average X n := X + + X n n More generally, let g be a real-valued function on R Then is an estimator EgX gx + + gx n, n Sample median The median of X is the value m that satisfies F m = /2 assuming there is a unique solution Its empirical version is any value ˆm n such that ˆF n ˆm n is equal or as close as possible to /2 In the above example F x = /x 2, so that the theoretical median is m = 2 = 442 In the ordered sample, the 00 th observation is equal to 466 and the 0 th observation is equal to 49 A common choice for the sample median is taking the average of these two values This gives ˆm n = 479 Properties of estimators Let T n = T n X,, X n be an estimator of the real-valued parameter θ Then it is desirable that T n is in some sense close to θ A minimum requirement is that the estimator approaches θ as the sample size increases This is called consistency To be more precise, suppose the sample X,, X n are the first n of an infinite sequence X, X 2, of independent copies of X Then T n is called strongly consistent if, with probability one, T n θ as n Note that consistency of frequencies as estimators of probabilities, or means as estimators of expectations, follows from the strong law of large numbers In general, an estimator T n can be a complicated function of the data In that case, it is helpful to know that the convergence of means to their expectations is uniform over a class The latter is a major topic in empirical process theory Parametric models The distribution P may be partly known beforehand The unknown parts of P are called parameters of the model For example, if the X i are yes/no answers to a certain question the binary case, we know that P allows only two possibilities, say and 0 yes=, no=0 There is only one parameter, say the probability of a yes answer θ = P X = More generally, in a parametric model, it is assumed that P is known up to a finite number of parameters θ = θ,, θ d We then often write P = P θ When there are infinitely many parameters which is for example the case when P is completely unknown, the model is called nonparametric Nonparametric models An example of a nonparametric model is where one assumes that the density f of the distribution function F exists, but all one assumes about it is some kind of smoothness eg the continuous first derivative of f exists In that case, one may propose eg to use the histogram as estimator of f This is an example of a nonparametric estimator Histograms Our aim is estimating the density fx at a given point x The density is defined as the derivative of the distribution function F at x: F x + h F x P x, x + h] fx = lim = lim h 0 h h 0 h Here, x, x + h] is the interval with left endpoint x not included and right endpoint x + h included Unfortunately, replacing P by P n here does not work, as for h small enough, P n x, x + h] will be equal to zero Therefore, instead of taking the limit as h 0, we fix h at a small positive value, called the bandwidth The estimator of fx thus becomes ˆf n x = P nx, x + h] h = number of X i x, x + h] nh 4

5 A plot of this estimator at points x {x 0, x 0 + h, x 0 + 2h, } is called a histogram Example Figure 2 shows the histogram, with bandwidth h = 05, for the sample of size n = 200 from the Pareto distribution with parameter θ = 2 The solid line is the density of this distribution f Figure 2 Conclusion An estimator T n is some function of the data X,, X n If it is a symmetric function of the data which we can in fact assume without loss of generality when the ordering in the data contains no information, we may write T n = T P n, where P n is the empirical distribution Roughly speaking, the main purpose in theoretical statistics is studying the difference between T P n and T P We therefore are interested in convergence of P n to P in a broad enough sense This is what empirical process theory is about 5

6 Introduction This chapter introduces the notation and part of the problem setting Let X,, X n, be iid copies of a random variable X with values in X and with distribution P The distribution of the sequence X, X 2, + perhaps some auxiliary variables is denoted by P Definition Let {T n, T } be a collection of real-valued random variables Then T n converges in probability to T, if for all ɛ > 0, lim n P T n T > ɛ = 0 Notation: T n P T Moreover, T n converges almost surely as to T if P lim n T n = T = Remark Convergence almost surely implies convergence in probability Law of large numbers for real-valued random variables Consider the case X = R Suppose the mean µ := EX exists Define the average X n := n Then, by the law of large numbers, as n, Now, let be the theoretical distribution function, and X i, n X n µ, as F t := P X t, t R, F n t := n #{X i t, i n}, t R, be the empirical distribution function Then by the law of large numbers, as n, F n t F t, as for all t We will prove in Chapter 4 the Glivenko-Cantelli Theorem, which says that This is a uniform law of large numbers sup F n t F t 0, as t Application: Kolmogorov s goodness-of-fit test We want to test H 0 : F = F 0 Test statistic: D n ; = sup F n t F 0 t t Reject H 0 for large values of D n 2 R d -valued random variables Questions: i What is a natural extension of half-intervals in R to higher dimensions? ii Does Glivenko-Cantelli hold for this extension? 6

7 3 Definition Glivenko-Cantelli classes of sets Let for any measurable A X, P n A := n #{X i A, i n} We call P n the empirical measure based on X,, X n Let D be a collection of subsets of X Definition 3 The collection D is called a Glivenko-Cantelli GC class if sup P n D P D 0, as D D Example Let X = R The class of half-intervals D = {l,t] : t R} is GC But when eg P = uniform distribution on [0, ] ie, F t = t, 0 t, the class is not GC and B = {all Borel subsets of [0, ]} 4 Convergence of averages to their expectations Notation For a function g : X R, we write P g := EgX, P n g := n Let G be a collection of real-valued functions on X gx i Definition 4 The class G is called a Glivenko-Cantelli GC class if sup P n g P g 0, as g G We will often use the notation P n P G := sup P n g P g g G We will skip measurability issues, and most of the time do not mention explicitly the requirement of measurability of certain sets or functions This means that everything has to be understood modulo measurability 7

8 2 Exponential probability inequalities A statistician is almost never sure about something, but often says that something holds with large probability We study probability inequalities for deviations of means from their expectations These are exponential inequalities, that is, the probability that the deviation is large is exponentially small We will in fact see that the inequalities are similar to those obtained if we assume normality Exponentially small probabilities are useful indeed when one wants to prove that with large probability a whole collection of events holds simultaneously It then suffices to show that adding up the small probabilities that one such an event does not hold, still gives something small We will use this argument in Chapter 4 2 Chebyshev s inequality Chebyshev s inequality Consider a random variable X R with distribution P, and an increasing function φ : R [0, Then for all a with φa > 0, we have P X a EφX φa Proof EφX = φxdp x = X a = φa X a φxdp x X a Let X be N 0, -distributed By Exercise 24, φxdp x + X a X<a φadp x dp = φap X a φxdp x P X a exp[ a 2 /2] a > 0 Corollary 2 Let X,, X n be independent real-valued random variables, and suppose, for all i, that X i is N 0, σi 2 -distributed Define b 2 = σi 2 Then for all a > 0, n P X i a ] exp [ a2 2b 2 22 Bernstein s inequality Bernstein s inequality Let X,, X n be independent real-valued random variables with expectation zero Suppose that for all i, Define We have for any a > 0, E X i m m! 2 Km 2 σ 2 i, m = 2, 3, b 2 = n P X i a σi 2 [ exp a 2 2aK + b 2 ] 8

9 Proof We have for 0 < λ < /K, It follows that E exp[λx i ] = + E exp + m=2 m=2 m! λm EX m i λ 2 2 λkm 2 σ 2 i = + λ2 σi 2 2 λk [ exp [ λ λ 2 σ 2 i 2 λk ] X i = [ exp ] n E exp[λx i ] λ 2 b 2 2 λk Now, apply Chebyshev s inequality to n X i, and with φx = exp[λx], x R We arrive at n [ λ 2 b 2 ] P X i a exp 2 λk λa Take to complete the proof 23 Hoeffding s inequality λ = a Ka + b 2 Hoeffding s inequality Let X,, X n be independent real-valued random variables with expectation zero Suppose that for all i, and for certain constants c i > 0, X i c i ] Then for all a > 0, n P X i a [ exp a 2 2 n c2 i ] Proof Let λ > 0 By the convexity of the exponential function exp[λx], we know that for any 0 α, exp[αλx + αλy] α exp[λx] + α exp[λy] Define now Then so But then, since Eα i = /2, we find α i = c i X i 2c i X i = α i c i + α i c i, exp[λx i ] α i exp[ λc i ] + α i exp[λc i ] E exp[λx i ] 2 exp[ λc i] + 2 exp[λc i] 9

10 Now, for all x, whereas Since we thus know that and hence Therefore, E exp exp[ x] + exp[x] = 2 exp[x 2 /2] = k=0 2k! 2 k k!, k=0 x 2k 2 k k! x 2k 2k!, exp[ x] + exp[x] 2 exp[x 2 /2], E exp[λx i ] exp[λ 2 c 2 i /2] [ λ ] [ X i exp It follows now from Chebyshev s inequality that n [ P X i a exp λ 2 λ 2 n n c 2 i /2 ] c 2 i /2 λa Take λ = a/ n c2 i to complete the proof ] 24 Exercise Exercise Let X be N 0, -distributed Show that for λ > 0, E exp[λx] = exp[λ 2 /2] Conclude that for all a > 0, Take λ = a to find the inequality P X a exp[λ 2 /2 λa] P X a exp[ a 2 /2] 0

11 3 Symmetrization Symmetrization is a technique based on the following idea Suppose you have some estimation method, and want to know how good it performs Suppose you have a sample of size n, the so-called training set and a second sample, say also of size n, the so-called test set Then we may use the training set to calculate the estimator, and the test set to check its performance For example, suppose we want to know how large the maximal deviation is between certain averages and expectations We cannot calculate this maximal deviation directly, as the expectations are unknown Instead, we can calculate the maximal deviation between the averages in the two samples Symmetrization is closely rerlated: it splits the sample of size n randomly in two subsamples Let X X be a random variable with distribution P We consider two independent sets of independent copies of X, X := X,, X n and X := X,, X n Let G be a class of real-valued functions on X Consider the empirical measures P n := n Here δ x denotes a point mass at x Define δ Xi, P n := n δ Xi P n P G := sup P n g P g, g G and likewise and P n P G := sup P ng P g, g G P n P n G := sup P n g P ng g G 3 Symmetrization with means Lemma 3 We have E P n P G E P n P n G Proof For a function f on X 2n, let E X fx, X denote the conditional expectation of fx, X given X Then obviously, E X P n g = P n g and So Hence P n P G = sup g G E X P ng = P g P n P g = E X P n P ng P n g P g = sup E X P n P ng g G Now, use that for any function fz, t depending on a random variable Z and a parameter t, we have So So we now showed that sup t EfZ, t sup t E fz, t E sup fz, t t sup E X P n P ng E X P n P n G g G P n P G E X P n P n G Finally, we use that the expectation of the conditional expectation is the unconditional expectation: EE X fx, X = EfX, X

12 So E P n P G EE X P n P n G = E P n P n G Definition 32 A Rademacher sequence {σ i } n is a sequence of independent random variables σ i, with Pσ i = = Pσ i = = 2 i Let {σ i } n be a Rademacher sequence, independent of the two samples X and X We define the symmetrized empirical measure Pn σ g := σ i gx i, g G n Let Pn σ G = sup Pn σ g g G Lemma 33 We have E P n P G 2E P σ n G Proof Consider the symmetrized version of the second sample X : P n,σ g = n σ i gx i Then P n P n G has the same distribution as Pn σ P n,σ G So E P n P n G = E Pn σ P n,σ G 32 Symmetrization with probabilities E P σ n G + E P,σ n G = 2E P σ n G Lemma 32 Let δ > 0 Suppose that for all g G, P P n g P g > δ/2 2 Then P P n P G > δ 2P P n P n G > δ 2 Proof Let P X denote the conditional probability given X If P n P G > δ, we know that for some random function g = g X depending on X, Because X is independent of X, we also know that P n g P g > δ P X P ng P g > δ/2 2 Thus, P P n g P g > δ and P ng P g δ 2 = EP X P n g P g > δ and P ng P g δ 2 2

13 = EP X P ng P g δ l{ P n g P g > δ} 2 2 El{ P ng P g > δ} = 2 P P ng P g > δ It follows that P P n P G > δ P P n g P g > δ 2P P n g P g > δ and P ng P g δ 2 2P P n g P ng > δ 2 Corollary 322 Let δ > 0 Suppose that for all g G, P P n g P g > δ/2 2 Then P P n P G > δ 4P Pn σ G > δ 4 Then 33 Some facts about conditional expectations Let X and Y be two random variables We write the conditional expectation of Y given X as E X Y = EY X EY = EE X Y Let f be some function of X and g be some function of X, Y We have The conditional probability given X is E X fxgx, Y = fxe X gx, Y P X X, Y B = E X l B X, Y Hence, P X X X, X, Y B = l A XP X X, Y B 3

14 4 Uniform laws of large numbers In this chapter, we prove uniform laws of large numbers for the empirical mean of functions g of the individual observations, when g varies over a class G of functions First, we study the case where G is finite Symmetrization is used in order to be able to apply Hoeffding s inequality Hoeffding s inequality gives exponential small probabilities for the deviation of averages from their expectations So considering only a finite number of such averages, the difference between these averages and their expectations will be small for all averages simultaneously, with large probability If G is not finite, we approximate it by a finite set A δ-approximation is called a δ-covering, and the number of elements of a δ-covering is called the δ-covering number We introduce Vapnik Chervonenkis VC classes These are classes with small covering numbers Let X X be a random variable with distribution P Consider a class G of real-valued functions on X, and consider iid copies {X, X 2, } of X In this chapter, we address the problem of proving P n P G P 0 If this is the case, we call G a Glivenko Cantelli GC class Remark It can be shown that if P n P G P 0, then also P n P G 0 almost surely This involves eg martingale arguments We will not consider this issue 4 Classes of functions Notation The sup-norm of a function g is g := sup gx x X Elementary observation Let {A k } N k= be a finite collection of events Then P N k=a k N k= PA k N max k N PA Lemma 4 Let G be a finite class of functions, with cardinality G := N > Suppose that for some finite constant K, max g G g K Then for all we have and δ 2K log N n, ] P Pn σ G > δ 2 exp [ nδ2 4K 2 ] P P n P G > 4δ 8 exp [ nδ2 4K 2 Proof By Hoeffding s inequality, for each g G, P P σ n g > δ 2 exp ] [ nδ2 2K 2 Use the elementary observation to conclude that ] P Pn σ G > δ 2N exp [ nδ2 2K 2 ] ] = 2 exp [log N nδ2 2K 2 2 exp [ nδ2 4K 2 4

By Chebyshev s inequality, for each g G Hence, by symmetrization with probabilities P P n g P g > δ vargx nδ 2 K 2 4K 2 log N 2 P P n P G > 4δ 4P P σ n G > δ 8 exp K2 nδ 2 ] [ nδ2 4K 2 Definition 42

some subset of a metric space Λ, d For δ > 0, the δ-covering number Nδ, S, d of S is the minumum number of balls with radius δ, necessary to cover S, ie the smallest value of N, such that there exist

15 By Chebyshev s inequality, for each g G Hence, by symmetrization with probabilities P P n g P g > δ vargx nδ 2 K 2 4K 2 log N 2 P P n P G > 4δ 4P P σ n G > δ 8 exp K2 nδ 2 ] [ nδ2 4K 2 Definition 42 The envelope G of a collection of functions G is defined by Gx = sup gx, x X g G In Exercise 2 of this chapter, the assumption sup g G g K used in Lemma 4, is weakened to P G < Definition 43 Let S be some subset of a metric space Λ, d For δ > 0, the δ-covering number Nδ, S, d of S is the minumum number of balls with radius δ, necessary to cover S, ie the smallest value of N, such that there exist s,, s N in Λ with min ds, s j δ, s S j=,,n The set s,, s N is then called a δ-covering of S The logarithm log N, S, d of the covering number is called the entropy of S Figure 3 Notation Let Theorem 44 Suppose Assume moreover that Then d,n g, g = P n g g g K, g G n log Nδ, G, d,n P 0 P n P G P 0 5

16 Proof Let δ > 0 Let g,, g N, with N = Nδ, G, d,n, be a δ-covering of G When P n g g j δ, we have P σ n g P σ n g j + δ So P σ n G max P n σ g j + δ j=,n By Hoeffding s inequality and the elementary observation, for log N δ 2K n, we have Conclude that for we have But then We thus get as n, P X max P n σ g j > δ j=,,n δ 2K log N n, ] 2 exp [ nδ2 4K 2 ] P X Pn σ G > 2δ 2 exp [ nδ2 4K 2 ] P Pn σ G > 2δ 2 exp [ nδ2 log Nδ, G, d,n 4K 2 + P 2K > δ n P P σ n G > 2δ 0 No, use the symmetrization with probabilities to conclude Since δ is arbitrary, this concludes the proof P P n P G > 8δ 0 Again, the assumption sup g G g K used in Theorem 44, can be weakened to P G < G being the envelope of G See Exercise 3 of this chapter 42 Classes of sets Let D be a collection of subsets of X, and let {ξ,, ξ n } be n points in X Definition 42 We write D ξ,, ξ n = card{d {ξ,, ξ n } : D D} = the number of subsets of {ξ,, ξ n } that D can distinguish That is, count the number of sets in D, when two sets D and D 2 are considered as equal if D D 2 {ξ,, ξ n } = Here D D 2 = D D c 2 D c D 2 is the symmetric difference between D and D 2 Remark For our purposes, we will not need to calculate D ξ,, ξ n exactly, but only a good enough upper bound Example Let X = R and Then for all {ξ,, ξ n } R D = {l,t] : t R} D ξ,, ξ n n + 6

17 Example Let D be the collection of all finite subsets of X Then, if the points ξ,, ξ n are distinct, D ξ,, ξ n = 2 n Theorem 422 Vapnik and Chervonenkis 97 We have if and only if n log D X,, X n P 0, sup P n D P D P 0 D D Proof of the if-part This follows from applying Theorem 44 to G = {l D : D D} Note that a class of indicator functions is uniformly bounded by, ie we can take K = in Theorem 44 Define now d,n g, g = max gx i gx i,,n Then d,n d,n, so also But for 0 < δ <, N, G, d,n N, G, d,n Nδ, {l D : D D}, d,n = D X,, X n So indeed, if n log D X,, X n P 0, then also n log Nδ, {l D : D D}, d,n n log D X,, X n P 0 43 Vapnik-Chervonenkis classes Definition 43 Let m D n = sup{ D ξ,, ξ n : ξ,, ξ n X } We say that D is a Vapnik-Chervonenkis VC class if for certain constants c and V, and for all n, m D n cn V, ie, if m D n does not grow faster than a polynomial in n Important conclusion: For sets, VC GC Examples a X = R, D = {l,t] : t R} Since m D n n +, D is VC b X = R d, D = {l,t] : t R d } Since m D n n + d, D is VC c X = R d, D = {{x : θ T x > t}, θ t R d+ } Since m D n 2 d n d, D is VC The VC property is closed under measure theoretic operations: Lemma 432 Let D, D and D 2 be VC Then the following classes are also VC: i D c = {D c : D D}, ii D D 2 = {D D 2 : D D, D 2 D 2 }, iii D D 2 = {D D 2 : D D, D 2 D 2 } Proof Exercise Examples - the class of intersections of two halfspaces, - all ellipsoids, - all half-ellipsoids, { - in R, the class {x : θ x + + θ r x r t} : } θ R t r+ 7

18 There are classes that are GC, but not VC Example Let X = [0, ] 2, and let D be the collection of all convex subsets of X Then D is not VC, but when P is uniform, D is GC Definition 433 The VC dimension of D is V D = inf{n : m D n < 2 n } The following Lemma is nice to know, but to avoid digressions, we will not provide a proof Lemma 434 We have that D is VC if and only if V D < In fact, we have for V = V D, m D n V k=0 n k 44 VC graph classes of functions Definition 44 The subgraph of a function g : X R is subgraphg = {x, t X R : gx t} A collection of functions G is called a VC class if the subgraphs {subgraphg : g G} form a VC class Example G = {l D : D D} is GC if D is GC Examples X = R d a G = {gx = θ 0 + θ x + + θ d x d : θ R d+ }, b G = {gx = θ 0 + θ x + + θ d x d : θ R d+ } a c d =, G = gx = { a + bx if x c d + ex if x > c, d d =, G = {gx = e θx : θ R} b c R5, d e Definition 44 Let S be some subset of a metric space Λ, d For δ > 0, the δ-packing number Dδ, S, d of S is the largest value of N, such that there exist s,, s N in S with ds k, s j > δ, k j Note For all δ > 0, Nδ, S, d Dδ, S, d Theorem 442 Let Q be any probability measure on X Define d,q g, g = Q g g For a VC class G with VC dimension V, we have for a constant A depending only on V, NδQG, G, d,q maxaδ 2V, e δ/4, δ > 0 Proof Without loss of generality, assume QG = Choose S X with distribution dq S = GdQ Given S = s, choose T uniformly in the interval [ Gs, Gs] Let g,, g N be a maximal set in G, such that Q g j g k > δ for j k Consider a pair j k Given S = s, the probability that T falls in between the two graphs of g j and g k is g j s g k s 2Gs So the unconditional probability that T falls in between the two graphs of g j and g k is gj s g k s 2Gs dq S s = Q g j g k 2 δ 2 8

19 Now, choose n independent copies {S i, T i } n of T, S between the graphs of g j and g k is then at most The probability that none of these fall in δ/2 n The probability that for some j k, none of these fall in between the graphs of g j and g k is then at most N δ/2 n 2 [ 2 exp 2 log N nδ ] 2 2 <, when we choose n the smallest integer such that n 4 log N δ So for such a value of n, with positive probability, for any j k, some of the T i fall in between the graphs of g j and g k Therefore, we must have But then, for N exp[δ/4], So V 4 log N N c + c δ N cn V 8 log N δ V 6V c N 2 δ N c 2 6V δ V = c 6V log N 2V Corollary 443 Suppose G is VC and that GdP < Then by Theorem 442 and Theorem 44, we have P n P G P 0 45 Exercises Exercise Let G be a finite class of functions, with cardinality G := N > Suppose that for some finite constant K, max g G g K Use Bernstein s inequality to show that for one has δ 2 4 log N n 2V [ δk + K 2 ] [ P P n P G > δ 2 exp nδ 2 4δK + K 2 Exercise 2 Let G be a finite class of functions, with cardinality G := N > Suppose that G has envelope G satisfying P G < Let 0 < δ <, and take K large enough, so that P Gl{G > K} δ 2 δ ] V Show that for δ 4K log N n, 9

20 Hint: use ] P P n P G > 4δ 8 exp [ nδ2 6K 2 + δ P n g P g P n gl{g K} P gl{g K} +P n Gl{G > K} + P Gl{G > K} Exercise 3 Let G be a class of functions, with envelope G, satisfying P G < and n log Nδ, G, d,n P 0 Show that P n P G P 0 Exercise 4 Are the following classes of sets functions VC? Why not? The class of all rectangles in R d 2 The class of all monotone functions on R 3 The class of functions on [0, ] given by G = {gx = ae bx + ce dx : a, b, c, d [0, ] 4 } 4 The class of all sections in R 2 a section is of the form {x, x 2 : x = a +r sin t, x 2 = a 2 +r cos t, θ t θ 2 }, for some a, a 2 R 2, some r > 0, and some 0 θ θ 2 2π 5 The class of all star-shaped sets in R 2 a set D is star-shaped if for some a D and all b D also all points on the line segment joining a and b are in D Exercise 5 Let G be the class of all functions g on [0, ] with derivative ġ satisfying ġ Check that G is not VC Show that G is GC by using partial integration and the Glivenko-Cantelli Theorem for the empirical distribution function 20

21 5 M-estimators 5 What is an M-estimator? Let X,, X n, be iid copies of a random variable X with values in X and with distribution P Let Θ be a parameter space a subset of some metric space and let for θ Θ, γ θ : X R, be some loss function We assume P γ θ < for all θ Θ We estimate the unknown parameter θ 0 := arg min θ Θ P γ θ, by the M-estimator ˆθ n := arg min θ Θ P nγ θ We assume that θ 0 exists and is unique and that ˆθ n exists Examples i Location estimators X = R, Θ = R, and ia γ θ x = x θ 2 estimating the mean, ib γ θ x = x θ estimating the median ii Maximum likelihood {p θ : θ Θ} family of densities wrt σ-finite dominating measure µ, and γ θ = log p θ If dp/dµ = p θ0, θ 0 Θ, then indeed θ 0 is a minimizer of P γ θ, θ Θ iia Poisson distribution: p θ x = e θ θx, θ > 0, x {, 2, } x! iib Logistic distribution: p θ x = 52 Consistency Define for θ Θ, e θ x + e θ x, θ R, x R 2 Rθ = P γ θ, and R n θ = P n γ θ We first present an easy proposition with a too stringent condition Proposition 52 Suppose that θ Rθ is continuous Assume moreover that sup R n θ Rθ P 0, θ Θ ie, that {γ θ : θ Θ} is a GC class Then ˆθ n P θ 0 Proof We have 0 Rˆθ n Rθ 0 = [Rˆθ n Rθ 0 ] [R n ˆθ n R n θ 0 ] + [R n ˆθ n R n θ 0 ] [Rˆθ n Rθ 0 ] [R n ˆθ n R n θ 0 ] P 0 So Rˆθ n P Rθ 0 and hence ˆθ n P θ 0 The assumption is hardly ever met, because it is close to requiring compactness of Θ 2

22 Lemma 522 Suppose that Θ, d is compact and that θ γ θ is continuous Moreover, assume that P G <, where G = sup γ θ θ Θ Then Then Proof Let By dominated convergence sup R n θ Rθ P 0 θ Θ wθ, ρ = Let δ > 0 be arbitrary Take ρ θ in such a way that sup γ θ γ θ { θ: d θ,θ<ρ} wθ, ρ 0, ρ 0 P wθ, ρ 0 P wθ, ρ θ δ Let B θ = { θ : d θ, θ < ρ θ } and let B θ,, B θn be a finite cover of Θ Then for G = {γ θ : θ Θ}, So the result follows from Theorem 44 PN2δ, G, d n > N 0 We give a lemma, which replaces compactness by a convexity assumption Lemma 523 Suppose that Θ is a convex subset of R r, and that θ γ θ, θ Θ is continuous and convex Suppose P G ɛ < for some ɛ > 0, where G ɛ = sup γ θ θ θ 0 ɛ Then ˆθ n P θ 0 Proof Because Θ is finite-dimensional, the set { θ θ 0 ɛ} is compact So by Lemma 522, sup R n θ Rθ P 0 θ θ 0 ɛ and Define ɛ α = ɛ + ˆθ n θ 0 θ n = αˆθ n + αθ 0 Then θ n θ 0 ɛ Moreover, R n θ n αr n ˆθ n + αr n θ 0 R n θ 0 It follows from the arguments used in the proof of Proposition 52, that θ n θ 0 P 0 But then also 53 Exercises ˆθ n θ 0 = ɛ θ n θ 0 ɛ θ n θ 0 P 0 22

23 Exercise Let Y {0, } be a binary response variable and Z R be a covariable Assume the logistic regression model P θ0 Y = Z = z = + exp[α 0 + β 0 z], where θ 0 = α 0, β 0 R 2 is an unknown parameter Let {Y i, Z i } n be iid copies of Y, Z Show consistency of the MLE of θ 0 Exercise 2 Suppose X,, X n are iid real-valued random variables with density f 0 = dp/dµ on [0,] Here, µ is Lebesgue measure on [0, ] Suppose it is given that f 0 F, with F the set of all decreasing densities bounded from above by 2 and from below by /2 Let ˆf n be the MLE Can you show consistency of ˆf n? For what metric? 23

24 6 Uniform central limit theorems After having studied uniform laws of large numbers, a natural question is: can we also prove uniform central limit theorems? It turns out that precisely defining what a uniform central limit theorem is, is quite involved, and actually beyond our scope In Sections 6-64 we will therefore only briefly indicate the results, and not present any proofs These sections only reveal a glimps of the topic of weak convergence on abstract spaces The thing to remember from them is the concept asymptotic continuity, because we will use that concept in our statistical applications In Section 65 we will prove that the empirical process indexed by a VC graph class is asymptotically continuous This result will be a corollary of another result of interest to theoretical statisticians: a result relating the increments of the empirical process to the entropy of G 6 Real-valued random variables Let X = R Central limit theorem in R Suppose EX = µ, and varx = σ 2 exist Then n Xn µ P z Φz, for all z, σ where Φ is the standard normal distribution function Notation Xn µ n L N 0,, σ or n Xn µ L N 0, σ 2 62 R d -valued random variables Let X, X 2, be iid R d -valued random variables copies of X, X X = R d, with expectation µ = EX, and covariance matrix Σ = EXX T µµ T Central limit theorem in R d We have n Xn µ L N 0, Σ, ie n [ a T X n µ ] L N 0, a T Σa, for all a R d 63 Donsker s Theorem Let X = R Recall the definition of the distribution function F and the empirical distribution function F n : Define F t = P X t, t R, F n t = n #{X i t, i n}, t R W n t := nf n t F t, t R By the central limit theorem in R Section 6, for all t W n t L N 0, F t F t Also, by the central limit theorem in R 2 Section 62, for all s < t, Wn s L N 0, Σs, t, W n t where F s F s F s F t Σs, t = F s F t F t F t 24

25 We are now going to consider the stochastic process W n = {W n t : t R} The process W n is called the classical empirical process Definition 63 Let K 0 be the collection of bounded functions on [0, ] The stochastic process B K 0, is called the standard Brownian bridge if - B0 = B = 0, - for all r and all t,, t r 0,, the vector - for all s t, covbs, Bt = s t - the sample paths of B are as continuous We now consider the process W F defined as Thus, W F = B F Bt Bt r W F t = BF t : t R is multivariate normal with mean zero, Donsker s theorem Consider W n and W F as elements of the space K of bounded functions on R We have W n L W F, that is, for all continuous and bounded functions f EfW n EfW F, Reflection Suppose F is continuous Then, since B is almost surely continuous, also W F = B F is almost surely continuous So W n must be approximately continuous as well in some sense Indeed, we have for any t and any sequence t n converging to t, This is called asymptotic continuity W n t n W n t P 0 64 Donsker classes Let X,, X n, be iid copies of a random variable X, with values in the space X, and with distribution P Consider a class G of functions g : X R The theoretical mean of a function g is P g := EgX, and the empirical average based on the n observations X,, X n is P n g := n gx i Here P n is the empirical distribution based on X,, X n Definition 64 The empirical process indexed by G is ν n g := n P n g P g, g G Let us recall the central limit theorem for g fixed Denote the variance of gx by σ 2 g := vargx = P g 2 P g 2 If σ 2 g <, we have ν n g L N 0, σ 2 g The central limit theorem also holds for finitely many g simultaneously Let g k and g l be two functions and denote the covariance between g k X and g l X by σg k, g l := covg k X, g l X = P g k g l P g k P g l 25

26 Then, whenever σ 2 g k < for k =,, r, ν n g ν n g r L N 0, Σ g,,g r, where Σ g,,g r is the variance-covariance matrix σ 2 g σg, g r Σ g,,g r = σg, g r σ 2 g r Definition 642 Let ν be a Gaussian process indexed by G Assume that for each r N and for each finite collection {g,, g r } G, the r-dimensional vector νg νg r has a N 0, Σ g,,g r -distribution, with Σ g,,g r defined in * We then call ν the P -Brownian bridge indexed by G Definition 643 Consider ν n and ν as bounded functions on G We call G a P -Donsker class if ν n L ν, that is, if for all continuous and bounded functions f, we have Efν n Efν Definition 644 The process ν n on G is called asymptotically continuous if for all g 0 G, and all possibly random sequences {g n } G with σg n g 0 P 0, we have ν n g n ν n g 0 P 0 We will use the notation ie, 2,P is the L 2 P -norm Remark Note that σg g 2,P g 2 2,P := P g 2, Definition 645 The class G is called totally bounded for the metric d 2,P g, g := g g 2,P induced by 2,P, if its entropy log N, G, d 2,P is finite Theorem 646 Suppose that G is totally bounded Then G is a P -Donsker class if and only if ν n as process on G is asymptotically continuous 65 Chaining and the increments of the empirical process 65 Chaining We will consider the increments of the symmetrized empirical process in Section 652 There, we will work conditionally on X = X,, X n We now describe the chaining technique in this context Let g 2 2,n := P n g 2 Let d 2,n g, g = g g 2,n, ie, d 2,n is the metric induced by 2,n Suppose that g 2,n R for all g G For notational convenience, we index the functions in G by a parameter θ Θ: G = {g θ : θ Θ} 26

27 Let for s = 0,, 2,, {g s j }Ns j= be a minimal 2 s R-covering set of G, d 2,n So N s = N2 s R, G, d 2,n, and for each θ, there exists a g s θ {gs,, g s N s } such that g θ g s θ 2,n 2 s R We use the parameter θ here to indicate which function in the covering set approximates a particular g We may choose g 0 θ 0, since g θ 2,n R Then for any S, g θ = S gθ s g s θ + g θ gθ S s= One can think of this as telescoping from g θ to gθ S, ie we follow a path taking smaller and smaller steps As S, we have max i n g θ X i gθ SX i 0 The term s= gs θ gs θ can be handled by exploiting the fact that as θ varies, each summand involves only finitely many functions 652 Increments of the symmetrized process We use the notation a b = max{a, b} a b = min{a, b} Lemma 652 On the set sup g G g 2,n R, and nδ 4 s= 2 s R log N2 s R, G, d 2,n 70R log 2, we have P X sup Pn σ g δ 4 exp[ nδ2 g G 70R 2 ] Proof Let {gj s}ns j= be a minimal 2 s R-covering set of G, s = 0,, So N s = N2 s R, G, d 2,n Now, use chaining Write g θ = s= gs θ gs Note that by the triangle inequality, θ gθ s g s θ 2,n gθ s g θ 2,n + g θ g s θ 2,n 2 s R + 2 s+ R = 32 s R Let η s be positive numbers satisfying s= η s Then P X sup Pn σ gθ s g s θ δ θ Θ s= P X sup θ Θ n P n σ gθ s g s θ δη s nδ 2 ηs 2 2 exp[2 log N s 8 2 2s R 2 ] s= What is a good choice for η s? We take Then indeed, by our condition on nδ, η s s= s= η s = 7 2 s R log N s nδ 72 s R log N s nδ + s= 2 s s 8 2 s s = Here, we used the bound + 2 s s + s= 0 2 x xdx 2 x xdx = + π/ log 2 /2 4 27

28 Observe that η s 7 2 s R log N s nδ, so that Thus, 2 log N s 2nδ 2 η 2 s s R 2 nδ 2 η 2 s 2 exp[2 log N s 8 2 2s R 2 ] 3nδ 2 ηs 2 2 exp[ s R 2 ] s= Next, invoke that η s 2 s s/8: s= s= 2nδ 2 ηs 2 2 exp[ s R 2 ] s= 2nδ 2 η 2 s 2 exp[ s R 2 ] 2 exp[ nδ2 s 49 96R 2 ] s= s= 2 exp[ nδ2 s ] = 2 exp[ nδ2 70R 2 70R 2 ] exp[ nδ2 70R 2 ] 4 exp[ nδ2 70R 2 ], where in the last inequality, we used the assumption that nδ 2 log 2 70R 2 Thus, we have shown that Remark It is easy to see that s= P X sup Pn σ g δ 4 exp[ nδ2 g G 70R 2 ] 2 s R log N2 s R, G, d 2,n 2 R 0 log Nu, G, d 2,n du 653 Asymptotic equicontinuity of the empirical process Fix some g 0 G and let Gδ = {g G : g g 0 2,P δ} Lemma 653 Suppose that G has envelope G, with P G 2 <, and that n log Nδ, G, d 2,n P 0 Then for each δ > 0 fixed ie, not depending on n, and for we have lim sup P n a 4 28A/2 2δ sup ν n g ν n g 0 > a g Gδ 0 H /2 udu 2δ, a2 6 exp[ 40δ 2 ] 28

29 Hu, G, d 2,n + lim sup P sup > A n u>0 Hu Proof The conditions P G 2 < and imply that sup g g 0 2,n g g 0 2,P P 0 g G So eventually, for each fixed δ > 0, with large probability The result now follows from Lemma 652 sup g g 0 2,n 2δ g Gδ Our next step is proving asymptotic equicontinuity of the empirical process This means that we shall take a small in Lemma 653, which is possible if the entropy integral converges Assume that and define 0 Jδ = H /2 udu <, δ 0 H /2 udu δ Roughly speaking, the increment at g 0 of the empirical process ν n g behaves like Jδ for g g 0 δ So, since Jδ 0 as δ 0, the increments can be made arbitrary small by taking δ small enough Theorem 6532 Suppose that G has envelope G with P G 2 < Suppose that Hu, G, d 2,n lim lim sup P sup > A = 0 A n u>0 Hu Also assume H /2 udu < 0 Then the empirical process ν n is asymptotically continuous at g 0, ie, for all η > 0, there exists a δ > 0 such that 59 lim sup P n and Proof Take A sufficiently large, such that Next, take δ sufficiently small, such that sup ν n g ν n g 0 > η g Gδ 6 exp[ A] < η 2, < η Hu, G, P n lim sup P sup > A η n u>0 Hu A /2 J2δ < η Then by Lemma 653, lim sup P n sup ν n g ν n g 0 > η g Gδ 6 exp[ AJ 2 2δ 2δ 2 ] + η 2 6 exp[ A] + η 2 < η, 29

30 where we used J2δ 2δ Remark Because the conditions in Theorem 6532 do not depend on g 0, its result holds for each g 0 ie, we have in fact shown that ν n is asymptotically continuous 654 Application to VC graph classes Theorem 654 Suppose that G is a VC-graph class with envelope G = sup g g G satisfying P G 2 < Then {ν n g : g G} is asymptotically continuous, and so G is P -Donsker Proof Apply Theorem 6532 and Theorem 442 Remark In particular, suppose that VC -graph class G with square integrable envelope G is parametrized by θ in some parameter space Θ R r, ie G = {g θ : θ Θ} Let z n θ = ν n g θ Question: do we have that for a random sequence θ n with θ n θ 0 in probability, also z n θ n z n θ 0 P 0? Indeed, if g θ g θ0 2,P P 0 as θ converges to θ 0, the answer is yes 30

31 7 Asymptotic normality of M-estimators Consider an M-estimator ˆθ n of a finite dimensional parameter θ 0 We will give conditions for asymptotic normality of ˆθ n It turns out that these conditions in fact imply asymptotic linearity Our first set of conditions include differentiability in θ at each x of the loss function γ θ x The proof of asymptotic normality is then the easiest In the second set of conditions, only differentiability in quadratic mean of γ θ is required The results of the previous chapter asymptotic continuity supply us with an elegant way to handle remainder terms in the proofs In this chapter, we assume that θ 0 is an interior point of Θ R r Moreoever, we assume that we already showed that ˆθ n is consistent 7 Asymptotic linearity Definition 7 The sequence of estimators ˆθ n of θ 0 is called asymptotically linear if we may write nˆθn θ 0 = np n l + o P, where l l = : X R r, l r satisfies P l = 0 and P lk 2 <, k =,, r The function l is then called the influence function For the case r =, we call σ 2 := P l 2 the asymptotic variance Definition 72 Let ˆθ n, and ˆθ n,2 be two asymptotically linear estimators of θ 0, with asymptotic variance σ 2 and σ 2 2 respectively Then e,2 := σ2 2 σ 2 is called the asymptotic relative efficiency of ˆθ n, as compared to ˆθ n,2 72 Conditions a,b and c for asymptotic normality We start with 3 conditions a,b and c, which are easier to check but more stringent We later relax them to conditions A,B and C Condition a There exists an ɛ > 0 such that θ γ θ is differentiable for all θ θ 0 < ɛ and all x, with derivative Condition b We have as θ θ 0, where V is a positive definite matrix ψ θ x := θ γ θx, x X P ψ θ ψ θ0 = V θ θ 0 + o θ θ 0, Condition c There exists an ɛ > 0 such that the class {ψ θ : θ θ 0 < ɛ} and is P -Donsker with envelope Ψ satisfying P Ψ 2 < Moreover, lim ψ θ ψ θ0 2,P = 0 θ θ 0 Lemma 72 Suppose conditions a,b and c Then ˆθ n is asymptotically linear with influence function l = V ψ θ0, so nˆθn θ 0 L N 0, V JV, 3

32 where J = P ψ θ0 ψ T θ 0 Proof Recall that θ 0 is an interior point of Θ, and minimizes P γ θ, so that P ψ θ0 = 0 Because ˆθ n is consistent, it is eventually a solution of the score equations Rewrite the score equations as P n ψˆθn = 0 0 = P n ψˆθn = P n ψˆθn ψ θ0 + P n ψ θ0 = P n P ψˆθn ψ θ0 + P ψˆθn + P n ψ θ0 Now, use condition b and the asymptotic equicontinuity of {ψ θ : θ θ 0 ɛ} see Chapter 6, to obtain This yields 0 = o P n /2 + V ˆθ n θ 0 + o ˆθ n θ 0 + P n ψ θ0 ˆθ n θ 0 = V P n ψ θ0 + o P n /2 Example: Huber estimator Let X = R, Θ = R The Huber estimator corresponds to the loss function γ θ x = γx θ, with γx = x 2 l{ x k} + 2k x k 2 l{ x > k}, x R Here, 0 < k < is some fixed constant, chosen by the statistician We will now verify a,b and c a { +2k if x θ k ψ θ x = 2x θ if x θ k 2k if x θ k b We have d dθ ψ θ dp = 2F k + θ F k + θ, where F t = P X t, t R is the distribution function So V = 2F k + θ 0 F k + θ 0 c Clearly ψ θ : θ R is a VC graph class, with envelope Ψ 2k So the Huber estimator ˆθ n has influence function k F k+θ 0 F k+θ 0 if x θ 0 k x θ lx = 0 F k+θ 0 F k+θ 0 if x θ 0 k if x θ 0 k k F k+θ 0 F k+θ 0 The asymptotic variance is σ 2 = k2 F k + θ 0 + k+θ 0 k+θ 0 x θ 0 2 df x + k 2 F k + θ 0 F k + θ 0 F k + θ Asymptotics for the median The median see Example ib in Chapter 5 can be regarded as the limiting case of a Huber estimator, with k 0 However, the loss function γ θ x = x θ is not differentiable, ie, does not satisfy condition a For even sample sizes, we do nevertheless have the score equation F n ˆθ n 2 = 0 Let us investigate this closer 32

33 Let X R have distribution F, and let F n be the empirical distribution The population median θ 0 is a solution of the equation F θ 0 = 0 We assume this solution exists and also that F has positive density f in a neighborhood of θ 0 Consider now for simplicity even sample sizes n and let the sample median ˆθ n be any solution of F n ˆθ n = 0 Then we get = 0 = F n ˆθ n F θ 0 ] + [ F n ˆθ n F ˆθ n [ F ˆθ n F θ 0 = [ ] W n ˆθ n + F ˆθ n F θ 0, n ] where W n = nf n F is the empirical process Since F is continuous at θ 0, and ˆθ n θ 0, we have by the asymptotic continuity of the empirical process Section 63, that W n ˆθ n = W n θ 0 + o P We thus arrive at 0 = W n θ 0 + [ ] n F ˆθ n F θ 0 + o P In other words, So the influence function is and the asymptotic variance is = W n θ 0 + n[fθ 0 + o][ˆθ n θ 0 ] nˆθn θ 0 = W nθ 0 fθ 0 lx = + o P { 2fθ 0 if x θ 0 +, 2fθ 0 if x > θ 0 σ 2 = 4fθ 0 2 We can now compare median and mean It is easily seen that the asymptotic relative efficiency of the mean as compared to the median is e,2 = 4σ 2 0 fθ 0 2, where σ0 2 = varx So e,2 = π/2 for the normal distribution, and e,2 = /2 for the double exponential Laplace distribution The density of the double exponential distribution is [ fx = ] 2 x θ0 exp, x R 2σ 2 0 σ 0 74 Conditions A,B and C for asymptotic normaility We are now going to relax the condition of differentiability of γ θ Condition A Differentiability in quadratic mean There exists a function ψ 0 : X R r, with P ψ0,k 2 <, k =,, r, such that γθ γ θ0 θ θ 0 T ψ 2,P 0 lim θ θ 0 Condition B We have as θ θ 0, θ θ 0 = 0 P γ θ P γ θ0 = 2 θ θ 0 T V θ θ 0 + o θ θ 0 2, 33

34 with V a positive definite matrix Condition C Define for θ θ 0, Suppose that for some ɛ > 0, the class {g θ : satisfying P G 2 < g θ = γ θ γ θ0 θ θ 0 0 < θ θ 0 < ɛ} is a P -Donsker class with envelope G Lemma 74 Suppose conditions A,B and C are met Then ˆθ n has influence function l = V ψ 0, and so ˆθn θ 0 L N 0, V JV, where J = ψ 0 ψ T 0 dp Proof Since {g θ : θ θ 0 ɛ} is a P -Donsker class, and ˆθ n is consistent, we may write 0 P n γˆθn γ θ0 = P n P γˆθn γ θ0 + P γˆθn γ θ0 = P n P g θ θ θ 0 + P γˆθn γ θ0 = P n P ˆθ n θ 0 T ψ 0 + o P n /2 + P γˆθn γ θ0 = P n P ˆθ n θ 0 T ψ 0 + o P n /2 ˆθ n θ ˆθ n θ 0 T V ˆθ n θ 0 + o ˆθ n θ 0 2 This implies ˆθ n θ 0 = O P n /2 But then V /2 ˆθ n θ 0 + V /2 P n P ψ 0 + o P n /2 2 o P n Therefore, ˆθ n θ 0 = V P n P ψ 0 + o P n /2 Because P ψ 0 = 0, the result follows, and the asymptotic covariance matrix is V JV 75Exercises Exercise Suppose X has the logistic distribution with location parameter θ see Example iib of Chapter 5 Show that the maximum likelihood estimator has asymptotic variance equal to 3, and the median has asymptotic variance equal to 4 Hence, the asymptotic relative efficiency of the maximum likelihood estimator as compared to the median is 4/3 Exercise 2 Let X i, Y i, i =,, n, be iid copies of X, Y, where X R d and Y R Suppose that the conditional distribution of Y given X = x has median mx = α 0 + α x + α d x d, with α 0 α = R d+ α d Assume moreover that given X = x, the random variable Y mx has a density f not depending on x, with f positive in a neighborhood of zero Suppose moreover that X Σ = E X XX T exists Let ˆα n = arg min a R d+ n be the least absolute deviations LAD estimator Show that nˆαn α L N 0, by verifying conditions A,B and C Y i a 0 a X i, a d X i,d, 4f 2 0 Σ, 34

35 8 Rates of convergence for least squares estimators Probability inequalities for the least squares estimator are obtained, under conditions on the entropy of the class of regression functions In the examples, we study smooth regression functions, functions of bounded variation, concave functions, analytic functions, and image restoration Results for he entropies of various classes of functions is taken from the literature on approximation theory Let Y,, Y n be real-valued observations, satisfying Y i = g 0 z i + W i, i =,, n, with z,, z n fixed covariates in a space Z, W,, W n independent errors with expectation zero, and with the unknown regression function g 0 in a given class G of regression functions The least squares estimator is ĝ n := arg min g G Y i gz i 2 Throughout, we assume that a minimizer ĝ n G of the sum of squares exists, but it need not be unique The following notation will be used The empirical measure of the covariates is Q n := n δ zi For g a function on Z, we denote its squared L 2 Q n -norm by g 2 n := g 2 2,Q n := n g 2 z i The empirical inner product between error and regression function is written as Finally, we let w, g n = n W i gz i GR := {g G : g g 0 n R} denote a ball around g 0 with radius R, intersected with G The main idea to arrive at rates of convergence for ĝ n is to invoke the basic inequality ĝ n g 0 2 n 2w, ĝ n g 0 n The modulus of continuity of the process {w, g g 0 n : g GR} can be derived from the entropy of GR, endowed with the metric d n g, g := g g n 8 Gaussian errors When the errors are Gaussian, it is not hard to extend the maximal inequality of Lemma 652 We therefore, and to simplify the exposition, will assume in this chapter that Then, as in Lemma 652, for W,, W n are ii, d, N 0, distributed R nδ 28 log Nu, GR, dn du 70R log 2, we have P 0 sup w, g g 0 n δ g GR ] 4 exp [ nδ2 70R 2 35

36 82 Rates of convergence Define Jδ, Gδ, d n = δ 0 log Nu, Gδ, dn du δ Theorem 82 Take Ψδ Jδ, Gδ, d n in such a way that Ψδ/δ 2 is a non-increasing function of δ Then for a constant c, and for nδ 2 n cψδ n we have for all δ δ n, Proof We have Now, if P ĝ n g 0 n > δ c exp[ nδ2 c 2 ] P ĝ n g 0 n > δ P sup w, g g 0 n 2 2s δ 2 := P s g G2 s+ δ s=0 nδ 2 n c Ψδ n, s=0 then also for all 2 s+ δ > δ n, n2 2s+2 δ 2 c Ψ2 s+ δ So, for an appropriate choice of c, we may apply * to each P s This gives, for some c 2, c 3, P s c 2 exp[ n24s 2 δ 4 c 2 2 2s+2 δ 2 ] c 3 exp[ nδ2 ] s=0 s=0 c 2 3 Take c = max{c, c 2, c 3 } So 83 Examples Example 83 Linear regression Let One may verify G = {gz = θ ψ z + + θ r ψ r z : θ R r } log Nu, Gδ, d n r log δ + 4u, for all 0 < u < δ, δ > 0 u δ 0 log Nu, Gδ, dn du r /2 δ = r /2 δ So Theorem 82 can be applied with 0 0 log /2 δ + 4u du δ log /2 + 4vdv := A 0 r /2 δ δ n ca 0 r n It yields that for some constant c not the same at each appearance and for all T c, r 2 P ĝ n g 0 n > T n c exp[ T r c 2 ] Note that we made extensive use here from the fact that it suffices to calculate the local entropy of G 36

37 Example 832 Smooth functions Let G = {g : [0, ] R, g m z 2 dz M 2 } Let ψ k z = z k, k =,, m, ψz = ψ z,, ψ m z T and Σ n = ψψ T dq n Denote the smallest eigenvalue of Σ n by λ n, and assume that λ n λ > 0, for all n n 0 One can show Kolmogorov and Tihomirov 959 that log Nδ, Gδ, d n Aδ m, for small δ > 0, where the constant A depends on λ Hence, we find from Theorem 82 that for T c,, P ĝ n g 0 n > T n m T 2 n 2m+ 2m+ c exp[ c 2 ] Example 833 Functions of bounded variation in R Let G = {g : R R, g z dz M} Without loss of generality, we may assume that z z n The derivative should be understood in the generalized sense: g z dz := gz i gz i Define for g G, Then it is easy to see that, i=2 α := gdq n max gz i α + M,,n One can now show Birman and Solomjak 967 that and therefore, for all T c, log Nδ, Gδ, d n Aδ, for small δ > 0, P ĝ n g 0 n > T n /3 c exp[ T 2 n /3 c 2 ] Example 834 Functions of bounded variation in R 2 Suppose that z i = u k, v l, i = kl, k =,, n, l =,, n 2, n = n n 2, with u u n, v v n2 Consider the class G = {g : R 2 R, Ig M} where Ig := I 0 g + I g + I 2 g 2, n 2 n 2 I 0 g := gu k, v l gu k, v l gu k, v l + gu k, v l, k=2 l=2 g u := n 2 gu, v l, n 2 l= 37

Empirical Processes in M-Estimation by Sara van de Geer

Empirical Processes in M-Estimation by Sara van de Geer Handout at New Directions in General Equilibrium Analysis Cowles Workshop, Yale University June 15-20, 2003 Version: June 13, 2003 1 Most of the