Draft. Advanced Probability Theory (Fall 2017) J.P.Kim Dept. of Statistics. Finally modified at November 28, 2017

Size: px

Start display at page:

Download "Draft. Advanced Probability Theory (Fall 2017) J.P.Kim Dept. of Statistics. Finally modified at November 28, 2017"

Austin York
5 years ago
Views:

1 Fall 207 Dept. of Statistics Finally modified at November 28, 207

2 Preface & Disclaimer This note is a summary of the lecture Advanced Probability Theory A held at Seoul National University, Fall 207. Lecturer was Minwoo Chae, and the note was summarized by, who is a Ph.D student. There are few textbooks and references in this course, which are following. Weak Convergence and Empirical Processes with Applications to Statistics, Van der Vaart & Wellner, Springer, 996. Asymptotic Statistics, Van der Vaart, Cambridge University Press, 998. Also I referred to following books when I write this note. The list would be updated continuously. Convergence of probability measures, Billingsley, John Wiley & Sons, 203. Lecture notes on Topics in Mathematics I held by Gerald Trutnau spring 205. Finally, some examples or motivation would be complemented based on the lecture notes summarized by myself of Probability Theory I on spring 206; Theory of Statistics II on fall 206, most of which are available at If you want to correct typo or mistakes, please contact to: joonpyokim@snu.ac.kr

3 Chapter Stochastic Convergence. Motivation Recall some basic results in asymptotics. Theorem.. SLLN. Let X, X 2,, X n be i.i.d random variables with E X <. Then n X i P a.s EX. Theorem..2 CLT. Let X, X 2,, X n be i.i.d random variable with E X 2 <. Then n where µ = EX and σ 2 = EX 2 µ2. d X i µ N0, σ2, From now on we will use following notations. Let Ω, A, P or Ω i, A i, P i be underlying probability space or sequence of them; D, d be a metric space; D = BD be a Borel σ-algebra of D; C b D be the set of all bounded continuous real functions on D; X X n, resp. be a map from Ω Ω i, resp. to D not necessarily be measurable. Remark..3. Note that LLN and CLT holds for fx i s, i.e., n fx i P a.s EfX 2

4 and n d fx i EfX i N0, σ2 f holds for σ 2 f = varfx, provided that E[fX 2 ] <. Our question in this course is: For a class F of real functions. do LLN and CLT hold uniformly in some sense? For example, does it hold P-a.s. or in probability? For finite f,, f k, n sup n f F fx i EfX f X i Ef X,, n 0 f k X i Ef k X converges weakly to MVN. How the convergence of infinite dimensional joint net can be defined? n f X i Ef X For this we see more general notion of weak convergence here. Definition..4. Let P n, P be Borel probability measures on D, D, d. Then i P n converges weakly to P, denoted as P n P, iff D w fdp n D f F fdp f C b D. ii If X n and X are D-valued random variables with laws P n and P respectively, then X n converges weakly to X, denoted as X n X, iff P n w w P. For weak convergence of., we may use the definition..4. For this,. should be embedded into a metric space. Example..5. Let Ω n, A n, P n = [0, ], B, λ, and F = { [0,t] : 0 t } D[0, ],. where B = B[0, ] is a Borel σ-algebra on [0, ] and λ denotes the Lebesgue measure. Then. can be viewed as a D[0, ]-valued random variables. A natural metric on D[0, ] is the uniform metric 3

5 defined as df, f 2 = sup f t f 2 t f, f 2 D[0, ]. t [0,] However, under such metric, D[0, ] is not separable, which makes the space too large to work with. Furthermore, under the metric,. may even not be measurable. Proposition..6. A map X : [0, ] D[0, ] defined as Xω = [ω,] is NOT Borel measurable with the uniform metric. Proof. Figure.: Proof of proposition..6. Let B s be the open ball of radius /2 in D[0, ] centered on [s,]. Then G = s S B s is an open set in D[0, ] for any S [0, ]. However, note that Xω B s if and only if ω = s, and hence X G = X G = S holds. If X is Borel measurable, then every subset S of [0, ] should be also Borel measurable; it yields contradiction. To handle this issue, we may consider some alternative views like: To consider a weaker σ-algebra, such as ball σ-alg. In here, ball σ-algebra is a σ-algebra generated by all open balls. If the space is separable, then ball σ-algebra is equivalent to Borel σ-algebra. Note that with smaller σ-algebra, measurability condition becomes weaker. To consider a weaker metric. This is one typical approach dealing with empirical process, using Skorokhod s metric. Under Skorokhod metric, D[0, ] becomes separable, and it is wellknown that there exists an equivalent metric with Skorokhod metric making D[0, ] also complete Billingsley. Drop the measurability requirement, that is to extend some notions of weak convergence to nonmeasurable maps. We shall focus on this approach in this course..2 Outer Integral From now on, let Ω, A, P be an underlying probability space. Also, let T : Ω R = [, ] be an arbitrary map not necessarily be measurable and B Ω be an arbitrary set not necessarily be 4

6 measurable. Definition.2.. i The outer integral of T w.r.t. P is defined as E T := inf{eu : U T, U : Ω R is measurable & EU exists}, where EU exists means EU + < or EU < note that it can be defined only except the case. ii The outer probability of B is iii The inner integral of T w.r.t. P is defined as iv The inner probability of B is P B = inf{pa : A B, A A}. E T = E T. P B = P Ω\B. Remark.2.2. Note that definitions in iii and iv is equivalent to using similar argument as i and ii, i.e., and E T = sup{eu : U T, U : Ω R is measurable & EU exists} P B = sup{pa : A B, A A}. It is well known that the map T achieving the supremum always exists provided that its expectation exists. Lemma.2.3. For any map T : Ω R, there exists a measurable map T : Ω R with i T T ; ii T U P-a.s. for any U : Ω R with U T P-a.s.. Furthermore, such T is unique up to P-null sets, and E T = ET provided that ET exists. Definition.2.4. Such function T is called minimal measurable majorant of T. Similarly, the maximal measurable minorant T can be defined as T = T. 5

7 There are several similarities between outer integral and normal one. Many concepts and propositions in the probability theory can be extended to the outer probability statement. However there are also several statements those not hold in outer-measure version. One example is Fubini s theorem. Lemma.2.5 Fubini theorem in outer-integral. Let T be a real-valued function on the product space Ω Ω 2, A A 2, P P 2. Then where E 2 is defined as E T E E 2 T E E 2T E T, E 2T ω = inf { E 2 U : Uω 2 T ω, ω 2, U : Ω 2 R is measurable and E 2 U exists } for ω Ω and vice versa. Now we will extend the notion of weak convergence to non-measurable maps..3 Weak Convergence Definition.3.. i A Borel probability measure L on D is tight if ɛ > 0 cpt set K with LK ɛ. ii A Borel measurable map X : Ω D is tight if the law of X LX := P X is tight. iii L or X is separable if there exists separable measurable set with probability, i.e., Lemma.3.2. separable measurable set A D s.t. LA = or PX A =. i If L or X is tight, then L or X is separable. ii The converse is true if D is complete. That is, given that D is complete, separability of L or X implies tightness. Now we are ready to define weak convergence of arbitrary map X n. Definition.3.3 Weak Convergence. Let Ω n, A n, P n be a sequence of probability spaces and X n : Ω n D be arbitrary maps may be non-measurable. Then X n is said to converge weakly to a Borel measure L, denoted as X n w L, if E fx n fdl f C b D. 6

8 Furthermore, if there is a Borel measurable map X with law L, i.e., LX = L, then it is denoted as X n w X. We can say similar arguments about weak convergence of measurable maps. Theorem.3.4 Portmanteau. TFAE. w i X n L ii lim inf n P X n G LG for any open set G iii lim sup n P X n F LF for any closed set F iv lim inf n E fx n fdl for any function f which is l.s.c & bdd below v lim sup n E fx n fdl for any function f which is u.s.c & bdd above vi lim P X n B = lim P X n B = LB for any L-continuity set B i.e., L B = 0 vii lim inf n E fx n fdl for any function f which is bdd, Lipschitz continuous, and nonnegative Recall that a function f is lower semicontinuous l.s.c if lim inf fx fx 0 x x 0 and vice versa. Our first important result in measure theory is continuous mapping theorem. Theorem.3.5 Continuous mapping theorem. Let D, d and E, e be metric spaces and g : D E be continuous at every point of a set D 0 D. If X n w gx n gx. w X and X takes its values in D 0, then Next to the continuous mapping theorem, Prokhorov theorem or Helly s principle, in special case is the most important theorem on weak convergence. To formulate the result, two new concepts are needed. Definition.3.6. ii {X n } is asymptotic tight if i {X n } is asymptotic measurable if E fx n E fx n 0 f C bd. ɛ > 0 cpt set K s.t. lim inf n P X n K δ ɛ δ > 0, where K δ := {y D : dy, K < δ} is the δ-enlargement of K. 7

9 Remark.3.7. A collection of Borel measurable maps X n is uniformly tight if ɛ > 0 cpt set K s.t. inf PX n K ɛ. n It is also equivalent if inf in the last statement is replaced to lim inf The δ in the definition of asymptotic tightness may seem a bit overdone it enlarges the set K, but nothing is gained in simple cases: Proposition.3.8. If D is separable and complete, then uniformly tightness and asymptotically tightness are the same for measurable maps. Following result might be useful to verify asymptotic measurability or tightness. Lemma.3.9. w i If X n X, then X n is asymptotically measurable. w ii If X n X, then X n is asymptotically tight X is tight. Now we are ready to state Prokhorov theorem. Theorem.3.0 Prokhorov. i If {X n } is asymptotically tight and asymptotically measurable, then {X n } is relatively compact, i.e., every subsequence {X n } has a further subsequence {X n } converging weakly to a tight Borel law. ii Relatively compact collection {X n } is asymptotically tight if D is Polish space i.e., separable and complete. Remark.3.. By previous theorem, for Borel measures on Polish space, the concepts relatively compact, asymptotically tight and uniformly tight are all equivalent. Our final extension is: w w Lemma.3.2. Let X n X and Y n c, where c is constant and X has separable Borel law. Then X n, Y n X, c. w Corollary.3.3. Let X n and X be on separable Banach space topological vector space and Y n and c be scalars. Then addition and scalar multiplication can be defined, which are also continuous operator on separable Banach space. Thus we can get X n + Y n w X + c 8

10 and Furthermore, if c 0, we can also obtain X n Y n w cx. w X n /Y n X/c..4 Spaces of Bounded Functions Definition.4.. Let T be an arbitrary set. Then the space l T is defined as where f = sup ft. t T It is well-known that l is Banach space. l T = {all functions f : T R s.t. f < }, Definition.4.2 Stochastic Process. A collection {Xt : t T } of random variables defined on the same probability space Ω, A, P is called a stochastic process. Note that, if every sample path t Xt, ω belongs to l T, i.e., every sample path is bounded, then X can be viewed as a random map from Ω to l T. For any arbitrary map X : Ω l T, it is natural to call a finite dimensional projection Xt, Xt 2,, Xt k for t, t 2,, t k T as a marginal. Obviously our interest is to find equivalent condition for asymptotic tightness of weak convergence of a sequence of random maps X n. Before starting, we introduce following two lemmas which will be used. Lemma.4.3. Let X n : Ω n l T be asymptotically tight. Then X n is asymptotically measurable X n t is asymptotically measurable for any t T. It implies that every stochastic process is asymptotically measurable; each marginal is random variable and hence measurable. Lemma.4.4. Let X, Y be Borel-measurable maps into l T and they are tight. Then LX = LY all marginals are equal in law. It means that for tight measurable maps, laws of all marginals determine the joint laws. Now we are ready to introduce our first result. 9

11 Theorem.4.5. Let X n : Ω n l T, n =, 2, be arbitrary maps. Then X n converges weakly to a tight limit if and only if X n is asymptotically tight; 2 every marginal converges weakly to a limit. Proof. is trivial from lemma.3.9. Next, note that, for any fixed t, t 2,, t k T, projection g : l T R k z zt, zt 2,, zt k is continuous function on l T. Thus continuous mapping theorem implies 2. Let t T be arbitrarily chosen. Then condition 2 implies that X n t is asymptotically measurable by lemma.3.9. Since t T was arbitrary, X n is asymptotically measurable by lemma.4.3. Then by Prokhorov theorem, every subsequence {n } {n} has a further subsequence {n } {n } which makes X n converges weakly. If such limit is all equal, then X n converges weakly. This follows from convergence of every marginal condition 2 and lemma.4.4. In details, for any subsequence {n } {n}, there exist a further subsequence {n } {n } and Y = Y n such that X n w Y. Note that Y is tight by condition and lemma.3.9, and by lemma.4.4, every Y n has the same law for any choice of subsequence {n }. Let X be tight r.v. s.t. LX = LY. Then we get w X. and therefore X n w X, n {n } {n} {n } {n } s.t. X n Theorem.4.5 tells that weak convergence for a sequence of random map is implied by asymptotic tightness and marginal convergence. Marginal convergence can be established by and of the well-known methods for proving weak convergence on Euclidean space. Asymptotic tightness can be given a more concrete form, either through finite approximation or essentially Arzelà-Ascoli characterization. Second approach is related to asymptotic continuity of the sample paths. Definition.4.6. A map ρ : T T R is called a semimetric or pseudometric if ρx, y 0 and x = y implies ρx, y = 0; 2 ρx, y = ρy, x; 3 ρx, z ρx, y + ρy, z. It may not satisfy ρx, y = 0 = x = y 0

12 Definition.4.7. Let X n : Ω n l T be a sequence of maps and ρ be a semimetric on T. Then X n is called asymptotically uniformly ρ-equicontinuous in probability if ɛ, η > 0 δ > 0 s.t. lim sup P sup ρs,t<δ X n s X n t > ɛ < η. Recall that a collection {f n : T R} of functions is uniformly equicontinuous if ɛ > 0 δ > 0 s.t. sup ρs,t<δ f n s f n t < ɛ uniformly on n. The definition in.4.7 is slightly changed to make the notion in probability. Now we are ready to see some equivalent conditions of asymptotic tightness; which is one of the goal of this section. Theorem.4.8. TFAE. i X n is asymptotically tight. ii X n t is asymptotically tight t T ; 2 semimetric ρ on T s.t. T, ρ is totally bounded and X n is asymptotically uniformly ρ-equicontinuous in probability. iii and 3 holds, where 3 ɛ, η > 0 finite partition {T,, T k } of T s.t. lim sup P max sup X n s X n t > ɛ i s,t T i < η..2 Remark.4.9. ii is related to Arzelà-Ascoli characterization of the space, while iii is related to the finite approximation of state space T. iii means that for any ɛ > 0, T can be partitioned into finitely many subset T i such that asymptotically the variation of the sample paths t Xt is less than ɛ on every T i. Proof. i ii. First we have to show. Let π t : x xt be a projection. Given ɛ > 0, there exists a compact set K s.t. From lim inf P X n K δ > ɛ δ > 0. n a K δ = b K s.t. b a < δ = π t b π t a b a < δ = π t a π t K δ,

13 we get lim inf n P X n t π t K δ lim inf n P X n K δ > ɛ δ > 0. As π t is continuous, π t K is compact, and it is the desired compact set. Now we show 2. Let ɛ be given, and K K 2 be a sequence of compact subsets of l T satisfying Now for any m, define ρ m as lim inf P X n K ɛ m m. Claim. T, ρ m is totally bounded. ρ m s, t = sup zs zt. z K m asymptotic tightness Remark.4.0. Note that totally boundedness means that ɛ > 0 T is covered with finite radius-ɛ balls w.r.t ρ. It is also equivalent to: ɛ > 0 finite subset whose distance from any element of T is less than ɛ. Proof of Claim. For given η > 0, choose z, z 2,, z k l T s.t. K m k B η z j. j= It can be chosen because of compactness of K m Since each z i is a bounded function, A := {z t,, z k t : t T } R k is bounded set, it is totally bounded, and hence t, t 2,, t p T s.t. It gives that for any t T, t i s.t. and hence we get A p B η z t i,, z k t i. z t,, z k t B η z t i,, z k t i, ρ m t, t i = sup zt zt i z K m sup min z K m j k zt z j t + z j t z j t i + z j t i zt i }{{}}{{} z z j z z j totally bounded space is bounded; converse is also true in Euclidean space. 2

14 2 sup min z z j + max z jt z j t i z K m j k j k }{{}}{{} η def. of {z j } η def. of {t i } 3η. In summary, η > 0 {t,, t p } s.t. t T t i s.t. ρ m t, t i 3η, which gives totally boundedness of T, ρ m. Claim Claim 2. T, ρ is totally bounded, where ρs, t = 2 m ρ m s, t. m= Proof of Claim 2. Note that ρ m increases as m grows by the definition. For η > 0, take m s.t. 2 m < η. Then by Claim, {t, t 2,, t p } s.t. T p B η t i ; ρ m. Then for every t T, t i s.t. ρ m t, t i < η, and so ρt, t i It means that η > 0 {t, t 2,, t p } s.t. m 2 k ρ k t, t i + 2 k }{{} k=m+ ρ mt,t i }{{} k= η + η = 2η. =2 m <η t T t i s.t. ρt, t i 2η, which gives that T, ρ is totally bounded. Claim 2 Claim 3. X n is asymptotically uniformly ρ-equicontinuous in probability. Proof of Claim 3. Let ɛ > 0. If z z 0 < ɛ for some z 0 K m, then zs zt zs z 0 s + }{{} z 0 s z 0 t }{{} + z 0 t zt }{{} <ɛ sup z Km zs zt =ρ ms,t <ɛ 2ɛ + ρ m s, t. 3

15 If ρs, t < 2 m ɛ, then ρ m s, t 2 m ρs, t < ɛ, which gives ρ m s, t < ɛ. Thus z K ɛ m = z 0 K m s.t. z z 0 < ɛ = zs zt 2ɛ + ρ m s, t 3ɛ provided that ρs, t < 2 m ɛ. Therefore, Now letting δ < 2 m ɛ, we get lim inf P sup ρs,t<δ K ɛ m { z : sup ρs,t<2 m ɛ X n s X n t 3ɛ In summary, we get m N ɛ > 0 δ > 0 s.t. lim inf P sup ρs,t<δ zs zt 3ɛ X n s X n t 3ɛ }. lim inf P X n K ɛ m m. m, which implies that X n is asymptotically uniformly ρ-equicontinuous in probability. Claim 3 ii iii. By the assumption, given ɛ, η > 0, δ > 0 s.t. lim sup P sup ρs,t<δ X n s X n t > ɛ Since T, ρ is totally bounded, finite set {t, t 2,, t p } T s.t. Now letting T i = B δ/2 t i ; ρ, we get and therefore T p B δ/2 t i ; ρ. j= < η. s, t T i = ρs, t ρs, t i + ρt i, t < δ, sup zs zt s,t T i sup ρs,t<δ zs zt for any i =, 2,, p. It implies the conclusion lim sup P max i sup X n s X n t > ɛ s,t T i lim sup P sup X n s X n t > ɛ ρs,t<δ. 4

16 iii i. Suppose that for given ɛ, η > 0, holds. Note that, for a fixed t i T i, lim sup P max sup X n s X n t > ɛ i s,t T i sup X n s X n t ɛ = sup X n s sup X n s X n t i + X n t i X n t i + ɛ, s,t T i s T i s T i and hence lim inf P X n max X nt i + ɛ lim inf i p P < η max sup X n s X n t ɛ i p s,t T i It implies that X n is asymptotically tight. Why? First note that for each i, M i > 0 s.t. Letting M = max i M i, we get lim inf P X n t i < M i + ɛ η ɛ > 0. lim sup P max X n t i M + ɛ lim sup P X n t i M + ɛ pη ɛ > 0, i i.e., max i X n t i is asymptotically tight. Now let K be a compact set s.t. Then which implies lim inf P max X n t i K ɛ η ɛ > 0. i lim inf P X n max X n t i + ɛ, max X n t i K ɛ i i = lim inf P X n max X n t i i ɛ, max X n t i K ɛ i lim inf Xn K 3ɛ, P lim inf P Xn K 3ɛ 2η, i η. i.e., X n is asymptotically tight. Now, let ζ > 0 and a sequence ɛ m 0 be given. Choose M > 0 s.t. lim sup P X n > M ζ. 5

17 For ɛ m and η = 2 m ζ, let T = be a partition satisfying lim sup P max i k m k m T m,i sup X n s X n t > ɛ m s,t T m,i Now, let {z m,, z m,2,, z m,pm } be the set of all functions in l T that are constant on each T m,i taking values Now let a Function z m,i s. By construction, for each m, M 0, ±ɛ m, ±2ɛ m,, ± ɛ m. ɛ m < η. b Approximating elements of l T with z m,i s. Figure.2: Function z m,i s and approximation. K m = P m B ɛm z m,i and K = X n M and max sup X n s X n t < ɛ m implies X n K m..3 i s,t T m,i Since K is closed and hence complete 2 and totally bounded, it is compact. Thus our claim is: Claim. δ > 0 m s.t. K δ m K i. m= K m. Proof of Claim. Assume not. Then δ > 0 s.t. m K δ m K i. That is, z m s.t. z m m K i but z m / K δ. Now we use Arzelà-Ascoli formulation: Note that {z n } K = P B ɛ z,i, i.e., an infinite number of z n belong to finite balls, which means that at least one of them contains also an infinite number of z n. Now consider a subsequence {z n} of {z n } in B ɛ z,i for some i. In the same way, 2 closed subset of complete space is complete 6

18 there exists a further subsequence {z n } in B ɛ2 z 2,i2 for some i 2. z z 2 z 3 z z 2 z 3 z z2 z Now define z l as a sequence z, z2, z 3,, and then we get z l is Cauchy sequence. Since l T is complete, z l converges. Now note that z l l m K i by construction for any l m, and m K i is closed, the limit z of z l belongs to m K i for any m. It implies that z K, which is contradictory to z m / K δ m. Claim Now, by the Claim, m lim sup P X n / K δ lim sup P X n / K i K i lim sup P X n > M or max sup X n s X n t > ɛ m for some m m.3 i s,t T m,i m lim sup P X n > M + lim sup P max sup X n s X n t > ɛ m i m s,t T = m,i m ζ + 2 m ζ < 2ζ. If the condition asymptotic tightness is replaced with weak convergence, then we can obtain stronger argument. w Proposition.4.. If X n X, where X is tight, then sample path t Xt, ω is uniformly ρ-continuous a.s., where ρ is the semimetric constructed in the proof of i ii part of theorem.4.8. Proof. Let notations be continued. We get PX Km ɛ lim sup P X n Km ɛ Portmanteau lim inf P X n K ɛ m m 7

19 for any m and ɛ > 0. By letting ɛ 0, we get PX K m m, which gives Hence, for we get P X m= ρ m s, t = sup zs zt and ρs, t = z K m Also, from ρ m s, t 2 m ρs, t, we get K m =..4 2 m ρ m s, t, m= z K m = zs zt ρ m s, t s, t T. ρs, t < δ = ρ m s, t < ɛ for any δ < 2 m ɛ. Therefore, we get the conclusion; For m = mω s.t. Xω K m, ɛ > 0 δ = δm s.t. Proposition.4.2. If X n w sup s,t T ρs,t<δ Xs Xt < ɛ. X, T, ρ is totally bounded, and sample path t Xt, ω is uniformly ρ-continuous P-a.s., then X n is asymptotically tight and asymptotically uniformly ρ- equicontinuous in probability. Remark.4.3. Before the proof, note that: The set of uniformly continuous functions on a totally bounded set is complete & separable in uniform metric. A brief proof is following. It is well known that CT is complete; CT is separable if and only if T is compact. It gives that the set of continuous functions is complete and separable if T is compact. Meanwhile, followings are also well known: Totally bounded, complete set is compact; uniformly continuous function can be extended to a continuous function on a completion. In other words, uniformly continuous function on a totally bounded set is equivalent to a continuous function on a compact set. 8

20 Proof. Note that T, ρ is totally bounded and t Xt, ω is uniformly ρ-continuous, so the set of realization of such X is complete and separable. Since r.v. on complete separable space is tight, X is tight, which implies X n is asymptotically tight lemma.3.9. Since X is tight and uniformly ρ-continuous a.s., η > 0 K : cpt set of uniformly ρ-continuous functions s.t. PX K η. Note that, from Portmanteau lemma, lim inf P X n K ɛ PX K ɛ η ɛ > 0. Since K is totally bounded, ɛ > 0 z, z 2,, z k K s.t. K k B ɛz i, which implies Since each z i is uniformly continuous, Then we get, for z B 2ɛ z i, K ɛ k B 2ɛ z i. δ > 0 s.t. ρs, t < δ = max z is z i t < ɛ. i k ρs, t < δ = zs zt zs z i s + z i s z i t + z i t zt 2ɛ + ɛ + 2ɛ = 5ɛ, and therefore, lim inf P sup ρs,t<δ X n s X n t 5ɛ lim inf P X n k B 2ɛ z i lim inf P X n K ɛ η. Concluding remark of this chapter is that, in most cases we are interested in the case that limit process is Gaussian, whose finite-dimensional convergence is obtained by CLT. In here, semimetric ρ in proposition.4.2 becomes p-norm. Definition.4.4. A stochastic process is called Gaussian if each marginal has multivariate normal distribution. 9

21 Remark.4.5. Note that if X n w X, where X is tight Gaussian process, then the metric ρs, t = ρ p s, t = E Xs Xt p /p, p makes X n asymptotically uniformly ρ-equicontinuous in probability. 20

22 Chapter 2 Maximal Inequalities and Symmetrization 2. Introduction In here, we use following notation. Let X, B, P be a baseline probability space, and X, B, P be the product space. We consider the projection into the ith coordinate, X i : X X. Then X, X 2, become i.i.d. r.v. s with distribution law P. Definition 2... Denote and P n := n G n := n δ Xi empirical measure δ Xi P. empirical process In here, δ X denotes the dirac-delta measure. Remark Often, G n denotes the stochastic process f G n f i.e., G n f f F, where F is a collection of measurable functions and Qf denotes Qf = fdq for a measurable function f and signed measure Q. Note that G n f = n Definition For signed measure Q, define fx i Pf. Q F := sup{ Qf : f F}. Our first step is very well-known results: Proposition For each f F, i P n f Pf a.s.. SLLN 2

23 ii G n f d N0, Pf Pf2. CLT We are interested in uniform versions of previous proposition. Uniform version of i becomes: In here P denotes outer probability. P P n P F Definition A collection of integrable measurable function F satisfying 2. is called P- Glivenko-Cantelli class. Next, uniform version of ii can be obtained as following. Assume that sup fx Pf < x X. f F Then f G n f can be viewed as a map into l F. If G n is asymptotically tight in l F, then G n converges weakly to a tight Borel measurable map G in l F, from CLT-like argument and theorem.4.5. Definition A class F of square-integrable measurable functions is called P-Donsker class if G n is asymptotically tight. Remark A finite collection F of integrable functions is trivially P-Glivenko-Cantelli. Furthermore, a finite collection F of square-integrable functions is P-Donsker iii i part of theorem.4.8. Example Let X, X 2, be i.i.d r.v s in R, and Then F := {,t] : t R}. P n P F = sup F n t F t 0 a.s. t R for any probability measure P on R. It gives that F is P-Glivenko-Cantelli for any P. To show F is P-Donsker, we should show asymptotical tightness of G n, which is obtained by controlling supremum on the finite partition. For this, we need some maximal inequalities and technique of controlling variation, which will be covered on the rest part of this chapter. 2.2 Tail and Concentration Bounds The most simple case is well-known to us: 22

24 Figure 2.: Supremum on the infinite set might be controlled as an aggregation of supremum on the finite net and variation in each small ball. Theorem 2.2. Markov inequality. Let X be a r.v. with mean µ. Then E X µ k P X µ t t k t, k > 0. It gives a polynomial bound for tail probability. However, such result may not be so useful because of its roughness. Some results about exponential bounds are also well-known, which are often called concentration inequalities. Theorem Chernoff bound. Proof. Clear from PX µ t EeλX µ e λt λ > 0 t R. IX µ t = Ie λx µ e λt eλx µ e λt. Example Gaussian tail bound. Let X Nµ, σ 2 be Gaussian r.v.. Then by Chernoff ineq., Hence, we get PX µ t e λt Ee λx µ = exp λt + σ2 PX µ t inf λ>0 exp λt + σ2 2 λ2 2 λ2 for any t > 0, λ > 0. = e t2 /2σ 2 t > 0. As shown, Gaussian random variable has a squared-exponential tail bound. In general, the collection of such distribution is named as sub-gaussian. Definition A r.v. X with mean EX = µ is called sub-gaussian if σ > 0 s.t. Ee λx µ e σ2 λ 2 /2 λ R, 23

25 Remark Note that right hand side of definition is an mgf of N0, σ 2. Thus sub- Gaussianity means smaller scale of mgf than that of Gaussian distribution, i.e., having tail which decays faster than Gaussian scale. Remark Obviously, if X is sub-gaussian, we get PX µ t exp t2 2σ 2 t 0. Furthermore, if X is sub-gaussian with parameter σ, so is X, and hence P X µ t = PX µ t + P X µ t 2 exp Example A r.v. ɛ is called Rademacher if In this case, Ee λɛ = eλ + e λ = 2 and hence ɛ is sub-gaussian with σ =. Pɛ = = Pɛ = = 2. λ 2k 2k! λ 2k k=0 k=0 λ 2 2 k k! = exp, 2 t2 2σ 2. Actually, this result is not so surprising, because distribution with bounded support has extremely light tail, clearly lighter tail than that of Gaussian. We can easily formulate the conjecture as following: Example Let X be a r.v. with EX = µ and Pa X b =. Then X is sub-gaussian with σ = b a 2. To show this, define ψλ = log Ee λx. cgf Then ψ0 = 0, ψ 0 = µ and where ψ λ = E λ X 2 E λ X 2 E λ fx := EfXeλX Ee λx. Note that E λ can be viewed as an expectation operator w.r.t weight proportional to e λx. Now note that: If a Y b a.s., then vary = min EY y 2 E Y b + a 2 y 2 b a 2 2 holds. 24

26 Since ψ λ can be viewed as a variance, we get Thus we obtain and hence which yields b a 2 ψ λ λ R. 2 b a 2 sup ψ λ, 2 λ R ψλ = ψ0 + ψ 0λ + ψ ξ 2 b a ψ0 + ψ 2 0λ + 2 = λµ + λ2 b a 2, 2 2 Ee λx µ = e λµ+ψλ exp λ 2 2 λ 2 2 λ 2 b a 2. 2 Our next result is that independent sum of sub-gaussian random variables is also sub-gaussian. Theorem Hoeffding s inequality. Let X i be independent r.v. s with EX i = µ i, and each X i is sub-gaussian with σ = σ i. Then n X i is also sub-gaussian with parameter n /2, σ2 i i.e., P X i µ i t exp 2 n Proof. It is sufficient to show that It is clear from t 2 σ2 i X + X 2 is sub-gaussian with σ 2 = σ 2 + σ2 2. t 0. E e λx +X 2 µ +µ 2 = E e λx µ E e λx 2 µ 2 σ 2 exp 2 λ2 σ 2 exp 2 2 λ2 σ 2 = exp + σ2 2 λ 2. 2 Following corollary is clear from Hoeffding s inequality, but it is very useful result. It will also be widely used in this course. Corollary If each X i is bounded and independent, i.e., Pa i X i b i =, then P X i µ i t exp 2t 2 n b i a i 2. 25

27 Before we move the step, let s check some equivalent conditions for sub-gaussianity. Theorem For any X with EX = 0, TFAE. i σ > 0 s.t. Ee λx exp λ 2 2 σ2 λ R i.e., X is sub-gaussian. ii c and Gaussian r,v, Z N0, τ 2 s.t. P X s cp Z s s 0. iii θ 0 s.t. EX 2k 2k! 2 k k! θ2k k =, 2,. iv σ > 0 s.t. Ee λx2 /2σ 2 λ [0,. λ Now we see some other notion. The notion of sub-gaussianity is fairly restrictive, so that it is natural to consider various relaxations of it. The class called sub-exponential r.v. s are defined by a slightly milder condition on the mgf and hence has a slower tail probability density rate. Definition A random variable X is called sub-exponential if ν, b > 0 s.t. Ee λx µ e ν2 λ 2 /2 λ : λ b. Obviously, sub-gaussianity implies sub-exponentiality. The converse is not true; sub-gaussianity is stronger condition. Example Let Z N0, amd X = Z 2. Then e λ Ee λx = for λ < 2λ 2, and it does not exists for λ > /2. With simple calculation, we can verify that e λ 2λ e 2λ2 λ : λ < 4. Therefore, X is sub-exponential, but not sub-gaussian. Theorem For any X with EX = 0, TFAE. i ν, b > 0 s.t. Ee λx e λ2 ν 2 /2 λ : λ /b i.e., X is sub-exponential. ii c 0 > 0 s.t. Ee λx < λ : λ c 0. iii c, c 2 > 0 s.t. P X t c e c 2t t > 0. iv σ, M > 0 s.t. EX k 2 σ2 k!m k 2 k = 2, 3, Bernstein condition The condition iv is called Berstein condition. It is known that: 26

28 Lemma If EX = 0 and X satisfies Bernstein condition, then Proof. Note that holds, which implies provided that λ. It gives M and hence t 2 P X t 2e 2σ 2 +Mt t > 0. Ee λx = λ k EX k k=0 = + + k! λ k EX k k=2 k! λ k 2 σ2 k!m k 2 k=2 = + λ2 2 σ2 k! λ M k 2 k=2 Ee λx + λ2 2 σ2 λ M e λ 2 σ 2 2 λ M PX t e λt+ λ2 σ 2 2 λ M λ : λ M, PX t Similar technique on X gives the conclusion λ 2 σ 2 inf λ /M e λt+ 2 λ M P X t 2e t 2 2σ 2 +Mt. = e t 2 2σ 2 +Mt. We can easily extend the result to the independent sum of random variables. Corollary Bernstein s inequality. Let X i be independent random variables satisfying Bernstein condition Then EX i = 0 and E X i k σ2 i 2 k!m k 2, k = 2, 3,. P X + + X n t 2e 2 t 2 n σ i 2+Mt. 27

29 Proof. By Chernoff inequality, we get P X + + X n t 2e λt Ee λ n Xi 2 exp t We get the conclusion by letting λ = Mt + n i.i.d Example Let Z k N0,. Then P n for any λ s.t. λ < 4. Since we get σ2 i. λt + Zk 2 t e λt exp λ Zk 2 n k= n = 2e λt e λ/n 2λ/n k= min λ </4 P n 2e λt e 2nλ/n2 = 2e 2λ2 λt+ n 2λ 2 n λt = nt2 8, Zk 2 t 2e nt2 k= λ 2 σ 2 i 2 λ M Example Johnson-Lindenstrauss embedding. Let u i R d, i =, 2,, m be extremely high-dimensional vectors i.e., d is very large. We want to find a map F : R d R n with n d and δ u i u j 2 2 F u i F u j δ u i u j 2 2 for some δ 0, embedding to low-dimensional space preserving the distance approximately. Remark Such embedding might be useful when using, for example, clustering algorithm. There are various distance-based clustering methods such as K-means. If one handles extremely highdimensional data, then obtaining distances between all pairs of data might require heavy computation. For this reason, one can first embed the data into low-dimensional subspace, with preserving distances, and regard the data as low-dimensional. Example continued Define F : R d R n by 8.. F u = Xu n, where X = x ij i,j R n d with x ij i.i.d N0,. 28

30 Then F u 2 2 u 2 2 = Xu 2 2 n u 2 2 = X i, u 2 n u 2 2 = n where X i is the ith row vector of X. Note that for any fixed u, holds, and hence we get n F u 2 2 u 2 = 2 X i, u 2 X i, χ 2 n u 2 F u 2 P 2 u 2 / [ δ, + δ] 2e nδ2 /8 2 u u 2 for any u 0 by previous example. Thus, using F u i u j = F u i F u j, we get F u i F u j 2 2 P u i u j 2 / [ δ, + δ] for some i j F u i F u j 2 2 P 2 u i u j 2 / [ δ, + δ] i j 2 m 2 e nδ2 /8. 2 Finally, for any ɛ 0, and m 2, so for such n, we can find a map F. From now on, we focus on our origin interest. m 2 e nδ2 /8 ɛ if n > 6 2 δ 2 log m ɛ, 2, Whether a given class F is a Glivenko-Cantelli Donsker class depends on the size of the class. A finite class of square integrable functions is always Donsker by theorem.4.8, while at the other extreme the class of all square integrable uniformlybounded functions is almost never Donsker. A relatively simple way to measure the size of a class is to use entropy numbers, which is essentially the logarithm of the number of ball or brackets of size ɛ needed to cover F. Let F, be a subset of a normed space of functions f : X R. Definition Covering number. The covering number Nɛ, F, is the minimum number of balls {g : g f < ɛ} of radius ɛ needed to cover F. The center of the balls f need not belongs to F. The entropy is the logarithm of the covering number Nɛ, F,. Definition Bracketing number. Given two functions l and u, the bracket [l, u] is the set of all functions with l f u. An ɛ-bracket is a bracket [l, u] with u l < ɛ. The bracketing number N [ ] ɛ, F, is the minimum number of ɛ-brackets needed to cover F. Each u and l need not belong to F. 29

31 The bracketing entropy entropy with bracketing is the logarithm of the bracketing number N [ ] ɛ, F,. We only consider norms with property f < g = f g. For example, L r Q norm satisfies the property. Remark Note that is satisfied, because f Q,r = f r dq /r Nɛ, F, N [ ] 2ɛ, F, u + l f [l, u], u l < 2ɛ = f B ɛ 2 holds, i.e., every 2ɛ-bracket is contained in some ɛ-ball. Definition An envelope function of F is any function F s.t. 2.3 Maximal Inequalities fx F x x X f F. In this section, we will obtain the bound of expectation of maximum, for example, maximum variation of stochastic process within small time. For this we introduce the notion of Orlicz norm. Definition For ψ : [0, [0,, where ψ is strictly increasing and convex function with ψ0 = 0, and a random variable X, the Orlicz norm X ψ is defined as { } X X ψ = inf C > 0 : Eψ. C Of course, we wonder that Orlicz norm is actually a norm. Proposition ψ is a norm on the set of all random variables with X ψ <, i.e., i ax ψ = a X ψ a R; ii X ψ = 0 X = 0 a.s.; iii X + Y ψ X ψ + Y ψ. 30

32 Proof. i Trivial. ii part is trivial. Assume that X ψ = 0. It means that X Eψ C > 0. C Note that X ψ ψ = on X 0 C C 0 X ψ ψ0 = 0 on X = 0. C C 0 ψ = because ψ is convex, strictly increasing function If PX 0 > 0, then by monotone convergence theorem, which is contradictory to. iii It suffices to show that X lim Eψ =, C 0 C X Y X + Y Eψ Eψ = Eψ. C C 2 C + C 2 Let Eψ X /C and Eψ Y /C 2. Then under our claim X + Y ψ C + C 2 holds. Taking infimum w.r.t C and C 2 sequentially, we get the desired result. It comes from: X + Y X + Y ψ ψ ψ is strictly increasing C + C 2 C + C 2 C X = ψ + C 2 Y C + C 2 C C + C 2 C 2 C X ψ + C 2 Y ψ ψ is convex. C + C 2 C + C 2 C C 2 There are two oftenly-used Orlicz norms. Example Let ψx = x p, p. Then trivially ψ satisfies conditions in definition 2.3. and { X p X ψ = inf C > 0 : E } C = inf {C > 0 : E X p C p } = E X p /p =: X p, i.e., Orlicz norm w.r.t ψx = x p is L p -norm. 3

33 Example Let ψ p x := e xp, p. Then trivially ψ p satisfies conditions in definition 2.3. and ψ p x x p. Hence Remark Note that, to X p or X ψp X p X ψp. exist, X X Eψ < or Eψ p < C C should be held for some C > 0 respectively. The former one requires polynomial order tail bound, while the latter one requires exponential order p = or squared-exponential one p = 2. In general, following holds. Proposition Tail bound. If X ψ <, then Proof. Since ψ is continuous from convexity, X Eψ = E lim X ψ P X > x. x ψ X ψ C X ψ ψ X = lim C C X ψ Eψ holds by MCT Actually = holds. Now Markov inequality gives X x P X > x = P ψ ψ X ψ X ψ X Eψ X ψ. x x ψ ψ X ψ X ψ X 2.2 C This proposition gives necessary condition for X ψ <. Then what is sufficient condition? In other words, is there any condition for tail bound which implies X ψ <? C Proposition If P X > x for p, C, δ > 0, then X x p+δ p <. Proof. E X p = 0 P X p > xdx + C dx <. x+δ/p Proposition If P X > x Ke Cxp for p and C, K > 0, then X ψp <. 32

34 Proof. Note that E e D X p X p = E = E = = KD holds for sufficiently small D > 0. It gives that Eψ p X De Ds ds Is < X p De Ds ds Ps < X p De Ds ds Ke Cs De Ds dx 0 D /p e C Ds ds for sufficiently small D > 0 precisely, if D C K+, i.e., X ψ p < precisely, X ψp K+ C /p. Remark Proposition gives that, if tail probability is bounded with p +δ order polynomial, then p-norm X p becomes finite; proposition gives that if tail probability is bounded with squared exponential exponential, resp., i.e., random variable has sub-gaussian sub-exponential, resp. distrbution, then X ψ2 < X ψ <, resp. is satisfied. Our origin goal of this section is to obtain some bounds for maximum of random variables. Such maximal inequalities can be found from the basic properties of Orlicz norm. Before starting, note following naive bound or similarly, E max X i i m m E X i m max E X i, i m max X /p m /p i i m = E max X /p i p E X i p m max E X i p = m /p max X i p. p i m i m i m Thus if random variable has smaller tail probability E max X i p <, then more tight bound for maximum is obtained m /p. Following proposition gives generalized bound. Theorem Let ψ be convex, strictly increasing function with ψ0 = 0. Further, assume that ψ satisfies lim sup x,y ψxψy ψcxy < for some c >

35 Then for any random variables X, X 2,, X m, max X i i m Kψ m max X i ψ, ψ i m where K is a constant depending only on ψ. Remark Note that: m /p in the naive bound is corresponding to ψ m. If ψ increases fast, then ψ m becomes smaller, which gives smaller bound. It holds for any random variables X,, X m ; it does not require additional assumption such as independence. Proof. Firstly, we assume that and ψ. In this case, 2 Thus, for y and any C > 0, ψxψy ψcxy x, y x ψ ψcx x y. y ψy c Xi max ψ Xi ψ C Xi max I i m Cy i m ψy Cy Xi + ψ Cy }{{} c Xi ψ C max + ψ i m ψy c Xi m ψ C + ψy 2 ψ on X i Cy < Xi I Cy < holds. Taking expectation with C = c max X i ψ and y = ψ 2m, we get i m [ ] [ ] max i m X i E ψ = E max Cy ψ Xi i m C X i m ψ max X i ψ E + ψy 2 { }} { Xi m Eψ X i ψ + 2m

36 2 + 2 =, and therefore max X i i m Cy ψ holds from ψ 2m 2ψ m, which comes from and increasingness of ψ. = cψ 2m max X i ψ i m 2cψ m max X i ψ i m m = ψ0 + ψψ 2m 0 + ψ 2m ψ 2 2 Now we see general ψ. Define φx = σψτx. If τ > 0 is large enough K > 0 s.t. x, y 0 φxφy = σ 2 ψτxψτy Kσ 2 ψcτ 2 xy = Kσφcτxy 2.3, so if σ < is small enough, we get φxφy φcτxy and φ = σψτ. Also note that 2 Putting C = X φ gives στ { } { X X ψ = inf C > 0 : Eψ = inf C > 0 : } X C σ Eφ. τc X σ Eφ = σ τc Eφ σ X X Eφ X φ X φ while holds from φσx + σ 0 σφx + σφ0. Hence we get On the other hand, and putting C = τ X ψ we get X ψ X φ στ. { } τ X X φ = inf C > 0 : σeψ, C τ X σeψ C X = σeψ X ψ, , which implies X φ τ X ψ. 35

37 Therefore we have max X i i m ψ στ max X i i m φ K στ φ m max X i φ i m K στ 2 ψ m max τ X i ψ i m = K ψ m max X i ψ. i m In, it was used that from φ x = τ ψ σ x and we have ψ σψ x σψ ψ x = x, σ σ φ x = x τ ψ σ στ ψ x. Remark Using previous theorem, we can obtain the bound of maximum of stochastic process. A common technique to handle maximum term is to partition the underlying space into finite net and control variation on the small ball, for example, sup X t max X t i + i m t T sup dt,t i <δ X ti X t. Partitioning the space into δ-balls is deeply related to the covering number; it will affect the bound. As δ becomes small, variation on each δ-ball might be smaller, while controlling maximum of finite net becomes challengeable. Definition Let T, d be an arbitrary semi-metric space. Then the covering number Nɛ is the minimum number of balls of radius ɛ needed to cover T ; a collection of points is ɛ-separated if the distance between each pair of points is strictly larger than ɛ; the packing number Dɛ is the maximum number of ɛ-separated points in T. We can naturally guess that the packing number Dɛ would have similar value with the covering number Nɛ. Proposition Nɛ Dɛ N 2 ɛ. 36

38 Proof. First, for D = Dɛ,let t, t 2,, t D be maximal ɛ-separated points. Then since the set {t, t 2,, t D } is maximal, adding any other point in T makes the set not ɛ-separated. That is, It means that i.e., Nɛ Dɛ. t T t i s.t. dt, t i ɛ. T D B ɛ t j, j= Next, let D = Dɛ and N = Nɛ/2. Assume that D > N. Then t, t 2,, t D which are ɛ- separated points, and s, s 2,, s N which balls centered with cover T, i.e., T N j= B ɛ/2s j. Then because we assumed that D > N, there exist two points t i and t i t i, t i B ɛ/2 s j. However it is contradictory to the assumption that t i, t i D N. Now we are ready for our main result for maximal inequality. those belong to the same ball are ɛ-separated. Therefore Definition A stochastic process X t t T is separable if for any countable dense subset T 0 T and δ > 0, sup ds,t<ssδ s,t T X s X t = sup ds,t<δ s,t T 0 Lemma If 0 X n X, then X n ψ X ψ. Proof. First, it is obvious that Now, for any C < X ψ, by definition, X s X t a.s.. 0 X y = X ψ Y ψ. lim Eψ It implies that X n ψ for large n, i.e., Since C < X ψ was arbitrary, we get Xn C X = Eψ >. MCT C lim inf X n ψ C. lim inf X n ψ X ψ. 37

39 Meanwhile, X n X implies X n ψ X ψ, which gives lim X n ψ = X ψ. Theorem Maximal Inequality. Let ψ be convex, strictly increasing function satisfying ψ0 = 0 and 2.3. Also assume that stochastic process X t t T is separable and satisfies Then for any η, δ > 0, sup ds,t δ X s X t ψ C ds, t s, t T. 2.4 { η } X s X t K ψ Dɛdɛ + δψ D 2 η ψ holds, where K is a constant depending only on C and ψ. 0 Proof. Construct T 0 T T recursively to satisfy that T j is a maximal η 2 j -separated set containing T j. Then by the definition of packing number, cardt j Dη 2 j. Note that by maximality t j+ T j+ t j T j s.t. dt j, t j+ η 2 j. Link every t j+ T j+ to a unique t j T j s.t. dt j, t j+ < η 2 j make any mapping which satisfies dt j, t j+ < η 2 j ; how it can be possible is not our interest. Now call t k+, t k,, t 0 to a chain. Note that is countable and dense subset by construction in T. Since X t t T is separable, sup k= T k ds,t δ X s X t = ψ = ds,t δ s,t k= T k MCT k sup X s X t ψ lim sup X s X t ds,t δ s,t T k+ Now let s k+ s k s 0 and t k+ t k t 0 be chains. Then X sk+ X tk+ X sk+ X s0 X tk+ X t0 + X s0 X t0 }{{} ψ. 38

40 holds. Now we get k { = Xsj+ X sj X tj+ X tj } 2 j=0 where L j is the set of all links from T j+ to T j. Then we get and hence by theorem 2.3.0, sup 2 s k+,t k+ T k+ ψ cardl j Dη 2 j, k max X u X v, u,v L j j=0 k max X u X v u,v L j j=0 2K K 4K 4K ψ k ψ cardl j max X u X v ψ }{{} u,v L j }{{} j=0 ψ Dη 2 j C du,v Cη 2 j k ψ Dη 2 j η 2 j 2 4 j=0 η/2 0 η 0 ψ Dɛdɛ ψ Dɛdɛ. Now, to control X s0 X t0, conversely for each pair of end points s 0, t 0, choose unique pair Figure 2.2: k ψ Dη 2 j η 2 j 2 j=0 η/2 0 ψ Dɛdɛ. s k+, t k+ T k+ which is different from those in previous paragraph; there is some abuse of notation. Then X s0 X t0 + X sk+ X tk+ 39

41 again, and hence max X s0 X t0 s 0,t 0 T 0 max s ψ k+,t k+ T k+ + max X sk+ X ψ tk+ ψ 4K η 0 ψ Dɛdɛ + max X sk+ X tk+ ψ. Note that the number of possible pairs of s 0, t 0 and consequently s k+, t k+ is at most cardt 0 2 Dη 2, and thus by theorem again, max X sk+ X tk+ ψ K ψ D 2 η max X sk+ X tk+ ψ. Since X sk+ X tk+ ψ C ds k+, t k+, we get max X s X t 8K ds,t δ s,t T k+ ψ η 0 ψ Dɛdɛ + K ψd 2 η Cδ = 8K η 0 ψ Dɛdɛ + KδψD 2 η. Remark Why we decomposed X sk+ X tk+ as and X s0 X t0, and decomposed X s0 X t0 again? If we bound X sk+ X tk+ directly with similar argument, then we obtain the bound with term ψ D 2 η 2 j, which might not be so useful. How such maximal inequality can be used? Following is one example which gives the bound for sub-gaussian stochastic process. process. Before we start, we should define sub-gaussianity of stochastic Definition A stochastic process X t t T is sub-gaussian with respect to semi-metric d if P X s X t > x 2 exp x 2 2 d 2 x. s, t Example Any zero-mean Gaussian process is sub-gaussian with respect to L 2 -distance ds, t = σx s X t = EX s X t 2. Example Let ɛ, ɛ 2,, ɛ n be Rademacher r.v. and X a = a i ɛ i, a R n. 40

42 Then by Hoeffding inequalitym, P a i ɛ i x 2 exp x 2 2 a 2 It implies that X a a R n is sub-gaussian stochastic process with respect to Euclidean distance da, b = a b 2. To apply maximal inequality, we should verify the condition 2.4. Proposition For sub-gaussian stochastic process X t t T and ψ 2 x = e x2, Proof. It suffices to show that It comes from X s X t ψ2 6ds, t. Xs X t Eψ 2. 6ds, t Xs X t Xs X Eψ 2 t 2 = E exp 6ds, t 6d 2 s, t Xs X t 2 = P exp 0 6d 2 > x dx s, t = P X s X t > 6ds, t log + x dx 0 2 exp 6d 2 s, t log + x 2 d 2 dx s, t Now we get the desired result. = 0 0 = 2. 2e 3 log+x dx Corollary Let X t t T be separable sub-gaussian stochastic process. Then E sup ds,t δ X s X t δ Remark From now on, A B denotes that 0. log Dɛdɛ δ > 0. A c B for some universal constant c > 0. Proof. Apply theorem with ψ = ψ 2 and η = δ. Since the constant K in theorem depended 4

43 only on ψ and C, which are all given in this example, K becomes universal. Therefore we get E sup ds,t δ X s X t = sup sup sup ds,t δ ds,t δ ds,t δ δ ψ2 0 δ ψ2 0 X s X t X s X t 2 X s X t ψ2 Dɛdɛ + δψ 2 D2 δ Dɛdɛ + δψ 2 Dδ ψ2 x = log + x and hence we get ψ δ ψ 0 2 x2 2ψ2 x for x 0 2 Dɛdɛ ψ2 is increasing, while D is decreasing, and hence = δψ δ 0 δ 0 2 Dδ δ 0 ψ log + Dɛdɛ log Dɛdɛ 2 Dɛdɛ log + x 2 log x for sufficiently large x Remark Note that log Dɛ is an entropy. Thus, whether the value of bound integral be finite or not depends on how fast the entropy grows as δ goes to Symmetrization In empirical process, our final goal is to obtain Glivenko-Cantelli and Donsker s theorem. They can be obtained from measuring the space F via covering number or bracketing numbers. The former one requires symmetrization technique, while the other one requires Bernstein inequality as follows. Lemma Let X,, X m be arbitrary r.v. s with x 2 P X i > x 2e 2 b+ax x > 0 42

44 for a, b > 0. Then max i m X i a log + m + b log + m. ψ Remark The bound can be also represented as aψ m + bψ2 m. Proof. First note that holds for p q φ defined as X ψp X ψq log 2 q p ψ p xlog 2 p = φ ψ q xlog 2 q, i.e., φ = ψ p ψ q for ψp x = 2 xp is concave function with φ =, and hence by Jensen, which gives log 2 q p X ψq X ψp. Now holds. Now recall that Thus for we get Therefore we have = φ φ Eψ q log 2 q log 2 X q X ψq Eφ ψ q log 2 q log 2 X q X ψq = Eψ p log 2 p X q, X ψq P X i > x 2e 2 x 2 b+ax 2e x2 4b 2e x 4a 0 x b x > b K + /p P X > x Ke Cxp, p = X ψp proposition C max i m X i X i = X i I X i b + X i I X i > b, a a }{{}}{{} P > x 2e x2 4b and hence ψ2 b P > x 2e x 4a and hence ψ a. max ψ i m + max ψ i m ψ a a 43

45 max i m + max ψ2 i m ψ ψ2 m max ψ 2 + ψ m max ψ i m i m ψ2 m b + ψ ma. Now we see very useful technique, which is called symmetrization. Recall that in empirical process we consider following setting: i.i.d X, X 2,, X n P P n f = n G n f = n fx i fx i Pf. Symmetrization technique is formulated based on the fact that, for Rademacher random variables ɛ,, ɛ n, f P n Pf would have similar behavior with f P 0 nf := n ɛ i fx i. Theorem Symmetrization. Let φ be a convex non-decreasing function and F be a class of measurable functions. Then E φ P n P F E φ 2 P 0 n F. Proof. We prove only under the measurability condition. Recall that under measurability, we can use Fubini theorem. Let Y, Y 2,, Y n be independent copies of X, X 2,, X n. Then and hence by non-decreasingness of φ, P n P F = sup fx i EfX i f F n = sup fx f F n i E Y fy i E Y sup fx i fy i n, f F Eφ P n P F E X φ E Y sup f F n E X E Y φ n sup f F fx i fy i fx i fy i Jensen 44

46 holds. Now note that, by symmetricity, and hence fx i fy i d fy i fx i fx i fy i d e i fy i fx i for any e i {, } symmetrization!. Consequently, we have sup n f F d fx i fy i sup n for any e,, e n {, } n. Therefore we get f F e i fx i fy i Eφ P n P F E ɛ E X,Y ɛ φ sup ɛ i fx i fy i f F n { } 2 E ɛ E X,Y φ sup ɛ i fx i 2 f F n + sup 2 ɛ i fy i f F n { } 2 E 2 2 ɛ E X,Y φ sup ɛ f F n i fx i + E X,Y φ sup ɛ f F n i fy i = 2 E 2 ɛ2e X φ sup ɛ i fx i f F n 2 = Eφ sup ɛ i fx i n f F = Eφ2 P 0 n F. Example Consider φx = x m, m. Then by symmetrization. If P 0 n F is measurable, then holds. The term E P n P m F 2 m E P 0 n m F E P 0 n F = E P 0 n F = E X E ɛ X sup n E ɛ X sup f F n f F ɛ i fx i ɛ i fx i can be viewed as a supremum of stochastic process n a i ɛ i for constants a i s, and hence its bound can be obtained via, for instance, Hoeffding inequality. Note that such argument requires measurability! Thus considering the class of functions which makes the target process measurable is a natural 45

47 procedure. Definition A class F of measurable functions f : X R on X, A, P is called P-measurable class if X,, X n e i fx i is measurable on completion of X n, A n, P n for every n and e,, e n {, } n. F 46

48 Chapter 3 Applications for Empirical Process 3. Glivenko-Cantelli Theorems Now we are ready for our first goal in empirical process; a uniform LLN. First we use bracketing argument; it does not require measurability. Theorem 3.. Bracketing Glivenko-Cantelli. If N [ ] ɛ, F, L P < ɛ > 0, then F is Glivenko- Cantelli, i.e., P 0. P n P F Proof. First note that ɛ-bracket w.r.t L P norm is [l, u] with l f u and u l = u l dp = P u l < ɛ. For given ɛ > 0, choose finitely many ɛ-brackets [l i, u i ], i N covering F. For each f F, i s.t. P n Pf = P n f Pf P n u i Pf = P n Pu i + Pu i f < P n Pu i + ɛ. If f is fixed, then i is also fixed, and hence by SLLN, P n Pu i 0 almost surely. Since i is finitely many, we have and therefore, Similarly we get max P n Pu i 0 almost surely, i N supp n Pf < max P n Pu i +ɛ. f F i N }{{} 0 inf P n Pf > ɛ + min P n Pl i, f F i N }{{} 0 47

49 and combining both we obtain Since ɛ > 0 was arbitrary, we get or lim sup P n Pf F ɛ almost surely. lim sup P n P F = 0 a.s., P a.s. 0. P n P F Example Let P be a probability measure on R and Then for given ɛ > 0, let Then F = {,c] : c R}. = t 0 < t < < t m = with Pt i, t i+ < ɛ i. [,ti ],,ti+ are ɛ-brackets covering F, and hence we get Glivenko-Cantelli theorem, sup F n t F t 0 almost surely. t Next argument for other type of Glivenko-Cantelli theorem uses symmetrization technique. mentioned in example 2.4.4, we need measurability condition in here. Theorem 3..3 Covering Glivenko-Cantelli. Let F be P-measurable and F be an envelope of F with P F <. Furthermore assume that where log Nɛ, F M, L P n = o P n M, ɛ > 0, F M = {f F M : f F}. As Then E P n P F = o i.e., it implies P n P F P 0. 48

50 Proof. Denote gf F = sup f F gf. Then by symmetrization, E P n P F 2E X E ɛ n holds. Note that and hence we get = 2E X E ɛ n 2E X E ɛ n ɛ i fx i F measurability! ɛ i fx i IF X i M + ɛ i fx i IF X i > M n ɛ i fx i + 2E X E ɛ ɛ i fx i IF X i > M n FM }{{ F } = sup ɛ i fx i IF X i > M f F n fx i IF X i > M n n E P n P F 2E X E ɛ n F X i IF X i > M ɛ i fx i + 2E XF X i IF X i > M }{{} FM =2P F IF >M M 0 P F < Now, for given X,, X n and ɛ > 0, let G be an ɛ-covering of F M s.t. cardg = Nɛ, F M, L P n. Note that and hence It gives E ɛ n n ɛ i fx i n ɛ i fx i FM f F M g G s.t. P n g f < ɛ, ɛ i gx i + E ɛ n ɛ i fx i + ɛ G ɛ i fx i +ɛ n }{{} = X ψ X ψ2 X = E ɛ max f G ɛ i gx i fx i n }{{} n i gx i fx i =P n g f <ɛ.. F 49

51 max ɛ f G i fx i + ɛ n ψ2 X + log G max ɛ f G i fx i n ψ2 X + ɛ /2 log Nɛ, F M, L P n max fx i 2 + ɛ f G n = /2 log Nɛ, F M, L P n max fx i 2 +ɛ f G n n }{{} log Nɛ, F M, L P n M n + ɛ = o P + ɛ =P nf 2 /2 by the assumption log Nɛ, F M, L P n = o P n. In part, following argument is used: For constants a i s, we get from ɛ 2 i E exp Cn = and in consequence Eψ 2 C n n a i ɛ i = log 2 ψ2 /2 n 2 a i ɛ i = a 2 i /2 n a 2 E exp i ɛ 2 i C 2 = exp n 2 C 2 a i ɛ i exp n 2 C 2 a 2 i a 2 i 2 C n log 2 /2 a 2 i. Or we can use some general arguments using Hoeffding inequality; see following remark. Since ɛ > 0 was arbitrary, we get Note that and therefore by BCT, we get E ɛ n E X E ɛ n E ɛ n ɛ i fx i = o P. FM ɛ i fx i M; FM ɛ i fx i = o as n. FM 50

52 Remark In part, we used the argument only can be applied on Rademacher ɛ s. However, we can also find more general argument using Hoeffding inequality. Note that since each a i ɛ i s are sub-gaussian, by Hoeffding s inequality, we can find K and C s.t. Now proposition gives P n n a i ɛ i > x Ke Cx2. K + /2 a i ɛ i, C ψ2 where K = 2 and C = 2 a 2 i precisely. Remark To make red-colored part in the proof of previous theorem rigorous, one should construct G to satisfy f M for f G. It can be assumed without loss of generality; if not, one can truncate the function as f M M so that truncated one also covers F M and satisfies f M. Just one have to check that it is still ɛ-covering of F M ; let f F M and g G s.t. P n g f < ɛ. Then for g = g M M, P n g f = n holds. n = n i: M gx i M i: M gx i M gx i fx i + gx i fx i + gx i fx i = P n g f < ɛ 3.2 Donsker Theorems i: gx i >M i: gx i >M M fx i + gx i fx i + i: M>gX i i: M>gX i In here we consider two versions of Donsker s theorem. From now on, Q,2 denotes for a probability measure Q. f Q,2 = f 2 dq /2 fx i + M fx i gx i Theorem 3.2. Covering Donsker. Let F δ := {f g : f, g F, f g P,2 < δ} be P-measurable for any δ 0, ] and F be an envelope of F with P F 2 <. If 0 sup log Nɛ F Q,2, F, L 2 Qdɛ <, 3. Q when the supremum is taken over all finitely discrete probability measures, then F is P-Donsker. 5

53 Proof. It suffices to prove that G n is asymptotically tight, where G n = {G n f : f F} is regarded as a stochastic process with index set F. By theorem.4.8 note that each G n f converges weakly by classical CLT, which implies asymptotic tightness of each marginal it s enough to show that: i F is totally bounded in L 2 P norm; ii G n is asymptotically uniformly L 2 P-equicontinuous in probability. For this, we need following lemma: Lemma Let a n : [0, ] [0, be a sequence of non-decreasing functions. Then Proof of lemma. lim lim sup a n δ = 0 a n δ n = 0 δ n 0. δ 0 = Let {δ n } be nonincreasing sequence convergin to 0. ɛ > 0 δ 0 > 0 s.t. lim sup a n δ 0 < ɛ 2 and hence N s.t. n N a n δ 0 < ɛ. Since a n is nondecreasing, N s.t. n N δ n < δ 0. Thus = It s sufficient to show that: n N N = δ n < δ 0 = a n δ n a n δ 0 < ɛ. δ n 0 s t. lim sup a n δ n = lim lim sup a n δ. δ 0 Let C = lim δ 0 lim sup a n δ. Then for any δ > 0, we get lim sup a n δ C, because a n decreases as δ 0. It gives that for any δ > 0 and for any ɛ > 0, Thus, for every fixed m, i.e., N, N 2, N 3, s.t. a n δ > C ɛ i.o.. a n > C m m i.o., a N > C a N2 > C 2 2, N 2 > N 52

54 a N3 3 > C 3, N 3 > N 2 and so on. Take δ n as Then by definition, holds, which gives that,,, }{{} 2, 2,,, }{{ 2} 3, 3,,,. }{{ 3} N, N 2 N a Nk δ Nk > C k N 3 N 2 lim sup a n δ n C. However, since a n δ n a n δ for any fixed δ > 0 and large n enough, we have which gives Therefore we get lim sup a n δ n lim sup a n δ δ > 0, lim sup a n δ n C. lim sup a n δ n = C = lim lim sup a n δ n. δ 0 Now we show ii first. ii is equivalent to Note that by definition thus ii is again equivalent to x, η > 0 δ > 0 s.t. lim sup P sup f g P,2 <δ G n f G n g > x Lemma sup f g P,2 <δ G n f G n g = G n Fδ ; x, η > 0 δ > 0 s.t. lim sup P G n Fδ > x < η. Note that G n δ decreases as δ 0, which makes P G n Fδ > x also non-decreasing of δ. Thus it is equivalent to which is also same as lim δ 0 lim sup P G n Fδ > x < η x > 0, < η. lim P G n Fδn > x < η x > 0 δ n

55 by the lemma. Now we will show 3.2 instead of ii. For given x > 0 and δ n 0, P G n Fδn > x x E G n Fδn 2 x E n ɛ i fx i Fδn symmetrization holds. Note that E becomes E in blue-colored part from the measurability of F δn. Now, note that where f n = n P ɛ X n ɛ i fx i gx i < x 2 exp x 2 2 f g 2, n fx i 2 by Hoeffding s inequality cf. example 2.3.2, which implies that the stochastic process f n corollary E ɛ X n ɛ i fx i is sub-gaussian w.r.t. n. Then by maximal inequality ɛ i fx i E ɛ X sup Fδn δ 0 f F δn f g n<δ holds for any δ > 0 and g F δn. Using 0 F δn E ɛ X n n ɛ i fx i gx i + E ɛ X n log Dɛ, Fδn, n dɛ + E ɛ X n ɛ i fx i Fδn Now using Dɛ Nɛ/2, we can obtain that E ɛ X n ɛ i gx i ɛ i gx i ɛ i fx i Fδn = 0 θn 0 and letting δ very big MCT, we can obtain log Nɛ, Fδn, n dɛ 0 log Dɛ, Fδn, n dɛ. log Nɛ, Fδn, n dɛ θ n = sup f n f F δn Nɛ, F δn, n = for large ɛ θn/ F n 0 θn/ F n 0 θn/ F n 0 log Nɛ F n, F, n dɛ F n F δn F sup log Nɛ F Q,2, F, L 2 Qdɛ F n Q sup Q log N ɛ 2 F Q,2, F, L 2 Q f Q,2 < ɛ, g Q,2 < ɛ f g Q,2 < 2ɛ, dɛ F n 54

56 Note that F n = n = which implies N2ɛ, F, L 2 Q N 2 ɛ, F, L 2 Q θn/2 F n log Nɛ F Q,2, F, L 2 Qdɛ 2 F n 0 θn/ F n 0 sup Q sup log Nɛ F Q,2, F, L 2 Qdɛ F n. Q F X i 2 converges to a positive constant by SLLN and the assumption, E X F 2 n = P F 2 <. Hence we get: θn/ F n E X sup log Nɛ F Q,2, F, L 2 Qdɛ F n 0 Q θn = E X sup log Nɛ F Q,2, F, L 2 Q F n I > ɛ dɛ 0 Q F n θn = sup log Nɛ F Q,2, F, L 2 QE F n I > ɛ dɛ. F n 0 Q By uniform entropy condition and DCT, the last term converges to 0 as n if E F n I θn > ɛ 0 ɛ > 0. F n If θ n / F n converges to 0 in probability, then Cauchy-Schwarz gives E F n Iθ n > ɛ F n E F 2 n P θ n > ɛ F n }{{}}{{} < 0 which gives the desired result. Thus our claim is that θ n / F n converges to a positive constant; therefore our final claim is: and Claim. θ n = o P. By definition, 0, P 0. However note that F n θn 2 = sup f 2 n = sup P n f 2 sup P n Pf 2 + sup Pf 2 sup P n Pf 2 + sup Pf 2 f F δn f F δn f F δn f F δn f F f F δn sup Pf 2 δn 2 0 def of F δ f F δn hold. Furthermore, since 4F 2 is an integrable envelope of G = {f 2 : f F }, we get for f, g F P n f 2 g 2 = P n f g f + g P n f g 4F f g n 4F n Cauchy-Schwarz f 2F 55

57 and hence Nɛ 2F 2 n, G, L P n Nɛ F n, F, n sup Nɛ F Q,2, F, Q,2. f g n ɛ F n P n f 2 g 2 f g n 4F n ɛ F n 4F n = ɛ 2F 2 n Hence Nɛ 2F 2 n, G, L P n is bounded by a fixed number depending only on ɛ, i.e., It implies that Nɛ 2F 2 n, G, L P n = O P ɛ > 0. log Nɛ, G, L P n = o P n ɛ > 0, cf. see following remark which implies that G is Glivenko-Cantelli thm 3..3, i.e., Claim sup P n Pf = sup P n Pf 2 P f G f F Q 0. Remark Assume that Nɛ 2F 2 n, G, L P n = O P for any ɛ > 0. For each ω, M > 0 and N s.t. and hence n > N = 2F 2 nω M, NɛM, G, L P n Nɛ 2F 2 n, G, L P n for such M and n. log Nɛ 2F 2 n, G, L P n = o P n ɛ > 0 implies that log Nɛ, G, L P n = o P n ɛ > 0. Proof Cont d. Now we show i. Since G is Glivenko-Cantelli, there exists a finitely discrete measure P n with P n Pf 2 F Meanwhile, by the uniform entropy condition, we get i.e., 0 log Nɛ F Pn,2, F, L 2 P n dɛ = F Pn,2 0. Nɛ, F, L 2 P n < ɛ > 0. 0 log Nɛ, F, L 2 P n dɛ <, 56

58 For f, g F, P n f g 2 < ɛ 2 implies Pf g 2 = P P n f g 2 + P n f g 2 P P n 2f 2 + 2g 2 +P }{{} n f g 2 ɛ 2 + ɛ 2 = 2ɛ 2 4 P n Pf 2 F for large n enough so that P n Pf 2 F ɛ 2 /4. It implies that for large n, i.e., for large n. Therefore we obtain i.e., F is totally bounded w.r.t L 2 P-norm. f g Pn,2 ɛ = f g P,2 2ɛ ɛ Nɛ, F, L 2 P N, F, L 2 P n < 2 Nɛ, F, L 2 P < ɛ > 0, Next we consider bracketing Donsker s theorem. It uses Bernstein s inequality in the proof. From now on, let F be a set of measurable functions with envelope F satisfying P F 2 <. Lemma If F < and f < for any f F, then Proof. Note that f E G n F max log F + max f F n G n f = n f F f P,2 fx i Pf. Each fx i Pf/ n has mean zero and satisfies Bernstein condition log F. 3.3 [ E fx i Pf k fxi Pf 2 ] fx = E i Pf k 2 n n n [ 2 f ] k 2 2f 2 X E i + Pf 2 n n 2 k 2 n Pf 2 2 k f n 2 k k! 4Pf 2 k 2 2n k! f n 57

59 holds. Thus by Bernstein s inequality, P G n f > x 2 exp x 2 2 4Pf 2 n + f 2 exp x 2 2 x 4 max f + max x n f F f F n holds for any x > 0 for large n. Now maximal inequality lemma 2.4. gives the conclusion E G n F max G nf f F ψ f max log + F + 4 max Pf 2 log + F f F n f F f max log F + max Pf 2 log F. f F n f F Theorem Bracketing Donsker. If then F is P-Donsker. 0 log N [ ] ɛ, F, L 2 Pdɛ <, Remark We use chaining technique and previous lemma in the proof. However, as the condition f satisfying < is required to apply the lemma, we should truncate the terms with the order f log F f P,2 n so that two terms in the RHS of 3.3 have equal order. Proof. There exists an envelope F of F with P F 2 < Recall remark ; bracketing number is larger than covering number with same diameter. Finiteness of the integral gives that N [ ] ɛ, F, L 2 P = for large ɛ. Let [l, u] be the only bracket covering F with u l P,2 < M. Also we get 0 log Nɛ, F, L 2 Pdɛ <, i.e., Nɛ, F, L 2 P is finite for any ɛ > 0. It implies that F is totally bounded; so F is bounded. Thus P u + l 2 2P u 2 + l 2 and u P,2 u f P,2 + f P,2 <, 58

60 l P,2 f l P,2 + f P,2 < for f F implies that P u + l 2 <. Letting F = sup u, l u + l, we get an envelope F of F with P F 2 <. For q, construct a sequence of nested partitions s.t. F q,i is a 2 q -bracket in P,2 and F = N q F q,i 2 q log N q <. 3.4 q= Figure 3.: Nested Partition i F q,i. Of course we have to show that we can find such partition satisfying 3.4. Note that N q is equal to the sum of the number of partitions of each F q,i, i.e., It implies that N q N q N [ ] 2 q, F, L 2 P. Figure 3.2: Relationship between N q and N q. log Nq log N q + log N [ ] 2 q, F, L 2 P log N q + log N [ ] 2 q, F, L 2 P a + b a + b 59

61 log N q 2 + log N [ ] 2 q, F, L 2 P + log N [ ] 2 q, F, L 2 P log N + and therefore 2 q log N q q= q log N [ ] 2 p, F, L 2 P, p=2 q= 2 q log N + = log N + = log N + log N + < log N + q= p= p= q=p q log N [ ] 2 p, F, L 2 P p=2 q 2 q log N [ ] 2 p, F, L 2 P 2 q log N [ ] 2 p, F, L 2 P 2 p log N [ ] 2 p, F, L 2 P p= 0 log N [ ] ɛ, F, L 2 Pdɛ holds, which yields 3.4. Now, fix f q,i F q,i fix representatives of each partition, and for f F q,i, define π q f := f q,i q f := sup g h g,h F q,i Since each F q,i is 2 q -bracket, g h P,2 2 q, and hence projection to the space of representatives. variation on each partition P q f 2 2 q. Figure 3.3: F q,i and representative f q,i. 60

Introduction to Empirical Processes and Semiparametric Inference Lecture 08: Stochastic Convergence

Introduction to Empirical Processes and Semiparametric Inference Lecture 08: Stochastic Convergence Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics and Operations