Chapter 7 Statistical Functionals and the Delta Method

Size: px

Start display at page:

Download "Chapter 7 Statistical Functionals and the Delta Method"

Myron Jordan
6 years ago
Views:

1 Chapter 7 Statistical Functionals and the Delta Method. Estimators as Functionals of F n or P n 2. Continuity of Functionals of F or P 3. Metrics for Distribution Functions F and Probability Distributions P 4. Differentiability of Functionals of F or P : Gateaux, Hadamard, and Frechet Derivatives 5. Higher Order Derivatives

2 2

3 Chapter 7 Statistical Functionals and the Delta Method Estimates as Functionals of F n or P n Often the quantity we want to estimate can be viewed as a functional T (F ) or T (P ) of the underlying distribution function F or P generating the data. Then a simple nonparametric estimator is simply T (F n ) or T (P n ) where F n and P n denote the empirical distribution function and empirical measure of the data. Notation. Suppose that X,..., X n are i.i.d. P on (X, A). We let P n n δ Xi the empirical measure of the sample, n i= where δ x the measure with mass one at x (so δ x (A) = A (x) for A A. When X = R k, especially when k =, we will write F n (x) = n n (,x] (X i )=P n (,x], F(x) =P (,x]. i= Here is a list of examples. Example. The mean T (F )= xdf (x). T (F n )= xdf n (x). Example.2 The r-th moment: for r an integer, T (F )= x r df (x), and T (F n )= x r df n (x). Example.3 The variance: T (F )=Var F (X) = (x xdf (x)) 2 df (x) = (x y) 2 df (x)df (y), 2 T (F n )=Var Fn (X) = (x xdf n (x)) 2 df n (x) = (x y) 2 df n (x)df n (y). 2 Example.4 The median: T (F )=F (/2). T (F n )=F n (/2). Example.5 The α trimmed mean: T (F n ) = ( 2α) α α F n (u)du. T (F ) = ( 2α) α α F (u)du for 0 < α < /2. 3

4 4 CHAPTER 7. STATISTICAL FUNCTIONALS AND THE DELTA METHOD Example.6 The Hodges-Lehmann functional: T (F ) = (/2){F F } (/2) where denotes convolution. Then T (F n ) = (/2){F n F n } (/2) = median{(x i + X j )/2}. Example.7 The Mann-Whitney functional. For X, Y independent with distribution functions F and G respectively, T (F, G) = F dg = P F,G (X Y ). Then T (F m, G n )= F m dg n (based on two independent samples X,..., X m i.i.d. F with empirical df F m and Y,..., Y n i.i.d. G with empirical df G n. Example.8 Multivariate mean: for P on (R k, B k ): T (P ) = xdp (x) (with values in R k ), T (P n )= xdp n (x) =n n i= X i. Example.9 Multivariate cross second moments: for P on (R k, B k ): T (P )= xx T dp (x) = x 2 dp (x); T (P n )= xx T dp n (x) = Note that T (P ) and T (P n ) take values in R k k. x 2 dp n (x) =n n X i Xi T. Example.0 Multivariate covariance matrix: for P on (R k, B k ): T (P )= (x ydp (y))(x ydp (y)) T dp (x) = (x y)(x y) T dp (x)dp (y), 2 T (P n )= (x ydp n (y))(x ydp n (y)) T dp n (x) = 2 i= n (x y)(x y) T dp n (x)dp n (y) =n (X i X n )(X i X n ) T. Example. k means clustering functional: T (P )=(T (P ),..., T k (P ) where the T i (P ) s minimize k x t 2 x t k 2 dp (x) = x t i 2 dp (x) C i where C i = {x R m : t i minimizes x t 2 over {t,..., t k }}. Then T (P n )=(T (P n ),..., T k (P n )) where the T i (P n ) s minimize x t 2 x t k 2 dp n (x). i= i= Example.2 The simplicial depth function: for P on R k and x R k, set T (P ) T (P )(x) = Pr P (x S(X,..., X k+ )) where X,..., X k+ are i.i.d. P and S(x,..., x k+ ) is the simplex in R k determined by x,..., x k+ ; e.g. for k = 2, the simplex determined by x,x 2,x 3 is just a triangle. Then T (P n )=Pr Pn (x S(X,..., X k+ )). Note that in this example T (P ) is a function from R k to R.

5 . ESTIMATES AS FUNCTIONALS OF F N OR P N 5 Example.3 (Z-functional derived from likelihood). A maximum likelihood estimator: for P on (X, A), suppose that P = {P θ : θ Θ R k } is a regular parametric model with vector scores function l θ ( ; θ). Then for general P, not necessarily in the model P, consider T defined by () l θ (x; T (P ))dp (x) = 0. Then l θ (x; T (P n ))dp n (x) =0 defines T (P n ). For estimation of location in one dimension with l(x; θ) =ψ(x θ) and ψ f /f, these become ψ(x T (F ))df (x) = 0 and ψ(x T (F n ))df n (x) =0. We expect that often the value T (P ) Θ satisfying () also satisfies T (P ) = argmin θ Θ K(P, P θ ). Here is a heuristic argument showing why this should be true: Note that for many cases we have ˆθ n = argmax θ n l n (θ) = argmax θ P n (log θ) argmax θ P (log θ) = argmax θ log p θ (x)dp (x). p Now ( ) pθ P (log p θ ) = P (log p)+p log p ( ) p = P (log p) P log p θ = P (log p) K(P, P θ ). Thus argmax θ log p θ (x)dp (x) = argmin θ K(P, P θ ) θ(p ). If we can interchange differentiation and integration it follows that θ K(P, P θ )= p(x) l θ (x; θ)dµ(x) = l θ (x; θ)dp (x), so the relation () is obtained by setting this gradient vector equal to 0. Example.4 A bootstrap functional: let T (F ) be a functional with estimator T (F n ), and consider estimating the distribution function of n(t (F n ) T (F )), H n (F ; ) =P F ( n(t (F n ) T (F )) ). A natural estimator is H n (F n, ).

6 6 CHAPTER 7. STATISTICAL FUNCTIONALS AND THE DELTA METHOD 2 Continuity of Functionals of F or P One of the basic properties of a functional T is continuity (or lack thereof). The sense in which we will want our functionals T to be continuous is in the sense of weak convergence. Definition 2. A. T : F R is weakly continuous at F 0 if F n F 0 implies T (F n ) T (F 0 ). T : F R is weakly lower-semicontinuous at F 0 if F n F 0 implies lim inf n T (F n ) T (F 0 ). B. T : P R is weakly continuous at P 0 P if P n P 0 implies T (P n ) T (P 0 ). Example 2. T (F )= xdf (x) is discontinuous at every F 0 : if F n = ( n )F 0 + n δ an, then F n F 0 since, for bounded ψ ψdf n = ( n ) ψdf 0 + n ψ(a n ) ψdf 0. But T (F n ) = ( n )T (F 0 )+n a n if we choose a n so that n a n. Example 2.2 T (F ) = ( 2α) α α F (u)du with 0 <α< /2 is continuous at every F 0 : F n F 0 implies that Fn (t) F0 (t) a.e. Lebesgue. Hence T (F n ) = ( 2α) α α ( 2α) α by the dominated convergence theorem. α Fn (u)du F 0 (u)du = T (F 0 ) Example 2.3 T (F )=F (/2) is continuous at every F 0 such that F 0 is continuous at /2. Example 2.4 (A lower-semicontinuous functional T ). Let T (F ) = V ar F (X) = (x E F X) 2 df (x) = 2 E F (X X ) 2 where X, X F are independent; recall example.3. If F n d F, then lim inf n T (F n ) T (F ); this follows from Skorokhod and Fatou. Here is the basic fact about empirical measures that makes weak continuity of a functional T useful: Theorem 2. (Varadarajan). If X,..., X n are i.i.d. P on a separable metric space (S, d), then Pr(P n P ) =. Proof. For each fixed bounded and continuous function ψ we have P n ψ ψdp n = n ψ(x i ) a.s. Pψ ψdp n i=

7 2. CONTINUITY OF FUNCTIONALS OF F OR P 7 by the ordinary strong law of large numbers. The proof is completed by noting that the collection of bounded continuous functions on a separable metric space (S, d) is itself separable. See Dudley (989), sections.2 and.4. Combining Varadarajan s theorem with weak continuity of T yields the following simple result. Proposition 2. Suppose that: (i) (X, A) =(S, B Borel ) where (S, d) is a separable metric space and B Borel denotes its usual Borel sigma - field. (ii) T : P R is weakly continuous at P 0 P. (iii) X,..., X n are i.i.d. P 0. Then T n T (P n ) a.s. T (P 0 ). Proof. By Varadarajan s theorem 2., P n P 0 a.s. Fix ω A with Pr(A) = so that P ω n P 0. Then by weak continuity of T, T n (P ω n ) T (P 0). A difficulty in using this theorem is typically in trying to verify weak-continuity of T. Weak continuity is a rather strong hypothesis, and many interesting functions fail to have this type of continuity. The following approach is often useful. Definition 2.2 Let F L (P ) be a collection of integrable functions. Say that P n P with respect to F if P n P F sup f F P n (f) P (f) 0. Furthermore, we say that T : P R is continuous with respect to F if P n P F 0 implies that T (P n ) T (P ). Definition 2.3 If F L (P ) is a collection of integrable functions with P n P F say that F is a Glivenko-Cantelli class for P and write F GC(P ). 0, we then Theorem 2.2 Suppose that: (i) F GC(P ); i.e. P n P F a.s. 0. (ii) T is continuous with respect to F. Then T (P n ) a.s. T (P ).

8 8 CHAPTER 7. STATISTICAL FUNCTIONALS AND THE DELTA METHOD 3 Metrics Probability Distributions F and P We have already encountered the total variation and Hellinger metrics in the course of studying Scheffé s lemma, Bayes estimators, and tests of hypotheses. As we will see, as useful as these metrics are, they are too strong: the empirical measure P n fails to converge to the true P in either the total variation or Hellinger distance in general. In fact this fails to hold in general for the Prohorov and dual bounded Lipschitz metrics which we introduce below, as has been shown by Dudley (969), Kersting (978), and Bretagnolle and Huber-Carol (977); also see the remarks in Huber (98), page 39. Nonetheless, it will be helpful to have in mind some some useful metrics for probability measures P and df s F, and their properties. Definition 3. The Kolmogorov or supremum metric between two distribution functions F and G is d K (F, G) F G sup x R k F (x) G(x). Definition 3.2 The Lévy metric between two distribution functions F and G is d L (F, G) inf{ɛ > 0: G(x ɛ) ɛ F (x) G(x + ɛ)+ɛ for all x R}. Definition 3.3 The Prohorov metric between two probability measures P, Q on a metric space (S, d) is d pr (P, Q) = inf{ɛ > 0: P (B) Q(B ɛ )+ɛ for all Borel sets B} where B ɛ {x : inf y B d(x, y) ɛ}. To define the next metric for P, Q on a metric space (S, d) for any real-valued function f on S, set f L sup x y f(x) f(y) /d(x, y), and denote the usual supremum norm by f sup x f(x). Finally, set f BL f L + f. Definition 3.4 The dual - bounded Lipschitz metric d BL is defined by d BL (P, Q) sup{ fdp fdq : f BL }. Definition 3.5 The total variation metric d TV is defined by d TV (P, Q) sup{ P (A) Q(A) : A A} = p q dµ 2 where p dp/dµ, q = dq/dµ for some measure µ dominating both P and Q (e.g. µ = P + Q). Definition 3.6 The Hellinger metric H is defined by H 2 (P, Q) = { p pqdµ q} 2 dµ = ρ(p, Q) 2 where µ is any measure dominating both P and Q. The quantity ρ(p, Q) pqdµ is called the affinity between P and Q. The following basic theorem establishes relationships between these metrics:

9 3. METRICS PROBABILITY DISTRIBUTIONS F AND P 9 Theorem 3. A. d Pr (P, Q) 2 d BL (P, Q) 2d Pr (P, Q). B. H 2 (P, Q) d TV (P, Q) H(P, Q){2 H 2 (P, Q)} /2. C. d Pr (P, Q) d TV (P, Q). D. For distributions P, Q on the real line, d L d K d TV. Proof. We proved B in chapter 2. For A, see Dudley (989) section.3, problem 5, and section.6, corollary.6.5. Also see Huber (98), corollary 2.4.3, page 33. Another useful reference is Whitt (974). Theorem 3.2 (Strassen). The following are equivalent: (a) d Pr (P, Q) ɛ. (b) There exist X P, Y Q defined on a common probability space (Ω, F, P r) such that Pr(d(X, Y ) ɛ) ɛ. Proof. (b) implies (a) is easy: for any Borel set B, [X B] = [X B, d(x, Y ) ɛ] [X B, d(x, Y ) >ɛ] [X B ɛ ] [d(x, Y ) >ɛ], so that P (B) Q(B ɛ )+ɛ. For the proof of (a) implies (b) see Strassen (965), Dudley (968), or Schay (974). A nice treatment of Strassen s theorem is given by Dudley (989).

10 0 CHAPTER 7. STATISTICAL FUNCTIONALS AND THE DELTA METHOD 4 Differentiability of Functionals T of F or P To be able to prove more than consistency, we will need stronger properties of the functional T, namely differentiability. Definition 4. T is Gateaux differentiable at F if there exists a linear functional T (F ; ) such that for F t = ( t)f + tg, T (F t ) T (F ) lim t 0 t = T (F ; G F )= = ψ F (x)dg(x) ψ(x)d(g(x) F (x)) where ψ F (x) =ψ ψdf (x) has mean zero under F. Or, T : P RisGateaux - differentiable at P if there exists T (P ; ) bounded and linear such that for P t ( t)p + tq T (P t ) T (P ) lim = T t 0 t (P ; Q P )= ψ(x)d(q(x) P (x)) = ψ P (x)dq(x). Definition 4.2 T has the influence function or influence curve IC(x; T, F ) at F if, with F t ( t)f + tδ x, T (F t ) T (F ) lim = IC(x; T, F )=ψ F (x). t 0 t Example 4. Probability of a set: suppose that T (F )=F(A) for a fixed measurable set A. Then T (F t ) T (F ) = { A (x) A (y)df (y)}dg(x) = ψ F (x)dg(x) t where ψ F (x) = A (x) F (A). Example 4.2 The mean: T (F )= xdf (x). Then T (F t ) T (F ) = {x T (F )}dg(x) = ψ F (x)dg(x) t where ψ F (x) =x T (F ). Note that the influence function ψ F (x) for the probability functional is bounded, but that the influence function ψ F (x) for the mean functional is unbounded. Example 4.3 The variance: T (F )=Var F (X) = (x µ(f )) 2 df (x). Now d dt T (F t) t=0 = d (x µ(f t )) 2 df t (x) dt = (x µ(f )) 2 d(g F )(x)+2 (x µ(f ))( ) µ(f ; G F )df (x) = (x µ(f )) 2 d(g F ) = {(x µ(f )) 2 σf 2 }dg(x). Hence IC(x; T, F )=ψ F (x) =(x µ(f )) 2 σ 2 F.

11 4. DIFFERENTIABILITY OF FUNCTIONALS T OF F OR P Example 4.4 T (F )=F (/2), and suppose that F has density f which is positive at F (/2). Then, with F t ( t)f + tg, d dt T (F t) = d t=0 dt F t (/2). t=0 Note that F t (Ft (/2)) = /2, and hence 0 = d dt F t(ft (/2)) so that Hence t=0 = d {F (Ft (/2)) + t(g F )(F t (/2))} dt t=0 = f(f (/2)) T (F ; G F )+(G F)(F (/2)) + 0, T (F ; G F ) = (G F )(F (/2)) f(f (/2)) ((,F = (/2)](x) /2)dG(x) f(f. (/2)) ψ F (x) =IC(x; T, F )= f(f (/2)) { (,F (/2)](x) /2}. Example 4.5 The p th quantile, T (F )=F (p). By a calculation similar to that for the median, and T (F ; G F ) = (G F )(F (p)) f(f (p)) ((,F = (p)](x) p)dg(x) f(f. (p)) ψ F (x) =IC(x; T, F )= f(f (p) { (,F (p)](x) p}. Now we need to consider other types of derivatives: in particular the stronger notions of derivative which we will discuss below are those of Fréchet and Hadamard derivatives. Definition 4.3 A functional T : F RisFréchet - differentiable at F F with respect to d if there exists a continuous linear functional T (F ; ) from finite signed measures in R such that () T (G) T (F ) T (F ; G F ) 0 as d (F, G) 0. d (G, F ) Here are some properties of Fréchet - differentiation: Theorem 4. Suppose that d is a metric for weak convergence (i.e. the Lévy metric for df s on the line; or the Prohorov or dual-bounded Lipschitz metric for measures on a metric space (S, d)). Then:

12 2 CHAPTER 7. STATISTICAL FUNCTIONALS AND THE DELTA METHOD A. If T exists in the Fréchet sense, then it is unique, and T is Gateaux differentiable with Gateaux derivative T. B. If T is Fréchet differentiable at F, then T is continuous at F. C. T (F ; G F )= ψd(g F )= (ψ ψdf )dg where the function ψ is bounded and continuous. Proof. See Huber (98), proposition 5., page 37. Fréchet differentiability leads to an easy proof of asymptotic normality if the metric d is compatible with the empirical df or empirical measure. Theorem 4.2 Suppose that T is Fréchet differentiable at F with respect to d and that (2) nd (F n,f)=o p (). Then n(t (Fn ) T (F )) = ψ F d{ n(f n F )} + o p () = d n n ψ F (X i )+o p () i= N(0, Eψ 2 F (X)). Proof. By Fréchet differentiability of T at F, n(t (Fn ) T (F )) = n ψ F df n + no(d (F n,f)) = n ψ F df n + o(d (F n,f)) nd (F n,f) d (F n,f) = n ψ F df n + o()o p () by (2) and (). Note that if d is the Lévy metric d L or the Kolmogorov metric d K on the line, then (2) is satisfied: ndl (F n,f) nd K (F n,f)= n F n F d = Un (F ) d U(F ). Unfortunately, if d = d Pr or d = d BL, then nd (F n,f) is not O p () in general; see Dudley (969), Kersting (978), and Huber-Carol (977). Thus we are lead to consideration of other metrics such as the Kolmogorov metric and generalizations thereof for problems concerning functionals T (P ) of probability distributions P. While some functionals T are Fréchet differentiable with respect to the supremum or Kolmogorov metric, we can make more functionals differentiable by considering a somewhat weaker notion of differentiability as follows:

13 4. DIFFERENTIABILITY OF FUNCTIONALS T OF F OR P 3 Definition 4.4 A functional T : F R is Hadamard differentiable at F with respect to the Kolmogorov distance d K = (or compactly differentiable with respect to d K ) if there exists T (F ; ) continuous and linear satisfying T (F t ) T (F ) T (F ; F t F ) = o() t for all {F t } satisfying t (F t F ) 0 for some function. The motivation for this definition is simply that we can write n(t (Fn ) T (F )) = T (F + n /2 n /2 (F n F )) T (F ) n /2 where n(f n F ) d = U n (F ) U(F ). Hence we can easily deduce the following theorem. Theorem 4.3 Suppose that T : F R is Hadamard - differentiable at F with respect to. Then n(t (Fn ) T (F )) d N(0,E( T 2 (F ; (, ] (X) F ))). Moreover, n(t (Fn ) T (F )) T (F ; n(f n F )) = o p (). Proof. This is easily proved using a Skorokhod construction of the empirical process, or by the extended continuous mapping theorem. Gill (989) used the Skorokhod approach; Wellner (989) pointed out the extended continuous mapping proof. One way of treating all the kinds of differentiability we have discussed so far is as follows. Define T (F t ) T (F ) T (F ; F t F ) Rem(F + th); Here h = t (F t F ). Let S be a collection of subsets of the metric space (F,d ). Then T is S differentiable at F with derivative T if for all S S Rem(F + th) t 0 as t 0 uniformly in h S. Now different choices of S yield different degrees of goodness of the linear approximation of T by T at F. The three most common choices are just those we have discussed: A. When S = {all singletons of (F,d )}, T is called Gateaux or directionally differentiable. B. When S = {all compact subsets of (F,d )}, T is called Hadamard or compactly differentiable. C. When S = {all bounded subsets of F,d )}, T is called Fréchet (or boundedly) differentiable. Here is a simple example of a function T defined on pairs of probability distributions (or, in this case, distribution functions) which is compactly differentiable with respect to the familiar supremum (or uniform or Kolomogorov) norm, but which is not Fréchet differentiable with respect to this norm.

14 4 CHAPTER 7. STATISTICAL FUNCTIONALS AND THE DELTA METHOD Example 4.6 For distribution functions F, G on R, define T by T (F, G) = F dg = P (X Y ) where X F, Y G are independent. Let F F sup x F (x) F (x) F F F where F = { (,t] : t R}. Proposition 4. A. T (F, G) is Hadamard differentiable with respect to at every pair of df s (F, G) with derivative T given by (3) T ((F, G); α, β) = αdg βdf. B. T (F, G) is not Fréchet differentiable with respect to. Proof. The following proof of A is basically Gill s (989), lemma 3. For F t F and G t G, define α t (F t F )/t and β t (G t G)/t; for Hadamard differentiability we have α t α and β t β with resepct to for some (bounded) functions α and β. Now T (F t,g t ) T (F, G) t T (α t,β t ) = t α t dβ t = αd(g t G)+ (α t α)d(g t G). Since T is continuous, it suffices to show that the right side converges to 0. The second term on the right is bounded by { α t α dg t + } dg 2 α t α 0. Fix ɛ> 0. Since the limit function α in the first term is right-continuous with left limits, there is a step function with a finite number m of jumps, α say, which satisfies α α <ɛ. Thus the first term may be bounded as follows: αd(g t G) (α α)d(g t G) + αd(g t G) 2 α α + m α(x j ) (G t G)[x j,x j ) j= 2ɛ +2m α G t G 2ɛ. Since ɛ is arbitrary, this completes the proof of A. Here is the proof of B. If T were Fréchet - differentiable, it would have to be true that (a) T (F n,g n ) T (F, G) T (F n F, G n G) = o( F n F G n G ) for every sequence of pairs of d.f. s {(F n,g n )} with F n F 0 and G n G 0. We now exhibit a sequence {(F n,g n )} for which (a) fails. By straightforward algebra using (3), (b) T (F n,g n ) T (F, G) T (F n F, G n G) = (F n F )d(g n G).

15 4. DIFFERENTIABILITY OF FUNCTIONALS T OF F OR P 5 Consider the d.f. s F n and G n corresponding to the measures which put masses n at 0,..., (n )/n and /n,..., respectively: n n F n = n δ k/n, and G n = n δ k/n. k=0 both of these sequences of df s converge uniformly to the uniform(0, ) df F (x) x G(x), and furthermore F n F = G n G =/n. Now (F n F )(x) = (F n F )() = 0, and n k= k= ( ) k n x [(k )/n,k/n) (x), (G n G)(x) = (F n F )(x) n ( k n = n k= ) x [(k )/n,k/n) (x) with (G n G)() = 0. Thus, separating G n G into its discrete and continuous parts, (F n F )d(g n G) = n (F n F ) k= = n n n n { n ( ) k (/n)+n n n 2 ( ) } 2 n /n = 2n = O(/n) o( F n F G n G )=o(/n). 0 ( ) n t { dt} Hence (a) fails and T is not Fréchet - differentiable. Remark 4. The previous example was suggested by R. M. Dudley. Dudley (992), (994) has studied other metrics, based p variation norms, for which this T is almost Fréchet - differentiable, and for which some functionals may be Fréchet differentiable even though Fréchet differentiability with respect to may fail. A particular refinement of Hadamard differentiability which is very useful is as follows: since the limiting P Brownian bridge process G P of the empirical process G n n(p n P ) is in C u (F,ρ P ) with probability one for any F CLT (P ), we say that T is Hadamard differentiable tangentially to C u (F,ρ P ) at P P if there is a continuous linear function T : C u (F,ρ P ) B so that T (P t ) T (P ) t T ( 0 ) holds for any path {P t } such that t (P t P 0 )/t satisfies t 0 F 0 with 0 C u (F,ρ P ). Then a nice version of the delta-method for nonlinear functions T of P n is given by the following theorem:

16 6 CHAPTER 7. STATISTICAL FUNCTIONALS AND THE DELTA METHOD Theorem 4.4 Suppose that: (i) T is Hadamard differentiable tangentially to C u (F,ρ P ) at P P. (ii) F CLT (P ): n(pn P ) G P (where G P takes values in C u (F,ρ P ) by definition of F CLT (P )). Then (4) n(t (Pn ) T (P )) T (G P ). Proof. Define g n : P l (F) B by g n (x) n(t (P + n /2 x) T (P )). Then, by (i), for { n } l (F) with n 0 F 0 and 0 C u (F,ρ P ), g n ( n ) T ( 0 ) g( 0 ). Thus by the extended continuous mapping theorem in the Hoffmann - Jorgensen weak convergence theory (see van der Vaart and Wellner (996), Theorem.., page 67), g n (G n ) g(g P )= T (G P ), and hence (4) holds. The immediate corollary for the classical Mann-Whitney form of the Wilcoxon statistic given in example 4.6 is: Corollary If X,..., X m are i.i.d. F and independent of Y,..., Y n which are i.i.d. G, 0< P F,G (X Y ) <, and λ N m/n m/(m + n) λ (0, ), then { } mn mn F m dg n F dg = N N {T (F m, G n ) T (F, G)} λ U(F )dg λ V(G)dF d N(0,σλ 2 (F, G)) where U and V are two independent Brownian bridge processes and σλ 2 (F, G) = ( λ)v ar(g(x)) + λv ar(f (Y )). This is, of course, well-known, and can be proved in a variety of other ways (by treating T (F m, G n ) as a two-sample U statistic, or a rank statistic, or by a direct analysis), but the proof via the differentiable functional approach seems instructive and useful. (See e.g. Lehmann (975), Statistical Methods Based on Ranks, Section 5, pages , and especially example 20, page 365.) Other interesting applications have been given by Grübel (988) (who studies the asymptotic theory of the length of the shorth); Pons and Turckheim (989) (who study bivariate hazard estimators and tests of independence based thereon), and Gill and Johansen (990) (who prove Hadamard differentiability of the product integral ). Gill, van der Laan, and Wellner (992) give applications to several problems connected with estimation of bivariate distributions. Arcones and Giné (990) study the delta-method in connection with M estimation and the bootstrap. van der Vaart (99b) shows that Hadamard differentiable functions preserve asymptotic efficiency properties of estimators.

17 5. HIGH ORDER DERIVATIVES 7 5 High Order Derivatives The following example illustrates the phenomena which we want to consider here in the simplest possible setting: Example 5. Suppose that X,X 2,..., X n are i.i.d. Bernoulli(p). Then n(x n p) d Z N(0,p( p)), and, if g(p) =p( p), n(g(xn ) g(p)) d g (p)z = ( 2p)Z N(0,p( p)( 2p) 2 ) by the delta-method (or g-prime theorem). But if p =/2, since g (/2) = 0 this yields only n(g(xn ) /4) d 0. Thus we need to study the higher derivatives of g at /2. since g is, in fact, a quadratic, we have g(p) =g(/2) + 0 (p /2) + 2! ( 2)(p /2)2 =/4 (p /2) 2. Thus n(g(x n ) g(/2)) = n(x n /2) 2 d Z 2 4 χ2. This is a very simple example of a more general limit theorem which we will develop below. Now consider a functional T : F Rasinsections - 4. Definition 5. T is k th order Gateaux differentiable at F if, with F t = F + t(g F ), exists. d k T (F ;(G F )) = dk dt k T (F t) t=0 Note that T (F ; G F )=d (T ; G F ) if it exists. It is usually the case that d k T (F ; G F ) = = ψ k (x,..., x k )d(g F )(x ) d(g F )(x k ) ψ k,f (x,..., x k )dg(x ) dg(x k ); here the function ψ k,f is determined from ψ k by a straightforward centering recipe: ψ,f (x) ψ (x) ψ df, ψ 2,F (x) ψ 2 (x,x 2 ) ψ 2 (x,x 2 )df (x 2 ) + ψ 2 (x,x 2 )df (x )df (x 2 ), ψ 2 (x,x 2 )df (x )

18 8 CHAPTER 7. STATISTICAL FUNCTIONALS AND THE DELTA METHOD and so forth; see Serfling section 6.3.2, lemma A, page 222. A consequence of this is that we can write n k/2 d k (T (F ; F n F ) = ψ k (x,,x k )d(f n (x ) F (x )) d(f n (x k ) F (x k )) = n k/2 ψ k,f (x,..., x k )df n (x ) df n (x k ) = n k/2 n i = n ψ k,f (X i,..., X ik ). i k = This is exactly n k/2 times a V- statistic of order k. Also not that, by Taylor s formula for a function of one real variable t, we have T (F t ) T (F )= k j=0 j! d jt (F ; G F )+ d k+ (k + )! dt k+ T (F t) for some t [0,t]. To analyze the asymptotic behavior of T in terms of d T, d 2 T,..., it is typically the first non-zero term d m T which dominates. Serfling s condition A m : suppose that t=t V ar F (ψ k,f (X,..., X k )) { = 0 for k < m > 0 for k = m. R mn T (F n ) T (F ) m! d mt (F ; F n F ) satisfies n m/2 R mn = o p (). This condition will be invoked with first m = and then m = 2 in the following two theorems. Theorem 5. (Serfling s theorem A). Suppose that X,..., X n are i.i.d. F, and suppose that T satisfies A. Let µ(t, F )=E F ψ,f (X ) = 0 and σ 2 (T, F )=V ar(ψ,f (X )) and suppose that σ 2 (T, F ) <. Then n(t (Fn ) T (F )) d N(0,σ 2 (T, F )). Proof. Now by (ii) of condition A, n(t (Fn ) T (F )) = o p () + n /2 d T (F ; F n F ) = o p () + n ψ,f (X i ) n i= d N(0,σ 2 (T, F )). This is essentially the same as in our earlier proofs of asymptotic normality using differentiability, but here we are hypothesizing that the remainder term is asymptotically negligible (and thus the real effort in using the theorem will be to verify the second part of A ).

19 5. HIGH ORDER DERIVATIVES 9 Theorem 5.2 (Serfling s theorem B). Suppose that X,..., X n are i.i.d. F, and suppose that T satisfies A 2 with ψ 2,F (x, y) =ψ 2,F (y, x) and E F ψ2,f 2 (X,X 2 ) <, E F ψ 2,F (X,X ) <, and E F ψ 2,F (x, X 2 ) = 0. Define A : L 2 (F ) L 2 (F ) by Ag(x) = ψ 2,F (x, y)g(y)df (y), g L 2 (F ), and let {λ k } be the eigenvalues of A. Then n{t (F n ) T (F )} d 2 λ k Zk 2 k= where Z,Z 2,... are i.i.d. N(0, ). Sketch of the proof: By condition A 2 we can write { n(t (F n ) T (F )) = n T (F n ) T (F ) } 2! d 2(F ; F n F ) + n 2! d 2(F ; F n F ) = o p () + n () ψ 2,F (x,x 2 )df n (x )df n (x 2 ). 2 Now denote the orthonormal eigenfunctions of the (Hilbert-Schmidt) operator A by {φ k } and the corresponding eigenvalues by {λ k }: thus Aφ k = λ k φ k. Then it is well-known that ψ 2,F (x, y) = λ k φ k (x)φ k (y) k= in the sense of L 2 (F F ) convergence. Hence n ψ 2,F (x,x 2 )df n (x )df n (x 2 ) (2) (3) = n = n d k= { λ k k= k= λ k Z 2 k λ k φ k (x)φ k (y)df n (x)df n (y) φ k df n } 2 = { λ k n k= n φ k (X i ) where the Z i s are i.i.d. N(0, ) since E F φ k (X i ) = 0, E F ψk 2(X i) =, and E F φ j (X i )φ k (X i ) = 0. Combining () and (3) completes the heuristic proof. The reason that this is heuristic is because of the infinite series appearing in (2) and (3). The complete proof entails consideration of finite sums and the corresponding approximation arguments; see Serfling (98), pages for the U statistic case. But note that the V statistic argument on page 227 just involves throwing the diagonal terms back in, and is therefore really easier. i= } 2 Remark 5. It seems to me that Serfling s µ(t, F ) = 0 as formulated above. It also seems to me that he has missed the factor of /2 appearing in the limit distribution.

20 20 CHAPTER 7. STATISTICAL FUNCTIONALS AND THE DELTA METHOD Remark 5.2 Gregory (977) gives related but stronger results which do not require E F ψ 2,F (X,X )= k= λ k < and apply to some interesting statistics with λ k =/k. Note that the infinite series k= k (Z2 k ) defined a proper random variable since the summands have mean 0 and variances 2/k 2 ; these actually arise as limit distributions of the popular Shapiro - Wilk tests for normality; see e.g. DeWet and Venter (972), (973).

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider