7 Influence Functions

Size: px

Start display at page:

Download "7 Influence Functions"

Henry Pierce
5 years ago
Views:

1 7 Influence Functions The influence function is used to approximate the standard error of a plug-in estimator. The formal definition is as follows. 7.1 Definition. The Gâteaux derivative of T at F in the direction G is defined by T((1 ǫ)f + ǫg) T(F) L F (G) = lim. (37) ǫ 0 ǫ If G = δ x is a point mass at x then we write L F (x) L F (δ x ) and we call L F (x) the influence function. Thus, T((1 ǫ)f + ǫδ x ) T(F) L F (x) = lim. (38) ǫ 0 ǫ The empirical influence function is defined by L(x) = L Fn (x). Thus, T((1 ǫ) L(x) = lim F n + ǫδ x ) T( F n ). (39) ǫ 0 ǫ Often we drop the subscript F and write L(x) instead of L F (x). 7.2 Theorem. Let T(F) = a(x)df(x) be a linear functional. Then: 22

2 1. L F (x) = a(x) T(F) and L(x) = a(x) T( F n ). 2. For any G, T(G) = T(F) + L F (x)dg(x). (40) 3. L F (x)df(x) = Let τ 2 = L 2 F (x)df(x). Then, τ2 = (a(x) T(F)) 2 df(x) and if τ 2 <, n(t(f) T( Fn )) N(0, τ 2 ). (41) 5. Let τ 2 = 1 n n L 2 (X i ) = 1 n n (a(x i ) T( F n )) 2. (42) Then, τ 2 τ P 2 and ŝe/se P 1 where ŝe = τ/ n and se = V(T( F n )). 6. We have that n(t(f) T( Fn )) τ N(0, 1). (43) Proof. The first three claims follow easily from the definition of the influence function. To prove the fourth 23

3 claim, write T( F n ) = T(F) + = T(F) + 1 n L F (x)d F n (x) n L F (X i ). From the central limit theorem and the fact that L F (x)df(x) = 0, it follows that n(t(f) T( Fn )) N(0, τ 2 ) where τ 2 = L 2 F (x)df(x). The fifth claim follows from the law of large numbers. The final statement follows from the fourth and fifth claims and Slutsky s theorem. The theorem above tells us that the influence function L F (x) behaves like the score function in parametric estimation. To see this, recall that if f(x; θ) is a parametric model, L n (θ) = n f(x i; θ) is the likelihood function and the maximum likelihood estimator θ n is the value of θ that maximizes L n (θ). The score function is s θ (x) = log f(x; θ)/ θ which, under appropriate regularity conditions, satisfies s θ (x)f(x; θ)dx = 0 and V( θ n ) (s θ (x)) 2 f(x; θ)dx/n. Similarly, for the influence function we have that L F (x)df(x) = 0 and and V(T( F n )) L 2 F (x)df(x)/n. 24

4 If the functional T(F) is not linear, then (40) will not hold exactly, but it may hold approximately. 7.3 Theorem. If T is Hadamard differentiable 2 with respect to d(f, G) = sup x F(x) G(x) then n(t( Fn ) T(F)) N(0, τ 2 ) (44) where τ 2 = L F (x) 2 df(x). Also, (T( F n ) T(F)) ŝe where ŝe = τ/ n and τ = 1 n N(0, 1) (45) n L 2 (X i ). (46) We call the approximation (T( F n ) T(F))/ŝe N(0, 1) the nonparametric delta method. From the normal approximation, a large sample confidence interval is T( F n ) ±z α/2 ŝe. This is only a pointwise asymptotic confidence interval. In summary: The Nonparametric Delta Method A 1 α, pointwise asymptotic confidence interval for T(F) is T( F n ) ± z α/2 ŝe (47) 2 Hadamard differentiability is defined in the appendix. 25

5 where ŝe = τ n and τ 2 = 1 n n L 2 (X i ). 7.4 Example (The mean). Let θ = T(F) = x df(x). The plug-in estimator is θ = x d F n (x) = X n. Also, T((1 ǫ)f + ǫδ x ) = (1 ǫ)θ + ǫx. Thus, L(x) = x θ, L(x) = x Xn and ŝe 2 = σ 2 /n where σ 2 = n 1 n (X i X n ) 2. A pointwise asymptotic nonparametric 95 percent confidence interval for θ is X n ± 2 ŝe. Sometimes statistical functionals take the form T(F) = a(t 1 (F),...,T m (F)) for some function a(t 1,..., t m ). By the chain rule, the influence function is where L(x) = m a t i L i (x) T i ((1 ǫ)f + ǫδ x ) T i (F) L i (x) = lim. (48) ǫ 0 ǫ 7.5 Example (Correlation). Let Z = (X, Y ) and let T(F) = E(X µ X )(Y µ Y )/(σ x σ y ) denote the correlation where 26

6 F(x, y) is bivariate. Recall that T(F) = a(t 1 (F), T 2 (F), T 3 (F), T 4 (F), where T 1 (F) = x df(z) T 2 (F) = y df(z) T 3 (F) = xy df(z) T 4 (F) = x 2 df(z) T 5 (F) = y 2 df(z) and a(t 1,..., t 5 ) = It follows from (48) that where x = t 3 t 1 t 2 (t4 t 2 1 )(t 5 t 2 2 ). L(x, y) = xỹ 1 2 T(F)( x2 + ỹ 2 ) x xdf x2 df ( xdf) 2, ỹ = y ydf y2 df ( ydf) Example (Quantiles). Let F be strictly increasing with positive density f. The T(F) = F 1 (p) be the p th quantile. The influence function is (see Exercise 10) { p 1 f(θ) L(x) =, x θ p f(θ), x > θ. The asymptotic variance of T( F n ) is τ 2 n = 1 L 2 p(1 p) (x)df(x) = n nf 2 (θ). (49) 27

7 To estimate this variance we need to estimate the density f. Later we shall see that the bootstrap provides a simpler estimate of the variance. 8 Empirical Probability Distributions This section discusses a generalization of the DKW inequality. The reader may skip this section if desired. Using the empirical cdf to estimate the true cdf is a special case of a more general idea. Let X 1,..., X n P be an iid sample from a probability measure P. Define the empirical probability distribution P n by P n (A) = number of X i A. (50) n We would like to be able to say that P n is close to P in some sense. For a fixed A we know that n P n (A) Binomial(n, p) where p = P(A). By Hoeffding s inequality, it follows that P( P n (A) P(A) > ǫ) 2e 2nǫ2. (51) We would like to extend this to be a statement of the form P ( sup P n (A) P(A) > ǫ ) something small A A 28

8 for some class of sets A. This is exactly what the DKW inequality does by taking A = {A = (, t] : t R}. But DKW is only useful for one-dimensional random variables. We can get a more general inequality by using Vapnik Chervonenkis (VC) theory. Let A be a class of sets. Given a finite set R = {x 1,..., x n } let N A (R) = # { R A : A A } (52) be the number of subsets of R picked out as A varies over A. We say that R is shattered by A if N A (R) = 2 n. The shatter coefficient is defined by s(a, n) = max R F n N A (R) (53) where F n consists of all finite sets of size n. 8.1 Theorem (Vapnik and Chervonenkis, 1971). For any P, n and ǫ > 0, P ( sup P n (A) P(A) > ǫ ) 8s(A, n)e nǫ2 /32. (54) A A Theorem 8.1 is only useful if the shatter coefficients do not grow too quickly with n. This is where VC dimension enters. If s(a, n) = 2 n for all n set VC(A) =. Otherwise, define VC(A) to be the largest k for which s(a, k) = 2 k. We call VC(A) the Vapnik Chervonenkis dimension of A. Thus, the VC-dimension 29

9 is the size of the largest finite set F that is shattered by A. The following theorem shows that if A has finite VCdimension then the shatter coefficients grow as a polynomial in n. 8.2 Theorem. If A has finite VC-dimension v, then In this case, s(a, n) n v + 1. P ( sup P n (A) P(A) > ǫ ) 8(n v + 1)e nǫ2 /32. (55) A A 8.3 Example. Let A = {(, x]; x R}. Then A shatters every one point set {x} but it shatters no set of the form {x, y}. Therefore, VC(A) = 1. Since, P((, x]) = F(x) is the cdf and P n ((, x]) = F n (x) is the empirical cdf, we conclude that P ( sup x F n (x) F(x) > ǫ ) 8(n + 1)e nǫ2 /32 which is looser than the DKW bound. This shows that the bound (54) is not the tightest possible. 8.4 Example. Let A be the set of closed intervals on the real line. Then A shatters S = {x, y} but it cannot shatter sets with three points. Consider S = {x, y, z} where x < y < z. One cannot find an interval A such that A S = {x, z}. So, VC(A) = 2. 30

10 8.5 Example. Let A be all linear half-spaces on the plane. Any three-point set (not all on a line) can be shattered. No four-point set can be shattered. Consider, for example, four points forming a diamond. Let T be the leftmost and rightmost points. This set cannot be picked out. Other configurations can also be seen to be unshatterable. So VC(A) = 3. In general, halfspaces in R d have VC dimension d Example. Let A be all rectangles on the plane with sides parallel to the axes. Any four-point set can be shattered. Let S be a five-point set. There is one point that is not leftmost, rightmost, uppermost or lowermost. Let T be all points in S except this point. Then T can t be picked out. So, we have that VC(A) = 4. 9 Appendix Here are some details about Theorem 7.3. Let F denote all distribution functions and let D denote the linear space generated by F. Write T((1 ǫ)f + ǫg) = T(F + ǫd) where D = G F D. The Gateâux derivative, which we now write as L F (D), is defined by lim ǫ 0 T(F + ǫd) T(F) ǫ L F (D) 0. 31

11 Thus T(F +ǫd) ǫl F (D)+o(ǫ) and the error term o(ǫ) goes to 0 as ǫ 0. Hadamard differentiability requires that this error term be small uniformly over compact sets. Equip D with a metric d. T is Hadamard differentiable at F if there exists a linear functional L F on D such that for any ǫ n 0 and {D, D 1, D 2,...} D such that d(d n, D) 0 and F + ǫ n D n F, lim n 10 Exercises ( T(F + ǫn D n ) T(F) ǫ n L F (D n ) ) = Fill in the details of the proof of Theorem Prove Theorem (Computer experiment.) Generate 100 observations from a N(0,1) distribution. Compute a 95 percent confidence band for the cdf F. Repeat this 1000 times and see how often the confidence band contains the true distribution function. Repeat using data from a Cauchy distribution. 4. Let X 1,..., X n F and let F n (x) be the empirical distribution function. For a fixed x, find the limiting distribution of F n (x). 32

On A-distance and Relative A-distance

1 ADAPTIVE COMMUNICATIONS AND SIGNAL PROCESSING LABORATORY CORNELL UNIVERSITY, ITHACA, NY 14853 On A-distance and Relative A-distance Ting He and Lang Tong Technical Report No. ACSP-TR-08-04-0 August 004