36. Multisample U-statistics and jointly distributed U-statistics Lehmann 6.1

Size: px

Start display at page:

Download "36. Multisample U-statistics and jointly distributed U-statistics Lehmann 6.1"

Maude Greer
5 years ago
Views:

1 36. Multisample U-statistics jointly distributed U-statistics Lehmann 6.1 In this topic, we generalize the idea of U-statistics in two different directions. First, we consider single U-statistics for situations in which there is more than one sample. Next, we consider the joint asymptotic distribution of two (single-sample) U-statistics. We begin by generalizing the idea of U-statistics to the case in which we have more than one rom sample. Suppose that X i1,..., X ini is an iid sample from F i for all 1 i s. In other words, we have s rom samples, each potentially from a different distribution, n i is the size of the ith sample. We may define a statistical functional θ = E φ (X 11,..., X 1a1 ; X 21,..., X 2a2 ; ; X s1,..., X sas ). (132) Notice that the kernel φ in Equation (132) has a 1 +a 2 + +a s arguments; furthermore, we assume that the first a 1 of them may be permuted without changing the value of φ, the next a 2 of them may be permuted without changing the value of φ, etc. In other words, there are s distinct blocks of arguments of φ, φ is symmetric in its arguments within each of these blocks. Not surprisingly, the U-statistic corresponding to the functional (132) is U N = ( 1 1 n1 ) ( ns ) a 1 a s 1 i 1< <i a1 n 1 1 r 1< <r as n s Note that N denotes the total of all the sample sizes: N = n n s. φ ( X 1i1,..., X 1ia1 ; ; X sr1,..., X sras ). (133) As we did in the case of single-sample U-statistics, define for 0 j 1 a i,..., 0 j s a s φ j1 j s (X 11,..., X 1j1 ; ; X s1,..., X sjs ) = E {φ(x 11,..., X 1a1 ; ; X s1,..., X sas ) X 11,..., X 1j1,, X s1,..., X sjs } (134) σ 2 j1 j s = Var φ j1 j s (X 11,..., X 1j1 ; ; X s1,..., X sjs ). (135) By an argument similar to the one used in the proof of Theorem 35.1, but much more tedious notationally, we can show that σ 2 j 1 j s = Cov {φ(x 11,..., X 1a1 ; ; X s1,..., X sas ), φ(x 11,..., X 1j1, X 1,a1+1,..., X 1,a1+(a 1 j 1); ; X s1,..., X sjs, X s,as+1,..., X s,as+(a s j s)) }. (136) Notice that some of the j i may equal 0. This was not true in the single-sample case, since φ 0 would have merely been the constant θ, so σ 2 0 = 0. In the special case when s = 2, Equations (134), (135) (136) become φ ij (X 1,..., X i ; Y 1,..., Y j ) = E {φ(x 1,..., X a1 ; Y 1,..., Y a2 ) X 1,..., X i, Y 1,..., Y j }, σ 2 ij = Var φ ij (X 1,..., X i ; Y 1..., Y j ), σ 2 ij = Cov {φ(x 1,..., X a1 ; Y 1,..., Y a2 ), φ(x 1,..., X i, X a1+1,..., X a1+(a 1 i); Y 1,..., Y j, Y a2+1,..., Y a2+(a 2 j)) }, 104

2 respectively, for 0 i a 1 0 j a 2. Although we will not derive it here as we did for the single-sample case, there is an analagous asymptotic normality result for multisample U-statistics, as follows. Theorem 36.1 Suppose that for i = 1,..., s, X i1,..., X ini is a rom sample from the distribution F i that these s samples are independent of each other. Suppose further that there exist constants ρ 1,..., ρ s in the interval (0, 1) such that n i /N ρ i for all i that σ 2 a 1 a s <. Then N(UN θ) L N(0, σ 2 ), where σ 2 = a2 1 ρ 1 σ a2 s ρ s σ Although the notation required for the multisample U-statistic theory is nightmarish, life becomes considerably simpler in the case s = 2 a 1 = a 2 = 1, in which case we obtain U N = 1 n 1 n 2 n 1 n 2 φ(x 1i ; X 2j ). j=1 Equivalently, we may assume that X 1,..., X m are an iid sample from F Y 1,..., Y n are an iid sample from G, which gives U N = 1 mn m j=1 In the case of the U-statistic of Equation (137), Theorem 36.1 states that ( ) N(UN θ) L N 0, σ2 10 ρ + σ2 01, 1 ρ n φ(x i ; Y j ). (137) where ρ = lim m/n, σ 2 10 = Cov {φ(x 1 ; Y 1 ), φ(x 1 ; Y 2 )}, σ 2 01 = Cov {φ(x 1 ; Y 1 ), φ(x 2 ; Y 1 )}. Example 36.1 For X 1,... X m iid from F Y 1,..., Y n iid from G, consider the Wilcoxon ranksum statistic W, defined to be the sum of the ranks of the Y i among the combined sample. We showed in Equation (79) that W = 1 2 n(n + 1) + m j=1 n I{X i < Y j }. Therefore, if we let φ(x; y) = I{x < y}, then the corresponding two-sample U-statistic U N is related to W by W = 1 2 n(n + 1) + mnu N. Therefore, we may use Theorem 36.1 to obtain the asymptotic normality of U N, therefore of W, just as we did in Topic 26. However, unlike in Topic 26, we make no assumption here that F G are merely shifted versions of one another. Thus, we may now obtain in principle the asymptotic distribution of the rank-sum statistic for any two distributions F G that we wish, so long as they have finite second moments. The other direction in which we will generalize the development of U-statistics is consideration of the joint distribution of two single-sample U-statistics. Suppose that there are two kernel functions, φ(x 1,..., x a ) ϕ(x 1,..., x b ), we define the two corresponding U-statistics U (1) n = 1 ( n a) 1 i 1< <i a n φ(x i1,..., X ia ) 105

3 U (2) n = 1 ( n b) 1 j 1< <j b n φ(x j1,..., X ja ) for a single rom sample X 1,..., X n from F. Define θ 1 = E U n (1) θ 2 = E U n (2). Furthermore, define γ ij to be the covariance between φ i (X 1,..., X i ) ϕ j (X 1,..., X j ), where φ i ϕ j are defined as in Equation (126). It may be proved that γ ij = Cov { φ(x 1,..., X a ), ϕ(x 1,..., X j, X a+1,..., X a+(b j) ) }. (138) Note in particular that γ ij depends only on the value of min{i, j}. The following theorem, stated without proof, gives the joint asymptotic distribution of U (1) n U (2) n. Theorem 36.2 Suppose X 1,..., X n is a rom sample from F that φ ϕ are two kernel functions satisfying Var φ(x 1,..., X a ) < Var ϕ(x 1,..., X b ) <. Define τ1 2 = Var φ 1 (X 1 ) τ2 2 = Var ϕ 1 (X 1 ), let γ ij be defined as in Equation (138). Then { (U (1)) n n U n (2) ( θ1 θ 2 ) } {( ) L 0 N, 0 ( a 2 τ 2 1 abγ 11 abγ 11 b 2 τ 2 2 )}. Problems Problem 36.1 Suppose X 1,..., X m Y 1,..., Y n are independent iid samples from distributions Unif(0, θ) Unif(µ, µ + θ), respectively. Assume m/n ρ as m, n 0 < µ < θ. (a) Find the asymptotic distribution of the U-statistic of Equation (137), where φ(x; y) = I{x < y}. In so doing, find a function g(x) such that E (U N ) = g(µ). (b) Find the asymptotic distribution of g(y X). (c) Find the range of values of µ for which the Wilcoxon estimate of g(µ) is asymptotically more efficient than g(y X). (The ARE in this case is the ratio of asymptotic variances. See the discussion of (4.3.12) (4.3.13) on p. 240 for details.) Problem 36.2 Solve each part of Problem 36.1, but this time under the assumptions that the independent rom samples X 1,..., X m Y 1,..., Y n satisfy P (X 1 t) = P (Y 1 θ t) = t 2 for t [0, 1] 0 < θ < 1. As in Problem 36.1, assume m/n ρ (0, 1). Problem 36.3 Suppose X 1,..., X m Y 1,..., Y n are independent iid samples from distributions N(0, 1) N(µ, 1), respectively. Assume m/(m + n) 1/2 as m, n. Let U N be the U-statistic of Equation (137), where φ(x; y) = I{x < y}. Suppose that θ(µ) σ 2 (µ) are such that N[UN θ(µ)] L N[0, σ 2 (µ)]. Calculate θ(µ) σ 2 (µ) for µ {.2,.5, 1, 1.5, 2}. Hint: This problem requires a bit of numerical integration. There are a couple of ways you might do this. Mathematica will do it quite easily. There is a function called integrate in Splus one called quad in MATLAB for integrating a function. If you cannot get any of these to work for you, let me know. 106

4 Problem 36.4 Suppose X 1, X 2,... are iid with finite variance. Define S 2 n = 1 n 1 n (x i x) 2 let G n be Gini s mean difference, the U-statistic defined in Problem Note that S 2 n is also a U-statistic, corresponding to the kernel function φ(x 1, x 2 ) = (x 1 x 2 ) 2. (a) If X i are distributed as Unif(0, θ), give the joint asymptotic distribution of G n S n by first finding the joint asymptotic distribution of the U-statistics G n Sn. 2 Note that the covariance matrix need not be positive definite; in this problem, the covariance matrix is singular. (b) The singular asymptotic covariance matrix in this problem implies that as n, the joint distribution of G n S n becomes concentrated on a line. Does this appear to be the case? For 1000 samples of size n from Uniform(0, 1), plot scatterplots of G n against S n. Take n {5, 25, 100}. 107

5 37. Bootstrapping Lehmann 6.5 In this topic, we introduce the bootstrap, even though the asymptotics content of what follows is not very high. Essentially, the only large-sample theory we rely on here is the weak law of large numbers. Nonetheless, bootstrapping is included here because of its natural relationship with the concepts of statistical functionals plug-in estimators seen in recent topics, also because it is an increasingly popular often misunderstood method. Consider a statistical functional λ n (F ) that depends on n. For instance, λ n (F ) may be some property, such as bias or variance, of an estimator ˆθ n of θ = θ(f ) based on a rom sample of size n from some distribution F. As an example, let θ(f ) = F 1 ( 1 2) be the median of F. Take ˆθn to be the mth order statistic from a rom sample of size n = 2m 1 from F. Consider the bias λ B n (F ) = E F ˆθn θ(f ) the variance λ V n (F ) = E F ˆθ2 n (E F ˆθn ) 2. Theoretical properties of λ B n λ V n are very difficult to obtain. Even asymptotics aren t very helpful, since n(ˆθ n θ) L N{0, 1/(4f 2 (θ))} tells us only that the bias goes to zero the limiting variance may be very hard to estimate because it involves the unknown quantity f(θ), which is hard to estimate. Consider the plug-in estimators λ B n ( ˆF n ) λ V n ( ˆF n ). (Recall that ˆF n denotes the empirical distribution function, which puts a mass of 1 n on each of the n sample points.) In our median example, λ B n ( ˆF n ) = E ˆFn ˆθ n ˆθ n λ V n ( ˆF n ) = E ˆFn (ˆθ n) 2 (E ˆFn ˆθ n ) 2, where ˆθ n is the sample median from a rom sample X 1,..., X n from ˆF n. To see how difficult it is to calculate λ B n ( ˆF n ) λ V n ( ˆF n ), consider the simplest nontrivial case, n = 3: Conditional on the order statistics (X (1), X (2), X (3) ), there are 27 equally likely possibilities for the value of (X 1, X 2, X 3 ), the sample of size 3 from ˆF n, namely (X (1), X (1), X (1) ), (X (1), X (1), X (2) ),..., (X (3), X (3), X (3) ). Of these 27 possibilities, exactly = 7 have the value X (1) occurring 2 or 3 times. Therefore, we obtain This implies that P (ˆθ n = X (1) ) = 7 27, P (ˆθ n = X (2) ) = 13 27, P (ˆθ n = X (3) ) = E ˆFn ˆθ n = 1 27 (7X (1) + 13X (2) + 7X (3) ) E ˆFn (ˆθ n) 2 = 1 27 (7X2 (1) + 13X2 (2) + 7X2 (3) ). Therefore, since ˆθ n = X (2), we obtain λ B n ( ˆF n ) = 1 27 (7X (1) 14X (2) + 7X (3) ) λ V n ( ˆF n ) = (10X2 (1) + 13X2 (2) + 10X2 (3) 13X (1)X (2) 13X (2) X (3) 7X (1) X (3) ). 108

6 To obtain the sampling distribution of these estimators, of course, we would have to consider the joint distribution of (X (1), X (2), X (3) ). Naturally, the calculations become even more difficult as n increases. Alternatively, we could use resampling in order to approximate λ B n ( ˆF n ) λ V n ( ˆF n ). This is the bootstrapping idea, it works like this: For some large number B, simulate B rom samples from ˆF n, namely X 11,..., X 1n,. X B1,..., X Bn, approximate a quantity like E ˆFn ˆθ n by the sample mean 1 B B ˆθ in, where ˆθ in is the sample median of the ith bootstrap sample X i1,..., X in. Notice that the weak law of large numbers asserts that 1 B B ˆθ in P E ˆFn ˆθ n. To recap, then, we wish to estimate some parameter λ n (F ) for an unknown distribution F based on a rom sample from F. We estimate λ n (F ) by λ n ( ˆF n ), but it is not easy to evaluate λ n ( ˆF n ) so we approximate λ n ( ˆF n ) by resampling B times from ˆF n obtain a bootstrap estimator λ B,n. Thus, there are two relevant issues: 1. How good is the approximation of λ n ( ˆF n ) by λ B,n? (Note that λ n( ˆF n ) is NOT an unknown parameter; it is known but hard to evaluate.) 2. How precise is the estimation of λ n (F ) by λ n ( ˆF n )? Question 1 is usually easy to answer using asymptotics; these asymptotics involve letting B usually rely on the weak law or the central limit theorem. For example, if we have an expectation functional λ n (F ) = E F h(x 1,..., X n ), then as B. λ B,n = 1 B B h(xi1,..., Xin) P λ n ( ˆF n ) Question 2, on the other h, is often tricky; asymptotic results involve letting n are hled case-by-case. We will not discuss these asymptotics here. On a related note, however, there is an argument in Lehmann s book (on pages ) about why a plug-in estimator may be better than an asymptotic estimator. That is, if it is possible to show λ n (F ) λ as n, then as an estimator of λ n (F ), λ n ( ˆF n ) may be preferable to λ. We conclude this topic by considering the so-called parametric bootstrap. If we assume that the unknown distribution function F comes from a family of distribution functions indexed by a parameter µ, then λ n (F ) is really λ n (F µ ). Then, instead of the plug-in estimator λ n ( ˆF n ), we might consider the estimator λ n (Fˆµ ), where ˆµ is an estimator of µ. Everything proceeds as in the nonparametric version of bootstrapping. Since it may not be easy to evaulate λ n (Fˆµ ) explicitly, we first find ˆµ then take B rom samples of size n, X11,..., X1n through XB1,..., X Bn, from Fˆµ. These samples are used to approximate λ n (Fˆµ ). 109

7 Example 37.1 Suppose X 1,..., X n is a rom sample from Poisson(µ). Take ˆµ = X. Suppose λ n (F µ ) = Var Fµ ˆµ. In this case, we happen to know that λ n (F µ ) = µ/n, but let s ignore this knowledge apply a parametric bootstrap. For some large B, say 500, generate B samples from Poisson(ˆµ) use the sample variance of ˆµ as an approximation to λ n (Fˆµ ). In R, with µ = 1 n = 20 we obtain > x <- rpois(20,1) # Generate the sample from F > muhat <- mean(x) > muhat [1] 0.85 > muhatstar <- rep(0,500) # Allocate the vector for muhatstar > for(i in 1:500) muhatstar[i] <- mean(rpois(20,muhat)) > var(muhatstar) [1] Note that the estimate is close to the known true value Obviously, this example is simplistic because we already know that λ n (F ) = µ/n, which makes ˆµ/n a more natural estimator. However, it is not always so simple to obtain a closed-form expression for λ n (F ). Incidentally, we could also use a nonparametric bootstrap approach in this example: > for (i in 1:500) muhatstar2[i] <- mean(sample(x,replace=t)) > var(muhatstar2) [1] Of course, is an approximation to λ n ( ˆF n ) rather than λ n (Fˆµ ). Furthermore, we can obtain a result arbitrarily close to λ n ( ˆF n ) by increasing the value of B: > muhatstar2_rep(0,100000) > for (i in 1:100000) muhatstar2[i] <- mean(sample(x,replace=t)) > var(muhatstar2) [1] In fact, it is in principle possible to obtain an approximate variance for our estimates of λ n ( ˆF n ) λ n (Fˆµ ),, using the central limit theorem, construct approximate confidence intervals for these quantities. This would allow us to specify the quantities to any desired level of accuracy. Problems Problem 37.1 (a) Devise a nonparametric bootstrap scheme for setting confidence intervals for β in the linear regression model Y i = α + βx i + ɛ i. There is more than one possible answer. (b) Using B = 1000, implement your scheme on the following dataset to obtain a 95% confidence interval. Compare your answer with the stard 95% confidence interval. Y x (In the dataset, Y is the number of manatee deaths due to collisions with powerboats in Florida x is the number of powerboat registrations in thouss for even years from ) Problem 37.2 Consider the following dataset that lists the latitude mean August temperature for 7 US cities. The residuals are listed for use in part (b). 110

8 City Miami Phoenix Memphis Baltimore Pittsburgh Boston Portl, OR Latitude Temperature Residual Minitab gives the following output for a simple linear regression: Predictor Coef SE Coef T P Constant latitude S = R-Sq = 61.5% R-Sq(adj) = 53.8% Note that this gives an asymptotic estimate of the variance of the slope parameter as = In each case below, use the described method to simulate B = 500 bootstrap samples (x b1, y b1 ),..., (x b7, y b7 ) for b = 1,..., B. For each b, refit the model to obtain ˆβ b. Report the sample variance of ˆβ 1,..., ˆβ B compare with the asymptotic estimate of (a) Parametric bootstrap. Take x bi = x i for all b i. Let ybi = ˆβ 0 + ˆβ 1 x i + ɛ i, where ɛ i N(0, ˆσ 2 ). Obtain ˆβ 0, ˆβ 1, ˆσ 2 from the above output. (b) Nonparametric bootstrap I. Take x bi = x i for all b i. Let ybi = ˆβ 0 + ˆβ 1 x i + rbi, where rb1,..., r b7 is an iid sample from the empirical distribution of the residuals from the original model (you may want to refit the original model to find these residuals). (c) Nonparametric bootstrap II. Let (x b1, y b1 ),..., (x b7, y b7 ) be an iid sample from the empirical distribution of (x 1, y 1 ),..., (x 7, y 7 ). Note: In Splus, you can obtain the slope coefficient of the linear regression of the vector y on the vector x using lm(y~x)$coef[2]. Problem 37.3 The same resampling idea that is exploited in the bootstrap can be used to approximate the value of difficult integrals by a technique sometimes called Monte Carlo integration. Suppose we wish to compute θ = e x2 cos 3 (x) dx. Using the NIntegrate function in Mathematica, we find that θ = (a) Define g(t) = 2e t2 cos 3 (t). Let U 1,..., U n be an iid uniform(0,1) sample. Let Prove that ˆθ 1 P θ. ˆθ 1 = 1 n n g(u i ). (b) Define h(t) = 2 2t. Prove that if we take V i = 1 U i for each i, then V i is a rom variable with density h(t). Prove that with we have ˆθ 2 P θ. ˆθ 2 = 1 n n g(v i ) h(v i ), 111

9 (c) For n = 1000, simulate ˆθ 1 ˆθ 2. Give estimates of the variance for each estimator by reporting ˆσ 2 /n for each, where ˆσ 2 is the sample variance of the g(u i ) or the g(v i )/h(v i ) as the case may be. (d) Plot, on the same set of axes, g(t), h(t), the stard uniform density for t [0, 1]. From this plot, explain why the variance of ˆθ 2 is smaller than the variance of ˆθ 1. [Incidentally, the technique of drawing rom variables from a density h whose shape is close to the function g of interest is a variance-reduction technique known as importance sampling.] Note: This was sort of a silly example, since it was really easy to get an exact value for θ using numerical methods. However, with certain high-dimensional integrals, the curse of dimensionality makes exact numerical methods extremely time-consuming computationally; thus, Monte Carlo integration does have a practical use in such cases. 112

Chapter 10. U-statistics Statistical Functionals and V-Statistics

Chapter 10. U-statistics Statistical Functionals and V-Statistics Chapter 10 U-statistics When one is willing to assume the existence of a simple random sample X 1,..., X n, U- statistics generalize common notions of unbiased estimation such as the sample mean and the