STA 732: Inference. Notes 4. General Classical Methods & Efficiency of Tests B&D 2, 4, 5

Size: px

Start display at page:

Download "STA 732: Inference. Notes 4. General Classical Methods & Efficiency of Tests B&D 2, 4, 5"

Clemence Johns
5 years ago
Views:

1 STA 732: Inference Notes 4. General Classical Methods & Efficiency of Tests B&D 2, 4, 5 1 Going beyond ML with ML type methods 1.1 When ML fails Example. Is LSAT correlated with GPA? Suppose we want to answer this question based on data (X i, Y i ), i = 1,, n, on LSAT and GPA scores of n students. As in Hw 1, we could model (X i, Y i ) BVN(µ 1, µ 2, σ1, 2 σ2, 2 ρ) and formalize the test on the correlation coefficient parameter ρ. However, there is no compelling reason to believe that the two scores are jointly normally distributed. A more general model and test formalization, with almost no assumptions on the shape of the bivariate pdf is given by (X i, Y i ) g, g G 2 = {all densities on R 2 with finite second moments} test H 0 : ρ(g) = 0 vs. H 1 : ρ(g) 0 where 1 ρ(g) = Corr(X, Y g) Unfortunately, the ML approach does not work here because ĝ MLE for any observed data (x i, y i ), i = 1,..., n we have does not exist! Indeed, sup l x,y (g) =. (1) g G 2 As a quick check of (1), define for every k 1, g k := 1 n n N 2(z i, I 2 ) G 2, where z i = (x i, y i ) T and notice that for k large, l x,y (g k ) n log k as k. 1.2 Methods based on M-estimators Consider the situation where data X 1,..., X n are modeled as X i g, g G and we are interested in the parameter θ = θ(g) Θ R d. Here θ provides only a partial indexing of the model space G. We could write G = a Θ {g G : θ(g) = a}. An ML type approach for inference on θ could be developed by starting with some criterion function m(x i, θ) with the property that large values of m(x i, θ) indicate a closer agreement 2 between θ and X i. This could be made more precise in the following way: if the 1 More precisely, ρ(g) = σ12(g) σ 1(g)σ 2(g) with σ 12(g) = (x µ 1 (g))(y µ 2 (g))g(x, y)dxdy, σ 2 1(g) = (x µ 1 (g)) 2 g(x, y)dx, σ 2 2(g) = (y µ 2 (g)) 2 g(x, y)dy, µ 1 (g) = xg(x, y)dxdy, µ 2 (g) = yg(x, y)dxdy 2 The text uses discrepancy measure to mean the negative of a criterion function. 1

2 true pdf is g = g 0 G and so the true θ = θ 0 := θ(g 0 ), then the map θ E{m(X i, θ) g 0 } is maximized at θ 0 and is potentially a decreasing function θ θ 0 at least in a small neighborhood of θ 0. Testing of H 0 : θ Θ 0 vs. H 1 : θ Θ 0 could be carried out by the test that rejects H 0 for large values of the statistic where M n = max θ Θ m n(θ) max θ Θ 0 m n (θ) m n (θ) = n m(x i, θ). Under reasonable regularity conditions we might expect m n (θ) to be approximately quadratic around a neighborhood of its maxima ˆθ n = argmax θ Θ m n (θ) with curvature { m n (ˆθ)}. Hence the above test will be approximately equivalent to reject H 0 whenever {θ : (θ ˆθ n ) T { m n (ˆθ n )}(θ ˆθ n ) c 2 } Θ 0 = for some c. The maxima ˆθ n, which can be taken as an estimator of θ, is commonly referred to as an M-estimator (M for maximization). In case of a parametric model where θ provides a complete indexing, i.e., we could rewrite the mode as X i g(x i θ), θ Θ, an attractive criterion function is m(x i, θ) = log g(x i θ), with E{m(X i, θ) θ 0 } = const K(g( θ 0 ), g( θ)), taking us back to the ML approach. However, other criterion functions could be used in this situation and will have to be used in other situations where θ provides only a partial indexing. Here are some examples. Example (Least squares, linear and non-linear). Consider data X i = (Z i, Y i ) R p R, i = 1,, n modeled as Y i = h(z i, β) + σϵ i, (Z i, ϵ i ) g(z i, ϵ i ) with model parameters (β, σ, g) R p R + G, where G is the space of all pdfs g(z, ϵ) on R p R such that E[ϵ Z, g] = 0, Var[ϵ Z, g] = 1. Here h is a known function, potentially non-linear and satisfies the identifiability condition: h(z, β 1 ) = h(z, β 2 ) z β 1 = β 2. The least squares criterion function is: m(x 1, β) = (Y 1 h(z 1, β)) 2. If (β 0, σ 0, g 0 ) characterizes the true distribution of the data, and E 0 denotes expectation under this distribution, then E 0 {m(x i, β)} = σ 2 0 E 0 {h(z 1, β 0 ) h(z 1, β)} 2 which is uniquely maximized at β = β 0. The M-estimator ˆβ n is often computed by using quasi-newton optimization algorithms. Example (Quantile regression). Consider data X i = (Z i, Y i ) R p R, i = 1,, n and suppose we want to model the τ-th conditional quantile of Y i given Z i by Z T i β. More formally, Y i = Z T i β + ϵ i, IND Z i g z (z i ), ϵ i Z i g ϵ (ε i Z i ) 2

3 0 where (β, g z, g ϵ ) are unknown but g ϵ is constrained to satisfy: g ϵ(ε z i )dε = τ, so that under the model P (Y i zi T β Z i = z i ) = τ for all z i. Koenker and Bassett introduced the criterion function m(x i, β) = (Y i Z T i β){τ I(Y i Z T i β 0)} and it can be proved that E 0 {m(x i, β)} is uniquely maximized at β = β 0. The M-estimator ˆβ n can be computed efficiently by linear programming. 1.3 Z-estimators If the criterion function is differentiable in θ, then the M-estimator can potentially be computed by solving the equation: ψ n (θ) := n ψ(x i, θ) = 0 (2) where ψ(x i, θ) = m(x θ i, θ). This suggests another way to get to a good estimator of θ, which can then be used to construct test procedures for hypotheses about θ. Let dim(θ) = d. We will call a d-dimensional map θ ψ(x i, θ) a score function if E 0 ψ(x i, θ) has a unique zero at θ = θ 0. A root (or zero) ˆθ n of ψ n (θ) will be called a Z-estimator. Bear in mind that an M-estimator need not be a Z-estimator (MLE for the model X i Uniform(0, θ), θ > 0) and a Z-estimator need not be an M-estimator (moment estimators, next). Example (Method of Moments). Consider the model X 1,, X n g(x i θ), θ Θ R d. Suppose X j s are univariate. Let µ j (θ) = E{X j 1 θ}, j = 1,, d. Suppose θ (µ 1 (θ), µ 2 (θ),, µ d (θ)) is one-to-one. Then one could estimate θ by solving the equations: µ j (θ) = 1 n X j i n, j = 1, 2,, d The solution ˆθ is called the method of moments (MoM) estimator. Similar methods could be devised for multivariate X j s. 1.4 Asymptotic Normality Under regularity conditions, asymptotic normality properties could be established for both M- and Z-estimators, leading to asymptotic size and p-value calculations for tests based on these estimators. The classical conditions are precisely the Cramér conditions we saw for asymptotic normality of the MLE. In fact we can reproduce Theorem 7 from Handout 3 almost verbatim. Here is a precise statement for the Z-estimator (and so also applies to M-estimators that can be seen as Z-estimators). 3

4 Theorem 1. Suppose X 1, X 2,... are modeled as X i g, g G and we are interested in θ = θ(g) Θ R d. Assume g 0 G is the true distribution and θ 0 = θ(g 0 ) is the true value of θ. Let E 0, Var 0 denote expectation and variance under g 0. Suppose θ ψ(x, θ) is twice continuously differentiable in a neighborhood U 0 of θ 0 for every x. Suppose E 0 ψ(x i, θ 0 ) = 0, Var 0 ψ(x i, θ 0 ) is finite and the matrix E 0 ψ(x i, θ 0 ) exists and is nonsingular. Assume that the second order partial derivatives are dominated within U 0 by a fixed function ψ(x) with E 0 ψ(x1 ) <. Then, 1. The probability that the equation ψ n (θ) = 0 has at least one root tends to 1 as n and there exists a sequence o roots θ n such that θ n P θ For any consistent estimator sequence ˆθ n satisfying ψ n (ˆθ n ) = 0, n n(ˆθn θ 0 ) = {E 0 ψ(x 1, θ 0 ) T } 1 1 n ψ(x i θ 0 ) + R n with R n P 0, and hence, n(ˆθn θ 0 ) L N d (0, Σ(g 0 )) where Σ(g 0 ) = {E 0 ψ(x 1, θ 0 ) T } 1 {Var 0 ψ(x 1, θ 0 )}{E 0 ψ(x 1, θ 0 )} 1 ). 3. With probability tending to 1, 1 n ψ n (ˆθ n ) is nonsingular and ˆΣ n := { 1 n ψ n (ˆθ n ) T } 1 { 1 n and hence, ˆθ n AN d (θ 0, 1 n ˆΣ n ). } n ψ(x i, ˆθ n )ψ(x i, ˆθ n ) T { 1 ψ n n (ˆθ n )} 1 P Σ(g 0 ) Proof. Repeat the proof of Theorem 7 of Handout 3 by replacing l n (θ) with ψ n (θ). Example (Error-in-variable regression). Let observations X i = (Z i, Y i ) R R, i = 1, 2,... denote paired univariate measurements modeled as Z i = U i + ϕ i, Y i = α + βu i + ϵ i, (U i, ϕ i, ϵ i ) g U N(0, σ 2 ) N(0, σ 2 ). This model is referred to as error-in-variable regression or measurement-error regression, since the actual predictor U i is unobserved, but we get to record a contaminated version Z i = U i + ϕ i. The distribution g U of the latent predictor is unknown, and is of no interest. Primary interest concerns regression coefficients 3 θ = (α, β, σ 2 ). The simple least squares estimate of θ provides poor inference, the estimator is not even consistent. But a consistent, asymptotically normal Z-estimator is obtained by using (HW 2): Y i α βz i ψ(x i, θ) = Z i (Y i α βz i ) + βσ 2. (Y i α βz i ) 2 σ 2 (1 + β 2 ) 3 Here we have assumed Var(ϕ i ) = Var(ϵ i ) = σ 2. In a more general setting we could let Var(ϕ i ) = ρσ 2. When ρ is known, we could get back to the equal variance setting by rescaling U i and Z i. Unknown ρ is difficult to deal with, and requires prior knowledge or additional variables. 4

5 2 Other approaches The M- and Z-estimator approaches provide a broad platform to construct statistical procedures with sound frequentist properties. They cover a large number of widely used statistical methods. However, there are an even larger number of procedures that do not stem from a particular platform of methods generation, but are constructed more or less based on heuristics that apply to a particular context. A testing procedure built upon heuristics, turns into a rigorous testing procedure the moment we are able to calculate its size and establish good power on the alternatives. These calculations can be challenging, but asymptotic theory and/or simulation techniques help out. Here are two specific examples. 2.1 Rank based test for location shift Consider two samples of observations W 1,..., W m and V 1,..., V k modeled as W i g(w i θ), V j g(v j ), W i s, V j s independent where θ R and g {all densities on R} are unknown. We are interested in testing H 0 : θ = 0, vs. H 1 : θ 0. A popular test statistic is the Mann-Whitney statistic 4 : U = 1 mk m k I(V j W i ) j=1 Clearly, µ(θ) := E(U θ) = P (V 1 W 1 θ), which equals 1/2 if θ = 0. If θ > 0, then U is expected to be larger (because then W i s get shifted to right by θ and should be relatively larger than V j s), and similarly, for θ < 0, U is expected to be smaller. Hence a test statistic against H 0 is given by U 1/2. It can be shown (from U-statistic theory, beyond the scope of this course) that with n = m + k and G(w) denoting the CDF of g(w), n(u µ(θ)) = n m m n [G(W i ) E{G(W i ) θ}] k k [G(V j θ) E{G(V j θ) θ}]+r n (θ) with R n (θ) P 0 under θ as n. If m/n λ (0, 1) as n, then we can conclude from the above (by standard CLT) that j=1 n(u µ(θ)) L N 1 (0, σ 2 (θ)) (3) where σ 2 (θ) = 1 λ Var{G(W 1) θ} λ Var{G(V 1 θ) θ}. 4 A related statistic is the Wilcoxon rank sum statistic: R = m rank(w i) = mku + m(m + 1)/2 where rank(w i ) gives the rank of W i in the pooled sample {W 1,..., W m, V 1,, V k } arranged from smallest to largest. 5

6 These calculations simplify considerably when θ = 0, which is what we need to calculate the size of the test based on U. Note that when θ = 0, W i, V j g and hence G(W i ) and G(V j ) are all Uniform(0, 1) variables with variance 1/12. Hence, σ 2 (0) = 1 12λ(1 λ). A simple consistent estimator of σ 2 (0) is n 2 /{12mk} and hence an approximately size-α Mann-Whitney test for H 0 : θ = 0 vs H 1 : θ 0 is given by reject H 0 if 12mk U 1/2 /n Φ 1 (1 α/2). Of course, if g was known to be a Gaussian, then the ML test for the above hypotheses would be given by the standard t-test (with equal variance assumed). We will later see that even in this case, a size-α Mann-Whitney test has power comparable to a size-α t-test. Much more interestingly, if g is non-gaussian and possesses tails that are thicker than those of a Gaussian pdf, then the Mann-Whitney test could be much more powerful than the t-test! 2.2 Testing complete spatial randomness (CSR) Point pattern data are routinely studied in ecology and related fields. Let X 1,..., X n denote observed spatial locations of n events (such as occurrences of some shrub species) in S R 2. Can we detect whether the point pattern of these locations is aggregate (i.e., points are clustered together), or regular (i.e., grid-like and the points repel each other) or completely spatially random? A heuristic approach to this inference problem goes like this. The K-statistic at distance r is defined as ˆK(r) = A w n 2 ij I( X i X j r) i j where A = area(s), w ij = 1/area(S B(X i, r)) with B(c, r) := {x R 2 : x c r} denoting the ball/disc in R 2 with center c and radius r. The hypothesis of CSR implies K(r) πr 2. So we will take our test statistic as ˆL(r) = ˆK(r) π The exact distribution of ˆL(r) is difficult to assess (depends on the shape of S in an intractable way). But it is easy to simulate CSR patterns of n points in S, and we could use the simulations to create Monte Carlo approximation to the distribution of ˆL(r). If L α/2,r and L 1 α/2,r denote the (MC appoximate) α/2 and (1 α/2)-th quantiles of L(r) under CSR, then we could reject H 0 : CSR at range r if ˆL(r) [L α/2,r, L 1 α/2,r ]. See for more details and for an application to shrub patterns. 3 Relative Efficiency of Tests Now that we have seen multiple ways of constructing test procedures, it is natural to ask if we could rank them based on their relative performances. While such comparisons are hard to make for a fixed sample size, asymptotic calculations often help. 6 r

7 To facilitate our discussion, let s get a clear formalization of the asymptotic context. We will consider an infinite sequence of statistical models X (n) p n,θ, θ Θ with a common parameter. For data X 1, X 2,..., modeled as X i g(x i θ), θ Θ, such a sequence of models is given by X (n) = (X 1,..., X n ), p n,θ (x (n) ) = g(x 1 θ) g(x n θ). (4) For testing H 0 : θ Θ 0 vs. H 1 : θ Θ c 0, we will consider sequences of test procedures given by test statistics and critical values, {(T n, c n ) : n = 1, 2,...} with T (n) based on data X (n), typically given by the same test generation scheme applied to all n, such as ML, or a Z-estimation associated with a certain score function ψ, etc. Such a test sequence will be called asymptotically size α if, lim sup P (T n c n θ) = α. n θ Θ 0 Now consider two sequences of test procedures sequences (T n,1, c n,1 ) n, (T n,2, c n,2 ), n = 1, 2,... and denote the corresponding power function sequences as: π n,1 (θ) = P (T n,1 c n,1 θ), π n,2 (θ) = P (T n,2 c n,2 θ). If both test sequences are asymptotically size α, then we would declare the first sequence asymptotically better than the second if lim π n,1(θ) lim π n,2 (θ) for every θ Θ c n n 0. However, such comparisons are mostly useless because almost for all reasonable size α test sequences, the limiting power equals 1 at any θ Θ c 0. For example, for the sequence of models based on X i N(µ, 1), an asymptotically size 5% test for H 0 : µ 0 vs H 1 : µ 0 is given by reject H 0 if n X (n) At any µ > 0, π n (µ) = 1 Φ(1.64 nµ) 1 as n. 3.1 The Pitman Relative Efficiency The situation can be salvaged by comparing lim n π n (θ n ) over a sequence of increasingly difficult alternatives θ n Θ 0 [the boundary of Θ 0 ]. To make headway along this line, we make a number of simplifying assumptions. First, we assume Θ 0 = {θ 0 }, a point null with θ 0. We also focus only on test statistic sequences (T n : n = 1, 2,...) that are regular at θ 0, i.e., there exist functions µ(θ) and σ(θ) such that for any h, n{tn µ(θ 0 + h n )} σ(θ 0 + h n ) L N(0, 1) under p n,θ0 + h n. (5) The above condition is to be understood as follows: if Z n denotes the normalized statistic on the left of (5) then lim n P (Z n A θ 0 + h n ) = P (Z A) for every A R where Z N(0, 1). Bear in mind that regularity is a strong condition and need not hold even if 7

8 we could show n(t n µ(θ))/σ(θ) L N(0, 1) at every θ Θ. We will later see tools to establish regularity. As a sequence of increasingly difficult alternatives, consider θ n = θ 0 + h/ n, n 1, for any fixed h 0, i.e., we consider the limiting power along a sequence of alternatives collapsing onto θ 0 at n rate along a ray h. If {(T n, c n ), n = 1} is asymptotically size α with T n regular, then we can write c n = µ(θ 0 ) + σ(θ 0 )(z α + r n )/ n where z α = Φ 1 (1 α) and r n 0. lim π n(θ n ) = lim P n n Notice that, ( n{tn µ(θ n )} σ(θ n ) n{µ(θ0 ) + σ(θ 0 )(z α + r n )/ n µ(θ n )} lim n σ(θ n ) and hence by Slutsky s theorem, ( ) lim π n(θ) = 1 Φ z α ht µ(θ 0 ). n σ(θ 0 ) n{µ(θ0 ) + σ(θ 0 )(z α + r n )/ n µ(θ n )} σ(θ n ) = z α ht µ(θ 0 ) σ(θ 0 ) The quantity µ(θ 0 )/σ(θ 0 ) is known as the slope of the test statistic sequence T n at θ 0. Clearly we should prefer test statistics with higher slopes! The above calculations may seem too dependent on the arbitrary choice of the collapsing alternatives. Pitman established that these calculations are indeed robust. To formalize this, I state below the definition of the Pitman relative efficiency and a theorem that connects this concept with the ratio of slopes of two test sequences. Definition 1. Let {T n,1 : n 1} and {T n,2 : n 1} be two test statistics sequences. For any given α, γ (0, 1), let n 1 (α, γ, θ) denote the minimum n needed so that a level α test based on T n,1 has power at least γ at θ. Let n 2 (α, γ, θ) denote the same for the other sequence. The Pitman relative efficiency of the first sequence against the second, at level α, power γ is defined as: n 2 (α, γ, θ ν ) lim (6) θ ν θ 0 n 1 (α, γ, θ ν ) provided the limit exists. In principle, the limiting sample size ratio may depend on α, γ and the particular collapsing sequence of alternatives θ ν. But often it does not, especially if θ ν = θ 0 + νh where h is a fixed unit vector and ν 0, i.e., the collapsing alternatives θ ν approach θ 0 along a ray and from only one direction. Below is a precise statement and proof for the case of dim(θ) = 1 with Θ = [0, ), θ 0 = 0. Theorem 2. Consider statistical models (p n,θ : θ 0} such that p n,θ p n,0 0 as θ 0 for every fixed n. Let T n,1 and T n,2 be two sequences of test statistics that satisfy the regularity condition (5) with µ i (0) > 0 and σ i (0) > 0, i = 1, 2. Then the Pitman relative efficiency of the first sequence to the second equals ( ) 2 µ1 (0)/σ 1 (0) µ 2 (0)/σ 2 (0) for every sequence θ ν 0 independently of α and γ. 8 ).

9 Proof. A somewhat tedious but straightforward argument shows that each of n i (α, γ, θ ν ), i = 1, 2, must diverge to. Also to ensure asymptotic size of α, the critical value for maybe taken as µ i (0) + σ i (0)(z α + r n,i ) where r n,i 0. And hence, ( π ni (α,γ,θ ν ),i(θ ν ) = 1 Φ z α + r n ) µ i (0) n i (α, β, θ ν )θ ν σ i (0) (1 + r n,i) + r n where all of r n, r n, r n 0. These power values converge to γ only when the argument on Φ() above converges to z γ and hence,. n 2 (α, γ, θ ν ) lim ν n 1 (α, γ, θ ν ) = lim n 2 (α, γ, θ ν )θν 2 ν n 1 (α, γ, θ ν )θν 2 = (z α z γ ) 2 /( µ 2(0) σ 2 (0) )2 (z α z γ ) 2 /( = µ 1(0) σ 1 (0) )2 3.2 Le Cam s Third Lemma & Local Asymptotic Normality ( ) 2 µ1 (0)/σ 1 (0) µ 2 (0)/σ 2 (0) The regularity property (5) requires establishing asymptotic normality of T n under a sequence of probability distributions q n = p n,θ0 +h/ n associated with a sequence of parameter values θ n = θ 0 + h/ n. We have seen that it is often possible to establish asymptotic normality under the sequence p n = p n,θ0 associated with the fixed parameter value θ 0. Le Cam developed a brilliant theory of contiguity and local asymptotic normality to show that asymptotic normality under p n could be enough to establish asymptotic normality under q n. We will not see his theory in full glory, but here is the much celebrated Le Cam s third lemma : Theorem 3 (Le Cam s third lemma). For every n = 1, 2,..., let p n, q n be two pdfs defined on a common space X n and let T n : X n R k be a map such that, ( ) Tn (X (n) (( ) ( )) ) L µ Σ τ log q n p n (X (n) N ) k+1 1, 2 σ2 τ σ 2 under X (n) p n. Then T n (X n ) L N k (µ + τ, Σ) under X (n) q n. The result might look a peculiar at first because of the requirement that log qn p n (X (n) ) L N( σ 2 /2, σ 2 ) for some σ 2. There is nothing unusual about this because it holds for all regular models. In fact it holds under a broad characterization known as Local Asymptotic Normality (LAN). Here is the definition: Definition 2 (Local Asymptotic Normality). A sequence of statistical models (p n,θ : θ Θ) is called locally asymptotically normal (LAN) at θ = θ 0 if there exist matrices r n, I θ0 and random vectors n,θ0 such that n,θ0 L N(0, I θ0 ) under p n,θ0 and for every converging sequence h n h with R n 0 in probability under p n,θ0. log p n,θ 0 +rn 1 h n (X (n) ) = h T n,θ0 1 p n,θ0 2 ht I θ0 h + R n, 9

10 Note that if the family is LAN at θ 0, then log p n,θ 0 +rn 1 p n,θ0 h n (X (n) ) L N( 1 2 σ2, σ 2 ) under p n,θ0 with σ 2 = h T I θ0 h. Therefore for a model satisfying LAN, we can invoke Le Cam s third lemma for any statistic T n = T n (X (n) ) as soon as we can establish a bivariate CLT between T n and h T n,θ0 and calculate the limiting covariance vector τ. Before we work out a specific example, here are some results on when LAN happens. Example ( models). Suppose p n,θ s are associated with an observations model X 1, X 2,... g(x i θ), θ Θ as in (4). If {g( θ) : θ Θ} satisfies the Cramér conditions at a θ 0 then (p n,θ : θ Θ) is LAN at θ 0 with I θ0 = I F (θ 0 ), the Fisher information at θ 0, r n = n I d, (I d is the d d identity matrix). n,θ0 = 1 n n θ log g(x i θ 0 ) = 1 n ln (θ 0 ). The same assertion holds under a conditions weaker than the Cramér conditions. The family {g( θ) : θ Θ} is said to be differentiable in quadratic mean (DQM) at θ 0 if there is a function s θ0 (x) such that lim h 0 1 h 2 [ g(x θ0 + h) g(x θ 0 ) 1 2 ht s θ0 (x) g(x θ 0 )] 2 dx = 0. When θ log g(x θ) is differentiable at θ 0 for almost every x, one typically has s θ0 (x) = θ0 log g(x θ 0 ), the score function. If {g( θ) : θ Θ} is DQM at θ 0 then the associated sequence of models (p n,θ : θ Θ) is LAN at θ 0 with I θ0 = E{s θ0 (X)s θ0 (X) T θ 0 }, r n = n I d and n,θ0 = n s θ 0 (X i )/ n. 3.3 Slope calculation for Mann-Whitney test Under the assumed model on X (n) = (W 1,..., W m, V 1,..., V k ), we have, p n,θ (w 1,..., w m, v 1,..., v k ) = m m g(w i θ) g(v j ) = p 1 m,θ(w (m) )p 2 k(v (k) ). j=1 Assume g is differentiable everywhere with derivative ġ. Then LAN holds at θ = 0 for the sequence of models p 1 m,θ on W (m) = (W 1,..., W m ), with m,0 = 1 m m ġ(w i)/g(w i ) and r m = m; and the same is inherited by the original model sequence p n,θ with r n = m and n,0 = 1 m ġ(w i ) m g(w i ). We have earlier seen that the centered and scaled Mann-Whitney statistic T n = n(u 1 2 ) admits the asymptotic expansion (3) with θ = 0. And hence if m/n λ (0, 1), then the 10

11 g(w) Relative Efficiency Logistic π 2 /9 = 110% Normal 3/π = 95% Uniform 100% t(3) 124% t(5) 190% 3 4 w2 )I( w < 1) (lower-bound) Table 1: Relative efficiency of the Mann-Whitney test against the two sample t-test for various choices of g. assumption of the Le Cam s third lemma holds for p n = p n,0 and q n = p n,h/ n = p n,hn /r n where h n = h m/n λ, with µ = 0, Σ = 1/{12λ(1 λ)} and { } τ = he ġ(w1 ) G(W g(w 1 ) 1) θ = 0 = h G(w)ġ(w)dw = h g(w) 2 dw by the integration by parts identity. And so n(u 1/2) L N(τ, 1/{12λ(1 λ)}) under q n. Consequently, n{u µ(h/ n)} = n(u 1/2) n{µ(h/ n) µ(0)} L N(τ h µ(0), 1 12λ(1 λ) ) but, µ(0) = θ µ(θ) θ=0 = θ [ {1 G(w θ)}g(w)dw] θ=0 = g(w) 2 dw by interchanging differentiation and integration. Hence, n(u µ(h/ n)) L N(0, 1 12λ(1 λ) ) under q n. So the regularity condition (5) holds with µ(θ) = P (V 1 < W 1 θ) and σ 2 (θ) = 1/{12λ(1 λ)}. So the slope is µ(0)/σ(0) = 12λ(1 λ) g(w) 2 dw. 3.4 Relative Efficiency of Mann-Whitney test against T-test It is rather straightforward to show that the slope of the two sample t-test 5 is λ(1 λ)/σ g, where σ 2 g = Var(W 1 g) = w 2 g(w)dw { wg(w)dw} 2. So the Pitman relative efficiency of the Mann-Whitney test against the two sample t-test is 12σ 2 g{ g(w) 2 dw} 2. Table 1 lists the efficiency number for various choices of g. It can be shown that this number is never smaller that 108/125 86% but it can be made arbitrarily large by considering a thick tailed g. 4 Parametric Models, Efficiency Bounds and Optimality of ML tests It turns out that under LAN conditions (and hence Cramér conditions which imply LAN), there is non-trivial bound on lim n π n (θ 0 +h/r n ) and there are sequence of test procedures that attain these bounds. Here is a precise statement. Theorem 4. Let Θ R d and suppose the sequence of parametric models (p n,θ : θ Θ), n = 1, 2... is LAN at θ = θ 0 with associated r n I d, I θ0 and n,θ0 such that r n. Suppose 5 Holds for either version: equal or unequal variance. See HW 2. 11

12 η : Θ R is differentiable at θ 0 with non-zero gradient η(θ 0 ) and η(θ 0 ) = 0. Then the power function sequence π n (θ) of any sequence of level α tests for testing H 0 : η(θ) 0 vs. H 1 : η(θ) > 0 satisfy, for every h such that η(θ 0 ) T h > 0, lim sup π n (θ 0 + h r n ) 1 Φ z α n and the bound is attained by tests reject H 0 if T n z α with η(θ 0 ) T h η(θ 0 ) T I 1 θ 0 η(θ 0 ) T n = η(θ 0) T I 1 θ 0 n,θ0 η(θ 0 ) T I 1 θ 0 η(θ 0 ) A sketch of a proof. It can be established that (perhaps on some subsequence) π n (θ 0 +h/r n ) converges to some π(h). The LAN property then leads to the fact that 6 π(h) is the power function for some size α procedure for testing H 0 : η(θ 0 ) T h 0 vs. H 1 : η(θ 0 ) T h > 0 in the model (N d (h, I 1 θ 0 ) : h R k ). But there is a size α UMP test for this H 0 in this model, whose power function is precisely the bound displayed in the statement of the theorem. That the test reject H 0 if T n z α attains the bound can shown by establishing T L n N d ( η(θ 0 ) T h/{ η(θ 0 ) T I 1 θ 0 η(θ 0 )} 1/2, 1) under θ 0 +h/r n by applications of Le Cam s third lemma and Slutsky s theorem. 6 This requires a powerful representation theorem which we omit. 12

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n Chapter 9 Hypothesis Testing 9.1 Wald, Rao, and Likelihood Ratio Tests Suppose we wish to test H 0 : θ = θ 0 against H 1 : θ θ 0. The likelihood-based results of Chapter 8 give rise to several possible