Statistical Inference with Monotone Incomplete Multivariate Normal Data

Size: px

Start display at page:

Download "Statistical Inference with Monotone Incomplete Multivariate Normal Data"

Tabitha Stevens
5 years ago
Views:

1 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 1/4 Statistical Inference with Monotone Incomplete Multivariate Normal Data This talk is based on joint work with my wonderful co-authors: Wan-Ying Chang (US Census Bureau) Megan Romer (Penn State University) Tomoya Yamada (Sapporo Gakuin University)

2 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 2/4 Background We have a population of patients We draw a random sample of N patients, and measure m variables on each patient: 1 Visual acuity 2 LDL (low-density lipoprotein) cholesterol 3 Systolic blood pressure 4 Glucose intolerance 5 Insulin response to oral glucose 6 Actual weight Expected weight. m. White blood cell count

3 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 3/4 We obtain data: Patient N v 1,1 v 1,2. v 2,1 v 2,2. v 3,1 v 3,2. v N,1 v N,2. v 1,m v 2,m v 3,m v N,m Vector notation: V 1,V 2,...,V N V 1 : The measurements on patient 1, stacked into a column etc.

4 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 4/4 Classical multivariate analysis Statistical analysis of N m-dimensional data vectors Common assumption: The population has a multivariate normal distribution V : The vector of measurements on a randomly chosen patient Multivariate normal populations are characterized by: µ: The population mean vector Σ: The population covariance matrix For a given data set, µ and Σ are unknown

5 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 5/4 We wish to perform inference about µ and Σ Construct confidence regions for, and test hypotheses about, µ and Σ Anderson (2003). An Introduction to Multivariate Statistical Analysis Eaton (1984). Multivariate Statistics: A Vector-Space Approach Johnson and Wichern (2002). Applied Multivariate Statistical Analysis Muirhead (1982). Aspects of Multivariate Statistical Theory

6 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 6/4 Standard notation: V N p (µ,σ) The probability density function of V : For v R m, f(v) = (2π) m/2 Σ 1/2 exp ( 1 2 (v µ) Σ 1 (v µ) ) V 1,V 2,...,V N : Measurements on N randomly chosen patients Estimate µ and Σ using Fisher s maximum likelihood principle Likelihood function: L(µ,Σ) = N j=1 f(v j) Maximum likelihood estimator: The value of (µ, Σ) that maximizes L

7 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 7/4 µ = 1 N N j=1 V j: The sample mean and MLE of µ Σ = 1 N n j=1 (V j V )(V j V ) : The MLE of Σ What are the probability distributions of µ and Σ? µ N p (µ, 1 N Σ) Law of Large Numbers: µ µ, a.s., as N N Σ has a Wishart distribution, a generalization of the χ 2 µ and Σ also are mutually independent

8 Monotone incomplete data Some patients were not measured completely The resulting data set, with denoting a missing observation v 1,1 v 1,2 v 1,3. v 1,m v 2,2 v 2,3. v 2,m v 3,2. v 3,m. v N,m Monotone data: Each is followed by s only We may need to renumber patients to display the data in monotone form Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 8/4

9 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 9/4 Physical Fitness Data A well-known data set from a SAS manual on missing data Patients: Men taking a physical fitness course at NCSU Three variables were measured: Oxygen intake rate (ml. per kg. body weight per minute) RunTime (time taken, in minutes, to run 1.5 miles) RunPulse (heart rate while running)

10 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 10/4 Oxygen RunTime RunPulse * * * * * * * * * * *

11 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 11/4 Monotone data have a staircase pattern; we will consider the two-step pattern Partition V into an incomplete part of dimension p and a complete part of dimension q ( X 1 Y 1 ), ( X 2 Y 2 ),..., ( X n Y n ), ( Y n+1 ), ( Y n+2 ),..., ( Y N ) Assume that the individual vectors are independent and are drawn from N m (µ,σ) Goal: Maximum likelihood inference for µ and Σ, with analytical results as extensive and as explicit as in the classical setting

12 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 12/4 Where do monotone incomplete data arise? Panel survey data (Census Bureau, Bureau of Labor Statistics) Astrophysics Early detection of diseases Wildlife survey research Covert communications Mental health research Climate and atmospheric studies.

13 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 13/4 We have n observations on ( X Y ) and N n additional observations on Y Difficulty: The likelihood function is more complicated L = = = n N f X,Y (x i,y i ) f Y (y i ) i=1 i=n+1 n f Y (y i )f X Y (x i ) i=1 N f Y (y i ) i=1 i=1 N i=n+1 n f X Y (x i ) f Y (y i )

14 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 14/4 Partition µ and Σ similarly: ( ) ( ) µ1 Σ11 Σ 12 µ =, Σ = µ 2 Σ 21 Σ 22 Let µ 1 2 = µ 1 Σ 12 Σ 1 22 (Y µ 2), Σ 11 2 = Σ 11 Σ 12 Σ 1 22 Σ 21 Y N q (µ 2,Σ 22 ), X Y N p (µ 1 2,Σ 11 2 ) µ and Σ: Wilks, Anderson, Morrison, Olkin, Jinadasa, Tracy,... Anderson and Olkin (1985): An elegant derivation of Σ

15 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 15/4 Sample means: X = 1 n X j, n j=1 Ȳ 2 = 1 N n N j=n+1 Y j, Ȳ 1 = 1 n Ȳ = 1 N n j=1 N j=1 Y j Y j Sample covariance matrices: n A 11 = (X j X)(X j X) n, A 12 = (X j X)(Y j Ȳ1) j=1 j=1 n A 22,n = (Y j Ȳ 1 )(Y j Ȳ 1 ), A 22,N = j=1 j=1 N (Y j Ȳ )(Y j Ȳ )

16 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 16/4 The MLE s of µ and Σ Notation: τ = n/n, τ = 1 τ µ 1 = X τa 12 A 1 22,n (Ȳ 1 Ȳ 2 ), µ 2 = Ȳ µ 1 is called the regression estimator of µ 1 In sample surveys, extra observations on a subset of variables are used to improve estimation of a parameter Σ is more complicated: Σ 11 = 1 n (A 11 A 12 A 1 22,n A 21) + 1 N A 12A 1 22,n A 22,NA 1 22,n A 21 Σ 12 = 1 N A 12A 1 22,n A 22,N Σ 22 = 1 N A 22,N

17 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 17/4 Seventy year-old unsolved problems Explicit confidence levels for elliptical confidence regions for µ In testing hypotheses on µ or Σ, are the LRT statistics unbiased? Calculate the higher moments of the components of µ Determine the asymptotic behavior of µ as n or N The Stein phenomenon for µ? The crucial obstacle: The exact distribution of µ

18 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 18/4 The exact distribution of µ Chang and D.R. (J. Multivariate Analysis, 2009): For n > p + q, µ = L µ + V 1 + ( ( ) 1 n 1 ) 1/2 ( Q2 ) 1/2 V2 N Q 1, 0 where V 1, V 2, Q 1, and Q 2 are independent; V 1 N p+q (0,Ω), V 2 N p (0,Σ 11 2 ), Q 1 χ 2 n q, Q 2 χ 2 q; Ω = 1 N Σ + ( 1 n 1 ) ( ) Σ N. 0 0 Consequences: µ is an unbiased estimator of µ. Also, µ 1 and µ 2 are independent iff Σ 12 = 0. Romer and D.R. (2009): Explicit formulas for the V s and Q s

19 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 19/4 Computation of the higher moments of µ now is straightforward Due to the term 1/Q 1, even moments exist only up to order n q The covariance matrix of µ: Cov( µ) = 1 N Σ + (n 2) τ n(n q 2) ( ) Σ Asymptotics for µ: If n,n with N/n δ 1 then ( ( )) L Σ N( µ µ) Np+q 0,Σ + (δ 1) 0 0

20 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 20/4 The analog of Hotelling s T 2 -statistic T 2 = ( µ µ) Ĉov( µ) 1 ( µ µ) where Ĉov( µ) = 1 N Σ + (n 2) τ n(n q 2) ) ( Σ An obvious ellipsoidal confidence region for µ is { } ν R p+q : ( µ ν) Ĉov( µ) 1 ( µ ν) c What is the corresponding confidence level?

21 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 21/4 Theorem: For t 0, P(T 2 t) is bounded above by P ( F q,n q (q 1 N 1 )t ) and bounded below by P ( N 2 Q 2 ( qq 3 ) Nq( τ 1/2 Q 1/2 3 + τ 1/2 Q 1/2 ) ) 2 4 t, nq 1 Q 5 Q 5 where Q 1 χ 2 n p q, Q 2 χ 2 p, Q 3 χ 2 q, Q 4 χ 2 q, Q 5 χ 2 2, and Q 1,...,Q 5 are mutually independent. Romer (2009) has now derived the exact distribution of T 2 Shrinkage estimation for µ when Σ is unknown

22 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 22/4 A decomposition of Σ Notation: A 11 2,n := A 11 A 12 A 1 22,n A 21 ( ) ( A 11 A 12 n Σ = τ + τ A 21 A 22,n ( ) ( A 12 A 1 22,n 0 +τ 0 I q ) A 11 2,n B B B B ) ( A 1 22,n A I q ) where ( ) A 11 A 12 A 21 A 22,n W p+q (n 1,Σ) and B W q (N n,σ 22 ) are independent. Also, N Σ 22 W q (N 1,Σ 22 )

23 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 23/4 A 22,N = n (Y j Ȳ1 + Ȳ1 Ȳ )(Y j Ȳ1 + Ȳ1 Ȳ ) j=1 N + (Y j Ȳ2 + Ȳ2 Ȳ )(Y j Ȳ2 + Ȳ2 Ȳ ) j=n+1 A 22,N = A 22,n + B B = N j=n+1 (Y j Ȳ2)(Y j Ȳ2) + n(n n) N (Ȳ1 Ȳ2)(Ȳ1 Ȳ2) Verify that the terms in the decomposition of Σ are independent

24 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 24/4 The marginal distribution of Σ 11 is non-trivial If Σ 12 = 0 then A 22,n, B, A 11 2,n, A 12 A 1 22,n A 21, X, Ȳ 1, and Ȳ2 are independent Matrix F -distribution: F (q) a,b = W 1/2 2 W 1 W 1/2 2 where W 1 W q (a,σ 22 ) and W 2 W q (b,σ 22 )

25 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 25/4 Theorem: Suppose that Σ 12 = 0. Then Σ 1/2 11 Σ 11 Σ 1/2 11 L = 1 n W N W 1/2 where W 1, W 2, and F are independent, and 2 ( Ip + F ) W 1/2 2 W 1 W p (n q 1,I p ), F F (p) N n,n q+p 1 W 2 W p (q,i p ), and NΣ 1/2 11 Σ 11 Σ 1/2 11 L = N n Σ 1/2 11 A 11 2,n Σ 1/ Σ 1/2 11 A 12 A 1 22,n (A 22,n + B)A 1 22,n A 21Σ 1/2 11

26 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 26/4 Theorem: With no assumptions on Σ 12, Σ 12 Σ 1 22 L = Σ 12 Σ Σ1/ W 1/2 KΣ 1/2 22 where W and K are independent, and W W p (n q + p 1,I p ), K N pq (0,I p I q ) In particular, Σ 12 Σ 1 22 is an unbiased estimator of Σ 12Σ 1 22 The general distribution of Σ requires the hypergeometric functions of matrix argument Saddlepoint approximations

27 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 27/4 The distribution of Σ is much simpler: Σ = Σ 11 2 Σ 22 Σ 11 2 and Σ 22 are independent; each is a product of independent χ 2 variables Hao and Krishnamoorthy (2001): p Σ = L n p N q Σ χ 2 n q j j=1 q j=1 χ 2 N j It now is plausible that tests of hypothesis on Σ are unbiased

28 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 28/4 Testing Σ = Σ 0 Data: Two-step, monotone incomplete sample Σ 0 : A given, positive definite matrix Test H 0 : Σ = Σ 0 vs. H a : Σ Σ 0 (WLOG, Σ 0 = I p+q ) Hao and Krishnamoorthy (2001): The LRT statistic is λ 1 A 22,N N/2 exp ( 1 2 tra ) 22,N A 11 2,n n/2 exp ( 1 2 tra ) 11 2,n exp ( 1 2 tra 12A 1 22,n A 21) ). Is the LRT unbiased? If C is a critical region of size α, is P(λ 1 C H a ) P(λ 1 C H 0 )?

29 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 29/4 Pitman (1939): Even with 1-d data, λ 1 is not unbiased Bartlett: λ 1 becomes unbiased if sample sizes are replaced by degrees of freedom With two-step monotone data, perhaps a similarly modified statistic, λ 2, is unbiased? Answer: Still unknown. Theorem: If Σ 11 < 1 then λ 2 is unbiased With monotone incomplete data, further modification is needed

30 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 30/4 Theorem: The modified LRT, λ 3 A 22,N (N 1)/2 exp ( 1 2 tra ) 22,N A 11 2,n (n q 1)/2 exp ( 1 2 tra ) 11 2,n A 12 A 1 22,n A 21 q/2 exp ( 1 2 tra 12A 1 22,n A 21), is unbiased. Also, λ 1 is not unbiased For diagonal Σ = diag(σ jj ), the power function of λ 3 increases monotonically as any σ jj 1 increases, j = 1,...,p + q.

31 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 31/4 With monotone two-step data, test H 0 : (µ,σ) = (µ 0,Σ 0 ) vs. H a : (µ,σ) (µ 0,Σ 0 ) where µ 0 and Σ 0 are given. The LRT statistic is λ 4 = λ 1 exp ( 1 2 (n X X + NȲ Ȳ ) ) Remarkably, λ 4 is unbiased The sphericity test, H 0 : Σ I p+q vs. H a : I p+q The unbiasedness of the LRT statistic is an open problem

32 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 32/4 The Stein phenomenon for µ µ: The mean of a complete sample from N m (µ,i m ) Quadratic loss function: L( µ,µ) = µ µ 2 Risk function: R( µ) = E L( µ,µ) C. Stein: µ is inadmissible for m 3 James-Stein estimator for shrinking µ to ν R m : ( ) c µ c = 1 µ ν 2 ( µ ν) + ν Baranchik s positive-part shrinkage estimator: ( ) µ + c c = 1 µ ν 2 ( µ ν) + ν +

33 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 33/4 We collect a monotone incomplete sample from N p+q (µ,σ) Does the Stein phenomenon hold for µ, the MLE of µ? The phenomenon seems almost universal: It holds for many loss functions, inference problems, and distributions Various results available on shrinkage estimation of Σ with incomplete data, but no such results available for µ The crucial impediment: The distribution of µ was unknown

34 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 34/4 Theorem (Yamada and D.R.): For p 2, n q + 3, and Σ = I p+q, both µ and µ c are inadmissible: R( µ) > R( µ c ) > R( µ + c ) for all ν R p+q and all c (0,2c ), where c = p 2 n + q N. Non-radial loss functions Replace µ ν 2 by non-radial functions of µ ν Shrinkage to a random vector ν, calculated from the data

35 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 35/4 Kurtosis tests for multivariate normality m-dimensional complete, random sample: V 1,...,V N Extensive literature on testing for multivariate normality Mardia s statistic for testing for kurtosis: b 2,m = N [ (Vj V ) S 1 (V j V ) ] 2 j=1 Invariance under nonsingular affine transformations of the data Asymptotic distribution of b 2,m

36 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 36/4 Monotone incomplete data, i.i.d., unknown population: ( ) ( ) ( ) ( ) ( ) ( X 1 X 2 X n,,...,,,,..., Y 1 Y 2 Y n A generalization of Mardia s statistic: Y n+1 Y n+2 ) Y N ˆβ = [( n (Xj ) µ j=1 + N j=n+1 Y j ) Σ 1 ( (Xj ) )] 2 µ Y j [ ] 2 (Y j µ 2 ) Σ 1 22 (Y j µ 2 )

37 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 37/4 An alternative to ˆβ Impute each missing X j using linear regression: { X j, 1 j n X j = µ 1 + Σ 12 Σ 1 22 (Y j µ 2 ), n + 1 j N Construct β = [( N ) ( Xj Y j=1 j µ ) Σ 1 ( ) ( Xj Y j µ )] 2 A remarkable result: β β β is invariant under nonsingular affine transformations of the data

38 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 38/4 Yamada, Romer, and D.R. (2010): Under certain regularity conditions, ( β c 1 )/c 2 L N(0,1) as n,n The constants c 1,c 2 depend on n,n and the underlying population distribution In the normal case, c 1,c 2 depend only on n,n,p,q

39 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 39/4 References Chang and D.R. (2009). Finite-sample inference with monotone incomplete multivariate normal data, I. J. Multivariate Analysis. Chang and D.R. (2009). Finite-sample inference with monotone incomplete multivariate normal data, II. J. Multivariate Analysis. D.R. and Yamada (2010). The Stein phenomenon for monotone incomplete multivariate normal data. J. Multivariate Analysis. Yamada (2009). The asymptotic expansion of the distribution of the canonical correlations with monotone incomplete multivariate normal data. Preprint, Sapporo Gakuin University. Romer (2009). The Statistical Analysis of Monotone Incomplete Multivariate Normal Data. Doctoral Dissertation, Penn State University. Romer and D.R. (2010). Maximum likelihood estimation of the mean of a multivariate normal population with monotone incomplete data. Preprint, Penn State University. Yamada, Romer, and D.R. (2010). Kurtosis tests for multivariate normality with monotone incomplete data. Preprint, Penn State University.

40 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 40/4 Astrostatistics research problems K. R. Lang, Astrophysical Formulae, Vol. II: Space, Time, Matter and Cosmology, 3rd. ed., Springer, 2006 Numerous monotone incomplete data sets Is it true that astrophysicists often discard incomplete data? Incomplete longitudinal data (light curves, luminosity data) Incomplete time series Small-sample distributions of test statistics, e.g., Mardia s statistic, often are unexplored even with complete data How to relax the MCAR assumption to MAR?

41 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 41/4 A. Isenman, Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning, Springer, 2008 COMBO-17 Survey Apply classical multivariate statistical procedures (principal components, MANOVA,...) to the COMBO-17 survey I hear that some variables in the survey are not of scientific interest, e.g., the absence of high-redshift (i.e. distant) high-absolute-magnitude (i.e. faint) galaxies, the dropoff in flux with redshift, the dropoff in image size with redshift,...

42 Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 42/4 Carry out a statistical analysis of the variables which are not of scientific interest Discover amazing results that put you in the NY Times and Get an invitation to Stockholm Send me 10% of the prize money (I m a reasonable guy)

Statistical Inference with Monotone Incomplete Multivariate Normal Data

Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 1/4 Statistical Inference with Monotone Incomplete Multivariate Normal Data This talk is based on joint work with my wonderful