A General Framework for High-Dimensional Inference and Multiple Testing

Size: px

Start display at page:

Download "A General Framework for High-Dimensional Inference and Multiple Testing"

Nigel Cunningham
6 years ago
Views:

1 A General Framework for High-Dimensional Inference and Multiple Testing Yang Ning Department of Statistical Science Joint work with Han Liu 1

2 Overview Goal: Control false scientific discoveries in high-dimensional data analysis Challenge: P-values for hypothesis testing in high-dimensional data Proposed Method: High-Dimensional Data Post-Regularization Inference Multiple Testing Decorrelated Score Test Wald Test {z } 2 Likelihood Ratio Test

3 {z } A Simulation Study 1 =0 n samples = d covariates {z } {z } + s sparsity Model: Y = X + n = 100, d = 500 and s = 4 3

4 Empirical Null Distribution of P-values H 0 : 1 =0 vs H 1 : 1 6= 0 (1). Perform variable selection by Lasso (2). Compute least square estimator after model selection Frequency Not uniformly distributed! P-value is wrong! P value 4

5 False Discovery Rate (FDR) H 0j : j =0 vs H 1j : j 6= 0, for 1 apple j apple d h i FDR = E #falsediscoveries #totaldiscoveries control FDR no greater than a given value (e.g. 0.1) well studied under independence, e.g. BH (1995) wide applications in scientific studies 5

6 Empirical FDR n = 500, d = 500 and s = 50 empirical FDR true FDR 6

7 Empirical FDR n = 500, d = 500 and s = 50 empirical FDR Most discoveries are false! true FDR 7

8 Challenges What we learn from the numerical studies? P-values are distorted by variable selection. Account for variable selection/regularization. Multiple hypothesis tests are dependent. Carefully design the test statistics to reduce dependence. 8

9 Problem Setup Data: X 1,...,X n i.i.d copies of X f(x; ). Partition of : par of interest 2 R nuisance par 2 R d 1 high-dimensional with d n Goal: Test the hypothesis H 0 : =0vsH 1 : 6= 0 9

10 Existing Work Sample splitting: Meinshausen & Buhlmann, Shah & Samworth Bootstrap and resampling: Chatterjee & Lahiri, Zhou & Min, Zhang & Cheng, Dezeure et al., McKeague & Qian Selective (conditional) inference: Lockhart et al., Lee & Taylor, Fithian et al., Tian & Taylor Debiased (desparsifying) estimator: Zhang & Zhang, van de Geer et al, Javanmard & Montanari, Belloni et al., Athey et al., Cai & Guo, Zhu & Bradic FDR Control Fan et al., Barber & Candes, G Sell et al, Liu, Ramdas et al., Candes et al. 10

11 Likelihood-based Inference Loglikelihood: L( )=n 1 P n i=1 log f(x i; ) 11

12 Likelihood-based Inference Loglikelihood: L( )=n 1 P n i=1 log f(x i; ) Low dimension: d n Profile loglikelihood: L(, ˆ ) Profile score: r L(, ˆ ) MLE: ˆ = arg max 2R L(, ˆ ) ˆ = arg max 2R d 1 L(, ) 12

13 Likelihood-based Inference Loglikelihood: L( )=n 1 P n i=1 log f(x i; ) Low dimension: d n High dimension: d n Profile loglikelihood: L(, ˆ ) Profile score: r L(, ˆ ) MLE: ˆ = arg max 2R L(, ˆ ) Decorrelated loglikelihood: L D ( ) =L(, ˆ ( ˆ )ŵ) Decorrelated score: S D ( ) =r L(, ˆ) ŵ T r L(, ˆ) M-estimator: ˆ D = arg max 2R L D ( ) ˆ = arg max 2R d 1 L(, ) 13

14 Likelihood-based Inference Loglikelihood: L( )=n 1 P n i=1 log f(x i; ) Low dimension: d n High dimension: d n Profile loglikelihood: L(, ˆ ) Profile score: r L(, ˆ ) MLE: ˆ = arg max 2R L(, ˆ ) Decorrelated loglikelihood: L D ( ) =L(, ˆ ( ˆ )ŵ) Decorrelated score: S D ( ) =r L(, ˆ) ŵ T r L(, ˆ) M-estimator: ˆ D = arg max 2R L D ( ) ˆ = arg max 2R d 1 L(, ) works for both HD and LD 14

15 A Framework for HD Inference 1. Calculate initial estimators ˆ = arg max 2R d L( ) P ( ) P is a generic penalty function with tuning parameter 0 15

16 A Framework for HD Inference 1. Calculate initial estimators ˆ = arg max 2R d L( ) P ( ) P is a generic penalty function with tuning parameter 0 2. Calculate ŵ = arg min kwk 1 s. t. kr L( ˆ) w T r L( ˆ)k max apple µ (µ tuning par) where, ŵ is an estimate of Stein s least favorable direction (1956) 16

17 A Framework for HD Inference 1. Calculate initial estimators ˆ = arg max 2R d L( ) P ( ) P is a generic penalty function with tuning parameter 0 2. Calculate ŵ = arg min kwk 1 s. t. kr L( ˆ) w T r L( ˆ)k max apple µ (µ tuning par) 3. Form test statistics for H 0 : =0vsH 1 : 6= 0 Plug ŵ into L D ( ) =L(, ˆ ( ˆ )ŵ) S D ( ) =r L(, ˆ) ŵ T r L(, ˆ) LR test Score test Wald test 2n[L D (ˆ D ) L D (0)] d 2 n 1/2 S D (0) d Normal n 1/2 ˆ D d Normal 17

18 A Framework for HD Inference 1. Calculate initial estimators ˆ = arg max 2R d L( ) P ( ) P is a generic penalty function with tuning parameter 0 2. Calculate ŵ = arg min kwk 1 s. t. kr L( ˆ) w T r L( ˆ)k max apple µ (µ tuning par) 3. Form test statistics for H 0 : =0vsH 1 : 6= 0 Plug ŵ into L D ( ) =L(, ˆ ( ˆ )ŵ) S D ( ) =r L(, ˆ) ŵ T r L(, ˆ) LR test Score test Wald test 2n[L D (ˆ D ) L D (0)] d 2 n 1/2 S D (0) d Normal n 1/2 ˆ D d Normal 18

19 A Brief History on Nuisance Par Neyman C( ) test E cient score function Projected score function (Neyman, 1959) (Bickel et al, 1993) (Lindsay, 1995) Locally ancillary estimating equation (Mcleish and Small, 1988) Doubly robust estimator (Robins et al, 1994, Scharfstein et al, 1999) 19

20 Example: Generalized Linear Model Model: f(y X; )=h(y )exp(y T X b( T X)) 1. Compute initial estimator: 2. Compute ŵ ˆ = arg max penalized loglikelihood ŵ = arg min kwk 1 s. t. kn 1 P n i=1 b00 ( ˆT X i )(Z i w T U i )U i k max apple µ =(, ) X =(Z, U) 3. Compute decorrelated score function S D ( ) =n 1 P n i=1 (Y i b 0 ( Z i ˆT U i ))(Z i ŵ T U i ) 20

21 Main Theorem for GLM Theorem [Uniformly normal approximation] For linear models with sub-gaussian noise, assume the covariate is sub-gaussian with a nonsingular covariance matrix. sup sup k k 0 apples,kwk 0 apples 0 t2r p ns D (0) P sd apple t (t). (s _ s0 ) log d p n under H 0 21

22 Main Theorem for GLM Theorem [Uniformly normal approximation] For linear models with sub-gaussian noise, assume the covariate is sub-gaussian with a nonsingular covariance matrix. sup sup k k 0 apples,kwk 0 apples 0 t2r p ns D (0) P sd apple t (t). (s _ s0 ) log d p n under H 0 uniform convergence over and w in a sparse set standard deviation of i (Z i w T U i ) determine convergence rate for normal approximation 22

23 Main Theorem for GLM Theorem [Uniformly normal approximation] For linear models with sub-gaussian noise, assume the covariate is sub-gaussian with a nonsingular covariance matrix. sup sup k k 0 apples,kwk 0 apples 0 t2r p ns D (0) P sd apple t (t). (s _ s0 ) log d p n under H 0 uniform convergence over and w in a sparse set standard deviation of i (Z i w T U i ) determine convergence rate for normal approximation 1. Tuning par satisfy µ p log d/n 2. Noncentral normal approximation under H 1n : = Cn 1/2 3. Same results hold for most GLM 4. The estimator attains the information lower bound. 23

24 Local Asymptotic Power for GLM Theorem [Uniformly normal approximation under alternative] For linear models with sub-gaussian noise, assume the covariate is sub-gaussian with a nonsingular covariance matrix. sup sup 2 1 (C),kwk 0 apples 0 t2r p ns D (0) P sd apple t (t +sd C). (s _ s0 ) log d p n under the local alternative hypothesis noncentral normal distribution 1 (C) ={(, ): = Cn 1/2, k k 0 apple s} 24

25 Data-Driven Tuning Parameters Assume that the tuning parameters and µ are chosen from a grid M = {a 1,...,a M } p log d/n Theorem [Normal approximation for data-driven tuning parameters] Under the same conditions, for any data-dependent tuning parameters ˆ, ˆµ : X! M it holds sup sup k k 0 apples,kwk 0 apples 0 t2r p ns D,ˆ,ˆµ (0) P sd apple t (t) = o(1) 25

26 Other Examples Graphical Model Classification Survival Analysis (GGM) (LDA) (AH/PH models) 26

27 Gaussian Graphical Model Model: X =(X 1,...,X d ) N(0, ), =( 1 ) k, = 1 1. Compute initial estimator: 2. Compute ŵ ˆ = arg min Compute decorrelated score function T ˆ e T k + k k 1, ŵ = arg min kwk 1, s.t. kˆ 12 w T ˆ 22 k 1 apple 0 S D ( ) =(1, ŵ T )(ˆ ˆ e k ) where ˆ =(, ˆ 1) can be extended to cluster-based graphical models. (Bunea et al) 27

28 Inference on Misspecified Model Working model f(x;, ) ˆ D o f o minimizes the KL distance between f and working model n 1/2 (ˆ D o ) d N(0, A 1 BA 1 ) Sandwich variance 28

29 FDR Control 1. Calculate P-values p j for H 0j : j =0 vs H 1j : j 6= 0, for 1 apple j apple d 29

30 FDR Control 1. Calculate P-values p j for H 0j : j =0 vs H 1j : j 6= 0, for 1 apple j apple d 2. Bound FDR for some cuto u 2 (0, 1) FDR(u) = P j2h 0 I(p j apple u) max{ P d j=1 I(p j apple u), 1} the set of null hypothesis H 0 is unknown 30

31 FDR Control 1. Calculate P-values p j for H 0j : j =0 vs H 1j : j 6= 0, for 1 apple j apple d 2. Bound FDR for some cuto u 2 (0, 1) FDR(u) = P j2h 0 I(p j apple u) max{ P d j=1 I(p j apple u), 1} H 0 apple d (can be improved by a two stage procedure). [FDR(u) = u d max{ P d j=1 I(p j apple u), 1} 31

32 Main Results on FDR Control Given level, calculate the cuto û =sup n o 0 <u<1: [FDR(u) apple 32

33 Main Results on FDR Control Given level, calculate the cuto û =sup n o 0 <u<1: [FDR(u) apple Theorem [FDR control for GLM] Under the same conditions, if s!1,d. n q for some q>1 FDR(û ) apple, in probability as n!1 33

34 {z } A Simulation Study 1 =0 n samples = d covariates {z } {z } + s sparsity Model: Y = X + n = 100, d = 500 and s = 4 34

35 Empirical Null Distribution H 0 : 1 =0 vs H 1 : 1 6= 0 MLE post model selection Proposed method Frequency Frequency P value P value

36 Empirical FDR BH + MLE post model selection Proposed method empirical FDR empirical FDR true FDR true FDR 36

37 Summary High-dimensional hypothesis testing is a common problem. We propose a unified framework to control testing errors in highdimensional models. The resulting P-values can be combined to control FDR in highdimensional multiple hypothesis testing. 37

38 Summary High-dimensional hypothesis testing is a common problem. We propose a unified framework to control testing errors in highdimensional models. The resulting P-values can be combined to control FDR in highdimensional multiple hypothesis testing. Ning & Liu, A General Theory of Hypothesis Tests and Confidence Regions for Sparse High Dimensional Models, Annals of Statistics,

39 Thank You 39

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los