Sample Size Requirement For Some Low-Dimensional Estimation Problems

Size: px

Start display at page:

Download "Sample Size Requirement For Some Low-Dimensional Estimation Problems"

Clementine Stone
6 years ago
Views:

1 Sample Size Requirement For Some Low-Dimensional Estimation Problems Cun-Hui Zhang, Rutgers University September 10, 2013 SAMSI Thanks for the invitation!

2 Acknowledgements/References Sun, T. and Zhang, C. H. (2012). Scaled sparse linear regression. Biometrika 99, Zhang, C.-H. and Zhang, S. S. (2011). Confidence intervals for low-dimensional parameters with high-dimensional data. Technical Report arxiv: , arxiv. Zhang, C.-H. (2011). Statistical inference for high-dimensional data. In Mathematisches Forschungsinstitut Oberwolfach: Very High Dimensional Semiparametric Models, Report No. 48/2011, pp Sun, T. and Zhang, C.-H. (2012). Comments on: Optimal rates of convergence for sparse covariance matrix estimation. Statistica Sinica 22, Ren, Z., Sun, T., Zhang, C.-H. and Zhou, H.H. (2013). Asymptotic normality and optimalities in estimation of large Gaussian graphical model, preprint.

3 Outline 1 LD problems 2 LDPE 3 Variable selection 4 SemiLD inference 5 Extensions 6 Sample size requirement 7 Simulation results

4 LD estimation problems Consider HD models Is it feasible to make regular statistical inference at n 1/2 rate? Regular estimation: stable limit distribution; working in a connected sample space; not super efficient; not requiring variable selection consistency Statistical inference: confidence interval, p-value, asymptotic normality, efficiency with information bound, etc. This is typically not possible for the estimation of HD parameters or non-smooth LD parameters What is the sample size requirement?

5 Feasibility; Example Linear model y = Xβ + ε, ε N(0, σ 2 I n ), where β R p with p n Scaled Lasso: λ univ = 2(log p)/n, A > 1 { β, σ} = arg min { y Xb 2 2/(2σn) + σ } b,σ 2 + Aλ univ b 1 Städler et al (10), Antoniadis (10), Sun Z (10,12), Belloni et al (11) Suppose s log p n 1/2 and a side regularity condition on X, where s = β 0 or s = j min{ β j /(σλ univ ), 1}. Then, σ/σ 1 = o P (n 1/2 ) where σ = y Xβ 2 / n is the oracle estimator. Consequently n( σ/σ 1) N(0, 1/2)

6 Feasibility; Example Estimation of σ is special since it is orthogonal to the estimation of β in terms of score functions The scores for the estimation of β j and β k are not orthogonal in general Need to use efficient score Bias correction

7 Outline 1 LD problems 2 LDPE 3 Variable selection 4 SemiLD inference 5 Extensions 6 Sample size requirement 7 Simulation results

8 Low-dimensional projection estimator (LDPE) Bias correction with a linear estimator β j = β (init) j + w j (y X β (init) ) Error decomposition: With η j = (X w j e j )/ w j 2, β j β j = w j ε + (X w j e j ) ( β (init) β) = w j ε + w j 2 η j ( β (init) β) The property X w j = e j only needs to hold approximately Since w j ε N(0, σ 2 w j 2 2 ), η j ( β (init) β) 0 β j β j σ w j 2 N(0, 1)

9 Low-dimensional projection estimator (LDPE) Bias correction with a linear estimator β j = β (init) j + w j (y X β (init) ) Approximate confidence interval: P { β j β j /( w j 2 σ) 1.96 } 0.95, provided that σ σ and η j ( β (init) β) 0 The key condition holds when β (init) β 1 = O P (sλ), e.g. Lasso η j C 2 log p 1/(λs) where η j = (X w j e j )/ w j 2 and default C = 1

10 LDPE Theory We use the scaled Lasso or the LSE after the scaled Lasso selection for the initial estimation We use the scales Lasso or a quadratic program to pick w j The compatibility factor (van de Geer, 07; van de Geer-Bühlmann, 09) κ = inf { S 1/2 Xu 2 n 1/2 u S 1 : u S c 1 ξ u S 1 } where S = {j : β j σλ univ } and ξ > (A + 1)/(A 1).

11 LDPE Theory Theorem 1: (Deterministic designs) Suppose s log p n. If 1/κ = O(1) and the choice of w j is feasible, then ( β j β j )/( στ j ) N(0, 1), σ/σ = 1 + o(n 1/2 ), (1) where τ j = w j 2 and σ = y Xβ 2 / n Theorem 2: (Random designs) Suppose s log p n and X has iid N(0, Σ). If max j Σ jj + Σ 1 (S) = O(1), then (1) holds and τ j = (1 + o(1)) n 1 (Σ 1 ) jj

12 Remarks Confidence intervals can be constructed without assuming min β j 0 β j Cλ univ This uniform signal strength or β min condition, required for variable ( selection consistency, divides the parameter space into p ) s 3 s disconnected regions according to sgn(β j ) { 1, 0, 1}. Stability selection: Meinshausen-Bühlmann (10) Recent developments: Bühlmann (12), Belloni et al (12), Javanmard-Montanari (13)

13 Multiplicity adjustments and thresholded LDPE Low-dimensional projection estimator (LDPE, Z-Zhang, 11) β 1, ŝe 1,..., β p, ŝe p, ŝe j = στ j σn 1/2 β (thr) j : threshold β j at level t j = ŝe j L ɛ, L ɛ = 2 log(p/ɛ) Estimation of β: For certain Ω n with P(Ω n ) 1 ɛ/p, E β (thr) [ β 2 2I Ωn (1+o(1))E βj 2 (στ j L ɛ ) 2 +(ɛ/p) j j (στ j ) 2] Selection with any signal strength {j : β j > 2 t j } {j : Similar to thresholding N(β j, se j ) β (thr) j 0} {j : β j 0}

14 Outline 1 LD problems 2 LDPE 3 Variable selection 4 SemiLD inference 5 Extensions 6 Sample size requirement 7 Simulation results

15 Inference after variable selection Optimistic approach Find an estimate of S = supp(β) {0, 1} p, say Ŝ Estimate β j by the LSE β j, b S, where β M = (X MX M ) 1 X My If P{Ŝ = S} 1, then { ((X } P X bs bs ) 1) 1/2 β jj j, S b β j / σ bs % Tinbshirani (96), Fan-Li (01), Fan-Peng (04), Meinshausen-Bühlmann (06), Tropp (06), Zhao-Yu (06), Wainwright (06), Z (07,10), Zhang (09,11), Z-Zhang (2012) Super-efficiency: Outperforms the oracle LSE with known A, where A supp(β) with A > β 0

16 Inference after model selection Conservative approach Leeb-Pötscher (06), Berk et al (09,11), Laber-Murphy (11) If P { ( max M (X M X M ) 1 X Mε ) j / σ } M K j = 95%, then { ((X } P X bs bs ) 1) 1/2 β j, S b β j, S b / σ bs K j 95% jj where β M = (X MX M ) 1 X MXβ A conservative confidence interval for a random parameter Bias: β j,m β j when M S Inefficiency: log p K j p

17 Low-dimensional case Consider the estimation of µ based on iid X i N(µ, 1), i n Hodges example: Let µ = X I { X > λ} with nλ and λ 0 For fixed µ, n( µ µ) {N(0, 1) µ 0 0 µ = 0 Optimistic approach: If µ (0, 2λ),... Hájek-Le Cam s LAM: n( µ µ) does not converge in distribution if µ = h/ n 0 max µ ne µ ( µ µ) 2 nλ 2 X is better than µ

18 Outline 1 LD problems 2 LDPE 3 Variable selection 4 SemiLD inference 5 Extensions 6 Sample size requirement 7 Simulation results

19 Statistical inference with high-dimensional data: The sampling distribution of high-dimensional regularized estimators is not tractable in general. However, low-dimensional statistical inference, such as p-values and confidence intervals for real parameters, is possible in high-dimensional models with high-dimensional data.

20 Semiparametric inference: Parametric component + NP component Low-dimensional statistical inference with high-dimensional data, or SemiLD inference : LD component + HD component General method of semild inference: HD estimation SemiLD inference is parallel to NP estimation semiparametric inference We borrow ideas from Engle et al (81), Chen (85,88), Rice (86), Heckman (86), Bickel et al (90),...

21 Minimum Fisher information; Stein (1956) Linear model y = Xβ + ε, ε N(0, σ 2 I n ) For a fixed a R p, consider the estimation of θ = a β Consider one-dim model β = uθ with a T u = 1 and known σ, y/σ = (Xu/σ)θ + N(0, 1) Minimum Fisher information: Subject to a T u = 1, F = min E(Xu/σ) 2 = min uσu/σ 2 = u 0 Σu 0 /σ 2 For the estimation of β j, the LDPE attains the minimum Fisher information bound Efficiency of LDPE: Z (11), van de Geer-Bühlmann-Ritov (13)

22 Outline 1 LD problems 2 LDPE 3 Variable selection 4 SemiLD inference 5 Extensions 6 Sample size requirement 7 Simulation results

23 General picture Linear regression: X with iid N(0, Σ) rows Known Σ versus unknown Σ (Robins and Ritov, 97) Additional costs in computational complexity and theoretical assumptions if Σ is unknown n 1/4 convergence s log p n 1/2 (Z Zhang, 11) Weaker condition in semisupervised learning, s log p N 1/2 Precision matrix and partial correlation (Sun Z, 11; Ren et al, 13) Deterministic X: Will cost a little more Extensions to GLM and problems with sample Hessian depending on unknown: Will cost even more Extensions to quantile regression and problems without sample Hessian: Will cost a lot more

24 Precision matrix and partial correlation Data X with iid N(0, Σ) rows Precision matrix Θ = Σ 1 Multivariate regression model: Cov(ε A, X A c ) = 0 and X A = X A c Σ 1 A c Σ A c,a + ε A = X A c B A c,a + ε A The residual has the covariance structure Eε A ε A /n = Σ A Σ A,A c Σ 1 A Σ c A c,a = Θ 1 A For a given A, we are interested in smooth functions of Θ 1 A, The oracle estimator is τ(θ 1 A ) = τ(eε A ε A /n) τ = τ(ε A ε A /n) For an oracle expert with the knowledge of B A c,a, ε A is sufficient for Θ A, so that τ is efficient as the MLE in a fixed-dimensional regular parametric model (exponential family)

25 Precision matrix and partial correlation X has iid N(0, Σ) rows; Θ = Σ 1 Multivariate regression: Cov(ε A, X A c ) = 0, X A = X A c B A c,a + ε A, Eε A ε A /n = Θ 1 A The oracle τ = τ(ε A ε A /n) is efficient for the estimation of τ(θ 1 A ) Let b B A c,a be the scaled Lasso estimator of B A c,a and z A = X A X A c b BA c,a. We propose bτ = τ(z A z A /n) Theorem. Suppose Σ 1 A c (S) + max j A Σ jj + Σ 1 A (S) = O(1). Let A be fixed and s A = max j A #{k A : Θ jk 0}. Then, bτ = τ + O P (λ 2 s A ) = τ + o P (n 1/2 ), when τ is a Lipschitz function and s A log p n 1/2. Consequently, nfτ (bτ τ) N(0, 1) where F τ is the minimum Fisher information for the estimation of τ.

26 Precision matrix and partial correlation We pick A = {j, k} for the estimation of individual elements. Corollary 1. For the estimation of partial correlation r jk = Θ jk / Θ jj Θ kk, n( rjk r jk ) 1 r jk 2 N(0, 1) Corollary 2. For the estimation of individual elements of the precision matrix Θ jk, n( Θjk Θ jk ) Θjj Θkk + Θ 2 jk N(0, 1) We may then threshold these estimates for inference about high-dimensional quantities

27 Outline 1 LD problems 2 LDPE 3 Variable selection 4 SemiLD inference 5 Extensions 6 Sample size requirement 7 Simulation results

28 Precision matrix estimation What is the sample size required for estimating LD parameters with parametric rate? Let { G = Σ : Θ (S) + max j } Σ jj M, max #{k : Θ jk 0} s j Theorem 1. Suppose s log p c 0 n for a certain small c 0. There exists c 1 depending on c 0 and M only such that { } inf sup P Θ jk Θ jk c 1 max((s/n) log p, n 1/2 ) c 1 inf j,k bθ jk Σ G For the low-dimensional projection estimator { } lim sup P Θ jk Θ jk t max((s/n) log p, n 1/2 ) = 0 max t j,k Σ G The sample size requirement is n (s log p) 2 for the n 1/2 rate

29 Precision matrix estimation Theorem 2. Suppose s log p c 0 n. There exists c 1 depending on c 0 and M only such that { inf sup P max Θ jk Θ jk c 1 max ( (s/n) log p, (log p)/n )} c 1 bθ jk Σ G j,k For the low-dimensional projection estimator { lim sup P max Θ jk Θ jk t max ( (s/n) log p, (log p)/n )} = 0 t Σ G j,k The sample size requirement is n s 2 log p for Bonferronni adjustments, attained This implies Θ thr Θ (S) s (log p)/n when s (log p)/n c 0 This is similar to the estimation of Σ (Bickel-Levina, 08)

30 Regression, estimation of individual coefficients Sample size requirement in the regression case is more complicated Deterministic design: Under a side condition on the design n (s log p) 2 for asymptotic normality n s 2 log p for simultaneous C.I. via Bonferroni Random design: Under a side condition on Σ = EX T X/n Known Σ: n s log p for efficient estimation n s log p for simultaneous confidence interval Unknown Σ: n s(d s)(log p) 2 for efficient estimation n s(d s) log p for simultaneous confidence interval Semi-supervised learning with unknown Σ: n s log p N s(d s)(log p) 2 for efficient estimation N s(d s) log p for simultaneous confidence interval We note that X is ancillary for the estimation of β. Are these conditions necessary?

31 Outline 1 LD problems 2 LDPE 3 Variable selection 4 SemiLD inference 5 Extensions 6 Sample size requirement 7 Simulation results

32 Simulation settings n = 200, p = 3000, σ = 1, λ univ = (2/n) log p = 0.283, β j = 3λ univ = for j = 1500, 1800,..., 3000, and β j = 3λ univ /j α otherwise; β j 0 for all j

33 Simulation settings n = 200, p = 3000, σ = 1, λ univ = (2/n) log p = 0.283, β j = 3λ univ = for j = 1500, 1800,..., 3000, and β j = 3λ univ /j α otherwise; β j 0 for all j (s, s (log p)/ n) = (8.93, 5.05) and (29.24, 16.55) respectively for α = 1 and 2, while the theory requires s(log p)/ n 0, where s = j min( β j /λ univ, 1).

34 Simulation settings n = 200, p = 3000, σ = 1, λ univ = (2/n) log p = 0.283, β j = 3λ univ = for j = 1500, 1800,..., 3000, and β j = 3λ univ /j α otherwise; β j 0 for all j (s, s (log p)/ n) = (8.93, 5.05) and (29.24, 16.55) respectively for α = 1 and 2, while the theory requires s(log p)/ n 0, where s = j min( β j /λ univ, 1). Generate ( X, X, ε) in each replication, where X has iid N(0, Σ) rows with Σ = (ρ j k ) p p and X is the column normalized version of X

35 Simulation settings n = 200, p = 3000, σ = 1, λ univ = (2/n) log p = 0.283, β j = 3λ univ = for j = 1500, 1800,..., 3000, and β j = 3λ univ /j α otherwise; β j 0 for all j (s, s (log p)/ n) = (8.93, 5.05) and (29.24, 16.55) respectively for α = 1 and 2, while the theory requires s(log p)/ n 0, where s = j min( β j /λ univ, 1). Generate ( X, X, ε) in each replication, where X has iid N(0, Σ) rows with Σ = (ρ j k ) p p and X is the column normalized version of X Four settings, labeled (A), (B), (C), and (D), respectively, with (α, ρ) = (2, 1/5), (1, 1/5), (2, 4/5), and (1, 4/5) Case (D) is most difficult

36 LDPE Unknown {β, σ}, deterministic X or unknown Σ for random X { y Xb 2 scaled Lasso = arg min 2 + σ } b,σ 2nσ 2 + λ univ b 1 { β (init), σ} = LSE after scaled Lasso z j = residual of Lasso(x j, X j ), = a regularized/approximate projection of x j to x k, k (init) β j = β j + z T j (y X β (init) )/(z T j x j ) ŝe j = σ z j / z T j x j

37 LDPE Unknown {β, σ}, deterministic X or unknown Σ for random X { y Xb 2 scaled Lasso = arg min 2 + σ } b,σ 2nσ 2 + λ univ b 1 { β (init), σ} = LSE after scaled Lasso z j = residual of Lasso(x j, X j ), = a regularized/approximate projection of x j to x k, k (init) β j = β j + z T j (y X β (init) )/(z T j x j ) ŝe j = σ z j / z T j x j Low-dimensional projection estimator (LDPE): β 1, ŝe 1,..., β p, ŝe p

38 LDPE Unknown {β, σ}, deterministic X or unknown Σ for random X { y Xb 2 scaled Lasso = arg min 2 + σ } b,σ 2nσ 2 + λ univ b 1 { β (init), σ} = LSE after scaled Lasso z j = residual of Lasso(x j, X j ), = a regularized/approximate projection of x j to x k, k (init) β j = β j + z T j (y X β (init) )/(z T j x j ) ŝe j = σ z j / z T j x j Low-dimensional projection estimator (LDPE): β 1, ŝe 1,..., β p, ŝe p Restricted LDPE: z j x k for the m = 4 smallest Ex T k x j

39 LDPE Unknown {β, σ}, deterministic X or unknown Σ for random X { y Xb 2 scaled Lasso = arg min 2 + σ } b,σ 2nσ 2 + λ univ b 1 { β (init), σ} = LSE after scaled Lasso z j = residual of Lasso(x j, X j ), = a regularized/approximate projection of x j to x k, k (init) β j = β j + z T j (y X β (init) )/(z T j x j ) ŝe j = σ z j / z T j x j Low-dimensional projection estimator (LDPE): β 1, ŝe 1,..., β p, ŝe p Restricted LDPE: z j x k for the m = 4 smallest Ex T k x j Oracle: β(oracle) j = e T j (X T K j X Kj ) 1 (X Kj (y X K c j β K c j ), K j = 3

40 LD problems LDPE Variable selection SemiLD inference Extensions Sample size requirement Bias correction: histogram for βbj βj with maximal βj = Sc a l e dl a s s o L e a s t s q u a r e se s t i ma t i o na f t e r s c a l e dl a s s os e l e c t i o n L DPE Re s t r i c t e dl DPE Simulation results

41 Bias correction: summary statistics for b β j β j with maximal β j = Estimator Lasso sclasso LSE-scLasso oracle LDPE R-LDPE (A) bias sd med abs err (B) bias sd med abs err (C) bias sd med abs err (D) bias sd med abs err

42 Overall relative coverage frequency, target = 95% (A) (B) (C) (D) all β j LDPE R-LDPE maximal β j LDPE R-LDPE Table: Mean coverage of LDPEs

43 LD problems LDPE Variable selection SemiLD inference Extensions Sample size requirement More bias correction: distribution of simulated relative coverage frequencies L DPE Re s t r i c t e dl DPE Simulation results

44 Efficiency: median ratio of width of the LDPE/restricted LDPE versus oracle (A) (B) (C) (D) LDPE R-LDPE Table: median ratio between widths

45 LD problems LDPE Variable selection SemiLD inference Extensions Sample size requirement Efficiency: median ratio of width of the LDPE/restricted LDPE versus oracle LDPE Rest r i ct edldpe Simulation results

46 Relative efficiency of the LDPE/restricted LDPE versus oracle (A) (B) (C) (D) LDPE R-LDPE Table: Median efficiency (ratio of the MSEs) of the LDPE and restricted LDPE estimators, versus the oracle estimator.

47 LD problems LDPE Variable selection SemiLD inference Extensions Sample size requirement Relative efficiency of the LDPE/restricted LDPE versus oracle LDPE Rest r i ct edldpe Simulation results

48 Threshold LDPE, β β 2, at Bonferroni 1/p level Estimator Lasso sclasso LSE-scLasso oracle t-ldpe (A) mean sd median (B) mean sd median (C) mean sd median (D) mean sd median

49 Variable selection with Bonferroni correction first block maximal β j (A) Lasso sclasso oracle LDPE (B) Lasso sclasso oracle LDPE (C) Lasso sclasso oracle LDPE (D) Lasso sclasso oracle LDPE

50 Thanks!

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Cun-Hui Zhang and Stephanie S. Zhang Rutgers University and Columbia University September 14, 2012 Outline Introduction Methodology