Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Size: px

Start display at page:

Download "Confidence Intervals for Low-dimensional Parameters with High-dimensional Data"

Andrew Reed
6 years ago
Views:

1 Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Cun-Hui Zhang and Stephanie S. Zhang Rutgers University and Columbia University September 14, 2012

2 Outline Introduction Methodology Bias corrected linear estimators Low-dimensional proections (LDP) Implementation with the scaled Lasso Confidence intervals Theoretical Results Simulation Study

3 Statistical Inference of Low-dimensional Parameters with High-dimensional Data Limit of statistical inference based on selection consistency theory All nonzero coefficients be greater than a noise level No signal strength or uniform strength of individual variables. This paper includes Proposal of low-dimensional proection estimator (LDPE) Confidence interval construction for coefficients Asymptotic normality of the proposed estimators Consistent estimator for their covariance matrices Numerical study

4 Setting Linear model y = Xβ + ɛ, ɛ N(0, σ 2 I ), where y R n, X = (x 1,, x p ) R n p, and β = (β 1,, β p ) T. Notation For vectors v = (v 1,, v m ) of any dimension, supp(v) = { : v 0}, v 0 = supp(v) and v q = { v q } 1/q. For A {1,, p}, v A = (v, A) T and X A = (x k, k A). Define = {1,, p} \ {}.

5 Least squares estimator (LSE) Least squares estimator of an estimable regression coefficient ˆβ (lse) = y T x / x 2 2, x :proection of x to the orthogonal complement of the column space of X = (x k, k ). For estimable β and β k, Cov( (lse) (lse) ˆβ, ˆβ k ) = σ 2 (x ) T x k /( x 2 x k 2) In the case of p > n, rank(x ) = n. ˆβ(lse) because x = 0. is not defined

6 Bias corrected linear estimators Preserving LSE properties: explicit formula of the covariance structure For any score vector z not orthogonal to x, the corresponding linear estimator satisfies ˆβ (lin) = zt y z T = β + zt ɛ x z T + z T x k β k x z T k x Bias problem: z T x k 0 for some k Bias correction with an initial estimator ˆβ (init) ˆβ = ˆβ (lin) k z T x k ˆβ (init) z T x = ˆβ (init) + zt {y Xˆβ (init) } z T x, with a score vector z depending on X only.

7 Bias corrected linear estimators Estimation error: sum of noise and approximation error ˆβ β = zt ɛ z T x + 1 z T x k z T x k (β k ˆβ (init) k ) Specifications of the score vector z and the initial estimator ˆβ (init)

8 Low-dimensional proections (LDP) Proper choice of z Suitable conditions on {X, β} and an initial estimator are given Control on the noise and approximation error Not too small x : z = x Too small x : z relaxed proection of x ˆβ (init) Low-dimensional proections estimator (LDPE): β Relaxed proection x : residual of the least squares fit of x on X z : residual of the Lasso, relaxation of the least squares method z = x X ˆγ, ˆγ = argmin{ x X b λ b 1 }. b 2n Common penalty level use: normalization to x k 2 2 = n

9 Sketch on LDPE theoretical results Karush-Kuhn-Tucker conditions: x T k z /n λ k z T x k (β k ˆβ (init) k ) nλ ˆβ (init) β 1 Let η = nλ / z 2 and τ = z 2 / /z T x. Since z T ɛ N(0, σ z 2 2 ), η ˆβ (init) β 1 /σ = o(1) τ 1 ( ˆβ β ) N(0, σ 2 )

10 Scaled Lasso Joint convex minimization method {ˆβ (init), ˆσ} = argmin{ y Xb σ b,σ 2σn 2 + λ 0 b 1 }, with a penalty level λ 0 λ 0 = A (2/n)log(p/ɛ) with a certain A > 1 and 0 < ɛ 1 (Sun and Zhang, 2011) Scaled Lasso estimate of the noise level at a penalty level λ ˆσ = argminmin{ x X b σ σ b 2σn 2 + λ b 1 }.

11 Lasso path Coefficient estimator, residual and factors along the Lasso path ˆγ (λ) = argmin{ x X b 2 2/(2n) + λ b 1 }, b z (λ) = x X ˆγ (λ), η (λ) = max k xt k z (λ) / z (λ) 2 = nλ/ z (λ) 2 τ (λ) = z (λ) 2 / x T z (λ)

12 Penalty level choice High accuracy of coverage probability of the confidence interval for β Reduction in the ratio of bias to standard error (SE) of our estimator Increase in its SE z = z (λ ), λ = argmin{η (λ) : τ (λ) (1 + κ 0 )τ (ˆσ λ )}, λ κ 0 > 0: pre-determined constant

13 Properties of procedure Proposition Both functions z (λ) 2 and η (λ) are nondecreasing in λ and the function τ (λ) is no greater than 1/ z (λ) 2 in the Lasso path. Moreover, η (λ ) η (ˆσ λ ) = nλ, τ (λ ) (1 + κ 0 )/(ˆσ n 1/2 ) Remark Since η (λ) is a nondecreasing function of λ, the procedure can be carried out by minimizing λ under the constraint on τ (λ). κ 0 = 1/2, λ = λ univ = (2/n)logp z determined by the design matrix X given κ 0

14 Restricted LDPE Weighted low dimensional proection with different relaxation levels for different variable x k according to their correlation to x The larger x T x k /n, the greater contribution to bias due to initial estimation error Smaller z T x k /n for large x T x k /n with a weighted relaxation z = x X ˆγ, ˆγ = argmin{ x X b 2 2 } b 2n w k = 0 for large x T x k /n and w k = 1 for other k +λ w k b k k

15 Confidence intervals Conditions on X and β: approximation error is smaller order than the standard deviation of the noise component Covariance of the noise component V V = (V k ) p p, V k = LDPE vector: ˆβ = ( ˆβ 1,, ˆβ p ) T z T z k z T x z T k x k = σ 2 Cov( zt ɛ z T, zt k ɛ x z T k x ). k Approximate (1 α)100% confidence interval of a T β (a: sparse vectors) a T ˆβ a T β ˆσΦ(1 α 2 )(at Va) 1/2, ˆσ: scaled Lasso noise level estimator Φ: standard normal distribution function

16 Conditions Sparsity p min{ β /(σλ univ ), 1} s, λ univ = (2/n)logp. =1 Ex: β 0 s, l q sparse with β q q/(σλ univ ) q s, 0 < q 1. Initial estimator P{ ˆβ (init) β 1 C 1 sσ (2/n)log(p/ɛ)} ɛ, fixed C 1 and all α 0 /p 2 ɛ 1, α 0 (0, 1). Existing oracle inequalities are used Noise level estimator P{ ˆσ/σ 1 C 2 s(2/n)log(p/ɛ)} ɛ fixed C 2 and all α 0 /p 2 ɛ 1

17 Main theorem ˆβ is the LDPE with a z depending on X only and an initial estimator β (init). Let max(ɛ n, ɛ n ) 0+, τ = z 2 / x T z, and η = min k xt k z / z 2. Suppose initial estimator condition holds with η C 1 sσ (2/n)log(p/ɛ) ɛ n for all. Then, maxp{ τ 1 ( ˆβ β ) z T ɛ/ z 2 > ɛ n } ɛ. If ˆσ condition holds with C 2 s(2/n)log(p/ɛ) ɛ n, then p and t R, Φ(t ɛ n ɛ n t ) 2ɛ P{τ 1 ( ˆβ β ) ˆσt} Φ(t+ɛ n +ɛ n t )+2ɛ. Consequently, for the covariance matrix V and all fixed m, lim inf a 0 m P{ at ˆβ a T β } ˆσΦ 1 (1 α/2)(a T Va) 1/2 } = 1 α n

18 Main theorem ˆβ is the LDPE with a z depending on X only and an initial estimator β (init). Let max(ɛ n, ɛ n ) 0+, τ = z 2 / x T z, and η = min k xt k z / z 2. Suppose initial estimator condition holds with η C 1 sσ (2/n)log(p/ɛ) ɛ n for all. Then, maxp{ τ 1 ( ˆβ β ) z T ɛ/ z 2 > ɛ n } ɛ. If ˆσ condition holds with C 2 s(2/n)log(p/ɛ) ɛ n, then p and t R, Φ(t ɛ n ɛ n t ) 2ɛ P{τ 1 ( ˆβ β ) ˆσt} Φ(t+ɛ n +ɛ n t )+2ɛ. Consequently, for the covariance matrix V and all fixed m, lim inf a 0 m P{ at ˆβ a T β } ˆσΦ 1 (1 α/2)(a T Va) 1/2 } = 1 α n

19 Asymptotic normality of LDPE and LDPE based confidence interval Remark In implementation with λ = λ unit, z is the residual of the Lasso estimator in the regression model for x against X with a penalty level λ to guarantee η 2logp. Thus, the dimension constraint for the asymptotic normality and proper coverage probability in Theorem is s(logp)/ n 0. τ 1/ z 2 n 1/2 is expected. Joint asymptotic normality of the LDPE for finitely many ˆβ (z T ɛ/ z 2, p) has a multivariate normal distribution with identical marginal distribution N(0, σ 2 ) Approximate coverage probability of confidence intervals Under additional condition on signal level estimator

20 Oracle inequalities and Random designs Condition for Scaled Lasso estimators Theorem 2 and Sun and Zhang, 2011 Random designs X: column normalized version of a Gaussian random matrix X Linear regression of x against X : motive for the use of the above Lasso path Theorem 3 and Remark 3

21 Simulation study n = 200, p = 3000 λ = λ univ = (2/n)logp β = 3λ univ for = 1500, 1800, 2100, 2400, 2700, 3000, β = 3λ univ / α for all other. Simulated data: ( X, X, y) X = ( x i ) n p has iid N(0, Σ) rows with Σ = (ρ ( k) ) p p X = (x 1,, x p ), x = x n/ x 2 y = Xβ + ɛ, ɛ N(0, I ), Settings (A): (α, ρ) = (2, 1/5); (B): (α, ρ) = (1, 1/5) (C): (α, ρ) = (2, 4/5); (D): (α, ρ) = (1, 4/5) Estimators: LDPE, restricted LDPE, Lasso, scaled Lasso ˆβ (init), ˆσ: Scaled Lasso z : Lasso path procedure, with κ 0 = 1/2, and λ m = 20 for restricted Lasso = λ univ

22 Asymptotic normality of LDPE estimator for the largest β Figure: Histogram of errors of maximal β estimation in settings (A), (B), (C), and (D)

23 Behavior of LDPE estimator for the largest β Table: Summary statistics of errors of maximal β estimation in every setting

24 Coverage probability of LDPE based confidence interval Table: Mean coverage for LDPEs of all β Figure: Top: relative coverage frequencies versus index; Bottom: variables percentage for given relative coverage frequency values

25 Performance of restricted LDPE Figure: Top: plots of LDPE and restricted LDPE median of ˆβ versus index; Bottom: relative coverage frequencies of LDPE and restricted LDPE versus index in setting (D)

26 Point estimation performance of LDPE and thresholded LDPE Figure: Plots of median absolute error for scaled Lasso, LDPE and thresholded LDPE in settings (A) and (D)

27 L 2 loss of Lasso, scaled Lasso, thresholded LDPE Table: Summary statistics of L 2 loss for Lasso, scaled Lasso, and thresholded LDPE

28 Thank you!!!

Sample Size Requirement For Some Low-Dimensional Estimation Problems

Sample Size Requirement For Some Low-Dimensional Estimation Problems Cun-Hui Zhang, Rutgers University September 10, 2013 SAMSI Thanks for the invitation! Acknowledgements/References Sun, T. and Zhang,