A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

Similar documents
DISCUSSION OF A SIGNIFICANCE TEST FOR THE LASSO. By Peter Bühlmann, Lukas Meier and Sara van de Geer ETH Zürich

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

A General Framework for High-Dimensional Inference and Multiple Testing

The lasso, persistence, and cross-validation

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

Sample Size Requirement For Some Low-Dimensional Estimation Problems

MSA220/MVE440 Statistical Learning for Big Data

Uncertainty quantification in high-dimensional statistics

Generalized Elastic Net Regression

Variable Selection for Highly Correlated Predictors

Analysis Methods for Supersaturated Design: Some Comparisons

Model Selection, Estimation, and Bootstrap Smoothing. Bradley Efron Stanford University

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

ISyE 691 Data mining and analytics

Stability and the elastic net

Comparisons of penalized least squares. methods by simulations

Statistical Inference

De-biasing the Lasso: Optimal Sample Size for Gaussian Designs

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

Bayesian Grouped Horseshoe Regression with Application to Additive Models

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Inference For High Dimensional M-estimates: Fixed Design Results

Bi-level feature selection with applications to genetic association

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Knockoffs as Post-Selection Inference

On Model Selection Consistency of Lasso

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

ADAPTIVE LASSO FOR SPARSE HIGH-DIMENSIONAL REGRESSION MODELS

Lecture 14: Variable Selection - Beyond LASSO

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models

High-dimensional covariance estimation based on Gaussian graphical models

Iterative Selection Using Orthogonal Regression Techniques

Package Grace. R topics documented: April 9, Type Package

Single Index Quantile Regression for Heteroscedastic Data

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Divide-and-combine Strategies in Statistical Modeling for Massive Data

OWL to the rescue of LASSO

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

A Confidence Region Approach to Tuning for Variable Selection

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

P-Values for High-Dimensional Regression

Uniform Post Selection Inference for LAD Regression and Other Z-estimation problems. ArXiv: Alexandre Belloni (Duke) + Kengo Kato (Tokyo)

An iterative hard thresholding estimator for low rank matrix recovery

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

Regularization and Variable Selection via the Elastic Net

Selective Inference for Effect Modification

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

Regression Shrinkage and Selection via the Lasso

Ultra High Dimensional Variable Selection with Endogenous Variables

high-dimensional inference robust to the lack of model sparsity

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

Linear regression methods

The Iterated Lasso for High-Dimensional Logistic Regression

MS-C1620 Statistical inference

Marginal Screening and Post-Selection Inference

Statistica Sinica Preprint No: SS R3

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Statistical Learning with the Lasso, spring The Lasso

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

Relaxed Lasso. Nicolai Meinshausen December 14, 2006

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Semi-Penalized Inference with Direct FDR Control

Machine Learning for OR & FE

Goodness-of-fit tests for high dimensional linear models

An ensemble learning method for variable selection

regression Lie Wang Abstract In this paper, the high-dimensional sparse linear regression model is considered,

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Two Tales of Variable Selection for High Dimensional Regression: Screening and Model Building

arxiv: v1 [stat.me] 30 Dec 2017

A Modern Look at Classical Multivariate Techniques

Why Do Statisticians Treat Predictors as Fixed? A Conspiracy Theory

Convex relaxation for Combinatorial Penalties

High-dimensional regression with unknown variance

An Improved 1-norm SVM for Simultaneous Classification and Variable Selection

ESL Chap3. Some extensions of lasso

Inference For High Dimensional M-estimates. Fixed Design Results

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Bayesian linear regression

A UNIFIED APPROACH TO MODEL SELECTION AND SPARS. REGULARIZED LEAST SQUARES by Jinchi Lv and Yingying Fan The annals of Statistics (2009)

The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R

A knockoff filter for high-dimensional selective inference

Distribution-Free Predictive Inference for Regression

arxiv: v1 [math.st] 27 May 2014

PENALIZED PRINCIPAL COMPONENT REGRESSION. Ayanna Byrd. (Under the direction of Cheolwoo Park) Abstract

The Sparsity and Bias of The LASSO Selection In High-Dimensional Linear Regression

Sure Independence Screening

Exact Post Model Selection Inference for Marginal Screening

Regularization: Ridge Regression and the LASSO

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

The adaptive and the thresholded Lasso for potentially misspecified models (and a lower bound for the Lasso)

High Dimensional Propensity Score Estimation via Covariate Balancing

Summary and discussion of: Exact Post-selection Inference for Forward Stepwise and Least Angle Regression Statistics Journal Club

Bayesian Sparse Linear Regression with Unknown Symmetric Error

Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1

Effect of outliers on the variable selection by the regularized regression

Robust Variable Selection Through MAVE

Transcription:

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los Angeles Joint work with with Hanzhong Liu (Tsinghua) and Xin Xu (Yale)

Table of contents 1. Introduction 2. Bootstrap Lasso+Partial Ridge 3. Theoretical Results 4. Simulation Results 5. Real Data Applications 6. Conclusions 2

Introduction

Sparse Linear Models Y = Xβ 0 + ϵ, where ϵ = (ϵ 1,..., ϵ n ) T is a vector of independent and identically distributed (i.i.d.) error random variables with mean 0 and variance σ 2 Y = (y 1,..., y n ) T R n is an n dimensional response vector X = (x T 1,..., xt n ) T = (X 1,..., X p ) R n p is a deterministic or random design matrix, with 1 n n i=1 x ij = 0, j = 1,..., p β 0 R p as a vector of coefficients High dimensionality: n p Sparsity: s = β 0 0 p 4

Perspective 1: Sparse Point Estimation (Variable Selection) Penalized Least Squares: 1 ˆβ = arg min β R p 2n Y Xβ 2 2 + p p λ (β j ) j=1 Lasso: p λ (t) = λ t [Tibshirani, 1996] Bridge: p λ (t) = λ t q for 0 < q 2 [Frank and Friedman, 1993] } SCAD: p λ {I(t (t) = λ λ) + (aλ t)+ (a 1)λ I(t > λ) for some a > 2, often, a = 3.7 [Fan and Li, 2001] MCP: p λ (t) = (aλ t) +/a and many others [Bühlmann and van de Geer, 2011, Fan and Lv, 2010] 5

Perspective 2: Statistical Inference Question: How to construct confidence intervals and hypothesis tests for individual β j s? Challenge: Inference is difficult for high-dimensional model parameters, because the limiting distribution of common estimators is complicated and hard to compute 6

Review: The Lasso Estimator { } 1 ˆβ Lasso = arg min β 2n Y Xβ 2 2 + λ 1 β 1, The limiting distribution of the Lasso is complicated [Knight and Fu, 2000], and the usual residual Bootstrap Lasso fails in estimating the limiting distribution and thus cannot be used to construct valid confidence intervals Various modifications of the Lasso have been proposed to form a valid inference procedure Bootstrap thresholded Lasso [Chatterjee and Lahiri, 2011] Bootstrap Lasso+OLS [Liu and Yu, 2013] De-sparsified (de-biased) Lasso methods [Zhang and Zhang, 2014, Van de Geer et al., 2014, Javanmard and Montanari, 2014] 7

Existing Inference Approaches Approaches: Sample splitting based methods [Wasserman and Roeder, 2009] Bootstrap / resampling based methods Perturbation resampling based method for a fixed p [Minnier et al., 2009] Modified residual Bootstrap Lasso method for a fixed p [Chatterjee and Lahiri, 2011] Risidual Bootstrap adaptive Lasso for p at a polynomial rate [Chatterjee and Lahiri, 2013] Residual Boostrap method based on a two stage estimator Lasso+OLS [Liu and Yu, 2013] De-sparsified (de-biased) Lasso methods LDPE [Zhang and Zhang, 2014] JM [Javanmard and Montanari, 2014] Other methods: Post-selection inference [Berk et al., 2013, Lee et al., 2016] Knockoff filter [Barber and Candès, 2015] and others [Dezeure et al., 2014] 8

Existing Inference Approaches Approaches: Sample splitting based methods [Wasserman and Roeder, 2009] Bootstrap / resampling based methods Perturbation resampling based method for a fixed p [Minnier et al., 2009] Modified residual Bootstrap Lasso method for a fixed p [Chatterjee and Lahiri, 2011] Risidual Bootstrap adaptive Lasso for p at a polynomial rate [Chatterjee and Lahiri, 2013] Residual Boostrap method based on a two stage estimator Lasso+OLS [Liu and Yu, 2013] De-sparsified (de-biased) Lasso methods LDPE [Zhang and Zhang, 2014] JM [Javanmard and Montanari, 2014] Other methods: Post-selection inference [Berk et al., 2013, Lee et al., 2016] Knockoff filter [Barber and Candès, 2015] and others [Dezeure et al., 2014] 9

De-sparsified Lasso Methods De-sparsified Lasso methods aim to remove the biases of the Lasso estimates and produce an asymptotically Normal estimate for each individual parameter Use Lasso to select variables Use Ordinary Least Squares (OLS) to estimate the coefficients of the selected variables Advantages: Do not rely on the beta-min condition: min j:β 0 j 0 β 0 j 1/ n Theoretically proven benchmark for high-dimensional inference Disadvantages: High computational cost Require good estimation of the precision matrix Require s log p/ n 0 as n to remove the asymptotic bias Rely heavily on the sparse linear model assumption and may have poor performance for misspecified models 10

Bootstrap Lasso+OLS A two-stage estimator Lasso+OLS: Use Lasso to select variables Use Ordinary Least Squares (OLS) to estimate the coefficients of the selected variables Advantages: Canonical and simple statistical techniques Comparable coverage probabilities and interval lengths as the de-sparsified Lasso methods Disadvantages: Requires hard sparsity (β 0 has at most s (s n) non-zero elements) Poor coverage probabilities for small but non-zero coefficients ([0, 0] confidence intervals in extreme cases) Requires the beta-min condition 11

Bootstrap Lasso+Partial Ridge

Contribution 1: Hard Sparsity Cliff-weak-sparsity Definition (Cliff-weak-sparsity) β 0 satisfies the cliff-weak-sparsity if its elements can be divided into two groups: the first group has s (s n) large elements with absolute values much larger than 1/ n) the second group contains (p s) small elements with absolute values much smaller than 1/ n) Without loss of generality, we assume β 0 = (β 0 1,..., β0 s, β 0 s+1,..., β0 p) T with β 0 j 1/ n for j = 1,..., s and β 0 j 1/ n for j = s + 1,..., p. Let S = {1,..., s}, and denote β 0 S = (β 0 1,..., β0 s ) 13

Contribution 2: Lasso+OLS Lasso+Partial Ridge Estimator Motivation: To increase the variance of our estimates for small coefficients whose corresponding predictors are missed by the Lasso A two-stage estimator Lasso+Partial Ridge (LPR) Use Lasso to select variables Use Partial Ridge to estimate the coefficients Partial Ridge is defined to minimize the empirical l 2 loss with no penalty on the selected predictors but an l 2 penalty on the unselected predictors, so as to reduce the bias of the coefficient estimates of the selected predictors while increasing the variance of the coefficient estimates of the unselected predictors. } Formally, let {j Ŝ = {1, 2,..., p} : ( ˆβ Lasso ) j 0 be the set of selected predictors by the Lasso, then we define the LPR estimator as: 1 ˆβ LPR = arg min β 2n Y Xβ 2 2 + λ 2 βj 2 2. j/ Ŝ 14

Approach 1: Residual Bootstrap Lasso+Partial Ridge (rblpr) For a deterministic design matrix X in a linear regression model, the residual Bootstrap is a standard method for constructing confidence intervals. How to define residuals? Lasso or Lasso+OLS or LPR? Simulation suggests the residuals obtained from the Lasso+OLS estimates approximate the true distribution of the error ϵ i s the best Let ˆβ Lasso+OLS denote the Lasso+OLS estimator, where βŝc = { } 1 ˆβ Lasso+OLS = arg min β: βŝc =0 2n Y Xβ 2 2, { } β j : j Ŝ. The residual vector is given by: ˆϵ = (ˆϵ 1,..., ˆϵ n ) T = Y Xˆβ Lasso+OLS 15

The rblpr Algorithm Input: Data (X, Y); Confidence level 1 α; Number of replications B Output: (1 α) confidence interval [l j, u j ] for β 0 j, j = 1,..., p Algorithm: 1. Compute the Lasso+OLS estimator ˆβ Lasso+OLS given data (X, Y) 2. Compute residual vector ˆϵ = (ˆϵ 1,..., ˆϵ n ) T = Y Xˆβ Lasso+OLS 3. Re-sample from the empirical distribution of the centered residual {ˆϵ i ˆϵ, i = 1,..., n}, where ˆϵ n = 1 n ˆϵ i, to form ϵ = (ϵ 1,..., ϵ n) T 4. Generate residual Bootstrap response Y rboot = Xˆβ Lasso+OLS + ϵ 5. Compute the residual Bootstrap Lasso (rblasso) estimator β rlasso as { } ˆβ 1 rblasso = arg min β 2n Y rboot Xβ 2 2 + λ 1 β 1, ( ) } and define Ŝ rblasso {j = {1, 2,..., p} : ˆβ rblasso 0 i=1 j 16

The rblpr Algorithm Algorithm (Cont d): 6. Compute the residual Bootstrap LPR (rblpr) estimator ˆβ rblpr based on (X, Y rboot ) as ˆβ 1 rblpr = arg min β 2n Y rboot Xβ 2 2 + λ 2 βj 2 2 j/ Ŝ rblasso (1) (B) 7. Repeat steps 3-6 for B times to obtain ˆβ rblpr,, ˆβ rblpr 8. { For each j = 1,..., p, compute } the α/2 and (1 α/2) quantiles of (b) ( ˆβ rblpr ) j, b = 1,..., B and denote them as a j and b j respectively 9. Output l j = ( ˆβ LPR ) j + ( ˆβ Lasso+OLS ) j b j u j = ( ˆβ LPR ) j + ( ˆβ Lasso+OLS ) j a j 17

Approach 2: Paired Bootstrap Lasso+Partial Ridge (pblpr) For a random design matrix X in a linear regression model, the paired Bootstrap is a standard method for constructing confidence intervals In the paired Boostrap, one generates a Boostrap sample {(x i, y i ), i = 1,..., n} from the empirical joint distribution of {(x i, y i ), i = 1,..., n} and then computes the estimator based on the Boostrap sample 18

The pblpr Algorithm Input: Data (X, Y); Confidence level 1 α; Number of replications B Output: (1 α) confidence interval [l j, u j ] for β 0 j, j = 1,..., p Algorithm: 1. Generate a Bootstrap sample (X pboot, Y pboot ) = {(x i, y i ), i = 1,..., n} from the empirical distribution of {(x i, y i ), i = 1,..., n} 2. Compute the paired Bootstrap Lasso (pblasso) estimator ˆβ pblasso as { } ˆβ 1 pblasso = arg min β 2n Y pboot X pbootβ 2 2 + λ 1 β 1, } and define Ŝ pblasso {j = {1, 2,..., p} : ( ˆβ pblasso ) j 0 19

The rblpr Algorithm Algorithm (Cont d): 3. Compute the paired Bootstrap LPR (pblpr) estimator as ˆβ 1 pblpr = arg min β 2n Y pboot X pbootβ 2 2 + λ 2 2 4. Repeat steps 1-3 for B times and obtain j/ Ŝ plasso (1) (B) ˆβ pblpr,..., ˆβ pblpr 5. { For each j = 1,..., p, compute } the α/2 and (1 α/2) quantiles of (ˆβ (b) pblpr) j, b = 1,..., B and output them as l j and u j β 2 j 20

Theoretical Results

Model Selection Consistency of Lasso under Cliff-weak-sparsity Theorem (Model selection consistency of Lasso) Under the cliff-weak-sparsity and other reasonable conditions, we have ( ) P (ˆβ Lasso ) S = s β 0 S, (ˆβ Lasso ) S c = 0 = 1 o(e nc 2 ) 1 as n, where 0 < c 2 < 1. 22

Model Selection Consistency of rblasso Let P denote the conditional probability given the data (X, Y). The following Theorem shows that the residual Bootstrap Lasso (rblasso) estimator also has sign consistency under the cliff-weak-sparsity and other appropriate conditions. Theorem (Model selection consistency of rblasso) Under the cliff-weak-sparsity and other reasonable conditions, the residual Bootstrap Lasso estimator has sign consistency, i.e., ( ) P (ˆβ rblasso) S = s β 0 S, (ˆβ rblasso) S c = 0 = 1 o p (e nc 2 ) 1 as n, where 0 < c 2 < 1. 23

Convergence in Distribution By the above two Theorems and under the orthogonality condition on the design matrix X, we can show that the residual Bootstrap LPR (rblpr) can consistently estimate the distribution of ˆβ LPR and hence can construct asymptotically valid confidence intervals for β 0 Theorem Under reasonable conditions and the orthogonality of X, for any u R p with u 2 = 1 and max 1 i n ut x i = o( n), we have d (L n, L n ) P 0, where L n is the conditional distribution of nu T (ˆβ rblpr ˆβ Lasso+OLS ) given ϵ, L n is the distribution of nu T (ˆβ LPR β 0 ), and d denotes the Kolmogorov-Simirnov distance (sup norm between the distribution functions). 24

Simulation Results

Simulation Setups: Generative Model 1 We consider two generative models for data simulation: 1. Linear regression model. The simulated data are drawn from the linear model: y i = x T i β 0 + ϵ i, ϵ i N(0, σ 2 ), i = 1,..., n. We fix n = 200 and p = 500. We generate the design matrix X in three scenarios (using the R package mvtnorm ). 26

Simulation Setups: Three Scenarios to Generate X Scenario 1 (Normal): x i i.i.d. N(0, Σ), i = 1,..., n We consider three types of Σ [Dezeure et al., 2014]: Toeplitz : Σ ij = ρ i j with ρ = 0.5, 0.9 Exponential decay : (Σ 1 ) ij = ρ i j with ρ = 0.5, 0.9 Equal correlation : Σ ij = ρ with ρ = 0.5, 0.9 27

Simulation Setups: Three Scenarios to Generate X Scenario 1 (Normal): x i i.i.d. N(0, Σ), i = 1,..., n We consider three types of Σ [Dezeure et al., 2014]: Toeplitz : Σ ij = ρ i j with ρ = 0.5, 0.9 Exponential decay : (Σ 1 ) ij = ρ i j with ρ = 0.5, 0.9 Equal correlation : Σ ij = ρ with ρ = 0.5, 0.9 Scenario 2 (t 2 ): x i i.i.d. t 2 (0, Σ), i = 1,..., n with the Toeplitz matrix Σ: Σ ij = ρ i j, where ρ = 0.5, 0.9 In Scenarios 1 and 2, we choose σ such that the Signal-to-Noise-Ratio (SNR) = Xβ 0 2 2 /(nσ2 ) = 10. 27

Simulation Setups: Three Scenarios to Generate X Scenario 1 (Normal): x i i.i.d. N(0, Σ), i = 1,..., n We consider three types of Σ [Dezeure et al., 2014]: Toeplitz : Σ ij = ρ i j with ρ = 0.5, 0.9 Exponential decay : (Σ 1 ) ij = ρ i j with ρ = 0.5, 0.9 Equal correlation : Σ ij = ρ with ρ = 0.5, 0.9 Scenario 2 (t 2 ): x i i.i.d. t 2 (0, Σ), i = 1,..., n with the Toeplitz matrix Σ: Σ ij = ρ i j, where ρ = 0.5, 0.9 In Scenarios 1 and 2, we choose σ such that the Signal-to-Noise-Ratio (SNR) = Xβ 0 2 2 /(nσ2 ) = 10. Scenario 3 (fmri data): A 200 500 design matrix X is generated by random sampling without replacement from the real 1750 2000 design matrix in the functional Magnetic Resonance Imaging (fmri) data [Kay et al., 2008]. Every column of X is normalized to have zero mean and unit variance, and we choose σ such that SNR = 1, 5 or 10 27

Simulation Setups: Two Cases to Generate β 0 Case 1 (hard sparsity): β 0 has 10 nonzero elements whose indices are randomly sampled without replacement from {1, 2,..., p} and whose values are generated from U[1/3, 1], a uniform distribution on the interval [1/3, 1]. The remaining 490 elements are set to 0. Case 2 (weak sparsity): The setup is similar to the paper [Zhang and Zhang, 2014]. β 0 has 10 large elements whose indices are randomly sampled without replacement from {1, 2,..., p} and whose values are generated from a normal distribution N(1, 0.001). The remaining 490 elements decay at a rate of 1/(j + 3) 2, i.e., β 0 j = 1/(j + 3) 2. 28

Simulation Setups X and β 0 are generated once and then kept fixed We simulate Y = (y 1,..., y n ) T from the linear model by generating independent error terms for 1000 replications We construct confidence intervals for each individual regression coefficient and compute their coverage probabilities and mean interval lengths 29

Simulation Setups: Generative Model 2 2. Misspecified linear model. Let X and Y f denote the design matrix (with n = 1750 and p = 2000) from the fmri data set. We first compute the Lasso+OLS estimator β f Lasso+OLS (selecting the tuning parameter λ 1 by 5-fold cross validation on Lasso+OLS): β f Lasso = arg min β β f Lasso+OLS = { 1 2n Yf Xβ 2 2 + λ 1 β 1 1 arg min β:β j=0, j/ S 2n Yf Xβ 2 2, where S = { j : (β f Lasso ) j 0 } is the relevant predictor set } 30

Simulation Setups: Generative Model 2 Then we generate the simulated response Y = (y 1,..., y n ) T from the following model: where y i = E(y i x i ) + ϵ i, ϵ i N(0, σ 2 ) 4 E(y i x i ) = xi T β f Lasso+OLS + α j x 2 ij + j=1 1 j<k 4 α jk x ij x ik, α j, j = 1,..., 4 and α jk, 1 j k 4 are independently generated from a uniform distribution U(0, 0.1) The values of α j s and α jk s are generated once and then kept fixed We set σ such that SNR = n i=1 E(y i x i ) 2 /(nσ 2 ) = 1, 5 or 10 31

Selection of the Partial Ridge tuning parameter λ 2 Coverage probability Interval length coverage probability 0.80 0.85 0.90 0.95 1.00 interval length 0.0 0.1 0.2 0.3 0.4 0.5 λ 2 = 0.1/n λ 2 = 0.5/n λ 2 = 1/n λ 2 = 5/n λ 2 = 10/n Figure 1: The effects of λ 2 on coverage probabilities and mean confidence interval lengths. The predictors are generated from a Normal distribution in Scenario 1 with a Toeplitz covariance matrix and ρ = 0.5. The coefficient vector β 0 is hard sparse. 32

pblpr (rblpr) vs. pblasso+ols (rblasso+ols) coverage probability coverage probability 0.95 0.0 0.4 0.8 hard sparsity; ρ=0.5 hard sparsity; ρ=0.9 weak sparsity; ρ=0.5 weak sparsity; ρ=0.9 interval length hard sparsity; ρ=0.5 hard sparsity; ρ=0.9 weak sparsity; ρ=0.5 weak sparsity; ρ=0.9 interval length 0.0 0.5 1.0 1.5 pblpr rblpr pblassools rblassools Figure 2: The design matrix is generated from a Normal distribution with a Toeplitz type covariance matrix. 33

pblpr vs. De-sparsified Lasso Methods coverage probability 0.95 0.0 0.4 0.8 hard sparsity; ρ=0.5 coverage probability hard sparsity; ρ=0.9 weak sparsity; ρ=0.5 weak sparsity; ρ=0.9 hard sparsity; ρ=0.5 interval length hard sparsity; ρ=0.9 weak sparsity; ρ=0.5 weak sparsity; ρ=0.9 interval length 0.0 0.5 1.0 1.5 pblpr LDPE JM Figure 3: The design matrix is generated from a Normal distribution with a Toeplitz type covariance matrix. 34

pblpr vs. De-sparsified Lasso Methods coverage probability 0.95 0.0 0.4 0.8 hard sparsity; ρ=0.5 coverage probability hard sparsity; ρ=0.9 weak sparsity; ρ=0.5 weak sparsity; ρ=0.9 hard sparsity; ρ=0.5 interval length hard sparsity; ρ=0.9 weak sparsity; ρ=0.5 weak sparsity; ρ=0.9 interval length 0 2 4 6 8 pblpr LDPE JM Figure 4: The design matrix is generated from a Normal distribution with a Equi.corr type covariance matrix. 35

pblpr vs. De-sparsified Lasso Methods coverage probability 0.95 0.0 0.4 0.8 hard sparsity; ρ=0.5 coverage probability hard sparsity; ρ=0.9 weak sparsity; ρ=0.5 weak sparsity; ρ=0.9 interval length hard sparsity; ρ=0.5 hard sparsity; ρ=0.9 weak sparsity; ρ=0.5 weak sparsity; ρ=0.9 interval length 0.0 1.0 2.0 pblpr LDPE JM Figure 5: The design matrix is generated from a t 2 distribution with a Toeplitz type covariance matrix. 36

pblpr vs. De-sparsified Lasso Methods pblpr LDPE JM Figure 6: This plot is for hard sparsity and a Normal design matrix with a Toeplitz type covariance matrix and ρ = 0.5. 37

pblpr vs. De-sparsified Lasso Methods pblpr LDPE JM Figure 7: This plot is for hard sparsity and a Normal design matrix with a Toeplitz type covariance matrix and ρ = 0.9. 38

pblpr vs. De-sparsified Lasso Methods pblpr LDPE JM Figure 8: This plot is for weak sparsity and a Normal design matrix with a Toeplitz type covariance matrix and ρ = 0.5. 39

pblpr vs. De-sparsified Lasso Methods pblpr LDPE JM Figure 9: This plot is for weak sparsity and a Normal design matrix with a Toeplitz type covariance matrix and ρ = 0.9. 40

pblpr vs. De-sparsified Lasso Methods as SNR Changes coverage probability 0.95 SNR=0.5 SNR=1 SNR=5 SNR=10 coverage probability 0.3 0.5 0.7 0.9 SNR=0.5 SNR=1 interval length SNR=5 SNR=10 interval length 0.0 0.5 1.0 1.5 pblpr LDPE JM Figure 10: This plot is for hard sparsity and a Normal design matrix with a Toeplitz type covariance matrix and ρ = 0.5. 41

pblpr vs. De-sparsified Lasso Methods for the Misspecified Model SNR=1 SNR=5 SNR=10 coverage probability 0.0 0.2 0.4 0.6 0.8 1.0 coverage probability 0.0 0.2 0.4 0.6 0.8 1.0 coverage probability 0.0 0.2 0.4 0.6 0.8 1.0 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 SNR=1 SNR=5 SNR=10 pblpr LDPE JM interval length 0.0 0.5 1.0 1.5 2.0 interval length 0.0 0.5 1.0 1.5 2.0 interval length 0.0 0.5 1.0 1.5 2.0 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 Figure 11: The results is based on data simulated from the misspecified model. 42

Real Data Applications

fmri Data The 95% confidence intervals constructed by pblpr, LDPE and JM cover 95.8%, 97% and 99.6% of the 500 components of β 0, respectively. pblpr LDPE JM Figure 12: Comparison of interval lengths produced by pblpr, LDPE and JM. The plot is generated using the ninth voxel as the response. 44

Conclusions

Contributions 1. Our proposed Bootstrap LPR method relaxes the beta-min condition required by the Bootstrap Lasso+OLS method. 46

Contributions 1. Our proposed Bootstrap LPR method relaxes the beta-min condition required by the Bootstrap Lasso+OLS method. 2. We conduct comprehensive simulation studies to evaluate the finite sample performance of the Bootstrap LPR method for both sparse linear models and misspecified models. Our main findings include: 46

Contributions 1. Our proposed Bootstrap LPR method relaxes the beta-min condition required by the Bootstrap Lasso+OLS method. 2. We conduct comprehensive simulation studies to evaluate the finite sample performance of the Bootstrap LPR method for both sparse linear models and misspecified models. Our main findings include: Compared with Bootstrap Lasso+OLS, Bootstrap LPR improves the coverage probabilities of 95% confidence intervals by about 50% on average for small but non-zero regression coefficients, at the price of 15% heavier computational burden. 46

Contributions 1. Our proposed Bootstrap LPR method relaxes the beta-min condition required by the Bootstrap Lasso+OLS method. 2. We conduct comprehensive simulation studies to evaluate the finite sample performance of the Bootstrap LPR method for both sparse linear models and misspecified models. Our main findings include: Compared with Bootstrap Lasso+OLS, Bootstrap LPR improves the coverage probabilities of 95% confidence intervals by about 50% on average for small but non-zero regression coefficients, at the price of 15% heavier computational burden. Compared with two de-sparsified Lasso methods, LDPE and JM, Bootstrap LPR has comparably good coverage probabilities for large and small regression coefficients, and in some cases outperforms LDPE and JM by producing conference intervals with more than 50% shorter interval lengths on average. Moreover, Bootstrap LPR is more than 30% faster than LDPE and JM and is robust to model misspecification. 46

Contributions 3. We extend model selection consistency of the Lasso from the hard sparsity case [Zhao and Yu, 2006, Wainwright, 2009], where the parameter β 0 is assumed to be exactly sparse (β 0 has s (s n) non-zero elements with absolute values larger than 1/ n), to a more general cliff-weak-sparsity case. Under the irrepresentable condition and other reasonable conditions, we show that the Lasso can correctly select all the large elements of β 0 while shrinking all the small elements to zero. 47

Contributions 3. We extend model selection consistency of the Lasso from the hard sparsity case [Zhao and Yu, 2006, Wainwright, 2009], where the parameter β 0 is assumed to be exactly sparse (β 0 has s (s n) non-zero elements with absolute values larger than 1/ n), to a more general cliff-weak-sparsity case. Under the irrepresentable condition and other reasonable conditions, we show that the Lasso can correctly select all the large elements of β 0 while shrinking all the small elements to zero. 4. We develop an R package HDCI to implement the Bootstrap Lasso, the Bootstrap Lasso+OLS and our new Bootstrap LPR methods. This package makes these methods easily accessible to practitioners. 47

Paper A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models by Hanzhong Liu, Xin Xu, and Jingyi Jessica Li https://arxiv.org/pdf/1706.02150.pdf Email: jli@stat.ucla.edu 48

References I Barber, R. F. and Candès, E. J. (2015). Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43:2055 2085. Berk, R., Brown, L., Buja, A., Zhang, K., and Zhao, L. (2013). Valid post-selection inference. The Annals of Statististics, 41:802 837. Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer. Chatterjee, A. and Lahiri, S. N. (2011). Bootstrapping lasso estimators. Journal of the American Statistical Association, 106:608 625. 49

References II Chatterjee, A. and Lahiri, S. N. (2013). Rates of convergence of the adaptive lasso estimators to the oracle distribution and higher order refinements by the bootstrap. The Annals of Statistics, 41:1232 1259. Dezeure, R., Bühlmann, P., Meier, L., and Meinshausen, N. (2014). High-dimensional inference: Confidence intervals, p-values and r-software hdi. Statistical Science, 30:533 558. Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348 1360. 50

References III Fan, J. and Lv, J. (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20:101 148. Frank, L. E. and Friedman, J. H. (1993). A statistical view of some chemometrics regression tools. Technometrics, 35(2):109 135. Javanmard, A. and Montanari, A. (2014). Confidence intervals and hypothesis testing for high-dimensional regression. The Journal of Machine Learning Research, 15:2869 2909. Kay, K. N., Naselaris, T., Prenger, R. J., and Gallant, J. L. (2008). Identifying natural images from human brain activity. Nature, 452:352 355. 51

References IV Knight, K. and Fu, W. J. (2000). Asymptotics for lasso-type estimators. The Annals of Statistics, 28:1356 1378. Lee, J. D., Sun, D. L., Sun, Y., and Taylor, J. E. (2016). Exact post-selection inference, with application to the lasso. The Annals of Statistics, 44(3):907 927. Liu, H. and Yu, B. (2013). Asymptotic properties of lasso+mls and lasso+ridge in sparse high-dimensional linear regression. Electronic Journal of Statistics, 7:3124 3169. Minnier, J., Tian, L., and Cai, T. (2009). A perturbation method for inference on regularized regression estimates. Journal of the American Statistical Association, 106:1371 1382. 52

References V Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B, 58:267 288. Van de Geer, S., Bühlmann, P., Ritov, Y., Dezeure, R., et al. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3):1166 1202. Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using-constrained quadratic programming (lasso). IEEE transactions on information theory, 55:2183 2202. Wasserman, L. and Roeder, K. (2009). High dimensional variable selection. The Annals of Statistics, 37:2178 2201. 53

References VI Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society Series B, 76(1):217 242. Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. Journal of Machine Learning Research, 7:2541 2563. 54