high-dimensional inference robust to the lack of model sparsity

Similar documents
arxiv: v2 [stat.me] 3 Jan 2017

A General Framework for High-Dimensional Inference and Multiple Testing

De-biasing the Lasso: Optimal Sample Size for Gaussian Designs

arxiv: v1 [stat.me] 14 Aug 2017

Inference for High Dimensional Robust Regression

Robust estimation, efficiency, and Lasso debiasing

Uniform Post Selection Inference for LAD Regression and Other Z-estimation problems. ArXiv: Alexandre Belloni (Duke) + Kengo Kato (Tokyo)

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

Likelihood Ratio Test in High-Dimensional Logistic Regression Is Asymptotically a Rescaled Chi-Square

An iterative hard thresholding estimator for low rank matrix recovery

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

Risk and Noise Estimation in High Dimensional Statistics via State Evolution

Sample Size Requirement For Some Low-Dimensional Estimation Problems

sparse and low-rank tensor recovery Cubic-Sketching

High-dimensional covariance estimation based on Gaussian graphical models

Factor-Adjusted Robust Multiple Test. Jianqing Fan (Princeton University)

Empirical Likelihood Tests for High-dimensional Data

Reconstruction from Anisotropic Random Measurements

GARCH Models Estimation and Inference. Eduardo Rossi University of Pavia

Sliced Inverse Regression

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

A Unified Theory of Confidence Regions and Testing for High Dimensional Estimating Equations

Testing Linear Restrictions: cont.

Econ 583 Homework 7 Suggested Solutions: Wald, LM and LR based on GMM and MLE

Exercises Chapter 4 Statistical Hypothesis Testing

arxiv: v3 [stat.me] 14 Nov 2016

DISCUSSION OF A SIGNIFICANCE TEST FOR THE LASSO. By Peter Bühlmann, Lukas Meier and Sara van de Geer ETH Zürich

CIVL /8904 T R A F F I C F L O W T H E O R Y L E C T U R E - 8

Testing Endogeneity with Possibly Invalid Instruments and High Dimensional Covariates

Permutation-invariant regularization of large covariance matrices. Liza Levina

GARCH Models Estimation and Inference

Empirical likelihood and self-weighting approach for hypothesis testing of infinite variance processes and its applications

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Lecture 7 Introduction to Statistical Decision Theory

Asymptotic Statistics-III. Changliang Zou

Cramér-Type Moderate Deviation Theorems for Two-Sample Studentized (Self-normalized) U-Statistics. Wen-Xin Zhou

Variable Selection for Highly Correlated Predictors

Inference For High Dimensional M-estimates: Fixed Design Results

Sparse and Low Rank Recovery via Null Space Properties

Simple Linear Regression

Econometrics II - EXAM Answer each question in separate sheets in three hours

Statistics 203: Introduction to Regression and Analysis of Variance Course review

STAT 200C: High-dimensional Statistics

Multivariate Regression

Comprehensive Examination Quantitative Methods Spring, 2018

Learning 2 -Continuous Regression Functionals via Regularized Riesz Representers

An Extended Frank-Wolfe Method, with Application to Low-Rank Matrix Completion

Tests for Two Coefficient Alphas

Large-Scale Multiple Testing of Correlations

Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error Distributions

Scale Mixture Modeling of Priors for Sparse Signal Recovery

Review of Classical Least Squares. James L. Powell Department of Economics University of California, Berkeley

Optimal Estimation of a Nonsmooth Functional

Problem Set 7. Ideally, these would be the same observations left out when you

1. The OLS Estimator. 1.1 Population model and notation

Methods for sparse analysis of high-dimensional data, II

Statistics Ph.D. Qualifying Exam

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Composite Loss Functions and Multivariate Regression; Sparse PCA

Non-linear panel data modeling

Understanding Regressions with Observations Collected at High Frequency over Long Span

Problems ( ) 1 exp. 2. n! e λ and

Next is material on matrix rank. Please see the handout

Maximum-Likelihood Estimation: Basic Ideas

Approximate Residual Balancing: De-Biased Inference of Average Treatment Effects in High Dimensions

A Simple, Graphical Procedure for Comparing Multiple Treatment Effects

Does Compressed Sensing have applications in Robust Statistics?

STAT 461/561- Assignments, Year 2015

Assumptions of classical multiple regression model

Exploring Granger Causality for Time series via Wald Test on Estimated Models with Guaranteed Stability

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

A Multiple Testing Approach to the Regularisation of Large Sample Correlation Matrices

Approximate Residual Balancing: De-Biased Inference of Average Treatment Effects in High Dimensions

Master s Written Examination

Double/de-biased machine learning using regularized Riesz representers

Graphlet Screening (GS)

Distribution-Free Predictive Inference for Regression

Nearest Neighbor Gaussian Processes for Large Spatial Data

Psychology 282 Lecture #4 Outline Inferences in SLR

Statistical inference on Lévy processes

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

Estimation of large dimensional sparse covariance matrices

A Course on Advanced Econometrics

Learning Markov Network Structure using Brownian Distance Covariance

y it = α i + β 0 ix it + ε it (0.1) The panel data estimators for the linear model are all standard, either the application of OLS or GLS.

Supremum of simple stochastic processes

Plug-in Measure-Transformed Quasi Likelihood Ratio Test for Random Signal Detection

STATISTICS 3A03. Applied Regression Analysis with SAS. Angelo J. Canty

Robust Testing and Variable Selection for High-Dimensional Time Series

Bivariate Paired Numerical Data

Knockoffs as Post-Selection Inference

Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics

Modified Simes Critical Values Under Positive Dependence

Estimating LASSO Risk and Noise Level

1 Regression with High Dimensional Data

Uniformly Most Powerful Bayesian Tests and Standards for Statistical Evidence

High-dimensional Ordinary Least-squares Projection for Screening Variables

Testing Homogeneity Of A Large Data Set By Bootstrapping

Transcription:

high-dimensional inference robust to the lack of model sparsity Jelena Bradic (joint with a PhD student Yinchu Zhu) www.jelenabradic.net Assistant Professor Department of Mathematics University of California, San Diego jbradic@ucsd.edu

Introduction Example-spurious results CorrT Methodology Theoretical Properties Numerical Experiments Jelena Bradic (joint with a PhD student Yinchu Zhu) WUSL Sep 2016 www.jelenabradic.net Assistant Professor Department of Mathematics University of California, San Diego jbradic@ucsd.edu

Setup Consider a high dimensional linear regression setting, Y = Xγ + Zβ + ε, (1) where Z R n and X R n p are the design matrices, p n, ε R n is the error term independent of the design with E(ε) = 0 and E(εε ) = σ 2 εi n, and γ and β are unknown model parameters. We focus on the problem of testing single entries of the model parameter, namely the following hypothesis: H 0 : β = β 0, versus H 1 : β β 0. (2) Sparsity assumption: γ 0 := s γ n and for inference procedures is such that s γ log p/ n 0 as n. 1

Setup The sparsity condition is also the backbone of various inference methods for the testing problem (2), including approaches based on consistent variable selection, such as Fan and Li (2001), Wasserman and Roeder (2009), Chatterjee et al. (2013) and recent advances, such as Buhlmann (2013), Zhang and Zhang (2014) (ZZ), Van de Geer et al. (2014) (VBRD), Javanmard and Montanari (2014a) (JM), Belloni et al. (2014) (BCH) and Ning and Liu (2014) (NL). However, the sparsity condition might not be satisfied in practice. For contemporary large-scale datasets, the sparsity assumption could be quite hard to hold, see e.g., Pritchard (2001), Ward (2009), Jin and Ke (2014), and Hall et al. (2014). Indeed, Kraft and Hunter (2009), Donoho and Jin (2015), Janson et al. (2015) and Dicker (2016) provide evidence suggesting that such an ideal assumption might not be valid. What happens if we apply sparsity-based methods when the underlying model parameter is not sparse? Can we obtain misleading and spurious results? 2

Example 1 Assume: X = I p, ε i are i.i.d. with N (0, 1) and such that for a [ 10, 10] β = 0 and γ = ap 1/2 1 p, We consider the de-biasing approach as formulated in [?] Let π = (β, γ ) R p+1 and W = (Z, X) R n (p+1). The debiased estimator is then defined π = π + I p+1 W (Y W π)/n Wald test rejects the hypothesis whenever π 1 > Φ 1 (1 α/2)/ n. Theorem In the above setup, we have lim n P ( π 1 > Φ 1 (1 α/2)/ n ) = F(α, a), where F(α, a) = 2 2Φ [ Φ ( ) 1 1 ] α 2 / 1 + a 2. 3

Rejection Probability Figure: Plot of the asymptotic Type I error of Wald test 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 α=0.01 α=0.02 α=0.05 α=0.1 0-10 -8-6 -4-2 0 2 4 6 8 10-10 a 10 The horizontal axis denotes a and the vertical axis denotes F(α, a). 4

Goals of our work To develop sparsity-robust tests for the hypothesis (2) We say that a test is sparsity-robust if the Type I error is asymptotically bounded by the nominal level, regardless of whether or not γ is sparse. We propose a new test, CorrT, that has Type I error asymptotically equal to the nominal level without any assumption on the sparsity of γ. Our methodology is based on the idea of exploiting the implication of the null hypothesis. Instead of directly estimating the parameter under testing, we test a moment condition that is equivalent to the null hypothesis. Moreover, whenever the sparsity condition holds, our method is shown to be optimal and matches existing sparsity-based methods in terms of Type II errors. Therefore, the proposed method, compared to sparsity-based tests, is more robust and does not lose efficiency. 5

Introduction CorrT Methodology Moment Condition Adaptive Estimation Test Statistic Computational asspects Theoretical Properties Numerical Experiments Jelena Bradic (joint with a PhD student Yinchu Zhu) WUSL Sep 2016 www.jelenabradic.net Assistant Professor Department of Mathematics University of California, San Diego jbradic@ucsd.edu

A transformation We observe that a new variable V := Y Zβ 0 satisfies a linear model V = Xγ + e, e = Z(β β 0 ) + ε. We formally introduce a model to account for the dependence between X and Z: Z = Xθ + u, i = 1,..., n. (3) where θ R p is sparse and u R n is independent of X with mean zero and variance E(uu ) = σ 2 ui n. We notice that [ ] E (V Xγ ) (Z Xθ ) /n = σu(β 2 β 0). Hence, solving the inference problem (2) is equivalent to testing [ ] [ ] H 0 : E (V Xγ ) (Z Xθ ) = 0, vs H 1 : E (V Xγ ) (Z Xθ ) 0. (4) 6

Self-adaptive estimation We define the following estimator γ = γ( σ γ )1{S γ } + γ(2n 1 V 2 )1{S γ = } (5) where σ γ = arg max{σ : σ S γ } and the set S γ is defined as { S γ = σ ρ n V 2 / } n : 1.5σ n 1/2 V X γ(σ) 2 0.5σ. (6) and γ(σ) := arg min γ 1 γ R p s.t. n 1 X (V Xγ) η 0σ V Xγ V 2 / log 2 n n 1 V (V Xγ) ρ nn 1 V 2 2. for η 0 = 1.1Φ 1 (1 p 1 n 1 ) max 1 j p n 1 n i=1 x 2 i,j, ρ n 1/2 n = 0.01/ log n. When the estimation target fails to be sparse, the estimator is stable; when the estimation target is sparse, the estimator automatically achieves consistency does not require knowledge of the noise level. (7) 7

CorrT test statistic We propose to consider the following correlation test (CorrT) statistic T n (β 0 ) = n 1/2 (V X γ) (Z X θ) σ ε σ u, (8) where σ ε = V X γ 2/ n and σ u = Z X θ 2/ n. Hence, a test with nominal size α (0, 1) rejects the hypothesis (2) if and only if T n(β 0) > Φ 1 (1 α/2). Why does this work? We can show, without assuming sparsity of γ, that n 1/2 (V X γ) (Z X θ) = n 1/2 (V X γ) u + O P ( log p θ θ 1), where under the null hypothesis, the first term on the right hand side has zero expectation and the second term vanishes fast enough. 8

Parametric Simplex Method Let σ > 0, and denote with γ(σ) = b + (σ) b (σ), where b(σ) = ( b + (σ), b (σ) ) is defined as the solution to the following parametric right hand side linear program b(σ) = arg max M 1 b subject to M 2 b M 3 + M 4 σ, (9) b R 2p where the matrices M 1 M 4 are taken to be n 1 X X n 1 X X n 1 X X n 1 X X M 1 = 1 2p 1, M 2 = X X R (2p+2n+1) 2p, X X n 1 V X n 1 V X n 1 X V η 0 1 p 1 n 1 X V η 01 p 1 M 3 = V + 1 n 1 V 2 / log 2 n R 2p+2n+1, M 4 = 0 n 1 R 2p+2n+1. V + 1 n 1 V 2/ log 2 n 0 n 1 (1 ρ n )n 1 V V 0 9

Introduction CorrT Methodology Theoretical Properties Robustness to the lack of sparsity Sparsity-adaptive property Numerical Experiments Jelena Bradic (joint with a PhD student Yinchu Zhu) WUSL Sep 2016 www.jelenabradic.net Assistant Professor Department of Mathematics University of California, San Diego jbradic@ucsd.edu

Assumptions Condition Let W = (Z, X) and w i = (z i, x i ). The matrix Σ W = E[W W]/n R p p satisfies that κ 1 σ min (Σ W ) σ max (Σ W ) κ 2. The vectors Σ 1/2 W w 1 are centered with sub-gaussian ( norms ) upper bounded by κ 3 and E ε 1 2+δ κ 4. Moreover, log p = o n δ/(2+δ) n. For the designs, it is standard to impose well-behaved covariance matrices and sub-gaussian properties. Condition γ 2 κ 5 and s θ = o ( n/ log n/ log p ), where s θ = θ 0. The assumption on s θ imposes sparsity in the first row of the precision matrix Σ W and the rate for s θ is stronger than the conditions in BCH and NL imposing o( n/ log p) and in VBRD imposing o(n/ log p). 10

obustness to the lack of sparsity Theorem Let Conditions 1 and 2 hold. If log p = o α (0, 1), ( ) n δ/(2+δ) n, then under H 0 ( ) lim P T n (β 0 ) > Φ 1 (1 α/2) = α. n Theorem 2 formally establishes that the new CorrT test is asymptotically exact in testing β = β 0. In particular, CorrT is robust to dense γ in the sense that even under dense γ, our procedure does not generate false positive results. 11

Estimation & inference tradeoff Remark Our result is theoretically intriguing as it overcomes limitations of the inference based on estimation principle. This principle relies on accurate estimation, which is challenging, if not impossible, for non-sparse models. To see the difficulty, consider the full model parameter π := (β, γ ) R p+1. The minimax rate of estimation in terms of l 2 -loss for parameters in a (p + 1)-dimensional l q -ball with q [0, 1] and of radius r n is r n (n 1 log p) 1 q/2. Theorem 2 says that CorrT is valid even when π cannot be consistently estimated. For example, suppose that log p n c for a constant c > 0 and π = 1/ p for j j = 1,..., p + 1. Notice that, as n, π q (n 1 log p) 1 q/2 for any q [0, 1], suggesting potential failure in estimation. Due to this difficulty, it appears quite unrealistic to expect valid inference in individual components of π based on accurate estimates for π. In fact, the debiasing technique does not perform well in this case as discussed in Example 1. 12

Sparsity-adaptive property We say that a procedure for testing the hypothesis (2) is sparsity-adaptive if (i) this procedure does not require knowledge of s γ, (ii) provides valid inference under any s γ and (iii) achieves efficiency with sparse γ. We now show the third property, efficiency under sparse γ. To formally discuss our results, we consider testing H 0 : β = β 0 versus where h R is a fixed constant. H 1,h : β = β 0 + h/ n. (10) 13

Sparsity-adaptive property Theorem Let Conditions 1 and 2 hold. Suppose that s γ = o (n/ log(p n)) and σ u /σ ε κ 0 for some constant κ 0 > 0. Then, under H 1,h in (10), ( ) P T n (β 0 ) > Φ 1 (1 α/2) Ψ(α, h), where Ψ(h, κ 0, α) = 2 Φ ( Φ 1 (1 α/2) + hκ 0 ) Φ ( Φ 1 (1 α/2) hκ 0 ). Theorem 3 establishes the local power of CorrT. It turns out that this local power matches that of existing sparsity-based methods, such as VBRD, NL and BCH, that are shown to be efficient. 14

Introduction CorrT Methodology Theoretical Properties Numerical Experiments Jelena Bradic (joint with a PhD student Yinchu Zhu) WUSL Sep 2016 www.jelenabradic.net Assistant Professor Department of Mathematics University of California, San Diego jbradic@ucsd.edu

Setting LTD Light-tailed design: N(0, Σ (ρ) ) with the (i, j) entry of Σ (ρ) being ρ i j. HTD Heavy-tailed design: each row of W is generated as Σ 1/2 (ρ) U, where U Rn contains i.i.d random variables of Student s t-distribution with 3 degrees of freedom normalized to have variance one. (the third moment does not exist.) The error term ε R n contains i.i.d random variables from either N(0, 1) (light-tailed error, or LTE) or Student s t-distribution with 6 degrees of freedom normalized to have variance one (heavy-tailed error, or HTE). We set 2/ n 2 j 4 π j = 0 j > max{s, 4} U(0, 4)/ n otherwise. We test the hypothesis H 0 : π 3 = 2/ n + h. 15

Table: Size properties (h = 0) LTD + LTE, ρ = 0 LTD + LTE, ρ = 1 2 HTD + HTE, ρ = 0 CorrT Debias Score CorrT Debias Score CorrT Debias Score s = 1 0.03 0.05 0.04 0.05 0.04 0.05 0.06 0.04 0.02 s = 3 0.06 0.05 0.05 0.06 0.06 0.05 0.05 0.11 0.03 s = 5 0.09 0.09 0.09 0.07 0.11 0.10 0.07 0.04 0.04 s = 10 0.01 0.03 0.03 0.03 0.05 0.03 0.06 0.05 0.03 s = 20 0.08 0.12 0.11 0.03 0.06 0.06 0.03 0.12 0.04 s = 50 0.07 0.16 0.17 0.04 0.10 0.12 0.02 0.09 0.09 s = 100 0.05 0.29 0.28 0.01 0.15 0.14 0.05 0.20 0.21 s = n 0.04 0.35 0.33 0.04 0.27 0.27 0.04 0.38 0.38 s = p 0.07 0.54 0.52 0.04 0.39 0.40 0.05 0.57 0.53 LTD + HTE, ρ = 0 LTD + HTE, ρ = 1 2 HTD + LTE, ρ = 0 CorrT Debias Score CorrT Debias Score CorrT Debias Score s = 1 0.03 0.05 0.04 0.04 0.04 0.02 0.06 0.05 0.05 s = 3 0.06 0.05 0.05 0.11 0.06 0.06 0.03 0.07 0.04 s = 5 0.09 0.09 0.09 0.05 0.06 0.05 0.06 0.11 0.07 s = 10 0.01 0.03 0.03 0.03 0.04 0.03 0.09 0.11 0.10 s = 20 0.08 0.12 0.11 0.06 0.11 0.10 0.05 0.13 0.06 s = 50 0.07 0.16 0.17 0.07 0.16 0.15 0.06 0.19 0.14 s = 100 0.05 0.29 0.28 0.05 0.33 0.26 0.05 0.24 0.22 s = n 0.04 0.35 0.33 0.05 0.43 0.41 0.05 0.40 0.31 s = p 0.07 0.54 0.52 0.06 0.51 0.50 0.06 0.53 0.51 16

Power curves Figure: Light-tailed errors 1.00 1.00 0.75 0.75 colour colour Power 0.50 CorrT Debias Power 0.50 CorrT Debias Score Score 0.25 0.25 0.00 0.0 0.1 0.2 0.3 h 0.00 0.00 0.25 0.50 0.75 1.00 h 1.00 0.75 colour Power 0.50 CorrT Debias Score 0.25 0.00 0.00 0.25 0.50 0.75 1.00 h 17