Inference For High Dimensional M-estimates: Fixed Design Results

Similar documents
Inference For High Dimensional M-estimates. Fixed Design Results

Inference for High Dimensional Robust Regression

High-dimensional regression:

Can we trust the bootstrap in high-dimension?

Can We Trust the Bootstrap in High-dimensions? The Case of Linear Models

Likelihood Ratio Test in High-Dimensional Logistic Regression Is Asymptotically a Rescaled Chi-Square

Can we trust the bootstrap in high-dimension?

Economics 583: Econometric Theory I A Primer on Asymptotics

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

IV Quantile Regression for Group-level Treatments, with an Application to the Distributional Effects of Trade

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Regression Analysis for Data Containing Outliers and High Leverage Points

F9 F10: Autocorrelation

Econometrics. Week 8. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

Advanced Statistics II: Non Parametric Tests

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

(Part 1) High-dimensional statistics May / 41

Robust estimation, efficiency, and Lasso debiasing

A Resampling Method on Pivotal Estimating Functions

AFT Models and Empirical Likelihood

M-Estimation under High-Dimensional Asymptotics

The Bootstrap: Theory and Applications. Biing-Shen Kuo National Chengchi University

Supplemental Material for KERNEL-BASED INFERENCE IN TIME-VARYING COEFFICIENT COINTEGRATING REGRESSION. September 2017

Estimation of large dimensional sparse covariance matrices

High-dimensional regression

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

large number of i.i.d. observations from P. For concreteness, suppose

MA Advanced Econometrics: Applying Least Squares to Time Series

Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation Discussion of sampling approach in big data

Bootstrapping Heteroskedasticity Consistent Covariance Matrix Estimator

Least squares under convex constraint

Data Mining Stat 588

UNIVERSITÄT POTSDAM Institut für Mathematik

Lecture 14 October 13

Regression Diagnostics for Survey Data

Multivariate Regression Analysis

STAT 540: Data Analysis and Regression

Statistics 910, #5 1. Regression Methods

Linear models and their mathematical foundations: Simple linear regression

A Primer on Asymptotics

Robustní monitorování stability v modelu CAPM

Understanding Regressions with Observations Collected at High Frequency over Long Span

Robust Principal Component Analysis

Problem 3. Give an example of a sequence of continuous functions on a compact domain converging pointwise but not uniformly to a continuous function

Robust high-dimensional linear regression: A statistical perspective

M-estimation in high-dimensional linear model

Indian Statistical Institute

Panel Data Models. James L. Powell Department of Economics University of California, Berkeley

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

Can we do statistical inference in a non-asymptotic way? 1

Lecture 3: Central Limit Theorem

Sliced Inverse Regression

Quantile Processes for Semi and Nonparametric Regression

Why Do Statisticians Treat Predictors as Fixed? A Conspiracy Theory

Quantile Regression for Panel Data Models with Fixed Effects and Small T : Identification and Estimation

Lecture 13: Subsampling vs Bootstrap. Dimitris N. Politis, Joseph P. Romano, Michael Wolf

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

A Modern Look at Classical Multivariate Techniques

ECO Class 6 Nonparametric Econometrics

Matrix Factorizations

Single Index Quantile Regression for Heteroscedastic Data

Prediction Intervals For Lasso and Relaxed Lasso Using D Variables

Inference for Identifiable Parameters in Partially Identified Econometric Models

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1

Least Squares Estimation-Finite-Sample Properties

Program Evaluation with High-Dimensional Data

Nonparametric Inference via Bootstrapping the Debiased Estimator

On Modifications to Linking Variance Estimators in the Fay-Herriot Model that Induce Robustness

1 Motivation for Instrumental Variable (IV) Regression

Confidence Intervals. Confidence interval for sample mean. Confidence interval for sample mean. Confidence interval for sample mean

Fluctuations from the Semicircle Law Lecture 4

Quantile Regression for Extraordinarily Large Data

Inference on distributions and quantiles using a finite-sample Dirichlet process

Large sample distribution for fully functional periodicity tests

Model Mis-specification

High-dimensional covariance estimation based on Gaussian graphical models

JEREMY TAYLOR S CONTRIBUTIONS TO TRANSFORMATION MODEL

EXTENDED GLRT DETECTORS OF CORRELATION AND SPHERICITY: THE UNDERSAMPLED REGIME. Xavier Mestre 1, Pascal Vallet 2

the error term could vary over the observations, in ways that are related

Linear Methods for Prediction

Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D.

A Comparison of Robust Estimators Based on Two Types of Trimming

Hierarchical Modeling for Univariate Spatial Data

Reconstruction from Anisotropic Random Measurements

36. Multisample U-statistics and jointly distributed U-statistics Lehmann 6.1

δ -method and M-estimation

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

University of California, Berkeley

Single Index Quantile Regression for Heteroscedastic Data

Stat 5100 Handout #26: Variations on OLS Linear Regression (Ch. 11, 13)

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Econometrics Summary Algebraic and Statistical Preliminaries

CALCULATION METHOD FOR NONLINEAR DYNAMIC LEAST-ABSOLUTE DEVIATIONS ESTIMATOR

STAT 200C: High-dimensional Statistics

Multiple Linear Regression

Probability and Statistics Notes

Spatial inference. Spatial inference. Accounting for spatial correlation. Multivariate normal distributions

STAT 4385 Topic 06: Model Diagnostics

Lawrence D. Brown* and Daniel McCarthy*

Transcription:

Inference For High Dimensional M-estimates: Fixed Design Results Lihua Lei, Peter Bickel and Noureddine El Karoui Department of Statistics, UC Berkeley Berkeley-Stanford Econometrics Jamboree, 2017 1/49

Table of Contents Background Main Results Heuristics and Proof Techniques Numerical Results 2/49

Table of Contents Background Main Results Heuristics and Proof Techniques Numerical Results 3/49

Setup Consider a linear Model: Y = Xβ + ɛ. y = (y 1,..., y n ) T R n : response vector; X = (x T 1,..., xt n ) T R n p : design matrix; β = (β 1,..., β p) T R p : coefficient vector; ɛ = (ɛ 1,..., ɛ n ) T R n : random unobserved error with independent entries. 4/49

M-Estimator M-Estimator: Given a convex loss function ρ( ) : R [0, ), 1 ˆβ = arg min β R p n n ρ(y i x T i β). When ρ is differentiable with ψ = ρ, ˆβ can be written as the solution: 1 n ψ(y i x T i n ˆβ) = 0. i=1 i=1 5/49

M-Estimator: Examples ρ(x) = x 2 /2 gives the Least-Square estimator; L2 Loss 0 2 4 6 8 10 12 4 2 0 2 4 x psi(x) 4 2 0 2 4 rho(x) 4 2 0 2 4 x 6/49

M-Estimator: Examples ρ(x) = x 2 /2 gives the Least-Square estimator; ρ(x) = x gives the Least-Absolute-Deviation estimator; rho(x) 0 2 4 6 8 10 12 L2 Loss 0 1 2 3 4 5 L1 Loss 4 2 0 2 4 x 4 2 0 2 4 x psi(x) 4 2 0 2 4 1.0 0.5 0.0 0.5 1.0 4 2 0 2 4 x 4 2 0 2 4 x 6/49

M-Estimator: Examples rho(x) ρ(x) = x 2 /2 gives the Least-Square estimator; ρ(x) = x gives the Least-Absolute-Deviation estimator; { x ρ(x) = 2 /2 x k gives the Huber estimator. k( x k/2) x > k L2 Loss L1 Loss Huber Loss 0 2 4 6 8 10 12 0 1 2 3 4 5 0 1 2 3 4 5 6 4 2 0 2 4 x 4 2 0 2 4 x 4 2 0 2 4 x psi(x) 4 2 0 2 4 1.0 0.5 0.0 0.5 1.0 1.0 0.0 0.5 1.0 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 x x x 6/49

Goals (Informal) Goal (Informal): Make inference on the coordinates of β when X is treated as fixed; no assumption imposed on β ; and the dimension p is comparable to the sample size n. 7/49

Goals (Informal) Goal (Informal): Make inference on the coordinates of β when X is treated as fixed; no assumption imposed on β ; and the dimension p is comparable to the sample size n. Why coordinates? 7/49

Goals (Informal) Goal (Informal): Make inference on the coordinates of β when X is treated as fixed; no assumption imposed on β ; and the dimension p is comparable to the sample size n. Why coordinates? Why fixed designs? 7/49

Goals (Informal) Goal (Informal): Make inference on the coordinates of β when X is treated as fixed; no assumption imposed on β ; and the dimension p is comparable to the sample size n. Why coordinates? Why fixed designs? Why assumption-free β? 7/49

Goals (Informal) Goal (Informal): Make inference on the coordinates of β when X is treated as fixed; no assumption imposed on β ; and the dimension p is comparable to the sample size n. Why coordinates? Why fixed designs? Why assumption-free β? Why p n? 7/49

Asymptotic Arguments: Motivation Consider β 1 WLOG; 8/49

Asymptotic Arguments: Motivation Consider β 1 WLOG; Ideally, we construct a 95% confidence interval for β1 as ( [q 0.025 L( ˆβ ) 1 ), q 0.975 (L( ˆβ )] 1 ) where q α denotes the α-th quantile; 8/49

Asymptotic Arguments: Motivation Consider β 1 WLOG; Ideally, we construct a 95% confidence interval for β1 as ( [q 0.025 L( ˆβ ) 1 ), q 0.975 (L( ˆβ )] 1 ) where q α denotes the α-th quantile; Unfortunately, L( ˆβ 1 ) is unknown. 8/49

Asymptotic Arguments: Motivation Consider β 1 WLOG; Ideally, we construct a 95% confidence interval for β1 as ( [q 0.025 L( ˆβ ) 1 ), q 0.975 (L( ˆβ )] 1 ) where q α denotes the α-th quantile; Unfortunately, L( ˆβ 1 ) is unknown. This motivates the asymptotic arguments, i.e. find a distribution F s.t. L( ˆβ 1 ) F. 8/49

Asymptotic Arguments: Textbook Version The limiting behavior of ˆβ when p is fixed, as n, ( ) L( ˆβ) N β, (X T X) 1 E(ψ2 (ɛ 1 )) [Eψ (ɛ 1 )] 2 ; As a consequence, we obtain an approximate 95% confidence interval for β1, [ ˆβ1 1.96sd( ˆβ 1 ), ˆβ 1 + 1.96sd( ˆβ ] 1 ) where sd( ˆβ 1 ) could be any consistent estimator of the standard deviation. 9/49

Asymptotic Arguments: Hypothetical Problems y X R n p original problem (n = 100, p = 30) y X ˆβ 1 10/49

Asymptotic Arguments: Hypothetical Problems y X R n p y 1 X 1 R n 1 p 1 original problem (n = 100, p = 30) y X ˆβ 1 hypothetical problem (n 1 = 200, p 1 = 30) y 1 X 1 ˆβ (1) 1 10/49

Asymptotic Arguments: Hypothetical Problems y X R n p y 1 X 1 R n 1 p 1 y 2 X 2 R n 2 p 2 original problem (n = 100, p = 30) y X ˆβ 1 hypothetical problem (n 1 = 200, p 1 = 30) y 1 X 1 ˆβ (1) 1 hypothetical problem (n 2 = 500, p 2 = 30) y 2 X 2 ˆβ (2) 1 10/49

Asymptotic Arguments: Hypothetical Problems y X R n p original problem y 1 X 1 R n 1 p 1 y 2 X 2 R n 2 p 2 y 3 X 3 R n 3 p 3 (n = 100, p = 30) hypothetical problem y X ˆβ 1 (n 1 = 200, p 1 = 30) y 1 X 1 ˆβ (1) 1 hypothetical problem (n 2 = 500, p 2 = 30) y 2 X 2 ˆβ (2) 1 hypothetical problem (n 3 = 2000, p 3 = 30) y 3 X 3 ˆβ (3) 1 10/49

Asymptotic Arguments: Hypothetical Problems y X R n p original problem y 1 X 1 R n 1 p 1 y 2 X 2 R n 2 p 2 y 3 X 3 R n 3 p 3 (n = 100, p = 30) hypothetical problem y X ˆβ 1 (n 1 = 200, p 1 = 30) y 1 X 1 ˆβ (1) 1 hypothetical problem (n 2 = 500, p 2 = 30) y 2 X 2 ˆβ (2) 1 hypothetical problem (n 3 = 2000, p 3 = 30) y 3 X 3 Asymptotic argument: use lim j L( ˆβ (j) 1 ) to approximate L( ˆβ 1 ). ˆβ (3) 1 10/49

Asymptotic Arguments Huber [1973] raised the question of understanding the behavior of ˆβ when both n and p tend to infinity; 11/49

Asymptotic Arguments Huber [1973] raised the question of understanding the behavior of ˆβ when both n and p tend to infinity; Huber [1973] showed the L 2 consistency of ˆβ: ˆβ β 2 2 0, when p = o(n 1 3 ); 11/49

Asymptotic Arguments Huber [1973] raised the question of understanding the behavior of ˆβ when both n and p tend to infinity; Huber [1973] showed the L 2 consistency of ˆβ: ˆβ β 2 2 0, when p = o(n 1 3 ); Portnoy [1984] prove the L 2 consistency of ˆβ when ( ) n p = o. log n 11/49

Asymptotic Arguments Portnoy [1985] and Mammen [1989] showed that ˆβ is jointly asymptotically normal when p << n 2 3, 12/49

Asymptotic Arguments Portnoy [1985] and Mammen [1989] showed that ˆβ is jointly asymptotically normal when p << n 2 3, in the sense that for any sequence of vectors a n R p, L at n ( ˆβ β ) Var(a T ˆβ) N(0, 1) n 12/49

p/n: A Measure of Difficulty All of the above works requires p/n 0 or n/p. 13/49

p/n: A Measure of Difficulty All of the above works requires p/n 0 or n/p. n/p is the number of samples per parameter; Classical rule of thumb: n/p 5 10; Heuristically, a larger n/p would give an easier problem; Hypothetical problems with n j /p j are not appropriate because they are increasingly easier than the original problem. 13/49

Moderate p/n Regime Formally, we define Moderate p/n Regime as p/n κ > 0. y X R n p original problem (n = 100, p = 30) y X ˆβ 1 14/49

Moderate p/n Regime Formally, we define Moderate p/n Regime as p/n κ > 0. y X R n p y 1 X 1 R n 1 p 1 original problem (n = 100, p = 30) y X ˆβ 1 hypothetical problem (n 1 = 200, p 1 = 60) y 1 X 1 ˆβ (1) 1 14/49

Moderate p/n Regime Formally, we define Moderate p/n Regime as p/n κ > 0. y X R n p y 1 X 1 R n 1 p 1 y 2 X 2 R n 2 p 2 original problem (n = 100, p = 30) y X ˆβ 1 hypothetical problem (n 1 = 200, p 1 = 60) y 1 X 1 ˆβ (1) 1 hypothetical problem (n 2 = 500, p 2 = 150) y 2 X 2 ˆβ (2) 1 14/49

Moderate p/n Regime Formally, we define Moderate p/n Regime as p/n κ > 0. y X R n p original problem y 1 X 1 R n 1 p 1 y 2 X 2 R n 2 p 2 y 3 X 3 R n 3 p 3 (n = 100, p = 30) hypothetical problem y X ˆβ 1 (n 1 = 200, p 1 = 60) y 1 X 1 ˆβ (1) 1 hypothetical problem (n 2 = 500, p 2 = 150) y 2 X 2 ˆβ (2) 1 hypothetical problem (n 3 = 2000, p 3 = 600) y 3 X 3 ˆβ (3) 1 14/49

Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. 15/49

Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. X 15/49

Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. X β 15/49

Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. X β ɛ 1 15/49

Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. y 1 = X β + ɛ 1 15/49

Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. y 1 = X β + ɛ 1 M-Estimates: ˆβ(1) 1, 15/49

Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. β y 2 = X + ɛ 1 ɛ 2 M-Estimates: ˆβ(1) 1, 15/49

Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. β y 2 = X + ɛ 1 ɛ 2 M-Estimates: ˆβ(1) (2) 1, ˆβ 1, 15/49

Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. β y 3 = X + ɛ 1 ɛ 2 ɛ 3 ɛ 3 M-Estimates: ˆβ(1) (2) 1, ˆβ 1, 15/49

Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. β y 3 = X + ɛ 1 ɛ 2 ɛ 3 ɛ 3 M-Estimates: ˆβ(1) 1 (2) (3), ˆβ 1, ˆβ 1, 15/49

Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. y r = X β + ɛ 1 ɛ 2 ɛ 3 ɛ r M-Estimates: ˆβ(1) 1 (2) (3), ˆβ 1, ˆβ 1, 15/49

Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. y r = X β + ɛ 1 ɛ 2 ɛ 3 ɛ r M-Estimates: ˆβ(1) 1, ˆβ (2) 1 (3) (r), ˆβ 1,..., ˆβ 1. 15/49

Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. y r = X β + ɛ 1 ɛ 2 ɛ 3 ɛ r M-Estimates: ˆβ(1) 1, ˆβ (2) 1 (3) (r), ˆβ 1,..., ˆβ 1. = ˆL( ˆβ 1 ; X) = ecdf({ (1) (r) ˆβ 1,..., ˆβ 1 }). 15/49

Moderate p/n Regime: More Informative Asymptotics A Simulation to compare Fix-p Regime and Moderate p/n Regime: Fix-p Approximation: n = 1000, p = 50κ. 16/49

Moderate p/n Regime: More Informative Asymptotics A Simulation to compare Fix-p Regime and Moderate p/n Regime: Fix-p Approximation: n = 1000, p = 50κ. β y r = X + ɛ 1 ɛ 2 ɛ 3 ɛ r M-Estimates: ˆβ(F,1) 1, ˆβ (F,2) 1, ˆβ (F,3) 1,..., ˆβ (F,r) 1. = ˆL( ˆβ F 1 ; X) = ecdf({ ˆβ (F,1) 1,..., ˆβ (F,r) 1 }). 16/49

Moderate p/n Regime: More Informative Asymptotics A Simulation to compare Fix-p Regime and Moderate p/n Regime: Moderate-p/n Approximation: n = 1000, p = 1000κ. 17/49

Moderate p/n Regime: More Informative Asymptotics A Simulation to compare Fix-p Regime and Moderate p/n Regime: Moderate-p/n Approximation: n = 1000, p = 1000κ. y r = X β + ɛ 1 ɛ 2 ɛ 3 ɛ r M-Estimates: ˆβ(M,1) 1, ˆβ (M,2) 1, ˆβ (M,3) 1,..., ˆβ (M,r) 1. = ˆL( ˆβ M 1 ; X) = ecdf({ ˆβ (M,1) 1,..., ˆβ (M,r) 1 }). 17/49

Moderate p/n Regime: More Informative Asymptotics Measure the accuracy of two approximations by the Kolmogorov-Smirnov statistics ( d KS ˆL( ˆβ1 ), ˆL( ˆβ ) 1 F ) ( and d KS ˆL( ˆβ1 ), ˆL( ˆβ ) 1 M ) Distance between the small sample and large sample distribution normal t(2) Kolmogorov Smirnov Statistics 0.50 0.45 0.40 0.25 0.50 0.75 0.25 0.50 0.75 kappa Asym. Regime p fixed p/n fixed 18/49

Moderate p/n Regime: Negative Results The moderate p/n regime in statistics: 19/49

Moderate p/n Regime: Negative Results The moderate p/n regime in statistics: Huber [1973] showed that for least-square estimators there always exists a sequence of vectors a n R p such that L at n ( ˆβ LS β ) Var(a T ˆβ N(0, 1). n LS ) 19/49

Moderate p/n Regime: Negative Results The moderate p/n regime in statistics: Huber [1973] showed that for least-square estimators there always exists a sequence of vectors a n R p such that L at n ( ˆβ LS β ) Var(a T ˆβ N(0, 1). n LS ) Bickel and Freedman [1982] showed that the bootstrap fails in the Least-Square case and the usual rescaling does not help; 19/49

Moderate p/n Regime: Negative Results The moderate p/n regime in statistics: Huber [1973] showed that for least-square estimators there always exists a sequence of vectors a n R p such that L at n ( ˆβ LS β ) Var(a T ˆβ N(0, 1). n LS ) Bickel and Freedman [1982] showed that the bootstrap fails in the Least-Square case and the usual rescaling does not help; El Karoui et al. [2011] showed that for general loss functions, ˆβ β 2 2 0. 19/49

Moderate p/n Regime: Negative Results The moderate p/n regime in statistics: Huber [1973] showed that for least-square estimators there always exists a sequence of vectors a n R p such that L at n ( ˆβ LS β ) Var(a T ˆβ N(0, 1). n LS ) Bickel and Freedman [1982] showed that the bootstrap fails in the Least-Square case and the usual rescaling does not help; El Karoui et al. [2011] showed that for general loss functions, ˆβ β 2 2 0. El Karoui and Purdom [2015] showed that most widely used resampling schemes give poor inference on β 1. 19/49

Moderate p/n Regime: Reason of Failure Qualitatively, Influential observation always exists [Huber, 1973]: let H = X(X T X) 1 X T be the hat matrix, max H i,i 1 i n tr(h) = p >> 0. n 20/49

Moderate p/n Regime: Reason of Failure Qualitatively, Influential observation always exists [Huber, 1973]: let H = X(X T X) 1 X T be the hat matrix, max H i,i 1 i n tr(h) = p >> 0. n Regression residuals fail to mimic true error: R i y i x T i ˆβ ɛ i. 20/49

Moderate p/n Regime: Reason of Failure Qualitatively, Influential observation always exists [Huber, 1973]: let H = X(X T X) 1 X T be the hat matrix, max H i,i 1 i n tr(h) = p >> 0. n Regression residuals fail to mimic true error: R i y i x T i ˆβ ɛ i. Technically, Taylor expansion/bahadur-type representation fails! 20/49

Moderate p/n Regime: Positive Results (Random Designs) Bean et al. [2013] showed that when X has i.i.d. Gaussian entries, for any sequence of a n R p L X,ɛ at n ( ˆβ β ) Var X,ɛ (a T ˆβ) N(0, 1); n 21/49

Moderate p/n Regime: Positive Results (Random Designs) Bean et al. [2013] showed that when X has i.i.d. Gaussian entries, for any sequence of a n R p L X,ɛ at n ( ˆβ β ) Var X,ɛ (a T ˆβ) N(0, 1); n El Karoui [2015] extended it to general random designs. 21/49

Moderate p/n Regime: Positive Results (Random Designs) Bean et al. [2013] showed that when X has i.i.d. Gaussian entries, for any sequence of a n R p L X,ɛ at n ( ˆβ β ) Var X,ɛ (a T ˆβ) N(0, 1); n El Karoui [2015] extended it to general random designs. The above result does not contradict Huber [1973] in that the randomness comes from both X and ɛ; 21/49

Moderate p/n Regime: Positive Results (Random Designs) Bean et al. [2013] showed that when X has i.i.d. Gaussian entries, for any sequence of a n R p L X,ɛ at n ( ˆβ β ) Var X,ɛ (a T ˆβ) N(0, 1); n El Karoui [2015] extended it to general random designs. The above result does not contradict Huber [1973] in that the randomness comes from both X and ɛ; El Karoui et al. [2011] showed that for general loss functions, ˆβ β 0. 21/49

Moderate p/n Regime: Summary Provides a more accurate approximation of L( ˆβ 1 ); 22/49

Moderate p/n Regime: Summary Provides a more accurate approximation of L( ˆβ 1 ); Qualitatively different from the classical regimes where p/n 0; L2 -consistency of ˆβ no longer holds; the residual R i behaves differently from ɛ i ; fixed design results are different from random design results. 22/49

Moderate p/n Regime: Summary Provides a more accurate approximation of L( ˆβ 1 ); Qualitatively different from the classical regimes where p/n 0; L2 -consistency of ˆβ no longer holds; the residual R i behaves differently from ɛ i ; fixed design results are different from random design results. Inference on the vector ˆβ is hard; but inference on the coordinate / low-dimensional linear contrasts of ˆβ is still possible. 22/49

Goals (Formal) Our Goal (formal): Under the linear model Y = Xβ + ɛ, Derive the asymptotic distribution of coordinates ˆβ j : under the moderate p/n regime, i.e. p/n κ (0, 1); with a fixed design matrix X; without assumptions on β. 23/49

Table of Contents Background Main Results Heuristics and Proof Techniques Numerical Results 24/49

Main Result (Informal) Definition 1. Let P and Q be two distributions on R p, d TV (P, Q) = sup A R p P (A) Q(A). 25/49

Main Result (Informal) Definition 1. Let P and Q be two distributions on R p, d TV (P, Q) = sup A R p P (A) Q(A). Theorem. Under appropriate conditions on the design matrix X, the distribution of ɛ and the loss function ρ, as p/n κ (0, 1), while n, max j d TV L ˆβ j E ˆβ j Var( ˆβ, N(0, 1) = o(1). j ) 25/49

Main Result (Informal) If ρ is an even function and ɛ d = ɛ, then ˆβ β d = β ˆβ = E ˆβ = β. Theorem. Under appropriate conditions on the design matrix X, the distribution of ɛ and the loss function ρ, as p/n κ (0, 1), while n, max j d TV L ˆβ j βj Var( ˆβ j ), N(0, 1) = o(1). 26/49

Why Surprising? Classical approaches heavily rely on L 2 consistency of ˆβ, which only holds when p = o(n); Bahadur-type representation for ˆβ where n( ˆβ β) = 1 n n i=1 for some i.i.d. random variable Z i s; Z i + o p ( 1 n ), which can be proved only when p = o ( n 2/3) ; 27/49

Why Surprising? Classical approaches heavily rely on L 2 consistency of ˆβ, which only holds when p = o(n); Bahadur-type representation for ˆβ where n( ˆβ β) = 1 n n i=1 for some i.i.d. random variable Z i s; Z i + o p ( 1 n ), which can be proved only when p = o ( n 2/3) ; Question: What happens when p [O(n 2/3 ), O(n)]? 27/49

Our Contributions and Limitations Instead, we develops a novel strategy that is built on Leave-on-out method [El Karoui et al., 2011]; and Second-Order Poincaré Inequality [Chatterjee, 2009]. 28/49

Our Contributions and Limitations Instead, we develops a novel strategy that is built on Leave-on-out method [El Karoui et al., 2011]; and Second-Order Poincaré Inequality [Chatterjee, 2009]. We prove that ˆβ1 is asymptotically normal for all p [O(1), O(n)] for fixed designs under regularity conditions; the conditions are satisfied by most design matrices. 28/49

Our Contributions and Limitations Instead, we develops a novel strategy that is built on Leave-on-out method [El Karoui et al., 2011]; and Second-Order Poincaré Inequality [Chatterjee, 2009]. We prove that ˆβ1 is asymptotically normal for all p [O(1), O(n)] for fixed designs under regularity conditions; the conditions are satisfied by most design matrices. Limitations: we impose strong conditions on ρ and L(ɛ); we do not know how to estimate Var ɛ ( ˆβ 1 ). 28/49

Examples: Realization of i.i.d. Designs We consider the case where X is a realization of a random design Z. The examples below are proved to satisfy the technical assumptions with high probability over Z. 29/49

Examples: Realization of i.i.d. Designs We consider the case where X is a realization of a random design Z. The examples below are proved to satisfy the technical assumptions with high probability over Z. Example 1 Z has i.i.d. mean-zero sub-gaussian entries with Var(Z ij ) = τ 2 > 0; Example 2 Z contains an intercept term, i.e. Z = (1, Z) and Z R n (p 1) has independent sub-gaussian entries with Z ij µ j d = µj Z ij, Var( Z ij ) > τ 2 for some arbitrary µ j s. 29/49

A Counter-Example Consider a one-way ANOVA situation. Each observation i is associated with a label k i {1,..., p} and let X i,j = I(j = k i ). This is equivalent to Y i = β k i + ɛ i. 30/49

A Counter-Example Consider a one-way ANOVA situation. Each observation i is associated with a label k i {1,..., p} and let X i,j = I(j = k i ). This is equivalent to Y i = β k i + ɛ i. It is easy to see that ˆβ j = arg min β R i:k i =j This is a standard location problem. ρ(y i β j ). 30/49

A Counter-Example Let n j = {i : k i = j}. In the least-square case, i.e. ρ(x) = x 2 /2, ˆβ j = β j + 1 n j ɛ i. i:k i =j 31/49

A Counter-Example Let n j = {i : k i = j}. In the least-square case, i.e. ρ(x) = x 2 /2, ˆβ j = β j + 1 n j ɛ i. i:k i =j Assume a balance design, i.e. n j n/p. Then n j << and none of ˆβ j is normal (unless ɛ i are normal); holds for general loss functions ρ. 31/49

A Counter-Example Let n j = {i : k i = j}. In the least-square case, i.e. ρ(x) = x 2 /2, ˆβ j = β j + 1 n j ɛ i. i:k i =j Assume a balance design, i.e. n j n/p. Then n j << and none of ˆβ j is normal (unless ɛ i are normal); holds for general loss functions ρ. Conclusion: some non-standard assumptions on X are required. 31/49

Table of Contents Background Main Results Heuristics and Proof Techniques Least-Square Estimator: A Motivating Example Second-Order Poincaré Inequality Assumptions Main Results Numerical Results 32/49

Least Square Estimator The L 2 loss, ρ(x) = x 2 /2, gives the least-square estimator ˆβ LS = (X T X) 1 X T Y = β + (X T X) 1 X T ɛ. 33/49

Least Square Estimator The L 2 loss, ρ(x) = x 2 /2, gives the least-square estimator ˆβ LS = (X T X) 1 X T Y = β + (X T X) 1 X T ɛ. Let e j denote the canonical basis vector in R p, then ˆβ LS j β j = e T j (X T X) 1 X T ɛ α T j ɛ. 33/49

Least Square Estimator Lindeberg-Feller CLT claims that in order for ˆβ LS L j βj N(0, 1) Var( ˆβ LS j ) it is sufficient and almost necessary that α j α j 2 0. (1) 34/49

Least Square Estimator To see the necessity of the condition, recall the one-way ANOVA case. Let n j = {i : k i = j}, then X T X = diag(n j ) p j=1. Recall that α T j = et j (XT X) 1 X T. This gives α j,i = { 1 n j if k i = j 0 if k i j 35/49

Least Square Estimator To see the necessity of the condition, recall the one-way ANOVA case. Let n j = {i : k i = j}, then X T X = diag(n j ) p j=1. Recall that α T j = et j (XT X) 1 X T. This gives α j,i = { 1 n j if k i = j 0 if k i j As a result, α j = 1 n j, α j 2 = 1 nj α j α j 2 = 1 nj and hence However, in moderate p/n regime, there exists j such that n j 1/κ and thus is not asymptotically normal. ˆβ LS j 35/49

M-Estimator The result for LSE is derived from the analytical form of ˆβ LS. By contrast, an analytical form is not available for general ρ. 36/49

M-Estimator The result for LSE is derived from the analytical form of ˆβ LS. By contrast, an analytical form is not available for general ρ. Let ψ = ρ, it is the solution of 1 n n ψ(y i x T ˆβ) i = 0 1 n i=1 n ψ(ɛ i x T i ( ˆβ β )) = 0. i=1 We show that ˆβj is a smooth function of ɛ; ˆβ j ɛ and ˆβ j ɛ ɛ T are computable. 36/49

Second-Order Poincaré Inequality ˆβ j is a smooth transform of a random vector, ɛ, with independent entries. A powerful CLT for this type of statistics is Second-Order Poincaré Inequality [Chatterjee, 2009]. 37/49

Second-Order Poincaré Inequality ˆβ j is a smooth transform of a random vector, ɛ, with independent entries. A powerful CLT for this type of statistics is Second-Order Poincaré Inequality [Chatterjee, 2009]. Definition 2. For each c 1, c 2 > 0, let L(c 1, c 2 ) be the class of probability measures on R that arise as laws of random variables like u(w ), where W N(0, 1) and u C 2 (R n ) with u (x) c 1 and u (x) c 2. For example, u = Id gives N(0, 1) and u = Φ gives U([0, 1]). 37/49

Second-Order Poincaré Inequality Proposition 1 (SOPI; Chatterjee [2009]). Let W = (W 1,..., W n ) indep. L(c 1, c 2 ). Take any g C 2 (R n ) and let U = g(w ), κ 1 = (E g(w ) 4 2) 1 4 ; κ 2 = (E 2 g(w ) 4 op) 1 4 ; n κ 0 = (E i g(w ) 4 ) 1 2. i=1 If EU 4 <, then ( ) ) U EU d TV (L, N(0, 1) Var(U) κ 0 + κ 1 κ 2 Var(U). 38/49

Assumptions A1 ρ(0) = ψ(0) = 0 and for any x R, 0 < K 0 ψ (x) K 1, ψ (x) K 2 ; A2 ɛ has independent entries with ɛ i L(c 1, c 2 ); A3 Let λ + and λ be the largest and smallest eigenvalues of X T X/n and λ + = O(1), λ = Ω(1). A4 Similar to the condition for OLS: max j e T j (XT X) 1 X T e T j (XT X) 1 X T 2 = o(1) A5 Similar to the condition that ( ) min Var( ˆβ 1 j ) = Ω j n 39/49

Main Results Theorem 3. Under assumptions A1 A5, as p/n κ for some κ (0, 1) while n, max j d TV L ˆβ j E ˆβ j Var( ˆβ, N(0, 1) = o(1). j ) 40/49

Table of Contents Background Main Results Heuristics and Proof Techniques Numerical Results 41/49

Setup Design matrix X: (i.i.d. design): X ij i.i.d. F ; (partial Hadamard design): a matrix formed by a random set of p columns of a n n Hadamard matrix. Entry Distribution F: F = N(0, 1); F = t 2. Error Distribution L(ɛ): ɛ i are i.i.d. with ɛ i N(0, 1); ɛ i t 2. 42/49

Setup Sample Size n: {100, 200, 400, 800}; κ = p/n: {0.5, 0.8}; Loss Function ρ: Huber loss with k = 1.345, { 1 ρ(x) = 2 x2 x k kx k2 2 x > k ; Coefficients: β = 0. 43/49

Asymptotic Normality of A Single Coordinate 44/49

Asymptotic Normality of A Single Coordinate X 44/49

Asymptotic Normality of A Single Coordinate X β 44/49

Asymptotic Normality of A Single Coordinate X β ɛ 1 44/49

Asymptotic Normality of A Single Coordinate y 1 = X β + ɛ 1 44/49

Asymptotic Normality of A Single Coordinate y 1 = X β + ɛ 1 M-Estimates: ˆβ(1) 1, 44/49

Asymptotic Normality of A Single Coordinate β y 2 = X + ɛ 1 ɛ 2 M-Estimates: ˆβ(1) 1, 44/49

Asymptotic Normality of A Single Coordinate β y 2 = X + ɛ 1 ɛ 2 M-Estimates: ˆβ(1) (2) 1, ˆβ 1, 44/49

Asymptotic Normality of A Single Coordinate β y 3 = X + ɛ 1 ɛ 2 ɛ 3 ɛ 3 M-Estimates: ˆβ(1) (2) 1, ˆβ 1, 44/49

Asymptotic Normality of A Single Coordinate β y 3 = X + ɛ 1 ɛ 2 ɛ 3 ɛ 3 M-Estimates: ˆβ(1) 1 (2) (3), ˆβ 1, ˆβ 1, 44/49

Asymptotic Normality of A Single Coordinate y r = X β + ɛ 1 ɛ 2 ɛ 3 ɛ r M-Estimates: ˆβ(1) 1 (2) (3), ˆβ 1, ˆβ 1, 44/49

Asymptotic Normality of A Single Coordinate y r = X β + ɛ 1 ɛ 2 ɛ 3 ɛ r M-Estimates: ˆβ(1) 1, ˆβ (2) 1 (3) (r), ˆβ 1,..., ˆβ 1. 44/49

Asymptotic Normality of A Single Coordinate y r = X β + ɛ 1 ɛ 2 ɛ 3 ɛ r M-Estimates: ˆβ(1) 1 ( ŝd se {, ˆβ (2) 1 (3) (r), ˆβ 1,..., ˆβ 1. (1) (r) ˆβ 1,..., ˆβ 1 ); } 44/49

Asymptotic Normality of A Single Coordinate y r = X β + ɛ 1 ɛ 2 ɛ 3 ɛ r M-Estimates: ˆβ(1) 1 ( ŝd se {, ˆβ (2) 1 (3) (r), ˆβ 1,..., ˆβ 1. (1) (r) ˆβ 1,..., ˆβ 1 ); } ( ) want to compare L ˆβ1 /ŝd with N(0, 1); 44/49

Asymptotic Normality of A Single Coordinate y r = X β + ɛ 1 ɛ 2 ɛ 3 ɛ r M-Estimates: ˆβ(1) 1 ( ŝd se {, ˆβ (2) 1 (3) (r), ˆβ 1,..., ˆβ 1. (1) (r) ˆβ 1,..., ˆβ 1 ); } ( ) want to compare L ˆβ1 /ŝd with N(0, 1); count the fraction of (j) ˆβ 1 [ 1.96ŝd, 1.96ŝd] as the proxy; 44/49

Asymptotic Normality of A Single Coordinate y r = X β + ɛ 1 ɛ 2 ɛ 3 ɛ r M-Estimates: ˆβ(1) 1 ( ŝd se {, ˆβ (2) 1 (3) (r), ˆβ 1,..., ˆβ 1. (1) (r) ˆβ 1,..., ˆβ 1 ); } ( ) want to compare L ˆβ1 /ŝd with N(0, 1); count the fraction of (j) ˆβ 1 [ 1.96ŝd, 1.96ŝd] as the proxy; should be close to 0.95 ideally. 44/49

Asymptotic Normality of A Single Coordinate Coverage of β^1 (κ = 0.5) normal t(2) Coverage of β^1 (κ = 0.8) normal t(2) 1.00 1.00 Coverage 0.95 0.90 1.00 0.95 0.90 iid hadamard Coverage 0.95 0.90 1.00 0.95 0.90 iid hadamard 100 200 400 800 100 200 400 800 Sample Size Entry Dist. normal t(2) hadamard 100 200 400 800 100 200 400 800 Sample Size Entry Dist. normal t(2) hadamard 45/49

Conclusion We establish the coordinate-wise asymptotic normality of the M-estimator for certain fixed design matrices under the moderate p/n regime under regularity conditions on X, L(ɛ) and ρ but no condition on β ; We prove the result by using the novel approach Second-Order Poincaré Inequality [Chatterjee, 2009]; We show that the regularity conditions are satisfied by a broad class of designs. 46/49

Discussion 47/49

Discussion Inference asym. normality + asym. bias + asym. variance Var( ˆβ 1 X) Var( ˆβ 1 ) when X is indeed a realization of a random design? Resampling method to give conservative variance estimates? More advanced boostrap? 47/49

Discussion Inference asym. normality + asym. bias + asym. variance Var( ˆβ 1 X) Var( ˆβ 1 ) when X is indeed a realization of a random design? Resampling method to give conservative variance estimates? More advanced boostrap? Relax the regularity conditions: Generalize to non-strongly convex and non-smooth loss functions? Generalize to general error distributions? 47/49

Discussion Inference asym. normality + asym. bias + asym. variance Var( ˆβ 1 X) Var( ˆβ 1 ) when X is indeed a realization of a random design? Resampling method to give conservative variance estimates? More advanced boostrap? Relax the regularity conditions: Generalize to non-strongly convex and non-smooth loss functions? Generalize to general error distributions? Get rid of asymptotics: Yes, exact finite-sample guarantee if n/p > 20; No assumption on X or β ; Only exchangeability assumption on ɛ. 47/49

Thank You! 48/49

References Derek Bean, Peter J Bickel, Noureddine El Karoui, and Bin Yu. Optimal m-estimation in high-dimensional regression. Proceedings of the National Academy of Sciences, 110(36):14563 14568, 2013. Peter J Bickel and David A Freedman. Bootstrapping regression models with many parameters. Festschrift for Erich L. Lehmann, pages 28 48, 1982. Sourav Chatterjee. Fluctuations of eigenvalues and second order poincaré inequalities. Probability Theory and Related Fields, 143(1-2):1 40, 2009. Noureddine El Karoui. On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators. 2015. Noureddine El Karoui and Elizabeth Purdom. Can we trust the bootstrap in high-dimension? UC Berkeley Statistics Department Technical Report, 2015. Noureddine El Karoui, Derek Bean, Peter J Bickel, Chinghway Lim, and Bin Yu. On robust regression with high-dimensional predictors. Proceedings of the National Academy of Sciences, 110(36):14557 14562, 2011. Peter J Huber. Robust regression: asymptotics, conjectures and monte carlo. The Annals of Statistics, pages 799 821, 1973. Enno Mammen. Asymptotics with increasing dimension for robust regression with applications to the bootstrap. The Annals of Statistics, pages 382 400, 1989. Stephen Portnoy. Asymptotic behavior of m-estimators of p regression parameters when p2/n is large. i. consistency. The Annals of Statistics, pages 1298 1309, 1984. Stephen Portnoy. Asymptotic behavior of m estimators of p regression parameters when p2/n is large; ii. normal approximation. The Annals of Statistics, pages 1403 1417, 1985. 49/49