Inference For High Dimensional M-estimates. Fixed Design Results

Similar documents
Inference For High Dimensional M-estimates: Fixed Design Results

Inference for High Dimensional Robust Regression

High-dimensional regression:

Likelihood Ratio Test in High-Dimensional Logistic Regression Is Asymptotically a Rescaled Chi-Square

Can we trust the bootstrap in high-dimension?

Can We Trust the Bootstrap in High-dimensions? The Case of Linear Models

Robust Principal Component Analysis

(Part 1) High-dimensional statistics May / 41

Negative Association, Ordering and Convergence of Resampling Methods

Can we trust the bootstrap in high-dimension?

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Economics 583: Econometric Theory I A Primer on Asymptotics

large number of i.i.d. observations from P. For concreteness, suppose

Can we do statistical inference in a non-asymptotic way? 1

Least squares under convex constraint

M-Estimation under High-Dimensional Asymptotics

Sliced Inverse Regression

SHOTA KATAYAMA AND YUTAKA KANO. Graduate School of Engineering Science, Osaka University, 1-3 Machikaneyama, Toyonaka, Osaka , Japan

High-dimensional covariance estimation based on Gaussian graphical models

High Dimensional Inverse Covariate Matrix Estimation via Linear Programming

sparse and low-rank tensor recovery Cubic-Sketching

Supplemental Material for KERNEL-BASED INFERENCE IN TIME-VARYING COEFFICIENT COINTEGRATING REGRESSION. September 2017

STAT 200C: High-dimensional Statistics

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Advanced Statistics II: Non Parametric Tests

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Robust estimation, efficiency, and Lasso debiasing

Nonparametric Inference via Bootstrapping the Debiased Estimator

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

Reconstruction from Anisotropic Random Measurements

Quantile Regression for Extraordinarily Large Data

Supplementary Material for Nonparametric Operator-Regularized Covariance Function Estimation for Functional Data

Assessing the dependence of high-dimensional time series via sample autocovariances and correlations

Robust high-dimensional linear regression: A statistical perspective

Lecture 20: Linear model, the LSE, and UMVUE

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued

Quantile Processes for Semi and Nonparametric Regression

Asymptotic Statistics-III. Changliang Zou

STAT 200C: High-dimensional Statistics

University of California San Diego and Stanford University and

Lecture 13: Subsampling vs Bootstrap. Dimitris N. Politis, Joseph P. Romano, Michael Wolf

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania

Estimation of large dimensional sparse covariance matrices

Fluctuations from the Semicircle Law Lecture 4

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Large sample distribution for fully functional periodicity tests

Bootstrap. Director of Center for Astrostatistics. G. Jogesh Babu. Penn State University babu.

A Resampling Method on Pivotal Estimating Functions

Variable Selection for Highly Correlated Predictors

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Linear Models in Machine Learning

Why Do Statisticians Treat Predictors as Fixed? A Conspiracy Theory

SOME CONVERSE LIMIT THEOREMS FOR EXCHANGEABLE BOOTSTRAPS

The largest eigenvalues of the sample covariance matrix. in the heavy-tail case

Sub-Gaussian estimators under heavy tails

STAT 540: Data Analysis and Regression

Learning gradients: prescriptive models

Quantile Regression for Panel Data Models with Fixed Effects and Small T : Identification and Estimation

BTRY 4090: Spring 2009 Theory of Statistics

Regularized Estimation of High Dimensional Covariance Matrices. Peter Bickel. January, 2008

Uncertainty Quantification for Inverse Problems. November 7, 2011

A Note on Auxiliary Particle Filters

The Central Limit Theorem: More of the Story

Lecture 7 Introduction to Statistical Decision Theory

Lecture I: Asymptotics for large GUE random matrices

Bootstrapping high dimensional vector: interplay between dependence and dimensionality

Lawrence D. Brown* and Daniel McCarthy*

A Comparison of Robust Estimators Based on Two Types of Trimming

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Bootstrapping Heteroskedasticity Consistent Covariance Matrix Estimator

Linear models. Linear models are computationally convenient and remain widely used in. applied econometric research

Sample Size Requirement For Some Low-Dimensional Estimation Problems

Confidence Intervals, Testing and ANOVA Summary

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Program Evaluation with High-Dimensional Data

36. Multisample U-statistics and jointly distributed U-statistics Lehmann 6.1

Least Squares Estimation-Finite-Sample Properties

Optimization Problems with Probabilistic Constraints

Composite Loss Functions and Multivariate Regression; Sparse PCA

Model Mis-specification

If g is also continuous and strictly increasing on J, we may apply the strictly increasing inverse function g 1 to this inequality to get

Regression, Ridge Regression, Lasso

A Multiple Testing Approach to the Regularisation of Large Sample Correlation Matrices

A Conditional Approach to Modeling Multivariate Extremes

Random Matrix Theory and its Applications to Econometrics

IV Quantile Regression for Group-level Treatments, with an Application to the Distributional Effects of Trade

Robustní monitorování stability v modelu CAPM

EXTENDED GLRT DETECTORS OF CORRELATION AND SPHERICITY: THE UNDERSAMPLED REGIME. Xavier Mestre 1, Pascal Vallet 2

THE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES. By Sara van de Geer and Johannes Lederer. ETH Zürich

Bickel Rosenblatt test

Bayesian spatial quantile regression

Stochastic process for macro

Concentration Inequalities for Random Matrices

Statistical Data Analysis

The deterministic Lasso

Concentration behavior of the penalized least squares estimator

Hard-Core Model on Random Graphs

Transcription:

: Fixed Design Results Lihua Lei Advisors: Peter J. Bickel, Michael I. Jordan joint work with Peter J. Bickel and Noureddine El Karoui Dec. 8, 2016 1/57

Table of Contents 1 Background 2 Main Results and Examples 3 Assumptions and Proof Sketch 4 Numerical Results 2/57

Table of Contents 1 Background 2 Main Results and Examples 3 Assumptions and Proof Sketch 4 Numerical Results 3/57

Setup Observe {x 1, y 1 }, {x 2, y 2 },..., {x n, y n }: response vector Y = (y 1,..., y n ) T R n ; design matrix X = (x T 1,..., x T n ) T R n p. 4/57

Setup Observe {x 1, y 1 }, {x 2, y 2 },..., {x n, y n }: response vector Y = (y 1,..., y n ) T R n ; design matrix X = (x1 T,..., x n T ) T R n p. Model: Linear Model: Y = X β + ɛ; ɛ = (ɛ 1,..., ɛ n ) T R n being a random vector; 4/57

M-Estimator M-Estimator: Given a convex loss function ρ( ) : R [0, ), 1 ˆβ = arg min β R p n n ρ(y i xi T i=1 β). 5/57

M-Estimator M-Estimator: Given a convex loss function ρ( ) : R [0, ), 1 ˆβ = arg min β R p n n ρ(y i xi T i=1 β). When ρ is differentiable with ψ = ρ, ˆβ can be written as the solution: 1 n ψ(y i xi T ˆβ) = 0. n i=1 5/57

M-Estimator: Examples ρ(x) = x 2 /2 gives the Least-Square estimator; 6/57

M-Estimator: Examples ρ(x) = x 2 /2 gives the Least-Square estimator; L2 Loss 0 2 4 6 8 10 12 4 2 0 2 4 x psi(x) 4 2 0 2 4 rho(x) 4 2 0 2 4 x 6/57

M-Estimator: Examples ρ(x) = x 2 /2 gives the Least-Square estimator; ρ(x) = x gives the Least-Absolute-Deviation estimator; L2 Loss 0 2 4 6 8 10 12 4 2 0 2 4 x psi(x) 4 2 0 2 4 rho(x) 4 2 0 2 4 x 6/57

M-Estimator: Examples ρ(x) = x 2 /2 gives the Least-Square estimator; ρ(x) = x gives the Least-Absolute-Deviation estimator; rho(x) L2 Loss 0 2 4 6 8 10 12 0 1 2 3 4 5 L1 Loss 4 2 0 2 4 x 4 2 0 2 4 x psi(x) 4 2 0 2 4 1.0 0.5 0.0 0.5 1.0 4 2 0 2 4 4 2 0 2 4 x x 6/57

M-Estimator: Examples rho(x) 0 2 4 6 8 10 12 ρ(x) = x 2 /2 gives the Least-Square estimator; ρ(x) = x { gives the Least-Absolute-Deviation estimator; x ρ(x) = 2 /2 x k gives the Huber estimator. k( x k/2) x > k L2 Loss L1 Loss 0 1 2 3 4 5 4 2 0 2 4 x 4 2 0 2 4 x psi(x) 4 2 0 2 4 1.0 0.5 0.0 0.5 1.0 4 2 0 2 4 4 2 0 2 4 x x 6/57

M-Estimator: Examples rho(x) 0 2 4 6 8 10 12 ρ(x) = x 2 /2 gives the Least-Square estimator; ρ(x) = x { gives the Least-Absolute-Deviation estimator; x ρ(x) = 2 /2 x k gives the Huber estimator. k( x k/2) x > k L2 Loss L1 Loss Huber Loss 0 1 2 3 4 5 0 1 2 3 4 5 6 4 2 0 2 4 x 4 2 0 2 4 x psi(x) 4 2 0 2 4 4 2 0 2 4 x 1.0 0.5 0.0 0.5 1.0 1.0 0.0 0.5 1.0 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 x x x 6/57

Goals (Informal) Goal (Informal): Make inference on the coordinates of ˆβ when the dimension p is comparable to the sample size n; and X is treated as fixed; without assumptions on β. 7/57

Goals (Informal) Goal (Informal): Make inference on the coordinates of ˆβ when the dimension p is comparable to the sample size n; and X is treated as fixed; without assumptions on β. Consider β 1 WLOG; Given X and L(ɛ), L( ˆβ 1 ) is uniquely determined; Ideally, we construct a 95% confidence interval for β1 as ( ) )] [q 0.025 L( ˆβ 1 ), q 0.975 (L( ˆβ 1 ) where q α denotes the α-th quantile; Unfortunately, L( ˆβ 1 ) is complicated. 7/57

Asymptotic Arguments Exact finite sample inference is hard. This motivates statisticians to resort to asymptotic arguments, i.e. find a distribution F s.t. L( ˆβ 1 ) F. 8/57

Asymptotic Arguments Exact finite sample inference is hard. This motivates statisticians to resort to asymptotic arguments, i.e. find a distribution F s.t. L( ˆβ 1 ) F. The limiting behavior of ˆβ when p is fixed, as n, ( ) L( ˆβ) N β, (X T X ) 1 E(ψ2 (ɛ 1 )) [Eψ (ɛ 1 )] 2 ; As a consequence, we obtain an approximate 95% confidence interval for β1, [ ˆβ 1 1.96sd( ˆβ 1 ), ˆβ 1 + 1.96sd( ] ˆβ 1 ) where sd( ˆβ 1 ) could be any consistent estimator of the standard deviation. 8/57

Asymptotic Arguments In other words, to approximate L( ˆβ 1 ), we consider a sequence of hypothetical problems, indexed by j, where the j-th problem has a sample size n j and a dimension p j = p. 9/57

Asymptotic Arguments In other words, to approximate L( ˆβ 1 ), we consider a sequence of hypothetical problems, indexed by j, where the j-th problem has a sample size n j and a dimension p j = p. For j-th problem, denote by ˆβ (j) the corresponding M-estimator, then the previous slide uses lim L( ˆβ (j) 1 ) to approximate L( ˆβ 1 ). j 9/57

Asymptotic Arguments In other words, to approximate L( ˆβ 1 ), we consider a sequence of hypothetical problems, indexed by j, where the j-th problem has a sample size n j and a dimension p j = p. For j-th problem, denote by ˆβ (j) the corresponding M-estimator, then the previous slide uses lim L( ˆβ (j) 1 ) to approximate L( ˆβ 1 ). j In general, p j is not necessarily fixed and can grow to infinity. 9/57

Asymptotic Arguments Huber (1973) raised the question of understanding the behavior of ˆβ when both n and p tend to infinity; Huber (1973) showed the L 2 consistency of ˆβ: ˆβ β 2 2 0 under the regime p 3 n 0; Portnoy (1984) prove the L 2 consistency of ˆβ under the regime p log p n 0; 10/57

Asymptotic Arguments Portnoy (1985) showed that ˆβ is jointly asymptotically normal under the regime (p log n) 3 2 n 0, in the sense that for any sequence of vectors a n R p, L at n ( ˆβ β ) Var(an T ˆβ) N(0, 1) 11/57

p/n: A Measure of Difficulty All of the above works requires p/n 0 or n/p. 12/57

p/n: A Measure of Difficulty All of the above works requires p/n 0 or n/p. n/p is the number of samples per parameter. Heuristically, a larger n/p would give an easier problem. 12/57

p/n: A Measure of Difficulty Recall that the approximation can be seen as a sequence of hypothetical problems with sample size n j and dimension p j. If n j /p j, the problems become increasingly easier as j grows. 13/57

p/n: A Measure of Difficulty Recall that the approximation can be seen as a sequence of hypothetical problems with sample size n j and dimension p j. If n j /p j, the problems become increasingly easier as j grows. In other words, the hypothetical problem used for approximation is much easier than the original problem. Then the approximation accuracy might be compromised. 13/57

Moderate p/n Regime Instead, we can consider a sequence of hypothetical problems with p j /n j fixed to be the same as the original problem, i.e. p j /n j p/n. 14/57

Moderate p/n Regime Instead, we can consider a sequence of hypothetical problems with p j /n j fixed to be the same as the original problem, i.e. p j /n j p/n. In this case, the difficulty of the problem is fixed. 14/57

Moderate p/n Regime Formally, we define Moderate p/n Regime as p j /n j κ > 0. A typical value for κ is p/n in the original problem. 15/57

Moderate p/n Regime: More Informative Asymptotics Consider a set of small-sample problems where n = 50 and p = nκ for κ {0.1,..., 0.9}. For each pair (n, p), Step 1 Generate X R n p with i.i.d. N(0, 1) entries; Step 2 Fix β = 0 and sample Y = ɛ with ɛ i i.i.d. N(0, 1) or ɛ i i.i.d. t 2 ; Step 3 Estimate β 1 by ˆβ 1 with a Huber loss; Step 4 Repeat Step 2 - Step 3 for 100 times and estimate L( ˆβ 1 ). 16/57

Moderate p/n Regime: More Informative Asymptotics Now consider two types of approximations: Fixed-p Approx.: N = 1000, P = p; Moderate-p/n Approx.: N = 1000, P = 1000κ; Repeat Step 1-Step 4 for new pairs (N, P) and estimate L( ˆβ 1 F ) (Fixed p); L( ˆβ 1 M ) (Moderate p/n). 17/57

Moderate p/n Regime: More Informative Asymptotics Now consider two types of approximations: Fixed-p Approx.: N = 1000, P = p; Moderate-p/n Approx.: N = 1000, P = 1000κ; Repeat Step 1-Step 4 for new pairs (N, P) and estimate L( ˆβ 1 F ) (Fixed p); L( ˆβ 1 M ) (Moderate p/n). Measure the accuracy of two approximations by the Kolmogorov-Smirnov statistics ) ) d KS (L( ˆβ 1 ), L( ˆβ 1 F ) and d KS (L( ˆβ 1 ), L( ˆβ 1 M ) 17/57

Moderate p/n Regime: More Informative Asymptotics Distance between the small sample and large sample distribution normal t(2) Kolmogorov Smirnov Statistics 0.50 0.45 0.40 0.25 0.50 0.75 0.25 0.50 0.75 kappa Asym. Regime p fixed p/n fixed 18/57

Moderate p/n Regime: Negative Results The moderate p/n regime has been widely studied in random matrix theory. In statistics: Huber (1973) showed that for least-square estimators there always exists a sequence of vectors a n R p such that L at n ( ˆβ LS β ) Var(an T ˆβ N(0, 1). LS ) Bickel and Freedman (1982) showed that the bootstrap fails in the Least-Square case and the usual rescaling does not help; El Karoui et al. (2011) showed that for general loss functions, ˆβ β 2 2 0. 19/57

Moderate p/n Regime: Negative Results The moderate p/n regime has been widely studied in random matrix theory. In statistics: Huber (1973) showed that for least-square estimators there always exists a sequence of vectors a n R p such that L at n ( ˆβ LS β ) Var(an T ˆβ N(0, 1). LS ) Bickel and Freedman (1982) showed that the bootstrap fails in the Least-Square case and the usual rescaling does not help; El Karoui et al. (2011) showed that for general loss functions, ˆβ β 2 2 0. Main reason: ˆF n, the empirical distribution of the residuals, namely R i y i xi T ˆβ, does not converge to L(ɛ i ). 19/57

Moderate p/n Regime: Positive Results If X is assumed to be a random matrix under regularity conditions, 20/57

Moderate p/n Regime: Positive Results If X is assumed to be a random matrix under regularity conditions, Bean et al. (2013) showed that when X has i.i.d. Gaussian entries, for any sequence of a n R p L X,ɛ at n ( ˆβ β ) Var X,ɛ (an T ˆβ) N(0, 1); The above result does not contradict Huber (1973) in that the randomness comes from both X and ɛ; El Karoui et al. (2011) showed that for general loss functions, ˆβ β 0. Under weaker assumptions on X, El Karoui (2015) showed L X,ɛ ˆβ 1 (τ) β1 bias( ˆβ 1 (τ)) N(0, 1) Var X,ɛ ( ˆβ 1 (τ)) where ˆβ 1 (τ) is the ridge-penalized M-estimator. 20/57

Moderate p/n Regime: Summary Provides a more accurate approximation of L( ˆβ 1 ); 21/57

Moderate p/n Regime: Summary Provides a more accurate approximation of L( ˆβ 1 ); Qualitatively different from the classical regimes where p/n 0; L 2 -consistency of ˆβ no longer holds; the residuals R i behaves differently from ɛ i ; fixed design results are different from random design results. 21/57

Moderate p/n Regime: Summary Provides a more accurate approximation of L( ˆβ 1 ); Qualitatively different from the classical regimes where p/n 0; L 2 -consistency of ˆβ no longer holds; the residuals R i behaves differently from ɛ i ; fixed design results are different from random design results. Inference on the vector ˆβ is hard; but inference on the coordinate / low-dimensional linear contrasts of ˆβ is still possible. 21/57

Goals (Formal) Our Goal (formal): Under the linear model Y = X β + ɛ, Derive the asymptotic distribution of coordinates ˆβ j : under the moderate p/n regime, i.e. p/n κ (0, 1); with a fixed design matrix X ; without assumptions on β. 22/57

Table of Contents 1 Background 2 Main Results and Examples 3 Assumptions and Proof Sketch 4 Numerical Results 23/57

Main Result (Informal) Definition 1. Let P and Q be two distributions on R p, d TV (P, Q) = sup A R p P(A) Q(A). 24/57

Main Result (Informal) Definition 1. Let P and Q be two distributions on R p, d TV (P, Q) = sup A R p P(A) Q(A). Theorem. Under appropriate conditions on the design matrix X, the distribution of ɛ and the loss function ρ, as p/n κ (0, 1), while n, max j d TV L ˆβ j E ˆβ j, N(0, 1) = o(1). Var( ˆβ j ) 24/57

Examples: Realization of i.i.d. Designs We consider the case where X is a realization of a random design Z. The examples below are proved to satisfy the technical assumptions with high probability over Z. 25/57

Examples: Realization of i.i.d. Designs We consider the case where X is a realization of a random design Z. The examples below are proved to satisfy the technical assumptions with high probability over Z. Example 1 Z has i.i.d. mean-zero sub-gaussian entries with Var(Z ij ) = τ 2 > 0; Example 2 Z contains an intercept term, i.e. Z = (1, Z) and Z R n (p 1) has independent sub-gaussian entries with Z ij µ j d = µj Z ij, Var( Z ij ) > τ 2 for some arbitrary µ j. 25/57

Examples: Realizations of Dependent Gaussian Designs Example 3 Z is matrix-normal with vec(z) N(0, Λ Σ) and λ max (Λ), λ max (Σ) = O (1), λ min (Λ), λ min (Σ) = Ω (1) Example 4 Z contains an intercept term, i.e. Z = (1, Z) and vec( Z) N(0, Λ Σ) with Λ and Σ satisfy the above condition and max i (Λ 1 2 1) i min i (Λ 1 2 1) i = O (1). 26/57

A Counter-Example Consider a one-way ANOVA situation. Each observation i is associated with a label k i {1,..., p} and let X i,j = I (j = k i ). This is equivalent to Y i = β k i + ɛ i. 27/57

A Counter-Example Consider a one-way ANOVA situation. Each observation i is associated with a label k i {1,..., p} and let X i,j = I (j = k i ). This is equivalent to Y i = β k i + ɛ i. It is easy to see that ˆβ j = arg min β R i:k i =j This is a standard location problem. ρ(y i β j ). 27/57

A Counter-Example Let n j = {i : k i = j}. In the least-square case, i.e. ρ(x) = x 2 /2, ˆβ j = β j + 1 n j ɛ i. i:k i =j 28/57

A Counter-Example Let n j = {i : k i = j}. In the least-square case, i.e. ρ(x) = x 2 /2, ˆβ j = β j + 1 n j i:k i =j ɛ i. Assume a balance design, i.e. n j n/p. Then n j << and none of ˆβ j is normal (unless ɛ i are normal); holds for general loss functions ρ. 28/57

A Counter-Example Let n j = {i : k i = j}. In the least-square case, i.e. ρ(x) = x 2 /2, ˆβ j = β j + 1 n j i:k i =j ɛ i. Assume a balance design, i.e. n j n/p. Then n j << and none of ˆβ j is normal (unless ɛ i are normal); holds for general loss functions ρ. Conclusion: some non-standard assumptions on X are required. 28/57

Table of Contents 1 Background 2 Main Results and Examples 3 Assumptions and Proof Sketch Least-Square Estimator: A Motivating Example Second-Order Poincaré Inequality Assumptions Main Results 4 Numerical Results 29/57

Least Square Estimator The L 2 loss, ρ(x) = x 2 /2, gives the least-square estimator ˆβ LS = (X T X ) 1 X T Y = β + (X T X ) 1 X T ɛ. 30/57

Least Square Estimator The L 2 loss, ρ(x) = x 2 /2, gives the least-square estimator ˆβ LS = (X T X ) 1 X T Y = β + (X T X ) 1 X T ɛ. Let e j denote the canonical basis vector in R p, then ˆβ LS j β j = e T j (X T X ) 1 X T ɛ. Write e T j (X T X ) 1 X T as α T j, then ˆβ LS j β j = n α j,i ɛ i. i=1 30/57

Least Square Estimator Lindeberg-Feller CLT claims that in order for ˆβ LS L j βj N(0, 1) Var( ˆβ j LS ) it is sufficient and almost necessary that α j α j 2 0. (1) 31/57

Least Square Estimator To see the necessity of the condition, recall the one-way ANOVA case. Let n j = {i : k i = j}, then X T X = diag(n j ) p j=1. This gives α j,i = { 1 n j if k i = j 0 if k i j 32/57

Least Square Estimator To see the necessity of the condition, recall the one-way ANOVA case. Let n j = {i : k i = j}, then X T X = diag(n j ) p j=1. This gives α j,i = { 1 n j if k i = j 0 if k i j As a result, α j = 1 n j, α j 2 = 1 nj α j α j 2 = 1 nj and hence However, in moderate p/n regime, there exists j such that n j 1/κ and thus ˆβ j LS is not asymptotically normal. 32/57

M-Estimator The result for LSE is derived from the analytical form of ˆβ LS. In contrast, an analytical form is not available for general ρ. 33/57

M-Estimator The result for LSE is derived from the analytical form of ˆβ LS. In contrast, an analytical form is not available for general ρ. Let ψ = ρ, it is the solution of 1 n n ψ(y i xi T i=1 ˆβ) = 0 33/57

M-Estimator The result for LSE is derived from the analytical form of ˆβ LS. In contrast, an analytical form is not available for general ρ. Let ψ = ρ, it is the solution of 1 n n ψ(y i xi T i=1 ˆβ) = 0 WLOG, assume β = 0, then 1 n n ψ(ɛ i xi T i=1 ˆβ) = 0. 33/57

M-Estimator Write R i for ɛ i x T i ˆβ and define D, D and G as D = diag(ψ (R i )), D = diag(ψ (R i )), G = I X (X T DX ) 1 X T D. 34/57

M-Estimator Write R i for ɛ i x T i ˆβ and define D, D and G as D = diag(ψ (R i )), D = diag(ψ (R i )), G = I X (X T DX ) 1 X T D. Lemma 2. Suppose ψ C 2 (R n ), then ˆβ j ɛ T = et j (X T DX ) 1 X T D, (2) ˆβ j ɛ ɛ T = G T diag(e T j (X T DX ) 1 X T D)G. (3) 34/57

Second-Order Poincaré Inequality ˆβ j is a smooth transform of a random vector, ɛ, with independent entries. A powerful CLT for this type of statistics is Second-Order Poincaré Inequality (Chatterjee, 2009). 35/57

Second-Order Poincaré Inequality ˆβ j is a smooth transform of a random vector, ɛ, with independent entries. A powerful CLT for this type of statistics is Second-Order Poincaré Inequality (Chatterjee, 2009). Definition 3. For each c 1, c 2 > 0, let L(c 1, c 2 ) be the class of probability measures on R that arise as laws of random variables like u(w ), where W N(0, 1) and u C 2 (R n ) with u (x) c 1 and u (x) c 2. For example, u = Id gives N(0, 1) and u = Φ gives U([0, 1]) 35/57

Second-Order Poincaré Inequality Proposition 1 (SOPI; Chatterjee, 2009). Let W = (W 1,..., W n ) indep. L(c 1, c 2 ). Take any g C 2 (R n ) and let U = g(w ), κ 0 = ( E n i g(w ) ) 1 4 2 ; i=1 κ 1 = (E g(w ) 4 2) 1 4 ; κ 2 = (E 2 g(w ) 4 op) 1 4. If U has a finite fourth moment, then ( ) ) U EU d TV (L, N(0, 1) Var(U) κ 0 + κ 1 κ 2 Var(U). 36/57

Assumptions Assume that A1 ρ(0) = ψ(0) = 0 and for any x R, 0 < K 0 ψ (x) K 1, ψ (x) K 2 ; A2 ɛ has independent entries with ɛ i L(c 1, c 2 ); A3 Let λ + and λ be the largest and smallest eigenvalues of X T X /n and λ + = O(1), λ = Ω(1). 37/57

Second-Order Poincaré Inequality on ˆβ j Apply Second-Order Poincaré Inequality to ˆβ j, we obtain that Lemma 4. Let D = diag(ψ (ɛ i x T i ˆβ)) n i=1, and Then under assumptions A1-A3, max j d TV L M j = E e T j (X T DX ) 1 X T D 1 2. ˆβ j E ˆβ ( j maxj (nm, N(0, 1) j 2 = O ) 1 8 p Var( ˆβ j ) n min j Var( ˆβ j ) The main result is obtained if we prove ( ) ( ) 1 1 M j = o n, Var( ˆβ j ) = Ω. n ), 38/57

Further Assumptions Define the following quantities: leave-one-predictor-out estimate ˆβ [j] : the M-estimator obtained by removing the j-th column of X (El Karoui, 2013); leave-one-predictor-out residuals r i,[j] = ɛ i x T i,[j] ˆβ [j] where x T i,[j] is the i-th row of X after removing j-th entry; h j,0 = (ψ(r 1,[j] ),..., ψ(r n,[j] )) T ; Q j = Cov(h j,0 ) be the covariance matrix of ψ(r i,[j] ). 39/57

Further Assumptions Besides assumptions A1 - A3, we assume that Xj A4 min T Q j X j j tr(q j ) = Ω (1). 40/57

Further Assumptions Besides assumptions A1 - A3, we assume that Xj A4 min T Q j X j j tr(q j ) = Ω (1). Q j does not involve X j ; Assumption A4 guarantees Var( ˆβ j ) = Ω ( ) 1. n 40/57

Further Assumptions If X j is a realization of a random vector Z j with i.i.d. entries, then EZj T Q j Z j = tr(ez j Zj T Q j ) = EZ1,j 2 tr(q j ). If Zj T Q j Z j concentrates around its mean, then Zj T Q j Z j tr(q j ) EZ 2 1,j > 0. 41/57

Further Assumptions If X j is a realization of a random vector Z j with i.i.d. entries, then EZj T Q j Z j = tr(ez j Zj T Q j ) = EZ1,j 2 tr(q j ). If Zj T Q j Z j concentrates around its mean, then Zj T Q j Z j tr(q j ) EZ 2 1,j > 0. For example, when Z j has i.i.d. sub-gaussian entries, the Hansen-Wright inequality implies the concentration. { { t P( Zj T Q j Z j EZj T 2 t Q j Z j t) 2 exp c min Q j 2, F Q j op }}. 41/57

Further Assumptions To describe the last assumption, we define the following quantities: D [j] = diag(ψ (r i,[j] )): leave-one-predictor-out version of D; G [j] = I X [j] (X T [j] D [j]x [j] ) 1 X T [j] D [j]; h T j,1,i = e T i G [j] : the i-th row of G [j] ; C = max { max j hj,0 T X j, max h j,0 2 i,j hj,1,i T X } j. h j,1,i 2 42/57

Further Assumptions The last assumption: A5 E 8 C = O (polylog(n)). 43/57

Further Assumptions The last assumption: A5 E 8 C = O (polylog(n)). It turns out that when ρ(x) = x 2 /2, C max j e T j (X T X ) 1 X T e T j (X T X ) 1 X T 2. Recall that for Least-Squares, ˆβ j are all asymptotically normal iff the right-handed side tends to 0. This indicates that the assumption A5 is not just an artifact of the proof. 43/57

Further Assumptions Let α j,0 = h j,0 / h j,0 2, α j,1,i = h j,1,i / h j,1,i 2. Again, if X j is a realization of a random vector Z j with i.i.d. σ 2 -sub-gaussian entries, then α T j,0 Z j and α T j,1,i Z j are all σ 2 -sub-gaussian. 44/57

Further Assumptions Let α j,0 = h j,0 / h j,0 2, α j,1,i = h j,1,i / h j,1,i 2. Again, if X j is a realization of a random vector Z j with i.i.d. σ 2 -sub-gaussian entries, then α T j,0 Z j and α T j,1,i Z j are all σ 2 -sub-gaussian. Then C is the maximum of np + p sub-gaussian random variables and hence E 8 C = O(polyLog(n)). 44/57

Review of All Assumptions A1 ρ(0) = ψ(0) = 0 and for any x R, 0 < K 0 ψ (x) K 1, ψ (x) K 2 ; A2 ɛ has independent entries with ɛ i L(c 1, c 2 ); A3 Let λ + and λ be the largest and smallest eigenvalues of X T X /n and λ + = O(1), λ = Ω(1). Zj A4 min T Q j Z j j tr(q j ) = Ω (1). A5 E 8 C = O (polylog(n)). 45/57

Main Results Theorem 5. Under assumptions A1 A5, as p/n κ for some κ (0, 1) while n, max j d TV L ˆβ j E ˆβ j, N(0, 1) = o(1). Var( ˆβ j ) 46/57

A Corollary If further assume that A6 ρ is an even function and ɛ i d = ɛi. Then one can show that ˆβ is unbiased. As a consequence, 47/57

A Corollary If further assume that A6 ρ is an even function and ɛ i d = ɛi. Then one can show that ˆβ is unbiased. As a consequence, Theorem 6. Under assumptions A1 A6, as p/n κ for some κ (0, 1) while n, max j d TV L ˆβ j βj, N(0, 1) = o(1), Var( ˆβ j ) 47/57

Table of Contents 1 Background 2 Main Results and Examples 3 Assumptions and Proof Sketch 4 Numerical Results 48/57

Setup Design matrix X: (i.i.d. design): X ij i.i.d. F ; (partial Hadamard design): a matrix formed by a random set of p columns of a n n Hadamard matrix. Entry Distribution F: F = N(0, 1); F = t 2. Error Distribution L(ɛ): ɛ i are i.i.d. with ɛ i N(0, 1); ɛ i t 2. 49/57

Setup Sample Size n: {100, 200, 400, 800}; κ = p/n: {0.5, 0.8}; Loss Function ρ: Huber loss with k = 1.345, { 1 ρ(x) = 2 x 2 x k kx k2 2 x > k 50/57

Asymptotic Normality of A Single Coordinate For each set of parameters, we run 50 simulations with each consisting of the following steps: (Step 1) Generate one design matrix X ; (Step 2) Generate the 300 error vectors ɛ; (Step 3) Regress each Y = ɛ on the design matrix X and end up with 300 random samples of ˆβ 1, denoted by ˆβ (1) (300) 1,..., ˆβ 1 ; (Step 4) Estimate the standard deviation of ˆβ 1 by the sample standard error sd; ˆ (Step 5) Construct [ a confidence interval ] I (k) = ˆβ (k) 1 1.96 sd, ˆ ˆβ (k) 1 + 1.96 sd ˆ for each k = 1,..., 300; (Step 6) Calculate the empirical 95% coverage by the proportion of confidence intervals which cover the true β 1 = 0. 51/57

Asymptotic Normality of A Single Coordinate 1.00 Coverage of β^1 (κ = 0.5) normal t(2) 1.00 Coverage of β^1 (κ = 0.8) normal t(2) Coverage 0.95 0.90 1.00 0.95 0.90 iid hadamard Coverage 0.95 0.90 1.00 0.95 0.90 iid hadamard 100 200 400 800 100 200 400 800 Sample Size Entry Dist. normal t(2) hadamard 100 200 400 800 100 200 400 800 Sample Size Entry Dist. normal t(2) hadamard 52/57

Conclusion We establish the coordinate-wise asymptotic normality of the M-estimator for certain fixed design matrices under the moderate p/n regime under regularity conditions on X, L(ɛ) and ρ but no condition on β ; We prove the result by using the novel approach Second-Order Poincaré Inequality (Chatterjee, 2009); We show that the regularity conditions are satisfied by a broad class of designs. 53/57

Future Works Future works for this project: Estimate Var( ˆβ j ) Relax the assumptions on L(ɛ) Relax the strong convexity of ρ Extend the results to GLM 54/57

Future Works Future works for this project: Estimate Var( ˆβ j ) Relax the assumptions on L(ɛ) Relax the strong convexity of ρ Extend the results to GLM Future works for my dissertation: Distributional properties in high dimensions Resampling methods in high dimensions 54/57

Thank You! 55/57

References I Bean, D., Bickel, P. J., El Karoui, N., & Yu, B. (2013). Optimal m-estimation in high-dimensional regression. Proceedings of the National Academy of Sciences, 110(36), 14563 14568. Bickel, P. J., & Freedman, D. A. (1982). Bootstrapping regression models with many parameters. Festschrift for Erich L. Lehmann, 28 48. Chatterjee, S. (2009). Fluctuations of eigenvalues and second order poincaré inequalities. Probability Theory and Related Fields, 143(1-2), 1 40. El Karoui, N. (2013). Asymptotic behavior of unregularized and ridge-regularized high-dimensional robust regression estimators: rigorous results. arxiv preprint arxiv:1311.2445. El Karoui, N. (2015). On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators. 56/57

References II El Karoui, N., Bean, D., Bickel, P. J., Lim, C., & Yu, B. (2011). On robust regression with high-dimensional predictors. Proceedings of the National Academy of Sciences, 110(36), 14557 14562. Huber, P. J. (1973). Robust regression: asymptotics, conjectures and monte carlo. The Annals of Statistics, 799 821. Portnoy, S. (1984). Asymptotic behavior of m-estimators of p regression parameters when p2/n is large. i. consistency. The Annals of Statistics, 1298 1309. Portnoy, S. (1985). Asymptotic behavior of m estimators of p regression parameters when p2/n is large; ii. normal approximation. The Annals of Statistics, 1403 1417. 57/57