Inference for High Dimensional Robust Regression

Department of Statistics UC Berkeley Stanford-Berkeley Joint Colloquium, 2015

Table of Contents 1 Background 2 Main Results 3 OLS: A Motivating Example

Setup Consider a linear regression model: Y i = X T i β 0 + ɛ i, i = 1, 2,..., n. Here Y i R, X i R p, β 0 R p and ɛ i R. OLS Estimator (p < n): M Estimator (p < n): ˆβ OLS 1 = arg min β R p n n (y i xi T β) 2 ; i=1 1 ˆβ = arg min β R p n n ρ(y i xi T i=1 β).

Background The limiting behavior for ˆβ when p is fixed ( ) L( ˆβ) N β 0, (X T X ) 1 E(ψ2 (ɛ)) [Eψ (ɛ)] 2 ;

Background The limiting behavior for ˆβ when p is fixed ( ) L( ˆβ) N β 0, (X T X ) 1 E(ψ2 (ɛ)) [Eψ (ɛ)] 2 ; Huber (1973) raised the question of understanding the behavior of ˆβ when p ;

Background The limiting behavior for ˆβ when p is fixed ( ) L( ˆβ) N β 0, (X T X ) 1 E(ψ2 (ɛ)) [Eψ (ɛ)] 2 ; Huber (1973) raised the question of understanding the behavior of ˆβ when p ; Huber (1973) proved that ˆβ OLS is jointly asymptotically normal iff κ = max(x (X T X ) 1 X T ) i,i 0 i which requires p n 0.

Background Portnoy (1984, 1985, 1986, 1987) proved the joint asymptotic normality of ˆβ in the case (p log n) 3 2 n 0;

Background Portnoy (1984, 1985, 1986, 1987) proved the joint asymptotic normality of ˆβ in the case (p log n) 3 2 n 0; Mammen (1989) provided an expansion for ˆβ (which leads to joint asymptotic normality) by assuming κn 1 3 (log n) 2 3 0.

Background El Karoui et al. (2011, 2013), Bean et al. (2013) established the joint asymptotic normality of ˆβ in the regime p n κ (0, 1), by assuming a random design X, which has i.i.d. Gaussian entries; Zhang and Zhang (2014), Van de Geer et al. (2014) proved the partial asymptotic normality of LASSO estimator by assuming a fixed design and imposing a sparsity condition on β 0 : β 0 0 log p n 0.

Main Research Question and Our Contributions Suppose p n κ (0, 1) and the design matrix X is fixed, can we make inference on the coordinate (or lower dimensional linear contrast) of ˆβ?

Main Research Question and Our Contributions Suppose p n κ (0, 1) and the design matrix X is fixed, can we make inference on the coordinate (or lower dimensional linear contrast) of ˆβ? YES! In this work, we prove the coordinatewise asymptotic normality of ˆβ in the regime p n κ (0, 1) for fixed designs; show that the conditions for fixed design matrix is satisfied by a broad class of random designs.

Table of Contents 1 Background 2 Main Results 3 OLS: A Motivating Example

Ridge-Regularized M Estimator Consider the ridge-regularized M estimator 1 ˆβ = arg min β R p n n ρ(y i xi T β) + τ 2 β 2. i=1 Assume that ρ C 2 is a convex function with ψ = ρ and β 0 = 0, then the first order condition implies that n x i ψ(ɛ i xi T i=1 ˆβ) = nτ ˆβ. In most cases, there is no closed form solution and ˆβ is an implicit function of ɛ.

Final Conclusions Theorem 1. Under Assumptions A1-A4, ˆβ is coordinatewise asymptotically normal in the sense that ( ), N(0, 1) PolyLog(n) = O. n max d TV j ˆβ j E ɛ ˆβ j Var ɛ ( ˆβ j )

Assumptions: Loss Function Assumption A1: Let ψ = ρ, for any x, 0 < D 0 ψ (x) D 1 ( x 1) m 1 ; ψ (x) D 2 ( x 1) m 2 ; ψ (x) D 3 ( x 1) m 3 ; max{d 1 0, D 1, D 2, D 3 } = O(PolyLog(n)); max{m 1, m 2, m 3 } = O(1).

Assumptions: Error Distribution Assumption A2: ɛ are transformations of i.i.d. Gaussian random variables, i.e. ɛ i = u i (ν i ), where ν i i.i.d. N(0, 1); u i c 1, u i c 2 ; max{c 1, c 2 } = O(PolyLog(n)).

Assumptions: Design Matrix Assumption A3: for design matrix X, max i,j X ij = O(PolyLog(n)); ( ) λ X T X max n = O(PolyLog(n)); 1 n n i=1 x i = O(PolyLog(n)), where x i is the i-th row.

Assumptions: Linear Concentration Property Assumption A4: Let x i be the i-th row of X and X j be the j-th column of X. {α k,i R p : k = 1,..., N n (1) ; i = 1,..., n} and {γ r,j R n : r = 1,..., N n (2) ; j = 1,..., p} are two sequences of unit vectors (with explicit forms but omitted here for concision) max{n (1) n, N (2) n } = O(n 2 ). α k,i only relies on ɛ and x i for i i; γ r,j only relies on ɛ and X j for j j; E ɛ max k,i α T k,i x i = O(PolyLog(n)); E ɛ max r,j γ T r,j X j = O(PolyLog(n));

Illustration of Assumptions A4 Consider i.i.d. standard gaussian designs For given k and i, α k,i x i and X ij i.i.d. N(0, 1), X ɛ. α T k,i x i N(0, 1). Then E ɛ,x max k,i αk,i T x i is the expectation of N n (1) gaussian random variables and hence E ɛ,x max αk,i T x i k,i By Markov Inequality, E ɛ max k,i αk,i T x i = O p (E ɛ,x max αk,i T x i k,i standard log nn n (1) = O(PolyLog(n)). ) = O p (PolyLog(n)).

Table of Contents 1 Background 2 Main Results 3 OLS: A Motivating Example

Lindeberg-Feller Condition Assume that p/n κ (0, 1), p < n and β 0 = 0, denote ˆβ OLS p 1 = arg min β R p n n (y i xi T β) 2 = (X T X ) 1 X T ɛ; i=1 then each coordinate is a linear constrast of ˆβ. Proposition 1 (Lindeberg-Feller Condition). Suppose ɛ n = (ɛ 1,..., ɛ n ) T has i.i.d. zero mean components with variance σ 2. If c n / c n 2 0 where c n = (c n,1,..., c n,kn ), then c T n ɛ n c n 2 d N(0, σ 2 ).

Lindeberg-Feller Condition Note that ˆβ OLS p,j p = e T j p (X T X ) 1 X T ɛ c T p,j p ɛ. For given matrix X R n p, define H(X ) then for any j p {1,..., p}, and this leads to Theorem 2. ˆβ OLS p is c.a.s.n. if H(X p ) 0. ej T (X T X ) 1 X T max j=1,...,p ej T (X T X ) 1 X T, 2 c T p,j p c T p,j p 2 H(X p )

Random Designs We prove that H(X p ) 0 for a broad class of random designs. Theorem 3. Let X R n p be a random matrix with independent zero mean entries, such that sup i,j X ij 8+δ M for some constant M and δ > 0. Assume that X has full column rank almost surely and Var(X ij ) > τ 2 for some τ > 0. Then provided lim sup p/n < 1. H(X ) = O p (n 1 4 )

Future Works Extend to heavy-tailed errors, e.g. ɛ i Cauchy; Explore more general random designs that satisfy A3 and A4; Calculate the bias E ɛ ˆβj and variance Var ɛ ( ˆβ j ); Prove the result for unregularized M estimator, i.e. τ = 0; Extend to low dimensional linear contrasts of ˆβ, i.e. α T ˆβ with α 0 = o(n).

Thank You!