: Fixed Design Results Lihua Lei Advisors: Peter J. Bickel, Michael I. Jordan joint work with Peter J. Bickel and Noureddine El Karoui Dec. 8, 2016 1/57
Table of Contents 1 Background 2 Main Results and Examples 3 Assumptions and Proof Sketch 4 Numerical Results 2/57
Table of Contents 1 Background 2 Main Results and Examples 3 Assumptions and Proof Sketch 4 Numerical Results 3/57
Setup Observe {x 1, y 1 }, {x 2, y 2 },..., {x n, y n }: response vector Y = (y 1,..., y n ) T R n ; design matrix X = (x T 1,..., x T n ) T R n p. 4/57
Setup Observe {x 1, y 1 }, {x 2, y 2 },..., {x n, y n }: response vector Y = (y 1,..., y n ) T R n ; design matrix X = (x1 T,..., x n T ) T R n p. Model: Linear Model: Y = X β + ɛ; ɛ = (ɛ 1,..., ɛ n ) T R n being a random vector; 4/57
M-Estimator M-Estimator: Given a convex loss function ρ( ) : R [0, ), 1 ˆβ = arg min β R p n n ρ(y i xi T i=1 β). 5/57
M-Estimator M-Estimator: Given a convex loss function ρ( ) : R [0, ), 1 ˆβ = arg min β R p n n ρ(y i xi T i=1 β). When ρ is differentiable with ψ = ρ, ˆβ can be written as the solution: 1 n ψ(y i xi T ˆβ) = 0. n i=1 5/57
M-Estimator: Examples ρ(x) = x 2 /2 gives the Least-Square estimator; 6/57
M-Estimator: Examples ρ(x) = x 2 /2 gives the Least-Square estimator; L2 Loss 0 2 4 6 8 10 12 4 2 0 2 4 x psi(x) 4 2 0 2 4 rho(x) 4 2 0 2 4 x 6/57
M-Estimator: Examples ρ(x) = x 2 /2 gives the Least-Square estimator; ρ(x) = x gives the Least-Absolute-Deviation estimator; L2 Loss 0 2 4 6 8 10 12 4 2 0 2 4 x psi(x) 4 2 0 2 4 rho(x) 4 2 0 2 4 x 6/57
M-Estimator: Examples ρ(x) = x 2 /2 gives the Least-Square estimator; ρ(x) = x gives the Least-Absolute-Deviation estimator; rho(x) L2 Loss 0 2 4 6 8 10 12 0 1 2 3 4 5 L1 Loss 4 2 0 2 4 x 4 2 0 2 4 x psi(x) 4 2 0 2 4 1.0 0.5 0.0 0.5 1.0 4 2 0 2 4 4 2 0 2 4 x x 6/57
M-Estimator: Examples rho(x) 0 2 4 6 8 10 12 ρ(x) = x 2 /2 gives the Least-Square estimator; ρ(x) = x { gives the Least-Absolute-Deviation estimator; x ρ(x) = 2 /2 x k gives the Huber estimator. k( x k/2) x > k L2 Loss L1 Loss 0 1 2 3 4 5 4 2 0 2 4 x 4 2 0 2 4 x psi(x) 4 2 0 2 4 1.0 0.5 0.0 0.5 1.0 4 2 0 2 4 4 2 0 2 4 x x 6/57
M-Estimator: Examples rho(x) 0 2 4 6 8 10 12 ρ(x) = x 2 /2 gives the Least-Square estimator; ρ(x) = x { gives the Least-Absolute-Deviation estimator; x ρ(x) = 2 /2 x k gives the Huber estimator. k( x k/2) x > k L2 Loss L1 Loss Huber Loss 0 1 2 3 4 5 0 1 2 3 4 5 6 4 2 0 2 4 x 4 2 0 2 4 x psi(x) 4 2 0 2 4 4 2 0 2 4 x 1.0 0.5 0.0 0.5 1.0 1.0 0.0 0.5 1.0 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 x x x 6/57
Goals (Informal) Goal (Informal): Make inference on the coordinates of ˆβ when the dimension p is comparable to the sample size n; and X is treated as fixed; without assumptions on β. 7/57
Goals (Informal) Goal (Informal): Make inference on the coordinates of ˆβ when the dimension p is comparable to the sample size n; and X is treated as fixed; without assumptions on β. Consider β 1 WLOG; Given X and L(ɛ), L( ˆβ 1 ) is uniquely determined; Ideally, we construct a 95% confidence interval for β1 as ( ) )] [q 0.025 L( ˆβ 1 ), q 0.975 (L( ˆβ 1 ) where q α denotes the α-th quantile; Unfortunately, L( ˆβ 1 ) is complicated. 7/57
Asymptotic Arguments Exact finite sample inference is hard. This motivates statisticians to resort to asymptotic arguments, i.e. find a distribution F s.t. L( ˆβ 1 ) F. 8/57
Asymptotic Arguments Exact finite sample inference is hard. This motivates statisticians to resort to asymptotic arguments, i.e. find a distribution F s.t. L( ˆβ 1 ) F. The limiting behavior of ˆβ when p is fixed, as n, ( ) L( ˆβ) N β, (X T X ) 1 E(ψ2 (ɛ 1 )) [Eψ (ɛ 1 )] 2 ; As a consequence, we obtain an approximate 95% confidence interval for β1, [ ˆβ 1 1.96sd( ˆβ 1 ), ˆβ 1 + 1.96sd( ] ˆβ 1 ) where sd( ˆβ 1 ) could be any consistent estimator of the standard deviation. 8/57
Asymptotic Arguments In other words, to approximate L( ˆβ 1 ), we consider a sequence of hypothetical problems, indexed by j, where the j-th problem has a sample size n j and a dimension p j = p. 9/57
Asymptotic Arguments In other words, to approximate L( ˆβ 1 ), we consider a sequence of hypothetical problems, indexed by j, where the j-th problem has a sample size n j and a dimension p j = p. For j-th problem, denote by ˆβ (j) the corresponding M-estimator, then the previous slide uses lim L( ˆβ (j) 1 ) to approximate L( ˆβ 1 ). j 9/57
Asymptotic Arguments In other words, to approximate L( ˆβ 1 ), we consider a sequence of hypothetical problems, indexed by j, where the j-th problem has a sample size n j and a dimension p j = p. For j-th problem, denote by ˆβ (j) the corresponding M-estimator, then the previous slide uses lim L( ˆβ (j) 1 ) to approximate L( ˆβ 1 ). j In general, p j is not necessarily fixed and can grow to infinity. 9/57
Asymptotic Arguments Huber (1973) raised the question of understanding the behavior of ˆβ when both n and p tend to infinity; Huber (1973) showed the L 2 consistency of ˆβ: ˆβ β 2 2 0 under the regime p 3 n 0; Portnoy (1984) prove the L 2 consistency of ˆβ under the regime p log p n 0; 10/57
Asymptotic Arguments Portnoy (1985) showed that ˆβ is jointly asymptotically normal under the regime (p log n) 3 2 n 0, in the sense that for any sequence of vectors a n R p, L at n ( ˆβ β ) Var(an T ˆβ) N(0, 1) 11/57
p/n: A Measure of Difficulty All of the above works requires p/n 0 or n/p. 12/57
p/n: A Measure of Difficulty All of the above works requires p/n 0 or n/p. n/p is the number of samples per parameter. Heuristically, a larger n/p would give an easier problem. 12/57
p/n: A Measure of Difficulty Recall that the approximation can be seen as a sequence of hypothetical problems with sample size n j and dimension p j. If n j /p j, the problems become increasingly easier as j grows. 13/57
p/n: A Measure of Difficulty Recall that the approximation can be seen as a sequence of hypothetical problems with sample size n j and dimension p j. If n j /p j, the problems become increasingly easier as j grows. In other words, the hypothetical problem used for approximation is much easier than the original problem. Then the approximation accuracy might be compromised. 13/57
Moderate p/n Regime Instead, we can consider a sequence of hypothetical problems with p j /n j fixed to be the same as the original problem, i.e. p j /n j p/n. 14/57
Moderate p/n Regime Instead, we can consider a sequence of hypothetical problems with p j /n j fixed to be the same as the original problem, i.e. p j /n j p/n. In this case, the difficulty of the problem is fixed. 14/57
Moderate p/n Regime Formally, we define Moderate p/n Regime as p j /n j κ > 0. A typical value for κ is p/n in the original problem. 15/57
Moderate p/n Regime: More Informative Asymptotics Consider a set of small-sample problems where n = 50 and p = nκ for κ {0.1,..., 0.9}. For each pair (n, p), Step 1 Generate X R n p with i.i.d. N(0, 1) entries; Step 2 Fix β = 0 and sample Y = ɛ with ɛ i i.i.d. N(0, 1) or ɛ i i.i.d. t 2 ; Step 3 Estimate β 1 by ˆβ 1 with a Huber loss; Step 4 Repeat Step 2 - Step 3 for 100 times and estimate L( ˆβ 1 ). 16/57
Moderate p/n Regime: More Informative Asymptotics Now consider two types of approximations: Fixed-p Approx.: N = 1000, P = p; Moderate-p/n Approx.: N = 1000, P = 1000κ; Repeat Step 1-Step 4 for new pairs (N, P) and estimate L( ˆβ 1 F ) (Fixed p); L( ˆβ 1 M ) (Moderate p/n). 17/57
Moderate p/n Regime: More Informative Asymptotics Now consider two types of approximations: Fixed-p Approx.: N = 1000, P = p; Moderate-p/n Approx.: N = 1000, P = 1000κ; Repeat Step 1-Step 4 for new pairs (N, P) and estimate L( ˆβ 1 F ) (Fixed p); L( ˆβ 1 M ) (Moderate p/n). Measure the accuracy of two approximations by the Kolmogorov-Smirnov statistics ) ) d KS (L( ˆβ 1 ), L( ˆβ 1 F ) and d KS (L( ˆβ 1 ), L( ˆβ 1 M ) 17/57
Moderate p/n Regime: More Informative Asymptotics Distance between the small sample and large sample distribution normal t(2) Kolmogorov Smirnov Statistics 0.50 0.45 0.40 0.25 0.50 0.75 0.25 0.50 0.75 kappa Asym. Regime p fixed p/n fixed 18/57
Moderate p/n Regime: Negative Results The moderate p/n regime has been widely studied in random matrix theory. In statistics: Huber (1973) showed that for least-square estimators there always exists a sequence of vectors a n R p such that L at n ( ˆβ LS β ) Var(an T ˆβ N(0, 1). LS ) Bickel and Freedman (1982) showed that the bootstrap fails in the Least-Square case and the usual rescaling does not help; El Karoui et al. (2011) showed that for general loss functions, ˆβ β 2 2 0. 19/57
Moderate p/n Regime: Negative Results The moderate p/n regime has been widely studied in random matrix theory. In statistics: Huber (1973) showed that for least-square estimators there always exists a sequence of vectors a n R p such that L at n ( ˆβ LS β ) Var(an T ˆβ N(0, 1). LS ) Bickel and Freedman (1982) showed that the bootstrap fails in the Least-Square case and the usual rescaling does not help; El Karoui et al. (2011) showed that for general loss functions, ˆβ β 2 2 0. Main reason: ˆF n, the empirical distribution of the residuals, namely R i y i xi T ˆβ, does not converge to L(ɛ i ). 19/57
Moderate p/n Regime: Positive Results If X is assumed to be a random matrix under regularity conditions, 20/57
Moderate p/n Regime: Positive Results If X is assumed to be a random matrix under regularity conditions, Bean et al. (2013) showed that when X has i.i.d. Gaussian entries, for any sequence of a n R p L X,ɛ at n ( ˆβ β ) Var X,ɛ (an T ˆβ) N(0, 1); The above result does not contradict Huber (1973) in that the randomness comes from both X and ɛ; El Karoui et al. (2011) showed that for general loss functions, ˆβ β 0. Under weaker assumptions on X, El Karoui (2015) showed L X,ɛ ˆβ 1 (τ) β1 bias( ˆβ 1 (τ)) N(0, 1) Var X,ɛ ( ˆβ 1 (τ)) where ˆβ 1 (τ) is the ridge-penalized M-estimator. 20/57
Moderate p/n Regime: Summary Provides a more accurate approximation of L( ˆβ 1 ); 21/57
Moderate p/n Regime: Summary Provides a more accurate approximation of L( ˆβ 1 ); Qualitatively different from the classical regimes where p/n 0; L 2 -consistency of ˆβ no longer holds; the residuals R i behaves differently from ɛ i ; fixed design results are different from random design results. 21/57
Moderate p/n Regime: Summary Provides a more accurate approximation of L( ˆβ 1 ); Qualitatively different from the classical regimes where p/n 0; L 2 -consistency of ˆβ no longer holds; the residuals R i behaves differently from ɛ i ; fixed design results are different from random design results. Inference on the vector ˆβ is hard; but inference on the coordinate / low-dimensional linear contrasts of ˆβ is still possible. 21/57
Goals (Formal) Our Goal (formal): Under the linear model Y = X β + ɛ, Derive the asymptotic distribution of coordinates ˆβ j : under the moderate p/n regime, i.e. p/n κ (0, 1); with a fixed design matrix X ; without assumptions on β. 22/57
Table of Contents 1 Background 2 Main Results and Examples 3 Assumptions and Proof Sketch 4 Numerical Results 23/57
Main Result (Informal) Definition 1. Let P and Q be two distributions on R p, d TV (P, Q) = sup A R p P(A) Q(A). 24/57
Main Result (Informal) Definition 1. Let P and Q be two distributions on R p, d TV (P, Q) = sup A R p P(A) Q(A). Theorem. Under appropriate conditions on the design matrix X, the distribution of ɛ and the loss function ρ, as p/n κ (0, 1), while n, max j d TV L ˆβ j E ˆβ j, N(0, 1) = o(1). Var( ˆβ j ) 24/57
Examples: Realization of i.i.d. Designs We consider the case where X is a realization of a random design Z. The examples below are proved to satisfy the technical assumptions with high probability over Z. 25/57
Examples: Realization of i.i.d. Designs We consider the case where X is a realization of a random design Z. The examples below are proved to satisfy the technical assumptions with high probability over Z. Example 1 Z has i.i.d. mean-zero sub-gaussian entries with Var(Z ij ) = τ 2 > 0; Example 2 Z contains an intercept term, i.e. Z = (1, Z) and Z R n (p 1) has independent sub-gaussian entries with Z ij µ j d = µj Z ij, Var( Z ij ) > τ 2 for some arbitrary µ j. 25/57
Examples: Realizations of Dependent Gaussian Designs Example 3 Z is matrix-normal with vec(z) N(0, Λ Σ) and λ max (Λ), λ max (Σ) = O (1), λ min (Λ), λ min (Σ) = Ω (1) Example 4 Z contains an intercept term, i.e. Z = (1, Z) and vec( Z) N(0, Λ Σ) with Λ and Σ satisfy the above condition and max i (Λ 1 2 1) i min i (Λ 1 2 1) i = O (1). 26/57
A Counter-Example Consider a one-way ANOVA situation. Each observation i is associated with a label k i {1,..., p} and let X i,j = I (j = k i ). This is equivalent to Y i = β k i + ɛ i. 27/57
A Counter-Example Consider a one-way ANOVA situation. Each observation i is associated with a label k i {1,..., p} and let X i,j = I (j = k i ). This is equivalent to Y i = β k i + ɛ i. It is easy to see that ˆβ j = arg min β R i:k i =j This is a standard location problem. ρ(y i β j ). 27/57
A Counter-Example Let n j = {i : k i = j}. In the least-square case, i.e. ρ(x) = x 2 /2, ˆβ j = β j + 1 n j ɛ i. i:k i =j 28/57
A Counter-Example Let n j = {i : k i = j}. In the least-square case, i.e. ρ(x) = x 2 /2, ˆβ j = β j + 1 n j i:k i =j ɛ i. Assume a balance design, i.e. n j n/p. Then n j << and none of ˆβ j is normal (unless ɛ i are normal); holds for general loss functions ρ. 28/57
A Counter-Example Let n j = {i : k i = j}. In the least-square case, i.e. ρ(x) = x 2 /2, ˆβ j = β j + 1 n j i:k i =j ɛ i. Assume a balance design, i.e. n j n/p. Then n j << and none of ˆβ j is normal (unless ɛ i are normal); holds for general loss functions ρ. Conclusion: some non-standard assumptions on X are required. 28/57
Table of Contents 1 Background 2 Main Results and Examples 3 Assumptions and Proof Sketch Least-Square Estimator: A Motivating Example Second-Order Poincaré Inequality Assumptions Main Results 4 Numerical Results 29/57
Least Square Estimator The L 2 loss, ρ(x) = x 2 /2, gives the least-square estimator ˆβ LS = (X T X ) 1 X T Y = β + (X T X ) 1 X T ɛ. 30/57
Least Square Estimator The L 2 loss, ρ(x) = x 2 /2, gives the least-square estimator ˆβ LS = (X T X ) 1 X T Y = β + (X T X ) 1 X T ɛ. Let e j denote the canonical basis vector in R p, then ˆβ LS j β j = e T j (X T X ) 1 X T ɛ. Write e T j (X T X ) 1 X T as α T j, then ˆβ LS j β j = n α j,i ɛ i. i=1 30/57
Least Square Estimator Lindeberg-Feller CLT claims that in order for ˆβ LS L j βj N(0, 1) Var( ˆβ j LS ) it is sufficient and almost necessary that α j α j 2 0. (1) 31/57
Least Square Estimator To see the necessity of the condition, recall the one-way ANOVA case. Let n j = {i : k i = j}, then X T X = diag(n j ) p j=1. This gives α j,i = { 1 n j if k i = j 0 if k i j 32/57
Least Square Estimator To see the necessity of the condition, recall the one-way ANOVA case. Let n j = {i : k i = j}, then X T X = diag(n j ) p j=1. This gives α j,i = { 1 n j if k i = j 0 if k i j As a result, α j = 1 n j, α j 2 = 1 nj α j α j 2 = 1 nj and hence However, in moderate p/n regime, there exists j such that n j 1/κ and thus ˆβ j LS is not asymptotically normal. 32/57
M-Estimator The result for LSE is derived from the analytical form of ˆβ LS. In contrast, an analytical form is not available for general ρ. 33/57
M-Estimator The result for LSE is derived from the analytical form of ˆβ LS. In contrast, an analytical form is not available for general ρ. Let ψ = ρ, it is the solution of 1 n n ψ(y i xi T i=1 ˆβ) = 0 33/57
M-Estimator The result for LSE is derived from the analytical form of ˆβ LS. In contrast, an analytical form is not available for general ρ. Let ψ = ρ, it is the solution of 1 n n ψ(y i xi T i=1 ˆβ) = 0 WLOG, assume β = 0, then 1 n n ψ(ɛ i xi T i=1 ˆβ) = 0. 33/57
M-Estimator Write R i for ɛ i x T i ˆβ and define D, D and G as D = diag(ψ (R i )), D = diag(ψ (R i )), G = I X (X T DX ) 1 X T D. 34/57
M-Estimator Write R i for ɛ i x T i ˆβ and define D, D and G as D = diag(ψ (R i )), D = diag(ψ (R i )), G = I X (X T DX ) 1 X T D. Lemma 2. Suppose ψ C 2 (R n ), then ˆβ j ɛ T = et j (X T DX ) 1 X T D, (2) ˆβ j ɛ ɛ T = G T diag(e T j (X T DX ) 1 X T D)G. (3) 34/57
Second-Order Poincaré Inequality ˆβ j is a smooth transform of a random vector, ɛ, with independent entries. A powerful CLT for this type of statistics is Second-Order Poincaré Inequality (Chatterjee, 2009). 35/57
Second-Order Poincaré Inequality ˆβ j is a smooth transform of a random vector, ɛ, with independent entries. A powerful CLT for this type of statistics is Second-Order Poincaré Inequality (Chatterjee, 2009). Definition 3. For each c 1, c 2 > 0, let L(c 1, c 2 ) be the class of probability measures on R that arise as laws of random variables like u(w ), where W N(0, 1) and u C 2 (R n ) with u (x) c 1 and u (x) c 2. For example, u = Id gives N(0, 1) and u = Φ gives U([0, 1]) 35/57
Second-Order Poincaré Inequality Proposition 1 (SOPI; Chatterjee, 2009). Let W = (W 1,..., W n ) indep. L(c 1, c 2 ). Take any g C 2 (R n ) and let U = g(w ), κ 0 = ( E n i g(w ) ) 1 4 2 ; i=1 κ 1 = (E g(w ) 4 2) 1 4 ; κ 2 = (E 2 g(w ) 4 op) 1 4. If U has a finite fourth moment, then ( ) ) U EU d TV (L, N(0, 1) Var(U) κ 0 + κ 1 κ 2 Var(U). 36/57
Assumptions Assume that A1 ρ(0) = ψ(0) = 0 and for any x R, 0 < K 0 ψ (x) K 1, ψ (x) K 2 ; A2 ɛ has independent entries with ɛ i L(c 1, c 2 ); A3 Let λ + and λ be the largest and smallest eigenvalues of X T X /n and λ + = O(1), λ = Ω(1). 37/57
Second-Order Poincaré Inequality on ˆβ j Apply Second-Order Poincaré Inequality to ˆβ j, we obtain that Lemma 4. Let D = diag(ψ (ɛ i x T i ˆβ)) n i=1, and Then under assumptions A1-A3, max j d TV L M j = E e T j (X T DX ) 1 X T D 1 2. ˆβ j E ˆβ ( j maxj (nm, N(0, 1) j 2 = O ) 1 8 p Var( ˆβ j ) n min j Var( ˆβ j ) The main result is obtained if we prove ( ) ( ) 1 1 M j = o n, Var( ˆβ j ) = Ω. n ), 38/57
Further Assumptions Define the following quantities: leave-one-predictor-out estimate ˆβ [j] : the M-estimator obtained by removing the j-th column of X (El Karoui, 2013); leave-one-predictor-out residuals r i,[j] = ɛ i x T i,[j] ˆβ [j] where x T i,[j] is the i-th row of X after removing j-th entry; h j,0 = (ψ(r 1,[j] ),..., ψ(r n,[j] )) T ; Q j = Cov(h j,0 ) be the covariance matrix of ψ(r i,[j] ). 39/57
Further Assumptions Besides assumptions A1 - A3, we assume that Xj A4 min T Q j X j j tr(q j ) = Ω (1). 40/57
Further Assumptions Besides assumptions A1 - A3, we assume that Xj A4 min T Q j X j j tr(q j ) = Ω (1). Q j does not involve X j ; Assumption A4 guarantees Var( ˆβ j ) = Ω ( ) 1. n 40/57
Further Assumptions If X j is a realization of a random vector Z j with i.i.d. entries, then EZj T Q j Z j = tr(ez j Zj T Q j ) = EZ1,j 2 tr(q j ). If Zj T Q j Z j concentrates around its mean, then Zj T Q j Z j tr(q j ) EZ 2 1,j > 0. 41/57
Further Assumptions If X j is a realization of a random vector Z j with i.i.d. entries, then EZj T Q j Z j = tr(ez j Zj T Q j ) = EZ1,j 2 tr(q j ). If Zj T Q j Z j concentrates around its mean, then Zj T Q j Z j tr(q j ) EZ 2 1,j > 0. For example, when Z j has i.i.d. sub-gaussian entries, the Hansen-Wright inequality implies the concentration. { { t P( Zj T Q j Z j EZj T 2 t Q j Z j t) 2 exp c min Q j 2, F Q j op }}. 41/57
Further Assumptions To describe the last assumption, we define the following quantities: D [j] = diag(ψ (r i,[j] )): leave-one-predictor-out version of D; G [j] = I X [j] (X T [j] D [j]x [j] ) 1 X T [j] D [j]; h T j,1,i = e T i G [j] : the i-th row of G [j] ; C = max { max j hj,0 T X j, max h j,0 2 i,j hj,1,i T X } j. h j,1,i 2 42/57
Further Assumptions The last assumption: A5 E 8 C = O (polylog(n)). 43/57
Further Assumptions The last assumption: A5 E 8 C = O (polylog(n)). It turns out that when ρ(x) = x 2 /2, C max j e T j (X T X ) 1 X T e T j (X T X ) 1 X T 2. Recall that for Least-Squares, ˆβ j are all asymptotically normal iff the right-handed side tends to 0. This indicates that the assumption A5 is not just an artifact of the proof. 43/57
Further Assumptions Let α j,0 = h j,0 / h j,0 2, α j,1,i = h j,1,i / h j,1,i 2. Again, if X j is a realization of a random vector Z j with i.i.d. σ 2 -sub-gaussian entries, then α T j,0 Z j and α T j,1,i Z j are all σ 2 -sub-gaussian. 44/57
Further Assumptions Let α j,0 = h j,0 / h j,0 2, α j,1,i = h j,1,i / h j,1,i 2. Again, if X j is a realization of a random vector Z j with i.i.d. σ 2 -sub-gaussian entries, then α T j,0 Z j and α T j,1,i Z j are all σ 2 -sub-gaussian. Then C is the maximum of np + p sub-gaussian random variables and hence E 8 C = O(polyLog(n)). 44/57
Review of All Assumptions A1 ρ(0) = ψ(0) = 0 and for any x R, 0 < K 0 ψ (x) K 1, ψ (x) K 2 ; A2 ɛ has independent entries with ɛ i L(c 1, c 2 ); A3 Let λ + and λ be the largest and smallest eigenvalues of X T X /n and λ + = O(1), λ = Ω(1). Zj A4 min T Q j Z j j tr(q j ) = Ω (1). A5 E 8 C = O (polylog(n)). 45/57
Main Results Theorem 5. Under assumptions A1 A5, as p/n κ for some κ (0, 1) while n, max j d TV L ˆβ j E ˆβ j, N(0, 1) = o(1). Var( ˆβ j ) 46/57
A Corollary If further assume that A6 ρ is an even function and ɛ i d = ɛi. Then one can show that ˆβ is unbiased. As a consequence, 47/57
A Corollary If further assume that A6 ρ is an even function and ɛ i d = ɛi. Then one can show that ˆβ is unbiased. As a consequence, Theorem 6. Under assumptions A1 A6, as p/n κ for some κ (0, 1) while n, max j d TV L ˆβ j βj, N(0, 1) = o(1), Var( ˆβ j ) 47/57
Table of Contents 1 Background 2 Main Results and Examples 3 Assumptions and Proof Sketch 4 Numerical Results 48/57
Setup Design matrix X: (i.i.d. design): X ij i.i.d. F ; (partial Hadamard design): a matrix formed by a random set of p columns of a n n Hadamard matrix. Entry Distribution F: F = N(0, 1); F = t 2. Error Distribution L(ɛ): ɛ i are i.i.d. with ɛ i N(0, 1); ɛ i t 2. 49/57
Setup Sample Size n: {100, 200, 400, 800}; κ = p/n: {0.5, 0.8}; Loss Function ρ: Huber loss with k = 1.345, { 1 ρ(x) = 2 x 2 x k kx k2 2 x > k 50/57
Asymptotic Normality of A Single Coordinate For each set of parameters, we run 50 simulations with each consisting of the following steps: (Step 1) Generate one design matrix X ; (Step 2) Generate the 300 error vectors ɛ; (Step 3) Regress each Y = ɛ on the design matrix X and end up with 300 random samples of ˆβ 1, denoted by ˆβ (1) (300) 1,..., ˆβ 1 ; (Step 4) Estimate the standard deviation of ˆβ 1 by the sample standard error sd; ˆ (Step 5) Construct [ a confidence interval ] I (k) = ˆβ (k) 1 1.96 sd, ˆ ˆβ (k) 1 + 1.96 sd ˆ for each k = 1,..., 300; (Step 6) Calculate the empirical 95% coverage by the proportion of confidence intervals which cover the true β 1 = 0. 51/57
Asymptotic Normality of A Single Coordinate 1.00 Coverage of β^1 (κ = 0.5) normal t(2) 1.00 Coverage of β^1 (κ = 0.8) normal t(2) Coverage 0.95 0.90 1.00 0.95 0.90 iid hadamard Coverage 0.95 0.90 1.00 0.95 0.90 iid hadamard 100 200 400 800 100 200 400 800 Sample Size Entry Dist. normal t(2) hadamard 100 200 400 800 100 200 400 800 Sample Size Entry Dist. normal t(2) hadamard 52/57
Conclusion We establish the coordinate-wise asymptotic normality of the M-estimator for certain fixed design matrices under the moderate p/n regime under regularity conditions on X, L(ɛ) and ρ but no condition on β ; We prove the result by using the novel approach Second-Order Poincaré Inequality (Chatterjee, 2009); We show that the regularity conditions are satisfied by a broad class of designs. 53/57
Future Works Future works for this project: Estimate Var( ˆβ j ) Relax the assumptions on L(ɛ) Relax the strong convexity of ρ Extend the results to GLM 54/57
Future Works Future works for this project: Estimate Var( ˆβ j ) Relax the assumptions on L(ɛ) Relax the strong convexity of ρ Extend the results to GLM Future works for my dissertation: Distributional properties in high dimensions Resampling methods in high dimensions 54/57
Thank You! 55/57
References I Bean, D., Bickel, P. J., El Karoui, N., & Yu, B. (2013). Optimal m-estimation in high-dimensional regression. Proceedings of the National Academy of Sciences, 110(36), 14563 14568. Bickel, P. J., & Freedman, D. A. (1982). Bootstrapping regression models with many parameters. Festschrift for Erich L. Lehmann, 28 48. Chatterjee, S. (2009). Fluctuations of eigenvalues and second order poincaré inequalities. Probability Theory and Related Fields, 143(1-2), 1 40. El Karoui, N. (2013). Asymptotic behavior of unregularized and ridge-regularized high-dimensional robust regression estimators: rigorous results. arxiv preprint arxiv:1311.2445. El Karoui, N. (2015). On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators. 56/57
References II El Karoui, N., Bean, D., Bickel, P. J., Lim, C., & Yu, B. (2011). On robust regression with high-dimensional predictors. Proceedings of the National Academy of Sciences, 110(36), 14557 14562. Huber, P. J. (1973). Robust regression: asymptotics, conjectures and monte carlo. The Annals of Statistics, 799 821. Portnoy, S. (1984). Asymptotic behavior of m-estimators of p regression parameters when p2/n is large. i. consistency. The Annals of Statistics, 1298 1309. Portnoy, S. (1985). Asymptotic behavior of m estimators of p regression parameters when p2/n is large; ii. normal approximation. The Annals of Statistics, 1403 1417. 57/57