Inference For High Dimensional M-estimates: Fixed Design Results Lihua Lei, Peter Bickel and Noureddine El Karoui Department of Statistics, UC Berkeley Berkeley-Stanford Econometrics Jamboree, 2017 1/49
Table of Contents Background Main Results Heuristics and Proof Techniques Numerical Results 2/49
Table of Contents Background Main Results Heuristics and Proof Techniques Numerical Results 3/49
Setup Consider a linear Model: Y = Xβ + ɛ. y = (y 1,..., y n ) T R n : response vector; X = (x T 1,..., xt n ) T R n p : design matrix; β = (β 1,..., β p) T R p : coefficient vector; ɛ = (ɛ 1,..., ɛ n ) T R n : random unobserved error with independent entries. 4/49
M-Estimator M-Estimator: Given a convex loss function ρ( ) : R [0, ), 1 ˆβ = arg min β R p n n ρ(y i x T i β). When ρ is differentiable with ψ = ρ, ˆβ can be written as the solution: 1 n ψ(y i x T i n ˆβ) = 0. i=1 i=1 5/49
M-Estimator: Examples ρ(x) = x 2 /2 gives the Least-Square estimator; L2 Loss 0 2 4 6 8 10 12 4 2 0 2 4 x psi(x) 4 2 0 2 4 rho(x) 4 2 0 2 4 x 6/49
M-Estimator: Examples ρ(x) = x 2 /2 gives the Least-Square estimator; ρ(x) = x gives the Least-Absolute-Deviation estimator; rho(x) 0 2 4 6 8 10 12 L2 Loss 0 1 2 3 4 5 L1 Loss 4 2 0 2 4 x 4 2 0 2 4 x psi(x) 4 2 0 2 4 1.0 0.5 0.0 0.5 1.0 4 2 0 2 4 x 4 2 0 2 4 x 6/49
M-Estimator: Examples rho(x) ρ(x) = x 2 /2 gives the Least-Square estimator; ρ(x) = x gives the Least-Absolute-Deviation estimator; { x ρ(x) = 2 /2 x k gives the Huber estimator. k( x k/2) x > k L2 Loss L1 Loss Huber Loss 0 2 4 6 8 10 12 0 1 2 3 4 5 0 1 2 3 4 5 6 4 2 0 2 4 x 4 2 0 2 4 x 4 2 0 2 4 x psi(x) 4 2 0 2 4 1.0 0.5 0.0 0.5 1.0 1.0 0.0 0.5 1.0 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 x x x 6/49
Goals (Informal) Goal (Informal): Make inference on the coordinates of β when X is treated as fixed; no assumption imposed on β ; and the dimension p is comparable to the sample size n. 7/49
Goals (Informal) Goal (Informal): Make inference on the coordinates of β when X is treated as fixed; no assumption imposed on β ; and the dimension p is comparable to the sample size n. Why coordinates? 7/49
Goals (Informal) Goal (Informal): Make inference on the coordinates of β when X is treated as fixed; no assumption imposed on β ; and the dimension p is comparable to the sample size n. Why coordinates? Why fixed designs? 7/49
Goals (Informal) Goal (Informal): Make inference on the coordinates of β when X is treated as fixed; no assumption imposed on β ; and the dimension p is comparable to the sample size n. Why coordinates? Why fixed designs? Why assumption-free β? 7/49
Goals (Informal) Goal (Informal): Make inference on the coordinates of β when X is treated as fixed; no assumption imposed on β ; and the dimension p is comparable to the sample size n. Why coordinates? Why fixed designs? Why assumption-free β? Why p n? 7/49
Asymptotic Arguments: Motivation Consider β 1 WLOG; 8/49
Asymptotic Arguments: Motivation Consider β 1 WLOG; Ideally, we construct a 95% confidence interval for β1 as ( [q 0.025 L( ˆβ ) 1 ), q 0.975 (L( ˆβ )] 1 ) where q α denotes the α-th quantile; 8/49
Asymptotic Arguments: Motivation Consider β 1 WLOG; Ideally, we construct a 95% confidence interval for β1 as ( [q 0.025 L( ˆβ ) 1 ), q 0.975 (L( ˆβ )] 1 ) where q α denotes the α-th quantile; Unfortunately, L( ˆβ 1 ) is unknown. 8/49
Asymptotic Arguments: Motivation Consider β 1 WLOG; Ideally, we construct a 95% confidence interval for β1 as ( [q 0.025 L( ˆβ ) 1 ), q 0.975 (L( ˆβ )] 1 ) where q α denotes the α-th quantile; Unfortunately, L( ˆβ 1 ) is unknown. This motivates the asymptotic arguments, i.e. find a distribution F s.t. L( ˆβ 1 ) F. 8/49
Asymptotic Arguments: Textbook Version The limiting behavior of ˆβ when p is fixed, as n, ( ) L( ˆβ) N β, (X T X) 1 E(ψ2 (ɛ 1 )) [Eψ (ɛ 1 )] 2 ; As a consequence, we obtain an approximate 95% confidence interval for β1, [ ˆβ1 1.96sd( ˆβ 1 ), ˆβ 1 + 1.96sd( ˆβ ] 1 ) where sd( ˆβ 1 ) could be any consistent estimator of the standard deviation. 9/49
Asymptotic Arguments: Hypothetical Problems y X R n p original problem (n = 100, p = 30) y X ˆβ 1 10/49
Asymptotic Arguments: Hypothetical Problems y X R n p y 1 X 1 R n 1 p 1 original problem (n = 100, p = 30) y X ˆβ 1 hypothetical problem (n 1 = 200, p 1 = 30) y 1 X 1 ˆβ (1) 1 10/49
Asymptotic Arguments: Hypothetical Problems y X R n p y 1 X 1 R n 1 p 1 y 2 X 2 R n 2 p 2 original problem (n = 100, p = 30) y X ˆβ 1 hypothetical problem (n 1 = 200, p 1 = 30) y 1 X 1 ˆβ (1) 1 hypothetical problem (n 2 = 500, p 2 = 30) y 2 X 2 ˆβ (2) 1 10/49
Asymptotic Arguments: Hypothetical Problems y X R n p original problem y 1 X 1 R n 1 p 1 y 2 X 2 R n 2 p 2 y 3 X 3 R n 3 p 3 (n = 100, p = 30) hypothetical problem y X ˆβ 1 (n 1 = 200, p 1 = 30) y 1 X 1 ˆβ (1) 1 hypothetical problem (n 2 = 500, p 2 = 30) y 2 X 2 ˆβ (2) 1 hypothetical problem (n 3 = 2000, p 3 = 30) y 3 X 3 ˆβ (3) 1 10/49
Asymptotic Arguments: Hypothetical Problems y X R n p original problem y 1 X 1 R n 1 p 1 y 2 X 2 R n 2 p 2 y 3 X 3 R n 3 p 3 (n = 100, p = 30) hypothetical problem y X ˆβ 1 (n 1 = 200, p 1 = 30) y 1 X 1 ˆβ (1) 1 hypothetical problem (n 2 = 500, p 2 = 30) y 2 X 2 ˆβ (2) 1 hypothetical problem (n 3 = 2000, p 3 = 30) y 3 X 3 Asymptotic argument: use lim j L( ˆβ (j) 1 ) to approximate L( ˆβ 1 ). ˆβ (3) 1 10/49
Asymptotic Arguments Huber [1973] raised the question of understanding the behavior of ˆβ when both n and p tend to infinity; 11/49
Asymptotic Arguments Huber [1973] raised the question of understanding the behavior of ˆβ when both n and p tend to infinity; Huber [1973] showed the L 2 consistency of ˆβ: ˆβ β 2 2 0, when p = o(n 1 3 ); 11/49
Asymptotic Arguments Huber [1973] raised the question of understanding the behavior of ˆβ when both n and p tend to infinity; Huber [1973] showed the L 2 consistency of ˆβ: ˆβ β 2 2 0, when p = o(n 1 3 ); Portnoy [1984] prove the L 2 consistency of ˆβ when ( ) n p = o. log n 11/49
Asymptotic Arguments Portnoy [1985] and Mammen [1989] showed that ˆβ is jointly asymptotically normal when p << n 2 3, 12/49
Asymptotic Arguments Portnoy [1985] and Mammen [1989] showed that ˆβ is jointly asymptotically normal when p << n 2 3, in the sense that for any sequence of vectors a n R p, L at n ( ˆβ β ) Var(a T ˆβ) N(0, 1) n 12/49
p/n: A Measure of Difficulty All of the above works requires p/n 0 or n/p. 13/49
p/n: A Measure of Difficulty All of the above works requires p/n 0 or n/p. n/p is the number of samples per parameter; Classical rule of thumb: n/p 5 10; Heuristically, a larger n/p would give an easier problem; Hypothetical problems with n j /p j are not appropriate because they are increasingly easier than the original problem. 13/49
Moderate p/n Regime Formally, we define Moderate p/n Regime as p/n κ > 0. y X R n p original problem (n = 100, p = 30) y X ˆβ 1 14/49
Moderate p/n Regime Formally, we define Moderate p/n Regime as p/n κ > 0. y X R n p y 1 X 1 R n 1 p 1 original problem (n = 100, p = 30) y X ˆβ 1 hypothetical problem (n 1 = 200, p 1 = 60) y 1 X 1 ˆβ (1) 1 14/49
Moderate p/n Regime Formally, we define Moderate p/n Regime as p/n κ > 0. y X R n p y 1 X 1 R n 1 p 1 y 2 X 2 R n 2 p 2 original problem (n = 100, p = 30) y X ˆβ 1 hypothetical problem (n 1 = 200, p 1 = 60) y 1 X 1 ˆβ (1) 1 hypothetical problem (n 2 = 500, p 2 = 150) y 2 X 2 ˆβ (2) 1 14/49
Moderate p/n Regime Formally, we define Moderate p/n Regime as p/n κ > 0. y X R n p original problem y 1 X 1 R n 1 p 1 y 2 X 2 R n 2 p 2 y 3 X 3 R n 3 p 3 (n = 100, p = 30) hypothetical problem y X ˆβ 1 (n 1 = 200, p 1 = 60) y 1 X 1 ˆβ (1) 1 hypothetical problem (n 2 = 500, p 2 = 150) y 2 X 2 ˆβ (2) 1 hypothetical problem (n 3 = 2000, p 3 = 600) y 3 X 3 ˆβ (3) 1 14/49
Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. 15/49
Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. X 15/49
Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. X β 15/49
Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. X β ɛ 1 15/49
Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. y 1 = X β + ɛ 1 15/49
Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. y 1 = X β + ɛ 1 M-Estimates: ˆβ(1) 1, 15/49
Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. β y 2 = X + ɛ 1 ɛ 2 M-Estimates: ˆβ(1) 1, 15/49
Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. β y 2 = X + ɛ 1 ɛ 2 M-Estimates: ˆβ(1) (2) 1, ˆβ 1, 15/49
Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. β y 3 = X + ɛ 1 ɛ 2 ɛ 3 ɛ 3 M-Estimates: ˆβ(1) (2) 1, ˆβ 1, 15/49
Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. β y 3 = X + ɛ 1 ɛ 2 ɛ 3 ɛ 3 M-Estimates: ˆβ(1) 1 (2) (3), ˆβ 1, ˆβ 1, 15/49
Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. y r = X β + ɛ 1 ɛ 2 ɛ 3 ɛ r M-Estimates: ˆβ(1) 1 (2) (3), ˆβ 1, ˆβ 1, 15/49
Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. y r = X β + ɛ 1 ɛ 2 ɛ 3 ɛ r M-Estimates: ˆβ(1) 1, ˆβ (2) 1 (3) (r), ˆβ 1,..., ˆβ 1. 15/49
Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. y r = X β + ɛ 1 ɛ 2 ɛ 3 ɛ r M-Estimates: ˆβ(1) 1, ˆβ (2) 1 (3) (r), ˆβ 1,..., ˆβ 1. = ˆL( ˆβ 1 ; X) = ecdf({ (1) (r) ˆβ 1,..., ˆβ 1 }). 15/49
Moderate p/n Regime: More Informative Asymptotics A Simulation to compare Fix-p Regime and Moderate p/n Regime: Fix-p Approximation: n = 1000, p = 50κ. 16/49
Moderate p/n Regime: More Informative Asymptotics A Simulation to compare Fix-p Regime and Moderate p/n Regime: Fix-p Approximation: n = 1000, p = 50κ. β y r = X + ɛ 1 ɛ 2 ɛ 3 ɛ r M-Estimates: ˆβ(F,1) 1, ˆβ (F,2) 1, ˆβ (F,3) 1,..., ˆβ (F,r) 1. = ˆL( ˆβ F 1 ; X) = ecdf({ ˆβ (F,1) 1,..., ˆβ (F,r) 1 }). 16/49
Moderate p/n Regime: More Informative Asymptotics A Simulation to compare Fix-p Regime and Moderate p/n Regime: Moderate-p/n Approximation: n = 1000, p = 1000κ. 17/49
Moderate p/n Regime: More Informative Asymptotics A Simulation to compare Fix-p Regime and Moderate p/n Regime: Moderate-p/n Approximation: n = 1000, p = 1000κ. y r = X β + ɛ 1 ɛ 2 ɛ 3 ɛ r M-Estimates: ˆβ(M,1) 1, ˆβ (M,2) 1, ˆβ (M,3) 1,..., ˆβ (M,r) 1. = ˆL( ˆβ M 1 ; X) = ecdf({ ˆβ (M,1) 1,..., ˆβ (M,r) 1 }). 17/49
Moderate p/n Regime: More Informative Asymptotics Measure the accuracy of two approximations by the Kolmogorov-Smirnov statistics ( d KS ˆL( ˆβ1 ), ˆL( ˆβ ) 1 F ) ( and d KS ˆL( ˆβ1 ), ˆL( ˆβ ) 1 M ) Distance between the small sample and large sample distribution normal t(2) Kolmogorov Smirnov Statistics 0.50 0.45 0.40 0.25 0.50 0.75 0.25 0.50 0.75 kappa Asym. Regime p fixed p/n fixed 18/49
Moderate p/n Regime: Negative Results The moderate p/n regime in statistics: 19/49
Moderate p/n Regime: Negative Results The moderate p/n regime in statistics: Huber [1973] showed that for least-square estimators there always exists a sequence of vectors a n R p such that L at n ( ˆβ LS β ) Var(a T ˆβ N(0, 1). n LS ) 19/49
Moderate p/n Regime: Negative Results The moderate p/n regime in statistics: Huber [1973] showed that for least-square estimators there always exists a sequence of vectors a n R p such that L at n ( ˆβ LS β ) Var(a T ˆβ N(0, 1). n LS ) Bickel and Freedman [1982] showed that the bootstrap fails in the Least-Square case and the usual rescaling does not help; 19/49
Moderate p/n Regime: Negative Results The moderate p/n regime in statistics: Huber [1973] showed that for least-square estimators there always exists a sequence of vectors a n R p such that L at n ( ˆβ LS β ) Var(a T ˆβ N(0, 1). n LS ) Bickel and Freedman [1982] showed that the bootstrap fails in the Least-Square case and the usual rescaling does not help; El Karoui et al. [2011] showed that for general loss functions, ˆβ β 2 2 0. 19/49
Moderate p/n Regime: Negative Results The moderate p/n regime in statistics: Huber [1973] showed that for least-square estimators there always exists a sequence of vectors a n R p such that L at n ( ˆβ LS β ) Var(a T ˆβ N(0, 1). n LS ) Bickel and Freedman [1982] showed that the bootstrap fails in the Least-Square case and the usual rescaling does not help; El Karoui et al. [2011] showed that for general loss functions, ˆβ β 2 2 0. El Karoui and Purdom [2015] showed that most widely used resampling schemes give poor inference on β 1. 19/49
Moderate p/n Regime: Reason of Failure Qualitatively, Influential observation always exists [Huber, 1973]: let H = X(X T X) 1 X T be the hat matrix, max H i,i 1 i n tr(h) = p >> 0. n 20/49
Moderate p/n Regime: Reason of Failure Qualitatively, Influential observation always exists [Huber, 1973]: let H = X(X T X) 1 X T be the hat matrix, max H i,i 1 i n tr(h) = p >> 0. n Regression residuals fail to mimic true error: R i y i x T i ˆβ ɛ i. 20/49
Moderate p/n Regime: Reason of Failure Qualitatively, Influential observation always exists [Huber, 1973]: let H = X(X T X) 1 X T be the hat matrix, max H i,i 1 i n tr(h) = p >> 0. n Regression residuals fail to mimic true error: R i y i x T i ˆβ ɛ i. Technically, Taylor expansion/bahadur-type representation fails! 20/49
Moderate p/n Regime: Positive Results (Random Designs) Bean et al. [2013] showed that when X has i.i.d. Gaussian entries, for any sequence of a n R p L X,ɛ at n ( ˆβ β ) Var X,ɛ (a T ˆβ) N(0, 1); n 21/49
Moderate p/n Regime: Positive Results (Random Designs) Bean et al. [2013] showed that when X has i.i.d. Gaussian entries, for any sequence of a n R p L X,ɛ at n ( ˆβ β ) Var X,ɛ (a T ˆβ) N(0, 1); n El Karoui [2015] extended it to general random designs. 21/49
Moderate p/n Regime: Positive Results (Random Designs) Bean et al. [2013] showed that when X has i.i.d. Gaussian entries, for any sequence of a n R p L X,ɛ at n ( ˆβ β ) Var X,ɛ (a T ˆβ) N(0, 1); n El Karoui [2015] extended it to general random designs. The above result does not contradict Huber [1973] in that the randomness comes from both X and ɛ; 21/49
Moderate p/n Regime: Positive Results (Random Designs) Bean et al. [2013] showed that when X has i.i.d. Gaussian entries, for any sequence of a n R p L X,ɛ at n ( ˆβ β ) Var X,ɛ (a T ˆβ) N(0, 1); n El Karoui [2015] extended it to general random designs. The above result does not contradict Huber [1973] in that the randomness comes from both X and ɛ; El Karoui et al. [2011] showed that for general loss functions, ˆβ β 0. 21/49
Moderate p/n Regime: Summary Provides a more accurate approximation of L( ˆβ 1 ); 22/49
Moderate p/n Regime: Summary Provides a more accurate approximation of L( ˆβ 1 ); Qualitatively different from the classical regimes where p/n 0; L2 -consistency of ˆβ no longer holds; the residual R i behaves differently from ɛ i ; fixed design results are different from random design results. 22/49
Moderate p/n Regime: Summary Provides a more accurate approximation of L( ˆβ 1 ); Qualitatively different from the classical regimes where p/n 0; L2 -consistency of ˆβ no longer holds; the residual R i behaves differently from ɛ i ; fixed design results are different from random design results. Inference on the vector ˆβ is hard; but inference on the coordinate / low-dimensional linear contrasts of ˆβ is still possible. 22/49
Goals (Formal) Our Goal (formal): Under the linear model Y = Xβ + ɛ, Derive the asymptotic distribution of coordinates ˆβ j : under the moderate p/n regime, i.e. p/n κ (0, 1); with a fixed design matrix X; without assumptions on β. 23/49
Table of Contents Background Main Results Heuristics and Proof Techniques Numerical Results 24/49
Main Result (Informal) Definition 1. Let P and Q be two distributions on R p, d TV (P, Q) = sup A R p P (A) Q(A). 25/49
Main Result (Informal) Definition 1. Let P and Q be two distributions on R p, d TV (P, Q) = sup A R p P (A) Q(A). Theorem. Under appropriate conditions on the design matrix X, the distribution of ɛ and the loss function ρ, as p/n κ (0, 1), while n, max j d TV L ˆβ j E ˆβ j Var( ˆβ, N(0, 1) = o(1). j ) 25/49
Main Result (Informal) If ρ is an even function and ɛ d = ɛ, then ˆβ β d = β ˆβ = E ˆβ = β. Theorem. Under appropriate conditions on the design matrix X, the distribution of ɛ and the loss function ρ, as p/n κ (0, 1), while n, max j d TV L ˆβ j βj Var( ˆβ j ), N(0, 1) = o(1). 26/49
Why Surprising? Classical approaches heavily rely on L 2 consistency of ˆβ, which only holds when p = o(n); Bahadur-type representation for ˆβ where n( ˆβ β) = 1 n n i=1 for some i.i.d. random variable Z i s; Z i + o p ( 1 n ), which can be proved only when p = o ( n 2/3) ; 27/49
Why Surprising? Classical approaches heavily rely on L 2 consistency of ˆβ, which only holds when p = o(n); Bahadur-type representation for ˆβ where n( ˆβ β) = 1 n n i=1 for some i.i.d. random variable Z i s; Z i + o p ( 1 n ), which can be proved only when p = o ( n 2/3) ; Question: What happens when p [O(n 2/3 ), O(n)]? 27/49
Our Contributions and Limitations Instead, we develops a novel strategy that is built on Leave-on-out method [El Karoui et al., 2011]; and Second-Order Poincaré Inequality [Chatterjee, 2009]. 28/49
Our Contributions and Limitations Instead, we develops a novel strategy that is built on Leave-on-out method [El Karoui et al., 2011]; and Second-Order Poincaré Inequality [Chatterjee, 2009]. We prove that ˆβ1 is asymptotically normal for all p [O(1), O(n)] for fixed designs under regularity conditions; the conditions are satisfied by most design matrices. 28/49
Our Contributions and Limitations Instead, we develops a novel strategy that is built on Leave-on-out method [El Karoui et al., 2011]; and Second-Order Poincaré Inequality [Chatterjee, 2009]. We prove that ˆβ1 is asymptotically normal for all p [O(1), O(n)] for fixed designs under regularity conditions; the conditions are satisfied by most design matrices. Limitations: we impose strong conditions on ρ and L(ɛ); we do not know how to estimate Var ɛ ( ˆβ 1 ). 28/49
Examples: Realization of i.i.d. Designs We consider the case where X is a realization of a random design Z. The examples below are proved to satisfy the technical assumptions with high probability over Z. 29/49
Examples: Realization of i.i.d. Designs We consider the case where X is a realization of a random design Z. The examples below are proved to satisfy the technical assumptions with high probability over Z. Example 1 Z has i.i.d. mean-zero sub-gaussian entries with Var(Z ij ) = τ 2 > 0; Example 2 Z contains an intercept term, i.e. Z = (1, Z) and Z R n (p 1) has independent sub-gaussian entries with Z ij µ j d = µj Z ij, Var( Z ij ) > τ 2 for some arbitrary µ j s. 29/49
A Counter-Example Consider a one-way ANOVA situation. Each observation i is associated with a label k i {1,..., p} and let X i,j = I(j = k i ). This is equivalent to Y i = β k i + ɛ i. 30/49
A Counter-Example Consider a one-way ANOVA situation. Each observation i is associated with a label k i {1,..., p} and let X i,j = I(j = k i ). This is equivalent to Y i = β k i + ɛ i. It is easy to see that ˆβ j = arg min β R i:k i =j This is a standard location problem. ρ(y i β j ). 30/49
A Counter-Example Let n j = {i : k i = j}. In the least-square case, i.e. ρ(x) = x 2 /2, ˆβ j = β j + 1 n j ɛ i. i:k i =j 31/49
A Counter-Example Let n j = {i : k i = j}. In the least-square case, i.e. ρ(x) = x 2 /2, ˆβ j = β j + 1 n j ɛ i. i:k i =j Assume a balance design, i.e. n j n/p. Then n j << and none of ˆβ j is normal (unless ɛ i are normal); holds for general loss functions ρ. 31/49
A Counter-Example Let n j = {i : k i = j}. In the least-square case, i.e. ρ(x) = x 2 /2, ˆβ j = β j + 1 n j ɛ i. i:k i =j Assume a balance design, i.e. n j n/p. Then n j << and none of ˆβ j is normal (unless ɛ i are normal); holds for general loss functions ρ. Conclusion: some non-standard assumptions on X are required. 31/49
Table of Contents Background Main Results Heuristics and Proof Techniques Least-Square Estimator: A Motivating Example Second-Order Poincaré Inequality Assumptions Main Results Numerical Results 32/49
Least Square Estimator The L 2 loss, ρ(x) = x 2 /2, gives the least-square estimator ˆβ LS = (X T X) 1 X T Y = β + (X T X) 1 X T ɛ. 33/49
Least Square Estimator The L 2 loss, ρ(x) = x 2 /2, gives the least-square estimator ˆβ LS = (X T X) 1 X T Y = β + (X T X) 1 X T ɛ. Let e j denote the canonical basis vector in R p, then ˆβ LS j β j = e T j (X T X) 1 X T ɛ α T j ɛ. 33/49
Least Square Estimator Lindeberg-Feller CLT claims that in order for ˆβ LS L j βj N(0, 1) Var( ˆβ LS j ) it is sufficient and almost necessary that α j α j 2 0. (1) 34/49
Least Square Estimator To see the necessity of the condition, recall the one-way ANOVA case. Let n j = {i : k i = j}, then X T X = diag(n j ) p j=1. Recall that α T j = et j (XT X) 1 X T. This gives α j,i = { 1 n j if k i = j 0 if k i j 35/49
Least Square Estimator To see the necessity of the condition, recall the one-way ANOVA case. Let n j = {i : k i = j}, then X T X = diag(n j ) p j=1. Recall that α T j = et j (XT X) 1 X T. This gives α j,i = { 1 n j if k i = j 0 if k i j As a result, α j = 1 n j, α j 2 = 1 nj α j α j 2 = 1 nj and hence However, in moderate p/n regime, there exists j such that n j 1/κ and thus is not asymptotically normal. ˆβ LS j 35/49
M-Estimator The result for LSE is derived from the analytical form of ˆβ LS. By contrast, an analytical form is not available for general ρ. 36/49
M-Estimator The result for LSE is derived from the analytical form of ˆβ LS. By contrast, an analytical form is not available for general ρ. Let ψ = ρ, it is the solution of 1 n n ψ(y i x T ˆβ) i = 0 1 n i=1 n ψ(ɛ i x T i ( ˆβ β )) = 0. i=1 We show that ˆβj is a smooth function of ɛ; ˆβ j ɛ and ˆβ j ɛ ɛ T are computable. 36/49
Second-Order Poincaré Inequality ˆβ j is a smooth transform of a random vector, ɛ, with independent entries. A powerful CLT for this type of statistics is Second-Order Poincaré Inequality [Chatterjee, 2009]. 37/49
Second-Order Poincaré Inequality ˆβ j is a smooth transform of a random vector, ɛ, with independent entries. A powerful CLT for this type of statistics is Second-Order Poincaré Inequality [Chatterjee, 2009]. Definition 2. For each c 1, c 2 > 0, let L(c 1, c 2 ) be the class of probability measures on R that arise as laws of random variables like u(w ), where W N(0, 1) and u C 2 (R n ) with u (x) c 1 and u (x) c 2. For example, u = Id gives N(0, 1) and u = Φ gives U([0, 1]). 37/49
Second-Order Poincaré Inequality Proposition 1 (SOPI; Chatterjee [2009]). Let W = (W 1,..., W n ) indep. L(c 1, c 2 ). Take any g C 2 (R n ) and let U = g(w ), κ 1 = (E g(w ) 4 2) 1 4 ; κ 2 = (E 2 g(w ) 4 op) 1 4 ; n κ 0 = (E i g(w ) 4 ) 1 2. i=1 If EU 4 <, then ( ) ) U EU d TV (L, N(0, 1) Var(U) κ 0 + κ 1 κ 2 Var(U). 38/49
Assumptions A1 ρ(0) = ψ(0) = 0 and for any x R, 0 < K 0 ψ (x) K 1, ψ (x) K 2 ; A2 ɛ has independent entries with ɛ i L(c 1, c 2 ); A3 Let λ + and λ be the largest and smallest eigenvalues of X T X/n and λ + = O(1), λ = Ω(1). A4 Similar to the condition for OLS: max j e T j (XT X) 1 X T e T j (XT X) 1 X T 2 = o(1) A5 Similar to the condition that ( ) min Var( ˆβ 1 j ) = Ω j n 39/49
Main Results Theorem 3. Under assumptions A1 A5, as p/n κ for some κ (0, 1) while n, max j d TV L ˆβ j E ˆβ j Var( ˆβ, N(0, 1) = o(1). j ) 40/49
Table of Contents Background Main Results Heuristics and Proof Techniques Numerical Results 41/49
Setup Design matrix X: (i.i.d. design): X ij i.i.d. F ; (partial Hadamard design): a matrix formed by a random set of p columns of a n n Hadamard matrix. Entry Distribution F: F = N(0, 1); F = t 2. Error Distribution L(ɛ): ɛ i are i.i.d. with ɛ i N(0, 1); ɛ i t 2. 42/49
Setup Sample Size n: {100, 200, 400, 800}; κ = p/n: {0.5, 0.8}; Loss Function ρ: Huber loss with k = 1.345, { 1 ρ(x) = 2 x2 x k kx k2 2 x > k ; Coefficients: β = 0. 43/49
Asymptotic Normality of A Single Coordinate 44/49
Asymptotic Normality of A Single Coordinate X 44/49
Asymptotic Normality of A Single Coordinate X β 44/49
Asymptotic Normality of A Single Coordinate X β ɛ 1 44/49
Asymptotic Normality of A Single Coordinate y 1 = X β + ɛ 1 44/49
Asymptotic Normality of A Single Coordinate y 1 = X β + ɛ 1 M-Estimates: ˆβ(1) 1, 44/49
Asymptotic Normality of A Single Coordinate β y 2 = X + ɛ 1 ɛ 2 M-Estimates: ˆβ(1) 1, 44/49
Asymptotic Normality of A Single Coordinate β y 2 = X + ɛ 1 ɛ 2 M-Estimates: ˆβ(1) (2) 1, ˆβ 1, 44/49
Asymptotic Normality of A Single Coordinate β y 3 = X + ɛ 1 ɛ 2 ɛ 3 ɛ 3 M-Estimates: ˆβ(1) (2) 1, ˆβ 1, 44/49
Asymptotic Normality of A Single Coordinate β y 3 = X + ɛ 1 ɛ 2 ɛ 3 ɛ 3 M-Estimates: ˆβ(1) 1 (2) (3), ˆβ 1, ˆβ 1, 44/49
Asymptotic Normality of A Single Coordinate y r = X β + ɛ 1 ɛ 2 ɛ 3 ɛ r M-Estimates: ˆβ(1) 1 (2) (3), ˆβ 1, ˆβ 1, 44/49
Asymptotic Normality of A Single Coordinate y r = X β + ɛ 1 ɛ 2 ɛ 3 ɛ r M-Estimates: ˆβ(1) 1, ˆβ (2) 1 (3) (r), ˆβ 1,..., ˆβ 1. 44/49
Asymptotic Normality of A Single Coordinate y r = X β + ɛ 1 ɛ 2 ɛ 3 ɛ r M-Estimates: ˆβ(1) 1 ( ŝd se {, ˆβ (2) 1 (3) (r), ˆβ 1,..., ˆβ 1. (1) (r) ˆβ 1,..., ˆβ 1 ); } 44/49
Asymptotic Normality of A Single Coordinate y r = X β + ɛ 1 ɛ 2 ɛ 3 ɛ r M-Estimates: ˆβ(1) 1 ( ŝd se {, ˆβ (2) 1 (3) (r), ˆβ 1,..., ˆβ 1. (1) (r) ˆβ 1,..., ˆβ 1 ); } ( ) want to compare L ˆβ1 /ŝd with N(0, 1); 44/49
Asymptotic Normality of A Single Coordinate y r = X β + ɛ 1 ɛ 2 ɛ 3 ɛ r M-Estimates: ˆβ(1) 1 ( ŝd se {, ˆβ (2) 1 (3) (r), ˆβ 1,..., ˆβ 1. (1) (r) ˆβ 1,..., ˆβ 1 ); } ( ) want to compare L ˆβ1 /ŝd with N(0, 1); count the fraction of (j) ˆβ 1 [ 1.96ŝd, 1.96ŝd] as the proxy; 44/49
Asymptotic Normality of A Single Coordinate y r = X β + ɛ 1 ɛ 2 ɛ 3 ɛ r M-Estimates: ˆβ(1) 1 ( ŝd se {, ˆβ (2) 1 (3) (r), ˆβ 1,..., ˆβ 1. (1) (r) ˆβ 1,..., ˆβ 1 ); } ( ) want to compare L ˆβ1 /ŝd with N(0, 1); count the fraction of (j) ˆβ 1 [ 1.96ŝd, 1.96ŝd] as the proxy; should be close to 0.95 ideally. 44/49
Asymptotic Normality of A Single Coordinate Coverage of β^1 (κ = 0.5) normal t(2) Coverage of β^1 (κ = 0.8) normal t(2) 1.00 1.00 Coverage 0.95 0.90 1.00 0.95 0.90 iid hadamard Coverage 0.95 0.90 1.00 0.95 0.90 iid hadamard 100 200 400 800 100 200 400 800 Sample Size Entry Dist. normal t(2) hadamard 100 200 400 800 100 200 400 800 Sample Size Entry Dist. normal t(2) hadamard 45/49
Conclusion We establish the coordinate-wise asymptotic normality of the M-estimator for certain fixed design matrices under the moderate p/n regime under regularity conditions on X, L(ɛ) and ρ but no condition on β ; We prove the result by using the novel approach Second-Order Poincaré Inequality [Chatterjee, 2009]; We show that the regularity conditions are satisfied by a broad class of designs. 46/49
Discussion 47/49
Discussion Inference asym. normality + asym. bias + asym. variance Var( ˆβ 1 X) Var( ˆβ 1 ) when X is indeed a realization of a random design? Resampling method to give conservative variance estimates? More advanced boostrap? 47/49
Discussion Inference asym. normality + asym. bias + asym. variance Var( ˆβ 1 X) Var( ˆβ 1 ) when X is indeed a realization of a random design? Resampling method to give conservative variance estimates? More advanced boostrap? Relax the regularity conditions: Generalize to non-strongly convex and non-smooth loss functions? Generalize to general error distributions? 47/49
Discussion Inference asym. normality + asym. bias + asym. variance Var( ˆβ 1 X) Var( ˆβ 1 ) when X is indeed a realization of a random design? Resampling method to give conservative variance estimates? More advanced boostrap? Relax the regularity conditions: Generalize to non-strongly convex and non-smooth loss functions? Generalize to general error distributions? Get rid of asymptotics: Yes, exact finite-sample guarantee if n/p > 20; No assumption on X or β ; Only exchangeability assumption on ɛ. 47/49
Thank You! 48/49
References Derek Bean, Peter J Bickel, Noureddine El Karoui, and Bin Yu. Optimal m-estimation in high-dimensional regression. Proceedings of the National Academy of Sciences, 110(36):14563 14568, 2013. Peter J Bickel and David A Freedman. Bootstrapping regression models with many parameters. Festschrift for Erich L. Lehmann, pages 28 48, 1982. Sourav Chatterjee. Fluctuations of eigenvalues and second order poincaré inequalities. Probability Theory and Related Fields, 143(1-2):1 40, 2009. Noureddine El Karoui. On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators. 2015. Noureddine El Karoui and Elizabeth Purdom. Can we trust the bootstrap in high-dimension? UC Berkeley Statistics Department Technical Report, 2015. Noureddine El Karoui, Derek Bean, Peter J Bickel, Chinghway Lim, and Bin Yu. On robust regression with high-dimensional predictors. Proceedings of the National Academy of Sciences, 110(36):14557 14562, 2011. Peter J Huber. Robust regression: asymptotics, conjectures and monte carlo. The Annals of Statistics, pages 799 821, 1973. Enno Mammen. Asymptotics with increasing dimension for robust regression with applications to the bootstrap. The Annals of Statistics, pages 382 400, 1989. Stephen Portnoy. Asymptotic behavior of m-estimators of p regression parameters when p2/n is large. i. consistency. The Annals of Statistics, pages 1298 1309, 1984. Stephen Portnoy. Asymptotic behavior of m estimators of p regression parameters when p2/n is large; ii. normal approximation. The Annals of Statistics, pages 1403 1417, 1985. 49/49