Inference For High Dimensional M-estimates: Fixed Design Results

Size: px

Start display at page:

Download "Inference For High Dimensional M-estimates: Fixed Design Results"

Randolph Hines
5 years ago
Views:

1 Inference For High Dimensional M-estimates: Fixed Design Results Lihua Lei, Peter Bickel and Noureddine El Karoui Department of Statistics, UC Berkeley Berkeley-Stanford Econometrics Jamboree, /49

2 Table of Contents Background Main Results Heuristics and Proof Techniques Numerical Results 2/49

3 Table of Contents Background Main Results Heuristics and Proof Techniques Numerical Results 3/49

4 Setup Consider a linear Model: Y = Xβ + ɛ. y = (y 1,..., y n ) T R n : response vector; X = (x T 1,..., xt n ) T R n p : design matrix; β = (β 1,..., β p) T R p : coefficient vector; ɛ = (ɛ 1,..., ɛ n ) T R n : random unobserved error with independent entries. 4/49

5 M-Estimator M-Estimator: Given a convex loss function ρ( ) : R [0, ), 1 ˆβ = arg min β R p n n ρ(y i x T i β). When ρ is differentiable with ψ = ρ, ˆβ can be written as the solution: 1 n ψ(y i x T i n ˆβ) = 0. i=1 i=1 5/49

6 M-Estimator: Examples ρ(x) = x 2 /2 gives the Least-Square estimator; L2 Loss x psi(x) rho(x) x 6/49

7 M-Estimator: Examples ρ(x) = x 2 /2 gives the Least-Square estimator; ρ(x) = x gives the Least-Absolute-Deviation estimator; rho(x) L2 Loss L1 Loss x x psi(x) x x 6/49

8 M-Estimator: Examples rho(x) ρ(x) = x 2 /2 gives the Least-Square estimator; ρ(x) = x gives the Least-Absolute-Deviation estimator; { x ρ(x) = 2 /2 x k gives the Huber estimator. k( x k/2) x > k L2 Loss L1 Loss Huber Loss x x x psi(x) x x x 6/49

9 Goals (Informal) Goal (Informal): Make inference on the coordinates of β when X is treated as fixed; no assumption imposed on β ; and the dimension p is comparable to the sample size n. 7/49

10 Goals (Informal) Goal (Informal): Make inference on the coordinates of β when X is treated as fixed; no assumption imposed on β ; and the dimension p is comparable to the sample size n. Why coordinates? 7/49

11 Goals (Informal) Goal (Informal): Make inference on the coordinates of β when X is treated as fixed; no assumption imposed on β ; and the dimension p is comparable to the sample size n. Why coordinates? Why fixed designs? 7/49

12 Goals (Informal) Goal (Informal): Make inference on the coordinates of β when X is treated as fixed; no assumption imposed on β ; and the dimension p is comparable to the sample size n. Why coordinates? Why fixed designs? Why assumption-free β? 7/49

13 Goals (Informal) Goal (Informal): Make inference on the coordinates of β when X is treated as fixed; no assumption imposed on β ; and the dimension p is comparable to the sample size n. Why coordinates? Why fixed designs? Why assumption-free β? Why p n? 7/49

14 Asymptotic Arguments: Motivation Consider β 1 WLOG; 8/49

15 Asymptotic Arguments: Motivation Consider β 1 WLOG; Ideally, we construct a 95% confidence interval for β1 as ( [q L( ˆβ ) 1 ), q (L( ˆβ )] 1 ) where q α denotes the α-th quantile; 8/49

16 Asymptotic Arguments: Motivation Consider β 1 WLOG; Ideally, we construct a 95% confidence interval for β1 as ( [q L( ˆβ ) 1 ), q (L( ˆβ )] 1 ) where q α denotes the α-th quantile; Unfortunately, L( ˆβ 1 ) is unknown. 8/49

17 Asymptotic Arguments: Motivation Consider β 1 WLOG; Ideally, we construct a 95% confidence interval for β1 as ( [q L( ˆβ ) 1 ), q (L( ˆβ )] 1 ) where q α denotes the α-th quantile; Unfortunately, L( ˆβ 1 ) is unknown. This motivates the asymptotic arguments, i.e. find a distribution F s.t. L( ˆβ 1 ) F. 8/49

18 Asymptotic Arguments: Textbook Version The limiting behavior of ˆβ when p is fixed, as n, ( ) L( ˆβ) N β, (X T X) 1 E(ψ2 (ɛ 1 )) [Eψ (ɛ 1 )] 2 ; As a consequence, we obtain an approximate 95% confidence interval for β1, [ ˆβ1 1.96sd( ˆβ 1 ), ˆβ sd( ˆβ ] 1 ) where sd( ˆβ 1 ) could be any consistent estimator of the standard deviation. 9/49

19 Asymptotic Arguments: Hypothetical Problems y X R n p original problem (n = 100, p = 30) y X ˆβ 1 10/49

20 Asymptotic Arguments: Hypothetical Problems y X R n p y 1 X 1 R n 1 p 1 original problem (n = 100, p = 30) y X ˆβ 1 hypothetical problem (n 1 = 200, p 1 = 30) y 1 X 1 ˆβ (1) 1 10/49

21 Asymptotic Arguments: Hypothetical Problems y X R n p y 1 X 1 R n 1 p 1 y 2 X 2 R n 2 p 2 original problem (n = 100, p = 30) y X ˆβ 1 hypothetical problem (n 1 = 200, p 1 = 30) y 1 X 1 ˆβ (1) 1 hypothetical problem (n 2 = 500, p 2 = 30) y 2 X 2 ˆβ (2) 1 10/49

22 Asymptotic Arguments: Hypothetical Problems y X R n p original problem y 1 X 1 R n 1 p 1 y 2 X 2 R n 2 p 2 y 3 X 3 R n 3 p 3 (n = 100, p = 30) hypothetical problem y X ˆβ 1 (n 1 = 200, p 1 = 30) y 1 X 1 ˆβ (1) 1 hypothetical problem (n 2 = 500, p 2 = 30) y 2 X 2 ˆβ (2) 1 hypothetical problem (n 3 = 2000, p 3 = 30) y 3 X 3 ˆβ (3) 1 10/49

23 Asymptotic Arguments: Hypothetical Problems y X R n p original problem y 1 X 1 R n 1 p 1 y 2 X 2 R n 2 p 2 y 3 X 3 R n 3 p 3 (n = 100, p = 30) hypothetical problem y X ˆβ 1 (n 1 = 200, p 1 = 30) y 1 X 1 ˆβ (1) 1 hypothetical problem (n 2 = 500, p 2 = 30) y 2 X 2 ˆβ (2) 1 hypothetical problem (n 3 = 2000, p 3 = 30) y 3 X 3 Asymptotic argument: use lim j L( ˆβ (j) 1 ) to approximate L( ˆβ 1 ). ˆβ (3) 1 10/49

24 Asymptotic Arguments Huber [1973] raised the question of understanding the behavior of ˆβ when both n and p tend to infinity; 11/49

25 Asymptotic Arguments Huber [1973] raised the question of understanding the behavior of ˆβ when both n and p tend to infinity; Huber [1973] showed the L 2 consistency of ˆβ: ˆβ β 2 2 0, when p = o(n 1 3 ); 11/49

26 Asymptotic Arguments Huber [1973] raised the question of understanding the behavior of ˆβ when both n and p tend to infinity; Huber [1973] showed the L 2 consistency of ˆβ: ˆβ β 2 2 0, when p = o(n 1 3 ); Portnoy [1984] prove the L 2 consistency of ˆβ when ( ) n p = o. log n 11/49

27 Asymptotic Arguments Portnoy [1985] and Mammen [1989] showed that ˆβ is jointly asymptotically normal when p << n 2 3, 12/49

28 Asymptotic Arguments Portnoy [1985] and Mammen [1989] showed that ˆβ is jointly asymptotically normal when p << n 2 3, in the sense that for any sequence of vectors a n R p, L at n ( ˆβ β ) Var(a T ˆβ) N(0, 1) n 12/49

29 p/n: A Measure of Difficulty All of the above works requires p/n 0 or n/p. 13/49

30 p/n: A Measure of Difficulty All of the above works requires p/n 0 or n/p. n/p is the number of samples per parameter; Classical rule of thumb: n/p 5 10; Heuristically, a larger n/p would give an easier problem; Hypothetical problems with n j /p j are not appropriate because they are increasingly easier than the original problem. 13/49

31 Moderate p/n Regime Formally, we define Moderate p/n Regime as p/n κ > 0. y X R n p original problem (n = 100, p = 30) y X ˆβ 1 14/49

32 Moderate p/n Regime Formally, we define Moderate p/n Regime as p/n κ > 0. y X R n p y 1 X 1 R n 1 p 1 original problem (n = 100, p = 30) y X ˆβ 1 hypothetical problem (n 1 = 200, p 1 = 60) y 1 X 1 ˆβ (1) 1 14/49

33 Moderate p/n Regime Formally, we define Moderate p/n Regime as p/n κ > 0. y X R n p y 1 X 1 R n 1 p 1 y 2 X 2 R n 2 p 2 original problem (n = 100, p = 30) y X ˆβ 1 hypothetical problem (n 1 = 200, p 1 = 60) y 1 X 1 ˆβ (1) 1 hypothetical problem (n 2 = 500, p 2 = 150) y 2 X 2 ˆβ (2) 1 14/49

34 Moderate p/n Regime Formally, we define Moderate p/n Regime as p/n κ > 0. y X R n p original problem y 1 X 1 R n 1 p 1 y 2 X 2 R n 2 p 2 y 3 X 3 R n 3 p 3 (n = 100, p = 30) hypothetical problem y X ˆβ 1 (n 1 = 200, p 1 = 60) y 1 X 1 ˆβ (1) 1 hypothetical problem (n 2 = 500, p 2 = 150) y 2 X 2 ˆβ (2) 1 hypothetical problem (n 3 = 2000, p 3 = 600) y 3 X 3 ˆβ (3) 1 14/49

35 Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. 15/49

36 Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. X 15/49

37 Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. X β 15/49

38 Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. X β ɛ 1 15/49

39 Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. y 1 = X β + ɛ 1 15/49

40 Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. y 1 = X β + ɛ 1 M-Estimates: ˆβ(1) 1, 15/49

41 Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. β y 2 = X + ɛ 1 ɛ 2 M-Estimates: ˆβ(1) 1, 15/49

42 Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. β y 2 = X + ɛ 1 ɛ 2 M-Estimates: ˆβ(1) (2) 1, ˆβ 1, 15/49

43 Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. β y 3 = X + ɛ 1 ɛ 2 ɛ 3 ɛ 3 M-Estimates: ˆβ(1) (2) 1, ˆβ 1, 15/49

44 Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. β y 3 = X + ɛ 1 ɛ 2 ɛ 3 ɛ 3 M-Estimates: ˆβ(1) 1 (2) (3), ˆβ 1, ˆβ 1, 15/49

45 Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. y r = X β + ɛ 1 ɛ 2 ɛ 3 ɛ r M-Estimates: ˆβ(1) 1 (2) (3), ˆβ 1, ˆβ 1, 15/49

46 Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. y r = X β + ɛ 1 ɛ 2 ɛ 3 ɛ r M-Estimates: ˆβ(1) 1, ˆβ (2) 1 (3) (r), ˆβ 1,..., ˆβ 1. 15/49

47 Moderate p/n Regime: More Informative Asymptotics A simulation to compare Fix-p Regime and Moderate p/n Regime: Original problem: n = 50, p = 50κ, Huber loss, i.i.d. ɛ i s. y r = X β + ɛ 1 ɛ 2 ɛ 3 ɛ r M-Estimates: ˆβ(1) 1, ˆβ (2) 1 (3) (r), ˆβ 1,..., ˆβ 1. = ˆL( ˆβ 1 ; X) = ecdf({ (1) (r) ˆβ 1,..., ˆβ 1 }). 15/49

48 Moderate p/n Regime: More Informative Asymptotics A Simulation to compare Fix-p Regime and Moderate p/n Regime: Fix-p Approximation: n = 1000, p = 50κ. 16/49

49 Moderate p/n Regime: More Informative Asymptotics A Simulation to compare Fix-p Regime and Moderate p/n Regime: Fix-p Approximation: n = 1000, p = 50κ. β y r = X + ɛ 1 ɛ 2 ɛ 3 ɛ r M-Estimates: ˆβ(F,1) 1, ˆβ (F,2) 1, ˆβ (F,3) 1,..., ˆβ (F,r) 1. = ˆL( ˆβ F 1 ; X) = ecdf({ ˆβ (F,1) 1,..., ˆβ (F,r) 1 }). 16/49

50 Moderate p/n Regime: More Informative Asymptotics A Simulation to compare Fix-p Regime and Moderate p/n Regime: Moderate-p/n Approximation: n = 1000, p = 1000κ. 17/49

51 Moderate p/n Regime: More Informative Asymptotics A Simulation to compare Fix-p Regime and Moderate p/n Regime: Moderate-p/n Approximation: n = 1000, p = 1000κ. y r = X β + ɛ 1 ɛ 2 ɛ 3 ɛ r M-Estimates: ˆβ(M,1) 1, ˆβ (M,2) 1, ˆβ (M,3) 1,..., ˆβ (M,r) 1. = ˆL( ˆβ M 1 ; X) = ecdf({ ˆβ (M,1) 1,..., ˆβ (M,r) 1 }). 17/49

52 Moderate p/n Regime: More Informative Asymptotics Measure the accuracy of two approximations by the Kolmogorov-Smirnov statistics ( d KS ˆL( ˆβ1 ), ˆL( ˆβ ) 1 F ) ( and d KS ˆL( ˆβ1 ), ˆL( ˆβ ) 1 M ) Distance between the small sample and large sample distribution normal t(2) Kolmogorov Smirnov Statistics kappa Asym. Regime p fixed p/n fixed 18/49

53 Moderate p/n Regime: Negative Results The moderate p/n regime in statistics: 19/49

54 Moderate p/n Regime: Negative Results The moderate p/n regime in statistics: Huber [1973] showed that for least-square estimators there always exists a sequence of vectors a n R p such that L at n ( ˆβ LS β ) Var(a T ˆβ N(0, 1). n LS ) 19/49

55 Moderate p/n Regime: Negative Results The moderate p/n regime in statistics: Huber [1973] showed that for least-square estimators there always exists a sequence of vectors a n R p such that L at n ( ˆβ LS β ) Var(a T ˆβ N(0, 1). n LS ) Bickel and Freedman [1982] showed that the bootstrap fails in the Least-Square case and the usual rescaling does not help; 19/49

56 Moderate p/n Regime: Negative Results The moderate p/n regime in statistics: Huber [1973] showed that for least-square estimators there always exists a sequence of vectors a n R p such that L at n ( ˆβ LS β ) Var(a T ˆβ N(0, 1). n LS ) Bickel and Freedman [1982] showed that the bootstrap fails in the Least-Square case and the usual rescaling does not help; El Karoui et al. [2011] showed that for general loss functions, ˆβ β /49

57 Moderate p/n Regime: Negative Results The moderate p/n regime in statistics: Huber [1973] showed that for least-square estimators there always exists a sequence of vectors a n R p such that L at n ( ˆβ LS β ) Var(a T ˆβ N(0, 1). n LS ) Bickel and Freedman [1982] showed that the bootstrap fails in the Least-Square case and the usual rescaling does not help; El Karoui et al. [2011] showed that for general loss functions, ˆβ β El Karoui and Purdom [2015] showed that most widely used resampling schemes give poor inference on β 1. 19/49

58 Moderate p/n Regime: Reason of Failure Qualitatively, Influential observation always exists [Huber, 1973]: let H = X(X T X) 1 X T be the hat matrix, max H i,i 1 i n tr(h) = p >> 0. n 20/49

59 Moderate p/n Regime: Reason of Failure Qualitatively, Influential observation always exists [Huber, 1973]: let H = X(X T X) 1 X T be the hat matrix, max H i,i 1 i n tr(h) = p >> 0. n Regression residuals fail to mimic true error: R i y i x T i ˆβ ɛ i. 20/49

60 Moderate p/n Regime: Reason of Failure Qualitatively, Influential observation always exists [Huber, 1973]: let H = X(X T X) 1 X T be the hat matrix, max H i,i 1 i n tr(h) = p >> 0. n Regression residuals fail to mimic true error: R i y i x T i ˆβ ɛ i. Technically, Taylor expansion/bahadur-type representation fails! 20/49

61 Moderate p/n Regime: Positive Results (Random Designs) Bean et al. [2013] showed that when X has i.i.d. Gaussian entries, for any sequence of a n R p L X,ɛ at n ( ˆβ β ) Var X,ɛ (a T ˆβ) N(0, 1); n 21/49

62 Moderate p/n Regime: Positive Results (Random Designs) Bean et al. [2013] showed that when X has i.i.d. Gaussian entries, for any sequence of a n R p L X,ɛ at n ( ˆβ β ) Var X,ɛ (a T ˆβ) N(0, 1); n El Karoui [2015] extended it to general random designs. 21/49

63 Moderate p/n Regime: Positive Results (Random Designs) Bean et al. [2013] showed that when X has i.i.d. Gaussian entries, for any sequence of a n R p L X,ɛ at n ( ˆβ β ) Var X,ɛ (a T ˆβ) N(0, 1); n El Karoui [2015] extended it to general random designs. The above result does not contradict Huber [1973] in that the randomness comes from both X and ɛ; 21/49

64 Moderate p/n Regime: Positive Results (Random Designs) Bean et al. [2013] showed that when X has i.i.d. Gaussian entries, for any sequence of a n R p L X,ɛ at n ( ˆβ β ) Var X,ɛ (a T ˆβ) N(0, 1); n El Karoui [2015] extended it to general random designs. The above result does not contradict Huber [1973] in that the randomness comes from both X and ɛ; El Karoui et al. [2011] showed that for general loss functions, ˆβ β 0. 21/49

65 Moderate p/n Regime: Summary Provides a more accurate approximation of L( ˆβ 1 ); 22/49

66 Moderate p/n Regime: Summary Provides a more accurate approximation of L( ˆβ 1 ); Qualitatively different from the classical regimes where p/n 0; L2 -consistency of ˆβ no longer holds; the residual R i behaves differently from ɛ i ; fixed design results are different from random design results. 22/49

67 Moderate p/n Regime: Summary Provides a more accurate approximation of L( ˆβ 1 ); Qualitatively different from the classical regimes where p/n 0; L2 -consistency of ˆβ no longer holds; the residual R i behaves differently from ɛ i ; fixed design results are different from random design results. Inference on the vector ˆβ is hard; but inference on the coordinate / low-dimensional linear contrasts of ˆβ is still possible. 22/49

68 Goals (Formal) Our Goal (formal): Under the linear model Y = Xβ + ɛ, Derive the asymptotic distribution of coordinates ˆβ j : under the moderate p/n regime, i.e. p/n κ (0, 1); with a fixed design matrix X; without assumptions on β. 23/49

69 Table of Contents Background Main Results Heuristics and Proof Techniques Numerical Results 24/49

70 Main Result (Informal) Definition 1. Let P and Q be two distributions on R p, d TV (P, Q) = sup A R p P (A) Q(A). 25/49

71 Main Result (Informal) Definition 1. Let P and Q be two distributions on R p, d TV (P, Q) = sup A R p P (A) Q(A). Theorem. Under appropriate conditions on the design matrix X, the distribution of ɛ and the loss function ρ, as p/n κ (0, 1), while n, max j d TV L ˆβ j E ˆβ j Var( ˆβ, N(0, 1) = o(1). j ) 25/49

72 Main Result (Informal) If ρ is an even function and ɛ d = ɛ, then ˆβ β d = β ˆβ = E ˆβ = β. Theorem. Under appropriate conditions on the design matrix X, the distribution of ɛ and the loss function ρ, as p/n κ (0, 1), while n, max j d TV L ˆβ j βj Var( ˆβ j ), N(0, 1) = o(1). 26/49

73 Why Surprising? Classical approaches heavily rely on L 2 consistency of ˆβ, which only holds when p = o(n); Bahadur-type representation for ˆβ where n( ˆβ β) = 1 n n i=1 for some i.i.d. random variable Z i s; Z i + o p ( 1 n ), which can be proved only when p = o ( n 2/3) ; 27/49

74 Why Surprising? Classical approaches heavily rely on L 2 consistency of ˆβ, which only holds when p = o(n); Bahadur-type representation for ˆβ where n( ˆβ β) = 1 n n i=1 for some i.i.d. random variable Z i s; Z i + o p ( 1 n ), which can be proved only when p = o ( n 2/3) ; Question: What happens when p [O(n 2/3 ), O(n)]? 27/49

75 Our Contributions and Limitations Instead, we develops a novel strategy that is built on Leave-on-out method [El Karoui et al., 2011]; and Second-Order Poincaré Inequality [Chatterjee, 2009]. 28/49

76 Our Contributions and Limitations Instead, we develops a novel strategy that is built on Leave-on-out method [El Karoui et al., 2011]; and Second-Order Poincaré Inequality [Chatterjee, 2009]. We prove that ˆβ1 is asymptotically normal for all p [O(1), O(n)] for fixed designs under regularity conditions; the conditions are satisfied by most design matrices. 28/49

77 Our Contributions and Limitations Instead, we develops a novel strategy that is built on Leave-on-out method [El Karoui et al., 2011]; and Second-Order Poincaré Inequality [Chatterjee, 2009]. We prove that ˆβ1 is asymptotically normal for all p [O(1), O(n)] for fixed designs under regularity conditions; the conditions are satisfied by most design matrices. Limitations: we impose strong conditions on ρ and L(ɛ); we do not know how to estimate Var ɛ ( ˆβ 1 ). 28/49

78 Examples: Realization of i.i.d. Designs We consider the case where X is a realization of a random design Z. The examples below are proved to satisfy the technical assumptions with high probability over Z. 29/49

79 Examples: Realization of i.i.d. Designs We consider the case where X is a realization of a random design Z. The examples below are proved to satisfy the technical assumptions with high probability over Z. Example 1 Z has i.i.d. mean-zero sub-gaussian entries with Var(Z ij ) = τ 2 > 0; Example 2 Z contains an intercept term, i.e. Z = (1, Z) and Z R n (p 1) has independent sub-gaussian entries with Z ij µ j d = µj Z ij, Var( Z ij ) > τ 2 for some arbitrary µ j s. 29/49

80 A Counter-Example Consider a one-way ANOVA situation. Each observation i is associated with a label k i {1,..., p} and let X i,j = I(j = k i ). This is equivalent to Y i = β k i + ɛ i. 30/49

81 A Counter-Example Consider a one-way ANOVA situation. Each observation i is associated with a label k i {1,..., p} and let X i,j = I(j = k i ). This is equivalent to Y i = β k i + ɛ i. It is easy to see that ˆβ j = arg min β R i:k i =j This is a standard location problem. ρ(y i β j ). 30/49

82 A Counter-Example Let n j = {i : k i = j}. In the least-square case, i.e. ρ(x) = x 2 /2, ˆβ j = β j + 1 n j ɛ i. i:k i =j 31/49

83 A Counter-Example Let n j = {i : k i = j}. In the least-square case, i.e. ρ(x) = x 2 /2, ˆβ j = β j + 1 n j ɛ i. i:k i =j Assume a balance design, i.e. n j n/p. Then n j << and none of ˆβ j is normal (unless ɛ i are normal); holds for general loss functions ρ. 31/49

84 A Counter-Example Let n j = {i : k i = j}. In the least-square case, i.e. ρ(x) = x 2 /2, ˆβ j = β j + 1 n j ɛ i. i:k i =j Assume a balance design, i.e. n j n/p. Then n j << and none of ˆβ j is normal (unless ɛ i are normal); holds for general loss functions ρ. Conclusion: some non-standard assumptions on X are required. 31/49

85 Table of Contents Background Main Results Heuristics and Proof Techniques Least-Square Estimator: A Motivating Example Second-Order Poincaré Inequality Assumptions Main Results Numerical Results 32/49

86 Least Square Estimator The L 2 loss, ρ(x) = x 2 /2, gives the least-square estimator ˆβ LS = (X T X) 1 X T Y = β + (X T X) 1 X T ɛ. 33/49

87 Least Square Estimator The L 2 loss, ρ(x) = x 2 /2, gives the least-square estimator ˆβ LS = (X T X) 1 X T Y = β + (X T X) 1 X T ɛ. Let e j denote the canonical basis vector in R p, then ˆβ LS j β j = e T j (X T X) 1 X T ɛ α T j ɛ. 33/49

88 Least Square Estimator Lindeberg-Feller CLT claims that in order for ˆβ LS L j βj N(0, 1) Var( ˆβ LS j ) it is sufficient and almost necessary that α j α j 2 0. (1) 34/49

89 Least Square Estimator To see the necessity of the condition, recall the one-way ANOVA case. Let n j = {i : k i = j}, then X T X = diag(n j ) p j=1. Recall that α T j = et j (XT X) 1 X T. This gives α j,i = { 1 n j if k i = j 0 if k i j 35/49

90 Least Square Estimator To see the necessity of the condition, recall the one-way ANOVA case. Let n j = {i : k i = j}, then X T X = diag(n j ) p j=1. Recall that α T j = et j (XT X) 1 X T. This gives α j,i = { 1 n j if k i = j 0 if k i j As a result, α j = 1 n j, α j 2 = 1 nj α j α j 2 = 1 nj and hence However, in moderate p/n regime, there exists j such that n j 1/κ and thus is not asymptotically normal. ˆβ LS j 35/49

91 M-Estimator The result for LSE is derived from the analytical form of ˆβ LS. By contrast, an analytical form is not available for general ρ. 36/49

92 M-Estimator The result for LSE is derived from the analytical form of ˆβ LS. By contrast, an analytical form is not available for general ρ. Let ψ = ρ, it is the solution of 1 n n ψ(y i x T ˆβ) i = 0 1 n i=1 n ψ(ɛ i x T i ( ˆβ β )) = 0. i=1 We show that ˆβj is a smooth function of ɛ; ˆβ j ɛ and ˆβ j ɛ ɛ T are computable. 36/49

93 Second-Order Poincaré Inequality ˆβ j is a smooth transform of a random vector, ɛ, with independent entries. A powerful CLT for this type of statistics is Second-Order Poincaré Inequality [Chatterjee, 2009]. 37/49

94 Second-Order Poincaré Inequality ˆβ j is a smooth transform of a random vector, ɛ, with independent entries. A powerful CLT for this type of statistics is Second-Order Poincaré Inequality [Chatterjee, 2009]. Definition 2. For each c 1, c 2 > 0, let L(c 1, c 2 ) be the class of probability measures on R that arise as laws of random variables like u(w ), where W N(0, 1) and u C 2 (R n ) with u (x) c 1 and u (x) c 2. For example, u = Id gives N(0, 1) and u = Φ gives U([0, 1]). 37/49

95 Second-Order Poincaré Inequality Proposition 1 (SOPI; Chatterjee [2009]). Let W = (W 1,..., W n ) indep. L(c 1, c 2 ). Take any g C 2 (R n ) and let U = g(w ), κ 1 = (E g(w ) 4 2) 1 4 ; κ 2 = (E 2 g(w ) 4 op) 1 4 ; n κ 0 = (E i g(w ) 4 ) 1 2. i=1 If EU 4 <, then ( ) ) U EU d TV (L, N(0, 1) Var(U) κ 0 + κ 1 κ 2 Var(U). 38/49

96 Assumptions A1 ρ(0) = ψ(0) = 0 and for any x R, 0 < K 0 ψ (x) K 1, ψ (x) K 2 ; A2 ɛ has independent entries with ɛ i L(c 1, c 2 ); A3 Let λ + and λ be the largest and smallest eigenvalues of X T X/n and λ + = O(1), λ = Ω(1). A4 Similar to the condition for OLS: max j e T j (XT X) 1 X T e T j (XT X) 1 X T 2 = o(1) A5 Similar to the condition that ( ) min Var( ˆβ 1 j ) = Ω j n 39/49

97 Main Results Theorem 3. Under assumptions A1 A5, as p/n κ for some κ (0, 1) while n, max j d TV L ˆβ j E ˆβ j Var( ˆβ, N(0, 1) = o(1). j ) 40/49

98 Table of Contents Background Main Results Heuristics and Proof Techniques Numerical Results 41/49

99 Setup Design matrix X: (i.i.d. design): X ij i.i.d. F ; (partial Hadamard design): a matrix formed by a random set of p columns of a n n Hadamard matrix. Entry Distribution F: F = N(0, 1); F = t 2. Error Distribution L(ɛ): ɛ i are i.i.d. with ɛ i N(0, 1); ɛ i t 2. 42/49

100 Setup Sample Size n: {100, 200, 400, 800}; κ = p/n: {0.5, 0.8}; Loss Function ρ: Huber loss with k = 1.345, { 1 ρ(x) = 2 x2 x k kx k2 2 x > k ; Coefficients: β = 0. 43/49

101 Asymptotic Normality of A Single Coordinate 44/49

102 Asymptotic Normality of A Single Coordinate X 44/49

103 Asymptotic Normality of A Single Coordinate X β 44/49

104 Asymptotic Normality of A Single Coordinate X β ɛ 1 44/49

105 Asymptotic Normality of A Single Coordinate y 1 = X β + ɛ 1 44/49

106 Asymptotic Normality of A Single Coordinate y 1 = X β + ɛ 1 M-Estimates: ˆβ(1) 1, 44/49

107 Asymptotic Normality of A Single Coordinate β y 2 = X + ɛ 1 ɛ 2 M-Estimates: ˆβ(1) 1, 44/49

108 Asymptotic Normality of A Single Coordinate β y 2 = X + ɛ 1 ɛ 2 M-Estimates: ˆβ(1) (2) 1, ˆβ 1, 44/49

109 Asymptotic Normality of A Single Coordinate β y 3 = X + ɛ 1 ɛ 2 ɛ 3 ɛ 3 M-Estimates: ˆβ(1) (2) 1, ˆβ 1, 44/49

110 Asymptotic Normality of A Single Coordinate β y 3 = X + ɛ 1 ɛ 2 ɛ 3 ɛ 3 M-Estimates: ˆβ(1) 1 (2) (3), ˆβ 1, ˆβ 1, 44/49

111 Asymptotic Normality of A Single Coordinate y r = X β + ɛ 1 ɛ 2 ɛ 3 ɛ r M-Estimates: ˆβ(1) 1 (2) (3), ˆβ 1, ˆβ 1, 44/49

112 Asymptotic Normality of A Single Coordinate y r = X β + ɛ 1 ɛ 2 ɛ 3 ɛ r M-Estimates: ˆβ(1) 1, ˆβ (2) 1 (3) (r), ˆβ 1,..., ˆβ 1. 44/49

113 Asymptotic Normality of A Single Coordinate y r = X β + ɛ 1 ɛ 2 ɛ 3 ɛ r M-Estimates: ˆβ(1) 1 ( ŝd se {, ˆβ (2) 1 (3) (r), ˆβ 1,..., ˆβ 1. (1) (r) ˆβ 1,..., ˆβ 1 ); } 44/49

114 Asymptotic Normality of A Single Coordinate y r = X β + ɛ 1 ɛ 2 ɛ 3 ɛ r M-Estimates: ˆβ(1) 1 ( ŝd se {, ˆβ (2) 1 (3) (r), ˆβ 1,..., ˆβ 1. (1) (r) ˆβ 1,..., ˆβ 1 ); } ( ) want to compare L ˆβ1 /ŝd with N(0, 1); 44/49

115 Asymptotic Normality of A Single Coordinate y r = X β + ɛ 1 ɛ 2 ɛ 3 ɛ r M-Estimates: ˆβ(1) 1 ( ŝd se {, ˆβ (2) 1 (3) (r), ˆβ 1,..., ˆβ 1. (1) (r) ˆβ 1,..., ˆβ 1 ); } ( ) want to compare L ˆβ1 /ŝd with N(0, 1); count the fraction of (j) ˆβ 1 [ 1.96ŝd, 1.96ŝd] as the proxy; 44/49

116 Asymptotic Normality of A Single Coordinate y r = X β + ɛ 1 ɛ 2 ɛ 3 ɛ r M-Estimates: ˆβ(1) 1 ( ŝd se {, ˆβ (2) 1 (3) (r), ˆβ 1,..., ˆβ 1. (1) (r) ˆβ 1,..., ˆβ 1 ); } ( ) want to compare L ˆβ1 /ŝd with N(0, 1); count the fraction of (j) ˆβ 1 [ 1.96ŝd, 1.96ŝd] as the proxy; should be close to 0.95 ideally. 44/49

117 Asymptotic Normality of A Single Coordinate Coverage of β^1 (κ = 0.5) normal t(2) Coverage of β^1 (κ = 0.8) normal t(2) Coverage iid hadamard Coverage iid hadamard Sample Size Entry Dist. normal t(2) hadamard Sample Size Entry Dist. normal t(2) hadamard 45/49

118 Conclusion We establish the coordinate-wise asymptotic normality of the M-estimator for certain fixed design matrices under the moderate p/n regime under regularity conditions on X, L(ɛ) and ρ but no condition on β ; We prove the result by using the novel approach Second-Order Poincaré Inequality [Chatterjee, 2009]; We show that the regularity conditions are satisfied by a broad class of designs. 46/49

119 Discussion 47/49

120 Discussion Inference asym. normality + asym. bias + asym. variance Var( ˆβ 1 X) Var( ˆβ 1 ) when X is indeed a realization of a random design? Resampling method to give conservative variance estimates? More advanced boostrap? 47/49

121 Discussion Inference asym. normality + asym. bias + asym. variance Var( ˆβ 1 X) Var( ˆβ 1 ) when X is indeed a realization of a random design? Resampling method to give conservative variance estimates? More advanced boostrap? Relax the regularity conditions: Generalize to non-strongly convex and non-smooth loss functions? Generalize to general error distributions? 47/49

122 Discussion Inference asym. normality + asym. bias + asym. variance Var( ˆβ 1 X) Var( ˆβ 1 ) when X is indeed a realization of a random design? Resampling method to give conservative variance estimates? More advanced boostrap? Relax the regularity conditions: Generalize to non-strongly convex and non-smooth loss functions? Generalize to general error distributions? Get rid of asymptotics: Yes, exact finite-sample guarantee if n/p > 20; No assumption on X or β ; Only exchangeability assumption on ɛ. 47/49

123 Thank You! 48/49

124 References Derek Bean, Peter J Bickel, Noureddine El Karoui, and Bin Yu. Optimal m-estimation in high-dimensional regression. Proceedings of the National Academy of Sciences, 110(36): , Peter J Bickel and David A Freedman. Bootstrapping regression models with many parameters. Festschrift for Erich L. Lehmann, pages 28 48, Sourav Chatterjee. Fluctuations of eigenvalues and second order poincaré inequalities. Probability Theory and Related Fields, 143(1-2):1 40, Noureddine El Karoui. On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators Noureddine El Karoui and Elizabeth Purdom. Can we trust the bootstrap in high-dimension? UC Berkeley Statistics Department Technical Report, Noureddine El Karoui, Derek Bean, Peter J Bickel, Chinghway Lim, and Bin Yu. On robust regression with high-dimensional predictors. Proceedings of the National Academy of Sciences, 110(36): , Peter J Huber. Robust regression: asymptotics, conjectures and monte carlo. The Annals of Statistics, pages , Enno Mammen. Asymptotics with increasing dimension for robust regression with applications to the bootstrap. The Annals of Statistics, pages , Stephen Portnoy. Asymptotic behavior of m-estimators of p regression parameters when p2/n is large. i. consistency. The Annals of Statistics, pages , Stephen Portnoy. Asymptotic behavior of m estimators of p regression parameters when p2/n is large; ii. normal approximation. The Annals of Statistics, pages , /49

Inference For High Dimensional M-estimates. Fixed Design Results

Inference For High Dimensional M-estimates. Fixed Design Results : Fixed Design Results Lihua Lei Advisors: Peter J. Bickel, Michael I. Jordan joint work with Peter J. Bickel and Noureddine El Karoui Dec. 8, 2016 1/57 Table of Contents 1 Background 2 Main Results and