Inference For High Dimensional M-estimates. Fixed Design Results

Size: px

Start display at page:

Download "Inference For High Dimensional M-estimates. Fixed Design Results"

Marjory Webster
5 years ago
Views:

1 : Fixed Design Results Lihua Lei Advisors: Peter J. Bickel, Michael I. Jordan joint work with Peter J. Bickel and Noureddine El Karoui Dec. 8, /57

2 Table of Contents 1 Background 2 Main Results and Examples 3 Assumptions and Proof Sketch 4 Numerical Results 2/57

3 Table of Contents 1 Background 2 Main Results and Examples 3 Assumptions and Proof Sketch 4 Numerical Results 3/57

4 Setup Observe {x 1, y 1 }, {x 2, y 2 },..., {x n, y n }: response vector Y = (y 1,..., y n ) T R n ; design matrix X = (x T 1,..., x T n ) T R n p. 4/57

5 Setup Observe {x 1, y 1 }, {x 2, y 2 },..., {x n, y n }: response vector Y = (y 1,..., y n ) T R n ; design matrix X = (x1 T,..., x n T ) T R n p. Model: Linear Model: Y = X β + ɛ; ɛ = (ɛ 1,..., ɛ n ) T R n being a random vector; 4/57

6 M-Estimator M-Estimator: Given a convex loss function ρ( ) : R [0, ), 1 ˆβ = arg min β R p n n ρ(y i xi T i=1 β). 5/57

7 M-Estimator M-Estimator: Given a convex loss function ρ( ) : R [0, ), 1 ˆβ = arg min β R p n n ρ(y i xi T i=1 β). When ρ is differentiable with ψ = ρ, ˆβ can be written as the solution: 1 n ψ(y i xi T ˆβ) = 0. n i=1 5/57

8 M-Estimator: Examples ρ(x) = x 2 /2 gives the Least-Square estimator; 6/57

9 M-Estimator: Examples ρ(x) = x 2 /2 gives the Least-Square estimator; L2 Loss x psi(x) rho(x) x 6/57

10 M-Estimator: Examples ρ(x) = x 2 /2 gives the Least-Square estimator; ρ(x) = x gives the Least-Absolute-Deviation estimator; L2 Loss x psi(x) rho(x) x 6/57

11 M-Estimator: Examples ρ(x) = x 2 /2 gives the Least-Square estimator; ρ(x) = x gives the Least-Absolute-Deviation estimator; rho(x) L2 Loss L1 Loss x x psi(x) x x 6/57

12 M-Estimator: Examples rho(x) ρ(x) = x 2 /2 gives the Least-Square estimator; ρ(x) = x { gives the Least-Absolute-Deviation estimator; x ρ(x) = 2 /2 x k gives the Huber estimator. k( x k/2) x > k L2 Loss L1 Loss x x psi(x) x x 6/57

13 M-Estimator: Examples rho(x) ρ(x) = x 2 /2 gives the Least-Square estimator; ρ(x) = x { gives the Least-Absolute-Deviation estimator; x ρ(x) = 2 /2 x k gives the Huber estimator. k( x k/2) x > k L2 Loss L1 Loss Huber Loss x x psi(x) x x x x 6/57

14 Goals (Informal) Goal (Informal): Make inference on the coordinates of ˆβ when the dimension p is comparable to the sample size n; and X is treated as fixed; without assumptions on β. 7/57

15 Goals (Informal) Goal (Informal): Make inference on the coordinates of ˆβ when the dimension p is comparable to the sample size n; and X is treated as fixed; without assumptions on β. Consider β 1 WLOG; Given X and L(ɛ), L( ˆβ 1 ) is uniquely determined; Ideally, we construct a 95% confidence interval for β1 as ( ) )] [q L( ˆβ 1 ), q (L( ˆβ 1 ) where q α denotes the α-th quantile; Unfortunately, L( ˆβ 1 ) is complicated. 7/57

16 Asymptotic Arguments Exact finite sample inference is hard. This motivates statisticians to resort to asymptotic arguments, i.e. find a distribution F s.t. L( ˆβ 1 ) F. 8/57

17 Asymptotic Arguments Exact finite sample inference is hard. This motivates statisticians to resort to asymptotic arguments, i.e. find a distribution F s.t. L( ˆβ 1 ) F. The limiting behavior of ˆβ when p is fixed, as n, ( ) L( ˆβ) N β, (X T X ) 1 E(ψ2 (ɛ 1 )) [Eψ (ɛ 1 )] 2 ; As a consequence, we obtain an approximate 95% confidence interval for β1, [ ˆβ sd( ˆβ 1 ), ˆβ sd( ] ˆβ 1 ) where sd( ˆβ 1 ) could be any consistent estimator of the standard deviation. 8/57

18 Asymptotic Arguments In other words, to approximate L( ˆβ 1 ), we consider a sequence of hypothetical problems, indexed by j, where the j-th problem has a sample size n j and a dimension p j = p. 9/57

19 Asymptotic Arguments In other words, to approximate L( ˆβ 1 ), we consider a sequence of hypothetical problems, indexed by j, where the j-th problem has a sample size n j and a dimension p j = p. For j-th problem, denote by ˆβ (j) the corresponding M-estimator, then the previous slide uses lim L( ˆβ (j) 1 ) to approximate L( ˆβ 1 ). j 9/57

20 Asymptotic Arguments In other words, to approximate L( ˆβ 1 ), we consider a sequence of hypothetical problems, indexed by j, where the j-th problem has a sample size n j and a dimension p j = p. For j-th problem, denote by ˆβ (j) the corresponding M-estimator, then the previous slide uses lim L( ˆβ (j) 1 ) to approximate L( ˆβ 1 ). j In general, p j is not necessarily fixed and can grow to infinity. 9/57

21 Asymptotic Arguments Huber (1973) raised the question of understanding the behavior of ˆβ when both n and p tend to infinity; Huber (1973) showed the L 2 consistency of ˆβ: ˆβ β under the regime p 3 n 0; Portnoy (1984) prove the L 2 consistency of ˆβ under the regime p log p n 0; 10/57

22 Asymptotic Arguments Portnoy (1985) showed that ˆβ is jointly asymptotically normal under the regime (p log n) 3 2 n 0, in the sense that for any sequence of vectors a n R p, L at n ( ˆβ β ) Var(an T ˆβ) N(0, 1) 11/57

23 p/n: A Measure of Difficulty All of the above works requires p/n 0 or n/p. 12/57

24 p/n: A Measure of Difficulty All of the above works requires p/n 0 or n/p. n/p is the number of samples per parameter. Heuristically, a larger n/p would give an easier problem. 12/57

25 p/n: A Measure of Difficulty Recall that the approximation can be seen as a sequence of hypothetical problems with sample size n j and dimension p j. If n j /p j, the problems become increasingly easier as j grows. 13/57

26 p/n: A Measure of Difficulty Recall that the approximation can be seen as a sequence of hypothetical problems with sample size n j and dimension p j. If n j /p j, the problems become increasingly easier as j grows. In other words, the hypothetical problem used for approximation is much easier than the original problem. Then the approximation accuracy might be compromised. 13/57

27 Moderate p/n Regime Instead, we can consider a sequence of hypothetical problems with p j /n j fixed to be the same as the original problem, i.e. p j /n j p/n. 14/57

28 Moderate p/n Regime Instead, we can consider a sequence of hypothetical problems with p j /n j fixed to be the same as the original problem, i.e. p j /n j p/n. In this case, the difficulty of the problem is fixed. 14/57

29 Moderate p/n Regime Formally, we define Moderate p/n Regime as p j /n j κ > 0. A typical value for κ is p/n in the original problem. 15/57

30 Moderate p/n Regime: More Informative Asymptotics Consider a set of small-sample problems where n = 50 and p = nκ for κ {0.1,..., 0.9}. For each pair (n, p), Step 1 Generate X R n p with i.i.d. N(0, 1) entries; Step 2 Fix β = 0 and sample Y = ɛ with ɛ i i.i.d. N(0, 1) or ɛ i i.i.d. t 2 ; Step 3 Estimate β 1 by ˆβ 1 with a Huber loss; Step 4 Repeat Step 2 - Step 3 for 100 times and estimate L( ˆβ 1 ). 16/57

31 Moderate p/n Regime: More Informative Asymptotics Now consider two types of approximations: Fixed-p Approx.: N = 1000, P = p; Moderate-p/n Approx.: N = 1000, P = 1000κ; Repeat Step 1-Step 4 for new pairs (N, P) and estimate L( ˆβ 1 F ) (Fixed p); L( ˆβ 1 M ) (Moderate p/n). 17/57

32 Moderate p/n Regime: More Informative Asymptotics Now consider two types of approximations: Fixed-p Approx.: N = 1000, P = p; Moderate-p/n Approx.: N = 1000, P = 1000κ; Repeat Step 1-Step 4 for new pairs (N, P) and estimate L( ˆβ 1 F ) (Fixed p); L( ˆβ 1 M ) (Moderate p/n). Measure the accuracy of two approximations by the Kolmogorov-Smirnov statistics ) ) d KS (L( ˆβ 1 ), L( ˆβ 1 F ) and d KS (L( ˆβ 1 ), L( ˆβ 1 M ) 17/57

33 Moderate p/n Regime: More Informative Asymptotics Distance between the small sample and large sample distribution normal t(2) Kolmogorov Smirnov Statistics kappa Asym. Regime p fixed p/n fixed 18/57

34 Moderate p/n Regime: Negative Results The moderate p/n regime has been widely studied in random matrix theory. In statistics: Huber (1973) showed that for least-square estimators there always exists a sequence of vectors a n R p such that L at n ( ˆβ LS β ) Var(an T ˆβ N(0, 1). LS ) Bickel and Freedman (1982) showed that the bootstrap fails in the Least-Square case and the usual rescaling does not help; El Karoui et al. (2011) showed that for general loss functions, ˆβ β /57

35 Moderate p/n Regime: Negative Results The moderate p/n regime has been widely studied in random matrix theory. In statistics: Huber (1973) showed that for least-square estimators there always exists a sequence of vectors a n R p such that L at n ( ˆβ LS β ) Var(an T ˆβ N(0, 1). LS ) Bickel and Freedman (1982) showed that the bootstrap fails in the Least-Square case and the usual rescaling does not help; El Karoui et al. (2011) showed that for general loss functions, ˆβ β Main reason: ˆF n, the empirical distribution of the residuals, namely R i y i xi T ˆβ, does not converge to L(ɛ i ). 19/57

36 Moderate p/n Regime: Positive Results If X is assumed to be a random matrix under regularity conditions, 20/57

37 Moderate p/n Regime: Positive Results If X is assumed to be a random matrix under regularity conditions, Bean et al. (2013) showed that when X has i.i.d. Gaussian entries, for any sequence of a n R p L X,ɛ at n ( ˆβ β ) Var X,ɛ (an T ˆβ) N(0, 1); The above result does not contradict Huber (1973) in that the randomness comes from both X and ɛ; El Karoui et al. (2011) showed that for general loss functions, ˆβ β 0. Under weaker assumptions on X, El Karoui (2015) showed L X,ɛ ˆβ 1 (τ) β1 bias( ˆβ 1 (τ)) N(0, 1) Var X,ɛ ( ˆβ 1 (τ)) where ˆβ 1 (τ) is the ridge-penalized M-estimator. 20/57

38 Moderate p/n Regime: Summary Provides a more accurate approximation of L( ˆβ 1 ); 21/57

39 Moderate p/n Regime: Summary Provides a more accurate approximation of L( ˆβ 1 ); Qualitatively different from the classical regimes where p/n 0; L 2 -consistency of ˆβ no longer holds; the residuals R i behaves differently from ɛ i ; fixed design results are different from random design results. 21/57

40 Moderate p/n Regime: Summary Provides a more accurate approximation of L( ˆβ 1 ); Qualitatively different from the classical regimes where p/n 0; L 2 -consistency of ˆβ no longer holds; the residuals R i behaves differently from ɛ i ; fixed design results are different from random design results. Inference on the vector ˆβ is hard; but inference on the coordinate / low-dimensional linear contrasts of ˆβ is still possible. 21/57

41 Goals (Formal) Our Goal (formal): Under the linear model Y = X β + ɛ, Derive the asymptotic distribution of coordinates ˆβ j : under the moderate p/n regime, i.e. p/n κ (0, 1); with a fixed design matrix X ; without assumptions on β. 22/57

42 Table of Contents 1 Background 2 Main Results and Examples 3 Assumptions and Proof Sketch 4 Numerical Results 23/57

43 Main Result (Informal) Definition 1. Let P and Q be two distributions on R p, d TV (P, Q) = sup A R p P(A) Q(A). 24/57

44 Main Result (Informal) Definition 1. Let P and Q be two distributions on R p, d TV (P, Q) = sup A R p P(A) Q(A). Theorem. Under appropriate conditions on the design matrix X, the distribution of ɛ and the loss function ρ, as p/n κ (0, 1), while n, max j d TV L ˆβ j E ˆβ j, N(0, 1) = o(1). Var( ˆβ j ) 24/57

45 Examples: Realization of i.i.d. Designs We consider the case where X is a realization of a random design Z. The examples below are proved to satisfy the technical assumptions with high probability over Z. 25/57

46 Examples: Realization of i.i.d. Designs We consider the case where X is a realization of a random design Z. The examples below are proved to satisfy the technical assumptions with high probability over Z. Example 1 Z has i.i.d. mean-zero sub-gaussian entries with Var(Z ij ) = τ 2 > 0; Example 2 Z contains an intercept term, i.e. Z = (1, Z) and Z R n (p 1) has independent sub-gaussian entries with Z ij µ j d = µj Z ij, Var( Z ij ) > τ 2 for some arbitrary µ j. 25/57

47 Examples: Realizations of Dependent Gaussian Designs Example 3 Z is matrix-normal with vec(z) N(0, Λ Σ) and λ max (Λ), λ max (Σ) = O (1), λ min (Λ), λ min (Σ) = Ω (1) Example 4 Z contains an intercept term, i.e. Z = (1, Z) and vec( Z) N(0, Λ Σ) with Λ and Σ satisfy the above condition and max i (Λ 1 2 1) i min i (Λ 1 2 1) i = O (1). 26/57

48 A Counter-Example Consider a one-way ANOVA situation. Each observation i is associated with a label k i {1,..., p} and let X i,j = I (j = k i ). This is equivalent to Y i = β k i + ɛ i. 27/57

49 A Counter-Example Consider a one-way ANOVA situation. Each observation i is associated with a label k i {1,..., p} and let X i,j = I (j = k i ). This is equivalent to Y i = β k i + ɛ i. It is easy to see that ˆβ j = arg min β R i:k i =j This is a standard location problem. ρ(y i β j ). 27/57

50 A Counter-Example Let n j = {i : k i = j}. In the least-square case, i.e. ρ(x) = x 2 /2, ˆβ j = β j + 1 n j ɛ i. i:k i =j 28/57

51 A Counter-Example Let n j = {i : k i = j}. In the least-square case, i.e. ρ(x) = x 2 /2, ˆβ j = β j + 1 n j i:k i =j ɛ i. Assume a balance design, i.e. n j n/p. Then n j << and none of ˆβ j is normal (unless ɛ i are normal); holds for general loss functions ρ. 28/57

52 A Counter-Example Let n j = {i : k i = j}. In the least-square case, i.e. ρ(x) = x 2 /2, ˆβ j = β j + 1 n j i:k i =j ɛ i. Assume a balance design, i.e. n j n/p. Then n j << and none of ˆβ j is normal (unless ɛ i are normal); holds for general loss functions ρ. Conclusion: some non-standard assumptions on X are required. 28/57

53 Table of Contents 1 Background 2 Main Results and Examples 3 Assumptions and Proof Sketch Least-Square Estimator: A Motivating Example Second-Order Poincaré Inequality Assumptions Main Results 4 Numerical Results 29/57

54 Least Square Estimator The L 2 loss, ρ(x) = x 2 /2, gives the least-square estimator ˆβ LS = (X T X ) 1 X T Y = β + (X T X ) 1 X T ɛ. 30/57

55 Least Square Estimator The L 2 loss, ρ(x) = x 2 /2, gives the least-square estimator ˆβ LS = (X T X ) 1 X T Y = β + (X T X ) 1 X T ɛ. Let e j denote the canonical basis vector in R p, then ˆβ LS j β j = e T j (X T X ) 1 X T ɛ. Write e T j (X T X ) 1 X T as α T j, then ˆβ LS j β j = n α j,i ɛ i. i=1 30/57

56 Least Square Estimator Lindeberg-Feller CLT claims that in order for ˆβ LS L j βj N(0, 1) Var( ˆβ j LS ) it is sufficient and almost necessary that α j α j 2 0. (1) 31/57

57 Least Square Estimator To see the necessity of the condition, recall the one-way ANOVA case. Let n j = {i : k i = j}, then X T X = diag(n j ) p j=1. This gives α j,i = { 1 n j if k i = j 0 if k i j 32/57

58 Least Square Estimator To see the necessity of the condition, recall the one-way ANOVA case. Let n j = {i : k i = j}, then X T X = diag(n j ) p j=1. This gives α j,i = { 1 n j if k i = j 0 if k i j As a result, α j = 1 n j, α j 2 = 1 nj α j α j 2 = 1 nj and hence However, in moderate p/n regime, there exists j such that n j 1/κ and thus ˆβ j LS is not asymptotically normal. 32/57

59 M-Estimator The result for LSE is derived from the analytical form of ˆβ LS. In contrast, an analytical form is not available for general ρ. 33/57

60 M-Estimator The result for LSE is derived from the analytical form of ˆβ LS. In contrast, an analytical form is not available for general ρ. Let ψ = ρ, it is the solution of 1 n n ψ(y i xi T i=1 ˆβ) = 0 33/57

61 M-Estimator The result for LSE is derived from the analytical form of ˆβ LS. In contrast, an analytical form is not available for general ρ. Let ψ = ρ, it is the solution of 1 n n ψ(y i xi T i=1 ˆβ) = 0 WLOG, assume β = 0, then 1 n n ψ(ɛ i xi T i=1 ˆβ) = 0. 33/57

62 M-Estimator Write R i for ɛ i x T i ˆβ and define D, D and G as D = diag(ψ (R i )), D = diag(ψ (R i )), G = I X (X T DX ) 1 X T D. 34/57

63 M-Estimator Write R i for ɛ i x T i ˆβ and define D, D and G as D = diag(ψ (R i )), D = diag(ψ (R i )), G = I X (X T DX ) 1 X T D. Lemma 2. Suppose ψ C 2 (R n ), then ˆβ j ɛ T = et j (X T DX ) 1 X T D, (2) ˆβ j ɛ ɛ T = G T diag(e T j (X T DX ) 1 X T D)G. (3) 34/57

64 Second-Order Poincaré Inequality ˆβ j is a smooth transform of a random vector, ɛ, with independent entries. A powerful CLT for this type of statistics is Second-Order Poincaré Inequality (Chatterjee, 2009). 35/57

65 Second-Order Poincaré Inequality ˆβ j is a smooth transform of a random vector, ɛ, with independent entries. A powerful CLT for this type of statistics is Second-Order Poincaré Inequality (Chatterjee, 2009). Definition 3. For each c 1, c 2 > 0, let L(c 1, c 2 ) be the class of probability measures on R that arise as laws of random variables like u(w ), where W N(0, 1) and u C 2 (R n ) with u (x) c 1 and u (x) c 2. For example, u = Id gives N(0, 1) and u = Φ gives U([0, 1]) 35/57

66 Second-Order Poincaré Inequality Proposition 1 (SOPI; Chatterjee, 2009). Let W = (W 1,..., W n ) indep. L(c 1, c 2 ). Take any g C 2 (R n ) and let U = g(w ), κ 0 = ( E n i g(w ) ) ; i=1 κ 1 = (E g(w ) 4 2) 1 4 ; κ 2 = (E 2 g(w ) 4 op) 1 4. If U has a finite fourth moment, then ( ) ) U EU d TV (L, N(0, 1) Var(U) κ 0 + κ 1 κ 2 Var(U). 36/57

67 Assumptions Assume that A1 ρ(0) = ψ(0) = 0 and for any x R, 0 < K 0 ψ (x) K 1, ψ (x) K 2 ; A2 ɛ has independent entries with ɛ i L(c 1, c 2 ); A3 Let λ + and λ be the largest and smallest eigenvalues of X T X /n and λ + = O(1), λ = Ω(1). 37/57

68 Second-Order Poincaré Inequality on ˆβ j Apply Second-Order Poincaré Inequality to ˆβ j, we obtain that Lemma 4. Let D = diag(ψ (ɛ i x T i ˆβ)) n i=1, and Then under assumptions A1-A3, max j d TV L M j = E e T j (X T DX ) 1 X T D 1 2. ˆβ j E ˆβ ( j maxj (nm, N(0, 1) j 2 = O ) 1 8 p Var( ˆβ j ) n min j Var( ˆβ j ) The main result is obtained if we prove ( ) ( ) 1 1 M j = o n, Var( ˆβ j ) = Ω. n ), 38/57

69 Further Assumptions Define the following quantities: leave-one-predictor-out estimate ˆβ [j] : the M-estimator obtained by removing the j-th column of X (El Karoui, 2013); leave-one-predictor-out residuals r i,[j] = ɛ i x T i,[j] ˆβ [j] where x T i,[j] is the i-th row of X after removing j-th entry; h j,0 = (ψ(r 1,[j] ),..., ψ(r n,[j] )) T ; Q j = Cov(h j,0 ) be the covariance matrix of ψ(r i,[j] ). 39/57

70 Further Assumptions Besides assumptions A1 - A3, we assume that Xj A4 min T Q j X j j tr(q j ) = Ω (1). 40/57

71 Further Assumptions Besides assumptions A1 - A3, we assume that Xj A4 min T Q j X j j tr(q j ) = Ω (1). Q j does not involve X j ; Assumption A4 guarantees Var( ˆβ j ) = Ω ( ) 1. n 40/57

72 Further Assumptions If X j is a realization of a random vector Z j with i.i.d. entries, then EZj T Q j Z j = tr(ez j Zj T Q j ) = EZ1,j 2 tr(q j ). If Zj T Q j Z j concentrates around its mean, then Zj T Q j Z j tr(q j ) EZ 2 1,j > 0. 41/57

73 Further Assumptions If X j is a realization of a random vector Z j with i.i.d. entries, then EZj T Q j Z j = tr(ez j Zj T Q j ) = EZ1,j 2 tr(q j ). If Zj T Q j Z j concentrates around its mean, then Zj T Q j Z j tr(q j ) EZ 2 1,j > 0. For example, when Z j has i.i.d. sub-gaussian entries, the Hansen-Wright inequality implies the concentration. { { t P( Zj T Q j Z j EZj T 2 t Q j Z j t) 2 exp c min Q j 2, F Q j op }}. 41/57

74 Further Assumptions To describe the last assumption, we define the following quantities: D [j] = diag(ψ (r i,[j] )): leave-one-predictor-out version of D; G [j] = I X [j] (X T [j] D [j]x [j] ) 1 X T [j] D [j]; h T j,1,i = e T i G [j] : the i-th row of G [j] ; C = max { max j hj,0 T X j, max h j,0 2 i,j hj,1,i T X } j. h j,1,i 2 42/57

75 Further Assumptions The last assumption: A5 E 8 C = O (polylog(n)). 43/57

76 Further Assumptions The last assumption: A5 E 8 C = O (polylog(n)). It turns out that when ρ(x) = x 2 /2, C max j e T j (X T X ) 1 X T e T j (X T X ) 1 X T 2. Recall that for Least-Squares, ˆβ j are all asymptotically normal iff the right-handed side tends to 0. This indicates that the assumption A5 is not just an artifact of the proof. 43/57

77 Further Assumptions Let α j,0 = h j,0 / h j,0 2, α j,1,i = h j,1,i / h j,1,i 2. Again, if X j is a realization of a random vector Z j with i.i.d. σ 2 -sub-gaussian entries, then α T j,0 Z j and α T j,1,i Z j are all σ 2 -sub-gaussian. 44/57

78 Further Assumptions Let α j,0 = h j,0 / h j,0 2, α j,1,i = h j,1,i / h j,1,i 2. Again, if X j is a realization of a random vector Z j with i.i.d. σ 2 -sub-gaussian entries, then α T j,0 Z j and α T j,1,i Z j are all σ 2 -sub-gaussian. Then C is the maximum of np + p sub-gaussian random variables and hence E 8 C = O(polyLog(n)). 44/57

79 Review of All Assumptions A1 ρ(0) = ψ(0) = 0 and for any x R, 0 < K 0 ψ (x) K 1, ψ (x) K 2 ; A2 ɛ has independent entries with ɛ i L(c 1, c 2 ); A3 Let λ + and λ be the largest and smallest eigenvalues of X T X /n and λ + = O(1), λ = Ω(1). Zj A4 min T Q j Z j j tr(q j ) = Ω (1). A5 E 8 C = O (polylog(n)). 45/57

80 Main Results Theorem 5. Under assumptions A1 A5, as p/n κ for some κ (0, 1) while n, max j d TV L ˆβ j E ˆβ j, N(0, 1) = o(1). Var( ˆβ j ) 46/57

81 A Corollary If further assume that A6 ρ is an even function and ɛ i d = ɛi. Then one can show that ˆβ is unbiased. As a consequence, 47/57

82 A Corollary If further assume that A6 ρ is an even function and ɛ i d = ɛi. Then one can show that ˆβ is unbiased. As a consequence, Theorem 6. Under assumptions A1 A6, as p/n κ for some κ (0, 1) while n, max j d TV L ˆβ j βj, N(0, 1) = o(1), Var( ˆβ j ) 47/57

83 Table of Contents 1 Background 2 Main Results and Examples 3 Assumptions and Proof Sketch 4 Numerical Results 48/57

84 Setup Design matrix X: (i.i.d. design): X ij i.i.d. F ; (partial Hadamard design): a matrix formed by a random set of p columns of a n n Hadamard matrix. Entry Distribution F: F = N(0, 1); F = t 2. Error Distribution L(ɛ): ɛ i are i.i.d. with ɛ i N(0, 1); ɛ i t 2. 49/57

85 Setup Sample Size n: {100, 200, 400, 800}; κ = p/n: {0.5, 0.8}; Loss Function ρ: Huber loss with k = 1.345, { 1 ρ(x) = 2 x 2 x k kx k2 2 x > k 50/57

86 Asymptotic Normality of A Single Coordinate For each set of parameters, we run 50 simulations with each consisting of the following steps: (Step 1) Generate one design matrix X ; (Step 2) Generate the 300 error vectors ɛ; (Step 3) Regress each Y = ɛ on the design matrix X and end up with 300 random samples of ˆβ 1, denoted by ˆβ (1) (300) 1,..., ˆβ 1 ; (Step 4) Estimate the standard deviation of ˆβ 1 by the sample standard error sd; ˆ (Step 5) Construct [ a confidence interval ] I (k) = ˆβ (k) sd, ˆ ˆβ (k) sd ˆ for each k = 1,..., 300; (Step 6) Calculate the empirical 95% coverage by the proportion of confidence intervals which cover the true β 1 = 0. 51/57

87 Asymptotic Normality of A Single Coordinate 1.00 Coverage of β^1 (κ = 0.5) normal t(2) 1.00 Coverage of β^1 (κ = 0.8) normal t(2) Coverage iid hadamard Coverage iid hadamard Sample Size Entry Dist. normal t(2) hadamard Sample Size Entry Dist. normal t(2) hadamard 52/57

88 Conclusion We establish the coordinate-wise asymptotic normality of the M-estimator for certain fixed design matrices under the moderate p/n regime under regularity conditions on X, L(ɛ) and ρ but no condition on β ; We prove the result by using the novel approach Second-Order Poincaré Inequality (Chatterjee, 2009); We show that the regularity conditions are satisfied by a broad class of designs. 53/57

89 Future Works Future works for this project: Estimate Var( ˆβ j ) Relax the assumptions on L(ɛ) Relax the strong convexity of ρ Extend the results to GLM 54/57

90 Future Works Future works for this project: Estimate Var( ˆβ j ) Relax the assumptions on L(ɛ) Relax the strong convexity of ρ Extend the results to GLM Future works for my dissertation: Distributional properties in high dimensions Resampling methods in high dimensions 54/57

91 Thank You! 55/57

92 References I Bean, D., Bickel, P. J., El Karoui, N., & Yu, B. (2013). Optimal m-estimation in high-dimensional regression. Proceedings of the National Academy of Sciences, 110(36), Bickel, P. J., & Freedman, D. A. (1982). Bootstrapping regression models with many parameters. Festschrift for Erich L. Lehmann, Chatterjee, S. (2009). Fluctuations of eigenvalues and second order poincaré inequalities. Probability Theory and Related Fields, 143(1-2), El Karoui, N. (2013). Asymptotic behavior of unregularized and ridge-regularized high-dimensional robust regression estimators: rigorous results. arxiv preprint arxiv: El Karoui, N. (2015). On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators. 56/57

93 References II El Karoui, N., Bean, D., Bickel, P. J., Lim, C., & Yu, B. (2011). On robust regression with high-dimensional predictors. Proceedings of the National Academy of Sciences, 110(36), Huber, P. J. (1973). Robust regression: asymptotics, conjectures and monte carlo. The Annals of Statistics, Portnoy, S. (1984). Asymptotic behavior of m-estimators of p regression parameters when p2/n is large. i. consistency. The Annals of Statistics, Portnoy, S. (1985). Asymptotic behavior of m estimators of p regression parameters when p2/n is large; ii. normal approximation. The Annals of Statistics, /57

Inference For High Dimensional M-estimates: Fixed Design Results

Inference For High Dimensional M-estimates: Fixed Design Results Lihua Lei, Peter Bickel and Noureddine El Karoui Department of Statistics, UC Berkeley Berkeley-Stanford Econometrics Jamboree, 2017 1/49