Robust estimation, efficiency, and Lasso debiasing

Size: px

Start display at page:

Download "Robust estimation, efficiency, and Lasso debiasing"

Garry Edwards
5 years ago
Views:

1 Robust estimation, efficiency, and Lasso debiasing Po-Ling Loh University of Wisconsin - Madison Departments of ECE & Statistics WHOA-PSI workshop Washington University in St. Louis Aug 12, 2017 Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

2 High-dimensional linear regression n 1 n p n 1 Linear model: p 1 y i = x T i β + ɛ i, i = 1,..., n Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

3 High-dimensional linear regression n 1 n p n 1 Linear model: p 1 y i = x T i β + ɛ i, i = 1,..., n When p n, assume sparsity: β 0 k Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

4 Robust M-estimators Generalization of OLS suitable for heavy-tailed/contaminated errors: { } 1 n β arg min l(xi T β y i ) β n i=1 Loss Least squares Absolute value Huber Tukey Millions of calls Least squares Huber Tukey Residual Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26 Year

5 Robust M-estimators Generalization of OLS suitable for heavy-tailed/contaminated errors: { } 1 n β arg min l(xi T β y i ) β n Extensive theory (consistency, asymptotic normality) for p fixed, n i=1 Loss Least squares Absolute value Huber Tukey Millions of calls Least squares Huber Tukey Residual Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26 Year

6 High-dimensional M-estimators Natural idea: For p > n, use regularized version: { } 1 n β arg min l(xi T β y i ) + λ β 1 β n i=1 Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

7 High-dimensional M-estimators Natural idea: For p > n, use regularized version: { } 1 n β arg min l(xi T β y i ) + λ β 1 β n Complications: Optimization for nonconvex l? i=1 Statistical theory? Are certain losses provably better than others? Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

8 Some statistical theory When l < C, global optima of high-dimensional M-estimator satisfy k log p β β 2 C, n regardless of distribution of ɛ i Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

9 Some statistical theory When l < C, global optima of high-dimensional M-estimator satisfy k log p β β 2 C, n regardless of distribution of ɛ i Compare to Lasso theory: Requires sub-gaussian ɛ i s Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

10 Some statistical theory When l < C, global optima of high-dimensional M-estimator satisfy k log p β β 2 C, n regardless of distribution of ɛ i Compare to Lasso theory: Requires sub-gaussian ɛ i s If l(u) is locally convex/smooth for u r, any local optima within radius cr of β satisfy β β 2 C k log p n Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

11 Some optimization theory O r! k log p n r b e Local optima may be obtained via two-step algorithm Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

12 Some optimization theory O r! k log p n r b e Local optima may be obtained via two-step algorithm Algorithm 1 Run composite gradient descent on convex, robust loss + l 1 -penalty until convergence, output β H 2 Run composite gradient descent on nonconvex, robust loss + µ-amenable penalty, input β 0 = β H Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

13 Motivating calculation Lasso analysis (e.g., van de Geer 07, Bickel et al. 08): { } 1 β arg min β n y X β λ β 1 }{{} L n(β) Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

14 Motivating calculation Lasso analysis (e.g., van de Geer 07, Bickel et al. 08): { } 1 β arg min β n y X β λ β 1 }{{} L n(β) Rearranging basic inequality L n ( β) L n (β ) and assuming λ 2 X T ɛ, obtain n β β 2 cλ k Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

15 Motivating calculation Lasso analysis (e.g., van de Geer 07, Bickel et al. 08): { } 1 β arg min β n y X β λ β 1 }{{} L n(β) Rearranging basic inequality L n ( β) L n (β ) and assuming λ 2 X T ɛ, obtain n β β 2 cλ k ( Sub-Gaussian assumptions on x i s and ɛ i s provide O bounds, minimax optimal ) k log p n Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

16 Motivating calculation Key observation: For general loss function, if λ 2 obtain β β 2 cλ k X T l (ɛ) n, Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

17 Motivating calculation Key observation: For general loss function, if λ 2 obtain β β 2 cλ k X T l (ɛ) n, l (ɛ) sub-gaussian whenever l bounded Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

18 Motivating calculation Key observation: For general loss function, if λ 2 obtain β β 2 cλ k X T l (ɛ) n, l (ɛ) sub-gaussian whenever l bounded = can achieve estimation error k log p β β 2 c, n without assuming ɛ i is sub-gaussian Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

19 Technical challenges Lasso analysis also requires verifying restricted eigenvalue (RE) condition on design matrix, more complicated for general l Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

20 Technical challenges Lasso analysis also requires verifying restricted eigenvalue (RE) condition on design matrix, more complicated for general l Addressed by local curvature of robust losses around origin Loss Least squares Absolute value Huber Tukey Residual Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

21 Technical challenges Lasso analysis also requires verifying restricted eigenvalue (RE) condition on design matrix, more complicated for general l Addressed by local curvature of robust losses around origin Loss Least squares Absolute value Huber Tukey Residual When l is nonconvex, local optima β may exist that are not global optima Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

22 Technical challenges Lasso analysis also requires verifying restricted eigenvalue (RE) condition on design matrix, more complicated for general l Addressed by local curvature of robust losses around origin Loss Least squares Absolute value Huber Tukey Residual When l is nonconvex, local optima β may exist that are not global optima Addressed by theoretical analysis of β β 2 and derivation of suitable optimization algorithms Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

23 Related work: Nonconvex regularized M-estimators Composite objective function { β arg min β 1 R L n (β) + } p ρ λ (β j ) j=1 Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

24 Related work: Nonconvex regularized M-estimators Composite objective function { β arg min β 1 R L n (β) + } p ρ λ (β j ) j=1 Assumptions: L n satisfies restricted strong convexity with curvature α (Negahban et al. 12) ρ λ has bounded subgradient at 0, and ρ λ (t) + µt 2 convex α > µ Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

25 Stationary points (L. & Wainwright 15) O r! k log p n b e Stationary points statistically indistinguishable from global optima L n ( β) + ρ λ ( β), β β 0, β feasible Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

26 Stationary points (L. & Wainwright 15) O r! k log p n b e Stationary points statistically indistinguishable from global optima L n ( β) + ρ λ ( β), β β 0, β feasible log p Under suitable distributional assumptions, for λ n and R 1 λ, k log p β β 2 c statistical error n Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

27 Mathematical statement Theorem (L. & Wainwright 15) Suppose R is chosen s.t. β is feasible, and λ satisfies { } log p max L n (β ), α λ α n R. For n Cτ2 R 2 log p, any stationary point β satisfies α 2 β β 2 λ k α µ, where k = β 0. Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

28 Mathematical statement Theorem (L. & Wainwright 15) Suppose R is chosen s.t. β is feasible, and λ satisfies { } log p max L n (β ), α λ α n R. For n Cτ2 R 2 log p, any stationary point β satisfies α 2 β β 2 λ k α µ, where k = β 0. New ingredient for robust setting: l convex only in local region = need for local consistency results Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

29 Local RSC condition Local RSC condition: For := β 1 β 2, L n (β 1 ) L n (β 2 ), α 2 2 τ log p n 2 1, β j β 2 r Loss function has directions of both positive and negative curvature. Negative directions are forbidden by regularizer. Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

30 Local RSC condition Local RSC condition: For := β 1 β 2, L n (β 1 ) L n (β 2 ), α 2 2 τ log p n 2 1, β j β 2 r Loss function has directions of both positive and negative curvature. Only requires restricted Negative directions curvature are forbiddenwithin by regularizer. constant-radius region around β Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

31 Consistency of local stationary points O r! k log p n r b e Theorem (L. 17) Suppose L n satisfies α-local RSC and ρ λ is µ-amenable, with α > µ. Suppose l log p τ C and λ n. For n α µ k log p, any stationary point β s.t. β β 2 r satisfies β β 2 λ k α µ. Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

32 Optimization theory Question: How to obtain sufficiently close local solutions? Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

33 Optimization theory Question: How to obtain sufficiently close local solutions? Goal: For regularized M-estimator { 1 n β arg min l(xi T β 1 R n i=1 β y i ) + ρ λ (β) }, where l satisfies α-local RSC, find stationary point such that β β 2 r Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

34 Wisdom from Huber Descending ψ-functions are tricky, especially when the starting values for the iterations are non-robust.... It is therefore preferable to start with a monotone ψ, iterate to death, and then append a few (1 or 2) iterations with the nonmonotone ψ. Huber 1981, pp Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

35 Two-step algorithm (L. 17) Two-step M-estimator: Finds local stationary points of nonconvex, robust loss + µ-amenable penalty { } 1 n β arg min l(xi T β y i ) + ρ λ (β) β 1 R n i=1 Algorithm 1 Run composite gradient descent on convex, robust loss + l 1 -penalty until convergence, output β H 2 Run composite gradient descent on nonconvex, robust loss + µ-amenable penalty, input β 0 = β H Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

36 Two-step algorithm (L. 17) Two-step M-estimator: Finds local stationary points of nonconvex, robust loss + µ-amenable penalty { } 1 n β arg min l(xi T β y i ) + ρ λ (β) β 1 R n i=1 Algorithm 1 Run composite gradient descent on convex, robust loss + l 1 -penalty until convergence, output β H 2 Run composite gradient descent on nonconvex, robust loss + µ-amenable penalty, input β 0 = β H Note: Want to optimize original nonconvex objective, since it leads to more efficient (lower-variance) estimators Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

37 Scale calibration Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

38 Scale calibration Closer look: Loss function l in some sense calibrated to scale of ɛ i Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

39 Scale calibration Closer look: Loss function l in some sense calibrated to scale of ɛ i If Huber parameter too large, estimation error bound based on l becomes suboptimal If Huber parameter too small, RSC no longer satisfied w.h.p. Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

40 Scale calibration Closer look: Loss function l in some sense calibrated to scale of ɛ i If Huber parameter too large, estimation error bound based on l becomes suboptimal If Huber parameter too small, RSC no longer satisfied w.h.p. For Lasso, optimal λ known to depend on σ ɛ, but loss function does not require calibration Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

41 Scale calibration Closer look: Loss function l in some sense calibrated to scale of ɛ i If Huber parameter too large, estimation error bound based on l becomes suboptimal If Huber parameter too small, RSC no longer satisfied w.h.p. For Lasso, optimal λ known to depend on σ ɛ, but loss function does not require calibration Better objective (low-dimensional version proposed by Huber): { 1 n ( yi x T ) } i β ( β, σ) arg min l σ + aσ +λ β 1 β,σ n σ i=1 }{{} L n(β,σ) Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

42 Scale calibration Closer look: Loss function l in some sense calibrated to scale of ɛ i If Huber parameter too large, estimation error bound based on l becomes suboptimal If Huber parameter too small, RSC no longer satisfied w.h.p. For Lasso, optimal λ known to depend on σ ɛ, but loss function does not require calibration Better objective (low-dimensional version proposed by Huber): { 1 n ( yi x T ) } i β ( β, σ) arg min l σ + aσ +λ β 1 β,σ n σ i=1 }{{} L n(β,σ) However, joint location/scale estimation notoriously difficult even in low dimensions Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

43 Scale calibration Another idea: MM-estimator { 1 n ( yi x β T ) } i β arg min l + λ β 1, β n σ 0 i=1 using robust estimate of scale σ 0 based on preliminary estimate β 0 How to obtain ( β 0, σ 0 )? Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

44 Scale calibration Another idea: MM-estimator { 1 n ( yi x β T ) } i β arg min l + λ β 1, β n σ 0 i=1 using robust estimate of scale σ 0 based on preliminary estimate β 0 How to obtain ( β 0, σ 0 )? S-estimators/LMS: where σ(r) = r (n nδ ) LTS: β 0 arg min β β 0 arg min β { σ(r(β))}, 1 n n nα i=1 (y i xi T β) 2 (i) + λ β 1 Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

45 Our approach Lepski s method originally proposed for adaptive bandwidth selection in nonparametric regression Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

46 Our approach Lepski s method originally proposed for adaptive bandwidth selection in nonparametric regression Can be used to select σ in location/scale problem: { } 1 n β σ arg min l σ (y i xi T β) + λσ β 1, β n i=1 where l σ is Huber loss parametrized by σ Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

47 Lepski s method Preceding theory implies k log p β σ β 2 Cσ, n w.h.p., assuming σ Var(ɛ i ) := σ Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

48 Lepski s method Preceding theory implies k log p β σ β 2 Cσ, n w.h.p., assuming σ Var(ɛ i ) := σ Basic idea of Lepski s method: Compute β σ on gridding {σ 1,..., σ M } of interval [σ min, σ max ] σ min max Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

49 Lepski s method Preceding theory implies k log p β σ β 2 Cσ, n w.h.p., assuming σ Var(ɛ i ) := σ Basic idea of Lepski s method: Compute β σ on gridding {σ 1,..., σ M } of interval [σ min, σ max ] σ For each σ j, check if β σj β k log p σl 2 2Cσ l n for all l > j, and let σ be argmin in this set min j ` max Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

50 Lepski s method Preceding theory implies k log p β σ β 2 Cσ, n w.h.p., assuming σ Var(ɛ i ) := σ Basic idea of Lepski s method: Compute β σ on gridding {σ 1,..., σ M } of interval [σ min, σ max ] σ For each σ j, check if β σj β k log p σl 2 2Cσ l n for all l > j, and let σ be argmin in this set min j ` max Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

51 Lepski s method Preceding theory implies k log p β σ β 2 Cσ, n w.h.p., assuming σ Var(ɛ i ) := σ Basic idea of Lepski s method: Compute β σ on gridding {σ 1,..., σ M } of interval [σ min, σ max ] σ For each σ j, check if β σj β k log p σl 2 2Cσ l n for all l > j, and let σ be argmin in this set min j ` max Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

52 Lepski s method Preceding theory implies k log p β σ β 2 Cσ, n w.h.p., assuming σ Var(ɛ i ) := σ Basic idea of Lepski s method: Compute β σ on gridding {σ 1,..., σ M } of interval [σ min, σ max ] σ For each σ j, check if β σj β k log p σl 2 2Cσ l n for all l > j, and let σ be argmin in this set min b max Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

53 Statistical guarantee Theorem (L. 17) With high probability, output of Lepski s method satisfies k log p β σ β 2 C σ, n Method does not require prior knowledge of scale σ Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

54 Efficiency Although β σ guaranteed to be l 2 -consistent, estimator may have relatively high variance Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

55 Efficiency Although β σ guaranteed to be l 2 -consistent, estimator may have relatively high variance One-step estimation proposed for obtaining better efficiency (Bickel 75): b ψ = β + (X T X ) 1 Â(ψ) where Â(ψ) is estimate of E[ψ (ɛ i )] 1 n n i=1 ψ(y i x T i β)x i, Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

56 Efficiency Although β σ guaranteed to be l 2 -consistent, estimator may have relatively high variance One-step estimation proposed for obtaining better efficiency (Bickel 75): b ψ = β + (X T X ) 1 Â(ψ) where Â(ψ) is estimate of E[ψ (ɛ i )] Low-dimensional result: n( bψ β) 1 n ( d N 0, n i=1 ψ(y i x T i β)x i, E[ψ 2 ) (ɛ i )] E[ψ (ɛ i )] 2 Θ, so asymptotic variance for ψ = f f matches variance of MLE Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

57 Efficiency In high dimensions, define b ψ = β σ + Θ Â(ψ) 1 n n ψ(y i xi T i=1 β σ )x i, where Â(ψ) = 1 n n i=1 ψ (y i xi T β σ ) and Θ is high-dimensional estimate of Θ (e.g., graphical Lasso estimator) Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

58 Efficiency In high dimensions, define b ψ = β σ + Θ Â(ψ) 1 n n ψ(y i xi T i=1 β σ )x i, where Â(ψ) = 1 n n i=1 ψ (y i xi T β σ ) and Θ is high-dimensional estimate of Θ (e.g., graphical Lasso estimator) Resembles Lasso debiasing procedure (Zhang & Zhang 14, van de Geer et al. 14, Javanmard & Montanari 14) Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

59 Efficiency Theorem (L. 17) Let J {1,..., p} denote a subset of coordinates of constant dimension. Then ( n( bψ β σ d E[ψ 2 ) (ɛ i )] ) J N 0, E[ψ (ɛ i )] 2 Θ JJ Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

60 Efficiency Theorem (L. 17) Let J {1,..., p} denote a subset of coordinates of constant dimension. Then ( n( bψ β σ d E[ψ 2 ) (ɛ i )] ) J N 0, E[ψ (ɛ i )] 2 Θ JJ Implies semiparametric efficiency of one-step estimator when ψ = f f Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

61 Efficiency Theorem (L. 17) Let J {1,..., p} denote a subset of coordinates of constant dimension. Then ( n( bψ β σ d E[ψ 2 ) (ɛ i )] ) J N 0, E[ψ (ɛ i )] 2 Θ JJ Implies semiparametric efficiency of one-step estimator when ψ = f f Can derive asymptotic confidence intervals/regions for subsets of coefficients Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

62 Efficiency Theorem (L. 17) Let J {1,..., p} denote a subset of coordinates of constant dimension. Then ( n( bψ β σ d E[ψ 2 ) (ɛ i )] ) J N 0, E[ψ (ɛ i )] 2 Θ JJ Implies semiparametric efficiency of one-step estimator when ψ = f f Can derive asymptotic confidence intervals/regions for subsets of coefficients Important: Allows statistical inference for high-dimensional regression in cases when x i s, ɛ i s are heavy-tailed Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

63 Summary New theory for robust high-dimensional M-estimators implies ( ) O error rates when l C based on local RSC k log p n Lepski s method proposed to avoid joint scale parameter estimation Derived properties of one-step estimator for semiparametric efficiency and high-dimensional inference Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

64 Summary New theory for robust high-dimensional M-estimators implies ( ) O error rates when l C based on local RSC k log p n Lepski s method proposed to avoid joint scale parameter estimation Derived properties of one-step estimator for semiparametric efficiency and high-dimensional inference Loh (2017). Statistical consistency and asymptotic normality for high-dimensional robust M-estimators. Annals of Statistics. Thank you! Po-Ling Loh (UW-Madison) Robust estimation, efficiency, and debiasing Aug 12, / 26

Robust high-dimensional linear regression: A statistical perspective

Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison Departments of ECE & Statistics STOC workshop on robustness and nonconvexity Montreal,