Robust high-dimensional linear regression: A statistical perspective

Size: px

Start display at page:

Download "Robust high-dimensional linear regression: A statistical perspective"

Jonas Gardner
6 years ago
Views:

1 Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison Departments of ECE & Statistics STOC workshop on robustness and nonconvexity Montreal, Canada June 23, 2017 Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

2 Introduction: Robust regression Robust statistics introduced in 1960s (Huber, Tukey, Hampel, et al.) Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

3 Introduction: Robust regression Robust statistics introduced in 1960s (Huber, Tukey, Hampel, et al.) Goals: 1 Develop estimators T ( ) that are reliable under deviations from model assumptions 2 Quantify performance with respect to deviations Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

4 Introduction: Robust regression Robust statistics introduced in 1960s (Huber, Tukey, Hampel, et al.) Goals: 1 Develop estimators T ( ) that are reliable under deviations from model assumptions 2 Quantify performance with respect to deviations Local stability captured by influence function IF (x; T, F ) = lim t 0 T ((1 t)f + tδ x ) T (F ) t Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

5 Introduction: Robust regression Robust statistics introduced in 1960s (Huber, Tukey, Hampel, et al.) Goals: 1 Develop estimators T ( ) that are reliable under deviations from model assumptions 2 Quantify performance with respect to deviations Local stability captured by influence function IF (x; T, F ) = lim t 0 T ((1 t)f + tδ x ) T (F ) t Global stability captured by breakdown point { } m ɛ (T ; X 1,..., X n ) = min n : sup T (X m ) T (X ) = X m Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

6 High-dimensional linear models n 1 n p n 1 Linear model: p 1 y i = x T i β + ɛ i, i = 1,..., n Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

7 High-dimensional linear models n 1 n p n 1 Linear model: p 1 y i = x T i β + ɛ i, i = 1,..., n When p n, assume sparsity: β 0 k Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

8 Robust M-estimators Generalization of OLS appropriate for robust statistics: { } 1 n β arg min l(xi T β y i ) β n i=1 Loss Least squares Absolute value Huber Tukey Millions of calls Least squares Huber Tukey Residual Year Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

9 Robust M-estimators Generalization of OLS appropriate for robust statistics: { } 1 n β arg min l(xi T β y i ) β n i=1 Extensive theory for p fixed, n Loss Least squares Absolute value Huber Tukey Millions of calls Least squares Huber Tukey Residual Year Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

10 Classes of loss functions Bounded l limits influence of outliers: IF ((x, y); T, F ) = lim t 0 + T ((1 t)f + tδ (x,y) ) T (F ) t l (x T β y)x where F F β and T minimizes M-estimator Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

11 Classes of loss functions Bounded l limits influence of outliers: IF ((x, y); T, F ) = lim t 0 + T ((1 t)f + tδ (x,y) ) T (F ) t l (x T β y)x where F F β and T minimizes M-estimator Redescending M-estimators have finite rejection point: l (u) = 0, for u c Loss Least squares Absolute value Huber Tukey Residual Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

12 Classes of loss functions Bounded l limits influence of outliers: IF ((x, y); T, F ) = lim t 0 + T ((1 t)f + tδ (x,y) ) T (F ) t l (x T β y)x where F F β and T minimizes M-estimator Redescending M-estimators have finite rejection point: l (u) = 0, for u c Loss Least squares Absolute value Huber Tukey Residual But bad for optimization!! Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

13 High-dimensional M-estimators Natural idea: For p > n, use regularized version: { } 1 n β arg min l(xi T β y i ) + λ β 1 β n i=1 Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

14 High-dimensional M-estimators Natural idea: For p > n, use regularized version: { } 1 n β arg min l(xi T β y i ) + λ β 1 β n Complications: Optimization for nonconvex l? i=1 Statistical theory? Are certain losses provably better than others? Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

15 Overview of results When l < C, global optima of high-dimensional M-estimator satisfy k log p β β 2 C, n regardless of distribution of ɛ i Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

16 Overview of results When l < C, global optima of high-dimensional M-estimator satisfy k log p β β 2 C, n regardless of distribution of ɛ i Compare to Lasso theory: Requires sub-gaussian ɛ i s Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

17 Overview of results When l < C, global optima of high-dimensional M-estimator satisfy k log p β β 2 C, n regardless of distribution of ɛ i Compare to Lasso theory: Requires sub-gaussian ɛ i s If l(u) is locally convex/smooth for u r, any local optima within radius cr of β satisfy β β 2 C k log p n Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

18 Overview of results When l < C, global optima of high-dimensional M-estimator satisfy k log p β β 2 C, n regardless of distribution of ɛ i Compare to Lasso theory: Requires sub-gaussian ɛ i s If l(u) is locally convex/smooth for u r, any local optima within radius cr of β satisfy β β 2 C k log p n * in order to verify RE condition w.h.p., need Var(ɛ i ) < cr 2, as well Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

19 Overview of results When l < C, global optima of high-dimensional M-estimator satisfy k log p β β 2 C, n regardless of distribution of ɛ i Compare to Lasso theory: Requires sub-gaussian ɛ i s If l(u) is locally convex/smooth for u r, any local optima within radius cr of β satisfy β β 2 C k log p n * in order to verify RE condition w.h.p., need Var(ɛ i ) < cr 2, as well Local optima may be obtained via two-step algorithm Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

20 Theoretical insight Lasso analysis (e.g., van de Geer 07, Bickel et al. 08): { } 1 β arg min β n y X β λ β 1 }{{} L n(β) Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

21 Theoretical insight Lasso analysis (e.g., van de Geer 07, Bickel et al. 08): { } 1 β arg min β n y X β λ β 1 }{{} L n(β) Rearranging basic inequality L n ( β) L n (β ) and assuming λ 2 X T ɛ, obtain n β β 2 cλ k Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

22 Theoretical insight Lasso analysis (e.g., van de Geer 07, Bickel et al. 08): { } 1 β arg min β n y X β λ β 1 }{{} L n(β) Rearranging basic inequality L n ( β) L n (β ) and assuming λ 2 X T ɛ, obtain n β β 2 cλ k ( Sub-Gaussian assumptions on x i s and ɛ i s provide O bounds, minimax optimal ) k log p n Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

23 Theoretical insight Key observation: For general loss function, if λ 2 obtain β β 2 cλ k X T l (ɛ) n, Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

24 Theoretical insight Key observation: For general loss function, if λ 2 obtain β β 2 cλ k X T l (ɛ) n, l (ɛ) sub-gaussian whenever l bounded Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

25 Theoretical insight Key observation: For general loss function, if λ 2 obtain β β 2 cλ k X T l (ɛ) n, l (ɛ) sub-gaussian whenever l bounded = can achieve estimation error k log p β β 2 c, n without assuming ɛ i is sub-gaussian Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

26 Technical challenges Lasso analysis also requires verifying restricted eigenvalue (RE) condition on design matrix, more complicated for general l Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

27 Technical challenges Lasso analysis also requires verifying restricted eigenvalue (RE) condition on design matrix, more complicated for general l When l is nonconvex, local optima β may exist that are not global optima Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

28 Technical challenges Lasso analysis also requires verifying restricted eigenvalue (RE) condition on design matrix, more complicated for general l When l is nonconvex, local optima β may exist that are not global optima Want error bounds on β β 2 as well, or algorithms to find β efficiently Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

29 Related work: Nonconvex regularized M-estimators Composite objective function { β arg min β 1 R L n (β) + } p ρ λ (β j ) j=1 Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

30 Related work: Nonconvex regularized M-estimators Composite objective function { β arg min β 1 R L n (β) + } p ρ λ (β j ) j=1 Assumptions: L n satisfies restricted strong convexity with curvature α (Negahban et al. 12) ρ λ has bounded subgradient at 0, and ρ λ (t) + µt 2 convex α > µ Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

31 Stationary points (L. & Wainwright 15) O r! k log p n b e Stationary points statistically indistinguishable from global optima L n ( β) + ρ λ ( β), β β 0, β feasible Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

32 Stationary points (L. & Wainwright 15) O r! k log p n b e Stationary points statistically indistinguishable from global optima L n ( β) + ρ λ ( β), β β 0, β feasible log p Under suitable distributional assumptions, for λ n and R 1 λ, k log p β β 2 c statistical error n Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

33 Mathematical statement Theorem (L. & Wainwright 15) Suppose R is chosen s.t. β is feasible, and λ satisfies { } log p max L n (β ), α λ α n R. For n Cτ2 R 2 log p, any stationary point β satisfies α 2 β β 2 λ k α µ, where k = β 0. Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

34 Mathematical statement Theorem (L. & Wainwright 15) Suppose R is chosen s.t. β is feasible, and λ satisfies { } log p max L n (β ), α λ α n R. For n Cτ2 R 2 log p, any stationary point β satisfies α 2 β β 2 λ k α µ, where k = β 0. New ingredient for robust setting: l convex only in local region = need for local consistency results Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

35 Local statistical consistency Loss Least squares Absolute value Huber Tukey Millions of calls Least squares Huber Tukey Residual Year Challenge in robust statistics: Population-level nonconvexity of loss = need for local optimization theory Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

36 Local RSC condition Local RSC condition: For := β 1 β 2, L n (β 1 ) L n (β 2 ), α 2 2 τ log p n 2 1, β j β 2 r Loss function has directions of both positive and negative curvature. Negative directions are forbidden by regularizer. Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

37 Local RSC condition Local RSC condition: For := β 1 β 2, L n (β 1 ) L n (β 2 ), α 2 2 τ log p n 2 1, β j β 2 r Loss function has directions of both positive and negative curvature. Only requires restricted Negative directions curvature are forbiddenwithin by regularizer. constant-radius region around β Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

38 Consistency of local stationary points O r! k log p n r b e Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

39 Consistency of local stationary points O r! k log p n r b e Theorem (L. 17) Suppose L n satisfies α-local RSC and ρ λ is µ-amenable, with α > µ. Suppose l log p τ C and λ n. For n α µ k log p, any stationary point β s.t. β β 2 r satisfies β β 2 λ k α µ. Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

40 Optimization theory Question: How to obtain sufficiently close local solutions? Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

41 Optimization theory Question: How to obtain sufficiently close local solutions? Goal: For regularized M-estimator { 1 n β arg min l(xi T β 1 R n i=1 β y i ) + ρ λ (β) }, where l satisfies α-local RSC, find stationary point such that β β 2 r Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

42 Wisdom from Huber Descending ψ-functions are tricky, especially when the starting values for the iterations are non-robust.... It is therefore preferable to start with a monotone ψ, iterate to death, and then append a few (1 or 2) iterations with the nonmonotone ψ. Huber 1981, pp Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

43 Two-step algorithm (L. 17) Use composite gradient descent (Nesterov 07): Iterative method to solve β arg min β Ω {L n(β) + ρ λ (β)}, L n differentiable, ρ λ convex & subdifferentiable Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

44 Two-step algorithm (L. 17) Use composite gradient descent (Nesterov 07): Iterative method to solve β arg min β Ω {L n(β) + ρ λ (β)}, L n differentiable, ρ λ convex & subdifferentiable L n ( t )+hrl n ( t ), t i + L 2 k t k 2 2 L n b t+1 t Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

45 Two-step algorithm (L. 17) Use composite gradient descent (Nesterov 07): Iterative method to solve β arg min β Ω {L n(β) + ρ λ (β)}, L n differentiable, ρ λ convex & subdifferentiable L n ( t )+hrl n ( t ), t i + L 2 k t k 2 2 L n b t+1 t Updates: { β t+1 arg min L n (β t ) + L n (β t ), β β t + L } β Ω 2 β βt ρ λ (β) Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

46 Two-step algorithm (L. 17) Two-step M-estimator: Finds local stationary points of nonconvex, robust loss + µ-amenable penalty { } 1 n β arg min l(xi T β y i ) + ρ λ (β) β 1 R n i=1 Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

47 Two-step algorithm (L. 17) Two-step M-estimator: Finds local stationary points of nonconvex, robust loss + µ-amenable penalty { } 1 n β arg min l(xi T β y i ) + ρ λ (β) β 1 R n i=1 Algorithm 1 Run composite gradient descent on convex, robust loss + l 1 -penalty until convergence, output β H Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

48 Two-step algorithm (L. 17) Two-step M-estimator: Finds local stationary points of nonconvex, robust loss + µ-amenable penalty { } 1 n β arg min l(xi T β y i ) + ρ λ (β) β 1 R n i=1 Algorithm 1 Run composite gradient descent on convex, robust loss + l 1 -penalty until convergence, output β H 2 Run composite gradient descent on nonconvex, robust loss + µ-amenable penalty, input β 0 = β H Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

49 Two-step algorithm (L. 17) Two-step M-estimator: Finds local stationary points of nonconvex, robust loss + µ-amenable penalty { } 1 n β arg min l(xi T β y i ) + ρ λ (β) β 1 R n i=1 Algorithm 1 Run composite gradient descent on convex, robust loss + l 1 -penalty until convergence, output β H 2 Run composite gradient descent on nonconvex, robust loss + µ-amenable penalty, input β 0 = β H Important: We want to optimize original nonconvex objective, since it leads to more efficient (lower-variance) estimators Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

50 Simulation 1 l 2 -error for robust regression losses 0.35 variance for robust regression losses 0.9 p=128 p=256 p=512 Huber Cauchy 0.3 p=128 p=256 p=512 Huber Cauchy 0.8 ˆβ β empirical variance of first component n/(k log p) n/(k log p) l 2 -error and empirical variance of M-estimators when errors follow Cauchy distribution (SCAD regularizer) Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

51 Simulation 1 l 2 -error for robust regression losses 0.35 variance for robust regression losses 0.9 p=128 p=256 p=512 Huber Cauchy 0.3 p=128 p=256 p=512 Huber Cauchy 0.8 ˆβ β empirical variance of first component n/(k log p) n/(k log p) l 2 -error and empirical variance of M-estimators when errors follow Cauchy distribution (SCAD regularizer) Can prove geometric convergence of two-step algorithm to desirable local optima (L. 17) Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

52 Summary Loss functions with desirable robustness properties in low-dimensional regression also good for high dimensions: ( ) k log p bounded influence l C O consistency n Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

53 Summary Loss functions with desirable robustness properties in low-dimensional regression also good for high dimensions: ( ) k log p bounded influence l C O consistency n Two-step optimization procedure: First step for consistency, second step for efficiency Loh (2017). Statistical consistency and asymptotic normality for high-dimensional robust M-estimators. Annals of Statistics. Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

54 Trailer Problem: Loss function l in some sense calibrated to scale of ɛ i Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

55 Trailer Problem: Loss function l in some sense calibrated to scale of ɛ i Better objective (joint location/scale estimator): { 1 n ( yi x T ) } i β ( β, σ) arg min l σ + aσ +λ β 1 β,σ n σ i=1 }{{} L n(β,σ) Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

56 Trailer Problem: Loss function l in some sense calibrated to scale of ɛ i Better objective (joint location/scale estimator): { 1 n ( yi x T ) } i β ( β, σ) arg min l σ + aσ +λ β 1 β,σ n σ i=1 }{{} L n(β,σ) However, location/scale estimation notoriously difficult even in low dimensions Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

57 Trailer Another idea: MM-estimator { 1 n ( yi x β T ) } i β arg min l + λ β 1, β n σ 0 i=1 using robust estimate of scale σ 0 based on preliminary estimate β 0 Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

58 Trailer Another idea: MM-estimator { 1 n ( yi x β T ) } i β arg min l + λ β 1, β n σ 0 i=1 using robust estimate of scale σ 0 based on preliminary estimate β 0 How to obtain ( β 0, σ 0 )? Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

59 Trailer Another idea: MM-estimator { 1 n ( yi x β T ) } i β arg min l + λ β 1, β n σ 0 i=1 using robust estimate of scale σ 0 based on preliminary estimate β 0 How to obtain ( β 0, σ 0 )? S-estimators/LMS: where σ(r) = r (n nδ ) β 0 arg min β { σ(r(β))}, Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

60 Trailer Another idea: MM-estimator { 1 n ( yi x β T ) } i β arg min l + λ β 1, β n σ 0 i=1 using robust estimate of scale σ 0 based on preliminary estimate β 0 How to obtain ( β 0, σ 0 )? S-estimators/LMS: where σ(r) = r (n nδ ) LTS: β 0 arg min β β 0 arg min β { σ(r(β))}, 1 n n nα i=1 (y i xi T β) 2 (i) + λ β 1 Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

61 Trailer Maybe an entirely different approach is necessary... Loh (2017). Scale estimation for high-dimensional robust regression. Coming soon? Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

62 Thank you! Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, / 26

Robust estimation, efficiency, and Lasso debiasing

Robust estimation, efficiency, and Lasso debiasing Po-Ling Loh University of Wisconsin - Madison Departments of ECE & Statistics WHOA-PSI workshop Washington University in St. Louis Aug 12, 2017 Po-Ling