Robust high-dimensional linear regression: A statistical perspective

Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison Departments of ECE & Statistics STOC workshop on robustness and nonconvexity Montreal, Canada June 23, 2017 Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 1 / 26

Introduction: Robust regression Robust statistics introduced in 1960s (Huber, Tukey, Hampel, et al.) Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 2 / 26

Introduction: Robust regression Robust statistics introduced in 1960s (Huber, Tukey, Hampel, et al.) Goals: 1 Develop estimators T ( ) that are reliable under deviations from model assumptions 2 Quantify performance with respect to deviations Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 2 / 26

Introduction: Robust regression Robust statistics introduced in 1960s (Huber, Tukey, Hampel, et al.) Goals: 1 Develop estimators T ( ) that are reliable under deviations from model assumptions 2 Quantify performance with respect to deviations Local stability captured by influence function IF (x; T, F ) = lim t 0 T ((1 t)f + tδ x ) T (F ) t Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 2 / 26

Introduction: Robust regression Robust statistics introduced in 1960s (Huber, Tukey, Hampel, et al.) Goals: 1 Develop estimators T ( ) that are reliable under deviations from model assumptions 2 Quantify performance with respect to deviations Local stability captured by influence function IF (x; T, F ) = lim t 0 T ((1 t)f + tδ x ) T (F ) t Global stability captured by breakdown point { } m ɛ (T ; X 1,..., X n ) = min n : sup T (X m ) T (X ) = X m Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 2 / 26

High-dimensional linear models n 1 n p n 1 Linear model: p 1 y i = x T i β + ɛ i, i = 1,..., n Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 3 / 26

High-dimensional linear models n 1 n p n 1 Linear model: p 1 y i = x T i β + ɛ i, i = 1,..., n When p n, assume sparsity: β 0 k Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 3 / 26

Robust M-estimators Generalization of OLS appropriate for robust statistics: { } 1 n β arg min l(xi T β y i ) β n i=1 Loss 0 1 2 3 4 5 6 Least squares Absolute value Huber Tukey Millions of calls 0 50 100 150 200 Least squares Huber Tukey 6 4 2 0 2 4 6 Residual 1950 1955 1960 1965 1970 Year Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 4 / 26

Robust M-estimators Generalization of OLS appropriate for robust statistics: { } 1 n β arg min l(xi T β y i ) β n i=1 Extensive theory for p fixed, n Loss 0 1 2 3 4 5 6 Least squares Absolute value Huber Tukey Millions of calls 0 50 100 150 200 Least squares Huber Tukey 6 4 2 0 2 4 6 Residual 1950 1955 1960 1965 1970 Year Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 4 / 26

Classes of loss functions Bounded l limits influence of outliers: IF ((x, y); T, F ) = lim t 0 + T ((1 t)f + tδ (x,y) ) T (F ) t l (x T β y)x where F F β and T minimizes M-estimator Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 5 / 26

Classes of loss functions Bounded l limits influence of outliers: IF ((x, y); T, F ) = lim t 0 + T ((1 t)f + tδ (x,y) ) T (F ) t l (x T β y)x where F F β and T minimizes M-estimator Redescending M-estimators have finite rejection point: l (u) = 0, for u c Loss 0 1 2 3 4 5 6 Least squares Absolute value Huber Tukey 6 4 2 0 2 4 6 Residual Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 5 / 26

Classes of loss functions Bounded l limits influence of outliers: IF ((x, y); T, F ) = lim t 0 + T ((1 t)f + tδ (x,y) ) T (F ) t l (x T β y)x where F F β and T minimizes M-estimator Redescending M-estimators have finite rejection point: l (u) = 0, for u c Loss 0 1 2 3 4 5 6 Least squares Absolute value Huber Tukey 6 4 2 0 2 4 6 Residual But bad for optimization!! Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 5 / 26

High-dimensional M-estimators Natural idea: For p > n, use regularized version: { } 1 n β arg min l(xi T β y i ) + λ β 1 β n i=1 Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 6 / 26

High-dimensional M-estimators Natural idea: For p > n, use regularized version: { } 1 n β arg min l(xi T β y i ) + λ β 1 β n Complications: Optimization for nonconvex l? i=1 Statistical theory? Are certain losses provably better than others? Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 6 / 26

Overview of results When l < C, global optima of high-dimensional M-estimator satisfy k log p β β 2 C, n regardless of distribution of ɛ i Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 7 / 26

Overview of results When l < C, global optima of high-dimensional M-estimator satisfy k log p β β 2 C, n regardless of distribution of ɛ i Compare to Lasso theory: Requires sub-gaussian ɛ i s If l(u) is locally convex/smooth for u r, any local optima within radius cr of β satisfy β β 2 C k log p n Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 7 / 26

Overview of results When l < C, global optima of high-dimensional M-estimator satisfy k log p β β 2 C, n regardless of distribution of ɛ i Compare to Lasso theory: Requires sub-gaussian ɛ i s If l(u) is locally convex/smooth for u r, any local optima within radius cr of β satisfy β β 2 C k log p n * in order to verify RE condition w.h.p., need Var(ɛ i ) < cr 2, as well Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 7 / 26

Overview of results When l < C, global optima of high-dimensional M-estimator satisfy k log p β β 2 C, n regardless of distribution of ɛ i Compare to Lasso theory: Requires sub-gaussian ɛ i s If l(u) is locally convex/smooth for u r, any local optima within radius cr of β satisfy β β 2 C k log p n * in order to verify RE condition w.h.p., need Var(ɛ i ) < cr 2, as well Local optima may be obtained via two-step algorithm Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 7 / 26

Theoretical insight Lasso analysis (e.g., van de Geer 07, Bickel et al. 08): { } 1 β arg min β n y X β 2 2 + λ β 1 }{{} L n(β) Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 8 / 26

Theoretical insight Lasso analysis (e.g., van de Geer 07, Bickel et al. 08): { } 1 β arg min β n y X β 2 2 + λ β 1 }{{} L n(β) Rearranging basic inequality L n ( β) L n (β ) and assuming λ 2 X T ɛ, obtain n β β 2 cλ k Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 8 / 26

Theoretical insight Lasso analysis (e.g., van de Geer 07, Bickel et al. 08): { } 1 β arg min β n y X β 2 2 + λ β 1 }{{} L n(β) Rearranging basic inequality L n ( β) L n (β ) and assuming λ 2 X T ɛ, obtain n β β 2 cλ k ( Sub-Gaussian assumptions on x i s and ɛ i s provide O bounds, minimax optimal ) k log p n Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 8 / 26

Theoretical insight Key observation: For general loss function, if λ 2 obtain β β 2 cλ k X T l (ɛ) n, Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 9 / 26

Theoretical insight Key observation: For general loss function, if λ 2 obtain β β 2 cλ k X T l (ɛ) n, l (ɛ) sub-gaussian whenever l bounded Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 9 / 26

Theoretical insight Key observation: For general loss function, if λ 2 obtain β β 2 cλ k X T l (ɛ) n, l (ɛ) sub-gaussian whenever l bounded = can achieve estimation error k log p β β 2 c, n without assuming ɛ i is sub-gaussian Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 9 / 26

Technical challenges Lasso analysis also requires verifying restricted eigenvalue (RE) condition on design matrix, more complicated for general l Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 10 / 26

Technical challenges Lasso analysis also requires verifying restricted eigenvalue (RE) condition on design matrix, more complicated for general l When l is nonconvex, local optima β may exist that are not global optima Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 10 / 26

Technical challenges Lasso analysis also requires verifying restricted eigenvalue (RE) condition on design matrix, more complicated for general l When l is nonconvex, local optima β may exist that are not global optima Want error bounds on β β 2 as well, or algorithms to find β efficiently Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 10 / 26

Related work: Nonconvex regularized M-estimators Composite objective function { β arg min β 1 R L n (β) + } p ρ λ (β j ) j=1 Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 11 / 26

Related work: Nonconvex regularized M-estimators Composite objective function { β arg min β 1 R L n (β) + } p ρ λ (β j ) j=1 Assumptions: L n satisfies restricted strong convexity with curvature α (Negahban et al. 12) ρ λ has bounded subgradient at 0, and ρ λ (t) + µt 2 convex α > µ Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 11 / 26

Stationary points (L. & Wainwright 15) O r! k log p n b e Stationary points statistically indistinguishable from global optima L n ( β) + ρ λ ( β), β β 0, β feasible Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 12 / 26

Stationary points (L. & Wainwright 15) O r! k log p n b e Stationary points statistically indistinguishable from global optima L n ( β) + ρ λ ( β), β β 0, β feasible log p Under suitable distributional assumptions, for λ n and R 1 λ, k log p β β 2 c statistical error n Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 12 / 26

Mathematical statement Theorem (L. & Wainwright 15) Suppose R is chosen s.t. β is feasible, and λ satisfies { } log p max L n (β ), α λ α n R. For n Cτ2 R 2 log p, any stationary point β satisfies α 2 β β 2 λ k α µ, where k = β 0. Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 13 / 26

Mathematical statement Theorem (L. & Wainwright 15) Suppose R is chosen s.t. β is feasible, and λ satisfies { } log p max L n (β ), α λ α n R. For n Cτ2 R 2 log p, any stationary point β satisfies α 2 β β 2 λ k α µ, where k = β 0. New ingredient for robust setting: l convex only in local region = need for local consistency results Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 13 / 26

Local statistical consistency Loss 0 1 2 3 4 5 6 Least squares Absolute value Huber Tukey Millions of calls 0 50 100 150 200 Least squares Huber Tukey 6 4 2 0 2 4 6 Residual 1950 1955 1960 1965 1970 Year Challenge in robust statistics: Population-level nonconvexity of loss = need for local optimization theory Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 14 / 26

Local RSC condition Local RSC condition: For := β 1 β 2, L n (β 1 ) L n (β 2 ), α 2 2 τ log p n 2 1, β j β 2 r 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 1 0.5 0 0.5 1 1 0.5 0 0.5 1 Loss function has directions of both positive and negative curvature. Negative directions are forbidden by regularizer. Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 15 / 26

Local RSC condition Local RSC condition: For := β 1 β 2, L n (β 1 ) L n (β 2 ), α 2 2 τ log p n 2 1, β j β 2 r 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 1 0.5 0 0.5 1 1 0.5 0 0.5 1 Loss function has directions of both positive and negative curvature. Only requires restricted Negative directions curvature are forbiddenwithin by regularizer. constant-radius region around β Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 15 / 26

Consistency of local stationary points O r! k log p n r b e Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 16 / 26

Consistency of local stationary points O r! k log p n r b e Theorem (L. 17) Suppose L n satisfies α-local RSC and ρ λ is µ-amenable, with α > µ. Suppose l log p τ C and λ n. For n α µ k log p, any stationary point β s.t. β β 2 r satisfies β β 2 λ k α µ. Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 16 / 26

Optimization theory Question: How to obtain sufficiently close local solutions? Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 17 / 26

Optimization theory Question: How to obtain sufficiently close local solutions? Goal: For regularized M-estimator { 1 n β arg min l(xi T β 1 R n i=1 β y i ) + ρ λ (β) }, where l satisfies α-local RSC, find stationary point such that β β 2 r Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 17 / 26

Wisdom from Huber Descending ψ-functions are tricky, especially when the starting values for the iterations are non-robust.... It is therefore preferable to start with a monotone ψ, iterate to death, and then append a few (1 or 2) iterations with the nonmonotone ψ. Huber 1981, pp. 191 192 Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 18 / 26

Two-step algorithm (L. 17) Use composite gradient descent (Nesterov 07): Iterative method to solve β arg min β Ω {L n(β) + ρ λ (β)}, L n differentiable, ρ λ convex & subdifferentiable L n ( t )+hrl n ( t ), t i + L 2 k t k 2 2 L n b t+1 t Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 19 / 26

Two-step algorithm (L. 17) Use composite gradient descent (Nesterov 07): Iterative method to solve β arg min β Ω {L n(β) + ρ λ (β)}, L n differentiable, ρ λ convex & subdifferentiable L n ( t )+hrl n ( t ), t i + L 2 k t k 2 2 L n b t+1 t Updates: { β t+1 arg min L n (β t ) + L n (β t ), β β t + L } β Ω 2 β βt 2 2 + ρ λ (β) Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 19 / 26

Two-step algorithm (L. 17) Two-step M-estimator: Finds local stationary points of nonconvex, robust loss + µ-amenable penalty { } 1 n β arg min l(xi T β y i ) + ρ λ (β) β 1 R n i=1 Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 20 / 26

Two-step algorithm (L. 17) Two-step M-estimator: Finds local stationary points of nonconvex, robust loss + µ-amenable penalty { } 1 n β arg min l(xi T β y i ) + ρ λ (β) β 1 R n i=1 Algorithm 1 Run composite gradient descent on convex, robust loss + l 1 -penalty until convergence, output β H 2 Run composite gradient descent on nonconvex, robust loss + µ-amenable penalty, input β 0 = β H Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 20 / 26

Two-step algorithm (L. 17) Two-step M-estimator: Finds local stationary points of nonconvex, robust loss + µ-amenable penalty { } 1 n β arg min l(xi T β y i ) + ρ λ (β) β 1 R n i=1 Algorithm 1 Run composite gradient descent on convex, robust loss + l 1 -penalty until convergence, output β H 2 Run composite gradient descent on nonconvex, robust loss + µ-amenable penalty, input β 0 = β H Important: We want to optimize original nonconvex objective, since it leads to more efficient (lower-variance) estimators Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 20 / 26

Simulation 1 l 2 -error for robust regression losses 0.35 variance for robust regression losses 0.9 p=128 p=256 p=512 Huber Cauchy 0.3 p=128 p=256 p=512 Huber Cauchy 0.8 ˆβ β 2 0.7 0.6 0.5 0.4 0.3 empirical variance of first component 0.25 0.2 0.15 0.1 0.2 0.1 0.05 0 0 5 10 15 n/(k log p) 0 10 11 12 13 14 15 16 17 18 19 20 n/(k log p) l 2 -error and empirical variance of M-estimators when errors follow Cauchy distribution (SCAD regularizer) Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 21 / 26

Simulation 1 l 2 -error for robust regression losses 0.35 variance for robust regression losses 0.9 p=128 p=256 p=512 Huber Cauchy 0.3 p=128 p=256 p=512 Huber Cauchy 0.8 ˆβ β 2 0.7 0.6 0.5 0.4 0.3 empirical variance of first component 0.25 0.2 0.15 0.1 0.2 0.1 0.05 0 0 5 10 15 n/(k log p) 0 10 11 12 13 14 15 16 17 18 19 20 n/(k log p) l 2 -error and empirical variance of M-estimators when errors follow Cauchy distribution (SCAD regularizer) Can prove geometric convergence of two-step algorithm to desirable local optima (L. 17) Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 21 / 26

Summary Loss functions with desirable robustness properties in low-dimensional regression also good for high dimensions: ( ) k log p bounded influence l C O consistency n Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 22 / 26

Summary Loss functions with desirable robustness properties in low-dimensional regression also good for high dimensions: ( ) k log p bounded influence l C O consistency n Two-step optimization procedure: First step for consistency, second step for efficiency Loh (2017). Statistical consistency and asymptotic normality for high-dimensional robust M-estimators. Annals of Statistics. Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 22 / 26

Trailer Problem: Loss function l in some sense calibrated to scale of ɛ i Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 23 / 26

Trailer Problem: Loss function l in some sense calibrated to scale of ɛ i Better objective (joint location/scale estimator): { 1 n ( yi x T ) } i β ( β, σ) arg min l σ + aσ +λ β 1 β,σ n σ i=1 }{{} L n(β,σ) Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 23 / 26

Trailer Problem: Loss function l in some sense calibrated to scale of ɛ i Better objective (joint location/scale estimator): { 1 n ( yi x T ) } i β ( β, σ) arg min l σ + aσ +λ β 1 β,σ n σ i=1 }{{} L n(β,σ) However, location/scale estimation notoriously difficult even in low dimensions Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 23 / 26

Trailer Another idea: MM-estimator { 1 n ( yi x β T ) } i β arg min l + λ β 1, β n σ 0 i=1 using robust estimate of scale σ 0 based on preliminary estimate β 0 Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 24 / 26

Trailer Another idea: MM-estimator { 1 n ( yi x β T ) } i β arg min l + λ β 1, β n σ 0 i=1 using robust estimate of scale σ 0 based on preliminary estimate β 0 How to obtain ( β 0, σ 0 )? Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 24 / 26

Trailer Another idea: MM-estimator { 1 n ( yi x β T ) } i β arg min l + λ β 1, β n σ 0 i=1 using robust estimate of scale σ 0 based on preliminary estimate β 0 How to obtain ( β 0, σ 0 )? S-estimators/LMS: where σ(r) = r (n nδ ) β 0 arg min β { σ(r(β))}, Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 24 / 26

Trailer Another idea: MM-estimator { 1 n ( yi x β T ) } i β arg min l + λ β 1, β n σ 0 i=1 using robust estimate of scale σ 0 based on preliminary estimate β 0 How to obtain ( β 0, σ 0 )? S-estimators/LMS: where σ(r) = r (n nδ ) LTS: β 0 arg min β β 0 arg min β { σ(r(β))}, 1 n n nα i=1 (y i xi T β) 2 (i) + λ β 1 Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 24 / 26

Trailer Maybe an entirely different approach is necessary... Loh (2017). Scale estimation for high-dimensional robust regression. Coming soon? Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 25 / 26

Thank you! Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 26 / 26