Robust high-dimensional linear regression: A statistical perspective

Similar documents
Robust estimation, efficiency, and Lasso debiasing

Estimators based on non-convex programs: Statistical and computational guarantees

Inference for High Dimensional Robust Regression

Optimization methods

Optimization methods

Least squares under convex constraint

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

IEOR 265 Lecture 3 Sparse Linear Regression

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

Multivariate Calibration with Robust Signal Regression

Sparse Learning and Distributed PCA. Jianqing Fan

Information theoretic perspectives on learning algorithms

Sparsity Regularization

Accelerate Subgradient Methods

Online Nonnegative Matrix Factorization with General Divergences

Inference For High Dimensional M-estimates. Fixed Design Results

(Part 1) High-dimensional statistics May / 41

Lasso: Algorithms and Extensions

Introduction Robust regression Examples Conclusion. Robust regression. Jiří Franc

Indian Statistical Institute

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

1 Regression with High Dimensional Data

KANSAS STATE UNIVERSITY Manhattan, Kansas

Inference For High Dimensional M-estimates: Fixed Design Results

Lecture 21 Theory of the Lasso II

Generalized Elastic Net Regression

THE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES. By Sara van de Geer and Johannes Lederer. ETH Zürich

STAT 200C: High-dimensional Statistics

Analysis of Greedy Algorithms

Reconstruction from Anisotropic Random Measurements

Higher-Order Methods

arxiv: v3 [stat.me] 8 Jun 2018

Lecture 12 Robust Estimation

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

The Geometry of Hypothesis Testing over Convex Cones

An iterative hard thresholding estimator for low rank matrix recovery

Linear Models in Machine Learning

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

arxiv: v3 [math.oc] 19 Oct 2017

Making Flippy Floppy

Composite nonlinear models at scale

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Subgradient Method. Ryan Tibshirani Convex Optimization

Measuring robustness

M-Estimation under High-Dimensional Asymptotics

Convex Optimization and l 1 -minimization

Robust model selection criteria for robust S and LT S estimators

A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives

DATA MINING AND MACHINE LEARNING

LMI Methods in Optimal and Robust Control

GARCH Models Estimation and Inference. Eduardo Rossi University of Pavia

Is the Whole Greater Than the Sum of Its Parts?

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

ECS289: Scalable Machine Learning

Bayesian Models for Regularization in Optimization

Gaussian Graphical Models and Graphical Lasso

Optimal prediction for sparse linear models? Lower bounds for coordinate-separable M-estimators

Stochastic Optimization: First order method

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Big Data Analytics. Lucas Rego Drumond

Oslo Class 2 Tikhonov regularization and kernels

Descent methods. min x. f(x)

Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Noisy and Missing Data Regression: Distribution-Oblivious Support Recovery

Algorithms for Nonsmooth Optimization

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Expanding the reach of optimal methods

Making Flippy Floppy

Divide-and-combine Strategies in Statistical Modeling for Massive Data

Linear Regression (9/11/13)

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf.

sparse and low-rank tensor recovery Cubic-Sketching

High-dimensional regression:

19.1 Problem setup: Sparse linear regression

Size Distortion and Modi cation of Classical Vuong Tests

Single Index Quantile Regression for Heteroscedastic Data

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Simple Linear Regression: The Model

Characterization of Gradient Dominance and Regularity Conditions for Neural Networks

Statistical and Computational Guarantees for the Baum-Welch Algorithm

Information geometry of mirror descent

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Sparse PCA in High Dimensions

Dr. Allen Back. Sep. 23, 2016

Smoothing Proximal Gradient Method. General Structured Sparse Regression

OWL to the rescue of LASSO

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

ON THE MAXIMUM BIAS FUNCTIONS OF MM-ESTIMATES AND CONSTRAINED M-ESTIMATES OF REGRESSION

Introduction to Machine Learning

Relaxed Lasso. Nicolai Meinshausen December 14, 2006

arxiv: v1 [stat.ml] 27 Dec 2015

Introduction to Robust Statistics. Elvezio Ronchetti. Department of Econometrics University of Geneva Switzerland.

Optimization for Compressed Sensing

Subgradient. Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes. definition. subgradient calculus

Transcription:

Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison Departments of ECE & Statistics STOC workshop on robustness and nonconvexity Montreal, Canada June 23, 2017 Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 1 / 26

Introduction: Robust regression Robust statistics introduced in 1960s (Huber, Tukey, Hampel, et al.) Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 2 / 26

Introduction: Robust regression Robust statistics introduced in 1960s (Huber, Tukey, Hampel, et al.) Goals: 1 Develop estimators T ( ) that are reliable under deviations from model assumptions 2 Quantify performance with respect to deviations Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 2 / 26

Introduction: Robust regression Robust statistics introduced in 1960s (Huber, Tukey, Hampel, et al.) Goals: 1 Develop estimators T ( ) that are reliable under deviations from model assumptions 2 Quantify performance with respect to deviations Local stability captured by influence function IF (x; T, F ) = lim t 0 T ((1 t)f + tδ x ) T (F ) t Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 2 / 26

Introduction: Robust regression Robust statistics introduced in 1960s (Huber, Tukey, Hampel, et al.) Goals: 1 Develop estimators T ( ) that are reliable under deviations from model assumptions 2 Quantify performance with respect to deviations Local stability captured by influence function IF (x; T, F ) = lim t 0 T ((1 t)f + tδ x ) T (F ) t Global stability captured by breakdown point { } m ɛ (T ; X 1,..., X n ) = min n : sup T (X m ) T (X ) = X m Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 2 / 26

High-dimensional linear models n 1 n p n 1 Linear model: p 1 y i = x T i β + ɛ i, i = 1,..., n Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 3 / 26

High-dimensional linear models n 1 n p n 1 Linear model: p 1 y i = x T i β + ɛ i, i = 1,..., n When p n, assume sparsity: β 0 k Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 3 / 26

Robust M-estimators Generalization of OLS appropriate for robust statistics: { } 1 n β arg min l(xi T β y i ) β n i=1 Loss 0 1 2 3 4 5 6 Least squares Absolute value Huber Tukey Millions of calls 0 50 100 150 200 Least squares Huber Tukey 6 4 2 0 2 4 6 Residual 1950 1955 1960 1965 1970 Year Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 4 / 26

Robust M-estimators Generalization of OLS appropriate for robust statistics: { } 1 n β arg min l(xi T β y i ) β n i=1 Extensive theory for p fixed, n Loss 0 1 2 3 4 5 6 Least squares Absolute value Huber Tukey Millions of calls 0 50 100 150 200 Least squares Huber Tukey 6 4 2 0 2 4 6 Residual 1950 1955 1960 1965 1970 Year Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 4 / 26

Classes of loss functions Bounded l limits influence of outliers: IF ((x, y); T, F ) = lim t 0 + T ((1 t)f + tδ (x,y) ) T (F ) t l (x T β y)x where F F β and T minimizes M-estimator Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 5 / 26

Classes of loss functions Bounded l limits influence of outliers: IF ((x, y); T, F ) = lim t 0 + T ((1 t)f + tδ (x,y) ) T (F ) t l (x T β y)x where F F β and T minimizes M-estimator Redescending M-estimators have finite rejection point: l (u) = 0, for u c Loss 0 1 2 3 4 5 6 Least squares Absolute value Huber Tukey 6 4 2 0 2 4 6 Residual Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 5 / 26

Classes of loss functions Bounded l limits influence of outliers: IF ((x, y); T, F ) = lim t 0 + T ((1 t)f + tδ (x,y) ) T (F ) t l (x T β y)x where F F β and T minimizes M-estimator Redescending M-estimators have finite rejection point: l (u) = 0, for u c Loss 0 1 2 3 4 5 6 Least squares Absolute value Huber Tukey 6 4 2 0 2 4 6 Residual But bad for optimization!! Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 5 / 26

High-dimensional M-estimators Natural idea: For p > n, use regularized version: { } 1 n β arg min l(xi T β y i ) + λ β 1 β n i=1 Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 6 / 26

High-dimensional M-estimators Natural idea: For p > n, use regularized version: { } 1 n β arg min l(xi T β y i ) + λ β 1 β n Complications: Optimization for nonconvex l? i=1 Statistical theory? Are certain losses provably better than others? Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 6 / 26

Overview of results When l < C, global optima of high-dimensional M-estimator satisfy k log p β β 2 C, n regardless of distribution of ɛ i Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 7 / 26

Overview of results When l < C, global optima of high-dimensional M-estimator satisfy k log p β β 2 C, n regardless of distribution of ɛ i Compare to Lasso theory: Requires sub-gaussian ɛ i s Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 7 / 26

Overview of results When l < C, global optima of high-dimensional M-estimator satisfy k log p β β 2 C, n regardless of distribution of ɛ i Compare to Lasso theory: Requires sub-gaussian ɛ i s If l(u) is locally convex/smooth for u r, any local optima within radius cr of β satisfy β β 2 C k log p n Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 7 / 26

Overview of results When l < C, global optima of high-dimensional M-estimator satisfy k log p β β 2 C, n regardless of distribution of ɛ i Compare to Lasso theory: Requires sub-gaussian ɛ i s If l(u) is locally convex/smooth for u r, any local optima within radius cr of β satisfy β β 2 C k log p n * in order to verify RE condition w.h.p., need Var(ɛ i ) < cr 2, as well Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 7 / 26

Overview of results When l < C, global optima of high-dimensional M-estimator satisfy k log p β β 2 C, n regardless of distribution of ɛ i Compare to Lasso theory: Requires sub-gaussian ɛ i s If l(u) is locally convex/smooth for u r, any local optima within radius cr of β satisfy β β 2 C k log p n * in order to verify RE condition w.h.p., need Var(ɛ i ) < cr 2, as well Local optima may be obtained via two-step algorithm Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 7 / 26

Theoretical insight Lasso analysis (e.g., van de Geer 07, Bickel et al. 08): { } 1 β arg min β n y X β 2 2 + λ β 1 }{{} L n(β) Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 8 / 26

Theoretical insight Lasso analysis (e.g., van de Geer 07, Bickel et al. 08): { } 1 β arg min β n y X β 2 2 + λ β 1 }{{} L n(β) Rearranging basic inequality L n ( β) L n (β ) and assuming λ 2 X T ɛ, obtain n β β 2 cλ k Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 8 / 26

Theoretical insight Lasso analysis (e.g., van de Geer 07, Bickel et al. 08): { } 1 β arg min β n y X β 2 2 + λ β 1 }{{} L n(β) Rearranging basic inequality L n ( β) L n (β ) and assuming λ 2 X T ɛ, obtain n β β 2 cλ k ( Sub-Gaussian assumptions on x i s and ɛ i s provide O bounds, minimax optimal ) k log p n Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 8 / 26

Theoretical insight Key observation: For general loss function, if λ 2 obtain β β 2 cλ k X T l (ɛ) n, Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 9 / 26

Theoretical insight Key observation: For general loss function, if λ 2 obtain β β 2 cλ k X T l (ɛ) n, l (ɛ) sub-gaussian whenever l bounded Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 9 / 26

Theoretical insight Key observation: For general loss function, if λ 2 obtain β β 2 cλ k X T l (ɛ) n, l (ɛ) sub-gaussian whenever l bounded = can achieve estimation error k log p β β 2 c, n without assuming ɛ i is sub-gaussian Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 9 / 26

Technical challenges Lasso analysis also requires verifying restricted eigenvalue (RE) condition on design matrix, more complicated for general l Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 10 / 26

Technical challenges Lasso analysis also requires verifying restricted eigenvalue (RE) condition on design matrix, more complicated for general l When l is nonconvex, local optima β may exist that are not global optima Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 10 / 26

Technical challenges Lasso analysis also requires verifying restricted eigenvalue (RE) condition on design matrix, more complicated for general l When l is nonconvex, local optima β may exist that are not global optima Want error bounds on β β 2 as well, or algorithms to find β efficiently Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 10 / 26

Related work: Nonconvex regularized M-estimators Composite objective function { β arg min β 1 R L n (β) + } p ρ λ (β j ) j=1 Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 11 / 26

Related work: Nonconvex regularized M-estimators Composite objective function { β arg min β 1 R L n (β) + } p ρ λ (β j ) j=1 Assumptions: L n satisfies restricted strong convexity with curvature α (Negahban et al. 12) ρ λ has bounded subgradient at 0, and ρ λ (t) + µt 2 convex α > µ Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 11 / 26

Stationary points (L. & Wainwright 15) O r! k log p n b e Stationary points statistically indistinguishable from global optima L n ( β) + ρ λ ( β), β β 0, β feasible Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 12 / 26

Stationary points (L. & Wainwright 15) O r! k log p n b e Stationary points statistically indistinguishable from global optima L n ( β) + ρ λ ( β), β β 0, β feasible log p Under suitable distributional assumptions, for λ n and R 1 λ, k log p β β 2 c statistical error n Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 12 / 26

Mathematical statement Theorem (L. & Wainwright 15) Suppose R is chosen s.t. β is feasible, and λ satisfies { } log p max L n (β ), α λ α n R. For n Cτ2 R 2 log p, any stationary point β satisfies α 2 β β 2 λ k α µ, where k = β 0. Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 13 / 26

Mathematical statement Theorem (L. & Wainwright 15) Suppose R is chosen s.t. β is feasible, and λ satisfies { } log p max L n (β ), α λ α n R. For n Cτ2 R 2 log p, any stationary point β satisfies α 2 β β 2 λ k α µ, where k = β 0. New ingredient for robust setting: l convex only in local region = need for local consistency results Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 13 / 26

Local statistical consistency Loss 0 1 2 3 4 5 6 Least squares Absolute value Huber Tukey Millions of calls 0 50 100 150 200 Least squares Huber Tukey 6 4 2 0 2 4 6 Residual 1950 1955 1960 1965 1970 Year Challenge in robust statistics: Population-level nonconvexity of loss = need for local optimization theory Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 14 / 26

Local RSC condition Local RSC condition: For := β 1 β 2, L n (β 1 ) L n (β 2 ), α 2 2 τ log p n 2 1, β j β 2 r 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 1 0.5 0 0.5 1 1 0.5 0 0.5 1 Loss function has directions of both positive and negative curvature. Negative directions are forbidden by regularizer. Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 15 / 26

Local RSC condition Local RSC condition: For := β 1 β 2, L n (β 1 ) L n (β 2 ), α 2 2 τ log p n 2 1, β j β 2 r 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 1 0.5 0 0.5 1 1 0.5 0 0.5 1 Loss function has directions of both positive and negative curvature. Only requires restricted Negative directions curvature are forbiddenwithin by regularizer. constant-radius region around β Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 15 / 26

Consistency of local stationary points O r! k log p n r b e Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 16 / 26

Consistency of local stationary points O r! k log p n r b e Theorem (L. 17) Suppose L n satisfies α-local RSC and ρ λ is µ-amenable, with α > µ. Suppose l log p τ C and λ n. For n α µ k log p, any stationary point β s.t. β β 2 r satisfies β β 2 λ k α µ. Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 16 / 26

Optimization theory Question: How to obtain sufficiently close local solutions? Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 17 / 26

Optimization theory Question: How to obtain sufficiently close local solutions? Goal: For regularized M-estimator { 1 n β arg min l(xi T β 1 R n i=1 β y i ) + ρ λ (β) }, where l satisfies α-local RSC, find stationary point such that β β 2 r Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 17 / 26

Wisdom from Huber Descending ψ-functions are tricky, especially when the starting values for the iterations are non-robust.... It is therefore preferable to start with a monotone ψ, iterate to death, and then append a few (1 or 2) iterations with the nonmonotone ψ. Huber 1981, pp. 191 192 Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 18 / 26

Two-step algorithm (L. 17) Use composite gradient descent (Nesterov 07): Iterative method to solve β arg min β Ω {L n(β) + ρ λ (β)}, L n differentiable, ρ λ convex & subdifferentiable Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 19 / 26

Two-step algorithm (L. 17) Use composite gradient descent (Nesterov 07): Iterative method to solve β arg min β Ω {L n(β) + ρ λ (β)}, L n differentiable, ρ λ convex & subdifferentiable L n ( t )+hrl n ( t ), t i + L 2 k t k 2 2 L n b t+1 t Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 19 / 26

Two-step algorithm (L. 17) Use composite gradient descent (Nesterov 07): Iterative method to solve β arg min β Ω {L n(β) + ρ λ (β)}, L n differentiable, ρ λ convex & subdifferentiable L n ( t )+hrl n ( t ), t i + L 2 k t k 2 2 L n b t+1 t Updates: { β t+1 arg min L n (β t ) + L n (β t ), β β t + L } β Ω 2 β βt 2 2 + ρ λ (β) Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 19 / 26

Two-step algorithm (L. 17) Two-step M-estimator: Finds local stationary points of nonconvex, robust loss + µ-amenable penalty { } 1 n β arg min l(xi T β y i ) + ρ λ (β) β 1 R n i=1 Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 20 / 26

Two-step algorithm (L. 17) Two-step M-estimator: Finds local stationary points of nonconvex, robust loss + µ-amenable penalty { } 1 n β arg min l(xi T β y i ) + ρ λ (β) β 1 R n i=1 Algorithm 1 Run composite gradient descent on convex, robust loss + l 1 -penalty until convergence, output β H Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 20 / 26

Two-step algorithm (L. 17) Two-step M-estimator: Finds local stationary points of nonconvex, robust loss + µ-amenable penalty { } 1 n β arg min l(xi T β y i ) + ρ λ (β) β 1 R n i=1 Algorithm 1 Run composite gradient descent on convex, robust loss + l 1 -penalty until convergence, output β H 2 Run composite gradient descent on nonconvex, robust loss + µ-amenable penalty, input β 0 = β H Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 20 / 26

Two-step algorithm (L. 17) Two-step M-estimator: Finds local stationary points of nonconvex, robust loss + µ-amenable penalty { } 1 n β arg min l(xi T β y i ) + ρ λ (β) β 1 R n i=1 Algorithm 1 Run composite gradient descent on convex, robust loss + l 1 -penalty until convergence, output β H 2 Run composite gradient descent on nonconvex, robust loss + µ-amenable penalty, input β 0 = β H Important: We want to optimize original nonconvex objective, since it leads to more efficient (lower-variance) estimators Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 20 / 26

Simulation 1 l 2 -error for robust regression losses 0.35 variance for robust regression losses 0.9 p=128 p=256 p=512 Huber Cauchy 0.3 p=128 p=256 p=512 Huber Cauchy 0.8 ˆβ β 2 0.7 0.6 0.5 0.4 0.3 empirical variance of first component 0.25 0.2 0.15 0.1 0.2 0.1 0.05 0 0 5 10 15 n/(k log p) 0 10 11 12 13 14 15 16 17 18 19 20 n/(k log p) l 2 -error and empirical variance of M-estimators when errors follow Cauchy distribution (SCAD regularizer) Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 21 / 26

Simulation 1 l 2 -error for robust regression losses 0.35 variance for robust regression losses 0.9 p=128 p=256 p=512 Huber Cauchy 0.3 p=128 p=256 p=512 Huber Cauchy 0.8 ˆβ β 2 0.7 0.6 0.5 0.4 0.3 empirical variance of first component 0.25 0.2 0.15 0.1 0.2 0.1 0.05 0 0 5 10 15 n/(k log p) 0 10 11 12 13 14 15 16 17 18 19 20 n/(k log p) l 2 -error and empirical variance of M-estimators when errors follow Cauchy distribution (SCAD regularizer) Can prove geometric convergence of two-step algorithm to desirable local optima (L. 17) Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 21 / 26

Summary Loss functions with desirable robustness properties in low-dimensional regression also good for high dimensions: ( ) k log p bounded influence l C O consistency n Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 22 / 26

Summary Loss functions with desirable robustness properties in low-dimensional regression also good for high dimensions: ( ) k log p bounded influence l C O consistency n Two-step optimization procedure: First step for consistency, second step for efficiency Loh (2017). Statistical consistency and asymptotic normality for high-dimensional robust M-estimators. Annals of Statistics. Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 22 / 26

Trailer Problem: Loss function l in some sense calibrated to scale of ɛ i Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 23 / 26

Trailer Problem: Loss function l in some sense calibrated to scale of ɛ i Better objective (joint location/scale estimator): { 1 n ( yi x T ) } i β ( β, σ) arg min l σ + aσ +λ β 1 β,σ n σ i=1 }{{} L n(β,σ) Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 23 / 26

Trailer Problem: Loss function l in some sense calibrated to scale of ɛ i Better objective (joint location/scale estimator): { 1 n ( yi x T ) } i β ( β, σ) arg min l σ + aσ +λ β 1 β,σ n σ i=1 }{{} L n(β,σ) However, location/scale estimation notoriously difficult even in low dimensions Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 23 / 26

Trailer Another idea: MM-estimator { 1 n ( yi x β T ) } i β arg min l + λ β 1, β n σ 0 i=1 using robust estimate of scale σ 0 based on preliminary estimate β 0 Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 24 / 26

Trailer Another idea: MM-estimator { 1 n ( yi x β T ) } i β arg min l + λ β 1, β n σ 0 i=1 using robust estimate of scale σ 0 based on preliminary estimate β 0 How to obtain ( β 0, σ 0 )? Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 24 / 26

Trailer Another idea: MM-estimator { 1 n ( yi x β T ) } i β arg min l + λ β 1, β n σ 0 i=1 using robust estimate of scale σ 0 based on preliminary estimate β 0 How to obtain ( β 0, σ 0 )? S-estimators/LMS: where σ(r) = r (n nδ ) β 0 arg min β { σ(r(β))}, Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 24 / 26

Trailer Another idea: MM-estimator { 1 n ( yi x β T ) } i β arg min l + λ β 1, β n σ 0 i=1 using robust estimate of scale σ 0 based on preliminary estimate β 0 How to obtain ( β 0, σ 0 )? S-estimators/LMS: where σ(r) = r (n nδ ) LTS: β 0 arg min β β 0 arg min β { σ(r(β))}, 1 n n nα i=1 (y i xi T β) 2 (i) + λ β 1 Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 24 / 26

Trailer Maybe an entirely different approach is necessary... Loh (2017). Scale estimation for high-dimensional robust regression. Coming soon? Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 25 / 26

Thank you! Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 26 / 26