Estimators based on non-convex programs: Statistical and computational guarantees

Similar documents
Robust high-dimensional linear regression: A statistical perspective

Robust estimation, efficiency, and Lasso debiasing

High-dimensional graphical model selection: Practical and information-theoretic limits

On Iterative Hard Thresholding Methods for High-dimensional M-Estimation

High-dimensional graphical model selection: Practical and information-theoretic limits

High-dimensional statistics: Some progress and challenges ahead

The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R

Introduction to graphical models: Lecture III

Convex Optimization Lecture 16

Randomized Smoothing for Stochastic Optimization

arxiv: v3 [stat.me] 8 Jun 2018

Lecture 1: Supervised Learning

Sparse Learning and Distributed PCA. Jianqing Fan

arxiv: v7 [stat.ml] 9 Feb 2017

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

Optimal prediction for sparse linear models? Lower bounds for coordinate-separable M-estimators

How hard is this function to optimize?

CS229 Supplemental Lecture notes

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Lecture 6 : Projected Gradient Descent

Lecture 3 September 1

CS60021: Scalable Data Mining. Large Scale Machine Learning

Statistical Computing (36-350)

arxiv: v1 [math.st] 9 Aug 2014

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Non-Convex Projected Gradient Descent for Generalized Low-Rank Tensor Regression

l 1 and l 2 Regularization

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Sketched Ridge Regression:

General principles for high-dimensional estimation: Statistics and computation

Proximal and First-Order Methods for Convex Optimization

Stochastic and online algorithms

Lecture 2 Part 1 Optimization

Fast Stochastic Optimization Algorithms for ML

Linear Models in Machine Learning

Learning discrete graphical models via generalized inverse covariance matrices

Bi-level feature selection with applications to genetic association

Importance Sampling for Minibatches

Non-Convex Projected Gradient Descent for Generalized Low-Rank Tensor Regression

Stochastic Subgradient Method

Beyond stochastic gradient descent for large-scale machine learning

ECS289: Scalable Machine Learning

Generalized Elastic Net Regression

Joint Gaussian Graphical Model Review Series I

Lecture 3 - Linear and Logistic Regression

Stochastic Proximal Gradient Algorithm

STAT 200C: High-dimensional Statistics

Randomized Smoothing Techniques in Optimization

ECS171: Machine Learning

Ad Placement Strategies

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Classification Logistic Regression

Beyond stochastic gradient descent for large-scale machine learning

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Divide-and-combine Strategies in Statistical Modeling for Massive Data

MATH 680 Fall November 27, Homework 3

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Accelerate Subgradient Methods

STA141C: Big Data & High Performance Statistical Computing

DATA MINING AND MACHINE LEARNING

Discriminative Models

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Statistics 300B Winter 2018 Final Exam Due 24 Hours after receiving it

arxiv: v1 [math.st] 10 Sep 2015

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

Optimization methods

Discriminative Models

Optimization methods

arxiv: v5 [cs.lg] 23 Dec 2017 Abstract

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

sparse and low-rank tensor recovery Cubic-Sketching

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

ECS289: Scalable Machine Learning

GLOBAL SOLUTIONS TO FOLDED CONCAVE PENALIZED NONCONVEX LEARNING. BY HONGCHENG LIU 1,TAO YAO 1 AND RUNZE LI 2 Pennsylvania State University

Nesterov s Acceleration

SVRG++ with Non-uniform Sampling

Convex relaxation for Combinatorial Penalties

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Composite Loss Functions and Multivariate Regression; Sparse PCA

Restricted Strong Convexity Implies Weak Submodularity

Composite nonlinear models at scale

Nearest Neighbor based Coordinate Descent

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Statistical Data Mining and Machine Learning Hilary Term 2016

OWL to the rescue of LASSO

Statistical Sparse Online Regression: A Diffusion Approximation Perspective

EE364b Convex Optimization II May 30 June 2, Final exam

ADMM and Fast Gradient Methods for Distributed Optimization

Convex Optimization Algorithms for Machine Learning in 10 Slides

Information theoretic perspectives on learning algorithms

Lecture 1: September 25

Probabilistic Graphical Models & Applications

An iterative hard thresholding estimator for low rank matrix recovery

Lecture 7 Introduction to Statistical Decision Theory

arxiv: v3 [math.oc] 29 Jun 2016

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Comparison of Modern Stochastic Optimization Algorithms

Stochastic gradient descent and robustness to ill-conditioning

Transcription:

Estimators based on non-convex programs: Statistical and computational guarantees Martin Wainwright UC Berkeley Statistics and EECS Based on joint work with: Po-Ling Loh (UC Berkeley) Martin Wainwright (UC Berkeley) May 2013 1 / 1

Introduction Prediction/regression problems arise throughout statistics: vector of predictors/covariates x R p response variable y R In big data setting: both sample size n and ambient dimension p are large many problems have p n

Introduction Prediction/regression problems arise throughout statistics: vector of predictors/covariates x R p response variable y R In big data setting: both sample size n and ambient dimension p are large many problems have p n Regularization is essential: { 1 θ argmin θ n n L(θ;x i,y i ) i=1 } {{ } L n(θ;x n 1,yn 1 ) } + R λ (θ). }{{} Regularizer

Introduction Prediction/regression problems arise throughout statistics: In big data setting: both sample size n and ambient dimension p are large many problems have p n Regularization is essential: { 1 θ argmin θ n n L(θ;x i,y i ) i=1 } {{ } L n(θ;x n 1,yn 1 ) For non-convex problems: Mind the gap! any global optimum is statistically good... but efficient algorithms only find local optima } + R λ (θ). }{{} Regularizer

Introduction Prediction/regression problems arise throughout statistics: In big data setting: both sample size n and ambient dimension p are large many problems have p n Regularization is essential: { 1 θ argmin θ n n L(θ;x i,y i ) i=1 } {{ } L n(θ;x n 1,yn 1 ) For non-convex problems: Mind the gap! any global optimum is statistically good... but efficient algorithms only find local optima Question } + R λ (θ). }{{} Regularizer How to close this undesirable gap between statistics and computation?

Vignette A: Regression with non-convex penalties Set-up: Observe (y i,x i ) pairs for i = 1,2,...,n, where y i Q ( θ, x i ), where θ R p has low-dimensional structure y n Q ( ) X n p θ S S c

Vignette A: Regression with non-convex penalties Set-up: Observe (y i,x i ) pairs for i = 1,2,...,n, where y i Q ( θ, x i ), where θ R p has low-dimensional structure y n Q ( ) X n p θ S S c Estimator: R λ -regularized likelihood { θ argmin 1 n logq(y i x i, θ ) } +R λ (θ). θ n i=1

Vignette A: Regression with non-convex penalties y n Q ( ) X n p θ S S c Example: Logistic regression for binary responses y i {0,1}: { 1 n { θ argmin log(1+e x i,θ ) y i x i, θ } } +R λ (θ). θ n i=1 Many non-convex penalties are possible: capped l 1 -penalty SCAD penalty MCP penalty (Fan & Li, 2001) (Zhang, 2006)

Convex and non-convex regularizers 0.8 0.7 Regularizers 0.6 0.5 R λ (t) 0.4 0.3 0.2 0.1 MCP L 1 Capped L 1 0 1 0.5 0 0.5 1 t

Statistical error versus optimization error Algorithm generating sequence of iterates {θ t } t=0 to solve { } 1 n θ argmin L(θ;x i,y i )+R λ (θ). θ n Global minimizer of population risk i=1 [ θ := argmine X,Y θ ] L(θ;X,Y) } {{ } L(θ)

Statistical error versus optimization error Algorithm generating sequence of iterates {θ t } t=0 to solve { } 1 n θ argmin L(θ;x i,y i )+R λ (θ). θ n Global minimizer of population risk i=1 [ θ := argmine X,Y θ ] L(θ;X,Y) } {{ } L(θ) Goal of statistician Provide bounds on Statistical error: θ t θ or θ θ

Statistical error versus optimization error Algorithm generating sequence of iterates {θ t } t=0 to solve { } 1 n θ argmin L(θ;x i,y i )+R λ (θ). θ n Global minimizer of population risk i=1 [ θ := argmine X,Y θ ] L(θ;X,Y) } {{ } L(θ) Goal of statistician Provide bounds on Statistical error: θ t θ or θ θ Goal of optimization-theorist Provide bounds on Optimization error: θ t θ

Logistic regression with non-convex regularizer 0.5 0 0.5 log error plot for logistic regression with SCAD, a = 3.7 opt err stat err log( β t β * 2 ) 1 1.5 2 2.5 3 3.5 4 0 500 1000 1500 2000 iteration count

What phenomena need to be explained? Empirical observation #1: From a statistical perspective, all local optima are essentially as good as a global optimum.

What phenomena need to be explained? Empirical observation #1: From a statistical perspective, all local optima are essentially as good as a global optimum. Some past work: for least-squares loss, certain local optima are good (Zhang & Zhang, 2012) if initialized at Lasso solution with l -guarantees, local algorithm has good behavior (Fan et al., 2012)

What phenomena need to be explained? Empirical observation #1: From a statistical perspective, all local optima are essentially as good as a global optimum. Some past work: for least-squares loss, certain local optima are good (Zhang & Zhang, 2012) if initialized at Lasso solution with l -guarantees, local algorithm has good behavior (Fan et al., 2012) Empirical observation #2: First-order methods converge as fast as possible up to statistical precision.

Vignette B: Error-in-variables regression Begin with high-dimensional sparse regression: y X θ ε n n p = + S S c

Vignette B: Error-in-variables regression Begin with high-dimensional sparse regression: y X θ ε n n p = + S S c Observe y R n and Z R n p Z X W n p = F n p, n p Here W R n p is a stochastic perturbation.

Missing data: Z n p Example: Missing data = X n p W Here W R n p is multiplicative perturbation (e.g., W ij Ber(α).) y X θ ε n n p = + S S c

Example: Additive perturbations Additive noise in covariates: Z X W n p = n p + n p Here W R n p is an additive perturbation (e.g., W ij N(0,σ 2 )). y X θ ε n n p = + S S c

A second look at regularized least-squares Equivalent formulation: { 1 θ arg min θ R p 2 θt( X T X) X T y } θ θ, n n + R λ(θ). Martin Wainwright (UC Berkeley) May 2013 10 / 1

A second look at regularized least-squares Equivalent formulation: { 1 θ arg min θ R p 2 θt( X T X) X T y } θ θ, n n + R λ(θ). Population view: unbiased estimators [ X T X ] cov(x 1 ) = E, and cov(x 1,y 1 ) = E n [ X T y n ]. Martin Wainwright (UC Berkeley) May 2013 10 / 1

Corrected estimators Equivalent formulation: { 1 θ arg min θ R p 2 θt( X T X) X T y } θ θ, n n + R λ(θ). Population view: unbiased estimators [ X T X ] cov(x 1 ) = E, and cov(x 1,y 1 ) = E n [ X T y n ]. A general family of estimators { 1 } θ arg min θ R p 2 θt Γθ θt γ + λ 2 n θ 2 1, where ( Γ, γ) are unbiased estimators of cov(x 1 ) and cov(x 1,y 1 ). Martin Wainwright (UC Berkeley) May 2013 10 / 1

Example: Estimator for missing data observe corrupted version Z R n p Z ij = { X ij with probability 1 α with probability α.

Example: Estimator for missing data observe corrupted version Z R n p Z ij = { X ij with probability 1 α with probability α. Natural unbiased estimates: set 0 and Ẑ := Z (1 α) : Γ = Ẑ T Ẑ n αdiag( Ẑ T Ẑ) Ẑ T y, and γ = n n,

Example: Estimator for missing data observe corrupted version Z R n p Z ij = { X ij with probability 1 α with probability α. Natural unbiased estimates: set 0 and Ẑ := Z (1 α) : Γ = Ẑ T Ẑ n αdiag( Ẑ T Ẑ) Ẑ T y, and γ = n n, solve (doubly non-convex) optimization problem: (Loh & W., 2012) {1 θ argmin θ Ω 2 θt Γθ γ, θ +Rλ (θ) }.

Non-convex quadratic and non-convex regularizer log error plot for corrected linear regression with MCP, b = 1.5 2 opt err stat err 0 2 log( β t β * 2 ) 4 6 8 10 12 0 200 400 600 800 1000 iteration count

Remainder of talk 1 Why are all local optima statistically good? Restricted strong convexity A general theorem Various examples 2 Why do first-order gradient methods converge quickly? Composite gradient methods Statistical versus optimization error Fast convergence for non-convex problems Martin Wainwright (UC Berkeley) May 2013 13 / 1

Geometry of a non-convex quadratic loss 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 1 0.5 0 0.5 1 1 0.5 0 0.5 1 Loss function has directions of both positive and negative curvature.

Geometry of a non-convex quadratic loss 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 1 0.5 0 0.5 1 1 0.5 0 0.5 1 Loss function has directions of both positive and negative curvature. Negative directions must be forbidden by regularizer.

Restricted strong convexity Here defined with respect to the l 1 -norm: Definition The loss function L n satisfies RSC with parameters (α j,τ j ), j = 1,2 if L n (θ + ) L n (θ ), }{{} Measure of curvature { α1 2 logp 2 τ 1 n 21 if 2 1 logp α 2 2 τ 2 n 1 if 2 > 1. Martin Wainwright (UC Berkeley) May 2013 15 / 1

Restricted strong convexity Here defined with respect to the l 1 -norm: Definition The loss function L n satisfies RSC with parameters (α j,τ j ), j = 1,2 if L n (θ + ) L n (θ ), }{{} Measure of curvature { α1 2 logp 2 τ 1 n 21 if 2 1 logp α 2 2 τ 2 n 1 if 2 > 1. holds with τ 1 = τ 2 = 0 for any function that is locally strongly convex around θ Martin Wainwright (UC Berkeley) May 2013 15 / 1

Restricted strong convexity Here defined with respect to the l 1 -norm: Definition The loss function L n satisfies RSC with parameters (α j,τ j ), j = 1,2 if L n (θ + ) L n (θ ), }{{} Measure of curvature { α1 2 logp 2 τ 1 n 21 if 2 1 logp α 2 2 τ 2 n 1 if 2 > 1. holds with τ 1 = τ 2 = 0 for any function that is locally strongly convex around θ holds for a variety of loss functions (convex and non-convex): ordinary least-squares (Raskutti, W. & Yu, 2010) likelihoods for generalized linear models (Negahban et al., 2012) certain non-convex quadratic functions (Loh & W, 2012) Martin Wainwright (UC Berkeley) May 2013 15 / 1

Well-behaved regularizers Properties defined at the univariate level R λ : R [0, ]. Satisfies R λ (0) = 0, and is symmetric around zero (R λ (t) = R λ ( t).) Non-decreasing and subadditive R λ (s+t) R λ (s)+r λ (t). Function t R λ(t) t is nonincreasing for t > 0 Differentiable for all t 0, subdifferentiable at t = 0 with subgradients bounded in absolute value by λl. For some µ > 0, the function R λ (t) = R λ (t)+µt 2 is convex. Martin Wainwright (UC Berkeley) May 2013 16 / 1

Well-behaved regularizers Properties defined at the univariate level R λ : R [0, ]. Satisfies R λ (0) = 0, and is symmetric around zero (R λ (t) = R λ ( t).) Non-decreasing and subadditive R λ (s+t) R λ (s)+r λ (t). Function t R λ(t) t is nonincreasing for t > 0 Differentiable for all t 0, subdifferentiable at t = 0 with subgradients bounded in absolute value by λl. For some µ > 0, the function R λ (t) = R λ (t)+µt 2 is convex. Includes (among others): rescaled l 1 loss: R λ (t) = λ t. MCP penalty and SCAD penalties (Fan et al., 2001; Zhang, 2006) does not include capped l 1 -penalty Martin Wainwright (UC Berkeley) May 2013 16 / 1

regularized M-estimator Main statistical guarantee θ arg min θ 1 M { } L n (θ)+r λ (θ). loss function satisfies (α, τ) RSC, and regularizer is regular (with parameters (µ, L))

regularized M-estimator Main statistical guarantee θ arg min θ 1 M { } L n (θ)+r λ (θ). loss function satisfies (α, τ) RSC, and regularizer is regular (with parameters (µ, L)) local optimum θ defined by conditions L n ( θ)+ R λ ( θ), θ θ 0 for all feasible θ.

regularized M-estimator Main statistical guarantee θ arg min θ 1 M { } L n (θ)+r λ (θ). loss function satisfies (α, τ) RSC, and regularizer is regular (with parameters (µ, L)) local optimum θ defined by conditions L n ( θ)+ R λ ( θ), θ θ 0 for all feasible θ. Theorem (Loh & W., 2013) Suppose M is chosen such that θ is feasible, and λ satisfies the bounds max { logp L n (θ } α 2 ),α 2 λ n 6LM Then any local optimum θ satisfies the bound θ θ 2 6λ n s 4(α µ) where s = β 0.

Geometry of local/global optima ǫ stat θ θ θ Consequence: All { local, global } optima are within distance ǫ stat of the target θ. θ

Geometry of local/global optima ǫ stat θ θ θ Consequence: All { local, global } optima are within distance ǫ stat of the target θ. With λ = c logp n, statistical error scales as θ ǫ stat slogp n, which is minimax optimal.

Empirical results (unrescaled) 0.25 0.2 p=128 p=256 p=512 p=1024 Mean squared error 0.15 0.1 0.05 0 200 400 600 800 1000 1200 Raw sample size

Empirical results (rescaled) 0.25 0.2 p=128 p=256 p=512 p=1024 Mean squared error 0.15 0.1 0.05 1 2 3 4 5 6 7 8 9 10 Rescaled sample size

Comparisons between different penalties 0.45 0.4 0.35 comparing penalties for corrected linear regression p=64 p=128 p=256 l 2 norm error 0.3 0.25 0.2 0.15 0.1 0.05 0 0 10 20 30 40 50 n/(k log p)

First-order algorithms and fast convergence Thus far... have shown that all local optima are statistically good how to obtain a local optimum quickly?

First-order algorithms and fast convergence Thus far... have shown that all local optima are statistically good how to obtain a local optimum quickly? Composite gradient descent for regularized objectives: (Nesterov, 2007) { } f(θ)+g(θ) min θ Ω where f is differentiable, and g is convex, sub-differentiable.

First-order algorithms and fast convergence Thus far... have shown that all local optima are statistically good how to obtain a local optimum quickly? Composite gradient descent for regularized objectives: (Nesterov, 2007) { } f(θ)+g(θ) min θ Ω where f is differentiable, and g is convex, sub-differentiable. Simple updates: { } θ t+1 = argmin θ α t f(θ t ) 2 2 +g(θ). θ Ω

First-order algorithms and fast convergence Thus far... have shown that all local optima are statistically good how to obtain a local optimum quickly? Composite gradient descent for regularized objectives: (Nesterov, 2007) { } f(θ)+g(θ) min θ Ω where f is differentiable, and g is convex, sub-differentiable. Simple updates: { } θ t+1 = argmin θ α t f(θ t ) 2 2 +g(θ). θ Ω Not directly applicable with f = L n and g = R λ (since R λ can be non-convex).

Composite gradient on a convenient splitting Define modified loss functions and regularizers: L n (θ) := L n (θ) µ θ 2 2, and Rλ (θ) := R λ (θ)+µ θ 2 2. }{{}}{{} non-convex convex

Composite gradient on a convenient splitting Define modified loss functions and regularizers: L n (θ) := L n (θ) µ θ 2 2, and Rλ (θ) := R λ (θ)+µ θ 2 2. }{{}}{{} non-convex convex Apply composite gradient descent to the objective { Ln (θ)+ R λ (θ)}. min θ Ω

Composite gradient on a convenient splitting Define modified loss functions and regularizers: L n (θ) := L n (θ) µ θ 2 2, and Rλ (θ) := R λ (θ)+µ θ 2 2. }{{}}{{} non-convex convex Apply composite gradient descent to the objective { Ln (θ)+ R λ (θ)}. min θ Ω converges to local optimum θ (Nesterov, 2007) L n (θ)+ R λ ( θ), θ θ 0 for all feasible θ.

Composite gradient on a convenient splitting Define modified loss functions and regularizers: L n (θ) := L n (θ) µ θ 2 2, and Rλ (θ) := R λ (θ)+µ θ 2 2. }{{}}{{} non-convex convex Apply composite gradient descent to the objective { Ln (θ)+ R λ (θ)}. min θ Ω converges to local optimum θ (Nesterov, 2007) L n (θ)+ R λ ( θ), θ θ 0 for all feasible θ. will show that convergence is geometrically fast with constant stepsize

Theoretical guarantees on computational error implement Nesterov s composite method with constant stepsize to ( L n, R λ ) split. fixed global optimum β defines the statistical error ǫ 2 stat = β β 2. population minimizer β is s-sparse

Theoretical guarantees on computational error implement Nesterov s composite method with constant stepsize to ( L n, R λ ) split. fixed global optimum β defines the statistical error ǫ 2 stat = β β 2. population minimizer β is s-sparse loss function satisfies (α, τ) RSC and smoothness conditions, and regularizer is (µ, L)-good

Theoretical guarantees on computational error implement Nesterov s composite method with constant stepsize to ( L n, R λ ) split. fixed global optimum β defines the statistical error ǫ 2 stat = β β 2. population minimizer β is s-sparse loss function satisfies (α, τ) RSC and smoothness conditions, and regularizer is (µ, L)-good Theorem (Loh & W., 2013) If n slogp, there is a contraction factor κ (0,1) such that for any δ ǫ stat, we have θ t θ 2 2 2 ( δ 2 +128τ slogp ) α µ n ǫ2 stat for all t T(δ) iterations, where T(δ) log(1/δ) log(1/κ).

Geometry of result 0 1 t ǫ θ θ θ Optimization error t := θ t θ decreases geometrically up to statistical tolerance: θ t+1 θ 2 κ t θ 0 θ 2 +o ( θ θ 2 ) }{{} for all t = 0,1,2,... Stat. error ǫ 2 stat

Non-convex linear regression with SCAD log error plot for corrected linear regression with SCAD, a = 2.5 1 opt err 0 stat err 1 log( β t β * 2 ) 2 3 4 5 6 7 8 0 200 400 600 800 1000 1200 iteration count

Non-convex linear regression with SCAD log error plot for corrected linear regression with SCAD, a = 3.7 2 opt err stat err 0 log( β t β * 2 ) 2 4 6 8 10 0 200 400 600 800 1000 1200 iteration count

Summary M-estimators based on non-convex programs arise frequently under suitable regularity conditions, we showed that: all local optima are well-behaved from the statistical point of view simple first-order methods converge as fast as possible

Summary M-estimators based on non-convex programs arise frequently under suitable regularity conditions, we showed that: all local optima are well-behaved from the statistical point of view simple first-order methods converge as fast as possible many open questions similar guarantees for more general problems? geometry of non-convex problems in statistics?

Summary M-estimators based on non-convex programs arise frequently under suitable regularity conditions, we showed that: all local optima are well-behaved from the statistical point of view simple first-order methods converge as fast as possible many open questions similar guarantees for more general problems? geometry of non-convex problems in statistics? Papers and pre-prints: Loh & W. (2013). Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Pre-print arxiv:1305.2436 Loh & W. (2012). High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. Annals of Statistics, 40:1637 1664. Negahban et al. (2012). A unified framework for high-dimensional analysis of M-estimators. Statistical Science, 27(4): 538 557.