Estimators based on non-convex programs: Statistical and computational guarantees Martin Wainwright UC Berkeley Statistics and EECS Based on joint work with: Po-Ling Loh (UC Berkeley) Martin Wainwright (UC Berkeley) May 2013 1 / 1
Introduction Prediction/regression problems arise throughout statistics: vector of predictors/covariates x R p response variable y R In big data setting: both sample size n and ambient dimension p are large many problems have p n
Introduction Prediction/regression problems arise throughout statistics: vector of predictors/covariates x R p response variable y R In big data setting: both sample size n and ambient dimension p are large many problems have p n Regularization is essential: { 1 θ argmin θ n n L(θ;x i,y i ) i=1 } {{ } L n(θ;x n 1,yn 1 ) } + R λ (θ). }{{} Regularizer
Introduction Prediction/regression problems arise throughout statistics: In big data setting: both sample size n and ambient dimension p are large many problems have p n Regularization is essential: { 1 θ argmin θ n n L(θ;x i,y i ) i=1 } {{ } L n(θ;x n 1,yn 1 ) For non-convex problems: Mind the gap! any global optimum is statistically good... but efficient algorithms only find local optima } + R λ (θ). }{{} Regularizer
Introduction Prediction/regression problems arise throughout statistics: In big data setting: both sample size n and ambient dimension p are large many problems have p n Regularization is essential: { 1 θ argmin θ n n L(θ;x i,y i ) i=1 } {{ } L n(θ;x n 1,yn 1 ) For non-convex problems: Mind the gap! any global optimum is statistically good... but efficient algorithms only find local optima Question } + R λ (θ). }{{} Regularizer How to close this undesirable gap between statistics and computation?
Vignette A: Regression with non-convex penalties Set-up: Observe (y i,x i ) pairs for i = 1,2,...,n, where y i Q ( θ, x i ), where θ R p has low-dimensional structure y n Q ( ) X n p θ S S c
Vignette A: Regression with non-convex penalties Set-up: Observe (y i,x i ) pairs for i = 1,2,...,n, where y i Q ( θ, x i ), where θ R p has low-dimensional structure y n Q ( ) X n p θ S S c Estimator: R λ -regularized likelihood { θ argmin 1 n logq(y i x i, θ ) } +R λ (θ). θ n i=1
Vignette A: Regression with non-convex penalties y n Q ( ) X n p θ S S c Example: Logistic regression for binary responses y i {0,1}: { 1 n { θ argmin log(1+e x i,θ ) y i x i, θ } } +R λ (θ). θ n i=1 Many non-convex penalties are possible: capped l 1 -penalty SCAD penalty MCP penalty (Fan & Li, 2001) (Zhang, 2006)
Convex and non-convex regularizers 0.8 0.7 Regularizers 0.6 0.5 R λ (t) 0.4 0.3 0.2 0.1 MCP L 1 Capped L 1 0 1 0.5 0 0.5 1 t
Statistical error versus optimization error Algorithm generating sequence of iterates {θ t } t=0 to solve { } 1 n θ argmin L(θ;x i,y i )+R λ (θ). θ n Global minimizer of population risk i=1 [ θ := argmine X,Y θ ] L(θ;X,Y) } {{ } L(θ)
Statistical error versus optimization error Algorithm generating sequence of iterates {θ t } t=0 to solve { } 1 n θ argmin L(θ;x i,y i )+R λ (θ). θ n Global minimizer of population risk i=1 [ θ := argmine X,Y θ ] L(θ;X,Y) } {{ } L(θ) Goal of statistician Provide bounds on Statistical error: θ t θ or θ θ
Statistical error versus optimization error Algorithm generating sequence of iterates {θ t } t=0 to solve { } 1 n θ argmin L(θ;x i,y i )+R λ (θ). θ n Global minimizer of population risk i=1 [ θ := argmine X,Y θ ] L(θ;X,Y) } {{ } L(θ) Goal of statistician Provide bounds on Statistical error: θ t θ or θ θ Goal of optimization-theorist Provide bounds on Optimization error: θ t θ
Logistic regression with non-convex regularizer 0.5 0 0.5 log error plot for logistic regression with SCAD, a = 3.7 opt err stat err log( β t β * 2 ) 1 1.5 2 2.5 3 3.5 4 0 500 1000 1500 2000 iteration count
What phenomena need to be explained? Empirical observation #1: From a statistical perspective, all local optima are essentially as good as a global optimum.
What phenomena need to be explained? Empirical observation #1: From a statistical perspective, all local optima are essentially as good as a global optimum. Some past work: for least-squares loss, certain local optima are good (Zhang & Zhang, 2012) if initialized at Lasso solution with l -guarantees, local algorithm has good behavior (Fan et al., 2012)
What phenomena need to be explained? Empirical observation #1: From a statistical perspective, all local optima are essentially as good as a global optimum. Some past work: for least-squares loss, certain local optima are good (Zhang & Zhang, 2012) if initialized at Lasso solution with l -guarantees, local algorithm has good behavior (Fan et al., 2012) Empirical observation #2: First-order methods converge as fast as possible up to statistical precision.
Vignette B: Error-in-variables regression Begin with high-dimensional sparse regression: y X θ ε n n p = + S S c
Vignette B: Error-in-variables regression Begin with high-dimensional sparse regression: y X θ ε n n p = + S S c Observe y R n and Z R n p Z X W n p = F n p, n p Here W R n p is a stochastic perturbation.
Missing data: Z n p Example: Missing data = X n p W Here W R n p is multiplicative perturbation (e.g., W ij Ber(α).) y X θ ε n n p = + S S c
Example: Additive perturbations Additive noise in covariates: Z X W n p = n p + n p Here W R n p is an additive perturbation (e.g., W ij N(0,σ 2 )). y X θ ε n n p = + S S c
A second look at regularized least-squares Equivalent formulation: { 1 θ arg min θ R p 2 θt( X T X) X T y } θ θ, n n + R λ(θ). Martin Wainwright (UC Berkeley) May 2013 10 / 1
A second look at regularized least-squares Equivalent formulation: { 1 θ arg min θ R p 2 θt( X T X) X T y } θ θ, n n + R λ(θ). Population view: unbiased estimators [ X T X ] cov(x 1 ) = E, and cov(x 1,y 1 ) = E n [ X T y n ]. Martin Wainwright (UC Berkeley) May 2013 10 / 1
Corrected estimators Equivalent formulation: { 1 θ arg min θ R p 2 θt( X T X) X T y } θ θ, n n + R λ(θ). Population view: unbiased estimators [ X T X ] cov(x 1 ) = E, and cov(x 1,y 1 ) = E n [ X T y n ]. A general family of estimators { 1 } θ arg min θ R p 2 θt Γθ θt γ + λ 2 n θ 2 1, where ( Γ, γ) are unbiased estimators of cov(x 1 ) and cov(x 1,y 1 ). Martin Wainwright (UC Berkeley) May 2013 10 / 1
Example: Estimator for missing data observe corrupted version Z R n p Z ij = { X ij with probability 1 α with probability α.
Example: Estimator for missing data observe corrupted version Z R n p Z ij = { X ij with probability 1 α with probability α. Natural unbiased estimates: set 0 and Ẑ := Z (1 α) : Γ = Ẑ T Ẑ n αdiag( Ẑ T Ẑ) Ẑ T y, and γ = n n,
Example: Estimator for missing data observe corrupted version Z R n p Z ij = { X ij with probability 1 α with probability α. Natural unbiased estimates: set 0 and Ẑ := Z (1 α) : Γ = Ẑ T Ẑ n αdiag( Ẑ T Ẑ) Ẑ T y, and γ = n n, solve (doubly non-convex) optimization problem: (Loh & W., 2012) {1 θ argmin θ Ω 2 θt Γθ γ, θ +Rλ (θ) }.
Non-convex quadratic and non-convex regularizer log error plot for corrected linear regression with MCP, b = 1.5 2 opt err stat err 0 2 log( β t β * 2 ) 4 6 8 10 12 0 200 400 600 800 1000 iteration count
Remainder of talk 1 Why are all local optima statistically good? Restricted strong convexity A general theorem Various examples 2 Why do first-order gradient methods converge quickly? Composite gradient methods Statistical versus optimization error Fast convergence for non-convex problems Martin Wainwright (UC Berkeley) May 2013 13 / 1
Geometry of a non-convex quadratic loss 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 1 0.5 0 0.5 1 1 0.5 0 0.5 1 Loss function has directions of both positive and negative curvature.
Geometry of a non-convex quadratic loss 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 1 0.5 0 0.5 1 1 0.5 0 0.5 1 Loss function has directions of both positive and negative curvature. Negative directions must be forbidden by regularizer.
Restricted strong convexity Here defined with respect to the l 1 -norm: Definition The loss function L n satisfies RSC with parameters (α j,τ j ), j = 1,2 if L n (θ + ) L n (θ ), }{{} Measure of curvature { α1 2 logp 2 τ 1 n 21 if 2 1 logp α 2 2 τ 2 n 1 if 2 > 1. Martin Wainwright (UC Berkeley) May 2013 15 / 1
Restricted strong convexity Here defined with respect to the l 1 -norm: Definition The loss function L n satisfies RSC with parameters (α j,τ j ), j = 1,2 if L n (θ + ) L n (θ ), }{{} Measure of curvature { α1 2 logp 2 τ 1 n 21 if 2 1 logp α 2 2 τ 2 n 1 if 2 > 1. holds with τ 1 = τ 2 = 0 for any function that is locally strongly convex around θ Martin Wainwright (UC Berkeley) May 2013 15 / 1
Restricted strong convexity Here defined with respect to the l 1 -norm: Definition The loss function L n satisfies RSC with parameters (α j,τ j ), j = 1,2 if L n (θ + ) L n (θ ), }{{} Measure of curvature { α1 2 logp 2 τ 1 n 21 if 2 1 logp α 2 2 τ 2 n 1 if 2 > 1. holds with τ 1 = τ 2 = 0 for any function that is locally strongly convex around θ holds for a variety of loss functions (convex and non-convex): ordinary least-squares (Raskutti, W. & Yu, 2010) likelihoods for generalized linear models (Negahban et al., 2012) certain non-convex quadratic functions (Loh & W, 2012) Martin Wainwright (UC Berkeley) May 2013 15 / 1
Well-behaved regularizers Properties defined at the univariate level R λ : R [0, ]. Satisfies R λ (0) = 0, and is symmetric around zero (R λ (t) = R λ ( t).) Non-decreasing and subadditive R λ (s+t) R λ (s)+r λ (t). Function t R λ(t) t is nonincreasing for t > 0 Differentiable for all t 0, subdifferentiable at t = 0 with subgradients bounded in absolute value by λl. For some µ > 0, the function R λ (t) = R λ (t)+µt 2 is convex. Martin Wainwright (UC Berkeley) May 2013 16 / 1
Well-behaved regularizers Properties defined at the univariate level R λ : R [0, ]. Satisfies R λ (0) = 0, and is symmetric around zero (R λ (t) = R λ ( t).) Non-decreasing and subadditive R λ (s+t) R λ (s)+r λ (t). Function t R λ(t) t is nonincreasing for t > 0 Differentiable for all t 0, subdifferentiable at t = 0 with subgradients bounded in absolute value by λl. For some µ > 0, the function R λ (t) = R λ (t)+µt 2 is convex. Includes (among others): rescaled l 1 loss: R λ (t) = λ t. MCP penalty and SCAD penalties (Fan et al., 2001; Zhang, 2006) does not include capped l 1 -penalty Martin Wainwright (UC Berkeley) May 2013 16 / 1
regularized M-estimator Main statistical guarantee θ arg min θ 1 M { } L n (θ)+r λ (θ). loss function satisfies (α, τ) RSC, and regularizer is regular (with parameters (µ, L))
regularized M-estimator Main statistical guarantee θ arg min θ 1 M { } L n (θ)+r λ (θ). loss function satisfies (α, τ) RSC, and regularizer is regular (with parameters (µ, L)) local optimum θ defined by conditions L n ( θ)+ R λ ( θ), θ θ 0 for all feasible θ.
regularized M-estimator Main statistical guarantee θ arg min θ 1 M { } L n (θ)+r λ (θ). loss function satisfies (α, τ) RSC, and regularizer is regular (with parameters (µ, L)) local optimum θ defined by conditions L n ( θ)+ R λ ( θ), θ θ 0 for all feasible θ. Theorem (Loh & W., 2013) Suppose M is chosen such that θ is feasible, and λ satisfies the bounds max { logp L n (θ } α 2 ),α 2 λ n 6LM Then any local optimum θ satisfies the bound θ θ 2 6λ n s 4(α µ) where s = β 0.
Geometry of local/global optima ǫ stat θ θ θ Consequence: All { local, global } optima are within distance ǫ stat of the target θ. θ
Geometry of local/global optima ǫ stat θ θ θ Consequence: All { local, global } optima are within distance ǫ stat of the target θ. With λ = c logp n, statistical error scales as θ ǫ stat slogp n, which is minimax optimal.
Empirical results (unrescaled) 0.25 0.2 p=128 p=256 p=512 p=1024 Mean squared error 0.15 0.1 0.05 0 200 400 600 800 1000 1200 Raw sample size
Empirical results (rescaled) 0.25 0.2 p=128 p=256 p=512 p=1024 Mean squared error 0.15 0.1 0.05 1 2 3 4 5 6 7 8 9 10 Rescaled sample size
Comparisons between different penalties 0.45 0.4 0.35 comparing penalties for corrected linear regression p=64 p=128 p=256 l 2 norm error 0.3 0.25 0.2 0.15 0.1 0.05 0 0 10 20 30 40 50 n/(k log p)
First-order algorithms and fast convergence Thus far... have shown that all local optima are statistically good how to obtain a local optimum quickly?
First-order algorithms and fast convergence Thus far... have shown that all local optima are statistically good how to obtain a local optimum quickly? Composite gradient descent for regularized objectives: (Nesterov, 2007) { } f(θ)+g(θ) min θ Ω where f is differentiable, and g is convex, sub-differentiable.
First-order algorithms and fast convergence Thus far... have shown that all local optima are statistically good how to obtain a local optimum quickly? Composite gradient descent for regularized objectives: (Nesterov, 2007) { } f(θ)+g(θ) min θ Ω where f is differentiable, and g is convex, sub-differentiable. Simple updates: { } θ t+1 = argmin θ α t f(θ t ) 2 2 +g(θ). θ Ω
First-order algorithms and fast convergence Thus far... have shown that all local optima are statistically good how to obtain a local optimum quickly? Composite gradient descent for regularized objectives: (Nesterov, 2007) { } f(θ)+g(θ) min θ Ω where f is differentiable, and g is convex, sub-differentiable. Simple updates: { } θ t+1 = argmin θ α t f(θ t ) 2 2 +g(θ). θ Ω Not directly applicable with f = L n and g = R λ (since R λ can be non-convex).
Composite gradient on a convenient splitting Define modified loss functions and regularizers: L n (θ) := L n (θ) µ θ 2 2, and Rλ (θ) := R λ (θ)+µ θ 2 2. }{{}}{{} non-convex convex
Composite gradient on a convenient splitting Define modified loss functions and regularizers: L n (θ) := L n (θ) µ θ 2 2, and Rλ (θ) := R λ (θ)+µ θ 2 2. }{{}}{{} non-convex convex Apply composite gradient descent to the objective { Ln (θ)+ R λ (θ)}. min θ Ω
Composite gradient on a convenient splitting Define modified loss functions and regularizers: L n (θ) := L n (θ) µ θ 2 2, and Rλ (θ) := R λ (θ)+µ θ 2 2. }{{}}{{} non-convex convex Apply composite gradient descent to the objective { Ln (θ)+ R λ (θ)}. min θ Ω converges to local optimum θ (Nesterov, 2007) L n (θ)+ R λ ( θ), θ θ 0 for all feasible θ.
Composite gradient on a convenient splitting Define modified loss functions and regularizers: L n (θ) := L n (θ) µ θ 2 2, and Rλ (θ) := R λ (θ)+µ θ 2 2. }{{}}{{} non-convex convex Apply composite gradient descent to the objective { Ln (θ)+ R λ (θ)}. min θ Ω converges to local optimum θ (Nesterov, 2007) L n (θ)+ R λ ( θ), θ θ 0 for all feasible θ. will show that convergence is geometrically fast with constant stepsize
Theoretical guarantees on computational error implement Nesterov s composite method with constant stepsize to ( L n, R λ ) split. fixed global optimum β defines the statistical error ǫ 2 stat = β β 2. population minimizer β is s-sparse
Theoretical guarantees on computational error implement Nesterov s composite method with constant stepsize to ( L n, R λ ) split. fixed global optimum β defines the statistical error ǫ 2 stat = β β 2. population minimizer β is s-sparse loss function satisfies (α, τ) RSC and smoothness conditions, and regularizer is (µ, L)-good
Theoretical guarantees on computational error implement Nesterov s composite method with constant stepsize to ( L n, R λ ) split. fixed global optimum β defines the statistical error ǫ 2 stat = β β 2. population minimizer β is s-sparse loss function satisfies (α, τ) RSC and smoothness conditions, and regularizer is (µ, L)-good Theorem (Loh & W., 2013) If n slogp, there is a contraction factor κ (0,1) such that for any δ ǫ stat, we have θ t θ 2 2 2 ( δ 2 +128τ slogp ) α µ n ǫ2 stat for all t T(δ) iterations, where T(δ) log(1/δ) log(1/κ).
Geometry of result 0 1 t ǫ θ θ θ Optimization error t := θ t θ decreases geometrically up to statistical tolerance: θ t+1 θ 2 κ t θ 0 θ 2 +o ( θ θ 2 ) }{{} for all t = 0,1,2,... Stat. error ǫ 2 stat
Non-convex linear regression with SCAD log error plot for corrected linear regression with SCAD, a = 2.5 1 opt err 0 stat err 1 log( β t β * 2 ) 2 3 4 5 6 7 8 0 200 400 600 800 1000 1200 iteration count
Non-convex linear regression with SCAD log error plot for corrected linear regression with SCAD, a = 3.7 2 opt err stat err 0 log( β t β * 2 ) 2 4 6 8 10 0 200 400 600 800 1000 1200 iteration count
Summary M-estimators based on non-convex programs arise frequently under suitable regularity conditions, we showed that: all local optima are well-behaved from the statistical point of view simple first-order methods converge as fast as possible
Summary M-estimators based on non-convex programs arise frequently under suitable regularity conditions, we showed that: all local optima are well-behaved from the statistical point of view simple first-order methods converge as fast as possible many open questions similar guarantees for more general problems? geometry of non-convex problems in statistics?
Summary M-estimators based on non-convex programs arise frequently under suitable regularity conditions, we showed that: all local optima are well-behaved from the statistical point of view simple first-order methods converge as fast as possible many open questions similar guarantees for more general problems? geometry of non-convex problems in statistics? Papers and pre-prints: Loh & W. (2013). Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Pre-print arxiv:1305.2436 Loh & W. (2012). High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. Annals of Statistics, 40:1637 1664. Negahban et al. (2012). A unified framework for high-dimensional analysis of M-estimators. Statistical Science, 27(4): 538 557.