Estimators based on non-convex programs: Statistical and computational guarantees

Size: px

Start display at page:

Download "Estimators based on non-convex programs: Statistical and computational guarantees"

Arnold Blake
5 years ago
Views:

1 Estimators based on non-convex programs: Statistical and computational guarantees Martin Wainwright UC Berkeley Statistics and EECS Based on joint work with: Po-Ling Loh (UC Berkeley) Martin Wainwright (UC Berkeley) May / 1

2 Introduction Prediction/regression problems arise throughout statistics: vector of predictors/covariates x R p response variable y R In big data setting: both sample size n and ambient dimension p are large many problems have p n

3 Introduction Prediction/regression problems arise throughout statistics: vector of predictors/covariates x R p response variable y R In big data setting: both sample size n and ambient dimension p are large many problems have p n Regularization is essential: { 1 θ argmin θ n n L(θ;x i,y i ) i=1 } {{ } L n(θ;x n 1,yn 1 ) } + R λ (θ). }{{} Regularizer

4 Introduction Prediction/regression problems arise throughout statistics: In big data setting: both sample size n and ambient dimension p are large many problems have p n Regularization is essential: { 1 θ argmin θ n n L(θ;x i,y i ) i=1 } {{ } L n(θ;x n 1,yn 1 ) For non-convex problems: Mind the gap! any global optimum is statistically good... but efficient algorithms only find local optima } + R λ (θ). }{{} Regularizer

5 Introduction Prediction/regression problems arise throughout statistics: In big data setting: both sample size n and ambient dimension p are large many problems have p n Regularization is essential: { 1 θ argmin θ n n L(θ;x i,y i ) i=1 } {{ } L n(θ;x n 1,yn 1 ) For non-convex problems: Mind the gap! any global optimum is statistically good... but efficient algorithms only find local optima Question } + R λ (θ). }{{} Regularizer How to close this undesirable gap between statistics and computation?

6 Vignette A: Regression with non-convex penalties Set-up: Observe (y i,x i ) pairs for i = 1,2,...,n, where y i Q ( θ, x i ), where θ R p has low-dimensional structure y n Q ( ) X n p θ S S c

7 Vignette A: Regression with non-convex penalties Set-up: Observe (y i,x i ) pairs for i = 1,2,...,n, where y i Q ( θ, x i ), where θ R p has low-dimensional structure y n Q ( ) X n p θ S S c Estimator: R λ -regularized likelihood { θ argmin 1 n logq(y i x i, θ ) } +R λ (θ). θ n i=1

8 Vignette A: Regression with non-convex penalties y n Q ( ) X n p θ S S c Example: Logistic regression for binary responses y i {0,1}: { 1 n { θ argmin log(1+e x i,θ ) y i x i, θ } } +R λ (θ). θ n i=1 Many non-convex penalties are possible: capped l 1 -penalty SCAD penalty MCP penalty (Fan & Li, 2001) (Zhang, 2006)

9 Convex and non-convex regularizers Regularizers R λ (t) MCP L 1 Capped L t

10 Statistical error versus optimization error Algorithm generating sequence of iterates {θ t } t=0 to solve { } 1 n θ argmin L(θ;x i,y i )+R λ (θ). θ n Global minimizer of population risk i=1 [ θ := argmine X,Y θ ] L(θ;X,Y) } {{ } L(θ)

11 Statistical error versus optimization error Algorithm generating sequence of iterates {θ t } t=0 to solve { } 1 n θ argmin L(θ;x i,y i )+R λ (θ). θ n Global minimizer of population risk i=1 [ θ := argmine X,Y θ ] L(θ;X,Y) } {{ } L(θ) Goal of statistician Provide bounds on Statistical error: θ t θ or θ θ

12 Statistical error versus optimization error Algorithm generating sequence of iterates {θ t } t=0 to solve { } 1 n θ argmin L(θ;x i,y i )+R λ (θ). θ n Global minimizer of population risk i=1 [ θ := argmine X,Y θ ] L(θ;X,Y) } {{ } L(θ) Goal of statistician Provide bounds on Statistical error: θ t θ or θ θ Goal of optimization-theorist Provide bounds on Optimization error: θ t θ

13 Logistic regression with non-convex regularizer log error plot for logistic regression with SCAD, a = 3.7 opt err stat err log( β t β * 2 ) iteration count

14 What phenomena need to be explained? Empirical observation #1: From a statistical perspective, all local optima are essentially as good as a global optimum.

15 What phenomena need to be explained? Empirical observation #1: From a statistical perspective, all local optima are essentially as good as a global optimum. Some past work: for least-squares loss, certain local optima are good (Zhang & Zhang, 2012) if initialized at Lasso solution with l -guarantees, local algorithm has good behavior (Fan et al., 2012)

16 What phenomena need to be explained? Empirical observation #1: From a statistical perspective, all local optima are essentially as good as a global optimum. Some past work: for least-squares loss, certain local optima are good (Zhang & Zhang, 2012) if initialized at Lasso solution with l -guarantees, local algorithm has good behavior (Fan et al., 2012) Empirical observation #2: First-order methods converge as fast as possible up to statistical precision.

17 Vignette B: Error-in-variables regression Begin with high-dimensional sparse regression: y X θ ε n n p = + S S c

18 Vignette B: Error-in-variables regression Begin with high-dimensional sparse regression: y X θ ε n n p = + S S c Observe y R n and Z R n p Z X W n p = F n p, n p Here W R n p is a stochastic perturbation.

19 Missing data: Z n p Example: Missing data = X n p W Here W R n p is multiplicative perturbation (e.g., W ij Ber(α).) y X θ ε n n p = + S S c

20 Example: Additive perturbations Additive noise in covariates: Z X W n p = n p + n p Here W R n p is an additive perturbation (e.g., W ij N(0,σ 2 )). y X θ ε n n p = + S S c

21 A second look at regularized least-squares Equivalent formulation: { 1 θ arg min θ R p 2 θt( X T X) X T y } θ θ, n n + R λ(θ). Martin Wainwright (UC Berkeley) May / 1

22 A second look at regularized least-squares Equivalent formulation: { 1 θ arg min θ R p 2 θt( X T X) X T y } θ θ, n n + R λ(θ). Population view: unbiased estimators [ X T X ] cov(x 1 ) = E, and cov(x 1,y 1 ) = E n [ X T y n ]. Martin Wainwright (UC Berkeley) May / 1

23 Corrected estimators Equivalent formulation: { 1 θ arg min θ R p 2 θt( X T X) X T y } θ θ, n n + R λ(θ). Population view: unbiased estimators [ X T X ] cov(x 1 ) = E, and cov(x 1,y 1 ) = E n [ X T y n ]. A general family of estimators { 1 } θ arg min θ R p 2 θt Γθ θt γ + λ 2 n θ 2 1, where ( Γ, γ) are unbiased estimators of cov(x 1 ) and cov(x 1,y 1 ). Martin Wainwright (UC Berkeley) May / 1

24 Example: Estimator for missing data observe corrupted version Z R n p Z ij = { X ij with probability 1 α with probability α.

25 Example: Estimator for missing data observe corrupted version Z R n p Z ij = { X ij with probability 1 α with probability α. Natural unbiased estimates: set 0 and Ẑ := Z (1 α) : Γ = Ẑ T Ẑ n αdiag( Ẑ T Ẑ) Ẑ T y, and γ = n n,

26 Example: Estimator for missing data observe corrupted version Z R n p Z ij = { X ij with probability 1 α with probability α. Natural unbiased estimates: set 0 and Ẑ := Z (1 α) : Γ = Ẑ T Ẑ n αdiag( Ẑ T Ẑ) Ẑ T y, and γ = n n, solve (doubly non-convex) optimization problem: (Loh & W., 2012) {1 θ argmin θ Ω 2 θt Γθ γ, θ +Rλ (θ) }.

27 Non-convex quadratic and non-convex regularizer log error plot for corrected linear regression with MCP, b = opt err stat err 0 2 log( β t β * 2 ) iteration count

28 Remainder of talk 1 Why are all local optima statistically good? Restricted strong convexity A general theorem Various examples 2 Why do first-order gradient methods converge quickly? Composite gradient methods Statistical versus optimization error Fast convergence for non-convex problems Martin Wainwright (UC Berkeley) May / 1

29 Geometry of a non-convex quadratic loss Loss function has directions of both positive and negative curvature.

30 Geometry of a non-convex quadratic loss Loss function has directions of both positive and negative curvature. Negative directions must be forbidden by regularizer.

31 Restricted strong convexity Here defined with respect to the l 1 -norm: Definition The loss function L n satisfies RSC with parameters (α j,τ j ), j = 1,2 if L n (θ + ) L n (θ ), }{{} Measure of curvature { α1 2 logp 2 τ 1 n 21 if 2 1 logp α 2 2 τ 2 n 1 if 2 > 1. Martin Wainwright (UC Berkeley) May / 1

32 Restricted strong convexity Here defined with respect to the l 1 -norm: Definition The loss function L n satisfies RSC with parameters (α j,τ j ), j = 1,2 if L n (θ + ) L n (θ ), }{{} Measure of curvature { α1 2 logp 2 τ 1 n 21 if 2 1 logp α 2 2 τ 2 n 1 if 2 > 1. holds with τ 1 = τ 2 = 0 for any function that is locally strongly convex around θ Martin Wainwright (UC Berkeley) May / 1

33 Restricted strong convexity Here defined with respect to the l 1 -norm: Definition The loss function L n satisfies RSC with parameters (α j,τ j ), j = 1,2 if L n (θ + ) L n (θ ), }{{} Measure of curvature { α1 2 logp 2 τ 1 n 21 if 2 1 logp α 2 2 τ 2 n 1 if 2 > 1. holds with τ 1 = τ 2 = 0 for any function that is locally strongly convex around θ holds for a variety of loss functions (convex and non-convex): ordinary least-squares (Raskutti, W. & Yu, 2010) likelihoods for generalized linear models (Negahban et al., 2012) certain non-convex quadratic functions (Loh & W, 2012) Martin Wainwright (UC Berkeley) May / 1

34 Well-behaved regularizers Properties defined at the univariate level R λ : R [0, ]. Satisfies R λ (0) = 0, and is symmetric around zero (R λ (t) = R λ ( t).) Non-decreasing and subadditive R λ (s+t) R λ (s)+r λ (t). Function t R λ(t) t is nonincreasing for t > 0 Differentiable for all t 0, subdifferentiable at t = 0 with subgradients bounded in absolute value by λl. For some µ > 0, the function R λ (t) = R λ (t)+µt 2 is convex. Martin Wainwright (UC Berkeley) May / 1

35 Well-behaved regularizers Properties defined at the univariate level R λ : R [0, ]. Satisfies R λ (0) = 0, and is symmetric around zero (R λ (t) = R λ ( t).) Non-decreasing and subadditive R λ (s+t) R λ (s)+r λ (t). Function t R λ(t) t is nonincreasing for t > 0 Differentiable for all t 0, subdifferentiable at t = 0 with subgradients bounded in absolute value by λl. For some µ > 0, the function R λ (t) = R λ (t)+µt 2 is convex. Includes (among others): rescaled l 1 loss: R λ (t) = λ t. MCP penalty and SCAD penalties (Fan et al., 2001; Zhang, 2006) does not include capped l 1 -penalty Martin Wainwright (UC Berkeley) May / 1

36 regularized M-estimator Main statistical guarantee θ arg min θ 1 M { } L n (θ)+r λ (θ). loss function satisfies (α, τ) RSC, and regularizer is regular (with parameters (µ, L))

37 regularized M-estimator Main statistical guarantee θ arg min θ 1 M { } L n (θ)+r λ (θ). loss function satisfies (α, τ) RSC, and regularizer is regular (with parameters (µ, L)) local optimum θ defined by conditions L n ( θ)+ R λ ( θ), θ θ 0 for all feasible θ.

38 regularized M-estimator Main statistical guarantee θ arg min θ 1 M { } L n (θ)+r λ (θ). loss function satisfies (α, τ) RSC, and regularizer is regular (with parameters (µ, L)) local optimum θ defined by conditions L n ( θ)+ R λ ( θ), θ θ 0 for all feasible θ. Theorem (Loh & W., 2013) Suppose M is chosen such that θ is feasible, and λ satisfies the bounds max { logp L n (θ } α 2 ),α 2 λ n 6LM Then any local optimum θ satisfies the bound θ θ 2 6λ n s 4(α µ) where s = β 0.

39 Geometry of local/global optima ǫ stat θ θ θ Consequence: All { local, global } optima are within distance ǫ stat of the target θ. θ

40 Geometry of local/global optima ǫ stat θ θ θ Consequence: All { local, global } optima are within distance ǫ stat of the target θ. With λ = c logp n, statistical error scales as θ ǫ stat slogp n, which is minimax optimal.

41 Empirical results (unrescaled) p=128 p=256 p=512 p=1024 Mean squared error Raw sample size

42 Empirical results (rescaled) p=128 p=256 p=512 p=1024 Mean squared error Rescaled sample size

43 Comparisons between different penalties comparing penalties for corrected linear regression p=64 p=128 p=256 l 2 norm error n/(k log p)

44 First-order algorithms and fast convergence Thus far... have shown that all local optima are statistically good how to obtain a local optimum quickly?

45 First-order algorithms and fast convergence Thus far... have shown that all local optima are statistically good how to obtain a local optimum quickly? Composite gradient descent for regularized objectives: (Nesterov, 2007) { } f(θ)+g(θ) min θ Ω where f is differentiable, and g is convex, sub-differentiable.

46 First-order algorithms and fast convergence Thus far... have shown that all local optima are statistically good how to obtain a local optimum quickly? Composite gradient descent for regularized objectives: (Nesterov, 2007) { } f(θ)+g(θ) min θ Ω where f is differentiable, and g is convex, sub-differentiable. Simple updates: { } θ t+1 = argmin θ α t f(θ t ) 2 2 +g(θ). θ Ω

47 First-order algorithms and fast convergence Thus far... have shown that all local optima are statistically good how to obtain a local optimum quickly? Composite gradient descent for regularized objectives: (Nesterov, 2007) { } f(θ)+g(θ) min θ Ω where f is differentiable, and g is convex, sub-differentiable. Simple updates: { } θ t+1 = argmin θ α t f(θ t ) 2 2 +g(θ). θ Ω Not directly applicable with f = L n and g = R λ (since R λ can be non-convex).

48 Composite gradient on a convenient splitting Define modified loss functions and regularizers: L n (θ) := L n (θ) µ θ 2 2, and Rλ (θ) := R λ (θ)+µ θ 2 2. }{{}}{{} non-convex convex

49 Composite gradient on a convenient splitting Define modified loss functions and regularizers: L n (θ) := L n (θ) µ θ 2 2, and Rλ (θ) := R λ (θ)+µ θ 2 2. }{{}}{{} non-convex convex Apply composite gradient descent to the objective { Ln (θ)+ R λ (θ)}. min θ Ω

50 Composite gradient on a convenient splitting Define modified loss functions and regularizers: L n (θ) := L n (θ) µ θ 2 2, and Rλ (θ) := R λ (θ)+µ θ 2 2. }{{}}{{} non-convex convex Apply composite gradient descent to the objective { Ln (θ)+ R λ (θ)}. min θ Ω converges to local optimum θ (Nesterov, 2007) L n (θ)+ R λ ( θ), θ θ 0 for all feasible θ.

51 Composite gradient on a convenient splitting Define modified loss functions and regularizers: L n (θ) := L n (θ) µ θ 2 2, and Rλ (θ) := R λ (θ)+µ θ 2 2. }{{}}{{} non-convex convex Apply composite gradient descent to the objective { Ln (θ)+ R λ (θ)}. min θ Ω converges to local optimum θ (Nesterov, 2007) L n (θ)+ R λ ( θ), θ θ 0 for all feasible θ. will show that convergence is geometrically fast with constant stepsize

52 Theoretical guarantees on computational error implement Nesterov s composite method with constant stepsize to ( L n, R λ ) split. fixed global optimum β defines the statistical error ǫ 2 stat = β β 2. population minimizer β is s-sparse

53 Theoretical guarantees on computational error implement Nesterov s composite method with constant stepsize to ( L n, R λ ) split. fixed global optimum β defines the statistical error ǫ 2 stat = β β 2. population minimizer β is s-sparse loss function satisfies (α, τ) RSC and smoothness conditions, and regularizer is (µ, L)-good

54 Theoretical guarantees on computational error implement Nesterov s composite method with constant stepsize to ( L n, R λ ) split. fixed global optimum β defines the statistical error ǫ 2 stat = β β 2. population minimizer β is s-sparse loss function satisfies (α, τ) RSC and smoothness conditions, and regularizer is (µ, L)-good Theorem (Loh & W., 2013) If n slogp, there is a contraction factor κ (0,1) such that for any δ ǫ stat, we have θ t θ ( δ τ slogp ) α µ n ǫ2 stat for all t T(δ) iterations, where T(δ) log(1/δ) log(1/κ).

55 Geometry of result 0 1 t ǫ θ θ θ Optimization error t := θ t θ decreases geometrically up to statistical tolerance: θ t+1 θ 2 κ t θ 0 θ 2 +o ( θ θ 2 ) }{{} for all t = 0,1,2,... Stat. error ǫ 2 stat

56 Non-convex linear regression with SCAD log error plot for corrected linear regression with SCAD, a = opt err 0 stat err 1 log( β t β * 2 ) iteration count

57 Non-convex linear regression with SCAD log error plot for corrected linear regression with SCAD, a = opt err stat err 0 log( β t β * 2 ) iteration count

58 Summary M-estimators based on non-convex programs arise frequently under suitable regularity conditions, we showed that: all local optima are well-behaved from the statistical point of view simple first-order methods converge as fast as possible

59 Summary M-estimators based on non-convex programs arise frequently under suitable regularity conditions, we showed that: all local optima are well-behaved from the statistical point of view simple first-order methods converge as fast as possible many open questions similar guarantees for more general problems? geometry of non-convex problems in statistics?

60 Summary M-estimators based on non-convex programs arise frequently under suitable regularity conditions, we showed that: all local optima are well-behaved from the statistical point of view simple first-order methods converge as fast as possible many open questions similar guarantees for more general problems? geometry of non-convex problems in statistics? Papers and pre-prints: Loh & W. (2013). Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Pre-print arxiv: Loh & W. (2012). High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. Annals of Statistics, 40: Negahban et al. (2012). A unified framework for high-dimensional analysis of M-estimators. Statistical Science, 27(4):

Robust high-dimensional linear regression: A statistical perspective

Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison Departments of ECE & Statistics STOC workshop on robustness and nonconvexity Montreal,