Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression

Similar documents
Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

A Risk Comparison of Ordinary Least Squares vs Ridge Regression

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Lecture 19: Convergence

The random version of Dvoretzky s theorem in l n

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

Exponential Convergence of Testing Error for Stochastic Gradient Methods

Machine Learning Brett Bernstein

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Introduction to Machine Learning DIS10

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Stochastic Simulation

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)

xn = x n 1 α f(xn 1 + β n) f(xn 1 β n)

Information-based Feature Selection

A survey on penalized empirical risk minimization Sara A. van de Geer

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Lecture 2 October 11

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

REGRESSION WITH QUADRATIC LOSS

Sieve Estimators: Consistency and Rates of Convergence

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator

6. Kalman filter implementation for linear algebraic equations. Karhunen-Loeve decomposition

Regression with quadratic loss

Regularization methods for large scale machine learning

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 6 9/23/2013. Brownian motion. Introduction

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Fast Rates for Regularized Objectives

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

Lecture 10 October Minimaxity and least favorable prior sequences

Intro to Learning Theory

MODEL CHANGE DETECTION WITH APPLICATION TO MACHINE LEARNING. University of Illinois at Urbana-Champaign

arxiv: v1 [math.pr] 13 Oct 2011

Machine Learning Theory (CS 6783)

CSE 527, Additional notes on MLE & EM

Lecture 12: February 28

A statistical method to determine sample size to estimate characteristic value of soil parameters

Introduction to Optimization Techniques. How to Solve Equations

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector

Lecture 24: Variable selection in linear models

MA Advanced Econometrics: Properties of Least Squares Estimators

Problem Set 4 Due Oct, 12

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Spectral Partitioning in the Planted Partition Model

Introductory statistics

Optimally Sparse SVMs

Machine Learning Brett Bernstein

Element sampling: Part 2

Adaptivity of Averaged Stochastic Gradient Descent to Local Strong Convexity for Logistic Regression

Lecture 20: Multivariate convergence and the Central Limit Theorem

Exponential convergence of testing error for stochastic gradient methods

Regression with an Evaporating Logarithmic Trend

Optimization Methods MIT 2.098/6.255/ Final exam

Efficient GMM LECTURE 12 GMM II

Empirical Process Theory and Oracle Inequalities

Linear Classifiers III

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Algorithms for Clustering

Statistical Inference Based on Extremum Estimators

ECE 901 Lecture 13: Maximum Likelihood Estimation

Differentiable Convex Functions

11 Correlation and Regression

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Unbiased Estimation. February 7-12, 2008

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

Basics of Probability Theory (for Theory of Computation courses)

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring

Self-normalized deviation inequalities with application to t-statistic

Lecture 15: Learning Theory: Concentration Inequalities

Maximum Likelihood Estimation and Complexity Regularization

6.867 Machine learning, lecture 7 (Jaakkola) 1

Multi parameter proximal point algorithms

Lecture 7: October 18, 2017

On Random Line Segments in the Unit Square

Convergence of random variables. (telegram style notes) P.J.C. Spreij

6.883: Online Methods in Machine Learning Alexander Rakhlin

TR/46 OCTOBER THE ZEROS OF PARTIAL SUMS OF A MACLAURIN EXPANSION A. TALBOT

An Introduction to Randomized Algorithms

10-701/ Machine Learning Mid-term Exam Solution

1 Review and Overview

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

The Method of Least Squares. To understand least squares fitting of data.

Simulation. Two Rule For Inverting A Distribution Function

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Analysis of Algorithms. Introduction. Contents

Are adaptive Mann iterations really adaptive?

A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers

CMSE 820: Math. Foundations of Data Sci.

Algebra of Least Squares

Minimizing Finite Sums with the Stochastic Average Gradient

Jacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3

5.1 A mutual information bound based on metric entropy

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Cov(aX, cy ) Var(X) Var(Y ) It is completely invariant to affine transformations: for any a, b, c, d R, ρ(ax + b, cy + d) = a.s. X i. as n.

An almost sure invariance principle for trimmed sums of random vectors

Lecture 10: Universal coding and prediction

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017

Transcription:

Harder, Better, Faster, Stroger Covergece Rates for Least-Squares Regressio Aoymous Author(s) Affiliatio Address email Abstract 1 2 3 4 5 6 We cosider the optimizatio of a quadratic objective fuctio whose gradiets are oly accessible through a stochastic oracle. We preset the first algorithm that achieves joitly the optimal predictio error rates for least-squares regressio, both i terms of forgettig of iitial coditios ad i terms of depedece o the oise ad dimesio of the problem, ad prove dimesioless ad tighter rates for a regularized versio of this algorithm. 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 1 Itroductio May supervised machie learig problems are aturally cast as the miimizatio of a smooth fuctio defied o a Euclidea space. This icludes least-squares regressio, logistic regressio (see, e.g., Hastie et al., 2009) or geeralized liear models (McCullagh ad Nelder, 1989). While small problems with few or low-dimesioal iput features may be solved precisely by may potetial optimizatio algorithms (e.g., Newto method), large-scale problems with may high-dimesioal features are typically solved with simple gradiet-based iterative techiques whose per-iteratio cost is small. I this paper, we cosider a quadratic objective fuctio f whose gradiets are oly accessible through a stochastic oracle that returs the gradiet at ay give poit plus a zero-mea fiite variace radom error. I this stochastic approximatio framework (Robbis ad Moro, 1951), it is kow that two quatities dictate the behavior of various algorithms, amely the covariace matrix V of the oise i the gradiets, ad the deviatio θ 0 θ betwee the iitial poit of the algorithm θ 0 ad ay of the global miimizer θ of f. This leads to a bias/variace decompositio (Bach ad Moulies, 2013; Hsu et al., 2014) of the performace of most algorithms as the sum of two terms: (a) the bias term characterizes how fast iitial coditios are forgotte ad thus is icreasig i a well-chose orm of θ 0 θ ; while (b) the variace term characterizes the effect of the oise i the gradiets, idepedetly of the startig poit, ad with a term that is icreasig i the covariace of the oise. 26 For quadratic fuctios with (a) a oise covariace matrix V which is proportioal (with costat 27 σ 2 ) to the Hessia of f (a situatio which correspods to least-squares regressio) ad (b) a ii- 28 tial poit characterized by the orm θ 0 θ 2, the optimal bias ad variace terms are kow separately. O the oe had, the optimal bias term after iteratios is proportioal to L θ 0 θ 2 29, where L is the largest eigevalue of the Hessia of f. This rate is achieved by accelerated gradiet 2 30 31 descet (Nesterov, 1983, 2004), ad is kow to be optimal if the umber of iteratios is less tha 32 the dimesio d of the uderlyig predictors, but the algorithm is ot robust to radom or determi- 33 istic oise i the gradiets (d Aspremot, 2008; Schmidt et al., 2011; Devolder et al., 2014). O the 34 other had, the optimal variace term is proportioal to σ2 d (Tsybakov, 2003); it is kow to be 35 achieved by averaged gradiet descet (Bach ad Moulies, 2013), for which the bias term oly 36 achieves L θ 0 θ 2 istead of L θ 0 θ 2. 2 Submitted to 30th Coferece o Neural Iformatio Processig Systems (NIPS 2016). Do ot distribute.

37 38 39 40 41 42 43 44 45 46 47 Our first cotributio i this paper is to aalyze i Sectio 3 averaged accelerated gradiet descet, showig that it attais optimal rates for both the variace ad the bias terms. It shows that averagig is beeficial for accelerated techiques ad provides a provable robustess to oise. While optimal whe measurig performace i terms of the dimesio d ad the iitial distace to optimum θ 0 θ 2, these rates are ot adapted i may situatios where either d is larger tha the umber of iteratios (i.e., the umber of observatios for regular stochastic gradiet descet) or L θ 0 θ 2 is much larger tha 2. Our secod cotributio is to provide i Sectio 4 a aalysis of a ew algorithm (based o some additioal regularizatio) that ca adapt our bouds to fier assumptios o θ 0 θ ad the Hessia of the problem, leadig i particular to dimesio-free quatities that ca thus be exteded to the Hilbert space settig (i particular for o-parametric estimatio). 48 49 50 51 52 53 54 55 56 57 58 59 60 61 2 Least-Squares Regressio Statistical Assumptios. We cosider the followig geeral settig: H is a d-dimesioal Euclidea space with d 1, the observatios (x, y ) H R, 1, are idepedet ad idetically distributed (i.i.d.), ad such that E x 2 ad Ey 2 are fiite. We cosider the least-squares regressio problem which is the miimizatio of the quadratic fuctio f(θ) = 1 2 E( x, θ y ) 2. Covariace matrix: We deote by Σ = E(x x ) R d d the populatio covariace matrix, which is the Hessia of f at all poits. Without loss of geerality, we ca assume Σ ivertible. This implies that all eigevalues of Σ are strictly positive (but they may be arbitrarily small). We assume there exists R > 0 such that E x 2 x x R 2 Σ where A B meas that B A is positive semi-defiite. This assumptio is satisfied, for example, for least-square regressio with almost surely bouded data. Eigevalue decay: Most covergece bouds deped o the dimesio d of H. However it is possible to derive dimesio-free ad ofte tighter covergece rates by cosiderig bouds depedig o the value tr Σ b for b [0, 1]. Give b, if we cosider the eigevalues of Σ ordered i decreasig order, which we deote by s i, the tr Σ b = 62 i sb i, ad the eigevalues decay For b goig to 0 the 63 tr Σ b teds to d ad we are back i the classical low-dimesioal case. Whe b = 1, we simply get 64 tr Σ = E x 2, which will correspod to the weakest assumptio i our cotext. 65 66 67 68 69 70 71 72 73 Optimal predictor: The regressio fuctio f(θ) = 1 2 E( x, θ y ) 2 always admits a global miimum θ = Σ 1 E(y x ). Whe iitializig algorithms at θ 0 = 0 or regularizig by the squared orm, rates of covergece geerally deped o θ, a quatity which could be arbitrarily large. However there exists a systematic upper-boud Σ 1 2 θ 2 Ey. 2 This leads aturally to the cosideratio of covergece bouds depedig o Σ r/2 θ for r 1. Noise: We deote by ε = y θ, x the residual for which we have E[ε x ] = 0. Although we do ot have E[ε x ] = 0 i geeral uless the model is well-specified, we assume the oise to be a structured process such that there exists σ > 0 with E[ε 2 x x ] σ 2 Σ. This assumptio is satisfied for example for data almost surely bouded or whe the model is well-specified. 74 75 76 77 78 79 80 81 82 83 84 Averaged Gradiet Methods ad Acceleratio. We focus i this paper o stochastic gradiet methods with acceleratio for a quadratic fuctio regularized by λ 2 θ θ 0 2. The regularizatio will be useful whe derivig tighter covergece rates i Sectio 4, ad it has the additioal beefit of makig the problem λ-strogly-covex. Accelerated stochastic gradiet descet is defied by a iterative system with two parameters (θ, ν ) startig from θ 0 = ν 0 H, ad satisfyig for 1, θ = ν 1 γf (ν 1 γλ(ν 1 θ 0 ) ν = θ + δ ( ) θ θ 1, (1) with γ, δ R 2 ad f (θ 1 ) a ubiased estimate o the gradiet f(θ). The mometum coefficiet δ R is chose to accelerate the covergece rate (Nesterov, 1983; Beck ad Teboulle, 2009) ad has its roots i the heavy-ball algorithm from Polyak (1964). We especially cocetrate here, followig Polyak ad Juditsky (1992), o the average of the sequece θ = 1 +1 i=0 θ, 2

85 86 87 88 89 90 91 92 93 94 95 Stochastic Oracles o the Gradiet. Let (F ) 0 be the icreasig family of σ-fields that are geerated by all variables (x i, y i ) for i. The oracle we cosider is the sum of the true gradiet f (θ) ad a idepedet zero-mea oise that does ot deped o θ 1. Cosequetly it is of the form f (θ) = f (θ) ξ where the oise process ξ is F -measurable with E[ξ F 1 ] = 0 ad E[ ξ 2 ] is fiite. Furthermore we also assume that there exists τ R such that E[ξ ξ ] τ 2 Σ, that is, the oise has a particular structure adapted to least-squares regressio. 3 Accelerated Stochastic Averaged Gradiet Descet We study the covergece of averaged accelerated stochastic gradiet descet defied by Eq. (1) for λ = 0 ad δ = 1. It ca be rewritte for the quadratic fuctio f as a secod-order iterative system with costat coefficiets: θ = [ I γσ ] (2θ 1 θ 2 ) + γy x. Theorem 1 For ay costat step-size γ, such that γσ I, Ef( θ ) f(θ ) 36 θ 0 θ 2 γ( + 1) 2 + 8 τ 2 d + 1. (2) 96 We ca make the followig observatios: 97 98 99 100 101 The first boud 1 γ θ 2 0 θ 2 i Eq. (2) correspods to the usual accelerated rate. It has bee show by Nesterov (2004) to be the optimal rate of covergece for optimizig a quadratic fuctio with a first-order method that ca access oly to sequeces of gradiets whe d. We recover by averagig a algorithm dedicated to strogly-covex fuctio the traditioal covergece rate for o-strogly covex fuctios. The secod boud i Eq. (2) matches the optimal statistical performace τ 2 d over all 103 estimators i H (Tsybakov, 2008) eve without computatioal limits, i the sese that 104 o estimator that uses the same iformatio ca improve upo this rate. Accordigly 105 this algorithm achieves joit bias/variace optimality (whe measured i terms of τ 2 ad 106 θ 0 θ 2 ). 107 108 109 110 111 112 113 114 115 116 117 We have the same rate of covergece for the bias whe compared to the regular Nesterov acceleratio without averagig studied by Flammario ad Bach (2015), which correspods to choosig δ = 1 2/ for all. However if the problem is µ-strogly covex, this latter was show to also coverge at the liear rate O ( (1 γµ) ) ad thus is adaptive to hidde strog-covexity (sice the algorithm does ot eed to kow µ to ru), thus eds up covergig faster tha the rate 1/ 2. This is cofirmed i our experimets i Sectio 5. Overall, the bias term is improved whereas the variace term is ot degraded ad acceleratio is thus robust to oise i the gradiets. Thereby, while secod-order iterative methods for optimizig quadratic fuctios i the sigular case, such as cojugate gradiet (Polyak, 1987, Sectio 6.1) are otoriously highly sesitive to oise, we are able to propose a versio which is robust to stochastic oise. 118 119 120 121 122 123 124 125 126 127 128 4 Tighter Covergece Rates We have see i Corollary 1 above that the averaged accelerated gradiet algorithm matches the lower bouds τ 2 d/ ad L θ 2 0 θ 2 for the predictio error. However the algorithm performs better i almost all cases except the worst-case scearios correspodig to the lower bouds. For example the algorithm may still predict well whe the dimesio d is much bigger tha. Similarly the orm of the optimal predictor θ 2 may be huge ad the predictio still good, as gradiet algorithms happe to be adaptive to the difficulty of the problem: ideed, if the problem is simpler, the covergece rate of the gradiet algorithm will be improved. I this sectio, we provide such a theoretical guaratee. We study the covergece of averaged accelerated stochastic gradiet descet defied by Eq. (1) for λ = (γ( + 1) 2 ) 1 ad δ [ 1 2 +2, 1]. We have the followig theorem: 1 this is differet from the oracle usually cosidered i stochastic approximatio (see Bach ad Moulies (2013); Dieuleveut ad Bach (2015)). 3

129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 Theorem 2 For ay costat step-size γ, such that γ(σ + λi) I, Ef( θ ) f(θ ) mi r [0,1], b [0,1] We ca make the followig observatios: [74 Σr/2 (θ 0 θ ) 2 γ 1 r ( + 1) + 8 τ 2 γ b tr(σ b ) 2(1 r) ( + 1) 1 2b The algorithm is idepedet of r ad b, thus all the bouds for differet values of (r, b) are valid. This is a strog property of the algorithm, which is ideed adaptative to the regularity ad the effective dimesio of the problem (oce γ is chose). I situatios i which either d is larger tha or L θ 0 θ 2 is larger tha 2, the algorithm ca still ejoy good covergece properties, by adaptig to the best values of b ad r. For b = 0 we recover the variace term of Corollary 1, but for b > 0 ad fast decays of eigevalues of Σ, the boud may be much smaller; ote that we lose i the depedecy i, but typically, for large d, this ca be advatageous. With r, b well chose, we recover the optimal rate for o-parametric regressio (Capoetto ad De Vito, 2007). 5 Experimets We illustrate ow our theoretical results o sythetic examples. For d = 25 we cosider ormally distributed iputs x with radom covariace matrix Σ which has eigevalues 1/i 3, for i = 1,..., d, ad radom optimum θ ad startig poit θ 0 such that θ 0 θ = 1. The outputs y are geerated from a liear fuctio with homoscedastic oise with uit sigal to oise-ratio (σ 2 = 1), we take R 2 = tr Σ the average radius of the data ad a step-size γ = 1/R 2 ad λ = 0. The additive oise oracle is used. We show results averaged over 10 replicatios. We compare the performace of averaged SGD (AvSGD), usual Nesterov acceleratio for covex fuctios (AccSGD) ad our ovel averaged accelerated SGD (AvAccSGD) 2, o two differet problems: oe determiistic ( θ 0 θ = 1, σ 2 = 0) which will illustrate how the bias term behaves, ad oe purely stochastic ( θ 0 θ = 0, σ 2 = 1) which will illustrate how the variace term behaves. For the bias (left plot of Figure 1), AvSGD coverges at speed O(1/), while AvAccSGD ]. log 10 [f(θ)-f(θ * )] 0-2 -4-6 -8 Bias -10 AvAccSGD -12 AccSGD AvSGD -14 0 2 log () 10 4 6 log 10 [f(θ)-f(θ * )] 0-2 -4-6 -8 Variace -10 AvAccSGD -12 AccSGD AvSGD -14 0 2 log () 10 4 6 Figure 1: Sythetic problem (d = 25) ad γ = 1/R 2. Left: Bias. Right: Variace. 152 153 154 155 156 157 158 159 160 ad AccSGD coverge both at speed O(1/ 2 ). However, as metioed i the observatios followig Theorem 1, AccSGD takes advatage of the hidde strog covexity of the quadratic fuctio ad starts covergig liearly at the ed. For the variace (right plot of Figure 1), AccSGD is ot covergig to the opltimum ad keeps oscillatig whereas AvSGD ad AvAccSGD both coverge to the optimum at a speed O(1/). However AvSGD remais slightly faster i the begiig. Note that for small, or whe the bias L θ 0 θ 2 / 2 is much bigger tha the variace σ 2 d/, the bias may have a stroger effect, although asymptotically, the variace always domiates. It is thus essetial to have a algorithm which is optimal i both regimes; this is achieved by AvAccSGD. 2 which is ot the averagig of AccSGD because the mometum term is proportioal to 1 3/ for AccSGD istead of beig equal to 1 for AvAccSGD. 4

161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 Refereces Bach, F. ad Moulies, E. (2013). No-strogly-covex smooth stochastic approximatio with covergece rate O(1/). I Advaces i Neural Iformatio Processig Systems. Beck, A. ad Teboulle, M. (2009). A fast iterative shrikage-thresholdig algorithm for liear iverse problems. SIAM J. Imagig Sci., 2(1):183 202. Capoetto, A. ad De Vito, E. (2007). Optimal rates for the regularized least-squares algorithm. Foudatios of Computatioal Mathematics, 7(3):331 368. d Aspremot, A. (2008). Smooth optimizatio with approximate gradiet. SIAM J. Optim., 19(3):1171 1183. Devolder, O., Glieur, F., ad Nesterov, Y. (2014). First-order methods of smooth covex optimizatio with iexact oracle. Math. Program., 146(1-2, Ser. A):37 75. Dieuleveut, A. ad Bach, F. (2015). No-parametric stochastic approximatio with large step sizes. Aals of Statistics. To appear. Flammario, N. ad Bach, F. (2015). From averagig to acceleratio, there is oly a step-size. I Proceedigs of the Iteratioal Coferece o Learig Theory (COLT). Hastie, T., Tibshirai, R., ad Friedma, J. (2009). The Elemets of Statistical Learig. Spriger Series i Statistics. Spriger, secod editio. Hsu, D., Kakade, S. M., ad Zhag, T. (2014). Radom desig aalysis of ridge regressio. Foudatios of Computatioal Mathematics, 14(3):569 600. McCullagh, P. ad Nelder, J. A. (1989). Geeralized Liear Models. Moographs o Statistics ad Applied Probability. Chapma & Hall, Lodo, secod editio. Nesterov, Y. (1983). A method of solvig a covex programmig problem with covergece rate O(1/k 2 ). Soviet Mathematics Doklady, 27(2):372 376. Nesterov, Y. (2004). Itroductory Lectures o Covex Optimizatio, volume 87 of Applied Optimizatio. Kluwer Academic Publishers, Bosto, MA. Polyak, B. T. (1964). Some methods of speedig up the covergece of iteratio methods. {USSR} Computatioal Mathematics ad Mathematical Physics, 4(5):1 17. Polyak, B. T. (1987). Itroductio to Optimizatio. Traslatios Series i Mathematics ad Egieerig. Optimizatio Software, Ic., Publicatios Divisio, New York. Polyak, B. T. ad Juditsky, A. B. (1992). Acceleratio of stochastic approximatio by averagig. SIAM J. Cotrol Optim., 30(4):838 855. Robbis, H. ad Moro, S. (1951). A stochastic approxiatio method. The Aals of mathematical Statistics, 22(3):400 407. Schmidt, M., Le Roux, N., ad Bach, F. (2011). Covergece rates of iexact proximal-gradiet methods for covex optimizatio. I Advaces i Neural Iformatio Processig Systems. Tsybakov, A. B. (2003). Optimal rates of aggregatio. I Proceedigs of the Aual Coferece o Computatioal Learig Theory. Tsybakov, A. B. (2008). Itroductio to Noparametric Estimatio. Spriger. 5