Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

Size: px

Start display at page:

Download "Statistical Optimality of Stochastic Gradient Descent through Multiple Passes"

Karen Brown
5 years ago
Views:

1 Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien and Alessandro Rudi Newton Institute, Cambridge - June 2018

2 Two-minute summary Stochastic gradient descent for large-scale machine learning Processes observations one by one

3 Two-minute summary Stochastic gradient descent for large-scale machine learning Processes observations one by one Theory: Single pass SGD is optimal Only for easy problems Practice: Multiple pass SGD always works better

4 Two-minute summary Stochastic gradient descent for large-scale machine learning Processes observations one by one Theory: Single pass SGD is optimal Only for easy problems Practice: Multiple pass SGD always works better Provable for hard problems Quantification of required number of passes Optimal statistical performance Source and capacity conditions from kernel methods

5 Least-squares regression in finite dimension Data: n observations (x i,y i ) X R, i = 1,...,n, i.i.d. Prediction as linear functions θ,φ(x) of features Φ(x) H = R d

6 Least-squares regression in finite dimension Data: n observations (x i,y i ) X R, i = 1,...,n, i.i.d. Prediction as linear functions θ,φ(x) of features Φ(x) H = R d Optimal prediction θ H minimizing F(θ) = E(y θ,φ(x) ) 2

7 Least-squares regression in finite dimension Data: n observations (x i,y i ) X R, i = 1,...,n, i.i.d. Prediction as linear functions θ,φ(x) of features Φ(x) H = R d Optimal prediction θ H minimizing F(θ) = E(y θ,φ(x) ) 2 Assumption: Φ(x) R almost surely Assumption: y M and y θ,φ(x) σ almost surely

8 Least-squares regression in finite dimension Data: n observations (x i,y i ) X R, i = 1,...,n, i.i.d. Prediction as linear functions θ,φ(x) of features Φ(x) H = R d Optimal prediction θ H minimizing F(θ) = E(y θ,φ(x) ) 2 Assumption: Φ(x) R almost surely Assumption: y M and y θ,φ(x) σ almost surely Statistical performance of estimators ˆθ defined as EF(ˆθ) F(θ ) Finite dimension: optimal rate σ2 dim(h) n = σ2 d n Attained by empirical risk minimization (ERM) and SGD

9 Least-squares regression in finite dimension Data: n observations (x i,y i ) X R, i = 1,...,n, i.i.d. Prediction as linear functions θ,φ(x) of features Φ(x) H = R d Optimal prediction θ H minimizing F(θ) = E(y θ,φ(x) ) 2 Assumption: Φ(x) R almost surely Assumption: y M and y θ,φ(x) σ almost surely Statistical performance of estimators ˆθ defined as EF(ˆθ) F(θ ) Finite dimension: optimal rate σ2 dim(h) n = σ2 d n Attained by empirical risk minimization (ERM) and SGD What if n dim(h)? Needs assumptions on Σ = E [ Φ(x) Φ(x) ] and θ

10 Spectrum of covariance matrix Σ = E [ Φ(x) Φ(x) ] Eigenvalues λ m (Σ) (in decreasing order) Example: News dataset (d = , n = ) 8 6 log( λ m ) log(m)

11 Spectrum of covariance matrix Σ = E [ Φ(x) Φ(x) ] Eigenvalues λ m (Σ) (in decreasing order) Example: News dataset (d = , n = ) 8 6 log( λ m ) log(m) Assumption: tr(σ 1/α ) = m 1λ m (Σ) 1/α is small (compared to n) Equivalent to λ m (Σ) = O(m α )

12 Difficulty of the learning problem Measuring difficulty through the norm of θ Assumption: Σ 1/2 r θ is small (compared to n) r = 1/2: usual assumption on θ Larger r: simpler problems Smaller r: harder problems (r = 0 always true)

13 Difficulty of the learning problem Measuring difficulty through the norm of θ Assumption: Σ 1/2 r θ is small (compared to n) r = 1/2: usual assumption on θ Larger r: simpler problems Smaller r: harder problems (r = 0 always true)

14 Difficulty of the learning problem Measuring difficulty through the norm of θ Assumption: Σ 1/2 r θ is small (compared to n) r = 1/2: usual assumption on θ Larger r: simpler problems Smaller r: harder problems (r = 0 always true)

15 Optimal statistical performance Easy problems r α 1 2rα 2α : optimal rate is O(n 2rα+1)

16 Optimal statistical performance Easy problems r α 1 2rα 2α : optimal rate is O(n 2rα+1), achieved by: Regularized ERM (Caponnetto and De Vito, 2007) Early-stopped gradient descent (Yao et al., 2007) Single-pass averaged SGD (Dieuleveut and Bach, 2016)

17 Optimal statistical performance Easy problems r α 1 2rα 2α : optimal rate is O(n 2rα+1) Hard problems r α 1 2α Lower bound: O(n 2rα 2rα+1). Known upper bound: O(n 2r )

18 Least-mean-square (LMS) algorithm Least-squares: F(θ) = 1 2 E[ (y Φ(x),θ ) 2] with θ R d SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) Iteration: θ i = θ i 1 γ ( Φ(x i ),θ i 1 y i ) Φ(xi )

19 Least-mean-square (LMS) algorithm Least-squares: F(θ) = 1 2 E[ (y Φ(x),θ ) 2] with θ R d SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) Iteration: θ i = θ i 1 γ ( Φ(x i ),θ i 1 y i ) Φ(xi ) New analysis for averaging and constant step-size γ = 1/(4R 2 ) Bach and Moulines (2013) Assume Φ(x) R and y Φ(x),θ σ almost surely No assumption regarding lowest eigenvalues of Σ Main result: EF( θ n ) F(θ ) 4σ2 d n + 4R2 θ 0 θ 2 n Matches statistical lower bound (Tsybakov, 2003)

20 Markov chain interpretation of constant step sizes LMS recursion: θ i = θ i 1 γ ( Φ(x i ),θ i 1 y i ) Φ(xi )

21 Markov chain interpretation of constant step sizes LMS recursion: θ i = θ i 1 γ ( Φ(x i ),θ i 1 y i ) Φ(xi ) The sequence (θ i ) i is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) - For least-squares, θ γ = θ θ n θ γ θ 0

22 Markov chain interpretation of constant step sizes LMS recursion: θ i = θ i 1 γ ( Φ(x i ),θ i 1 y i ) Φ(xi ) The sequence (θ i ) i is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n θ γ θ 0

23 Markov chain interpretation of constant step sizes LMS recursion: θ i = θ i 1 γ ( Φ(x i ),θ i 1 y i ) Φ(xi ) The sequence (θ i ) i is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n θ n θ θ 0

24 Markov chain interpretation of constant step sizes LMS recursion: θ i = θ i 1 γ ( Φ(x i ),θ i 1 y i ) Φ(xi ) The sequence (θ i ) i is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n does not converge to θ but oscillates around it Ergodic theorem: Averaged iterates converge to θ γ = θ at rate O(1/n) See Dieuleveut, Durmus, and Bach (2017) for more details

25 Simulations - synthetic examples Gaussian distributions - d = 20 0 synthetic square log 10 [f(θ) f(θ * )] /2R 2 1/8R 2 4 1/32R 2 1/2R 2 n 1/ log 10 (n)

26 Simulations - benchmarks alpha (d = 500, n = ), news (d = , n = ) 1 alpha square C=1 test 0.2 news square C=1 test log 10 [f(θ) f(θ * )] /R 2 1/R 2 n 1/2 SAG log 10 [f(θ) f(θ * )] /R 2 1/R 2 n 1/2 SAG log 10 (n) log 10 (n)

27 Optimal bounds for least-squares? Least-squares: cannot beat σ 2 d/n (Tsybakov, 2003). Really? What if d n?

28 Optimal bounds for least-squares? Least-squares: cannot beat σ 2 d/n (Tsybakov, 2003). Really? What if d n? Needs assumptions on Σ = E [ Φ(x) Φ(x) ] and θ

29 Finer assumptions (Dieuleveut and Bach, 2016) Covariance eigenvalues Pessimistic assumption: all eigenvalues λ m less than a constant Actual decay as λ m = o(m α ) with trσ 1/α = λ 1/α m small m New result: replace σ2 d n by σ2 (γn) 1/α trσ 1/α n 8 6 log( λ m ) log(m)

30 Finer assumptions (Dieuleveut and Bach, 2016) Covariance eigenvalues Pessimistic assumption: all eigenvalues λ m less than a constant Actual decay as λ m = o(m α ) with trσ 1/α = λ 1/α m small m New result: replace σ2 d n by σ2 (γn) 1/α trσ 1/α n 8 40 log( λ m ) (γ n) 1/α Tr Σ 1/α / n log(m) alpha

31 Finer assumptions (Dieuleveut and Bach, 2016) Covariance eigenvalues Pessimistic assumption: all eigenvalues λ m less than a constant Actual decay as λ m = o(m α ) with trσ 1/α = λ 1/α m small m New result: replace σ2 d n by σ2 (γn) 1/α trσ 1/α n Optimal predictor Pessimistic assumption: θ 0 θ 2 finite/small Finer assumption: Σ 1/2 r (θ 0 θ ) 2 small, for r [0,1] Always satisfied for r = 0 and θ 0 = 0, since Σ 1/2 θ 2 Ey 2 n

32 Finer assumptions (Dieuleveut and Bach, 2016) Covariance eigenvalues Pessimistic assumption: all eigenvalues λ m less than a constant Actual decay as λ m = o(m α ) with trσ 1/α = λ 1/α m small m New result: replace σ2 d n by σ2 (γn) 1/α trσ 1/α n Optimal predictor Pessimistic assumption: θ 0 θ 2 finite/small Finer assumption: Σ 1/2 r (θ 0 θ ) 2 small, for r [0,1] Always satisfied for r = 0 and θ 0 = 0, since Σ 1/2 θ 2 Eyn 2 New result: replace θ 0 θ 2 by Σ1/2 r (θ 0 θ ) 2 γn γ 2r n 2r

33 Optimal bounds for least-squares? Least-squares: cannot beat σ 2 d/n (Tsybakov, 2003). Really? What if d n? Refined assumptions with adaptivity (Dieuleveut and Bach, 2016) Beyond strong convexity or lack thereof EF( θ n ) F(θ ) inf α 1,r [0,1] 4σ 2 trσ 1/α n (γn) 1/α + 4 Σ1/2 r θ 2 γ 2r n 2r Previous results: α = + and r = 1/2

34 Optimal bounds for least-squares? Least-squares: cannot beat σ 2 d/n (Tsybakov, 2003). Really? What if d n? Refined assumptions with adaptivity (Dieuleveut and Bach, 2016) Beyond strong convexity or lack thereof EF( θ n ) F(θ ) inf α 1,r [0,1] 4σ 2 trσ 1/α n (γn) 1/α + 4 Σ1/2 r θ 2 γ 2r n 2r Previous results: α = + and r = 1/2 Optimal step-size γ potentially decaying with n, but depends on usually unknown quantities α and r no adaptivity (yet)

35 Optimal bounds for least-squares? Least-squares: cannot beat σ 2 d/n (Tsybakov, 2003). Really? What if d n? Refined assumptions with adaptivity (Dieuleveut and Bach, 2016) Beyond strong convexity or lack thereof EF( θ n ) F(θ ) inf α 1,r [0,1] 4σ 2 trσ 1/α n (γn) 1/α + 4 Σ1/2 r θ 2 γ 2r n 2r Previous results: α = + and r = 1/2 Optimal step-size γ potentially decaying with n, but depends on usually unknown quantities α and r no adaptivity (yet) Extension to non-parametric estimation (using kernels) with optimal rates when r (α 1)/(2α), still with O(n 2 ) running-time

36 From least-squares to non-parametric estimation Extension to Hilbert spaces: Φ(x),θ H θ i = θ i 1 γ ( Φ(x i ),θ i 1 y i ) Φ(xi ) If θ 0 = 0, θ i is a linear combination of Φ(x 1 ),...,Φ(x i ) θ i = i k=1 a k Φ(x k ) and a i = γ i 1 k=1 a k Φ(x k ),Φ(x i ) +γy i Kernel trick: k(x,x ) = Φ(x),Φ(x ) Reproducing kernel Hilbert spaces and non parametric estimation See, e.g., Schölkopf and Smola (2001); Shawe-Taylor and Cristianini (2004); Dieuleveut and Bach (2016) Still O(n 2 ) overall running-time

37 From least-squares to non-parametric estimation Extension to Hilbert spaces: Φ(x),θ H θ i = θ i 1 γ ( Φ(x i ),θ i 1 y i ) Φ(xi ) If θ 0 = 0, θ i is a linear combination of Φ(x 1 ),...,Φ(x i ) θ i = i k=1 a k Φ(x k ) and a i = γ i 1 k=1 a k Φ(x k ),Φ(x i ) +γy i Kernel trick: k(x,x ) = Φ(x),Φ(x ) Reproducing kernel Hilbert spaces and non-parametric estimation See, e.g., Schölkopf and Smola (2001); Shawe-Taylor and Cristianini (2004); Dieuleveut and Bach (2016) Still O(n 2 ) overall running-time

38 Example: Sobolev spaces in one dimension X = [0, 1], functions represented through their Fourier series Weighted Fourier basis Φ(x) m = λ 1/2 m cos(2mπx) (plus sines) kernel k(x,x ) = m λ mcos [ 2mπ(x x ) ]

39 Example: Sobolev spaces in one dimension X = [0, 1], functions represented through their Fourier series Weighted Fourier basis Φ(x) m = λ 1/2 m cos(2mπx) (plus sines) kernel k(x,x ) = m λ mcos [ 2mπ(x x ) ] λ m m α corresponds to Sobolev penalty on f θ (x) = θ,φ(x) f θ 2 = θ 2 = m Fourier(f θ ) m 2 λ 1 m 1 0 f (α/2) θ (x) 2 dx

40 Example: Sobolev spaces in one dimension X = [0, 1], functions represented through their Fourier series Weighted Fourier basis Φ(x) m = λ 1/2 m cos(2mπx) (plus sines) kernel k(x,x ) = m λ mcos [ 2mπ(x x ) ] λ m m α corresponds to Sobolev penalty on f θ (x) = θ,φ(x) f θ 2 = θ 2 = m Fourier(f θ ) m 2 λ 1 m 1 0 f (α/2) θ (x) 2 dx Adapted norm Σ 1/2 r θ 2 depends on regularity of f θ Σ 1/2 r θ 2 = Fourier(f θ ) m 2 λ 2r m m Optimal rate is O(n 2rα+1) 2rα 1 0 f (rα) θ (x) 2 dx

41 New assumption needed Assumption: Σ µ/2 1/2 Φ(x) almost surely small Already used by Steinwart et al. (2009) True for µ = 1 Usually µ 1/α (equal for Sobolev spaces) Relationship between L norm L and RKHS norm g L = O( g µ g 1 µ L 2 ) NB: implies bounded leverage scores (Rudi et al., 2015)

42 Multiple pass SGD (sampling with replacement) Algorithm from n i.i.d. observations (x i,y i ), i = 1,...,n: θ u = θ u 1 +γ ( y i(u) θ u 1,Φ(x i(u) ) ) Φ(x i(u) ) θ t averaged iterate after t n iterations

43 Multiple pass SGD (sampling with replacement) Algorithm from n i.i.d. observations (x i,y i ), i = 1,...,n: θ u = θ u 1 +γ ( y i(u) θ u 1,Φ(x i(u) ) ) Φ(x i(u) ) θ t averaged iterate after t n iterations Theorem (Pillaud-Vivien, Rudi, and Bach, 2018): Assume r α 1 2α. If µ 2r, then after t = Θ(n α/(2rα+1) ) iterations, we have: EF( θ t ) F(θ ) = O(n 2rα/(2rα+1) ) Optimal Otherwise, then after t = Θ(n 1/µ (logn) 1 µ) iterations, we have: EF( θ t ) F(θ ) O(n 2r/µ ) Improved Proof technique following Rosasco and Villa (2015)

44 Proof sketch Algorithm from n i.i.d. observations (x i,y i ), i = 1,...,n: θ u = θ u 1 +γ ( y i(u) θ u 1,Φ(x i(u) ) ) Φ(x i(u) ) θ t averaged iterate after t n iterations Following Rosasco and Villa (2015), consider batch gradient recursion η u = θ u 1 + γ n ( yi θ u 1,Φ(x i ) ) Φ(x i ) n i=1 η t averaged iterate after t n iterations

45 Proof sketch Algorithm from n i.i.d. observations (x i,y i ), i = 1,...,n: θ u = θ u 1 +γ ( y i(u) θ u 1,Φ(x i(u) ) ) Φ(x i(u) ) θ t averaged iterate after t n iterations Following Rosasco and Villa (2015), consider batch gradient recursion η u = θ u 1 + γ n ( yi θ u 1,Φ(x i ) ) Φ(x i ) n i=1 η t averaged iterate after t n iterations As long as t = O(n 1/µ ) ( t 1/α) Property 1: EF( θ t ) EF( η t ) = O ( t t 1/α) Property 2: EF( η t ) F(θ ) = O +O(t 2r ) n

46 Multiple pass SGD (sampling with replacement) Algorithm from n i.i.d. observations (x i,y i ), i = 1,...,n: θ u = θ u 1 +γ ( y i(u) θ u 1,Φ(x i(u) ) ) Φ(x i(u) ) θ t averaged iterate after t n iterations Theorem (Pillaud-Vivien, Rudi, and Bach, 2018): Assume r α 1 2α. If µ 2r, then after t = Θ(n α/(2rα+1) ) iterations, we have: EF( θ t ) F(θ ) = O(n 2rα/(2rα+1) ) Optimal Otherwise, then after t = Θ(n 1/µ (logn) 1 µ) iterations, we have: EF( θ t ) F(θ ) O(n 2r/µ ) Improved Proof technique following Rosasco and Villa (2015)

47 Statistical optimality If µ 2r, then after t = Θ(n α/(2rα+1) ) iterations, we have: EF( θ t ) F(θ ) = O(n 2rα/(2rα+1) ) Optimal Otherwise, then after t = Θ(n 1/µ (logn) µ) 1 iterations, we have: EF( θ t ) F(θ ) O(n 2r/µ ) Improved

48 Simulations Synthetic examples One-dimensional kernel regression Sobolev spaces Arbitrary chosen values for r and α Check optimal number of iterations over the data

49 Simulations Synthetic examples One-dimensional kernel regression Sobolev spaces Arbitrary chosen values for r and α Check optimal number of iterations over the data Comparing three sampling schemes With replacement Without replacement (cycling with random reshuffling) Cycling

50 Simulations (sampling with replacement) α = 3/2,r = 1/3 > (α 1)/(2α) α = 4,r = 1/4 = (α 1)/(2α) line of slope 1 line of slope log 10 t * 3.0 log 10 t * log 10 n log 10 n α = 5/2,r = 1/5 < (α 1)/(2α) α = 3,r = 1/6 < (α 1)/(2α) line of slope 1.25 line of slope log 10 t * 3.0 log 10 t * log 10 n log 10 n

51 Simulations (sampling without replacement) α = 3/2,r = 1/3 > (α 1)/(2α) α = 4,r = 1/4 = (α 1)/(2α) line of slope 1 line of slope log 10 t * 3.0 log 10 t * log 10 n log 10 n α = 5/2,r = 1/5 < (α 1)/(2α) α = 3,r = 1/6 < (α 1)/(2α) line of slope 1.25 line of slope log 10 t * 3.0 log 10 t * log 10 n log 10 n

52 Simulations (cycling) α = 3/2,r = 1/3 > (α 1)/(2α) α = 4,r = 1/4 = (α 1)/(2α) line of slope 1 line of slope log 10 t * 3.0 log 10 t * log 10 n log 10 n α = 5/2,r = 1/5 < (α 1)/(2α) α = 3,r = 1/6 < (α 1)/(2α) line of slope 1.25 line of slope log 10 t * 3.0 log 10 t * log 10 n log 10 n

53 Simulations - Benchmarks MNIST dataset with linear kernel Optimal number of passes Number of samples

54 Benefits of multiple passes Conclusion Number of passes grows with sample size for hard problems First provable improvement of multiple passes over SGD [NB: Hardt et al. (2016); Lin and Rosasco (2017) consider small step-sizes]

55 Benefits of multiple passes Conclusion Number of passes grows with sample size for hard problems First provable improvement of multiple passes over SGD [NB: Hardt et al. (2016); Lin and Rosasco (2017) consider small step-sizes] Current work - Extensions Study of cycling and sampling without replacement (Shamir, 2016; Gürbüzbalaban et al., 2015) Mini-batches Beyond least-squares Optimal efficient algorithms for the situation µ > 2r Combining analysis with exponential convergence of testing errors (Pillaud-Vivien, Rudi, and Bach, 2017)

56 References F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). In Advances in Neural Information Processing Systems (NIPS), Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3): , A. Dieuleveut and F. Bach. Non-parametric Stochastic Approximation with Large Step sizes. Annals of Statistics, 44(4): , Aymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large step-sizes. The Annals of Statistics, 44(4): , Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the gap between constant step size stochastic gradient descent and markov chains. Technical Report , arxiv, Mert Gürbüzbalaban, Asu Ozdaglar, and Pablo Parrilo. Why random reshuffling beats stochastic gradient descent. Technical Report , arxiv, Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, Junhong Lin and Lorenzo Rosasco. Optimal rates for multi-pass stochastic gradient methods. Journal of Machine Learning Research, 18(97):1 47, O. Macchi. Adaptive processing: The least mean squares approach with applications in transmission. Wiley West Sussex, 1995.

57 L. Pillaud-Vivien, A. Rudi, and F. Bach. Stochastic gradient methods with exponential convergence of testing errors. Technical Report , arxiv, Loucas Pillaud-Vivien, Alessandro Rudi, and Francis Bach. Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes. Technical Report , arxiv, Lorenzo Rosasco and Silvia Villa. Learning with incremental iterative regularization. In Advances in Neural Information Processing Systems, pages , A. Rudi, R. Camoriano, and L. Rosasco. Less is more: Nyström computational regularization. In Advances in Neural Information Processing Systems, pages , B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, Ohad Shamir. Without-replacement sampling for stochastic gradient methods. In Advances in Neural Information Processing Systems 29, pages 46 54, J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, Ingo Steinwart, Don R. Hush, and Clint Scovel. Optimal rates for regularized least squares regression. In Proc. COLT, A. B. Tsybakov. Optimal rates of aggregation. In Proc. COLT, Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent learning. Constructive Approximation, 26(2): , 2007.

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,