Early Stopping for Computational Learning

Size: px

Start display at page:

Download "Early Stopping for Computational Learning"

Herbert Anderson
5 years ago
Views:

1 Early Stopping for Computational Learning Lorenzo Rosasco Universita di Genova, Massachusetts Institute of Technology Istituto Italiano di Tecnologia CBMM Sestri Levante, September, 2014 joint work with A. Tacchetti, S.Villa (IIT) also, B.C. Vu+S. Villa (IIT) and J. Lin, D.X. Zhou (CityU)

2 Image classification (not processing) +1 X n = 0 x x p x 1 n x p n 1 C A Y n = 0 y 1. y n 1 C A 1 n=millions, p=hundreds thousands......

3 Learning Theory and Optimization: One Remark Given (x 1,y 1 ),...,(x n,y n ), consider 1 min w2r p n nx i=1 V (y i,w T x i )+ This is not! R(w) Given P (x, y), consider min E V (y, w2r wt x)+ p This is not! R(w) the learning problem the learning problem Learning is stochastic optimization! yet, there is a divide! between statistical and optimization analysis

4 Plan Bias + Variance + Computations Name Game: Problems, Estimators and Algorithms! Iterative Algorithms and Early Stopping! Learning with Stochastic/Incremental Gradient! Beyond Least Squares

5 The Problem Assumption 0 S =(x i,y i ) n i=1 P (x, y), x i 2X = R p, p apple1, kx i kapple1, y i apple1 almost surely. E(w) =E y w T x 2 w = arg min kwk 2, X 0 = arg min w2x 0 X E(w)

6 Error Measures Risk: P(E(ŵ) E(w ) ) Parameters: P( ŵ w 2 ) Let C = E[x T x] and Cv j = j v j, j =1,... Assumption 1 Source Condition X j v T j w 2 2s j < 1, s 2 (0, 1) Effective Dimension j j b, b 2 [1, 1]

7 An Estimator All we have is S =(x i,y i ) n i=1 P (x, y).? v v v v v v v v v Ill-Posedeness Bias + Variance 1 nx ŵ = arg min w2x n i=1 (y i w T x i ) 2 + w T w

8 Convergence Theorem [ ] Choose n such that n! 0 and 1/ n n! 0 for n!1. Set ŵ =ŵ n. Then under Assumption 0, it holds P( lim n!1 E(ŵ) =E(w )) = 1 Analogous results for parameter estimation. Not a practical result.

9 A Posteriori Parameter Choice What about? Cross Validation Validation S Validation S 0 Validation mx ˆ = arg min Ê 0 ( ), Ê 0 ( )= 1 Validation 2 m n k=1 (y 0 k ŵ T x 0 k) 2

10 Adaptive rates Theorem [Caponnetto, De Vito, R.+Smale,Zhou ] Set ŵ =ŵˆ. Then under Assumption 0 and 1, it holds, with probability at least 1 e, 2 [0, 1) E(ŵ) E(w ) apple c n 2rb 2rb+1 r = s +1/2 The above result is optimal in a minmax sense inf w sup E[E(w) E(w )] w 2 b,s The above result describes what is done in practice.

11 Proof Approach Separate analysis (spectral calculus) and probability E(y w T x) 2 [Caponnetto, De Vito, R.+Smale,Zhou ] Cw = g; C = E[ 1 n XT n X n ]; g = E[xy] kc Ĉk Concentration Inequalities Ĉw =ĝ; Ĉ =[ 1 n XT n X n ]; ĝ = 1 n nx i=1 x i y i 1 n kx nw Y n k 2

12 Algorithms? Learning Theory~Stochastic Optimization! so far IBC, Nemirovski Yudin Oracle Model What s missing? Computations!! What s the cost of computing an estimator?! How do approximate computations affect our error estimates?

13 Computations ŵ = arg min w2x 1 n nx (y i w T x i ) 2 + w T w i=1 ŵ =(X T n X n + ni) 1 X T n Y n Nonparametric x T w = x T X T n c ĉ =(X n X T n + ni) 1 Y n Complexity O(np 2 ] ) + O(mp] ) Parametric O(n 3 ] ) + O(mn] ) Nonparametric Can we do better than this?

14 ŵ 0 = 0, gradient descent for k =1:t 1 ŵ k =ŵ k 1 n XT n (X n ŵ k 1 Y n ) also gradient descent Xt 1 ŵ t = (1 n j=0 n XT n X n ) j X T n Y n [Landweber 50] Interlude: Neumann Series c 1 = 1X (1 c) j j=0 (X T n X n ) 1 = c 1 t 1 X j=0 1X (1 Xn T X n ) j (Xn T X n ) 1 j=0 (1 c) j Xt 1 (1 Xn T X n ) j j=0 (X T n X n ) 1 (X T n X n + ni) 1

15 Early Stopping Theorem [Caponnetto Yao R. 07] Choose t n such that t n!1and t n /n! 0 for n!1. Set ŵ =ŵ tn. Then under Assumption 0, it holds P( lim n!1 E(ŵ) =E(w )) = 1 Theorem [Bauer, Pereverzev, R. 07, Caponnetto, Yao 10] Set ŵ =ŵˆt. Then under Assumption 0 and 1, it holds, with probability at least 1 e, 2 [0, 1) E(ŵ) E(w ) apple c n 2rb 2rb+1 Related results: GD- aka L2Boosting [Buhmann Yu 02, Yao, R. Caponnetto 05, Bauer, Pereverzev R. 07, Caponnetto, Yao, 10, Raskutti et al. 13]

16 One Observation emp. error Emp Err test error Val Err t test error Val Err t Bias + Variance + Computations Bias Variance t Computations

17 Better Complexity Parametric! Complexity O(np] ) O(np 2 ] ) + O(mp] ) Nonparametric! Complexity O(n 2 ] ) O(n 3 ] ) + O(mn] )

18 Few Openish Questions! What s the best we can do?! What about other (iterative) schemes?! Accelerated GD - aka nu-method [Bauer, Perverzev R. 07, Caponnetto, Yao, 10]! Conjugate Gradient (CG) - aka Partial Least Squares [Blanchard, Kramer 09]! Nesterov method?! others?! Stochastic vs Incremental?! Other Loss functions?! Other regularization?!!

19 Yet Another Iteration: Stochastic Gradient ŵ 0 = 0, for k =1:t 1 GD ŵ k =ŵ k 1 n XT n (X n ŵ k 1 Y n ) ŵ 0 = 0, SGD aka Robbins-Monro for k =1:n ŵ k =ŵ k 1 x k (x T k ŵ k 1 y k ) lower iteration cost O(p)

20 SGD Flavors ŵ 0 = 0, Varying Step-Size Penalized SGD for k =1:n ŵ Ave = 1 n ŵ k =ŵ k 1 x k (x k ŵ k 1 y k )+ ŵ k n 1 X k=0 ŵ k Varying Step Size Penalized SGD: only results in expectation.! 2 parameters to crossvalidate [Smale, Yao 05 and Tarres, Yao 07],! step-size cross validation [Ying, Pontil 05, Zhang 04], constant stepsize for finite dimensions Bach Moulines 13].

21 Does Anybody Use These Methods? small, generalization error keep decreasing after the first pass on the data overfitting eventually occurs big, more iterations are needed.

22 SGD Flavors (cont.) I have all the n examples, I can process them more than once! should I? ŵ 0 = 0, for j =1:t 1 ˆv 0 = ŵ j 1, for k =1:n ŵ j = ˆv n, Multiple Epochs SGD - aka IGD ˆv k =ˆv k 1 n xt k (x kˆv k 1 y k ) end

23 Early Stopping Theorem [R. Tacchetti Villa 14] Choose t n such that t n!1and t n / p n! 0 for n!1. Set ŵ =ŵ tn. Then under Assumption 0, it holds P( lim n!1 E(ŵ) =E(w )) = 1 Theorem [R. Tacchetti Villa 14] Set ŵ =ŵˆt. Then under Assumption 0 and 1, it holds, with probability at least 1 e, 2 [0, 1) kŵ w k 2 apple c n s s+1 Still missing results with the whole Asssumption 1

24 Few more questions I have all the n examples, I can process them more than once! should I? Yes! (or cross validate step size [Dieuleveut Bach 14]) Results suggest same statistical/numerical complexity as GD Parametric! Complexity O(np] ) + O(mp] ) Nonparametric! O(n 2 ] ) + O(mn] ) Complexity

25 Few Openish Questions! Optimal numerical complexity in a statistical minmax class?! Linear Nonparametrics?! What about other (iterative) schemes?! Accelerated GD - aka nu-method [Bauer, Perverzev R. 07, Caponnetto, Yao, 10]! Conjugate Gradient (CG) - aka Partial Least Squares [Blanchard, Kramer 09]! Nesterov method?! others?! Stochastic vs Incremental?! Other Loss functions?! Other regularization?!

26 What about other loss functions? min kwk 2, w2x 0 X 0 = arg min X E(w), E(w) =EV (y, w T x) Nemitski Loss: V : R R! [0, 1) such that V (y, ) convex for all y 2 R. for p 2 [1, 1), V (y, ) apple a(y)+b p, 8, y 2 R where b 22 [0, 1) and a : R! [0, 1) with Ea(y) < 1. Early Stopping with Subgradient [Lin, R. Zhou 14]

27 What about Other Regularizers? min R(w), w2x 0 X 0 = arg min X E(w), E(w) =E y w T x 2 Convex Regularization: R : X!R, proper, l.s.c. functional. [R., Villa, Vu 14]

28 A new Peradigm For Learning Algorithm Design? min R(w), w2x 0 X 0 = arg min X E(w), E(w) =EV (y, w T x) w t ŵ t Bias/Optimization Variance/Stability min E V (y, 1 w2r wt x)+ R(w) min p w2r p n nx i=1 V (y i,w T x i )+ R(w)

29 Lots of questions! Avoid Data Cross Validation\Splitting?! Early Stopping for Convex Loss, beyond subgradient?! Early Stopping for Convex Regularization! Problems: From learning, to more general stochastic optimization or inverse problems (stochastic or not), robust optimization! Approaches: Coordinate descent, distributed approaches!!

Iterative Convex Regularization

Iterative Convex Regularization Lorenzo Rosasco Universita di Genova Universita di Genova Istituto Italiano di Tecnologia Massachusetts Institute of Technology Optimization and Statistical Learning Workshop,