Optimization geometry and implicit regularization

Size: px

Start display at page:

Download "Optimization geometry and implicit regularization"

Jasmin Johnson
5 years ago
Views:

1 Optimization geometry and implicit regularization Suriya Gunasekar Joint work with N. Srebro (TTIC), J. Lee (USC), D. Soudry (Technion), M.S. Nacson (Technion), B. Woodworth (TTIC), S. Bhojanapalli (TTIC), B. Neyshabur (TTIC-> IAS)

2 Optimization in ML h " : $ & parameterized by W R * Training data { $,, &, :. = 1,2, 56 = argmin " =,>? l(h " $,, &, ) Over parameterization: D F Many global minima all l(h" 5 $,, &, ) = 0 What we really care about is I J,K l h 5 " $, & à Different global optima have different I J,K l h 5 " $, &

3 Optimization in ML h " : $ & parameterized by W R * Training data { $,, &, :. = 1,2, 56 = argmin " =,>? l(h " $,, &, ) " # Over parameterization: D 3 X! = Many global minima all l(h" 5 $,, &, ) = 0 What we really care about is H I,J l h 5 " $, & à Different global optima have different H I,J l h 5 " $, &

4 Optimization in ML h " : $ & parameterized by W R * Training data { $,, &, :. = 1,2, 56 = argmin " =,>? l(h " $,, &, ) " # Over parameterization: D 3 X! = Many global minima all l(h" 5 $,, &, ) = 0 What we really care about is H I,J l h 5 " $, & à Different global optima have different H I,J l h 5 " $, &

5 Optimization in ML h " : $ & parameterized by W R * Training data { $,, &, :. = 1,2, 56 = argmin " =,>? l(h " $,, &, ) " # Over parameterization: D 3 X! = Many global minima all l(h" 5 $,, &, ) = 0 What we really care about is H I,J l h 5 " $, & à Different global optima have different H I,J l h 5 " $, &

6 Learning overparameterized models h " : $ & parameterized by W R * 8 +, = argmin " l(h " $ 5, & 5 ) + R(,) small R +, A,B l h " + $, & - 7 C l(h " + $ 5, & 5 ) Explicit regularization for high dimensional estimation

7 Learning overparameterized models h " : $ & parameterized by W R * 8 +, = argmin " l(h " $ 5, & 5 ) + R(,)? A What happens if we don t have R(,)? Explicit regularization for high dimensional estimation

8 Matrix Estimation from Linear Measurements min L(W ):= W 2R d d e.g matrix completion, linear neural networks, Ø When! # $ optimization is underdetermined with many trivial global minima NX (hx n,wi y n ) 2 := kx (W ) yk 2 2 n=1 e.g. impute 0 or 42 or for matrix completion

9 Matrix Estimation from Linear Measurements min L(W ):= W 2R d d e.g matrix completion, linear neural networks, Ø When! # $ optimization is underdetermined with many trivial global minima NX (hx n,wi y n ) 2 := kx (W ) yk 2 2 n=1 e.g. impute 0 or 42 or for matrix completion min U,V 2R d d L(U, V )=L(UV > )=kx (UV > ) yk 2 2 No explicit regularization & no rank constraint Ø same trivial global minima exists

10 Matrix Estimation from Linear Measurements min L(W ):= W 2R d d e.g matrix completion, linear neural networks, Ø When! # $ optimization is underdetermined with many trivial global minima NX (hx n,wi y n ) 2 := kx (W ) yk 2 2 n=1 e.g. impute 0 or 42 or for matrix completion min U,V 2R d d L(U, V )=L(UV > )=kx (UV > ) yk 2 2 No explicit regularization & no rank constraint Ø same trivial global minima exists Gradient descent on L(U, V ) U k+1 = U k r U L(Uk,V k ) V k+1 = V k r V L(Uk,V k )

11 d = 50, N= 300, X n iid Gaussian, W rank-2 ground truth y = X (W )+N (0, 10 3 ), y test = X test (W )+N (0, 10 3 ) min L(W ):= W 2R d d NX (hx n,wi y n ) 2 := kx (W ) yk 2 2 n=1 min U,V 2R d d L(U, V )=L(UV > )=kx (UV > ) yk 2 2 Gradient descent on!" # gets to good global minima G Woodworth Bhojanapalli Neyshabur Srebro, NIPS 2017

12 d = 50, N= 300, X n iid Gaussian, W rank-2 ground truth y = X (W )+N (0, 10 3 ), y test = X test (W )+N (0, 10 3 ) Gradient descent on!" # gets to good global minima Gradient descent on!" # generalizes better with smaller step size G Woodworth Bhojanapalli Neyshabur Srebro, NIPS 2017

13 d = 50, N= 300, X n iid Gaussian, W rank-2 ground truth y = X (W )+N (0, 10 3 ), y test = X test (W )+N (0, 10 3 ) Question: Which global minima does gradient descent reach? Why does it generalize well? G Woodworth Bhojanapalli Neyshabur Srebro, NIPS 2017

14 Implicit Regularization Different optimization algorithms different global minimum "W different generalization $ %,' l h " * +,,

15 Overparameterization in neural networks Image datasets CIFAR ~ 60K images, ImageNet ~14M images, ~1M annotations Architectures for vision tasks: AlexNet (2012): 8 layers, 60M parameters VGG-16 (2014): 16 layers, 138M parameters ResNet (2015): 152 layers, NNs trained using local search have good generalization even without explicit regularization or early stopping (Neyshabur et al. 2014, Zhang et al. 2016, Hoffer et al. 2017)

16 Bias of optimization algorithms Effect of optimization geometry (Neyshabur et al. 2015) /1 Training Error Path-SGD SGD Effect of size of minibatch (Keskar et al. 2017,Dihn et al. 2017) Effect of adaptive algorithms (Wilson et al. 2017) Learning to learn (Abdrychowicz et al. 2016, Finn et al. 2017)

17 Bias of optimization algorithms Effect of optimization geometry (Neyshabur et al. 2015) /1 Training Error 0/1 Test Error Path-SGD SGD Effect of size of minibatch (Keskar et al. 2017,Dihn et al. 2017) Effect of adaptive algorithms (Wilson et al. 2017) Learning to learn (Abdrychowicz et al. 2016, Finn et al. 2017)

18 Bias of optimization algorithms Effect of optimization geometry (Neyshabur et al. 2015) /1 Training Error 0/1 Test Error Path-SGD SGD Effect of size of minibatch (Keskar et al. 2017,Dihn et al. 2017) Effect of adaptive algorithms (Wilson et al. 2017) Learning to learn (Abdrychowicz et al. 2016, Finn et al. 2017)

19 Implicit Regularization Different optimization algorithms different global minimum "W different generalization $ %,' l h " * +,, Can we characterize which specific global minimum different optimization algorithms converge to?

20 Implicit Regularization Can we characterize which specific global minimum different optimization algorithms converge to? How does this depend on optimization geometry, initialization, step size, momentum, stochasticity?

21 Implicit Regularization Can we characterize which specific global minimum different optimization algorithms converge to? How does this depend on optimization geometry, initialization, step size, momentum, stochasticity? Understanding the implicit bias could enable Optimization algorithms for faster convergence AND better generalization New regularization techniques Efficiently train smaller networks

22 Gradient descent: linear regression min w N L(w) = X (hx n,wi y n ) 2 n=1 # $! = "

23 Gradient descent: linear regression min w N L(w) = X (hx n,wi y n ) 2 n=1 Gradient descent initialized at!(0) w(t + 1) = w(t) r w L(w(t)) & ' %! = r w L(w(t)) = NX (hx n,w(t)i y n )x n n=1

24 Gradient descent: linear regression min w N L(w) = X (hx n,wi y n ) 2 n=1 Gradient descent initialized at!(0) w(t + 1) = w(t) r w L(w(t)), /.! = NX r w L(w(t)) = (hx n,w(t)i y n )x n n=1 Updates lie on a low dimensional affine manifold Δ! & span(, - )

25 Gradient descent: linear regression min w N L(w) = X (hx n,wi y n ) 2 n=1 Gradient descent initialized at!(0) w(t + 1) = w(t) r w L(w(t)) = NX n=1 r w L(w(t)) (hx n,w(t)i y n )x n Updates lie on a low dimensional affine manifold Δ! & span(, - ) If w(0) = 0 w(t)! argminkwk 2 Xw=y

26 Gradient descent: linear regression min w N L(w) = X (hx n,wi y n ) 2 n=1 Gradient descent initialized at!(0) w(t + 1) = w(t) r w L(w(t)) = NX n=1 r w L(w(t)) (hx n,w(t)i y n )x n Updates lie on a low dimensional affine manifold Δ! & span(, - ) If w(0) = 0 w(t)! argminkwk 2 Xw=y w(t)! argminkw w(0)k 2 Xw=y

27 Gradient descent: linear regression min w N L(w) = X (hx n,wi y n ) 2 n=1 Gradient descent initialized at!(0) w(t + 1) = w(t) r w L(w(t)) = NX n=1 r w L(w(t)) (hx n,w(t)i y n )x n Updates lie on a low dimensional affine manifold Δ! & span(, - ) If w(0) = 0 w(t)! argminkwk 2 Xw=y w(t)! argminkw w(0)k 2 Xw=y Independent of step size., momentum, instancewise stochastic gradient descent

28 Gradient descent: linear regression min w N L(w) = X (hx n,wi y n ) 2 n=1 # $! = " Same argument for linear models ŷ(x) =hw, xi and loss functions `(ŷ(x),y) with unique finite root at ŷ = y

29 Can we get such results for other problems and other optimization algorithms? First, same problem different optimization algorithms

30 Mirror descent w.r.t potential! Gradient descent w(t + 1) = argmin w hw, r w L(w(t)) kw w(t)k2 2

31 Mirror descent w.r.t potential! Gradient descent w(t + 1) = argmin w hw, r w L(w(t)) kw w(t)k2 2 Mirror Descent w.r.t. strongly convex potential! w(t + 1) = argmin w hw, r w L(w(t)) + D (w, w(t)) D (w, w(t)) = (w) (w(t)) hr (w(t),w w(t)i ". $.! % = ( % ) log %[)] à / 0 %, % 2 = 34(%, %(2))

32 Mirror descent w.r.t potential! w(t + 1) = argmin w hw, r w L(w(t)) + D (w, w(t)) G, Lee, Soudry, Srebro. Arxiv 2018

33 Mirror descent w.r.t potential! w(t + 1) = argmin w hw, r w L(w(t)) + D (w, w(t)) Dual updates lie on the low dimensional affine manifold Δ# $ span(+, ) If r (w(0)) = 0 w(t)! argmin Xw=y (w) G, Lee, Soudry, Srebro. Arxiv 2018

34 Mirror descent w.r.t potential! w(t + 1) = argmin w hw, r w L(w(t)) + D (w, w(t)) Dual updates lie on the low dimensional affine manifold Δ# $ span(+, ) If r (w(0)) = 0 w(t)! argmin Xw=y (w) Again independent of step size, stochasticity, dual momentum Also works with affine constraints on # Exponentiated gradient descent à implicit entropic regularization! # = 0 # 1 log(# 1 ) G, Lee, Soudry, Srebro. Arxiv 2018

35 Steepest descent w.r.t. norm. Gradient descent w(t) = argmin v:kvk 2 apple1 hv, r w L(w(t)) Steepest Descent w.r.t. general norm. w(t) =argmin v:kvkapple1 hv, r w L(w(t)) e.g. Coordinate descent. =. $ G, Lee, Soudry, Srebro. Arxiv 2018

36 Steepest descent w.r.t. norm. Gradient descent w(t) = argmin v:kvk 2 apple1 hv, r w L(w(t)) Steepest Descent w.r.t. general norm. w(t) =argmin v:kvkapple1 hv, r w L(w(t)) e.g. Coordinate descent. =. $ % &? argmin /012 % %(0) G, Lee, Soudry, Srebro. Arxiv 2018

37 Steepest descent w.r.t. norm. Even for! 0: $ % argmin -./0 $ $(0) G, Lee, Soudry, Srebro. Arxiv 2018

38 Can we get such results for other problems and other optimization algorithms? How about gradient descent on other problems or different parameterizations?

39 Matrix Estimation from Linear Measurements min L(W ):= W 2R d d e.g matrix completion, linear neural networks, Ø When! # $ optimization is underdetermined with many trivial global minima NX (hx n,wi y n ) 2 := kx (W ) yk 2 2 n=1 e.g. impute 0 or 42 or for matrix completion min U,V 2R d d L(U, V )=L(UV > )=kx (UV > ) yk 2 2 No explicit regularization & no rank constraint Ø same trivial global minima exists Gradient descent on L(U, V ) U k+1 = U k r U L(Uk,V k ) V k+1 = V k r V L(Uk,V k )

40 d = 50, N= 300, X n iid Gaussian, W rank-2 ground truth y = X (W )+N (0, 10 3 ), y test = X test (W )+N (0, 10 3 ) Question: Which global minima does gradient descent reach? Why does it generalize well? G Woodworth Bhojanapalli Neyshabur Srebro, NIPS 2017

41 Gradient descent on!"($) converges to a minimum nuclear norm solution G Woodworth Bhojanapalli Neyshabur Srebro, NIPS 2017

42 Conjecture (informal) Gradient descent on!"($, &) converges to the minimum nuclear norm solution W (t) =U(t)V (t) >! WNN = argmin kw k X (W )=y when, Initialization is close to 0 Step size is very small à U t = du t dt = r U L(U) G Woodworth Bhojanapalli Neyshabur Srebro, NIPS 2017

43 Commutative! "! "! # =! #! " for all %, ' [*] W (t) =e X (s t ) W (0) e X (s t ) for some s t 2 R N! 0 necessary to remain in the (non-linear) manifold G Woodworth Bhojanapalli Neyshabur Srebro, NIPS 2017

44 Commutative! "! "! $ =! $! " for all &, ( [+] W (t) =e X (s t ) W (0) e X (s t ) for some s t 2 R N! 0 necessary to remain in the (non-linear) manifold Let U 1 ( ) be the solution of gradient flow initialized at U 0 = I. If measurements X n commute, i.e. X i X j = X j X i,and if W1 =limu 1 ( )U 1 ( ) > exists and satisfies L( W 1 )=0,then!0 W 1 = W NN = min X (W )=y kw k Conjecture proved for RIP! # by Li et al. (2018) G Woodworth Bhojanapalli Neyshabur Srebro, NIPS 2017

45 Proof Ideas Characterize the manifold in which the!(#) lie on à! 0 + '()* +, for gradient descent à. /0.! 0 + '()* +, for mirror descent à (4) for matrix factorization with 6 0, 5 0 0, commutative 9,

46 Proof Ideas Characterize the manifold in which the!(#) lie on à! 0 + '()* +, for gradient descent à. /0.! 0 + '()* +, for mirror descent à (4) for matrix factorization with 6 0, 5 0 0, commutative 9, Show that all the global minima on the manifold satisfy the KKT conditions for regularized problem à à à min =>?@!!(0) B min F G!,! 0 CD?E min 2(H)?E 5

47 Losses with a unique finite root Robust characterization of for general mirror descent with potential! " # min ()*+, -(", " 0 ) No useful characterization for generic steepest descent w.r.t norm. à even when. 4 strongly convex à even for 5 0 Fragile characterization for matrix factorization 6(#) min 7 9, :(7)*+ 6 à ONLY for 6 0 0, 5 0 à Proven only for RIP measurements à initialization close to 0 is particularly bad!! What happens with other losses? à Very different for logistic regression no finite minima

48 Losses with a unique finite root Robust characterization of for general mirror descent with potential! " # min ()*+, -(", " 0 ) No useful characterization for generic steepest descent w.r.t norm. à even when. 4 strongly convex à even for 5 0 Fragile characterization for matrix factorization 6(#) min 7 9, :(7)*+ 6 à ONLY for 6 0 0, 5 0 à Proven only for RIP measurements à initialization close to 0 is particularly bad!! What happens with other losses? à Very different for logistic regression no finite minima

49 Losses with a unique finite root Robust characterization of for general mirror descent with potential! " # min ()*+, -(", " 0 ) No useful characterization for generic steepest descent w.r.t norm. à even when. 4 strongly convex à even for 5 0 Fragile characterization for matrix factorization 6(#) min 7 9, :(7)*+ 6 à ONLY for 6 0 0, 5 0 à Proven only for RIP measurements à initialization close to 0 is particularly bad!! What happens with other losses? à Very different for logistic regression no finite minima

50 Implicit bias when global minimum is unattainable! Logistic regression on separable data

51 Gradient descent: logistic regression min w L(w) = N X n=1 log(1 + exp( Gradient descent initialized at!(0) w(t + 1) = w(t) r w L(w(t)) y n hx n,wi)) r w L(w(t)) = NX n=1 r n (t)x n Soudry, Hoffer, Srebro, ICLR 2018

52 Gradient descent: logistic regression min w L(w) = N X n=1 log(1 + exp( Gradient descent initialized at!(0) w(t + 1) = w(t) r w L(w(t)) r w L(w(t)) = NX n=1 but kw(t)k!1! r n (t)x n y n hx n,wi)) Soudry, Hoffer, Srebro, ICLR 2018

53 Gradient descent: logistic regression min w L(w) = N X n=1 log(1 + exp( Gradient descent initialized at &(0) w(t + 1) = w(t) r w L(w(t)) r w L(w(t)) = NX n=1 but kw(t)k!1! r n (t)x n y n hx n,wi)) w(t) kw(t)k 2! argmax w:kwk 2 apple1 min n y n hw, x n i Independent of step size! and initialization "($) Soudry, Hoffer, Srebro, ICLR 2018

54 Gradient descent: logistic regression min w L(w) = N X n=1 log(1 + exp( y n hx n,wi)) Holds for linear classifiers ŷ(x) =hw, xi and any strictly monotone loss `(ŷ(x),y) with exponential tail Soudry, Hoffer, Srebro, ICLR 2018

55 How fast is the margin maximized? Fixed step size! $ # %&' ( - extremely slow!! Compare with # $ convergence of )(+) ( Can we use lighter or heavier tail to get faster convergence? No. exponential-tail yields optimal rate. For l. = exp. 4, 6 > 1, margin converges as # $ %&' 9 For l. = exp. 4, $ < 6 < 1, margin converges as? For l.. A4, does not converge to max-margin : ; ( 4 %&' ( Soudry, Hoffer, Srebro, 2018 Any other way to get faster convergence? Yes. Stepsize! 1/ L yield F# $ convergence ( (comparable to best hard-margin SVM algorithms) Schpiegel, Lee, G, Srebro, Soudry, Arxiv 2018

56 How fast is the margin maximized? Fixed step size! $ # %&' ( - extremely slow!! Compare with # $ convergence of )(+) ( Can we use lighter or heavier tail to get faster convergence? No. exponential-tail yields optimal rate. For l. = exp. 4, 6 > 1, margin converges as # $ %&' 9 For l. = exp. 4, $ < 6 < 1, margin converges as? For l.. A4, does not converge to max-margin : ; ( 4 %&' ( Soudry, Hoffer, Srebro, 2018 Any other way to get faster convergence? Yes. Stepsize! 1/ L yield F# $ convergence ( (comparable to best hard-margin SVM algorithms) Schpiegel, Lee, G, Srebro, Soudry, Arxiv 2018

57 How fast is the margin maximized? Fixed step size! $ # %&' ( - extremely slow!! Compare with # $ convergence of )(+) ( Can we use lighter or heavier tail to get faster convergence? No. exponential-tail yields optimal rate. For l. = exp. 4, 6 > 1, margin converges as # $ %&' 9 For l. = exp. 4, $ < 6 < 1, margin converges as? For l.. A4, does not converge to max-margin : ; ( 4 %&' ( Soudry, Hoffer, Srebro, 2018 Any other way to get faster convergence? Yes. Stepsize! 1/ L yield F# $ convergence ( (comparable to best hard-margin SVM algorithms) Schpiegel, Lee, G, Srebro, Soudry, Arxiv 2018

58 Implicit bias on strictly monotone losses with exponential tail Can we get a more robust characterization compared to regression-type losses?

59 Steepest descent w.r.t. norm. G, Lee, Soudry, Srebro. Arxiv 2018

60 Steepest descent w.r.t. norm. à Independent of initialization à Small enough # G, Lee, Soudry, Srebro. Arxiv 2018

61 Steepest descent w.r.t. norm. à Independent of initialization à Small enough! Compare with squared loss à G, Lee, Soudry, Srebro. Arxiv 2018

62 Matrix Estimation from Linear Measurements e.g matrix completion, linear neural networks, Ø When! # $ optimization is underdetermined with many trivial global minima e.g. impute 0 or 42 or for matrix completion Gradient descent on L(U, V ) U k+1 = U k r U L(Uk,V k ) V k+1 = V k r V L(Uk,V k ) G, Lee, Soudry, Srebro. Arxiv 2018

63 Matrix Estimation from Linear Measurements e.g matrix completion, linear neural networks, Ø When! # $ optimization is underdetermined with many trivial global minima e.g. impute 0 or 42 or for matrix completion Gradient descent on L(U, V ) U k+1 = U k r U L(Uk,V k ) V k+1 = V k r V L(Uk,V k ) Let W (t) =U(t)U(t) >. For any full rank W (0) and any t such that {L(W (t))} t is strictly decreasing, if exists, then W (t) k W (t)k W (t) kw (t)k! and rl(w (t)) krl(w (t))k max min kw k apple1 n y nhx n,wi G, Lee, Soudry, Srebro. Arxiv 2018

64 Strictly monotone losses Gradient descent! "! " # max min! # (), -, /,, 1 à Independent of initialization à Any step size leading to descent algorithm Steepest descent w.r.t norm.! " max min -! ", /,, 1! (), à Independent of initialization à Any step size leading to descent algorithm Matrix factorization 5 " 5 " max min 5 (), -, 7,, 8 à Independent of initialization à Any step size leading to descent algorithm à Very different for logistic regression no finite minima

65 Simplicity from Asymptotics Squared loss: " depends on initial " 0 and stepsize % May need to take % 0, " 0 0 to get characterization in terms of gradient manifold Exponential loss ' ( ' ( stepsize % does NOT depend on initial " 0 and What happens at the beginning doesn t effect the asymptotic behavior as " Limit direction dominated only by the updates and hence the gradient manifold

Different optimization algorithms Role of optimization in ML extends beyond reaching some global minima Implicit regularization plays a crucial role in

66 Different optimization algorithms Role of optimization in ML extends beyond reaching some global minima Implicit regularization plays a crucial role in generalization of over parameterized models Understanding specific global minimum reached by an algorithm is important! different implicit bias different generalization

Implicit Optimization Bias

Implicit Optimization Bias as a key to Understanding Deep Learning Nati Srebro (TTIC) Based on joint work with Behnam Neyshabur (TTIC IAS), Ryota Tomioka (TTIC MSR), Srinadh Bhojanapalli, Suriya Gunasekar,