Optimization geometry and implicit regularization

Size: px
Start display at page:

Download "Optimization geometry and implicit regularization"

Transcription

1 Optimization geometry and implicit regularization Suriya Gunasekar Joint work with N. Srebro (TTIC), J. Lee (USC), D. Soudry (Technion), M.S. Nacson (Technion), B. Woodworth (TTIC), S. Bhojanapalli (TTIC), B. Neyshabur (TTIC-> IAS)

2 Optimization in ML h " : $ & parameterized by W R * Training data { $,, &, :. = 1,2, 56 = argmin " =,>? l(h " $,, &, ) Over parameterization: D F Many global minima all l(h" 5 $,, &, ) = 0 What we really care about is I J,K l h 5 " $, & à Different global optima have different I J,K l h 5 " $, &

3 Optimization in ML h " : $ & parameterized by W R * Training data { $,, &, :. = 1,2, 56 = argmin " =,>? l(h " $,, &, ) " # Over parameterization: D 3 X! = Many global minima all l(h" 5 $,, &, ) = 0 What we really care about is H I,J l h 5 " $, & à Different global optima have different H I,J l h 5 " $, &

4 Optimization in ML h " : $ & parameterized by W R * Training data { $,, &, :. = 1,2, 56 = argmin " =,>? l(h " $,, &, ) " # Over parameterization: D 3 X! = Many global minima all l(h" 5 $,, &, ) = 0 What we really care about is H I,J l h 5 " $, & à Different global optima have different H I,J l h 5 " $, &

5 Optimization in ML h " : $ & parameterized by W R * Training data { $,, &, :. = 1,2, 56 = argmin " =,>? l(h " $,, &, ) " # Over parameterization: D 3 X! = Many global minima all l(h" 5 $,, &, ) = 0 What we really care about is H I,J l h 5 " $, & à Different global optima have different H I,J l h 5 " $, &

6 Learning overparameterized models h " : $ & parameterized by W R * 8 +, = argmin " l(h " $ 5, & 5 ) + R(,) small R +, A,B l h " + $, & - 7 C l(h " + $ 5, & 5 ) Explicit regularization for high dimensional estimation

7 Learning overparameterized models h " : $ & parameterized by W R * 8 +, = argmin " l(h " $ 5, & 5 ) + R(,)? A What happens if we don t have R(,)? Explicit regularization for high dimensional estimation

8 Matrix Estimation from Linear Measurements min L(W ):= W 2R d d e.g matrix completion, linear neural networks, Ø When! # $ optimization is underdetermined with many trivial global minima NX (hx n,wi y n ) 2 := kx (W ) yk 2 2 n=1 e.g. impute 0 or 42 or for matrix completion

9 Matrix Estimation from Linear Measurements min L(W ):= W 2R d d e.g matrix completion, linear neural networks, Ø When! # $ optimization is underdetermined with many trivial global minima NX (hx n,wi y n ) 2 := kx (W ) yk 2 2 n=1 e.g. impute 0 or 42 or for matrix completion min U,V 2R d d L(U, V )=L(UV > )=kx (UV > ) yk 2 2 No explicit regularization & no rank constraint Ø same trivial global minima exists

10 Matrix Estimation from Linear Measurements min L(W ):= W 2R d d e.g matrix completion, linear neural networks, Ø When! # $ optimization is underdetermined with many trivial global minima NX (hx n,wi y n ) 2 := kx (W ) yk 2 2 n=1 e.g. impute 0 or 42 or for matrix completion min U,V 2R d d L(U, V )=L(UV > )=kx (UV > ) yk 2 2 No explicit regularization & no rank constraint Ø same trivial global minima exists Gradient descent on L(U, V ) U k+1 = U k r U L(Uk,V k ) V k+1 = V k r V L(Uk,V k )

11 d = 50, N= 300, X n iid Gaussian, W rank-2 ground truth y = X (W )+N (0, 10 3 ), y test = X test (W )+N (0, 10 3 ) min L(W ):= W 2R d d NX (hx n,wi y n ) 2 := kx (W ) yk 2 2 n=1 min U,V 2R d d L(U, V )=L(UV > )=kx (UV > ) yk 2 2 Gradient descent on!" # gets to good global minima G Woodworth Bhojanapalli Neyshabur Srebro, NIPS 2017

12 d = 50, N= 300, X n iid Gaussian, W rank-2 ground truth y = X (W )+N (0, 10 3 ), y test = X test (W )+N (0, 10 3 ) Gradient descent on!" # gets to good global minima Gradient descent on!" # generalizes better with smaller step size G Woodworth Bhojanapalli Neyshabur Srebro, NIPS 2017

13 d = 50, N= 300, X n iid Gaussian, W rank-2 ground truth y = X (W )+N (0, 10 3 ), y test = X test (W )+N (0, 10 3 ) Question: Which global minima does gradient descent reach? Why does it generalize well? G Woodworth Bhojanapalli Neyshabur Srebro, NIPS 2017

14 Implicit Regularization Different optimization algorithms different global minimum "W different generalization $ %,' l h " * +,,

15 Overparameterization in neural networks Image datasets CIFAR ~ 60K images, ImageNet ~14M images, ~1M annotations Architectures for vision tasks: AlexNet (2012): 8 layers, 60M parameters VGG-16 (2014): 16 layers, 138M parameters ResNet (2015): 152 layers, NNs trained using local search have good generalization even without explicit regularization or early stopping (Neyshabur et al. 2014, Zhang et al. 2016, Hoffer et al. 2017)

16 Bias of optimization algorithms Effect of optimization geometry (Neyshabur et al. 2015) /1 Training Error Path-SGD SGD Effect of size of minibatch (Keskar et al. 2017,Dihn et al. 2017) Effect of adaptive algorithms (Wilson et al. 2017) Learning to learn (Abdrychowicz et al. 2016, Finn et al. 2017)

17 Bias of optimization algorithms Effect of optimization geometry (Neyshabur et al. 2015) /1 Training Error 0/1 Test Error Path-SGD SGD Effect of size of minibatch (Keskar et al. 2017,Dihn et al. 2017) Effect of adaptive algorithms (Wilson et al. 2017) Learning to learn (Abdrychowicz et al. 2016, Finn et al. 2017)

18 Bias of optimization algorithms Effect of optimization geometry (Neyshabur et al. 2015) /1 Training Error 0/1 Test Error Path-SGD SGD Effect of size of minibatch (Keskar et al. 2017,Dihn et al. 2017) Effect of adaptive algorithms (Wilson et al. 2017) Learning to learn (Abdrychowicz et al. 2016, Finn et al. 2017)

19 Implicit Regularization Different optimization algorithms different global minimum "W different generalization $ %,' l h " * +,, Can we characterize which specific global minimum different optimization algorithms converge to?

20 Implicit Regularization Can we characterize which specific global minimum different optimization algorithms converge to? How does this depend on optimization geometry, initialization, step size, momentum, stochasticity?

21 Implicit Regularization Can we characterize which specific global minimum different optimization algorithms converge to? How does this depend on optimization geometry, initialization, step size, momentum, stochasticity? Understanding the implicit bias could enable Optimization algorithms for faster convergence AND better generalization New regularization techniques Efficiently train smaller networks

22 Gradient descent: linear regression min w N L(w) = X (hx n,wi y n ) 2 n=1 # $! = "

23 Gradient descent: linear regression min w N L(w) = X (hx n,wi y n ) 2 n=1 Gradient descent initialized at!(0) w(t + 1) = w(t) r w L(w(t)) & ' %! = r w L(w(t)) = NX (hx n,w(t)i y n )x n n=1

24 Gradient descent: linear regression min w N L(w) = X (hx n,wi y n ) 2 n=1 Gradient descent initialized at!(0) w(t + 1) = w(t) r w L(w(t)), /.! = NX r w L(w(t)) = (hx n,w(t)i y n )x n n=1 Updates lie on a low dimensional affine manifold Δ! & span(, - )

25 Gradient descent: linear regression min w N L(w) = X (hx n,wi y n ) 2 n=1 Gradient descent initialized at!(0) w(t + 1) = w(t) r w L(w(t)) = NX n=1 r w L(w(t)) (hx n,w(t)i y n )x n Updates lie on a low dimensional affine manifold Δ! & span(, - ) If w(0) = 0 w(t)! argminkwk 2 Xw=y

26 Gradient descent: linear regression min w N L(w) = X (hx n,wi y n ) 2 n=1 Gradient descent initialized at!(0) w(t + 1) = w(t) r w L(w(t)) = NX n=1 r w L(w(t)) (hx n,w(t)i y n )x n Updates lie on a low dimensional affine manifold Δ! & span(, - ) If w(0) = 0 w(t)! argminkwk 2 Xw=y w(t)! argminkw w(0)k 2 Xw=y

27 Gradient descent: linear regression min w N L(w) = X (hx n,wi y n ) 2 n=1 Gradient descent initialized at!(0) w(t + 1) = w(t) r w L(w(t)) = NX n=1 r w L(w(t)) (hx n,w(t)i y n )x n Updates lie on a low dimensional affine manifold Δ! & span(, - ) If w(0) = 0 w(t)! argminkwk 2 Xw=y w(t)! argminkw w(0)k 2 Xw=y Independent of step size., momentum, instancewise stochastic gradient descent

28 Gradient descent: linear regression min w N L(w) = X (hx n,wi y n ) 2 n=1 # $! = " Same argument for linear models ŷ(x) =hw, xi and loss functions `(ŷ(x),y) with unique finite root at ŷ = y

29 Can we get such results for other problems and other optimization algorithms? First, same problem different optimization algorithms

30 Mirror descent w.r.t potential! Gradient descent w(t + 1) = argmin w hw, r w L(w(t)) kw w(t)k2 2

31 Mirror descent w.r.t potential! Gradient descent w(t + 1) = argmin w hw, r w L(w(t)) kw w(t)k2 2 Mirror Descent w.r.t. strongly convex potential! w(t + 1) = argmin w hw, r w L(w(t)) + D (w, w(t)) D (w, w(t)) = (w) (w(t)) hr (w(t),w w(t)i ". $.! % = ( % ) log %[)] à / 0 %, % 2 = 34(%, %(2))

32 Mirror descent w.r.t potential! w(t + 1) = argmin w hw, r w L(w(t)) + D (w, w(t)) G, Lee, Soudry, Srebro. Arxiv 2018

33 Mirror descent w.r.t potential! w(t + 1) = argmin w hw, r w L(w(t)) + D (w, w(t)) Dual updates lie on the low dimensional affine manifold Δ# $ span(+, ) If r (w(0)) = 0 w(t)! argmin Xw=y (w) G, Lee, Soudry, Srebro. Arxiv 2018

34 Mirror descent w.r.t potential! w(t + 1) = argmin w hw, r w L(w(t)) + D (w, w(t)) Dual updates lie on the low dimensional affine manifold Δ# $ span(+, ) If r (w(0)) = 0 w(t)! argmin Xw=y (w) Again independent of step size, stochasticity, dual momentum Also works with affine constraints on # Exponentiated gradient descent à implicit entropic regularization! # = 0 # 1 log(# 1 ) G, Lee, Soudry, Srebro. Arxiv 2018

35 Steepest descent w.r.t. norm. Gradient descent w(t) = argmin v:kvk 2 apple1 hv, r w L(w(t)) Steepest Descent w.r.t. general norm. w(t) =argmin v:kvkapple1 hv, r w L(w(t)) e.g. Coordinate descent. =. $ G, Lee, Soudry, Srebro. Arxiv 2018

36 Steepest descent w.r.t. norm. Gradient descent w(t) = argmin v:kvk 2 apple1 hv, r w L(w(t)) Steepest Descent w.r.t. general norm. w(t) =argmin v:kvkapple1 hv, r w L(w(t)) e.g. Coordinate descent. =. $ % &? argmin /012 % %(0) G, Lee, Soudry, Srebro. Arxiv 2018

37 Steepest descent w.r.t. norm. Even for! 0: $ % argmin -./0 $ $(0) G, Lee, Soudry, Srebro. Arxiv 2018

38 Can we get such results for other problems and other optimization algorithms? How about gradient descent on other problems or different parameterizations?

39 Matrix Estimation from Linear Measurements min L(W ):= W 2R d d e.g matrix completion, linear neural networks, Ø When! # $ optimization is underdetermined with many trivial global minima NX (hx n,wi y n ) 2 := kx (W ) yk 2 2 n=1 e.g. impute 0 or 42 or for matrix completion min U,V 2R d d L(U, V )=L(UV > )=kx (UV > ) yk 2 2 No explicit regularization & no rank constraint Ø same trivial global minima exists Gradient descent on L(U, V ) U k+1 = U k r U L(Uk,V k ) V k+1 = V k r V L(Uk,V k )

40 d = 50, N= 300, X n iid Gaussian, W rank-2 ground truth y = X (W )+N (0, 10 3 ), y test = X test (W )+N (0, 10 3 ) Question: Which global minima does gradient descent reach? Why does it generalize well? G Woodworth Bhojanapalli Neyshabur Srebro, NIPS 2017

41 Gradient descent on!"($) converges to a minimum nuclear norm solution G Woodworth Bhojanapalli Neyshabur Srebro, NIPS 2017

42 Conjecture (informal) Gradient descent on!"($, &) converges to the minimum nuclear norm solution W (t) =U(t)V (t) >! WNN = argmin kw k X (W )=y when, Initialization is close to 0 Step size is very small à U t = du t dt = r U L(U) G Woodworth Bhojanapalli Neyshabur Srebro, NIPS 2017

43 Commutative! "! "! # =! #! " for all %, ' [*] W (t) =e X (s t ) W (0) e X (s t ) for some s t 2 R N! 0 necessary to remain in the (non-linear) manifold G Woodworth Bhojanapalli Neyshabur Srebro, NIPS 2017

44 Commutative! "! "! $ =! $! " for all &, ( [+] W (t) =e X (s t ) W (0) e X (s t ) for some s t 2 R N! 0 necessary to remain in the (non-linear) manifold Let U 1 ( ) be the solution of gradient flow initialized at U 0 = I. If measurements X n commute, i.e. X i X j = X j X i,and if W1 =limu 1 ( )U 1 ( ) > exists and satisfies L( W 1 )=0,then!0 W 1 = W NN = min X (W )=y kw k Conjecture proved for RIP! # by Li et al. (2018) G Woodworth Bhojanapalli Neyshabur Srebro, NIPS 2017

45 Proof Ideas Characterize the manifold in which the!(#) lie on à! 0 + '()* +, for gradient descent à. /0.! 0 + '()* +, for mirror descent à (4) for matrix factorization with 6 0, 5 0 0, commutative 9,

46 Proof Ideas Characterize the manifold in which the!(#) lie on à! 0 + '()* +, for gradient descent à. /0.! 0 + '()* +, for mirror descent à (4) for matrix factorization with 6 0, 5 0 0, commutative 9, Show that all the global minima on the manifold satisfy the KKT conditions for regularized problem à à à min =>?@!!(0) B min F G!,! 0 CD?E min 2(H)?E 5

47 Losses with a unique finite root Robust characterization of for general mirror descent with potential! " # min ()*+, -(", " 0 ) No useful characterization for generic steepest descent w.r.t norm. à even when. 4 strongly convex à even for 5 0 Fragile characterization for matrix factorization 6(#) min 7 9, :(7)*+ 6 à ONLY for 6 0 0, 5 0 à Proven only for RIP measurements à initialization close to 0 is particularly bad!! What happens with other losses? à Very different for logistic regression no finite minima

48 Losses with a unique finite root Robust characterization of for general mirror descent with potential! " # min ()*+, -(", " 0 ) No useful characterization for generic steepest descent w.r.t norm. à even when. 4 strongly convex à even for 5 0 Fragile characterization for matrix factorization 6(#) min 7 9, :(7)*+ 6 à ONLY for 6 0 0, 5 0 à Proven only for RIP measurements à initialization close to 0 is particularly bad!! What happens with other losses? à Very different for logistic regression no finite minima

49 Losses with a unique finite root Robust characterization of for general mirror descent with potential! " # min ()*+, -(", " 0 ) No useful characterization for generic steepest descent w.r.t norm. à even when. 4 strongly convex à even for 5 0 Fragile characterization for matrix factorization 6(#) min 7 9, :(7)*+ 6 à ONLY for 6 0 0, 5 0 à Proven only for RIP measurements à initialization close to 0 is particularly bad!! What happens with other losses? à Very different for logistic regression no finite minima

50 Implicit bias when global minimum is unattainable! Logistic regression on separable data

51 Gradient descent: logistic regression min w L(w) = N X n=1 log(1 + exp( Gradient descent initialized at!(0) w(t + 1) = w(t) r w L(w(t)) y n hx n,wi)) r w L(w(t)) = NX n=1 r n (t)x n Soudry, Hoffer, Srebro, ICLR 2018

52 Gradient descent: logistic regression min w L(w) = N X n=1 log(1 + exp( Gradient descent initialized at!(0) w(t + 1) = w(t) r w L(w(t)) r w L(w(t)) = NX n=1 but kw(t)k!1! r n (t)x n y n hx n,wi)) Soudry, Hoffer, Srebro, ICLR 2018

53 Gradient descent: logistic regression min w L(w) = N X n=1 log(1 + exp( Gradient descent initialized at &(0) w(t + 1) = w(t) r w L(w(t)) r w L(w(t)) = NX n=1 but kw(t)k!1! r n (t)x n y n hx n,wi)) w(t) kw(t)k 2! argmax w:kwk 2 apple1 min n y n hw, x n i Independent of step size! and initialization "($) Soudry, Hoffer, Srebro, ICLR 2018

54 Gradient descent: logistic regression min w L(w) = N X n=1 log(1 + exp( y n hx n,wi)) Holds for linear classifiers ŷ(x) =hw, xi and any strictly monotone loss `(ŷ(x),y) with exponential tail Soudry, Hoffer, Srebro, ICLR 2018

55 How fast is the margin maximized? Fixed step size! $ # %&' ( - extremely slow!! Compare with # $ convergence of )(+) ( Can we use lighter or heavier tail to get faster convergence? No. exponential-tail yields optimal rate. For l. = exp. 4, 6 > 1, margin converges as # $ %&' 9 For l. = exp. 4, $ < 6 < 1, margin converges as? For l.. A4, does not converge to max-margin : ; ( 4 %&' ( Soudry, Hoffer, Srebro, 2018 Any other way to get faster convergence? Yes. Stepsize! 1/ L yield F# $ convergence ( (comparable to best hard-margin SVM algorithms) Schpiegel, Lee, G, Srebro, Soudry, Arxiv 2018

56 How fast is the margin maximized? Fixed step size! $ # %&' ( - extremely slow!! Compare with # $ convergence of )(+) ( Can we use lighter or heavier tail to get faster convergence? No. exponential-tail yields optimal rate. For l. = exp. 4, 6 > 1, margin converges as # $ %&' 9 For l. = exp. 4, $ < 6 < 1, margin converges as? For l.. A4, does not converge to max-margin : ; ( 4 %&' ( Soudry, Hoffer, Srebro, 2018 Any other way to get faster convergence? Yes. Stepsize! 1/ L yield F# $ convergence ( (comparable to best hard-margin SVM algorithms) Schpiegel, Lee, G, Srebro, Soudry, Arxiv 2018

57 How fast is the margin maximized? Fixed step size! $ # %&' ( - extremely slow!! Compare with # $ convergence of )(+) ( Can we use lighter or heavier tail to get faster convergence? No. exponential-tail yields optimal rate. For l. = exp. 4, 6 > 1, margin converges as # $ %&' 9 For l. = exp. 4, $ < 6 < 1, margin converges as? For l.. A4, does not converge to max-margin : ; ( 4 %&' ( Soudry, Hoffer, Srebro, 2018 Any other way to get faster convergence? Yes. Stepsize! 1/ L yield F# $ convergence ( (comparable to best hard-margin SVM algorithms) Schpiegel, Lee, G, Srebro, Soudry, Arxiv 2018

58 Implicit bias on strictly monotone losses with exponential tail Can we get a more robust characterization compared to regression-type losses?

59 Steepest descent w.r.t. norm. G, Lee, Soudry, Srebro. Arxiv 2018

60 Steepest descent w.r.t. norm. à Independent of initialization à Small enough # G, Lee, Soudry, Srebro. Arxiv 2018

61 Steepest descent w.r.t. norm. à Independent of initialization à Small enough! Compare with squared loss à G, Lee, Soudry, Srebro. Arxiv 2018

62 Matrix Estimation from Linear Measurements e.g matrix completion, linear neural networks, Ø When! # $ optimization is underdetermined with many trivial global minima e.g. impute 0 or 42 or for matrix completion Gradient descent on L(U, V ) U k+1 = U k r U L(Uk,V k ) V k+1 = V k r V L(Uk,V k ) G, Lee, Soudry, Srebro. Arxiv 2018

63 Matrix Estimation from Linear Measurements e.g matrix completion, linear neural networks, Ø When! # $ optimization is underdetermined with many trivial global minima e.g. impute 0 or 42 or for matrix completion Gradient descent on L(U, V ) U k+1 = U k r U L(Uk,V k ) V k+1 = V k r V L(Uk,V k ) Let W (t) =U(t)U(t) >. For any full rank W (0) and any t such that {L(W (t))} t is strictly decreasing, if exists, then W (t) k W (t)k W (t) kw (t)k! and rl(w (t)) krl(w (t))k max min kw k apple1 n y nhx n,wi G, Lee, Soudry, Srebro. Arxiv 2018

64 Strictly monotone losses Gradient descent! "! " # max min! # (), -, /,, 1 à Independent of initialization à Any step size leading to descent algorithm Steepest descent w.r.t norm.! " max min -! ", /,, 1! (), à Independent of initialization à Any step size leading to descent algorithm Matrix factorization 5 " 5 " max min 5 (), -, 7,, 8 à Independent of initialization à Any step size leading to descent algorithm à Very different for logistic regression no finite minima

65 Simplicity from Asymptotics Squared loss: " depends on initial " 0 and stepsize % May need to take % 0, " 0 0 to get characterization in terms of gradient manifold Exponential loss ' ( ' ( stepsize % does NOT depend on initial " 0 and What happens at the beginning doesn t effect the asymptotic behavior as " Limit direction dominated only by the updates and hence the gradient manifold

66 Different optimization algorithms Role of optimization in ML extends beyond reaching some global minima Implicit regularization plays a crucial role in generalization of over parameterized models Understanding specific global minimum reached by an algorithm is important! different implicit bias different generalization

Implicit Optimization Bias

Implicit Optimization Bias Implicit Optimization Bias as a key to Understanding Deep Learning Nati Srebro (TTIC) Based on joint work with Behnam Neyshabur (TTIC IAS), Ryota Tomioka (TTIC MSR), Srinadh Bhojanapalli, Suriya Gunasekar,

More information

Overparametrization for Landscape Design in Non-convex Optimization

Overparametrization for Landscape Design in Non-convex Optimization Overparametrization for Landscape Design in Non-convex Optimization Jason D. Lee University of Southern California September 19, 2018 The State of Non-Convex Optimization Practical observation: Empirically,

More information

Characterizing Implicit Bias in Terms of Optimization Geometry

Characterizing Implicit Bias in Terms of Optimization Geometry Suriya Gunasekar 1 Jason Lee Daniel Soudry 3 Nathan Srebro 1 Abstract We study the implicit bias of generic optimization methods, including mirror descent, natural gradient descent, and steepest descent

More information

The Implicit Bias of Gradient Descent on Separable Data

The Implicit Bias of Gradient Descent on Separable Data Journal of Machine Learning Research 19 2018 1-57 Submitted 4/18; Published 11/18 The Implicit Bias of Gradient Descent on Separable Data Daniel Soudry Elad Hoffer Mor Shpigel Nacson Department of Electrical

More information

Don t Decay the Learning Rate, Increase the Batch Size. Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017

Don t Decay the Learning Rate, Increase the Batch Size. Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017 Don t Decay the Learning Rate, Increase the Batch Size Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017 slsmith@ Google Brain Three related questions: What properties control generalization?

More information

Foundations of Deep Learning: SGD, Overparametrization, and Generalization

Foundations of Deep Learning: SGD, Overparametrization, and Generalization Foundations of Deep Learning: SGD, Overparametrization, and Generalization Jason D. Lee University of Southern California November 13, 2018 Deep Learning Single Neuron x σ( w, x ) ReLU: σ(z) = [z] + Figure:

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

Unraveling the mysteries of stochastic gradient descent on deep neural networks

Unraveling the mysteries of stochastic gradient descent on deep neural networks Unraveling the mysteries of stochastic gradient descent on deep neural networks Pratik Chaudhari UCLA VISION LAB 1 The question measures disagreement of predictions with ground truth Cat Dog... x = argmin

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley Collaborators Joint work with Samy Bengio, Moritz Hardt, Michael Jordan, Jason Lee, Max Simchowitz,

More information

Machine Learning A Geometric Approach

Machine Learning A Geometric Approach Machine Learning A Geometric Approach CIML book Chap 7.7 Linear Classification: Support Vector Machines (SVM) Professor Liang Huang some slides from Alex Smola (CMU) Linear Separator Ham Spam From Perceptron

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

CSC2541 Lecture 5 Natural Gradient

CSC2541 Lecture 5 Natural Gradient CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient 1 / 12 Motivation Two classes of optimization procedures used throughout ML (stochastic) gradient descent,

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods 2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017 Stochastic Gradient Descent: The Workhorse of Machine Learning CS6787 Lecture 1 Fall 2017 Fundamentals of Machine Learning? Machine Learning in Practice this course What s missing in the basic stuff? Efficiency!

More information

Neural Networks: Optimization & Regularization

Neural Networks: Optimization & Regularization Neural Networks: Optimization & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Opt & Reg

More information

Deep Learning for Computer Vision

Deep Learning for Computer Vision Deep Learning for Computer Vision Lecture 4: Curse of Dimensionality, High Dimensional Feature Spaces, Linear Classifiers, Linear Regression, Python, and Jupyter Notebooks Peter Belhumeur Computer Science

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine

More information

Part 4: Conditional Random Fields

Part 4: Conditional Random Fields Part 4: Conditional Random Fields Sebastian Nowozin and Christoph H. Lampert Colorado Springs, 25th June 2011 1 / 39 Problem (Probabilistic Learning) Let d(y x) be the (unknown) true conditional distribution.

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017 Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines

More information

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017 The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Kernelized Perceptron Support Vector Machines

Kernelized Perceptron Support Vector Machines Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:

More information

arxiv: v1 [cs.lg] 4 Oct 2018

arxiv: v1 [cs.lg] 4 Oct 2018 Gradient descent aligns the layers of deep linear networks Ziwei Ji Matus Telgarsky {ziweiji,mjt}@illinois.edu University of Illinois, Urbana-Champaign arxiv:80.003v [cs.lg] 4 Oct 08 Abstract This paper

More information

SGD and Deep Learning

SGD and Deep Learning SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients

More information

Optimization and Gradient Descent

Optimization and Gradient Descent Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Linear classifiers: Overfitting and regularization

Linear classifiers: Overfitting and regularization Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,

More information

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:

More information

Stochastic gradient descent; Classification

Stochastic gradient descent; Classification Stochastic gradient descent; Classification Steve Renals Machine Learning Practical MLP Lecture 2 28 September 2016 MLP Lecture 2 Stochastic gradient descent; Classification 1 Single Layer Networks MLP

More information

Day 4: Classification, support vector machines

Day 4: Classification, support vector machines Day 4: Classification, support vector machines Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 21 June 2018 Topics so far

More information

Implicit Regularization in Matrix Factorization

Implicit Regularization in Matrix Factorization Implicit Regularization in Matrix Factorization Suriya Gunasekar Blake Woodworth Srinadh Bhojanapalli Behnam Neyshabur Nathan Srebro SURIYA@TTIC.EDU BLAKE@TTIC.EDU SRINADH@TTIC.EDU BEHNAM@TTIC.EDU NATI@TTIC.EDU

More information

CSCI567 Machine Learning (Fall 2018)

CSCI567 Machine Learning (Fall 2018) CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement

More information

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Warm up. Regrade requests submitted directly in Gradescope, do not  instructors. Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required

More information

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline

More information

Machine Learning, Fall 2012 Homework 2

Machine Learning, Fall 2012 Homework 2 0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

More information

Regression with Numerical Optimization. Logistic

Regression with Numerical Optimization. Logistic CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204

More information

CS-E3210 Machine Learning: Basic Principles

CS-E3210 Machine Learning: Basic Principles CS-E3210 Machine Learning: Basic Principles Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 48 In a nutshell

More information

CSC 411: Lecture 04: Logistic Regression

CSC 411: Lecture 04: Logistic Regression CSC 411: Lecture 04: Logistic Regression Raquel Urtasun & Rich Zemel University of Toronto Sep 23, 2015 Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 1 / 16 Today Key Concepts: Logistic

More information

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML) Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang (Chap. 12 of CIML) Nonlinear Features x4: -1 x1: +1 x3: +1 x2: -1 Concatenated (combined) features XOR:

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Neural Networks: Backpropagation

Neural Networks: Backpropagation Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Machine Learning. Support Vector Machines. Manfred Huber

Machine Learning. Support Vector Machines. Manfred Huber Machine Learning Support Vector Machines Manfred Huber 2015 1 Support Vector Machines Both logistic regression and linear discriminant analysis learn a linear discriminant function to separate the data

More information

Logistic Regression Trained with Different Loss Functions. Discussion

Logistic Regression Trained with Different Loss Functions. Discussion Logistic Regression Trained with Different Loss Functions Discussion CS640 Notations We restrict our discussions to the binary case. g(z) = g (z) = g(z) z h w (x) = g(wx) = + e z = g(z)( g(z)) + e wx =

More information

INTRODUCTION TO DATA SCIENCE

INTRODUCTION TO DATA SCIENCE INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #13 3/9/2017 CMSC320 Tuesdays & Thursdays 3:30pm 4:45pm ANNOUNCEMENTS Mini-Project #1 is due Saturday night (3/11): Seems like people are able to do

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017 Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Support Vector Machines Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

More information

Intelligent Systems Discriminative Learning, Neural Networks

Intelligent Systems Discriminative Learning, Neural Networks Intelligent Systems Discriminative Learning, Neural Networks Carsten Rother, Dmitrij Schlesinger WS2014/2015, Outline 1. Discriminative learning 2. Neurons and linear classifiers: 1) Perceptron-Algorithm

More information

Machine Learning Basics

Machine Learning Basics Security and Fairness of Deep Learning Machine Learning Basics Anupam Datta CMU Spring 2019 Image Classification Image Classification Image classification pipeline Input: A training set of N images, each

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Logistic Regression Logistic

Logistic Regression Logistic Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,

More information

Chapter 2. Optimization. Gradients, convexity, and ALS

Chapter 2. Optimization. Gradients, convexity, and ALS Chapter 2 Optimization Gradients, convexity, and ALS Contents Background Gradient descent Stochastic gradient descent Newton s method Alternating least squares KKT conditions 2 Motivation We can solve

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Gradient-Based Learning. Sargur N. Srihari

Gradient-Based Learning. Sargur N. Srihari Gradient-Based Learning Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and Other Differentiation

More information

Classification with Perceptrons. Reading:

Classification with Perceptrons. Reading: Classification with Perceptrons Reading: Chapters 1-3 of Michael Nielsen's online book on neural networks covers the basics of perceptrons and multilayer neural networks We will cover material in Chapters

More information

ADALINE for Pattern Classification

ADALINE for Pattern Classification POLYTECHNIC UNIVERSITY Department of Computer and Information Science ADALINE for Pattern Classification K. Ming Leung Abstract: A supervised learning algorithm known as the Widrow-Hoff rule, or the Delta

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

L5 Support Vector Classification

L5 Support Vector Classification L5 Support Vector Classification Support Vector Machine Problem definition Geometrical picture Optimization problem Optimization Problem Hard margin Convexity Dual problem Soft margin problem Alexander

More information

ECE521 Lecture7. Logistic Regression

ECE521 Lecture7. Logistic Regression ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

More information

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Generalization and Regularization

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Generalization and Regularization TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 Generalization and Regularization 1 Chomsky vs. Kolmogorov and Hinton Noam Chomsky: Natural language grammar cannot be learned by

More information

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization

More information

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis

More information

Neural networks COMS 4771

Neural networks COMS 4771 Neural networks COMS 4771 1. Logistic regression Logistic regression Suppose X = R d and Y = {0, 1}. A logistic regression model is a statistical model where the conditional probability function has a

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012 CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012 Luke ZeClemoyer Slides adapted from Carlos Guestrin Linear classifiers Which line is becer? w. = j w (j) x (j) Data Example i Pick the one with the

More information

Basis Expansion and Nonlinear SVM. Kai Yu

Basis Expansion and Nonlinear SVM. Kai Yu Basis Expansion and Nonlinear SVM Kai Yu Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2 Nonlinear Classifiers via Basis Expansion

More information

Dan Roth 461C, 3401 Walnut

Dan Roth   461C, 3401 Walnut CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017 Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information