Optimization geometry and implicit regularization
|
|
- Jasmin Johnson
- 5 years ago
- Views:
Transcription
1 Optimization geometry and implicit regularization Suriya Gunasekar Joint work with N. Srebro (TTIC), J. Lee (USC), D. Soudry (Technion), M.S. Nacson (Technion), B. Woodworth (TTIC), S. Bhojanapalli (TTIC), B. Neyshabur (TTIC-> IAS)
2 Optimization in ML h " : $ & parameterized by W R * Training data { $,, &, :. = 1,2, 56 = argmin " =,>? l(h " $,, &, ) Over parameterization: D F Many global minima all l(h" 5 $,, &, ) = 0 What we really care about is I J,K l h 5 " $, & à Different global optima have different I J,K l h 5 " $, &
3 Optimization in ML h " : $ & parameterized by W R * Training data { $,, &, :. = 1,2, 56 = argmin " =,>? l(h " $,, &, ) " # Over parameterization: D 3 X! = Many global minima all l(h" 5 $,, &, ) = 0 What we really care about is H I,J l h 5 " $, & à Different global optima have different H I,J l h 5 " $, &
4 Optimization in ML h " : $ & parameterized by W R * Training data { $,, &, :. = 1,2, 56 = argmin " =,>? l(h " $,, &, ) " # Over parameterization: D 3 X! = Many global minima all l(h" 5 $,, &, ) = 0 What we really care about is H I,J l h 5 " $, & à Different global optima have different H I,J l h 5 " $, &
5 Optimization in ML h " : $ & parameterized by W R * Training data { $,, &, :. = 1,2, 56 = argmin " =,>? l(h " $,, &, ) " # Over parameterization: D 3 X! = Many global minima all l(h" 5 $,, &, ) = 0 What we really care about is H I,J l h 5 " $, & à Different global optima have different H I,J l h 5 " $, &
6 Learning overparameterized models h " : $ & parameterized by W R * 8 +, = argmin " l(h " $ 5, & 5 ) + R(,) small R +, A,B l h " + $, & - 7 C l(h " + $ 5, & 5 ) Explicit regularization for high dimensional estimation
7 Learning overparameterized models h " : $ & parameterized by W R * 8 +, = argmin " l(h " $ 5, & 5 ) + R(,)? A What happens if we don t have R(,)? Explicit regularization for high dimensional estimation
8 Matrix Estimation from Linear Measurements min L(W ):= W 2R d d e.g matrix completion, linear neural networks, Ø When! # $ optimization is underdetermined with many trivial global minima NX (hx n,wi y n ) 2 := kx (W ) yk 2 2 n=1 e.g. impute 0 or 42 or for matrix completion
9 Matrix Estimation from Linear Measurements min L(W ):= W 2R d d e.g matrix completion, linear neural networks, Ø When! # $ optimization is underdetermined with many trivial global minima NX (hx n,wi y n ) 2 := kx (W ) yk 2 2 n=1 e.g. impute 0 or 42 or for matrix completion min U,V 2R d d L(U, V )=L(UV > )=kx (UV > ) yk 2 2 No explicit regularization & no rank constraint Ø same trivial global minima exists
10 Matrix Estimation from Linear Measurements min L(W ):= W 2R d d e.g matrix completion, linear neural networks, Ø When! # $ optimization is underdetermined with many trivial global minima NX (hx n,wi y n ) 2 := kx (W ) yk 2 2 n=1 e.g. impute 0 or 42 or for matrix completion min U,V 2R d d L(U, V )=L(UV > )=kx (UV > ) yk 2 2 No explicit regularization & no rank constraint Ø same trivial global minima exists Gradient descent on L(U, V ) U k+1 = U k r U L(Uk,V k ) V k+1 = V k r V L(Uk,V k )
11 d = 50, N= 300, X n iid Gaussian, W rank-2 ground truth y = X (W )+N (0, 10 3 ), y test = X test (W )+N (0, 10 3 ) min L(W ):= W 2R d d NX (hx n,wi y n ) 2 := kx (W ) yk 2 2 n=1 min U,V 2R d d L(U, V )=L(UV > )=kx (UV > ) yk 2 2 Gradient descent on!" # gets to good global minima G Woodworth Bhojanapalli Neyshabur Srebro, NIPS 2017
12 d = 50, N= 300, X n iid Gaussian, W rank-2 ground truth y = X (W )+N (0, 10 3 ), y test = X test (W )+N (0, 10 3 ) Gradient descent on!" # gets to good global minima Gradient descent on!" # generalizes better with smaller step size G Woodworth Bhojanapalli Neyshabur Srebro, NIPS 2017
13 d = 50, N= 300, X n iid Gaussian, W rank-2 ground truth y = X (W )+N (0, 10 3 ), y test = X test (W )+N (0, 10 3 ) Question: Which global minima does gradient descent reach? Why does it generalize well? G Woodworth Bhojanapalli Neyshabur Srebro, NIPS 2017
14 Implicit Regularization Different optimization algorithms different global minimum "W different generalization $ %,' l h " * +,,
15 Overparameterization in neural networks Image datasets CIFAR ~ 60K images, ImageNet ~14M images, ~1M annotations Architectures for vision tasks: AlexNet (2012): 8 layers, 60M parameters VGG-16 (2014): 16 layers, 138M parameters ResNet (2015): 152 layers, NNs trained using local search have good generalization even without explicit regularization or early stopping (Neyshabur et al. 2014, Zhang et al. 2016, Hoffer et al. 2017)
16 Bias of optimization algorithms Effect of optimization geometry (Neyshabur et al. 2015) /1 Training Error Path-SGD SGD Effect of size of minibatch (Keskar et al. 2017,Dihn et al. 2017) Effect of adaptive algorithms (Wilson et al. 2017) Learning to learn (Abdrychowicz et al. 2016, Finn et al. 2017)
17 Bias of optimization algorithms Effect of optimization geometry (Neyshabur et al. 2015) /1 Training Error 0/1 Test Error Path-SGD SGD Effect of size of minibatch (Keskar et al. 2017,Dihn et al. 2017) Effect of adaptive algorithms (Wilson et al. 2017) Learning to learn (Abdrychowicz et al. 2016, Finn et al. 2017)
18 Bias of optimization algorithms Effect of optimization geometry (Neyshabur et al. 2015) /1 Training Error 0/1 Test Error Path-SGD SGD Effect of size of minibatch (Keskar et al. 2017,Dihn et al. 2017) Effect of adaptive algorithms (Wilson et al. 2017) Learning to learn (Abdrychowicz et al. 2016, Finn et al. 2017)
19 Implicit Regularization Different optimization algorithms different global minimum "W different generalization $ %,' l h " * +,, Can we characterize which specific global minimum different optimization algorithms converge to?
20 Implicit Regularization Can we characterize which specific global minimum different optimization algorithms converge to? How does this depend on optimization geometry, initialization, step size, momentum, stochasticity?
21 Implicit Regularization Can we characterize which specific global minimum different optimization algorithms converge to? How does this depend on optimization geometry, initialization, step size, momentum, stochasticity? Understanding the implicit bias could enable Optimization algorithms for faster convergence AND better generalization New regularization techniques Efficiently train smaller networks
22 Gradient descent: linear regression min w N L(w) = X (hx n,wi y n ) 2 n=1 # $! = "
23 Gradient descent: linear regression min w N L(w) = X (hx n,wi y n ) 2 n=1 Gradient descent initialized at!(0) w(t + 1) = w(t) r w L(w(t)) & ' %! = r w L(w(t)) = NX (hx n,w(t)i y n )x n n=1
24 Gradient descent: linear regression min w N L(w) = X (hx n,wi y n ) 2 n=1 Gradient descent initialized at!(0) w(t + 1) = w(t) r w L(w(t)), /.! = NX r w L(w(t)) = (hx n,w(t)i y n )x n n=1 Updates lie on a low dimensional affine manifold Δ! & span(, - )
25 Gradient descent: linear regression min w N L(w) = X (hx n,wi y n ) 2 n=1 Gradient descent initialized at!(0) w(t + 1) = w(t) r w L(w(t)) = NX n=1 r w L(w(t)) (hx n,w(t)i y n )x n Updates lie on a low dimensional affine manifold Δ! & span(, - ) If w(0) = 0 w(t)! argminkwk 2 Xw=y
26 Gradient descent: linear regression min w N L(w) = X (hx n,wi y n ) 2 n=1 Gradient descent initialized at!(0) w(t + 1) = w(t) r w L(w(t)) = NX n=1 r w L(w(t)) (hx n,w(t)i y n )x n Updates lie on a low dimensional affine manifold Δ! & span(, - ) If w(0) = 0 w(t)! argminkwk 2 Xw=y w(t)! argminkw w(0)k 2 Xw=y
27 Gradient descent: linear regression min w N L(w) = X (hx n,wi y n ) 2 n=1 Gradient descent initialized at!(0) w(t + 1) = w(t) r w L(w(t)) = NX n=1 r w L(w(t)) (hx n,w(t)i y n )x n Updates lie on a low dimensional affine manifold Δ! & span(, - ) If w(0) = 0 w(t)! argminkwk 2 Xw=y w(t)! argminkw w(0)k 2 Xw=y Independent of step size., momentum, instancewise stochastic gradient descent
28 Gradient descent: linear regression min w N L(w) = X (hx n,wi y n ) 2 n=1 # $! = " Same argument for linear models ŷ(x) =hw, xi and loss functions `(ŷ(x),y) with unique finite root at ŷ = y
29 Can we get such results for other problems and other optimization algorithms? First, same problem different optimization algorithms
30 Mirror descent w.r.t potential! Gradient descent w(t + 1) = argmin w hw, r w L(w(t)) kw w(t)k2 2
31 Mirror descent w.r.t potential! Gradient descent w(t + 1) = argmin w hw, r w L(w(t)) kw w(t)k2 2 Mirror Descent w.r.t. strongly convex potential! w(t + 1) = argmin w hw, r w L(w(t)) + D (w, w(t)) D (w, w(t)) = (w) (w(t)) hr (w(t),w w(t)i ". $.! % = ( % ) log %[)] à / 0 %, % 2 = 34(%, %(2))
32 Mirror descent w.r.t potential! w(t + 1) = argmin w hw, r w L(w(t)) + D (w, w(t)) G, Lee, Soudry, Srebro. Arxiv 2018
33 Mirror descent w.r.t potential! w(t + 1) = argmin w hw, r w L(w(t)) + D (w, w(t)) Dual updates lie on the low dimensional affine manifold Δ# $ span(+, ) If r (w(0)) = 0 w(t)! argmin Xw=y (w) G, Lee, Soudry, Srebro. Arxiv 2018
34 Mirror descent w.r.t potential! w(t + 1) = argmin w hw, r w L(w(t)) + D (w, w(t)) Dual updates lie on the low dimensional affine manifold Δ# $ span(+, ) If r (w(0)) = 0 w(t)! argmin Xw=y (w) Again independent of step size, stochasticity, dual momentum Also works with affine constraints on # Exponentiated gradient descent à implicit entropic regularization! # = 0 # 1 log(# 1 ) G, Lee, Soudry, Srebro. Arxiv 2018
35 Steepest descent w.r.t. norm. Gradient descent w(t) = argmin v:kvk 2 apple1 hv, r w L(w(t)) Steepest Descent w.r.t. general norm. w(t) =argmin v:kvkapple1 hv, r w L(w(t)) e.g. Coordinate descent. =. $ G, Lee, Soudry, Srebro. Arxiv 2018
36 Steepest descent w.r.t. norm. Gradient descent w(t) = argmin v:kvk 2 apple1 hv, r w L(w(t)) Steepest Descent w.r.t. general norm. w(t) =argmin v:kvkapple1 hv, r w L(w(t)) e.g. Coordinate descent. =. $ % &? argmin /012 % %(0) G, Lee, Soudry, Srebro. Arxiv 2018
37 Steepest descent w.r.t. norm. Even for! 0: $ % argmin -./0 $ $(0) G, Lee, Soudry, Srebro. Arxiv 2018
38 Can we get such results for other problems and other optimization algorithms? How about gradient descent on other problems or different parameterizations?
39 Matrix Estimation from Linear Measurements min L(W ):= W 2R d d e.g matrix completion, linear neural networks, Ø When! # $ optimization is underdetermined with many trivial global minima NX (hx n,wi y n ) 2 := kx (W ) yk 2 2 n=1 e.g. impute 0 or 42 or for matrix completion min U,V 2R d d L(U, V )=L(UV > )=kx (UV > ) yk 2 2 No explicit regularization & no rank constraint Ø same trivial global minima exists Gradient descent on L(U, V ) U k+1 = U k r U L(Uk,V k ) V k+1 = V k r V L(Uk,V k )
40 d = 50, N= 300, X n iid Gaussian, W rank-2 ground truth y = X (W )+N (0, 10 3 ), y test = X test (W )+N (0, 10 3 ) Question: Which global minima does gradient descent reach? Why does it generalize well? G Woodworth Bhojanapalli Neyshabur Srebro, NIPS 2017
41 Gradient descent on!"($) converges to a minimum nuclear norm solution G Woodworth Bhojanapalli Neyshabur Srebro, NIPS 2017
42 Conjecture (informal) Gradient descent on!"($, &) converges to the minimum nuclear norm solution W (t) =U(t)V (t) >! WNN = argmin kw k X (W )=y when, Initialization is close to 0 Step size is very small à U t = du t dt = r U L(U) G Woodworth Bhojanapalli Neyshabur Srebro, NIPS 2017
43 Commutative! "! "! # =! #! " for all %, ' [*] W (t) =e X (s t ) W (0) e X (s t ) for some s t 2 R N! 0 necessary to remain in the (non-linear) manifold G Woodworth Bhojanapalli Neyshabur Srebro, NIPS 2017
44 Commutative! "! "! $ =! $! " for all &, ( [+] W (t) =e X (s t ) W (0) e X (s t ) for some s t 2 R N! 0 necessary to remain in the (non-linear) manifold Let U 1 ( ) be the solution of gradient flow initialized at U 0 = I. If measurements X n commute, i.e. X i X j = X j X i,and if W1 =limu 1 ( )U 1 ( ) > exists and satisfies L( W 1 )=0,then!0 W 1 = W NN = min X (W )=y kw k Conjecture proved for RIP! # by Li et al. (2018) G Woodworth Bhojanapalli Neyshabur Srebro, NIPS 2017
45 Proof Ideas Characterize the manifold in which the!(#) lie on à! 0 + '()* +, for gradient descent à. /0.! 0 + '()* +, for mirror descent à (4) for matrix factorization with 6 0, 5 0 0, commutative 9,
46 Proof Ideas Characterize the manifold in which the!(#) lie on à! 0 + '()* +, for gradient descent à. /0.! 0 + '()* +, for mirror descent à (4) for matrix factorization with 6 0, 5 0 0, commutative 9, Show that all the global minima on the manifold satisfy the KKT conditions for regularized problem à à à min =>?@!!(0) B min F G!,! 0 CD?E min 2(H)?E 5
47 Losses with a unique finite root Robust characterization of for general mirror descent with potential! " # min ()*+, -(", " 0 ) No useful characterization for generic steepest descent w.r.t norm. à even when. 4 strongly convex à even for 5 0 Fragile characterization for matrix factorization 6(#) min 7 9, :(7)*+ 6 à ONLY for 6 0 0, 5 0 à Proven only for RIP measurements à initialization close to 0 is particularly bad!! What happens with other losses? à Very different for logistic regression no finite minima
48 Losses with a unique finite root Robust characterization of for general mirror descent with potential! " # min ()*+, -(", " 0 ) No useful characterization for generic steepest descent w.r.t norm. à even when. 4 strongly convex à even for 5 0 Fragile characterization for matrix factorization 6(#) min 7 9, :(7)*+ 6 à ONLY for 6 0 0, 5 0 à Proven only for RIP measurements à initialization close to 0 is particularly bad!! What happens with other losses? à Very different for logistic regression no finite minima
49 Losses with a unique finite root Robust characterization of for general mirror descent with potential! " # min ()*+, -(", " 0 ) No useful characterization for generic steepest descent w.r.t norm. à even when. 4 strongly convex à even for 5 0 Fragile characterization for matrix factorization 6(#) min 7 9, :(7)*+ 6 à ONLY for 6 0 0, 5 0 à Proven only for RIP measurements à initialization close to 0 is particularly bad!! What happens with other losses? à Very different for logistic regression no finite minima
50 Implicit bias when global minimum is unattainable! Logistic regression on separable data
51 Gradient descent: logistic regression min w L(w) = N X n=1 log(1 + exp( Gradient descent initialized at!(0) w(t + 1) = w(t) r w L(w(t)) y n hx n,wi)) r w L(w(t)) = NX n=1 r n (t)x n Soudry, Hoffer, Srebro, ICLR 2018
52 Gradient descent: logistic regression min w L(w) = N X n=1 log(1 + exp( Gradient descent initialized at!(0) w(t + 1) = w(t) r w L(w(t)) r w L(w(t)) = NX n=1 but kw(t)k!1! r n (t)x n y n hx n,wi)) Soudry, Hoffer, Srebro, ICLR 2018
53 Gradient descent: logistic regression min w L(w) = N X n=1 log(1 + exp( Gradient descent initialized at &(0) w(t + 1) = w(t) r w L(w(t)) r w L(w(t)) = NX n=1 but kw(t)k!1! r n (t)x n y n hx n,wi)) w(t) kw(t)k 2! argmax w:kwk 2 apple1 min n y n hw, x n i Independent of step size! and initialization "($) Soudry, Hoffer, Srebro, ICLR 2018
54 Gradient descent: logistic regression min w L(w) = N X n=1 log(1 + exp( y n hx n,wi)) Holds for linear classifiers ŷ(x) =hw, xi and any strictly monotone loss `(ŷ(x),y) with exponential tail Soudry, Hoffer, Srebro, ICLR 2018
55 How fast is the margin maximized? Fixed step size! $ # %&' ( - extremely slow!! Compare with # $ convergence of )(+) ( Can we use lighter or heavier tail to get faster convergence? No. exponential-tail yields optimal rate. For l. = exp. 4, 6 > 1, margin converges as # $ %&' 9 For l. = exp. 4, $ < 6 < 1, margin converges as? For l.. A4, does not converge to max-margin : ; ( 4 %&' ( Soudry, Hoffer, Srebro, 2018 Any other way to get faster convergence? Yes. Stepsize! 1/ L yield F# $ convergence ( (comparable to best hard-margin SVM algorithms) Schpiegel, Lee, G, Srebro, Soudry, Arxiv 2018
56 How fast is the margin maximized? Fixed step size! $ # %&' ( - extremely slow!! Compare with # $ convergence of )(+) ( Can we use lighter or heavier tail to get faster convergence? No. exponential-tail yields optimal rate. For l. = exp. 4, 6 > 1, margin converges as # $ %&' 9 For l. = exp. 4, $ < 6 < 1, margin converges as? For l.. A4, does not converge to max-margin : ; ( 4 %&' ( Soudry, Hoffer, Srebro, 2018 Any other way to get faster convergence? Yes. Stepsize! 1/ L yield F# $ convergence ( (comparable to best hard-margin SVM algorithms) Schpiegel, Lee, G, Srebro, Soudry, Arxiv 2018
57 How fast is the margin maximized? Fixed step size! $ # %&' ( - extremely slow!! Compare with # $ convergence of )(+) ( Can we use lighter or heavier tail to get faster convergence? No. exponential-tail yields optimal rate. For l. = exp. 4, 6 > 1, margin converges as # $ %&' 9 For l. = exp. 4, $ < 6 < 1, margin converges as? For l.. A4, does not converge to max-margin : ; ( 4 %&' ( Soudry, Hoffer, Srebro, 2018 Any other way to get faster convergence? Yes. Stepsize! 1/ L yield F# $ convergence ( (comparable to best hard-margin SVM algorithms) Schpiegel, Lee, G, Srebro, Soudry, Arxiv 2018
58 Implicit bias on strictly monotone losses with exponential tail Can we get a more robust characterization compared to regression-type losses?
59 Steepest descent w.r.t. norm. G, Lee, Soudry, Srebro. Arxiv 2018
60 Steepest descent w.r.t. norm. à Independent of initialization à Small enough # G, Lee, Soudry, Srebro. Arxiv 2018
61 Steepest descent w.r.t. norm. à Independent of initialization à Small enough! Compare with squared loss à G, Lee, Soudry, Srebro. Arxiv 2018
62 Matrix Estimation from Linear Measurements e.g matrix completion, linear neural networks, Ø When! # $ optimization is underdetermined with many trivial global minima e.g. impute 0 or 42 or for matrix completion Gradient descent on L(U, V ) U k+1 = U k r U L(Uk,V k ) V k+1 = V k r V L(Uk,V k ) G, Lee, Soudry, Srebro. Arxiv 2018
63 Matrix Estimation from Linear Measurements e.g matrix completion, linear neural networks, Ø When! # $ optimization is underdetermined with many trivial global minima e.g. impute 0 or 42 or for matrix completion Gradient descent on L(U, V ) U k+1 = U k r U L(Uk,V k ) V k+1 = V k r V L(Uk,V k ) Let W (t) =U(t)U(t) >. For any full rank W (0) and any t such that {L(W (t))} t is strictly decreasing, if exists, then W (t) k W (t)k W (t) kw (t)k! and rl(w (t)) krl(w (t))k max min kw k apple1 n y nhx n,wi G, Lee, Soudry, Srebro. Arxiv 2018
64 Strictly monotone losses Gradient descent! "! " # max min! # (), -, /,, 1 à Independent of initialization à Any step size leading to descent algorithm Steepest descent w.r.t norm.! " max min -! ", /,, 1! (), à Independent of initialization à Any step size leading to descent algorithm Matrix factorization 5 " 5 " max min 5 (), -, 7,, 8 à Independent of initialization à Any step size leading to descent algorithm à Very different for logistic regression no finite minima
65 Simplicity from Asymptotics Squared loss: " depends on initial " 0 and stepsize % May need to take % 0, " 0 0 to get characterization in terms of gradient manifold Exponential loss ' ( ' ( stepsize % does NOT depend on initial " 0 and What happens at the beginning doesn t effect the asymptotic behavior as " Limit direction dominated only by the updates and hence the gradient manifold
66 Different optimization algorithms Role of optimization in ML extends beyond reaching some global minima Implicit regularization plays a crucial role in generalization of over parameterized models Understanding specific global minimum reached by an algorithm is important! different implicit bias different generalization
Implicit Optimization Bias
Implicit Optimization Bias as a key to Understanding Deep Learning Nati Srebro (TTIC) Based on joint work with Behnam Neyshabur (TTIC IAS), Ryota Tomioka (TTIC MSR), Srinadh Bhojanapalli, Suriya Gunasekar,
More informationOverparametrization for Landscape Design in Non-convex Optimization
Overparametrization for Landscape Design in Non-convex Optimization Jason D. Lee University of Southern California September 19, 2018 The State of Non-Convex Optimization Practical observation: Empirically,
More informationCharacterizing Implicit Bias in Terms of Optimization Geometry
Suriya Gunasekar 1 Jason Lee Daniel Soudry 3 Nathan Srebro 1 Abstract We study the implicit bias of generic optimization methods, including mirror descent, natural gradient descent, and steepest descent
More informationThe Implicit Bias of Gradient Descent on Separable Data
Journal of Machine Learning Research 19 2018 1-57 Submitted 4/18; Published 11/18 The Implicit Bias of Gradient Descent on Separable Data Daniel Soudry Elad Hoffer Mor Shpigel Nacson Department of Electrical
More informationDon t Decay the Learning Rate, Increase the Batch Size. Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017
Don t Decay the Learning Rate, Increase the Batch Size Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017 slsmith@ Google Brain Three related questions: What properties control generalization?
More informationFoundations of Deep Learning: SGD, Overparametrization, and Generalization
Foundations of Deep Learning: SGD, Overparametrization, and Generalization Jason D. Lee University of Southern California November 13, 2018 Deep Learning Single Neuron x σ( w, x ) ReLU: σ(z) = [z] + Figure:
More informationCase Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!
Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:
More informationUnraveling the mysteries of stochastic gradient descent on deep neural networks
Unraveling the mysteries of stochastic gradient descent on deep neural networks Pratik Chaudhari UCLA VISION LAB 1 The question measures disagreement of predictions with ground truth Cat Dog... x = argmin
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationDay 3 Lecture 3. Optimizing deep networks
Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient
More informationwhat can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley
what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley Collaborators Joint work with Samy Bengio, Moritz Hardt, Michael Jordan, Jason Lee, Max Simchowitz,
More informationMachine Learning A Geometric Approach
Machine Learning A Geometric Approach CIML book Chap 7.7 Linear Classification: Support Vector Machines (SVM) Professor Liang Huang some slides from Alex Smola (CMU) Linear Separator Ham Spam From Perceptron
More informationECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple
More informationCSC2541 Lecture 5 Natural Gradient
CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient 1 / 12 Motivation Two classes of optimization procedures used throughout ML (stochastic) gradient descent,
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationSupport Vector Machines and Kernel Methods
2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University
More informationAdaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade
Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random
More informationProbabilistic Graphical Models & Applications
Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with
More informationStochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017
Stochastic Gradient Descent: The Workhorse of Machine Learning CS6787 Lecture 1 Fall 2017 Fundamentals of Machine Learning? Machine Learning in Practice this course What s missing in the basic stuff? Efficiency!
More informationNeural Networks: Optimization & Regularization
Neural Networks: Optimization & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Opt & Reg
More informationDeep Learning for Computer Vision
Deep Learning for Computer Vision Lecture 4: Curse of Dimensionality, High Dimensional Feature Spaces, Linear Classifiers, Linear Regression, Python, and Jupyter Notebooks Peter Belhumeur Computer Science
More informationEE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015
EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,
More informationFaster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)
Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine
More informationPart 4: Conditional Random Fields
Part 4: Conditional Random Fields Sebastian Nowozin and Christoph H. Lampert Colorado Springs, 25th June 2011 1 / 39 Problem (Probabilistic Learning) Let d(y x) be the (unknown) true conditional distribution.
More informationCS60021: Scalable Data Mining. Large Scale Machine Learning
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance
More informationComments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms
Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:
More informationApprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning
Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire
More informationSimple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017
Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient
More informationStochastic Gradient Descent. Ryan Tibshirani Convex Optimization
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationTopics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families
Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines
More informationThe Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017
The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic
More informationAd Placement Strategies
Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationKernelized Perceptron Support Vector Machines
Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:
More informationarxiv: v1 [cs.lg] 4 Oct 2018
Gradient descent aligns the layers of deep linear networks Ziwei Ji Matus Telgarsky {ziweiji,mjt}@illinois.edu University of Illinois, Urbana-Champaign arxiv:80.003v [cs.lg] 4 Oct 08 Abstract This paper
More informationSGD and Deep Learning
SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients
More informationOptimization and Gradient Descent
Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationLinear classifiers: Overfitting and regularization
Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)
More informationMachine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression
Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,
More informationKernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning
Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:
More informationStochastic gradient descent; Classification
Stochastic gradient descent; Classification Steve Renals Machine Learning Practical MLP Lecture 2 28 September 2016 MLP Lecture 2 Stochastic gradient descent; Classification 1 Single Layer Networks MLP
More informationDay 4: Classification, support vector machines
Day 4: Classification, support vector machines Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 21 June 2018 Topics so far
More informationImplicit Regularization in Matrix Factorization
Implicit Regularization in Matrix Factorization Suriya Gunasekar Blake Woodworth Srinadh Bhojanapalli Behnam Neyshabur Nathan Srebro SURIYA@TTIC.EDU BLAKE@TTIC.EDU SRINADH@TTIC.EDU BEHNAM@TTIC.EDU NATI@TTIC.EDU
More informationCSCI567 Machine Learning (Fall 2018)
CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement
More informationWarm up. Regrade requests submitted directly in Gradescope, do not instructors.
Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required
More informationLecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University
Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline
More informationMachine Learning, Fall 2012 Homework 2
0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0
More informationRegression with Numerical Optimization. Logistic
CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204
More informationCS-E3210 Machine Learning: Basic Principles
CS-E3210 Machine Learning: Basic Principles Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 48 In a nutshell
More informationCSC 411: Lecture 04: Logistic Regression
CSC 411: Lecture 04: Logistic Regression Raquel Urtasun & Rich Zemel University of Toronto Sep 23, 2015 Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 1 / 16 Today Key Concepts: Logistic
More informationMachine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)
Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang (Chap. 12 of CIML) Nonlinear Features x4: -1 x1: +1 x3: +1 x2: -1 Concatenated (combined) features XOR:
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationClassification Logistic Regression
Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationNeural Networks: Backpropagation
Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationMachine Learning. Support Vector Machines. Manfred Huber
Machine Learning Support Vector Machines Manfred Huber 2015 1 Support Vector Machines Both logistic regression and linear discriminant analysis learn a linear discriminant function to separate the data
More informationLogistic Regression Trained with Different Loss Functions. Discussion
Logistic Regression Trained with Different Loss Functions Discussion CS640 Notations We restrict our discussions to the binary case. g(z) = g (z) = g(z) z h w (x) = g(wx) = + e z = g(z)( g(z)) + e wx =
More informationINTRODUCTION TO DATA SCIENCE
INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #13 3/9/2017 CMSC320 Tuesdays & Thursdays 3:30pm 4:45pm ANNOUNCEMENTS Mini-Project #1 is due Saturday night (3/11): Seems like people are able to do
More informationCSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent
CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Support Vector Machines Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique
More informationIntelligent Systems Discriminative Learning, Neural Networks
Intelligent Systems Discriminative Learning, Neural Networks Carsten Rother, Dmitrij Schlesinger WS2014/2015, Outline 1. Discriminative learning 2. Neurons and linear classifiers: 1) Perceptron-Algorithm
More informationMachine Learning Basics
Security and Fairness of Deep Learning Machine Learning Basics Anupam Datta CMU Spring 2019 Image Classification Image Classification Image classification pipeline Input: A training set of N images, each
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationLogistic Regression Logistic
Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,
More informationChapter 2. Optimization. Gradients, convexity, and ALS
Chapter 2 Optimization Gradients, convexity, and ALS Contents Background Gradient descent Stochastic gradient descent Newton s method Alternating least squares KKT conditions 2 Motivation We can solve
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationGradient-Based Learning. Sargur N. Srihari
Gradient-Based Learning Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and Other Differentiation
More informationClassification with Perceptrons. Reading:
Classification with Perceptrons Reading: Chapters 1-3 of Michael Nielsen's online book on neural networks covers the basics of perceptrons and multilayer neural networks We will cover material in Chapters
More informationADALINE for Pattern Classification
POLYTECHNIC UNIVERSITY Department of Computer and Information Science ADALINE for Pattern Classification K. Ming Leung Abstract: A supervised learning algorithm known as the Widrow-Hoff rule, or the Delta
More informationLecture 5: Logistic Regression. Neural Networks
Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture
More informationL5 Support Vector Classification
L5 Support Vector Classification Support Vector Machine Problem definition Geometrical picture Optimization problem Optimization Problem Hard margin Convexity Dual problem Soft margin problem Alexander
More informationECE521 Lecture7. Logistic Regression
ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard
More informationTTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Generalization and Regularization
TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 Generalization and Regularization 1 Chomsky vs. Kolmogorov and Hinton Noam Chomsky: Natural language grammar cannot be learned by
More informationLinear Regression. CSL603 - Fall 2017 Narayanan C Krishnan
Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization
More informationLinear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan
Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis
More informationNeural networks COMS 4771
Neural networks COMS 4771 1. Logistic regression Logistic regression Suppose X = R d and Y = {0, 1}. A logistic regression model is a statistical model where the conditional probability function has a
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationCSE546: SVMs, Dual Formula5on, and Kernels Winter 2012
CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012 Luke ZeClemoyer Slides adapted from Carlos Guestrin Linear classifiers Which line is becer? w. = j w (j) x (j) Data Example i Pick the one with the
More informationBasis Expansion and Nonlinear SVM. Kai Yu
Basis Expansion and Nonlinear SVM Kai Yu Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2 Nonlinear Classifiers via Basis Expansion
More informationDan Roth 461C, 3401 Walnut
CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationMachine Learning Practice Page 2 of 2 10/28/13
Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes
More informationNon-Convex Optimization. CS6787 Lecture 7 Fall 2017
Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More information