Modern Optimization Techniques

Similar documents
Modern Optimization Techniques

Big Data Analytics. Lucas Rego Drumond

Modern Optimization Techniques

Selected Topics in Optimization. Some slides borrowed from

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. A. Supervised Learning A.1. Linear Regression. Lars Schmidt-Thieme

Optimization for Machine Learning

Convex Optimization Lecture 16

ECS171: Machine Learning

STA141C: Big Data & High Performance Statistical Computing

A Parallel SGD method with Strong Convergence

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

CS260: Machine Learning Algorithms

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Optimization for Machine Learning

Logistic Regression. Stochastic Gradient Descent

Nonlinear Optimization Methods for Machine Learning

Linear Regression (continued)

Day 3 Lecture 3. Optimizing deep networks

ECS289: Scalable Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning

Constrained Optimization and Lagrangian Duality

Optimization Methods for Machine Learning

Least Mean Squares Regression

Large-scale Stochastic Optimization

Coordinate Descent and Ascent Methods

Optimization for Machine Learning

Sub-Sampled Newton Methods

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Stochastic Optimization Algorithms Beyond SG

Stochastic Gradient Descent. CS 584: Big Data Analytics

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

Stochastic Gradient Descent

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

A Quick Tour of Linear Algebra and Optimization for Machine Learning

Comparison of Modern Stochastic Optimization Algorithms

Statistical Computing (36-350)

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Deep Feedforward Networks

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Beyond stochastic gradient descent for large-scale machine learning

Stochastic optimization in Hilbert spaces

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Advanced Topics in Machine Learning

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Machine Learning. Lecture 2: Linear regression. Feng Li.

Max Margin-Classifier

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Conditional Gradient (Frank-Wolfe) Method

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Homework 4. Convex Optimization /36-725

Variations of Logistic Regression with Stochastic Gradient Descent

Accelerating SVRG via second-order information

12. Interior-point methods

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Deep Learning Lab Course 2017 (Deep Learning Practical)

Numerical optimization

CSC321 Lecture 8: Optimization

CSC321 Lecture 2: Linear Regression

Big Data Analytics: Optimization and Randomization

Least Mean Squares Regression. Machine Learning Fall 2018

Stochastic Subgradient Method

Neural Networks: Optimization & Regularization

Logistic Regression. William Cohen

Numerical optimization. Numerical optimization. Longest Shortest where Maximal Minimal. Fastest. Largest. Optimization problems

Linear Convergence under the Polyak-Łojasiewicz Inequality

Empirical Risk Minimization

ECS289: Scalable Machine Learning

Linear Discrimination Functions

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Stochastic gradient descent and robustness to ill-conditioning

CSC 411 Lecture 17: Support Vector Machine

DeepLearning on FPGAs

Linear Models in Machine Learning

Introduction to Optimization

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Lecture 6 Optimization for Deep Neural Networks

Numerical Optimization Techniques

GRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x)

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

ECS171: Machine Learning

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Logistic Regression. COMP 527 Danushka Bollegala

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Statistical Machine Learning from Data

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal

Transcription:

Modern Optimization Techniques 2. Unconstrained Optimization / 2.2. Stochastic Gradient Descent Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany original slides by Lucas Rego Drumond (ISMLL) 1 / 44

Syllabus Tue. 18.10. (0) 0. Overview 1. Theory Tue. 25.10. (1) 1. Convex Sets and Functions 2. Unconstrained Optimization Tue. 1.11 (2) 2.1 Gradient Descent Tue. 8.11. (3) 2.2 Stochastic Gradient Descent Tue. 15.11. (4) 2.3 Newton s Method Tue. 22.11. (5) 2.3b Newton s Method (Part 2) Tue. 29.11. (6) 2.4 Subgradient Methods Tue. 6.12. (7) 2.5 Coordinate Descent 3. Equality Constrained Optimization Tue. 13.12. (8) 3.1 Duality Tue. 20.12. (9) 3.2 Methods Christmas Break 4. Inequality Constrained Optimization Tue. 10.1. (10) 4.1 Interior Point Methods Tue. 17.1. (11) 4.2 Cutting Plane Method 5. Distributed Optimization Tue. 24.1. (11) 5.1 Alternating Direction Method of Multipliers Tue. 31.1. (12) Questions and Answers 1 / 44

Outline 1. Stochastic Gradient Descent (SGD) 2. More on Line Search 3. Example: SGD for Linear Regression 4. Stochastic Gradient Descent in Practice 2 / 44

1. Stochastic Gradient Descent (SGD) Outline 1. Stochastic Gradient Descent (SGD) 2. More on Line Search 3. Example: SGD for Linear Regression 4. Stochastic Gradient Descent in Practice 2 / 44

1. Stochastic Gradient Descent (SGD) Unconstrained Convex Optimization arg min f (x) x dom f dom f R N is open (unconstrained optimization) f is convex 2 / 44

1. Stochastic Gradient Descent (SGD) Stochastic Gradient Gradient Descent makes use of the gradient f (x) Stochastic Gradient Descent: makes use of Stochastic Gradient only: g(x) p(g R M x), E p (g(x)) = f (x) for each point x R N : random variable over R N with distribution p (conditional on x) on average yields the gradient (at each point) 3 / 44

1. Stochastic Gradient Descent (SGD) Stochastic Gradient / Example: Big Sums f is a big sum : f (x) = 1 C C f c (x) c=1 with f c convex, c = 1,..., C g is the gradient of a random summand: p(g x) := Unif({ f c (x) c = 1,..., C}) 4 / 44

1. Stochastic Gradient Descent (SGD) Stochastic Gradient / Example: Least Squares min x R N f (x) := Ax b 2 2 will find solution for Ax = b if there is any (then Ax b 2 = 0) otherwise will find the x where the difference Ax b of left and right side is as small as possible (in the squared L2 norm) is a big sum: M f (x) := Ax b 2 2 = ((Ax) m b m ) 2 = 1 M m=1 M f m (x), f m (x) := M((Ax) m b m ) 2 m=1 stochastic gradient g: gradient for a random component m 5 / 44

1. Stochastic Gradient Descent (SGD) Stochastic Gradient / Example: Supervised Learning where min θ R P f (x) := 1 N N l(y n, ŷ(x n, θ)) + λ θ 2 2 n=1 (xn, y n ) R M R (k) are N training samples, ŷ is a parametrized model, e.g., logistic regression ŷ(x; θ) := (1 + e θt x ) 1, P := M, T := 1 l is a loss, e.g., negative binomial loglikelihood: l(y, ŷ) := y log ŷ (1 y) log(1 ŷ) λ R + 0 is the regularization weight. will find parametrization with best trade-off between low loss and low model complexity 6 / 44

1. Stochastic Gradient Descent (SGD) Stochastic Gradient / Example: Supervised Learning (2/2) min θ R P f (x) := 1 N N l(y n, ŷ(x n, θ)) + λ θ 2 2 n=1 where (x n, y n ) R M R T are N training samples,... is a big sum: f (θ) := 1 N N f n (θ), f n (θ) := l(y n, ŷ(x n, θ)) + λ θ 2 2 n=1 stochastic gradient g: gradient for a random sample n 7 / 44

1. Stochastic Gradient Descent (SGD) Stochastic Gradient Descent the very same as Gradient Descent but use stochastic gradient g(x) instead of exact gradient f (x) in each step where 1 min-sgd(f, p, x (0), µ, K): 2 for k := 1,..., K: 3 draw g (k 1) p(g x) 4 x (k 1) := g (k 1) 5 µ (k 1) := µ(f, x (k 1), x (k 1) ) 6 x (k) := x (k 1) + µ (k 1) x (k 1) 7 return x (K) f objective function p (distribution of the) stochastic gradient of f x (0) starting value µ step length controller Lars Schmidt-Thieme, K number Information of iterations Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 8 / 44

1. Stochastic Gradient Descent (SGD) Stochastic Gradient Descent / For Big Sums where 1 min-sgd((f c ) c=1,...,c, ( f c ) c=1,...,c, x (0), µ, K): 2 for k := 1,..., K: 3 draw c (k 1) Unif(1,..., C) 4 g (k 1) := f c (k 1)(x (k 1) ) 5 x (k 1) := g (k 1) 6 µ (k 1) := µ(f, x (k 1), x (k 1) ) 7 x (k) := x (k 1) + µ (k 1) x (k 1) 8 return x (K) (f c ) c=1,...,c objective function summands, f := 1 C C c=1 f c ( f c ) c=1,...,c gradients of the objective function summands x (0) starting value µ step length controller K number of iterations 9 / 44

1. Stochastic Gradient Descent (SGD) SGD / For Big Sums / Epochs where 1 min-sgd((f c ) c=1,...,c, ( f c ) c=1,...,c, x (0), µ, K): 2 C := (1, 2,..., C) 3 x (0,C) := x (0) 4 for k := 1,..., K: 5 randomly shuffle C 6 x (k,0) := x (k 1,C) 7 for i = 1,..., C: 8 g (k,i 1) := f ci (x (k,i 1) ) 9 x (k,i 1) := g (k,i 1) 10 µ (k,i 1) := µ(f, x (k,i 1), x (k,i 1) ) 11 x (k,i) := x (k,i 1) + µ (k,i 1) x (k,i 1) 12 return x (K,C) (f c ) c=1,...,c objective function summands, f := 1 C C c=1 f c... K number of epochs 10 / 44

1. Stochastic Gradient Descent (SGD) Convergence of SGD Theorem (Convergence of SGD) If (i) f is strongly convex ( 2 f (x) mi, m R + ), (ii) the expected squared norm of its stochastic gradient g is uniformly bounded ( G R + 0 x : E( g(x) 2 ) G 2 ) and (iii) the steplength µ (k) := 1 m(k+1) is used, then E p ( x (k) x 2 ) 1 k + 1 max{ x (0) x 2, G 2 m 2 } 11 / 44

1. Stochastic Gradient Descent (SGD) Convergence of SGD / Proof f (x ) f (x) f (x) T (x x) + m 2 x x 2 str. conv. (i) f (x) f (x ) f (x ) T (x x ) + m 2 x x 2 = m 2 x x 2 summing both yields 0 f (x) T (x x) + m x x 2 f (x) T (x x ) m x x 2 (1) E( x (k) x 2 ) = E( x (k 1) µ (k 1) g (k 1) x 2 ) = E( x (k 1) x 2 ) 2µ (k 1) E((g (k 1) ) T (x (k 1) x )) + (µ (k 1) ) 2 E( g k 1 2 ) = E( x (k 1) x 2 ) 2µ (k 1) E( f (x (k 1) ) T (x (k 1) x )) + (µ (k 1) ) 2 E( g k 1 2 (ii),(1) E( x (k 1) x 2 ) 2µ (k 1) me( x x (k 1) 2 ) + (µ (k 1) ) 2 G 2 = (1 2µ (k 1) m)e( x (k 1) x 2 ) + (µ (k 1) ) 2 G 2 (2) 12 / 44

1. Stochastic Gradient Descent (SGD) Convergence of SGD / Proof (2/2) induction over k: k := 0: k > 0: x (0) x 2 1 1 L, L := max{ x (0) x 2, G 2 m 2 } E( x (k) x 2 ) (2) (1 2µ (k 1) m)e( x (k 1) x 2 ) + (µ (k 1) ) 2 G 2 (iii) = (1 2 k )E( x (k 1) x 2 ) + G 2 m 2 k 2 ind.hyp. (1 2 k ) 1 k 1 L + G 2 m 2 k 2 def. L (1 2 k ) 1 k L + 1 k 2 L = k 1 k 2 L 1 k L 13 / 44

2. More on Line Search Outline 1. Stochastic Gradient Descent (SGD) 2. More on Line Search 3. Example: SGD for Linear Regression 4. Stochastic Gradient Descent in Practice 14 / 44

2. More on Line Search Choosing the step size for SGD The step size µ is a crucial parameter to be tuned Given the low cost of the SGD update, using line search for the step size is a bad choice Possible alternatives: Fixed step size Armijo principle Bold-Driver Adagrad 14 / 44

2. More on Line Search Example: Body Fat prediction We want to estimate the percentage of body fat based on various attributes: Age (years) Weight (lbs) Height (inches) Neck circumference (cm) Chest circumference (cm) Abdomen 2 circumference (cm) Hip circumference (cm) Thigh circumference (cm) Knee circumference (cm)... http://lib.stat.cmu.edu/datasets/bodyfat 15 / 44

2. More on Line Search Example: Body Fat prediction The data is represented it as: 1 a 1,1 a 1,2... a 1,M y 1 1 a 2,1 a 2,2... a 2,M A =....., y = y 2. 1 a N,1 a N,2... a N,M y N with N = 252, M = 14 We can model the percentage of body fat y as a linear combination of the body measurements with parameters x: ŷ n = x T a n = x 0 1 + x 1 a n,1 + x 2 a n,2 +... + x M a n,m 16 / 44

2. More on Line Search SGD - Fixed Step Size on the Body Fat dataset SGD Step Size MSE 0.00 0.05 0.10 0.15 0.20 0.0001 0.001 0.01 0.1 0 100 200 300 400 500 Iterations 17 / 44

2. More on Line Search Bold Driver Heuristic The Bold Driver Heuristic makes the assumption that smaller step sizes are needed when closer to the optimum It adjusts the step size based on the value of f (x t ) f (x (k 1) ) If the value of f (x) grows, the step size must decrease If the value of f (x) decreases, the step size can be larger for faster convergence 18 / 44

2. More on Line Search Bold Driver Heuristic - Update Rule We need to define an increase factor µ + > 1, e.g. µ + := 1.05, and a decay factor µ (0, 1), e.g., µ := 0.5. Step size update rule: adapt stepsize only once after each epoch, not for every (inner) iteration. Cycle through the whole data and update the parameters Evaluate the objective function f (x (k) ) if f (x (k) ) < f (x (k 1) ) then µ µ + µ else f (x (k) ) > f (x (k 1) ) then µ µ µ different from the bold driver heuristics for batch gradient descent, there is no way to evaluate f (x + µ x) for different µ. stepsize µ is adapted once after the step has been done 19 / 44

2. More on Line Search Bold Driver where µ stepsize of last update 1 stepsize-bd(µ, f new, f old, µ +, µ ): 2 if f new < f old 3 µ := µ + µ 4 else 5 µ := µ µ 6 return µ f new, f old = f (x k ), f (x k 1 ) function values before and after the last update µ +, µ stepsize increase and decay factors 20 / 44

2. More on Line Search Considerations Works well for a range of problems The initial µ just needs to be large enough µ + and µ have to be adjusted to the problem at hand May lead to faster convergence 21 / 44

2. More on Line Search AdaGrad Adagrad adjusts the step size individually for each variable to be optimized It uses information about the past gradients Leads to faster convergence Less sensitive to the choice of the step size 22 / 44

2. More on Line Search AdaGrad - Update Rule We have Update rule: f (x) = M f m (x) m=1 Update stepsize for every inner iteration Pick a random instance m Uniform(1, M) Compute the gradient x f m (x) Update the gradient history h := h + x f m (x) x f m (x) The step size for variable x n is µ n := µ 0 hn Update x next := x µ x f m (x) i.e., x next n := x n µ 0 hn ( x f m (x)) n denotes the elementwise product. 23 / 44

2. More on Line Search AdaGrad where 1 stepsize-adagrad(g, h, µ 0 ): 2 h := h + g g 3 µ n := µ 0 / h n for n = 1,..., N 4 return (µ, h) returns a vector of stepsizes, one for each variable g = f m (x) current (stochastic) gradient h past gradient size history µ 0 initial stepsize 24 / 44

2. More on Line Search AdaGrad Step Size ADAGRAD Step Size MSE 0.00 0.05 0.10 0.15 0.20 0.001 0.01 0.1 1 0 100 200 300 400 500 Iterations 25 / 44

2. More on Line Search AdaGrad vs Fixed Step Size ADAGRAD Step Size MSE 0.00 0.01 0.02 0.03 0.04 0.05 AdaGrad Fixed Step Size 0 100 200 300 400 500 Iterations 26 / 44

3. Example: SGD for Linear Regression Outline 1. Stochastic Gradient Descent (SGD) 2. More on Line Search 3. Example: SGD for Linear Regression 4. Stochastic Gradient Descent in Practice 27 / 44

3. Example: SGD for Linear Regression Practical Example: Household Spending Suppose we have the following data about different households: Number of workers in the household (a 1 ) Household composition (a 2 ) Region (a 3 ) Gross normal weekly household income (a 4 ) Weekly household spending (y) We want to creat a model of the weekly household spending 27 / 44

3. Example: SGD for Linear Regression Practical Example: Household Spending If we have data about m households, we can represent it as: 1 a 1,2... a 1,n y 1 1 a 2,2... a 2,n A m,n =.... y = y 2. 1 a m,2... a m,n y m We can model the household consumption is a linear combination of the household features with parameters x: ŷ i = x T a i = x 0 1 + x 1 a i,1 + x 2 a i,2 + x 3 a i,3 + x 4 a i,4 28 / 44

3. Example: SGD for Linear Regression Least Square Problem Revisited The following least square problem minimize Ax y 2 2 Can be rewritten as minimize m (x T a i y i ) 2 i=1 1 a 1,1 a 1,2 a 1,3 a 1,4 y 1 1 a 2,1 a 2,2 a 2,3 a 2,4 A m,n =..... y = y 2. 1 a m,1 a m,2 a m,3 a m,4 y m 29 / 44

3. Example: SGD for Linear Regression The Gradient Descent update rule For the problem minimize m (x T a i y i ) 2 i=1 The the gradient f (x) of the objective function is: m x f (x) = 2 (x T a i y i )a i i=1 The Gradient Descent update rule is then: ( ) m x x µ 2 (x T a i y i )a i i=1 30 / 44

3. Example: SGD for Linear Regression The Gradient Descent update rule We need to see all the data before updating x ( ) m x x µ 2 (x T a i y i )a i Can we make any progress before iterating over all the data? i=1 31 / 44

3. Example: SGD for Linear Regression Decomposing the objective function The objective function m f (x) = (x T a i y i ) 2 i=1 Can be expressed as a function of the objective on each data point (a, y): So that f i (x) = (x T a i y i ) 2 f (x) = m f i (x) i=1 32 / 44

3. Example: SGD for Linear Regression A simpler update rule Now that we have f (x) = m f i (x) i=1 We can define the following update rule Pick a random instance i Uniform(1, m) Update x x x + µ ( x f i (x)) 33 / 44

3. Example: SGD for Linear Regression Stochastic Gradient Descent (SGD) 1: procedure StochasticGradiendDescent input: f, µ 2: Get initial point x 3: repeat 4: for i 1,..., m do 5: x x µ f i (x) 6: end for 7: until convergence 8: return x, f (x) 9: end procedure 34 / 44

3. Example: SGD for Linear Regression SGD and Least Squares We have m f (x) = f i (x) i=1 with f i (x) = (x T a i y i ) 2 The update rule is x f i (x) = 2(x T a i y i )a i ) x x µ (2(x T a i y i )a i 35 / 44

3. Example: SGD for Linear Regression SGD vs. GD 1: procedure SGD input: f, µ 2: Get initial point x 3: repeat 4: for i 1,..., m do 5: x x µ f i (x) 6: end for 7: until convergence 8: return x, f (x) 9: end procedure 1: procedure GradientDescent input: f 2: Get initial point x 3: repeat 4: Get Step Size µ 5: x := x µ f (x) 6: until convergence 7: return x, f (x) 8: end procedure 36 / 44

3. Example: SGD for Linear Regression SGD vs. GD - Least Squares 1: procedure SGD input: f, µ 2: Get initial point x 3: repeat 4: for i 1,..., m do 5: x x µ ( 2(x T ) a i y i )a i 6: end for 7: until convergence 8: return x, f (x) 9: end procedure 1: procedure GD input: f 2: Get initial point x 3: repeat 4: Get Step Size µ 5: x x µ ( 2 m ) i=1 (xt a i y i )a i 6: until convergence 7: return x, f (x) 8: end procedure 37 / 44

4. Stochastic Gradient Descent in Practice Outline 1. Stochastic Gradient Descent (SGD) 2. More on Line Search 3. Example: SGD for Linear Regression 4. Stochastic Gradient Descent in Practice 38 / 44

4. Stochastic Gradient Descent in Practice GD Step Size GD Step Size MSE 0 1 2 3 4 5 0.001 0.01 0.1 0.5 1 0 100 200 300 400 500 Iterations 38 / 44

4. Stochastic Gradient Descent in Practice SGD vs GD - Body Fat Dataset SGD vs GD MSE 0.00 0.05 0.10 0.15 0.20 SGD GD 0 200 400 600 800 1000 Iterations 39 / 44

4. Stochastic Gradient Descent in Practice Year Prediction Data Set Least Squares Problem Prediction of the release year of a song from audio features 90 features Experiments done on a subset of 1000 instances of the data 40 / 44

4. Stochastic Gradient Descent in Practice GD Step Size - Year Prediction GD Step Size MSE 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 0.000000001 0.00000001 0.00000005 0.0000001 0 200 400 600 800 1000 Iterations 41 / 44

4. Stochastic Gradient Descent in Practice SGD Step Size - Year Prediction SGD Step Size MSE 0e+00 1e+05 2e+05 3e+05 4e+05 5e+05 0.0000000001 0.000000001 0.00000001 0 200 400 600 800 1000 Iterations 42 / 44

4. Stochastic Gradient Descent in Practice AdaGrad Step Size - Year Prediction ADAGRAD Step Size MSE 0e+00 1e+05 2e+05 3e+05 4e+05 5e+05 0.001 0.01 0.1 0.5 1 0 100 200 300 400 500 Iterations 43 / 44

4. Stochastic Gradient Descent in Practice AdaGrad vs SGD vs GD - Year Prediction ADAGRAD Step Size MSE 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 AdaGrad GD SGD 0 200 400 600 800 1000 Iterations 44 / 44

Further Readings SGD is not covered in Boyd and Vandenberghe [2004]. Leon Bottou, Frank E. Curtis, Jorge Nocedal (2016): Stochastic Gradient Methods for Large-Scale Machine Learning, ICML 2016 Tutorial, http://users.iems.northwestern.edu/~nocedal/icml Francis Bach (2013): Stochastic gradient methods for machine learning, Microsoft Machine Learning Summit 2013, http://research.microsoft.com/en-us/um/cambridge/ events/mls2013/downloads/stochastic_gradient.pdf for the convergence proof: Ji Liu (2014), Notes Stochastic Gradient Descent, http://www.cs.rochester.edu/~jliu/csc-576-2014fall.html 45 / 44

References I Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge Univ Press, 2004. 46 / 44