Gradient Descent. Mohammad Emtiyaz Khan EPFL Sep 22, 2015

Similar documents
Logistic Regression. Mohammad Emtiyaz Khan EPFL Oct 8, 2015

Fundamentals of Machine Learning (Part I)

Fundamentals of Machine Learning

Fundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015

Optimization Tutorial 1. Basic Gradient Descent

Sample questions for Fundamentals of Machine Learning 2018

ECS171: Machine Learning

Linear Regression. S. Sumitra

Kernel Ridge Regression. Mohammad Emtiyaz Khan EPFL Oct 27, 2015

Selected Topics in Optimization. Some slides borrowed from

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION

Machine Learning and Data Mining. Linear regression. Kalev Kask

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Optimization and Gradient Descent

Gradient Descent. Dr. Xiaowei Huang

1 Kernel methods & optimization

Least Mean Squares Regression

Bias-Variance Decomposition. Mohammad Emtiyaz Khan EPFL Oct 6, 2015

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Big Data Analytics. Lucas Rego Drumond

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Performance Surfaces and Optimum Points

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

Error Functions & Linear Regression (1)

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

COMP 551 Applied Machine Learning Lecture 2: Linear Regression

Least Mean Squares Regression. Machine Learning Fall 2018

Neural Networks and Deep Learning

Machine Learning and Data Mining. Linear regression. Prof. Alexander Ihler

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Lecture 5: September 15

Non-convex optimization. Issam Laradji

LECTURE NOTE #NEW 6 PROF. ALAN YUILLE

STA141C: Big Data & High Performance Statistical Computing

Convex Optimization / Homework 1, due September 19

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

CSC321 Lecture 8: Optimization

MATH2070 Optimisation

Lecture 13: Simple Linear Regression in Matrix Format. 1 Expectations and Variances with Vectors and Matrices

Lecture Unconstrained optimization. In this lecture we will study the unconstrained problem. minimize f(x), (2.1)

Day 3 Lecture 3. Optimizing deep networks

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Optimization Part 1 P. Agius L2.1, Spring 2008

CSC 578 Neural Networks and Deep Learning

Introduction to gradient descent

Lecture 17: Numerical Optimization October 2014

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Ridge Regression. Mohammad Emtiyaz Khan EPFL Oct 1, 2015

Bayesian Deep Learning

Linear Regression (continued)

Descent methods. min x. f(x)

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Mathematical optimization

CHAPTER 2: QUADRATIC PROGRAMMING

ECE580 Exam 1 October 4, Please do not write on the back of the exam pages. Extra paper is available from the instructor.

COMP 551 Applied Machine Learning Lecture 2: Linear regression

Lecture 5: Logistic Regression. Neural Networks

Lecture 1: September 25

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Lecture 5: September 12

8 Numerical methods for unconstrained problems

2 2 + x =

CSC321 Lecture 7: Optimization

Is the test error unbiased for these programs? 2017 Kevin Jamieson

ECE580 Fall 2015 Solution to Midterm Exam 1 October 23, Please leave fractions as fractions, but simplify them, etc.

Lecture 3. Optimization Problems and Iterative Algorithms

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Why should you care about the solution strategies?

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Machine Learning. A. Supervised Learning A.1. Linear Regression. Lars Schmidt-Thieme

ECS289: Scalable Machine Learning

CS489/698: Intro to ML

Comparison of Modern Stochastic Optimization Algorithms

Linearly-Convergent Stochastic-Gradient Methods

Machine Learning (CSE 446): Neural Networks

1 Convexity, concavity and quasi-concavity. (SB )

Lec3p1, ORF363/COS323

Lecture 5 : Projections

CPSC 540: Machine Learning

Scientific Computing: Optimization

CSC242: Intro to AI. Lecture 21

CS260: Machine Learning Algorithms

Computing Neural Network Gradients

Lecture 3 Notes. Dan Sheldon. September 17, 2012

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015

Lecture 2: Convex Sets and Functions

Gradient Descent. Sargur Srihari

CSC321 Lecture 2: Linear Regression

Constrained optimization. Unconstrained optimization. One-dimensional. Multi-dimensional. Newton with equality constraints. Active-set method.

Perceptron (Theory) + Linear Regression

Transcription:

Gradient Descent Mohammad Emtiyaz Khan EPFL Sep 22, 201 Mohammad Emtiyaz Khan 2014 1

Learning/estimation/fitting Given a cost function L(β), we wish to find β that minimizes the cost: min β L(β), subject to β R D+1 This is learning posed as an optimization problem. We will use an algorithm to solve the problem. Grid search Grid search is one of the simplest algorithms where we compute cost over a grid (of say M points) to find the minimum. This is extremely simple and works for any kind of loss when we have very few parameters and the loss is easy to compute. For a large number of parameters, however, grid search has too many for-loops, resulting in exponential computational complexity. Choosing a good range of values is another problem. Are there any other issues? 1

Follow the gradient A gradient (at a point) is the slope of the tangent (at that point). It points to the direction of largest increase of the function. For 2-parameter model, MSE and MAE are shown below. (I used y T = [2, 1, 1.] and x T = [ 1, 1, 1]). 160 140 120 100 80 60 40 20 10 0 0 10 10 0 10 β 1 β 0 1 10 0 10 0 β 1 10 10 β 0 0 10 2

Batch gradient descent To minimize the function, take a step in the (opposite) direction of the gradient β (k+1) β (k) α L(β(k) ) β where α > 0 is the step-size (or learning rate). Gradient descent for 1-parameter model to minimize MSE: β (k+1) 0 = (1 α)β (k) 0 + αȳ Where ȳ = n y n/n. When is this sequence guaranteed to converge? 3

Gradients for MSE y = y 1 y 2. y N X = We define the error vector e: 1 x 11 x 12... x 1D 1 x 21.. x 22.... x 2D.... 1 x N1 x N2... x ND e = y Xβ (2) and MSE as follows: L(β) = 1 2N N (y n x T nβ) 2 (3) n=1 = 1 2N et e (4) then the gradient is given by, L β = 1 N X T e () What is the computational complexity? (1) 4

Implementation Issues Stopping criteria: When g is (close to) zero, we are at (or close to) the optimum. If second-order derivative is positive, it is a minimum. See the section on Optimality conditions. Step-size selection: If α is too big, the method might diverge. If it is too small, convergence is slow. Convergence to a local minimum is guaranteed only when α < α min where α min is a fixed constant that depends on the problem. Line-search methods: We can set step-size automatically using a line-search method. More details on backtracking methods can be found in Chapter 1 of Bertsekas book on nonlinear programming. Feature normalization: Gradient descent is very sensitive to illconditioning. Therefore, always normalize your feature. With unnormalized features, step-size selecion is difficult since different directions might move at different speed.

Additional Notes Pseudo-code for grid search Pseudo-code for grid-search with MSE for 1-parameter model. 1 beta0 = -10:.1:10; 2 for i = 1:length(beta0) 3 err(i) = computecost(y, X, beta0(i)); 4 end [val, idx] = min(err); 6 beta0 star = beta0(idx); A function for computing MSE. We normalized MSE by 2N here. This keeps MSE normalized for different N (without affecting the minimum). 1 function computecost(y, X, beta0) 2 e = y - beta0; 3 return e'*e / (2*N); Extend this code to a 2-parameters model. What is the computational complexity in terms of M (number of grid points), N (number of data examples) and D (number of dimensions) for grid search to minimize MSE for linear regression? Read more about grid search and other methods for hyperparmeter setting: https://en.wikipedia.org/wiki/hyperparameter_ optimization#grid_search. 6

Local and global minimum The above figure is taken from Bertsekas, Nonlinear programming. A vector β is a local minimum of L if it is no worse than its neighbors; i.e. there exists an ɛ > 0 such that, L(β ) L(β), β with β β < ɛ A vector β is a global minimum of L if it is no worse than all other vectors, L(β ) L(β), β R D+1 These local or global minimum are said to be strict if the corresponding inequality is strict for β β. When working with a cost function, it is essential to make sure that a global minimum exist, i.e. that the cost function is lower bounded. For more details to gain understanding about local and global minima, read the first 3 pages of Chapter 1 from Bertsekas book on Nonlinear programming. 7

Computational complexity The computation cost is expressed using the big-o notation. Here is a definition taken from Wikipedia. Let f and g be two functions defined on some subset of the real numbers. We write f(x) = O(g(x)) as x, if and only if there exists a positive real number c and a real number x 0 such that f(x) c g(x), x > x 0. Please read and learn more from this page in Wikipedia: http://en.wikipedia.org/wiki/computational_complexity_of_ mathematical_operations#matrix_algebra. What is the computational complexity of the dot product a T b of two vectors a, b, both of length N? What is the computational complexity of matrix multiplication? Pseudo-code for gradient descent Matlab implementation. 1 beta = zeros(d, 1); 2 for k = 1:maxIters 3 g = computegradient(y, X, beta); 4 beta = beta - alpha * g; if g'*g < 1e-; break; end; 6 end Compute gradient. 1 function computegradient(y,x,beta) 2 % Fill this in 3 return g; 4 end What is the computational complexity (in terms of M, N and D) of grid search with MSE for linear regression? 8

Optimality conditions The first-order necessary condition says that at an optimum the gradient is equal to zero. L(β ) β = 0 (6) The second-order sufficient condition ensures that the optimum is a minimum (not a maximum or saddle-point) using the Hessian matrix. H(β ) := 2 L(β ) is positive definite. (7) T β β The Hessian is also related to the convexity of a function: a twicedifferentiable function is convex if and only if the Hessian is positive definite. Stochastic gradient descent When N is large, choose a random pair (x i, y i ) in the training set and take a step. β (k+1) β (k) + α (k) L n(β k ) β For convergence, α k 0 appropriately. One such condition called Robbins-Monroe condition suggests to take α k such that: α (k) =, k=1 (α (k) ) 2 < (8) k=1 One way to obtain such sequence is α (k) = 1/(1 + k) r where r (0., 1). Read Section 8. of Kevin Murphy s book. There are many variants of this method too. See more details and references in the Wikipedia page https://en.wikipedia.org/wiki/stochastic_gradient_descent. 9

To do 1. Revise computational complexity (also see the Wikipedia link in Page 6 of lecture notes). 2. Derive computational complexity of grid-search and gradient descent. 3. Derive gradients for MSE and MAE cost functions. 4. Derive convergence of gradient descent for 1 parameter model.. Implement gradient descent and gain experience in setting the step-size. 6. Learn to assess convergence of gradient descent. 7. Revise linear algebra to understand positive-definite matrices. 8. Implement stochastic gradient descent and gain experience in setting the step-size. 10