Gradient Descent. Stephan Robert

Similar documents
Gradient Descent. Model

Linear Regression with mul2ple variables. Mul2ple features. Machine Learning

Subgradient Method. Ryan Tibshirani Convex Optimization

Linear Regression. S. Sumitra

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Optimization methods

Optimization methods

Lasso Regression: Regularization for feature selection

Lecture 1: September 25

ECS171: Machine Learning

Lecture 6: September 12

Selected Topics in Optimization. Some slides borrowed from

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Lasso Regression: Regularization for feature selection

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Error Functions & Linear Regression (1)

Least Mean Squares Regression. Machine Learning Fall 2018

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Mustafa Jarrar. Machine Learning Linear Regression. Birzeit University. Mustafa Jarrar: Lecture Notes on Linear Regression Birzeit University, 2018

Lecture 5: September 15

CSCI567 Machine Learning (Fall 2014)

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Optimization and Gradient Descent

Convex Optimization. Problem set 2. Due Monday April 26th

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Lecture 9: September 28

Linear Models for Regression CS534

CSE 250a. Assignment Noisy-OR model. Out: Tue Oct 26 Due: Tue Nov 2

Lecture 7: September 17

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Optimization for Machine Learning

CSC321 Lecture 8: Optimization

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Lecture 23: November 21

You submitted this quiz on Wed 16 Apr :18 PM IST. You got a score of 5.00 out of 5.00.

Regression with Numerical Optimization. Logistic

Least Mean Squares Regression

Linear Models for Regression CS534

Support Vector Machines and Kernel Methods

Lecture 11 Linear regression

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

Machine Learning for NLP

Gradient Descent. Sargur Srihari

CSC 411: Lecture 02: Linear Regression

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Lecture 5: September 12

Linear Regression 1 / 25. Karl Stratos. June 18, 2018

Optimization Tutorial 1. Basic Gradient Descent

Derivatives. INFO-1301, Quantitative Reasoning 1 University of Colorado Boulder. April 26-28, 2017 Prof. Michael Paul

Linear Regression (continued)

CSC321 Lecture 7: Optimization

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Intelligent Control. Module I- Neural Networks Lecture 7 Adaptive Learning Rate. Laxmidhar Behera

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Why should you care about the solution strategies?

Day 3 Lecture 3. Optimizing deep networks

Ordinary Least Squares Linear Regression

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Logistic Regression Logistic

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Kernelized Perceptron Support Vector Machines

5 Finding roots of equations

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

LINEAR REGRESSION, RIDGE, LASSO, SVR

Overfitting, Bias / Variance Analysis

Classification Based on Probability

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Lecture 6: September 17

More First-Order Optimization Algorithms

Deep Learning for Computer Vision

Ridge Regression: Regulating overfitting when using many features. Training, true, & test error vs. model complexity. CSE 446: Machine Learning

Unconstrained minimization of smooth functions

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Machine Learning Lecture 7

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lasso: Algorithms and Extensions

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6)

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Linear Models for Regression CS534

Lecture 5 Multivariate Linear Regression

CS60021: Scalable Data Mining. Large Scale Machine Learning

Data Mining Techniques. Lecture 2: Regression

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECS171: Machine Learning

10. Unconstrained minimization

Intelligent Systems Discriminative Learning, Neural Networks

Transcription:

Gradient Descent Stephan Robert

Model Other inputs # of bathrooms Lot size Year built Location Lake view Price (CHF) 6 5 4 3 2 1 2 4 6 8 1 Size (m 2 )

Model How it works Linear regression with one variable y 6 5 Price (CHF) 4 3 2 1 2 4 6 8 1 Size (m 2 ) x

Model How it works Linear regression with one variable y y i = + 1 x i + i f (x) =f(x) = + 1 x 6 5 Price (CHF) 4 3 2 1 2 4 6 8 1 Size (m 2 ) x

Model Idea: Choose, 1 so that f(x) is close to y for our training example y f(x) = + 1 x 6 5 Price (CHF) 4 3 2 1 2 4 6 8 1 Size (m 2 ) x

Cost function: J(, 1 )= 1 2m Aim: Model mx (f(x i ) y i ) 2 i=1 1 min J(, 1 )=min, 1, 1 2m mx (f(x i ) y i ) 2 i=1

Other functions Quadratic f(x) = + 1 x + 2 x 2 Higher order polynomial f(x) = + 1 x + 2 x 2 + 3 x 3 +...

A very simple example J(, 1 )= 1 mx (f(x i ) y i ) 2 = 2m i=1 1 2m ((1 2)2 +(2 4) 2 +(3 6) 2 = 1 6 (1 + 4 + 9) = 14 6 =2.33 7 6 5 4 3 2 1 y 1 =, 1 =1 f(x 3 ) 1 2 3 4 y 2 y 3

A very simple example J(, 1 )= 1 2m mx (f(x i ) y i ) 2 i=1 1 9 8 7 6 5 4 3 2 1.5 1 1.5 2 2.5 3 3.5 4 1

A very simple example J(, 1 )= 1 2m mx (f(x i ) y i ) 2 i=1 1 9 8 7 Minimum 6 5 4 3 2 1.5 1 1.5 2 2.5 3 3.5 4 1

A very simple example J(, 1 )= 1 2m mx (f(x i ) y i ) 2 i=1 From Andrew Ng, Stanford

3D-plots

Cost function 1 J(, 1 ) 7 5 3 1-1 1 2 3 4

Gradient descent (intuition) Have some function J(, 1 ) Goal: min J(, 1 ), 1 Solution: we start with some keep changing these parameters to reduce minimum J(, 1 ) (, 1 ) and we until we hopefully end up a a

Gradient descent (intuition) Have some function J(, 1 ) Goal: min J(, 1 ), 1 Proposition: j = j j J(, 1 ) 1 9 8 7 6 5 4 3 2 1 Negative slope.5 1 1.5 2 2.5 3 3.5 4

Gradient descent Formally J( ) :R n! R is convex and differentiable = B 1. n 1 C A 2 Rn+1 We want to solve i.e. find such that min J( ) 2R n J( )= min 2R n J( )

Gradient descent Algorithm Choose an initial [] 2 R n+1 Repeat [k] = [k 1] [k 1]rJ( [k 1]) Stop at some point k =1, 2, 3,...

Example with 2 parameters Choose an initial [] = ( [], 1 []) Repeat j [k] = j [k 1] [k 1] j [k 1] J( [k 1], 1 [k 1]) Stop at some point j =, 1

Example with 2 parameters Implementation detail Repeat (simultaneous update) temp = [k 1] [k 1] temp 1 = 1 [k 1] [k 1] [k 1] J( [k 1], 1 [k 1]) 1 [k 1] J( [k 1], 1 [k 1]) [k] =temp 1 [k] =temp 1 k =1, 2, 3,...

Learning rate Search of one parameter only (one is supposed to be set already) 1 [k] = 1 [k 1] [k 1] k 1 = k = 1 [k] = 1 [k 1] 1 [k] = 1 [k 1] 1 [k 1] J( [k 1], 1 [k 1]) 1 [k 1] J( [k 1], 1 [k 1]) 1 [k 1] J(, 1 [k 1])

1 [k] = 1 [k 1] Learning rate 1 [k 1] J(, 1 [k 1]) 7 6 5 4 3 2 1 7 J( 1 ) J( 6 1 ) 1-6 -4-2 -1 2 4 6-6 -4-2 -1 2 4 6 5 4 3 2 1 too small too large 1

Gradient Descent Learning rate (intuitive) Practical trick: Debug: J should decrease after each iteration Fix a stop-threshold (.1?) Observe the cost function J( )= min 2R n J( ) J( ) Number of iterations

Gradient descent Important: We assume there are no local minimums for now! Wikimedia

Gradient Descent Learning rate (intuitive) Practical trick: Debug: J should decrease after each iteration Fix a stop-threshold (.1?) Observe the cost function J( )= min 2R n J( ) J( ) Number of iterations

The porridge is Gradient Descent Backtracking line search (Boyd) Convergence analysis might help Take an arbitrary parameter and fix it:.1 apple apple.8 At each iteration start with =1and while J( rj( )) >J( ) 2 rj( ) 2 update = Simple but works quite well in practice (check!)

Interpretation Gradient Descent Backtracking line search (Boyd) J( rj( )) J( ) rj( ) 2 J( ) 2 rj( ) 2 =

Gradient Descent Backtracking line search (Boyd) Example (13 steps) (Geoff Gordon & Ryan Tibshirani) Convergence analysis by GG&RT J = 1( 2 )+ 2 1 2 =.8

Gradient Descent Exact line search (Boyd) At each iteration do the best along the gradient direction = arg max s J( srj( )) Often not much more efficient than backtracking, not really worth it

Gradient descent linear regression Repeat j [k] = j [k 1] [k 1] Stop at some point f(x) = + 1 x J(, 1 )= 1 2m = 1 2m mx (f(x i ) y i ) 2 i=1 mx ( + 1 x i y i ) 2 i=1 j [k 1] J( [k 1], 1 [k 1]) j =, 1

J(, 1 )= 1 2m J(, 1 )= Gradient descent linear regression mx ( + 1 x i y i ) 2 i=1 1 2m! mx ( + 1 x i y i ) 2 i=1 = 1 m mx ( + 1 x i y i ) i=1

J(, 1 )= 1 2m 1 J(, 1 )= Gradient descent linear regression mx ( + 1 x i y i ) 2 i=1 1 1 2m! mx ( + 1 x i y i ) 2 i=1 = 1 m mx ( + 1 x i y i )x i i=1

Choose an initial Repeat Stop at some point Gradient descent linear regression k [k] = [k 1] 1 [k] = 1 [k 1] [] = ( [], 1 []) 1 = k = [k 1] J( [k 1], 1 [k 1]) 1 [k 1] J( [k 1], 1 [k 1])

Gradient descent linear regression Repeat [k] = [k 1] 1 [k] = 1 [k 1] m m mx ( [k 1] + 1 [k 1]x i y i ) i=1 mx ( [k 1] + 1 [k 1]x i y i )x i i=1 Stop at some point WARNING: Update and 1 simultaneously

Generalization Higher order polynomial f(x) = + 1 x + 2 x 2 + 3 x 3 +... Multi-features f(x 1,x 2,...)=f(x) = + 1 x 1 + 2 x 2 + 3 x 3 +... WARNING: Notation f(x) :(x) 1 = x 1 f(x 1,x 2,...):x 1 (x 1 ) i is a scalar is a variable is a scalar

Choose an initial Repeat Gradient Descent, n=1 Linear Regression k [k] = [k 1] 1 [k] = 1 [k 1] 1 = k = m m Stop at some point [] = ( [], 1 []) mx ( [k 1] + 1 [k 1]x i y i ) i=1 mx ( [k 1] + 1 [k 1]x i y i )x i i=1

Choose an initial Repeat Gradient Descent, n>1 Linear Regression [k] = [k 1] 1 [k] = 1 [k 1] 2 [k] = 2 [k 1] Stop at some point. [] = ( [], 1 [], 2 [],..., n []) m m m mx (f(x i ) y i ) i=1 mx (f(x i ) i=1 mx (f(x i ) i=1 y i )(x 1 ) i y i )(x 2 ) i

Practical trick: Gradient Descent Scaling all features on the similar scale e.g. size of the house = 3 m 2, number of floors: 3 Mean normalization x i = Size i MeanSize MaxSize-MinSize

Vector Notation x = B x x 1. x n 1 C A 2 Rn+1 = B 1. n 1 C A 2 Rn+1 f(x,x 1,x 2,...,x n )=f(x) = t x Often: (x ) i =1,i2 {, 1,...,n} = + 1 x 1 + 2 x 2 +...+ n x n

Vector Notation (x) i = B (x ) i (x 1 ) i. (x n ) i 1 C A = B 1. n 1 C A 2 Rn+1 f((x ) i, (x 1 ) i, (x 2 ) i,...,(x n ) i )=f((x) i )= t x i = + 1 (x 1 ) i + 2 (x 2 ) i +...+ n (x n ) i i 2 R m+1

Vector Notation X = B (x) t 1 (x) t 2. (x) t m 1 C A = B (x ) 1 (x 1 ) 1... (x n ) 1 (x ) 2 (x 1 ) 2... (x n ) 2. (x ) m (x 1 ) m... (x n ) m 1 C A 2 Rm (n+1) m (n+1)

Vector Notation y = B (y) 1 (y) 2. (y) m 1 C A = B y 1 y 2. y m 1 C A 2 Rm = B 1. n 1 C A 2 Rn+1 Cost function J( ) = 1 2m (X y)t (X y)

Example m=4=#nb of samples x x 1 x 2 x 3 x 4 y Size #bedrooms #floors Age Price (mios) 1 125 3 2 2.8 1 22 5 2 15 1.2 1 4 7 3 5 4 1 25 4 2 1 2 X = B 1 125 3 2 2 1 22 5 2 15 1 4 7 3 5 1 25 4 2 1 1 C A y = B.8 1.2 4 2 1 C A 2 Rm

Example X = B 1 125 3 2 2 1 22 5 2 15 1 4 7 3 5 1 25 4 2 1 1 C A 2 Rm (n+1) X t = B 1 1 1 1 125 22 4 25 3 5 7 4 2 2 3 2 2 15 5 1 1 C A 2 R(n+1) m

Normal equation Method to resolve squares) Derivates = analytically (Least =(X t X) 1 X t y

Comparison Gradient Descent Choose a learning parameter Many iterations Works even for large n (in the order of 1 6 ) Normal Equation No need to choose a learning parameter No iteration Need to compute (X t X) 1 Slow when n is large (-> 1 )