ECS171: Machine Learning

Similar documents
CS260: Machine Learning Algorithms

ECS171: Machine Learning

STA141C: Big Data & High Performance Statistical Computing

ECS289: Scalable Machine Learning

ECS171: Machine Learning

ECS289: Scalable Machine Learning

Logistic Regression. Stochastic Gradient Descent

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Linear classifiers: Overfitting and regularization

Selected Topics in Optimization. Some slides borrowed from

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Linear Models for Regression CS534

Linear Models for Regression CS534

Large-scale Stochastic Optimization

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Least Mean Squares Regression. Machine Learning Fall 2018

Linear Regression. S. Sumitra

Optimization and Gradient Descent

Big Data Analytics. Lucas Rego Drumond

Stochastic Gradient Descent. CS 584: Big Data Analytics

Neural Network Training

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Comparison of Modern Stochastic Optimization Algorithms

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Algorithmic Stability and Generalization Christoph Lampert

CSC242: Intro to AI. Lecture 21

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Machine Learning Foundations

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Introduction to Neural Networks

Least Mean Squares Regression

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Kernelized Perceptron Support Vector Machines

Machine Learning. Lecture 2: Linear regression. Feng Li.

Computational statistics

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Machine Learning Basics Lecture 3: Perceptron. Princeton University COS 495 Instructor: Yingyu Liang

ECS289: Scalable Machine Learning

CSC321 Lecture 8: Optimization

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Sub-Sampled Newton Methods

Logistic Regression. COMP 527 Danushka Bollegala

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Perceptron (Theory) + Linear Regression

Logistic Regression Trained with Different Loss Functions. Discussion

Neural Networks: Backpropagation

J. Sadeghi E. Patelli M. de Angelis

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms

Stochastic Optimization Algorithms Beyond SG

Reading Group on Deep Learning Session 1

More Optimization. Optimization Methods. Methods

STA141C: Big Data & High Performance Statistical Computing

Classification with Perceptrons. Reading:

EE-559 Deep learning 3.5. Gradient descent. François Fleuret Mon Feb 18 13:34:11 UTC 2019

Overparametrization for Landscape Design in Non-convex Optimization

CSC321 Lecture 7: Optimization

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Linear Discrimination Functions

CSC2541 Lecture 5 Natural Gradient

Stochastic gradient descent; Classification

Overfitting, Bias / Variance Analysis

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

A summary of Deep Learning without Poor Local Minima

CSC 411: Lecture 02: Linear Regression

Linear Models in Machine Learning

Gradient Descent. Sargur Srihari

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Warm up: risk prediction with logistic regression

Convex Optimization Lecture 16

Reward-modulated inference

Stochastic Subgradient Method

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Computational and Statistical Learning Theory

Artificial Neuron (Perceptron)

Lecture 5: September 12

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Lecture 1: Supervised Learning

Linear Regression (continued)

Why should you care about the solution strategies?

Advanced computational methods X Selected Topics: SGD

Lecture 5: Logistic Regression. Neural Networks

Logistic Regression Logistic

Machine Learning (CSE 446): Neural Networks

You submitted this quiz on Wed 16 Apr :18 PM IST. You got a score of 5.00 out of 5.00.

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

The Perceptron Algorithm 1

Adam: A Method for Stochastic Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Neural Networks (Part 1) Goals for the lecture

Machine Learning A Geometric Approach

Transcription:

ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018

Gradient descent

Optimization Goal: find the minimizer of a function min f (w) w For now we assume f is twice differentiable Machine learning algorithm: find the hypothesis that minimizes E in

Convex vs Nonconvex Convex function: f (w ) = 0 w is global minimum A function is convex if 2 f (w) is positive definite Example: linear regression, logistic regression, Non-convex function: f (x) = 0 Global min, local min, or saddle point most algorithms only converge to gradient= 0 Example: neural network,

Convex vs Nonconvex Convex function: f (w ) = 0 w is global minimum A function is convex if 2 f (w) is positive definite Example: linear regression, logistic regression, Non-convex function: f (w ) = 0 w is Global min, local min, or saddle point most algorithms only converge to gradient= 0 Example: neural network,

Gradient Descent Gradient descent: repeatedly do α > 0 is the step size w t+1 w t α f (w t )

Gradient Descent Gradient descent: repeatedly do w t+1 w t α f (w t ) α > 0 is the step size Generate the sequence w 1, w 2, converge to minimum solution ( f (w) = 0)

Gradient Descent Gradient descent: repeatedly do w t+1 w t α f (w t ) α > 0 is the step size Generate the sequence w 1, w 2, converge to minimum solution ( f (w) = 0) Step size too large diverge; too small slow convergence

Why gradient descent? Reason I: Gradient is the steepest direction to decrease the objective function locally

Why gradient descent? Reason I: Gradient is the steepest direction to decrease the objective function locally Reason II: successive approximation view At each iteration, form an approximation function of f ( ): f (w t + d) g(d) := f (w t ) + f (w t ) T d + 1 2α d 2 Update solution by w t+1 w t + d d = arg min d g(d) g(d ) = 0 f (w t ) + 1 α d = 0 d = α f (w t )

Why gradient descent? Reason I: Gradient is the steepest direction to decrease the objective function locally Reason II: successive approximation view At each iteration, form an approximation function of f ( ): f (w t + d) g(d) := f (w t ) + f (w t ) T d + 1 2α d 2 Update solution by w t+1 w t + d d = arg min d g(d) g(d ) = 0 f (w t ) + 1 α d = 0 d = α f (w t ) d will decrease f ( ) if α (step size) is sufficiently small

Illustration of gradient descent

Illustration of gradient descent Form a quadratic approximation f (w t + d) g(d) = f (w t ) + f (w t ) T d + 1 2α d 2

Illustration of gradient descent Minimize g(d): g(d ) = 0 f (w t ) + 1 α d = 0 d = α f (w t )

Illustration of gradient descent Update w t+1 = w t + d = w t α f (w t )

Illustration of gradient descent Form another quadratic approximation f (w t+1 + d) g(d) = f (w t+1 ) + f (w t+1 ) T d + 1 2α d 2 d = α f (w t+1 )

Illustration of gradient descent Update w t+2 = w t+1 + d = w t+1 α f (w t+1 )

When will it diverge? Can diverge (f (w t )<f (w t+1 )) if g is not an upperbound of f

When will it converge? Always converge (f (w t )>f (w t+1 )) when g is an upperbound of f

Convergence Let L be the Lipchitz constant ( 2 f (x) LI for all x) Theorem: gradient descent converges if α < 1 L

Convergence Let L be the Lipchitz constant ( 2 f (x) LI for all x) Theorem: gradient descent converges if α < 1 L In practice, we do not know L need to tune step size when running gradient descent

Applying to Logistic regression gradient descent for logistic regression Initialize the weights w 0 For t = 1, 2, Compute the gradient E in = 1 N N y n x n 1 + e ynw T x n n=1 Update the weights: w w η E in Return the final weights w

Applying to Logistic regression gradient descent for logistic regression Initialize the weights w 0 For t = 1, 2, Compute the gradient E in = 1 N N y n x n 1 + e ynw T x n n=1 Update the weights: w w η E in Return the final weights w When to stop? Fixed number of iterations, or Stop when E in < ɛ

Stochastic Gradient descent

Large-scale Problems Machine learning: usually minimizing the in-sample loss (training loss) min w { 1 N min w { 1 N N l(w T x n, y n )} := E in (w) (linear model) n=1 N l(h w (x n ), y n )} := E in (w) (general hypothesis) n=1 l: loss function (e.g., l(a, b) = (a b) 2 ) Gradient descent: w w η E in (w) }{{} Main computation

Large-scale Problems Machine learning: usually minimizing the in-sample loss (training loss) min w { 1 N min w { 1 N N l(w T x n, y n )} := E in (w) (linear model) n=1 N l(h w (x n ), y n )} := E in (w) (general hypothesis) n=1 l: loss function (e.g., l(a, b) = (a b) 2 ) Gradient descent: w w η In general, E in (w) = 1 N N n=1 f n(w), each f n (w) only depends on (x n, y n ) E in (w) }{{} Main computation

Stochastic gradient Gradient: E in (w) = 1 N N f n (w) Each gradient computation needs to go through all training samples slow when millions of samples Faster way to compute approximate gradient? n=1

Stochastic gradient Gradient: E in (w) = 1 N N f n (w) Each gradient computation needs to go through all training samples slow when millions of samples n=1 Faster way to compute approximate gradient? Use stochastic sampling: Sample a small subset B {1,, N} Estimate gradient E in (w) 1 f n (w) B B : batch size n B

Stochastic gradient descent Stochastic Gradient Descent (SGD) Input: training data {x n, y n } N n=1 Initialize w (zero or random) For t = 1, 2, Sample a small batch B {1,, N} Update parameter w w η t 1 B f n (w) n B

Stochastic gradient descent Stochastic Gradient Descent (SGD) Input: training data {x n, y n } N n=1 Initialize w (zero or random) For t = 1, 2, Sample a small batch B {1,, N} Update parameter w w η t 1 B f n (w) Extreme case: B = 1 Sample one training data at a time n B

Logistic Regression by SGD Logistic regression: min w 1 N N log(1 + e ynw T x n ) }{{} f n(w) n=1 SGD for Logistic Regression Input: training data {x n, y n } N n=1 Initialize w (zero or random) For t = 1, 2, Sample a batch B {1,, N} Update parameter w w η t 1 y n x n B 1 + e ynw T x n i B }{{} f n(w)

Why SGD works? Stochastic gradient is an unbiased estimator of full gradient: E[ 1 B f n (w)] = 1 N n B N f n (w) n=1 = E in (w)

Why SGD works? Stochastic gradient is an unbiased estimator of full gradient: E[ 1 B f n (w)] = 1 N n B N f n (w) n=1 = E in (w) Each iteration updated by gradient + zero-mean noise

Stochastic gradient descent In gradient descent, η (step size) is a fixed constant Can we use fixed step size for SGD?

Stochastic gradient descent In gradient descent, η (step size) is a fixed constant Can we use fixed step size for SGD? SGD with fixed step size cannot converge to global/local minimizers

Stochastic gradient descent In gradient descent, η (step size) is a fixed constant Can we use fixed step size for SGD? SGD with fixed step size cannot converge to global/local minimizers If w is the minimizer, f (w ) = 1 N N n=1 f n(w )=0,

Stochastic gradient descent In gradient descent, η (step size) is a fixed constant Can we use fixed step size for SGD? SGD with fixed step size cannot converge to global/local minimizers If w is the minimizer, f (w ) = 1 N N n=1 f n(w )=0, but 1 B f n (w ) 0 n B if B is a subset

Stochastic gradient descent In gradient descent, η (step size) is a fixed constant Can we use fixed step size for SGD? SGD with fixed step size cannot converge to global/local minimizers If w is the minimizer, f (w ) = 1 N N n=1 f n(w )=0, but 1 B f n (w ) 0 n B if B is a subset (Even if we got minimizer, SGD will move away from it)

Stochastic gradient descent, step size To make SGD converge: Step size should decrease to 0 η t 0 Usually with polynomial rate: η t t a with constant a

Stochastic gradient descent vs Gradient descent Stochastic gradient descent: pros: cheaper computation per iteration faster convergence in the beginning cons: less stable, slower final convergence hard to tune step size (Figure from https://medium.com/@imadphd/ gradient-descent-algorithm-and-its-variants-10f652806a3)

Revisit perceptron Learning Algorithm Given a classification data {x n, y n } N n=1 Learning a linear model: Consider the loss: min w 1 N N l(w T x n, y n ) n=1 l(w T x n, y n ) = max(0, y n w T x n ) What s the gradient?

Revisit perceptron Learning Algorithm l(w T x n, y n ) = max(0, y n w T x n ) Consider two cases: Case I: y n w T x n > 0 (prediction correct) l(w T x n, y n ) = 0 w l(w T x n, y n ) = 0

Revisit perceptron Learning Algorithm l(w T x n, y n ) = max(0, y n w T x n ) Consider two cases: Case I: y n w T x n > 0 (prediction correct) l(w T x n, y n ) = 0 w l(w T x n, y n ) = 0 Case II: y n w T x n < 0 (prediction wrong) l(w T x n, y n ) = y n w T x n w l(w T x n, y n ) = y n x n

Revisit perceptron Learning Algorithm l(w T x n, y n ) = max(0, y n w T x n ) Consider two cases: Case I: y n w T x n > 0 (prediction correct) l(w T x n, y n ) = 0 w l(w T x n, y n ) = 0 Case II: y n w T x n < 0 (prediction wrong) l(w T x n, y n ) = y n w T x n w l(w T x n, y n ) = y n x n SGD update rule: Sample an index n { w t+1 w t if y n w T x n 0 (predict correct) w t + η t y n x n if y n w T x n <0 (predict wrong) Equivalent to Perceptron Learning Algorithm when η t = 1

Conclusions Gradient descent Stochastic gradient descent Next class: LFD 2 Questions?