Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Similar documents
Convex Optimization. Lecture 12 - Equality Constrained Optimization. Instructor: Yuanzhang Xiao. Fall University of Hawaii at Manoa

Primal/Dual Decomposition Methods

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Subgradient Method. Ryan Tibshirani Convex Optimization

Optimization methods

Stochastic Subgradient Method

CPSC 540: Machine Learning

Selected Topics in Optimization. Some slides borrowed from

6. Proximal gradient method

Conditional Gradient (Frank-Wolfe) Method

STA141C: Big Data & High Performance Statistical Computing

Introduction to Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Big Data Analytics. Lucas Rego Drumond

Stochastic Subgradient Methods

Optimization methods

Stochastic Optimization: First order method

Consider the following example of a linear system:

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Chapter 2. Optimization. Gradients, convexity, and ALS

Binary Classification / Perceptron

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Large-scale Stochastic Optimization

ECS289: Scalable Machine Learning

Stochastic Gradient Descent. CS 584: Big Data Analytics

Lecture 25: November 27

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

Classification: Logistic Regression from Data

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

5. Subgradient method

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Convex Optimization Lecture 16

Deep Learning & Neural Networks Lecture 4

6. Proximal gradient method

CPSC 540: Machine Learning

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

Lecture 7: September 17

Lecture 4: Perceptrons and Multilayer Perceptrons

This ensures that we walk downhill. For fixed λ not even this may be the case.

Big Data Analytics: Optimization and Randomization

Least Mean Squares Regression

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

We have seen that for a function the partial derivatives whenever they exist, play an important role. This motivates the following definition.

ECS289: Scalable Machine Learning

Notes on AdaGrad. Joseph Perla 2014

Lecture 9: September 28

Lecture 6: September 17

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Proximal Methods for Optimization with Spasity-inducing Norms

Optimization Methods. Lecture 18: Optimality Conditions and. Gradient Methods. for Unconstrained Optimization

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

10. Unconstrained minimization

Stochastic Optimization Algorithms Beyond SG

Lecture 1: September 25

Overfitting, Bias / Variance Analysis

Error Functions & Linear Regression (1)

Nonlinear Optimization for Optimal Control

Preliminaries Overview OPF and Extensions. Convex Optimization. Lecture 8 - Applications in Smart Grids. Instructor: Yuanzhang Xiao

Math 164: Optimization Barzilai-Borwein Method

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Day 3 Lecture 3. Optimizing deep networks

Modern Optimization Techniques

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Lecture 2: Subgradient Methods

Machine Learning (CSE 446): Neural Networks

Announcements Kevin Jamieson

ARock: an algorithmic framework for asynchronous parallel coordinate updates

An Algorithm for Unconstrained Quadratically Penalized Convex Optimization (post conference version)

Trade-Offs in Distributed Learning and Optimization

CS-E3210 Machine Learning: Basic Principles

Lecture 14: October 17

Newton s Method. Javier Peña Convex Optimization /36-725

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Stochastic Analogues to Deterministic Optimizers

Linear Models in Machine Learning

Optimization for Machine Learning

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

ECS171: Machine Learning

Optimization Tutorial 1. Basic Gradient Descent

Subgradients. subgradients and quasigradients. subgradient calculus. optimality conditions via subgradients. directional derivatives

CS260: Machine Learning Algorithms

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Algorithms for Nonsmooth Optimization

Linearly-Convergent Stochastic-Gradient Methods

Development of a Deep Recurrent Neural Network Controller for Flight Applications

The FTRL Algorithm with Strongly Convex Regularizers

Coordinate Descent and Ascent Methods

Stochastic Quasi-Newton Methods

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Transcription:

Convex Optimization Lecture 15 - Gradient Descent in Machine Learning Instructor: Yuanzhang Xiao University of Hawaii at Manoa Fall 2017 1 / 21

Today s Lecture 1 Motivation 2 Subgradient Method 3 Stochastic Subgradient Method 2 / 21

Outline 1 Motivation 2 Subgradient Method 3 Stochastic Subgradient Method 3 / 21

Why Gradient Descent in Machine Learning? in theory: Newton methods much faster than gradient descent in practice, especially in machine learning involving big data: gradient descent (and its variations) are used why and how to use gradient descent in machine learning tasks? 4 / 21

Example Least Square least squares: minimize Ax b 2 2 assuming A T A is invertible, we have analytical solution: x = (A T A) 1 (A T b) building block of Newton methods (i.e., inverse of 2 f (x)) 5 / 21

Example Least Square with big data, we can have A R m n where m 10 6 and n 10 5 the size of (A T A) 1 is 10 5 10 5 74GB of memory sparsity of A does not help, because (A T A) 1 can be dense 6 / 21

Example Least Square in gradient descent, we do x (k+1) = x (k) t (k) f (x (k) ) = x (k) t (k) (A T Ax (k) A T b) need to store A R m n : 740MB memory if density of A is 10 3 A T b R n : 7MB memory Ax (k) R m : 0.7MB memory A T (Ax (k) ) R n : 0.7MB memory memory needed in gradient descent: 740MB as compared to 74GB in analytical solution or Newton method 7 / 21

Example Least Square least squares: minimize Ax b 2 2 with m = 500, 000, n = 400, 000, and density of A being 10 3 8 / 21

Outline 1 Motivation 2 Subgradient Method 3 Stochastic Subgradient Method 9 / 21

Motivation Objective is Not Differentiable least squares with regularization: minimize Ax b 2 2 + x 1 do not want to introduce additional variables n = 10 5 another 10 5 variables do not want to use Newton method can we use variations of gradient descent? 10 / 21

Subgradient g is a subgradient of f at x if: f (y) f (x) + g T (y x) for all y the line f (x) + g T (y x) is a global underestimator of f g 1 subgradient at x 1 ; g 2, g 3 subgradients at x 2 11 / 21

Descent Directions g may not be descent direction for nondifferentiable f example: f (x) = x 1 + 2 x 2 12 / 21

Subgradient Method subgradient method: x (k+1) = x (k) α k g (k) g (k) is any subgradient at x (k) not a descent method; need to keep track of the best point f (k) best = min f (x (i) ) i=1,...,k 13 / 21

Subgradient Method Step Size Rules not a descent method no backtracking line search step sizes fixed ahead of run time constant step size: α k = α square summable but not summable: αk 2 <, k=1 k=1 α k = example: α k = 1/k diminishing, not summable: lim α k = 0, k α k = k=1 example: α k = 1/ k 14 / 21

Example Piecewise Linear Minimization piecewise linear minimization with m = 100, n = 20 minimize max i=1,...,m (at i x + b i ) constant step size: 15 / 21

Example Piecewise Linear Minimization piecewise linear minimization with m = 100, n = 20 minimize max i=1,...,m (at i x + b i ) diminishing step size: 16 / 21

Outline 1 Motivation 2 Subgradient Method 3 Stochastic Subgradient Method 17 / 21

Stochastic Subgradient Method least squares reformulation: minimize m (ai T x b i ) 2 i=1 more generally, consider a problem: minimize m f i (x) i=1 stochastic subgradient method: x (k+1) = x (k) α k g (k) i g (k) i is any subgradient of randomly chosen f i at x (k) 18 / 21

Advantages of Stochastic Subgradient Method even less memory in the LS example, need to store a i R 100,000 ( 0.7MB) sometimes data arrive consecutively update x after each data a i arrives 19 / 21

Example Piecewise Linear Minimization piecewise linear minimization with m = 100, n = 20 minimize max i=1,...,m (at i x + b i ) convergence: 20 / 21

Example Piecewise Linear Minimization piecewise linear minimization with m = 100, n = 20 minimize empirical distribution of f (k) best f : max i=1,...,m (at i x + b i ) 21 / 21