Convergence rate of SGD

Similar documents
Ad Placement Strategies

Adaptive Gradient Methods AdaGrad / Adam

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Notes on AdaGrad. Joseph Perla 2014

Logistic Regression Logistic

Classification Logistic Regression

Ad Placement Strategies

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Machine Learning for NLP

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

CS60021: Scalable Data Mining. Large Scale Machine Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms

Logistic Regression. COMP 527 Danushka Bollegala

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Summary and discussion of: Dropout Training as Adaptive Regularization

Stochastic Gradient Descent with Only One Projection

Convex Optimization Algorithms for Machine Learning in 10 Slides

Composite Objective Mirror Descent

Stochastic Gradient Descent

Day 3 Lecture 3. Optimizing deep networks

Large-scale Stochastic Optimization

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Coordinate Descent and Ascent Methods

Linear classifiers: Overfitting and regularization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Announcements Kevin Jamieson

Warm up: risk prediction with logistic regression

Kernelized Perceptron Support Vector Machines

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

ECS171: Machine Learning

On the Generalization Ability of Online Strongly Convex Programming Algorithms

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

STA141C: Big Data & High Performance Statistical Computing

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Classification: Logistic Regression from Data

Adaptive Online Gradient Descent

No-Regret Algorithms for Unconstrained Online Convex Optimization

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Stochastic Subgradient Method

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Optimization and Gradient Descent

Online Learning With Kernel

Selected Topics in Optimization. Some slides borrowed from

Machine Learning for NLP

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Optimization Methods for Machine Learning

Introduction to Logistic Regression and Support Vector Machine

Regret bounded by gradual variation for online convex optimization

Least Mean Squares Regression

Big Data Analytics. Lucas Rego Drumond

Computational and Statistical Learning Theory

Lasso Regression: Regularization for feature selection

Constrained Stochastic Gradient Descent for Large-scale Least Squares Problem

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

Convex Optimization Lecture 16

Bregman Divergence and Mirror Descent

Big Data Analytics: Optimization and Randomization

Advanced Machine Learning

arxiv: v4 [math.oc] 5 Jan 2016

Sub-Sampled Newton Methods

ECS289: Scalable Machine Learning

Distributed online optimization over jointly connected digraphs

Lasso Regression: Regularization for feature selection

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Empirical Risk Minimization

Discriminative Models

Second order machine learning

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

Accelerated Gradient Method for Multi-Task Sparse Learning Problem

Stochastic Optimization

Adaptive Online Learning in Dynamic Environments

Distributed online optimization over jointly connected digraphs

Linear Models for Regression. Sargur Srihari

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

4.1 Online Convex Optimization

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Trade-Offs in Distributed Learning and Optimization

Optimization, Learning, and Games with Predictable Sequences

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Linear Models for Classification

Machine Learning: Logistic Regression. Lecture 04

Optimal and Adaptive Online Learning

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Lecture 19: Follow The Regulerized Leader

ECS171: Machine Learning

Least Mean Squares Regression. Machine Learning Fall 2018

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Transcription:

Convergence rate of SGD heorem: (see Nemirovski et al 09 from readings) Let f be a strongly convex stochastic function Assume gradient of f is Lipschitz continuous and bounded hen, for step sizes: he expected loss decreases as O(/t): Emily Fox 04 4 Convergence rates for gradient descent/ascent versus SGD Number of Iterations to get to accuracy `(w ) `(w) apple Gradient descent: If func is strongly convex: O(ln(/ϵ)) iterations Stochastic gradient descent: If func is strongly convex: O(/ϵ) iterations Seems exponentially worse, but much more subtle: otal running time, e.g., for logistic regression: Gradient descent: SGD: SGD can win when we have a lot of data And, when analyzing true error, situation even more subtle expected running time about the same, see readings Emily Fox 04

Motivating AdaGrad (Duchi, Hazan, Singer 0) w R d Assuming, standard stochastic (sub)gradient descent updates are of the form: g t,i w (t+) i w (t) i Should all features share the same learning rate? Often have high-dimensional feature spaces Many features are irrelevant Rare features are often very informative Adagrad provides a feature-specific adaptive learning rate by incorporating knowledge of the geometry of past observations Emily Fox 04 6 Why Adapt to Geometry? Why adapt to geometry? Hard Hard x x x y t y t t, t, t, t, t, t, 0 0 0 0 - -.. 0 0 -. -. 0 0 - -0 0 0 0 0 0.. 0 0 0 0 - - 0 0 0 0 - - 0 0 - --. -. 0 0 Examples from Duchi et al. ISMP 0 slides Nice Nice Frequent, Frequent, irrelevant irrelevant Infrequent, Infrequent, predictive predictive Infrequent, Infrequent, predictive predictive Emily Fox 04 7 Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 0 8 / Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 0 8 /

Not All Features are Created Equal Examples: ext data: he most unsung birthday in American business and technological history this year may be the 0th anniversary of the Xerox 94 photocopier. a High-dimensional image features a he Atlantic, July/August 00. Images from Duchi et al. ISMP 0 slides Emily Fox 04 8 Projected Gradient w (t+) i w (t) i g t,i Brief aside Consider an arbitrary feature space w W w W If, can use projected gradient for (sub)gradient descent w (t+) = Emily Fox 04 9

Regret Minimization How do we assess the performance of an online algorithm? Algorithm iteratively predicts w (t) Incur loss f t (w (t) ) Regret: What is the total incurred loss of algorithm relative to the best choice of that could have been made retrospectively w R( )= f t (w (t) ) inf ww f t (w) Emily Fox 04 0 Regret Bounds for Standard SGD Standard projected gradient stochastic updates: w (t+) = arg min ww w (w(t) g t ) Standard regret bound: f t (w (t) ) f t (w ) apple w() w + g t Emily Fox 04 4

Projected Gradient using Mahalanobis Standard projected gradient stochastic updates: w (t+) = arg min ww w (w(t) g t ) What if instead of an L metric for projection, we considered the Mahalanobis norm w (t+) = arg min ww w (w(t) A g t ) A Emily Fox 04 Mahalanobis Regret Bounds w (t+) = arg min ww w (w(t) A g t ) A What A to choose? Regret bound now: f t (w (t) ) f t (w ) apple w() w A + What if we minimize upper bound on regret w.r.t. A in hindsight? min A gt,a g t g t A Emily Fox 04

Mahalanobis Regret Minimization Objective: min A Solution: gt,a g t A = c g t gt! subject to A 0, tr(a) apple C For proof, see Appendix E, Lemma of Duchi et al. 0. Uses trace trick and Lagrangian. A defines the norm of the metric space we should be operating in Emily Fox 04 4 AdaGrad Algorithm w (t+) = arg min ww w (w(t) A g t ) A At time t, estimate optimal (sub)gradient modification A by A t = tx g g =! For d large, A t is computationally intensive to compute. Instead, hen, algorithm is a simple modification of normal updates: w (t+) = arg min ww w (w(t) diag(a t ) g t ) diag(a t ) Emily Fox 04 6

AdaGrad in Euclidean Space For W = R d, For each feature dimension, where w (t+) i w (t) i t,i = t,i g t,i hat is, Each feature dimension has it s own learning rate! Adapts with t w (t+) i w (t) i q Pt = g,i g t,i akes geometry of the past observations into account Primary role of η is determining rate the first time a feature is encountered Emily Fox 04 6 AdaGrad heoretical Guarantees AdaGrad regret bound: dx f t (w (t) ) f t (w ) apple R g :,j i= R := max w (t) w t So, what does this mean in practice? Many cool examples. his really is used in practice! Let s just examine one Emily Fox 04 7 7

AdaGrad heoretical Example Expect to out-perform when gradient vectors are sparse SVM hinge loss example: ft (w) = [ y t xt, w ]+ where xt {, 0, }d If xjt 0 with probability / j " E f X (t) w!#, > f (w ) = O w p max{log d, d } / best!# known method: "Previously X w p (t) p E f w f (w ) = O d 8 Emily Fox 04 Neural Network Learning Neural Network Neural Network LearningLearning Neural Network Learning Wildly non-convex problem: f (x; ) = log ( + exp (h[p(hx, i) Wildly non-convex problem: problem: Wildly non-convex Very non-convex problem, but use SGD methods anyway p(hxk, k i)], 0 i)) f (x; ) = log ( + exp (h[p(hx, i) p(hxk, k i)], 0 i)) f (x; ) = log ( + exp (h[p(hx where, i) p(hxk, k i)], 0 i)) Neural Network Learning p( ) = Accuracy on est Set Average Frame Accuracy (%) where + exp( ) p(hx, i) where p( ) = 0 + exp( ) p( ) = 0 + exp( ) i) SGD GPU Downpour SGD Downpour SGD w/adagrad Sandblaster L BFGS 0 0 p(hx, 0 40 60 80 00 x4 xduchixetal. (UCxBerkeley) p(hx, i) x x x x x4 x 4 Adaptive Subgradient Methods x 4 x ISMP 0 x x4 x 4 / 0 ime (hours) (Dean etadaptive al. 0) Subgradient Methods Duchi et al. (UC Berkeley) ISMP 0 / Images from Duchi et Distributed, d =.7 0 parameters. SGD and AdaGrad use 80 Duchi et al. (UC Berkeley) ISMP 0 machines (000 cores), L-BFGS uses 800 (0000Adaptive cores)subgradient Methods al. ISMP 0 slides 9 Duchi et al. (UC Berkeley) Emily Fox 04 Adaptive Subgradient Methods ISMP 0 6 / / 9 8

What you should know about Logistic Regression (LR) and Click Prediction Click prediction problem: Estimate probability of clicking Can be modeled as logistic regression Logistic regression model: Linear model Gradient ascent to optimize conditional likelihood Overfitting + regularization Regularized optimization Convergence rates and stopping criterion Stochastic gradient ascent for large/streaming data Convergence rates of SGD AdaGrad motivation, derivation, and algorithm Emily Fox 04 40 9