Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Similar documents
Second-Order Stochastic Optimization for Machine Learning in Linear Time

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

ECS289: Scalable Machine Learning

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

STA141C: Big Data & High Performance Statistical Computing

Day 3 Lecture 3. Optimizing deep networks

Nonlinear Optimization Methods for Machine Learning

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo

Stochastic Optimization Algorithms Beyond SG

Convex Optimization Lecture 16

ECS171: Machine Learning

Lecture 1: Supervised Learning

Deep Learning & Neural Networks Lecture 4

CS260: Machine Learning Algorithms

Large-scale Stochastic Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Algorithmic Stability and Generalization Christoph Lampert

CSC321 Lecture 7: Optimization

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Overfitting, Bias / Variance Analysis

Stochastic Gradient Descent. CS 584: Big Data Analytics

Sub-Sampled Newton Methods

Why should you care about the solution strategies?

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

CSC321 Lecture 8: Optimization

CPSC 540: Machine Learning

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Advanced computational methods X Selected Topics: SGD

Introduction to gradient descent

Optimization Methods for Machine Learning

Accelerating Stochastic Optimization

Linear Regression (continued)

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Lecture 6 Optimization for Deep Neural Networks

Support Vector Machines

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Linear Convergence under the Polyak-Łojasiewicz Inequality

A Quick Tour of Linear Algebra and Optimization for Machine Learning

SVMs, Duality and the Kernel Trick

Fast Stochastic Optimization Algorithms for ML

Linear Models in Machine Learning

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

CS60021: Scalable Data Mining. Large Scale Machine Learning

Linear Convergence under the Polyak-Łojasiewicz Inequality

OPTIMIZATION METHODS IN DEEP LEARNING

Optimization for neural networks

Comparison of Modern Stochastic Optimization Algorithms

Introduction to Optimization

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Mini-Course 1: SGD Escapes Saddle Points

Selected Topics in Optimization. Some slides borrowed from

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India

Sub-Sampled Newton Methods I: Globally Convergent Algorithms

Sample questions for Fundamentals of Machine Learning 2018

CSC321 Lecture 9: Generalization

Adaptive Gradient Methods AdaGrad / Adam

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Computational and Statistical Learning Theory

Logistic Regression. Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Logistic Regression. William Cohen

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

The K-FAC method for neural network optimization

Stochastic Analogues to Deterministic Optimizers

Week 3: Linear Regression

CPSC 540: Machine Learning

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Big Data Analytics: Optimization and Randomization

Second-Order Methods for Stochastic Optimization

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Nesterov s Acceleration

The Randomized Newton Method for Convex Optimization

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima

Machine Learning for NLP

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Notes on AdaGrad. Joseph Perla 2014

Lecture 22. r i+1 = b Ax i+1 = b A(x i + α i r i ) =(b Ax i ) α i Ar i = r i α i Ar i

Big Data Analytics. Lucas Rego Drumond

A Parallel SGD method with Strong Convergence

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Kernelized Perceptron Support Vector Machines

Discriminative Models

Homework #3 RELEASE DATE: 10/28/2013 DUE DATE: extended to 11/18/2013, BEFORE NOON QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE FORUM.

Transcription:

ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization..................................... 1 1.2 The Problem........................................... 2 2 Estimator of the Inverse 2 3 The Algorithm 3 3.1 Running Time.......................................... 3 4 An Alternative Interpretation 4 5 Experimental Results 4 6 Comments 5 1 Introduction This is a recent work with Elad Hazan. The most basic problem in machine learning is you have a distribution over {(x, y)} and you have some hypothesis class H, and you d like to find the optimal h H via argmin h H E D [l(y, h(x))], using some loss l. Hazan: Sometimes you get very lucky and this kind of combination of things may be a breakthrough in optimization. So you can use ERM which is basically finding the hypothesis in H which minimizes error on sample set. However, ERM tends to overfit the data, so people typically add regularization to avoid overfitting, and you sometimes restrict H to linear functions, etc. We can reinterpret as follows: min f(θ) = min θ Rd θ R d { 1 m } m f k (θ) + R(θ) In SVM and logistic regression, f k is convex, and R(θ) could be L 2 or L 1. We ll assume L 2 and f k convex for the rest of the talk. For convenience, redefine f k (θ) := f k (θ) + λ θ 2 2. (Basically, we need that the regularizer be strongly convex). 1.1 History of Optimization 1. First-order methods: This is basically gradient descent. This is preferred in practice because gradients are cheap to compute. O(md) per step, m is samples, d is dimension. You can also make it stochastic by subsampling the gradient, which is much faster (used in practice), but theoretically it has worse guarantees. This was the starting point of first order methods, but there are ways of obtaining faster i=1 1

convergence while still maintaining linear time properties. Nesterov introduced acceleration (two or three gradients to get faster convergence). There is also adaptive regularization (AdaGrad: Duchi, Hazan, Singer). There s also the idea of variance reduction: doing mini-batches, SVRG: GD + SGD (mixing SGD and gradient steps). There s also dual coordinate descent, but it s really the same line of work of SVRG. 2. Second-order methods: Here we want to use 2, the Hessian. Newton s method is the canonical example; however, Hessian computations are expensive. It has fast convergence (quadratic convergence near the optimum). The per step time is O(md 2 + d 3 ), which is too expensive in practice, where d 3 is the Hessian inversion step. More advanced second methods involving subsampling the Hessian, and get an unbiased estimator at any point. You can extend this to mini-batches of samples to get a better estimate (NewSamp paper from NIPS 2015). There are some issues with it though: The solution is not satisfying, since your estimator is no longer unbiased (E[ X 1 ] 2 ). Moreover, you still need to invert the matrix you found with your new batch. They try to fix this with low rank projection, but finding one eigenvector is still O(d 2 ) time, and you will lose out on information. 1.2 The Problem So the question is: Can we do Newton s method in linear time while obtaining similar guarantees. So we present LiSSA is a subsampled Newton s method which is linear time for usual ML applications, and enjoys linear convergence to the optimum. Gradient descent gives linear for strongly convex function, but this result is still better (we will see why). There is a very simple way to perform fast multiplication of the inverse with the gradient. We propose a novel unbiased estimator for this. Note that our notation for Hessian inverse is 2. First recall Taylor expansion for PSD matrix A is A 1 = i=0 (I A)i where A 1. 2 Estimator of the Inverse Pick a probability distribution p i over non negative integers. Select some î p i. Then select X 1,, Xî indepdnent samples of the Hessian, and define 2 f(x) = 1 pî î (I X j ) Basically we pick how many of the terms in the power series (where to truncate, this is î). Then you sample î Hessians, but the distribution is over the series. Then we clearly see E[ 2 f(x)] = p i 1 i E[I X j ] = (I 2 f) i p i i=0 j=1 Now, you can pick many distributions over the i. It turns out the best estimator for variance over samples is picking uniform p i = 1/S. This was our first attempt. This is picking out one term randomly in the series. But what if you want to go all terms up to a term. If you want to pick a single term up to i, you need i samples, if you want all up to i, you need i 2 samples. So we can recursively represent the Taylor series as j=1 i=0 A 1 = I + (I A)(I + (I A)(...)) 2

You could potentially use Chebyshev polynomial to write it in shorter form; you could then potentially accelerate things. But Chebyshev won t be noise stable. But a caveat is you cannot take a full estimate of the gradient. Taking the average of many gradient descents does work; this seems to be important. You re basically averaging at every step. Then, set X 0 = I, and define X i = I + (I 2 f)x i 1 and via the recursive definition E[X i ] 2 f, and for the same cost we get to use more terms of the series to get our estimator. We will use this estimator for the rest of the talk. 3 The Algorithm Now, let s just run Newton s method with the estimator we constructed. We will later explain why this is linear time. That is, x t+1 = x t 2 f(x) () The meat of the algorithm is in a concentration step. We step over time t = 1,, T : We re going to have S 1 samples of our estimator (concetration: making many samples of the same estimator). For each of these samples, we will sample S 2 copies of the Hessian (we re constructing the estimator). A critical aspect is that we re using several unbiased estimators of the product of the inverse, and then we re averaging those. To make it clear this is a linear time operation, we see that in our recursion, we already have the matrix inverse product is already included. For typical machine learning applications, this is a rank 1 matrix, which allows our product to be calculated in linear time. We want to take S 2 large enough where ɛ is small enough. We only need S 2 samples to take a degree S 2 approximation of our Taylor series representation of the inverse. 3.1 Running Time Note that 2 f is of the form αi + 2 f i, the α is an artifact of the regularizer (condition number needs to not be too large). Commonly 2 f i = cx i x T i, where x i is the i th training sample. Recall that we restricted our hypothesis class to regularizations that are basically rank-1 losses: ridge regression, logistic, etc. The Hessian is rank 1 matrix for all of these cases. Therefore each update step (I 2 f) f can be done in linear time (via rank 1 matrix). It will be actually O(sparsity) since we re just doing vector-vector products! So even if you re not rank 1 but you re sparse, you re set! O(d) is worst case (note that if you remove the rank 1 condition, you re at O(d 2 ) worst-case since that s all the time it takes to multiply vector by matrix, which is the form the estimator is cast in. This is still an improvement upon most other methods out there, which have a factor of at least O(d 3 ) due to the requirement of matrix inversion). We make some extra assumptions: 1. 2 f i 1 2. λ min ( 2 f i ) 1 κ max, where κ is the condition number. (Maybe this issue can be averaged out). Now we come to the main theorem: Theorem 3.1. Set S 1 = Õ(κ2 max) and S 2 = Õ(κ) (truncation point of Taylor series), then after a sufficient number of gradient descent steps, we have the following per step guarantee for LiSSA with probability 1 δ: x t+1 x 2 x t x 2 2 3

We ve done some work which can reduce κ 2 max. Note that this bound holds irrespective of whether or not the Hessian is rank 1. We are making use of second order information. Corollary 3.2. For t > T 1, LiSSA returns a point x t such that with probability at least 1 δ f(x t ) min f(x ) + ɛ x ( ) in total time log(1/ɛ) O(md) + Õ(κ2 maxκd). The number of gradient calls is independent of the condition number; it s log(1/ɛ). Note that this is an impossible theorem to get for first order methods! The runtime for LiSSA is O ( m + κκ 2 maxd log(1/ɛ) ), which is much better than gradient descent, which is mκ, accelerated gradient descent, which is m κ, SVRG (which behaves erratically near optimum) (O(m + κ)). We don t have at true Expectation result: Whatever you can get with expectation, you can get with high probability. Remark 3.3. The number of iterations here is independent of the condition number, it only co mes into the sampling, which is already very pessimistic! 4 An Alternative Interpretation We can think of Newton s method as minimzing local quadratic optimization. If we apply gradient descent to a quadratic function, we actually get x t+1 = (I 2 f)x t f which is the exact same recursive formulation as our estimator. You could then replace gradient descent with stochastic gradient descent, the Hessian with an estimator of the Hessian, so you can see that our estimator is a special case of performing a specific type of gradient descent (full) on the quadratic function x T ( 2 f)x + f T x + c. So you can t get the same bounds with stochastic gradient descent via the analysis, since you need to do the averaging for the analysis. In the analysis, they just use spectral norm concentration bounds. If you use SGD, you won t get logarithmic convergence guarantee in the sample size bound. Lower bounds hold for the scenario where you use stochastic information about the gradient which would not allow this to work. 5 Experimental Results Not surprisingly, Newton s method has a nice quadratic curvature (it converges quadratically in terms of iterations), but LiSSA still converges linearly in iterations. In time however, LiSSA is much better, since its per iteration complexity is much smaller. We tried out SVM on MNIST for our testing. NewSamp (2015 NIPS: Their theorems are weaker, and they still invert matrices (on a low rank projection - which loses information), and they only have expectation bounds instead of in high probability) is worse in both cases, and is much slower (you had to project down to a low number of dimensions even to get the slower algorithm!) In practice, we take S 1 to be very small (recall that this is the number of models we draw for our spectral concentration bounds). Typically 1 S 1 10. 4

6 Comments The importance of this result to some extent depends on how much you care about using second-order methods. The work demonstrates that LiSSA has better guarantees and empirical performance than other second-order methods; however, second-order methods take the same amount of time as stochastic gradient descent in practice when you only care about getting within 1/10 or 1/100 of your optimal value. The real gains show when you re aiming for 10 4 level of accuracy and above. Then, LiSSA converges a lot more quickly in terms of time rather than iterations. However, in many machine learning applications, you may not care about such accuracy convergence, since in practice you are training on a regularized objective in the first place which is only a proxy for the true correct objective. However, the other important aspect of this result is it allows a fully second-order method to be considered feasible for other applications which before were too costly. There are problems with using Newton s method on non-convex functions, since Newton s method will tend towards saddle points and so on, so the Hessian information needs to be taken into account well before such second-order methods can be used for something like the training of neural networks. It s unclear whether or not there would be a benefit, as stochastic gradient descent seems to work well enough and is very quick. However, it is possible that advancements can be made to lead to escaping saddle points. Currently, Rong Ge authored the only theoretical work which tackles the problem of escaping saddle points. There may be more connections to be made here. 5