Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Similar documents
Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam

Why should you care about the solution strategies?

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Large-scale Stochastic Optimization

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Stochastic Gradient Descent with Only One Projection

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal

Notes on AdaGrad. Joseph Perla 2014

Ad Placement Strategies

Linear Convergence under the Polyak-Łojasiewicz Inequality

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Stochastic Optimization Algorithms Beyond SG

Accelerating SVRG via second-order information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Second-Order Methods for Stochastic Optimization

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Convex Optimization Lecture 16

arxiv: v1 [math.oc] 1 Jul 2016

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Second-Order Stochastic Optimization for Machine Learning in Linear Time

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima

Collaborative Filtering Matrix Completion Alternating Least Squares

ECS171: Machine Learning

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization

Nonlinear Optimization Methods for Machine Learning

Adaptive Online Gradient Descent

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Variance-Reduced and Projection-Free Stochastic Optimization

Day 3 Lecture 3. Optimizing deep networks

Neural Network Training

CS260: Machine Learning Algorithms

Lecture 9: September 28

Stochastic Optimization

Optimization for neural networks

Convergence rate of SGD

arxiv: v2 [math.oc] 1 Nov 2017

Big Data Analytics: Optimization and Randomization

Sub-Sampled Newton Methods

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

arxiv: v4 [math.oc] 24 Apr 2017

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

CS60021: Scalable Data Mining. Large Scale Machine Learning

Collaborative Filtering

Gradient Descent. Dr. Xiaowei Huang

, b = 0. (2) 1 2 The eigenvectors of A corresponding to the eigenvalues λ 1 = 1, λ 2 = 3 are

A Quick Tour of Linear Algebra and Optimization for Machine Learning

CPSC 540: Machine Learning

Computational and Statistical Learning Theory

Stochastic optimization in Hilbert spaces

Full-information Online Learning

A Simple Algorithm for Nuclear Norm Regularized Problems

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

Introduction to Optimization

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Randomized Smoothing for Stochastic Optimization

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir

Lecture 1: Supervised Learning

arxiv: v2 [cs.lg] 14 Sep 2017

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Overparametrization for Landscape Design in Non-convex Optimization

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate

Optimization for Machine Learning (Lecture 3-B - Nonconvex)

Optimization Methods for Machine Learning

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Provable Alternating Minimization Methods for Non-convex Optimization

Proximal and First-Order Methods for Convex Optimization

Stochastic gradient methods for machine learning

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Stochastic Cubic Regularization for Fast Nonconvex Optimization

Conditional Gradient (Frank-Wolfe) Method

Convex Optimization Algorithms for Machine Learning in 10 Slides

Learning with stochastic proximal gradient

Logistic Regression. Stochastic Gradient Descent

Linear Regression (continued)

SVRG++ with Non-uniform Sampling

Efficient Low-Rank Stochastic Gradient Descent Methods for Solving Semidefinite Programs

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

Adaptive Online Learning in Dynamic Environments

Mini-Course 1: SGD Escapes Saddle Points

Generalized Conditional Gradient and Its Applications

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Stochastic gradient methods for machine learning

ECS289: Scalable Machine Learning

A Conservation Law Method in Optimization

Algorithmic Stability and Generalization Christoph Lampert

ONLINE VARIANCE-REDUCING OPTIMIZATION

Stochastic Quasi-Newton Methods

A Greedy Framework for First-Order Optimization

COR-OPT Seminar Reading List Sp 18

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

Transcription:

Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer

Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM, online regret minimization Online gradient descent 2. Regularization AdaGrad and optimal regularization 3. Gradient Descent++ Frank-Wolfe, acceleration, variance reduction, second order methods, non-convex optimization

Accelerating gradient descent? 1. Adaptive regularization (AdaGrad) works for non-smooth&non-convex 2. Variance reduction uses special ERM structure very effective for smooth&convex 3. Acceleration/momentum smooth convex only, general purpose optimization since 80 s

Condition number of convex functions defined as γ = #, where (simplified) $ 0 αi + f x βi α = strong convexity, β = smoothness Non-convex smooth functions: (simplified) βi + f x βi Why do we care? well-conditioned functions exhibit much faster optimization! (but equivalent via reductions)

Examples

Smooth gradient descent The descent lemma, β-smooth functions: (algorithm: x 012 = x 0 η 4 ) f x 012 f x 0 0 x 012 x 0 + β x 0 x 012 + = η + βη + 0 + = 1 4β 0 + Thus, for M-bounded functions: ( f x 0 M) 2M f x ; f(x 2 ) = > f(x 012 ) f x 0 Thus, exists a t for which, 0 0 + 8Mβ T 1 4β > 0 + 0

Smooth gradient descent Conclusions: for x 012 = x 0 η 4 and T = Ω 2 C, finds 0 + ε 1. Holds even for non-convex functions 2. For convex functions implies f x 0 f x O(ε) (faster for smooth!)

Non-convex stochastic gradient descent The descent lemma, β-smooth functions: (algorithm: x 012 = x 0 η N 0 ) E f x 012 f x 0 E 4 x 012 x 0 + β x 0 x 012 + = E P 0 η 4 + β N + 4 = η + 4 + η + βe N + 4 = η 4 + + η + β( 4 + + var P 0 ) Thus, for M-bounded functions: ( f x 0 M) T = O Mβ ε + 0Y;. 0 + ε

Controlling the variance: Interpolating GD and SGD Model: both full and stochastic gradients. Estimator combines both into lower variance RV: x 012 = x 0 η Pf x 0 Pf x [ + f(x [ ) Every so often, compute full gradient and restart at new x [. Theorem: [Schmidt, LeRoux, Bach 12; Johnson and Zhang 13; Mahdavi, Zhang, Jin 13] Variance reduction for well-conditioned functions 0 αi + f x βi, γ = β α Produces an ε approximate solution in time O m + γ d log 1 ε γ should be interpreted as 1 ε

Acceleration/momentum [Nesterov 83] Optimal gradient complexity (smooth, convex) modest practical improvements, non-convex momentum methods. With variance reduction, fastest possible running time of firstorder methods: O (m + γm) d log 1 ε [Woodworth, Srebro 15] tight lower bound w. gradient oracle

Experiments w. convex losses Improve upon GradientDescent++?

Next few slides: Move from first order (GD++) to second order

Higher Order Optimization Gradient Descent Direction of Steepest Descent Second Order Methods Use Local Curvature

Newton s method (+ Trust region) x 2 x 012 = x 0 η [ + f(x)] p2 f(x) x + For non-convex function: can move to Solution: solve a quadratic approximation in a local area (trust region) x m

Newton s method (+ Trust region) x 2 x 012 = x 0 η [ + f(x)] p2 f(x) x + d 3 time per iteration! Infeasible for ML!! x m Till recently

Speed up the Newton direction computation?? Spielman-Teng 04: diagonally dominant systems of equations in linear time! 2015 Godel prize Used by Daitch-Speilman for faster flow algorithms Erdogu-Montanari 15: low rank approximation & inversion by Sherman-Morisson Allow stochastic information Still prohibitive: rank * d 2

Stochastic Newton? n x 012 = x 0 η [ + f(x)] p2 f(x) ERM, rank-1 loss: arg min r E s u [l x ; a s, b s + 2 + x + ] unbiased estimator of the Hessian: y + = a z a { z l x ; a s, b z + I i ~ U[1,, m] clearly E y + = + f, but E y + p2 + f p2

Circumvent Hessian creation and inversion! 3 steps: (1) represent Hessian inverse as infinite series p+ = > I + s s [ 0 ˆ For any distribution on naturals i N (2) sample from the infinite series (Hessian-gradient product), ONCE + f p2 f = > I + f s f = E s I + f s f s 1 Pr[i] (3) estimate Hessian-power by sampling i.i.d. data examples = E s, [s] Œ I + f f 2 0 s 1 Pr[i] Single example Vector-vector products only

Linear-time Second-order Stochastic Algorithm (LiSSA) Use the estimator p+ f defined previously Compute a full (large batch) gradient f Move in the direction p+ f f Theoretical running time to produce an ε approximate solution for γ well-conditioned functions (convex): [Agarwal, Bullins, Hazan 15] O dm log 1 ε + γd d log 1 ε 1. Faster than first-order methods! 2. Indications this is tight [Arjevani,Shamir 16]

What about constraints?? Next few slides projection free (Frank-Wolfe) methods

Recommendation systems movies 1 5 2 3 2 4 1 rating of user i for movie j users 5 4 2 3 1 1 1 5 5

Recommendation systems users movies 1 4 5 2 1 5 3 4 2 2 3 5 5 2 4 4 1 1 3 3 3 3 3 3 2 3 1 5 4 4 3 4 2 1 2 3 1 2 4 1 5 1 4 5 3 3 5 2 complete missing entries

Recommendation systems users movies 1 4 5 2 1 5 3 4 2 2 3 5 5 2 4 4 1 1 3 3 3 3 3 5 2 3 1 5 4 4 3 4 2 1 2 3 1 2 4 1 5 1 4 5 3 3 5 2 get new data

Recommendation systems users movies 1 2 5 5 4 4 2 4 3 2 3 1 5 2 4 2 3 1 3 3 2 4 5 5 3 3 2 5 1 5 2 4 3 4 2 3 1 1 1 1 1 1 4 5 3 3 5 2 re-complete missing entries Assume low rank of true matrix, convex relaxation: bounded trace norm

Bounded trace norm matrices Trace norm of a matrix = sum of singular values K = { X X is a matrix with trace norm at most D } Computational bottleneck: projections on K require eigendecomposition: O(n 3 ) operations But: linear optimization over K is easier computing top eigenvector; O(sparsity) time

Projections à linear optimization 1. Matrix completion (K = bounded trace norm matrices) eigen decomposition largest singular vector computation 2. Online routing (K = flow polytope) conic optimization over flow polytope shortest path computation 3. Rotations (K = rotation matrices) conic optimization over rotations set Wahba s algorithm linear time 4. Matroids (K = matroid polytope) convex opt. via ellipsoid method greedy algorithm nearly linear time!

Conditional Gradient algorithm [Frank, Wolfe 56] Convex opt problem: min f(x) f is smooth, convex linear opt over K is easy v t = arg min x2k rf(x t ) > x x t+1 = x t + t (v t x t ) 1. At iteration t: convex comb. of at most t vertices (sparsity) 2. No learning rate. η 0 2 0 (independent of diameter, gradients etc.)

FW theorem x t+1 = x t + t (v t Theorem: x t ), v t = arg min x2k r> t x f(x t ) f(x )=O( 1 t ) rf(x t ) v t+1 x t+1 x t Proof, main observation: f(x t+1 ) f(x )=f(x t + t (v t x t )) f(x ) apple f(x t ) f(x )+ t (v t x t ) > r t + 2t 2 kv t x t k 2 -smoothness of f apple f(x t ) f(x )+ t (x x t ) > r t + 2t 2 kv t x t k 2 optimality of v t apple f(x t ) f(x )+ t (f(x ) f(x t )) + 2t 2 kv t x t k 2 convexity of f apple (1 t )(f(x t ) f(x )) + 2 t2 D2. Thus: h t = f(x t ) f(x * ) h t+1 apple (1 t )h t + O( 2 t ) t,h t = O( 1 t )

Online Conditional Gradient Set x 2 K arbitrarily For t = 1, 2,, 1. Use x 0, obtain f t 2. Compute x 012 as follows v t = arg min x2k Pt i=1 rf i(x i )+ t x t >x x t+1 (1 t )x t + t v t P t i=1 rf i(x i )+ t x t v t x t+1 x t Theorem:[Hazan, Kale 12] Regret = O(T m/ ) Theorem: [Garber, Hazan 13] For polytopes, strongly-convex and smooth losses, 1. Offline: convergence after t steps: e p (0) 2. Online: Regret = O( T)

Next few slides: survey state-of-the-art

Non-convex optimization in ML Machine Distribution over {a} R label arg min œ 1 m > l s x, a s, b s s 2 0 u + R x

3 Solution concepts for non-convex optimization? 250 Global minimization is NP hard, even for degree 4 polynomials. Even local minimization up to 4 th order optimality conditions is NP hard. 200 150 100 50 0 50 3 2 1 0 1 2 3 2 1 0 1 2 3 Algorithmic stability is sufficient [Hardt, Recht, Singer 16] Duchi (UC Berkeley) Convex Optimization for Machine Learning F Optimization approaches: Finding vanishing gradients / local minima efficiently Graduated optimization / homotopy method Quasi-convexity Structure of local optima (probabilistic assumptions that allow alternating minimization, )

Gradient/Hessian based methods Goal: find point x such that 1. f x ε (approximate first order optimality) 2. + f x εi (approximate second order optimality) 1. (we ve proved) GD algorithm: x 012 = x 0 η 4 finds in O 2 C (expensive) iterations point (1) 2. (we ve proved) SGD algorithm: x 012 = x 0 η P 4 finds in O 2 C Ÿ (cheap) iterations point (1) 3. SGD algorithm with noise finds in O 2 C (cheap) iterations (1&2) [Ge, Huang, Jin, Yuan 15] 4. Recent second order methods: find in O 2 point (1&2) [Carmon,Duchi, Hinder, Sidford 16] [Agarwal, Allen-Zuo, Bullins, Hazan, Ma 16] C / (expensive) iterations

Recap Chair/car 250 200 150 100 50 0 50 3 2 1 0 1 2 3 3 Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53 2 1 0 1 2 3 1. Online learning and stochastic optimization Regret minimization Online gradient descent 2. Regularization AdaGradand optimal regularization 3. Advanced optimization Frank-Wolfe, acceleration, variance reduction, second order methods, non-convex optimization

Bibliography & more information, see: http://www.cs.princeton.edu/~ehazan/tutorial/simonstutorial.htm Thank you!