Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Similar documents
Lock-Free Approaches to Parallelizing Stochastic Gradient Descent

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Optimization methods

Lecture 9: September 28

Optimization methods

CPSC 540: Machine Learning

Nonlinear Optimization Methods for Machine Learning

Convex Optimization Lecture 16

Coordinate Descent and Ascent Methods

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Line Search Methods for Unconstrained Optimisation

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Higher-Order Methods

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Conditional Gradient (Frank-Wolfe) Method

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Lecture 1: September 25

Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization

Gradient Methods Using Momentum and Memory

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Selected Topics in Optimization. Some slides borrowed from

Stochastic Optimization Algorithms Beyond SG

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION

CPSC 540: Machine Learning

Collaborative Filtering Matrix Completion Alternating Least Squares

8 Numerical methods for unconstrained problems

Why should you care about the solution strategies?

Convex Optimization Algorithms for Machine Learning in 10 Slides

Lecture 8: February 9

Optimization for Learning and Big Data

Nonlinear equations and optimization

Generalized Conditional Gradient and Its Applications

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Lasso: Algorithms and Extensions

ECS289: Scalable Machine Learning

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

Information geometry of mirror descent

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Large-scale Stochastic Optimization

Proximal methods. S. Villa. October 7, 2014

An Iterative Descent Method

Neural Network Training

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Advanced computational methods X Selected Topics: SGD

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

STA141C: Big Data & High Performance Statistical Computing

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation

Stochastic and online algorithms

Trade-Offs in Distributed Learning and Optimization

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Non-convex optimization. Issam Laradji

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

Convex Optimization. Problem set 2. Due Monday April 26th

Scientific Computing: Optimization

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India

Optimisation non convexe avec garanties de complexité via Newton+gradient conjugué

Low Rank Matrix Completion Formulation and Algorithm

Lecture 22. r i+1 = b Ax i+1 = b A(x i + α i r i ) =(b Ax i ) α i Ar i = r i α i Ar i

Block stochastic gradient update method

Inverse problems Total Variation Regularization Mark van Kraaij Casa seminar 23 May 2007 Technische Universiteit Eindh ove n University of Technology

Structured matrix factorizations. Example: Eigenfaces

1 Sparsity and l 1 relaxation

Mini-Course 1: SGD Escapes Saddle Points

Proximal Minimization by Incremental Surrogate Optimization (MISO)

CPSC 540: Machine Learning

Newton s Method. Javier Peña Convex Optimization /36-725

Introduction to Optimization

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Day 3 Lecture 3. Optimizing deep networks

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

Some Relevant Topics in Optimization

CS60021: Scalable Data Mining. Large Scale Machine Learning

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

ECS171: Machine Learning

Stochastic optimization in Hilbert spaces

A direct formulation for sparse PCA using semidefinite programming

Sparse and Regularized Optimization

Preconditioning via Diagonal Scaling

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Lecture 5: September 12

Collaborative Filtering

Optimization Tutorial 1. Basic Gradient Descent

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION

Fast proximal gradient methods

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Stochastic Optimization: First order method

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Nonlinear Optimization for Optimal Control

Transcription:

Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

optimization () cost constraints might be too much to cover in 3 hours

optimization (for big data?) cost E ξ [(, ξ)] conv(a) constraints distribution over ξ is well-behaved A is simple (low-cardinality, low-dimension, low-complexity) E ξ [(, ξ)] + () closely related cousin where P is a simple convex function

Support Vector Machines + + + + + - + - + + - - - - - + - - - - - cancer vs other illness fraud vs normal purchase up-going vs down-going muons = max(, )+λ sample average over observed data and labels regularizer to select low-complexity models

LASSO Compressed Sensing Sparse Modeling reduce number of measurements required for signal acquisition search for a sparse set of markers for classification = ( )

Matrix Completion M = M ij known for black cells M ij unknown for white cells Rows index features Columns index examples Entries specified on set E How do you fill in the missing data? M = L R * k x n kn entries k x r r x n r(k+n) parameters minimize (,) ( ) + X

Graph Cuts Image Segmentation Entity Resolution Topic Modeling Bhusnurmath and Taylor, 2008 (,) [, ] = =

optimization () cost constraints might be too much to cover in 3 hours optimization is ubiquitous optimization is modular optimization is declarative

[ + ] []+α [] Today: gradient descent conjugate/ accelerated stochastic/ sub projected/ proximal mix and match

tomorrow: duality min () = max () min max L(, ) = max min L(, ) find problems that always lower bound the optimal value. puts problem in NP conp information from one problem informs the other some times easier to solve one than the other basis of many proof techniques in data science (and tons of other areas too!)

what we ll be skipping... 2nd order/newton/bfgs interior point methods/ellipsoid methods active set methods, manifold identification branch and bound integrating combinatorial thinking derivative-free optimization soup of heuristics (simulated annealing, genetic algorithms,...) modeling

optimality conditions () R Search for () = Turns a geometric problem into an algebraic problem: solve for the point where the gradient vanishes. Is necessary for optimality (sufficient for convex, smooth f )

[ + ] []+α [] gradient descent Assume there exits an where ( )= D Suppose the map ψ() = α () is contractive on D () () < for some run gradient descent starting at [] D [ + ] = [] ([]) = ([]) ( ) [] ( )= contractivity + [] linear rate

If f is 2x differentiable, contractivity means f is convex on D ( + ) () for all t>0 lim + ( + ) () = lim + = () ( ( + ) ()) () () + convex Lipschitz gradients

convexity ( +( )) ()+( )() () ()+ () ( ) + l strong convexity = Lipschitz gradients () () () ()+ () ( )+ condition number of Hessian follows from Taylor s theorem With step size α = l+, [] + []. ([]) + []

Note on convergence rate With step size α = [] + l+,. [] If you don t know the exact stepsize, can we achieve the rate? Exact line search: at each iteration, find the α that minimizes f(x+αd). Backtracking line search: Reduce α by constant multiple until the function value sufficiently decreases. Both achieve linear rate of convergence. More sophisticated line searches often used in practice, but none improve over this rate in the worst case.

acceleration/multistep gradient method akin to an ODE [ + ] =[] = () ([]) to prevent oscillation, add a second order term = () [ + ] =[] ([]) + ([] [ ]) [ + ] = []+ [] [] = ([]) + [ ] heavy ball method (constant α,β) when f is quadratic, this is Chebyshev s iterative method

analysis [ + ] = []+ [] [] = ([]) + [ ] heavy ball method (constant α,β) Analyze by defining a composite error vector: := [] [ ] Then [ + ] =[]+( [] ) where := ( )+( + )

[ + ] =[]+( [] ) B has the same eigenvalues as analysis (cont.) +( + ) = diag(,,..., ) where λi are the eigenvalues of ( ) Choose α,β to explicitly minimize the max eigenvalue of B to obtain = ( + / ) = +. Leads to linear convergence for [] + with rate approximately

about those rates... Best steepest descent: Linear rate approx Heavy-ball: Linear rate approx Big difference! To yield [] < [] + + log(/ ) gradient descent log(/ ) heavy ball A factor of κ 1/2 difference. e.g. if κ=100, need 10 times fewer steps.

conjugate gradients [ + ] = []+ [] [] = ([]) + [ ] Choose αk by line search (to reduce f) Choose βk such that p[k] is approximately conjugate to p[1],..., p[k-1] (really only makes sense for quadratics, but whatever...) Does not achieve a better rate than heavy ball Gets around having to know parameters Convergence proofs very sketchy (except when f is quadratic) and need elaborate line search to guarantee local convergence.

optimal method Nesterov s optimal method (1983,2004) = [ + ] = []+ [] [] = ([]+ ([] [ ])) + [ ] heavy ball with extragradient step + =( +) + + = ( ) + + = + + = = + FISTA (Beck and Teboulle 2007) + Recent fixes use line search to find parameters and still achieve optimal rate (modulo log factors) Analysis based on estimate sequences, using simple quadratic approximations to f

why optimal? you can t beat the heavy ball convergence rate using only gradients and function evaluations. () = + = ( + ) + + () ( + ) + start at x[0] = e1. after k steps, x[j] = 0 for j>k+1 norm of the optimal solution on the unseen coordinates tends to ( + )

not strongly convex (l=0) gradient descent: ([]) [] + optimal method: ([]) [] ( + ) Big difference! To yield ([]) < gradient descent [] optimal method [] A factor of ϵ 1/2 difference. e.g. if ϵ=0.0001, need 100 times fewer steps.

nonconvexity can still efficiently find a point where () in time (/ ) not convex n.b. nonconvexity really `lets you model anything () =,= () = for all Q checking if 0 is a local minimum in NP-hard

stochastic gradient E [(, )] Stochastic Gradient Descent: For each k, sample ξk and compute [ + ] =[] ([], ) Robbins and Monro (1950) Adaptive Filtering (1960s-1990s) Back Propagation in Neural Networks (1980s) Online Learning, Stochastic Approximation (2000s)

Support Vector Machines + + + + - + - + + - - - - - + + - - - - - cancer vs other illness fraud vs normal purchase up-going vs down-going muons = max(, )+λ = max(, )+ Step 1: Pick i and compute the sign of the assignment: ˆ = sign( ) Step 2: If ˆ =, ( ) +

matrix completion M = L R * k x n k x r r x n Entries Specified on set E (,) ( ) + X Idea: approximate X LR T (L,R) (,) (L R ) + L + R

SGD code for matrix completion M = L R * k x n k x r r x n (L,R) (,) (L R ) + L + R Step 1: Pick (u,v) and compute residual: =(L R ) Step 2: Take a mixture of current model and corrected model: L ( )L R R ( )R L

Mixture of hundreds of models, including nuclear norm nuclear norm (a.k.a. SVD)

SGD and BIG Data E [(, )] For each k, sample ξk and compute [ + ] =[] ([], ) Ideal for big data analysis: small, predictable memory footprint robustness against noise in data rapid learning rates one algorithm! Why should this work?

Example: Computing the mean ( = ) = Stepsize = 1/2k = ( )= = ( )/ =. = ( )/ = = ( )/ =. In general, if we minimize SGD returns: = ( = ) =

Stepsize = 1/2 = cos +sin () = = + + 3 2 1 1 1 0 5 6 4 0 6 7 Choose directions in order 7 8 Choose a direction uniformly with replacement

convergence of sgd () Assume f is strongly convex with parameter l and has Lipschitz gradients with parameter L Assume at each iteration we sample G(x), an unbiased estimate of f(x), independent of x and the past iterates Assume G(x) M almost surely. [ + ] =[] ([])

[ + ] = [] ([]) = [] ([] ) ([]) + ([]). Define = E [] + E[([] ) ([])] +. By iterating expectation: E[([] ) ([])] = E [ ] E [([] ) ([]) [ ] ] = E[([] ) ([])] By strong convexity: ([]) ([] ) ([]) ( )+. + ( ) +

+ ( ) + Large steps: >, = E[ [] ] max, [] < Small steps:, constant stepsize E[ [] ] ( ) [] + Achieves 1/k rate if run in epochs of diminishing stepsize

R () = = () Algorithm Time per iteration Error after T iterations Error after items N Newton O(d 2 N+d 3 ) C I 2 Gradient O(dN) C G SGD O(d) (or constant) C S N

extensions non-smooth, non-strongly convex (1/ k) non-convex (converges asymptotically) stochastic coordinate descent (special decomposition of f) parallelization

projected gradient Suppose it is easy to solve Π () unique solution () convex smooth projected gradient method: [ + ] Π ([]+α [])

[ + ] Π ([]+α []) Key Lemma: Π () Π () Assume minimizer of f Ω Assume f is strongly convex [ + ] = ([] (([])) ( ) [] ([]) = ([]) ( ) [] non-expansive ( )= contractivity + [] linear rate

()+() ()+() ([]) + ([]) ( []) + α [] + () Define prox () = arg min + () Solving the approximation yields [ + ] =prox α ([] α ([]))

proximal mapping prox () = arg min + () () = () = prox () = () prox () = + < >

()+() ()+() ([]) + ([]) ( []) + α [] + () Define prox () = arg min + () Solving the approximation yields [ + ] =prox α ([] α ([])) Key Lemma: prox () prox () immediately implies earlier analysis works for proximal gradient. projected gradient is a special case inherits rates of convergence from f (i.e., P=0)

More variants mirror descent: use a general distance () ( )+ ( ), + D(, ) ADMM: combine multiple prox operators for complicated constraints.

[ + ] []+α [] gradient descent conjugate/ accelerated stochastic/ sub projected/ proximal mix and match Everything here combines, and you get the expected rates out.