Optimization methods

Similar documents
Optimization methods

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Lasso: Algorithms and Extensions

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Lecture 9: September 28

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Lecture 1: September 25

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Lecture 8: February 9

6. Proximal gradient method

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Oslo Class 6 Sparsity based regularization

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

5. Subgradient method

Conditional Gradient (Frank-Wolfe) Method

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

CPSC 540: Machine Learning

Subgradient Method. Ryan Tibshirani Convex Optimization

Fast proximal gradient methods

Stochastic Optimization: First order method

Smoothing Proximal Gradient Method. General Structured Sparse Regression

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

6. Proximal gradient method

Proximal Methods for Optimization with Spasity-inducing Norms

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Descent methods. min x. f(x)

Nonconvex Sparse Logistic Regression with Weakly Convex Regularization

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

1 Sparsity and l 1 relaxation

Lecture 17: October 27

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Linear Methods for Regression. Lijun Zhang

Convex Optimization Lecture 16

Selected Topics in Optimization. Some slides borrowed from

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Dual Proximal Gradient Method

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Lecture 7: September 17

The Frank-Wolfe Algorithm:

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Solution of Exercise Sheet

Optimization for Learning and Big Data

Coordinate descent methods

Newton-Like Methods for Sparse Inverse Covariance Estimation

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

Lecture 6: September 17

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Math 273a: Optimization Subgradient Methods

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

Gaussian Graphical Models and Graphical Lasso

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization

Accelerated Proximal Gradient Methods for Convex Optimization

Coordinate Descent and Ascent Methods

Minimizing the Difference of L 1 and L 2 Norms with Applications

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Linear Regression. S. Sumitra

Lecture 25: November 27

Sparsity Regularization

Algorithms for Nonsmooth Optimization

CPSC 540: Machine Learning

MATH 680 Fall November 27, Homework 3

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.

Proximal methods. S. Villa. October 7, 2014

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

An Efficient Proximal Gradient Method for General Structured Sparse Learning

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Mathematics of Data: From Theory to Computation

Efficient block-coordinate descent algorithms for the Group Lasso

A Tutorial on Primal-Dual Algorithm

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Lecture 5: September 12

Regression Shrinkage and Selection via the Lasso

6.252 NONLINEAR PROGRAMMING LECTURE 10 ALTERNATIVES TO GRADIENT PROJECTION LECTURE OUTLINE. Three Alternatives/Remedies for Gradient Projection

The Proximal Gradient Method

Classification Logistic Regression

OWL to the rescue of LASSO

Lecture 6: September 12

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Accelerating Nesterov s Method for Strongly Convex Functions

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

CHAPTER 11. A Revision. 1. The Computers and Numbers therein

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution

Big Data Analytics: Optimization and Randomization

Homework 2. Convex Optimization /36-725

Transcription:

Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016

Introduction Aim: Overview of optimization methods that Tend to scale well with the problem dimension Are widely used in machine learning and signal processing Are (reasonably) well understood theoretically

Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent

Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent

Gradient 3.68 3.8.88.48.08 1.68 1.8 0.88 0.48 0.08 Direction of maximum variation

Gradient descent (aka steepest descent) Method to solve the optimization problem where f is differentiable minimize f (x), Gradient-descent iteration: x (0) = arbitrary initialization x (k+1) = x (k) α k f where α k is the step size (x (k))

Gradient descent (1D)

Gradient descent (D) 4.0 3.5 3.0.5.0 1.5 1.0 0.5

Small step size 4.0 3.5 3.0.5.0 1.5 1.0 0.5

Large step size 100 90 80 70 60 50 40 30 0 10

Line search Exact ( α k := arg min f x (k) β f (x (k))) β 0 Backtracking (Armijo rule) Given α 0 0 and β (0, 1), set α k := α 0 β i for the smallest i such that ( f x (k+1)) ( f x (k)) 1 α k f (x (k))

Backtracking line search 4.0 3.5 3.0.5.0 1.5 1.0 0.5

Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent

Lipschitz continuity A function f : R n R m is Lipschitz continuous with Lipschitz constant L if for any x, y R n f (y) f (x) L y x Example: f (x) := Ax is Lipschitz continuous with L = σ max (A)

Quadratic upper bound If the gradient of f : R n R is Lipschitz continuous with constant L then for any x, y R n f (y) f (x) L y x f (y) f (x) + f (x) T (y x) + L y x

Consequence of quadratic bound Since x (k+1) = x (k) α k f ( x (k)) f ( x (k+1)) f (x (k)) ( α k 1 α ) ( kl f x (k)) If α k 1 L the value of the function always decreases! f ( x (k+1)) ( f x (k)) α k f (x (k))

Gradient descent with constant step size Conditions: f is convex f is L-Lipschitz continuous There exists a solution x such that f (x ) is finite If α k = α 1 L f ( x (k)) x (k) f (x x (0) ) α k We need O ( 1 ɛ ) iterations to get an ɛ-optimal solution

Proof Recall that if α 1 L ( f x (i)) ( f x (i 1)) α f (x (i 1)) By the first-order characterization of convexity f ( x (i 1)) f (x ) f (x (i 1)) T (x (i 1) x ) f This implies f (x (i 1)) ( x (i 1) x x (i 1) x α f (x (i 1)) ) ( x (i 1) x ) x (i) x ( x (i)) f (x ) f (x (i 1)) T ( x (i 1) x ) α = 1 α = 1 α

Proof Because the value of f never increases, f ( x (k)) f (x ) 1 k k f i=1 ( x (i)) f (x ) = 1 ( x (0) x ) x α k (k) x x (0) x α k

Backtracking line search If the gradient of f : R n R is Lipschitz continuous with constant L the step size in the backtracking line search satisfies { α k α min := min α 0, β } L

Proof Line search ends when f ( x (k+1)) f ( x (k)) α k f but we know that if α k 1 L ( f x (k+1)) ( f x (k)) α k f (x (k)) (x (k)) This happens as soon as β/l α 0 β i 1/L

Gradient descent with backtracking Under the same conditions as before gradient descent with backtracking line search achieves f ( x (k)) x (0) f (x x ) α min k O ( 1 ɛ ) iterations to get an ɛ-optimal solution

Strong convexity A function f : R n is strongly convex if for any x, y R n f (y) f (x) + f (x) T (y x) + S y x. Example: f (x) := Ax y where A R m n is strongly convex with S = σ min (A) if m > n

Gradient descent for strongly convex functions If f is S-strongly convex and f is L-Lipschitz continuous f ( x (k)) f (x ) ck L x (k) x (0) c := L S 1 L S + 1 We need O ( log 1 ɛ ) iterations to get an ɛ-optimal solution

Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent

Lower bounds for convergence rate There exist convex functions with L-Lipschitz-continuous gradients such that for any algorithm that selects x (k) from { ( x (0) + span f x (0)) (, f x (1)),..., f (x (k 1))} we have f ( x (k)) f (x ) 3L x (0) x 3 (k + 1)

Nesterov s accelerated gradient method ( ) Achieves lower bound, i.e. O 1 convergence ɛ Uses momentum variable y (k+1) = x (k) α k f (x (k)) x (k+1) = β k y (k+1) + γ k y (k) Despite guarantees, why this works is not completely understood

Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent

Projected gradient descent Optimization problem minimize f (x) subject to x S, where f is differentiable and S is convex Projected-gradient-descent iteration: x (0) = arbitrary initialization x (k+1) = P S (x (k) α k f (x (k)))

Projected gradient descent 8 7 6 5 4 3 1

Projected gradient descent 8 7 6 5 4 3 1

Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent

Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent

Subgradient method Optimization problem minimize f (x) where f is convex but nondifferentiable Subgradient-method iteration: x (0) = arbitrary initialization x (k+1) = x (k) α k q (k) where q (k) is a subgradient of f at x (k)

Least-squares regression with l 1 -norm regularization minimize 1 Ax y + λ x 1 Sum of subgradients is a subgradient of the sum ( ) q (k) = A T Ax (k) y + λ sign (x (k)) Subgradient-method iteration: x (0) = arbitrary initialization x (k+1) = x (k) α k (A T ( Ax (k) y ) + λ sign (x (k)))

Convergence of subgradient method It is not a descent method Convergence rate can be shown to be O ( 1/ɛ ) Diminishing step sizes are necessary for convergence Experiment: minimize 1 Ax y + λ x 1 A R 000 1000, y = Ax 0 + z where x 0 is 100-sparse and z is iid Gaussian

Convergence of subgradient method 10 4 α 0 10 3 α 0 / k α 0 /k 10 f(x (k) ) f(x ) f(x ) 10 1 10 0 10-1 10-0 0 40 60 80 100 k

Convergence of subgradient method 10 1 10 0 α 0 α 0 / n α 0 /n f(x (k) ) f(x ) f(x ) 10-1 10-10 -3 0 1000 000 3000 4000 5000 k

Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent

Composite functions Interesting class of functions for data analysis f (x) + g (x) f convex and differentiable, g convex but not differentiable Example: Least-squares regression (f ) + l 1 -norm regularization (g) 1 Ax y + λ x 1

Interpretation of gradient descent Solution of local first-order approximation = arg min f x (x (k)) x (k+1) := x (k) α k f ( x = arg min x (k) α k f (x (k))) x ( x (k)) + f (x (k)) T ( x x (k)) + 1 x x (k) α k

Proximal gradient method Idea: Minimize local first-order approximation + g ( x (k+1) = arg min f x (k)) + f (x (k)) T ( x x (k)) + 1 x x (k) x α k + g (x) 1 ( = arg min x x (k) α k f (x (k))) x + α k g (x) ( = prox αk g x (k) α k f (x (k))) Proximal operator: prox g (y) := arg min x g (x) + 1 y x

Proximal gradient method Method to solve the optimization problem minimize f (x) + g (x), where f is differentiable and prox g is tractable Proximal-gradient iteration: x (0) = arbitrary initialization ( x (k+1) = prox αk g x (k) α k f (x (k)))

Interpretation as a fixed-point method A vector ˆx is a solution to minimize f (x) + g (x), if and only if it is a fixed point of the proximal-gradient iteration for any α > 0 ˆx = prox αk g (ˆx α k f (ˆx))

Projected gradient descent as a proximal method The proximal operator of the indicator function { 0 if x S, I S (x) := if x / S. of a convex set S R n is projection onto S Proximal-gradient iteration: x (k+1) = prox αk I S (x (k) α k f = P S (x (k) α k f (x (k))) (x (k)))

Proximal operator of l 1 norm The proximal operator of the l 1 norm is the soft-thresholding operator where β > 0 and S β (y) i := prox β 1 (y) = S β (y) { y i sign (y i ) β if y i β 0 otherwise

Iterative Shrinkage-Thresholding Algorithm (ISTA) The proximal gradient method for the problem is called ISTA minimize 1 Ax y + λ x 1 ISTA iteration: x (0) = arbitrary initialization ( ( )) x (k+1) = S αk λ x (k) α k A T Ax (k) y

Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) ISTA can be accelerated using Nesterov s accelerated gradient method FISTA iteration: x (0) = arbitrary initialization z (0) = x (0) ( ( )) x (k+1) = S αk λ z (k) α k A T Az (k) y z (k+1) = x (k+1) + k (x (k+1) x (k)) k + 3

Convergence of proximal gradient method Without acceleration: Descent method Convergence rate can be shown to be O (1/ɛ) with constant step or backtracking line search With acceleration: Not a descent method ( ) Convergence rate can be shown to be O 1 ɛ backtracking line search Experiment: minimize 1 Ax y + λ x 1 with constant step or A R 000 1000, y = Ax 0 + z, x 0 100-sparse and z iid Gaussian

Convergence of proximal gradient method 10 4 10 3 10 10 1 Subg. method (α 0 / k) ISTA FISTA f(x (k) ) f(x ) f(x ) 10 0 10-1 10-10 -3 10-4 10-5 0 0 40 60 80 100 k

Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent

Coordinate descent Idea: Solve the n-dimensional problem minimize h (x 1, x,..., x n ) by solving a sequence of 1D problems Coordinate-descent iteration: x (0) = arbitrary initialization x (k+1) i = arg min h α ( x (k) 1,..., α,..., x n (k) ) for some 1 i n

Coordinate descent Convergence is guaranteed for functions of the form n f (x) + g i (x i ) where f is convex and differentiable and g 1,..., g n are convex i=1

Least-squares regression with l 1 -norm regularization h (x) := 1 Ax y + λ x 1 The solution to the subproblem min xi h (x 1,..., x i,..., x n ) is ˆx i = S λ (γ i ) A i where A i is the ith column of A and m γ i := A li y l j i l=1 A lj x j

Computational experiments From Statistical Learning with Sparsity The Lasso and Generalizations by Hastie, Tibshirani and Wainwright