Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016
Introduction Aim: Overview of optimization methods that Tend to scale well with the problem dimension Are widely used in machine learning and signal processing Are (reasonably) well understood theoretically
Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent
Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent
Gradient 3.68 3.8.88.48.08 1.68 1.8 0.88 0.48 0.08 Direction of maximum variation
Gradient descent (aka steepest descent) Method to solve the optimization problem where f is differentiable minimize f (x), Gradient-descent iteration: x (0) = arbitrary initialization x (k+1) = x (k) α k f where α k is the step size (x (k))
Gradient descent (1D)
Gradient descent (D) 4.0 3.5 3.0.5.0 1.5 1.0 0.5
Small step size 4.0 3.5 3.0.5.0 1.5 1.0 0.5
Large step size 100 90 80 70 60 50 40 30 0 10
Line search Exact ( α k := arg min f x (k) β f (x (k))) β 0 Backtracking (Armijo rule) Given α 0 0 and β (0, 1), set α k := α 0 β i for the smallest i such that ( f x (k+1)) ( f x (k)) 1 α k f (x (k))
Backtracking line search 4.0 3.5 3.0.5.0 1.5 1.0 0.5
Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent
Lipschitz continuity A function f : R n R m is Lipschitz continuous with Lipschitz constant L if for any x, y R n f (y) f (x) L y x Example: f (x) := Ax is Lipschitz continuous with L = σ max (A)
Quadratic upper bound If the gradient of f : R n R is Lipschitz continuous with constant L then for any x, y R n f (y) f (x) L y x f (y) f (x) + f (x) T (y x) + L y x
Consequence of quadratic bound Since x (k+1) = x (k) α k f ( x (k)) f ( x (k+1)) f (x (k)) ( α k 1 α ) ( kl f x (k)) If α k 1 L the value of the function always decreases! f ( x (k+1)) ( f x (k)) α k f (x (k))
Gradient descent with constant step size Conditions: f is convex f is L-Lipschitz continuous There exists a solution x such that f (x ) is finite If α k = α 1 L f ( x (k)) x (k) f (x x (0) ) α k We need O ( 1 ɛ ) iterations to get an ɛ-optimal solution
Proof Recall that if α 1 L ( f x (i)) ( f x (i 1)) α f (x (i 1)) By the first-order characterization of convexity f ( x (i 1)) f (x ) f (x (i 1)) T (x (i 1) x ) f This implies f (x (i 1)) ( x (i 1) x x (i 1) x α f (x (i 1)) ) ( x (i 1) x ) x (i) x ( x (i)) f (x ) f (x (i 1)) T ( x (i 1) x ) α = 1 α = 1 α
Proof Because the value of f never increases, f ( x (k)) f (x ) 1 k k f i=1 ( x (i)) f (x ) = 1 ( x (0) x ) x α k (k) x x (0) x α k
Backtracking line search If the gradient of f : R n R is Lipschitz continuous with constant L the step size in the backtracking line search satisfies { α k α min := min α 0, β } L
Proof Line search ends when f ( x (k+1)) f ( x (k)) α k f but we know that if α k 1 L ( f x (k+1)) ( f x (k)) α k f (x (k)) (x (k)) This happens as soon as β/l α 0 β i 1/L
Gradient descent with backtracking Under the same conditions as before gradient descent with backtracking line search achieves f ( x (k)) x (0) f (x x ) α min k O ( 1 ɛ ) iterations to get an ɛ-optimal solution
Strong convexity A function f : R n is strongly convex if for any x, y R n f (y) f (x) + f (x) T (y x) + S y x. Example: f (x) := Ax y where A R m n is strongly convex with S = σ min (A) if m > n
Gradient descent for strongly convex functions If f is S-strongly convex and f is L-Lipschitz continuous f ( x (k)) f (x ) ck L x (k) x (0) c := L S 1 L S + 1 We need O ( log 1 ɛ ) iterations to get an ɛ-optimal solution
Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent
Lower bounds for convergence rate There exist convex functions with L-Lipschitz-continuous gradients such that for any algorithm that selects x (k) from { ( x (0) + span f x (0)) (, f x (1)),..., f (x (k 1))} we have f ( x (k)) f (x ) 3L x (0) x 3 (k + 1)
Nesterov s accelerated gradient method ( ) Achieves lower bound, i.e. O 1 convergence ɛ Uses momentum variable y (k+1) = x (k) α k f (x (k)) x (k+1) = β k y (k+1) + γ k y (k) Despite guarantees, why this works is not completely understood
Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent
Projected gradient descent Optimization problem minimize f (x) subject to x S, where f is differentiable and S is convex Projected-gradient-descent iteration: x (0) = arbitrary initialization x (k+1) = P S (x (k) α k f (x (k)))
Projected gradient descent 8 7 6 5 4 3 1
Projected gradient descent 8 7 6 5 4 3 1
Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent
Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent
Subgradient method Optimization problem minimize f (x) where f is convex but nondifferentiable Subgradient-method iteration: x (0) = arbitrary initialization x (k+1) = x (k) α k q (k) where q (k) is a subgradient of f at x (k)
Least-squares regression with l 1 -norm regularization minimize 1 Ax y + λ x 1 Sum of subgradients is a subgradient of the sum ( ) q (k) = A T Ax (k) y + λ sign (x (k)) Subgradient-method iteration: x (0) = arbitrary initialization x (k+1) = x (k) α k (A T ( Ax (k) y ) + λ sign (x (k)))
Convergence of subgradient method It is not a descent method Convergence rate can be shown to be O ( 1/ɛ ) Diminishing step sizes are necessary for convergence Experiment: minimize 1 Ax y + λ x 1 A R 000 1000, y = Ax 0 + z where x 0 is 100-sparse and z is iid Gaussian
Convergence of subgradient method 10 4 α 0 10 3 α 0 / k α 0 /k 10 f(x (k) ) f(x ) f(x ) 10 1 10 0 10-1 10-0 0 40 60 80 100 k
Convergence of subgradient method 10 1 10 0 α 0 α 0 / n α 0 /n f(x (k) ) f(x ) f(x ) 10-1 10-10 -3 0 1000 000 3000 4000 5000 k
Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent
Composite functions Interesting class of functions for data analysis f (x) + g (x) f convex and differentiable, g convex but not differentiable Example: Least-squares regression (f ) + l 1 -norm regularization (g) 1 Ax y + λ x 1
Interpretation of gradient descent Solution of local first-order approximation = arg min f x (x (k)) x (k+1) := x (k) α k f ( x = arg min x (k) α k f (x (k))) x ( x (k)) + f (x (k)) T ( x x (k)) + 1 x x (k) α k
Proximal gradient method Idea: Minimize local first-order approximation + g ( x (k+1) = arg min f x (k)) + f (x (k)) T ( x x (k)) + 1 x x (k) x α k + g (x) 1 ( = arg min x x (k) α k f (x (k))) x + α k g (x) ( = prox αk g x (k) α k f (x (k))) Proximal operator: prox g (y) := arg min x g (x) + 1 y x
Proximal gradient method Method to solve the optimization problem minimize f (x) + g (x), where f is differentiable and prox g is tractable Proximal-gradient iteration: x (0) = arbitrary initialization ( x (k+1) = prox αk g x (k) α k f (x (k)))
Interpretation as a fixed-point method A vector ˆx is a solution to minimize f (x) + g (x), if and only if it is a fixed point of the proximal-gradient iteration for any α > 0 ˆx = prox αk g (ˆx α k f (ˆx))
Projected gradient descent as a proximal method The proximal operator of the indicator function { 0 if x S, I S (x) := if x / S. of a convex set S R n is projection onto S Proximal-gradient iteration: x (k+1) = prox αk I S (x (k) α k f = P S (x (k) α k f (x (k))) (x (k)))
Proximal operator of l 1 norm The proximal operator of the l 1 norm is the soft-thresholding operator where β > 0 and S β (y) i := prox β 1 (y) = S β (y) { y i sign (y i ) β if y i β 0 otherwise
Iterative Shrinkage-Thresholding Algorithm (ISTA) The proximal gradient method for the problem is called ISTA minimize 1 Ax y + λ x 1 ISTA iteration: x (0) = arbitrary initialization ( ( )) x (k+1) = S αk λ x (k) α k A T Ax (k) y
Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) ISTA can be accelerated using Nesterov s accelerated gradient method FISTA iteration: x (0) = arbitrary initialization z (0) = x (0) ( ( )) x (k+1) = S αk λ z (k) α k A T Az (k) y z (k+1) = x (k+1) + k (x (k+1) x (k)) k + 3
Convergence of proximal gradient method Without acceleration: Descent method Convergence rate can be shown to be O (1/ɛ) with constant step or backtracking line search With acceleration: Not a descent method ( ) Convergence rate can be shown to be O 1 ɛ backtracking line search Experiment: minimize 1 Ax y + λ x 1 with constant step or A R 000 1000, y = Ax 0 + z, x 0 100-sparse and z iid Gaussian
Convergence of proximal gradient method 10 4 10 3 10 10 1 Subg. method (α 0 / k) ISTA FISTA f(x (k) ) f(x ) f(x ) 10 0 10-1 10-10 -3 10-4 10-5 0 0 40 60 80 100 k
Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent
Coordinate descent Idea: Solve the n-dimensional problem minimize h (x 1, x,..., x n ) by solving a sequence of 1D problems Coordinate-descent iteration: x (0) = arbitrary initialization x (k+1) i = arg min h α ( x (k) 1,..., α,..., x n (k) ) for some 1 i n
Coordinate descent Convergence is guaranteed for functions of the form n f (x) + g i (x i ) where f is convex and differentiable and g 1,..., g n are convex i=1
Least-squares regression with l 1 -norm regularization h (x) := 1 Ax y + λ x 1 The solution to the subproblem min xi h (x 1,..., x i,..., x n ) is ˆx i = S λ (γ i ) A i where A i is the ith column of A and m γ i := A li y l j i l=1 A lj x j
Computational experiments From Statistical Learning with Sparsity The Lasso and Generalizations by Hastie, Tibshirani and Wainwright