Optimization methods - PDF Free Download

Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016

Introduction Aim: Overview of optimization methods that Tend to scale well with the problem dimension Are widely used in machine learning and signal processing Are (reasonably) well understood theoretically

Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent

Gradient 3.68 3.8.88.48.08 1.68 1.8 0.88 0.48 0.08 Direction of maximum variation

Gradient descent (aka steepest descent) Method to solve the optimization problem where f is differentiable minimize f (x), Gradient-descent iteration: x (0) = arbitrary initialization x (k+1) = x (k) α k f where α k is the step size (x (k))

Gradient descent (1D)

Gradient descent (D) 4.0 3.5 3.0.5.0 1.5 1.0 0.5

Small step size 4.0 3.5 3.0.5.0 1.5 1.0 0.5

Large step size 100 90 80 70 60 50 40 30 0 10

Line search Exact ( α k := arg min f x (k) β f (x (k))) β 0 Backtracking (Armijo rule) Given α 0 0 and β (0, 1), set α k := α 0 β i for the smallest i such that ( f x (k+1)) ( f x (k)) 1 α k f (x (k))

Backtracking line search 4.0 3.5 3.0.5.0 1.5 1.0 0.5

Lipschitz continuity A function f : R n R m is Lipschitz continuous with Lipschitz constant L if for any x, y R n f (y) f (x) L y x Example: f (x) := Ax is Lipschitz continuous with L = σ max (A)

Quadratic upper bound If the gradient of f : R n R is Lipschitz continuous with constant L then for any x, y R n f (y) f (x) L y x f (y) f (x) + f (x) T (y x) + L y x

Consequence of quadratic bound Since x (k+1) = x (k) α k f ( x (k)) f ( x (k+1)) f (x (k)) ( α k 1 α ) ( kl f x (k)) If α k 1 L the value of the function always decreases! f ( x (k+1)) ( f x (k)) α k f (x (k))

Gradient descent with constant step size Conditions: f is convex f is L-Lipschitz continuous There exists a solution x such that f (x ) is finite If α k = α 1 L f ( x (k)) x (k) f (x x (0) ) α k We need O ( 1 ɛ ) iterations to get an ɛ-optimal solution

Proof Recall that if α 1 L ( f x (i)) ( f x (i 1)) α f (x (i 1)) By the first-order characterization of convexity f ( x (i 1)) f (x ) f (x (i 1)) T (x (i 1) x ) f This implies f (x (i 1)) ( x (i 1) x x (i 1) x α f (x (i 1)) ) ( x (i 1) x ) x (i) x ( x (i)) f (x ) f (x (i 1)) T ( x (i 1) x ) α = 1 α = 1 α

Proof Because the value of f never increases, f ( x (k)) f (x ) 1 k k f i=1 ( x (i)) f (x ) = 1 ( x (0) x ) x α k (k) x x (0) x α k

Backtracking line search If the gradient of f : R n R is Lipschitz continuous with constant L the step size in the backtracking line search satisfies { α k α min := min α 0, β } L

Proof Line search ends when f ( x (k+1)) f ( x (k)) α k f but we know that if α k 1 L ( f x (k+1)) ( f x (k)) α k f (x (k)) (x (k)) This happens as soon as β/l α 0 β i 1/L

Gradient descent with backtracking Under the same conditions as before gradient descent with backtracking line search achieves f ( x (k)) x (0) f (x x ) α min k O ( 1 ɛ ) iterations to get an ɛ-optimal solution

Strong convexity A function f : R n is strongly convex if for any x, y R n f (y) f (x) + f (x) T (y x) + S y x. Example: f (x) := Ax y where A R m n is strongly convex with S = σ min (A) if m > n

Gradient descent for strongly convex functions If f is S-strongly convex and f is L-Lipschitz continuous f ( x (k)) f (x ) ck L x (k) x (0) c := L S 1 L S + 1 We need O ( log 1 ɛ ) iterations to get an ɛ-optimal solution

Lower bounds for convergence rate There exist convex functions with L-Lipschitz-continuous gradients such that for any algorithm that selects x (k) from { ( x (0) + span f x (0)) (, f x (1)),..., f (x (k 1))} we have f ( x (k)) f (x ) 3L x (0) x 3 (k + 1)

Nesterov s accelerated gradient method ( ) Achieves lower bound, i.e. O 1 convergence ɛ Uses momentum variable y (k+1) = x (k) α k f (x (k)) x (k+1) = β k y (k+1) + γ k y (k) Despite guarantees, why this works is not completely understood

Projected gradient descent Optimization problem minimize f (x) subject to x S, where f is differentiable and S is convex Projected-gradient-descent iteration: x (0) = arbitrary initialization x (k+1) = P S (x (k) α k f (x (k)))

Projected gradient descent 8 7 6 5 4 3 1

Subgradient method Optimization problem minimize f (x) where f is convex but nondifferentiable Subgradient-method iteration: x (0) = arbitrary initialization x (k+1) = x (k) α k q (k) where q (k) is a subgradient of f at x (k)

Least-squares regression with l 1 -norm regularization minimize 1 Ax y + λ x 1 Sum of subgradients is a subgradient of the sum ( ) q (k) = A T Ax (k) y + λ sign (x (k)) Subgradient-method iteration: x (0) = arbitrary initialization x (k+1) = x (k) α k (A T ( Ax (k) y ) + λ sign (x (k)))

Convergence of subgradient method It is not a descent method Convergence rate can be shown to be O ( 1/ɛ ) Diminishing step sizes are necessary for convergence Experiment: minimize 1 Ax y + λ x 1 A R 000 1000, y = Ax 0 + z where x 0 is 100-sparse and z is iid Gaussian

Convergence of subgradient method 10 4 α 0 10 3 α 0 / k α 0 /k 10 f(x (k) ) f(x ) f(x ) 10 1 10 0 10-1 10-0 0 40 60 80 100 k

Convergence of subgradient method 10 1 10 0 α 0 α 0 / n α 0 /n f(x (k) ) f(x ) f(x ) 10-1 10-10 -3 0 1000 000 3000 4000 5000 k

Composite functions Interesting class of functions for data analysis f (x) + g (x) f convex and differentiable, g convex but not differentiable Example: Least-squares regression (f ) + l 1 -norm regularization (g) 1 Ax y + λ x 1

Interpretation of gradient descent Solution of local first-order approximation = arg min f x (x (k)) x (k+1) := x (k) α k f ( x = arg min x (k) α k f (x (k))) x ( x (k)) + f (x (k)) T ( x x (k)) + 1 x x (k) α k

Proximal gradient method Idea: Minimize local first-order approximation + g ( x (k+1) = arg min f x (k)) + f (x (k)) T ( x x (k)) + 1 x x (k) x α k + g (x) 1 ( = arg min x x (k) α k f (x (k))) x + α k g (x) ( = prox αk g x (k) α k f (x (k))) Proximal operator: prox g (y) := arg min x g (x) + 1 y x

Proximal gradient method Method to solve the optimization problem minimize f (x) + g (x), where f is differentiable and prox g is tractable Proximal-gradient iteration: x (0) = arbitrary initialization ( x (k+1) = prox αk g x (k) α k f (x (k)))

Interpretation as a fixed-point method A vector ˆx is a solution to minimize f (x) + g (x), if and only if it is a fixed point of the proximal-gradient iteration for any α > 0 ˆx = prox αk g (ˆx α k f (ˆx))

Projected gradient descent as a proximal method The proximal operator of the indicator function { 0 if x S, I S (x) := if x / S. of a convex set S R n is projection onto S Proximal-gradient iteration: x (k+1) = prox αk I S (x (k) α k f = P S (x (k) α k f (x (k))) (x (k)))

Proximal operator of l 1 norm The proximal operator of the l 1 norm is the soft-thresholding operator where β > 0 and S β (y) i := prox β 1 (y) = S β (y) { y i sign (y i ) β if y i β 0 otherwise

Iterative Shrinkage-Thresholding Algorithm (ISTA) The proximal gradient method for the problem is called ISTA minimize 1 Ax y + λ x 1 ISTA iteration: x (0) = arbitrary initialization ( ( )) x (k+1) = S αk λ x (k) α k A T Ax (k) y

Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) ISTA can be accelerated using Nesterov s accelerated gradient method FISTA iteration: x (0) = arbitrary initialization z (0) = x (0) ( ( )) x (k+1) = S αk λ z (k) α k A T Az (k) y z (k+1) = x (k+1) + k (x (k+1) x (k)) k + 3

Convergence of proximal gradient method Without acceleration: Descent method Convergence rate can be shown to be O (1/ɛ) with constant step or backtracking line search With acceleration: Not a descent method ( ) Convergence rate can be shown to be O 1 ɛ backtracking line search Experiment: minimize 1 Ax y + λ x 1 with constant step or A R 000 1000, y = Ax 0 + z, x 0 100-sparse and z iid Gaussian

Convergence of proximal gradient method 10 4 10 3 10 10 1 Subg. method (α 0 / k) ISTA FISTA f(x (k) ) f(x ) f(x ) 10 0 10-1 10-10 -3 10-4 10-5 0 0 40 60 80 100 k

Coordinate descent Idea: Solve the n-dimensional problem minimize h (x 1, x,..., x n ) by solving a sequence of 1D problems Coordinate-descent iteration: x (0) = arbitrary initialization x (k+1) i = arg min h α ( x (k) 1,..., α,..., x n (k) ) for some 1 i n

Coordinate descent Convergence is guaranteed for functions of the form n f (x) + g i (x i ) where f is convex and differentiable and g 1,..., g n are convex i=1

Least-squares regression with l 1 -norm regularization h (x) := 1 Ax y + λ x 1 The solution to the subproblem min xi h (x 1,..., x i,..., x n ) is ˆx i = S λ (γ i ) A i where A i is the ith column of A and m γ i := A li y l j i l=1 A lj x j

Computational experiments From Statistical Learning with Sparsity The Lasso and Generalizations by Hastie, Tibshirani and Wainwright