Optimization methods

Size: px

Start display at page:

Download "Optimization methods"

Henry Hunt
5 years ago
Views:

Optimization methods Optimization-Based Data Analysis http://www.

1 Optimization methods Optimization-Based Data Analysis Carlos Fernandez-Granda /8/016

2 Introduction Aim: Overview of optimization methods that Tend to scale well with the problem dimension Are widely used in machine learning and signal processing Are (reasonably) well understood theoretically

3 Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent

4 Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent

5 Gradient Direction of maximum variation

6 Gradient descent (aka steepest descent) Method to solve the optimization problem where f is differentiable minimize f (x), Gradient-descent iteration: x (0) = arbitrary initialization x (k+1) = x (k) α k f where α k is the step size (x (k))

7 Gradient descent (1D)

8 Gradient descent (D)

9 Small step size

10 Large step size

11 Line search Exact ( α k := arg min f x (k) β f (x (k))) β 0 Backtracking (Armijo rule) Given α 0 0 and β (0, 1), set α k := α 0 β i for the smallest i such that ( f x (k+1)) ( f x (k)) 1 α k f (x (k))

12 Backtracking line search

13 Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent

14 Lipschitz continuity A function f : R n R m is Lipschitz continuous with Lipschitz constant L if for any x, y R n f (y) f (x) L y x Example: f (x) := Ax is Lipschitz continuous with L = σ max (A)

15 Quadratic upper bound If the gradient of f : R n R is Lipschitz continuous with constant L then for any x, y R n f (y) f (x) L y x f (y) f (x) + f (x) T (y x) + L y x

16 Consequence of quadratic bound Since x (k+1) = x (k) α k f ( x (k)) f ( x (k+1)) f (x (k)) ( α k 1 α ) ( kl f x (k)) If α k 1 L the value of the function always decreases! f ( x (k+1)) ( f x (k)) α k f (x (k))

17 Gradient descent with constant step size Conditions: f is convex f is L-Lipschitz continuous There exists a solution x such that f (x ) is finite If α k = α 1 L f ( x (k)) x (k) f (x x (0) ) α k We need O ( 1 ɛ ) iterations to get an ɛ-optimal solution

18 Proof Recall that if α 1 L ( f x (i)) ( f x (i 1)) α f (x (i 1)) By the first-order characterization of convexity f ( x (i 1)) f (x ) f (x (i 1)) T (x (i 1) x ) f This implies f (x (i 1)) ( x (i 1) x x (i 1) x α f (x (i 1)) ) ( x (i 1) x ) x (i) x ( x (i)) f (x ) f (x (i 1)) T ( x (i 1) x ) α = 1 α = 1 α

19 Proof Because the value of f never increases, f ( x (k)) f (x ) 1 k k f i=1 ( x (i)) f (x ) = 1 ( x (0) x ) x α k (k) x x (0) x α k

20 Backtracking line search If the gradient of f : R n R is Lipschitz continuous with constant L the step size in the backtracking line search satisfies { α k α min := min α 0, β } L

21 Proof Line search ends when f ( x (k+1)) f ( x (k)) α k f but we know that if α k 1 L ( f x (k+1)) ( f x (k)) α k f (x (k)) (x (k)) This happens as soon as β/l α 0 β i 1/L

22 Gradient descent with backtracking Under the same conditions as before gradient descent with backtracking line search achieves f ( x (k)) x (0) f (x x ) α min k O ( 1 ɛ ) iterations to get an ɛ-optimal solution

23 Strong convexity A function f : R n is strongly convex if for any x, y R n f (y) f (x) + f (x) T (y x) + S y x. Example: f (x) := Ax y where A R m n is strongly convex with S = σ min (A) if m > n

24 Gradient descent for strongly convex functions If f is S-strongly convex and f is L-Lipschitz continuous f ( x (k)) f (x ) ck L x (k) x (0) c := L S 1 L S + 1 We need O ( log 1 ɛ ) iterations to get an ɛ-optimal solution

25 Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent

26 Lower bounds for convergence rate There exist convex functions with L-Lipschitz-continuous gradients such that for any algorithm that selects x (k) from { ( x (0) + span f x (0)) (, f x (1)),..., f (x (k 1))} we have f ( x (k)) f (x ) 3L x (0) x 3 (k + 1)

27 Nesterov s accelerated gradient method ( ) Achieves lower bound, i.e. O 1 convergence ɛ Uses momentum variable y (k+1) = x (k) α k f (x (k)) x (k+1) = β k y (k+1) + γ k y (k) Despite guarantees, why this works is not completely understood

28 Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent

29 Projected gradient descent Optimization problem minimize f (x) subject to x S, where f is differentiable and S is convex Projected-gradient-descent iteration: x (0) = arbitrary initialization x (k+1) = P S (x (k) α k f (x (k)))

30 Projected gradient descent

31 Projected gradient descent

32 Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent

33 Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent

34 Subgradient method Optimization problem minimize f (x) where f is convex but nondifferentiable Subgradient-method iteration: x (0) = arbitrary initialization x (k+1) = x (k) α k q (k) where q (k) is a subgradient of f at x (k)

35 Least-squares regression with l 1 -norm regularization minimize 1 Ax y + λ x 1 Sum of subgradients is a subgradient of the sum ( ) q (k) = A T Ax (k) y + λ sign (x (k)) Subgradient-method iteration: x (0) = arbitrary initialization x (k+1) = x (k) α k (A T ( Ax (k) y ) + λ sign (x (k)))

36 Convergence of subgradient method It is not a descent method Convergence rate can be shown to be O ( 1/ɛ ) Diminishing step sizes are necessary for convergence Experiment: minimize 1 Ax y + λ x 1 A R , y = Ax 0 + z where x 0 is 100-sparse and z is iid Gaussian

37 Convergence of subgradient method 10 4 α α 0 / k α 0 /k 10 f(x (k) ) f(x ) f(x ) k

38 Convergence of subgradient method α 0 α 0 / n α 0 /n f(x (k) ) f(x ) f(x ) k

39 Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent

40 Composite functions Interesting class of functions for data analysis f (x) + g (x) f convex and differentiable, g convex but not differentiable Example: Least-squares regression (f ) + l 1 -norm regularization (g) 1 Ax y + λ x 1

41 Interpretation of gradient descent Solution of local first-order approximation = arg min f x (x (k)) x (k+1) := x (k) α k f ( x = arg min x (k) α k f (x (k))) x ( x (k)) + f (x (k)) T ( x x (k)) + 1 x x (k) α k

42 Proximal gradient method Idea: Minimize local first-order approximation + g ( x (k+1) = arg min f x (k)) + f (x (k)) T ( x x (k)) + 1 x x (k) x α k + g (x) 1 ( = arg min x x (k) α k f (x (k))) x + α k g (x) ( = prox αk g x (k) α k f (x (k))) Proximal operator: prox g (y) := arg min x g (x) + 1 y x

43 Proximal gradient method Method to solve the optimization problem minimize f (x) + g (x), where f is differentiable and prox g is tractable Proximal-gradient iteration: x (0) = arbitrary initialization ( x (k+1) = prox αk g x (k) α k f (x (k)))

44 Interpretation as a fixed-point method A vector ˆx is a solution to minimize f (x) + g (x), if and only if it is a fixed point of the proximal-gradient iteration for any α > 0 ˆx = prox αk g (ˆx α k f (ˆx))

45 Projected gradient descent as a proximal method The proximal operator of the indicator function { 0 if x S, I S (x) := if x / S. of a convex set S R n is projection onto S Proximal-gradient iteration: x (k+1) = prox αk I S (x (k) α k f = P S (x (k) α k f (x (k))) (x (k)))

46 Proximal operator of l 1 norm The proximal operator of the l 1 norm is the soft-thresholding operator where β > 0 and S β (y) i := prox β 1 (y) = S β (y) { y i sign (y i ) β if y i β 0 otherwise

47 Iterative Shrinkage-Thresholding Algorithm (ISTA) The proximal gradient method for the problem is called ISTA minimize 1 Ax y + λ x 1 ISTA iteration: x (0) = arbitrary initialization ( ( )) x (k+1) = S αk λ x (k) α k A T Ax (k) y

48 Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) ISTA can be accelerated using Nesterov s accelerated gradient method FISTA iteration: x (0) = arbitrary initialization z (0) = x (0) ( ( )) x (k+1) = S αk λ z (k) α k A T Az (k) y z (k+1) = x (k+1) + k (x (k+1) x (k)) k + 3

49 Convergence of proximal gradient method Without acceleration: Descent method Convergence rate can be shown to be O (1/ɛ) with constant step or backtracking line search With acceleration: Not a descent method ( ) Convergence rate can be shown to be O 1 ɛ backtracking line search Experiment: minimize 1 Ax y + λ x 1 with constant step or A R , y = Ax 0 + z, x sparse and z iid Gaussian

50 Convergence of proximal gradient method Subg. method (α 0 / k) ISTA FISTA f(x (k) ) f(x ) f(x ) k

51 Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent

52 Coordinate descent Idea: Solve the n-dimensional problem minimize h (x 1, x,..., x n ) by solving a sequence of 1D problems Coordinate-descent iteration: x (0) = arbitrary initialization x (k+1) i = arg min h α ( x (k) 1,..., α,..., x n (k) ) for some 1 i n

53 Coordinate descent Convergence is guaranteed for functions of the form n f (x) + g i (x i ) where f is convex and differentiable and g 1,..., g n are convex i=1

54 Least-squares regression with l 1 -norm regularization h (x) := 1 Ax y + λ x 1 The solution to the subproblem min xi h (x 1,..., x i,..., x n ) is ˆx i = S λ (γ i ) A i where A i is the ith column of A and m γ i := A li y l j i l=1 A lj x j

55 Computational experiments From Statistical Learning with Sparsity The Lasso and Generalizations by Hastie, Tibshirani and Wainwright

Optimization methods

Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,