Lecture 5: September 15

Similar documents
Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Lecture 5: September 12

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Lecture 6: September 12

Descent methods. min x. f(x)

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Selected Topics in Optimization. Some slides borrowed from

Lecture 14: October 17

Lecture 1: September 25

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Lecture 9: September 28

Lecture 14: Newton s Method

Unconstrained minimization of smooth functions

Lecture 6: September 17

Lecture 17: October 27

6. Proximal gradient method

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Lecture 23: November 21

Newton s Method. Javier Peña Convex Optimization /36-725

Subgradient Method. Ryan Tibshirani Convex Optimization

6. Proximal gradient method

Lecture 25: November 27

Lecture 8: February 9

Lecture 16: October 22

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Lecture 7: September 17

Lecture 15: October 15

Lecture 11: October 2

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Lecture 4: September 12

Optimization Tutorial 1. Basic Gradient Descent

Lecture 23: Conditional Gradient Method

Fast proximal gradient methods

Lecture 17: Primal-dual interior-point methods part II

Lecture 10: September 26

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Lecture 4: January 26

Convex Optimization. Prof. Nati Srebro. Lecture 12: Infeasible-Start Newton s Method Interior Point Methods

5. Subgradient method

Lecture 6: September 19

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Lecture 20: November 1st

Lecture 24: August 28

Conditional Gradient (Frank-Wolfe) Method

Convex Optimization CMU-10725

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Homework 2. Convex Optimization /36-725

Lecture 23: November 19

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

12. Interior-point methods

Lecture 9: Numerical Linear Algebra Primer (February 11st)

10. Unconstrained minimization

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution

Unconstrained minimization

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Line Search Methods for Unconstrained Optimisation

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Nonlinear Optimization for Optimal Control

CSCI 1951-G Optimization Methods in Finance Part 09: Interior Point Methods

12. Interior-point methods

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Optimization methods

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Lecture 18: November Review on Primal-dual interior-poit methods

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Convex Optimization and l 1 -minimization

Lecture 14: October 11

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Homework 4. Convex Optimization /36-725

FALL 2018 MATH 4211/6211 Optimization Homework 1

CE 191: Civil and Environmental Engineering Systems Analysis. LEC 05 : Optimality Conditions

Lecture 6: Conic Optimization September 8

Lecture 1: January 12

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

CSCI : Optimization and Control of Networks. Review on Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

A Brief Review on Convex Optimization

Dual Ascent. Ryan Tibshirani Convex Optimization

14. Nonlinear equations

E5295/5B5749 Convex optimization with engineering applications. Lecture 5. Convex programming and semidefinite programming

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

Convex Optimization. Lecture 12 - Equality Constrained Optimization. Instructor: Yuanzhang Xiao. Fall University of Hawaii at Manoa

Lecture 25: Subgradient Method and Bundle Methods April 24

An Iterative Descent Method

Optimization methods

Convex Optimization / Homework 1, due September 19

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

EE Applications of Convex Optimization in Signal Processing and Communications Dr. Andre Tkacenko, JPL Third Term

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

The Steepest Descent Algorithm for Unconstrained Optimization

Homework 5. Convex Optimization /36-725

Convex Optimization. Problem set 2. Due Monday April 26th

Transcription:

10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 15 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Di Jin, Mengdi Wang, Bin Deng Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the Instructor. Some of the content are borrowed from Boyd & Vendenbergheś book 5.1 Canonical convex problem review Several canonical convex problem can be summarized as below: Linear program (LP): min x c T x subject to Gx h Ax = b Quadratic program (QP): like LP, but with a quadratic objective function Semidefinite program (SDP): like LP, but with matrices Conic program: the most general form of convex problems Second-order cone problem (SOCP): min x c T x subject to D i x + d i 2 e T i x + f i i = 1,...p Ax = b The relations in these convex problem are summarized as: LP QP SOCP SDP Conic program In other words, LP is a special case of QP; QP is a special case of SOCP; SOCP is a special case of SDP; all the previous problems are within conic program. 5.2 Gradient descent 5.2.1 Algorithm Consider the following unconstrained, smooth convex optimization: min x where f(x) is convex and differentiable with dom(f) = R n. Gradient descent is to gradually update x and find x which minimizes the objective function f(x) Then the algorithm is written as: f(x) choose initial x (0), where x (0) R n 5-1

5-2 Lecture 5: September 15 update x (k) with the following equation repeatedly: x (k) = x (k 1) t k f(x (k 1) k = 1, 2, 3,...) where t k is step size at step k, f(x (k 1) is gradient of f plugged with x (k 1) stop at some point such as decrease of f(x) below some threshold If f(x) is convex and initializing x (0) at different points, f(x) will converges to its global minimal. If f(x) is strictly convex, the solution will be unique even when initializing x (0) at different points and update paths are different. In contrast, if f(x) is non-convex, f(x) does not have global minimal. When initialization of x (0) at different points will give different results. 5.2.2 Gradient descent interpretation Gradient descent represent a quadratic approximation to the objective function. For convex and differentiable function f(y), make Taylor expansion at x: f(y) = f(x) + f(x) T (y x) + 1 2 (y x)t 2 f(z)(y x) Approximate f(y) by replacing 2 f(z) with 1 t I: f(y) f(x) + f(x) T (y x) + 1 2t (y x)t I(y x) = f(x) + f(x) T (y x) + 1 2t y x 2 2 To minimize f(y), take derivative against y and set to 0: f(x) + 1 (y x) = 0 t y = x t f(x) Let x + = y denote x for next step: x + = x t f(x) As show in Figure 5.1, at each point of x, a quadratic function is used for approximation. Then the quadratic function is minimized and x + is found as the solution. Set x + as next step of x. 5.2.3 Fixed step size Choosing step size t can affect whether f(x) will converge to its minimal as well as how fast it will converge. As shown in Figure 5.2, consider the function f(x) = (10x 2 1 + x 2 2)/2. If t is chosen to be too big (left panel, after 8 steps), f(x) may not decrease and can diverge. In middle panel (after 100 steps), if t is too small, the gradient update moves very slow and it takes many steps to reach the minimal. In right panel it only takes about 40 steps to reach convergence for right step size. Therefore it is important to choose appropriate step size t.

Lecture 5: September 15 5-3 Figure 5.1: Quadratic approximation in gradient descent Figure 5.2: Different step size

5-4 Lecture 5: September 15 5.3 Backtracking line search 5.3.1 Algorithm Figure 5.3: Backtracking line search Backtracking line search allows us to adaptively choose the right step size at each step of gradient update. The method is summarized as: 1. At each iteration, start with step size t = 1 2. While f(x t f(x)) > f(x) αt f(x) 2 2 shrink t = βt, where 0 < β < 1, 0 < α 1/2 are some fixed parameters 3. Perform gradient update of x: x + = x t f(x) 4. Go to 1 and repeat the following steps until convergence 5.3.2 Backtracking interpretation For convex and differentiable function f(x), take Taylor expansion at x t f(x): f(x t f(x)) = f(x) t f(x) T f(x) + t2 2 f(x)t 2 f(z) f(x) f(x) t f(x) 2 2 where 2 f(z) is positive semi-definite matrix due to convexity of f(x). Thus for a convex function it always lies about its tangent line. When t is very small, according to the Taylor expansion, there is: f(x t f(x)) f(x) t f(x) T f(x) = f(x) t f(x) 2 2 < f(x) αt f(x) 2 2 (0 < α 1/2)

Lecture 5: September 15 5-5 By shrinking step size t = tβ, there exists t to make: f(x t f(x)) f(x) αt f(x) 2 2 (0 < α 1/2) This is the stopping condition for step size search. The fix parameter α can be considered as acceptable fraction decrease in f predicted by linear exploration. The backtracking line search can be represented with Figure 5.3. As for selections of parameters α and β, if α is too small, which means we only accept small step size, it may make gradient update very slow. For simplicity, in reality, α can be chosen to be 0.5. β determine how fast and accurate to get the acceptable step size. Big β can cost less rounds of search, but the search can be crude and can shrink step size too much. Small β represent fine search for step size, but may need more rounds of search. It is worth noting that backtracking line search contains inner loop and outer loop. Compared to the gradient descent with right step size, it dose not save steps. 5.4 Exact line search Exact line search chooses step size to do the best along direction of negative gradient. The algorithm can be summarized as: 1. Initialize x (0) 2. Compute the gradient f(x (k) ), where k = 0, 1, 2, 3,... 3. For each step, find t such as: t = argmin s 0 f(x s f(x)) 4. Update with the following equation: x (k+1) = x (k) t f(x (k) ) 5. Go to 2 and repeat until convergence In practice, it may not be possible to do the minimization in step 4 exactly. In addition, approximations to exact line search are often not much more efficient than backtracking. Thus it is usually not worth the procedures. 5.5 Convergence analysis Assume that f is convex and differentiable, with dom(f) = R n, and additionally for any x and y f(x) f(y) L x y 2 i.e., f is Lipschitz continuous with constant L > 0. Then we have the theorem as following.

5-6 Lecture 5: September 15 Theorem 5.1 Gradient descent with fixed step size t 1/L satisfies Proof: f is Lispschitz continuous with constant L f(x (k) ) f x(0) x 2 2 ) (5.1) 2tk f(y) f(x) + f(x) T (y x) + L 2 y x 2 2, all x, y (5.2) (Note: Proof of the inequality in 5.2 is a part of Homework 1). Assume that y is Gradient descent update and we could get the following equations: y = x + = x t f(x) (5.3) Plugging 5.4 in 5.2, we have f(x) = x x+ t (5.4) f(x + ) f(x) + f(x) T ( t f(x)) + L 2 t2 f(x) 2 2 = f(x) t f(x) 2 2 + Lt2 2 f(x) 2 2 = f(x) (1 Lt 2 )t f(x) 2 2 Taking that 0 < t 1/L, we could obtain the upper bound of 5.5, when t = 1/L (5.5) f(x + ) f(x) t 2 f(x) 2 2 = f(x) t 2 x+ x 2 2 t (5.6) = f(x) 1 2t x x+ 2 2 Using the convexity of f, we could get where x is the optimal of convex function f. Taking 5.6 and 5.7, we could get f(x ) f(x) f(x) T (x x) f(x) f(x ) + f(x) T (x x) (5.7) f(x + ) f(x ) f(x) T (x x ) 1 2t x+ x 2 2 ( f(x) = x x+ ) t = f(x ) ( x x+ ) T (x x ) 1 t 2t x+ x 2 2 = f(x ) 1 t (x x+ ) T (x x ) 1 2t x+ x 2 2 = f(x ) 1 2t (2(x x+ ) T (x x ) x + x 2 2) (5.8) = f(x ) 1 2t (2(x x+ ) T (x x ) x + x 2 2 + x x 2 2 x x 2 2) = f(x ) 1 2t ( x x 2 2 ( x x 2 2 2(x x + ) T (x x ) + x + x 2 2) = f(x ) + 1 2t ( x x 2 2 x x 2 2)

Lecture 5: September 15 5-7 In 5.8, move f(x ) to the left hand of the equation and obtain Summing over iterations: f(x + ) f(x ) 1 2t ( x x 2 2 x x 2 2) (5.9) k (f(x (i) f(x ))) 1 2t ( x(0) x x (k) x 2 2) i=1 (5.10) 1 2t x(0) x 2 2 Since f(x (k) ) is nonincreasing, we have: f(x (k) ) f(x ) 1 k k (f(x (i) ) f(x )) i=1 x(0) x 2 2 2tk (5.11) Note: We say gradient descent have convergence rate O(1/k). In other words, to get f(x (k) ) f(x ) ε, we need O(1/ε) iterations. 5.5.1 Convergence analysis for Backtracking line search With the same assumptions, f is convex and differentiable, with dom(f) = R n, and additionally for any x and y f(x) f(y) L x y 2 i.e., f is Lipschitz continuous with constant L > 0. Then we have the theorem as following. Theorem 5.2 Gradient descent with backtracking line search satisfies where t min = min{1, β/l} f(x (k) ) f x(0) x 2 2 ) (5.12) 2t min k Proof: If we do line search with α = 1/2 and start with t = 1.Selected step size satisfies t k t min = min{1, β/l}. From 5.9 f(x (i) ) f + 1 2t i ( x (i 1) x 2 2 x (i) x 2 2) f + 1 2t min ( x (i 1) x 2 2 x (i) x 2 2) (5.13)

5-8 Lecture 5: September 15 add the upper bounds to get f(x (k) ) f 1 k k (f(x (i) ) f 1 ) x (0) x 2 2 (5.14) 2kt min i=1 Conclusion: Backtracking line search has the same 1/k bound as with constant step size. Note: If β is not too small, then we do not lose much compared to fixed step size. β = 1 returns the exact rate of fixed step size. 5.5.2 Convergence analysis under strong convexity Reminder: strong convexity of f means f(x) m 2 x 2 2 is convex for some m > 0. If f is twice differentiable, then this implies. 2 f(x) mi for any x (5.15) Sharper lower bound than that from usual convexity: f(y) f(x) + f(x) T (y x) + m 2 y x 2 2 all x, y (5.16) Under Lipschitz assumption as above, and also strong convexity, we have the following theorem, Theorem 5.3 Gradient descent with fixed step size t 2/(m + L) or with backtracking line search satisfies where 0 < c < 1 f(x (k) ) f c k L 2 x(0) x 2 2 (5.17) Proof: If x + = x t f(x) and 0 < t < 2/(m + L): x + x 2 2 = x t f(x) x 2 2 Then we have, = x x 2 2 2t f(x) T (x x ) + t 2 f(x) 2 2 (1 t 2mL m + L ) x x 2 2 + t(t 2 m + L ) f(x) 2 2 (1 t 2mL m + L ) x x 2 2 x (k) x 2 c k x (0) x 2 2, c = 1 t 2mL m + L For t = 2/(m + L), get c = γ 1 2 γ+1 with γ = L/m Bound on function value: (5.18) (5.19) f(x (k) ) f L 2 x(k) x 2 2 ck L 2 x(0) x 2 2 (5.20) This theorem concludes that the converge rate is O(c k ). O(log(1/ε)) iterations. In other words, to get f(x (k) ) f ε, need

Lecture 5: September 15 5-9 5.5.3 Example of conditions Let s look at the conditions for a simple problem, f(β) = 1 2 y Xβ 2 2. Lipschitz continuity of f: This means 2 f(x) LI As 2 f(β) = X T X Strong convexity of f: This means 2 f(x) mi As 2 f(β) = X T X If X is wide i.e., X is n p with p > n then σ min (X) = 0, and f can t be strongly convex. Even if σ min (X) > 0, can have a very large condition number L/m = σ min (X)/σ min (X) 5.5.4 Practicalities Stopping rule:stop when f(x) 2 is small Recall f(x ) = 0 at solution x If f is strongly convex with parameter m, then f(x) 2 (2mε) f(x) Pros and cons of gradient descent: Pro: simple idea, and each iteration is cheap Pro: very fast for well-conditional, strong convex problems. Con: often slow, because interesting problems aren t strongly convex or well-conditioned. Con: can t handle nondifferentiable functions. References [L. Vandenberghe, Lecture notes for EE 236C, UCLA, Spring 2011-2012]