More First-Order Optimization Algorithms

Similar documents
Yinyu Ye, MS&E, Stanford MS&E310 Lecture Note #06. The Simplex Method

Lagrangian Duality Theory

10 Numerical methods for constrained problems

Conic Linear Optimization and its Dual. yyye

Midterm Review. Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305, U.S.A.

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Linear and Combinatorial Optimization

Second Order Optimization Algorithms I

Nonlinear Optimization for Optimal Control

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Numerical Optimization

A Review of Linear Programming

12. Interior-point methods

Primal-dual relationship between Levenberg-Marquardt and central trajectories for linearly constrained convex optimization

2.098/6.255/ Optimization Methods Practice True/False Questions

Lecture: Algorithms for LP, SOCP and SDP

Interior Point Algorithms for Constrained Convex Optimization

Interior Point Methods for Mathematical Programming

Applications of Linear Programming

Introduction to Mathematical Programming IE406. Lecture 10. Dr. Ted Ralphs

Mathematical Optimization Models and Applications

Semidefinite Programming

Interior Point Methods in Mathematical Programming

ISM206 Lecture Optimization of Nonlinear Objective with Linear Constraints

Interior Point Methods. We ll discuss linear programming first, followed by three nonlinear problems. Algorithms for Linear Programming Problems

Operations Research Lecture 4: Linear Programming Interior Point Method

I.3. LMI DUALITY. Didier HENRION EECI Graduate School on Control Supélec - Spring 2010

6.252 NONLINEAR PROGRAMMING LECTURE 10 ALTERNATIVES TO GRADIENT PROJECTION LECTURE OUTLINE. Three Alternatives/Remedies for Gradient Projection

1. Algebraic and geometric treatments Consider an LP problem in the standard form. x 0. Solutions to the system of linear equations

A Distributed Newton Method for Network Utility Maximization, II: Convergence

Lecture 3. Optimization Problems and Iterative Algorithms

ECE580 Exam 1 October 4, Please do not write on the back of the exam pages. Extra paper is available from the instructor.

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

Lecture 7: September 17

Algorithms for constrained local optimization

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount

Lecture: Introduction to LP, SDP and SOCP

On self-concordant barriers for generalized power cones

Part 4: Active-set methods for linearly constrained optimization. Nick Gould (RAL)

Penalty and Barrier Methods. So we again build on our unconstrained algorithms, but in a different way.

Notes taken by Graham Taylor. January 22, 2005

Linear Programming: Simplex

A Second-Order Path-Following Algorithm for Unconstrained Convex Optimization

Written Examination

3 The Simplex Method. 3.1 Basic Solutions

Discrete Optimization

ELE539A: Optimization of Communication Systems Lecture 15: Semidefinite Programming, Detection and Estimation Applications

Optimization methods

Analytic Center Cutting-Plane Method

Chap6 Duality Theory and Sensitivity Analysis

1 Review Session. 1.1 Lecture 2

A priori bounds on the condition numbers in interior-point methods

Project Discussions: SNL/ADMM, MDP/Randomization, Quadratic Regularization, and Online Linear Programming

Convex Optimization and l 1 -minimization

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Agenda. Interior Point Methods. 1 Barrier functions. 2 Analytic center. 3 Central path. 4 Barrier method. 5 Primal-dual path following algorithms

3E4: Modelling Choice. Introduction to nonlinear programming. Announcements

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

10. Unconstrained minimization

CONSTRAINED NONLINEAR PROGRAMMING

12. Interior-point methods

Motivation. Lecture 2 Topics from Optimization and Duality. network utility maximization (NUM) problem:

CS711008Z Algorithm Design and Analysis

F 1 F 2 Daily Requirement Cost N N N

Lecture 6: Conic Optimization September 8

IE 5531 Midterm #2 Solutions

Summary of the simplex method

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Review Solutions, Exam 2, Operations Research

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Interior Point Methods: Second-Order Cone Programming and Semidefinite Programming

Lecture 8 Plus properties, merit functions and gap functions. September 28, 2008

Lecture 1. 1 Conic programming. MA 796S: Convex Optimization and Interior Point Methods October 8, Consider the conic program. min.

Lecture 5. The Dual Cone and Dual Problem

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

CSCI 1951-G Optimization Methods in Finance Part 01: Linear Programming

Lecture 9 Sequential unconstrained minimization

3. THE SIMPLEX ALGORITHM

Primal-Dual Interior-Point Methods for Linear Programming based on Newton s Method

Cubic regularization of Newton s method for convex problems with constraints

Global Optimization of Polynomials

Lecture 17: Primal-dual interior-point methods part II

Numerical Optimization

TIM 206 Lecture 3: The Simplex Method

Homework 4. Convex Optimization /36-725

Convex Optimization & Lagrange Duality

Additional Exercises for Introduction to Nonlinear Optimization Amir Beck March 16, 2017

Math 273a: Optimization Subgradients of convex functions

On well definedness of the Central Path

arxiv: v1 [math.oc] 1 Jul 2016

Unconstrained minimization

Multidisciplinary System Design Optimization (MSDO)

Lecture Note 18: Duality

4TE3/6TE3. Algorithms for. Continuous Optimization

Introduction and Math Preliminaries

Chapter 6 Interior-Point Approach to Linear Programming

Convex Optimization M2

Line Search Methods for Unconstrained Optimisation

Optimisation in Higher Dimensions

Conic Linear Programming. Yinyu Ye

Transcription:

More First-Order Optimization Algorithms Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305, U.S.A. http://www.stanford.edu/ yyye Chapters 3, 8, 3

The SDM for Unconstrained Convex Lipschitz Optimization Here we consider f(x) being convex and differentiable everywhere and satisfying the (first-order) β-lipschitz condition. Given the knowledge β, we again adopt the fixed step-size rule: x k+ = x k β f(xk ). () Theorem For convex Lipschitz optimization the Steepest Descent Method generates a sequence of solutions such that f(x k+ ) f(x ) β k + 2 x0 x 2 and where x is a minimizer of the problem. min l=0,...,k f(xl ) 2 = 4β 2 (k + )(k + 2) x0 x 2, Proof: For simplicity, we let δ k = f(x k ) f(x )( 0), g k = f(x k ), and k = x k x in the rest of proof. As we have proved for general Lipschitz optimization δ k+ δ k = f(x k+ ) f(x k ) 2β gk 2, that is δ k δ k+ 2β gk 2. (2) 2

Furthermore, from the convexity, δ k = f(x ) f(x k ) (g k ) T (x x k ) = (g k ) T k, that is δ k (g k ) T k. (3) Thus, from (2) and (3) δ k+ = δ k+ δ k + δ k 2β gk 2 + (g k ) T k = β 2 xk+ x k 2 β(x k+ x k ) k, (using ()) = β 2 ( xk+ x k 2 + 2(x k+ x k ) T k ) = β 2 ( k+ k 2 + 2( k+ k ) T k ) = β 2 ( k 2 k+ 2 ). (4) Sum up (4) from to k +, we have k+ l= δ l β 2 ( 0 2 k+ 2 ) β 2 0 2. 3

From the proof of the Corollary of last lecture, we have δ 0 β 2 0 2. Thus, and k+ k+ l=0 δ l β 0 2, (5) l=0 δl = k+ l=0 (l + l)δl = k+ l=0 (l + )δl k+ l=0 lδl = k+2 l= lδl k+ l= lδl = (k + 2)δ k+ + k+ l= lδl k+ l= lδl = (k + 2)δ k+ + k+ l= l(δl δ l ) (k + 2)δ k+ + k+ l= l 2β gl 2, where the first inequality comes from (2). Let g = min l=0,...,k g l. Then we finally have (k + 2)δ k+ + (k + )(k + 2)/2 g 2 β 0 2, (6) 2β which completes the proof. 4

The Accelerated Steepest Descent Method (ASDM) There is an accelerated steepest descent method (Nesterov 83) that works as follows: λ 0 = 0, λ k+ = + + 4(λ k ) 2 2, α k = λk, (7) λk+ x k+ = x k β f(xk ), x k+ = ( α k ) x k+ + α k x k. (8) Note that (λ k ) 2 = λ k+ (λ k+ ), λ k > k/2 and α k 0. One can prove: Theorem 2 f( x k+ ) f(x ) 2β k 2 x0 x 2, k. 5

Convergence Analysis of ASDM Again for simplification, we let k = λ k x k (λ k ) x k x, g k = f(x k ) and δ k = f( x k ) f(x )( 0) in the following. Applying Lemma for x = x k+ and y = x k, convexity of f and (8) we have δ k+ δ k = f( x k+ ) f(x k ) + f(x k ) f( x k ) β 2 xk+ x k 2 + f(x k ) f( x k ) β 2 xk+ x k 2 + (g k ) T (x k x k ) = β 2 xk+ x k 2 β( x k+ x k ) T (x k x k ). (9) Applying Lemma for x = x k+ and y = x, convexity of f and (8) we have δ k+ = f( x k+ ) f(x k ) + f(x k ) f(x ) β 2 xk+ x k 2 + f(x k ) f(x ) β 2 xk+ x k 2 + (g k ) T (x k x ) = β 2 xk+ x k 2 β( x k+ x k ) T (x k x ). (0) 6

Multiplying (9) by λ k (λ k ) and (0) by λ k respectively, and summing the two, we have (λ k ) 2 δ k+ (λ k ) 2 δ k (λ k ) 2 β 2 xk+ x k 2 λ k β( x k+ x k ) T k Using (7) and (8) we can derive = β 2 ((λk ) 2 x k+ x k 2 + 2λ k ( x k+ x k ) T k ) = β 2 ( λk x k+ (λ k ) x k x 2 k 2 ) = β 2 ( k 2 λ k x k+ (λ k ) x k x 2 ). λ k x k+ (λ k ) x k = λ k+ x k+ (λ k+ ) x k+. Thus, Sum up () from to k we have (λ k ) 2 δ k+ (λ k ) 2 δ k β 2 ( k 2 k+ 2.) () δ k+ β 2(λ k ) 2 2 2β k 2 0 2 since λ k k/2 and 0. 7

First-Order Algorithms for Simply Constrained Optimization Consider the conic nonlinear optimization problem: min f(x) s.t. x K. Nonnegative Linear Regression: given data A R m n and b R m min f(x) = 2 Ax b 2 s.t. x 0; where f(x) = A T (Ax b). Semidefinite Linear Regression: given data A i S n for i =,..., m and b R m min f(x) = 2 AX b 2 s.t. X 0; where f(x) = A T (AX b). AX = A X... A m X and AT y = i= y i A i. 8

I: The Logarithmic Barrier Method One may add a barrier regularization: min ϕ(x) = f(x) + µb(x), where B(x) (or B(X)) is a barrier function keeping x (or X) in the interior of cone K and µ is a small positive barrier parameter. K = R n + : B(x) = j log(x j ), B(x) = x... x n Rn. K = S n + : B(X) = log(det(x)), B(X) = X S n. 9

Optimality Conditions for Optimization with Logarithmic Barrier Optimization in Nonnegative Cone (x > 0): ϕ(x) = f(x) µx e = 0, X = diag(x), and the scaled gradient vector: X ϕ(x) = X f(x) µe and it converges to zero implies f(x) > 0 for any positive µ. In practice, we set µ = ϵ tolerance. Optimization in Semidefinite Cone (X 0): ϕ(x) = f(x) µx = 0, and the scaled gradient matrix: X /2 ϕ(x)x /2 = X /2 f(x)x /2 µi and it converges to zero implies f(x) 0 for any positive µ (Suggested Project I) 0

II: SDM Followed by Feasible-Region-Projection ˆx k+ = x k β f(xk ) x k+ = Proj K (ˆx k+ ) For examples: if K = {x : x 0}, then x k+ = Proj K (ˆx k+ ) = max{0, ˆx k+ }. K = {X : X 0}, then factorize ˆX k+ = V T ΛV and let X k+ = Proj K ( ˆX k+ ) = V T max{0, Λ}V. The drawback is that the eigenvalue-factorization may be costly in each iteration (using L T DL factorization in Suggested Project I?)

Value-Iteration for MDP as Gradient-Ascent and Feasible-Region-Projection Let y R m represent the cost-to-go values of the m states, ith entry for ith state, of a given policy. The MDP problem entails choosing the optimal value vector y such that it satisfies: y i = min j A i {c j + γp T j y }, i, which can be casted as an optimal solution of the linear program max y b T y, s.t. y i c j + γp T j y, j A i, i, where b is any fixed positive vector. The Value-Iteration (VI) Method is, starting from any y 0, y k+ i = min j A i {c j + γp T j y k }, i. If the initial y 0 is strictly feasible for state i, that is, y 0 i < c j + γp T j y0, j A i, then y k i increasing in the VI iteration for all i and k. would be On the other hand, if any of the inequalities is violated, then we have to decrease y i min j Ai {c j + γp T j y0 }. at least to 2

Convergence of Value-Iteration for MDP Theorem 3 Let the VI algorithm mapping be A(v) i = min j Ai {c j + γp T j v, i}. Then, for any two value vectors u R m and v R m and every state i: A(u) i A(v) i γ u v, which implies A(u) i A(v) i γ u v Let j u and j v be the two arg min actions for value vectors u and v, respectively. Assume that A(u) i A(v) i 0 where the other case can be proved similarly. 0 A(u) i A(v) i = (c ju + γp T j u u) (c jv + γp T j v v) (c jv + γp T j v u) (c jv + γp T j v v) = γp T j v (u v) γ u v. where the first inequality is from that j u is the arg min action for value vector u, and the last inequality follows from the fact that the elements in p jv are non-negative and sum-up to. Many research issues in suggested Project II. 3

III: SDM with Affine Scaling Let an iterate solution x k > 0 (for simplicity let each entry of x k be bounded above by ). Then, we can scale it to e, the vector of all ones, by x = (X k ) x where X k is the diagonal matrix of vector x k. f (x ) = f(x k x ) where f (e) = X k f(x k ), the new SDM iterate would be x = e α k X k f(x k ) and after scaling back x k+ = x k α k (X k ) 2 f(x k ). If function f is β-lipschitz, then so is f with β x k 2, and we can fix α k = β x k 2 keep x k+ > 0) so that f(x k+ ) f(x k ) 2β x k 2 X k f(x k ) 2, (or smaller to so that the scaled gradient/complementarity vector X k f(x k )/ x k converging to zero. 4

Affine Scaling for SDP Cone Let an iterate solution X k 0. Then, one can scale it to I, the identity matrix, by X = (X k ) /2 X(X k ) /2. Now consider the objective function after scaling f (X ) = f((x k ) /2 X (X k ) /2 ) where f (I) = (X k ) /2 f(x k )(X k ) /2, the new SDM iterate would be X = I α k (X k ) /2 f(x k )(X k ) /2 and after scaling back X k+ = X k α k X k f(x k )X k. If function f is β-lipschitz, then so is f with with β X k 2,, and we can fix α k = β X k 2 to keep X k+ 0) so that f(x k+ ) f(x k ) 2β X k 2 (X k ) /2 f(x k )(X k ) /2 2, (or smaller so that the scaled gradient/complementarity matrix converging to zero... 5

IV: Gradient-Projection Method for Conic Optimization Consider the nonnegative cone. At any iterate solution x k 0, we project the gradient vector to the feasible direction space at x k : g k j = f(x k ) j if x k j > 0 or f(xk ) j < 0, 0 otherwise. Then we apply the SDM with the projected gradient vector g k, that is, take a largest stepsize α such that x k+ = x k αg k 0, 0 α β. Does it converge? What s the relation with the type II method? What is the convergence speed? (Consider it a bonus question to Problem 6 of HW3.) 6

V: Reduced Gradient Method the Simplex Algorithm for LP LP: min c T x s.t. Ax = b, x 0, where A R m n has a full row rank m. Theorem 4 (The Fundamental Theorem of LP in Algebraic form) Given (LP) and (LD) where A has full row rank m, i) if there is a feasible solution, there is a basic feasible solution (Carathéodory s theorem); ii) if there is an optimal solution, there is an optimal basic solution. High-Level Idea:. Initialization Start at a BSF or corner point of the feasible polyhedron. 2. Test for Optimality. Compute the reduced gradient vector at the corner. If no descent and feasible direction can be found, stop and claim optimality at the current corner point; otherwise, select a new corner point and go to Step 2. 7

LP theorems depicted in two variable space c a a2 a2 a3 a3 a4 a4 Objective contour a5 Figure : The LP Simplex Method 8

When a Basic Feasible Solution is Optimal Suppose the basis of a basic feasible solution is A B and the rest is A N. One can transform the equality constraint to A B Ax = A B b, so that x B = A B b A B A Nx N. That is, we express x B in terms of x N, the non-basic variables are are active for constraints x 0. Then the objective function equivalently becomes c T x = c T Bx B + c T Nx N = c T BA B b ct BA B A N x N + c T Nx N = c T BA B b + (ct N c T BA B A N )x N. Vector r T = c T c T B A B A is called the Reduced Gradient/Cost Vector where r B = 0 always. Theorem 5 If Reduced Gradient Vector r T = c T c T B A B A 0, then the BFS is optimal. Proof: Let y T = c T B A B (called Shadow Price Vector), then y is a dual feasible solution (r = c A T y 0) and c T x = c T B x B = c T B A B b = yt b, that is, the duality gap is zero. 9

The Simplex Algorithm Procedures 0. Initialize Start a BFS with basic index set B and let N denote the complementary index set.. Test for Optimality: Compute the Reduced Gradient Vector r at the current BFS and let If r e 0, stop the current BFS is optimal. r e = min j N {r j }. 2. Determine the Replacement: Increase x e while keep all other non-basic variables at the zero value (inactive) and maintain the equality constraints: x B = A B b A B A.ex e ( 0). If x e can be increased to, stop the problem is unbounded below. Otherwise, let the basic variable x o be the one first becoming 0. 3. Update basis: update B with x o being replaced by x e, and return to Step. 20

A Toy Example minimize x 2x 2 subject to x +x 3 = A = 0 0 0 0 0 0 0 0 x 2 +x 4 = x +x 2 +x 5 =.5., b =.5, ct = ( 2 0 0 0). Consider initial BFS with basic variables B = {3, 4, 5} and N = {, 2}. Iteration :. A B = I, A B = I, yt = (0 0 0) and r N = ( 2) it s NOT optimal. Let e = 2. 2

2. Increase x 2 while x B = A B b A B A.2x 2 =.5 0 x 2. We see x 4 becomes 0 first. 3. The new basic variables are B = {3, 2, 5} and N = {, 4}. Iteration 2:. A B = 0 0 0 0 0, A B = 0 0 0 0 0, y T = (0 2 0) and r N = ( 2) it s NOT optimal. Let e =. 22

2. Increase x while x B = A B b A B A.x = 0.5 0 x. We see x 5 becomes 0 first. 3. The new basic variables are B = {3, 2, } and N = {4, 5}. Iteration 3:. A B = 0 0 0 0, A B = 0 0 0, y T = (0 ) and r N = ( ) it s Optimal. Is the Simplex Method always convergent to a minimizer? Which condition of the Global Convergence Theorem failed? 23