More First-Order Optimization Algorithms

More First-Order Optimization Algorithms Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305, U.S.A. http://www.stanford.edu/ yyye Chapters 3, 8, 3

The SDM for Unconstrained Convex Lipschitz Optimization Here we consider f(x) being convex and differentiable everywhere and satisfying the (first-order) β-lipschitz condition. Given the knowledge β, we again adopt the fixed step-size rule: x k+ = x k β f(xk ). () Theorem For convex Lipschitz optimization the Steepest Descent Method generates a sequence of solutions such that f(x k+ ) f(x ) β k + 2 x0 x 2 and where x is a minimizer of the problem. min l=0,...,k f(xl ) 2 = 4β 2 (k + )(k + 2) x0 x 2, Proof: For simplicity, we let δ k = f(x k ) f(x )( 0), g k = f(x k ), and k = x k x in the rest of proof. As we have proved for general Lipschitz optimization δ k+ δ k = f(x k+ ) f(x k ) 2β gk 2, that is δ k δ k+ 2β gk 2. (2) 2

Furthermore, from the convexity, δ k = f(x ) f(x k ) (g k ) T (x x k ) = (g k ) T k, that is δ k (g k ) T k. (3) Thus, from (2) and (3) δ k+ = δ k+ δ k + δ k 2β gk 2 + (g k ) T k = β 2 xk+ x k 2 β(x k+ x k ) k, (using ()) = β 2 ( xk+ x k 2 + 2(x k+ x k ) T k ) = β 2 ( k+ k 2 + 2( k+ k ) T k ) = β 2 ( k 2 k+ 2 ). (4) Sum up (4) from to k +, we have k+ l= δ l β 2 ( 0 2 k+ 2 ) β 2 0 2. 3

From the proof of the Corollary of last lecture, we have δ 0 β 2 0 2. Thus, and k+ k+ l=0 δ l β 0 2, (5) l=0 δl = k+ l=0 (l + l)δl = k+ l=0 (l + )δl k+ l=0 lδl = k+2 l= lδl k+ l= lδl = (k + 2)δ k+ + k+ l= lδl k+ l= lδl = (k + 2)δ k+ + k+ l= l(δl δ l ) (k + 2)δ k+ + k+ l= l 2β gl 2, where the first inequality comes from (2). Let g = min l=0,...,k g l. Then we finally have (k + 2)δ k+ + (k + )(k + 2)/2 g 2 β 0 2, (6) 2β which completes the proof. 4

The Accelerated Steepest Descent Method (ASDM) There is an accelerated steepest descent method (Nesterov 83) that works as follows: λ 0 = 0, λ k+ = + + 4(λ k ) 2 2, α k = λk, (7) λk+ x k+ = x k β f(xk ), x k+ = ( α k ) x k+ + α k x k. (8) Note that (λ k ) 2 = λ k+ (λ k+ ), λ k > k/2 and α k 0. One can prove: Theorem 2 f( x k+ ) f(x ) 2β k 2 x0 x 2, k. 5

Convergence Analysis of ASDM Again for simplification, we let k = λ k x k (λ k ) x k x, g k = f(x k ) and δ k = f( x k ) f(x )( 0) in the following. Applying Lemma for x = x k+ and y = x k, convexity of f and (8) we have δ k+ δ k = f( x k+ ) f(x k ) + f(x k ) f( x k ) β 2 xk+ x k 2 + f(x k ) f( x k ) β 2 xk+ x k 2 + (g k ) T (x k x k ) = β 2 xk+ x k 2 β( x k+ x k ) T (x k x k ). (9) Applying Lemma for x = x k+ and y = x, convexity of f and (8) we have δ k+ = f( x k+ ) f(x k ) + f(x k ) f(x ) β 2 xk+ x k 2 + f(x k ) f(x ) β 2 xk+ x k 2 + (g k ) T (x k x ) = β 2 xk+ x k 2 β( x k+ x k ) T (x k x ). (0) 6

Multiplying (9) by λ k (λ k ) and (0) by λ k respectively, and summing the two, we have (λ k ) 2 δ k+ (λ k ) 2 δ k (λ k ) 2 β 2 xk+ x k 2 λ k β( x k+ x k ) T k Using (7) and (8) we can derive = β 2 ((λk ) 2 x k+ x k 2 + 2λ k ( x k+ x k ) T k ) = β 2 ( λk x k+ (λ k ) x k x 2 k 2 ) = β 2 ( k 2 λ k x k+ (λ k ) x k x 2 ). λ k x k+ (λ k ) x k = λ k+ x k+ (λ k+ ) x k+. Thus, Sum up () from to k we have (λ k ) 2 δ k+ (λ k ) 2 δ k β 2 ( k 2 k+ 2.) () δ k+ β 2(λ k ) 2 2 2β k 2 0 2 since λ k k/2 and 0. 7

First-Order Algorithms for Simply Constrained Optimization Consider the conic nonlinear optimization problem: min f(x) s.t. x K. Nonnegative Linear Regression: given data A R m n and b R m min f(x) = 2 Ax b 2 s.t. x 0; where f(x) = A T (Ax b). Semidefinite Linear Regression: given data A i S n for i =,..., m and b R m min f(x) = 2 AX b 2 s.t. X 0; where f(x) = A T (AX b). AX = A X... A m X and AT y = i= y i A i. 8

I: The Logarithmic Barrier Method One may add a barrier regularization: min ϕ(x) = f(x) + µb(x), where B(x) (or B(X)) is a barrier function keeping x (or X) in the interior of cone K and µ is a small positive barrier parameter. K = R n + : B(x) = j log(x j ), B(x) = x... x n Rn. K = S n + : B(X) = log(det(x)), B(X) = X S n. 9

Optimality Conditions for Optimization with Logarithmic Barrier Optimization in Nonnegative Cone (x > 0): ϕ(x) = f(x) µx e = 0, X = diag(x), and the scaled gradient vector: X ϕ(x) = X f(x) µe and it converges to zero implies f(x) > 0 for any positive µ. In practice, we set µ = ϵ tolerance. Optimization in Semidefinite Cone (X 0): ϕ(x) = f(x) µx = 0, and the scaled gradient matrix: X /2 ϕ(x)x /2 = X /2 f(x)x /2 µi and it converges to zero implies f(x) 0 for any positive µ (Suggested Project I) 0

II: SDM Followed by Feasible-Region-Projection ˆx k+ = x k β f(xk ) x k+ = Proj K (ˆx k+ ) For examples: if K = {x : x 0}, then x k+ = Proj K (ˆx k+ ) = max{0, ˆx k+ }. K = {X : X 0}, then factorize ˆX k+ = V T ΛV and let X k+ = Proj K ( ˆX k+ ) = V T max{0, Λ}V. The drawback is that the eigenvalue-factorization may be costly in each iteration (using L T DL factorization in Suggested Project I?)

Value-Iteration for MDP as Gradient-Ascent and Feasible-Region-Projection Let y R m represent the cost-to-go values of the m states, ith entry for ith state, of a given policy. The MDP problem entails choosing the optimal value vector y such that it satisfies: y i = min j A i {c j + γp T j y }, i, which can be casted as an optimal solution of the linear program max y b T y, s.t. y i c j + γp T j y, j A i, i, where b is any fixed positive vector. The Value-Iteration (VI) Method is, starting from any y 0, y k+ i = min j A i {c j + γp T j y k }, i. If the initial y 0 is strictly feasible for state i, that is, y 0 i < c j + γp T j y0, j A i, then y k i increasing in the VI iteration for all i and k. would be On the other hand, if any of the inequalities is violated, then we have to decrease y i min j Ai {c j + γp T j y0 }. at least to 2

Convergence of Value-Iteration for MDP Theorem 3 Let the VI algorithm mapping be A(v) i = min j Ai {c j + γp T j v, i}. Then, for any two value vectors u R m and v R m and every state i: A(u) i A(v) i γ u v, which implies A(u) i A(v) i γ u v Let j u and j v be the two arg min actions for value vectors u and v, respectively. Assume that A(u) i A(v) i 0 where the other case can be proved similarly. 0 A(u) i A(v) i = (c ju + γp T j u u) (c jv + γp T j v v) (c jv + γp T j v u) (c jv + γp T j v v) = γp T j v (u v) γ u v. where the first inequality is from that j u is the arg min action for value vector u, and the last inequality follows from the fact that the elements in p jv are non-negative and sum-up to. Many research issues in suggested Project II. 3

III: SDM with Affine Scaling Let an iterate solution x k > 0 (for simplicity let each entry of x k be bounded above by ). Then, we can scale it to e, the vector of all ones, by x = (X k ) x where X k is the diagonal matrix of vector x k. f (x ) = f(x k x ) where f (e) = X k f(x k ), the new SDM iterate would be x = e α k X k f(x k ) and after scaling back x k+ = x k α k (X k ) 2 f(x k ). If function f is β-lipschitz, then so is f with β x k 2, and we can fix α k = β x k 2 keep x k+ > 0) so that f(x k+ ) f(x k ) 2β x k 2 X k f(x k ) 2, (or smaller to so that the scaled gradient/complementarity vector X k f(x k )/ x k converging to zero. 4

Affine Scaling for SDP Cone Let an iterate solution X k 0. Then, one can scale it to I, the identity matrix, by X = (X k ) /2 X(X k ) /2. Now consider the objective function after scaling f (X ) = f((x k ) /2 X (X k ) /2 ) where f (I) = (X k ) /2 f(x k )(X k ) /2, the new SDM iterate would be X = I α k (X k ) /2 f(x k )(X k ) /2 and after scaling back X k+ = X k α k X k f(x k )X k. If function f is β-lipschitz, then so is f with with β X k 2,, and we can fix α k = β X k 2 to keep X k+ 0) so that f(x k+ ) f(x k ) 2β X k 2 (X k ) /2 f(x k )(X k ) /2 2, (or smaller so that the scaled gradient/complementarity matrix converging to zero... 5

IV: Gradient-Projection Method for Conic Optimization Consider the nonnegative cone. At any iterate solution x k 0, we project the gradient vector to the feasible direction space at x k : g k j = f(x k ) j if x k j > 0 or f(xk ) j < 0, 0 otherwise. Then we apply the SDM with the projected gradient vector g k, that is, take a largest stepsize α such that x k+ = x k αg k 0, 0 α β. Does it converge? What s the relation with the type II method? What is the convergence speed? (Consider it a bonus question to Problem 6 of HW3.) 6

V: Reduced Gradient Method the Simplex Algorithm for LP LP: min c T x s.t. Ax = b, x 0, where A R m n has a full row rank m. Theorem 4 (The Fundamental Theorem of LP in Algebraic form) Given (LP) and (LD) where A has full row rank m, i) if there is a feasible solution, there is a basic feasible solution (Carathéodory s theorem); ii) if there is an optimal solution, there is an optimal basic solution. High-Level Idea:. Initialization Start at a BSF or corner point of the feasible polyhedron. 2. Test for Optimality. Compute the reduced gradient vector at the corner. If no descent and feasible direction can be found, stop and claim optimality at the current corner point; otherwise, select a new corner point and go to Step 2. 7

LP theorems depicted in two variable space c a a2 a2 a3 a3 a4 a4 Objective contour a5 Figure : The LP Simplex Method 8

When a Basic Feasible Solution is Optimal Suppose the basis of a basic feasible solution is A B and the rest is A N. One can transform the equality constraint to A B Ax = A B b, so that x B = A B b A B A Nx N. That is, we express x B in terms of x N, the non-basic variables are are active for constraints x 0. Then the objective function equivalently becomes c T x = c T Bx B + c T Nx N = c T BA B b ct BA B A N x N + c T Nx N = c T BA B b + (ct N c T BA B A N )x N. Vector r T = c T c T B A B A is called the Reduced Gradient/Cost Vector where r B = 0 always. Theorem 5 If Reduced Gradient Vector r T = c T c T B A B A 0, then the BFS is optimal. Proof: Let y T = c T B A B (called Shadow Price Vector), then y is a dual feasible solution (r = c A T y 0) and c T x = c T B x B = c T B A B b = yt b, that is, the duality gap is zero. 9

The Simplex Algorithm Procedures 0. Initialize Start a BFS with basic index set B and let N denote the complementary index set.. Test for Optimality: Compute the Reduced Gradient Vector r at the current BFS and let If r e 0, stop the current BFS is optimal. r e = min j N {r j }. 2. Determine the Replacement: Increase x e while keep all other non-basic variables at the zero value (inactive) and maintain the equality constraints: x B = A B b A B A.ex e ( 0). If x e can be increased to, stop the problem is unbounded below. Otherwise, let the basic variable x o be the one first becoming 0. 3. Update basis: update B with x o being replaced by x e, and return to Step. 20

A Toy Example minimize x 2x 2 subject to x +x 3 = A = 0 0 0 0 0 0 0 0 x 2 +x 4 = x +x 2 +x 5 =.5., b =.5, ct = ( 2 0 0 0). Consider initial BFS with basic variables B = {3, 4, 5} and N = {, 2}. Iteration :. A B = I, A B = I, yt = (0 0 0) and r N = ( 2) it s NOT optimal. Let e = 2. 2

2. Increase x 2 while x B = A B b A B A.2x 2 =.5 0 x 2. We see x 4 becomes 0 first. 3. The new basic variables are B = {3, 2, 5} and N = {, 4}. Iteration 2:. A B = 0 0 0 0 0, A B = 0 0 0 0 0, y T = (0 2 0) and r N = ( 2) it s NOT optimal. Let e =. 22

2. Increase x while x B = A B b A B A.x = 0.5 0 x. We see x 5 becomes 0 first. 3. The new basic variables are B = {3, 2, } and N = {4, 5}. Iteration 3:. A B = 0 0 0 0, A B = 0 0 0, y T = (0 ) and r N = ( ) it s Optimal. Is the Simplex Method always convergent to a minimizer? Which condition of the Global Convergence Theorem failed? 23