Lecture 3. Optimization Problems and Iterative Algorithms

Lecture 3 Optimization Problems and Iterative Algorithms January 13, 2016 This material was jointly developed with Angelia Nedić at UIUC for IE 598ns

Outline Special Functions: Linear, Quadratic, Convex Criteria for Convexity of a Function Operations Preserving Convexity Unconstrained Optimization First-Order Necessary Optimality Conditions Constrained Optimization First-Order Necessary Optimality Conditions KKT Conditions Iterative Algorithms Stochastic Optimization 1

Convex Function f is convex when dom(f) is convex set and there holds f(αx + (1 α)y) αf(x) + (1 α)f(y) for all x, y dom(f) and α [0, 1] strictly convex if the inequality is strict for all x, y dom(f) & α (0, 1) Note that dom(f) is defined as dom(f) {x : f(x) < + }. Stochastic Optimization 2

f (x) f (y) f (x) x y x y f is concave when f is convex f is strictly concave when f is strictly convex Stochastic Optimization 3

Examples of Convex/Concave Functions Examples on R Convex: Affine: ax + b over R for any a, b R Exponential: e ax over R for any a R Power: x p over (0, + ) for p 1 or p 0 Powers of absolute value: x p over R for p 1 Negative entropy: x ln x over (0, + ) Concave: Affine: ax + b over R for any a, b R Powers: x p over (0, + ) for 0 p 1 Logarithm: ln x over (0, + ) Examples on R n Affine functions are both convex and concave Norms x, x 1, x are convex Stochastic Optimization 4

Second-Order Conditions for Convexity Let f be twice differentiable and let dom(f) be the domain of f [In general, when differentiability is considered, it is required that dom(f) is open] The Hessian 2 f(x) is a symmetric n n matrix whose entries are the second-order partial derivatives of f at x: [ 2 f(x) ] ij = 2 f(x) x i x j for i, j = 1,..., n 2nd-order conditions: f is convex if and only if dom(f) is convex set and 2 f(x) 0 for all x dom(f) Positive semidefiniteness of a matrix: [Recall that R n n M 0 if for all x R n, x T Mx 0] f is strictly convex if dom(f) is convex set 2 f(x) 0 for all x dom(f) Positive definiteness of a matrix: [Recall that R n n M 0 if for all x R n, x T Mx > 0] Stochastic Optimization 5

Examples Quadratic function: f(x) = (1/2)x Qx + q x + r with a symmetric n n matrix Q f(x) = Qx + q, 2 f(x) = Q Convex for Q 0 Least-squares objective: f(x) = Ax b 2 with an m n matrix A f(x) = 2A T (Ax b), Convex for any A 2 f(x) = 2A T A Quadratic-over-linear: f(x, y) = x 2 /y Convex for y > 0 2 f(x, y) = 2 y 3 [ y x ] [ y x ] T 0 Stochastic Optimization 6

First-Order Condition for Convexity Let f be differentiable and let dom(f) be its domain. Then, the gradient f(x) = f(x) x 1 f(x) x 2. f(x) x n exists at each x dom(f) 1st-order condition: f is convex if and only if dom(f) is convex and f(x) + f(x) T (z x) f(z) for all x, z dom(f) Note: A first order approximation is a global underestimate of f Stochastic Optimization 7

Very important property used in convex optimization for algorithm designs and performance analysis Stochastic Optimization 8

Operations Preserving Convexity Let f and g be convex functions over R n Positive Scaling: λf is convex for λ > 0; Sum: f + g is convex; (λf)(x) = λf(x) for all x (f + g)(x) = f(x) + g(x) for all x Composition with affine function: for g affine [i.e., g(x) = Ax + b], the composition f g is convex, where (f g)(x) = f(ax + b) for all x Pointwise maximum: For convex functions f 1,..., f m, the pointwisemax function h(x) = max {f 1 (x),..., f m (x)} is convex Polyhedral function: f(x) = max i=1,...,m (a T i x + b i ) is convex Pointwise supremum: Let Y R m and f : R n R m R. Let f(x, y) be convex in x for each y Y. Then, the supremum function over the set Y h(x) = sup y Y f(x, y) is convex Stochastic Optimization 9

Optimization Terminology Let C R n and f : C R. Consider the following optimization problem minimize subject to f(x) x C Example: C = {x R n g(x) 0, x X} Terminology: The set C is referred to as feasible set We say that the problem is feasible when C is nonempty The problem is unconstrained when C = R n, and it is constrained otherwise We say that a vector x is optimal solution or a global minimum when x is feasible and the value f(x ) is not exceeded at any x C, i.e., x C f(x ) f(x) for all x C Stochastic Optimization 10

Local Minimum minimize subject to f(x) x C A vector ˆx is a local minimum for the problem if ˆx C and there is a ball B(ˆx, r) such that f(ˆx) f(x) for all x C with x ˆx r Every global minimum is also a local minimum When the set C is convex and the function f is convex then a local minimum is also global Stochastic Optimization 11

First-Order Necessary Optimality Condition: Unconstrained Problem Let f be a differentiable function with dom(f) = R n and let C = R n. If ˆx is a local minimum of f over R n, then the following holds: f(ˆx) = 0 The gradient relation can be equivalently given as: (y ˆx) f(ˆx) 0 for all y R n This is a variational inequality V I(K, F ) with the set K and the mapping F given by K = R n, F (x) = f(x) Solving a minimization problem can be reduced to solving a corresponding variational inequality Stochastic Optimization 12

First-Order Necessary Optimality Condition: Constrained Problem Let f be a differentiable function with dom(f) = R n and let C R n be a closed convex set. If ˆx is a local minimum of f over C, then the following holds: (y ˆx) f(ˆx) 0 for all y C (1) Again, this is a variational inequality V I(K, F ) with the set K and the mapping F given by K = C, F (x) = f(x) Recall that when f is convex, then a local minimum is also global When f is convex: the preceding relation is also sufficient for ˆx to be a global minimum i.e., if ˆx satisfies relation (1), then ˆx is a (global) minimum Stochastic Optimization 13

Equality and Inequality Constrained Problem Consider the following problem minimize f(x) subject to h 1 (x) = 0,..., h p (x) = 0 g 1 (x) 0,..., g m (x) 0 where f, h i and g j are continuously differentiable over R n. Def. For a feasible vector x, an active set of (inequality) constraints is the set given by A(x) = {j g j (x) = 0} If j A(x), we say that the j-th constraint is inactive at x Def. We say that a vector x is regular if the gradients h 1 (x),..., h p (x), and g j (x) for j A(x) are linearly independent NOTE: x is regular when there are no equality constraints, and all the inequality constrains are inactive [p = 0 and A(x) = ] Stochastic Optimization 14

Lagrangian Function With the problem minimize f(x) subject to h 1 (x) = 0,..., h p (x) = 0 g 1 (x) 0,..., g m (x) 0 (2) we associate the Lagrangian function L(x, λ, µ) defined by L(x, λ, µ) = f(x) + p i=1 λ i h i (x) + m j=1 µ j g j (x) where λ i R for all i, and µ j R + for all j Stochastic Optimization 15

First-Order Karush-Kuhn-Tucker (KKT) Necessary Conditions Th. Let ˆx be a local minimum of the equality/inequality constrained problem (2). Also, assume that ˆx is regular. Then, there exist unique multipliers ˆλ and ˆµ such that x L(ˆx, ˆλ, ˆµ) = 0 [L is the Lagrangian function] ˆµ j 0 for all j ˆµ j = 0 for all j A(ˆx) The last condition is referred to as complementarity conditions We can compactly write them as: g(ˆx) ˆµ Stochastic Optimization 16

In fact, the complementarity-based formulation can be used to write the first-order optimality conditions more compactly. Consider the following constrained optimization problem: minimize f(x) subject to c 1 (x) 0. c m (x) 0 0. Then, if ˆx is regular, then there exists multipliers ˆλ such that 0 ˆx x f(ˆx) x c(ˆx) Tˆλ 0 (3) 0 ˆλ c(ˆx) 0 (4) More succinctly, this is a nonlinear complementarity problem, denoted by Stochastic Optimization 17

CP (R m+n, F ), a problem that requires a z that satisfies 0 z F (z) 0, where z ( ) x λ and F (z) ( x f x c T λ c(x) ). Stochastic Optimization 18

Second-Order KKT Necessary Conditions Th. Let ˆx be a local minimum of the equality/inequality constrained problem (2). Also, assume that ˆx is regular and that f, h i, g j are twice continuously differentiable. Then, there exist unique multipliers ˆλ and ˆµ such that x L(ˆx, ˆλ, ˆµ) = 0 ˆµ j 0 for all j ˆµ j = 0 for all j A(ˆx) For any vector y such that h i (ˆx) y = 0 for all i and g j (ˆx) y = 0 for all j A(ˆx), the following relation holds: y 2 xxl(ˆx, ˆλ, ˆµ)y 0 Stochastic Optimization 19

Solution Procedures: Iterative Algorithms For solving problems, we will consider iterative algorithms Given an initial iterate x 0 We generate a new iterate x k+1 = G k (x k ) where G k is a mapping that depends on the optimization problem Objectives: Provide necessary conditions on the mappings G k that yield a sequence {x k } converging to a solution of the problem of interest Study how fast the sequence {x k } converges: Global convergence rate (when far from optimal points) Local convergence rate (when near an optimal point) Stochastic Optimization 20

Gradient Descent Method Consider continuously differentiable function f. We want to minimize f(x) over x R n Gradient descent method x k+1 = x k α k f(x k ) The scalar α k is a stepsize: α k > 0 The stepsize choices α k = α, or line search, or other stepsize rule so that f(x k+1 ) < f(x k ) Convergence Rate: Looking at the tail of an error e(x k ) = dist(x k, X ) sequence: where dist(x, A) {d(x, a) : a A}. Local convergence is at the best linear lim sup k e(x k+1 ) e(x k ) q for some q (0, 1) Stochastic Optimization 21

Global convergence is also at the best linear Stochastic Optimization 22

Newton s Method Consider twice continuously differentiable function f with Hessian 2 f(x) 0 for all x. We want to solve the following problem: minimize {f(x) : x R n } Newton s method x k+1 = x k α k 2 f(x k ) 1 f(x k ) Local Convergence Rate (near x ) f(x) converges to zero quadratically: f(x k ) C q 2k for all large enough k where C > 0 and q (0, 1) Stochastic Optimization 23

Penalty Methods For solving inequality constrained problems: minimize f(x) subject to g j (x) 0, j = 1,..., m Penalty Approach: Remove the constraints but penalize their violation P c : minimize F (x, c) = f(x)+cp (g 1 (x),..., g m (x)) over x R n where c > 0 is a penalty parameter and P is some penalty function Penalty methods operate in two stages for c and x, respectively Choose initial value c 0 (1) Having c k, solve the problem P ck to obtain its optimal x (c k ) (2) Using x (c k ), update c k to obtain c k+1 and go to step 1 Stochastic Optimization 24

Q-Rates of Convergence Let {x k } be a sequence in R n that converges to x Convergence is said to be: 1. Q-linear if r (0, 1) such that x k+1 x x k x r for k > K. Example: (1 + 0.5 k ) converges Q-linearly to 1. 2. Q-quadratic if M such that x k+1 x x k x 2 M for k > K. Example: (1 + 0.5 2k ) converges Q-quadratically to 1. 3. Q-superlinear if r (0, 1) such that lim k x k+1 x x k x = 0 Example: (1 + k k ) converges Q-superlinearly to 1. 4. Q-quadratically = Q-superlinearly = Q-linearly Stochastic Optimization 25

Example 1 f(x, y) = x 2 + y 2 1. Steepest descent from ( ) 1 1 2. Newton from 3. Newton from ( ) 1 1 ( 1 1 ) Stochastic Optimization 26

y y y Uday V. Shanbhag Lecture 3 2 2 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 0 0 0 0.5 0.5 0.5 1 1 1 1.5 1.5 1.5 2 2 1.5 1 0.5 0 0.5 1 1.5 2 x 2 2 1.5 1 0.5 0 0.5 1 1.5 2 x 2 2 1.5 1 0.5 0 0.5 1 1.5 2 x Figure 1: Well Conditioned Function:Steepest, Newton, Newton Stochastic Optimization 27

Example 2 f(x, y) = 0.1x 2 + y 2 1. Steepest descent from ( ) 1 1 2. Newton from 3. Newton from ( ) 1 1 ( 1 1 ) Stochastic Optimization 28

2 2 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 y 0 y 0 y 0 0.5 0.5 0.5 1 1 1 1.5 1.5 1.5 2 2 1.5 1 0.5 0 0.5 1 1.5 2 x 2 2 1.5 1 0.5 0 0.5 1 1.5 2 x 2 2 1.5 1 0.5 0 0.5 1 1.5 2 x Figure 2: Ill-Conditioned Function: Steepest, Newton, Newton Stochastic Optimization 29

Interior-Point Methods Solve inequality (and more generally) constrained problem: minimize f(x) subject to g j (x) 0, j = 1,..., m The IPM solves a sequence of problems parametrized by t > 0: minimize f(x) 1 t m j=1 ln( g j (x)) Can be viewed as a penalty method with Penalty parameter c = 1 t Penalty function P (u 1,..., u m ) = m j=1 over x R n ln( u j ) This function is known as logarithmic barrier or log barrier function Stochastic Optimization 30

The material for this lecture: References for this lecture (B) Bertsekas D.P. Nonlinear Programming Chapter 1 and Chapter 3 (descent and Newton s methods, KKT conditions) (FP) Facchinei and Pang Finite Dimensional..., Vol I (Part on Complementarity Problems) Chapter 1 for Normal Cone, Dual Cone, and Tangent Cone (BNO) Bertsekas, Nedić, Ozdaglar Convex Analysis and Optimization Chapter 1 (convex functions) Stochastic Optimization 31