Agenda. Interior Point Methods. 1 Barrier functions. 2 Analytic center. 3 Central path. 4 Barrier method. 5 Primal-dual path following algorithms

Agenda Interior Point Methods 1 Barrier functions 2 Analytic center 3 Central path 4 Barrier method 5 Primal-dual path following algorithms 6 Nesterov Todd scaling 7 Complexity analysis

Interior point methods Primal (P) Dual (D) minimize subject to c T x Gx + s = h Ax = b s K 0 maximize h T z b T y subject to A T y + G T z + c = 0 z K 0 Wlog (why?), work with [G; A] full col rank Interior point methods (IPMs): maintain primal + dual strict feasibility while working toward complementary slackness (x k, s k ) primal feasible with s k 0 (y k, z k ) dual feasible with z k 0 z k, s k 0

Main idea minimize subject to [ tc T x + ϕ(s) [ ] [ G I x h = A 0] s b] ϕ is a barrier function defined on int(k) with following properties: strictly convex analytic self-concordant blows up when s approaches K For each t, unique minimizer (x(t), s(t)) [requires tiny bit of thought] Limiting points as t are primal-optimal solutions Smooth curve (x(t), s(t)) usually called the central path

Canonical cones and canonical barriers 1 K = R n + ϕ(x) = Σ n i=1logx i 1 1 ) 1 ϕ(x) = (,, 2 ϕ(x) = diag( x 1 x n x 2,, 1 1 ) x 2 n 2 K = L = {(x 1, x 2,..., x n ) : x n 2 x n } ϕ(x) = 1 2 log(xt Jx) ϕ(x) = Jx x T Jx 2 ϕ(x) = J x T Jx + 2(Jx)(Jx)T (x T Jx) 2 J = [ ] I 0 0 1 Note [ 2 ϕ(x)] 1 = 2xx T (x T Jx)J

3 K = S n + ϕ(x) = X 1 ϕ(x) = log detx 2 ϕ(x)h = X 1 HX 1

Self concordance (1) (2) (3) are strictly convex and self-concordant Implication of self-concordance Newton s method extremely effective at minimizing smooth, cvx, self-concordant objectives

Barrier function for composite cones x = (x 1,..., x m ), x i K i Product K = K 1 K m Barrier: ϕ(x) = i ϕ i (x i ) each ϕ i SC = ϕ SC

Properties of barrier functions: Generalized logarithm (i) ϕ(tx) = ϕ(x) θ(ϕ) log t, for t > 0 θ(ϕ) = n for R n + θ(ϕ) = 1 for L θ(ϕ) = n for S n + Further properties following from (i) (ii) (iii) ϕ(x), x = θ(ϕ) [ 2 ϕ(x)]x = ϕ(x)

Barriers are self dual K: cone product where each component is either LP, SOCP or SDP cone For every x in K, ϕ(x) K The mapping is self inverse and homogenous of degree 1 int(k) int(k) x ϕ(x) ϕ( ϕ(x)) = x ϕ(tx) = t 1 ϕ(x) x int(k), t > 0

Analytic center minimize subject to [ ϕ(s) [ ] [ G I x h = A 0] s b] Convex program Solution strictly feasible Unique solution (x, s )

Computing analytic center Newton s method + line search (P ) minimize f(x) (cvx) subject to Ax = b Pure Newton s method: sequence {x k }, k = 0, 1, 2,... Input: x 0 feasible Repeat x k+1 = arg min Ax=b [f(x k )+ f(x k )(x x k )+ 1 ] 2 (x x k) T [ 2 f(x k )](x x k ) until convergence

With v = x k+1 x k, this boils down to minimize f(x k)v + 1 2 vt 2 f(x k )v subject to Av = 0 Optimality conditions { f(xk ) + 2 f(x k )v + A T λ = 0 Av = 0 or in matrix form [ ] [ [ ] 2 f(x k ) A T v f(xk ) = A 0 λ] 0

Problem: Can get outside the feasible set Solution: Line search : x k+1 = x k + tv Exact line search : ˆt = argminf(x k + tv) Backtracking line search : 0 < α, β < 1 while f(x + tv) > f(x) + α f(x), tv do t = βt

Complexity analysis f : cvx & self concordant Repeat until convergence (1) Compute Newton direction v (2) Compute ˆt from line search (3) Update x = x + ˆtv Theorem Assume ɛ < 1 2. Then f(x k) f ɛ if k f(x 0 ) f + log 2 log 2 ɛ 1 For practical purposes, log 2 log 2 ɛ 1 is constant, e.g. 5

This lecture: K = K (symmetry) minimize c T x (P) subject to Gx + s = h Ax = b s 0 Same cone maximize h T z b T y (D) subject to A T y + G T z + c = 0 z 0

Central path minimize subject to c, x + t 1 ϕ(s) Gx + s = h Ax = b Optimality conditions Important consequence (x, s) feasible and s 0 c + A T y + G T z = 0 (y, z) dual feasible and z 0 t 1 ϕ(s) = z Optimality conditions (t = ) (x, s) primal feasible (y, z) dual feasible (s, z) 0 Complementary slackness: s, z = 0 Central path (x, s) primal feasible (y, z) dual feasible (s, z) 0 Relaxed compl. slack. : s, z = θ(k)/t

Dual central path (D) (Dual CP) maximize h T z b T y subject to A T y + G T z + c = 0 z 0 minimize ht z + b T y + t 1 ϕ(z) subject to A T y + G T z + c = 0 Theorem Primal and dual central paths linked via { z (t) = t 1 ϕ(s (t)) s (t) = t 1 ϕ(z (t)) There is only one central path (s(t), z(t)) and s(t), z(t) = 1 t θ(k)

Proof (x, s) on central path Gx + s = h, Ax = b, s 0 (y, z) : A T y + G T z + c = 0 t 1 ϕ(s) = z Dual central path minimize ht z + b T y + t 1 ϕ(z) subject to A T y + G T z + c = 0 Lagrangian: h T z + b T y + 1 t ϕ(z) xt (A T y + G T z + c) Optimality conditions (y, z) on dual central path Unique CP since A T y + G T z + c = 0, z 0 x : b Ax = 0 s = h Gx = t 1 ϕ(z) z = t 1 ϕ(s) s = t 1 ϕ(z)

s, z = s, t 1 ϕ(s) = θ(k)/t

Characterization of central path (CP 1 ) (CP 2 ) (CP 3 ) s (t) strictly feasible z (t) strictly feasible augmented complementary slackness z (t) = t 1 ϕ(s (t)) In the case of SDP : tz (t) = [s (t)] 1 = z (t)s (t) = t 1 Id = trace(z (t)s (t)) = t 1 n (CP 1 ) - (CP 2 ) - (CP 3 ) fully characterize CP

Duality gap along CP c T x + b T y + h T z = y T Ax z T Gx + b T y + h T z = (h Gx) T z = s T z = 1 t θ(k) Proposition Duality gap along CP is t 1 θ(k). In particular, c T x p θ(k) t d + b T y + h T z θ(k) t Therefore, as t (x(t), s(t)) opt. sol. (y(t), z(t)) opt. sol.

Path following algorithm Start with t = t 0 and (x (t 0 ), s (t 0 )) Increase t = t 1 > t 0 and compute (x(t 1 ), s(t 1 )) using Newton s method with (x (t 0 ), s (t 0 )) as initial guess Few Newton iterations because we may be inside the region of quadratic convergence

Barrier method min subject to c T x Gx + s = h Ax = b s 0 Given strictly fasible (x, s), t = t 0, µ > 1 and tol > 0, repeat 1 Centering step 2 Update (x, s) = (x(t), s(t)) 3 Quit if θ(k)/t < tol 4 Increase t = µt min tc T x + ϕ(s) s. t. Gx + s = h Ax = b

Primal-dual path following methods Closely related to barrier methods Follow CP to find approximate solutions Steps are computed by linearizing CP equations Gx + s = h (s, z) 0 Central path: Ax = b A T y + G T z + c = 0 z = 1 t ϕ(s) e.g. SDP: G t (s, z) := z 1 t s 1 = 0

Main idea: From (t, s, z), update into (t +, s +, z + ) (i) Equivalent system Ḡ t (s, z) = 0 (ii) Choose t + > t and linearize equation Ḡ t+ (s + s, z + z) Ḡt + (s, z) + Ḡt + s s + Ḡt + z z = 0

Suppose current guess is feasible (iii) Solve system and update G x + s = 0 A x = 0 A T y + G T z = 0 Ḡt + s s + Ḡt + z z = Ḡt + (s, z) { s+ = s + α s z + = z + β z

Symmetrization: How do we construct the system Ḡt(s, z) = 0? SDP : z = 1 t s 1 zs = 1 t Id sz = 1 t Id Popular approach: Make system symmetric in s and z 1 2 (sz + zs) = 1 t Id Fact [requires some thought]: (s, z) 0 1 t s 1 = z 1 2 (sz + zs) = 1 t Id Leads to Alizadeh-Haeberly-Overton search direction and the sz + zs primal-dual path following method

Other symmetrizations LP : With (s, z) 0 and s z = (s i z i ) i=1,...,m z = 1 t ϕ(s) s z = 1 t 1

SOCP : {x = (x, x n ) R n : x x n } ϕ(x) = 1 2 log[d x] ϕ(x) = 1 D x [ x x n ] D x = x 2 n x 2 Then z = 1 t ϕ(s) { tz = 1 D s s tz n = 1 D s s n { z s n + z n s = 0 tz n = 1 D s s n where the second equivalence follows from 1/D s = tz n /s n. Since t z, s + tz n s n = s2 n s 2 D s = 1 we have z = 1 t ϕ(s) { z s n + z n s = 0 s, z + s n z n = 1/t

Scaling Idea for SDP: Q 0 Z = 1 t S 1 SZ = 1 t Id QSZQ 1 = 1 t Id ZS = 1 t Id Q 1 ZSQ = 1 t Id 1 2 [QSZQ 1 + Q 1 ZSQ] = 1 t Id complete freedom in choosing Q Q can vary from one iteration to the next

Change of coordinates 1 2 [QSQQ 1 ZQ 1 + Q 1 ZQ 1 QSQ] = 1 t Id Change of coordinates { S = QSQ 0 Z = Q 1 ZQ 1 0 Preserves positive definite cone Preserves central path Convergence analysis is simplified considerably when at each iteration, Q is chosen such that S and Z commute when (S, Z) are iterates to be updated

Nesterov-Todd scaling: S = Z General Scaling (S, Z) 0 G t (S, Z) = 0 W scaling matrix if multiplications with W and W T preserve the cone multiplications preserve the central path (S, Z) on CP (W S, W T Z) on CP Example K = S+ n W (S) = QSQ T W T (S) = Q T SQ W T (S) = Q T SQ 1 W is a scaling matrix: preserves cone and central path Positive scaling Q 0 W (S) = QSQ W T (Z) = Q 1 ZQ 1

Nesterov-Todd scaling Used in SEDUMI and SDPT 3 W associated with Ŝ, Ẑ such that W T Ẑ = W Ŝ = λ implies Ŝ, Ẑ = λ 2 W T W = 2 ϕ(w) where w is the unique point obeying 2 ϕ(w)ŝ = ẑ

NT for S n + Q 0 W (S) = QSQ W T (Z) = Q 1 ZQ 1 W T W (S) = Q 2 SQ 2 [ 2 ϕ(p )]S = P 1 SP 1 Q = P 1/2 P 1 ŜP 1 = Ẑ P = Ŝ1/2 (Ŝ1/2 ẐŜ1/2 ) 1/2 Ŝ 1/2 Can be computed by Cholesky or SVD computations

NT for R n + Positive diagonal scaling: W = diag(w i ) w i ŝ i = 1 ẑ i w i ẑi w i = ŝ i λ = W ŝ = W T ẑ = { ŝ i ẑ i } i

NT for Lorentz cone: Ben-Tal and Nemirovski, Chapter 6.8 v = W = β(2vv T J) w + e n 2(wn + 1) w = 1 2γ γ = [ Jŝ ŝjŝ + [ẑt Jẑ β = ŝ T Jŝ ] ẑ ẑt Jẑ [ 1 2 + ŝ T ẑ 2 ŝjŝ ẑ T Jẑ ] 1/2 ] 1/4

Basic primal-dual update Current (ŝ, ẑ) and (ẑ, ŷ) ŝ 0, ẑ 0 1. Set t such that ŝ T ẑ = 1 t θ(k) and scaling W for ŝ, ẑ 2. Choose t + = µt (µ > 1) 3. Solve the KKT system by linearizing CP equation Gx + s = h Ḡ t+ (W s, W T z) = 0 Ax = b A T y + G T z + c = 0 around ŝ, ẑ 4. Update { (s+, x + ) = (ŝ, ˆx) + α p ( s, x) (z +, y + ) = (ẑ, ŷ) + α d ( z, y) such that positivity is preserved ŝ +, ẑ + 0

Linearized CP equations Gˆx + ŝ h Aˆx b := r A T ŷ + G T ẑ + c (r = 0 if strictly feasible) Linear system and G x + s A x = r A T y + G T z Ḡ t+ (W T ŝ, W ẑ) + Ḡt + [ W T s + W z ] = 0 where the scaling obeys W T ŝ = W ẑ = λ

The linearized equation LP: Ḡ t = s z 1 t e & e = 1 s z = sizi SDP: Ḡ t = 1 2 [SZ + ZS] 1 t Id = s z 1 t e s z = 1 [SZ + ZS] & e = Id 2 0 [ ] sn z SOCP: s z = + z n s = s n z 1 t. n 0 = 1 t e 1 Ḡ t+ (λ, λ) = λ λ 1 t + e gives Linearized equation reads Ḡt + [ W T s + W z ] = λ [W T s + W z ] λ [W T s + W z ] = 1 t + e λ λ

Path following algorithm Choose starting points ŝ 0, ẑ 0, ˆx and ŷ 1. Compute residuals and evaluate stopping criteria r = Gˆx + ŝ h Aˆx b A T ŷ + G T ẑ + c Terminate if r and ŝ T ẑ sufficiently small 2. Compute scaling matrix W λ = W T ŝ = W ẑ, 1 t := ŝt ẑ θ(k)

3. Computing search directions: solve G x a + s a A x a = r and λ [W z a + W T ] s a = λ λ A T y a + G T z a 4. Select barrier parameter: [ (ŝ + αp s) T (ẑ + α d ẑ) σ = ŝ T ẑ δ: algorithm parameter (typical value is δ = 3) α p = sup {α [0, 1], ŝ + α s a 0} ] δ t + = t/σ α d = sup {α [0, 1], ẑ + α z a 0}

5. Compute search direction G x + s A x = r A T y + G T z λ [W z + W T s ] = 1 t + e λ λ 6. Update iterates (ˆx, ŝ) = (ˆx, ŝ) + min {1, 0.99α p } ( x, s) (ŷ, ẑ) = (ŷ, ẑ) + min {1, 0.99α d } ( y, z) α p = sup {α 0, ŝ + α s 0} α d = sup {α 0, ẑ + α z 0}

Interpretation Step 3: affine scaling directions solve linearized CP equations Step 4: heuristic for updating t t + based on an estimate of the quality of the affine scaling direction σ small if the step in affine scaling direction large reduction in ŝ T ẑ Step 5: system has same coefficient matrix as in step 3 Direct method solve two equations with the cost of one (i.e. can reuse the matrix factorization)

Mehrotra correction Step 5: Solve the same system but with RHS of linearized equations 1 e λ λ [ W T ] s a W z a t + Extra term is the approximation of second order term in Typically saves a few iterations W T (ŝ + ŝ) W (ẑ + ẑ) = 1 t + e

Newton equations Eliminating s reduces to 0 AT G T A 0 0 G 0 W T W Eliminating z [ G T W 1 W T G A T A 0 Because W T W = 2 ϕ(w) 1 (NT scaling) x y = RHS z ] [ ] x = RHS y G T W 1 W T G = G T 2 ϕ(w)g Hessian of barrier ϕ(h Gx) at scaling point w

Complexity analysis: SDP Short step path following methods based on commutative scalings (e.g. NT) (ˆt, ŝ, ẑ) N 0.1 ˆtŝ 1/2 ẑŝ 1/2 I 2 0.1 ŝ, ẑ strictly feasible 1. Choose a new value of t: ( t + = 1 χ ) 1 ˆt χ : parameter n 2. Solve linearized CP equations with commutative scaling

Key result Theorem If χ 0.1, then ŝ +, ẑ + strictly feasible and ŝ +, ẑ + = n t + and (ˆt +, ŝ +, ẑ + ) N 0.1 Same proximity to CP Value of centrality parameter larger by 1 + O(1) n factor Once we reach N 0.1 trace primal-dual CP by staying in N 0.1 and increasing parameter by an absolute constant factor every O( n) steps

In general t + = ( 1 0.1 θ(k) )t Once we managed to get close to the CP, then every O( θ(k)) steps of the scheme improves the quality of approximations by an absolute constant factor In particular, it takes no more than O(1) θ(k) log steps to generate a strictly feasible ɛ-solution ( 1 + θ(k) ) t 0 ɛ

References 1 A. Ben-Tal and A. Nemirovski, Lectures on Modern Convex Optimization: Analysis, Algorithms, and Engineering Applications, MPS-SIAM Series on Optimization 2 S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press 3 L. Vandenberghe, EE236C (Spring 2011), UCLA