Infeasible Primal-Dual (Path-Following) Interior-Point Methods for Semidefinite Programming* Notes for CAAM 564, Spring 2012 Dept of CAAM, Rice University Outline (1) Introduction (2) Formulation & a complexity theorem (3) Computation and numerical results (4) Comments * Yin Zhang: On extending primal-dual interiorpoint algorithms from LP to SDP, SIOPT, published in 1998 (Tech. Report 1995). 1
Primal SDP: min C X s.t. A i X = b i, i = 1 : m X 0, where C X = tr(c T X), A i, C symmetric, and X 0 means symmetric positive semidefinite. Diagonal C, A i LP SDP constraints are nonlinear: X 2 2 0 {x 11 0, x 11 x 22 x 2 12 } Dual SDP: max s.t. m i=1 b y y i A i + Z = C Z 0 2
Strong Duality: C X b y = X Z = 0 if both primal and dual are strictly feasible (Slater constraint qualification). Note: For X, Z 0, X Z = 0 XZ = 0. Proof: is trivial. X Z = tr(xz) = 0 tr(xz 1/2 Z 1/2 ) = 0 tr(z 1/2 XZ 1/2 ) = 0 λ i (Z 1/2 XZ 1/2 ) = 0 λ i (Z 1/2 XZ 1/2 ) = 0, i Z 1/2 XZ 1/2 = 0 (symmetry) (Z 1/2 X 1/2 )(Z 1/2 X 1/2 ) T = 0 Z 1/2 X 1/2 = 0 ZX = 0. 3
SDP Applications: Eigenvalue optimization Control theory Linear matrix inequalities in system and control theory Book by Boyd et al, SIAM, 1994. Combinatorial optimization Example: Max-Cut relaxation Non-convex optimization Example: QCQP Relaxations 4
Definitions: Interior-point: (X, y, Z) : X, Z 0 Infeasible Alg: allows infeasible iterates Primal-dual Alg: perturbed and damped Newton (or composite Newton) method applied to a form of optimality condition system. Standard Optimality Conditions: (X, y, Z), X, Z 0 such that i A i X b i = 0 m yi A i + Z C = 0 n(n + 1)/2 XZ = 0 n 2 This optimality system is not square. Complementarity needs symmetrization. Weighted symmetrization: (YZ 95) H P (M) = (P MP 1 + P T M T P T )/2 XZ = τi H P (XZ) = τi complementarity H P (XZ) = 0 the central path H P (XZ) = µi 5
H P unifies a family of primal-dual formulations. 3 most promising members: AHO (Alizadeh/Haeberly/Overton 94): P = I XZ + ZX = 0 KSH (Kojima et al 94, Monteiro 95): P k = [Z k ] 1/2 Z k XZ + ZXZ k = 0 NT (Nesterov/Todd 95): W = X.5 (X.5 ZX.5 ).5 X.5, P k = [W k ].5 A sequence of equiv. systems for varying P k : Formulation AHO KSH NT Real Newton? Yes No No AHO difficult to analyze its complexity KSH Complexity results were obtained for feasible or short-step methods In favor of: infeasible long-step methods implementable and most efficient in practice 6
Central path: {(X, Z) 0 : XZ = µi 0} On the central path: λ i (XZ) = µ, i λ 2 (XZ) central path λ 1 (XZ) Short-step uses a narrow neighborhood. Long-step uses a wide neighborhood. 7
A narrow neighborhood: α (0, 1) N α = {(X, Z) 0 : XZ µi F αµ} where µ is the normalized duality gap µ = tr(xz)/n. A wide neighborhood: γ (0, 1) N γ = {(X, Z) 0 : λ min (XZ) γµ} where µ = tr(xz)/n. A neighborhood with feasibility priority: N γ = { (X, Z) N γ : r r 0 µ µ 0 where r measures infeasibility. } 8
Column-Stacking Operator vec : R n n R n2 vec [ 1 3 2 4 ] = (1 2 3 4) T SDP: Find v = (X, y, Z), X, Z 0, such that F k (v) = A T y + vecz vecc A(vecX) b vec(z k XZ + ZXZ k ) where A T = [veca 1 veca m ]. = 0, Infeasible long-step Algorithm: (YZ 95) Choose v 0 = (X 0, y 0, Z 0 ), X 0, Z 0 0, For k = 0,1,2,..., until µ k 2 L do (1) solve F k (vk ) v = F k (v k ) + σµ k p k (2) choose α, v k+1 v k + α v N γ End where σ (0, 1), and the centering term p k is derived from Z k XZ + ZXZ k 2σµ k Z k = 0. 9
The key to convergence analysis is to estimate the 2nd-order term. Key Lemma: (YZ 95) Under the conditions that (1) a solution (X, y, Z ) exists, (2) X 0 = Z 0 = ρi for ρ max { λ 1 (X ), λ 1 (Z ), tr(x + Z ) n } ( ) (3) (X, Z) N γ, Z 2( X Z)Z 1 1 λ1 (XZ) tr(xz) 2 F 19n λ n (XZ) γ Polynomial Complexity: (YZ 95) Theorem: Under the conditions (1)-(3), the algorithm terminates (µ k 2 L ) in at most O(n 2.5 L)-iterations. Global convergence holds without (*) in (2). 10
Computation: How difficult is it to solve SDP? Distinctions from LP: (1) Special code for special prob., e.g. LP (2) Cost for linear system varies with method (3) Parallelism more important than sparsity Linear system F k (vk ) v = F k (v k ) + σµ k p k : 0 A T I A 0 0 E 0 F vec X y vec Z = E and F are no longer diagonal. vecr d r p vecr c Solution Procedure: (1) M = A(E 1 F)A T. (2) M y = r p + A[E 1 (FvecR d vecr c )]. (3) Z = R d m i=1 y i A i. (4) X = σµz 1 X H(X( Z)Z 1 ). Most computation is in steps (1)-(2). M is m m, m [0, n(n + 1)/2]., 11
Consider forming: M = A(E 1 F)A T Notation: Kronecker product U V = [u ij V ]) KSH: E = Z Z, F = (ZX I + I ZX)/2 M 0, M ij = tr(z 1 A i XA j ) AHO: E = Z I + I Z, F = X I + I X M asymmetric (LU) ZP i + P i Z = A i Lyapunov eqn for P i M ij = tr(p i (XA j + A j X)) NT: W = X.5 (X.5 ZX.5 ).5 X.5 E = I I, F = W W M 0, M ij = tr(a i W A j W ) 2 Cholesky + SVD for W (TTT 96) work(aho) > work(nt) > work(ksh) 12
Leading Cost* per Iter for Dense Problems (Form: O(m(m + n)n 2 ), Solve O(m 3 )) m = O(1) O(n) O(n 2 ) Form O(n 3 )* O(n 4 )* O(n 6 ) Solve O(1) O(n 3 ) O(n 6 )* Forming M is most expensive. Bad news: O(n 4 ) per iteration most likely Good news: Rich parallelism Coarse grain: compute M ij independently Fine grain: rich in matrix multiplications Expect (almost) linear speed-up 13
Preliminary Implementation: Predictor-Corrector Algorithm: Choose v 0 = (X 0, y 0, Z 0 ), X 0, Z 0 0, For k = 0,1,2,..., until µ k 2 L do (1) F k (vk ) v p = F k (v k ). Choose σ k. (2) F k (vk ) v c = F k (v k + v p ) + σ k µ k p k (3) Let v = v p + v c (4) choose α, v k+1 v k + α v (interior) End (Extension of Mehrotra s framework to SDP) LiteSDP, a short MATLAB code for general problems, dense or sparse only KSH formulation implemented backtracking for α using Cholesky initial point X 0 = Z 0 = ni tested on randomly generated problems 14
General Problem: m = 400, n = 200 Iter. No P_inf D_inf Gap Cpu -------------------------------------------- iter 0: 9.62e-01 1.15e+00 1.70e+03 0.77 iter 1: 1.52e-15 9.60e-01 1.33e+00 1.45 iter 2: 1.63e-15 2.12e-16 3.11e+00 1.81 iter 3: 1.31e-15 2.12e-16 2.18e+00 1.51... iter 9: 1.20e-15 2.09e-16 1.34e-03 1.52 iter 10: 1.27e-15 2.10e-16 3.83e-04 1.73 iter 11: 1.36e-15 2.11e-16 4.56e-05 1.51 iter 12: 1.26e-15 2.10e-16 9.23e-06 1.92 iter 13: 1.28e-15 2.00e-16 1.09e-06 1.82 iter 14: 1.29e-15 2.02e-16 1.10e-07 1.54 iter 15: 1.17e-15 2.02e-16 1.11e-08 1.78 iter 16: 1.19e-15 2.06e-16 1.22e-09 --- ------------------------------------------- [m n]=[400 200] obj=-739.355 cputime=26.09 Most (Mac Air) time spent on forming M 15
Max-Cut: n = 500 (same general code) Iter. No P_inf D_inf Gap Cpu ------------------------------------------ iter 0: 2.04e+01 5.08e+00 4.47e+04 0.27 iter 1: 0.00e+00 2.51e-16 8.23e-01 1.91 iter 2: 0.00e+00 2.17e-16 5.46e-01 1.98 iter 3: 0.00e+00 1.22e-16 2.37e-01 2.16 iter 4: 0.00e+00 1.03e-16 1.47e-01 1.80 iter 5: 0.00e+00 1.05e-16 5.90e-02 2.01 iter 6: 0.00e+00 1.06e-16 1.25e-02 1.82 iter 7: 0.00e+00 9.74e-17 7.59e-03 1.87 iter 8: 0.00e+00 1.07e-16 9.10e-04 1.89 iter 9: 0.00e+00 1.09e-16 1.07e-04 1.79 iter 10: 0.00e+00 9.68e-17 1.08e-05 2.00 iter 11: 0.00e+00 1.04e-16 1.09e-06 1.80 iter 12: 0.00e+00 9.62e-17 1.10e-07 1.76 iter 13: 0.00e+00 1.05e-16 1.11e-08 1.91 iter 14: 0.00e+00 1.06e-16 1.12e-09 --- --------------------------------------------- [m n]=[500 500] obj=-3633.6 cputime=25.2 16
Which is most efficient? AHO, KSH, NT? Comparative study I: AHO (94-95) AHO: least iterations, more robust, fast local convergence Comparison on iteration only, no CPU (AHO requires more work per iteration.) Comparative study II: TTT (Mar. 96) No. Iter/CPU per iter on Max-Cut n = 50 AHO KSH NT(Schur) NT(LS) 7.4/5.1 7.6/3.8 8.4/4.0 8.4/4.4 37.74 28.88 33.6 36.96 Schur: LU on M = AE 1 FA T LS: QR on ÂT, where M = ÂÂT NT(LS) achieves higher accuracy (major advantage of NT over AHO & KSH) 17
How to write M = AE 1 FA T = ÂÂT? NT: 2 Chol + SVD W 1/2 (TTT 3/96)  = A(W 1/2 W 1/2 ) e T i  = vec(w 1/2 A i W 1/2 ) T AHO: M = AE 1 FA T is asymmetric. KSH: Cholesky factors X = L x L T x, Z = L z L T z are already available if using backtracking.  e T i  = A(L x L T = vec(l 1 z z ) A i L x ) T Least cost/iteration among the three 18