A Proximal Alternating Direction Method for Semi-Definite Rank Minimization (Supplementary Material)

Size: px

Start display at page:

Download "A Proximal Alternating Direction Method for Semi-Definite Rank Minimization (Supplementary Material)"

Brent Martin
5 years ago
Views:

1 A Proximal Alternating Direction Method for Semi-Definite Rank Minimization (Supplementary Material) Ganzhao Yuan and Bernard Ghanem King Abdullah University of Science and Technology (KAUST), Saudi Arabia The supplementary material is organized as follows. Section 1 presents the details of our proofs. Section 2 presents the convergence analysis of the proposed MPEC-based proximal ADM algorithm. Section 3 presents the classical ADM for solving the subproblem. Finally, Section presents additional experimental results. 1 Proofs Lemma 1. [Semidefinite Embedding Lemma (Fazel, Hindi, and Boyd 23)] Let R R m n be a given matrix. Then (R) r if and only if there exist matrices S = S T R m m and T = T T R n n such that [ ] S R R T, (S) + (T) 2r. (1) T Remark. We remark that the semidefinite optimization problem discussed in this paper is very general. To illustrate this point, we consider the following optimization problem: min g(r) + λ (R) (2) R By the Semidefinite Embedding Lemma, we have the following equivalent semi-definite optimization problem : min g(r) + 1 ([ ]) [ ] S S R R 2, s.t. T R T. T Since [ ] S R R T T is positive semidefinite, by the Schur complement condition 1, it holds that S and T. Then the variable matrix [ ] S T T is also positive semidefinite. Using the MPEC reformulation, we have the following equivalent optimization problem: min g(r) + 1 R,S,T,V 2 tr(v), [ ] [ ] S S R s.t. I V, =, T R T, I V. T Clearly, the equivalent optimization problem in Eq (2) can be solved by our proposed PADM algorithm. Copyright c 2, Association for the Advancement of Artificial Intelligence ( All rights reserved. 1 Schur complement condition implies that: X ( A B B T C ) A, S = C B T A B Here, we prove the variational formulation of the NP-hard function. Lemma 2. For any given positive semidefinite matrix X R n n, it holds that (X) = min tr(i V), s.t. V, X =, V I (3) V and the unique optimal solution of the minimization problem in Eq (3) is given by V = Udiag(sign(σ))U T, where X = Udiag(σ)U T is the eigenvalue decomposition of X, sign is the standard signum function 2. Proof. First of all, we let X and V be arbitrary feasible solutions with their eigenvalue decomposition given by X = Udiag(σ)U T and V = Sdiag(δ)S T, respectively. It always holds that: X, V = Udiag( σ)diag( δ)s T 2 F The second equality is achieved only when V and X are simultaneously unitarily diagonalizable, i.e. both V and X share the same spectral decomposition. Therefore, the feasible set for V in Eq (3) must be contained in the set {V V = Udiag(v)U T, v 1, v R r }. Using the fact that tr(v) = v, 1, tr(x) = σ, 1 and V, X = Udiag(v)U T, Udiag(σ)U T = v, σ, Eq (3) reduces to the vector case: (X) = min v 1, 1 v, s.t. v, σ =, v 1 () Since v and σ, Eq () is equivalent to Eq (): (X) = min v 1, 1 v, s.t. v σ =, v 1 () where denotes the Hadamard product (also known as the entrywise product). Note that for all i [r] when σ i =, v i = 1 will be achieved by minimization, when σ i, v i = will be enforced by the constraint. Since the objective function in Eq () is linear, minimization is always achieved at the boundaries of the feasible solution space. Thus, v = 1 sign(σ). Finally, we have V = Udiag(1 sign(σ))u T. We thus complete the proof of this lemma. 1, x < ; 2 sign(x) =, x = ; 1, x >.

2 The following lemma shows how to compute the generalized Singular Value Thresholding (SVT) operator which is involved in V-subproblem in our PADM Algorithm. Lemma 3. Assume that W has the SVD decomposition that W = Udiag(σ)U T. The optimal solution of the following optimization problem arg min V 1 2 V W 2 F + I (V) () can be computed as Udiag(min(1, max(, σ)))t. Here I is an indicator function{ of the convex set {V, V V I} with I (V)., otherwise. Proof. The proof of this lemma is very natural. For completeness, we present our proof here. For notation convenience, we use σ and z to denote the singular values of W and V, respectively. We naturally derive the following inequalities: 1 2 V W 2 F + I (V) = 1 2 ( z 2 + σ 2 2 V, W ) + I Θ (z) 1 2 ( z 2 + σ 2 2 z, σ ) + I Θ (z) = 1 2 z σ 2 + I Θ (z) From the von Veumann s trace inequality, the solution set of () must be contained in the set {V V = Udiag(v )U T }, where v is given by v 1 = arg min z 2 z σ 2 + I Θ (z) (7) Since the optimization problem in Eq(7) is decomposable, a simple computation yields that the solution can be computed in closed form as: v = min(1, max(, σ)). Therefore, V = Udiag(min(1, max(, σ)))u T. We thus complete the proof of this lemma. 2 Convergence Analysis The global convergence of ADM for convex problems was given by He and Yuan in (He and Yuan 2) under an elegant variational inequality framework. However, since our MPEC optimization problem is non-convex, the convergence analysis for ADM needs additional conditions. In non-convex optimization, convergence to a stationary point (local minimum) is the best convergence property that we can hope for. Under boundedness condition, we show that the sequence generated by the proximal ADM converges to a KKT point. For the ease of discussions, we define: and u {X, V}, s {X, V, π} () Ω {X X κi}, {V V I} (9) First of all, we present the first-order KKT conditions of the MPEC reformulation optimization problem. Based on the augmented Lagrangian function of the MPEC reformulation, we naturally derive the following KKT conditions of the optimization problem for {X, V, π }: I Ω (X ) + A T g(a(x ) b) + πv I (V ) λi + πx = V, X whose existence can be guaranteed by Robinson s constraint qualification (Rockafellar, Wets, and Wets 199). First of all, we prove the subgradient lower bound for the iterates gap by the following lemma. Lemma. Assume that π k is bounded for all k, then there exists a constant ϖ > such that the following inequality holds: L(s k+1 ) ϖ s k+1 s k (1) Proof. By the optimal condition of the X-subproblem and V-subproblem, we have: D(X k+1 X k ) + A T g(a(x k+1 ) b) + π k V k + α V k, X k+1 V k + I Ω (X k+1 ) (11) E(V k+1 V k ) λi + π k X k+1 +α V k+1, X k+1 X k+1 + I (V k+1 ) () By the definition of L we have that L X (s k+1 ) = A T g(ax k+1 b) + π k+1 V k+1 + α V k+1, X k+1 V k+1 + I Ω (X k+1 ) = (π k + α V k, X k+1 )V k + (π k+1 + α V k+1, X k+1 )V k+1 D(X k+1 X k ) = (π k + α V k+1, X k+1 + α V k V k+1, X k+1 )V k +(π k+1 + α V k+1, X k+1 )V k+1 D(X k+1 X k ) = (π k+1 + α V k V k+1, X k+1 )V k + (2π k+1 π k )V k+1 D(X k+1 X k ) = (α V k V k+1, X k+1 )V k + (π k+1 π k )V k+1 +π k+1 (V k+1 V k ) D(X k+1 X k ) The first step uses the definition of L X (s k+1 ), the second step uses Eq (11), the third step uses V k + V k+1 V k+1 = V k, the fourth step uses the multiplier update rule for π. Assume that π k+1 is bounded by a constant ρ that π k+1 ρ. We have: L X (s k+1 ) F α V k V k+1, X k+1 V k + (π k+1 π k )V k+1 + π k+1 (V k+1 V k ) + D(X k+1 X k ) 2nκα V k V k+1 F + 2 π k+1 π k ρ V k V k+1 F + D X k+1 X k F (13)

3 Similarly, we have L V (s k+1 ) = I (V k+1 ) λi + π k+1 X k+1 +α V k+1, X k+1 X k+1 = (π k+1 π k )X k+1 E(V k+1 V k ) L π (s k+1 ) = I V k+1, X k+1 = 1 α (πk+1 π k ) Then we derive the following inequalities: L V (s k+1 ) F κ π k π k+1 + E V k+1 V k F () L π (s k+1 ) 1 α πk+1 π k () Combining Eqs (13-), we conclude that there exists ϖ > such that the following inequality holds L(s) F ϖ s s k. Thus, we complete the proof of this lemma. The following lemma is useful in our convergence analysis. Lemma. Assume that π k is bounded for all k, then we have the following inequality: s k s k+1 2 < + () k= In particular the sequence s k s k+1 is asymptotic regular, namely s k s k+1 as k. Moreover any cluster point of s k is a stationary point of L. Proof. Due to the initialization and the update rule of π, we conclude that π k is nonnegative and monotone non-decreasing. Moreover, as k, we have: X k+1, V k+1 =. This can be proved by contradiction. Suppose that X k+1, V k+1, then π k = + as k. This contradicts our assumption that π k is bounded. Therefore, we conclude that as k + it holds that X k+1, V k+1 =, π i+1 π i < + (17) π i+1 π i 2 < + On the other hand, we naturally derive the following inequalities: L(X k+1, V k+1 ; π k+1 ) = L(X k+1, V k+1 ; π k ) + π k+1 π k, V k+1, X k+1 = L(X k+1, V k+1 ; π k ) + 1 α πk+1 π k 2 L(X k, V k+1 ; π k ) 2 Xk+1 X k α πk+1 π k 2 L(X k, V k ; π k ) 2 Xk+1 X k 2 2 Vk+1 V k α πk+1 π k 2 (1) The first step uses the definition of L; the second step uses update rule of the Lagrangian multiplier π; the third and fourth step use the -strongly convexity of L with respect to X and v, respectively. We define C = L(X, V ; π ) L(X k+1, V k+1 ; π k+1 )+ 1 α k πi+1 π i 2. Clearly, by the boundedness of X k, V k and π k, both L(X k+1, V k+1 ; π k+1 ) and C are bounded. Summing Eq (1) over i = 1, 2..., k, we have: 2 u i+1 u i 2 C (19) Therefore, combining Eq(17) and Eq (19), we have + k=1 sk+1 s k 2 < + ; in particular s k+1 s k. By Eq (1), we have that: L(s k+1 ) ϖ s k+1 s k (2) which implies that any cluster point of s k is a stationary point of L. We complete the proof of this lemma. Remarks: Lemma states that any cluster point is the KKT point. Strictly speaking, this result does not imply the convergence of the algorithm. This is because the boundedness of k= sk s k+1 2 does not imply that the sequence s k is convergent 3. In what follows, we aim to prove stronger result in Theorem 1. Our analysis is mainly based on a recent non-convex analysis tool called Kurdyka-Łojasiewicz inequality (Attouch et al. 21; Bolte, Sabach, and Teboulle 2). One key condition of our proof requires that the Lagrangian function L(s) satisfies the so-call (KL) property in its effective domain. It is so-called the semi-algebraic function satisfy the Kurdyka- Łojasiewicz property. It is not hard to validate that the Lagrangian function L(s) is a semi-algebraic function. This is not surprising since semi-algebraic function is ubiquitous in applications. Interested readers can refer to (Xu and Yin 213) for more details. We now present the following proposition established in (Attouch et al. 21). Proposition 1. For a given semi-algebraic function L(s), for all s doml, there exists θ [, 1), η (, + ] a neighborhood S of s and a concave and continuous function ϕ(t) = c t 1 θ, t [, η) such that for all s S and satisfies L( s) (L(s), L(s) + η), the following inequality holds: (, L( s))ϕ k (L(s) L( s)) 1, s where (, L( s)) = min{ u : u L( s)}. 3 One typical counter-example is s k = k 1. Clearly, i k= sk s k+1 2 = k=1 ( 1 k )2 is bounded by π2 ; however, s k is divergent since s k = ln(k) + C e, where C e is the well-known Euler s constant. Note that semi-algebraic functions include (i) real polynomial functions, (ii) finite sums and products of semi-algebraic functions, and (iii) indicator functions of semi-algebraic sets. Using these definitions repeatedly, the graph of L(s) : {(s, t) t = L(s)} can be proved to be a semi-algebraic set. Therefore, L(s) is a semialgebraic function.

4 The following theorem establishes the convergence properties of the proposed algorithm under a boundedness condition. Theorem 1. Assume that π k is bounded for all k. Then we have the following inequality: + k= s k s k+1 < (21) Moreover, as k, Algorithm 1 converges to the first order KKT point of the MPEC reformulation optimization problem. Proof. For simplicity, we define R k = ϕ(l(s k ) L(s )) ϕ(l(s k+1 ) L(s )). We naturally derive the following inequalities: 2 uk+1 u k 2 1 α πk+1 π k 2 L(s k ) L(s k+1 ) = (L(s k ) L(s )) (L(s k+1 ) L(s )) R k ϕ k (L(s k ) L(s )) R k (, L(s k )) R k ϖ( X k X k 1 + V k V k 1 + π k π k 1 ) R k ϖ( 2 u k u k 1 + π k π k 1 ) = R k ϖ ( uk u k 1 + πk π k 1 ) (Rk ϖ) 2 + ( uk u k 1 + πk π k 1 ) 2 The first step uses Eq (1); the third step uses the concavity of ϕ such that ϕ(a) ϕ(b) ϕ k (a)(a b) for all a, b R; the fourth step uses the KL property such that (, L(s k ))ϕ k (L(s k ) L(s )) 1 as in Proposition 1; the fifth step uses Eq (1); the sixth step uses the inequality that x; y x + y 2 x; y, where ; in [ ] denotes the row-wise partitioning indicator as in Matlab; the last step uses that fact that 2ab a 2 + b 2 for all a, b R. As k, we have: 2 uk+1 u k 2 (Rk κ) 2 + ( ) 2 uk u k 1 Take the squared root of both side and use the inequality that a + b a + b for all a, b R+, we have uk+1 u k Rk κ + 32 uk u k 1 Rk κ + ( uk u k 1 u k+1 u k ) Summing the inequality above over i = 1, 2..., k, we have: u i+1 u i Z + ( u 1 u + u k+1 u k ) where Z = κ k (ϕ(l(s ) L(s )) ϕ(l(s k+1 ) L(s ))) is bounded real number. Therefore, we conclude that as k, we obtain: u i+1 u i < + () Moreover, by Eq (1) in Lemma we have L(s k+1 ) =. In other words, we have the following results: I (X k+1 ) + A T g(a(x k+1 ) b) + π k+1 V k+1 I (V k+1 ) + λi + π k+1 X k+1 = V k+1, X k+1 which imply that {X k+1, V k+1, π k+1 } is a first-order KKT point. Algorithm 2 Classical Alternating Direction Method for Solving the Convex X-Subproblem in Eq (). (S.) Initialize X, Y = R n n, y = R m. z = R m, Z = R n n. Set t =. (S.1) Solve the following (y, X)-subproblem: X t+1 = arg min J (X, y t, Y t ; Z t, z t ) (23) X (S.2) Solve the following Y-subproblem: (y t+1, Y t+1 ) = arg min J (X t+1, y, Y; Z t, z t ) (2) y, Y κi (S.3) Update the Lagrange multiplier via the following formula: Z t+1 = Z t + β(x t+1 Y t+1 ) z t+1 = z t + γ(a(x k+1 ) b y k+1 ) (S.) Set t := t + 1 and then go to Step (S.1). 3 Solving the Convex Subproblem The efficiency of Algorithm proximal ADM in Algorithm 1 relies whether the convex subproblem can be efficiently solved. In this section, we aim to solve the following semidefinite optimization subproblem involved in the proposed PADM algorithm: min g (A(X) b) + α X κi 2 B(X) 2 F + 2 X 2 F + X, C, () Our solution is naturally based on the classical ADM (He and Yuan 2; Lin, Liu, and Su 211). For completeness,

5 we include our algorithm details here. First, we introduce t- wo auxiliary vectors y R m and Y R n n to reformulate Eq () as: min g (y) + α y,x,y 2 B(X) 2 F + 2 X 2 F + X, C, s.t. A(X) b = y, X = Y, Y κi. (2) Let J β,γ : R m R n n R n n R m R n n R be the augmented Lagrangian function in Eq(2) J (y, X, Y; z, Z) = g (y) + α 2 B(X) 2 F + 2 X 2 F + X, C + z, A(X) b y + γ 2 A(X) b y 2 F + Z, X Y + β 2 X Y 2 F, s.t. Y κi z and Z are the Lagrange multipliers associated to the constraints A(Y) b y = and X Y =, respectively, and γ, β > are the penalty parameters. The detailed iteration steps of the classical ADM for Eq (2) are described in Algorithm 2. Next, we focus our attention on the solutions of subproblems (23) and (2) arising in Algorithm 2. (i) X-subproblem. The first-order optimality condition for the variable X is: αb B(X t+1 ) + γa A(X t+1 ) + ( + β)x t+1 = E where E = βy + γa (b + y) C A z Z. Solving this linear system gives: X t+1 = (αb B + γa A + ( + β)i) 1 E (27) When the dimension of the solution is high, solving this linear system may dominant the computation time. However, one can use iterative conjugate gradient to alleviate this computational burden. We remark that it is also possible to utilize linearized ADM to address this issue (He and Yuan 2; Lin, Liu, and Su 211). (ii) (y, Y)-subproblem. Variable y in Eq (2) is updated by solving the following problem: y t+1 = arg min g (y) + γ y R m 2 q y 2, with q = A(X) b+z/γ. It reduces to the Moreau proximal operator prox g ( ) that can be evaluated efficiently by our assumption in the introduction section. Variable Y in Eq (2) is updated by solving the following problem: Y t+1 β = arg min Y κi 2 Y S 2 F with S = X t+1 + Z t /β. Assume that S has the spectral decomposition that S = Vdiag(s)V T. A simple computation yields that the solution Y t+1 can be computed in closed form as: Y t+1 = Vdiag(max(, min(κ1, s)))v T. The exposition above shows that the computation required in each iteration of Algorithm 2 is insignificant. Classical ADM has excellent convergence both in theory and in practice for convex problems. The convergence of Algorithm 2 can be obtained since the Féjer monotonicity of iterative sequences {y t, X t, Y t, z t, Z t } holds due to convexity. For the proof of convergence of Algorithm 2, interested readers can refer to (He and Yuan 2) for more details. Additional Experimental Results In this section, we present some additional experimental results to demonstrate the superiority of our proposed proximal ADM algorithm. Due to page limitations, we were not able to add these results in the submission. We extend our method for sensor network localization problem in the presence of laplace noise and uniform noise. We show our results of laplace noise in Figure 1- and uniform noise in Figure 7-, these experimental results strengthen our conclusions drawn in our submission. References [Attouch et al. 21] Attouch, H.; Bolte, J.; Redont, P.; and Soubeyran, A. 21. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the kurdyka-lojasiewicz inequality. Mathematics of Operations Research (MOR) 3(2): [Bolte, Sabach, and Teboulle 2] Bolte, J.; Sabach, S.; and Teboulle, M. 2. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Programming (1-2): [Fazel, Hindi, and Boyd 23] Fazel, M.; Hindi, H.; and Boyd, S. P. 23. Log-det heuristic for matrix minimization with applications to hankel and euclidean ance matrices. In Proceedings of the American Control Conference, volume 3, 2. 1 [He and Yuan 2] He, B., and Yuan, X. 2. On the O(1/n) convergence rate of the douglas-rachford alternating direction method. SIAM Journal on Numerical Analysis (SINUM) (2): ,, [Lin, Liu, and Su 211] Lin, Z.; Liu, R.; and Su, Z Linearized alternating direction method with adaptive penalty for low- representation. In NIPS, 2., [Rockafellar, Wets, and Wets 199] Rockafellar, R. T.; Wets, R. J.-B.; and Wets, M Variational analysis. springe, berlin, heidelberg, new york, [Xu and Yin 213] Xu, Y., and Yin, W A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on Imaging Sciences (SIIMS) (3):

7.e1.3e1 3.7e1 2.e1 3.9e 3.e 2.1e 1.1e 1 2 3 Iteration Figure 1: Asymptotic behavior on minimum- sensor network localization problem in the presence of laplace noise.

6 7.e1.3e1 3.7e1 2.e1 3.9e 3.e 2.1e 1.1e Iteration Figure 1: Asymptotic behavior on minimum- sensor network localization problem in the presence of laplace noise. We plot the values of (blue) and (red) against the number of iterations, as well as how the sensors have been located at different stages of the process (1, 2, 3,, ) (a) FM, =.2, = (b) TAM, =.21, = (c) LPAM, =.19, = (d) LDHM, =.1, = (e) PDA, =.27, = 11. Figure 2: Performance comparison on 2d data in the presence of laplace noise (f) PADM, =., = (a) FM, =.32, = (b) TAM, =.29, = (c) LPAM, =.21, = (d) LDHM, =.32, = (e) PDA, =., = 9. Figure 3: Performance comparison on 3d data in the presence of laplace noise (f) PADM, =.21, = (b) and comparisons on 2d data (c) and comparisons on 3d data (d) and comparisons on 7d data Figure : Performance comparison with varying the number of sensor u in the presence of laplace noise (a) and comparisons on 2d data (b) and comparisons on 3d data (c) and comparisons on 7d data Figure : Performance comparison with varying the s in the presence of laplace noise (a) and comparisons on 2d data (b) and comparisons on 3d data (c) and comparisons on 7d data Figure : Performance comparison with varying the r in the presence of laplace noise.

7.1e1.e1 3.7e1 2.e1 3.e 3.e 2.1e 1.3e 1 2 3 Iteration Figure 7: Asymptotic behavior on minimum- sensor network localization problem in the presence of uniform noise.

7 7.1e1.e1 3.7e1 2.e1 3.e 3.e 2.1e 1.3e Iteration Figure 7: Asymptotic behavior on minimum- sensor network localization problem in the presence of uniform noise. We plot the values of (blue) and (red) against the number of iterations, as well as how the sensors have been located at different stages of the process (1, 2, 3,, ) (a) FM, =.39, = (b) TAM, =.3, = (c) LPAM, =.3, = 2... (d) LDHM, =.32, = (e) PDA, =., =. Figure : Performance comparison on 2d data in the presence of uniform noise... (f) PADM, =.31, = (a) FM, =., = (b) TAM, =.7, = (c) LPAM, =.9, = (d) LDHM, =., = (e) PDA, =., = 31. Figure 9: Performance comparison on 3d data in the presence of uniform noise (f) PADM, =.3, = (b) and comparisons on 2d data (c) and comparisons on 3d data (d) and comparisons on 7d data Figure 1: Performance comparison with varying the number of sensor u in the presence of uniform noise (a) and comparisons on 2d data (b) and comparisons on 3d data (c) and comparisons on 7d data Figure 11: Performance comparison with varying the s in the presence of uniform noise (a) and comparisons on 2d data (b) and comparisons on 3d data (c) and comparisons on 7d data Figure : Performance comparison with varying the r in the presence of uniform noise.

Block Coordinate Descent for Regularized Multi-convex Optimization

Block Coordinate Descent for Regularized Multi-convex Optimization Yangyang Xu and Wotao Yin CAAM Department, Rice University February 15, 2013 Multi-convex optimization Model definition Applications Outline