A Feasible Method for Optimization with Orthogonality Constraints

Size: px

Start display at page:

Download "A Feasible Method for Optimization with Orthogonality Constraints"

Lilian Jacobs
5 years ago
Views:

1 A Feasible Method for Optimization with Orthogonality Constraints Zaiwen Wen 1 Wotao Yin 2 1 Shanghai Jiaotong University, Math Dept 2 Rice University, CAAM Dept May 2011

2 Orthogonality Constraints Unknowns: q orthogonal matrices X (1),..., X (q) R n p Optimization problem: min F (X), subject to X (i) X (i) = I, i = 1,..., q. X (i) is constrained to the Stiefel manifold.

3 The Vector Case If p = 1 X (i) reduces to x (i) R n X(i) X (i) = I becomes x (i) 2 = 1 or x (i) S n 1. Example: n = 2, q = 2, the feasible set is homeomorphic to a torus By Dave Burke

4 Why are orthogonality/normalization constraints interesting? wide applications; non-convex in general, causing possibly local minima; expensive to stay feasible in the matrix case. Applications arise with 1. Direction fields: x ij 2 = 1 2. Homogeneous objectives and normalization: x 2 = 1, X F = 1 3. Eigenvalue problems: X X = I 4. Certain SDPs (Burer et al): 0 Z = X X, Z ii = x i 2 = 1 5. Certain combinatorial problems: Square X is a permutation matrix X X = I and X

5 The Method

6 Stay Feasible vs Infeasible Infeasible algorithms Penalty, augmented Lagrangian methods SDP relaxations / lifting Feasible algorithms iterative projection descent along geodesics (or retractions) Advantages: Work when cost functions are meaningless outside of feas. set In case of early termination, a feasible sol is returned Fewer parameters (e.g., for penalty) Disadvantages: maintaining feasibility can be expensive

7 Basic Idea Consider At iteration k min F(X), subject to X X = I. X k+1 Orthogonalize X k + τproj( F k ) X k+1 Orthogonalize X k + τh k X k X k+1 solutiony (τ) of Y = X k + τ 2 H k(x k + Y ) Need H k such that 1. Y (τ) Y (τ) = Xk X k; 2. Y (τ) is a descent curve; 3. solving for Y (τ) is fast.

8 Example: S 2 = {x R 3 : x 2 = 1} S

9 Example: S 2 = {x R 3 : x 2 = 1} x x S

10 Example: S 2 = {x R 3 : x 2 = 1} x f(x) x S 2 geodesic

11 Example: S 2 = {x R 3 : x 2 = 1} x f(x) S 2 x f(x) x geodesic Hx Proj( f ) = x ( f x)

12 Example: S 2 = {x R 3 : x 2 = 1} x f(x) S 2 x f(x) x geodesic Hx Proj( f ) = x ( f x) = ( x f f x ) x = Hx.

13 Example: S 2 = {x R 3 : x 2 = 1} x f(x) S 2 x f(x) x 1 2 Hx geodesic Hx Proj( f ) = x ( f x) = ( x f f x ) x = Hx.

14 Example: S 2 = {x R 3 : x 2 = 1} x f(x) S 2 x f(x) x 1 2 Hx geodesic 1 2 Hy Hx 0.7 y Proj( f ) = ( f x) x = ( x f f x ) x = Hx. Solution y of y = x + 1 2H(x + y) is feasible. Solution {y(τ)} of y = x + τ 2 H(x + y) is a geodesic.

15 Relation to Crank-Nicolson Discretize 1D PDE u t = F (u, x, t, u x, u x ): (forward Euler) (backward Euler) (Crank-Nicolson) u n+1 j u n+1 j u n+1 j uj n t uj n t uj n t = F n j ( ) = F n+1 j ( ) = 1 2 ( F n+1 j ) ( ) + Fj n ( ) Crank-Nicolson is the average forward Euler and backward Euler:

16 The Matrix Update Given X k and G k = F(X k ), we let H = X k G k G kx k and solve Y = X + τ 2 H(X + Y ) for Y (τ). Update X k+1 Y (τ) = where τ is a step size. Properties of Y (τ): 1. Y (τ) Y (τ) = X k X k; 2. Y (τ) is a descent curve; 3. Y (τ) has fast implementations. (I τ 2 H ) 1 (I + τ 2 H ) X k,

17 Skew-Symmetric H Preserves X X Theorem (Cayley s Transform) If H = H, then 1. (I H) is nonsingular; 2. Y = (I H) 1 (I + H)X satisfies Y Y = X X. Proof. 1. x (I H)x = x x for any vector x. 2. Y Y = X (I + H) (I H) T (I H) 1 (I + H)X = X (I H) (I + H) 1 (I H) 1 (I + H)X = X (I H) (I H) 1 (I + H) 1 (I + H)X = X X.

18 Generalization For M 0 and complex-valued constraints X MX = C. The update Y = (I HM) 1 (I + HM)X preserves Y MY = X MX. provided that H is a skew-hermitian matrix.

19 We now know that X X can be preserved by X k+1 Y (τ) = (I τ 2 H ) 1 (I + τ 2 H ) X k. This is known as Cayley s transform, which has been used in matrix computations such as inverse eigenvalue problems [Friedland-Nocedal- Overton 87]. It is also called Crank-Nicolson schemes for heat PDEs, etc. It appears that it has not been systematically studied for minimizing a general differentiable F(X) subject to X X = I.

20 Properties of Y (τ): 1. Y (τ) Y (τ) = Xk X k; 2. Y (τ) is a descent curve; 3. Y (τ) has fast implementations.

21 Y (τ) is a descent curve since Y (0) = negative projected gradient. Y (0) = HX k = Proj c ( F k ; X k ). Proj c is projection to the tangent of {X : X X = I} under the canonical metric: Y, Z c = tr(y (I 1 2 XX )Z ), Y, Z T X. For projection under the Euclidean metric: Y, Z e = tr(y Z ), use H = XG k (I 1 2 X kx k ) (I 1 2 X kx k )G kx. Generally, given any tangent direction D T X (CG, Newton, etc.), leads to Y (0) = D. H = (I 1 2 XX )DX XD (I 1 2 XX )

22 Properties of Y (τ): 1. Y (τ) Y (τ) = Xk X k; 2. Y (τ) is a descent curve; 3. Y (τ) has fast implementations.

23 Comparing with Descent along Geodesics Moving along geodesics is a sensible choice (c.f. Smith 93, Mahony 94; Edelman-Arias-Smith 98) for optimization on manifold, but geodesics of {X : X X = I} are difficult to compute. Y (τ) = ( I τ 2 H) 1 ( I + τ 2 H) X is not a geodesic. It is a retraction ( approximate geodesic ) of the Stiefel manifold; see book by Absil, Mahony, and Sepulchre.

24 Computing Cost Y (τ) = (I τ 2 H ) 1 (I + τ 2 H ) X. p = 1: Y (τ) is closed-form, a linear combination of X and F, small p: apply the SMW formula and solve a smaller (2p 2p) system, large p: solve an n n system, OR perform low-rank approximates. Cheaper than: geodesic steps: approximating matrix exponentials or solving 2nd-order PDEs; gradient projection: matrix orthogonalization, n p SVD (e.g., Manton 02); gradient + orthogonalization: QR decomposition.

25 In short, we proposed a curvilinear descent path, which maintains feasible and easy to compute. The path Y (τ) is neither straight nor a geodesic, and τ is not linear in the length of Y (τ).

26 Convergence Properties Theorem Consider min F (X), s.t. X X = I. X is a stationary point if and only if X X = I and HX = 0. If HX 0, then Y (τ) is a descent curve. We also proved X k globally converges to a stationary point, near a stationary point, τ is lower bounded, HX = 0 if and only if H = 0. As X k X station, (I τ 2 H) I, so (I τ 2 H) is well-behaved.

27 Full Algorithm and Numerical Results

28 Algorithm: 1. Input: feasible X 0 2. k 0 3. While stopping conditions not met do 4. H X k ( F k ) ( F k )Xk 5. Compute an initial τ by Barzilai-Borwein 6. τ non-monotonic line search 7. X k+1 Y (τ) = ( I τ 2 H) 1 ( I + τ 2 H) X k 8. k k + 1 Code was written in MATLAB.

29 Three types of problems: 1. Guaranteed global minimum; 2. Numerical global minimum, but no proof yet; or just lucky? 3. Local (non-global) minima; we restart algorithm from random points, and pick the best solution.

30 Max-Cut SDP SDP: max X tr(cx), s.t. X ii = 1, i = 1,, n, X 0. NLP: write X = V V where V = [v 1,..., v n ] R p n max V R p n c ij v i v j, s.t. v i = 1, i = 1,..., n. i,j If p rank(x ), NLP is equivalent to SDP. Theorem (a slight extension of Burer-Monteiro 05) There exists p(c) n such that, if p p(c), any local minimizer of NLP is globally optimal. A simple upper bound: p(c) n + 1 inf{rank(c + D) : D is diagonal}.

31 Max-Cut SDP SDPLR for max-cut, C language Our code Name n obj CPU p obj CPU nfe feasi t..g e e e-14 t..pm e e e-14 G e e e-15 G e e e-15 G e e e-14 G e e e-14 G e e e-14 G e e e-14 G e e e-14 G e e e-14 G e e e-14 G e e e-14 G e e e-14 G e e e-14 G e e e-14 G e e e-14 G e e e-14 G e e e-14 G e e e-14 G e e e-14 SDRLP for max-cut by S. Burer and R. Monteiro

32 Low-Rank Nearest Correlation Matrix Estimation Given a weight matrix W and a symmetric matrix C: min X W (X C) 2 F, X ii = 1, i = 1,..., n, rank(x) r. Low-rank approximation: X = V V, V R r n 1 min V R r n 2 W (V V C) 2 F, s.t. V i 2 = 1, i = 1,..., n. Test examples: Ex1: n = 500, G ij = e 0.05 i j, W = I Ex3: n = 500, G ij = e 0.05 i j, random W Ex5: n = 943, G is based on the Movielens data sets. W is given by T. Fushiki

33 Low-Rank nearest correlation problem Major PenCorr Our code r obj CPU feasi obj CPU feasi obj CPU feasi Ex e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e-15 Ex e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e-15 Ex e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e-14 Major: Pietersz-Groenen, PenCorr: Y.Gao-D.Sun

34 Linear eigenvalue problem Let λ 1... λ n be the eigenvalues of A: p i=1 λ i := max X R n p tr(x AX) s.t. X X = I Compared with MATLAB s eigs (which calls Fortran library ARPACK and incurs overhead) on: Dense random matrices UF Sparse Matrix Collection: 39 large sparse matrices with n 4000

35 Dense matrix: sum of 6 largest eigenvalues n eigs obj 1.153e e e e e e+05 feasi 5.972e e e e e e-15 nax cpu Our code obj 1.153e e e e e e+05 feasi 6.468e e e e e e-15 nfe cpu err 1.255e e e e e e-06

36 Dense matrix: fix n = 5000 p eigs obj 1.986e e e e e e e+05 nax feasi 1.776e e e e e e e-14 cpu Our code obj 1.986e e e e e e e+05 feasi 4.441e e e e e e e-15 nfe cpu err 1.325e e e e e e e-07

37 Sparse matrix: sum of 2 largest eigenvalues eigs Our code Name n obj CPU nax obj err CPU nfe crystm e e e ct20stif e e e msc e e e msc e e e pwtk e e e s3rmt3m e e e bcsstk e e e bcsstm e e e Kuu e e e Muu e e e finan e e e nd3k e e e nasa e e e nasasrb e e e fv e fv e fv e t2dal e e aft e e e cfd e e e

38 Quadratic Assignment Problem (QAP) Minimizing a quadratic over permutation matrices: min tr(a XBX ), s.t. X X = I, X 0 X R n n Tested 134 problems in QAPLIB: rewrite the objective as tr(a (X X)B(X X) ) apply augmented Lagrangian to X 0 with parameter µ run our code 40 times from random X 0 with µ = 0.1, 1, 10 each pick the best out of 120 rounded solutions

39 QAPLIB: exact recovery gap = best upper bound v best upper bound 100%, CPU: average cpu Our code Name n obj µ min gap % med gap % max gap % CPU feasi chr12b e e+00 esc e e+00 esc16a e e+00 esc16b e e+00 esc16c e e+00 esc16d e e+00 esc16e e e+00 esc16f e e+00 esc16g e e+00 esc16h e e+00 esc16i e e+00 esc16j e e+00 esc32b e e+00 esc32c e e+00 esc32d e e+00 esc32e e e+00 esc32g e e+00 esc64a e e+00

40 QAPLIB: exact recovery Our code Name n obj µ min gap % med gap % max gap % CPU feasi had e e+00 had e e+00 had e e+00 had e e+00 had e e+00 lipa20a e e+00 lipa20b e e+00 lipa30a e e+00 lipa30b e e+00 lipa40b e e+00 lipa50b e e+00 lipa60b e e+00 lipa70b e e+00 lipa80b e e+00 lipa90b e e+00 nug e e+00 nug e e+00 nug16a e e+00 nug16b e e+00 tai12a e e+00

41 QAPLIB: n 80 Our code Name n obj µ min gap% med gap% max gap% CPU feasi esc e e+00 lipa80a e e+00 lipa80b e e+00 lipa90a e e+00 lipa90b e e+00 sko100a e e+00 sko100b e e+00 sko100c e e+00 sko100d e e+00 sko100e e e+00 sko100f e e+00 sko e e+00 sko e e+00 tai100a e e+00 tai100b e e+00 tai150b e e+00 tai256c e e+00 tai80a e e+00 tai80b e e+00 tho e e+00 wil e e+00 exact recovery

42 Polynomial Optimization Consider minimizing the following polynomials subject to x 2 = 1: 1. ( i j + k + l)x i x j x k x l i<j<k<l 50 1 i<j<k 49 1 i 20 1 i<j<k 20 x i x j x k + x 2 i x j x 2 i x k + x j x 2 k x 6 i + 1 i 19 x 3 i x 3 i+1 x 2 i x 2 j x 2 k + x 3 i x 2 j x k + x 2 i x 3 j x k + x i x 3 j x 2 k SDP (Nie) Our code F(x) f hmg sos cpu rep F(x) (min, mean, max) CPU feasi :30:00 10 ( , , ) e :00:00 10 ( , , ) e e-5 10:00: (6e-5, , ) e :00: ( , , ) e-15 Problems from J.Nie, Regularization Methods for Sum of Squares Relaxations... Nie s code solves an SDP relaxation, which has approximation guarantees.

43 Stability Number of Graph Given a graph G = (V, E), Motzkin and Straus showed α(g) 1 min n xi (i,j) E x i 2 xj 2, s.t. x 2 = 1 i=1 Test random graphs G whose edges e i,j for {i, j} M are generated with probability 1 2, where M is a subset chosen randomly from V with M = n/2. SDP (Nie) Our code n CPU α(g) rep CPU (min, mean, max) feasi 20 0:03: ( 0.004, 0.006, 0.007) 1.7e :39: ( 0.007, 0.009, 0.011) 7.8e :21: ( 0.005, 0.007, 0.010) 1.4e :27: ( 0.008, 0.010, 0.012) 5.5e :46: ( 0.009, 0.012, 0.015) 3.6e ( 0.013, 0.015, 0.020) 2.0e ( 0.018, 0.023, 0.036) 2.4e-15

44 1-Bit CS Recover x from min x 1, s.t. Ax 0, x 2 = 1. Our approach (joint J. Laska, Z. Wen, and R. Baraniuk) 1. Outer: form augmented Lagrangian subproblem and update µ, B and b. min µ x Bx b 2 2, s.t. x 2 = 1, 2. Inner: solve the above subproblem by iterating x k+1 min x 2 =1 µτ x x (x k + τg k ) 2 2

45 Shrinkage on Sphere Shrinkage (soft-thresholding): shrink(y, µ) min x µ x x y 2 2. Shrinkage on sphere: add constraint x 2 = 1 x opt { shrink(y,µ) shrink(y,µ) 2, shrink(y, µ) 0, sign(y i )e i, o.w., where y i is the largest (in magnitude) entry of y. Matrix case: min X µ X X Y 2 F, s.t. X F = 1. reduces to shrinkage on sphere over singular values.

A Feasible Method for Optimization with Orthogonality Constraints

Noname manuscript No. (will be inserted by the editor) A Feasible Method for Optimization with Orthogonality Constraints Zaiwen Wen Wotao Yin Received: date / Accepted: date Abstract Minimization with