Convex and Nonsmooth Optimization: Assignment Set # 5 Spring 2009 Professor: Michael Overton April 23, PDF Free Download

Eduardo Corona Convex and Nonsmooth Optimization: Assignment Set # 5 Spring 29 Professor: Michael Overton April 23, 29. So, let p q, and A be a full rank p q matrix. Also, let A UV > be the SVD factorization of A. Finally, we de ne a p + q p + q symmetric matrix by blocks: A > V B > U > A UV > V > V > U U > Now, we know that the matrix is essentially a q q diagonal matrix b with p q q zero rows attached to it on the bottom, and > is this same matrix with p q zero columns attached to its right. Hence, we can rewrite this as a three by three block matrix, and diagonalize this matrix by "rotating" a part of the space by 9 degrees: b V @ U b A V > U > V p2 @ I q I q b I U q I q A @ b A p @ I q I q I q I q A V > I 2 U p q I > p q > It is then clear that B has 2q positive eigenvalues, and these correspond to i ; with i the singular values of A. In the case where the range of A is not q; we can still do this, although now B would have 2r positive eigenvalues (since some of the s would be ). Finally, we observe that in the case p q, B has no zero eigenvalues (it is invertible). 2. In the Candes and Recht paper, it is shown that the low rank matrix completion problem can be cast as a nuclear norm imization problem, which in turn is a SDP. Here X; M 2 R qq : trace(x) subject to X ij M ij (i; j) 2 X For a general matrix M 2 R pq ; we introduce the auxiliary matrices W 2 R pp ; W 2 2 R qq to obtain an alternative formulation for this as a SDP: 2 (trace(w )+trace(w 2 )) subject to X ij M ij (i; j) 2 W X X W 2 Now, given any X satisfying X ij M ij; we set W I p,w 2 I. Then, the semide nite contraint looks like: Ip X X X I + I q X

And from problem ; making X A >, this constraint is equivalent to I + B. However, we know that the nonzero eigenvalues of B are i ; where i are the singular values of X. Thus, this constraint is equivalent to asking i 8i () maxf i g. Hence, the best upper bound we can come up with for 2 (trace(w )+trace(w 2 )) is p+q 2. If X ; then clearly this pair, W ; W 2 I is the optimal solution. 3. Let y (vec(w ); vec(w 2 ); vec(x)); b 2 if y i is a diagonal entry of W or W 2, zero otherwise. Also, let E ij denote a matrix with in the (i; j)th entry, and zeroes everywhere else. This SDP in standard dual form then looks like: max b > y px qx subject to i;j w ij Eij + i;j w 2 ij + X E ij Z (i;j) 2 X ij Eij E ji! M f fm Z Where f M is a matrix with the entries of M in ( otherwise). And of course, we can rewrite this completely in terms of the y s; renag A i the corresponding matrix in front of our variables. Hence, the number of dual variables is m p 2 + q 2 + card( c ): Alternatively, we can also write this problem in the primal standard form: Eji 2 E ij 2 trace(z) ; Z M ij (i; j) 2 Z The Schur complement matrix is an m m matrix, of the form: S A(Z P )A T S ij tr(p A k Z A l ) Where A k is one of the block matrices used on the dual problem formulation. Hence, to construct it we have to nd m 2 of these entries, which in turn involves computing the products P A k Z A l. The A matrices are often sparse (in fact, in our example they only have 2 or nonzero entries), however V and Z are probably full matrices in R p+qp+q. In the SDP paper provided (Alizadeh-Haeberly- Overton), the authors argue that the most expensive operation in the XZ method is the construction of this matrix, and provide complexity bounds of O(mn 3 + m 2 n 2 ) work. For our problem, n p + q and written in the dual standard form, m p 2 + q 2 + card( c ); whereas in the standard primal form, m card( c ). Hence, it is computationally cheaper to use the primal form. How do we get this bound? A way to go about inverting Z is to compute its Cholesky factors, which takes O(n 3 ) work, and inverting L takes n 2 work (or n 3 work if done naively). In any case, this is done once, so it doesn t enter our calculations. Now, we have m products of the form P A k and Z A l ; and the computational cost of these depends on the sparsity structure of the A k s. In general, one doesn t know the sparsity structure of these matrices, and so if the number of non-zero entries is roughly comparable with n; one gets a bound of 2mn 3 work. However, in our case, the matrices A have 2 or nonzero entries regardless of p or q; and we only need to multiply two or one row. Hence, if done smartly, one can reduce this to 4m(p + q) work. Finally, for each of the m 2 entries, we need to compute the trace of the matrix (P A k )(Z A l ); which takes m 2 n 2 work (since we only need to obtain the diagonal entries of this product). However, in our case, these matrices have only two nonzero rows, so this is reduced to 4m 2 (p + q) work. Overall, constructing the Schur complement matrix takes O((p+q) 3 +m 2 (p+q)) work, and computing the cholesky factor of the Schur complement matrix takes O(m 3 ) work. For the dual problem, m O((p + q) 2 ) (it has at most p 2 + q 2 + pq variables), and so constructing the Schur complement matrix is more expensive. However, here we are using too many variables. Using the primal form of this problem, m O(pq); and so both these operations take comparably the same amount of work. 2

Proportion of Matrices Completed 4. If this is a dual problem in its standard form, the corresponding primal problem would be: *! + M f fm ; P subject to ha i ; P i b i P The constraints corresponding to the entries of W and W 2 tell us that: p ii 2 p ij 8i 6 j; i; j 2 f; ::; pg or i; j 2 fp + ; ::; p + qg That is, the variable in our primal problem looks like: P 2 I p Q Q 2 I q Finally, taking into account the constraints corresponding to X; we conclude that q ij (i; j) 2. for all Matlab Programg Assignment: CVX and the Matrix Completion Problem 5 I wrote a matlab function [X,error]rankSDP(M,num) to solve the corresponding SDP using CVX. Using this package, I wrote the optimization problem, declaring the variables W ; W 2 to be symmetric, and using the SDP notation. The program grabs num entries from the matrix M (which are drawn randomly from a complete list), and returns the completed matrix, and the error (which is taken to be the norm of the di erence between X and M). For each of the three matrices in Xdata:mat, I ran 5 tests, in which I randomly added one or two entries at a time to the constraints, and ran the optimization program. For each case, I observed both the proportion of matrices completed vs number of entries provided (where a matrix is considered to be completed if the norm of the error falls under a certain tolerance), and I also plotted a histogram of the imum of entries needed to complete the matrix: comp mat X Matrices Completed for Experiments with X.9.8.7.6.5.4.3.2. :pdf 5 5 2 25 Number of entries provided 3

Frequency Proportion of Matrices Completed Frequency comp mat 25 Histogram: Entries Needed to Complete X (Out of 25) 2 5 5 2:pdf 2 4 6 8 2 22 Entries Needed to Complete comp mat X2.9 Matrices Completed for Experiments with X2.8.7.6.5.4.3.2. 3:pdf 5 5 2 25 3 35 4 45 5 Number of Entries Provided comp mat 2 5 Histogram: Number of Entries Needed to Complete X2 (Out of 5) 5 4:pdf 34 36 38 4 42 44 46 48 Number of Entries Needed 4

F r equency Proportion of Matrices Completed comp mat X3 Matrices Completed for Experiments with X3.9.8.7.6.5.4.3.2. 5:pdf 4 45 5 55 6 65 7 75 8 Number of Entries Provided comp mat 3 4 Histogram: Number of Entries Needed to Complete X3 (Out of 8) 2 8 6 4 2 56 58 6 62 64 66 68 7 72 74 76 Number of Entries Needed 6:pdf We observe that, as we provide more and more entries, there is a imum number of entries after which the algorithm starts completing the matrix in some of the experiments. Eventually, all matrices are completed.the mean number of entries needed in each case is 6; 4 and 68; which corresponds to a proportion of 64%; 82% and 85%. This seems contradictory: however, these matrices are distinct, and have a very small number of entries. To really observe the behaviour of this algorithm, we need to run a more systematic set of experiments like those proposed on problem 6. 6 Now, I wrote a matlab function to run this experiment, [num,e,t]rankexp(q,r,m,tol), which generates m random experiments of order q and rank r (In this case, I only use it for rank 3). These matrices are randomly generated by obtaining a random matrix V 2 R q3 ; and computing V V >. With overwhelg probability, the result is a matrix of rank r (and in any case, it is lower or equal than r). Experiment : I ran experiments on the CIMS number crunching servers (Solaris, 6GB Ram) for q 2; 3; 4; 5 increasing the number of entries provided by 2% of the total number of entries each time. For these experiments, the mean imum number of entries required to complete the matrix was: q % #total 2 72% 2884 3 62% 5589 4 5% 86 5 44% 25 28% 28 5

Average Time (Minutes) Number of Entries Needed 3 Number of Entries Needed vs Matrix Size 25 2 5 5 2 3 4 5 6 7 8 9 q (Order of X) The mean running times were: q Total (s) 2 :375 3 :628 4 :93 5 :258 5:64.3 Average Time vs Size of Completed Matrix.2..9.8.7.6.5.4 2 25 3 35 4 45 5 q (order of X) I tried to run more experiments (for example, for q > ; or trying to increase the number of entries by a smaller percentage of q 2 ), but the runs took too long. However, based on the observations I have, I think a good upper bound for a q such that the mean completion time is bigger than 5 is q. In computing running time, we observe the trade-o between the increased e ciency of the matrix completion algorithm (which has a theoretical bound of n 6 5 log(n); which means the theoretical bound on the proportion of entries needed should go to like n 4 5 log(n)), and the increase in size and complexity of the problem as q grows. Also, we note that although the overall running time increases, the average time (per SDP problem solved) decreases with matrix size. It is most likely that for larger matrix sizes this behaviour is reversed. 7 The ratio test algorithm to nd s is very simple. We have: x + sx + s x x 6

Now, as was stated in the problem, for a particular component x i ; if (x) i ; then this entry will be positive 8s. Otherwise, s ensures positivity. Hence, we can take t to be: x i (x) i t f (x) i< x i (x) i g If there are any negative entries, and otherwise. 8 For SDPs, we can do a similar test. First, we obtain the Cholesky factorization of X; X LL >. Inverting these triangular factor, we obtain the following: X + sx I + sl X(L ) > Now, we can just perform the exact same test on the eigenvalues of L X(L ) >. That is, we can take t to be: t f 2 (L X(L ) > ) : < g (L X(L ) > ); L X(L ) > ; L X(L ) > 7

Convex and Nonsmooth Optimization: Assignment Set # 5 Spring 2009 Professor: Michael Overton April 23, 2009