Low-rank Solutions of Linear Matrix Equations via Procrustes Flow

Size: px

Start display at page:

Download "Low-rank Solutions of Linear Matrix Equations via Procrustes Flow"

Osborne Harrington
5 years ago
Views:

1 Low-rank Solutions of Linear Matrix Equations via Procrustes low Stephen Tu Mahdi Soltanolkotabi Ross Boczar Benjamin Recht July 11, 2015 Abstract In this paper we study the problem of recovering an low-rank positive semidefinite matrix from linear measurements Our algorithm, which we call Procrustes low, starts from an initial estimate obtained by a thresholding scheme followed by gradient descent on a non-convex objective We show that as long as the measurements obey a standard restricted isometry property, our algorithm converges to the unknown matrix at a geometric rate In the case of Gaussian measurements, such convergence occurs for a n n matrix of rank r when the number of measurements is Ω(nr) 1 Introduction In numerous applications from signal processing, machine learning, control, and wireless communications, it is often of interest to find a positive semidefinite matrix of minimal rank obeying a set of linear equations In particular, one wishes to solve problems of the form min rank(x) st A(X) = b and X 0, (11) X Rn n where A : R n n R m is a known affine transformation that maps matrices to vectors More specifically, the k-th entry of A(X) is A k, X := Tr(A k X), where each A k R n n is symmetric Since the early seventies, a very popular heuristic for solving such problems has been to replace X with a low-rank factorization X = UU T and solve matrix quadratic equations of the form find st A(UU T ) = b, (12) U R n r via a local search heuristic [Ruh74] Many researchers have demonstrated that such heuristics work well in practice for a variety of problems [RS05, un06, LRS +, RR13] However, these procedures lack strong guarantees associated with convex programming heuristics for solving (11) In this paper we show that a local search heuristic provably solves (12) under standard restricted isometry assumptions on the linear map A or standard ensembles of equality constraints, we demonstrate that X can be estimated to precision ɛ by such local search as long as we have Ω(nr) Department of Electrical Engineering and Computer Science, UC Berkeley, Berkeley CA Ming Hsieh Department of Electrical Engineering, University of Southern California, Los Angeles, CA Department of Statistics, UC Berkeley, Berkeley CA

2 equations 1 This is merely a constant factor more than the number of parameters needed to specify a n n rank r matrix Specialized to a random Gaussian model, our work improves upon recent and independent work by Zheng and Lafferty [ZL15] 2 Algorithm: Procrustes low In this paper we study a local search heuristic for solving matrix quadratic equations of the form (12) which consists of two components: (1) a careful initialization obtained by a projected gradient scheme on n n matrices, and (2) a series of successive refinements of this initial solution via a gradient descent scheme This algorithm is a natural extension of the Wirtinger low algorithm developed in [CLS14b] for solving vector quadratic equations ollowing [CLS14b], we shall refer to the combination of these two steps as the Procrustes low (P) algorithm detailed in Algorithm 1 21 Initialization via low-rank projected gradients In the initial phase of our algorithm we start from Z 0 = 0 n n and apply successive updates of the form on rank r matrices of size n n ( m ) Z τ+1 = P r Z τ α τ+1 ( A k, Z τ b k ) A k, (21) k=1 for T init iterations Here, P r denotes projection onto rank-r positive semidefinite (PSD) matrices which can be computed efficiently via Lanczos methods We set our initialization to an n r matrix U 0 obeying Z Tinit = U 0 U T 0 Updates of the form (21) have a long history in compressed sensing/matrix sensing literature (see eg [TG07, GK09, NT09, NV09, BD09, MJD09, CCS]) urthermore, using the first step of the update (21) for the purposes of initialization has also been proposed in previous work (see eg [AM07, KMO, JNS12]) 22 Successive refinement via gradient descent As mentioned earlier, we are interested in finding a matrix U R n r obeying matrix quadratic equations of the form b = A(UU T ) We wish to refine our initial estimate by solving the following non-convex optimization problem min f(u) := 1 A(UU T ) b 2 = 1 m ( A k, UU T b k ) 2, (22) U R n r 4 l 2 4 which minimizes the misfit in our quadratic equations via the square loss To solve (22), starting from our initial estimate U 0 R n r we apply the successive updates ( σ 2 U τ+1 := U τ µ r(u 0 ) τ+1 σ1 4(U 0) f(u σ 2 m ) τ ) = U τ µ r(u 0 ) τ+1 σ1 4(U ( A k, U τ Uτ T b k )A k U τ (23) 0) Here and throughout, for a matrix X, σ l (X) denotes the l-th largest singular value of X We note that the update (23) is essentially gradient descent with a carefully chosen step size 1 Here and throughout we use f(x) = Ω(g(x)) if there is a positive constant C such that f(x) Cg(x) for all x sufficiently large k=1 k=1 2

3 Algorithm 1 Procrustes low (P) Require: {A k } m k=1, {b k} m k=1, {α τ } τ=1, {µ τ } τ=1, T init N // Initialization phase Z 0 := 0 n n repeat Z τ+1 P r (Z τ α m τ k=1 ( A k, Z τ b k )A k ), // Projection onto rank r PSD matrices until τ = T init QΣQ T := Z Tinit // SVD of Z Tinit, with Q R n r, Σ R r r U 0 := QΣ 1/2 // Gradient descent phase repeat σr U τ+1 U τ µ 2(U 0) ( m τ+1 σ1 4(U 0) k=1 ( A k, U τ Uτ T ) b k )A k U τ, until convergence 3 Main Results or our theoretical results we shall focus on affine maps A which are isotropic and obey the matrix Restricted Isometry Property (RIP) Definition 31 (Isotropic mapping) We say the random map A is an isotropic mapping if E[ A(X) 2 l 2 ] = X 2, holds for any fixed X R n n which is independent of A Definition 32 (Restricted Isometry Property (RIP) [CT05, RP]) The map A satisfies r-rip with constant δ r, if (1 δ r ) X 2 A(X) 2 l 2 (1 + δ r ) X 2, holds for all matrices X R n n of rank at most r We are interested in finding a matrix U R n r obeying quadratic equations of the form A(UU T ) = b, (31) where we assume b = A(XX T ) for a planted solution X R n r We wish to understand when the Procrustes low algorithm recovers this planted solution X We note that this is only possible up to a certain rotational factor as if U obeys (31), then so does any matrix UR with R R r r an orthonormal matrix satisfying R T R = I r This naturally leads to defining the distance between two matrices U, X R n r as dist(u, X) := min U XR (32) R R r r : R T R=I r We note that this distance is the solution to the classic orthogonal Procrustes problem (hence the name of the algorithm) It is known that the optimal rotation matrix R minimizing U XR is equal to R = AB T, where AΣB T is the singular value decomposition (SVD) of X T U We now have all of the elements in place to state our main result 3

4 Theorem 33 Let X R n r be an arbitrary matrix with singular values σ 1 (X) σ 2 (X) σ r (X) > 0 and condition number κ = σ 1 (X)/σ r (X) Also, let b = A(XX T ) R m be m quadratic samples urthermore, assume the mapping A is isotropic and obeys rank-4r RIP with RIP constant δ 4r 1/ Then, using T init log( rκ 2 ) + 2 iterations of the initialization phase of Procrustes low with α τ = 1/m yields a solution U 0 obeying dist (U 0, X) 1 4 σ r(x) (33) urthermore, take a constant step size µ τ = µ for all τ = 1, 2, and assume µ 4/75 Then, starting from any initial solution obeying (33), we have dist (U τ, X) 1 ( 1 24 ) τ µ 2 σr κ 4 (X) (34) The above theorem shows that Procrustes low algorithm achieves a good initialization under the isotropy and RIP assumptions on the mapping A Also, starting from any sufficiently accurate initialization the algorithm exhibits geometric convergence to the unknown matrix X We note that in the above result we have not attempted to optimize the constants urthermore, there is a natural tradeoff involved between the upper bound on the RIP constant, the radius in which P is contractive (33), and its rate of convergence (34) In particular, as it will become clear in the proofs one can increase the radius in which P is contractive (increase the constant 1/4 in (33)) and the rate of convergence (increase the constant 24/125 in (34)) by assuming a smaller upper bound on the RIP constant The most common measurement ensemble which satisfies the isotropy and RIP assumptions is the spiked Gaussian ensemble, where each symmetric matrix A k has N (0, 1/m) entries on the diagonal and N (0, 1/2m) entries elsewhere or this ensemble to achieve a RIP constant of δ r, we require at least m = Ω( 1 nr) measurements Applying Theorem 33 to this measurement δr 2 ensemble, we conclude that the Procrustes low algorithm yields a solution with relative error ɛ (dist( ˆX, X)/ X ɛ) in O(log( r/ɛ)) iterations using only Ω(nr) measurements We would like to note that if more measurements are available it is not necessary to use multiple projected gradient updates in the initialization phase In particular, for the Gaussian model if m = Ω(nr 2 κ 4 ), then (33) will hold after the first iteration (T init = 1) How to verify the initialization is complete Theorem 33 requires that T init = Ω(log( rκ 2 )), but κ is a property of X and is hence unknown However, as long as the RIP constant δ 2r satisfies δ 2r 1/, then we can use each iterate of initialization to test whether or not we have entered the radius of convergence The following lemma establishes a sufficient condition we can check using only information from Z τ The proof is deferred to Appendix B Lemma 34 Assume the RIP constant of A satisfies δ 2r 1/ Let Z τ denote the τ-th step of initialization in Procrustes low, and let U 0 R n r be the such that Z τ = U 0 U0 T Define e τ := A(Z τ ) b l2 = A(Z τ XX T ) Then, if the following condition holds e τ 3 20 σ r(z τ ), l2 4

5 we have that dist(u 0, X) 1 4 σ r(x) One might consider using solely the projected gradient updates (ie set T init = ) as in previous approaches [TG07, GK09, NT09, NV09, BD09, MJD09, CCS] We note that the projected gradient updates in the initialization phase require computing the first r eigenvectors of a PSD matrix whereas the gradient updates do not require any eigenvector computations Such eigenvector computations may be prohibitive compared to the gradient updates, especially when n is large and for ensembles where matrix-vector multiplication is fast We would like to emphasize, however, that for small n and dense matrices using projected gradient updates may be more efficient due to faster convergence rate Our scheme is a natural interpolation: one could only do projected gradient steps, or one could do one projected gradient step Here we argue that very few projected gradients provide sufficient initialization such that gradient descent converges geometrically 4 Related work There is a vast literature dedicated to low-rank matrix recovery/sensing and semi-definite programming We shall only focus on the papers most related to our framework Recht, azel, and Parrilo were the first to study low-rank solutions of linear matrix equations under RIP assumptions [RP] They showed that if the rank-r RIP constant of A is less than a fixed numerical constant, then the matrix with minimum trace satisfying the equality constraints coincided with the minimum rank solution In particular, for the Gaussian ensemble the required number of measurements is Ω(nr) [CP11] Subsequently, a series of papers [CR09, Gro11, Rec11, CLS14a] showed that trace minimization and related convex optimization approaches also work for other measurement ensembles such as those arising in matrix completion and related problems In this paper we have established a similar result to [RP] (albeit only for PSD matrices) We require the same order of measurements Ω(nr) but use a more computationally friendly local search algorithm Also related to this work are projection gradient schemes with hard thresholding [TG07, GK09, NT09, NV09, BD09, MJD09, CCS] Such algorithms enjoy similar guarantees to that of [RP] and this work Indeed, we utilize such results in the initialization phase of our algorithm However, such algorithms require a rank-r SVD in each iteration which may be expensive for large problem sizes We would like to emphasize, however, that for small problem sizes and dense matrices (such as Gaussian ensembles) such algorithms may be faster than gradient descent approaches such as ours More recently, there have been a few results using non-convex optimization schemes for matrix recovery problems In particular, theoretical guarantees for matrix completion have been established using manifold optimization 2 [KMO] and alternating minimization [Kes12] (albeit with the caveat of requiring a fresh set of samples in each iteration) Later on, Jain etal [JNS12] analyzed the performance of alternating minimization under similar modeling assumptions to [RP] and this paper However, the requirements on the RIP constant in [JNS12] are more stringent compared to [RP] and ours In particular, the authors require δ 4r c/r whereas we only require δ 4r c We should note, however, that the authors study general low-rank matrices rather than PSD ones 2 See [JBAS] for related schemes using manifold optimization 5

6 studied in this paper (we defer the analysis of general low-rank matrices to a future version of this paper) Specialized to the Gaussian model, the results of [JNS12] require Ω(nr 3 κ 4 ) measurements 3 Our algorithm and analysis are inspired by the recent paper [CLS14b] by Candes, Li and Soltanolkotabi See also [Sol14, CLM15] for some stability results In [CLS14b] the authors introduced a local regularity condition to analyze the convergence of a gradient descent-like scheme for phase retrieval We use a similar regularity condition but generalize it to ranks higher than one Recently, independent of our work, Zheng and Lafferty [ZL15] provided an analysis of gradient descent using (22) via the same regularity condition Zheng and Lafferty focus on the Gaussian ensemble, and establish a sample complexity of m = Ω(nr 2 κ 4 log n) 4 In comparison we only require Ω(nr) measurements removing both the dependence on κ in the sample complexity and improving the asymptotic rate We would like to emphasize that the improvement in our result is not just due to the more sophisticated initialization scheme In particular, Zheng and Lafferty show geometric convergence (albeit at a slower rate) starting from any initial solution obeying dist(u 0, X) cσ r (X) as long as the number of measurements obeys m = Ω(nrκ 4 log n) In contrast, we establish geometric convergence starting from the same neighborhood of U 0 with only Ω(nr) measurements Moreover, the theory of restricted isometries in our work considerably simplifies the analysis inally, we would also like to mention [SOR14] for guarantees using stochastic gradient algorithms The results of [SOR14] are applicable to a variety of models; focusing on the Gaussian ensemble, the authors require Ω ((nr log n)/ɛ) samples to reach a relative error of ɛ In contrast, our sample complexity is independent of the desired relative error ɛ However, their algorithm only requires a random initialization 5 Proofs Before we dive into the details of the proofs, we would like to mention that we will prove our results using the update in lieu of the P update U τ+1 = U τ µ κ 2 X 2 f(u τ ), (51) U τ+1 = U τ µ P σ 2 r(u 0 ) U 0 4 f(u τ ) (52) As we prove in Section 53, our initial solution obeys U 0 U T 0 XXT σ 2 r(x)/4 Hence, applying Weyl s inequality we can conclude that σ 2 r(u 0 ) σ 4 1 (U 0) σ2 r(x) + σr(x)/4 2 (σ1 2(X) r(x) σ2 1 (X)/4)2 3σ2, (53) (X) 3 The authors also propose a stage-wise algorithm with improved sample complexity of Ω(nr 3 κ 4 ) where κ is a local condition number defined as the ratio of the maximum ratio of two successive eigenvalues We note, however, that in general κ can be as large as κ 4 We note that the paper [ZL15] defines κ in terms of the larger matrix XX T, whereas our convention is to define κ for the smaller matrix X σ 4 1 6

7 and similarly, σr(u 2 0 ) σ1 4(U 0) 12 σr(x) 2 (54) 25 (X) Thus, any result proven for the update (51) will automatically carry over to the P update with a simple rescaling of the upper bound on the step size via (53) urthermore, we can upper bound the convergence rate of gradient descent using the P update in terms of properties of X instead of U 0 via (54) 51 Preliminaries We start with a well known characterization of RIP Lemma 51 [Can08] Let A satisfy 4r-RIP with constant δ 4r Then, for all fixed matrices X, Y of rank at most 2r which are independent of A, we have A(X), A(Y ) X, Y δ 4r X Y Next, we state a recent result which characterizes the convergence rate of projected gradient descent onto general non-convex sets specialized to our problem See [MJD09] for related results using singular value hard thresholding Lemma 52 [ORS15] Let X R n r be an arbitrary matrix Also, let b = A(XX T ) R m be m quadratic samples Consider the iterative updates ( ) Z τ+1 P r Z τ 1 m ( A k, Z τ b k )Z τ, m where P r projects its input matrix onto the rank-r PSD cone Then, Z τ XX T ρ(a) τ Z0 XX T, holds Here, ρ(a) is defined as ρ(a) := 2 sup k=1 X =1,rank(X) 2r, Y =1,rank(Y ) 2r σ 4 1 A(X), A(Y ) X, Y We shall make repeated use of the following lemma which upper bounds UU T XX T by some factor of dist(u, X) Lemma 53 or any U R n r obeying dist(u, X) 1 4 X, we have UU T XX T 9 4 X U XR Proof UU T XX T = U(U XR) T + (U XR)(XR) T ( U + X ) U XR 9 4 X U XR 7

8 inally, we also need the following lemma which upper bounds dist(u, X) by some factor of UU T XX T We defer the proof of this result to Appendix A Lemma 54 or any U, X R n r, we have dist(u, X) 2 1 2( UU T XX T 2 2 1)σr(X) 2 52 Proof of convergence of gradient descent updates (Equation (34)) We first outline the general proof strategy Please see [CLS14b, Sections 23 and 79] for related arguments The first step is to show that gradient descent on the expected value of the function, which we call (U) := E[f(U)], exhibits geometric convergence in a small neighborhood around X The standard approach in optimization to show this is to prove that the function exhibits strong convexity However, it is not possible for (U) to be strongly convex in any neighborhood around X 5 Thus, we rely on the approach used by [CLS14b], which establishes a sufficient condition that only relies on first-order information along certain trajectories After showing the sufficient condition holds in expectation, we use standard RIP results to show that this condition also holds for the function f(u) with high probability To begin our analysis, we start with the following formulas for the gradient of f(u) and (U) f(u) = m A k, UU T XX T A k U, (U) = (UU T XX T )U k=1 Throughout the proof R is the solution to the orthogonal Procrustes problem That is, R = arg min U X R, R R n n : RT R=Ir with the dependence on U omitted for sake of exposition The following definition defines a notion of strong convexity along certain trajectories of the function Definition 55 (Regularity condition, [CLS14b]) Let X R n r be a global optimum of a function f Define the set B(δ) as B(δ) := {U R n r : dist(u, X) δ} The function f satisfies a regularity condition, denoted by RC(α, β, δ), if for all matrices U B(δ) the following inequality holds: f(u), U XR 1 α U XR β f(u) 2 If a function satisfies RC(α, β, δ), then as long as gradient descent starts from a point U 0 B(δ), it will have a geometric rate of convergence to the optimum X This is formalized by the following lemma 5 Such a proof is only possible in the special case when r = 1 8

9 Lemma 56 [CLS14b] If f satisfies RC(α, β, δ) and U 0 B(δ), then the gradient descent update U τ+1 U τ µ f(u τ ), with step size 0 < µ 2/β obeys U τ B(δ) and ( dist 2 (U τ, X) 1 2µ ) τ dist 2 (U 0, X), α for all τ 0 The proof is complete by showing that the regularity condition holds To this end, we first show in Lemma 57 below that the expected function (U) satisfies a slightly stronger variant of the regularity condition from Definition 55 We then show in Lemma 58 that the gradient of f is always close to the gradient of Lemma 57 Let (U) = UU T XX T 2 denote the average function or all U obeying U XR 1 4 σ r(x), we have (U), U XR 1 ( UU T XX T 2 (U 20 + XR)U T σ2 r(x) 4 U XR κ 2 X 2 (U) 2 (55) 2 ) Lemma 58 Let A be a linear map obeying rank-4r RIP with constant δ 4r or any V R n r and any U R n r obeying dist(u, X) 1 4 X and independent of A, we have UU (U) f(u), V δ T 4r XX T V U T This immediately implies that for any U R n r obeying dist(u, X) 1 4 X and independent of A, we have UU f(u) (U) δ T 4r XX T U We shall prove these two lemmas in Sections 521 and 522 However, we first explain how the regularity condition follows from these two lemmas To begin, note that (U), U XR = f(u), U XR + (U) f(u), U XR (a) f(u), U XR + 1 UU T XX T (U XR)U T (b) f(u), U XR + 1 ( UU T XX T 2 (U 20 + XR)U T 2 ) (56) where (a) follows from Lemma 58 by using the fact that δ 4r 1 Theorem 33 and (b) follows from 2ab a 2 + b 2 as assumed in the statement of 9

10 Combining (56) with Lemma 57 for any U obeying U XR 1 4 σ r(x), we have f(u), U XR σ2 r(x) 4 urthermore, for any U obeying dist(u, X) 1 4 X, we have (U) 2 U XR κ 2 X 2 (U) 2 (57) (a) 1 2 f(u) 2 (U) f(u) 2 (b) 1 2 f(u) 2 δ2 4r UU T XX T 2 U 2 (c) 1 2 f(u) δ2 4r X 2 UU T XX T 2 (d) 1 2 f(u) 2 8δ2 4r X 4 U XR 2 (58) Here, (a) holds by the triangle inequality and the inequality (a b) 2 a 2 /2 b 2, (b) holds by Lemma 58, (c) follows from the fact that for dist(u, X) 1 4 X, we have U 5 4 X, and (d) follows from Lemma 53 Equation (58) immediately implies κ 2 X 2 (U) κ 2 X 2 f(u) κ 2 δ2 4r X 2 U XR 2 (59) Plugging the latter into (57), and assuming δ 4r 1, we conclude that for any U XR 1 4 σ r(x) we have ( 1 f(u), U XR 4 32 ) 25 δ2 4r σr(x) 2 U XR κ 2 X 2 f(u) σ2 r(x) U XR κ 2 X 2 f(u) 2 Since dist(u, X) 1 4 σ r(x) implies U XR 1 4 σ r(x), the last inequality shows that f(u) obeys RC(5/σr(X), κ2 X 2, 1 4 σ r(x)) The convergence result in Equation (34) now follows from Lemma 56 All that remains is to prove Lemmas 57 and Proof of the regularity condition for the average function (Lemma 57) We first state some properties of the Procrustes problem and its optimal solution Let U, X R n r and define H := U XR, where R is the orthogonal matrix which minimizes U XR Let AΣB T be the SVD of X T U; we know that the optimal R is equal to R = AB T Thus, U T XR = BΣB T = (XR) T U, which shows that U T XR is a symmetric PSD matrix urthermore, note that since H T XR = U T XR R T X T XR = (XR) T U R T X T XR = (XR) T (U XR) = (XR) T H,

11 we can conclude that H T XR is symmetric To avoid carrying R in our equations we perform the change of variable X XR That is, without loss of generality we assume R = I and that U T X 0 and H T X = X T H Note that for any U obeying dist(u, X) 1 4 X we have (UU T XX T )U UU T XX T U 5 UU 4 X T XX T Using the latter along with the simplifications discussed above, to prove the lemma (ie (55)) it suffices to prove (UU T XX T )U, U X 1 ( UU T XX T 2 ) (U 20 + X)U T 2 σ2 r(x) 4 Equation (5) can equivalently be written in the form U X UU T 5 κ 2 XX T 2 (5) 0 Tr((H T H) 2 + 3H T HH T X + (H T X) 2 + H T HX T X ( ) 1 [ 5 κ 2 (H T H) 2 + 4H T HH T X + 2(H T X) 2 + 2H T HX X] T 1 [ ] (H T H) 2 + 2H T HH T X + H T HX T X 20 σ2 r(x) H T H) 4 Rearranging terms, we arrive at 0 Tr(c 1 (H T H) 2 + c 2 H T HH T X + c 3 (H T X) 2 + c 4 H T HX T X σ2 r(x) H T H) 4 = Tr(( c 2 2 H T H + c 3 H T X) 2 + (c 1 c2 2 )(H T H) 2 + c 4 H T HX T X σ2 r(x) H T H) c 3 4c 3 4 (511) Here the constants c 1, c 2, c 3, c 4 are defined as c 1 = κ 2, c 2 = κ 2, c 3 = κ 2, c 4 = κ 2 Since κ 1 1, it is easy to verify that (a) c i > 0 for i = 1,, 4, and (b) c 1 c2 2 4c 3 To prove (511) it thus suffices to prove, H 2 c 4 1/4 c 2 2 /4c 3 c 1 σ 2 r(x) Using the fact that H 1 4 σ r(x) c 4 1/4 c 2 2 /4c 3 c for κ 1 we see that the proof is complete by assuming 11

12 522 Proof of gradient concentration (Lemma 58) Define := UU T XX T Then, f(u) (U), V = (a) = m A k, A k, V U T E[ A k, A k, V U T ] k=1 m A k, A k, V U T, V U T k=1 (b) UU δ T 4r XX T V U T where (a) follows from our assumption that A is isotropic and (b) follows from Lemma 51, since rank( ) 2r and rank(v U T ) r This proves the first part of the lemma To prove the second part, by the variational form of the robenius norm, we have f(u) (U) = sup f(u) (U), V V R n r, V 1 UU δ T 4r XX T sup V U T The result now follows from V U T V U U 53 Proof of initialization (Equation (33)) V R n r, V 1 Using Lemma 51, we can conclude that ρ(a) from Lemma 52 is bounded by ρ(a) 2δ 4r 1/5 Setting Z 0 = 0 n n and applying Lemma 52 to our initialization iterates, we have that Z τ XX T (1/5) τ XX T (1/5) τ X X rom Lemma 54, we have that dist(u 0, X) 2 Z τ XX T 2(1/5) τ κ X σ r (X) Hence, if we want the RHS to be upper bounded by 1 4 σ r(x), we require 2(1/5) τ κ X 1 4 σ r(x) = (1/5) τ Since X r X, it is enough to require that σ r (X) 4 2κ X τ log( rκ 2 ) + 2 (512) Similarly, it is easy to check that if τ satisfies (512), then Z t XX T Zt XX T σr(x)/4 2, also holds 12

13 Acknowledgements BR is generously supported by ONR awards N and N , NS awards CC and CC , AOSR award A , and a Sloan Research ellowship This research is supported in part by NS CISE Expeditions Award CC , LBNL Award , and DARPA XData Award A , and gifts from Amazon Web Services, Google, SAP, The Thomas and Stacey Siebel oundation, Adatao, Adobe, Apple, Inc, Blue Goji, Bosch, C3Energy, Cisco, Cray, Cloudera, EMC2, Ericsson, acebook, Guavus, HP, Huawei, Informatica, Intel, Microsoft, NetApp, Pivotal, Samsung, Schlumberger, Splunk, Virdata and VMware References [AM07] [BD09] D Achlioptas and McSherry ast computation of low-rank matrix approximations Journal of the ACM, 54(2), 2007 T Blumensath and M E Davies Iterative hard thresholding for compressed sensing Applied and Computational Harmonic Analysis, 27(3): , 2009 [Can08] E J Candès The restricted isometry property and its implications for compressed sensing Compte Rendus de l Academie des Sciences, 2008 [CCS] [CLM15] J Cai, E J Candès, and Z Shen A singular value thresholding algorithm for matrix completion SIAM Journal on Optimization, 20(4): , 20 T T Cai, X Li, and Z Ma Optimal rates of convergence for noisy sparse phase retrieval via thresholded Wirtinger flow CoRR, abs/ , 2015 [CLS14a] E J Candès, X Li, and M Soltanolkotabi Phase retrieval from coded diffraction patterns Applied and Computational Harmonic Analysis, 39(2): , 2014 [CLS14b] E J Candès, X Li, and M Soltanolkotabi algorithms CoRR, abs/140765, 2014 [CP11] [CR09] [CT05] Phase retrieval via wirtinger flow: Theory and E J Candès and Y Plan Tight oracle bounds for low-rank matrix recovery from a minimal number of random measurements IEEE Transactions on Information Theory, 57(4): , 2011 E J Candès and B Recht Exact matrix completion via convex optimization oundations of Computational Mathematics, 9(6): , 2009 E J Candès and T Tao Decoding by linear programming IEEE Transactions on Information Theory, 51(12): , 2005 [un06] S unk Netflix update: Try this at home, December 2006 [GK09] [Gro11] R Garg and R Khandekar Gradient descent with sparsification: an iterative algorithm for sparse recovery with restricted isometry property In ICML, 2009 D Gross Recovering low-rank matrices from few coefficients in any basis IEEE Transactions on Information Theory, 57(3): , 2011 [JBAS] M Journée, Bach, P-A Absil, and R Sepulchre Low-rank optimization on the cone of positive semidefinite matrices SIAM Journal on Optimization, 20(5): , 20 [JNS12] P Jain, P Netrapalli, and S Sanghavi Low-rank matrix completion using alternating minimization CoRR, abs/ ,

14 [Kes12] R H Keshavan Efficient algorithms for collaborative filtering PhD thesis, Stanford University, 2012 [KMO] R H Keshavan, A Montanari, and S Oh Matrix completion from a few entries IEEE Transactions on Information Theory, 56(6): , 20 [LRS + ] J Lee, B Recht, N Srebro, J A Tropp, and R Salakhutdinov Practical large-scale optimization for max-norm regularization In NIPS, 20 [MJD09] [NT09] [NV09] [ORS15] R Meka, P Jain, and I S Dhillon Guaranteed rank minimization via singular value projection CoRR, abs/ , 2009 D Needell and J A Tropp CoSaMP: Iterative signal recovery from incomplete and inaccurate samples Applied and Computational Harmonic Analysis, 26(3): , 2009 D Needell and R Vershynin Uniform uncertainty principle and signal recovery via regularized orthogonal matching pursuit oundations of Computational Mathematics, 9(3): , 2009 S Oymak, B Recht, and M Soltanolkotabi Sharp time-data tradeoffs for linear inverse problems In preparation, 2015 [Rec11] B Recht A simpler approach to matrix completion Journal of Machine Learning Research, 12: , 2011 [RP] B Recht, M azel, and P A Parrilo Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization SIAM Review, 52(3): , 20 [RR13] [RS05] [Ruh74] [Sol14] [SOR14] [TG07] [ZL15] B Recht and C Ré Parallel stochastic gradient algorithms for large-scale matrix completion Mathematical Programming Computation, pages , 2013 J Rennie and N Srebro ast maximum margin matrix factorization for collaborative prediction In ICML, 2005 A Ruhe Numerical computation of principal components when several observations are missing Technical report, University of Umea, Institute of Mathematics and Statistics Report, 1974 M Soltanolkotabi Algorithms and Theory for Clustering and Nonconvex Quadratic Programming PhD thesis, Stanford University, 2014 C De Sa, K Olukotun, and C Ré Global convergence of stochastic gradient descent for some nonconvex matrix problems CoRR, abs/ , 2014 J A Tropp and A C Gilbert Signal recovery from random measurements via orthogonal matching pursuit IEEE Transactions on Information Theory, 53(12): , 2007 Q Zheng and J Lafferty A convergent gradient descent algorithm for rank minimization and semidefinite programming from random linear measurements CoRR, abs/ , 2015 A Proof of Lemma 54 Define H = U XR Similar to the discussion at the beginning of Section 521, without loss of generality we can assume that (a) R = I, (b) U T X 0 and (c) H T X = X T H With these simplifications, establishing the lemma is equivalent to showing that Tr((H T H) 2 + 4H T HH T X + 2(H T X) 2 + 2X T XH T H ηh T H) 0 (A1) 14

15 1 holds with η = 2( We note that 2 1)σr(X) 2 Tr((H T H + 2H T X) 2 + (4 2 2)H T HH T X + 2X T XH T H ηh T H) = Tr((H T H) 2 + 4H T HH T X + 2(H T X) 2 + 2X T XH T H ηh T H) Hence, a sufficient condition for (A1) to hold is (4 2 2)H T X + 2X T X ηi r 0 (A2) Recalling that H T X = U T X X T X, and that U T X 0, we have (4 2 2)H T X + 2X T X ηi r = (4 2 2)U T X + (2 (4 2 2))X T X ηi r Since U T X 0, to show (A2) it suffices to show = (4 2 2)U T X + 2( 2 1)X T X ηi r 2( 2 1)X T X ηi r 0 X T X The RHS trivially holds, concluding the proof B Proof of Lemma 34 rom RIP and the assumption that δ 2r 1/, we have By Weyl s inequalities, this means that Lemma 54 ensures that Z τ XX T Zτ XX T σ 2 r(x) σ r (Z τ ) η 2( 2 1) I r 9 e τ 3 1 dist(u 0, X) Z τ XX T 2 σ r (X) 9 e τ (B1) We can upper bound the RHS by the following chain of inequalities, 3 1 Z τ XX T (a) σ r (X) 2 σ r (X) 9 e τ ( ) (b) σ r (X) 2 σ r (Z) 6 9 e τ (c) σ r (X) 2 6 σ2 r(x) = 1 4 σ r(x) where (a) follows from RIP, (b) follows since e τ 3 20 σ r(z) implies that ( 9 e τ 1 ) 2 σ r (Z) 6 9 e τ, and (c) follows by (B1) 15

Low-rank Solutions of Linear Matrix Equations via Procrustes Flow

Low-rank Solutions of Linear Matrix Equations via Procrustes low Stephen Tu, Ross Boczar, Max Simchowitz {STEPHENT,BOCZAR,MSIMCHOW}@BERKELEYEDU EECS Department, UC Berkeley, Berkeley, CA Mahdi Soltanolkotabi