arxiv: v2 [stat.ml] 22 Nov 2016

Size: px

Start display at page:

Download "arxiv: v2 [stat.ml] 22 Nov 2016"

Owen Barton
6 years ago
Views:

1 Convergence Analysis for Rectangular Matrix Completion Using Burer-Monteiro actorization and Gradient Descent arxiv: v [stat.ml] Nov 016 Qinqing Zheng John Lafferty University of Chicago November 3, 016 Abstract We address the rectangular matrix completion problem by lifting the unknown matrix to a positive semidefinite matrix in higher dimension, and optimizing a nonconvex objective over the semidefinite factor using a simple gradient descent scheme. With Oµr κ n maxµ, log n random observations of a n 1 n µ-incoherent matrix of rank r and condition number κ, where n = maxn 1, n, the algorithm linearly converges to the global optimum with high probability. 1 Introduction A growing body of recent research is shedding new light on the role of nonconvex optimization for tackling large scale problems in machine learning, signal processing, and convex programming. This work is developing techniques that help to explain the surprising effectiveness of relatively simple first-order algorithms for certain nonconvex optimizations. When applied to problems that can be formulated as semidefinite programs, these techniques can often be viewed as part of a framework proposed by Burer and Monteiro [4]. The Burer- Monteiro technique is based on factoring the semidefinite variable, and applying classical optimization techniques to the resulting nonconvex objective over the factor. While worst-case complexity considerations imply that such an approach cannot succeed in general, a series of recent papers [11, 40, 35, 13, 1] has shown the strategy to be remarkably effective for a number of problems of practical interest, with analytical convergence guarantees and strong empirical performance. In this paper, we enlarge the collection of problems to which the Burer-Monteiro technique can be successfully applied, by analyzing the convergence properties of gradient descent applied to the problem of rectangular matrix completion from incomplete measurements. The standard matrix completion problem asks for the recovery of a low rank matrix X R n 1 n given only a small fraction of observed entries. Let Ω be the set of m indices of the observed entries. ixing a target rank r minn 1, n, the natural, but nonconvex objective is min X R n 1 n rankx subject to X ij = X ij, i, j Ω. 1 1

2 In order for this problem to be well-posed, it is important to understand when X is identifiable and, in particular, the unique minimizer of 1. Moreover, because the problem is in general NPhard, it is essential to identify tractable families of instances, together with efficient algorithms having global convergence guarantees. In the current work, we apply the factorization technique by lifting the matrix X to a positive semidefinite matrix Y R n 1+n n 1 +n in higher dimension. Lifting is an established method that recasts vector or matrix estimation problems in terms of positive semidefinite matrices with special structure. It has been applied to sparse eigenvector approximation [14] and phase retrieval [10], where the lifted matrix is of rank one. As explained in detail below, we can construct Y to be of the same rank as X, thus obtaining a factorization Y = Z Z for some Z R n 1+n r, and transforming the original matrix completion problem into the problem of recovering the semidefinite factor Z. We formulate this as minimizing a nonconvex objective fz, to which we apply a gradient descent scheme, using a particular spectral initialization. Our analysis of this algorithm establishes a lower bound on the number of matrix measurements that are sufficient to guarantee identifiability of the true matrix and geometric convergence of the gradient descent algorithm, with explicit bounds on the rate. In the following section we give a full description of our approach. Our theoretical results are presented in Section 3, with detailed proofs contained in the appendix. Our analysis subsumes the case where X is positive semidefinite. In Section 4 we briefly review related work. The experimental results are presented in Section 5, and we conclude with a brief discussion of future work in Section 6. Semidefinite Lifting, actorization, and Gradient Descent or any n 1 + n r matrix Z, we will use Z i to denote its ith row, and Z U and Z V to denote the top n 1 and bottom n rows. The operator, robenius andl norm of matrices are denoted by, and, respectively. We define Z, = max Zi i as the largest l norm of its rows, and similarly Z, = max { Z,, Z },. Let P Ω : R n 1 n R n 1 n be the operator where P Ω X ij = { X ij if i, j Ω, 0 otherwise. In this paper, we focus on completing an incoherent or non-spiky matrix X. With U Σ V denoting the rank-r SVD of X, we assume X is µ-incoherent, as defined below. Definition 1. The matrix X is µ-incoherent with respect to the canonical basis if its singular vectors satisfy µr µr U,, V n,, 3 1 n where µ is a constant. 1 1 Note that µ 1, since r = U = i [n 1] U i µr.

3 Our main interest is the uniform model where m entries of X are observed uniformly at random, though we shall analyze a Bernoulli sampling model, where each entry of X is observed with probability p = m/n 1 n. One can transfer the results back to the uniform model, as the probability of failure under the uniform model is at most twice that under the Bernoulli model; see [8, 9]. Using the rank-r SVD of X, we can lift X to [ ] [ ] U Y = Σ U X U X V Σ V = Z Z, where Z = V Σ 1. 4 The symmetric decomposition of Y is not unique; our goal is to find a matrix in the set { S = Z R n 1 +n r Z } = Z R for some R with RR = R R = I, 5 since for any Z S we have X = Z U Z V. Let Ω denote the corresponding observed entries of Y, and consider minimization of the squared error 1 min Z p i,j Ω ZZij Yij 1 = min P Ω ZZ Y Z p. 6 Note that Y is not the unique minimizer of 6, nor is it the only possible positive semidefinite lifting of X. or example, let P be an r r nonsingular matrix, and form the matrices [ ] [ ] Z U = Σ 1 P Y U = Σ 1 P Σ 1 U X. 7 V Σ 1 P 1 X V Σ 1 P Σ 1 V Since Ω does not contain any entry in the top-left or bottom-right block, Y is also a minimizer of 6. Thus, the solution set of the lifted problem is much larger than the set S of actual interest. or the sake of simple analysis, we shall focus on exact recovery of Y only, and thus impose an additional regularizer to align the column spaces of Z U and Z V, as in [35]. The regularized loss is fz = 1 PΩ ZZ Y p + λ Z DZ [ ] 4, where D = In I n While this apparently introduces an extra tuning parameter, our analysis establishes linear convergence of the projected gradient descent algorithm when λ = 1, and thus one may treat λ as a fixed number. It is discussed in [13] that one needs to ensure the iterates stay incoherent. Let C be the set of incoherent matrices { µr C = Z : Z, } Z 0 9 n 1 n where we assume µ is known and Z 0 will be determined. 3

4 Our algorithm is simply gradient descent on fz, with projection onto C. Let M = p 1 P Ω UV X. Then the gradient of f is given by [ ] 0 M fz = M Z + λdzz DZ The projection P C to the feasible set C has closed form solution, given by row-wise clipping: Z i if Z i µr n 1 n Z 0, P C Z i = Z i µr n 1 n Z 0 otherwise. Z i 11 Note that X 0 p 1 P Ω X is an unbiased estimator of X under the Bernoulli model. To initialize, we thus construct Z 0 from the top rank-r factors of X 0. This leads to the following algorithm. Algorithm 1: Projected gradient descent for matrix completion input: Ω, { X ij : i, j Ω }, m, n 1, n, r, λ, η initialization p = m/n 1 n U 0 Σ 0 V 0 = rank-r SVD of p 1 P Ω X Z 0 = [U 0 Σ 0 1 ; V 0 Σ 0 1 ] Z 1 = P C Z 0 k 1 repeat M k = p 1 P Ω ZU k Zk V X [ ] 0 M k fz k = Z k + λdz k Z k DZ k. M k 0 Z k+1 = P C Z k k k + 1 until convergence; η Z 0 fzk output: Ẑ = Zk, X = Z k U Zk V. Remarks. i The step size η is normalized by Z 0. Our analysis will establish linear convergence when taking step sizes of the form η/σ1, where η is a sufficiently small constant. We replace σ1 by Z 0 in the actual algorithm since it is unknown in practice. ii The feasible set 9 depends on Z 0 as well. Under the above spectral initialization, our analysis shows that when p Oµκ r µr log n/n 1 n, the term n 1 n Z 0 is an upper bound of Z, with high probability see Corollary 1 below. This means S is a subset of C. Note that this does not change the global optimality of Z and its equivalent elements, since fz = 0. In practice, we find 4

5 that the iterates of our algorithm remain incoherent, so that one may drop the projection step. iii The column space regularizer 8 is needed in our analysis. We also found that when λ = 0, our algorithm typically converges to another PSD lifted matrix of X, with minor difference from Y in the top-left and bottom-right blocks. In the following section we state and sketch a proof of our main convergence result for this algorithm. 3 Main Result: Convergence Analysis Theorem 1. Suppose that X is of rank r, with condition number κ = σ 1/σ r, and µ-incoherent as defined in Definition 1. Suppose further that we observe m entries of X chosen uniformly at random. Let Y = Z Z be the lifted matrix as in 4 and write n = maxn 1, n. Then there exist universal constants c 0, c 1, c, c 3 such that if m c 0 µr κ maxµ, log nn, 1 then with probability at least 1 c 1 n c the iterates of Algorithm 1 converge to Z geometrically, when using regularization parameter λ = 1/, correctly specified input rank r, and constant step size η/σ 1 with η c 3 /µ r κ. We shall analyze the Bernoulli sampling model, as justified in Section. Let us define the distance to Z in terms of the solution set S. Definition. Define the distance between Z and Z as dz, Z = min Z Z = min Z Z R. Z S RR =R R=I The next theorem establishes the global convergence of Algorithm 1, assuming that the input rank is correctly specified. The proof sketch is given in the next subsection. Theorem. There exist universal constants c 0, c 1, c such that if p c 0µr κ log n, with probability at least 1 c 1 n c n 1 n, the initialization Z 1 C satisfies dz 1, Z 1 4 σ r. 13 Moreover, there exist universal constants c 3, c 4, c 5, c 6 such that if p c 3 maxµr κ, µr log n, n 1 n when using constant step size η/σ1 with η c 4 µ r κ and initial value Z1 C obeying 13, the kth step of Algorithm 1 with λ = 1/ satisfies dz k, Z η k/ σ r κ with probability at least 1 c 5 n c 6. 5

6 Remarks. i After each update, the distance of our iterates to Z is reduced by at least a factor of 1 O1/µ r κ. ii Hence, the output Ẑ satisfies dẑ, Z ε after at most log 1 1/1 99 iterations. 3.1 Proof Sketch 56 η κ log σ r/4ε Our proof idea is of the same nature as the analysis in [11, 40]. We show two appealing properties when sufficient entries are observed. irst, our spectral initialization produces a starting point within the Oσ r neighborhood of the solution set. then with prob- Lemma 1. There exist universal constants c, c 1, c, such that if p cµr κ log n n 1 n ability at least 1 c 1 n c, dz 1, Z dz 0, Z 1 σ 4 r. To demonstrate this, we exploit the concentration around the mean of p 1 P Ω X. See Appendix B for the proof. Using this lemma, we can immediately show that Z and all other elements of S are contained in the feasible set 9. Corollary 1. With probability at least 1 c 1 n c, Z µr, n 1 n Z 0. The second crucial property is that fz is well-behaved within the O σ r neighborhood, so that the iterates move closer to the optima in every iteration. The key step is to set up a local regularity condition [11] similar to Nesterov s conditions [8]. Definition 3. Let Z = arg min Z S Z Z denote the matrix closest to Z in the solution set. We say that f satisfies the regularity condition RCε, α, β if there exist constants α, β such that for any Z C satisfying dz, Z ε, we have fz, Z Z 1 α σ r Z Z + 1 fz βσ1. Using this condition, one can show the iterates converge linearly to the optima if we start close enough to Z. Lemma. Consider the update Z k+1 = P C Z k µ fz k. If f satisfies RCε, α, β, σ1 dz k, Z ε and 0 < µ minα/, /β, then dz k+1, Z 1 µ ακ dzk, Z. 6

7 The following lemma illustrates the local regularity of fz. Nesterov s criterion is established upon strong convexity and strong smoothness of the objective. Here we show analogous curvature and smoothness conditions holds for fz locally within the O σr neighborhood with high probability. Interestingly, we found that to show the local curvature condition holds, it suffices to set λ = 1. The proof can be found in Appendix C, for which we have generalized some technical lemmas of [13]. Lemma 3. Let the regularization constant be set to λ = 1. There exists universal constant c, c 1, c, such that if p c maxµ r κ, µr log n, then f satisfies RC 1 4 σ n 1 n r, 51/99, 13196µ r κ, with probability at least 1 c 1 n c. 4 Related Work Matrix completion is one instance of the general low rank linear inverse problem find X of minimum rank such that AX = b, 14 where A is an affine transformation and b = AX is the measurement of the ground truth X. Considerable progress has been made towards algorithms for recovering X including both convex and nonconvex approaches. One of the most popular methods is nuclear norm minimization, a convenient convex relaxation of rank minimization. It was first proposed in [15, 9], and analyzed under a certain restricted isometry property RIP. Subsequent work clarified the conditions for reconstruction, and studied recovery guarantees for both exact and approximately low rank matrices, with or without noise [8, 9, 7, 1]. One significant advantage for this approach is its near-optimal sample complexity. Under the same incoherence assumption as ours, Chen [1] establishes the currently best-known lower bound of Oµrn log n samples. Using a closely related notion of incoherence, Negahban and Wainwright [7] show that if X is α-nonspiky with X X α n1 n, then Oα rn log n samples are sufficient for exact recovery. However, convexity and low sample complexity aside, in practice the power of nuclear norm relaxation is limited due to high computational cost. The popular algorithms for nuclear norm minimization are proximal methods that perform iterative singular value thresholding [5, 34]. However, such algorithms don t scale to large instances because the per-iteration SVD is expensive. Another popular convex surrogate for the rank function is the max-norm [31, 17], given by X = min X=UV U, V,. or certain types of problems, the max-norm offers better generalization error bounds than the nuclear norm [30]. But practically solving large scale problems that incorporate the max-norm is also non-trivial. In 010, Lee et al. [5] rephrased the max-norm constrained problem as an SDP, and applied Burer-Monteiro factorization. Although this ends up with an l, constraint similar to ours 9, we emphasize that the constraint plays a different role in our setting. While [31, 5] use it to promote low rank solutions, our purpose is to enforce incoherent solutions; and experimental results suggest that one can drop it. Moreover, the convergence of projected gradient descent for this problem was not previously understood. 7

8 In a parallel line of work, the problem of developing techniques that exactly solve nonconvex formulations has attracted significant recent research attention. In chronological order, Keshavan et al. [3] proposed a manifold gradient method for matrix completion. They factorize X = U Σ V, where U R n1 r, U U = n 1 I n1 and V R n r, V V = n I n. Similar to our definition of S, the equivalence classes of U and V are Grassmann manifolds of r dimensional subspaces. The authors then minimize the nonconvex objective U, V = min S R r r PΩ USV X over the manifolds. In each iteration, U and V are updated along their manifold gradients, followed by the update of the optimal scaling matrix S. This algorithm can exactly exactly reconstruct the matrix, though the convergence rate is unknown. However, its per-iteration update also has high computational complexity, see Section 5 for details. There are other manifold optimization methods for matrix completion including [, 6, 36]. In the same year, Jain et al. [1] suggested minimizing the squared residual AX b under a rank constraint rankx r. While this constraint is nonconvex, projection onto the feasible set can be computed using low rank SVD. Under certain RIP assumption on A, Jain et al. establish the global convergence of projected gradient descent for 14. This algorithm is named Singular Value Projection SVP. Yet in the setting of completion, only experimental support for the effectiveness of SVP is provided. More importantly, SVP also suffers from expensive per-iteration SVD for large scale problems. Keshavan [4], Jain et al. [] further analysed the alternating minimization procedure for 14. AltMin factorizes X = UV where U R n1 r and V R n r, and alternately solves AUV b over U and V, while fixing the other factor. The authors obtain sample complexity bounds with rκ 8, r 7 κ 6 dependency, respectively. In 014, Hardt [19] improved the bounds to r κ. Notably, all these works assume the use of resampling independent sequences of samples Ω k, k = 1,,.... In other words, in every iteration we can sample the true matrix under a certain Bernoulli model independently. However, in practice Ω is usually given and fixed. To get around the dependence on the sample sets, they partition Ω into a predefined number of subsets of equal size. However, sample sets obtained by partitioning are not independent, and partitioning, if used in practice, does not make the most efficient use of the data. Thus, Hardt and Wootters [0] considered a new resampling scheme. They assume a known generative model of {Ω k }, where each Ω k is obtained under a Bernoulli model with probability p k, p = k p k and Ω = k Ω k. While not practical, under this assumption the authors obtain a sample complexity that is logarithmic in κ. Another theoretical disadvantage of the resampling scheme is that the sample complexity depends on the desired accuracy ε, as established by [4,, 19, 0]. As the accuracy goes to zero, the sample complexity increases. In contrast, our algorithm doesn t require resampling, and the sample complexity is independent of ε. In 014, Candès et al. [11] proposed Wirtinger flow for phase retrieval. Wirtinger flow is a fast first-order algorithm that minimizes a fourth order nonconvex objective, geometrically converging to the global optimum. While previous work [10, 6, 7] lifts the phase retrieval problem into an SDP where the solution is rank one, this work bridges SDP and first-order algorithms via the Burer-Monteiro technique. It has inspired further research on related topics; last year, the authors of [40, 35, 1, 13] considered factorizations for 14, assuming X is semidefinite, and proved global optimality of first-order algorithms under appropriate initializations. Tu et al. 8

9 [35] have extended this algorithm to handle rectangular matrix via asymmetric factorization, and have shown exact recovery of X, assuming A satisfies a certain RIP. They use lifting implicitly, factorizing X = Z U ZV and applying gradient updates on both factors Z U and Z V simultaneously, with the nonconvex objective function gz U, Z V = 1 P Ω Z U ZV X p + λ Z U Z U ZV Z V Their proof strategy also shows convergence of Z in the lifted space. or the specific case of matrix completion, Chen and Wainwright [13] obtained guarantees when X is semidefinite. Our work generalizes the results obtained in [35, 13], extending the recent literature on first-order algorithms for factorized models. After completing this work we learned of independent research of Sun and Luo [33], who also analysed a gradient algorithm for rectangular matrix completion. Their formulation is similar to ours, with additional robenius norm constraints on the factors. The authors established a sample complexity of Or 7 κ 6 observations; in comparison our bound scales as Or κ. The authors also analyzed block coordinate descent type alternating minimization, which cyclically updates the rows of U and then the rows of V, showing exact recovery of this algorithm without resampling. Recent independent work of Yi et al. [38] analyzes a gradient scheme for Robust PCA. Under the setting of partial observation without corruption, this is the standard matrix completion problem. In other related work, [39, 37] also study nonconvex optimization methods for matrix completion, using algorithms that still require low rank SVD in each iteration. 5 Experiments We conduct experiments on synthetic datasets to support our analytical results. As the column space regularizer and incoherence constraint of our gradient method GD are merely for analytical purpose, we drop them in all the experiments; simply optimize the l loss 1 PΩ ZZ Y. We compare GD with SVP, OptSpace, nuclear norm minimization nuclear and trust region methods on Riemannian manifolds trustregion. or nuclear, we rescale the standard objective to be 1 min X λ P ΩX X + X, 16 where λ = 0 will enforce the minimizer fitting the observed values exactly. We use ADMM to solve 16. It is based on the algorithm for the matrix approach in [34], and can neatly handle the case λ = 0. We emphasize there is no computational difference between cases whether λ is zero or not. All methods are implemented in MATLAB. We use the toolbox Manopt for trustregion [3] and the implementation of OptSpace from the authors. or AltMin, we use the same sample sets in every iteration. The experiments were run on a Linux machine with a 3.4GHz Intel Core i7 processor and 8 GB memory. Computational Complexity Table 1 summarizes the per-iteration complexity of all the methods for completing a n n matrix. Since M k is a sparse matrix with m nonzero entries, and we have 9

10 dropped the regularizer and constraint, our method GD only needs mr + m + n r operations to compute the gradient, and 4nr operations to update the iterate. The computation of nuclear is dominated by singular value thresholding and updating the objective value, which require the On 3 cost full SVD. Similarly, SVP needs On r operations to compute the rank-r SVD for low rank projection. or OptSpace, Omr + n r + nr operations are needed to compute the manifold gradient and line search. The most expensive part is to determine the optimal scaling matrix S R r r, which boils down to solving a r by r dense linear system. In total Omr 3 + n r + nr 4 + r 6 operations are used to construct and solve this system. or AltMin, in every iteration we have to solve n 1 +n linear systems of size r r. See [3] for the exact formulation. The time cost for each iteration is Omr. One can see that GD reduces the computation than the others. Though the dominating terms for SVP and GD are in the same order, in practice the partial SVD are more expensive than the gradient update, especially on large instances. Method Complexity GD mr + m + n r + 4nr SVP On r OptSpace Omr 3 + n r + nr 4 + r 6 nuclear On 3 AltMin Omr Table 1: Per-iteration complexities. Runtime Comparison We randomly generated a true matrix X of size and rank 3. It is constructed from the rank-3 SVD of a random matrix with i.i.d normal entries. We sampled m = entries of X uniformly at random, where m is roughly equal to nr log n with n = 4000 and r = 3. or simplicity, we feed SVP, OptSpace and GD with the true rank. or all these methods, we use the randomized algorithm of Halko et al. [18] to compute the low rank SVD, which is approximately 15 times faster than MATLAB built-in SVD on instances of such size. We report relative error measured in the robenius norm, defined as X X / X. or nuclear, we set λ = 0 to enforce exact fitting. The convergence speed of ADMM mildly depends on the choice of penalty parameter. We tested 5 values 0.1, 0., 0.5, 1, 1.5 and selected 0., which leads to fastest convergence. Similarly, for SVP, we would like to choose the largest step size for which the algorithm is converging. We evaluated 15, 0, 30, 35, 40 and selected 30. The step size is chosen for GD in the same way. ive values 0, 50, 70, 75, 80 are tested for η and we picked 70. or OptSpace, we compared fixed step sizes with line search, and found the algorithm converged fastest under line search. igure 1a shows the results. GD is slightly slower than trustregion and faster than competing approaches. To further illustrate how runtime scales as the dimension increases, we run larger instances of size and , where the true rank is 40. The parameters are selected in the same manner, and we terminate the computation once the relative error is below 1e 9. We report the results of AltMin GD, SVP and trustregion in igure a; nuclear, OptSpace do not 10

11 10 0 GD SVP OptSpace nuclear altmin trust region relative error runtime seconds a b igure 1: a Runtime comparison where X is and of rank entries are observed. b Magnified plots to compare other methods except nuclear. scale well to such sizes so that we didn t include them. The runtime of AltMin scales the slowest, while the runtimes of GD and trustregion increase slower than SVP. Sample Complexity We evaluate the number of observations required by GD for exact recovery. or simplicity, we consider square but asymmetric X. We conducted experiments in 4 cases, where the randomly generated X is of size or , and of rank 10 or 0. In each case, we compute the solutions of GD given m random observations, and a solution with relative error below 1e 6 is considered to be successful. We run 0 trials and compute the empirical probability of successful recovery. The results are shown in igure b. or all four cases, the phase transitions occur around m 3.5nr. This suggests that the actual sample complexity of GD may scale linearly with both the dimension n and the rank r. 6 Conclusion We propose a lifting procedure together with Burer-Monteiro factorization and a first-order algorithm to carry out rectangular matrix completion. While optimizing a nonconvex objective, we establish linear convergence of our method to the global optimum with Oµr κ n maxµ, log n random observations. We conjecture that Onr observations are sufficient for exact recovery, and that the column space regularizer can be dropped. We provide empirical evidence showing this simple algorithm is fast and scalable, suggesting that lifting techniques may be promising for much more general classes of problems. 11

12 runtime seconds AltMin Trust Region GD SVP 4k x k rank 3 10k x 5k rank 40 a 0k x 5k rank 40 empirical probability of successful recovery n = 500, r = 10 n = 500, r = 0 n = 1000, r = 10 n = 1000, r = m/nr b igure : a Runtime growth of AltMin, trustregresion, GD and SVP. b Sample complexity of gradient scheme. Acknowledgements The authors thank Rina oygel Barber, Yudong Chen and Ruoyu Sun for helpful comments. Research supported in part by ONR grant N and NS grant DMS A Technical Lemmas Another way of writing the objective function is fz = 1 p m l=1 Al, ZZ λ b l + Z DZ 4, where l is an index of Ω, A l is a matrix with 1 at the corresponding observed entry and 0 elsewhere. Let H = Z Z, the gradient can be written as fz = 1 p = 1 p m l=1 m l=1 {}}{ Al, ZZ b l Al + A l Z + λ DZ Z DZ A l, HZ + ZH + HH A l + A l Z + H + λγ. Γ 1

13 We will use the following facts throughout the proof: Z, = Z, µr n 1 n σ 1, 17 µr H, 3 σ n 1 n 1, 18 A l + A l B, C = A l, BC + CB, 19 Z Z is positive semidefinite, H Z is symmetric. 0 Inequality 17 is a direct result of Definition 1. To see 18, note that H, Z, + Z, µr n 1 n σ 1 + µr n 1 n σ1, and σ 1 σ σ 1 by the discussion of initialization in Appendix B. or 0, it holds that arg min Z Z R = AB, RR =R R=I where AΛB is the SVD of Z Z. Clearly, Z Z is positive semidefinite, and H Z = Z Z Z Z = BΛB Z Z is symmetric. Next, we list several technical lemmas that are utilized later. We will use c to denote a numerical constant, whose value may vary from line to line. [ ] [ 1 ] ZU UΣ R Lemma 4. or any Z of the form Z = =, where U, V, R are unitary matrices and Z V V Σ 1 R Σ 0 is a diagonal matrix, we have ZZ Z Z UΣV U Σ V. Proof. Recall that [ ] [ ] Z Z = U U = Σ 1 V Σ 1 Z V where X = U Σ V. We have ZZ Z Z = UΣU U Σ U + V ΣV V Σ V + UΣV U Σ V, 1 and UΣU U Σ U + V ΣV V Σ V. = Σ + Σ Σ, U U Σ U U + V V Σ V V 13

14 We can obtain the lower bound Σ, U U Σ U U + V V Σ V V r = σ i U U Σ U U + V V Σ V V = i=1 r i=1 r σ i σ i r k=1 i=1 k=1 σ k U U ik + V V ik r σk U U ik V V ik = Σ, U U Σ V V. ii 3 Combining and 3, we obtain UΣU U Σ U + = V ΣV V Σ V Σ + Σ Σ, U U Σ V V UΣV U + Σ V = UΣV U Σ V. Plugging 4 back into 1, we obtain the lemma. UΣV, U Σ V Recall that n = maxn 1, n. We will exploit the following two known concentration results. Lemma 5 Chen [1], Lemma. or any fixed matrix X R n 1 n, there exist universal constants c, c 1, c such that with probability at least 1 c 1 n c, p 1 P Ω X X log n c X log n p + X p,. Lemma 6 Candès and Recht [8], Theorem 4.1. Define subspace T = { M R n 1 n : M = U X + Y V for some X and Y 4 }. 5 Let P T be the Euclidean projection onto T. There is a numerical constant c such that for any δ 0, 1], if p c µr log n, then with probability 1 3n 3, we have δ n 1 n p 1 P T P Ω P T pp T δ. Lemma 7 upper bounds the spectral norm of the adjacency matrix of a random Erdős-Rényi graph. It is a variant of Lemma 7.1 of Keshavan et al. [3], which uses known results of eige and Ofek [16]. 14

15 Lemma 7 Chen and Wainwright [13], Lemma 9. Suppose that Ω [d] [d] is the set of edges of a random Erdős-Rényi graph with n nodes, where any pair of nodes is connected with probability p. There exists two numerical constants c 1, c such that, for any δ 0, 1], if p c 1 log d δ d, then with probability at least 1 1 d 4, uniformly for all x, y R n it holds that p 1 d x i y j 1 + δ x 1 y 1 + c p x y. 6 i,j Ω We refer readers to [3] for a complete proof, in particular noticing that one can choose p large enough so that the constant factor in the first term in 6 is only 1 + δ. Lemma 8, 9 and 10 are direct generalizations of Lemma 4 and 5 of [13]. Lemma 8. There exists a constant c such that, for any δ 0, 1], if p c δ max then with probability at least 1 1 n 1+n 4, uniformly for all H such that H, 3 we have p 1 PΩ HH 1 + δ H 4 + δσ r H. logn1 +n n 1 +n, µ r κ n 1 n, µr n 1 n σ 1, Proof. It holds that p 1 PΩ HH = p 1 H i, H j i,j Ω p 1 i,j Ω H i H j. 7 Since Ω is a reduced sampling of Y R n 1+n n 1 +n under a Bernoulli model, Lemma 7 is applicable here. Assume p c 1 logn 1 +n δ n 1 +n, we then have with probability at least 1 1n 1 + n 4, for all H such that H, 3 µr p 1 PΩ HH p 1 n 1 n σ 1, i,j Ω a 1 + δ Hi Hj i [n 1 +n ] Hi n1 + n + c p n1 + n i [n 1 +n ] 1 + δ H 4 + c H p H, b H 1 + δ H + 81c µ r σ1 n 1 + n, pn 1 n where a follows from Lemma 7 and b follows from H, 3 µr n 1 n σ1. 15 Hi 4 8

16 Let us further assume p 16c µ r κ γ δ n 1 n, where γ = n/n 1 n is a fixed constant, then we can bound p 1 PΩ HH H 1 + δ H + r δσ. 9 The final threshold we obtain is thus p c max logn1 +n δ n 1 +n, µ r κ n 1 n for some constant c. Lemma 9. There exists a constant c, if p c log n, then with probability at least 1 n 4 n 1 n uniformly for all matrices A, B such that AB is of size n 1 + n n 1 + n, p 1 P Ω AB { } n min A B,, B A, 1 n 4, Proof. Let Ω Yi = {j : i, j Ω} denote the set of entries sampled in the ith row of AB. Note that because of the structure of Ω, at most n entries are sampled at the frist n 1 rows, and at most n 1 entries are sampled at the rest n rows. Using a binomial tail bound, if p c log n for sufficiently large c, the event max i [n1 ] Ω Yi n pn holds with probability at least 1 n 4. Similarly for the rest n rows. Hence, if p c log n n 1 n for some constant c, with probability at least 1 n 4 1 n 4, we have max i [n1 +n ] Ω Yi pn. Conditioning on this event, we then have for all A, B of proper size, p 1 PΩ AB n = 1 +n p 1 i=1 n 1 +n p 1 i=1 n 1 +n p 1 i=1 j Ω Yi A i, B j Ai n A B,. j Ω Yi Similarly we can prove with probability at least 1 n 4 1 n 4, Bj A i max Ω Y i B, i [n 1 +n ] p 1 PΩ AB n B A,. The following lemma establishes restricted strong convexity and smoothness of the observation operator for matrices in T. Lemma 10. Let T be the subspace defined in 5. There exists a universal constant c such that, if p c µr log n, with probability at least 1 3n 3, uniformly for all A T, we have δ n 1 n p1 δ A P ΩA p1 + δ A

17 Consequently, uniformly for all A, B T, p 1 P Ω A, P Ω B A, B δ A B. 31 Proof. By Lemma 6, with probability at least 1 3n 3, for any X R n 1 n it holds that p1 δ X P T P Ω P T X p1 + δ X. 3 Let A be a matrix in T. Rewriting P Ω A = P ΩP T A, P Ω P T A = A, P T P Ω P T A, and using the Cauchy-Schwarz inequality and 3 we can bound In addition, we have P Ω A p1 + δ A. 33 P Ω A = A, P T P Ω P T A = A, P T P Ω P T A pp T A + pp T A A P T P Ω P T pp T A + p A a p1 δ A, 34 where a follows from Lemma 6. Combining 33 and 34 proves 30. To show 31, let A = A and B = B. Both A + B and A B are in T. We have A B P Ω A, P Ω B = 1 { 1 {}}{{}}{ } P Ω A + B 4 P Ω A B b 1 { } 1 + δp A + B 4 1 δp A B = 1 { } δp A 4 + B + 4p A, B 35 = pδ + p A, B, where b follows from 30. Thus, we have p 1 P Ω A, P Ω B = p 1 A B P Ω A, P Ω B δ A B + A, B. 36 Similarly, we can show p 1 P Ω A, P Ω B δ A B + A, B. 37 Last, we want to show the projection onto feasible set C is a contraction. 17

18 Lemma 11. Let y R r be a vector such that y θ, for any x R r. Then P θx y x y. Proof. If x θ, then P θx = x. Otherwise P θx = θ x, where x = y = y x x + P x y, we have x. Write x It suffices to show θ x y = θ x y x x + P x y = θ y x + P x y. 38 θ y x x y x. 39 If y x 0, then 39 holds because x > θ. If y x > 0, 39 still holds since x > θ y y x. B Initialization B.1 Proof of Lemma 1 Let δ denote the upper bound of p 1 P Ω X X as in Lemma 5, and let σ 1... σ n denote the singular values of p 1 P Ω X. By Weyl s theorem, we have σ i σ i δ, i [n]. 40 Note this implies σ r+1 δ, as σr+1 = 0. By definition, Z 0 = [U; V ]Σ 1, where UΣV is the rank-r SVD of p 1 P Ω X. According to Lemma 4, one has Z 0 Z 0 Z Z UΣV X a r UΣV X r UΣV p 1 P Ω X + p 1 P Ω X X b r δ + δ = 4 rδ, 41 where a holds because rankuσv X r, b holds since UΣV p 1 P Ω X = σ r+1 δ. Let H = Z 0 Z 0. We want to bound dz 0, Z = H. According to 0, H Z 0 is 18

19 symmetric and Z 0 Z 0 is positive semidefinite. Hence we can write Z 0 Z 0 Z Z = HZ 0 + Z 0 H + HH = tr H H + H Z 0 + H HZ 0 Z 0 + 4H HH Z 0 = tr H H + H Z H HH Z 0 + H HZ 0 Z 0 tr 4 H HH Z 0 + H HZ 0 Z 0 =4 tr H HZ 0 Z 0 + HZ 0T, 4 where in the second line we used that H Z 0 is symmetric. Besides, as Z 0 Z 0 is positive semidefinite, 4 trh HZ 0 Z 0 is nonnegative. Therefore, Z 0 Z 0 Z Z HZ 0 4 1σr H. 43 Combining 41 and 43, it follows that Z 0 Z 0 Z Z dz 0, Z 4 1σr 8r δ. 1σr 44 Therefore, it suffices to show dz 0, Z 8r δ 1σr a = c r log n X σr p + b c r σ 1 µr log n σr pn 1 n σ r, log n p µr log n pn 1 n X, where in a we replaced δ using Lemma 5, and b holds since by our incoherence assumption 3 we have X = U Σ V σ1 max U i V j σ 1 U i,j, V, σ1 µr, 46 n 1 n X, = U Σ V, σ 1 U V, c σ µr n 1 n. 47

20 Note that for c we used AB, A, B. Hence, to obtain dz 0, Z 1 16 σ r, it suffices to have { } cµr 3/ κ log n p max, cµr κ log n n 1 n n 1 n Since P C is just row-wise clipping, by Lemma 11 we have B. Proof of Corollary 1 dz 1, Z P C Z 0 Z Z 0 Z. = cµr κ log n n 1 n. 48 By the incoherence assumption, we have Z, σ 1 σ1. rom the above discussion, we can see that µr n 1 n σ 1, see 17. It suffices to show 8r δ 1 1σr 16 σ r δ 1 16 σ r. By Wely s theorem, we have σ 1 σ σ r. As a result, σ 1 σ 1. C Regularity Condition Analogous to the restricted strong convexity RSC and restricted strong smoothness RSS, we show that with high probability our objective function f satisfies the local curvature and local smoothness conditions defined below. Local Curvature Condition There exists constant c 1, c such that for any Z C satisfying dz, Z 1 4 σ r, Local Smoothness Condition fz, H c 1 H + c H DZ. There exist constants c 3, c 4 such that for any Z C satisfying dz, Z 1 4 σ r, fz c 3 H + c 4 H DZ. 0

21 C.1 Proof of the Local Curvature Condition fz, H = 1 m A l, HZ + ZH + HH A l + A l Z + H, H + λ trh Γ p l=1 i = 1 m A l, HZ + ZH + HH A l, HZ + ZH + HH + λ trh Γ p = 1 p { l=1 a {}} b {{}} { m m A l, HZ + ZH + A l, HH + l=1 + λ trh Γ ii 1 { a + b 3 p l=1 a m l=1 } 3 A l, HZ + ZH A l, HH {}}{{}}{ m A l, HZ + ZH m } A l, HH + λ trh Γ l=1 b 1 = 1 { a 3 } p 8 b + λ trh Γ iii 1 a p 5 4 b + λ trh Γ = 1 PΩ p 1 HZ + ZH 5 p 1 P Ω HH + λ trh Γ. 49 where we used equation 19 for i, the Cauchy-Schwarz inequality for ii, inequality a b a b for iii. inally, in the last line we used m l=1 A l, M = P Ω M. We first lower bound 1 PΩ p 1 HZ + ZH p 1 PΩ H U Z V + Z U HV, which expands to l=1 b. By the symmetry of Ω, it is equal to p 1 PΩ H U Z V + p 1 PΩ Z U HV + p 1 P Ω H U Z V, P Ω Z U HV. 50 As both H U Z V and Z U H V belong to T, we use Lemma 10 to lower bound above three terms, 1

22 respectively. This gives us 1 PΩ p 1 HZ + ZH HU 1 δ Z V + Z U HV + H U Z V,Z U HV δ H U Z V Z U HV HU 1 δ Z V + HU Z U HV + H U Z V,Z U HV δ Z V + Z U HV iv 1 δσr HU + H V + HU Z V,Z U HV = 1 δσ r H + H UZ V,Z U H V. where we used H U Z V Until now, we obtain σ r H U and Z U H V σ r H V for iv. 51 fz, H 1 δσ r H + H UZ V,Z U H V + λ trh Γ 5 p 1 P Ω HH. 5 Next, we lower bound H U Z V,Z U H V + λ trh Γ together. Rewriting [ H U Z V,Z U HV 0 Z = H, U HV Z V HU 0 ZZ ZZ = HH + ZH + HZ, ] Z = H, 1 ZH DZH DZ, 53

23 and plugging in Γ = DZZ DZ, we then have H U Z V,Z U H V + λ trh Γ = H, 1 ZH DZH DZ + λ H, DZZ ZZ DZ + λ H, DZZ DZ + λ H, DZZ DH a = H, 1 ZH DZH DZ + λ H, DZZ ZZ DZ + λ Z DH b = λ Z DH + H, 1 ZH DZH DZ + λdhh + ZH + HZ DZ + H c = λ Z DH + 1 H Z + λ H DH + 3λ trh DHH DZ + λ 1 trh DZH DZ = λ Z DH + 1 λ Z DH H Z + λ 1 + λ Z DH + 3H DH 7 λ H DH trh DZH DZ 7 λ H 4 + λ 1 trh DZH DZ Equality a holds because Z DZ = 0. We plug in 53 in b. or c, we use Z DZ = 0 and that H Z is symmetric. inally, we take λ = 1 and use Lemma 8 to upper bound p 1 P Ω HH : fz, H 1 δσr H + 1 Z DH H δ H 4 5 δσ r H = 1 δσr δ H 5 δσ r H + 1 Z DH or simplicity, we take δ = 1. We also have 16 H 1 16 σ r. This leads to fz, H 7 51 σ r H + 1 Z DH Note that this lower bound holds with high probability uniformly for all Z C such that dz, Z 1 4 σ r, since Lemma 8 and 10 hold uniformly. When the ground truth X is positive semidefinite, we don t need to do lifitng nor impose the regularizer. Using Lemma 10, we can lower bound 1 PΩ p 1 HZ + ZH 1 δσr H directly. Taking proper constants, we can obtain the standard restricted strong convexity condition: fz, H σ r H. 54 3

24 C. Proof of the Local Smoothness Condition To upper bound fz = max W =1 fz, W, it suffices to show that for any n r W of unit robenius norm, fz, W is upper bounded. We first write fz, W = 1 m A l, HZ + ZH + A l, HH A l + A l Z + H, W + λ trw Γ p i = 1 p l=1 m l=1 A l, HZ + ZH + A l, HH A l, WZ + ZW + A l, W H + HW + λ trw Γ = 1 { P Ω HZ + ZH, P Ω WZ + ZW + P Ω HH, P Ω WZ + ZW p } + P Ω HZ + ZH, P Ω W H + HW + P Ω HH, P Ω W H + HW + λ trw Γ, 57 where we used 19 for i. Since a + b + c + d + e 5a + b + c + d + e, we have fz, W 5 p { P Ω HZ + ZH, P Ω WZ + ZW + P Ω HH, P Ω WZ + ZW + P Ω HZ + ZH, P Ω W H + HW + P Ω HH, P Ω W H + HW } + 5λ trw Γ ii 5 PΩ HZ + ZH + PΩ HH p PΩ WZ + ZW + PΩ W H + HW =1 {}}{ + 5λ Γ W 58 1 iii 5 {}}{ 3 {}}{ P Ω HZ + PΩ ZH { p + }}{ PΩ HH 4 1 {}}{ {}}{ P Ω WZ + PΩ ZW { p + }}{ PΩ W H { + }}{ PΩ HW + 5λ Γ, 4

25 where we used the Cauchy-Schwarz inequality for ii, and a+b a +b for iii. We then use Lemma 9 to upper bound 1,, 4, 5, 6, 7, and Lemma 8 for 3. Also since W = 1, one has fz, W 5 8n H Z δ, H 4 + δσ r H 8n Z + 8n, H, + 5λ Γ = 40n 8n Z δ, H + δσ r H Z +, H, + 5λ Γ 400µrσ1 8µrσ δ H + r δσ H + 5λ Γ, 59 where in the last line we plugged in Z, 18. Next, we bound µr µr n σ 1 and H, 3 n σ 1, i.e. 17 and Γ DZZ = ZZ DZ + DZZ DZ DZZ ZZ DZ + DZZ DZ a ZZ ZZ Z + Z Z DZ b = HH + ZH + HZ Z + Z Z DH HH 6 + ZH HZ + Z + Z Z DH c 6 H + Z H Z + Z Z DH d = 6 H + 4σ 1 H Z + 4σ1 Z DH Inequality a holds because AB A B and D = 1. To get b, for the first term in the 3rd line we expand ZZ ZZ, for the second term we expand Z = Z + H and use Z DZ = 0. or c, we use AB A B A B. Last, d holds because Z = σ 1. inally, we combine 59 and 60. As before, take λ = 1, δ = 1, and 16 H 1 16 σ r, we. 60 5

26 obtain fz { 400µrσ1 8µrσ δ H + δσ r + 30λ H + 4σ 1 Z } H + 0λ σ1 Z DH { a 400µrσ1 8µrσ δ H r δσ + 8 σ 1λ H + 1} σ H b + 0λ σ1 Z DH { } µ r σ 1 H + 5σ 1 Z DH 399µ r σ 1 H + 5σ 1 Z DH, 61 where for a we used Z H + Z 1 4 σ r + σ σ 1, for b we used µ, r 1. As before, this condition holds uniformly for all Z such that dz, Z 1 4 σ r and satisfying the incoherence condition. or the case X is positive semidefinite, as we don t need to impose the regularizer, standard restricted strong smoothness condition follows: fz σ 1 H. C.3 Proof of Lemma 3 Rearranging the terms in the smoothness condition 61, we can further bound 1 Z DH 4 fz 0µ r κσ 1 fz 13196µ r κσ σ r H σ r H. 6 Combining equation 56 and 6, it follows that fz, H σ r H + 1 fz µ r κσ1. 63 inally, by upper bounding the probability that Lemma 8, 9, or 10 fails, and the sample probability p these lemmas require, we conclude that once µr log n p c max, µ r κ, 64 n 1 n n 1 n regularity condition 63 holds with probability at least 1 c 1 n c, where c, c 1, c are constants. 6

27 D Linear Convergence D.1 Proof of Lemma Let H k = Z k Z k. Our iterate is Z k+1 = P C Z k η fz k. Since P C is just row-wise clipping, by Lemma 11 we have P C It follows that Z k+1 Z k Zk η σ1 Z k η fz k Z k σ1 fz k Z k = H k + η fz k σ1 η fz k, H k σ1 a H k + η fz k σ η 1 1 σ1 α σ r = b 1 η H k ακ 1 η H k ακ, + ηη /β σ 1 Zk η fz k Z k. 65 σ1 H k + 1 fz k βσ1 fz k where we use the definition of RCε, α, β for a and 0 < η min {α/, /β} for b. Therefore, dz k+1, Z = min Z k+1 Z 1 η Z S ακ dzk, Z. 67 References [1] Srinadh Bhojanapalli, Anastasios Kyrillidis, and Sujay Sanghavi. Dropping convexity for faster semi-definite optimization. arxiv: , 015. [] Nicolas Boumal and Pierre-antoine Absil. Rtrmc: A riemannian trust-region method for low-rank matrix completion. In Advances in neural information processing systems, pages , 011. [3] Nicolas Boumal, Bamdev Mishra, P-A Absil, and Rodolphe Sepulchre. Manopt, a matlab toolbox for optimization on manifolds. Journal of Machine Learning Research, 15Apr: ,

28 [4] Samuel Burer and Renato D.C. Monteiro. A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Mathematical Programming, 95:39 357, 003. [5] Jian-eng Cai, Emmanuel J Candès, and Zuowei Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 04: , 010. [6] Emmanuel Candès, Thomas Strohmer, and Vladislav Voroninski. Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Communications on Pure and Applied Mathematics, 668: , 013. [7] Emmanuel J Candès and Xiaodong Li. Solving quadratic equations via phaselift when there are about as many equations as unknowns. oundations of Computational Mathematics, 14 5: , 014. [8] Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization. oundations of Computational mathematics, 96:717 77, 009. [9] Emmanuel J Candès and Terence Tao. The power of convex relaxation: Near-optimal matrix completion. Information Theory, IEEE Transactions on, 565: , 010. [10] Emmanuel J Candès, Yonina C Eldar, Thomas Strohmer, and Vladislav Voroninski. Phase retrieval via matrix completion. SIAM Review, 57:5 51, 015. [11] Emmanuel J Candès, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via Wirtinger flow: Theory and algorithms. Information Theory, IEEE Transactions on, 614: , 015. [1] Yudong Chen. Incoherence-optimal matrix completion. Information Theory, IEEE Transactions on, 615:909 93, 015. [13] Yudong Chen and Martin J. Wainwright. ast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees. arxiv: , 015. [14] Alexandre d Aspremont, Laurent El Ghaoui, Michael I Jordan, and Gert RG Lanckriet. A direct formulation for sparse pca using semidefinite programming. SIAM review, 493: , 007. [15] Maryam azel. Matrix rank minimization with applications. Technical report, Elec. Eng. Dept., Stanford University, 00. PhD thesis. [16] Uriel eige and Eran Ofek. Spectral techniques applied to sparse random graphs. Random Structures & Algorithms, 7:51 75, 005. [17] Rina oygel and Nathan Srebro. Concentration-based guarantees for low-rank matrix reconstruction. arxiv: ,

29 [18] Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp. inding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53:17 88, 011. [19] Moritz Hardt. Understanding alternating minimization for matrix completion. In OCS 014. IEEE, 014. [0] Moritz Hardt and Mary Wootters. ast matrix completion without the condition number. In COLT 014, pages , 014. [1] Prateek Jain, Raghu Meka, and Inderjit S Dhillon. Guaranteed rank minimization via singular value projection. In Advances in Neural Information Processing Systems, pages , 010. [] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion using alternating minimization. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages ACM, 013. [3] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from a few entries. Information Theory, IEEE Transactions on, 566: , 010. [4] Raghunandan Hulikal Keshavan. Efficient algorithms for collaborative filtering. PhD thesis, Stanford University, 01. [5] Jason D Lee, Ben Recht, Nathan Srebro, Joel Tropp, and Ruslan R Salakhutdinov. Practical large-scale optimization for max-norm regularization. In Advances in Neural Information Processing Systems, pages , 010. [6] Bamdev Mishra, Gilles Meyer, rancis Bach, and Rodolphe Sepulchre. Low-rank optimization with trace norm penalty. SIAM Journal on Optimization, 34:14 149, 013. [7] Sahand Negahban and Martin J. Wainwright. Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. The Journal of Machine Learning Research, 13 1: , 01. [8] Yurii Nesterov. Introductory lectures on convex optimization, volume 87. Springer Science & Business Media, 004. [9] Benjamin Recht, Maryam azel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review, 53: , 010. [30] Nathan Srebro and Adi Shraibman. Rank, trace-norm and max-norm. In Learning Theory, pages Springer, 005. [31] Nathan Srebro, Jason Rennie, and Tommi S Jaakkola. Maximum-margin matrix factorization. In Advances in neural information processing systems, pages ,

30 [3] Ruoyu Sun. Matrix Completion via Nonconvex actorization: Algorithms and Theory. PhD thesis, University of Minnesota Twin Cities, 015. [33] Ruoyu Sun and Zhi-Quan Luo. Guaranteed matrix completion via nonconvex factorization. In oundations of Computer Science, IEEE 56th Annual Symposium on, pages 70 89, 015. [34] Ryota Tomioka, Kohei Hayashi, and Hisashi Kashima. Estimation of low-rank tensors via convex optimization. arxiv: , 010. [35] Stephen Tu, Ross Boczar, Max Simchowitz, Mahdi Soltanolkotabi, and Benjamin Recht. Low-rank solutions of linear matrix equations via procrustes flow. In International Conference on Machine Learning, 016. [36] Bart Vandereycken. Low-rank matrix completion by riemannian optimization. SIAM Journal on Optimization, 3: , 013. [37] Ke Wei, Jian-eng Cai, Tony. Chan, and Shingyu Leung. Guarantees of riemannian optimization for low rank matrix completion. arxiv: , 016. [38] Xinyang Yi, Dohyung Park, Yudong Chen, and Constantine Caramanis. ast algorithms for robust pca via gradient descent. In Advances in neural information processing systems, 016. [39] Tuo Zhao, Zhaoran Wang, and Han Liu. A nonconvex optimization framework for low rank matrix estimation. In Advances in Neural Information Processing Systems, pages , 015. [40] Qinqing Zheng and John Lafferty. A convergent gradient descent algorithm for rank minimization and semidefinite programming from random linear measurements. In Advances in Neural Information Processing Systems, pages ,

Symmetric Factorization for Nonconvex Optimization

Symmetric Factorization for Nonconvex Optimization Qinqing Zheng February 24, 2017 1 Overview A growing body of recent research is shedding new light on the role of nonconvex optimization for tackling