Nonconvex Low Rank Matrix Factorization via Inexact First Order Oracle

Size: px

Start display at page:

Download "Nonconvex Low Rank Matrix Factorization via Inexact First Order Oracle"

Dustin Martin
6 years ago
Views:

1 Nonconvex Low Rank Matrix Factorization via Inexact First Order Oracle Tuo Zhao Zhaoran Wang Han Liu Abstract We study the low rank matrix factorization problem via nonconvex optimization. Compared with the convex relaxation approach, nonconvex optimization exhibits superior empirical performance for large scale low rank matrix estimation. However, the understanding of its theoretical guarantees is limited. To bridge this gap, we exploit the notion of inexact first order oracle, which naturally appears in low rank matrix factorization problems such as matrix sensing and completion. Particularly, our analysis shows that a broad class of nonconvex optimization algorithms, including alternating minimization and gradient-type methods, can be treated as solving two sequences of convex optimization algorithms using inexact first order oracle. Thus we can show that these algorithms converge geometrically to the global optima and recover the true low rank matrices under suitable conditions. Numerical results are provided to support our theory. 1 Introduction Let M R m n be a rank k matrix with k much smaller than m and n. Our goal is to estimate M based on partial observations of its entries. For example, matrix completion is based on a subsample of M s entries, while matrix sensing is based on linear measurements A i, M, where i {1,..., d} with d much smaller than mn and A i is the sensing matrix. In the past decade, significant progress has been made on the recovery of low rank matrix [Candès and Recht, 009, Candès and Tao, 010, Candes and Plan, 010, Recht et al., 010, Lee and Bresler, 010, Keshavan et al., 010a,b, Jain et al., 010, Cai et al., 010, Recht, 011, Gross, 011, Chandrasekaran et al., 011, Hsu et al., 011, Rohde and Tsybakov, 011, Koltchinskii et al., 011, Negahban and Wainwright, 011, Chen et al., 011, Xiang et al., 01, Negahban and Wainwright, 01, Agarwal et al., 01, Recht and Ré, 013, Chen, 013, Chen et al., 013a,b, Jain et al., 013, Jain and Netrapalli, 014, Hardt, 014, Hardt et al., 014, Hardt and Wootters, 014, Sun and Luo, 014, Hastie et al., 014, Cai and Zhang, 015, Yan et al., 015, Zhu et al., 015, Wang et al., 015]. Among these works, most Tuo Zhao is Affiliated with Department of Computer Science, Johns Hopkins niversity, Baltimore, MD 118 SA and Department of Operations Research and Financial Engineering, Princeton niversity, Princeton, NJ 08544; tour@cs.jhu.edu. Zhaoran Wang is Affiliated with Department of Operations Research and Financial Engineering, Princeton niversity, Princeton, NJ SA; zhaoran@princeton.edu. Han Liu is Affiliated with Department of Operations Research and Financial Engineering, Princeton niversity, Princeton, NJ SA; hanliu@princeton.edu. 1

2 are based upon convex relaxation with nuclear norm constraint or regularization. Nevertheless, solving these convex optimization problems can be computationally prohibitive in high dimensional regimes with large m and n [Hsieh and Olsen, 014]. A computationally more efficient alternative is nonconvex optimization. In particular, we reparameterize the m n matrix variable M in the optimization problem as V with R m k and V R n k, and optimize over and V. Such a reparametrization automatically enforces the low rank structure and leads to low computational cost per iteration. Due to this reason, the nonconvex approach is widely used in large scale applications such as recommendation systems or collaborative filtering [Koren, 009, Koren et al., 009]. Despite the superior empirical performance of the nonconvex approach, the understanding of its theoretical guarantees is rather limited in comparison with the convex relaxation approach. The classical nonconvex optimization theory can only show its sublinear convergence to local optima. But many empirical results have corroborated its exceptional computational performance and convergence to global optima. Only until recently has there been theoretical analysis of the block coordinate descent-type nonconvex optimization algorithm, which is known as alternating minimization [Jain et al., 013, Hardt, 014, Hardt et al., 014, Hardt and Wootters, 014]. In particular, the existing results show that, provided a proper initialization, the alternating minimization algorithm attains a linear rate of convergence to a global optimum R m k and V R n k, which satisfy M = V. Meanwhile, Keshavan et al. [010a,b] establish the convergence of the gradient-type methods, and Sun and Luo [014] further establish the convergence of a broad class of nonconvex optimization algorithms including both gradient-type and block coordinate descent-type methods. However, Keshavan et al. [010a,b], Sun and Luo [014] only establish the asymptotic convergence for an infinite number of iterations, rather than the explicit rate of convergence. Besides these works, Lee and Bresler [010], Jain et al. [010], Jain and Netrapalli [014] consider projected gradient-type methods, which optimize over the matrix variable M R m n rather than R m k and V R n k. These methods involve calculating the top k singular vectors of an m n matrix at each iteration. For k much smaller than m and n, they incur much higher computational cost per iteration than the aforementioned methods that optimize over and V. All these works, except Sun and Luo [014], focus on specific algorithms, while Sun and Luo [014] do not establish the explicit optimization rate of convergence. In this paper, we propose a new theory for analyzing a broad class of nonconvex optimization algorithms for low rank matrix estimation. The core of our theory is the notion of inexact first order oracle. Based on the inexact first order oracle, we establish sufficiently conditions under which the iteration sequences converge geometrically to the global optima. For both matrix sensing and completion, a direct consequence of our threoy is that, a broad family of nonconvex optimization algorithms, including gradient descent, block coordinate gradient descent, and block coordinate minimization, attain linear rates of convergence to the true low rank matrices and V. In particular, our proposed theory covers alternating minimization as a special case and recovers the results of Jain et al. [013], Hardt [014], Hardt et al. [014], Hardt and Wootters [014] under suitable conditions. Meanwhile, our approach covers gradient-type methods, which are also widely used in practice [Takács et al., 007, Paterek, 007, Koren et al., 009, Gemulla et al., 011, Recht and Ré, 013, Zhuang et al., 013]. To the best of our knowledge, our analysis is the first one that establishes exact recovery guarantees and geometric rates of convergence for a broad family of nonconvex matrix sensing and completion algorithms.

3 To achieve maximum generality, our unified analysis significantly differs from previous works. In detail, Jain et al. [013], Hardt [014], Hardt et al. [014], Hardt and Wootters [014] view alternating minimization as an approximate power method. However, their point of view relies on the closed form solution of each iteration of alternating minimization, which makes it difficult to generalize to other algorithms, e.g., gradient-type methods. Meanwhile, Sun and Luo [014] take a geometric point of view. In detail, they show that the global optimum of the optimization problem is the unique stationary point within its neighborhood and thus a broad class of algorithms succeed. However, such geometric analysis of the objective function does not characterize the convergence rate of specific algorithms towards the stationary point. nlike existing results, we analyze nonconvex optimization algorithms as approximate convex counterparts. For example, our analysis views alternating minimization on a nonconvex objective function as an approximate block coordinate minimization on some convex objective function. We use the key quantity, the inexact first order oracle, to characterize such a perturbation effect, which results from the local nonconvexity at intermediate solutions. This eventually allows us to establish explicit rate of convergence in an analogous way as existing convex optimization analysis. Our proposed inexact first order oracle is closely related to a series previous work on inexact or approximate gradient descent algorithms: Güler [199], Luo and Tseng [1993], Nedić and Bertsekas [001], d Aspremont [008], Baes [009], Friedlander and Schmidt [01], Devolder et al. [014]. Different from these existing results focusing on convex minimization, we show that the inexact first order oracle can also sharply captures the evolution of generic optimization algorithms even with the presence of nonconvexity. More recently, Candes et al. [014], Balakrishnan et al. [014], Arora et al. [015] respectively analyze the Wirtinger Flow algorithm for phase retrieval, the expectation maximization (EM) Algorithm for latent variable models, and the gradient descent algorithm for sparse coding based on a similar idea to ours. Though their analysis exploits similar nonconvex structures, they work on completely different problems, and the delivered technical results are also fundamentally different. A conference version of this paper was presented in the Annual Conference on Neural Information Processing Systems 015 [Zhao et al., 015]. During our conference version was under review, similar work was released on arxiv.org by Zheng and Lafferty [015], Bhojanapalli et al. [015], Tu et al. [015], Chen and Wainwright [015]. These works focus on symmetric positive semidefinite low rank matrix factorization problems. In contrast, our proposed methodologies and theory do not require the symmetry and positive semidefiniteness, and therefore can be applied to rectangular low rank matrix factorization problems. The rest of this paper is organized as follows. In, we review the matrix sensing problems, and then introduce a general class of nonconvex optimization algorithms. In 3, we present the convergence analysis of the algorithms. In 4, we lay out the proof. In 5, we extend the proposed methodology and theory to the matrix completion problems. In 6, we provide numerical experiments and draw the conclusion. Notation: For v = (v 1,..., v d ) T R d, we define the vector l q norm as v q q = j vq j. We define e i as an indicator vector, where the i-th entry is one, and all other entries are zero. For a matrix A R m n, we use A j = (A 1j,..., A mj ) to denote the j-th column of A, and A i = (A i1,..., A in ) to denote the i-th row of A. Let σ max (A) and σ min (A) be the largest and smallest nonzero singular values of A. We define the following matrix norms: A F = j A j, A = 3

4 σ max (A). Moreover, we define A to be the sum of all singular values of A. We define as the Moore-Penrose pseudoinverse of A as A. Given another matrix B R m n, we define the inner product as A, B = i,j A ijb ij. For a bivariate function f(u, v), we define u f(u, v) to be the gradient with respect to u. Moreover, we use the common notations of Ω( ), O( ), and o( ) to characterize the asymptotics of two real sequences. Matrix Sensing We start with the matrix sensing problem. Let M R m n be the unknown low rank matrix of interest. We have d sensing matrices A i R m n with i {1,..., d}. Our goal is to estimate M based on b i = A i, M in the high dimensional regime with d much smaller than mn. nder such a regime, a common assumption is rank(m ) = k min{d, m, n}. Existing approaches generally recover M by solving the following convex optimization problem min M subject to b = A(M), (1) M R m n where b = [b 1,..., b d ] R d, and A(M) : R m n R d is an operator defined as A(M) = [ A 1, M,..., A i, M ] R d. () Existing convex optimization algorithms for solving (1) are computationally inefficient, since they incur high per-iteration computational cost and only attain sublinear rates of convergence to the global optimum [Jain et al., 013, Hsieh and Olsen, 014]. Therefore in large scale settings, we usually consider the following nonconvex optimization problem instead min R m k,v R n k F(, V ), where F(, V ) = 1 b A(V ). (3) The reparametrization of M = V, though making the problem in (3) nonconvex, significantly improves the computational efficiency. Existing literature [Koren, 009, Koren et al., 009, Takács et al., 007, Paterek, 007, Koren et al., 009, Gemulla et al., 011, Recht and Ré, 013, Zhuang et al., 013] has established convincing evidence that (3) can be effectively solved by a broad variety of gradient-based nonconvex optimization algorithms, including gradient descent, alternating exact minimization (i.e., alternating least squares or block coordinate minimization), as well as alternating gradient descent (i.e., block coordinate gradient descent), as illustrated in Algorithm 1. It is worth noting that the QR decomposition and rank k singular value decomposition in Algorithm 1 can be accomplished efficiently. In particular, the QR decomposition can be accomplished in O(k max{m, n}) operations, while the rank k singular value decomposition can be accomplished in O(kmn) operations. In fact, the QR decomposition is not necessary for particular update schemes, e.g., Jain et al. [013] prove that the alternating exact minimization update schemes with or without the QR decomposition are equivalent. 3 Convergence Analysis We analyze the convergence of the algorithms illustrated in. Before we present the main results, we first introduce a unified analytical framework based on a key quantity named the approximate 4

5 Algorithm 1 A family of nonconvex optimization algorithms for matrix sensing. Here (, D, V ) KSVD(M) is the rank k singular value decomposition of M. D is a diagonal matrix containing the top k singular values of M in decreasing order, and and V contain the corresponding top k left and right singular vectors of M. (V, R V ) QR(V ) is the QR decomposition, where V is the corresponding orthonormal matrix and R V is the corresponding upper triangular matrix. Input: {b i } d i=1, {A i} d i=1 Parameter: Step size η, Total number of iterations T ( (0), D (0), V (0) ) KSVD( d i=1 b ia i ), V (0) V (0) D (0), (0) (0) D (0) For: t = 0,..., T 1 Alternating Exact Minimization : V (t+0.5) argmin V F( (t), V ) (V (t+1), R (t+0.5) ) QR(V (t+0.5) ) V Alternating Gradient Descent : V (t+0.5) V (t) η V F( (t), V (t) ) (V (t+1), R (t+0.5) ) QR(V (t+0.5) ), (t) (t) R (t+0.5) pdating V V V Gradient Descent : V (t+0.5) V (t) η V F( (t), V (t) ) (V (t+1), R (t+0.5) ) QR(V (t+0.5) ), (t+1) (t) R (t+0.5) V V Alternating Exact Minimization : (t+0.5) argmin F(, V (t+1) ) ( (t+1), R (t+0.5) ) QR( (t+0.5) ) Alternating Gradient Descent : (t+0.5) (t) η F( (t), V (t+1) ) ( (t+1), R (t+0.5) ) QR( (t+0.5) ), V (t+1) V t+1 R (t+0.5) pdating Gradient Descent : (t+0.5) (t) η F( (t), V (t) ) ( (t+1), R (t+0.5) ) QR( (t+0.5) ), V (t+1) V t R (t+0.5) End for Output: M (T ) (T 0.5) V (T ) (for gradient descent we use (T ) V (T ) ) first order oracle. Such a unified framework equips our theory with the maximum generality. Without loss of generality, we assume m n throughout the rest of this paper. 3.1 Main Idea We first provide an intuitive explanation for the success of nonconvex optimization algorithms, which forms the basis of our later analysis of the main results in 4. Recall that (3) can be written as a special instance of the following optimization problem, min f(, V ). (4) R m k,v Rn k A key observation is that, given fixed, f(, ) is strongly convex and smooth in V under suitable conditions, and the same also holds for given fixed V correspondingly. For the convenience of discussion, we summarize this observation in the following technical condition, which will be later verified for matrix sensing and completion under suitable conditions. 5

6 Condition 1 (Strong Biconvexity and Bismoothness). There exist universal constants µ + > 0 and µ > 0 such that µ F f(, V ) f(, V ), f(, V ) µ + F for all,, µ V V F f(, V ) f(, V ) V V, V f(, V ) µ + V V F for all V, V Ideal First Order Oracle To ease presentation, we assume that and V are the unique global minimizers to the generic optimization problem in (4). Assuming that is given, we can obtain V by V = argmin V R n k f(, V ). (5) Condition 1 implies the objective function in (5) is strongly convex and smooth. Hence, we can choose any gradient-based algorithm to obtain V. For example, we can directly solve for V in or iteratively solve for V using gradient descent, i.e., V f(, V ) = 0, (6) V (t) = V (t 1) η V f(, V (t 1) ), (7) where η is a step size. Taking gradient descent as an example, we can invoke classical convex optimization results [Nesterov, 004] to prove that V (t) V F κ V (t 1) V F for all t = 0, 1,,..., where κ (0, 1) and only depends on µ + and µ in Condition 1. For notational simplicity, we call V f(, V (t 1) ) the ideal first order oracle, since we do not know in practice Inexact First Order Oracle Though the ideal first order oracle is not accessible in practice, it provides us insights to analyze nonconvex optimization algorithms. Taking gradient descent as an example, at the t-th iteration, we take a gradient descent step over V based on V f(, V (t 1) ). Now we can treat V f(, V (t 1) ) as an approximation of V f(, V (t 1) ), where the approximation error comes from approximating by. Then the relationship between V f(, V (t 1) ) and V f(, V (t 1) ) is similar to that between gradient and approximate gradient in existing literature on convex optimization. For simplicity, we call V f(, V (t 1) ) the inexact first order oracle. To characterize the difference between V f(, V (t 1) ) and V f(, V (t 1) ), we define the approximation error of the inexact first order oracle as E(V, V, ) = V f(, V ) V f(, V ) F, (8) where V is the current decision varaible for evaluating the gradient. In the above example, it holds for V = V (t 1). Later we will illustrate that E(V, V, ) is critical to our analysis. In 6

7 the above example of alternating gradient descent, we will prove later that for V (t) = V (t 1) η V f(, V (t 1) ), we have V (t) V F κ V (t 1) V F + µ + E(V (t), V (t 1), ). (9) In other words, E(V (t), V (t 1), ) captures the perturbation effect by employing the inexact first order oracle V f(, V (t 1) ) instead of the ideal first order oracle V f(, V (t 1) ). For V (t+1) = argmin V f(, V ), we will prove that V (t) V F 1 µ E(V (t), V (t), ). (10) According to the update schemes shown in Algorithms 1 and, for alternating exact minimization, we set = (t) in (10), while for gradient descent or alternating gradient descent, we set = (t 1) or = (t) in (9) respectively. Due to symmetry, similar results also hold for (t) F. To establish the geometric rate of convergence towards the global minima and V, it remains to establish upper bounds for the approximate error of the inexact first oder oracle. Taking gradient decent as an example, we will prove that given an appropriate initial solution, we have µ + E(V (t), V (t 1), (t 1) ) α (t 1) F (11) for some α (0, 1 κ). Combining with (9) (where we take = (t 1) ), (11) further implies V (t) V F κ V (t 1) V F + α (t 1) F. (1) Correspondingly, similar results hold for (t) F, i.e., (t) F κ (t 1) F + α V (t 1) V F. (13) Combining (1) and (13) we then establish the contraction max{ V (t) V F, (t) F } (α + κ) max{ V (t 1) V F, (t 1) F }, which further implies the geometric convergence, since α (0, 1 κ). Respectively, we can establish similar results for alternating exact minimization and alternating gradient descent. Based upon such a unified analysis, we now present the main results. 3. Main Results Before presenting the main results, we first introduce an assumption known as the restricted isometry property (RIP). Recall that k is the rank of the target low rank matrix M. Assumption 1 (Restricted Isometry Property). The linear operator A( ) : R m n R d defined in () satisfies k-rip with parameter δ k (0, 1), i.e., for all R m n such that rank( ) k, it holds that (1 δ k ) F A( ) (1 + δ k ) F. 7

8 Several random matrix ensembles satisfy k-rip for a sufficiently large d with high probability. For example, suppose that each entry of A i is independently drawn from a sub-gaussian distribution, A( ) satisfies k-rip with parameter δ k with high probability for d = Ω(δ k kn log n). The following theorem establishes the geometric rate of convergence of the nonconvex optimization algorithms summarized in Algorithm 1. Theorem 1. Assume there exists a sufficiently small constant C 1 such that A( ) satisfies k-rip with δ k C 1 /k, and the largest and smallest nonzero singular values of M are constants, which do not scale with (d, m, n, k). For any pre-specified precision ɛ, there exist an η and universal constants C and C 3 such that for all T C log(c 3 /ɛ), we have M (T ) M F ɛ. The proof of Theorems 1 is provided in 4., 4.3, and 4.4. Theorem 1 implies that all three nonconvex optimization algorithms converge geometrically to the global optimum. Moreover, assuming that each entry of A i is independently drawn from a sub-gaussian distribution with mean zero and variance proxy one, our result further suggests that, to achieve exact low rank matrix recovery, our algorithm requires the number of measurements d to satisfy d = Ω(k 3 n log n), (14) since we assume that δ k C 1 /k. This sample complexity result matches the state-of-the-art result for nonconvex optimization methods, which is established by Jain et al. [013]. In comparison with their result, which only covers the alternating exact minimization algorithm, our results holds for a broader variety of nonconvex optimization algorithms. Note that the sample complexity in (14) depends on a polynomial of σ max (M )/σ min (M ), which is treated as a constant in our paper. If we allow σ max (M )/σ min (M ) to increase, we can plug the nonconvex optimization algorithms into the multi-stage framework proposed by Jain et al. [013]. Following similar lines to the proof of Theorem 1, we can derive a new sample complexity, which is independent of σ max (M )/σ min (M ). See more details in Jain et al. [013]. 4 Proof of Main Results We sketch the proof of Theorems 1. The proof of all related lemmas are provided in the appendix. For notational simplicity, let σ 1 = σ max (M ) and σ k = σ min (M ). Recall the nonconvex optimization algorithms are symmetric about the updates of and V. Hence, the following lemmas for the update of V also hold for updating. We omit some statements for conciseness. Theorem can be proved in a similar manner, and its proof is provided in Appendix F. Before presenting the proof, we first introduce the following lemma, which verifies Condition 1. Lemma 1. Suppose that A( ) satisfies k-rip with parameter δ k. Given an arbitrary orthonormal matrix R m k, for any V, V R n k, we have 1 + δ k V V F F(, V ) F(, V ) V F(, V ), V V 1 δ k V V F. The proof of Lemma 1 is provided in Appendix A.1. Lemma 1 implies that F(, ) is strongly convex and smooth in V given a fixed orthonormal matrix, as specified in Condition 1. Equipped with Lemma 1, we now lay out the proof for each update scheme in Algorithm 1. 8

9 4.1 Rotation Issue Given a factorization of M = V, we can equivalently represent it as M = newv new, where new = O new and V new = V O new for an arbitrary unitary matrix O new R k k. This implies that directly calculating F is not desirable and the algorithm may converge to an arbitrary factorization of M. To address this issue, existing analysis usually chooses subspace distances to evaluate the difference between subspaces spanned by columns of and, because these subspaces are invariant to rotations [Jain et al., 013]. For example, let R m (m k) denote the orthonormal complement to, we can choose the subspace distance as F. For any O new R k k such that O newo new = I k, we have new F = O new F = F. In this paper, we consider a different subspace distance defined as min O F. (15) O O=I k We can verify that (15) is also invariant to rotation. The next lemma shows that (15) is equivalent to F. Lemma. Given two orthonormal matrices R m k and R m k, we have F min O F F. O O=I The proof of Lemma is provided in Stewart et al. [1990], therefore omitted. Equipped with Lemma, our convergence analysis guarantees that there always exists a factorization of M satisfying the desired computational properties for each iteration (See Lemma 5, Corollaries 1 and ). Similarly, the above argument can also be generalized to gradient descent and alternating gradient descent algorithms. 4. Proof of Theorem 1 (Alternating Exact Minimization) Proof. Throughout the proof for alternating exact minimization, we define a constant ξ (1, ) to simplify the notation. Moreover, we assume that at the t-th iteration, there exists a matrix factorization of M M = (t) V (t), where (t) R m k is an orthonormal matrix. We define the approximation error of the inexact first order oracle as E(V (t+0.5), V (t+0.5), (t) ) = V F( (t), V (t+0.5) ) V F( (t), V (t+0.5) ) F. The following lemma establishes an upper bound for the approximation error of the approximation first order oracle under suitable conditions. 9

10 Lemma 3. Suppose that δ k and (t) satisfy Then we have δ k (1 δ k) σ k 1ξk(1 + δ k )σ 1 and (t) (t) F (1 δ k)σ k 4ξ(1 + δ k )σ 1. (16) E(V (t+0.5), V (t+0.5), (t) ) (1 δ k)σ k (t) (t) F. ξ The proof of Lemma 3 is provided in Appendix A.. Lemma 3 shows that the approximation error of the inexact first order oracle for updating V diminishes with the estimation error of (t), when (t) is sufficiently close to (t). The following lemma quantifies the progress of an exact minimization step using the inexact first order oracle. Lemma 4. We have V (t+0.5) V (t) F 1 1 δ k E(V (t+0.5), V (t+0.5), (t) ). The proof of Lemma 4 is provided in Appendix A.3. Lemma 4 illustrates that the estimation error of V (t+0.5) diminishes with the approximation error of the inexact first order oracle. The following lemma characterizes the effect of the renormalization step using QR decomposition, i.e., the relationship between V (t+0.5) and V (t+1) in terms of the estimation error. Lemma 5. Suppose that V (t+0.5) satisfies V (t+0.5) V (t) F σ k 4. (17) Then there exists a factorization of M = (t+1) V (t+1) such that V (t+0.5) R n k is an orthonormal matrix, and satisfies V (t+1) V (t+1) F σ k V (t+0.5) V (t) F. The proof of Lemma 5 is provided in Appendix A.4. The next lemma quantifies the accuracy of the initialization (0). Lemma 6. Suppose that δ k satisfies δ k (1 δ k ) σk 4 19ξ k(1 + δ k ) σ1 4. (18) Then there exists a factorization of M = (0) V (0) such that (0) R m k is an orthonormal matrix, and satisfies (0) F (1 δ k)σ k 4ξ(1 + δ k )σ 1. 10

11 The proof of Lemma 6 is provided in Appendix A.5. Lemma 6 implies that the initial solution (0) attains a sufficiently small estimation error. Combining Lemmas 3, 4, and 5, we obtain the following corollary for a complete iteration of updating V. Corollary 1. Suppose that δ k and (t) satisfy We then have δ k (1 δ k ) σ 4 k 19ξ k(1 + δ k ) σ 4 1 and (t) (t) F (1 δ k)σ k 4ξ(1 + δ k )σ 1. (19) Moreover, we also have V (t+1) V (t+1) F (1 δ k)σ k 4ξ(1 + δ k )σ 1. V (t+1) V (t+1) F 1 ξ (t) (t) F and V (t+0.5) V (t) F σ k ξ (t) (t) F. The proof of Corollary 1 is provided in Appendix A.6. Since the alternating exact minimization algorithm updates and V in a symmetric manner, we can establish similar results for a complete iteration of updating in the next corollary. Corollary. Suppose that δ k and V (t+1) satisfy δ k (1 δ k ) σ 4 k 19ξ k(1 + δ k ) σ 4 1 and V (t+1) V (t+1) F (1 δ k)σ k 4ξ(1 + δ k )σ 1. (0) Then there exists a factorization of M = (t+1) V (t+1) such (t+1) is an orthonormal matrix, and satisfies Moreover, we also have (t+1) (t+1) F (1 δ k)σ k 4ξ(1 + δ k )σ 1. (t+1) (t+1) F 1 ξ V (t+1) V (t+1) F and (t+0.5) (t+1) F σ k ξ V (t+1) V (t+1) F. The proof of Corollary directly follows Appendix A.6, and is therefore omitted.. We then proceed with the proof of Theorem 1 for alternating exact minimization. Lemma 6 ensures that (19) of Corollary 1 holds for (0). Then Corollary 1 ensures that (0) of Corollary holds for V (1). By induction, Corollaries 1 and can be applied recursively for all T iterations. Thus we obtain V (T ) V (T ) F 1 ξ (T 1) (T 1) F 1 ξ V (T 1) V (T 1) F 1 ξ T 1 (0) (0) F (1 δ k)σ k 4ξ T (1 + δ k )σ 1, (1) 11

12 where the last inequality comes from Lemma 6. Therefore, for a pre-specified accuracy ɛ, we need at most ( ) 1 (1 T = log δk )σ k log 1 ξ () ɛ(1 + δ k )σ 1 iterations such that Moreover, Corollary implies V (T ) V (T ) F (T 0.5) (T ) F σ k ξ V (T ) V (T ) F (1 δ k)σ k 4ξ T (1 + δ k )σ 1 ɛ. (3) (1 δ k )σ k 8ξ T +1 (1 + δ k )σ 1, where the last inequality comes from (1). Therefore, we need at most ( 1 (1 T = log δk )σk ) log 1 ξ 4ξɛ(1 + δ k ) (4) iterations such that (T 0.5) F Then combining (3) and (5), we obtain (1 δ k )σ k 8ξ T +1 (1 + δ k )σ 1 ɛ σ 1. (5) M (T ) M = (T 0.5) V (T ) (T ) V (T ) F = (T 0.5) V (T ) (T ) V (T ) + (T ) V (T ) (T ) V (T ) F V (T ) (T 0.5) (T ) F + (T ) V (T ) V (T ) F ɛ, (6) where the last inequality comes from V (T ) = 1 (since V (T ) is orthonormal) and = M = σ 1 (since (T ) V (T ) = M and V (T ) is orthonormal). Thus combining () and (4) with (6), we complete the proof. 4.3 Proof of Theorem 1 (Alternating Gradient Descent) Proof. Throughout the proof for alternating gradient descent, we define a sufficiently large constant ξ. Moreover, we assume that at the t-th iteration, there exists a matrix factorization of M M = (t) V (t), where (t) R m k is an orthonormal matrix. We define the approximation error of the inexact first order oracle as E(V (t+0.5), V (t), (t) ) = V F( (t), V (t) ) V F( (t), V (t) ) F. The first lemma is parallel to Lemma 3 for alternating exact minimization. 1

13 Lemma 7. Suppose that δ k, (t), and V (t) satisfy Then we have δ k (1 δ k)σ k 4ξkσ 1, (t) (t) F σ k 4ξσ 1 E(V (t+0.5), V (t), (t) ) (1 + δ k)σ k (t) (t) F. ξ, and V (t) V (t) F σ 1 k. (7) The proof of Lemma 7 is provided in Appendix B.1. Lemma 7 illustrates that the approximation error of the inexact first order oracle diminishes with the estimation error of (t), when (t) and V (t) are sufficiently close to (t) and V (t). Lemma 8. Suppose that the step size parameter η satisfies Then we have η = V (t+0.5) V F δ k V (t) V F δ k. (8) 1 + δ k E(V (t+0.5), V (t), (t) ). The proof of Lemma 8 is in Appendix B.. Lemma 8 characterizes the progress of a gradient descent step with a pre-specified fixed step size. A more practical option is adaptively selecting η using the backtracking line search procedure, and similar results can be guaranteed. See Nesterov [004] for details. The following lemma characterizes the effect of the renormalization step using QR decomposition. Lemma 9. Suppose that V (t+0.5) satisfies V (t+0.5) V (t) F σ k 4. (9) Then there exists a factorization of M = (t+1) V (t+1) such that V (t+1) R n k is an orthonormal matrix, and V (t+1) V (t+1) F σ k V (t+0.5) V (t) F, (t) (t+1) F 3σ 1 σ k V (t+0.5) V (t) F + σ 1 (t) (t) F, The proof of Lemma 9 is provided in Appendix B.3. The next lemma quantifies the accuracy of the initial solutions. Lemma 10. Suppose that δ k satisfies Then we have (0) (0) F σ k 4ξσ 1 σ 6 k δ k 19ξ kσ1 6. (30) and V (0) V (0) F σ k σ 1 k. ξσ 1 13

14 The proof of Lemma 10 is in Appendix B.4. Lemma 10 indicates that the initial solutions (0) and V (0) attain sufficiently small estimation errors. Combining Lemmas 7, 8, 5,, we obtain the following corollary for a complete iteration of updating V. Corollary 3. Suppose that δ k, (t), and V (t) satisfy δ k σ 6 k 19ξ kσ 6 1, (t) (t) F σ k 4ξσ1, and V (t) V (t) F σ k. (31) ξσ 1 We then have Moreover, we have V (t+1) V (t+1) F σ k 4ξσ 1 and (t) (t+1) F σ k ξσ 1. V (t+0.5) V (t) F δ k V (t) V (t) F + σ k ξ (t) (t) F, (3) V (t+1) V (t+1) F δ k V (t) V (t) F + 4 σ k ξ (t) (t) F, (33) (t) (t+1) F 3σ ( ) 1 δk 6 V (t) V (t) F + σ k ξ + 1 σ 1 (t) (t) F. (34) The proof of Corollary 3 is provided in Appendix B.5. Since the alternating gradient descent algorithm updates and V in a symmetric manner, we can establish similar results for a complete iteration of updating in the next corollary. Corollary 4. Suppose that δ k, V (t+1), and (t) satisfy δ k σ 6 k 19ξ kσ 6 1, V (t+1) V (t+1) F σ k 4ξσ1, and (t) (t+1) F σ k. (35) ξσ 1 We then have Moreover, we have (t+1) (t+1) F σ k 4ξσ 1 and V (t+1) V (t+1) F σ k ξσ 1. (t+0.5) (t+1) F δ k (t) (t+1) F + σ k ξ V (t+1) V (t+1) F, (36) (t+1) (t+1) F δ k (t) (t+1) F + 4 σ k ξ V (t+1) V (t+1) F, (37) V (t+1) V (t+1) F 3σ ( ) 1 δk 6 (t) (t+1) F + σ k ξ + 1 σ 1 V (t+1) V (t+1) F. (38) 14

15 The proof of Corollary 4 directly follows Appendix B.5, and is therefore omitted.. Now we proceed with the proof of Theorem 1 for alternating gradient descent. Recall that Lemma 10 ensures that (31) of Corollary 3 holds for (0) and V (0). Then Corollary 3 ensures that (35) of Corollary 4 holds for (0) and V (1). By induction, Corollaries 1 and can be applied recursively for all T iterations. For notational simplicity, we write (3)-(38) as V (t+0.5) V (t) F α 1 V (t) V (t) F + γ 1 σ 1 (t) (t) F, (39) σ 1 V (t+1) V (t+1) F α V (t) V (t) F + γ σ 1 (t) (t) F, (40) (t+0.5) (t+1) F α 3 (t) (t+1) F + γ 3 σ 1 V (t+1) V (t+1) F, (41) σ 1 (t+1) (t+1) F α 4 (t) (t+1) F + γ 4 σ 1 V (t+1) V (t+1) F, (4) (t) (t+1) F α 5 V (t) V (t) F + γ 5 σ 1 (t) (t) F, (43) V (t+1) V (t+1) F α 6 (t) (t+1) F + γ 6 σ 1 V (t+1) V (t+1) F. (44) Note that we have γ 5, γ 6 (1, ), but α 1,...,α 6, γ 1,..., and γ 4 can be sufficiently small as long as ξ is sufficiently large. We then have (t+1) (t+) F (i) α 5 V (t+1) V (t+1) F + γ 5 σ 1 (t+1) (t+1) F (ii) α 5 α 6 (t) (t+1) F + α 5 γ 6 σ 1 V (t+1) V (t+1) F + γ 5 σ 1 (t+1) (t+1) F (iii) (α 5 α 6 + γ 5 α 4 ) (t) (t+1) F + (γ 5 γ 4 σ 1 + α 5 γ 6 )σ 1 V (t+1) V (t+1) F (iv) (α 5 α 6 + γ 5 α 4 ) (t) (t+1) F + (γ 5 γ 4 σ 1 + α 5 γ 6 )α V (t) V (t) F + (γ 5 γ 4 σ 1 + α 5 γ 6 )γ σ 1 (t) (t) F, (45) where (i) comes from (43), (ii) comes from (44), (iii) comes from (4), and (iv) comes from (40). Similarly, we can obtain V (t+1) V (t+1) F α 6 (t) (t+1) F + γ 6 α V (t) V (t) F + γ 6 γ σ 1 (t) (t) F, (46) σ 1 (t+1) (t+1) F α 4 (t) (t+1) F + γ 4 α V (t) V (t) F + γ 4 γ σ 1 (t) (t) F (47) (t+0.5) (t+1) F α 3 (t) (t+1) F + γ 3 α V (t) V (t) F + γ 3 γ σ 1 (t) (t) F. (48) For simplicity, we define φ V (t+1) = V (t+1) V (t+1) F, φ V (t+0.5) = V (t+0.5) V (t) F, φ (t+1) = σ V 1 V (t+1) V (t+1) F, φ (t+1) = (t+1) (t+) F, φ (t+0.5) = (t+0.5) (t+1) F, φ (t+1) = σ 1 (t+1) (t+1) F. 15

16 Then combining (39), (40) with (45) (48), we obtain max { φ V (t+1), φ V (t+0.5), φ V (t+1), φ (t+1), φ (t+0.5), φ (t+1)} β max { φv (t), φ (t), φ (t)}, (49) where β is a contraction coefficient defined as β = max{α 5 α 6 + γ 5 α 4, α 6, α 4, α 3 } + max{α 1, α, (γ 5 γ 4 σ 1 + α 5 γ 6 ), γ 6 α, γ 4 α, γ 3 α } + max{γ 1, γ, (γ 5 γ 4 σ 1 + α 5 γ 6 )γ, γ 6 γ, γ 4 γ, γ 3 γ }. Then we can choose ξ as a sufficiently large constant such that β < 1. By recursively applying (49) for t = 0,..., T, we obtain max { { } φ V (T ), φ V (T 0.5), φ (T V ), φ (T ), φ (T 0.5), φ (T )} β max φv (T 1), φ (T 1), φ (T 1) β max { φ V (T ), φ (T ), φ (T )}... β T max { φ V (0), φ (0), φ (0)}. By Corollary 3, we obtain (0) (1) F 3σ ( ) 1 δk 6 V (0) V (0) F + σ k ξ + 1 σ 1 (0) (0) F (i) 3σ 1 σk 3 σ ( ) k 6 σ σ k 1ξσ1 3 + ξσ 1 ξ + 1 k 4ξσ 1 (ii) = σ4 k 8ξ σ σ k ξ σ 1 + σ k 4ξσ 1 (iii) σ k ξσ 1, (50) where (i) and (ii) come from Lemma 10, and (iii) comes from the definition of ξ and σ 1 σ k. Combining (50) with Lemma 10, we have Then we need at most iterations such that T = V (T ) V F β T max { σ k ξσ 1, { } { σ φv (0), φ (0), φ (0) max k, ξσ 1 ( { σ log max k, ξσ 1 σ k 4ξσ 1 } σ k ξσ 1 σ k 4ξσ 1 }., σ k ξ, σk } 1 ) log 1 (β 1 ) ξσ 1 ɛ ɛ and (T ) F β T max { σ k ξσ 1, σk } 4ξσ1 ɛ σ 1. We then follow similar lines to (6) in 4., and show M (T ) M F ɛ, which completes the proof. 4.4 Proof of Theorem 1 (Gradient Descent) Proof. The convergence analysis of the gradient descent algorithm is similar to that of the alternating gradient descent. The only difference is that for updating, the gradient descent algorithm employs V = V (t) instead of V = V (t+1) to calculate the gradient at = (t). Then everything else directly follows 4.3, and is therefore omitted.. 16

17 5 Extensions to Matrix Completion We then extend our methodology and theory to matrix completion problems. Let M R m n be the unknown low rank matrix of interest. We observe a subset of the entries of M, namely, W {1,..., m} {1,..., n}. We assume that W is drawn uniformly at random, i.e., Mi,j is observed independently with probability ρ (0, 1]. To exactly recover M, a common assumption is the incoherence of M, which will be specified later. A popular approach for recovering M is to solve the following convex optimization problem min M subject to P W (M ) = P W (M), (51) M R m n where P W (M) : R m n R m n is an operator defined as [P W (M)] ij = { Mij if (i, j) W, 0 otherwise. Similar to matrix sensing, existing algorithms for solving (51) are computationally inefficient. Hence, in practice we usually consider the following nonconvex optimization problem min F W (, V ), where F W (, V ) = 1 R m k,v R n k P W(M ) P W (V ) F. (5) Similar to matrix sensing, (5) can also be efficiently solved by gradient-based algorithms illustrated in Algorithm. For the convenience of later convergence analysis, we partition the observation set W into T +1 subsets W 0,...,W T by Algorithm 4. However, in practice we do not need the partition scheme, i.e., we simply set W 0 = = W T = W. Before we present the convergence analysis, we first introduce an assumption known as the incoherence property. Assumption (Incoherence Property). The target rank k matrix M is incoherent with parameter µ, i.e., given the rank k singular value decomposition of M = Σ V, we have max k i µ i m and max V k j µ j n. Roughly speaking, the incoherence assumption guarantees that each entry of M contains similar amount of information, which makes it feasible to complete M when its entries are missing uniformly at random. The following theorem establishes the iteration complexity and the estimation error under the Frobenius norm. Theorem. Suppose that there exists a universal constant C 4 such that ρ satisfies ρ C 4µ k 3 log n log(1/ɛ), (53) m where ɛ is the pre-specified precision. Then there exist an η and universal constants C 5 and C 6 such that for any T C 5 log(c 6 /ɛ), we have M (T ) M F ɛ with high probability. 17

18 Algorithm A family of nonconvex optimization algorithms for matrix completion. The incoherence factorization algorithm IF( ) is illustrated in Algorithm 3, and the partition algorithm Partition( ), which is proposed by Hardt and Wootters [014], is provided in Algorithm 4 of Appendix C for the sake of completeness. The initialization procedures INT ( ) and INT ( ) are provided in Algorithm 5 and Algorithm 6 of Appendix D for the sake of completeness. Here F W ( ) is defined in (5). Input: P W (M ) Parameter: Step size η, Total number of iterations T ({W t } T t=0, ρ) Partition(W), P W 0 ( M) P W0 (M ), and M ij 0 for all (i, j) / W 0 ( (0), V (0) ) INT ( M), (V (0), (0) ) INT V ( M) For: t = 0,..., T 1 Alternating Exact Minimization : V (t+0.5) argmin V F Wt+1 ( (t), V ) (V (t+1), R (t+0.5) ) IF(V (t+0.5) ) V Alternating Gradient Descent : V (t+0.5) V (t) η V F Wt+1 ( (t), V (t) ) (V (t+1), R (t+0.5) ) IF(V (t+0.5) ), (t) (t) R (t+0.5) pdating V. V V Gradient Descent : V (t+0.5) V (t) η V F Wt+1 ( (t), V (t) ) (V (t+1), R (t+0.5) ) IF(V (t+0.5) ), (t+1) (t) R (t+0.5) V V Alternating Exact Minimization : (t+0.5) argmin F Wt+ (, V (t+1) ) ( (t+1), R (t+0.5) ) IF( (t+0.5) ) Alternating Gradient Descent : (t+0.5) (t) η F Wt+ ( (t), V (t+1) ) ( (t+1), R (t+0.5) ) IF( (t+0.5) ), V (t+1) V (t+1) R (t+0.5) pdating. Gradient Descent : (t+0.5) (t) η F Wt+ ( (t), V (t) ) ( (t+1), R (t+0.5) ) IF( (t+0.5) ), V (t+1) V (t) R (t+0.5) End for Output: M (T ) (T 0.5) V (T ) (for gradient descent we use (T ) V (T ) ) Algorithm 3 The incoherence factorization algorithm for matrix completion. It guarantees that the solutions satisfy the incoherence condition throughout all iterations. Input: W in r Number of rows of W in Parameter: Incoherence parameter µ (W in, R in W ) QR(W in ) W argmin W W in F W (W out, R tmp ) QR(W out ) W R out W = W out W in Output: W out, R out W subject to max W j µ k/r j 18

19 The proof of Theorem is provided in F.1, F., and F.3. Theorem implies that all three nonconvex optimization algorithms converge to the global optimum at a geometric rate. Furthermore, our results indicate that the completion of the true low rank matrix M up to ɛ- accuracy requires the entry observation probability ρ to satisfy ρ = Ω(µ k 3 log n log(1/ɛ)/m). (54) ( σmax(m ) σ min (M ) This result matches the result established by Hardt [014], which is the state-of-the-art result for alternating minimization. Moreover, our analysis covers three nonconvex optimization algorithms. In fact, the sample complexity in (54) depends on a polynomial of σmax(m ) σ min (M ), which is a constant since in this paper we assume that σ max (M ) and σ min (M ) are constants. If we allow σmax(m ) σ min (M ) to increase, we can replace the QR decomposition in Algorithm 3 with the smooth QR) decomposition proposed by Hardt and Wootters [014] and achieve a dependency of log on the condition number with a more involved proof. See more details in Hardt and Wootters [014]. However, in this paper, our primary focus is on the dependency on k, n and m, rather than optimizing over the dependency on condition number. 6 Numerical Experiments We present numerical experiments to support our theoretical analysis. We first consider a matrix sensing problem with m = 30, n = 40, and k = 5. We vary d from 300 to 900. Each entry of A i s are independent sampled from N(0, 1). We then generate M = V, where Ũ Rm k and Ṽ Rn k are two matrices with all their entries independently sampled from N(0, 1/k). We then generate d measurements by b i = A i, M for i = 1,..., d. Figure 1 illustrates the empirical performance of the alternating exact minimization and alternating gradient descent algorithms for a single realization. The step size for the alternating gradient descent algorithm is determined by the backtracking line search procedure. We see that both algorithms attain linear rate of convergence for d = 600 and d = 900. Both algorithms fail for d = 300, because d = 300 is below the minimum requirement of sample complexity for the exact matrix recovery. We then consider a matrix completion problem with m = 1000, n = 50, and k = 5. We vary ρ from 0.05 to 0.1. We then generate M = V, where Ũ Rm k and Ṽ Rn k are two matrices with all their entries independently sampled from N(0, 1/k). The observation set is generated uniformly at random with probability ρ. Figure illustrates the empirical performance of the alternating exact minimization and alternating gradient descent algorithms for a single realization. The step size for the alternating gradient descent algorithm is determined by the backtracking line search procedure. We see that both algorithms attain linear rate of convergence for ρ = 0.05 and ρ = 0.1. Both algorithms fail for ρ = 0.05, because the entry observation probability is below the minimum requirement of sample complexity for the exact matrix recovery. 7 Conclusion In this paper, we propose a generic analysis for characterizing the convergence properties of nonconvex optimization algorithms. By exploiting the inexact first order oracle, we prove that a broad class 19

20 10 10 Estimation Error d=300 d=600 d= Numer of Iterations (a) Alternating Exact Minimization Algorithm Estaimation Error d=300 d=600 d= Number of Iterations (b) Alternating Gradient Descent Algorithm Figure 1: Two illustrative examples for matrix sensing. The vertical axis corresponds to estimation error M (t) M F. The horizontal axis corresponds to numbers of iterations. Both the alternating exact minimization and alternating gradient descent algorithms attain linear rate of convergence for d = 600 and d = 900. But both algorithms fail for d = 300, because the sample size is not large enough to guarantee proper initial solutions. of nonconvex optimization algorithms converge geometrically to the global optimum and exactly recover the true low rank matrices under suitable conditions. A Lemmas for Theorem 1 (Alternating Exact Minimization) A.1 Proof of Lemma 1 Proof. For notational convenience, we omit the index t in (t) and V (t), and denote them by and V respectively. Then we define two nk nk matrices S (t) 11 S (t) S (t) 1k =. d.... with S(t) pq = A i (t) p (t) q A i, S (t) k1 S (t) i=1 kk G (t) = G (t) 11 G (t) 1k..... G (t) k1 G (t) kk with G(t) pq = d i=1 A i p q A i for 1 p, q k. Note that S (t) and G (t) are essentially the partial Hessian matrices V F( (t), V ) and V F(, V ) for a vectorized V, i.e., vec(v ) R nk. Before we proceed with the main proof, we first introduce the following lemma. 0

21 10 10 Estimation Error ; = 0:05 7; = 0:05 7; = 0: Number of Iterations (a) Alternating Exact Minimization Algorithm Estimation Error ; = 0:05 7; = 0:05 7; = 0: Number of Iterations (b) Alternating Gradient Descent Algorithm Figure : Two illustrative examples for matrix completion. The vertical axis corresponds to estimation error M (t) M F. The horizontal axis corresponds to numbers of iterations. Both the alternating exact minimization and alternating gradient descent algorithms attain linear rate of convergence for ρ = 0.05 and ρ = 0.1. But both algorithms fail for ρ = 0.05, because the entry observation probability is not large enough to guarantee proper initial solutions. Lemma 11. Suppose that A( ) satisfies k-rip with parameter δ k. We then have 1 + δ k σ max (S (t) ) σ min (S (t) ) 1 δ k. The proof of Lemma 11 is provided in Appendix A.7. Note that Lemma 11 is also applicable G (t), since G (t) shares the same structure with S (t). We then proceed with the proof of Lemma 1. Given a fixed, F(, V ) is a quadratic function of V. Therefore we have F(, V ) = F(, V ) + V F(, V ), V V + vec(v ) vec(v ), V F (, V ) ( vec(v ) vec(v ) ), which further implies implies F(, V ) F(, V ) V (, V ), V V σ max ( V F (, V )) V V F F(, V ) F(, V ) V (, V ), V V σ min ( V F (, V )) V V F. Then we can verify that V F (, V ) also shares the same structure with S(t). Thus applying Lemma 11 to the above two inequalities, we complete the proof. 1

22 A. Proof of Lemma 3 Proof. For notational convenience, we omit the index t in (t) and V (t), and denote them by and V respectively. We define two nk nk matrices J (t) 11 J (t) J (t) 1k =. d.... (t) with J pq = A i (t) p q A i, J (t) k1 J (t) i=1 kk K (t) = K (t) 11 K (t) 1k..... K (t) k1 K (t) kk with K(t) pq = (t) p qi n for 1 p, q k. Before we proceed with the main proof, we first introduce the following lemmas. Lemma 1. Suppose that A( ) satisfies k-rip with parameter δ k. We then have S (t) K (t) J (t) 3δ k k (t) F. The proof of Lemma 1 is provided in Appendix A.8. Note that Lemma 1 is also applicable to G (t) K (t) J (t), since G (t) and S (t) share the same structure. Lemma 13. Given F R k k, we define a nk nk matrix F 11 I n F 1k I n F = F k1 I n F kk I n For any V R n k, let v = vec(v ) R nk, then we have Fv = F V F. Proof. By linear algebra, we have which completes the proof. [F V ] ij = F i V j = k F il V jl = l=1 k F il I l V l, l=1 as We then proceed with the proof of Lemma 3. Since b i = tr(v A i ), then we rewrite F(, V ) F(, V ) = 1 d i=1 ( tr(v A i ) b i ) = 1 d ( k Vj A i j i=1 j=1 k j=1 V j A i j). For notational simplicity, we define v = vec(v ). Since V (t+0.5) minimizes F( (t), V ), we have vec ( F( (t), V (t+0.5) ) ) = S (t) v (t+0.5) J (t) v = 0.

23 Solving the above system of equations, we obtain Meanwhile, we have vec( V F(, V (t+0.5) )) = G (t) v (t+0.5) G (t) v v (t+0.5) = (S (t) ) 1 J (t) v. (55) = G (t) (S (t) ) 1 J (t) v G (t) v = G (t)( (S (t) ) 1 J (t) I nk ) v, (56) where the second equality come from (55). By triangle inequality, (56) further implies ((S (t) ) 1 J (t) I nk )v (K (t) I nk )v + (S (t) ) 1 (J (t) S (t) K (t) )v ( (t) I k )V F + (S (t) ) 1 (J (t) S (t) K (t) )v (t) I k F V + (S (t) ) 1 (J (t) S (t) K (t) )v, (57) where the second inequality comes from Lemma 13. Plugging (57) into (56), we have vec( V F(, V (t+0.5) )) G (t) ((S (t) ) 1 J (t) I nk )v (i) (1 + δ k )(σ 1 (t) I k + (S (t) ) 1 S (t) K (t) J (t) σ 1 k) (ii) ( (1 + δ k )σ 1 ( (t) ) ( (t) ) F + 3δ ) kk (t) F 1 δ k (iii) ( (1 + δ k )σ 1 (t) F + 3δ ) kk (t) (iv) F (1 δ k)σ k (t) F, 1 δ k ξ where (i) comes from Lemma 11 and V = M = σ 1 and V F = v σ 1 k, (ii) comes from Lemmas 11 and 1, (iii) from Cauchy-Schwartz inequality, and (iv) comes from (16). Since we have V F( (t), V (t+0.5) ) = 0, we further btain which completes the proof. A.3 Proof of Lemma 4 E(V (t+0.5), V (t+0.5), (t) ) (1 δ k)σ k (t) F, ξ Proof. For notational convenience, we omit the index t in (t) and V (t), and denote them by and V respectively. By the strong convexity of F(, ), we have F(, V ) 1 δ k V (t+0.5) V F F(, V (t+0.5) ) + V F(, V (t+0.5) ), V V (t+0.5). (58) By the strong convexity of F(, ) again, we have F(, V (t+0.5) ) F(, V ) + V F(, V ), V (t+0.5) V (t+0.5) + 1 δ k V (t+0.5) V F F(, V ) + 1 δ k V (t+0.5) V F, (59) 3

24 where the last inequality comes from the optimality condition of V = argmin V F(, V ), i.e. V F(, V ), V (t+0.5) V 0. Meanwhile, since V (t+0.5) minimizes F( (t), ), we have the optimality condition which further implies V F(, V (t+0.5) ), V V (t+0.5) Combining (58) and (59) with (60), we obtain which completes the proof. A.4 Proof of Lemma 5 V F( (t), V (t+0.5) ), V V (t+0.5) 0, V F(, V (t+0.5) ) V F( (t), V (t+0.5) ), V V (t+0.5). (60) V (t+0.5) V 1 1 δ k E(V (t+0.5), V (t+0.5), (t) ), Proof. Before we proceed with the proof, we first introduce the following lemma. Lemma 14. Suppose that A R n k is a rank k matrix. Let E R n k satisfy E A < 1. Then given a QR decomposition (A + E) = QR, there exists a factorization of A = Q O such that Q R n k is an orthonormal matrix, and satisfies A Q Q E F F 1 E A. The proof of Lemma 14 is provided in Stewart et al. [1990], therefore omitted. We then proceed with the proof of Lemma 5. We consider A = V (t) and E = V (t+0.5) V (t) in Lemma 14 respectively. We can verify that V (t+0.5) V (t) V (t) V (t+0.5) V (t) F σ k 1 4. Then there exists a V (t) = V (t+1) O such that V (t+1) is an orthonormal matrix, and satisfies V (t+0.5) V (t+1) F V (t) V (t+0.5) V (t) F σ k V (t+0.5) V (t) F. 4

Nonconvex Low Rank Matrix Factorization via Inexact First Order Oracle

Nonconvex Low Rank Matrix Factorization via Inexact First Order Oracle Nonconvex Low Ran Matrix Factorization via Inexact First Order Oracle Tuo Zhao Zhaoran Wang Han Liu Abstract We study the low ran matrix factorization problem via nonconvex optimization Compared with the