Nonconvex Low Rank Matrix Factorization via Inexact First Order Oracle

Size: px

Start display at page:

Download "Nonconvex Low Rank Matrix Factorization via Inexact First Order Oracle"

Justina Dorsey
5 years ago
Views:

1 Nonconvex Low Ran Matrix Factorization via Inexact First Order Oracle Tuo Zhao Zhaoran Wang Han Liu Abstract We study the low ran matrix factorization problem via nonconvex optimization Compared with the convex relaxation approach, nonconvex optimization exhibits superior empirical performance for large scale low ran matrix estimation However, the understanding of its theoretical guarantees is limited To bridge this gap, we exploit the notion of inexact first order oracle, which naturally appears in low ran matrix factorization problems such as matrix sensing and completion Particularly, our analysis shows that a broad class of nonconvex optimization algorithms, including alternating minimization and gradient-type methods, can be treated as solving two sequences of convex optimization algorithms using inexact first order oracle Thus we can show that these algorithms converge geometrically to the global optima and recover the true low ran matrices under suitable conditions Numerical results are provided to support our theory 1 Introduction Let M R m n be a ran matrix with much smaller than m and n Our goal is to estimate M based on partial observations of its entries For example, matrix completion is based on a subsample of M s entries, while matrix sensing is based on linear measurements A i,m, where i {1,,d} with d much smaller than mn and A i is the sensing matrix In the past decade, significant progress has been made on the recovery of low ran matrix [Candès and Recht, 009, Candès and Tao, 010, Candes and Plan, 010, Recht et al, 010, Lee and Bresler, 010, Keshavan et al, 010a,b, Jain et al, 010, Cai et al, 010, Recht, 011, Gross, 011, Chandrasearan et al, 011, Hsu et al, 011, Rohde and Tsybaov, 011, Koltchinsii et al, 011, Negahban and Wainwright, 011, Chen et al, 011, Xiang et al, 01, Negahban and Wainwright, 01, Agarwal et al, 01, Recht and Ré, 013, Chen, 013, Chen et al, 013a,b, Jain et al, 013, Jain and Netrapalli, 014, Tuo Zhao is Affiliated with Department of Computer Science, Johns Hopins niversity, Baltimore, MD 118 SA and Department of Operations Research and Financial Engineering, Princeton niversity, Princeton, NJ 08544; tour@csjhuedu Zhaoran Wang is Affiliated with Department of Operations Research and Financial Engineering, Princeton niversity, Princeton, NJ SA; zhaoran@princetonedu Han Liu is Affiliated with Department of Operations Research and Financial Engineering, Princeton niversity, Princeton, NJ SA; hanliu@princetonedu 1

2 Hardt, 014, Hardt et al, 014, Hardt and Wootters, 014, Sun and Luo, 014, Hastie et al, 014, Cai and Zhang, 015, Yan et al, 015, Zhu et al, 015, Wang et al, 015] Among these wors, most are based upon convex relaxation with nuclear norm constraint or regularization Nevertheless, solving these convex optimization problems can be computationally prohibitive in high dimensional regimes with large m and n [Hsieh and Olsen, 014] A computationally more efficient alternative is nonconvex optimization In particular, we reparameterize the m n matrix variable M in the optimization problem as V with R m and V R n, and optimize over and V Such a reparametrization automatically enforces the low ran structure and leads to low computational cost per iteration Due to this reason, the nonconvex approach is widely used in large scale applications such as recommendation systems or collaborative filtering [Koren, 009, Koren et al, 009] Despite the superior empirical performance of the nonconvex approach, the understanding of its theoretical guarantees is rather limited in comparison with the convex relaxation approach The classical nonconvex optimization theory can only show its sublinear convergence to local optima But many empirical results have corroborated its exceptional computational performance and convergence to global optima Only until recently has there been theoretical analysis of the bloc coordinate descent-type nonconvex optimization algorithm, which is nown as alternating minimization [Jain et al, 013, Hardt, 014, Hardt et al, 014, Hardt and Wootters, 014] In particular, the existing results show that, provided a proper initialization, the alternating minimization algorithm attains a linear rate of convergence to a global optimum R m and V R n, which satisfy M = V Meanwhile, Keshavan et al [010a,b] establish the convergence of the gradient-type methods, and Sun and Luo [014] further establish the convergence of a broad class of nonconvex optimization algorithms including both gradient-type and bloc coordinate descent-type methods However, Keshavan et al [010a,b], Sun and Luo [014] only establish the asymptotic convergence for an infinite number of iterations, rather than the explicit rate of convergence Besides these wors, Lee and Bresler [010], Jain et al [010], Jain and Netrapalli [014] consider projected gradient-type methods, which optimize over the matrix variable M R m n rather than R m and V R n These methods involve calculating the top singular vectors of an m n matrix at each iteration For much smaller than m and n, they incur much higher computational cost per iteration than the aforementioned methods that optimize over and V All these wors, except Sun and Luo [014], focus on specific algorithms, while Sun and Luo [014] do not establish the explicit optimization rate of convergence In this paper, we propose a new theory for analyzing a broad class of nonconvex optimization algorithms for low ran matrix estimation The core of our theory is the notion of inexact first order oracle Based on the inexact first order oracle, we establish sufficiently conditions under which the iteration sequences converge geometrically to the global optima For both matrix sensing and completion, a direct consequence of our threoy is that, a broad family of nonconvex optimization algorithms, including gradient descent, bloc coordinate gradient descent, and bloc coordinate minimization, attain linear rates of convergence to the true low ran matrices

3 and V In particular, our proposed theory covers alternating minimization as a special case and recovers the results of Jain et al [013], Hardt [014], Hardt et al [014], Hardt and Wootters [014] under suitable conditions Meanwhile, our approach covers gradient-type methods, which are also widely used in practice [Taács et al, 007, Patere, 007, Koren et al, 009, Gemulla et al, 011, Recht and Ré, 013, Zhuang et al, 013] To the best of our nowledge, our analysis is the first one that establishes exact recovery guarantees and geometric rates of convergence for a broad family of nonconvex matrix sensing and completion algorithms To achieve maximum generality, our unified analysis significantly differs from previous wors In detail, Jain et al [013], Hardt [014], Hardt et al [014], Hardt and Wootters [014] view alternating minimization as an approximate power method However, their point of view relies on the closed form solution of each iteration of alternating minimization, which maes it difficult to generalize to other algorithms, eg, gradient-type methods Meanwhile, Sun and Luo [014] tae a geometric point of view In detail, they show that the global optimum of the optimization problem is the unique stationary point within its neighborhood and thus a broad class of algorithms succeed However, such geometric analysis of the objective function does not characterize the convergence rate of specific algorithms towards the stationary point nlie existing results, we analyze nonconvex optimization algorithms as approximate convex counterparts For example, our analysis views alternating minimization on a nonconvex objective function as an approximate bloc coordinate minimization on some convex objective function We use the ey quantity, the inexact first order oracle, to characterize such a perturbation effect, which results from the local nonconvexity at intermediate solutions This eventually allows us to establish explicit rate of convergence in an analogous way as existing convex optimization analysis Our proposed inexact first order oracle is closely related to a series previous wor on inexact or approximate gradient descent algorithms: Güler [199], Luo and Tseng [1993], Nedić and Bertseas [001], d Aspremont [008], Baes [009], Friedlander and Schmidt [01], Devolder et al [014] Different from these existing results focusing on convex minimization, we show that the inexact first order oracle can also sharply captures the evolution of generic optimization algorithms even with the presence of nonconvexity More recently, Candes et al [014], Balarishnan et al [014], Arora et al [015] respectively analyze the Wirtinger Flow algorithm for phase retrieval, the expectation maximization (EM) Algorithm for latent variable models, and the gradient descent algorithm for sparse coding based on a similar idea to ours Though their analysis exploits similar nonconvex structures, they wor on completely different problems, and the delivered technical results are also fundamentally different A conference version of this paper was presented in the Annual Conference on Neural Information Processing Systems 015 [Zhao et al, 015] During our conference version was under review, similar wor was released on arxivorg by Zheng and Lafferty [015], Bhojanapalli et al [015], Tu et al [015], Chen and Wainwright [015] These wors focus on symmetric positive semidefinite low ran matrix factorization problems In contrast, our proposed methodologies and theory do not require the symmetry and positive semidefiniteness, and therefore can be ap- 3

4 plied to rectangular low ran matrix factorization problems The rest of this paper is organized as follows In, we review the matrix sensing problems, and then introduce a general class of nonconvex optimization algorithms In 3, we present the convergence analysis of the algorithms In 4, we lay out the proof In 5, we extend the proposed methodology and theory to the matrix completion problems In 6, we provide numerical experiments and draw the conclusion Notation: For v = (v 1,,v d ) T R d, we define the vector l q norm as v q q = j vq j We define e i as an indicator vector, where the i-th entry is one, and all other entries are zero For a matrix A R m n, we use A j = (A 1j,,A mj ) to denote the j-th column of A, and A i = (A i1,,a in ) to denote the i- th row of A Let σ max (A) and σ min (A) be the largest and smallest nonzero singular values of A We define the following matrix norms: A F = j A j, A = σ max (A) Moreover, we define A to be the sum of all singular values of A We define as the Moore-Penrose pseudoinverse of A as A Given another matrix B R m n, we define the inner product as A,B = i,j A ijb ij For a bivariate function f (u,v), we define u f (u,v) to be the gradient with respect to u Moreover, we use the common notations of Ω( ), O( ), and o( ) to characterize the asymptotics of two real sequences Matrix Sensing We start with the matrix sensing problem Let M R m n be the unnown low ran matrix of interest We have d sensing matrices A i R m n with i {1,,d} Our goal is to estimate M based on b i = A i,m in the high dimensional regime with d much smaller than mn nder such a regime, a common assumption is ran(m ) = min{d,m,n} Existing approaches generally recover M by solving the following convex optimization problem min M M R m n subject to b = A(M), (1) where b = [b 1,,b d ] R d, and A(M) : R m n R d is an operator defined as A(M) = [ A 1,M,, A i,m ] R d () Existing convex optimization algorithms for solving (1) are computationally inefficient, since they incur high per-iteration computational cost and only attain sublinear rates of convergence to the global optimum [Jain et al, 013, Hsieh and Olsen, 014] Therefore in large scale settings, we usually consider the following nonconvex optimization problem instead min F (,V ), where F (,V ) = 1 R m,v R n b A(V ) (3) The reparametrization of M = V, though maing the problem in (3) nonconvex, significantly improves the computational efficiency Existing literature [Koren, 009, Koren et al, 009, Taács et al, 007, Patere, 007, Koren et al, 009, Gemulla et al, 011, Recht and Ré, 013, Zhuang 4

5 et al, 013] has established convincing evidence that (3) can be effectively solved by a broad variety of gradient-based nonconvex optimization algorithms, including gradient descent, alternating exact minimization (ie, alternating least squares or bloc coordinate minimization), as well as alternating gradient descent (ie, bloc coordinate gradient descent), as illustrated in Algorithm 1 Algorithm 1 A family of nonconvex optimization algorithms for matrix sensing Here (,D,V ) KSVD(M) is the ran singular value decomposition of M D is a diagonal matrix containing the top singular values of M in decreasing order, and and V contain the corresponding top left and right singular vectors of M (V,R V ) QR(V ) is the QR decomposition, where V is the corresponding orthonormal matrix and R V is the corresponding upper triangular matrix Input: {b i } d i=1, {A i} d i=1 Parameter: Step size η, Total number of iterations T ( (0),D (0),V (0) ) KSVD( d i=1 b i A i ), V (0) V (0) D (0), (0) (0) D (0) For: t = 0,,T 1 Alternating Exact Minimization : V (t+05) argmin V F ( (t),v ) (V (t+1),r (t+05) ) QR(V (t+05) ) V Alternating Gradient Descent : V (t+05) V (t) η V F ( (t),v (t) ) (V (t+1),r (t+05) ) QR(V (t+05) ), (t) (t) R (t+05) pdating V V V Gradient Descent : V (t+05) V (t) η V F ( (t),v (t) ) (V (t+1),r (t+05) ) QR(V (t+05) ), (t+1) (t) R (t+05) V V Alternating Exact Minimization : (t+05) argmin F (,V (t+1) ) ( (t+1),r (t+05) ) QR( (t+05) ) Alternating Gradient Descent : (t+05) (t) η F ( (t),v (t+1) ) ( (t+1),r (t+05) ) QR( (t+05) ), V (t+1) V t+1 R (t+05) pdating Gradient Descent : (t+05) (t) η F ( (t),v (t) ) ( (t+1),r (t+05) ) QR( (t+05) ), V (t+1) V t R (t+05) End for Output: M (T ) (T 05) V (T ) (for gradient descent we use (T ) V (T ) ) It is worth noting that the QR decomposition and ran singular value decomposition in Algorithm 1 can be accomplished efficiently In particular, the QR decomposition can be accomplished in O( max{m,n}) operations, while the ran singular value decomposition can be accomplished in O(mn) operations In fact, the QR decomposition is not necessary for particular update schemes, eg, Jain et al [013] prove that the alternating exact minimization update schemes with or without the QR decomposition are equivalent 5

6 3 Convergence Analysis We analyze the convergence of the algorithms illustrated in Before we present the main results, we first introduce a unified analytical framewor based on a ey quantity named the approximate first order oracle Such a unified framewor equips our theory with the maximum generality Without loss of generality, we assume m n throughout the rest of this paper 31 Main Idea We first provide an intuitive explanation for the success of nonconvex optimization algorithms, which forms the basis of our later analysis of the main results in 4 Recall that (3) can be written as a special instance of the following optimization problem, min f (,V ) (4) R m,v R n A ey observation is that, given fixed, f (, ) is strongly convex and smooth in V under suitable conditions, and the same also holds for given fixed V correspondingly For the convenience of discussion, we summarize this observation in the following technical condition, which will be later verified for matrix sensing and completion under suitable conditions Condition 1 (Strong Biconvexity and Bismoothness) There exist universal constants µ + > 0 and µ > 0 such that µ F f (,V ) f (,V ), f (,V ) µ + F for all,, µ V V F f (,V ) f (,V ) V V, V f (,V ) µ + V V F for all V,V 311 Ideal First Order Oracle To ease presentation, we assume that and V are the unique global minimizers to the generic optimization problem in (4) Assuming that is given, we can obtain V by V = argmin V R n f (,V ) (5) Condition 1 implies the objective function in (5) is strongly convex and smooth Hence, we can choose any gradient-based algorithm to obtain V For example, we can directly solve for V in or iteratively solve for V using gradient descent, ie, V f (,V ) = 0, (6) V (t) = V (t 1) η V f (,V (t 1) ), (7) 6

7 where η is a step size Taing gradient descent as an example, we can invoe classical convex optimization results [Nesterov, 004] to prove that V (t) V F κ V (t 1) V F for all t = 0,1,,, where κ (0,1) and only depends on µ + and µ in Condition 1 For notational simplicity, we call V f (,V (t 1) ) the ideal first order oracle, since we do not now in practice 31 Inexact First Order Oracle Though the ideal first order oracle is not accessible in practice, it provides us insights to analyze nonconvex optimization algorithms Taing gradient descent as an example, at the t-th iteration, we tae a gradient descent step over V based on V f (,V (t 1) ) Now we can treat V f (,V (t 1) ) as an approximation of V f (,V (t 1) ), where the approximation error comes from approximating by Then the relationship between V f (,V (t 1) ) and V f (,V (t 1) ) is similar to that between gradient and approximate gradient in existing literature on convex optimization For simplicity, we call V f (,V (t 1) ) the inexact first order oracle To characterize the difference between V f (,V (t 1) ) and V f (,V (t 1) ), we define the approximation error of the inexact first order oracle as E(V,V,) = V f (,V ) V f (,V ) F, (8) where V is the current decision variable for evaluating the gradient In the above example, it holds for V = V (t 1) Later we will illustrate that E(V,V,) is critical to our analysis In the above example of alternating gradient descent, we will prove later that for V (t) = V (t 1) η V f (,V (t 1) ), we have V (t) V F κ V (t 1) V F + µ + E(V (t),v (t 1),) (9) In other words, E(V (t),v (t 1),) captures the perturbation effect by employing the inexact first order oracle V f (,V (t 1) ) instead of the ideal first order oracle V f (,V (t 1) ) For V (t+1) = argmin V f (,V ), we will prove that V (t) V F 1 µ E(V (t),v (t),) (10) According to the update schemes shown in Algorithms 1 and, for alternating exact minimization, we set = (t) in (10), while for gradient descent or alternating gradient descent, we set = (t 1) or = (t) in (9) respectively Due to symmetry, similar results also hold for (t) F To establish the geometric rate of convergence towards the global minima and V, it remains to establish upper bounds for the approximate error of the inexact first oder oracle Taing gradient decent as an example, we will prove that given an appropriate initial solution, we have µ + E(V (t),v (t 1), (t 1) ) α (t 1) F (11) 7

8 for some α (0,1 κ) Combining with (9) (where we tae = (t 1) ), (11) further implies V (t) V F κ V (t 1) V F + α (t 1) F (1) Correspondingly, similar results hold for (t) F, ie, (t) F κ (t 1) F + α V (t 1) V F (13) Combining (1) and (13) we then establish the contraction max{ V (t) V F, (t) F } (α + κ) max{ V (t 1) V F, (t 1) F }, which further implies the geometric convergence, since α (0, 1 κ) Respectively, we can establish similar results for alternating exact minimization and alternating gradient descent Based upon such a unified analysis, we now present the main results 3 Main Results Before presenting the main results, we first introduce an assumption nown as the restricted isometry property (RIP) Recall that is the ran of the target low ran matrix M Assumption 1 (Restricted Isometry Property) The linear operator A( ) : R m n R d defined in () satisfies -RIP with parameter δ (0,1), ie, for all R m n such that ran( ), it holds that (1 δ ) F A( ) (1 + δ ) F Several random matrix ensembles satisfy -RIP for a sufficiently large d with high probability For example, suppose that each entry of A i is independently drawn from a sub-gaussian distribution, A( ) satisfies -RIP with parameter δ with high probability for d = Ω(δ nlogn) The following theorem establishes the geometric rate of convergence of the nonconvex optimization algorithms summarized in Algorithm 1 Theorem 1 Assume there exists a sufficiently small constant C 1 such that A( ) satisfies -RIP with δ C 1 /, and the largest and smallest nonzero singular values of M are constants, which do not scale with (d,m,n,) For any pre-specified precision ɛ, there exist an η and universal constants C and C 3 such that for all T C log(c 3 /ɛ), we have M (T ) M F ɛ The proof of Theorems 1 is provided in 4, 43, and 44 Theorem 1 implies that all three nonconvex optimization algorithms converge geometrically to the global optimum Moreover, assuming that each entry of A i is independently drawn from a sub-gaussian distribution with mean zero and variance proxy one, our result further suggests that, to achieve exact low ran matrix recovery, our algorithm requires the number of measurements d to satisfy d = Ω( 3 nlogn), (14) 8

9 since we assume that δ C 1 / This sample complexity result matches the state-of-the-art result for nonconvex optimization methods, which is established by Jain et al [013] In comparison with their result, which only covers the alternating exact minimization algorithm, our results holds for a broader variety of nonconvex optimization algorithms Note that the sample complexity in (14) depends on a polynomial of σ max (M )/σ min (M ), which is treated as a constant in our paper If we allow σ max (M )/σ min (M ) to increase, we can plug the nonconvex optimization algorithms into the multi-stage framewor proposed by Jain et al [013] Following similar lines to the proof of Theorem 1, we can derive a new sample complexity, which is independent of σ max (M )/σ min (M ) See more details in Jain et al [013] 4 Proof of Main Results We setch the proof of Theorems 1 The proof of all related lemmas are provided in the appendix For notational simplicity, let σ 1 = σ max (M ) and σ = σ min (M ) Recall the nonconvex optimization algorithms are symmetric about the updates of and V Hence, the following lemmas for the update of V also hold for updating We omit some statements for conciseness Theorem can be proved in a similar manner, and its proof is provided in Appendix F Before presenting the proof, we first introduce the following lemma, which verifies Condition 1 Lemma 1 Suppose that A( ) satisfies -RIP with parameter δ Given an arbitrary orthonormal matrix R m, for any V, V R n, we have 1 + δ V V F F (,V ) F (,V ) V F (,V ),V V 1 δ V V F The proof of Lemma 1 is provided in Appendix A1 Lemma 1 implies that F (, ) is strongly convex and smooth in V given a fixed orthonormal matrix, as specified in Condition 1 Equipped with Lemma 1, we now lay out the proof for each update scheme in Algorithm 1 41 Rotation Issue Given a factorization of M = V, we can equivalently represent it as M = newv new, where new = O new and V new = V O new for an arbitrary unitary matrix O new R This implies that directly calculating F is not desirable and the algorithm may converge to an arbitrary factorization of M To address this issue, existing analysis usually chooses subspace distances to evaluate the difference between subspaces spanned by columns of and, because these subspaces are invariant to rotations [Jain et al, 013] For example, let R m (m ) denote the orthonormal 9

10 complement to, we can choose the subspace distance as F For any O new R such that O newo new = I, we have new F = O new F = F In this paper, we consider a different subspace distance defined as min O F (15) O O=I We can verify that (15) is also invariant to rotation The next lemma shows that (15) is equivalent to F Lemma Given two orthonormal matrices R m and R m, we have F min O O=I O F F The proof of Lemma is provided in Stewart et al [1990], therefore omitted Equipped with Lemma, our convergence analysis guarantees that there always exists a factorization of M satisfying the desired computational properties for each iteration (See Lemma 5, Corollaries 1 and ) Similarly, the above argument can also be generalized to gradient descent and alternating gradient descent algorithms 4 Proof of Theorem 1 (Alternating Exact Minimization) Proof Throughout the proof for alternating exact minimization, we define a constant ξ (1, ) to simplify the notation Moreover, we assume that at the t-th iteration, there exists a matrix factorization of M M = (t) V (t), where (t) R m is an orthonormal matrix We define the approximation error of the inexact first order oracle as E(V (t+05),v (t+05), (t) ) = V F ( (t),v (t+05) ) V F ( (t),v (t+05) ) F The following lemma establishes an upper bound for the approximation error of the approximation first order oracle under suitable conditions Lemma 3 Suppose that δ and (t) satisfy Then we have δ (1 δ ) σ 1ξ(1 + δ )σ 1 and (t) (t) F (1 δ )σ 4ξ(1 + δ )σ 1 (16) E(V (t+05),v (t+05), (t) ) (1 δ )σ (t) (t) ξ F 10

11 The proof of Lemma 3 is provided in Appendix A Lemma 3 shows that the approximation error of the inexact first order oracle for updating V diminishes with the estimation error of (t), when (t) is sufficiently close to (t) The following lemma quantifies the progress of an exact minimization step using the inexact first order oracle Lemma 4 We have V (t+05) V (t) F 1 1 δ E(V (t+05),v (t+05), (t) ) The proof of Lemma 4 is provided in Appendix A3 Lemma 4 illustrates that the estimation error of V (t+05) diminishes with the approximation error of the inexact first order oracle The following lemma characterizes the effect of the renormalization step using QR decomposition, ie, the relationship between V (t+05) and V (t+1) in terms of the estimation error Lemma 5 Suppose that V (t+05) satisfies V (t+05) V (t) F σ 4 (17) Then there exists a factorization of M = (t+1) V (t+1) such that V (t+05) R n is an orthonormal matrix, and satisfies V (t+1) V (t+1) F σ V (t+05) V (t) F The proof of Lemma 5 is provided in Appendix A4 The next lemma quantifies the accuracy of the initialization (0) Lemma 6 Suppose that δ satisfies (1 δ ) σ 4 δ 19ξ (1 + δ ) σ1 4 (18) Then there exists a factorization of M = (0) V (0) such that (0) R m is an orthonormal matrix, and satisfies (0) F (1 δ )σ 4ξ(1 + δ )σ 1 The proof of Lemma 6 is provided in Appendix A5 Lemma 6 implies that the initial solution (0) attains a sufficiently small estimation error Combining Lemmas 3, 4, and 5, we obtain the following corollary for a complete iteration of updating V 11

12 Corollary 1 Suppose that δ and (t) satisfy We then have (1 δ ) σ 4 δ 19ξ (1 + δ ) σ1 4 and (t) (t) F (1 δ )σ 4ξ(1 + δ )σ 1 (19) Moreover, we also have V (t+1) V (t+1) F (1 δ )σ 4ξ(1 + δ )σ 1 V (t+1) V (t+1) F 1 ξ (t) (t) F and V (t+05) V (t) F σ ξ (t) (t) F The proof of Corollary 1 is provided in Appendix A6 Since the alternating exact minimization algorithm updates and V in a symmetric manner, we can establish similar results for a complete iteration of updating in the next corollary Corollary Suppose that δ and V (t+1) satisfy (1 δ ) σ 4 δ 19ξ (1 + δ ) σ1 4 and V (t+1) V (t+1) F (1 δ )σ 4ξ(1 + δ )σ 1 (0) Then there exists a factorization of M = (t+1) V (t+1) such (t+1) is an orthonormal matrix, and satisfies Moreover, we also have (t+1) (t+1) F (1 δ )σ 4ξ(1 + δ )σ 1 (t+1) (t+1) F 1 ξ V (t+1) V (t+1) F and (t+05) (t+1) F σ ξ V (t+1) V (t+1) F The proof of Corollary directly follows Appendix A6, and is therefore omitted We then proceed with the proof of Theorem 1 for alternating exact minimization Lemma 6 ensures that (19) of Corollary 1 holds for (0) Then Corollary 1 ensures that (0) of Corollary holds for V (1) By induction, Corollaries 1 and can be applied recursively for all T iterations Thus we obtain V (T ) V (T ) F 1 ξ (T 1) (T 1) F 1 ξ V (T 1) V (T 1) F 1 ξ T 1 (0) (0) (1 δ F )σ 4ξ T, (1) (1 + δ )σ 1 where the last inequality comes from Lemma 6 Therefore, for a pre-specified accuracy ɛ, we need at most ( ) 1 (1 T = log δ )σ log 1 ξ () ɛ(1 + δ )σ 1 1

13 iterations such that Moreover, Corollary implies V (T ) V (T ) F (1 δ )σ 4ξ T (1 + δ )σ 1 ɛ (3) (T 05) (T ) F σ ξ V (T ) V (T ) (1 δ )σ F 8ξ T +1, (1 + δ )σ 1 where the last inequality comes from (1) Therefore, we need at most ( 1 (1 δ T = log )σ ) log 1 ξ 4ξɛ(1 + δ ) (4) iterations such that (T 05) F Then combining (3) and (5), we obtain (1 δ )σ 8ξ T +1 ɛ (5) (1 + δ )σ 1 σ 1 M (T ) M = (T 05) V (T ) (T ) V (T ) F = (T 05) V (T ) (T ) V (T ) + (T ) V (T ) (T ) V (T ) F V (T ) (T 05) (T ) F + (T ) V (T ) V (T ) F ɛ, (6) where the last inequality comes from V (T ) = 1 (since V (T ) is orthonormal) and = M = σ 1 (since (T ) V (T ) = M and V (T ) is orthonormal) Thus combining () and (4) with (6), we complete the proof 43 Proof of Theorem 1 (Alternating Gradient Descent) Proof Throughout the proof for alternating gradient descent, we define a sufficiently large constant ξ Moreover, we assume that at the t-th iteration, there exists a matrix factorization of M M = (t) V (t), where (t) R m is an orthonormal matrix We define the approximation error of the inexact first order oracle as E(V (t+05),v (t), (t) ) = V F ( (t),v (t) ) V F ( (t),v (t) ) F The first lemma is parallel to Lemma 3 for alternating exact minimization 13

14 Lemma 7 Suppose that δ, (t), and V (t) satisfy Then we have δ (1 δ )σ 4ξσ 1, (t) (t) F σ 4ξσ 1 E(V (t+05),v (t), (t) ) (1 + δ )σ (t) (t) ξ F, and V (t) V (t) F σ 1 (7) The proof of Lemma 7 is provided in Appendix B1 Lemma 7 illustrates that the approximation error of the inexact first order oracle diminishes with the estimation error of (t), when (t) and V (t) are sufficiently close to (t) and V (t) Lemma 8 Suppose that the step size parameter η satisfies Then we have η = V (t+05) V F δ V (t) V F δ (8) 1 + δ E(V (t+05),v (t), (t) ) The proof of Lemma 8 is in Appendix B Lemma 8 characterizes the progress of a gradient descent step with a pre-specified fixed step size A more practical option is adaptively selecting η using the bactracing line search procedure, and similar results can be guaranteed See Nesterov [004] for details The following lemma characterizes the effect of the renormalization step using QR decomposition Lemma 9 Suppose that V (t+05) satisfies V (t+05) V (t) F σ 4 (9) Then there exists a factorization of M = (t+1) V (t+1) such that V (t+1) R n is an orthonormal matrix, and V (t+1) V (t+1) F σ V (t+05) V (t) F, (t) (t+1) F 3σ 1 σ V (t+05) V (t) F + σ 1 (t) (t) F, The proof of Lemma 9 is provided in Appendix B3 The next lemma quantifies the accuracy of the initial solutions Lemma 10 Suppose that δ satisfies Then we have σ (0) (0) F 4ξσ1 σ 6 δ 19ξ σ1 6 (30) and V (0) V (0) F σ σ 1 ξσ 1 14

15 The proof of Lemma 10 is in Appendix B4 Lemma 10 indicates that the initial solutions (0) and V (0) attain sufficiently small estimation errors Combining Lemmas 7, 8, 5,, we obtain the following corollary for a complete iteration of updating V Corollary 3 Suppose that δ, (t), and V (t) satisfy We then have σ 6 δ 19ξ σ1 6, (t) (t) F σ 4ξσ1, and V (t) V (t) F σ (31) ξσ 1 σ V (t+1) V (t+1) F 4ξσ1 and (t) (t+1) F σ ξσ 1 Moreover, we have V (t+05) V (t) F δ V (t) V (t) F + σ ξ (t) (t) F, (3) V (t+1) V (t+1) F δ V (t) V (t) σ F + 4 ξ (t) (t) F, (33) (t) (t+1) F 3σ ( ) 1 δ 6 V (t) V (t) σ F + ξ + 1 σ 1 (t) (t) F (34) The proof of Corollary 3 is provided in Appendix B5 Since the alternating gradient descent algorithm updates and V in a symmetric manner, we can establish similar results for a complete iteration of updating in the next corollary Corollary 4 Suppose that δ, V (t+1), and (t) satisfy We then have σ 6 δ 19ξ σ1 6, V (t+1) V (t+1) F σ 4ξσ1, and (t) (t+1) F σ (35) ξσ 1 σ (t+1) (t+1) F 4ξσ1 and V (t+1) V (t+1) F σ ξσ 1 Moreover, we have (t+05) (t+1) F δ (t) (t+1) F + σ ξ V (t+1) V (t+1) F, (36) (t+1) (t+1) F δ (t) (t+1) σ F + 4 ξ V (t+1) V (t+1) F, (37) V (t+1) V (t+1) F 3σ ( ) 1 δ 6 (t) (t+1) σ F + ξ + 1 σ 1 V (t+1) V (t+1) F (38) 15

16 The proof of Corollary 4 directly follows Appendix B5, and is therefore omitted Now we proceed with the proof of Theorem 1 for alternating gradient descent Recall that Lemma 10 ensures that (31) of Corollary 3 holds for (0) and V (0) Then Corollary 3 ensures that (35) of Corollary 4 holds for (0) and V (1) By induction, Corollaries 1 and can be applied recursively for all T iterations For notational simplicity, we write (3)-(38) as V (t+05) V (t) F α 1 V (t) V (t) F + γ 1 σ 1 (t) (t) F, (39) σ 1 V (t+1) V (t+1) F α V (t) V (t) F + γ σ 1 (t) (t) F, (40) (t+05) (t+1) F α 3 (t) (t+1) F + γ 3 σ 1 V (t+1) V (t+1) F, (41) σ 1 (t+1) (t+1) F α 4 (t) (t+1) F + γ 4 σ 1 V (t+1) V (t+1) F, (4) (t) (t+1) F α 5 V (t) V (t) F + γ 5 σ 1 (t) (t) F, (43) V (t+1) V (t+1) F α 6 (t) (t+1) F + γ 6 σ 1 V (t+1) V (t+1) F (44) Note that we have γ 5,γ 6 (1,), but α 1,,α 6, γ 1,, and γ 4 can be sufficiently small as long as ξ is sufficiently large We then have (t+1) (t+) F (i) α 5 V (t+1) V (t+1) F + γ 5 σ 1 (t+1) (t+1) F (ii) α 5 α 6 (t) (t+1) F + α 5 γ 6 σ 1 V (t+1) V (t+1) F + γ 5 σ 1 (t+1) (t+1) F (iii) (α 5 α 6 + γ 5 α 4 ) (t) (t+1) F + (γ 5 γ 4 σ 1 + α 5 γ 6 )σ 1 V (t+1) V (t+1) F (iv) (α 5 α 6 + γ 5 α 4 ) (t) (t+1) F + (γ 5 γ 4 σ 1 + α 5 γ 6 )α V (t) V (t) F + (γ 5 γ 4 σ 1 + α 5 γ 6 )γ σ 1 (t) (t) F, (45) where (i) comes from (43), (ii) comes from (44), (iii) comes from (4), and (iv) comes from (40) Similarly, we can obtain V (t+1) V (t+1) F α 6 (t) (t+1) F + γ 6 α V (t) V (t) F σ 1 (t+1) (t+1) F α 4 (t) (t+1) F + γ 4 α V (t) V (t) F (t+05) (t+1) F α 3 (t) (t+1) F + γ 3 α V (t) V (t) F For simplicity, we define + γ 6 γ σ 1 (t) (t) F, (46) + γ 4 γ σ 1 (t) (t) F (47) + γ 3 γ σ 1 (t) (t) F (48) φ V (t+1) = V (t+1) V (t+1) F, φ V (t+05) = V (t+05) V (t) F, φ V (t+1) = σ 1 V (t+1) V (t+1) F, φ (t+1) = (t+1) (t+) F, φ (t+05) = (t+05) (t+1) F, φ (t+1) = σ 1 (t+1) (t+1) F 16

17 Then combining (39), (40) with (45) (48), we obtain max { φ V (t+1),φ V (t+05),φ V (t+1),φ (t+1),φ (t+05),φ (t+1)} β max { φv (t),φ (t),φ (t)}, (49) where β is a contraction coefficient defined as β = max{α 5 α 6 + γ 5 α 4,α 6,α 4,α 3 } + max{α 1,α,(γ 5 γ 4 σ 1 + α 5 γ 6 ),γ 6 α,γ 4 α,γ 3 α } + max{γ 1,γ,(γ 5 γ 4 σ 1 + α 5 γ 6 )γ,γ 6 γ,γ 4 γ,γ 3 γ } Then we can choose ξ as a sufficiently large constant such that β < 1 By recursively applying (49) for t = 0,,T, we obtain max { { } φ V (T ),φ V (T 05),φ (T V ),φ (T ),φ (T 05),φ (T )} β max φv (T 1),φ (T 1),φ (T 1) β max { φ V (T ),φ (T ),φ (T )} β T max { φ V (0),φ (0),φ (0)} By Corollary 3, we obtain (0) (1) F 3σ ( ) 1 δ 6 V (0) V (0) σ F + ξ + 1 σ 1 (0) (0) F (i) 3σ 1 σ (ii) = σ 4 8ξ σ 3 1 σ 3 1ξσ σ ξ σ 1 + σ ( ) 6 σ + ξσ 1 ξ + 1 4ξσ 1 σ 4ξσ 1 (iii) σ ξσ 1, (50) where (i) and (ii) come from Lemma 10, and (iii) comes from the definition of ξ and σ 1 σ Combining (50) with Lemma 10, we have { } { σ φv (0),φ (0),φ (0) max σ }, ξσ 1 4ξσ1 Then we need at most T = log ( { σ max, ξσ 1 iterations such that { V (T ) V σ F β T max σ, ξσ 1 4ξσ1 σ ξσ 1, σ ξ, σ } 1 ) log 1 (β 1 ) ξσ 1 ɛ } ɛ { σ and (T ) F β T max σ, ξσ 1 4ξσ1 } ɛ σ 1 We then follow similar lines to (6) in 4, and show M (T ) M F ɛ, which completes the proof 44 Proof of Theorem 1 (Gradient Descent) Proof The convergence analysis of the gradient descent algorithm is similar to that of the alternating gradient descent The only difference is that for updating, the gradient descent algorithm employs V = V (t) instead of V = V (t+1) to calculate the gradient at = (t) Then everything else directly follows 43, and is therefore omitted 17

18 5 Extensions to Matrix Completion We then extend our methodology and theory to matrix completion problems Let M R m n be the unnown low ran matrix of interest We observe a subset of the entries of M, namely, W {1,,m} {1,,n} We assume that W is drawn uniformly at random, ie, Mi,j is observed independently with probability ρ (0,1] To exactly recover M, a common assumption is the incoherence of M, which will be specified later A popular approach for recovering M is to solve the following convex optimization problem min M M R m n subject to P W (M ) = P W (M), (51) where P W (M) : R m n R m n is an operator defined as M [P W (M)] ij = ij if (i,j) W, 0 otherwise Similar to matrix sensing, existing algorithms for solving (51) are computationally inefficient Hence, in practice we usually consider the following nonconvex optimization problem min F W (,V ), where F W (,V ) = 1 R m,v R n P W(M ) P W (V ) F (5) Similar to matrix sensing, (5) can also be efficiently solved by gradient-based algorithms illustrated in Algorithm For the convenience of later convergence analysis, we partition the observation set W into T + 1 subsets W 0,,W T by Algorithm 4 However, in practice we do not need the partition scheme, ie, we simply set W 0 = = W T = W Before we present the convergence analysis, we first introduce an assumption nown as the incoherence property Assumption (Incoherence Property) The target ran matrix M is incoherent with parameter µ, ie, given the ran singular value decomposition of M = Σ V, we have max i µ and max V i m j µ j n Roughly speaing, the incoherence assumption guarantees that each entry of M contains similar amount of information, which maes it feasible to complete M when its entries are missing uniformly at random The following theorem establishes the iteration complexity and the estimation error under the Frobenius norm Theorem Suppose that there exists a universal constant C 4 such that ρ satisfies ρ C 4µ 3 lognlog(1/ɛ), (53) m where ɛ is the pre-specified precision Then there exist an η and universal constants C 5 and C 6 such that for any T C 5 log(c 6 /ɛ), we have M (T ) M F ɛ with high probability 18

19 Algorithm A family of nonconvex optimization algorithms for matrix completion The incoherence factorization algorithm IF( ) is illustrated in Algorithm 3, and the partition algorithm Partition( ), which is proposed by Hardt and Wootters [014], is provided in Algorithm 4 of Appendix C for the sae of completeness The initialization procedures INT ( ) and INT ( ) are provided in Algorithm 5 and Algorithm 6 of Appendix D for the sae of completeness Here F W ( ) is defined in (5) Input: P W (M ) Parameter: Step size η, Total number of iterations T ({W t } T t=0, ρ) Partition(W), P W 0 ( M) P W0 (M ), and M ij 0 for all (i,j) W 0 ( (0),V (0) ) INT ( M), (V (0), (0) ) INT V ( M) For: t = 0,,T 1 Alternating Exact Minimization : V (t+05) argmin V F Wt+1 ( (t),v ) (V (t+1),r (t+05) ) IF(V (t+05) ) V Alternating Gradient Descent : V (t+05) V (t) η V F Wt+1 ( (t),v (t) ) (V (t+1),r (t+05) ) IF(V (t+05) ), (t) (t) R (t+05) pdating V V V Gradient Descent : V (t+05) V (t) η V F Wt+1 ( (t),v (t) ) (V (t+1),r (t+05) ) IF(V (t+05) ), (t+1) (t) R (t+05) V V Alternating Exact Minimization : (t+05) argmin F Wt+ (,V (t+1) ) ( (t+1),r (t+05) ) IF( (t+05) ) Alternating Gradient Descent : (t+05) (t) η F Wt+ ( (t),v (t+1) ) ( (t+1),r (t+05) ) IF( (t+05) ), V (t+1) V (t+1) R (t+05) pdating Gradient Descent : (t+05) (t) η F Wt+ ( (t),v (t) ) ( (t+1),r (t+05) ) IF( (t+05) ), V (t+1) V (t) R (t+05) End for Output: M (T ) (T 05) V (T ) (for gradient descent we use (T ) V (T ) ) The proof of Theorem is provided in F1, F, and F3 Theorem implies that all three nonconvex optimization algorithms converge to the global optimum at a geometric rate Furthermore, our results indicate that the completion of the true low ran matrix M up to ɛ-accuracy requires the entry observation probability ρ to satisfy ρ = Ω(µ 3 lognlog(1/ɛ)/m) (54) This result matches the result established by Hardt [014], which is the state-of-the-art result for alternating minimization Moreover, our analysis covers three nonconvex optimization algorithms In fact, the sample complexity in (54) depends on a polynomial of σ max(m ), which is a constant σ min (M ) since in this paper we assume that σ max (M ) and σ min (M ) are constants If we allow σ max(m ) to inσ min (M ) 19

20 Algorithm 3 The incoherence factorization algorithm for matrix completion It guarantees that the solutions satisfy the incoherence condition throughout all iterations Input: W in r Number of rows of W in Parameter: Incoherence parameter µ (W in,r in W ) QR(W in ) W argmin W W in F W (W out,r tmp W ) QR(W out ) R out W = W out W in Output: W out,r out W subject to max W j µ /r j ( σmax (M ) σ min (M ) crease, we can replace the QR decomposition in Algorithm 3 with the smooth QR ) decomposition proposed by Hardt and Wootters [014] and achieve a dependency of log on the condition number with a more involved proof See more details in Hardt and Wootters [014] However, in this paper, our primary focus is on the dependency on, n and m, rather than optimizing over the dependency on condition number 6 Numerical Experiments We present numerical experiments to support our theoretical analysis We first consider a matrix sensing problem with m = 30, n = 40, and = 5 We vary d from 300 to 900 Each entry of A i s are independent sampled from N(0,1) We then generate M = V, where Ũ R m and Ṽ R n are two matrices with all their entries independently sampled from N(0, 1/) We then generate d measurements by b i = A i,m for i = 1,,d Figure 1 illustrates the empirical performance of the alternating exact minimization and alternating gradient descent algorithms for a single realization The step size for the alternating gradient descent algorithm is determined by the bactracing line search procedure We see that both algorithms attain linear rate of convergence for d = 600 and d = 900 Both algorithms fail for d = 300, because d = 300 is below the minimum requirement of sample complexity for the exact matrix recovery We then consider a matrix completion problem with m = 1000, n = 50, and = 5 We vary ρ from 005 to 01 We then generate M = V, where Ũ R m and Ṽ R n are two matrices with all their entries independently sampled from N(0, 1/) The observation set is generated uniformly at random with probability ρ Figure illustrates the empirical performance of the alternating exact minimization and alternating gradient descent algorithms for a single realization The step size for the alternating gradient descent algorithm is determined by the bactracing line search procedure We see that both algorithms attain linear rate of convergence for ρ = 005 and ρ = 01 Both algorithms fail for ρ = 005, because the entry observation probability is below 0

21 10 10 Estimation Error d=300 d=600 d= Numer of Iterations (a) Alternating Exact Minimization Algorithm Estaimation Error d=300 d=600 d= Number of Iterations (b) Alternating Gradient Descent Algorithm Figure 1: Two illustrative examples for matrix sensing The vertical axis corresponds to estimation error M (t) M F The horizontal axis corresponds to numbers of iterations Both the alternating exact minimization and alternating gradient descent algorithms attain linear rate of convergence for d = 600 and d = 900 But both algorithms fail for d = 300, because the sample size is not large enough to guarantee proper initial solutions the minimum requirement of sample complexity for the exact matrix recovery 7 Conclusion In this paper, we propose a generic analysis for characterizing the convergence properties of nonconvex optimization algorithms By exploiting the inexact first order oracle, we prove that a broad class of nonconvex optimization algorithms converge geometrically to the global optimum and exactly recover the true low ran matrices under suitable conditions A Lemmas for Theorem 1 (Alternating Exact Minimization) A1 Proof of Lemma 1 Proof For notational convenience, we omit the index t in (t) and V (t), and denote them by and V respectively Then we define two n n matrices S (t) = S (t) 11 S (t) 1 S (t) 1 S (t) with S (t) pq = d i=1 A i (t) p (t) q A i, 1

22 10 10 Estimation Error ; = 0:05 7; = 0:05 7; = 0: Number of Iterations (a) Alternating Exact Minimization Algorithm Estimation Error ; = 0:05 7; = 0:05 7; = 0: Number of Iterations (b) Alternating Gradient Descent Algorithm Figure : Two illustrative examples for matrix completion The vertical axis corresponds to estimation error M (t) M F The horizontal axis corresponds to numbers of iterations Both the alternating exact minimization and alternating gradient descent algorithms attain linear rate of convergence for ρ = 005 and ρ = 01 But both algorithms fail for ρ = 005, because the entry observation probability is not large enough to guarantee proper initial solutions G (t) = G (t) 11 G (t) 1 G (t) 1 G (t) with G (t) pq = d i=1 A i p q A i for 1 p,q Note that S (t) and G (t) are essentially the partial Hessian matrices V F ( (t),v ) and V F (,V ) for a vectorized V, ie, vec(v ) R n Before we proceed with the main proof, we first introduce the following lemma Lemma 11 Suppose that A( ) satisfies -RIP with parameter δ We then have 1 + δ σ max (S (t) ) σ min (S (t) ) 1 δ The proof of Lemma 11 is provided in Appendix A7 Note that Lemma 11 is also applicable G (t), since G (t) shares the same structure with S (t) We then proceed with the proof of Lemma 1 Given a fixed, F (,V ) is a quadratic function of V Therefore we have F (,V ) = F (,V ) + V F (,V ),V V + vec(v ) vec(v ), V F(,V )(vec(v ) vec(v )),

23 which further implies implies F (,V ) F (,V ) F V (,V ),V V σ max ( V F(,V )) V V F F (,V ) F (,V ) F V (,V ),V V σ min ( V F(,V )) V V F Then we can verify that V F(,V ) also shares the same structure with S(t) Thus applying Lemma 11 to the above two inequalities, we complete the proof A Proof of Lemma 3 Proof For notational convenience, we omit the index t in (t) and V (t), and denote them by and V respectively We define two n n matrices J (t) 11 J (t) J (t) = K (t) = 1 J (t) 1 J (t) K (t) 11 K (t) 1 K (t) 1 K (t) with with J (t) pq = K (t) d i=1 pq = (t) A i (t) p q A i, p qi n for 1 p,q Before we proceed with the main proof, we first introduce the following lemmas Lemma 1 Suppose that A( ) satisfies -RIP with parameter δ We then have S (t) K (t) J (t) 3δ (t) F The proof of Lemma 1 is provided in Appendix A8 Note that Lemma 1 is also applicable to G (t) K (t) J (t), since G (t) and S (t) share the same structure Lemma 13 Given F R, we define a n n matrix F 11 I n F 1 I n F = F 1 I n F I n For any V R n, let v = vec(v ) R n, then we have F v = FV F Proof By linear algebra, we have [FV ] ij = F i V j = which completes the proof F il V jl = F il I l V l, l=1 3 l=1

24 as We then proceed with the proof of Lemma 3 Since b i = tr(v A i ), then we rewrite F (,V ) F (,V ) = 1 d i=1 ( tr(v A i ) b i ) = 1 d ( V j A i j i=1 j=1 j=1 V j A i j) For notational simplicity, we define v = vec(v ) Since V (t+05) minimizes F ( (t),v ), we have vec ( V F ( (t),v (t+05) ) ) = S (t) v (t+05) J (t) v = 0 Solving the above system of equations, we obtain v (t+05) = (S (t) ) 1 J (t) v (55) Meanwhile, we have vec( V F (,V (t+05) )) = G (t) v (t+05) G (t) v = G (t) (S (t) ) 1 J (t) v G (t) v = G (t)( ) (S (t) ) 1 J (t) I n v, (56) where the second equality come from (55) By triangle inequality, (56) further implies ((S (t) ) 1 J (t) I n )v (K (t) I n )v + (S (t) ) 1 (J (t) S (t) K (t) )v ( (t) I )V F + (S (t) ) 1 (J (t) S (t) K (t) )v (t) I F V + (S (t) ) 1 (J (t) S (t) K (t) )v, (57) where the second inequality comes from Lemma 13 Plugging (57) into (56), we have vec( V F (,V (t+05) )) G (t) ((S (t) ) 1 J (t) I n )v (i) (1 + δ )(σ 1 (t) I + (S (t) ) 1 S (t) K (t) J (t) σ 1 ) (ii) ( (1 + δ )σ 1 ( (t) ) ( (t) ) F + 3δ ) (t) 1 δ F (iii) ( (1 + δ )σ 1 (t) F + 3δ ) (t) (iv) 1 δ F (1 δ )σ ξ (t) F, where (i) comes from Lemma 11 and V = M = σ 1 and V F = v σ 1, (ii) comes from Lemmas 11 and 1, (iii) from Cauchy-Schwartz inequality, and (iv) comes from (16) Since we have V F ( (t),v (t+05) ) = 0, we further btain which completes the proof E(V (t+05),v (t+05), (t) ) (1 δ )σ (t) ξ F, 4

25 A3 Proof of Lemma 4 Proof For notational convenience, we omit the index t in (t) and V (t), and denote them by and V respectively By the strong convexity of F (, ), we have F (,V ) 1 δ V (t+05) V F F (,V (t+05) ) + V F (,V (t+05) ),V V (t+05) (58) By the strong convexity of F (, ) again, we have F (,V (t+05) ) F (,V ) + V F (,V ),V V (t+05) + 1 δ V (t+05) V F F (,V ) + 1 δ V (t+05) V F, (59) where the last inequality comes from the optimality condition of V = argmin V F (,V ), ie V F (,V ),V (t+05) V 0 Meanwhile, since V (t+05) minimizes F ( (t), ), we have the optimality condition which further implies V F (,V (t+05) ),V V (t+05) Combining (58) and (59) with (60), we obtain which completes the proof V F ( (t),v (t+05) ),V V (t+05) 0, V F (,V (t+05) ) V F ( (t),v (t+05) ),V V (t+05) (60) V (t+05) V 1 1 δ E(V (t+05),v (t+05), (t) ), A4 Proof of Lemma 5 Proof Before we proceed with the proof, we first introduce the following lemma Lemma 14 Suppose that A R n is a ran matrix Let E R n satisfy E A < 1 Then given a QR decomposition (A + E) = QR, there exists a factorization of A = Q O such that Q R n is an orthonormal matrix, and satisfies Q Q F A E F 1 E A 5

26 The proof of Lemma 14 is provided in Stewart et al [1990], therefore omitted We then proceed with the proof of Lemma 5 We consider A = V (t) and E = V (t+05) V (t) in Lemma 14 respectively We can verify that V (t+05) V (t) V (t) V (t+05) V (t) F σ 1 4 Then there exists a V (t) = V (t+1) O such that V (t+1) is an orthonormal matrix, and satisfies V (t+05) V (t+1) F V (t) V (t+05) V (t) F σ V (t+05) V (t) F A5 Proof of Lemma 6 Proof Before we proceed with the main proof, we first introduce the following lemma Lemma 15 Let b = A(M ), M is a ran- matrix, and A is a linear measurement operator that satisfies -RIP with constant δ < 1/3 Let X (t+1) be the (t + 1)-th step iterate of SVP, then we have A(X (t+1) ) b A(M ) b + δ A(X (t) ) b The proof of Lemma 15 is provided in Jain et al [010], therefore omitted We then explain the implication of Lemma 15 Jain et al [010] show that X (t+1) is obtained by taing a projected gradient iteration over X (t) using step size 1 1+δ Then taing X (t) = 0, we have Then Lemma 15 implies X (t+1) = (0) Σ (0) V (0) 1 + δ Since A( ) satisfies -RIP, then (61) further implies ( (0) (0) (0) ) Σ V A M 4δ 1 + δ A(M ) (61) (0) Σ (0) V (0) M 1 + δ F 4δ (1 + 3δ ) M F (6) We then project each column of M into the subspace spanned by { (0) i } i=1, and obtain (0) (0) M M F 6δ M F 6

Nonconvex Low Rank Matrix Factorization via Inexact First Order Oracle

Nonconvex Low Rank Matrix Factorization via Inexact First Order Oracle Tuo Zhao Zhaoran Wang Han Liu Abstract We study the low rank matrix factorization problem via nonconvex optimization. Compared with