Nonconvex Low Rank Matrix Factorization via Inexact First Order Oracle

Size: px
Start display at page:

Download "Nonconvex Low Rank Matrix Factorization via Inexact First Order Oracle"

Transcription

1 Nonconvex Low Rank Matrix Factorization via Inexact First Order Oracle Tuo Zhao Zhaoran Wang Han Liu Abstract We study the low rank matrix factorization problem via nonconvex optimization. Compared with the convex relaxation approach, nonconvex optimization exhibits superior empirical performance for large scale low rank matrix estimation. However, the understanding of its theoretical guarantees is limited. To bridge this gap, we exploit the notion of inexact first order oracle, which naturally appears in low rank matrix factorization problems such as matrix sensing and completion. Particularly, our analysis shows that a broad class of nonconvex optimization algorithms, including alternating minimization and gradient-type methods, can be treated as solving two sequences of convex optimization algorithms using inexact first order oracle. Thus we can show that these algorithms converge geometrically to the global optima and recover the true low rank matrices under suitable conditions. Numerical results are provided to support our theory. 1 Introduction Let M R m n be a rank k matrix with k much smaller than m and n. Our goal is to estimate M based on partial observations of its entries. For example, matrix completion is based on a subsample of M s entries, while matrix sensing is based on linear measurements A i, M, where i {1,..., d} with d much smaller than mn and A i is the sensing matrix. In the past decade, significant progress has been made on the recovery of low rank matrix [Candès and Recht, 009, Candès and Tao, 010, Candes and Plan, 010, Recht et al., 010, Lee and Bresler, 010, Keshavan et al., 010a,b, Jain et al., 010, Cai et al., 010, Recht, 011, Gross, 011, Chandrasekaran et al., 011, Hsu et al., 011, Rohde and Tsybakov, 011, Koltchinskii et al., 011, Negahban and Wainwright, 011, Chen et al., 011, Xiang et al., 01, Negahban and Wainwright, 01, Agarwal et al., 01, Recht and Ré, 013, Chen, 013, Chen et al., 013a,b, Jain et al., 013, Jain and Netrapalli, 014, Hardt, 014, Hardt et al., 014, Hardt and Wootters, 014, Sun and Luo, 014, Hastie et al., 014, Cai and Zhang, 015, Yan et al., 015, Zhu et al., 015, Wang et al., 015]. Among these works, most Tuo Zhao is Affiliated with Department of Computer Science, Johns Hopkins niversity, Baltimore, MD 118 SA and Department of Operations Research and Financial Engineering, Princeton niversity, Princeton, NJ 08544; tour@cs.jhu.edu. Zhaoran Wang is Affiliated with Department of Operations Research and Financial Engineering, Princeton niversity, Princeton, NJ SA; zhaoran@princeton.edu. Han Liu is Affiliated with Department of Operations Research and Financial Engineering, Princeton niversity, Princeton, NJ SA; hanliu@princeton.edu. 1

2 are based upon convex relaxation with nuclear norm constraint or regularization. Nevertheless, solving these convex optimization problems can be computationally prohibitive in high dimensional regimes with large m and n [Hsieh and Olsen, 014]. A computationally more efficient alternative is nonconvex optimization. In particular, we reparameterize the m n matrix variable M in the optimization problem as V with R m k and V R n k, and optimize over and V. Such a reparametrization automatically enforces the low rank structure and leads to low computational cost per iteration. Due to this reason, the nonconvex approach is widely used in large scale applications such as recommendation systems or collaborative filtering [Koren, 009, Koren et al., 009]. Despite the superior empirical performance of the nonconvex approach, the understanding of its theoretical guarantees is rather limited in comparison with the convex relaxation approach. The classical nonconvex optimization theory can only show its sublinear convergence to local optima. But many empirical results have corroborated its exceptional computational performance and convergence to global optima. Only until recently has there been theoretical analysis of the block coordinate descent-type nonconvex optimization algorithm, which is known as alternating minimization [Jain et al., 013, Hardt, 014, Hardt et al., 014, Hardt and Wootters, 014]. In particular, the existing results show that, provided a proper initialization, the alternating minimization algorithm attains a linear rate of convergence to a global optimum R m k and V R n k, which satisfy M = V. Meanwhile, Keshavan et al. [010a,b] establish the convergence of the gradient-type methods, and Sun and Luo [014] further establish the convergence of a broad class of nonconvex optimization algorithms including both gradient-type and block coordinate descent-type methods. However, Keshavan et al. [010a,b], Sun and Luo [014] only establish the asymptotic convergence for an infinite number of iterations, rather than the explicit rate of convergence. Besides these works, Lee and Bresler [010], Jain et al. [010], Jain and Netrapalli [014] consider projected gradient-type methods, which optimize over the matrix variable M R m n rather than R m k and V R n k. These methods involve calculating the top k singular vectors of an m n matrix at each iteration. For k much smaller than m and n, they incur much higher computational cost per iteration than the aforementioned methods that optimize over and V. All these works, except Sun and Luo [014], focus on specific algorithms, while Sun and Luo [014] do not establish the explicit optimization rate of convergence. In this paper, we propose a new theory for analyzing a broad class of nonconvex optimization algorithms for low rank matrix estimation. The core of our theory is the notion of inexact first order oracle. Based on the inexact first order oracle, we establish sufficiently conditions under which the iteration sequences converge geometrically to the global optima. For both matrix sensing and completion, a direct consequence of our threoy is that, a broad family of nonconvex optimization algorithms, including gradient descent, block coordinate gradient descent, and block coordinate minimization, attain linear rates of convergence to the true low rank matrices and V. In particular, our proposed theory covers alternating minimization as a special case and recovers the results of Jain et al. [013], Hardt [014], Hardt et al. [014], Hardt and Wootters [014] under suitable conditions. Meanwhile, our approach covers gradient-type methods, which are also widely used in practice [Takács et al., 007, Paterek, 007, Koren et al., 009, Gemulla et al., 011, Recht and Ré, 013, Zhuang et al., 013]. To the best of our knowledge, our analysis is the first one that establishes exact recovery guarantees and geometric rates of convergence for a broad family of nonconvex matrix sensing and completion algorithms.

3 To achieve maximum generality, our unified analysis significantly differs from previous works. In detail, Jain et al. [013], Hardt [014], Hardt et al. [014], Hardt and Wootters [014] view alternating minimization as an approximate power method. However, their point of view relies on the closed form solution of each iteration of alternating minimization, which makes it difficult to generalize to other algorithms, e.g., gradient-type methods. Meanwhile, Sun and Luo [014] take a geometric point of view. In detail, they show that the global optimum of the optimization problem is the unique stationary point within its neighborhood and thus a broad class of algorithms succeed. However, such geometric analysis of the objective function does not characterize the convergence rate of specific algorithms towards the stationary point. nlike existing results, we analyze nonconvex optimization algorithms as approximate convex counterparts. For example, our analysis views alternating minimization on a nonconvex objective function as an approximate block coordinate minimization on some convex objective function. We use the key quantity, the inexact first order oracle, to characterize such a perturbation effect, which results from the local nonconvexity at intermediate solutions. This eventually allows us to establish explicit rate of convergence in an analogous way as existing convex optimization analysis. Our proposed inexact first order oracle is closely related to a series previous work on inexact or approximate gradient descent algorithms: Güler [199], Luo and Tseng [1993], Nedić and Bertsekas [001], d Aspremont [008], Baes [009], Friedlander and Schmidt [01], Devolder et al. [014]. Different from these existing results focusing on convex minimization, we show that the inexact first order oracle can also sharply captures the evolution of generic optimization algorithms even with the presence of nonconvexity. More recently, Candes et al. [014], Balakrishnan et al. [014], Arora et al. [015] respectively analyze the Wirtinger Flow algorithm for phase retrieval, the expectation maximization (EM) Algorithm for latent variable models, and the gradient descent algorithm for sparse coding based on a similar idea to ours. Though their analysis exploits similar nonconvex structures, they work on completely different problems, and the delivered technical results are also fundamentally different. A conference version of this paper was presented in the Annual Conference on Neural Information Processing Systems 015 [Zhao et al., 015]. During our conference version was under review, similar work was released on arxiv.org by Zheng and Lafferty [015], Bhojanapalli et al. [015], Tu et al. [015], Chen and Wainwright [015]. These works focus on symmetric positive semidefinite low rank matrix factorization problems. In contrast, our proposed methodologies and theory do not require the symmetry and positive semidefiniteness, and therefore can be applied to rectangular low rank matrix factorization problems. The rest of this paper is organized as follows. In, we review the matrix sensing problems, and then introduce a general class of nonconvex optimization algorithms. In 3, we present the convergence analysis of the algorithms. In 4, we lay out the proof. In 5, we extend the proposed methodology and theory to the matrix completion problems. In 6, we provide numerical experiments and draw the conclusion. Notation: For v = (v 1,..., v d ) T R d, we define the vector l q norm as v q q = j vq j. We define e i as an indicator vector, where the i-th entry is one, and all other entries are zero. For a matrix A R m n, we use A j = (A 1j,..., A mj ) to denote the j-th column of A, and A i = (A i1,..., A in ) to denote the i-th row of A. Let σ max (A) and σ min (A) be the largest and smallest nonzero singular values of A. We define the following matrix norms: A F = j A j, A = 3

4 σ max (A). Moreover, we define A to be the sum of all singular values of A. We define as the Moore-Penrose pseudoinverse of A as A. Given another matrix B R m n, we define the inner product as A, B = i,j A ijb ij. For a bivariate function f(u, v), we define u f(u, v) to be the gradient with respect to u. Moreover, we use the common notations of Ω( ), O( ), and o( ) to characterize the asymptotics of two real sequences. Matrix Sensing We start with the matrix sensing problem. Let M R m n be the unknown low rank matrix of interest. We have d sensing matrices A i R m n with i {1,..., d}. Our goal is to estimate M based on b i = A i, M in the high dimensional regime with d much smaller than mn. nder such a regime, a common assumption is rank(m ) = k min{d, m, n}. Existing approaches generally recover M by solving the following convex optimization problem min M subject to b = A(M), (1) M R m n where b = [b 1,..., b d ] R d, and A(M) : R m n R d is an operator defined as A(M) = [ A 1, M,..., A i, M ] R d. () Existing convex optimization algorithms for solving (1) are computationally inefficient, since they incur high per-iteration computational cost and only attain sublinear rates of convergence to the global optimum [Jain et al., 013, Hsieh and Olsen, 014]. Therefore in large scale settings, we usually consider the following nonconvex optimization problem instead min R m k,v R n k F(, V ), where F(, V ) = 1 b A(V ). (3) The reparametrization of M = V, though making the problem in (3) nonconvex, significantly improves the computational efficiency. Existing literature [Koren, 009, Koren et al., 009, Takács et al., 007, Paterek, 007, Koren et al., 009, Gemulla et al., 011, Recht and Ré, 013, Zhuang et al., 013] has established convincing evidence that (3) can be effectively solved by a broad variety of gradient-based nonconvex optimization algorithms, including gradient descent, alternating exact minimization (i.e., alternating least squares or block coordinate minimization), as well as alternating gradient descent (i.e., block coordinate gradient descent), as illustrated in Algorithm 1. It is worth noting that the QR decomposition and rank k singular value decomposition in Algorithm 1 can be accomplished efficiently. In particular, the QR decomposition can be accomplished in O(k max{m, n}) operations, while the rank k singular value decomposition can be accomplished in O(kmn) operations. In fact, the QR decomposition is not necessary for particular update schemes, e.g., Jain et al. [013] prove that the alternating exact minimization update schemes with or without the QR decomposition are equivalent. 3 Convergence Analysis We analyze the convergence of the algorithms illustrated in. Before we present the main results, we first introduce a unified analytical framework based on a key quantity named the approximate 4

5 Algorithm 1 A family of nonconvex optimization algorithms for matrix sensing. Here (, D, V ) KSVD(M) is the rank k singular value decomposition of M. D is a diagonal matrix containing the top k singular values of M in decreasing order, and and V contain the corresponding top k left and right singular vectors of M. (V, R V ) QR(V ) is the QR decomposition, where V is the corresponding orthonormal matrix and R V is the corresponding upper triangular matrix. Input: {b i } d i=1, {A i} d i=1 Parameter: Step size η, Total number of iterations T ( (0), D (0), V (0) ) KSVD( d i=1 b ia i ), V (0) V (0) D (0), (0) (0) D (0) For: t = 0,..., T 1 Alternating Exact Minimization : V (t+0.5) argmin V F( (t), V ) (V (t+1), R (t+0.5) ) QR(V (t+0.5) ) V Alternating Gradient Descent : V (t+0.5) V (t) η V F( (t), V (t) ) (V (t+1), R (t+0.5) ) QR(V (t+0.5) ), (t) (t) R (t+0.5) pdating V V V Gradient Descent : V (t+0.5) V (t) η V F( (t), V (t) ) (V (t+1), R (t+0.5) ) QR(V (t+0.5) ), (t+1) (t) R (t+0.5) V V Alternating Exact Minimization : (t+0.5) argmin F(, V (t+1) ) ( (t+1), R (t+0.5) ) QR( (t+0.5) ) Alternating Gradient Descent : (t+0.5) (t) η F( (t), V (t+1) ) ( (t+1), R (t+0.5) ) QR( (t+0.5) ), V (t+1) V t+1 R (t+0.5) pdating Gradient Descent : (t+0.5) (t) η F( (t), V (t) ) ( (t+1), R (t+0.5) ) QR( (t+0.5) ), V (t+1) V t R (t+0.5) End for Output: M (T ) (T 0.5) V (T ) (for gradient descent we use (T ) V (T ) ) first order oracle. Such a unified framework equips our theory with the maximum generality. Without loss of generality, we assume m n throughout the rest of this paper. 3.1 Main Idea We first provide an intuitive explanation for the success of nonconvex optimization algorithms, which forms the basis of our later analysis of the main results in 4. Recall that (3) can be written as a special instance of the following optimization problem, min f(, V ). (4) R m k,v Rn k A key observation is that, given fixed, f(, ) is strongly convex and smooth in V under suitable conditions, and the same also holds for given fixed V correspondingly. For the convenience of discussion, we summarize this observation in the following technical condition, which will be later verified for matrix sensing and completion under suitable conditions. 5

6 Condition 1 (Strong Biconvexity and Bismoothness). There exist universal constants µ + > 0 and µ > 0 such that µ F f(, V ) f(, V ), f(, V ) µ + F for all,, µ V V F f(, V ) f(, V ) V V, V f(, V ) µ + V V F for all V, V Ideal First Order Oracle To ease presentation, we assume that and V are the unique global minimizers to the generic optimization problem in (4). Assuming that is given, we can obtain V by V = argmin V R n k f(, V ). (5) Condition 1 implies the objective function in (5) is strongly convex and smooth. Hence, we can choose any gradient-based algorithm to obtain V. For example, we can directly solve for V in or iteratively solve for V using gradient descent, i.e., V f(, V ) = 0, (6) V (t) = V (t 1) η V f(, V (t 1) ), (7) where η is a step size. Taking gradient descent as an example, we can invoke classical convex optimization results [Nesterov, 004] to prove that V (t) V F κ V (t 1) V F for all t = 0, 1,,..., where κ (0, 1) and only depends on µ + and µ in Condition 1. For notational simplicity, we call V f(, V (t 1) ) the ideal first order oracle, since we do not know in practice Inexact First Order Oracle Though the ideal first order oracle is not accessible in practice, it provides us insights to analyze nonconvex optimization algorithms. Taking gradient descent as an example, at the t-th iteration, we take a gradient descent step over V based on V f(, V (t 1) ). Now we can treat V f(, V (t 1) ) as an approximation of V f(, V (t 1) ), where the approximation error comes from approximating by. Then the relationship between V f(, V (t 1) ) and V f(, V (t 1) ) is similar to that between gradient and approximate gradient in existing literature on convex optimization. For simplicity, we call V f(, V (t 1) ) the inexact first order oracle. To characterize the difference between V f(, V (t 1) ) and V f(, V (t 1) ), we define the approximation error of the inexact first order oracle as E(V, V, ) = V f(, V ) V f(, V ) F, (8) where V is the current decision varaible for evaluating the gradient. In the above example, it holds for V = V (t 1). Later we will illustrate that E(V, V, ) is critical to our analysis. In 6

7 the above example of alternating gradient descent, we will prove later that for V (t) = V (t 1) η V f(, V (t 1) ), we have V (t) V F κ V (t 1) V F + µ + E(V (t), V (t 1), ). (9) In other words, E(V (t), V (t 1), ) captures the perturbation effect by employing the inexact first order oracle V f(, V (t 1) ) instead of the ideal first order oracle V f(, V (t 1) ). For V (t+1) = argmin V f(, V ), we will prove that V (t) V F 1 µ E(V (t), V (t), ). (10) According to the update schemes shown in Algorithms 1 and, for alternating exact minimization, we set = (t) in (10), while for gradient descent or alternating gradient descent, we set = (t 1) or = (t) in (9) respectively. Due to symmetry, similar results also hold for (t) F. To establish the geometric rate of convergence towards the global minima and V, it remains to establish upper bounds for the approximate error of the inexact first oder oracle. Taking gradient decent as an example, we will prove that given an appropriate initial solution, we have µ + E(V (t), V (t 1), (t 1) ) α (t 1) F (11) for some α (0, 1 κ). Combining with (9) (where we take = (t 1) ), (11) further implies V (t) V F κ V (t 1) V F + α (t 1) F. (1) Correspondingly, similar results hold for (t) F, i.e., (t) F κ (t 1) F + α V (t 1) V F. (13) Combining (1) and (13) we then establish the contraction max{ V (t) V F, (t) F } (α + κ) max{ V (t 1) V F, (t 1) F }, which further implies the geometric convergence, since α (0, 1 κ). Respectively, we can establish similar results for alternating exact minimization and alternating gradient descent. Based upon such a unified analysis, we now present the main results. 3. Main Results Before presenting the main results, we first introduce an assumption known as the restricted isometry property (RIP). Recall that k is the rank of the target low rank matrix M. Assumption 1 (Restricted Isometry Property). The linear operator A( ) : R m n R d defined in () satisfies k-rip with parameter δ k (0, 1), i.e., for all R m n such that rank( ) k, it holds that (1 δ k ) F A( ) (1 + δ k ) F. 7

8 Several random matrix ensembles satisfy k-rip for a sufficiently large d with high probability. For example, suppose that each entry of A i is independently drawn from a sub-gaussian distribution, A( ) satisfies k-rip with parameter δ k with high probability for d = Ω(δ k kn log n). The following theorem establishes the geometric rate of convergence of the nonconvex optimization algorithms summarized in Algorithm 1. Theorem 1. Assume there exists a sufficiently small constant C 1 such that A( ) satisfies k-rip with δ k C 1 /k, and the largest and smallest nonzero singular values of M are constants, which do not scale with (d, m, n, k). For any pre-specified precision ɛ, there exist an η and universal constants C and C 3 such that for all T C log(c 3 /ɛ), we have M (T ) M F ɛ. The proof of Theorems 1 is provided in 4., 4.3, and 4.4. Theorem 1 implies that all three nonconvex optimization algorithms converge geometrically to the global optimum. Moreover, assuming that each entry of A i is independently drawn from a sub-gaussian distribution with mean zero and variance proxy one, our result further suggests that, to achieve exact low rank matrix recovery, our algorithm requires the number of measurements d to satisfy d = Ω(k 3 n log n), (14) since we assume that δ k C 1 /k. This sample complexity result matches the state-of-the-art result for nonconvex optimization methods, which is established by Jain et al. [013]. In comparison with their result, which only covers the alternating exact minimization algorithm, our results holds for a broader variety of nonconvex optimization algorithms. Note that the sample complexity in (14) depends on a polynomial of σ max (M )/σ min (M ), which is treated as a constant in our paper. If we allow σ max (M )/σ min (M ) to increase, we can plug the nonconvex optimization algorithms into the multi-stage framework proposed by Jain et al. [013]. Following similar lines to the proof of Theorem 1, we can derive a new sample complexity, which is independent of σ max (M )/σ min (M ). See more details in Jain et al. [013]. 4 Proof of Main Results We sketch the proof of Theorems 1. The proof of all related lemmas are provided in the appendix. For notational simplicity, let σ 1 = σ max (M ) and σ k = σ min (M ). Recall the nonconvex optimization algorithms are symmetric about the updates of and V. Hence, the following lemmas for the update of V also hold for updating. We omit some statements for conciseness. Theorem can be proved in a similar manner, and its proof is provided in Appendix F. Before presenting the proof, we first introduce the following lemma, which verifies Condition 1. Lemma 1. Suppose that A( ) satisfies k-rip with parameter δ k. Given an arbitrary orthonormal matrix R m k, for any V, V R n k, we have 1 + δ k V V F F(, V ) F(, V ) V F(, V ), V V 1 δ k V V F. The proof of Lemma 1 is provided in Appendix A.1. Lemma 1 implies that F(, ) is strongly convex and smooth in V given a fixed orthonormal matrix, as specified in Condition 1. Equipped with Lemma 1, we now lay out the proof for each update scheme in Algorithm 1. 8

9 4.1 Rotation Issue Given a factorization of M = V, we can equivalently represent it as M = newv new, where new = O new and V new = V O new for an arbitrary unitary matrix O new R k k. This implies that directly calculating F is not desirable and the algorithm may converge to an arbitrary factorization of M. To address this issue, existing analysis usually chooses subspace distances to evaluate the difference between subspaces spanned by columns of and, because these subspaces are invariant to rotations [Jain et al., 013]. For example, let R m (m k) denote the orthonormal complement to, we can choose the subspace distance as F. For any O new R k k such that O newo new = I k, we have new F = O new F = F. In this paper, we consider a different subspace distance defined as min O F. (15) O O=I k We can verify that (15) is also invariant to rotation. The next lemma shows that (15) is equivalent to F. Lemma. Given two orthonormal matrices R m k and R m k, we have F min O F F. O O=I The proof of Lemma is provided in Stewart et al. [1990], therefore omitted. Equipped with Lemma, our convergence analysis guarantees that there always exists a factorization of M satisfying the desired computational properties for each iteration (See Lemma 5, Corollaries 1 and ). Similarly, the above argument can also be generalized to gradient descent and alternating gradient descent algorithms. 4. Proof of Theorem 1 (Alternating Exact Minimization) Proof. Throughout the proof for alternating exact minimization, we define a constant ξ (1, ) to simplify the notation. Moreover, we assume that at the t-th iteration, there exists a matrix factorization of M M = (t) V (t), where (t) R m k is an orthonormal matrix. We define the approximation error of the inexact first order oracle as E(V (t+0.5), V (t+0.5), (t) ) = V F( (t), V (t+0.5) ) V F( (t), V (t+0.5) ) F. The following lemma establishes an upper bound for the approximation error of the approximation first order oracle under suitable conditions. 9

10 Lemma 3. Suppose that δ k and (t) satisfy Then we have δ k (1 δ k) σ k 1ξk(1 + δ k )σ 1 and (t) (t) F (1 δ k)σ k 4ξ(1 + δ k )σ 1. (16) E(V (t+0.5), V (t+0.5), (t) ) (1 δ k)σ k (t) (t) F. ξ The proof of Lemma 3 is provided in Appendix A.. Lemma 3 shows that the approximation error of the inexact first order oracle for updating V diminishes with the estimation error of (t), when (t) is sufficiently close to (t). The following lemma quantifies the progress of an exact minimization step using the inexact first order oracle. Lemma 4. We have V (t+0.5) V (t) F 1 1 δ k E(V (t+0.5), V (t+0.5), (t) ). The proof of Lemma 4 is provided in Appendix A.3. Lemma 4 illustrates that the estimation error of V (t+0.5) diminishes with the approximation error of the inexact first order oracle. The following lemma characterizes the effect of the renormalization step using QR decomposition, i.e., the relationship between V (t+0.5) and V (t+1) in terms of the estimation error. Lemma 5. Suppose that V (t+0.5) satisfies V (t+0.5) V (t) F σ k 4. (17) Then there exists a factorization of M = (t+1) V (t+1) such that V (t+0.5) R n k is an orthonormal matrix, and satisfies V (t+1) V (t+1) F σ k V (t+0.5) V (t) F. The proof of Lemma 5 is provided in Appendix A.4. The next lemma quantifies the accuracy of the initialization (0). Lemma 6. Suppose that δ k satisfies δ k (1 δ k ) σk 4 19ξ k(1 + δ k ) σ1 4. (18) Then there exists a factorization of M = (0) V (0) such that (0) R m k is an orthonormal matrix, and satisfies (0) F (1 δ k)σ k 4ξ(1 + δ k )σ 1. 10

11 The proof of Lemma 6 is provided in Appendix A.5. Lemma 6 implies that the initial solution (0) attains a sufficiently small estimation error. Combining Lemmas 3, 4, and 5, we obtain the following corollary for a complete iteration of updating V. Corollary 1. Suppose that δ k and (t) satisfy We then have δ k (1 δ k ) σ 4 k 19ξ k(1 + δ k ) σ 4 1 and (t) (t) F (1 δ k)σ k 4ξ(1 + δ k )σ 1. (19) Moreover, we also have V (t+1) V (t+1) F (1 δ k)σ k 4ξ(1 + δ k )σ 1. V (t+1) V (t+1) F 1 ξ (t) (t) F and V (t+0.5) V (t) F σ k ξ (t) (t) F. The proof of Corollary 1 is provided in Appendix A.6. Since the alternating exact minimization algorithm updates and V in a symmetric manner, we can establish similar results for a complete iteration of updating in the next corollary. Corollary. Suppose that δ k and V (t+1) satisfy δ k (1 δ k ) σ 4 k 19ξ k(1 + δ k ) σ 4 1 and V (t+1) V (t+1) F (1 δ k)σ k 4ξ(1 + δ k )σ 1. (0) Then there exists a factorization of M = (t+1) V (t+1) such (t+1) is an orthonormal matrix, and satisfies Moreover, we also have (t+1) (t+1) F (1 δ k)σ k 4ξ(1 + δ k )σ 1. (t+1) (t+1) F 1 ξ V (t+1) V (t+1) F and (t+0.5) (t+1) F σ k ξ V (t+1) V (t+1) F. The proof of Corollary directly follows Appendix A.6, and is therefore omitted.. We then proceed with the proof of Theorem 1 for alternating exact minimization. Lemma 6 ensures that (19) of Corollary 1 holds for (0). Then Corollary 1 ensures that (0) of Corollary holds for V (1). By induction, Corollaries 1 and can be applied recursively for all T iterations. Thus we obtain V (T ) V (T ) F 1 ξ (T 1) (T 1) F 1 ξ V (T 1) V (T 1) F 1 ξ T 1 (0) (0) F (1 δ k)σ k 4ξ T (1 + δ k )σ 1, (1) 11

12 where the last inequality comes from Lemma 6. Therefore, for a pre-specified accuracy ɛ, we need at most ( ) 1 (1 T = log δk )σ k log 1 ξ () ɛ(1 + δ k )σ 1 iterations such that Moreover, Corollary implies V (T ) V (T ) F (T 0.5) (T ) F σ k ξ V (T ) V (T ) F (1 δ k)σ k 4ξ T (1 + δ k )σ 1 ɛ. (3) (1 δ k )σ k 8ξ T +1 (1 + δ k )σ 1, where the last inequality comes from (1). Therefore, we need at most ( 1 (1 T = log δk )σk ) log 1 ξ 4ξɛ(1 + δ k ) (4) iterations such that (T 0.5) F Then combining (3) and (5), we obtain (1 δ k )σ k 8ξ T +1 (1 + δ k )σ 1 ɛ σ 1. (5) M (T ) M = (T 0.5) V (T ) (T ) V (T ) F = (T 0.5) V (T ) (T ) V (T ) + (T ) V (T ) (T ) V (T ) F V (T ) (T 0.5) (T ) F + (T ) V (T ) V (T ) F ɛ, (6) where the last inequality comes from V (T ) = 1 (since V (T ) is orthonormal) and = M = σ 1 (since (T ) V (T ) = M and V (T ) is orthonormal). Thus combining () and (4) with (6), we complete the proof. 4.3 Proof of Theorem 1 (Alternating Gradient Descent) Proof. Throughout the proof for alternating gradient descent, we define a sufficiently large constant ξ. Moreover, we assume that at the t-th iteration, there exists a matrix factorization of M M = (t) V (t), where (t) R m k is an orthonormal matrix. We define the approximation error of the inexact first order oracle as E(V (t+0.5), V (t), (t) ) = V F( (t), V (t) ) V F( (t), V (t) ) F. The first lemma is parallel to Lemma 3 for alternating exact minimization. 1

13 Lemma 7. Suppose that δ k, (t), and V (t) satisfy Then we have δ k (1 δ k)σ k 4ξkσ 1, (t) (t) F σ k 4ξσ 1 E(V (t+0.5), V (t), (t) ) (1 + δ k)σ k (t) (t) F. ξ, and V (t) V (t) F σ 1 k. (7) The proof of Lemma 7 is provided in Appendix B.1. Lemma 7 illustrates that the approximation error of the inexact first order oracle diminishes with the estimation error of (t), when (t) and V (t) are sufficiently close to (t) and V (t). Lemma 8. Suppose that the step size parameter η satisfies Then we have η = V (t+0.5) V F δ k V (t) V F δ k. (8) 1 + δ k E(V (t+0.5), V (t), (t) ). The proof of Lemma 8 is in Appendix B.. Lemma 8 characterizes the progress of a gradient descent step with a pre-specified fixed step size. A more practical option is adaptively selecting η using the backtracking line search procedure, and similar results can be guaranteed. See Nesterov [004] for details. The following lemma characterizes the effect of the renormalization step using QR decomposition. Lemma 9. Suppose that V (t+0.5) satisfies V (t+0.5) V (t) F σ k 4. (9) Then there exists a factorization of M = (t+1) V (t+1) such that V (t+1) R n k is an orthonormal matrix, and V (t+1) V (t+1) F σ k V (t+0.5) V (t) F, (t) (t+1) F 3σ 1 σ k V (t+0.5) V (t) F + σ 1 (t) (t) F, The proof of Lemma 9 is provided in Appendix B.3. The next lemma quantifies the accuracy of the initial solutions. Lemma 10. Suppose that δ k satisfies Then we have (0) (0) F σ k 4ξσ 1 σ 6 k δ k 19ξ kσ1 6. (30) and V (0) V (0) F σ k σ 1 k. ξσ 1 13

14 The proof of Lemma 10 is in Appendix B.4. Lemma 10 indicates that the initial solutions (0) and V (0) attain sufficiently small estimation errors. Combining Lemmas 7, 8, 5,, we obtain the following corollary for a complete iteration of updating V. Corollary 3. Suppose that δ k, (t), and V (t) satisfy δ k σ 6 k 19ξ kσ 6 1, (t) (t) F σ k 4ξσ1, and V (t) V (t) F σ k. (31) ξσ 1 We then have Moreover, we have V (t+1) V (t+1) F σ k 4ξσ 1 and (t) (t+1) F σ k ξσ 1. V (t+0.5) V (t) F δ k V (t) V (t) F + σ k ξ (t) (t) F, (3) V (t+1) V (t+1) F δ k V (t) V (t) F + 4 σ k ξ (t) (t) F, (33) (t) (t+1) F 3σ ( ) 1 δk 6 V (t) V (t) F + σ k ξ + 1 σ 1 (t) (t) F. (34) The proof of Corollary 3 is provided in Appendix B.5. Since the alternating gradient descent algorithm updates and V in a symmetric manner, we can establish similar results for a complete iteration of updating in the next corollary. Corollary 4. Suppose that δ k, V (t+1), and (t) satisfy δ k σ 6 k 19ξ kσ 6 1, V (t+1) V (t+1) F σ k 4ξσ1, and (t) (t+1) F σ k. (35) ξσ 1 We then have Moreover, we have (t+1) (t+1) F σ k 4ξσ 1 and V (t+1) V (t+1) F σ k ξσ 1. (t+0.5) (t+1) F δ k (t) (t+1) F + σ k ξ V (t+1) V (t+1) F, (36) (t+1) (t+1) F δ k (t) (t+1) F + 4 σ k ξ V (t+1) V (t+1) F, (37) V (t+1) V (t+1) F 3σ ( ) 1 δk 6 (t) (t+1) F + σ k ξ + 1 σ 1 V (t+1) V (t+1) F. (38) 14

15 The proof of Corollary 4 directly follows Appendix B.5, and is therefore omitted.. Now we proceed with the proof of Theorem 1 for alternating gradient descent. Recall that Lemma 10 ensures that (31) of Corollary 3 holds for (0) and V (0). Then Corollary 3 ensures that (35) of Corollary 4 holds for (0) and V (1). By induction, Corollaries 1 and can be applied recursively for all T iterations. For notational simplicity, we write (3)-(38) as V (t+0.5) V (t) F α 1 V (t) V (t) F + γ 1 σ 1 (t) (t) F, (39) σ 1 V (t+1) V (t+1) F α V (t) V (t) F + γ σ 1 (t) (t) F, (40) (t+0.5) (t+1) F α 3 (t) (t+1) F + γ 3 σ 1 V (t+1) V (t+1) F, (41) σ 1 (t+1) (t+1) F α 4 (t) (t+1) F + γ 4 σ 1 V (t+1) V (t+1) F, (4) (t) (t+1) F α 5 V (t) V (t) F + γ 5 σ 1 (t) (t) F, (43) V (t+1) V (t+1) F α 6 (t) (t+1) F + γ 6 σ 1 V (t+1) V (t+1) F. (44) Note that we have γ 5, γ 6 (1, ), but α 1,...,α 6, γ 1,..., and γ 4 can be sufficiently small as long as ξ is sufficiently large. We then have (t+1) (t+) F (i) α 5 V (t+1) V (t+1) F + γ 5 σ 1 (t+1) (t+1) F (ii) α 5 α 6 (t) (t+1) F + α 5 γ 6 σ 1 V (t+1) V (t+1) F + γ 5 σ 1 (t+1) (t+1) F (iii) (α 5 α 6 + γ 5 α 4 ) (t) (t+1) F + (γ 5 γ 4 σ 1 + α 5 γ 6 )σ 1 V (t+1) V (t+1) F (iv) (α 5 α 6 + γ 5 α 4 ) (t) (t+1) F + (γ 5 γ 4 σ 1 + α 5 γ 6 )α V (t) V (t) F + (γ 5 γ 4 σ 1 + α 5 γ 6 )γ σ 1 (t) (t) F, (45) where (i) comes from (43), (ii) comes from (44), (iii) comes from (4), and (iv) comes from (40). Similarly, we can obtain V (t+1) V (t+1) F α 6 (t) (t+1) F + γ 6 α V (t) V (t) F + γ 6 γ σ 1 (t) (t) F, (46) σ 1 (t+1) (t+1) F α 4 (t) (t+1) F + γ 4 α V (t) V (t) F + γ 4 γ σ 1 (t) (t) F (47) (t+0.5) (t+1) F α 3 (t) (t+1) F + γ 3 α V (t) V (t) F + γ 3 γ σ 1 (t) (t) F. (48) For simplicity, we define φ V (t+1) = V (t+1) V (t+1) F, φ V (t+0.5) = V (t+0.5) V (t) F, φ (t+1) = σ V 1 V (t+1) V (t+1) F, φ (t+1) = (t+1) (t+) F, φ (t+0.5) = (t+0.5) (t+1) F, φ (t+1) = σ 1 (t+1) (t+1) F. 15

16 Then combining (39), (40) with (45) (48), we obtain max { φ V (t+1), φ V (t+0.5), φ V (t+1), φ (t+1), φ (t+0.5), φ (t+1)} β max { φv (t), φ (t), φ (t)}, (49) where β is a contraction coefficient defined as β = max{α 5 α 6 + γ 5 α 4, α 6, α 4, α 3 } + max{α 1, α, (γ 5 γ 4 σ 1 + α 5 γ 6 ), γ 6 α, γ 4 α, γ 3 α } + max{γ 1, γ, (γ 5 γ 4 σ 1 + α 5 γ 6 )γ, γ 6 γ, γ 4 γ, γ 3 γ }. Then we can choose ξ as a sufficiently large constant such that β < 1. By recursively applying (49) for t = 0,..., T, we obtain max { { } φ V (T ), φ V (T 0.5), φ (T V ), φ (T ), φ (T 0.5), φ (T )} β max φv (T 1), φ (T 1), φ (T 1) β max { φ V (T ), φ (T ), φ (T )}... β T max { φ V (0), φ (0), φ (0)}. By Corollary 3, we obtain (0) (1) F 3σ ( ) 1 δk 6 V (0) V (0) F + σ k ξ + 1 σ 1 (0) (0) F (i) 3σ 1 σk 3 σ ( ) k 6 σ σ k 1ξσ1 3 + ξσ 1 ξ + 1 k 4ξσ 1 (ii) = σ4 k 8ξ σ σ k ξ σ 1 + σ k 4ξσ 1 (iii) σ k ξσ 1, (50) where (i) and (ii) come from Lemma 10, and (iii) comes from the definition of ξ and σ 1 σ k. Combining (50) with Lemma 10, we have Then we need at most iterations such that T = V (T ) V F β T max { σ k ξσ 1, { } { σ φv (0), φ (0), φ (0) max k, ξσ 1 ( { σ log max k, ξσ 1 σ k 4ξσ 1 } σ k ξσ 1 σ k 4ξσ 1 }., σ k ξ, σk } 1 ) log 1 (β 1 ) ξσ 1 ɛ ɛ and (T ) F β T max { σ k ξσ 1, σk } 4ξσ1 ɛ σ 1. We then follow similar lines to (6) in 4., and show M (T ) M F ɛ, which completes the proof. 4.4 Proof of Theorem 1 (Gradient Descent) Proof. The convergence analysis of the gradient descent algorithm is similar to that of the alternating gradient descent. The only difference is that for updating, the gradient descent algorithm employs V = V (t) instead of V = V (t+1) to calculate the gradient at = (t). Then everything else directly follows 4.3, and is therefore omitted.. 16

17 5 Extensions to Matrix Completion We then extend our methodology and theory to matrix completion problems. Let M R m n be the unknown low rank matrix of interest. We observe a subset of the entries of M, namely, W {1,..., m} {1,..., n}. We assume that W is drawn uniformly at random, i.e., Mi,j is observed independently with probability ρ (0, 1]. To exactly recover M, a common assumption is the incoherence of M, which will be specified later. A popular approach for recovering M is to solve the following convex optimization problem min M subject to P W (M ) = P W (M), (51) M R m n where P W (M) : R m n R m n is an operator defined as [P W (M)] ij = { Mij if (i, j) W, 0 otherwise. Similar to matrix sensing, existing algorithms for solving (51) are computationally inefficient. Hence, in practice we usually consider the following nonconvex optimization problem min F W (, V ), where F W (, V ) = 1 R m k,v R n k P W(M ) P W (V ) F. (5) Similar to matrix sensing, (5) can also be efficiently solved by gradient-based algorithms illustrated in Algorithm. For the convenience of later convergence analysis, we partition the observation set W into T +1 subsets W 0,...,W T by Algorithm 4. However, in practice we do not need the partition scheme, i.e., we simply set W 0 = = W T = W. Before we present the convergence analysis, we first introduce an assumption known as the incoherence property. Assumption (Incoherence Property). The target rank k matrix M is incoherent with parameter µ, i.e., given the rank k singular value decomposition of M = Σ V, we have max k i µ i m and max V k j µ j n. Roughly speaking, the incoherence assumption guarantees that each entry of M contains similar amount of information, which makes it feasible to complete M when its entries are missing uniformly at random. The following theorem establishes the iteration complexity and the estimation error under the Frobenius norm. Theorem. Suppose that there exists a universal constant C 4 such that ρ satisfies ρ C 4µ k 3 log n log(1/ɛ), (53) m where ɛ is the pre-specified precision. Then there exist an η and universal constants C 5 and C 6 such that for any T C 5 log(c 6 /ɛ), we have M (T ) M F ɛ with high probability. 17

18 Algorithm A family of nonconvex optimization algorithms for matrix completion. The incoherence factorization algorithm IF( ) is illustrated in Algorithm 3, and the partition algorithm Partition( ), which is proposed by Hardt and Wootters [014], is provided in Algorithm 4 of Appendix C for the sake of completeness. The initialization procedures INT ( ) and INT ( ) are provided in Algorithm 5 and Algorithm 6 of Appendix D for the sake of completeness. Here F W ( ) is defined in (5). Input: P W (M ) Parameter: Step size η, Total number of iterations T ({W t } T t=0, ρ) Partition(W), P W 0 ( M) P W0 (M ), and M ij 0 for all (i, j) / W 0 ( (0), V (0) ) INT ( M), (V (0), (0) ) INT V ( M) For: t = 0,..., T 1 Alternating Exact Minimization : V (t+0.5) argmin V F Wt+1 ( (t), V ) (V (t+1), R (t+0.5) ) IF(V (t+0.5) ) V Alternating Gradient Descent : V (t+0.5) V (t) η V F Wt+1 ( (t), V (t) ) (V (t+1), R (t+0.5) ) IF(V (t+0.5) ), (t) (t) R (t+0.5) pdating V. V V Gradient Descent : V (t+0.5) V (t) η V F Wt+1 ( (t), V (t) ) (V (t+1), R (t+0.5) ) IF(V (t+0.5) ), (t+1) (t) R (t+0.5) V V Alternating Exact Minimization : (t+0.5) argmin F Wt+ (, V (t+1) ) ( (t+1), R (t+0.5) ) IF( (t+0.5) ) Alternating Gradient Descent : (t+0.5) (t) η F Wt+ ( (t), V (t+1) ) ( (t+1), R (t+0.5) ) IF( (t+0.5) ), V (t+1) V (t+1) R (t+0.5) pdating. Gradient Descent : (t+0.5) (t) η F Wt+ ( (t), V (t) ) ( (t+1), R (t+0.5) ) IF( (t+0.5) ), V (t+1) V (t) R (t+0.5) End for Output: M (T ) (T 0.5) V (T ) (for gradient descent we use (T ) V (T ) ) Algorithm 3 The incoherence factorization algorithm for matrix completion. It guarantees that the solutions satisfy the incoherence condition throughout all iterations. Input: W in r Number of rows of W in Parameter: Incoherence parameter µ (W in, R in W ) QR(W in ) W argmin W W in F W (W out, R tmp ) QR(W out ) W R out W = W out W in Output: W out, R out W subject to max W j µ k/r j 18

19 The proof of Theorem is provided in F.1, F., and F.3. Theorem implies that all three nonconvex optimization algorithms converge to the global optimum at a geometric rate. Furthermore, our results indicate that the completion of the true low rank matrix M up to ɛ- accuracy requires the entry observation probability ρ to satisfy ρ = Ω(µ k 3 log n log(1/ɛ)/m). (54) ( σmax(m ) σ min (M ) This result matches the result established by Hardt [014], which is the state-of-the-art result for alternating minimization. Moreover, our analysis covers three nonconvex optimization algorithms. In fact, the sample complexity in (54) depends on a polynomial of σmax(m ) σ min (M ), which is a constant since in this paper we assume that σ max (M ) and σ min (M ) are constants. If we allow σmax(m ) σ min (M ) to increase, we can replace the QR decomposition in Algorithm 3 with the smooth QR) decomposition proposed by Hardt and Wootters [014] and achieve a dependency of log on the condition number with a more involved proof. See more details in Hardt and Wootters [014]. However, in this paper, our primary focus is on the dependency on k, n and m, rather than optimizing over the dependency on condition number. 6 Numerical Experiments We present numerical experiments to support our theoretical analysis. We first consider a matrix sensing problem with m = 30, n = 40, and k = 5. We vary d from 300 to 900. Each entry of A i s are independent sampled from N(0, 1). We then generate M = V, where Ũ Rm k and Ṽ Rn k are two matrices with all their entries independently sampled from N(0, 1/k). We then generate d measurements by b i = A i, M for i = 1,..., d. Figure 1 illustrates the empirical performance of the alternating exact minimization and alternating gradient descent algorithms for a single realization. The step size for the alternating gradient descent algorithm is determined by the backtracking line search procedure. We see that both algorithms attain linear rate of convergence for d = 600 and d = 900. Both algorithms fail for d = 300, because d = 300 is below the minimum requirement of sample complexity for the exact matrix recovery. We then consider a matrix completion problem with m = 1000, n = 50, and k = 5. We vary ρ from 0.05 to 0.1. We then generate M = V, where Ũ Rm k and Ṽ Rn k are two matrices with all their entries independently sampled from N(0, 1/k). The observation set is generated uniformly at random with probability ρ. Figure illustrates the empirical performance of the alternating exact minimization and alternating gradient descent algorithms for a single realization. The step size for the alternating gradient descent algorithm is determined by the backtracking line search procedure. We see that both algorithms attain linear rate of convergence for ρ = 0.05 and ρ = 0.1. Both algorithms fail for ρ = 0.05, because the entry observation probability is below the minimum requirement of sample complexity for the exact matrix recovery. 7 Conclusion In this paper, we propose a generic analysis for characterizing the convergence properties of nonconvex optimization algorithms. By exploiting the inexact first order oracle, we prove that a broad class 19

20 10 10 Estimation Error d=300 d=600 d= Numer of Iterations (a) Alternating Exact Minimization Algorithm Estaimation Error d=300 d=600 d= Number of Iterations (b) Alternating Gradient Descent Algorithm Figure 1: Two illustrative examples for matrix sensing. The vertical axis corresponds to estimation error M (t) M F. The horizontal axis corresponds to numbers of iterations. Both the alternating exact minimization and alternating gradient descent algorithms attain linear rate of convergence for d = 600 and d = 900. But both algorithms fail for d = 300, because the sample size is not large enough to guarantee proper initial solutions. of nonconvex optimization algorithms converge geometrically to the global optimum and exactly recover the true low rank matrices under suitable conditions. A Lemmas for Theorem 1 (Alternating Exact Minimization) A.1 Proof of Lemma 1 Proof. For notational convenience, we omit the index t in (t) and V (t), and denote them by and V respectively. Then we define two nk nk matrices S (t) 11 S (t) S (t) 1k =. d.... with S(t) pq = A i (t) p (t) q A i, S (t) k1 S (t) i=1 kk G (t) = G (t) 11 G (t) 1k..... G (t) k1 G (t) kk with G(t) pq = d i=1 A i p q A i for 1 p, q k. Note that S (t) and G (t) are essentially the partial Hessian matrices V F( (t), V ) and V F(, V ) for a vectorized V, i.e., vec(v ) R nk. Before we proceed with the main proof, we first introduce the following lemma. 0

21 10 10 Estimation Error ; = 0:05 7; = 0:05 7; = 0: Number of Iterations (a) Alternating Exact Minimization Algorithm Estimation Error ; = 0:05 7; = 0:05 7; = 0: Number of Iterations (b) Alternating Gradient Descent Algorithm Figure : Two illustrative examples for matrix completion. The vertical axis corresponds to estimation error M (t) M F. The horizontal axis corresponds to numbers of iterations. Both the alternating exact minimization and alternating gradient descent algorithms attain linear rate of convergence for ρ = 0.05 and ρ = 0.1. But both algorithms fail for ρ = 0.05, because the entry observation probability is not large enough to guarantee proper initial solutions. Lemma 11. Suppose that A( ) satisfies k-rip with parameter δ k. We then have 1 + δ k σ max (S (t) ) σ min (S (t) ) 1 δ k. The proof of Lemma 11 is provided in Appendix A.7. Note that Lemma 11 is also applicable G (t), since G (t) shares the same structure with S (t). We then proceed with the proof of Lemma 1. Given a fixed, F(, V ) is a quadratic function of V. Therefore we have F(, V ) = F(, V ) + V F(, V ), V V + vec(v ) vec(v ), V F (, V ) ( vec(v ) vec(v ) ), which further implies implies F(, V ) F(, V ) V (, V ), V V σ max ( V F (, V )) V V F F(, V ) F(, V ) V (, V ), V V σ min ( V F (, V )) V V F. Then we can verify that V F (, V ) also shares the same structure with S(t). Thus applying Lemma 11 to the above two inequalities, we complete the proof. 1

22 A. Proof of Lemma 3 Proof. For notational convenience, we omit the index t in (t) and V (t), and denote them by and V respectively. We define two nk nk matrices J (t) 11 J (t) J (t) 1k =. d.... (t) with J pq = A i (t) p q A i, J (t) k1 J (t) i=1 kk K (t) = K (t) 11 K (t) 1k..... K (t) k1 K (t) kk with K(t) pq = (t) p qi n for 1 p, q k. Before we proceed with the main proof, we first introduce the following lemmas. Lemma 1. Suppose that A( ) satisfies k-rip with parameter δ k. We then have S (t) K (t) J (t) 3δ k k (t) F. The proof of Lemma 1 is provided in Appendix A.8. Note that Lemma 1 is also applicable to G (t) K (t) J (t), since G (t) and S (t) share the same structure. Lemma 13. Given F R k k, we define a nk nk matrix F 11 I n F 1k I n F = F k1 I n F kk I n For any V R n k, let v = vec(v ) R nk, then we have Fv = F V F. Proof. By linear algebra, we have which completes the proof. [F V ] ij = F i V j = k F il V jl = l=1 k F il I l V l, l=1 as We then proceed with the proof of Lemma 3. Since b i = tr(v A i ), then we rewrite F(, V ) F(, V ) = 1 d i=1 ( tr(v A i ) b i ) = 1 d ( k Vj A i j i=1 j=1 k j=1 V j A i j). For notational simplicity, we define v = vec(v ). Since V (t+0.5) minimizes F( (t), V ), we have vec ( F( (t), V (t+0.5) ) ) = S (t) v (t+0.5) J (t) v = 0.

23 Solving the above system of equations, we obtain Meanwhile, we have vec( V F(, V (t+0.5) )) = G (t) v (t+0.5) G (t) v v (t+0.5) = (S (t) ) 1 J (t) v. (55) = G (t) (S (t) ) 1 J (t) v G (t) v = G (t)( (S (t) ) 1 J (t) I nk ) v, (56) where the second equality come from (55). By triangle inequality, (56) further implies ((S (t) ) 1 J (t) I nk )v (K (t) I nk )v + (S (t) ) 1 (J (t) S (t) K (t) )v ( (t) I k )V F + (S (t) ) 1 (J (t) S (t) K (t) )v (t) I k F V + (S (t) ) 1 (J (t) S (t) K (t) )v, (57) where the second inequality comes from Lemma 13. Plugging (57) into (56), we have vec( V F(, V (t+0.5) )) G (t) ((S (t) ) 1 J (t) I nk )v (i) (1 + δ k )(σ 1 (t) I k + (S (t) ) 1 S (t) K (t) J (t) σ 1 k) (ii) ( (1 + δ k )σ 1 ( (t) ) ( (t) ) F + 3δ ) kk (t) F 1 δ k (iii) ( (1 + δ k )σ 1 (t) F + 3δ ) kk (t) (iv) F (1 δ k)σ k (t) F, 1 δ k ξ where (i) comes from Lemma 11 and V = M = σ 1 and V F = v σ 1 k, (ii) comes from Lemmas 11 and 1, (iii) from Cauchy-Schwartz inequality, and (iv) comes from (16). Since we have V F( (t), V (t+0.5) ) = 0, we further btain which completes the proof. A.3 Proof of Lemma 4 E(V (t+0.5), V (t+0.5), (t) ) (1 δ k)σ k (t) F, ξ Proof. For notational convenience, we omit the index t in (t) and V (t), and denote them by and V respectively. By the strong convexity of F(, ), we have F(, V ) 1 δ k V (t+0.5) V F F(, V (t+0.5) ) + V F(, V (t+0.5) ), V V (t+0.5). (58) By the strong convexity of F(, ) again, we have F(, V (t+0.5) ) F(, V ) + V F(, V ), V (t+0.5) V (t+0.5) + 1 δ k V (t+0.5) V F F(, V ) + 1 δ k V (t+0.5) V F, (59) 3

24 where the last inequality comes from the optimality condition of V = argmin V F(, V ), i.e. V F(, V ), V (t+0.5) V 0. Meanwhile, since V (t+0.5) minimizes F( (t), ), we have the optimality condition which further implies V F(, V (t+0.5) ), V V (t+0.5) Combining (58) and (59) with (60), we obtain which completes the proof. A.4 Proof of Lemma 5 V F( (t), V (t+0.5) ), V V (t+0.5) 0, V F(, V (t+0.5) ) V F( (t), V (t+0.5) ), V V (t+0.5). (60) V (t+0.5) V 1 1 δ k E(V (t+0.5), V (t+0.5), (t) ), Proof. Before we proceed with the proof, we first introduce the following lemma. Lemma 14. Suppose that A R n k is a rank k matrix. Let E R n k satisfy E A < 1. Then given a QR decomposition (A + E) = QR, there exists a factorization of A = Q O such that Q R n k is an orthonormal matrix, and satisfies A Q Q E F F 1 E A. The proof of Lemma 14 is provided in Stewart et al. [1990], therefore omitted. We then proceed with the proof of Lemma 5. We consider A = V (t) and E = V (t+0.5) V (t) in Lemma 14 respectively. We can verify that V (t+0.5) V (t) V (t) V (t+0.5) V (t) F σ k 1 4. Then there exists a V (t) = V (t+1) O such that V (t+1) is an orthonormal matrix, and satisfies V (t+0.5) V (t+1) F V (t) V (t+0.5) V (t) F σ k V (t+0.5) V (t) F. 4

Nonconvex Low Rank Matrix Factorization via Inexact First Order Oracle

Nonconvex Low Rank Matrix Factorization via Inexact First Order Oracle Nonconvex Low Ran Matrix Factorization via Inexact First Order Oracle Tuo Zhao Zhaoran Wang Han Liu Abstract We study the low ran matrix factorization problem via nonconvex optimization Compared with the

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via nonconvex optimization Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis Lecture 7: Matrix completion Yuejie Chi The Ohio State University Page 1 Reference Guaranteed Minimum-Rank Solutions of Linear

More information

Provable Alternating Minimization Methods for Non-convex Optimization

Provable Alternating Minimization Methods for Non-convex Optimization Provable Alternating Minimization Methods for Non-convex Optimization Prateek Jain Microsoft Research, India Joint work with Praneeth Netrapalli, Sujay Sanghavi, Alekh Agarwal, Animashree Anandkumar, Rashish

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

Solving Corrupted Quadratic Equations, Provably

Solving Corrupted Quadratic Equations, Provably Solving Corrupted Quadratic Equations, Provably Yuejie Chi London Workshop on Sparse Signal Processing September 206 Acknowledgement Joint work with Yuanxin Li (OSU), Huishuai Zhuang (Syracuse) and Yingbin

More information

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

Recovering any low-rank matrix, provably

Recovering any low-rank matrix, provably Recovering any low-rank matrix, provably Rachel Ward University of Texas at Austin October, 2014 Joint work with Yudong Chen (U.C. Berkeley), Srinadh Bhojanapalli and Sujay Sanghavi (U.T. Austin) Matrix

More information

Robust Principal Component Analysis

Robust Principal Component Analysis ELE 538B: Mathematics of High-Dimensional Data Robust Principal Component Analysis Yuxin Chen Princeton University, Fall 2018 Disentangling sparse and low-rank matrices Suppose we are given a matrix M

More information

Compressed Sensing and Sparse Recovery

Compressed Sensing and Sparse Recovery ELE 538B: Sparsity, Structure and Inference Compressed Sensing and Sparse Recovery Yuxin Chen Princeton University, Spring 217 Outline Restricted isometry property (RIP) A RIPless theory Compressed sensing

More information

Exact Low-rank Matrix Recovery via Nonconvex M p -Minimization

Exact Low-rank Matrix Recovery via Nonconvex M p -Minimization Exact Low-rank Matrix Recovery via Nonconvex M p -Minimization Lingchen Kong and Naihua Xiu Department of Applied Mathematics, Beijing Jiaotong University, Beijing, 100044, People s Republic of China E-mail:

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Recent Developments in Compressed Sensing

Recent Developments in Compressed Sensing Recent Developments in Compressed Sensing M. Vidyasagar Distinguished Professor, IIT Hyderabad m.vidyasagar@iith.ac.in, www.iith.ac.in/ m vidyasagar/ ISL Seminar, Stanford University, 19 April 2018 Outline

More information

Rapid, Robust, and Reliable Blind Deconvolution via Nonconvex Optimization

Rapid, Robust, and Reliable Blind Deconvolution via Nonconvex Optimization Rapid, Robust, and Reliable Blind Deconvolution via Nonconvex Optimization Shuyang Ling Department of Mathematics, UC Davis Oct.18th, 2016 Shuyang Ling (UC Davis) 16w5136, Oaxaca, Mexico Oct.18th, 2016

More information

arxiv: v1 [math.na] 1 Sep 2018

arxiv: v1 [math.na] 1 Sep 2018 On the perturbation of an L -orthogonal projection Xuefeng Xu arxiv:18090000v1 [mathna] 1 Sep 018 September 5 018 Abstract The L -orthogonal projection is an important mathematical tool in scientific computing

More information

Fast and Sample Efficient Inductive Matrix Completion via Multi-Phase Procrustes Flow

Fast and Sample Efficient Inductive Matrix Completion via Multi-Phase Procrustes Flow Fast and Sample Efficient Inductive Matrix Completion via Multi-Phase Procrustes Flow Xiao Zhang and Simon S. Du and Quanquan Gu arxiv:1803.0133v1 [stat.ml] 3 Mar 018 Abstract We revisit the inductive

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October

More information

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU)

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU) sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU) 0 overview Our Contributions: 1 overview Our Contributions: A near optimal low-rank

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via convex relaxations Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Low-rank Matrix Completion with Noisy Observations: a Quantitative Comparison

Low-rank Matrix Completion with Noisy Observations: a Quantitative Comparison Low-rank Matrix Completion with Noisy Observations: a Quantitative Comparison Raghunandan H. Keshavan, Andrea Montanari and Sewoong Oh Electrical Engineering and Statistics Department Stanford University,

More information

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER 2011 7255 On the Performance of Sparse Recovery Via `p-minimization (0 p 1) Meng Wang, Student Member, IEEE, Weiyu Xu, and Ao Tang, Senior

More information

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical

More information

October 25, 2013 INNER PRODUCT SPACES

October 25, 2013 INNER PRODUCT SPACES October 25, 2013 INNER PRODUCT SPACES RODICA D. COSTIN Contents 1. Inner product 2 1.1. Inner product 2 1.2. Inner product spaces 4 2. Orthogonal bases 5 2.1. Existence of an orthogonal basis 7 2.2. Orthogonal

More information

Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012

Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012 Instructions Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012 The exam consists of four problems, each having multiple parts. You should attempt to solve all four problems. 1.

More information

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 11 Luca Trevisan February 29, 2016

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 11 Luca Trevisan February 29, 2016 U.C. Berkeley CS294: Spectral Methods and Expanders Handout Luca Trevisan February 29, 206 Lecture : ARV In which we introduce semi-definite programming and a semi-definite programming relaxation of sparsest

More information

Characterization of Gradient Dominance and Regularity Conditions for Neural Networks

Characterization of Gradient Dominance and Regularity Conditions for Neural Networks Characterization of Gradient Dominance and Regularity Conditions for Neural Networks Yi Zhou Ohio State University Yingbin Liang Ohio State University Abstract zhou.1172@osu.edu liang.889@osu.edu The past

More information

Lecture notes: Applied linear algebra Part 1. Version 2

Lecture notes: Applied linear algebra Part 1. Version 2 Lecture notes: Applied linear algebra Part 1. Version 2 Michael Karow Berlin University of Technology karow@math.tu-berlin.de October 2, 2008 1 Notation, basic notions and facts 1.1 Subspaces, range and

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Tighten after Relax: Minimax-Optimal Sparse PCA in Polynomial Time

Tighten after Relax: Minimax-Optimal Sparse PCA in Polynomial Time Tighten after Relax: Minimax-Optimal Sparse PCA in Polynomial Time Zhaoran Wang Huanran Lu Han Liu Department of Operations Research and Financial Engineering Princeton University Princeton, NJ 08540 {zhaoran,huanranl,hanliu}@princeton.edu

More information

Conditions for Robust Principal Component Analysis

Conditions for Robust Principal Component Analysis Rose-Hulman Undergraduate Mathematics Journal Volume 12 Issue 2 Article 9 Conditions for Robust Principal Component Analysis Michael Hornstein Stanford University, mdhornstein@gmail.com Follow this and

More information

arxiv: v3 [math.oc] 19 Oct 2017

arxiv: v3 [math.oc] 19 Oct 2017 Gradient descent with nonconvex constraints: local concavity determines convergence Rina Foygel Barber and Wooseok Ha arxiv:703.07755v3 [math.oc] 9 Oct 207 0.7.7 Abstract Many problems in high-dimensional

More information

Linear Algebra Massoud Malek

Linear Algebra Massoud Malek CSUEB Linear Algebra Massoud Malek Inner Product and Normed Space In all that follows, the n n identity matrix is denoted by I n, the n n zero matrix by Z n, and the zero vector by θ n An inner product

More information

Chapter 3 Transformations

Chapter 3 Transformations Chapter 3 Transformations An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1 Linear Transformations A function is called a linear transformation if 1. for every and 2. for every If we fix the bases

More information

Low-Rank Matrix Recovery

Low-Rank Matrix Recovery ELE 538B: Mathematics of High-Dimensional Data Low-Rank Matrix Recovery Yuxin Chen Princeton University, Fall 2018 Outline Motivation Problem setup Nuclear norm minimization RIP and low-rank matrix recovery

More information

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28 Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:

More information

An iterative hard thresholding estimator for low rank matrix recovery

An iterative hard thresholding estimator for low rank matrix recovery An iterative hard thresholding estimator for low rank matrix recovery Alexandra Carpentier - based on a joint work with Arlene K.Y. Kim Statistical Laboratory, Department of Pure Mathematics and Mathematical

More information

Dropping Convexity for Faster Semi-definite Optimization

Dropping Convexity for Faster Semi-definite Optimization JMLR: Workshop and Conference Proceedings vol 49: 53, 06 Dropping Convexity for Faster Semi-definite Optimization Srinadh Bhojanapalli Toyota Technological Institute at Chicago Anastasios Kyrillidis Sujay

More information

The convex algebraic geometry of rank minimization

The convex algebraic geometry of rank minimization The convex algebraic geometry of rank minimization Pablo A. Parrilo Laboratory for Information and Decision Systems Massachusetts Institute of Technology International Symposium on Mathematical Programming

More information

Lecture 2: Linear Algebra Review

Lecture 2: Linear Algebra Review EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1

More information

Structured matrix factorizations. Example: Eigenfaces

Structured matrix factorizations. Example: Eigenfaces Structured matrix factorizations Example: Eigenfaces An extremely large variety of interesting and important problems in machine learning can be formulated as: Given a matrix, find a matrix and a matrix

More information

SPARSE signal representations have gained popularity in recent

SPARSE signal representations have gained popularity in recent 6958 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 10, OCTOBER 2011 Blind Compressed Sensing Sivan Gleichman and Yonina C. Eldar, Senior Member, IEEE Abstract The fundamental principle underlying

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee227c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee227c@berkeley.edu

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 11, 2016 Paper presentations and final project proposal Send me the names of your group member (2 or 3 students) before October 15 (this Friday)

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Optimization for Compressed Sensing

Optimization for Compressed Sensing Optimization for Compressed Sensing Robert J. Vanderbei 2014 March 21 Dept. of Industrial & Systems Engineering University of Florida http://www.princeton.edu/ rvdb Lasso Regression The problem is to solve

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Guaranteed Rank Minimization via Singular Value Projection: Supplementary Material

Guaranteed Rank Minimization via Singular Value Projection: Supplementary Material Guaranteed Rank Minimization via Singular Value Projection: Supplementary Material Prateek Jain Microsoft Research Bangalore Bangalore, India prajain@microsoft.com Raghu Meka UT Austin Dept. of Computer

More information

arxiv: v1 [math.st] 10 Sep 2015

arxiv: v1 [math.st] 10 Sep 2015 Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees Department of Statistics Yudong Chen Martin J. Wainwright, Department of Electrical Engineering and

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

On the Local Convergence of an Iterative Approach for Inverse Singular Value Problems

On the Local Convergence of an Iterative Approach for Inverse Singular Value Problems On the Local Convergence of an Iterative Approach for Inverse Singular Value Problems Zheng-jian Bai Benedetta Morini Shu-fang Xu Abstract The purpose of this paper is to provide the convergence theory

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Lecture 8: February 9

Lecture 8: February 9 0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we

More information

Collaborative Filtering Matrix Completion Alternating Least Squares

Collaborative Filtering Matrix Completion Alternating Least Squares Case Study 4: Collaborative Filtering Collaborative Filtering Matrix Completion Alternating Least Squares Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 19, 2016

More information

The deterministic Lasso

The deterministic Lasso The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality

More information

arxiv: v5 [math.na] 16 Nov 2017

arxiv: v5 [math.na] 16 Nov 2017 RANDOM PERTURBATION OF LOW RANK MATRICES: IMPROVING CLASSICAL BOUNDS arxiv:3.657v5 [math.na] 6 Nov 07 SEAN O ROURKE, VAN VU, AND KE WANG Abstract. Matrix perturbation inequalities, such as Weyl s theorem

More information

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013. The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning Spring 2013 Assignment Two Caramanis/Sanghavi Due: Tuesday, Feb. 19, 2013. Computational

More information

Sketchy Decisions: Convex Low-Rank Matrix Optimization with Optimal Storage

Sketchy Decisions: Convex Low-Rank Matrix Optimization with Optimal Storage Sketchy Decisions: Convex Low-Rank Matrix Optimization with Optimal Storage Madeleine Udell Operations Research and Information Engineering Cornell University Based on joint work with Alp Yurtsever (EPFL),

More information

Reconstruction from Anisotropic Random Measurements

Reconstruction from Anisotropic Random Measurements Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013

More information

Constructing Explicit RIP Matrices and the Square-Root Bottleneck

Constructing Explicit RIP Matrices and the Square-Root Bottleneck Constructing Explicit RIP Matrices and the Square-Root Bottleneck Ryan Cinoman July 18, 2018 Ryan Cinoman Constructing Explicit RIP Matrices July 18, 2018 1 / 36 Outline 1 Introduction 2 Restricted Isometry

More information

approximation algorithms I

approximation algorithms I SUM-OF-SQUARES method and approximation algorithms I David Steurer Cornell Cargese Workshop, 201 meta-task encoded as low-degree polynomial in R x example: f(x) = i,j n w ij x i x j 2 given: functions

More information

Non-convex Robust PCA: Provable Bounds

Non-convex Robust PCA: Provable Bounds Non-convex Robust PCA: Provable Bounds Anima Anandkumar U.C. Irvine Joint work with Praneeth Netrapalli, U.N. Niranjan, Prateek Jain and Sujay Sanghavi. Learning with Big Data High Dimensional Regime Missing

More information

arxiv: v7 [stat.ml] 9 Feb 2017

arxiv: v7 [stat.ml] 9 Feb 2017 Submitted to the Annals of Applied Statistics arxiv: arxiv:141.7477 PATHWISE COORDINATE OPTIMIZATION FOR SPARSE LEARNING: ALGORITHM AND THEORY By Tuo Zhao, Han Liu and Tong Zhang Georgia Tech, Princeton

More information

Linear Algebra and Eigenproblems

Linear Algebra and Eigenproblems Appendix A A Linear Algebra and Eigenproblems A working knowledge of linear algebra is key to understanding many of the issues raised in this work. In particular, many of the discussions of the details

More information

A Randomized Algorithm for the Approximation of Matrices

A Randomized Algorithm for the Approximation of Matrices A Randomized Algorithm for the Approximation of Matrices Per-Gunnar Martinsson, Vladimir Rokhlin, and Mark Tygert Technical Report YALEU/DCS/TR-36 June 29, 2006 Abstract Given an m n matrix A and a positive

More information

Rank Determination for Low-Rank Data Completion

Rank Determination for Low-Rank Data Completion Journal of Machine Learning Research 18 017) 1-9 Submitted 7/17; Revised 8/17; Published 9/17 Rank Determination for Low-Rank Data Completion Morteza Ashraphijuo Columbia University New York, NY 1007,

More information

Lecture Notes 10: Matrix Factorization

Lecture Notes 10: Matrix Factorization Optimization-based data analysis Fall 207 Lecture Notes 0: Matrix Factorization Low-rank models. Rank- model Consider the problem of modeling a quantity y[i, j] that depends on two indices i and j. To

More information

The Projected Power Method: An Efficient Algorithm for Joint Alignment from Pairwise Differences

The Projected Power Method: An Efficient Algorithm for Joint Alignment from Pairwise Differences The Projected Power Method: An Efficient Algorithm for Joint Alignment from Pairwise Differences Yuxin Chen Emmanuel Candès Department of Statistics, Stanford University, Sep. 2016 Nonconvex optimization

More information

Analysis of Robust PCA via Local Incoherence

Analysis of Robust PCA via Local Incoherence Analysis of Robust PCA via Local Incoherence Huishuai Zhang Department of EECS Syracuse University Syracuse, NY 3244 hzhan23@syr.edu Yi Zhou Department of EECS Syracuse University Syracuse, NY 3244 yzhou35@syr.edu

More information

arxiv: v1 [math.oc] 11 Jun 2009

arxiv: v1 [math.oc] 11 Jun 2009 RANK-SPARSITY INCOHERENCE FOR MATRIX DECOMPOSITION VENKAT CHANDRASEKARAN, SUJAY SANGHAVI, PABLO A. PARRILO, S. WILLSKY AND ALAN arxiv:0906.2220v1 [math.oc] 11 Jun 2009 Abstract. Suppose we are given a

More information

Linear Algebra. Session 12

Linear Algebra. Session 12 Linear Algebra. Session 12 Dr. Marco A Roque Sol 08/01/2017 Example 12.1 Find the constant function that is the least squares fit to the following data x 0 1 2 3 f(x) 1 0 1 2 Solution c = 1 c = 0 f (x)

More information

We describe the generalization of Hazan s algorithm for symmetric programming

We describe the generalization of Hazan s algorithm for symmetric programming ON HAZAN S ALGORITHM FOR SYMMETRIC PROGRAMMING PROBLEMS L. FAYBUSOVICH Abstract. problems We describe the generalization of Hazan s algorithm for symmetric programming Key words. Symmetric programming,

More information

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J 7 Appendix 7. Proof of Theorem Proof. There are two main difficulties in proving the convergence of our algorithm, and none of them is addressed in previous works. First, the Hessian matrix H is a block-structured

More information

Sparsity Regularization

Sparsity Regularization Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation

More information

Matrix completion: Fundamental limits and efficient algorithms. Sewoong Oh Stanford University

Matrix completion: Fundamental limits and efficient algorithms. Sewoong Oh Stanford University Matrix completion: Fundamental limits and efficient algorithms Sewoong Oh Stanford University 1 / 35 Low-rank matrix completion Low-rank Data Matrix Sparse Sampled Matrix Complete the matrix from small

More information

Lecture Notes 9: Constrained Optimization

Lecture Notes 9: Constrained Optimization Optimization-based data analysis Fall 017 Lecture Notes 9: Constrained Optimization 1 Compressed sensing 1.1 Underdetermined linear inverse problems Linear inverse problems model measurements of the form

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Course Notes: Week 1

Course Notes: Week 1 Course Notes: Week 1 Math 270C: Applied Numerical Linear Algebra 1 Lecture 1: Introduction (3/28/11) We will focus on iterative methods for solving linear systems of equations (and some discussion of eigenvalues

More information

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization Numerical Linear Algebra Primer Ryan Tibshirani Convex Optimization 10-725 Consider Last time: proximal Newton method min x g(x) + h(x) where g, h convex, g twice differentiable, and h simple. Proximal

More information

arxiv: v3 [stat.me] 8 Jun 2018

arxiv: v3 [stat.me] 8 Jun 2018 Between hard and soft thresholding: optimal iterative thresholding algorithms Haoyang Liu and Rina Foygel Barber arxiv:804.0884v3 [stat.me] 8 Jun 08 June, 08 Abstract Iterative thresholding algorithms

More information

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2017 LECTURE 5

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2017 LECTURE 5 STAT 39: MATHEMATICAL COMPUTATIONS I FALL 17 LECTURE 5 1 existence of svd Theorem 1 (Existence of SVD) Every matrix has a singular value decomposition (condensed version) Proof Let A C m n and for simplicity

More information

Limitations in Approximating RIP

Limitations in Approximating RIP Alok Puranik Mentor: Adrian Vladu Fifth Annual PRIMES Conference, 2015 Outline 1 Background The Problem Motivation Construction Certification 2 Planted model Planting eigenvalues Analysis Distinguishing

More information

1 Cricket chirps: an example

1 Cricket chirps: an example Notes for 2016-09-26 1 Cricket chirps: an example Did you know that you can estimate the temperature by listening to the rate of chirps? The data set in Table 1 1. represents measurements of the number

More information

5.6. PSEUDOINVERSES 101. A H w.

5.6. PSEUDOINVERSES 101. A H w. 5.6. PSEUDOINVERSES 0 Corollary 5.6.4. If A is a matrix such that A H A is invertible, then the least-squares solution to Av = w is v = A H A ) A H w. The matrix A H A ) A H is the left inverse of A and

More information

Self-Calibration and Biconvex Compressive Sensing

Self-Calibration and Biconvex Compressive Sensing Self-Calibration and Biconvex Compressive Sensing Shuyang Ling Department of Mathematics, UC Davis July 12, 2017 Shuyang Ling (UC Davis) SIAM Annual Meeting, 2017, Pittsburgh July 12, 2017 1 / 22 Acknowledgements

More information

Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global

Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global homas Laurent * 1 James H. von Brecht * 2 Abstract We consider deep linear networks with arbitrary convex differentiable loss. We provide a short and elementary proof of the fact that all local minima

More information

6 Compressed Sensing and Sparse Recovery

6 Compressed Sensing and Sparse Recovery 6 Compressed Sensing and Sparse Recovery Most of us have noticed how saving an image in JPEG dramatically reduces the space it occupies in our hard drives as oppose to file types that save the pixel value

More information

Matrix estimation by Universal Singular Value Thresholding

Matrix estimation by Universal Singular Value Thresholding Matrix estimation by Universal Singular Value Thresholding Courant Institute, NYU Let us begin with an example: Suppose that we have an undirected random graph G on n vertices. Model: There is a real symmetric

More information

A New Estimate of Restricted Isometry Constants for Sparse Solutions

A New Estimate of Restricted Isometry Constants for Sparse Solutions A New Estimate of Restricted Isometry Constants for Sparse Solutions Ming-Jun Lai and Louis Y. Liu January 12, 211 Abstract We show that as long as the restricted isometry constant δ 2k < 1/2, there exist

More information

A fast randomized algorithm for overdetermined linear least-squares regression

A fast randomized algorithm for overdetermined linear least-squares regression A fast randomized algorithm for overdetermined linear least-squares regression Vladimir Rokhlin and Mark Tygert Technical Report YALEU/DCS/TR-1403 April 28, 2008 Abstract We introduce a randomized algorithm

More information

Greedy Signal Recovery and Uniform Uncertainty Principles

Greedy Signal Recovery and Uniform Uncertainty Principles Greedy Signal Recovery and Uniform Uncertainty Principles SPIE - IE 2008 Deanna Needell Joint work with Roman Vershynin UC Davis, January 2008 Greedy Signal Recovery and Uniform Uncertainty Principles

More information

Linear Algebra. Paul Yiu. Department of Mathematics Florida Atlantic University. Fall A: Inner products

Linear Algebra. Paul Yiu. Department of Mathematics Florida Atlantic University. Fall A: Inner products Linear Algebra Paul Yiu Department of Mathematics Florida Atlantic University Fall 2011 6A: Inner products In this chapter, the field F = R or C. We regard F equipped with a conjugation χ : F F. If F =

More information

Lecture 9: Numerical Linear Algebra Primer (February 11st)

Lecture 9: Numerical Linear Algebra Primer (February 11st) 10-725/36-725: Convex Optimization Spring 2015 Lecture 9: Numerical Linear Algebra Primer (February 11st) Lecturer: Ryan Tibshirani Scribes: Avinash Siravuru, Guofan Wu, Maosheng Liu Note: LaTeX template

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

CHAPTER 11. A Revision. 1. The Computers and Numbers therein

CHAPTER 11. A Revision. 1. The Computers and Numbers therein CHAPTER A Revision. The Computers and Numbers therein Traditional computer science begins with a finite alphabet. By stringing elements of the alphabet one after another, one obtains strings. A set of

More information

arxiv: v2 [cs.it] 19 Sep 2016

arxiv: v2 [cs.it] 19 Sep 2016 Fast Algorithms for Robust PCA via Gradient Descent Xinyang Yi Dohyung Park Yudong Chen Constantine Caramanis The University of Texas at Austin Cornell University {yixy,dhpark,constantine}@utexas.edu yudong.chen@cornell.edu

More information

Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery

Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery Jorge F. Silva and Eduardo Pavez Department of Electrical Engineering Information and Decision Systems Group Universidad

More information

Grothendieck s Inequality

Grothendieck s Inequality Grothendieck s Inequality Leqi Zhu 1 Introduction Let A = (A ij ) R m n be an m n matrix. Then A defines a linear operator between normed spaces (R m, p ) and (R n, q ), for 1 p, q. The (p q)-norm of A

More information