Nonconvex Low Rank Matrix Factorization via Inexact First Order Oracle

Size: px
Start display at page:

Download "Nonconvex Low Rank Matrix Factorization via Inexact First Order Oracle"

Transcription

1 Nonconvex Low Ran Matrix Factorization via Inexact First Order Oracle Tuo Zhao Zhaoran Wang Han Liu Abstract We study the low ran matrix factorization problem via nonconvex optimization Compared with the convex relaxation approach, nonconvex optimization exhibits superior empirical performance for large scale low ran matrix estimation However, the understanding of its theoretical guarantees is limited To bridge this gap, we exploit the notion of inexact first order oracle, which naturally appears in low ran matrix factorization problems such as matrix sensing and completion Particularly, our analysis shows that a broad class of nonconvex optimization algorithms, including alternating minimization and gradient-type methods, can be treated as solving two sequences of convex optimization algorithms using inexact first order oracle Thus we can show that these algorithms converge geometrically to the global optima and recover the true low ran matrices under suitable conditions Numerical results are provided to support our theory 1 Introduction Let M R m n be a ran matrix with much smaller than m and n Our goal is to estimate M based on partial observations of its entries For example, matrix completion is based on a subsample of M s entries, while matrix sensing is based on linear measurements A i,m, where i {1,,d} with d much smaller than mn and A i is the sensing matrix In the past decade, significant progress has been made on the recovery of low ran matrix [Candès and Recht, 009, Candès and Tao, 010, Candes and Plan, 010, Recht et al, 010, Lee and Bresler, 010, Keshavan et al, 010a,b, Jain et al, 010, Cai et al, 010, Recht, 011, Gross, 011, Chandrasearan et al, 011, Hsu et al, 011, Rohde and Tsybaov, 011, Koltchinsii et al, 011, Negahban and Wainwright, 011, Chen et al, 011, Xiang et al, 01, Negahban and Wainwright, 01, Agarwal et al, 01, Recht and Ré, 013, Chen, 013, Chen et al, 013a,b, Jain et al, 013, Jain and Netrapalli, 014, Tuo Zhao is Affiliated with Department of Computer Science, Johns Hopins niversity, Baltimore, MD 118 SA and Department of Operations Research and Financial Engineering, Princeton niversity, Princeton, NJ 08544; tour@csjhuedu Zhaoran Wang is Affiliated with Department of Operations Research and Financial Engineering, Princeton niversity, Princeton, NJ SA; zhaoran@princetonedu Han Liu is Affiliated with Department of Operations Research and Financial Engineering, Princeton niversity, Princeton, NJ SA; hanliu@princetonedu 1

2 Hardt, 014, Hardt et al, 014, Hardt and Wootters, 014, Sun and Luo, 014, Hastie et al, 014, Cai and Zhang, 015, Yan et al, 015, Zhu et al, 015, Wang et al, 015] Among these wors, most are based upon convex relaxation with nuclear norm constraint or regularization Nevertheless, solving these convex optimization problems can be computationally prohibitive in high dimensional regimes with large m and n [Hsieh and Olsen, 014] A computationally more efficient alternative is nonconvex optimization In particular, we reparameterize the m n matrix variable M in the optimization problem as V with R m and V R n, and optimize over and V Such a reparametrization automatically enforces the low ran structure and leads to low computational cost per iteration Due to this reason, the nonconvex approach is widely used in large scale applications such as recommendation systems or collaborative filtering [Koren, 009, Koren et al, 009] Despite the superior empirical performance of the nonconvex approach, the understanding of its theoretical guarantees is rather limited in comparison with the convex relaxation approach The classical nonconvex optimization theory can only show its sublinear convergence to local optima But many empirical results have corroborated its exceptional computational performance and convergence to global optima Only until recently has there been theoretical analysis of the bloc coordinate descent-type nonconvex optimization algorithm, which is nown as alternating minimization [Jain et al, 013, Hardt, 014, Hardt et al, 014, Hardt and Wootters, 014] In particular, the existing results show that, provided a proper initialization, the alternating minimization algorithm attains a linear rate of convergence to a global optimum R m and V R n, which satisfy M = V Meanwhile, Keshavan et al [010a,b] establish the convergence of the gradient-type methods, and Sun and Luo [014] further establish the convergence of a broad class of nonconvex optimization algorithms including both gradient-type and bloc coordinate descent-type methods However, Keshavan et al [010a,b], Sun and Luo [014] only establish the asymptotic convergence for an infinite number of iterations, rather than the explicit rate of convergence Besides these wors, Lee and Bresler [010], Jain et al [010], Jain and Netrapalli [014] consider projected gradient-type methods, which optimize over the matrix variable M R m n rather than R m and V R n These methods involve calculating the top singular vectors of an m n matrix at each iteration For much smaller than m and n, they incur much higher computational cost per iteration than the aforementioned methods that optimize over and V All these wors, except Sun and Luo [014], focus on specific algorithms, while Sun and Luo [014] do not establish the explicit optimization rate of convergence In this paper, we propose a new theory for analyzing a broad class of nonconvex optimization algorithms for low ran matrix estimation The core of our theory is the notion of inexact first order oracle Based on the inexact first order oracle, we establish sufficiently conditions under which the iteration sequences converge geometrically to the global optima For both matrix sensing and completion, a direct consequence of our threoy is that, a broad family of nonconvex optimization algorithms, including gradient descent, bloc coordinate gradient descent, and bloc coordinate minimization, attain linear rates of convergence to the true low ran matrices

3 and V In particular, our proposed theory covers alternating minimization as a special case and recovers the results of Jain et al [013], Hardt [014], Hardt et al [014], Hardt and Wootters [014] under suitable conditions Meanwhile, our approach covers gradient-type methods, which are also widely used in practice [Taács et al, 007, Patere, 007, Koren et al, 009, Gemulla et al, 011, Recht and Ré, 013, Zhuang et al, 013] To the best of our nowledge, our analysis is the first one that establishes exact recovery guarantees and geometric rates of convergence for a broad family of nonconvex matrix sensing and completion algorithms To achieve maximum generality, our unified analysis significantly differs from previous wors In detail, Jain et al [013], Hardt [014], Hardt et al [014], Hardt and Wootters [014] view alternating minimization as an approximate power method However, their point of view relies on the closed form solution of each iteration of alternating minimization, which maes it difficult to generalize to other algorithms, eg, gradient-type methods Meanwhile, Sun and Luo [014] tae a geometric point of view In detail, they show that the global optimum of the optimization problem is the unique stationary point within its neighborhood and thus a broad class of algorithms succeed However, such geometric analysis of the objective function does not characterize the convergence rate of specific algorithms towards the stationary point nlie existing results, we analyze nonconvex optimization algorithms as approximate convex counterparts For example, our analysis views alternating minimization on a nonconvex objective function as an approximate bloc coordinate minimization on some convex objective function We use the ey quantity, the inexact first order oracle, to characterize such a perturbation effect, which results from the local nonconvexity at intermediate solutions This eventually allows us to establish explicit rate of convergence in an analogous way as existing convex optimization analysis Our proposed inexact first order oracle is closely related to a series previous wor on inexact or approximate gradient descent algorithms: Güler [199], Luo and Tseng [1993], Nedić and Bertseas [001], d Aspremont [008], Baes [009], Friedlander and Schmidt [01], Devolder et al [014] Different from these existing results focusing on convex minimization, we show that the inexact first order oracle can also sharply captures the evolution of generic optimization algorithms even with the presence of nonconvexity More recently, Candes et al [014], Balarishnan et al [014], Arora et al [015] respectively analyze the Wirtinger Flow algorithm for phase retrieval, the expectation maximization (EM) Algorithm for latent variable models, and the gradient descent algorithm for sparse coding based on a similar idea to ours Though their analysis exploits similar nonconvex structures, they wor on completely different problems, and the delivered technical results are also fundamentally different A conference version of this paper was presented in the Annual Conference on Neural Information Processing Systems 015 [Zhao et al, 015] During our conference version was under review, similar wor was released on arxivorg by Zheng and Lafferty [015], Bhojanapalli et al [015], Tu et al [015], Chen and Wainwright [015] These wors focus on symmetric positive semidefinite low ran matrix factorization problems In contrast, our proposed methodologies and theory do not require the symmetry and positive semidefiniteness, and therefore can be ap- 3

4 plied to rectangular low ran matrix factorization problems The rest of this paper is organized as follows In, we review the matrix sensing problems, and then introduce a general class of nonconvex optimization algorithms In 3, we present the convergence analysis of the algorithms In 4, we lay out the proof In 5, we extend the proposed methodology and theory to the matrix completion problems In 6, we provide numerical experiments and draw the conclusion Notation: For v = (v 1,,v d ) T R d, we define the vector l q norm as v q q = j vq j We define e i as an indicator vector, where the i-th entry is one, and all other entries are zero For a matrix A R m n, we use A j = (A 1j,,A mj ) to denote the j-th column of A, and A i = (A i1,,a in ) to denote the i- th row of A Let σ max (A) and σ min (A) be the largest and smallest nonzero singular values of A We define the following matrix norms: A F = j A j, A = σ max (A) Moreover, we define A to be the sum of all singular values of A We define as the Moore-Penrose pseudoinverse of A as A Given another matrix B R m n, we define the inner product as A,B = i,j A ijb ij For a bivariate function f (u,v), we define u f (u,v) to be the gradient with respect to u Moreover, we use the common notations of Ω( ), O( ), and o( ) to characterize the asymptotics of two real sequences Matrix Sensing We start with the matrix sensing problem Let M R m n be the unnown low ran matrix of interest We have d sensing matrices A i R m n with i {1,,d} Our goal is to estimate M based on b i = A i,m in the high dimensional regime with d much smaller than mn nder such a regime, a common assumption is ran(m ) = min{d,m,n} Existing approaches generally recover M by solving the following convex optimization problem min M M R m n subject to b = A(M), (1) where b = [b 1,,b d ] R d, and A(M) : R m n R d is an operator defined as A(M) = [ A 1,M,, A i,m ] R d () Existing convex optimization algorithms for solving (1) are computationally inefficient, since they incur high per-iteration computational cost and only attain sublinear rates of convergence to the global optimum [Jain et al, 013, Hsieh and Olsen, 014] Therefore in large scale settings, we usually consider the following nonconvex optimization problem instead min F (,V ), where F (,V ) = 1 R m,v R n b A(V ) (3) The reparametrization of M = V, though maing the problem in (3) nonconvex, significantly improves the computational efficiency Existing literature [Koren, 009, Koren et al, 009, Taács et al, 007, Patere, 007, Koren et al, 009, Gemulla et al, 011, Recht and Ré, 013, Zhuang 4

5 et al, 013] has established convincing evidence that (3) can be effectively solved by a broad variety of gradient-based nonconvex optimization algorithms, including gradient descent, alternating exact minimization (ie, alternating least squares or bloc coordinate minimization), as well as alternating gradient descent (ie, bloc coordinate gradient descent), as illustrated in Algorithm 1 Algorithm 1 A family of nonconvex optimization algorithms for matrix sensing Here (,D,V ) KSVD(M) is the ran singular value decomposition of M D is a diagonal matrix containing the top singular values of M in decreasing order, and and V contain the corresponding top left and right singular vectors of M (V,R V ) QR(V ) is the QR decomposition, where V is the corresponding orthonormal matrix and R V is the corresponding upper triangular matrix Input: {b i } d i=1, {A i} d i=1 Parameter: Step size η, Total number of iterations T ( (0),D (0),V (0) ) KSVD( d i=1 b i A i ), V (0) V (0) D (0), (0) (0) D (0) For: t = 0,,T 1 Alternating Exact Minimization : V (t+05) argmin V F ( (t),v ) (V (t+1),r (t+05) ) QR(V (t+05) ) V Alternating Gradient Descent : V (t+05) V (t) η V F ( (t),v (t) ) (V (t+1),r (t+05) ) QR(V (t+05) ), (t) (t) R (t+05) pdating V V V Gradient Descent : V (t+05) V (t) η V F ( (t),v (t) ) (V (t+1),r (t+05) ) QR(V (t+05) ), (t+1) (t) R (t+05) V V Alternating Exact Minimization : (t+05) argmin F (,V (t+1) ) ( (t+1),r (t+05) ) QR( (t+05) ) Alternating Gradient Descent : (t+05) (t) η F ( (t),v (t+1) ) ( (t+1),r (t+05) ) QR( (t+05) ), V (t+1) V t+1 R (t+05) pdating Gradient Descent : (t+05) (t) η F ( (t),v (t) ) ( (t+1),r (t+05) ) QR( (t+05) ), V (t+1) V t R (t+05) End for Output: M (T ) (T 05) V (T ) (for gradient descent we use (T ) V (T ) ) It is worth noting that the QR decomposition and ran singular value decomposition in Algorithm 1 can be accomplished efficiently In particular, the QR decomposition can be accomplished in O( max{m,n}) operations, while the ran singular value decomposition can be accomplished in O(mn) operations In fact, the QR decomposition is not necessary for particular update schemes, eg, Jain et al [013] prove that the alternating exact minimization update schemes with or without the QR decomposition are equivalent 5

6 3 Convergence Analysis We analyze the convergence of the algorithms illustrated in Before we present the main results, we first introduce a unified analytical framewor based on a ey quantity named the approximate first order oracle Such a unified framewor equips our theory with the maximum generality Without loss of generality, we assume m n throughout the rest of this paper 31 Main Idea We first provide an intuitive explanation for the success of nonconvex optimization algorithms, which forms the basis of our later analysis of the main results in 4 Recall that (3) can be written as a special instance of the following optimization problem, min f (,V ) (4) R m,v R n A ey observation is that, given fixed, f (, ) is strongly convex and smooth in V under suitable conditions, and the same also holds for given fixed V correspondingly For the convenience of discussion, we summarize this observation in the following technical condition, which will be later verified for matrix sensing and completion under suitable conditions Condition 1 (Strong Biconvexity and Bismoothness) There exist universal constants µ + > 0 and µ > 0 such that µ F f (,V ) f (,V ), f (,V ) µ + F for all,, µ V V F f (,V ) f (,V ) V V, V f (,V ) µ + V V F for all V,V 311 Ideal First Order Oracle To ease presentation, we assume that and V are the unique global minimizers to the generic optimization problem in (4) Assuming that is given, we can obtain V by V = argmin V R n f (,V ) (5) Condition 1 implies the objective function in (5) is strongly convex and smooth Hence, we can choose any gradient-based algorithm to obtain V For example, we can directly solve for V in or iteratively solve for V using gradient descent, ie, V f (,V ) = 0, (6) V (t) = V (t 1) η V f (,V (t 1) ), (7) 6

7 where η is a step size Taing gradient descent as an example, we can invoe classical convex optimization results [Nesterov, 004] to prove that V (t) V F κ V (t 1) V F for all t = 0,1,,, where κ (0,1) and only depends on µ + and µ in Condition 1 For notational simplicity, we call V f (,V (t 1) ) the ideal first order oracle, since we do not now in practice 31 Inexact First Order Oracle Though the ideal first order oracle is not accessible in practice, it provides us insights to analyze nonconvex optimization algorithms Taing gradient descent as an example, at the t-th iteration, we tae a gradient descent step over V based on V f (,V (t 1) ) Now we can treat V f (,V (t 1) ) as an approximation of V f (,V (t 1) ), where the approximation error comes from approximating by Then the relationship between V f (,V (t 1) ) and V f (,V (t 1) ) is similar to that between gradient and approximate gradient in existing literature on convex optimization For simplicity, we call V f (,V (t 1) ) the inexact first order oracle To characterize the difference between V f (,V (t 1) ) and V f (,V (t 1) ), we define the approximation error of the inexact first order oracle as E(V,V,) = V f (,V ) V f (,V ) F, (8) where V is the current decision variable for evaluating the gradient In the above example, it holds for V = V (t 1) Later we will illustrate that E(V,V,) is critical to our analysis In the above example of alternating gradient descent, we will prove later that for V (t) = V (t 1) η V f (,V (t 1) ), we have V (t) V F κ V (t 1) V F + µ + E(V (t),v (t 1),) (9) In other words, E(V (t),v (t 1),) captures the perturbation effect by employing the inexact first order oracle V f (,V (t 1) ) instead of the ideal first order oracle V f (,V (t 1) ) For V (t+1) = argmin V f (,V ), we will prove that V (t) V F 1 µ E(V (t),v (t),) (10) According to the update schemes shown in Algorithms 1 and, for alternating exact minimization, we set = (t) in (10), while for gradient descent or alternating gradient descent, we set = (t 1) or = (t) in (9) respectively Due to symmetry, similar results also hold for (t) F To establish the geometric rate of convergence towards the global minima and V, it remains to establish upper bounds for the approximate error of the inexact first oder oracle Taing gradient decent as an example, we will prove that given an appropriate initial solution, we have µ + E(V (t),v (t 1), (t 1) ) α (t 1) F (11) 7

8 for some α (0,1 κ) Combining with (9) (where we tae = (t 1) ), (11) further implies V (t) V F κ V (t 1) V F + α (t 1) F (1) Correspondingly, similar results hold for (t) F, ie, (t) F κ (t 1) F + α V (t 1) V F (13) Combining (1) and (13) we then establish the contraction max{ V (t) V F, (t) F } (α + κ) max{ V (t 1) V F, (t 1) F }, which further implies the geometric convergence, since α (0, 1 κ) Respectively, we can establish similar results for alternating exact minimization and alternating gradient descent Based upon such a unified analysis, we now present the main results 3 Main Results Before presenting the main results, we first introduce an assumption nown as the restricted isometry property (RIP) Recall that is the ran of the target low ran matrix M Assumption 1 (Restricted Isometry Property) The linear operator A( ) : R m n R d defined in () satisfies -RIP with parameter δ (0,1), ie, for all R m n such that ran( ), it holds that (1 δ ) F A( ) (1 + δ ) F Several random matrix ensembles satisfy -RIP for a sufficiently large d with high probability For example, suppose that each entry of A i is independently drawn from a sub-gaussian distribution, A( ) satisfies -RIP with parameter δ with high probability for d = Ω(δ nlogn) The following theorem establishes the geometric rate of convergence of the nonconvex optimization algorithms summarized in Algorithm 1 Theorem 1 Assume there exists a sufficiently small constant C 1 such that A( ) satisfies -RIP with δ C 1 /, and the largest and smallest nonzero singular values of M are constants, which do not scale with (d,m,n,) For any pre-specified precision ɛ, there exist an η and universal constants C and C 3 such that for all T C log(c 3 /ɛ), we have M (T ) M F ɛ The proof of Theorems 1 is provided in 4, 43, and 44 Theorem 1 implies that all three nonconvex optimization algorithms converge geometrically to the global optimum Moreover, assuming that each entry of A i is independently drawn from a sub-gaussian distribution with mean zero and variance proxy one, our result further suggests that, to achieve exact low ran matrix recovery, our algorithm requires the number of measurements d to satisfy d = Ω( 3 nlogn), (14) 8

9 since we assume that δ C 1 / This sample complexity result matches the state-of-the-art result for nonconvex optimization methods, which is established by Jain et al [013] In comparison with their result, which only covers the alternating exact minimization algorithm, our results holds for a broader variety of nonconvex optimization algorithms Note that the sample complexity in (14) depends on a polynomial of σ max (M )/σ min (M ), which is treated as a constant in our paper If we allow σ max (M )/σ min (M ) to increase, we can plug the nonconvex optimization algorithms into the multi-stage framewor proposed by Jain et al [013] Following similar lines to the proof of Theorem 1, we can derive a new sample complexity, which is independent of σ max (M )/σ min (M ) See more details in Jain et al [013] 4 Proof of Main Results We setch the proof of Theorems 1 The proof of all related lemmas are provided in the appendix For notational simplicity, let σ 1 = σ max (M ) and σ = σ min (M ) Recall the nonconvex optimization algorithms are symmetric about the updates of and V Hence, the following lemmas for the update of V also hold for updating We omit some statements for conciseness Theorem can be proved in a similar manner, and its proof is provided in Appendix F Before presenting the proof, we first introduce the following lemma, which verifies Condition 1 Lemma 1 Suppose that A( ) satisfies -RIP with parameter δ Given an arbitrary orthonormal matrix R m, for any V, V R n, we have 1 + δ V V F F (,V ) F (,V ) V F (,V ),V V 1 δ V V F The proof of Lemma 1 is provided in Appendix A1 Lemma 1 implies that F (, ) is strongly convex and smooth in V given a fixed orthonormal matrix, as specified in Condition 1 Equipped with Lemma 1, we now lay out the proof for each update scheme in Algorithm 1 41 Rotation Issue Given a factorization of M = V, we can equivalently represent it as M = newv new, where new = O new and V new = V O new for an arbitrary unitary matrix O new R This implies that directly calculating F is not desirable and the algorithm may converge to an arbitrary factorization of M To address this issue, existing analysis usually chooses subspace distances to evaluate the difference between subspaces spanned by columns of and, because these subspaces are invariant to rotations [Jain et al, 013] For example, let R m (m ) denote the orthonormal 9

10 complement to, we can choose the subspace distance as F For any O new R such that O newo new = I, we have new F = O new F = F In this paper, we consider a different subspace distance defined as min O F (15) O O=I We can verify that (15) is also invariant to rotation The next lemma shows that (15) is equivalent to F Lemma Given two orthonormal matrices R m and R m, we have F min O O=I O F F The proof of Lemma is provided in Stewart et al [1990], therefore omitted Equipped with Lemma, our convergence analysis guarantees that there always exists a factorization of M satisfying the desired computational properties for each iteration (See Lemma 5, Corollaries 1 and ) Similarly, the above argument can also be generalized to gradient descent and alternating gradient descent algorithms 4 Proof of Theorem 1 (Alternating Exact Minimization) Proof Throughout the proof for alternating exact minimization, we define a constant ξ (1, ) to simplify the notation Moreover, we assume that at the t-th iteration, there exists a matrix factorization of M M = (t) V (t), where (t) R m is an orthonormal matrix We define the approximation error of the inexact first order oracle as E(V (t+05),v (t+05), (t) ) = V F ( (t),v (t+05) ) V F ( (t),v (t+05) ) F The following lemma establishes an upper bound for the approximation error of the approximation first order oracle under suitable conditions Lemma 3 Suppose that δ and (t) satisfy Then we have δ (1 δ ) σ 1ξ(1 + δ )σ 1 and (t) (t) F (1 δ )σ 4ξ(1 + δ )σ 1 (16) E(V (t+05),v (t+05), (t) ) (1 δ )σ (t) (t) ξ F 10

11 The proof of Lemma 3 is provided in Appendix A Lemma 3 shows that the approximation error of the inexact first order oracle for updating V diminishes with the estimation error of (t), when (t) is sufficiently close to (t) The following lemma quantifies the progress of an exact minimization step using the inexact first order oracle Lemma 4 We have V (t+05) V (t) F 1 1 δ E(V (t+05),v (t+05), (t) ) The proof of Lemma 4 is provided in Appendix A3 Lemma 4 illustrates that the estimation error of V (t+05) diminishes with the approximation error of the inexact first order oracle The following lemma characterizes the effect of the renormalization step using QR decomposition, ie, the relationship between V (t+05) and V (t+1) in terms of the estimation error Lemma 5 Suppose that V (t+05) satisfies V (t+05) V (t) F σ 4 (17) Then there exists a factorization of M = (t+1) V (t+1) such that V (t+05) R n is an orthonormal matrix, and satisfies V (t+1) V (t+1) F σ V (t+05) V (t) F The proof of Lemma 5 is provided in Appendix A4 The next lemma quantifies the accuracy of the initialization (0) Lemma 6 Suppose that δ satisfies (1 δ ) σ 4 δ 19ξ (1 + δ ) σ1 4 (18) Then there exists a factorization of M = (0) V (0) such that (0) R m is an orthonormal matrix, and satisfies (0) F (1 δ )σ 4ξ(1 + δ )σ 1 The proof of Lemma 6 is provided in Appendix A5 Lemma 6 implies that the initial solution (0) attains a sufficiently small estimation error Combining Lemmas 3, 4, and 5, we obtain the following corollary for a complete iteration of updating V 11

12 Corollary 1 Suppose that δ and (t) satisfy We then have (1 δ ) σ 4 δ 19ξ (1 + δ ) σ1 4 and (t) (t) F (1 δ )σ 4ξ(1 + δ )σ 1 (19) Moreover, we also have V (t+1) V (t+1) F (1 δ )σ 4ξ(1 + δ )σ 1 V (t+1) V (t+1) F 1 ξ (t) (t) F and V (t+05) V (t) F σ ξ (t) (t) F The proof of Corollary 1 is provided in Appendix A6 Since the alternating exact minimization algorithm updates and V in a symmetric manner, we can establish similar results for a complete iteration of updating in the next corollary Corollary Suppose that δ and V (t+1) satisfy (1 δ ) σ 4 δ 19ξ (1 + δ ) σ1 4 and V (t+1) V (t+1) F (1 δ )σ 4ξ(1 + δ )σ 1 (0) Then there exists a factorization of M = (t+1) V (t+1) such (t+1) is an orthonormal matrix, and satisfies Moreover, we also have (t+1) (t+1) F (1 δ )σ 4ξ(1 + δ )σ 1 (t+1) (t+1) F 1 ξ V (t+1) V (t+1) F and (t+05) (t+1) F σ ξ V (t+1) V (t+1) F The proof of Corollary directly follows Appendix A6, and is therefore omitted We then proceed with the proof of Theorem 1 for alternating exact minimization Lemma 6 ensures that (19) of Corollary 1 holds for (0) Then Corollary 1 ensures that (0) of Corollary holds for V (1) By induction, Corollaries 1 and can be applied recursively for all T iterations Thus we obtain V (T ) V (T ) F 1 ξ (T 1) (T 1) F 1 ξ V (T 1) V (T 1) F 1 ξ T 1 (0) (0) (1 δ F )σ 4ξ T, (1) (1 + δ )σ 1 where the last inequality comes from Lemma 6 Therefore, for a pre-specified accuracy ɛ, we need at most ( ) 1 (1 T = log δ )σ log 1 ξ () ɛ(1 + δ )σ 1 1

13 iterations such that Moreover, Corollary implies V (T ) V (T ) F (1 δ )σ 4ξ T (1 + δ )σ 1 ɛ (3) (T 05) (T ) F σ ξ V (T ) V (T ) (1 δ )σ F 8ξ T +1, (1 + δ )σ 1 where the last inequality comes from (1) Therefore, we need at most ( 1 (1 δ T = log )σ ) log 1 ξ 4ξɛ(1 + δ ) (4) iterations such that (T 05) F Then combining (3) and (5), we obtain (1 δ )σ 8ξ T +1 ɛ (5) (1 + δ )σ 1 σ 1 M (T ) M = (T 05) V (T ) (T ) V (T ) F = (T 05) V (T ) (T ) V (T ) + (T ) V (T ) (T ) V (T ) F V (T ) (T 05) (T ) F + (T ) V (T ) V (T ) F ɛ, (6) where the last inequality comes from V (T ) = 1 (since V (T ) is orthonormal) and = M = σ 1 (since (T ) V (T ) = M and V (T ) is orthonormal) Thus combining () and (4) with (6), we complete the proof 43 Proof of Theorem 1 (Alternating Gradient Descent) Proof Throughout the proof for alternating gradient descent, we define a sufficiently large constant ξ Moreover, we assume that at the t-th iteration, there exists a matrix factorization of M M = (t) V (t), where (t) R m is an orthonormal matrix We define the approximation error of the inexact first order oracle as E(V (t+05),v (t), (t) ) = V F ( (t),v (t) ) V F ( (t),v (t) ) F The first lemma is parallel to Lemma 3 for alternating exact minimization 13

14 Lemma 7 Suppose that δ, (t), and V (t) satisfy Then we have δ (1 δ )σ 4ξσ 1, (t) (t) F σ 4ξσ 1 E(V (t+05),v (t), (t) ) (1 + δ )σ (t) (t) ξ F, and V (t) V (t) F σ 1 (7) The proof of Lemma 7 is provided in Appendix B1 Lemma 7 illustrates that the approximation error of the inexact first order oracle diminishes with the estimation error of (t), when (t) and V (t) are sufficiently close to (t) and V (t) Lemma 8 Suppose that the step size parameter η satisfies Then we have η = V (t+05) V F δ V (t) V F δ (8) 1 + δ E(V (t+05),v (t), (t) ) The proof of Lemma 8 is in Appendix B Lemma 8 characterizes the progress of a gradient descent step with a pre-specified fixed step size A more practical option is adaptively selecting η using the bactracing line search procedure, and similar results can be guaranteed See Nesterov [004] for details The following lemma characterizes the effect of the renormalization step using QR decomposition Lemma 9 Suppose that V (t+05) satisfies V (t+05) V (t) F σ 4 (9) Then there exists a factorization of M = (t+1) V (t+1) such that V (t+1) R n is an orthonormal matrix, and V (t+1) V (t+1) F σ V (t+05) V (t) F, (t) (t+1) F 3σ 1 σ V (t+05) V (t) F + σ 1 (t) (t) F, The proof of Lemma 9 is provided in Appendix B3 The next lemma quantifies the accuracy of the initial solutions Lemma 10 Suppose that δ satisfies Then we have σ (0) (0) F 4ξσ1 σ 6 δ 19ξ σ1 6 (30) and V (0) V (0) F σ σ 1 ξσ 1 14

15 The proof of Lemma 10 is in Appendix B4 Lemma 10 indicates that the initial solutions (0) and V (0) attain sufficiently small estimation errors Combining Lemmas 7, 8, 5,, we obtain the following corollary for a complete iteration of updating V Corollary 3 Suppose that δ, (t), and V (t) satisfy We then have σ 6 δ 19ξ σ1 6, (t) (t) F σ 4ξσ1, and V (t) V (t) F σ (31) ξσ 1 σ V (t+1) V (t+1) F 4ξσ1 and (t) (t+1) F σ ξσ 1 Moreover, we have V (t+05) V (t) F δ V (t) V (t) F + σ ξ (t) (t) F, (3) V (t+1) V (t+1) F δ V (t) V (t) σ F + 4 ξ (t) (t) F, (33) (t) (t+1) F 3σ ( ) 1 δ 6 V (t) V (t) σ F + ξ + 1 σ 1 (t) (t) F (34) The proof of Corollary 3 is provided in Appendix B5 Since the alternating gradient descent algorithm updates and V in a symmetric manner, we can establish similar results for a complete iteration of updating in the next corollary Corollary 4 Suppose that δ, V (t+1), and (t) satisfy We then have σ 6 δ 19ξ σ1 6, V (t+1) V (t+1) F σ 4ξσ1, and (t) (t+1) F σ (35) ξσ 1 σ (t+1) (t+1) F 4ξσ1 and V (t+1) V (t+1) F σ ξσ 1 Moreover, we have (t+05) (t+1) F δ (t) (t+1) F + σ ξ V (t+1) V (t+1) F, (36) (t+1) (t+1) F δ (t) (t+1) σ F + 4 ξ V (t+1) V (t+1) F, (37) V (t+1) V (t+1) F 3σ ( ) 1 δ 6 (t) (t+1) σ F + ξ + 1 σ 1 V (t+1) V (t+1) F (38) 15

16 The proof of Corollary 4 directly follows Appendix B5, and is therefore omitted Now we proceed with the proof of Theorem 1 for alternating gradient descent Recall that Lemma 10 ensures that (31) of Corollary 3 holds for (0) and V (0) Then Corollary 3 ensures that (35) of Corollary 4 holds for (0) and V (1) By induction, Corollaries 1 and can be applied recursively for all T iterations For notational simplicity, we write (3)-(38) as V (t+05) V (t) F α 1 V (t) V (t) F + γ 1 σ 1 (t) (t) F, (39) σ 1 V (t+1) V (t+1) F α V (t) V (t) F + γ σ 1 (t) (t) F, (40) (t+05) (t+1) F α 3 (t) (t+1) F + γ 3 σ 1 V (t+1) V (t+1) F, (41) σ 1 (t+1) (t+1) F α 4 (t) (t+1) F + γ 4 σ 1 V (t+1) V (t+1) F, (4) (t) (t+1) F α 5 V (t) V (t) F + γ 5 σ 1 (t) (t) F, (43) V (t+1) V (t+1) F α 6 (t) (t+1) F + γ 6 σ 1 V (t+1) V (t+1) F (44) Note that we have γ 5,γ 6 (1,), but α 1,,α 6, γ 1,, and γ 4 can be sufficiently small as long as ξ is sufficiently large We then have (t+1) (t+) F (i) α 5 V (t+1) V (t+1) F + γ 5 σ 1 (t+1) (t+1) F (ii) α 5 α 6 (t) (t+1) F + α 5 γ 6 σ 1 V (t+1) V (t+1) F + γ 5 σ 1 (t+1) (t+1) F (iii) (α 5 α 6 + γ 5 α 4 ) (t) (t+1) F + (γ 5 γ 4 σ 1 + α 5 γ 6 )σ 1 V (t+1) V (t+1) F (iv) (α 5 α 6 + γ 5 α 4 ) (t) (t+1) F + (γ 5 γ 4 σ 1 + α 5 γ 6 )α V (t) V (t) F + (γ 5 γ 4 σ 1 + α 5 γ 6 )γ σ 1 (t) (t) F, (45) where (i) comes from (43), (ii) comes from (44), (iii) comes from (4), and (iv) comes from (40) Similarly, we can obtain V (t+1) V (t+1) F α 6 (t) (t+1) F + γ 6 α V (t) V (t) F σ 1 (t+1) (t+1) F α 4 (t) (t+1) F + γ 4 α V (t) V (t) F (t+05) (t+1) F α 3 (t) (t+1) F + γ 3 α V (t) V (t) F For simplicity, we define + γ 6 γ σ 1 (t) (t) F, (46) + γ 4 γ σ 1 (t) (t) F (47) + γ 3 γ σ 1 (t) (t) F (48) φ V (t+1) = V (t+1) V (t+1) F, φ V (t+05) = V (t+05) V (t) F, φ V (t+1) = σ 1 V (t+1) V (t+1) F, φ (t+1) = (t+1) (t+) F, φ (t+05) = (t+05) (t+1) F, φ (t+1) = σ 1 (t+1) (t+1) F 16

17 Then combining (39), (40) with (45) (48), we obtain max { φ V (t+1),φ V (t+05),φ V (t+1),φ (t+1),φ (t+05),φ (t+1)} β max { φv (t),φ (t),φ (t)}, (49) where β is a contraction coefficient defined as β = max{α 5 α 6 + γ 5 α 4,α 6,α 4,α 3 } + max{α 1,α,(γ 5 γ 4 σ 1 + α 5 γ 6 ),γ 6 α,γ 4 α,γ 3 α } + max{γ 1,γ,(γ 5 γ 4 σ 1 + α 5 γ 6 )γ,γ 6 γ,γ 4 γ,γ 3 γ } Then we can choose ξ as a sufficiently large constant such that β < 1 By recursively applying (49) for t = 0,,T, we obtain max { { } φ V (T ),φ V (T 05),φ (T V ),φ (T ),φ (T 05),φ (T )} β max φv (T 1),φ (T 1),φ (T 1) β max { φ V (T ),φ (T ),φ (T )} β T max { φ V (0),φ (0),φ (0)} By Corollary 3, we obtain (0) (1) F 3σ ( ) 1 δ 6 V (0) V (0) σ F + ξ + 1 σ 1 (0) (0) F (i) 3σ 1 σ (ii) = σ 4 8ξ σ 3 1 σ 3 1ξσ σ ξ σ 1 + σ ( ) 6 σ + ξσ 1 ξ + 1 4ξσ 1 σ 4ξσ 1 (iii) σ ξσ 1, (50) where (i) and (ii) come from Lemma 10, and (iii) comes from the definition of ξ and σ 1 σ Combining (50) with Lemma 10, we have { } { σ φv (0),φ (0),φ (0) max σ }, ξσ 1 4ξσ1 Then we need at most T = log ( { σ max, ξσ 1 iterations such that { V (T ) V σ F β T max σ, ξσ 1 4ξσ1 σ ξσ 1, σ ξ, σ } 1 ) log 1 (β 1 ) ξσ 1 ɛ } ɛ { σ and (T ) F β T max σ, ξσ 1 4ξσ1 } ɛ σ 1 We then follow similar lines to (6) in 4, and show M (T ) M F ɛ, which completes the proof 44 Proof of Theorem 1 (Gradient Descent) Proof The convergence analysis of the gradient descent algorithm is similar to that of the alternating gradient descent The only difference is that for updating, the gradient descent algorithm employs V = V (t) instead of V = V (t+1) to calculate the gradient at = (t) Then everything else directly follows 43, and is therefore omitted 17

18 5 Extensions to Matrix Completion We then extend our methodology and theory to matrix completion problems Let M R m n be the unnown low ran matrix of interest We observe a subset of the entries of M, namely, W {1,,m} {1,,n} We assume that W is drawn uniformly at random, ie, Mi,j is observed independently with probability ρ (0,1] To exactly recover M, a common assumption is the incoherence of M, which will be specified later A popular approach for recovering M is to solve the following convex optimization problem min M M R m n subject to P W (M ) = P W (M), (51) where P W (M) : R m n R m n is an operator defined as M [P W (M)] ij = ij if (i,j) W, 0 otherwise Similar to matrix sensing, existing algorithms for solving (51) are computationally inefficient Hence, in practice we usually consider the following nonconvex optimization problem min F W (,V ), where F W (,V ) = 1 R m,v R n P W(M ) P W (V ) F (5) Similar to matrix sensing, (5) can also be efficiently solved by gradient-based algorithms illustrated in Algorithm For the convenience of later convergence analysis, we partition the observation set W into T + 1 subsets W 0,,W T by Algorithm 4 However, in practice we do not need the partition scheme, ie, we simply set W 0 = = W T = W Before we present the convergence analysis, we first introduce an assumption nown as the incoherence property Assumption (Incoherence Property) The target ran matrix M is incoherent with parameter µ, ie, given the ran singular value decomposition of M = Σ V, we have max i µ and max V i m j µ j n Roughly speaing, the incoherence assumption guarantees that each entry of M contains similar amount of information, which maes it feasible to complete M when its entries are missing uniformly at random The following theorem establishes the iteration complexity and the estimation error under the Frobenius norm Theorem Suppose that there exists a universal constant C 4 such that ρ satisfies ρ C 4µ 3 lognlog(1/ɛ), (53) m where ɛ is the pre-specified precision Then there exist an η and universal constants C 5 and C 6 such that for any T C 5 log(c 6 /ɛ), we have M (T ) M F ɛ with high probability 18

19 Algorithm A family of nonconvex optimization algorithms for matrix completion The incoherence factorization algorithm IF( ) is illustrated in Algorithm 3, and the partition algorithm Partition( ), which is proposed by Hardt and Wootters [014], is provided in Algorithm 4 of Appendix C for the sae of completeness The initialization procedures INT ( ) and INT ( ) are provided in Algorithm 5 and Algorithm 6 of Appendix D for the sae of completeness Here F W ( ) is defined in (5) Input: P W (M ) Parameter: Step size η, Total number of iterations T ({W t } T t=0, ρ) Partition(W), P W 0 ( M) P W0 (M ), and M ij 0 for all (i,j) W 0 ( (0),V (0) ) INT ( M), (V (0), (0) ) INT V ( M) For: t = 0,,T 1 Alternating Exact Minimization : V (t+05) argmin V F Wt+1 ( (t),v ) (V (t+1),r (t+05) ) IF(V (t+05) ) V Alternating Gradient Descent : V (t+05) V (t) η V F Wt+1 ( (t),v (t) ) (V (t+1),r (t+05) ) IF(V (t+05) ), (t) (t) R (t+05) pdating V V V Gradient Descent : V (t+05) V (t) η V F Wt+1 ( (t),v (t) ) (V (t+1),r (t+05) ) IF(V (t+05) ), (t+1) (t) R (t+05) V V Alternating Exact Minimization : (t+05) argmin F Wt+ (,V (t+1) ) ( (t+1),r (t+05) ) IF( (t+05) ) Alternating Gradient Descent : (t+05) (t) η F Wt+ ( (t),v (t+1) ) ( (t+1),r (t+05) ) IF( (t+05) ), V (t+1) V (t+1) R (t+05) pdating Gradient Descent : (t+05) (t) η F Wt+ ( (t),v (t) ) ( (t+1),r (t+05) ) IF( (t+05) ), V (t+1) V (t) R (t+05) End for Output: M (T ) (T 05) V (T ) (for gradient descent we use (T ) V (T ) ) The proof of Theorem is provided in F1, F, and F3 Theorem implies that all three nonconvex optimization algorithms converge to the global optimum at a geometric rate Furthermore, our results indicate that the completion of the true low ran matrix M up to ɛ-accuracy requires the entry observation probability ρ to satisfy ρ = Ω(µ 3 lognlog(1/ɛ)/m) (54) This result matches the result established by Hardt [014], which is the state-of-the-art result for alternating minimization Moreover, our analysis covers three nonconvex optimization algorithms In fact, the sample complexity in (54) depends on a polynomial of σ max(m ), which is a constant σ min (M ) since in this paper we assume that σ max (M ) and σ min (M ) are constants If we allow σ max(m ) to inσ min (M ) 19

20 Algorithm 3 The incoherence factorization algorithm for matrix completion It guarantees that the solutions satisfy the incoherence condition throughout all iterations Input: W in r Number of rows of W in Parameter: Incoherence parameter µ (W in,r in W ) QR(W in ) W argmin W W in F W (W out,r tmp W ) QR(W out ) R out W = W out W in Output: W out,r out W subject to max W j µ /r j ( σmax (M ) σ min (M ) crease, we can replace the QR decomposition in Algorithm 3 with the smooth QR ) decomposition proposed by Hardt and Wootters [014] and achieve a dependency of log on the condition number with a more involved proof See more details in Hardt and Wootters [014] However, in this paper, our primary focus is on the dependency on, n and m, rather than optimizing over the dependency on condition number 6 Numerical Experiments We present numerical experiments to support our theoretical analysis We first consider a matrix sensing problem with m = 30, n = 40, and = 5 We vary d from 300 to 900 Each entry of A i s are independent sampled from N(0,1) We then generate M = V, where Ũ R m and Ṽ R n are two matrices with all their entries independently sampled from N(0, 1/) We then generate d measurements by b i = A i,m for i = 1,,d Figure 1 illustrates the empirical performance of the alternating exact minimization and alternating gradient descent algorithms for a single realization The step size for the alternating gradient descent algorithm is determined by the bactracing line search procedure We see that both algorithms attain linear rate of convergence for d = 600 and d = 900 Both algorithms fail for d = 300, because d = 300 is below the minimum requirement of sample complexity for the exact matrix recovery We then consider a matrix completion problem with m = 1000, n = 50, and = 5 We vary ρ from 005 to 01 We then generate M = V, where Ũ R m and Ṽ R n are two matrices with all their entries independently sampled from N(0, 1/) The observation set is generated uniformly at random with probability ρ Figure illustrates the empirical performance of the alternating exact minimization and alternating gradient descent algorithms for a single realization The step size for the alternating gradient descent algorithm is determined by the bactracing line search procedure We see that both algorithms attain linear rate of convergence for ρ = 005 and ρ = 01 Both algorithms fail for ρ = 005, because the entry observation probability is below 0

21 10 10 Estimation Error d=300 d=600 d= Numer of Iterations (a) Alternating Exact Minimization Algorithm Estaimation Error d=300 d=600 d= Number of Iterations (b) Alternating Gradient Descent Algorithm Figure 1: Two illustrative examples for matrix sensing The vertical axis corresponds to estimation error M (t) M F The horizontal axis corresponds to numbers of iterations Both the alternating exact minimization and alternating gradient descent algorithms attain linear rate of convergence for d = 600 and d = 900 But both algorithms fail for d = 300, because the sample size is not large enough to guarantee proper initial solutions the minimum requirement of sample complexity for the exact matrix recovery 7 Conclusion In this paper, we propose a generic analysis for characterizing the convergence properties of nonconvex optimization algorithms By exploiting the inexact first order oracle, we prove that a broad class of nonconvex optimization algorithms converge geometrically to the global optimum and exactly recover the true low ran matrices under suitable conditions A Lemmas for Theorem 1 (Alternating Exact Minimization) A1 Proof of Lemma 1 Proof For notational convenience, we omit the index t in (t) and V (t), and denote them by and V respectively Then we define two n n matrices S (t) = S (t) 11 S (t) 1 S (t) 1 S (t) with S (t) pq = d i=1 A i (t) p (t) q A i, 1

22 10 10 Estimation Error ; = 0:05 7; = 0:05 7; = 0: Number of Iterations (a) Alternating Exact Minimization Algorithm Estimation Error ; = 0:05 7; = 0:05 7; = 0: Number of Iterations (b) Alternating Gradient Descent Algorithm Figure : Two illustrative examples for matrix completion The vertical axis corresponds to estimation error M (t) M F The horizontal axis corresponds to numbers of iterations Both the alternating exact minimization and alternating gradient descent algorithms attain linear rate of convergence for ρ = 005 and ρ = 01 But both algorithms fail for ρ = 005, because the entry observation probability is not large enough to guarantee proper initial solutions G (t) = G (t) 11 G (t) 1 G (t) 1 G (t) with G (t) pq = d i=1 A i p q A i for 1 p,q Note that S (t) and G (t) are essentially the partial Hessian matrices V F ( (t),v ) and V F (,V ) for a vectorized V, ie, vec(v ) R n Before we proceed with the main proof, we first introduce the following lemma Lemma 11 Suppose that A( ) satisfies -RIP with parameter δ We then have 1 + δ σ max (S (t) ) σ min (S (t) ) 1 δ The proof of Lemma 11 is provided in Appendix A7 Note that Lemma 11 is also applicable G (t), since G (t) shares the same structure with S (t) We then proceed with the proof of Lemma 1 Given a fixed, F (,V ) is a quadratic function of V Therefore we have F (,V ) = F (,V ) + V F (,V ),V V + vec(v ) vec(v ), V F(,V )(vec(v ) vec(v )),

23 which further implies implies F (,V ) F (,V ) F V (,V ),V V σ max ( V F(,V )) V V F F (,V ) F (,V ) F V (,V ),V V σ min ( V F(,V )) V V F Then we can verify that V F(,V ) also shares the same structure with S(t) Thus applying Lemma 11 to the above two inequalities, we complete the proof A Proof of Lemma 3 Proof For notational convenience, we omit the index t in (t) and V (t), and denote them by and V respectively We define two n n matrices J (t) 11 J (t) J (t) = K (t) = 1 J (t) 1 J (t) K (t) 11 K (t) 1 K (t) 1 K (t) with with J (t) pq = K (t) d i=1 pq = (t) A i (t) p q A i, p qi n for 1 p,q Before we proceed with the main proof, we first introduce the following lemmas Lemma 1 Suppose that A( ) satisfies -RIP with parameter δ We then have S (t) K (t) J (t) 3δ (t) F The proof of Lemma 1 is provided in Appendix A8 Note that Lemma 1 is also applicable to G (t) K (t) J (t), since G (t) and S (t) share the same structure Lemma 13 Given F R, we define a n n matrix F 11 I n F 1 I n F = F 1 I n F I n For any V R n, let v = vec(v ) R n, then we have F v = FV F Proof By linear algebra, we have [FV ] ij = F i V j = which completes the proof F il V jl = F il I l V l, l=1 3 l=1

24 as We then proceed with the proof of Lemma 3 Since b i = tr(v A i ), then we rewrite F (,V ) F (,V ) = 1 d i=1 ( tr(v A i ) b i ) = 1 d ( V j A i j i=1 j=1 j=1 V j A i j) For notational simplicity, we define v = vec(v ) Since V (t+05) minimizes F ( (t),v ), we have vec ( V F ( (t),v (t+05) ) ) = S (t) v (t+05) J (t) v = 0 Solving the above system of equations, we obtain v (t+05) = (S (t) ) 1 J (t) v (55) Meanwhile, we have vec( V F (,V (t+05) )) = G (t) v (t+05) G (t) v = G (t) (S (t) ) 1 J (t) v G (t) v = G (t)( ) (S (t) ) 1 J (t) I n v, (56) where the second equality come from (55) By triangle inequality, (56) further implies ((S (t) ) 1 J (t) I n )v (K (t) I n )v + (S (t) ) 1 (J (t) S (t) K (t) )v ( (t) I )V F + (S (t) ) 1 (J (t) S (t) K (t) )v (t) I F V + (S (t) ) 1 (J (t) S (t) K (t) )v, (57) where the second inequality comes from Lemma 13 Plugging (57) into (56), we have vec( V F (,V (t+05) )) G (t) ((S (t) ) 1 J (t) I n )v (i) (1 + δ )(σ 1 (t) I + (S (t) ) 1 S (t) K (t) J (t) σ 1 ) (ii) ( (1 + δ )σ 1 ( (t) ) ( (t) ) F + 3δ ) (t) 1 δ F (iii) ( (1 + δ )σ 1 (t) F + 3δ ) (t) (iv) 1 δ F (1 δ )σ ξ (t) F, where (i) comes from Lemma 11 and V = M = σ 1 and V F = v σ 1, (ii) comes from Lemmas 11 and 1, (iii) from Cauchy-Schwartz inequality, and (iv) comes from (16) Since we have V F ( (t),v (t+05) ) = 0, we further btain which completes the proof E(V (t+05),v (t+05), (t) ) (1 δ )σ (t) ξ F, 4

25 A3 Proof of Lemma 4 Proof For notational convenience, we omit the index t in (t) and V (t), and denote them by and V respectively By the strong convexity of F (, ), we have F (,V ) 1 δ V (t+05) V F F (,V (t+05) ) + V F (,V (t+05) ),V V (t+05) (58) By the strong convexity of F (, ) again, we have F (,V (t+05) ) F (,V ) + V F (,V ),V V (t+05) + 1 δ V (t+05) V F F (,V ) + 1 δ V (t+05) V F, (59) where the last inequality comes from the optimality condition of V = argmin V F (,V ), ie V F (,V ),V (t+05) V 0 Meanwhile, since V (t+05) minimizes F ( (t), ), we have the optimality condition which further implies V F (,V (t+05) ),V V (t+05) Combining (58) and (59) with (60), we obtain which completes the proof V F ( (t),v (t+05) ),V V (t+05) 0, V F (,V (t+05) ) V F ( (t),v (t+05) ),V V (t+05) (60) V (t+05) V 1 1 δ E(V (t+05),v (t+05), (t) ), A4 Proof of Lemma 5 Proof Before we proceed with the proof, we first introduce the following lemma Lemma 14 Suppose that A R n is a ran matrix Let E R n satisfy E A < 1 Then given a QR decomposition (A + E) = QR, there exists a factorization of A = Q O such that Q R n is an orthonormal matrix, and satisfies Q Q F A E F 1 E A 5

26 The proof of Lemma 14 is provided in Stewart et al [1990], therefore omitted We then proceed with the proof of Lemma 5 We consider A = V (t) and E = V (t+05) V (t) in Lemma 14 respectively We can verify that V (t+05) V (t) V (t) V (t+05) V (t) F σ 1 4 Then there exists a V (t) = V (t+1) O such that V (t+1) is an orthonormal matrix, and satisfies V (t+05) V (t+1) F V (t) V (t+05) V (t) F σ V (t+05) V (t) F A5 Proof of Lemma 6 Proof Before we proceed with the main proof, we first introduce the following lemma Lemma 15 Let b = A(M ), M is a ran- matrix, and A is a linear measurement operator that satisfies -RIP with constant δ < 1/3 Let X (t+1) be the (t + 1)-th step iterate of SVP, then we have A(X (t+1) ) b A(M ) b + δ A(X (t) ) b The proof of Lemma 15 is provided in Jain et al [010], therefore omitted We then explain the implication of Lemma 15 Jain et al [010] show that X (t+1) is obtained by taing a projected gradient iteration over X (t) using step size 1 1+δ Then taing X (t) = 0, we have Then Lemma 15 implies X (t+1) = (0) Σ (0) V (0) 1 + δ Since A( ) satisfies -RIP, then (61) further implies ( (0) (0) (0) ) Σ V A M 4δ 1 + δ A(M ) (61) (0) Σ (0) V (0) M 1 + δ F 4δ (1 + 3δ ) M F (6) We then project each column of M into the subspace spanned by { (0) i } i=1, and obtain (0) (0) M M F 6δ M F 6

Nonconvex Low Rank Matrix Factorization via Inexact First Order Oracle

Nonconvex Low Rank Matrix Factorization via Inexact First Order Oracle Nonconvex Low Rank Matrix Factorization via Inexact First Order Oracle Tuo Zhao Zhaoran Wang Han Liu Abstract We study the low rank matrix factorization problem via nonconvex optimization. Compared with

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via nonconvex optimization Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis Lecture 7: Matrix completion Yuejie Chi The Ohio State University Page 1 Reference Guaranteed Minimum-Rank Solutions of Linear

More information

Provable Alternating Minimization Methods for Non-convex Optimization

Provable Alternating Minimization Methods for Non-convex Optimization Provable Alternating Minimization Methods for Non-convex Optimization Prateek Jain Microsoft Research, India Joint work with Praneeth Netrapalli, Sujay Sanghavi, Alekh Agarwal, Animashree Anandkumar, Rashish

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Rapid, Robust, and Reliable Blind Deconvolution via Nonconvex Optimization

Rapid, Robust, and Reliable Blind Deconvolution via Nonconvex Optimization Rapid, Robust, and Reliable Blind Deconvolution via Nonconvex Optimization Shuyang Ling Department of Mathematics, UC Davis Oct.18th, 2016 Shuyang Ling (UC Davis) 16w5136, Oaxaca, Mexico Oct.18th, 2016

More information

Solving Corrupted Quadratic Equations, Provably

Solving Corrupted Quadratic Equations, Provably Solving Corrupted Quadratic Equations, Provably Yuejie Chi London Workshop on Sparse Signal Processing September 206 Acknowledgement Joint work with Yuanxin Li (OSU), Huishuai Zhuang (Syracuse) and Yingbin

More information

Low-rank Matrix Completion with Noisy Observations: a Quantitative Comparison

Low-rank Matrix Completion with Noisy Observations: a Quantitative Comparison Low-rank Matrix Completion with Noisy Observations: a Quantitative Comparison Raghunandan H. Keshavan, Andrea Montanari and Sewoong Oh Electrical Engineering and Statistics Department Stanford University,

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Chapter 3 Transformations

Chapter 3 Transformations Chapter 3 Transformations An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1 Linear Transformations A function is called a linear transformation if 1. for every and 2. for every If we fix the bases

More information

Robust Principal Component Analysis

Robust Principal Component Analysis ELE 538B: Mathematics of High-Dimensional Data Robust Principal Component Analysis Yuxin Chen Princeton University, Fall 2018 Disentangling sparse and low-rank matrices Suppose we are given a matrix M

More information

arxiv: v7 [stat.ml] 9 Feb 2017

arxiv: v7 [stat.ml] 9 Feb 2017 Submitted to the Annals of Applied Statistics arxiv: arxiv:141.7477 PATHWISE COORDINATE OPTIMIZATION FOR SPARSE LEARNING: ALGORITHM AND THEORY By Tuo Zhao, Han Liu and Tong Zhang Georgia Tech, Princeton

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October

More information

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

Compressed Sensing and Sparse Recovery

Compressed Sensing and Sparse Recovery ELE 538B: Sparsity, Structure and Inference Compressed Sensing and Sparse Recovery Yuxin Chen Princeton University, Spring 217 Outline Restricted isometry property (RIP) A RIPless theory Compressed sensing

More information

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER 2011 7255 On the Performance of Sparse Recovery Via `p-minimization (0 p 1) Meng Wang, Student Member, IEEE, Weiyu Xu, and Ao Tang, Senior

More information

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical

More information

arxiv: v1 [math.na] 1 Sep 2018

arxiv: v1 [math.na] 1 Sep 2018 On the perturbation of an L -orthogonal projection Xuefeng Xu arxiv:18090000v1 [mathna] 1 Sep 018 September 5 018 Abstract The L -orthogonal projection is an important mathematical tool in scientific computing

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

sparse and low-rank tensor recovery Cubic-Sketching

sparse and low-rank tensor recovery Cubic-Sketching Sparse and Low-Ran Tensor Recovery via Cubic-Setching Guang Cheng Department of Statistics Purdue University www.science.purdue.edu/bigdata CCAM@Purdue Math Oct. 27, 2017 Joint wor with Botao Hao and Anru

More information

Fast and Sample Efficient Inductive Matrix Completion via Multi-Phase Procrustes Flow

Fast and Sample Efficient Inductive Matrix Completion via Multi-Phase Procrustes Flow Fast and Sample Efficient Inductive Matrix Completion via Multi-Phase Procrustes Flow Xiao Zhang and Simon S. Du and Quanquan Gu arxiv:1803.0133v1 [stat.ml] 3 Mar 018 Abstract We revisit the inductive

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via convex relaxations Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Exact Low-rank Matrix Recovery via Nonconvex M p -Minimization

Exact Low-rank Matrix Recovery via Nonconvex M p -Minimization Exact Low-rank Matrix Recovery via Nonconvex M p -Minimization Lingchen Kong and Naihua Xiu Department of Applied Mathematics, Beijing Jiaotong University, Beijing, 100044, People s Republic of China E-mail:

More information

Recovering any low-rank matrix, provably

Recovering any low-rank matrix, provably Recovering any low-rank matrix, provably Rachel Ward University of Texas at Austin October, 2014 Joint work with Yudong Chen (U.C. Berkeley), Srinadh Bhojanapalli and Sujay Sanghavi (U.T. Austin) Matrix

More information

Conditions for Robust Principal Component Analysis

Conditions for Robust Principal Component Analysis Rose-Hulman Undergraduate Mathematics Journal Volume 12 Issue 2 Article 9 Conditions for Robust Principal Component Analysis Michael Hornstein Stanford University, mdhornstein@gmail.com Follow this and

More information

arxiv: v3 [math.oc] 19 Oct 2017

arxiv: v3 [math.oc] 19 Oct 2017 Gradient descent with nonconvex constraints: local concavity determines convergence Rina Foygel Barber and Wooseok Ha arxiv:703.07755v3 [math.oc] 9 Oct 207 0.7.7 Abstract Many problems in high-dimensional

More information

Optimization for Compressed Sensing

Optimization for Compressed Sensing Optimization for Compressed Sensing Robert J. Vanderbei 2014 March 21 Dept. of Industrial & Systems Engineering University of Florida http://www.princeton.edu/ rvdb Lasso Regression The problem is to solve

More information

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing Afonso S. Bandeira April 9, 2015 1 The Johnson-Lindenstrauss Lemma Suppose one has n points, X = {x 1,..., x n }, in R d with d very

More information

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 11 Luca Trevisan February 29, 2016

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 11 Luca Trevisan February 29, 2016 U.C. Berkeley CS294: Spectral Methods and Expanders Handout Luca Trevisan February 29, 206 Lecture : ARV In which we introduce semi-definite programming and a semi-definite programming relaxation of sparsest

More information

Dropping Convexity for Faster Semi-definite Optimization

Dropping Convexity for Faster Semi-definite Optimization JMLR: Workshop and Conference Proceedings vol 49: 53, 06 Dropping Convexity for Faster Semi-definite Optimization Srinadh Bhojanapalli Toyota Technological Institute at Chicago Anastasios Kyrillidis Sujay

More information

15 Singular Value Decomposition

15 Singular Value Decomposition 15 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing

More information

Lecture 4: Purifications and fidelity

Lecture 4: Purifications and fidelity CS 766/QIC 820 Theory of Quantum Information (Fall 2011) Lecture 4: Purifications and fidelity Throughout this lecture we will be discussing pairs of registers of the form (X, Y), and the relationships

More information

Recent Developments in Compressed Sensing

Recent Developments in Compressed Sensing Recent Developments in Compressed Sensing M. Vidyasagar Distinguished Professor, IIT Hyderabad m.vidyasagar@iith.ac.in, www.iith.ac.in/ m vidyasagar/ ISL Seminar, Stanford University, 19 April 2018 Outline

More information

Characterization of Gradient Dominance and Regularity Conditions for Neural Networks

Characterization of Gradient Dominance and Regularity Conditions for Neural Networks Characterization of Gradient Dominance and Regularity Conditions for Neural Networks Yi Zhou Ohio State University Yingbin Liang Ohio State University Abstract zhou.1172@osu.edu liang.889@osu.edu The past

More information

Lecture notes: Applied linear algebra Part 1. Version 2

Lecture notes: Applied linear algebra Part 1. Version 2 Lecture notes: Applied linear algebra Part 1. Version 2 Michael Karow Berlin University of Technology karow@math.tu-berlin.de October 2, 2008 1 Notation, basic notions and facts 1.1 Subspaces, range and

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee227c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee227c@berkeley.edu

More information

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit New Coherence and RIP Analysis for Wea 1 Orthogonal Matching Pursuit Mingrui Yang, Member, IEEE, and Fran de Hoog arxiv:1405.3354v1 [cs.it] 14 May 2014 Abstract In this paper we define a new coherence

More information

Sketchy Decisions: Convex Low-Rank Matrix Optimization with Optimal Storage

Sketchy Decisions: Convex Low-Rank Matrix Optimization with Optimal Storage Sketchy Decisions: Convex Low-Rank Matrix Optimization with Optimal Storage Madeleine Udell Operations Research and Information Engineering Cornell University Based on joint work with Alp Yurtsever (EPFL),

More information

Supplement to A Generalized Least Squares Matrix Decomposition. 1 GPMF & Smoothness: Ω-norm Penalty & Functional Data

Supplement to A Generalized Least Squares Matrix Decomposition. 1 GPMF & Smoothness: Ω-norm Penalty & Functional Data Supplement to A Generalized Least Squares Matrix Decomposition Genevera I. Allen 1, Logan Grosenic 2, & Jonathan Taylor 3 1 Department of Statistics and Electrical and Computer Engineering, Rice University

More information

approximation algorithms I

approximation algorithms I SUM-OF-SQUARES method and approximation algorithms I David Steurer Cornell Cargese Workshop, 201 meta-task encoded as low-degree polynomial in R x example: f(x) = i,j n w ij x i x j 2 given: functions

More information

Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012

Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012 Instructions Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012 The exam consists of four problems, each having multiple parts. You should attempt to solve all four problems. 1.

More information

Linear Algebra Massoud Malek

Linear Algebra Massoud Malek CSUEB Linear Algebra Massoud Malek Inner Product and Normed Space In all that follows, the n n identity matrix is denoted by I n, the n n zero matrix by Z n, and the zero vector by θ n An inner product

More information

Lecture 2: Linear Algebra Review

Lecture 2: Linear Algebra Review EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1

More information

Collaborative Filtering Matrix Completion Alternating Least Squares

Collaborative Filtering Matrix Completion Alternating Least Squares Case Study 4: Collaborative Filtering Collaborative Filtering Matrix Completion Alternating Least Squares Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 19, 2016

More information

Sparsity Regularization

Sparsity Regularization Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation

More information

Structured matrix factorizations. Example: Eigenfaces

Structured matrix factorizations. Example: Eigenfaces Structured matrix factorizations Example: Eigenfaces An extremely large variety of interesting and important problems in machine learning can be formulated as: Given a matrix, find a matrix and a matrix

More information

Limitations in Approximating RIP

Limitations in Approximating RIP Alok Puranik Mentor: Adrian Vladu Fifth Annual PRIMES Conference, 2015 Outline 1 Background The Problem Motivation Construction Certification 2 Planted model Planting eigenvalues Analysis Distinguishing

More information

Low-Rank Matrix Recovery

Low-Rank Matrix Recovery ELE 538B: Mathematics of High-Dimensional Data Low-Rank Matrix Recovery Yuxin Chen Princeton University, Fall 2018 Outline Motivation Problem setup Nuclear norm minimization RIP and low-rank matrix recovery

More information

Tighten after Relax: Minimax-Optimal Sparse PCA in Polynomial Time

Tighten after Relax: Minimax-Optimal Sparse PCA in Polynomial Time Tighten after Relax: Minimax-Optimal Sparse PCA in Polynomial Time Zhaoran Wang Huanran Lu Han Liu Department of Operations Research and Financial Engineering Princeton University Princeton, NJ 08540 {zhaoran,huanranl,hanliu}@princeton.edu

More information

arxiv: v1 [math.st] 10 Sep 2015

arxiv: v1 [math.st] 10 Sep 2015 Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees Department of Statistics Yudong Chen Martin J. Wainwright, Department of Electrical Engineering and

More information

5.6. PSEUDOINVERSES 101. A H w.

5.6. PSEUDOINVERSES 101. A H w. 5.6. PSEUDOINVERSES 0 Corollary 5.6.4. If A is a matrix such that A H A is invertible, then the least-squares solution to Av = w is v = A H A ) A H w. The matrix A H A ) A H is the left inverse of A and

More information

Linear Algebra and Eigenproblems

Linear Algebra and Eigenproblems Appendix A A Linear Algebra and Eigenproblems A working knowledge of linear algebra is key to understanding many of the issues raised in this work. In particular, many of the discussions of the details

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28 Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:

More information

Lecture Notes 10: Matrix Factorization

Lecture Notes 10: Matrix Factorization Optimization-based data analysis Fall 207 Lecture Notes 0: Matrix Factorization Low-rank models. Rank- model Consider the problem of modeling a quantity y[i, j] that depends on two indices i and j. To

More information

We describe the generalization of Hazan s algorithm for symmetric programming

We describe the generalization of Hazan s algorithm for symmetric programming ON HAZAN S ALGORITHM FOR SYMMETRIC PROGRAMMING PROBLEMS L. FAYBUSOVICH Abstract. problems We describe the generalization of Hazan s algorithm for symmetric programming Key words. Symmetric programming,

More information

MATH 205C: STATIONARY PHASE LEMMA

MATH 205C: STATIONARY PHASE LEMMA MATH 205C: STATIONARY PHASE LEMMA For ω, consider an integral of the form I(ω) = e iωf(x) u(x) dx, where u Cc (R n ) complex valued, with support in a compact set K, and f C (R n ) real valued. Thus, I(ω)

More information

The Projected Power Method: An Efficient Algorithm for Joint Alignment from Pairwise Differences

The Projected Power Method: An Efficient Algorithm for Joint Alignment from Pairwise Differences The Projected Power Method: An Efficient Algorithm for Joint Alignment from Pairwise Differences Yuxin Chen Emmanuel Candès Department of Statistics, Stanford University, Sep. 2016 Nonconvex optimization

More information

Lecture 8: February 9

Lecture 8: February 9 0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we

More information

Exercise Solutions to Functional Analysis

Exercise Solutions to Functional Analysis Exercise Solutions to Functional Analysis Note: References refer to M. Schechter, Principles of Functional Analysis Exersize that. Let φ,..., φ n be an orthonormal set in a Hilbert space H. Show n f n

More information

A Piggybacking Design Framework for Read-and Download-efficient Distributed Storage Codes

A Piggybacking Design Framework for Read-and Download-efficient Distributed Storage Codes A Piggybacing Design Framewor for Read-and Download-efficient Distributed Storage Codes K V Rashmi, Nihar B Shah, Kannan Ramchandran, Fellow, IEEE Department of Electrical Engineering and Computer Sciences

More information

The deterministic Lasso

The deterministic Lasso The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality

More information

SECTION: CONTINUOUS OPTIMISATION LECTURE 4: QUASI-NEWTON METHODS

SECTION: CONTINUOUS OPTIMISATION LECTURE 4: QUASI-NEWTON METHODS SECTION: CONTINUOUS OPTIMISATION LECTURE 4: QUASI-NEWTON METHODS HONOUR SCHOOL OF MATHEMATICS, OXFORD UNIVERSITY HILARY TERM 2005, DR RAPHAEL HAUSER 1. The Quasi-Newton Idea. In this lecture we will discuss

More information

j=1 [We will show that the triangle inequality holds for each p-norm in Chapter 3 Section 6.] The 1-norm is A F = tr(a H A).

j=1 [We will show that the triangle inequality holds for each p-norm in Chapter 3 Section 6.] The 1-norm is A F = tr(a H A). Math 344 Lecture #19 3.5 Normed Linear Spaces Definition 3.5.1. A seminorm on a vector space V over F is a map : V R that for all x, y V and for all α F satisfies (i) x 0 (positivity), (ii) αx = α x (scale

More information

October 25, 2013 INNER PRODUCT SPACES

October 25, 2013 INNER PRODUCT SPACES October 25, 2013 INNER PRODUCT SPACES RODICA D. COSTIN Contents 1. Inner product 2 1.1. Inner product 2 1.2. Inner product spaces 4 2. Orthogonal bases 5 2.1. Existence of an orthogonal basis 7 2.2. Orthogonal

More information

CS 229r: Algorithms for Big Data Fall Lecture 19 Nov 5

CS 229r: Algorithms for Big Data Fall Lecture 19 Nov 5 CS 229r: Algorithms for Big Data Fall 215 Prof. Jelani Nelson Lecture 19 Nov 5 Scribe: Abdul Wasay 1 Overview In the last lecture, we started discussing the problem of compressed sensing where we are given

More information

Elementary linear algebra

Elementary linear algebra Chapter 1 Elementary linear algebra 1.1 Vector spaces Vector spaces owe their importance to the fact that so many models arising in the solutions of specific problems turn out to be vector spaces. The

More information

arxiv: v5 [math.na] 16 Nov 2017

arxiv: v5 [math.na] 16 Nov 2017 RANDOM PERTURBATION OF LOW RANK MATRICES: IMPROVING CLASSICAL BOUNDS arxiv:3.657v5 [math.na] 6 Nov 07 SEAN O ROURKE, VAN VU, AND KE WANG Abstract. Matrix perturbation inequalities, such as Weyl s theorem

More information

Constructing Explicit RIP Matrices and the Square-Root Bottleneck

Constructing Explicit RIP Matrices and the Square-Root Bottleneck Constructing Explicit RIP Matrices and the Square-Root Bottleneck Ryan Cinoman July 18, 2018 Ryan Cinoman Constructing Explicit RIP Matrices July 18, 2018 1 / 36 Outline 1 Introduction 2 Restricted Isometry

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2017 LECTURE 5

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2017 LECTURE 5 STAT 39: MATHEMATICAL COMPUTATIONS I FALL 17 LECTURE 5 1 existence of svd Theorem 1 (Existence of SVD) Every matrix has a singular value decomposition (condensed version) Proof Let A C m n and for simplicity

More information

Non-convex Robust PCA: Provable Bounds

Non-convex Robust PCA: Provable Bounds Non-convex Robust PCA: Provable Bounds Anima Anandkumar U.C. Irvine Joint work with Praneeth Netrapalli, U.N. Niranjan, Prateek Jain and Sujay Sanghavi. Learning with Big Data High Dimensional Regime Missing

More information

SPARSE signal representations have gained popularity in recent

SPARSE signal representations have gained popularity in recent 6958 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 10, OCTOBER 2011 Blind Compressed Sensing Sivan Gleichman and Yonina C. Eldar, Senior Member, IEEE Abstract The fundamental principle underlying

More information

1 Cricket chirps: an example

1 Cricket chirps: an example Notes for 2016-09-26 1 Cricket chirps: an example Did you know that you can estimate the temperature by listening to the rate of chirps? The data set in Table 1 1. represents measurements of the number

More information

On Sparsity, Redundancy and Quality of Frame Representations

On Sparsity, Redundancy and Quality of Frame Representations On Sparsity, Redundancy and Quality of Frame Representations Mehmet Açaaya Division of Engineering and Applied Sciences Harvard University Cambridge, MA Email: acaaya@fasharvardedu Vahid Taroh Division

More information

Guaranteed Rank Minimization via Singular Value Projection: Supplementary Material

Guaranteed Rank Minimization via Singular Value Projection: Supplementary Material Guaranteed Rank Minimization via Singular Value Projection: Supplementary Material Prateek Jain Microsoft Research Bangalore Bangalore, India prajain@microsoft.com Raghu Meka UT Austin Dept. of Computer

More information

7. Symmetric Matrices and Quadratic Forms

7. Symmetric Matrices and Quadratic Forms Linear Algebra 7. Symmetric Matrices and Quadratic Forms CSIE NCU 1 7. Symmetric Matrices and Quadratic Forms 7.1 Diagonalization of symmetric matrices 2 7.2 Quadratic forms.. 9 7.4 The singular value

More information

An iterative hard thresholding estimator for low rank matrix recovery

An iterative hard thresholding estimator for low rank matrix recovery An iterative hard thresholding estimator for low rank matrix recovery Alexandra Carpentier - based on a joint work with Arlene K.Y. Kim Statistical Laboratory, Department of Pure Mathematics and Mathematical

More information

Rank minimization via the γ 2 norm

Rank minimization via the γ 2 norm Rank minimization via the γ 2 norm Troy Lee Columbia University Adi Shraibman Weizmann Institute Rank Minimization Problem Consider the following problem min X rank(x) A i, X b i for i = 1,..., k Arises

More information

NORMS ON SPACE OF MATRICES

NORMS ON SPACE OF MATRICES NORMS ON SPACE OF MATRICES. Operator Norms on Space of linear maps Let A be an n n real matrix and x 0 be a vector in R n. We would like to use the Picard iteration method to solve for the following system

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

The Proximal Gradient Method

The Proximal Gradient Method Chapter 10 The Proximal Gradient Method Underlying Space: In this chapter, with the exception of Section 10.9, E is a Euclidean space, meaning a finite dimensional space endowed with an inner product,

More information

Lecture Notes 9: Constrained Optimization

Lecture Notes 9: Constrained Optimization Optimization-based data analysis Fall 017 Lecture Notes 9: Constrained Optimization 1 Compressed sensing 1.1 Underdetermined linear inverse problems Linear inverse problems model measurements of the form

More information

6 Compressed Sensing and Sparse Recovery

6 Compressed Sensing and Sparse Recovery 6 Compressed Sensing and Sparse Recovery Most of us have noticed how saving an image in JPEG dramatically reduces the space it occupies in our hard drives as oppose to file types that save the pixel value

More information

Linear Algebra. Paul Yiu. Department of Mathematics Florida Atlantic University. Fall A: Inner products

Linear Algebra. Paul Yiu. Department of Mathematics Florida Atlantic University. Fall A: Inner products Linear Algebra Paul Yiu Department of Mathematics Florida Atlantic University Fall 2011 6A: Inner products In this chapter, the field F = R or C. We regard F equipped with a conjugation χ : F F. If F =

More information

Multiple integrals: Sufficient conditions for a local minimum, Jacobi and Weierstrass-type conditions

Multiple integrals: Sufficient conditions for a local minimum, Jacobi and Weierstrass-type conditions Multiple integrals: Sufficient conditions for a local minimum, Jacobi and Weierstrass-type conditions March 6, 2013 Contents 1 Wea second variation 2 1.1 Formulas for variation........................

More information

Greedy Signal Recovery and Uniform Uncertainty Principles

Greedy Signal Recovery and Uniform Uncertainty Principles Greedy Signal Recovery and Uniform Uncertainty Principles SPIE - IE 2008 Deanna Needell Joint work with Roman Vershynin UC Davis, January 2008 Greedy Signal Recovery and Uniform Uncertainty Principles

More information

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU)

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU) sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU) 0 overview Our Contributions: 1 overview Our Contributions: A near optimal low-rank

More information

Grothendieck s Inequality

Grothendieck s Inequality Grothendieck s Inequality Leqi Zhu 1 Introduction Let A = (A ij ) R m n be an m n matrix. Then A defines a linear operator between normed spaces (R m, p ) and (R n, q ), for 1 p, q. The (p q)-norm of A

More information

CHAPTER 11. A Revision. 1. The Computers and Numbers therein

CHAPTER 11. A Revision. 1. The Computers and Numbers therein CHAPTER A Revision. The Computers and Numbers therein Traditional computer science begins with a finite alphabet. By stringing elements of the alphabet one after another, one obtains strings. A set of

More information

Matrix completion: Fundamental limits and efficient algorithms. Sewoong Oh Stanford University

Matrix completion: Fundamental limits and efficient algorithms. Sewoong Oh Stanford University Matrix completion: Fundamental limits and efficient algorithms Sewoong Oh Stanford University 1 / 35 Low-rank matrix completion Low-rank Data Matrix Sparse Sampled Matrix Complete the matrix from small

More information

Course Summary Math 211

Course Summary Math 211 Course Summary Math 211 table of contents I. Functions of several variables. II. R n. III. Derivatives. IV. Taylor s Theorem. V. Differential Geometry. VI. Applications. 1. Best affine approximations.

More information

Lecture 9: Numerical Linear Algebra Primer (February 11st)

Lecture 9: Numerical Linear Algebra Primer (February 11st) 10-725/36-725: Convex Optimization Spring 2015 Lecture 9: Numerical Linear Algebra Primer (February 11st) Lecturer: Ryan Tibshirani Scribes: Avinash Siravuru, Guofan Wu, Maosheng Liu Note: LaTeX template

More information

The convex algebraic geometry of rank minimization

The convex algebraic geometry of rank minimization The convex algebraic geometry of rank minimization Pablo A. Parrilo Laboratory for Information and Decision Systems Massachusetts Institute of Technology International Symposium on Mathematical Programming

More information

Reconstruction from Anisotropic Random Measurements

Reconstruction from Anisotropic Random Measurements Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013

More information

DS-GA 1002 Lecture notes 10 November 23, Linear models

DS-GA 1002 Lecture notes 10 November 23, Linear models DS-GA 2 Lecture notes November 23, 2 Linear functions Linear models A linear model encodes the assumption that two quantities are linearly related. Mathematically, this is characterized using linear functions.

More information

Chapter 1. Preliminaries. The purpose of this chapter is to provide some basic background information. Linear Space. Hilbert Space.

Chapter 1. Preliminaries. The purpose of this chapter is to provide some basic background information. Linear Space. Hilbert Space. Chapter 1 Preliminaries The purpose of this chapter is to provide some basic background information. Linear Space Hilbert Space Basic Principles 1 2 Preliminaries Linear Space The notion of linear space

More information

Rui ZHANG Song LI. Department of Mathematics, Zhejiang University, Hangzhou , P. R. China

Rui ZHANG Song LI. Department of Mathematics, Zhejiang University, Hangzhou , P. R. China Acta Mathematica Sinica, English Series May, 015, Vol. 31, No. 5, pp. 755 766 Published online: April 15, 015 DOI: 10.1007/s10114-015-434-4 Http://www.ActaMath.com Acta Mathematica Sinica, English Series

More information