Low rank matrix completion by alternating steepest descent methods

SIAM J. IMAGING SCIENCES Vol. xx, pp. x c xxxx Society for Industrial and Applied Mathematics x x Low rank matrix completion by alternating steepest descent methods Jared Tanner and Ke Wei Abstract. Matrix completion involves recovering a matrix from a subset of its entries by utilizing interdependency between the entries, typically through low rank structure. Despite the combinatorial nature of matrix completion, there are many computationally efficient algorithms which are effective for a broad class of matrices. In this paper, we introduce an alternating steepest descent algorithm () and a scaled variant,, for the fixed-rank matrix completion problem. Empirical evaluation of and on both image in painting and random problems show they are competitive with other state-of-the-art matrix completion algorithms in terms of recoverable rank and overall computational time. In particular, their low per iteration computational complexity makes and efficient for large size problems, especially when computing the solutions to moderate accuracy. A preliminary convergence analysis is also presented. Key words. Matrix completion, alternating minimization, gradient descent, exact line-search AMS subject classifications. 15A29,41A29,6510,65J20,68Q25,90C26 1. Introduction. The problem of recovering a low rank matrix from partial entries - also known as matrix completion - arises in a wide variety of practical context, such as model reduction [23], pattern recognition [10], and machine learning [1, 2]. rom the pioneering work on low rank approximation by azel [12] and matrix completion by Candes and Recht [6], this problem has received intensive investigations both from theoretical and algorithmic aspects, see [3, 4, 5, 7, 8, 9, 13, 15, 16, 18, 21, 22, 24, 25, 26, 27, 29, 30, 31, 32] and their references for a partial review. These matrix completion and low rank approximation techniques rely on the dependencies between entries imposed by the low rank structure. Explicitly seeking the lowest rank matrix consistent with the known entries is expressed as (1.1) min Z R m n rank(z), subject to P Ω(Z) = P Ω (Z 0 ), where Z 0 R m n is the underlying matrix to be reconstructed, Ω is a subset of indices for the known entries, and P Ω is the associated sampling operator which acquires only the entries indexed by Ω. Problem (1.1) is non-convex and generally NP-hard [17] due to the rank objective. One of the most widely studied approaches is to replace the rank objective with a convex relaxation such as the Schatten 1-norm, also known as the nuclear norm. It has been proven that, provided the singular vectors are weakly correlated with the canonical basis, the solution of (1.1) can be obtained by solving a nuclear norm minimization problem (1.2) min Z R m n Z, subject to P Ω (Z) = P Ω (Z 0 ), where the nuclear norm of Z, denoted by Z, is the sum of its singular values. More precisely, if Z 0 satisfies the incoherent conditions and Ω = p of its entries are observed Mathematical Institute, University of Oxford, Oxford, UK (tanner@maths.ox.ac.uk). Mathematical Institute, University of Oxford, Oxford, UK (wei@maths.ox.ac.uk). 1

2 J. Tanner and K. Wei uniformly at random, Z 0 can be exactly recovered by (1.2) with high probability if p is of the order O(r(m + n r)log α (max(m, n)) [6, 7, 15, 27]. In practise, (1.2) can be solved either by semidefinite programming or by iterative soft thresholding algorithms [5, 13, 20]. Alternative to nuclear norm minimization, there have been many algorithms which are designed to target (1.1) directly; most of them are adaptations of algorithms for compressed sensing, such as the hard thresholding algorithms [3, 18, 22, 29]. The iterative thresholding algorithms typically update a current rank r estimate along a line-search direction which departs the manifold of rank r matrices, and then projects back to the rank r manifold by computing the nearest matrix in robenius norm. The most direct implementation of the projection involves computing a partial singular value decomposition (SVD) in each iteration. or r m n, computing the SVD has computational complexity of O(n 3 ) which is the dominant per iteration cost 1, and limits their applicability for large n. A subclass of methods which further exploit the manifold structure in their line-search updates achieve superior efficiency, for r n, of order O(r 3 ). Examples include [31] which is a nonlinear conjugate gradient method and [26] which uses a scaled gradient in the subspace update, as well as other geometries and metrics in [24, 25]. To circumvent the high computational cost of an SVD, other algorithms explicitly remain on the manifold of rank r matrices by using the factorization Z = XY where X R m r and Y R r n. Based on this simple factorization model, rather than solving (1.1), algorithms are designed to solve the non-convex problem (1.3) min X,Y 1 2 P Ω(Z 0 ) P Ω (XY ) 2. Algorithms for the solution of (1.3) usually follow an alternating minimization scheme, with Poweractorization [16] and [32] two representatives. We present alternating steepest descent (), and a scaled variant, which incorporate an exact line-search to update the solutions for the model (1.3). In so doing has a lower per iteration complexity than Poweractorization, allowing a more efficient exploration of the manifold in the early iterations where the current estimate is inaccurate. The manuscript is outlined as follows. In Sec. 2 we briefly review the alternating minimization algorithms Poweractorization and which motivate. In Sec. 3 we propose for (1.3), which replaces the least square subproblem solution in Poweractorization with exact line-search so as to reduce the per iteration computational cost. In Sec. 4 we present, a version of which is accelerated by scaling the search directions adaptively to improve the asymptotic convergence rate. Numerical experiments presented in Sec. 5 contrast the aforementioned algorithms and show that is highly efficient, particularly for large problems when solved to moderate accuracy. Section 6 concludes the paper with some future research directions. 2. Alternating minimization. Alternating minimization is widely used for optimization problems due to its simplicity, low memory footprint and flexibility [11]. Without loss of generality, let us consider an objective function f(x, Y ) with two variable components X 1 The related low rank approximation question [28] has an O(n 4 ) per iteration cost dominated by computing the measurement operator and its adjoint; in this context, computing the SVD is of secondary cost.

Alternating Steepest Descent for Matrix Completion 3 X, Y Y, where X and Y are the admissible sets. The alternating minimization method, also known as Gauss-Seidel or 2-block coordinate descent method, minimizes f(x, Y ) by successive searches over X and Y. or problem (1.3), the objective function is (2.1) f(x, Y ) = 1 2 P Ω(Z 0 ) P Ω (XY ) 2, with X := R m r, Y := R r n. Poweractorization (), Alg. 1, is a matrix completion algorithm which applies the alternating minimization method to (1.3). Algorithm 1 Poweractorization (, [16]) Input: P Ω (Z 0 ), X 0 R m r, Y 0 R r n Repeat 1. ix Y i, solve X i+1 = arg min P Ω(Z 0 ) P Ω (XY i ) 2 X 2. ix X i+1, solve Y i+1 = arg min P Ω(Z 0 ) P Ω (X i+1 Y ) 2 Y 3. i = i + 1 Until termination criteria is reached Ouput X i, Y i Note that each subproblem in Alg. 1 is a standard least square problem. In [16], a rank increment technique is also proposed to improve the performance of the algorithm. As a typical alternating minimization method, limit points of Alg. 1 are necessarily stationary points, see Thm. 3.4. The per iteration computational cost of is determined by the solution of the least squares subproblems in Alg. 1. To decrease this per iteration cost, [32] proposes the model 1 (2.2) min X,Y,Z 2 Z XY 2 2, subject to P Ω (Z) = P Ω (Z 0 ), where the projection onto Ω is moved from the objective in (2.1) to a linear constraint. The alternating minimization approach applied to (2.2) gives the low-rank matrix fitting algorithm (), Alg. 2. Algorithm 2 Low-rank Matrix itting (, [32]) Input: P Ω (Z 0 ), X 0 R m r, Y 0 R r n Repeat 1. X i+1 = Z i Y i = arg min XY i Z i 2 X 2. Y i+1 = X i+1 Z i = arg min Y X i+1y Z i 2 3. Z i+1 = X i+1 Y i+1 + P Ω (Z 0 X i+1 Y i+1 ) 4. i = i + 1 Until termination criteria is reached Output: X i, Y i obtains new approximate solutions of X and Y by solving least square problems while Z is held fixed, as done in. However the subproblems in have explicit

4 J. Tanner and K. Wei solutions, which reduces the high per iteration computational cost of. Moreover, [32] proposes using an over-relaxation scheme to further accelerate Alg. 2. The convergence of limit points to stationary points is similarly established for, see Thm. 3.5 in [32]. 3. Alternating steepest descent. Solving the least square subproblem in to high accuracy is both computationally expensive and potentially of limited value when the factor that is held fixed is inconsistent with the known entries P Ω (Z 0 ). To improve the computational efficiency we replace solving the least squares subproblems in with a single step of simple line-search along the gradient descent directions. The alternating steepest descent method () applies steepest gradient descent to f(x, Y ) in (2.1) alternatively with respect to X and Y. If f(x, Y ) is written as f Y (X) when Y is held constant and f X (Y ) when X is held constant, the directions of gradient ascent are (3.1) (3.2) f Y (X) = (P Ω (Z 0 ) P Ω (XY ))Y T, f X (Y ) = X T (P Ω (Z 0 ) P Ω (XY )). Let t x, t y be the steepest descent stepsizes for descent directions f Y (X) and f X (Y ), then (3.3) (3.4) t x = t y = f Y (X) 2 P Ω ( f Y (X)Y ) 2, f X (Y ) 2 P Ω (X f X (Y )) 2, which together form the alternating steepest descent algorithm, Alg. 3. Algorithm 3 Alternating Steepest Descent () Input: P Ω (Z 0 ), X 0 R m r, Y 0 R r n Repeat 1. f Yi (X i ) = (P Ω (Z 0 ) P Ω (X i Y i ))Yi T, t xi = 2. X i+1 = X i t xi f Yi (X i ) 3. f Xi+1 (Y i ) = X T i+1 (P Ω(Z 0 ) P Ω (X i+1 Y i )), t yi = 4. Y i+1 = Y i t yi f Xi+1 (Y i ) 5. i = i + 1 Until termination criteria is reached Output: X i, Y i f Yi (X i ) 2 P Ω ( f Yi (X i )Y i ) 2 f Xi+1 (Y i ) 2 P Ω (X i+1 f Xi+1 (Y i )) 2 Remark 3.1.The denominators in t xi and t yi are zeroes if and only if the gradients are zeroes. This follows from the contradiction that otherwise f(x, Y ) could be decreased without a lower bound which violates the nonnegativity of the objective. If the denominator of t xi or t yi is zero in an iteration, it indicates X i or Y i respectively are at a local minima and should be held fixed for that iteration by setting the offending stepsize to be zero while minimizing the other objective. If both denominators of t xi or t yi are zeroes then a stationary point has been reached.

Alternating Steepest Descent for Matrix Completion 5 3.1. Per iteration computational cost: efficient residual updates. Algorithm 3 consists of two main parts: the computations of the gradient and the steepest descent stepsize. orming the gradient involves a matrix product between the residual and an m r or r n matrix. The computations of both the residual and the matrix product require to leading order 2 Ω r floating point operations (flops), where Ω denotes the number of sampled entries. Computing the stepsize also requires 2 Ω r flops for the denominator. Naively implemented this would require a per iteration complexity for Alg. 3 of 12 Ω r flops. However the residual only needs to be computed once at the beginning of an iteration and then can be updated efficiently. Taking the first two steps as an example, the residual after X i is updated to X i+1 is P Ω (Z 0 ) P Ω (X i+1 Y i ) = P Ω (Z 0 ) P Ω ((X i t xi f Yi (X i ))Y i ) = P Ω (Z 0 ) P Ω (X i Y i ) + t xi P Ω ( f Yi (X i )Y i ). ortunately P Ω ( f Yi (X i )Y i ) is formed when computing t xi, so the new residual can be updated from the old one P Ω (Z 0 ) P Ω (X i Y i ) without computing a matrix-matrix product. Therefore the leading order cost per iteration of Alg. 3 is 8 Ω r flops. 3.2. convergence analysis. The convergence analysis of block coordinate methods for general unconstrained optimization problems has been studied in [14] based on stationary points. However because of the simple structure of problem (1.3), for completeness we will offer a direct proof that limit points of are necessarily stationary points of (1.3); that is limit points (X, Y ) satisfy (3.5) f X (Y ) = 0, f Y (X) = 0. Theorem 3.1. Any limit point of the sequence (X i, Y i ) generated by Alg. 3 is a stationary point. Proof. Suppose (X ik, Y ik ) is a subsequence of (X i, Y i ) such that lim X i k = X, lim Y i k = Y, i k i k for some (X, Y ). In order to prove (X, Y ) is a stationary point, by continuity of the gradient, it is sufficient to prove (3.6) lim i k f X ik (Y ik ) = 0, lim f Y ik (X ik ) = 0. i k Since (X ik, Y ik ) are bounded for all i k, (3.6) follows immediately from the subsequent Lems. 3.2 and 3.3. Lemma 3.2. If (X ik, Y ik ) is a bounded sequence, (3.7) lim i k f X ik (Y ik 1) = 0, lim i k f Y ik (X ik ) = 0 Proof. By the quadratic form of f(x, Y ) with respect to X and Y, the decrease of f(x, Y ) can be computed exactly in each iteration, (3.8) (3.9) f(x i, Y i 1 ) = f(x i 1, Y i 1 ) 1 f Yi 1 (X i 1 ) 4 2 P Ω ( f Yi 1 (X i 1 )Y i 1 ) 2, f(x i, Y i ) = f(x i, Y i 1 ) 1 f Xi (Y i 1 ) 4 2 P Ω (X i f Xi (Y i 1 )) 2.

6 J. Tanner and K. Wei Summing up (3.8) and (3.9) from i = 0 to i = l gives f(x l, Y l ) = f(x 0, Y 0 ) 1 l 1 { f Yi (X i ) 4 2 P Ω ( f Yi (X i )Y i ) 2 i=0 f Xi+1 (Y i ) 4 + P Ω (X i+1 f Xi+1 (Y i )) 2 }. The nonnegativity of f(x l, Y l ) for all l implies Consequently, { f Yi (X i ) 4 f Xi+1 (Y i ) 4 } P Ω ( f Yi (X i )Y i ) 2 + P Ω (X i+1 f Xi+1 (Y i )) 2 <. i=0 (3.10) (3.11) lim i lim i f Yi (X i ) 4 P Ω ( f Yi (X i )Y i ) 2 f Xi (Y i 1 ) 4 P Ω (X i f Xi (Y i 1 )) 2 = 0, = 0. In particular, (3.10) and (3.11) are also true when i is replaced by the subsequence i k. Since Y ik is bounded, one has P Ω ( f Yik (X ik )Y ik )) f Yik (X ik )Y ik ) f Yik (X ik ) Y ik C f Yik (X ik ) for some C > 0. Thus (3.12) lim i k f Yik (X ik ) 4 C 2 f Yik (X ik ) 2 lim i k f Yik (X ik ) 4 P Ω ( f Yik (X ik )Y ik ) 2 = 0, which gives lim f Y ik (X ik ) = 0. Similarly lim f X ik (Y ik 1) = 0 since X ik is bounded. i k i k Lemma 3.2 justifies the second limit in (3.6). The first limit in (3.6) follows directly from Lem. 3.3. Lemma 3.3. If (X ik, Y ik ) are bounded, (3.13) lim i k f X ik (Y ik ) = 0. Proof. The difference between f Xik (Y ik ) and f Xik (Y ik 1) is f Xik (Y ik ) f Xik (Y ik 1) = X T i k (P Ω (Z 0 ) P Ω (X ik Y ik )) + X T i k (P Ω (Z 0 ) P Ω (X ik Y ik 1)) = X T i k P Ω (X ik (Y ik Y ik 1)) = t yik 1 XT i k P Ω (X ik f Xik (Y ik 1)) f Xik (Y ik 1) 2 = P Ω (X ik f Xik (Y ik 1)) 2 Xi T k P Ω (X ik f Xik (Y ik 1)),

Alternating Steepest Descent for Matrix Completion 7 where the third equation follows from Y ik = Y ik 1 + t yik 1 f X ik (Y ik 1). Hence, f Xik (Y ik 1) 2 f Xik (Y ik ) f Xik (Y ik 1) X ik P Ω (X ik f Xik (Y ik 1)) (3.14) f Xik (Y ik 1) 2 C 0 P Ω (X ik ) f Xik (Y ik 1)) where the second inequality follows from the boundedness of X ik and the limit follows from (3.11). Combining (3.14) together with (3.7) gives lim f X ik (Y ik ) = 0. i k 3.3. Poweractorization convergence analysis. This section presents a proof that limit points of Poweractorization are necessarily stationary point. The proof follows that of Thm. 3.1. Theorem 3.4. Any limit point of the sequence (X i, Y i ) generated Alg. 1 is a stationary point. Proof. Let (X i, Y i ) be a sequence from Alg. 1, with a subsequence (X ik, Y ik ) converging to (X, Y ). Because Y i minimize f Xi (Y ) over all the Y R r n, for all i. By the continuity of the gradients, f Xi (Y i ) = 0, f X (Y ) = lim i k f X ik (Y ik ) = 0. To prove lim i k f Y ik (X ik ) = 0, let ( X i, Y i 1 ) be the solution obtained by one step steepest descent as in Alg. 3 from (X i 1, Y i 1 ); and (X i, Ỹi) be the solution obtained by one step steepest descent from (X i, Y i 1 ). Then, f(x i, Y i 1 ) f( X i, Y i 1 ) = f(x i 1, Y i 1 ) 1 f Yi 1 (X i 1 ) 4 2 P Ω ( f Yi 1 (X i 1 )Y i 1 ) 2, f(x i, Y i ) f(x i, Ỹi) = f(x i, Y i 1 ) 1 f Xi (Y i 1 ) 4 2 P Ω (X i f Xi (Y i 1 )) 2, where the inequalities follow from that X i and Y i, by the definition, are the minimal points, and the equalities follow from the same argument as in (3.8) and (3.9). Thus, f(x l, Y l ) f(x 0, Y 0 ) 1 2 l 1 { k=0 Repeating the argument for Lem. 3.2 yields f Yi (X i ) 4 P Ω ( f Yi (X i )Y i ) 2 f Y (X ) = lim i k f Y ik (X ik ) = 0. f Xi+1 (Y i ) 4 + P Ω (X i+1 f Xi+1 (Y i )) 2 }.

8 J. Tanner and K. Wei 4. Scaled alternating steepest descent. In this section, we present an accelerated version of Alg. 3. To motivate its construction, let us consider the case when all the entries of Z 0 are observed. Under this assumption, problem (1.3) can be simplified to (4.1) min X,Y 1 2 Z0 XY 2. The Newton directions for problem (4.1) with respect to X and Y are (4.2) (4.3) (Z 0 XY )Y T (Y Y T ) 1, (X T X) 1 X T (Z 0 XY ), which are the gradient descent directions scaled by (Y Y T ) 1 and (X T X) 1 respectively. However when only partial entries of Z 0 are known, the Newton directions do not possess explicit formulas similar to (4.2) and (4.3); and solving each subproblem by Newton s method is the same as solving a least square problem as f(x, Y ) is an exact quadratic function of X or Y. Nevertheless, it is still possible to scale the gradient descent directions by (Y Y T ) 1 and (X T X) 1. The scaled alternating gradient descent method () uses the scaled gradient descent directions with exact line-search, Alg. 4. Algorithm 4 Scaled Alternating Steepest Descent () Input: P Ω (Z 0 ), X 0 R m r, Y 0 R r n Repeat 1. f Yi (X i ) = (P Ω (Z 0 ) P Ω (X i Y i ))Y T i 2. d xi = f Yi (X i )(Y i Y T i ) 1, t xi = f Yi (X i ), d xi / P Ω (d xi Y i ) 2 3. X i+1 = X i + t xi d xi 4. f Xi+1 (Y i ) = X T i+1 (P Ω(Z 0 ) P Ω (X i+1 Y i )) 5. d yi = (X T i+1 X i+1) 1 f Xi+1 (Y i ), t yi = f Xi+1 (Y i ), d yi / P Ω (X i+1 d yi ) 2 6. Y i+1 = Y i + t yi d yi 7. i = i + 1 Until termination criteria is reached Output: X i, Y i Remark 4.1. 1. The leading order per iteration computational cost of Alg. 4 remains 8 Ω r, and the residual can be updated as in, see Sec. 3.1. 2. Although motivated from a different perspective, can also be viewed as a sequential version of the Riemannian gradient method suggested in [24], though with with a new metric. 5. Numerical experiments. In the section, we present the empirical results on our gradient based algorithms and other algorithms. and are implemented in Matlab with two subroutines written in C to take advantage of the sparse structure. Other tested algorithms include Poweractorization () [16], [32], [31], and [26], which are all downloaded from the authors websites and where applicable use the aforementioned C subroutines. We compare these algorithms on both standard test images in Sec. 5.1 and random low rank matrices in Sec. 5.2.

Alternating Steepest Descent for Matrix Completion 9 5.1. Image inpainting. In this section, we demonstrate the performance of the proposed algorithms on image inpainting problems. Image inpainting concerns filling in the unknown pixels of an image from an incomplete set of observed entries. All the aforementioned algorithms are tested on three standard 512 512 grayscale test images (Boat, Barbara, and Lena), each projected to the nearest rank 50 image in robenius norm. Two sampling schemes are considered, 1) random sampling where 35% pixels of the low rank image were sampled uniformly at random, and 2) cross mask where 6.9% pixels of of the image were masked in a non-random cross pattern. The tolerance was set to. All algorihtms were provided with the true rank of the image except for random sampling which used warm starting strategies proposed by its authors in [32] where the hand tuned parameters chosen to give the best performance on the images tested were their increasing strategy with initial rank set to 25 and maximum rank set to 60. The tests were performed on the same computer as described in Sec. 5.2.2. Original image Rank 50 approximation 35% random sampling igure 1. Image reconstruction for Boat from random sampling. The size of the image is 512 512. The desired rank is 50, and 35% entries of the low rank image are sampled uniformly at random.

10 J. Tanner and K. Wei Original image Rank 50 approximation cross masked igure 2. Image reconstruction for Boat from cross masked. The size of the image is 512 512. The desired rank is 50, and 6.9% entries of of the low rank image are masked in a non-random way. The reconstructed images for the Boat test images and each tested algorithm are displayed in igs. 1 and 2 for the random sampling and cross mask respectively. The plotted against iteration number and computational time are shown for each test image and algorithm tested are displayed in igs. 3 and 4 respectively. ig. 4 (a,c,e) show that and have rapid initial decrease in the residual, and that has the fastest asymptotic rate. or moderate accuracy and require the least time for random sampling of the tested natural images, while for high accuracy is preferable. It should be noted that the completed images are visually indistinguishable for s of about, which is well before the fast asymptotic rate of begins. Alternatively, for the cross mask, ig. 4 (b,d,f) show that, Poweractorization, and are preferable throughout, though the fast asymptotic rate of suggests it would be superior for a relative tolerance below. Only is preferred for both random sampling and the cross mask, suggesting its usage for moderate accuracy,

Alternating Steepest Descent for Matrix Completion 11 Convergence curve for random sampling Convergence curve for cross mask 0 200 400 600 800 1000 1200 iter (a) 0 100 200 300 400 500 600 700 800 iter (b) Convergence curve for random sampling Convergence curve for cross mask 0 200 400 600 800 1000 1200 iter (c) 0 50 100 150 200 250 300 350 400 450 500 iter (d) Convergence curve for random sampling Convergence curve for cross mask 0 200 400 600 800 1000 1200 iter (e) 0 100 200 300 400 500 600 700 800 iter (f) igure 3. Relative residual as function of iterations for image recovery testing on Boat (top), Barbara (middle) and Lena (bottom). Left panel: random sampling; right panel: cross mask. and for higher accuracy. 5.2. Random matrix completion problems. In this section we conduct tests of randomly drawn rank r matrices using the model Z 0 = XY, where X R m r and Y R r n with X and Y having their entries drawn i.i.d. from the normal distribution N(0, 1). A random subset Ω of p entries in Z 0 are sampled uniformly at random. or conciseness, the tests presented in this section consider square matrices as is typical in the literature.

12 J. Tanner and K. Wei Convergence curve for random sampling Convergence curve for cross mask 0 5 10 15 20 25 30 time [s] (a) 0 10 20 30 40 50 60 70 80 time [s] (b) Convergence curve for random sampling Convergence curve for cross mask 0 5 10 15 20 25 30 time [s] (c) 0 10 20 30 40 50 60 70 80 time [s] (d) Convergence curve for random sampling Convergence curve for cross mask 0 5 10 15 20 25 30 time [s] (e) 0 10 20 30 40 50 60 70 80 time [s] (f) igure 4. Relative residual as function of time for image recovery testing on Boat (top), Barbara (middle) and Lena (bottom). Left panel: random sampling; right panel: cross mask. 5.2.1. Largest recoverable rank. Of central importance in matrix completion is what is the largest recoverable rank given (p, m, n), or equivalently, given (r, m, n) how many of the entries are needed in order to reliably recover a rank r matrix. To quantify the optimality of such statements we use the phase transition framework which for matrix completion is defined

Alternating Steepest Descent for Matrix Completion 13 through the undersampling and oversampling ratios: (5.1) δ = p mn and ρ = d r p, where p is the number of sampled entries, and d r = r(m + n r) is the degrees of freedom in a m n rank r matrix. The region of greatest importance is severe undersampling, for δ 1, and we restrict our tests to the two values δ {0.05, 0.1}, which also allows us testing of large matrices 2. Due to the high computational cost of accurately computing an algorithm s phase transition we limit the testing here to the three of the most efficient algorithms,, and. Both and were provided with the true rank of the matrix, while the decreasing strategy with the initial rank set to 1.25r was used in as suggested in [32]. In these tests an algorithm is considered to have successfully recovered the sampled matrix Z 0 if it returned a matrix Z out that is within of Z 0 in the relative robenius norm, that is if rel.err := Z out Z 0 / Z 0. or each triple (m, n, p), we started from a sufficiently small rank r so that the algorithm could successfully recover each of the sampled matrices in 100 random tests. The rank is then increased until it was large enough that the algorithm failed to recovery any of 100 tests to within the prescribed relative accuracy. We refer to the largest rank for which the algorithm succeeded in each of the 100 tests as r min, and the smallest rank for which the algorithm failed all the 100 tests as r max. The exact values of r min, r max and associated ρ min, ρ max are listed in Tab. 1. rom Tab.1, we can see that both and LGeomCG are able to recovering a rank r matrix from only C r(m+n r) measurements of its entries with C being only slightly larger than 1. The ability of recovering a low rank matrix from almost near the minimum number of measurements was firstly reported for NIHT by the authors in [29]. In contrast, has a substantially lower phase transition. Table 1 shows that these observations are consistent for different, moderately large size matrices. Table 1 Phase transition table for,, and. or each (m, n, p) with p = δ mn, if r r min, the algorithm was observed to recover each of the 100 randomly drawn m n matrices of rank r and failed to recover each of the randomly drawn matrices if r r max. The oversampling ratios ρ min and ρ max are computed using r min and r max by (5.1). Configuration m = n δ r min r max ρ min ρ max r min r max ρ min ρ max r min r max ρ min ρ max 1000 0.05 18 23 0.71 0.91 18 24 0.71 0.95 12 22 0.48 0.87 1000 0.10 43 48 0.84 0.94 44 48 0.86 0.94 30 44 0.59 0.86 2000 0.05 44 47 0.87 0.93 44 47 0.87 0.93 25 41 0.50 0.81 2000 0.10 92 95 0.90 0.93 92 96 0.90 0.94 54 84 0.53 0.82 4000 0.05 91 94 0.90 0.93 92 95 0.91 0.94 65 84 0.64 0.83 4000 0.10 186 190 0.91 0.93 188 191 0.92 0.93 141 172 0.69 0.84 2 The tests were conducted on the IRIDIS HPC facility provided by the e-infrastructure South Centre for Innovation.The IRIDIS facility runs Linux nodes composed of two 6-core 2.4GHz Intel Westmere processors and approximately 22GB of usable memory per node. The data presented required 4.5 months of CPU time.

14 J. Tanner and K. Wei 5.2.2. Recovery time. In this section we evaluate the recovery time for the aforementioned algorithms when applied to the previously described randomly generated rank r matrices. The relative error, number of iterations, and computational time reported in this section are average results of 10 random tests. All algorithms were supplied with the true rank, except which used the decreasing strategy with the initial rank 1.25r as advocated in [32]. The tolerance for, P Ω (Z out Z 0 ) 2 / P Ω (Z 0 ) 2, was set to for each algorithm and all parameters were set to their default values. The simulation was conducted on a Linux machine with Intel Xeon E5-2643 CPUs @ 3.30 GHz and 64GB memory, and executed from Matlab R2013a. The performance of,, and are compared in Tab. 2 for medium size matrices of m = n = 1000 with undersampling ratio δ {0.1, 0.3, 0.5} and the rank of the matrix is taken so that p/d r 2.0 and 1.27. Table 2 shows that takes many fewer iterations than and ; however, both and can require much less computational time than due to their lower per iteration computational complexity. is observed to take fewer number of iterations than and is typically faster, with the advantage of over increases as when δ = p/mn increases for p/d r being approximately fixed. This is because the more entries of a matrix are known, the more likely it is that will behave like a second order method; in particular, if all the entries have been observed, is actually an alternating Newton method for (4.1) while is an alternating gradient descent method. Next we compare the algorithms, excluding due to its slow convergence, on larger problem sizes and explore their how their performance changes with additive noise. Tests with additive noise have the sampled entries P Ω (Z 0 ) corrupted by the vector (5.2) e = ǫ P Ω (Z 0 ) 2 w w 2, where the entries of w are i.i.d standard Gaussian random variables, and ǫ is referred to as the noise level. The tests conducted here consider ǫ to be either zero or 0.1. Tests are conducted for m = n {4000, 8000, 16000}, r {40, 80}, and p/d r {3, 5}. The results are presented in Tabs. 3 and 4 for noise levels ǫ = 0 and 0.1 respectively. Table 3 shows that for ǫ = 0: and consistently require less time than, always requires the least time for p/d r = 5 and requires the least time for p/d r = 3. Table 4 shows that for ǫ = 0.1 the performance is predictable, with out of the twelve (p, m, n, r) tuples tested: requires the least time in four instances, in one instance, in three instances, and in four instances. Despite the efficiency of for random sampling of the natural images in ig. 4 (a,c,e), Tabs. 3 and 4 show that consistently requires substantially greater computational time, form random rank r matrices, than the other algorithms tested. Tables 3 and 4 show the efficiency of and when the random problems are solved to moderate accuracy. However it is worth noting that these randomly generated matrices are well-conditioned with high probability; as and are both first order methods, it can be expected that they will suffer from slow asymptotic convergence for severely ill-conditioned matrices, in which case a higher order method such as or may be preferable, see [4, 26] for a discussion of this phenomenon.

Alternating Steepest Descent for Matrix Completion 15 Table 2 Average computational time (seconds) and average number of iterations of, scaled and over ten random rank r matrices per (p, m, n, r) tuple for m = 1000, n = 1000, δ {0.1, 0.3, 0.5}. p/d r 2.05 1.28 rel.err iter time rel.err iter time p/mn 0.1 r 25 40 3.5e-5 103 1.5 9.0e-5 582 14.1 3.5e-5 97 1.5 9.1e-5 533 13.3 2.6e-5 29 11.0 8.1e-5 141 85.4 p/mn 0.3 r 75 125 2.6e-5 63 8.8 7.0e-5 373 88.7 2.6e-5 48 6.8 7.0e-5 252 61.6 2.1e-5 20 14.0 6.7e-5 93 110 p/mn 0.5 r 130 220 2.1e-5 64 26.3 5.8e-5 361 255 2.3e-5 33 13.7 6.0e-5 155 111 1.6e-5 16 16.5 5.7e-5 65 124 Table 3 Average computational time (seconds) and average number of iterations of, scaled, and over ten random rank r matrices per (p, m, n, r) tuple for m = n {4000, 8000, 16000}, r {40, 80} and p/d r {3, 5}. The noise level ǫ is equal to 0. r 40 80 p/d r 3 5 3 5 rel.err iter time rel.err iter time rel.err iter time rel.err iter time m = n 4000 2.1e-5 42 11.0 1.4e-5 23 9.6 2.0e-5 35 36.1 1.1e-5 20 33.7 2.0e-5 41 10.6 1.4e-5 22 9.2 1.9e-5 33 34.1 1.2e-5 18 30.4 2.1e-5 68 12.0 1.3e-5 35 10.4 2.0e-5 55 37.8 1.2e-5 31 35.9 1.3e-5 23 9.4 1.0e-5 15 9.8 1.0e-5 21 33.6 6.9e-6 14 36.0 1.7e-5 24 76.1 1.3e-5 16 116 1.3e-5 23 967 1.1e-5 14 1609 m = n 8000 2.1e-5 43 24.6 1.4e-5 23 21.1 1.9e-5 37 80.8 1.4e-5 20 70.9 2.1e-5 43 24.4 1.4e-5 23 20.8 1.9e-5 36 78.6 1.4e-5 20 68.6 1.9e-5 73 29.2 1.5e-5 38 24.6 1.9e-5 62 94.0 1.3e-5 35 87.0 1.7e-5 23 20.4 8.1e-6 16 22.2 1.3e-5 22 74.1 1.2e-5 15 77.4 1.4e-5 29 179 8.9e-6 16 265 1.7e-5 23 2330 9.5e-6 15 3445 m = n 16000 2.1e-5 44 61.4 1.4e-5 23 50.4 1.9e-5 38 201 1.1e-5 21 169 2.0e-5 43 56.4 1.4e-5 23 47.8 2.0e-5 37 186 1.4e-5 20 159 2.2e-5 81 77.5 1.3e-5 41 62.6 1.7e-5 69 242 1.4e-5 37 210 1.6e-5 24 47.5 1.0e-5 16 49.9 1.4e-5 22 167 9.7e-6 15 178 1.5e-5 25 495 1.0e-5 16 704 1.4e-5 23 6063 1.0e-5 16 8681

16 J. Tanner and K. Wei Table 4 Average computational time (seconds) and average number of iterations of, scaled, and over ten random rank r matrices per (p, m, n, r) tuple for m = n {4000, 8000, 16000}, r {40, 80} and p/d r {3, 5}. The noise level ǫ is equal to 0.1. r 40 80 p/d r 3 5 3 5 rel.err iter time rel.err iter time rel.err iter time rel.err iter time m = n 4000 7.1e-2 21 5.6 5.0e-2 19 8.2 7.0e-2 21 21.6 4.9e-2 19 32.0 7.1e-2 21 5.6 5.0e-2 19 8.3 7.0e-2 21 21.8 4.9e-2 19 32.2 7.1e-2 30 5.8 5.0e-2 20 6.3 7.0e-2 26 18.8 4.9e-2 16 19.9 7.1e-2 14 5.8 5.0e-2 11 7.2 7.0e-2 13 20.9 4.9e-2 11 28.4 7.1e-2 15 69.8 5.0e-2 10 113 7.0e-2 14 984 4.9e-2 9 1524 m = n 8000 7.1e-2 21 12.3 5.0e-2 19 17.9 7.0e-2 21 46.2 5.0e-2 19 67.1 7.1e-2 21 12.3 5.0e-2 19 18.0 7.0e-2 21 46.5 5.0e-2 19 67.5 7.1e-2 34 14.8 5.0e-2 22 15.7 7.0e-2 31 49.7 5.0e-2 20 52.7 7.1e-2 14 12.7 5.0e-2 11 15.5 7.0e-2 13 44.7 5.0e-2 11 60.8 7.1e-2 16 171 5.0e-2 10 264 7.0e-2 14 2337 5.0e-2 10 3623 m = n 16000 7.1e-2 21 29.8 5.0e-2 20 41.9 7.1e-2 21 108 5.0e-2 19 157 7.1e-2 21 28.2 5.0e-2 20 42.3 7.1e-2 21 106 5.0e-2 19 152 7.3e-2 29 31.1 5.0e-2 26 42.4 7.1e-2 32 119 5.0e-2 22 135 7.1e-2 15 29.4 5.0e-2 11 35.7 7.1e-2 14 106 5.0e-2 11 133 7.1e-2 16 651 5.0e-2 11 757 7.1e-2 15 6285 5.0e-2 10 9227 6. Conclusion and future directions. In this paper, we have proposed an alternating steepest method,, and a scaled variant,, for matrix completion. Despite their simplicity, empirical evaluation of and on both image in painting and random problems show and to be competitive with other state-of-the-art matrix completion algorithms in terms of recoverable rank and overall computational time. In particular, their low per iteration computational complexity makes and efficient for large size problems, especially when computing the solutions to moderate accuracy. Although and are implemented for fixed rank problems in the paper, the common rank increasing or decreasing heuristics can be incorporated into them as well. or, it would be interesting to try different scaling techniques, especially for more diverse applications. In a different direction, a recent paper [19] proves that Poweractorization with some specific initial point is guaranteed to find the objective matrix with high probability if the number of measurements is sufficiently large; a similar analysis for appears a constructive approach. REERENCES [1] Y. Amit, M. ink, N. Srebro, and S. Ullman. Uncovering shared structures in multiclass classification. in Proceedings of the 24th international conference on machine learning, ACM, pages 17 24, 2007. [2] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. Advances in Neural Information Processing Systems, 19:41 48, 2007. [3] J. Blanchard, J. Tanner, and K. Wei. CGIHT: Conjugate gradient iterative hard thresholing for com-

Alternating Steepest Descent for Matrix Completion 17 pressed sensing and matrix completion. Numerical analysis group preprint 14/08, University of Oxford, 2014. [4] N. Boumal and P.-A. Absil. RTRMC: A Riemannian trust-region method for low-rank matrix completion. In J. Shawe-Taylor, R.S. Zemel, P. Bartlett,.C.N. Pereira, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 24 (NIPS), pages 406 414. 2011. [5] J.. Cai, E. J. Candès, and Z. W. Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956 1982, 2010. [6] E. J. Candès and B. Recht. Exact matrix completion via convex optimization. oundations of Computational Mathematics, 9(6):717 772, 2009. [7] E. J. Candès and T. Tao. The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053 1080, 2009. [8] W. Dai, O. Milenkovic, and E. Kerman. Subspace evolution and transfer (SET) for low-rank matrix completion. IEEE Transactions on Signal Processing, 59(7):3120 3132, 2011. [9] Y. C. Eldar, D. Needell, and Y. Plan. Unicity conditions for low-rank matrix recovery. Applied and Computational Harmonic Analysis, 33(2):309 314, 2012. [10] L. Eldén. Matrix methods in data mining and pattern recognization. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2007. [11] R. Escalante and M. Raydan. Alternating Projection Methods. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2011. [12] M. azel. Matrix rank minimization with applications. Ph. D. dissertation, Stanford University, 2002. [13] D. Goldfarb and S. Ma. Convergence of fixed-point continuation algorithms for matrix rank minimization. oundations of Computational Mathematics, 11(2):183 210, 2011. [14] L. Grippo and M. Sciandrone. Gloabally convergence block-coordinate techniques for unconstrained optimization. Optimization Methods & Software, 10(4):587 637, 1999. [15] D. Gross. Recovering low-rank matrices from few coefficients in any basis. IEEE Trans. on Information Theory, 57(3):1548 1566, 2011. [16] J. P. Haldar and D. Hernando. Rank-constrained solutions to linear matrix equations using Poweractorization. IEEE Signal Processing Letters, 16(7):584 587, 2009. [17] N. J. A. Harvey, D. R. Karger, and S. Yekhanin. The complexity of matrix completion. SODA 2006, Proceedings of the seventeenth annual ACM-SIAM symposium on discrete algorithms, pages 1103 1111, 2006. [18] P. Jain, R. Meka, and I. Dhillon. Guaranteed rank minimization via singular value projection. Proceedings of the Neural Information Processing Systems Conference (NIPS), pages 937 945, 2010. [19] P. Jain, P. Netrapalli, and S. Sanghavi. Low-rank matrix completion using alternating minimization. arxiv:1212.0467, 2012. [20] Toh. K.-C. and S. Yun. An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. Pacific J. Optimization, 6:615 640, 2010. [21] A. Kyrillidis and V. Cevher. Matrix ALPS: Accelerated low rank and sparse matrix reconstruction. Technical report, 2012. [22] A. Kyrillidis and V. Cevher. Matrix recipes for hard thresholding methods. Journal of Mathematical Imaging and Vision, 48(2):235 265, 2014. [23] Z. Liu and L. Vandenberghe. Interior-point method for nuclear norm approximation with application to system identification. SIAM Journal on Matrix Analysis and Applications, 31:1235 1256, 2009. [24] B. Mishra, K. A. Apuroop, and R. Sepulchre. A Riemannian geometry for low-rank matrix completion. arxiv:1306.2672, 2013. [25] B. Mishra, G. Meyer, S. Bonnabel, and R. Sepulchre. ixed-rank matrix factorizations and Riemannian low-rank optimization. Computational Statistics, 2013. doi:10.1007/s00180-013-0464-z. [26] T. Ngo and Y. Saad. Scaled gradients on grassmann manifolds for matrix completion. In Advances in Neural Information Processing Systems(NIPS). 2012. [27] B. Recht. A simpler approach to matrix completion. The Journal of Machine Learning Research, 12:3413 3430, 2011. [28] B. Recht, M. azel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Review, 52(3):471 501, 2010. [29] J. Tanner and K. Wei. Normalized iterative hard thresholding for matrix completion. SIAM J. Sci.

18 J. Tanner and K. Wei Comput., 35(5):S104 S125, 2013. [30] K. Toh and S. Yun. An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. Pacific Journal of Optimization, pages 615 640, 2010. [31] B. Vandereycken. Low rank matrix completion by Riemannian optimization. SIAM Journal on Optimization, 23(2):1214 1236, 2013. [32] Z. Wen, W. Yin, and Y. Zhang. Solving a low-rank factorization model for matrix completion by a non-linear successive over-relaxation algorithm. Mathematical Programming Computation, 4:333 361, 2012.