arxiv: v2 [cs.na] 10 Feb 2018

Size: px

Start display at page:

Download "arxiv: v2 [cs.na] 10 Feb 2018"

Clyde Goodman
6 years ago
Views:

1 Efficient Robust Matrix Factorization with Nonconvex Penalties Quanming Yao, James T. Kwok Department of Computer Science and Engineering Hong Kong University of Science and Technology Hong Kong { qyaoaa, jamesk }@cse.ust.hk arxiv: v2 [cs.na] 10 Feb 2018 Abstract Robust matrix factorization (RMF), which uses the l 1-loss, often outperforms standard matrix factorization using the l 2- loss, particularly when outliers are present. The state-of-theart RMF solver is the RMF-MM algorithm, which, however, cannot utilize data sparsity. Moreover, sometimes even the (convex) l 1-loss is not robust enough. In this paper, we propose the use of nonconvex loss to enhance robustness. To address the resultant difficult optimization problem, we use majorization-minimization (MM) optimization and propose a new MM surrogate. To improve scalability, we exploit data sparsity and optimize the surrogate via its dual with the accelerated proximal gradient algorithm. The resultant algorithm has low time and space complexities and is guaranteed to converge to a critical point. Extensive experiments demonstrate its superiority over the state-of-theart in terms of both accuracy and scalability. Introduction Matrix factorization (MF) is a fundamental machine learning tool, and serves as an important component in many applications such as computer vision (Basri, Jacobs, and Kemelmacher 2007; Gu et al. 2014; Ji et al. 2010), social networks (Yang and Leskovec 2013) and recommender systems (Koren 2008). The square loss has been commonly used in MF (Mnih and Salakhutdinov 2008; Koren 2008; Candès and Recht 2009). However, this implicitly assumes the Gaussian noise, and is sensitive to outliers. Eriksson & Van Den Hengel (2010) proposed robust matrix factorization (RMF), which uses the l 1 -loss instead, and obtains much better empirical performance. Unfortunately, the resultant nonconvex nonsmooth optimization problem is much more difficult. Most RMF solvers are not scalable (Eriksson and Van Den Hengel 2010; Zheng et al. 2012; Cabral et al. 2013; Meng et al. 2013; Kim et al. 2015). Recently, Cambier & Absil (2016) proposed the robust matrix completion (RMC) algorithm, which smooths the l 1 -loss and then performs gradient descent on the Riemannian manifold. empirically, it can be used on large sparse matrices. However, as smoothing is used, RMC only approximates the RMF solution. Copyright c 2018, Association for the Advancement of Artificial Intelligence ( All rights reserved. The current state-of-the-art RMF solver is the RMF-MM (Lin, Xu, and Zha 2017), which is based on the technique of majorization minimization (MM) (Lange, Hunter, and Yang 2000; Hunter and Lange 2004; Mairal 2013). In each iteration, instead of optimizing the original RMF objective, a convex nonsmooth surrogate is constructed and optimized by the linearized alternating direction method with parallel splitting and adaptive penalty (LADMPSAP) algorithm (Lin, Liu, and Li 2015), a variant of the alternating direction method of multipliers (ADMM) (Boyd et al. 2011). Empirically, RMF-MM has fast convergence. Besides, it is the only RMF solver with convergence guarantees. However, it does not utilize data sparsity, and so cannot be used on large sparse matrices. Though the l 1 -loss used in RMF is more robust than the l 2 -loss, it may still not be robust enough for outliers. A similar observation is also made recently on the l 1 -regularizer in sparse learning and low-rank matrix learning (Gong et al. 2013; Zuo et al. 2013; Gu et al. 2014; Lu et al. 2016; Yao and Kwok 2016). To alleivate this problem, various nonconvex regularizers have been introduced. Examples includue the Geman penalty (Geman and Yang 1995), logsum penalty (Candès, Wakin, and Boyd 2008) and Laplace penalty (Trzasko and Manduca 2009). Empirically, they achieve much better performance on tasks such as feature selection (Gong et al. 2013; Zuo et al. 2013) and image denoising (Gu et al. 2014; Lu et al. 2016). In this paper, we propose to improve the robustness of RMF by using these nonconvex functions (instead of l 1 or l 2 ) as the loss function. Note that the above papers used these nonconvex functions as regularizers, not as the loss. The resultant optimization problem is difficult, and existing RMF solvers cannot be used. As for RMF-MM, we rely on the more flexible MM optimization technique, and a new MM surrogate is proposed. To improve scalabiltiy, we transform the surrogate to its dual and then solve it with the accelerated proximal gradient (APG) algorithm (Beck 2009; Nesterov 2013). Data sparsity can also be exploited in the design of the APG algorithm. As for its convergence analysis, proof techniques in RMF-MM cannot be used as the loss is no longer convex. Instead, we develop new proof techniques based on the Clarke subdifferential, and show that convergence to a critical point can be guaranteed. Extensive experiments demonstrate the superiority of the

2 proposed algorithm over the state-of-the-art in terms of both accuracy and scalability. Notation For scalar x, sign (x) = 1 if x > 0, 0 if x = 0, and 1 otherwise. For a vector x, Diag(x) constructs a diagonal matrix X with X ii = x i. For a matrix X, X F = ( i,j X2 ij )1/2 is its Frobenius norm, X 1 = i,j X ij is its l 1 -norm, and nnz(x) is the number of nonzero elements in X. for a square matrix X, tr(x) = i X ii is its trace. for two matrices X, Y, denotes element-wise product. For a smooth function f, f is its gradient. for a convex f, G f(x) = {U : f(y ) f(x) + tr(u (Y X))} is a subgradient. for a continuous f, f is the Clarke subdifferential (Clarke 1990). Related Work Majorization Minimization Majorization minimization (MM), also called optimization transfer, is a general technique to make difficult optimization problems easier (Lange, Hunter, and Yang 2000; Hunter and Lange 2004; Mairal 2013). Many algorithms can be seen as special cases of the MM algorithm. Examples include the EM algorithm, proximal algorithm, and difference of convex programming. Consider a function h(x), which is hard to optimize. In using MM, let the iterate at the kth iteration be X k. The next iterate is generated as X k+1 = X k + arg min X f k (X), (1) where f k is a surrogate that is being optimized instead of h. As suggested in (Lange, Hunter, and Yang 2000), a good surrogate should have the following properties: (a) h(x k + X) f k (X) for any X k ; (b) 0=arg min X f k (X) h(x k + X) and h(x k ) = f k (0); (c) f k is convex. Robust Matrix Factorization (RMF) In matrix factorization (MF), a matrix M R m n is approximated by UV, where U R m r, V R n r and r min(m, n). Moreover, some entries of M may be missing, as in applications such as structure from motion (SfM) (Basri, Jacobs, and Kemelmacher 2007) and recommender systems (Koren 2008). The MF problem can be formulated as: min U,V 1 2 W (M UV ) 2 F + λ 2 ( U 2 F + U 2 F ), (2) where W {0, 1} m n contain indices to missing entries in M (with W ij = 1 if M ij is observed, and 0 otherwise), and λ 0 is a regularization parameter. The l 2 -loss in (2) is sensitive to outliers. De La Torre & Black (2003) replaced it by the l 1 -loss, leading to robust matrix factorization (RMF): min U,V W (M UV ) 1 + λ 2 ( U 2 F + V 2 F ). (3) Many RMF solvers have been developed. However, as (3) is neither convex nor smooth, these solvers lack scalability and/or robustness and/or convergence guarantees. Very recently, Lin et al. (2017) proposed the RMF-MM algorithm, which solves (3) using MM. Let the iterate at the kth iteration be (U k, V k ). RMF-MM finds increments (Ū, V ) to be added to the target (U, V ) solution: U = U k + Ū, V = V k + V. (4) Given (U k, V k ), the objective in (3) can be rewritten as H k (Ū, V ) W (M (U k + Ū)(V k + V ) ) 1 + λ 2 U k + Ū 2 F + λ 2 V k + V 2 F. (5) The following Proposition constructs a surrogate F k that satisfies properties (a) and (b) in Section. Moreover, unlike H k, F k is jointly convex on (Ū, V ). Proposition 0.1. (Lin, Xu, and Zha 2017) Let nnz(w (i,:) ) (resp. nnz(w (:,j) )) be the number of nonzero elements in the ith row (resp. jth column) of W, Λ r = Diag( nnz(w (1,:) ),..., nnz(w (m,:) )), and Λ c = Diag( nnz(w (:,1) ),..., nnz(w (:,n) )). Then, H k (Ū, V ) F k (Ū, V ), where F k (Ū, V ) W (M U k (V k ) Ū(V k ) U k V ) 1 + λ 2 U k + Ū 2 F Λ rū 2 F + λ 2 V k + V 2 F Λ c V 2 F. (6) Equality holds iff (Ū, V ) = (0, 0). However, because of the coupling of Ū, V k (resp. U k, V ) in Ū(V k ) (resp. U k V ) in (6), F k is still difficult to optimize. To address this problem, RMF-MM uses the linearized alternating direction method with parallel splitting and adaptive penalty (LADMPSAP) (Lin, Liu, and Li 2015), a multi-block variant of the alternating direction method of multipliers (ADMM) (Boyd et al. 2011). RMF-MM converges to a critical point of (3). Empirically, it converges faster than the other RMF solvers. However, due to nonconvexity of (3), its convergence rate is not known. Moreover, its space and time complexities are O(mn), and O(mnrIK), respectively, where I is the number of LADMPSAP iterations. They both grow linearly with the matrix size, and may be computationally expensive on large data sets. Besides, the l 1 -loss can still be sensitive to outliers. Proposed Algorithm In this paper, we replace the l 1 -loss by a general nonconvex loss. Section introduces the nonconvex RMF formulation. The resultant optimization problem is more difficult to solve, and existing RMF solvers cannot be used as they rely crucially on the l 1 -norm. Instead, we use the more flexible MM technique. To handle nonconvexity, a new surrogate, based on relaxation back to l 1 and variable decoupling, is proposed in Section. Instead of using LADMPSAP

3 as in RMF-MM, we propose in Section a more timeand space-efficient procedure by running the accelerated proximal gradient (APG) algorithm on the surrogate s dual. In particular, the underlying gradient computations can efficiently utilize data sparsity. The complete algorithm is presented in Section, and convergence is analyzed in Section. Nonconvex Loss With a nonconvex loss, problem (3) is changed to: m n Ḣ(U, V ) W ij φ ( M ij [UV ] ij ) min U,V i=1 j=1 + λ 2 ( U 2 F + V 2 F ), (7) where φ is nonconvex. We assume the following on φ: Assumption 1. φ(α) is concave, smooth and strictly increasing on α 0. Obviously, the (convex) l 1 function also satisfies Assumption 1, and thus (7) includes (3). As shown in Table 1, Assumption 1 is satisfied by popular nonconvex functions including the Geman penalty (Geman and Yang 1995), Laplace penalty (Trzasko and Manduca 2009), the log-sum-penalty (LSP) (Candès, Wakin, and Boyd 2008), and slightly modified variants of the minimax concave penalty (MCP) (Zhang 2010) and smooth-capped-absolutedeviation (SCAD) penalty (Fan and Li 2001). Note that unlike previous papers (Gong et al. 2013; Zuo et al. 2013; Yao and Kwok 2016), we use these nonconvex functions as the loss, not as the regularizer. Table 1: Example nonconvex regularizers (θ > 2 for SCAD and θ > 0 for others is a constant). Here, δ > 0 is a small constant to ensure that the φ s for MCP and SCAD are strictly increasing. Geman penalty Laplace penalty log-sum-penalty (LSP) minimax concave penalty (MCP) smoothly clipped absolute deviation (SCAD) penalty φ( α ) α θ+ α ( α θ ) 1 exp ( ) log 1 + α θ { (1 + δ)α α2 2θ α θ 1 2 θ2 + δα α > θ (1 + δ)α α 1 α 2 +2θα 1 2(θ 1) + δα 1 < α θ (1+θ) 2 + δα α > θ Note from (7) that when the ith row of W is zero, the ith row of U obtained is also zero because of the U 2 F regularizer. Similarly, when the ith column of W is zero, the corresponding column in V is also zero. To avoid this trivial solution, we make the following Assumption, as in matrix completion (Candès and Recht 2009) and RMF-MM. Assumption 2. W has no zero row or column. Constructing the Surrogate The following Proposition first obtains a convex upper bound of φ using the Taylor expansion. Proposition 0.2. φ( α ) φ ( β ) α +(φ( β ) φ ( β ) β ) for any given β R. Equality holds iff α = ±β. Figure 1 shows the upper bounds for the nonconvex penalties in Table 1. Interestingly, the upper bound is simply a scaled l 1 -loss (with scaling factor φ ( β )) and offset by φ( β ) φ ( β ) β. As one can expect, recovery of the l 1 - loss will make optimization easier. At iteration k, Ḣ in (7) can be written as: (Ū, V ) m n i=1 j=1 W ijφ( M ij [(U k + Ū)(V k + V ) ] ij ) + λ 2 U k + Ū 2 F + λ 2 V k + V 2 F, where (Ū, V ) is as defined in (4). Applying Proposition 0.2 on the first term of Ḣk, we obtain the following. Corollary 0.3. Let A k ij = φ ( [U k (V k ) ] ij ), then Ḣ k (Ū, V ) b k + λ 2 U k + Ū 2 F + λ 2 V k + V 2 F + Ẇ k (M U k (V k ) Ū(V k ) U k V Ū V ) 1, where b k = m n i=1 j=1 W ij(φ ( ) [U k (V k ) ] ij [U k (V k ) ] ij ) and Ẇ k = A k W. A k ij The product Ū V inside the l 1 term still couples V Ū and together. As Ḣk is similar as H k in (5), one may want to reuse Proposition 0.1. However, note that the elements in A k are positive due to Assumption 1 and thus Ẇ k is a nonnegative matrix, while W in (3) is a binary matrix. Using the Cauchy-Schwarz inequality, the following Lemma decouples Ū and V. This then allows the design of a convex surrogate F k that upper-bounds Ḣk. Lemma 0.4. Let sum(ẇ (i,:) k ) = n j=1 Ẇ ij k (resp. sum(ẇ (:,j) k ) = m i=1 Ẇ ij k ) be the sum of elements on row i (resp. column j) of Ẇ k. Then, Ẇ k (Ū V ) Λk rū 2 F Λk c V 2 F, where Λ k r = Diag( sum(ẇ (1,:) k ),..., Λ k c = Diag( sum(ẇ (:,1) k ),..., holds iff (Ū, V ) = (0, 0). Proposition 0.5. sum(ẇ k (m,:) sum(ẇ k (:,n) Ḣ k (Ū, V ) F k (Ū, V ), where )) and )). Equality F k (Ū, V ) Ẇ k (M U k (V k ) Ū(V k ) U k V ) 1 + λ 2 U k + Ū 2 F Λk rū 2 F + λ 2 V k + V 2 F Λk c V 2 F + b k. (8) Equality holds iff (Ū, V ) = (0, 0). In the special case where the l 1 -loss is used, Ẇ k = W, b k = 0 for all k s, and Λ k r = Λ r, Λ k c = Λ c. The surrogate in (8) then reduces to that in (6), and Proposition 0.5 becomes Proposition 0.1.

4 (a) Geman. (b) Laplace. (c) LSP. (d) MCP (modified). (e) SCAD (modified). Figure 1: Illustration of the upper bound in Proposition 0.2 for the various nonconvex penalities in Table 1 β = 1, θ = 2.5 for SCAD and θ = 0.5 for the others; and δ = 0.05 for MCP and SCAD. It can be easily seen that F k qualifies as a good surrogate in Section : (a) Ḣ(Ū + U k, V + V k ) F k (Ū, V ); (b) (0, 0) = arg min Ū, V F k (Ū, V ) Ḣk (Ū, V ) and F k (0, 0) = Ḣ(0, 0); and (c) F k is jointly convex in Ū, V. Optimizing the Surrogate Optimizing using LADMPSAP LADMPSAP, which has been used in RMF-MM, can also be used to optimize (8). 1 However, LADMPSAP cannot utilize possible sparsity of W. Moreover, LADMPSAP converges at a rate of O(1/T ) (Lin, Liu, and Li 2015), which is slow. Optimizing using APG We first transform the surrogate in (8) to the dual, which then has only nnz(w ) variables (instead of O(mn)). Let Ω {(i 1, j 1 ),..., (i nnz(w ), j nnz(w ) )} be the set containing indices to the nonzero elements in W. Let H( ) be the linear operator which maps a nnz(w )- dimensional vector x to a sparse matrix X R m n with nonzero positions indicated by Ω, i.e., X itj t = x t where (i t, j t ) is the tth element in Ω. Let H 1 ( ) be the inverse operator of H, which extracts an nnz(w )-dimensional vector x from X with x t = X itj t. Proposition 0.6. Let ẇ k = H 1 (Ẇ k ). The dual problem of F k is min Ū, V min D k (x) x W k 1 2 tr((h(x)v k λu k ) A k r(h(x)v k λu k )) tr(h(x) M) tr((h(x) U k λv k ) A k c (H(x) U k λv k )), (9) 1 Because of the lack of space, the detailed update equations will not be included here. where W k {x R nnz(w ) : x i [ẇ k ] 1 i }, A k r = (λi + (Λ k r) 2 ) 1, and A k c = (λi + (Λ k c ) 2 ) 1. From the obtained x, the primal (Ū, V ) solution can be recovered as Ū = A k r(h(x)v k λu k ) and V = A k c (H(x) U k λv k ). Problem (9) can be solved by the APG algorithm, which has a convergence rate of O(1/T 2 ) (Beck 2009; Nesterov 2013) and is faster than LADMPSAP. In each iteration, one needs to compute the gradient D k of the objective: D k (x) = H 1 (A k r(h(x)v k λu k )(V k ) ) + H 1 (U k [(U k ) H(x) λ(v k ) ]A k c ) H 1 (M). (10) The following Proposition shows that the proximal step has a closed-form solution. As x is nnz(w )-dimensional, computing the proximal step takes only O(nnz(W )) time. Proposition 0.7. Let x = arg min 1 x W k 2 x z 2 F for a given z. Then, x i = sign (z i) min( z i, (ẇi k) 1 ). Exploiting Sparsity A direct implementation of APG takes O(mnr) time and O(mn) space. In the following, we show how these can be reduced by exploiting sparsity. Objective (9) involves A k r, A k c and W k, which are all related to Ẇ k. From its definition in Corollary 0.3, Ẇ k is sparse (as W is sparse). Thus, by exploting sparsity, constructing A k r, A k c and W k only take O(nnz(W )) time and space. In each APG iteration, one has to compute the objective, gradient, and proximal step. First, consider computing the gradient in (10). The first term can be rewritten as ĝ k = H 1 (Q k (V k ) ), where Q k = A k r(h(x)v k λu k ). As A k r is diagonal and H(x) is sparse, Q k can be computed as A k r(h(x)v k ) λ(a k ru k ) in O(nnz(W )r + mr) time, where r is the number of columns in U k and V k. Let the tth element in Ω be (i t, j t ). By the definition of H 1 ( ), we have ĝt k = r q=1 Qk i V k tq j tq, and this takes O(nnz(W )r + mr)

5 time. Similarly, computing the second term in (10) takes O(nnz(W )r + nr) time. Hence, computing the gradient D k ( ) using Algorithm 1 takes a total of O(nnz(W )r+(m+ n)r) time and O(nnz(W ) + (m + n)r) space. Algorithm 1 Computing D k (x) by exploiting sparsity. 1: set X itj t = x t for all (i t, j t ) Ω; // i.e., X = H(x) 2: Q k = A k r(xv k ) λ(a k ru k ); 3: construct the nnz(w )-dimensional vector ĝ k with ĝt k = r q=1 Qk i V k tq j ; tq 4: P k = A k c (X U k ) λ(a k c V k ); 5: construct the nnz(w )-dimensional vector ğ k with ğt k = r q=1 U i k P k tq j ; // i.e., tq ğk = H 1 (U k (P k ) ) 6: return ĝ k + ğ k H 1 (M). Next, as x R nnz(w ), the proximal step in Proposition 0.7 takes O(nnz(W )) time and space. Similar to Algorithm 1, the objective can be obtained in O(nnz(W )r + (m + n)r) time and O(nnz(W ) + (m + n)r) space. Thus, by exploiting sparsity, the APG algorithm takes O(nnz(W ) + (m + n)r) space, which is much smaller than the O(mn) space required by LADMPSAP. The iteration time complexity of APG is O(nnz(W )r + (m + n)r), which is again much smaller than the O(mnr) time each LADMPSAP iteration takes. Complete Algorithm The complete procedure, which will be called Robust Matrix Factorization with Nonconvex Loss (RMFNL) algorithm is shown in Algorithm 2. The surrogate is optimized via its dual in step 4. The primal solution is recovered in step 5, and (U k, V k ) are updated in step 6. Algorithm 2 Robust matrix factorization using nonconvex loss (RMFNL) algorithm. 1: initialize U 1 R m r and V 1 R m r ; 2: for k = 1, 2,..., K do 3: compute Ẇ k in Corollary 0.3 (only on the observed positions), and Λ k r, Λ k c in Lemma 0.4; 4: compute x k = arg min x W k D k (x) in Proposition 0.6 using APG (use Algorithm 1 to compute D k (x)); 5: compute Ū k and V k in Proposition 0.6; 6: U k+1 = U k + Ū k, V k+1 = V k + V k ; 7: end for 8: return U K+1 and V K+1. Convergence Analysis In this Section, we prove convergence of the proposed RMFNL algorithm. As φ in (7) is nonconvex, Proposition 1 in (Lin, Xu, and Zha 2017) cannot be extended to our surrogate F k. The RMF-MM iterations guarantee a sufficient decrease on the objective (Lin, Xu, and Zha 2017) in the following sense: Definition 0.1 ((Nesterov 1998)). The sequence {X k } has a sufficient decrease on f if there exists a constant γ > 0 such that f(x k ) f(x k+1 ) γ X k X k+1 2 F, k. The following Proposition shows that RMFNL also achieves a sufficient decrease on its objective. Moreover, the {(U k, V k )} sequence generated is bounded, which has at least one limit point. The Proposition is similar to Theorem 1 in (Lin, Xu, and Zha 2017). However, we use a different surrogate and Λ k r, Λ k c are not fixed but change with k. Hence, their proof cannot be directly extended. Proposition 0.8. For Algorithm 2, (i) {(U k, V k )} is bounded. (ii) {(U k, V k )} has a sufficient decrease on Ḣ, i.e., Ḣ(U k, V k ) Ḣ(U k+1, V k+1 ) γ U k+1 U k 2 F + γ V k+1 V k 2 F, where γ > 0 is a constant; (iii) lim k (U k+1 U k ) = 0 and lim k (V k+1 V k ) = 0; In the following, we show that limit points from Algorithm 2 are also critical points of (3). To prove convergence, the subgradient is used in (Lin, Xu, and Zha 2017). Here, as φ is nonconvex, we will use the Clarke subdifferential (Clarke 1990), which generalizes subgradients to nonconvex functions. The following Proposition connects the subgradient of surrogate F k to the Clarke subdifferential of Ḣ. Proposition 0.9. (i) F k (0, 0) = Ḣ k (0, 0); (ii) If 0 Ḣ k (0, 0), then (U k, V k ) is a critical point of (7). Theorem The limit points of the sequence generated by Algorithm 2 are also critical points of (7). RMF-MM is the only existing RMF solver that guarantees convergence to critical points. While our (7) is more difficult than their (3) (due to the nonconvex loss), the proposed RMFNL still shares the same guarantee as RMF-MM. Moreover, as (7) is neither convex nor smooth, convergence to critical points is the best one can obtain (Nesterov 1998). Experiments In this section, we compare the proposed RMFNL with state-of-the-art MF algorithms on both synthetic and realworld data sets. Experiments are performed on a PC with Intel i7 CPU and 32GB RAM. All the codes are in Matlab, with sparse matrix operations implemented in C++. Synthetic Data We first perform experiments with synthetic data. The synthetic data is generated as X = UV, where U R m 5, V R m 5, and m = {250, 500, 1000}. Elements of U and V are sampled i.i.d. from the normal distribution N (0, 1). This is then corrupted to produce M = X + N + S, where N is the noise matrix from N (0, 0.1), and S is a sparse matrix modeling outliers with 5% nonzero elements randomly sampled from {±5}. We randomly draw 10 log(m)/m% of the elements from M as observations, with half of them for training and the other half for validation. The remaining unobserved elements are for testing. note that the larger the m, the sparser is the observed matrix. We compare RMF-MM (Lin, Xu, and Zha 2017), which uses the l 1 -loss, with the proposed RMFNL (Algorithm 2),

6 Table 2: Performance on the synthetic data set ( nnz is the percentage of nonzero elements). RMSE is scaled by 10 1 and CPU time is in seconds. The best and comparable results according to the pairwise t-test with 95% confidence are highlightened. m = 250 (nnz: 11.04%) m = 500 (nnz: 6.21%) m = 1000 (nnz: 3.45%) loss algorithm RMSE CPU time RMSE CPU time RMSE CPU time l 1 RMF-MM 1.77± ± ± ± ± ±130.9 LADMPSAP 1.09± ± ± ± ± ±138.8 LSP RMFNL APG(dense) 1.10± ± ± ± ± ±91.9 APG 1.09± ± ± ± ± ±3.2 Table 3: Performance on the recommender system data sets. The RMSE is scaled by 10 1, CPU time is in seconds for MovieLens and minutes for netflix and yahoo. The best and comparable results according to the pairwise t-test with 95% confidence are highlightened. MovieLens-100K MovieLens-10M netflix yahoo loss algorithm RMSE CPU time RMSE CPU time RMSE CPU time RMSE CPU time l 2 AltGrad 10.40± ± ± ±8.9 RP 11.53± ± ± ± ± ± ± ±3.1 ScaledASD 10.73± ± ± ± ± ± ± ±9.4 l 1 RMC 9.14± ± ± ±3.9 RMF-MM 9.09± ±116.0 LSP RMFNL 8.91± ± ± ± ± ± ± ±6.8 using the nonconvex log-sum-penalty (LSP) (Candès, Wakin, and Boyd 2008) as loss. As observed in (Gong et al. 2013; Yao, Kwok, and Zhong 2015; Lu et al. 2016; Yao and Kwok 2016), other nonconvex functions in Table 1 usually have similar behavior. We compare three different optimizers for RMFNL: (a) RMFNL(LADMPSAP) in Section ; (b) RMFNL(APG-dense) in Section, which uses APG but does not utilize data sparsity; and (c) RMFNL(APG) in Algorithm 2, which utilizes data sparsity as discussed in Section. We initialize the iterates (U 1, V 1 ) by Gaussian random matrices. The matrix rank r is set to the ground-truth (i.e., 5). optimization is stopped when the relative change in objective values is smaller than For performance evaluation, we follow (Lin, Xu, and Zha 2017) and use (i) the root mean square error RMSE = W (X Ū V T ) 2 F /nnz( W ), where the testing elements are indicated by the binary matrix W ; and (ii) CPU time. To reduce statistical variability, results are averaged over five repetitions. Table 2 shows the results. As can be seen, all RMFNL variants have lower RMSE than RMF-MM, as LSP is more robust than l 1. As for time, RMFNL(APG-dense) is faster than RMFNL(LADMPSAP), as APG has a faster convergence rate than LADMPSAP. By further utilizing data sparsity, RMFNL(APG) is the fastest. Figure 2 shows that the speedups of RMFNL(APG) over RMFNL(LADMPSAP) and RMFNL(APG-dense) increase with matrix size (m). Recall that the larger the m, the sparser is the matrix. Hence, this agrees with Table 2, which shows that the time complexity of RMFNL(APG) only depends on the number of nonzero elements, but not on the matrix size. Figure 3 shows the convergence. When measured w.r.t. the number of iterations, all RMFNL variants and RMF- MM have the same behavior. However, when measured w.r.t. time, RMFNL(APG) is the fastest. Recommender Systems In this section, we perform experiments on some popular recommendation system data sets (Table 4). Following (Yao and Kwok 2016), we use 50% of the observed ratings for training, 25% for validation, and the rest for testing. We randomly corrupt 5% of the training ratings by adding ±5 with equal probability. Table 4: Recommendation data sets used in the experiment. Here, nnz is the percentage of nonzero elements. #users #movies #ratings nnz MovieLens-100K 943 1, , % MovieLens-10M 69,878 10,677 10,000, % netflix 480,189 17, ,480, % yahoo 249, ,111 62,551, % The proposed RMFNL, using LSP as loss, is compared with the following state-of-the-art algorithms: 1. Matrix factorization algorithms using the square loss, including alternating gradient descent (AltGrad) (Mnih and Salakhutdinov 2008), Riemannian preconditioning (RP) (Mishra and Sepulchre 2016), and scaled alternating steepest descent (ScaledASD) (Tanner and Wei 2016); 2. RMF algorithms, including RMF-MM and robust matrix completion (RMC) (Cambier and Absil 2016). All hyperparameters are tuned by using the validation set. To reduce statistical variability, results are averaged over five repetitions. For performance evaluation, we use (i) the testing RMSE; and (ii) CPU time. RMF-MM cannot converge in 10 4 seconds on MovieLens- 10M, and is thus not reported (as well as on the larger netflix and yahoo data sets). AltGrad is not compared on netflix and yahoo, as it is inferior to ScaledASD on MovieLens. RMC runs out of memory on netflix and yahoo. Table 3 shows the performance. As can be seen, RMFNL

7 (a) 250. (b) 500. (c) Figure 3: Convergence on the synthetic data set. Top: RMSE vs number of iterations (note that RMFNL(APG-dense) and RMFNL(APG) overlap with each other); Bottom: RMSE vs CPU time. Table 5: Performance on the data subsets extracted from the Dinosaur sequence. CPU time is in seconds. The best and comparable results according to the pairwise t-test with 95% confidence are highlightened. D1 D2 D3 loss algorithm l 1-loss CPU time l 1-loss CPU time l 1-loss CPU time l 1 RMF-MM(heuristic) 0.374± ± ± ± ± ±3.4 RMF-MM 0.442± ± ± ± ± ±2.1 LSP RMFNL 0.323± ± ± ± ± ±1.0 Affine Rigid Structure-from-Motion (SfM) Figure 2: Speedup of RMFNL(APG) over the other two RMFNL variants (LADMPSAP and APG-dense). consistently achieves the lowest RMSE, while the l 2 -based methods (AltGrad, RP and ScaledASD) have the highest RMSE. Thus, the nonconvex LSP loss is the most robust while the l 2 -loss is the most susceptible to outliers. RMC, which approximates the l 1 -loss, has testing RMSE slightly higher than RMF-MM. Figures 4(a) and 4(b) show convergence of the testing RMSE on MovieLens. As can be seen, RMF-MM is the slowest on MovieLens-100K, as it cannot utilize data sparsity. AltGrad, RP and ScaledASD are fast, but have very high RMSEs. The speeds of RMFNL and RMC are comparable. Figures 4(c) and 4(d) show convergence on netflix and yahoo, and the observations are similar. In this section, we perform experiments on affine rigid structure-from-motion (SfM) (Koenderink and Van Doorn 1991). This involves reconstructing the 3D scene from sparse feature points tracked in m images of a moving camera. Each feature point is projected to every image plane, and is thus represented by a 2m-dimensional vector. With n feature points, this leads to a 2m n matrix. Often, this matrix has missing data (e.g., when feature points are not visible throughout the whole motion sequence) and outliers (which arise from feature mismatch). We use the Oxford Dinosaur sequence, which has 36 images and 4, 983 feature points. As in (Lin, Xu, and Zha 2017), we extract three data subsets using feature points observed in at least 5, 6 and 7 images (Table 6). The fully observed data matrix can be recovered by rank-4 matrix factorization (Eriksson and Van Den Hengel 2010; Meng et al. 2013; Zheng et al. 2012; Cabral et al. 2013; Lin, Xu, and Zha 2017). We compare RMFNL with RMF-MM, and a RMF- MM variant (denoted RMF-MM(heuristic)) described in Section 4.2 of (Lin, Xu, and Zha 2017). In this variant, the diagonal entries of Λ r and Λ c are initialized with small values and then gradually increased until they reach the upper bounds in (6). It is claimed that this enables RMF-MM to converge faster (Lin, Xu, and Zha 2017). However, our

8 (a) MovieLens-100K. (b) MovieLens-10M. (c) netflix. (d) yahoo. Figure 4: Testing RMSE vs CPU time (seconds) on recommendation data sets. RMF-MM is too slow to run on larger data sets. Table 6: Data sets extracted from the Dinosaur sequence. Here, min occurrence is the minimum number of images that all feature points in that data subset have to appear in. data subset min occurrence matrix size sparsity D % D % D % experimental results show that this heuristic leads to more accurate, but not faster, results. Moreover, the key pitfall is that Proposition 0.1 and the convergence guarantee for RMF-MM no longer holds. For performance evaluation, as there is no ground-truth, we follow (Lin, Xu, and Zha 2017) and use (i) the l 1 - loss on reconstructing the corrupted matrix W (UV X) 1 /nnz( W ), where U and V are output from the algorithm, M is the data matrix with the observed positions indicated by the binary matrix W ; and (ii) CPU time. Results are shown in Table 5. As can be seen, RMF- MM(heuristic) obtain a lower l 1 -loss than RMF-MM, but is still outperformed by RMFNL. In terms of time, RMFNL is the fastest as it can utilize data sparsity, though the speedup is not as significant as in previous sections. This is because the extracted Dinosaur data subsets are not very sparse, as can be seen from Table 6. Conclusion In this paper, we addressed two problems in robust matrix factorization (RMF). First, to enhance robustness, we proposed the use of nonconvex loss over the commonly used (convex) l 1 and l 2 -loss. Second, we improved scalabililty by exploiting data sparsity (which RMF-MM cannot) and using the accelerated proximal gradient algorithm (which is faster than the commonly used ADMM). Theoretical analysis shows that the proposed RMFNL algorithm can generate a decreasing sequence and converge to a critical point. Extensive experiments on both synthetic and realworld data sets demonstrate that RMFNL is more accurate and more scalable than the state-of-the-art. References Basri, R.; Jacobs, D.; and Kemelmacher, I Photometric stereo with general, unknown lighting. International Journal of Computer Vision 72(3): Beck, A.and Teboulle, M A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2(1): Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; and Eckstein, J Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning 3(1): Cabral, R.; De la Torre, F.; Costeira, J.; and Bernardino, A Unifying nuclear norm and bilinear factorization approaches for low-rank matrix decomposition. In International Conference on Computer Vision, Cambier, L., and Absil, P Robust low-rank matrix completion by Riemannian optimization. SIAM Journal on Scientific Computing 38(5):S440 S460. Candès, E., and Recht, B Exact matrix completion via convex optimization. Foundations of Computational Mathematics 9(6): Candès, E.; Wakin, M.; and Boyd, S Enhancing sparsity by reweighted l 1 minimization. Journal of Fourier Analysis and Applications 14(5-6): Clarke, F Optimization and nonsmooth analysis. SIAM. Eriksson, A., and Van Den Hengel, A Efficient computation of robust low-rank matrix approximations in the presence of missing data using the l 1 -norm. In Computer Vision and Pattern Recognition, Fan, J., and Li, R Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association 96(456): Geman, D., and Yang, C Nonlinear image recovery with half-quadratic regularization. IEEE Transactions on Image Processing 4(7): Gong, P.; Zhang, C.; Lu, Z.; Huang, J.; and Ye, J A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems. In International Conference on Machine Learning, Gu, S.; Zhang, L.; Zuo, W.; and Feng, X Weighted nuclear norm minimization with application to image

9 denoising. In Computer Vision and Pattern Recognition, Hunter, D., and Lange, K A tutorial on MM algorithms. The American Statistician 58(1): Ji, H.; Liu, C.; Shen, Z.; and Xu, Y Robust video denoising using low rank matrix completion. In Computer Vision and Pattern Recognition, Kim, E.; Lee, M.; Choi, C.; Kwak, N.; and Oh, S Efficient l 1 -norm-based low-rank matrix approximations for large-scale problems using alternating rectified gradient method. IEEE Transactions on Neural Networks and Learning Systems 26(2): Koenderink, J., and Van Doorn, A Affine structure from motion. Journal of the Optical Society of America 8(2): Koren, Y Factorization meets the neighborhood: a multifaceted collaborative filtering model. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Lange, K.; Hunter, R.; and Yang, I Optimization transfer using surrogate objective functions. Journal of Computational and Graphical Statistics 9(1):1 20. Lin, Z.; Liu, R.; and Li, H Linearized alternating direction method with parallel splitting and adaptive penalty for separable convex programs in machine learning. Machine Learning 2(99): Lin, Z.; Xu, C.; and Zha, H Robust matrix factorization by majorization minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence (99). Lu, C.; Tang, J.; Yan, S.; and Lin, Z Nonconvex nonsmooth low rank minimization via iteratively reweighted nuclear norm. IEEE Transactions on Image Processing 25(2): Mairal, J Optimization with first-order surrogate functions. In International Conference on Machine Learning, Meng, D.; Xu, Z.; Zhang, L.; and Zhao, J A cyclic weighted median method for l 1 low-rank matrix factorization with missing entries. In AAAI Conference on Artificial Intelligence, Mishra, B., and Sepulchre, R Riemannian preconditioning. SIAM Journal on Optimization 26(1): Mnih, A., and Salakhutdinov, R Probabilistic matrix factorization. In Advances in Neural Information Processing Systems, Nesterov, Y Introductory lectures on convex optimization: A basic course. Springer. Nesterov, Y Gradient methods for minimizing composite functions. Mathematical Programming 140(1): Tanner, J., and Wei, K Low rank matrix completion by alternating steepest descent methods. Applied and Computational Harmonic Analysis 40(2): Trzasko, J., and Manduca, A Highly undersampled magnetic resonance image reconstruction via homotopicminimization. IEEE Transactions on Medical Imaging 28(1): Yang, J., and Leskovec, J Overlapping community detection at scale: a nonnegative matrix factorization approach. In International Conference on Web Search and Data Mining, Yao, Q., and Kwok, J Efficient learning with a family of nonconvex regularizers by redistributing nonconvexity. In International Conference on Machine Learning, Yao, Q.; Kwok, J.; and Zhong, W Fast lowrank matrix learning with nonconvex regularization. In International Conference on Data Mining, Zhang, C Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics 38(2): Zheng, Y.; Liu, G.; Sugimoto, S.; Yan, S.; and Okutomi, M Practical low-rank matrix approximation under robust l 1 -norm. In Computer Vision and Pattern Recognition, Zuo, W.; Meng, D.; Zhang, L.; Feng, X.; and Zhang, D A generalized iterated shrinkage algorithm for nonconvex sparse coding. In International Conference on Computer Vision,

Relaxed Majorization-Minimization for Non-smooth and Non-convex Optimization

Relaxed Majorization-Minimization for Non-smooth and Non-convex Optimization Chen Xu 1, Zhouchen Lin 1,2,, Zhenyu Zhao 3, Hongbin Zha 1 1 Key Laboratory of Machine Perception (MOE), School of EECS, Peking