arxiv: v2 [cs.na] 10 Feb 2018
|
|
- Clyde Goodman
- 6 years ago
- Views:
Transcription
1 Efficient Robust Matrix Factorization with Nonconvex Penalties Quanming Yao, James T. Kwok Department of Computer Science and Engineering Hong Kong University of Science and Technology Hong Kong { qyaoaa, jamesk }@cse.ust.hk arxiv: v2 [cs.na] 10 Feb 2018 Abstract Robust matrix factorization (RMF), which uses the l 1-loss, often outperforms standard matrix factorization using the l 2- loss, particularly when outliers are present. The state-of-theart RMF solver is the RMF-MM algorithm, which, however, cannot utilize data sparsity. Moreover, sometimes even the (convex) l 1-loss is not robust enough. In this paper, we propose the use of nonconvex loss to enhance robustness. To address the resultant difficult optimization problem, we use majorization-minimization (MM) optimization and propose a new MM surrogate. To improve scalability, we exploit data sparsity and optimize the surrogate via its dual with the accelerated proximal gradient algorithm. The resultant algorithm has low time and space complexities and is guaranteed to converge to a critical point. Extensive experiments demonstrate its superiority over the state-of-theart in terms of both accuracy and scalability. Introduction Matrix factorization (MF) is a fundamental machine learning tool, and serves as an important component in many applications such as computer vision (Basri, Jacobs, and Kemelmacher 2007; Gu et al. 2014; Ji et al. 2010), social networks (Yang and Leskovec 2013) and recommender systems (Koren 2008). The square loss has been commonly used in MF (Mnih and Salakhutdinov 2008; Koren 2008; Candès and Recht 2009). However, this implicitly assumes the Gaussian noise, and is sensitive to outliers. Eriksson & Van Den Hengel (2010) proposed robust matrix factorization (RMF), which uses the l 1 -loss instead, and obtains much better empirical performance. Unfortunately, the resultant nonconvex nonsmooth optimization problem is much more difficult. Most RMF solvers are not scalable (Eriksson and Van Den Hengel 2010; Zheng et al. 2012; Cabral et al. 2013; Meng et al. 2013; Kim et al. 2015). Recently, Cambier & Absil (2016) proposed the robust matrix completion (RMC) algorithm, which smooths the l 1 -loss and then performs gradient descent on the Riemannian manifold. empirically, it can be used on large sparse matrices. However, as smoothing is used, RMC only approximates the RMF solution. Copyright c 2018, Association for the Advancement of Artificial Intelligence ( All rights reserved. The current state-of-the-art RMF solver is the RMF-MM (Lin, Xu, and Zha 2017), which is based on the technique of majorization minimization (MM) (Lange, Hunter, and Yang 2000; Hunter and Lange 2004; Mairal 2013). In each iteration, instead of optimizing the original RMF objective, a convex nonsmooth surrogate is constructed and optimized by the linearized alternating direction method with parallel splitting and adaptive penalty (LADMPSAP) algorithm (Lin, Liu, and Li 2015), a variant of the alternating direction method of multipliers (ADMM) (Boyd et al. 2011). Empirically, RMF-MM has fast convergence. Besides, it is the only RMF solver with convergence guarantees. However, it does not utilize data sparsity, and so cannot be used on large sparse matrices. Though the l 1 -loss used in RMF is more robust than the l 2 -loss, it may still not be robust enough for outliers. A similar observation is also made recently on the l 1 -regularizer in sparse learning and low-rank matrix learning (Gong et al. 2013; Zuo et al. 2013; Gu et al. 2014; Lu et al. 2016; Yao and Kwok 2016). To alleivate this problem, various nonconvex regularizers have been introduced. Examples includue the Geman penalty (Geman and Yang 1995), logsum penalty (Candès, Wakin, and Boyd 2008) and Laplace penalty (Trzasko and Manduca 2009). Empirically, they achieve much better performance on tasks such as feature selection (Gong et al. 2013; Zuo et al. 2013) and image denoising (Gu et al. 2014; Lu et al. 2016). In this paper, we propose to improve the robustness of RMF by using these nonconvex functions (instead of l 1 or l 2 ) as the loss function. Note that the above papers used these nonconvex functions as regularizers, not as the loss. The resultant optimization problem is difficult, and existing RMF solvers cannot be used. As for RMF-MM, we rely on the more flexible MM optimization technique, and a new MM surrogate is proposed. To improve scalabiltiy, we transform the surrogate to its dual and then solve it with the accelerated proximal gradient (APG) algorithm (Beck 2009; Nesterov 2013). Data sparsity can also be exploited in the design of the APG algorithm. As for its convergence analysis, proof techniques in RMF-MM cannot be used as the loss is no longer convex. Instead, we develop new proof techniques based on the Clarke subdifferential, and show that convergence to a critical point can be guaranteed. Extensive experiments demonstrate the superiority of the
2 proposed algorithm over the state-of-the-art in terms of both accuracy and scalability. Notation For scalar x, sign (x) = 1 if x > 0, 0 if x = 0, and 1 otherwise. For a vector x, Diag(x) constructs a diagonal matrix X with X ii = x i. For a matrix X, X F = ( i,j X2 ij )1/2 is its Frobenius norm, X 1 = i,j X ij is its l 1 -norm, and nnz(x) is the number of nonzero elements in X. for a square matrix X, tr(x) = i X ii is its trace. for two matrices X, Y, denotes element-wise product. For a smooth function f, f is its gradient. for a convex f, G f(x) = {U : f(y ) f(x) + tr(u (Y X))} is a subgradient. for a continuous f, f is the Clarke subdifferential (Clarke 1990). Related Work Majorization Minimization Majorization minimization (MM), also called optimization transfer, is a general technique to make difficult optimization problems easier (Lange, Hunter, and Yang 2000; Hunter and Lange 2004; Mairal 2013). Many algorithms can be seen as special cases of the MM algorithm. Examples include the EM algorithm, proximal algorithm, and difference of convex programming. Consider a function h(x), which is hard to optimize. In using MM, let the iterate at the kth iteration be X k. The next iterate is generated as X k+1 = X k + arg min X f k (X), (1) where f k is a surrogate that is being optimized instead of h. As suggested in (Lange, Hunter, and Yang 2000), a good surrogate should have the following properties: (a) h(x k + X) f k (X) for any X k ; (b) 0=arg min X f k (X) h(x k + X) and h(x k ) = f k (0); (c) f k is convex. Robust Matrix Factorization (RMF) In matrix factorization (MF), a matrix M R m n is approximated by UV, where U R m r, V R n r and r min(m, n). Moreover, some entries of M may be missing, as in applications such as structure from motion (SfM) (Basri, Jacobs, and Kemelmacher 2007) and recommender systems (Koren 2008). The MF problem can be formulated as: min U,V 1 2 W (M UV ) 2 F + λ 2 ( U 2 F + U 2 F ), (2) where W {0, 1} m n contain indices to missing entries in M (with W ij = 1 if M ij is observed, and 0 otherwise), and λ 0 is a regularization parameter. The l 2 -loss in (2) is sensitive to outliers. De La Torre & Black (2003) replaced it by the l 1 -loss, leading to robust matrix factorization (RMF): min U,V W (M UV ) 1 + λ 2 ( U 2 F + V 2 F ). (3) Many RMF solvers have been developed. However, as (3) is neither convex nor smooth, these solvers lack scalability and/or robustness and/or convergence guarantees. Very recently, Lin et al. (2017) proposed the RMF-MM algorithm, which solves (3) using MM. Let the iterate at the kth iteration be (U k, V k ). RMF-MM finds increments (Ū, V ) to be added to the target (U, V ) solution: U = U k + Ū, V = V k + V. (4) Given (U k, V k ), the objective in (3) can be rewritten as H k (Ū, V ) W (M (U k + Ū)(V k + V ) ) 1 + λ 2 U k + Ū 2 F + λ 2 V k + V 2 F. (5) The following Proposition constructs a surrogate F k that satisfies properties (a) and (b) in Section. Moreover, unlike H k, F k is jointly convex on (Ū, V ). Proposition 0.1. (Lin, Xu, and Zha 2017) Let nnz(w (i,:) ) (resp. nnz(w (:,j) )) be the number of nonzero elements in the ith row (resp. jth column) of W, Λ r = Diag( nnz(w (1,:) ),..., nnz(w (m,:) )), and Λ c = Diag( nnz(w (:,1) ),..., nnz(w (:,n) )). Then, H k (Ū, V ) F k (Ū, V ), where F k (Ū, V ) W (M U k (V k ) Ū(V k ) U k V ) 1 + λ 2 U k + Ū 2 F Λ rū 2 F + λ 2 V k + V 2 F Λ c V 2 F. (6) Equality holds iff (Ū, V ) = (0, 0). However, because of the coupling of Ū, V k (resp. U k, V ) in Ū(V k ) (resp. U k V ) in (6), F k is still difficult to optimize. To address this problem, RMF-MM uses the linearized alternating direction method with parallel splitting and adaptive penalty (LADMPSAP) (Lin, Liu, and Li 2015), a multi-block variant of the alternating direction method of multipliers (ADMM) (Boyd et al. 2011). RMF-MM converges to a critical point of (3). Empirically, it converges faster than the other RMF solvers. However, due to nonconvexity of (3), its convergence rate is not known. Moreover, its space and time complexities are O(mn), and O(mnrIK), respectively, where I is the number of LADMPSAP iterations. They both grow linearly with the matrix size, and may be computationally expensive on large data sets. Besides, the l 1 -loss can still be sensitive to outliers. Proposed Algorithm In this paper, we replace the l 1 -loss by a general nonconvex loss. Section introduces the nonconvex RMF formulation. The resultant optimization problem is more difficult to solve, and existing RMF solvers cannot be used as they rely crucially on the l 1 -norm. Instead, we use the more flexible MM technique. To handle nonconvexity, a new surrogate, based on relaxation back to l 1 and variable decoupling, is proposed in Section. Instead of using LADMPSAP
3 as in RMF-MM, we propose in Section a more timeand space-efficient procedure by running the accelerated proximal gradient (APG) algorithm on the surrogate s dual. In particular, the underlying gradient computations can efficiently utilize data sparsity. The complete algorithm is presented in Section, and convergence is analyzed in Section. Nonconvex Loss With a nonconvex loss, problem (3) is changed to: m n Ḣ(U, V ) W ij φ ( M ij [UV ] ij ) min U,V i=1 j=1 + λ 2 ( U 2 F + V 2 F ), (7) where φ is nonconvex. We assume the following on φ: Assumption 1. φ(α) is concave, smooth and strictly increasing on α 0. Obviously, the (convex) l 1 function also satisfies Assumption 1, and thus (7) includes (3). As shown in Table 1, Assumption 1 is satisfied by popular nonconvex functions including the Geman penalty (Geman and Yang 1995), Laplace penalty (Trzasko and Manduca 2009), the log-sum-penalty (LSP) (Candès, Wakin, and Boyd 2008), and slightly modified variants of the minimax concave penalty (MCP) (Zhang 2010) and smooth-capped-absolutedeviation (SCAD) penalty (Fan and Li 2001). Note that unlike previous papers (Gong et al. 2013; Zuo et al. 2013; Yao and Kwok 2016), we use these nonconvex functions as the loss, not as the regularizer. Table 1: Example nonconvex regularizers (θ > 2 for SCAD and θ > 0 for others is a constant). Here, δ > 0 is a small constant to ensure that the φ s for MCP and SCAD are strictly increasing. Geman penalty Laplace penalty log-sum-penalty (LSP) minimax concave penalty (MCP) smoothly clipped absolute deviation (SCAD) penalty φ( α ) α θ+ α ( α θ ) 1 exp ( ) log 1 + α θ { (1 + δ)α α2 2θ α θ 1 2 θ2 + δα α > θ (1 + δ)α α 1 α 2 +2θα 1 2(θ 1) + δα 1 < α θ (1+θ) 2 + δα α > θ Note from (7) that when the ith row of W is zero, the ith row of U obtained is also zero because of the U 2 F regularizer. Similarly, when the ith column of W is zero, the corresponding column in V is also zero. To avoid this trivial solution, we make the following Assumption, as in matrix completion (Candès and Recht 2009) and RMF-MM. Assumption 2. W has no zero row or column. Constructing the Surrogate The following Proposition first obtains a convex upper bound of φ using the Taylor expansion. Proposition 0.2. φ( α ) φ ( β ) α +(φ( β ) φ ( β ) β ) for any given β R. Equality holds iff α = ±β. Figure 1 shows the upper bounds for the nonconvex penalties in Table 1. Interestingly, the upper bound is simply a scaled l 1 -loss (with scaling factor φ ( β )) and offset by φ( β ) φ ( β ) β. As one can expect, recovery of the l 1 - loss will make optimization easier. At iteration k, Ḣ in (7) can be written as: (Ū, V ) m n i=1 j=1 W ijφ( M ij [(U k + Ū)(V k + V ) ] ij ) + λ 2 U k + Ū 2 F + λ 2 V k + V 2 F, where (Ū, V ) is as defined in (4). Applying Proposition 0.2 on the first term of Ḣk, we obtain the following. Corollary 0.3. Let A k ij = φ ( [U k (V k ) ] ij ), then Ḣ k (Ū, V ) b k + λ 2 U k + Ū 2 F + λ 2 V k + V 2 F + Ẇ k (M U k (V k ) Ū(V k ) U k V Ū V ) 1, where b k = m n i=1 j=1 W ij(φ ( ) [U k (V k ) ] ij [U k (V k ) ] ij ) and Ẇ k = A k W. A k ij The product Ū V inside the l 1 term still couples V Ū and together. As Ḣk is similar as H k in (5), one may want to reuse Proposition 0.1. However, note that the elements in A k are positive due to Assumption 1 and thus Ẇ k is a nonnegative matrix, while W in (3) is a binary matrix. Using the Cauchy-Schwarz inequality, the following Lemma decouples Ū and V. This then allows the design of a convex surrogate F k that upper-bounds Ḣk. Lemma 0.4. Let sum(ẇ (i,:) k ) = n j=1 Ẇ ij k (resp. sum(ẇ (:,j) k ) = m i=1 Ẇ ij k ) be the sum of elements on row i (resp. column j) of Ẇ k. Then, Ẇ k (Ū V ) Λk rū 2 F Λk c V 2 F, where Λ k r = Diag( sum(ẇ (1,:) k ),..., Λ k c = Diag( sum(ẇ (:,1) k ),..., holds iff (Ū, V ) = (0, 0). Proposition 0.5. sum(ẇ k (m,:) sum(ẇ k (:,n) Ḣ k (Ū, V ) F k (Ū, V ), where )) and )). Equality F k (Ū, V ) Ẇ k (M U k (V k ) Ū(V k ) U k V ) 1 + λ 2 U k + Ū 2 F Λk rū 2 F + λ 2 V k + V 2 F Λk c V 2 F + b k. (8) Equality holds iff (Ū, V ) = (0, 0). In the special case where the l 1 -loss is used, Ẇ k = W, b k = 0 for all k s, and Λ k r = Λ r, Λ k c = Λ c. The surrogate in (8) then reduces to that in (6), and Proposition 0.5 becomes Proposition 0.1.
4 (a) Geman. (b) Laplace. (c) LSP. (d) MCP (modified). (e) SCAD (modified). Figure 1: Illustration of the upper bound in Proposition 0.2 for the various nonconvex penalities in Table 1 β = 1, θ = 2.5 for SCAD and θ = 0.5 for the others; and δ = 0.05 for MCP and SCAD. It can be easily seen that F k qualifies as a good surrogate in Section : (a) Ḣ(Ū + U k, V + V k ) F k (Ū, V ); (b) (0, 0) = arg min Ū, V F k (Ū, V ) Ḣk (Ū, V ) and F k (0, 0) = Ḣ(0, 0); and (c) F k is jointly convex in Ū, V. Optimizing the Surrogate Optimizing using LADMPSAP LADMPSAP, which has been used in RMF-MM, can also be used to optimize (8). 1 However, LADMPSAP cannot utilize possible sparsity of W. Moreover, LADMPSAP converges at a rate of O(1/T ) (Lin, Liu, and Li 2015), which is slow. Optimizing using APG We first transform the surrogate in (8) to the dual, which then has only nnz(w ) variables (instead of O(mn)). Let Ω {(i 1, j 1 ),..., (i nnz(w ), j nnz(w ) )} be the set containing indices to the nonzero elements in W. Let H( ) be the linear operator which maps a nnz(w )- dimensional vector x to a sparse matrix X R m n with nonzero positions indicated by Ω, i.e., X itj t = x t where (i t, j t ) is the tth element in Ω. Let H 1 ( ) be the inverse operator of H, which extracts an nnz(w )-dimensional vector x from X with x t = X itj t. Proposition 0.6. Let ẇ k = H 1 (Ẇ k ). The dual problem of F k is min Ū, V min D k (x) x W k 1 2 tr((h(x)v k λu k ) A k r(h(x)v k λu k )) tr(h(x) M) tr((h(x) U k λv k ) A k c (H(x) U k λv k )), (9) 1 Because of the lack of space, the detailed update equations will not be included here. where W k {x R nnz(w ) : x i [ẇ k ] 1 i }, A k r = (λi + (Λ k r) 2 ) 1, and A k c = (λi + (Λ k c ) 2 ) 1. From the obtained x, the primal (Ū, V ) solution can be recovered as Ū = A k r(h(x)v k λu k ) and V = A k c (H(x) U k λv k ). Problem (9) can be solved by the APG algorithm, which has a convergence rate of O(1/T 2 ) (Beck 2009; Nesterov 2013) and is faster than LADMPSAP. In each iteration, one needs to compute the gradient D k of the objective: D k (x) = H 1 (A k r(h(x)v k λu k )(V k ) ) + H 1 (U k [(U k ) H(x) λ(v k ) ]A k c ) H 1 (M). (10) The following Proposition shows that the proximal step has a closed-form solution. As x is nnz(w )-dimensional, computing the proximal step takes only O(nnz(W )) time. Proposition 0.7. Let x = arg min 1 x W k 2 x z 2 F for a given z. Then, x i = sign (z i) min( z i, (ẇi k) 1 ). Exploiting Sparsity A direct implementation of APG takes O(mnr) time and O(mn) space. In the following, we show how these can be reduced by exploiting sparsity. Objective (9) involves A k r, A k c and W k, which are all related to Ẇ k. From its definition in Corollary 0.3, Ẇ k is sparse (as W is sparse). Thus, by exploting sparsity, constructing A k r, A k c and W k only take O(nnz(W )) time and space. In each APG iteration, one has to compute the objective, gradient, and proximal step. First, consider computing the gradient in (10). The first term can be rewritten as ĝ k = H 1 (Q k (V k ) ), where Q k = A k r(h(x)v k λu k ). As A k r is diagonal and H(x) is sparse, Q k can be computed as A k r(h(x)v k ) λ(a k ru k ) in O(nnz(W )r + mr) time, where r is the number of columns in U k and V k. Let the tth element in Ω be (i t, j t ). By the definition of H 1 ( ), we have ĝt k = r q=1 Qk i V k tq j tq, and this takes O(nnz(W )r + mr)
5 time. Similarly, computing the second term in (10) takes O(nnz(W )r + nr) time. Hence, computing the gradient D k ( ) using Algorithm 1 takes a total of O(nnz(W )r+(m+ n)r) time and O(nnz(W ) + (m + n)r) space. Algorithm 1 Computing D k (x) by exploiting sparsity. 1: set X itj t = x t for all (i t, j t ) Ω; // i.e., X = H(x) 2: Q k = A k r(xv k ) λ(a k ru k ); 3: construct the nnz(w )-dimensional vector ĝ k with ĝt k = r q=1 Qk i V k tq j ; tq 4: P k = A k c (X U k ) λ(a k c V k ); 5: construct the nnz(w )-dimensional vector ğ k with ğt k = r q=1 U i k P k tq j ; // i.e., tq ğk = H 1 (U k (P k ) ) 6: return ĝ k + ğ k H 1 (M). Next, as x R nnz(w ), the proximal step in Proposition 0.7 takes O(nnz(W )) time and space. Similar to Algorithm 1, the objective can be obtained in O(nnz(W )r + (m + n)r) time and O(nnz(W ) + (m + n)r) space. Thus, by exploiting sparsity, the APG algorithm takes O(nnz(W ) + (m + n)r) space, which is much smaller than the O(mn) space required by LADMPSAP. The iteration time complexity of APG is O(nnz(W )r + (m + n)r), which is again much smaller than the O(mnr) time each LADMPSAP iteration takes. Complete Algorithm The complete procedure, which will be called Robust Matrix Factorization with Nonconvex Loss (RMFNL) algorithm is shown in Algorithm 2. The surrogate is optimized via its dual in step 4. The primal solution is recovered in step 5, and (U k, V k ) are updated in step 6. Algorithm 2 Robust matrix factorization using nonconvex loss (RMFNL) algorithm. 1: initialize U 1 R m r and V 1 R m r ; 2: for k = 1, 2,..., K do 3: compute Ẇ k in Corollary 0.3 (only on the observed positions), and Λ k r, Λ k c in Lemma 0.4; 4: compute x k = arg min x W k D k (x) in Proposition 0.6 using APG (use Algorithm 1 to compute D k (x)); 5: compute Ū k and V k in Proposition 0.6; 6: U k+1 = U k + Ū k, V k+1 = V k + V k ; 7: end for 8: return U K+1 and V K+1. Convergence Analysis In this Section, we prove convergence of the proposed RMFNL algorithm. As φ in (7) is nonconvex, Proposition 1 in (Lin, Xu, and Zha 2017) cannot be extended to our surrogate F k. The RMF-MM iterations guarantee a sufficient decrease on the objective (Lin, Xu, and Zha 2017) in the following sense: Definition 0.1 ((Nesterov 1998)). The sequence {X k } has a sufficient decrease on f if there exists a constant γ > 0 such that f(x k ) f(x k+1 ) γ X k X k+1 2 F, k. The following Proposition shows that RMFNL also achieves a sufficient decrease on its objective. Moreover, the {(U k, V k )} sequence generated is bounded, which has at least one limit point. The Proposition is similar to Theorem 1 in (Lin, Xu, and Zha 2017). However, we use a different surrogate and Λ k r, Λ k c are not fixed but change with k. Hence, their proof cannot be directly extended. Proposition 0.8. For Algorithm 2, (i) {(U k, V k )} is bounded. (ii) {(U k, V k )} has a sufficient decrease on Ḣ, i.e., Ḣ(U k, V k ) Ḣ(U k+1, V k+1 ) γ U k+1 U k 2 F + γ V k+1 V k 2 F, where γ > 0 is a constant; (iii) lim k (U k+1 U k ) = 0 and lim k (V k+1 V k ) = 0; In the following, we show that limit points from Algorithm 2 are also critical points of (3). To prove convergence, the subgradient is used in (Lin, Xu, and Zha 2017). Here, as φ is nonconvex, we will use the Clarke subdifferential (Clarke 1990), which generalizes subgradients to nonconvex functions. The following Proposition connects the subgradient of surrogate F k to the Clarke subdifferential of Ḣ. Proposition 0.9. (i) F k (0, 0) = Ḣ k (0, 0); (ii) If 0 Ḣ k (0, 0), then (U k, V k ) is a critical point of (7). Theorem The limit points of the sequence generated by Algorithm 2 are also critical points of (7). RMF-MM is the only existing RMF solver that guarantees convergence to critical points. While our (7) is more difficult than their (3) (due to the nonconvex loss), the proposed RMFNL still shares the same guarantee as RMF-MM. Moreover, as (7) is neither convex nor smooth, convergence to critical points is the best one can obtain (Nesterov 1998). Experiments In this section, we compare the proposed RMFNL with state-of-the-art MF algorithms on both synthetic and realworld data sets. Experiments are performed on a PC with Intel i7 CPU and 32GB RAM. All the codes are in Matlab, with sparse matrix operations implemented in C++. Synthetic Data We first perform experiments with synthetic data. The synthetic data is generated as X = UV, where U R m 5, V R m 5, and m = {250, 500, 1000}. Elements of U and V are sampled i.i.d. from the normal distribution N (0, 1). This is then corrupted to produce M = X + N + S, where N is the noise matrix from N (0, 0.1), and S is a sparse matrix modeling outliers with 5% nonzero elements randomly sampled from {±5}. We randomly draw 10 log(m)/m% of the elements from M as observations, with half of them for training and the other half for validation. The remaining unobserved elements are for testing. note that the larger the m, the sparser is the observed matrix. We compare RMF-MM (Lin, Xu, and Zha 2017), which uses the l 1 -loss, with the proposed RMFNL (Algorithm 2),
6 Table 2: Performance on the synthetic data set ( nnz is the percentage of nonzero elements). RMSE is scaled by 10 1 and CPU time is in seconds. The best and comparable results according to the pairwise t-test with 95% confidence are highlightened. m = 250 (nnz: 11.04%) m = 500 (nnz: 6.21%) m = 1000 (nnz: 3.45%) loss algorithm RMSE CPU time RMSE CPU time RMSE CPU time l 1 RMF-MM 1.77± ± ± ± ± ±130.9 LADMPSAP 1.09± ± ± ± ± ±138.8 LSP RMFNL APG(dense) 1.10± ± ± ± ± ±91.9 APG 1.09± ± ± ± ± ±3.2 Table 3: Performance on the recommender system data sets. The RMSE is scaled by 10 1, CPU time is in seconds for MovieLens and minutes for netflix and yahoo. The best and comparable results according to the pairwise t-test with 95% confidence are highlightened. MovieLens-100K MovieLens-10M netflix yahoo loss algorithm RMSE CPU time RMSE CPU time RMSE CPU time RMSE CPU time l 2 AltGrad 10.40± ± ± ±8.9 RP 11.53± ± ± ± ± ± ± ±3.1 ScaledASD 10.73± ± ± ± ± ± ± ±9.4 l 1 RMC 9.14± ± ± ±3.9 RMF-MM 9.09± ±116.0 LSP RMFNL 8.91± ± ± ± ± ± ± ±6.8 using the nonconvex log-sum-penalty (LSP) (Candès, Wakin, and Boyd 2008) as loss. As observed in (Gong et al. 2013; Yao, Kwok, and Zhong 2015; Lu et al. 2016; Yao and Kwok 2016), other nonconvex functions in Table 1 usually have similar behavior. We compare three different optimizers for RMFNL: (a) RMFNL(LADMPSAP) in Section ; (b) RMFNL(APG-dense) in Section, which uses APG but does not utilize data sparsity; and (c) RMFNL(APG) in Algorithm 2, which utilizes data sparsity as discussed in Section. We initialize the iterates (U 1, V 1 ) by Gaussian random matrices. The matrix rank r is set to the ground-truth (i.e., 5). optimization is stopped when the relative change in objective values is smaller than For performance evaluation, we follow (Lin, Xu, and Zha 2017) and use (i) the root mean square error RMSE = W (X Ū V T ) 2 F /nnz( W ), where the testing elements are indicated by the binary matrix W ; and (ii) CPU time. To reduce statistical variability, results are averaged over five repetitions. Table 2 shows the results. As can be seen, all RMFNL variants have lower RMSE than RMF-MM, as LSP is more robust than l 1. As for time, RMFNL(APG-dense) is faster than RMFNL(LADMPSAP), as APG has a faster convergence rate than LADMPSAP. By further utilizing data sparsity, RMFNL(APG) is the fastest. Figure 2 shows that the speedups of RMFNL(APG) over RMFNL(LADMPSAP) and RMFNL(APG-dense) increase with matrix size (m). Recall that the larger the m, the sparser is the matrix. Hence, this agrees with Table 2, which shows that the time complexity of RMFNL(APG) only depends on the number of nonzero elements, but not on the matrix size. Figure 3 shows the convergence. When measured w.r.t. the number of iterations, all RMFNL variants and RMF- MM have the same behavior. However, when measured w.r.t. time, RMFNL(APG) is the fastest. Recommender Systems In this section, we perform experiments on some popular recommendation system data sets (Table 4). Following (Yao and Kwok 2016), we use 50% of the observed ratings for training, 25% for validation, and the rest for testing. We randomly corrupt 5% of the training ratings by adding ±5 with equal probability. Table 4: Recommendation data sets used in the experiment. Here, nnz is the percentage of nonzero elements. #users #movies #ratings nnz MovieLens-100K 943 1, , % MovieLens-10M 69,878 10,677 10,000, % netflix 480,189 17, ,480, % yahoo 249, ,111 62,551, % The proposed RMFNL, using LSP as loss, is compared with the following state-of-the-art algorithms: 1. Matrix factorization algorithms using the square loss, including alternating gradient descent (AltGrad) (Mnih and Salakhutdinov 2008), Riemannian preconditioning (RP) (Mishra and Sepulchre 2016), and scaled alternating steepest descent (ScaledASD) (Tanner and Wei 2016); 2. RMF algorithms, including RMF-MM and robust matrix completion (RMC) (Cambier and Absil 2016). All hyperparameters are tuned by using the validation set. To reduce statistical variability, results are averaged over five repetitions. For performance evaluation, we use (i) the testing RMSE; and (ii) CPU time. RMF-MM cannot converge in 10 4 seconds on MovieLens- 10M, and is thus not reported (as well as on the larger netflix and yahoo data sets). AltGrad is not compared on netflix and yahoo, as it is inferior to ScaledASD on MovieLens. RMC runs out of memory on netflix and yahoo. Table 3 shows the performance. As can be seen, RMFNL
7 (a) 250. (b) 500. (c) Figure 3: Convergence on the synthetic data set. Top: RMSE vs number of iterations (note that RMFNL(APG-dense) and RMFNL(APG) overlap with each other); Bottom: RMSE vs CPU time. Table 5: Performance on the data subsets extracted from the Dinosaur sequence. CPU time is in seconds. The best and comparable results according to the pairwise t-test with 95% confidence are highlightened. D1 D2 D3 loss algorithm l 1-loss CPU time l 1-loss CPU time l 1-loss CPU time l 1 RMF-MM(heuristic) 0.374± ± ± ± ± ±3.4 RMF-MM 0.442± ± ± ± ± ±2.1 LSP RMFNL 0.323± ± ± ± ± ±1.0 Affine Rigid Structure-from-Motion (SfM) Figure 2: Speedup of RMFNL(APG) over the other two RMFNL variants (LADMPSAP and APG-dense). consistently achieves the lowest RMSE, while the l 2 -based methods (AltGrad, RP and ScaledASD) have the highest RMSE. Thus, the nonconvex LSP loss is the most robust while the l 2 -loss is the most susceptible to outliers. RMC, which approximates the l 1 -loss, has testing RMSE slightly higher than RMF-MM. Figures 4(a) and 4(b) show convergence of the testing RMSE on MovieLens. As can be seen, RMF-MM is the slowest on MovieLens-100K, as it cannot utilize data sparsity. AltGrad, RP and ScaledASD are fast, but have very high RMSEs. The speeds of RMFNL and RMC are comparable. Figures 4(c) and 4(d) show convergence on netflix and yahoo, and the observations are similar. In this section, we perform experiments on affine rigid structure-from-motion (SfM) (Koenderink and Van Doorn 1991). This involves reconstructing the 3D scene from sparse feature points tracked in m images of a moving camera. Each feature point is projected to every image plane, and is thus represented by a 2m-dimensional vector. With n feature points, this leads to a 2m n matrix. Often, this matrix has missing data (e.g., when feature points are not visible throughout the whole motion sequence) and outliers (which arise from feature mismatch). We use the Oxford Dinosaur sequence, which has 36 images and 4, 983 feature points. As in (Lin, Xu, and Zha 2017), we extract three data subsets using feature points observed in at least 5, 6 and 7 images (Table 6). The fully observed data matrix can be recovered by rank-4 matrix factorization (Eriksson and Van Den Hengel 2010; Meng et al. 2013; Zheng et al. 2012; Cabral et al. 2013; Lin, Xu, and Zha 2017). We compare RMFNL with RMF-MM, and a RMF- MM variant (denoted RMF-MM(heuristic)) described in Section 4.2 of (Lin, Xu, and Zha 2017). In this variant, the diagonal entries of Λ r and Λ c are initialized with small values and then gradually increased until they reach the upper bounds in (6). It is claimed that this enables RMF-MM to converge faster (Lin, Xu, and Zha 2017). However, our
8 (a) MovieLens-100K. (b) MovieLens-10M. (c) netflix. (d) yahoo. Figure 4: Testing RMSE vs CPU time (seconds) on recommendation data sets. RMF-MM is too slow to run on larger data sets. Table 6: Data sets extracted from the Dinosaur sequence. Here, min occurrence is the minimum number of images that all feature points in that data subset have to appear in. data subset min occurrence matrix size sparsity D % D % D % experimental results show that this heuristic leads to more accurate, but not faster, results. Moreover, the key pitfall is that Proposition 0.1 and the convergence guarantee for RMF-MM no longer holds. For performance evaluation, as there is no ground-truth, we follow (Lin, Xu, and Zha 2017) and use (i) the l 1 - loss on reconstructing the corrupted matrix W (UV X) 1 /nnz( W ), where U and V are output from the algorithm, M is the data matrix with the observed positions indicated by the binary matrix W ; and (ii) CPU time. Results are shown in Table 5. As can be seen, RMF- MM(heuristic) obtain a lower l 1 -loss than RMF-MM, but is still outperformed by RMFNL. In terms of time, RMFNL is the fastest as it can utilize data sparsity, though the speedup is not as significant as in previous sections. This is because the extracted Dinosaur data subsets are not very sparse, as can be seen from Table 6. Conclusion In this paper, we addressed two problems in robust matrix factorization (RMF). First, to enhance robustness, we proposed the use of nonconvex loss over the commonly used (convex) l 1 and l 2 -loss. Second, we improved scalabililty by exploiting data sparsity (which RMF-MM cannot) and using the accelerated proximal gradient algorithm (which is faster than the commonly used ADMM). Theoretical analysis shows that the proposed RMFNL algorithm can generate a decreasing sequence and converge to a critical point. Extensive experiments on both synthetic and realworld data sets demonstrate that RMFNL is more accurate and more scalable than the state-of-the-art. References Basri, R.; Jacobs, D.; and Kemelmacher, I Photometric stereo with general, unknown lighting. International Journal of Computer Vision 72(3): Beck, A.and Teboulle, M A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2(1): Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; and Eckstein, J Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning 3(1): Cabral, R.; De la Torre, F.; Costeira, J.; and Bernardino, A Unifying nuclear norm and bilinear factorization approaches for low-rank matrix decomposition. In International Conference on Computer Vision, Cambier, L., and Absil, P Robust low-rank matrix completion by Riemannian optimization. SIAM Journal on Scientific Computing 38(5):S440 S460. Candès, E., and Recht, B Exact matrix completion via convex optimization. Foundations of Computational Mathematics 9(6): Candès, E.; Wakin, M.; and Boyd, S Enhancing sparsity by reweighted l 1 minimization. Journal of Fourier Analysis and Applications 14(5-6): Clarke, F Optimization and nonsmooth analysis. SIAM. Eriksson, A., and Van Den Hengel, A Efficient computation of robust low-rank matrix approximations in the presence of missing data using the l 1 -norm. In Computer Vision and Pattern Recognition, Fan, J., and Li, R Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association 96(456): Geman, D., and Yang, C Nonlinear image recovery with half-quadratic regularization. IEEE Transactions on Image Processing 4(7): Gong, P.; Zhang, C.; Lu, Z.; Huang, J.; and Ye, J A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems. In International Conference on Machine Learning, Gu, S.; Zhang, L.; Zuo, W.; and Feng, X Weighted nuclear norm minimization with application to image
9 denoising. In Computer Vision and Pattern Recognition, Hunter, D., and Lange, K A tutorial on MM algorithms. The American Statistician 58(1): Ji, H.; Liu, C.; Shen, Z.; and Xu, Y Robust video denoising using low rank matrix completion. In Computer Vision and Pattern Recognition, Kim, E.; Lee, M.; Choi, C.; Kwak, N.; and Oh, S Efficient l 1 -norm-based low-rank matrix approximations for large-scale problems using alternating rectified gradient method. IEEE Transactions on Neural Networks and Learning Systems 26(2): Koenderink, J., and Van Doorn, A Affine structure from motion. Journal of the Optical Society of America 8(2): Koren, Y Factorization meets the neighborhood: a multifaceted collaborative filtering model. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Lange, K.; Hunter, R.; and Yang, I Optimization transfer using surrogate objective functions. Journal of Computational and Graphical Statistics 9(1):1 20. Lin, Z.; Liu, R.; and Li, H Linearized alternating direction method with parallel splitting and adaptive penalty for separable convex programs in machine learning. Machine Learning 2(99): Lin, Z.; Xu, C.; and Zha, H Robust matrix factorization by majorization minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence (99). Lu, C.; Tang, J.; Yan, S.; and Lin, Z Nonconvex nonsmooth low rank minimization via iteratively reweighted nuclear norm. IEEE Transactions on Image Processing 25(2): Mairal, J Optimization with first-order surrogate functions. In International Conference on Machine Learning, Meng, D.; Xu, Z.; Zhang, L.; and Zhao, J A cyclic weighted median method for l 1 low-rank matrix factorization with missing entries. In AAAI Conference on Artificial Intelligence, Mishra, B., and Sepulchre, R Riemannian preconditioning. SIAM Journal on Optimization 26(1): Mnih, A., and Salakhutdinov, R Probabilistic matrix factorization. In Advances in Neural Information Processing Systems, Nesterov, Y Introductory lectures on convex optimization: A basic course. Springer. Nesterov, Y Gradient methods for minimizing composite functions. Mathematical Programming 140(1): Tanner, J., and Wei, K Low rank matrix completion by alternating steepest descent methods. Applied and Computational Harmonic Analysis 40(2): Trzasko, J., and Manduca, A Highly undersampled magnetic resonance image reconstruction via homotopicminimization. IEEE Transactions on Medical Imaging 28(1): Yang, J., and Leskovec, J Overlapping community detection at scale: a nonnegative matrix factorization approach. In International Conference on Web Search and Data Mining, Yao, Q., and Kwok, J Efficient learning with a family of nonconvex regularizers by redistributing nonconvexity. In International Conference on Machine Learning, Yao, Q.; Kwok, J.; and Zhong, W Fast lowrank matrix learning with nonconvex regularization. In International Conference on Data Mining, Zhang, C Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics 38(2): Zheng, Y.; Liu, G.; Sugimoto, S.; Yan, S.; and Okutomi, M Practical low-rank matrix approximation under robust l 1 -norm. In Computer Vision and Pattern Recognition, Zuo, W.; Meng, D.; Zhang, L.; Feng, X.; and Zhang, D A generalized iterated shrinkage algorithm for nonconvex sparse coding. In International Conference on Computer Vision,
Relaxed Majorization-Minimization for Non-smooth and Non-convex Optimization
Relaxed Majorization-Minimization for Non-smooth and Non-convex Optimization Chen Xu 1, Zhouchen Lin 1,2,, Zhenyu Zhao 3, Hongbin Zha 1 1 Key Laboratory of Machine Perception (MOE), School of EECS, Peking
More informationA Brief Overview of Practical Optimization Algorithms in the Context of Relaxation
A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation Zhouchen Lin Peking University April 22, 2018 Too Many Opt. Problems! Too Many Opt. Algorithms! Zero-th order algorithms:
More informationLOW-rank matrix learning is a central issue in many
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, SUBMITTED 1 Large-Scale Low-Rank Matrix Learning with Nonconvex Regularizers Quanming Yao, Member IEEE, James T. Kwok, Fellow IEEE, Taifeng
More informationRelaxed Majorization-Minimization for Non-smooth and Non-convex Optimization
Relaxed Majorization-Minimization for Non-smooth and Non-convex Optimization MM has been successfully applied to a wide range of problems. Mairal (2013 has given a comprehensive review on MM. Conceptually,
More informationFast Nonnegative Matrix Factorization with Rank-one ADMM
Fast Nonnegative Matrix Factorization with Rank-one Dongjin Song, David A. Meyer, Martin Renqiang Min, Department of ECE, UCSD, La Jolla, CA, 9093-0409 dosong@ucsd.edu Department of Mathematics, UCSD,
More informationSparsity Regularization
Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation
More informationEfficient Sparse Low-Rank Tensor Completion Using the Frank-Wolfe Algorithm
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) Efficient Sparse Low-Rank Tensor Completion Using the Frank-Wolfe Algorithm Xiawei Guo, Quanming Yao, James T. Kwok
More informationCSC 576: Variants of Sparse Learning
CSC 576: Variants of Sparse Learning Ji Liu Department of Computer Science, University of Rochester October 27, 205 Introduction Our previous note basically suggests using l norm to enforce sparsity in
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationProbabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms
Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms François Caron Department of Statistics, Oxford STATLEARN 2014, Paris April 7, 2014 Joint work with Adrien Todeschini,
More informationGROUP-SPARSE SUBSPACE CLUSTERING WITH MISSING DATA
GROUP-SPARSE SUBSPACE CLUSTERING WITH MISSING DATA D Pimentel-Alarcón 1, L Balzano 2, R Marcia 3, R Nowak 1, R Willett 1 1 University of Wisconsin - Madison, 2 University of Michigan - Ann Arbor, 3 University
More informationAn efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss
An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss arxiv:1811.04545v1 [stat.co] 12 Nov 2018 Cheng Wang School of Mathematical Sciences, Shanghai Jiao
More informationLow-rank Matrix Completion with Noisy Observations: a Quantitative Comparison
Low-rank Matrix Completion with Noisy Observations: a Quantitative Comparison Raghunandan H. Keshavan, Andrea Montanari and Sewoong Oh Electrical Engineering and Statistics Department Stanford University,
More informationSCMF: Sparse Covariance Matrix Factorization for Collaborative Filtering
SCMF: Sparse Covariance Matrix Factorization for Collaborative Filtering Jianping Shi Naiyan Wang Yang Xia Dit-Yan Yeung Irwin King Jiaya Jia Department of Computer Science and Engineering, The Chinese
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationAn Alternating Proximal Splitting Method with Global Convergence for Nonconvex Structured Sparsity Optimization
An Alternating Proximal Splitting Method with Global Convergence for Nonconvex Structured Sparsity Optimization Shubao Zhang and Hui Qian and Xiaojin Gong College of Computer Science and Technology Department
More informationSupplementary Material of A Novel Sparsity Measure for Tensor Recovery
Supplementary Material of A Novel Sparsity Measure for Tensor Recovery Qian Zhao 1,2 Deyu Meng 1,2 Xu Kong 3 Qi Xie 1,2 Wenfei Cao 1,2 Yao Wang 1,2 Zongben Xu 1,2 1 School of Mathematics and Statistics,
More informationA Unified Approach to Proximal Algorithms using Bregman Distance
A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department
More informationLecture 9: September 28
0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These
More informationRobust Principal Component Analysis
ELE 538B: Mathematics of High-Dimensional Data Robust Principal Component Analysis Yuxin Chen Princeton University, Fall 2018 Disentangling sparse and low-rank matrices Suppose we are given a matrix M
More informationDATA MINING AND MACHINE LEARNING
DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems
More informationProbabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms
Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms Adrien Todeschini Inria Bordeaux JdS 2014, Rennes Aug. 2014 Joint work with François Caron (Univ. Oxford), Marie
More informationFantope Regularization in Metric Learning
Fantope Regularization in Metric Learning CVPR 2014 Marc T. Law (LIP6, UPMC), Nicolas Thome (LIP6 - UPMC Sorbonne Universités), Matthieu Cord (LIP6 - UPMC Sorbonne Universités), Paris, France Introduction
More informationEfficient Computation of Robust Low-Rank Matrix Approximations in the Presence of Missing Data using the L 1 Norm
Efficient Computation of Robust Low-Rank Matrix Approximations in the Presence of Missing Data using the L 1 Norm Anders Eriksson, Anton van den Hengel School of Computer Science University of Adelaide,
More informationECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis
ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis Lecture 7: Matrix completion Yuejie Chi The Ohio State University Page 1 Reference Guaranteed Minimum-Rank Solutions of Linear
More informationSparse Gaussian conditional random fields
Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian
More informationIterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming
Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Zhaosong Lu October 5, 2012 (Revised: June 3, 2013; September 17, 2013) Abstract In this paper we study
More informationProximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725
Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:
More informationA Bregman alternating direction method of multipliers for sparse probabilistic Boolean network problem
A Bregman alternating direction method of multipliers for sparse probabilistic Boolean network problem Kangkang Deng, Zheng Peng Abstract: The main task of genetic regulatory networks is to construct a
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via convex relaxations Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationAn accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems
An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems Kim-Chuan Toh Sangwoon Yun March 27, 2009; Revised, Nov 11, 2009 Abstract The affine rank minimization
More informationRobust Principal Component Analysis Based on Low-Rank and Block-Sparse Matrix Decomposition
Robust Principal Component Analysis Based on Low-Rank and Block-Sparse Matrix Decomposition Gongguo Tang and Arye Nehorai Department of Electrical and Systems Engineering Washington University in St Louis
More informationAccelerated Gradient Method for Multi-Task Sparse Learning Problem
Accelerated Gradient Method for Multi-Task Sparse Learning Problem Xi Chen eike Pan James T. Kwok Jaime G. Carbonell School of Computer Science, Carnegie Mellon University Pittsburgh, U.S.A {xichen, jgc}@cs.cmu.edu
More informationNonlinear Support Vector Machines through Iterative Majorization and I-Splines
Nonlinear Support Vector Machines through Iterative Majorization and I-Splines P.J.F. Groenen G. Nalbantov J.C. Bioch July 9, 26 Econometric Institute Report EI 26-25 Abstract To minimize the primal support
More informationarxiv: v1 [cs.cv] 1 Jun 2014
l 1 -regularized Outlier Isolation and Regression arxiv:1406.0156v1 [cs.cv] 1 Jun 2014 Han Sheng Department of Electrical and Electronic Engineering, The University of Hong Kong, HKU Hong Kong, China sheng4151@gmail.com
More informationRanking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization
Ranking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization Jialin Dong ShanghaiTech University 1 Outline Introduction FourVignettes: System Model and Problem Formulation Problem Analysis
More informationLecture Notes 10: Matrix Factorization
Optimization-based data analysis Fall 207 Lecture Notes 0: Matrix Factorization Low-rank models. Rank- model Consider the problem of modeling a quantity y[i, j] that depends on two indices i and j. To
More informationThe picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R
The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R Xingguo Li Tuo Zhao Tong Zhang Han Liu Abstract We describe an R package named picasso, which implements a unified framework
More informationL 2,1 Norm and its Applications
L 2, Norm and its Applications Yale Chang Introduction According to the structure of the constraints, the sparsity can be obtained from three types of regularizers for different purposes.. Flat Sparsity.
More informationAdaptive one-bit matrix completion
Adaptive one-bit matrix completion Joseph Salmon Télécom Paristech, Institut Mines-Télécom Joint work with Jean Lafond (Télécom Paristech) Olga Klopp (Crest / MODAL X, Université Paris Ouest) Éric Moulines
More informationSparse Covariance Selection using Semidefinite Programming
Sparse Covariance Selection using Semidefinite Programming A. d Aspremont ORFE, Princeton University Joint work with O. Banerjee, L. El Ghaoui & G. Natsoulis, U.C. Berkeley & Iconix Pharmaceuticals Support
More information2 Regularized Image Reconstruction for Compressive Imaging and Beyond
EE 367 / CS 448I Computational Imaging and Display Notes: Compressive Imaging and Regularized Image Reconstruction (lecture ) Gordon Wetzstein gordon.wetzstein@stanford.edu This document serves as a supplement
More informationLecture 8: February 9
0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we
More informationLinearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation
Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation Zhouchen Lin Visual Computing Group Microsoft Research Asia Risheng Liu Zhixun Su School of Mathematical Sciences
More informationElastic-Net Regularization of Singular Values for Robust Subspace Learning
Elastic-Net Regularization of Singular Values for Robust Subspace Learning Eunwoo Kim Minsik Lee Songhwai Oh Department of ECE, ASRI, Seoul National University, Seoul, Korea Division of EE, Hanyang University,
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationarxiv: v1 [math.oc] 1 Jul 2016
Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationStructured matrix factorizations. Example: Eigenfaces
Structured matrix factorizations Example: Eigenfaces An extremely large variety of interesting and important problems in machine learning can be formulated as: Given a matrix, find a matrix and a matrix
More informationA Novel Approach to Quantized Matrix Completion Using Huber Loss Measure
1 A Novel Approach to Quantized Matrix Completion Using Huber Loss Measure Ashkan Esmaeili and Farokh Marvasti ariv:1810.1260v1 [stat.ml] 29 Oct 2018 Abstract In this paper, we introduce a novel and robust
More informationAlgorithms for Collaborative Filtering
Algorithms for Collaborative Filtering or How to Get Half Way to Winning $1million from Netflix Todd Lipcon Advisor: Prof. Philip Klein The Real-World Problem E-commerce sites would like to make personalized
More informationRobust PCA. CS5240 Theoretical Foundations in Multimedia. Leow Wee Kheng
Robust PCA CS5240 Theoretical Foundations in Multimedia Leow Wee Kheng Department of Computer Science School of Computing National University of Singapore Leow Wee Kheng (NUS) Robust PCA 1 / 52 Previously...
More informationA Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization
A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.
More informationA tutorial on sparse modeling. Outline:
A tutorial on sparse modeling. Outline: 1. Why? 2. What? 3. How. 4. no really, why? Sparse modeling is a component in many state of the art signal processing and machine learning tasks. image processing
More informationSparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28
Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:
More informationarxiv: v1 [math.oc] 23 May 2017
A DERANDOMIZED ALGORITHM FOR RP-ADMM WITH SYMMETRIC GAUSS-SEIDEL METHOD JINCHAO XU, KAILAI XU, AND YINYU YE arxiv:1705.08389v1 [math.oc] 23 May 2017 Abstract. For multi-block alternating direction method
More informationAccelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)
Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan
More informationProperties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation
Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Adam J. Rothman School of Statistics University of Minnesota October 8, 2014, joint work with Liliana
More informationOn Optimal Frame Conditioners
On Optimal Frame Conditioners Chae A. Clark Department of Mathematics University of Maryland, College Park Email: cclark18@math.umd.edu Kasso A. Okoudjou Department of Mathematics University of Maryland,
More informationSOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu
SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu LITIS - EA 48 - INSA/Universite de Rouen Avenue de l Université - 768 Saint-Etienne du Rouvray
More informationApproximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.
Using Quadratic Approximation Inderjit S. Dhillon Dept of Computer Science UT Austin SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina Sept 12, 2012 Joint work with C. Hsieh, M. Sustik and
More informationCoordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /
Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods
More informationBlock Coordinate Descent for Regularized Multi-convex Optimization
Block Coordinate Descent for Regularized Multi-convex Optimization Yangyang Xu and Wotao Yin CAAM Department, Rice University February 15, 2013 Multi-convex optimization Model definition Applications Outline
More informationContraction Methods for Convex Optimization and monotone variational inequalities No.12
XII - 1 Contraction Methods for Convex Optimization and monotone variational inequalities No.12 Linearized alternating direction methods of multipliers for separable convex programming Bingsheng He Department
More informationOn the Convex Geometry of Weighted Nuclear Norm Minimization
On the Convex Geometry of Weighted Nuclear Norm Minimization Seyedroohollah Hosseini roohollah.hosseini@outlook.com Abstract- Low-rank matrix approximation, which aims to construct a low-rank matrix from
More informationContraction Methods for Convex Optimization and Monotone Variational Inequalities No.18
XVIII - 1 Contraction Methods for Convex Optimization and Monotone Variational Inequalities No18 Linearized alternating direction method with Gaussian back substitution for separable convex optimization
More informationLecture Notes 9: Constrained Optimization
Optimization-based data analysis Fall 017 Lecture Notes 9: Constrained Optimization 1 Compressed sensing 1.1 Underdetermined linear inverse problems Linear inverse problems model measurements of the form
More informationContraction Methods for Convex Optimization and Monotone Variational Inequalities No.16
XVI - 1 Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16 A slightly changed ADMM for convex optimization with three separable operators Bingsheng He Department of
More informationRandomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity
Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University
More informationLearning with stochastic proximal gradient
Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and
More informationSupplementary Materials: Bilinear Factor Matrix Norm Minimization for Robust PCA: Algorithms and Applications
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 16 Supplementary Materials: Bilinear Factor Matrix Norm Minimization for Robust PCA: Algorithms and Applications Fanhua Shang, Member, IEEE,
More informationNonconvex Sparse Logistic Regression with Weakly Convex Regularization
Nonconvex Sparse Logistic Regression with Weakly Convex Regularization Xinyue Shen, Student Member, IEEE, and Yuantao Gu, Senior Member, IEEE Abstract In this work we propose to fit a sparse logistic regression
More informationDecoupled Collaborative Ranking
Decoupled Collaborative Ranking Jun Hu, Ping Li April 24, 2017 Jun Hu, Ping Li WWW2017 April 24, 2017 1 / 36 Recommender Systems Recommendation system is an information filtering technique, which provides
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationSolving Corrupted Quadratic Equations, Provably
Solving Corrupted Quadratic Equations, Provably Yuejie Chi London Workshop on Sparse Signal Processing September 206 Acknowledgement Joint work with Yuanxin Li (OSU), Huishuai Zhuang (Syracuse) and Yingbin
More informationLasso: Algorithms and Extensions
ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions
More informationUses of duality. Geoff Gordon & Ryan Tibshirani Optimization /
Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear
More informationEUSIPCO
EUSIPCO 013 1569746769 SUBSET PURSUIT FOR ANALYSIS DICTIONARY LEARNING Ye Zhang 1,, Haolong Wang 1, Tenglong Yu 1, Wenwu Wang 1 Department of Electronic and Information Engineering, Nanchang University,
More informationSparse Solutions of an Undetermined Linear System
1 Sparse Solutions of an Undetermined Linear System Maddullah Almerdasy New York University Tandon School of Engineering arxiv:1702.07096v1 [math.oc] 23 Feb 2017 Abstract This work proposes a research
More informationCS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu
CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu Feature engineering is hard 1. Extract informative features from domain knowledge
More informationAn Optimization-based Approach to Decentralized Assignability
2016 American Control Conference (ACC) Boston Marriott Copley Place July 6-8, 2016 Boston, MA, USA An Optimization-based Approach to Decentralized Assignability Alborz Alavian and Michael Rotkowitz Abstract
More informationFast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and
More informationEE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6)
EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6) Gordon Wetzstein gordon.wetzstein@stanford.edu This document serves as a supplement to the material discussed in
More informationAccelerated primal-dual methods for linearly constrained convex problems
Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize
More informationOn the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,
Math 30 Winter 05 Solution to Homework 3. Recognizing the convexity of g(x) := x log x, from Jensen s inequality we get d(x) n x + + x n n log x + + x n n where the equality is attained only at x = (/n,...,
More informationSparse and low-rank decomposition for big data systems via smoothed Riemannian optimization
Sparse and low-rank decomposition for big data systems via smoothed Riemannian optimization Yuanming Shi ShanghaiTech University, Shanghai, China shiym@shanghaitech.edu.cn Bamdev Mishra Amazon Development
More informationDeep Learning: Approximation of Functions by Composition
Deep Learning: Approximation of Functions by Composition Zuowei Shen Department of Mathematics National University of Singapore Outline 1 A brief introduction of approximation theory 2 Deep learning: approximation
More informationA Randomized Approach for Crowdsourcing in the Presence of Multiple Views
A Randomized Approach for Crowdsourcing in the Presence of Multiple Views Presenter: Yao Zhou joint work with: Jingrui He - 1 - Roadmap Motivation Proposed framework: M2VW Experimental results Conclusion
More informationAn iterative hard thresholding estimator for low rank matrix recovery
An iterative hard thresholding estimator for low rank matrix recovery Alexandra Carpentier - based on a joint work with Arlene K.Y. Kim Statistical Laboratory, Department of Pure Mathematics and Mathematical
More informationLow-Rank Tensor Completion by Truncated Nuclear Norm Regularization
Low-Rank Tensor Completion by Truncated Nuclear Norm Regularization Shengke Xue, Wenyuan Qiu, Fan Liu, and Xinyu Jin College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou,
More informationImproved Bounded Matrix Completion for Large-Scale Recommender Systems
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-7) Improved Bounded Matrix Completion for Large-Scale Recommender Systems Huang Fang, Zhen Zhang, Yiqun
More informationLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descent
Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent KDD 2011 Rainer Gemulla, Peter J. Haas, Erik Nijkamp and Yannis Sismanis Presenter: Jiawen Yao Dept. CSE, UT Arlington 1 1
More informationA Local Non-Negative Pursuit Method for Intrinsic Manifold Structure Preservation
Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence A Local Non-Negative Pursuit Method for Intrinsic Manifold Structure Preservation Dongdong Chen and Jian Cheng Lv and Zhang Yi
More informationSpectral k-support Norm Regularization
Spectral k-support Norm Regularization Andrew McDonald Department of Computer Science, UCL (Joint work with Massimiliano Pontil and Dimitris Stamos) 25 March, 2015 1 / 19 Problem: Matrix Completion Goal:
More informationIEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER 2011 7255 On the Performance of Sparse Recovery Via `p-minimization (0 p 1) Meng Wang, Student Member, IEEE, Weiyu Xu, and Ao Tang, Senior
More informationDiscriminative Feature Grouping
Discriative Feature Grouping Lei Han and Yu Zhang, Department of Computer Science, Hong Kong Baptist University, Hong Kong The Institute of Research and Continuing Education, Hong Kong Baptist University
More informationA direct formulation for sparse PCA using semidefinite programming
A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon
More informationARock: an algorithmic framework for asynchronous parallel coordinate updates
ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,
More informationMachine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression
Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,
More information