THE sparse and low-rank priors have been widely used

Size: px

Start display at page:

Download "THE sparse and low-rank priors have been widely used"

Melvyn Lamb
5 years ago
Views:

This article has been accepted for publication in a future issue of this journal, but has not been fully edited Content may change prior to final publication Citation information: DOI

1 This article has been accepted for publication in a future issue of this journal, but has not been fully edited Content may change prior to final publication Citation information: DOI /TPAMI , IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF L A T E CLASS FILES, VOL 14, NO 8, AUGUST Bilinear Factor Matrix Norm Minimization for Robust PCA: Algorithms and Applications Fanhua Shang, Member, IEEE, James Cheng, Yuanyuan Liu, Zhi-Quan Luo, Fellow, IEEE, and Zhouchen Lin, Senior Member, IEEE Abstract The heavy-tailed distributions of corrupted outliers and singular values of all channels in low-level vision have proven effective priors for many applications such as bacground modeling, photometric stereo and image alignment And they can be well modeled by a hyper-laplacian However, the use of such distributions generally leads to challenging non-convex, non-smooth and non-lipschitz problems, and maes existing algorithms very slow for large-scale applications Together with the analytic solutions to l p -norm imization with two specific values of p, ie, p=1/ and p=/, we propose two novel bilinear factor matrix norm imization models for robust principal component analysis We first define the double nuclear norm and Frobenius/nuclear hybrid norm penalties, and then prove that they are in essence the Schatten-1/ and / quasi-norms, respectively, which lead to much more tractable and scalable Lipschitz optimization problems Our experimental analysis shows that both our methods yield more accurate solutions than original Schatten quasi-norm imization, even when the number of observations is very limited Finally, we apply our penalties to various low-level vision problems, eg, text removal, moving object detection, image alignment and inpainting, and show that our methods usually outperform the state-of-the-art methods Index Terms Robust principal component analysis, ran imization, Schatten-p quasi-norm, l p -norm, double nuclear norm penalty, Frobenius/nuclear norm penalty, alternating direction method of multipliers ADMM 1 INTRODUCTION THE sparse and low-ran priors have been widely used in many real-world applications in computer vision and pattern recognition, such as image restoration [1], face recognition [], subspace clustering [], [4], [5] and robust principal component analysis [6] RPCA, also called lowran and sparse matrix decomposition in [7], [8] or robust matrix completion in [9] Sparsity plays an important role in various low-level vision tass For instance, it has been observed that the gradient of natural scene images can be better modeled with a heavy-tailed distribution such as hyper-laplacian distributions px e x α, typically with 05 α 08, which correspond to non-convex l p - norms [10], [11], as exhibited by the sparse noise/outliers in low-level vision problems [1] shown in Fig 1 To induce sparsity, a principled way is to use the convex l 1 -norm [6], [1], [14], [15], [16], [17], which is the closest convex relaxation of the sparser l p -norm, with compressed sensing being a proent example However, it has been shown in [18] that the l 1 -norm over-penalizes large entries of vectors and results in a biased solution Compared with the l 1 -norm, F Shang, J Cheng and Y Liu are with the Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, NT, Hong Kong {fhshang, jcheng, yyliu}@csecuheduh Z-Q Luo is with Shenzhen Research Institute of Big Data, the Chinese University of Hong Kong, Shenzhen, China and Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455, USA luozq@cuheducn and luozq@umnedu Z Lin is with Key Laboratory of Machine Perception MOE, School of EECS, Peing University, Beijing , PR China, and the Cooperative Medianet Innovation Center, Shanghai Jiao Tong University, Shanghai 0040, PR China zlin@pueducn Manuscript received May 9, 016 Probability Empirical H L, α=1/ H L, α=/ Values of recovered foreground Fig 1 Sparsity priors Left: A typical image of video sequences Middle: the sparse component recovered by [6] Right: the empirical distribution of the sparse component blue solid line, along with a hyper-laplacian fit with α=1/ green dashdot line and α=/ red dotted line many non-convex surrogates of the l 0 -norm listed in [19] give a closer approximation, eg, SCAD [18] and MCP [0] Although the use of hyper-laplacian distributions maes the problems non-convex, fortunately an analytic solution can be derived for two specific values of p, 1/ and /, by finding the roots of a cubic and quartic polynomial, respectively [10], [1], [] The resulting algorithm can be several orders of magnitude faster than existing algorithms [10] As an extension from vectors to matrices, the low-ran structure is the sparsity of the singular values of a matrix Ran imization is a crucial regularizer to induce a lowran solution To solve such a problem, the ran function is usually relaxed by its convex envelope [5], [], [4], [5], [6], [7], [8], [9], the nuclear norm ie, the sum of the singular values, also nown as the trace norm or Schatten- 1 norm [6], [0] By realizing the intimate relationship between the l 1 -norm and nuclear norm, the latter also over-penalizes large singular values, that is, it may mae the solution deviate from the original solution as the l 1 - norm does [19], [1] Compared with the nuclear norm, the Schatten-q norm for 0 < q < 1 is equivalent to the l q -norm on singular values and maes a closer approximation to the c 017 IEEE Personal use is permitted, but republication/redistribution requires IEEE permission See for more information

σ1,, σr Rr be the non-zero singular values of L Rm n 4 4 1 Red channel Green channel Blue channel 6 x 10 15 Red channel Green channel Blue channel 10 4 4 x 10 8 Magnitude x 10 Magnitude Magnitude 8

= σ 6 4 Red channel Green channel Blue channel 10 5 0 0 10 0 0 40 Index of singular values 50 0 0 10 0 0 40 Index of singular values 50 0 0 10 0 0 40 Index of singular values 50 Fig The heavy-tailed

2 This article has been accepted for publication in a future issue of this journal, but has not been fully edited Content may change prior to final publication Citation information: DOI /TPAMI , IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATE CLASS FILES, VOL 14, NO 8, AUGUST 015 TABLE 1 The norms of sparse and low-ran matrices Let σ = σ1,, σr Rr be the non-zero singular values of L Rm n Red channel Green channel Blue channel 6 x Red channel Green channel Blue channel x 10 8 Magnitude x 10 Magnitude Magnitude 8 p, q Sparsity Low-ranness 0 S ℓ0 ℓ0 -norm L S0 = σ ℓ0 ran 0, 1 S ℓp= Sij p1/p ℓp -norm L Sq= σq1/q Schatten-q norm 1 S ℓ1 = Sij ℓ1 -norm L = σ nuclear norm Frobenius norm S F = Sij Frobenius norm L F = σ 6 4 Red channel Green channel Blue channel Index of singular values Index of singular values Index of singular values 50 Fig The heavy-tailed empirical distributions bottom of the singular values of the three channels of these three images top ran function [], [] Nie et al [1] presented an efficient augmented Lagrange multiplier ALM method to solve the joint ℓp -norm and Schatten-q norm LpSq imization Lai et al [] and Lu et al [] proposed iteratively reweighted least squares methods for solving Schatten quasi-norm imization problems However, all these algorithms have to be solved iteratively, and involve singular value decomposition SVD in each iteration, which occupies the largest computation cost, Om, nmn [4], [5] It has been shown in [6] that the singular values of nonlocal matrices in natural images usually exhibit a heavyα tailed distribution pσ e σ, as well as the similar phenomena in natural scenes [7], [8], as shown in Fig Similar to the case of heavy-tailed distributions of sparse outliers, the analytic solutions can be derived for the two specific cases of α, 1/ and / However, such algorithms have high per-iteration complexity Om, nmn Thus, we naturally want to design equivalent, tractable and scalable forms for the two cases of the Schatten-q quasi-norm, q = 1/ and /, which can fit the heavy-tailed distribution of singular values closer than the nuclear norm, as analogous to the superiority of the ℓp quasi-norm hyperlaplacian priors to the ℓ1 -norm Laplacian priors We summarize the main contributions of this wor as follows 1 By taing into account the heavy-tailed distributions of both sparse noise/outliers and singular values of matrices, we propose two novel tractable bilinear factor matrix norm imization models for RPCA, which can fit empirical distributions very well to corrupted data Different from the definitions in our previous wor [4], we define the double nuclear norm and Frobenius/nuclear hybrid norm penalties as tractable low-ran regularizers Then we prove that they are in essence the Schatten-1/ and / quasi-norms, respectively The solution of the resulting imization problems only requires SVDs on two much smaller factor matrices as compared with the much larger ones required by existing algorithms Therefore, our algorithms can reduce the per-iteration complexity from Om, nmn to Omnd, where d m, n in general In particular, our penalties are Lipschitz, and more tractable and scalable than original Schatten quasi-norm imization, which is non-lipschitz and generally NP-hard [], [9] Moreover, we present the convergence property of the proposed algorithms for imizing our RPCA models and provide their proofs We also extend our algorithms to solve matrix completion problems, eg, image inpainting 4 We empirically study both of our bilinear factor matrix norm imizations and show that they outperform original Schatten norm imization, even with only a few observations Finally, we apply the defined low-ran regularizers to address various low-level vision problems, eg, text removal, moving object detection, and image alignment and inpainting, and obtain superior results than existing methods R ELATED W ORK In this section, we mainly discuss some recent advances in RPCA, and briefly review some existing wor on RPCA and its applications in computer vision readers may see [6] for a review RPCA [4], [40] aims to recover a low-ran matrix L Rm n m n and a sparse matrix S Rm n from corrupted observations D = L +S Rm n as follows: λ ranl + S ℓ0, st, L + S = D L,S 1 where ℓ0 denotes the ℓ0 -norm1 and λ > 0 is a regularization parameter Unfortunately, solving 1 is NP-hard Thus, we usually use the convex or non-convex surrogates to replace both of the terms in 1, and formulate this problem into the following more general form: λ L Sqq + S pℓp, st, PΩ L+S = PΩ D L,S where in general p, q [0, ], S ℓp and L Sq are depicted in Table 1 and can be seen as the loss term and regularized term, respectively, and PΩ is the orthogonal projection onto the linear subspace of matrices supported on Ω := {i, j Dij is observed}: PΩ Dij = Dij if i, j Ω and PΩ Dij = 0 otherwise If Ω is a small subset of the entries of the matrix, is also nown as the robust matrix completion problem as in [9], and it is impossible to exactly recover S [41] As analyzed in [4], we can easily see that the optimal solution SΩc = 0, where Ωc is the complement of Ω, ie, the index set of unobserved entries When p = and q = 1, becomes a nuclear norm regularized least squares problem as in [4] eg, image inpainting in Sec 64 1 Convex Nuclear Norm Minimization In [6], [40], [44], both of the non-convex terms in 1 are replaced by their convex envelopes, ie, the nuclear norm q = 1 and the ℓ1 -norm p = 1, respectively λ L + S ℓ1, st, L + S = D L,S 1 Strictly speaing, the ℓ0 -norm is not actually a norm, and is defined as the number of non-zero elements When p 1, S ℓp strictly defines a norm which satisfies the three norm conditions, while it defines a quasi-norm when 0 < p < 1 Due to the relationship between S ℓp and L Sq, the latter has the same cases as the former c 017 IEEE Personal use is permitted, but republication/redistribution requires IEEE permission See for more information

3 This article has been accepted for publication in a future issue of this journal, but has not been fully edited Content may change prior to final publication Citation information: DOI /TPAMI , IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF L A T E CLASS FILES, VOL 14, NO 8, AUGUST 015 TABLE Comparison of various RPCA models and their properties Note that U R m d and V R n d are the factor matrices of L, ie, L=UV T Model Objective function Constraints Parameters Convex? Per-iteration Complexity RPCA [6], [44] λ L + S l1 L+S = D λ Yes Omn PSVT [45] λ L d + S l1 λ, d L+S = D WNNM [46] λ L w, + S l1 λ No Omn LMaFit [47] D L l1 UV T = L d MTF [48] λ W + S l1 UW V T +S =D, U T U=V T V =I d λ, d RegL1 [16] λ V + P Ω D UV T l1 U T U =I d λ, d No Omnd Unifying [49] λ U F + V F + P Ω D L l1 UV T = L λ, d facten [15] λ 1 U F+ V F + L F+ P Ω D L l1 λ UV = L T λ 1, λ, d LpSq [1] λ L q S q + P Ω D L p l p 0<p, q 1 L+S = D λ No Omn λ S+L 1/ + V + P Ω S 1/ l 1/ λ S+L / U F + V + P Ω S / l / L+S =D, UV T = L λ, d No Omnd Wright et al [40] and Candès et al [6] proved that, under some mild conditions, the convex relaxation formulation can exactly recover the low-ran and sparse matrices L, S with high probability The formulation has been widely used in many computer vision applications, such as object detection and bacground subtraction [17], image alignment [50], low-ran texture analysis [9], image and video restoration [51], and subspace clustering [7] This is mainly because the optimal solutions of the sub-problems involving both terms in can be obtained by two wellnown proximal operators: the singular value thresholding SVT operator [] and the soft-thresholding operator [5] The l 1 -norm penalty in can also be replaced by the l 1, - norm as in outlier pursuit [8], [5], [54], [55] and subspace learning [5], [56], [57] To efficiently solve the popular convex problem, various first-order optimization algorithms have been proposed, especially the alternating direction method of multipliers [58] ADMM, or also called inexact ALM in [44] However, they all involve computing the SVD of a large matrix of size m n in each iteration, and thus suffer from high computational cost, which severely limits their applicability to large-scale problems [59], as well as existing Schatten-q quasi-norm 0 < q < 1 imization algorithms such as LpSq [1] While there have been many efforts towards fast SVD computation such as partial SVD [60], the performance of those methods is still unsatisfactory for many real applications [59], [61] Non-Convex Formulations To address this issue, Shen et al [47] efficiently solved the RPCA problem by factorizing the low-ran component into two smaller factor matrices, ie, L = UV T as in [6], where U R m d, V R n d, and usually d m, n, as well as the matrix tri-factorization MTF [48] and factorized data [6] cases In [16], [4], [64], [65], [66], the columnorthonormal constraint is imposed on the first factor matrix U According to the following matrix property, the original convex problem can be reformulated as a smaller matrix nuclear norm imization problem Property 1 For any matrix L R m n and L = UV T If U has orthonormal columns, ie, U T U =I d, then L = V Cabral et al [49] and Kim et al [15] replaced the nuclear norm regularizer in with the equivalent non-convex formulation stated in Lemma 1, and proposed scalable bilinear spectral regularized models, similar to collaborative filtering applications in [6], [67] In [15], an elastic-net regularized matrix factorization model was proposed for subspace learning and low-level vision problems Besides the popular nuclear norm, some variants of the nuclear norm were presented to yield better performance Gu et al [7] proposed a weighted nuclear norm ie, L w, = i w iσ i, where w = [w 1,, w n ] T, and assigned different weights w i to different singular values such that the shrinage operator becomes more effective Hu et al [8] first used the truncated nuclear norm ie, L d = n i=d+1 σ i to address image recovery problems Subsequently, Oh et al [45] proposed an efficient partial singular value thresholding PSVT algorithm to solve many RPCA problems of low-level vision The formulations mentioned above are summarized in Table BILINEAR FACTOR MATRI NORM MINIMIZATION In this section, we first define the double nuclear norm and Frobenius/nuclear hybrid norm penalties, and then prove the equivalence relationships between them and the Schatten quasi-norms Incorporating with hyper-laplacian priors of both sparse noise/outliers and singular values, we propose two novel bilinear factor matrix norm regularized models for RPCA Although the two models are still nonconvex and even non-smooth, they are more tractable and scalable optimization problems, and their each factor matrix term is convex On the contrary, the original Schatten quasi-norm imization problem is very difficult to solve because it is generally non-convex, non-smooth, and non- Lipschitz, as well as the l q quasi-norm [9] As in some collaborative filtering applications [6], [67], the nuclear norm has the following alternative non-convex formulations Lemma 1 For any matrix R m n of ran at most r d, the following equalities hold: 1 = U R m d,v R n d :=UV T U F + V F 4 = U F V F U,V :=UV T c 017 IEEE Personal use is permitted, but republication/redistribution requires IEEE permission See for more information

4 This article has been accepted for publication in a future issue of this journal, but has not been fully edited Content may change prior to final publication Citation information: DOI /TPAMI , IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF L A T E CLASS FILES, VOL 14, NO 8, AUGUST The bilinear spectral penalty in the first equality of 4 has been widely used in low-ran matrix completion and recovery problems, such as RPCA [15], [49], online RPCA [68], matrix completion [69], and image inpainting [5] 1 Double Nuclear Norm Penalty Inspired by the equivalence relation between the nuclear norm and the bilinear spectral penalty, our double nuclear norm D-N penalty is defined as follows Definition 1 For any matrix R m n of ran at most r d, we decompose it into two factor matrices U R m d and V R n d such that =UV T Then the double nuclear norm penalty of is defined as 1 D-N = U,V :=UV T 4 U + V 5 Different from the definition in [4], [70], ie, U,V :=UV T U V, which cannot be used directly to solve practical problems, Definition 1 can be directly used in practical low-ran matrix completion and recovery problems, eg, RPCA and image recovery Analogous to the well-nown Schatten-q quasi-norm [1], [], [], the double nuclear norm penalty is also a quasi-norm, and their relationship is stated in the following theorem Theorem 1 The double nuclear norm penalty D-N is a quasinorm, and also the Schatten-1/ quasi-norm, ie, D-N = S1/ 6 To prove Theorem 1, we first give the following lemma, which is mainly used to extend the well-nown trace inequality of John von Neumann [71], [7] Lemma Let R n n be a symmetric positive semidefinite PSD matrix and its full SVD be U Σ U T with Σ = diagλ 1,, λ n Suppose Y is a diagonal matrix ie, Y = diagτ 1,, τ n, and if λ 1 λ n 0 and 0 τ 1 τ n, then Tr T Y n λ i τ i i=1 Lemma can be seen as a special case of the well-nown von Neumann s trace inequality, and its proof is provided in the Supplementary Materials Lemma For any matrix = UV T R m n, U R m d and V R n d, the following inequality holds: 1 U + V 1/ S 1/ Proof: Let U = L U Σ U RU T and V = L V Σ V RV T be the thin SVDs of U and V, where L U R m d, L V R n d, and R U, Σ U, R V, Σ V R d d Let = L Σ R T, where the columns of L R m d and R R n d are the left and right singular vectors associated with the top d singular values of with ran at most r r d, and Σ = diag[σ 1,, σ r, 0,, 0] R d d Suppose W 1 = L Σ U L T, W = L Σ 1 U Σ 1 L T and W = L Σ L T, we first construct the following PSD matrices M 1 R m m and S 1 R m m : M 1 = L Σ 1 U [ [ ] ] Σ 1 L Σ 1 U L T Σ 1 W1 W L T = W T 0, W [ ] I n [ ] I S 1 = I L U Σ 1 n L U Σ 1 U LT U LT = n L U Σ 1 U LT U 0 U U L U Σ 1 U LT U L U Σ 1 U LT U Because the trace of the product of two PSD matrices is always non-negative see the proof of Lemma 6 in [7], we have Tr I n L U Σ 1 U LT U L U Σ 1 U LT U L U Σ 1 U LT U [ ] W1 W W T 0 W By further simplifying the above expression, we obtain TrW 1 TrL U Σ 1 U LT UW T +TrL U Σ 1 U LT UW 0 7 Recalling = UV T and L Σ R T = L UΣ U R T U R VΣ V L T V, thus L T U L Σ = Σ U R T U R VΣ V L T V R Due to the orthonormality of the columns of L U, R U, L V, R V, L and R, and using the well-nown von Neumann s trace inequality [71], [7], we obtain TrL U Σ 1 U LT UW = TrL U Σ 1 U LT UL Σ L T = TrL U Σ 1 U Σ U RUR T V Σ V L T V R L T = TrL U RUR T V Σ V L T V R L T TrΣ V = V Using Lemma, we have TrL U Σ 1 U LT UW T =TrL U Σ 1 U LT UL Σ 1 U Σ 1 L T = TrΣ 1 U LT UL Σ 1 U Σ 1 L T L U = TrΣ 1 U O 1Σ 1 U Σ 1 O T 1 TrΣ 1 U Σ 1 U Σ 1 = TrΣ 1 = 1/ S 1/ where O 1 = L T U L R d d, and it is easy to verify that O 1 Σ 1 U Σ 1 O1 T is a symmetric PSD matrix Using 7, 8, 9 and TrW 1 = U, we have U + V 1/ S 1/ The detailed proof of Theorem 1 is provided in the Supplementary Materials According to Theorem 1, it is easy to verify that the double nuclear norm penalty possesses the following property [4] Property Given a matrix R m n with ran d, the following equalities hold: U + V D-N = U R m d,v R n d :=UV T = U V U,V :=UV T Frobenius/Nuclear Norm Penalty Inspired by the definitions of the bilinear spectral and double nuclear norm penalties mentioned above, we define a Frobenius/nuclear hybrid norm F-N penalty as follows Definition For any matrix R m n of ran at most r d, we decompose it into two factor matrices U R m d and V c 017 IEEE Personal use is permitted, but republication/redistribution requires IEEE permission See for more information

5 This article has been accepted for publication in a future issue of this journal, but has not been fully edited Content may change prior to final publication Citation information: DOI /TPAMI , IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF L A T E CLASS FILES, VOL 14, NO 8, AUGUST R n d such that = UV T Then the Frobenius/nuclear hybrid norm penalty of is defined as [ F-N = U F + V / ] / 10 U,V :=UV T Different from the definition in [4], ie, U,V :=UV T U F V, Definition can also be directly used in practical problems Analogous to the double nuclear norm penalty, the Frobenius/nuclear hybrid norm penalty is also a quasi-norm, as stated in the following theorem Theorem The Frobenius/nuclear hybrid norm penalty, F-N, is a quasi-norm, and is also the Schatten-/ quasi-norm, ie, F-N = S/ 11 To prove Theorem, we first give the following lemma Lemma 4 For any matrix = UV T R m n, U R m d and V R n d, the following inequality holds: 1 U / F + V S / Proof: To prove this lemma, we use the same notations as in the proof of Lemma, eg, = L Σ R T, U = L U Σ U RU T and V = L V Σ V RV T denote the thin SVDs of U and V, respectively Suppose W 1 = R Σ V R T, W = R Σ 1 V Σ R T and W = R Σ 4 R T, we first construct the following PSD matrices M R m m and S R m m : M = R Σ 1 V [ [ ] ] W1 W Σ 1 R Σ V R T Σ R T = W T 0, W [ ] I n [ ] I S = I L V Σ 1 n L V Σ 1 V LT V LT = n L V Σ 1 V LT V 0 V V L V Σ 1 V LT V L V Σ 1 V LT V Similar to Lemma, we have the following inequality: I Tr n L V Σ 1 [ ] V LT W1 W V L V Σ 1 V LT V L V Σ 1 V LT W T 0 V W By further simplifying the above expression, we also obtain TrW 1 TrL V Σ 1 V LT V W T +TrL V Σ 1 V LT V W 0 1 Since L Σ R T = L U Σ U R T U R V Σ V L T V, LT V R Σ = L T L UΣ U R T U R V Σ V T = Σ V R T V R U Σ U L T U L Due to the orthonormality of the columns of L U, R U, L V, R V, L and R, and using the von Neumann s trace inequality [71], [7], we have TrL V Σ 1 V LT V W = TrL V Σ 1 V LT V R Σ Σ 1 R T = TrL V Σ 1 V Σ V R T V R U Σ U L T UL Σ 1 R T = TrRL T V RV T R U Σ U L T U L Σ 1 TrΣ U Σ 1 1 U F + / S / By Lemma, we also have TrL V Σ 1 V LT V W T =TrL V Σ 1 V LT V R Σ 1 V Σ R T = TrΣ 1 V LT V R Σ 1 V Σ R T L V = TrΣ 1 V O Σ 1 V Σ O T TrΣ 1 V Σ 1 V Σ = TrΣ = / S / 1 14 where O = L T V R R d d, and it is easy to verify that O Σ 1 V Σ O T is a symmetric PSD matrix Using 1, 1, 14, and TrW 1 = V, then we have U F + V / S / According to Theorem see the Supplementary Materials for its detailed proof, it is easy to verify that the Frobenius/nuclear hybrid norm penalty possesses the following property [4] Property For any matrix R m n with ran = r d, the following equalities hold: F-N = U R m d,v R n d :=UV T = U,V :=UV T U F V U F + V / Similar to the relationship between the Frobenius norm and nuclear norm, ie, F ran F [6], the bounds hold for between both the double nuclear norm and Frobenius/nuclear hybrid norm penalties and the nuclear norm, as stated in the following property Property 4 For any matrix R m n, the following inequalities hold: F-N D-N ran The proof of Property 4 is similar to that in [4] Moreover, both the double nuclear norm and Frobenius/nuclear hybrid norm penalties naturally satisfy many properties of quasi-norms, eg, the unitary-invariant property Obviously, we can find that Property 4 in turn implies that any low F-N or D-N penalty is also a low nuclear norm approximation Problem Formulations Without loss of generality, we can assume that the unnown entries of D are set to zero ie, P Ω cd=0, and S Ω c may be any values ie, S Ω c, + such that P Ω cl+s = P Ω cd Thus, the constraint with the projection operator P Ω in is considered instead of just L+S = D as in [4] Together with hyper-laplacian priors of sparse components, we use L 1/ D-N and L / F-N defined above to replace L q S q in, and present the following double nuclear norm and Frobenius/nuclear hybrid norm penalized RPCA models: λ U,V,L,S U + V + P Ω S 1/ st, UV T = L, L + S =D λ st, UV T = L, L + S =D U,V,L,S l 1/, U F + V + PΩ S / l /, From the two proposed models 15 and 16, one can easily see that the norm of each bilinear factor matrix is convex, and they are much more tractable and scalable optimization problems than the original Schatten quasinorm imization problem as in Considering the optimal solution S Ω c =0, S Ω c must be set to 0 for the expected output S c 017 IEEE Personal use is permitted, but republication/redistribution requires IEEE permission See for more information

6 This article has been accepted for publication in a future issue of this journal, but has not been fully edited Content may change prior to final publication Citation information: DOI /TPAMI , IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF L A T E CLASS FILES, VOL 14, NO 8, AUGUST OPTIMIZATION ALGORITHMS To efficiently solve both our challenging problems 15 and 16, we need to introduce the auxiliary variables V Û and, or only V to split the interdependent terms such that they can be solved independently Thus, we can reformulate Problems 15 and 16 into the following equivalent forms: U,V,L,S,Û, V U,V,L,S, V λ Û + V + P Ω S 1/ l 1/, st, Û = U, V = V, UV T = L, L + S =D λ U F + V + P Ω S / l /, st, V = V, UV T = L, L + S =D 41 Solving 17 via ADMM Inspired by recent progress on alternating direction methods [44], [58], we mainly propose an efficient algorithm based on the alternating direction method of multipliers [58] ADMM, also nown as the inexact ALM [44] to solve the more complex problem 17, whose augmented Lagrangian function is given by L µ U, V, L, S, Û, V, {Y i }= λ Û + V + P Ω S 1 l1 + Y 1, Û U + Y, V V + Y, UV T L + Y 4, L+S D + µ Û U F + V V F + UV T L F + L+S D F where µ > 0 is the penalty parameter,, represents the inner product operator, and Y 1 R m d, Y R n d and Y, Y 4 R m n are Lagrange multipliers 411 Updating U +1 and V +1 To update U +1 and V +1, we consider the following optimization problems: Û U +µ 1 U Y 1 F + UV T L +µ 1 Y F, 19 V V +µ 1 V Y F + U +1 V T L +µ 1 Y F 0 Both 19 and 0 are least squares problems, and their optimal solutions are given by U +1 = Û + µ 1 Y 1 + M V I d +V T V 1, 1 V +1 = V +µ 1 Y +M T U +1 I d +U T +1U +1 1 where M =L µ 1 Y, and I d denotes an identity matrix of size d d 41 Updating Û+1 and V +1 To solve Û+1 and V +1, we fix the other variables and solve the following optimization problems Û V λ Û + µ Û U +1 + Y1 /µ F, λ V + µ V V +1 + Y /µ F 4 Both and 4 are nuclear norm regularized least squares problems, and their closed-form solutions can be given by the so-called SVT operator [], respectively Algorithm 1 ADMM for solving S+L 1/ problem 17 Input: P Ω D R m n, the given ran d, and λ Initialize: µ 0, ρ>1, =0, and ϵ 1: while not converged do : while not converged do : Update U +1 and V +1 by 1 and 4: Compute Û+1 and V +1 via the SVT operator [] 5: Update L +1 and S +1 by 6 and 9 6: end while // Inner loop 7: Update the multipliers by Y1 +1 =Y1 +µ Û+1 U +1, Y +1 =Y +µ V +1 V +1, Y +1 =Y +µ U +1 V T Y4 +1 =Y +1 L +1, 4 +µ L +1 + S +1 D 8: Update µ +1 by µ +1 = ρµ 9: : end while // Outer loop Output: U +1 and V Updating L +1 To update L +1, we can obtain the following optimization problem: U +1 V T L +1 L+µ 1 Y F + L+S D+µ 1 Y 4 F 5 Since 5 is a least squares problem, and thus its closedform solution is given by L +1 = 1 U +1 V+1 T + µ 1 Y S + D µ 1 Y Updating S +1 By eeping all other variables fixed, S +1 can be updated by solving the following problem P Ω S 1/ S l 1/ + µ S+L +1 D+µ 1 Y 4 F 7 Generally, the l p -norm 0<p<1 leads to a non-convex, non-smooth, and non-lipschitz optimization problem [9] Fortunately, we can efficiently solve 7 by introducing the following half-thresholding operator [1] Proposition 1 For any matrix A R m n, and R m n is an l 1/ quasi-norm solution of the following imization A F + γ 1/ l 1/, 8 then the solution can be given by = H γ A, where the half-thresholding operator H γ is defined as { H γ a ij = a ij[1+cos π ϕγaij ], a ij > 54γ 4, 0, otherwise, where ϕ γ a ij =arccos γ 8 a ij / / Before giving the proof of Proposition 1, we first give the following lemma [1] Lemma 5 Let y R l 1 be a given vector, and τ > 0 Suppose that x R l 1 is a solution of the following problem, x Bx y + τ x 1/ l 1/ Then for any real parameter µ 0,, x can be expressed as x =H τµ φ µ x, where φ µ x = x + µb T y Bx c 017 IEEE Personal use is permitted, but republication/redistribution requires IEEE permission See for more information

7 This article has been accepted for publication in a future issue of this journal, but has not been fully edited Content may change prior to final publication Citation information: DOI /TPAMI , IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF L A T E CLASS FILES, VOL 14, NO 8, AUGUST Proof: The formulation 8 can be reformulated as the following equivalent form: vec vec veca + γ vec 1/ l 1/ Let B = I mn, µ = 1, and using Lemma 5, the closed-form solution of 8 is given by vec =H γ veca Using Proposition 1, the closed-form solution of 7 is S +1 = P Ω H /µ D L +1 µ 1 Y PΩ D L +1 µ 1 Y 4 where PΩ is the complementary operator of P Ω Alternatively, Zuo et al [] proposed a generalized shrinagethresholding operator to iteratively solve l p -norm imization with arbitrary p values, ie, 0 p < 1, and achieve a higher efficiency Based on the description above, we develop an efficient ADMM algorithm to solve the double nuclear norm penalized problem 17, as outlined in Algorithm 1 To further accelerate the convergence of the algorithm, the penalty parameter µ, as well as ρ, are adaptively updated by the strategy as in [44] The varying µ, together with shrinagethresholding operators such as the SVT operator, sometimes play the role of adaptive selection on the ran of matrices or the number of non-zeros elements We found that updating {U, V }, {Û, V }, L and S just once in the inner loop is sufficient to generate a satisfying accurate solution of 17, so also called inexact ALM, which is used for computational efficiency In addition, we initialize all the elements of the Lagrange multipliers Y 1, Y and Y to 0, while all elements in Y 4 are initialized by the same way as in [44] 4 Solving 18 via ADMM Similar to Algorithm 1, we also propose an efficient ADMM algorithm to solve 18 ie, Algorithm, and provide the details in the Supplementary Materials Since the update schemes of U, V, V and L are very similar to that of Algorithm 1, we discuss their major differences below By eeping all other variables fixed, S +1 can be updated by solving the following problem P Ω S / S l / + µ S+L +1 D+µ 1 Y F 0 Inspired by [10], [], [74], we introduce the following twothirds-thresholding operator to efficiently solve 0 Proposition For any matrix C R m n, and R m n is an l / quasi-norm solution of the following imization C F + γ / l /, 1 then the solution can be given by = T γ C, where the two-thirds-thresholding operator T γ is defined as cij sgnc ij ψ γ c ij + ψγc T γ c ij = ij ψ γ c ij 8, c ij > 4 γ, 0, otherwise, where ψ γ c ij = γ cosharccosh 7c ij 16 γ / /, and sgn is the sign function Before giving the proof of Proposition, we first give the following lemma [], [74] Lemma 6 Let y R be a given real number, and τ >0 Suppose that x R is a solution of the following problem, x x xy + τ x / Then x has the following closed-form thresholding formula: x sgny ψ τ y+ y ψτy = ψ τ y 8, y > 4 τ, 0, otherwise Proof: It is clear that the operator in Lemma 6 can be extended to vectors and matrices by applying it elementwise Using Lemma 6, the closed-form thresholding formula of 1 is given by = T γ C By Proposition, the closed-form solution of 0 is S +1 = P Ω T /µ D L +1 µ 1 + P Ω 5 ALGORITHM ANALYSIS D L +1 µ 1 Y Y We mainly analyze the convergence property of Algorithm 1 Naturally, the convergence of Algorithm can also be guaranteed in a similar way Moreover, we also analyze their per-iteration complexity 51 Convergence Analysis Before analyzing the convergence of Algorithm 1, we first introduce the definition of the critical points or stationary points of a non-convex function given in [75] Definition Given a function f :R n, + ], we denote the domain of f by domf, ie, domf := {x R n : fx + } A function f is said to be proper if domf ; lower semicontinuous at the point x 0 if lim x x0 inf fx fx 0 If f is lower semicontinuous at every point of its domain, then it is called a lower semicontinuous function Definition 4 Let a non-convex function f :R n, + ] be a proper and lower semi-continuous function x is a critical point of f if 0 fx, where fx is the limiting sub-differential of f at x, ie, fx = {u R n : x x, fx fx and u fx u as }, and fx is the Frèchet sub-differential of f at x see [75] for more details As stated in [45], [76], the general convergence property of the ADMM for non-convex problems has not been answered yet, especially for multi-bloc cases [76], [77] For such challenging problems 15 and 16, although it is difficult to guarantee the convergence to a local imum, our empirical convergence tests showed that our algorithms have strong convergence behavior see the Supplementary Materials for details Besides the empirical behavior, we also provide the convergence property for Algorithm 1 in the following theorem Theorem Let {U, V, Û, V, L, S, {Yi }} be the sequence generated by Algorithm 1 Suppose that the sequence {Y } is bounded, and µ is non-decreasing and =0 µ +1/µ 4/ <, then c 017 IEEE Personal use is permitted, but republication/redistribution requires IEEE permission See for more information

8 This article has been accepted for publication in a future issue of this journal, but has not been fully edited Content may change prior to final publication Citation information: DOI /TPAMI , IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF L A T E CLASS FILES, VOL 14, NO 8, AUGUST Fig Histograms with the percentage of estimated rans for Gaussian noise and outlier corrupted matrices of size left and 1, 000 1, 000 right, whose true rans are 10 and 0, respectively I {U, V }, {Û, V }, {L } and {S } are all Cauchy sequences; II Any accumulation point of the sequence {U, V, Û, V, L, S } satisfies the Karush-Kuhn- Tucer KKT conditions for Problem 17 The proof of Theorem can be found in the Supplementary Materials Theorem shows that under mild conditions, any accumulation point or limit point of the sequence generated by Algorithm 1 is a critical point of the Lagrangian function L µ, ie, U, V, Û, V, L, S, {Y i }, which satisfies the first-order optimality conditions ie, the KKT conditions of 17: 0 λ Û +Y1, 0 λ V +Y, 0 ΦS +P Ω Y4, P Ω cy4 = 0, L = U V T, Û = U, V = V, and L + S = D, where ΦS := P Ω S 1/ l 1/ Similarly, the convergence of Algorithm can also be guaranteed Theorem is established for the proposed ADMM algorithm, which has only a single iteration in the inner loop When the inner-loop iterations of Algorithm 1 iterate until convergence, it may lead to a simpler proof We leave further theoretical analysis of convergence as future wor Theorem also shows that our ADMM algorithms have much weaer convergence conditions than the ones in [15], [45], eg, the sequence of only one Lagrange multiplier is required to be bounded for our algorithms, while the ADMM algorithms in [15], [45] require the sequences of all Lagrange multipliers to be bounded 5 Convergence Behavior According to the KKT conditions of 17 mentioned above, we tae the following conditions as the stopping criterion for our algorithms see details in Supplementary Materials, max{ϵ 1 / D F, ϵ } < ϵ where ϵ 1 =max{ U V T L F, L +S D F, Y1 V Û T Y T F }, ϵ = max{ Û U F / U F, V V F / V F }, V is the pseudo-inverse of V, and ε is the stopping tolerance In this paper, we set the stopping tolerance to ϵ=10 5 for synthetic data and ϵ=10 4 for realworld problems As shown in the Supplementary Materials, the stopping tolerance and relative squared error RSE of our methods decrease fast, and they converge within only a small number of iterations usually within 50 iterations 5 Computational Complexity The per-iteration cost of existing Schatten quasi-norm imization methods such as LpSq [1] is doated by the computation of the thin SVD of an m n matrix with m n, and is Omn In contrast, the time complexity of computing SVD for and 4 is Omd +nd The doant cost of each iteration in Algorithm 1 corresponds to the matrix multiplications in the update of U, V and L, which tae Omnd+d Given that d m, n, the overall complexity of Algorithm 1, as well as Algorithm, is thus Omnd, which is the same as the complexity of LMaFit [47], RegL1 [16], ROSL [64], Unifying [49], and facten [15] 6 EPERIMENTAL RESULTS In this section, we evaluate both the effectiveness and efficiency of our methods ie, S+L 1/ and S+L / for solving extensive synthetic and real-world problems We also compare our methods with several state-of-the-art methods, such as LMaFit [47], RegL1 4 [16], Unifying [49], facten 5 [15], RPCA 6 [44], PSVT 7 [45], WNNM 8 [46], and LpSq 9 [1] 61 Ran Estimation As suggested in [6], [8], the regularization parameter λ of our two methods is generally set to maxm, n Analogous to other matrix factorization methods [15], [16], [47], [49], two proposed methods also have another important ran parameter, d To estimate it, we design a simple ran estimation procedure Since the observed data may be corrupted by noise/outlier and/or missing data, our ran estimation procedure combines two ey techniques First, we efficiently compute the largest singular values of the input matrix usually = 100, and then use the basic spectral gap technique for detering the number of clusters [78] Moreover, the ran estimator for incomplete matrices is exploited to loo for an index for which the ratio between two consecutive singular values is imized, as suggested in [79] We conduct some experiments on corrupted matrices to test the performance of our ran estimation procedure, as shown in Fig Note that the input matrices are corrupted by both sparse outliers and Gaussian noise as shown below, where the fraction of sparse outliers varies from 0 to 5%, and the noise factor of Gaussian noise is changed from 0 to 05 It can be seen that our ran estimation procedure performs well in terms of robustness to noise and outliers 6 Synthetic Data We generated the low-ran matrix L R m n of ran r as the product P Q T, where P R m r and Q R n r are independent matrices whose elements are independent and identically distributed iid random variables sampled from standard Gaussian distributions The sparse matrix S R m n is generated by the following procedure: its support is chosen uniformly at random and the non-zero entries are iid random variables sampled uniformly in the interval [ 5, 5] The input matrix is D = L +S +N, where the Gaussian noise is N = nf randn and nf 0 is the cslzhang/ c 017 IEEE Personal use is permitted, but republication/redistribution requires IEEE permission See for more information

9 This article has been accepted for publication in a future issue of this journal, but has not been fully edited Content may change prior to final publication Citation information: DOI /TPAMI , IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF L A T E CLASS FILES, VOL 14, NO 8, AUGUST a RPCA [44] b PSVT [45] c WNNM [46] d Unifying [49] e LpSq [1] f S+L 1/ g S+L / Fig 4 Phase transition plots for different algorithms on corrupted matrices of size top and bottom -axis denotes the matrix ran, and Y-axis indicates the corruption ratio The color magnitude indicates the success ratio [0, 1], and a larger red area means a better performance of the algorithm best viewed in colors 0 TABLE Comparison of average RSE, F-M and time seconds on corrupted matrices We highlight the best results in bold and the second best in italic for each of three performance metrics Note that WNNM and LpSq could not yield experimental results on the largest problem within 4 hours RPCA [44] PSVT [45] WNNM [46] Unifying [49] LpSq [1] S+L 1/ S+L / m = n RSE F-M Time RSE F-M Time RSE F-M Time RSE F-M Time RSE F-M Time RSE F-M Time RSE F-M Time , , , , , , , , RSE WNNM Unifying S+L 1/ S+L / Noise factor RSE WNNM Unifying S+L 1/ S+L / Noise factor Fig 5 Comparison of RSE results on corrupted matrices of size left and 1, 000 1, 000 right with different noise factors noise factor For quantitative evaluation, we measured the performance of low-ran component recovery by the RSE, and evaluated the accuracy of outlier detection by the F- measure abbreviated to F-M as in [17] The higher F-M or lower RSE, the better is the quality of the recovered results 61 Model Comparison We first compared our methods with RPCA nuclear norm & l 1 -norm, PSVT truncated nuclear norm & l 1 -norm, WNNM weighted nuclear norm & l 1 -norm, LpSq Schatten q-norm & l p -norm, and Unifying bilinear spectral penalty & l 1 -norm, where p and q in LpSq are chosen from the range of {01, 0,, 1} A phase transition plot uses magnitude to depict how liely a certain ind of low-ran matrices can be recovered by those methods for a range of different matrix rans and corruption ratios If the recovered matrix L has a RSE smaller than 10, we consider the estimation of both L and S is regarded as successful Fig 4 shows the phase transition results of RPCA, PSVT, WNNM, LpSq, Unifying and both our methods on outlier corrupted matrices of size and , where the corruption ratios varied from 0 to 05 with increment 005, and the true ran r from 5 to 50 with increment 5 Note that the ran parameter of PSVT, Unifying, and both our methods is set to d = 15r as suggested in [4], [47] The results show that both our methods perform significantly better than the other methods, which justifies the effectiveness of the proposed RPCA models 15 and 16 To verify the robustness of our methods, the observed matrices are corrupted by both Gaussian noise and outliers, where the noise factor and outlier ratio are set to nf = 05 and 0% The average results including RSE, F-M, and running time of 10 independent runs on corrupted matrices with different sizes are reported in Table Note that the ran parameter d of PSVT, Unifying, and both our methods is computed by our ran estimation procedure It is clear that both our methods significantly outperform all the other methods in terms of both RSE and F-M in all settings Those non-convex methods including PSVT, WNNM, LpSq, Unifying, and both our methods consistently perform better than the convex method, RPCA, in terms of both RSE and F-M Impressively, both our methods are much faster than the other methods, and at least 10 times faster than RPCA, PSVT, WNNM, and LpSq in the case when the size of matrices exceeds 5, 000 5, 000 This actually shows that our methods are more scalable, and have even greater advantage over existing methods for handling large matrices We also report the RSE results of both our methods on corrupted matrices of size and 1, 000 1, 000 with outlier ratio 15%, as shown in Fig 5, where the true rans of those matrices are 10 and 0, and the noise factor ranges from 01 to 05 In addition, we provide the best results of two baseline methods, WNNM and Unifying Note that the parameter d of Unifying and both our methods is computed via our ran estimation procedure From the results, we can observe that both our methods are much more robust against Gaussian noise than WNNM and Unifying, and have much greater advantage over them in cases when the noise level is relatively large, eg, c 017 IEEE Personal use is permitted, but republication/redistribution requires IEEE permission See for more information

Unifying [49] f LpSq [1] g S+L 1/ h S+L / Fig 8 Text removal results of different methods: detected text mass top and recovered bacground images bottom a Input image top and original image bottom; b

10 This article has been accepted for publication in a future issue of this journal, but has not been fully edited Content may change prior to final publication Citation information: DOI /TPAMI , IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF L A T E CLASS FILES, VOL 14, NO 8, AUGUST a Images b RPCA [44] c PSVT [45] d WNNM [46] e Unifying [49] f LpSq [1] g S+L 1/ h S+L / Fig 8 Text removal results of different methods: detected text mass top and recovered bacground images bottom a Input image top and original image bottom; b RPCA F-M: 0954, RSE: 01051; c PSVT F-M: 09560, RSE: 01007; d WNNM F-M: 0956, RSE: 0094; e Unifying F-M: 09584, RSE: 00976; f LpSq F-M: 09665, RSE: 01097; g S+L 1/ F-M: 09905, RSE: 0096; and h S+L / F-M: 0987, RSE: F measure LMaFit RegL1 Unifying facten S+L 1/ S+L / RSE F measure Time seconds RegL1 Unifying facten S+L 1/ Outlier ratio % a F-measure vs outlier ratio Outlier ratio % b RSE vs outlier ratio 07 Bootstrap Hall Lobby Mall Surface S+L 10 0 / Bootstrap Hall Lobby Mall Surface Fig 10 Quantitative comparison for different methods in terms of F- measure left and running time right, in seconds and in logarithmic scale on five subsequences of surveillance videos RSE 10 RegL1 Unifying facten LpSq S+L 1/ S+L / Missing ratio % c RSE vs high missing ratio RSE Missing ratio % d RSE vs low missing ratio Fig 6 Comparison of different methods on corrupted matrices under varying outlier and missing ratios in terms of RSE and F-measure RSE 10 1 RPCA PSVT WNNM LpSq LMaFit RegL1 Unifying facten S+L 1/ S+L / Running time seconds F measure Running time seconds Fig 7 Comparison of RSE left and F-measure right of different methods on corrupted matrices of size vs running time 6 Comparisons with Other Methods Figs 6a and 6b show the average F-measure and RSE results of different matrix factorization based methods on 1, 000 1, 000 matrices with different outliers ratios, where the noise factor is set to 0 For fair comparison, the ran parameter of all these methods is set to d= 15r as in [4], [47] In all cases, RegL1 [16], Unifying [49], facten [15], and both our methods have significantly better performance than LMaFit [47], where the latter has no regularizers This empirically verifies the importance of low-ran regularizers including our defined bilinear factor matrix norm penalties Moreover, we report the average RSE results of these matrix factorization based methods with outlier ratio 5% and various missing ratios in Figs 6c and 6d, in which we also present the results of LpSq One can see that only with a very limited number of observations eg, 80% missing ratio, both our methods yield much more accurate solutions than the other methods including LpSq, while more observations are available, both LpSq and our methods significantly outperform the other methods in terms of RSE Finally, we report the performance of all those methods mentioned above on corrupted matrices of size as running time goes by, as shown in Fig 7 It is clear that both our methods obtain significantly more accurate solutions than the other methods with much shorter running time Different from all other methods, the performance of LMaFit becomes even worse over time, which may be caused by the intrinsic model without a regularizer This also empirically verifies the importance of all low-ran regularizers, including our defined bilinear factor matrix norm penalties, for recovering low-ran matrices 6 Real-world Applications In this subsection, we apply our methods to solve various low-level vision problems, eg, text removal, moving object detection, and image alignment and inpainting 61 Text Removal We first apply our methods to detect some text and remove them from the image used in [70] The ground-truth image c 017 IEEE Personal use is permitted, but republication/redistribution requires IEEE permission See for more information

facten [15] e S+L 1/ f S+L / Fig 9 Bacground and foreground separation results of different algorithms on the Bootstrap data set The one frame with missing data of each sequence top and its manual

11 This article has been accepted for publication in a future issue of this journal, but has not been fully edited Content may change prior to final publication Citation information: DOI /TPAMI , IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF L A T E CLASS FILES, VOL 14, NO 8, AUGUST a Input,segmentation b RegL1 [16] c Unifying [49] d facten [15] e S+L 1/ f S+L / Fig 9 Bacground and foreground separation results of different algorithms on the Bootstrap data set The one frame with missing data of each sequence top and its manual segmentation bottom are shown in a The results of different algorithms are presented from b to f, respectively The top panel is the recovered bacground, and the bottom panel is the segmentation is of size with ran equal to 10, as shown in Fig 8a The input data are generated by setting 5% of the randomly selected pixels as missing entries For fairness, we set the ran parameter of PSVT [45], Unifying [49] and our methods to 0, and the stopping tolerance ϵ = 10 4 for all these algorithms The text detection and removal results are shown in Fig 8, where the text detection accuracy F- M and the RSE of recovered low-ran component are also reported The results show that both our methods significantly outperform the other methods not only visually but also quantitatively The running time of our methods and the Schatten quasi-norm imization method, LpSq [1], is 16sec, 11sec and 7765sec, respectively, which show that both our methods are more than 50 times faster than LpSq 6 Moving Object Detection We test our methods on real surveillance videos for moving object detection and bacground subtraction as a RPCA plus matrix completion problem Bacground modeling is a crucial tas for motion segmentation in surveillance videos A video sequence satisfies the low-ran and sparse structures, because the bacground of all the frames is controlled by few factors and hence exhibits low-ran property, and the foreground is detected by identifying spatially localized sparse residuals [6], [17], [40] We test our methods on real surveillance videos for object detection and bacground subtraction on five surveillance videos: Bootstrap, Hall, Lobby, Mall and WaterSurface databases 10 The input data are generated by setting 10% of the randomly selected pixels of each frame as missing entries, as shown in Fig 9a Figs 9b-9f show the foreground and bacground separation results on the Bootstrap data set We can see that the bacground can be effectively extracted by RegL1, Unifying, facten and both our methods, where their ran parameter is computed via our ran estimation procedure It is clear that the decomposition results of both our methods are significantly better than that of RegL1 and facten visually in terms of both bacground components and foreground segmentation In addition, we also provide the running time and F-measure of different algorithms on all the five data sets, as shown in Fig 10, from which we can observe that both our methods consistently outperform the other methods in terms of F-measure Moreover, Unifying and 10 our methods are much faster than RegL1 Although facten is slightly faster than Unifying and our methods, it usually has poorer quality of the results 6 Image Alignment We also study the performance of our methods in the application of robust image alignment: Given n images {I 1,, I n } of an object of interest, the image alignment tas aims to align them using a fixed geometric transformation model, such as affine [50] For this problem, we search for a transformation τ = {τ 1,, τ n } and write D τ = [veci 1 τ 1 veci n τ n ] R m n In order to robustly align the set of linearly correlated images despite sparse outliers, we consider the following double nuclear norm regularized model λ U,V,S,τ U + V + S 1/ l 1/, st, D τ =UV T +S 4 Alternatively, our Frobenius/nuclear norm penalty can also be used to address the image alignment problem above We first test both our methods on the Windows data set which contains 16 images of a building, taen from various viewpoints by a perspective camera, and with occlusions due to tree branches used in [50] and report the aligned results of RASL 11 [50], PSVT [45] and our methods in Fig 11, from which it is clear that, compared with RASL and PSVT, both our methods not only robustly align the images, correctly detect and remove the occlusion, but also achieve much better performance in terms of low-ran components, as shown in Figs 11d and 11h, which give the close-up views of the red boxes in Figs 11c and 11g, respectively 64 Image Inpainting Finally, we applied the defined D-N and F-N penalties to image inpainting As shown by Hu et al [8] 1, the images of natural scenes can be viewed as approximately low ran matrices Naturally, we consider the following D-N penalty regularized least squares problem: λ U + V + 1 P ΩL D F, st, L=UV T 5 U,V,L c 017 IEEE Personal use is permitted, but republication/redistribution requires IEEE permission See for more information

results on the Windows data set used in [50] The first row: the results generated by RASL [50] left and PSVT [45] right; The second row: the results generated by our S+L1/ left

regularized model and the corresponding ADMM algorithms for solving both models are provided in the Supplementary Materials We compared our methods with one nuclear norm solver

[45] have very similar performance as shown in [45], we only report the results of [8] For fair comparison, we set the same values to the parameters d, ρ and µ0 for both our

missing pixels We also show the inpainting results of different methods for random mas of 80% missing pixels in Fig 1 see the Supplementary Materials for more results with

analyzed in [4], [70], our D-N and F-N penalties not only lead to two scalable optimization problems, but also require significantly fewer observations than traditional nuclear

AND D ISCUSSIONS In this paper we defined the double nuclear norm and Frobenius/nuclear hybrid norm penalties, which are in essence the Schatten-1/ and / quasi-norms, respectively

penalized methods for low-level vision problems Our experimental results show that both our methods can yield more accurate solutions than original Schatten quasi-norm imization

effectiveness and generality of both our 1 As suggested in [19], we chose the ℓq -norm penalty, where q was chosen from the range of {01, 0,, 1} Fig 1 The natural images used in

applications, whose results also show that both our methods perform more robust to outliers and missing ratios than existing methods An interesting direction of future wor is the

observations are sufficient for both our models to reliably recover lowran matrices, although in our experiments we found that our methods perform much better than existing

graph Laplacian [7], [80], [81] and hyper-laplacian matrix [8], or the elastic-net [15] We can apply our bilinear factor matrix norm penalties to various structured sparse and

project is funded by Research Committee of CUHK Zhouchen Lin is supported National Basic Research Program of China 97 Program grant no 015CB550, National Natural Science

12 This article has been accepted for publication in a future issue of this journal, but has not been fully edited Content may change prior to final publication Citation information: DOI /TPAMI , IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATE CLASS FILES, VOL 14, NO 8, AUGUST 015 a b c 1 d e f g h Fig 11 Alignment results on the Windows data set used in [50] The first row: the results generated by RASL [50] left and PSVT [45] right; The second row: the results generated by our S+L1/ left and S+L/ right methods, respectively a and e: Aligned images; b and f: Sparse occlusions; c and g: Low-ran components; d and h: Close-up see red box in the left The F-N penalty regularized model and the corresponding ADMM algorithms for solving both models are provided in the Supplementary Materials We compared our methods with one nuclear norm solver [4], one weighted nuclear norm method [7], [46], two truncated nuclear norm methods [8], [45], and one Schatten-q quasi-norm method [19]1 Since both of the ADMM algorithms in [8], [45] have very similar performance as shown in [45], we only report the results of [8] For fair comparison, we set the same values to the parameters d, ρ and µ0 for both our methods and [8], [45], eg, d = 9 as in [8] Fig 1 shows the 8 test images and some quantitative results including average PSNR and running time of all those methods with 85% random missing pixels We also show the inpainting results of different methods for random mas of 80% missing pixels in Fig 1 see the Supplementary Materials for more results with different missing ratios and ran parameters The results show that both our methods consistently produce much better PSNR results than the other methods in all the settings As analyzed in [4], [70], our D-N and F-N penalties not only lead to two scalable optimization problems, but also require significantly fewer observations than traditional nuclear norm solvers, eg, [4] Moreover, both our methods are much faster than the other methods, in particular, more than 5 times faster than the methods [19], [8], [46] 7 C ONCLUSIONS AND D ISCUSSIONS In this paper we defined the double nuclear norm and Frobenius/nuclear hybrid norm penalties, which are in essence the Schatten-1/ and / quasi-norms, respectively To tae advantage of the hyper-laplacian priors of sparse noise/outliers and singular values of low-ran components, we proposed two novel tractable bilinear factor matrix norm penalized methods for low-level vision problems Our experimental results show that both our methods can yield more accurate solutions than original Schatten quasi-norm imization when the number of observations is very limited, while the solutions obtained by the three methods are almost identical when a sufficient number of observations is observed The effectiveness and generality of both our 1 As suggested in [19], we chose the ℓq -norm penalty, where q was chosen from the range of {01, 0,, 1} Fig 1 The natural images used in image inpainting [8] top, and quantitative comparison of inpainting results bottom methods are demonstrated through extensive experiments on both synthetic data and real-world applications, whose results also show that both our methods perform more robust to outliers and missing ratios than existing methods An interesting direction of future wor is the theoretical analysis of the properties of both of our bilinear factor matrix norm penalties compared to the nuclear norm and the Schatten quasi-norm For example, how many observations are sufficient for both our models to reliably recover lowran matrices, although in our experiments we found that our methods perform much better than existing Schatten quasi-norm methods with a limited number of observations In addition, we are interested in exploring ways to regularize our models with auxiliary information, such as graph Laplacian [7], [80], [81] and hyper-laplacian matrix [8], or the elastic-net [15] We can apply our bilinear factor matrix norm penalties to various structured sparse and low-ran problems [5], [17], [8], [], eg, corrupted columns [9] and Hanel matrix for image restoration [5] ACKNOWLEDGMENTS This wor is supported by the Hong Kong GRF The project is funded by Research Committee of CUHK Zhouchen Lin is supported National Basic Research Program of China 97 Program grant no 015CB550, National Natural Science Foundation NSF of China grant nos and 61100, and Qualcomm c 017 IEEE Personal use is permitted, but republication/redistribution requires IEEE permission See for more information

TNNR [8] PSNR: 747 db e WNNM [46] PSNR: 795 db f IRNN [19] PSNR: 686 db g D-N PSNR: 879 db h F-N PSNR: 864 db Fig 1 Image inpainting results of different methods on Image 7 with 80% random missing

13 This article has been accepted for publication in a future issue of this journal, but has not been fully edited Content may change prior to final publication Citation information: DOI /TPAMI , IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATE CLASS FILES, VOL 14, NO 8, AUGUST a Ground truth b Input image c APGL [4] PSNR: 66 db d TNNR [8] PSNR: 747 db e WNNM [46] PSNR: 795 db f IRNN [19] PSNR: 686 db g D-N PSNR: 879 db h F-N PSNR: 864 db Fig 1 Image inpainting results of different methods on Image 7 with 80% random missing pixels best viewed in colors R EFERENCES [1] [] [] [4] [5] [6] [7] [8] [9] [10] [11] [1] [1] [14] [15] [16] [17] M Aharon, M Elad, and A Brucstein, K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation, IEEE Trans Signal Proces, vol 54, no 11, pp 411 4, Nov 006 J Wright, A Yang, A Ganesh, S Sastry, and Y Ma, Robust face recognition via sparse representation, IEEE Trans Pattern Anal Mach Intell, vol 1, no, pp 10 7, Feb 009 F De la Torre and M J Blac, A framewor for robust subspace learning, Int J Comput Vis, vol 54, no 1, pp , 00 E Elhamifar and R Vidal, Sparse subspace clustering: Algorithm, theory, and applications, IEEE Trans Pattern Anal Mach Intell, vol 5, no 11, pp , Nov 01 G Liu, Z Lin, S Yan, J Sun, Y Yu, and Y Ma, Robust recovery of subspace structures by low-ran representation, IEEE Trans Pattern Anal Mach Intell, vol 5, no 1, pp , Jan 01 E J Cande s, Li, Y Ma, and J Wright, Robust principal component analysis? J ACM, vol 58, no, pp 1 7, 011 M Tao and Yuan, Recovering low-ran and sparse components of matrices from incomplete and noisy observations, SIAM J Optim, vol 1, no 1, pp 57 81, 011 T Zhou and D Tao, GoDec: Randomized low-ran & sparse matrix decomposition in noisy case, in Proc 8th Int Conf Mach Learn, 011, pp 40 Y Chen, H u, C Caramanis, and S Sanghavi, Robust matrix completion with corrupted columns, in Proc 8th Int Conf Mach Learn, 011, pp D Krishnan and R Fergus, Fast image deconvolution using hyper-laplacian priors, in Proc Adv Neural Inf Process Syst, 009, pp S Mallat, A theory for multiresolution signal decomposition: The wavelet representation, IEEE Trans Pattern Anal Mach Intell, vol 11, no 7, pp , Jul 1989 L Luo, J Yang, J Qian, and Y Tai, Nuclear-L1 norm joint regression for face reconstruction and recognition with mixed noise, Pattern Recogn, vol 48, no 1, pp , 015 A Erisson and A van den Hengel, Efficient computation of robust low-ran matrix approximations in the presence of missing data using the L1 norm, in Proc IEEE Conf Comput Vis Pattern Recognit, 010, pp Q Ke and T Kanade, Robust L1 norm factorization in the presence of outliers and missing data by alternative convex programg, in Proc IEEE Conf Comput Vis Pattern Recognit, 005, pp E Kim, M Lee, and S Oh, Elastic-net regularization of singular values for robust subspace learning, in Proc IEEE Conf Comput Vis Pattern Recognit, 015, pp Y Zheng, G Liu, S Sugimoto, S Yan, and M Outomi, Practical low-ran matrix approximation under robust L1 -norm, in Proc IEEE Conf Comput Vis Pattern Recognit, 01, pp Zhou, C Yang, and W Yu, Moving object detection by detecting contiguous outliers in the low-ran representation, IEEE Trans Pattern Anal Mach Intell, vol 5, no, pp , Mar 01 [18] J Fan and R Li, Variable selection via nonconcave penalized lielihood and its Oracle properties, J Am Statist Assoc, vol 96, no 456, pp , 001 [19] C Lu, J Tang, S Yan, and Z Lin, Generalized nonconvex nonsmooth low-ran imization, in Proc IEEE Conf Comput Vis Pattern Recognit, 014, pp [0] C Zhang, Nearly unbiased variable selection under imax concave penalty, Ann Statist, vol 8, no, pp , 010 [1] Z u, Chang, F u, and H Zhang, L1/ regularization: A thresholding representation theory and a fast solver, IEEE Trans Neural Netw Learn Syst, vol, no 7, pp , Jul 01 [] W Zuo, D Meng, L Zhang, Feng, and D Zhang, A generalized iterated shrinage algorithm for non-convex sparse coding, in Proc IEEE Int Conf Comput Vis, 01, pp 17 4 [] J Cai, E J Cande s, and Z Shen, A singular value thresholding algorithm for matrix completion, SIAM J Optim, vol 0, no 4, pp , 010 [4] V Chandrasearan, S Sanghavi, P Parrilo, and A Willsy, Ransparsity incoherence for matrix decomposition, SIAM J Optim, vol 1, no, pp , 011 [5] K Jin and J Ye, Annihilating filter-based low-ran Hanel matrix approach for image inpainting, IEEE Trans Image Process, vol 4, no 11, pp , Nov 015 [6] B Recht, M Fazel, and P Parrilo, Guaranteed imum-ran solutions of linear matrix equations via nuclear norm imization, SIAM Rev, vol 5, no, pp , 010 [7] N Shahid, V Kalofolias, Bresson, M Bronstein, and P Vandergheynst, Robust principal component analysis on graphs, in Proc IEEE Int Conf Comput Vis, 015, pp [8] H Zhang, Z Lin, C Zhang, and E Chang, Exact recoverability of robust PCA via outlier pursuit with tight recovery bounds, in Proc 9th AAAI Conf Artif Intell, 015, pp [9] Z Zhang, A Ganesh, Liang, and Y Ma, TILT: Transform invariant low-ran textures, Int J Comput Vis, vol 99, no 1, pp 1 4, 01 [0] Z Gu, M Shao, L Li, and Y Fu, Discriative metric: Schatten norm vs vector norm, in Proc 1st Int Conf Pattern Recog, 01, pp [1] F Nie, H Wang, Cai, H Huang, and C Ding, Robust matrix completion via joint Schatten p-norm and Lp -norm imization, in Proc IEEE Int Conf Data Mining, 01, pp [] M Lai, Y u, and W Yin, Improved iteratively rewighted least squares for unconstrained smoothed ℓp imization, SIAM J Numer Anal, vol 51, no, pp , 01 [] C Lu, Z Lin, and S Yan, Smoothed low ran and sparse matrix recovery by iteratively reweighted least squares imization, IEEE Trans Image Process, vol 4, no, pp , Feb 015 [4] F Shang, Y Liu, and J Cheng, Scalable algorithms for tractable Schatten quasi-norm imization, in Proc 0th AAAI Conf Artif Intell, 016, pp [5] G H Golub and C F V Loan, Matrix Computations, Fourth ed Johns Hopins University Press, c 017 IEEE Personal use is permitted, but republication/redistribution requires IEEE permission See for more information

14 This article has been accepted for publication in a future issue of this journal, but has not been fully edited Content may change prior to final publication Citation information: DOI /TPAMI , IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF L A T E CLASS FILES, VOL 14, NO 8, AUGUST [6] S Wang, L Zhang, and Y Liang, Nonlocal spectral prior model for low-level vision, in Proc Asian Conf Comput Vis, 01, pp 1 44 [7] S Gu, L Zhang, W Zuo, and Feng, Weighted nuclear norm imization with application to image denoising, in Proc IEEE Conf Comput Vis Pattern Recognit, 014, pp [8] Y Hu, D Zhang, J Ye, Li, and He, Fast and accurate matrix completion via truncated nuclear norm regularization, IEEE Trans Pattern Anal Mach Intell, vol 5, no 9, pp , Sep 01 [9] W Bian, Chen, and Y Ye, Complexity analysis of interior point algorithms for non-lipschitz and nonconvex imization, Math Program, vol 149, no 1, pp 01 7, 015 [40] J Wright, A Ganesh, S Rao, Y Peng, and Y Ma, Robust principal component analysis: exact recovery of corrupted low-ran matrices by convex optimization, in Proc Adv Neural Inf Process Syst, 009, pp [41] J Wright, A Ganesh, K Min, and Y Ma, Compressive principal component pursuit, Inform Infer, vol, no 1, pp 68, 01 [4] F Shang, Y Liu, H Tong, J Cheng, and H Cheng, Robust bilinear factorization with missing and grossly corrupted observations, Inform Sciences, vol 07, pp 5 7, 015 [4] K-C Toh and S Yun, An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems, Pac J Optim, vol 6, pp , 010 [44] Z Lin, M Chen, and L Wu, The augmented Lagrange multiplier method for exact recovery of corrupted low-ran matrices, Univ Illinois, Urbana-Champaign, Tech Rep, Mar 009 [45] T Oh, Y Tai, J Bazin, H Kim, and I Kweon, Partial sum imization of singular values in robust PCA: Algorithm and applications, IEEE Trans Pattern Anal Mach Intell, vol 8, no 4, pp , Apr 016 [46] S Gu, Q ie, D Meng, W Zuo, Feng, and L Zhang, Weighted nuclear norm imization and its applications to low level vision, Int J Comput Vis, vol 11, no, pp 18 08, 017 [47] Y Shen, Z Wen, and Y Zhang, Augmented Lagrangian alternating direction method for matrix separation based on low-ran factorization, Optim Method Softw, vol 9, no, pp 9 6, 014 [48] Y Liu, L Jiao, and F Shang, A fast tri-factorization method for low-ran matrix recovery and completion, Pattern Recogn, vol 46, no 1, pp 16 17, 01 [49] R Cabral, F De la Torre, J Costeira, and A Bernardino, Unifying nuclear norm and bilinear factorization approaches for low-ran matrix decomposition, in Proc IEEE Int Conf Comput Vis, 01, pp [50] Y Peng, A Ganesh, J Wright, W u, and Y Ma, RASL: Robust alignment by sparse and low-ran decomposition for linearly correlated images, IEEE Trans Pattern Anal Mach Intell, vol 4, no 11, pp 46, Nov 01 [51] H Ji, C Liu, Z Shen, and Y u, Robust video denoising using low ran matrix completion, in Proc IEEE Conf Comput Vis Pattern Recognit, 010, pp [5] D L Donoho, De-noising by soft-thresholding, IEEE Trans Inform Theory, vol 41, no, pp 61 67, May 1995 [5] H u, C Caramanis, and S Sanghavi, Robust PCA via outlier pursuit, in Proc Adv Neural Inf Process Syst, 010, pp [54] Y Yang, Z Ma, A G Hauptmann, and N Sebe, Feature selection for multimedia analysis by sharing information among multiple tass, IEEE Trans Multimedia, vol 15, no, pp , Apr 01 [55] Z Ma, Y Yang, N Sebe, and A G Hauptmann, Knowledge adaptation with partially shared features for event detection using few exemplars, IEEE Trans Pattern Anal Mach Intell, vol 6, no 9, pp , Sep 014 [56] M Shao, D Kit, and Y Fu, Generalized transfer subspace learning through low-ran constraint, Int J Comput Vis, vol 109, no 1, pp 74 9, 014 [57] Z Ding, M Shao, and S Yan, Missing modality transfer learning via latent low-ran constraint, IEEE Trans Image Process, vol 4, no 11, pp 4 44, Nov 015 [58] S Boyd, N Parih, E Chu, B Peleato, and J Ecstein, Distributed optimization and statistical learning via the alternating direction method of multipliers, Found Trends Mach Learn, vol, no 1, pp 1 1, 011 [59] T Oh, Y Matsushita, Y Tai, and I Kweon, Fast randomized singular value thresholding for nuclear norm imization, in Proc IEEE Conf Comput Vis Pattern Recognit, 015, pp [60] R Larsen, PROPACK-software for large and sparse SVD calculations, 005 [61] J Cai and S Osher, Fast singular value thresholding without singular value decomposition, Methods Anal appl, vol 0, no 4, pp 5 5, 01 [6] F J amd M Osarsson and K Åstrom, On the imal problems of low-ran matrix factorization, in Proc IEEE Conf Comput Vis Pattern Recognit, 015, pp [6] S iao, W Li, D u, and D Tao, FaLRR: A fast low ran representation solver, in Proc IEEE Int Conf Comput Vis, 015, pp [64] Shu, F Porili, and N Ahuja, Robust orthonormal subspace learning: Efficient recovery of corrupted low-ran matrices, in Proc IEEE Conf Comput Vis Pattern Recognit, 014, pp [65] F Shang, Y Liu, J Cheng, and H Cheng, Robust principal component analysis with missing data, in Proc rd ACM Int Conf Inf Knowl Manag, 014, pp [66] G Liu and S Yan, Active subspace: Toward scalable low-ran learning, Neur Comp, vol 4, no 1, pp 71 94, 01 [67] N Srebro, J Rennie, and T Jaaola, Maximum-margin matrix factorization, in Proc Adv Neural Inf Process Syst, 004, pp [68] J Feng, H u, and S Yan, Online robust PCA via stochastic optimization, in Proc Adv Neural Inf Process Syst, 01, pp [69] R Sun and Z-Q Luo, Guaranteed matrix completion via nonconvex factorization, IEEE Trans Inform Theory, vol 6, no 11, pp , Nov 016 [70] F Shang, Y Liu, and J Cheng, Tractable and scalable Schatten quasi-norm approximations for ran imization, in Proc 19th Int Conf Artif Intell Statist, 016, pp [71] L Mirsy, A trace inequality of John von Neumann, Monatsh Math, vol 79, pp 0 06, 1975 [7] J von Neumann, Some matrix-inequalities and metrization of matric-space, Toms Univ Rev, vol 1, pp 86 00, 197 [7] R Mazumder, T Hastie, and R Tibshirani, Spectral regularization algorithms for learning large incomplete matrices, J Mach Learn Res, vol 11, pp 87, 010 [74] W Cao, J Sun, and Z u, Fast image deconvolution using closedform thresholding formulas of l q q = 1, regularization, J Vis Commun Image R, vol 4, no 1, pp 1 41, 01 [75] J Bolte, S Sabach, and M Teboulle, Proximal alternating linearized imization for nonconvex and nonsmooth problems, Math Program, vol 146, no 1, pp , 014 [76] M Hong, Z-Q Luo, and M Razaviyayn, Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems, SIAM J Optim, vol 6, no 1, pp 7 64, 016 [77] C Chen, B He, Y Ye, and Yuan, The direct extension of ADMM for multi-bloc convex imization problems is not necessarily convergent, Math Program, vol 155, no 1-, pp 57 79, 016 [78] U von Luxburg, A tutorial on spectral clustering, Stat Comput, vol 17, no 4, pp , 007 [79] R Keshavan, A Montanari, and S Oh, Low-ran matrix completion with noisy observations: a quantitative comparison, in 47th Conf Commun Control Comput, 009, pp [80] Y Yang, D u, F Nie, S Yan, and Y Zhuang, Image clustering using local discriant models and global integration, IEEE Trans Image Process, vol 19, no 10, pp , Oct 010 [81] Y Yang, F Nie, D u, J Luo, Y Zhuang, and Y Pan, A multimedia retrieval framewor based on semi-supervised raning and relevance feedbac, IEEE Trans Pattern Anal Mach Intell, vol 4, no 4, pp 7 74, Apr 01 [8] M Yin, J Gao, and Z Lin, Laplacian regularized low-ran representation and its applications, IEEE Trans Pattern Anal Mach Intell, vol 8, no, pp , Mar c 017 IEEE Personal use is permitted, but republication/redistribution requires IEEE permission See for more information

and Systems from idian University, i an, China, in 01 He is currently a Research Associate with the Department of Computer Science and Engineering, The Chinese University of Hong Kong From 01 to 015,

Department of Electrical and Computer Engineering, Due University, Durham, NC, USA His current research interests include machine learning, data ing, pattern recognition, and computer vision Zhi-Quan

the Society of Industrial and Applied Mathematics to pursue PhD study in the USA He received the PhD degree in operations research in 1989 from the Operations Research Center and the Department of

Hamilton, Canada, where he eventually became the department head and held a Canada Research Chair in Information Processing Since April of 00, he has been with the Department of Electrical and

15 This article has been accepted for publication in a future issue of this journal, but has not been fully edited Content may change prior to final publication Citation information: DOI /TPAMI , IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF L A T E CLASS FILES, VOL 14, NO 8, AUGUST Fanhua Shang M 14 received the PhD degree in Circuits and Systems from idian University, i an, China, in 01 He is currently a Research Associate with the Department of Computer Science and Engineering, The Chinese University of Hong Kong From 01 to 015, he was a Post-Doctoral Research Fellow with the Department of Computer Science and Engineering, The Chinese University of Hong Kong From 01 to 01, he was a Post-Doctoral Research Associate with the Department of Electrical and Computer Engineering, Due University, Durham, NC, USA His current research interests include machine learning, data ing, pattern recognition, and computer vision Zhi-Quan Tom Luo F 07 received the BSc degree in Applied Mathematics in 1984 from Peing University, Beijing, China Subsequently, he was selected by a joint committee of the American Mathematical Society and the Society of Industrial and Applied Mathematics to pursue PhD study in the USA He received the PhD degree in operations research in 1989 from the Operations Research Center and the Department of Electrical Engineering and Computer Science, MIT, Cambridge, MA, USA From 1989 to 00, Dr Luo held a faculty position with the Department of Electrical and Computer Engineering, McMaster University, Hamilton, Canada, where he eventually became the department head and held a Canada Research Chair in Information Processing Since April of 00, he has been with the Department of Electrical and Computer Engineering at the University of Minnesota Twin Cities as a full professor His research interests lies in the union of optimization algorithms, signal processing and digital communication Currently, he is with the Chinese University of Hong Kong, Shenzhen, where he is a Professor and serves as the Vice President Academic Dr Luo is a fellow of SIAM He is a recipient of the 004, 009 and 011 IEEE Signal Processing Society Best Paper Awards, the 011 EURASIP Best Paper Award and the 011 ICC Best Paper Award He was awarded the 010 Faras Prize from the INFRMS Optimization Society Dr Luo chaired the IEEE Signal Processing Society Technical Committee on the Signal Processing for Communications SPCOM from He has held editorial positions for several international journals including Journal of Optimization Theory and Applications, SIAM Journal on Optimization, Mathematics of Computation and IEEE TRANSACTIONS ON SIGNAL PROCESSING In 014 he is elected to the Royal Society of Canada James Cheng is an Assistant Professor with the Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong His current research interests include distributed computing systems, large-scale networ analysis, temporal networs, and big data Zhouchen Lin M 00-SM 08 received the PhD degree in applied mathematics from Peing University in 000 He is currently a professor with the Key Laboratory of Machine Perception, School of Electronics Engineering and Computer Science, Peing University His research interests include computer vision, image processing, machine learning, pattern recognition, and numerical optimization He is an area chair of CVPR 014/016, ICCV 015, and NIPS 015, and a senior program committee member of AAAI 016/017/018 and IJCAI 016 He is an associate editor of the IEEE Transactions on Pattern Analysis and Machine Intelligence and the International Journal of Computer Vision He is an IAPR Fellow Yuanyuan Liu received the PhD degree in Pattern Recognition and Intelligent System from idian University, i an, China, in 01 She is currently a Post-Doctoral Research Fellow with the Department of Computer Science and Engineering, The Chinese University of Hong Kong Prior to that, she was a Post- Doctoral Research Fellow with the Department of Systems Engineering and Engineering Management, CUHK Her current research interests include image processing, pattern recognition, and machine learning c 017 IEEE Personal use is permitted, but republication/redistribution requires IEEE permission See for more information

16 Supplementary Materials: Bilinear Factor Matrix Norm Minimization for Robust PCA: Algorithms and Applications 1 Fanhua Shang, Member, IEEE, James Cheng, Yuanyuan Liu, Zhi-Quan Luo, Fellow, IEEE, and Zhouchen Lin, Senior Member, IEEE In this supplementary material, we give the detailed proofs of some theorems, lemmas and properties We also provide the stopping criterion of our algorithms and the details of Algorithm In addition, we present two new ADMM algorithms for image recovery application and their pseudo-codes, and some additional experimental results for both synthetic and real-world datasets NOTATIONS R l denotes the l-dimensional Euclidean space, and the set of all m n matrices 1 with real entries is denoted by R m n Tr T Y = ij ijy ij, where Tr denotes the trace of a matrix We assume the singular values of R m n are ordered as σ 1 σ σ r > σ r+1 = = σ n = 0, where r = ran Then the SVD of is denoted by = UΣV T, where Σ = diagσ 1,, σ n I n denotes an identity matrix of size n n Definition 5 For any vector x R l, its l p -norm for 0<p< is defined as l 1/p x lp = x i p i=1 where x i is the i-th element of x When p = 1, the l 1 -norm of x is x l1 = i x i which is convex, while the l p -norm of x is a quasi-norm when 0<p<1, which is non-convex and violates the triangle inequality In addition, the l -norm of x is x l = i x i The above definition can be naturally extended from vectors to matrices by the following form S lp = S ij p i,j 1/p Definition 6 The Schatten-p norm 0<p< of a matrix R m n is defined as follows: n 1/p Sp = σ p i where σ i denotes the i-th largest singular value of In the following, we will list some special cases of the Schatten-p norm 0<p< When 0<p<1, the Schatten-p norm is a quasi-norm, and it is non-convex and violates the triangle inequality i=1 F Shang, J Cheng and Y Liu are with the Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, NT, Hong Kong {fhshang, jcheng, yyliu}@csecuheduh Z-Q Luo is with Shenzhen Research Institute of Big Data, the Chinese University of Hong Kong, Shenzhen, China and Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455, USA luozq@cuheducn and luozq@umnedu Z Lin is with Key Laboratory of Machine Perception MOE, School of EECS, Peing University, Beijing , PR China, and the Cooperative Medianet Innovation Center, Shanghai Jiao Tong University, Shanghai 0040, PR China zlin@pueducn 1 Without loss of generality, we assume m n in this paper

17 When p=1, the Schatten-1 norm also nown as the nuclear norm or trace norm of is defined as n = σ i When p=, the Schatten- norm is more commonly called the Frobenius norm defined as F = n ij APPENDI A: PROOF OF LEMMA i=1 i=1 σ i = i,j To prove Lemma, we first define the doubly stochastic matrix, and give the following lemma Definition 7 A square matrix is doubly stochastic if its elements are non-negative real numbers, and the sum of elements of each row or column is equal to 1 Lemma 7 Let P R n n be a doubly stochastic matrix, and if then 0 x 1 x x n, y 1 y y n 0, 6 n p ij x i y j i,j=1 n x y The proof of Lemma 7 is essentially similar to that of the lemma in [1], thus we give the following proof setch for this lemma Proof: Using 6, there exist non-negative numbers α i and β j for all 1 i, j n such that x = α i, y = β j for all = 1,, n 1 i j n Let δ ij denote the Kronecer delta ie, δ ij =1 if i=j, and δ ij =0 otherwise, we have n n n x y p ij x i y j = δ ij p ij x i y j =1 = 1 i,j n = If r s, by the lemma in [1] we now that Therefore, we have 1 r,s n r i n,1 j s i,j=1 δ ij p ij α r β s 1 i<r,1 j s =1 i,j=1 α r β s 1 r i j s n r i n,1 j s δ ij p ij + r i n,1 j s The same result can be obtained in a similar way for r s Proof of Lemma : Proof: Using the properties of the trace, we now that Tr T Y = δ ij p ij δ ij p ij 0, and 1 i<r,1 j s δ ij p ij 0 n U ijλ i τ j i,j=1 δ ij p ij = 0 Note that U is a unitary matrix, ie, U T U = U U T = I n, which implies that U ij is a doubly stochastic matrix By Lemma 7, we have Tr T Y n i=1 λ iτ i Note that the Frobenius norm is the induced norm of the l -norm on matrices

18 APPENDI B: PROOFS OF THEOREMS 1 AND Proof of Theorem 1: Proof: Using Lemma, for any factor matrices U R m d and V R n d with the constraint =UV T, we have U + V S1/ On the other hand, let U =L Σ 1/ and V =R Σ 1/, where =L Σ R T is the SVD of as in Lemma, then we have = U V T and S1/ = U + V /4 Therefore, we have This completes the proof U,V :=UV T U + V = S1/ Proof of Theorem : Proof: Using Lemma 4, for any U R m d and V R n d with the constraint =UV T, we have U / F + V S/ On the other hand, let U =L Σ 1/ and V =R Σ /, where =L Σ R T is the SVD of as in Lemma, then we have = U V T and S/ = [ U F + V /] / Thus, we have This completes the proof U,V :=UV T U / F + V = S/ APPENDI C: PROOF OF PROPERTY 4 Proof: The proof of Property 4 involves some properties of the l p -norm, which we recall as follows For any vector x in R n and 0 < p p 1 1, the following inequalities hold: x l1 x lp1, x lp1 x lp n 1 p 1 p 1 x lp1 Let be an m n matrix of ran r, and denote its compact SVD by = U m r Σ r r Vn r T By Theorems 1 and, and the properties of the l p -norm mentioned above, we have = diagσ r r l1 diagσ r r l 1 = D-N ran, = diagσ r r l1 diagσ r r l = F-N ran In addition, Therefore, we have This completes the proof F-N = diagσ r r l diagσ r r l 1 = D-N F-N D-N ran APPENDI D: SOLVING 18 VIA ADMM Similar to Algorithm 1, we also propose an efficient algorithm based on the alternating direction method of multipliers ADMM to solve 18, whose augmented Lagrangian function is given by L µ U, V, L, S, V, Y 1, Y, Y = λ U F + V + P Ω S / l / + Y 1, V V + Y, UV T L + Y, L+S D + µ UV T L F + V V F + L + S D F where Y 1 R n d, Y R m n and Y R m n are the matrices of Lagrange multipliers

19 4 Algorithm ADMM for solving S+L / problem 18 Input: D R m n, the given ran d and λ Initialize: µ 0, ρ>1, =0, and ϵ 1: while not converged do : while not converged do : Update U +1 and V +1 by 61 and 6, respectively 4: Update V +1 by V +1 = D λ/µ V +1 Y /µ 5: Update L +1 and S +1 by 4 and in this paper, respectively 6: end while // Inner loop 7: Update the multipliers Y +1 1, Y +1 and Y +1 by Y +1 1 =Y 1 +µ V +1 V +1, Y +1 =Y +µ U +1 V T +1 L +1, and Y +1 =Y +µ L +1 +S +1 D 8: Update µ +1 by µ +1 = ρµ 9: : end while // Outer loop Output: U +1 and V +1 Update of U +1 and V +1 : For updating U +1 and V +1, we consider the following optimization problems: V U and their optimal solutions can be given by where P =L µ 1 Y V +1 = λ U F + µ UV T L + µ 1 Y F, 7 V V + µ 1 Y 1 F + U +1 V T L + µ 1 Y F, 8 U +1 = µ P V λ I d + µ V T V 1, 9 [ V + µ 1 Y 1 + P T U +1 ] I d + U T +1U +1 1, 40 Update of V +1 : To update V +1, we fix the other variables and solve the following optimization problem V λ V + µ V V +1 + Y 1 /µ F 41 Similar to and 4, the closed-form solution of 41 can also be obtained by the SVT operator [] defined as follows Definition 8 Let Y be a matrix of size m n m n, and U Y Σ Y V T Y be its SVD Then the singular value thresholding SVT operator D τ is defined as []: D τ Y = U Y S τ Σ Y V T Y, where S τ x = max x τ, 0 sgnx is the soft shrinage operator [], [4], [5]

20 Update of L +1 : For updating L +1, we consider the following optimization problem: U +1 V T L +1 L+µ 1 Y F + L+S D+µ 1 Y F 4 Since 4 is a least squares problem, and thus its closed-form solution is given by L +1 = 1 U +1 V+1 T + µ 1 Y S + D µ 1 Y 4 Together with the update scheme of S +1, as stated in in this paper, we develop an efficient ADMM algorithm to solve the Frobenius/nuclear hybrid norm penalized RPCA problem 18, as outlined in Algorithm 5 APPENDI E: PROOF OF THEOREM In this part, we first prove the boundedness of multipliers and some variables of Algorithm 1, and then we analyze the convergence of Algorithm 1 To prove the boundedness, we first give the following lemma Lemma 8 [6] Let H be a real Hilbert space endowed with an inner product, and a corresponding norm, and y x, where fx denotes the subgradient of fx Then y =1 if x 0, and y 1 if x=0, where is the dual norm of For instance, the dual norm of the nuclear norm is the spectral norm,, ie, the largest singular value Lemma 9 Boundedness Let Y +1 1 =Y 1 +µ Û+1 U +1, Y +1 =Y +µ V +1 V +1, Y +1 =Y +µ U +1 V T +1 L +1 and Y +1 4 = Y 4 +µ L +1 +S +1 D Suppose that µ is non-decreasing and µ +1 =0 µ 4/ {Û, V }, {Y 1, Y, Y 4 / µ 1 }, {L } and {S } produced by Algorithm 1 are all bounded <, then the sequences {U, V }, Proof: Let U := U, V, V := Û, V and Y := Y 1, Y, Y, Y 4 By the first-order optimality conditions of the augmented Lagrangian function of 17 with respect to Û and V ie, Problems and 4, we have 0 ÛL µ U +1, V +1, L +1, S +1, Y and 0 V L µ U +1, V +1, L +1, S +1, Y, ie, Y1 +1 λ Û+1 and Y +1 λ V +1, where Y1 +1 = µ Û+1 U +1 +Y1 /µ and Y +1 = µ V +1 V +1 + Y /µ By Lemma 8, we obtain Y +1 1 λ and Y +1 λ, which implies that the sequences {Y1 } and {Y } are bounded Next we prove the boundedness of {Y4 / µ 1 } Using 7, it is easy to show that PΩ 4 } is bounded Let A=D L +1 µ 1 the sequence {PΩ +1 Y {Y4 / µ 1 }, we consider the following two cases Case 1: a ij > µ If a ij > µ, i, j Ω Y +1 4 = 0, which implies that Y 4, and ΦS:= P ΩS 1/ l 1/ To prove the boundedness of, and using Proposition 1, then [S +1 ] ij > 0 The first-order optimality condition of 7 implies that [ ΦS +1 ] ij + [Y +1 4 ] ij = 0, where [ ΦS +1 ] ij denotes the gradient of the penalty ΦS at [S +1 ] ij Since Φs ij = signs ij / s ij, then [Y ] ij = [S +1 ] ij Using Proposition 1, we have [Y4 +1 ] ij 1 = µ µ [S+1 ] ij = where γ =/µ Case : a ij, i, j Ω µ a ij 1 + cos π ϕγaij µ 1,

21 6 If a ij, and using Proposition 1, we have [S +1 ] ij = 0 Since µ [Y +1 4 /µ ] ij = [L +1 + µ 1 Y 4 D + S +1 ] ij = [L +1 + µ 1 Y 4 D] ij = a ij, µ then [Y4 +1 ] ij µ, and Y +1 4 F / µ = P Ω Y +1 4 F / µ Ω Therefore, {Y 4 / µ 1 } is bounded By the iterative scheme of Algorithm 1, we have L µ U +1, V + 1, L +1, S +1, Y L µ U +1, V +1, L, S, Y L µ U, V, L, S, Y =L µ 1 U, V, L, S, Y 1 + α Y j=1 j Yj 1 Y F + β 4 Y 4 1 F µ 1 where α = µ 1+µ µ 1, β = µ 1+µ, and the above equality holds due to the definition of the augmented Lagrangian µ 1 4/ function L µ U, V, L, S, Û, V, {Y i } Since µ is non-decreasing, and =1 µ /µ 4/ <, then =1 β = =1 α = =1 =1 µ 1 + µ µ 1 µ 1 + µ µ 4/ 1 =1 µ µ =1 1 1 µ <, µ 4/ 1 µ < < =1 µ 4/ 1 Since { Y4 F / µ 1 } is bounded, and µ = ρµ 1 and ρ > 1, then { Y4 1 F / µ 1 } is also bounded, which implies that { Y4 Y4 1 F µ / 1 } is bounded Then {L µ U +1, V +1, L +1, S +1, Y } is upper-bounded due to the boundedness of the sequences of all Lagrange multipliers, ie, {Y 1 }, {Y }, {Y } and { Y 4 Y 1 4 F / µ } λ Û + V + P Ω S 1/ l 1/ = L µ 1 U, V, L, S, Y i=1 Y i F 1 Yi F is upper-bounded note that the above equality holds due to the definition of L µ U, V, L, S, Û, V, {Y i }, thus {S }, {Û} and { V } are all bounded Similarly, by U = Û [Y1 Y1 1 ]/µ 1, V = V [Y Y 1 ]/µ 1, L = U V T [Y Y 1 ]/µ 1 and the boundedness of {Û}, { V }, {Yi } i=1,,, and {Y4 / µ 1 }, thus {U }, {V } and {L } are also bounded This means that each bounded sequence must have a convergent subsequence due to the Bolzano-Weierstrass theorem Proof of Theorem : Proof: I Û+1 U +1 = [Y1 +1 Y1 ]/µ, V +1 V +1 = [Y +1 Y ]/µ, U +1 V+1 T L +1 = [Y +1 Y ]/µ and L +1 +S +1 D = [Y4 +1 Y4 ]/µ Due to the boundedness of {Y1 }, {Y }, {Y } and {Y4 / µ 1 }, the non-decreasing property of {µ }, and <, we have =0 =0 =0 µ +1/µ 4/ Û+1 U +1 F L +1 U +1 V T +1 F which implies that =0 =0 lim Û+1 U +1 F =0, µ +1 µ Y1 +1 Y1 F <, µ +1 µ Y +1 Y F <, V +1 V +1 F =0 =0 µ +1 µ =0 L +1 +S +1 D F lim V +1 V +1 F = 0, and lim L +1 + S +1 D F = 0 =0 µ 1 Y +1 Y F <, µ +1 Y +1 µ 5/ lim L +1 U +1 V+1 T F = 0, 4 Y4 F µ 1/ Hence, {U, V, Û, V, L, S } approaches to a feasible solution In the following, we will prove that the sequences {U } and {V } are Cauchy sequences, <,

22 Using Y1 = Y1 1 +µ 1 Û U, Y = Y 1 +µ 1 V V and Y = Y 1 +µ 1 U V T L, then the first-order optimality conditions of 19 and 0 with respect to U and V are written as follows: U +1 V T L + Y V + U +1 Û Y 1 µ µ = U +1 V T U V T Y 1 + Y + Y V +U +1 U +U Û Y 1 µ 1 µ 1 µ µ 44 Y = U +1 U V T V +I d + Y 1 + Y V + Y 1 1 Y1 Y 1 µ 1 µ 1 = = 0, V +1 U T +1 L T + Y T V +1 U T +1 V U T µ U +1 + V +1 V Y 1 Y T + Y T + Y µ 1 µ 1 T µ = V +1 V U T +1U +1 +I d +V U T +1 U T U +1 + = 0 By 44 and 45, we obtain = Recall that we have V +1 V [ V U T U T +1U +1 + µ µ µ U +1 + V +1 V +V V Y Y T Y 1 T µ 1 + Y T U +1 U [ Y 1 Y = Y V + Y 1 Y1 1 + Y ] 1 1 V T V + I d, µ 1 µ 1 =0 Y 1 T Y T µ 1 Y T U +1 U F =0 =0 µ µ +1 µ 4/ µ 1 ϑ 1 <, =0 µ U +1 + Y Y 1 µ +1 µ ϑ 1 µ µ + Y µ 1 =0 µ µ µ +1 ϑ µ 4/ 1 <, U +1 + Y 1 Y Y µ 1 ] U T +1U +1 +I d 1 where the constant ϑ 1 is defined as {[ ] } ϑ 1 = max ρ Y 1 Y F + Y F V F +ρ Y1 Y1 1 F + Y1 F V T V +I d 1 F, = 1,, In addition, V +1 V F =0 1 [ µ =0 + =0 =0 ρ Y 1 Y F + Y F U +1 F +ρ Y Y 1 V F U +1 F U T +1U +1 +I d 1 F U +1 U F 1 ϑ U +1 U F + ϑ µ =0 ϑ U +1 U F + =0 F + Y F ] U+1U T +1 +I d 1 F =0 µ +1 µ ϑ <, where the constants ϑ and ϑ are defined as { } ϑ = max V F U +1 F U+1U T +1 +I d 1 F, = 1,,, {[ } ϑ = max ρ Y 1 Y F + Y F U +1 F +ρ Y Y 1 F + Y F ] U +1U T +1 +I d 1 F, = 1,, µ 7 45

23 Consequently, both {U } and {V } are convergent sequences Moreover, it is not difficult to verify that {U } and {V } are both Cauchy sequences Similarly, {Û}, { V }, {S } and {L } are also Cauchy sequences Practically, the stopping criterion of Algorithm 1 is satisfied within a finite number of iterations II Let U, V, Û, V, L, S be a critical point of 17, and ΦS = P Ω S 1/ l 1/ Applying the Fermat s rule as in [7] to the subproblem 7, we then obtain 0 λ Û + Y 1 and 0 λ V + Y, 0 ΦS + P Ω Y 4 and P Ω cy 4 = 0, L = U V T, Û = U, V = V, L + S = D Applying the Fermat s rule to 7, we have 0 ΦS +1 + P Ω Y +1 4, P Ω cy +1 4 = 0 46 In addition, the first-order optimal conditions for and 4 are given by 0 λ Û+1 + Y +1 1, 0 λ V +1 + Y Since {U }, {V }, {Û}, { V }, {L } and {S } are Cauchy sequences, then lim U +1 U = 0, lim V +1 V = 0, lim V +1 V = 0, lim L +1 L = 0, lim Û+1 Û = 0, lim S +1 S = 0 Let U, V, Û, V, S and L be their limit points, respectively Together with the results in I, then we have that U =Û, V = V, L =U V T and L +S =D Using 46 and 47, the following holds 0 λ Û + Y 1 and 0 λ V + Y, 0 ΦS + P Ω Y 4 and P Ω cy 4 = 0, L = U V T, Û = U, V = V, L + S = D Therefore, any accumulation point {U, V, Û, V, L, S } of the sequence {U, V, Û, Û, L, S } generated by Algorithm 1 satisfies the KKT conditions for the problem 17 That is, the sequence asymptotically satisfies the KKT conditions of 17 In particular, whenever the sequence {U, V, L, S } converges, it converges to a critical point of 15 This completes the proof 8 APPENDI F: STOPPING CRITERION For the problem 15, the KKT conditions are Using 48, we have 0 λ U + Y V and 0 λ V + Y T U, 0 ΦS + P Ω Ŷ and P Ω cŷ = 0, 48 L = U V T, P Ω L + P Ω S = P Ω D Y λ U V and Y [U T ] λ V T 49 where V is the pseudo-inverse of V The two conditions hold if and only if U V [U T ] V T Recalling the equivalence relationship between 15 and 17, the KKT conditions for 17 are given by 0 λ Û + Y 1 and 0 λ V + Y, 0 ΦS + P Ω Ŷ and P Ω cŷ = 0, L =U V T, Û =U, V =V, L +S =D Hence, we use the following conditions as the stopping criteria for Algorihtm 1: where ϵ 1 = max{ U V T V F / V F } max {ϵ 1 / D F, ϵ } < ϵ L F, L +S D F, Y 1 V Û T Y T F } and ϵ = max{ Û U F / U F, V 50

24 Algorithm Solving image recovery problem 51 via ADMM Input: P Ω D, the given ran d, and λ Initialize: µ 0 =10 4, µ max =10 0, ρ>1, =0, and ϵ 9 1: while not converged do [ : U +1 = L + Y /µ V +Û Y 1 /µ ] V T [ L + Y /µ T U +1 + V Y V + I 1 ] U T : V +1 = /µ +1 U +1 + I 1 4: Û+1 = D λ/µ U+1 + Y1 /µ, and V +1 = D λ/µ V+1 + Y /µ D+µ 5: L +1 = P U +1 V T +1 Y Ω 1+µ + PΩ U+1 V+1 T Y /µ 6: Y1 +1 = Y1 + µ U +1 Û+1, Y +1 = Y + µ V +1 V +1, and Y +1 = Y + µ L+1 U +1 V T 7: µ +1 = ρµ, µ max 8: + 1 9: end while Output: U +1 and V APPENDI G: ALGORITHMS FOR IMAGE RECOVERY In this part, we propose two efficient ADMM algorithms as outlined in Algorithms and 4 to solve the following D-N/F-N penalty regularized least squares problems for matrix completion: λ U,V,L U + V + 1 P ΩL P Ω D F, st, L = UV T, 51 U,V,L λ U 1 F + V + P ΩL P Ω D F, st, L = UV T 5 Similar to 17 and 18, we also introduce the matrices Û and V as auxiliary variables to 51 ie, 5 in this paper, and obtain the following equivalent formulation, The augmented Lagrangian function of 5 is λ U,V,Û, V,L Û + V + 1 P ΩL P Ω D F, st, L = UV T, U = Û, V = V L µ = λ Û + V + 1 P ΩL P Ω D F + Y 1, U Û + Y, V V + Y, L UV T + µ U Û F + V V F + L UV T F where Y i i=1,, are the matrices of Lagrange multipliers 5 Updating U +1 and V +1 : By fixing the other variables at their latest values, and removing the terms that do not depend on U and V and adding some proper terms that do not depend on U and V, the optimization problems with respect to U and V are formulated as follows: U Û + Y1 /µ F + L UV T + Y /µ F, 54 V V + Y /µ F + L U +1 V T + Y /µ F 55 Since both 54 and 55 are smooth convex optimization problems, their closed-form solutions are given by [ ] 1 U +1 = L + Y /µ V + Û Y1 /µ V T V + I, 56 [ V +1 = L + Y /µ T U +1 + V ] 1 Y /µ U +1U T +1 + I 57

25 Algorithm 4 Solving image recovery problem 5 via ADMM Input: P Ω D, the given ran d, and λ Initialize: µ 0 =10 4, µ max =10 0, ρ>1, =0, and ϵ 10 1: while not converged do : U +1 = [ µ L + Y ] V µ V T V + λ/i 1 [ : V +1 = L + Y /µ T U +1 + V ] U Y1 T /µ +1 U +1 + I 1 4: V +1 = D λ/µ V+1 + Y1 /µ 5: L +1 = P Ω D+µ U +1 V T +1 Y 1+µ + P Ω U+1 V T +1 Y /µ 6: Y +1 1 =Y 1 +µ V +1 V +1 and Y +1 =Y +µ L+1 U +1 V T +1 7: µ +1 = ρµ, µ max 8: + 1 9: end while Output: U +1 and V +1 Updating Û+1 and V +1 : By eeping all other variables fixed, Û+1 is updated by solving the following problem: λ Û + µ U +1 Û + Y 1 /µ F 58 To solve 58, the SVT operator [] is considered as follows: Û +1 =D λ/µ U +1 +Y1 /µ 59 Similarly, V +1 is given by V +1 = D λ/µ V +1 + Y /µ 60 Updating L +1 : By fixing all other variables, the optimal L +1 is the solution to the following problem: 1 P ΩL P Ω D F + µ L U +1V+1 T + Y F 61 µ Since 61 is a smooth convex optimization problem, it is easy to show that the optimal solution to 61 is D + µ U +1 V+1 T L +1 = P Y Ω + PΩ U +1 V+1 T Y /µ 1 + µ where P Ω is the complementary operator of P Ω, ie, P Ω D ij =0 if i, j Ω, and P Ω D ij =D ij otherwise Based on the description above, we develop an efficient ADMM algorithm for solving the double nuclear norm imization problem 51, as outlined in Algorithm Similarly, we also present an efficient ADMM algorithm to solve 5, as outlined in Algorithm 4 6 APPENDI H: MORE EPERIMENTAL RESULTS In this paper, we compared both our methods with the state-of-the-art methods, such as LMaFit [8], RegL1 4 [9], Unifying [10], facten 5 [11], RPCA 6 [6], PSVT 7 [1], WNNM 8 [1], and LpSq 9 [14] The Matlab code of the proposed methods can cslzhang/ 9

26 11 Stop criterion r = 5 r = 10 r = 0 Stop criterion r = 5 r = 10 r = 0 RSE r = 5 r = 10 r = 0 RSE r = 5 r =10 r = Iterations Iterations Iterations Iterations a Stop criterion: S+L 1/ left and S+L / right b RSE: S+L 1/ left and S+L / right Fig 1 Convergence behavior of our S+L 1/ and S+L / methods for the cases of the matrix ran 5, 10 and 0 10 S+L 1/ S+L / S+L 1/ S+L / RSE 10 LpSq PSVT Unifying S+L 1/ RSE 10 LpSq PSVT Unifying S+L 1/ RSE 10 4 RSE S+L / Parameter of ran S+L / Parameter of ran Regularization parameter Regularization parameter a RSE vs d: left and right b RSE vs λ: left and right Fig Comparison of RSE results on corrupted matrices with different ran parameters a and regularization parameters b be downloaded from the lin 10 Convergence Behavior Fig 1 illustrates the evolution of the relative squared error RSE, ie L L F / L F, and stop criterion over the iterations on corrupted matrices of size 1, 000 1, 000 with outlier ratio 5%, respectively From the results, it is clear that both the stopping tolerance and RSE values of our two methods decrease fast, and they converge within only a small number of iterations, usually within 50 iterations Robustness Lie the other non-convex methods such as PSVT and Unifying, the most important parameter of our methods is the ran parameter d To verify the robustness of our methods with respect to d, we report the RSE results of PSVT, Unifying and our methods on corrupted matrices with outlier ratio 10% in Fig a, in which we also present the results of the baseline method, LpSq [14] It is clear that both our methods perform much more robust than PSVT and Unifying, and consistently yield much better solutions than the other methods in all settings To verify the robustness of both our methods with respect to another important parameter ie the regularization parameter λ, we also report the RSE results of our methods on corrupted matrices with outlier ratio 10% in Fig b Note that the ran parameter of both our methods is computed by our ran estimation procedure From the resutls, one can see that both our methods demonstrate very robust performance over a wide range of the regularization parameter, eg from 10 4 to 10 0 Text Removal We report the text removal results of our methods with varying ran parameters from 10 to 40, as shown in Fig, where the ran of the original image is 10 We also present the results of the baseline method, LpSq [14] The results show that our methods significantly outperform the other methods in terms of RSE and F-measure, and they perform much more robust than Unifying with respect to the ran parameter 10 DFMNMzip?dl=0

27 1 TABLE 1 Information of the surveillance video sequences used in our experiments Datasets Bootstrap Hall Lobby Mall WaterSurface Size frames [10, 160] 1000 [144, 176] 1000 [18, 160] 1000 [56, 0] 600 [18, 160] 500 Description Crowded scene Crowded scene Dynamic foreground Crowded scene Dynamic bacground RSE LpSq Unifying S+L 1/ S+L / Parameter of ran F measure LpSq Unifying S+L 1/ S+L / Parameter of ran Fig The text removal results including RSE left and F-measure right of LpSq [14], Unifying [10] and our methods with different ran parameters Moving Object Detection We present the detailed descriptions for five surveillance video sequences: Bootstrap, Hall, Lobby, Mall and WaterSurface data sets, as shown in Table 1 Moreover, Fig 4 illustrates the foreground and bacground separation results on the Hall, Mall, Lobby and WaterSurface data sets Image Inpainting In this part, we first reported the average PSNR results of two proposed methods ie, D-N and F-N with different ratios of random missing pixels from 95% to 80%, as shown in Fig 5 Since the methods in [15], [16] are very slow, we only present the average inpainting results of APGL 11 [17] and WNNM 1 [1], both of which use the fast SVD strategy and need to compute only partical SVD instead of the full one Thus, APGL and WNNM are usually much faster than the methods [15], [16] Considering that only a small fraction of pixels are randomly selected, thus we conducted 50 independent runs and report the average PSNR and standard deviation std The results show that both our methods consistently outperform APGL [17] and WNNM [1] in all the settings This experiment actually shows that both our methods have even greater advantage over existing methods in the cases when the numer of observed pixels is very limited, eg, 5% observed pixels As suggested in [16], we set d=9 for our two methods and TNNR 1 [16] To evaluate the robustness of our two methods with respect to their ran parameter, we report the average PSNR and standard deviation of two proposed methods with varying ran parameter d from 7 to 15, as shown in Fig 6 Moreover, we also present the average inpainting results of TNNR [16] over 50 independent runs It is clear that two proposed methods perform much more robust than TNNR with repsect to the ran parameter REFERENCES [1] L Mirsy, A trace inequality of John von Neumann, Monatsh Math, vol 79, pp 0 06, 1975 [] J Cai, E J Candès, and Z Shen, A singular value thresholding algorithm for matrix completion, SIAM J Optim, vol 0, no 4, pp , mattohc/nnlshtml 1 cslzhang/ 1

28 1 [] I Daubechies, M Defrise, and C De Mol, An iterative thresholding algorithm for linear inverse problems with a sparsity constraint, Commun Pur Appl Math, vol 57, no 11, pp , 004 [4] D L Donoho, De-noising by soft-thresholding, IEEE Trans Inform Theory, vol 41, no, pp 61 67, May 1995 [5] E T Hale, W Yin, and Y Zhang, Fixed-point continuation for l 1 -imization: Methodology and convergence, SIAM J Optim, vol 19, no, pp , 008 [6] Z Lin, M Chen, and L Wu, The augmented Lagrange multiplier method for exact recovery of corrupted low-ran matrices, Univ Illinois, Urbana-Champaign, Tech Rep, Mar 009 [7] F Wang, W Cao, and Z u, Convergence of multi-bloc Bregman ADMM for nonconvex composite problems, ariv: , 015 [8] Y Shen, Z Wen, and Y Zhang, Augmented Lagrangian alternating direction method for matrix separation based on low-ran factorization, Optim Method Softw, vol 9, no, pp 9 6, 014 [9] Y Zheng, G Liu, S Sugimoto, S Yan, and M Outomi, Practical low-ran matrix approximation under robust L 1 -norm, in Proc IEEE Conf Comput Vis Pattern Recognit, 01, pp [10] R Cabral, F De la Torre, J Costeira, and A Bernardino, Unifying nuclear norm and bilinear factorization approaches for low-ran matrix decomposition, in Proc IEEE Int Conf Comput Vis, 01, pp [11] E Kim, M Lee, and S Oh, Elastic-net regularization of singular values for robust subspace learning, in Proc IEEE Conf Comput Vis Pattern Recognit, 015, pp [1] T Oh, Y Tai, J Bazin, H Kim, and I Kweon, Partial sum imization of singular values in robust PCA: Algorithm and applications, IEEE Trans Pattern Anal Mach Intell, vol 8, no 4, pp , Apr 016 [1] S Gu, Q ie, D Meng, W Zuo, Feng, and L Zhang, Weighted nuclear norm imization and its applications to low level vision, Int J Comput Vis, vol 11, no, pp 18 08, 017 [14] F Nie, H Wang, Cai, H Huang, and C Ding, Robust matrix completion via joint Schatten p-norm and L p-norm imization, in Proc IEEE Int Conf Data Mining, 01, pp [15] C Lu, J Tang, S Yan, and Z Lin, Generalized nonconvex nonsmooth low-ran imization, in Proc IEEE Conf Comput Vis Pattern Recognit, 014, pp [16] Y Hu, D Zhang, J Ye, Li, and He, Fast and accurate matrix completion via truncated nuclear norm regularization, IEEE Trans Pattern Anal Mach Intell, vol 5, no 9, pp , Sep 01 [17] K-C Toh and S Yun, An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems, Pac J Optim, vol 6, pp , 010

29 14 a Input,segmentation b RegL1 [9] c Unifying [10] d facten [11] e S+L 1/ f S+L / Fig 4 Bacground and foreground separation results of different algorithms on the Hall, Mall, Lobby and WaterSurface data sets The one frame with missing data of each sequence top and its manual segmentation bottom are shown in a The results by different algorithms are presented from b to f, respectively The top panel is the recovered bacground and the bottom panel is the segmentation

THE sparse and low-rank priors have been widely used

THE sparse and low-rank priors have been widely used IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1 Bilinear Factor Matrix Norm Minimization for Robust PCA: Algorithms and Applications Fanhua Shang, Member, IEEE, James Cheng, Yuanyuan Liu,