FAST FIRST-ORDER METHODS FOR STABLE PRINCIPAL COMPONENT PURSUIT

Size: px

Start display at page:

Download "FAST FIRST-ORDER METHODS FOR STABLE PRINCIPAL COMPONENT PURSUIT"

Ruby Merritt
5 years ago
Views:

1 FAST FIRST-ORDER METHODS FOR STABLE PRINCIPAL COMPONENT PURSUIT N. S. AYBAT, D. GOLDFARB, AND G. IYENGAR Abstract. The stable principal component pursuit SPCP problem is a non-smooth convex optimization problem, the solution of which has been shown both in theory and in practice to enable one to recover the low ran and sparse components of a matrix whose elements have been corrupted by Gaussian noise. In this paper, we first show how several existing fast first-order methods can be applied to this problem very efficiently. Specifically, we show that the subproblems that arise when applying optimal gradient methods of Nesterov, alternating linearization methods and alternating direction augmented Lagrangian methods to the SPCP problem either have closed-form solutions or have solutions that can be obtained with very modest effort. Later, we develop a new first order algorithm, NSA, based on partial variable splitting. All but one of the methods analyzed require at least one of the non-smooth terms in the objective function to be smoothed and obtain an ɛ-optimal solution to the SPCP problem in O1/ɛ iterations. NSA, which wors directly with the fully non-smooth objective function, is proved to be convergent under mild conditions on the sequence of parameters it uses. Our preliminary computational tests show that the latter method, NSA, although its complexity is not nown, is the fastest among the four algorithms described and substantially outperforms ASALM, the only existing method for the SPCP problem. To best of our nowledge, an algorithm for the SPCP problem that has O1/ɛ iteration complexity and has a per iteration complexity equal to that of a singular value decomposition is given for the first time. 1. Introduction. In [, 1], it was shown that when the data matrix D R m n is of the form D = X 0 + S 0, where X 0 is a low-ran matrix, i.e. ranx 0 min{m, n}, and S 0 is a sparse matrix, i.e. S 0 0 mn. 0 counts the number of nonzero elements of its argument, one can recover the low-ran and sparse components of D by solving the principal component pursuit problem 1 where ξ =. max{m,n} min X + ξ D X 1, 1.1 X R m n For X R m n, X denotes the nuclear norm of X, which is equal to the sum of its singular values, X 1 := m n i=1 j=1 X ij, X := max{ X ij : 1 i m, 1 j n} and X := σ max X, where σ max X is the maximum singular value of X. To be more precise, let X 0 R m n with ranx 0 = r and let X 0 = UΣV T = r i=1 σ iu i vi T denote the singular value decomposition SVD of X 0. Suppose that for some µ > 0, U and V satisfy max U T e i µr i m, max V T e i µr i n, UV T µr mn, 1. where e i denotes the i-th unit vector. Theorem 1.1. [] Suppose D = X 0 + S 0, where X 0 R m n with m < n satisfies 1. for some µ > 0, and the support set of S 0 is uniformly distributed. Then there are constants c, ρ r, ρ s such that with probability of at least 1 cn 10, the principal component pursuit problem 1.1 exactly recovers X 0 and S 0 provided that ranx 0 ρ r mµ 1 logn and S 0 0 ρ s mn. 1.3 In [13], it is shown that the recovery is still possible even when the data matrix, D, is corrupted with a dense error matrix, ζ 0 such that ζ 0 F δ, by solving the stable principal component pursuit SPCP problem Specifically, the following theorem is proved in [13]. P : min X,S R m n{ X + ξ S 1 : X + S D F δ}. 1.4 IEOR Department, Columbia University. nsa106@columbia.edu. IEOR Department, Columbia University. goldfarb@columbia.edu. IEOR Department, Columbia University. gi10@columbia.edu. Research partially supported by ONR grant N , NSF Grant DMS and DOE Grant DE-FG

2 Theorem 1.. [13] Suppose D = X 0 + S 0 + ζ 0, where X 0 R m n with m < n satisfies 1. for some µ > 0, and the support set of S 0 is uniformly distributed. If X 0 and S 0 satisfy 1.3, then for any ζ 0 such that ζ 0 F δ the solution, X, S, to the stable principal component pursuit problem 1.4 satisfies X X 0 F + S S 0 F Cmnδ for some constant C with high probability. Principal component pursuit and stable principal component pursuit both have applications in video surveillance and face recognition. For existing algorithmic approaches to solving principal component pursuit see [, 3, 6, 7, 13] and references therein. In this paper, we develop four different fast first-order algorithms to solve the SPCP problem P. The first two algorithms are direct applications of Nesterov s optimal algorithm [9] and the proximal gradient method of Tseng [11], which is inspired by both FISTA and Nesterov s infinite memory algorithms that are introduced in [1] and [9], respectively. In this paper it is shown that both algorithms can compute an ɛ-optimal, feasible solution to P in O1/ɛ iterations. The third and fourth algorithms apply an alternating direction augmented Lagrangian approach to an equivalent problem obtained by partial variable splitting. The third algorithm can compute an ɛ-optimal, feasible solution to the problem in O1/ɛ iterations, which can be easily improved to O1/ɛ complexity. Given ɛ > 0, all first three algorithms use suitably smooth versions of at least one of the norms in the objective function. The fourth algorithm NSA wors directly with the original non-smooth objective function and can be shown to converge to an optimal solution of P, provided that a mild condition on the increasing sequence of penalty multipliers holds. To best of our nowledge, an algorithm for the SPCP problem that has O1/ɛ iteration complexity and has a per iteration complexity equal to that of a singular value decomposition is given for the first time. The only algorithm that we now of that has been designed to solve the SPCP problem P is the algorithm ASALM [10]. The results of our numerical experiments comparing NSA algorithm with ASALM has shown that NSA is faster and also more robust to changes in problem parameters.. Proximal Gradient Algorithm with Smooth Objective Function. In this section we show that Nesterov s optimal algorithm [8, 9] for simple sets is efficient for solving P. For fixed parameters µ > 0 and ν > 0, define the smooth C 1,1 functions f µ. and g ν. as follows f µ X = g ν S = max X, U µ U R m n : U 1 U F,.1 max S, W ν W R m n : W 1 W F.. Clearly, f µ. and g ν. closely approximate the non-smooth functions fx := X and gs := S 1, respectively. Also let χ := {X, S R m n R m n : X + S D F δ} and L = 1 µ + 1 ν, where 1 µ and 1 ν are the Lipschitz constants for the gradients of f µ. and g ν., respectively. Then Nesterov s optimal algorithm [8, 9] for simple sets applied to the problem: min X,S R m n{f µx + ξ g ν S : X, S χ},.3 is given by Algorithm 1. Because of the simple form of the set χ, it is easy to ensure that all iterates Y x, Y s, Zx, Zs and X +1, S +1 lie in χ. Hence, Algorithm 1 enjoys the full convergence rate of OL/ of the Nesterov s method. Thus, setting µ = Ωɛ and ν = Ωɛ, Algorithm 1 computes an ɛ-optimal and feasible solution to problem P in = O1/ɛ iterations. The iterates Y x, Y s and Zx, Zs that need to be computed at each iteration of Algorithm 1 are solutions to an optimization problem of the form: { L P s : min X X,S R m n X F + S S } F + Q x, X + Q s, S : X, S χ..4 The following lemma shows that the solution to problems of the form P s can be computed efficiently.

3 Algorithm 1 SMOOTH PROXIMAL GRADIENTX 0, S 0 1: input: X 0 R m n, S 0 R m n : 0 3: while do 4: Compute f µx and { g νs } 5: Y x, Y s argmin X,S fµx, X + g νs, S + L X X F + S S F : X, S χ 6: Γ X, S := i+1 i=0 { fµxi, X + gνsi, S } { } 7: Z x, Z s argmin X,S Γ X, S + L X X0 F + S S 0 F : X, S χ 8: X +1, S +1 9: : end while 11: return X, S Y x, Y s + +3 Z x, Z s Lemma.1. The optimal solution X, S to problem P s can be written in closed form as follows. When δ > 0, θ X S L + θ = D q s + q x X,.5 L + θ θ S = L + θ where q x X := X 1 L Q x, q s S := S 1 L Q s and θ = max { 0, D q x X + L + θ L + θ L + θ L q x X + q s S D F δ q s S,.6 1 }..7 When δ = 0, X = 1 D q s S + 1 q x X and S = 1 D q x X + 1 q s S..8 Proof. Suppose that δ > 0. Writing the constraint in problem P s, X, S χ, as the Lagrangian function for.4 is given as LX, S; θ = L 1 X + S D F δ,.9 X X F + S S F + Q x, X X + Q s, S S + θ X + S D F δ. Therefore, the optimal solution X, S and optimal Lagrangian multiplier θ R must satisfy the Karush- Kuhn-Tucer KKT conditions: i. X + S D F δ, ii. θ 0, iii. θ X + S D F δ = 0, iv. LX X + θ X + S D + Q x = 0, v. LS S + θ X + S D + Q s = 0. Conditions iv and v imply that X, S satisfy.5 and.6, from which it follows that X + S D = L L + θ q x X + q s S D..10 3

4 Case 1: q x X + q s S D F δ. Setting X = q x X, S = q s S and θ = 0, clearly satisfies.5,.6 and conditions i from.10, ii and iii. Thus, this choice of variables satisfies all the five KKT conditions. Case : q x X+q s S D F > δ. Set θ = L qx X+q s S D F δ 1. Since q x X+q s S D F > δ, θ > 0; hence, ii is satisfied. Moreover, for this value of θ, it follows from.10 that X +S D F = δ. Thus, KKT conditions i and iii are satisfied. Therefore, setting X and S according to.5 and.6, respectively; and setting { θ L q x = max 0, X + q s S } D F 1, δ satisfies all the five KKT conditions. Now, suppose that δ = 0. Since S = D X, problem P s can be written as min X R m n X X + Qx L F + D X S + Qs L F, which is also equivalent to the problem: min X R m n X q x X F + X D q s S F. Then.8 trivially follows from first-order optimality conditions for this problem and the fact that S = D X. 3. Proximal Gradient Algorithm with Partially Smooth Objective Function. In this section we show how the proximal gradient algorithm, Algorithm 3 in [11], can be applied to the problem min X,S R m n{f µx + ξ S 1 : X, S χ}, 3.1 where f µ. is the smooth function defined in.1 such that f µ. is Lipschitz continuous with constant L µ = 1 µ. This algorithm is given in Algorithm. Algorithm PARTIALLY SMOOTH PROXIMAL GRADIENTX 0, S 0 1: input: X 0 R m n, S 0 R m n : Z0 x, Z0 s X 0, S 0, 0 3: while do 4: Y x, Y s + 5: Compute f µy x X, S + + { 6: Z+1, x Z+1 s argmin X,S i=0 7: X +1, S +1 X, S + 8: + 1 9: end while 10: return X, S + Z x, Z s i+1 } x {ξ S 1 + fµyi, X } + Lµ X X0 F : X, S χ Z+1, x Z+1 s + Mimicing the proof in [11], it is easy to show that Algorithm, which uses the prox function 1 X X 0 F, converges to the optimal solution of 3.1. Given X 0, S 0 χ, e.g. X 0 = 0 and S 0 = D, the current algorithm eeps all iterates in χ as in Algorithm 1, and hence it enjoys the full convergence rate of OL/. Thus, setting µ = Ωɛ, Algorithm computes an ɛ-optimal, feasible solution of problem P in = O1/ɛ iterations. The only thing left to be shown is that the optimization subproblems in Algorithm can be solved efficiently. The subproblem that has to be solved at each iteration to compute Z+1 x, Zs +1 has the form: { P ns : min ξ S 1 + Q, X X + ρ X X } F : X, S χ, 3. for some ρ > 0. Lemma 3.1 shows that these computations can be done efficiently. 4

5 Lemma 3.1. The optimal solution X, S to problem P ns can be written in closed form as follows. When δ > 0, S = sign D q X { max D q X ξ ρ + } θ ρθ E, 0, 3.3 X = θ ρ + θ D S + ρ ρ + θ q X, 3.4 where q X := X 1 ρ Q, E and 0 Rm n are matrices with all components equal to ones and zeros, respectively, and denotes the componentwise multiplication operator. θ = 0 if D q X F δ; otherwise, θ is the unique positive solution of the nonlinear equation φθ = δ, where { ξ φθ := min θ E, ρ ρ + θ } D q X F. 3.5 Moreover, θ can be efficiently computed in Omn logmn time. When δ = 0, S = sign D q X max { D q X } ξρ E, 0 and X = D S. 3.6 Proof. Suppose that δ > 0. Let X, S be an optimal solution to problem P ns and θ denote the optimal Lagrangian multiplier for the constraint X, S χ written as.9. Then the KKT optimality conditions for this problem are i. Q + ρx X + θ X + S D = 0, ii. ξg + θ X + S D = 0 and G S 1, iii. X + S D F δ, iv. θ 0, v. θ X + S D F δ = 0. From i and ii, we have [ ] [ ] ρ + θ I θ I X θ I θ = I S [ θ D + ρ q X θ D ξg ], 3.7 where q X = X 1 ρ Q. From 3.7 it follows that [ ρ + θ I θ I 0 ρθ ρ+θ I ] [ ] [ X S = ρθ ρ+θ θ D + ρ q X D q X ξg ]. 3.8 From the second equation in 3.8, we have ξ ρ + θ ρθ G + S + q X D = But 3.9 is precisely the first-order optimality conditions for the shrinage problem { min S R m n ξ ρ + θ ρθ S S + q X D F Thus, S is the optimal solution to the shrinage problem and is given by follows from the first equation in 3.8, and it implies X + S D = }. ρ ρ + θ S + q X D

6 Therefore, X + S D F = ρ ρ + θ S + q X D F, = ρ ρ + θ sign D q X = ρ { { D q X ξ ρ + θ } ρθ E, 0 D q X F, max ρ + θ max D q X ξ ρ + } θ ρθ E, 0 D q X F, = ρ { ρ + θ min ξ ρ + } θ ρθ E, D q X F, { } ξ = min θ E, ρ D q X ρ + θ F, 3.11 where the second equation uses 3.3. Now let φ : R + R + be { } ξ φθ := min θ E, ρ D q X F. 3.1 ρ + θ Case 1: D q X F δ. θ = 0, S = 0 and X = q X trivially satisfy all the KKT conditions. Case : D q X F > δ. It is easy to show that φ. is a strictly decreasing function of θ. Since φ0 = D q X F > δ and lim θ φθ = 0, there exists a unique θ > 0 such that φθ = δ. Given θ, S and X can then be computed from equations 3.3 and 3.4, respectively. Moreover, since θ > 0 and φθ = δ, 3.11 implies that X, S and θ satisfy the KKT conditions. We now show that θ can be computed in Omn logmn time. Let A := D q X and 0 a 1 a... a mn be the mn elements of the matrix A sorted in increasing order, which can be done in Omn logmn time. Defining a 0 := 0 and a mn+1 :=, we then have for all j {0, 1,..., mn} that ρ ρ + θ a j ξ θ ρ ρ + θ a j+1 1 ξ a j 1 ρ 1 θ 1 ξ a j+1 1 ρ For all < j mn define θ j such that 1 θ j = 1 ξ a j 1 ρ and let := max Then for all < j mn { j : 1 θ j } 0, j {0, 1,..., mn}. j φθ j = ρ ξ a ρ + θ i + mn j j θ j i=0 Also define θ := and θ mn+1 := 0 so that φθ := 0 and φθ mn+1 = φ0 = A F > δ. Note that {θ j } { <j mn} contains all the points at which φθ may not be differentiable for θ 0. Define j := max{j : φθ j δ, j mn}. Then θ is the unique solution of the system j ρ ξ a ρ + θ i + mn j = δ and θ > 0, 3.15 θ i=0 since φθ is continuous and strictly decreasing in θ for θ 0. Solving the equation in 3.15 requires finding the roots of a fourth-order polynomial a..a. quartic function; therefore, one can compute θ > 0 using the algebraic solutions of quartic equations as shown by Lodovico Ferrari in 1540, which requires O1 operations. Note that if = mn, then θ is the solution of the equation ρ mn ρ + θ a i = δ, i=1

7 i.e. θ = ρ A F δ 1 = ρ D X F δ 1. Hence, we have proved that problem P ns can be solved efficiently. Now, suppose that δ = 0. Since S = D X, problem P ns can be written as min S R m n ξ ρ S S D q X F Then 3.6 trivially follows from first-order optimality conditions for the above problem and the fact that X = D S. The following lemma will be used later in Section 5. However, we give its proof here, since it uses some equations from the proof of Lemma 3.1. Let 1 χ.,. denote the indicator function of the closed convex set χ R m n R m n, i.e. if Z, S χ, then 1 χ Z, S = 0; otherwise, 1 χ Z, S =. Lemma 3.. Suppose that δ > 0. Let X, S be an optimal solution to problem P ns and θ be an optimal Lagrangian multiplier such that X, S and θ together satisfy the KKT conditions, i-v in the proof of Lemma 3.1. Then W, W 1 χ X, S, where W := Q + ρ X X = θ X + S D. Proof. Let W := Q + ρ X X, then from i and v of the KKT optimality conditions in the proof of Lemma 3.1, we have W = θ X + S D and W F = θ X + S D = θ X + S D δ + θ δ = θ δ Moreover, for all X, S χ, it follows from the definition of χ that W, θ X + S D θ W F X + S D F θ δ W F. Thus, for all X, S χ, we have W, W = W F = θ δ W F W, θ X + S D. Hence, 0 W, θ X + S D W = W, θ X X + S S X, S χ It follows from the proof of Lemma 3.1 that if D q X F > δ, then θ > 0, where q X = X 1 ρ Q. Therefore, 3.19 implies that 0 W, X X + S S X, S χ. 3.0 On the other hand, if D q X F δ, then θ = 0. Hence W = θ X + S D = 0, and 3.0 follows trivially. Therefore, 3.0 always holds and this shows that W, W 1 χ X, S. 4. Alternating Linearization and Augmented Lagrangian Algorithms. In this and the next section we present algorithms for solving problems 3.1 and 1.4 that are based on partial variable splitting combined with alternating minimization of a suitably linearized augmented Lagrangian function. We can write problems 1.4 and 3.1 generically as min X,S Rm n{φx + ξ gs : X, S χ}. 4.1 For problem 1.4, φx = fx = X, while for problem 3.1, φx = f µ X given in.1. In this section, we first assume that assume that φ : R m n R and g : R m n R m n R are any closed convex functions such that φ is Lipschitz continuous, and χ is a general closed convex set. Here we use partial variable splitting, i.e. we only split the X variables in 4.1, to arrive at the following equivalent problem min X,S,Z Rm n{φx + ξ gs : X = Z, Z, S χ}. 4. Let ψz, S := ξ gs + 1 χ Z, S and define the augmented Lagrangian function L ρ X, Z, S; Y = φx + ψz, S + Y, X Z + ρ X Z F. 4.3 Then minimizing 4.3 by alternating between X and then Z, S leads to several possible methods that can compute a solution to 4.. These include the alternating linearization method ALM with sipping step 7

8 Algorithm 3 ALM-SY 0 1: input: X 0 R m n, S 0 R m n, Y 0 R m n : Z 0 X 0, 0 3: while 0 do 4: X +1 argmin X L ρx, Z, S ; Y 5: if φx +1 + ψx +1, S > L ρx +1, Z, S ; Y then 6: X +1 Z 7: end if 8: Z +1, S +1 argmin Z,S ψz, S + φx +1 + φx +1, Z X +1 + ρ Z X +1 F 9: Y +1 φx +1 + ρx +1 Z +1 10: : end while that has an O ρ convergence rate, and the fast version of this method with an O ρ rate see [3] for full splitting versions of these methods. In this paper, we only provide a proof of the complexity result for the alternating linearization method with sipping steps ALM-S in Theorem 4.1 below. One can easily extend the proof of Theorem 4.1 to an ALM method based on 4.3 with the function gs replaced by a suitably smoothed version see [3] for the details of ALM algorithm. Theorem 4.1. Let φ : R m n R and ψ : R m n R m n R be closed convex functions such that φ is Lipschitz continuous with Lipschitz constant L, and χ be a closed convex set. Let ΦX, S := φx+ψx, S. For ρ L, the sequence {Z, S } Z+ in Algorithm ALM-S satisfies ΦZ, S ΦX, S ρ X 0 X F + n, 4.4 where X, S = argmin X,S R m n ΦX, S, n := 1 i=0 1 {ΦX i+1,s i>l ρx i+1,z i,s i;y i} and 1 {.} is 1 if its argument is true; otherwise, 0. Proof. See Appendix A for the proof. We obtain Algorithm 4 by applying Algorithm 3 to solve problem 3.1, where the smooth function φx = f µ X, defined in.1, the non-smooth closed convex function is ξ S χ X, S and χ = {X, S R m n R m n : X + S D F δ}. Theorem 4.1 shows that Algorithm 4 has an iteration complexity of O 1 ɛ to obtain ɛ-optimal and feasible solution of P. Algorithm 4 PARTIALLY SMOOTH ALMY 0 1: input: Y 0 R m n : Z 0 0, S 0 D, 0 3: while 0 do 4: X +1 argmin X f µx + Y, X Z + ρ X Z F 5: B f µx +1 + ξ S 1 + Y, X +1 Z + ρ X +1 Z F 6: if f µx +1 + ξ S χx +1, S > B then 7: X +1 Z 8: end if 9: Z +1, S +1 argmin Z,S {ξ S 1 + f µx +1, Z X +1 + ρ Z X +1 F : Z, S χ} 10: Y +1 f µx +1 + ρx +1 Z +1 11: + 1 1: end while Using the fast version of Algorithm 3, a fast version of Algorithm 4 with Oρ/ convergence rate, employing partial splitting and alternating linearization, can be constructed. This fast version can compute an ɛ-optimal and feasible solution to problem P in O1/ɛ iterations. Moreover, lie the proximal gradient methods described earlier, each iteration for these methods can be computed efficiently. The subproblems 8

9 to be solved at each iteration of Algorithm 4 and its fast version have the following generic form: min X R m n f µx + Q, X X + ρ X X F, 4.5 min {ξ S 1 + Q, Z Z + ρ Z,S R m n Z Z F : Z, S χ}. 4.6 Let U diagσv T denote the singular value decomposition of the matrix X Q/ρ, then X, the minimizer of the subproblem in 4.5, can be easily computed as U diag σ V T. And Lemma 3.1 shows how to solve the subproblem in 4.6. σ max{ρσ, 1+ρµ} 5. Non-smooth Augmented Lagrangian Algorithm. Algorithm 5 is a Non-Smooth Augmented Lagrangian Algorithm NSA that solves the non-smooth problem P. The subproblem in Step 4 of Algorithm 5 is a matrix shrinage problem and can be solved efficiently by computing a singular value decomposition SVD of an m n matrix; and Lemma 3.1 shows that the subproblem in Step 6 can also be solved efficiently. Algorithm 5 NSAZ 0, Y 0 1: input: Z 0 R m n, Y 0 R m n : 0 3: while 0 do 4: X +1 argmin X { X + Y, X Z + ρ X Z F } 5: Ŷ +1 Y + ρ X +1 Z 6: Z +1, S +1 argmin {Z,S: Z+S D F δ } {ξ S 1 + Y, Z X +1 + ρ Z X +1 F } 7: Let θ be an optimal Lagrangian dual variable for the 1 Z + S D F δ constraint 8: Y +1 Y + ρ X +1 Z +1 9: Choose ρ +1 such that ρ +1 ρ 10: : end while We now prove that Algorithm NSA converges under fairly mild conditions on the sequence {ρ } Z+ of penalty parameters. We first need the following lemma, which extends the similar result given in [6] to partial splitting of variables. Lemma 5.1. Suppose that δ > 0. Let {X, Z, S, Y, θ } Z+ be the sequence produced by Algorithm NSA. X, X, S 1 = argmin X,Z,S { X + ξ S 1 : Z + S D F δ, X = Z} be any optimal solution, Y R m n and θ 0 be any optimal Lagrangian duals corresponding to the constraints X = Z and 1 Z + S D F δ, respectively. Then { Z X F + ρ Y Y F } Z + is a non-increasing sequence and Z + Z +1 Z F < Z + ρ Y +1 Y F <, Z + ρ 1 Y +1 + Y, S +1 S < Z + ρ 1 Ŷ+1 + Y, X +1 X <, Z + ρ 1 Y Y +1, X + S Z +1 S +1 <. Proof. See Appendix B for the proof. Given partially split SPCP problem, min X,Z,S { X + ξ S 1 : X = Z, Z, S χ}, let L be its Lagrangian function LX, Z, S; Y, θ = X + ξ S 1 + Y, X Z + θ Z + S D F δ. 5.1 Theorem 5.. Suppose that δ > 0. Let {X, Z, S, Y, θ } Z+ be the sequence produced by Algorithm NSA. Choose {ρ } Z+ such that 9

10 i 1 Z + ρ = : Then lim Z+ Z = lim Z+ X = X, lim Z+ S = S such that X, S = argmin{ X + ξ S 1 : X + S D F δ}. ii Z + 1 ρ = : If D X F δ, then lim Z+ θ = θ 0 and lim Z+ Y = Y such that X, X, S, Y, θ is a saddle point of the Lagrangian function L in 5.1. Otherwise, if D X F = δ, then there exists a limit point, Y, θ, of the sequence {Y, θ } Z+ such that Y, θ = argmax Y,θ {LX, X, S ; Y, θ : θ 0}. Remar 5.1. Requiring 1 Z + ρ = is similar to the condition in Theorem in [6], which is needed to show that Algorithm I-ALM converges to an optimal solution of the robust PCA problem. Remar 5.. Let D = X 0 + S 0 + ζ 0 such that ζ 0 F δ and X 0, S 0 satisfies the assumptions of Theorem 1.. If S 0 F > Cmnδ, then with very high probability, D X F > δ, where C is the numerical constant defined in Theorem 1.. Therefore, most of the time in applications, one does not encounter the case where D X F = δ. Proof. From Lemma 5.1 and the fact that X +1 Z +1 = 1 ρ Y +1 Y for all 1, we have > ρ Y +1 Y F = X +1 Z +1 F. Z + Z + Hence, lim Z+ X Z = 0. Let X #, X #, S # 1 = argmin X,Z,S { X + ξ S 1 : Z + S D F δ, X = Z} be any optimal solution, Y # R m n and θ # 0 be any optimal Lagrangian duals corresponding to X = Z and 1 Z + S D F δ constraints, respectively and f := X # + ξ S # 1. Moreover, let χ = {Z, S R m n R m n : Z + S D F δ} and 1 χ Z, S denote the indicator function of the closed convex set χ, i.e. 1 χ Z, S = 0 if Z, S χ; otherwise, 1 χ Z, S =. Since the sequence {Z, S } Z+ produced by NSA is a feasible sequence for the set χ, we have 1 χ Z, S = 0 for all 1. Hence, the following inequality is true for all 0 X + ξ S 1 = X + ξ S χ Z, S, X # + ξ S # χ X #, S # Ŷ, X # X Y, S # S Y, X # + S # Z S, = f + Ŷ + Y #, X X # + Y + Y #, S S # + Y # Y, X # + S # Z S 5. + Y #, Z X, where the inequality follows from the convexity of norms and the fact that Y ξ S 1, Ŷ X and Y, Y 1 χ Z, S ; the final equality follows from rearranging the terms and the fact that X #, S # χ. From Lemma 5.1, we have Z + ρ 1 1 Ŷ + Y #, X X # + Y + Y #, S S # + Y # Y, X # + S # Z S Since 1 Z + ρ =, there exists K Z + such that lim Ŷ + Y #, X X # + Y + Y #, S S # + Y # Y, X # + S # Z S K <. = and the fact that lim Z+ Z X = 0 imply that along K 5. converges to f = X # + ξ S # 1 = min{ X +ξ S 1 : X, S χ}; hence along K subsequence, { X +ξ S 1 } K is a bounded sequence. Therefore, there exists K K Z + such that lim K X, S = X, S. Also, since lim Z+ Z X = 0 and Z, S χ for all 1, we also have X, S = lim K Z, S χ. Since the limit of both sides of 5. along K gives X + ξ S 1 = lim K X + ξ S 1 f and X, S χ, we conclude that X, S = argmin{ X + ξ S 1 : X, S χ}. 10

11 It is also true that X, X, S is an optimal solution to an equivalent problem: argmin X,Z,S { X + 1 ξ S 1 : Z + S D F δ, X = Z}. Now, let Ȳ Rm n and θ 0 be optimal Lagrangian duals corresponding to X = Z and 1 Z + S D F δ constraints, respectively. From Lemma 5.1, it follows that { Z X F + ρ Y Ȳ F } Z + is a bounded non-increasing sequence. Hence, it has a unique limit point, i.e. lim Z X F = lim Z X F + ρ Y Ȳ Z + Z F = lim Z X + K F + ρ Y Ȳ F = 0, where the equalities follow from the facts that lim K Z = X, µ as and {Ŷ} Z+, {Y } Z+ are bounded sequences. lim Z+ Z X F = 0 and lim Z+ Z X = 0 imply that lim Z+ X = X. Using Lemma 3.1 for the -th subproblem given in Step 6 in Algorithm 5, we have { S +1 = sign D X ρ Y max D X Y ξ ρ } + θ, 5.4 ρ Z +1 = θ ρ + θ D S +1 + ρ ρ + θ E, 0 ρ θ X ρ Y. 5.5 If D X ρ Y F δ, then θ = 0; otherwise, θ > 0 is the unique solution such that φ θ = δ, where { φ θ := ξ min θ E, ρ ρ + θ X D } F Y. 5.6 ρ In the following, it is shown that the sequence {S } Z+ has a unique limit point S. Since lim Z+ X = X, {Y } Z+ is a bounded sequence and ρ as, we have lim Z+ X ρ Y = X. Case 1: D X F δ. Previously, we have shown that that exists a subsequence K Z + such that lim K X, S = X, S = argmin X,S { X + ξ S 1 : X + S D F δ}. On the other hand, since D X F δ, X, 0 is a feasible solution. Hence, X + ξ S X, which implies that S = 0. X + ξ S 1 = X + ξ S χ Z, S, X + ξ χ X, 0 Ŷ, X X Y, 0 S Y, X + 0 Z S, = X + Ŷ, X X + Y, Z X. 5.7 Since the sequences {Y } Z+ and {Ŷ} Z+ are bounded and lim Z+ X = lim Z+ Z = X, taing the limit on both sides of 5.7, we have X + ξ lim S 1 = lim X + ξ S 1 Z + Z + = lim Z + X + Ŷ, X X + Y, Z X = X. Therefore, lim Z+ S 1 = 0, which implies that lim Z+ S = S = 0. Case : D X F > δ. Since D X ρ Y F D X F > δ, there exists K Z + such that for all K, D X ρ Y F > δ. For all K, φ. is a continuous and strictly decreasing function of θ for θ 0. Hence, inverse function φ 1. exits around δ for all K. Thus, φ 0 = D X ρ Y F > δ and lim θ φ θ = 0 imply that θ = φ 1 δ > 0 for all K. Moreover, φ θ φθ := ξ θ E F implies that θ ξ mn δ for all K. Therefore, {θ } Z+ is a bounded sequence, which has a convergent subsequence K θ Z + such that lim Kθ θ = θ. We also have φ θ φ θ pointwise for all 0 θ ξ mn δ, where { φ θ := ξ } min θ E, D X F

12 Since φ θ = δ for all K, we have { δ = lim φ θ = ξ K min E, θ ρ ρ + θ X D } F Y = φ θ. 5.9 ρ Note that since D X F > δ, φ is invertible around δ, i.e. φ 1 exists around δ. Thus, θ = φ 1 δ. Since K θ is an arbitrary subsequence, we can conclude that θ := lim Z+ θ = φ 1 δ. Since there exists θ > 0 such that θ = lim Z+ θ, taing the limit on both sides of 5.4, we have { S := lim S +1 = sign D X max D X ξ } Z + θ E, 0, 5.10 and this completes the first part of the theorem. Now, we will show that if D X F δ, then the sequences {θ } Z+ and {Y } Z+ have unique limits. Note that from B.3, it follows that Y = θ 1 Z + S D for all 1. First suppose that D X F < δ. Since D X ρ Y F D X F < δ, there exists K Z + such that for all K, D X ρ Y F < δ. Thus, from Lemma 3.1 for all K, θ = 0, S +1 = 0, Z +1 = X ρ Y, which implies that θ := lim Z+ θ = 0 and Y = lim Z+ Y = lim Z+ θ 1 Z + S D = 0 since S = lim K S = lim Z+ S = 0, lim Z+ Z = X and D X F < δ. Now suppose that D X F > δ. In Case above we have shown that θ = lim Z+ θ. Hence, there exists Y R m n such that Y = lim Z+ θ 1 Z + S D = θ X + S D. Suppose that 1 Z + =. From Lemma 5.1, we have ρ Z + Z +1 Z F <. Equivalently, the series can be written as > Z +1 Z F = ρ Ŷ+1 Y +1 F Z + Z + Since 1 Z + =, there exists a subsequence K Z ρ + such that lim K Ŷ+1 Y +1 F lim K ρ Z +1 Z F = 0, i.e. lim K ρ Z +1 Z = 0. Using B.1, B. and B.3, we have = 0. Hence, 0 X +1 + θ Z +1 + S +1 D + ρ Z +1 Z, ξ S θ Z +1 + S +1 D If D X = δ, then there exists Y R m n such that Y = lim Z+ θ 1 Z + S D = θ X + S D. Taing the limit of 5.1,5.13 along K Z + and using the fact that lim K ρ Z +1 Z = 0, we have 0 X + θ X + S D, ξ S 1 + θ X + S D and 5.15 together imply that X, S, Y = θ X + S D and θ satisfy KKT optimality conditions for the problem min X,Z,S { X +ξ S 1 : 1 Z +S D F δ, X = Z}. Hence, X, X, S, Y, θ is a saddle point of the Lagrangian function LX, Z, S; Y, θ = X + ξ S 1 + Y, X Z + θ Z + S D F δ. Suppose that D X F = δ. Fix > 0. If D X ρ Y F δ, then θ = 0. Otherwise, θ > 0 and as shown in case in the first part of the proof θ ξ mn δ. Thus, for any > 0, 0 θ ξ mn δ. Since {θ } Z+ is a bounded sequence, there exists a further subsequence K θ K such that θ := lim Kθ θ 1 and Y := lim Kθ θ 1 Z + S D = θ X + S D exist. Thus, taing the limit of 5.1,5.13 along K θ Z + and using the facts that lim K ρ Z +1 Z = 0 and X = lim Z+ X = lim Z+ Z, S = lim Z+ S exist, we conclude that X, X, S, Y, θ is a saddle point of the Lagrangian function LX, Z, S; Y, θ. 1

13 6. Numerical experiments. Our preliminary numerical experiments showed that among the four algorithms discussed in this paper, NSA is the fastest. It also has very few parameters that need to be tuned. Therefore, we only report the results for NSA. We conducted two sets of numerical experiments with 1 NSA to solve 1.4, where ξ =. In the first set we solved randomly generated instances of the max{m,n} stable principle component pursuit problem. In this setting, first we tested only NSA to see how the run times scale with respect to problem parameters and size; then we compared NSA with another alternating direction augmented Lagrangian algorithm ASALM [10]. In the second set of experiments, we ran NSA and ASALM to extract moving objects from an airport security noisy video [5] Random Stable Principle Component Pursuit Problems. We tested NSA on randomly generated stable principle component pursuit problems. The data matrices for these problems, D = X 0 + S 0 + ζ 0, were generated as follows i. X 0 = UV T, such that U R n r, V R n r for r = c r n and U ij N 0, 1, V ij N 0, 1 for all i, j are independent standard Gaussian variables and c r {0.05, 0.1}, ii. Λ {i, j : 1 i, j n} such that cardinality of Λ, Λ = p for p = c p n and c p {0.05, 0.1}, iii. Sij 0 U[ 100, 100] for all i, j Λ are independent uniform random variables between 100 and 100, iv. ζij 0 ϱn 0, 1 for all i, j are independent Gaussian variables. We created 10 random problems of size n {500, 1000, 1500}, i.e. D R n n, for each of the two choices of c r and c p using the procedure described above, where ϱ was set such that signal-to-noise ratio of D is either 80dB or 45dB. Signal-to-noise ratio of D is given by [ ] E X 0 + S 0 F SNRD = 10 log 10 E [ ζ 0 F ] cr n + c s 100 /3 = 10 log 10 ϱ. 6.1 Hence, for a given SNR value, we selected ϱ according to 6.1. Table 6.1 displays the ϱ value we have used in our experiments. As in [10], we set δ = n + 8nϱ in 1.4 in the first set of experiments for both Table 6.1 ϱ values depending on the experimental setting SNR n c r=0.05 c p=0.05 c r=0.05 c p=0.1 c r=0.1 c p=0.05 c r=0.1 c p= dB dB NSA and ASALM. Our code for NSA was written in MATLAB 7. and can be found at ~nsa106. We terminated the algorithm when X +1, S +1 X, S F X, S F + 1 ϱ. 6. The results of our experiments are displayed in Tables 6. and 6.3. In Table 6., the row labeled CPU lists the running time of NSA in seconds and the row labeled SVD# lists the number of partial singular value decomposition SVD computed by NSA. The minimum, average and maximum CPU times and number of partial SVD taen over the 10 random instances are given for each choice of n, c r and c p values. Table C.3 and Table C.4 in the appendix list additional error statistics. With the stopping condition given in 6., the solutions produced by NSA have Xsol +S sol D F D F approximately when SNRD = 80dB and when SNRD = 45dB, regardless of the problem dimension n and the problem parameters related to the ran and sparsity of D, i.e. c r and c p. After thresholding the singular values of X sol that were less than , NSA found the true ran in all 10 13

14 random problems solved when SNRD = 80dB, and it found the true ran for 113 out of 10 problems when SNRD = 45dB, while for 6 of the remaining problems ranx sol is off from ranx 0 only by 1. Table 6. shows that the number of partial SVD was a very slightly increasing function of n, c r and c p. Moreover, Table 6.3 shows that the relative error of the solution X sol, S sol was almost constant for different n, c r and c p values. Table 6. NSA: Solution time for decomposing D R n n, n {500, 1000, 1500} c r=0.05 c p=0.05 c r=0.05 c p=0.1 c r=0.1 c p=0.05 c r=0.1 c p=0.1 SNR n Field min/avg/max min/avg/max min/avg/max min/avg/max 80dB 45dB SVD# 9/9.0/9 9/9.5/10 10/10.0/10 11/11/11 CPU 3./4.4/ /5.1/ /5./ /6./8.1 SVD# 9/9.9/10 10/10.0/10 11/11/11 1/1.0/1 CPU 16.5/19.6/ /0.7/4.3 5./6.9/ /31./36.3 SVD# 10/10.0/10 10/10.9/11 1/1.0/1 1/1./13 CPU 38.6/44.1/ /48.6/ /84.1/ /97.7/155. SVD# 6/6/6 6/6.9/7 7/7.1/8 8/8/8 CPU.3/.9/4..9/3.6/4.5.9/3.9/6. 3.5/4./6.0 SVD# 7/7.0/7 7/7.0/7 8/8.1/9 9/9.0/9 CPU 11.5/13.4/ /13.3/ /18.7/ /3.8/8.9 SVD# 7/7.9/8 8/8.0/8 9/9.0/9 9/9.0/9 CPU 34.1/37.7/ /37.1/ /59.0/ /59.7/64.8 Table 6.3 NSA: Solution accuracy for decomposing D R n n, n {500, 1000, 1500} c r=0.05 c p=0.05 c r=0.05 c p=0.1 c r=0.1 c p=0.05 c r=0.1 c p=0.1 SNR n Relative Error avg / max avg / max avg / max avg / max 80dB 45dB X sol X 0 F X 0 F 4.0E-4 / 4.E-4 5.8E-4 / 8.5E-4 3.6E-4 / 3.9E-4 4.4E-4 / 4.5E-4 S sol S 0 F S 0 F 1.7E-4 / 1.8E-4 1.6E-4 /.5E-4 1.6E-4 / 1.8E-4 1.3E-4 / 1.3E-4 X sol X 0 F X 0 F.0E-4 /.4E-4 3.8E-4 / 4.1E-4.E-4 /.E-4.8E-4 /.9E-4 S sol S 0 F S 0 F 1.E-4 / 1.4E-4 1.5E-4 / 1.6E-4 1.E-4 / 1.3E-4 1.1E-4 / 1.1E-4 X sol X 0 F X 0 F 1.8E-4 /.E-4.1E-4 /.6E-4 1.3E-4 / 1.3E-4.8E-4 /.9E-4 S sol S 0 F S 0 F 1.3E-4 / 1.6E-4 9.6E-5 / 1.1E-4 8.1E-5 / 8.5E-5 1.3E-4 / 1.4E-4 X sol X 0 F X 0 F 6.0E-3 / 6.E-3 8.0E-3 / 9.E-3 6.1E-3 / 6.3E-3 8.1E-3 / 8.E-3 S sol S 0 F S 0 F.1E-3 /.E-3.3E-3 /.7E-3.E-3 /.3E-3.7E-3 /.9E-3 X sol X 0 F X 0 F 4.1E-3 / 4.E-3 6.1E-3 / 6.E-3 4.6E-3 / 4.7E-3 6.0E-3 / 6.5E-3 S sol S 0 F S 0 F 1.9E-3 / 1.9E-3.4E-3 /.5E-3.3E-3 / 3.5E-3 3.1E-3 / 3.7E-3 X sol X 0 F X 0 F 3.4E-3 / 3.6E-3 4.7E-3 / 4.7E-3 3.9E-3 / 4.0E-3 5.3E-3 / 5.3E-3 S sol S 0 F S 0 F 1.8E-3 / 1.8E-3.3E-3 /.3E-3.6E-3 / 3.5E-3 3.1E-3 / 3.1E-3 Next, we compared NSA with ASALM [10] for a fixed problem size, i.e. n = 1500 where D R n n. In all the numerical experiments, we terminated NSA according to 6.. For random problems with SNRD = 80dB, we terminated ASALM according to 6.. However, for random problems with SNRD = 45dB, ASALM produced solutions with 99% relative errors when 6. was used. Therefore, for random problems with SNRD = 45dB, we terminated ASALM either when it computed a solution with better relative errors comparing to NSA solution for the same problem or when an iterate satisfied 6. with the righthand side replaced by 0.1ϱ. The code for ASALM was obtained from the authors of [10]. 14

15 The comparison results are displayed in Table 6.5 and Table 6.6. In Table 6.5, the row labeled CPU lists the running time of each algorithm in seconds and the row labeled SVD# lists the number of partial SVD computation of each algorithm. In Table 6.5, the minimum, average and maximum of CPU times and the number of partial SVD computation of each algorithm taen over the 10 random instances are given for each two choices of c r and c p. Moreover, Table C.1 and Table C. given in the appendix list different error statistics. We used PROPACK [4] for computing partial singular value decompositions. In order to estimate the ran of X 0, we followed the scheme proposed in Equation 17 in [6]. Both NSA and ASALM found the true ran in all 40 random problems solved when SNRD = 80dB. NSA found the true ran for 39 out of 40 problems with n = 1500 when SNRD = 45dB, while for the remaining 1 problem ranx sol is off from ranx 0 only by 1. On the other hand, when SNRD = 45dB, ASALM could not find the true ran in any of the test problems. For each of the four problem settings corresponding to different c r and c p values, in Table 6.4 we report the average and maximum of ranx sol over 10 random instances, after thresholding the singular values of X sol that were less than Table 6.5 shows that for all of the problem classes, the number of partial SVD required by ASALM was Table 6.4 NSA vs ASALM: ranx sol values for problems with n = 1500, SNRD = 45dB ranx 0 = 75 ranx 0 = 150 c r=0.05 c p=0.05 c r=0.05 c p=0.1 c r=0.1 c p=0.05 c r=0.1 c p=0.1 Alg. avg / max avg / max avg / max avg / max NSA 75 / / / / 150 ASALM / / 07.4 / / 04 more than twice the number that NSA required. On the other hand, there was a big difference in CPU times; this difference can be explained by the fact that ASALM required more leading singular values than NSA did per partial SVD computation. Table 6.6 shows that although the relative errors of the low-ran components produced by NSA were slightly better, the relative errors of the sparse components produced by NSA were significantly better than those produced by ASALM. Finally, in Figure 6.1, we plot the decomposition of D = X 0 + S 0 + ζ 0 R n n generated by NSA, where ranx 0 = 75, S 0 0 = 11, 500 and SNRD = 45. In the first row, we plot randomly selected 1500 components of S 0 and 100 leading singular values of X 0 in the first row. In the second row, we plot the same components of S sol and 100 singular of X sol produced by NSA. In the third row, we plot the absolute errors of S sol and X sol. Note that the scales of the graphs showing absolute errors of S sol and X sol are larger than those of S 0 and X 0. And in the fourth row, we plot the same 1500 random components of ζ 0. When we compare the absolute error graphs of S sol and X sol with the graph showing ζ 0, we can confirm that the solution produced by NSA is inline with Theorem 1.. Table 6.5 NSA vs ASALM: Solution time for decomposing D R n n, n = 1500 c r=0.05 c p=0.05 c r=0.05 c p=0.1 c r=0.1 c p=0.05 c r=0.1 c p=0.1 SNR Alg. Field min/avg/max min/avg/max min/avg/max min/avg/max 80dB 45dB NSA ASALM NSA ASALM SVD# 10/10.0/10 10/10.9/11 1/1.0/1 1/1./13 CPU 38.6/44.1/ /48.6/ /84.1/ /97.7/155. SVD# /.0/ 0/0.0/0 9/9.0/9 9/9.4/30 CPU 657.3/677.8/ /850.0/ /1316.1/ /1905./004.7 SVD# 7/7.9/8 8/8.0/8 9/9.0/9 9/9.0/9 CPU 34.1/37.7/ /37.1/ /59.0/ /59.7/64.8 SVD# 1/1/1 18/18.5/19 8/8.0/8 7/7.3/8 CPU 666.6/686.9/ /857.1/ /13./ /1739.1/

16 Table 6.6 NSA vs ASALM: Solution accuracy for decomposing D R n n, n = 1500 c r=0.05 c p=0.05 c r=0.05 c p=0.1 c r=0.1 c p=0.05 c r=0.1 c p=0.1 SNR Alg. Relative Error avg / max avg / max avg / max avg / max 80dB 45dB NSA ASALM NSA ASALM X sol X 0 F X 0 F 1.8E-4 /.E-4.1E-4 /.6E-4 1.3E-4 / 1.3E-4.8E-4 /.9E-4 S sol S 0 F S 0 F 1.3E-4 / 1.6E-4 9.6E-5 / 1.1E-4 8.1E-5 / 8.5E-5 1.3E-4 / 1.4E-4 X sol X 0 F X 0 F 3.9E-4 / 4.E-4 8.4E-4 / 8.8E-4 6.6E-4 / 6.8E-4 1.4E-3 / 1.4E-3 S sol S 0 F S 0 F 5.7E-4 / 6.E-4 7.6E-4 / 8.0E-4 1.1E-3 / 1.1E-3 1.4E-3 / 1.4E-3 X sol X 0 F X 0 F 3.4E-3 / 3.6E-3 4.7E-3 / 4.7E-3 3.9E-3 / 4.0E-3 5.3E-3 / 5.3E-3 S sol S 0 F S 0 F 1.8E-3 / 1.8E-3.3E-3 /.3E-3.6E-3 / 3.5E-3 3.1E-3 / 3.1E-3 X sol X 0 F X 0 F 4.6E-3 / 4.8E-3 7.3E-3 / 8.4E-3 4.7E-3 / 4.7E-3 7.8E-3 / 7.9E-3 S sol S 0 F S 0 F 4.8E-3 / 4.9E-3 5.8E-3 / 7.0E-3 5.5E-3 / 5.5E-3 7.3E-3 / 7.5E-3 Fig NSA: Comparison of randomly selected 1500 components of ζ 0 with absolute errors of those components in S sol and σx sol. D R n n, n = 1500, SNRD = 45dB 6.. Foreground Detection on a Noisy Video. We used NSA and ASALM to extract moving objects in an airport security video [5], which is a sequence of 01 grayscale frames of size We assume that the airport security video [5] was not corrupted by Gaussian noise. We formed the i-th column of the data matrix D by stacing the columns of the i th frame into a long vector, i.e. D is in R In order to have a noisy video with SNR = 0dB signal-to-noise ratio SNR, given D, we chose ϱ = D F / SNR/0 and then obtained a noisy D by D = D + ϱ randn , 01, where randnm, n produces a random matrix with independent standard Gaussian entries. Solving for X, S = argmin X,S R { X + ξ S 1 : X + S D F δ}, we decompose D into a low ran 16

matrix X and a sparse matrix S. We estimate the i-th frame bacground image with the i-th column of X and estimate the i-th frame moving object with the i-th column of S.

17 matrix X and a sparse matrix S. We estimate the i-th frame bacground image with the i-th column of X and estimate the i-th frame moving object with the i-th column of S. Both algorithms are terminated when X +1,S +1 X,S F X,S F +1 ϱ The recovery statistics of each algorithm are are displayed in Table 6.7. X sol, S sol denote the variables corresponding to the low-ran and sparse components of D, respectively, when the algorithm of interest terminates. Figure 7 and Figure 7 show the 35-th, 100-th and 15-th frames of the noise added airport security video [5] in their first row of images. The second and third rows in these tables have the recovered bacground and foreground images of the selected frames, respectively. Even though the visual quality of recovered bacground and foreground are very similar, Table 6.7 shows that both the number of partial SVDs and the CPU time of NSA are significantly less than those for ASALM. Table 6.7 NSA vs ASALM: Recovery statistics for foreground detection on a noisy video Alg. CPU SVD# X sol S sol 1 ranx sol X sol +S sol D F D F NSA ASALM Acnowledgements. We would lie to than to Min Tao for providing the code ASALM. Dt: X sol t: S sol t: Fig Bacground extraction from a video with 0dB SNR using NSA 17

18 Dt: X sol t: S sol t: Fig. 7.. Bacground extraction from a video with 0dB SNR using ASALM 18

19 REFERENCES [1] A. Bec and M. Teboulle, A fast iterative shrinage-thresholding algorithm for linear inverse problems, SIAM Journal on Imaging Sciences, 009, pp [] E. J. Candès, X. Li, Y. Ma, and Wright J., Robust principle component analysis?, Journal of ACM, , pp [3] D. Goldfarb, S. Ma, and K. Scheinberg, Fast alternating linearization methods for minimizing the sum of two convex functions. arxiv: v, October 010. [4] R.M. Larsen, Lanczos bidiagonalization with partial reorthogonalization, Technical report DAIMI PB-357, Department of Computer Science, Aarhus University, [5] L. Li, W. Huang, I. Gu, and Q. Tian, Statistical modeling of complex bacgrounds for foreground object detection, IEEE Trans. on Image Processing, , pp [6] Z. Lin, M. Chen, L. Wu, and Y. Ma, The augmented lagrange multiplier method for exact recovery of corrupted low-ran matrices, arxiv: v, 011. [7] Z. Lin, A. Ganesh, J. Wright, L. Wu, M. Chen, and Y. Ma, Fast convex optimization algorithms for exact recovery of a corrupted low-ran matrix, tech. report, UIUC Technical Report UILU-ENG-09-14, 009. [8] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Kluwer Academic Publishers, 004. [9], Smooth minimization of nonsmooth functions, Mathematical Programming, , pp [10] M. Tao and X. Yuan, Recovering low-ran and sparse components of matrices from incomplete and noisy observations, SIAM Journal on Optimization, 1 011, pp [11] P. Tseng, On accelerated proximal gradient methods for convex-concave optimization, submitted to SIAM Journal on Optimization, 008. [1] J. Wright, A. Ganesh, S. Rao, Y. Peng, and Y. Ma, Robust principal component analysis: Exact recovery of corrupted low-ran matrices via convex optimization, in Proceedings of Neural Information Processing Systems NIPS, December 009. [13] Z. Zhou, X. Li, J. Wright, E. Candès, and Y. Ma, Stable principle component pursuit, Proceedings of International Symposium on Information Theory, 010. Appendix A. Proof of Theorem 4.1. Definition A.1. Let φ : R m n R and ψ : R m n R m n R be closed convex functions and define Q φ Z, S X := ψz, S + φx + γ φ X, Z X + ρ Z X F, Q ψ Z X, S := φz + ψx, S + γ ψ x X, S, Z X + ρ Z X F, A.1 A. and p φ x X, p φ s X := argmin Z,S R m n Q φ Z, S X, p ψ X, S := argmin Z R m n Q ψ Z X, S, where γ φ X is any subgradient in the subdifferential φ at the point X and A.3 A.4 γx ψ X, S, γs ψ X, S is any subgradient in the subdifferential ψ at the point X, S. Lemma A.. Let φ, ψ, Q φ, Q ψ, p φ x, p φ s, p ψ, γ φ, γ ψ x, γ ψ s be as given in Definition A.1. and ΦX, S := φx + ψx, S. Let X 0 R m n and define ˆX := p φ xx 0 and Ŝ := pφ s X 0. If then for any X, S R m n R m n, Moreover, if ρ ΦX, S Φ ˆX, Ŝ Φ ˆX, Ŝ Qφ ˆX, Ŝ X0, A.5 X ˆX F X X 0 F. A.6 Φ p ψ ˆX, Ŝ, Ŝ Q ψ p ψ ˆX, Ŝ ˆX, Ŝ, A.7 then for any X, S R m n R m n, ΦX, S Φ p ψ ˆX, Ŝ Ŝ, X p ψ ˆX, Ŝ F X ρ ˆX F. A.8 19

20 Proof. Let X 0 R m n satisfy A.5. Then for any X, S R m n R m n, we have ΦX, S Φ p φ xx 0, Ŝ ΦX, S Q φ ˆX, Ŝ X 0. A.9 First order optimality conditions for A.3 and ψ being a closed convex function guarantee that there exists γx ψ ˆX, Ŝ, γs ψ ˆX, Ŝ ψ ˆX, Ŝ such that γx ψ ˆX, Ŝ where ψ ˆX, Ŝ denotes the subdifferential of ψ.,. at the point ˆX, Ŝ. Moreover, using the convexity of ψ.,. and φ., we have + γ φ X 0 + ρ ˆX X 0 = 0, A.10 γs ψ ˆX, Ŝ = 0, A.11 ψx, S ψ ˆX, Ŝ + γx ψ ˆX, Ŝ, X ˆX + γs ψ ˆX, Ŝ, S Ŝ, φx φx 0 + γ φ X 0, X X 0. These two inequalities and A.11 together imply ΦX, S ψ ˆX, Ŝ + γx ψ ˆX, Ŝ, X ˆX + φx 0 + γ φ X 0, X X 0. A.1 This inequality together with A.5 and A.10 gives ΦX, S Φ ˆX, Ŝ γx ψ ˆX, Ŝ, X ˆX + γ φ X 0, X X 0 = γ φ X 0 + γx ψ ˆX, Ŝ, X ˆX ρ X X0 F, =ρ X 0 ˆX, X ˆX ρ X X0 F, = ρ X ˆX F X X 0 F. γ φ X 0, ˆX X 0 ρ X X0 F, Hence, we have A.6. Suppose that X 0 satisfies A.7. Then for any X, S R m n R m n, we have ΦX, S Φ p ψ ˆX, Ŝ, Ŝ ΦX, S Q ψ p ψ ˆX, Ŝ ˆX, Ŝ. A.13 First order optimality conditions for A.4 and φ being a closed convex function guarantee that there exists γ p φ ψ ˆX, Ŝ φ p ψ ˆX, Ŝ such that γ φ p ψ ˆX, Ŝ + γx ψ ˆX, Ŝ + ρ p ψ ˆX, Ŝ ˆX = 0. A.14 Moreover, using the convexity of φ. and ψ.,., we have φx φ p ψ ˆX, Ŝ + γ φ p ψ ˆX, Ŝ, X p ψ ˆX, Ŝ, A.15 ψx, S ψ ˆX, Ŝ + γx ψ ˆX, Ŝ, X ˆX, A.16 0

21 where A.16 follows from the fact that ˆX, Ŝ = argmin X,S Q φ X, S X 0 implies γx ψ ˆX, Ŝ, 0 ψ ˆX, Ŝ, i.e. we can set γs ψ ˆX, Ŝ = 0. Summing the two inequalities A.15 and A.16 give ΦX, S ψ ˆX, Ŝ + γx ψ ˆX, Ŝ, X ˆX + φ p ψ ˆX, Ŝ + γ φ p ψ ˆX, Ŝ, X p ψ ˆX, Ŝ. A.17 This inequality together with A.7 and A.14 gives ΦX, S Φ p ψ ˆX, Ŝ, Ŝ γx ψ ˆX, Ŝ, X ˆX + γ φ p ψ ˆX, Ŝ, X p ψ ˆX, Ŝ γx ψ ˆX, Ŝ, p ψ ˆX, Ŝ ˆX ρ pψ ˆX, Ŝ ˆX F, = γ φ p ψ ˆX, Ŝ + γx ψ ˆX, Ŝ, X p ψ ˆX, Ŝ ρ pψ ˆX, Ŝ ˆX F, =ρ ˆX p ψ ˆX, Ŝ, X p ψ ˆX, Ŝ ρ pψ ˆX, Ŝ ˆX F, = ρ X p ψ ˆX, Ŝ F X ˆX F. Hence, we have A.8. We are now ready to give the proof of Theorem 4.1. Proof. Let I := {0 i 1 : ΦX i+1, S i L ρ X i+1, Z i, S i ; Y i } and I c := {0, 1,..., 1} \ I. Since φ. is Lipschitz continuous with Lipschitz constant L and ρ L, Φp φ xx, p φ s X Q φ p φ xx, p φ s X X is true for all X R m n. Since A.5 in Lemma A. is true for all X 0 R m n, A.6 is true for all X, S R m n R m n. Particularly, since for all i I I c Z i+1, S i+1 = argmin Q φ Z, S X i+1, Z,S A.18 setting X, S := X, S and X 0 := X i+1 in Lemma A. imply that p φ xx i+1 = Z i+1, p φ s X i+1 = S i+1 and we have ρ ΦX, S ΦZ i+1, S i+1 Z i+1 X F X i+1 X F. Moreover, A.18 implies that for all i I I c, there exits γ ψ x Z i, S i, γ ψ s Z i, S i ψz i, S i such that A.19 γ ψ x Z i, S i + φx i + ρz i X i = 0, γ ψ s Z i, S i = 0. A.0 A.1 A.0 and the definition of Y i+1 of Algorithm ALM-S shown in Algorithm 3 imply that γ ψ x Z i, S i = φx i + ρx i Z i = Y i. Hence, by defining Q ψ. Z i, S i according to A. using γ ψ x Z i, S i = Y i, for all X R m n we have L ρ X, Z i, S i ; Y i = φx + ψz i, S i + Y i, X Z i + ρ X Z i F = Q ψ X Z i, S i. A. for all i I I c. Hence, for all i I X i+1 = argmin X L ρ X, Z i, S i ; Y i = argmin X Q ψ X Z i, S i. Thus, for all i I, setting X 0 := X i in Lemma A. imply p φ xx i = Z i, p φ s X i = S i and p ψ p φ xx i, p φ s X i = p ψ Z i, S i = X i+1. For all i I we have ΦX i+1, S i L ρ X i+1, Z i, S i ; Y i = Q ψ X i+1 Z i, S i. Hence, for all i I setting X 0 := X i in Lemma A. satisfies A.7. Therefore, setting X, S := X, S and X 0 := X i in Lemma A. implies that ρ ΦX, S ΦX i+1, S i X i+1 X F Z i X F. 1 A.3

22 For any i I, summing A.19 and A.3 gives ρ ΦX, S ΦX i+1, S i ΦZ i+1, S i+1 Z i+1 X F Z i X F. A.4 Moreover, since X i+1 = Z i for i I c and A.19 holds for all i I I c, we trivially have ρ ΦX, S ΦZ i+1, S i+1 Z i+1 X F Z i X F. Summing A.4 and A.5 over i = 0, 1,..., 1 gives I + I c ΦX, S ΦX i+1, S i ρ i I 1 ΦZ i+1, S i+1 Z X F Z 0 X F. i=0 A.5 A.6 For any i I I c, setting X, S := X i+1, S i and X 0 := X i+1 in Lemma A. gives ρ ΦX i+1, S i ΦZ i+1, S i+1 Z i+1 X i+1 F 0. A.7 Trivially, for i = 1,..., we also have ρ ΦX i, S i 1 ΦZ i, S i Z i X i F 0. A.8 Moreover, since for all i I setting X 0 := X i in Lemma A. satisfies A.7, setting X, S := Z i, S i and X 0 := X i in Lemma A. implies that ρ ΦZ i, S i ΦX i+1, S i X i+1 Z i F 0. A.9 And since X i+1 = Z i for all i I c, A.9 trivially holds for all i I c. Thus, for all i I I c we have ρ ΦZ i, S i ΦX i+1, S i 0. A.30 Adding A.7 and A.30 yields ΦZ i, S i ΦZ i+1, S i+1 for all i I I c and adding A.8 and A.30 yields ΦX i, S i 1 ΦX i+1, S i for all i = 1,..., 1. Hence, 1 i=0 ΦZ i+1, S i+1 ΦZ, S, and i I ΦX i+1, S i n ΦX, S 1. A.31 These two inequalities, A.6 and the fact that X 0 = Z 0 imply ρ I + Ic ΦX, S n ΦX, S 1 ΦZ, S X 0 X F. A.3 Hence, 4.4 follows from the facts: I + I c = +n and n ΦX, S 1 +ΦZ, S +n ΦZ, S due to A.7. Appendix B. Proof of Lemma 5.1. Proof. Since Y and θ are optimal Lagrangian dual variables, we have X, X, S = argmin X + ξ S 1 + Y, X Z + θ Z + S D X,Z,S F δ.

Robust Principal Component Analysis Based on Low-Rank and Block-Sparse Matrix Decomposition

Robust Principal Component Analysis Based on Low-Rank and Block-Sparse Matrix Decomposition Gongguo Tang and Arye Nehorai Department of Electrical and Systems Engineering Washington University in St Louis