Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization

Size: px

Start display at page:

Download "Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization"

Georgia Lambert
5 years ago
Views:

1 Robust Principal Coponent Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optiization John Wright, Arvind Ganesh, Shankar Rao, and Yi Ma Departent of Electrical Engineering University of Illinois at Urbana-Chapaign Visual Coputing Group Microsoft Research Asia Abstract. Principal coponent analysis is a fundaental operation in coputational data analysis, with yriad applications ranging fro web search to bioinforatics to coputer vision and iage analysis. However, its perforance and applicability in real scenarios are liited by a lack of robustness to outlying or corrupted observations. This paper considers the idealized robust principal coponent analysis proble of recovering a low rank atrix A fro corrupted observations D = A + E. Here, the error entries E can be arbitrarily large (odeling grossly corrupted observations coon in visual and bioinforatic data), but are assued to be sparse. We prove that ost atrices A can be efficiently and exactly recovered fro ost error sign-and-support patterns, by solving a siple convex progra. Our result holds even when the rank of A grows nearly proportionally (up to a logarithic factor) to the diensionality of the observation space and the nuber of errors E grows in proportion to the total nuber of entries in the atrix. A by-product of our analysis is the first proportional growth results for the related but soewhat easier proble of copleting a low-rank atrix fro a sall fraction of its entries. We propose an algorith based on iterative thresholding that, for large atrices, is significantly faster and ore scalable than general-purpose solvers. We give siulations and real-data exaples corroborating the theoretical results. Corresponding author: John Wright, Roo 146 Coordinated Science Laboratory, 1308 West Main Street, Urbana, IL Eail: jnwright@uiuc.edu 1

2 1 Introduction The proble of finding and exploiting low-diensional structure in high-diensional data is taking on increasing iportance in iage, audio and video processing, web search, and bioinforatics, where datasets now routinely lie in thousand- or even illion-diensional observation spaces. The curse of diensionality is in full play here: eaningful inference with liited nuber of observations requires soe assuption that the data have low intrinsic coplexity, e.g., that they are low-rank 15], sparse in soe basis 10], or lie on soe low-diensional anifold 7, ]. Perhaps the siplest useful assuption is that the observations all lie near soe low-diensional subspace. In other words, if we stack all the observations as colun vectors of a atrix M R n, the atrix should be (approxiately) low rank. Principal coponent analysis (PCA) 15, 0] seeks the best (in an l -sense) such low-rank representation of the given data atrix. It enjoys a nuber of optiality properties when the data are only ildly corrupted by sall noise, and can be stably and efficiently coputed via the singular value decoposition. One ajor shortcoing of classical PCA is its brittleness with respect to grossly corrupted or outlying observations 0]. Gross errors are ubiquitous in odern applications in iaging and bioinforatics, where soe easureents ay be arbitrarily corrupted (e.g., due to occlusion or sensor failure) or siply irrelevant to the structure we are trying to identify. A nuber of natural approaches to robustifying PCA have been explored in the literature. These approaches include influence function techniques 19, 8], ultivariate triing 18], alternating iniization 1], and rando sapling techniques 16]. Unfortunately, none of these existing approaches yields a polynoial-tie algorith with strong perforance guarantees. 1 In this paper, we consider an idealization of the robust PCA proble, in which the goal is to recover a low-rank atrix A fro highly corrupted easureents D = A + E. The errors E can be arbitrary in agnitude, but are assued to be sparsely supported, affecting only a fraction of the entries of D. This should be contrasted with the classical setting in which the atrix A is perturbed by sall (but densely supported) noise. In that setting, classical PCA, coputed via the singular value decoposition, reains optial if the noise is Gaussian. Here, on the other hand, even a sall fraction of large errors can cause arbitrary corruption in PCA s estiate of the low rank structure, A. Our approach to robust PCA is otivated by two recent, and tightly related, lines of research. The first set of results concerns the robust solution of over-deterined linear systes of equations in the presence of arbitrary, but sparse errors. These results iply that for generic systes of equations, it is possible to correct a constant fraction of arbitrary errors in polynoial tie 7]. This is achieved by eploying the l 1 -nor as a convex surrogate for the highly-nonconvex l 0 -nor. A parallel (and still eerging) line of work concerns the proble of coputing low-rank atrix solutions to underdeterined linear equations 6, 6]. One of the ost striking results concerns the exact copletion of low-rank atrices fro only a sall fraction of their entries 6, 8, 5]. There, a siilar convex relaxation is eployed, replacing the highly non-convex atrix rank with the nuclear nor (or su of singular values). The robust PCA proble outlined above cobines aspects of both of these lines of work: we wish 1 Rando sapling approaches guarantee near-optial estiates, but have coplexity exponential in the rank of the atrix A 0. Triing algoriths have coparatively lower coputational coplexity, but guarantee only locally optial solutions. A ajor difference between robust PCA and low-rank atrix copletion is that here we do not know which entries are corrupted, whereas in atrix copletion the support of the issing entries is given.

3 to recover a low-rank atrix fro large but sparse errors. We will show that cobining the solutions to the above probles (nuclear nor iniization for low-rank recovery and l 1 -iniization for error correction) yields a polynoial-tie algorith for robust PCA that provably succeeds under broad conditions: With high probability, solving a siple convex progra perfectly recovers a generic atrix A R of rank as large as C log(), fro errors affecting a constant fraction of the entries. Notice that the conclusion holds with high probability only when the diensionality is high enough. Thus, we will not be able to truly witness the striking ability of convex progra in recovering the low-rank and sparse structures fro the data unless is large enough. This phenoenon has been dubbed as the blessing of diensionality 1]. Thus, we need scalable algoriths to solve the associated convex progra. To this end, we show how a near-solution to this convex progra can be obtained relatively efficiently via iterative thresholding techniques, siilar to those proposed for atrix copletion 4]. or large atrices, this algorith is significantly faster and ore scalable than general-purpose convex progra solvers. inally, our proof iplies strong results for the low-rank atrix copletion proble, and including the first results applicable to the proportional growth setting where the rank of the atrix grows as a constant (non-vanishing) fraction of the diensionality: With overwheling probability, solving a siple convex progra perfectly recovers a generic atrix A R of rank as large as C, fro observations consisting of only a fraction ρ (ρ < 1) of its entries. 1.1 Organization of the paper This paper is organized as follows. Section forulates the robust principal coponent analysis proble ore precisely and states the ain results of this paper, placing these results in the context of existing work. Section 3 begins the proof of the ain result by providing sufficient conditions for a desired pair (A, E) to be recoverable by convex prograing. These conditions will be stated in ters of the existence of a dual vector satisfying certain constraints. Section 4 then shows that with high probability such a certifying dual vector indeed exists. This establishes our ain result. Section 5 explores the iplications of this analysis for the related proble of copleting a low-rank atrix fro an observation consisting of only a sall subset of its entries. Section 6 gives a siple but fast algorith for approxiately solving the robust PCA proble that extends existing iterative thresholding techniques, and is ore efficient and scalable that generic interiorpoint ethods. In Section 7, we perfor siulations and experients corroborating the theoretical results and suggesting their applicability to real-world probles in coputer vision. inally, in Section 8, we outline several proising directions for future work. Proble Setting and Main Results We assue that the observed data atrix D R n was generated fro a low-rank atrix A R n, by corrupting soe of its entries. The corruption can be represented as an additive error E R n, so that D = A + E. Because the error affects only a portion of the entries of D, E is a sparse atrix. The idealized (or noise-free) robust PCA proble can then be forulated as follows: 3

4 Proble.1 (Robust PCA). Given D = A + E, where A and E are unknown, but A is known to be low rank and E is known to be sparse, recover A. This proble forulation iediately suggests a conceptual solution: seek the lowest rank A that could have generated the data, subject to the constraint that the errors are sparse: E 0 k. The Lagrangian reforulation of this optiization proble is in rank(a) + γ E 0 subj A + E = D. (1) A,E If we could solve this proble for appropriate γ, we ight hope to exactly recover the pair (A 0, E 0 ) that generated the data D. Unfortunately, (1) is a highly nonconvex optiization proble, and no efficient solution is known. 3 We can obtain a tractable optiization proble by relaxing (1), replacing the l 0 -nor with the l 1 -nor, and the rank with the nuclear nor A = i σ i(a), yielding the following convex surrogate: in A + λ E 1 subj A + E = D. () A,E This relaxation can be otivated by observing that A +λ E 1 is the convex envelope of rank(a)+ λ E 0 over the set of (A, E) such that A, + E 1, 1. Moreover, recent advances in our understanding of the nuclear nor heuristic for low-rank solutions to atrix equations 6, 6] and the l 1 heuristic for sparse solutions to underdeterined linear systes 7, 13], suggest that there ight be circustances under which solving the tractable proble () perfectly recovers the low-rank atrix A 0. The ain result of this paper will be to show that this is indeed true under surprisingly broad conditions. A sketch of the result is as follows: or alost all pairs (A 0, E 0 ) consisting of a low-rank atrix A 0 and a sparse atrix E 0, (A 0, E 0 ) = arg in A,E A + λ E 1 subj A + E = A 0 + E 0, and the iniizer is uniquely defined. That is, under natural probabilistic odels for low-rank and sparse atrices, alost all observations D = A 0 +E 0 generated as the su of a low-rank atrix A 0 and a sparse atrix E 0 can be efficiently and exactly decoposed into their generating parts by solving a convex progra. 4 Of course, this is only possible with an appropriate choice of the regularizing paraeter λ > 0. ro the optiality conditions for the convex progra (), it is not difficult to show that for atrices D R, the correct scaling is λ = O ( 1/). Throughout this paper, unless otherwise stated, we will fix λ = 1. (3) 3 In a sense, this proble subsues both the low rank atrix copletion proble and the l 0 -iniization proble, both of which are NP-hard and hard to approxiate. While we are not aware of any foral hardness result for (1), it is reasonable to conjecture that it is siilarly difficult. 4 Notice that this is not an equivalence result for the intractable optiization (1) and the convex progra () rather than asserting that the solutions of these two probles are equal with high probability, we directly prove that the convex progra correctly decoposes D = A 0 + E 0 into (A 0, E 0 ). A natural conjecture, however, is that under the conditions of our ain result, Theore 1, (A 0, E 0 ) is also the solution to (1) for soe choice of γ. 4

5 or siplicity, all of our results in this paper will be stated for square atrices D R, although there is little difficulty in extending the to non-square atrices. It should be clear that not all atrices A 0 can be successfully recovered by solving the convex progra (). Consider, e.g., the rank-1 case where U = e i ] and V = e j ]. Without additional prior knowledge, the low-rank atrix A = USV cannot be recovered fro even a single gross error. We therefore restrict our attention to atrices A 0 whose row and colun spaces are not aligned with the standard basis. This can be done probabilistically, by asserting that the arginal distributions of U and V are unifor on the Stiefel anifold W r. In 6], this was called the rando orthogonal odel. Definition. (Rando orthogonal odel 6]). We consider a atrix A 0 to be distributed according to the rando orthogonal odel of rank r if its left and right singular vectors are independent uniforly distributed r atrices with orthonoral coluns. 5 In this odel, the nonzero singular values of A 0 can be arbitrary. Our odel for errors is siilarly natural: each entry of the atrix is independently corrupted with soe probability ρ s, and the signs of the corruptions are independent Radeacher rando variables. Definition.3 (Bernoulli error signs and support). We consider an error atrix E 0 to be drawn fro the Bernoulli sign and support odel with paraeter ρ s if the entries of sign(e 0 ) are independently distributed, each taking on value 0 with probability 1 ρ s, and ±1 with probability ρ s / each. In this odel, the agnitude of the nonzero entries in E 0 can be arbitrary. Our ain result is the following: Theore.4 (Robust recovery fro non-vanishing error fractions). or any p > 0, there exist constants (C 0 > 0, ρ s > 0, 0 ) with the following property: if > 0, (A 0, E 0 ) R R with the singular spaces of A 0 R distributed according to the rando orthogonal odel of rank r C 0 log() and the signs and support of E 0 R distributed according to the Bernoulli sign-and-support odel with error probability ρ s, then with probability at least 1 C p (A 0, E 0 ) = arg in A + 1 E 1 subj A + E = A 0 + E 0, (5) and the iniizer is uniquely defined. In other words, atrices A 0 whose singular spaces are distributed according to the rando orthogonal odel can, with probability approaching one, be efficiently recovered fro alost all corruption sign and support patterns without prior knowledge of the pattern of corruption. Our line of analysis also iplies strong results for the atrix copletion proble studied in 6, 8, 5]. W r. 5 That is, the left and right singular vectors are independent saples fro the Haar easure on the Stiefel anifold (4) 5

6 Theore.5 (Matrix copletion in proportional growth). There exist nuerical constants 0, ρ r, ρ s, C all > 0, with the following property: if > 0 and A 0 R is distributed according to the rando orthogonal odel of rank r ρ r, (6) and Υ ] ] is an independently chosen subset of ] ] in which the inclusion of each pair (i, j) is an independent Bernoulli(1 ρ s ) rando variable with ρ s ρ s, then with probability at least 1 exp ( C), and the iniizer is uniquely defined. A 0 = arg in A subj A(i, j) = A 0 (i, j) (i, j) Υ, (7) inally, in Section 6 we extend existing iterative thresholding techniques for solving equalityconstrained l 1 -nor iniization probles 31] and nuclear nor iniization probles 4] to give an algorith that produces a near-solution to () ore efficiently and scalably than off-theshelf interior point ethods..1 Relationship to existing work During the preparation of this anuscript, we were infored of results by 9] analyzing the sae convex progra. That paper s ain result shows that under the probabilistic odel studied here, exact recovery can be guaranteed with high probability when E 0 0 C 1.5 log() ax(r, log ). (8) While this is an interesting result, even for constant rank r it guarantees correction of only a vanishing fraction o( 1.5 ) of errors. In contrast, our ain result, Theore.4, states that even if r grows proportional to / log(), non-vanishing fractions of errors are corrected with high probability. Both analyses start fro the optiality condition for the convex progra (). The key technical coponent of this iproved result is a probabilistic analysis of an iterative refineent technique for producing a dual vector that certifies optiality of the pair (A 0, E 0 ). This proof technique is related to techniques used in 7, 30], with additional care required to handle an operator nor constraint arising fro the presence of the nuclear nor in (). inally, while Theore.5 is erely a byproduct of our analysis and not the ain focus of this paper, it is interesting in light of results by 8]. That work proves that in the probabilistic odel considered here, a generic rank r atrix can, with high probability, be efficiently and exactly copleted fro a subset of only Cr log 8 () (9) entries. When r > C log 8 (), this bound exceeds the nuber of possible observations. In contrast, our Theore.5 iplies that for certain scenarios with r as large as ρ r, the atrix can be copleted fro a subset of (1 ρ s ) entries. or atrices of relatively large rank, this is a significant extension of 8]. Our result does not supersede (9) for saller ranks, however. These results are further contrasted in Section 5. 6

7 3 Analysis raework In this section, we begin our analysis of the seidefinite progra (). After introducing notation, we provide two sufficient conditions for a pair (A, E) to be the unique optial solution to (). The first condition, given in Lea 3.1 below, is stated in ters of the existence of a dual vector W that certifies optiality of the pair (A, E). The existence (or non-existence) of such a dual vector is itself a rando convex prograing feasibility proble, and so the condition in this Lea is not directly aenable to analysis. The next step, outlined in Lea 3.3, is to show that if there exists soe W 0 that does not violate the constraints of this feasibility proble too badly, then it can be refined to produce a W that satisfies the constraints and certifies optiality of (A, E). The probabilistic analysis in the following Section 4 will then show that, with high probability, such a W 0 exists. 3.1 Notation. or any n Z +, n] = {1... n}. or M R, and I, J ], M I,J will denote the subatrix of M consisting of those rows indexed by I and those coluns indexed by J. We will use as shorthand for the entire index set: M I, is the subatrix consisting of those rows indexed by I. M will denote the transpose of M. or atrices P, Q R n, P, Q =. tracep Q] will denote the (robenius) inner product. The sybol I will denote the identity atrix or identity operator on atrices; the distinction will be clear fro context. M p,q will denote the operator nor of the atrix M, as a linear ap between l p and l q. Iportant special cases are the spectral nor M,, and the ax row- and colun nors, and the ax eleent nor M, = ax i M i, and M 1, = ax M,j, (10) j M 1, = ax M ij. (11) i,j We will also often reason about linear operators on atrices. If L : R R is a linear ap, L, will denote its operator nor with respect to the robenius nor on atrices:. LM] L, = sup. (1) M R \{0} M If S is a subspace, π S will denote the projection operator onto that subspace. or U R r, π U will denote the projection operator onto the range of U. inally, if I ], π I. = i I e ie i will denote the projection onto those coordinates indexed by I. We will let Ω. = {(i, j) E 0ij 0} (13) denote the set of corrupted entries. By slight abuse of notation, we will identify Ω with subspace {M M ij = 0 (i, j) Ω c }, so π Ω will denote the projection onto the corrupted eleents. We will let Σ R be an iid Radeacher (Bernoulli ±1) atrix that defines the signs of the corrupted entries: sign(e 0 ) = π Ω Σ]. (14) Expressing sign(e 0 ) in this way allows us to exploit independence between Ω and Σ. 7

8 The sybol will denote the Kronecker product between atrices. vec will denote the operator that vectorizes a atrix by stacking the coluns. or M R, this operator stacks the entries M ij in lexicographic order. or x = vecm] R we write x Ω as shorthand for x L(Ω), where L(Ω) = {j + i (i, j) Ω} ], and as further shorthand, we will occasionally use vec Ω M] to denote vecm]] Ω. In both our analysis and optiization algorith, we will ake frequent use of the soft thresholding operator, which we denote S γ : S γ x] = { 0, x γ, sign(x)( x γ), x > γ. (15) or atrices X, S γ X] will denote the atrix obtained by applying S γ to each eleent of X ij. inally, we often find it convenient to use the notation S Ωc γ X]. = π Ω S γ X]]. (16) That is, Sγ Ωc applies the soft threshold to X but only retains those eleents not in Ω. Throughout, we will use the sybol W r to refer to the Stiefel anifold of atrices U R r with orthonoral coluns: U U = I r r. We will use SO(r) to refer to the special orthogonal group of r r atrices R such that R R = RR = I r r and det(r) = Optiality conditions for (A 0, E 0 ) We begin with a siple sufficient condition for the pair (A 0, E 0 ) to be the unique optial solution to (). These conditions are stated in ters of a dual vector, the existence of which certifies optiality. Our analysis will then show that under the stated circustances, such a dual certificate can be produced with high probability. Lea 3.1. Let (A 0, E 0 ) R R, with Ω. = supp(e 0 ) ] ]. (17) Let A 0 = USV denote the copact singular value decoposition of A 0, and Θ denote the subspace generated by atrices with colun space range(u) or row space range(v ): Θ. = {UM M R r } + {MV M R r } R. (18) Suppose that π Ω π Θ, < 1 and there exists W R such that UV + W ] ij = λ sign(e 0i,j ) i, j Ω, UV + W ] ij < λ i, j Ω c, U W = 0, W V = 0, W, < 1 Then the pair (A 0, E 0 ) is the unique optial solution to ().. (19) Proof. We show that if a Lagrange ultiplier vector Y. = UV +W satisfying this syste exists, then any nonzero perturbation A 0 A 0 +, E 0 E 0 respecting the constraint A + E = A 0 + E 0 8

9 will increase the objective function. By convexity, we ay restrict our interest to satisfying 1, < in (i,j) Ω E 0ij. or such, λ E 0 1 λ E 0 1 = λ sign(e 0ij ) ij + λ ij Y,. (i,j) Ω (i,j) / Ω with equality if and only if π Ω ] = 0 (since on this set Y ij is strictly saller than λ). Siilarly, since Y = UV + W A0, A 0 + A 0 + Y,. (0) Moreover, since A 0 = UV, A 0 = Y, A 0, A 0 + Y, = Y, A 0 +. Thus, by Lea 3. below, if equality holds in (0), then Θ. Suing the two subgradient bounds, we have A λ E 0 1 A 0 + λ E 0 1, with equality only if Ω Θ. If π Ω π Θ, < 1, then Ω Θ = {0} and so either or = 0. A λ E 0 1 > A 0 + λ E 0 1, Lea 3.. Consider P R, with P, = 1 and σ in (P ) < 1. Write the full singular value decoposition of P as P = ] ] I 0 ] U 1 U V1 V 0 Σ where Σ, < 1. Let Q R have reduced singular value decoposition Q = UΣV. Then if P, Q = Q, U U = 0 and V V = 0. Proof. Let r denote the rank of Q. ] ] I 0 U P, Q =, 1 U 0 Σ U Σ ] V U V 1 V V. Let Y, Z R r denote the atrices Y =. ] U 1 U U, Z. ] V = 1 V U V, and notice that both Y and Z V have orthonoral coluns. Then, after the above rotation r r P, Q = σ i (P ) σ j (Q)Z ij Y ij = σ j (Q) σ i (P )Z ij Y ij. (1) i=1 j=1 or all j, i=1 σ i(p )Z ij Y ij ] I 0 0 Σ Z,j Y,j 1, with equality if and only if σ i (P ) < 1 = Z ij = 0 and Z,j = Y,j. Since whenever P, Q = Q each of the i=1 σ i(p )Z ij Y ij ust be one, this iplies that U U and V V are both zero. Thus, we can guarantee that (A 0, E 0 ) is optial with high probability by asserting that a certain rando convex feasibility proble is satisfiable with high probability. While there are potentially any W satisfying these constraints, ost of the lack an explicit expression. As has proven fruitful in a variety of related probles, we will instead consider a putative dual vector W 0 that does have an explicit expression: the iniu robenius nor solution to the equality constraints in (19). We will see that this vector already satisfies the operator nor constraint with high probability. However, the box constraint due to sign(e 0 ) is likely to be violated. We then describe an iterative procedure that, with high probability, fixes the box constraint, while respecting the equality and operator nor constraints. The output of this iteration is the desired certifying dual vector. j=1 i=1 9

10 3.3 Iterative construction of the dual certificate In constructing the dual certificate, we will use the fact that the violations of the inequality constraints, viewed as a atrix, often have sparse rows and sparse coluns. To foralize this, fix a sall constant c 0, 1], and define Ψ c. = { M R M,j 0 c j, M i, 0 c i, π Ω M] = 0 }. () We will show that such atrices are near the nullspace of the equality constraints in (19). The equality constraints in (19) can be expressed as π Ω W ] = π Ω λ sign(e 0 ) UV ] and π Θ W ] = 0. We will let Γ denote the orthogonal copleent of the nullspace of these constraints: Γ. = Θ + Ω. (3) Let ξ c denote the operator nor of π Γ, with respect to the robenius nor on R, restricted to Ψ c :. π Γ M ξ c = sup 0, 1]. (4) M Ψ c\{0} M In Section 4.3, we establish a useful probabilistic upper bound for this quantity. Before investigating these properties, we show that if ξ c and W 0 are well-controlled, we can find a dual vector certifying optiality of (A 0, E 0 ). Lea 3.3 (Iterative construction of a dual certificate). Suppose for soe c (0, 1) and ε > 0, there exists W 0 satisfying π Θ W 0 ] = 0 and π Ω W 0 ] = π Ω λ sign(e 0 ) UV ], such that and UV + W 0 1, + 1 S c Ω 1 ξ λ ε(uv + W 0 ) (λ ε) c, c (5) UV + W 0, + 1 S c Ω 1 ξ λ ε(uv + W 0 ) (λ ε) c, c (6) Then there exists W satisfying the syste (19). W 0, + 1 S Ω c 1 ξ λ ε(uv + W 0 ) < 1. (7) c Proof. We construct a convergent sequence W 0, W 1,... whose liit satisfies (19). or each k, set W k = W k 1 π Γ S Ωc λ ε (UV + W k 1 ). (8) Notice that π Θ W k ] = π Θ W k 1 ] and π Ω W k ] = π Ω W k 1 ], so for all k, π Θ W k ] = 0 and π Ω W k ] = π Ω λ sign(e 0 ) UV ]. We will further show by induction that the sequence (W k ) satisfies the following properties: UV + W k 1, UV + W 0 1, + Sλ ε(uv Ωc + W 0 ) 10 ξc, i (9) k 1 i=0

11 UV + W k, UV + W 0, + Sλ ε(uv Ωc + W 0 ) ax i ax j S Ωc ] λ ε (UV + W k 1 ) S Ωc ] λ ε (UV + W k 1 ) i,,j ξc, i (30) k 1 i=0 c, (31) 0 c, (3) 0 Sλ ε(uv Ωc + W k ) ξc k Sλ ε(uv Ωc + W 0 ). (33) or k = 0, (9), (30) and (33) hold trivially. The sparsity assertion (31) follows fro the assuptions of the lea: for all i, ] S λ ε(uv Ωc + W 0 ) UV + W 0 ] i, i, 0 (λ ε) UV + W 0 1, (λ ε) c. The exact sae reasoning applied to the coluns gives (3). Now, suppose the stateents (9)-(33) hold for W 0... W k 1. Then UV + W k 1, UV + W k 1 1, + Sλ ε(uv Ωc + W k 1 ) UV + W k 1 1, + Sλ ε(uv Ωc + W 0 ) ξc k 1 UV + W 0 1, + Sλ ε(uv Ωc + W 0 ) ξc, i k 1 establishing (9) for k. The sae reasoning applied to, establishes (30) for k. Bounding the 1 suation in (9) by 1 ξ c and applying assuption (5) of the lea gives that UV + W k 1, (λ ε) c. This iplies (31): the nuber of entries of each colun of UV + W k that exceed λ ε in absolute value is at ost c. The sae chain of reasoning establishes that (3): the rows of Sλ ε Ωc (UV + W k ) are also c-sparse. inally, notice that ) Sλ ε(uv Ωc + W k ) = Sλ ε (UV Ωc + W k 1 π Γ Sλ ε(uv Ωc + W k 1 ) = Sλ ε (UV Ωc + W k 1 Sλ ε(uv Ωc + W k 1 ) + π Γ Sλ ε(uv Ωc + W k 1 )). Since the entries of UV + W k 1 Sλ ε Ωc (UV + W k 1 ) have agnitude λ ε, π Γ Sλ ε Ωc (UV + W k 1 ) doinates Sλ ε Ωc (UV + W k ) eleentwise in absolute value. Hence, Sλ ε Ωc (UV + W k ) π Γ Sλ ε Ωc (UV + W k 1 ) ξ c Sλ ε Ωc (UV + W k 1 ), where we have used that Sλ ε Ωc (UV + W k 1 ) Ψ c. Thus each of the three stateents holds for all k, and Sλ ε Ωc (UV + W k ) decreases geoetrically. or all k sufficiently large, UV + W k 1, λ. Meanwhile, notice that W k, W k 1, + π Γ S Ωc λ ε(uv + W k 1 ), W k 1, + Sλ ε(uv Ωc + W k 1 ) W k 1, + ξc k 1 Sλ ε(uv Ωc + W 0 ). 11 i=0

12 By induction, it is easy to show that W k, W 0, + S Ωc λ ε(uv + W 0 ) k 1 ξc. i By the second assuption of the lea, this quantity is bounded by 1 for all k. i=0 4 Probabilistic Analysis of the Initial Dual Vector In this section, we analyze the iniu robenius nor solution to the equality constraints in the optiality condition (19). We show that with high probability, this initial dual vector satisfies the operator nor constraint, and that the violations of the box constraint are sall enough that the iteration described in (8) will succeed in producing a dual certificate. The analysis is organized as follows: in Section 4.1, we introduce several tools and preliinary results. Section 4. then shows that with overwheling probability the operator nor of the initial dual vector is bounded by a constant that can be ade arbitrarily sall by assuing the error probability ρ s and rank r are both low enough. Section 4.3 then shows that the restricted operator nor ξ c is also bounded by a sall constant, establishing that all row- and colun- sparse atrices are nearly orthogonal to the subspace spanned by the equality constraints in (19). Section 4.4 analyzes the violations of the box constraint. inally, Section 4.5 cobines these results with the results of Section 3 to prove our ain result, Theore Tools and preliinaries In this section, we will often need to refer to the following subspaces Ξ U. = {UM M R r }, Ξ V. = {MV M R r }, Ξ UV. = {UMV M R r r }. Notice that π Θ = π ΞU + π ΞV π ΞUV and that π ΞUV = π ΞU π ΞV = π ΞV π ΞU. In the Bernoulli support odel, the nuber of errors in each row and colun concentrates about ρ s. or each j ], let I j. = {i (i, j) Ω}, (34) i.e., I j contains the indices of the errors in the j-th colun. Siilarly, for i ], set In ters of these quantities, for each η > 0, define the event E Ω (η) : ax j J i. = {j (i, j) Ω}. (35) I j < (1 + η)ρ s and ax J i < (1 + η)ρ s. i Much of our analysis hinges on the operator nor of π Ω π Θ. We will see that on the above event E Ω (η), this quantity can be controlled by bounding nors of subatrices of the singular vectors U 1

13 and V. This results in a nuber of bounds involving the following function of the rank and error probability:. r/ + ρs τ(r/, ρ s ) = 1 r/. (36) Where τ is used as shorthand, it should be understood as τ(r/, ρ s ). We will repeatedly refer to the following good events: E ΩU : π Ω π ΞU, τ (r/, ρ s ), E ΩV : π Ω π ΞV, τ (r/, ρ s ), E ΩΘ : π Ω π Θ, τ (r/, ρ s ). In establishing that these events are overwhelingly likely, the following result on singular values of Gaussian atrices will prove useful: act 4.1. Let M R n, > n and suppose that the eleents of M are independent identically distributed N (0, 1) rando variables. Then P σ ax (M) + n + t ] ( ) exp t, P σ in (M) n t ] ( ) exp t. This result is widely used in the literature, with various estiates of the error exponent. The for stated here can be obtained by via the bounds E σ ax (M)] + n and E σ in (M)] n, in conjunction with ] Proposition.18, Equation (.35), and the observation that the singular values are 1-Lipschitz. Lea 4.. ix any η (0, 1/16). Consider (U, V, Ω) drawn fro the rando orthogonal odel of rank r < with error probability ρ s > 0. Then there exists t (r/, ρ s ) > 0 such that ) ( P U,V,Ω E Ω (η) E ΩU E ΩV E ΩΘ ] 1 4 exp ( t 4 exp η ρ ) s. (37) In particular, if for all larger than soe 0 Z, r/ ρ r < 1, then P U,V,Ω E Ω (η) E ΩU E ΩV E ΩΘ ] 1 exp ( C + O(log())). (38) Proof. Each of the rando variables I j is a su of independent Bernoulli(ρ s ) rando variables X 1,j, X,j,... X,j. The partial su k i=1 X k ρ s k is a Martingale whose ( differences ) are all bounded by 1. So by Azua s inequality, we have P I j ρ s > t] < exp t. The sae calculation clearly holds for the J i, and so setting t = ηρ s, ( P E Ω (η) c ] P I j > ρ s (1 + η)] exp η ρ ) s. (39) So, with high probability each row and colun of E 0 is (1 + η)ρ s -sparse. We next show that such sparse vectors are nearly orthogonal to the rando subspace range(u). The atrix U is uniforly 13

14 distributed on W r, and can be realized by orthogonalizing a Gaussian atrix. Let Z R r be an iid N (0, 1 ) atrix; then U is equal in distribution to Z(Z Z) 1/. or each I j, Z Ij, (Z Z) 1/, Z I j,, σ in (Z), where σ in (Z) denotes the r-th singular value of the r atrix Z. Now, for any t > 0 P σ in (Z) < 1 ] ( ) r/ t < exp t. (40) Meanwhile, on the event E Ω (η), P Z Ω Z Ij,, > ] r/ + I j / + t ( ) < exp t. (41) Hence, ] r/ η ρs + t P U Ij,, > 1 r/ t ] r/ η ρs + t P U Ij U Ij,, > 1 I j < (1 + η)ρ s + P I j (1 + η)ρ s ] r/ t ) ( < exp ( t + exp η ρ ) s. Set t in ( r 3, ) ηρ s. By the assuption of the lea, 1 + η + η < 4/3, and r/ η ρs + t 1 r/ t ( ) r/ η + η ρs ( 3 1 ) r/ r/ ρs ( 3 1 ) τ(r/, ρ s ). r/ So, (applying a syetric arguent to the V Ji ), then, for the event E 1 defined below, ( ) E 1 : ax ax U j Ij,,, ax V i Ji,, τ(r/, ρ s ), (4) P U,V,Ω E 1 E Ω (η)] > 1 4 exp ( t ) ( 4 exp η ρ ) s. (43) If for all > 0, r/ ρ r < 1, then t is bounded away fro zero. We next show that E 1 iplies E ΩU, E ΩV, and E ΩΘ. We can express π Ω M] in ters of its action on the coluns of M: π Ω M] = j π I j M,j e j. So, π π ΞU ΩM] = π U π Ij M,j e j, (44) j 14

15 π U π Ij M,j U π Ij M,j and π ΞU π Ω M] = j j U Ij,, M,j = j ( ) ax U j Ij,, M, and so on E 1, π Ω π ΞU, π ΞU π Ω, τ(r/, ρ s ); and so E 1 establishes that E 1 = E ΩV. Now, notice that since = E ΩU. A syetric arguent π Ω π Θ M] = π Ω π ΞU M] + π Ω π U Mπ V ], (45) if we choose a basis B R r for the orthogonal copleent of range(u) and define Ξ U V {BQV Q R r r } R, then π Ω π Θ, π Ω π ΞU, + πω π ΞU V,. (46). = Since πω π ΞU V, = sup π Ω M] M Ξ U V \{0} sup π Ω M] = π Ω π ΞV,, M Ξ V \{0} E 1 = E ΩΘ and the proof is coplete. We will need to understand concentration of Lipschitz functions of atrices that are uniforly distributed (according to the Haar easure) on two anifolds of interest: the Stiefel anifold W r and the group of r r orthogonal atrices with deterinant one, SO(r). This is governed by the concentration function on these spaces: Definition 4.3. ] Let (X, d) be a etric space. or a given easure µ on X, the concentration function is defined as α X,d,µ (t). = sup { 1 µ(a t) A X, µ(a) 1 }, (47) where A t = {x d(x, A) < t} is a t-neighborhood of A. The concentration functions for W r and SO(r) are well known: act 4.4 (4] Theores and 6.7.1). or r < the anifold W r, with distance d(x, Y ) =. X Y, the Haar easure µ has concentration function ) π α W,d,µ (t) ( 8 exp t. (48) 8 Siilarly, on SO(r) with δ(x, Y ) =. X Y, and ν the Haar easure, ) π α SO(r),δ,ν (t) ( 8 exp rt. (49) 8 15

16 Propositions 1.3 and 1.8 of ] then iply that Lipschitz functions on these spaces concentrate about their edians and expectations: Corollary 4.5. Suppose r <, and let f : R r R with Lipschitz nor f lip. = sup X Y f(x) f(y ) X Y. (50) Then if U is distributed according to the Haar easure on W r, and edian(f) denotes any edian, ( ) P f(u) edian(f) + t] exp t. (51) Siilarly, if g : R r r R with Lipschitz nor 8 f lip g lip. = sup X Y g(x) g(y ) X Y. (5) Then if R is distributed according to the Haar easure on SO(r), g satisfies the following tail bound: ] ( ) 8π P g(r) Eg(R)] g lip r + t exp rt. (53) 8 g lip Proof. The concentration result (51) is a restateent of ] Proposition 1.3. or (53), notice first that ] Proposition 1.3 iplies that ( ) P g(r) edian(g) t] exp rt. (54) If we set ā. = then ] Proposition 1.8 gives 0 exp ( rt 8 g lip ) P g(r) Eg] ā + t] exp 8 g lip dt = g lip 8π r, (55) ( rt 8 g lip ). (56) 4. Bounding the operator nor In this section, we begin our analysis of the iniu robenius nor solution to the equality constraints of the feasibility proble (19). We show that with overwheling probability this atrix also has operator nor bounded by a sall constant. An iportant byproduct of this analysis is a siple proof for that, with the sae odels studied here, the easier proble of atrix copletion filling issing entries in a low-rank atrix can be efficiently and exactly solved by convex optiization, even for cases when the rank of the atrix grows in proportion to the diensionality. Section 5 further elucidates connections to that proble. 16

17 4..1 General approach We begin by showing that with high probability the iniu robenius nor solution is unique, and giving an explicit representation for the pseudoinverse operator in that case. This operator, denoted H, is applied to the atrix λ signe 0 ] UV to give the initial dual vector. We separately bound the nor of the two ters induced by H signe 0 ], and H UV ],, respectively. Both arguents follow in a fairly straightforward anner by reducing to a net and then applying concentration inequalities. Throughout this section, N will denote a 1 -net for S 1. By ] Lea 3.18, there is such a net with size at ost exp(4). Moving fro A, = sup x,y S 1 x Ay to our product of nets loses at ost a constant factor in the estiate: A, 4 sup x,y N x Ay (see e.g., 9] Proposition.6). We will argue that for our A of interest, f(a) = x Ay concentrates, and union bound over all exp(8) pairs in N N. 4.. Representation and uniqueness of W 0. Lea 4.6 (Representation of the pseudoinverse). Suppose that we have π Θ π Ω, < 1. Then the operator I π Ω π Θ π Ω is invertible, and for any M the optiization proble has unique solution in W subj π Θ W ] = 0, π Ω W ] = π Ω M] (57) Ŵ = π Θ π Ω (I π Ω π Θ π Ω ) 1 π Ω M] = π Θ π Ω (π Ω π Θ π Ω ) k π Ω M]. Proof. Choose atrices U, V R r whose coluns for orthonoral bases for the orthogonal copleent of the ranges of U and V, respectively. Let Q denote the atrix Q. = V U V U V U Notice that the rows of Q are orthonoral, and that they for a basis for the subspace of vectors {vecm] M Θ}. The equality constraint in (57) can therefore be expressed as ] ] Q 0 vec(w ) =, (58) I Ω, vec Ω M] Here, I is the identity atrix, and I Ω, is the subatrix of I consisting of those rows indexed by Ω, taken in lexicographic order. The iniu robenius nor solution W is siply the iniu l nor solution to this syste of equations. Define the atrix Notice that for any atrix M, P. = Q I,Ω.. Q P I Ω, vecm] = vec π Θ π Ω M]. k=0 17

18 Since Q and I Ω, each have orthonoral rows, π Θ π Ω, = Q ] P I Ω,, = P,. So, by the I P assuption of the lea P, < 1, and the atrix P is nonsingular. We therefore have I the following explicit expression for the unique iniu l -nor solution to (58): vecŵ ] = Q I,Ω ] I P P I ] 1 0 vec Ω M] ]. (59) Applying the Schur copleent forula (which is justified since I P P ), the above is equal to ] ] Q P I,Ω (I P P ) 1 vec I Ω M] = (I Q Q)I,Ω (I P P ) 1 I Ω, vecm] = (I Q Q)I,Ω (I Ω, Q QI,Ω ) k I Ω, vecm] = vec π Θ π Ω k=0 k=0 (π Ω π Θ π Ω ) k π Ω M] yielding the representation in the stateent of the lea Effect of the singular vectors We next analyze the part of the initial dual vector W 0 induced by UV. ro the previous lea we have the expression H ] = π Θ π Ω k=0 ], (π Ω π Θ π Ω ) k π Ω ], (60) whenever π Ω π Θ, < 1. Analysis of H UV ] is coplicated by the fact that UV and Θ are dependent rando variables. Notice, however, that π Θ depends only on the subspace spanned by the coluns of U, not on any choice of basis for that subspace. UV, on the other hand, does depend on the choice of basis. We will use this fact to decouple H and UV, and show that the l operator nor of H UV ] is bounded below a sall constant with overwheling probability. Lea 4.7. Let (U, V, Ω) be sapled fro the rando orthogonal odel of rank r with error probability ρ s. Suppose that r and ρ s satisfy τ(r/, ρ s ) < 1 4. (61) Then with probability at least 1 exp ( C + O(log )), the solution W UV 0 to the optiization proble in W subj U W = 0, W V = 0, π Ω W ] = π Ω UV ] (6) is unique, and satisfies ( W UV r ) ( 0, 64 τ, ρ s r 1 ) π r 1. (63) 18

19 Proof. irst consider the event ix x, y N and write E 1 : π Ω UV ], 64τ. x π Ω UV ]y = U π Ω xy ], V. On the event E ΩU, U π Ω xy ] τ. So, as a function of V, f(v ) =. x π Ω UV ]y is τ-lipschitz. Since the distribution of V is invariant under the orthogonal transforation V V, f(v ) is a syetric rando variable and 0 is a edian. Hence, on the event E ΩU the tail bound (51) iplies that P V U,Ω f(v ) > t] < exp ( t 8τ ). (64) Set t = 16τ. A union bound over the exp(8) eleents of N N shows that on E ΩU, ] ] P V U,Ω π Ω UV ], > 64τ P V U,Ω sup x π Ω UV ]y > 16τ and hence we can conclude that x,y N N exp ( 4), P U,V,Ω E 1 ] 1 exp ( 4) PE c ΩU ] 1 exp ( C + O(log )). Now, consider the cobined event E 1 E ΩΘ. On this event, the representation in Lea 4.6 is valid and W0 UV = π Θ π Ω (π Ω π Θ π Ω ) k π Ω UV ]. (65) k=0 or any M, π Θ M] = π U Mπ V, π Θ M], M,, so W0 UV, π Ω (π Ω π Θ π Ω ) k π Ω UV ] k=0, π Ω UV ], + (π Ω π Θ π Ω ) k UV ] k=1,. (66) We have already addressed the first ter. To bound the second, we expand our probability space as follows. Consider Ũ and Ṽ distributed according to the Haar easure on W 1. Identify U and V with the first r coluns of Ũ and Ṽ, respectively, and write Û and ˆV for the reaining 1 r coluns. Notice that U and V are indeed distributed according to the rando orthogonal odel of rank r, and that Û and ˆV follow the rando orthogonal odel of rank r 1 (although, of course, U and Û now dependent rando variables). Write (π Ω π Θ π Ω ) k UV ] = Ω π Θ π Ω ), k=1(π k ŨṼ Û ˆV ], Ω π Θ π Ω ) k=1(π k ŨṼ ] + Ω π Θ π Ω ), k=1(π k Û ˆV ]., k=1 19

20 We next show that with overwheling probability each of these ters is well-controlled. Notice that if R SO ( r 1) is any orthogonal atrix, then the joint distribution Ũ, U, and Û is invariant under the ap ] I 0 Ũ Ũ. (67) 0 R This follows fro the right orthogonal invariance of the Haar easure on W 1 (see e.g., 11] Section 1.4.3). Since this ap preserves U and V, it also preserves Θ. Hence, the ter of interest, k=1 (π Ωπ Θ π Ω ) Û k ˆV ], is equal in distribution to k=1 (π Ωπ Θ π Ω ) ÛR k ˆV ],. The orthogonal atrix R is independent of all of the other ters in this expression. This independence allows us to estiate the nor by first bounding the operator nor of the ap k=1 (π Ωπ Θ π Ω ) k and then applying easure concentration on SO ( r 1). or fixed x, y N, consider the quantity ( ) x (π Ω π Θ π Ω ) k ÛR ˆV ] y = Û (π Ω π Θ π Ω ) k xy ] ˆV, R k=1. = M, R. This is a M -Lipschitz function of R. Since for any w S 1, Rw is uniforly distributed on S 1, Ee i Rw] = 0 for all i. So, ER] = 0 and E R M M, R = 0. On the event E ΩΘ, M k=1 (π Ω π Θ π Ω ) k xy ] π Ωπ Θ π Ω, xy 4τ 1 π Ω π Θ π Ω, 1 4τ 4τ 3. (68) k=1 Hence, on E ΩΘ by (53) ) 9( r 1)t P R M M, R > γ() + t] < exp ( 18 τ, (69). where γ() = 4τ( r,ρs) 1 4τ( r 8π,ρs) r 1 1 8π 3 r 1. or copactness of notation, set β(, r). =. Then, setting t = 64τ, union bounding over the exp(8) pairs in N N, and then r 1 3β oving fro N N to S 1 S 1 (losing at ost a factor of 4) gives P R Ũ,V,Ω Ω π Θ π Ω ) k=1(π k ÛR ˆV ] > 56 τ 3β(, r) + 4γ() < exp ( 4). (70), And so, Ω π Θ π Ω ) k=1(π k Û ˆV ] > 56 τ 3β(, r) + 4γ(), = P R Ũ,V,Ω Ω π Θ π Ω ) k=1(π k ÛR ˆV ] > 56 τ 3β(, r) + 4γ(), exp ( 4) + P EΩΘ] c exp ( C + O(log )). PŨ,V,Ω 0

21 An identical arguent, this tie randoizing over R =. ] I 0 shows that 0 R PŨ,V,Ω Ω π Θ π Ω ) k=1(π k ŨṼ ] > 56 τ 3β(, r) + 4γ() exp ( C + O(log )). (71), Cobining ters yields the bound Effect of the error signs We next consider the effect of the error signs. As in the previous proposition, we handle the k = 0 and k = 1... parts of the representation in Lea 4.6 separately. Lea 4.8. There exists a function φ(ρ s ) satisfying li ρs 0 φ(ρ s ) = 0, such that if E 0 is distributed according to the Bernoulli sign and support odel with error probability ρ s. P Ω,Σ sign(e 0 ), φ(ρ s ) ] exp ( C). (7) Proof. We first provide a bound on the oent-generating-function for the iid rando variables. Y ij = sign(e0ij ) of the for Eexp(tY )] exp(αt ). The oents of Y are 1 k = 0, EY k ] = 0 k odd, ρ s k > 0, k even, so Eexp(tY )] = 1 + ρ st k k=1 (k)!. Since exp(αt ) = 1 + k=1 αk t k k!, it suffices to choose α such that ρ s (k)! α k k = 1,,.... k! Since (k)! k! k k 1, α ax k=1,,... k ρ 1 k s suffices. Consider the function ψ : 1, ) 0, 1] R defined by ψ(x, y) = 1 x y 1 x. ψ(1, y) = y, and for all y li x ψ(x, y) = 0. The only stationary point occurs at ( ) x = log(1/y), and hence for y > 0 its axiu on 1, ) is ψ(x 1 (y), y) = ax y, log(y 1 ) y 1 log(y 1 ). Notice that li y 0 ψ(x (y), y) = 0, and that Now, for any fixed pair x, y N, let Z. = x sign(e 0 ) y. Then Eexp(tY )] exp(ψ(x (ρ s ), ρ s )t ). (73) Eexp(tZ)] = ij Eexp(tx i y j Y ij )] ij exp(ψ(x (ρ s ), ρ s )t (x i y j ) ) = exp(ψ(x (ρ s ), ρ s )t ). Applying a Chernoff bound (and optiizing the exponent) gives ( P x t ) sign(e 0 )y > t] < exp 4ψ(x. (74) (ρ s ), ρ s ) 1

22 Union bounding (and recognizing that oving fro N N to S 1 loses at ost a factor of 4) gives P sign(e 0 ), 4t ] ( t ) exp 8 4ψ(x. (75) (ρ s ), ρ s ) If we set, e.g., t(ρ s ) = 8ψ 1/ (x (ρ s ), ρ s ), the probability of failure will be bounded by exp( 8). urther setting φ(ρ s ) = 4t(ρ s ) gives the stateent of the lea. The above lea goes part of the way to controlling the coponent of the initial dual vector induced by the errors, H λ sign(e 0 )]. A straightforward Martingale arguent, detailed in the following lea, copletes the proof. Lea 4.9. Consider (U, V, Ω, Σ) drawn fro the rando orthogonal odel of rank r < with Bernoulli error probability ρ s and rando signs, and suppose that r and ρ s satisfy τ (r/, ρ s ) 1 4. (76) Then with probability at least 1 exp( C + O(log )) in (U, V, Ω, Σ), the solution W0 E optiization proble to the in W subj U W = 0, W V = 0, π Ω W ] = π Ω λ sign(e 0 )] (77) is unique, and satisfies W E, 0 φ(ρ s ) + 18τ(r/, ρ s), (78) 3 where φ( ) is the function defined in Lea 4.8. Proof. On event E ΩΘ, the iniizer is unique, and can be expressed as W0 E = π Θ π Ω (π Ω π Θ π Ω ) k π Ω λ sign(e 0 )]. k=0 Since π Θ is a contraction in the (, ) nor, W0 E, π Ω (π Ω π Θ π Ω ) k π Ω λ sign(e 0 )] k=0, π Ω λ signe 0 ]], + (π Ω π Θ π Ω ) k π Ω λ sign(e 0 )] k=1,. Lea 4.8 controls the first ter below φ(ρ s ) with overwheling probability. or the second ter, it is convenient to treat sign(e 0 ) as the projection of an iid Radeacher atrix Σ onto Ω: sign(e 0 ) = π Ω Σ]. ix x, y N, and notice that x (π Ω π Θ π Ω ) k π Ω λ sign(e 0 )]y = λ (π Ω π Θ π Ω ) k xy ], Σ. k=1 k=1. = M, Σ.

23 On the event E ΩΘ, M R 1 has robenius nor at ost 4τ 4τ 1 4τ 3. Order the indices (i, j) 1 i, j arbitrarily, and consider the Martingale Z defined by Z 0 = 0, Z k = Z k 1 + M ik,j k Σ ik,j k. The k-th Martingale difference is bounded by M ik,j k, and the overall squared l -nor of the differences is bounded by M. Hence, by Azua s inequality, P M, Σ > t] < exp ) ( 9t 3τ. (79) Set t = 3τ 3. A union bound over the exp(8) eleents in N N shows that on the event E ΩΘ, P Σ U,V,Ω (π Ω π Θ π Ω ) k π Ω λ sign(e 0 )] > 18τ exp ( 4). (80) 3 k=1, Hence, P U,V,Ω,Σ k=1 (π Ωπ Θ π Ω ) k π Ω λ sign(e 0 )], > 18τ 3 ] is bounded by exp ( 4) + PE c ΩΘ] exp ( C + O(log )). 4.3 Controlled feedback for sparse atrices We next bound the operator nor of the projection π Γ, restricted to row- and colun-sparse atrices Ψ c. Lea 4.10 (Representation of iterates). Let Θ, Γ, Ω be defined as above. Suppose that π Ω π Θ, < 1. Then for all M Ω, π Γ M] = π Ω π Θ k=0 (π Θ π Ω π Θ ) k π Θ M]. (81) Proof. or M Ω, vecπ Γ M] = Q I P I,Ω ] P I = Q I P I,Ω ] P I ] 1 Q I Ω, ] vecm] ] 1 Q vecm] 0 ]. Under the condition of the lea, P, < 1 so the above inverse is indeed well-defined, and can be expressed via the Schur copleent forula: vecπ Γ M] = Q (I P P ) 1 QvecM] I,Ω P (I P P ) 1 QvecM] = π Ω Q (I P P ) 1 QvecM] = π Ω Q (P P ) k QvecM]. k=0 3

24 Recognizing that Q (P P ) k QvecM] = vec k=0 k=0 π Θ (π Θ π Ω π Θ ) k π Θ M] ] copletes the proof. Lea ix any c (0, 1). Consider (U, V, Ω) drawn fro the rando orthogonal odel of rank r <, with Bernoulli error probability ρ s. urther suppose that r and ρ s satisfy Then with probability at least 1 exp ( C + O(log )), ξ c 3 9 where H(c) is the base-e binary entropy function. τ ( r, ρ ) 1 s 4. (8) ( r + c + ) H(c), (83) Proof. ro the above representation, whenever π Ω π Θ, < 1, ξ c = sup π Ω π Θ (π Θ π Ω π Θ ) k π Θ M] M Ψ c k=0 M 1 Θ π Ω π Θ ) k=0(π k sup π Θ M], M Ψ c M 1 Now, 1 1 π Θ π Ω π Θ, sup M Ψ c M 1 π Θ M]. sup M Ψ c M 1 π Θ M] sup M Ψ c M 1 sup I c U I,, + π U M + Mπ V sup J c V J,,. Identify U with the orthogonalization Z(Z Z) 1/ of an r iid N (0, 1 ) atrix Z. Then for any given I ] of size c, U I,, Z I,, σ. Now, in(z) P σ in (Z) > 1 ] ( ) r/ t 1 < exp t 1. (84) 4

25 Meanwhile, for each I of size c, again P Z I,, > r/ + ] ( ) c + t < exp t. (85) There are at ost exp (H(c) + O(log )) such subsets I, where H( ) denotes the base-e binary entropy function, so ] P ax I ( ] c) Z I,, > r/ + c + t < exp ( (t / H(c)) + O(log ) ). (86) Choosing t 1 = 1 1 r, and set t = H(c) and cobining bounds gives r/ + c + H(c) ξ c (1 4τ(r/, ρ s ) )(1 r/). (87) Noticing that r/ < τ and applying the bound τ < 1/4 to the ters in the denoinator copletes the proof. 4.4 Controlling the initial violations In this section, we analyze the initial dual vector W 0 given by the iniu robenius nor solution to the equality constraints of the optiality condition (19), and show that for any constant β, the probability that S 5 UV + W 6 0 ] > 3β approaches zero. We will treat the parts of UV + W 0 = UV + H UV ] + H λ sign(e 0 )] separately, and use the following siple lea to cobine the bounds: Lea 4.1. or all atrices A, B, and α (0, 1), Sγ Ωc A + B] Sαγ Ωc A] + S(1 α)γ Ωc B]. (88) Proof. Notice that for scalars x, S γ x] is a convex nonnegative function. X R, Sγ Ωc X] is again convex (see, e.g, 3] Exaple 3.14), and so S Ωc γ A + B] = Sγ Ωc α Sγ Ωc A α α A B α + (1 α) 1 α] ] + (1 α) S Ωc γ = S Ωc αγ A] + S Ωc (1 α)γ B]. B 1 α] Hence, for atrices The strategy, then, is to bound the expectation of Sλ/3 Ωc ] for each of the three ters, using the following lea. It will turn out that for any prespecified p, we can choose C 0 such if r < C 0 log, the expected robenius nor of the violations is O( p ); an application of the Markov inequality then bounds the probability that the value deviates above any fixed constant β. 5

26 Lea Let X be a syetric rando variable satisfying the (subgaussian) tail bound P X t] exp( Ct ). Let Y. = S γ (X). Then E Y ] C exp ( Cγ ). (89) Proof. Since X is syetric, Y is also syetric, so EY ] = t PY t]dt 4 (s γ) exp( Cs )ds 4 γ γ 0 t exp( C(t + γ) )dt s exp( Cs )ds. Lea Let X be a rando variable satisfying a tail bound of the for Suppose γ > a and let Y. = S γ (X). Then Proof. EY ] = = 0 γ P X a + t] C 1 exp( C t ). E Y ] C 1 C exp ( C (γ a) ). (90) t P Y t]dt = 0 tp X γ + t]dt (s γ)p X s]ds C 1 (s γ) exp ( C (s a) ) ds = C 1 (q + a γ) exp( C q )dq C 1 exp ( C (γ a) ). C γ a γ In the previous section, we were interested in controlling the quadratic products x W 0 y involving the initial dual vector W 0 and arbitrary unit vectors. Here, because the soft thresholding operator S γ acts eleentwise, in this section, we require tighter control over e i W 0e j. Because there are only pairs (e i, e j ), uch tighter control can be established, as is foralized in the following lea: Lea Consider U, V drawn fro the rando orthogonal odel of rank r. or any κ > 0, define the events 1 E U (κ) : ax U i, i κ log, 1 E V (κ) : ax V i, i κ log. 6

27 Then if r < C0 log() for soe C 0 < 1 4κ, On these good events, P U,V E U (κ) E V (κ)] 1 exp ax i,j π Θ e i e j ] ( (κ 1/ C 0 ) ). (91) 8 log κ log. (9) Proof. Notice that f(u) =. U i, is a 1-Lipschitz function of U. Since i U i, = U = r, by syetry E U i, ] = r. By the Markov inequality, f has a edian no larger than Ef] Ef ] = r. Invoking Lipschitz concentration on W r and union bounding over the choices of i, ] ( 1 P ax U i, > exp ( ) ) 1 r i κ log 8 κ log. (93) An identical calculation applies to E V (κ). Suing the two probabilities of failure copletes the proof. Lea Let (U, V ) be distributed according to the rando orthogonal odel of rank r <. or any fixed Ω ] ] and κ > 0, on the good event E U (κ) E V U S Ωc 1 UV 4 ] ( 1 3 κ 144 ). (94) κ log Proof. ix U and consider UV ] i,j = U i, Vj, as a U i, -Lipschitz function of V. On E U (κ), the Lipschitz constant is bounded by (κ log ) 1/, and ( ) P V U UV ] ij > t] < exp κt log. (95) 8 By Lea 4.13, on E U, ( ) ] E V U S 1 UV ] 3 ij ] Hence, suing over pairs (i, j) Ω c E V U S Ω c 1 3 UV ] Applying the Cauchy-Schwarz inequality copletes the proof. 16 ( κ log exp κ ) 7 log(). (96) ] 16 κ κ log (1 7 ). (97) Lea ix any β > 0, κ > 0. Consider (U, V, Ω) drawn fro the rando orthogonal odel of rank r, with Bernoulli error probability ρ s, and suppose that r, ρ s satisfy r < C 0 log, C 0 < 1 4κ, τ ( r, ρ ) 1 s < 4. (98) Then there exists 0, C 1, C > 0 such that for all > 0, S Ω P c U,V,Ω 1 H UV ] ] ] > β C 1 1 κ β κ log + P U,V,Ω E U (κ) c E V (κ) c EΩΘ] c. 7

28 Proof. We use a siilar splitting trick to that in Lea 4.7. Again let Ũ and Ṽ be uniforly distributed on W 1. Identify U and V with their first r coluns, and let Û and ˆV denote the reaining r 1 coluns. Write S Ωc 1 H UV ] ] = S Ωc 3 1 H ŨṼ Û ˆV ]] 3 S Ωc 1 H ŨṼ ]] + S Ωc 6 1 H Û ˆV ]]. 6 We first address the second ter. On the event E ΩΘ, S Ωc 1 H Û ˆV ]] = 6 SΩc 1 π 6 Θ π Ω Ω π Θ π Ω ) k=0(π k π Ω Û ˆV ]] = SΩc 1 π 6 Ω π Θ π Ω Ω π Θ π Ω ) k=0(π k π Ω Û ˆV ]] = SΩc 1 π 6 Θ π Ω Ω π Θ π Ω ) k=0(π k π Ω Û ˆV ]]. Here, we have used that π Ω π Θ π Ω = π Ω (I π Θ )π Ω = π Ω π Θ π Ω. Now, since for any R SO( r 1), the joint distribution of (Ũ, Ṽ, Ω) is invariant under the ap ] I 0 Ũ Ũ. (99) 0 R Moreover, this ap preserves U and V, and therefore H. Hence, if R is distributed accord- ing to the invariant easure on SO( r 1), SΩc 1 H Û ˆV ]] is equal in distribution to 6 π Θ π Ω k=0 (π Ωπ Θ π Ω ) k π Ω ÛR ˆV ]]. Consider the i, j eleent of this atrix, SΩc 1 6 e i π Θ π Ω k=0 (π Ω π Θ π Ω ) k π Ω ÛR ˆV ]e j = ( ) Û (π Ω π Θ π Ω ) k π Ω π Θ e i e j ] ˆV, R. = M, R. k=0 This is a M -Lipschitz function of R. On E U (κ) E V (κ) E ΩΘ, π Ω π Θ, M π Θ e i e 4τ 1 j ] 1 π Ω π Θ π Ω, 1 4τ κ log() Let δ =. r 1. Since E R M, R ] = 0, the tail bound (53) iplies that ( 9 κδ log()t where P R M M, R > q() + t] < exp q() = κ log. ), (100) 4 8π 3δ κ log. (101) 8

29 ] By Lea 4.14, E R Ũ,Ṽ S,Ω 1 M, R ] κδ log() exp κδ log() 3 ( 1 4 δ 8π κ log() ). (10) or copactness, let ζ(r/, ρ s, ). = 1 κδ 3 ( ) 1 4 8π. (103) δ κ log() Then suing over the eleents in Ω c gives that on the event E U (κ) E V (κ) E ΩΘ, S Ω E c R Ũ,Ṽ,Ω 1 H ÛR ˆV ]] ] κδ log ζ. (104) By the Markov inequality, on E U (κ) E V (κ) E ΩΘ, P R Ũ,Ṽ,Ω S Ω c 1 6 H ÛR ˆV ]] > β An identical calculation shows that if we set R = So, P U,V,Ω S Ω c 1 3 P R Ũ,Ṽ,Ω S Ω c 1 6 H UV ] ] I 0 0 R H Ũ RṼ ]] > β ] < ] > β Under the assuptions of the Lea, δ = < 3 ζ/ ] 16 ζ/ < 3βδ κ log(). (105) ] SO( 1), on E U (κ) E V (κ) E ΩΘ 16 ζ/ 3βδ κ log(). (106) 3βδ κ log() + P(E U (κ) E V (κ)) c ] + PE c ΩΘ]. r 1 that for all > δ, δ(, r) > 1 and so ζ 1 κ 64 1 C0 log 1. Hence, δ > 0 such ( π κ log ) urtherore, for any κ, κ such that for > κ, 16 8π κ log < 1 4. or > ax( δ, κ ), ζ < 1 κ 104. or such sufficiently large, the ultiplier 3 the stateent of the lea. 3δ 3 3. Choosing this value for C 1 and setting 0 = ax( δ, κ ) gives Lea 4.18 (Box violations induced by error). ix any α (0, 1). Let (U, V, Ω, Σ) be distributed according to the rando orthogonal odel of rank r and with error probability ρ s, with r and ρ s satisfying τ(r/, ρ s ) < 1/4. (107) Then on the good event E U (κ) E V (κ) E ΩΘ, E Σ U,V,Ω S Ω c α H λ sign(e 0 )] ] κ log() 1 ακ 4. (108) 9

30 Proof. We can exploit independence of Ω and Σ by writing sign(e 0 ) = π Ω Σ]. ro the representation in the previous lea, S Ωc α H λ sign(e 0 )] ] = S α (π Θ π Ω ) k λ Σ]]. The i, j eleent of the atrix of interest is (π Θ π Ω ) k λ Σ] e j = λ (π Ω π Θ ) k π Θ e i e j ], Σ e i k=1 k=1 k=1. = M, Σ. On E U (κ) E V (κ), π Θ e i e j ] 1 κ log. On E ΩΘ, k=1 (π Θπ Ω ) k, τ 1 τ 1 M κ log. The sae Martingale arguent as in Lea 4.9 shows that Then by Lea 4.13, 1, and so ) ) P Σ M, Σ > t] < exp ( t κ log()t M = exp (. (109) ( ) ] E Σ U,V,Ω S α M, Σ ] ( κ log() exp ) ακ log(). (110) Suing over the eleents in Ω c to bound E ] and then applying the Cauchy-Schwarz inequality establishes the result. The three leas in this section cobine to yield the following bound on the robenius nor of the violations. Corollary 4.19 (Control of box violations). ix any β > 0, p > 0. There exist constants C 0 (p) > 0, ρ s > 0, 0 with the following property: if > 0 and (U, V, Ω, Σ) are distributed according to the rando orthogonal odel of rank r < C 0 (p) log, (111) with Bernoulli error probability ρ s ρ s and rando signs, then S ] Ω P c U,V,Ω,Σ 5 UV + W 6 0 ] 3β 1 C β p. Proof. By Lea 4.1, S Ωc 5 UV + W 6 0 ] UV ] S Ωc S Ωc 1 3 H UV ]] + S Ωc 1 6 H λ sign(e 0 )] ]. We use the three previous leas to estiate each of these ters. Set κ = 048p + 104, and C 0 = 1 16κ. ro Lea 4.16 and the Markov inequality, for any Ω, on E U, ] P V U S Ω c 1 3 UV ] β 30 4 β κ log() 1/ κ/144,

31 and so S Ω P c U,V,Ω,Σ 1 UV ] 3 ro Lea 4.18, P U,V,Ω,Σ S Ωc 1 6 ] 4 β β κ log() 1/ κ/144 + PE U (κ) c ] = o ( p). H λ sign(e 0 )] ] ] β κ log() 1/ κ/4 + P (E U (κ) E V (κ)) c ] + P E c ΩΘ] = o ( p). inally, by Lea 4.17 P U,V,Ω,Σ S Ωc 1 3 H UV ] ] ] β = C 1 β κ log() 1/ κ/048 + P (E U (κ) E V (κ)) c ] + P E c ΩΘ] C 1 p β κ log() + o ( p). Suing the three failure probabilities copletes the proof. 4.5 Pulling it all together We close by fitting the probabilistic leas in the previous three sections together to give a proof of our ain result, Theore.4. Proof. We show that C 0 and 0 can be selected such that with high probability the conditions of Lea 3.3 are satisfied. Set ε = 1 6. ix 0 large enough and c > 0 sall enough that on the good event fro Lea 4.11, as long as C 0 1, ξ c 3 ( 1 + c 9 log + ) H(c ) 1. (11) 0 We then ust show that UV + W 0 1, + Notice that S Ωc 5 6 UV + W 0 ] 5 6 c. (113) UV + W 0 1, UV 1, + W 0 1, ax V i, + W 0,. (114) i ro Leas 4.7 and 4.9, for any choice of C 0, W 0, is with overwheling probability bounded by a linear function of τ. or any fixed C 0, li τ(r/, ρ s ) = ρ s. Hence, by choosing 0 large and ρ s sall, we can bound W 0, 5 c (115) 18 with overwheling probability. Meanwhile, on the overwhelingly likely good event E V, V i, 5 c (116) 18 31

32 for sufficiently large. inally, fix β = 5 36 c and choose C 0 sall enough and 0 large enough that the conditions of Corollary 4.19 are satisfied. Then, with probability at least 1 C p, (5) is satisfied. The sae calculations apply to (6). inally, on these good events, W 0, + 1 S c Ω 5 UV + W 1 ξ c 6 0 ] 18 5 c + 5 c < 1, (117) 36 so (7) holds. By Lea 3.3, then, there exists a certifying dual vector, and the proof is coplete. 5 Iplications on Low-Rank Matrix Copletion Our result has strong iplications for the low-rank atrix copletion proble studied in 6, 8, 5]. In atrix copletion, the goal is to recover a low rank atrix A 0 fro an observation consisting of a known subset Υ. = ] ] \ Ω of its entries. 6 6] and 8] studied the following convex prograing heuristic for the atrix copletion proble A 0 = arg in A subj A(i, j) = A 0 (i, j) (i, j) Υ. (118) Duality considerations in 6] give the following characterization of A 0 that can be recovered by solving (118): Lea 5.1 (6], Lea 3.1). Let A 0 R be a rank-r atrix with reduced singular value decoposition USV. As above, let Θ. = span ( {UM M R r } {MV M R r } ) R. (119) Suppose there exists a atrix Y such that { π ΘY ] = UV, π Υ Y ] = 0, π Θ Y ], < 1 }. (10) Then A 0 is the unique solution to the seidefinite progra in A subj π ΥA] = π ΥA 0]. (11) Candes and collaborators then consider the iniu robenius solution Y 0 to the syste of equations (10) and show that if the nuber of observations Υ is large enough, Y 0, is bounded below one with high probability. Recall that in Section 4, we analyzed the operator nor of the iniu robenius nor solution to a siilar syste of equations. In fact, we will see that Y 0 is exactly equal to UV + W0 UV, where W0 UV is the part of the initial (RPCA) dual vector that was induced by the singular vectors of A 0. Using Lea 4.7, the operator nor of W0 UV can be bounded below one with overwheling probability, even in a proportional growth setting. This yields Theore.5, which we forally prove below: 6 In 6], Ω denotes this set. 3

33 Proof. Notice that (10) is feasible if and only if the syste { π Θ W ] = 0, π Υ W ] = π Υ UV ], W, < 1 } (1) is feasible in W =. Y UV. Here, Υ is the set of issing eleents fro the atrix to be copleted; we can identify this set with the set of corrupted entries Ω in the low-rank recovery proble that is the ain focus of this paper. This is again a rando subset of ] ] in which the inclusion of each pair (i, j) is an independent Bernoulli(ρ s ) rando variable. Hence, if we can show that the atrix W0 UV defined in Lea 4.7 as the iniu robenius nor solution to the equality constraints π Θ W ] = 0, π Ω W ] = π Ω UV ] has operator nor bounded below 1, we will have further established that A 0 uniquely solves (118), and hence can be efficiently recovered by nuclear nor iniization. Lea 4.7 iediately iplies Theore.5 as follows. irst, notice that ax(ρ r, ρ s ) < 1 89 = τ(ρ r, ρ s ) < 1 4. Under this condition, with probability at least 1 exp (C + O(log )), the iniizer W UV ( W0 UV, 64τ(ρ r, ρ s ) is uniquely defined and satisfies ) ρ r 1 π (1 ρ r ) 1. (13) Choose 0 large enough that the second ter is < 1 3 and ρ r 4. Then on the above 1 good event, W0 UV, < 56τ(ρ r, ρ s ) + 1. or ρ r, ρ s sufficiently sall, this is bounded below one: for exaple, ax(ρ r, ρ s ) < suffices. 7 Thus, atrices A 0 distributed according to the rando orthogonal odel of rank as large as r = C can be recovered fro incoplete subsets Υ of size (1 ρ s ). This is the first result to suggest that atrix copletion should succeed in such a proportional growth scenario. The previous best result for A 0 distributed according to the rando orthogonal odel, due to 8], showed that Υ = C r log 8 () (14) observations suffice. When r, this result becoes epty, since in this case the prescribed C log 8 () nuber of easureents exceeds. The fact that our result holds in proportional growth ay see surprising in light of discussions in 8] as discussed there, if all one assues is that the singular spaces of A 0 are incoherent with the standard basis, then Υ = O(r log ) easureents are necessary. It is iportant to note, though, that the singular spaces of atrices A 0 distributed according to the rando orthogonal odel satisfy rich regularity properties in addition to incoherence. In particular, the subatrix nor considerations used in bounding π Θ π Ω, in our proof of Theore.4 can be viewed as a kind of restricted isoetry property for the singular vectors U and V. urtherore, our proofs ake heavy use of the independence of U and V in the rando orthogonal odel. Independence and orthogonal invariance allow us to introduce an auxiliary randoization R over the choice of basis for U, decoupling H (which only depends on the subspace spanned by U and not on any choice of basis for the space) fro UV, which very clearly does depend on the choice of basis. inally, one should also note that our Theore.5 only supersedes (14) for relatively large rank r. or saller (say fixed) rank r, (14) reains the strongest available result for atrix copletion in the rando orthogonal odel. 7 These estiates are clearly pessiistic: larger constant fractions ρ r are possible. However, the arguent given here already suffices to establish that atrix copletion succeeds in a new regie of atrices whose rank is a constant fraction of the diensionality. 33

34 6 Iterative Thresholding for Corrupted Low-rank Matrix Recovery There are a nuber of possible approaches to solving the robust PCA seidefinite progra (). or sall proble sizes, interior point ethods offer superior accuracy and convergence rates. However, off-the-shelf interior point solvers becoe ipractical for data atrices larger than about 70 70, due to the O( 6 ) coplexity of solving the associated Newton syste. In this section, we propose an alternative first-order ethod with far better scaling behavior, with only O( 3 ) coplexity. It can easily work for atrices up to 1, 000 1, 000 on a desktop PC. We will use this algorith for all of our experients in the next section. Our approach is a straightforward cobination of iterative thresholding algoriths for nuclear nor iniization 4] and l 1 -nor iniization 31]. As such, it solves a slightly odified version of (), in which a sall ultiple of the robenius nor is appended: in A + λ E τ A + 1 τ E subj A + E = D. (15) Here, τ is a large constant; as τ, we expect the solution to coincide with that of (). 8 Our algorith for solving (15) cobines the algoriths of 4] for nuclear nor iniization and 31] for l 1 -nor iniization. It aintains an estiate of a dual variable Y R n consisting of the Lagrange ultipliers for the equality constraint D = A + E. The algorith proceeds by soft thresholding the singular values of Y to obtain an estiate of A and soft thresholding the atrix entries theselves to obtain an estiate of E. This is ade precise in Algorith 1 below. Algorith 1: Iterative Thresholding for Corrupted Low-rank Matrix Recovery Input: Observation atrix D R n, weights λ, τ. while not converged (U, S, V ) svd(y k 1 ), A k U S τi] + V, E k sign(y k 1 ) Y k 1 λτ11 ] +, Y k Y k 1 + δ k (D A k E k ), end Output: A. We prove that the iteration in Algorith 1 converges to the global optiu of (15). The proof is a straightforward odification of the arguents of 4]. Let φ τ (A, E). = τ A + λτ E A + 1 E. Lea 6.1. Let (Z 1, Z ) φ τ (A, E), and (Z 1, Z ) φ τ (A, E ). Then Z 1 Z 1, A A + Z Z, E E A A + E E. Proof. Since (Z 1, Z ) φ τ (A, E), Z 1 (τ A + 1 A ) and Z (λτ E E ). We can therefore write Z 1 = τ(uv + Q) + A, where USV is a copact singular value decoposition of A, Q, 1, U Q = 0, and QV = 0. A siilar expression can be given for Z 1. So, Z 1 Z 1, A A = τ UV + Q U V Q, A A + A A = τ ( A + A UV + Q, A U V + Q, A ) + A A, 8 This can ost likely be shown by a siilar arguent to Theore 3.1 of 4]. 34

35 where we have used that UV, A = tr V U USV ] = tr S] = A. Since UV + Q, A UV + Q, A A (and likewise for U V + Q, A ), (16) is A A. Siilarly, Z = λτw + E, where W ij = sign(e ij ) for E ij 0, and W ij 1, 1] if E ij = 0. So, we have that Z Z, E E = λτ W W, E E + E E = λτ ( W, E + W, E W, E W, E ) + E E = λτ ( E 1 + E 1 W, E W, E ) + E E. (16) Since again W, E E 1 and W, E E 1, (16) is bounded below by E E. Cobining the bounds for the two parts copletes the proof. As in 4], let D α denote the singular value shrinkage operator, which aps a atrix A with SVD A = USV to US αi] + V. Lea 6.. The thresholding operator ψ α,β : (A, E) (D α A, S β E) solves the optiization proble in α B 1 + β B B A B E B. (17) 1,B Proof. This follows fro optiality properties of D and S individually. Theore 6.3. Suppose that 0 < δ k < 1 for all k. Then the sequence A k, E k obtained by Algorith 1 converges to the unique global optiu of (15). Proof. Suppose that (A, E ) are optial for (15), and Y is the atrix of Lagrange ultipliers for the constraint A + E = D. Then (Z 1, Z ) φ τ (A, E ) such that Z 1 Y = 0 and Z Y = 0. Siilarly, since (A k, E k ) = ψ τ,λτ (Y k 1, Y k 1 ), by Lea 6., (Z k 1, Z k ) φ τ (A k, E k ) such that 9 Cobining the two, we can write This iplies, via Lea 6.1, that Now, Z k 1 Y k 1 = 0 and Z k Y k 1 = 0. Y k 1 Y = Z k 1 Z 1 and Y k 1 Y = Z k Z. A k A, Y k 1 Y = A k A, Z k 1 Z 1 A k A, E k E, Y k 1 Y = E k E, Z k Z E k E. Y k Y = Y k 1 Y + δ k (D A k E k ) = Y k 1 Y + δ k Y k 1 Y, D A k E k + δ k D A k E k Y k 1 Y δ k Y k 1 Y, A k A δ k Y k 1 Y, E k E + δ k D A k E k Y k 1 Y δ k A k A δ k E k E + δ k A k A + δ k E k E. 9 Noticing that the objective function in (17) can be written as φ τ (B 1, B ) (B 1, B ), (A, E) + 1 (A, E). 35

36 Hence, if 0 < δ k < 1, the sequence Y k Y is nonincreasing. This sequence therefore converges to a liit (possibly 0), and therefore A k A 0 and E k E 0, establishing the result. 7 Siulations and Experients In this section, we first perfor several siulations corroborating the our theoretical results and clarifying their iplications. We then sketch applications to two coputer vision tasks involving the recovery of intrinsically low-diensional data fro gross corruption: background estiation fro video and face alignent under varying illuination Siulation: proportional growth. We deonstrate the efficacy of the proposed iterative thresholding algorith on randoly generated atrices. or siplicity, we restrict our exaples to square atrices. We draw A 0 according to the independent rando orthogonal odel described earlier. We generate E 0 as a sparse atrix whose support is chosen uniforly at rando, and whose non-zero entries are independent and uniforly distributed in the range 500, 500]. We apply our proposed algorith on the atrix D =. A 0 + E 0 to recover Â and Ê. The results are presented in Table 1. or these experients, we choose λ = 1/ and τ = 10, 000. We observe that the proposed algorith is successful in recovering A 0 even when 10% of its entries are corrupted. The relative error in recovering A 0 decreases with increase in diensionality. We note that though the algorith in soe cases overestiates the rank of A, we observe that the extra nonzero singular values of Â tend to have very sall agnitude. rank(a 0 ) E 0 0 Â A 0 A 0 rank(â) Ê 0 #iterations tie (s) , , ,011 8, , ,014 10,000 6, , ,081 6,775 31, , ,005 10, , ,0 10, , ,091 10,000 6, , ,55 10,000 35,981 Table 1: Proportional growth. Here the rank of the atrix grows in proportion (5%) to the diensionality ; and the nuber of corrupted easureents grows in proportion to the nuber of entries, top 5% and botto 10%, respectively. 10 Note that it ight see to be overkill to cast these relatively siple vision tasks as robust PCA probles. There ay exist easier and faster solutions that harness special doain inforation or additional structures in the visual data. Here, we erely use these intuitive exaples and data to illustrate how our algorith can be used as a general tool to effectively extract low-diensional and sparse structures fro real data. 36

The paraeters λ and τ are the sae as the previous experient.

37 7. Siulation: phase transition wrt (ρ r, ρ s ). We exaine how the rank of A and the proportion of errors in E affect the perforance our algorith. We fix = 00 and then construct a rando atrix pair (A 0, E 0 ) satisfying rank(a 0 ) = ρ r and E 0 = ρ s, where both ρ r and ρ s vary between 0 and 1. The paraeters λ and τ are the sae as the previous experient. or each choice of ρ r and ρ s, we apply the proposed algorith to D, and ter a given experient a success if the recovered Â satisfies Â A0 A 0 < Each experient is repeated over 10 trials and the results are plotted in igure 1 (left). The color of each cell reflects the epirical recovery rate. White denotes perfect recovery in all experients, and black denotes failure for all experients. We see that there is a relatively sharp phase transition between success and failure of the algorith on the line ρ r + ρ s = 0.5. To verify this behavior, we repeat the experient, but only vary ρ r and ρ s between 0 and 0.3. These results, seen in igure 1 (right), show that phase transition reains fairly sharp even at higher resolution. These results are conservative, and can be potentially iproved by increasing the axiu nuber of iterations (here chosen to be 5000), or varying the choices of λ and τ. igure 1: Phase transition wrt (ρ r, ρ s ). Left: (ρ r, ρ s ) 0, 1]. Right: (ρ r, ρ s ) 0, 0.3]. 7.3 Experient: background odeling fro video. Background odeling or subtraction fro video sequences is a popular approach to detecting activity in the scene, and finds application in video surveillance fro static caeras. The key proble here is to guarantee robustness to partial occlusions and illuination changes in the scene. Background subtraction fits nicely into our odel. If the individual fraes are stacked as coluns of a atrix D, then D can be expressed as the su of a low-rank background atrix and a sparse error atrix representing the activity in the scene. We illustrate this idea by two exaples (see igures and 3). In igure, the video sequence consists of 00 fraes of a scene in an airport. There is no significant change in illuination in the video, but a lot of activity in the foreground. We observe that our algorith is very effective in separating the background fro the activity. In igure 3, we have 550 fraes fro a scene in a lobby. There is little activity in the video, but the illuination changes drastically towards the end of the sequence. We see that our algorith is once again able to recover the background, irrespective of the illuination change. 37

Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization

Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization JOHN WRIGHT and YI MA University of Illinois at Urbana-Champaign, Microsoft Research Asia and