Composite optimization for robust blind deconvolution

Size: px

Start display at page:

Download "Composite optimization for robust blind deconvolution"

Johnathan Hicks
5 years ago
Views:

1 Coposite optiization for robust blind deconvolution Vasileios Charisopoulos Daek Davis Mateo Díaz Ditriy Drusvyatskiy Abstract The blind deconvolution proble seeks to recover a pair of vectors fro a set of rank one bilinear easureents We consider a natural nonsooth forulation of the proble and show that under standard statistical assuptions, its oduli of weak convexity, sharpness, and Lipschitz continuity are all diension independent This phenoenon persists even when up to half of the easureents are corrupted by noise Consequently, standard algoriths, such as the subgradient and prox-linear ethods, converge at a rapid diension-independent rate when initialized within constant relative error of the solution We then coplete the paper with a new initialization strategy, copleenting the local search algoriths The initialization procedure is both provably efficient and robust to outlying easureents Nuerical experients, on both siulated and real data, illustrate the developed theory and ethods Introduction A variety of tasks in data science aount to solving a nonlinear syste F x = 0, where F : R d R is a highly structured sooth ap The setting when F is a quadratic ap already subsues iportant probles such as phase retrieval [,37,47], blind deconvolution [4,33,36,49], atrix copletion [3,8,48], and covariance atrix estiation [5,35], to nae a few Recent works have suggested a nuber of two-stage procedures for globally solving such probles The first stage initialization yields a rough estiate x 0 of an optial solution, often using spectral techniques The second stage local refineent uses a local search algorith that rapidly converges to an optial solution, when initialized at x 0 For a detailed discussion, we refer the reader to the recent survey [6] The typical starting point for local refineent is to for an optiization proble in x X fx := hf x, School of Operations Research and Inforation Engineering, Cornell University, Ithaca, NY 4850, USA; peopleoriecornelledu/vc333/ School of Operations Research and Inforation Engineering, Cornell University, Ithaca, NY 4850, USA; peopleoriecornelledu/dsd95/ Center for Applied Matheatics, Cornell University Ithaca, NY 4850, USA; peoplecacornelledu/d85/ Departent of Matheatics, U Washington, Seattle, WA 9895; wwwathwashingtonedu/ ddrusv Research of Drusvyatskiy was supported by the NSF DMS 6585 and CCF awards

2 where h is a carefully chosen penalty function and X is a constraint set Most widelyused penalties are sooth and convex; eg, the squared l -nor hz = z is ubiquitous in this context Equipped with such penalties, the proble is sooth and therefore gradient-based ethods becoe iediately applicable The ain analytic challenge is that the condition nuber λax f λ in of the proble often grows with the diension of f the abient space d This is the case for exaple for phase retrieval, blind deconvolution, and atrix copletion probles; see eg [6] and references therein Consequently, generic nonlinear prograing guarantees yield efficiency estiates that are far too pessiistic Instead, a fruitful strategy is to recognize that the Hessian ay be well-conditioned along the relevant set of directions, which suffice to guarantee rapid convergence This is where new insight and analytic techniques for each particular proble coe to bare eg [37,39,49] Soothness of the penalty function h in is crucially used by the aforeention techniques A different recent line of work [6, 0,, 5] has instead suggested the use of nonsooth convex penalties ost notably the l -nor hz = z Such a nonsooth forulation will play a central role in our work A nuber of algoriths are available for nonsooth copositional probles, ost notably the subgradient ethod x t+ = proj X x t α t v t with v t fx t, and the prox-linear algorith x t+ = argin x X h F x t + F x t x x t + x x t α t The local convergence guarantees of both ethods can be succinctly described as follows Set X := argin X f and suppose there exist constants ρ, µ, L > 0 satisfying: approxiation hf y h F x + F xy x ρ y x for all x X, sharpness fx inf f µ dist x, X for all x X, Lipschitz bound v L for all v fx with distx, X ρ µ Then when equipped with an appropriate sequence α t and initialized at a point x 0 satisfying distx 0, X ρ, both the subgradient and prox-linear iterates will converge to an µ optial solution of the proble The prox-linear algorith converges quadratically, while the subgradient ethod converges at a linear rate governed by the ratio µ 0, L A possible advantage of nonsooth techniques can be gleaned fro the phase retrieval proble The papers [5, Corollary 3,3], [, Corollary 38] recently, showed that for the phase retrieval proble, standard statistical assuptions iply that with high probability all the constants ρ, µ, L > 0 are diension independent Consequently, copletely generic guarantees outlined above, without any odification, iply that both ethods converge at a diension-independent rate, when initialized within constant relative error of the optial solution This is in sharp contrast to the sooth forulation of the proble, where a ore nuanced analysis is required, based on restricted soothness and convexity Moreover, this approach is robust to outliers in the sense that analogous guarantees persist even when up to half of the easureents are corrupted by noise

3 In light of the success of the nonsooth penalty approach for phase retrieval, it is intriguing to deterine if nonsooth techniques can be fruitful for a wider class of large-scale probles Our current work fits squarely within this research progra In this work, we analyze a nonsooth penalty technique for the proble of blind deconvolution Forally, we consider the task of robustly recovering a pair w, x R d R d fro bilinear easureents: y i = l i, w r i, x + η i, supp η where η is an arbitrary noise corruption with frequency p fail := that is at ost one half, and l i R d and r i R d are known easureent vectors Such bilinear systes and their coplex analogues arise often in biological systes, control theory, coding theory, and iage deblurring, aong others Most notably such probles appear when recovering a pair u, v C C fro the convolution easureents y = Lu Rv C When passing to the Fourier doain this proble is equivalent to that of solving a coplex bilinear syste of equations; see the pioneering work [4] All the arguents we present can be extended to the coplex case We focus on the real case for siplicity In this work we analyze the following nonsooth forulation of the proble: in w, x ν M fw, x := l i, w r i, x y i, 3 where ν is a user-specified constant and M = w x F Our contributions are two-fold: Local refineent Suppose that the vectors l i and r i are both iid Sub-Gaussian and satisfy a ild growth condition which is autoatically satisfied for Gaussian rando vectors We will show that as long as the nuber of easureents satisfies d +d p fail ln p fail, the forulation 3 adits diension independent constants ρ, L, and µ with high probability Consequently, subgradient and prox-linear ethods rapidly converge to the optial solution at a diension-independent rate when initialized at a point x 0 with constant relative error w 0x 0 w x F Analogous w x T F results also hold under ore general incoherence assuptions Initialization Suppose now that l i and r i are both iid Gaussian and are independent fro the noise η We develop an initialization procedure that in the regie d + d and p fail [0, /0], will find a point x 0 satisfying w 0x 0 w x F, with w x T F high probability To the best of our knowledge, this is the only available initialization procedure with provable guarantees in presence of gross outliers We also develop copleentary guarantees under the weaker assuption that the vectors l i, r i corresponding to exact easureents are independent fro the noise η i in the outlying easureents This noise odel allows one to plant outlying easureents fro a copletely different pair of signals, and is therefore coputationally ore challenging The literature studying bilinear systes is rich Fro the inforation-theoretic perspective [7, 9, 34], the optial saple coplexity in the noiseless regie is d + d if no further assuptions eg sparsity are iposed on the signals Therefore, fro a saple coplexity viewpoint, our guarantees are optial Incidentally, to our best knowledge, all 3 i=

4 alternative approaches are either suboptial by a polylogarithic factor in d, d or require knowing the sign pattern of one of the underlying signals [3, 4] Recent algorithic advances for blind deconvolution can be classified into two ain approaches: works based on convex relaxations and those eploying gradient descent on a sooth nonconvex function The influential convex techniques of [3,4] lift the objective to a higher diension, thereby necessitating the resolution of a high-diensional seidefinite progra The ore recent work of [, ] instead relaxes the feasible region in the natural paraeter space, under the assuption that the coordinate signs of either w or x are known a priori Finally, with the exception of [4], the aforeentioned works do not provide guarantees in the noisy regie Nonconvex approaches for blind deconvolution typically apply gradient descent to a sooth forulation of the proble [7, 33, 37] Since the condition nuber of the proble scales with diension, as we entioned previously, these works introduce a nuanced analysis that is specific to the gradient ethod The authors of [33] propose applying gradient descent on a regularized objective function, and identify a basin of attraction around the solution The paper [37] instead analyzes gradient descent on the unregularized objective They use the leave-one-out technique and prove that the iterates reain within a region where the objective function satisfies restricted strong convexity and soothness conditions The saple coplexities of the ethods in [7, 33, 37, 37] are optial up to polylog factors The nonconvex strategies entioned above all use spectral ethods for initialization These ethods are not robust to outliers, since they rely on the leading singular vectors/values of a potentially noisy easureent operator Adapting the spectral initialization of [5] to bilinear inverse probles enables us to deal with gross outliers of arbitrary agnitude Indeed, high variance noise akes it easier for our initialization to reject outlying easureents The outline of the paper is as follows Section records basic notation we will use throughout the paper Section 3 reviews the ipact of sharpness and weak convexity on the rapid convergence of nuerical ethods Section 4 establishes estiates of weak convexity, sharpness, and Lipschitz oduli for the blind deconvolution proble under both deterinistic and statistical assuptions on the data Section 5 introduces the initialization procedure and proves its correctness even if a constant fraction of easureents is corrupted by gross outliers The final Section 6 presents nuerical experients illustrating the theoretical results in the paper Notation The section records basic notation that we will use throughout the paper To this end, we always endow R d with the dot product, x, y = x y, and the induced nor x = x, x The sybol S d denotes the unit sphere in R d, while B denotes the open unit ball When convenient, we will use the notation B d to ephasize the diension of the abient space More generally, B r x will stand for the open ball around x of radius r We define the distance and the nearest-point projection of a point x onto a closed set Q R d by distx, Q = inf x y and proj Q x = argin x y, y Q 4 y Q

5 respectively For any pair of real-valued functions f, g : R d R, the notation f g eans that there exists a positive constant C such that fx Cgx for all x R d We write f g if both f g and g f We will always use the trace inner product X, Y = TrX T Y on the space of atrices R d d The sybols A op and A F will denote the operator and Frobenius nor of A, respectively Assuing d, the ap σ : R d R d + returns the vector of ordered singular values σ A σ A σ d A Note the equalities A F = σa and A op = σ A Nonsooth functions will appear throughout this work Consequently will use soe basic constructions of generalized differentiation, as set out for exaple in the onographs [8, 38, 4, 45] Consider a function f : R d R {+ } and a point x, with fx finite Then the Fréchet subdifferential of f at x, denoted by fx, is the set of all vectors v R d satisfying fy fx + v, y x + o y x as y x Thus, a vector v lies in the subdifferential fx precisely when the function y fx + v, y x locally inorizes f up to first-order We say that a point x is stationary for f whenever the inclusion, 0 fx, holds Standard results show for convex functions f the subdifferential fx reduces to the subdifferential in the sense of convex analysis, while for differentiable functions f it consists only of the gradient fx = { fx} Notice that in general, the little-o ter in ay depend on the base-point x, and the estiate therefore ay be nonunifor In this work, we will only encounter functions whose subgradients autoatically satisfy a unifor type of lower-approxiation property We say that a function f : R d R {+ } is ρ-weakly convex if the perturbed function x fx + ρ x is convex It is straightforward to see that for any ρ-weakly convex function f, subgradients autoatically satisfy the unifor bound: fy fx + v, y x ρ y x x, y R d, v fx We will coent further on the class of weakly convex functions in Section 3 We say a that a rando vector X in R d is η-sub-gaussian whenever E exp u,x for η all vectors u S d The sub-gaussian nor of a real-valued rando variable X is defined X to be X ψ = inf{t > 0 : E exp }, while the sub-exponential nor is defined by t X ψ = inf{t > 0 : E exp X } Given a saple y = y t,, y n, we will write edy to denote its edian 3 Algoriths for sharp weakly convex probles The central thrust of this work is that under reasonable statistical assuptions, the penalty forulation 3 satisfies two key properties: the objective function is weakly convex and grows at least linearly as one oves away fro the solution set In this section, we review Weakly convex functions also go by other naes such as lower-c, uniforly prox-regularity, paraconvex, and seiconvex 5

6 the consequences of these two properties for local rapid convergence of nuerical ethods The discussion ostly follows the recent work [0], though eleents of this viewpoint can already be seen in the two papers [, 5] on robust phase retrieval Setting the stage, we introduce the following assuption Assuption A Consider the optiization proble, in x X fx 3 Suppose that the following properties hold for soe real µ, ρ > 0 Weak convexity The set X is closed and convex, while the function f : R d R is ρ-weakly convex Sharpness The set of iniizers X := argin fx is nonepty and the inequality x X fx inf f µ dist x, X holds for all x X The class of weakly convex functions is broad and its iportance in optiization is well docuented [5, 40, 43, 44, 46] It trivially includes all convex functions and all C -sooth functions with Lipschitz gradient More broadly, it includes all copositions fx = hf x, where h is convex and L-Lipschitz, and F is C -sooth with β-lipschitz Jacobian Indeed then the coposite function f = h F is weakly convex with paraeter ρ = Lβ; see eg [4, Lea 4] In particular, our target proble 3 is clearly weakly convex, being a coposition of the l nor and a quadratic ap The estiate ρ = Lβ on the weak convexity constant is often uch too pessiistic, however Indeed, under statistical assuptions, we will see that the target proble 3 has a uch better weak convexity constant The notion of sharpness, and the related error bound property, is now ubiquitous in nonlinear optiization Indeed, sharpness underlies uch of perturbation theory and rapid convergence guarantees of various nuerical ethods For a systeatic treatent of error bounds and their applications, we refer the reader to the onographs of Dontchev- Rockafellar [] and Ioffe [8], and the article of Lewis-Pang [3] Taken together, weak convexity and sharpness provide an appealing fraework for deriving local rapid convergence guarantees for nuerical ethods In this work, we specifically focus on two such procedures: the subgradient and prox-linear algoriths To this end, we ai to estiate both the radius of rapid converge around the solution set and the rate of convergence Our ultiate goal is to show that when specialized to our target proble 3, with high probability, both of these quantities are independent of the abient diensions d and d as soon as the nuber of easureents is sufficiently large Both the subgradient and prox-linear algoriths have the property that when initialized at a stationary point of the proble, they could stay there for all subsequent iterations Since we are interested in finding global inia, and not just stationary points, we ust therefore estiate the neighborhood of the solution set that has no extraneous stationary points This is the content of the following siple lea [0, Lea 3] 6

7 Lea 3 Suppose that Assuption A holds Then the proble 3 has no stationary points x satisfying 0 < distx; X < µ ρ Proof Fix a critical point x X / X Letting x := proj X x, we deduce µ distx, X fx fx ρ x x = ρ dist x, X Dividing by distx, X, the result follows The estiate µ of the radius in Lea 3 is tight To see this, consider iniizing ρ the univariate function fx = λ x on the real line X = R Observe that the set of iniizers is X = { ± λ}, while x = 0 is always an extraneous stationary point A quick coputation shows that the sallest valid weak convexity is ρ = λ while the largest valid sharpness constant is µ = λ We therefore deduce dist0, X = = µ Hence the radius of the region µ that is λ ρ ρ devoid of extraneous stationary points is tight In light of Lea 3, let us define for any γ > 0 the tube { T γ := z R d : distz, X γ µ } 3 ρ Thus we would like to search for algoriths whose basin of attraction is a tube T γ for soe nuerical constant γ > 0 Due to the above discussion, such a basin of attraction is in essence optial We next discuss two rapidly converging algoriths The first is the Polyak subgradient ethod, outlined in Algorith Notice that the only paraeter that is needed to ipleent the procedure is the inial value of the proble 3 This value is soeties known; case in point, the inial value of the penalty forulation 3 is zero when the bilinear easureents are exact Algorith : Polyak Subgradient Method Data: x 0 R d Step k: k 0 Choose ζ k fx k If ζ k = 0, then exit algorith Set x k+ = proj X x k fx k in X f ζ ζ k k The rate of convergence of the ethod relies on the Lipschitz constant and the condition easure: L := sup{ ζ : ζ fx, x T } and τ := µ L A straightforward arguent [0, Lea 3] shows τ [0, ] The following theore appears as [0, Theore 4], while its application to phase retrieval was investigated in [] Theore 3 Polyak subgradient ethod Suppose that Assuption A holds and fix a real γ 0, Then Algorith initialized at any point x 0 T γ produces iterates that converge Q-linearly to X, that is dist x k+, X γτ dist x k, X k

8 When the inial value of the proble 3 is unknown, there is a straightforward odification of the subgradient ethod that converges R-linearly The idea is to choose a geoetrically decaying control sequence for the stepsize The disadvantage is that the convergence guarantees rely on being able to tune estiates of L, ρ, and µ Algorith : Subgradient ethod with geoetrically decreasing stepsize Data: Real λ > 0 and q 0, Step k: k 0 Choose ζ k gx k If ζ k = 0, then exit algorith Set stepsize α k = λ q k ζ Update iterate x k+ = proj X x k α k k ζ k The following theore appears as [0, Theore 6] The convex version of the result dates back to Goffin [6] Theore 33 Geoetrically decaying subgradient ethod Suppose that Assuption A γµ holds, fix a real γ 0,, and suppose τ Set λ := and q := γτ γ ρl Then the iterates x k generated by Algorith, initialized at a point x 0 T γ, satisfy: dist x k ; X γ µ γτ k ρ k 0 34 Notice that both subgradient algoriths and are at best locally linearly convergent, with a relatively cheap per-iteration cost As the last exaple we discuss an algorith that is specifically designed for convex copositions, which is locally quadratically convergent The caveat is that the ethod ay have a high per-iteration cost, since in each iteration one ust solve an auxiliary convex optiization proble Setting the stage, let us introduce the following assuption Assuption B Consider the optiization proble, in fx := hf x 35 x X Suppose that the following properties holds for soe real µ, ρ > 0 Convexity and soothness The function h and the set X are convex and F is differentiable Approxiation accuracy The convex odels f x y := hf x + F xy x satisfy the estiate: fy f x y ρ y x x, y X 3 Sharpness The set of iniizers X := argin fx is nonepty and the inequality x X fx inf f µ dist x, X holds for all x X 8

9 It is straightforward to see that Assuption B iplies that f is ρ-weakly convex; see eg [4, Lea 73] Therefore Assuption B iplies Assuption A Algorith 3 describes the prox-linear ethod a close variant of Gauss-Newton For a historical account of the prox-linear ethod, see eg, [0, 4, 3] and the references therein Algorith 3: Prox-linear algorith Data: Initial point x 0 R d, proxial paraeter β > 0 Step k: k 0 Set x k+ argin x X {h F x k + F x k x x k + β x x k } The following theore proves that under Assuption B, the prox-linear ethod converges quadratically, when initialized sufficiently close to the solution set Guarantees of this type have appeared, for exaple, in [, 3, 3, 5] For the sake of copleteness, we provide a quick arguent Theore 34 Prox-linear algorith Suppose Assuption B holds Choose any β ρ and set γ := ρ/β Then Algorith 3 initialized at any point x 0 T γ converges quadratically: distx k+, X β µ dist x k, X k 0 Proof Consider an iterate x k and choose any x proj X x k Taking into account that the function x f xk x + β x x k is strongly convex and x k+ is its iniizer, we deduce f xk x k+ + β x k+ x k + β x k+ x f xk x + β x x k Using Assuption B, we therefore obtain fx k+ + β x k+ x fx + β x x k Rearranging and using sharpness Assuption B3, we conclude as claied µ distx k+, X fx k+ fx β dist x k, X, 4 Assuptions and Models In this section, we ai to interpret the efficiency of the subgradient and prox-linear algoriths discussed in Section 3, when applied to our target proble 3 To this end, we ust estiate the three paraeters ρ, µ, L > 0 These quantities control both the size of the attraction neighborhood around the optial solution set and the rate of convergence within the neighborhood In particular, we will show that these quantities are independent of the abient diension d, d under natural assuptions on the data generating echanis 9

10 It will be convenient for the tie being to abstract away fro the forulation 3, and instead consider the function gw, x := Awx y, where A: R d d R is an arbitrary linear ap and y R is an arbitrary vector The forulation 3 corresponds to the particular linear ap AX = l i Xr i i= Since we will be interested in the prox-linear ethod, let us define the convex odel g w,x ŵ, ˆx := Awx + wˆx x + ŵ wx y Our strategy is as follows Section 4 identifies deterinistic assuptions on the data, A and y, that yield favorable estiates of ρ, µ, L > 0 Then Section 4 shows that these deterinistic assuptions hold with high probability under natural statistical assuptions on the data generating echanis 4 Favorable Deterinistic Properties The following property, widely used in the literature, will play a central role in our analysis Assuption C Restricted Isoetry Property RIP There exist constants c, c > 0 such that for all atrices X R d d of rank at ost two the following bound holds: c X F AX c X F The following proposition estiates the two constants ρ and L, governing the perforance of the subgradient and prox-linear ethods under Assuption C Proposition 4 Approxiation accuracy and Lipschitz continuity Suppose Assuption C holds and let K > 0 be arbitrary Then the following estiates hold: gŵ, ˆx g w,x ŵ, ˆx c w, x ŵ, ˆx x, ˆx R d, w, ŵ R d, gw, x gŵ, ˆx c K w, x ŵ, ˆx x, ˆx KB, w, ŵ KB Proof To see the first estiate, observe gŵ, ˆx g w,x ŵ, ˆx = Aŵˆx y Awx + wˆx x + ŵ wx y Aŵˆx wx wˆx x ŵ wx = A w ŵx ˆx c w ŵx ˆx F c w ŵ + x ˆx, 0

11 where the last estiate follows fro Young s inequality ab a + b Now suppose w, ŵ KB and x, ˆx KB We then successively copute: gw, x gŵ, ˆx Awx ŵˆx c wx ŵˆx F The proof is coplete = c w ŵx + ŵx ˆx F c x w ŵ + c ŵ x ˆx c K w, x ŵ, ˆx We next ove on to estiates of the sharpness constant µ To this end, consider two vectors w R d and x R d, and set M := x w T F = x w T Without loss of generality, henceforth, we suppose w = x Our estiates on the sharpness constant will be valid only on bounded sets Consequently, define the two sets: S ν := ν M B d B d, S ν := {α w, /α x: /ν α ν} The set S ν siply encodes a bounded region, while S ν encodes all rank- factorizations of the atrix w x with bounded factors We begin with the following proposition, which analyzes the sharpness properties of the idealized function x, w wx w x F The proof is quite long, and therefore we have placed it in Appendix A Proposition 4 For any ν, we have the following bound wx w x F M ν + dist w, x, Sν for all w, x S ν M ν+ Thus the function x, w wx w x F is sharp on the set S ν with coefficient We note in passing that the analogue of Proposition 4 for syetric atrices was proved in [49, Lea 54] The sharpness of the loss g, in the noiseless regie ie when y = A w x is now iediate Proposition 43 Sharpness in the Noiseless Regie Suppose that Assuption C holds and that equality, y = A w x, holds Then for any ν, we have the following bound: gw, x g w, x c M ν + dist w, x, Sν Proof Using Assuption C and Proposition 4, we deduce for all w, x S ν gw, x g w, x = Awx w x c wx w x F c M ν + dist w, x, Sν, as claied

12 Sharpness in the noisy case requires an additional assuption We record it below Henceforth, for any set I, we define the restricted linear ap A I : R d d R I by setting A I X := AX i I Assuption D I-outliner bounds There exists a set I {,, }, vectors w R d, x R d, and a constant c 3 > 0 such that the following hold C Equality y i = A w x i holds for all i / I C For all atrices X R d d of rank at ost two, we have c 3 X F A I cx A IX 4 Cobining Assuption D with Proposition 4 quickly yields sharpness of the objective even in the noisy setting Proposition 44 Sharpness in the noisy regie Suppose that Assuption D holds Then gw, x g w, x c 3 M ν + dist w, x, Sν for all w, x S ν Proof Defining η = A w x T y, we have the following bound: gw, x g w, x = A wx w x + η η = Awx w x + Awx w x + η i i Awx w x i ηi i I Awx w x Awx w x i i I = Awx w x i Awx w x i i I c i I c 3 wx w x F c 3 M ν + dist w, x, Sν, where the first inequality follows by the reverse triangle inequality, the second inequality follows by Assuption C, and the final inequality follows fro Proposition 4 The proof is coplete To suarize, suppose Assuptions C and D are valid Then in the notation of Section 3 we ay set: ρ = c, L = c ν M, µ = c 3 M ν +

13 Consequently, the tube radius of T is µ ρ = c 3 c M ν+ and the the linear convergence rate of the subgradient ethod is governed by τ = µ = c 3 L c In particular, the local search 4ν+ algoriths ust be initialized at a point x, w, whose relative distance to the solution set distx,w,sν is upper bounded by a constant We record this conclusion below x w F Corollary 45 Convergence guarantees Suppose Assuptions C and D are valid, and consider the optiization proble Choose any pair x 0, y 0 satisfying Then the following are true in gw, x = x,w S ν Awx y distw 0, x 0, S ν w x F c 3 4 c ν + Polyak subgradient Algorith initialized x 0, y 0 produces iterates that converge linearly to Sν, that is dist w k, x k, Sν c k 3 c 3 k 0 w x F 3c ν + 4 3c ν + geoetric subgradient Set λ := c 3 w x F 6 c and q := 3 Then the c νν+ 3c ν+4 iterates x k generated by Algorith, initialized at w 0, x 0 converge linearly: dist w k, x k, Sν c k 3 c 3 k 0 w x F 3c ν + 4 3c ν + 3 prox-linear Algorith 3 with β = ρ and initialized at w 0, x 0 converges quadratically: distw k, x k, Sν c k 3 w x F k 0 c ν + 4 Assuptions under generative odels In this section, we present natural generative odels under which Assuptions C and D are guaranteed to hold Recall that at the high level, we ai to recover the pair of signals w, x based on given corrupted bilinear easureents y Forally, let us fix two disjoint sets I in [] and I out [], called the inlier and outlier sets Intuitively, the index set I in encodes exact easureents while I out encodes easureents that have been replaced by gross outliers Define the corruption frequency p fail := Iout ; henceforth, we will suppose p fail [0, / Then for an arbitrary, potentially rando sequence {ξ i } i=, we consider the easureent odel: { l i, w r i, x if i I in, y i := 4 ξ i if i I out 3

14 In accordance with the previous section, we define the linear ap A: R d d R by AX = l i Xr i i= To siplify notation, we let L R d denote the atrix whose rows, in colun for, are l i and we let R R d denote the atrix whose rows are r i Note that we ake no assuptions about the nature of ξ i In particular, ξ i can even encode exact easureents for a different signal We focus on two easureent atrix odels The first odel requires both atrices L and R to be rando For siplicity, the reader ay assue both are Gaussian with iid entries, though the results of this paper extend beyond this case The second odel allows sei-deterinistic atrices, naely deterinistic L and Gaussian R with iid entries In the later parts of the paper, we will put further incoherence assuptions on the deterinistic atrix L Rando atrix odels M The vectors l i and r i are iid realizations of η-sub-gaussian rando vectors l R d and r R d, respectively Suppose oreover that l and r are independent and satisfy the nondegeneracy condition, for soe real µ 0, p 0 > 0 inf X: rank X X F = P l Xr µ 0 p 0, 43 M The atrix L is arbitrary and the atrix R is standard Gaussian Soe coents are in order The odel M is fully stochastic, in the sense that l i and r i are generated by independent sub-gaussian rando vectors The nondegeneracy condition 43 essentially asserts that with positive probability, the products l Xr are non-negligible, uniforly over all unit nor rank two atrices X In particular, the following exaple shows that Gaussian atrices with iid entries are adissible under Model M In contrast, the odel M is sei-stochastic: it allows L to be deterinistic, while aking the stronger assuption that R is Gaussian Exaple 4 Gaussian Matrices Satisfy Model M Assue that l and r are standard Gaussian rando vectors in R d and R d, respectively We clai this setting is adissible under M To see this, fix a rank atrix X having unit Frobenius nor Consider now a singular value decoposition X = σ u v + σ u v, and note the equality, σ + σ = For each index i =, define a i := l, u i and b i := v i, r Then clearly a, a, b, b are iid standard Gaussian; see eg [5, Exercise 336] Thus, for any c 0, we copute P l Xr c = P σ a b + σ a b c = E P σ a b + σ a b c a, a Notice that conditioned on a, a, we have σ a b + σ a b N0, σ a + σ a Thus letting z be a standard noral, we have P l Xr c = E P σ a + σ a z c a, a = P σ a + σ a z c Pσ a z c P a z c Therefore, we ay siply set µ 0 = edian a z / and p 0 = 4

15 4 Assuptions C and D under Model M In this section, we ai to prove the following theore, which shows validity of Assuptions C and D under M, with high probability Theore 46 Measureent Model M Consider a set I {,, } satisfying I < / Then there exist constants c, c, c 3, c 4, c 5, c 6 > 0 depending only on µ 0, p 0, η such that the following holds As long as c d +d + c +, then with probability at I / ln c I / least 4 exp c 3 I /, every atrix X R d d and of rank at ost two satisfies c 4 X F AX c 5 X F, 44 [ A I cx A I X ] c 6 I X F 45 Due to scale invariance, in the proof we only concern ourselves with atrices X of rank at ost two satisfying X F = Let us fix such a atrix X and an arbitrary index set I {,, } with I < / We begin with the following lea Lea 47 Pointwise concentration The rando variable l Xr is sub-exponential with paraeter η Consequently, the estiate holds: µ 0 p 0 E l Xr η 46 Moreover, there exists a nuerical constant c > 0 such that for any t 0, η ], we have with probability at least exp ct η 4 the estiate: A I cx A I X E [ A I cx A I X ] t 47 Proof Markov s inequality along with 43 iplies E l Xr µ 0 P l Xr µ 0 µ 0 p 0, which is the lower bound in 46 Now we address the upper bound To that end, suppose that X has a singular value decoposition X = σ U V + σ U V We then deduce l Xr ψ = l σ U V + σ U V r ψ = σ l, U V, r + σ l, U V, r ψ σ l, U V, r ψ + σ l, U V, r ψ σ l, U ψ V, r ψ + σ l, U ψ V, r ψ σ + σ η η, where the second inequality follows since ψ is a nor and XY ψ X ψ Y ψ [5, Lea 77] This bound has two consequences: first l Xr is a sub-exponential rando variable with paraeter η and second E l Xr η, see [5, Exercise 7] The first bound will be useful oentarily, while the second copletes the proof of 46 5

16 Next define the sub-exponential rando variable { l i Xr i E l i Xr i if i / I Y i = l i Xr i E l i Xr i if i I Standard results eg [5, Exercise 70] iply Y i ψ η for all i Using Bernstein inequality for sub-exponential rando variables, Theore C6, to upper bound P i= Y i t copletes the proof Proof of Theore 46 Choose ɛ 0, and let N be the ɛ/ -net guaranteed by Lea C Let E denote the event that the following two estiates hold for all atrices in X N : A I cx A I X E [ A I cx A I X ] t, 48 AX E [ AX ] t 49 Throughout the proof, we will assue that the event E holds We will estiate the probability of E at the end of the proof Meanwhile, seeking to establish RIP, define the quantity c := sup X S AX We ai first to provide a high probability bound on c Let X S be arbitrary and let X be the closest point to X in N Then we have AX AX + AX X E AX + t + AX X 40 E AX + t + E AX X + AX X, 4 where 40 follows fro 47 and 4 follows fro the triangle inequality To siplify the third ter in 4, using SVD, we deduce that there exist two orthogonal atrices X, X of rank at ost two satisfying X X = X + X With this decoposition in hand, we copute AX X AX + AX c X F + X F c X X F c ɛ, 4 where the second inequality follows fro the definition of c and the estiate X F + X F X, X F = X + X F Thus, we arrive at the bound AX E AX + t + c ɛ 43 6

17 As X was arbitrary, we ay take the supreu of both sides of the inequality, yielding c sup X S E AX + t + c ɛ Rearranging yields the bound c Assuing that ɛ /4, we further deduce that sup X S E AX + t ɛ c σ := sup X S E AX + t, 44 establishing that the rando variable c is bounded by σ in the event E Now let Î denote either Î = or Î = I We now provide a unifor lower bound on A Î X c A Î X Indeed, A Î c X A Î X = A Î c X + AÎcX X A Î X + AÎX X A Î c X A Î X AX X 45 E [ A Î c X AÎX ] t AX X 46 E [ A Î c X AÎX ] t E AX X + AX X 47 E [ A Î c X AÎX ] t σɛ, 48 where 45 uses the forward and reverse triangle inequalities, 46 follows fro 48, the estiate 47 follows fro the forward and reverse triangle inequalities, and 48 follows fro 4 and 44 Switching the roles of I and I c in the above sequence of inequalities, and choosing ɛ = t/4 σ, we deduce sup AÎcX AÎX E [ AÎcX AÎX ] 3t X S In particular, setting Î =, we deduce sup AX E [ AX ] 3t X S and therefore using 46, we conclude the RIP property Next, let Î = I and note that µ 0 p 0 3t AX η + 3t, X S 49 E [ A Î X c AÎX ] = Ic I E l Xr µ 0 p 0 I 7

18 Therefore every X S satisfies [ A Î c X AÎX ] µ 0 p 0 I 3t 40 Setting t = 3 in{µ 0p 0 /, µ 0 p 0 I //} = 3 µ 0p 0 I / in 49 and 40, we deduce the claied estiates 44 and 45 Finally, let us estiate the probability of E Using Lea 47 and the union bound yields PE c P { 48 or 49 fails at X } X N 4 N exp ct η 4 d +d exp ct ɛ η 4 = 4 exp d + d + ln9/ɛ ct η 4 where the second inequality follows fro lea C and c is a constant Then we deduce since /ɛ = 4 σ/t + η / I / PE c c 4 exp c d + d + ln c + 4cµ 0p 0 I I / 9η 4 Hence as long as 8η4 c d +d + lnc + The result follows iediately 4cµ 0 p I 0 c I /, we can be sure PE c 4 exp 4cµ 0 p I 0 8η 4 Cobining Theore 46 with Corollary 45 we obtain the following guarantee Corollary 48 Convergence guarantees Consider the easureent odel 4 and suppose that odel M is valid Consider the optiization proble in x,w S ν fw, x = l i, w r i, x y i i= Then there exist constants c, c, c 3, c 4, c 5, c 6 > 0 depending only on µ 0, p 0, η such that as long as c d +d + p fail ln c + c p fail and you choose any pair x 0, y 0 with relative error distw 0, x 0, S ν w x F c 6 p fail 4 c 5 ν +, 4 then with probability at least 4 exp c 3 p fail the following are true 8

19 Polyak subgradient Algorith initialized x 0, y 0 produces iterates that converge linearly to S ν, that is dist w k, x k, S ν w x F c 6 p fail 3c 5ν + 4 k c 6 p fail 3c 5ν + k 0 geoetric subgradient Set λ := c 6 p fail w x F 6 and q := c 6 p fail c 5 νν+ 3c 5 ν+4 Then the iterates x k generated by Algorith, initialized at w 0, x 0 converge linearly: dist w k, x k, S ν w x F c 6 p fail 3c 5ν + 4 k c 6 p fail 3c 5ν + k 0 3 prox-linear Algorith 3 with β = ρ and initialized at w 0, x 0 converges quadratically: distw k, x k, X k c6 p fail w x F k 0 c 5 ν + Thus with high probability, if one initializes the subgradient and prox-linear ethods at a pair w 0, x 0 satisfying distw 0,x 0,Sν c 6 p fail, then the ethods will converge to the w x F 4 c 5 ν+ optial solution set at a diension independent rate 4 Assuptions C and D under Model M In this section, we verify Assuptions C and D under Model M and an extra incoherence condition Naely, we ipose further conditions on l p /l singular values of L p σ p,in L = inf Lw p and σ p,ax L := sup Lw p, w S d w S d which intuitively guarantee that the entries of any vector in {AX rankx } are well-spread Proposition 49 Measureent Model M Assue Model M and fix an arbitrary index I {,, } Define the paraeter := σ,inl π σ I,axL π, and suppose > 0 Then there exist nuerical constants c, c, c 3 > 0 such that with probability 4 exp c d + d + ln c + σ,axl σ,in L c 3 σ,inl σ,axl, 9

20 every atrix X R d d of rank at ost two satisfies σ,in L π X F AX 5/ + σ,ax L X F, 4 π and A I cx A IX X F 43 Proof The arguent irrors the proof of Proposition 46 and therefore we only provide a sketch Fix a unit Frobenius nor atrix X of rank at ost two We ai to show that for any fixed Î {,, }, the following rando variable is highly concentrated around its ean: ZÎ = A Î X c A Î X To that end, fix a singular value decoposition X = s u v + s u v We then copute AX i = l i s u v + s u v r i = s l i, u r i + s l i, u r i, where u and u are orthogonal, s + s =, and r i, r i are iid standard noral rando variables This decoposition, together with the rotation invariance of the noral distribution, furnishes us with the following distributional equivalence: AX i d = s l i, u + s l i, u r i, where r i is a standard noral rando variable Consequently, we have the following expression for the expectation: E [ZÎ] = s l i, u π + s l i, u s l i, u π + s l i, u i Îc We now upper/lower bound this expectation The upper bound follows fro the estiate [ ] E [ZÎ] E AX π Lu + Lu = 3/ σ,ax L π The lower bound uses the following two diensional inequality z holds for all z R : E [ZÎ] = π = π i Îc i Î z z, which s l i, u + s l i, u s l i, u π + s l i, u s l i, u + s l i, u π i= s Lu + s Lu Î π π σ,inl σ,ax L Î π π 0 i Î s l i, u + s l i, u i Î ax i=,, l i

21 In particular, setting Î =, we deduce σ,in L π E AX 3/ σ,ax L π 44 To establish concentration of the rando variable ZÎ, we apply a standard result Theore C5 on the concentration of weighted sus of ean zero independent sub-gaussian rando variables In particular, to apply Theore C5, we write Y i = r i E r i, and define weights { a i = s l i, u + s l i, u if i / Î, s l i, u + s l i, u if i Î Noticing that r i E r i ψ K, where K > 0 is an absolute constant, and a = i= s l i, u + s l i, u σ,axl, it follows that for any fixed unit Frobenius nor atrix X of rank at ost two, with probability at least exp ct, we have K σ,ax L A Î X c AÎX E [ AÎcX AÎX ] t 45 We have thus established concentration for any fixed X We now proceed with a covering arguent in the sae way as in the proof of Theore 46 To this end, choose ɛ 0, and let N be the ɛ/ -net guaranteed by lea C Let E denote the event that the following two estiates hold for all atrices X N : A I cx A I X E [ A I cx A I X ] t, AX E [ AX ] t Throughout the proof, we will assue that the event E holds By exactly the sae covering arguent as in Theore 46, setting ɛ = t/4 σ with σ = sup X S E AX +t, we deduce AÎcX sup AÎX E [ AÎcX AÎX ] X S where either Î = or Î = I In particular, setting Î = and using the bound 44, we deduce σ,in L π for all X S In turn, setting Î = I we deduce 3t AX 3/ σ,ax L + 3t π A I cx A IX E [ A I cx A I X ] 3t σ,ax L π σ,inl π 3t, I 3t

22 Setting t := σ,inl 3, the estiates 4 and 43 follow iediately Finally, estiating π the probability of E using the union bound quickly yields: PE c 4 exp c d + d + ln c + σ,axl c 3 σ,inl σ,in L σ,axl The result follows 5 Initialization Previous sections have focused on local convergence guarantees under various statistical assuptions In particular, under Assuptions C and D, one ust initialize the local search procedures at a point w, x, whose relative distance to the solution set distx,w,s ν is upper x w F bounded by a constant In this section, we present a new spectral initialization routine Algorith 4 that is able to efficiently find such point w, x The algorith is inspired by [5, Section 4] and [5] Before describing the intuition behind the procedure, let us forally introduce our assuptions Throughout this section, we ake the following assuption on the data generating echanis, which is stronger than Model M: M The entries of atrices L and R are iid Gaussian Our arguents rely heavily on properties of the Gaussian distribution We note, however, that our experiental results suggest that Algorith 4 provides high-quality initializations under weaker distributional assuptions Recall that in the previous sections, the noise ξ was arbitrary In this section, however, we ust assue ore about the nature of the noise We will consider two different settings N The easureent vectors {l i, r i } i= and the noise sequence {ξ i } i= are independent N The inlying easureent vectors {l i, r i } i Iin and the corrupted observations {ξ i } i Iout are independent The noise odels N and N differ in how an adversary ay choose to corrupt the easureents Model N allows an adversary to corrupt the signal, but does not allow observation of the easureent vectors {l i, r i } i= On the other hand, Model N allows an adversary to observe the outlying easureent vectors {l i, r i } i Iout and arbitrarily corrupt those easureents For exaple, the adversary ay replace the outlying easureents with those taken fro a copletely different signal: y i = A w x for i I i out We can now describe the intuition underlying Algorith 4 Throughout we denote unit vectors parallel to w and x by w and x, respectively Algorith 4 exploits the expected near orthogonality of the rando vectors l i and r i to the directions w and x, respectively, in order to select a good set of easureent vectors Naely, since E [ l i, w ] = E [ r i, x ] = 0, we expect inial eigenvectors of L init and R init to be near w and x, respectively Since our easureents are bilinear, we cannot necessarily select vectors for which l i, w and r i, x are both sall, rather, we ay only select vectors

23 Algorith 4: Initialization Data: y R, L R d, R R d I sel {i y i ed y } For directional estiates: L init i I sel l i l i, R init i I sel r i r i ŵ argin p S d p L init p, and x argin q S d q R init q Estiate the nor of the signal: return w 0, x 0 M argin Gβ := β R w 0 sign M M y i β l i, ŵ r i, x, i= / ŵ, and x 0 M / x for which the product l i, w r i, x is sall, leading to subtle abiguities not present in [5, Section 4] and [5]; see Figure Corruptions add further abiguities since the noise odel N allows a constant fraction of easureents to be adversarially odified y w l l r r x x Figure : Intuition behind spectral initialization The pair l, r will be included since both vectors are alost orthogonal to the true directions l, r is unlikely to be included since r is alost aligned with x Forally, Algorith 4 estiates an initial signal w 0, x 0 in two stages: first it constructs a pair of directions ŵ, ˆx which estiate the true directions w := w w and x := x x 3

24 up to sign; then it constructs an estiate M of the signed signal nor ±M, which corrects for sign errors in the first stage We now discuss both stages in ore detail, starting with the direction estiate Most proofs will be deferred to Appendix B The general proof strategy we follow is analogous to [5, Section 4] for phase retrieval, with soe subtle odifications due to asyetry Direction Estiate In the first stage of the algorith, we estiate the directions w and x, up to sign Key to our arguent is the following decoposition for odel N which will be proved in Appendix B: L init = Isel I d γ w w + L, R init = Isel I d γ x x + R, where γ, γ and the atrices L, R have sall operator nor decreasing with d + d /, with high probability Using the Davis-Kahan sin θ theore [9], we can then show that the inial eigenvectors of L init and R init are sufficiently close to {± w } and {± x }, respectively Proposition 5 Directional estiates There exist nuerical constants c, c, C > 0, so that for any p fail [0, /0] and t [0, ], with probability at least c exp c t, the following hold: in s {±} ŵ x sw x F C C ax{d,d } p fail + + t ax{d,d } + t under Model N, and under Model N Nor estiate In the second stage of the algorith, we estiate M as well as correct the sign of the direction estiates fro the previous stage In particular, for any ŵ, x S d S d define the quantity c 5 δ := + in ŵ x s w x c 6 p fail s {±} F, 5 Then we prove the following estiate see Ap- where c 5 and c 6 are as in Theore 46 pendix B Proposition 5 Nor Estiate Under either noise odel, N and N, there exist nuerical constants c,, c 6 > 0 so that if c d +d + p fail ln c + c p fail, then with probability at least 4 exp c 3 p fail, we have that any iniizer M of the function Gβ := y i β l i, ŵ x, r i i= satisfies M M δm Moreover, if in this event δ <, then we have sign M = argin s {±} ŵ x s w x F 4

25 Thus, the preceding proposition shows that tighter estiates on the nor M result fro better directional estiates in the first stage of Algorith 4 In light of Proposition 5, we next estiate the probability of the event δ /, which in particular iplies with high probability sign M = argin s {±} ŵ x s w x F Proposition 53 Sign estiate Under either Model N and N, there exist nuerical constants c 0, c, c, c 3 > 0 such that if p fail < c 0 and c 3 d +d, then the estiate holds: P δ > / c exp c Proof Using Theore 46 and Propositions 5, we deduce that for any t [0, ], with probability c exp c t we have ax{d C,d } + t under Model N, and δ ax{d C p fail +,d } + t under Model N Thus under odel N it suffices to set t = C ax{d,d } Then the probability of the event δ / is at least c exp c C ax{d, d } On the other hand, under odel N, it suffices to assue Cp fail < and then we can set t = C p fail ax{d,d } The probability of the event δ / is then at least c exp c C p fail ax{d, d } Finally using the bound ax{d, d } d + d c 3 yields the result Step 3: Final estiate at the following theore Putting the directional and nor estiates together, we arrive Theore 54 There exist nuerical constants c 0, c, c, c 3, C > 0 such that if p fail c 0 and c 4 d + d, then for all t [0, ], with probability at least c exp c 3 t, we have w0 x 0 w x F w x F C C ax{d,d } p fail + + t ax{d,d } + t under Model N, and under Model N Proof Suppose that we are in the events guaranteed by Propositions 5,5, and 53 Then noting that w 0 = sign M M / ŵ, x 0 = M / x, In the case of odel N, one can set c 0 = /0 5

26 we find that w0 x 0 w x F = sign M M ŵ x M w x F = M ŵ x sign M w x + M M M ŵ x M ŵ x sign Mw x F + Mδ = M + in ŵ x s w x c 5 c 6 p fail s {±} where c 5 and c 6 are defined in Theore 46 Appealing to Proposition 5, the result follows F F, Cobining Corollary 45 and Theore 54, we arrive at the following guarantee for the stage procedure Corollary 55 Efficiency estiates Suppose either of the odels N and N Let w 0, x 0 be the output of the initialization Algorith 4 Set M = w 0 x 0 F and consider the optiization proble M M in x, w M gw, x = Awx y 5 Set ν := and notice that the feasible region of 5 coincides with S ν Then there exist constants c 0, c, c, c 3, c 5 > 0 and c 4 0, such that as long as c 3 d + d and p fail c 0, the following properties hold with probability c exp c 3 subgradient Both Algoriths and with appropriate λ, q initialized x 0, y 0 produce iterates that converge linearly to S ν, that is dist w k, x k, S ν w x F c 4 c 4 k k 0 prox-linear Algorith 3 initialized at w 0, x 0 with appropriate β > 0 converges quadratically: distw k, x k, S ν w x F c 5 k k 0 Proof We provide the proof under odel N The proof under odel N is copletely analogous Cobining Proposition 5, Proposition 53, and Theore 54, we deduce that there exist constants c 0, c, c, c 3, C such that as long as c 3 d + d and p fail < c 0, then for any t [0, ], with probability c exp c t, we have M M δ, 53 3 In the case of odel N, one can set c 0 = /0 6

27 and w 0 x 0 w x F M ax{d, d } C + t In particular, notice fro 53 that ν 3 and therefore the feasible region S ν contains an optial solution of the original proble 3 Using Proposition 4, we have M w 0 x 0 w x F ν + dist w 0, x 0, Sν Cobining the estiates, we conclude distw 0, x 0, S ν M ν + w 0x 0 w x F M ax{d, d } ν + C + t Thus to ensure the relative error assuption 4, it suffices to ensure the inequality ax{d, d } ν + C + t c 6 p fail 4 c 5 ν +, where c 5, c 6 are the constants fro Corollary 48 Using the bound ν 3, it suffices to set c6 p t = 6 ax{d, d } 3c 5 C Thus the probability of the desired event becoes c exp c 3 c 4 ax{d, d } for soe constant c 4 Finally, using the bound ax{d, d } d + d c 3 and applying Corollary 48 copletes the proof 6 Nuerical Experients In this section we deonstrate the perforance and stability of the prox-linear and subgradient ethods, and the initialization procedure, when applied to real and artificial instances of Proble 3 All experients were perfored using the Julia [7] prograing language Subgradient ethod ipleentation Ipleentation of the subgradient ethod for Proble 3 is siple, and has low per-iteration cost Indeed, one ay siply choose the subgradient [ ] [ ] li 0 sign l i, w x, r i y x, r i + l 0 i, w fw, x, r i i= where signt denotes the sign of t, with the convention sign0 = 0 The cost of coputing this subgradient is on the order of four atrix ultiplications When applying Algorith, choosing the correct paraeters is iportant, since its convergence is especially sensitive to the value of the step-size decay q; the experient described in Section 6, which aided us epirically in choosing q for the rest of the experients, deonstrates this phenoenon Setting λ = 0 seeed to suffice for all the experients depicted hereafter 7

28 Recall that the convex odels used by the prox- Prox-linear ethod ipleentation linear ethod take the for: f wk,x k w, x = Aw kx k + w k x x k + w w k x k y 6 Equivalently, one ay rewrite this expression as a Least Absolute Deviation LAD objective: f wk,x k w, x = w x k, r i l i l i, w k ri wk y }{{} x x i l i, w k x k, r i k }{{} i= =:A i }{{} =:ỹ i =:z = Az ỹ Thus, each iteration of Algorith 3 requires solving a strongly convex optiization proble: { z k+ = argin z S ν Az ỹ + } α z Motivated by the work of [5] on robust phase retrieval, we solve this subproble with the graph splitting variant of the Alternating Direction Method of Multipliers, as described in [4] This iterative ethod applies to probles of the for in z X st t ỹ + α z t = Az Yielding the following subprobles, which are repeatedly executed: { z argin z S ν α z + ρ } z z k λ k { t argin t t ỹ + ρ } t t k ν k [ ] z+ Id +d A [ ] Id +d A z + λ t + A I 0 0 t + ν λ + λ + z z +, ν + ν + t t +, where λ R d +d and ν R are dual ultipliers and ρ > 0 is a control paraeter Each above step ay be coputed analytically We found in our experients that choosing α = and ρ yielded fast convergence Our stopping criteria for this subproble is considered et when the prial residual satisfies z +, t + z, t ɛ k d + d + ax { z, t } and the dual residual satisfies λ +, ν + λ, ν ɛ k d + d + ax { λ, ν } with ɛ k = k 6 Artificial Data We first illustrate the perforance of the prox-linear and subgradient ethods under noise odel N with iid standard Gaussian noise ξ i Both ethods are initialized with Algorith 4 We experiented with Gaussian noise of varying variances, and observed that 8

29 higher levels did not adversely affect the perforance of our algorith This is not surprising, since the theory suggests that both the objective and the initialization procedure are robust to gross outliers We analyze the perforance with proble diensions d {400, 000} and d = 500 and with nuber of easureents = c d + d with c varying fro to 8 In Fig and 3, we have depicted how the quantity wk x k w x F w x F changes per iteration for the prox-linear and subgradient ethods We conducted tests in both the oderate corruption p fail = 5 and high corruption p fail = 45 regies For both ethods, under oderate corruption p fail = 5 we see that exact recovery is possible as long as c 5 Likewise, even in high corruption regie p fail = 45 exact recovery is still possible as long as c 8 We also illustrate the perforance of Algorith when there is no corruption at all in Fig, which converges an order of agnitude faster than Algorith In ters of algorith perforance, we see that the prox-linear ethod takes few outer iterations, approxiately 5, to achieve very high accuracy, while the subgradient ethod requires a few hundred iterations This behavior is expected as the prox-linear ethod converges quadratically and the subgradient ethod converges linearly Although the nuber of iterations of the prox-linear ethod is sall, we deonstrate in the sequel that its total run-tie, including the cost of solving subprobles, can be higher than the subgradient ethod 6 Nuber of atrix-vector ultiplications Each iteration of the prox-linear ethod requires the nuerical resolution of a convex optiization proble We solve this subproble using the graph splitting ADMM algorith, as described in [4], the cost of which is doinated by the nuber of atrix vector products required to reach the target accuracy The nuber of inner iterations of the prox-linear ethod and thus the nuber of atrix vector products is not deterined a priori The cost of each iteration of the subgradient ethod, on the other hand, is on the order of 4 atrix vector products In the subsequent plots, we solve a sequence of synthetic probles for d = d = 00 and keep track of the total nuber of atrix-vector ultiplications perfored We run both ethods until we obtain wx w x F 0 w x 5 Additionally, we keep F track of the sae statistics for the subgradient ethod We present the results in Fig 4 We observe that the nuber of atrix-vector ultiplications required by the prox-linear ethod can be uch greater than those required by the subgradient ethod Additionally, they see to be uch ore sensitive to the ratio 6 Choice of step size decay d +d Due to the sensitivity of Algorith to the step size decay q, we experient with different choices of q in order to find an epirical range of values which yield acceptable perforance To that end, we generate synthetic probles of diension and choose q {090, 0905,, 0995}, and record the average error of the final iterate after 000 iterations of the subgradient ethod for different choices of = c d +d The average is taken 9

30 d =400,d =500 d =000,d = d =400,d =500,000, d =000,d =500,000, k 0 0 k d =400,d =500,000, d 0 =000,d 500 =500,000, k 0 0 k c = k c = 3 c = 4 c = 5 c = 6 c k= 8 Figure : Diensions are d, d = 400, 500 in the first colun and d, d = 000, 500 in the second colun We plot the error wk x k w x F / w x F vs iteration count Top row is using Algorith with p fail = 05 Second row is using Algorith with p fail = 045 Third row is using Algorith with p fail = 0 30

31 d =400,d =500 d =000,d = d 0 5 =400,d 0 = k d 0 5 =000,d 0 = k c = c = 3 c = 4 c = 5 c = 6 c = 7 c = 8 k 0 8 k Figure 3: Diensions are d, d = 400, 500 in the first colun and d, d = 000, 500 in the second colun We plot the error w k x k w x F / w x F vs iteration count for an application of Algorith 3 in the two settings: p fail = 05 top row and p fail = 045 botto row over 50 test runs with λ = 0 We test both noisy and noiseless instances to see if corruption of entries significantly changes the effective range of q Results are shown in Fig 5 63 Robustness to noise We now epirically validate the robustness of the prox-linear and subgradients algoriths to noise In a setup failiar fro other recent works [4, 5], we generate phase transition plots, where the x-axis varies with the level of corruption p fail, the y-axis varies as the ratio d +d changes, and the shade of each pixel represents the percentage of proble instances solved successfully For every configuration p fail, /d + d, we run 00 experients Noise odel N - independent noise Initially, we experient with Gaussian rando atrices and d, d {00, 00, 00, 00}, the results for which can be found in Fig 6 3

32 Matvec uls to accuracy Prox c = 4 Prox c = 8 Subg c = 4 Subg c = p fail Figure 4: Matrix-vector ultiplications to reach rel accuracy of 0 5 Final noralized error q Final noralized error q c = 3 c = 6 c = 8 c = 3 c = 6 c = 8 Figure 5: Final noralized error wk x k w x F / w x F for Algorith with different choices of q, in the settings p fail = 0 left and p fail = 05 right The phase transition plots are siilar for both diensionality choices, revealing that in the oderate independent noise regie p fail 5%, setting 4d +d suffices On the other hand, for exact recovery in high noise regies p fail 45%, one ay need to choose as large as 8 d + d We repeat the sae experient in the setting where the atrix L is deterinistic and has orthogonal coluns of Euclidean nor, and R is a gaussian rando atrix Specifically, we take L to be a partial Hadaard atrix, fro the first d coluns of an Hadaard atrix In that case, the operator v Lv can be coputed efficiently in O log tie by 0-padding v to length and coputing its Fast Walsh-Hadaard Transfor FWHT Additionally, the products w L w can also be coputed in O log tie by taking the FWHT of w and keeping the first d coordinates of the result 3

33 The phase transition plots can be found in Fig 7 A coparison with the phase transition plot in Fig 6 shows a different trend In this case, exact recovery does not occur when the noise is above p fail 0% and is in the range {,, 8} Noise odel N - arbitrary noise We now repeat the previous experients, but switch to noise odel N In particular, we now adversarially hide a different signal in a subset of easureents, ie, we set { l i, w x, r i, i / I in, y i = l i, w ip x ip, r i i I out, where in the above w ip, x ip R d R d is an arbitrary pair of signals Intuitively, this is a ore challenging noise odel than N, since it allows an adversary try to trick the algorith into recovering an entirely different signal Our experients confir that this regie is indeed ore difficult for the proposed algoriths, which is why we only depict the range p fail [0, 038] in Figs 8 and 9 below 6 Perforance of initialization on real data We now deonstrate the proposed initialization strategy on real world iages Specifically, we set w and x to be two rando digits fro the training subset of the MNIST dataset [30] In this experient, the easureent atrices L, R R have iid Gaussian entries, and the noise follows Model N with p fail = 045 We apply the initialization ethod and plot the resulting iages initial estiates in Fig 0 Evidently, the initial estiates of the iages are visually siilar to the true digits, up to sign; in other exaples, the foreground appears to be switched with the background, which corresponds to the natural sign abiguity Finally, we plot the noralized error for the two recovery ethods subgradient and prox-linear in Fig 63 Experients on Big Data We apply the subgradient ethod for recovering large-scale real color iages W, X R n n 3 In this setting, p fail = 00 so using Algorith is applicable with in X f = 0 We flatten the atrices W, X into 3n diensional vectors w, x In contrast to the previous experients, our sensing atrices are of the following for: HS L = HS k, R = where H {, } d d / d is the d d syetric noralized Hadaard atrix and S i = diagξ,, ξ d, ξ iid {, } is a diagonal rando sign atrix The sae holds for S i Notice that we can perfor the operations w Lw, x Rx in Okd log d tie: we first for the eleentwise product between the signal and the rando signs, and then take its Hadaard transfor, which can be perfored in Od log d flops We can efficiently copute 33 HS HS k,

0 0 0 04 06 08 /d + d 3 4 5 6 7 8 3 4 5 6 7 8 d = d = 00 00 0 0 03 04 048 d = d = 00 00 0 0 03 04 048 /d + d 3 4 5 6 7 8 3 4 5 6 7 8 d = d = 00 00 0 0 03 04 048 d = d = 00 00 0 0 03 04 048 Corruption

Figure 7: Phase transition for M, N 3 4 5 6 7 8 d = d = 00 00 0 0 03 038 d = d = 00 3 4 5 6 7 8 00 0 0 03 038 Corruption level Figure 9: Phase transition for M, N p L p, q R q, required for the

34 /d + d d = d = d = d = /d + d d = d = d = d = Corruption level Corruption level /d + d Figure 6: Phase transition for M, N d = d = d = d = Corruption level Figure 8: Phase transition for M, N /d + d Figure 7: Phase transition for M, N d = d = d = d = Corruption level Figure 9: Phase transition for M, N p L p, q R q, required for the subgradient ethod, in a siilar fashion We recover each channel separately, which eans we essentially have to solve three siilar iniization probles Notice that this results in diensionality d = d = n, = kn for each channel We observed that our initialization procedure Algorith 4 is extreely accurate in this setting Therefore to better illustrate the perforance of the local search algoriths, we perfor the following heuristic initialization For each channel, we first saple ŵ, x S d, rescale by the true agnitude of the signal, and run Algorith for one step to obtain our initial estiates w 0, x 0 34

35 Figure 0: Digits 5, 6 top and 9, 6 botto Original iages are shown on the left, estiates on the right Paraeters: p fail = 045, = Fig 0top Fig 0botto 0 Fig 0top Fig 0botto Error 0 5 Error k k Figure : Relative error vs iteration count on nist digits for subgradient ethod left and prox-linear ethod right An exaple where we recover a pair of 5 5 color iages using the Polyak subgradient ethod Algorith is shown below; Fig shows the progression of the estiates w k, up until the 90-th iteration, while Fig 3 depicts the noralized error at each iteration for the different channels of the iages References [] Alireza Aghasi, Ali Ahed, and Paul Hand Branchhull: Convex bilinear inversion fro the entrywise product of signals with known signs arxiv preprint arxiv:700434, 35

Figure : Iterates w0i, i =,, 9, k, d, n =, 6, 8, 5 07 []

convex progra for bilinear inversion of sparse vectors arxiv

and Paul Hand Blind deconvolutional phase retrieval via

Ahed, Benjain Recht, and Justin Roberg Blind deconvolution

36 Figure : Iterates w0i, i =,, 9, k, d, n =, 6, 8, 5 07 [] Alireza Aghasi, Ali Ahed, Paul Hand, and Babhru Joshi A convex progra for bilinear inversion of sparse vectors arxiv preprint arxiv: , 08 [3] Ali Ahed, Alireza Aghasi, and Paul Hand Blind deconvolutional phase retrieval via convex prograing arxiv preprint arxiv: , 08 [4] Ali Ahed, Benjain Recht, and Justin Roberg Blind deconvolution using convex prograing IEEE Transactions on Inforation Theory, 603:7 73, 04 36

37 Error Red Blue Green k Figure 3: Noralized error for different channels in iage recovery [5] Paolo Albano and Pierarco Cannarsa Singularities of seiconcave functions in Banach spaces In Stochastic analysis, control, optiization and applications, Systes Control Found Appl, pages 7 90 Birkhäuser Boston, Boston, MA, 999 [6] Yu Bai, Qijia Jiang, and Ju Sun Subgradient descent learns orthogonal dictionaries arxiv preprint arxiv:80070, 08 [7] Jeff Bezanson, Alan Edelan, Stefan Karpinski, and Viral B Shah Julia: A fresh approach to nuerical coputing SIAM review, 59:65 98, 07 [8] JM Borwein and QJ Zhu Techniques of Variational Analysis Springer Verlag, New York, 005 [9] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart Concentration inequalities: A nonasyptotic theory of independence Oxford university press, 03 [0] JV Burke Descent ethods for coposite nondifferentiable optiization probles Math Prograing, 333:60 79, 985 [] JV Burke and MC Ferris A Gauss-Newton ethod for convex coposite optiization Math Prograing, 7, Ser A:79 94, 995 [] EJ Candès, X Li, and M Soltanolkotabi Phase retrieval via Wirtinger flow: theory and algoriths IEEE Trans Infor Theory, 64: , 05 [3] EJ Candès and B Recht Exact atrix copletion via convex optiization Found Coput Math, 96:77 77, 009 [4] Eanuel J Candes and Yaniv Plan Tight oracle inequalities for low-rank atrix recovery fro a inial nuber of noisy rando easureents IEEE Transactions on Inforation Theory, 574:34 359, 0 37

38 [5] Yuxin Chen, Yuejie Chi, and Andrea J Goldsith Exact and stable covariance estiation fro quadratic sapling via convex prograing IEEE Trans Infor Theory, 67: , 05 [6] Yuejie Chi, Yue M Lu, and Yuxin Chen Nonconvex optiization eets low-rank atrix factorization: An overview arxiv preprint arxiv: , 08 [7] Sunav Choudhary and Urbashi Mitra Sparse blind deconvolution: What cannot be done In Inforation Theory ISIT, 04 IEEE International Syposiu on, pages IEEE, 04 [8] Mark A Davenport and Justin Roberg An overview of low-rank atrix recovery fro incoplete observations arxiv preprint arxiv:60064, 06 [9] Chandler Davis and Willia Morton Kahan The rotation of eigenvectors by a perturbation iii SIAM Journal on Nuerical Analysis, 7: 46, 970 [0] Daek Davis, Ditriy Drusvyatskiy, Kellie J MacPhee, and Courtney Paquette Subgradient ethods for sharp weakly convex functions arxiv preprint arxiv:803046, 08 [] Daek Davis, Ditriy Drusvyatskiy, and Courtney Paquette The nonsooth landscape of phase retrieval arxiv preprint arxiv:70347, 07 [] AL Dontchev and RT Rockafellar Iplicit functions and solution appings Monographs in Matheatics, Springer-Verlag, 009 [3] D Drusvyatskiy and AS Lewis Error bounds, quadratic growth, and linear convergence of proxial ethods To appear in Math Oper Res, arxiv:600666, 06 [4] D Drusvyatskiy and C Paquette Efficiency of iniizing copositions of convex functions and sooth aps Preprint arxiv:605005, 06 [5] JC Duchi and F Ruan Solving ost of a set of quadratic equalities: Coposite optiization for robust phase retrieval Preprint arxiv: , 07 [6] J L Goffin On convergence rates of subgradient optiization ethods Math Prograing, 33:39 347, 977 [7] Wen Huang and Paul Hand Blind deconvolution by steepest descent algorith on a quotient anifold arxiv preprint arxiv: v, 08 [8] Alexander D Ioffe Variational analysis of regular appings Springer Monographs in Matheatics Springer, Cha, 07 Theory and applications [9] Michael Kech and Felix Kraher Optial injectivity conditions for bilinear inverse probles with applications to identifiability of deconvolution probles SIAM Journal on Applied Algebra and Geoetry, :0 37, 07 38

39 [30] Y Lecun, L Bottou, Y Bengio, and P Haffner Gradient-based learning applied to docuent recognition Proceedings of the IEEE, 86:78 34, Nov 998 [3] Adrian S Lewis and Jong-Shi Pang Error bounds for convex inequality systes In Generalized convexity, generalized onotonicity: recent results Luiny, 996, volue 7 of Nonconvex Opti Appl, pages 75 0 Kluwer Acad Publ, Dordrecht, 998 [3] AS Lewis and SJ Wright A proxial ethod for coposite iniization Math Progra, pages 46, 05 [33] Xiaodong Li, Shuyang Ling, Thoas Stroher, and Ke Wei Rapid, robust, and reliable blind deconvolution via nonconvex optiization arxiv preprint arxiv: , 06 [34] Yanjun Li, Kiryung Lee, and Yora Bresler Identifiability in blind deconvolution with subspace or sparsity constraints IEEE Transactions on Inforation Theory, 67: , 06 [35] Yuanxin Li, Cong Ma, Yuxin Chen, and Yuejie Chi Nonconvex atrix factorization fro rank-one easureents arxiv preprint arxiv:800686, 08 [36] Shuyang Ling and Thoas Stroher Self-calibration and biconvex copressive sensing Inverse Probles, 3:500, 3, 05 [37] Cong Ma, Kaizheng Wang, Yuejie Chi, and Yuxin Chen Iplicit regularization in nonconvex statistical estiation: Gradient descent converges linearly for phase retrieval, atrix copletion and blind deconvolution arxiv preprint arxiv:70467, 07 [38] BS Mordukhovich Variational analysis and generalized differentiation I, volue 330 of Grundlehren der Matheatischen Wissenschaften [Fundaental Principles of Matheatical Sciences] Springer-Verlag, Berlin, 006 Basic theory [39] Sahand N Negahban, Pradeep Ravikuar, Martin J Wainwright, and Bin Yu A unified fraework for high-diensional analysis of M-estiators with decoposable regularizers Statist Sci, 74: , 0 [40] E A Nurinskii The quasigradient ethod for the solving of the nonlinear prograing probles Cybernetics, 9:45 50, Jan 973 [4] Neal Parikh and Stephen Boyd Block splitting for distributed optiization Matheatical Prograing Coputation, 6:77 0, 04 [4] J-P Penot Calculus without derivatives, volue 66 of Graduate Texts in Matheatics Springer, New York, 03 [43] RA Poliquin and RT Rockafellar Prox-regular functions in variational analysis Trans Aer Math Soc, 348: , 996 [44] RT Rockafellar Favorable classes of Lipschitz-continuous functions in subgradient optiization In Progress in nondifferentiable optiization, volue 8 of IIASA Collaborative Proc Ser CP-8, pages 5 43 Int Inst Appl Sys Anal, Laxenburg, 98 39

40 [45] RT Rockafellar and RJ-B Wets Variational Analysis Grundlehren der atheatischen Wissenschaften, Vol 37, Springer, Berlin, 998 [46] S Rolewicz On paraconvex ultifunctions In Third Syposiu on Operations Research Univ Mannhei, Mannhei, 978, Section I, volue 3 of Operations Res Verfahren, pages Hain, Königstein/Ts, 979 [47] Y Shechtan, Y C Eldar, O Cohen, H N Chapan, J Miao, and M Segev Phase retrieval with application to optical iaging: A conteporary overview IEEE Signal Processing Magazine, 33:87 09, May 05 [48] Ruoyu Sun and Zhi-Quan Luo Guaranteed atrix copletion via non-convex factorization IEEE Trans Infor Theory, 6: , 06 [49] Stephen Tu, Ross Boczar, Max Sichowitz, Mahdi Soltanolkotabi, and Benjain Recht Low-rank solutions of linear atrix equations via procrustes flow In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volue 48, ICML 6, pages JMLRorg, 06 [50] Roan Vershynin Introduction to the non-asyptotic analysis of rando atrices In Copressed sensing, pages 0 68 Cabridge Univ Press, Cabridge, 0 [5] Roan Vershynin High-Diensional Probability: An Introduction with Applications in Data Science Cabridge University Press, 08 [5] G Wang, GB Giannakis, and YC Eldar Solving systes of rando quadratic equations via a truncated aplitude flow arxiv: , 06 Appendix A Sharpness A Proof of Proposition 4 Without loss of generality, we assue that M = by rescaling and that w = e R d and x = e R d by rotation invariance Recall that the distance to Sν ay be written succinctly as distw, x, Sν = inf { w α /ν α ν w + x /α x } Before we establish the general result, we first consider the sipler case, d = d = Clai The following bound holds: wx inf { w /ν α ν α + x /α }, for all w, x [ ν, ν] Proof of Clai Consider a pair w, x R with w, x ν It is easy to see that without loss of generality, we ay assue w x We then separate the proof into two cases, which are graphically depicted in Figure 4 40

Figure 4: The regions K, K correspond to cases and of the proof of Clai, respectively Case : w x ν ν In this case, we will traverse fro w, x to the S ν in the direction, See Figure 4 First, consider

nonnegative since w + x wx = w + x + xw + Set α = w t/ and note the identity /α = x t/ Therefore, wx = /αw α + αx /α + w αx /α Observe now the equality = x t/ t/ + w t/ t/ + t / = t w + x t/ = t w +

41 Figure 4: The regions K, K correspond to cases and of the proof of Clai, respectively Case : w x ν ν In this case, we will traverse fro w, x to the S ν in the direction, See Figure 4 First, consider the equation in the variable t and note the equality wx w + xt + t / =, wx w + xt + t / = w t/ x t/ Using the quadratic forula to solve for t, we get t = w + x w + x wx Note that the discriinant is nonnegative since w + x wx = w + x + xw + Set α = w t/ and note the identity /α = x t/ Therefore, wx = /αw α + αx /α + w αx /α Observe now the equality = x t/ t/ + w t/ t/ + t / = t w + x t/ = t w + x wx t t = w α + x /α / Hence it reains to bound α First we note that α 0, /α 0, since α + /α = w t/ + x t/ = w + x + w + x wx 0 In addition, since w x, we have α = w t/ x t/ = /α Since α and /α are positive, we ust therefore have α /ν Thus, it reains to verify the bound α ν To that end, notice that Therefore, ν ν /α = x t/ w t/ ν ν = α ν ν α Since the function t t is increasing, we deduce α ν α t 4

Sharp Time Data Tradeoffs for Linear Inverse Problems

Sharp Time Data Tradeoffs for Linear Inverse Problems Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used