Structured signal recovery from quadratic measurements: Breaking sample complexity barriers via nonconvex optimization

Size: px

Start display at page:

Download "Structured signal recovery from quadratic measurements: Breaking sample complexity barriers via nonconvex optimization"

Virgil Cook
5 years ago
Views:

1 Structured signal recovery fro quadratic easureents: Breaking saple coplexity barriers via nonconvex optiization Mahdi Soltanolkotabi Ming Hsieh Departent of Electrical Engineering University of Southern California Los Angeles CA February 0 07; Updated May 5 07 Abstract This paper concerns the proble of recovering an unknown but structured signal x R n fro quadratic easureents of the for y r = a r x for r =.... We focus on the under-deterined setting where the nuber of easureents is significantly saller than the diension of the signal ( << n). We forulate the recovery proble as a nonconvex optiization proble where prior structural inforation about the signal is enforced through constrains on the optiization variables. We prove that projected gradient descent when initialized in a neighborhood of the desired signal converges to the unknown signal at a linear rate. These results hold for any closed constraint set (convex or nonconvex) providing convergence guarantees to the global optiu even when the objective function and constraint set is nonconvex. Furtherore these results hold with a nuber of easureents that is only a constant factor away fro the inial nuber of easureents required to uniquely identify the unknown signal. Our results provide the first provably tractable algorith for this data-poor regie breaking local saple coplexity barriers that have eerged in recent literature. In this paper we utilize and further develop powerful tools for unifor convergence of epirical processes that ay have broader iplications for rigorous understanding of constrained nonconvex optiization heuristics. Introduction Signal reconstruction fro quadratic easureents is at the heart of any applications in signal and iage processing. In this proble we acquire quadratic easureents of the for y r = a r x r =... (.) fro an unknown structured signal x C n. Here a r C n are known sapling vectors and y r R are observed easureents. Such quadratic signal recovery probles are of interest in a variety of doains ranging fro cobinatorial optiization to wireless counications and iaging. Focusing on signal processing applications recovering a signal fro easureents of the for (.) is usually referred to as the generalized phase retrieval proble. The connection with phase retrieval is due to the fact that optical detectors especially at sall wavelengths can often only record the intensity of the light field and not its phase. Indeed the acquired easureents in any popular coherent diffraction iaging systes such as those based on Ptychography or phase fro defocus

2 are of the for (.) with x corresponding to the object of interest a r odulated sinusoids and y r the recorded data. Given the ubiquity of the generalized phase retrieval proble in signal processing over the years any heuristics have been developed for its solution. On the one hand invention of new X-ray sources and new experiental setups that enable recording and reconstruction of non-crystalline objects has caused a ajor revival in the use of phase retrieval techniques in iaging. On the other hand the last five years has also witnessed treendous progress in ters of providing rigorous atheatical guarantees for the perforance of soe classical heuristics such as alternating iniization [58 75] as well as newer ones based on seidefinite prograing [7] and Wirtinger flows [6] and its variants [ ]. We shall review all these algoriths and atheatical results in greater detail in Section 6. These results essentially deonstrate that a signal of diension n can be recovered efficiently and reliably fro the order of n generic quadratic easureents of the for (.). The recent surge of applied and theoretical activity regarding phaseless iaging is in part driven by the hope that it will eventually lead to successful iaging of large protein coplexes and biological speciens enabling live iaging of bio-cheical activities at the olecular level. Furtherore phaseless iaging techniques increasingly play a crucial role in eerging national security applications aied at onitoring electronic products that are intended for ilitary or infrastructure use so as to ensure these products do not contain secret backdoors granting foreign governents cyber access to vital US infrastructure. Despite the incredible progress discussed earlier on both applied and atheatical fronts ajor challenges ipede the use of such techniques to these eerging doains. One ajor challenge is that acquiring easureents of large speciens at high resolutions (corresponding to very short wavelengths) require tie consuing and expensive easureents. To be concrete the ost odern phase-less iaging setups require iage acquisition ties exceeding 500 days for iaging a icroeter icroeter specien at 0n resolution! The atheatical results in this paper pave the way for a new generation of data-driven phase-less iaging systes that can utilize prior inforation to significantly reduce acquisition tie and enhance iage reconstruction enabling nano-scale iaging at unprecedented speeds and resolutions. To overcoe these challenges in this paper we ai to utilize a-priori structural inforation available about the signal to reduce the required nuber of quadratic easureents of the for (.). Indeed in the application doains discussed above there is a lot of a-priori knowledge available that can be utilized to reduce acquisition tie and enhance iage reconstruction. For exaple iages of electronic chips are extreely structured e.g. piecewise constant and often projections of 3D rectilinear odels. While historically various a-priori inforation such as non-negativty has been used to enhance iage reconstruction such siple fors of structural inforation are often not sufficient. Coplicating the atter further our atheatical understanding of how well even siple fors of a-priori inforation can enhance reconstruction is far fro coplete. To be concrete assue we know the signal of interest is sparse e.g. it has at ost s nonzero entries. In this case for known tractable algoriths to yield accurate solutions the nuber of generic easureents ust exceed c s log n with c a constant [60]. This is surprising as the degrees of freedo of an s-sparse vector is of the order s and based on the copressive sensing literature one expects to be able to recover the signal x fro the order of s log(n/s) generic quadratic easureents. In fact it is known that on the order of s log(n/s) generic quadratic easureents uniquely specify the unknown signal up to a global phase factor. However it is not known whether a tractable

3 algorith can recover the signal fro such inial nuber of generic quadratic easureents. The above exaple deonstrates a significant gap in our ability to utilize prior structural assuptions in phase retrieval probles so as to reduce the required nuber of easureents or saple coplexity. This is not an isolated exaple and such gaps hold ore generally for a variety of probles and structures (see [60 3 ] for ore details on related gaps). The eergence of such saple coplexity barriers is quite surprising as in any cases there is no tractable algorith known to close this gap. In fact for soe probles such as sparse PCA it is known that closing this gap via a coputationally tractable approach will yield tractable algoriths for notoriously difficult probles such as planted clique [9]. Miniizing (non)convex objectives with (non)convex constraints We wish to discern an unknown but structured signal x C n fro quadratic easureents of the for y r = a r x for r =.... However in the applications of interest typically the nuber of equations is significantly saller than the nuber of variables n so that there are infinitely any solutions obeying the quadratic constraints. However it ay still be possible to recover the signal by exploiting knowledge of its structure. To this ai let R R n R be a cost function that reflects soe notion of coplexity of the structured solution. It is then natural to use the following optiization proble to recover the signal. iniize z C n L(z) = l ( y r a r z ) subject to R(z) R(x). (.) Here l( y r a r z ) is a loss function easuring the isfit between the easureents y r and the data odel and R is a regularization function that reflects known prior knowledge about the signal. A natural approach to solve this proble is via projected gradient type updates of the for z τ+ = P K (z τ µ τ L(z τ )). (.) Here L is the Wirtinger derivative of L (see [6 Section 6] for details) and P K (z) denotes the projection of z C n onto the set K = {w C n R(w) R(x)}. (.3) Following [6] we shall refer to this iterative procedure as the Projected Wirtinger Flow (PWF) algorith. A-priori it is copletely unclear why the iterative updates (.) should converge as not only the loss function ay be nonconvex but also the regularization function! Efficient signal reconstruction fro nonlinear easureents in this high-diensional setting poses new challenges: When are the iterates able to escape local optia and saddle points and converge to global optia? How any easureents do we need? Can we break through the barriers faced by convex relaxations? How does the nuber of easureents depend on the a-priori prior knowledge available about the signal? What regularizer is best suited to utilizing a particular for of prior knowledge? 3

4 How any passes (or iterations) of the algorith is required to get to an accurate solution? At the heart of answering these questions is the ability to predict convergence behavior/rate of (non)convex constrained optiization algoriths. 3 Precise easures for statistical resources Throughout the rest of the paper we assue that the signal x R n and the easureent vectors a r R n are all real-valued. For sake of brevity we have focused our attention to this real-valued case. However we note that all of our definitions/results trivially extend to the coplex case. We wish to characterize the rates of convergence for the projected gradient updates (.) as a function of the nuber of saples the available prior knowledge and the choice of the regularizer. To ake these connections precise and quantitative we need a few definitions. Naturally the required nuber of saples for reliable signal reconstruction depends on how well the regularization function R can capture the properties of the unknown signal x. For exaple if we know our unknown paraeter is approxiately sparse naturally using an l nor for the regularizer is superior to using an l regularizer. To quantify this capability we first need a couple of standard definitions which we adapt fro [6 6]. Definition 3. (Descent set and cone) The set of descent of a function R at a point x is defined as D R (x) = {h R(x + h) R(x)}. The cone of descent is defined as a closed cone C R (x) that contains the descent set i.e. D R (x) C R (x). The tangent cone is the conic hull of the descent set. That is the sallest closed cone C R (x) obeying D R (x) C R (x). We note that the capability of the regularizer R in capturing the properties of the unknown signal x depends on the size of the descent cone C R (x). The saller this cone is the ore suited the function R is at capturing the properties of x. To quantify the size of this set we shall use the notion of ean width. Definition 3. (Gaussian width) The Gaussian width of a set C R p is defined as: ω(c) = E g [sup z C g z ] where the expectation is taken over g N (0 I p ). Throughout we use B n /S n to denote the the unit ball/sphere of R n. We now have all the definitions in place to quantify the capability of the function R in capturing the properties of the unknown paraeter x. This naturally leads us to the definition of the iniu required nuber of saples. Definition 3.3 (inial nuber of saples) Let C R (x) be a cone of descent of R at x. We define the inial saple function as M(R x) = ω (C R (x) B n ). We shall often use the short hand 0 = M(R x) with the dependence on R x iplied. 4

5 We note that 0 is exactly the iniu nuber of saples required for structured signal recovery fro linear easureents when using convex regularizers [8 ]. Specifically the optiization proble (y r a r x ) subject to R(z) R(x) (3.) succeeds at recovering the unknown signal x with high probability fro easureents of the for y r = a r x if and only if 0. While this result is only known to be true for convex regularization functions we believe that 0 also characterizes the inial nuber of saples even for nonconvex regularizers in (3.). See [6] for soe results in the nonconvex case as well as the role this quantity plays in the coputational coplexity of projected gradient schees for linear inverse probles. Given that in phase-less iaging we have less inforation (we loose the phase of the linear easureents) we can not hope to recover structured signals fro 0 when using (.). Therefore we can use 0 as a lower-bound on the iniu nuber of easureents required for projected gradient descent iterations (.) to succeed in recovering the signal of interest. 4 Nonconvex regularization exaples Next we provide two exaples of (non)convex regularizers which are of interest in phase-less iaging applications. These two siple exaples are eant to highlight the iportance of nonconvex regularizers in iaging. However our theoretical fraework is by no eans liited to these siple exaples and can deal with significantly ore coplicated nonconvex regularizers capturing uch ore nuanced fors of prior structure. piecewise constant structure. One a-priori structural inforation available in any phase-less iaging applications is that iages tend to be piecewise constant. For exaple contiguous parts of biological speciens are ade up of the sae tissue and exhibit the sae behavior under electro-agnetic radiations. Siilarly iages of electronic chips are often projections of piecewise constant 3D rectilinear odels. Let z R n n denote a D iage consisting of an array of pixels. A popular approache for exploiting piecewise-constant structures is to use total variation regularization functions. Two coon choices are the isotropic and anisotropic totoal variation regularizations defined as p R iso (z) = ( z i+j z ij + z ij+ z ij ) ij R ani (z) = z i+j z ij p + z ij+ z ij p. ij When p = these regularization functions are convex. However for any iage reconstruction tasks total variation regularization with p < despite being nonconvex is significantly ore effective at capturing piece-wise constant structure. We would like to note that 0 only approxiately characterizes the iniu nuber of saples required. A ore precise characterization is φ (ω (C R (x) B n )) ω (C R (x) B n ) where φ(t) = Γ( t+ ) t. However since our Γ( t ) results have unspecified constants we avoid this ore accurate characterization. 5

6 discrete values. Another for of a-priori knowledge that is soeties available in iaging applications is possible discrete values the iage pixels can take. For exaple in iaging electronic chips the etallic coposition of the different parts pre-deterines the possible discrete values and are known in advance. Let z R n n denote a D iage consisting of an array of pixels and assue the possible discrete values are {a a... a k }. A natural regularization in this case is R(z) = ij k I ( (z ij a r )) with I(z) = { 0 if z = 0 + if z 0. This regularization is convenient as projection onto its sub-level sets is easy and aounts to replacing each entry of the input vector/atrix with the closest discrete value fro {a a... a k } (a.k.a. hard thresholding). Of course this is not the only regularization function that can enforce discrete structures and in practice soft thresholding variants ay be ore effective. Our fraework can be used to analyze any such variants. Indeed an interesting aspect of our results is that it allows us to understand what regularizer is best suited at enforcing a particular for of prior structure. 5 Theoretical results for Projected Wirtinger Flows In this section we shall explain our ain theoretical results. To this ai we need to define the distance to the solution set. Definition 5. Let x R n be any solution to the quadratic syste y = Ax (the signal we wish to recover). For each z R n define dist(z x) = in ( z x l z + x l ). As we entioned earlier we are interested in recovering structured signal recovery probles fro quadratic easureents via the optiization proble (.). Naturally the convergence/lack of convergence as well as the rate of convergence of projected Wirtinger Flow iterates (.) depends on the loss function l. We now discuss our theoretical results for two different loss functions. 5. Intensity-based Wirtinger Flows with convex regularizers Our first result focuses on a quadratic loss function applied to intensity easureents i.e. l(x y) = (x y ) in (.). In this case the optiization proble takes the for iniize z R n L I (z) = 4 (y r a r z ) subject to R(z) R(x). (5.) Our first result studies the effectiveness of projected Wirtinger Flows on this objective. Theore 5. Let x R n be an arbitrary vector and R R n R be a proper convex function. Suppose A R n is a Gaussian ap and let y = Ax R be quadratic easureents. To estiate x start fro a point z 0 obeying dist(z 0 x) 8 x l (5.) 6

7 and apply the Projected Wirtinger Flow (PWF) updates z τ+ = P K (z τ µ τ L(z τ )) (5.3) with K = {z R n R(z) R(x)}. Also set the learning paraeter sequence as µ 0 = 0 and µ τ = µ x l for all τ =... and assue µ c /n for soe fixed nuerical constant c. Furtherore let 0 = M(R x) defined by 3.3 be our lower bound on the nuber of easureents. Also assue > c 0 log n (5.4) holds for a fixed nuerical constant c. Then there is an event of probability at least / /n e γ such that on this event starting fro any initial point obeying (5.) the update (5.3) satisfy dist(z τ x) ( µ τ 5 ) dist(z0 x). (5.5) As entioned earlier 0 is the inial nuber of easureents required to recover a structured signal fro linear easureents. 0 also serves as a lower bound on structured signal recovery fro quadratic easureents as they are even less inforative (we loose sign inforation). Theore 5. shows that PWF applied to quadratic loss using intensity easureents can (locally) reconstruct the signal with this inial saple coplexity (up to a constant and log factor). To be concrete consider the case where the unknown signal x is known to be s sparse and we use R(z) = z l as the regularizer in (5.). In this case it is known that 0 s log(n/s) and Theore 5. predicts that an s-sparse signal can (locally) be recovered fro the order of s log(n/s) log(n) easureents. This breaks through well-known barriers that have eerged for this proble in recent literature. Indeed for known tractable convex relaxation schees to yield accurate solutions the nuber of generic easureents ust exceed c s log n with c a constant [54 60]. We also note that even recent nonconvex approaches such as [4 78] have also not succeeded at breaking through this s barrier even when an initialization obeying (5.) is available. The convergence guarantees provided above hold as long as PWF is initialized per (5.) in a neighborhood of the unknown signal with relative error less than a constant. In this paper we are concerned only with local convergence properties of PWF and therefore do not provide an explicit construction for such an initialization. However in a copanion paper we deonstrate that the optiization proble (5.) has certain favorable characteristics that ay allow global convergence guarantees fro any initialization using second order ethods. Another interesting aspect of the above result is that the rate of convergence is geoetric. Specifically to achieve a relative error of ɛ ( z x l / x l ɛ) the required nuber of iterations is n log(/ɛ). Note that the cost of each iteration depends on applying the atrix A and its transpose A T which has coputational coplexity on the order of O(n). This is assuing that the projection has negligible cost copared to the cost of applying A/A T. This is the case for exaple for sparse signals when using the regularizer R(z) = z l. Therefore in these cases to achieve a relative error of ɛ the total coputational coplexity of PWF is on the order of O (n log(/ɛ)). Let us now discuss soe ways in which this theore is sub-optial. Even though this theore breaks through known saple coplexity barriers a natural question is whether it is possible to We note that x l can be trivially estiated fro the easureents as y r x l and our proofs are robust to this isspecification. We avoid stating this variant for ease of reading. 7

8 reove the log factor so as to have a saple coplexity that is only a constant factor away fro the iniu saple coplexity of structured signal recovery fro linear easureents. Another way in which the algorith is sub-optial is coputational coplexity. While the rate of convergence of PWF stated above is geoetric it is not linear. With a linear rate of convergence to achieve a relative error of ɛ the total coputational coplexity would be on the order of O (n log(/ɛ)) which is a factor of n saller than the guarantees provided by PWF. In the next section we will show how to close these gaps in saple coplexity and coputational coplexity by using a different loss function in (5.). Finally a ajor draw back of Theore 5. it that it only applies to convex regularizers. In the next section we will show how to also reove this assuption so as to allow arbitrary nonconvex regularizers. 5. Aplitude-based Wirtinger Flows with (non)convex regularizers Our second result focuses on a quadratic loss function applied to aplitude easureents i.e. l(x y) = (x y) in (.). In this case the optiization proble takes the for iniize z R n L A (z) = ( y r a r z ) subject to R(z) R(x). (5.6) One challenging aspect of the above loss function is that it is not differentiable and it is not clear how to run projected gradient descent. However this does not pose a fundaental challenge as the loss function is differentiable except for isolated points and we can use the notion of generalized gradients to define the gradient at a non-differentiable point as one of the liits points of the gradient in a local neighborhood of the non-differentiable point. For the loss in (5.6) the generalized gradient takes the for L A (z) = ( a r z y r ) sgn(a r z)a r. (5.7) Theore 5.3 Let x R n be an arbitrary vector and R R n R be a proper function (convex or nonconvex). Suppose A R n is a Gaussian ap and let y = Ax R be quadratic easureents. To estiate x start fro a point z 0 obeying dist(z 0 x) 5 x l (5.8) and apply the Projected Wirtinger Flow (PWF) updates z τ+ = P K (z τ µ τ L A (z τ )) (5.9) with K = {z R n R(z) R(x)} and L A defined via (5.7). Also set the learning paraeter sequence µ 0 = 0 and µ τ = for all τ =.... Furtherore let 0 = M(R x) defined by 3.3 be our lower bound on the nuber of easureents. Also assue > c 0 (5.0) holds for a fixed nuerical constant c. Then there is an event of probability at least 9e γ such that on this event starting fro any initial point obeying (5.8) the update (5.9) satisfy Here γ is a fixed nuerical constant. dist(z τ x) ( 3 ) τ dist(z 0 x). (5.) 8

9 The first interesting and perhaps surprising aspect of this result is its generality: it applies not only to convex regularization functions but also nonconvex ones! As we entioned earlier the optiization proble in (.) is not known to be tractable even for convex regularizers. Despite the nonconvexity of both the objective and regularizer the theore above shows that with a near inial nuber of easureents projected gradient descent provably converges to the original signal x without getting trapped in any local optia. The aplitude-based loss also has stronger saple coplexity and coputational coplexity guarantees copared with the intensity-based version. Indeed the required nuber of easureents iproves upon the intensity-based loss by a logarithic factor achieving a near optial saple coplexity for this proble (up to a constant factor). Also the convergence rate of the aplitude-based approach is now linear. Therefore to achieve a relative error of ɛ the total nuber of iterations is on the order of O(log(/ɛ)). Thus the overall coputational coplexity is on the order of O (n log(/ɛ)) (in general the cost is the total nuber of iterations ultiplied by the cost of applying the easureent atrix A and its transpose). As a result the coputational coplexity is also now optial in ters of dependence on the atrix diensions. Indeed for a dense atrix even verifying that a good solution has been achieved requires one atrix-vector ultiplication which takes O(n) tie. We now pause to discuss the choice of the loss function. The theoretical results above suggests that the least squares loss on aplitude values is superior to one on intensity values in ters of both saple and coputational coplexity. Such iproved perforance has also been observed epirically for ore realistic odels in optics [79]. Indeed [79] shows that not only the aplitude-based least squares has faster convergence rates but also is ore robust to noise and odel isspecification. However we would like to point out that the least squares objective on intensity values does have certain advantages. For instance it is possible to do exact line search (in closed for) on this objective. We have observed that this approach works rather well in soe practical doains (e.g. ptychography for chip iaging) without the need for any tuning as the step size in each iteration is calculated in closed for via exact line search. Therefore we would like to caution against rushed judgents declaring one variant of Wirtinger Flow superior to another due to inor (e.g. logarithic) theoretical iproveents in saple coplexity and or coputational coplexity. 3 We would like to ephasize that there is no best or correct loss function that works better than others for all application doains. Ultiately the choice of the loss function is dictated by the statistics of the noise or isspecification present in a particular doain. 6 Discussions and prior art Phase retrieval is a century old proble and any heuristics have been developed for its solution. For a partial review of soe of these heuristics as well as soe recent theoretical advances in related probles we refer the reader to the overview articles/chapters [ ] and [69 Part II] as well as [5 Section.6] and references therein such as [ ]. There has also been a surge of activity surrounding nonconvex optiization probles in the last few years. While discussing all of these results is beyond the scope of this paper we shall briefly discuss soe of the ost relevant and recent literature in the coing paragraphs. We refer the reader to [7] and references therein [ ] for a ore coprehensive review of such results. We also 3 Unfortunately such preature declarations have becoe exceedingly coon in recent literature. 9

10 refer the reader to[35 9 5] for recent algorithic approaches based on linear progras and [5] for characterizing large systes liits of dynaics of phase retrieval algoriths. The Wirtinger Flow algoriths for solving quadratic systes of equations was introduce in [6]. [6] also provides a local convergence analysis when no prior structural assuption is available about the signal. The analysis of [6] was based on the so called regularity condition. This regularity condition and closely related notions have been utilized/generalized in a variety of interesting ways to provide rigorous convergence guarantees for related nonconvex probles arising in diverse applications ranging fro atrix copletion to dictionary learning and blind deconvolution [ ]. The intensity-based results presented in Section 5. are based on a generalization of the regularity condition in [6] so as to allow arbitrary convex constraints. The second set of results we presented in this paper were based on least squares fit of the aplitudes. This objective function has been historically used in phase retrieval applications [7] and has close connections with the classical Fienup algorith [69 Chapter 3]. Focusing on ore recent literature [79] deonstrated the effectiveness of this approach in optical applications. More recently a few interesting publications [8 77] study variants of this loss function and develop guarantees for its convergence. The analysis presented in both of these papers are also based on variants of the regularity condition of [6] and do not utilize any structural assuptions. In this paper we have analyzed the perforance of the aplitude-based PWF with any closed constraint (convex or nonconvex). These results are based on a new approach to analyzing nonconvex optiization probles that differs fro the regularity approach used in [6] and all of the papers entioned above. Rather this new technique follows a ore direct route utilizing/developing powerful concentration inequalities to directly show the error between the iterates and the structured signal decreases at each iteration. A ore recent line of research ais to provide a ore general understanding of the geoetric landscape of nonconvex optiization probles by showing that in any probles there are no spurious local inizers and saddles points have favorable properties [ ]. A ajor advantage of such results is that they do not required specialized initializations in the sense that trust region-type algoriths or noisy stochastic ethods are often guaranteed to converge fro a rando initialization and not just when an initial solution is available in a local neighborhood of the optial solution. The disadvantage of such results is that the guaranteed rates of convergence of these approaches are either not linear/geoetric or each iteration is very costly. These approaches also have slightly looser saple coplexity bounds. Perhaps the ost relevant result of this kind to this paper is the interesting work of Sun Qu and Wright [70] which studies the geoetric landscape of the objective (5.) in the absence of any regularizer. The authors also show that a certain trust region algorith achieves a relative error of ɛ after O (n 7 log 7 n + log log ) ɛ as long as the nuber of saples exceeds n log 3 n. As entioned previously using different proof techniques in a copanion paper we deonstrate a result of a siilar flavor to [70] for the constrained proble (5.). This results shows that with 0 log n easureents all local optia are global and a second order schee recovers the global optia (the unknown signal) in a polynoial nuber of iterations. We now pause to caution against erroneous iss-interpretations of the theoretical results discussed in the previous paragraph: There are no spurious local optia i.e. all local optia are global in phase retrieval applications 0

11 Initialization is irrelevant in phase retrieval applications The reason these conclusions are inaccurate are two-fold. First while the results of the previous paragraph and Theore 5. both require on the order of n log n saples the ultiplicative constants in these results tend to be drastically different in practice. Second the easureent vectors occurring in practical doains are substantially ore ill-conditioned than the Gaussian easureents studied in this paper. This further aplifies the gap between the saple coplexity of local versus global results. Indeed in any practical doains where phase retrieval is applied local optia are abound and a ajor source of algorithic stagnation. Therefore carefully crafted initialization schees or regularization ethods are crucial for the convergence of local search heuristics in any phase-less iaging doains. We would also like to ention prior work on sparse phase retrieval. For generic easureents such as the Gaussian distribution studied in this paper [54] provides guarantees for the convex relaxation-based PhaseLift algorith as long as the nuber of saples exceed s log n where s is the nuber of non-zeros in the sparse signal. The papers [54 60] showed that these results are essentially uniprovable when using siple SDP relaxations. More recently interesting work by Cai Li and Ma [4] studies the perforance of Wirtinger Flow based schees for sparse phase retrieval probles. This result also requires s log n easureents even when an initialization obeying (5.) is available. Therefore this results also does not break through the local s barrier. More recently there are a few publications aied at going below s easureents. These results differ fro ours in that they are either applicable to other design odels [49] which differ fro phase retrieval or specific designs which tailor the algorith to the easureent process [ ] or require additional constraints on the coefficient of the sparse signal [78 34]. In contrast to the above publications in this paper we have deonstrated that locally only s log(n/s) saples suffice to recover any s sparse signal fro generic quadratic easureents forally breaking through the s barrier. 4 Furtherore our results applies to any regularizer (convex or nonconvex) allowing us to enforce various fors of prior knowledge in our reconstruction. Finally we would like to ention that there has also been soe recent publications aied at developing theoretical guarantees for ore practical odels. For instance the papers [ ] develop theoretical guarantees for convex relaxation techniques for ore realistic Fourier based odels such as coded diffraction patterns and Ptychography. More recently the papers [ ] also develop soe theoretical guarantees for faster but soeties design-specific algoriths. Despite all of this interesting progress the known results for ore realistic easureent odels are far inferior to their Gaussian counterparts in ters of saple coplexity coputational coplexity or stability to noise. Closing these gaps is an interesting and iportant future direction. 7 Proofs In the Gaussian odel the easureent vectors also obey a r l 6n for all r =... with probability at least e.5n. Thoughout the proofs we assue we are on this event without explicitly entioning it each tie. Without loss of generality we will assue throughout the proofs that x l =. We reind the reader that throughout x is a solution to our quadratic equations 4 After the first version of this paper appeared on arxiv the authors of [34] reoved the extra assuption that the agnitude of the entries of the signal on the support are equal. This paper also now forally breaks through the barrier.

12 i.e. obeys y = Ax and that the sapling vectors are independent fro x. We also reind the reader that for a set C R n ω(c) is the ean width of C per Definition 3.. Throughout we use S n /B n to denote the unit sphere/unit ball of R n. We first discuss soe coon background and results used for proving both theores. Since the proof of the two theores follow substantially different paths we dedicate a subsection to each: Section 7.4 for proof of Theore 5. and Section 7.5 for proof of Theore Forulas for gradients and generalized gradients As a reinder the intensity-based loss function is equal to and the gradient equal to L I (z) = 4 L I (z) = (y r a r z ) ( a r z y r ) (a r z)a r. As a reinder the aplitude-based loss function is equal to and the generalized gradient is equal to L A (z) = L A (z) = ( y r a r z ) ( a r z y r ) sgn(a r z)a r. 7. Concentration and bounds for stochastic processes In this section we gather soe useful results on concentration of stochastic processes which will be crucial in our proofs. We begin with a lea which is a direct consequence of Gordon s escape fro the esh lea [30] whose proof is deferred to Appendix A.. Lea 7. Assue C R n is a cone and S n is the unit sphere of R n. Also assue that ax (0 ω (C S n ) δ δ ) for a fixed nuerical constant c. Then for all h C holds with probability at least e δ 360. (a r h) h l δ h l We also need a generalization of the above lea stated below and proved in Appendix A..

13 Lea 7. Assue C R n is a cone (not necessarily convex) and S n is the unit sphere of R n. Also assue that ax (80 ω (C S n ) δ δ ) for a fixed nuerical constant c. Then for all u h C holds with probability at least 6e δ 440. (a r u)(a r h) u h δ u l h l We next state a generalization of Gordon s escape through the esh lea whose proof appears in Appendix A.3. Lea 7.3 Let d R n be fixed vector with nonzero entries and construct the diagonal atrix D = diag(d). Also let A R n have i.i.d. N (0 ) entries. Furtherore assue T R n and define where g R is distributed as N (0 I ). Define b (d) = E[ Dg l ] σ(t ) = ax v T v l then for all u T holds with probability at least DAu l b (d) u l d l ω(t ) + η 6e η 8 d l σ (T ). The previous lea leads to the following Corollary. We skip the proof as it is identical to how Lea 7. is derived fro Gordon s lea (See Section A. for details). Corollary 7.4 Let d R n be fixed vector with nonzero entries and assue T B n. Furtherore assue Then for all u T ( d r) ax (0 d l ω (T ) δ 3 δ ). d r(a r u) d r holds with probability at least 6e δ 440 ( d r ). u l δ 3

14 The above generalization of Gordon s lea together with its corollary will be very useful in our proofs in particular it allows us to prove the following key result whose proof is also deferred to Appendix A.4. Lea 7.5 Assue C R n is a cone and S n is the unit sphere of R n. Furtherore let x R n be a fixed vector. Also assue that Then for all h C 600 ax ( ω (C S n ) δ log n δ ). (a r h) (a r x) ( h l x l + (h x) ) δ h l x l holds with probability at least / /n e γ 7e γ δ with γ and γ fixed nuerical constants. We also need the following iportant lea. The proof of this lea is based on the paper [57]. Please also see [5] for related calculations. We defer the proof to Appendix A.5. Lea 7.6 Assue C C R n are cones and S n is the unit sphere of R n. Also assue c ax (ω (C S n ) ω (C S n )) for a fixed nuerical constant c. Then for any u C and v C u a r a r v E[ u aa v ] δ u l v l holds with probability at least e γδ where γ is a fixed nuerical constant. Here a R n is distributed as N (0 I). We also state a siple generalization of Lea 7.6 above. This lea has a near identical proof. We skip details for brevity. Lea 7.7 Assue C D R n are sets with diaeters bounded by fixed nuerical constants. Also assue c ax (ω (C) ω (D)) for a fixed nuerical constant c. Then for any u C and v D u a r a r v E[ u aa v ] δ holds with probability at least e γδ where γ is a fixed nuerical constant. Here a R n is distributed as N (0 I). Finally we also need the following lea with the proof appearing in Appendix A.6. Lea 7.8 For any u v R n define θ = cos ( u l v ). Then we have l u T v E[ u aa v ] = π u l v l (sin(θ) + cos(θ) ( π θ)) π u l v l. 4

15 7.3 Cone and projection identities In this section we will gather a few results regarding higher diensional cones and projections that are used throughout the proofs. These results are directly adapted fro [6 6.]. We begin with a result about projections onto sets. The first part concerning projections onto convex sets is the well known contractivity result regarding convex projections. Lea 7.9 Assue K R n is a closed set and v R n. Then if K is convex for every u K we have P K (v) u l v u l. (7.) Furtherore for any closed set K (not necessarily convex) and for every u K we have P K (v) u l v u l. (7.) Proof Equation (7.) is well known. We shall prove the second result. To this ai note that by definition of projection onto a set we have Also note that v P K (v) l v u l. (7.3) v P K (v) l = (v u) (P K (v) u) l = v u l + P K (v) u l P K (v) u v u. Cobining the latter inequality with (7.3) and using the Cauchy-Schwarz inequality we have P K (v) u l = v P K (v) l v u l + P K (v) u v u P K (v) u v u P K (v) u l v u l. Dividing both sides of the above inequality by P K (v) u l We now state a result concerning projection onto cones. concludes the proof. Lea 7.0 Let C R n be a closed cone and v R n. The followings two identities hold v l = v P C (v) l + P C (v) l (7.4) P C (v) l = sup u v. (7.5) u C B n The following lea is straightforward and follows fro the fact that translation preserves distances. Lea 7. Suppose K R n is a closed set. The projection onto K obeys P K (x + v) x = P K {x} (v). The next lea copares the length of a projection onto a set to the length of projection onto the conic approxiation of the set. 5

16 Lea 7. (Coparison of projections) Let D be a closed and nonepty set that contains 0. Let C be a nonepty and closed cone containing D (D C). Then for all v R n Furtherore assue D is a convex set. Then for all v R n P D (v) l P C (v) l (7.6) P D (v) l P C (v) l. (7.7) 7.4 Convergence analysis for intensity-based Wirtinger Flows In this section we shall prove Theore 5.. The proof of this result is based on an extension of the fraework developed in [6]. Therefore the outline of our exposition closely follows that of [6]. Section 7.4. discusses our general convergence analysis and shows that it follows fro a certain Regularity Condition (RC). In this section we also show that the regularity condition can be proven by showing two sufficient Local Curvature and Local Soothness conditions denoted by LCC and LSC. We then prove the Local Curvature condition in Section 7.4. and the Local Soothness condition in Section General convergence analysis Note that (5.) guarantees that either z 0 x l or z 0 + x l is sall. Throughout the proof without loss of generality we assue z 0 x l is the saller one. To introduce our general convergence analysis we begin by defining E(ɛ) = {z R n R(z) R(x) z x l ɛ}. Note that when condition (5.) holds the next iterate z obeys z E(ɛ) with ɛ = /8. The reason is that when the regularizer is convex so is the set K and by contractivity of projection onto convex sets (Lea 7.9) we also have z x l = P K (z 0 ) x l z 0 x l ɛ. We will assue that the function L I satisfies a regularity condition on E(ɛ) which essentially states that the gradient of the function is well-behaved. Condition 7.3 (Regularity Condition) We say that the function L I condition or RC(α β ɛ) if for all vectors z E(ɛ) we have satisfies the regularity L I (z) z x α z x l + β L I(z) l. (7.8) In the lea below we show that as long as the regularity condition holds on E(ɛ) then Projected Wirtinger Flow starting fro an initial solution in E(ɛ) converges to a global optiizer at a geoetric rate. Subsequent sections shall establish that this property holds. Lea 7.4 Assue that L I obeys RC(α β ɛ) for all z E(ɛ). Furtherore suppose z E(ɛ) and assue 0 < µ /β. Consider the following update z τ+ = P K (z τ µ f(z τ )). 6

17 Then for all τ we have z τ E(ɛ) and z τ x l ( µ τ α ) z 0 x l. Proof The proof is siilar to a related proof in the Wirtinger Flow paper [6]. We prove that if z E(ɛ) then for all 0 < µ /β z + = z µ f(z) obeys z + x l ( µ α ) z x l. (7.9) The latter iplies that if z x l ɛ then z + x l ɛ. Cobining the latter with the fact that projection onto convex sets are contractive (Lea 7.9) we conclude that P K (z + ) x l = P K (z + ) P(x) l z + x l z x l ɛ. (7.0) Also by the definition of P K we have R(P K (z + )) R(x). Therefore if z E(ɛ) then we also have P K (z + ) E(ɛ). The lea follows by inductively applying (7.9) and (7.0). Now let us deonstrate how (7.9) follows fro siple algebraic anipulations together with the regularity condition (7.8). To this ai note that z + x l = z x µ L I (z) l = z x l µ L I (z) (z x) + µ L I (z) l z x l µ ( α z x l + β L I(z) l ) + µ L I (z) l = ( µ α ) z x l + µ (µ β ) L I(z) l ( µ α ) z x l where the last line follows fro µ /β. This concludes the proof. For any z E(ɛ) we need to show that L I (z) z x α z x l + β L I(z) l. (7.) We prove that (7.) holds with ɛ = 8 by establishing that our gradient satisfies the local soothness and local curvature conditions defined below. Cobining both these two properties gives (7.). Condition 7.5 (Local Curvature Condition) We say that the function L I satisfies the local curvature condition or LCC(α ɛ δ) if for all vectors z E(ɛ) L I (z) z x ( α + λ) z x l + γ a r (z x) 4. (7.) This condition essentially states that the function curves sufficiently upwards (along ost directions) near the curve of global optiizers. 7

18 Condition 7.6 (Local Soothness Condition) We say that the function L I satisfies the local soothness condition or LSC(β ɛ δ) if for all vectors z E(ɛ) we have L I (z) l β (λ z x l + γ a r (z x) 4 ). (7.3) This condition essentially states that the gradient of the function is well behaved (the function does not vary too uch) near the curve of global optiizers Proof of the local curvature condition For any z E(ɛ) we want to prove the local curvature condition (7.). Recall that L I (z) = ( a r z y r ) (a r a r )z and define h = z x. To establish (7.) it suffices to prove that ((h a r a r x) + 3(h a r a r x) a r h + a r h 4 ) ( γ a r h 4 ) ( α + λ) h l (7.4) holds for all h satisfying h ɛ. Equivalently we only need to prove that for all h satisfying h l ɛ we have ((h a r a r x) + 3(h a r a r x) a r h + ( γ) a r h 4 ) ( α + λ) h l. (7.5) Define the following cone which is the cone of descent of R at x C R (x) = {h R(x + ch) R(x) for soe c > 0}. Now note that since h = z x and R(z) R(x) therefore h C R (x). Note that by Lea 7.5 as long as we have then 600 ax ( ω (C R S n ) δ log n δ ) = 600 ax ( 0 δ log n δ ) (a r h) (a r x) ( + δ) h l + (h x) holds with probability at least / /n e γ 7e γ δ with γ and γ fixed nuerical constants. Therefore to establish the local curvature condition (7.) it suffices to show ((η + )(h a r a r x) + 3(h a r a r x) a r h + ( γ) a r h 4 ) ( α + λ + η( + δ)) h l + η(h x). (7.6) 8

19 We pick η such that η + γ = 3. This is equivalent to establishing that We note that ( 3 (h a r a γ r x) + γ a r h ) ( α + λ ( + δ) + 9 4( γ) ( + δ)) 9 h l + ( 4( γ) ) (h x). (7.7) 3 (h a r a γ r x) + γ a r h = ( 3 h) γ Therefore it suffices to prove a r a r (x + ( γ)h). 3 (h a r a r (x + 3 ( γ)h)) ( 4 9 ( γ) ( α + λ ( + δ)) + ( + δ)) h l + 9 ( + 8γ)(h x). Noting that it suffices to prove (h a r a r (x + 3 ( γ)h)) ( h a r a r (x + 3 ( γ)h) ) (7.8) h a r a r (x + 3 ( γ)h) ( 4 9 ( γ) ( α + λ δ) + ( + δ)) h l + 9 ( + 8γ)(h x). (7.9) To establish (7.9) we shall utilize Lea 7.6. To this ai note that since h = z x and R(z) R(x) therefore h C R (x). Now define the set T = (γ+ ) ( γ) x + z R(z) R(x) and z x ( γ) x + z l ɛ l (γ+ ) and set C =cone(t ). Note that 3 ( γ) (x + 3 ( γ)h) = (γ + ) ( γ) x + z C (x + 3 ( γ)h) C. Also note that for all z E(ɛ) with ɛ < 3 Siilarly for all z E(ɛ) γ (γ + ) ( γ) x + z = 3 l ( γ) x + z x 3 ɛ > 0. (7.0) l ( γ) (γ + ) ( γ) x + z = 3 l ( γ) x + z x 3 + ɛ. (7.) l ( γ) 9

20 Define Now set v = arg ax u T If a v 0 using (7.0) we have a v = a (γ+ ) ( γ) x + z v (γ+ ) ( γ) x + z v T = { (γ + ) ( γ) x + z R(z) R(x) and z x l ɛ}. a u. By definiton of T v is of the for v = l ( 3 ( γ) ( ɛ)a On the other hand if a v < 0 using (7.) we have a v = a (γ+ ) ( γ) x + z v (γ+ ) ( γ) x + z v l ( 3 ( γ) + ( ɛ)a Inequalities (7.) and (7.3) iediately iply ax(a v 0) ( 3 ( γ) (γ + ) ( γ) x + z v) (γ + ) ( γ) x + z v) ɛ) ax (sup u T a u 0) and in(a v 0) (γ+ ) ( γ) x+zv (γ+ ) ( γ) x+zv l ( 3 ( γ) ( 3 ( γ) for soe z v E(ɛ). ɛ) (sup u T a u). (7.) + ɛ) (sup u T a u). (7.3) ( 3 ( γ) + ɛ) in (sup u T a u 0). (7.4) 0

21 By (7.) σ(t ) = sup v l 3 v T ( γ) + ɛ. Thus using (7.4) we have ω(t ) = E[sup a v] v T = E[sup (ax(a v 0) + in(a v 0))] v T E ax (sup a u 0) + in (sup a u 0) ( 3 ( γ) ɛ) u T ( 3 ( γ) + ɛ) u T = E ɛ ax (sup a u 0) + sup a u ( 9 4 ɛ ( γ) ) u T ( 3 ( γ) + ɛ) u T ɛ = ( 9 4 ɛ ( γ) ) E ax (sup a u 0) + u T ( 3 ( γ) + ɛ) ω(t ) ɛ = ( 9 4 ɛ ( γ) ) P ax (sup a u 0) t dt + 0 u T ( 3 ( γ) + ɛ) ω(t ) ɛ = ( 9 4 ɛ ( γ) ) P{sup a u t}dt + 0 u T ( 3 ( γ) + ɛ) ω(t ) = ɛ ( 9 4 ɛ ( γ) ) 0 ( 9 4 ɛ ( γ) ) ( 9 4 ( 3 e (t ω(t )) σ (T ) dt + ( 3 ( γ) + ɛ) ω(t ) ɛ π σ(t ) (erf ( ω(t ) σ(t ) ) + ) + ( 3 ( γ) + ɛ) ω(t ) 4ɛ π ɛ ( γ) ) σ(t ) + ( 3 ( γ) + ɛ) ω(t ) ( γ) ɛ) ( 8πɛ + ω(t )). (7.5)

22 Also using the fact that z E(ɛ) we have ω(t ) = E[sup a u] = E [ sup a ( (γ + ) u T z E(ɛ) ( γ) x + z) ] = E [ sup z E(ɛ) E [ sup z E(ɛ) = E [ 3 a ( 3 ( γ) x + z x) ] a ( 3 ( γ) ( γ) a x + sup z E(ɛ) = E [ sup a (z x) ] z E(ɛ) ɛ E [ sup a u] u C R (x) B n =ɛ ω (C R (x) S n ) Now using (7.5) together with the above we have ω(c S n ) = ω(t ) ( 3 ( γ) + ɛ) ( 8πɛ + ω(t )) x) + sup a (z x) ] z E(ɛ) a (z x) ] ɛ ( 3 ( γ) + ɛ) ( 8π + ω(cr (x) S n )) ( 8π + ω (C R (x) S n )). Therefore as long as ax (c ω (C R (x) S n ) ) for a fixed nuerical constant c applying Lea 7.6 with u = h and v = x + 3 ( γ)h and δ = π with probability at least e γ we have h a r a r (x + 3 ( γ)h) E [ h aa (x + 3 ( γ)h) ] π h l x + 3 ( γ)h l π ( ) h l x + 3 ( γ)h l where in the last inequality we have applied Lea 7.8. To prove (7.9) it then suffices to show 4 π ( ) h l x + 3 ( γ)h l ( 4 9 ( γ) ( α + λ ( + δ)) + ( + δ)) h l + 9 ( + 8γ)(h x). Using the fact that x + 3 ( γ)h l 3 ( γ)ɛ it suffices to prove 4 π ( ) ( 3 ( γ)ɛ) ( 4 9 ( γ) ( α + λ ( + δ)) + ( + δ)) + ( + 8γ). 9 The latter holds as long as 3 ɛ ( γ) π ( 4 ( ) 9 ( γ) ( α + λ ( + δ)) + ( + δ)) + 9 ( + 8γ). (7.6)

23 Using the values α = 50 λ = 50 γ = 000 δ = 000 = 000 the inequality in (7.6) holds as long as ɛ 8 copleting the proof Proof of the local soothness condition For any z E(ɛ) we want to prove (7.3) which is equivalent to proving that for all w R n obeying w l = we have Recall that and define g(h w) = ( L I (z)) w β (λ z x l + γ L I (z) = a r (z x) 4 ). ( a r z y r ) (a r a r )z ((h a r )(w a r ) a r x + 3 a r h (w a r )(a r x) + (a r h) 3 (w a r )). Define h = z x to establish (7.3) it suffices to prove that g(h w) β (λ h l + γ a r h 4 ). (7.7) holds for all h and w satisfying h l ɛ and w l =. Note that since (a + b + c) 3(a + b + c ) g(h w) ( h a r w a r a r x + 3 h a r a r x w a r + a r h 3 w a r ) 3 h a r w a r a r x h a r a r x w a r + 3 a r h 3 w a r = 3(I + I + I 3 ). (7.8) We now bound each of the ters on the right-hand side. For the first ter using Cauchy-Schwarz and applying Lea 7. and Lea 7.5 we have I ( a r x a r w ) ( 6n ( a r x ) ( 6n ( a r x a r h ) a r x a r h ) a r x ) (( + δ) h l + (x h) ) 4( + δ)n h l 48n h l. (7.9) 3

24 Siilarly for the second ter we have I ( n ( a r h 4 ) ( a r w a r x ) a r h 4 ). (7.30) Finally for the third ter we use the Cauchy-Schwarz inequality together with Lea 7. to derive I 3 ( a r h 3 ax a r r l ) 6n ( We now plug these inequalities into (7.8) and get g(h w) 48n h l + n β (λ h l + γ 6n ( a r h 3 ) a r h 4 ) ( 6n( + δ) h l ( n h l ( n ( a r h 4 ) 3 8 n ( a r h 4 ) a r h ) a r h 4 ) a r h 4 ). (7.3) a r h n a r h 4 a r h 4 ) (7.3) which copletes the proof of (7.7) and in turn establishes the local soothness condition in (7.3). However the last line of (7.3) holds as long as copleting the proof. β ax ( 48 λ 3 ) n = 3000n (7.33) γ 7.5 Convergence analysis for aplitude-based Wirtinger Flows In this section we shall prove Theore 5.3. Throughout we use the shorthand C to denote the descent cone of R at x i.e. C = C R (x). Note that (5.8) guarantees that either z 0 x l or z 0 + x l is sall. Throughout the proof without loss of generality we assue z 0 x l is the saller one. To introduce our general convergence analysis we begin again by defining E(ɛ) = {z R n R(z) R(x) z x l ɛ}. 4

Sharp Time Data Tradeoffs for Linear Inverse Problems

Sharp Time Data Tradeoffs for Linear Inverse Problems Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used