Minimax rate of convergence and the performance of ERM in phase recovery

Size: px

Start display at page:

Download "Minimax rate of convergence and the performance of ERM in phase recovery"

Lynette Woods
6 years ago
Views:

1 Minimax rate of convergence and the performance of ERM in phase recovery Guillaume Lecué,3 Shahar Mendelson,4,5 ovember 0, 03 Abstract We study the performance of Empirical Risk Minimization in noisy phase retrieval problems, indexed by subsets of R n and relative to subgaussian sampling; that is, when the given data is y i = +wi a i, x 0 for a subgaussian random vector a, independent noise w and a fixed but unknown x 0 that belongs to a given subset of R n. We show that ERM produces ˆx whose Euclidean distance to either x 0 or x 0 depends on the gaussian mean-width of the indexing set and on the signal-to-noise ratio of the problem. The bound coincides with the one for linear regression when x 0 is of the order of a constant. In addition, we obtain a minimax lower bound for the problem and identify sets for which ERM is a minimax procedure. As examples, we study the class of d-sparse vectors in R n and the unit ball in l n. Introduction Phase retrieval has attracted much attention recently, as it has natural applications in areas that include X-ray crystallography, transmission electron microscopy and coherent diffractive imaging see, for example, the discussion in [] and references therein). In phase retrieval, one attempts to identify a vector x 0 that belongs to an arbitrary set T R n using noisy, quadratic measurements of x 0. The CRS, CMAP, Ecole Polytechnique, 90 Palaiseau, France. Department of Mathematics, Technion, I.I.T, Haifa 3000, Israel. 3 guillaume.lecue@cmap.polytechnique.fr 4 shahar@tx.technion.ac.il 5 Supported by the Mathematical Sciences Institute The Australian ational University.

2 given data is a sample of cardinality, a i, y i ) i=, for vectors a i R n and y i = a i, x 0 + w i,.) for a noise vector w i ) i=. Our aim is to investigate phase retrieval from a theoretical point of view, relative to a well behaved, random sampling method. To formulate the problem explicitly, let µ be an isotropic, L-subgaussian measure on R n and set a to be a random vector distributed according to µ. Thus, for every x R n, E x, a = x isotropicity) and for every u, P r x, a Lu x, a L ) exp u /) L-subgaussian). Given a set T R n and a fixed, but unknown x 0 T, y i are the random noisy measurements of x 0 : for a sample size, a i ) i= are independent copies of a and w i ) i= are independent copies of a mean-zero variable w that are also independent of a i ) i=. Clearly, due to the nature of the given measurements, x 0 and x 0 are indistinguishable, and the best that one can hope for is a procedure that produces ˆx T that is close to one of the two points. The goal here is to find such a procedure and identify the way in which the distance between ˆx and either x 0 or x 0 depends on the structure of T, the measure µ and the noise. The procedure studied here is empirical risk minimization ERM), which produces ˆx that minimizes the empirical risk in T : P l x = ai, x ). yi i= The loss is the standard squared loss functional, which, in this case,satisfies l x a, y) = f x a) y) = x, a x0, a w) = x x 0, a x+x 0, a w). Comparing the empirical and actual structures on T is a vital component in the analysis of ERM. In phase recovery, the centered empirical process that is at the heart of this approach is defined for any x T by, P l x l x0 ) = x x0, a i x+x0, a i w i x x0, a i x+x0, a i. i= i= Both the first and second components are difficult to handle directly, even when the underlying measure is subgaussian, because of the powers involved an effective power of 4 in the first component and of 3 in the second one).

3 Therefore, rather than trying to employ the concentration of empirical means around the actual ones, which might not be sufficiently strong in this case, one uses a combination of a small-ball estimate for the high order part of the process, and a more standard deviation argument for the low-order component see Section 3 and the formulation of Theorem A and Theorem B). We assume that linear forms satisfy a certain small-ball estimate, and in particular, do not assign too much weight to small neighbourhoods of 0. Assumption. There is a constant κ 0 > 0 satisfying that for every s, t R n, E a, s a, t κ 0 s t. Assumption. is not very restrictive and holds for many natural choices of random vectors in R n, like the gaussian measure or any isotropic logconcave measure on R n see, for example, the discussion in []). It is not surprising that the error rate of ERM depends on the structure of T, and because of the subgaussian nature of the random measurement vector a, the natural parameter that captures the complexity of T is the gaussian mean-width associated with normalizations of T. Definition. Let G = g,..., g n ) be the standard gaussian vector in R n. For T R n, set n. lt ) = E sup g i t i t T The normalized sets in question are { } t s T,R = : t, s T, R < t s t + s, t s { } t + s T +,R = : t, s T, R < t s t + s, t + s which have been used in [], or their local versions, { } t x0 T,R x 0 ) = : t T, R < t x 0 t + x 0, t x 0 { } t + x0 T +,R x 0 ) = : t T, R < t x 0 t + x 0. t + x 0 The sets in question play a central role in the exclusion argument that is used in the analysis of ERM. Setting L x = l x l x0, the excess loss function i= 3

4 associated with l and x T, it is evident that P Lˆx 0 because L x0 = 0 is a possible competitor). If one can find an event of large probability and R > 0 for which P L x > 0 if x x 0 x + x 0 R, then on that event, ˆx x 0 ˆx + x 0 R, which is the estimate one is looking for. The normalization allows one to study relative fluctuations of P L x in particular, the way these fluctuations scale with x x 0 x + x 0. This is achieved by considering empirical means of products of functions u, v,, for u T+,R x 0 ) and v T,R x 0 ). The obvious problem with the local sets T +,R x 0 ) and T,R x 0 ) is that x 0 is not known. As a first attempt of bypassing this problem, one may use the global sets T +,R and T,R instead, as had been done in []. Unfortunately, this global approach is not completely satisfactory. Roughly put, there are two types of subsets of R n one is interested in, and that appear in applications. The first type consists of sets for which the local complexity is essentially the same everywhere, and the sets T +,R, T,R are not very different from the seemingly smaller T +,R x 0 ), T,R x 0 ), regardless of x 0. A typical example of such a set is d-sparse vectors a set consisting of all the vectors in R n that are supported on at most d-coordinates. For every x 0 T and R > 0, the sets T +,R x 0 ), T,R x 0 ), and T +,R, T,R are contained in the subset of the sphere consisting of d-sparse vectors, which is a relatively small set. For this kind of set, the global approach, using T +,R and T,R, suffices, and the choice of the target x 0 does not really influence the rate in which ˆx x 0 ˆx + x 0 decays to 0 with. In contrast, sets of the second type one would like to study, have vastly changing local complexity, with the typical example being a convex, centrally symmetric set i.e. if x T then x T ). Consider, for example, the case T = B n, the unit ball in ln. It is not surprising that for small R, the sets T +,R 0) and T,R 0) are very different from T,R e ) and T +,R e ): the ones associated with the centre 0 are the entire sphere, while for e =, 0,..., 0), T +,R e ) and T,R e ) consist of vectors that are well approximated by sparse vectors whose support depend on R), and thus are rather small subsets of the sphere. The situation that one encounters in B n is generic for convex centrallysymmetric sets. The sets become locally richer the closer the centre is to 0, and at 0, for small enough R, T +,R 0) and T,R 0) are the entire sphere. Since the sets T +,R and T,R are blind to the location of the centre, and are, in fact, the union over all possible centres of the local sets, they are simply too big to be used in the analysis of ERM in convex sets. A correct estimate on the performance of ERM for such sets requires a more delicate 4

5 local analysis and additional information on x 0. Moreover, the rate of convergence of ERM truly depends on x 0 in the phase recovery problem via the signal-to-noise ratio x 0 /σ. We begin by formulating our results using the global sets T +,R and T,R. Let T + = T +,0 and T = T,0, set E R = max{lt +,R ), lt,r )}, E = max{lt + ), lt )} and observe that as nonempty subsets of the sphere lt,r ), lt +,r ) E g = /π. The first result presented here is that the error rate of ERM for the phase retrieval problem in T depends on the fixed points { r γ) = inf r > 0 : E r γ } r and r 0 Q) = inf { r > 0 : E r Q }, for constants γ and Q that will be specified later. Recall that the ψ -norm of a random variable w is defined by w ψ = inf{c > 0 : E expw /c ) } and set σ = w ψ. Theorem A. For every L >, κ 0 > 0 and β >, there exist constants c 0, c and c that depend only on L, κ 0 and β for which the following holds. Let a and w be as above and assume that w has a finite ψ -norm. If l is the squared loss and ˆx is produced by ERM, then with probability at least exp c 0 min{l T +,r ), l T,r )}) β+, ˆx x 0 ˆx + x 0 r := max{r 0 c ), r c /σ log )}. When w < the term σ log may be replaced by w. The upper estimate of max{r 0, r } in Theorem A represents two ranges of noise. It follows from the definition of the fixed points that r 0 is dominant if σ r 0 / log. As explained in [7] for linear regression, r 0 captures the difficulty of recovery in the noise free case, when the only reason for errors is that there are several far-away functions in the class that coincide with the target on the noiseless data. When the noise level σ surpasses that threshold, errors occur because of the interaction class members have with the noise, and the dominating term becomes r. 5

6 Of course, there are cases in which r 0 = 0 for sufficiently large. This is precisely when exact recovery is possible in the noise-free environment. And, in such cases, the error of ERM tends to zero with σ. The behavior of ERM in the noise-free case is one of the distinguishing features of sets with well behaved global complexity because E is not too large. Since E R E for every R > 0, it follows that when E, r 0 = 0 and that r γ) E/γ ). Therefore, on the event from Theorem A, ˆx x 0 ˆx + x 0 σ E log. This estimate suffices for many applications. For example, when T is the set of d-sparse vectors, one may show see, e.g. []) that E d logen/d). Hence, by Theorem A, when d log en/d ), with high probability, d logen/d) ˆx x 0 ˆx + x 0 σ log. The proof of this observation regarding d-sparse vectors, and that this estimate is sharp in the minimax sense up to the logarithmic term) may be found in Section 6. One should note that Theorem A improves the main result from [] in three ways. First of all, the error rate the estimate on ˆx x 0 ˆx + x 0 ) established in Theorem A is E/ up to logarithmic factors), whereas in [], it scaled like c/ /4 for very large values of. Second, the error rate scales linearly in the noise level σ in Theorem A. On the other hand, the rate obtained in [] does not decay with σ for σ. Finally, the probability estimate has been improved, though it is still likely to be suboptimal. Although the main motivation for [] was dealing with phase retrieval for sparse classes, and for which Theorem A is well suited, we next turn to the question of more general classes, the most important example of which is a convex, centrally-symmetric class. For such a class, the global localization is simply too big to yield a good bound. Definition. Let r Q) = inf s η) = inf { r > 0 : lt rb n ) Qr }, { s > 0 : lt sb n ) ηs }, 6

7 and v ζ) = inf { v > 0 : lt vb n ) ζv 3 }, The parameters r and s have been used in [7] to obtain a sharp estimate on the performance of ERM for linear regression in an arbitrary convex set, and relative to L-subgaussian measurements. This result is formulated below in a restricted context, analogous to the phase retrieval setup: a linear model z = a, x 0 +w, for an isotropic, L-subgaussian vector a, independent noise w and x 0 T. Let ˆx be the output of ERM using the data a i, z i ) i= and set w ψ = σ. Theorem.3 For every L there exist constants c, c, c 3 and c 4 that depend only on L for which the following holds. Let T R d be a convex set, put η = c /σ and set Q = c.. If σ c 3 r Q) then with probability at least 4 exp c 4η s η)) ), x x 0 s η).. If σ c 3 r Q) then with probability at least 4 exp c 4Q ), x x 0 r Q). Our main result is a phase retrieval version of Theorem.3. Theorem B. For every L, κ 0 > 0 and β there exist constants c, c, c 3, c 4, c 5 and Q that depend only on L and κ 0 and β for which the following holds. Let T R d be a convex, centrally-symmetric set, and let a and w be as in Theorem A. Assume that σ/ x 0 ) c 0 r Q)/ log, set η = c x 0 /σ log ) and let ζ = c /σ log ).. If x 0 v c ), then with probability at least exp c 3 η s η)) ) β+, min{ x x 0, x + x 0 } c 4 s η).. If x 0 v c ) then with probability at least exp c 3 ζ v ζ)) ) β+, max{ x x 0, x + x 0 } c 4 v ζ). If σ/ x 0 ) c 0 r Q)/ log the same assertion as in. and. holds, with an upper bound of r Q) replacing s η) and v ζ). 7

8 Theorem B follows from Theorem A and a more transparent description of the localized sets T,R x 0 ) and T +,R x 0 ) see Lemma 4.). To put Theorem B in some perspective, observe that v tends to zero. Indeed, since lt rb n) lt ), it follows that v ζ) lt )/ ζ) /3. Hence, for the choice of ζ σ log ) as in Theorem B, v σlt ) ) /3 log, which tends to zero when σ 0 and when. Therefore, if x 0 0, the first part of Theorem B describes the long term behaviour of ERM. Also, and using the same argument, r Q) lt ) Q. Thus, for every σ > 0 the problem becomes high noise in the sense that the condition σ/ x 0 ) c 0 r Q)/ log is satisfied when is large enough. In the typical situation, which is both high noise and large x 0, the error rate depends on η = c x 0 /σ log. We believe that the / log factor is an artifact of the proof, but the other term, x 0 /σ is the signalto-noise ratio, and is rather natural. Although Theorem A and Theorem B clearly improve the results from [], it is natural to ask whether these are optimal in a more general sense. The final result presented here is that Theorem B is close to being optimal in the minimax sense. The formulation and proof of the minimax lower bound is presented in Section 5. Finally, we end the article with two examples of classes that are of interest in phase retrieval: d-sparse vectors and the unit ball in l n. The first is a class with a fixed local complexity, and the second has a growing local complexity. Preliminaries Throughout this article, absolute constants are denoted by C, c, c,... etc. Their value may change from line to line. The fact that there are absolute constants c, C for which ca b Ca is denoted by a b; a b means that a cb, while a L b means that the constants depend only on the parameter L. 8

9 For p, let p be the l n p norm endowed on R n, and for a function f or a random variable X) on a probability space, set f Lp to be its L p norm. Other norms that play a significant role here are the Orlicz norms. For basic facts on these norms we refer the reader to [9, 5]. Recall that for α, f ψα = inf{c > 0 : E exp f α /c α ) }, and it is straightforward to extend the definition for 0 < α <. Orlicz norms measure the rate of decay of a function. One may verify that f ψα sup p f Lp /p /α. Moreover, for t, P r f t) exp ct α / f α ψ α ), and f ψα is equivalent to the smallest constant κ for which P r f t) exp t α /κ α ) for every t. Definition. A random variable is L-subgaussian if it has a bounded ψ norm and X ψ L X L. Observe that for L-subgaussian random variables, all the L p norms are equivalent and their tails exhibits a faster decay than the corresponding gaussian. Indeed, if X is L-subgaussian, and for every t, X Lp p X ψ L p X L, P r X > t) exp ct / X ψ ) exp ct /L X L )) for a suitable absolute constant c. It is standard to verify that for every f, g, fg ψ f ψ g ψ, and that if X,..., X are independent copies of X and α, then max i X i ψα X ψα log /α..) An additional feature of ψ α random variables is concentration, namely that if X i ) i= are independent copies of a ψ α random variable X, then i= X i concentrates around EX. One example of such a concentration result is the following Bernstein-type inequality see, e.g., [5]). Theorem. There exists an absolute constant c 0 for which the following holds. If X,..., X are independent copies of a ψ random variable X, then for every t > 0, ) P r X i EX > t X ψ exp c 0 min{t, t}). i= 9

10 One important example of a probability space considered here is the discrete space Ω = {,..., }, endowed with the uniform probability measure. Functions on Ω can be viewed as vectors in R and the corresponding L p and ψ α norms are denoted by L p and ψ α. A significant part of the proofs of Theorem A has to do with the behaviour of a monotone non-increasing rearrangement of vectors. Given v R, let vi ) i= be a non-increasing rearrangement of v i ) i=. It turns out that the ψα norm captures information on the coordinates of vi ) i=. Lemma.3 For every α there exist constants c and c that depend only on α for which the following holds. For every v R, vi c sup i log /α e/i) v ψα c sup i v i log /α e/i). Proof. We will prove the claim only for α = as the other cases follow an identical path. Let v R and denote by P r the uniform probability measure on Ω = {,..., }. By the tail characterization of the ψ norm, {j : v j > t} = P r v > t) exp ct / v ψ ). Hence, for t i = c / v ψ loge/i), {j : vj > t i } i/e i, and for every i, v i t i. Therefore, sup i as claimed. In the reverse direction, consider v i loge/i) c / v ψ, B = { β > 0 : i, v ψ βv i / loge/i) }. It is enough to show that B is bounded by a constant that is independent of v. To that end, fix β B and without loss of generality, assume that β >. Set B = sup i βv i / loge/i) and since β B, v ψ B. Also, since /β <, i= ) /β + i ) /β dx /β x /β. 0

11 Therefore, expvi /B ) = i= i= e i expvi ) /B ) i= expβ loge/i)) i= ) /β e) /β /β /β e/β /β <, provided that β c. Thus, if β c, v ψ showing that B is bounded by c. < B which is a contradiction,. Empirical and Subgaussian processes The sampling method used here is isotropic and L-subgaussian, meaning that the vectors a i ) i= are independent and distributed according to a probability measure µ on R n that is both isotropic and L-subgaussian [5]: Definition.4 Let µ be a probability measure on R n and let a be distributed according to µ. The measure µ is isotropic if for every t R n, E a, t = t. It is L-subgaussian if for every t Rn and every u, P r a, t Lu t, a ) exp u /). Given T R n, let d T = sup t T t and put k T ) = lt )/d T ). The latter appears naturally in the context of Dvoretzky type theorems, and in particular, in Milman s proof of Dvoretzky s Theorem see, e.g., []). Theorem.5 [0] For every L there exist constants c and c that depend only on L and for which the following holds. For every u c, with probability at least exp c u k T )), for every t T and every I {,..., }, ) / ) t, ai Lu 3 lt ) + d T I loge/ I ). i I For every integer, let j T be the largest integer j in {,..., } for which lt ) d T j loge/j). It follows from Theorem.5 that if t T and I j T, i I t, ai ) / L,u lt ),

12 and if I j T, i I t, ai ) / L,u d T I loge/ I ). Therefore, if v = t, a i ) i= and vi ) i= is a monotone non-increasing rearrangement of v i ) i=, then / if i j T vi i i vj ) j= L,u lt ) i d T loge/i) otherwise..) This observation will be used extensively in what follows. The next fact deals with product processes. Theorem.6 [0] There exist absolute constants c 0, c and c for which the following holds. If T, T R n, j and u c 0, then with probability at least exp c u j ), sup ai, t a i, s E a, t a, s t T, s T i= c L u lt )lt ) + u )) lt )dt ) + lt )dt ) + j/ dt )dt ) and sup a i, t a i, s E a, t a, s t T, s T i= c L u lt )lt ) + u )) lt )dt ) + lt )dt ) + j/ dt )dt ). Remark.7 Let ε i ) i= be independent, symmetric, {, }-valued random variables. It follows from the results in [0] that with the same probability estimate in Theorem.6 and relative to the product measure ε X), sup ε t T, s T i ai, t a i, s i= L u lt )lt ) + u )) lt )dt ) + lt )dt ) + j/ dt )dt ).

13 Assume that k T )) / = lt )/dt ) lt )/dt ). Setting j/ = lt )/dt ), Theorem.6 and Remark.7 yield that with probability at least exp c u k T )), sup ai, t a i, s E a, t a, s L u lt ) lt ) + u ) dt ), t T, s T sup t T, s T and sup i= a i, t a i, s E a, t a, s L u lt ) lt ) + u ) dt ) i= t T, s T ε i ai, t a i, s L u lt ) lt ) + u ) dt )..3) i= One case which is of particular interest is when T = T = T, and then, with probability at least exp c u k T )), sup t T ai, t E a, t L u l T ) + u dt )lt )). i=. Monotone rearrangement of coordinates The first goal of this section is to investigate the coordinate structure of v R m, given information on its norm in various L m p and ψα m spaces. The vectors we will be interested in are of the form a i, t ) i= for t T, and for which, thanks to the results presented in Section., one has the necessary information at hand. It is standard to verify that if v ψ m α A, then v p p,α A m /p. Thus, v L m p p v ψ m α. It turns out that if the two norms are equivalent, v is regular in some sense. The next lemma, which is a version of the Paley-Zygmund Inequality, see, e.g. [4]), describes the regularity properties needed here in the case p = α =. Lemma.8 For every β > there exist constants c and c that depend only on β and for which the following holds. If v ψ m β v L m, there exists I {,..., m} of cardinality at least c m, and for every i I, v i c v L m. 3

14 Proof. Recall that v ψ m sup i m vi / logem/i). Hence, for every j m, j vl v ψ m l= j logem/l) β v L m j logem/j). l= Therefore, m v L m = m v l = vl + l j l= m l=j+ Setting c β) /β logeβ)) and j = c β)m, v l c 0β v L m j logem/j) + c 0 β v L m j logem/j) m/) v L m. Thus, m l=j+ v l m/) v L m, while v j+ j + l j+ v l β logeβ) v L m. m l=j+ Let I be the set of the m j smallest coordinates of v. Fix η > 0 to be named later, put I η I to be the set of coordinates in I for which v i η v L m and denote by I c η its complement in I. Therefore, Hence, m m/) v L m I l j+ v l = l I η v l + l I c η v L m I β logeβ) I η I + η Ic η I v l v j+ I η + η v L m I c η β logeβ) I η I + η I )) η m β logeβ) η) I ) η I I + η. If η = min{/4, β/) logeβ)}, then I η η/) I c β)m, as claimed. ). v l. ext, let us turn to decomposition results for vectors of the form a i, t ) i=. Recall that for a set T R, j T is the largest integer for which lt ) d T j loge/j). 4

15 Lemma.9 For every L > there exist constants c and c that depend only on L and for which the following holds. Let T R n and set W = {t/ t : t T } S n. With probability at least exp c l W )), for every t T, a i, t ) i= = v + v and v, v have the following properties:. The supports of v and v are disjoint.. v c lw ) t and suppv ) j W. 3. v ψ c t. Proof. Fix t T and let J t {,..., } be the set of the largest j W coordinates of a i, t ) i=. Set v = a j, t/ t )j Jt and v = a j, t/ t )j J c t. By Theorem.5 and the characterization of the ψ norm of a vector using the monotone rearrangement of its coordinates Lemma.3), v LlW ), and v ψ L. To conclude the proof, set v = t v and v = t v. Recall that for every R > 0, { } t + s T +,R = : t, s T, t + s t s R, t + s and a similar definition holds for T,R. Set j +,R = j T+,R, j,r = j T+,R and E R = max{lt +,R ), lt,r )}. Combining the above estimates leads to the following corollary. Corollary.0 For every L > there exist constants c, c, c 3 and c 4 that depend only on L for which the following holds. Let T R n and R > 0, and consider T +,R and T,R as above. With probability at least 4 exp c L min{l T +,R ), l T,R )}), for every s, t T for which t s t + s R,. s t, a i ) i= = v + v, for vectors v and v of disjoint supports satisfying suppv ) j,r, v c lt,r ) s t and v ψ c s t. 5

16 . s + t, a i ) i= = u + u, for vectors u and u of disjoint supports satisfying suppu ) j +,R, u c lt +,R ) s+t and u ψ c s+t. 3. If h s,t a) = s+t s+t, a s t s t, a, then h s,t a i ) E h s,t c 3 i= In particular, recalling that for every s, t T, ER + E R E s + t, a s t, a κ 0 s + t s t, it follows that if c 4 L)E R /κ 0 then 4. κ 0 s + t s t ). s + t, a i s t, ai L s + t s t. i=.4) From here on, denote by Ω,R the event on which Corollary.0 holds for the sets T +,R and T,R and samples of cardinality L E R /κ 0. Lemma. There exist constants c 0 depending only on L and c, κ that depend only on κ 0 and L for which the following holds. If c 0 E R /κ 0, then for a i ) i= Ω,R, for every s, t T for which s t s + t R, there is I s,t {,..., } of cardinality at least κ, and for every i I s,t, s t, a i s + t, ai c s t s + t. Lemma. is an empirical small-ball estimate, as it shows that with high probability, and for every pair s, t as above, many of the coordinates of a i, s t a i, s + t ) i= are large. Proof. Fix s, t T as above and set y = s t, a i ) i=, and x = s + t, a i ) i=. 6

17 Let y = v +v and x = u +u as in Corollary.0. Let j 0 = max{j,r, j +,R } and put J = suppv ) suppu ). Observe that J j 0 and that yj) xj) j J j suppv ) v j)xj) + j suppu ) j0 ) / j0 ) / v x j)) + u y j)) i= i= L lt,r ) s t j 0 loge/j 0 ) s + t +lt +,R ) s + t j 0 loge/j 0 ) s t L E R s t s + t κ 0 4 s t s + t, yj)u j) because, by the definition of j 0, j 0 loge/j 0 ) max{lt,r ), lt +,R )} and since c 0 E R /κ 0 for c 0 = c 0 L) large enough. Thus, by.4), j J c yj)xj) κ 0 s t s + t /4. Set m = J c and let z = yj)xj)) j J c = v j)u j)) j J c. Since L ER /κ 0, it is evident that j 0 /; thus / m and z L m = yj)xj) m 4m κ 0 s t s + t κ 0 s t s + t. j J c On the other hand, z ψ m v u j)) j Jc ψ m v ψ m u ψ m L s t s + t, and z satisfies the assumption of Lemma.8 for β = c L, κ 0 ). The claim follows immediately from that lemma. 3 Proof of Theorem A It is well understood that when analyzing properties of ERM relative to a loss l, studying the excess loss functional is rather natural. The excess loss shares the same empirical minimizer as the loss, but it has additional qualities: for every x T, EL x 0 and L x0 = 0. 7

18 Since 0 is a potential minimizer of {P L x : x T }, the minimizer ˆx satisfies that P Lˆx 0, giving one a way of excluding parts of T as potential empirical minimizers. One simply has to show that with high probability, those parts belong to the set {x : P L x > 0}, for example, by showing that P L x is equivalent to EL x, as the latter is positive for points that are not true minimizers. The squared excess loss has a simple decomposition to two processes: a quadratic process and a multiplier one. Indeed, given a class of functions F and f F, fa) y ) f a) y ) = fa) f a) ) fa) f a) ) f a) y ). where, as always, f is a minimizer of the functional E fa) y ) in F. In the phase retrieval problem, y = x 0, a + w for a noise w that is independent of a, and each f x F is given by f x = x,. Thus, L x a, y) = l x a, y) l x0 a, y) = f x a) y ) fx0 a) y ) = x x 0, a x + x 0, a ) w x x0, a x + x 0, a. Since w is a mean-zero random variable that is independent of a, and by Assumption., EL x a, y) = E x x 0, a x + x 0, a κ 0 x x 0 x + x 0. Therefore, E f x a) y ) has a unique minimizer in F : f = f x0 = f x0. To show that P L x > 0 on a large subset T T, it suffices to obtain a high probability lower bound on inf x T ) x x0, a i x + x0, a i i= that dominates a high probability upper bound on sup w i x x0, a i x + x0, a i. x T i= The set T that will be used is T R = {x T : x x 0 x + x 0 R} for a well-chosen R. 8

19 Theorem 3. There exist a constant c 0 depending only on L, and constants c, κ depending only on κ 0 and L for which the following holds. For every R > 0 and c 0 ER /κ 0, with probability at least for every x T R, 4 exp c L min{l T +,R ), l T,R )}), x0 x, a i x0 + x, a i c x 0 x x 0 + x. i= Theorem 3. is an immediate outcome of Lemma. Theorem 3. There exist absolute constants c and c for which the following holds. For every β >, with probability at least exp c L min{l T +,R ), l T,R )}) β ), for every x T R, E w i x x0, a i x + x0, a i c β w ψ log R x x 0 x+x 0. i= Proof. By standard properties of empirical process, and since w is meanzero and independent of a, it suffices to estimate sup ε i w i x x 0, a i x + x0, a i, x T R i= for independent signs ε i ) i=. By the contraction principle for Bernoulli processes see, e.g., [9]), it follows that for every fixed w i ) i= and a i) i=, P r ε sup ε x T R i w i ) x x 0 x + x 0, a i, a i x x i= 0 x + x 0 > u ) P r ε max w x x 0 x + x 0 i sup ε i, a i, a i i x x 0 x + x 0 > u. x T R i= Applying Remark.7, if L E R then with ε a) -probability of at least exp c L min{l T +,R ), l T,R )}), x x 0 x + x 0 sup ε i, a i, a i x x 0 x + x 0 c L E R. x T R i= 9

20 Also, because w is a ψ random variable, P rw t w ψ ) exp t /), and thus, w β log w ψ with probability at least β+. Combining the two estimates and a Fubini argument, it follows that with probability at least exp c L min{l T +,R ), l T,R )}) β+, for every x T R, i= w i x x0, a i x + x0, a c 3 L β w ψ log E R x x 0 x+x 0. On the intersection of the two events appearing in Theorem 3. and Theorem 3., if κ0,l ER and setting ρ = x x 0 x + x 0 R r c κ 0 /c L β)), then for every x T R, P L x c κ 0ρ c L ) E R β w ψ log ρ c κ 0R c L ) E R β w ψ log R. Therefore, if L,κ0 E R and R E R c 3 L, κ 0 ) w ψ β log, 3.) then P L x > 0 and ˆx T R. Theorem A follows from the definition of r γ) for a well chosen γ. 4 Proof of Theorem B Most of the work required for the proof of Theorem B has been carried out in Section 3. A literally identical argument, in which one replaces the sets T +,R and T,R with T +,R x 0 ) and T,R x 0 ) may be used, leading to an analogous version of Theorem A, with the obvious modifications: the complexity parameter is max{lt +,R x 0 )), lt,r x 0 ))} for the right choice of R, and 0

21 the probability estimate is exp c 0 min{l T +,R x 0 )), l T,R x 0 ))}) β+. All that remains to complete the proof of Theorem B is to analyze the structure of the local sets and identify the fixed points r 0 and r. A first step in that direction is the following: Lemma 4. There exist absolute constants c and c for which the following holds. For every R > 0 and x 0 R/4,. If x 0 min{ x x 0, x + x 0 } R then x x 0 x + x 0 c R.. If x x 0 x + x 0 R then x 0 min{ x x 0, x + x 0 } c R. Moreover, if x 0 R/4 then x x 0 x + x 0 R if and only if x R. Proof. Assume without loss of generality that x x 0 x + x 0. If x x 0 x 0 then x 0 x 0 x x 0 x + x 0 x x 0 + x 0 3 x 0. Hence, x 0 x + x 0, and then x 0 min{ x x 0, x + x 0 } x x 0 x + x 0. Otherwise, x x 0 > x 0. If, in addition, x 0 x x 0 x + x 0 ) / /4, 4 x 0 x x 0 x + x 0 ) / x 0 / x + x 0 /, and thus x + x 0 6 x 0. Since x 0 < x x 0 x + x 0, it follows that x + x 0 x x 0 x 0, and again, x 0 min{ x x 0, x + x 0 } x x 0 x + x 0. Therefore, the final case, and the only one in which there is no point-wise equivalence between x x 0 x+x 0 and x 0 min{ x x 0, x+x 0 }, is when min{ x x 0, x + x 0 } x 0 and x 0 x x 0 x + x 0 ) / /4. In that case, if x 0 R/4 then x 0 min{ x x 0, x + x 0 } x 0 R/6,

22 and x x 0 x + x 0 4 x 0 R/4, from which the first part of the claim follows immediately. For the second one, observe that x x 0 x x x 0 x + x 0 x + x x 0 + x 0, and if x 0 R/4, the equivalence is evident. In view of Lemma 4., the way the product x x 0 x + x 0 relates to min{ x x 0, x + x 0 } depends on x 0. If x 0 R/4, then {x T : x x 0 x+x 0 R} {x T : min{ x x 0, x+x 0 } c R/ x 0 }, and if x 0 R/4, {x T : x x 0 x + x 0 R} {x T : x c R}, for a suitable absolute constant c. When T is convex and centrally-symmetric, the corresponding complexity parameter - the gaussian average of T +,R x 0 ) = T,R x 0 ) is x 0 R lt c R/ x 0 )B n) if x 0 R, E R x 0 ) R lt c RB n ) if x 0 < R. The fixed point conditions appearing in Theorem A now become and r 0 = inf{r : E R x 0 ) c } 4.) r γ) = inf{r : E R x 0 ) γ R}, 4.) where one selects the slightly suboptimal γ = c /σ log. The assertion of Theorem A is that with high probability, ERM produces ˆx for which ˆx x 0 ˆx + x 0 max{r γ), r 0 }. If x 0 R, the fixed-point condition 4.) is ) R lt c R/ x 0 )B n ) c 3 4.3) x 0

23 while 4.) is, Recall that and x 0 R lt c R/ x 0 )B n ) c 4 /σ log ) R. 4.4) r Q) = inf{r > 0 : lt rb n ) Qr }, s η) = inf{s > 0 : lt sb n ) ηs }. Therefore, it is straightforward to verify that r 0 = x 0 r c 3 ) and r c /σ log ) ) = x 0 s c 4 x 0 /σ log ). Setting R = x 0 max{r c 3), s c 4 x 0 /σ log )}, it remains to ensure that x 0 R; that is, Observe that if max{s c 4 x 0 /σ log ), r c 3 )} x ) r c 3 ) c 3σ c 4 x 0 log, 4.6) then r c 3) s c 4 x 0 /σ log ). Indeed, applying the convexity of T, it is standard to verify that r Q) is attained and r Q) ρ if and only if lt ρb n) Qρ with a similar statement for s see, e.g., the discussion in [7]). Therefore, s η) r Q) if and only if lt r Q)) ηr Q)). The latter is evident because lt r Q)) = Qr Q) and recalling that Q = c 3 and η = c 4 x 0 /σ log. Under 4.6), an assumption which has been made in the formulation of Theorem B, 4.5) becomes s c 4 x 0 /σ log ) x 0 and, by the definition of s, this is the case if and only if lt x 0 B n ) c 4 x 0 σ log x 0 ; that is simply when x 0 v ζ) for ζ = c 4/σ log. Hence, by Theorem A, combined with Lemma 4., it follows that with high probability, min{ ˆx x 0, ˆx + x 0 } s c 4 x 0 /σ log ). The other cases, when either x 0 is small, or when r 0 dominates r are treated is a similar fashion, and are omitted. 3

24 5 Minimax lower bounds In this section we study the optimality of ERM as a phase retrieval procedure, in the minimax sense. The estimate obtained here is based on the maximal cardinality of separated subsets of the class with respect to the L µ) norm. Definition 5. Let B be the unit ball in a normed space. For any subset A of the space, let MA, rb) be the maximal cardinality of a subset of A that is r-separated with respect to the norm associated with B. Observe that if MA, rb) L there are x,..., x L A for which the sets x i +r/3)b are disjoint. A similar statement is true in the reverse direction. Let F be a class of functions on Ω, µ) and let a be distributed according to µ. For f 0 F and a centred gaussian variable w, which has variance σ and is independent of a, consider the gaussian regression model y = f 0 a) + w. 5.) Any procedure that performs well in the minimax sense, must do so for any choice of f 0 F in 5.). Following [7], there are two possible sources of statistical complexity that influence the error rate of gaussian regression in F.. Firstly, that there are functions in F that, despite being far away from f 0, still satisfy f 0 a i ) = fa i ) for every i, and thus are indistinguishable from f 0 on the data. This statistical complexity is independent of the noise, and for every f 0 F and A = a i ) i=, it is captured by the L µ) diameter of the set Kf 0, A) = {f F : fa i )) i= = f 0 a i )) i=}, which is denoted by d A).. Secondly, that the set F f 0 ) rd = {f f 0 : f F, f f 0 L r} is rich enough at a scale that is proportional to its L µ) diameter r. The richness of the set is measured using the cardinality of a maximal L µ)-separated set. To that end, let D be the unit ball in L µ), set and put Cr, θ 0 ) = sup r log / MF f 0 + θ 0 rd), rd) f 0 F q η) = inf { r > 0 : Cr, θ 0 ) ηr }. 5.) 4

25 Theorem 5. [7] For every f 0 F let P f 0 be the probability measure that generates samples a i, y i ) i= according to 5.). For every θ 0 there exists a constant θ > 0 for which inf ˆf sup f 0 F P f 0 f0 ˆf ) max{q θ /σ), d A)/4)} /5 5.3) where inf ˆf is the infimum over all possible estimators constructed using the given data. Earlier versions of this minimax bound may be found in [4], [6] and []. To apply this general principle to the phase recovery problem, note that the regression function is f 0 x) = x 0, x := fx0 x) for some unknown vector x 0 T R n, while the estimators are ˆf = ˆx,. Also, observe that for every x, x T, f x0 f x L µ) = E x 0, a x, a ) = E x0 x, a x0 + x, a and therefore, one has to identify the L structure of the set F f x0 = { x, x0, : x T }. To obtain the desired bound, it suffices to assume the following: Assumption 5. There exist constants C and C for which, for every s, t R n, C s t s + t E s t, a s + t, a C s t s + t. It is straightforward to verify that if a is an L-subgaussian vector on R n that satisfies Assumption., then it automatically satisfies Assumption 5.. The norm x 0 plays a central role in the analysis of the rates of convergence of the ERM in phase recovery. Therefore, the minimax lower bounds presented here are not only for the entire model T but for every shell V 0 = T R 0 S n. A minimax lower bound over T follows by taking the supremum over all possible choices of R 0. To apply Theorem 5., observe that by Assumption 5., for every u, v T, C u v u + v f v f u L C u v u + v. 5

26 Fix R 0 > 0 and consider V 0 = T R 0 S n. Clearly, for every r > 0 and every x 0 V 0, { u V 0 : u x 0 u + x 0 θ } 0r {u V 0 : f u F f x0 + θ 0 rd)} C 5.4) where F = {f u : u V 0 }. Fix θ 0 > to be named later, and let θ be as in Theorem 5.. If there are x 0 V 0 and {x,..., x M } V 0 that satisfy. x i x 0 x i + x 0 θ 0 r/c,. for every i < j M, x i x j x i + x j r/c, and 3. log M > θ r/σ), then sup f0 F r log/ MF f 0 + θ 0 rd), rd) > θ r /σ, and the best possible rate in phase recovery in V 0 is larger than r. Fix x 0 V 0 and r > 0, and let R = r/c. We will present two different estimates, based on R 0 = x 0, the location of x 0. Centre of small norm. Recall that θ 0 > and assume first that R 0 = x 0 θ 0 R/4. ote that V 0 R/8)B n { u V 0 : u x 0 u + x 0 θ 0r C }, and thus it suffices to constructed a separated set in V 0 R/8)B n. Set x,..., x L to be a maximal c 3 R-separated subset of V0 R/8)B n for a constant c 3 that depends only on C and C and which will be specified later; thus, L = MV 0 R/8)B n, c 3 RB n ). Lemma 5.3 There is a subset I {,..., L} of cardinality M L/ for which x i ) i I satisfies. and.. Proof. Since x i R/8)B n and x 0 θ 0 R/4, x i x 0 x i + x 0 θ 0 + ) R/4) θ 0 R = θ 0 r/c, and thus. is satisfied for every i L. To show that. holds for a large subset, it suffices to find I {,..., L} of cardinality at least L/ such that for every i, j I, x i x j c 3 R/ and xi + x j c 3 R/ 6

27 for a well-chosen c 3. To construct the subset, observe that if there are distinct integers i, j, k L for which x i + x j < c 3 R/ and xi + x k < c 3 R/, then x j x k < c 3 R, which is impossible. Therefore, for every xi there is at most a single index j {,..., L}\{i} satisfying that x i + x j < c 3 R/. With this observation the set I is constructed inductively. Without loss of generality, assume that I = {,..., M} for M L/. If i j and i, j M, x i x j x i + x j c 3R/4 r/c, for the right choice of c 3, and thus x i ) M i= satisfies.. Centre of large norm. ext, assume that R 0 = x 0 θ 0 R/4. By Lemma 4., there is an absolute constant c 4 < /3, for which, if x 0 ρ/4 and x0 min{ x x 0, x+x 0 } c 4 ρ, then x x 0 x+x 0 ρ. Therefore, applied to the choice ρ = θ 0 R, V 0 x 0 + c 4 θ 0 R/ x 0 )B n )) V 0 x 0 + c 4 θ 0 R/ x 0 )B n )) {u V 0 : x 0 u x 0 + u θ 0 R}, and it suffices to a find a separated set in the former. ote that if x V 0 x 0 + c 4 θ 0 R/ x 0 )B n ) then x x 0 c 4 θ 0 R/ x 0 x 0 / θ 0 R/4/4 because one can choose c 4 /3, and x 3 x 0 /. Moreover, if x, x V 0 x 0 + c 4 θ 0 R/ x 0 )B n ), then then x + x x 0 c 4 θ 0 R/ x 0 x 0. Applying Lemma 4., there is an absolute constant c 5, for which, if x 0 min { x i x j, x i + x j } θ 0 R/4, 5.5) x i x j x i + x j c 5 θ 0 R/4. Hence, if x,..., x M V 0 x 0 +c 4 θ 0 R/ x 0 )B n ) is θ 0R/4 x 0 -separated, then 5.5) holds, and x i x j x i + x j c 5 θ 0 R/4 r/c, provided that θ 0 is a sufficiently large constant that depends only on C and C. 7

28 Corollary 5.4 There exist absolute constants θ, c, c and c 3 that depend only on C and C and for which the following holds. Let R 0 > 0 and set V 0 = T R 0 S n. Define log MV 0 c rb n, c rb n ) if R 0 c 3 r, q r) = ) ) ) sup x0 V 0 log M V 0 x 0 + c r R 0 B n r, c R 0 B n if R 0 > c 3 r If q r) θ r/σ), then the minimax rate in V 0 is larger than r. ow, a minimax lower bound for the risk min{ x x 0, x + x 0 } for all shells V 0 and therefore, for T ) may be derived using Lemma 4.. To that end, let c 0 be a large enough absolute constant and CR 0, r) = sup r log / M T R 0 S n ) x 0 + c 0 rb n ), rb n ). x 0 T : x 0 =R 0 Definition 5.5 Fix R 0 > 0. For every α, β > 0 set q α) = inf { r > 0 : CR 0, r) αr } 5.6) and put t β) = inf { r > 0 : CR 0, r) βr 3 }. ote that for any c > 0, if t c) then t c) q c) and q cr0 /σ ) t c/σ) if and only if q cr0 /σ ) R 0. Theorem C. There exists an absolute constant c for which the following holds. Let R 0 > 0.. If R 0 t c /σ ), then for any procedure x, there exists x 0 T with x 0 = R 0 and for which, with probability at least /5, and x x 0 x + x 0 x 0 q c x 0 ) σ min{ x x 0, x + x 0 } q c x 0 ). σ 8

29 . If R 0 t c /σ ) then for any procedure x there exists x 0 T with x 0 = R 0 for which, with probability at least /5, and x x 0 x + x 0 t c )) σ x, x x 0, x + x 0 t c ). σ Theorem C is a general minimax bound, and although it seems strange at first glance, the parameters appearing in it are very close to those used in Theorem C. Following the same path as in [7], let us show that Theorem C and Theorem C are almost sharp, under some mild structural assumptions on T. First, recall Sudakov s inequality see, e.g., [9]): Theorem 5.6 If W R n and ε > 0 then where c is an absolute constant. cε log / MW, εb n ) lw ), Fix R 0 > 0 set V 0 = T R 0 B n, and put s = s c R 0 /σ log ) ) and v = v c R 0 /σ log ) ). Assume that there is some x 0 T, with the following properties:. x 0 = R.. The localized sets V 0 x 0 + s Bn ) and V 0 x 0 + v Bn ) satisfy that and that lv 0 x 0 + s S n )) lt s B n ) lv 0 x 0 + v S n )) lt v B n ), which is a mild assumption on the complexity structure of T. 3. Sudakov s inequality is sharp at the scales s and v, namely, s log / MV 0 x 0 + c 0 s B n ), s B n ) lv 0 x 0 + s B n )) 5.7) and a similar assertion holds for v. 9

30 In such a case, the rates of convergence obtained in Theorem B are minimax up to the an extra log factor) thanks to Theorem C. When Sudakov s inequality is sharp as in 5.7), we believe that ERM should be a minimax procedure in phase recovery, despite the logarithmic gap between Theorem B an Theorem C. A similar conclusion for linear regression was obtained in [7]. Sudakov s inequality is sharp in many cases - most notably, when T = B n, but not always. It is not sharp even for standard sets like the unit ball in l n p for + log n) < p <. 6 Examples Here, we will present two simple applications of the upper and lower bounds on the performance of ERM in phase recovery. aturally, there are many other examples that follow in a similar way and that can be derived using very similar arguments. The choice of the examples has been made to illustrate the question of the optimality of Theorem A, B and C, as well as an indication of the similarities and differences between phase recovery and linear regression. Since the estimate used in these examples are rather well known, some of the details will not be presented in full. 6. Sparse vectors The first example we consider represents classes with a local complexity that remains unchanged, regardless of the choice of x 0. Let T = W d be the set of d-sparse vectors in R n for some d /4) that is, vectors with at most d non-zero coordinates. Clearly, for every R > 0 T +,R, T,R W d S n. Also, for any x 0 T and any I {,..., n} of cardinality d that is disjoint of suppx 0 ), / { } { } x )S I x0 ) i I x + x0 ) i I : x W d, : x W d, x x 0 x + x 0 where S I is the unit sphere supported on the coordinates I. With this observation, a straightforward argument shows that, lt +,R ), lt,r ) d logen/d). Applying Theorem A, if follows that for d log en/d ), with probability 30

31 at least exp cd logen/d))) β+, ERM produces ˆx that satisfies d logen/d) ˆx x 0 ˆx + x 0 κ0,l,β σ log = ). 6.) Moreover, with the same probability estimate, if x 0 ) then by Lemma 4., min{ ˆx x 0, ˆx + x 0 } κ0,l,β σ d logen/d) log 6.) x 0 and if x 0 ) then ˆx d logen/d), ˆx x 0, ˆx + x 0 κ0,l σ log. 6.3) In particular, when x 0 is of the order of a constant, the rate of convergence in 6.) and 6.) is identical to the one obtained in [7] in the linear regression up to a log term). In the latter, ERM achieves the minimax rate with the same probability estimate) of d logen/d) ˆx x 0 L σ. Otherwise, when x 0 is large, the rate of convergence of the ERM in 6.) is actually better than in linear regression, but, when x 0 is small, it is worse - deteriorating to the square root of the rate in linear regression up to logarithmic terms). When the noise level σ tends to zero, the rates of convergence in linear regression and phase recovery tend to zero as well. In particular, exact reconstruction happens that is ˆx = x 0 in linear regression and ˆx = x 0 or ˆx = x 0 in phase recovery when d log en/d ). For the lower bound, it is well known that log / MW d c 0 rb n, rbn ) d logen/d) for every r > 0 and c0 ). Combined with the results of the previous section, this suffices to show that the rate obtained in Theorem A is the minimax one up to a log term in the large noise regime) and that the ERM is a minimax procedure for the phase retrieval problem when the signal x 0 is known to be d-sparse and d log en/d ). 6. The unit ball of l n Consider the set T = B n, the unit ball of ln. Being convex and centrally symmetric, it is a natural example of a set with changing local complexity 3

32 which becomes very large when x 0 is close to 0. Moreover, it is an example in which one may obtain sharp estimates on lb n rbn ) at every scale. Indeed, one may show see, for example, [5]) that l B n rb n ) logenr ) if r n r n otherwise. It follows that for B n, one has )) / Q log n if n C Q 0 Q rq) if C Q n C 0 Q = 0 if n C Q. where C 0 and C are absolute constants. The only range in which this estimate is not sharp is when n Q, because in that range, r Q) decays to zero very quickly. A more accurate estimate on lb n rbn ) can be performed when n Q see [8]), but since it is not our main interest, we will not pursue it further, and only consider the cases n C Q and n C 0 Q. A straightforward computation shows that the two other fixed points satisfy: )) s η log n /4 if n η η η) and vζ) n η if n η )) ζ log n 3 /6 if n ζ /3 /3 ζ n ζ ) /4 if n ζ /3 /3. The estimates above will be used to derive rates of convergence for the ERM ˆx for the squared loss) of the form min{ ˆx x 0, ˆx + x 0 } rate. Upper bounds on the rate of convergence rate follow from Theorem B, and hold with high probability as stated in there. For the sake of brevity, we 3

33 will not present the probability estimates, but those can be easily derived from Theorem B. Thanks to Theorem B, obtaining upper bounds on rate involves the study of several different regimes, depending on x 0, the noise level σ and the way the number of observations compares with the dimension n. The noise-free case: σ = 0. In this case, rate is upper bounded by r Q), for Q that is an absolute constant. In particular, when n C 0 Q, the rate is less than log n/ )) /. At this point, it is natural to wonder whether there is a procedure that outperforms ERM in the noise-free case. The minimax lower bound d A) in Theorem 5. may be used to address this question, as no algorithm can do better than d A)/4, with probability greater than /5. In the phase recovery problem and using the notation of section 5, one has d A) = sup { f x f x0 : x B n, f x a i ) = f x0 a i ), i =,..., } sup { x x 0 x + x 0 : x B n, a i, x = a i, x 0, i =,..., } inf sup { x x 0 x + x 0 : x B n L:R n R, Lx) = Lx 0 )} with an infimum taken over all linear operators L : R n R. By Lemma 4., for x 0 = /, 0,..., 0) B n in fact, any vector x 0 in B n for which x 0 is a positive constant smaller than / would do) d A) inf sup min{ x x 0, x + x 0 } L:R n R x B n kerl x 0) inf sup x y = c B n L:R n R x,y B n kerl ) which is the Gelfand -width of B n. By a result due to Garnaev and Gluskin see [3]), { )} min, c B n log en if n ) 0 otherwise. which is of the same order as r except when n, which is not treated here). Therefore, no algorithm can outperform ERM and ERM is a minimax procedure in this case. 33

34 ote that when n c Q, exact reconstruction of x 0 or x 0 is possible and it can happens only in that case i.e. σ = 0 and n c Q ) because of the minimax lower bound provided by d A). The noisy case: σ > 0. According to Theorem B, the rate of convergence rate depends on r = r Q) for some absolute constant Q, s = s η) for η = c x 0 /σ log ) and on v = v ζ) for ζ = c /σ log ). The outcome of Theorem B is presented in Figure. rate σ/ x 0 c 0 r / log σ/ x 0 c 0 r / log x 0 v r s x 0 v r v Figure : High probability bounds on the rate of convergence of the ERM ˆx for the square loss in phase recovery: min{ ˆx x 0, ˆx + x 0 } rate. As the proof of all the estimates is similar, a detailed analysis is only presented when ζ /3 /3 η C Q, which is equivalent to σ log ) /6 x0 c c Q σ log. c The upper bounds on rate change according to the way the number of observations scales relative to n:. n C 0 Q. In this situation, r logn/)/ ) /. Therefore, if σ/ x 0 logn/)/ log ) then rate logn/)/ ) /, and if σ/ x 0 logn/)/ log ), rate σ log x 0 log σ n x 0 )) /4 σ log log σ n 3 if x 0 σ log log )) /6 otherwise. )) /6 σ n 3 6.4). c x 0 /σ log ) n C Q. In that case r = 0. In particular σ/ x 0 > c 0 r / log and therefore, the rate is upper bounded as in 6.4). 3. c /σ log ) ) /3 /3 n c x 0 /σ log ). Again, in this case, r = 0. Therefore, one is in the situation of the small 34

35 signal-to-noise ratio and rate σ log σ n log x 0 log σ n 3 if x 0 σ log log )) /6 otherwise. )) /6 σ n 3 4. n c /σ log ) ) /3 /3. Once again, r = 0, and σ n log x 0 if x 0 σ rate ) / n log σ otherwise. ) / n log One may ask whether these estimates are optimal in the minimax sense, or perhaps there is another procedure that can outperform ERM. It appears that up to an extra log factor), ERM is indeed optimal. To see that, it is enough to apply Theorem C and verify that Sudakov s inequality is sharp in the following sense: see the discussion following Theorem C): that if x 0 /, then for every ε < /4 ε log / MB n x 0 + c 0 εb n ), εb n ) l B n εb n ). This fact is relatively straightforward to verify see, e.g., Example in [0]). Therefore, up to the extra log factor, which we believe is parasitic, ERM is a minimax phase-recovery procedure in B n. References [] L. Birgé, Approximation dans les espaces métriques et théorie de l estimation, Z. Wahrsch. Verw. Gebiete ) vol. 65, 983 [] Y. C. Eldar, S. Mendelson, Phase Retrieval: Stability and Recovery Guarantees, preprint. [3] A. Yu. Garnaev and E. D. Gluskin. The widths of a Euclidean ball. Dokl. Akad. auk SSSR, 775):048 05, 984. [4] E. Giné, V. de la Peña, Decoupling: From Dependence to Independence, Springer- Verlag, 999. [5] Y.Gordon, A.E. Litvak, S. Mendelson and A. Pajor, Gaussian averages of interpolated bodies and applications to approximate reconstruction, J. Approx. Theory, 49 ),

Learning subgaussian classes : Upper and minimax bounds

Learning subgaussian classes : Upper and minimax bounds Guillaume Lecué 1,3 Shahar Mendelson,4,5 May 0, 013 Abstract We obtain sharp oracle inequalities for the empirical risk minimization procedure in