Minimax rate of convergence and the performance of ERM in phase recovery

Size: px
Start display at page:

Download "Minimax rate of convergence and the performance of ERM in phase recovery"

Transcription

1 Minimax rate of convergence and the performance of ERM in phase recovery Guillaume Lecué,3 Shahar Mendelson,4,5 ovember 0, 03 Abstract We study the performance of Empirical Risk Minimization in noisy phase retrieval problems, indexed by subsets of R n and relative to subgaussian sampling; that is, when the given data is y i = +wi a i, x 0 for a subgaussian random vector a, independent noise w and a fixed but unknown x 0 that belongs to a given subset of R n. We show that ERM produces ˆx whose Euclidean distance to either x 0 or x 0 depends on the gaussian mean-width of the indexing set and on the signal-to-noise ratio of the problem. The bound coincides with the one for linear regression when x 0 is of the order of a constant. In addition, we obtain a minimax lower bound for the problem and identify sets for which ERM is a minimax procedure. As examples, we study the class of d-sparse vectors in R n and the unit ball in l n. Introduction Phase retrieval has attracted much attention recently, as it has natural applications in areas that include X-ray crystallography, transmission electron microscopy and coherent diffractive imaging see, for example, the discussion in [] and references therein). In phase retrieval, one attempts to identify a vector x 0 that belongs to an arbitrary set T R n using noisy, quadratic measurements of x 0. The CRS, CMAP, Ecole Polytechnique, 90 Palaiseau, France. Department of Mathematics, Technion, I.I.T, Haifa 3000, Israel. 3 guillaume.lecue@cmap.polytechnique.fr 4 shahar@tx.technion.ac.il 5 Supported by the Mathematical Sciences Institute The Australian ational University.

2 given data is a sample of cardinality, a i, y i ) i=, for vectors a i R n and y i = a i, x 0 + w i,.) for a noise vector w i ) i=. Our aim is to investigate phase retrieval from a theoretical point of view, relative to a well behaved, random sampling method. To formulate the problem explicitly, let µ be an isotropic, L-subgaussian measure on R n and set a to be a random vector distributed according to µ. Thus, for every x R n, E x, a = x isotropicity) and for every u, P r x, a Lu x, a L ) exp u /) L-subgaussian). Given a set T R n and a fixed, but unknown x 0 T, y i are the random noisy measurements of x 0 : for a sample size, a i ) i= are independent copies of a and w i ) i= are independent copies of a mean-zero variable w that are also independent of a i ) i=. Clearly, due to the nature of the given measurements, x 0 and x 0 are indistinguishable, and the best that one can hope for is a procedure that produces ˆx T that is close to one of the two points. The goal here is to find such a procedure and identify the way in which the distance between ˆx and either x 0 or x 0 depends on the structure of T, the measure µ and the noise. The procedure studied here is empirical risk minimization ERM), which produces ˆx that minimizes the empirical risk in T : P l x = ai, x ). yi i= The loss is the standard squared loss functional, which, in this case,satisfies l x a, y) = f x a) y) = x, a x0, a w) = x x 0, a x+x 0, a w). Comparing the empirical and actual structures on T is a vital component in the analysis of ERM. In phase recovery, the centered empirical process that is at the heart of this approach is defined for any x T by, P l x l x0 ) = x x0, a i x+x0, a i w i x x0, a i x+x0, a i. i= i= Both the first and second components are difficult to handle directly, even when the underlying measure is subgaussian, because of the powers involved an effective power of 4 in the first component and of 3 in the second one).

3 Therefore, rather than trying to employ the concentration of empirical means around the actual ones, which might not be sufficiently strong in this case, one uses a combination of a small-ball estimate for the high order part of the process, and a more standard deviation argument for the low-order component see Section 3 and the formulation of Theorem A and Theorem B). We assume that linear forms satisfy a certain small-ball estimate, and in particular, do not assign too much weight to small neighbourhoods of 0. Assumption. There is a constant κ 0 > 0 satisfying that for every s, t R n, E a, s a, t κ 0 s t. Assumption. is not very restrictive and holds for many natural choices of random vectors in R n, like the gaussian measure or any isotropic logconcave measure on R n see, for example, the discussion in []). It is not surprising that the error rate of ERM depends on the structure of T, and because of the subgaussian nature of the random measurement vector a, the natural parameter that captures the complexity of T is the gaussian mean-width associated with normalizations of T. Definition. Let G = g,..., g n ) be the standard gaussian vector in R n. For T R n, set n. lt ) = E sup g i t i t T The normalized sets in question are { } t s T,R = : t, s T, R < t s t + s, t s { } t + s T +,R = : t, s T, R < t s t + s, t + s which have been used in [], or their local versions, { } t x0 T,R x 0 ) = : t T, R < t x 0 t + x 0, t x 0 { } t + x0 T +,R x 0 ) = : t T, R < t x 0 t + x 0. t + x 0 The sets in question play a central role in the exclusion argument that is used in the analysis of ERM. Setting L x = l x l x0, the excess loss function i= 3

4 associated with l and x T, it is evident that P Lˆx 0 because L x0 = 0 is a possible competitor). If one can find an event of large probability and R > 0 for which P L x > 0 if x x 0 x + x 0 R, then on that event, ˆx x 0 ˆx + x 0 R, which is the estimate one is looking for. The normalization allows one to study relative fluctuations of P L x in particular, the way these fluctuations scale with x x 0 x + x 0. This is achieved by considering empirical means of products of functions u, v,, for u T+,R x 0 ) and v T,R x 0 ). The obvious problem with the local sets T +,R x 0 ) and T,R x 0 ) is that x 0 is not known. As a first attempt of bypassing this problem, one may use the global sets T +,R and T,R instead, as had been done in []. Unfortunately, this global approach is not completely satisfactory. Roughly put, there are two types of subsets of R n one is interested in, and that appear in applications. The first type consists of sets for which the local complexity is essentially the same everywhere, and the sets T +,R, T,R are not very different from the seemingly smaller T +,R x 0 ), T,R x 0 ), regardless of x 0. A typical example of such a set is d-sparse vectors a set consisting of all the vectors in R n that are supported on at most d-coordinates. For every x 0 T and R > 0, the sets T +,R x 0 ), T,R x 0 ), and T +,R, T,R are contained in the subset of the sphere consisting of d-sparse vectors, which is a relatively small set. For this kind of set, the global approach, using T +,R and T,R, suffices, and the choice of the target x 0 does not really influence the rate in which ˆx x 0 ˆx + x 0 decays to 0 with. In contrast, sets of the second type one would like to study, have vastly changing local complexity, with the typical example being a convex, centrally symmetric set i.e. if x T then x T ). Consider, for example, the case T = B n, the unit ball in ln. It is not surprising that for small R, the sets T +,R 0) and T,R 0) are very different from T,R e ) and T +,R e ): the ones associated with the centre 0 are the entire sphere, while for e =, 0,..., 0), T +,R e ) and T,R e ) consist of vectors that are well approximated by sparse vectors whose support depend on R), and thus are rather small subsets of the sphere. The situation that one encounters in B n is generic for convex centrallysymmetric sets. The sets become locally richer the closer the centre is to 0, and at 0, for small enough R, T +,R 0) and T,R 0) are the entire sphere. Since the sets T +,R and T,R are blind to the location of the centre, and are, in fact, the union over all possible centres of the local sets, they are simply too big to be used in the analysis of ERM in convex sets. A correct estimate on the performance of ERM for such sets requires a more delicate 4

5 local analysis and additional information on x 0. Moreover, the rate of convergence of ERM truly depends on x 0 in the phase recovery problem via the signal-to-noise ratio x 0 /σ. We begin by formulating our results using the global sets T +,R and T,R. Let T + = T +,0 and T = T,0, set E R = max{lt +,R ), lt,r )}, E = max{lt + ), lt )} and observe that as nonempty subsets of the sphere lt,r ), lt +,r ) E g = /π. The first result presented here is that the error rate of ERM for the phase retrieval problem in T depends on the fixed points { r γ) = inf r > 0 : E r γ } r and r 0 Q) = inf { r > 0 : E r Q }, for constants γ and Q that will be specified later. Recall that the ψ -norm of a random variable w is defined by w ψ = inf{c > 0 : E expw /c ) } and set σ = w ψ. Theorem A. For every L >, κ 0 > 0 and β >, there exist constants c 0, c and c that depend only on L, κ 0 and β for which the following holds. Let a and w be as above and assume that w has a finite ψ -norm. If l is the squared loss and ˆx is produced by ERM, then with probability at least exp c 0 min{l T +,r ), l T,r )}) β+, ˆx x 0 ˆx + x 0 r := max{r 0 c ), r c /σ log )}. When w < the term σ log may be replaced by w. The upper estimate of max{r 0, r } in Theorem A represents two ranges of noise. It follows from the definition of the fixed points that r 0 is dominant if σ r 0 / log. As explained in [7] for linear regression, r 0 captures the difficulty of recovery in the noise free case, when the only reason for errors is that there are several far-away functions in the class that coincide with the target on the noiseless data. When the noise level σ surpasses that threshold, errors occur because of the interaction class members have with the noise, and the dominating term becomes r. 5

6 Of course, there are cases in which r 0 = 0 for sufficiently large. This is precisely when exact recovery is possible in the noise-free environment. And, in such cases, the error of ERM tends to zero with σ. The behavior of ERM in the noise-free case is one of the distinguishing features of sets with well behaved global complexity because E is not too large. Since E R E for every R > 0, it follows that when E, r 0 = 0 and that r γ) E/γ ). Therefore, on the event from Theorem A, ˆx x 0 ˆx + x 0 σ E log. This estimate suffices for many applications. For example, when T is the set of d-sparse vectors, one may show see, e.g. []) that E d logen/d). Hence, by Theorem A, when d log en/d ), with high probability, d logen/d) ˆx x 0 ˆx + x 0 σ log. The proof of this observation regarding d-sparse vectors, and that this estimate is sharp in the minimax sense up to the logarithmic term) may be found in Section 6. One should note that Theorem A improves the main result from [] in three ways. First of all, the error rate the estimate on ˆx x 0 ˆx + x 0 ) established in Theorem A is E/ up to logarithmic factors), whereas in [], it scaled like c/ /4 for very large values of. Second, the error rate scales linearly in the noise level σ in Theorem A. On the other hand, the rate obtained in [] does not decay with σ for σ. Finally, the probability estimate has been improved, though it is still likely to be suboptimal. Although the main motivation for [] was dealing with phase retrieval for sparse classes, and for which Theorem A is well suited, we next turn to the question of more general classes, the most important example of which is a convex, centrally-symmetric class. For such a class, the global localization is simply too big to yield a good bound. Definition. Let r Q) = inf s η) = inf { r > 0 : lt rb n ) Qr }, { s > 0 : lt sb n ) ηs }, 6

7 and v ζ) = inf { v > 0 : lt vb n ) ζv 3 }, The parameters r and s have been used in [7] to obtain a sharp estimate on the performance of ERM for linear regression in an arbitrary convex set, and relative to L-subgaussian measurements. This result is formulated below in a restricted context, analogous to the phase retrieval setup: a linear model z = a, x 0 +w, for an isotropic, L-subgaussian vector a, independent noise w and x 0 T. Let ˆx be the output of ERM using the data a i, z i ) i= and set w ψ = σ. Theorem.3 For every L there exist constants c, c, c 3 and c 4 that depend only on L for which the following holds. Let T R d be a convex set, put η = c /σ and set Q = c.. If σ c 3 r Q) then with probability at least 4 exp c 4η s η)) ), x x 0 s η).. If σ c 3 r Q) then with probability at least 4 exp c 4Q ), x x 0 r Q). Our main result is a phase retrieval version of Theorem.3. Theorem B. For every L, κ 0 > 0 and β there exist constants c, c, c 3, c 4, c 5 and Q that depend only on L and κ 0 and β for which the following holds. Let T R d be a convex, centrally-symmetric set, and let a and w be as in Theorem A. Assume that σ/ x 0 ) c 0 r Q)/ log, set η = c x 0 /σ log ) and let ζ = c /σ log ).. If x 0 v c ), then with probability at least exp c 3 η s η)) ) β+, min{ x x 0, x + x 0 } c 4 s η).. If x 0 v c ) then with probability at least exp c 3 ζ v ζ)) ) β+, max{ x x 0, x + x 0 } c 4 v ζ). If σ/ x 0 ) c 0 r Q)/ log the same assertion as in. and. holds, with an upper bound of r Q) replacing s η) and v ζ). 7

8 Theorem B follows from Theorem A and a more transparent description of the localized sets T,R x 0 ) and T +,R x 0 ) see Lemma 4.). To put Theorem B in some perspective, observe that v tends to zero. Indeed, since lt rb n) lt ), it follows that v ζ) lt )/ ζ) /3. Hence, for the choice of ζ σ log ) as in Theorem B, v σlt ) ) /3 log, which tends to zero when σ 0 and when. Therefore, if x 0 0, the first part of Theorem B describes the long term behaviour of ERM. Also, and using the same argument, r Q) lt ) Q. Thus, for every σ > 0 the problem becomes high noise in the sense that the condition σ/ x 0 ) c 0 r Q)/ log is satisfied when is large enough. In the typical situation, which is both high noise and large x 0, the error rate depends on η = c x 0 /σ log. We believe that the / log factor is an artifact of the proof, but the other term, x 0 /σ is the signalto-noise ratio, and is rather natural. Although Theorem A and Theorem B clearly improve the results from [], it is natural to ask whether these are optimal in a more general sense. The final result presented here is that Theorem B is close to being optimal in the minimax sense. The formulation and proof of the minimax lower bound is presented in Section 5. Finally, we end the article with two examples of classes that are of interest in phase retrieval: d-sparse vectors and the unit ball in l n. The first is a class with a fixed local complexity, and the second has a growing local complexity. Preliminaries Throughout this article, absolute constants are denoted by C, c, c,... etc. Their value may change from line to line. The fact that there are absolute constants c, C for which ca b Ca is denoted by a b; a b means that a cb, while a L b means that the constants depend only on the parameter L. 8

9 For p, let p be the l n p norm endowed on R n, and for a function f or a random variable X) on a probability space, set f Lp to be its L p norm. Other norms that play a significant role here are the Orlicz norms. For basic facts on these norms we refer the reader to [9, 5]. Recall that for α, f ψα = inf{c > 0 : E exp f α /c α ) }, and it is straightforward to extend the definition for 0 < α <. Orlicz norms measure the rate of decay of a function. One may verify that f ψα sup p f Lp /p /α. Moreover, for t, P r f t) exp ct α / f α ψ α ), and f ψα is equivalent to the smallest constant κ for which P r f t) exp t α /κ α ) for every t. Definition. A random variable is L-subgaussian if it has a bounded ψ norm and X ψ L X L. Observe that for L-subgaussian random variables, all the L p norms are equivalent and their tails exhibits a faster decay than the corresponding gaussian. Indeed, if X is L-subgaussian, and for every t, X Lp p X ψ L p X L, P r X > t) exp ct / X ψ ) exp ct /L X L )) for a suitable absolute constant c. It is standard to verify that for every f, g, fg ψ f ψ g ψ, and that if X,..., X are independent copies of X and α, then max i X i ψα X ψα log /α..) An additional feature of ψ α random variables is concentration, namely that if X i ) i= are independent copies of a ψ α random variable X, then i= X i concentrates around EX. One example of such a concentration result is the following Bernstein-type inequality see, e.g., [5]). Theorem. There exists an absolute constant c 0 for which the following holds. If X,..., X are independent copies of a ψ random variable X, then for every t > 0, ) P r X i EX > t X ψ exp c 0 min{t, t}). i= 9

10 One important example of a probability space considered here is the discrete space Ω = {,..., }, endowed with the uniform probability measure. Functions on Ω can be viewed as vectors in R and the corresponding L p and ψ α norms are denoted by L p and ψ α. A significant part of the proofs of Theorem A has to do with the behaviour of a monotone non-increasing rearrangement of vectors. Given v R, let vi ) i= be a non-increasing rearrangement of v i ) i=. It turns out that the ψα norm captures information on the coordinates of vi ) i=. Lemma.3 For every α there exist constants c and c that depend only on α for which the following holds. For every v R, vi c sup i log /α e/i) v ψα c sup i v i log /α e/i). Proof. We will prove the claim only for α = as the other cases follow an identical path. Let v R and denote by P r the uniform probability measure on Ω = {,..., }. By the tail characterization of the ψ norm, {j : v j > t} = P r v > t) exp ct / v ψ ). Hence, for t i = c / v ψ loge/i), {j : vj > t i } i/e i, and for every i, v i t i. Therefore, sup i as claimed. In the reverse direction, consider v i loge/i) c / v ψ, B = { β > 0 : i, v ψ βv i / loge/i) }. It is enough to show that B is bounded by a constant that is independent of v. To that end, fix β B and without loss of generality, assume that β >. Set B = sup i βv i / loge/i) and since β B, v ψ B. Also, since /β <, i= ) /β + i ) /β dx /β x /β. 0

11 Therefore, expvi /B ) = i= i= e i expvi ) /B ) i= expβ loge/i)) i= ) /β e) /β /β /β e/β /β <, provided that β c. Thus, if β c, v ψ showing that B is bounded by c. < B which is a contradiction,. Empirical and Subgaussian processes The sampling method used here is isotropic and L-subgaussian, meaning that the vectors a i ) i= are independent and distributed according to a probability measure µ on R n that is both isotropic and L-subgaussian [5]: Definition.4 Let µ be a probability measure on R n and let a be distributed according to µ. The measure µ is isotropic if for every t R n, E a, t = t. It is L-subgaussian if for every t Rn and every u, P r a, t Lu t, a ) exp u /). Given T R n, let d T = sup t T t and put k T ) = lt )/d T ). The latter appears naturally in the context of Dvoretzky type theorems, and in particular, in Milman s proof of Dvoretzky s Theorem see, e.g., []). Theorem.5 [0] For every L there exist constants c and c that depend only on L and for which the following holds. For every u c, with probability at least exp c u k T )), for every t T and every I {,..., }, ) / ) t, ai Lu 3 lt ) + d T I loge/ I ). i I For every integer, let j T be the largest integer j in {,..., } for which lt ) d T j loge/j). It follows from Theorem.5 that if t T and I j T, i I t, ai ) / L,u lt ),

12 and if I j T, i I t, ai ) / L,u d T I loge/ I ). Therefore, if v = t, a i ) i= and vi ) i= is a monotone non-increasing rearrangement of v i ) i=, then / if i j T vi i i vj ) j= L,u lt ) i d T loge/i) otherwise..) This observation will be used extensively in what follows. The next fact deals with product processes. Theorem.6 [0] There exist absolute constants c 0, c and c for which the following holds. If T, T R n, j and u c 0, then with probability at least exp c u j ), sup ai, t a i, s E a, t a, s t T, s T i= c L u lt )lt ) + u )) lt )dt ) + lt )dt ) + j/ dt )dt ) and sup a i, t a i, s E a, t a, s t T, s T i= c L u lt )lt ) + u )) lt )dt ) + lt )dt ) + j/ dt )dt ). Remark.7 Let ε i ) i= be independent, symmetric, {, }-valued random variables. It follows from the results in [0] that with the same probability estimate in Theorem.6 and relative to the product measure ε X), sup ε t T, s T i ai, t a i, s i= L u lt )lt ) + u )) lt )dt ) + lt )dt ) + j/ dt )dt ).

13 Assume that k T )) / = lt )/dt ) lt )/dt ). Setting j/ = lt )/dt ), Theorem.6 and Remark.7 yield that with probability at least exp c u k T )), sup ai, t a i, s E a, t a, s L u lt ) lt ) + u ) dt ), t T, s T sup t T, s T and sup i= a i, t a i, s E a, t a, s L u lt ) lt ) + u ) dt ) i= t T, s T ε i ai, t a i, s L u lt ) lt ) + u ) dt )..3) i= One case which is of particular interest is when T = T = T, and then, with probability at least exp c u k T )), sup t T ai, t E a, t L u l T ) + u dt )lt )). i=. Monotone rearrangement of coordinates The first goal of this section is to investigate the coordinate structure of v R m, given information on its norm in various L m p and ψα m spaces. The vectors we will be interested in are of the form a i, t ) i= for t T, and for which, thanks to the results presented in Section., one has the necessary information at hand. It is standard to verify that if v ψ m α A, then v p p,α A m /p. Thus, v L m p p v ψ m α. It turns out that if the two norms are equivalent, v is regular in some sense. The next lemma, which is a version of the Paley-Zygmund Inequality, see, e.g. [4]), describes the regularity properties needed here in the case p = α =. Lemma.8 For every β > there exist constants c and c that depend only on β and for which the following holds. If v ψ m β v L m, there exists I {,..., m} of cardinality at least c m, and for every i I, v i c v L m. 3

14 Proof. Recall that v ψ m sup i m vi / logem/i). Hence, for every j m, j vl v ψ m l= j logem/l) β v L m j logem/j). l= Therefore, m v L m = m v l = vl + l j l= m l=j+ Setting c β) /β logeβ)) and j = c β)m, v l c 0β v L m j logem/j) + c 0 β v L m j logem/j) m/) v L m. Thus, m l=j+ v l m/) v L m, while v j+ j + l j+ v l β logeβ) v L m. m l=j+ Let I be the set of the m j smallest coordinates of v. Fix η > 0 to be named later, put I η I to be the set of coordinates in I for which v i η v L m and denote by I c η its complement in I. Therefore, Hence, m m/) v L m I l j+ v l = l I η v l + l I c η v L m I β logeβ) I η I + η Ic η I v l v j+ I η + η v L m I c η β logeβ) I η I + η I )) η m β logeβ) η) I ) η I I + η. If η = min{/4, β/) logeβ)}, then I η η/) I c β)m, as claimed. ). v l. ext, let us turn to decomposition results for vectors of the form a i, t ) i=. Recall that for a set T R, j T is the largest integer for which lt ) d T j loge/j). 4

15 Lemma.9 For every L > there exist constants c and c that depend only on L and for which the following holds. Let T R n and set W = {t/ t : t T } S n. With probability at least exp c l W )), for every t T, a i, t ) i= = v + v and v, v have the following properties:. The supports of v and v are disjoint.. v c lw ) t and suppv ) j W. 3. v ψ c t. Proof. Fix t T and let J t {,..., } be the set of the largest j W coordinates of a i, t ) i=. Set v = a j, t/ t )j Jt and v = a j, t/ t )j J c t. By Theorem.5 and the characterization of the ψ norm of a vector using the monotone rearrangement of its coordinates Lemma.3), v LlW ), and v ψ L. To conclude the proof, set v = t v and v = t v. Recall that for every R > 0, { } t + s T +,R = : t, s T, t + s t s R, t + s and a similar definition holds for T,R. Set j +,R = j T+,R, j,r = j T+,R and E R = max{lt +,R ), lt,r )}. Combining the above estimates leads to the following corollary. Corollary.0 For every L > there exist constants c, c, c 3 and c 4 that depend only on L for which the following holds. Let T R n and R > 0, and consider T +,R and T,R as above. With probability at least 4 exp c L min{l T +,R ), l T,R )}), for every s, t T for which t s t + s R,. s t, a i ) i= = v + v, for vectors v and v of disjoint supports satisfying suppv ) j,r, v c lt,r ) s t and v ψ c s t. 5

16 . s + t, a i ) i= = u + u, for vectors u and u of disjoint supports satisfying suppu ) j +,R, u c lt +,R ) s+t and u ψ c s+t. 3. If h s,t a) = s+t s+t, a s t s t, a, then h s,t a i ) E h s,t c 3 i= In particular, recalling that for every s, t T, ER + E R E s + t, a s t, a κ 0 s + t s t, it follows that if c 4 L)E R /κ 0 then 4. κ 0 s + t s t ). s + t, a i s t, ai L s + t s t. i=.4) From here on, denote by Ω,R the event on which Corollary.0 holds for the sets T +,R and T,R and samples of cardinality L E R /κ 0. Lemma. There exist constants c 0 depending only on L and c, κ that depend only on κ 0 and L for which the following holds. If c 0 E R /κ 0, then for a i ) i= Ω,R, for every s, t T for which s t s + t R, there is I s,t {,..., } of cardinality at least κ, and for every i I s,t, s t, a i s + t, ai c s t s + t. Lemma. is an empirical small-ball estimate, as it shows that with high probability, and for every pair s, t as above, many of the coordinates of a i, s t a i, s + t ) i= are large. Proof. Fix s, t T as above and set y = s t, a i ) i=, and x = s + t, a i ) i=. 6

17 Let y = v +v and x = u +u as in Corollary.0. Let j 0 = max{j,r, j +,R } and put J = suppv ) suppu ). Observe that J j 0 and that yj) xj) j J j suppv ) v j)xj) + j suppu ) j0 ) / j0 ) / v x j)) + u y j)) i= i= L lt,r ) s t j 0 loge/j 0 ) s + t +lt +,R ) s + t j 0 loge/j 0 ) s t L E R s t s + t κ 0 4 s t s + t, yj)u j) because, by the definition of j 0, j 0 loge/j 0 ) max{lt,r ), lt +,R )} and since c 0 E R /κ 0 for c 0 = c 0 L) large enough. Thus, by.4), j J c yj)xj) κ 0 s t s + t /4. Set m = J c and let z = yj)xj)) j J c = v j)u j)) j J c. Since L ER /κ 0, it is evident that j 0 /; thus / m and z L m = yj)xj) m 4m κ 0 s t s + t κ 0 s t s + t. j J c On the other hand, z ψ m v u j)) j Jc ψ m v ψ m u ψ m L s t s + t, and z satisfies the assumption of Lemma.8 for β = c L, κ 0 ). The claim follows immediately from that lemma. 3 Proof of Theorem A It is well understood that when analyzing properties of ERM relative to a loss l, studying the excess loss functional is rather natural. The excess loss shares the same empirical minimizer as the loss, but it has additional qualities: for every x T, EL x 0 and L x0 = 0. 7

18 Since 0 is a potential minimizer of {P L x : x T }, the minimizer ˆx satisfies that P Lˆx 0, giving one a way of excluding parts of T as potential empirical minimizers. One simply has to show that with high probability, those parts belong to the set {x : P L x > 0}, for example, by showing that P L x is equivalent to EL x, as the latter is positive for points that are not true minimizers. The squared excess loss has a simple decomposition to two processes: a quadratic process and a multiplier one. Indeed, given a class of functions F and f F, fa) y ) f a) y ) = fa) f a) ) fa) f a) ) f a) y ). where, as always, f is a minimizer of the functional E fa) y ) in F. In the phase retrieval problem, y = x 0, a + w for a noise w that is independent of a, and each f x F is given by f x = x,. Thus, L x a, y) = l x a, y) l x0 a, y) = f x a) y ) fx0 a) y ) = x x 0, a x + x 0, a ) w x x0, a x + x 0, a. Since w is a mean-zero random variable that is independent of a, and by Assumption., EL x a, y) = E x x 0, a x + x 0, a κ 0 x x 0 x + x 0. Therefore, E f x a) y ) has a unique minimizer in F : f = f x0 = f x0. To show that P L x > 0 on a large subset T T, it suffices to obtain a high probability lower bound on inf x T ) x x0, a i x + x0, a i i= that dominates a high probability upper bound on sup w i x x0, a i x + x0, a i. x T i= The set T that will be used is T R = {x T : x x 0 x + x 0 R} for a well-chosen R. 8

19 Theorem 3. There exist a constant c 0 depending only on L, and constants c, κ depending only on κ 0 and L for which the following holds. For every R > 0 and c 0 ER /κ 0, with probability at least for every x T R, 4 exp c L min{l T +,R ), l T,R )}), x0 x, a i x0 + x, a i c x 0 x x 0 + x. i= Theorem 3. is an immediate outcome of Lemma. Theorem 3. There exist absolute constants c and c for which the following holds. For every β >, with probability at least exp c L min{l T +,R ), l T,R )}) β ), for every x T R, E w i x x0, a i x + x0, a i c β w ψ log R x x 0 x+x 0. i= Proof. By standard properties of empirical process, and since w is meanzero and independent of a, it suffices to estimate sup ε i w i x x 0, a i x + x0, a i, x T R i= for independent signs ε i ) i=. By the contraction principle for Bernoulli processes see, e.g., [9]), it follows that for every fixed w i ) i= and a i) i=, P r ε sup ε x T R i w i ) x x 0 x + x 0, a i, a i x x i= 0 x + x 0 > u ) P r ε max w x x 0 x + x 0 i sup ε i, a i, a i i x x 0 x + x 0 > u. x T R i= Applying Remark.7, if L E R then with ε a) -probability of at least exp c L min{l T +,R ), l T,R )}), x x 0 x + x 0 sup ε i, a i, a i x x 0 x + x 0 c L E R. x T R i= 9

20 Also, because w is a ψ random variable, P rw t w ψ ) exp t /), and thus, w β log w ψ with probability at least β+. Combining the two estimates and a Fubini argument, it follows that with probability at least exp c L min{l T +,R ), l T,R )}) β+, for every x T R, i= w i x x0, a i x + x0, a c 3 L β w ψ log E R x x 0 x+x 0. On the intersection of the two events appearing in Theorem 3. and Theorem 3., if κ0,l ER and setting ρ = x x 0 x + x 0 R r c κ 0 /c L β)), then for every x T R, P L x c κ 0ρ c L ) E R β w ψ log ρ c κ 0R c L ) E R β w ψ log R. Therefore, if L,κ0 E R and R E R c 3 L, κ 0 ) w ψ β log, 3.) then P L x > 0 and ˆx T R. Theorem A follows from the definition of r γ) for a well chosen γ. 4 Proof of Theorem B Most of the work required for the proof of Theorem B has been carried out in Section 3. A literally identical argument, in which one replaces the sets T +,R and T,R with T +,R x 0 ) and T,R x 0 ) may be used, leading to an analogous version of Theorem A, with the obvious modifications: the complexity parameter is max{lt +,R x 0 )), lt,r x 0 ))} for the right choice of R, and 0

21 the probability estimate is exp c 0 min{l T +,R x 0 )), l T,R x 0 ))}) β+. All that remains to complete the proof of Theorem B is to analyze the structure of the local sets and identify the fixed points r 0 and r. A first step in that direction is the following: Lemma 4. There exist absolute constants c and c for which the following holds. For every R > 0 and x 0 R/4,. If x 0 min{ x x 0, x + x 0 } R then x x 0 x + x 0 c R.. If x x 0 x + x 0 R then x 0 min{ x x 0, x + x 0 } c R. Moreover, if x 0 R/4 then x x 0 x + x 0 R if and only if x R. Proof. Assume without loss of generality that x x 0 x + x 0. If x x 0 x 0 then x 0 x 0 x x 0 x + x 0 x x 0 + x 0 3 x 0. Hence, x 0 x + x 0, and then x 0 min{ x x 0, x + x 0 } x x 0 x + x 0. Otherwise, x x 0 > x 0. If, in addition, x 0 x x 0 x + x 0 ) / /4, 4 x 0 x x 0 x + x 0 ) / x 0 / x + x 0 /, and thus x + x 0 6 x 0. Since x 0 < x x 0 x + x 0, it follows that x + x 0 x x 0 x 0, and again, x 0 min{ x x 0, x + x 0 } x x 0 x + x 0. Therefore, the final case, and the only one in which there is no point-wise equivalence between x x 0 x+x 0 and x 0 min{ x x 0, x+x 0 }, is when min{ x x 0, x + x 0 } x 0 and x 0 x x 0 x + x 0 ) / /4. In that case, if x 0 R/4 then x 0 min{ x x 0, x + x 0 } x 0 R/6,

22 and x x 0 x + x 0 4 x 0 R/4, from which the first part of the claim follows immediately. For the second one, observe that x x 0 x x x 0 x + x 0 x + x x 0 + x 0, and if x 0 R/4, the equivalence is evident. In view of Lemma 4., the way the product x x 0 x + x 0 relates to min{ x x 0, x + x 0 } depends on x 0. If x 0 R/4, then {x T : x x 0 x+x 0 R} {x T : min{ x x 0, x+x 0 } c R/ x 0 }, and if x 0 R/4, {x T : x x 0 x + x 0 R} {x T : x c R}, for a suitable absolute constant c. When T is convex and centrally-symmetric, the corresponding complexity parameter - the gaussian average of T +,R x 0 ) = T,R x 0 ) is x 0 R lt c R/ x 0 )B n) if x 0 R, E R x 0 ) R lt c RB n ) if x 0 < R. The fixed point conditions appearing in Theorem A now become and r 0 = inf{r : E R x 0 ) c } 4.) r γ) = inf{r : E R x 0 ) γ R}, 4.) where one selects the slightly suboptimal γ = c /σ log. The assertion of Theorem A is that with high probability, ERM produces ˆx for which ˆx x 0 ˆx + x 0 max{r γ), r 0 }. If x 0 R, the fixed-point condition 4.) is ) R lt c R/ x 0 )B n ) c 3 4.3) x 0

23 while 4.) is, Recall that and x 0 R lt c R/ x 0 )B n ) c 4 /σ log ) R. 4.4) r Q) = inf{r > 0 : lt rb n ) Qr }, s η) = inf{s > 0 : lt sb n ) ηs }. Therefore, it is straightforward to verify that r 0 = x 0 r c 3 ) and r c /σ log ) ) = x 0 s c 4 x 0 /σ log ). Setting R = x 0 max{r c 3), s c 4 x 0 /σ log )}, it remains to ensure that x 0 R; that is, Observe that if max{s c 4 x 0 /σ log ), r c 3 )} x ) r c 3 ) c 3σ c 4 x 0 log, 4.6) then r c 3) s c 4 x 0 /σ log ). Indeed, applying the convexity of T, it is standard to verify that r Q) is attained and r Q) ρ if and only if lt ρb n) Qρ with a similar statement for s see, e.g., the discussion in [7]). Therefore, s η) r Q) if and only if lt r Q)) ηr Q)). The latter is evident because lt r Q)) = Qr Q) and recalling that Q = c 3 and η = c 4 x 0 /σ log. Under 4.6), an assumption which has been made in the formulation of Theorem B, 4.5) becomes s c 4 x 0 /σ log ) x 0 and, by the definition of s, this is the case if and only if lt x 0 B n ) c 4 x 0 σ log x 0 ; that is simply when x 0 v ζ) for ζ = c 4/σ log. Hence, by Theorem A, combined with Lemma 4., it follows that with high probability, min{ ˆx x 0, ˆx + x 0 } s c 4 x 0 /σ log ). The other cases, when either x 0 is small, or when r 0 dominates r are treated is a similar fashion, and are omitted. 3

24 5 Minimax lower bounds In this section we study the optimality of ERM as a phase retrieval procedure, in the minimax sense. The estimate obtained here is based on the maximal cardinality of separated subsets of the class with respect to the L µ) norm. Definition 5. Let B be the unit ball in a normed space. For any subset A of the space, let MA, rb) be the maximal cardinality of a subset of A that is r-separated with respect to the norm associated with B. Observe that if MA, rb) L there are x,..., x L A for which the sets x i +r/3)b are disjoint. A similar statement is true in the reverse direction. Let F be a class of functions on Ω, µ) and let a be distributed according to µ. For f 0 F and a centred gaussian variable w, which has variance σ and is independent of a, consider the gaussian regression model y = f 0 a) + w. 5.) Any procedure that performs well in the minimax sense, must do so for any choice of f 0 F in 5.). Following [7], there are two possible sources of statistical complexity that influence the error rate of gaussian regression in F.. Firstly, that there are functions in F that, despite being far away from f 0, still satisfy f 0 a i ) = fa i ) for every i, and thus are indistinguishable from f 0 on the data. This statistical complexity is independent of the noise, and for every f 0 F and A = a i ) i=, it is captured by the L µ) diameter of the set Kf 0, A) = {f F : fa i )) i= = f 0 a i )) i=}, which is denoted by d A).. Secondly, that the set F f 0 ) rd = {f f 0 : f F, f f 0 L r} is rich enough at a scale that is proportional to its L µ) diameter r. The richness of the set is measured using the cardinality of a maximal L µ)-separated set. To that end, let D be the unit ball in L µ), set and put Cr, θ 0 ) = sup r log / MF f 0 + θ 0 rd), rd) f 0 F q η) = inf { r > 0 : Cr, θ 0 ) ηr }. 5.) 4

25 Theorem 5. [7] For every f 0 F let P f 0 be the probability measure that generates samples a i, y i ) i= according to 5.). For every θ 0 there exists a constant θ > 0 for which inf ˆf sup f 0 F P f 0 f0 ˆf ) max{q θ /σ), d A)/4)} /5 5.3) where inf ˆf is the infimum over all possible estimators constructed using the given data. Earlier versions of this minimax bound may be found in [4], [6] and []. To apply this general principle to the phase recovery problem, note that the regression function is f 0 x) = x 0, x := fx0 x) for some unknown vector x 0 T R n, while the estimators are ˆf = ˆx,. Also, observe that for every x, x T, f x0 f x L µ) = E x 0, a x, a ) = E x0 x, a x0 + x, a and therefore, one has to identify the L structure of the set F f x0 = { x, x0, : x T }. To obtain the desired bound, it suffices to assume the following: Assumption 5. There exist constants C and C for which, for every s, t R n, C s t s + t E s t, a s + t, a C s t s + t. It is straightforward to verify that if a is an L-subgaussian vector on R n that satisfies Assumption., then it automatically satisfies Assumption 5.. The norm x 0 plays a central role in the analysis of the rates of convergence of the ERM in phase recovery. Therefore, the minimax lower bounds presented here are not only for the entire model T but for every shell V 0 = T R 0 S n. A minimax lower bound over T follows by taking the supremum over all possible choices of R 0. To apply Theorem 5., observe that by Assumption 5., for every u, v T, C u v u + v f v f u L C u v u + v. 5

26 Fix R 0 > 0 and consider V 0 = T R 0 S n. Clearly, for every r > 0 and every x 0 V 0, { u V 0 : u x 0 u + x 0 θ } 0r {u V 0 : f u F f x0 + θ 0 rd)} C 5.4) where F = {f u : u V 0 }. Fix θ 0 > to be named later, and let θ be as in Theorem 5.. If there are x 0 V 0 and {x,..., x M } V 0 that satisfy. x i x 0 x i + x 0 θ 0 r/c,. for every i < j M, x i x j x i + x j r/c, and 3. log M > θ r/σ), then sup f0 F r log/ MF f 0 + θ 0 rd), rd) > θ r /σ, and the best possible rate in phase recovery in V 0 is larger than r. Fix x 0 V 0 and r > 0, and let R = r/c. We will present two different estimates, based on R 0 = x 0, the location of x 0. Centre of small norm. Recall that θ 0 > and assume first that R 0 = x 0 θ 0 R/4. ote that V 0 R/8)B n { u V 0 : u x 0 u + x 0 θ 0r C }, and thus it suffices to constructed a separated set in V 0 R/8)B n. Set x,..., x L to be a maximal c 3 R-separated subset of V0 R/8)B n for a constant c 3 that depends only on C and C and which will be specified later; thus, L = MV 0 R/8)B n, c 3 RB n ). Lemma 5.3 There is a subset I {,..., L} of cardinality M L/ for which x i ) i I satisfies. and.. Proof. Since x i R/8)B n and x 0 θ 0 R/4, x i x 0 x i + x 0 θ 0 + ) R/4) θ 0 R = θ 0 r/c, and thus. is satisfied for every i L. To show that. holds for a large subset, it suffices to find I {,..., L} of cardinality at least L/ such that for every i, j I, x i x j c 3 R/ and xi + x j c 3 R/ 6

27 for a well-chosen c 3. To construct the subset, observe that if there are distinct integers i, j, k L for which x i + x j < c 3 R/ and xi + x k < c 3 R/, then x j x k < c 3 R, which is impossible. Therefore, for every xi there is at most a single index j {,..., L}\{i} satisfying that x i + x j < c 3 R/. With this observation the set I is constructed inductively. Without loss of generality, assume that I = {,..., M} for M L/. If i j and i, j M, x i x j x i + x j c 3R/4 r/c, for the right choice of c 3, and thus x i ) M i= satisfies.. Centre of large norm. ext, assume that R 0 = x 0 θ 0 R/4. By Lemma 4., there is an absolute constant c 4 < /3, for which, if x 0 ρ/4 and x0 min{ x x 0, x+x 0 } c 4 ρ, then x x 0 x+x 0 ρ. Therefore, applied to the choice ρ = θ 0 R, V 0 x 0 + c 4 θ 0 R/ x 0 )B n )) V 0 x 0 + c 4 θ 0 R/ x 0 )B n )) {u V 0 : x 0 u x 0 + u θ 0 R}, and it suffices to a find a separated set in the former. ote that if x V 0 x 0 + c 4 θ 0 R/ x 0 )B n ) then x x 0 c 4 θ 0 R/ x 0 x 0 / θ 0 R/4/4 because one can choose c 4 /3, and x 3 x 0 /. Moreover, if x, x V 0 x 0 + c 4 θ 0 R/ x 0 )B n ), then then x + x x 0 c 4 θ 0 R/ x 0 x 0. Applying Lemma 4., there is an absolute constant c 5, for which, if x 0 min { x i x j, x i + x j } θ 0 R/4, 5.5) x i x j x i + x j c 5 θ 0 R/4. Hence, if x,..., x M V 0 x 0 +c 4 θ 0 R/ x 0 )B n ) is θ 0R/4 x 0 -separated, then 5.5) holds, and x i x j x i + x j c 5 θ 0 R/4 r/c, provided that θ 0 is a sufficiently large constant that depends only on C and C. 7

28 Corollary 5.4 There exist absolute constants θ, c, c and c 3 that depend only on C and C and for which the following holds. Let R 0 > 0 and set V 0 = T R 0 S n. Define log MV 0 c rb n, c rb n ) if R 0 c 3 r, q r) = ) ) ) sup x0 V 0 log M V 0 x 0 + c r R 0 B n r, c R 0 B n if R 0 > c 3 r If q r) θ r/σ), then the minimax rate in V 0 is larger than r. ow, a minimax lower bound for the risk min{ x x 0, x + x 0 } for all shells V 0 and therefore, for T ) may be derived using Lemma 4.. To that end, let c 0 be a large enough absolute constant and CR 0, r) = sup r log / M T R 0 S n ) x 0 + c 0 rb n ), rb n ). x 0 T : x 0 =R 0 Definition 5.5 Fix R 0 > 0. For every α, β > 0 set q α) = inf { r > 0 : CR 0, r) αr } 5.6) and put t β) = inf { r > 0 : CR 0, r) βr 3 }. ote that for any c > 0, if t c) then t c) q c) and q cr0 /σ ) t c/σ) if and only if q cr0 /σ ) R 0. Theorem C. There exists an absolute constant c for which the following holds. Let R 0 > 0.. If R 0 t c /σ ), then for any procedure x, there exists x 0 T with x 0 = R 0 and for which, with probability at least /5, and x x 0 x + x 0 x 0 q c x 0 ) σ min{ x x 0, x + x 0 } q c x 0 ). σ 8

29 . If R 0 t c /σ ) then for any procedure x there exists x 0 T with x 0 = R 0 for which, with probability at least /5, and x x 0 x + x 0 t c )) σ x, x x 0, x + x 0 t c ). σ Theorem C is a general minimax bound, and although it seems strange at first glance, the parameters appearing in it are very close to those used in Theorem C. Following the same path as in [7], let us show that Theorem C and Theorem C are almost sharp, under some mild structural assumptions on T. First, recall Sudakov s inequality see, e.g., [9]): Theorem 5.6 If W R n and ε > 0 then where c is an absolute constant. cε log / MW, εb n ) lw ), Fix R 0 > 0 set V 0 = T R 0 B n, and put s = s c R 0 /σ log ) ) and v = v c R 0 /σ log ) ). Assume that there is some x 0 T, with the following properties:. x 0 = R.. The localized sets V 0 x 0 + s Bn ) and V 0 x 0 + v Bn ) satisfy that and that lv 0 x 0 + s S n )) lt s B n ) lv 0 x 0 + v S n )) lt v B n ), which is a mild assumption on the complexity structure of T. 3. Sudakov s inequality is sharp at the scales s and v, namely, s log / MV 0 x 0 + c 0 s B n ), s B n ) lv 0 x 0 + s B n )) 5.7) and a similar assertion holds for v. 9

30 In such a case, the rates of convergence obtained in Theorem B are minimax up to the an extra log factor) thanks to Theorem C. When Sudakov s inequality is sharp as in 5.7), we believe that ERM should be a minimax procedure in phase recovery, despite the logarithmic gap between Theorem B an Theorem C. A similar conclusion for linear regression was obtained in [7]. Sudakov s inequality is sharp in many cases - most notably, when T = B n, but not always. It is not sharp even for standard sets like the unit ball in l n p for + log n) < p <. 6 Examples Here, we will present two simple applications of the upper and lower bounds on the performance of ERM in phase recovery. aturally, there are many other examples that follow in a similar way and that can be derived using very similar arguments. The choice of the examples has been made to illustrate the question of the optimality of Theorem A, B and C, as well as an indication of the similarities and differences between phase recovery and linear regression. Since the estimate used in these examples are rather well known, some of the details will not be presented in full. 6. Sparse vectors The first example we consider represents classes with a local complexity that remains unchanged, regardless of the choice of x 0. Let T = W d be the set of d-sparse vectors in R n for some d /4) that is, vectors with at most d non-zero coordinates. Clearly, for every R > 0 T +,R, T,R W d S n. Also, for any x 0 T and any I {,..., n} of cardinality d that is disjoint of suppx 0 ), / { } { } x )S I x0 ) i I x + x0 ) i I : x W d, : x W d, x x 0 x + x 0 where S I is the unit sphere supported on the coordinates I. With this observation, a straightforward argument shows that, lt +,R ), lt,r ) d logen/d). Applying Theorem A, if follows that for d log en/d ), with probability 30

31 at least exp cd logen/d))) β+, ERM produces ˆx that satisfies d logen/d) ˆx x 0 ˆx + x 0 κ0,l,β σ log = ). 6.) Moreover, with the same probability estimate, if x 0 ) then by Lemma 4., min{ ˆx x 0, ˆx + x 0 } κ0,l,β σ d logen/d) log 6.) x 0 and if x 0 ) then ˆx d logen/d), ˆx x 0, ˆx + x 0 κ0,l σ log. 6.3) In particular, when x 0 is of the order of a constant, the rate of convergence in 6.) and 6.) is identical to the one obtained in [7] in the linear regression up to a log term). In the latter, ERM achieves the minimax rate with the same probability estimate) of d logen/d) ˆx x 0 L σ. Otherwise, when x 0 is large, the rate of convergence of the ERM in 6.) is actually better than in linear regression, but, when x 0 is small, it is worse - deteriorating to the square root of the rate in linear regression up to logarithmic terms). When the noise level σ tends to zero, the rates of convergence in linear regression and phase recovery tend to zero as well. In particular, exact reconstruction happens that is ˆx = x 0 in linear regression and ˆx = x 0 or ˆx = x 0 in phase recovery when d log en/d ). For the lower bound, it is well known that log / MW d c 0 rb n, rbn ) d logen/d) for every r > 0 and c0 ). Combined with the results of the previous section, this suffices to show that the rate obtained in Theorem A is the minimax one up to a log term in the large noise regime) and that the ERM is a minimax procedure for the phase retrieval problem when the signal x 0 is known to be d-sparse and d log en/d ). 6. The unit ball of l n Consider the set T = B n, the unit ball of ln. Being convex and centrally symmetric, it is a natural example of a set with changing local complexity 3

32 which becomes very large when x 0 is close to 0. Moreover, it is an example in which one may obtain sharp estimates on lb n rbn ) at every scale. Indeed, one may show see, for example, [5]) that l B n rb n ) logenr ) if r n r n otherwise. It follows that for B n, one has )) / Q log n if n C Q 0 Q rq) if C Q n C 0 Q = 0 if n C Q. where C 0 and C are absolute constants. The only range in which this estimate is not sharp is when n Q, because in that range, r Q) decays to zero very quickly. A more accurate estimate on lb n rbn ) can be performed when n Q see [8]), but since it is not our main interest, we will not pursue it further, and only consider the cases n C Q and n C 0 Q. A straightforward computation shows that the two other fixed points satisfy: )) s η log n /4 if n η η η) and vζ) n η if n η )) ζ log n 3 /6 if n ζ /3 /3 ζ n ζ ) /4 if n ζ /3 /3. The estimates above will be used to derive rates of convergence for the ERM ˆx for the squared loss) of the form min{ ˆx x 0, ˆx + x 0 } rate. Upper bounds on the rate of convergence rate follow from Theorem B, and hold with high probability as stated in there. For the sake of brevity, we 3

33 will not present the probability estimates, but those can be easily derived from Theorem B. Thanks to Theorem B, obtaining upper bounds on rate involves the study of several different regimes, depending on x 0, the noise level σ and the way the number of observations compares with the dimension n. The noise-free case: σ = 0. In this case, rate is upper bounded by r Q), for Q that is an absolute constant. In particular, when n C 0 Q, the rate is less than log n/ )) /. At this point, it is natural to wonder whether there is a procedure that outperforms ERM in the noise-free case. The minimax lower bound d A) in Theorem 5. may be used to address this question, as no algorithm can do better than d A)/4, with probability greater than /5. In the phase recovery problem and using the notation of section 5, one has d A) = sup { f x f x0 : x B n, f x a i ) = f x0 a i ), i =,..., } sup { x x 0 x + x 0 : x B n, a i, x = a i, x 0, i =,..., } inf sup { x x 0 x + x 0 : x B n L:R n R, Lx) = Lx 0 )} with an infimum taken over all linear operators L : R n R. By Lemma 4., for x 0 = /, 0,..., 0) B n in fact, any vector x 0 in B n for which x 0 is a positive constant smaller than / would do) d A) inf sup min{ x x 0, x + x 0 } L:R n R x B n kerl x 0) inf sup x y = c B n L:R n R x,y B n kerl ) which is the Gelfand -width of B n. By a result due to Garnaev and Gluskin see [3]), { )} min, c B n log en if n ) 0 otherwise. which is of the same order as r except when n, which is not treated here). Therefore, no algorithm can outperform ERM and ERM is a minimax procedure in this case. 33

34 ote that when n c Q, exact reconstruction of x 0 or x 0 is possible and it can happens only in that case i.e. σ = 0 and n c Q ) because of the minimax lower bound provided by d A). The noisy case: σ > 0. According to Theorem B, the rate of convergence rate depends on r = r Q) for some absolute constant Q, s = s η) for η = c x 0 /σ log ) and on v = v ζ) for ζ = c /σ log ). The outcome of Theorem B is presented in Figure. rate σ/ x 0 c 0 r / log σ/ x 0 c 0 r / log x 0 v r s x 0 v r v Figure : High probability bounds on the rate of convergence of the ERM ˆx for the square loss in phase recovery: min{ ˆx x 0, ˆx + x 0 } rate. As the proof of all the estimates is similar, a detailed analysis is only presented when ζ /3 /3 η C Q, which is equivalent to σ log ) /6 x0 c c Q σ log. c The upper bounds on rate change according to the way the number of observations scales relative to n:. n C 0 Q. In this situation, r logn/)/ ) /. Therefore, if σ/ x 0 logn/)/ log ) then rate logn/)/ ) /, and if σ/ x 0 logn/)/ log ), rate σ log x 0 log σ n x 0 )) /4 σ log log σ n 3 if x 0 σ log log )) /6 otherwise. )) /6 σ n 3 6.4). c x 0 /σ log ) n C Q. In that case r = 0. In particular σ/ x 0 > c 0 r / log and therefore, the rate is upper bounded as in 6.4). 3. c /σ log ) ) /3 /3 n c x 0 /σ log ). Again, in this case, r = 0. Therefore, one is in the situation of the small 34

35 signal-to-noise ratio and rate σ log σ n log x 0 log σ n 3 if x 0 σ log log )) /6 otherwise. )) /6 σ n 3 4. n c /σ log ) ) /3 /3. Once again, r = 0, and σ n log x 0 if x 0 σ rate ) / n log σ otherwise. ) / n log One may ask whether these estimates are optimal in the minimax sense, or perhaps there is another procedure that can outperform ERM. It appears that up to an extra log factor), ERM is indeed optimal. To see that, it is enough to apply Theorem C and verify that Sudakov s inequality is sharp in the following sense: see the discussion following Theorem C): that if x 0 /, then for every ε < /4 ε log / MB n x 0 + c 0 εb n ), εb n ) l B n εb n ). This fact is relatively straightforward to verify see, e.g., Example in [0]). Therefore, up to the extra log factor, which we believe is parasitic, ERM is a minimax phase-recovery procedure in B n. References [] L. Birgé, Approximation dans les espaces métriques et théorie de l estimation, Z. Wahrsch. Verw. Gebiete ) vol. 65, 983 [] Y. C. Eldar, S. Mendelson, Phase Retrieval: Stability and Recovery Guarantees, preprint. [3] A. Yu. Garnaev and E. D. Gluskin. The widths of a Euclidean ball. Dokl. Akad. auk SSSR, 775):048 05, 984. [4] E. Giné, V. de la Peña, Decoupling: From Dependence to Independence, Springer- Verlag, 999. [5] Y.Gordon, A.E. Litvak, S. Mendelson and A. Pajor, Gaussian averages of interpolated bodies and applications to approximate reconstruction, J. Approx. Theory, 49 ),

Learning subgaussian classes : Upper and minimax bounds

Learning subgaussian classes : Upper and minimax bounds Learning subgaussian classes : Upper and minimax bounds Guillaume Lecué 1,3 Shahar Mendelson,4,5 May 0, 013 Abstract We obtain sharp oracle inequalities for the empirical risk minimization procedure in

More information

On the singular values of random matrices

On the singular values of random matrices On the singular values of random matrices Shahar Mendelson Grigoris Paouris Abstract We present an approach that allows one to bound the largest and smallest singular values of an N n random matrix with

More information

arxiv: v1 [math.st] 27 Sep 2018

arxiv: v1 [math.st] 27 Sep 2018 Robust covariance estimation under L 4 L 2 norm equivalence. arxiv:809.0462v [math.st] 27 Sep 208 Shahar Mendelson ikita Zhivotovskiy Abstract Let X be a centered random vector taking values in R d and

More information

SUPPLEMENTARY MATERIAL TO REGULARIZATION AND THE SMALL-BALL METHOD I: SPARSE RECOVERY

SUPPLEMENTARY MATERIAL TO REGULARIZATION AND THE SMALL-BALL METHOD I: SPARSE RECOVERY arxiv: arxiv:60.05584 SUPPLEMETARY MATERIAL TO REGULARIZATIO AD THE SMALL-BALL METHOD I: SPARSE RECOVERY By Guillaume Lecué, and Shahar Mendelson, CREST, CRS, Université Paris Saclay. Technion I.I.T. and

More information

Risk minimization by median-of-means tournaments

Risk minimization by median-of-means tournaments Risk minimization by median-of-means tournaments Gábor Lugosi Shahar Mendelson August 2, 206 Abstract We consider the classical statistical learning/regression problem, when the value of a real random

More information

Supremum of simple stochastic processes

Supremum of simple stochastic processes Subspace embeddings Daniel Hsu COMS 4772 1 Supremum of simple stochastic processes 2 Recap: JL lemma JL lemma. For any ε (0, 1/2), point set S R d of cardinality 16 ln n S = n, and k N such that k, there

More information

Sparse recovery under weak moment assumptions

Sparse recovery under weak moment assumptions Sparse recovery under weak moment assumptions Guillaume Lecué 1,3 Shahar Mendelson 2,4,5 February 23, 2015 Abstract We prove that iid random vectors that satisfy a rather weak moment assumption can be

More information

Uniform uncertainty principle for Bernoulli and subgaussian ensembles

Uniform uncertainty principle for Bernoulli and subgaussian ensembles arxiv:math.st/0608665 v1 27 Aug 2006 Uniform uncertainty principle for Bernoulli and subgaussian ensembles Shahar MENDELSON 1 Alain PAJOR 2 Nicole TOMCZAK-JAEGERMANN 3 1 Introduction August 21, 2006 In

More information

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical

More information

A Geometric Approach to Statistical Learning Theory

A Geometric Approach to Statistical Learning Theory A Geometric Approach to Statistical Learning Theory Shahar Mendelson Centre for Mathematics and its Applications The Australian National University Canberra, Australia What is a learning problem A class

More information

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9 MAT 570 REAL ANALYSIS LECTURE NOTES PROFESSOR: JOHN QUIGG SEMESTER: FALL 204 Contents. Sets 2 2. Functions 5 3. Countability 7 4. Axiom of choice 8 5. Equivalence relations 9 6. Real numbers 9 7. Extended

More information

OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1

OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1 The Annals of Statistics 1997, Vol. 25, No. 6, 2512 2546 OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1 By O. V. Lepski and V. G. Spokoiny Humboldt University and Weierstrass Institute

More information

VISCOSITY SOLUTIONS. We follow Han and Lin, Elliptic Partial Differential Equations, 5.

VISCOSITY SOLUTIONS. We follow Han and Lin, Elliptic Partial Differential Equations, 5. VISCOSITY SOLUTIONS PETER HINTZ We follow Han and Lin, Elliptic Partial Differential Equations, 5. 1. Motivation Throughout, we will assume that Ω R n is a bounded and connected domain and that a ij C(Ω)

More information

Estimates for probabilities of independent events and infinite series

Estimates for probabilities of independent events and infinite series Estimates for probabilities of independent events and infinite series Jürgen Grahl and Shahar evo September 9, 06 arxiv:609.0894v [math.pr] 8 Sep 06 Abstract This paper deals with finite or infinite sequences

More information

Reconstruction from Anisotropic Random Measurements

Reconstruction from Anisotropic Random Measurements Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013

More information

An example of a convex body without symmetric projections.

An example of a convex body without symmetric projections. An example of a convex body without symmetric projections. E. D. Gluskin A. E. Litvak N. Tomczak-Jaegermann Abstract Many crucial results of the asymptotic theory of symmetric convex bodies were extended

More information

Empirical Processes and random projections

Empirical Processes and random projections Empirical Processes and random projections B. Klartag, S. Mendelson School of Mathematics, Institute for Advanced Study, Princeton, NJ 08540, USA. Institute of Advanced Studies, The Australian National

More information

arxiv: v2 [math.st] 29 Nov 2017

arxiv: v2 [math.st] 29 Nov 2017 arxiv:1701.04112v2 [math.st] 29 Nov 2017 Regularization, sparse recovery, and median-of-means tournaments Gábor Lugosi Shahar Mendelson November 30, 2017 Abstract We introduce a regularized risk minimization

More information

Sparse and Low Rank Recovery via Null Space Properties

Sparse and Low Rank Recovery via Null Space Properties Sparse and Low Rank Recovery via Null Space Properties Holger Rauhut Lehrstuhl C für Mathematik (Analysis), RWTH Aachen Convexity, probability and discrete structures, a geometric viewpoint Marne-la-Vallée,

More information

1/12/05: sec 3.1 and my article: How good is the Lebesgue measure?, Math. Intelligencer 11(2) (1989),

1/12/05: sec 3.1 and my article: How good is the Lebesgue measure?, Math. Intelligencer 11(2) (1989), Real Analysis 2, Math 651, Spring 2005 April 26, 2005 1 Real Analysis 2, Math 651, Spring 2005 Krzysztof Chris Ciesielski 1/12/05: sec 3.1 and my article: How good is the Lebesgue measure?, Math. Intelligencer

More information

LARGE DEVIATIONS OF TYPICAL LINEAR FUNCTIONALS ON A CONVEX BODY WITH UNCONDITIONAL BASIS. S. G. Bobkov and F. L. Nazarov. September 25, 2011

LARGE DEVIATIONS OF TYPICAL LINEAR FUNCTIONALS ON A CONVEX BODY WITH UNCONDITIONAL BASIS. S. G. Bobkov and F. L. Nazarov. September 25, 2011 LARGE DEVIATIONS OF TYPICAL LINEAR FUNCTIONALS ON A CONVEX BODY WITH UNCONDITIONAL BASIS S. G. Bobkov and F. L. Nazarov September 25, 20 Abstract We study large deviations of linear functionals on an isotropic

More information

Sections of Convex Bodies via the Combinatorial Dimension

Sections of Convex Bodies via the Combinatorial Dimension Sections of Convex Bodies via the Combinatorial Dimension (Rough notes - no proofs) These notes are centered at one abstract result in combinatorial geometry, which gives a coordinate approach to several

More information

REGULARIZATION AND THE SMALL-BALL METHOD I: SPARSE RECOVERY

REGULARIZATION AND THE SMALL-BALL METHOD I: SPARSE RECOVERY Submitted to the Annals of Applied Statistics arxiv: arxiv:1601.05584 REGULARIZATIO AD THE SMALL-BALL METHOD I: SPARSE RECOVERY By Guillaume Lecué, and Shahar Mendelson, CREST, CRS, Université Paris Saclay.

More information

JUHA KINNUNEN. Harmonic Analysis

JUHA KINNUNEN. Harmonic Analysis JUHA KINNUNEN Harmonic Analysis Department of Mathematics and Systems Analysis, Aalto University 27 Contents Calderón-Zygmund decomposition. Dyadic subcubes of a cube.........................2 Dyadic cubes

More information

sparse and low-rank tensor recovery Cubic-Sketching

sparse and low-rank tensor recovery Cubic-Sketching Sparse and Low-Ran Tensor Recovery via Cubic-Setching Guang Cheng Department of Statistics Purdue University www.science.purdue.edu/bigdata CCAM@Purdue Math Oct. 27, 2017 Joint wor with Botao Hao and Anru

More information

CHAPTER 7. Connectedness

CHAPTER 7. Connectedness CHAPTER 7 Connectedness 7.1. Connected topological spaces Definition 7.1. A topological space (X, T X ) is said to be connected if there is no continuous surjection f : X {0, 1} where the two point set

More information

Majorizing measures and proportional subsets of bounded orthonormal systems

Majorizing measures and proportional subsets of bounded orthonormal systems Majorizing measures and proportional subsets of bounded orthonormal systems Olivier GUÉDON Shahar MENDELSON1 Alain PAJOR Nicole TOMCZAK-JAEGERMANN Abstract In this article we prove that for any orthonormal

More information

REAL ANALYSIS LECTURE NOTES: 1.4 OUTER MEASURE

REAL ANALYSIS LECTURE NOTES: 1.4 OUTER MEASURE REAL ANALYSIS LECTURE NOTES: 1.4 OUTER MEASURE CHRISTOPHER HEIL 1.4.1 Introduction We will expand on Section 1.4 of Folland s text, which covers abstract outer measures also called exterior measures).

More information

Subspaces and orthogonal decompositions generated by bounded orthogonal systems

Subspaces and orthogonal decompositions generated by bounded orthogonal systems Subspaces and orthogonal decompositions generated by bounded orthogonal systems Olivier GUÉDON Shahar MENDELSON Alain PAJOR Nicole TOMCZAK-JAEGERMANN August 3, 006 Abstract We investigate properties of

More information

4130 HOMEWORK 4. , a 2

4130 HOMEWORK 4. , a 2 4130 HOMEWORK 4 Due Tuesday March 2 (1) Let N N denote the set of all sequences of natural numbers. That is, N N = {(a 1, a 2, a 3,...) : a i N}. Show that N N = P(N). We use the Schröder-Bernstein Theorem.

More information

THE INVERSE FUNCTION THEOREM

THE INVERSE FUNCTION THEOREM THE INVERSE FUNCTION THEOREM W. PATRICK HOOPER The implicit function theorem is the following result: Theorem 1. Let f be a C 1 function from a neighborhood of a point a R n into R n. Suppose A = Df(a)

More information

In N we can do addition, but in order to do subtraction we need to extend N to the integers

In N we can do addition, but in order to do subtraction we need to extend N to the integers Chapter 1 The Real Numbers 1.1. Some Preliminaries Discussion: The Irrationality of 2. We begin with the natural numbers N = {1, 2, 3, }. In N we can do addition, but in order to do subtraction we need

More information

arxiv: v5 [math.na] 16 Nov 2017

arxiv: v5 [math.na] 16 Nov 2017 RANDOM PERTURBATION OF LOW RANK MATRICES: IMPROVING CLASSICAL BOUNDS arxiv:3.657v5 [math.na] 6 Nov 07 SEAN O ROURKE, VAN VU, AND KE WANG Abstract. Matrix perturbation inequalities, such as Weyl s theorem

More information

Course 212: Academic Year Section 1: Metric Spaces

Course 212: Academic Year Section 1: Metric Spaces Course 212: Academic Year 1991-2 Section 1: Metric Spaces D. R. Wilkins Contents 1 Metric Spaces 3 1.1 Distance Functions and Metric Spaces............. 3 1.2 Convergence and Continuity in Metric Spaces.........

More information

Arithmetic progressions in sumsets

Arithmetic progressions in sumsets ACTA ARITHMETICA LX.2 (1991) Arithmetic progressions in sumsets by Imre Z. Ruzsa* (Budapest) 1. Introduction. Let A, B [1, N] be sets of integers, A = B = cn. Bourgain [2] proved that A + B always contains

More information

INDUSTRIAL MATHEMATICS INSTITUTE. B.S. Kashin and V.N. Temlyakov. IMI Preprint Series. Department of Mathematics University of South Carolina

INDUSTRIAL MATHEMATICS INSTITUTE. B.S. Kashin and V.N. Temlyakov. IMI Preprint Series. Department of Mathematics University of South Carolina INDUSTRIAL MATHEMATICS INSTITUTE 2007:08 A remark on compressed sensing B.S. Kashin and V.N. Temlyakov IMI Preprint Series Department of Mathematics University of South Carolina A remark on compressed

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

Rademacher Averages and Phase Transitions in Glivenko Cantelli Classes

Rademacher Averages and Phase Transitions in Glivenko Cantelli Classes IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 1, JANUARY 2002 251 Rademacher Averages Phase Transitions in Glivenko Cantelli Classes Shahar Mendelson Abstract We introduce a new parameter which

More information

Uniform Uncertainty Principle and signal recovery via Regularized Orthogonal Matching Pursuit

Uniform Uncertainty Principle and signal recovery via Regularized Orthogonal Matching Pursuit Uniform Uncertainty Principle and signal recovery via Regularized Orthogonal Matching Pursuit arxiv:0707.4203v2 [math.na] 14 Aug 2007 Deanna Needell Department of Mathematics University of California,

More information

Least singular value of random matrices. Lewis Memorial Lecture / DIMACS minicourse March 18, Terence Tao (UCLA)

Least singular value of random matrices. Lewis Memorial Lecture / DIMACS minicourse March 18, Terence Tao (UCLA) Least singular value of random matrices Lewis Memorial Lecture / DIMACS minicourse March 18, 2008 Terence Tao (UCLA) 1 Extreme singular values Let M = (a ij ) 1 i n;1 j m be a square or rectangular matrix

More information

ROW PRODUCTS OF RANDOM MATRICES

ROW PRODUCTS OF RANDOM MATRICES ROW PRODUCTS OF RANDOM MATRICES MARK RUDELSON Abstract Let 1,, K be d n matrices We define the row product of these matrices as a d K n matrix, whose rows are entry-wise products of rows of 1,, K This

More information

Only Intervals Preserve the Invertibility of Arithmetic Operations

Only Intervals Preserve the Invertibility of Arithmetic Operations Only Intervals Preserve the Invertibility of Arithmetic Operations Olga Kosheleva 1 and Vladik Kreinovich 2 1 Department of Electrical and Computer Engineering 2 Department of Computer Science University

More information

Optimisation Combinatoire et Convexe.

Optimisation Combinatoire et Convexe. Optimisation Combinatoire et Convexe. Low complexity models, l 1 penalties. A. d Aspremont. M1 ENS. 1/36 Today Sparsity, low complexity models. l 1 -recovery results: three approaches. Extensions: matrix

More information

The small ball property in Banach spaces (quantitative results)

The small ball property in Banach spaces (quantitative results) The small ball property in Banach spaces (quantitative results) Ehrhard Behrends Abstract A metric space (M, d) is said to have the small ball property (sbp) if for every ε 0 > 0 there exists a sequence

More information

MINIMAL GRAPHS PART I: EXISTENCE OF LIPSCHITZ WEAK SOLUTIONS TO THE DIRICHLET PROBLEM WITH C 2 BOUNDARY DATA

MINIMAL GRAPHS PART I: EXISTENCE OF LIPSCHITZ WEAK SOLUTIONS TO THE DIRICHLET PROBLEM WITH C 2 BOUNDARY DATA MINIMAL GRAPHS PART I: EXISTENCE OF LIPSCHITZ WEAK SOLUTIONS TO THE DIRICHLET PROBLEM WITH C 2 BOUNDARY DATA SPENCER HUGHES In these notes we prove that for any given smooth function on the boundary of

More information

Tools from Lebesgue integration

Tools from Lebesgue integration Tools from Lebesgue integration E.P. van den Ban Fall 2005 Introduction In these notes we describe some of the basic tools from the theory of Lebesgue integration. Definitions and results will be given

More information

Approximately Gaussian marginals and the hyperplane conjecture

Approximately Gaussian marginals and the hyperplane conjecture Approximately Gaussian marginals and the hyperplane conjecture Tel-Aviv University Conference on Asymptotic Geometric Analysis, Euler Institute, St. Petersburg, July 2010 Lecture based on a joint work

More information

Introduction and Preliminaries

Introduction and Preliminaries Chapter 1 Introduction and Preliminaries This chapter serves two purposes. The first purpose is to prepare the readers for the more systematic development in later chapters of methods of real analysis

More information

Some Background Material

Some Background Material Chapter 1 Some Background Material In the first chapter, we present a quick review of elementary - but important - material as a way of dipping our toes in the water. This chapter also introduces important

More information

for all subintervals I J. If the same is true for the dyadic subintervals I D J only, we will write ϕ BMO d (J). In fact, the following is true

for all subintervals I J. If the same is true for the dyadic subintervals I D J only, we will write ϕ BMO d (J). In fact, the following is true 3 ohn Nirenberg inequality, Part I A function ϕ L () belongs to the space BMO() if sup ϕ(s) ϕ I I I < for all subintervals I If the same is true for the dyadic subintervals I D only, we will write ϕ BMO

More information

On the constant in the reverse Brunn-Minkowski inequality for p-convex balls.

On the constant in the reverse Brunn-Minkowski inequality for p-convex balls. On the constant in the reverse Brunn-Minkowski inequality for p-convex balls. A.E. Litvak Abstract This note is devoted to the study of the dependence on p of the constant in the reverse Brunn-Minkowski

More information

2. Function spaces and approximation

2. Function spaces and approximation 2.1 2. Function spaces and approximation 2.1. The space of test functions. Notation and prerequisites are collected in Appendix A. Let Ω be an open subset of R n. The space C0 (Ω), consisting of the C

More information

Geometry of log-concave Ensembles of random matrices

Geometry of log-concave Ensembles of random matrices Geometry of log-concave Ensembles of random matrices Nicole Tomczak-Jaegermann Joint work with Radosław Adamczak, Rafał Latała, Alexander Litvak, Alain Pajor Cortona, June 2011 Nicole Tomczak-Jaegermann

More information

D I S C U S S I O N P A P E R

D I S C U S S I O N P A P E R I N S T I T U T D E S T A T I S T I Q U E B I O S T A T I S T I Q U E E T S C I E N C E S A C T U A R I E L L E S ( I S B A ) UNIVERSITÉ CATHOLIQUE DE LOUVAIN D I S C U S S I O N P A P E R 2014/06 Adaptive

More information

Construction of a general measure structure

Construction of a general measure structure Chapter 4 Construction of a general measure structure We turn to the development of general measure theory. The ingredients are a set describing the universe of points, a class of measurable subsets along

More information

P-adic Functions - Part 1

P-adic Functions - Part 1 P-adic Functions - Part 1 Nicolae Ciocan 22.11.2011 1 Locally constant functions Motivation: Another big difference between p-adic analysis and real analysis is the existence of nontrivial locally constant

More information

Isomorphisms between pattern classes

Isomorphisms between pattern classes Journal of Combinatorics olume 0, Number 0, 1 8, 0000 Isomorphisms between pattern classes M. H. Albert, M. D. Atkinson and Anders Claesson Isomorphisms φ : A B between pattern classes are considered.

More information

Economics 204 Summer/Fall 2011 Lecture 5 Friday July 29, 2011

Economics 204 Summer/Fall 2011 Lecture 5 Friday July 29, 2011 Economics 204 Summer/Fall 2011 Lecture 5 Friday July 29, 2011 Section 2.6 (cont.) Properties of Real Functions Here we first study properties of functions from R to R, making use of the additional structure

More information

Spanning and Independence Properties of Finite Frames

Spanning and Independence Properties of Finite Frames Chapter 1 Spanning and Independence Properties of Finite Frames Peter G. Casazza and Darrin Speegle Abstract The fundamental notion of frame theory is redundancy. It is this property which makes frames

More information

Integral Jensen inequality

Integral Jensen inequality Integral Jensen inequality Let us consider a convex set R d, and a convex function f : (, + ]. For any x,..., x n and λ,..., λ n with n λ i =, we have () f( n λ ix i ) n λ if(x i ). For a R d, let δ a

More information

Analysis-3 lecture schemes

Analysis-3 lecture schemes Analysis-3 lecture schemes (with Homeworks) 1 Csörgő István November, 2015 1 A jegyzet az ELTE Informatikai Kar 2015. évi Jegyzetpályázatának támogatásával készült Contents 1. Lesson 1 4 1.1. The Space

More information

Irredundant Families of Subcubes

Irredundant Families of Subcubes Irredundant Families of Subcubes David Ellis January 2010 Abstract We consider the problem of finding the maximum possible size of a family of -dimensional subcubes of the n-cube {0, 1} n, none of which

More information

Metric Space Topology (Spring 2016) Selected Homework Solutions. HW1 Q1.2. Suppose that d is a metric on a set X. Prove that the inequality d(x, y)

Metric Space Topology (Spring 2016) Selected Homework Solutions. HW1 Q1.2. Suppose that d is a metric on a set X. Prove that the inequality d(x, y) Metric Space Topology (Spring 2016) Selected Homework Solutions HW1 Q1.2. Suppose that d is a metric on a set X. Prove that the inequality d(x, y) d(z, w) d(x, z) + d(y, w) holds for all w, x, y, z X.

More information

High-dimensional distributions with convexity properties

High-dimensional distributions with convexity properties High-dimensional distributions with convexity properties Bo az Klartag Tel-Aviv University A conference in honor of Charles Fefferman, Princeton, May 2009 High-Dimensional Distributions We are concerned

More information

A NEW SET THEORY FOR ANALYSIS

A NEW SET THEORY FOR ANALYSIS Article A NEW SET THEORY FOR ANALYSIS Juan Pablo Ramírez 0000-0002-4912-2952 Abstract: We present the real number system as a generalization of the natural numbers. First, we prove the co-finite topology,

More information

In N we can do addition, but in order to do subtraction we need to extend N to the integers

In N we can do addition, but in order to do subtraction we need to extend N to the integers Chapter The Real Numbers.. Some Preliminaries Discussion: The Irrationality of 2. We begin with the natural numbers N = {, 2, 3, }. In N we can do addition, but in order to do subtraction we need to extend

More information

COMPLETELY INVARIANT JULIA SETS OF POLYNOMIAL SEMIGROUPS

COMPLETELY INVARIANT JULIA SETS OF POLYNOMIAL SEMIGROUPS Series Logo Volume 00, Number 00, Xxxx 19xx COMPLETELY INVARIANT JULIA SETS OF POLYNOMIAL SEMIGROUPS RICH STANKEWITZ Abstract. Let G be a semigroup of rational functions of degree at least two, under composition

More information

Maths 212: Homework Solutions

Maths 212: Homework Solutions Maths 212: Homework Solutions 1. The definition of A ensures that x π for all x A, so π is an upper bound of A. To show it is the least upper bound, suppose x < π and consider two cases. If x < 1, then

More information

Mathematical Methods for Physics and Engineering

Mathematical Methods for Physics and Engineering Mathematical Methods for Physics and Engineering Lecture notes for PDEs Sergei V. Shabanov Department of Mathematics, University of Florida, Gainesville, FL 32611 USA CHAPTER 1 The integration theory

More information

Indeed, if we want m to be compatible with taking limits, it should be countably additive, meaning that ( )

Indeed, if we want m to be compatible with taking limits, it should be countably additive, meaning that ( ) Lebesgue Measure The idea of the Lebesgue integral is to first define a measure on subsets of R. That is, we wish to assign a number m(s to each subset S of R, representing the total length that S takes

More information

Topological properties

Topological properties CHAPTER 4 Topological properties 1. Connectedness Definitions and examples Basic properties Connected components Connected versus path connected, again 2. Compactness Definition and first examples Topological

More information

On the minimum of several random variables

On the minimum of several random variables On the minimum of several random variables Yehoram Gordon Alexander Litvak Carsten Schütt Elisabeth Werner Abstract For a given sequence of real numbers a,..., a n we denote the k-th smallest one by k-

More information

Statistical inference on Lévy processes

Statistical inference on Lévy processes Alberto Coca Cabrero University of Cambridge - CCA Supervisors: Dr. Richard Nickl and Professor L.C.G.Rogers Funded by Fundación Mutua Madrileña and EPSRC MASDOC/CCA student workshop 2013 26th March Outline

More information

A VERY BRIEF REVIEW OF MEASURE THEORY

A VERY BRIEF REVIEW OF MEASURE THEORY A VERY BRIEF REVIEW OF MEASURE THEORY A brief philosophical discussion. Measure theory, as much as any branch of mathematics, is an area where it is important to be acquainted with the basic notions and

More information

REAL AND COMPLEX ANALYSIS

REAL AND COMPLEX ANALYSIS REAL AND COMPLE ANALYSIS Third Edition Walter Rudin Professor of Mathematics University of Wisconsin, Madison Version 1.1 No rights reserved. Any part of this work can be reproduced or transmitted in any

More information

THE SMALLEST SINGULAR VALUE OF A RANDOM RECTANGULAR MATRIX

THE SMALLEST SINGULAR VALUE OF A RANDOM RECTANGULAR MATRIX THE SMALLEST SINGULAR VALUE OF A RANDOM RECTANGULAR MATRIX MARK RUDELSON AND ROMAN VERSHYNIN Abstract. We prove an optimal estimate on the smallest singular value of a random subgaussian matrix, valid

More information

Geometric intuition: from Hölder spaces to the Calderón-Zygmund estimate

Geometric intuition: from Hölder spaces to the Calderón-Zygmund estimate Geometric intuition: from Hölder spaces to the Calderón-Zygmund estimate A survey of Lihe Wang s paper Michael Snarski December 5, 22 Contents Hölder spaces. Control on functions......................................2

More information

Introduction to Real Analysis Alternative Chapter 1

Introduction to Real Analysis Alternative Chapter 1 Christopher Heil Introduction to Real Analysis Alternative Chapter 1 A Primer on Norms and Banach Spaces Last Updated: March 10, 2018 c 2018 by Christopher Heil Chapter 1 A Primer on Norms and Banach Spaces

More information

Exact Bounds on Sample Variance of Interval Data

Exact Bounds on Sample Variance of Interval Data University of Texas at El Paso DigitalCommons@UTEP Departmental Technical Reports (CS) Department of Computer Science 3-1-2002 Exact Bounds on Sample Variance of Interval Data Scott Ferson Lev Ginzburg

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Sampling and high-dimensional convex geometry

Sampling and high-dimensional convex geometry Sampling and high-dimensional convex geometry Roman Vershynin SampTA 2013 Bremen, Germany, June 2013 Geometry of sampling problems Signals live in high dimensions; sampling is often random. Geometry in

More information

NEWMAN S INEQUALITY FOR INCREASING EXPONENTIAL SUMS

NEWMAN S INEQUALITY FOR INCREASING EXPONENTIAL SUMS NEWMAN S INEQUALITY FOR INCREASING EXPONENTIAL SUMS Tamás Erdélyi Dedicated to the memory of George G Lorentz Abstract Let Λ n := {λ 0 < λ < < λ n } be a set of real numbers The collection of all linear

More information

Implications of the Constant Rank Constraint Qualification

Implications of the Constant Rank Constraint Qualification Mathematical Programming manuscript No. (will be inserted by the editor) Implications of the Constant Rank Constraint Qualification Shu Lu Received: date / Accepted: date Abstract This paper investigates

More information

High Dimensional Probability

High Dimensional Probability High Dimensional Probability for Mathematicians and Data Scientists Roman Vershynin 1 1 University of Michigan. Webpage: www.umich.edu/~romanv ii Preface Who is this book for? This is a textbook in probability

More information

Laplace s Equation. Chapter Mean Value Formulas

Laplace s Equation. Chapter Mean Value Formulas Chapter 1 Laplace s Equation Let be an open set in R n. A function u C 2 () is called harmonic in if it satisfies Laplace s equation n (1.1) u := D ii u = 0 in. i=1 A function u C 2 () is called subharmonic

More information

A Lower Bound Theorem. Lin Hu.

A Lower Bound Theorem. Lin Hu. American J. of Mathematics and Sciences Vol. 3, No -1,(January 014) Copyright Mind Reader Publications ISSN No: 50-310 A Lower Bound Theorem Department of Applied Mathematics, Beijing University of Technology,

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

The 123 Theorem and its extensions

The 123 Theorem and its extensions The 123 Theorem and its extensions Noga Alon and Raphael Yuster Department of Mathematics Raymond and Beverly Sackler Faculty of Exact Sciences Tel Aviv University, Tel Aviv, Israel Abstract It is shown

More information

3 Integration and Expectation

3 Integration and Expectation 3 Integration and Expectation 3.1 Construction of the Lebesgue Integral Let (, F, µ) be a measure space (not necessarily a probability space). Our objective will be to define the Lebesgue integral R fdµ

More information

14.1 Finding frequent elements in stream

14.1 Finding frequent elements in stream Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours

More information

Topology. Xiaolong Han. Department of Mathematics, California State University, Northridge, CA 91330, USA address:

Topology. Xiaolong Han. Department of Mathematics, California State University, Northridge, CA 91330, USA  address: Topology Xiaolong Han Department of Mathematics, California State University, Northridge, CA 91330, USA E-mail address: Xiaolong.Han@csun.edu Remark. You are entitled to a reward of 1 point toward a homework

More information

Comparison of Orlicz Lorentz Spaces

Comparison of Orlicz Lorentz Spaces Comparison of Orlicz Lorentz Spaces S.J. Montgomery-Smith* Department of Mathematics, University of Missouri, Columbia, MO 65211. I dedicate this paper to my Mother and Father, who as well as introducing

More information

Chapter 1 The Real Numbers

Chapter 1 The Real Numbers Chapter 1 The Real Numbers In a beginning course in calculus, the emphasis is on introducing the techniques of the subject;i.e., differentiation and integration and their applications. An advanced calculus

More information

l 1 -Regularized Linear Regression: Persistence and Oracle Inequalities

l 1 -Regularized Linear Regression: Persistence and Oracle Inequalities l -Regularized Linear Regression: Persistence and Oracle Inequalities Peter Bartlett EECS and Statistics UC Berkeley slides at http://www.stat.berkeley.edu/ bartlett Joint work with Shahar Mendelson and

More information

MATH 51H Section 4. October 16, Recall what it means for a function between metric spaces to be continuous:

MATH 51H Section 4. October 16, Recall what it means for a function between metric spaces to be continuous: MATH 51H Section 4 October 16, 2015 1 Continuity Recall what it means for a function between metric spaces to be continuous: Definition. Let (X, d X ), (Y, d Y ) be metric spaces. A function f : X Y is

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

Lecture 6: September 19

Lecture 6: September 19 36-755: Advanced Statistical Theory I Fall 2016 Lecture 6: September 19 Lecturer: Alessandro Rinaldo Scribe: YJ Choe Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information