ORIE 6340: Mathematics of Data Science

Size: px

Start display at page:

Download "ORIE 6340: Mathematics of Data Science"

Hugo Watson
5 years ago
Views:

1 ORIE 6340: Matheatics of Data Science Daek Davis Contents 1 Estiation in High Diensions Tools for understanding high-diensional sets Concentration of volue in high-diensions Rando sections of high-diensional convex sets Gaussian width Estiation fro linear observations Estiation based on M bound Estiation as a (tractable) optiization proble A proof of a general M bound Fro expectation to overwheling probability Consequences: estiation fro noisy easureents Applications Sparse recovery for general dictionaries Exact recovery The geoetrical eaning of exact recovery Escape through a esh Exact sparse recovery Estiation in High Diensions Disclaier: This section is heavily based on [2, 10, 11] The ain goal. We wish to estiate a vector x contained in a set K R n, fro easureents y 1,..., y of x. Vector x ay represent a signal, a paraeter of a distribution, or an unknown atrix. The set K encodes prior inforation on x or properties we want to enforce on x. 1

2 Figure 1: Estiating a signal in high diensions Brief exaples (ore later). easureents as possible. Goal in each proble is to recover x K, given as few 1. (Copressed sensing) Set K is set of k-sparse vectors, i.e., those with at ost k nonzeros. Measureents are linear (a i Gaussian is typical) y i = a i, x i = 1,...,. Rearkably, with high probability, x can be efficiently recovered fro = O(k log(n/k)) linear easureents. 2. (Matrix copletion) Set K is set of low rank atrices. Measureents are a sapling of x s entries: y i = x li,k i i = 1,...,. Rearkably, with high-probability, if entries chosen uniforly at rando and x is incoherent, atrix can be efficiently recovered fro O(poly(rank(x), log(n), µ)n) easureents. 3. (Nonlinear Measureents.) Measureents aybe nonlinear e.g., y i = sign( a i, x ) or Ey i = θ( a i, x ). First exaple is siply logistic regression, while second is called generalized linear odel. Low coplexity structure. How any easureents are required for efficient estiation? Depends on the coplexity or diension of K. Intuitively, only = O(di(K))) easureents needed to recover K. 1 Is that the best we can do? Certainly not: the set of k-sparse vectors is full-diensional, yet copressive sensing techniques ay be used to recover a sparse signal with O(k log(n/k)) n easureents. School of Operations Research and Inforation Engineering, Cornell University, Ithaca, NY 14850, USA; people.orie.cornell.edu/dsd95/. 1 Let us define the algebraic diension of a set to be the diension of the sallest subspace containing that set. 2

3 In general, feasible sets usually have high algebraic diension, but low-coplexity. Exaples include iages, adjacency atrices of networks, and regression coefficients. Oddly, a certain space of 3 3 iages of high-contrast patches is close to a set that is topologically equivalent to a Klein bottle, which has low coplexity [4]. So our goal for this section is threefold 1. Quantify coplexity of general sets K R n. 2. Show estiation is possible few easureents for low coplexity K. 3. Design algorithically efficient estiators. 1.1 Tools for understanding high-diensional sets What do high diensional convex bodies look like? 2 Figure 1.1 illustrates a counterintuitive feature of the high-diensional ball of volue 1: ost of its volue lies within a fixed slab. Soewhat contradictory, we will see in a oent ost of the volue of ball also lies near the boundary of the ball. Figure 2: Balls of volue 1 in varying diensions. The region contained within the dashed lines is the slab { 1/2 x 1 1/2}. The slab contains 96% of the volue of each ball. Heuristic. Convex bodies consist of two parts: the bulk and the outliers (see Figure 1.1). The bulk akes up ost of the volue, but has sall diaeter (usually looks like a ball); the outliers contribute little to volue but are large in diaeter. 2 A convex body is a closed, convex, bounded set with nonepty interior. 3

4 Figure 3: V. Milan s hyperbolic drawings of high diensional convex sets For exaple, the Euclidean ball B inscribed within the l 1 ball K = B 1 = {x x 1 1} has radius 2/ n, but vol(b) 1/n vol(k) 1/n 1 n. Conclusion: The ball B, perhaps inflated by a constant factor, fors the bulk of K. The outliers of K are the tentacles shown in Figure 1.1(b), which extend far beyond B in the coordinate directions. We can argue ore rigorously using concentration inequalities Concentration of volue in high-diensions. Consider an isotropic convex body K, eaning that a rando vector X distributed uniforly in K satisfies E [X] = 0 and E [ XX T ] = I n. Through translation and scaling, any convex body can be ade isotropic (in other words, this is not a restrictive assuption). A first result. At least 90% of the volue of K lies near the Euclidean ball of radius 10n. Indeed, E [ X 2] = Tr(E [ XX T ] ) = Tr(I n ) = n. Thus, Markov s inequality says that P ( X ) 10n E [ X 2 ] 10n = A uch ore powerful concentration result shows that the bulk of the volue of K lies near the sphere of radius n. Theore 1.1 (Volue distribution in high-diensional sets). For X and K as above, the following are true: there exists absolute constants c, C > 0, such that 4

5 1. (Concentration of volue) for t 1, we have 2. (Thin shell) for every ε (0, 1), we have P ( X t n ) exp ( ct n ), P ( X n > ε n ) C exp ( cε 3 n 1/2). Exaple: volue distribution in the hypercube. Let K = [ 3, 3] n be the isotropic hypercube. Then Theore 1.1 iplies that ost of the volue lies near the corners of the hypercube (these points have nor Θ( n)). On the other hand, alost no volue lies near the center of its facets (these points have nor Θ(1)) Rando sections of high-diensional convex sets. What do rando sections of high-diensional convex sets look like? A useful (soeties incorrect) heuristic is that the bulk of a convex body is a Euclidean ball. Thus, if E is a rando low-diensional subspace, we should expect that E isses the outliers and that the intersection E K looks like a ball (see Figure 1.1.2). This is the content of Dvoretsky s theore. Figure 4: A rando section of a high-diensional convex set Theore 1.2 (Dvoretsky s Theore). Let K R d be origin syetric convex body such that the axial volue ellipsoid is the Euclidean ball. Let ε (0, 1), let E be a unifor rando subspace (with respect to the Haar easure) of diension d = cε 2 log n. Then there exists an R > 0 such that probability.99, we have (1 ε)b(r) K E (1 + ε)b(r), where B(R) E is the Euclidean ball of radius R in the subspace E. 5

6 John s theore guarantees that every convex body contains an ellipsoid of axial volue. Any ellipsoid ay be apped to a Euclidean ball through an affine transforation. Thus, up to affine transforation, the assuptions of Dvoretsky s theore are pretty ild. What about high-diensional sections? High diensional subspaces are ore likely to intersect the outliers of K, so we should not expect such sections to be round. Can we estiate other properties of K E? For exaple, dia(k E)? We will see that the diaeter of rando sections is intiately connected to estiation in high-diensions. One way to get at this quantity is through ean width of a set Gaussian width The ean width and its variations are soe of the ost iportant concepts that we will learn about in this course. Later it will reappear, for exaple, when we study statistical learning theory, estiation probles, and sketching. Iportant. In the sequel, we no longer assue K is a convex body, but any bounded set. Figure 5: The width of a set in direction η. Figure depicts the width of a set K along the direction η S n 1. The width in direction η ay be expressed through the following forula: sup η, u v = sup η, z, u,v K z K K where K K = {u v u, v K} is the Minkowski su of K and K. This shows that width ay be expressed through the support function of K a fundaental object in convex analysis: σ K (η) + σ K ( η) = sup η, z, z K K 6

7 where σ K (η) = sup u K η, u. The spherical ean width is then siply the average width, over all directions [ ] w(k) = E sup z K K η, z, where η is distributed uniforly on the sphere. While the ean width has an intuitive geoetric description, the Gaussian width is a bit sipler to work with and has siilar iportance Definition 1.3. Let K R n, and let g N(0, I n ) be a standard gaussian vector. Then the Gaussian width of K is defined as [ ] w(k) = E sup z K K g, z. Note. It is coon to see other variants of the Gaussian width, for exaple, [ ] E sup g, z z K and [ ] 1/2 E sup g, z 2 z K It can easily be shown that these definitions asyptotically equivalent to the Gaussian width defined above. Relation between spherical and Gaussian widths. Rotation invariance of the Gaussian distribution shows that the rando variable g is independent fro the rando vector η = (1/ g )g, which happens to be uniforly distributed on the sphere. Thus, w(k) = E [ g sup ] η, z = E [ g ] w(k). z K K Since E [ g ] n, it follows that w(k) n w(k). The Gaussian width is invariant under several types of transfor- Invariance properties. ations: Proposition 1.4 (Invariance properties of Gaussian width). The Gaussian width is invariant under translations, orthogonal linear transforations, and taking convex hulls. The last property is iportant: the Gaussian width does not distinguish between convex and nonconvex sets: w(k) = w(conv(k)). This fact will justify convexification procedures for estiation probles over nonconvex sets. Exaple 1.1. We will now briefly copute a few exaples of Gaussian width. As soe of these calculations will appear on your hoework, we will not go through all of the details here: 7

8 1. (The Euclidean Ball.) If K = B2 n or K = S n 1, we have [ ] [ ] w(k) = E sup g, u v = 2E sup g, u = 2E [ g 2 ] n. u,v K u K 2. (Sets with algebraic diension d.) Suppose K B n 2 is contained in a d- diensional subspace. Then K is also contained in a d-diensional ball, so by rotation invariance, we have w(k) 2 d. 3. (Hyper Cube.) Let K = [ 1, 1] n. Then [ ] [ ] w(k) = E sup g, u v u,v K = 2E sup g, u u K 2 = 2E [ g 1 ] = 2 π n where the third equality follows fro duality and the fourth is a calculation. 4. (l 1 -ball.) Let K = B1 n. Then [ ] w(k) = E sup g, u v u,v K [ = 2E sup g, u u K ] = 2E [ g ] log(n) where the third equality follows fro duality and the fourth is a calculation. 5. (Finite sets.) Let K B n 2 be a finite set. Then w(k) log K. (Independent of diension!) This will be a hoework exercise. 6. (Sparsity.) Let K be the set of s-sparse unit vectors: K = {x R n : x 2 = 1, x 0 s}, where x 0 denotes the nuber of nonzero eleents of x. Then This will be a hoework exercise. w(k) s log(2n/s). 7. (Low-Rank.) Let K be the set of d 1 d 2 atrices of rank at ost r and unit Frobenius nor: K = {X : rank(x) r, X F = 1}. Then, we will later see that w(k) r(d 1 + d 2 ). 8

9 Odd behavior of width. The spherical width of B n 1 is uch saller than its diaeter: w(b n 1 ) log(n) n 2, Moreover, the Gaussian width of B1 n is, up to a logarithic factor, on the order of its inscribed ball. This is odd since B1 n looks uch larger than its inscribed ball. See figure On the other hand, the Gaussian width of the hypercube is roughly the sae as the Gaussian width of its circuscribed ball nb2 n. What s going on? The hypercube has 2 n vertices, so intuitively (and rigorously according to Theore 1.1), ost of its volue ust concentrate there. Moreover, the hypercube and its circuscribed ball have roughly the sae volue, hence, they are close. On the other hand, the ball B1 n only has 2n vertices, and so with high probability 3, a rando Gaussian vector is nearly orthogonal to all of the. Consequently, the width in a gaussian direction η is not really influenced by the tentacles. Figure 6: Odd behavior of Gaussian width. Squared width is a stable diension. For any set bounded set K, we ay define [ ] h(k) := E sup g, z 2. z K K It is easy to show that h(k) w(k) 2. Define the stable diension of K to be the quantity: d(k) = h(k) dia(k) 2. The stable diension acts as a robust variant of the algebraic diension, which ore accurately captures the coplexity of the underlying set. In fact, using ite 2 above, we autoatically get the following bound: w(k) 2 algebraic diension of K. dia(k) 2 A slightly stronger result holds for the stable diension. 3 Use a siple union bound 9

10 Lea 1.5. The stable diension of a bounded set is always bounded by the algebraic diension. This lea will appear as an (easy) hoework proble. Mean width fro a single realization? w(k, g) = Gaussian concentration iplies that sup g, z, z K K concentrates tightly about its ean, w(k), with high probability. Thus, to estiate the ean, we siply saple one g N(0, I n ), and copute the supreu. Since w(k) = w(conv(k)), the supreu ay be coputed by solving a convex optiization proble. Bounds on Gaussian width: connections to covering nubers. The covering nuber N(K, t) of a set K is the inial nuber of balls of radius t whose union covers K. The Gaussian width is deeply connected to covering nubers through the following theore. Theore 1.6 (Sudakov and Dudley s inequalities). For any bounded subset K R n, we have sup t log(n(k, t)) w(k) log(n(k, t))dt. t>0 The lower bound in the above theore is known as Sudakov s inequality and the upper bound is known as Dudley s inequality. Rando sections of sall codiension: M bound. us bound the diaeter of a rando section of K. 0 Returning to our question, let Theore 1.7 (M bound). Let K R n be a bounded set. Let E be a unifor rando subspace of codiension. Then E dia(k E) w(k). To get a feel for this bound, let s think about the two extrees: 1. ( = Ω(n)) In this case, E dia(k E) w(k) n w(k). In other words, the size of constant diensional rando section is bounded by the spherical ean width. This suggests that low-diensional subspaces pass through the bulk of K, but ignore the outliers (see Figure 1.1.2). 2. ( = O(1)) In this case, E dia(k E) w(k) w(k) n. One interpretation of this bound is that, beyond the bulk, one can pick up an extra n in diaeter fro the tentacles. 10

11 The first bound of this sort was proven by V. Milan [5]; the stateent presented here is due to Pajor and Toczak-Jaegerann [9]. We will get back to the proof of general version of this bound, but first lets dive into soe consequences. 1.2 Estiation fro linear observations Recall that our goal is to estiate an unknown vector x K R n given soe vector of easureents y = (y 1,..., y ) R whose coordinates are iid draws of a rando function of x. In this section, we will study a siple odel where the observations y i coe fro Gaussian linear functions. In particular, y i = a i, x i = 1,...,, where a i N(0, I n ) are standard Gaussian. We can rewrite this in vector for as y = Ax where a i is ith row of A. Note that A is full-rank with probability one, then if > n the proble of recovering x is trivial. The proble becoes interesting when the nuber of easureents is saller than the diension, i.e. < n. Without additional restrictions the proble is ill-posed. Hence, we need the constraint x K to enforce additional structure. Figure 7: Feasibility proble, estiating x in the intersection of K and {x Ax = y} Estiation based on M bound Under this setting we only have two pieces of inforation about x, 1. It satisfies Ax = y; 2. It belongs to K. It is natural to define an estiator given by the following feasibility proble Find x K such that A x = y, (1.1) see Figure?? for an illustration. Now, how good of an estiate is x? 11

12 Theore 1.8. Assue that K is a closed bounded set of R n and A is a Gaussian atrix 4 with < n. Then, the estiator given by (1.1) satisfies E sup x x 2 w(k). x K Proof. This is a direct consequence of the M bound, Theore 1.7. We will ake use of the following well-known fact: the rando subspace E = ker A is uniforly distributed over the set of subspaces of diension n. It is easy to see that w(k K) 2w(K). Then using the Theore 1.7, we deduce E sup x x 2 1 w(k) E dia((k K) E), x 2 which follows since x x, x x (K K) E Estiation as a (tractable) optiization proble The next question we would like to answer is, how can we copute the estiator in (1.1). A first step towards this goal is to substitute this feasibility proble by an optiization proble. To do so, we need to introduce an additional (ild) assuption on K. Fro know on we assue that K has non-epty interior and is star-shaped, i.e. the inclusion tk K t [0, 1] holds. This leads us to define the Minkowski functional of K as the function K : R n R given by x K = inf{λ > 0 λ 1 x K}. Miknowski functionals are standard notions in geoetric functional analysis and convex analysis. It is not hard to see that under the current assuptions on K, the functional K is continuous and positively hoogeneous 5 By definition we have K = {x x K 1}. Moreover, if K is a syetric convex body 6, then K defines a nor. With this notation in hand we can introduce the optiization proble arg in x K subject to Ax = y, (1.2) which leads us to the following guarantee. 4 As we described above, its entries are iid standard noral rando variables. 5 αx K = α x K, for all α > 0. 6 K is convex, bounded, closed, origin-syetric (K = K), and has nonepty interior 12

13 Theore 1.9. Let K be a star-shaped bounded closed set with nonepty interior. Then, the estiator x given by (1.2) satisfies E sup x x w(k). x K Proof. Due to Theore 1.8 it suffices to check that x K. This follows iediately since x K x K 1, by definition. Convex relaxations. The issue with the previous optiization proble is that it could be hard to solve (in fact, solving nonconvex progras is NP-hard). Raising the question: how to devise a coputationally tractable estiator? If K is convex body, the proble becoes a convex progra and we can use off-the-shelf solvers, like interior-point ethods, subgradient algortihs, or proxial-splitting ethods. Furtherore, if K is a polytope, (1.2) can be cast as a linear progra, which open the others to even faster algoriths. Given the invariance under convex hulls of the Gaussian width, it s natural to consider convexifying the set K. Define the convex relaxation proble arg in x conv(k) subject to Ax = y. (1.3) We recover exactly error bounds in Theore 1.9, to see this note that if x is the estiator defined in (1.3) satisfies E sup x K x x 2 E sup x conv(k) x x 2 w(conv(k)) = w(k). Progra (1.3) gives a tractable approach for any convex bodies K. Indeed there are soe convex bodies for which coputing K is hard. Later we will see soe relevant exaples with tractable relaxations. Inforation-theoretical aspects. The aforeentioned results iply that if we fix a desired accuracy, E sup x x 2 ε, then w 2 (K) x K observations will suffice suffice. Where the hidden constant depends on the accuracy. It is worth noting that this result is unifor. In the sense that we can use Markov s inequality to ensure that with fixed probability, say 0.9, the estiation is siultaneously accurate for all vectors x K. Later we will see that the actual probability is uch better; it goes exponentially fast as a function of. 1.3 A proof of a general M bound Now we will give a proof of a generalization of the M bound, Theore

14 Theore 1.10 (General M bound). Let T be a bounded set. Let A be a n Gaussian atrix. Fix ɛ > 0 and define the set Then T ɛ = {u T E sup u T ɛ u 2 8π where g is a standard Gaussian rando vector. 1 Au 2 ɛ}. (1.4) E sup g, y + u T Exercise 1. The previous stateent generalizes Theore 1.7, why? π 2 ɛ (1.5) Before we go on, soe coents are in order. This theore will allows us to handle noisy easureents of the for y = Ax + ν. For the conclusion (1.5) we are assuing the convention sup u u = 0. A ore general stateent holds. One can relax the assuption on the distribution of the entries of A to sub-gaussianity. Nonetheless, we will not go into this tricker setting, we refer the interested reader to [10, Section 8]. To prove the bound we will need a couple of siple tools fro epirical processes. Recall that an stochastic process is a collection of rando variables (Z(t)) t T over the sae probability space. The indexing set could denote tie, as it happens for Brownian otion, or it could be a subset of R n, as we saw for exaple with the Gaussian width. Proposition Consider a finite collection of stochastic processes Z 1 (t),..., Z (t) indexed by t T. Let ε i be independent Radeacher rando variable 7. Then the following holds, - (Syetrization). E sup t T [ Zi (t) EZ i (t) ] 2E sup t T ε i Z i (t) ; - (Contraction). E sup ε t T i Z i (t) 2E sup t T ε i Z i (t) ; These stateents are not too hard to proof and ore general stateents can be found in [11] or [3]. Proof of Theore The conclusion (1.5) would follow if we proved the deviation inequality 1 2 E sup a i, u u T π u 2 4 E sup g, y. (1.6) u T 7 P(ε i = 1) = P(ε i = 1) = 1/2. 14

15 If it holds for T, then it also holds for T ɛ T on the left hand side. Moreover, for u T ɛ we ve got 1 ai, u = 1 Au 1 ɛ. Hence (1.5) follows by an applications of the triangle inequality. Rotation invariance gives 2 E a i, u = π u. Hence by syetrization and contraction, we can bound the left hand side of (1.6) E sup u T 1 2 a i, u π u 2 2E sup t T 4E sup t T = 4E sup t T Observe that ε i a i and a i have the sae distribution, hence g := 1 ε i a i is a standard Gaussian vector. Thus it can be cast as 4 E sup g, u, u T proving the result ε i a i, u ε i a i, u ε i a i, u. (1.7) Fro expectation to overwheling probability As it turns out one can use the M bound to derive high probability guarantees via concentration of easure. For this, we will need the Gaussian concentration inequality. Proposition Assue that g N(0, I ) is a Gaussian vector and let f : R R be an L-Lipschitz function. Then P ( f(g) Ef(g) t) 2 exp( /2L 2 ). For a proof of this result see for exaple [3]. Then we get the next result. Theore Let T be a bounded set. Let A be a n Gaussian atrix. Fix ɛ > 0 and define the set 1 T ɛ = {u T Au 2 ɛ}. (1.8) Then 8π π sup u 2 u T ɛ E sup g, y + (ɛ + t) (1.9) u T 2 with probability at least 1 2 exp ( t 2 /2 ax u T u 2 ). 15

16 Exercise 2. Prove the previous theore. Hint: Consider the function f(a) = sup u T 1 a i, u Consequences: estiation fro noisy easureents u π 2. Let us use the new M bound in a sightly general context. Assue we observe easureents y = Ax + ν where ν odels bounded noise and satisfies 1 ν 1 = 1 νi ɛ. The noise ν is unknown and arbitrary; in particular, it could be correlated with A or x. To handle the noise we could consider the feasibility proble Leading to the following guarantee. Theore If x is a solution to (1.10), then Find x K such that 1 A x y 1 ε. (1.10) E sup x x 2 8π x K ( ) w(k) + ɛ. (1.11) Proof. Lets apply the M bound to T = K K and with 2ɛ instead of ɛ, we get 8π E sup u 2 u T 2ɛ E sup g, y + 2πɛ ( ) w(k) 8π + ɛ, u T the inequality follows by the definition of the Gaussian width and the fact that T is syetric. To finish the proof we need to ensure that x x T 2ɛ for any x K. By construction, x, x K so x x T. Now, by the triangle inequality and the constraints on x 1 A( x x) 1 = 1 A x y + ν 1 1 A x y + 1 ν 1 2ɛ. Consequently, E sup x K x x 2 E sup u T2ɛ u 2 finishing the proof. Siilarly to before, if K is a closed star-shaped bounded set with nonepty interior, we define the optiization proble arg in x K subject to 1 Ax y ɛ, (1.12) and recover the following theore. Theore For any fixed x K let x be a solution of (1.12). Then E sup x x 2 ( ) w(k) 8π + ɛ. x K Again, we can take the convex relaxation of (1.12) and still get the sae error bound. 16

17 1.4 Applications Next, we will talk about explicit applications of the M bound Sparse recovery for general dictionaries In soe fields, such as signal processing and haronic analysis, it is often convenient to consider redundant data representations. One way to achieve this is by considering a dictionary, i.e. an arbitrary collection of vectors v 1,..., v N R n that span (they could be linearly dependent). Of course, the choice of the dictionary depends on the application. For a deeper introduction to these ideas see for exaple [6]. Given the redundancies it is natural to wonder about sparse representations. We say that a vector x R d is s-sparse if it can written as a linear cobination of at ost s vectors in the dictionary, i.e. x = α i v i with at ost s non-zero coefficients α i R. Our goal now is to recover such sparse representation fro noisy easureents. Just as before, 1 y = Ax + ν with ν. Based on our previous success with convex progras, we consider the proble arg in α 1 subject to 1 Ax y ɛ, N x = α iv i. (1.13) Exercise 3. Define K := conv{±v i } N. Prove that { } N x K = in α 1 : x = α iv i for all x R n. Theore Assue that all dictionary vectors satisfy v i 1. Let x be a solution of the convex proble (1.13). Then s log N E x x 2 C α 2 + 2πɛ. Proof. Fix x R n, we will apply the M bound to the following polytope K := α 1 K. By assuption x K. Hence, by Exercise 3 the two probles (1.12) and (1.13) are equivalent. Thus, we iediately get the bound (1.11). 17

18 Next, let us copute the Gaussian width in this error bound. convexification and Exaple 1.1 we deduce By invariance under w(k) = α 1 w( K) C α 1 log N C s α 2 log N, where the last inequality follows since for any s-sparse vector α we have α 1 s α 2. Substituting this in (1.11) gives the desired result. We could generalize this proof for general dictionaries at the expense of an extra ax i v i 2. An iportant advantage of (1.13) is that it can be cast as a linear progra (how?) and thus we can rely on very fast solvers! Another positive feature of this forulation is that it autoatically gives us a representation α. Since we are only approxiating x there is a priori no reason to believe that α will be sparse. Nonetheless, one can prove 8, using a siple diensionality arguent, that if x is s-sparse in the dictionary then α will be s-sparse. Recovery for the canonical dictionary. siplest dictionaries, the canonical basis To crystallize ideas, let us consider one of the v i = e i for all i = 1,..., n. In this case sparsity in the dictionary coincides with sparsity of the vector itself. arg in x 1 subject to 1 Ax y ɛ. (1.14) As a direct corollary of the previous theore we obtain. Corollary Fix a s-sparse vector x R n and let x be a solution of (1.18). Then s log n E x x 2 C x 2 + 2πɛ. Takeaway. Using linear prograing, we can approxiately recover a s-sparse vector in a general dictionary of size N, fro s log N rando easureents. 1.5 Exact recovery Previously, we saw guarantees in the accuracy of our estiator. Surprisingly, in soe scenarios it is possible to ensure perfect recovery, i.e. x = x, with overwheling probability. Assue for this section that we have noiseless easureents where A is a n Gaussian atrix. 8 under very ild regularity conditions y = Ax 18

19 (a) Descent cone D(K, x) and E x. (b) Illustration of recovery condition in ters of the spherical cap S(K, x). Figure 8: Geoetrical concepts involved in exact recovery condition The geoetrical eaning of exact recovery In order to derive results we need to find a characterization of exact recovery. Reeber that we have two pieces of inforation. The first one tell us that the signal we d like to recover lies in the set K and the second one tell us that it belongs to the affine space E x = {x Ax = y}. These two pieces copletely deterine x if, and only, K E x = {x}. (1.15) We cannot wish to capture this with the M bound, in fact, this equality iplies the diaeter is zero. How to describe this phenoenon geoetrically? To siplify the scenario, let us assue for now that K is convex. Then it is clear that exact recovery (1.15) holds if and only if the affine subspace E x is tangent to K at x. Notice that even when K is nonconvex this equivalence still holds locally, i.e. if we restrict(intersect) everything to a sufficiently sall ball around x.. Hence, we will lose nothing if we replace K by its tangent cone, also known as descent cone. That is, the cone of all direction that go into K eanating fro x, D(K, x) := cone{z x : z K} where cone(c) = t 0 tc. Translating condition (1.15) to zero, gives (K x) (E x x) = {0}. (1.16) Locally (and globally for convex sets), this is equivalent to having D(K, x) instead of (K x) and noting that E x x = ker A, we arrive to D(K, x) ker(a) = {0}. (1.17) For our purposes, the descent cone can be substituted by its intersection with the sphere, i.e. { } z x S(K, x) = D(K, x) S n 1 = z x x K. Hence we obtain the following equivalent for of exact recovery (see Figure 8) S(K, x) ker A =. 19

20 1.5.2 Escape through a esh Studying the likelihood of a nonepty intersection between two sets has a long history and have been widely studied in geoetric probability, see for exaple the fantastic book by Klain and Rota [8]. Indeed, for the specific case of a subspace and a spherical cap there exists a sharp result, this is known as the escape through a esh Theore and is due to Gordon [7]. The theore is stated in ters of a sightly different version of the Gaussian width w(s) = E sup g, u. u S Theore Let S be a fixed subset of S n 1 and let E be a unifor rando subspace of R n of a fixed codiension n. Assue that w(s) Then, S E = ( ( ) ) 2 with probability at least exp w(s). This theore is way stronger than the M bound. In fact one can recover a siilar stateent using this bound, but one can only ensure the probability P(S E ) w(k)/. Now, how to convert this into an algorithic stateent? We already develop estiators that are able to find a point x in K E x, either by solving a feasibility proble or an optiization proble. By the discussion above, if we assue convexity of K, such point is the true signal x = x if, and only if, S(K, x) ker E =. Thus, we iediately get the following theore Theore Assue that K is a syetric convex body and let x be the solution of (1.1) or (1.2). Further, suppose that the nuber of easureents satisfies Then > w(s(k, x)). x = x with overwheling probability (the sae as in Theore 1.18). Takeaway. Provided we can solve the associated optiization proble, exact recovery is possible whenever the nuber of observations is greater than the Gaussian width of D(K, x) S n 1. 20

21 Recurrent takeaway. The Gaussian width easures the saple coplexity of these estiation proble! Exact sparse recovery Let us deonstrate the power of this result by considering a concrete exaple. Consider again the sparse recovery proble. Let x be an s-sparse vector in R n and define the set One can write the descent cone as K = x 1 B n 1 = {x x 1 x 1 }. D(K, x) = cone{z x + z 1 x 1 } 9. Then, after soe bits of agic ath one gets (you will have to this in a hoework) w(s(k, x)) s log(2n/s). Thanks to our construction, K = 1 x 1 1 (why?) then in this case we can rewrite progra 1.2 with the proportional objective objective 1. That is arg in x 1 subject to Ax = y, (1.18) and we get an exact recovery analogous of (1.17). Theore Assue that x R n is an unknown s-sparse vector. Let x be a solution of (1.18). Then with probability at least 1 3 exp( ) we have x = x provided that > Cs log(n/s) for soe universal constant C > 0. This is actually sharp, see Figure 9. One can prove that if < s log(n/s) the probability of recovering using this ethod is very sall. We will not go into this, but we recoend [1] as a reference for phase transition phenoena in recovery probles. References [1] Dennis Aelunxen, Martin Lotz, Michael B McCoy, and Joel A Tropp. Living on the edge: Phase transitions in convex progras with rando data. Inforation and Inference: A Journal of the IMA, 3(3): , This is why they are call descent cones, they usually contain the directions in which soe function(minkowski functional of K) decreases. 21

Figure 9: Epirical probability of exact recovery for different paraeters. White eans probability one, black probability zero. Taken fro [1]. [2] Keith Ball et al.

22 Figure 9: Epirical probability of exact recovery for different paraeters. White eans probability one, black probability zero. Taken fro [1]. [2] Keith Ball et al. An eleentary introduction to odern convex geoetry. Flavors of geoetry, 31:1 58, [3] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasyptotic theory of independence. Oxford university press, [4] Gunnar Carlsson, Tigran Ishkhanov, Vin De Silva, and Afra Zoorodian. On the local behavior of spaces of natural iages. International journal of coputer vision, 76(1):1 12, [5] V D. Milan. Rando subspaces of proportional diension of finite diensional nored spaces: Approach through the isoperietric inequality, volue 1166, pages [6] David L. Donoho and Michael Elad. Optially sparse representation in general (nonorthogonal) dictionaries via l1 iniization. Proceedings of the National Acadey of Sciences, 100(5): , [7] Y. Gordon. On ilan s inequality and rando subspaces which escape through a esh in n. In Jora Lindenstrauss and Vitali D. Milan, editors, Geoetric Aspects of Functional Analysis, pages , Berlin, Heidelberg, Springer Berlin Heidelberg. [8] Daniel A Klain and Gian-Carlo Rota. Introduction to geoetric probability. Cabridge University Press, [9] Alain Pajor and Nicole Toczak-Jaegerann. Subspaces of sall codiension of finite-diensional banach spaces. Proceedings of the Aerican Matheatical Society, 97(4): , [10] Roan Vershynin. Estiation in high diensions: a geoetric perspective. In Sapling theory, a renaissance, pages Springer,

23 [11] Roan Vershynin. High-diensional probability: An introduction with applications in data science, volue 47. Cabridge University Press,

Sharp Time Data Tradeoffs for Linear Inverse Problems

Sharp Time Data Tradeoffs for Linear Inverse Problems Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used