Efficient Sketches for the Set Query Problem

Size: px

Start display at page:

Download "Efficient Sketches for the Set Query Problem"

Steven Howard
6 years ago
Views:

1 Efficient Setches for the Set Query Problem Eric Price Abstract We develop an algorithm for estimating the values of a vector x R n over a support S of size from a randomized sparse binary linear setch Ax of size O(. Given Ax and S, we can recover x with x x S x x S with probability at least Ω(. The recovery taes O( time. While interesting in its own right, this primitive also has a number of applications. For example, we can:. Improve the linear -sparse recovery of heavy hitters in Zipfian distributions with O( log n space from a + approximation to a + o( approximation, giving the first such approximation in O( log n space when O(n.. Recover bloc-sparse vectors with O( space and a + approximation. Previous algorithms required either ω( space or ω( approximation. Introduction In recent years, a new linear approach for obtaining a succinct approximate representation of n-dimensional vectors (or signals has been discovered. For any signal x, the representation is equal to Ax, where A is an m n matrix, or possibly a random variable chosen from some distribution over such matrices. The vector Ax is often referred to as the measurement vector or linear setch of x. Although m is typically much smaller than n, the setch Ax often contains plenty of useful information about the signal x. A particularly useful and well-studied problem is that of stable sparse recovery. The problem is typically defined as follows: for some norm parameters p and q and an approximation factor C >, given Ax, recover a vector x such that ( x x p C Err q (x,, where Err q (x, = min -sparse ˆx ˆx x q where we say that ˆx is -sparse if it has at most non-zero coordinates. Sparse recovery has applications to numerous areas such as data stream computing [Mut3, Ind7] This research has been supported in part by the David and Lucille Pacard Fellowship, MADALGO (Center for Massive Data Algorithmics, funded by the Danish National Research Association, NSF grant CCF-78645, a Cisco Fellowship, and the NSF Graduate Research Fellowship Program. MIT CSAIL and compressed sensing [CRT6, Don6], notably for constructing imaging systems that acquire images directly in compressed form (e.g., [DDT + 8, Rom9]. The problem has been a subject of extensive study over the last several years, with the goal of designing schemes that enjoy good compression rate (i.e., low values of m as well as good algorithmic properties (i.e., low encoding and recovery times. It is nown that there exist distributions of matrices A and associated recovery algorithms that for any x with high probability produce approximations x satisfying Equation ( with l p = l q = l, constant approximation factor C = +, and setch length m = O( log(n/; it is also nown that this setch length is asymptotically optimal [DIPW, FPRU]. Similar results for other combinations of l p /l q norms are nown as well. Because it is impossible to improve on the setch size in the general sparse recovery problem, recently there has been a large body of wor on more restricted problems that are amenable to more efficient solutions. This includes model-based compressive sensing [BCDH], which imposes additional constraints (or models on x beyond near-sparsity. Examples of models include bloc sparsity, where the large coefficients tend to cluster together in blocs [BCDH, EKB9]; tree sparsity, where the large coefficients form a rooted, connected tree structure [BCDH, LD5]; and being Zipfian, where we require that the histogram of coefficient size follow a Zipfian (or power law distribution. A sparse recovery algorithm needs to perform two tass: locating the large coefficients of x and estimating their value. Existing algorithms perform both tass at the same time. In contrast, we propose decoupling these tass. In models of interest, including Zipfian signals and bloc-sparse signals, existing techniques can locate the large coefficients more efficiently or accurately than they can estimate them. Prior to this wor, however, estimating the large coefficients after finding them had no better solution than the general sparse recovery problem. We fill this gap by giving an optimal method for estimating the values of the large coefficients after locating them. In particular, a random Gaussian matrix [CD4] or a random sparse binary matrix ([GLPS9], building on [CCF, CM4] has this property with overwhelming probability. See [GI] for an overview.

2 We refer to this tas as the Set Query Problem. Main result. (Set Query Algorithm. We give a randomized distribution over O( n binary matrices A such that, for any vector x R n and set S {,..., n} with S =, we can recover an x from Ax + ν and S with x x S ( x x S + ν where x S R n equals x over S and zero elsewhere. The matrix A has O( non-zero entries per column, recovery succeeds with probability Ω(, and recovery taes O( time. This can be achieved for arbitrarily small >, using O(/ rows. We achieve a similar result in the l norm. The set query problem is useful in scenarios when, given a setch of x, we have some alternative methods for discovering a good support of an approximation to x. This is the case, e.g., in bloc-sparse recovery, where (as we show in this paper it is possible to identify heavy blocs using other methods. It is also a natural problem in itself. In particular, it generalizes the well-studied point query problem [CM4], which considers the case that S is a singleton. We note that, although the set query problem for sets of size can be reduced to instances of the point query problem, this reduction is less space-efficient than the algorithm we propose, as elaborated below. Techniques. Our method is related to existing sparse recovery algorithms, including Count-Setch [CCF] and Count-Min [CM4]. In fact, our setch matrix A is almost identical to the one used in Count-Setch each column of A has d random locations out of O(d each independently set to ±, and the columns are independently generated. We can view such a matrix as hashing each coordinate to d bucets out of O(d. The difference is that the previous algorithms require O( log measurements to achieve our error bound (and d = O(log, while we only need O( measurements and d = O(. We overcome two obstacles to bring d down to O( and still achieve the error bound with high probability 3. First, in order to estimate the coordinates x i, we need a more elaborate method than, say, taing the median of the bucets that i was hashed into. This is because, with constant probability, all such bucets might contain some other elements from S (be heavy and therefore using any of them as an estimator for y i would result in too much error. Since, for super-constant values of S, it is highly liely that such an event will occur for at least one i S, it follows that this type of estimation results in large error. The term set query is in contrast to point query, used in e.g. [CM4] for estimation of a single coordinate. 3 In this paper, high probability means probability at least / c for some constant c >. We solve this issue by using our nowledge of S. We now when a bucet is corrupted (that is, contains more than one element of S, so we only estimate coordinates that lie in a large number of uncorrupted bucets. Once we estimate a coordinate, we subtract our estimation of its value from the bucets it is contained in. This potentially decreases the number of corrupted bucets, allowing us to estimate more coordinates. We show that, with high probability, this procedure can continue until it estimates every coordinate in S. The other issue with the previous algorithms is that their analysis of their probability of success does not depend on. This means that, even if the head did not interfere, their chance of success would be a constant (lie Ω(d rather than high probability in (meaning Ω(d. We show that the errors in our estimates of coordinates have low covariance, which allows us to apply Chebyshev s inequality to get that the total error is concentrated around the mean with high probability. Related wor. A similar recovery algorithm (with d = has been analyzed and applied in a streaming context in [EG7]. However, in that paper the authors only consider the case where the vector y is -sparse. In that case, the termination property alone suffices, since there is no error to bound. Furthermore, because d = they only achieve a constant probability of success. In this paper we consider general vectors y so we need to mae sure the error remains bounded, and we achieve a high probability of success. The recovery procedure also has similarities to recovering LDPCs using belief propagation, especially over the binary erasure channel. The similarities are strongest for exact recovery of -sparse y; our method for bounding the error from noise is quite different. Applications. Our efficient solution to the set query problem can be combined with existing techniques to achieve sparse recovery under several models. We say that a vector x follows a Zipfian or power law distribution with parameter α if xr(i = Θ( xr( i α where r(i is the location of the ith largest coefficient in x. When α > /, x is well approximated in the l norm by its sparse approximation. Because a wide variety of real world signals follow power law distributions ([Mit4, BKM + ], this notion (related to compressibility 4 is often considered to be much of the reason why sparse recovery is interesting [CT6, Cev8]. Prior to this wor, sparse recovery of power law distributions has only been solved via general sparse recovery methods: ( + Err (x, error in O( log(n/ measurements. However, locating the large coefficients in a power law 4 A signal is compressible when xr(i = O( xr( i α rather than Θ( xr( i α [CT6]. This allows it to decay very quicly then stop decaying for a while; we require that the decay be continuous.

3 distribution has long been easier than in a general distribution. Using O( log n measurements, the Count- Setch algorithm [CCF] can produce a candidate set S {,..., b} with S = O( that includes all of the top positions in a power law distribution with high probability (if α > /. We can then apply our set query algorithm to recover an approximation x to x S. Because we already are using O( log n measurements on Count- Setch, we use O( log n rather than O( measurements in the set query algorithm to get an / log n rather than approximation. This lets us recover a -sparse x with O( log n measurements with x x ( + Err (x,. log n This is especially interesting in the common regime where < n c for some constant c >. Then no previous algorithms achieve better than a ( + approximation with O( log n measurements, and the lower bound in [DIPW] shows that any O( approximation requires Ω( log n measurements 5. This means at Θ( log n measurements, the best approximation changes from ω( to + o(. Another application is that of finding bloc-sparse approximations. In this application, the coordinate set {... n} is partitioned into n/b blocs, each of length b. We define a (, b-bloc-sparse vector to be a vector where all non-zero elements are contained in at most /b blocs. An example of bloc-sparse data is time series data from n/b locations over b time steps, where only /b locations are active. We can define Err (x,, b = min x ˆx. (,b bloc-sparse ˆx The bloc-sparse recovery problem can now be formulated analogously to Equation. Since the formulation imposes restrictions on the sparsity patterns, it is natural to expect that one can perform sparse recovery from fewer than O( log(n/ measurements needed in the general case. Because of that reason and the prevalence of approximately bloc-sparse signals, the problem of stable recovery of variants of bloc-sparse approximations has been recently a subject of extensive research (e.g., see [EB9, SPH9, BCDH, CIHB9]. The state of the art algorithm has been given in [BCDH], who gave a probabilistic construction of a single m n matrix A, with m = O(+ b log n, and an n logo( n-time algorithm for performing the bloc-sparse recovery in the l norm (as well as other variants. If the blocs have size Ω(log n, the algorithm uses only O( measurements, which is a 5 The lower bound only applies to geometric distributions, not Zipfian ones. However, our algorithm applies to more general sub- Zipfian distributions (defined in Section 4., which includes both. substantial improvement over the general bound. However, the approximation factor C guaranteed by that algorithm was super-constant. In this paper, we provide a distribution over matrices A, with m = O( + b log n, which enables solving this problem with a constant approximation factor and in the l norm, with high probability. As with Zipfian distributions, first one algorithm tells us where to find the heavy hitters and then the set query algorithm estimates their values. In this case, we modify the algorithm of [ABI8] to find bloc heavy hitters, which enables us to find the support of the b most significant blocs using O( b log n measurements. The essence is to perform dimensionality reduction of each bloc from b to O(log n dimensions, then estimate the result with a linear hash table. For each bloc, most of the projections are estimated pretty well, so the median is a good estimator of the bloc s norm. Once the support is identified, we can recover the coefficients using the set query algorithm. Preliminaries. Notation For n Z +, we denote {,..., n} by [n]. Suppose x R n. Then for i [n], x i R denotes the value of the i- th coordinate in x. As an exception, e i R n denotes the elementary unit vector with a one at position i. For S [n], x S denotes the vector x R n given by x i = x i if i S, and x i = otherwise. We use supp(x to denote the support of x. We use upper case letters to denote sets, matrices, and random distributions. We use lower case letters for scalars and vectors.. Negative Association This paper would lie to mae a claim of the form We have observations each of whose error has small expectation and variance. Therefore the average error is small with high probability in. If the errors were independent this would be immediate from Chebyshev s inequality, but our errors depend on each other. Fortunately, our errors have some tendency to behave even better than if they were independent: the more noise that appears in one coordinate, the less remains to land in other coordinates. We use negative dependence to refer to this general class of behavior. The specific forms of negative dependence we use are negative association and approximate negative correlation; see Appendix A for details on these notions. 3

4 3 Set-Query Algorithm Theorem 3.. There is a randomized sparse binary setch matrix A and recovery algorithm A, such that for any x R n, S [n] with S =, x = A (Ax + ν, S R n has supp(x S and x x S ( x x S + ν with probability at least / c. A has O( c rows and O(c non-zero entries per column, and A runs in O(c time. One can achieve x x S ( x x S + ν under the same conditions, but with only O( c rows. We will first show Theorem 3. for a constant c = /3 rather than for general c. Parallel repetition gives the theorem for general c, as described in Section 3.7. We will also only show it with entries of A being in {,, }. By splitting each row in two, one for the positive and one for the negative entries, we get a binary matrix with the same properties. The paper focuses on the more difficult l result; see Appendix B for details on the l result. 3. Intuition We call x S the head and x x S the tail. The head probably contains the heavy hitters, with much more mass than the tail of the distribution. We would lie to estimate x S with zero error from the head and small error from the tail with high probability. Our algorithm is related to the standard Count- Setch [CCF] and Count-Min [CM4] algorithms. In order to point out the differences, let us examine how they perform on this tas. These algorithms show that hashing into a single w = O( sized hash table is good in the sense that each point x i has:. Zero error from the head with constant probability (namely w.. A small amount of error from the tail in expectation (and hence with constant probability. They then iterate this procedure d times and tae the median, so that each estimate has small error with probability Ω(d. With d = O(log, we get that all estimates in S are good with O( log measurements with high probability in. With fewer measurements, however, some x i will probably have error from the head. If the head is much larger than the tail (such as when the tail is zero, this is a major problem. Furthermore, with O( measurements the error from the tail would be small only in expectation, not with high probability. We mae three observations that allow us to use only O( measurements to estimate x S with error relative to the tail with high probability in.. The total error from the tail over a support of size is concentrated more strongly than the error at a single point: the error probability drops as Ω(d rather than Ω(d.. The error from the head can be avoided if one nows where the head is, by modifying the recovery algorithm. 3. The error from the tail remains concentrated after modifying the recovery algorithm. For simplicity this paper does not directly show (, only ( and (3. The modification to the algorithm to achieve ( is quite natural, and described in detail and illustrated in Section 3.. Rather than estimate every coordinate in S immediately, we only estimate those coordinates which mostly do not overlap with other coordinates in S. In particular, we only estimate x i as the median of at least d positions that are not in the image of S \ {i}. Once we learn x i, we can subtract Ax i e i from the observed Ax and repeat on A(x x i e i and S \ {i}. Because we only loo at positions that are in the image of only one remaining element of S, this avoids any error from the head. We show in Section 3.3 that this algorithm never gets stuc; we can always find some position that mostly doesn t overlap with the image of the rest of the remaining support. We then show that the error from the tail has low expectation, and that it is strongly concentrated. We thin of the tail as noise located in each cell (coordinate in the image space. We decompose the error of our result into two parts: the point error and the propagation. The point error is error introduced in our estimate of some x i based on noise in the cells that we estimate x i from, and equals the median of the noise in those cells. The propagation is the error that comes from point error in estimating other coordinates in the same connected component; these errors propagate through the component as we subtract off incorrect estimates of each x i. Section 3.4 shows how to decompose the total error in terms of point errors and the component sizes. The two following sections bound the expectation and variance of these two quantities and show that they obey some notions of negative dependence. We combine these errors in Section 3.7 to get Theorem 3. with a specific c (namely c = /3. We then use parallel repetition to achieve Theorem 3. for arbitrary c. 3. Algorithm We describe the setch matrix A and recovery procedure in Algorithm 3.. Unlie Count-Setch [CCF] or Count-Min [CM4], our A is not split into d hash tables 4

5 Figure : An instance of the set query problem. There are n vertices on the left, corresponding to x, and the table on the right represents Ax. Each vertex i on the left maps to d cells on the right, randomly increasing or decreasing the value in each cell by x i. We represent addition by blac lines, and subtraction by red lines. We are told the locations of the heavy hitters, which we represent by blue circles; the rest is represented by yellow circles. of size O(. Instead, it has a single w = O(d / sized hash table where each coordinate is hashed into d unique positions. We can thin of A as a random d-uniform hypergraph, where the non-zero entries in each column correspond to the terminals of a hyperedge. We say that A is drawn from G d (w, n with random signs associated with each (hyperedge, terminal pair. We do this so we will be able to apply existing theorems on random hypergraphs. Figure shows an example Ax for a given x, and Figure demonstrates running the recovery procedure on this instance. Lemma 3.. Algorithm 3. runs in time O(d. Proof. A has d entries per column. For each of the at most d rows q in the image of S, we can store the preimages P (q. We also eep trac of the sets of possible next hyperedges, J i = {j L j d i} for i {, }. We can compute these in an initial pass in O(d. Then in each iteration, we remove an element j J or J and update x j, b, and T in O(d time. We then loo at the two or fewer non-isolated vertices q in hyperedge j, and remove j from the associated P (q. If this maes P (q =, we chec whether to insert the element in P (q into the J i. Hence the inner loop taes O(d time, for O(d total. 6.5 (a (c (b Figure : Example run of the algorithm. Part (a shows the state as considered by the algorithm: Ax and the graph structure corresponding to the given support. In part (b, the algorithm chooses a hyperedge with at least d isolated vertices and estimates the value as the median of those isolated vertices multiplied by the sign of the corresponding edge. In part (c, the image of the first vertex has been removed from Ax and we repeat on the smaller graph. We continue until the entire support has been estimated, as in part (d. 3 (d 3.3 Exact Recovery The random hypergraph G d (w, of random d-uniform hyperedges on w vertices is well studied in [K L]. We use their results to show that the algorithm successfully 5

6 Definition of setch matrix A. For a constant d, let A be a w n = O( d n matrix where each column is chosen independently uniformly at random over all exactly d-sparse columns with entries in {,, }. We can thin of A as the incidence matrix of a random d-uniform hypergraph with random signs. Recovery procedure. : procedure SetQuery(A, S, b Recover approximation x to x S from b = Ax + ν : T S 3: while T > do 4: Define P (q = {j A qj, j T } as the set of hyperedges in T that contain q. 5: Define L j = {q A qj, P (q = } as the set of isolated vertices in hyperedge j. 6: Choose a random j T such that L j d. If this is not possible, find a random j T such that L j d. If neither is possible, abort. 7: x j median q L j A qj b q 8: b b x j Ae j 9: T T \ {j} : end while : return x : end procedure Algorithm 3.: Recovering a signal given its support. terminates with high probability, and that most hyperedges are chosen with at least d isolated vertices: Lemma 3.. With probability at least O(/, Algorithm 3. terminates without aborting. Furthermore, in each component at most one hyperedge is chosen with only d isolated vertices. We will show this by building up a couple lemmas. We define a connected hypergraph H with r vertices on s hyperedges to be a hypertree if r = s(d + and to be unicyclic if r = s(d. Then Theorem 4 of [K L] shows that, if the graph is sufficiently sparse, G d (w, is probably composed entirely of hypertrees and unicyclic components. The precise statement is as follows 6 : Lemma 3.3 (Theorem 4 of [K L]. Let m = w/d(d. Then with probability O(d 5 w /m 3, G d (w, is composed entirely of hypertrees and unicyclic components. We use a simple consequence: Corollary 3.. If d = O( and w d(d, then with probability O(/, G d (w, is composed entirely of hypertrees and unicyclic 6 Their statement of the theorem is slightly different. This is the last equation in their proof of the theorem. We now prove some basic facts about hypertrees and unicyclic components: Lemma 3.4. Every hypertree has a hyperedge incident on at least d isolated vertices. Every unicyclic component either has a hyperedge incident on d isolated vertices or has a hyperedge incident on d isolated vertices, the removal of which turns the unicyclic component into a hypertree. Proof. Let H be a connected component of s hyperedges and r vertices. If H is a hypertree, r = (d s +. Because H has only ds total (hyperedge, incident vertex pairs, at most (s of these pairs can involve vertices that appear in two or more hyperedges. Thus at least one of the s edges is incident on at most one vertex that is not isolated, so some edge has d isolated vertices. If H is unicyclic, r = (d s and so at most s of the (hyperedge, incident vertex pairs involve non-isolated vertices. Therefore on average, each edge has d isolated vertices. If no edge is incident on at least d isolated vertices, every edge must be incident on exactly d isolated vertices. In that case, each edge is incident on exactly two non-isolated vertices and each non-isolated vertex is in exactly two edges. Hence we can perform an Eulerian tour of all the edges, so removing any edge does not disconnect the graph. After removing the edge, the graph has s = s edges and r = r d + vertices; therefore r = (d s + so the graph is a hypertree. Corollary 3. and Lemma 3.4 combine to show Lemma Total error in terms of point error and component size Define C i,j to be the event that hyperedges i and j are in the same component, and D i = j C i,j to be the number of hyperedges in the same component as i. Define L i to be the cells that are used to estimate i; so L i = {q A qj, P (q = } at the round of the algorithm when i is estimated. Define Y i = median q Li A qi (b Ax S q to be the point error for hyperedge i, and x to be the output of the algorithm. Then the deviation of the output at any coordinate i is at most twice the sum of the point errors in the component containing i: Lemma 3.5. (x x S i j S Y j C i,j. Proof. Let T i = (x x S i, and define R i = {j j i, q L i s.t. A qj } to be the set of hyperedges that 6

7 overlap with the cells used to estimate i. Then from the description of the algorithm, it follows that T i = median A qi ((b Ax S q q L i j T i Y i + T j. j R i A qj T j We can thin of the R i as a directed acyclic graph (DAG, where there is an edge from j to i if j R i. Then if p(i, j is the number of paths from i to j, T i j p(j, i Y i. Let r(i = {j i R j } be the outdegree of the DAG. Because the L i are disjoint, r(i d L i. From Lemma 3., r(i for all but one hyperedge in the component, and r(i for that one. Hence p(i, j for any i and j, giving the result. We use the following corollary: Corollary 3.. Proof. x x S = i S x x S 4 i S 4 i S D i Y i (x x S i 4 i S D i j S C i,j= Y j = 4 i S ( Y j j S C i,j= D i Y i where the second inequality is the power means inequality. The D j and Y j are independent from each other, since one depends only on A over S and one only on A over [n] \ S. Therefore we can analyze them separately; the next two sections show bounds and negative dependence results for Y j and D j, respectively. 3.5 Bound on point error Recall from Section 3.4 that based entirely on the set S and the columns of A corresponding to S, we can identify the positions L i used to estimate x i. We then defined the point error Y i = median q L i A qi (b Ax S q = median A qi (A(x x S +ν q q L i and showed how to relate the total error to the point error. Here we would lie to show that the Y i have bounded moments and are negatively dependent. Unfortunately, it turns out that the Y i are not negatively associated so it is unclear how to show negative dependence directly. Instead, we will define some other variables Z i that are always larger than the corresponding Y i. We will then show that the Z i have bounded moments and negative association. We use the term NA throughout the proof to denote negative association. For the definition of negative association and relevant properties, see Appendix A. Lemma 3.6. Suppose d 7 and define µ = O( ( x x S + ν. There exist random variables Z i such that the variables Yi are stochastically dominated by Z i, the Z i are negatively associated, E[Z i ] = µ, and E[Zi ] = O(µ. Proof. The choice of the L i depends only on the values of A over S; hence conditioned on nowing L i we still have A(x x S distributed randomly over the space. Furthermore the distribution of A and the reconstruction algorithm are invariant under permutation, so we can pretend that ν is permuted randomly before being added to Ax. Define B i,q to be the event that q supp(ae i, and define D i,q {, } independently at random. Then define the random variable V q = (b Ax S q = ν q + x i B i,q D i,q. i [n]\s Because we want to show concentration of measure, we would lie to show negative association (NA of the Y i = median q Li A qi V q. We now ν is a permutation distribution, so it is NA [JP83]. The B i,q for each i as a function of q are chosen from a Fermi-Dirac model, so they are NA [DR96]. The B i,q for different i are independent, so all the B i,q variables are NA. Unfortunately, the D i,q can be negative, which means the V q are not necessarily NA. Instead we will find some NA variables that dominate the V q. We do this by considering V q as a distribution over D. Let W q = E D [Vq ] = νq + i [n]\s x i B i,q. As increasing functions of NA variables, the W q are NA. By Marov s inequality Pr D [Vq cw q ] c, so after choosing the B i,q and as a distribution over D, Vq is dominated by the random variable U q = W q F q where F q is, independently for each q, given by the p.d.f. f(c = /c for c and f(c = otherwise. Because the distribution of V q over D is independent for each q, the U q jointly dominate the Vq. The U q are the componentwise product of the W q with independent positive random variables, so they too are NA. Then define Z i = median q L i U q. 7

8 As an increasing function of disjoint subsets of NA variables, the Z i are NA. We also have that Y i = (median A qi V q (median V q q L i q L i = median Vq q L i median U q = Z i q L i so the Z i stochastically dominate Yi. We now will bound E[Zi ]. Define µ = E[W q ] = E[νq ] + x i E[B i,q ] Then we have i [n]\s = d w x x S + w ν ( x x S + ν. Pr[W q cµ] c Pr[U q cµ] = c f(x Pr[W q cµ/x]dx x x c dx + c + ln c dx = x c Because the U q are NA, they satisfy marginal probability bounds [DR96]: Pr[U q t q, q [w]] Pr[U q t q ] for any t q. Therefore ( Pr[Z i cµ] i [n] T L i q T T = L i / ( + ln c Li c ( Pr[Z i cµ] 4 + ln c c P r[u q cµ] d/ Li / If d 7, this maes E[Z i ] = O(µ and E[Z i ] = O(µ. 3.6 Bound on component size Lemma 3.7. Let D i be the number of hyperedges in the same component as hyperedge i. Then for any i j, Cov(D i, D j = E[D i D j ] E[D i ] O( log6. Proof. The intuition is that if one component gets larger, other components tend to get smaller. Also the graph is very sparse, so component size is geometrically distributed. There is a small probability that i and j are connected, in which case D i and D j are positively correlated, but otherwise D i and D j should be negatively correlated. However analyzing this directly is rather difficult, because as one component gets larger, the remaining components have a lower average size but higher variance. Our analysis instead taes a detour through the hypergraph where each hyperedge is piced independently with a probability that gives the same expected number of hyperedges. This distribution is easier to analyze, and only differs in a relatively small Õ( hyperedges from our actual distribution. This allows us to move between the regimes with only a loss of Õ(, giving our result. Suppose instead of choosing our hypergraph from G d (w, we chose it from G d (w, ; that is, each hyperedge appeared independently with the appropriate prob- ( w d ability to get hyperedges in expectation. This model is somewhat simpler, and yields a very similar hypergraph G. One can then modify G by adding or removing an appropriate number of random hyperedges I to get exactly hyperedges, forming a uniform G G d (w,. By the Chernoff bound, I O( log with probability. Ω( Let D i be the size of the component containing i in G, and H i = Di D i. Let E denote the event that any of the D i or D i is more than C log, or that more than C log hyperedges lie in I, for some constant C. Then E happens with probability less than for some C, so 5 it has negligible influence on E[Di D j ]. Hence the rest of this proof will assume E does not happen. Therefore H i = if none of the O( log random hyperedges in I touch the O(log hyperedges in the components containing i in G, so H i = with probability at least O( log. Even if H i, we still have H i (Di + D j O(log. Also, we show that the D i are negatively correlated, when conditioned on being in separate components. Let D(n, p denote the distribution of the component size of a random hyperedge on G d (n, p, where p is the probability an hyperedge appears. Then D(n, p dominates D(n, p whenever n > n the latter hypergraph is contained within the former. If C i,j is the event that i and j are connected in G, this means E[D i D j = t, C i,j = ] Furthermore, E[D i ] = O( and E[D4 i ] = O(. is a decreasing function in t, so we have negative corre- 8

9 lation: E[D i D j C i,j = ] E[D i C i,j = ] E[D j C i,j = ] E[D i ] E[D j]. Furthermore for i j, Pr[C i,j = ] = E[C i,j ] = l i E[C i,l] = E[Di] = O(/. Hence E[D i D j] = E[D i D j C i,j = ] Pr[C i,j = ]+ Therefore E[D i D j ] E[D i D j C i,j = ] Pr[C i,j = ] E[D i ] E[D j] + O( log4. = E[(D i + H i (D j + H j ] = E[D i D j] + E[H i D j] + E[H i H j ] E[D i ] E[D j] + O( log = E[D i H i ] + O( log6 log 4 + log log = E[D i ] E[H i ] E[D i ] + E[H i ] + O( log6 E[D i ] + O( log6 Now to bound E[Di 4 ] in expectation. Because our hypergraph is exceedingly sparse, the size of a component can be bounded by a branching process that dies out with constant probability at each step. Using this method, Equations 7 and 7 of [COMS7] state that Pr[D ] e Ω(. Hence E[D i ] = O( and E[D 4 i ] = O(. Because H i is with high probability and O(log otherwise, this immediately gives E[Di ] = O( and E[Di 4] = O(. 3.7 Wrapping it up Recall from Corollary 3. that our total error x x S 4 i Y i D i 4 i Z i D i. The previous sections show that Z i and Di each have small expectation and covariance. This allows us to apply Chebyshev s inequality to concentrate 4 i Z idi about its expectation, bounding x x S with high probability: Lemma 3.8. We can recover x from Ax + ν and S with x x S ( x x S + ν with probability at least in O( recovery time. c /3 Our A has O( c rows and sparsity O( per column. Proof. Our total error is x x S 4 i Y i D i 4 i Then by Lemma 3.6 and Lemma 3.7, E[4 i Z i D i ] = 4 i Z i D i. E[Z i ] E[D i ] = µ where µ = O( ( x x S + ν. Furthermore, E[( i Var( i Z i D i ] = i = i i E[Z i D 4 i ] + i j E[Z i Z j D i D j ] E[Z i ] E[D 4 i ] + i j E[Z i Z j ] E[D i D j ] O(µ + i j E[Z i ] E[Z j ](E[D i ] + O( log6 = O(µ log 6 + ( E[Z i D i ] Z i D i = E[( i Z i D i ] E[Z i D i ] O(µ log 6 By Chebyshev s inequality, this means Pr[4 i Z i D i ( + cµ] O( log6 c Pr[ x x S ( + cc ( x x S + ν ] O( c /3 for some constant C. Rescaling down by C( + c, we can get x x S ( x x S + ν with probability at least c /3 : Now we shall go from /3 probability of error to c error for arbitrary c, with O(c multiplicative cost in time and space. We simply perform Lemma 3.8 O(c times in parallel, and output the pointwise median of the results. By a standard parallel repetition argument, this gives our main result: Theorem 3.. We can recover x from Ax + ν and S with x x S ( x x S + ν with probability at least in O(c recovery time. c Our A has O( c rows and sparsity O(c per column. 9

10 Proof. Lemma 3.8 gives an algorithm that achieves O( /3 probability of error. We will show here how to achieve c probability of error with a linear cost in c, via a standard parallel repetition argument. Suppose our algorithm gives an x such that x x S µ with probability at least p, and that we run this algorithm m times independently in parallel to get output vectors x,..., x m. We output y given by y i = median j [m] (x j i, and claim that with high probability y x S µ 3. Let J = {j [m] x j x S µ}. Each j [m] lies in J with probability at least p, so the chance that J 3m/4 is less than ( m m/4 p m/4 (4ep m/4. Suppose that J 3m/4. Then for all i S, {j J (x j i y i } J m J /3 and similarly {j J (x j i y i } J /3. Hence for all i S, y i x i is smaller than at least J /3 of the (x j i x i for j J. Hence or J µ i S j J((x j i x i i S = J 3 y x y x 3µ J 3 (y i x i with probability at least (4ep m/4. Using Lemma 3.8 to get p = and µ = 6 /3 ( x x S + ν, with m = c repetitions we get Theorem Applications We give two applications where the set query algorithm is a useful primitive. 4. Heavy Hitters of sub-zipfian distributions For a vector x, let r i be the index of the ith largest element, so x ri is non-increasing in i. We say that x is Zipfian with parameter α if x ri = Θ( x r i α. We say that x is sub-zipfian with parameters (, α if there exists a non-increasing function f with x ri = Θ(f(ii α for all i. A Zipfian with parameter α is a sub-zipfian with parameter (, α for all, using f(i = x r. The Zipfian heavy hitters problem is, given a linear setch Ax of a Zipfian x and a parameter, to find a -sparse x with minimal x x (up to some approximation factor. We require that x be -sparse (and no more because we want to find the heavy hitters themselves, not to find them as a proxy for approximating x. Zipfian distributions are common in real-world data sets, and finding heavy hitters is one of the most important problems in data streams. Therefore this is a very natural problem to try to improve; indeed, the original paper on Count-Setch discussed it [CCF]. They show a result complementary to our wor, namely that one can find the support efficiently: Lemma 4. (Section 4. of [CCF]. If x is sub-zipfian with parameter (, α and α > /, one can recover a candidate support set S with S = O( from Ax such that {r,..., r } S. A has O( log n rows and recovery succeeds with high probability in n. Proof setch. Let S = {r,..., r }. With O( log n measurements, Count-Setch identifies each x i to within x x S with high probability. If α > /, this is less than x r /3 for appropriate. But x r9 x r /3. Hence only the largest 9 elements of x could be estimated as larger than anything in x S, so the locations of the largest 9 estimated values must contain S. It is observed in [CCF] that a two-pass algorithm could identify the heavy hitters exactly. However, with a single pass, no better method has been nown for Zipfian distributions than for arbitrary distributions; in fact, the lower bound [DIPW] on linear sparse recovery uses a geometric (and hence sub-zipfian distribution. As discussed in [CCF], using Count-Setch 7 with O( log n rows gets a -sparse x with x x r x ( + Err (x, = Θ( / α. α where, as in Section, Err (x, = min ˆx x. -sparse ˆx The set query algorithm lets us improve from a + approximation to a + o( approximation. This is not useful for approximating x, since increasing is much more effective than decreasing. Instead, it is useful for finding elements that are quite close to being the actual heavy hitters of x. Naïve application of the set query algorithm to the output set of Lemma 4. would only get a close O(-sparse vector, not a -sparse vector. To get a -sparse vector, we must show a lemma that generalizes one used in the proof of sparse recovery of Count-Setch (first in [CM6], but our description is more similar to [GI]. 7 Another analysis ([CM5] uses Count-Min to achieve a better polynomial dependence on, but at the cost of using the l norm. Our result is an improvement over this as well.

11 Lemma 4.. Let x, x R n. Let S and S be the locations of the largest elements (in magnitude of x and x, respectively. Then if (* (x x S S Err (x,, for, we have x S x ( + 3Err (x,. Previous proofs have shown the following weaer form: Corollary 4.. If we change the condition (* to x x Err (x,, the same result holds. The corollary is immediate from Lemma 4. and (x x S S S S (x x S S. Therefore xs\s xs \S E(E + E ( + E 5E. Plugging into Equation 3, and using (x x S E, x S x E + 5E + xs \S + x[n]\(s S 6E + x[n]\s = ( + 6E x S x ( + 3E. Proof of Lemma 4.. We have (3 x S x = (x x S + xs\s + x[n]\(s S The tricy bit is to bound the middle term x S\S. We will show that it is not much larger than xs \S. Let d = S \ S, and let a be the d-dimensional vector corresponding to the absolute values of the coefficients of x over S \ S. That is, if S \ S = {j,..., j d }, then a i = x ji for i [d]. Let a be analogous for x over S \ S, and let b and b be analogous for x and x over S \ S, respectively. Let E = Err (x, = x x S. We have xs\s xs \S = a b = (a b (a + b a b a + b a b ( b + a b a b (E + a b So we should bound a b. We now that p q p q for all p and q, so a a + b b (x x S\S + (x x S \S (x x S S E. We also now that a b and b a both contain all nonnegative coefficients. Hence a b a b + b a ( a a + b b a a + b b E a b E. With this lemma in hand, on Zipfian distributions we can get a -sparse x with a +o( approximation factor. Theorem 4.. Suppose x comes from a sub-zipfian distribution with parameter α > /. Then we can recover a -sparse x from Ax with x x log n Err (x,. with O( c log n rows and O(n log n recovery time, with probability at least c. Proof. By Lemma 4. we can identify a set S of size O( that contains all the heavy hitters. We then run the set query algorithm of Theorem 3. with 3 substituted log n for. This gives an ˆx with ˆx x S 3 log n Err (x,. Let x contain the largest coefficients of ˆx. By Lemma 4. we have x x ( + Err (x,. log n 4. Bloc-sparse vectors In this section we consider the problem of finding blocsparse approximations. In this case, the coordinate set {... n} is partitioned into n/b blocs, each of length b. We define a (, b-bloc-sparse vector to be a vector where all non-zero elements are contained in at most /b blocs. That is, we partition {,..., n} into T i = {(i b +,..., ib}. A vector x is (, b-bloc-sparse if there exist S,..., S /b {T,..., T n/b } with supp(x S i. Define Err (x,, b = min x ˆx. (,b bloc-sparse ˆx

12 Finding the support of bloc-sparse vectors is closely related to finding bloc heavy hitters, which is studied for the l norm in [ABI8]. The idea is to perform dimensionality reduction of each bloc into log n dimensions, then perform sparse recovery on the resulting log n b - sparse vector. The differences from previous wor are minor, so we relegate the details to Appendix C. Lemma 4.3. For any b and, there exists a family of matrices A with O( 5 b log n rows and column sparsity O( log n such that we can recover a support S from Ax in O( n b log n time with x x S ( + Err (x,, b with probability at least n Ω(. Once we now a good support S, we can run Algorithm 3. to estimate x S : Theorem 4.. For any b and, there exists a family of binary matrices A with O( + 5 b log n rows such that we can recover a (, b-bloc-sparse x in O( + n b log n time with x x ( + Err (x,, b with probability at least Ω(. Proof. Let S be the result of Lemma 4.3 with approximation /3, so x x S ( + 3 Err (x,, b. Then the set query algorithm on x and S uses O(/ rows to return an x with Therefore as desired. x x S 3 x x S. x x x x S + x x S ( + 3 x x S ( + 3 Err (x,, b ( + Err (x,, b If the bloc size b is at least log n and is constant, this gives an optimal bound of O( rows. 5 Conclusion and Future Wor We show efficient recovery of vectors conforming to Zipfian or bloc sparse models, but leave open extending this to other models. Our framewor decomposes the tas into first locating the heavy hitters and then estimating them, and our set query algorithm is an efficient general solution for estimating the heavy hitters once found. The remaining tas is to efficiently locate heavy hitters in other models. Our analysis assumes that the columns of A are fully independent. It would be valuable to reduce the independence needed, and hence the space required to store A. We show -sparse recovery of Zipfian distributions with + o( approximation in O( log n space. Can the o( be made smaller, or a lower bound shown, for this problem? Acnowledgments I would lie to than my advisor Piotr Indy for much helpful advice, Anna Gilbert for some preliminary discussions, and Joseph O Roure for pointing me to [K L]. References [ABI8] A. Andoni, K. Do Ba, and P. Indy. Bloc heavy hitters. MIT Technical Report TR- 8-4, 8. [BCDH] R. G. Baraniu, V. Cevher, M. F. Duarte, and C. Hegde. Model-based compressive sensing. IEEE Transactions on Information Theory, 56, No. 4:98,. [BKM + ] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomins, and J. Wiener. Graph structure in the web. Comput. Netw., 33(-6:39 3,. [CCF] M. Chariar, K. Chen, and M. Farach- Colton. Finding frequent items in data streams. ICALP,. [CD4] [Cev8] Z. Chen and J. J. Dongarra. Condition numbers of gaussian random matrices. SIAM Journal on Matrix Analysis and Applications, 7:63 6, 4. V. Cevher. Learning with compressible priors. In NIPS, Vancouver, B.C., Canada, 7 December 8. [CIHB9] V. Cevher, P. Indy, C. Hegde, and R. G. Baraniu. Recovery of clustered sparse signals from compressive measurements. SAMPTA, 9.

13 [CM4] G. Cormode and S. Muthurishnan. Improved data stream summaries: The countmin setch and its applications. Latin, 4. [CM5] Graham Cormode and S. Muthurishnan. Summarizing and mining sewed data streams. In SDM, 5. [CM6] G. Cormode and S. Muthurishnan. Combinatorial algorithms for compressed sensing. Sirocco, 6. [COMS7] A. Coja-Oghlan, C. Moore, and V. Sanwalani. Counting connected graphs and hypergraphs via the probabilistic method. Random Struct. Algorithms, 3(3:88 39, 7. [CRT6] [CT6] E. J. Candès, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Comm. Pure Appl. Math., 59(8:8 3, 6. E.J. Candès and T. Tao. Near-optimal signal recovery from random projections: Universal encoding strategies? Information Theory, IEEE Transactions on, 5(: , dec. 6. [EG7] D. Eppstein and M. T. Goodrich. Spaceefficient straggler identification in round-trip data streams via Newton s identitities and invertible Bloom filters. WADS, 7. [EKB9] [FPRU] [GI] Y. C. Eldar, P. Kuppinger, and H. Bölcsei. Compressed sensing of bloc-sparse signals: Uncertainty relations and efficient recovery. CoRR, abs/96.373, 9. S. Foucart, A. Pajor, H. Rauhut, and T. Ullrich. The Gelfand widths of lp-balls for < p. preprint,. A. Gilbert and P. Indy. Sparse recovery using sparse matrices. Proceedings of IEEE,. [GLPS9] A. C. Gilbert, Y. Li, E. Porat, and M. J. Strauss. Approximate sparse recovery: Optimizing time and measurements. CoRR, abs/9.9, 9. [Ind7] P. Indy. Setching, streaming and sublinearspace algorithms. Graduate course notes, available at http: // stellar. mit. edu/ S/ course/ 6/ fa7/ /, 7. [DDT + 8] M. Duarte, M. Davenport, D. Tahar, J. Lasa, T. Sun, K. Kelly, and R. Baraniu. Single-pixel imaging via compressive sampling. IEEE Signal Processing Magazine, 8. [DIPW] K. Do Ba, P. Indy, E. Price, and D. Woodruff. Lower bounds for sparse recovery. SODA,. [Don6] D. L. Donoho. Compressed Sensing. IEEE Trans. Info. Theory, 5(4:89 36, Apr. 6. [JP83] [K L] [LD5] K. Joag-Dev and F. Proschan. Negative association of random variables with applications. The Annals of Statistics, (:86 95, 983. M. Karońsi and T. Lucza. The phase transition in a random hypergraph. J. Comput. Appl. Math., 4(:5 35,. C. La and M. N. Do. Signal reconstruction using sparse tree representation. In in Proc. Wavelets XI at SPIE Optics and Photonics, 5. [DPR96] [DR96] [EB9] D. Dubhashi, V. Priebe, and D. Ranjan. Negative dependence through the FKG inequality. In Research Report MPI-I-96--, Max-Planc-Institut fur Informati, Saarbrucen, 996. D. Dubhashi and D. Ranjan. Balls and bins: A study in negative dependence. Random Structures & Algorithms, 3:99 4, 996. Y.C. Eldar and H. Bolcsei. Bloc-sparsity: Coherence and efficient recovery. IEEE Int. Conf. Acoustics, Speech and Signal Processing, 9. [Mit4] M. Mitzenmacher. A brief history of generative models for power law and lognormal distributions. Internet Mathematics, :6 5, 4. [Mut3] S. Muthurishnan. Data streams: Algorithms and applications (invited tal at SODA 3. Available at http: // athos. rutgers. edu/ \ sim muthu/ stream--. ps, 3. [Rom9] J. Romberg. Compressive sampling by random convolution. SIAM Journal on Imaging Science, 9. 3

14 [SPH9] A M. Stojnic, F. Parvaresh, and B. Hassibi. On the reconstruction of bloc-sparse signals with an optimal number of measurements. IEEE Trans. Signal Processing, 9. Negative Dependence Negative dependence is a fairly common property in balls-and-bins types of problems, and can often cleanly be analyzed using the framewor of negative association ([DR96, DPR96, JP83]. Definition (Negative Association. Let (X,..., X n be a vector of random variables. Then (X,..., X n are negatively associated if for every two disjoint index sets, I, J [n], E[f(X i, i Ig(X j, j J] E[f(X i, i I]E[g(X j, j J] for all functions f : R I R and g : R J R that are both non-decreasing or both non-increasing. If random variables are negatively associated then one can apply most standard concentration of measure arguments, such as Chebyshev s inequality and the Chernoff bound. This means it is a fairly strong property, which maes it hard to prove directly. What maes it so useful is that it remains true under two composition rules: Lemma A. ([DR96], Proposition 7.. If (X,..., X n and (Y,..., Y m are each negatively associated and mutually independent, then (X,..., X n, Y,..., Y m is negatively associated.. Suppose (X,..., X n is negatively associated. Let I,..., I [n] be disjoint index sets, for some positive integer. For j [], let h j : R Ij R be functions that are all non-decreasing or all nonincreasing, and define Y j = h j (X i, i I j. Then (Y,..., Y is also negatively associated. Lemma A. allows us to relatively easily show that one component of our error (the point error is negatively associated without performing any computation. Unfortunately, the other component of our error (the component size is not easily built up by repeated applications of Lemma A. 8. Therefore we show something much weaer for this error, namely approximate negative correlation: E[X i X j ] E[X i ]E[X j ] Ω( E[X i] E[X j ] 8 This paper considers the component size of each hyperedge, which clearly is not negatively associated: if one hyperedge is in a component of size than so is every other hyperedge. But one can consider variants that just consider the distribution of component sizes, which seems plausibly negatively associated. However, this is hard to prove. for all i j. This is still strong enough to use Chebyshev s inequality. B Set Query in the l norm This section wors through all the changes to prove the set query algorithm wors in the l norm with w = O( measurements. We use Lemma 3.5 to get an l analog of Corollary 3.: (4 x x S = i S i S (x x S i C i,j Y j = D i Y i. j S i S Then we bound the expectation, variance, and covariance of D i and Y i. The bound on D i wors the same as in Section 3.6: E[D i ] = O(, E[Di ] = O(, E[D i D j ] E[D i ] O(log 4 /. The bound on Y i is slightly different. We define U q = ν q + x i B i,q i [n]\s and observe that U q V q, and U q is NA. Hence is NA, and Y i Z i. Define then Z i = median U q q L i µ = E[U q] = d w x x S + w ν ( x x S + ν Pr[Z i cµ] Li ( c Li / ( d 4 c so E[Z i ] = O(µ and E[Z i ] = O(µ. Now we will show the analog of Section 3.7. We now x x S i D i Z i and E[ D i Z i] = E[D i ] E[Z i] = µ i i for some µ = O( ( x x S + ν. Then E[( D i Z i ] = i Var( i i E[D i ] E[Z i ] + i j E[D i D j ] E[Z iz j] O(µ + i j (E[D i ] + O(log 4 / E[Z i] = O(µ log 4 + ( E[D i Z i] Z id i O(µ log 4. 4

CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014

CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014 Instructor: Chandra Cheuri Scribe: Chandra Cheuri The Misra-Greis deterministic counting guarantees that all items with frequency > F 1 /