Approximating and Testing k-histogram Distributions in Sub-linear time

Size: px

Start display at page:

Download "Approximating and Testing k-histogram Distributions in Sub-linear time"

Nickolas Thornton
5 years ago
Views:

1 Electronic Colloquiu on Coputational Coplexity, Report No Approxiating and Testing k-histogra Distributions in Sub-linear tie Piotr Indyk Reut Levi Ronitt Rubinfeld Noveber 9, 011 Abstract A discrete distribution p, over [n], is a k-histogra if its probability distribution function can be represented as a piece-wise constant function with k pieces. Such a function is represented by a list of k intervals and k corresponding values. We consider the following proble: given a collection of saples fro a distribution p, find a k-histogra that approxiately iniizes the l distance to the distribution p. We give tie and saple efficient algoriths for this proble. We further provide algoriths that distinguish distributions that have the property of being a k- histogra fro distributions that are ɛ-far fro any k-histogra in the l 1 distance and l distance respectively. CSAIL, MIT, Cabridge MA E-ail: indyk@theory.lcs.it.edu. This aterial is based upon work supported by David and Lucille Packard Fellowship, MADALGO Center for Massive Data Algorithics, funded by the Danish National Research Association and NSF grant CCF School of Coputer Science, Tel Aviv University. E-ail: reuti.levi@gail.co. Research supported by the Israel Science Foundation grant nos. 1147/09 and 46/08 CSAIL, MIT, Cabridge MA 0139 and the Blavatnik School of Coputer Science, Tel Aviv University. E-ail: ronitt@csail.it.edu. Research supported by NSF grants and , Marie Curie Reintegration grant PIRG03- GA and the Israel Science Foundation grant nos. 1147/09 and 1675/09. ISSN

2 1 Introduction The ubiquity of assive data sets is a phenoenon that began over a decade ago, and is becoing ore and ore pervasive. As a result, there has been recently a significant interests in constructing succinct representations of the data. Ideally, such representations should take little space and coputation tie to operate on, while approxiately preserving the desired properties of the data. One of the ost natural and useful succinct representations of the data are histogras. For a data set D whose eleents coe fro the universe [n], a k-histogra H is a piecewise constant function defined over [n] consisting of k pieces. Note that a k-histogra can be described using Ok nubers. A good k- histogra is such that a the value Hi is a good approxiation of the total nuber of ties an eleent i occurs in the data set denoted by P i and b the value of k is sall. Histogras are a popular and flexible way to approxiate the distribution of data attributes e.g., eployees age or salary in databases. They can be used for data visualization, analysis and approxiate query answering. As a result, coputing and aintaining histogras of the data has attracted a substantial aount of interests in databases and beyond, see e.g., [GMP97, JPK + 98, GKS06, CMN98, TGIK0, GGI + 0], or the survey [Ioa03]. A popular criterion for fitting a histogra to a distribution P is the least-squares criterion. Specifically, the goal is to find H that iniizes the l nor P H. Such histogras are often called v-optial histogras, with v standing for variance. There has been a substantial aount of work on algoriths, approxiate or exact, that copute the optial k-histogra H given P and k [JPK + 98, GKS06]. However, since these algoriths need to read the whole input to copute H, their running ties are at least linear in n. A ore efficient way to construct data histogras is to use rando saples fro data set D. There have been soe results on this front as well [CMN98, GMP97]. However, they have been restricted to so-called equi-depth histogras which are essentially approxiate quantiles of the data distribution or so-called copressed histogras. Although the nae by which they are referred to sounds siilar, both of these representations are quite different fro the representations considered in this paper. We are not aware of any work on constructing v-optial histogras fro rando saples with provable guarantees. The proble of constructing an approxiate histogra fro rando saples can be forulated in the fraework of distribution property testing and estiation [Rub06, Ron08]. In this fraework, an algorith is given access to i.i.d. saples fro an unknown probability distribution p, and its goal is to to characterize or estiate various properties of p. In our case we define p = P/ p 1. Then choosing a rando eleent fro the data set D corresponds to choosing i [n] according the distribution p. In this paper we propose several algoriths for constructing and testing for the existence of good histogras approxiating a given distribution p. 1.1 Histogra taxonoy Forally a histogra is a function H : [n] [0, 1] that is defined by a sequence of intervals I 1,..., I k and a corresponding sequence of values v 1,..., v k. For t [n], Ht represents an estiate to pt. We consider the following classes of histogras see [TGIK0] for a full list of classes: 1. Tiling histogras: the intervals for a tiling of [n] i.e., they are disjoint and cover the whole doain. For any t we have Ht = v i, where t I i. In practice we represent a tiling k-histogra as a sequence {I 1, v 1... I k, v k }.. Priority histogras: the intervals can overlap. For any t we have Ht = v i, where i is the largest index such that t I i ; if none exists Ht = 0. In practice we represent a priority k-histogra as 1

3 {I 1, v 1, r 1... I k, v k, r k } where r 1,..., r k correspond to the priority of the intervals. Note that if a function has a tiling k-histogra representation then it has a priority k-histogra representation. Conversely if it has a priority k-histogra representation then it has a tiling k-histogra representation. 1. Results The following algoriths receive as input a distribution over [n], p, an accuracy paraeter ɛ and an integer k. 1. In Section 3, we describe an algorith which outputs a priority k-histogra that is closest to p in the l distance up to ɛ-additive error. The algorith is a greedy algorith, at each step it enuerates over all possible intervals and adds the interval which iniizes the approxiated l distance. The saple coplexity of the algorith is Õk/ɛ ln n and the running tie is Õk/ɛ n. We then iprove the running tie substantially to Õk/ɛ ln n by enuerating on a partial set of intervals.. In Section 4, we provide a testing algorith for the property of being a tiling k-histogra with respect to the l 1 nor. The saple coplexity of the algorith is Õ knɛ 5. We provide a siilar test for the l nor that has saple coplexity of Oln n ɛ 4. We prove that testing if a distribution is a tiling k-histogra in the l 1 -nor requires Ω kn saples for every k 1/ɛ. 1.3 Related Work Our forulation of the proble falls within the fraework of property testing [RS96, GGR98, BFR + 00]. Properties of single and pairs of distributions has been studied quite extensively in the past see [BFR + 10, BFF + 01, AAK + 07, BDKR05, GMP97, BKR04, RRSS09, Val08, VV11]. One question that has received uch attention in property testing is to deterine whether or not two distributions are siilar. A proble referred to as Identity testing assues that the algorith is given access to saples of distribution p and an explicit description of distribution q. The goal is to distinguish a pair of distributions that are identical fro a pair of distributions that are far fro each other. A special case of Identity testing is Unifority Testing, where the fixed distribution, q, is the unifor distribution. A unifor distribution can be represented by a tiling 1-histogra and therefore the study of unifority testing is closely related to our study. Goldreich and Ron [GR00] study Unifority Testing in the context of approxiating graph expansion. They show that counting pairwise collisions in a saple can be used to approxiate the l -nor of the probability distribution fro which the saple was drawn fro. Several ore recent works, including this one, ake use of this technical tool. Batu et al. [BFR + 10] note that running the [GR00] algorith with Õ n saples yields an algoriths for unifority testing in the l 1 -nor. Paninski [Pan08] gives an optial algorith in this setting that takes a saple of size O n and proves a atching lower bound of Ω n. Valiant [Val08] shows that a tolerant tester for unifority for constant precision would require n 1 o1 saples. Several works in property testing of distributions approxiate the distribution by a sall histogra distribution and use this representation as an essential way in their algorith [BKR04], [BFF + 01]. Histogras were subject of extensive research in data strea literature, see [TGIK0, GGI + 0] and the references therein. Our algorith in Section 3 is inspired by streaing algorith in [TGIK0].

4 Preliinaries Denote by D n the set of all discrete distributions over [n]. A property of a discrete distributions is a subset P D n. We say that a distribution p D n is ɛ-far fro p D n in the l 1 distance l distance if p p 1 > ɛ p p > ɛ. We say that an algorith, A, is a testing algorith for the property P if given an accuracy paraeter ɛ and a distribution p: 1. if p P, A accepts p with probability at least /3. if p is ɛ-far according to any specified distance easure fro every distribution in P, A rejects p with probability at least /3. Let p D n, the for every i [n], denote by p i the probability of the i-th eleent. For every I [n], let pi denote the weight of I, i.e. p i. For every I [n] such that pi 0, let p I denote the distribution of p restricted to I i.e. p I i = p i pi. Call an interval I flat if p I is unifor or pi = 0. Given a set of saples fro p, S, denote by S I the saples that fall in the interval I. Define the observed collision probability of interval I as colls I S where colls I def = occi,si and occi, SI is [ the nuber of occurrences of i in S I. In [GR00], in the proof of Lea 1, it was shown that E p I and that [ ] colls I Pr p I > δ p I S In particular, since p I 1, we also have that [ ] colls I Pr p I > ɛ < S In a siilar fashion we prove the following lea. < < δ SI ] colls I = S 1/ 1 pi 4 δ S p I. 1 ɛ 1 S. 3 Lea 1 Based on [GR00] If we take 4 saples, S, then, for every interval I, ɛ [ colls I Pr ] p i ɛpi > 3 4 S 4 Proof: For every i < j define an indicator variable C i,j so that C i,j = 1 if the ith saple is equal to the jth saple and is in the interval I. For every i < j, µ def = E[C i,j ] = p def i. Let P = {i, j : 1 i < j }. By Chebyshev s inequality: [ i,j P Pr C i,j ] p i P > ɛpi Var[ i,j P C i,j] ɛ pi P 3

5 Fro [GR00] we know that Var i,j P C i,j P µ + P 3/ µ 3/ 5 [ ] and since µ p I we have Var i,j P C i,j pi P + P 3/ µ 1/, thus Pr [ i,j P C i,j P p i ] > ɛpi < P + P 3/ µ 1/ ɛ P 6 ɛ P 1/ 6 ɛ Near-optial Priority k-histogra In this section we give an algorith that given p D n, outputs a priority k-histogra which is close in the l distance to an optial tiling k-histogra that describes p. The algorith, based on a sketching algorith in [TGIK0], takes a greedy strategy. Initially the algorith starts with an epty priority histogra. It then proceed by doing k ln ɛ 1 iterations, where in each iteration it goes over all n possible intervals and adds the best one, i.e the interval I [n] which iniizes the distance between p and H when added to the currently constructed priority histogra H. The algorith has an efficient saple coplexity of only logarithic dependence on n but the running tie has polynoial dependence on n. This polynoial dependency is due to the exhaustive search for the interval which iniizes the distance between p and H. Theore 1 Let p D n be the distribution and let H be the tiling k-histogra which iniizes p H. The priority histogra H reported by Algorith 1 satisfies p H p H + 5ɛ. The saple coplexity of Algorith 1 is Õk/ɛ ln n. The running tie coplexity of Algorith 1 is Õk/ɛ n. Proof: By Chernoff s bound and union bound over the intervals in [n], with high constant probability, for every I, y I pi ξ. 8 By Lea 1 and Chernoff s bound, with high constant probability, for every I, z I p i ξpi. 9 Henceforth, we assue that the estiations obtained by the algorith are good. It clear that we can transfor any priority histogra H to any tiling k-histogra, H, by adding the k intervals of H to H with the highest priority. This iplies that there exists an interval J and a value y J such that adding the to H as described in Algorith 1 decreases the error in the following way p H J,yJ p H 1 1 p H k p H. 10 4

6 Algorith 1: Greedy algorith for priority k-histogra 1 Obtain l = ln1n saples, S, fro p, where ξ = ɛ ξ k ln 1 ɛ For each interval I [n] set y I := S I l ; 3 Obtain r = ln6n sets of saples, S 1,..., S r, each of size = 4 fro p; ξ I 4 For each interval I [n] set z I := edian ; 5 Initialize the priority histogra H to epty; 6 for i := 1 to k ln 1 ɛ do 7 foreach interval J [n] do 8 Create H J,yJ obtained by: collsi 1 S1,..., collsr Sr Adding the interval J to H with the value y J ; Recoputing the interval to the left right of J, I L I R so it would not intersect with J; Adding I L I R with the value y IL y IR to H; c J := z I y I ; 9 Let J in be the interval with the sallest value of c J ; 10 Update H to be H Jin,y Jin ; 11 return H ; where H J,yJ is defined in Algorith 1. Next, we would like to write the distance between H J,yJ and p as a function of p i and pi, for I H J,y J. We note that the value of x that iniizes the su p i x is x = pi, therefore p H J,yJ = = p i pi 11 p pi pi i p i + 1 p i pi. 13 Since c J = z I y I, by applying the triangle inequality twice we get that c J z I p i + z I p i + 5 p i y I p i p + pi y I 14, 15

7 So after reordering the ters in Equation 15 we obtain that c J p i pi + z I p i + y I p Fro the fact that y I p = y I pi y I + pi and Equation 8 it follows that Therefore we obtain fro Equations 9, 13, 16 and 17 that. 16 y I pi ξξ + pi. 17 c J p H J,yJ + ξpi + ξξ + pi 18 p H J,yJ + 3ξ + {I H J,y J } ξ. 19 Since the algorith calculates c J for every interval J, we derive fro Equations 10 and 19 that at the q-th step p H Jin,y p Jin H 1 1 p H k p H + 3ξ + qξ. 0 So for H obtained by the algorith after q steps we have p H p H 1 1 q k +q3ξ +qξ. Setting q = k ln 1 ɛ we obtain that p H p H + 5ɛ as desired. 3.1 Iproving the Running Tie We now turn to iproving the running tie coplexity to atch the saple coplexity. Instead of going over all possible intervals in [n] in search for an interval I [n] to add to the constructed priority histogra H. We search for I over a uch saller subset of intervals, in particular, only those intervals whose endpoints are saples or neighbors of saples. In Lea we prove that if we decrease the value a histogra H assigns to an interval I, then the square of the distance between H and p in the l -nor can grow by at ost pi. The lea iplies that we can treat light weight intervals as atoic coponents in our search because they do not affect the distance between H and p by uch. While the running tie is reduced significantly, we prove that the histogra this algorith outputs is still close to being optial. Lea Let p D n and let I be an interval in [n]. For 0 β 1 < β 1, p i β pi 1 p i β 1 Theore Let p and H be as in Theore 1. There is an algorith that outputs a priority histogra H that satisfies p H p H + 8ɛ. The saple coplexity of the algorith and the running tie coplexity of the algorith is Õk/ɛ ln n. Proof: In the iproved algorith, as in Algorith 1, we take l = ln1n saples, T. Instead of going ξ over all J [n] in Step 7 we consider only a sall subset of intervals as candidates. We denote this subset of intervals by T. Let T be the set of all eleents in T and those that are distance one away, i.e. 6

8 T = {in{i + 1, n}, i, ax{i 1, 0} i T }. Then T is the set of all intervals between pairs of eleents in T, i.e. [a, b] T if and only if a b and a, b T. Thus, the size of T is bounded above by 3l+1. Therefore we decrease the nuber of iterations in Step 7 fro n to at ost 3l+1. It is easy to see that intervals which are not in T have sall weight. Forally, let I be an intervals such that pi > ξ. The probability that I has no hits after taking l saples is at ost 1 ξ l < 1/n. Therefore by union bound over all the intervals I [n], with high constant probability, for every interval which has no hits after taking l saples, the weight of the interval is at ost ξ. Next we see why in Step 7 we can ignore intervals which have sall weight. Consider a single run of the loop in Step 7 in Algorith 1. Let H be the histogra constructed by the algorith so far and let J in be the interval added to H at the end of the run. We shall see that there is an interval J T such that p H J, pj p H Jin,y Jin 4ξ. J Denote the endpoints of J in by a and b where a < b. Let I 1 = [a 1, b 1 ] be the largest interval in T such that I 1 J in and let I = [a, b ] be the sallest interval in T such that J in I. Therefore for every interval J = [x, y] where x {a 1, a } and y {b 1, b } we have that i J J in p i ξ where J J in is the syetric difference of J and J in. Let β 1, β the value assigned to i [a, a 1 ], i [a, a 1 ] by H Jin, y Jin, respectively. Notice that the algorith only assigns values to intervals in T, therefore β 1 and β are well defined. Take J to be as follows. If β 1 > y J then take the start-point of J to be a 1 otherwise take it to be a. If β > y J then take the end-point of J to be b 1 otherwise take it to be b. By lea it follows that p HJ,yJin p H Jin, y Jin p i 4ξ. 3 i J J in Thus, we obtain Equation fro the fact that p H J, pj J = in δ p H J,δ 4 Thus, by siilar calculations as in the proof of theore 1, after q steps, p H p H 1 1 q k + q3ξ + qξ + 4ξ; Setting q = k ln 1 ɛ we obtain that p H p H 8ɛ. Proof of Lea : p i β p i + β 5 p i β 1 p i β = p i β 1 p i + β 1 piβ β 1 + β 1 β 6 pi 7 4 Testing whether a Distribution is a Tiling k-histogra In this section we provide testing algoriths for the property of being a tiling k-histogra. The testing algoriths attept to partition [n] into k intervals which are flat according to p recall that an interval is flat if it has unifor conditional distribution or it has no weight. If it fails to do so then it rejects p. Intervals that are close to being flat can be detected because either they have light weight, in which case they can be 7

9 found via sapling, or they are not light weight, in which case they have sall l -nor. Sall l -nor can in turn be detected via estiations of the collision probability. Thus an interval that has overall sall nuber of saples or alternatively sall nuber of pairwise collisions is considered by the algorith to be a flat interval. The search of the flat intervals boundaries is perfored in a siilar anner to a search of a value in a binary search. The efficiency of our testing algorith is stated in the following theores: Theore 3 Algorith is a testing algorith for the property of being a tiling k-histogra for the l distance easure. The saple coplexity of the algorith is Oln n ɛ 4. The running tie coplexity of the algorith is Ok ln 3 n ɛ 4. Theore 4 There exists a testing algorith for the property of being a tiling k-histogra for the l 1 distance easure. The saple coplexity of the algorith is Õ knɛ 5. The running tie coplexity of the algorith is Õk knɛ 5. Algorith : Test Tiling k-histogra 1 Obtain r = 16 ln6n sets of saples, S 1,..., S r, each of size = 64 ln n ɛ 4 fro p; Set previous := 1, low := 1, high := n; 3 for i := 1 to k do 4 while high low do 5 id := low + high - low /; 6 if testflatness-l [previous,id], S 1,..., S r, ɛ then 7 low := id+1; 8 else 9 high := id 1; 10 previous := low; 11 high := n; 1 If previous = n then return ACCEPT; 13 return REJECT Algorith 3: testflatness-l I, S 1,..., S r, ɛ 1 For each i [r] set ˆp i I := Si ; If there exists i [r] such that Si 3 If edian collsi 1 S1 4 return REJECT;,..., collsr < ɛ I Sr 1 then return ACCEPT ; + ax i{ ɛ ˆp i I } then return ACCEPT ; Proof of Theore 3: Let I be an interval in [n] we first show that [ Pr edian{colls1 I S 1,..., collsr I } p I ax i S r { ɛ ˆp i I } ] > 1 1 6n. 8 8

10 Recall that ˆp i I = Si 64, hence, due to the facts that and S i ɛ 4 we get that Si Si 64 SI i. By Equation 3, for each i [r], 16ˆpi I ɛ 4 [ collsi i Pr p I S i ɛ 4 ] ɛ ˆp i > 3 I 4. 9 Since each estiate collsi I Si is close to p I with high constant probability, we get fro Chernoff s bound that for r = 16 ln6n the edian of r results is close to p I with very high probability as stated in Equation 8. By union bound over all the intervals in [n], with high constant probability, the following holds for everyone of the at ost n intervals in [n], I, edian{colls1 I S 1,..., collsr I } p I ax { ɛ i ˆp i I }. 30 S r So henceforth we assue that this is the case. Assue the algorith rejects. When this occurs it iplies that there are at least k distinct intervals such that for each interval the test testflatness-l returned REJECT. For each of these intervals I we have pi 0 and edian{ colls1 I S1,..., collsr I Sr } > 1 + ax i{ ɛ ˆp i I }. In this case p I 1, and so I is not flat and contains at least one bucket boundary. Thus, there are at least k internal bucket boundaries. Therefore p is not a tiling k-histogra. Assue the algorith accepts p. When this occurs there is a partition of [n] to k intervals, I, such that for each interval I I, testflatness-l returned ACCEPT. Define p to be pi on the intervals obtained by the algorith. For every I I, If is the case that there exists i [r], such that Si < ɛ, then by fact 1 below, pi < ɛ. Therefore, fro the fact that p i x is iniized by x = pi and the Cauchy-Schwarz inequality we get that Otherwise, if Si fact 1, it follows that ˆp i I = Si p i pi p i 31 pi ɛ pi. 3 ɛ for every i [r] then by the second ite in fact 1, pi ɛ 4. By the first ite in pi and therefore edian{ colls1 I,..., collsr I } 1 + S 1 S r ɛ pi. 33 pi pi 1 This iplies that p I 1 + ɛ pi. Thus, p I u ɛ pi and since p I u = we get that p i pi ɛ pi. Hence I I p i pi ɛ, thus, p is ɛ-close to p in the l -nor. Fact 1 If we take 48 lnn γ ɛ saples, S, then with probability greater than 1 1 γ : 9

11 1. For any I such that pi ɛ 4, pi S I 3pI. For any I such that S I ɛ, pi > ɛ 4 3. For any I such that S I < ɛ, pi < ɛ Proof: Fix I, if pi ɛ 4, by Chernoff s bound with probability greater than 1 e ɛ 48, pi S I 3pI. 34 In particular, if pi = ɛ 4, then Si pi ɛ 4 or pi > ɛ 4 but then pi S I 1 n e ɛ 48 > 1 1 γ, the above is true for every I. 3ɛ 8, thus if S I ɛ > 3ɛ 8 then pi > ɛ 4. If S I < ɛ then either < ɛ. By the union bound, with probability greater than Algorith 4: testflatness-l 1 I, S 1,..., S r, ɛ 1 If there exists i [r] such that SI i < 163 then return ACCEPT; ɛ 4 collsi If edian 1 S1,,..., collsr I Sr ɛ 4 then return ACCEPT ; 3 return REJECT; Proof of Theore 4: Apply Algorith with the following changes: take each set of saples S i to be of size = 13 knɛ 5 and replace testflatness-l with testflatness-l 1. By Equation [ ] colls I Pr p I > δ p I 4 < δ. 35 S p I S Thus, if S I is such that S 16 δ 16 δ p I, then [ ] colls I Pr p I > δ p I > S By additive Chernoff s bound and the union bound for r = 16 ln6n and δ = ɛ 16, with high constant probability for every interval I that passes Step 1 in Algorith 4 it holds that colls I S p I δ p I the total nuber of intervals in [n] is less than n. So fro fro this point on we assue that the algorith obtains a δ-ultiplicative approxiation of p I for every I that passes Step 1. Assue the algorith rejects p, then there are at least k distinct intervals such that for each interval the test testflatness-l 1 returned REJECT. By our assuption each of these intervals is not flat and thus contains at least one bucket boundary. Thus, there are at least k internal buckets boundaries, therefore p is not a tiling k-histogra. Assue the algorith accepts p, then there is a partition of [n] to k intervals, I, such that for each interval I I, testflatness-l 1 returned ACCEPT. Define p to be pi on the intervals obtained by the algorith. For 10

12 ɛ any interval I for which testflatness-l 1 returned ACCEPT and passes Step 1 it holds that p I u < thus p i pi ɛ pi. Denote by L the set of intervals for which testflatness-l 1 returned ACCEPT on Step 1. By Chernoff s bound, for every I L, with probability greater than 1 e ɛ 3k, either pi ɛ 4k or pi 163. Hence, with probability greater than 1 n r e ɛ 3k > 1 1 6, the total weight of the intervals in L: ɛ 4 I L ax{ 163 ɛ ɛ 4, 4k } ɛ ɛ 4 = ɛ 1 + ɛ 4 I L I L kn, 37 where the last inequality follows fro the fact that L k which iplies that I L /n k. Therefore, p is ɛ-close to p in l 1 -nor. 4.1 Lower Bound We prove that for every k 1/ɛ, the upper bound in Theore 4 is tight in ter of the dependence in k and n. We note that for k = n, testing tiling k-histogra is trivial, i.e. every distribution is a tiling n-histogra. Hence, we can not expect to have a lower bound for any k. We also note that the testing lower bound is also an approxiation lower bound. Theore 5 Given a distribution D testing if D is a tiling k-histogra in the l 1 -nor requires Ω kn saples for every k 1/ɛ. Proof: Divide [n] into k intervals of equal size up to ±1. In the YES instance the total probability of each interval alternates between 0 and /k and within each interval the eleents have equal probability. The NO instance is defined siilarly with one exception, randoly pick one of the intervals that have total probability /k, I, and within I randoly pick half of the eleents to have probability 0 and the other half of the eleents to have twice the probability of the corresponding eleents in the YES instance. In the proof of the lower bound for testing unifority it is shown that distinguishing a unifor distribution fro a distribution that is unifor on a rando half of the eleents and has 0 weight on the other half requires Ω n. Since the nuber of eleents in I is Θn/k, by a siilar arguent we know that at least Ω n/k saples are required fro I in order to distinguish the YES instance fro the NO instance. Fro the fact that the total probability of I is Θ1/k we know that in order to obtain Θ n/k hits in I we are required to take a total nuber of saples which is of order nk, thus we obtain a lower bound of Ω nk. References [AAK + 07] N. Alon, A. Andoni, T. Kaufan, K. Matulef, R. Rubinfeld, and N. Xie. Testing k-wise and alost k-wise independence. In Proceedings of the Thirty-Ninth Annual ACM Syposiu on the Theory of Coputing STOC, pages , 007. [BDKR05] T. Batu, S. Dasgupta, R. Kuar, and R. Rubinfeld. The coplexity of approxiating the entropy. SIAM Journal on Coputing, 351:13 150, 005. [BFF + 01] T. Batu, L. Fortnow, E. Fischer, R. Kuar, R. Rubinfeld, and P. White. Testing rando variables for independence and identity. In Proceedings of the Forty-Second Annual Syposiu on Foundations of Coputer Science FOCS, pages ,

13 [BFR + 00] [BFR + 10] [BKR04] [CMN98] [GGI + 0] [GGR98] [GKS06] [GMP97] [GR00] T. Batu, L. Fortnow, R. Rubinfeld, W.D. Sith, and P. White. Testing that distributions are close. In Proceedings of the Forty-First Annual Syposiu on Foundations of Coputer Science FOCS, pages 59 69, Los Alaitos, CA, USA, 000. IEEE Coputer Society. T. Batu, L. Fortnow, R. Rubinfeld, W. D. Sith, and P. White. Testing closeness of discrete distributions. CoRR, abs/ , 010. This is a long version of [BFR + 00]. T. Batu, R. Kuar, and R. Rubinfeld. Sublinear algoriths for testing onotone and uniodal distributions. In Proceedings of the Thirty-Sixth Annual ACM Syposiu on the Theory of Coputing STOC, pages , 004. S. Chaudhuri, R. Motwani, and V. Narasayya. Rando sapling for histogra construction: how uch is enough? SIGMOD, A. Gilbert, S. Guha, P. Indyk, Y. Kotidis, M. Muthukrishnan, and M. Strauss. Fast, sall-space algoriths for approxiate histogra aintenance. STOC, 00. O. Goldreich, S. Goldwasser, and D. Ron. Property testing and its connection to learning and approxiation. Journal of the ACM, 454: , S. Guha, N. Koudas, and K. Shi. Approxiation and streaing algoriths for histogra construction probles. ACM Transactions on Database Systes TODS, 311, 006. P.B. Gibbons, Y Matias, and V. Poosala. Fast increental aintenance of approxiate histogras. VLDB, O. Goldreich and D. Ron. On testing expansion in bounded-degree graphs. Electronic Colloqiu on Coputational Coplexity, 70, 000. [Ioa03] Y. Ioannidis. The history of histogras abridged. VLDB, 003. [JPK + 98] [Pan08] [Ron08] [RRSS09] [RS96] H. V. Jagadish, V. Poosala, N. Koudas, K. Sevcik, S. Muthukrishnan, and T. Suel. Optial histogras with quality guarantees. VLDB, L. Paninski. Testing for unifority given very sparsely-sapled discrete data. IEEE Transactions on Inforation Theory, 5410: , 008. D. Ron. Property testing: A learning theory perspective. Foundations and Trends in Machine Learning, 3:307 40, 008. S. Raskhodnikova, D. Ron, A. Shpilka, and A. Sith. Strong lower bonds for approxiating distributions support size and the distinct eleents proble. SIAM Journal on Coputing, 393:813 84, 009. R. Rubinfeld and M. Sudan. Robust characterization of polynoials with applications to progra testing. SIAM Journal on Coputing, 5:5 71, [Rub06] R. Rubinfeld. Sublinear tie algoriths. In Proc. International Congress of Matheaticians, volue 3, pages , 006. [TGIK0] [Val08] [VV11] Nitin Thaper, Sudipto Guha, Piotr Indyk, and Nick Koudas. Dynaic ultidiensional histogras. In SIGMOD Conference, pages , 00. P. Valiant. Testing syetric properties of distributions. In Proceedings of the Fourtieth Annual ACM Syposiu on the Theory of Coputing STOC, pages , 008. G. Valiant and P. Valiant. Estiating the unseen: an n/ logn-saple estiator for entropy and support size, shown optial via new CLTs. In Proceedings of the Fourty-Third Annual ACM Syposiu on the Theory of Coputing, pages , 011. See also ECCC TR and TR ECCC ISSN

Approximating and Testing k-histogram Distributions in Sub-linear Time

Approximating and Testing k-histogram Distributions in Sub-linear Time Electronic Colloquiu on Coputational Coplexity, Revision 1 of Report No. 171 (011) Approxiating and Testing k-histogra Distributions in Sub-linear Tie Piotr Indyk Reut Levi Ronitt Rubinfeld July 31, 014