Testing Properties of Collections of Distributions

Testing Properties of Collections of Distributions Reut Levi Dana Ron Ronitt Rubinfeld April 9, 0 Abstract We propose a fraework for studying property testing of collections of distributions, where the nuber of distributions in the collection is a paraeter of the proble. Previous work on property testing of distributions considered single distributions or pairs of distributions. We suggest two odels that differ in the way the algorith is given access to saples fro the distributions. In one odel the algorith ay ask for a saple fro any distribution of its choice, and in the other the choice of the distribution is rando. Our ain focus is on the basic proble of distinguishing between the case that all the distributions in the collection are the sae or very siilar, and the case that it is necessary to odify the distributions in the collection in a non-negligible anner so as to obtain this property. We give alost tight upper and lower bounds for this testing proble, as well as study an extension to a clusterability property. One of our lower bounds directly iplies a lower bound on testing independence of a joint distribution, a result which was left open by previous work. School of Coputer Science, Tel Aviv University. E-ail: reuti.levi@gail.co. Research supported by the Israel Science Foundation grant nos. 47/09 and 46/08 School of Electrical Engineering, Tel Aviv University. E-ail: danar@eng.tau.ac.il. Research supported by the Israel Science Foundation grant nuber 46/08. CSAIL, MIT, Cabridge MA 039 and the Blavatnik School of Coputer Science, Tel Aviv University. E-ail: ronitt@csail.it.edu. Research supported by NSF grants 073334 and 078645, Marie Curie Reintegration grant PIRG03- GA-008-3077 and the Israel Science Foundation grant nos. 47/09 and 675/09.

Introduction In recent years, several works have investigated the proble of testing various properties of data that is ost naturally thought of as saples of an unknown distribution. More specifically, the goal in testing a specific property is to distinguish the case that the saples coe fro a distribution that has the property fro the case that the saples coe fro a distribution that is far usually in ters of l nor, but other nors have been studied as well fro any distribution that has the property. To give just a few exaples, such tasks include testing whether a distribution is unifor [GR00, Pan08] or siilar to another known distribution [BFR + 0], and testing whether a joint distribution is independent [BFF + 0]. Related tasks concern sublinear estiation of various easures of a distribution, such as its entropy [BDKR05, GMV09] or its support size [RRSS09]. Recently, general techniques have been designed to obtain nearly tight lower bounds on such testing and estiation probles [Val08a, Val08b]. These types of questions have arisen in several disparate areas, including physics [Ma8, SKSB98, NBS04], cryptography and pseudorando nuber generation [Knu69], statistics [Csi67, Har75, WW95, Pan04, Pan08, Pan03], learning theory [Ya95], property testing of graphs and sequences e.g.,[gr00, CS07, KS08, NS07, RRRS07, FM08] and streaing algoriths e.g., [AMS99, FKSV99, FS00, GMV09, CMIM03, CK04, BYJK + 0, IM08, BO0a, BO0b, BO08, IKOS09]. In these works, there has been significant focus on properties of distributions over very large doains, where standard statistical techniques based on learning an approxiation of the distribution ay be very inefficient. In this work we consider the setting in which one receives data which is ost naturally thought of as saples of several distributions, for exaple, when studying purchase patterns in several geographic locations, or the behavior of linguistic data aong varied text sources. Such data could also be generated when saples of the distributions coe fro various sensors that are each part of a large sensor-net. In these exaples, it ay be reasonable to assue that the nuber of such distributions ight be quite large, even on the order of a thousand or ore. However, for the ost part, previous research has considered properties of at ost two distributions [BFR + 00, Val08a]. We propose new odels of property testing that apply to properties of several distributions. We then consider the coplexity of testing properties within these odels, beginning with properties that we view as basic and expect to be useful in constructing building blocks for future work. We focus on quantifying the dependence of the saple coplexities of the testing algoriths in ters of the nuber of distributions that are being considered, as well as the size of the doain of the distributions.. Our Contributions.. The Models We begin by proposing two odels that describe possible access patterns to ultiple distributions D,..., D over the sae doain [n]. In these odels there is no explicit description of the distribution the algorith is only given access to the distributions via saples. In the first odel, referred to as the sapling odel, at each tie step, the algorith receives a pair of the for i, j where i is selected uniforly in [] and j [n] is distributed according to D i. In the second odel, referred to as the query odel, at each tie step, the algorith is allowed to specify i [] and receives j that is distributed according to D i. It is iediate that any algorith in the sapling odel can also be used in the query odel. On the other hand, as is iplied by our results, there are property testing probles which have a significantly larger saple coplexity in the sapling odel than in the query odel. In both odels the task is to distinguish between the case that the tested distributions have the property

and the case that they are ɛ-far fro having the property, for a given distance paraeter ɛ. Distance to the property is easured in ters of the average l -distance between the tested distributions and the closest collection of distributions that have the property. In all of our results, the dependence of the algoriths on the distance paraeter ɛ is inverse polynoial. Hence, for the sake of succinctness, in all that follows we do not ention this dependence explicitly. We note that the sapling odel can be extended to allow the choice of the distribution that is, the index i to be non-unifor i.e., be deterined by a weight w i and the distance easure is adapted accordingly... Testing Equivalence in the sapling odel One of the first properties of distributions studied in the property testing odel is that of deterining whether two distributions over doain [n] are identical alternatively, very close or far according to the l -distance. In [BFR + 0], an algorith is given that uses Õn/3 saples and distinguishes between the case that the two distributions are ɛ-far and the case that they are Oɛ/ n-close. This algorith has been shown to be nearly tight in ters of the dependence on n by Valiant [Val08b]. Valiant also shows that in order to distinguish between the case that the distributions are ɛ-far and the case that they are β-close, for two constants ɛ and β, requires alost linear dependence on n. Our ain focus is on a natural generalization, which we refer to as the equivalence property of distributions D,..., D, in which the goal of the tester is to distinguish the case in which all distributions are the sae or, slightly ore generally, that there is a distribution D for which i= D i D polyɛ/ n, fro the case in which there is no distribution D for which i= D i D ɛ. To solve this proble in the unifor sapling odel with saple coplexity Õn/3 which ensures with high probability that each distribution is sapled Ωn /3 log ties, one can ake calls to the algorith of [BFR + 0] to check that every distribution is close to D. OUR ALGORITHMS. We show that one can get a better saple coplexity dependence on. Specifically, we give two algoriths, one with saple coplexity Õn/3 /3 + and the other with saple coplexity Õn/ / +n. The first result in fact holds for the case that for each saple pair i, j, the distribution D i which generated j is not selected necessarily uniforly, and furtherore, it is unknown according to what weight it is selected. The second result holds for the case where the selection is non-unifor, but the weights are known. Moreover, the second result extends to the case in which it is desired that the tester pass distributions that are close for each eleent, to within a ultiplicative factor of ± ɛ/c for soe constant c >, and for sufficiently large frequencies. Thus, starting fro the known result for =, as long as n, the coplexity grows as Õn/3 /3 + = Õn/3 /3, and once n, the coplexity is Õn / / + n = Õn/ / which is lower than the forer expression when n. Both of our algoriths build on the close relation between testing equivalence and testing independence of a joint distribution over [] [n] which was studied in [BFF + 0]. The Õn/3 /3 + algorith follows fro [BFF + 0] after we fill in a certain gap in the analysis of their algorith due to an iprecision of a clai given in [BFR + 00]. The Õn/ / + n algorith exploits the fact that i is selected uniforly or, ore generally, according to a known weight w i to iprove on the Õn/3 /3 + algorith in the case that n. ALMOST MATCHING LOWER BOUNDS. We show that the behavior of the upper bound on the saple coplexity of the proble is not just an artifact of our algoriths, but rather alost captures the coplexity of the proble. Naely, we give alost atching lower bounds of Ωn /3 /3 for n = Ω log and Ωn / / for every n and. The latter lower bound can be viewed as a generalization of a lower

bound given in [BFR + 0], but the analysis is soewhat ore subtle. Our lower bound of Ωn /3 /3 consists of two parts. The first is a general theore concerning testing syetric properties of collections of distributions. This theore extends a central lea of Valiant [Val08b] on which he builds his lower bounds, and in particular the lower bound of Ωn /3 for testing whether two distributions are identical or far fro each other i.e., the case of equivalence for =. The second part is a construction of two collections of distributions to which the theore is applied where the construction is based on the one proposed in [BFF + 0] for testing independence. As in [Val08b], the lower bound is shown by focusing on the siilarity between the typical collision statistics of a faily of collections of distributions that have the property and a faily of collections of distributions that are far fro having the property. However, since any ore types of collisions are expected to occur in the case of collections of distributions, our proof outline is ore intricate and requires new ways of upper bounding the probabilities of certain types of events...3 Testing Clusterability in the query odel The second property that we consider is a natural generalization of the equivalence property. Naely, we ask whether the distributions can be partitioned into at ost k subsets clusters, such that within in cluster the distance between every two distributions is very sall. We study this property in the query odel, and give an algorith whose coplexity does not depend on the nuber of distributions and for which the dependence on n is Õn/3. The dependence on k is alost linear. The algoriths works by cobining the diaeter clustering algorith of [ADPR03] for points in a general etric space where the algorith has access to the corresponding distance atrix with the closeness of distributions tester of [BFR + 0]. Note that the results of [Val08b] iply that this is tight to within polylogarithic factors in n...4 Iplications of our results As noted previously, in the course of proving the lower bound of Ωn /3 /3 for the equivalence property, we prove a general theore concerning testability of syetric properties of collections of distributions which extends a lea in [Val08b]. This theore ay have applications to proving other lower bounds on collections of distributions. Further byproducts of our research regard the saple coplexity of testing whether a joint distribution is independent, More precisely, the following question is considered in [BFR + 0]: Let Q be a distribution over pairs of eleents drawn fro [] [n] without loss of generality, assue n ; what is the saple coplexity in ters of and n required to distinguish independent joint distributions, fro those that are far fro the nearest independent joint distribution in ter of l distance? The lower bound claied in [BFF + 0], contains a known gap in the proof. Siilar gaps in the lower bounds of [BFR + 0] for testing the closeness of distributions and of [BDKR05] for estiating the entropy of a distribution were settled by the work of [Val08b], which applies to syetric properties. Since independence is not a syetric property, the work of [Val08b] cannot be directly applied here. In this work, we show that the lower bound of Ωn /3 /3 indeed holds. Furtherore, by the aforeentioned correction of the upper bound of Õn/3 /3 fro [BFF + 0], we get nearly tight bounds on the coplexity of testing independence.. Other related work Other works on testing and estiating properties of single or pairs of distributions include [Bat0, GMV09, BKR04, RS04, AAK + 07, RX0, BNNR09, ACS0, AIOR09]. 3

.3 Open Probles and Further Research There are several possible directions for further research on testing properties of collections of distributions, and we next give a few exaples. One natural extension of our results is to give algoriths for testing the property of clusterability for k > in the sapling odel. One ay also consider testing properties of collections of distributions that are defined by certain easures of distributions, and ay be less sensitive to the exact for of the distributions. For exaple, a very basic easure is the ean expected value of the distribution, when we view the doain [n] as integers instead of eleent naes, or when we consider other doains. Given this easure, we ay consider testing whether the distributions all have siilar eans or whether they should be odified significantly so that this holds. It is not hard to verify that this property can be quite easily tested in the query odel by selecting Θ/ɛ distributions uniforly and estiating the ean of each. On the other hand, in the sapling odel an Ω lower bound is quite iediate even for n = and a constant ɛ. We are currently investigating whether the coplexity of this proble in the sapling odel is in fact higher, and it would be interesting to consider other easures as well..4 Organization We start by providing notation and definitions in Section. In Section 3 we give the lower bound of Ωn /3 /3 for testing equivalence in the unifor sapling odel, which is the ain technical contribution of this paper. In Section 4 we give our second lower bound of Ωn / / for testing equivalence and our algoriths for the proble follow in Sections 5 and 6. We conclude with our algorith for testing clusterability in the query odel in Section 7. Preliinaries Let [n] def = {,..., n}, and let D = D,..., D be a list of distributions, where D i : [n] [0, ] and n j= D ij = for every i. For a vector v = v,..., v n R n, let v = n i= v i denote the l nor of the vector v. For a property P of lists of distributions and 0 ɛ, we say that D is ɛ-far fro having P if i= D i D i > ɛ for every list D = D,..., D that has the property P note that D i D i is twice the the statistical distance between the two distributions. Given a distance paraeter ɛ, a testing algorith for a property P should distinguish between the case that D has the property P and the case that it is ɛ-far fro P. We consider two odels within which this task is perfored.. The Query Model. In this odel the testing algorith ay indicate an index i of its choice and it gets a saple j distributed according to D i.. The Sapling Model. In this odel the algorith cannot select query a distribution of its choice. Rather, it ay obtain a pair i, j where i is selected uniforly we refer to this as the Unifor sapling odel and j is distributed according to D i. We also consider a generalization in which there is an underlying weight vector w = w,..., w where i= w i =, and the distribution D i is selected according to w. In this case the notion of ɛ-far needs to be odified accordingly. Naely, we say that D is ɛ-far fro P with respect to w if i= w i D i Di > ɛ for every list D = D,..., D that has the property P. 4

We consider two variants of this non-unifor odel: The Known-Weights sapling odel, in which w is known to the algorith, and the Unknown-Weights sapling odel in which w is known. A ain focus of this work is on the following property. We shall say that a list D = D... D of distributions over [n] belongs to P,n eq or has the property P,n eq if D i = D i for all i, i. 3 A Lower Bound of Ωn /3 /3 for Testing Equivalence in the Unifor Sapling Model when n = Ω log In this section we prove the following theore: Theore Any testing algorith for the property P eq,n in the unifor sapling odel for every ɛ /0 and for n > c log where c is soe sufficiently large constant, requires Ωn /3 /3 saples. The proof of Theore consists of two parts. The first is a general theore Theore concerning testing syetric properties of lists of distributions. This theore extends a lea of Valiant [Val08b, Le. 4.5.4] which leads to what Valiant refers to as the Wishful Thinking Theore. The second part is a construction of two lists of distributions to which Theore is applied. Our analysis uses a technique called Poissonization [Szp0] which was used in the past in the context of lower bounds for testing and estiating properties of distributions in [RRSS09, Val08a, Val08b], and hence we first introduce soe preliinaries concerning Poisson distributions. We later provide soe intuition regarding the benefits of Poissonization. 3. Preliinaries concerning Poisson distributions For a positive real nuber λ, the Poisson distribution poiλ takes the value x N where N = {0,,,...} with probability poix; λ = e λ λ x /x!. The expectation and variance of poiλ are both λ. For λ and λ we shall use the following bound on the l distance between the corresponding Poisson distributions for a proof see for exaple [RRSS09, Clai A.]: poiλ poiλ λ λ. For a vector λ = λ,..., λ d of positive real nubers, the corresponding ultivariate Poisson distribution poi λ is the product distribution poiλ... poiλ d. That is, poi λ assigns each vector x = x..., x d N d the probability d i= poix i; λ i. We shall soeties consider vectors λ whose coordinates are indexed by vectors a = a,..., a N, and will use λ a to denote the coordinate of λ that corresponds to a. Thus, poi λ a is a univariate Poisson distribution. With a slight abuse of notation, for a subset I [d] or I N, we let poi λi denote the ultivariate Poisson distributions restricted to the coordinates of λ in I. For any two d-diensional vectors λ + = λ +,..., λ+ d and λ = λ,..., λ d of positive real values, we get fro the proof of [Val08b, Lea 4.5.3] that, poi λ + poi λ d poiλ + j poiλ j, j= for our purposes we shall use the following generalized lea. 5

Lea For any two d-diensional vectors λ + = λ +,..., λ+ d and λ = λ,..., λ d of positive real values, and for any partition {I i } l i= of [d], poi λ + poi λ l poi λ + I i poi λ I i. i= Proof: Let {I i } l i= be a partition of [d], let i denote i,... i d, by the triangle inequality we have that for every k [l], poi i ; λ + poi i ; λ = poii j ; λ + j poii j ; λ j j [d] j [d] poii j ; λ + j poii j ; λ + j poii j ; λ j j [d] j [d]\i k j I k + poii j ; λ + j poii j ; λ j poii j ; λ j. j [d]\i k j I k Hence, we obtain that poi λ + poi λ = poi i ; λ + poi i ; λ i N d Thus, the lea follows by induction on l. We shall also ake use of the following Lea. j [d] poi λ + I k poi λ I k + poi λ + [d] \ I k poi λ [d] \ I k. Lea For any two d-diensional vectors λ + = λ +,..., λ+ d and λ = λ,..., λ d of positive real values, poi λ + poi λ d λ j λ+ j. Proof: In order to prove the lea we shall use the KL-divergence between distributions. Naely, for two distributions p and p over a doain X, D KL p p def = x X p x ln p x p x. Let λ + = λ +..., λ+ d, λ = λ..., λ d and let i denote i,... i d. We have that j= λ j ln poi i ; λ + poi i ; λ = = d j= ln e λ j λ+ j λ + ij j /λ j d λ j λ+ j + i j lnλ + j /λ j j= d λ j λ+ j + i j λ + j /λ j, j= 6

where in the last inequality we used the fact that ln x x for every x > 0. Therefore, we obtain that D KL poi λ + poi λ = poi i ; λ + ln poi i ; λ+ poi i ; λ i N d d λ j λ+ j + λ+ j λ+ j /λ j j= d λ j = λ+ j, j= λ j where in Equation we used the facts that i N poii; λ = and i N poii; λ i = λ. The l distance is related to the KL-divergence by D D D KL D D and thus we obtain the lea. The next lea bounds the probability that a Poisson rando variable is significantly saller than its expected value. Lea 3 Let X poiλ, then, Pr[X < λ/] < 3/4 λ/4. Proof: Consider the atching between j and j + λ/ for every j = 0,..., λ/. We consider the ratio between poij; λ and poij + λ/; λ: poij + λ/; λ poij; λ = e λ λ j+λ/ /j + λ/! e λ λ j /j! = = > λ λ/ j + λ/j + λ/ j + λ j + λ/ λ j + λ/ λ j + λ λ λ λ λ λ/ λ λ/4 3/4λ = 4/3 λ/4 This iplies that and the proof is copleted. Pr[X < λ/] = < Pr[X < λ/] Pr[λ/ X < λ] Pr[λ/ X < λ] Pr[X < λ/] Pr[λ/ X < λ] < 3/4 λ/4, 7

The next two notations will play an iportant technical role in our analysis. For a list of distributions D = D... D, an integer κ and a vector a = a,..., a N, let p D,κ j; a def = poia i ; κ D i j. 3 i= That is, for a fixed choice of a doain eleent j [n], consider perforing independent trials, one for each distribution D i, where in trial i we select a non-negative integer according to the Poisson distribution poiλ for λ = κ D i j. Then p D,κ j; a is the probability of the joint event that we get an outcoe of a i in trial i, for each i []. Let λ D,κ be a vector whose coordinates are indexed by all a N, such that λ D,κ a = n p D,κ j; a. 4 j= That is, λ D,κ a is the expected nuber of ties we get the joint outcoe a,..., a if we perfor the probabilistic process defined above independently for every j [n]. 3. Testability of syetric properties of lists of distributions In this subsection we prove the following theore which is used to prove Theore. Theore Let D + and D be two lists of distributions over [n], all of whose frequencies are at ost where κ is soe positive integer and 0 < δ <. If δ κ poi λ D +,κ poi λ D,κ < 6 30 35δ 5, 5 then testing in the unifor sapling odel any syetric property of distributions such that D + has the property, while D is Ω-far fro having the property requires Ωκ saples. A HIGH-LEVEL DISCUSSION OF THE PROOF OF THEOREM. For an eleent j [n] and a distribution D i, i [], let α i,j be the nuber of ties the pair i, j appears in the saple when the saple is selected according to soe sapling odel. Thus α,j,..., α,j is the saple histogra of the eleent j. The histogra of the eleents histogras is called the fingerprint of the saple. That is, the fingerprint indicates, for every a N, the nuber of eleents j such that α,j,..., α,j = a. As shown in [BFR + 0], when testing syetric properties of distributions, it can be assued without loss of generality that the testing algorith is provided only with the fingerprint of the saple. Furtherore, since the nuber, n, of eleents is fixed, it suffices to give the tester the fingerprint of the saple without the 0 = 0,..., 0 entry. For exaple, consider the distributions D and D over {,, 3} such that D [j] = /3 for every j {,, 3}, D [] = D [] = / and D [3] = 0. Assue that we saple D, D four ties, according to the unifor sapling odel and we get the saples,,,,,,, 3, where the first coordinate denotes the distribution, and the second coordinate denotes the eleent. Then the saple histogra of eleent is, because was selected once by D and once by D. For the eleents, 3 we have the saple histogras 0, and, 0, respectively. The fingerprint of the saple is 0,,, 0,, 0, 0,... for the following order of histogras: 0, 0, 0,,, 0,, 0,, 0,, 3, 0,.... In order to prove Theore, we would like to show that the distributions of the fingerprints when the saple is generated according to D + and when it is generated according to D are siilar, for a saple size 8

that is below the lower bound stated in the theore. For each choice of eleent j [n] and a distribution D i, the nuber of ties the saple i, j appears, i.e. α i,j, depends on the nuber of ties the other saples appear siply because the total nuber of saples is fixed. Furtherore, for each histogra a, the nuber of eleents with saple histogra identical to a is dependent on the nuber of ties the other histogras appear, because the nuber of saples is fixed. For instance, in the exaple above, if we know that we have the histogra 0, once and the histogra, once, then we know that third histogra cannot be, 0. In addition, it is dependent because the nuber of eleents is fixed. We thus see that the distribution of the fingerprints is rather difficult to analyze and therefore it is difficult to bound the statistical distance between two different such distributions. Therefore, we would like to break as uch of the above dependencies. To this end we define a slightly different process for generating the saples that involves Poissonization [Szp0]. In the Poissonized process the nuber of saples we take fro each distribution D i, denoted by κ i, is distributed according to the Poisson distribution. We prove that, while the overall nuber of saples the Poissonized process takes is bigger just by a constant factor fro the unifor process, we get with very high probability that κ i > κ i, for every i, where κ i is the nuber of saples taken fro D i. This iplies that if we prove a lower bound for algoriths that receive saples generated by the Poissonized process, then we obtain a related lower bound for algoriths that work in the unifor sapling odel. As opposed to the process that takes a fixed nuber of saples according to the unifor sapling odel, the benefit of the Poissonized process is that the α i,j s deterined by this process are independent. Therefore, the type of saple histogra that eleent j has is copletely independent of the types of saple histogras the other eleents have. We get that the fingerprint distribution is a generalized ultinoial distribution, which has been studied by Roos [Roo99] the connection is due to Valiant [Val08a]. Definition In the Poissonized unifor sapling odel with paraeter κ which we ll refer to as the κ-poissonized odel, given a list D = D,..., D of distributions, a saple is generated as follows: Draw κ,..., κ poiκ Return κ i saples distributed according to D i for each i []. Lea 4 Assue there exists a tester T in the unifor sapling odel for a property P of lists of distributions, that takes a saple of size s = κ where κ c for soe sufficiently large constant c, and works for every ɛ ɛ 0 where ɛ 0 is a constant and whose success probability is at least /3. Then there exists a tester T for P in the Poissonized unifor sapling odel with paraeter 4κ, that works for every ɛ ɛ 0 and whose success probability is at least 9 30. Proof: Roughly speaking, the tester T tries to siulate T if it has a sufficiently large saple, and otherwise it guesses the answer. More precisely, consider a tester T that receives κ saples where κ poi4κ. By Lea 3 we have that, Pr [ κ < κ ] 3/4 κ. If κ κ then T siulates T on the first κ saples that it got. Otherwise it outputs accept or reject with equal probability. The probability that κ κ is at least 3/4 κ, which is greater than 4 5 for κ > c and a sufficiently large constant c. Therefore, the success probability of T is at least 4 5 3 + 5 = 9 30, as desired. Given Lea 4 it suffices to consider saples that are generated in the Poissonized unifor sapling odel. The process for generating a saple {α,j,..., α,j } j [n] recall that α i,j is the nuber of ties 9

that eleent j was selected by distribution D i in the κ-poissonized odel is equivalent to the following process: For each i [] and j [n], independently select α i,j according to poiκ D i j see [Fel67, p. 6]. Thus the probability of getting a particular histogra a j = a,j,..., a,j for eleent j is p D,κ j; a j as defined in Equation 3. We can represent the event that the histogra of eleent j is a j by a Bernoulli rando vector b j that is indexed by all a N, is in the coordinate corresponding to a j, and is 0 elsewhere. Given this representation, the fingerprint of the saple corresponds to n j= b j. In fact, we would like b j to be of finite diension, so we have to consider only a finite nuber sufficiently large of possible histogras. Under this relaxation, b j = 0,..., 0 would correspond to the case that the saple histogra of eleent j is not in the set of histogras we consider. Roos s theore, stated next, shows that the distribution of the fingerprints can be approxiated by a ultivariate Poisson distribution the Poisson here is related to the fact that the fingerprints distributions are generalized ultinoial distributions and not related to the Poisson fro the Poissonization process. For siplicity, the theore is stated for vectors bj that are indexed directly, that is b j = b j,,..., b j,h. Theore 3 [Roo99] Let D Sn be the distribution of the su S n of n independent Bernoulli rando vectors b,..., [ ] [ ] b n in R h where Pr bj = e l = p j,l and Pr bj = 0,..., 0 = h l= p j,l here e l satisfies e j,l = and e j,l = 0 for every l l. Suppose we define an h-diensional vector λ = λ,..., λ h as follows: λ l = n j= p j,l. Then D Sn poi λ 88 5 h n j= p j,l l= n j= p j,l. 6 We next show how to obtain a bound on sus of the for given in Equation 6 under appropriate conditions. Lea 5 Given a list D = D,..., D of distributions over [n] and a real nuber 0 < δ / such that for all i [] and for all j [n], D i j for soe integer κ, we have that a N \ 0 δ κ n j= pd,κ j; a n j= pd,κ j; a δ. 7 Proof: n j= pd,κ j; a n j= pd,κ j; a a N \ 0 = a N \ 0 a N \ 0 a N \ 0 ax p D j; a j ax j δ δ a a= poia i ; κ D i j i= a +...+a a δ, 8 0

where the inequality in Equation 8 holds for δ / and the inequality in Equation 8 follows fro: and the proof is copleted. e κ Dij κ D i j a poia; κ D i j = a! κ D i j a δ a, Proof of Theore : By the first preise of the theore, D i + j, D+ i j δ κ for every i [] and j [n]. By Lea 5 this iplies that Equation 7 holds both for D = D + and for D = D. Cobining this with Theore 3 we get that the l distance between the fingerprint distribution when the saple is generated according to D + in the κ-poissonized odel, see Definition and the distribution poi λ D +,κ is at ost 88 76 5 δ = 5 δ, and an analogous stateent holds for D. By applying the preise in Equation 5 concerning the l distance between poi λ D +,κ and poi λ D,κ and the triangle inequality, we get that the l distance between the two fingerprint distributions is saller than 76 5 δ + 6 30 35δ 5 = 6 30, which iplies that the statistical difference is saller than 8 30, and thus it is not possible to distinguish between D+ and D in the κ-poissonized odel with success probability at least 9 30. By Lea 4 we get the desired result. 3.3 Proof of Theore In this subsection we show how to apply Theore to two lists of distributions, D + and D, which we will define shortly, where D + P eq = P,n eq while D is /0-far fro P eq. Recall that by the preise of Theore, n c log for soe sufficiently large constant c >. In the proof it will be convenient to assue that is even and that n which corresponds in the lea to t is divisible by 4. It is not hard to verify that it is possible to reduce the general case to this case. In order to define D, we shall need the next lea. Lea 6 For every two even integers and t, there exists a 0/-valued atrix M with rows and t coluns for which the following holds:. In each row and each colun of M, exactly half of the eleents are and the other half are 0.. For every integer x < /, and for every subset S [] of size x, the nuber of coluns j such that M[i, j] = for every i S is at least t x x x ln t, and at ost t. x + x ln t Proof: Consider selecting a atrix M randoly as follows: Denote the first t/ coluns of M by F. For each colun in F, pick, independently fro the other t/ coluns in F, a rando half of its eleents to be, and the other half of the eleents to be 0. Coluns t/ +,..., t are the negations of rows,..., t/, respectively. Thus, in each row and each colun of M, exactly half of the eleents are and the other half are 0.

Consider a fixed choice of x. For each colun j between and t, each subset of coluns S [] of size x, and b {0, }, define the indicator rando variable I S,j,b to be if and only if M[i, j] = b for every i S. Hence, Pr[I S,j,b = ] =... x. Clearly, Pr[I S,j,b = ] < x. On the other hand, Pr[I S,j,b = ] x x = x x x x x. where the last inequality is due to Bernoulli s inequality which states that + x n > + nx, for every real nuber x > 0 and an integer n > [MV70]. Let E S,b denote the expected value of t/ j= I S,j,b. Fro the fact that coluns t/ +,..., t are the negations of coluns,..., t/ it follows that t j=t/+ I S,j, = t/ j= I S,j,0. Therefore, the expected nuber of coluns j t such that M[i, j] = for every i S is siply E S, + E S,0 that is, at ost t and at least t x x x. By the additive Chernoff bound, [ t/ ] tx ln Pr I S,j,b E S,b > j= < exp t/x ln /t = x. Thus, by taking a union bound over b {0, }, [ t Pr I S,j, E S, + E S,0 > ] tx ln j= < 4 x. By taking a union bound over all subsets S we get that M has the desired properties with probability greater than 0. We first define D +, in which all distributions are identical. Specifically, for each i []: if j n/3 /3 D i + def n j = /3 /3 n if n < j n 0 o.w. 9 We now turn to defining D. Let M be a atrix as in Lea 6 for t = n/. For every i []: if j n/3 /3 n /3 /3 D i j def = n if n < j n and M[i, j n/] = 0 o.w. 0

For both D + and D, we refer to the eleents j n/3 /3 as the heavy eleents, and to the eleents n j n, as the light eleents. Observe that each heavy eleent has exactly the sae probability weight,, in all distributions D + n /3 /3 i and Di. On the other hand, for each light eleent i, while D+ i j = n for every i, in D we have that D i + j = n for half of the distributions, the distributions selected by the M, and D i + j = 0 for half of the distributions, the distributions which are not selected by M. We later use the properties of M to bound the l distance between the fingerprints distributions of D + and D. A HIGH-LEVEL DISCUSSION. To gain soe intuition before delving into the detailed proof, consider first the special case that = which was studied by Valiant [Val08a], and indeed the construction is the sae as the one he analyzes and was initially proposed in [BFR + 00]. In this case each heavy eleent has probability weight Θ/n /3 and we would like to establish a lower bound of Ωn /3 on the nuber of saples required to distinguish between D + and D. That is, we would like to show that the corresponding fingerprints distributions when the saple is of size on /3 are very siilar. The first ain observation is that since the probability weight of light eleents is Θ/n in both D + and D, the probability that a light eleent will appear ore than twice in a saple of size on /3 is very sall. That is using the fingerprints of histogras notation we introduced previously, for each a = a, a such that a + a >, the saple will not include with high probability any light eleent j such that α,j = a and α,j = a for both D + and D. Moreover, for every x {, }, the expected nuber of eleents j such that α,j, α,j = x, 0 is the sae in D + and D, as well as the variance fro syetry, the sae applies to 0, x. Thus, ost of the difference between the fingerprints distributions is due to the nubers of eleents j such that α,j, α,j =,. For this setting we do expect to see a non-negligble difference for light eleents between D + and D in particular, we cannot get the, histogra for light eleents in D, as opposed to D +. Here is where the heavy eleents coe into play. Recall that in both D + and D the heavy eleents have the sae probability weight, so that the expected nuber of heavy eleents i such that a,j, a,j =, is the sae for D + and D. However, intuitively, the variance of these nubers for the heavy eleents swaps the differences between the light eleents so that it is not possible to distinguish between D + and D. The actual proof, which foralizes and quantifies this intuition, considers the difference between the values of the vectors λ D+,k and λ D,k as defined in Equation 4 in the coordinates corresponding to a such that a + a =. We can then apply Leas and to obtain Equation 5 in Theore. Turning to >, it is no longer true that in a saple of size on /3 /3 we will not get histogra vectors a such that i= a i > for light eleents. Thus we have to deal with any ore vectors a of diension and to bound the total contribution of all of the to the difference between fingerprints of D + and of D. To this end we partition the set of all possible histogras vectors into several subsets according to their Haing weight i= a i and depending on whether all a is are in {0, }, or there exists a least one a i such that a i. In particular, to deal with the forer whose nuber, for each choice of Haing weight x is relatively large, i.e., roughly x, we use the properties of the atrix M based on which D is defined. We note that fro the analysis we see that, siilarly to when =, we need the variance of the heavy eleents to play a role just for the cases where i= a i = while in the other cases the total contribution of the light eleents is rather sall. In the reainder of this section we provide the details of the analysis. Before establishing that indeed D is Ω-far fro P eq, we introduce soe ore notation which will be used throughout the reainder of the proof of Theore. Let S x be the set of vectors that contain exactly x coordinates that are, and all the rest are 0 which corresponds to an eleent that was sapled once or 0 ties by each distribution. Let A x be the set of vector that their coordinates su up to x but ust 3

contain at least one coordinate that is which corresponds to an eleent that was saples at least twice by at least one distribution. More forally, for any integer x, we define the following two subsets of N : { def S x = a N : i= a } i = x and, i [], a i < and A x def = { a N : i= a i = x and i [], a i For a N, let sup a def = {i : a i 0} denote the support of a, and let I M a def = {j : D i j = n } i sup a. Note that in ters of the atrix M based on which D is defined, I M a consists of the coluns in M whose restriction to the support of a contains only s. In ters of the D, it corresponds to the set of light eleents that ight have a saple histogra of a when sapling according to D. Lea 7 For every > 5 and for n c ln for soe sufficiently large c, we have that i= D i D > /0 for every distribution D over [n]. That is, the list D is /0-far fro P eq. Proof: Consider any a S. By Lea 6, setting t = n/, the size of I M a, i.e. the nuber of light eleents l such that Di [l] = n for every i sup a, is at ost n 4 + 8 ln n. The sae lower bound holds for the nuber of light eleents l such that Di [l] = 0 for every i sup a. This iplies that for every i i in [], for at least n n 4 + 8 ln n of the light eleents, l, we have that Di [l] = n while D i [l] = 0, or that D i [l] = n while D i [l] = 0. Therefore, D i D i 8 ln n, which for n c ln and a sufficiently large constant c, is at least 8. Thus, by the triangle inequality we have that for every D, i= D i D 8, which greater than /0 for > 5. In what follows we work towards establishing that Equation 5 in Theore holds for D + and D. Set κ = δ n/3, where δ is a constant to be deterined later. We shall use the shorthand λ + for λ D+,κ, and λ /3 for λ D,κ recall that the notation λ D,κ was introduced in Equation 4. By the definition of λ +, for each a N, n λ + κ D i + a = ja i e κ D+ i j a i! = j= i= n /3 /3 / j= = n/3 /3 e δ i= δ/ a i e δ/ a i! δ/ a i i= a i! + + n j=n/+ i= n e δ/n/3 i= } δ/n /3 /3 a i e δ/n/3 /3 a i! δ/n /3 /3 a i. a i! By the construction of M, for every light j, i= D i j = n = n. Therefore, λ a = n/3 /3 δ/ a i δ/n /3 /3 a i e δ + a i! a i! i= e δ/n/3 j I M a i= 4.

Hence, λ + a and λ a differ only on the ter which corresponds to the contribution of the light eleents. Equations and deonstrate why we choose M with the specific properties defined in Lea 6. First of all, in order for every Di to be a probability distribution, we want each row of M to su up to exactly n/. We also want each colun of M to su up to exactly /, in order to get i= e κ D+ i j = i= e κ D i j. Finally, we would have liked I M a i= a i to equal n/ for every a. This would iply that λ + a and λ a are equal. As we show below, this is in fact true for every a S. For vectors a S x where x >, the second condition in Lea 6 ensures that I M a is sufficiently close to n x. This property of M is not necessary in order to bound the contribution of the vectors in A x. The bound that we give for those vectors is less tight, but since there are fewer such vectors, it suffices. We start by considering the contribution to Equation 5 of histogra vectors a S i.e., vectors of the for 0,..., 0,, 0,..., 0 which correspond to the nuber of eleents that are sapled only by one distribution, once. We prove that in the Poissonized unifor sapling odel, for every a S the nuber of eleents with such saple histogra is distributed exactly the sae in D + and D. Lea 8 poi λ + a poi λ a = 0. a S Proof: For every a S, the size of I M a is n 4, thus, j I M a i= δ/n /3 /3 a i = n a i! δ/n /3 /3 a i. a i! By Equations and, it follows that λ + a λ a = 0 for every a S. The lea follows by applying Equation. We now turn to bounding the contribution to Equation 5 of histogra vectors a A i.e., vectors of the for 0,..., 0,, 0,..., 0 which correspond to the nuber of eleents that are sapled only by one distribution, twice. i= Lea 9 poi λ + A poi λ A 3δ. Proof: For every a A, the size of I M a is n 4, thus, δ/n /3 /3 a i j I M a i= a i! By Equations, and it follows that δ/n /3 /3 a i = n a i! i=. λ a λ + a = = n δ/n /3 /3 a i a i! e δ/n/3 i= n /3 δ, 3 4e δ/n/3 4/3 5

and that By Equations 3 and 4 we have that λ a n/3 /3 e δ δ/ a i i= a i! = n/3 δ 4e δ. 4 5/3 λ a λ + a λ a By Equation 5 and the fact that A = we get eδ δ/n/3 δ 4 δ. 5 λ a λ a + a A λ a δ = δ The lea follows by applying Lea. Recall that for a subset I of N, poi λi denotes the ultivariate Poisson distributions restricted to the coordinates of λ that are indexed by the vectors in I. We separately deal with S x where x < /, and x /, where our ain efforts are with respect to the forer, as the latter correspond to very low probability events. Lea 0 For 6, n c ln where c is a sufficiently large constant and for δ /6 poi λ / + x= S x poi λ / x= S x 3δ. Proof: Let a be a vector in S x then by the definition of S x, every coordinate of a is 0 or. Therefore we ake the following siplification of Equation : For each a / x= S x, λ + a = n/3 /3 δ x n e δ + e δ x δ/n/3 n /3 /3. By Lea 6, for every a / x= S x the size of I M a is at ost n n x x x 4x ln n. By Equation this iplies that λ a = n/3 /3 δ x e δ + x + n e δ/n/3 x + η δ x n /3 /3, 4x ln n and at least 6

x where x + 4x ln 4x ln n η n and thus η x x x + 4 ln n. By the facts x that n c ln for soe sufficiently large constant c, and that x for every x < / and 6, we obtain that η x. So we have that λ + a λ a n e δ x x δ/n/3 n /3 /3 n 4δ x 4 n /3 4/3 x, and that Then we get, for δ /, that λ a n/3 /3 δ x e δ, λ + a λ a λ a eδ n 4/3 /3 4δ x n /3 /3 x n4/3 /3 4δ x n /3 /3 x x n4/3 4/3 4x /x δ n /3 /3 n4/3 4/3 8δ n /3 /3 x. Suing over all a / x= S x we get: a S / x= S x λ a λ + a λ a = x= n 4/3 4/3 64δ x=0 x 8δ /3 n /3 8δ /3 n /3 x 6 64δ 8δ 8δ 7 where in Equation 6 we used the fact that n >, and Equation 7 holds for δ /6. The lea follows by applying Lea. Lea For n, and δ /4, poi λ + a poi λ a 3δ 3. x / a S x 7

Proof: We first observe that S x x /x for every x 6. To see why this is true, observe that S x equals the nuber of possibilities of arranging x balls in bins, i.e., + x S x = x + xx x! x x! = x x! x x x x, where we have used the preise that and thus x 6. By Equations and and the fact that x y ax{x, y} for every positive real nubers x,y, x / a S x λ + a λ a n x / a S x i= = n a S x = n x / x=/ x=/ x x n x x=/ = 8δ 3 x=/ 3 δ n /3 /3 δ n /3 /3 n δ n /3 /3 δ n /3 /3 x δ /3 n /3 x δ /3 n /3 ai P i= a i x x 8δ 3 δ 8 6δ 3 9 where in Equation 8 we used the fact that n and Equation 9 holds for δ /4. The lea follows by applying Equation. We finally turn to the contribution of a A x such that x 3. Lea For n and δ /4, x 3 a A x poi λ + a poi λ a 6δ 3. Proof: We first observe that A x x for every x. To see why this is true, observe that A x equals the nuber of possibilities of arranging x balls, where one ball is a special double ball in bins. 8

By Equations and and the fact that x y ax{x, y} for every positive real nubers x,y, x 3 a A x λ + a λ a n δ n /3 /3 x 3 a A x i= = n δ n /3 /3 x 3 a A x x n δ n /3 /3 x=3 x = n δ /3 x=3 = 4δ 3 x=0 n /3 x δ /3 n /3 ai P i= a i x 4δ 3 δ 0 8δ 3 where in Equation 0 we used the fact that n and Equation holds for δ /4. The lea follows by applying Equation. We are now ready to finalize the proof of Theore. Proof of Theore : Let D + and D be as defined in Equations 9 and 0, respectively, and recall that κ = δ n/3 where δ will be set subsequently. By the definition of the distributions in D + and D, /3 the probability weight assigned to each eleent is at ost = δ n /3 /3 κ, as required by Theore. By Lea 7, D is /0-far fro P eq. Therefore, it reains to establish that Equation 5 holds for D + and D. Consider the following partition of N : / { a} a S, A, S x, { a} S a Sx, { a} x / a S x 3 Ax, x= where { a} a T denotes the list of all singletons of eleents in T. By Lea it follows that poi λ + poi λ poi λ + a poi λ a a S + poi λ + A poi λ A x= / / + poi λ + S x poi λ S x + x / a S x + x 3 9 a A x x= poi λ + a poi λ a poi λ + a poi λ a.

For δ < /6 we get by Leas 8 that poi λ + poi λ 35δ + 48δ 3, which is less than 6 30 35δ 5 for δ = /00. 3.4 A lower bound for testing Independence Corollary 4 Given a joint distribution Q over [] [n] ipossible to test if Q is independent or /48-far fro independent using on /3 /3 saples. Proof: Follows directly fro Lea 5 and Theore. 4 A Lower Bound of Ωn / / for Testing Equivalence in the Unifor Sapling Model In this section we prove the following theore: Theore 5 Testing the property P eq,n in the unifor sapling odel for every ɛ / and 64 requires Ωn / / saples. We assue without loss of generality that n is even or else, we set the probability weight of the eleent n to 0 in all distributions considered, and work with n that is even. Define H n to be the set of all distributions over [n] that have probability n on exactly half of the eleents and 0 on the other half. Define Hn to be the set of all possible lists of distributions fro H n. Define Un to consist of only a single list of distributions each of which is identical to U n, where U n denotes the unifor distribution over [n]. Thus the single list in Un belongs to P,n. eq On the other hand we show, in Lea 3, that Hn contains ostly lists of distributions that are Ω-far fro P,n. eq However, we also show, in Lea 4, that any tester in the unifor sapling odel that takes less than n / / /6 saples cannot distinguish between D that was uniforly drawn fro Hn and D = U n,..., U n Un. Details follow. Lea 3 For every 3, with probability at least over the choice of D Hn we have that D is /-far fro P,n. eq Proof: We need to prove that with probability at least over the choice of D Hn, for every v = v,..., v n R n which corresponds to a distribution i.e., v j 0 for every j [n] and n j= v j =, D i v >. i= We shall actually prove a slightly ore general stateent. Naely, that Equation holds for every vector v R n. We define the function, ed D : [n] [0, ], such that ed D j = µ D j,..., D j, where µ x,..., x denotes the edian of x,..., x where if is even, it is the value in position 0

in sorted non-decreasing order. The su i= x i c is iniized when c = µ x,..., x. Therefore, for every D and every vector v R n, Di ed D i= D i v. 3 i= Recall that for every D = D,..., D in Hn, and for each i, j [] [n], we have that either D i j = n, or D ij = 0. Thus, ed D j = 0 when D i j = 0 for at least half of the i s in [] and ed D j = n otherwise. We next show that for every i, j [] [n], the probability over D H n that D i j will have the sae value as ed D j is at ost a bit bigger than half. More precisely, we show that for every i, j [] [n]: [ Pr D H n Di j ed D j ]. 4 Fix i, j [] [n], and consider selecting D uniforly at rando fro Hn. Suppose we first deterine the values D i j for i i, and set D i j in the end. For each i, j the probability that D i j = 0 is /, and the probability that D i j = n is /. If ore than / of the outcoes are 0, or ore than / are n, then the value of edd j is already deterined. Conditioned on this we have that the probability that D i j ed D j is exactly /. On the other hand, if at ost / are 0 and at ost / are n that is, for odd there are / that are 0 and / that are n, and for even there are / of one kind and / of the other then necessarily ed D j = D i j. We thus bound the probability of this event. First consider the case that is odd so that is even. [ Pr Bin, = ] = =!!!. 5 By Stirling s approxiation,! = π e e λ, where λ is a paraeter that satisfies + < λ <, thus,!!! < π e e π/ / e / e /+ 6 = e 6+ π/ 7 < π/ 8, 9 where Inequalities 8 and 9 hold for 3. In case is even, the probability over the choice of D i j for i i that ed D j is deterined by D i j is Pr [ Bin, ] [ ] = + Pr Bin, =.

Hence, Equation 4 holds for all and we obtain the following bound on the expectation [ ] Di ed D n [ = E D H Di n j ed D j ] 30 E D H n i= while the axiu value is bounded as Di ed D = i= i= j= [ = n Pr D H n Dj i ed D i ] n n n 3 3 =, 33 i= j= Assue for the sake of contradiction that [ ] Di ed D / Pr D H n i= then by Equation 36 we have, [ ] Di ed D E D H n i= n j= n Di j ed D j 34 n 35 =. 36 < >, 37 + 38 =, 39 which contradicts Equation 33. Recall that for an eleent j [n] and a distribution D i, i [], we let a i,j denote the nuber of ties the pair i, j appears in the saple when the saple is selected in the unifor sapling odel. Thus a,j,..., a,j is the saple histogra of the eleent j. Since the saple points are selected independently, a saple is siply the union of the histogras of the different eleents, or equivalently, a atrix M in N n. Lea 4 Let U be the distribution of the histogra of q saples taken fro the unifor distribution over [] [n], and let H be the distribution of the histogra of q saples taken fro a rando list of distributions in H n. Then, U H 4q n. 40 Proof: For every atrix M N n, let A M be the event of getting the histogra M i.e. M[i, j] = x if eleent j is chosen exactly x ties fro distribution D i in the saple; For every x = x,..., x N, let B x be the event of getting a histogra M with exactly x i saples fro distribution D i i.e., such that