Collision-based Testers are Optimal for Uniformity and Closeness

Size: px

Start display at page:

Download "Collision-based Testers are Optimal for Uniformity and Closeness"

Jewel Marsh
5 years ago
Views:

1 Electronic Colloquiu on Coputational Coplexity, Report No. 178 (016) Collision-based Testers are Optial for Unifority and Closeness Ilias Diakonikolas Theis Gouleakis John Peebles Eric Price USC MIT MIT UT Austin Noveber 10, 016 Abstract We study the fundaental probles of (i) unifority testing of a discrete distribution, and (ii) closeness testing between two discrete distributions with bounded l -nor. These probles have been extensively studied in distribution testing and saple-optial estiators are known for the [Pan08, CDVV14, VV14, DKN15b]. In this work, we show that the original collision-based testers proposed for these probles [GR00, BFR + 00] are saple-optial, up to constant factors. Previous analyses showed saple coplexity upper bounds for these testers that are optial as a function of the doain size n, but suboptial by polynoial factors in the error paraeter ɛ. Our ain contribution is a new tight analysis establishing that these collision-based testers are inforation-theoretically optial, up to constant factors, both in the dependence on n and in the dependence on ɛ. 1 Introduction 1.1 Background and Our Results The generic inference proble in distribution property testing [BFR + 00, BFR + 13] (also see, e.g., [Rub1, Can15, Gol16b]) is the following: given saple access to one or ore unknown distributions, deterine whether they satisfy soe global property or are far fro satisfying the property. During the past couple of decades, distribution testing whose roots lie in statistical hypothesis testing [NP33, LR05] has developed into a ature field. One of the ost fundaental tasks in this field is deciding whether an unknown discrete distribution is approxiately unifor on its doain, known as the proble of unifority testing. Forally, we want to design an algorith that, given independent saples fro a discrete distribution p over [n] and a paraeter ɛ > 0, distinguishes (with high probability) the case that p is unifor fro the case that p is ɛ-far fro unifor, i.e., the total variation distance between p and the unifor distribution over [n] is at least ɛ. Unifority testing was the very first proble considered in this line of work: Goldreich and Ron [GR00], otivated by the question of testing the expansion of graphs, proposed a siple and natural unifority tester that relies on the collision probability of the unknown distribution. The collision probability of a discrete distribution p is the probability that two saples drawn according to p are equal. The key intuition here is that the unifor distribution has the iniu collision probability aong all distributions on the sae doain, and that any distribution that is ɛ-far Part of this research was perfored when the author was at the University of Edinburgh, and while visiting MIT. Supported in part by a Marie Curie Career Integration grant and an EPSRC grant. This aterial is based upon work supported by the NSF under Grant No This aterial is based upon work supported by the NSF Graduate Research Fellowship under Grant No , and by the NSF under Grant No ISSN

2 fro unifor has noticeably larger collision probability. Foralizing this intuition, Goldreich and Ron [GR00] showed that the collision-based unifority tester succeeds after drawing O(n 1/ /ɛ 4 ) saples fro the unknown distribution. An inforation-theoretic lower bound of Ω(n 1/ ) on the nuber of saples required by any unifority tester follows fro a siple birthday-paradox arguent [GR00, BFF + 01], even for constant values of the paraeter ɛ. In subsequent work, Paninski [Pan08] showed an inforation-theoretic lower bound of Ω(n 1/ /ɛ ), and also provided a atching upper bound of O(n 1/ /ɛ ) that holds under the assuption that ɛ = Ω(n 1/4 ) 1. This lower bound assuption on ɛ is not inherent: As shown in a nuber of recent works [VV14, DKN15b] (see also [ADJ + 1, CDVV14]), a variant of Pearson s χ -tester can test unifority with O(n 1/ /ɛ ) saples for all values of n, ɛ > 0. The chi-squared type testers of [CDVV14, VV14] are siple, but are also arguably slightly less natural than the original collision-based unifority tester [GR00]. Perhaps surprisingly, prior to this work, the saple coplexity of the collision unifority tester was not fully understood. In particular, it was not known whether the saple upper bound of O(n 1/ /ɛ 4 ) established in [GR00] is tight for this tester, or there exists an iproved analysis that can give a better upper bound. As our first ain contribution (Theore 1), we provide a new analysis of the collision unifority tester establishing a tight O(n 1/ /ɛ ) upper bound on its saple coplexity. That is, we show that the originally proposed unifority tester is in fact saple-optial, up to constant factors. A related testing proble of central iportance in the field is the following: Given saples fro two unknown distributions p, q over [n] with the proise that ax{ p, q } b, distinguish between the cases that p q ɛ/ and p q ɛ. That is, we want to test the closeness between two unknown distributions with sall l -nor. (We reark here that the assuption that both p and q have sall l -nor is critical in this context.) The seinal work of Batu et al. [BFR + 00] gave a collision-based tester for this proble that uses O(b /ɛ 4 +b 1/ /ɛ ) saples. Subsequent work by Chan, Diakonikolas, Valiant, and Valiant [CDVV14] gave a different chi-squared type tester that uses O(b 1/ /ɛ ); this saple bound was shown [CDVV14, VV14] to be optial, up to constant factors. Siilarly to the case of unifority testing, prior to this work, it was not known whether the analysis of the collision-based closeness tester in [BFR + 00] is tight. As our second contribution, we show (Theore 8) that (essentially) the collision-based tester of [BFR + 00] succeeds with O(b 1/ /ɛ ) saples, i.e., it is saple-optial, up to constants, for the corresponding proble. Reark. Unifority testing has been a useful algorithic priitive for several other distribution testing probles as well [BFF + 01, DDS + 13, DKN15b, DKN15a, CDGR16, Gol16a]. Notably, Goldreich [Gol16a] recently showed that the ore general proble of testing the identity of any explicitly given distribution can be reduced to unifority testing with only a constant factor loss in saple coplexity. The proble of l closeness testing for distributions with sall l nor has been identified as an iportant algorithic priitive since the original work of Batu et al. [BFR + 00] who exploited it to obtain the first l 1 closeness tester. Recently, Diakonikolas and Kane [DK16] gave a collection of reductions fro various distribution testing probles to the above l closeness testing proble. The approach of [DK16] shows that one can obtain saple-optial testers for a range of different properties of distributions by applying an optial tester for the above proble as a black-box. 1 The unifority tester of [Pan08] relies on the nuber of unique eleents, i.e., the eleents that appear in the saple set exactly once. Such a tester is only eaningful in the regie that the total nuber of saples is saller than the doain size.

3 1. Overview of Analysis We now provide a brief suary of previous analyses and a coparison with our work. The canonical way to construct and analyze distribution property testers roughly works as follows: Given independent saples s 1,..., s fro our distribution(s), we consider an appropriate rando variable (statistic) F (s 1,..., s ). If F (s 1,..., s ) exceeds an appropriately defined threshold T, our tester rejects; otherwise, it accepts. The canonical analysis proceeds by bounding the expectation and variance of F for the case that the distribution(s) satisfy the property (copleteness), and the case they are ɛ-far fro satisfying the property (soundness), followed by an application of Chebyshev s inequality. The ain difficulty is choosing the statistic F appropriately so that the expectations for the copleteness and soundness cases are sufficiently separated after a sall nuber of saples, and at the sae tie the variance of the statistic is not too large. Typically, the challenging step in the analysis is bounding fro above the variance of F in the soundness case. Our analysis follows this standard fraework. Roughly speaking, for both probles we consider, we provide a tighter analysis of the variance of the corresponding estiators, that in turn leads to the optial saple coplexity upper bound. More specifically, for the case of unifority testing, the arguent of [GR00] proceeds by showing that the collision tester yields a (1+γ)-ultiplicative approxiation of the l -nor of the unknown distribution with O(n 1/ /γ ) saples. Setting γ = ɛ gives a unifority testing under the l 1 distance that uses O(n 1/ /ɛ 4 ) saples. We note that the quadratic dependence on 1/γ in the ultiplicative approxiation of the l nor is tight in general. (For an easy exaple, consider the case that our distribution is either unifor over two eleents, or assigns probability ass 1/ γ, 1/ + γ to the eleents.) Roughly speaking, we show that we can do better when the l nor of the distribution in question is sall. More specifically, the collision unifority tester can distinguish between the case that p (1 + γ/)/n and p (1 + γ)/n with O(n1/ /γ) saples. This iediately yields the desired l 1 guarantee. For the closeness testing proble (under our bounded l nor assuption), Batu et al. [BFR + 00] construct a statistic whose expectation is proportional to the square of the l distance between the two distributions p and q. This statistic has three ters whose expectations are proportional to p, q, and p q respectively. Specifically, the first ter is obtained by considering the nuber of self-collisions of a set of saples fro p. Siilarly, the second ter is proportional to the nuber of self-collisions of a set of saples fro q. The third ter is obtained by considering the nuber of cross-collisions between soe saples fro p and q. In order to siplify the analysis, [BFR + 00] uses a separate set of fresh saples for the cross-collisions ter. This set is independent of the set of saples used for the two self-collisions ters. While this choice akes the analysis cleaner, it ends up increasing the variance of the estiator too uch leading to a sub-optial saple upper bound. We show that by reusing saples to calculate the nuber of cross-collisions, one achieves sufficiently good variance to get optial saple coplexity. This coes at the cost of a ore coplicated analysis involving a very careful calculation of the variance. 1.3 Notation We write [n] to denote the set {1,..., n}. We consider discrete distributions over [n], which are functions p : [n] [0, 1] such that n i=1 p i = 1. We use the notation p i to denote the probability of eleent i in distribution p. We will denote by U n the unifor distribution over [n]. For r 1, the l r nor of a distribution is identified with the l r nor of the corresponding vector, i.e., p r = ( n i=1 p i r ) 1/r. The l 1 (resp. l ) distance between distributions p and q is defined as the the l 1 (resp. l ) nor of the vector of their difference, i.e., p q 1 = n i=1 p i q i and p q = n i=1 (p i q i ). 3

4 Testing Unifority via Collisions In this section, we show that the natural collision unifority tester proposed in [GR00] is sapleoptial up to constant factors. More specifically, we are given saples fro a probability distribution p over [n], and we wish to distinguish (with high constant probability) between the cases that p is unifor versus ɛ-far fro unifor in l 1 -distance. The ain result of this section is that the collision-based unifority tester succeeds in this task with = O(n 1/ /ɛ ) saples. In fact, we prove the following stronger l -guarantee for the collisions tester: With = O(n 1/ /ɛ ) saples, it distinguishes between the cases that p U n ɛ /(n) (copleteness) versus p U n ɛ /n (soundness). The desired l 1 guarantee follows fro this l guarantee by an application of the Cauchy-Schwarz inequality in the soundness case. Forally, we analyze the following tester: Algorith Test-Unifority-Collisions(p, n, ɛ) Input: saple access to a distribution p over [n], and ɛ > 0. Output: YES if p U n ɛ /(n); NO if p U n ɛ /n. 1. Draw iid saples fro p.. Let σ ij be an indicator variable which is 1 if saples i and j are the sae and 0 otherwise. 3. Define the rando variable s = i<j σ ij and the threshold t = ( ) 1+3ɛ /4 4. If s t return NO ; otherwise, return YES. n The following theore characterizes the perforance of the above estiator: Theore 1. The above estiator, when given saples drawn fro a distribution p over [n] will, with probability at least 3/4, distinguish the case that p U n ɛ /(n) fro the case that p U n ɛ /n provided that 300n 1/ /ɛ. The rest of this section is devoted to the proof of Theore 1. Note that the condition of the theore is equivalent to testing whether p 1+ɛ / n versus p 1+ɛ n. Our tester takes = 300n1/ saples fro p and distinguishes between the two cases with probability at least 3/4. ɛ.1 Analysis of Test-Unifority-Collisions The analysis proceeds by bounding the expectation and variance of the estiator for the copleteness and soundness cases, and applying Chebyshev s inequality. The novelty here is a tight analysis of the variance which leads to the optial saple bound. We start by recalling the following siple closed forula for the expected value: Lea. We have that E[s] = ( ) p. Proof. For any i, j, the probability that saples i and j are equal is p. By this and linearity of expectation, we get E[s] = E ij σ ij = ij E[σ ij ] = ij p = ( ) p. 4

5 Thus, we see that in the copleteness case the expected value is at ost ( ) 1+ɛ / n. In the soundness case, the expected value is at least ( ) 1+ɛ n. This otivates our choice of the threshold t halfway between these expected values. In order to argue that the statistic will be close to its expected value, we bound its variance fro above and use Chebyshev s inequality. We bound the variance in two steps. First, we obtain the following bound: Lea 3. We have that Var[s] p + 3 ( p 3 3 p 4 ). Proof. The lea follows fro the following chain of (in-)equalities: Var[s] = E[s ] E[s] = E σ ij σ kl i<j = E k<l i<j; k<l all distinct ( )( = ( = σ ij σ kl + ) p 4 + ( ) p 4 i<j<l σ ij σ jl + σ ij σ kj + i<j i,k<j i k ( ) p ( 3 ) p ) ( p p 4 ) + ( 1)( ) ( p 3 3 p 4 ) p + 3 ( p 3 3 p 4 ). σ ij ( ( ) p ) p 4 ( ) p 4 Reark. We note that the upper bound of the previous lea is tight, up to constant factors. The 3 p 4 ter is critical for getting the optial dependence on ɛ in the saple bound. Continuing the analysis, we now derive an upper bound on the nuber of saples that suffices for the tester to have the desired success probability of 3/4. Lea 4. Let α satisfy p = 1+α n and σ be the standard deviation of s. The nuber of saples required by Test-Unifority-Collisions is at ost 5σn α 3ɛ /4, in order to get error probability at ost 1/4. Proof. By Chebyshev s inequality, we have that [ ( ) ] Pr s p kσ 1 k where σ Var[s]. 5

6 We want s to be closer to its expected value than the threshold is to its expected value because when this occurs, the tester outputs the right answer. Furtherore, to achieve our desired probability of error of at ost 1/4, we want this to happen with probability at least 3/4. So, we set k =, and then we want ( ) ( ) kσ E[s] t = p (1 + 3ɛ /4)/n = α 3ɛ /4 /n It suffices for the nuber of saples to satisfy the slightly stronger condition that So, it suffices to have σ α 3ɛ /4. 5n 5σn α 3ɛ /4. We ight as well take the sallest nuber of saples for which the tester works, which iplies the desired inequality. We are now ready to show an upper bound on the nuber of saples in the copleteness case, i.e., when p is the unifor distribution. Lea 5. In the copleteness case, the required nuber of saples is at ost in order to get error probability 1/4. 6n1/ ɛ, Proof. It is easy to see that p = 1/n and p 3 3 = p 4 = 1/n. Thus, by Lea 3, σ /n 1/. Also, we know α = 0 when p is unifor. Substituting these two facts into Lea 4 and solving for gives 6n1/ ɛ. We now turn to the soundness case, where p is far fro unifor. By Lea 4, it suffices to bound fro above the variance σ. We proceed by a case analysis based on whether the ter p or 3 ( p 3 3 p 4 ) contributes ore to the variance..1.1 Case when p is Larger Lea 6. Consider the soundness case, where p = (1 + α)/n for α ɛ. If p contributes ore to the variance, i.e., if p 3 ( p 3 3 p 4 ), then the required nuber of saples is at ost 48n1/ ɛ in order to get error probability 1/4. 6

7 Proof. Suppose that p 3 ( p 3 3 p 4 ). Then σ p = (1+α)/n. Substituting this into Lea 4 and solving for gives that the necessary nuber of saples is at ost 1 + α 8n 1/ (α 3ɛ /4). Using calculus to axiize this expression by varying α, one gets that α = ɛ axiizes the expression. Thus, 1 + ɛ 3n 1/ ɛ 3n 1/ ɛ < 48n1/ ɛ..1. Case when 3 ( p 3 3 p 4 ) is Larger Lea 7. Consider the soundness case, where p = (1 + α)/n for α ɛ. If 3 ( p 3 3 p 4 ) contributes ore to the variance, i.e., if 3 ( p 3 3 p 4 ) p, then the required nuber of saples is at ost in order to get error probability 1/4. 300n1/ ɛ Proof. Suppose that 3 ( p 3 3 p 4 ) p. Then σ 3 ( p 3 3 p 4 ). Substituting this into Lea 4 and solving for gives that the necessary nuber of saples is at ost 50n p 3 3 p 4 (α 3ɛ /4). Let us paraeterize p as p i = 1/n + a i for soe vector a. Then we have a = α/n, and we can write 50n p 3 3 p 4 (α 3ɛ /4) 50n p 3 3 p 4 (α/4) (since ɛ α) 50n p n (α/4) = 50n = 50n = 50n = 50n 50n = 50n [ n i=1 (1/n + a i) 3] 1 n (α/4) [ n n n i=1 a i + 3 n n i=1 a i + n i=1 a3 i (α/4) 3 n n i=1 a i + 3 n 3 n (α/4) n i=1 a i + n i=1 a3 i (α/4) 3 n a + a 3 3 (α/4) 50n n i=1 a i + n i=1 a3 i 3 n a + a 3 (α/4) 3 n (α/n) + (α/n)3/ (α/4) = 400 α + 800n1/ α 400 ɛ + 800n1/ ɛ 300n1/ ɛ. ] 1 n (since n a i = 0) i=1 (since ɛ α) 7

8 Note that, as entioned earlier, if we had ignored the p 4 ter, we would have had an Ω(1/ɛ 4 ) ter in our bound, which would have given us the wrong dependence on ɛ. Theore 1 now follows as an iediate consequence of these last three leas. Reark. It is worth noting that the collisions statistic analyzed in this section is very siilar to the chi-squared-like unifority tester in [DKN15b] itself a siplification of siilar testers in [CDVV14, VV14] which also achieves the optial saple coplexity of O(n 1/ /ɛ ). Specifically, if X i denotes the nuber of ties we see the i-th doain eleent in the saple, the [DKN15b] statistic is i (X i /n) X i = i<j σ ij n i X i + n. We note that the [DKN15b] analysis uses Poissonization; i.e., instead of drawing saples fro the distribution, we draw Poi() saples. Without Poissonization, the aforeentioned statistic siplifies to s n, where s is the collisions statistic. While the non-poissonized versions of the two testers are equivalent, the Poissonized versions are not. Specifically, the Poissonized version of the [DKN15b] unifority tester has sufficiently good variance to yield the saple-optial bound. On the other hand, the Poissonized version of the collisions statistic does not have good variance: Specifically, its variance does not have the p 4 ter which as noted earlier is necessary to get the optial ɛ dependence. 3 Testing Closeness via Collisions Given saples fro two unknown distributions p, q over [n] with the proise that ax{ p, q } b, we want to distinguish between the cases that p q ɛ/ versus p q ɛ. We show that a natural collisions-based tester succeeds in this task with O(b 1/ /ɛ ) saples. The estiator we analyze is a slight variant of the l tester in [BFR + 00], described in pseudocode below. We define the nuber of self-collisions in a sequence of saples fro a distribution as i<j σ ij, where σ ij is the indicator variable denoting whether saples i and j are the sae. Siilarly, we define the nuber of cross-collisions between two sequences of saples as i,j l ij, where l ij is the indicator variable denoting whether saple i fro the first sequence is the sae as saple j fro the second sequence. Algorith Test-Closeness-Collisions(p, q, n, b, ɛ) Input: saple access to distribution p, q over [n], ɛ, b > 0. Output: YES if p q ɛ/; NO if p q ɛ. 1. Draw two ultisets S p, S q of iid saples fro p, q. Let C 1 denote the nuber of self-collisions of S p, C denote the nuber of self-collisions of S q, and C 3 denote the nuber of cross-collisions between S p and S q.. Define the rando variable Z = C 1 + C 1 3. If Z t return NO ; otherwise, return YES. C 3 and the threshold t = ( ) ɛ /. The following theore characterizes the perforance of the above estiator: Theore 8. There exists an absolute constant c such that the above estiator, when given saples drawn fro each of two distributions, p, q over [n] will, with probability at least 3/4, distinguish the case p q ɛ/ fro the case that p q ɛ provided that c b1/, where b is an ɛ upper bound on p, q. 8

9 3.1 Analysis of Test-Closeness-Collisions Let X i, Y i be the nuber of ties we see the eleent i in each set of saples S p and S q, respectively. The above rando variables are distributed as follows: X i Bin(, p i ), Y i Bin(, q i ). Note that the statistic Z can be written as Z = 1 n i=1 [ (Xi Y i ) X i Y i ] + 1 n i=1 [X i (X i 1) + Y i (Y i 1)] = 1 A + 1 B, where A = n i=1 [ (Xi Y i ) X i Y i ] and B = n i=1 [X i(x i 1) + Y i (Y i 1)]. Note that { } ( 1) 1 Var[Z] 4 ax 4 Var[A], 4 Var[B] Note that B essentially corresponds to the nuber of collisions within two disjoint sets of saples, hence we already have an upper bound on its variance. The bulk of the analysis goes into bounding fro above the variance of A = n i=1 A i = n [ i=1 (Xi Y i ) ] X i Y i. Reark. The l collision-based tester we analyze here is closely related to the l -tester of [CDVV14]. Specifically, the A ter in the expression for Z has the sae forula as the l -tester of [CDVV14]. However, a key difference is that the statistic of [CDVV14] is Poissonized, which is crucial for its analysis. We now proceed to analyze the collision-based closeness tester. We start with a siple forula for its expectation: Lea 9. For the expectation of the statistic Z in the closeness tester, we have: ( ) E[Z] = p q. (1) Proof. Viewing p and q as vectors, we have ( E[Z] = E[C 1 + C 1 C 3] = ) (p p) + ( ) (q q) 1 (p q) =. ( ) p q. For the variance, we show the following upper bound: Lea 10. For the variance of the statistic Z in the closeness tester, we have: Var[Z] 116 b p q 4b 1/. To prove this lea, we will use the following proposition, whose proof is deferred to the following subsection. Proposition 11. We have that Var[A] 100 b i (p i q i )(p i q i ). Proof of Lea 10. Recall that by Lea 3 we have Var[B] 4 ( p + q ) ( p 3 3 p 4 + q 3 3 q 4 ). 9

10 Cobined with Proposition 11, we obtain: { } ( 1) 1 Var[Z] 4 ax 4 Var[A], 4 Var[B] ax{100 b (p i q i )(p i qi ), i 4( p + q ) + 4( p 3 3 p 4 + q 3 3 q 4 )}. The second ter in the ax stateent is at ost 16b. Thus, we have Var[Z] 116( 1) b + 8( 1) i (p i q i )(p i q i ) 116 b i (p i q i ) (p i + q i ) 116 b i (p i q i ) 4 i (p i + q i ) (by the Cauchy-Schwarz inequality) 116 b p q 4b 1/ (since i (p i + q i ) 4b). 3. Proof of Theore 8 By Lea 10, we have that Var[Z] 116 b p q 4b 1/ 116 b p q b 1/. We wish to show we can distinguish the copleteness case (i.e., p q ɛ/) fro the soundness case (i.e., p q ɛ). Set α = p q. Then we are proised that either α ɛ or α ɛ /4. Recall we chose t = ( )ɛ and that Lea 9 says that E[Z] = ( ) α. Since E[Z copleteness case] t E[Z soundness case], the only way we fail to distinguish the copleteness and soundness cases is if Z deviates fro its expectation additively by at least ( ) ɛ ( ) ( ) t E[Z] = α ax{α, ɛ }/4, where the last inequality follows by the proise on α in the copleteness and soundness cases. By Chebyshev s inequality, the probability this happens is at ost ( ) Pr[ Z E[Z] ax{α, ɛ }/4 ] Var[Z] [t E[Z]] 116 b αb ) 1/ ax{α, ɛ }/4] 3768 b ɛ 4 + [ ( 4096 b1/ 3768 b 4096 b1/ ɛ 4 + ɛ, { 1 in α, α } ɛ 4 where we siplified using the assuption that. Thus, if we set = O( b1/ ), we get a ɛ constant probability of error in both cases as desired. In the copleteness case where α ε /4 and E[Z] = ( ) ( α, Z has to deviate by at least ) ε /4 ( ) α to ɛ cross the threshold t = ( ) ε /. In the soundness case where α ε, Z has to deviate by at least ( ) α/ ɛ/ to cross the threshold t. 10

11 3.3 Proof of Proposition 11 Recall that A = n i=1 A i = n [ i=1 (Xi Y i ) ] X i Y i, hence Var(A) = n i=1 Var(A i) + i j Cov(A i, A j ). We proceed to bound fro above the individual variances and covariances via a sequence of eleentary but quite tedious calculations Bounding Var(A i ): Since we can write: A i = (X i Y i ) X i Y i = X i + Y i X i Y i X i Y i, Var(A i ) = Var(X i ) + Var(Y i ) + 4Var(X i Y i ) + Var(X i ) + Var(Y i ) + [ Cov(X i, X i Y i ) Cov(X i, X i ) Cov(Y i, X i Y i ) Cov(Y i, Y i ) + Cov(X i Y i, X i ) + Cov(X i Y i, Y i )]. We proceed to calculate the individual quantities: (a) Cov(Xi, X i ) = Cov([σ r = σ s = i], [σ t = i]) r,s,t [] = Cov([σ r = i], [σ r = i]) + r [] r,s [], r s Cov([σ r = σ s = i], [σ r = i]) = p i (1 p i ) + ( )(E[[σ r = σ s = i] [σ r = i]] p i p i ) = p i (1 p i ) + ( )(p i p 3 i ) = p i (1 p i )[1 + p i ( 1)] = p i (1 p i )[1 p i + p i ] = p i (1 p i )(1 p i ) + p i (1 p i ). (b) Cov(X i, X i Y i ) = E[X 3 i Y i ] E[X i ] E[X i Y i ] = Cov(X i, X i ) E[Y i ] = p i q i (1 p i )(1 p i )+ 3 p i q i (1 p i ). (c) Cov(X i, X i Y i ) = Var(X i ) E[Y i ] = p i (1 p i )q i. (d) Var(X i ) =E[X 4 i ] (E[X i ]) =p i (1 7p i + 7p i + 1p i 18p i + 6 p i 6p 3 i + 11p 3 i 6 p 3 i + 3 p 3 i ) (p i p i + p i ) =p i 7p i + 7 p i + 1p 3 i 18 p 3 i p 3 i 6p 4 i + 11 p 4 i 6 3 p 4 i + 4 p 4 i ( p i + p 4 i + 4 p 4 i p 3 i + 3 p 3 i 3 p 4 i ) =p i 7p i + 6 p i + 1p 3 i 16 p 3 i p 3 i 6p 4 i + 10 p 4 i 4 3 p 4 i =p i 7p i + 6 p i + 1p 3 i 6p 4 i 16 p 3 i p 3 i + 10 p 4 i 4 3 p 4 i. 11

12 (e) Var(X i Y i ) = E[X i Y i ] (E[X i Y i ]) = E[X i ]E[Y i ] (E[X i ]E[Y i ]) = (p i p i + p i ) (q i q i + q i ) 4 p i q i = p i q i + p i q i (p i q i + p i q i ) + 3 (p i q i + p i q i ) 3 p i q i. So, we get: Var(A i ) = p i 7p i + 6 p i + 1p 3 i 6p 4 i 16 p 3 i p 3 i + 10 p 4 i 4 3 p 4 i + q i 7q i + 6 q i + 6q 3 i 16 q 3 i q 3 i + 10 q 4 i 4 3 q 4 i + 4( (p i q i + p i q i p i q i p i q i ) + 3 (p i q i + p i q i ) 3 p i q i ) + p i (1 p i ) + q i (1 q i ) 4( p i (1 p i )(1 p i ) + 3 p i (1 p i ))q i (p i (1 p i )(1 p i ) + 4 p i (1 p i )) 4( q i (1 q i )(1 q i ) + 3 q i (1 q i ))p i q i (1 q i )(1 q i ) 4 q i (1 q i ) + 4 p i (1 p i )q i + 4 q i (1 q i )p i = [p i 7p i + 1p 3 i 6p 4 i + q i 7q i + 1q 3 i 6q 4 i + p i p i + q i q i p i (1 p i )(1 p i ) q i (1 q i )(1 q i )] + [ 4p i q i (1 p i )(1 p i ) 4p i q i (1 q i )(1 q i ) + 4p i q i ( p i q i ) + 6p i 16p 3 i + 10p 4 i + 6q i 16q 3 i + 10q 4 i + 4p i q i (1 + p i q i p i q i ) 4p i + 4p 3 i 4q i + 4q 3 i ] + 3 [4p 3 i 4p 4 i + 4q 3 i 4q 4 i + 4p i q i (p i + q i ) 8p i q i 8p 1q i 8p i q i + 8p 3 1q i + 8p i q 3 i ] = [ p i + 8p 3 i 6p 4 i q i + 8q 3 i 6q 4 i ] + [(p i + q i ) 1p 3 i + 10p 4 i + 4p i q i 8p 3 i q i + 4p i q i + 4p i q i 1q 3 i 8p i q 3 i + 10q 4 i )] (p i q i ) [p i (1 p i ) + q i (1 q i )] 8(p 3 i + q 3 i ) + 1 (p i + q i ) (p i q i ) (p i + q i ) 0 (p i + q i ) (p i q i ) (p i + q i ). 3.4 Bounding the Covariances It suffices to show that the covariances of A i and A j, for i j, are appropriately bounded fro above. Let i j. Note that if σ r is the result of saple k, we have: Cov(X i, X j ) = Cov([σ r = i], [σ u = j]) = Cov([σ r = i], [σ r = j]) = p i p j. r,u [] r [] Siilarly, Cov(Xi, X j ) = Cov([σ r = σ s = i], [σ t = j]) r,s,t [] = Cov([σ r = i], [σ r = j]) + r [] r,s [], r s = p i p j ( 1)p i p j. Cov([σ r = σ s = i], [σ r = j]) 1

13 And, Siilarly, Cov(X i, X j ) = = 4 r,s,t,u [] unique r,s,u [] Cov([σ r = σ s = i], [σ t = σ u = j]) unique r,s [] unique r,s [] unique r,t [] r [] Cov([σ r = σ s = i], [σ r = σ u = j]) Cov([σ r = σ s = i], [σ r = σ s = j]) Cov([σ r = σ s = i], [σ r = j]) Cov([σ r = i], [σ r = σ t = j]) Cov([σ r = i], [σ r = j]) = p i p j ( 1)(p i p j + p i p j + p i p j) 4( 1)( )p i p j = p i p j ( 1)(p i p j + p i p j) ( 1)( 3)p i p j. Cov(X i Y i, X j Y j ) = E[X i Y i X j Y j ] E[X i Y i ]E[X j Y j ] Also, = E[X i X j ]E[Y i Y j ] E[X i ]E[Y i ]E[X j ]E[Y j ] = (Cov(X i, X j ) + E[X i ]E[X j ]) (Cov(Y i, Y j ) + E[Y i ]E[Y j ]) E[X i ]E[X j ]E[Y i ]E[Y j ] = ( 3 )p i p j q i q j. Cov(X i Y i, X j ) = E[X i Y i X j ] E[X i Y i ]E[X j ] = E[X i X j ]E[Y i ] E[X i ]E[Y i ]E[X j ] = (Cov(X i, X j ) + E[X i ]E[X j ]) E[Y i ] E[X i ]E[X j ]E[Y i ] = Cov(X i, X j )E[Y i ]. Siilar equations hold if we swap i and j and/or we swap X and Y. Because covariance is bilinear, this gives us all the inforation we need in order to exactly copute Cov(A i, A j ). In particular, by setting W i = X i Y i, we have: Cov(A i, A j ) = Cov(W i X i Y i, W j X j Y j ) For the suands we have: (a) = Cov(X i, X j ) + Cov(Y i, Y j ) + Cov(X i, Y j ) + Cov(X j, Y i ) Cov(Wi, X j ) Cov(Wi, Y j ) Cov(Wj, X i ) Cov(Wj, Y i ) + Cov(Wi, Wj ). Cov(W i, X j ) = Cov((X i Y i ), X j ) = Cov(X i, X j ) Cov(X i Y i, X j ) = p i p j ( 1)p i p j + p i p j q i = p i p j (1 p i ) + p i p j (q i p i ). 13

14 (b) Cov(W i, Y j ) = q i q j (1 q i ) + q i q j (p i q i ). (c) Cov(W j, X i ) = p i p j (1 p j ) + p i p j (q j p j ). (d) Cov(W j, Y i ) = q i q j (1 q j ) + q i q j (p j q j ). (e) Cov(W i, W j ) = Cov(X i, X j ) + Cov(Y i, Y j ) + 4 Cov(X i Y i, X j Y j ) By substituting, we get: Cov(Xi, X j Y j ) Cov(Xj, X i Y i ) Cov(Yi, X j Y j ) Cov(Yj, X i Y i ) = Cov(Xi, Xj ) + Cov(Yi, Yj ) + 4 Cov(X i Y i, X j Y j ) Cov(X i, X j )E[Y j ] Cov(X j, X i )E[Y i ] Cov(Y i, Y j )E[X j ] Cov(Y j, Y i )E[X i ] = p i p j ( 1)(p i p j + p i p j) ( 1)( 3)p i p j q i q j ( 1)(q i q j + q i q j ) ( 1)( 3)q i q j + 4( 3 )p i p j q i q j + p i p j q j + 4 ( 1)p i p j q j + p i p j q i + 4 ( 1)p jp i q i + q i q j p j + 4 ( 1)q i q j p j + q i q j p i + 4 ( 1)q j q i p i. Cov(A i, A j ) = Cov(X i, X j ) + Cov(Y i, Y j ) Cov(W i, X j ) Cov(W i, Y j ) Cov(W j, X i ) Cov(W j, Y i ) + Cov(W i, W j ) = (p i p j + q i q j ) + p i p j (1 p i ) p i p j (q i p i ) + q i q j (1 q i ) q i q j (p i q i ) + p i p j (1 p j ) p i p j (q j p j ) + q i q j (1 q j ) q i q j (p j q j ) + Cov(W i, W j ) = [p i p j (q i + q j ) + q i q j (p i + p j )] + [p i p j (q i + q j ) + q i q j (p i + p j )] ( 1)( 3)p i p j ( 1)( 3)q i q j + 4( 3 )p i p j q i q j + 4 ( 1)(p i q j + p j q i )(p i p j + q i q j ) = 6(p i p j + q i q j ) + [10(p i p j + q i q j ) + 4p i p j q i q j 4(p i q j + p j q i )(p i p j + q i q j )] 3 [4(p i p j + q i q j ) + 8p i p j q i q j 4(p i q j + p j q i )(p i p j + q i q j )] = 6(p i p j + q i q j ) + [5(p i p j + q i q j ) + p i p j q i q j (p i q j + p j q i )(p i p j + q i q j )] 4 3 [(p i p j + q i q j ) (p i q j + p j q i )(p i p j + q i q j )]. 14

15 In suary, Cov(A i, A j ) = 6(p i p j + q i q j ) + [(5p i p j + 5q i q j ) 6p i p j q i q j p i q i (p j q j ) p j q j (p i q i ) ] 4 3 (p i q i )(p j q j )(p i p j + q i q j ). The total contribution of the covariances to the variance for all i j is i j Cov(A i, A j ). We consider the coefficients on each of the powers of separately. We have: [ 3 ] i j Cov(A i, A j ) = 4 i j (p i q i )(p j q j )(p i p j + q i q j ) = 4 i 4 i = 4 i 4 i (p i q i ) (p i + qi ) 4 (p i q i )(p j q j )(p i p j + q i q j ) i,j (p i q i ) (p i + q i ) 4 (p i q i )(p j q j )(p i p j + q i q j ) i,j (p i q i ) (p i + q i ) 4(p q) (pp + qq )(p q) (p i q i ) (p i + q i ). Also, [] i j Cov(A i, A j ) 0. Finally, we have [ ] i j Cov(A i, A j ) = i j [(5p i p j + 5q i q j ) 6p i p j q i q j p i q i (p j q j ) p j q j (p i q i ) ] 10 i j (p i p j + q i q j ) 10 i,j 3.5 Copleting the Proof (p i p j + q i q j ) = 10[p (pp )p + q (qq )q] = 10[(p p)(p p) + (q q)(q q)] = 10 p q 4 0b 0b. Var[A] = n Cov(A i, A j ) Var[A i ] + i=1 i j n ( ) 80 pi + q i (p i q i ) (p i + q i ) i=1 + 0 b i (p i q i ) (p i + q i ) 100 b i (p i q i )(p i q i ). 15

16 References [ADJ + 1] J. Acharya, H. Das, A. Jafarpour, A. Orlitsky, S. Pan, and A. Suresh. Copetitive classification and closeness testing. In COLT, 01. [BFF + 01] [BFR + 00] [BFR + 13] T. Batu, E. Fischer, L. Fortnow, R. Kuar, R. Rubinfeld, and P. White. Testing rando variables for independence and identity. In Proc. 4nd IEEE Syposiu on Foundations of Coputer Science, pages , 001. T. Batu, L. Fortnow, R. Rubinfeld, W. D. Sith, and P. White. Testing that distributions are close. In IEEE Syposiu on Foundations of Coputer Science, pages 59 69, 000. T. Batu, L. Fortnow, R. Rubinfeld, W. D. Sith, and P. White. Testing closeness of discrete distributions. J. ACM, 60(1):4, 013. [Can15] C. L. Canonne. A survey on distribution testing: Your data is big. but is it blue? Electronic Colloquiu on Coputational Coplexity (ECCC), :63, 015. [CDGR16] C. L. Canonne, I. Diakonikolas, T. Gouleakis, and R. Rubinfeld. Testing shape restrictions of discrete distributions. In 33rd Syposiu on Theoretical Aspects of Coputer Science, STACS, pages 5:1 5:14, 016. [CDVV14] S. Chan, I. Diakonikolas, P. Valiant, and G. Valiant. Optial algoriths for testing closeness of discrete distributions. In SODA, pages , 014. [DDS + 13] C. Daskalakis, I. Diakonikolas, R. Servedio, G. Valiant, and P. Valiant. Testing k-odal distributions: Optial algoriths via reductions. In SODA, pages , 013. [DK16] I. Diakonikolas and D. M. Kane. A new approach for testing properties of discrete distributions. CoRR, abs/ , 016. In FOCS 16. [DKN15a] I. Diakonikolas, D. M. Kane, and V. Nikishkin. Optial algoriths and lower bounds for testing closeness of structured distributions. In 56th Annual IEEE Syposiu on Foundations of Coputer Science, FOCS 015, 015. [DKN15b] I. Diakonikolas, D. M. Kane, and V. Nikishkin. Testing Identity of Structured Distributions. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Syposiu on Discrete Algoriths, SODA 015, 015. [Gol16a] O. Goldreich. The unifor distribution is coplete with respect to testing identity to a fixed distribution. Electronic Colloquiu on Coputational Coplexity (ECCC), 3:15, 016. [Gol16b] O. Goldreich. Lecture Notes on Property Testing of Distributions. Available at oded/pdf/pt-dist.pdf, March, 016. [GR00] O. Goldreich and D. Ron. On testing expansion in bounded-degree graphs. Electronic Colloqiu on Coputational Coplexity, 7(0), 000. [LR05] E. L. Lehann and J. P. Roano. Testing statistical hypotheses. Springer Texts in Statistics. Springer,

17 [NP33] [Pan08] J. Neyan and E. S. Pearson. On the proble of the ost efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Matheatical or Physical Character, 31( ):89 337, L. Paninski. A coincidence-based test for unifority given very sparsely-sapled discrete data. IEEE Transactions on Inforation Theory, 54: , 008. [Rub1] R. Rubinfeld. Taing big probability distributions. XRDS, 19(1):4 8, 01. [VV14] G. Valiant and P. Valiant. An autoatic inequality prover and instance optial identity testing. In FOCS, ECCC ISSN

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs On the Inapproxiability of Vertex Cover on k-partite k-unifor Hypergraphs Venkatesan Guruswai and Rishi Saket Coputer Science Departent Carnegie Mellon University Pittsburgh, PA 1513. Abstract. Coputing