Lecture 9: March 26, 2014

Size: px

Start display at page:

Download "Lecture 9: March 26, 2014"

Charlotte Randall
5 years ago
Views:

1 COMS : Sub-Linear Algorithms in Learning and Testing Lecturer: Rocco Servedio Lecture 9: March 26, 204 Spring 204 Scriber: Keith Nichols Overview. Last Time Finished analysis of O ( n ɛ ) -query algorithm for monotonicity. Showed an Ω( n) lower bound for one-sided non-adaptive monotonicity testers. Stated and proved (one direction of) Yao s Principle: Suppose there exists a distribution D over functions f : {, } n {, } (the inputs to the property testing problem) such that any q-query deterministic algorithm gives the right answer with probability at most c. Then, given any q-query non-adaptive randomized testing algorithm A, there exists some function f A such that: Pr[ A outputs correct answer onf A ] c..2 Today: lower bound for two-sided non-adaptive monotonicity testers. We will use Yao s Principle to show the following lower bound: Theorem (Chen Servedio Tan 4). Any 2-sided non-adaptive property tester for monotonicity, to ɛ 0 -test, needs Ω ( n /5) queries (where ɛ 0 > 0 is an absolute constant). 2 Ω( n /5 ) lower bound: proving Theorem 2. Preliminaries Recall the inition of total variation distance between two distributions over the same set Ω: d TV (D, D 2 ) = D (x) D 2 (x). 2 x

2 2 PROVING THE Ω ( n /5) LOWER BOUND 2 As homework problem from last time, we have the lemma below, which relates the probability of distinguishing between samples from two distributions to their total variation distance: Lemma 2 (HW problem). Let D, D 2 be two distributions over some set Ω, and A be any algorithm (possibly randomized) that takes x Ω as input and outputs Yes or No. Then Pr [ A(x) = Yes ] Pr [ A(x) = Yes ] d TV(D, D 2 ) x D x D 2 HW Problem where the probabilities are also taken over the possible randomness of A. To apply this lemma, recall that given a deterministic algorithm s set of queries Q = {z (),..., z (q) } {, } n, a distribution D over Boolean functions induces a distribution D Q over {, } q : x is drawn from D Q by drawing f D; outputting (f(z (),..., f(z (q) )) {, } q. With this observation and Yao s principle in hand, we can state and prove a key tool in proving lower bounds in property testing: Lemma 3 (Key Tool). Fix any property P (a set of Boolean functions). Let D Yes be a distribution over the Boolean functions that belong to P, and D No be a distribution over Boolean functions that all have dist(f, P) > ɛ. Suppose that for all q-query sets Q, one has d TV ( D Yes Q, D No Q ) (2-sided) non-adaptive ɛ-tester for P must use at least q + queries. 4. Then any Proof. Let D be the mixture D = 2 D Yes + 2 D No (that is, a draw from D is obtained by tossing a fair coin, and returning accordingly a sample drawn either from D Yes or D No ). Fix a q-query deterministic algorithm A. Let p Y = Pr f D Yes [ A accepts on f ], p N = Pr f D No [ A accepts on f ] That is, p Y is the probability that a random Yes function is accepted, while p N is the probability that a random No function is accepted. Via the assumption and the This is sometimes referred to as a data processing inequality for the total variation distance.

3 2 PROVING THE Ω ( n /5) LOWER BOUND 3 previous lemma, p Y p N. However, this means that A cannot be a succesful 4 tester; as Pr f D [ A gives wrong answer ] = 2 ( p Y ) + 2 p N = (p N p Y ) 3 8 > 3 So Yao s Principle tells us that any randomized non-adaptive q-query algorithm is wrong on some f in support of D with probability at least 3 ; but a legit tester can only 8 be wrong on any such f with probability less than. 3 Exercise 4 (Generalization of Lemma 3). Relax the previous lemma slightly. Prove that the conclusion still holds even under the weaker assumptions HW Problem Pr [ f P ] 99 f D Yes 00, Pr [ d TV (f, P) > ɛ ] 99 f D No 00. For our lower bound, we need to come up with D Yes (resp. D No ) to be over monotone functions ( (resp. ɛ 0 -far from monotone) such that Q {, } n with ) Q Q Q = q, d TV D Yes, D No. 4 At a high-level, we need to argue that both distributions look the same. One may thus think of the Central Limit Theorem the sum of many independent, nice real-valued random variables converges to a Gaussian in distribution (in cumulative distribution function). For instance, a binomial distribution Bin ( 0 6, 2) has the same shape ( bell curve ) as the corresponding Gaussian distribution N ( 2, 4 06). For our purpose, however, the convergence guarantees stated by the Central Limit Theorem will not be enough, as they do not give explicit bounds on the rate of convergence; we will use a quantitative version of the CLT, the Berry Esséen Theorem. First, recall the inition a (real-valued) Gaussian random variable: Definition 5 (One-dimensional Gaussian distribution). A real-valued random variable is said to be Gaussian with mean µ and variance σ if it follows the distribution N (µ, σ), which has probability density function f µ,σ (x) = 2πσ e (x µ)2 2σ 2, x R Such a random variable has indeed expectation µ and variance σ 2 ; futhermore, the distribution is fully specified by these two parameters. Extending to higher dimensions, one can ine similarly a d-dimensional Gaussian random variable:

4 2 PROVING THE Ω ( n /5) LOWER BOUND F0,(x) f0,(x) x (a) Cumulative distribution function (CDF) x (b) Probability density function (PDF) Figure : Standard Gaussian N (0, ). Definition 6 (d-dimensional Gaussian distribution). Fix a vector µ R d and a symmetric non-negative inite matrix Σ R d d. A random variable taking values in R d is said to be Gaussian with mean µ and covariance Σ if it follows the distribution N (µ, Σ), which has probability density function f µ,σ (x) = (2π) k det Σ e 2 (x µ)t Σ (x µ), x R d As in the univariate case, µ and Σ uniquely ine the distribution; further, one has that for X N (µ, Σ), Σ i,j = Cov(X i, X j ) = E[(X i EX i )(X j EX j )], i, j [d]. Theorem 7 (Berry Esséen 2 ). Let S = X X n be the sum of n independent (real-valued) random variables X,..., X n satisfying Pr[ X i E[X i ] τ ] =. that is every X i is almost surely bounded. For i [n], ine µ i = E[X i ] and σ i = Var Xi, so that ES = n i= µ i and Var S = n ( i= σi 2 (the last equality by independence). n ) ni= Finally, let G be a N i= µ i, σi 2 Gaussian variable, matching the first two moments of S. Then, for all θ R, Pr[ S θ ] Pr[ G θ ] O(τ) ni=. σi 2

5 2 PROVING THE Ω ( n /5) LOWER BOUND 5 In other terms 3, letting F S (resp. F G ) denote the CDF of S (resp. G), one has F S F G O(τ). n i= σ2 i Remark. The constant hidden in the O( ) notation is actually very reasonable one can take it to be equal to. Application: baby step towards the lower bound. Fix any string z {, } n, and for i [n] let the (independent) random variables γ i be ined as + w.p. 2 γ i = w.p. 2 Letting X i = γ i z i, we have µ i = EX i = 0, σ i = Var X i = ; and can take τ = to apply the Berry Esséen theorem to X = X X n. This allows us to conclude that θ R, Pr[ X θ ] Pr[ G θ ] O() n for G N (0, n). Now, consider a slightly different distribution than the λ i s: for the same z {, } n, ine the independent random variables ν i by ν i = 9 w.p w.p. 0 and let Y i = ν i z i for i [n], Y = Y + + Y n. By our choice of parameters, ( EY i = 0 ( 3) ) z i = 0 = EX i 3 Var Y i = E [ ] Yi 2 = = = Var X i So E[Y ] = E[Y ] = 0 and Var Y = Var X = n; by the Berry Esséen theorem (with τ set to 3, and G as before) θ R, Pr[ Y θ ] Pr[ G θ ] O() n 3 This quantity F S F G is also referred to as the Kolmogorov distance between S and G. 3 There exist other versions of this theorem, with weaker assumptions or phrased in terms of the third moments of the X i s; we only state here one tailored to our needs.

6 2 PROVING THE Ω ( n /5) LOWER BOUND 6 and by the triangle inequality θ R, Pr[ X θ ] Pr[ Y θ ] O() n () We can now ine D Yes and D No based on this (that is, based on respectively a random draw of λ, ν R n distributed as above): a function f λ D Yes is given by z {, } n, f λ (z) = sign(λ z +... λ n z n ). and similarly for f ν D No : z {, } n, f ν (z) = sign(ν z +... ν n z n ) With the notations above, X 0 if and only if f γ (z) = and Y 0 if and only if f ν (z) =. This implies that for any fixed single query z, ( ) {z} {z} d TV D Yes, D No = O() ( Pr[ X 0 ] Pr[ Y 0 ] + Pr[ X > 0 ] Pr[ Y > 0 ] ). 2 n This almost looks like what we were aiming at so why aren t we done? There are two problems with what we did above:. This only deals the case q = ; that is, would provide a lower bound against one-query algorithms. Fix: we will use a multidimensional version of the Berry Esséen Theorem for the sums of q-dimensional independent random variables (converging to a multidimensional Gaussian). 2. f γ, f ν are not monotone (indeed, both the γ i s and ν i s can be negative). Fix: shift everything by 2: - γ i {, 3}: f γ is monotone; - ν i {, 7 3 }: f ν will be far from monotone with high probability (will show this). 2.2 The lower bound construction Up until this point, everything has been a warmup; we are now ready to go into more detail.

7 2 PROVING THE Ω ( n /5) LOWER BOUND 7 D Yes and D No. As we mentioned in the previous section, we need to (re)ine the distributions D Yes and D No (that is, of γ and ν) to solve the second issue: D Yes D No Draw f D Yes by independently drawing, for i [n], +3 w.p. 2 γ i = + w.p. 2 and setting f : x {, } n sign( n i= γ i x i ). Any such f is monotone, as the weights are all positive. Similarly, draw f D No by independently drawing, for i [n], w.p. 3 0 ν i = w.p. and setting f : x {, } n sign( n i= ν i x i ). f is not always far from monotone actually, one of the functions in the support of D No (the one with all weights set to 7/3) is even monotone. However, we shall argue that f D No is far from monotone with overwhelming probability, and then apply the relaxation of the key tool (HW Problem 4) to conclude. The theorem will stem from the following two lemmas, that states respectively that ( ) No-functions are almost all far from monotone, and ( ) that the two distributions are hard to distinguish: Lemma 8 (Lemma ). There exists a universal constant ɛ 0 > 0 such that Pr [ dist(f, M) > ɛ 0 ] f D No 2. Θ(n) (note that this o() probability is actually stronger than what the relaxation from Problem 4 requires.) Lemma 9 (Lemma ). Let A be any deterministic q-query algorithm. Then ( Pr [ A accepts ] Pr [ A accepts ] q 5/4 O (log n) /2 ) f Yes D Yes f No D No so that if q = Õ( n /5) the RHS is at most 0.0, which implies with the earlier lemmas and discussion that at least q + queries are needed for any 2-sided, non-adaptive randomized tester. 0 n /4

8 2 PROVING THE Ω ( n /5) LOWER BOUND 8 Proof of Lemma 8. By an additive Chernoff bound, with probability at least 2 Θ(n) the random variables ν i satisfy m = { i [n] : ν i = } [0.09n, 0.n]. ( ) Say that any linear threshold function for which ( ) holds is nice. Fix any nice f in the support of D No, and rename the variables so that the negative weights correspond to the first variables: ( f(x) = sign (x + + x m ) + 7 ) 3 (x m+ + + x n ), x {, } n It is not difficult to show that for this f (remembering that m = Θ(n)), these first variables have high influence roughly of the same order as for the MAJ function: Claim 0 (HW Problem). For i [m], Inf i [f] = Ω ( n ). Observe further that f is unate (i.e., monotone increasing in some coordinates, and monotone decreasing in the others). Indeed, any LTF g : x sign(w x) is unate: - non-decreasing in coordinate x i if and only if w i 0; - non-increasing in coordinate x i if and only if w i 0. We saw in previous lectures that, for g monotone, ĝ(i) = Inf i [g]; it turns out the same proof generalizes to unate g, yielding HW Problem ĝ(i) = ±Inf i [g] where the sign depends on whether g is non-decreasing or non-increasing in x i. Back to our function f, this means that and thus for all i [m] ˆf(i) = Ω ( n ). + Inf i [f] = ˆf(i) if ν i = 7 3 ˆf(i) if ν i = Fix any monotone Boolean function g: we will show that dist(f, g) ɛ 0, for some

9 2 PROVING THE Ω ( n /5) LOWER BOUND 9 choice of ɛ 0 > 0 independent of f and g. [ 4 dist(f, g) = E ] x U{,} n (f(x) g(x)) 2 = ( ˆf(S) ĝ(s)) 2 (Parseval) S [n] n ( ˆf(i) m ĝ(i)) 2 ( ˆf(i) m ĝ(i)) 2 = ( Inf i [f] Inf i [g]) 2 (g mon.) i= i= i= m m = (Inf i [f] + Inf i [g]) 2 (Inf i [f]) 2 i= i= ( ( )) m 2 ( ) m = Ω n = Ω i= n = Ω(). Proof (sketch) of Lemma 9. Fix any deterministic, non-adaptive q-query algorithm A; and view its q queries z (),..., z (q) {, } n as a q n matrix Q {, } q n, where z (i) corresponds to the i th row of Q. n {}}{ z () z () 2 z () 3 z () n z (2) z (2) 2 z (2) 3 z (2) n q z (q) z (q) 2 z (q) 3 z n (q) Define the Yes-response vector R Y, random variable over {, } q, by the process of (i) drawing f Yes D Yes, where f Yes (x) = sign(γ x + + γ n x n ); (ii) setting the i th coordinate of R Y to f Yes (Q i, ) (f Yes on the i th row of Q, i.e. z (i) ). Similarly, ine the No-response vector R N over {, } q. Via Lemma 2 (the homework problem on total variation distance), (LHS of Lemma 9) d TV (R Y, R N ). (abusing the notation of total variation distance, by identifying the random variables with their distribution.) Hence, our new goal is to show that: d TV (R Y, R N ) (RHS of Lemma 9).?

10 2 PROVING THE Ω ( n /5) LOWER BOUND 0 Multidimensional Berry Esséen setup. variables S, T R q as For fixed Q as above, ine two random S = Qγ, with γ U {,3} n; T = Qν, with for each i [n] (independently) w.p. 3 ν i = w.p. We will also need the following geometric notion: Definition. An orthant in R q is the analogue in q-dimensional Euclidean space of a quadrant in the plane R 2 ; that is, it is a set of the form 0 0 O = O O 2 O q where each O i is either R + or R. There are 2 q different orthants in R q. The random variable R Y is fully determined by the orthant S lies in: the i th coordinate of R Y is the sign of the i th coordinate of S, as S i = (Qγ) i = Q i, γ. Likewise, R N is determined by the orthant T lies in. Abusing slightly the notation, we will write R Y = sign(s) for i [q], R Y,i = sign(s i ) (and similarly, R T = sign(t )). Now, it is enough to show that for any union O of orthants, ( q 5/4 (log n) /2 ) Pr[ S O ] Pr[ T O ] O. ( ) as this is equivalent to proving that, for any subset U {, } q, Pr[ R S U ] Pr[ R T U ] O ( q 5/4 (log n) /2 n /4 ) (and the LHS is by inition equal to dtv (R Y, R N )). Note that for q = we get back to the regular Berry Esséen Theorem; for q >, we will need a multidimensional Berry Esséen. The key will be to have random variables with matching means and covariances (instead of means and variances for the one-dimensional case). n /4 (Rest of the proof during next lecture.)

Lecture 8: March 12, 2014

Lecture 8: March 12, 2014 COMS 6998-3: Sub-Linear Algorithms in Learning and Testing Lecturer: Rocco Servedio Lecture 8: March 2, 204 Spring 204 Scriber: Yichi Zhang Overview. Last time Proper learning for P implies property testing