Lecture 9: March 26, 2014

Similar documents
Lecture 8: March 12, 2014

3 Finish learning monotone Boolean functions

Lecture 5: February 16, 2012

Lecture 1: 01/22/2014

arxiv: v1 [cs.cc] 29 Feb 2012

Lecture Notes 5 Convergence and Limit Theorems. Convergence with Probability 1. Convergence in Mean Square. Convergence in Probability, WLLN

Lecture 1: August 28

Supervised Machine Learning (Spring 2014) Homework 2, sample solutions

6 The normal distribution, the central limit theorem and random samples

Lecture 11. Multivariate Normal theory

Probability and Distributions

Testing equivalence between distributions using conditional samples

Lecture 13: 04/23/2014

Lecture 2: Repetition of probability theory and statistics

MAT 271E Probability and Statistics

Recitation 2: Probability

18.440: Lecture 28 Lectures Review

11.1 Set Cover ILP formulation of set cover Deterministic rounding

Quick Tour of Basic Probability Theory and Linear Algebra

Lecture 2: Review of Basic Probability Theory

Probability and Measure

The Moment Method; Convex Duality; and Large/Medium/Small Deviations

Joint Probability Distributions and Random Samples (Devore Chapter Five)

Lecture 4: September Reminder: convergence of sequences

MAS223 Statistical Inference and Modelling Exercises

Lecture 2: Review of Probability

STAT 414: Introduction to Probability Theory

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics

SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions

Random Variables and Their Distributions

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Limiting Distributions

Random Variables. Random variables. A numerically valued map X of an outcome ω from a sample space Ω to the real line R

STAT 200C: High-dimensional Statistics

The Probabilistic Method

Learning and Fourier Analysis

FE 5204 Stochastic Differential Equations

SDS 321: Introduction to Probability and Statistics

14.1 Finding frequent elements in stream

Settling the Query Complexity of Non-Adaptive Junta Testing

ECE Lecture #10 Overview

Algorithms for Uncertainty Quantification

Sample Spaces, Random Variables

Limiting Distributions

Statistics 300B Winter 2018 Final Exam Due 24 Hours after receiving it

Probability Background

(y 1, y 2 ) = 12 y3 1e y 1 y 2 /2, y 1 > 0, y 2 > 0 0, otherwise.

Probability. Table of contents

Probability reminders

Improved Approximation of Linear Threshold Functions

conditional cdf, conditional pdf, total probability theorem?

7 Random samples and sampling distributions

Testing Monotone High-Dimensional Distributions

EEL 5544 Noise in Linear Systems Lecture 30. X (s) = E [ e sx] f X (x)e sx dx. Moments can be found from the Laplace transform as

Learning and Fourier Analysis

University of Regina. Lecture Notes. Michael Kozdron

Testing Halfspaces. Ryan O Donnell Carnegie Mellon University. Rocco A. Servedio Columbia University.

ECE 564/645 - Digital Communications, Spring 2018 Homework #2 Due: March 19 (In Lecture)

Chapter 6 Expectation and Conditional Expectation. Lectures Definition 6.1. Two random variables defined on a probability space are said to be

Lecture 13 (Part 2): Deviation from mean: Markov s inequality, variance and its properties, Chebyshev s inequality

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Introduction to Machine Learning. Lecture 2

Probability. Lecture Notes. Adolfo J. Rumbos

The Central Limit Theorem

Lecture Notes 3 Convergence (Chapter 5)

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 20

Lecture 1 Measure concentration

STOR Lecture 16. Properties of Expectation - I

Notes on Random Vectors and Multivariate Normal

Metric spaces and metrizability

CLASSICAL PROBABILITY MODES OF CONVERGENCE AND INEQUALITIES

STAT 418: Probability and Stochastic Processes

Lecture 5: Probabilistic tools and Applications II

Lecture 22: Variance and Covariance

Chapter 5 continued. Chapter 5 sections

MTH739U/P: Topics in Scientific Computing Autumn 2016 Week 6

MATHEMATICS 154, SPRING 2009 PROBABILITY THEORY Outline #11 (Tail-Sum Theorem, Conditional distribution and expectation)

12. Perturbed Matrices

Lecture 7. 1 Notations. Tel Aviv University Spring 2011

1 Presessional Probability

A ROBUST KHINTCHINE INEQUALITY, AND ALGORITHMS FOR COMPUTING OPTIMAL CONSTANTS IN FOURIER ANALYSIS AND HIGH-DIMENSIONAL GEOMETRY

STA 2201/442 Assignment 2

1 Glivenko-Cantelli type theorems

1 Review of Probability and Distributions

Example continued. Math 425 Intro to Probability Lecture 37. Example continued. Example

6.842 Randomness and Computation Lecture 5

Lecture 11. Probability Theory: an Overveiw

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

A Noisy-Influence Regularity Lemma for Boolean Functions Chris Jones

MATH 151, FINAL EXAM Winter Quarter, 21 March, 2014

Solution Set for Homework #1

Math 416 Lecture 3. The average or mean or expected value of x 1, x 2, x 3,..., x n is

X = X X n, + X 2

CIS 800/002 The Algorithmic Foundations of Data Privacy September 29, Lecture 6. The Net Mechanism: A Partial Converse

4 Invariant Statistical Decision Problems

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

Notes for Lecture 15

UC Berkeley Department of Electrical Engineering and Computer Sciences. EECS 126: Probability and Random Processes

Lecture 25: Review. Statistics 104. April 23, Colin Rundel

Review of Probability Theory

Transcription:

COMS 6998-3: Sub-Linear Algorithms in Learning and Testing Lecturer: Rocco Servedio Lecture 9: March 26, 204 Spring 204 Scriber: Keith Nichols Overview. Last Time Finished analysis of O ( n ɛ ) -query algorithm for monotonicity. Showed an Ω( n) lower bound for one-sided non-adaptive monotonicity testers. Stated and proved (one direction of) Yao s Principle: Suppose there exists a distribution D over functions f : {, } n {, } (the inputs to the property testing problem) such that any q-query deterministic algorithm gives the right answer with probability at most c. Then, given any q-query non-adaptive randomized testing algorithm A, there exists some function f A such that: Pr[ A outputs correct answer onf A ] c..2 Today: lower bound for two-sided non-adaptive monotonicity testers. We will use Yao s Principle to show the following lower bound: Theorem (Chen Servedio Tan 4). Any 2-sided non-adaptive property tester for monotonicity, to ɛ 0 -test, needs Ω ( n /5) queries (where ɛ 0 > 0 is an absolute constant). 2 Ω( n /5 ) lower bound: proving Theorem 2. Preliminaries Recall the inition of total variation distance between two distributions over the same set Ω: d TV (D, D 2 ) = D (x) D 2 (x). 2 x

2 PROVING THE Ω ( n /5) LOWER BOUND 2 As homework problem from last time, we have the lemma below, which relates the probability of distinguishing between samples from two distributions to their total variation distance: Lemma 2 (HW problem). Let D, D 2 be two distributions over some set Ω, and A be any algorithm (possibly randomized) that takes x Ω as input and outputs Yes or No. Then Pr [ A(x) = Yes ] Pr [ A(x) = Yes ] d TV(D, D 2 ) x D x D 2 HW Problem where the probabilities are also taken over the possible randomness of A. To apply this lemma, recall that given a deterministic algorithm s set of queries Q = {z (),..., z (q) } {, } n, a distribution D over Boolean functions induces a distribution D Q over {, } q : x is drawn from D Q by drawing f D; outputting (f(z (),..., f(z (q) )) {, } q. With this observation and Yao s principle in hand, we can state and prove a key tool in proving lower bounds in property testing: Lemma 3 (Key Tool). Fix any property P (a set of Boolean functions). Let D Yes be a distribution over the Boolean functions that belong to P, and D No be a distribution over Boolean functions that all have dist(f, P) > ɛ. Suppose that for all q-query sets Q, one has d TV ( D Yes Q, D No Q ) (2-sided) non-adaptive ɛ-tester for P must use at least q + queries. 4. Then any Proof. Let D be the mixture D = 2 D Yes + 2 D No (that is, a draw from D is obtained by tossing a fair coin, and returning accordingly a sample drawn either from D Yes or D No ). Fix a q-query deterministic algorithm A. Let p Y = Pr f D Yes [ A accepts on f ], p N = Pr f D No [ A accepts on f ] That is, p Y is the probability that a random Yes function is accepted, while p N is the probability that a random No function is accepted. Via the assumption and the This is sometimes referred to as a data processing inequality for the total variation distance.

2 PROVING THE Ω ( n /5) LOWER BOUND 3 previous lemma, p Y p N. However, this means that A cannot be a succesful 4 tester; as Pr f D [ A gives wrong answer ] = 2 ( p Y ) + 2 p N = 2 + 2 (p N p Y ) 3 8 > 3 So Yao s Principle tells us that any randomized non-adaptive q-query algorithm is wrong on some f in support of D with probability at least 3 ; but a legit tester can only 8 be wrong on any such f with probability less than. 3 Exercise 4 (Generalization of Lemma 3). Relax the previous lemma slightly. Prove that the conclusion still holds even under the weaker assumptions HW Problem Pr [ f P ] 99 f D Yes 00, Pr [ d TV (f, P) > ɛ ] 99 f D No 00. For our lower bound, we need to come up with D Yes (resp. D No ) to be over monotone functions ( (resp. ɛ 0 -far from monotone) such that Q {, } n with ) Q Q Q = q, d TV D Yes, D No. 4 At a high-level, we need to argue that both distributions look the same. One may thus think of the Central Limit Theorem the sum of many independent, nice real-valued random variables converges to a Gaussian in distribution (in cumulative distribution function). For instance, a binomial distribution Bin ( 0 6, 2) has the same shape ( bell curve ) as the corresponding Gaussian distribution N ( 2, 4 06). For our purpose, however, the convergence guarantees stated by the Central Limit Theorem will not be enough, as they do not give explicit bounds on the rate of convergence; we will use a quantitative version of the CLT, the Berry Esséen Theorem. First, recall the inition a (real-valued) Gaussian random variable: Definition 5 (One-dimensional Gaussian distribution). A real-valued random variable is said to be Gaussian with mean µ and variance σ if it follows the distribution N (µ, σ), which has probability density function f µ,σ (x) = 2πσ e (x µ)2 2σ 2, x R Such a random variable has indeed expectation µ and variance σ 2 ; futhermore, the distribution is fully specified by these two parameters. Extending to higher dimensions, one can ine similarly a d-dimensional Gaussian random variable:

2 PROVING THE Ω ( n /5) LOWER BOUND 4 0.5 0.8 0.4 F0,(x) 0.6 0.4 f0,(x) 0.3 0.2 0.2 0. 0 4 2 0 2 4 x (a) Cumulative distribution function (CDF) 0 4 2 0 2 4 x (b) Probability density function (PDF) Figure : Standard Gaussian N (0, ). Definition 6 (d-dimensional Gaussian distribution). Fix a vector µ R d and a symmetric non-negative inite matrix Σ R d d. A random variable taking values in R d is said to be Gaussian with mean µ and covariance Σ if it follows the distribution N (µ, Σ), which has probability density function f µ,σ (x) = (2π) k det Σ e 2 (x µ)t Σ (x µ), x R d As in the univariate case, µ and Σ uniquely ine the distribution; further, one has that for X N (µ, Σ), Σ i,j = Cov(X i, X j ) = E[(X i EX i )(X j EX j )], i, j [d]. Theorem 7 (Berry Esséen 2 ). Let S = X +... + X n be the sum of n independent (real-valued) random variables X,..., X n satisfying Pr[ X i E[X i ] τ ] =. that is every X i is almost surely bounded. For i [n], ine µ i = E[X i ] and σ i = Var Xi, so that ES = n i= µ i and Var S = n ( i= σi 2 (the last equality by independence). n ) ni= Finally, let G be a N i= µ i, σi 2 Gaussian variable, matching the first two moments of S. Then, for all θ R, Pr[ S θ ] Pr[ G θ ] O(τ) ni=. σi 2

2 PROVING THE Ω ( n /5) LOWER BOUND 5 In other terms 3, letting F S (resp. F G ) denote the CDF of S (resp. G), one has F S F G O(τ). n i= σ2 i Remark. The constant hidden in the O( ) notation is actually very reasonable one can take it to be equal to. Application: baby step towards the lower bound. Fix any string z {, } n, and for i [n] let the (independent) random variables γ i be ined as + w.p. 2 γ i = w.p. 2 Letting X i = γ i z i, we have µ i = EX i = 0, σ i = Var X i = ; and can take τ = to apply the Berry Esséen theorem to X = X +... + X n. This allows us to conclude that θ R, Pr[ X θ ] Pr[ G θ ] O() n for G N (0, n). Now, consider a slightly different distribution than the λ i s: for the same z {, } n, ine the independent random variables ν i by ν i = 9 w.p. 3 0 3 w.p. 0 and let Y i = ν i z i for i [n], Y = Y + + Y n. By our choice of parameters, ( EY i = 0 ( 3) + 9 0 ) z i = 0 = EX i 3 Var Y i = E [ ] Yi 2 = 0 9 + 9 0 9 = = Var X i So E[Y ] = E[Y ] = 0 and Var Y = Var X = n; by the Berry Esséen theorem (with τ set to 3, and G as before) θ R, Pr[ Y θ ] Pr[ G θ ] O() n 3 This quantity F S F G is also referred to as the Kolmogorov distance between S and G. 3 There exist other versions of this theorem, with weaker assumptions or phrased in terms of the third moments of the X i s; we only state here one tailored to our needs.

2 PROVING THE Ω ( n /5) LOWER BOUND 6 and by the triangle inequality θ R, Pr[ X θ ] Pr[ Y θ ] O() n () We can now ine D Yes and D No based on this (that is, based on respectively a random draw of λ, ν R n distributed as above): a function f λ D Yes is given by z {, } n, f λ (z) = sign(λ z +... λ n z n ). and similarly for f ν D No : z {, } n, f ν (z) = sign(ν z +... ν n z n ) With the notations above, X 0 if and only if f γ (z) = and Y 0 if and only if f ν (z) =. This implies that for any fixed single query z, ( ) {z} {z} d TV D Yes, D No = O() ( Pr[ X 0 ] Pr[ Y 0 ] + Pr[ X > 0 ] Pr[ Y > 0 ] ). 2 n This almost looks like what we were aiming at so why aren t we done? There are two problems with what we did above:. This only deals the case q = ; that is, would provide a lower bound against one-query algorithms. Fix: we will use a multidimensional version of the Berry Esséen Theorem for the sums of q-dimensional independent random variables (converging to a multidimensional Gaussian). 2. f γ, f ν are not monotone (indeed, both the γ i s and ν i s can be negative). Fix: shift everything by 2: - γ i {, 3}: f γ is monotone; - ν i {, 7 3 }: f ν will be far from monotone with high probability (will show this). 2.2 The lower bound construction Up until this point, everything has been a warmup; we are now ready to go into more detail.

2 PROVING THE Ω ( n /5) LOWER BOUND 7 D Yes and D No. As we mentioned in the previous section, we need to (re)ine the distributions D Yes and D No (that is, of γ and ν) to solve the second issue: D Yes D No Draw f D Yes by independently drawing, for i [n], +3 w.p. 2 γ i = + w.p. 2 and setting f : x {, } n sign( n i= γ i x i ). Any such f is monotone, as the weights are all positive. Similarly, draw f D No by independently drawing, for i [n], + 7 9 w.p. 3 0 ν i = w.p. and setting f : x {, } n sign( n i= ν i x i ). f is not always far from monotone actually, one of the functions in the support of D No (the one with all weights set to 7/3) is even monotone. However, we shall argue that f D No is far from monotone with overwhelming probability, and then apply the relaxation of the key tool (HW Problem 4) to conclude. The theorem will stem from the following two lemmas, that states respectively that ( ) No-functions are almost all far from monotone, and ( ) that the two distributions are hard to distinguish: Lemma 8 (Lemma ). There exists a universal constant ɛ 0 > 0 such that Pr [ dist(f, M) > ɛ 0 ] f D No 2. Θ(n) (note that this o() probability is actually stronger than what the relaxation from Problem 4 requires.) Lemma 9 (Lemma ). Let A be any deterministic q-query algorithm. Then ( Pr [ A accepts ] Pr [ A accepts ] q 5/4 O (log n) /2 ) f Yes D Yes f No D No so that if q = Õ( n /5) the RHS is at most 0.0, which implies with the earlier lemmas and discussion that at least q + queries are needed for any 2-sided, non-adaptive randomized tester. 0 n /4

2 PROVING THE Ω ( n /5) LOWER BOUND 8 Proof of Lemma 8. By an additive Chernoff bound, with probability at least 2 Θ(n) the random variables ν i satisfy m = { i [n] : ν i = } [0.09n, 0.n]. ( ) Say that any linear threshold function for which ( ) holds is nice. Fix any nice f in the support of D No, and rename the variables so that the negative weights correspond to the first variables: ( f(x) = sign (x + + x m ) + 7 ) 3 (x m+ + + x n ), x {, } n It is not difficult to show that for this f (remembering that m = Θ(n)), these first variables have high influence roughly of the same order as for the MAJ function: Claim 0 (HW Problem). For i [m], Inf i [f] = Ω ( n ). Observe further that f is unate (i.e., monotone increasing in some coordinates, and monotone decreasing in the others). Indeed, any LTF g : x sign(w x) is unate: - non-decreasing in coordinate x i if and only if w i 0; - non-increasing in coordinate x i if and only if w i 0. We saw in previous lectures that, for g monotone, ĝ(i) = Inf i [g]; it turns out the same proof generalizes to unate g, yielding HW Problem ĝ(i) = ±Inf i [g] where the sign depends on whether g is non-decreasing or non-increasing in x i. Back to our function f, this means that and thus for all i [m] ˆf(i) = Ω ( n ). + Inf i [f] = ˆf(i) if ν i = 7 3 ˆf(i) if ν i = Fix any monotone Boolean function g: we will show that dist(f, g) ɛ 0, for some

2 PROVING THE Ω ( n /5) LOWER BOUND 9 choice of ɛ 0 > 0 independent of f and g. [ 4 dist(f, g) = E ] x U{,} n (f(x) g(x)) 2 = ( ˆf(S) ĝ(s)) 2 (Parseval) S [n] n ( ˆf(i) m ĝ(i)) 2 ( ˆf(i) m ĝ(i)) 2 = ( Inf i [f] Inf i [g]) 2 (g mon.) i= i= i= m m = (Inf i [f] + Inf i [g]) 2 (Inf i [f]) 2 i= i= ( ( )) m 2 ( ) m = Ω n = Ω i= n = Ω(). Proof (sketch) of Lemma 9. Fix any deterministic, non-adaptive q-query algorithm A; and view its q queries z (),..., z (q) {, } n as a q n matrix Q {, } q n, where z (i) corresponds to the i th row of Q. n {}}{ z () z () 2 z () 3 z () n z (2) z (2) 2 z (2) 3 z (2) n q....... z (q) z (q) 2 z (q) 3 z n (q) Define the Yes-response vector R Y, random variable over {, } q, by the process of (i) drawing f Yes D Yes, where f Yes (x) = sign(γ x + + γ n x n ); (ii) setting the i th coordinate of R Y to f Yes (Q i, ) (f Yes on the i th row of Q, i.e. z (i) ). Similarly, ine the No-response vector R N over {, } q. Via Lemma 2 (the homework problem on total variation distance), (LHS of Lemma 9) d TV (R Y, R N ). (abusing the notation of total variation distance, by identifying the random variables with their distribution.) Hence, our new goal is to show that: d TV (R Y, R N ) (RHS of Lemma 9).?

2 PROVING THE Ω ( n /5) LOWER BOUND 0 Multidimensional Berry Esséen setup. variables S, T R q as For fixed Q as above, ine two random S = Qγ, with γ U {,3} n; T = Qν, with for each i [n] (independently). + 7 9 w.p. 3 ν i = w.p. We will also need the following geometric notion: Definition. An orthant in R q is the analogue in q-dimensional Euclidean space of a quadrant in the plane R 2 ; that is, it is a set of the form 0 0 O = O O 2 O q where each O i is either R + or R. There are 2 q different orthants in R q. The random variable R Y is fully determined by the orthant S lies in: the i th coordinate of R Y is the sign of the i th coordinate of S, as S i = (Qγ) i = Q i, γ. Likewise, R N is determined by the orthant T lies in. Abusing slightly the notation, we will write R Y = sign(s) for i [q], R Y,i = sign(s i ) (and similarly, R T = sign(t )). Now, it is enough to show that for any union O of orthants, ( q 5/4 (log n) /2 ) Pr[ S O ] Pr[ T O ] O. ( ) as this is equivalent to proving that, for any subset U {, } q, Pr[ R S U ] Pr[ R T U ] O ( q 5/4 (log n) /2 n /4 ) (and the LHS is by inition equal to dtv (R Y, R N )). Note that for q = we get back to the regular Berry Esséen Theorem; for q >, we will need a multidimensional Berry Esséen. The key will be to have random variables with matching means and covariances (instead of means and variances for the one-dimensional case). n /4 (Rest of the proof during next lecture.)