COMS 6253: Advanced Computational Learning Theory Lecturer: Rocco Servedio Lecture 5: February 16, 2012 Spring 2012 Scribe: Igor Carboni Oliveira 1 Last time and today Previously: Finished first unit on PAC learning using PTF degree upper bounds. Started PAC learning under the uniform distribution. Introduced Fourier analysis over the Boolean hypercube. Today: Connection between Fourier analysis and learning under the uniform distribution. Low-Degree Algorithm (LDA) and Fourier concentration for some classes of Boolean functions. Applications of the LDA: learning decision trees and DNFs. Relevant Readings: Y. Mansour. An O(n log log n ) Learning Algorithm for DNF under the Uniform Distribution. In Proceeding of COLT 53 61 (1992). [Journal of Computer and Systems Sciences 50(3):543-550 (1995)]. Y. Mansour. Learning Boolean Functions via the Fourier Transform. In Theoretical Advances in Neural Computation and Learning, (V.P. Roychodhury and K-Y. Siu and A. Orlitsky, ed.), 391 424 (1994). J. Kobler, W. Lindner. Learning Boolean Functions under the Uniform Distribution via the Fourier Transform. In The Computational Complexity Column. 1
2 REVIEW OF FOURIER ANALYSIS 2 2 Review of Fourier Analysis Remember that any function f : { 1, 1} n R has a unique representation of the form 1 f(x) = ˆf(S)χ S (x), (1) where χ S (x) = x i. (2) i S In particular, we have: ˆf(S) = E x [f(x)χ S (x)]. (3) Therefore, if f : { 1, 1} n { 1, 1} is Boolean, it follows that ˆf(S) 1 for every S [n]. Actually, using the next proposition we can say something much stronger in this case: Proposition 1. [Plancherel s Identity] For any f, g : { 1, 1} n R, we have: E x [f(x)g(x)] = ˆf(S)ĝ(S). (4) Proof. Using the Fourier expansions of f and g, it follows that: E x [f(x)g(x)] = E x ˆf(S)χ S (x) ĝ(t )χ T (x) (5) T [n] = ˆf(S)ĝ(T ) E x [χ S (x)χ T (x)] (6) T [n] = ˆf(S)ĝ(S), (7) where the last equality follows from the fact that χ S (x)χ T (x) = χ S T (x) and E x [χ S T (x)] = 1 if S = T and is 0 otherwise. Corollary 2. [Parseval s Identity] For any Boolean f : { 1, 1} n { 1, 1}, we have: ˆf(S) 2 = 1. (8) 1 In contrast, it is easy to see that the representation of a function as a PTF is not unique.
2 REVIEW OF FOURIER ANALYSIS 3 Proof. Applying Plancherel s identity with f = g we get that [ ˆf(S) 2 = E ] x f(x) 2 = E x [1] = 1. (9) Definition 3. The Fourier degree deg(f) of a function f : { 1, 1} n R is the smallest integer d 0 such that for all S [n] with S > d, we have ˆf(S) = 0. For example, let s consider the Fourier degree of the AND function over n variables: { 1 if all xi = 1. AND(x 1,..., x n ) = 0 if some x i = 1. It is easy to see that: AND(x 1,..., x n ) = (1 x 1) (1 x 2)... (1 x n) 2 2 2 (10) = ( 1) S χ 2 n S. (11) Thus we have deg(and) = n. In other words, the AND function has maximum Fourier degree. The following lemma will be useful later. Lemma 4. [Fourier degree of depth-d Decision Trees] Let f : { 1, 1} n { 1, 1} be a Boolean function computed by some decision tree of depth d. Then deg(f) d. Proof. Given a decision tree T of depth d representing f, it is easy to see that: f(x) = 1 P (x) f(p ), (12) Paths P in T where f(p ) is the value of function f at the end of path P and 1 P (x) is a function on at most d variables that is 1 if x follows P and 0 otherwise. Note that 1 P (x) has Fourier degree at most d, since it is a function that depends on at most d variables. By the linearity of the Fourier transform, it follows that deg(f) d.
3 FOURIER ANALYSIS AND LEARNING THEORY 4 3 Fourier Analysis and Learning Theory Intuitively, if we could find most of the heavy Fourier coefficients of an unknown Boolean function f then we should be able to come up with a good approximation for f. In this section we formalize this idea. First, we show how to use the example oracle EX(f) to approximate ˆf(S) for any S [n]. We emphasize that the learning theory results discussed in this note hold with respect to the uniform distribution. Lemma 5. There is an algorithm A that, given γ > 0, δ > 0, S [n] and oracle access to EX(f), where f : { 1, 1} n { 1, 1}, outputs with probability at least 1 δ a value c S such that c S ˆf(S) γ. The number of oracle calls made by A is O( 1 log 1) γ 2 δ and its running time is poly(n, 1, log 1)). γ 2 δ Proof. For X uniform over { 1, 1} n, let Z = f(x) χ S (X). Then it is clear that Z { 1, 1} and E[Z] = ˆf(S). In addition, note that an arbitrary number of independent pairs (X, f(x)) can be sampled using oracle EX(f) and that χ S (X) can be easily computed given any input X. It follows from the Chernoff-Hoeffding bound that the empirical estimate c S for E[Z] obtained using O( 1 log 1 ) draws from EX(f) is γ close γ 2 δ to ˆf(S) with probability at least 1 δ. Although the previous lemma can be used to approximate a small number of Fourier coefficients, an efficient learning algorithm cannot afford to approximate all coefficients of an unknown Boolean function. Actually, even if we are promised that there is some special Fourier coefficient ˆf(T ) such that ˆf(T ) > 0.99, it is not clear how to find such coefficient efficiently using queries to the random oracle EX(f) 2. We can still make use of lemma 5 if we restrict the concept class C to contain only functions that have most of their Fourier weight concentrated on a (fixed) small number of coefficients. For example, for many interesting Boolean functions f, at least 1 ɛ of the Fourier weight lives on low-degree coefficients, i.e., there exists a small d such that S >d ˆf(S) 2 ɛ. If this is the case, we only need to estimate O(n d ) coefficients to obtain a good description of the function. Definition 6. [Fourier Concentration] Let f : { 1, 1} n R. We say f has α(ɛ, n)-fourier concentration if ˆf(S) ɛ. (13) S >α(ɛ,n) 2 However, we will see in a few lectures that if we can query the unknown function f at arbitrary input positions, then we can actually find such coefficient efficiently.
3 FOURIER ANALYSIS AND LEARNING THEORY 5 For convenience, we associate to each function f : { 1, 1} n R its Fourier concentration function α f (ɛ, n) in the natural way, that is, for every ɛ we set α f (ɛ, n) to be the smallest integer k 0 such that ˆf(S) ɛ. (14) S >k Note that for Boolean f we have ˆf(S) 1 ɛ. (15) S α f (ɛ,n) The next algorithm demonstrates the connection between learning under the uniform distribution and estimating Fourier coefficients. Low-Degree Algorithm (LDA). This algorithm is used to approximate an unknown Boolean function f with d = α(ɛ, n)-fourier concentration. The LDA is given parameters τ > 0 (accuracy) and δ > 0 (confidence) and has access to the random oracle EX(f). It computes as follows: 1. The algorithm draws m = O( nd τ nd log ) random examples (x, f(x)) from EX(f). δ 2. It uses these examples to find estimates c S for ˆf(S) for each subset S [n] with S d as done in the proof of lemma 5. 3. Sets h(x) := S d c Sχ S (x). Note that this may not be a binary function. 4. Outputs hypothesis h(x) := sign(h(x)). Lemma 7. If f : { 1, 1} n { 1, 1} has α(ɛ, n)-fourier concentration, then with probability at least 1 δ the Low-Degree Algorithm constructs a function h such that E x [ (h(x) f(x)) 2 ] ɛ + τ. (16) Proof. Fix an S [n] with S d and let γ = τ and δ = δ. It follows from n d n d the proof of lemma 5 that with failure probability at most δ the estimate c S obtained during step 2 satisfies c S ˆf(S) γ. Therefore applying an union bound we get that with probability at least 1 δ, for every S d we have c S ˆf(S) γ. Let
3 FOURIER ANALYSIS AND LEARNING THEORY 6 g(x) = f(x) h(x). Clearly, for every S we have ĝ(s) = ˆf(S) ĥ(s). It follows using Plancherel s identity that E x [ (h(x) f(x)) 2 ] = E x [g(x) 2 ] (17) = ĝ(s) 2 (18) = ( ˆf(S) ĥ(s)) 2 (19) (by definition of h) = ( ˆf(S) c S ) 2 + ( ˆf(S) 0) 2 (20) S d (with probability 1 δ) (γ ) 2 + S d S >d S >d ˆf(S) 2 (21) τ + ɛ, (22) where the last inequality uses the Fourier concentration assumption about f (recall that d = α(ɛ, n)). Lemma 8. The hypothesis h = sign(h(x)) output by the Low-Degree Algorithm is (τ + ɛ)-close to f with probability at least 1 δ. Proof. Using the previous lemma it is enough to prove that Pr x [f(x) sign(h(x))] E x [(h(x) f(x)) 2 ]. But since f(x) { 1, 1}, we have which completes the argument. Pr x [f(x) sign(h(x))] = 1 2 n x { 1,1} n 1 f(x) sign(h(x)) (23) Altogether, we have shown: 1 2 n x { 1,1} n (f(x) h(x)) 2 (24) = E x [ (h(x) f(x)) 2 ], (25) Theorem 9. Let C be a class of n-variables Boolean functions such that every f C has d = α(ɛ, n)-fourier concentration. Then there is a poly(n d, 1, log 1 )-time uniform ɛ δ distribution PAC algorithm that learns any f C to accuracy 2ɛ. Proof. Run the LDA with τ = ɛ. The result follows from lemmas 7 and 8.
4 APPLICATIONS OF THE LOW-DEGREE ALGORITHM 7 4 Applications of the Low-Degree Algorithm 4.1 Learning depth-d Decision Trees It follows from lemma 4 that if f is an n-variable Boolean function computed by a depth-d decision tree, then α f (0, n) d. Combining this observation with theorem 9, we immediately obtain: Proposition 10. Let C be the class of depth-d decision trees over n variables. Then there is a poly(n d, 1, log 1 )-time uniform distribution PAC algorithm that learns any ɛ δ f C to accuracy ɛ with probability at least 1 δ. 4.2 Learning s-term DNFs In this subsection we present a Fourier-concentration result for DNFs. Theorem 11. If f : { 1, 1} n { 1, 1} is an s-term DNF, then α f (ɛ, n) = O(log s ɛ log 1 ɛ ). Using the Low-Degree Algorithm we get: Corollary 12. Let C be the class of s-term DNFs over n variables. Then there is a poly(n log s ɛ log 1 ɛ, log 1 )-time uniform distribution PAC algorithm that learns any f C δ to accuracy 2ɛ with probability at least 1 δ. To obtain theorem 11, we first argue that instead of proving a Fourier concentration result for s-term DNFs, it is enough to get a concentration result for width-w DNFs. Lemma 13. Every s-term DNF can be ɛ-approximated by a ( log s ɛ) -width DNF. Proof. Note that removing a term of size greater than log s from the original DNF can ɛ only change an ɛ -fraction of its output bits. Since there are at most s of these terms, s the result follows by an union bound. Lemma 14. Let f, f : { 1, 1} n { 1, 1} be ɛ-close Boolean functions. If f has α(ɛ, n)-fourier concentration, we have: ˆf(S) 2 9ɛ. (26) S >α(ɛ,n)
4 APPLICATIONS OF THE LOW-DEGREE ALGORITHM 8 Proof. Since f and f are ɛ-close Boolean functions, we have E x [(f(x) f (x)) 2 ] = 4 Pr x [f(x) f (x)] 4ɛ. (27) On the other hand, using Plancherel s identity: E x [(f(x) f (x)) 2 ] = ( ˆf(S) ˆf (S)) 2 (28) S >α(ɛ,n) ( ) 2 ˆf(S) ˆf (S) (29) (by definition) = v f v f 2 2, (30) where v f is a real-valued vector with coordinates over sets S with S > α(ɛ, n) such that v f S = ˆf(S) and v f is defined similarly. Using inequality 27, we get v f v f 2 4ɛ. In addition, since f has α(ɛ, n)-fourier concentration, we have v f 2 ɛ. Now using the fact that for any pair of vectors u, w we have u w 2 u 2 w 2, it follows that 4ɛ+ ɛ v f v f 2 + v f 2 v f 2, which is equivalent to v f 2 2 9ɛ. In other words: ˆf(S) 2 9ɛ. (31) S >α(ɛ,n) Therefore, it follows from lemmas 13 and 14 that to get theorem 11 it is enough to prove the following proposition: Proposition 15. Let f : { 1, 1} n { 1, 1} be a Boolean function and assume that f is computed by a width-w DNF. Then α f (ɛ, n) = O(w log 1 ɛ ). To achieve that, we first study how a width-w DNF simplifies under a random restriction over its input bits. Definition 16. A restriction is a pair (I, Z), with I [n] and Z { 1, 1} I, where I = [n]\i. Definition 17. For f : { 1, 1} n R, we write f I Z to denote the (I, Z)-restricted version of f, that is, the function f I Z : { 1, 1} I R obtained from f by setting the variables in I to Z. For example, if f(x 1, x 2,..., x 5 ) : { 1, 1} 5 R is a real-valued function and (I, Z) is a restriction with I = {2, 3} and Z = ( 1, 1, 1), then f I Z represents the function f( 1, x 2, x 3, 1, 1).
4 APPLICATIONS OF THE LOW-DEGREE ALGORITHM 9 Definition 18. A random restriction with -probability ρ (also called a ρ-random restriction) is a pair (I,Z) chosen by: For each i [n], put i I with probability ρ. Pick z { 1, 1} I uniformly at random. It is clear that functions simplify under a restriction. For a width-w DNF, we can actually show that the function simplifies a lot. Theorem 19. [Hastad Switching Lemma] Let f : { 1, 1} n { 1, 1} be a width-w DNF. Then for (I, Z) a ρ-random restriction: For example, if ρ = 1, we get that 10w Pr [DT-depth(f I Z ) > d] (5ρw) d. (32) (I,Z) Pr [DT-depth(f I Z ) > d] (I,Z) 2 d. (33) The proof of this theorem is beyond the scope of this note. The Switching Lemma is a powerful tool for our purposes because we understand the Fourier-concentration of decision trees pretty well. In particular, we know by lemma 4 that depth-d decision trees are concentrated on coefficients with S d. Intuitively, we want to argue that since after the random restriction the resulting function has good concentration, it must be the case that the original width-w DNF has some concentration as well. To formalize this argument we introduce additional notation. Definition 20. Let f : { 1, 1} n R be any function and let S, I [n]. We define F S I (Z) : { 1, 1} I R as follows: F S I (Z) = f I Z (S). (34) Proposition 21. Let f : { 1, 1} n R and suppose S I [n]. Then for any set T I. F S I (T ) = ˆf(S T ), (35)
4 APPLICATIONS OF THE LOW-DEGREE ALGORITHM 10 Proof. F S I (T ) = E Z { 1,1} I[F S I (Z) χ T (Z)] (36) = E Z { 1,1} I[ f I Z (S) χ T (Z)] (37) = E Z { 1,1} I[E Y { 1,1} I[f I Z (Y ) χ S (Y )] χ T (Z)] (38) = E X { 1,1} n[f(x) χ S (X) χ T (X)] (39) (since S T = ) = E X { 1,1} n[f(x) χ S T (X)] (40) = ˆf(S T ). (41) The proof of proposition 15 will be completed next class.