Lecture 5: February 16, 2012

Similar documents
Lecture 4: LMN Learning (Part 2)

3 Finish learning monotone Boolean functions

Lecture 10: Learning DNF, AC 0, Juntas. 1 Learning DNF in Almost Polynomial Time

Lecture 7: Passive Learning

Notes for Lecture 15

6.842 Randomness and Computation April 2, Lecture 14

Learning and Fourier Analysis

Lecture 1: 01/22/2014

1 Last time and today

Lecture 8: Linearity and Assignment Testing

Lecture 29: Computational Learning Theory

More Efficient PAC-learning of DNF with Membership Queries. Under the Uniform Distribution

APPROXIMATION RESISTANCE AND LINEAR THRESHOLD FUNCTIONS

Learning DNF Expressions from Fourier Spectrum

On Learning Monotone DNF under Product Distributions

Lecture 5: February 21, 2017

Lecture 9: March 26, 2014

Lecture 2: Proof of Switching Lemma

Fourier analysis of boolean functions in quantum computation

Approximating MAX-E3LIN is NP-Hard

New Results for Random Walk Learning

+ ɛ Hardness for Max-E3Lin

Lecture 19: Simple Applications of Fourier Analysis & Convolution. Fourier Analysis

Learning and Fourier Analysis

Quantum boolean functions

Agnostic Learning of Disjunctions on Symmetric Distributions

CSC 2429 Approaches to the P vs. NP Question and Related Complexity Questions Lecture 2: Switching Lemma, AC 0 Circuit Lower Bounds

Notes for Lecture 25

Lecture 7: ɛ-biased and almost k-wise independent spaces

Learning Unions of ω(1)-dimensional Rectangles

Defining the Integral

FOURIER CONCENTRATION FROM SHRINKAGE

CS 151 Complexity Theory Spring Solution Set 5

CS Foundations of Communication Complexity

2 Completing the Hardness of approximation of Set Cover

Lecture 23: Alternation vs. Counting

Notes for Lecture 11

Junta Approximations for Submodular, XOS and Self-Bounding Functions

FOURIER ANALYSIS OF BOOLEAN FUNCTIONS

1 Approximate Counting by Random Sampling

On the Efficiency of Noise-Tolerant PAC Algorithms Derived from Statistical Queries

Stanford University CS254: Computational Complexity Handout 8 Luca Trevisan 4/21/2010

be the set of complex valued 2π-periodic functions f on R such that

10.1 The Formal Model

Learning Juntas. Elchanan Mossel CS and Statistics, U.C. Berkeley Berkeley, CA. Rocco A. Servedio MIT Department of.

Learning DNF from Random Walks

Introduction to Cryptography

Lecture 22. m n c (k) i,j x i x j = c (k) k=1

Yale University Department of Computer Science

Separating Models of Learning from Correlated and Uncorrelated Data

1 The Probably Approximately Correct (PAC) Model

Lecture 3 Small bias with respect to linear tests

1 Review of The Learning Setting

1 Randomized Computation

arxiv: v1 [cs.cc] 29 Feb 2012

On the correlation of parity and small-depth circuits

PROPERTY TESTING LOWER BOUNDS VIA COMMUNICATION COMPLEXITY

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

CIS 800/002 The Algorithmic Foundations of Data Privacy September 29, Lecture 6. The Net Mechanism: A Partial Converse

5.1 Learning using Polynomial Threshold Functions

Lecture 18: Inapproximability of MAX-3-SAT

On the Structure of Boolean Functions with Small Spectral Norm

Learning Combinatorial Functions from Pairwise Comparisons

Degree and Sensitivity: tails of two distributions *

Lecture 5: Probabilistic tools and Applications II

Harmonic Analysis on the Cube and Parseval s Identity

Lecture 5: Derandomization (Part II)

CSC 5170: Theory of Computational Complexity Lecture 9 The Chinese University of Hong Kong 15 March 2010

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ

ICML '97 and AAAI '97 Tutorials

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Notes for Lecture 7. 1 Increasing the Stretch of Pseudorandom Generators

Computational Learning Theory

Lecture 10: Hardness of approximating clique, FGLSS graph

Fourier Concentration from Shrinkage

1 AC 0 and Håstad Switching Lemma

Upper Bounds on Fourier Entropy

1 Maintaining a Dictionary

2 Upper-bound of Generalization Error of AdaBoost

Testing Linear-Invariant Function Isomorphism

Lecture Introduction. 2 Formal Definition. CS CTT Current Topics in Theoretical CS Oct 30, 2012

Lecture 13: 04/23/2014

PAC-learning, VC Dimension and Margin-based Bounds

EXPOSITORY NOTES ON DISTRIBUTION THEORY, FALL 2018

COMS 4771 Introduction to Machine Learning. Nakul Verma

1 Continuous extensions of submodular functions

Unconditional Lower Bounds for Learning Intersections of Halfspaces

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

CS6840: Advanced Complexity Theory Mar 29, Lecturer: Jayalal Sarma M.N. Scribe: Dinesh K.

Complexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler

Outline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181.

CS Communication Complexity: Applications and New Directions

Learning Pseudo-Boolean k-dnf and Submodular Functions

Conditional Sparse Linear Regression

: Error Correcting Codes. December 2017 Lecture 10

From Batch to Transductive Online Learning

arxiv: v2 [cs.lg] 17 Apr 2013

Lecture 3: Boolean Functions and the Walsh Hadamard Code 1

Learning Coverage Functions and Private Release of Marginals

Lecture 21: Counting and Sampling Problems

Transcription:

COMS 6253: Advanced Computational Learning Theory Lecturer: Rocco Servedio Lecture 5: February 16, 2012 Spring 2012 Scribe: Igor Carboni Oliveira 1 Last time and today Previously: Finished first unit on PAC learning using PTF degree upper bounds. Started PAC learning under the uniform distribution. Introduced Fourier analysis over the Boolean hypercube. Today: Connection between Fourier analysis and learning under the uniform distribution. Low-Degree Algorithm (LDA) and Fourier concentration for some classes of Boolean functions. Applications of the LDA: learning decision trees and DNFs. Relevant Readings: Y. Mansour. An O(n log log n ) Learning Algorithm for DNF under the Uniform Distribution. In Proceeding of COLT 53 61 (1992). [Journal of Computer and Systems Sciences 50(3):543-550 (1995)]. Y. Mansour. Learning Boolean Functions via the Fourier Transform. In Theoretical Advances in Neural Computation and Learning, (V.P. Roychodhury and K-Y. Siu and A. Orlitsky, ed.), 391 424 (1994). J. Kobler, W. Lindner. Learning Boolean Functions under the Uniform Distribution via the Fourier Transform. In The Computational Complexity Column. 1

2 REVIEW OF FOURIER ANALYSIS 2 2 Review of Fourier Analysis Remember that any function f : { 1, 1} n R has a unique representation of the form 1 f(x) = ˆf(S)χ S (x), (1) where χ S (x) = x i. (2) i S In particular, we have: ˆf(S) = E x [f(x)χ S (x)]. (3) Therefore, if f : { 1, 1} n { 1, 1} is Boolean, it follows that ˆf(S) 1 for every S [n]. Actually, using the next proposition we can say something much stronger in this case: Proposition 1. [Plancherel s Identity] For any f, g : { 1, 1} n R, we have: E x [f(x)g(x)] = ˆf(S)ĝ(S). (4) Proof. Using the Fourier expansions of f and g, it follows that: E x [f(x)g(x)] = E x ˆf(S)χ S (x) ĝ(t )χ T (x) (5) T [n] = ˆf(S)ĝ(T ) E x [χ S (x)χ T (x)] (6) T [n] = ˆf(S)ĝ(S), (7) where the last equality follows from the fact that χ S (x)χ T (x) = χ S T (x) and E x [χ S T (x)] = 1 if S = T and is 0 otherwise. Corollary 2. [Parseval s Identity] For any Boolean f : { 1, 1} n { 1, 1}, we have: ˆf(S) 2 = 1. (8) 1 In contrast, it is easy to see that the representation of a function as a PTF is not unique.

2 REVIEW OF FOURIER ANALYSIS 3 Proof. Applying Plancherel s identity with f = g we get that [ ˆf(S) 2 = E ] x f(x) 2 = E x [1] = 1. (9) Definition 3. The Fourier degree deg(f) of a function f : { 1, 1} n R is the smallest integer d 0 such that for all S [n] with S > d, we have ˆf(S) = 0. For example, let s consider the Fourier degree of the AND function over n variables: { 1 if all xi = 1. AND(x 1,..., x n ) = 0 if some x i = 1. It is easy to see that: AND(x 1,..., x n ) = (1 x 1) (1 x 2)... (1 x n) 2 2 2 (10) = ( 1) S χ 2 n S. (11) Thus we have deg(and) = n. In other words, the AND function has maximum Fourier degree. The following lemma will be useful later. Lemma 4. [Fourier degree of depth-d Decision Trees] Let f : { 1, 1} n { 1, 1} be a Boolean function computed by some decision tree of depth d. Then deg(f) d. Proof. Given a decision tree T of depth d representing f, it is easy to see that: f(x) = 1 P (x) f(p ), (12) Paths P in T where f(p ) is the value of function f at the end of path P and 1 P (x) is a function on at most d variables that is 1 if x follows P and 0 otherwise. Note that 1 P (x) has Fourier degree at most d, since it is a function that depends on at most d variables. By the linearity of the Fourier transform, it follows that deg(f) d.

3 FOURIER ANALYSIS AND LEARNING THEORY 4 3 Fourier Analysis and Learning Theory Intuitively, if we could find most of the heavy Fourier coefficients of an unknown Boolean function f then we should be able to come up with a good approximation for f. In this section we formalize this idea. First, we show how to use the example oracle EX(f) to approximate ˆf(S) for any S [n]. We emphasize that the learning theory results discussed in this note hold with respect to the uniform distribution. Lemma 5. There is an algorithm A that, given γ > 0, δ > 0, S [n] and oracle access to EX(f), where f : { 1, 1} n { 1, 1}, outputs with probability at least 1 δ a value c S such that c S ˆf(S) γ. The number of oracle calls made by A is O( 1 log 1) γ 2 δ and its running time is poly(n, 1, log 1)). γ 2 δ Proof. For X uniform over { 1, 1} n, let Z = f(x) χ S (X). Then it is clear that Z { 1, 1} and E[Z] = ˆf(S). In addition, note that an arbitrary number of independent pairs (X, f(x)) can be sampled using oracle EX(f) and that χ S (X) can be easily computed given any input X. It follows from the Chernoff-Hoeffding bound that the empirical estimate c S for E[Z] obtained using O( 1 log 1 ) draws from EX(f) is γ close γ 2 δ to ˆf(S) with probability at least 1 δ. Although the previous lemma can be used to approximate a small number of Fourier coefficients, an efficient learning algorithm cannot afford to approximate all coefficients of an unknown Boolean function. Actually, even if we are promised that there is some special Fourier coefficient ˆf(T ) such that ˆf(T ) > 0.99, it is not clear how to find such coefficient efficiently using queries to the random oracle EX(f) 2. We can still make use of lemma 5 if we restrict the concept class C to contain only functions that have most of their Fourier weight concentrated on a (fixed) small number of coefficients. For example, for many interesting Boolean functions f, at least 1 ɛ of the Fourier weight lives on low-degree coefficients, i.e., there exists a small d such that S >d ˆf(S) 2 ɛ. If this is the case, we only need to estimate O(n d ) coefficients to obtain a good description of the function. Definition 6. [Fourier Concentration] Let f : { 1, 1} n R. We say f has α(ɛ, n)-fourier concentration if ˆf(S) ɛ. (13) S >α(ɛ,n) 2 However, we will see in a few lectures that if we can query the unknown function f at arbitrary input positions, then we can actually find such coefficient efficiently.

3 FOURIER ANALYSIS AND LEARNING THEORY 5 For convenience, we associate to each function f : { 1, 1} n R its Fourier concentration function α f (ɛ, n) in the natural way, that is, for every ɛ we set α f (ɛ, n) to be the smallest integer k 0 such that ˆf(S) ɛ. (14) S >k Note that for Boolean f we have ˆf(S) 1 ɛ. (15) S α f (ɛ,n) The next algorithm demonstrates the connection between learning under the uniform distribution and estimating Fourier coefficients. Low-Degree Algorithm (LDA). This algorithm is used to approximate an unknown Boolean function f with d = α(ɛ, n)-fourier concentration. The LDA is given parameters τ > 0 (accuracy) and δ > 0 (confidence) and has access to the random oracle EX(f). It computes as follows: 1. The algorithm draws m = O( nd τ nd log ) random examples (x, f(x)) from EX(f). δ 2. It uses these examples to find estimates c S for ˆf(S) for each subset S [n] with S d as done in the proof of lemma 5. 3. Sets h(x) := S d c Sχ S (x). Note that this may not be a binary function. 4. Outputs hypothesis h(x) := sign(h(x)). Lemma 7. If f : { 1, 1} n { 1, 1} has α(ɛ, n)-fourier concentration, then with probability at least 1 δ the Low-Degree Algorithm constructs a function h such that E x [ (h(x) f(x)) 2 ] ɛ + τ. (16) Proof. Fix an S [n] with S d and let γ = τ and δ = δ. It follows from n d n d the proof of lemma 5 that with failure probability at most δ the estimate c S obtained during step 2 satisfies c S ˆf(S) γ. Therefore applying an union bound we get that with probability at least 1 δ, for every S d we have c S ˆf(S) γ. Let

3 FOURIER ANALYSIS AND LEARNING THEORY 6 g(x) = f(x) h(x). Clearly, for every S we have ĝ(s) = ˆf(S) ĥ(s). It follows using Plancherel s identity that E x [ (h(x) f(x)) 2 ] = E x [g(x) 2 ] (17) = ĝ(s) 2 (18) = ( ˆf(S) ĥ(s)) 2 (19) (by definition of h) = ( ˆf(S) c S ) 2 + ( ˆf(S) 0) 2 (20) S d (with probability 1 δ) (γ ) 2 + S d S >d S >d ˆf(S) 2 (21) τ + ɛ, (22) where the last inequality uses the Fourier concentration assumption about f (recall that d = α(ɛ, n)). Lemma 8. The hypothesis h = sign(h(x)) output by the Low-Degree Algorithm is (τ + ɛ)-close to f with probability at least 1 δ. Proof. Using the previous lemma it is enough to prove that Pr x [f(x) sign(h(x))] E x [(h(x) f(x)) 2 ]. But since f(x) { 1, 1}, we have which completes the argument. Pr x [f(x) sign(h(x))] = 1 2 n x { 1,1} n 1 f(x) sign(h(x)) (23) Altogether, we have shown: 1 2 n x { 1,1} n (f(x) h(x)) 2 (24) = E x [ (h(x) f(x)) 2 ], (25) Theorem 9. Let C be a class of n-variables Boolean functions such that every f C has d = α(ɛ, n)-fourier concentration. Then there is a poly(n d, 1, log 1 )-time uniform ɛ δ distribution PAC algorithm that learns any f C to accuracy 2ɛ. Proof. Run the LDA with τ = ɛ. The result follows from lemmas 7 and 8.

4 APPLICATIONS OF THE LOW-DEGREE ALGORITHM 7 4 Applications of the Low-Degree Algorithm 4.1 Learning depth-d Decision Trees It follows from lemma 4 that if f is an n-variable Boolean function computed by a depth-d decision tree, then α f (0, n) d. Combining this observation with theorem 9, we immediately obtain: Proposition 10. Let C be the class of depth-d decision trees over n variables. Then there is a poly(n d, 1, log 1 )-time uniform distribution PAC algorithm that learns any ɛ δ f C to accuracy ɛ with probability at least 1 δ. 4.2 Learning s-term DNFs In this subsection we present a Fourier-concentration result for DNFs. Theorem 11. If f : { 1, 1} n { 1, 1} is an s-term DNF, then α f (ɛ, n) = O(log s ɛ log 1 ɛ ). Using the Low-Degree Algorithm we get: Corollary 12. Let C be the class of s-term DNFs over n variables. Then there is a poly(n log s ɛ log 1 ɛ, log 1 )-time uniform distribution PAC algorithm that learns any f C δ to accuracy 2ɛ with probability at least 1 δ. To obtain theorem 11, we first argue that instead of proving a Fourier concentration result for s-term DNFs, it is enough to get a concentration result for width-w DNFs. Lemma 13. Every s-term DNF can be ɛ-approximated by a ( log s ɛ) -width DNF. Proof. Note that removing a term of size greater than log s from the original DNF can ɛ only change an ɛ -fraction of its output bits. Since there are at most s of these terms, s the result follows by an union bound. Lemma 14. Let f, f : { 1, 1} n { 1, 1} be ɛ-close Boolean functions. If f has α(ɛ, n)-fourier concentration, we have: ˆf(S) 2 9ɛ. (26) S >α(ɛ,n)

4 APPLICATIONS OF THE LOW-DEGREE ALGORITHM 8 Proof. Since f and f are ɛ-close Boolean functions, we have E x [(f(x) f (x)) 2 ] = 4 Pr x [f(x) f (x)] 4ɛ. (27) On the other hand, using Plancherel s identity: E x [(f(x) f (x)) 2 ] = ( ˆf(S) ˆf (S)) 2 (28) S >α(ɛ,n) ( ) 2 ˆf(S) ˆf (S) (29) (by definition) = v f v f 2 2, (30) where v f is a real-valued vector with coordinates over sets S with S > α(ɛ, n) such that v f S = ˆf(S) and v f is defined similarly. Using inequality 27, we get v f v f 2 4ɛ. In addition, since f has α(ɛ, n)-fourier concentration, we have v f 2 ɛ. Now using the fact that for any pair of vectors u, w we have u w 2 u 2 w 2, it follows that 4ɛ+ ɛ v f v f 2 + v f 2 v f 2, which is equivalent to v f 2 2 9ɛ. In other words: ˆf(S) 2 9ɛ. (31) S >α(ɛ,n) Therefore, it follows from lemmas 13 and 14 that to get theorem 11 it is enough to prove the following proposition: Proposition 15. Let f : { 1, 1} n { 1, 1} be a Boolean function and assume that f is computed by a width-w DNF. Then α f (ɛ, n) = O(w log 1 ɛ ). To achieve that, we first study how a width-w DNF simplifies under a random restriction over its input bits. Definition 16. A restriction is a pair (I, Z), with I [n] and Z { 1, 1} I, where I = [n]\i. Definition 17. For f : { 1, 1} n R, we write f I Z to denote the (I, Z)-restricted version of f, that is, the function f I Z : { 1, 1} I R obtained from f by setting the variables in I to Z. For example, if f(x 1, x 2,..., x 5 ) : { 1, 1} 5 R is a real-valued function and (I, Z) is a restriction with I = {2, 3} and Z = ( 1, 1, 1), then f I Z represents the function f( 1, x 2, x 3, 1, 1).

4 APPLICATIONS OF THE LOW-DEGREE ALGORITHM 9 Definition 18. A random restriction with -probability ρ (also called a ρ-random restriction) is a pair (I,Z) chosen by: For each i [n], put i I with probability ρ. Pick z { 1, 1} I uniformly at random. It is clear that functions simplify under a restriction. For a width-w DNF, we can actually show that the function simplifies a lot. Theorem 19. [Hastad Switching Lemma] Let f : { 1, 1} n { 1, 1} be a width-w DNF. Then for (I, Z) a ρ-random restriction: For example, if ρ = 1, we get that 10w Pr [DT-depth(f I Z ) > d] (5ρw) d. (32) (I,Z) Pr [DT-depth(f I Z ) > d] (I,Z) 2 d. (33) The proof of this theorem is beyond the scope of this note. The Switching Lemma is a powerful tool for our purposes because we understand the Fourier-concentration of decision trees pretty well. In particular, we know by lemma 4 that depth-d decision trees are concentrated on coefficients with S d. Intuitively, we want to argue that since after the random restriction the resulting function has good concentration, it must be the case that the original width-w DNF has some concentration as well. To formalize this argument we introduce additional notation. Definition 20. Let f : { 1, 1} n R be any function and let S, I [n]. We define F S I (Z) : { 1, 1} I R as follows: F S I (Z) = f I Z (S). (34) Proposition 21. Let f : { 1, 1} n R and suppose S I [n]. Then for any set T I. F S I (T ) = ˆf(S T ), (35)

4 APPLICATIONS OF THE LOW-DEGREE ALGORITHM 10 Proof. F S I (T ) = E Z { 1,1} I[F S I (Z) χ T (Z)] (36) = E Z { 1,1} I[ f I Z (S) χ T (Z)] (37) = E Z { 1,1} I[E Y { 1,1} I[f I Z (Y ) χ S (Y )] χ T (Z)] (38) = E X { 1,1} n[f(x) χ S (X) χ T (X)] (39) (since S T = ) = E X { 1,1} n[f(x) χ S T (X)] (40) = ˆf(S T ). (41) The proof of proposition 15 will be completed next class.