Lecture 4: LMN Learning (Part 2)

Similar documents
CSC 2429 Approaches to the P vs. NP Question and Related Complexity Questions Lecture 2: Switching Lemma, AC 0 Circuit Lower Bounds

Learning and Fourier Analysis

Lecture 10: Learning DNF, AC 0, Juntas. 1 Learning DNF in Almost Polynomial Time

Lecture 5: February 16, 2012

Notes for Lecture 15

6.842 Randomness and Computation April 2, Lecture 14

CS6840: Advanced Complexity Theory Mar 29, Lecturer: Jayalal Sarma M.N. Scribe: Dinesh K.

Lecture 7: Passive Learning

3 Finish learning monotone Boolean functions

CS Communication Complexity: Applications and New Directions

Notes for Lecture 25

On Learning Monotone DNF under Product Distributions

Uniform-Distribution Attribute Noise Learnability

Poly-logarithmic independence fools AC 0 circuits

Uniform-Distribution Attribute Noise Learnability

Lecture 29: Computational Learning Theory

Notes for Lecture 11

CS Foundations of Communication Complexity

Learning and Fourier Analysis

Lecture 3: AC 0, the switching lemma

Lecture 13: Lower Bounds using the Adversary Method. 2 The Super-Basic Adversary Method [Amb02]

The Uniform Hardcore Lemma via Approximate Bregman Projections

Lecture 23: Analysis of Boolean Functions

1 Last time and today

α x x 0 α x x f(x) α x x α x ( 1) f(x) x f(x) x f(x) α x = α x x 2

Finite fields, randomness and complexity. Swastik Kopparty Rutgers University

On the correlation of parity and small-depth circuits

1 Lecture 6-7, Scribe: Willy Quach

Lecture 22. m n c (k) i,j x i x j = c (k) k=1

Compute the Fourier transform on the first register to get x {0,1} n x 0.

New Results for Random Walk Learning

2 Completing the Hardness of approximation of Set Cover

1 Randomized Computation

Sublinear Algorithms Lecture 3

Quantum boolean functions

PRGs for space-bounded computation: INW, Nisan

Complexity Theory of Polynomial-Time Problems

Fourier analysis of boolean functions in quantum computation

1 Nisan-Wigderson pseudorandom generator

CS 880: Advanced Complexity Theory Final Project: Hardness Amplification within NP. Abstract

: On the P vs. BPP problem. 18/12/16 Lecture 10

Two Sides of the Coin Problem

FOURIER CONCENTRATION FROM SHRINKAGE

Lecture 7: ɛ-biased and almost k-wise independent spaces

Tight bounds on The Fourier Spectrum of AC 0

Learning Juntas. Elchanan Mossel CS and Statistics, U.C. Berkeley Berkeley, CA. Rocco A. Servedio MIT Department of.

On the Structure of Boolean Functions with Small Spectral Norm

How Low Can Approximate Degree and Quantum Query Complexity be for Total Boolean Functions?

A New Lower Bound Technique for Quantum Circuits without Ancillæ

6.895 Randomness and Computation March 19, Lecture Last Lecture: Boosting Weak Learners Into Strong Learners

CS 6815: Lecture 4. September 4, 2018

Lecture 3 Small bias with respect to linear tests

Lecture 10: Hardness of approximating clique, FGLSS graph

Harmonic Analysis of Boolean Functions

BOUNDED-DEPTH CIRCUITS CANNOT SAMPLE GOOD CODES

Lecture 8: Linearity and Assignment Testing

Approximation by DNF: Examples and Counterexamples

Lecture 2: From Classical to Quantum Model of Computation

Certifying polynomials for AC 0 [ ] circuits, with applications

Connections between exponential time and polynomial time problem Lecture Notes for CS294

1 AC 0 and Håstad Switching Lemma

Learning convex bodies is hard

Fourier Concentration from Shrinkage

U.C. Berkeley CS278: Computational Complexity Professor Luca Trevisan 1/29/2002. Notes for Lecture 3

1 Continuous extensions of submodular functions

Lecture 15: Expanders

Introduction to Interactive Proofs & The Sumcheck Protocol

Approximating MAX-E3LIN is NP-Hard

Learning DNF Expressions from Fourier Spectrum

Junta Approximations for Submodular, XOS and Self-Bounding Functions

Quantum algorithms (CO 781/CS 867/QIC 823, Winter 2013) Andrew Childs, University of Waterloo LECTURE 13: Query complexity and the polynomial method

Upper Bounds on Fourier Entropy

Testing by Implicit Learning

Some Depth Two (and Three) Threshold Circuit Lower Bounds. Ryan Williams Stanford Joint work with Daniel Kane (UCSD)

Lecture 7: Schwartz-Zippel Lemma, Perfect Matching. 1.1 Polynomial Identity Testing and Schwartz-Zippel Lemma

Agnostic Learning of Disjunctions on Symmetric Distributions

CS 151 Complexity Theory Spring Solution Set 5

NP-Completeness Review

THE COMPLEXITY OF CONSTRUCTING PSEUDORANDOM GENERATORS FROM HARD FUNCTIONS

CSC 5170: Theory of Computational Complexity Lecture 9 The Chinese University of Hong Kong 15 March 2010

Proclaiming Dictators and Juntas or Testing Boolean Formulae

Separating Models of Learning from Correlated and Uncorrelated Data

Learning DNF from Random Walks

Learning Restricted Models of Arithmetic Circuits

5.1 Learning using Polynomial Threshold Functions

APPROXIMATION RESISTANCE AND LINEAR THRESHOLD FUNCTIONS

Nondeterminism LECTURE Nondeterminism as a proof system. University of California, Los Angeles CS 289A Communication Complexity

Meta-Algorithms vs. Circuit Lower Bounds Valentine Kabanets

Notes for Lecture 3... x 4

Testing Booleanity and the Uncertainty Principle

Majority is incompressible by AC 0 [p] circuits

Last time, we described a pseudorandom generator that stretched its truly random input by one. If f is ( 1 2

COS598D Lecture 3 Pseudorandom generators from one-way functions

1 The Low-Degree Testing Assumption

Yale University Department of Computer Science

CS151 Complexity Theory. Lecture 9 May 1, 2017

The Uniform Hardcore Lemma via Approximate Bregman Projections

Algorithms Reading Group Notes: Provable Bounds for Learning Deep Representations

On the power and the limits of evolvability. Vitaly Feldman Almaden Research Center

A Switching Lemma Tutorial. Benjamin Rossman University of Toronto and NII, Tokyo

Transcription:

CS 294-114 Fine-Grained Compleity and Algorithms Sept 8, 2015 Lecture 4: LMN Learning (Part 2) Instructor: Russell Impagliazzo Scribe: Preetum Nakkiran 1 Overview Continuing from last lecture, we will finish presenting the learning algorithm of Linial, Mansour, and Nisan (LMN [1]). This will show that we can approimately learn constant-depth circuits (AC 0 ) in O(n polylog(n) ) time, given only oracle access. The algorithm will proceed by simply querying the function on random points, and estimating the low-degree Fourier coefficients from these points. For this to be approimately correct, we must show that low-depth circuits can be approimated by low-degree polynomials (ie, their Fourier spectrum concentrates on low-degree terms). The analysis will proceed roughly as follows: 1. Use Hastad s Switching Lemma and Fourier analysis to show that low-depth circuits don t have much mass in their higher-order Fourier coefficients. 2. Show that approimately learning a function is equivalent to approimately learning its Fourier coefficients. 3. Show that the probabilistic interpretation of Fourier coefficients (as correlations with parities) naturally gives a strategy for estimating the Fourier coefficients The bulk of the analysis is in Step 1. This is related to our previous discussion of circuit lower bounds, roughly because: if a circuit has large high-order Fourier coefficients, then it is essentially computing large parities (recall the Fourier basis is parities of subsets). But parities cannot be computed by circuits of low-depth, as we showed previously. 2 Notation Let F () be a binary function on n bits: F : { 1, +1} n { 1, +1} (1) For sets S [n], let ˆα S denote the Fourier coefficients of F. So F can be written as the multilinear polynomial: F () = ˆα S i (2) Let χ S denote the Fourier basis functions: χ S () = i (3) So equivalently: F () = ˆα S χ S () (4) 4-1

Recall that since the Fourier basis is orthonormal (with respect to the inner-product f, g = E [f()g()]), we have the simple Fourier inversion formula: ˆα S = E [F ()χ S ()] (5) And since both F and χ S are (±1)-valued, the correlation can be written in terms of their probability of agreeing on a random input: ˆα S = E [F ()χ S ()] = Pr [F () = χ S ()] Pr [F () χ S ()] (6) 3 Fourier Spectrum of Low-Depth Circuits Here we show that low-depth circuits don t have much mass in their higher-order Fourier coefficients. Instead of directly analyzing the Fourier representation of circuits, we will use Hastad s Switching Lemma to consider a random restriction, which we know leads to a low-depth decision tree (and therefore has only small Fourier coefficients). We know how to relate the Fourier coefficients of a function before and after random-restriction, so this will allow us to bound the Fourier spectrum of the original function. Last lecture, we proved the following lemma: Lemma 1 If ρ is a random restriction of F leaving pn variables unset, then E (ˆα T ) 2 Ω (7) T : T O( D p ) (ˆα T ) 2 Where ˆα T are Fourier coefficients of F, and ˆα T are the coefficients of F ρ. By the above lemma, it suffices to show that after random restriction, the higher-order Fourier coefficients are together very small w.h.p., so the original higher-order Fourier coefficients must also be small. We will use the following lemma: Lemma 2 (Hastad s Switching Lemma) If F has a depth d, size s circuit (, gates of unbounded 1 fan-in), and D log s, p = O( ), then the random restriction F D d 1 ρ has low decision-tree depth w.h.p.: [ Pr F ρ has Decision-Tree depth D ] 2 D (8) ρ We showed last time that a depth-d decision tree can be represented (uniquely) by a degree-d multilinear polynomials, so in particular, a depth-d decision tree will have no Fourier coefficients of size > D. Therefore: Corollary 1 Pr ρ [F ρ has any ˆα S s.t. S > D] 2 D (9) Moreover: 4-2

Corollary 2 E (ˆα T ) 2 2 D (10) Proof. In the 2 D fraction of cases for which large coefficients eist (by Corollary 1), the sum (ˆα T )2 T (ˆα T )2 = 1 by Parseval s identity. Combining these, we can bound the higher-order Fourier mass of the original function F : Corollary 3 T : T O(D d ) (ˆα T ) 2 O(2 D ) (11) 1 Proof. Combine Lemma 1 with p = O( ), and Corollary 2: D d 1 Ω (ˆα T ) 2 E (ˆα T ) 2 O(2 D ) (12) T : T O(D d ) Notice that we should epect that the higher-order Fourier coefficients of F are individually small, but we have in fact shown the stronger result that they are collectively small. (Intuitively this should be true, because a function can t be simultaneously correlated to many large parities). We have shown that low-depth circuits have only a small amount of mass in its high-order Fourier coefficients. But can we just ignore these coefficients in learning the function? Net we will show that we essentially can. 4 Approimate Learning in Fourier Domain Here we will show that approimately learning a function is equivalent to approimately learning its Fourier coefficients. Thus, two functions whose Fourier coefficients largely agree will also largely agree as functions. One issue is, once we start approimating Fourier coefficients, the resulting function may not take values strictly in {±1}. Thus, in order to analyze this, we need to etend our boolean functions to R. We have previously been writing functions as multilinear polynomials over the boolean hypercube, but we may simply etend them to be over R. Lemma 3 Let f, g be multilinear polynomials over R: f = β S i, g = γ S i Then E R {±1} n[(f() g())2 ] = 1 2 n {±1} n (f() g()) 2 = (β S γ S ) 2 (13) Proof. This is clear, since the Fourier basis is orthogonal (with respect to the inner product E), and orthogonal transforms preserve norm. 4-3

5 The Learning Algorithm The main idea of the algorithm (and analysis) is: approimately learning a Fourier coefficient is easy, since Fourier coefficients are just correlations with parities (which can be approimately learnt by sampling). That is, ˆα S = E [F ()χ S ()] = Pr [F () = χ S ()] Pr [F () χ S ()] (14) The goal, given oracle access to some low-depth function F, is to learn a hypothesis function h that agrees with F ecept perhaps on some ɛ-fraction of inputs. We will need to set some parameters. First, the number of low-degree Fourier coefficients (ˆα S with S O(D d )) that we are going to try to approimate is ( ) n M = O(D d (15) ) We will approimate each of these coefficients to within an additive factor of δ = ( ɛ ) 1/2 (16) 2M We will approimate the higher-order coefficients as 0. We want the Fourier mass in these remaining terms to be very small (inverse poly), so we will set so that T : T O(D d ) (ˆα T ) 2 O(2 D ) is small (by Corollary 3). The algorithm is: 1. Get about 1 log ( M δ 2 ɛ D = O(log s + log(1/ɛ)). (17) ) random samples ( i, f( i )) of the function. 2. For each small monomial χ T : T O(D d ), use the samples to compute (using empirical probabilities): γ T = Pr[F () = χ T ()] Pr[F () χ T ()] (18) 3. Compute a R-valued approimation to F as: g() = T : T O(D d ) 4. Return the binary-valued hypothesis function γ T χ T () (19) h = sign(g) (20) Analysis We will show that that our hypothesis h agrees with F everywhere ecept perhaps on ɛ-fraction of inputs: Pr[F () h()] ɛ (21) 4-4

First notice that 1/δ 2 samples is sufficient to approimate a single Fourier coefficient within an additive factor of δ (by Chernoff bound). The additional log(m) factor in Step 1 allows us to union-bound over all Fourier coefficients, and conclude that all our estimated Fourier coefficients (in Step 2) are approimately correct with high probability: T, T O(D d ) : γ T α T δ (w.h.p) Then, estimating larger coefficients as 0, we can bound our total error in learning the Fourier coefficients: (α S γ S ) 2 = (α S γ S ) 2 + αs 2 (22) S = ( S: S O(D d ) S: S O(D d ) S: S >O(D d ) δ 2 ) + 2 D ɛ (by choice of δ, D) Now by Lemma 3, the function g (determined by our estimated Fourier coefficients γ S ) approimately agrees with the function F : E [(F () g()) 2 ] = S (α S γ S ) 2 ɛ (23) But g is not binary, so we must bound the error when we take its sign. But this is easy: Pr [F () h()] = Pr [F () sign(g())] (24) = E [ 1 4 (F () sign(g()))2 ] (since both functions are ±1) E [(F () g()) 2 ] (by considering the two cases F () = sign(g()) and F ()...) ɛ (25) Therefore the hypothesis h agrees with F almost everywhere. Runtime The runtime of this algorithm is on the order of its sample compleity: ( ) ( ) 1 M M δ 2 log = poly ɛ ɛ By our choice of M = ( n O(D )), this is at most d (26) This concludes our analysis of LMN learning. References n O(log(s/ɛ))d (27) [1] Nathan Linial, Yishay Mansour, and Noam Nisan. Constant depth circuits, fourier transform, and learnability. Journal of the ACM (JACM), 40(3):607 620, 1993. 4-5