Lecture 10: Learning DNF, AC 0, Juntas. 1 Learning DNF in Almost Polynomial Time

Similar documents
Lecture 4: LMN Learning (Part 2)

Lecture 5: February 16, 2012

Learning and Fourier Analysis

CS6840: Advanced Complexity Theory Mar 29, Lecturer: Jayalal Sarma M.N. Scribe: Dinesh K.

Notes for Lecture 11

CS 151 Complexity Theory Spring Solution Set 5

6.842 Randomness and Computation April 2, Lecture 14

Notes for Lecture 25

Learning and Fourier Analysis

CSC 2429 Approaches to the P vs. NP Question and Related Complexity Questions Lecture 2: Switching Lemma, AC 0 Circuit Lower Bounds

1 AC 0 and Håstad Switching Lemma

Lecture 29: Computational Learning Theory

1 Last time and today

3 Finish learning monotone Boolean functions

Circuits. Lecture 11 Uniform Circuit Complexity

Lecture 1: 01/22/2014

Testing by Implicit Learning

Complete problems for classes in PH, The Polynomial-Time Hierarchy (PH) oracle is like a subroutine, or function in

Quantum boolean functions

Lecture 5: February 21, 2017

Finite fields, randomness and complexity. Swastik Kopparty Rutgers University

Lecture 3: AC 0, the switching lemma

5.1 Learning using Polynomial Threshold Functions

Homework 7 Solutions

Lecture 13: Lower Bounds using the Adversary Method. 2 The Super-Basic Adversary Method [Amb02]

Learning Juntas. Elchanan Mossel CS and Statistics, U.C. Berkeley Berkeley, CA. Rocco A. Servedio MIT Department of.

Guest Lecture: Average Case depth hierarchy for,, -Boolean circuits. Boaz Barak

Quantum Computing Lecture 6. Quantum Search

Lecture 13: Private Key Encryption

1 Nisan-Wigderson pseudorandom generator

1 The Low-Degree Testing Assumption

On the Average-case Complexity of MCSP and Its Variants. Shuichi Hirahara (The University of Tokyo) Rahul Santhanam (University of Oxford)

Complexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler

Outline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181.

Introduction to Cryptography

Polynomial time Prediction Strategy with almost Optimal Mistake Probability

Two Sides of the Coin Problem

Notes on Computer Theory Last updated: November, Circuits

Lecture 10: Hardness of approximating clique, FGLSS graph

Lecture 7: Passive Learning

Lecture 2: Proof of Switching Lemma

On Learning Monotone DNF under Product Distributions

Lecture 5: Pseudo-Random Generators and Pseudo-Random Functions

Introduction to Machine Learning

Umans Complexity Theory Lectures

On the correlation of parity and small-depth circuits

Lecture 21: Counting and Sampling Problems

On Exact Learning from Random Walk

COMS E F15. Lecture 22: Linearity Testing Sparse Fourier Transform

2 Natural Proofs: a barrier for proving circuit lower bounds

Limits to Approximability: When Algorithms Won't Help You. Note: Contents of today s lecture won t be on the exam

A.Antonopoulos 18/1/2010

Learning DNF Expressions from Fourier Spectrum

CS151 Complexity Theory. Lecture 15 May 22, 2017

Essential facts about NP-completeness:

Sublinear Algorithms Lecture 3

Definition: Alternating time and space Game Semantics: State of machine determines who

Stanford University CS254: Computational Complexity Handout 8 Luca Trevisan 4/21/2010

Lecture 22. m n c (k) i,j x i x j = c (k) k=1

: On the P vs. BPP problem. 18/12/16 Lecture 10

Approximating MAX-E3LIN is NP-Hard

CSC 5170: Theory of Computational Complexity Lecture 9 The Chinese University of Hong Kong 15 March 2010

CS151 Complexity Theory. Lecture 14 May 17, 2017

On Noise-Tolerant Learning of Sparse Parities and Related Problems

Separating Models of Learning from Correlated and Uncorrelated Data

CS Foundations of Communication Complexity

Notes for Lecture 15

Lecture 7: Pseudo Random Generators

CS Communication Complexity: Applications and New Directions

ICML '97 and AAAI '97 Tutorials

Today. Few Comments. PCP Theorem, Simple proof due to Irit Dinur [ECCC, TR05-046]! Based on some ideas promoted in [Dinur- Reingold 04].

New Results for Random Walk Learning

CSE 291: Fourier analysis Chapter 2: Social choice theory

Lecture 5, CPA Secure Encryption from PRFs

Lecture 24: Approximate Counting

Computational Learning Theory - Hilary Term : Introduction to the PAC Learning Framework

CS151 Complexity Theory

1 Distributional problems

1 Locally computable randomized encodings

Lecture 8: Alternatation. 1 Alternate Characterizations of the Polynomial Hierarchy

Classes of Boolean Functions

CS 6815: Lecture 4. September 4, 2018

Computational and Statistical Learning Theory

CS Introduction to Complexity Theory. Lecture #11: Dec 8th, 2015

1 Lecture 6-7, Scribe: Willy Quach

Lecture 26. Daniel Apon

1 Circuit Complexity. CS 6743 Lecture 15 1 Fall Definitions

On the Structure of Boolean Functions with Small Spectral Norm

Optimal Cryptographic Hardness of Learning Monotone Functions

Learning Unions of ω(1)-dimensional Rectangles

Simple Learning Algorithms for. Decision Trees and Multivariate Polynomials. xed constant bounded product distribution. is depth 3 formulas.

Agnostic Learning of Disjunctions on Symmetric Distributions

CS 395T Computational Learning Theory. Scribe: Mike Halcrow. x 4. x 2. x 6

Lecture 10: Boolean Circuits (cont.)

TTIC An Introduction to the Theory of Machine Learning. Learning from noisy data, intro to SQ model

Last time, we described a pseudorandom generator that stretched its truly random input by one. If f is ( 1 2

Junta Approximations for Submodular, XOS and Self-Bounding Functions

Definition: Alternating time and space Game Semantics: State of machine determines who

Lecture 8: Linearity and Assignment Testing

QUANTUM COMPUTATION. Exercise sheet 1. Ashley Montanaro, University of Bristol H Z U = 1 2

Transcription:

Analysis of Boolean Functions (CMU 8-859S, Spring 2007) Lecture 0: Learning DNF, AC 0, Juntas Feb 5, 2007 Lecturer: Ryan O Donnell Scribe: Elaine Shi Learning DNF in Almost Polynomial Time From previous lectures, we have learned that if a function f is -concentrated on some collection S, then we can learn the function using membership queries in poly( S, /)poly(n) log(/δ) time. In the last lecture, we showed that a DNF of width w is -concentrated on a set of size n O( w ), and concluded that width-w DNFs are learnable in time n O( w ). Today, we shall improve this bound, by showing that a DNF of width w is -concentrated on a collection of size w O(w log ). We shall hence conclude that poly(n)-size DNFs are learnable in almost polynomial time. Recall that in the last lecture we introduced Håstad s Switching Lemma, and we showed that DNFs of width w are -concentrated on degrees up to O(w log ). Theorem. (Håstad s Switching Lemma) Let f be computable by a width-w DNF, If (I,X) is a random restriction with -probability ρ, then d N, Theorem.2 If f is a width-w DNF, then Pr [DT-depth(f X I ) > d] (5ρw)d I,X U O(w log ) f(u) 2 To show that a DNF of width w is -concentrated on a collection of size w O(w log ), we also need the following theorem: Theorem.3 If f is a width-w DNF, then U ( ) U f(u) 2 Proof: Let (I,X) be a random restriction with -probability. After this restriction, the DNF becomes a O()-depth decision tree with high probability. Due to Håstad s Switching Lemma, we

have the following: = E [ f X I ] = I,X n d=0 n d=0 ( 5 2 d 2 w n [ ] Pr [DT-depth(f X I ) = d] E f X I DT-depth(f X I ) = d I,X I,X ) d 2 d (Håstad s Switching Lemma, DT of size s has L Fourier norm s) d=0 In addition, we have 2 E [ f X I ] = EE f X I (S) I,X I X S I [ ] = E E E [f X I (y)y S ] I X y {,} I S I > E I EE[f(y,X)y S] X y = E f(s) I U [n] S I = Pr[U I] = I U [n] S I ( ) U Corollary.4 If f is a width-w DNF, then U O(w log ) w O(w log ) Proof: 2 U [n] ( ) O(w log ) ( ) U U O(w log ) U O(w log ) ( ) U Corollary.5 If f is a width-w DNF, it s -concentrated on a collection of size w O(w log ). 2

{ Proof: Define S = U : U O(w log ), f(u) } w O(w log ). By Parseval, we get that S w O(w log ). We now show that S is -concentrated on S. By Theorem.2, we know that f(u) 2 By Corollary.4, we have U / S U O(w log ) f(u) 2 U / S U O(w log ) Therefore, f is 2-concentrated on S. U / S U O(w log ) max w O(w log ) w O(w log ) = Corollary.6 poly(n)-size DNFs are -concentrated on collections of size ( log n ) O(log n log ) ( n ) O(log log n = log ) And if = Θ(), then the above is n O(log log n). Note that this uses the fact that size-n DNF formulas are -close to a width-log( n ) DNF. An open research problem is the following question: Are poly(n)-size DNFs -concentrated on a collection of size poly(n), assuming that = Θ(). 2 Learning AC 0 We will now study how to learn polynomial-size, constant-depth circuits, AC 0. Consider circuits with unbounded fan-in AND, OR, and NOT gates. The size of the circuit is defined as the number of AND/OR gates. Observe the following fact: Fact 2. Let d denote the depth of the circuit. At the expense of a factor of d in size, these circuits can be taken to be layered. Here layered means that eacy layer consists of the same type of gates, either AND or OR, and adjacent layers contain the opposite gates. In a layered circuit, the number of layers is the depth of the circuit, and define the bottom fan-in of a layered circuit to be the maximum fan-in at the lowest layer (i.e. closest to the input layer). Theorem 2.2 (LMN.) Let f be computable by a size s, depth D, and bottom fan-in w circuit. Then U (0w) D f(u) 2 O(s 2 w ) 3

Before we show a proof of the LMN theorem, let us first look at some corollaries implied by this theorem. Corollary 2.3 If f has a size s, depth D circuit, then U [O(log s )]D f(u) 2 Proof: Notice that such an f is -close to a similar circuit with bottom fan-in log( s ). Corollary 2.4 AC 0, i.e., the class of poly-size, constant-depth circuits, are learnable from random examples in time n poly(log( n )), where n denotes the size of the circuit. Proof: According to Corollary 2.3, AC 0 circuits are -concentrated on a collection of size n poly(log( n )). Corollary 2.5 If f has a size-s, depth-d circuit, then I(f) [O(log s)] D Remark 2.6 Due to Håstad, the above bound can be improved to [O(log s)] D. Proof:(sketch.) Define F(τ) = f(u) 2. Recall that I(f) = U = U >τ U f(u) 2 = n F(r) = r= O(log s) D r= F(r) + U O(log s) D U f(u) 2 + F(O(log s) D ) O(log s) D + O(log s) D + n r=o(log s) D F(r) n r=o(log s) D F(r) n It remains to show that F(r) O(log s) D. By Corollary 2.3, r=o(log s) D F(τ) = U >τ f(u) 2 s 2 Ω(τ/D ) n r=o(log s) D F(r) n Using this fact plus some manipulation, it is not hard to show that F(r) O(logs) D. r=o(log s) D Using this fact, we can derive the following two corollaries: 4

Corollary 2.7 Parity / AC 0, Majority / AC 0. Proof: I(Parity) = n, I(Majority) = Θ( n). Definition 2.8 (PRFG.) A function f : {, } m {, } n {, } is a Pseudo-Random Function Generator (PRFG), if for all Probablistic Polynomial Time (P.P.T) algorithm A with access to a function, Pr S {,} m, A s random bits [A(f(S, )) = YES] Pr g, A s random bits [A(g) = YES] n ω() where g is a function picked at random from all functions {h h : {, } n {, }}. Corollary 2.9 Psuedo-random function generators / AC 0. Proof: Suppose that f AC 0. Then we can construct a P.P.T adversary A such that A can tell f apart from a truly random function with non-negligible probability. Consider an adversary A that picks at random X {, } n, and i [n]. A is given an oracle to g, it queries g at X and X (i). A outputs YES iff g(x) g(x (i) ). If g is truly random, then Pr[A outputs YES] = /2; if g is in AC 0, then Pr[A outputs YES] I(g)/n = poly log(n)/n. After seeing all these applications of the LMN theorem, we now show how to prove it. To prove the LMN theorem, we need the the following tools: Observation 2.0 A depth-w Decision Tree (DT) is expressible as a width-w DNF or as a width-w CNF. Lemma 2. Let f : {, } n {, }, let (I,X) be a random restriction with -probability ρ. Then d 5, f(u) 2 2 Pr [DT-depth(f X I ) > d] (I,X) U 2d/ρ Now using Lemma 2., we can prove the LMN theorem. Proof:(LMN.) Claim 2.2 Pr [DT-depth(f X I ) > w] s 2 w I,X The above claim in combination with Lemma 2. would complete the proof. We now show why the claim is true. Observe that we can view choosing random restriction with -probability ( 0w )D as the following: 5

First choose a random restriction with -probability 0w. Further choose a random restriction ont he surviving variables with probability 0w.... Repeat the above D times. After the first restriction, due to Håstad s Switching Lemma, for any level 2 circuit, the probability that it doesn t turn into a depth-w DT can be bounded as below: ( ) w 5w Pr[Doesn t turn into a depth w DT] = 2 w 0w Due to Observation 2.0, we can express a depth-w DT as a width-w DNF or CNF. Using this, we can transform the bottom two layers of circuit using the opposite of what they were before, i.e., CNF to DNF, and DNF to CNF. This will succeed except with probability 2 w (number of level 2 gates). Now the 2nd lowest layer and the 3rd lowest layer will have the same type of gates, so we can collapse them into a single layer. Observe that this operation preserves the bottom fan-in, because the resulting CNF or DNF has width w. So we repeat this operation D times, and the probability that the resulting circuit is not a depth-w DT can be bounded as below: Pr[DT-depth of resulting circuit > w] number of level 2 gates +number of level 3 gates +number of level 2 gates +... 2 w s 2 w 3 Learning Juntas We now study how to learn juntas. Let C r = {f : {, } n {, }, and f is an r-junta} denote the family of r-juntas. Note that C log n {poly-size DTs} {poly-size DNFs}. Remark 3. To learn C r, it suffices for the algorithm to identify the r relevant variables. Since then we can just draw O(r2 r ) random examples and with high probability, learn the entire truth table (of size 2 r ). Observation 3.2 If f is an r-junta, then every Fourier coefficient of f is either 0 or 2 r in absolute value. This is straightforward directly from the definition of Fourier coefficients, and the fact that f only depend on r variables. Fact 3.3 If f(s) 0, then all variables in S are relevant. 6

Proof: Suppose f(s) 0, but there exists i S irrelevant, according to the definition, f(s) = E x [f(x)x S ]. But for any x, consider x (i), we have: f(x) X S = f(x (i) ) X (i) S Therefore, everything cancels out in f(s), and f(s) = 0. Using the above facts, we give one idea for learning juntas, and show that r-juntas are learnable in time poly(n, 2 r ) n r. Estimate f( ). Estimate f(s) for all S = up to accuracy 2 r. If we find an S such that f(s) 0, then 4 we know that all variables in S are relevant. Note that this takes time poly(n, 2 r ) ( n ). Estimate f(s) for all S = 2 up to accuracy 2 r. If we find an S such that f(s) 0, then 4 we know that all variables in S are relevant. Note that this takes time poly(n, 2 r ) ( n 2). Do the above for S of size 3, 4,...,r. Observation 3.4 The above gives us a poly(n, 2 r ) n r -time algorithm for learning r-juntas. In the next lecture, we shall improve the above result. In particular, we will ask the question, what kind of functions f can have f(s) = 0 for all S d? In particular, if f is not such a function, then by step d in the above algorithm, we will have found a relevant variable. If not, i.e., f(s) = 0 for all S d, we will show an algorithm for learning such functions. 7