3 Finish learning monotone Boolean functions

Size: px

Start display at page:

Download "3 Finish learning monotone Boolean functions"

Cory Elliott
5 years ago
Views:

1 COMS : Sub-Linear Algorithms in Learning and Testing Lecturer: Rocco Servedio Lecture 5: 02/19/2014 Spring 2014 Scribes: Dimitris Paidarakis 1 Last time Finished KM algorithm; Applications of KM algorithm: learning decision trees;learning functions with sparse Fourier representation (in particular k-juntas of parities); started learning monotone Boolean functions (via their influence). 2 Today Finish learning monotone Boolean functions (using Inf i [f])), and AC 0 circuits (via Fourier concentration on low-degree coefficients no membership queries); Lower bounds for learning monotone Boolean functions; learning k-juntas of half spaces in poly(( nk )k ) time. No Fourier here! (but membership queries). Relevant Readings: Mansour [Man94]: Learning Boolean Functions via the Fourier Transform. Gopalan, Klivans and Meka [GKM12]: Learning Functions of Halfspaces Using Prefix Covers. 3 Finish learning monotone Boolean functions Recall: f : { 1, 1} n { 1, 1} is monotone if x y implies f(x) f(y) Influence: Inf i [f] = Pr x [ f(x i 1 ) f(x i 1 ) ] for i [n] Total influence: Inf[f] = n Inf i[f]; an important example is Inf[MAJ] n (for the majority function MAJ(x) = sign(x x n )) 1

2 3 FINISH LEARNING MONOTONE BOOLEAN FUNCTIONS 2 for all Boolean functions f, Inf i [f] = S i ˆf(S) 2 ; and thus Inf[f] = S S ˆf(S) 2. Claim 1. If f is a monotone Boolean function, we have Inf i [f] = ˆf(i) (where ˆf(i) is short for ˆf({i})). Proof. Without loss of generality, we consider the case i = 1: Inf 1 [f] = Pr [ f(1x ) f( 1x ) ] = { x { 1, 1} n 1 : f(1x ) = 1, f( 1x ) = 1 } x { 1,1} n 1 2 n 1 with the last equality holding because of the monotonicity of f. But on the other hand, ˆf(1) = E[f(x)x 1 ] = 1 (f(1x ) f( 1x )) 2 n x { 1,1} n 1 = { x { 1, 1} n 1 : f(1x ) = 1, f( 1x ) = 1 } 2 n 1 where the second equality again uses the monotonicity of f. Lemma 2. For any monotone Boolean function f, Inf[f] Inf[MAJ] n. Proof. A first approach: we can prove that Inf[f] n using the Cauchy-Schwarz inequality: n Inf[f] = ˆf(i) 1 n 1 2 n ˆf(i) 2 n (Cauchy Schwarz) since n ˆf(i) 2 S [n] ˆf(S) 2 = 1. Or we can also show the stronger statement that Inf[f] Inf[MAJ]: [ ] n n n Inf[f] = ˆf(i) = E[f(x)x i ] = E f(x) x i = E[f(x) (x x n )] E[MAJ(x) (x x n )] = Inf[MAJ] (where the inequality comes from observing that since f(x) { 1, 1}, the quantity f(x) (x x n ) is at most x x n, which happens for f = MAJ, i.e. when f(x) = sign(x x n )).

3 4 LOWER BOUNDS FOR LEARNING MONOTONE BOOLEAN FUNCTIONS 3 We can combine the facts above to obtain Fourier concentration for monotone Boolean functions; more generally: Theorem 3. Let f be any monotone Boolean function. Then ˆf(S) 2. S Inf[f] Proof. By contradiction, assume S Inf[f] Inf[f] = S [n] S ˆf(S) 2 leading to a contradiction. S Inf[f] ˆf(S) 2 > ; then S ˆf(S) 2 Inf[f] S Inf[f] ˆf(S) 2 > Inf[f] = Inf[f] Remark 1. One can also see this proof as an application of Markov s Inequality, by seeing the Fourier weights ˆf(S) 2 as the probability distribution ˆf 2 they induce over subsets of [n]. The theorem now can be rephrased as as Inf[f] = E S ˆf(S) 2 S. [ Pr S Inf[f] ] S ˆf(S) 2 E S ˆf(S) 2 S Inf[f] Corollary { 4. Suppose f : { 1, 1} n { 1, 1} is monotone. Then f is -concentrated on S def = S [n] : S } n. 4 Lower bounds for learning monotone Boolean functions As a direct consequence, the LMN algorithm will learn any monotone Boolean function n in time poly(n ) = 2 O( n(log n)/). While this constitutes a huge saving comparison to the 2 Ω(n) general bound, this is still it is a lot! Hence, an immediate question is Can we do better?. =

4 4 LOWER BOUNDS FOR LEARNING MONOTONE BOOLEAN FUNCTIONS 4 One may first ask whether there is a better analysis of the LMN algorithm for monotone Boolean functions which would yield significantly better performance. However, the answer to this is negative; it is known that there exist monotone Boolean functions such that S n/100 ˆf(S) , which imply that no low-degree learning algorithm such as LMN can do better than to deal with the n Ω( n) Fourier coefficients up to degree Ω( n). ( ) Learning to high accuracy Clearly, the 2 O n log n becomes trivial for 1/ n; hence, this range of accuracy seems like a good regime to look for a lower bound. And indeed, one can show that we cannot efficiently learn monotone Boolean functions to high accuracy, which we do below. Claim 5. There is a class C of monotone Boolean functions such that, if the target function f is drawn uniformly from C, then any learning algorithm A making less than 1 2n 10 n membership queries will output a hypothesis h such that E[dist(f, h)] 1 5 n (where the expectation is taken over the draw of f). Proof. For simplicity consider n even (the case n odd is similar, up to technicalities). Define C to be the class of monotone Boolean functions f such that +1 if n x i > 0 f(x) = 1 if x i < 0 ±1 if x i = 0 (arbitrarily) Equivalently, drawing a function from this class amounts to tossing ( ) n n independent 2 fair coins that specify the value of f on the middle layer of the hypercube (where i x i = 0). Yet, the learning algorithm makes at most 1 2n 10 n MQ in this middle layer, which contains between 1 2n 2 n and 2n n different points. So A sees less than a 1 fraction 5 of the values of the inputs that are define f, and misses at least a 4 fraction of them. 5 Each unseen point contributes on expectation 1 1 to the error of its hypothesis h. 2 2 n Therefore, E[error(h)] n 1 n 2 = 1 n 5 n. In fact, a stronger lower bound can be proven:

5 5 MAIN CONTRIBUTION OF LMN: LEARNING AC 0 CIRCUITS 5 Theorem 6 ([BBL98]). There is a (different) class C of monotone Boolean functions n such that any algorithm that makes at most membership queries outputs, when the target function f is drawn uniformly from C, a hypothesis h such that E[dist(f, h)] High-level sketch of proof. Each f C n is a term monotone DNF, f = T 1 T 2 n/50, where each term T 1,..., T 2 n/50 is drawn independently from the set of all c n -length 50 conjunctions (for an appropriately chosen constant c so that the function is balanced with high probability). The argument then goes roughly as follows: every time a query x satisfies one of the terms, the algorithm is given for free all the variables of the term. But even with this overly generous assumption, there are at most 2 n/100 positive examples, hence at most 2 n/100 terms out of the 2 n/50 total terms are shown to the algorithm. Intuitively, this means that the algorithm does not see anything about almost all terms (with high probability); further, each negative example eliminates (again, with high probability) very few possible terms, so that negative examples do not help either. 5 Main contribution of LMN: learning AC 0 circuits x 1 x 3 x 4 x 7 Above we see a size-6 and depth-3 constant depth circuit. Linial, Mansour, Nisan showed { that if f is computed by a size-m, depth-d circuit then f is -concentrated on S = S : S ( O ( )) } log M D. That is we can learn the class AC 0 of constant-depth, polynomial-size Boolean circuits in n poly(log n ) -time. see HW page for related problem: there exists a depth-d, size-m circuit with no Fourier weight on any subset S of coefficients such that S log D 1 M.

6 6 LEARNING HALFSPACES 6 6 Learning halfspaces Definition 7. A Boolean function f : {0, 1} n { 1, 1} is said to be a halfspace (or Linear Threshold Function (LTF)) if there exist weights w 1,..., w n R and threshold θ R such that f(x) = sign(w x θ) for all x {0, 1} n. Fact 8 (PAC-learning Halfspaces). There is an algorithm that can learn any unknown halfspace over {0, 1} n in poly(n, 1, log 1 )-time, using only random independent and identically distributed examples under arbitrary distribution D over {0, 1} n. The algorithm δ outputs with probability at least 1 δ an hypothesis h such that Pr x D [ f(x) h(x) ]. This algorithm is based on polynomial-time linear programming. It works when f is a halfspace, but breaks down completely if f is a function of halfspaces, such as f = h 1 h 2. Indeed, in the arbitrary distribution D model, even if we allow membership queries no algorithm faster than 2 n time is known for f = h 1 h 2. So we will restrict our attention (as usual) to the uniform-distribution setting. One question to ask is whether one can use Fourier analysis to learn (under the uniform distribution) a single halfspace or more ambitiously a function g(h 1,... h k ) of halfspaces, where g : {0, 1} k { 1, 1}. The following results are known here: Let h = MAJ. If S is such that S S ĥ(s)2 1, then it must be the case that S = n Ω( 1 2 ). In particular, the KM algorithm will not work well. There exists f = g(h 1,..., h k ) such that if S is such that ˆf(S) S S 2 1, then it must be the case that S = (n/k) Ω(k2 / 2 ). For k small compared to n this is n Ω(k2 / 2 ). These are bad-news results for Fourier concentration. The good news is this is as bad as the bad news gets; it is known that any f = g(h 1,..., h k ) satisfies ˆf(S) 2 1 S O(k2 ) 2 ( ) and thus can be learnt with LMN in n O k 2 2 time (without membership queries). However, with membership queries, it is possible to achieve much better running time namely, polynomial in n and 1/ for any fixed constant k: Theorem 9 ([GKM12]). The class of k-juntas of halfspaces (functions of the form f = g(h 1,..., h k ) with the h i s being halfspaces) can be learnt under the uniform distribution in poly(( nk )k )-time, using membership queries.

7 6 LEARNING HALFSPACES 7 Idea of the proof The algorithm will use a hypothesis that is a Read Once Branching Program (ROBP). Definition 10. A width-w ROBP M is a layered digraph with layers 0, 1,..., n and at most W nodes in each layer. L(M, i) is the set of nodes in layer i, with L(M, 0) = {v 0 } (v 0 being the start node). Moreover, each node in L(M, n) is labeled 0 or 1 (respectively, ACCEPT or REJECT); for i {0, 1,..., n 1}, each v L(M, i) has two out-edges, one labeled 0 and the other labeled 1, both going to nodes in L(M, i + 1); for z {0, 1} i and node v, M(v, z) denotes the node reached starting from v and following i edges according to z. We can view a ROBP as a Boolean function M : {0, 1} n {0, 1} by setting, for z {0, 1} n, M(z) def = { 0 if M(v 0, z) is labeled 0 1 if M(v 0, z) is labeled 1. Notation We will write U i for the uniform distribution over {0, 1} i. For a prefix x {0, 1} i (with i n), we define f x : {0, 1} n i {0, 1} by f x (z) = f(x z), where stands for concatenation. Note that dist(f x, f y ) = Pr z {0,1} n 1 [ f(x z) f(y z) ]. Definition 11. A function f : {0, 1} n {0, 1} is said to be (, W )-prefix coverable if for all i [n] there exists S i {0, 1} i with S i W such that y {0, 1} i, x S i such that dist(f x, f y ). The collection (S 1,..., S n ) is then called an (, W )-prefix cover of f. The two building blocks of the proofs will be the following lemmas: Lemma 12. Every k-junta of LTFs g(h 1,..., h k ) is (, ( 4k )k )-prefix coverable. Lemma 13. There is a membership-query algorithm which, given, W, and MQ(f) for some (, W )-prefix coverable function f, outputs (a W -ROBP) h such that dist(h, f) 4n. Furthermore, the algorithm runes in time poly(n, W, 1, log 1 δ ). Remark 2. The two lemmas above combined yield the theorem: to learn a k-junta of LTFs to accuracy, set def =, W def = ( 4k 4n )k = ( 16kn ) k and run the algorithm of Lemma 13.

8 7 NEXT TIME 8 7 Next time Lemma 2 is a direct consequence of the following two claims, which we will prove next time. Claim 14. If h is any LTF, h is (, 2 )-prefix coverable. Claim 15. Let f 1,..., f k be any (, W )-prefix coverable functions, and fix any g : {0, 1} k {0, 1}. Then g(f 1,..., f k ) is (2k, W k )-prefix coverable. References [BBL98] A. Blum, C. Burch, and J. Langford. On learning monotone boolean functions. In Proceedings of the Thirty-Ninth Annual Symposium on Foundations of Computer Science, pages , [GKM12] Parikshit Gopalan, Adam R. Klivans, and Raghu Meka. Learning functions of halfspaces using prefix covers. In Mannor et al. [MSW12], pages [Man94] Yishay Mansour. Learning boolean functions via the fourier transform. In Vwani Roychowdhury, Kai-Yeung Siu, and Alon Orlitsky, editors, Theoretical Advances in Neural Computation and Learning, pages Springer US, [MSW12] Shie Mannor, Nathan Srebro, and Robert C. Williamson, editors. COLT The 25th Annual Conference on Learning Theory, June 25-27, 2012, Edinburgh, Scotland, volume 23 of JMLR Proceedings. JMLR.org, 2012.

Lecture 5: February 16, 2012

COMS 6253: Advanced Computational Learning Theory Lecturer: Rocco Servedio Lecture 5: February 16, 2012 Spring 2012 Scribe: Igor Carboni Oliveira 1 Last time and today Previously: Finished first unit on