Learning Juntas. Elchanan Mossel CS and Statistics, U.C. Berkeley Berkeley, CA. Rocco A. Servedio MIT Department of.

Similar documents
Learning DNF from Random Walks

Discriminative Learning can Succeed where Generative Learning Fails

Separating Models of Learning from Correlated and Uncorrelated Data

On Learning Monotone DNF under Product Distributions

3 Finish learning monotone Boolean functions

Testing by Implicit Learning

Agnostic Learning of Disjunctions on Symmetric Distributions

Maximum Margin Algorithms with Boolean Kernels

On the Sample Complexity of Noise-Tolerant Learning

Lecture 5: February 16, 2012

Lecture 29: Computational Learning Theory

Polynomial time Prediction Strategy with almost Optimal Mistake Probability

Lecture 7: Passive Learning

Learning DNF Expressions from Fourier Spectrum

Lecture 4: LMN Learning (Part 2)

Efficient Noise-Tolerant Learning from Statistical Queries

Yale University Department of Computer Science

1 Last time and today

Proclaiming Dictators and Juntas or Testing Boolean Formulae

Uniform-Distribution Attribute Noise Learnability

Unconditional Lower Bounds for Learning Intersections of Halfspaces

Lecture 10: Learning DNF, AC 0, Juntas. 1 Learning DNF in Almost Polynomial Time

On the Efficiency of Noise-Tolerant PAC Algorithms Derived from Statistical Queries

Testing Monotone High-Dimensional Distributions

On Noise-Tolerant Learning of Sparse Parities and Related Problems

Boosting and Hard-Core Sets

Being Taught can be Faster than Asking Questions

Lecture 1: 01/22/2014

Lecture 5: Efficient PAC Learning. 1 Consistent Learning: a Bound on Sample Complexity

Learning Large-Alphabet and Analog Circuits with Value Injection Queries

1 Randomized Computation

Notes for Lecture 15

CS 395T Computational Learning Theory. Scribe: Mike Halcrow. x 4. x 2. x 6

New Results for Random Walk Learning

Learning Sparse Perceptrons

Optimal Hardness Results for Maximizing Agreements with Monomials

Testing for Concise Representations

From Batch to Transductive Online Learning

Simple Learning Algorithms for. Decision Trees and Multivariate Polynomials. xed constant bounded product distribution. is depth 3 formulas.

On the Fourier spectrum of symmetric Boolean functions

CS 151 Complexity Theory Spring Solution Set 5

Testing Booleanity and the Uncertainty Principle

Learning and Fourier Analysis

Baum s Algorithm Learns Intersections of Halfspaces with respect to Log-Concave Distributions

TTIC An Introduction to the Theory of Machine Learning. Learning from noisy data, intro to SQ model

Learning symmetric non-monotone submodular functions

Uniform-Distribution Attribute Noise Learnability

Learning convex bodies is hard

Introduction to Computational Learning Theory

Notes for Lecture 25

CSC 2429 Approaches to the P vs. NP Question and Related Complexity Questions Lecture 2: Switching Lemma, AC 0 Circuit Lower Bounds

On Learning µ-perceptron Networks with Binary Weights

Testing Problems with Sub-Learning Sample Complexity

Learning Functions of Halfspaces Using Prefix Covers

Optimal Cryptographic Hardness of Learning Monotone Functions

Computational Learning Theory

Can PAC Learning Algorithms Tolerate. Random Attribute Noise? Sally A. Goldman. Department of Computer Science. Washington University

8.1 Polynomial Threshold Functions

Tight Bounds on Proper Equivalence Query Learning of DNF

Hierarchical Concept Learning

CSC 5170: Theory of Computational Complexity Lecture 9 The Chinese University of Hong Kong 15 March 2010

Lecture 3 Small bias with respect to linear tests

More Efficient PAC-learning of DNF with Membership Queries. Under the Uniform Distribution

Computational Learning Theory - Hilary Term : Introduction to the PAC Learning Framework

Quantum boolean functions

ICML '97 and AAAI '97 Tutorials

Learning Coverage Functions and Private Release of Marginals

Notes for Lecture 11

A Necessary Condition for Learning from Positive Examples

Learning Intersections and Thresholds of Halfspaces

Computational Learning Theory. CS534 - Machine Learning

On the correlation of parity and small-depth circuits

1 The Low-Degree Testing Assumption

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

Nondeterminism LECTURE Nondeterminism as a proof system. University of California, Los Angeles CS 289A Communication Complexity

Cell-Probe Proofs and Nondeterministic Cell-Probe Complexity

Complexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler

Outline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181.

Finding Correlations in Subquadratic Time, with Applications to Learning Parities and Juntas

Proving SAT does not have Small Circuits with an Application to the Two Queries Problem

EQUIVALENCES AND SEPARATIONS BETWEEN QUANTUM AND CLASSICAL LEARNABILITY

arxiv: v1 [cs.cc] 29 Feb 2012

What Can We Learn Privately?

Lecture 7 Limits on inapproximability

3.1 Decision Trees Are Less Expressive Than DNFs

Machine Learning

Poly-logarithmic independence fools AC 0 circuits

CSC 5170: Theory of Computational Complexity Lecture 5 The Chinese University of Hong Kong 8 February 2010

Classes of Boolean Functions

Solving LWE problem with bounded errors in polynomial time

20.1 2SAT. CS125 Lecture 20 Fall 2016

CS446: Machine Learning Spring Problem Set 4

CSE 291: Fourier analysis Chapter 2: Social choice theory

Testing Booleanity and the Uncertainty Principle

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

On the power and the limits of evolvability. Vitaly Feldman Almaden Research Center

arxiv: v2 [cs.lg] 17 Apr 2013

Machine Learning

Polynomial Certificates for Propositional Classes

Hardness Results for Agnostically Learning Low-Degree Polynomial Threshold Functions

Transcription:

Learning Juntas Elchanan Mossel CS and Statistics, U.C. Berkeley Berkeley, CA mossel@stat.berkeley.edu Ryan O Donnell Rocco A. Servedio MIT Department of Computer Science Mathematics Department, Cambridge, MA Columbia University New York, NY odonnell@theory.lcs.mit.edu rocco@cs.columbia.edu ABSTRACT We consider a fundamental problem in computational learning theory: learning an arbitrary Boolean function which depends on an unknown set of k out of n Boolean variables. We give an algorithm for learning such functions from uniform random examples which runs in time roughly (n k ) ω+1 ω, where ω < 2.76 is the matrix multiplication exponent. We thus obtain the first polynomial factor improvement on the naive n k time bound which can be achieved via exhaustive search. Our algorithm and analysis exploit new structural properties of Boolean functions. Categories and Subject Descriptors F.2 [Analysis of Algorithms and Problem Complexity]: Miscellaneous General Terms Theory, Algorithms Keywords learning, juntas, relevant variables, uniform distribution, Fourier 1. INTRODUCTION 1.1 Background and motivation Supported by a Miller Fellowship. Most of this research was conducted when the first author was a postdoc at Microsoft Research. Supported by NSF grant 99-1242. Supported by an NSF Mathematical Sciences Postdoctoral Research Fellowship and by NSF grant CCR-98-77049. Most of this research was conducted when the third author was a postdoc at Harvard University Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro t or commercial advantage and that copies bear this notice and the full citation on the rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speci c permission and/or a fee. STOC 0 San Diego, CA Copyright 200X ACM X-XXXXX-XX-X/XX/XX...$5.00. One of the most important and challenging issues in machine learning is how to learn efficiently and effectively in the presence of irrelevant information. Many real-world learning problems can be modeled in the following way: we are given a set of labeled data points and we wish to find some hypothesis which accurately predicts the label of each data point. An oft-encountered situation in this framework is that each data point contains a large amount of information (i.e., each data point is a high dimensional vector of attribute values over a fixed large set of attributes), but only a small unknown portion of this information is relevant to the label of the data point (i.e., the label is determined by a function which only depends on a few of the attributes). For example, in a computational biology scenario each data point may correspond to a long DNA sequence, and the label may be some property which depends only on a small unknown active part of this sequence. In this paper we consider the following learning problem which Blum [] and Blum and Langley [4] proposed as a clean formulation of learning in the presence of irrelevant information: Let f be an unknown Boolean function over an n-bit domain which depends only on an unknown subset of k n variables. Such a function is called a k-junta. Given a data set of labeled examples x, f(x), where the points x are independently and uniformly chosen random n-bit strings, can the function f be learned by a computationally efficient algorithm? (We give a precise description of what it means to learn f in Section 2.) Note that a naive brute force search over all possible subsets of k relevant variables can be performed in time roughly n k ; we would like to have an algorithm which runs faster than this. We believe that the problem of efficiently learning k-juntas is the single most important open question in uniform distribution learning. In addition to being natural and elegant, learning juntas is at the heart of the most notorious open problems in uniform distribution learning, namely learning DNF formulas and decision trees of superconstant size. Since every k-junta can be expressed as a decision tree or DNF formula of size 2 k, it is clear that efficient algorithms for learning 2 k -size decision trees or DNFs would also be efficient algorithms for learning k-juntas. But in fact more is true: obtaining efficient algorithms for decision trees or DNFs requires that we be able to efficiently learn juntas. Specifically, any size-k decision tree is also a k-junta, and any k-term DNF is ɛ-indistinguishable (under the uniform distribution) from a k log(k/ɛ)-junta. Thus, learning ω(1)- size decision trees or ω(1)-term DNFs in polynomial time is

equivalent to learning ω(1)-juntas in polynomial time. We note that learning from uniform random examples seems to be the model in which this problem has the right amount of difficulty. As described in Section 5, allowing the learner to make membership queries makes the problem too easy, while restricting the learner to the statistical query model makes the problem provably hard. 1.2 Our results We give the first learning algorithm for the problem of learning k-juntas which achieves a polynomial factor improvement over brute force exhaustive search. Under the uniform distribution, our algorithm exactly learns an unknown k-junta with confidence 1 δ in time n ω+1 ω k poly(2 k, n, log(1/δ)), where ω is the exponent in the time bound for matrix multiplication. Since Coppersmith and Winograd [7] have shown that ω < 2.76, our algorithm runs in time roughly N.704 where N n k is the running time of a naive brute force approach. Our algorithm and analysis exploit new structural properties of Boolean functions which may be of independent interest. We note that since this learning problem was first posed by Blum in 1994, little progress has been made. The first improvement over the trivial n k time bound of which we are aware is a recent algorithm due to A. Kalai and Mansour [12] which runs in time roughly n k Ω(k1/4). Mansour [18] later improved this to n k Ω(k1/2). (In recent work Fischer et al. have studied the problem of testing k-juntas [8], but the learning and testing problems seem to require different techniques.) 1. Organization In Section 2 we formally define the learning problem and give some background on polynomial representations of Boolean functions. In Section we present the learning algorithms which allow us to reduce the learning problem to some questions about representing Boolean functions as polynomials. In Section 4 we prove the necessary new structural properties of Boolean functions and thus obtain our learning result. Finally, in Section 5 we use the developed machinery to analyze several variants of the juntas problem. 2. PRELIMINARIES The learning model we consider is a uniform distribution version of Valiant s Probably Approximately Correct (PAC) model [19] which has been studied by many researchers, e.g., [6, 10, 11, 14, 15, 17, 20, 21]. In this model a concept class C is a collection n 1 C n of Boolean functions, where each c C n is a function on n bits. Let f C n be an unknown target function. A learning algorithm A for C takes as input an accuracy parameter 0 < ɛ < 1 and a confidence parameter 0 < δ < 1. During its execution A has access to an example oracle EX(f) which, when queried, generates a random labeled example x, f(x) where x is drawn uniformly from {0, 1} n. A outputs a hypothesis h which is a Boolean function over {0, 1} n ; the error of this hypothesis is defined to be error(h, f) = Pr[h(x) f(x)]. (Here and in the remainder of the paper, unless otherwise indicated all probabilities are taken over x chosen uniformly at random from {0, 1} n.) We say that A is a uniform-distribution PAC learning algorithm for C if the following condition holds: for every f C and every ɛ, δ, with probability at least 1 δ algorithm A outputs a hypothesis h which has error(h, f) ɛ. For the purposes of this paper the accuracy parameter ɛ will always be 0, so our goal is to exactly identify the unknown function. A Boolean function f : {0, 1} n {0, 1} is said to depend on the ith variable if there exist inputs x, y {0, 1} n which differ only in the ith coordinate and which have f(x) f(y). Equivalently, we say that such a variable is relevant to f. If the function f has at most k relevant variables then we call f a k-junta. The concept class we consider in this paper is the set of k-juntas over n variables, i.e., C n = {f : {0, 1} n {0, 1} s.t. f is a k-junta}. Equivalently, each function f C n is defined by a subset R = {i 1,..., i k } {1,..., n} of k k relevant variables and a truth table of 2 k bits corresponding to all possible settings of these variables. We are most interested in the case where k is O(log n) or even a large constant value. For such k the number of possible sets of relevant variables is n (1 o(1))k. Hence the naive learning algorithm which performs an exhaustive search over all possible subsets of relevant variables will take time n (1 o(1))k. 2.1 Representing Boolean functions as polynomials A Boolean function g on n bits is a mapping {F, T} n {F, T}. There are many possible ways to represent g as a multilinear polynomial. Since our analysis will use several different representations, we give a general definition which encompasses all of the cases we will need: Definition 1. Let F be a field and let f, t { 1, 0, 1} be distinct elements of F. We say that a multilinear polynomial p F, f, t -represents g if p : F n F has the following properties: for all inputs in {f, t} n, p outputs a value in {f, t}; and, p and g induce the same mapping when F and T are identified with f and t in the input and output. Note that since f 2, t 2 {0, 1}, the assumption that p is multilinear is without loss of generality. It is well known that the F, f, t -representation of g always exists and is unique; for completeness we give a simple proof below. Proposition 2. Every Boolean function g has a unique multilinear F, f, t -representation. Proof. The condition that p is a multilinear polynomial which represents g is equivalent to a system of 2 n linear equations in 2 n unknowns, where the unknowns are the coefficients on the 2 n multilinear monomials. Let A n denote the 2 n 2 n matrix arising from this linear system, so the columns of A n correspond to monomials and the rows correspond to truth assignments. It suffices to prove that A n has full rank; we now prove this by induction. ( ) 1 f In the case n = 1 we have A 1 = 1 t which has full rank over any field since f t. In the general case, one ( can rearrange the rows and columns of A n to get A n = An 1 fa n 1 A n 1 ta n 1 ), where the columns on the left correspond to monomials not containing x n and the others correspond to monomials containing x n. By performing ( elementary row An 1 fa operations on this matrix, one can get n 1 0 (t f)a n 1 ). Since f t and A n 1 has full rank by induction, this has full rank.

The fields we will consider in this paper are the twoelement field F 2 and the field R of real numbers. In F 2 we will represent bits by f = 0, t = 1, and in R we will usually represent bits by f = 1, t = 1. Definition. Given a Boolean function g on n bits: We write g F2 for the multilinear polynomial which F 2, 0, 1 -represents g, and we say that g F2 F 2-represents g. Note that g F2 can be viewed as a parity of ANDs since F 2 multiplication corresponds to AND and F 2 addition corresponds to parity. We write g R for the multilinear polynomial which R, +1, 1 -represents g, and we say that g R R-represents g. Note that g R is precisely the Fourier representation of g. As is standard, we write ĝ(s) for the coefficient of x S in g R, where x S denotes the monomial i S xi. We call ĝ(s) the S Fourier coefficient of g. As an example, if g = PARITY n then we have g F2 = x 1 + x 2 + + x n and g R = x 1x 2 x n. Note that there is a huge difference in the degrees of these two polynomial representations; we will be very interested in the degree of Boolean functions under various representations. We observe that for a given field this degree is independent of the exact choice of f, t. This is because we can pass back and forth between any two such choices by nonconstant linear transformations on the inputs and outputs, and under such transformations the monomials of highest degree can never vanish. Thus we can make the following definition: Definition 4. deg F (g) is defined to be deg(p) where p is any F, f, t -representation of g. Hence we have deg F2 (PARITY n) = 1 and deg R (PARITY n) = n. In general deg F2 (g) deg R (g): g F2 Fact 5. For any Boolean function g, deg F2 (g) deg R (g). Proof. Let p be the R, 0, 1 -representation of g and let be the F 2-representation of g. We have p(x) = [ ( ) ( )] p(z) x i (1 x i). z {0,1} n i:z i =1 i:z i =0 This polynomial clearly has integer coefficients; g F2 is obtained by reducing the coefficients of p mod 2, and this operation can only decrease degree.. LEARNING TOOLS In this section we give the learning algorithms we will use for solving the junta problem. We first show that it suffices to give a learning algorithm which can identify a single relevant variable. We then give two learning algorithms that look for relevant variables. Our algorithm for learning k- juntas will end up trying both algorithms and we shall prove in Section 4 that at least one of them always works. Throughout this section, f will denote a k-junta on n bits, R will denote the set of variables on which f depends, k will denote R (so 0 k k), and f will denote the function {F, T} k {F, T} given by restricting f to R..1 Finding a single relevant variable is enough Proposition 6. Suppose that A is an algorithm running in time n α poly(2 k, n, log(1/δ)) which can identify at least one variable relevant to f with confidence 1 δ (assuming f is nonconstant). Then there is an algorithm for exactly learning f which runs in time n α poly(2 k, n, log(1/δ)). Proof. First note that if f is nonconstant then for uniform random inputs each output value occurs with frequency at least 1/2 k. Hence we can decide whether or not f is a constant function with confidence 1 δ in time in time poly(2 k, n, log(1/δ)). Next, suppose ρ is any restriction fixing at most k bits. We claim that we can run any learning algorithm on f ρ with a slowdown of at most poly(2 k ). To do so, we only need to transform the example oracle for f into one for f ρ; this is easily done by rejecting all samples x, f(x) for which x does not agree with ρ. Since ρ fixes at most k bits, the probability that a random x agrees with ρ is at least 2 k. Hence with probability 1 δ we can get M samples for f ρ by taking M poly(2 k ) log(m/δ) samples from the oracle for f. We now show how to identify all the variables R on which f depends in the requisite amount of time. By induction, suppose we have identified some relevant variables R R. For each of the 2 R possible restrictions ρ which fix the bits in R, consider the function f ρ. Since f ρ is also a k- junta, A can identify some variables relevant to f ρ (or else we can check that f ρ is constant). By running A (with the slowdown described above) for each possible ρ, we will identify new variables to add into R. We repeatedly add new variables to R, testing all restrictions on these variables, until all of the restricted subfunctions are constant. It is clear that at this point we will have identified all variables relevant to f. Note that R grows by at least one variable per stage, and so we will never run A more than k2 k times. Further, we can get confidence 1 δ/k2 k for each run even after rejectionsampling slowdown in time n α poly(2 k, n, log(1/δ)). Hence we can identify R in time n α poly(2 k, n, log(1/δ)) with confidence 1 δ. Finally, once R is identified it is easy to learn f exactly. Simply draw poly(2 k, log(1/δ)) samples; with probability 1 δ we will see every possible bit setting for R so we can build f s truth table and output this as our hypothesis..2 The Fourier-based learning algorithm We describe a simple Fourier-based algorithm for trying to identify a variable relevant to f. The algorithm is based on the Low Degree learning algorithm of Linial, Mansour, and Nisan [15] (see also [16]). As with the Low Degree algorithm, our Fourier-based algorithm tries to learn the unknown function f by estimating all of f s Fourier coefficients ˆf(S) with 1 S α. Unlike the Low Degree algorithm, our algorithm can stop as soon as it finds a nonzero coefficient, since all variables in the associated monomial must be relevant to f. We first show how to compute the exact value of any desired Fourier coefficient: Proposition 7. We can exactly calculate any Fourier coefficient ˆf(S) with confidence 1 δ in time poly(2 k, n, log(1/δ)). Proof. We view bits as being ±1, as in the R-representation. After multilinear reduction we see that the polynomial x Sf R(x)

has ˆf(S) as its constant coefficient. Since E[x T ] = 0 for all nonempty subsets T, linearity of expectation lets us conclude that E[x Sf R(x)] = ˆf(S). We can clearly compute the value of x Sf R(x) in linear time given a labeled example x, f(x). By standard Chernoff bounds, poly(2 k, log(1/δ)) independent samples of the ±1 random variable x Sf R(x) are sufficient for computing the expectation to within ±1/2 k+1 with confidence 1 δ. Since f and f have the same Fourier expansion and f is a function on at most k variables, ˆf(S) must be of the form a/2 k for some integer a [ 2 k, 2 k ]. Hence by rounding the empirical expectation to the nearest integer multiple of 1/2 k we will get the exact value of ˆf(S) with confidence 1 δ. The next proposition says that if f has a nonzero Fourier coefficient of small (but nonzero) degree, then we can efficiently identify some variables relevant to f. Proposition 8. If ˆf (S) 0 for some S with 1 S α, then we can identify at least one relevant variable for f with confidence 1 δ in time n α poly(2 k, n, log(1/δ)). Proof. We use Proposition 7 to compute each Fourier coefficient ˆf(S), 1 S α, with confidence 1 δ/n α. Since there are at most n α possible sets S, with confidence 1 δ we will obtain the exact values of all the desired Fourier coefficients in time at most n α poly(2 k, n, log(1/δ)). Since f and f have the same Fourier coefficients, we will find an S with ˆf(S) 0. It is easy to see that every variable in S must be relevant to f; for if f is does not depend on x i then ˆf(S) = E[x Sf R(x)] = E[x i]e[x S if R(x)] = 0 E[x S if R(x)] = 0.. The F 2-based learning algorithm In this subsection we show that if f is a low-degree polynomial over F 2, then in fact we can learn f exactly. Here we view True and False as 1 and 0 respectively. Recall the following well-known result from computational learning theory [9]: Theorem 9. Let g : {0, 1} N {0, 1} be a parity function on an unknown subset of the N Boolean variables x 1,..., x N. There is a learning algorithm B which, given access to labeled examples x, g(x) drawn from any probability distribution D on {0, 1} N, outputs a hypothesis h (which is a parity of some subset of x 1,..., x N) such that with probability 1 δ we have Pr x D[h(x) g(x)] ɛ. Algorithm B runs in time O(( N log 1/δ + ) ω ) where ω < 2.76 is the exponent for matrix ɛ ɛ multiplication. The idea behind Theorem 9 is simple: since g is a parity function, each labeled example x, g(x) corresponds to a linear equation over F 2 where the ith unknown corresponds to whether x i is present in g. Algorithm B draws O( N log 1/δ + ) ɛ ɛ examples and solves the resulting system of linear equations to find some parity over x 1,..., x N which is consistent with all of the examples. Well-known results in PAC learning theory [5] imply that such a consistent parity will satisfy the ɛ, δ criterion. Now suppose deg F2 (f ) = α k. Then f is a F 2-linear combination (i.e., a parity) over the set of monomials (conjunctions) in x 1,..., x n of degree up to α. This lets us learn f in time roughly n ωα : Proposition 10. If deg F2 (f ) = α, then we can learn f exactly in time n ωα poly(2 k, n, log(1/δ)) with confidence 1 δ. (Hence we can certainly identify a variable on which f depends.) Proof. Consider the expanded variable space consisting of all monomials over x 1,..., x n of degree at most α. There are at most N = n α variables in this space. Run algorithm B from Theorem 9 on this variable space, with ɛ set to 2 (k+1). That is, given an example x, f(x), translate it to the example (x S) S α, f(x), and run B using this new example oracle. Simulating a draw from this new oracle takes time N poly(n), so constructing all the necessary examples for B takes time N 2 poly(2 k, n, log(1/δ)). Solving the resulting system of equations takes time N ω poly(2 k, n, log(1/δ)). Hence the total time for the algorithm is n ωα poly(2 k, n, log(1/δ)) as claimed. We now argue that B s output hypothesis is precisely the F 2-representation of f. Let D be the distribution over the expanded variable space induced by the uniform distribution on x 1,..., x n. Since f (equivalently f) is a parity over the expanded variable space, the output of B will be a parity hypothesis h over the expanded variable space which satisfies Pr x D[h(x) f(x)] 2 (k+1). View both f and h as F 2-polynomials of degree α over the original variables x 1,..., x n. If f and h are not identical, then f + h 0 and we have Pr[f(x) h(x)] = Pr[f(x)+h(x) 0]. Now since deg F2 (f + h) α and f + h is not identically 0, the polynomial f + h must be nonzero on at least a 2 α 2 k fraction of the points in (F 2) n. (This is a slightly nonstandard form of the Schwartz-Zippel Lemma.) But this contradicts the fact that Pr x D[h(x) f(x)] 2 (k+1). 4. LEARNING JUNTAS VIA STRUCTURAL PROPERTIES OF BOOLEAN FUNCTIONS With our learning tools in hand we are ready to give the algorithm for learning k-juntas. The basic idea is to show that every Boolean function f must either have a nonzero Fourier coefficient of not too large positive degree, or must be a polynomial over F 2 of not too large degree. Then by Propositions 8 and 10, in either case we can find a relevant variable for f without performing a full-fledged exhaustive search. The Fourier learning algorithm described earlier fails only on functions whose low-degree Fourier coefficients are all zero (except for possibly the constant coefficient; if this is nonzero the Fourier algorithm can still fail). Let us make a definition for such functions: Definition 11. Suppose that g satisfies ĝ(s) = 0 for all 1 S < t. If ĝ( ) is also 0 then we say that g is strongly balanced up to size t. If ĝ( ) is nonzero we say that g is strongly biased up to size t. These definitions were essentially first made by Bernasconi in [2]. The justification of the terminology is this: if g is strongly balanced up to size t, then it is easy to show that every subfunction of g obtained by fixing 0 l t 1 bits is balanced (i.e. is true with probability exactly 1/2). Similarly, if g is strongly biased up to size t then it is easy to show that every such subfunction has the same bias as g itself.

We now show that strongly balanced functions have low F 2-degree: Theorem 12. Let g / {PARITY n, PARITY n} be a Boolean function on n bits which is strongly balanced up to size t. Then deg F2 (g) n t. Proof. Given such a g, let h = g PARITY n. Then h R = g R x 1x 2 x n. By assumption, g R has zero coefficient on all monomials x S with S < t. By multilinear reduction (x 2 i = 1) we see that h R has zero coefficient on all monomials x S with S > n t. Hence deg R (h) n t, so by Fact 5, deg F2 (h) n t. But since g = h PARITY n, the F 2- representation of g is simply g F2 (x) = h F2 (x)+x 1 + +x n. Adding a degree 1 polynomial to h F2 does not increase degree (since g is neither PARITY n nor its negation, h is not a constant function and hence deg F2 (h) 1), and consequently deg F2 (g) n t. The bound n t in Theorem 12 is best possible. To see this, consider the function g(x) = (x 1 x n t) x n t+1 x n. This function has F 2-representation g F2 (x) = x 1 x n t + x n t+1+ +x n so deg F2 (g) = n t. Moreover, g is balanced and every subfunction of g fixing fewer than t bits is also balanced, since to make g unbalanced one must restrict all of x n k+1,..., x n. It remains to deal with strongly biased functions. Our next theorem shows that no Boolean function can be strongly biased up to too large a size: Theorem 1. If g is a Boolean function on n bits which is strongly biased up to size t, then t 2 n. Proof. Let g R(x) = S csxs be the R-representation of g. Since g is strongly biased up to size t we have 0 < c < 1 and c S = 0 for all 0 < S < t. As in Theorem 12, we let h = g PARITY n so h R(x) = c x 1x 2 x n+ S n t c Sx S, where c S = c [n]\s. Let h : {+1, 1} n {1+c, 1 c, 1+c, 1 c } be the real-valued function given by h (x) = h R(x) c x 1x 2 x n; note that deg(h ) n t. Furthermore, for x {+1, 1} n we have h (x) {1 + c, 1 c } iff h R(x) = +1, and h (x) { 1 + c, 1 c } iff h R(x) = 1. Since 0 < c < 1 we have that {1 + c, 1 c } and { 1 + c, 1 c } are disjoint two-element sets. Let p : R R be the degree polynomial which maps 1 + c and 1 c to +1 and 1 c and 1 + c to 1. Now consider the polynomial p h. By construction p h maps {+1, 1} n {+1, 1}, and p h R-represents h. But the R-representation of h is unique, so after multilinear reduction p h must be identical to h R. Since c 0, we know that deg R (h) is exactly n. Since p has degree exactly and deg(h ) n t, we conclude that (n t) n, whence t 2 n. The bound 2 n in Theorem 1 is best possible. To see this, let n = m and consider the function ( 2m ) ( n ) f(x 1,..., x n) = x i x i. i=1 i=m+1 It is easy to see that this function is unbalanced, and also that its bias cannot change under any restriction of fewer than 2m bits (to change the bias, one must set bits 1... 2m or m + 1... m or 1... m, 2m + 1... m). We can now prove our main theorem: Theorem 14. The class of k-juntas over n bits can be exactly learned under the uniform distribution with confidence 1 δ in time n ω+1 ω k poly(2 k, n, log(1/δ)). Proof. Let f be a k-junta on n bits and f be the function on at most k bits given by restricting f to its relevant variables. Let t = ω k > 2 k. If f is strongly balanced ω+1 up to size t then by Theorem 12 f is an F 2-polynomial of degree at most k t = k/(ω + 1). By Proposition 10 f can be learned in time (n k/(ω+1) ) ω poly(2 k, n, log(1/δ)). On the other hand, suppose f is not strongly balanced up to size t. By Theorem 1, f cannot be strongly biased up to size t, since t > 2 k. Hence f has a nonzero Fourier coefficient of degree less than t and greater than 0. So by Proposition 8, some relevant variable for f can be identified in time n t poly(2 k, n, log(1/δ)). In either case, we can identify some relevant variable for f in time n ω+1 ω k poly(2 k, n, log(1/δ)). Proposition 6 completes the proof. 5. VARIANTS OF THE JUNTA LEARNING PROBLEM We can use the ideas developed thus far to analyze some variants and special cases of the juntas learning problem. 5.1 Some easier special cases For various subclasses of k-juntas, the learning problem is more easily solved. Monotone juntas: It is easy to verify that if f is a monotone function, then ˆf ({i}) > 0 for every relevant variable x i. (Use the fact that ˆf ({i}) = E[x if (x)] = Pr[f(x) = x i] Pr[f(x) x i].) Hence monotone juntas can be learned in time poly(2 k, n, log(1/δ)) using the Fourier learning algorithm of Proposition 8. Random juntas: As observed in [4], almost every k-junta on n variables can be learned in time poly(2 k, n, log(1/δ)). To see this, observe that if a function f on k bits is chosen uniformly at random, then for every S we have ˆf (S) = 0 only if exactly half of all inputs have f (x) = x S. This occurs with probability ( 2 k ) 2 /2 2 k = k 1 O(1)/2 k/2. Consequently, with overwhelming probability in terms of k at least 1 O(k)/2 k/2 a random function on k variables will have every Fourier coefficient of degree 1 nonzero, and hence we can learn using Proposition 8. Symmetric juntas: A symmetric k-junta is a junta whose value depends only on how many of its k relevant variables are set to 1. We can learn any symmetric k-junta in time n 2 k poly(2 k, n, log(1/δ)), which somewhat improves on our bound for arbitrary k-juntas. To prove this, we show that every symmetric function f on k variables, other than parity and its negation, has a nonzero Fourier coefficient ˆf (S) for 1 S < 2 k. Hence we can identify at least one relevant variable in time n 2 k poly(2 k, n, log(1/δ)) using Proposition 8, and we can use the algorithm of Proposition 6 since

the class of symmetric functions is closed under subfunctions. To prove this claim about the Fourier coefficients of symmetric functions, first note that if f is not balanced then by Theorem 1 it must have a nonzero Fourier coefficient of positive degree less than 2 k. Otherwise, if f is balanced and is neither parity nor its negation, then g := f PARITY k is a symmetric nonconstant function and deg R (g) < k; this last fact follows because the x 1x 2 x k coefficient of g is the constant coefficient of f, and f is balanced. By a result of von zur Gathen and Roche [22], every nonconstant symmetric function g on k variables has deg R (g) k O(k.548 ). Hence ĝ(s) 0 for some k O(k.548 ) S < k, so ˆf ([k] \ S) 0 and 1 [k] \ S O(k.548 ) 2 k. In [22] von zur Gathen and Roche conjecture that every nonconstant symmetric Boolean function f on k variables has deg R (f) k O(1). We note that if a somewhat stronger conjecture were true that every nonconstant symmetric function has a nonzero Fourier coefficient of degree d for some k O(1) d k 1 then using the above approach we could learn symmetric juntas in poly(2 k, n, log(1/δ)) time. (The von zur Gathen/Roche conjecture does not appear to suffice since f could conceivably have a nonzero Fourier coefficient of degree 0 and yet have no nonzero Fourier coefficients of positive degree O(1).) 5.2 Other learning models Blum and Langley observed [4] that if the learning algorithm is allowed to make membership queries for the value of the target junta at points of its choice, then any k-junta can be learned in time poly(2 k, n, log(1/δ). By drawing random examples, the learner will either determine that the function is constant or it will obtain two inputs x and y with f(x) f(y). In the latter case the learner then selects a path in the Hamming cube between x and y and queries f on all points in the path. The learner will thus find two neighboring points z and z on which f has different values, so the coordinate in which z and z differ is relevant. The learner then recurses as in Proposition 6. While membership queries make the problem of learning juntas easy, casting the problem in the more restrictive statistical query learning model of Kearns (see [1] for background on this model) makes the problem provably hard. The class of k-juntas over n variables contains at least ( ) n k distinct parity functions, and for any two distinct parity functions x S x T we have that E[x Sx T ] = 0. Consequently, an information-theoretic lower bound of Bshouty and Feldman [1] implies that any statistical query algorithm for learning ( k-juntas under the uniform distribution must have q/τ 2 n ) k, where q is the number of statistical queries which the algorithm makes and τ (0, 1) is the additive error tolerance required for each query. Thus, as noted earlier, the PAC model of learning from random examples seems to be the right framework for the juntas problem. We close by observing that if the uniform distribution is replaced by a product measure in which Pr[x i = T] = p i, then for almost every choice of (p 1,..., p n) [0, 1] n, k- juntas are learnable in time poly(2 k, n, log(1/δ)). In particular, we claim that for every product distribution except for a set of measure zero in [0, 1] n, every k-junta f has nonzero correlation with every variable on which it depends, and consequently a straightforward variant of the Fourier-based learning algorithm will identify all relevant variables in the claimed time bound. This is a consequence of the following easily verified fact: Fact 15. If f is the Fourier represntation of a Boolean function, then E p1,...,p n [f (x)x i], when viewed formally as a multivariable polynomial in p 1,..., p n, is not identically zero. Consequently, the set of points (p 1,..., p n) [0, 1] n on which this polynomial takes value 0 has measure 0. The union of all such sets for all (finitely many) choices of i and f still has measure 0, and the claim is proved. 6. CONCLUSION A major goal for future research is to give an algorithm which runs in polynomial time for k = log n or even k = ω(1). We hope that further study of the structural properties of Boolean functions will lead to such an algorithm. Right now, the bottleneck preventing an improved runtime for our algorithm is the case of strongly balanced juntas. A. Kalai has asked the following question: Question: Is it true that for any Boolean function f on k bits which is strongly balanced up to size 2 k, there is a restriction fixing at most 2 k bits under which f becomes a parity function? If the answer were yes, then it would be straightforward to give a learning algorithm for k-juntas running in time n 2 k. (Of course, another way to get such an algorithm would be to give a quadratic algorithm for matrix multiplication!) Finally, we close by observing that there are still several important generalizations of the k-junta problem for which no algorithm with running time better than n (1 o(1))k is known. Can we learn juntas under any fixed nonuniform product distribution? Can we learn ternary juntas (i.e. functions on {0, 1, 2} n with k relevant variables) under uniform? 7. ACKNOWLEDGEMENTS We would like to thank Adam Kalai and Yishay Mansour for helpful discussions and for telling us about [12, 18]. 8. REFERENCES [1] N. Bshouty andv. Feldman. On using extended statistical queries to avoid membership queries. Journal of Machine Learning Research, 2:59 95, 2002. [2] A. Bernasconi. On a hierarchy of boolean functions hard to compute in constant depth. Discrete Mathematics & Theoretical Computer Science, 4:2:79 90, 2001. [] A. Blum. Relevant examples and relevant features: Thoughts from computational learning theory. In AAAI Fall Symposium on Relevance, 1994. [4] A. Blum and P. Langley. Selection of relevant features and examples in machine learning. Artificial Intelligence, 97(1-2):245 271, 1997. [5] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. Occam s razor. Information Processing Letters, 24:77 80, 1987.

[6] N. Bshouty, J. Jackson, and C. Tamon. More efficient PAC learning of DNF with membership queries under the uniform distribution. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory, pages 286 295, 1999. [7] D. Coppersmith and S. Winograd. Matrix multiplication via arithmetic progressions. In Proceedings of the Nineteenth Symposium on Theory of Computing, pages 1 6, 1987. [8] E. Fischer, G. Kindler, D. Ron, S. Safra, and A. Samorodnitsky. Testing juntas. In Proceedings of the 4rd IEEE Symposium on Foundations of Computer Science, 2002. [9] D. Helmbold, R. Sloan, and M. Warmuth. Learning integer lattices. SIAM Journal on Computing, 21(2):240 266., 1992. [10] J. Jackson. An efficient membership-query algorithm for learning DNF with respect to the uniform distribution. Journal of Computer and System Sciences, 55:414 440, 1997. [11] J. Jackson, A. Klivans, and R. Servedio. Learnability beyond AC 0. In Proceedings of the 4th ACM Symposium on Theory of Computing, 2002. [12] A. Kalai and Y. Mansour. Personal communication., 2001. [1] M. Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM, 45(6):98 1006, 1998. [14] L. Kucera, A. Marchetti-Spaccamela, and M. Protassi. On learning monotone DNF formulae under uniform distributions. Information and Computation, 110:84 95, 1994. [15] N. Linial, Y. Mansour, and N. Nisan. Constant depth circuits, Fourier transform and learnability. Journal of the ACM, 40():607 620, 199. [16] Y. Mansour. Learning Boolean functions via the Fourier transform, pages 91 424. 1994. [17] Y. Mansour. An o(n log log n ) learning algorithm for DNF under the uniform distribution. Journal of Computer and System Sciences, 50:54 550, 1995. [18] Y. Mansour. Personal communication., 2001. [19] L. Valiant. A theory of the learnable. Communications of the ACM, 27(11):114 1142, 1984. [20] K. Verbeurgt. Learning DNF under the uniform distribution in quasi-polynomial time. In Proceedings of the Third Annual Workshop on Computational Learning Theory, pages 14 26, 1990. [21] K. Verbeurgt. Learning sub-classes of monotone DNF on the uniform distribution. In Proceedings of the Ninth Conference on Algorithmic Learning Theory, pages 85 99, 1998. [22] J. von zur Gathen and J. Roche. Polynomials with two values. Combinatorica, 17():45 62, 1997.