THE CHOW PARAMETERS PROBLEM

Size: px

Start display at page:

Download "THE CHOW PARAMETERS PROBLEM"

James Turner
6 years ago
Views:

1 THE CHOW PARAMETERS PROBLEM RYAN O DONNELL AND ROCCO A. SERVEDIO Abstract. In the 2nd Annual FOCS (1961), Chao-Kong Chow proved that every Boolean threshold function is uniquely determined by its degree-0 and degree-1 Fourier coefficients. These numbers became known as the Chow Parameters. Providing an algorithmic version of Chow s Theorem i.e., efficiently constructing a representation of a threshold function given its Chow Parameters has remained open ever since. This problem has received significant study in the fields of circuit complexity, game theory and the design of voting systems, and learning theory. In this paper we effectively solve the problem, giving a randomized PTAS with the following behavior: Given the Chow Parameters of a Boolean threshold function f over n bits and any constant ɛ > 0, the algorithm runs in time O(n 2 log 2 n) and with high probability outputs a representation of a threshold function f which is ɛ-close to f. Along the way we prove several new results of independent interest about Boolean threshold functions. In addition to various structural results, these include Õ(n2 )-time learning algorithms for threshold functions under the uniform distribution in the following models: (i) The Restricted Focus of Attention model, answering an open question of Birkendorf et al. (ii) An agnostic-type model. This contrasts with recent results of Guruswami and Raghavendra who show NP-hardness for the problem under general distributions. (iii) The PAC model, with constant ɛ. Our Õ(n2 )-time algorithm substantially improves on the previous best known running time and nearly matches the Ω(n 2 ) bits of training data that any successful learning algorithm must use. Key words. Chow Parameters, threshold functions, approximation, learning theory AMS subject classifications. 94C10, 06E30, 68Q32, 68R99, 91B12, 91B14, 42C10 1. Introduction. This paper is concerned with Boolean threshold functions: Definition 1.1. A Boolean function f : { 1, 1} n { 1, 1} is a threshold function if it is expressible as f(x) = sgn(w 0 + w 1 x w n x n ) for some real numbers w 0, w 1,..., w n. Boolean threshold functions are of fundamental interest in circuit complexity, game theory/voting theory, and learning theory. Early computer scientists studying switching functions (i.e., Boolean functions) spent an enormous amount of effort on the class of threshold functions; see for instance the books [10, 26, 36, 48, 38] on this topic. More recently, researchers in circuit complexity have worked to understand the computational power of threshold functions and shallow circuits with these functions as gates; see e.g. [21, 45, 24, 25, 22]. In game theory and social choice theory, where simple cooperative games [42] correspond to monotone Boolean functions, threshold functions (with nonnegative weights) are known as weighted majority games and have been extensively studied as models for voting, see e.g. [43, 27, 11, 54]. Finally, in various guises, the problem of learning an unknown threshold function ( halfspace ) has arguably been the central problem in machine learning for much of the last two decades, with algorithms such as Perceptron, Weighted Majority, boosting, and support vector machines emerging as central tools in the field. A beautiful result of C.-K. Chow from the 2nd FOCS conference [9] gives a surprising characterization of Boolean threshold functions: among all Boolean functions, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, Pennsylvania, (odonnell@cs.cmu.edu). Supported in part by NSF award CCF , a CyLab Research Grant, an Okawa Foundation Grant, and a Sloan Foundation Fellowship. Columbia University, 1214 Amsterdam Avenue, New York, New York, (rocco@cs.columbia.edu). Supported in part by NSF award CCF , NSF award CCF , and a Sloan Foundation Fellowship. 1

2 2 R. O DONNELL AND R. A. SERVEDIO each threshold function f : { 1, 1} n { 1, 1} is uniquely determined by the center of mass of its positive inputs, avg{x { 1, 1} n : f(x) = 1}, and the number of positive inputs #{x : f(x) = 1}. These n + 1 parameters of f are equivalent, after scaling and additive shifting, to its degree-0 and degree-1 Fourier coefficients (and also, essentially, to its influences or Banzhaf power indices ). We give a formal definition: Definition 1.2. Given any Boolean function f : { 1, 1} n { 1, 1}, its Chow Parameters 1 are the rational numbers f(0), f(1),..., f(n) defined by f(0) = E[f(x)], f(i) = E[f(x)x i ], for 1 i n. We also say the Chow Vector of f is χ = χ f = ( f(0), f(1),..., f(n)). Throughout this paper the notation E[ ] and Pr[ ] refers to an x { 1, 1} n chosen uniformly at random. (We note that this corresponds to the Impartial Culture Assumption in the theory of social choice [19].) Our notation slightly abuses the standard Fourier coefficient notation of f( ) and f({i}). Chow s Theorem implies that the following algorithmic problem is in principle solvable: The Chow Parameters Problem. Given the Chow Parameters f(0), f(1),..., f(n) of a Boolean threshold function f, output a representation of f as f(x) = sgn(w 0 + w 1 x 1 + w n x n ). Unfortunately, the proof of Chow s Theorem (reviewed in Section 2.3) is completely nonconstructive and does not suggest any algorithm, much less an efficient one. As we now briefly describe, over the past five decades the Chow Parameters problem has been considered by researchers in a range of different fields Background on the Chow Parameters problem. As far back as 1960 researchers studying Boolean functions were interested in finding an efficient algorithm for the Chow Parameters problem [14]. Electrical engineers at the time faced the following problem: Given an explicit truth table, determine if it can be realized as a threshold circuit and if so, which one. The Chow Parameters are easily computed from a truth table, and Chow s Theorem implies that they give a unique representation for every threshold function. Several heuristics were proposed for the Chow Parameters problem [30, 56, 29, 10], an empirical study was performed to compare various methods [58], and lookup tables were produced mapping Chow Vectors into weights-based representations for each threshold function on six [39], seven [57], and eight [41] bits. Winder provides a good early survey [59]. Generalizations of Chow s Theorem were later given in [7, 46]. Researchers in game theory have also considered the Chow Parameters problem; Chow s Theorem was independently rediscovered by the game theorist Lapidot [34] and subsequently studied in [11, 13, 54, 18]. In the realm of social choice and voting theory the Chow Parameters represent the Banzhaf power indices [43, 2] of the n voters a measure of each one s influence over the outcome. Here the Chow Parameters problem is very natural: Consider designing a voting rule for, say, the European Union. Target Banzhaf power indices are given, usually in proportion to the squareroot of the states populations, and one wishes to come up with a weighted majority voting rule whose power indices are as close to the targets as possible. Researchers in voting theory have recently devoted significant attention to this problem [35, 8], calling it a fundamental constitutional problem [16] and in particular considering its computational complexity [51, 1]. 1 Chow s Theorem was proven simultaneously by Tannenbaum [53], but the terminology Chow Parameters has stuck.

3 CHOW PARAMETERS 3 The Chow Parameters problem also has motivation from learning theory. Ben- David and Dichterman [3] introduced the Restricted Focus of Attention (RFA) model to formalize the idea that learning algorithms often have only partial access to each example vector. Birkendorf et al. [5] performed a comprehensive study of the RFA model and observed that the approximation version of the Chow Parameters problem (given approximate Chow Parameters, output an approximating threshold function) is equivalent to the problem of efficiently learning threshold functions under the uniform distribution in the 1-RFA model. (In the 1-RFA model the learner is only allowed to see one bit of each example string in addition to the label; we give details in Section 10.) As the main open question posed in [5], Birkendorf et al. asked whether there is an efficient uniform distribution learning algorithm for threshold functions in the 1-RFA model. This question motivated subsequent research [20, 47] which gave information-theoretic sample complexity upper bounds for this learning problem (see Section 3); however no computationally efficient algorithm was previously known. To summarize, we believe that the range of different contexts in which the Chow Parameters Problem has arisen is evidence of its fundamental status The Chow Parameters problem reframed as an approximation problem. It is unlikely that the Chow Parameters Problem can be solved exactly in polynomial time note that even checking the correctness of a candidate solution is #P-complete, because computing f(0) is equivalent to counting 0-1 knapsack solutions. Thus, as is implicitly proposed in [5, 1], it is natural to look for a polynomialtime approximation scheme (PTAS). Here we mean an approximation in the following sense: Definition 1.3. The distance between two Boolean functions f, g : { 1, 1} n { 1, 1} is dist(f, g) def = Pr[f(x) g(x)]. If dist(f, g) ɛ we say that f and g are ɛ-close. We would like a PTAS which, given a value ɛ and the Chow Parameters of f, outputs a (representation of a) threshold function f that is ɛ-close to f. With this relaxed goal of approximating f, one may even tolerate only an approximation of the Chow Parameters of f; this gives us the variant of the problem that Birkendorf et al. considered. (Note that, as we discuss in Section 3, it is in no way obvious that approximate Chow Parameters even information-theoretically specify an approximator to f.) In particular the following notion of approximate Chow Parameters proves to be most natural: Definition 1.4. Let f, g : { 1, 1} n { 1, 1}. We define d Chow (f, g) def = n j=0 ( f(j) ĝ(j)) 2 to be the Chow Distance between f and g Our results. Our main result is an efficient PTAS A for the Chow Parameters problem which succeeds given approximations to the Chow Parameters. We prove: Main Theorem. There is a function κ(ɛ) = 2 Õ(1/ɛ2) such that the following holds: Let f : { 1, 1} n { 1, 1} be a threshold function and let 0 < ɛ < 1/2. Write χ for the Chow Vector of f and assume that α is a vector satisfying α χ κ(ɛ). Then given as input α and ɛ the algorithm A performs 2 poly(1/κ(ɛ)) n 2 log n log( n δ ) bit operations and outputs the (weights-based) representation of a threshold function f which with probability at least 1 δ satisfies dist(f, f ) ɛ. Although the running time dependence on ɛ is doubly-exponential, we emphasize that the polynomial dependence on n is quadratic, independent of ɛ; i.e., A is an

4 4 R. O DONNELL AND R. A. SERVEDIO EPTAS. Some of our learning applications have only singly-exponential dependence on ɛ Our approach. We briefly describe the two main ingredients of our approach and explain how we combine them to obtain the efficient algorithm A. First ingredient: small Chow Distance from a threshold function implies small distance. An immediate question that arises when thinking about the Chow Parameters problem is how to recognize whether a candidate solution is a good one. If we are given the Chow Vector χ f of an unknown threshold function f and we have a candidate threshold function g, we can approximate the Chow Vector χ g of g by sampling. The following Proposition is easily proved via Fourier analysis in Section 2.3: Proposition 1.5. d Chow (f, g) 2 dist(f, g). This means that if d Chow (f, g) is large then f and g are far apart. But if d Chow (f, g) is small, does this necessarily mean that f and g are close? This question has been studied in the learning theory community, in [5] (for threshold functions with small integer weights), [20], and [47]. In Section 3 we show that the answer is yes by proving the following robust version of Chow s Theorem: Theorem 1.6. Let f : { 1, 1} n { 1, 1} be any threshold function and let g : { 1, 1} n ( { 1, 1} be any Boolean function such that d Chow (f, g) ɛ. Then dist(f, g) Õ 1/ ) log(1/ɛ). This is the first result of this nature that is completely independent of n. A key ingredient in the proof of Theorem 1.6 is a new result showing that every threshold function f is extremely close to a threshold function f for which only a very small fraction of points have small margin (see Section 6 for a precise statement). We feel that this and Theorem 1.6 have independent interest as structural results about threshold functions. Second ingredient: using the Chow Parameters as weights. The second ingredient in our approach is to establish a result, Theorem 7.1, having the following corollary: Corollary 7.2. There is an absolute constant C > 0 such that the following holds. Let f(x) = sgn(w 0 + w 1 x w n x n ) be any threshold function, and let H be the set of 1/ɛ C indices i for which w i is largest. 2 Then there exists a threshold function f (x) = sgn(v 0 + v 1 x v n x n ) with dist(f, f ) ɛ in which the weights v i for i [n] \ H are the Chow Parameters f(i) themselves. The heuristic of using the Chow Parameters as possible weights was considered by several researchers in the early 60s (see [59]); however no theorem on the efficacy of this approach was previously known. Our proof of Theorem 7.1 and its robust version Theorem 7.4 rely in part on recent work of Matulef et al. on Property Testing for threshold functions [37]. The algorithm and intuitive explanation. Given these two ingredients, our PTAS A for the approximate Chow Parameters problem works by constructing a small (depending only on ɛ) number of candidate threshold functions. It enumerates all (in some sense) possible weight settings for the indices in H, and for each one produces a candidate threshold function by setting the remaining weights equal to the given Chow Parameters. The second ingredient tells us that at least one of these candidate threshold functions must be close to to the unknown threshold function f, and thus 2 As we discuss at the beginning of Section 7, for any threshold function f the value f(i) is equal to Inf i (f), the influence of the i-th variable on f. It is well known and easy to show (see e.g. Lemma 7 of [17]) that for a threshold function f(x) = sgn(w 0 + w 1 x w nx n), if Inf i (f) > Inf j (f) then w i > w j. So we may equivalently view H as the set of 1/ɛ C indices i for which f(i) is largest.

5 CHOW PARAMETERS 5 must have small Chow Distance to f, by Proposition 1.5. Now the first ingredient tells us that any threshold function whose Chow Distance to the target Chow Vector is small must itself be close to the target. So the algorithm can estimate each of the candidates Chow Vectors (this takes Õ(n2 ) time) and output any candidate whose Chow Distance to the target vector is small Consequences in learning theory. As we show in Section 10, our approach yields a range of new algorithmic results in learning theory. Our Main Theorem directly gives the first poly(n)-time algorithm for learning threshold functions in the uniform distribution 1-RFA model, answering the question of [5]: Theorem 1.7. There is an algorithm which performs 2 2Õ(1/ɛ2 ) n 2 log n log( n δ ) bit operations and properly learns threshold functions to accuracy ɛ and confidence 1 δ in the uniform distribution 1-RFA model. A variant of our algorithm gives a very fast agnostic-type learning algorithm for threshold functions (equivalently, an algorithm for learning Boolean threshold functions from uniformly distributed examples when there is adversarial noise in the labels): Theorem 1.8. Let g be any Boolean function and let opt = min f Pr[f(x) g(x)] where the min is over all threshold functions and the probability is uniform over { 1, 1} n. Given an input parameter ɛ > 0 and access to independent uniform examples (x, g(x)), algorithm B outputs the (weights-based) representation of a threshold function f which with probability at least 1 δ satisfies Pr[f (x) g(x)] O(opt Ω(1) )+ɛ. The algorithm performs poly(1/ɛ) n 2 log( n δ ) + 2poly(1/ɛ) n log n log( 1 δ ) bit operations. For example, if opt = 1/ log(n), our algorithm takes time O(n 2 log n log( n δ )) and outputs a hypothesis with accuracy 1/ log Ω(1) (n). Thereom 1.8 is in interesting contrast with the algorithm of Kalai et al. [28] which constructs an (opt + ɛ)-accurate hypothesis but runs in n poly(1/ɛ) time (and does not output a threshold function). As we discuss in Section 10, recent hardness results of Guruswami and Raghavendra [23] imply that if P NP there can be no algorithm comparable to ours for learning under arbitrary (as opposed to uniform) distributions over { 1, 1} n. Finally, as a corollary of Theorem 1.8, we obtain a uniform-distribution PAC learning algorithm for threshold functions that runs in time Õ(n2 ) for learning to constant accuracy ɛ = Θ(1). The fastest previous algorithm we are aware of for learning arbitrary threshold functions in this model (linear programming, using Vaidya [55]) runs in Õ(n4.5 ) poly(1/ɛ) time. Thus our algorithm is significantly faster for learning to accuracy ɛ = Θ(1), and in fact is faster as long as ɛ < 1/(log n) c for sufficiently small constant c > 0. As we explain later, our time bound is very close to the Ω(n 2 ) bits of input that any learning algorithm must use. 2. Preliminaries Fourier analysis. This paper extensively uses the basics of Fourier analysis over the Boolean cube { 1, 1} n. We give a brief review. We consider functions f : { 1, 1} n R (though we often focus on Boolean-valued functions which map to { 1, 1}), and we think of the inputs x to f as being distributed according to the uniform probability distribution. The set of such functions forms a 2 n -dimensional inner product space with inner product given by f, g = E x [f(x)g(x)]. The set of functions (χ S ) S [n] defined by χ S (x) = i S x i forms a complete orthonormal basis for this space. We will also often write simply x S for i S x i. Given a function

6 6 R. O DONNELL AND R. A. SERVEDIO f : { 1, 1} n R we define its Fourier coefficients by f(s) = E x [f(x)x S ], and we have that f(x) = f(s)x S S. As an easy consequence of orthonormality we have Plancherel s identity f, g = f(s)ĝ(s), S which has as a special case Parseval s identity, E x [f(x) 2 ] = f(s) S 2. From this it follows that for every f : { 1, 1} n { 1, 1} we have f(s) S 2 = 1. The following definitions are fairly standard in the analysis of Boolean functions: Definition 2.1. A function f : { 1, 1} n { 1, 1} is said to be a junta on J [n] if f only depends on the coordinates in J. Typically we think of J as a small set in this case. Definition 2.2. We say that f : { 1, 1} n R is τ-regular if ˆf(i) τ for all i [n]. The following simple lemma is implicit in [37]; we state and prove it explicitly here for completeness. Lemma 2.3. Let f(x) : { 1, 1} n { 1, 1} be a Boolean threshold function and let J [n] be any subset of coordinates. If f is τ-close to a junta on J, then f is τ-close to a junta on J which is itself a Boolean threshold function. Proof. We assume without loss of generality that J is the set {1,..., r}. It is clear that the junta over { 1, 1} r to which f is closest is the function g(x 1,..., x r ) that maps each input (x 1,..., x r ) to the more commonly occuring value of the restricted function f x1,...,x r (a function of variables x r+1,..., x n ). But for f(x) = sgn(w 0 + w 1 x 1 + +w n x n ) this more common value will be sgn(w 0 +w 1 x 1 + +w r x r ), because for uniform (x r+1,..., x n ) { 1, 1} n r the random variable w r+1 x r w n x n is centered around zero. We will also require the following lemma, which gives a lower bound on the degree-1 Fourier weight of any threshold function in terms of its bias: Lemma 2.4. Let f : { 1, 1} n { 1, 1} be a Boolean threshold function and suppose that 1 E[f] = p. Then n f(i) 2 p 2 /2. i=1 Before giving the proof let us contrast this lemma with some known results. Proposition 2.2 of Talagrand [52] gives a general upper bound n f(i) i=1 2 O(p 2 log(1/p)) for any Boolean function satisfying 1 E[f] = p. In [37] it is shown that a slightly stronger bound Θ(p 2 log(1/p)) holds for threshold functions f that are sufficiently τ-regular. However when we use Lemma 2.4 we will not have regularity (and even if we did, the extra log factor would not end up improving any of our bounds). Proof. Write f(x) = sgn(w 0 + w 1 x w n x n ), where we assume without loss of generality that n j=1 w2 j = 1 and that w 0 + w 1 x w n x n 0 for all x { 1, 1} n. We have E[f(x)(w x)] = n f(i)w i n f(i) 2, i=1 where the equality is Plancherel s identity and the inequality is Cauchy-Schwarz. On the other hand, using the definition of f we obtain E[f(x)(w x)] = E[1 { w x w0 } w x ] = p E[ w x w x w0 ]. i=1

7 CHOW PARAMETERS 7 The first equality above holds because each x such that w x < w 0 can be paired with x; the value of f is the same on these two inputs, so their contributions to the expectation cancel each other out. The second equality above is a routine renormalization using the equality 1 E[f] = p. We now recall the Khintchine inequality with best constant [50], which says that for any w R n we have E[ w x ] 1 2 w. Since w = 1 in our setting, we get E[ w x ] = 1 2, so surely E[ w x w x w0 ] 1/ 2. Thus combining all statements yields n f(i) 2 p/ 2, completing the proof. i= Mathematical tools. We use the following simple estimate on several occasions: Fact 2.5. Suppose A and B are nonnegative and A B η. Then A B η/ B. A B A+ B η B. Proof. A B = We also will need some results from probability theory: Definition 2.6. We write Φ for the c.d.f. (cumulative density function) of a standard mean-0, variance-1 Gaussian random variable. We extend the notation by writing Φ[a, b] to denote Φ(b) Φ(a), allowing b < a. Finally, we will use the estimate Φ[a, b] b a without comment. The Berry-Esseen theorem is a version of the Central Limit Theorem with explicit error bounds: Theorem 2.7. (Berry-Esseen) Let X 1,..., X n be a sequence of independent random variables satisfying E[X i ] = 0 for all i, E[X 2 i ] = σ, and E[ X i 3 ] = ρ 3. Let S = (X X n )/σ and let F denote the c.d.f. of S. Then sup F (x) Φ(x) Cρ 3 /σ 3, x where Φ is the c.d.f. of a standard Gaussian random variable, and C is a universal constant. It is known [49] that one can take C = Corollary 2.8. Let x 1,..., x m denote independent ±1 random bits and let w 1,..., w m R. Write σ = w 2 i, and assume w i /σ τ for all i. Then for any interval [a, b] R, Pr[a w1 x w m x m b] Φ([ a σ, b σ ]) 2τ. In particular, Pr[a w 1 x w m x m b] b a σ + 2τ Margins, and Chow s Theorem. Having introduced Fourier analysis, we recall and prove Proposition 1.5: Proposition 1.5. d Chow (f, g) 2 dist(f, g).

8 8 R. O DONNELL AND R. A. SERVEDIO Proof. For f, g : { 1, 1} n { 1, 1} we have dist(f, g) = 1 4 E[(f(x) g(x))2 ] = 1 4 ( f(s) ĝ(s)) 2 S [n] 1 4 n ( f(j) ĝ(j)) 2 = 1 4 d Chow(f, g) 2, where the second equality is Parseval s identity. Let us introduce a notion of margin for threshold functions: Definition 2.9. Let f : { 1, 1} n { 1, 1} be a Boolean threshold function, f(x) = sgn(w 0 +w 1 x 1 + +w n x n ), where the weights are scaled so that j 0 w2 j = 1. Given a particular input x { 1, 1} n we define marg(f, x) = w 0 + w 1 x w n x n. 3 Remark The usual notion of margin from learning theory also involves scaling the data points x so that x 1 for all x. Thus we have that the learning theoretic margin of f on x is marg(f, x)/ n. We now present a proof of Chow s theorem from 1961: Theorem (Chow.) Let f : { 1, 1} n { 1, 1} be a Boolean threshold function and let g : { 1, 1} n { 1, 1} be a Boolean function such that ĝ(j) = f(j) for all 0 j n. Then g = f. Note that another way of phrasing this is: If f is a Boolean threshold function, g is a Boolean function, and d Chow (f, g) = 0, then dist(f, g) = 0. Our Theorem 1.6 gives a robust version of this statement. Proof. Write f(x) = sgn(w 0 + w 1 x w n x n ), where the weights are scaled so that n j=0 w2 j = 1. We may assume without loss of generality that marg(f, x) 0 for all x. (Otherwise, first perturb the weights slightly without changing f.) Now we have n 0 = w j ( f(j) ĝ(j)) j=0 j=0 = E[(w 0 + w 1 x w n x n )(f(x) g(x))] = E[1 {f(x) g(x)} 2marg(f, x)]. The first equality is by the assumption that f(j) = ĝ(j) for all 0 j n, the second equality is linearity of expectation (or Plancherel s identity), and the third equality uses the fact that f(x) = sgn(w 0 + w 1 x w n x n ). But since marg(f, x) is always strictly positive, we must have Pr[f(x) g(x)] = 0 as claimed. 3. First ingredient: small Chow Distance implies small distance. Our main result in this section is the following. Theorem 1.6 Restated. Let f : { 1, 1} n { 1, 1} be any threshold function and let g : { 1, 1} n ( { 1, 1} be any Boolean function such that d Chow (f, g) ɛ. Then dist(f, g) Õ 1/ ) log(1/ɛ). 4 Let us compare this with some recent results with a similar qualitative flavor. The main result of Goldberg [20] is a proof that for any threshold function c. 3 This notation is slightly informal as it doesn t show the dependence on the representation of f. 4 For a quantity q < 1, the notation Õ(q) means O(q logc (1/q)) for some absolute constant

9 CHOW PARAMETERS 9 f and any Boolean function g, if f(j) ĝ(j) (ɛ/n) O(log(n/ɛ) log(1/ɛ)) for all 0 j n, then dist(f, g) ɛ. Note that the condition of Goldberg s theorem requires that d Chow (f, g) n O(log n). Subsequently Servedio [47] showed that to obtain dist(f, g) ɛ it suffices to have f(j) ĝ(j) 1/(2Õ(1/ɛ2) n) for all 0 j n. This is a worse requirement in terms of ɛ but a better one in terms of n; however it still requires that d Chow (f, g) 1/ n. In contrast, Theorem 1.6 allows the Chow Distance between f and g to be an absolute constant independent of n. This independence of n will be crucial later on when we use Theorem 1.6 to obtain a computationally efficient algorithm for the Chow Parameters problem. At a high level, we prove Theorem 1.6 by giving a robust version of the proof of Chow s Theorem (Theorem 2.11). A first obvious approach to making the argument robust is to try to show that every threshold function has margin Ω(1) (independent of n) on every x. However this is well known to be badly false. A next attempt might be to show that every threshold function has a representation with margin Ω(1) on almost every x. This too turns out to be impossible (cf. our discussion after the statement of Lemma 5.1 below). The key to getting an n-independent margin lower bound is to also very slightly alter the threshold function. Specifically, the next few sections of the paper will be devoted to the proof of the following: Theorem 3.1. Let f : { 1, 1} n { 1, 1} be any threshold function and let ρ > 0 be sufficiently small. Then there is a threshold function f : { 1, 1} n { 1, 1} with dist(f, f ) 2 1/ρ satisfying ( Pr[marg(f, x) ρ] Õ 1/ ) log(1/ρ). x In other words, any threshold function f is very close to another threshold function f satisfying marg(f, x) Ω(1) for almost all x. We remark that although the fraction of points failing the margin bound could be as large as inverse-logarithmic in ρ, we only have to change f on a fraction of points which is exponentially small in 1/ρ to achieve this. Theorem 3.1 is the key structural result for threshold functions that allows us to robustify the proof of Theorem We will now show how Theorem 1.6 follows from Theorem 3.1. Proof. (Theorem 1.6.) Given f, apply Theorem 3.1 with its parameter ρ set (with foresight) to ρ = ɛ log(1/ɛ). This yields a threshold function f (x) = sgn(u 0 +u 1 x 1 + +u n x n ), with n j=0 u2 j = 1 satisfying dist(f, f ) 2 1/ρ ɛ and ( Pr[marg(f, x) ρ] τ def = Õ 1/ ) poly log log(1/ɛ) log(1/ρ) =. (3.1) x log(1/ɛ) Since dist(f, f ) ɛ, by Proposition 1.5 we have d Chow (f, f ) 2 ɛ and thus d Chow (f, g) 3 ɛ by the triangle inequality. We now follow the proof of Chow s

10 10 R. O DONNELL AND R. A. SERVEDIO Theorem 2.11: 3 n ɛ d Chow (f, g) = u 2 j n ( f (j) ĝ(j)) 2 j=0 j=0 n u j ( f (j) ĝ(j)) j=0 = E[1 {f (x) g(x)} 2marg(f, x)], (3.2) where the second inequality is Cauchy-Schwarz. Now suppose that Pr[f (x) g(x)] 2τ. Then by (3.1) we must have that for at least a τ fraction of x s, both f (x) g(x) and marg(f, x) > ρ. This gives a contribution exceeding τ ρ to (3.2). But τρ = ɛ poly log log(1/ɛ) > 3 ɛ, a contradiction. Thus dist(f, g) 2τ and so dist(f, g) dist(f, f ) + dist(f, g) ɛ + 2τ = Õ ( 1/ ) log(1/ɛ). 4. The critical index and anticoncentration. Fix a representation f(x) = sgn(w 0 +w 1 x 1 + +w n x n ) of a threshold function. Throughout this section we adopt the convention that w 1 w n > 0 (this will be without loss of generality, by permuting indices). The notion of the critical index of the sequence of weights w 1,..., w n will be useful for us. Roughly speaking, it allows us to approximately decompose any linear form w 0 + w 1 x w n x n over random ±1 x i s into a short dominant head, w 0 + w 1 x w small x small, and a long remaining tail which acts like a Gaussian random variable. The τ-critical index of w 1,..., w n is essentially the least index l for which the random variable w l x l + + w n x n behaves like a Gaussian up to error τ. The notion of a critical index was (implicitly) introduced and used in [47]. Towards proving a margin lower bound such as Theorem 3.1 for f, we need to show some kind of anticoncentration for the random variable w 0 +w 1 x 1 + +w n x n ; we want it to rarely be near 0. Let us describe intuitively how analyzing the critical index helps us show this. If the critical index of w 1,..., w n is large, then it must be the case that the initial weights w 1, w 2,... up to the critical index are rapidly decreasing (roughly speaking, if the weights w i, w i+1,... stayed about the same for a long stretch this would cause w i x i + + w n x n to behave like a Gaussian). This rapid decrease can in turn be shown to imply that the the head part w 0 + w 1 x w small x small is not too concentrated around any particular value; see Theorem 4.2 below. On the other hand, if the critical index l is small, then the random variable w l x l + +w n x n behaves like a Gaussian. Since Gaussians have good anticoncentration, the overall linear form w 0 +w 1 x 1 + +w n x n will have good anticoncentration, regardless of the head part s value. We need to alter f slightly to make these two cases go through, but having done so, we are able to bound the fraction of inputs x for which marg(f, x) is very small, leading to Theorem 3.1. We now give precise definitions. For 1 k n we write σ k to denote the 2-norm def n of the tail weights starting from k; i.e. σ k = i k w2 i. Definition 4.1. Fix a parameter 0 < τ < 1/2. We define the τ-critical index of the weight vector w to be the least index l such that w l is small relative to σ l in the

11 CHOW PARAMETERS 11 following sense: w l σ l τ. (4.1) (If no index 1 l n satisfies (4.1), as is the case for ( 1 2, 1 4, 1 8,..., 1 2 ) for example, n then we say that the τ-critical index is +.) The connection between Equation (4.1) and behaving like a Gaussian up to error τ is given by the Berry-Esseen Theorem, stated in Section 2.2. The following anticoncentration result shows that if the critical index is large, then the random variable w 1 x 1 + +w n x n does not put much probability mass close to any particular value: Theorem 4.2. Let 0 < τ < 1/2 and t 1 be parameters, and define k = O(1) t τ ln ( ) t 2 τ. If the τ-critical index l for w1,..., w n satisfies l k, then we have Pr x [ w 0 + w 1 x w n x n t σ k ] O(2 t ). A similar result was established in [47]. The following subsections 4.1, 4.2, 4.3 are devoted to the proof of Theorem 4.2. Throughout, they assume l denotes the τ-critical index of w 1,..., w n where w 1 w n > 0 as in the condition of Theorem Partitioning weights into blocks. The following simple lemma shows that the tail weight decreases exponentially up to the τ-critical index: Lemma 4.3. For 1 a < b l, we have σ 2 b < (1 τ 2 ) b a σ 2 a < (1 τ 2 ) b a w 2 a/τ 2. Proof. Since a is less than the critical index, we have w 2 a > τ 2 σ 2 a = τ 2 (w 2 a +σ 2 a+1), or equivalently (1 τ 2 )w 2 a > τ 2 σ 2 a+1. Adding (1 τ 2 )σ 2 a+1 to both sides gives (1 τ 2 )(w 2 a + σ 2 a+1) > (1 τ 2 )σ 2 a+1 + τ 2 σ 2 a+1, which is equivalent to (1 τ 2 )σ 2 a > σ 2 a+1. This implies that σ 2 b < (1 τ 2 ) b a σ a ; the second inequality follows from w 2 a > τ 2 σ 2 a. Fix a parameter Z > 1. We divide the list of weights w 1,..., w l into Z-blocks of consecutive weights as follows. The first Z-block B 1 is w 1,..., w k1 where k 1 is defined to be the first index such that w 1 (the largest weight in the block) is large relative to σ k1+1 (the total tail weight of all weights after the Z-block) in the following sense: w 1 > Z σ k1+1. Similarly for i = 2, 3,... the ith Z-block B i is w ki 1+1,..., w ki index such that where k i is the first w ki 1+1 > Z σ ki+1. The following lemma says each Z-block must be relatively short prior to the critical index: Lemma 4.4. Suppose that the ith Z-block B i is such that k i m l, where Then B i is of length at most m. m def = 1 τ 2 ln(z2 /τ 2 ). (4.2)

12 12 R. O DONNELL AND R. A. SERVEDIO Proof. Suppose that the length B i of the ith Z-block were more than m. Applying Lemma 4.3 with b a = m, we have σ 2 k i 1+1+m < (1 τ 2 ) m w 2 k i 1+1/τ 2 e τ 2m w 2 k i 1+1/τ 2. But by the assumption that the ith Z-block is longer than m, we also have w 2 k i 1+1 Z 2 σ 2 k i 1+1+m. Combining these inequalities and plugging in our expression for m we get a contradiction. An easy consequence is that if the critical index is large, then there must be many blocks prior to it: Corollary 4.5. For t 1, suppose that the τ-critical index l is at least tm, where m is defined as in (4.2). Then k t tm, i.e. there are at least t complete Z-blocks by the (tm)-th weight Block structure and concentration of the random variable w x. Let f(x) = sgn(w 0 +w 1 x 1 + +w n x n ) be a threshold function with w 1 w n > 0, and let B 1, B 2,... be the Z-blocks for w as defined in the previous subsection. In this subsection we prove the following lemma which is a slight variant of a similar result in [47]. Intuitively the lemma says that if a weight vector v has many blocks, then for any w 0 R, only an exponentially small fraction of points x { 1, 1} n will have a small margin for the threshold function sgn(w 0 + w 1 x w n x n ). As we show in the next subsection, Theorem 4.2 will be an easy consequence of this lemma. Lemma 4.6. Fix a value t such that there exist at least t complete Z-blocks B 1,..., B t in the weight vector w. Then for any w 0 R, we have Pr[ w 0 + w 1 x w n x n σ kt+1 (Z/6)] 2 t + 2te Z2 /72. Here the probability is taken over a uniform random choice of x from { 1, 1} n. We first give some necessary preliminary results and then prove Lemma 4.6. Our approach follows that of [47] with slight modifications. Let us view the choice of a uniform random assignment x to the variables in Z- blocks B 1,..., B t as taking place in successive stages, where in the ith stage values are assigned to the variables in the ith Z-block B i. Immediately after the ith stage, some value call it ξ i has been determined for w 0 + w 1 x w ki x ki. The following simple lemma shows that if ξ i is too far from 0, then it is unlikely that the remaining variables x ki+1,..., x n will come out in such a way as to make the final sum close to 0. Lemma 4.7. For any value A > 0 and any 1 i t, if ξ i 2σ ki+1 2 ln(2/a), then we have Pr xki +1,...,x n [ w 0 + w 1 x w n x n σ ki+1 2 ln(2/a)] A. (4.3) Proof. By the lower bound on ξ i in the hypothesis of the lemma, it can only be the case that w 0 + w 1 x w n x n σ ki+1 2 ln(2/a) if w ki+1x ki w n x n σ ki+1 2 ln(2/a). (4.4) We now recall the Hoeffding bound (see e.g. [12]), which says that for any 0 v R r and any γ > 0, we have Pr x { 1,1} r[ v 1 x 1 + +v r x r γ v v2 r] 2e γ2 /2.

13 CHOW PARAMETERS 13 Since wk i+1 w2 n = σk 2 i+1, this Hoeffding bound implies that the probability of (4.4) is at most 2e ( 2 ln(2/a)) 2 /2 = A. We henceforth fix A to be A def = 2e Z2 /72, so we have 6 2 ln(2/a) = Z. We now show that regardless of the value of ξ i 1, we have ξ i 2σ ki+1(z/6) with probability at most 1/2 over the choice of values for variables in block B i in the ith stage. Lemma 4.8. For any ξ i 1 R, we have Pr xki 1 +1,...,x ki [ ξ i 2σ ki+1(z/6) ξ i 1 ] 1/2. Proof. Since ξ i equals ξ i 1 + (w ki 1+1x ki w ki x ki ), we have ξ i 2σ ki+1(z/6) if and only if the value w ki 1+1x ki w ki x ki lies in the interval [I L, I R ] def = [ ξ i 1 2σ ki+1(z/6), ξ i 1 + 2σ ki+1(z/6)] of width 2 3 σ k i+1z. First suppose that 0 / [I L, I R ], i.e. the whole interval has the same sign. If this is the case then Pr[w ki 1+1x ki w ki x ki [I L, I R ]] 1 2 since by symmetry the value w ki 1+1x ki w ki x ki is equally likely to be positive or negative. Now suppose that 0 [I L, I R ]. By definition of k i, we know that σ ki+1 w ki 1+1 /Z, and consequently we have that the width of the interval [I L, I R ] is at most 2 3 w k i 1+1. But now observe that once the value of x ki 1+1 is set to either +1 or 1, this effectively shifts the target interval, which now w ki 1+2x ki w ki x ki must hit, by a displacement of w ki 1+1 to become [I L w ki 1+1x ki 1+1, I R w ki 1+1x ki 1+1]. (Note that in the special case where k i = k i 1 + 1, the value w ki 1+2x ki w ki x ki which must hit the target interval is simply 0.) Since the original interval [I L, I R ] contained 0 and was of length at most 2 3 w k i 1+1, the new interval does not contain 0, and thus again by symmetry we have that the probability (now over the choice of x ki 1+2,..., x ki ) that w ki 1+1x ki w ki x ki lies in [I L, I R ] is at most 1 2. In order to have w 0 + w 1 x w n x n σ kt+1 2 ln(2/a), it must be the case that either (i) each ξ i < 2σ ki+1 2 ln(2/a) for i = 1,..., t; or (ii) for some 1 i t we have ξ i 2σ ki+1 2 ln(2/a) but nonetheless w0 + w 1 x w n x n < σ ki+1 2 ln(2/a). Lemma 4.8 gives us that the probability of (i) is at most (1/2) t = 2 t, and Lemma 4.7 with the union bound gives us that the probability of (ii) is at most t A. This proves Lemma Proof of Theorem 4.2. Let Z = 12 t. We take m = 1 τ ln(z 2 /τ 2 ) as in 2 (4.2), and we have k = tm + 1. With these choices the condition l k of Theorem 4.2 together with Corollary 4.5 implies that there are at least t complete Z-blocks in the weight vector w. Thus we may apply Lemma 4.6, and we have that Pr[ w 0 + w 1 x w n x n σ kt+1 2 t] 2 t + 2te 2t O(2 t ). Now we further observe that since there are in fact t complete Z-blocks prior to the kth weight, we have k t + 1 k and hence σ kt+1 σ k, so the above inequality implies Pr[ w 0 + w 1 x w n x n t σ k ] O(2 t ). This is the desired conclusion of Theorem 4.2.

14 14 R. O DONNELL AND R. A. SERVEDIO 4.4. Extension of Theorem 4.2. The same proof with a slightly different choice of Z (taking Z = O(1)t C ) in fact gives us the following significantly stronger version of Theorem 4.2; however this stronger version is not more useful for our purposes: Theorem 4.9. In the setting of Theorem 4.2, let C 1/2 be another parameter, and suppose we instead define Then if l k, k = O(1) t τ 2 ln ( t C τ ). Pr x [ w 0 + w 1 x w n x n t C σ k ] O(2 t ). 5. Approximating threshold functions using not-too-large head weights. The main result of this section is a lemma which roughly says that any threshold function f can be approximated by a threshold function f in which the 2-norm of the tail weights, σ k, is at least an Ω(1) fraction of the head weights. This is important so that the Gaussian random variable to which the tail part is close has Ω(1) variance and thus sufficiently good anticoncentration. Lemma 5.1. Let f : { 1, 1} n { 1, 1} be any threshold function, f(x) = sgn(w 0 + w 1 x w n x n ) (recall that we assume w 1 w 2 w n ). Let def 0 < ɛ < 1/2 and 1 k n be parameters, and write σ k = j k w2 j. Assuming σ k > 0, there are numbers v 0,..., v k 1 satisfying v i k (k+1)/2 3 ln(2/ɛ) σ k (5.1) such that the threshold function f : { 1, 1} n { 1, 1} defined by f (x) = sgn(v 0 + v 1 x v k 1 x k 1 + w k x k + + w n x n ) satisfies dist(f, f ) ɛ. One may further ensure that v 1 v 2 v k 1 w k and that sgn(v i ) = sgn(w i ) for all i. Before proving this lemma, let us give an illustration. Consider the threshold function f(x) = sgn(nx 1 + nx 2 + x x n ), (5.2) with k = 3. The tail weights here have σ 3 = n 2, which of course is not a constant fraction of the two head weights, n. Further, this cannot be fixed just by choosing a different weights-based representation of the same function f. What Lemma 5.1 shows here is that we can shrink the head weights from n all the way down to Θ( ln(1/ɛ)) n without changing the function on more than an ɛ fraction of points (this heavily uses the fact that the tail acts like a Gaussian with standard deviation n 2). Then indeed σ 3 is an Ω(f(ɛ)) fraction of the head weights for a function f(ɛ) that is independent of n, as desired. We now give the proof of Lemma 5.1, a modification of the classic argument of [40] which bounds the weights required for exact representation of any threshold function. Proof. We will first prove the theorem without the extra constraints v 1 v 2 v k 1 w k and sgn(v i ) = sgn(w i ). At the end of the proof we will show how these constraints can also be ensured.

15 Let h : { 1, 1} k 1 R denote the head of f, CHOW PARAMETERS 15 h(x) = w 0 + w 1 x w k 1 x k 1. Consider the system S of 2 k 1 linear equations in k unknowns named u 0,..., u k 1 : for each x { 1, 1} k 1 we include the equation u 0 + u 1 x u k 1 x k 1 = h(x). Of course, the linear system S is satisfiable, since (u 0,..., u k 1 ) = (w 0,..., w k 1 ) is a solution. Let C be defined by C = 3 ln(2/ɛ) σ k, and consider the system LP of 2 k 1 linear inequalities over unknowns u 0,..., u k 1 : for each x { 1, 1} k 1 we include the (in)equality C if h(x) C, u 0 + u 1 x u k 1 x k 1 = h(x) if h(x) < C, (5.3) C if h(x) C. We have that LP is feasible, since it is a relaxation of the satisfiable system S. Now we use the following standard result from the theory of linear inequalities, which is a straightforward consequence of Cramer s rule and is implicit in several works (see e.g. the proof at the start of Section 3 of [24]): Lemma 5.2. Let LP denote a feasible linear program over k variables u 0,..., u k 1 in which the constraint matrix has all entries from { 1, 0, 1} and the right-hand side has all entries at most C in absolute value. Then there is a feasible solution (v 0,..., v k 1 ) in which v i k (k+1)/2 C for each i. This implies that there is a feasible solution (u 0,..., u k 1 ) = (v 0,..., v k 1 ) to LP in which the numbers v i are not too large in magnitude: specifically, using Lemma 5.2 we may obtain We now show that the threshold function v i k (k+1)/2 C. (5.4) f (x) = sgn(v 0 + v 1 x v k 1 x k 1 + w k x k + w n x n ) satisfies dist(f, f ) ɛ. Given x { 1, 1} n, let us abuse notation by writing let us also write h(x) = h(x 1,..., x k 1 ) = w 0 + w 1 x w k 1 x k 1 ; h (x) = v 0 + v 1 x v k 1 x k 1

16 16 R. O DONNELL AND R. A. SERVEDIO for the head of f and t(x) = j k w j x j for the tail, which is common to both f and f. Now if x is any input for which h(x) < C then we have h(x) = h (x) by construction, and hence f(x) = f (x). Thus in order for f(x) to disagree with f(x ) it must at least be the case that h(x) C. Moreover, it must also be the case that t(x) C, for otherwise sgn(h(x) + t(x)) will equal sgn(h (x) + t(x)), because h(x) and h (x) have the same sign by construction. But the Hoeffding bound implies that Pr x [ t(x) C] Pr x [ t(x) 2 ln(2/ɛ) σ k ] 2e ln(2/ɛ) = ɛ. Hence indeed Pr[f(x) f (x)] ɛ, as desired. Finally, we complete the proof by showing how to ensure the extra constraints v 1 v 2 v k 1 w k and sgn(v i ) = sgn(w i ). First, the constraints sgn(u i ) = sgn(w i ) can be added into LP by this we mean adding constraints like u 1 0, u 2 0, etc. Next, the constraints sgn(w 1 )u 1 sgn(w 2 )u 2 sgn(w 2 )u 2 sgn(w 3 )u 3 sgn(w k 2 )u k 2 sgn(w k 1 )u k 1 can be added into LP; again, these are constraints like u i u i+1. Finally, we can add the constraint sgn(w k 1 )u k 1 w k. Of course, LP remains feasible after the addition of all of these constraints, since (u 0,..., u k 1 ) = (w 0,..., w k 1 ) is still a solution. It remains to show that there is still a solution satisfying the bounds in (5.4). But this still follows from Lemma 5.2: the added constraints only have coefficients in { 1, 0, 1}, and the added right-hand side entries are all 0, except for the last, which is w k σ k C. 6. Every threshold function is close to a threshold function for which few points have small margin. In this subsection we show how to combine Theorem 4.2 and Lemma 5.1 to establish the following: Theorem 6.1. Let f : { 1, 1} n { 1, 1} be any threshold function and let 0 < τ < 1/2. Then there is a threshold function f : { 1, 1} n { 1, 1} with dist(f, f ) ɛ satisfying Pr x [marg(f, x) ρ] O(τ), where ɛ = ɛ(τ) = 2 2O(log3 (1/τ)/τ 2 ) and ρ = ρ(τ) = 2 O(log3 (1/τ)/τ 2). Our main structural results about margins, Theorem 3.1, is simply a rephrasing of the above theorem. Hence proving Theorem 6.1 completes the proof of Theorem 1.6, the first ingredient in our solution to the Chow Parameters Problem. The plan for the proof of Theorem 6.1 follows the intuition described in the beginning of Section 4. We consider the location of the τ-critical index of f. Case 1 is that it occurs quite early. In that case, the resulting tail acts like a Gaussian (up to error τ), and hence we can get a good anticoncentration bound so long as the tail s

17 CHOW PARAMETERS 17 variance is large enough. To ensure this, we alter f at the beginning of the argument using Lemma 5.1, which yields tail weights with total variance lower bounded by a function that depends only on τ. Case 2 is that the critical index occurs late. In this case we get anticoncentration by appealing to Theorem 4.2. We again use Lemma 5.1 so that the σ k parameter is not too small. We now give the formal proof. Proof. (Theorem 6.1) We intend to apply Theorem 4.2 in Case 2 with its t parameter set to log(1/τ), so that the anticoncentration is O(τ). Thus we will need to ensure the τ-critical index parameter l is at least k def = O(1) log(1/τ) τ 2 ln ( log(1/τ) To that end, fix a weights-based representation of f, f(x) = sgn(w 0 + w 1 x w n x n ), τ ). (6.1) where we may assume that w 1 w 2 w n > 0. Write σ k = j k w2 j, and observe that σ k > 0 since each w i 0. Now apply Lemma 5.1, with its parameter ɛ set to 2 ko(k). This yields a new threshold function where each v i satisfies f (x) = sgn(v 0 + v 1 x v k 1 x k 1 + w k x k + w n x n ), (6.2) v i k O(k) σ k, (6.3) and also v 1 v 2 v k 1 w k. This f has dist(f, f ) ɛ = 2 ko(k). To analyze marg(f, x), let us normalize the weights of f by dividing each weight by v v2 k 1 + w2 k + + w2 n. We thus may write f (x) = sgn(u 0 + u 1 x u k 1 x k 1 + u k x k + u n x n ), where j 0 u2 j = 1. Equation (6.2) implies that for each of the k values i = 0,..., k 1 we have that vi 2 is at most k O(k) times as large as wk w2 n. Letting σ i denote j i u2 j and recalling that j 0 u2 j = 1, this is easily seen to imply that σ k k O(k). (6.4) Recalling that we still have u 1 u 2 u n > 0, let l be the τ-critical index for u 1,..., u n, and consider two cases: Case 1: l < k. In this case, consider any fixed choice for x 1,..., x l 1 and write h = u 0 + u 1 x u l 1 x l 1. Using the definition of τ-critical index and applying the Berry-Esseen Corollary 2.8 to u l x l + + u n x n we get Pr [ h γ u l x l + + u n x n h + γ] 2γ x l,...,x n σ l + 2τ, for any choice of γ 0. Taking γ = τσ l τσ k we conclude Pr x [marg(f, x) τσ k] 4τ.

18 18 R. O DONNELL AND R. A. SERVEDIO Case 2: l k. In this case we apply Theorem 4.2, with its parameter t set to log(1/τ), as described at the beginning of the proof. With k defined as in (6.1), we conclude Pr x [marg(f, x) log(1/τ) σ k] O(τ). Combining the results of the two cases and using σ k k O(k) from (6.4), we conclude that we always have Pr x [marg(f, x) τk O(k) ] O(τ). Now it only remains to observe that by definition (6.1) of k, Hence we have that and k O(k) = 2 O(log3 (1/τ)/τ 2). dist(f, f ) 2 ko(k) ɛ(τ) τk O(k) τ2 O(log3 (1/τ)/τ 2) ρ(τ). 7. Second ingredient: using Chow Parameters as weights for tail variables. We begin this section with some informal motivation for and description of our second ingredient. We first recall that every threshold function f is unate; this means that for every i, f is either monotone increasing or monotone decreasing as a function of its i- th coordiante. A well-known consequence of unateness is that the magnitude of the Fourier coefficient ˆf(i) is equal to the influence of the variable x i on f; i.e. Pr[f(x) f(y)] where x is drawn uniformly from { 1, 1} n and y is x with the ith bit flipped. As done in the first ingredient, it is natural to group together the high-influence variables, forming the head indices of f. We refer to the remaining indices as the tail indices. Note that an algorithm for the Chow Parameters problem can do this grouping, since it is given the ˆf(i) s. The following theorem states that any threshold function f is either already close to a junta over the head indices, or is close to a threshold function obtained by replacing the tail weights with (suitably scaled versions of) the tail Chow Parameters. (We have made no effort to optimize the precise polynomial dependence of τ(ɛ) on ɛ.) Theorem 7.1. There is a polynomial function τ(ɛ) = poly(ɛ) such that the following holds: Let f be a Boolean threshold function over head indices H and tail indices T, ( f(x) = sgn v 0 + v i x i + ) w i x i, i H i T and let 0 < ɛ < 1/2. Assume that H contains all indices i such that f(i) τ(ɛ) 2. Then one of the following holds: (i) f is O(ɛ)-close to a junta over H; or,

The Chow Parameters Problem

The Chow Parameters Problem [Etended Abstract] ABSTRACT Ryan O Donnell Carnegie Mellon University Pittsburgh, PA 15213 odonnell@cs.cmu.edu In the 2nd Annual FOCS (1961), C. K. Chow proved that every Boolean