THE CHOW PARAMETERS PROBLEM

Size: px
Start display at page:

Download "THE CHOW PARAMETERS PROBLEM"

Transcription

1 THE CHOW PARAMETERS PROBLEM RYAN O DONNELL AND ROCCO A. SERVEDIO Abstract. In the 2nd Annual FOCS (1961), Chao-Kong Chow proved that every Boolean threshold function is uniquely determined by its degree-0 and degree-1 Fourier coefficients. These numbers became known as the Chow Parameters. Providing an algorithmic version of Chow s Theorem i.e., efficiently constructing a representation of a threshold function given its Chow Parameters has remained open ever since. This problem has received significant study in the fields of circuit complexity, game theory and the design of voting systems, and learning theory. In this paper we effectively solve the problem, giving a randomized PTAS with the following behavior: Given the Chow Parameters of a Boolean threshold function f over n bits and any constant ɛ > 0, the algorithm runs in time O(n 2 log 2 n) and with high probability outputs a representation of a threshold function f which is ɛ-close to f. Along the way we prove several new results of independent interest about Boolean threshold functions. In addition to various structural results, these include Õ(n2 )-time learning algorithms for threshold functions under the uniform distribution in the following models: (i) The Restricted Focus of Attention model, answering an open question of Birkendorf et al. (ii) An agnostic-type model. This contrasts with recent results of Guruswami and Raghavendra who show NP-hardness for the problem under general distributions. (iii) The PAC model, with constant ɛ. Our Õ(n2 )-time algorithm substantially improves on the previous best known running time and nearly matches the Ω(n 2 ) bits of training data that any successful learning algorithm must use. Key words. Chow Parameters, threshold functions, approximation, learning theory AMS subject classifications. 94C10, 06E30, 68Q32, 68R99, 91B12, 91B14, 42C10 1. Introduction. This paper is concerned with Boolean threshold functions: Definition 1.1. A Boolean function f : { 1, 1} n { 1, 1} is a threshold function if it is expressible as f(x) = sgn(w 0 + w 1 x w n x n ) for some real numbers w 0, w 1,..., w n. Boolean threshold functions are of fundamental interest in circuit complexity, game theory/voting theory, and learning theory. Early computer scientists studying switching functions (i.e., Boolean functions) spent an enormous amount of effort on the class of threshold functions; see for instance the books [10, 26, 36, 48, 38] on this topic. More recently, researchers in circuit complexity have worked to understand the computational power of threshold functions and shallow circuits with these functions as gates; see e.g. [21, 45, 24, 25, 22]. In game theory and social choice theory, where simple cooperative games [42] correspond to monotone Boolean functions, threshold functions (with nonnegative weights) are known as weighted majority games and have been extensively studied as models for voting, see e.g. [43, 27, 11, 54]. Finally, in various guises, the problem of learning an unknown threshold function ( halfspace ) has arguably been the central problem in machine learning for much of the last two decades, with algorithms such as Perceptron, Weighted Majority, boosting, and support vector machines emerging as central tools in the field. A beautiful result of C.-K. Chow from the 2nd FOCS conference [9] gives a surprising characterization of Boolean threshold functions: among all Boolean functions, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, Pennsylvania, (odonnell@cs.cmu.edu). Supported in part by NSF award CCF , a CyLab Research Grant, an Okawa Foundation Grant, and a Sloan Foundation Fellowship. Columbia University, 1214 Amsterdam Avenue, New York, New York, (rocco@cs.columbia.edu). Supported in part by NSF award CCF , NSF award CCF , and a Sloan Foundation Fellowship. 1

2 2 R. O DONNELL AND R. A. SERVEDIO each threshold function f : { 1, 1} n { 1, 1} is uniquely determined by the center of mass of its positive inputs, avg{x { 1, 1} n : f(x) = 1}, and the number of positive inputs #{x : f(x) = 1}. These n + 1 parameters of f are equivalent, after scaling and additive shifting, to its degree-0 and degree-1 Fourier coefficients (and also, essentially, to its influences or Banzhaf power indices ). We give a formal definition: Definition 1.2. Given any Boolean function f : { 1, 1} n { 1, 1}, its Chow Parameters 1 are the rational numbers f(0), f(1),..., f(n) defined by f(0) = E[f(x)], f(i) = E[f(x)x i ], for 1 i n. We also say the Chow Vector of f is χ = χ f = ( f(0), f(1),..., f(n)). Throughout this paper the notation E[ ] and Pr[ ] refers to an x { 1, 1} n chosen uniformly at random. (We note that this corresponds to the Impartial Culture Assumption in the theory of social choice [19].) Our notation slightly abuses the standard Fourier coefficient notation of f( ) and f({i}). Chow s Theorem implies that the following algorithmic problem is in principle solvable: The Chow Parameters Problem. Given the Chow Parameters f(0), f(1),..., f(n) of a Boolean threshold function f, output a representation of f as f(x) = sgn(w 0 + w 1 x 1 + w n x n ). Unfortunately, the proof of Chow s Theorem (reviewed in Section 2.3) is completely nonconstructive and does not suggest any algorithm, much less an efficient one. As we now briefly describe, over the past five decades the Chow Parameters problem has been considered by researchers in a range of different fields Background on the Chow Parameters problem. As far back as 1960 researchers studying Boolean functions were interested in finding an efficient algorithm for the Chow Parameters problem [14]. Electrical engineers at the time faced the following problem: Given an explicit truth table, determine if it can be realized as a threshold circuit and if so, which one. The Chow Parameters are easily computed from a truth table, and Chow s Theorem implies that they give a unique representation for every threshold function. Several heuristics were proposed for the Chow Parameters problem [30, 56, 29, 10], an empirical study was performed to compare various methods [58], and lookup tables were produced mapping Chow Vectors into weights-based representations for each threshold function on six [39], seven [57], and eight [41] bits. Winder provides a good early survey [59]. Generalizations of Chow s Theorem were later given in [7, 46]. Researchers in game theory have also considered the Chow Parameters problem; Chow s Theorem was independently rediscovered by the game theorist Lapidot [34] and subsequently studied in [11, 13, 54, 18]. In the realm of social choice and voting theory the Chow Parameters represent the Banzhaf power indices [43, 2] of the n voters a measure of each one s influence over the outcome. Here the Chow Parameters problem is very natural: Consider designing a voting rule for, say, the European Union. Target Banzhaf power indices are given, usually in proportion to the squareroot of the states populations, and one wishes to come up with a weighted majority voting rule whose power indices are as close to the targets as possible. Researchers in voting theory have recently devoted significant attention to this problem [35, 8], calling it a fundamental constitutional problem [16] and in particular considering its computational complexity [51, 1]. 1 Chow s Theorem was proven simultaneously by Tannenbaum [53], but the terminology Chow Parameters has stuck.

3 CHOW PARAMETERS 3 The Chow Parameters problem also has motivation from learning theory. Ben- David and Dichterman [3] introduced the Restricted Focus of Attention (RFA) model to formalize the idea that learning algorithms often have only partial access to each example vector. Birkendorf et al. [5] performed a comprehensive study of the RFA model and observed that the approximation version of the Chow Parameters problem (given approximate Chow Parameters, output an approximating threshold function) is equivalent to the problem of efficiently learning threshold functions under the uniform distribution in the 1-RFA model. (In the 1-RFA model the learner is only allowed to see one bit of each example string in addition to the label; we give details in Section 10.) As the main open question posed in [5], Birkendorf et al. asked whether there is an efficient uniform distribution learning algorithm for threshold functions in the 1-RFA model. This question motivated subsequent research [20, 47] which gave information-theoretic sample complexity upper bounds for this learning problem (see Section 3); however no computationally efficient algorithm was previously known. To summarize, we believe that the range of different contexts in which the Chow Parameters Problem has arisen is evidence of its fundamental status The Chow Parameters problem reframed as an approximation problem. It is unlikely that the Chow Parameters Problem can be solved exactly in polynomial time note that even checking the correctness of a candidate solution is #P-complete, because computing f(0) is equivalent to counting 0-1 knapsack solutions. Thus, as is implicitly proposed in [5, 1], it is natural to look for a polynomialtime approximation scheme (PTAS). Here we mean an approximation in the following sense: Definition 1.3. The distance between two Boolean functions f, g : { 1, 1} n { 1, 1} is dist(f, g) def = Pr[f(x) g(x)]. If dist(f, g) ɛ we say that f and g are ɛ-close. We would like a PTAS which, given a value ɛ and the Chow Parameters of f, outputs a (representation of a) threshold function f that is ɛ-close to f. With this relaxed goal of approximating f, one may even tolerate only an approximation of the Chow Parameters of f; this gives us the variant of the problem that Birkendorf et al. considered. (Note that, as we discuss in Section 3, it is in no way obvious that approximate Chow Parameters even information-theoretically specify an approximator to f.) In particular the following notion of approximate Chow Parameters proves to be most natural: Definition 1.4. Let f, g : { 1, 1} n { 1, 1}. We define d Chow (f, g) def = n j=0 ( f(j) ĝ(j)) 2 to be the Chow Distance between f and g Our results. Our main result is an efficient PTAS A for the Chow Parameters problem which succeeds given approximations to the Chow Parameters. We prove: Main Theorem. There is a function κ(ɛ) = 2 Õ(1/ɛ2) such that the following holds: Let f : { 1, 1} n { 1, 1} be a threshold function and let 0 < ɛ < 1/2. Write χ for the Chow Vector of f and assume that α is a vector satisfying α χ κ(ɛ). Then given as input α and ɛ the algorithm A performs 2 poly(1/κ(ɛ)) n 2 log n log( n δ ) bit operations and outputs the (weights-based) representation of a threshold function f which with probability at least 1 δ satisfies dist(f, f ) ɛ. Although the running time dependence on ɛ is doubly-exponential, we emphasize that the polynomial dependence on n is quadratic, independent of ɛ; i.e., A is an

4 4 R. O DONNELL AND R. A. SERVEDIO EPTAS. Some of our learning applications have only singly-exponential dependence on ɛ Our approach. We briefly describe the two main ingredients of our approach and explain how we combine them to obtain the efficient algorithm A. First ingredient: small Chow Distance from a threshold function implies small distance. An immediate question that arises when thinking about the Chow Parameters problem is how to recognize whether a candidate solution is a good one. If we are given the Chow Vector χ f of an unknown threshold function f and we have a candidate threshold function g, we can approximate the Chow Vector χ g of g by sampling. The following Proposition is easily proved via Fourier analysis in Section 2.3: Proposition 1.5. d Chow (f, g) 2 dist(f, g). This means that if d Chow (f, g) is large then f and g are far apart. But if d Chow (f, g) is small, does this necessarily mean that f and g are close? This question has been studied in the learning theory community, in [5] (for threshold functions with small integer weights), [20], and [47]. In Section 3 we show that the answer is yes by proving the following robust version of Chow s Theorem: Theorem 1.6. Let f : { 1, 1} n { 1, 1} be any threshold function and let g : { 1, 1} n ( { 1, 1} be any Boolean function such that d Chow (f, g) ɛ. Then dist(f, g) Õ 1/ ) log(1/ɛ). This is the first result of this nature that is completely independent of n. A key ingredient in the proof of Theorem 1.6 is a new result showing that every threshold function f is extremely close to a threshold function f for which only a very small fraction of points have small margin (see Section 6 for a precise statement). We feel that this and Theorem 1.6 have independent interest as structural results about threshold functions. Second ingredient: using the Chow Parameters as weights. The second ingredient in our approach is to establish a result, Theorem 7.1, having the following corollary: Corollary 7.2. There is an absolute constant C > 0 such that the following holds. Let f(x) = sgn(w 0 + w 1 x w n x n ) be any threshold function, and let H be the set of 1/ɛ C indices i for which w i is largest. 2 Then there exists a threshold function f (x) = sgn(v 0 + v 1 x v n x n ) with dist(f, f ) ɛ in which the weights v i for i [n] \ H are the Chow Parameters f(i) themselves. The heuristic of using the Chow Parameters as possible weights was considered by several researchers in the early 60s (see [59]); however no theorem on the efficacy of this approach was previously known. Our proof of Theorem 7.1 and its robust version Theorem 7.4 rely in part on recent work of Matulef et al. on Property Testing for threshold functions [37]. The algorithm and intuitive explanation. Given these two ingredients, our PTAS A for the approximate Chow Parameters problem works by constructing a small (depending only on ɛ) number of candidate threshold functions. It enumerates all (in some sense) possible weight settings for the indices in H, and for each one produces a candidate threshold function by setting the remaining weights equal to the given Chow Parameters. The second ingredient tells us that at least one of these candidate threshold functions must be close to to the unknown threshold function f, and thus 2 As we discuss at the beginning of Section 7, for any threshold function f the value f(i) is equal to Inf i (f), the influence of the i-th variable on f. It is well known and easy to show (see e.g. Lemma 7 of [17]) that for a threshold function f(x) = sgn(w 0 + w 1 x w nx n), if Inf i (f) > Inf j (f) then w i > w j. So we may equivalently view H as the set of 1/ɛ C indices i for which f(i) is largest.

5 CHOW PARAMETERS 5 must have small Chow Distance to f, by Proposition 1.5. Now the first ingredient tells us that any threshold function whose Chow Distance to the target Chow Vector is small must itself be close to the target. So the algorithm can estimate each of the candidates Chow Vectors (this takes Õ(n2 ) time) and output any candidate whose Chow Distance to the target vector is small Consequences in learning theory. As we show in Section 10, our approach yields a range of new algorithmic results in learning theory. Our Main Theorem directly gives the first poly(n)-time algorithm for learning threshold functions in the uniform distribution 1-RFA model, answering the question of [5]: Theorem 1.7. There is an algorithm which performs 2 2Õ(1/ɛ2 ) n 2 log n log( n δ ) bit operations and properly learns threshold functions to accuracy ɛ and confidence 1 δ in the uniform distribution 1-RFA model. A variant of our algorithm gives a very fast agnostic-type learning algorithm for threshold functions (equivalently, an algorithm for learning Boolean threshold functions from uniformly distributed examples when there is adversarial noise in the labels): Theorem 1.8. Let g be any Boolean function and let opt = min f Pr[f(x) g(x)] where the min is over all threshold functions and the probability is uniform over { 1, 1} n. Given an input parameter ɛ > 0 and access to independent uniform examples (x, g(x)), algorithm B outputs the (weights-based) representation of a threshold function f which with probability at least 1 δ satisfies Pr[f (x) g(x)] O(opt Ω(1) )+ɛ. The algorithm performs poly(1/ɛ) n 2 log( n δ ) + 2poly(1/ɛ) n log n log( 1 δ ) bit operations. For example, if opt = 1/ log(n), our algorithm takes time O(n 2 log n log( n δ )) and outputs a hypothesis with accuracy 1/ log Ω(1) (n). Thereom 1.8 is in interesting contrast with the algorithm of Kalai et al. [28] which constructs an (opt + ɛ)-accurate hypothesis but runs in n poly(1/ɛ) time (and does not output a threshold function). As we discuss in Section 10, recent hardness results of Guruswami and Raghavendra [23] imply that if P NP there can be no algorithm comparable to ours for learning under arbitrary (as opposed to uniform) distributions over { 1, 1} n. Finally, as a corollary of Theorem 1.8, we obtain a uniform-distribution PAC learning algorithm for threshold functions that runs in time Õ(n2 ) for learning to constant accuracy ɛ = Θ(1). The fastest previous algorithm we are aware of for learning arbitrary threshold functions in this model (linear programming, using Vaidya [55]) runs in Õ(n4.5 ) poly(1/ɛ) time. Thus our algorithm is significantly faster for learning to accuracy ɛ = Θ(1), and in fact is faster as long as ɛ < 1/(log n) c for sufficiently small constant c > 0. As we explain later, our time bound is very close to the Ω(n 2 ) bits of input that any learning algorithm must use. 2. Preliminaries Fourier analysis. This paper extensively uses the basics of Fourier analysis over the Boolean cube { 1, 1} n. We give a brief review. We consider functions f : { 1, 1} n R (though we often focus on Boolean-valued functions which map to { 1, 1}), and we think of the inputs x to f as being distributed according to the uniform probability distribution. The set of such functions forms a 2 n -dimensional inner product space with inner product given by f, g = E x [f(x)g(x)]. The set of functions (χ S ) S [n] defined by χ S (x) = i S x i forms a complete orthonormal basis for this space. We will also often write simply x S for i S x i. Given a function

6 6 R. O DONNELL AND R. A. SERVEDIO f : { 1, 1} n R we define its Fourier coefficients by f(s) = E x [f(x)x S ], and we have that f(x) = f(s)x S S. As an easy consequence of orthonormality we have Plancherel s identity f, g = f(s)ĝ(s), S which has as a special case Parseval s identity, E x [f(x) 2 ] = f(s) S 2. From this it follows that for every f : { 1, 1} n { 1, 1} we have f(s) S 2 = 1. The following definitions are fairly standard in the analysis of Boolean functions: Definition 2.1. A function f : { 1, 1} n { 1, 1} is said to be a junta on J [n] if f only depends on the coordinates in J. Typically we think of J as a small set in this case. Definition 2.2. We say that f : { 1, 1} n R is τ-regular if ˆf(i) τ for all i [n]. The following simple lemma is implicit in [37]; we state and prove it explicitly here for completeness. Lemma 2.3. Let f(x) : { 1, 1} n { 1, 1} be a Boolean threshold function and let J [n] be any subset of coordinates. If f is τ-close to a junta on J, then f is τ-close to a junta on J which is itself a Boolean threshold function. Proof. We assume without loss of generality that J is the set {1,..., r}. It is clear that the junta over { 1, 1} r to which f is closest is the function g(x 1,..., x r ) that maps each input (x 1,..., x r ) to the more commonly occuring value of the restricted function f x1,...,x r (a function of variables x r+1,..., x n ). But for f(x) = sgn(w 0 + w 1 x 1 + +w n x n ) this more common value will be sgn(w 0 +w 1 x 1 + +w r x r ), because for uniform (x r+1,..., x n ) { 1, 1} n r the random variable w r+1 x r w n x n is centered around zero. We will also require the following lemma, which gives a lower bound on the degree-1 Fourier weight of any threshold function in terms of its bias: Lemma 2.4. Let f : { 1, 1} n { 1, 1} be a Boolean threshold function and suppose that 1 E[f] = p. Then n f(i) 2 p 2 /2. i=1 Before giving the proof let us contrast this lemma with some known results. Proposition 2.2 of Talagrand [52] gives a general upper bound n f(i) i=1 2 O(p 2 log(1/p)) for any Boolean function satisfying 1 E[f] = p. In [37] it is shown that a slightly stronger bound Θ(p 2 log(1/p)) holds for threshold functions f that are sufficiently τ-regular. However when we use Lemma 2.4 we will not have regularity (and even if we did, the extra log factor would not end up improving any of our bounds). Proof. Write f(x) = sgn(w 0 + w 1 x w n x n ), where we assume without loss of generality that n j=1 w2 j = 1 and that w 0 + w 1 x w n x n 0 for all x { 1, 1} n. We have E[f(x)(w x)] = n f(i)w i n f(i) 2, i=1 where the equality is Plancherel s identity and the inequality is Cauchy-Schwarz. On the other hand, using the definition of f we obtain E[f(x)(w x)] = E[1 { w x w0 } w x ] = p E[ w x w x w0 ]. i=1

7 CHOW PARAMETERS 7 The first equality above holds because each x such that w x < w 0 can be paired with x; the value of f is the same on these two inputs, so their contributions to the expectation cancel each other out. The second equality above is a routine renormalization using the equality 1 E[f] = p. We now recall the Khintchine inequality with best constant [50], which says that for any w R n we have E[ w x ] 1 2 w. Since w = 1 in our setting, we get E[ w x ] = 1 2, so surely E[ w x w x w0 ] 1/ 2. Thus combining all statements yields n f(i) 2 p/ 2, completing the proof. i= Mathematical tools. We use the following simple estimate on several occasions: Fact 2.5. Suppose A and B are nonnegative and A B η. Then A B η/ B. A B A+ B η B. Proof. A B = We also will need some results from probability theory: Definition 2.6. We write Φ for the c.d.f. (cumulative density function) of a standard mean-0, variance-1 Gaussian random variable. We extend the notation by writing Φ[a, b] to denote Φ(b) Φ(a), allowing b < a. Finally, we will use the estimate Φ[a, b] b a without comment. The Berry-Esseen theorem is a version of the Central Limit Theorem with explicit error bounds: Theorem 2.7. (Berry-Esseen) Let X 1,..., X n be a sequence of independent random variables satisfying E[X i ] = 0 for all i, E[X 2 i ] = σ, and E[ X i 3 ] = ρ 3. Let S = (X X n )/σ and let F denote the c.d.f. of S. Then sup F (x) Φ(x) Cρ 3 /σ 3, x where Φ is the c.d.f. of a standard Gaussian random variable, and C is a universal constant. It is known [49] that one can take C = Corollary 2.8. Let x 1,..., x m denote independent ±1 random bits and let w 1,..., w m R. Write σ = w 2 i, and assume w i /σ τ for all i. Then for any interval [a, b] R, Pr[a w1 x w m x m b] Φ([ a σ, b σ ]) 2τ. In particular, Pr[a w 1 x w m x m b] b a σ + 2τ Margins, and Chow s Theorem. Having introduced Fourier analysis, we recall and prove Proposition 1.5: Proposition 1.5. d Chow (f, g) 2 dist(f, g).

8 8 R. O DONNELL AND R. A. SERVEDIO Proof. For f, g : { 1, 1} n { 1, 1} we have dist(f, g) = 1 4 E[(f(x) g(x))2 ] = 1 4 ( f(s) ĝ(s)) 2 S [n] 1 4 n ( f(j) ĝ(j)) 2 = 1 4 d Chow(f, g) 2, where the second equality is Parseval s identity. Let us introduce a notion of margin for threshold functions: Definition 2.9. Let f : { 1, 1} n { 1, 1} be a Boolean threshold function, f(x) = sgn(w 0 +w 1 x 1 + +w n x n ), where the weights are scaled so that j 0 w2 j = 1. Given a particular input x { 1, 1} n we define marg(f, x) = w 0 + w 1 x w n x n. 3 Remark The usual notion of margin from learning theory also involves scaling the data points x so that x 1 for all x. Thus we have that the learning theoretic margin of f on x is marg(f, x)/ n. We now present a proof of Chow s theorem from 1961: Theorem (Chow.) Let f : { 1, 1} n { 1, 1} be a Boolean threshold function and let g : { 1, 1} n { 1, 1} be a Boolean function such that ĝ(j) = f(j) for all 0 j n. Then g = f. Note that another way of phrasing this is: If f is a Boolean threshold function, g is a Boolean function, and d Chow (f, g) = 0, then dist(f, g) = 0. Our Theorem 1.6 gives a robust version of this statement. Proof. Write f(x) = sgn(w 0 + w 1 x w n x n ), where the weights are scaled so that n j=0 w2 j = 1. We may assume without loss of generality that marg(f, x) 0 for all x. (Otherwise, first perturb the weights slightly without changing f.) Now we have n 0 = w j ( f(j) ĝ(j)) j=0 j=0 = E[(w 0 + w 1 x w n x n )(f(x) g(x))] = E[1 {f(x) g(x)} 2marg(f, x)]. The first equality is by the assumption that f(j) = ĝ(j) for all 0 j n, the second equality is linearity of expectation (or Plancherel s identity), and the third equality uses the fact that f(x) = sgn(w 0 + w 1 x w n x n ). But since marg(f, x) is always strictly positive, we must have Pr[f(x) g(x)] = 0 as claimed. 3. First ingredient: small Chow Distance implies small distance. Our main result in this section is the following. Theorem 1.6 Restated. Let f : { 1, 1} n { 1, 1} be any threshold function and let g : { 1, 1} n ( { 1, 1} be any Boolean function such that d Chow (f, g) ɛ. Then dist(f, g) Õ 1/ ) log(1/ɛ). 4 Let us compare this with some recent results with a similar qualitative flavor. The main result of Goldberg [20] is a proof that for any threshold function c. 3 This notation is slightly informal as it doesn t show the dependence on the representation of f. 4 For a quantity q < 1, the notation Õ(q) means O(q logc (1/q)) for some absolute constant

9 CHOW PARAMETERS 9 f and any Boolean function g, if f(j) ĝ(j) (ɛ/n) O(log(n/ɛ) log(1/ɛ)) for all 0 j n, then dist(f, g) ɛ. Note that the condition of Goldberg s theorem requires that d Chow (f, g) n O(log n). Subsequently Servedio [47] showed that to obtain dist(f, g) ɛ it suffices to have f(j) ĝ(j) 1/(2Õ(1/ɛ2) n) for all 0 j n. This is a worse requirement in terms of ɛ but a better one in terms of n; however it still requires that d Chow (f, g) 1/ n. In contrast, Theorem 1.6 allows the Chow Distance between f and g to be an absolute constant independent of n. This independence of n will be crucial later on when we use Theorem 1.6 to obtain a computationally efficient algorithm for the Chow Parameters problem. At a high level, we prove Theorem 1.6 by giving a robust version of the proof of Chow s Theorem (Theorem 2.11). A first obvious approach to making the argument robust is to try to show that every threshold function has margin Ω(1) (independent of n) on every x. However this is well known to be badly false. A next attempt might be to show that every threshold function has a representation with margin Ω(1) on almost every x. This too turns out to be impossible (cf. our discussion after the statement of Lemma 5.1 below). The key to getting an n-independent margin lower bound is to also very slightly alter the threshold function. Specifically, the next few sections of the paper will be devoted to the proof of the following: Theorem 3.1. Let f : { 1, 1} n { 1, 1} be any threshold function and let ρ > 0 be sufficiently small. Then there is a threshold function f : { 1, 1} n { 1, 1} with dist(f, f ) 2 1/ρ satisfying ( Pr[marg(f, x) ρ] Õ 1/ ) log(1/ρ). x In other words, any threshold function f is very close to another threshold function f satisfying marg(f, x) Ω(1) for almost all x. We remark that although the fraction of points failing the margin bound could be as large as inverse-logarithmic in ρ, we only have to change f on a fraction of points which is exponentially small in 1/ρ to achieve this. Theorem 3.1 is the key structural result for threshold functions that allows us to robustify the proof of Theorem We will now show how Theorem 1.6 follows from Theorem 3.1. Proof. (Theorem 1.6.) Given f, apply Theorem 3.1 with its parameter ρ set (with foresight) to ρ = ɛ log(1/ɛ). This yields a threshold function f (x) = sgn(u 0 +u 1 x 1 + +u n x n ), with n j=0 u2 j = 1 satisfying dist(f, f ) 2 1/ρ ɛ and ( Pr[marg(f, x) ρ] τ def = Õ 1/ ) poly log log(1/ɛ) log(1/ρ) =. (3.1) x log(1/ɛ) Since dist(f, f ) ɛ, by Proposition 1.5 we have d Chow (f, f ) 2 ɛ and thus d Chow (f, g) 3 ɛ by the triangle inequality. We now follow the proof of Chow s

10 10 R. O DONNELL AND R. A. SERVEDIO Theorem 2.11: 3 n ɛ d Chow (f, g) = u 2 j n ( f (j) ĝ(j)) 2 j=0 j=0 n u j ( f (j) ĝ(j)) j=0 = E[1 {f (x) g(x)} 2marg(f, x)], (3.2) where the second inequality is Cauchy-Schwarz. Now suppose that Pr[f (x) g(x)] 2τ. Then by (3.1) we must have that for at least a τ fraction of x s, both f (x) g(x) and marg(f, x) > ρ. This gives a contribution exceeding τ ρ to (3.2). But τρ = ɛ poly log log(1/ɛ) > 3 ɛ, a contradiction. Thus dist(f, g) 2τ and so dist(f, g) dist(f, f ) + dist(f, g) ɛ + 2τ = Õ ( 1/ ) log(1/ɛ). 4. The critical index and anticoncentration. Fix a representation f(x) = sgn(w 0 +w 1 x 1 + +w n x n ) of a threshold function. Throughout this section we adopt the convention that w 1 w n > 0 (this will be without loss of generality, by permuting indices). The notion of the critical index of the sequence of weights w 1,..., w n will be useful for us. Roughly speaking, it allows us to approximately decompose any linear form w 0 + w 1 x w n x n over random ±1 x i s into a short dominant head, w 0 + w 1 x w small x small, and a long remaining tail which acts like a Gaussian random variable. The τ-critical index of w 1,..., w n is essentially the least index l for which the random variable w l x l + + w n x n behaves like a Gaussian up to error τ. The notion of a critical index was (implicitly) introduced and used in [47]. Towards proving a margin lower bound such as Theorem 3.1 for f, we need to show some kind of anticoncentration for the random variable w 0 +w 1 x 1 + +w n x n ; we want it to rarely be near 0. Let us describe intuitively how analyzing the critical index helps us show this. If the critical index of w 1,..., w n is large, then it must be the case that the initial weights w 1, w 2,... up to the critical index are rapidly decreasing (roughly speaking, if the weights w i, w i+1,... stayed about the same for a long stretch this would cause w i x i + + w n x n to behave like a Gaussian). This rapid decrease can in turn be shown to imply that the the head part w 0 + w 1 x w small x small is not too concentrated around any particular value; see Theorem 4.2 below. On the other hand, if the critical index l is small, then the random variable w l x l + +w n x n behaves like a Gaussian. Since Gaussians have good anticoncentration, the overall linear form w 0 +w 1 x 1 + +w n x n will have good anticoncentration, regardless of the head part s value. We need to alter f slightly to make these two cases go through, but having done so, we are able to bound the fraction of inputs x for which marg(f, x) is very small, leading to Theorem 3.1. We now give precise definitions. For 1 k n we write σ k to denote the 2-norm def n of the tail weights starting from k; i.e. σ k = i k w2 i. Definition 4.1. Fix a parameter 0 < τ < 1/2. We define the τ-critical index of the weight vector w to be the least index l such that w l is small relative to σ l in the

11 CHOW PARAMETERS 11 following sense: w l σ l τ. (4.1) (If no index 1 l n satisfies (4.1), as is the case for ( 1 2, 1 4, 1 8,..., 1 2 ) for example, n then we say that the τ-critical index is +.) The connection between Equation (4.1) and behaving like a Gaussian up to error τ is given by the Berry-Esseen Theorem, stated in Section 2.2. The following anticoncentration result shows that if the critical index is large, then the random variable w 1 x 1 + +w n x n does not put much probability mass close to any particular value: Theorem 4.2. Let 0 < τ < 1/2 and t 1 be parameters, and define k = O(1) t τ ln ( ) t 2 τ. If the τ-critical index l for w1,..., w n satisfies l k, then we have Pr x [ w 0 + w 1 x w n x n t σ k ] O(2 t ). A similar result was established in [47]. The following subsections 4.1, 4.2, 4.3 are devoted to the proof of Theorem 4.2. Throughout, they assume l denotes the τ-critical index of w 1,..., w n where w 1 w n > 0 as in the condition of Theorem Partitioning weights into blocks. The following simple lemma shows that the tail weight decreases exponentially up to the τ-critical index: Lemma 4.3. For 1 a < b l, we have σ 2 b < (1 τ 2 ) b a σ 2 a < (1 τ 2 ) b a w 2 a/τ 2. Proof. Since a is less than the critical index, we have w 2 a > τ 2 σ 2 a = τ 2 (w 2 a +σ 2 a+1), or equivalently (1 τ 2 )w 2 a > τ 2 σ 2 a+1. Adding (1 τ 2 )σ 2 a+1 to both sides gives (1 τ 2 )(w 2 a + σ 2 a+1) > (1 τ 2 )σ 2 a+1 + τ 2 σ 2 a+1, which is equivalent to (1 τ 2 )σ 2 a > σ 2 a+1. This implies that σ 2 b < (1 τ 2 ) b a σ a ; the second inequality follows from w 2 a > τ 2 σ 2 a. Fix a parameter Z > 1. We divide the list of weights w 1,..., w l into Z-blocks of consecutive weights as follows. The first Z-block B 1 is w 1,..., w k1 where k 1 is defined to be the first index such that w 1 (the largest weight in the block) is large relative to σ k1+1 (the total tail weight of all weights after the Z-block) in the following sense: w 1 > Z σ k1+1. Similarly for i = 2, 3,... the ith Z-block B i is w ki 1+1,..., w ki index such that where k i is the first w ki 1+1 > Z σ ki+1. The following lemma says each Z-block must be relatively short prior to the critical index: Lemma 4.4. Suppose that the ith Z-block B i is such that k i m l, where Then B i is of length at most m. m def = 1 τ 2 ln(z2 /τ 2 ). (4.2)

12 12 R. O DONNELL AND R. A. SERVEDIO Proof. Suppose that the length B i of the ith Z-block were more than m. Applying Lemma 4.3 with b a = m, we have σ 2 k i 1+1+m < (1 τ 2 ) m w 2 k i 1+1/τ 2 e τ 2m w 2 k i 1+1/τ 2. But by the assumption that the ith Z-block is longer than m, we also have w 2 k i 1+1 Z 2 σ 2 k i 1+1+m. Combining these inequalities and plugging in our expression for m we get a contradiction. An easy consequence is that if the critical index is large, then there must be many blocks prior to it: Corollary 4.5. For t 1, suppose that the τ-critical index l is at least tm, where m is defined as in (4.2). Then k t tm, i.e. there are at least t complete Z-blocks by the (tm)-th weight Block structure and concentration of the random variable w x. Let f(x) = sgn(w 0 +w 1 x 1 + +w n x n ) be a threshold function with w 1 w n > 0, and let B 1, B 2,... be the Z-blocks for w as defined in the previous subsection. In this subsection we prove the following lemma which is a slight variant of a similar result in [47]. Intuitively the lemma says that if a weight vector v has many blocks, then for any w 0 R, only an exponentially small fraction of points x { 1, 1} n will have a small margin for the threshold function sgn(w 0 + w 1 x w n x n ). As we show in the next subsection, Theorem 4.2 will be an easy consequence of this lemma. Lemma 4.6. Fix a value t such that there exist at least t complete Z-blocks B 1,..., B t in the weight vector w. Then for any w 0 R, we have Pr[ w 0 + w 1 x w n x n σ kt+1 (Z/6)] 2 t + 2te Z2 /72. Here the probability is taken over a uniform random choice of x from { 1, 1} n. We first give some necessary preliminary results and then prove Lemma 4.6. Our approach follows that of [47] with slight modifications. Let us view the choice of a uniform random assignment x to the variables in Z- blocks B 1,..., B t as taking place in successive stages, where in the ith stage values are assigned to the variables in the ith Z-block B i. Immediately after the ith stage, some value call it ξ i has been determined for w 0 + w 1 x w ki x ki. The following simple lemma shows that if ξ i is too far from 0, then it is unlikely that the remaining variables x ki+1,..., x n will come out in such a way as to make the final sum close to 0. Lemma 4.7. For any value A > 0 and any 1 i t, if ξ i 2σ ki+1 2 ln(2/a), then we have Pr xki +1,...,x n [ w 0 + w 1 x w n x n σ ki+1 2 ln(2/a)] A. (4.3) Proof. By the lower bound on ξ i in the hypothesis of the lemma, it can only be the case that w 0 + w 1 x w n x n σ ki+1 2 ln(2/a) if w ki+1x ki w n x n σ ki+1 2 ln(2/a). (4.4) We now recall the Hoeffding bound (see e.g. [12]), which says that for any 0 v R r and any γ > 0, we have Pr x { 1,1} r[ v 1 x 1 + +v r x r γ v v2 r] 2e γ2 /2.

13 CHOW PARAMETERS 13 Since wk i+1 w2 n = σk 2 i+1, this Hoeffding bound implies that the probability of (4.4) is at most 2e ( 2 ln(2/a)) 2 /2 = A. We henceforth fix A to be A def = 2e Z2 /72, so we have 6 2 ln(2/a) = Z. We now show that regardless of the value of ξ i 1, we have ξ i 2σ ki+1(z/6) with probability at most 1/2 over the choice of values for variables in block B i in the ith stage. Lemma 4.8. For any ξ i 1 R, we have Pr xki 1 +1,...,x ki [ ξ i 2σ ki+1(z/6) ξ i 1 ] 1/2. Proof. Since ξ i equals ξ i 1 + (w ki 1+1x ki w ki x ki ), we have ξ i 2σ ki+1(z/6) if and only if the value w ki 1+1x ki w ki x ki lies in the interval [I L, I R ] def = [ ξ i 1 2σ ki+1(z/6), ξ i 1 + 2σ ki+1(z/6)] of width 2 3 σ k i+1z. First suppose that 0 / [I L, I R ], i.e. the whole interval has the same sign. If this is the case then Pr[w ki 1+1x ki w ki x ki [I L, I R ]] 1 2 since by symmetry the value w ki 1+1x ki w ki x ki is equally likely to be positive or negative. Now suppose that 0 [I L, I R ]. By definition of k i, we know that σ ki+1 w ki 1+1 /Z, and consequently we have that the width of the interval [I L, I R ] is at most 2 3 w k i 1+1. But now observe that once the value of x ki 1+1 is set to either +1 or 1, this effectively shifts the target interval, which now w ki 1+2x ki w ki x ki must hit, by a displacement of w ki 1+1 to become [I L w ki 1+1x ki 1+1, I R w ki 1+1x ki 1+1]. (Note that in the special case where k i = k i 1 + 1, the value w ki 1+2x ki w ki x ki which must hit the target interval is simply 0.) Since the original interval [I L, I R ] contained 0 and was of length at most 2 3 w k i 1+1, the new interval does not contain 0, and thus again by symmetry we have that the probability (now over the choice of x ki 1+2,..., x ki ) that w ki 1+1x ki w ki x ki lies in [I L, I R ] is at most 1 2. In order to have w 0 + w 1 x w n x n σ kt+1 2 ln(2/a), it must be the case that either (i) each ξ i < 2σ ki+1 2 ln(2/a) for i = 1,..., t; or (ii) for some 1 i t we have ξ i 2σ ki+1 2 ln(2/a) but nonetheless w0 + w 1 x w n x n < σ ki+1 2 ln(2/a). Lemma 4.8 gives us that the probability of (i) is at most (1/2) t = 2 t, and Lemma 4.7 with the union bound gives us that the probability of (ii) is at most t A. This proves Lemma Proof of Theorem 4.2. Let Z = 12 t. We take m = 1 τ ln(z 2 /τ 2 ) as in 2 (4.2), and we have k = tm + 1. With these choices the condition l k of Theorem 4.2 together with Corollary 4.5 implies that there are at least t complete Z-blocks in the weight vector w. Thus we may apply Lemma 4.6, and we have that Pr[ w 0 + w 1 x w n x n σ kt+1 2 t] 2 t + 2te 2t O(2 t ). Now we further observe that since there are in fact t complete Z-blocks prior to the kth weight, we have k t + 1 k and hence σ kt+1 σ k, so the above inequality implies Pr[ w 0 + w 1 x w n x n t σ k ] O(2 t ). This is the desired conclusion of Theorem 4.2.

14 14 R. O DONNELL AND R. A. SERVEDIO 4.4. Extension of Theorem 4.2. The same proof with a slightly different choice of Z (taking Z = O(1)t C ) in fact gives us the following significantly stronger version of Theorem 4.2; however this stronger version is not more useful for our purposes: Theorem 4.9. In the setting of Theorem 4.2, let C 1/2 be another parameter, and suppose we instead define Then if l k, k = O(1) t τ 2 ln ( t C τ ). Pr x [ w 0 + w 1 x w n x n t C σ k ] O(2 t ). 5. Approximating threshold functions using not-too-large head weights. The main result of this section is a lemma which roughly says that any threshold function f can be approximated by a threshold function f in which the 2-norm of the tail weights, σ k, is at least an Ω(1) fraction of the head weights. This is important so that the Gaussian random variable to which the tail part is close has Ω(1) variance and thus sufficiently good anticoncentration. Lemma 5.1. Let f : { 1, 1} n { 1, 1} be any threshold function, f(x) = sgn(w 0 + w 1 x w n x n ) (recall that we assume w 1 w 2 w n ). Let def 0 < ɛ < 1/2 and 1 k n be parameters, and write σ k = j k w2 j. Assuming σ k > 0, there are numbers v 0,..., v k 1 satisfying v i k (k+1)/2 3 ln(2/ɛ) σ k (5.1) such that the threshold function f : { 1, 1} n { 1, 1} defined by f (x) = sgn(v 0 + v 1 x v k 1 x k 1 + w k x k + + w n x n ) satisfies dist(f, f ) ɛ. One may further ensure that v 1 v 2 v k 1 w k and that sgn(v i ) = sgn(w i ) for all i. Before proving this lemma, let us give an illustration. Consider the threshold function f(x) = sgn(nx 1 + nx 2 + x x n ), (5.2) with k = 3. The tail weights here have σ 3 = n 2, which of course is not a constant fraction of the two head weights, n. Further, this cannot be fixed just by choosing a different weights-based representation of the same function f. What Lemma 5.1 shows here is that we can shrink the head weights from n all the way down to Θ( ln(1/ɛ)) n without changing the function on more than an ɛ fraction of points (this heavily uses the fact that the tail acts like a Gaussian with standard deviation n 2). Then indeed σ 3 is an Ω(f(ɛ)) fraction of the head weights for a function f(ɛ) that is independent of n, as desired. We now give the proof of Lemma 5.1, a modification of the classic argument of [40] which bounds the weights required for exact representation of any threshold function. Proof. We will first prove the theorem without the extra constraints v 1 v 2 v k 1 w k and sgn(v i ) = sgn(w i ). At the end of the proof we will show how these constraints can also be ensured.

15 Let h : { 1, 1} k 1 R denote the head of f, CHOW PARAMETERS 15 h(x) = w 0 + w 1 x w k 1 x k 1. Consider the system S of 2 k 1 linear equations in k unknowns named u 0,..., u k 1 : for each x { 1, 1} k 1 we include the equation u 0 + u 1 x u k 1 x k 1 = h(x). Of course, the linear system S is satisfiable, since (u 0,..., u k 1 ) = (w 0,..., w k 1 ) is a solution. Let C be defined by C = 3 ln(2/ɛ) σ k, and consider the system LP of 2 k 1 linear inequalities over unknowns u 0,..., u k 1 : for each x { 1, 1} k 1 we include the (in)equality C if h(x) C, u 0 + u 1 x u k 1 x k 1 = h(x) if h(x) < C, (5.3) C if h(x) C. We have that LP is feasible, since it is a relaxation of the satisfiable system S. Now we use the following standard result from the theory of linear inequalities, which is a straightforward consequence of Cramer s rule and is implicit in several works (see e.g. the proof at the start of Section 3 of [24]): Lemma 5.2. Let LP denote a feasible linear program over k variables u 0,..., u k 1 in which the constraint matrix has all entries from { 1, 0, 1} and the right-hand side has all entries at most C in absolute value. Then there is a feasible solution (v 0,..., v k 1 ) in which v i k (k+1)/2 C for each i. This implies that there is a feasible solution (u 0,..., u k 1 ) = (v 0,..., v k 1 ) to LP in which the numbers v i are not too large in magnitude: specifically, using Lemma 5.2 we may obtain We now show that the threshold function v i k (k+1)/2 C. (5.4) f (x) = sgn(v 0 + v 1 x v k 1 x k 1 + w k x k + w n x n ) satisfies dist(f, f ) ɛ. Given x { 1, 1} n, let us abuse notation by writing let us also write h(x) = h(x 1,..., x k 1 ) = w 0 + w 1 x w k 1 x k 1 ; h (x) = v 0 + v 1 x v k 1 x k 1

16 16 R. O DONNELL AND R. A. SERVEDIO for the head of f and t(x) = j k w j x j for the tail, which is common to both f and f. Now if x is any input for which h(x) < C then we have h(x) = h (x) by construction, and hence f(x) = f (x). Thus in order for f(x) to disagree with f(x ) it must at least be the case that h(x) C. Moreover, it must also be the case that t(x) C, for otherwise sgn(h(x) + t(x)) will equal sgn(h (x) + t(x)), because h(x) and h (x) have the same sign by construction. But the Hoeffding bound implies that Pr x [ t(x) C] Pr x [ t(x) 2 ln(2/ɛ) σ k ] 2e ln(2/ɛ) = ɛ. Hence indeed Pr[f(x) f (x)] ɛ, as desired. Finally, we complete the proof by showing how to ensure the extra constraints v 1 v 2 v k 1 w k and sgn(v i ) = sgn(w i ). First, the constraints sgn(u i ) = sgn(w i ) can be added into LP by this we mean adding constraints like u 1 0, u 2 0, etc. Next, the constraints sgn(w 1 )u 1 sgn(w 2 )u 2 sgn(w 2 )u 2 sgn(w 3 )u 3 sgn(w k 2 )u k 2 sgn(w k 1 )u k 1 can be added into LP; again, these are constraints like u i u i+1. Finally, we can add the constraint sgn(w k 1 )u k 1 w k. Of course, LP remains feasible after the addition of all of these constraints, since (u 0,..., u k 1 ) = (w 0,..., w k 1 ) is still a solution. It remains to show that there is still a solution satisfying the bounds in (5.4). But this still follows from Lemma 5.2: the added constraints only have coefficients in { 1, 0, 1}, and the added right-hand side entries are all 0, except for the last, which is w k σ k C. 6. Every threshold function is close to a threshold function for which few points have small margin. In this subsection we show how to combine Theorem 4.2 and Lemma 5.1 to establish the following: Theorem 6.1. Let f : { 1, 1} n { 1, 1} be any threshold function and let 0 < τ < 1/2. Then there is a threshold function f : { 1, 1} n { 1, 1} with dist(f, f ) ɛ satisfying Pr x [marg(f, x) ρ] O(τ), where ɛ = ɛ(τ) = 2 2O(log3 (1/τ)/τ 2 ) and ρ = ρ(τ) = 2 O(log3 (1/τ)/τ 2). Our main structural results about margins, Theorem 3.1, is simply a rephrasing of the above theorem. Hence proving Theorem 6.1 completes the proof of Theorem 1.6, the first ingredient in our solution to the Chow Parameters Problem. The plan for the proof of Theorem 6.1 follows the intuition described in the beginning of Section 4. We consider the location of the τ-critical index of f. Case 1 is that it occurs quite early. In that case, the resulting tail acts like a Gaussian (up to error τ), and hence we can get a good anticoncentration bound so long as the tail s

17 CHOW PARAMETERS 17 variance is large enough. To ensure this, we alter f at the beginning of the argument using Lemma 5.1, which yields tail weights with total variance lower bounded by a function that depends only on τ. Case 2 is that the critical index occurs late. In this case we get anticoncentration by appealing to Theorem 4.2. We again use Lemma 5.1 so that the σ k parameter is not too small. We now give the formal proof. Proof. (Theorem 6.1) We intend to apply Theorem 4.2 in Case 2 with its t parameter set to log(1/τ), so that the anticoncentration is O(τ). Thus we will need to ensure the τ-critical index parameter l is at least k def = O(1) log(1/τ) τ 2 ln ( log(1/τ) To that end, fix a weights-based representation of f, f(x) = sgn(w 0 + w 1 x w n x n ), τ ). (6.1) where we may assume that w 1 w 2 w n > 0. Write σ k = j k w2 j, and observe that σ k > 0 since each w i 0. Now apply Lemma 5.1, with its parameter ɛ set to 2 ko(k). This yields a new threshold function where each v i satisfies f (x) = sgn(v 0 + v 1 x v k 1 x k 1 + w k x k + w n x n ), (6.2) v i k O(k) σ k, (6.3) and also v 1 v 2 v k 1 w k. This f has dist(f, f ) ɛ = 2 ko(k). To analyze marg(f, x), let us normalize the weights of f by dividing each weight by v v2 k 1 + w2 k + + w2 n. We thus may write f (x) = sgn(u 0 + u 1 x u k 1 x k 1 + u k x k + u n x n ), where j 0 u2 j = 1. Equation (6.2) implies that for each of the k values i = 0,..., k 1 we have that vi 2 is at most k O(k) times as large as wk w2 n. Letting σ i denote j i u2 j and recalling that j 0 u2 j = 1, this is easily seen to imply that σ k k O(k). (6.4) Recalling that we still have u 1 u 2 u n > 0, let l be the τ-critical index for u 1,..., u n, and consider two cases: Case 1: l < k. In this case, consider any fixed choice for x 1,..., x l 1 and write h = u 0 + u 1 x u l 1 x l 1. Using the definition of τ-critical index and applying the Berry-Esseen Corollary 2.8 to u l x l + + u n x n we get Pr [ h γ u l x l + + u n x n h + γ] 2γ x l,...,x n σ l + 2τ, for any choice of γ 0. Taking γ = τσ l τσ k we conclude Pr x [marg(f, x) τσ k] 4τ.

18 18 R. O DONNELL AND R. A. SERVEDIO Case 2: l k. In this case we apply Theorem 4.2, with its parameter t set to log(1/τ), as described at the beginning of the proof. With k defined as in (6.1), we conclude Pr x [marg(f, x) log(1/τ) σ k] O(τ). Combining the results of the two cases and using σ k k O(k) from (6.4), we conclude that we always have Pr x [marg(f, x) τk O(k) ] O(τ). Now it only remains to observe that by definition (6.1) of k, Hence we have that and k O(k) = 2 O(log3 (1/τ)/τ 2). dist(f, f ) 2 ko(k) ɛ(τ) τk O(k) τ2 O(log3 (1/τ)/τ 2) ρ(τ). 7. Second ingredient: using Chow Parameters as weights for tail variables. We begin this section with some informal motivation for and description of our second ingredient. We first recall that every threshold function f is unate; this means that for every i, f is either monotone increasing or monotone decreasing as a function of its i- th coordiante. A well-known consequence of unateness is that the magnitude of the Fourier coefficient ˆf(i) is equal to the influence of the variable x i on f; i.e. Pr[f(x) f(y)] where x is drawn uniformly from { 1, 1} n and y is x with the ith bit flipped. As done in the first ingredient, it is natural to group together the high-influence variables, forming the head indices of f. We refer to the remaining indices as the tail indices. Note that an algorithm for the Chow Parameters problem can do this grouping, since it is given the ˆf(i) s. The following theorem states that any threshold function f is either already close to a junta over the head indices, or is close to a threshold function obtained by replacing the tail weights with (suitably scaled versions of) the tail Chow Parameters. (We have made no effort to optimize the precise polynomial dependence of τ(ɛ) on ɛ.) Theorem 7.1. There is a polynomial function τ(ɛ) = poly(ɛ) such that the following holds: Let f be a Boolean threshold function over head indices H and tail indices T, ( f(x) = sgn v 0 + v i x i + ) w i x i, i H i T and let 0 < ɛ < 1/2. Assume that H contains all indices i such that f(i) τ(ɛ) 2. Then one of the following holds: (i) f is O(ɛ)-close to a junta over H; or,

The Chow Parameters Problem

The Chow Parameters Problem The Chow Parameters Problem [Etended Abstract] ABSTRACT Ryan O Donnell Carnegie Mellon University Pittsburgh, PA 15213 odonnell@cs.cmu.edu In the 2nd Annual FOCS (1961), C. K. Chow proved that every Boolean

More information

arxiv: v1 [cs.cc] 29 Feb 2012

arxiv: v1 [cs.cc] 29 Feb 2012 On the Distribution of the Fourier Spectrum of Halfspaces Ilias Diakonikolas 1, Ragesh Jaiswal 2, Rocco A. Servedio 3, Li-Yang Tan 3, and Andrew Wan 4 arxiv:1202.6680v1 [cs.cc] 29 Feb 2012 1 University

More information

Improved Approximation of Linear Threshold Functions

Improved Approximation of Linear Threshold Functions Improved Approximation of Linear Threshold Functions Ilias Diakonikolas Computer Science Division UC Berkeley Berkeley, CA ilias@cs.berkeley.edu Rocco A. Servedio Department of Computer Science Columbia

More information

A ROBUST KHINTCHINE INEQUALITY, AND ALGORITHMS FOR COMPUTING OPTIMAL CONSTANTS IN FOURIER ANALYSIS AND HIGH-DIMENSIONAL GEOMETRY

A ROBUST KHINTCHINE INEQUALITY, AND ALGORITHMS FOR COMPUTING OPTIMAL CONSTANTS IN FOURIER ANALYSIS AND HIGH-DIMENSIONAL GEOMETRY A ROBUST KHINTCHINE INEQUALITY, AND ALGORITHMS FOR COMPUTING OPTIMAL CONSTANTS IN FOURIER ANALYSIS AND HIGH-DIMENSIONAL GEOMETRY ANINDYA DE, ILIAS DIAKONIKOLAS, AND ROCCO A. SERVEDIO Abstract. This paper

More information

Agnostic Learning of Disjunctions on Symmetric Distributions

Agnostic Learning of Disjunctions on Symmetric Distributions Agnostic Learning of Disjunctions on Symmetric Distributions Vitaly Feldman vitaly@post.harvard.edu Pravesh Kothari kothari@cs.utexas.edu May 26, 2014 Abstract We consider the problem of approximating

More information

Testing Halfspaces. Ryan O Donnell Carnegie Mellon University. Rocco A. Servedio Columbia University.

Testing Halfspaces. Ryan O Donnell Carnegie Mellon University. Rocco A. Servedio Columbia University. Kevin Matulef MIT matulef@mit.edu Testing Halfspaces Ryan O Donnell Carnegie Mellon University odonnell@cs.cmu.edu Rocco A. Servedio Columbia University rocco@cs.columbia.edu November 9, 2007 Ronitt Rubinfeld

More information

Lecture 9: March 26, 2014

Lecture 9: March 26, 2014 COMS 6998-3: Sub-Linear Algorithms in Learning and Testing Lecturer: Rocco Servedio Lecture 9: March 26, 204 Spring 204 Scriber: Keith Nichols Overview. Last Time Finished analysis of O ( n ɛ ) -query

More information

Nearly optimal solutions for the Chow Parameters Problem and low-weight approximation of halfspaces

Nearly optimal solutions for the Chow Parameters Problem and low-weight approximation of halfspaces Nearly optimal solutions for the Chow Parameters Problem and low-weight approximation of halfspaces Anindya De University of California, Berkeley Vitaly Feldman IBM Almaden Research Center Ilias Diakonikolas

More information

3 Finish learning monotone Boolean functions

3 Finish learning monotone Boolean functions COMS 6998-3: Sub-Linear Algorithms in Learning and Testing Lecturer: Rocco Servedio Lecture 5: 02/19/2014 Spring 2014 Scribes: Dimitris Paidarakis 1 Last time Finished KM algorithm; Applications of KM

More information

Lecture 29: Computational Learning Theory

Lecture 29: Computational Learning Theory CS 710: Complexity Theory 5/4/2010 Lecture 29: Computational Learning Theory Instructor: Dieter van Melkebeek Scribe: Dmitri Svetlov and Jake Rosin Today we will provide a brief introduction to computational

More information

Discriminative Learning can Succeed where Generative Learning Fails

Discriminative Learning can Succeed where Generative Learning Fails Discriminative Learning can Succeed where Generative Learning Fails Philip M. Long, a Rocco A. Servedio, b,,1 Hans Ulrich Simon c a Google, Mountain View, CA, USA b Columbia University, New York, New York,

More information

Lecture 5: February 16, 2012

Lecture 5: February 16, 2012 COMS 6253: Advanced Computational Learning Theory Lecturer: Rocco Servedio Lecture 5: February 16, 2012 Spring 2012 Scribe: Igor Carboni Oliveira 1 Last time and today Previously: Finished first unit on

More information

CSE 291: Fourier analysis Chapter 2: Social choice theory

CSE 291: Fourier analysis Chapter 2: Social choice theory CSE 91: Fourier analysis Chapter : Social choice theory 1 Basic definitions We can view a boolean function f : { 1, 1} n { 1, 1} as a means to aggregate votes in a -outcome election. Common examples are:

More information

Lecture 10: Learning DNF, AC 0, Juntas. 1 Learning DNF in Almost Polynomial Time

Lecture 10: Learning DNF, AC 0, Juntas. 1 Learning DNF in Almost Polynomial Time Analysis of Boolean Functions (CMU 8-859S, Spring 2007) Lecture 0: Learning DNF, AC 0, Juntas Feb 5, 2007 Lecturer: Ryan O Donnell Scribe: Elaine Shi Learning DNF in Almost Polynomial Time From previous

More information

APPROXIMATION RESISTANCE AND LINEAR THRESHOLD FUNCTIONS

APPROXIMATION RESISTANCE AND LINEAR THRESHOLD FUNCTIONS APPROXIMATION RESISTANCE AND LINEAR THRESHOLD FUNCTIONS RIDWAN SYED Abstract. In the boolean Max k CSP (f) problem we are given a predicate f : { 1, 1} k {0, 1}, a set of variables, and local constraints

More information

Lecture 7: Passive Learning

Lecture 7: Passive Learning CS 880: Advanced Complexity Theory 2/8/2008 Lecture 7: Passive Learning Instructor: Dieter van Melkebeek Scribe: Tom Watson In the previous lectures, we studied harmonic analysis as a tool for analyzing

More information

6.842 Randomness and Computation April 2, Lecture 14

6.842 Randomness and Computation April 2, Lecture 14 6.84 Randomness and Computation April, 0 Lecture 4 Lecturer: Ronitt Rubinfeld Scribe: Aaron Sidford Review In the last class we saw an algorithm to learn a function where very little of the Fourier coeffecient

More information

Quantum boolean functions

Quantum boolean functions Quantum boolean functions Ashley Montanaro 1 and Tobias Osborne 2 1 Department of Computer Science 2 Department of Mathematics University of Bristol Royal Holloway, University of London Bristol, UK London,

More information

TTIC An Introduction to the Theory of Machine Learning. Learning from noisy data, intro to SQ model

TTIC An Introduction to the Theory of Machine Learning. Learning from noisy data, intro to SQ model TTIC 325 An Introduction to the Theory of Machine Learning Learning from noisy data, intro to SQ model Avrim Blum 4/25/8 Learning when there is no perfect predictor Hoeffding/Chernoff bounds: minimizing

More information

Bounded Independence Fools Halfspaces

Bounded Independence Fools Halfspaces Bounded Independence Fools Halfspaces Ilias Diakonikolas Columbia University Rocco A. Servedio Columbia University Parikshit Gopalan MSR-Silicon Valley May 27, 2010 Ragesh Jaiswal Columbia University Emanuele

More information

Junta Approximations for Submodular, XOS and Self-Bounding Functions

Junta Approximations for Submodular, XOS and Self-Bounding Functions Junta Approximations for Submodular, XOS and Self-Bounding Functions Vitaly Feldman Jan Vondrák IBM Almaden Research Center Simons Institute, Berkeley, October 2013 Feldman-Vondrák Approximations by Juntas

More information

Testing Monotone High-Dimensional Distributions

Testing Monotone High-Dimensional Distributions Testing Monotone High-Dimensional Distributions Ronitt Rubinfeld Computer Science & Artificial Intelligence Lab. MIT Cambridge, MA 02139 ronitt@theory.lcs.mit.edu Rocco A. Servedio Department of Computer

More information

10.1 The Formal Model

10.1 The Formal Model 67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 10: The Formal (PAC) Learning Model Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 We have see so far algorithms that explicitly estimate

More information

Introduction to Machine Learning (67577) Lecture 3

Introduction to Machine Learning (67577) Lecture 3 Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz

More information

20.1 2SAT. CS125 Lecture 20 Fall 2016

20.1 2SAT. CS125 Lecture 20 Fall 2016 CS125 Lecture 20 Fall 2016 20.1 2SAT We show yet another possible way to solve the 2SAT problem. Recall that the input to 2SAT is a logical expression that is the conunction (AND) of a set of clauses,

More information

Testing by Implicit Learning

Testing by Implicit Learning Testing by Implicit Learning Rocco Servedio Columbia University ITCS Property Testing Workshop Beijing January 2010 What this talk is about 1. Testing by Implicit Learning: method for testing classes of

More information

Nearly optimal solutions for the Chow Parameters Problem and low-weight approximation of halfspaces

Nearly optimal solutions for the Chow Parameters Problem and low-weight approximation of halfspaces Nearly optimal solutions for the Chow Parameters Problem and low-weight approximation of halfspaces [Extended Abstract] Anindya De Computer Science Division University of California Berkeley, CA 94720

More information

A Nearly Optimal and Agnostic Algorithm for Properly Learning a Mixture of k Gaussians, for any Constant k

A Nearly Optimal and Agnostic Algorithm for Properly Learning a Mixture of k Gaussians, for any Constant k A Nearly Optimal and Agnostic Algorithm for Properly Learning a Mixture of k Gaussians, for any Constant k Jerry Li MIT jerryzli@mit.edu Ludwig Schmidt MIT ludwigs@mit.edu June 27, 205 Abstract Learning

More information

Polynomial regression under arbitrary product distributions

Polynomial regression under arbitrary product distributions Polynomial regression under arbitrary product distributions Eric Blais and Ryan O Donnell and Karl Wimmer Carnegie Mellon University Abstract In recent work, Kalai, Klivans, Mansour, and Servedio [KKMS05]

More information

From Batch to Transductive Online Learning

From Batch to Transductive Online Learning From Batch to Transductive Online Learning Sham Kakade Toyota Technological Institute Chicago, IL 60637 sham@tti-c.org Adam Tauman Kalai Toyota Technological Institute Chicago, IL 60637 kalai@tti-c.org

More information

CSC 5170: Theory of Computational Complexity Lecture 9 The Chinese University of Hong Kong 15 March 2010

CSC 5170: Theory of Computational Complexity Lecture 9 The Chinese University of Hong Kong 15 March 2010 CSC 5170: Theory of Computational Complexity Lecture 9 The Chinese University of Hong Kong 15 March 2010 We now embark on a study of computational classes that are more general than NP. As these classes

More information

New Results for Random Walk Learning

New Results for Random Walk Learning Journal of Machine Learning Research 15 (2014) 3815-3846 Submitted 1/13; Revised 5/14; Published 11/14 New Results for Random Walk Learning Jeffrey C. Jackson Karl Wimmer Duquesne University 600 Forbes

More information

arxiv: v1 [cs.ds] 3 Feb 2018

arxiv: v1 [cs.ds] 3 Feb 2018 A Model for Learned Bloom Filters and Related Structures Michael Mitzenmacher 1 arxiv:1802.00884v1 [cs.ds] 3 Feb 2018 Abstract Recent work has suggested enhancing Bloom filters by using a pre-filter, based

More information

Learning and Fourier Analysis

Learning and Fourier Analysis Learning and Fourier Analysis Grigory Yaroslavtsev http://grigory.us CIS 625: Computational Learning Theory Fourier Analysis and Learning Powerful tool for PAC-style learning under uniform distribution

More information

On the Sample Complexity of Noise-Tolerant Learning

On the Sample Complexity of Noise-Tolerant Learning On the Sample Complexity of Noise-Tolerant Learning Javed A. Aslam Department of Computer Science Dartmouth College Hanover, NH 03755 Scott E. Decatur Laboratory for Computer Science Massachusetts Institute

More information

1 Last time and today

1 Last time and today COMS 6253: Advanced Computational Learning Spring 2012 Theory Lecture 12: April 12, 2012 Lecturer: Rocco Servedio 1 Last time and today Scribe: Dean Alderucci Previously: Started the BKW algorithm for

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

Learning symmetric non-monotone submodular functions

Learning symmetric non-monotone submodular functions Learning symmetric non-monotone submodular functions Maria-Florina Balcan Georgia Institute of Technology ninamf@cc.gatech.edu Nicholas J. A. Harvey University of British Columbia nickhar@cs.ubc.ca Satoru

More information

Notes 6 : First and second moment methods

Notes 6 : First and second moment methods Notes 6 : First and second moment methods Math 733-734: Theory of Probability Lecturer: Sebastien Roch References: [Roc, Sections 2.1-2.3]. Recall: THM 6.1 (Markov s inequality) Let X be a non-negative

More information

Introduction to Computational Learning Theory

Introduction to Computational Learning Theory Introduction to Computational Learning Theory The classification problem Consistent Hypothesis Model Probably Approximately Correct (PAC) Learning c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 1

More information

Proclaiming Dictators and Juntas or Testing Boolean Formulae

Proclaiming Dictators and Juntas or Testing Boolean Formulae Proclaiming Dictators and Juntas or Testing Boolean Formulae Michal Parnas The Academic College of Tel-Aviv-Yaffo Tel-Aviv, ISRAEL michalp@mta.ac.il Dana Ron Department of EE Systems Tel-Aviv University

More information

Two Comments on Targeted Canonical Derandomizers

Two Comments on Targeted Canonical Derandomizers Two Comments on Targeted Canonical Derandomizers Oded Goldreich Department of Computer Science Weizmann Institute of Science Rehovot, Israel. oded.goldreich@weizmann.ac.il April 8, 2011 Abstract We revisit

More information

Lecture 6,7 (Sept 27 and 29, 2011 ): Bin Packing, MAX-SAT

Lecture 6,7 (Sept 27 and 29, 2011 ): Bin Packing, MAX-SAT ,7 CMPUT 675: Approximation Algorithms Fall 2011 Lecture 6,7 (Sept 27 and 29, 2011 ): Bin Pacing, MAX-SAT Lecturer: Mohammad R. Salavatipour Scribe: Weitian Tong 6.1 Bin Pacing Problem Recall the bin pacing

More information

Continuity. Chapter 4

Continuity. Chapter 4 Chapter 4 Continuity Throughout this chapter D is a nonempty subset of the real numbers. We recall the definition of a function. Definition 4.1. A function from D into R, denoted f : D R, is a subset of

More information

Lecture 23: Alternation vs. Counting

Lecture 23: Alternation vs. Counting CS 710: Complexity Theory 4/13/010 Lecture 3: Alternation vs. Counting Instructor: Dieter van Melkebeek Scribe: Jeff Kinne & Mushfeq Khan We introduced counting complexity classes in the previous lecture

More information

Learning convex bodies is hard

Learning convex bodies is hard Learning convex bodies is hard Navin Goyal Microsoft Research India navingo@microsoft.com Luis Rademacher Georgia Tech lrademac@cc.gatech.edu Abstract We show that learning a convex body in R d, given

More information

A Noisy-Influence Regularity Lemma for Boolean Functions Chris Jones

A Noisy-Influence Regularity Lemma for Boolean Functions Chris Jones A Noisy-Influence Regularity Lemma for Boolean Functions Chris Jones Abstract We present a regularity lemma for Boolean functions f : {, } n {, } based on noisy influence, a measure of how locally correlated

More information

Foundations of Machine Learning and Data Science. Lecturer: Avrim Blum Lecture 9: October 7, 2015

Foundations of Machine Learning and Data Science. Lecturer: Avrim Blum Lecture 9: October 7, 2015 10-806 Foundations of Machine Learning and Data Science Lecturer: Avrim Blum Lecture 9: October 7, 2015 1 Computational Hardness of Learning Today we will talk about some computational hardness results

More information

CONSTRUCTION OF THE REAL NUMBERS.

CONSTRUCTION OF THE REAL NUMBERS. CONSTRUCTION OF THE REAL NUMBERS. IAN KIMING 1. Motivation. It will not come as a big surprise to anyone when I say that we need the real numbers in mathematics. More to the point, we need to be able to

More information

Lecture 4: LMN Learning (Part 2)

Lecture 4: LMN Learning (Part 2) CS 294-114 Fine-Grained Compleity and Algorithms Sept 8, 2015 Lecture 4: LMN Learning (Part 2) Instructor: Russell Impagliazzo Scribe: Preetum Nakkiran 1 Overview Continuing from last lecture, we will

More information

Analysis of Boolean Functions

Analysis of Boolean Functions Analysis of Boolean Functions Kavish Gandhi and Noah Golowich Mentor: Yufei Zhao 5th Annual MIT-PRIMES Conference Analysis of Boolean Functions, Ryan O Donnell May 16, 2015 1 Kavish Gandhi and Noah Golowich

More information

Learning DNF Expressions from Fourier Spectrum

Learning DNF Expressions from Fourier Spectrum Learning DNF Expressions from Fourier Spectrum Vitaly Feldman IBM Almaden Research Center vitaly@post.harvard.edu May 3, 2012 Abstract Since its introduction by Valiant in 1984, PAC learning of DNF expressions

More information

Notes for Lecture 15

Notes for Lecture 15 U.C. Berkeley CS278: Computational Complexity Handout N15 Professor Luca Trevisan 10/27/2004 Notes for Lecture 15 Notes written 12/07/04 Learning Decision Trees In these notes it will be convenient to

More information

Introduction and Preliminaries

Introduction and Preliminaries Chapter 1 Introduction and Preliminaries This chapter serves two purposes. The first purpose is to prepare the readers for the more systematic development in later chapters of methods of real analysis

More information

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 1: Quantum circuits and the abelian QFT

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 1: Quantum circuits and the abelian QFT Quantum algorithms (CO 78, Winter 008) Prof. Andrew Childs, University of Waterloo LECTURE : Quantum circuits and the abelian QFT This is a course on quantum algorithms. It is intended for graduate students

More information

Testing Lipschitz Functions on Hypergrid Domains

Testing Lipschitz Functions on Hypergrid Domains Testing Lipschitz Functions on Hypergrid Domains Pranjal Awasthi 1, Madhav Jha 2, Marco Molinaro 1, and Sofya Raskhodnikova 2 1 Carnegie Mellon University, USA, {pawasthi,molinaro}@cmu.edu. 2 Pennsylvania

More information

Continuity. Chapter 4

Continuity. Chapter 4 Chapter 4 Continuity Throughout this chapter D is a nonempty subset of the real numbers. We recall the definition of a function. Definition 4.1. A function from D into R, denoted f : D R, is a subset of

More information

Learning and Fourier Analysis

Learning and Fourier Analysis Learning and Fourier Analysis Grigory Yaroslavtsev http://grigory.us Slides at http://grigory.us/cis625/lecture2.pdf CIS 625: Computational Learning Theory Fourier Analysis and Learning Powerful tool for

More information

Polynomial regression under arbitrary product distributions

Polynomial regression under arbitrary product distributions Mach Learn (2010) 80: 273 294 DOI 10.1007/s10994-010-5179-6 Polynomial regression under arbitrary product distributions Eric Blais Ryan O Donnell Karl Wimmer Received: 15 March 2009 / Accepted: 1 November

More information

1 Review of The Learning Setting

1 Review of The Learning Setting COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #8 Scribe: Changyan Wang February 28, 208 Review of The Learning Setting Last class, we moved beyond the PAC model: in the PAC model we

More information

Stochastic Submodular Cover with Limited Adaptivity

Stochastic Submodular Cover with Limited Adaptivity Stochastic Submodular Cover with Limited Adaptivity Arpit Agarwal Sepehr Assadi Sanjeev Khanna Abstract In the submodular cover problem, we are given a non-negative monotone submodular function f over

More information

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht PAC Learning prof. dr Arno Siebes Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht Recall: PAC Learning (Version 1) A hypothesis class H is PAC learnable

More information

The sum of d small-bias generators fools polynomials of degree d

The sum of d small-bias generators fools polynomials of degree d The sum of d small-bias generators fools polynomials of degree d Emanuele Viola April 9, 2008 Abstract We prove that the sum of d small-bias generators L : F s F n fools degree-d polynomials in n variables

More information

Computational Learning Theory

Computational Learning Theory CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful

More information

Introductory Analysis I Fall 2014 Homework #9 Due: Wednesday, November 19

Introductory Analysis I Fall 2014 Homework #9 Due: Wednesday, November 19 Introductory Analysis I Fall 204 Homework #9 Due: Wednesday, November 9 Here is an easy one, to serve as warmup Assume M is a compact metric space and N is a metric space Assume that f n : M N for each

More information

PROBABILISTIC ANALYSIS OF THE GENERALISED ASSIGNMENT PROBLEM

PROBABILISTIC ANALYSIS OF THE GENERALISED ASSIGNMENT PROBLEM PROBABILISTIC ANALYSIS OF THE GENERALISED ASSIGNMENT PROBLEM Martin Dyer School of Computer Studies, University of Leeds, Leeds, U.K. and Alan Frieze Department of Mathematics, Carnegie-Mellon University,

More information

A Regularity Lemma, and Low-weight Approximators, for Low-degree Polynomial Threshold Functions

A Regularity Lemma, and Low-weight Approximators, for Low-degree Polynomial Threshold Functions THEORY OF COMPUTING www.theoryofcomputing.org A Regularity Lemma, and Low-weight Approximators, for Low-degree Polynomial Threshold Functions Ilias Diakonikolas Rocco A. Servedio Li-Yang Tan Andrew Wan

More information

Online Learning, Mistake Bounds, Perceptron Algorithm

Online Learning, Mistake Bounds, Perceptron Algorithm Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which

More information

Part V. 17 Introduction: What are measures and why measurable sets. Lebesgue Integration Theory

Part V. 17 Introduction: What are measures and why measurable sets. Lebesgue Integration Theory Part V 7 Introduction: What are measures and why measurable sets Lebesgue Integration Theory Definition 7. (Preliminary). A measure on a set is a function :2 [ ] such that. () = 2. If { } = is a finite

More information

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces. Math 350 Fall 2011 Notes about inner product spaces In this notes we state and prove some important properties of inner product spaces. First, recall the dot product on R n : if x, y R n, say x = (x 1,...,

More information

Review of Probability Theory

Review of Probability Theory Review of Probability Theory Arian Maleki and Tom Do Stanford University Probability theory is the study of uncertainty Through this class, we will be relying on concepts from probability theory for deriving

More information

Turing Machines, diagonalization, the halting problem, reducibility

Turing Machines, diagonalization, the halting problem, reducibility Notes on Computer Theory Last updated: September, 015 Turing Machines, diagonalization, the halting problem, reducibility 1 Turing Machines A Turing machine is a state machine, similar to the ones we have

More information

Polynomial time Prediction Strategy with almost Optimal Mistake Probability

Polynomial time Prediction Strategy with almost Optimal Mistake Probability Polynomial time Prediction Strategy with almost Optimal Mistake Probability Nader H. Bshouty Department of Computer Science Technion, 32000 Haifa, Israel bshouty@cs.technion.ac.il Abstract We give the

More information

Decoupling course outline Decoupling theory is a recent development in Fourier analysis with applications in partial differential equations and

Decoupling course outline Decoupling theory is a recent development in Fourier analysis with applications in partial differential equations and Decoupling course outline Decoupling theory is a recent development in Fourier analysis with applications in partial differential equations and analytic number theory. It studies the interference patterns

More information

12.1 A Polynomial Bound on the Sample Size m for PAC Learning

12.1 A Polynomial Bound on the Sample Size m for PAC Learning 67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 12: PAC III Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 In this lecture will use the measure of VC dimension, which is a combinatorial

More information

NP Completeness and Approximation Algorithms

NP Completeness and Approximation Algorithms Chapter 10 NP Completeness and Approximation Algorithms Let C() be a class of problems defined by some property. We are interested in characterizing the hardest problems in the class, so that if we can

More information

Lecture 3 Small bias with respect to linear tests

Lecture 3 Small bias with respect to linear tests 03683170: Expanders, Pseudorandomness and Derandomization 3/04/16 Lecture 3 Small bias with respect to linear tests Amnon Ta-Shma and Dean Doron 1 The Fourier expansion 1.1 Over general domains Let G be

More information

CS446: Machine Learning Spring Problem Set 4

CS446: Machine Learning Spring Problem Set 4 CS446: Machine Learning Spring 2017 Problem Set 4 Handed Out: February 27 th, 2017 Due: March 11 th, 2017 Feel free to talk to other members of the class in doing the homework. I am more concerned that

More information

Lecture 3: Randomness in Computation

Lecture 3: Randomness in Computation Great Ideas in Theoretical Computer Science Summer 2013 Lecture 3: Randomness in Computation Lecturer: Kurt Mehlhorn & He Sun Randomness is one of basic resources and appears everywhere. In computer science,

More information

COS598D Lecture 3 Pseudorandom generators from one-way functions

COS598D Lecture 3 Pseudorandom generators from one-way functions COS598D Lecture 3 Pseudorandom generators from one-way functions Scribe: Moritz Hardt, Srdjan Krstic February 22, 2008 In this lecture we prove the existence of pseudorandom-generators assuming that oneway

More information

P, NP, NP-Complete, and NPhard

P, NP, NP-Complete, and NPhard P, NP, NP-Complete, and NPhard Problems Zhenjiang Li 21/09/2011 Outline Algorithm time complicity P and NP problems NP-Complete and NP-Hard problems Algorithm time complicity Outline What is this course

More information

SCALE INVARIANT FOURIER RESTRICTION TO A HYPERBOLIC SURFACE

SCALE INVARIANT FOURIER RESTRICTION TO A HYPERBOLIC SURFACE SCALE INVARIANT FOURIER RESTRICTION TO A HYPERBOLIC SURFACE BETSY STOVALL Abstract. This result sharpens the bilinear to linear deduction of Lee and Vargas for extension estimates on the hyperbolic paraboloid

More information

Math 328 Course Notes

Math 328 Course Notes Math 328 Course Notes Ian Robertson March 3, 2006 3 Properties of C[0, 1]: Sup-norm and Completeness In this chapter we are going to examine the vector space of all continuous functions defined on the

More information

CS 151 Complexity Theory Spring Solution Set 5

CS 151 Complexity Theory Spring Solution Set 5 CS 151 Complexity Theory Spring 2017 Solution Set 5 Posted: May 17 Chris Umans 1. We are given a Boolean circuit C on n variables x 1, x 2,..., x n with m, and gates. Our 3-CNF formula will have m auxiliary

More information

CS264: Beyond Worst-Case Analysis Lecture #11: LP Decoding

CS264: Beyond Worst-Case Analysis Lecture #11: LP Decoding CS264: Beyond Worst-Case Analysis Lecture #11: LP Decoding Tim Roughgarden October 29, 2014 1 Preamble This lecture covers our final subtopic within the exact and approximate recovery part of the course.

More information

Answering Many Queries with Differential Privacy

Answering Many Queries with Differential Privacy 6.889 New Developments in Cryptography May 6, 2011 Answering Many Queries with Differential Privacy Instructors: Shafi Goldwasser, Yael Kalai, Leo Reyzin, Boaz Barak, and Salil Vadhan Lecturer: Jonathan

More information

Hardness Results for Agnostically Learning Low-Degree Polynomial Threshold Functions

Hardness Results for Agnostically Learning Low-Degree Polynomial Threshold Functions Hardness Results for Agnostically Learning Low-Degree Polynomial Threshold Functions Ilias Diakonikolas Columbia University ilias@cs.columbia.edu Rocco A. Servedio Columbia University rocco@cs.columbia.edu

More information

Computational Learning Theory

Computational Learning Theory 1 Computational Learning Theory 2 Computational learning theory Introduction Is it possible to identify classes of learning problems that are inherently easy or difficult? Can we characterize the number

More information

LEBESGUE INTEGRATION. Introduction

LEBESGUE INTEGRATION. Introduction LEBESGUE INTEGATION EYE SJAMAA Supplementary notes Math 414, Spring 25 Introduction The following heuristic argument is at the basis of the denition of the Lebesgue integral. This argument will be imprecise,

More information

Testing Problems with Sub-Learning Sample Complexity

Testing Problems with Sub-Learning Sample Complexity Testing Problems with Sub-Learning Sample Complexity Michael Kearns AT&T Labs Research 180 Park Avenue Florham Park, NJ, 07932 mkearns@researchattcom Dana Ron Laboratory for Computer Science, MIT 545 Technology

More information

means is a subset of. So we say A B for sets A and B if x A we have x B holds. BY CONTRAST, a S means that a is a member of S.

means is a subset of. So we say A B for sets A and B if x A we have x B holds. BY CONTRAST, a S means that a is a member of S. 1 Notation For those unfamiliar, we have := means equal by definition, N := {0, 1,... } or {1, 2,... } depending on context. (i.e. N is the set or collection of counting numbers.) In addition, means for

More information

Randomized Complexity Classes; RP

Randomized Complexity Classes; RP Randomized Complexity Classes; RP Let N be a polynomial-time precise NTM that runs in time p(n) and has 2 nondeterministic choices at each step. N is a polynomial Monte Carlo Turing machine for a language

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

Testing for Concise Representations

Testing for Concise Representations Testing for Concise Representations Ilias Diakonikolas Columbia University ilias@cs.columbia.edu Ronitt Rubinfeld MIT ronitt@theory.csail.mit.edu Homin K. Lee Columbia University homin@cs.columbia.edu

More information

Topics in Theoretical Computer Science: An Algorithmist's Toolkit Fall 2007

Topics in Theoretical Computer Science: An Algorithmist's Toolkit Fall 2007 MIT OpenCourseWare http://ocw.mit.edu 18.409 Topics in Theoretical Computer Science: An Algorithmist's Toolkit Fall 2007 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Fourier analysis of boolean functions in quantum computation

Fourier analysis of boolean functions in quantum computation Fourier analysis of boolean functions in quantum computation Ashley Montanaro Centre for Quantum Information and Foundations, Department of Applied Mathematics and Theoretical Physics, University of Cambridge

More information

Learning and 1-bit Compressed Sensing under Asymmetric Noise

Learning and 1-bit Compressed Sensing under Asymmetric Noise JMLR: Workshop and Conference Proceedings vol 49:1 39, 2016 Learning and 1-bit Compressed Sensing under Asymmetric Noise Pranjal Awasthi Rutgers University Maria-Florina Balcan Nika Haghtalab Hongyang

More information

CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning

CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning 0.1 Linear Functions So far, we have been looking at Linear Functions { as a class of functions which can 1 if W1 X separate some data and not

More information

Spanning and Independence Properties of Finite Frames

Spanning and Independence Properties of Finite Frames Chapter 1 Spanning and Independence Properties of Finite Frames Peter G. Casazza and Darrin Speegle Abstract The fundamental notion of frame theory is redundancy. It is this property which makes frames

More information

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity; CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and

More information