THE CHOW PARAMETERS PROBLEM
|
|
- James Turner
- 6 years ago
- Views:
Transcription
1 THE CHOW PARAMETERS PROBLEM RYAN O DONNELL AND ROCCO A. SERVEDIO Abstract. In the 2nd Annual FOCS (1961), Chao-Kong Chow proved that every Boolean threshold function is uniquely determined by its degree-0 and degree-1 Fourier coefficients. These numbers became known as the Chow Parameters. Providing an algorithmic version of Chow s Theorem i.e., efficiently constructing a representation of a threshold function given its Chow Parameters has remained open ever since. This problem has received significant study in the fields of circuit complexity, game theory and the design of voting systems, and learning theory. In this paper we effectively solve the problem, giving a randomized PTAS with the following behavior: Given the Chow Parameters of a Boolean threshold function f over n bits and any constant ɛ > 0, the algorithm runs in time O(n 2 log 2 n) and with high probability outputs a representation of a threshold function f which is ɛ-close to f. Along the way we prove several new results of independent interest about Boolean threshold functions. In addition to various structural results, these include Õ(n2 )-time learning algorithms for threshold functions under the uniform distribution in the following models: (i) The Restricted Focus of Attention model, answering an open question of Birkendorf et al. (ii) An agnostic-type model. This contrasts with recent results of Guruswami and Raghavendra who show NP-hardness for the problem under general distributions. (iii) The PAC model, with constant ɛ. Our Õ(n2 )-time algorithm substantially improves on the previous best known running time and nearly matches the Ω(n 2 ) bits of training data that any successful learning algorithm must use. Key words. Chow Parameters, threshold functions, approximation, learning theory AMS subject classifications. 94C10, 06E30, 68Q32, 68R99, 91B12, 91B14, 42C10 1. Introduction. This paper is concerned with Boolean threshold functions: Definition 1.1. A Boolean function f : { 1, 1} n { 1, 1} is a threshold function if it is expressible as f(x) = sgn(w 0 + w 1 x w n x n ) for some real numbers w 0, w 1,..., w n. Boolean threshold functions are of fundamental interest in circuit complexity, game theory/voting theory, and learning theory. Early computer scientists studying switching functions (i.e., Boolean functions) spent an enormous amount of effort on the class of threshold functions; see for instance the books [10, 26, 36, 48, 38] on this topic. More recently, researchers in circuit complexity have worked to understand the computational power of threshold functions and shallow circuits with these functions as gates; see e.g. [21, 45, 24, 25, 22]. In game theory and social choice theory, where simple cooperative games [42] correspond to monotone Boolean functions, threshold functions (with nonnegative weights) are known as weighted majority games and have been extensively studied as models for voting, see e.g. [43, 27, 11, 54]. Finally, in various guises, the problem of learning an unknown threshold function ( halfspace ) has arguably been the central problem in machine learning for much of the last two decades, with algorithms such as Perceptron, Weighted Majority, boosting, and support vector machines emerging as central tools in the field. A beautiful result of C.-K. Chow from the 2nd FOCS conference [9] gives a surprising characterization of Boolean threshold functions: among all Boolean functions, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, Pennsylvania, (odonnell@cs.cmu.edu). Supported in part by NSF award CCF , a CyLab Research Grant, an Okawa Foundation Grant, and a Sloan Foundation Fellowship. Columbia University, 1214 Amsterdam Avenue, New York, New York, (rocco@cs.columbia.edu). Supported in part by NSF award CCF , NSF award CCF , and a Sloan Foundation Fellowship. 1
2 2 R. O DONNELL AND R. A. SERVEDIO each threshold function f : { 1, 1} n { 1, 1} is uniquely determined by the center of mass of its positive inputs, avg{x { 1, 1} n : f(x) = 1}, and the number of positive inputs #{x : f(x) = 1}. These n + 1 parameters of f are equivalent, after scaling and additive shifting, to its degree-0 and degree-1 Fourier coefficients (and also, essentially, to its influences or Banzhaf power indices ). We give a formal definition: Definition 1.2. Given any Boolean function f : { 1, 1} n { 1, 1}, its Chow Parameters 1 are the rational numbers f(0), f(1),..., f(n) defined by f(0) = E[f(x)], f(i) = E[f(x)x i ], for 1 i n. We also say the Chow Vector of f is χ = χ f = ( f(0), f(1),..., f(n)). Throughout this paper the notation E[ ] and Pr[ ] refers to an x { 1, 1} n chosen uniformly at random. (We note that this corresponds to the Impartial Culture Assumption in the theory of social choice [19].) Our notation slightly abuses the standard Fourier coefficient notation of f( ) and f({i}). Chow s Theorem implies that the following algorithmic problem is in principle solvable: The Chow Parameters Problem. Given the Chow Parameters f(0), f(1),..., f(n) of a Boolean threshold function f, output a representation of f as f(x) = sgn(w 0 + w 1 x 1 + w n x n ). Unfortunately, the proof of Chow s Theorem (reviewed in Section 2.3) is completely nonconstructive and does not suggest any algorithm, much less an efficient one. As we now briefly describe, over the past five decades the Chow Parameters problem has been considered by researchers in a range of different fields Background on the Chow Parameters problem. As far back as 1960 researchers studying Boolean functions were interested in finding an efficient algorithm for the Chow Parameters problem [14]. Electrical engineers at the time faced the following problem: Given an explicit truth table, determine if it can be realized as a threshold circuit and if so, which one. The Chow Parameters are easily computed from a truth table, and Chow s Theorem implies that they give a unique representation for every threshold function. Several heuristics were proposed for the Chow Parameters problem [30, 56, 29, 10], an empirical study was performed to compare various methods [58], and lookup tables were produced mapping Chow Vectors into weights-based representations for each threshold function on six [39], seven [57], and eight [41] bits. Winder provides a good early survey [59]. Generalizations of Chow s Theorem were later given in [7, 46]. Researchers in game theory have also considered the Chow Parameters problem; Chow s Theorem was independently rediscovered by the game theorist Lapidot [34] and subsequently studied in [11, 13, 54, 18]. In the realm of social choice and voting theory the Chow Parameters represent the Banzhaf power indices [43, 2] of the n voters a measure of each one s influence over the outcome. Here the Chow Parameters problem is very natural: Consider designing a voting rule for, say, the European Union. Target Banzhaf power indices are given, usually in proportion to the squareroot of the states populations, and one wishes to come up with a weighted majority voting rule whose power indices are as close to the targets as possible. Researchers in voting theory have recently devoted significant attention to this problem [35, 8], calling it a fundamental constitutional problem [16] and in particular considering its computational complexity [51, 1]. 1 Chow s Theorem was proven simultaneously by Tannenbaum [53], but the terminology Chow Parameters has stuck.
3 CHOW PARAMETERS 3 The Chow Parameters problem also has motivation from learning theory. Ben- David and Dichterman [3] introduced the Restricted Focus of Attention (RFA) model to formalize the idea that learning algorithms often have only partial access to each example vector. Birkendorf et al. [5] performed a comprehensive study of the RFA model and observed that the approximation version of the Chow Parameters problem (given approximate Chow Parameters, output an approximating threshold function) is equivalent to the problem of efficiently learning threshold functions under the uniform distribution in the 1-RFA model. (In the 1-RFA model the learner is only allowed to see one bit of each example string in addition to the label; we give details in Section 10.) As the main open question posed in [5], Birkendorf et al. asked whether there is an efficient uniform distribution learning algorithm for threshold functions in the 1-RFA model. This question motivated subsequent research [20, 47] which gave information-theoretic sample complexity upper bounds for this learning problem (see Section 3); however no computationally efficient algorithm was previously known. To summarize, we believe that the range of different contexts in which the Chow Parameters Problem has arisen is evidence of its fundamental status The Chow Parameters problem reframed as an approximation problem. It is unlikely that the Chow Parameters Problem can be solved exactly in polynomial time note that even checking the correctness of a candidate solution is #P-complete, because computing f(0) is equivalent to counting 0-1 knapsack solutions. Thus, as is implicitly proposed in [5, 1], it is natural to look for a polynomialtime approximation scheme (PTAS). Here we mean an approximation in the following sense: Definition 1.3. The distance between two Boolean functions f, g : { 1, 1} n { 1, 1} is dist(f, g) def = Pr[f(x) g(x)]. If dist(f, g) ɛ we say that f and g are ɛ-close. We would like a PTAS which, given a value ɛ and the Chow Parameters of f, outputs a (representation of a) threshold function f that is ɛ-close to f. With this relaxed goal of approximating f, one may even tolerate only an approximation of the Chow Parameters of f; this gives us the variant of the problem that Birkendorf et al. considered. (Note that, as we discuss in Section 3, it is in no way obvious that approximate Chow Parameters even information-theoretically specify an approximator to f.) In particular the following notion of approximate Chow Parameters proves to be most natural: Definition 1.4. Let f, g : { 1, 1} n { 1, 1}. We define d Chow (f, g) def = n j=0 ( f(j) ĝ(j)) 2 to be the Chow Distance between f and g Our results. Our main result is an efficient PTAS A for the Chow Parameters problem which succeeds given approximations to the Chow Parameters. We prove: Main Theorem. There is a function κ(ɛ) = 2 Õ(1/ɛ2) such that the following holds: Let f : { 1, 1} n { 1, 1} be a threshold function and let 0 < ɛ < 1/2. Write χ for the Chow Vector of f and assume that α is a vector satisfying α χ κ(ɛ). Then given as input α and ɛ the algorithm A performs 2 poly(1/κ(ɛ)) n 2 log n log( n δ ) bit operations and outputs the (weights-based) representation of a threshold function f which with probability at least 1 δ satisfies dist(f, f ) ɛ. Although the running time dependence on ɛ is doubly-exponential, we emphasize that the polynomial dependence on n is quadratic, independent of ɛ; i.e., A is an
4 4 R. O DONNELL AND R. A. SERVEDIO EPTAS. Some of our learning applications have only singly-exponential dependence on ɛ Our approach. We briefly describe the two main ingredients of our approach and explain how we combine them to obtain the efficient algorithm A. First ingredient: small Chow Distance from a threshold function implies small distance. An immediate question that arises when thinking about the Chow Parameters problem is how to recognize whether a candidate solution is a good one. If we are given the Chow Vector χ f of an unknown threshold function f and we have a candidate threshold function g, we can approximate the Chow Vector χ g of g by sampling. The following Proposition is easily proved via Fourier analysis in Section 2.3: Proposition 1.5. d Chow (f, g) 2 dist(f, g). This means that if d Chow (f, g) is large then f and g are far apart. But if d Chow (f, g) is small, does this necessarily mean that f and g are close? This question has been studied in the learning theory community, in [5] (for threshold functions with small integer weights), [20], and [47]. In Section 3 we show that the answer is yes by proving the following robust version of Chow s Theorem: Theorem 1.6. Let f : { 1, 1} n { 1, 1} be any threshold function and let g : { 1, 1} n ( { 1, 1} be any Boolean function such that d Chow (f, g) ɛ. Then dist(f, g) Õ 1/ ) log(1/ɛ). This is the first result of this nature that is completely independent of n. A key ingredient in the proof of Theorem 1.6 is a new result showing that every threshold function f is extremely close to a threshold function f for which only a very small fraction of points have small margin (see Section 6 for a precise statement). We feel that this and Theorem 1.6 have independent interest as structural results about threshold functions. Second ingredient: using the Chow Parameters as weights. The second ingredient in our approach is to establish a result, Theorem 7.1, having the following corollary: Corollary 7.2. There is an absolute constant C > 0 such that the following holds. Let f(x) = sgn(w 0 + w 1 x w n x n ) be any threshold function, and let H be the set of 1/ɛ C indices i for which w i is largest. 2 Then there exists a threshold function f (x) = sgn(v 0 + v 1 x v n x n ) with dist(f, f ) ɛ in which the weights v i for i [n] \ H are the Chow Parameters f(i) themselves. The heuristic of using the Chow Parameters as possible weights was considered by several researchers in the early 60s (see [59]); however no theorem on the efficacy of this approach was previously known. Our proof of Theorem 7.1 and its robust version Theorem 7.4 rely in part on recent work of Matulef et al. on Property Testing for threshold functions [37]. The algorithm and intuitive explanation. Given these two ingredients, our PTAS A for the approximate Chow Parameters problem works by constructing a small (depending only on ɛ) number of candidate threshold functions. It enumerates all (in some sense) possible weight settings for the indices in H, and for each one produces a candidate threshold function by setting the remaining weights equal to the given Chow Parameters. The second ingredient tells us that at least one of these candidate threshold functions must be close to to the unknown threshold function f, and thus 2 As we discuss at the beginning of Section 7, for any threshold function f the value f(i) is equal to Inf i (f), the influence of the i-th variable on f. It is well known and easy to show (see e.g. Lemma 7 of [17]) that for a threshold function f(x) = sgn(w 0 + w 1 x w nx n), if Inf i (f) > Inf j (f) then w i > w j. So we may equivalently view H as the set of 1/ɛ C indices i for which f(i) is largest.
5 CHOW PARAMETERS 5 must have small Chow Distance to f, by Proposition 1.5. Now the first ingredient tells us that any threshold function whose Chow Distance to the target Chow Vector is small must itself be close to the target. So the algorithm can estimate each of the candidates Chow Vectors (this takes Õ(n2 ) time) and output any candidate whose Chow Distance to the target vector is small Consequences in learning theory. As we show in Section 10, our approach yields a range of new algorithmic results in learning theory. Our Main Theorem directly gives the first poly(n)-time algorithm for learning threshold functions in the uniform distribution 1-RFA model, answering the question of [5]: Theorem 1.7. There is an algorithm which performs 2 2Õ(1/ɛ2 ) n 2 log n log( n δ ) bit operations and properly learns threshold functions to accuracy ɛ and confidence 1 δ in the uniform distribution 1-RFA model. A variant of our algorithm gives a very fast agnostic-type learning algorithm for threshold functions (equivalently, an algorithm for learning Boolean threshold functions from uniformly distributed examples when there is adversarial noise in the labels): Theorem 1.8. Let g be any Boolean function and let opt = min f Pr[f(x) g(x)] where the min is over all threshold functions and the probability is uniform over { 1, 1} n. Given an input parameter ɛ > 0 and access to independent uniform examples (x, g(x)), algorithm B outputs the (weights-based) representation of a threshold function f which with probability at least 1 δ satisfies Pr[f (x) g(x)] O(opt Ω(1) )+ɛ. The algorithm performs poly(1/ɛ) n 2 log( n δ ) + 2poly(1/ɛ) n log n log( 1 δ ) bit operations. For example, if opt = 1/ log(n), our algorithm takes time O(n 2 log n log( n δ )) and outputs a hypothesis with accuracy 1/ log Ω(1) (n). Thereom 1.8 is in interesting contrast with the algorithm of Kalai et al. [28] which constructs an (opt + ɛ)-accurate hypothesis but runs in n poly(1/ɛ) time (and does not output a threshold function). As we discuss in Section 10, recent hardness results of Guruswami and Raghavendra [23] imply that if P NP there can be no algorithm comparable to ours for learning under arbitrary (as opposed to uniform) distributions over { 1, 1} n. Finally, as a corollary of Theorem 1.8, we obtain a uniform-distribution PAC learning algorithm for threshold functions that runs in time Õ(n2 ) for learning to constant accuracy ɛ = Θ(1). The fastest previous algorithm we are aware of for learning arbitrary threshold functions in this model (linear programming, using Vaidya [55]) runs in Õ(n4.5 ) poly(1/ɛ) time. Thus our algorithm is significantly faster for learning to accuracy ɛ = Θ(1), and in fact is faster as long as ɛ < 1/(log n) c for sufficiently small constant c > 0. As we explain later, our time bound is very close to the Ω(n 2 ) bits of input that any learning algorithm must use. 2. Preliminaries Fourier analysis. This paper extensively uses the basics of Fourier analysis over the Boolean cube { 1, 1} n. We give a brief review. We consider functions f : { 1, 1} n R (though we often focus on Boolean-valued functions which map to { 1, 1}), and we think of the inputs x to f as being distributed according to the uniform probability distribution. The set of such functions forms a 2 n -dimensional inner product space with inner product given by f, g = E x [f(x)g(x)]. The set of functions (χ S ) S [n] defined by χ S (x) = i S x i forms a complete orthonormal basis for this space. We will also often write simply x S for i S x i. Given a function
6 6 R. O DONNELL AND R. A. SERVEDIO f : { 1, 1} n R we define its Fourier coefficients by f(s) = E x [f(x)x S ], and we have that f(x) = f(s)x S S. As an easy consequence of orthonormality we have Plancherel s identity f, g = f(s)ĝ(s), S which has as a special case Parseval s identity, E x [f(x) 2 ] = f(s) S 2. From this it follows that for every f : { 1, 1} n { 1, 1} we have f(s) S 2 = 1. The following definitions are fairly standard in the analysis of Boolean functions: Definition 2.1. A function f : { 1, 1} n { 1, 1} is said to be a junta on J [n] if f only depends on the coordinates in J. Typically we think of J as a small set in this case. Definition 2.2. We say that f : { 1, 1} n R is τ-regular if ˆf(i) τ for all i [n]. The following simple lemma is implicit in [37]; we state and prove it explicitly here for completeness. Lemma 2.3. Let f(x) : { 1, 1} n { 1, 1} be a Boolean threshold function and let J [n] be any subset of coordinates. If f is τ-close to a junta on J, then f is τ-close to a junta on J which is itself a Boolean threshold function. Proof. We assume without loss of generality that J is the set {1,..., r}. It is clear that the junta over { 1, 1} r to which f is closest is the function g(x 1,..., x r ) that maps each input (x 1,..., x r ) to the more commonly occuring value of the restricted function f x1,...,x r (a function of variables x r+1,..., x n ). But for f(x) = sgn(w 0 + w 1 x 1 + +w n x n ) this more common value will be sgn(w 0 +w 1 x 1 + +w r x r ), because for uniform (x r+1,..., x n ) { 1, 1} n r the random variable w r+1 x r w n x n is centered around zero. We will also require the following lemma, which gives a lower bound on the degree-1 Fourier weight of any threshold function in terms of its bias: Lemma 2.4. Let f : { 1, 1} n { 1, 1} be a Boolean threshold function and suppose that 1 E[f] = p. Then n f(i) 2 p 2 /2. i=1 Before giving the proof let us contrast this lemma with some known results. Proposition 2.2 of Talagrand [52] gives a general upper bound n f(i) i=1 2 O(p 2 log(1/p)) for any Boolean function satisfying 1 E[f] = p. In [37] it is shown that a slightly stronger bound Θ(p 2 log(1/p)) holds for threshold functions f that are sufficiently τ-regular. However when we use Lemma 2.4 we will not have regularity (and even if we did, the extra log factor would not end up improving any of our bounds). Proof. Write f(x) = sgn(w 0 + w 1 x w n x n ), where we assume without loss of generality that n j=1 w2 j = 1 and that w 0 + w 1 x w n x n 0 for all x { 1, 1} n. We have E[f(x)(w x)] = n f(i)w i n f(i) 2, i=1 where the equality is Plancherel s identity and the inequality is Cauchy-Schwarz. On the other hand, using the definition of f we obtain E[f(x)(w x)] = E[1 { w x w0 } w x ] = p E[ w x w x w0 ]. i=1
7 CHOW PARAMETERS 7 The first equality above holds because each x such that w x < w 0 can be paired with x; the value of f is the same on these two inputs, so their contributions to the expectation cancel each other out. The second equality above is a routine renormalization using the equality 1 E[f] = p. We now recall the Khintchine inequality with best constant [50], which says that for any w R n we have E[ w x ] 1 2 w. Since w = 1 in our setting, we get E[ w x ] = 1 2, so surely E[ w x w x w0 ] 1/ 2. Thus combining all statements yields n f(i) 2 p/ 2, completing the proof. i= Mathematical tools. We use the following simple estimate on several occasions: Fact 2.5. Suppose A and B are nonnegative and A B η. Then A B η/ B. A B A+ B η B. Proof. A B = We also will need some results from probability theory: Definition 2.6. We write Φ for the c.d.f. (cumulative density function) of a standard mean-0, variance-1 Gaussian random variable. We extend the notation by writing Φ[a, b] to denote Φ(b) Φ(a), allowing b < a. Finally, we will use the estimate Φ[a, b] b a without comment. The Berry-Esseen theorem is a version of the Central Limit Theorem with explicit error bounds: Theorem 2.7. (Berry-Esseen) Let X 1,..., X n be a sequence of independent random variables satisfying E[X i ] = 0 for all i, E[X 2 i ] = σ, and E[ X i 3 ] = ρ 3. Let S = (X X n )/σ and let F denote the c.d.f. of S. Then sup F (x) Φ(x) Cρ 3 /σ 3, x where Φ is the c.d.f. of a standard Gaussian random variable, and C is a universal constant. It is known [49] that one can take C = Corollary 2.8. Let x 1,..., x m denote independent ±1 random bits and let w 1,..., w m R. Write σ = w 2 i, and assume w i /σ τ for all i. Then for any interval [a, b] R, Pr[a w1 x w m x m b] Φ([ a σ, b σ ]) 2τ. In particular, Pr[a w 1 x w m x m b] b a σ + 2τ Margins, and Chow s Theorem. Having introduced Fourier analysis, we recall and prove Proposition 1.5: Proposition 1.5. d Chow (f, g) 2 dist(f, g).
8 8 R. O DONNELL AND R. A. SERVEDIO Proof. For f, g : { 1, 1} n { 1, 1} we have dist(f, g) = 1 4 E[(f(x) g(x))2 ] = 1 4 ( f(s) ĝ(s)) 2 S [n] 1 4 n ( f(j) ĝ(j)) 2 = 1 4 d Chow(f, g) 2, where the second equality is Parseval s identity. Let us introduce a notion of margin for threshold functions: Definition 2.9. Let f : { 1, 1} n { 1, 1} be a Boolean threshold function, f(x) = sgn(w 0 +w 1 x 1 + +w n x n ), where the weights are scaled so that j 0 w2 j = 1. Given a particular input x { 1, 1} n we define marg(f, x) = w 0 + w 1 x w n x n. 3 Remark The usual notion of margin from learning theory also involves scaling the data points x so that x 1 for all x. Thus we have that the learning theoretic margin of f on x is marg(f, x)/ n. We now present a proof of Chow s theorem from 1961: Theorem (Chow.) Let f : { 1, 1} n { 1, 1} be a Boolean threshold function and let g : { 1, 1} n { 1, 1} be a Boolean function such that ĝ(j) = f(j) for all 0 j n. Then g = f. Note that another way of phrasing this is: If f is a Boolean threshold function, g is a Boolean function, and d Chow (f, g) = 0, then dist(f, g) = 0. Our Theorem 1.6 gives a robust version of this statement. Proof. Write f(x) = sgn(w 0 + w 1 x w n x n ), where the weights are scaled so that n j=0 w2 j = 1. We may assume without loss of generality that marg(f, x) 0 for all x. (Otherwise, first perturb the weights slightly without changing f.) Now we have n 0 = w j ( f(j) ĝ(j)) j=0 j=0 = E[(w 0 + w 1 x w n x n )(f(x) g(x))] = E[1 {f(x) g(x)} 2marg(f, x)]. The first equality is by the assumption that f(j) = ĝ(j) for all 0 j n, the second equality is linearity of expectation (or Plancherel s identity), and the third equality uses the fact that f(x) = sgn(w 0 + w 1 x w n x n ). But since marg(f, x) is always strictly positive, we must have Pr[f(x) g(x)] = 0 as claimed. 3. First ingredient: small Chow Distance implies small distance. Our main result in this section is the following. Theorem 1.6 Restated. Let f : { 1, 1} n { 1, 1} be any threshold function and let g : { 1, 1} n ( { 1, 1} be any Boolean function such that d Chow (f, g) ɛ. Then dist(f, g) Õ 1/ ) log(1/ɛ). 4 Let us compare this with some recent results with a similar qualitative flavor. The main result of Goldberg [20] is a proof that for any threshold function c. 3 This notation is slightly informal as it doesn t show the dependence on the representation of f. 4 For a quantity q < 1, the notation Õ(q) means O(q logc (1/q)) for some absolute constant
9 CHOW PARAMETERS 9 f and any Boolean function g, if f(j) ĝ(j) (ɛ/n) O(log(n/ɛ) log(1/ɛ)) for all 0 j n, then dist(f, g) ɛ. Note that the condition of Goldberg s theorem requires that d Chow (f, g) n O(log n). Subsequently Servedio [47] showed that to obtain dist(f, g) ɛ it suffices to have f(j) ĝ(j) 1/(2Õ(1/ɛ2) n) for all 0 j n. This is a worse requirement in terms of ɛ but a better one in terms of n; however it still requires that d Chow (f, g) 1/ n. In contrast, Theorem 1.6 allows the Chow Distance between f and g to be an absolute constant independent of n. This independence of n will be crucial later on when we use Theorem 1.6 to obtain a computationally efficient algorithm for the Chow Parameters problem. At a high level, we prove Theorem 1.6 by giving a robust version of the proof of Chow s Theorem (Theorem 2.11). A first obvious approach to making the argument robust is to try to show that every threshold function has margin Ω(1) (independent of n) on every x. However this is well known to be badly false. A next attempt might be to show that every threshold function has a representation with margin Ω(1) on almost every x. This too turns out to be impossible (cf. our discussion after the statement of Lemma 5.1 below). The key to getting an n-independent margin lower bound is to also very slightly alter the threshold function. Specifically, the next few sections of the paper will be devoted to the proof of the following: Theorem 3.1. Let f : { 1, 1} n { 1, 1} be any threshold function and let ρ > 0 be sufficiently small. Then there is a threshold function f : { 1, 1} n { 1, 1} with dist(f, f ) 2 1/ρ satisfying ( Pr[marg(f, x) ρ] Õ 1/ ) log(1/ρ). x In other words, any threshold function f is very close to another threshold function f satisfying marg(f, x) Ω(1) for almost all x. We remark that although the fraction of points failing the margin bound could be as large as inverse-logarithmic in ρ, we only have to change f on a fraction of points which is exponentially small in 1/ρ to achieve this. Theorem 3.1 is the key structural result for threshold functions that allows us to robustify the proof of Theorem We will now show how Theorem 1.6 follows from Theorem 3.1. Proof. (Theorem 1.6.) Given f, apply Theorem 3.1 with its parameter ρ set (with foresight) to ρ = ɛ log(1/ɛ). This yields a threshold function f (x) = sgn(u 0 +u 1 x 1 + +u n x n ), with n j=0 u2 j = 1 satisfying dist(f, f ) 2 1/ρ ɛ and ( Pr[marg(f, x) ρ] τ def = Õ 1/ ) poly log log(1/ɛ) log(1/ρ) =. (3.1) x log(1/ɛ) Since dist(f, f ) ɛ, by Proposition 1.5 we have d Chow (f, f ) 2 ɛ and thus d Chow (f, g) 3 ɛ by the triangle inequality. We now follow the proof of Chow s
10 10 R. O DONNELL AND R. A. SERVEDIO Theorem 2.11: 3 n ɛ d Chow (f, g) = u 2 j n ( f (j) ĝ(j)) 2 j=0 j=0 n u j ( f (j) ĝ(j)) j=0 = E[1 {f (x) g(x)} 2marg(f, x)], (3.2) where the second inequality is Cauchy-Schwarz. Now suppose that Pr[f (x) g(x)] 2τ. Then by (3.1) we must have that for at least a τ fraction of x s, both f (x) g(x) and marg(f, x) > ρ. This gives a contribution exceeding τ ρ to (3.2). But τρ = ɛ poly log log(1/ɛ) > 3 ɛ, a contradiction. Thus dist(f, g) 2τ and so dist(f, g) dist(f, f ) + dist(f, g) ɛ + 2τ = Õ ( 1/ ) log(1/ɛ). 4. The critical index and anticoncentration. Fix a representation f(x) = sgn(w 0 +w 1 x 1 + +w n x n ) of a threshold function. Throughout this section we adopt the convention that w 1 w n > 0 (this will be without loss of generality, by permuting indices). The notion of the critical index of the sequence of weights w 1,..., w n will be useful for us. Roughly speaking, it allows us to approximately decompose any linear form w 0 + w 1 x w n x n over random ±1 x i s into a short dominant head, w 0 + w 1 x w small x small, and a long remaining tail which acts like a Gaussian random variable. The τ-critical index of w 1,..., w n is essentially the least index l for which the random variable w l x l + + w n x n behaves like a Gaussian up to error τ. The notion of a critical index was (implicitly) introduced and used in [47]. Towards proving a margin lower bound such as Theorem 3.1 for f, we need to show some kind of anticoncentration for the random variable w 0 +w 1 x 1 + +w n x n ; we want it to rarely be near 0. Let us describe intuitively how analyzing the critical index helps us show this. If the critical index of w 1,..., w n is large, then it must be the case that the initial weights w 1, w 2,... up to the critical index are rapidly decreasing (roughly speaking, if the weights w i, w i+1,... stayed about the same for a long stretch this would cause w i x i + + w n x n to behave like a Gaussian). This rapid decrease can in turn be shown to imply that the the head part w 0 + w 1 x w small x small is not too concentrated around any particular value; see Theorem 4.2 below. On the other hand, if the critical index l is small, then the random variable w l x l + +w n x n behaves like a Gaussian. Since Gaussians have good anticoncentration, the overall linear form w 0 +w 1 x 1 + +w n x n will have good anticoncentration, regardless of the head part s value. We need to alter f slightly to make these two cases go through, but having done so, we are able to bound the fraction of inputs x for which marg(f, x) is very small, leading to Theorem 3.1. We now give precise definitions. For 1 k n we write σ k to denote the 2-norm def n of the tail weights starting from k; i.e. σ k = i k w2 i. Definition 4.1. Fix a parameter 0 < τ < 1/2. We define the τ-critical index of the weight vector w to be the least index l such that w l is small relative to σ l in the
11 CHOW PARAMETERS 11 following sense: w l σ l τ. (4.1) (If no index 1 l n satisfies (4.1), as is the case for ( 1 2, 1 4, 1 8,..., 1 2 ) for example, n then we say that the τ-critical index is +.) The connection between Equation (4.1) and behaving like a Gaussian up to error τ is given by the Berry-Esseen Theorem, stated in Section 2.2. The following anticoncentration result shows that if the critical index is large, then the random variable w 1 x 1 + +w n x n does not put much probability mass close to any particular value: Theorem 4.2. Let 0 < τ < 1/2 and t 1 be parameters, and define k = O(1) t τ ln ( ) t 2 τ. If the τ-critical index l for w1,..., w n satisfies l k, then we have Pr x [ w 0 + w 1 x w n x n t σ k ] O(2 t ). A similar result was established in [47]. The following subsections 4.1, 4.2, 4.3 are devoted to the proof of Theorem 4.2. Throughout, they assume l denotes the τ-critical index of w 1,..., w n where w 1 w n > 0 as in the condition of Theorem Partitioning weights into blocks. The following simple lemma shows that the tail weight decreases exponentially up to the τ-critical index: Lemma 4.3. For 1 a < b l, we have σ 2 b < (1 τ 2 ) b a σ 2 a < (1 τ 2 ) b a w 2 a/τ 2. Proof. Since a is less than the critical index, we have w 2 a > τ 2 σ 2 a = τ 2 (w 2 a +σ 2 a+1), or equivalently (1 τ 2 )w 2 a > τ 2 σ 2 a+1. Adding (1 τ 2 )σ 2 a+1 to both sides gives (1 τ 2 )(w 2 a + σ 2 a+1) > (1 τ 2 )σ 2 a+1 + τ 2 σ 2 a+1, which is equivalent to (1 τ 2 )σ 2 a > σ 2 a+1. This implies that σ 2 b < (1 τ 2 ) b a σ a ; the second inequality follows from w 2 a > τ 2 σ 2 a. Fix a parameter Z > 1. We divide the list of weights w 1,..., w l into Z-blocks of consecutive weights as follows. The first Z-block B 1 is w 1,..., w k1 where k 1 is defined to be the first index such that w 1 (the largest weight in the block) is large relative to σ k1+1 (the total tail weight of all weights after the Z-block) in the following sense: w 1 > Z σ k1+1. Similarly for i = 2, 3,... the ith Z-block B i is w ki 1+1,..., w ki index such that where k i is the first w ki 1+1 > Z σ ki+1. The following lemma says each Z-block must be relatively short prior to the critical index: Lemma 4.4. Suppose that the ith Z-block B i is such that k i m l, where Then B i is of length at most m. m def = 1 τ 2 ln(z2 /τ 2 ). (4.2)
12 12 R. O DONNELL AND R. A. SERVEDIO Proof. Suppose that the length B i of the ith Z-block were more than m. Applying Lemma 4.3 with b a = m, we have σ 2 k i 1+1+m < (1 τ 2 ) m w 2 k i 1+1/τ 2 e τ 2m w 2 k i 1+1/τ 2. But by the assumption that the ith Z-block is longer than m, we also have w 2 k i 1+1 Z 2 σ 2 k i 1+1+m. Combining these inequalities and plugging in our expression for m we get a contradiction. An easy consequence is that if the critical index is large, then there must be many blocks prior to it: Corollary 4.5. For t 1, suppose that the τ-critical index l is at least tm, where m is defined as in (4.2). Then k t tm, i.e. there are at least t complete Z-blocks by the (tm)-th weight Block structure and concentration of the random variable w x. Let f(x) = sgn(w 0 +w 1 x 1 + +w n x n ) be a threshold function with w 1 w n > 0, and let B 1, B 2,... be the Z-blocks for w as defined in the previous subsection. In this subsection we prove the following lemma which is a slight variant of a similar result in [47]. Intuitively the lemma says that if a weight vector v has many blocks, then for any w 0 R, only an exponentially small fraction of points x { 1, 1} n will have a small margin for the threshold function sgn(w 0 + w 1 x w n x n ). As we show in the next subsection, Theorem 4.2 will be an easy consequence of this lemma. Lemma 4.6. Fix a value t such that there exist at least t complete Z-blocks B 1,..., B t in the weight vector w. Then for any w 0 R, we have Pr[ w 0 + w 1 x w n x n σ kt+1 (Z/6)] 2 t + 2te Z2 /72. Here the probability is taken over a uniform random choice of x from { 1, 1} n. We first give some necessary preliminary results and then prove Lemma 4.6. Our approach follows that of [47] with slight modifications. Let us view the choice of a uniform random assignment x to the variables in Z- blocks B 1,..., B t as taking place in successive stages, where in the ith stage values are assigned to the variables in the ith Z-block B i. Immediately after the ith stage, some value call it ξ i has been determined for w 0 + w 1 x w ki x ki. The following simple lemma shows that if ξ i is too far from 0, then it is unlikely that the remaining variables x ki+1,..., x n will come out in such a way as to make the final sum close to 0. Lemma 4.7. For any value A > 0 and any 1 i t, if ξ i 2σ ki+1 2 ln(2/a), then we have Pr xki +1,...,x n [ w 0 + w 1 x w n x n σ ki+1 2 ln(2/a)] A. (4.3) Proof. By the lower bound on ξ i in the hypothesis of the lemma, it can only be the case that w 0 + w 1 x w n x n σ ki+1 2 ln(2/a) if w ki+1x ki w n x n σ ki+1 2 ln(2/a). (4.4) We now recall the Hoeffding bound (see e.g. [12]), which says that for any 0 v R r and any γ > 0, we have Pr x { 1,1} r[ v 1 x 1 + +v r x r γ v v2 r] 2e γ2 /2.
13 CHOW PARAMETERS 13 Since wk i+1 w2 n = σk 2 i+1, this Hoeffding bound implies that the probability of (4.4) is at most 2e ( 2 ln(2/a)) 2 /2 = A. We henceforth fix A to be A def = 2e Z2 /72, so we have 6 2 ln(2/a) = Z. We now show that regardless of the value of ξ i 1, we have ξ i 2σ ki+1(z/6) with probability at most 1/2 over the choice of values for variables in block B i in the ith stage. Lemma 4.8. For any ξ i 1 R, we have Pr xki 1 +1,...,x ki [ ξ i 2σ ki+1(z/6) ξ i 1 ] 1/2. Proof. Since ξ i equals ξ i 1 + (w ki 1+1x ki w ki x ki ), we have ξ i 2σ ki+1(z/6) if and only if the value w ki 1+1x ki w ki x ki lies in the interval [I L, I R ] def = [ ξ i 1 2σ ki+1(z/6), ξ i 1 + 2σ ki+1(z/6)] of width 2 3 σ k i+1z. First suppose that 0 / [I L, I R ], i.e. the whole interval has the same sign. If this is the case then Pr[w ki 1+1x ki w ki x ki [I L, I R ]] 1 2 since by symmetry the value w ki 1+1x ki w ki x ki is equally likely to be positive or negative. Now suppose that 0 [I L, I R ]. By definition of k i, we know that σ ki+1 w ki 1+1 /Z, and consequently we have that the width of the interval [I L, I R ] is at most 2 3 w k i 1+1. But now observe that once the value of x ki 1+1 is set to either +1 or 1, this effectively shifts the target interval, which now w ki 1+2x ki w ki x ki must hit, by a displacement of w ki 1+1 to become [I L w ki 1+1x ki 1+1, I R w ki 1+1x ki 1+1]. (Note that in the special case where k i = k i 1 + 1, the value w ki 1+2x ki w ki x ki which must hit the target interval is simply 0.) Since the original interval [I L, I R ] contained 0 and was of length at most 2 3 w k i 1+1, the new interval does not contain 0, and thus again by symmetry we have that the probability (now over the choice of x ki 1+2,..., x ki ) that w ki 1+1x ki w ki x ki lies in [I L, I R ] is at most 1 2. In order to have w 0 + w 1 x w n x n σ kt+1 2 ln(2/a), it must be the case that either (i) each ξ i < 2σ ki+1 2 ln(2/a) for i = 1,..., t; or (ii) for some 1 i t we have ξ i 2σ ki+1 2 ln(2/a) but nonetheless w0 + w 1 x w n x n < σ ki+1 2 ln(2/a). Lemma 4.8 gives us that the probability of (i) is at most (1/2) t = 2 t, and Lemma 4.7 with the union bound gives us that the probability of (ii) is at most t A. This proves Lemma Proof of Theorem 4.2. Let Z = 12 t. We take m = 1 τ ln(z 2 /τ 2 ) as in 2 (4.2), and we have k = tm + 1. With these choices the condition l k of Theorem 4.2 together with Corollary 4.5 implies that there are at least t complete Z-blocks in the weight vector w. Thus we may apply Lemma 4.6, and we have that Pr[ w 0 + w 1 x w n x n σ kt+1 2 t] 2 t + 2te 2t O(2 t ). Now we further observe that since there are in fact t complete Z-blocks prior to the kth weight, we have k t + 1 k and hence σ kt+1 σ k, so the above inequality implies Pr[ w 0 + w 1 x w n x n t σ k ] O(2 t ). This is the desired conclusion of Theorem 4.2.
14 14 R. O DONNELL AND R. A. SERVEDIO 4.4. Extension of Theorem 4.2. The same proof with a slightly different choice of Z (taking Z = O(1)t C ) in fact gives us the following significantly stronger version of Theorem 4.2; however this stronger version is not more useful for our purposes: Theorem 4.9. In the setting of Theorem 4.2, let C 1/2 be another parameter, and suppose we instead define Then if l k, k = O(1) t τ 2 ln ( t C τ ). Pr x [ w 0 + w 1 x w n x n t C σ k ] O(2 t ). 5. Approximating threshold functions using not-too-large head weights. The main result of this section is a lemma which roughly says that any threshold function f can be approximated by a threshold function f in which the 2-norm of the tail weights, σ k, is at least an Ω(1) fraction of the head weights. This is important so that the Gaussian random variable to which the tail part is close has Ω(1) variance and thus sufficiently good anticoncentration. Lemma 5.1. Let f : { 1, 1} n { 1, 1} be any threshold function, f(x) = sgn(w 0 + w 1 x w n x n ) (recall that we assume w 1 w 2 w n ). Let def 0 < ɛ < 1/2 and 1 k n be parameters, and write σ k = j k w2 j. Assuming σ k > 0, there are numbers v 0,..., v k 1 satisfying v i k (k+1)/2 3 ln(2/ɛ) σ k (5.1) such that the threshold function f : { 1, 1} n { 1, 1} defined by f (x) = sgn(v 0 + v 1 x v k 1 x k 1 + w k x k + + w n x n ) satisfies dist(f, f ) ɛ. One may further ensure that v 1 v 2 v k 1 w k and that sgn(v i ) = sgn(w i ) for all i. Before proving this lemma, let us give an illustration. Consider the threshold function f(x) = sgn(nx 1 + nx 2 + x x n ), (5.2) with k = 3. The tail weights here have σ 3 = n 2, which of course is not a constant fraction of the two head weights, n. Further, this cannot be fixed just by choosing a different weights-based representation of the same function f. What Lemma 5.1 shows here is that we can shrink the head weights from n all the way down to Θ( ln(1/ɛ)) n without changing the function on more than an ɛ fraction of points (this heavily uses the fact that the tail acts like a Gaussian with standard deviation n 2). Then indeed σ 3 is an Ω(f(ɛ)) fraction of the head weights for a function f(ɛ) that is independent of n, as desired. We now give the proof of Lemma 5.1, a modification of the classic argument of [40] which bounds the weights required for exact representation of any threshold function. Proof. We will first prove the theorem without the extra constraints v 1 v 2 v k 1 w k and sgn(v i ) = sgn(w i ). At the end of the proof we will show how these constraints can also be ensured.
15 Let h : { 1, 1} k 1 R denote the head of f, CHOW PARAMETERS 15 h(x) = w 0 + w 1 x w k 1 x k 1. Consider the system S of 2 k 1 linear equations in k unknowns named u 0,..., u k 1 : for each x { 1, 1} k 1 we include the equation u 0 + u 1 x u k 1 x k 1 = h(x). Of course, the linear system S is satisfiable, since (u 0,..., u k 1 ) = (w 0,..., w k 1 ) is a solution. Let C be defined by C = 3 ln(2/ɛ) σ k, and consider the system LP of 2 k 1 linear inequalities over unknowns u 0,..., u k 1 : for each x { 1, 1} k 1 we include the (in)equality C if h(x) C, u 0 + u 1 x u k 1 x k 1 = h(x) if h(x) < C, (5.3) C if h(x) C. We have that LP is feasible, since it is a relaxation of the satisfiable system S. Now we use the following standard result from the theory of linear inequalities, which is a straightforward consequence of Cramer s rule and is implicit in several works (see e.g. the proof at the start of Section 3 of [24]): Lemma 5.2. Let LP denote a feasible linear program over k variables u 0,..., u k 1 in which the constraint matrix has all entries from { 1, 0, 1} and the right-hand side has all entries at most C in absolute value. Then there is a feasible solution (v 0,..., v k 1 ) in which v i k (k+1)/2 C for each i. This implies that there is a feasible solution (u 0,..., u k 1 ) = (v 0,..., v k 1 ) to LP in which the numbers v i are not too large in magnitude: specifically, using Lemma 5.2 we may obtain We now show that the threshold function v i k (k+1)/2 C. (5.4) f (x) = sgn(v 0 + v 1 x v k 1 x k 1 + w k x k + w n x n ) satisfies dist(f, f ) ɛ. Given x { 1, 1} n, let us abuse notation by writing let us also write h(x) = h(x 1,..., x k 1 ) = w 0 + w 1 x w k 1 x k 1 ; h (x) = v 0 + v 1 x v k 1 x k 1
16 16 R. O DONNELL AND R. A. SERVEDIO for the head of f and t(x) = j k w j x j for the tail, which is common to both f and f. Now if x is any input for which h(x) < C then we have h(x) = h (x) by construction, and hence f(x) = f (x). Thus in order for f(x) to disagree with f(x ) it must at least be the case that h(x) C. Moreover, it must also be the case that t(x) C, for otherwise sgn(h(x) + t(x)) will equal sgn(h (x) + t(x)), because h(x) and h (x) have the same sign by construction. But the Hoeffding bound implies that Pr x [ t(x) C] Pr x [ t(x) 2 ln(2/ɛ) σ k ] 2e ln(2/ɛ) = ɛ. Hence indeed Pr[f(x) f (x)] ɛ, as desired. Finally, we complete the proof by showing how to ensure the extra constraints v 1 v 2 v k 1 w k and sgn(v i ) = sgn(w i ). First, the constraints sgn(u i ) = sgn(w i ) can be added into LP by this we mean adding constraints like u 1 0, u 2 0, etc. Next, the constraints sgn(w 1 )u 1 sgn(w 2 )u 2 sgn(w 2 )u 2 sgn(w 3 )u 3 sgn(w k 2 )u k 2 sgn(w k 1 )u k 1 can be added into LP; again, these are constraints like u i u i+1. Finally, we can add the constraint sgn(w k 1 )u k 1 w k. Of course, LP remains feasible after the addition of all of these constraints, since (u 0,..., u k 1 ) = (w 0,..., w k 1 ) is still a solution. It remains to show that there is still a solution satisfying the bounds in (5.4). But this still follows from Lemma 5.2: the added constraints only have coefficients in { 1, 0, 1}, and the added right-hand side entries are all 0, except for the last, which is w k σ k C. 6. Every threshold function is close to a threshold function for which few points have small margin. In this subsection we show how to combine Theorem 4.2 and Lemma 5.1 to establish the following: Theorem 6.1. Let f : { 1, 1} n { 1, 1} be any threshold function and let 0 < τ < 1/2. Then there is a threshold function f : { 1, 1} n { 1, 1} with dist(f, f ) ɛ satisfying Pr x [marg(f, x) ρ] O(τ), where ɛ = ɛ(τ) = 2 2O(log3 (1/τ)/τ 2 ) and ρ = ρ(τ) = 2 O(log3 (1/τ)/τ 2). Our main structural results about margins, Theorem 3.1, is simply a rephrasing of the above theorem. Hence proving Theorem 6.1 completes the proof of Theorem 1.6, the first ingredient in our solution to the Chow Parameters Problem. The plan for the proof of Theorem 6.1 follows the intuition described in the beginning of Section 4. We consider the location of the τ-critical index of f. Case 1 is that it occurs quite early. In that case, the resulting tail acts like a Gaussian (up to error τ), and hence we can get a good anticoncentration bound so long as the tail s
17 CHOW PARAMETERS 17 variance is large enough. To ensure this, we alter f at the beginning of the argument using Lemma 5.1, which yields tail weights with total variance lower bounded by a function that depends only on τ. Case 2 is that the critical index occurs late. In this case we get anticoncentration by appealing to Theorem 4.2. We again use Lemma 5.1 so that the σ k parameter is not too small. We now give the formal proof. Proof. (Theorem 6.1) We intend to apply Theorem 4.2 in Case 2 with its t parameter set to log(1/τ), so that the anticoncentration is O(τ). Thus we will need to ensure the τ-critical index parameter l is at least k def = O(1) log(1/τ) τ 2 ln ( log(1/τ) To that end, fix a weights-based representation of f, f(x) = sgn(w 0 + w 1 x w n x n ), τ ). (6.1) where we may assume that w 1 w 2 w n > 0. Write σ k = j k w2 j, and observe that σ k > 0 since each w i 0. Now apply Lemma 5.1, with its parameter ɛ set to 2 ko(k). This yields a new threshold function where each v i satisfies f (x) = sgn(v 0 + v 1 x v k 1 x k 1 + w k x k + w n x n ), (6.2) v i k O(k) σ k, (6.3) and also v 1 v 2 v k 1 w k. This f has dist(f, f ) ɛ = 2 ko(k). To analyze marg(f, x), let us normalize the weights of f by dividing each weight by v v2 k 1 + w2 k + + w2 n. We thus may write f (x) = sgn(u 0 + u 1 x u k 1 x k 1 + u k x k + u n x n ), where j 0 u2 j = 1. Equation (6.2) implies that for each of the k values i = 0,..., k 1 we have that vi 2 is at most k O(k) times as large as wk w2 n. Letting σ i denote j i u2 j and recalling that j 0 u2 j = 1, this is easily seen to imply that σ k k O(k). (6.4) Recalling that we still have u 1 u 2 u n > 0, let l be the τ-critical index for u 1,..., u n, and consider two cases: Case 1: l < k. In this case, consider any fixed choice for x 1,..., x l 1 and write h = u 0 + u 1 x u l 1 x l 1. Using the definition of τ-critical index and applying the Berry-Esseen Corollary 2.8 to u l x l + + u n x n we get Pr [ h γ u l x l + + u n x n h + γ] 2γ x l,...,x n σ l + 2τ, for any choice of γ 0. Taking γ = τσ l τσ k we conclude Pr x [marg(f, x) τσ k] 4τ.
18 18 R. O DONNELL AND R. A. SERVEDIO Case 2: l k. In this case we apply Theorem 4.2, with its parameter t set to log(1/τ), as described at the beginning of the proof. With k defined as in (6.1), we conclude Pr x [marg(f, x) log(1/τ) σ k] O(τ). Combining the results of the two cases and using σ k k O(k) from (6.4), we conclude that we always have Pr x [marg(f, x) τk O(k) ] O(τ). Now it only remains to observe that by definition (6.1) of k, Hence we have that and k O(k) = 2 O(log3 (1/τ)/τ 2). dist(f, f ) 2 ko(k) ɛ(τ) τk O(k) τ2 O(log3 (1/τ)/τ 2) ρ(τ). 7. Second ingredient: using Chow Parameters as weights for tail variables. We begin this section with some informal motivation for and description of our second ingredient. We first recall that every threshold function f is unate; this means that for every i, f is either monotone increasing or monotone decreasing as a function of its i- th coordiante. A well-known consequence of unateness is that the magnitude of the Fourier coefficient ˆf(i) is equal to the influence of the variable x i on f; i.e. Pr[f(x) f(y)] where x is drawn uniformly from { 1, 1} n and y is x with the ith bit flipped. As done in the first ingredient, it is natural to group together the high-influence variables, forming the head indices of f. We refer to the remaining indices as the tail indices. Note that an algorithm for the Chow Parameters problem can do this grouping, since it is given the ˆf(i) s. The following theorem states that any threshold function f is either already close to a junta over the head indices, or is close to a threshold function obtained by replacing the tail weights with (suitably scaled versions of) the tail Chow Parameters. (We have made no effort to optimize the precise polynomial dependence of τ(ɛ) on ɛ.) Theorem 7.1. There is a polynomial function τ(ɛ) = poly(ɛ) such that the following holds: Let f be a Boolean threshold function over head indices H and tail indices T, ( f(x) = sgn v 0 + v i x i + ) w i x i, i H i T and let 0 < ɛ < 1/2. Assume that H contains all indices i such that f(i) τ(ɛ) 2. Then one of the following holds: (i) f is O(ɛ)-close to a junta over H; or,
The Chow Parameters Problem
The Chow Parameters Problem [Etended Abstract] ABSTRACT Ryan O Donnell Carnegie Mellon University Pittsburgh, PA 15213 odonnell@cs.cmu.edu In the 2nd Annual FOCS (1961), C. K. Chow proved that every Boolean
More informationarxiv: v1 [cs.cc] 29 Feb 2012
On the Distribution of the Fourier Spectrum of Halfspaces Ilias Diakonikolas 1, Ragesh Jaiswal 2, Rocco A. Servedio 3, Li-Yang Tan 3, and Andrew Wan 4 arxiv:1202.6680v1 [cs.cc] 29 Feb 2012 1 University
More informationImproved Approximation of Linear Threshold Functions
Improved Approximation of Linear Threshold Functions Ilias Diakonikolas Computer Science Division UC Berkeley Berkeley, CA ilias@cs.berkeley.edu Rocco A. Servedio Department of Computer Science Columbia
More informationA ROBUST KHINTCHINE INEQUALITY, AND ALGORITHMS FOR COMPUTING OPTIMAL CONSTANTS IN FOURIER ANALYSIS AND HIGH-DIMENSIONAL GEOMETRY
A ROBUST KHINTCHINE INEQUALITY, AND ALGORITHMS FOR COMPUTING OPTIMAL CONSTANTS IN FOURIER ANALYSIS AND HIGH-DIMENSIONAL GEOMETRY ANINDYA DE, ILIAS DIAKONIKOLAS, AND ROCCO A. SERVEDIO Abstract. This paper
More informationAgnostic Learning of Disjunctions on Symmetric Distributions
Agnostic Learning of Disjunctions on Symmetric Distributions Vitaly Feldman vitaly@post.harvard.edu Pravesh Kothari kothari@cs.utexas.edu May 26, 2014 Abstract We consider the problem of approximating
More informationTesting Halfspaces. Ryan O Donnell Carnegie Mellon University. Rocco A. Servedio Columbia University.
Kevin Matulef MIT matulef@mit.edu Testing Halfspaces Ryan O Donnell Carnegie Mellon University odonnell@cs.cmu.edu Rocco A. Servedio Columbia University rocco@cs.columbia.edu November 9, 2007 Ronitt Rubinfeld
More informationLecture 9: March 26, 2014
COMS 6998-3: Sub-Linear Algorithms in Learning and Testing Lecturer: Rocco Servedio Lecture 9: March 26, 204 Spring 204 Scriber: Keith Nichols Overview. Last Time Finished analysis of O ( n ɛ ) -query
More informationNearly optimal solutions for the Chow Parameters Problem and low-weight approximation of halfspaces
Nearly optimal solutions for the Chow Parameters Problem and low-weight approximation of halfspaces Anindya De University of California, Berkeley Vitaly Feldman IBM Almaden Research Center Ilias Diakonikolas
More information3 Finish learning monotone Boolean functions
COMS 6998-3: Sub-Linear Algorithms in Learning and Testing Lecturer: Rocco Servedio Lecture 5: 02/19/2014 Spring 2014 Scribes: Dimitris Paidarakis 1 Last time Finished KM algorithm; Applications of KM
More informationLecture 29: Computational Learning Theory
CS 710: Complexity Theory 5/4/2010 Lecture 29: Computational Learning Theory Instructor: Dieter van Melkebeek Scribe: Dmitri Svetlov and Jake Rosin Today we will provide a brief introduction to computational
More informationDiscriminative Learning can Succeed where Generative Learning Fails
Discriminative Learning can Succeed where Generative Learning Fails Philip M. Long, a Rocco A. Servedio, b,,1 Hans Ulrich Simon c a Google, Mountain View, CA, USA b Columbia University, New York, New York,
More informationLecture 5: February 16, 2012
COMS 6253: Advanced Computational Learning Theory Lecturer: Rocco Servedio Lecture 5: February 16, 2012 Spring 2012 Scribe: Igor Carboni Oliveira 1 Last time and today Previously: Finished first unit on
More informationCSE 291: Fourier analysis Chapter 2: Social choice theory
CSE 91: Fourier analysis Chapter : Social choice theory 1 Basic definitions We can view a boolean function f : { 1, 1} n { 1, 1} as a means to aggregate votes in a -outcome election. Common examples are:
More informationLecture 10: Learning DNF, AC 0, Juntas. 1 Learning DNF in Almost Polynomial Time
Analysis of Boolean Functions (CMU 8-859S, Spring 2007) Lecture 0: Learning DNF, AC 0, Juntas Feb 5, 2007 Lecturer: Ryan O Donnell Scribe: Elaine Shi Learning DNF in Almost Polynomial Time From previous
More informationAPPROXIMATION RESISTANCE AND LINEAR THRESHOLD FUNCTIONS
APPROXIMATION RESISTANCE AND LINEAR THRESHOLD FUNCTIONS RIDWAN SYED Abstract. In the boolean Max k CSP (f) problem we are given a predicate f : { 1, 1} k {0, 1}, a set of variables, and local constraints
More informationLecture 7: Passive Learning
CS 880: Advanced Complexity Theory 2/8/2008 Lecture 7: Passive Learning Instructor: Dieter van Melkebeek Scribe: Tom Watson In the previous lectures, we studied harmonic analysis as a tool for analyzing
More information6.842 Randomness and Computation April 2, Lecture 14
6.84 Randomness and Computation April, 0 Lecture 4 Lecturer: Ronitt Rubinfeld Scribe: Aaron Sidford Review In the last class we saw an algorithm to learn a function where very little of the Fourier coeffecient
More informationQuantum boolean functions
Quantum boolean functions Ashley Montanaro 1 and Tobias Osborne 2 1 Department of Computer Science 2 Department of Mathematics University of Bristol Royal Holloway, University of London Bristol, UK London,
More informationTTIC An Introduction to the Theory of Machine Learning. Learning from noisy data, intro to SQ model
TTIC 325 An Introduction to the Theory of Machine Learning Learning from noisy data, intro to SQ model Avrim Blum 4/25/8 Learning when there is no perfect predictor Hoeffding/Chernoff bounds: minimizing
More informationBounded Independence Fools Halfspaces
Bounded Independence Fools Halfspaces Ilias Diakonikolas Columbia University Rocco A. Servedio Columbia University Parikshit Gopalan MSR-Silicon Valley May 27, 2010 Ragesh Jaiswal Columbia University Emanuele
More informationJunta Approximations for Submodular, XOS and Self-Bounding Functions
Junta Approximations for Submodular, XOS and Self-Bounding Functions Vitaly Feldman Jan Vondrák IBM Almaden Research Center Simons Institute, Berkeley, October 2013 Feldman-Vondrák Approximations by Juntas
More informationTesting Monotone High-Dimensional Distributions
Testing Monotone High-Dimensional Distributions Ronitt Rubinfeld Computer Science & Artificial Intelligence Lab. MIT Cambridge, MA 02139 ronitt@theory.lcs.mit.edu Rocco A. Servedio Department of Computer
More information10.1 The Formal Model
67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 10: The Formal (PAC) Learning Model Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 We have see so far algorithms that explicitly estimate
More informationIntroduction to Machine Learning (67577) Lecture 3
Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz
More information20.1 2SAT. CS125 Lecture 20 Fall 2016
CS125 Lecture 20 Fall 2016 20.1 2SAT We show yet another possible way to solve the 2SAT problem. Recall that the input to 2SAT is a logical expression that is the conunction (AND) of a set of clauses,
More informationTesting by Implicit Learning
Testing by Implicit Learning Rocco Servedio Columbia University ITCS Property Testing Workshop Beijing January 2010 What this talk is about 1. Testing by Implicit Learning: method for testing classes of
More informationNearly optimal solutions for the Chow Parameters Problem and low-weight approximation of halfspaces
Nearly optimal solutions for the Chow Parameters Problem and low-weight approximation of halfspaces [Extended Abstract] Anindya De Computer Science Division University of California Berkeley, CA 94720
More informationA Nearly Optimal and Agnostic Algorithm for Properly Learning a Mixture of k Gaussians, for any Constant k
A Nearly Optimal and Agnostic Algorithm for Properly Learning a Mixture of k Gaussians, for any Constant k Jerry Li MIT jerryzli@mit.edu Ludwig Schmidt MIT ludwigs@mit.edu June 27, 205 Abstract Learning
More informationPolynomial regression under arbitrary product distributions
Polynomial regression under arbitrary product distributions Eric Blais and Ryan O Donnell and Karl Wimmer Carnegie Mellon University Abstract In recent work, Kalai, Klivans, Mansour, and Servedio [KKMS05]
More informationFrom Batch to Transductive Online Learning
From Batch to Transductive Online Learning Sham Kakade Toyota Technological Institute Chicago, IL 60637 sham@tti-c.org Adam Tauman Kalai Toyota Technological Institute Chicago, IL 60637 kalai@tti-c.org
More informationCSC 5170: Theory of Computational Complexity Lecture 9 The Chinese University of Hong Kong 15 March 2010
CSC 5170: Theory of Computational Complexity Lecture 9 The Chinese University of Hong Kong 15 March 2010 We now embark on a study of computational classes that are more general than NP. As these classes
More informationNew Results for Random Walk Learning
Journal of Machine Learning Research 15 (2014) 3815-3846 Submitted 1/13; Revised 5/14; Published 11/14 New Results for Random Walk Learning Jeffrey C. Jackson Karl Wimmer Duquesne University 600 Forbes
More informationarxiv: v1 [cs.ds] 3 Feb 2018
A Model for Learned Bloom Filters and Related Structures Michael Mitzenmacher 1 arxiv:1802.00884v1 [cs.ds] 3 Feb 2018 Abstract Recent work has suggested enhancing Bloom filters by using a pre-filter, based
More informationLearning and Fourier Analysis
Learning and Fourier Analysis Grigory Yaroslavtsev http://grigory.us CIS 625: Computational Learning Theory Fourier Analysis and Learning Powerful tool for PAC-style learning under uniform distribution
More informationOn the Sample Complexity of Noise-Tolerant Learning
On the Sample Complexity of Noise-Tolerant Learning Javed A. Aslam Department of Computer Science Dartmouth College Hanover, NH 03755 Scott E. Decatur Laboratory for Computer Science Massachusetts Institute
More information1 Last time and today
COMS 6253: Advanced Computational Learning Spring 2012 Theory Lecture 12: April 12, 2012 Lecturer: Rocco Servedio 1 Last time and today Scribe: Dean Alderucci Previously: Started the BKW algorithm for
More informationFORMULATION OF THE LEARNING PROBLEM
FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we
More informationLearning symmetric non-monotone submodular functions
Learning symmetric non-monotone submodular functions Maria-Florina Balcan Georgia Institute of Technology ninamf@cc.gatech.edu Nicholas J. A. Harvey University of British Columbia nickhar@cs.ubc.ca Satoru
More informationNotes 6 : First and second moment methods
Notes 6 : First and second moment methods Math 733-734: Theory of Probability Lecturer: Sebastien Roch References: [Roc, Sections 2.1-2.3]. Recall: THM 6.1 (Markov s inequality) Let X be a non-negative
More informationIntroduction to Computational Learning Theory
Introduction to Computational Learning Theory The classification problem Consistent Hypothesis Model Probably Approximately Correct (PAC) Learning c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 1
More informationProclaiming Dictators and Juntas or Testing Boolean Formulae
Proclaiming Dictators and Juntas or Testing Boolean Formulae Michal Parnas The Academic College of Tel-Aviv-Yaffo Tel-Aviv, ISRAEL michalp@mta.ac.il Dana Ron Department of EE Systems Tel-Aviv University
More informationTwo Comments on Targeted Canonical Derandomizers
Two Comments on Targeted Canonical Derandomizers Oded Goldreich Department of Computer Science Weizmann Institute of Science Rehovot, Israel. oded.goldreich@weizmann.ac.il April 8, 2011 Abstract We revisit
More informationLecture 6,7 (Sept 27 and 29, 2011 ): Bin Packing, MAX-SAT
,7 CMPUT 675: Approximation Algorithms Fall 2011 Lecture 6,7 (Sept 27 and 29, 2011 ): Bin Pacing, MAX-SAT Lecturer: Mohammad R. Salavatipour Scribe: Weitian Tong 6.1 Bin Pacing Problem Recall the bin pacing
More informationContinuity. Chapter 4
Chapter 4 Continuity Throughout this chapter D is a nonempty subset of the real numbers. We recall the definition of a function. Definition 4.1. A function from D into R, denoted f : D R, is a subset of
More informationLecture 23: Alternation vs. Counting
CS 710: Complexity Theory 4/13/010 Lecture 3: Alternation vs. Counting Instructor: Dieter van Melkebeek Scribe: Jeff Kinne & Mushfeq Khan We introduced counting complexity classes in the previous lecture
More informationLearning convex bodies is hard
Learning convex bodies is hard Navin Goyal Microsoft Research India navingo@microsoft.com Luis Rademacher Georgia Tech lrademac@cc.gatech.edu Abstract We show that learning a convex body in R d, given
More informationA Noisy-Influence Regularity Lemma for Boolean Functions Chris Jones
A Noisy-Influence Regularity Lemma for Boolean Functions Chris Jones Abstract We present a regularity lemma for Boolean functions f : {, } n {, } based on noisy influence, a measure of how locally correlated
More informationFoundations of Machine Learning and Data Science. Lecturer: Avrim Blum Lecture 9: October 7, 2015
10-806 Foundations of Machine Learning and Data Science Lecturer: Avrim Blum Lecture 9: October 7, 2015 1 Computational Hardness of Learning Today we will talk about some computational hardness results
More informationCONSTRUCTION OF THE REAL NUMBERS.
CONSTRUCTION OF THE REAL NUMBERS. IAN KIMING 1. Motivation. It will not come as a big surprise to anyone when I say that we need the real numbers in mathematics. More to the point, we need to be able to
More informationLecture 4: LMN Learning (Part 2)
CS 294-114 Fine-Grained Compleity and Algorithms Sept 8, 2015 Lecture 4: LMN Learning (Part 2) Instructor: Russell Impagliazzo Scribe: Preetum Nakkiran 1 Overview Continuing from last lecture, we will
More informationAnalysis of Boolean Functions
Analysis of Boolean Functions Kavish Gandhi and Noah Golowich Mentor: Yufei Zhao 5th Annual MIT-PRIMES Conference Analysis of Boolean Functions, Ryan O Donnell May 16, 2015 1 Kavish Gandhi and Noah Golowich
More informationLearning DNF Expressions from Fourier Spectrum
Learning DNF Expressions from Fourier Spectrum Vitaly Feldman IBM Almaden Research Center vitaly@post.harvard.edu May 3, 2012 Abstract Since its introduction by Valiant in 1984, PAC learning of DNF expressions
More informationNotes for Lecture 15
U.C. Berkeley CS278: Computational Complexity Handout N15 Professor Luca Trevisan 10/27/2004 Notes for Lecture 15 Notes written 12/07/04 Learning Decision Trees In these notes it will be convenient to
More informationIntroduction and Preliminaries
Chapter 1 Introduction and Preliminaries This chapter serves two purposes. The first purpose is to prepare the readers for the more systematic development in later chapters of methods of real analysis
More informationQuantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 1: Quantum circuits and the abelian QFT
Quantum algorithms (CO 78, Winter 008) Prof. Andrew Childs, University of Waterloo LECTURE : Quantum circuits and the abelian QFT This is a course on quantum algorithms. It is intended for graduate students
More informationTesting Lipschitz Functions on Hypergrid Domains
Testing Lipschitz Functions on Hypergrid Domains Pranjal Awasthi 1, Madhav Jha 2, Marco Molinaro 1, and Sofya Raskhodnikova 2 1 Carnegie Mellon University, USA, {pawasthi,molinaro}@cmu.edu. 2 Pennsylvania
More informationContinuity. Chapter 4
Chapter 4 Continuity Throughout this chapter D is a nonempty subset of the real numbers. We recall the definition of a function. Definition 4.1. A function from D into R, denoted f : D R, is a subset of
More informationLearning and Fourier Analysis
Learning and Fourier Analysis Grigory Yaroslavtsev http://grigory.us Slides at http://grigory.us/cis625/lecture2.pdf CIS 625: Computational Learning Theory Fourier Analysis and Learning Powerful tool for
More informationPolynomial regression under arbitrary product distributions
Mach Learn (2010) 80: 273 294 DOI 10.1007/s10994-010-5179-6 Polynomial regression under arbitrary product distributions Eric Blais Ryan O Donnell Karl Wimmer Received: 15 March 2009 / Accepted: 1 November
More information1 Review of The Learning Setting
COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #8 Scribe: Changyan Wang February 28, 208 Review of The Learning Setting Last class, we moved beyond the PAC model: in the PAC model we
More informationStochastic Submodular Cover with Limited Adaptivity
Stochastic Submodular Cover with Limited Adaptivity Arpit Agarwal Sepehr Assadi Sanjeev Khanna Abstract In the submodular cover problem, we are given a non-negative monotone submodular function f over
More informationPAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht
PAC Learning prof. dr Arno Siebes Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht Recall: PAC Learning (Version 1) A hypothesis class H is PAC learnable
More informationThe sum of d small-bias generators fools polynomials of degree d
The sum of d small-bias generators fools polynomials of degree d Emanuele Viola April 9, 2008 Abstract We prove that the sum of d small-bias generators L : F s F n fools degree-d polynomials in n variables
More informationComputational Learning Theory
CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful
More informationIntroductory Analysis I Fall 2014 Homework #9 Due: Wednesday, November 19
Introductory Analysis I Fall 204 Homework #9 Due: Wednesday, November 9 Here is an easy one, to serve as warmup Assume M is a compact metric space and N is a metric space Assume that f n : M N for each
More informationPROBABILISTIC ANALYSIS OF THE GENERALISED ASSIGNMENT PROBLEM
PROBABILISTIC ANALYSIS OF THE GENERALISED ASSIGNMENT PROBLEM Martin Dyer School of Computer Studies, University of Leeds, Leeds, U.K. and Alan Frieze Department of Mathematics, Carnegie-Mellon University,
More informationA Regularity Lemma, and Low-weight Approximators, for Low-degree Polynomial Threshold Functions
THEORY OF COMPUTING www.theoryofcomputing.org A Regularity Lemma, and Low-weight Approximators, for Low-degree Polynomial Threshold Functions Ilias Diakonikolas Rocco A. Servedio Li-Yang Tan Andrew Wan
More informationOnline Learning, Mistake Bounds, Perceptron Algorithm
Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which
More informationPart V. 17 Introduction: What are measures and why measurable sets. Lebesgue Integration Theory
Part V 7 Introduction: What are measures and why measurable sets Lebesgue Integration Theory Definition 7. (Preliminary). A measure on a set is a function :2 [ ] such that. () = 2. If { } = is a finite
More informationMath 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.
Math 350 Fall 2011 Notes about inner product spaces In this notes we state and prove some important properties of inner product spaces. First, recall the dot product on R n : if x, y R n, say x = (x 1,...,
More informationReview of Probability Theory
Review of Probability Theory Arian Maleki and Tom Do Stanford University Probability theory is the study of uncertainty Through this class, we will be relying on concepts from probability theory for deriving
More informationTuring Machines, diagonalization, the halting problem, reducibility
Notes on Computer Theory Last updated: September, 015 Turing Machines, diagonalization, the halting problem, reducibility 1 Turing Machines A Turing machine is a state machine, similar to the ones we have
More informationPolynomial time Prediction Strategy with almost Optimal Mistake Probability
Polynomial time Prediction Strategy with almost Optimal Mistake Probability Nader H. Bshouty Department of Computer Science Technion, 32000 Haifa, Israel bshouty@cs.technion.ac.il Abstract We give the
More informationDecoupling course outline Decoupling theory is a recent development in Fourier analysis with applications in partial differential equations and
Decoupling course outline Decoupling theory is a recent development in Fourier analysis with applications in partial differential equations and analytic number theory. It studies the interference patterns
More information12.1 A Polynomial Bound on the Sample Size m for PAC Learning
67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 12: PAC III Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 In this lecture will use the measure of VC dimension, which is a combinatorial
More informationNP Completeness and Approximation Algorithms
Chapter 10 NP Completeness and Approximation Algorithms Let C() be a class of problems defined by some property. We are interested in characterizing the hardest problems in the class, so that if we can
More informationLecture 3 Small bias with respect to linear tests
03683170: Expanders, Pseudorandomness and Derandomization 3/04/16 Lecture 3 Small bias with respect to linear tests Amnon Ta-Shma and Dean Doron 1 The Fourier expansion 1.1 Over general domains Let G be
More informationCS446: Machine Learning Spring Problem Set 4
CS446: Machine Learning Spring 2017 Problem Set 4 Handed Out: February 27 th, 2017 Due: March 11 th, 2017 Feel free to talk to other members of the class in doing the homework. I am more concerned that
More informationLecture 3: Randomness in Computation
Great Ideas in Theoretical Computer Science Summer 2013 Lecture 3: Randomness in Computation Lecturer: Kurt Mehlhorn & He Sun Randomness is one of basic resources and appears everywhere. In computer science,
More informationCOS598D Lecture 3 Pseudorandom generators from one-way functions
COS598D Lecture 3 Pseudorandom generators from one-way functions Scribe: Moritz Hardt, Srdjan Krstic February 22, 2008 In this lecture we prove the existence of pseudorandom-generators assuming that oneway
More informationP, NP, NP-Complete, and NPhard
P, NP, NP-Complete, and NPhard Problems Zhenjiang Li 21/09/2011 Outline Algorithm time complicity P and NP problems NP-Complete and NP-Hard problems Algorithm time complicity Outline What is this course
More informationSCALE INVARIANT FOURIER RESTRICTION TO A HYPERBOLIC SURFACE
SCALE INVARIANT FOURIER RESTRICTION TO A HYPERBOLIC SURFACE BETSY STOVALL Abstract. This result sharpens the bilinear to linear deduction of Lee and Vargas for extension estimates on the hyperbolic paraboloid
More informationMath 328 Course Notes
Math 328 Course Notes Ian Robertson March 3, 2006 3 Properties of C[0, 1]: Sup-norm and Completeness In this chapter we are going to examine the vector space of all continuous functions defined on the
More informationCS 151 Complexity Theory Spring Solution Set 5
CS 151 Complexity Theory Spring 2017 Solution Set 5 Posted: May 17 Chris Umans 1. We are given a Boolean circuit C on n variables x 1, x 2,..., x n with m, and gates. Our 3-CNF formula will have m auxiliary
More informationCS264: Beyond Worst-Case Analysis Lecture #11: LP Decoding
CS264: Beyond Worst-Case Analysis Lecture #11: LP Decoding Tim Roughgarden October 29, 2014 1 Preamble This lecture covers our final subtopic within the exact and approximate recovery part of the course.
More informationAnswering Many Queries with Differential Privacy
6.889 New Developments in Cryptography May 6, 2011 Answering Many Queries with Differential Privacy Instructors: Shafi Goldwasser, Yael Kalai, Leo Reyzin, Boaz Barak, and Salil Vadhan Lecturer: Jonathan
More informationHardness Results for Agnostically Learning Low-Degree Polynomial Threshold Functions
Hardness Results for Agnostically Learning Low-Degree Polynomial Threshold Functions Ilias Diakonikolas Columbia University ilias@cs.columbia.edu Rocco A. Servedio Columbia University rocco@cs.columbia.edu
More informationComputational Learning Theory
1 Computational Learning Theory 2 Computational learning theory Introduction Is it possible to identify classes of learning problems that are inherently easy or difficult? Can we characterize the number
More informationLEBESGUE INTEGRATION. Introduction
LEBESGUE INTEGATION EYE SJAMAA Supplementary notes Math 414, Spring 25 Introduction The following heuristic argument is at the basis of the denition of the Lebesgue integral. This argument will be imprecise,
More informationTesting Problems with Sub-Learning Sample Complexity
Testing Problems with Sub-Learning Sample Complexity Michael Kearns AT&T Labs Research 180 Park Avenue Florham Park, NJ, 07932 mkearns@researchattcom Dana Ron Laboratory for Computer Science, MIT 545 Technology
More informationmeans is a subset of. So we say A B for sets A and B if x A we have x B holds. BY CONTRAST, a S means that a is a member of S.
1 Notation For those unfamiliar, we have := means equal by definition, N := {0, 1,... } or {1, 2,... } depending on context. (i.e. N is the set or collection of counting numbers.) In addition, means for
More informationRandomized Complexity Classes; RP
Randomized Complexity Classes; RP Let N be a polynomial-time precise NTM that runs in time p(n) and has 2 nondeterministic choices at each step. N is a polynomial Monte Carlo Turing machine for a language
More informationThe sample complexity of agnostic learning with deterministic labels
The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College
More informationTesting for Concise Representations
Testing for Concise Representations Ilias Diakonikolas Columbia University ilias@cs.columbia.edu Ronitt Rubinfeld MIT ronitt@theory.csail.mit.edu Homin K. Lee Columbia University homin@cs.columbia.edu
More informationTopics in Theoretical Computer Science: An Algorithmist's Toolkit Fall 2007
MIT OpenCourseWare http://ocw.mit.edu 18.409 Topics in Theoretical Computer Science: An Algorithmist's Toolkit Fall 2007 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
More informationFourier analysis of boolean functions in quantum computation
Fourier analysis of boolean functions in quantum computation Ashley Montanaro Centre for Quantum Information and Foundations, Department of Applied Mathematics and Theoretical Physics, University of Cambridge
More informationLearning and 1-bit Compressed Sensing under Asymmetric Noise
JMLR: Workshop and Conference Proceedings vol 49:1 39, 2016 Learning and 1-bit Compressed Sensing under Asymmetric Noise Pranjal Awasthi Rutgers University Maria-Florina Balcan Nika Haghtalab Hongyang
More informationCS 446: Machine Learning Lecture 4, Part 2: On-Line Learning
CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning 0.1 Linear Functions So far, we have been looking at Linear Functions { as a class of functions which can 1 if W1 X separate some data and not
More informationSpanning and Independence Properties of Finite Frames
Chapter 1 Spanning and Independence Properties of Finite Frames Peter G. Casazza and Darrin Speegle Abstract The fundamental notion of frame theory is redundancy. It is this property which makes frames
More informationLecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;
CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and
More information