Course Notes. Part II. Probabilistic Combinatorics. Algorithms

Size: px

Start display at page:

Download "Course Notes. Part II. Probabilistic Combinatorics. Algorithms"

Howard Norman
6 years ago
Views:

1 Course Notes Part II Probabilistic Combinatorics and Algorithms J. A. Verstraete Department of Mathematics University of California San Diego 9500 Gilman Drive La Jolla California

2 2 Basic probabilistic inequalities The probability spaces we deal with will generally be discrete. The reader is referred to Feller for a complete and formal background of probability theory, and to Williams for a shorter but still complete text. 2.1 Probability and Expectation We will recall that a probability space is a triple (Ω, F, P ) where Ω is a set, F is a family of subsets of Ω containing, closed under complementation, and closed under countable unions, and P : F [0, 1] is a countably additive function on F with P ( ) = 0. The elements of Ω are called sample points and the sets in F are called events. If Ω is finite, then P is determined completely by its value on each ω Ω. A random variable is a real-valued function X : F R such that the inverse image X 1 of X maps Borel subsets of R that is sets which consist of unions and intersubsections of countably many half closed intervals (a, b] to events in F. If (Ω, F, P ) is a probability space and A F is any event, then we write A a.s. (almost surely) instead of P (A) = 1. For a sequence of events A 1, A 2, F, we write A n a.a.s. to mean lim P (A n) = 1. n This says A n occurs asymptotically almost surely Expectation In the following definitions, integrals are taken to denote Lebesgue integrals. For the purposes we have in mind, most of the integrals will be Riemann integrals, or even finite sums. Let (Ω, F, P ) be a probability space. The expectation (or first moment) of a random variable X, when it exists, is defined by E(X) = X(ω) dp (ω). Ω Let F (x) denote P (X x) for each x R. This is the cumulative distribution function of X, and its derivative is called the probability distribution function of X. In practice, if f is the distribution function of X and X has range R R, then E(X) = xf(x) dx. R We will assume henceforth that when we write an expression involving E(X), then the mean of X exists. In the last subsection, we used the fact that the expectation is a linear operator, together with the fact that for any random variable X, there is a point ω Ω such that X(ω) E(X) and a point ω Ω such that X(ω ) E(X). This simple notion is very useful in general when applying the probabilistic method. In general, the rth moment of X is the expectation of X r, and the variance of X is var(x) = E(X 2 ) E(X) 2. The standard deviation of X is var(x), and we often write var(x) = σ 2 and the standard deviation as σ. Amongst other things, we will use these as parameters to measure the concentration of a random variable X. One of the crucial properties of expectation we shall use is that it is a linear operator: if X 1, X 2,..., X n are random variables, then E(X 1 + X X n ) = E(X 1 ) + E(X 2 ) + + E(X n ). It is not true, however, that E(XY ) = E(X)E(Y ). 1

3 2.1.2 Independence Events E 1, E 2,..., E n F are said to be independent if P (E 1 E 2 E n ) = P (E i ). Random variables X 1, X 2,..., X n are independent if the events E i = {X i x i } are independent for all choices of x i. There are many important theorems regarding independent events and random variables. For example if X 1, X 2,..., X n are independent random variables, then there are many things one can say about their sum X n = n X i, especially as n tends to infinity. A special case of the central limit theorem is that if P (X i = 1) = P (X i = 1) = 1 2, then n 1/2 X n tends in distribution to a standard Gaussian. One of the basic facts concerning independent random variables X 1, X 2,..., X n is that E(X 1 X 2... X n ) = n E(X i ). The reader should consult Williams for a succinct exposition of the notions and theorems concerning independence Conditional Expectation For the probability space (Ω, F, P ) and a set B F of non-zero measure, the probability of A F given B is P (A B)/P (B), which is written P (A B). We read this as the conditional probability of A given B or, simply, the probability of A given B. This defines a probability measure P B on (Ω, F) which allows us to define a random variable from a given random variable X by considering X as a function on (Ω, F, P B ). This random variable is denoted X B. While it is true that for disjoint events A and B, P (A B) = P (A) + P (B), the reader should easily come up with examples where P (C A B) P (C A) + P (C B). The expectation of this random variable is called the conditional expectation of X given B, written E(X B). Furthermore, if Y is a random variable, then E(X Y = y) is just x R xp (X = x Y = y). If we do not specify the value of Y, then one obtains a random variable called the conditional expectation of X given Y, which is denoted E(X Y ). One of the main properties of this random variable is that its expectation can be computed from the formula E(X) = E(E(X Y )). This is sometimes called the tower property of conditional expectation, and is fundamental to the definition of martingales. We will return to this important notion in greater depth at a later stage Basic Inequalities We have already discussed some basic combinatorial inequalities involving binomial coefficients in Part I. There are a number of useful inequalities concerning probability and expectation of random variables. Perhaps the most commonly useful inequality is the Cauchy-Schwartz Inequality, which, in its simplest form, states that E(XY ) E(X)E(Y ) for any random variable X. This is a special case of a much more general inequality known as Jensen s inequality, namely Let f be a convex function on an open interval I in R and let X be a random variable such that P (X I) = 1, and E(f(X)) and E( X ) are both finite. Then E(f(X)) f(e(x)). A second important inequality is Hölder s inequality, which can be deduced from Jensen s Inequality, namely 2

4 Let p, q 1 be real numbers with 1/p + 1/q = 1. Then E(XY ) E(X p ) 1/p E(Y q ) 1/q when X and Y are non-negative random variables such that E(X), E(Y ) and E(XY ) are all finite. This inequality is fundamental not only in probability theory, but also in functional analysis (e.g. duality of normed spaces). We will also require some inequalities for real numbers. The Taylor series of a single-variable function f about zero, when it exists, is defined by k=0 f (k) (0) xk k!. One can truncate the Taylor series at some term to obtain an approximation to the function f. This idea allows us to obtain several useful inequalities concerning real numbers. Two very familar Taylor series include e x = k=0 x k k! ( 1) k x k+1 and ln(1 + x) = k + 1 k=0 where the second is valid only for 1 < x 1. Using these series, it is fairly straightforward to prove some of the following inequalities for positive real numbers x i (the proofs are left as exercises): 1. n ( n ) (1 x i ) exp x i 2. ( n x i ) 1/n 1 n n x i It is always useful to remember (1 1 n )n < 1 e < (1 1 n )n 1. More inequalities may be deduced by converting sums to integrals. We know that a convergent Riemann integral is the limiting value of a Riemann sum. Thus, for example, f(t) dt = lim n n n f(k/n) for continuous bounded functions f on [0, 1]. Another trick is to convert the sum directly to an integral. For instance, if f + k denotes the maximum value of f on the interval [k, k + 1] and f k the minimum value, then clearly k=1 n 1 f k k=1 n 1 f(t) dt n f + k. k=1 Another thing to note is that sums over subsets frequently arise: for example one might recognise the identity n (1 + x i ) 1 = S [n] i S where the sum is over all non-empty subsets S of [n]. The expression on the left is much more manageable to estimate using preceding inequalities. But suppose the sum on the right is only x i 3

5 over subsets of [n] of size k. Then the product on the left is still an upper bound for that sum, but often not a good one. To fix this we introduce a weight α > 0 as follows: x i = α k (αx i ) α k S =k i S S =k i S n (1 + αx i ). Having an estimate for the product then allows us to minimize the result over α. With a bit of luck, an appropriate choice of α will give a good estimate. There are many more analytic techniques for evaluating sums, but we will not give them here. The reader may wish to consult some texts on generating functions Classical Distributions In most of our work, the same probability distributions will recur. These are, principally, the binomial distribution, the Poisson distribution, and the normal distribution (or Gaussian distribution. These are respectively defined by x ( ) n P (X x) = p t (1 p) n t t P (X x) = P (X x) = t=0 x t=0 x e λ λ t t! 1 2πσ e 1 2σ 2 (t µ)2 One should recall the meaning of the parameters in each of these distributions, for example the rate of the Poisson process is the parameter λ featuring in the second distribution function above, and the mean and variance of a Gaussian random variable with the distribution in the third line are µ and σ respectively. Other useful distributions in our work are the geometric distribution, the negative binomial distribution, the exponential distribution and the hypergeometric distribution. A short account of the basic facts concerning these distributions, presented in a way that is tailored for the material to follow, may be found in Bollobás book on random graphs. The moment generating function of a random variable X is denoted by M X (t) and defined by M X (t) = E(e tx ). The probability generating function of a random variable X is denoted by G X (t) and defined by G X (t) = E(t X ). The names assigned to these functions are natural, in the sense, for example, that the rth derivative of M X (t) evaluated at zero is precisely the rth moment of X. Finally, the characteristic function ϕ X of X is the complex-valued function defined by ϕ X (t) = E(e itx ) = e itx df X (x) where F X is the cumulative distribution function of X. A fundamental theorem here is Lévy s convergence theorem: Theorem 1 Let ϕ n be the characteristic function of a distribution function F n, for n N, and suppose that ϕ(t) = lim n ϕ n (t) exists and is continuous for any real number t. Then there is a distribution function F of which ϕ is the characteristic function. One of the consequences of this theorem is the famous central limit theorem, which we state in the next chapter. 4 R dt.

6 2.2 Markov and Chebyshev s Inequalities In the present chapter, we start to develop some tools from probability theory, which, although remaining simple, allow us to increase the breadth of applicability of the probabilistic method. The main theme is that of concentration: in many situations, one is required to know not only the expectation of a random variable, but also how far the random variable deviates from its expectation. Therefore most of the inequalities we develop will be collected under the title concentration inequalities. Two inequalities which are applicable regardless of the distribution of our random variable are Markov s and Chebyshev s inequalities. Based only on the variance and the mean of a random variable X, Markov s and Chebyshev s inequalities tell us something about the concentration of X around its mean. Throughout what follows, σ 2 denotes the variance of a random variable X and σ denotes the standard deviation of X. Markov s inequality follows from the simple fact that if X is a non-negative random variable and λ > 0, then X λi X λ for any real λ: λp (X λ) = E(λI X λ ) E(X). Here I X λ denotes the indicator function of the event X λ: I X λ (ω) = 0 if X(ω) < λ and I X λ (ω) = 1 if X(ω) λ. Therefore we obtain Markov s inequality: P (X λe(x)) 1 λ. To obtain Chebyshev s inequality, replace X with the non-negative random variable (X E(X)) 2. Then we obtain If E(X) = µ 0, then this reduces to P ( X E(X) λσ) 1 λ 2. P (X = 0) σ2 µ 2. Markov s and Chebyshev s Inequalities have many applications in combinatorics and elsewhere, due to their generality. Later we will see that for the distributions we are interested in, much stronger concentration inequalities can be found Subset sums We give a simple application of Chebyshev s inequality in combinatorial number theory. Although it does not give much better than a straightforward counting argument, it is how the inequality is used that should be retained: at first impression there is no probability in sight. The basic problem is this: what is the largest size of a subset A of [n] such that no two non-empty subsets of A have the same sum? Since there are 2 k subsets sums in a set of size k, we must have 2 k < nk giving upper bound log 2 n + log 2 log 2 n + 1. A modest improvement of the second order term is given by Chebyshev s Inequality: Theorem 2 The maximum number of elements of [n] that can be chosen so that all subsets have distinct sums is between log 2 n and log 2 n log 2 log 2 n

7 Proof Since the sequence of powers of two has distinct subset sums, the lower bound is proved. We can do better using second moments. Let A = {x 1, x 2,..., x k } be a subset of [n] in which all subset sums are distinct, and assign a random weight ε i {0, 1} to x i, for all i [k]. If X = k ε i x i then E(X) = 1 2 k x i and var(x) = 1 4 k x2 i. This last sum is clearly at most kn2 /4, since each x i is an element of [n]. Take a real number λ > 1. By Chebyshev s inequality, P ( X E(X) < λn k/2) 1 1 λ 2. Now this is a key point in the proof the probability P (X = x) is either zero or 2 k, since no pair of distinct subsets of A have the same sum, by assumption. Therefore P ( X E(X) < λn k/2) 2 k λn k. So if λ = 2, then 2 k 1 n 2k, which gives the required bound on k. The problem of determining whether there is a constant c so that any subset of [n] whose subset sums are distinct has size at most log 2 n + c is one of Erdős oldest problems, dating back to the 1930s Large Chromatic Number and Girth The chromatic number of a graph G is the minimum number of colours which can be assigned to the vertices of G so that no two adjacent vertices are assigned the same colour. For example, the complete graph on n vertices has chromatic number n, and bipartite graphs have chromatic number two. The famous Four Colour Theorem asserts that planar graphs have chromatic number at most four. The chromatic number of a graph is a notoriously hard parameter to deal with. It seems to depend globally on the structure of the graph, and the following result of Erdős shows that even if a graph is locally very sparse, it could still have high chromatic number. The proof of this result is a slightly more subtle alteration than in the construction of dense graphs of large girth in Part I. Theorem 3 For every pair of numbers g, k, there is a graph of chromatic number at least k in which every cycle has length greater than g. Proof We select each edge of the complete graph on n vertices with probability p = n γ 1 where 0 < γ < 1/g, to obtain a random graph G. We let n throughout the proof. The probability that (v 1, v 2,..., v l, v 1 ) is a cycle in G is clearly p l, so the expected number of cycles X of length at most g in G is exactly E(X) = l g n(n 1)(n 2)... (n l + 1) pl 2l 0 6

8 So by Markov s Inequality, P (X > n/2) 0. Now let y = 3 p ln n and let Y denote the number of sets of a vertices of G with no edges between them (this is called an independent set or stable set of G). Then ( ) n E(Y ) = (1 p) (a 2) 0 a so Markov s Inequality shows P (Y 1) 0. Therefore there exists a specific graph G for which X n/2 and Y = 0. Now remove from G one vertex of every cycle of length at most g, to get a graph H with at least 1 2n vertices and with no stable set of size a. Now we make the following observation: if a graph on n vertices has no stable set of size a, then its chromatic number is more than n/a. This follows from the fact that vertices of the same colour form a stable set, so the number of colours must be more than n/a. Applying this observation in H, we see that the chromatic number of H is at least n 2a >. If n is large enough, this is as large as we wish. nγ 6 ln n Explicit constructions of graphs of large chromatic number and girth (the length of the shortest cycle in a graph containing a cycle) were first given by Lovász. Since then, many other constructions exist, in particular, certain Ramanujan graphs of Lubotsky, Phillips and Sarnak give, for arbitrary k, n-vertex graphs of girth g log n and chromatic number large than k. All this indicates the difficulty in dealing with colouring: even if the graph is locally a tree (the case for graphs of large girth), the chromatic number may still be large. Many other results show that the chromatic number of a graph appears to be a global propery. 2.3 The Chernoff Bound One of the most fundamental theorems in probability is the central limit theorem. It is a statement about the convergence of the distribution of sums of many independent random variables to a Gaussian or normal distribution based on mild assumptions on the moments of the random variables. There are many versions of this theorem, of which we will state only one. The theorem is as follows; here Y denotes a standard Gaussian random variable and we write X i i.i.d as shorthand for the random variables X i are independently and identically distributed. Theorem 4 If X 1, X 2,..., X n are i.i.d random variables with means µ i and bounded variances σi 2, then with S n = n X i, µ n = n µ i and σn 2 = n σ2 i, S n µ n σ n d Y See Feller s probability theory book or Kallenberg s introduction to probability for more general versions and proofs of this theorem. The standard proof of the central limit involve s an application of Lévy s Convergence Theorem. We shall use the central limit theorem (actually a minor modification of it) to prove the Erdős-Kac Theorem on prime divisors in a later subsection. Chebyshev s inequality gives a polynomial bound on the probability that a random variable is a certain number of standard deviations away from its expectation. The central limit theorem sometimes provides an exponentially small bound for these so-called tail events or large deviations. For example, when λ is fixed, we have P ( S n µ n λσ n ) e 1 2 λ2. 7

9 This is fine when λ is constant, but when λ depends on n, then the quality of convergence to normality becomes important. It is certainly possible to find random variables where the convergence in distribution is very slow. However, when the X i s are independent with Bernouilli distribution, then X = X i has a binomial distribution, and a result can be obtained for all n, which is called the Chernoff Bound (1952): Theorem 5 Let X i, i = 1, 2,..., n be independent random variables with Bernouilli distributions with means p i, i [n], and let p = n 1 p i and h 0. Then P (S n > (p + h)n) e h2 n/2a P (S n < (p h)n) e h2 n/2b, where S n = X 1 + X X n and a is the maximum of α(1 α) for p α p + h and b is the maximum of β(1 β) for p h β p. In particular, if X has binomial distribution with probability p and mean pn, then for any ε < 1, P ( X pn > εpn) < 2e ε2 pn/2. Proof Let t, γ > 0 be real numbers, let S = S n, and write P (S γn) = P (e ts e tγn ) E(e ts )e tγn M S (t)e tγn. In the second step we used Markov s inequality. Recall the moment generating function of a random variable X is M X (t) := E(e tx ). Now the moment generating function M S (t) is exactly the product of moment generating functions M Xi (t), since the X i are independent random variables. Since M Xi (t) = (1 p i + p i e t ), we have P (S γn) e tγn n (1 p i + p i e t ) e tγn [ (1 p) + pe t) n. In the last step we used the arithmetic mean - geometric mean inequality. expression over t [0, ), we obtain a minimum of ( ) γ γn ( ) 1 p (1 γ)n = e I(γ) when e t γ(1 p) = p 1 γ p(1 γ). Minimizing this Now we put γ = p + h and use some simple estimates from first year calculus to get the result. The second statement of concentration is proved similarly (or considering complementary events). To get the last statement of the theorem, take the bounds from the first part with h = εp and h = εp, and add them together. There are many other forms of the Chernoff Bound. In Assignment 2, you are asked to prove the following inequality: if X is a sum of independent random variables X i where X i 1 and E(X i ) = 0, then for 0 λ 2σ, P ( X > λσ) 2e 1 4 λ2 where σ is the standard deviation of X. In the next few subsections we give some applications of the Chernoff Bound. 8

10 2.3.1 Triangles in Random Graphs A very natural instance of the binomial distribution comes from random graphs. We consider the sample space Ω n of all graphs on n vertices, with probability measure P p (G) = p e(g) (1 p) (n 2) e(g). Since we can view Ω n as Ω n = ( n 2) {0, 1} - one probability space for each edge - the probability measure P is a product probability measure. In words, edges of a random graph ω Ω n appear independently with probability p. For simplicity, we refer to a random graph as G n,p if it is taken from the sample space Ω n with probability measure P p. So the number of edges in a random graph has a binomial distribution, and we may apply the Chernoff bound. When we get to the subsection on random graphs, we ll look in detail at their structure. For now, as an example of the Chernoff Bound, we give another structural property of random graphs: every pair of vertices has roughly the same number of common neighbours. Theorem 6 Let (u, v) be the number of triangles in G n,p containing an edge {u, v} E(G n,p ). If p 2 n/ log n, then a.a.s (u, v) p 2 n Proof It is sufficient to prove that a.a.s every pair {u, v} of vertices of G n,p has codegree d(u, v) p 2 n, where the codegree of u and v is the number of common neighbours of u and v (vertices adjacent to both u and v). Let µ = p 2 (n 2) and let X = d(u, v) for a fixed pair of vertices {u, v}. Note that µ = E(X). Let X w = 1 if uw, vw G n,p and X w = 0 otherwise. Then X w has Bernouilli distribution with probability p 2. More importantly, the random variables X w are independent. Now X = w u,v X w, so we can apply the Chernoff Bound. If ε > 0, and n is large enough, then P ( X µ > εµ) < 2e ε2 µ 2 < 1 n 3. Here we used the assumption p 2 n/ log n. It follows that the expected number of pairs u, v with X µ > εµ is less than 1 n, which proves the theorem. We wrote the statement of the theorem in fairly succinct form. Another way to write it is as follows. It should be clear why we chose the succinct form! For all real numbers δ, ε > 0, there exist positive integers M = M(ε, δ) and N = N(ε, δ) such that for every integer n > N, if p : N [0, 1] is a function satisfying satisfying p(n) 2 > M(log n)/n, then P ( u, v G n,p : d(u, v) p 2 n < δp 2 n) > 1 ε Upsets in Tournaments We consider another striking example of an application of the Chernoff Bound concerning tournaments. Let σ be a permutation of n letters {1, 2,..., n} and let T be a tournament with players {1, 2,..., n}. The permutation tells us which players should beat which i.e. it is a ranking of the players. Accordingly, the game involving players i and j is called an upset if i beats j and σ(i) < σ(j). The question is, in a given tournament, is there a ranking of the players that assures few upsets? For example, can we make sure that at most 1 3 of the games are upsets? It turns out that there are tournaments which fail in this regard, as proved by Erdős and Moon (1963). 9

11 Theorem 7 There exist a tournament T such that the difference D between the upsets and nonupsets is at most 2n 3/2 log n. Proof We may assume n 3. Let σ be a fixed ranking of {1, 2,..., n}. In the game between players i and j, we let an upset occur with probability 1 2. Let D(σ) be the difference between upsets and non-upsets; then D(σ) = X e where X e = 1 if e represents an upset and X e = 1 otherwise. If we rescale the random variables X e using Y e = 1 2 (X e + 1), then the sum of Y e s satisfies the requirements of Theorem 5, so for any constant c > 0, P (D(σ) > 2n 3/2 (log n) 1/2 ) < e 4 log n n 1 2( n 2) < n (n 1) < 1 n!. Since there were n! possible rankings of our tournament, the expected number of rankings for which D(σ) > 2n 3/2 log n is less than one. Therefore there is a tournament for which every ranking σ produces D(σ) 2n 3/2 log n. De la Vega showed, using more advanced techniques, that we can find tournaments of n players such that D(σ) n 3/2 for every ranking σ, and this is best possible apart from the implicit constant Max Cut A cut in a graph G is the set of edges of G with one end in each part of a partition (X, Y ) of the vertex set of G. It is not hard to show that in every graph G, there is a cut with at least 1 2 e(g) edges. Indeed, if e(x, Y ) is the number of edges in a cut (X, Y ) where X = Y = n, then e(x, Y ) = ( ) 2n 2 n 1 (X,Y ) e G so there exists (X, Y ) for which n 2n 1 e(x, Y ) e(g) ( 2n 2 n 1 1 2n 2( n which implies e(x, Y ) E(G) for some (X, Y ). A slightly weaker result is obtained by taking a set X of vertices with probability 1 2, independently for each vertex of G. Then the expected number of edges between X and Y = X is 1 2e(G), so there must be an X for which e(x, Y ) 1 2e(G). A natural question is whether we can find larger cuts in a graph. We give a very natural application of Chernoff s theorem to show that this is not the case. Theorem 8 For all ε > 0, there exist graphs on 2n vertices of average degree d (8 log 2)/ε 2 such that for every equipartition of the vertices into two classes X and Y, the number of edges between X and Y is at least (1 ε) dn 2 and at most (1 + ε) dn 2. Proof We take a random graph: let p = d 2n and let G be a graph in which each edge of K 2n is present with probability p. For a given equipartition (X, Y ) of the vertices of G, the expected number of edges between X and Y is d 2n (2n) = dn 2. ) )

12 Chernoff s bound provides us with the necessary concentration: the probability that the number of edges between X and Y is less than (1 ε) dn 2 and at most (1 + ε) dn 2 is at most 2e ε2 dn 4. So the expected number of equipartitions (X, Y ) with fewer than (1 ε) dn 2 (1 + ε) dn 2 edges between X and Y is at most ( ) 1 2n 2e ε2 dn 4 < e n log 4 ε 2 dn 4. 2 n edges or more than Here we used that ( 2n n ) < 2 2n. Since d (8 log 2)/ε 2, this is less than one. Therefore there exists a graph on n vertices for which all equipartitions have between (1 + ε) dn 2 and (1 ε) dn 2 edges. The max cut problem is one of the central problems in combinatorial optimization. While it is NPhard to determine the exact size of a largest cut in a graph, one may ask for efficient algorithms which give a good approximation to the maximum cut. One of the most important results in approximation algorithms, due to Goemans and Williamson, states that the maximum cut can be approximated to a factor of about eighty-eight percent. A recent result of Khot, Kindler, Mossel and O Donnel shows that this is best possible, based on a controversial conjecture called the unique games conjecture. 2.4 Classic Bases and the Borel Cantelli Lemma A set A of non-negative integers is called a basis of order k for a set S Z if every element of S may be written as a sum of k elements of A. There are some very famous open problems concerning bases: for example, the Goldbach conjecture is that the set of all primes forms a basis of order two for the even integers greater than two. Fermat s Theorem (proved by Wiles) states that {x k : x Z} does not have a basis of order two consisting of kth powers, when k > 2. On the positive side, Vinogradov showed that every odd integer is a sum of three primes. A very general result on bases uses the following definition of density: the Schnirelmann density of a set A Z is given by A [n] σ(a) = inf n 1 n A fundamental theorem in number theory states that if σ(a) > 0 and 0 A, then A is a basis of order k for Z, for some positive integer k. Schnirelmann density is discussed at length in the book of Halberstam and Roth on sequences. It is of interest in number theory to find bases which do not contain many elements. Some early examples include the set of triangular numbers, (Gauss s Theorem), the set of squares (Lagrange s Theorem) and the set of kth powers, proved by Hilbert (this is known generally as Waring s Problem). It is easy to see that a basis of order k cannot contain too few elements of [n]: at most A k integers are a sum of k elements of A. Since every integer in [n] is a sum of k elements of A, A n 1/k. Erdős and Turán conjectured (1941) that there exists no basis A of order k > 1 such that every integer may be written as a sum of k elements of A in at most a constant number of ways. This is particular to the integers; indeed, in the cyclic group Z q 2 +q+1, 11

13 when q is a prime power, one can construct perfect difference sets a set A such that every x Z q 2 +q+1 can be written in exactly one way as a sum of two elements of A. In this subsection, we are interested in finding a basis A of order two which has the property that every integer in [n] can be written in only a few ways as a sum of two elements of A (sometimes called a thin basis). We need the following lemma, known in probability theory as the First Borel-Cantelli Lemma Borel-Cantelli Lemma For events A 1, A 2,..., the event that A n occurs infinitely often, written sometimes as {A n i.o.} or lim sup A n, is defined by: {A n i.o.} := A j. i 1 j i Lemma 9 Let A 1, A 2,... be a sequence of events with i 1 P (A i) finite. Then the probability that A n occurs infinitely often is zero. Proof For any positive integer i, P (A n i.o.) P A j P (A j ). j i j i Since the sum of P (A i ) is finite, this tends to zero as i Back to bases The following technical lemma is needed: Lemma 10 Let f(x) = ( log x x )1/2 on [n] and let S n = f(x)f(y), where the sum is over all pairs (x, y) with x + y = n, where x, y [n]. Then, as n tends to infinity, S n π log n. Proof Let N = n/2 and t = x/n. Then we may write S n = 2 x<n f(n + x)f(n x) 2 log n N x<n Since t = x/n, this is a Riemann sum, which converges to an integral: This completes the proof. S n log n (1 t 2 dt = π. ) 1/2 1 (1 t 2 ) 1/2. 12

14 Theorem 11 There exists a set A Z + such that every n Z + can be represented in c n log n ways as a sum of two elements of A, where 8 c n 24. Proof Let f(x) be the function defined in the last lemma. Construct a set A Z + by taking x A with probability af(x), where a is to be determined. For all positive integers x and y such that x + y = n, let I xy be the indicator that x, y A and let A(n) be the number of ways of writing n as a sum of two elements of A. Then the preceding lemma applies to give a(n) = E(A(n)) = E(I xy ) πa 2 log n. x+y=n Now the indicator variables I xy are independent, since we restricted to pairs (x, y) with x+y = n. So by the Chernoff bound, P ( A(n) a(n) > 1 2 a(n)) < 2e 1 8 a(n) 1 n 2 when a = 16/π and n is large enough. By the First Borel-Cantelli Lemma, the probability that A(n) a(n) > 1 2 a(n) occurs infinitely often is zero. Therefore there exists a set A Z+ such that A(n) a(n) < 1 2 a(n) for all n sufficiently large. Since a(n) = πa 2 log n 16 log n, we are done. Erdős and Tetali extended this result for bases of order k: there exists a basis A of order k for the integers such that every integer n can be represented as a sum of k elements of A in Θ(log n) ways. The key difference between this and the proof above is that, unfortunately, the random variables corresponding to I xy are no longer independent. 13

Algorithms. and. We will assume henceforth that when we write an expression involving E(X), then the mean of. is defined by

Algorithms. and. We will assume henceforth that when we write an expression involving E(X), then the mean of. is defined by Course Notes Part II Probabilistic Combinatorics and Algorithms J. A. Verstraete Department of Mathematics University of California San Diego 9500 Gilman Drive La Jolla California 92037-02 jacques@ucsd.edu