CPSC 536N: Randomized Algorithms Term 2. Lecture 2

Similar documents
CSE 548: Analysis of Algorithms. Lectures 18, 19, 20 & 21 ( Randomized Algorithms & High Probability Bounds )

CPSC 536N: Randomized Algorithms Term 2. Lecture 9

Theorem 1.7 [Bayes' Law]: Assume that,,, are mutually disjoint events in the sample space s.t.. Then Pr( )

Machine Learning Theory Lecture 4

Lecture 2 Sep 5, 2017

Randomized Algorithms

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 6: Provable Approximation via Linear Programming

Lecture 5: January 30

11.1 Set Cover ILP formulation of set cover Deterministic rounding

CSE548, AMS542: Analysis of Algorithms, Fall 2016 Date: Nov 30. Final In-Class Exam. ( 7:05 PM 8:20 PM : 75 Minutes )

The first bound is the strongest, the other two bounds are often easier to state and compute. Proof: Applying Markov's inequality, for any >0 we have

Example 1. The sample space of an experiment where we flip a pair of coins is denoted by:

Notes 6 : First and second moment methods

Lecture 5: Probabilistic tools and Applications II

CS6999 Probabilistic Methods in Integer Programming Randomized Rounding Andrew D. Smith April 2003

Assignment 4: Solutions

Convex and Semidefinite Programming for Approximation

Approximation & Complexity

Lecture 20: LP Relaxation and Approximation Algorithms. 1 Introduction. 2 Vertex Cover problem. CSCI-B609: A Theorist s Toolkit, Fall 2016 Nov 8

Lecture 18: Inapproximability of MAX-3-SAT

Chernoff Bounds. Theme: try to show that it is unlikely a random variable X is far away from its expectation.

Approximation algorithm for Max Cut with unit weights

Provable Approximation via Linear Programming

Tail Inequalities. The Chernoff bound works for random variables that are a sum of indicator variables with the same distribution (Bernoulli trials).

Lecture 19: Interactive Proofs and the PCP Theorem

An Algorithmist s Toolkit September 24, Lecture 5

Errata: First Printing of Mitzenmacher/Upfal Probability and Computing

CS261: A Second Course in Algorithms Lecture #18: Five Essential Tools for the Analysis of Randomized Algorithms

CS155: Probability and Computing: Randomized Algorithms and Probabilistic Analysis

Lectures 6: Degree Distributions and Concentration Inequalities

Lecture 16: Constraint Satisfaction Problems

3 Multiple Discrete Random Variables

Lower Bounds for Testing Bipartiteness in Dense Graphs

Notes on Discrete Probability

PCP Course; Lectures 1, 2

X = X X n, + X 2

Homework 4 Solutions

Lecture 18: March 15

Discrete Mathematics

Learning symmetric non-monotone submodular functions

CS 6820 Fall 2014 Lectures, October 3-20, 2014

Lecture 5: The Principle of Deferred Decisions. Chernoff Bounds

Graph Theory. Thomas Bloom. February 6, 2015

More Approximation Algorithms

COMP2610/COMP Information Theory

What Computers Can Compute (Approximately) David P. Williamson TU Chemnitz 9 June 2011

Lecture 4: Probability and Discrete Random Variables

Approximating MAX-E3LIN is NP-Hard

Lecture 10 February 4, 2013

Near-Optimal Algorithms for Maximum Constraint Satisfaction Problems

Probability reminders

Lecture 13 (Part 2): Deviation from mean: Markov s inequality, variance and its properties, Chebyshev s inequality

Approximating Submodular Functions. Nick Harvey University of British Columbia

Lecture 5. Shearer s Lemma

Price of Stability in Survivable Network Design

POISSON PROCESSES 1. THE LAW OF SMALL NUMBERS

Handout 1: Mathematical Background

On the efficient approximability of constraint satisfaction problems

Lecture 10. Semidefinite Programs and the Max-Cut Problem Max Cut

Lecture 17 (Nov 3, 2011 ): Approximation via rounding SDP: Max-Cut

Unique Games and Small Set Expansion

Randomized Complexity Classes; RP

Some Open Problems in Approximation Algorithms

Foundations of Machine Learning and Data Science. Lecturer: Avrim Blum Lecture 9: October 7, 2015

Lecture 7: February 6

Some Open Problems in Approximation Algorithms

A An Overview of Complexity Theory for the Algorithm Designer

Lecture 6,7 (Sept 27 and 29, 2011 ): Bin Packing, MAX-SAT

Preliminaries and Complexity Theory

Lecture 13 March 7, 2017

Lecture 24: April 12

Aditya Bhaskara CS 5968/6968, Lecture 1: Introduction and Review 12 January 2016

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

Handout 1: Mathematical Background

CS 246 Review of Proof Techniques and Probability 01/14/19

The expansion of random regular graphs

Decomposing oriented graphs into transitive tournaments

1 Maintaining a Dictionary

1 Normal Distribution.

Chapter 5. Means and Variances

Approximation Algorithms and Hardness of Approximation May 14, Lecture 22

On the Optimality of Some Semidefinite Programming-Based. Approximation Algorithms under the Unique Games Conjecture. A Thesis presented

Lecture 4. P r[x > ce[x]] 1/c. = ap r[x = a] + a>ce[x] P r[x = a]

Kousha Etessami. U. of Edinburgh, UK. Kousha Etessami (U. of Edinburgh, UK) Discrete Mathematics (Chapter 7) 1 / 13

Uncertainty. Michael Peters December 27, 2013

A simple algorithmic explanation for the concentration of measure phenomenon

Approximation Resistance from Pairwise Independent Subgroups

Lecture 5 January 16, 2013

1 Review of The Learning Setting

CS 361: Probability & Statistics

IN this paper, we consider the capacity of sticky channels, a

COS 341: Discrete Mathematics

MATH MW Elementary Probability Course Notes Part I: Models and Counting

Lectures One Way Permutations, Goldreich Levin Theorem, Commitments

Inapproximability Ratios for Crossing Number

Lecture 10 + additional notes

Chapter 6 Randomization Algorithm Theory WS 2012/13 Fabian Kuhn

Lecture 6: The Pigeonhole Principle and Probability Spaces

Expectation is linear. So far we saw that E(X + Y ) = E(X) + E(Y ). Let α R. Then,

Decomposition of random graphs into complete bipartite graphs

Transcription:

CPSC 536N: Randomized Algorithms 2014-15 Term 2 Prof. Nick Harvey Lecture 2 University of British Columbia In this lecture we continue our introduction to randomized algorithms by discussing the Max Cut problem. We will achieve a non-trivial result for this problem using only some elementary tools of discrete probability. An important technical point that arises is the difference between algorithms that are good in expectation and those that are good with high probability. 1 Example: Max Cut The Max Cut problem is a foundational problem in combinatorial optimization and approximation algorithms; many important techniques have been developed during the study of this problem. It is defined as follows. Let G = (V, E) be an undirected graph. For U V, let δ(u) = { uv E : u U and v U }. The set δ(u) is called the cut determined by vertex set U. The Max Cut problem is to solve max { δ(u) : U V }. This is NP-hard (and in fact was one of the original problems shown to be NP-hard by Karp in his famous 1972 paper), so we cannot hope to solve it exactly. Instead, we will be content to find a cut that is sufficiently large. More precisely, let OPT denote the size of the maximum cut. We want an algorithm for which there exists a factor α > 0 (independent of G) such that the set U output by the algorithm is guaranteed to have δ(u) αopt. If the algorithm is randomized, we want this guarantee to hold with some probability close to 1. Here is a brief summary of what is known about this problem. Folklore: there is an algorithm with α = 1/2. In fact, there are several such algorithms. Goemans and Williamson 1995: there is an algorithm with α = 0.878... Håstad 2001: no efficient algorithm has α > 16/17, unless P = NP. Khot, Kindler, Mossel, O Donnel and Oleszkiewicz 2004-2005: no efficient algorithm has α > 0.878..., assuming the Unique Games Conjecture. Khot won the Nevanlinna Prize in 2014 partially for this result. We will give a randomized algorithm achieving α = 1/2. In fact, this algorithm appears in an old paper of Erdos [1]. The algorithm couldn t possibly be any simpler: it simply lets U be a uniformly random subset of V. One can check that this is equivalent to independently adding each vertex to U with probability 1/2. (This relates to the first question in assignment 0.) Note that the algorithm does not even look at the edges of G! Philosophically, this algorithm as trying to find hay in a haystack, where the cuts are the haystack and the near-maximum cuts are the hay. The following claim analyzes this algorithm. 1

Claim 1 Let U be the set chosen by the algorithm. Then E [ δ(u) ] OPT/2. Proof: For every edge uv E, let X uv be the indicator random variable which is 1 if uv δ(u). Then [ ] E [ δ(u) ] = E X uv Now we note that = uv E = uv E uv E E [ X uv ] Pr [ X uv = 1 ] (linearity of expectation) Pr [ X uv = 1 ] = Pr [ (u U v U) (u U v U) ] = Pr [ u U v U ] + Pr [ u U v U ] (these are disjoint events) = Pr [ u U ] Pr [ v U ] + Pr [ u U ] Pr [ v U ] (independence) = 1/2. Thus E [ δ(u) ] = E /2 OPT/2, since clearly OPT E. So we have shown that the algorithm outputs a cut whose expected size is large. We might instead prefer a different sort of guarantee: perhaps we d like to know that, with high probability, the algorithm outputs a cut that is large. Mathematically, we have shown that E [ δ(u) ] OPT/2, but perhaps we want to know that Pr [ δ(u) OPT/2 ] is large. To connect these two statements, we need to show that δ(u) is likely to be close to its expectation. 2 Concentration Inequalities One of the most important tasks in analyzing randomized algorithms is to understand what random variables arise and how well they are concentrated. A variable with good concentration is one that is close to its mean with good probability. A concentration inequality is a theorem proving that a random variable has good concentration. Such theorems are also known as tail bounds. 2.1 Markov s inequality This is the simplest concentration inequality. The downside is that it only gives very weak bounds, but the upside is that needs almost no assumptions about the random variable. It is often useful in scenarios where not much concentration is needed, or where the random variable is too complicated to be analyzed by more powerful inequalities. Theorem 2 Let Y be a real-valued random variable that assumes only nonnegative values 1. Then, for all a > 0, Pr [ Y a ] E [ Y ]. a 1 Pedantic detail: we should also assume that E [ Y ] is finite. 2

References: Mitzenmacher-Upfal Theorem 3.1, Motwani-Raghavan Theorem 3.2, Wikipedia, Grimmett-Stirzaker Lemma 7.2.7, Durrett Theorem 1.6.4. Note that if a E [ Y ] then the right-hand side of the inequality is at least 1, and so the statement is trivial. So Markov s inequality is only useful if a > E [ Y ]. Typically we use Markov s inequality to prove that Y has only a constant probability of exceeding its mean by a constant factor, e.g., Pr [ Y 2 E [ Y ] ] 1/2. Proof: Let X be the indicator random variable that is 1 if Y a. Since Y is non-negative, we have Then X Y/a. (1) Pr [ Y a ] = Pr [ X 1 ] = E [ X ] E [ Y/a ] = E [ Y ], a where the inequality comes from taking the expectation of both sides of (1). Note that Markov s inequality only bounds the right tail of Y, i.e., the probability that Y is much greater than its mean. 2.2 The Reverse Markov inequality In some scenarios, we would also like to bound the probability that Y is much smaller than its mean. Markov s inequality can be used for this purpose if we know an upper-bound on Y. The following result is an immediate corollary of Theorem 2. Corollary 3 Let Y be a random variable that is never larger than B. Then, for all a < B, References: Grimmett-Stirzaker Theorem 7.3.5. Pr [ Y a ] E [ B Y ]. B a 2.3 Application to Max Cut Let s now analyze the probability that our randomized algorithm for Max Cut gives a large cut. Let Y = δ(u) and B = E, and note that Y is never larger than B. Fix any ɛ [0, 1/2] and set a = ( 1 2 ɛ) E. By the Reverse Markov inequality, Pr [ δ(u) (1/2 ɛ) E ] E [ E δ(u) ] E (1/2 ɛ) E = E E [ δ(u) ] (1/2 + ɛ) E (linearity of expectation) = 1 1 + 2ɛ since E [ δ(u) ] = E /2 1 ɛ, where we have used Inequality 2 from the Notes on Convexity Inequalities. This shows that, with probability at least ɛ, the algorithm outputs a set U satisfying δ(u) > (1/2 ɛ) OPT. This statement is quite unsatisfying. If we want a 0.499 approximation, then we have only shown that the algorithm has probability 0.001 of succeeding. Next we will show how to increase the probability of success. 3

3 Amplification by Independent Trials In many cases where the probability of success is positive but small, we can amplify that probability by perfoming several independent trials and taking the best outcome. We already used this technique in the first lecture to improve the success probability of our equality test. For Max Cut, consider the following algorithm. First it picks several sets U 1,..., U k V, independently and uniformly at random. Let j be the index for which δ(u j ) is largest. The algorithm simply outputs the set U j. (Our initial Max Cut algorithm above did not even look at the edges of the graph, but this new algorithm must look at the edges to compute δ(u j ).) To analyze this algorithm, we wish to argue that δ(u j ) is large. Well, δ(u j ) is small only if all δ(u i ) are small, and this is rather unlikely. Pr [ δ(u i ) (1/2 ɛ) E i = 1,..., k ] = k Pr [ δ(u i ) (1/2 ɛ) E ] i=1 (1 ɛ) k (by our analysis above) e ɛk. (by independence) Here we have used the standard trick 1 + x e x, which is Inequality 1 from the Notes on Convexity Inequalities. Thus, setting k = log(1/δ)/ɛ, we obtain that Pr [ every δ(u i ) (1/2 ɛ) E ] δ. In summary, our new Max Cut algorithm picks k = log(1/δ)/ɛ sets U 1,..., U k, finds the j which maximizes δ(u j ), then outputs U j. With probability at least 1 δ we have Pr [ δ(u j ) (1/2 ɛ) E ]. In this analysis we used the following general fact, which is elementary but often very useful. Claim 4 Consider a random trial in which the probability of success is p. If we perform k mutually independent trials, then Pr [ all failures ] = (1 p) k e pk and therefore Pr [ at least one success ] = 1 (1 p) k 1 e pk. 3.1 Example: Flipping Coins Consider flipping a fair coin. If the coin turns up heads, call this a success. So the probability of success is p = 1/2. By Claim 4, Pr [ at least one heads after k trials ] = 1 1/2 k. This is a true statement, but not all that interesting: it is quite obvious that after k coin flips, the probability of seeing at least one head is very close to 1. Note that the expected number of heads is k/2. Can t we say that there are probably close to k/2 heads? This takes us beyond the subject of amplification and into a discussion of the binomial distribution. 4

4 The binomial distribution Again, let us flip a fair coin k times and let X be the number of heads seen. Then X has the binomial distribution. So ( ) k Pr [ exactly i heads ] = 2 k. i How can we show that X is probably close to its expected value (which is k/2). Well, the probability of X being small is: Pr [ at most i heads ] = ( ) k 2 k. (2) j Here the meaning of small depends on the choice of i. For what values of i is this sum small? Unfortunately the sum is a bit too complicated to get a feel for its magnitude. Can we simplify the expression so it s easier to see what s going on? As far as I know, this sum does not have a simple closed-form expression. Instead, can we usefully approximate that sum? It would be great to have a provable upper bound for the sum that is simple, useful and asymptotically nearly tight. The Chernoff bound, which we discuss in Section 5, is a very general and powerful tool for analyzing tails of probability distributions, and it gives a fantastic bound on (2). But first, to motivate the bound that we will obtain, let us describe some interesting behavior of binomial distributions. 0 j i 4.1 Different behavior in different regimes In most introductory courses on probability theory, one often encounters statements that describe the limiting behavior of the binomial distribution. For example: Fact 5 Let X be a binomially distributed random variable with parameters n and p. (That is, X gives the number of successes in n independent trials where each trial succeeds with probability p.) As n gets large while p remains fixed, the distribution of X is well approximated by the normal distribution with mean np and variance np(1 p). References: Durrett Theorem 3.4.1 and Example 3.4.3. The following plot shows the binomial distribution for n = 20 and p = 1/2 in red and the corresponding normal approximation in blue. 5

However the binomial distribution has qualitatively different behavior when p is very small. Fact 6 Let X be a binomially distributed random variable with parameters n and p. As n gets large while np remains fixed, the distribution of X is well approximated by the Poisson distribution with parameter λ = np. References: Mitzenmacher-Upfal Theorem 5.5, Durrett Theorem 3.6.1. The following plot shows the binomial distribution for n = 50 and p = 4/n in red and the corresponding Poisson approximation in blue. Note that the red plot is quite asymmetric, so we would not expect a normal distribution (which is symmetric) to approximate it well. Ideally would like an upper bound on (2) which works well for all ranges of parameters, and captures the phenomena described in Fact 5 and Fact 6. Remarkably, the Chernoff bound is able to capture both of these phenomena. 6

5 The Chernoff Bound The Chernoff bound is used to bound the tails of the distribution for a sum of independent random variables, under a few mild assumptions. Since binomial random variables are sums of independent Bernoulli random variables, it can be used to bound (2). Not only is the Chernoff bound itself very useful, but its proof techniques are very general and can be applied in a wide variety of settings. The Chernoff bound is by far the most useful tool in randomized algorithms. Numerous papers on randomized algorithms have only three ingredients: Chernoff bounds, union bounds and cleverness. Of course, there is an art in being clever and finding the right way to assemble these ingredients! 5.1 Formal Statement Let X 1,..., X n be independent random variables such that X i always lies in the interval [0, 1]. Define X = n i=1 X i and p i = E [ X i ]. Let µ min i E [ X i ] µ max. Theorem 7 For all δ > 0, Pr [ X (1 + δ)µ max ] Pr [ X (1 δ)µ min ] (a) (c) ( ) e δ µmax (1+δ) 1+δ ( e δ (1 δ) 1 δ ) µmin (b) e δ2 µ max/3 (if δ 1) e (1+δ) ln(1+δ)µmax/4 (if δ 1) e δµmax/3 (if δ 1) (d) e δ2 µ min /2. Inequalities (c) and (d) are only valid for δ < 1, but Pr [ X (1 δ)µ min ] = 0 if δ > 1. Remarks. All of these bounds are useful in different scenarios. In inequality (b), the cases δ 1 and δ > 1 are different due to the different phenomena illustrated in Facts 5 and 6. Historical remark. Chernoff actually only considered the case in which the X i random variables are identically distributed. The more general form stated here is due to Hoeffding. References [1] P. Erdős. Gráfok páros körüljárású részgráfjairól (On bipartite subgraphs of graphs, in Hungarian). Mat. Lapok, 18:283 288, 1967. 1 7