Tail and Concentration Inequalities

Similar documents
Lecture 4: Sampling, Tail Inequalities

March 1, Florida State University. Concentration Inequalities: Martingale. Approach and Entropy Method. Lizhe Sun and Boning Yang.

CSE 312 Final Review: Section AA

Probability inequalities 11

11.1 Set Cover ILP formulation of set cover Deterministic rounding

Hoeffding, Chernoff, Bennet, and Bernstein Bounds

Lecture 2 One too many inequalities

Uses of Asymptotic Distributions: In order to get distribution theory, we need to norm the random variable; we usually look at n 1=2 ( X n ).

Lecture 4: Inequalities and Asymptotic Estimates

Northwestern University Department of Electrical Engineering and Computer Science

Lecture 4. P r[x > ce[x]] 1/c. = ap r[x = a] + a>ce[x] P r[x = a]

Proving the central limit theorem

STAT 200C: High-dimensional Statistics

COMPSCI 240: Reasoning Under Uncertainty

Formulas for probability theory and linear models SF2941

Joint Probability Distributions and Random Samples (Devore Chapter Five)

3 Multiple Discrete Random Variables

Lecture 1 Measure concentration

Expectation. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Solutions to Problem Set 4

EE376A: Homeworks #4 Solutions Due on Thursday, February 22, 2018 Please submit on Gradescope. Start every question on a new page.

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Chernoff Bounds. Theme: try to show that it is unlikely a random variable X is far away from its expectation.

Topic 3: The Expectation of a Random Variable

Tail Inequalities. The Chernoff bound works for random variables that are a sum of indicator variables with the same distribution (Bernoulli trials).

Randomized Algorithms Week 2: Tail Inequalities

Lecture 4: Probability and Discrete Random Variables

Machine Learning Brett Bernstein. Recitation 1: Gradients and Directional Derivatives

Learning Theory. Sridhar Mahadevan. University of Massachusetts. p. 1/38

Lectures 6: Degree Distributions and Concentration Inequalities

Mathematical Statistics 1 Math A 6330

Lecture 11 : Asymptotic Sample Complexity

Expectation. DS GA 1002 Probability and Statistics for Data Science. Carlos Fernandez-Granda

Homework 4 Solutions

Machine Learning Theory Lecture 4

DO NOT OPEN THIS QUESTION BOOKLET UNTIL YOU ARE TOLD TO DO SO

n! (k 1)!(n k)! = F (X) U(0, 1). (x, y) = n(n 1) ( F (y) F (x) ) n 2

Lecture 2: Repetition of probability theory and statistics

Moments. Raw moment: February 25, 2014 Normalized / Standardized moment:

Kousha Etessami. U. of Edinburgh, UK. Kousha Etessami (U. of Edinburgh, UK) Discrete Mathematics (Chapter 7) 1 / 13

1 Solution to Problem 2.1

STAT 200C: High-dimensional Statistics

Tail Inequalities Randomized Algorithms. Sariel Har-Peled. December 20, 2002

Discrete Distributions

CS145: Probability & Computing

Lecture 1: Introduction to Sublinear Algorithms

Convergence in Distribution

Math Bootcamp 2012 Miscellaneous

EE514A Information Theory I Fall 2013

CSE548, AMS542: Analysis of Algorithms, Spring 2014 Date: May 12. Final In-Class Exam. ( 2:35 PM 3:50 PM : 75 Minutes )

Continuous Random Variables

Exercises with solutions (Set D)

Introduction to Probability and Statistics (Continued)

Lecture 01 August 31, 2017

F X (x) = P [X x] = x f X (t)dt. 42 Lebesgue-a.e, to be exact 43 More specifically, if g = f Lebesgue-a.e., then g is also a pdf for X.

Random Variables. Cumulative Distribution Function (CDF) Amappingthattransformstheeventstotherealline.

CSE 525 Randomized Algorithms & Probabilistic Analysis Spring Lecture 3: April 9

Decision making and problem solving Lecture 1. Review of basic probability Monte Carlo simulation

Lecture 4: Phase Transition in Random Graphs

Random Variables. Random variables. A numerically valued map X of an outcome ω from a sample space Ω to the real line R

SUFFICIENT STATISTICS

Chp 4. Expectation and Variance

Chapter Generating Functions

3. Review of Probability and Statistics

Lecture 2: Randomized Algorithms

4.9 Anti-derivatives. Definition. An anti-derivative of a function f is a function F such that F (x) = f (x) for all x.

Lecture 1 September 3, 2013

Probability reminders

(x 3)(x + 5) = (x 3)(x 1) = x + 5. sin 2 x e ax bx 1 = 1 2. lim

Bell-shaped curves, variance

This exam is closed book and closed notes. (You will have access to a copy of the Table of Common Distributions given in the back of the text.

Continuous Expectation and Variance, the Law of Large Numbers, and the Central Limit Theorem Spring 2014

Probability Models. 4. What is the definition of the expectation of a discrete random variable?

. Find E(V ) and var(v ).

Distributions of Functions of Random Variables. 5.1 Functions of One Random Variable

HW Solution 12 Due: Dec 2, 9:19 AM

ECE 302 Division 1 MWF 10:30-11:20 (Prof. Pollak) Final Exam Solutions, 5/3/2004. Please read the instructions carefully before proceeding.

Tail bound inequalities and empirical likelihood for the mean

1 Review of The Learning Setting

Lecture 18: March 15

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science Transmission of Information Spring 2006

CSCI-6971 Lecture Notes: Monte Carlo integration

Expectation, Variance and Standard Deviation for Continuous Random Variables Class 6, Jeremy Orloff and Jonathan Bloom

Sequences & Functions

Selected Exercises on Expectations and Some Probability Inequalities

P (x). all other X j =x j. If X is a continuous random vector (see p.172), then the marginal distributions of X i are: f(x)dx 1 dx n

Stochastic Processes for Physicists

Midterm Examination. STA 215: Statistical Inference. Due Wednesday, 2006 Mar 8, 1:15 pm

Lecture 2. Frequency problems

n(1 p i ) n 1 p i = 1 3 i=1 E(X i p = p i )P(p = p i ) = 1 3 p i = n 3 (p 1 + p 2 + p 3 ). p i i=1 P(X i = 1 p = p i )P(p = p i ) = p1+p2+p3

CSE 312: Foundations of Computing II Quiz Section #10: Review Questions for Final Exam (solutions)

Chapter 8: Taylor s theorem and L Hospital s rule

Chapter 6. Convergence. Probability Theory. Four different convergence concepts. Four different convergence concepts. Convergence in probability

Lecture 4: September Reminder: convergence of sequences

14 Lecture 14 Local Extrema of Function

Introduction to Statistical Learning Theory

Math 407: Probability Theory 5/10/ Final exam (11am - 1pm)

Exponential Tail Bounds

Basic Probability Reference Sheet

Quick Tour of Basic Probability Theory and Linear Algebra

Transcription:

CSE 694: Probabilistic Analysis and Randomized Algorithms Lecturer: Hung Q. Ngo SUNY at Buffalo, Spring 2011 Last update: February 19, 2011 Tail and Concentration Ineualities From here on, we use 1 A to denote the indicator variable for event A, i.e. 1 A = 1 if A holds and 1 A = 0 otherwise. Our presentation follows closely the first chapter of [1]. 1 Markov Ineuality Theorem 1.1. If X is a r.v. taking only non-negative values, µ = E[X], then a > 0 Prob[X a] µ a. 1) Proof. From the simple fact that a1 {X a} X, taking expectation on both sides we get ae [ 1 {X a} ] µ, which implies 1). Problem 1. Use Markov ineuality to prove the following. Let c 1 be an arbitrary constant. If n people have a total of d dollars, then there are at least 1 1/c)n of them each of whom has less than cd/n dollars. You can easily prove the above statement from first principle. However, please set up a probability space, a random variable, and use Markov ineuality to prove it. It is instructive!) 2 Chebyshev Ineuality Theorem 2.1 Two-sided Chebyshev s Ineuality). If X is a r.v. with mean µ and variance σ 2, then a > 0, Prob [ X µ a ] σ2 a 2 Proof. Let Y = X µ) 2, then E[Y ] = σ 2 and Y is a non-negative r.v.. From Markov ineuality 1) we have Prob [ X µ a ] = Prob [ Y a 2] σ2 a 2. The one-sided versions of Chebyshev ineuality are sometimes called Cantelli ineuality. Theorem 2.2 One-sided Chebyshev s Ineuality). Let X be a r.v. with E[X] = µ and Var [X] = σ 2, then for all a > 0, Prob[X µ + a] Prob[X µ a] σ 2 σ 2 + a 2 2) σ 2 σ 2 + a 2. 3) 1

Proof. Let Y = X µ, then E[Y ] = 0 and Var [Y ] = Var [X] = σ 2. Why?) Thus, for any t such that t + a > 0 we have Prob[Y a] = Prob[Y + t a + t] [ ] Y + t = Prob a + t 1 [ Y ) + t 2 Prob 1] a + t [ Y ) ] + t 2 E a + t = σ2 + t 2 a + t) 2 The second ineuality follows from Markov ineuality. The above analysis holds for any t such that t + a > 0. We pick t to minimize the right hand side, which is t = σ 2 /a > 0. That proves 2). Problem 2. Prove 3). 3 Bernstein, Chernoff, Hoeffding 3.1 The basic bound using Bernstein s trick Let us consider the simplest case, and then relax assumptions one by one. For i [n], let X i be i.i.d. random variables which are all Bernoulli with parameter p. Let X = n i=1 X i. Then, E[X] = np. We will prove that, as n gets large X is far from E[X] with exponentially low probability. Let m be such that np < m < n, we want to bound Prob[X m]. For notational convenience, let = 1 p. Bernstein taught us the following trick. For any t > 0 the following holds. Prob[X m] = Prob [tx tm] = Prob [ e tx e tm] E [ e tx] e tm = E [ n ] i=1 etx i e tm n i=1 E [ e tx i] = e tm because the X i are independent) n i=1 = pet + ) e tm = pet + ) n e tm. The ineuality on the third line follows from Markov ineuality 1). Naturally, we set t to minimize the right hand side, which is m t 0 = ln n m)p > 0. 2

Plugging t 0 in, we obtain the following after simple algebraic manipulations: pn ) ) m n n m Prob[X m]. 4) m n m This is still uite a mess. But there s a way to make it easier to remember. The relative entropy or Kullberg- Leibler distance) between two Bernoulli distributions with parameters p and p is defined to be REp p ) := p ln p 1 p + 1 p) ln p 1 p. There are several different interpretations of the relative entropy function. You can find them from the Wikipedia entry on relative entropy. It can be shown that REp p ) 0 for all p, p 0, 1). Anyhow, we can rewrite 4) simply as Prob[X m] e n REm/n p). 5) Next, suppose the X i are still Bernoulli variables but with different parameters p i. Let i = 1 p i, p = i p i)/n and = 1 p. Note that E[X] = np as before. A similar analysis leads to n i=1 Prob[X m] p ie t + i ) e tm pet + ) n e tm. The second ineuality is due to the geometric-arithmetic means ineuality, which states that, for any nonnegative real numbers a 1,, a n we have ) a1 + + a n n a 1 a n. n Thus, 5) holds when the X i are Bernoulli and they don t have to be identically distributed. Finally, consider a fairly general case when the X i do not even have to be discrete variables. Suppose the X i are independent random variables where E[X i ] = p i and X i [0, 1] for all i. Again, let p = i p i/n and = 1 p. Bernstein s trick leads us to n i=1 E [ e tx i] Prob[X m] e tm. The problem is, we no longer can compute E [ e tx i] because we don t know the Xi s distributions. Hoeffding taught us another trick. For t > 0, the function fx) = e tx is convex. Hence, the curve of fx) inside [0, 1] is below the linear segment connecting the points 0, f0)) and 1, f1)). The segment s euation is Hence, y = f1) f0))x + f0) = e t 1)x + 1 = e t x + 1 x). E [ e tx i ] E [ e t X i + 1 X i ) ] = p i e t + i. We thus obtain 4) as before. Overall, we just proved the following theorem. Theorem 3.1 Bernstein-Chernoff-Hoeffding). Let X i [0, 1] be independent random variables where E[X i ] = p i, i [n]. Let X = n i=1 X i, p = n i=1 p i/n and = 1 p. Then, for any m such that np < m < n we have Prob[X m] e nrem/n p). 6) Problem 3. Let X i [0, 1] be independent random variables where E[X i ] = p i, i [n]. Let X = n i=1 X i, p = n i=1 p i/n and = 1 p. Prove that, for any m such that 0 < m < np we have Prob[X m] e nrem/n p). 7) 3

3.2 Instantiations There are a variety of different bounds we can get out of 6) and 7). Theorem 3.2 Hoeffding Bounds). Let X i [0, 1] be independent random variables where E[X i ] = p i, i [n]. Let X = n i=1 X i. Then, for any t > 0 we have Prob[X E[X] + t] e 2t2 /n. 8) and Prob[X E[X] t] e 2t2 /n. 9) Proof. We prove 8), leaving 9) as an exercise. Let p = n i=1 p i/n and = 1 p. WLOG, we assume 0 < p < 1. Define m = p + x)n, where 0 < x < = 1 p, so that np < m < n. Also, define Routine manipulations give m ) fx) = RE n p = RE p + x p) = p + x) ln p + x p f x) = ln p + x p f x) = ln x 1 p + x) x) By Taylor s expansion, for any x [0, 1] there is some ξ [0, x] such that fx) = f0) + xf 0) + 1 2 x2 f ξ) = 1 2 x2 1 p + ξ) ξ) 2x2. + x) ln x. 10) The last ineuality follows from the fact that p + ξ) ξ) p + )/2) 2 = 1/4. Finally, set x = t/n. Then, m = np + t = E[X] + t. From 6) we get Prob[X E[X] + t] e nfx) e 2x2n = e 2t2 /n. Problem 4. Prove 9). Theorem 3.3 Chernoff Bounds). Let X i [0, 1] be independent random variables where E[X i ] = p i, i [n]. Let X = n i=1 X i. Then, i) For any 0 < δ 1, Prob[X 1 + δ)e[x]] e E[X]δ2 /3. 11) ii) For any 0 < δ < 1, iii) If t > 2eE[X], then Prob[X 1 δ)e[x]] e E[X]δ2 /2. 12) Prob[X t] 2 t. 13) 4

Proof. To bound the upper tail, we apply 6) with m = p+δp)n. Without loss of generality, we can assume m < n, or euivalently δ < /p. In particular, we will analyze the function gx) = REp + xp p) = 1 + x)p ln1 + x) + px) ln px, for 0 < x min{/p, 1}. First, observe that ln px = ln 1 + px ) px px px. Hence, px) ln px px, from which we can infer that gx) 1 + x)p ln1 + x) px = p [1 + x) ln1 + x) x]. Now, define Then, hx) = 1 + x) ln1 + x) x x 2 /3. h x) = ln1 + x) 2x/3 h x) = 1 1 + x 2/3. Thus, 1/2 is a local extremum of h x). Note that h 0) = 0, h 1/2) 0.07 > 0, and h 1) 0.026 > 0. Hence, h x) 0 for all x 0, 1]. The function hx) is thus non-decreasing. Hence, hx) h0) = 0 for all x [0, 1]. Conseuently, for all x [0, 1]. Thus, from 6) we have gx) p [1 + x) ln1 + x) x] px 2 /3 Prob[X 1 + δ)e[x]] = Prob[X 1 + δ)pn] e n gδ) e δ2 E[X]/3. Problem 5. Prove 12). Problem 6. Let X i [0, 1] be independent random variables where E[X i ] = p i, i [n]. Let X = n i=1 X i, and µ = E[X]. Prove the following i) For any δ, t > 0 we have e et 1 Prob[X 1 + δ)e[x]] e t1+δ) Hint: repeat the basic structure of the proof using Bernstein s trick. Then, because 1 + x e x we can apply 1 + p i e t + 1 p i e p ie t p i.) ii) Show that, for any δ > 0 we have e δ Prob[X 1 + δ)µ] 1 + δ) 1+δ ) µ ) µ 5

iii) Prove that, for any t > 2eE[X], Prob[X t] 2 t. Problem 7. Let X i [a i, b i ] be independent random variables where a i, b i are real numbers. Let X = n i=1 X i. Repeat the basic proof structure to show a slightly more general Hoeffding bounds: Problem 8. Prove that, for any 0 α n, 2t 2 ) Prob[X E[X] t] exp n i=1 a i b i ) 2 2t 2 ) Prob[X E[X] t] exp n i=1 a i b i ) 2 0kαn ) n 2 Hα)n, k where Hα) = α log 2 α 1 α) log 2 1 α) is the binary entropy function. References [1] D. P. DUBHASHI AND A. PANCONESI, Concentration of measure for the analysis of randomized algorithms, Cambridge University Press, Cambridge, 2009. 6