Chapter 5. Inequalities. 5.1 The Markov and Chebyshev inequalities

Similar documents
Learning Theory: Lecture Notes

7.1 Convergence of sequences of random variables

Convergence of random variables. (telegram style notes) P.J.C. Spreij

7.1 Convergence of sequences of random variables

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Distribution of Random Samples & Limit theorems

EE 4TM4: Digital Communications II Probability Theory

STAT Homework 1 - Solutions

Lecture 2: Concentration Bounds

This section is optional.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

An Introduction to Randomized Algorithms

Math 525: Lecture 5. January 18, 2018

n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n,

Lecture 7: Properties of Random Samples

Probability for mathematicians INDEPENDENCE TAU

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

STAT Homework 2 - Solutions

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

The standard deviation of the mean

1 = δ2 (0, ), Y Y n nδ. , T n = Y Y n n. ( U n,k + X ) ( f U n,k + Y ) n 2n f U n,k + θ Y ) 2 E X1 2 X1

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

CS166 Handout 02 Spring 2018 April 3, 2018 Mathematical Terms and Identities

Randomized Algorithms I, Spring 2018, Department of Computer Science, University of Helsinki Homework 1: Solutions (Discussed January 25, 2018)

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

4. Partial Sums and the Central Limit Theorem

Lecture 01: the Central Limit Theorem. 1 Central Limit Theorem for i.i.d. random variables

0, otherwise. EX = E(X 1 + X n ) = EX j = np and. Var(X j ) = np(1 p). Var(X) = Var(X X n ) =

Glivenko-Cantelli Classes

1 Convergence in Probability and the Weak Law of Large Numbers

Lecture 19: Convergence

Problems from 9th edition of Probability and Statistical Inference by Hogg, Tanis and Zimmerman:

Infinite Sequences and Series

Lecture 3 : Random variables and their distributions

Sequences and Series of Functions

Notes 19 : Martingale CLT

Lecture 8: Convergence of transformations and law of large numbers

Probability 2 - Notes 10. Lemma. If X is a random variable and g(x) 0 for all x in the support of f X, then P(g(X) 1) E[g(X)].

ECE 330:541, Stochastic Signals and Systems Lecture Notes on Limit Theorems from Probability Fall 2002

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

CS 330 Discussion - Probability

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

2.2. Central limit theorem.

Lecture 12: November 13, 2018

Problem Set 2 Solutions

ST5215: Advanced Statistical Theory

Rademacher Complexity

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 15

Math 216A Notes, Week 5

Lecture 2: April 3, 2013

Basics of Probability Theory (for Theory of Computation courses)

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Notes 5 : More on the a.s. convergence of sums

Discrete Mathematics for CS Spring 2005 Clancy/Wagner Notes 21. Some Important Distributions

Approximations and more PMFs and PDFs

Chapter 6 Infinite Series

LECTURE 8: ASYMPTOTICS I

Application to Random Graphs

On Random Line Segments in the Unit Square

Advanced Stochastic Processes.

IIT JAM Mathematical Statistics (MS) 2006 SECTION A

Parameter, Statistic and Random Samples

4.1 Sigma Notation and Riemann Sums

Probability and Random Processes

Topic 8: Expected Values

Lecture 3 The Lebesgue Integral

Parameter, Statistic and Random Samples

Estimation for Complete Data

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Lecture 12: September 27

Econ 325: Introduction to Empirical Economics

f X (12) = Pr(X = 12) = Pr({(6, 6)}) = 1/36

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

6.3 Testing Series With Positive Terms

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Self-normalized deviation inequalities with application to t-statistic

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 6 9/24/2008 DISCRETE RANDOM VARIABLES AND THEIR EXPECTATIONS

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

Bertrand s Postulate

Bertrand s Postulate. Theorem (Bertrand s Postulate): For every positive integer n, there is a prime p satisfying n < p 2n.

CS161 Handout 05 Summer 2013 July 10, 2013 Mathematical Terms and Identities

Binomial Distribution

Mathematics 170B Selected HW Solutions.

Lecture 3: August 31

BHW #13 1/ Cooper. ENGR 323 Probabilistic Analysis Beautiful Homework # 13

Agnostic Learning and Concentration Inequalities

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

AMS570 Lecture Notes #2

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

s = and t = with C ij = A i B j F. (i) Note that cs = M and so ca i µ(a i ) I E (cs) = = c a i µ(a i ) = ci E (s). (ii) Note that s + t = M and so

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2.

Massachusetts Institute of Technology

We are mainly going to be concerned with power series in x, such as. (x)} converges - that is, lims N n

Linear Regression Demystified

Limit Theorems. Convergence in Probability. Let X be the number of heads observed in n tosses. Then, E[X] = np and Var[X] = np(1-p).

Discrete Mathematics and Probability Theory Spring 2012 Alistair Sinclair Note 15

Math 299 Supplement: Real Analysis Nov 2013

Transcription:

Chapter 5 Iequalities 5.1 The Markov ad Chebyshev iequalities As you have probably see o today s frot page: every perso i the upper teth percetile ears at least 1 times more tha the average salary. I other words, if you pick up a radom perso, the probability that his salary is more tha 1 times the average salary is more tha 0.1. Put ito a formal settig, deote by the set of people edowed with a uiform probability ad let X(!) be the perso s salary, the P(X > 1E[X]) 0.1. The followig theorem will show that this is ot possible. Theorem 5.1 (Markov iequality) Let X be a radom variable assumig oegative values, i.e., X(!) 0 for every!. The for every a 0, P(X a) E[X] a. Commet: Adrey Adreyevich Markov (1856 19) was a Russia mathematicia otably kow for his work o Stochastic processes.

10 Chapter 5 Commet: Note first that this is a vacuous statemet for a < E[X]. For a > E[X] this iequality limits the probability that X assumes values larger tha its expected value. This is the first time i this course that we derive a iequality. Iequalities, i geeral, are a importat tool i aalysis, where estimates (rather tha exact idetities) are ofte eeded. The stregth of such a iequality is that it holds for every radom variable. As a result, this estimate may be far from beig tight. Proof : Let A be the evet that X is at least a, that is A = {! X(!) a}. Defie a ew radom variable Y = ai A, where I A is the idicator of A. That is, a X(!) a Y(!) = 0 otherwise. Clearly, X Y, so by the mootoicity of expectatio E[X] E[Y] = a E[I A ] = ap(a) = ap(x a), where we used the fact that I A is a Beroulli variable, hece, its expectatio equals the probability that it is equal to 1. A corollary of Markov s iequality is aother iequality due to Chebyshev: Theorem 5. (Chebyshev s iequality) Let X be a radom variable with expected value µ ad variace. The, for every a > 0 P(X µ a) a. Commet: Pafuty Lvovich Chebyshev (181 1894) was a Russia mathematicia kow for may cotributios; he was Markov s advisor. Commet: If we write a = k, this theorem states that the probability that a radom variable assumes a value whose absolute distace from its expected value is more tha k times its stadard deviatio is at most 1k, P(X µ k ) 1 k.

Iequalities 11 Proof : Defie Y = X µ. Y is a o-egative radom variable, satisfyig, by defiitio, E[Y] =. Applyig Markov s iequality for Y, P(X µ a) = P(Y a ) E[Y] a = a. Example: O average, Joh Doe driks 5 liters of wie every week. What ca be said about the probability that he driks more tha 50 liters of wie o a give week? Here we apply the Markov iequality. If X is the amout of wie Joh Doe driks o a give week, the P(X > 50) E[X] 50 = 1. Note that this may be a very loose estimate; it may well be that Joe Doe driks exactly 5 liters of wie every week, i which case this probability is zero. Example: Let X B (, 1), The, E[X] = ad Var[X] = 4. If you toss a fair coi 10 6 times, the distributio of the umber of Heads is X B (10 6, 1); the probability of gettig betwee 495,000 ad 505,000 Heads ca be bouded by P(X E[X] < 5000) = 1 P(X E[X] 5000) 1 106 4 5000 = 0.99 Sice a biomial variable B (, p) ca be represeted as a sum of Beroulli variables, the followig geeralizatio is called for. Let X 1, X,...,X by idepedet ad idetically distributed radom variables, with E[X 1 ] = µ ad Var[X 1 ] =. Defie X = k=1 X k.

1 Chapter 5 The, by the liearity of the expectatio ad the idepedece of the radom variables, E[X] = µ ad Var[X] =. Chebyshev s iequality yields P(X µ a) a. This iequality starts gettig useful whe a >. For further referece, we ote that i the particular case where µ = 0, we get a fortiori that for every a > 0, P(X a) a. 5. Hoe dig s iequality We have just see that if we cosider a sum of idepedet radom variables, we ca use Chebyshev s iequality to boud the probability that this sum deviates by much from its expectatio. We ow ask: is this result tight? It turs out there are much better bouds that we ca get i this settig. Theorem 5.3 (Hoeffdig s iequality) Let X 1, X,...,X be idepedet radom variables (ot ecessarily idetically-distributed) satisfyig E[X i ] = 0 ad X i 1 for all 1 i (that is, for every i ad every!, X i (!) 1). The, for every a > 0 P X i a exp a. Recall the defiitio of the momet geeratig fuctio of X M X (t) = E[e tx ]. The proof will use momet geeratig fuctios ad Markov s iequality. We eed the followig lemma. Lemma 5.1 Let X be a radom variable satisfyig E[X] = 0 ad X i 1. The for all t R M X (t) exp t.

Iequalities 13 Proof : Fix t R, ad cosider the fuctio f (x) = e tx. This fuctio is covex (the secod derivative of f is everywhere positive), meaig that ay lie segmet betwee two poits o its graph lies above the graph; i.e., for every a, b R ad 0 1, f ( a + (1 )b) f (a) + (1 ) f (b). I particular, takig a = 1 ad b = 1 we get that for every t ad every 0 1, e t( (1 )) e t + (1 )e t. Settig x = 1, i.e., = (1 + x) ad 1 = (1 x), we get that for ay 1 x 1, e tx 1 + x et + 1 x e t = et + e t Sice X is always betwee 1 ad 1, it follows that e tx et + e t + X et e t, + x et e t. which is a iequality betwee radom variables. By mootoicity of expectatio, usig the liearity of the expectatio ad the fact that E[X] = 0, It remais to show that for all t R M X (t) = E[e tx ] et + e t. e t + e t exp t. This is doe by compariso of the Taylor series aroud 0, e t + e t = 1 t k + ( t) k k=0 k! = `=0 t ` (`)! `=0 t ` ``! = (t )` `=0 `! = exp t, where we used the fact that (`)! ``!. Proof :(of Hoe dig s iequality) Let X = X i. First ote that M X (t) = E[exp(t X i )] = E[ e tx i ] = E[e tx i ] = M Xi (t)

14 Chapter 5 where we used Propositio 4.3 ad the idepedece of the radom variables. Usig the above lemma we get that M X (t) exp t. For t > 0 the evet X a is the same as the evet e tx e ta. Usig Markov s iequality we get that for ay t > 0 P(X a) = P(e tx e ta ) E[etX ] e ta = M X(t) e ta exp t ta. The right-had side depeds o the parameter t, whereas the left-had side does ot; to get a tighter boud, we may choose t such to miimize the right-had side. The miimum is achieved whe t = a, hece P(X a) exp (a) (a)a = exp a. Commet: Wassily Hoe dig (1914 1991) was a Fiish statisticia. The iequality amed after him was proved i 1963. Commet: We have stated Hoe dig s iequality uder the assumptios that the radom variables are bouded by 1 i absolute value ad the expectatio is 0. I geeral, if we have bouded radom variables we ca apply a a e trasformatio to get a radom variables which satisfy the coditio of the theorem. For example, if E[X i ] = µ ad X i b, the Y i = X i µ b + µ satisfies, E[Y i ] = 0 ad Y i X i + µ b + µ 1. Example: Goig back to the last example from the previous sectio, if X i are idepedet Beroulli 1 radom variables, the X = X i B, 1. We defie

Iequalities 15 ew radom variables Y i = X i 1 so that Y i 1 ad E[Y i ] = 0. Furthermore, set Y = Y i, so that Y = X. The evet X + 5 equals the evet Y 10. Hoe dig s iequality gives us the boud P Y 10 exp 100 = e 50. A similar boud holds for X 5. Put together, P X 5 e 50. So, if you tossed a fair coi 10 6 times, the probability that you get betwee 495, 000 ad 505, 000 heads is at least 1 e 50 0.9999999999999999999996145... Quite a improvemet over our previous boud of 99 percet! 5.3 Jese s iequality Recall that a fuctio f R R is called covex if its graph betwee every two poits lies uder the secat, i.e., if for every x, y R ad 0 < < 1, f ( x + (1 )y) f (x) + (1 ) f (y). Propositio 5.1 (Jese s iequality) If g is a covex real-valued fuctio ad X is a real-valued radom variable, the provided that the expectatios exist. g(e[x]) E[g(X)], Proof : Let s start with a proof uder the additioal assumptio that g is twice di eretiable. Sice g is covex, it follows that g (x) 0 for all x.

16 Chapter 5 Taylor expadig g about µ = E[X], g(x) = g(µ) + g (µ)(x µ) + 1 g ( )(x µ) for some. The last term is o-egative, so we get that for all x g(x) g(µ) + g (µ)(x µ), ad as a iequality betwee radom variables: By the mootoicity of expectatio, g(x) g(µ) + g (µ)(x µ). E[g(X)] E[g(µ) + g (µ)(x µ)] = g(µ) + g (µ)(e[x] µ) = g(e[x]) which is precisely what we eed to show. What about the more geeral case? Ay covex fuctio is cotiuous, ad has oe-sided derivatives with For every m [g (µ), g +(µ)] g g(x) g(y) g(x) g(y) (x) = lim lim = g y x x y y x x y +(x). g(x) g(µ) + m(x µ), so the same proof holds with m replacig g (µ). Commet: Jese s iequality is valid also for covex fuctios of several variables. Example: Sice exp is a covex fuctio, exp(t E[X]) E[e tx ] = M X (t), or, E[X] 1 t log M X(t), for all t > 0.

Iequalities 17 Example: Cosider a discrete radom variable X assumig the positive values x 1,...,x with equal probability 1. Jese s iequality for g(x) = log(x) gives, log 1 x i 1 Reversig sigs, ad expoetiatig, we get 1 log(x i ) = log x i 1 x i, x 1 i. which is the classical arithmetic mea-geometric mea iequality. I fact, this iequality ca be geeralized for arbitrary distributios, p X (x i ) = p i, yieldig p i x i x p i i. 5.4 Kolmogorov s iequality (Not taught i 016-7) The Kolmogorov iequality may first seem to be of similar flavor as Chebyshev s iequality, but it is cosiderably stroger. I have decided to iclude it here because its proof ivolves some iterestig subtleties. First, a lemma: Lemma 5. If X, Y, Z are radom variables such that Y is idepedet of X ad Z, the E[XYZ] = E[XZ] E[Y]. Proof : The fact that Y is idepedet of both X, Z implies that (i the case of discrete variables), p X,Y,Z (x, y, z) = p X,Z (x, z)p Y (y).

18 Chapter 5 Now, for every z, E[XYZ = z] = xy p X,YZ (x, yz) x,y = xy p X,Y,Z(x, y, z) x,y p Z (z) = xy p X,Z(x, z)p Y (y) x,y p Z (z) = y yp Y (y) x p X,Z(x, z) x p Z (z) = E[Y] E[XZ = z]. Theorem 5.4 (Kolmogorov s iequality) Let X 1,...,X be idepedet radom variables such that E[X k ] = 0 ad Var[X k ] = k <. The, for all a > 0, P max 1 k X 1 + +X k a 1 a i. Commet: For = 1 this is othig but the Chebyshev iequality. For > 1 it would still be Chebyshev s iequality if the maximum over 1 k was replaced by k =, sice by idepedece Var[X 1 + +X ] = i. Proof : We itroduce the otatio S k = X 1 + +X k. This theorem is cocered with the probability that S k > a for some k. We defie the radom variable N(!) to be the smallest iteger k for which S k > a; if there is o such umber we set N(!) =. We observe the equivalece of evets,! max 1 k S k > a =! S N(!) > a,

Iequalities 19 ad from the Markov iequality P max 1 k S k > a 1 a E[S N]. We eed to estimate the right had side. If we could replace E[S N] by E[S ] = Var[S ] = i, the we would be doe. The trick is to show that E[S N ] E[S ] by usig coditioal expectatios. If E[S NN = k] E[S N = k] for all 1 k the the iequality holds, sice we have the a iequality betwee radom variables E[S N N] E[S N], ad applyig expectatios o both sides gives the desired result. For k =, the idetity hold trivially. Otherwise, we write E[S NN = ] = E[S N = ], E[S N = k] = E[S kn = k] + E[(X k+1 + +X ) N = k] + E[S k (X k+1 + +X )N = k] The first term o the right had side equals E[S NN = k], whereas the secod terms is o-egative. Remais the third term for which we remark that X k+1 + +X is idepedet of both S k ad N, ad by the previous lemma, E[S k (X k+1 + +X )N = k] = E[S k N = k] E[X k+1 + +X ] = 0. Puttig it all together, E[S N = k] E[S NN = k]. Sice this holds for all k s we have thus show that E[S N] E[S ] = i, which completes the proof.