MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

Similar documents
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

7.1 Convergence of sequences of random variables

This section is optional.

Lecture 19: Convergence

7.1 Convergence of sequences of random variables

Advanced Stochastic Processes.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 6 9/23/2013. Brownian motion. Introduction

4. Partial Sums and the Central Limit Theorem

An Introduction to Randomized Algorithms

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

Stochastic Simulation

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Fall 2013 MTH431/531 Real analysis Section Notes

1 Convergence in Probability and the Weak Law of Large Numbers

6.3 Testing Series With Positive Terms

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall Midterm Solutions

EE 4TM4: Digital Communications II Probability Theory

Application to Random Graphs

MIT Spring 2016

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Probability 2 - Notes 10. Lemma. If X is a random variable and g(x) 0 for all x in the support of f X, then P(g(X) 1) E[g(X)].

1.010 Uncertainty in Engineering Fall 2008

Basics of Probability Theory (for Theory of Computation courses)

Unbiased Estimation. February 7-12, 2008

Rates of Convergence by Moduli of Continuity

Lecture 3: August 31

Limit Theorems. Convergence in Probability. Let X be the number of heads observed in n tosses. Then, E[X] = np and Var[X] = np(1-p).

Distribution of Random Samples & Limit theorems


(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

Lecture 2: Concentration Bounds

Learning Theory: Lecture Notes

Problem Set 2 Solutions

Lecture 12: September 27

Chapter 7 Isoperimetric problem

Lecture 2 February 8, 2016

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

ST5215: Advanced Statistical Theory

Math 113 Exam 3 Practice

LECTURE 8: ASYMPTOTICS I

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Math 113, Calculus II Winter 2007 Final Exam Solutions

SDS 321: Introduction to Probability and Statistics

Introduction to Extreme Value Theory Laurens de Haan, ISM Japan, Erasmus University Rotterdam, NL University of Lisbon, PT

Mathematics 170B Selected HW Solutions.

Expectation and Variance of a random variable

Math 525: Lecture 5. January 18, 2018

Additional Notes on Power Series

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology

Ma 530 Introduction to Power Series

Chapter 10: Power Series

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Statistical Theory MT 2009 Problems 1: Solution sketches

An Introduction to Asymptotic Theory

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Notes 5 : More on the a.s. convergence of sums

Topic 9: Sampling Distributions of Estimators

IIT JAM Mathematical Statistics (MS) 2006 SECTION A

Introduction to Probability. Ariel Yadin

Approximations and more PMFs and PDFs

The standard deviation of the mean

STAT Homework 1 - Solutions

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Integrable Functions. { f n } is called a determining sequence for f. If f is integrable with respect to, then f d does exist as a finite real number

Statistical Theory MT 2008 Problems 1: Solution sketches

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

It is often useful to approximate complicated functions using simpler ones. We consider the task of approximating a function by a polynomial.

Topic 9: Sampling Distributions of Estimators

1 Approximating Integrals using Taylor Polynomials

Sequences and Series of Functions

Review Questions, Chapters 8, 9. f(y) = 0, elsewhere. F (y) = f Y(1) = n ( e y/θ) n 1 1 θ e y/θ = n θ e yn

Entropy Rates and Asymptotic Equipartition

ECE 330:541, Stochastic Signals and Systems Lecture Notes on Limit Theorems from Probability Fall 2002

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

ENGI Series Page 6-01

Seunghee Ye Ma 8: Week 5 Oct 28

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Section 11.8: Power Series

Math 312 Lecture Notes One Dimensional Maps

Lecture 4: Unique-SAT, Parity-SAT, and Approximate Counting

Central Limit Theorem using Characteristic functions

MA131 - Analysis 1. Workbook 3 Sequences II

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

Ma 530 Infinite Series I

6a Time change b Quadratic variation c Planar Brownian motion d Conformal local martingales e Hints to exercises...

Topic 9: Sampling Distributions of Estimators

Math 2784 (or 2794W) University of Connecticut

We are mainly going to be concerned with power series in x, such as. (x)} converges - that is, lims N n

Sequences. Notation. Convergence of a Sequence

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

3. Z Transform. Recall that the Fourier transform (FT) of a DT signal xn [ ] is ( ) [ ] = In order for the FT to exist in the finite magnitude sense,

FINAL EXAMINATION IN FOUNDATION OF ANALYSIS (TMA4225)

Asymptotic distribution of products of sums of independent random variables

Lecture Notes 15 Hypothesis Testing (Chapter 10)

Lecture 4 February 16, 2016

Information Theory and Statistics Lecture 4: Lempel-Ziv code

Lecture 4. We also define the set of possible values for the random walk as the set of all x R d such that P(S n = x) > 0 for some n.

Lecture 20: Multivariate convergence and the Central Limit Theorem

Transcription:

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013 Large Deviatios for i.i.d. Radom Variables Cotet. Cheroff boud usig expoetial momet geeratig fuctios. Properties of a momet geeratig fuctios. Legedre trasforms. 1 Prelimiary otes The Weak Law of Large Numbers tells us that if X 1, X 2,..., is a i.i.d. sequece of radom variables with mea µ E[X 1 ] < the for every E > 0 as. X 1 +... + X P( µ > E) 0, But how quickly does this covergece to zero occur? We ca try to use Chebyshev iequality which says X 1 +... + X Var(X 1 ) P( µ > E). E 2 This suggest a decay rate of order 1 if we treat Var(X 1 ) ad E as a costat. Is this a accurate rate? Far from so... I fact if the higher momet of X 1 was fiite, for example, E[X1 2m ] <, the usig a similar boud, we could show that the decay rate is at least 1 (exercise). The goal of the large deviatio theory is to show that i may iterestig cases the decay rate is i fact expoetial: e c. The expoet c > 0 is called the large deviatios rate, ad i may cases it ca be computed explicitly or umerically. m 1

2 Large deviatios upper boud (Cheroff boud) Cosider a i.i.d. sequece with a commo probability distributio fuctio F (x) = P(X x), x R. Fix a value a > µ, where µ is agai a expectatio correspodig to the distributio F. We cosider probability that the average of X 1,..., X exceeds a. The WLLN tells us that this happes with probability covergig to zero as icreases, ad ow we obtai a estimate o this probability. Fix a positive parameter θ > 0. We have 1 i X i P( > a) = P( X i > a) 1 i θ X = P(e 1 i i > e θa ) θ E[e 1 i X i ] e θa E[ i eθx i ] = θa ), (e Markov iequality But recall that X θx i i s are i.i.d. Therefore E[ i e ] = (E[eθX 1 ]). Thus we obtai a upper boud ( ) 1 i X i E[e θx 1 ] P( > a). (1) e θa Of course this boud is meaigful oly if the ratio E[e θx 1 ]/e θa is less tha uity. We recogize E[e θx 1 ] as the momet geeratig fuctio of X 1 ad deote it by M(θ). For the boud to be useful, we eed E[e θx 1 ] to be at least fiite. If we could show that this ratio is less tha uity, we would be doe expoetially fast decay of the probability would be established. Similarly, suppose we wat to estimate 1 i X i P( < a), for some a < µ. Fixig ow a egative θ < 0, we obtai 1 i X i θ X P( < a) = P(e 1 i i > e θa ) ( ) M(θ), e θa 2

ad ow we eed to fid a egative θ such that M(θ) < e θa. I particular, we eed to focus o θ for which the momet geeratig fuctio is fiite. For this purpose let D(M) {θ : M(θ) < }. Namely D(M) is the set of values θ for which the momet geeratig fuctio is fiite. Thus we call D the domai of M. 3 Momet geeratig fuctio. Examples ad properties Let us cosider some examples of computig the momet geeratig fuctios. Expoetial distributio. Cosider a expoetially distributed radom variable X with parameter λ. The M(θ) = 0 e θx λe λx dx = λ e (λ θ)x dx. 0 1 λ θ (λ θ)x 0 Whe θ < λ this itegral is equal to e = 1/(λ θ). But whe θ λ, the itegral is ifiite. Thus the exp. momet geeratig fuctio is fiite iff θ < λ ad is M(θ) = λ/(λ θ). I this case the domai of the momet geeratig fuctio is D(M) = (, λ). Stadard Normal distributio. Whe X has stadard Normal distributio, we obtai 1 θx x 2 2π 1 x 2 2θx+θ 2 θ 2 M(θ) = E[e θx ] = e e 2 dx = e 2 dx 2π θ 2 1 (x θ)2 = e 2 e 2 dx 2π Itroducig chage of variables y = x θ we obtai that the itegral 2 is equal to 1 e y 2 dy = 1 (itegral of the desity of the stadard 2π θ Normal distributio). Therefore M(θ) = e 2 2. We see that it is always fiite ad D(M) = R. I a retrospect it is ot surprisig that i this case M(θ) is fiite for all θ. The desity of the stadard Normal distributio decays like e x2 ad 3

this is faster tha just expoetial growth e θx. So o matter how large is θ the overall product is fiite. Poisso distributio. Suppose X has a Poisso distributio with parameter λ. The θm λm λ M(θ) = E[e θx ] = e e m! m=0 (e θ λ) m λ = e m! m=0 e θ λ λ = e, t m m 0 m! (where we use the formula = e t ). Thus agai D(M) = R. This agai has to do with the fact that λ m /m! decays at the rate similar to 1/m! which is faster the ay expoetial growth rate e θm. We ow establish several properties of the momet geeratig fuctios. Propositio 1. The momet geeratig fuctio M(θ) of a radom variable X satisfies the followig properties: (a) M(0) = 1. If M(θ) < for some θ > 0 the M(θ ' ) < for all θ ' [0, θ]. Similarly, if M(θ) < for some θ < 0 the M(θ ' ) < for all θ ' [θ, 0]. I particular, the domai D(M) is a iterval cotaiig zero. (b) Suppose (θ 1, θ 2 ) D(M). The M(θ) as a fuctio of θ is differetiable i θ for every θ 0 (θ 1, θ 2 ), ad furthermore, d M(θ) = E[Xe θ 0 X ] <. dθ θ=θ 0 Namely, the order of differetiatio ad expectatio operators ca be chaged. Proof. Part (a) is left as a exercise. We ow establish part (b). Fix ay θ 0 (θ 1, θ 2 ) ad cosider a θ-idexed sequece of radom variables exp(θx) exp(θ 0 X) Y θ. θ θ 0 4

d Sice dθ exp(θx) = x exp(θx), the almost surely Y θ X exp(θ 0 X), as θ θ 0. Thus to establish the claim it suffices to show that covergece of expectatios holds as well, amely lim θ θ0 E[Y θ ] = E[X exp(θ 0 X)], ad E[X exp(θ 0 X)] <. For this purpose we will use the Domiated Covergece Theorem. Namely, we will idetify a radom variable Z such that Y θ Z almost surely i some iterval (θ 0 E, θ 0 + E), ad E[Z] <. Fix E > 0 small eough so that (θ 0 E, θ 0 + E) (θ 1, θ 2 ). Let Z = E 1 exp(θ 0 X + E X ). Usig the Taylor expasio of exp( ) fuctio, for every θ (θ 0 E, θ 0 + E), we have ( 1 1 1 ) Y θ = exp(θ 0 X) X + (θ θ 0 )X 2 + (θ θ 0 ) 2 X 3 + + (θ θ 0 ) 1 X +, 2! 3!! which gives ( ) 1 1 Y θ exp(θ 0 X) X + (θ θ 0 ) X 2 + + (θ θ 0 ) 1 X + 2!! ( ) 1 1 exp(θ 0 X) X + E X 2 + + E 1 X + 2!! = exp(θ 0 X)E 1 (exp(e X ) 1) exp(θ 0 X)E 1 exp(e X ) = Z. It remais to show that E[Z] <. We have E[Z] = E 1 E[exp(θ 0 X + EX)1{X 0}] + E 1 E[exp(θ 0 X EX)1{X < 0}] E 1 E[exp(θ 0 X + EX)] + E 1 E[exp(θ 0 X EX)] = E 1 M(θ 0 + E) + E 1 M(θ 0 E) <, sice E was chose so that (θ 0 E, θ 0 + E) (θ 1, θ 2 ) D(M). This completes the proof of the propositio. Problem 1. (a) Establish part (a) of Propositio 1. (b) Costruct a example of a radom variable for which the correspodig iterval is trivial {0}. Namely, M(θ) = for every θ > 0. 5

(c) Costruct a example of a radom variable X such that D(M) = [θ 1, θ 2 ] for some θ 1 < 0 < θ 2. Namely, the the domai D is a o-zero legth closed iterval cotaiig zero. Now suppose the i.i.d. sequece X i, i 1 is such that 0 (θ 1, θ 2 ) D(M), where M is the momet geeratig fuctio of X 1. Namely, M is fiite i a eighborhood of 0. Let a > µ = E[X 1 ]. Applyig Propositio 1, let us differetiate this ratio with respect to θ at θ = 0: d M(θ) E[X θx 1 ]e θa E[e θx 1 1 e θa ae ] = = µ a < 0. dθ eθa e2θa Note that M(θ)/e θa = 1 whe θ = 0. Therefore, for sufficietly small positive θ, the ratio M(θ)/e θa is smaller tha uity, ad (1) provides a expoetial boud o the tail probability for the average of X 1,..., X. Similarly, if a < µ, the ratio M(θ)/e θa < 1 for sufficietly small egative θ. We ow summarize our fidigs. Theorem 1 (Cheroff boud). Give a i.i.d. sequece X 1,..., X suppose the momet geeratig fuctio M(θ) is fiite i some iterval (θ 1, θ 2 ) : 0. Let a > µ = E[X 1 ]. The there exists θ > 0, such that M(θ)/e θa < 1 ad ( ) X 1 i i M(θ) P( > a). e θa Similarly, if a < µ, the there exists θ < 0, such that M(θ)/e θa < 1 ad 1 i X ( ) i M(θ) P( < a). e θa How small ca we make the ratio M(θ)/ exp(θa)? We have some freedom i choosig θ as log as E[e θx 1 ] is fiite. So we could try to fid θ which miimizes the ratio M(θ)/e θa. This is what we will do i the rest of the lecture. The surprisig coclusio of the large deviatios theory is very ofte that such a miimizig value θ exists ad is tight. Namely it provides the correct decay rate! I this case we will be able to say 1 i X i P( > a) exp( I(a, θ )), a where I(a, θ ) = log M(θ )/e θ. 6

4 Legedre trasforms Theorem 1 gave us a large deviatios boud (M(θ)/e θa ) which we rewrite as e (θa log M(θ)). We ow study i more detail the expoet θa log M(θ). Defiitio 1. A Legedre trasform of a radom variable X is the fuctio I(a) sup θ R (θa log M(θ)). Let us go over the examples of some distributios ad compute their correspodig Legedre trasforms. Expoetial distributio with parameter λ. Recall that M(θ) = λ/(λ θ) whe θ < λ ad M(θ) = otherwise. Therefore whe θ < λ λ I(a) = sup (aθ log ) λ θ θ = sup (aθ log λ + log(λ θ)), θ ad I(a) = otherwise. Settig the derivative of g(θ) = aθ log λ + log(λ θ) equal to zero we obtai the equatio a 1/(λ θ) = 0 which has the uique solutio θ = λ 1/a. For the boudary cases, we have aθ log λ + log(λ θ)) whe either θ λ or θ (check). Therefore I(a) = a(λ 1/a) log λ + log(λ λ + 1/a) = aλ 1 log λ + log(1/a) = aλ 1 log λ log a. The large deviatios boud the tells us that whe a > 1/λ 1 i X i (aλ 1 log λ log a) P( > a) e. (.2 log 1.2) Say λ = 1 ad a = 1.2. The the approximatio gives us e. Note that we ca obtai a exact expressio for this tail probability. Ideed, X 1, X 1 +X 2,..., X 1 +X 2 +X,... are the evets of a Poisso process with parameter λ = 1. Therefore we ca compute the probability P( 1 i X i > 1.2) exactly: it is the probability that the Poisso 7

process has at most 1 evets before time 1.2. Thus 1 i X i P( > 1.2) = P( X i > 1.2) 1 i (1.2) k 1.2 = e. k! 0 k 1 It is ot at all clear how revealig this expressio is. I hidsight, we kow that it is approximately e (.2 log 1.2), obtaied via large deviatios theory. Stadard Normal distributio. Recall that M(θ) = e θ 2 2 whe X 1 has the stadard Normal distributio. The expected value µ = 0. Thus we fix a > 0 ad obtai θ 2 I(a) = sup (aθ ) θ 2 a 2 =, 2 achieved at θ = a. Thus for a > 0, the large deviatios theory predicts that X i 2 1 i a P( > a) e 2. 1 i X i Agai we could compute this probability directly. We kow that is distributed as a Normal radom variable with mea zero ad variace 1/. Thus 1 i P( X i > a) = e t2 2 dt. 2π a After a little bit of techical work oe could show that this itegral is a+e domiated by its part aroud a, amely,, which is further approxa 2 a 2π imated by the value of the fuctio itself at a, amely e 2. This is cosistet with the value give by the large deviatios theory. Simply the lower order magitude term disappears i the approximatio o the 2π log scale. 8

Poisso distributio. Suppose X has a Poisso distributio with parame eter λ. Recall that i this case M(θ) = e θ λ λ. The I(a) = sup (aθ (e θ λ λ)). θ Settig derivative to zero we obtai θ = log(a/λ) ad I(a) = a log(a/λ) (a λ). Thus for a > λ, the large deviatios theory predicts that 1 i X i (a log(a/λ) a+λ) P( > a) e. I this case as well we ca compute the large deviatios probability explicitly. The sum X 1 + + X of Poisso radom variables is also a Poisso radom variable with parameter λ. Therefore (λ) m λ P( X i > a) = e. m! 1 i m>a But agai it is hard to ifer a more explicit rate of decay usig this expressio 5 Additioal readig materials Chapter 0 of [2]. This is o-techical itroductio to the field which describes motivatio ad various applicatios of the large deviatios theory. Soft readig. Chapter 2.2 of [1]. Refereces [1] A. Dembo ad O. Zeitoui, Large deviatios techiques ad applicatios, Spriger, 1998. [2] A. Shwartz ad A. Weiss, Large deviatios for performace aalysis, Chapma ad Hall, 1995. 9

MIT OpeCourseWare http://ocw.mit.edu 15.070J / 6.265J Advaced Stochastic Processes Fall 2013 For iformatio about citig these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.