Convergence in Distribution

Similar documents
8 Laws of large numbers

IEOR 3106: Introduction to Operations Research: Stochastic Models. Fall 2011, Professor Whitt. Class Lecture Notes: Thursday, September 15.

Limiting Distributions

Uses of Asymptotic Distributions: In order to get distribution theory, we need to norm the random variable; we usually look at n 1=2 ( X n ).

Limiting Distributions

Stat 5101 Lecture Slides: Deck 7 Asymptotics, also called Large Sample Theory. Charles J. Geyer School of Statistics University of Minnesota

Asymptotic Statistics-III. Changliang Zou

Joint Probability Distributions and Random Samples (Devore Chapter Five)

7 Random samples and sampling distributions

Gaussian vectors and central limit theorem

Asymptotic Statistics-VI. Changliang Zou

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

Exercises and Answers to Chapter 1

Theoretical Statistics. Lecture 1.

APPM/MATH 4/5520 Solutions to Exam I Review Problems. f X 1,X 2. 2e x 1 x 2. = x 2

Notes on Random Vectors and Multivariate Normal

CHAPTER 3: LARGE SAMPLE THEORY

STA205 Probability: Week 8 R. Wolpert

From now on, we will represent a metric space with (X, d). Here are some examples: i=1 (x i y i ) p ) 1 p, p 1.

7 Convergence in R d and in Metric Spaces

CS145: Probability & Computing

Formulas for probability theory and linear models SF2941

lim F n(x) = F(x) will not use either of these. In particular, I m keeping reserved for implies. ) Note:

Chapter 5. Chapter 5 sections

This does not cover everything on the final. Look at the posted practice problems for other topics.

X n D X lim n F n (x) = F (x) for all x C F. lim n F n(u) = F (u) for all u C F. (2)

Quick Tour of Basic Probability Theory and Linear Algebra

Chapter 4. Chapter 4 sections

Measure-theoretic probability

Probability and Measure

SUMMARY OF PROBABILITY CONCEPTS SO FAR (SUPPLEMENT FOR MA416)

Stochastic Models (Lecture #4)

The Delta Method and Applications

5 Operations on Multiple Random Variables

Lecture 6 Basic Probability

SDS : Theoretical Statistics

Lecture 14: Multivariate mgf s and chf s

BASICS OF PROBABILITY

Properties of Random Variables

Math 117: Infinite Sequences

Chapter 5 continued. Chapter 5 sections

Exercises Measure Theoretic Probability

Fourth Week: Lectures 10-12

Qualifying Exam CS 661: System Simulation Summer 2013 Prof. Marvin K. Nakayama

i=1 k i=1 g i (Y )] = k

Section Taylor and Maclaurin Series

Statistics, Data Analysis, and Simulation SS 2015

A Brief Analysis of Central Limit Theorem. SIAM Chapter Florida State University

Large Sample Theory. Consider a sequence of random variables Z 1, Z 2,..., Z n. Convergence in probability: Z n

4 Expectation & the Lebesgue Theorems

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics

Lecture 2: Review of Probability

d(x n, x) d(x n, x nk ) + d(x nk, x) where we chose any fixed k > N

Chapter 5. Weak convergence

Mid Term-1 : Practice problems

Sometimes can find power series expansion of M X and read off the moments of X from the coefficients of t k /k!.

Characteristic Functions and the Central Limit Theorem

n! (k 1)!(n k)! = F (X) U(0, 1). (x, y) = n(n 1) ( F (y) F (x) ) n 2

STAT 512 sp 2018 Summary Sheet

Lecture 13 and 14: Bayesian estimation theory

Transforms. Convergence of probability generating functions. Convergence of characteristic functions functions

Probability and Distributions

Random Variables. P(x) = P[X(e)] = P(e). (1)

the convolution of f and g) given by

Probability and Measure

P (A G) dp G P (A G)

Problem Set 2: Solutions Math 201A: Fall 2016

Chapter 6. Characteristic functions

Chapter 6. Convergence. Probability Theory. Four different convergence concepts. Four different convergence concepts. Convergence in probability

i=1 k i=1 g i (Y )] = k f(t)dt and f(y) = F (y) except at possibly countably many points, E[g(Y )] = f(y)dy = 1, F(y) = y

Lecture 1 Measure concentration

Section 27. The Central Limit Theorem. Po-Ning Chen, Professor. Institute of Communications Engineering. National Chiao Tung University

Analysis Qualifying Exam

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n

conditional cdf, conditional pdf, total probability theorem?

Math 832 Fall University of Wisconsin at Madison. Instructor: David F. Anderson

simple if it completely specifies the density of x

Exercises Measure Theoretic Probability

Chapter 2. Metric Spaces. 2.1 Metric Spaces

1 Exercises for lecture 1

µ X (A) = P ( X 1 (A) )

Chapter 7: Special Distributions

Many of you got these steps reversed or otherwise out of order.

Econ 508B: Lecture 5

6.1 Moment Generating and Characteristic Functions

P (x). all other X j =x j. If X is a continuous random vector (see p.172), then the marginal distributions of X i are: f(x)dx 1 dx n

Introduction to Statistical Inference Self-study

EE4601 Communication Systems

The main results about probability measures are the following two facts:

1 Topology Definition of a topology Basis (Base) of a topology The subspace topology & the product topology on X Y 3

Department of Mathematics

Math Camp II. Calculus. Yiqing Xu. August 27, 2014 MIT

Review of Probability Theory

Notes on uniform convergence

Exam P Review Sheet. for a > 0. ln(a) i=0 ari = a. (1 r) 2. (Note that the A i s form a partition)

Chapter 5 Joint Probability Distributions

n! (k 1)!(n k)! = F (X) U(0, 1). (x, y) = n(n 1) ( F (y) F (x) ) n 2

1 Review of Probability and Distributions

Lecture 5: Moment generating functions

Probability on a Riemannian Manifold

Transcription:

Convergence in Distribution Undergraduate version of central limit theorem: if X 1,..., X n are iid from a population with mean µ and standard deviation σ then n 1/2 ( X µ)/σ has approximately a normal distribution. Also Binomial(n, p) random variable has approximately a N(np, np(1 p)) distribution. Precise meaning of statements like X and Y have approximately the same distribution? Desired meaning: X and Y have nearly the same cdf. But care needed. 1

Q1) If n is a large number is the N(0,1/n) distribution close to the distribution of X 0? Q2) Is N(0,1/n) close to the N(1/n,1/n) distribution? Q3) Is N(0,1/n) close to N(1/ n,1/n) distribution? Q4) If X n 2 n is the distribution of X n close to that of X 0? 2

Answers depend on how close close needs to be so it s a matter of definition. In practice the usual sort of approximation we want to make is to say that some random variable X, say, has nearly some continuous distribution, like N(0,1). So: want to know probabilities like P(X > x) are nearly P(N(0,1) > x). Real difficulty: case of discrete random variables or infinite dimensions: not done in this course. Mathematicians meaning of close: Either they can provide an upper bound on the distance between the two things or they are talking about taking a limit. In this course we take limits and use metrics. 3

Definition: A sequence of random variables X n taking values in a separable metric space S, d converges in distribution to a random variable X if E(g(X n )) E(g(X)) for every bounded continuous function g mapping S to the real line. Notation: : X n X. Remark: This is abusive language. It is the distributions that converge not the random variables. Example: If U is Uniform and X n = U, X = 1 U then X n converges in distribution to X. Other Jargon: weak convergence, weak convergence, convergence in law. 4

General Properties: If X n X and h is continuous from S 1 to S 2 then Y n = h(x n ) Y = h(x) Theorem 1 (Slutsky) If X n X, Y y o and h is continuous from S 1 S 2 to S 3 at x, y o for each x then Z n = h(x n, Y n ) Z = h(x, y) 5

We will begin by specializing to simplest case: S is the real line and d(x, y) = x y. In the following we suppose that X n, X are real valued random variables. Theorem 2 The following are equivalent: 1. X n converges in distribution to X. 2. P(X n x) P(X x) for each x such that P(X = x) = 0. 3. The limit of the characteristic functions of X n is the characteristic function of X: for every real t. E(e itxn ) E(e itx ) These are all implied by M Xn (t) M X (t) < for all t ǫ for some positive ǫ. 6

Now let s go back to the questions I asked: X n N(0,1/n) and X = 0. Then P(X n x) 1 x > 0 0 x < 0 1/2 x = 0 Limit is cdf of X = 0 except for x = 0; cdf of X is not continuous at x = 0. So: X n X. Does X n N(1/n,1/n) have distribution close that of Y n N(0,1/n). Find a limit X and prove both X n X and Y n X. Take X = 0. Then and E(e txn ) = e t/n+t2 /(2n) 1 = E(e tx ) E(e tyn ) = e t2 /(2n) 1 so that both X n and Y n have the same limit in distribution. 7

N(0,1/n) vs X=0; n=10000 0.0 0.2 0.4 0.6 0.8 1.0 X=0 N(0,1/n) -3-2 -1 0 1 2 3 N(0,1/n) vs X=0; n=10000 0.0 0.2 0.4 0.6 0.8 1.0 X=0 N(0,1/n) -0.03-0.02-0.01 0.0 0.01 0.02 0.03 8

N(1/n,1/n) vs N(0,1/n); n=10000 0.0 0.2 0.4 0.6 0.8 1.0 N(0,1/n) N(1/n,1/n) -3-2 -1 0 1 2 3 N(1/n,1/n) vs N(0,1/n); n=10000 0.0 0.2 0.4 0.6 0.8 1.0 N(0,1/n) N(1/n,1/n) -0.03-0.02-0.01 0.0 0.01 0.02 0.03 9

Multiply both X n and Y n by n 1/2 and let X N(0,1). Then nx n N(n 1/2,1) and ny n N(0,1). Use characteristic functions to prove that both nx n and nyn converge to N(0,1) in distribution. If you now let X n N(n 1/2,1/n) and Y n N(0,1/n) then again both X n and Y n converge to 0 in distribution. If you multiply X n and Y n in the previous point by n 1/2 then n 1/2 X n N(1,1) and n 1/2 Y n N(0,1) so that n 1/2 X n and n 1/2 Y n are not close together in distribution. You can check that 2 n 0 in distribution. 10

N(1/sqrt(n),1/n) vs N(0,1/n); n=10000 0.0 0.2 0.4 0.6 0.8 1.0 N(0,1/n) N(1/sqrt(n),1/n) -3-2 -1 0 1 2 3 N(1/sqrt(n),1/n) vs N(0,1/n); n=10000 0.0 0.2 0.4 0.6 0.8 1.0 N(0,1/n) N(1/sqrt(n),1/n) -0.03-0.02-0.01 0.0 0.01 0.02 0.03 11

Summary: to derive approximate distributions: Show sequence of rvs X n converges weakly to some X. The limit distribution (i.e. dstbn of X) should be non-trivial, like say N(0,1). Don t say: X n is approximately N(1/n,1/n). Do say: n 1/2 (X n 1/n) converges to N(0,1) in distribution. 12

The Central Limit Theorem Theorem 3 If X 1, X 2, are iid with mean 0 and variance 1 then n 1/2 X converges in distribution to N(0,1). That is, P(n 1/2 X x) 1 2π x e y2 /2 dy. Proof: We will show E(e itn1/2 X ) e t 2 /2. This is the characteristic function of N(0,1) so we are done by our theorem. 13

Some basic facts: If Z N(0,1) then E ( e itz) = e t2 /2 Theorem 4 If X is a real random variable with E( X k ) < then the function ψ(t) = E ( e itx) has k continuous derivatives as a function of the real variable t. (Real part and imaginary part each have that many derivatives.) Moreover for 1 j k we find ψ (j) (t) = i k E ( X k e itx) Theorem 5 (Taylor Expansion) For such an X: ψ(t) = 1 + k j=1 i j E(X j )t j /j! + R(t) where the remainder function R(t) satisfies lim t 0 R(t)/tk = 0 14

Finish proof: let ψ(t) = E(exp(itX 1 )): E(e it n X ) = ψ n (t/ n) Since variance is 1 and mean is 0: ψ(t) = 1 t 2 /2 + R(t) where lim t 0 R(t)/t 2 = 0. Fix t, replace t by t/ n: ψ n (t/ n) = 1 t 2 /(2n) + R(t/ n) Define x n = t 2 /2 + 2nR(t/ n). Notice x n t 2 /2 (by property of R) and use x n x implies valid for all complex x. Get to finish proof. (1 + x n /n) n e x E(e itn1/2 X ) e t2 /2. 15

Proof of Theorem 4: do case k = 1. Must show But lim h 0 ψ(t + h) ψ(t) h ψ(t + h) ψ(t) h = E = ie(xe itx ) ei(t+h)x e itx h Fact: e i(t+h)x e itx h X for any t. By Dominated Convergence Theorem can take limit inside integral to get ψ (t) = ie(xe itx ) 16

Multivariate convergence in distribution Definition: X n R p converges in distribution to X R p if E(g(X n )) E(g(X)) for each bounded continuous real valued function g on R p. This is equivalent to either of Cramér Wold Device: a T X n converges in distribution to a T X for each a R p. or Convergence of characteristic functions: E(e iat X n ) E(e iat X ) for each a R p. 17

Extensions of the CLT 1. Y 1, Y 2, iid in R p, mean µ, variance Σ then n 1/2 (Ȳ µ) MVN(0,Σ). 2. Lyapunov CLT: for each n X n1,..., X nn independent rvs with E(X ni ) = 0 (1) Var( X ni ) = 1 (2) i E( X ni 3 ) 0 (3) then i X ni N(0,1). i 3. Lindeberg CLT: If conds (1), (2) and E(X 2 ni 1( X ni > ǫ)) 0 each ǫ > 0 then i X ni N(0,1). (Lyapunov s condition implies Lindeberg s.) 4. Non-independent rvs: m-dependent CLT, martingale CLT, CLT for mixing processes. 5. Not sums: Slutsky s theorem, δ method. 18

Slutsky s Theorem in R p : If X n X and Y n converges in distribution (or in probability) to c, a constant, then X n + Y n X + c. More generally, if f(x, y) is continuous then f(x n, Y n ) f(x, c). Warning: hypothesis that limit of Y n constant is essential. Definition: Y n Y in probability if ǫ > 0: P(d(Y n, Y ) > ǫ) 0. Fact: for Y constant convergence in distribution and in probability are the same. Always convergence in probability implies convergence in distribution. Both are weaker than almost sure convergence: Definition: Y n Y almost surely if P({ω Ω : lim n Y n (ω) = Y (ω)}) = 1. 19

Theorem 6 (The delta method) Suppose: Sequence Y n y, a constant. If X n = a n (Y n y) then X n X for some random variable X. f is ftn defined on a neighbourhood of y R p which is differentiable at y. Then a n (f(y n ) f(y)) converges in distribution to f (y)x. If X n R p and f : R p R q then f is q p matrix of first derivatives of components of f. 20

Proof: The function f : R q R p is differentiable at y R q if there is a matrix Df such that lim h 0 f(y + h) f(y) Dfh h = 0 that is, for each ǫ > 0 there is a δ > 0 such that h δ implies Define and f(y + h) f(y) Dfh ǫ h. R n = a n (f(y n ) f(y)) a n Df (Y n y) S n = a n Df (Y n y) = DfX n According to Slutsky s theorem S n DfX If we now prove R n 0 then by Slutsky s theorem we find a n (f(y n ) f(y)) = S n + R n DfX 21

Now fix ǫ 1, ǫ 2 > 0. I claim there is K so big that for all n P(B n ) P( a n (Y n y) > K) ǫ 1. Let δ > 0 be the value in the definition of derivative corresponding to ǫ 2 /K. Choose N so large that n N implies K/a n δ. For n N we have { a n (Y n y) > K} { Y n y > δ} { R n > ǫ 2 } so that n N implies P( R n > ǫ 2 ) ǫ 1 which means R n 0 in probability. 22

Example: Suppose X 1,..., X n are a sample from a population with mean µ, variance σ 2, and third and fourth central moments µ 3 and µ 4. Then n 1/2 (s 2 σ 2 ) N(0, µ 4 σ 4 ) where is notation for convergence in distribution. For simplicity I define s 2 = X 2 X 2. 23

How to apply δ method: 1) Write statistic as a function of averages: Define See that Define W i = W n = [ X 2 i X i [ X 2 X ] ]. See that s 2 = f( W n ). f(x 1, x 2 ) = x 1 x 2 2 2) Compute mean of your averages: [ ] [ E(X 2 µ W E( W n ) = i ) µ = 2 + σ 2 E(X i ) µ ]. 3) In δ method theorem take Y n = W n and y = E(Y n ). 24

4) Take a n = n 1/2. 5) Use central limit theorem: n 1/2 (Y n y) MV N(0,Σ) where Σ = Var(W i ). 6) To compute Σ take expected value of (W µ W )(W µ W ) T There are 4 entries in this matrix. Top left entry is This has expectation: (X 2 µ 2 σ 2 ) 2 E { (X 2 µ 2 σ 2 ) 2} = E(X 4 ) (µ 2 + σ 2 ) 2. 25

Using binomial expansion: So E(X 4 ) = E{(X µ + µ) 4 } = µ 4 + 4µµ 3 + 6µ 2 σ 2 + 4µ 3 E(X µ) + µ 4. Σ 11 = µ 4 σ 4 + 4µµ 3 + 4µ 2 σ 2 Top right entry is expectation of (X 2 µ 2 σ 2 )(X µ) which is E(X 3 ) µe(x 2 ) Similar to 4th moment get µ 3 + 2µσ 2 Lower right entry is σ 2. So Σ = [ µ4 σ 4 + 4µµ 3 + 4µ 2 σ 2 µ 3 + 2µσ 2 µ 3 + 2µσ 2 σ 2 ] 26

7) Compute derivative (gradient) of f: has components (1, 2x 2 ). Evaluate at y = (µ 2 + σ 2, µ) to get a T = (1, 2µ). This leads to n 1/2 (s 2 σ 2 ) [ n 1/2 X [1, 2µ] 2 (µ 2 + σ 2 ) X µ which converges in distribution to ] (1, 2µ)MV N(0,Σ). This rv is N(0, a T Σa) = N(0, µ 4 σ 4 ). 27

Alternative approach worth pursuing. Suppose c is constant. Define X i = X i c. Then: sample variance of Xi variance of X i. is same as sample Notice all central moments of X i same as for X i. Conclusion: no loss in µ = 0. In this case: a T = (1,0) and Σ = [ µ4 σ 4 µ 3 µ 3 σ 2 ]. Notice that a T Σ = [µ 4 σ 4, µ 3 ] and a T Σa = µ 4 σ 4. 28

Special case: if population is N(µ, σ 2 ) then µ 3 = 0 and µ 4 = 3σ 4. Our calculation has n 1/2 (s 2 σ 2 ) N(0,2σ 4 ) You can divide through by σ 2 and get n 1/2 ( s2 1) N(0,2) σ2 In fact ns 2 /σ 2 has a χ 2 n 1 distribution and so the usual central limit theorem shows that (n 1) 1/2 [ns 2 /σ 2 (n 1)] N(0,2) (using mean of χ 2 1 is 1 and variance is 2). Factor out n to get n n 1 n1/2 (s 2 /σ 2 1)+(n 1) 1/2 N(0,2) which is δ method calculation except for some constants. Difference is unimportant: Slutsky s theorem. 29

Extending the ideas to higher dimensions. W 1, W 2, iid Density f(x) = exp( (x + 1))1(x > 1) Mean 0 shifted exponential. Set S k = W 1 + + W k Plot against k for k = 1..n. Label x axis to run from 0 to 1. Rescale vertical axes to fit in square. 30

n=100 n=400 S_k/sqrt(n) 0.0 0.4 0.8 S_k/sqrt(n) 0.2 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.8 1.0 k/n 0.0 0.2 0.4 0.6 0.8 1.0 k/n n=1600 n=10000 S_k/sqrt(n) 0.0 0.2 0.4 0.6 0.8 1.0 S_k/sqrt(n) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 k/n 0.0 0.2 0.4 0.6 0.8 1.0 k/n 31

Proof: of Slutsky s Theorem: First: why is it true? If X n X and Y n y then we will show (X n, Y n ) (X, y). Point is that joint law of X, y is determined by marginal laws! Once this is done then by definition. E(h(X n, Y n )) E(h(X, y)) Note: You don t need continuity for all x, y but I will do only easy case. 32

Definition: A family {P α, α A} of probability measures on (S, d) is tight if for each ǫ > 0 there is a K compact in S such that for every α A P(K) 1 ǫ Theorem 7 If S is a complete separable metric space then each probability measure P on the Borel sets in S is tight. Proof: Let x 1, x 2, be dense in S. For each n draw balls B n,, B n,2, 1/n around x 1, x 2,.... of radius Each point in S is in one of these balls because the x j sequence is dense. That is: S = B n,j j=1 33

Thus 1 = P(S) = lim J P Pick J n so large that P J n B n,j j=1 J B n,j j=1 1 ǫ/2 n Let F n be the closure of J n j=1 B n,j. Let K = n=1 F n. I claim K is compact and has probability at least 1 ǫ. First P(K) = 1 P(K c ) = 1 P ( F c n ) 1 P(F c n) 1 ǫ/2 n = 1 ǫ (Incidentally you see that K is not empty!) 34

Second: K closed (intersection of closed sets). Third: K is totally bounded since each F n is a cover of K by (closed) balls of radius 1/n. So K is compact. Theorem 8 If X n converge in distribution to some X in a complete separable metric space S then the sequence X n is tight. Conversely: Theorem 9 If the sequence X n is tight then every subsequence is also tight. There is a subsequence X nk and a random variable X such that as k X nk X. Theorem 10 If there is a rv X such that every subsequence of X n has a further subsubsequence converging in distribution to X then X n X. 35

Proof of Theorem 9: do R p. First assertion obvious. Let x 1, x 2,... be dense in R p. Find sequence n 1,1 < n 1,2 < such that the sequence F n1,k (x 1 ) has a limit which we denote y 1. Exists because probabilities trapped in [0,1]. (Bolzano-Weierstrass). Pick n 2,1 < n 2,2 < a subsequence of the n 1,k such that F n2,k (x 2 ) has a limit which we denote y 2. Continue picking subsequence n m+1,k from the sequence n m,k so that F nm+1,k (x m+1 ) has a limit which we denote y m+1. 36

Trick: Diagonalization. Consider the sequence n 1,1 < n 2,2 < After the kth entry all remaining are a subsequence of the kth subsequence n k,j. So for each j. lim k F n k,k (x j ) = y j Idea: would like to define F X (x j ) = y j but that might not give cdf. Instead set F X (x) = inf{y j : x j > x}. Next: prove F X is cdf. Then prove subsequence converges to F X. 37