Notes on Asymptotic Theory: Convergence in Probability and Distribution Introduction to Econometric Theory Econ. 770 Jonathan B. Hill Dept. of Economics University of North Carolina - Chapel Hill November 9, 0 Introduction Let ( F ) be a probability space. Throughout is a parameter of interest like the mean, variance, correlation, or distribution parameters like Poisson, Binomial, or exponential. Throughout f^ g is a sequence of estimators of based on a sample of data f g = with sample size. Assume ^ is F-measurable for any. Unless otherwise noted, assume the 0 have the same mean and variance:» ( ). If appropriate, we may have a bivariate sample f g = where» ( ) and» ( ). Examples include the sample mean, variance, or correlation: Sample Mean : ¹ := X = Sample Variance # : := Sample Variance # : ^ := X ( ) = X ¹ = Sample Correlation : ^ := P = ¹ ¹ ^ ^ Similarly, we may estimate a probability by using a sample relative frequency: ^ () = X ( ) the sample percentage of =
Notice ^ () estimates ( ). We will look at estimator properties: what ^ is on average for any sample size; and what ^ becomes as the sample size grows. In every case above the estimator is a variant of a straight average (e.g. P = ¹ ¹ is a straight average of ¹ ¹ ), or a function of a straight average (e.g. ^ := ( P = ¹ ), the square root of the average ¹ ). We therefore pay particular attention to the sample mean. Unbiasedness Defn. We say ^ is an unbiased estimator of if [^ ] =. De ne bias as ³^ B := [^ ] An unbiased estimator has zero bias: B(^ ) = 0. If we had an in nite number of samples of size, then the average estimate ^ across all samples would be. An asymptotically unbiased estimator satis es B(^ )! 0 as!. Claim (Weighted Average): Let have a common mean := [ ]. Then the weighted average ^ := P P = is an unbiased estimator of := [] if = =. " X # = = X [ ] = = X = QED. Corollary (Straight Average): The sample mean ¹ := P = is a weighted average with at or uniform weights = hence trivially P = = hence [ ] ¹ = The problem then arises as to which weighted average P = may be preferred in practice since any with unit summed weights is unbiased. We will discuss below the concept of e ciency below, but the minimum mean-squared-error unbiased estimator has uniform weights if» ( ). That is: Claim (Sample Mean is Best): Let» ( ). Then ¹ is the best linear unbiased estimator of (i.e. it is BLUE). The Lagrange is = We want to solve à X! min subject to = X = = à X! à L ( ) := + =! X =
where by independence ( P = ) = P =, hence X L ( ) := Ã + =! X = The rst order conditions are L ( ) = = 0 and L ( ) = X = 0 = Therefore = ( ) is a constant that sums to P = =. Write = ( ) =:. Since P = = P = = = it follows = =. QED. Remark: As in many cases here and below, independence can be substituted for uncorrelatedness since the same proof applies: [ ] = [ ][ ] for all 6=. We can also substitute uncorrelatedness with a condition that restricts the total correlation across all and for 6=, but such generality is typically only exploited in time series settings (where is at a di erent time period). Claim (Sample Variance): Let» ( ). The estimator is unbiased and ^ is negatively biased but asymptotically unbiased. Notice = ^ = X ¹ X = + ¹ = = = X ( ) + X ¹ X + ( ) ¹ = = = = X ( ) + ¹ ¹ X ( ) = = = X ( ) + ¹ ¹ ¹ = = X ( ) ¹ = By the iid assumption and the fact that ¹ is unbiased Ã! ¹ X = = X ( ) = = = = Further, by de nition := [( ) ] hence " # X ( ) = X h ( ) i = X = = 3 = =
Therefore = ^ = = This implies each claim: = ( is unbiased), ^ = ( ) (^ is negatively biased), and ^ = ( )! (^ is asymptotically unbiased). QED. Example: We simulate 00 samples of» (754) with sample size = 0. In Figure we plot ¹ for each sample. The simulation average of all ¹ is 74.98394 and the simulation variance of all ¹ is 6595. In Figure we plot ^ = P = for each sample with weights = P =. The simulation average of all ^ is 74.98795 and the simulation variance of all ^ is.30940776. Thus, both display the same property of unbiasedness, but ¹ exhibits less dispersion across samples Figure : ¹ Figure : ^ 3 Convergence in Mean-Square or L -Convergence Defn. We say ^ R converges to in mean-square if We also write If ^ is unbiased for then MSE(^ ) := ³^! 0 ^! and ^! in mean-square. MSE(^ ) = ³^ h^ i = h^ i Convergence in mean-square certainly does not require unbiasedness. In the, MSE is i i MSE(^ ) = ³^ = ³^ h^ + h^ i ³ i i ³ i = ³^ h^ + h^ + ³^ h^ h^ i ³ i = ³^ h^ + h^ 4
i i i since h^ is just a constant and ³^ h^ = [^ ] h^ = 0. Hence MSE is the variance plus bias squared: i ³ i i ³ ³^ MSE(^ ) = ³^ h^ + h^ = h^ + B If ^ R then we write MSE(^ ) := ³^ ³^ 0! 0 hence component wise convergence. We may similarly write convergence in -norm ³^ ³^ 0 0 X! 0 where kk := @ X = = or convergence in matrix (spectral) norm: ³^ ³^ 0! 0 where kk is the largest eigenvalue of. A Both imply convergence with respect to each element ³^! 0. Defn. We say ^ R has the property of -convergence, or convergence in -norm, to if for 0 ^! 0 Clearly -convergence and mean-square convergence are equivalent. Claim (Sample Mean): Let» ( ). Then ¹! in mean square. ( ¹ ) = [ ¹ ] =! 0 QED. We only require uncorrelatedness since [ ¹ ] = still holds. Claim (Sample Mean): mean square. Let» ( ) be uncorrelated. Then ¹! in ( ¹ ) = [ ¹ ] =! 0 QED. In fact, we only need all cross covariances to not be too large as the sample size grows. Claim (Sample Mean): Let» ( ) satisfy P ( )! 0. Then ¹! in mean square. ( ¹ ) = [ ¹ ] = + P ( )! 0 QED. Remark: In micro-economic contexts involving cross-sectional data this type of correlatedness is evidently rarely or never entertained. Typically we assume the 0 are uncorrelated. It is, however, profoundly popular in macroeconomic and 5
nance contexts where data are time series. A very large class of time series random variables satis es both ( ) 6= 0 8 6= and P ( )! 0, and therefore exhibits ¹! in mean square. If» ( ) then ¹! in -norm for any (] but proving the result for non-integer ( ) is quite a bit more di cult. There are many types of "maximal inequalities", however, that can be used to prove X for ( ) where 0 is a nite constant. = Claim (Sample Mean): any ( ). X = since QED. Let» ( ) be iid. Then ¹! in -norm for = X f g = =! 0 Example: We simulate» (7400) with sample sizes = 555 000. In Figure 3 we plot ¹ and [ ] ¹ = 400 over sample size. Notice the high volatility for small. Figure 3: ¹ and [ ¹] 4 Convergence in Probability : WLLN Defn. We say ^ converges in probability to if lim ³ ^ = 0 8 0 ()! We variously write ^! and ^! 6
and we say ^ is a consistent estimator of. Since probability convergence is convergence in the sequence f(j^ j )g =, by the de nition of a limit it follows for every 0 there exists 0 such that ³ ^ 8 That is, for a large enough sample size ^ is guaranteed to be as close to as we choose (i.e. the ) with as a great a probability as we choose (i.e. ). Claim (Law of Large Numbers = LLN): If» ( ) then ¹!. By Chebyshev s inequality and independence, for any 0 ¹ ¹ =! 0 QED Remark : We call this a Weak Law of Large Numbers [WLLN] since convergence is in probability. A Strong LLN based on a stronger form of convergence is given below. Remark : We only need uncorrelatedness to get ¹ =! 0. The WLLN, however extends to many forms of dependent random variables. Remark 3: In the iid case we only need j j, although the proof is substantially more complicated. Even for non-iid data we typically only need j j + for in nitessimal 0 (pay close attention to scholarly articles you read, and to your own assumptions: usually far stronger assumptions are imposed than are actually required). The weighted average P = is also consistent as long as the weights decay with the sample size. Thus we write the weight as. Claim: If» ( ) then P =! if P = = and P =! 0. By Chebyshev s inequality, independence and P = =, for any 0! Ã X X Ã X Ã! = f g = which proves the claim. QED. = X = = = h( ) i X =! 0 An example is ¹ with =, but also the weights = P Figure. = =! used in Example: We simulate» (75 0) with sample sizes = 55 5 0000. In Figures 4 and 5 we plot ¹ and ^ = P = over sample size. Notice the high volatility for small. 7
Figure 4 : ¹ Figure 5 : ^ 7 9 7 9 7 8 7 8 7 7 7 7 7 6 7 6 7 5 7 5 7 4 7 4 7 3 7 3 7 7 7 7 7 0 5 0 0 5 0 0 5 3 0 0 5 4 0 0 5 5 0 0 5 6 0 0 5 7 0 0 5 8 0 0 5 9 0 0 5 Sample Size n 7 0 5 0 0 5 0 0 5 3 0 0 5 4 0 0 5 5 0 0 5 6 0 0 5 7 0 0 5 8 0 0 5 9 0 0 5 Sample Size n Claim (Slutsky Theorem): Let ^ R. If ^! and : R! R is continuous (except possibly with countably many discontinuity points) then (^ )! (). Corollary: Let ^!, =. Then ^ ^!, ^ ^!, and if 6= 0 and liminf! j^ j 0 then ^ ^!. Claim: If» ( ) and [ 4] then!. Note = X = ¹ = X ( ) ¹ By LLN ¹!, therefore by the Slutsky Theorem ¹! 0. By [ 4 ] it follows ( ) is iid with a nite variance, hence it satis es the LLN: P = ( )! [( ) ] =. QED. Claim: = If» ( ) and [ j then the sample correlation ^! the population correlation. Example: We simulate» (7 400) and» (0900) and construct = 43 + +. The true correlation is = [ ] [ ] [ ] = 43 [ ] + 7 ( 43 + 7) 0 p 4 400 + 900 = 43 7 + 400 + 7 7( 43 + 7) 0 p 4 400 + 900 = 8 We estimate correlation for samples with size = 5 55 0000. Figure 6 demonstrates consistency and therefore the Slutsky Theorem. 8
.00 Figure 6: Correlation 0.90 0.80 0.70 0.60 0.50 5 005 005 3005 4005 5005 6005 7005 8005 9005 Sample Size n 5 Almost Sure Convergence : SLLN Defn. We say ^ converges almost surely to if ³ ^ = = lim! This is identical to µ lim max ^ = 0 8 0! We variously write and we say ^ is strongly consistent for. We have the following relationships. Claim: ^! and ^! ^! implies ^! ;. ^! implies ^!. (j^ j ) (^ ) by Chebyshev s inequality. If (^ )! 0 (i.e. ^! ) then (j^ j )! 0 where 0 is arbitrary. Therefore ^!. (j^ j ) (sup j^ j ) since sup j^ j j^ j. Therefore if (sup j^ j )! 0 8 0 (i.e. ^! ) then (j^ j )! 0 8 0 (i.e. ^! ). QED. If ^ is bounded wp then ^! if and only if [^ ]! which is asymptotic unbiasedness (see Bierens). By the Slutsky Theorem ^! implies (^ )! 0 hence [(^ ) ]! 0: convergence in probability implies convergence in mean-square. This proves the following (and gives almost sure convergence as the "strongest" form: the one that implies all the rest). Claim (a.s. =) i.p. =) m.s.): Let ^ be bounded wp: (j^ j ) = 9
for nite 0. Then ^! implies ^! implies asymptotic unbiasedness and ^!. Claim (Strong Law of Large Numbers = SLLN):!. Remark: Example: If» ( ) then ¹ The Slutsky Theorem carries over to strong convergence. Let» ( ) and de ne ^ := + ¹ Then (j^ j ) =. Moreover, under the iid assumption ¹! by the SLLN, hence by the Slutsky Theorem ^! + Therefore and [^ ]! = ( + ) and ^! + ³^! 0 6 Convergence in Distribution : CLT Defn. We say ^ converges in distribution to a distribution, or to a random variable with distribution, if lim ³^ = () for every on the support.! Thus, while ^ may itself not be distributed, asymptotically it is. We write ^! or ^! where». The notation ^! is a bit awkward, because characterizes in nitely many random variables. We are therefore saying there is some random draw from that ^ is becoming. Which random draw is not speci ed. 6. Central Limit Theorem By far the most famous result concerns the sample mean ¹. Convergence of some estimator ^ in a monumentally large number of cases reduces to convergence of a sample mean of something, call it. This carries over to the sample correlation, regression model estimation methods like Ordinary Least Squares, GMM, and Maximum Likelihood, as well as non-parametric estimation, and on and on. 0
As usual, we limit ourselves to the iid case. The following substantially carries over to non-iid data, and based on a rarely cited obscure fact does not even require a nite variance (I challenge you to nd a proof of this, or to ever discover any econometrics textbook that accurately states this). Claim (Central Limit Theorem = CLT): := p ¹! (0 ) If» ( ) then Remark : This is famously cites as the Lindeberg Lévy CLT. Historically, however, the proof arose in di erent camps sometime between 90-930 (covering Lindeberg, Lévy, Chebyshev, Markov and Lyapunov). Remark : Notice by construction := p ¹ is a standardized sample mean because [ ] ¹ = by identical distributedness and [ ] ¹ = by independence and identical distributedness. Thus := p ¹ ¹ = p = ¹ [ ] ¹ [ ] ¹ Therefore p ¹ has mean 0 and variance : " # p ¹ p = ¹ = 0 " # p ¹ = ¹ = = Thus, even as! the random variable» (0). Although this is a long way from proving has a de nable distribution, even in the limit, it does help to point out that the term p! is necessary to stabilize, ¹ for otherwise we simply have ¹! 0. Remark 3: Asymptotically := p ¹ has a standard normal density () expf g. De ne := ( ), hence := p ¹ X = p We will show the characteristic function [ ]!. The latter is the characteristic function of a standard normal, while characteristic functions and distributions have a unique correspondence: only standard normals have a characteristic function like. =
By independence and identical distributednessnow expand around = 0 by a second order Taylor expansion: " i h = h i # Y = = () = Y = = i ³ h i h = = +! +! + = +!! + where is a remainder term that is a function of. Now take the expectations as in (), and note [ ] = [( )] = 0 and [ ] = [( ) ] = = : i h = + [ ]!! + [ ] = + where := [ ] It is easy to prove is a bounded random variable, in particular j j wp (see Bierens) so even if does not have higher moments we know j j. Further! 0 because [ ]!. Now take the -power in (): by the Binomial expansion ³ i h = = µ µ The rst term satis es µ + + = X = X µ µ µ =0 µ! because the sequence f( + ) g converges: ( )! (simply put = ). For the second term notice for large enough we have j j hence X µ µ X µ X µ = ( + ) = = =0
See Bierens for details that verify ( + )! 0. QED. Example (Bernoulli): The most striking way to demonstrate the CLT is to begin with the least normal of data, a Bernoulli random variable which is discrete and takes only two nite values, and show p ¹! (0 ), a continuous random variable with in nite support. We simulate» () for = 5, 50 500, 0000 and compute ¹ ¹ ¹ := p = p p = p 8 In order to show the small sample distribution of we need a sample of, 0 so we repeat the simulation 000 times. We plot the relative frequencies of the sample of 0 for each. Let f g 000 = be the simulated sample of 0. The relative frequencies are the percentage 000 P 000 = ( + ) for interval endpoints = [ 5 49 48 49 50]. See Figure 7. For the sake of comparison in Figure 8 we plot the relative frequencies for one sample of 000 iid standard normal random variables» (0 ). Another way to see how becomes a standard normal random variable is to compute the quantile such that ( ) = 975. A standard normal satis es ( 96) = 975. We call an empirical quantile since it is based on a simulated set of samples. We simulate 0,000 samples for each size = 5 05 05..., 5005 and compute. See Figure 9. As increases! 96. Figure 7 Standardized Means for Bernoulli 4 000 0, = 5 000 0, = 50 000 0, = 500 000 0, = 5000 3
Figure 8 Standard Normal Standard Normal Figure 9 - Empirical Quantiles q.3...0.9.8.7 5 505 005 505 005 505 3005 3505 4005 4505 5005 Sample Size n 4