Elements of Asymptotic Theory. James L. Powell Department of Economics University of California, Berkeley

Similar documents
Elements of Asymptotic Theory. James L. Powell Department of Economics University of California, Berkeley

Econometrics I. September, Part I. Department of Economics Stanford University

Chapter 7: Special Distributions

LECTURE 7 NOTES. x n. d x if. E [g(x n )] E [g(x)]

Lecture 2: Consistency of M-estimators

Economics 583: Econometric Theory I A Primer on Asymptotics

Maximum Likelihood Asymptotic Theory. Eduardo Rossi University of Pavia

Consistency and asymptotic normality

Review of Classical Least Squares. James L. Powell Department of Economics University of California, Berkeley

7. Introduction to Large Sample Theory

Stat 5101 Lecture Slides: Deck 7 Asymptotics, also called Large Sample Theory. Charles J. Geyer School of Statistics University of Minnesota

1 Probability Spaces and Random Variables

Elementary Analysis in Q p

Asymptotic theory for linear regression and IV estimation

General Linear Model Introduction, Classes of Linear models and Estimation

MATH 2710: NOTES FOR ANALYSIS

1 Extremum Estimators

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

Lecture 3 Consistency of Extremum Estimators 1

Chapter 6. Convergence. Probability Theory. Four different convergence concepts. Four different convergence concepts. Convergence in probability

Lecture 2: Review of Probability

Lecture 4: Law of Large Number and Central Limit Theorem

Use of Transformations and the Repeated Statement in PROC GLM in SAS Ed Stanek

MATH 829: Introduction to Data Mining and Analysis Consistency of Linear Regression

Asymptotic Theory. L. Magee revised January 21, 2013

Asymptotic Statistics-III. Changliang Zou

Research Note REGRESSION ANALYSIS IN MARKOV CHAIN * A. Y. ALAMUTI AND M. R. MESHKANI **

Lecture 6. 2 Recurrence/transience, harmonic functions and martingales

HENSEL S LEMMA KEITH CONRAD

Estimation of the large covariance matrix with two-step monotone missing data

Sums of independent random variables

Notes on Instrumental Variables Methods

Chapter 3. GMM: Selected Topics

Economics 241B Review of Limit Theorems for Sequences of Random Variables

Lecture Notes 5 Convergence and Limit Theorems. Convergence with Probability 1. Convergence in Mean Square. Convergence in Probability, WLLN

Advanced Econometrics II (Part 1)

SDS : Theoretical Statistics

Moment Generating Function. STAT/MTHE 353: 5 Moment Generating Functions and Multivariate Normal Distribution

4. Score normalization technical details We now discuss the technical details of the score normalization method.

Estimating Time-Series Models

8 STOCHASTIC PROCESSES

Problem Set 2 Solution

Stat 5101 Notes: Algorithms

ECE 534 Information Theory - Midterm 2

Asymptotically Optimal Simulation Allocation under Dependent Sampling

5601 Notes: The Sandwich Estimator

Adaptive estimation with change detection for streaming data

Review of Probability Theory II

Numerical Linear Algebra

Proof: We follow thearoach develoed in [4]. We adot a useful but non-intuitive notion of time; a bin with z balls at time t receives its next ball at

arxiv: v1 [physics.data-an] 26 Oct 2012

Colin Cameron: Asymptotic Theory for OLS

Convex Analysis and Economic Theory Winter 2018

Estimation, Inference, and Hypothesis Testing

MATHEMATICAL MODELLING OF THE WIRELESS COMMUNICATION NETWORK

Real Analysis 1 Fall Homework 3. a n.

Math Camp II. Calculus. Yiqing Xu. August 27, 2014 MIT

Robustness of classifiers to uniform l p and Gaussian noise Supplementary material

System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests

Graduate Econometrics I: Asymptotic Theory

ON THE LEAST SIGNIFICANT p ADIC DIGITS OF CERTAIN LUCAS NUMBERS

Large Sample Theory. Consider a sequence of random variables Z 1, Z 2,..., Z n. Convergence in probability: Z n

Lecture 21: Convergence of transformations and generating a random variable

CS145: Probability & Computing

IMPROVED BOUNDS IN THE SCALED ENFLO TYPE INEQUALITY FOR BANACH SPACES

Lecture 3 January 16

Limiting Distributions

CHAPTER 2: SMOOTH MAPS. 1. Introduction In this chapter we introduce smooth maps between manifolds, and some important

The non-stochastic multi-armed bandit problem

On the asymptotic sizes of subset Anderson-Rubin and Lagrange multiplier tests in linear instrumental variables regression

Limiting Distributions

CMSC 425: Lecture 4 Geometry and Geometric Programming

On split sample and randomized confidence intervals for binomial proportions

CONVOLVED SUBSAMPLING ESTIMATION WITH APPLICATIONS TO BLOCK BOOTSTRAP

If g is also continuous and strictly increasing on J, we may apply the strictly increasing inverse function g 1 to this inequality to get

Elementary theory of L p spaces

Probability Lecture III (August, 2006)

Nonparametric estimation of Exact consumer surplus with endogeneity in price

EE 508 Lecture 13. Statistical Characterization of Filter Characteristics

SECTION 5: FIBRATIONS AND HOMOTOPY FIBERS

A Primer on Asymptotics

Notes on Random Vectors and Multivariate Normal

#A64 INTEGERS 18 (2018) APPLYING MODULAR ARITHMETIC TO DIOPHANTINE EQUATIONS

16.2. Infinite Series. Introduction. Prerequisites. Learning Outcomes

3.4 Design Methods for Fractional Delay Allpass Filters

On-Line Appendix. Matching on the Estimated Propensity Score (Abadie and Imbens, 2015)

16.2. Infinite Series. Introduction. Prerequisites. Learning Outcomes

Probability Theory and Statistics. Peter Jochumzen

Lecture 1: August 28

Best approximation by linear combinations of characteristic functions of half-spaces

Chapter 1 Fundamentals

Additive results for the generalized Drazin inverse in a Banach algebra

Final Examination Statistics 200C. T. Ferguson June 11, 2009

Estimation of Separable Representations in Psychophysical Experiments

The Fekete Szegő theorem with splitting conditions: Part I

On the rate of convergence in the martingale central limit theorem

Positive decomposition of transfer functions with multiple poles

On Wald-Type Optimal Stopping for Brownian Motion

PETER J. GRABNER AND ARNOLD KNOPFMACHER

Brownian Motion and Random Prime Factorization

Transcription:

Elements of Asymtotic Theory James L. Powell Deartment of Economics University of California, Berkeley Objectives of Asymtotic Theory While exact results are available for, say, the distribution of the classical least squares estimator for the normal linear regression model, and for other leading secial combinations of distributions and statistics, generally distributional results are unavailable when statistics are nonlinear in the underlying data and/or when the joint distribution of those data is not arametrically secified. Fortunately, most statistics of interest can be shown to be smooth functions of samle averages of some observable random variables, in which case, under fairly general conditions, their distributions can be aroximated by a normal distribution, usually with a covariance matrix that shrinks to zero as the number of random variables in the averages increases. Thus, as the samle size increases, the ga between the true distribution of the statistic and its aroximating distribution asymtotes to zero, motivating the term asymtotic theory. There are two main stes in the aroximation of the distribution of a statistic by a normal distribution. The first ste is the demonstration that a samle average converges (in an aroriate sense) to its exectation, and that a standardized version of the statistic has distribution function that converges to a standard normal. Such results are called limit theorems the former (aroximation of a samle average by its exectations) is called a law of large numbers (abbreviated LLN ), and the second (aroximation of the distribution of a standardized samle average by a standard normal distribution) is known as a central limit theorem (abbreviated CLT ). There are a number of LLNs and CLTs in the statistics literature, which imose different combinations of restrictions on the number of moments of the variables being averaged and the deendence between them; some basic versions are given below. The second ste in deriving a normal aroximation of the distribution of a smooth function of samle averages is demonstration that this smooth function is aroximately a linear function of samle averages (which is itself a samle average). Useful results for this ste, which use the usual calculus aroximation of smooth (differentiable) functions by normal functions, are groued under the heading Slutsky theorems, after the author of one of the articularly useful results. So, as long as the conditions of the limit theorems and Slutsky theorems aly, smooth functions of samle averages are aroximately samle averages 1

themselves, and thus their distribution functions are aroximately normal. The whole aaratus of asymtotic theory is made u of the regularity conditions for alicability of the limit and Slutsky theorems, along with the rules for mathematical maniulation of the aroximating objects. In order to discuss aroximation of a statistic by a constant, or its distribution by a normal distribution, we first have to extend the usual definitions of limits of deterministic sequences to accomodate sequences of random variables (yielding the concet of convergence in robability ) or their distribution functions ( convergence in distribution ). Convergence in Distribution and Probability For most alications of asymtotic theory to statistical roblems, the objects of interest are sequences of random vectors (or matrices), say or Y N, which are generally statistics indexed by the samle size N; the object is to find simle aroximations for the distribution functions of or Y N which can be made arbitrarily close to their true distribution functions for N large enough. Writing the cumulative distribution function of as F N (x) Pr{ x} (where the argument x has the same dimension as and the inequality holds only if it is true for each comonent), we define convergence in distribution (sometimes called convergence in law) in terms of convergence of the function F N (x) to some fixed distribution function F (x) for any oint x where the F is continuous. That is, the random vector converges in distribution to X, denoted d X as N, if at all oints x where F (x) is continuous. lim N F N(x) =F (x) A few comments on this definition and its imlications are warranted: 1. The limiting random vector X in this definition is really just a laceholder for the distribution function F (x); there needs be no real random vector X which has the limiting distribution function. In fact, it is often customary to relace X with the limiting distribution function F (x) in the 2

definition, i.e., to write d F (x) as N ; if, say, F (x) is a multivariate standard normal, then we would write d N(0, I), where the condition N is usually not stated, but imlied. d 2. If F (x), then we can use the c.d.f. F (x) to aroximate robabilies for XN if N is large, i.e., Z Pr{ A} = A df (x), where the symbol = A means that the difference in the left- and right-hand sides goes to zero as N increases. A 3. We would like to construct statistics so that their limiting distributions are of convenient (tabulated) form, e.g., normal or chi-squared distribution. This often involves standardizing the statistics by subtracting off their exectations and dividing by their standard deviations (or the matrix equivalent). 4. If lim N F N (x) is discontinuous, it may not be a distribution function, which is why the qualification at all oints where F (x) is continuous has to be aended to the definition of convergence in distribution. To make this more concrete, suose µ N 0, 1, N so that ³ F N (x) = Φ Nx 0 if x<0 1 2 if x =0 1 if x>0. Obviously here lim N F N (x) is not right-continuous, so we would never use it to aroximate robabilities for (since it isn t a c.d.f.). Instead, a reasonable aroximating function for F N (x) would be F (x) 1{x 0}. 3

where 1{A} is the indicator function of the statement A, i.e., it equals one if A is true and is zero otherwise. The function F (x) is the c.d.f. of a degenerate random variable X which equals zero with robability one, which is a retty good aroximation for F N (x) when N is large for this roblem. The last examle convergence of the c.d.f. F N (x) to the c.d.f. of a degenerate ( nonrandom ) random variable is a secial case of the other key convergence concet for asymtotic theory, convergence in robability. Following Ruud s text, we define convergence in robability of the sequence of random vectors to the (nonrandom) vector x, denoted or if i.e., lim = x N x as N d x, F N (x) 1{0 x} whenever all comonents of x are nonzero. (The symbol lim stands for the term robabilty limit.) We can generalize this definition of convergence in robability of to a constant x to ermit to converge in robability to a nondegenerate random vector X by secifying that X means that is that X d 0. An equivalent definition of convergence in robability, which is the definition usually seen in textbooks, x if, for any number ε>0, Pr{ x >ε} 0 4

as N. Thus, converges in robability to x if all the robability mass for eventually ends u in an ε-neighborhood of x, no matter how small ε is. Showing the equivalence of these two definitions of robability limit is a routine exercise in robability theory and real analysis. A stronger form of convergence of to x, which can often be used to show convergence in robability, is quadratic mean convergence. We say qm X n x as N if E[ x 2 ] 0. If this condition can be shown, then it follows that converges in robability to x because of Markov s Inequality: If Z is a nonnegative (scalar) random variable, i.e., Pr{Z 0} =1, then for any K>0. Pr{Z >K} E(Z) K The roof of this inequality is worth remembering; it is based uon a decomosition of E(Z) (which may be infinite) as E(Z) = Z 0 Z K 0 zdf Z (z) zdf Z (z)+ Z K (z K)dF Z (z)+ Z K KdF Z (z). Since the first two terms in this sum are nonnegative, it follows that E(Z) Z K KdF Z (z) =K Pr{Z >K}, from which Markov s inequality immediately follows. A better-known variant of this inequality, Tchebysheff s Inequality, takesz =(X E(X)) 2 for a scalar random variable X (with finite second moment) and K = ε 2 for any ε>0, from which it follows that Pr{ X E(X) >ε} =Pr{(X E(X)) 2 >ε 2 } Var(X) ε 2, since Var(X) =E(Z) for this examle. A similar consequence of Markov s inequality is that convergence in quadratic mean imlies convergence in robability: if E[ x 2 ] 0, 5

then for any ε>0. Pr{ x >ε} E[ x 2 ] ε 2 0 There is another, stronger version of convergence of a random variable to a limit, known as almost sure convergence (or convergence almost everywhere or convergence with robability one), which is denoted a.s. x. Rather than give the detailed definition of this form of convergence which is best left to a more advanced course it is worth ointing out that it is indeed stronger than convergence in robability (which is sometimes called weak convergence for that reason), but that, for the usual uroses of asymtotic theory, the weaker concet is sufficient. Limit Theorems With the concets of convergence in robability and distribution, we can more recisely state how a samle average is aroximately equal to its exectation, or how its standardized version is aroximately normal. The first result, on convergence in robability, is Weak Law of Large Numbers (WLLN): If 1 N P N i=1 X i is a samle average of scalar, i.i.d. random variables {X i } N i=1 which have E(X i)=μ and Var(X i )=σ 2 finite, then μ = E(Xi ). Proof: By the usual calculations for means and variances of sums of i.i.d. random variables, E( )=μ, Var( )= σ2 N = E( μ) 2, which tends to zero as N. By definition, XN tends to μ in quadratic mean, and thus in robability. Insection of the roof of the WLLN suggests a more general result for averages of random variables that are not i.i.d. namely, that E( ) 0 if 1 N 2 NX i=1 j=1 NX Cov(X i,x j ) 0. 6

This general condition can be secialized to yield other LLNs for heterogeneous (non-constant variances) or deendent (nonzero covariances) data. Other laws of large numbers exist which relax conditions on the existence of second moments, and which show almost sure convergence (and are known as strong laws of large numbers ). The next main result, on aroximate normality of samle averages, again alies to i.i.d. data: Central Limit Theorem (Lindeberg-Levy): If XN 1 N P N i=1 X i is a samle average of scalar, i.i.d. random variables {X i } N i=1 which have E(X i)=μ and Var(X i )=σ 2 finite, then Z N E( ) Var( XN ) = µ XN μ N σ d N(0, 1). The roof of this result is a bit more involved, requiring maniulation of characteristic functions, and will not be resented here. The result of this CLT is often rewritten as N( XN μ) d N(0,σ 2 ), which oints out that converges to its mean μ at exactly the rate N increases, so that the roduct eventually balances out to yield a random variable with a normal distribution. Another way to rewrite the consequence of the CLT is µ A N μ, σ2, N where the symbol A means is aroximately distributed as (or is asymtotically distributed as ). It is very imortant here to make the distinction between the limiting distribution of the standardized mean Z N, which is a standard normal, and the asymtotic distribution of the (non-standardized) samle mean, which is actually a sequence of aroximating normal distributions with variances that shrink to zero at seed 1/N. In short, a limiting distribution cannot deend uon N, which has assed to its (infinite) limit, while an asymtotic distribution can involve the samle size N. Like the Weak Law of Large Numbers, the Lindeberg-Levy Central Limit Theorem admits a number of different generalizations (often named after their authors) which relax the assumtion of i.i.d. data while 7

imosing various stronger restrictions on the existence of moments or the tail behavior of the true data distributions. To this oint, the WLLN and CLT aly only to scalar random variables {X i }; to extend them to random vectors {X i }, we can use the intriguingly-named Cramér-Wald device, which is really a theorem that states that if the scalar random variable Y N λ 0 converges in distribution to Y λ 0 X for any fixed vector λ having the same dimension as thatis,if λ 0 d λ 0 X for any λ then the vector converges in distribution to X, d X. This result immediately yields a multivariate version of the Lindeberg-Levy CLT: a samle mean of i.i.d. random vectors {X i } with E(X i )=μ and V (X i )=Σ has N( μ) d N (0, Σ). And, since convergence in robability is a secial case of convergence in distribution, the Cramér-Wald device also immediately yields a multivariate version of the WLLN. Slutsky Theorems To extend the limit theorems for samle averages to statistics which are functions of samle averages, asymtotic theory uses smoothness roerties of those functions (i.e., continuity and differentiability) to aroximate those functions by olynomials, usually constant or linear functions. The simlest of these aroximation results is the continuity theorem, which states that robability limits share an imortant roerty of ordinary limits namely, the robability limit of a continuous function is the value of that function evaluated at the robability limit: 8

Continuity Theorem: If x0 and the function g(x) is continous at x = x 0, then g( ) g(x 0 ). Proof: For any ε>0, continuity of the function g(x) at x 0 imlies that there is some δ>0 so that x 0 δ imlies that g( ) g(x 0 ) ε. Thus Pr{ g( ) g(x 0 ) ε} Pr{ x 0 δ}. But the robability on the right-hand side converges to one because x0, so the robability on the left-hand side does, too. With the continuity theorem, it should be clear that robability limits are more conveniently maniulated than, say, mathematical exectations, since, for examle, if ˆθ ȲN is an estimator of the ratio of the means of Y i and X i, θ 0 E(Y i) E(X i ) μ Y μ X, then ˆθ N θ0 as long as μ X 6=0, even though E(ˆθ N ) 6= θ 0 in general. Another key aroximation result is Slutsky s Theorem, which states that the sum (or roduct) of a random vector that converges in robability and another that converges in distribution itself converges in distribution: Slutsky s Theorem: If the random vectors and Y N havethesamelength,andif x0 and Y N d Y, then XN + Y N d x0 + Y and X 0 N Y N d x 0 0 Y. The tyical alication of Slutsky s theorem is to estimators that can be reresented in terms of a linear aroximation involving some statistics that converge either in robability or distribution. To be more concrete, suose the (scalar) estimator ˆθ N can be decomosed as N(ˆθN θ 0 )=X 0 NY N + Z N, where x0, d Y N N(0, Σ), and Z N 0. 9

This decomosition is often obtained by a first-order Taylor s series exansion of ˆθ N, and the comonents, Y N, and Z N are samle averages to which a LLN or CLT can be alied; an examle is the so-called delta method discussed below. Under these conditions, Slutsky s theorem imlies that N(ˆθN θ 0 ) d N(0, x 0 0Σx 0 ). Both the continuity theorem and Slutsky s theorem are secial cases of the continuous maing theorem, which says that convergence in distribution, like convergence in robability, interchanges with continuous functions, as long as they are continuous everywhere: Continuous Maing Theorem: If d X and the function g(x) is continous for all x, then g(xn ) d g(x). The continuity requirement for g(x) can be relaxed to hold only on the suort of X if needed. The continuous maing theorem can be used to get aroximations for the distributions of test statistics based uon asymtotically-normal estimators; for examle, if Z N N(ˆθ N θ 0 ) d N (0, Σ), then T Z 0 NZ N = N(ˆθ N θ 0 ) 0 (ˆθ N θ 0 ) d χ 2, where dim{ˆθ N }. The final aroximation result discussed here, which is an alication of Slutsky s theorem, is the so-called delta method, which states that continuously-differentiable functions of asymtotically-normal statistics are themselves asymtotically normal: The Delta Method : If ˆθ N is a random vector which is asymtotically normal, with N(ˆθ N θ 0 ) d N (0, Σ) for some θ 0, and if g(θ) is continuously differentiable at θ = θ 0 with Jacobian matrix G 0 g(θ 0) θ 0 10

that has full row rank, then N(g(ˆθN ) g(θ 0 )) d N 0, G 0 ΣG 0 0. Proof: First suose that g(θ) is scalar-valued. Since ˆθ N converges in robability to θ 0 (by Slutsky s Theorem), and since g(θ)/ θ 0 is continuous at θ 0, the Mean Value Theorem of differential calculus imlies that, g(ˆθ N )=g(θ 0 )+ g(θ N ) θ 0 (ˆθ N θ 0 ), for some value of θ N between ˆθ N and θ 0 (with arbitrarily high robability). Since ˆθ N that θ N θ 0, so Thus, by Slutsky s Theorem, g(θ N ) θ 0 G 0 g(θ 0) θ 0. θ0,itfollows N(g(ˆθ N ) g(θ 0 )) = g(θ N ) θ 0. N(ˆθ N θ 0 ) d G 0 N(0, Σ) = N(0, G 0 ΣG 0 0). To extend this argument to the case where g(ˆθ N ) is vector-valued, use the same argument to show that any (scalar) linear combination λ 0 g(ˆθ) has the asymtotic distribution N(λ 0 g(ˆθ N ) λ 0 g(θ 0 )) d N(0, λ 0 G 0 ΣG 0 0λ), from which the result follows from the Cramér-Wald device. 11