V. Properties of estimators {Parts C, D & E in this file}

Size: px

Start display at page:

Download "V. Properties of estimators {Parts C, D & E in this file}"

Lilian Skinner
5 years ago
Views:

1 A. Definitions & Desiderata. model. estimator V. Properties of estimators {Parts C, D & E in this file}. sampling errors and sampling distribution 4. unbiasedness 5. low sampling variance 6. low mean squared error (MSE) i. definition of MSE ii. implied loss function on errors iii. relationship to variance and bias 7. evocative simple examples B. Finite sample (exact) properties (also called small sample properties, altho equally valid. relative efficiency for large samples). best linear unbiasedness (BLUness). efficiency a. definition and relationship to BLUness b. [SKIP] efficiency using Cramer-Rao lower bounds (Part ) - what the bounds mean - what must be assumed in order to calculate the lower bounds - how the bounds can show (or fail to show) efficiency c. [SKIP] likelihood function of random sample - motivation - gaussian case d. [SKIP] showing efficiency using Cramer-Rao lower bounds (Part ) - Cramer-Rao result (how to calculate the lower bounds in and parameter cases) - example: efficiency of y when ~ IID[µ, ] & known (one-parameter case)

2 - example: efficiency of y when ~ IID[µ, ] & unknown (two-parameter case) C. Large Sample or Asymptotic properties. iting distribution vs. it of sampling distribution. asymptotic variance. asymptotic mean & asymptotic unbiasedness 4. consistency i. Consistency through vanishing it of MSE (Squared Error Consistency) ii. probabilitt (p) 5. asymptotic efficiency i. definition - definition & relation to consistency - theorems for manipulating/evaluating probabilitts - illustrative examples ii. [SKIP] producing asymptotically efficient estimators (maximum likelihood estimation) iii. [SKIP] an astronomically terrible asymptotically efficient estimator D. Minimum MSE linear estimator E. The Zen of Econometrics

3 C. Large Sample or Asymptotic Properties. Limiting distribution vs. Limit of distribution Economic data is often not IID (nor even IID) because it is usually not the result of a controlled experiment. Consequently, it is often difficult or impossible to obtain estimators which we can show are either efficient or BLU and we must resort to evaluating the "goodness" of our estimators by means of what are called "large sample" or "asymptotic" properties. These properties are meaningful only when the actual sample is sufficiently large. Unfortunately, in practice it is not always clear just how large is "large enough." evertheless, since large sample properties are often all we can get, they are much prized. Suppose that y ~ IID(µ, i ) for i =.... Then we know that ȳ µ, using all observations and ỹ µ, using only the first observations ~ s obviously preferable to y for any finite sample size, but notice that the its of the sampling distributions of these two estimators as the sample size becomes arbitrarily large ~ are identical for an infinitely large sample, both y and y are distributed (µ, )! (I.e., the density function is a spike centered on µ with zero width and unit area.) In fact, because var practically any estimator constant terms that become negligible compared to as Since the area under a density function must equal one, this function must have infinite height.

4 the it of the sampling distribution of practically every estimator is a spike with zero width and unit area. Consequently, the it of the sampling distribution is not a useful largesample concept. C.. Asymptotic variance However, re-writing the last equation more explicitly, var a terms that become negligible compared to as it then follows from var(cx) = c var(x) that var a terms that become negligible compared to as so if we take the it of this as and then divide by we get an expression for how fast the sampling variance of is going to zero when the sample size is large. This is called "the asymptotic variance of " or Avar( ) or Asyvar( ): Avar var a terms that become negligible compared to as a a leading term in var() ~ otice that, even though the it of var(y) = it of var(y) =, these estimators have asymptotic variances of / and /, respectively.

5 Therefore we define the "asymptotic distribution of "or "iting distribution of " as the distribution that "tends toward" as the sample size (while finite) becomes arbitrarily large. By definition, this iting distribution will have variance equal to the Avar( ). The mean of the iting distribution of is called the asymptotic mean of or AE( ) or AsyE( ). Its definition is straightforward: AE and hence Abias ote #: Asymptotic unbiasedness is a substantially weaker property than "ordinary" or "finite-sample" unbiasedness. For example, Abias Abias 6 ote #: Since virtually all of the estimators ordinarily considered are (at least asymptotically) unbiased, there is usually no need to distinguish between the following three definitions of Avar( ), all of which appear in the literature: Avar E E or Avar E AE or Avar E E even though the last of these three definitions would more properly be called the asymptotic MSE( ). In any case, the distinction between these three alternatives is certainly not important for what we are doing; consequently, we will use the last of the three, since it is a bit simpler.

6 C.4. Consistency The "probabilitt" or "p" of an estimator is a formalization of the intuitively appealing notion that the sampling density function of (practically) all estimators "spikes out" (becomes a spike with unit area and zero width) as the sample size goes to infinity. The p of the estimator is just the value over which the spike sits; we say that the estimator is "consistent" if its probabilitt equals the population value that the estimator is trying to estimate. It turns out to be extraordinarily useful to formalize this notion of a probabilitt because the formal definition leads to a number of results which (while not easy to prove) are both intuitively appealing and extremely easy to apply. Consequently, it is usually possible to compute the probabilitt of one's estimator (and to thereby determine whether or not it is consistent) even in the messy circumstances we often face with (typically) non-experimental economic data, such as data which is not only non-gaussian, but also not iid. (In other words, people place exceptional value on consistencn their estimators because it is often the only property they can get!) C.4.i. Consistency through vanishing it of MSE (Squared Error Consistency) First, a way to prove consistency that does not require dealing with probabilitts: is (squared error) consistent if and onlf a. MSE( ) = E{[ - ] } as or (equivalently) b. Avar( ) as and Abias( ) = This kind of consistencs called "squared error consistency" to distinguish it from the definition of consistency based on probabilitts. The two forms of consistency are equivalent except in the case of an estimator whose sampling density function has tails which are so thick that the estimator does not have a finite variance or MSE; such an estimator can be consistent (based on its probabilitt being correct) even though it cannot be squared error consistent because its MSE is not well-defined for finite samples.

7 C.4.ii. Probability Limits (p) Definition of p: * p( ) = if and onlf * a. (verbal) The sampling density function of becomes a spike centered on value as the sample size goes to infinity. b. formal probability that lies in, no matter how small > is Definition of Consistency: is consistent for if and onlf p( ) = Theorems for manipulating ps:. (Slutsky's Theorem) If g( ) is a continuous function, then p{ g(x) } = g( p{ X } ).. If f() is not random, then p{ f() } = f() where "" denotes an ordinart as. I.e. the it of g(x) as x approaches z is g(z) for all values of z.

8 . So long as p{ X } and p{ Y } exist, p{ X + Y } = p{ X } + p{ Y }. 4. So long as p{ X } and p{ Y } exist, p{ X * Y } = p{ X } * p{ Y }. 5. So long as p{ X } and p{ Y } exist and p{ Y } is not zero, p{ X / Y } = p{ X } / p{ Y }. 6. (Khintchine's Theorem) p{ sample moment } = corresponding population moment ote #: Suppose, for example, that you have an unbiased estimate of the variance of some random variable, but what you need is an estimate of the variable's standard deviation. The square root of the unbiased variance estimator is not unbiased for the square root of the population variance. But slutsky's theorem guarantees that the square root of a consistent estimator of the variance is a consistent estimator of the square root of the population variance. ote # Strategy for evaluating probabilitts: a. write your estimator as a function sample moments b. use theorems # - 5 to write the p of the estimator as a function of ps of sample moments c. use Khintchine's theorem to evaluate the ps of the sample moments.

9 Illustrative Examples: ~ a. p(y) = p(y) = µ {This follows directly from Khintchine's theorem since these are sample moments, using all of the sample data and the first half of the sample data, respectively.} Since in this case we know the sampling distributions of the estimators, it is also easy to show squared error consistency. For example: ỹ µ, using only the first observations therefore MSE ỹ var ỹ bias ỹ b. Assuming that is a fixed constant and that p( ) exists, p p p p p

10 c. Assuming that p( ) exists, p p p p p 4 Therefore, if is consistent for, then so is + ( /)!! C.5 Asymptotic efficiency i. Definition: is asymptotically efficient if and onlf and. is squared error consistent (which implies that Abias( ) = ) ~ ~. Avar( ) Avar( ) for all squared error consistent. 4. And so is + ( / )

11 C.5.ii. [SKIP] Producing asymptotically efficient estimators: Maximum Likelihood Estimation Asymptotic efficiencs the most desirable large-sample property an estimator can have. Indeed, if the sample is large enough for it to be credibly meaningful, asymptotic efficiencs a wonderful property. Unfortunately, just like finite-sample efficiency, it looks like asymptotic efficiency will be difficult to prove. Remarkably enough, that turns out not to be the case because it turns out that maximum likelihood estimators are all asymptotically efficient. First I will motivate our interest in maximum likelihood estimation (MLE) by discussing the nice (asymptotic) properties that MLE estimators have; then I will show you how to obtain MLE estimators. If the model includes enough distributional information that we can express the likelihood of the observed sample as a function of the sample data and a finite number of unknown parameters (such as µ, or,,..., k ), then the unknown parameters can be estimated using the maximum likelihood method and the resulting estimators have the following nice asymptotic properties: 5 these Cramer- a. They are asymptotically unbiased. b. They are consistent. c. They are asymptotically efficient. d. They are asymptotically normal. ML i e. g( ) is the maximum likelihood estimator of g( ) {and hence has all nice asymptotic properties as an estimator of g( ) } for any continuous i function g(x). ML ML f. In fact is asymptotically efficient because Avar( ) equals the i Rao lower bound. Therefore, calculation of the k x k matrix of i i 5 The proof of these results also requires that the densits a continuous function of the observations and that the log-likelihood function be a sufficiently smooth function of the unknown parameters that its third partial derivatives with respect to these parameters are all finite. These conditions are all satisfied for gaussian data.

12 Cramer-Rao lower bounds for,,..., k provides a straightforward way to obtain V, the k x k (asymptotic) variance- ML ML ML covariance matrix for,,...,. k Thus, ML i i ii asymptotically ~ [, ] where of i ii for k=): is the Cramer-Rao lower bound on the sampling variance of any unbiased estimator, obtained as the ith diagonal element of the inverse of the k x k matrix {expressed below E L,, E L,, E L,, E L,, E L,, E L,, E L,, E L,, E L,, where L (,, ) is a contraction for the log-likelihood function, log{ likelihood(y, y,..., y ;,, ) } The calculation of this matrix (and its inverse) can be tedious, but at least it is straightforward.

13 So maximum likelihood estimation is rather wonderful if one is willing to settle for ML ML ML asymptotic properties and if one can actually obtain,,...,. How do you obtain these estimators? k First of all, recall that the likelihood function, likelihood(y, y,..., y ;,,... ) k takes,,... (the parameters specifying the distribution of y ) as given and expresses k i the likelihood of having observed y, y,..., y (the sample observations) as a function of y, y,..., y. Suppose that we turned this around in our mind and instead viewed the likelihood function as a function of our estimates (,,..., ) of the (unknown) k parameters (,,..., ) given that we have observed the sample (y, y,..., y ): k likelihood(,,..., k ; y, y,..., y ) ow this likelihood function expresses how likelt is that the underlying parameters (,,... k ) equal the particular values (,,..., k ) given that we have observed the sample (y, y,..., y ). The maximum likelihood estimates of (,,... k ) are thus the values of (,,..., ) that maximize k likelihood(,,..., k ; y, y,..., y ) for the given observation values, (y, y,..., y ). Recall our earlier example in which we had three observations ( y, y, y ) and we assumed that ~ IID(, ) for i =... This led to the likelihood function Likelihood y, y, y ;, e y e y e y

14 At that point we assumed that var(y) = = and that E{y} = = 8 and went on to evaluate likelihood(y, y, y ; 8, ) for various possible samples, such as (8,8,8) and (8,,8). To estimate and using the maximum likelihood method, we instead take y, y, and y as given, consider the likelihood function Likelihood, ;y, y, y e y e y e y and ask questions like "given that we have observed (8., 7.7, 8.5), how likels it that was and was in the distribution from which these observations were picked?" The values of and maximize the likelihood of having observed the sample we did observe are the maximum likelihood estimates of and. Since the slope of the logarithm function is always positive, the pair of values ( and ) that maximizes likelihood(, ; y, y, y ) also maximizes log{ likelihood(, ; y, y, y ) } which, in this case { y ~ IID(, ) for i =... } equals i

15 L, ; y, y, y i log f ;, i log log i log log i ML ML Therefore the maximum likelihood estimatators of, of = µ and = must satisfy the two first order conditions for maximizing log{ likelihood(, ; y, y, y ) } with respect to,, namely: L, ; y, y, y log log i I L, ; y, y, y log log i II Equation I yields:

16 L, ; y, y, y log log i i i i i i ȳ so that, since the variance estimate is surely positive, ML ȳ Substituting this result into equation II and letting

17 i yields L, ; y, y, y log log i log log oting that d log(x)/dx = /x and that d(/x)/dx = -/x, this yields L, ; y, y, y log Multiplying both sides of this equation by -( )/ yields ML i i ȳ Thus, we can conclude that,

18 ML ȳ the sample mean ML the sample variance othing was special about a sample size of three, so these results are valid for any sample size. Therefore, if y ~ IID(, ) for i =..., then y and are both i and. consistent. asymptotically unbiased. asymptotically normal 4. aymptotically efficient. This is no news for y, since it is (for ~ IID(, ) for i =...) efficient even in finite samples, but it is a useful result for. Moreover, since maximum likelihood estimators asymptotically attain the Cramer-Rao lower bounds, we can use these to complete the specification of the joint (asymptotic) density function for this pair of estimators. Using the matrix of Cramer-Rao lower bounds for this ML ML problem calculated in section.c.v above, the estimators (, ) = (y, ) are thus asymptotically distributed ȳ µ asy ~, 4 We see then (for y ~ IID(, ) for i =... ) that y and are asymptoticallndependent (since i they are asymptotically normal and asymptotically uncorrelated), but we knew this already from the fact that y and are in this case independently distributed for any sample size. We can also conclude that

19 ~ asy, 4 On your homework assignment you will use the known result that var(s ) = 4 /(-) to show that the var( ) < 4 / for any sample size. But, since we just saw that 4 / is the Cramer-Rao lower bound, how can have a finite sample variance smaller than that? (Answer: as we showed earlier, is a biased estimator for.) Recall that C.5.iii [SKIP] An astronomically terrible (but asymptotically efficient) estimator Avar E 8 and suppose that is known to be asymptotically efficient. What about + ( /)? First note that, since is to be asymptotically efficient, it is asymptotically unbiased; consequently, 8 + ( /) is asymptotically unbiased also: Abias 8 E 8 E 8 E 8 E 8 Abias

20 8 ext note that both and + ( /) in fact have the same asymptotic variance: Avar 8 E 8 E 8 E 8 6 E E E E E 6 8 Bias 6 E 8 Bias 6 Avar 8 Abias Avar 8 So, if is asymptotically efficient (and hence asymptotically unbiased) then so is + ( /)!! (.5 + ) In fact, a close look at this derivation shows that and [ + / ] both have the same asymptotic

21 (.5 + ) variance (for > ) no matter how large the constant is, as do and [ + / ]. Thus, since s s, we see that and s have the same asymptotic variance. 6 D. Minimum MSE estimator If one's loss function on estimation errors is not proportional to the square of the error, then one might prefer some other estimator over the efficient estimator. But suppose that MSE realls what one cares about, is it possible to find an estimator with a smaller MSE than that of the efficient estimator? It turns out that the answer to this question is both "yes" and "no": Let be the efficient estimator for and let k k & t var ~ ~ where (k) is an alternative estimator of and k is some positive number. What value of k yields the (k) with the smallest MSE? bias k E k E k k E k k var k var k k var 6 Indeed, they thus have the same asymptotic distribution.

22 Thus, MSE k var k bias k k var k k var k var var k var k var var k var k var t var k k t

23 ~ Thus k*, the value of k which minimizes the MSE{ (k)}, must satisfy dmse k dk k k d dk var k k t var d dk k k t var k k t var k k t var t k t which implies that k t t <. Therefore, it is always optimal (i.e. it lowers the MSE) to bias the estimator a bit toward zero. However, it ~ * * must be noted that (k ) is not in fact a feasible estimator since the expression for k involves t which depends on.

25 E. The Zen of Econometrics We have now gone to considerable trouble to see that s a good (in some senses, optimal) estimator of µ when y ~ IID(µ, i ). But what if our actual sample observations are clearly not gaussian? For example, consider the problem of estimating the mean size of a U.S. firm, as quantified bts annual sales revenues. The empirically observed distribution of firm sizes looks more like the graph of a density function than it does like the graph of a gaussian density function: However, since the size of a firm can be viewed as the product of a large number of (more or less) independent factors, the empirically observed distribution of the logarithm of the firm sizes looks reasonably gaussian. firm: or Therefore there are two different ways to approach the problem of estimating the mean size of a U.S.. Try to figure out the most efficient estimator for the (population) mean of a random sample drawn from a log-normally distributed population. Then one must still face the problem of calculating its sampling distribution so that you can do inference. These problems can be dealt with to some degree, but even their approximate solution is difficult and complicated.. Since it is reasonable to suppose that the observations on log(firm sales revenue) are IID, use the sample mean of the logarithms of the observed sales revenue figures to efficiently estimate the population mean of the logarithm of firm sales revenue. Similarly, use the methods we have covered to obtain a 95% confidence interval for the population mean of the logarithm of firm sales revenue. (This interval will be centered around the sample mean of the logarithms of the observed sales revenue data.) Letting "c95upper" and "c95lower" denote the upper and lower endpoints of this 95% confidence interval for the population mean of the logarithm of firm sales revenue, one can c95lower c95upper immediatelnfer that the interval [ e, e ] is a 95% confidence interval for the population

26 mean of the of sales revenue itself. (This interval will not be centered around the exponential of the sample mean of the logarithms of the observed sales revenue data, however.) Or, more reasonably, one can and, I would argue, typically should decide that the data itself is telling us that the size of the firm ought to be quantified by the logarithm of its annual sales revenues, in which case the sample mean of the logarithms of the observed sales revenue figures is precisely the estimator one wants. This example suggests a completely different way of looking at the theoretical results we have obtained. Instead of seeing them as arbitrary special cases where everything worked out nicely and resolving to always assume that our data is IID (whether it is or not) so that we can "use" these results, this example suggests that we could more gracefully observe that our statistical machinery for using data to obtain knowledge about the world "works" best (i.e., most simply and effectively) when we have framed the problem in such a way that the sample data is IID and act accordingly.

Statistical inference

Statistical inference Contents 1. Main definitions 2. Estimation 3. Testing L. Trapani MSc Induction - Statistical inference 1 1 Introduction: definition and preliminary theory In this chapter, we shall