Parameter Estimation Consider a sample of observations on a random variable Y. his generates random variables: (y 1, y 2,, y ). A random sample is a sample (y 1, y 2,, y ) where the random variables y t, t = 1,,, are independently and identically distributed (often written i.i.d. ). his means that the joint probability density function of the sample is f (y 1, y 2,, y ) = f(y 1 ) f(y 2 ) f(y 3 ) f(y ) where f(y t ) is the marginal probability function of y t, t = 1, 2,,. Estimators Consider the estimation of a (K 1) parameter vector θ based on a sample (y 1, y 2,, y ). An estimator of the (K 1) parameter vector θ is a function θ e (y 1, y 2,, y ), where y t is the t-th random variable in the sample. Since it is a function of random variables, an estimator θ e (y 1, y 2,, y ) is itself a vector of random variables. An estimate of the (K 1) parameter vector θ is given θ e (y 1, y 2,, y ), where y t is the observed value of the random variable y t, t = 1, 2,,. Since the y t s are observed, the estimate θ e (y 1, y 2,, y ) is not a vector of random variables. he observed values y t s depend on the sample, implying that the estimate q e (y 1, y 2,, y ) will take a specific value for a given sample, but vary across samples. How do we identify the function q e ( )? he Method of Moments: Consider the r-th sample moment of the random variable Y µ e r = y r t /, r = 1, 2,, K. Assume that the K sample moments of Y are related to the (K 1) parameter vector θ by the known function: µ r = h r (θ), r = 1, 2,, K. he method of moments consists in equating the sample moments µ r e with the true moments µ r, and solving the resulting system of K equations, µ r e = h r (θ), r = 1,, K, for q e. he resulting estimator is the method of moment estimator.
Let y 1, y 2,, y be a sample, where y t is distributed with mean β and variance σ 2, t = 1, 2,, or y ~(β σ 2 ). herefore θ = (β, σ 2 ). We have: E(Y) = µ 1 = β, and E[(Y - β) 2 ] = E(Y 2 ) E(Y) 2 = µ 2 - (µ 1 ) 2 = σ 2. hen, equating the sample moments to the population moments gives µ e 1 = y t / = β µ 2 e = y t 2 / = σ 2 + β 2. Solving these two equations for θ = (β, σ 2 ) gives the method of moment estimator for the mean β: β m = y t /, and the variance σ 2 : 2 σ m = ( y 2 t /) - (β m ) 2 = (y t - β m ) 2 /. he Maximum Likelihood Method Let the joint probability density function of the sample (y 1, y 2,, y ) be f (y 1, y 2,, y θ), where θ is a (K 1) vector of unknown parameters that belongs to the parameter space Ω, θ Ω. Define the likelihood function of the sample as l(θ y 1, y 2,, y ) = f (y 1, y 2,, y θ). he maximum likelihood estimator is the value θ l e that solves the following maximization problem Max θ [l(θ Y 1, Y 2,, Y ): θ Ω]. Define the log-likelihood function of the sample as L = ln[l(θ y 1, y 2,, y )]. Since the logarithmic function is monotonic, l( ) and [ln l( )] attain their maxima at the same value of θ. As a result, it will often be convenient to define the maximum likelihood estimator of θ as the value θ l e that Max θ L = ln[l(θ Y 1, Y 2,, Y )], θ Ω. Let (y 1, y 2,, y ) be a random sample, where y t N(β, σ 2 ), θ = (β, σ 2 ), σ 2 > 0. hus the probability density function for Y t is f(y t β, σ 2 ) = exp[(-1/2) (y t - β) 2 /σ 2 ]/[2π σ 2 ] 1/2 and l(β, σ 2 y 1, y 2,, y )] = f (y 1, y 2,, y θ) = Π f(y t β, σ 2 ). 2
It follows that the log-likelihood function of the sample is L = ln[l(β, σ 2 Y 1, Y 2,, Y )] = ln[ Π f(y t β, σ 2 ) ] = ln[ Π exp[(-1/2) (Y t - β) 2 /σ 2 ]/[2π σ 2 ] 1/2 ] = t= 1[(-1/2) (Y t - β) 2 /σ 2 - (1/2) ln[2π σ 2 ]] = (-/2) ln(2π) - (/2) ln(σ 2 ) - (1/2) t= 1(Y t - β) 2 /σ 2. Note that L is a concave function of θ = (β, σ 2 ). he first-order conditions for a maximum with respect to θ = (β, σ 2 ) are: L/ β = [( y t ) - β]/σ 2 = 0 L/ (σ 2 ) = -/(2 σ 2 ) + [ (y t - β) 2 ]/(2 σ ). Solving these two equations for θ = (β, σ 2 ) gives the maximum likelihood estimator for the mean β: β l = y t /, and the variance σ 2 : σ 2 l = (y t - β e l ) 2 /. he maximum likelihood method requires knowing the probability distribution of the y t s. (Other methods can be less demanding ) In this case, β l = β m, and σ l 2 = σ m 2, i.e. the method of moment and the maximum likelihood estimation method give identical estimators for θ = (β, σ 2 ). Least-Squares Method Assume that we know a function h(θ) satisfying E(Y t ) = h t (θ), t = 1, 2,,, where θ is a (K 1) vector of parameters, θ Ω. Define the error term e t = y t - h t (θ). It follows that e t is a random variable (since it is a function of the random variable y t ) and has mean zero: E(e t ) = E(y t ) - h t (θ) = 0. Define the error sum of squares, S: S(y 1, y 2,, y, θ) = [y t - h t (θ)] 2. he least squares estimator of θ is the value θ s e that solves the following minimization problem Min θ [S(y 1, y 2,, y, θ): θ Ω]. Let y 1, y 2,, y be a sample, where Y t is distributed with mean β and some finite variance, t = 1, 2,,. Given E(y t ) = β, let h t (β) = β. hen, the least squares estimator of β is obtained by minimizing S = (y t - β) 2. Note that S is a convex function of β. he first-order necessary condition for a minimum of S is: 3
S/ β = -2 (y t - β) = 0. Solving this equation for β gives the least squares estimator of β β s = y t /. Note: In this case, β l = β m = β s, i.e. the method of moment, the maximum likelihood estimation method, and the least squares method all give identical estimators for the mean β. Properties of Estimators Again, we consider an estimator θ e (y 1, y 2,, y ) for a (K 1) parameter vector θ based on a sample (y 1, y 2,, y ). Finite sample properties Based on sample of size Unbiased estimator An estimator θ e (y 1, y 2,, y ) of θ is unbiased if E(θ e ) = θ. If E(θ e ) θ, then the estimator θ e is said to be biased, its bias being [E(θ e ) - θ] 0. Efficient estimator An estimator θ e (y 1, y 2,, y ) of θ is efficient if it is unbiased and if it has the smallest possible variance among all unbiased estimators. Cramer-Rao Lower Bound An unbiased estimator θ e is efficient if its variance satisfies V(θ e ) = -[E( 2 L(θ)/ θ 2 )] -1 = I(θ) -1 where: L(θ) = ln[l(θ Y 1, Y 2,, Y )] is the log-likelihood function of the sample I(θ) = -[E( 2 L(θ)/ θ 2 )] is a (K K) matrix called the information matrix and -[E( 2 L(θ)/ θ 2 )] -1 = I(θ) -1 is called the Cramer-Rao lower bound. Note: his requires knowing the probability function of the Y t s. Best Linear Unbiased Estimator (BLUE) An estimator θ e is best linear unbiased: if it is linear, i.e. if θ e = a t Y t, for some a t s, if it is unbiased, i.e. if E(θ e ) = θ, and if it has the smallest variance among all linear unbiased estimators. his does not require knowing the probability function of the y t s.
Let (y 1, y 2,, y ) be a random sample of size, where y t ~ (β, σ 2 ) t = 1, 2,,. By definition of the mean, we have E(y t ) = β. Consider the following estimator of the mean β β e = y t /. We have E(β e ) = E( y t /) = It follows that β e = E(y t )/ = β/ = β. y t / is an unbiased estimator of β. he variance of β e is V(β e ) = V( y t / ) = (1/ 2 ) V(y t ) (his assumes the y t s are independent in a random sample) = (1/ 2 ) σ 2 since V(y t ) = σ 2, t = (1,2,, ) = (1/ 2 ) σ 2 = σ 2 / = V(β e ). Noting that the estimator β e = y t / is linear, it can be shown to have the smallest variance among all linear unbiased estimators. hus, β e is the best linear unbiased estimator (BLUE) of β. If we know that y t is normally distributed (i.e., y t N(β, σ 2 )), then if can be shown that I(θ) = -E[ 2 L/ θ 2 0 2 ] = σ, where θ = (β, σ 2 ). 0 2σ It follows that the Cramer-Rao lower bound is 2 σ I(θ) -1 0 =, 2σ 0 Since the variance of β e, V(β e ) = σ 2 /, is equal to the Cramer-Rao lower bound, it follows that β e is efficient. his means that, under the normality assumption, β e has the smallest variance among all unbiased estimators (whether they are linear or not). Since β e = β m = β l = β s = y t /, it follows that the estimator of the mean β obtained from either the method of moments, the maximum likelihood method, or the least square method, is unbiased, BLUE, as well as efficient under a normal distribution. 5
Let Y 1, Y 2,, Y be a random sample of size, where Y t is distributed with mean β and variance σ 2, t = 1, 2,,. Consider the following estimator of the variance σ 2 (σ 2 ) e = (Y t - β e ) 2 /, where β e = Y t /. Since (σ 2 ) e = σ 2 m = σ 2 l, the estimator (σ 2 ) e is identical to both the method of moment estimator and the maximum likelihood estimator (under normality) of σ 2. Note that E(σ 2 ) e = E[(1/) = E[(1/) = E[(1/) (Y t - β e ) 2 ] [(Y t - β) - (β e - β)] 2 ] [(Y t - β) 2 + (β e - β) 2-2 (Y t - β)(β e - β)]] = E[ (Y t - β) 2 / + (β e - β) 2-2 (Y t - β)(β e - β)/] = E[ (Y t - β) 2 / + (β e - β) 2-2 (β e - β) 2 ] since β e = = E[ (Y t - β) 2 / - (β e - β) 2 ] = = Y t / E(Y t - β) 2 / - E(β e - β) 2 σ 2 / - V(β e ), since σ 2 = E(Y t - β) 2, and E(β e - β) 2 = V(β e ) = σ 2 - σ 2 /, since V(β e ) = σ 2 / = σ 2 (-1)/. It follows that E(σ 2 ) e = [(-1)/] σ 2 < σ 2. his implies that (σ 2 ) e = σ 2 m = σ 2 l is a biased estimator of the variance σ 2. hus, the estimation of the variance σ 2 obtained by either the method of moment or the maximum likelihood method gives biased estimator. his suggests that an unbiased estimator of the variance σ 2 is σ u 2 = (Y t - β e ) 2 /(-1), where β e = Y t /. he estimator σ u 2 is unbiased since σ u 2 = (σ 2 ) e [/(-1)], which implies E(σ u 2 ) = E(σ 2 ) e [/(-1)] = σ 2. Asymptotic Properties he sample size becomes large, approaches infinity Again, we consider an estimator θ e (y 1, y 2,, y ) for a (K 1) parameter vector θ based on a sample of size. Consistent estimator An estimator θ e of θ is said to be consistent if lim P( θ e - θ < ε) = 1 where ε is an arbitrarily small positive number. Equivalently, the estimator θ e is consistent if it converges in probability to the constant θ, where θ is said to be the probability limit of θ e : 6
plim θ e = θ. Note: Sufficient conditions for θ e to be a consistent estimator of θ are that lim E(θ e ) = θ (i.e. θ e is asymptotic unbiased) and lim V(θ e ) = 0. Let (y 1, y 2,, y ) be a sample where Y t has mean β and variance σ 2. Consider the estimator β e = y t / of β. We have shown that the estimator β e has mean β and variance σ 2 /. We know that the estimator β e is unbiased (i.e., E(β e ) = β) for any sample size. It is thus also asymptotically unbiased (as becomes large). In addition its variance V(β e ) = σ 2 / clearly goes to zero as becomes large. hus, β e = Y t / is a consistent estimator of β. Central Limit heorem Let (y 1, y 2,, y ) be a random sample where y t ~ (β, σ 2 ), t = 1, 2,,. Let β e = Y t /. hen, as, () 1/2 (β e - β) converges in distribution to a N(0, σ 2 ) random variable: () 1/2 (β e - β) d N(0, σ 2 ). Implications: his result obtains for any distribution of the y t s. he Central Limit heorem says that, if the sample size is reasonably large (e.g., > 0), then () 1/2 (β e - β) is approximately normally distributed with mean 0 and variance σ 2. Equivalently stated, when is reasonably large, β e is approximately normally distributed: β e N(β, σ 2 /). Note that this result is consistent with our earlier results that β e has mean β and variance σ 2 /. What is new here is the asymptotic normality of β e (or of () 1/2 (β e - β)) for any distribution of the Y t s. Asymptotic Efficiency An estimator θ e of θ is said to be asymptotically efficient if it is consistent and if it has the smallest possible asymptotic variance among all consistent estimators. An estimator θ e of θ is asymptotically efficient if it satisfies () 1/2 (θ e - θ) d N(0, lim [(1/) I(θ)] -1 ) where I(θ) = -E[ 2 L/ θ 2 ] is the information matrix defined above. Under fairly general conditions, the maximum likelihood estimator θ l of θ is Consistent Asymptotically Normal 7
Asymptotically Unbiased Asymptotically Efficient Let (y 1, y 2,, y ) be a random sample of size, where y t ~ N(β, σ 2 ), t = 1, 2,,. he maximum likelihood estimator θ e l = (β l, σ 2 l ) of θ = (β, σ 2 ) is β l = Y t / and σ 2 l = (1/) (Y t - β l ) 2. From the above results, the maximum likelihood estimator θ e l = (β l, σ 2 l ) of θ is consistent, asymptotically normal, and asymptotically efficient. In addition, we have seen that the information matrix is I(θ) = -E[ 2 L/ θ 2 ] = 0 2 σ. It follows that the asymptotic distribution of θ e l = (β l, σ 2 l ) is 0 2σ () 1/2 (θ e l - θ) d N(0, lim [(1/) I(θ)] -1 2 ) = σ 0 N 0,. 0 2σ his shows that the asymptotic variance of θ e l is: V(β l ) σ 2 / (which is identical to the one derived earlier) and V(σ 2 l ) 2 σ / (which is a new result). 8