Lecture 5: Linear Regressions

Size: px

Start display at page:

Download "Lecture 5: Linear Regressions"

Easter Black
6 years ago
Views:

1 Lecture 5: Liear Regressios I lecture 2, we itroduced statioary liear time series models. I that lecture, we discussed the data geeratig processes ad their characteristics, assumig that we kow all parameters (autoregressive or movig average coefficiets). However, i empirical studies, we have to specify a ecoometric model, estimate this model ad draw ifereces based o the estimates. I this lecture, we will provide a itroductio to parametric estimatio of a liear model with time series observatios. Three commoly used estimatio methods are least square estimatio (LS), maximum likelihood estimatio (MLE) ad geeral method of momets (GMM). I this lecture, we will discuss LS ad MLE. Least Square Estimatio Least square (LS) estimatio is oe of the first techiques we lear i ecoometrics. It is both ituitive ad easy to implemet, ad the famous Gauss-Markov theorem tells that uder certai assumptios, ordiary least square (OLS) estimator is the best liear ubiased estimator (BLUE). We will start from review of classical LS estimatio ad the we will cosider estimatios with relaxed assumptios. Below are our otatios i this lecture ad the basic algebra i LS estimatio. Cosider the regressio y t = x tβ 0 + u t, t =,..., () where x t is k by vector ad β 0, also a k by vector is the true parameter. The the OLS estimator of β 0, deoted by ˆβ is ˆβ = x t x t x t y t (2) ad the OLS sample residual is Y = û t = y t x t ˆβ. Sometimes, it is more coveiet to work i matrix form. Defie y x u y 2 x 2 u 2. y The the regressio ca be writte as Copyright by Lig Hu. X =. x U =. u Y = X β 0 + U, (3)

2 ad the OLS estimator ca be writte as ˆβ = (X X ) X Y. (4) Defie M X = I X (X X ) X. It is easy to see that M x is symmetric, idempotet (M x M x = M x ), ad orthogoal to the colums of X. The we have Û = Y X ˆβ = M X Y. To derive the distributio of the estimator ˆβ, ˆβ = (X X ) X Y = (X X ) X (X β 0 + U ) = β 0 + (X X ) X U. (5) Therefore, the properties of ˆβ depeds o (X X ) X U. For example, if E(X X ) X U = 0, the ˆβ is ubiased estimator.. Case : OLS with determiistic regressors ad i.i.d. Gaussia errors Assumptio (a) x t is determiistic; (b) u t i.i.d(0, σ 2 ); (c) u t i.i.d.n(0, σ 2 ). ad Uder assumptio (a) ad (b), E(U ) = 0 ad E(U U ) = σ 2 I. The from (5) we have E( ˆβ ) = β 0 + (X X ) X E(U ) = β 0, E( ˆβ β 0 )( ˆβ β 0 ) = E(X X ) X U U X (X X ) = (X X ) X E(U U )X (X X ) = σ 2 (X X ) Uder these assumptios, Gauss-Markov theorem tells that the OLS estimator ˆβ is the best liear ubiased estimator for β 0. The OLS estimator for σ 2 is s 2 = ÛÛ /( k) = U M X M X U /( k) = U M X U /( k). (6) Sice M X is symmetric, there exists a by matrix P such that M X = P ΛP ad P P = I where Λ is a by matrix with the eigevalues of M X alog the pricipal diagoal ad zeros elsewhere. From properties of M X we ca compute that Λ cotais k zeros ad k oes alog its pricipal diagoal. The RSS = U M X U = U P ΛP U = (P U )Λ(P U ) = W λw = λ t wt 2 2

3 where W = P U. The E(W W ) = P E(U U )P = σ 2 I, therefore, w t are ucorrelated with mea 0 ad variace σ 2. Therefore, E(U M X U ) = λ t E(wt 2 ) = ( k)σ 2. So the s 2 defied i (6) is ubiased estimator for σ 2 : E(s 2 ) = σ 2. With the Gaussia assumptio (c), ˆβ is also Gaussia, ˆβ N(0, σ 2 (X X ) ). Note that here ˆβ is exact ormal, while may of the estimator i our later discussios are asymptotically ormal. Actually, uder assumptio, OLS estimator is optimal. Also, with the Gaussia assumptio, w t is i.i.d.n(0, σ 2 ). Therefore we have U M X U /σ 2 χ 2 ( k)..2 Case 2: OLS with stochastic regressors ad i.i.d. Gaussia errors The assumptio of determiistic regressors is very strog for empirical studies i ecoomics. Some examples of determiistic regressors are costats ad determiistic tred (i.e. x t = (, t, t 2,...)). However, most data we have for ecoometric regressio are stochastic. Therefore from this subsectio, we will allow the regressors to be stochastic. However, i case 2 ad case 3, we assume that x t is idepedet of errors (leads ad lags). This is still too strog i time series, as it rules out may processes icludig ARMA models. Assumptio 2 (a) x t is stochastic ad idepedet of u s for all t, s; (b) u t i.i.d.n(0, σ 2 ). This assumptio ca be equivaletly writte as U X N(0, σ 2 I ). Uder these assumptios, ˆβ is still ubiased: E( ˆβ ) = β 0 + E(X X ) X E(U ) = β 0. Coditioal o X, ˆβ is ormal, ˆβ X N(β 0, σ 2 (X X ) ). To get the ucoditioal probability distributio for ˆβ, we have to itegrate this coditioal desity over X. Therefore, the ucoditioal distributio of ˆβ will deped o the distributio of X. However, we still have the ucoditioal distributio for the estimate of the variace U M X U /σ 2 χ 2 ( k)..3 Case 3: OLS with stochastic regressors ad i.i.d. No-Gaussia errors Compared to case 2, i this sectio we let the error terms to follow arbitrary i.i.d. distributio with fiite fourth momets. Sice this is a arbitrary ukow distributio, it is very hard obtai exact distributio (fiite sample distributio) for ˆβ, istead, we will apply asymptotic theory i this problem. Assumptio 3 (a) x t is stochastic ad idepedet of u s for all t, s; (b) u t i.i.d.(0, σ 2 ), ad E(u 4 t ) = µ 4 < ; (c) E(x t x t) = Q t, a positive defiite matrix with (/) Q t Q, a positive defiite matrix; (d) E(x it x jt x kt x lt ) < for all i, j, k, l ad t; (e) (/) (x tx t) p Q. 3

4 With assumptio (a), we still have the ˆβ is ubiased estimator for β 0. The assumptio (c) to (e) are restrictios o x t. Basically we wat to have (/) x tx t p (/) E(x tx t). We have ˆβ β 0 = x t x t x t u t = (/) x t x t (/) From assumptios ad cotiuous mappig theorem, we have (/) x t x t p Q. x t u t x t u t is a martigale differece sequece with fiite variace, the by LLN for mixigales, we have (/) x t u t p 0. Therefore, ˆβ p β 0, so ˆβ is a cosistet estimator. Next, we will derive the distributio for it. This is the first time we derive asymptotic distributio for a OLS estimator. The routies i derivig asymptotically distributio for ˆβ are outlied as follows: first we apply LLN o the term x tx t, after properly ormed (so that the limit is a costat); the apply cotiuous mappig theorem to get the limit for x tx t. We already got this i the above proof of cosistecy for ˆβ. The we apply CLT o the term x tu t, also after properly ormed (so that the limit is odegeerate). Note E(x t x tu 2 t ) = σ 2 Q t ad (/) σ2 Q t σ 2 Q. By CLT for mds, we have (/ ) x t u t N(0, σ 2 Q). Therefore, ( ˆβ β 0 ) = (/) x t x t (/ ) x t u t N(0, Q (σ 2 Q)Q ) = N(0, σ 2 Q ). so the ˆβ follows ˆβ N (β 0, σ2 Q ). Note that this distributio is ot exact, but approximate. So we should read it as approximately distributed as ormal. 4

5 To compute this variace, we eed to kow σ 2. Whe it is ukow, the OLS estimator s 2 is still cosistet uder assumptio 3. We have u 2 t = (y t x tβ 0 ) 2 = y t x t ˆβ + x t( ˆβ β 0 ) 2 = (y t x t ˆβ ) 2 + 2(y t x t ˆβ )x t( ˆβ β 0 ) + x t( ˆβ β 0 ) 2 By LLN, we have (/) u2 t σ 2. There are three terms i the above equatio. For the secod term, we have (/) (y t x ˆβ t )x t( ˆβ β 0 ) = 0 as (y t x t ˆβ ) is orthogoal to x t. For the third term, ( ˆβ β 0 ) (/) x tx t ( ˆβ β 0 ) p 0 as ˆβ β 0 is o p () ad (/) x tx t Q. Therefore, we ca defie ad we have ˆσ 2 = (/) ˆσ 2 = (/) (y t x ˆβ t ) 2 = (/) (y t x ˆβ t ) 2, u 2 t (/) x t( ˆβ β 0 ) 2 σ 2. This estimator is oly slightly differet from ŝ 2 (ˆσ 2 = ( k)ŝ 2 /). Sice ( k)/ as, if ˆσ 2 is cosistet, so is s 2. Next, to derive the distributio of ˆσ 2. (ˆσ 2 σ 2 ) = (/ ) (u 2 t σ 2 ) ( ˆβ β 0 ) (/) x tx t ( ˆβ β 0 ). The secod term goes to zero as (/) x tx t p Q ad ˆβ β 0 p 0. Defie z t = u 2 t σ 2, the z t is i.i.d. with mea zero ad variace E(u 4 t ) σ 4 = µ 4 σ 4. Applyig CLT, we have (/ ) z t d N(0, µ 4 σ 4 ), therefore, (ˆσ 2 σ 2 ) d N(0, µ 4 σ 4 ). The same limit distributio applies for s 2, sice the differece betwee ˆσ 2 ad s 2 is o p ( /2 ). 5

6 .4 Case 4: OLS estimatio i autoregressio with i.i.d. error I a autoregressio, say, x t = φ 0 x t + ɛ t, where ɛ t is i.i.d., the regressors are o loger idepedet of ɛ t. I this case, the OLS estimator of φ 0 is biased. However, we will show that uder assumptio 4, the estimator is cosistet. Assumptio 4 The regressio model is y t = c + φ y t + φ 2 y t φ p y t p + ɛ t, with roots of ( φ z φ 2 z 2... φ p z p ) = 0 outside the uit circle (so y t is statioary) ad with ɛ t i.i.d. with mea zero, variace σ 2, ad fiite fourth momets µ 4. Page i Hamilto presets the geeral AR(p) case with costat. We will use AR(2) as a example, y t = φ y t + φ 2 y t 2 + ɛ t. Let x t = (y t, y t 2 ), u t = ɛ t ad y t = x tβ 0 + u t (so β 0 = (φ, φ 2 )). ( ˆβ β 0 ) = (/) x t x t (/ ) x t u t (7) The first term (/) x t x t = (/) y2 t y t y t 2 y t y t 2 y2 t 2 I this matrix, first, o the diagoal, y2 t j coverge to γ 0. The remaiig term coverges to γ. Therefore, (/) x t x γ0 γ t p Q = γ γ 0 Apply CLT for mds o the secod term i (7), (/ ) x t u t d N(0, σ 2 Q),. y t y t 2 therefore, ( ˆβ β 0 ) d N(0, σ 2 Q ). So far we have cosidered four cases i OLS regressios. The commo assumptio i all those four cases are i.i.d. errors. From ext sectio, we will cosider cases where the errors are ot i.i.d...5 OLS with o-i.i.d. errors Whe the error u t is i.i.d., the the variace-covariace matrix V = E(U U ) = σ 2 I. If V is still diagoal but the elemets are ot equal, for example, the errors o some dates display larger variace ad the errors o some dates display smaller variace, the the errors are said to exhibit heteroskedasticity. If V is o-diagoal, the the errors are said to be autocorrelated. For example, let u t = ɛ t φɛ t where ɛ t is i.i.d., the u t is serially correlated errors. Case 5 i Hamilto assumes 6

7 Assumptio 5 (a) x t is stochastic; (b) coditioal o the full matrix X, the vector U N(0, σ 2 V ); (c) V is a kow positive matrix. Uder these assumptios, the exact distributio of ˆβ ca be derived. However, this is a very strog assumptio ad it rules out the autoregressive regressio. Also, the assumptio that V is kow rarely holds i applicatios. Case 6 i Hamilto assumes ucorrelated but heteroskedastic errors with ukow covariace matrix. Uder assumptio 6, the OLS estimator is still cosistet ad asymptotically ormal. Assumptio 6 (a) x t stochastic, icludig perhaps lagged values of y; (b) x t u t is martigale differece sequece; (c) E(u 2 t x t x t) = Ω t, a positive defiite matrix, with (/) Ω t p Ω ad (/) u2 t x t x t p Ω; (d) E(u 4 t x it x jt x lt x kt ) < for all i, j, k, l ad t; (e) plims of (/) u tx it x t x t ad (/) x itx jt x t x t exist ad are fiite for all i, j ad (/) x tx t p Q, a osigular matrix. Agai, write the OLS estimator as ( ˆβ β 0 ) = (/) Assumptio 6 (e) esures that Apply CLT for mds, therefore, (/) (/ ) x t x t (/ ) x t x t p Q. x t u t N(0, Ω), ( ˆβ β 0 ) N(0, Q ΩQ ). x t u t However, both Q ad Ω are ot observable ad we eed to fid cosistet estimates for them. White proposes the followig estimator ˆQ = (/) x tx t ad ˆΩ = (/) û2 t x t x t where û t is the OLS residual y t x ˆβ t. Propositio With heteroskedasticity of ukow form satisfyig assumptio 6, the asymptotic variace-covariace matrix of the OLS coefficiet vector ca be cosistetly estimated by ˆQ ˆΩ ˆQ p Q ΩQ (8) Proof: Assumptio 6 (e) esures ˆQ Q ad assumptio 6 (c) esures that Ω (/) So to prove (8), we oly eed to show that ˆΩ Ω = (/) u 2 t x t x t p Ω. (û 2 t u 2 t )x t x t 0. 7

8 The trick here is to make use of a kow fact that ˆβ β 0 p 0. If we could write ˆΩ Ω as sums of some products of ˆβ β 0 ad terms that are bouded, the ˆΩ Ω p 0. û 2 t u 2 t = (û t + u t )(û t u t ) = 2(y t β 0x t ) ( ˆβ β 0 ) x t ( ˆβ β 0 ) x t = 2u t ( ˆβ β 0 ) x t + ( ˆβ β 0 ) x t 2 The ˆΩ Ω = ( 2/) u t ( ˆβ β 0 ) x t (x t x t) + (/) ( ˆβ β 0 ) x t 2 (x t x t). Write the first term ( 2/) u t ( ˆβ β 0 ) x t (x t x t) = 2 k ( ˆβ i β i0 ) (/) i= u t x it (x t x t). The term i the bracket has a fiite plim by assumptio 6 (e) ad we have ˆβ i β i0 0 for each i. The this term coverges to zero. (if this looks messy, take k =, the you ca simply move ( ˆβ β 0 ) out of the summatio. ˆβ β 0 p 0 ad the sum has a fiite limit, so the product goes to zero). Similarly for the secod term, (/) ( ˆβ k k β 0 ) x t 2 (x t x t) = ( ˆβ i β i0 )( ˆβ j β j0 ) (/) x it x jt (x t x t) p 0 i= j= as the term i bracket has a fiite plim. Therefore, ˆΩ Ω 0. Defie ˆV = ˆQ ˆΩ ˆQ, the ˆβ N(β 0, ˆV /), ad V / is a heteroskedastic-cosistet estimates for the variace-covariace matrix. Newey-West proposes the followig estimator for the variace-covariace matrix which is heteroskedastic ad autocorrelatio cosistet (HAC). q ( ˆV / = (X X ) û 2 t x t x t + k ) (x t û t û t k x t k q + + x t kû t k û t x t) (X X ). t=k+.6 Geeral least square k= Geeral least square (GLS) ad feasible geeral least square (FGLS) is preferred i least square estimatio whe the errors are heteroskedastic or/ad autocorrelated. Let x t be stochastic ad U X N(0, σ 2 V ) where V is kow (assumptio 5). Sice V is symmetric ad positive defiite, there exists matrix L such that V = L L. Premultiply L to our regressio ad get LY = LXβ 0 + LU. 8

9 The the ew error Ũ = LU is i.i.d. coditioal o X, E(ŨŨ X) = LE(UU X)L = σ 2 LV L = σ 2 I. The the estimator β = (X L LX) X L Ly = (X V X) X V y is kow as the geeral least square estimator. However, as we remarked earlier, i applicatios, V is rarely kow ad we have estimate it. The GLS estimator obtaied usig estimated V is kow as feasible GLS estimator. Usually, FGLS require that we specify a parametric model for the error. For example, let the error u t follow a AR() process, u t = ρ 0 u t + ɛ t where ɛ t i.i.d.(0, σ 2 ). I this case, we ca ru OLS first ad obtai the OLS residual û t. The ru OLS estimatio for ρ usig the û t. This estimator, deoted by ˆρ, is cosistet estimator for ρ. To show this, write û t = (y t β 0 x t + β 0 x t ˆβ x t ) = u t + (β 0 ˆβ ) x t. û t û t = = = = u t + (β 0 ˆβ ) x t u t + (β 0 ˆβ ) x t u t u t + (β 0 ˆβ ) u t u t + o p () (ɛ t + ρu t )u t ρvar(u t ). (u t x t + u t x t ) + (β 0 ˆβ ) x t x t (β 0 ˆβ) Similarly, we ca show that ûtû t p var(u t ), hece ˆρ ρ 0. Still use similar method, we ca show that û t û t = u t u t + o p (). Hece (ˆρ ρ0 ) N(0, ( ρ 2 0)). Fially the FGLS estimator for β 0 based o V (ˆρ) has the same limit distributio as the GLS estimator based o V (ρ 0 ) (page i Hamilto)..7 Statistical iferece with LS estimatio Some commoly used test statistics for LS estimator are t statistics ad F statistics. t statistics is used to test the hypothesis of a sigle parameter, say β i = c. For simplicity, we assume that c = 0, so we use t statistics to test if a variable is sigificat. The t statistics is defied as the ratio 9

10 ˆβ i /sd(β i ). Let the estimate of the variace of ˆβ be deoted by s 2 Ŵ, the the stadard deviatio of ˆβ i is the product of s ad the square root of the ith elemet o the diagoal, i.e., t = ˆβ i s 2 w ii. (9) Recall that if X/σ N(0, ), ad Y 2 /σ 2 χ 2 (m), ad let X ad Y be idepedet, the t = X m Y follows exact studet t distributio with m degree of freedom. F -statistics is used to test the hypothesis of m differet liear restrictios about β, say H 0 : Rβ = r, where R is a m by k matrix. The F statistics is the defied as F = (R ˆβ r) V ar(r ˆβ r) (R ˆβ r). (0) This is a Wald statistics. To derive the distributio of the statistics, we will eed the followig result Propositio 2 If a k by vector X N(µ, Σ), the (X µ) Σ (X µ) χ 2 (k). Also recall that a exact F (m, ) distributio is defied to be F (m, ) = χ2 (m)/m χ 2 ()/. With assumptio Ŵ = (X X ), ad uder the hull hypothesis ˆβ i N(0, σ 2 w ii ). We ca the write ˆβ i σ t = 2 w ii. s 2 σ 2 Sice the umerator is N(0, ) ad the deomiator is the square root of χ 2 ( k) divided by k (sice RSS/σ 2 χ 2 ( k)), ad the umerator ad deomiator are idepedet, so t statistics (9) uder assumptio follows exact t distributio. With assumptio ad uder the ull hypothesis, we have R ˆβ r N(0, σ 2 R(X X ) R), the by propositio 2, the F statistics defied i (0) uder hypothesis H 0 (R ˆβ r) σ 2 R(X X ) R (R ˆβ r) χ 2 (m). If we replace σ 2 with s 2, ad divide it by the umber of restrictios m, we get the OLS F test of a liear hypothesis F = (R ˆβ r) s 2 R(X X ) R (R ˆβ r)/m = F /m (RSS/σ 2 )/( k), 0

11 so F follows a exact F (m, k) distributio. A alterative way to express the F statistics is to compute the estimator without restrictio ˆβ ad its associated sum of residual RSS u ; ad the estimator with restrictio β ad its associated sum of residual RSS r, the we ca write F = (RSS r RSS u )/m. RSS u /( k) Now, with assumptio 2, X is stochastic ad ˆβ is ormal coditioal o X ad RSS σ 2 χ 2 ( k) coditioal o X. This coditioal distributio of RSS is the same for all X, therefore, the ucoditioal distributio of RSS is the same as the coditioal distributio. The same is true for the t ad F statistics. Therefore we have the same results uder assumptio 2 as that uder assumptio. From case 3, we o loger have exact distributio for the estimator, ad we have to derive the asymptotic distributio for the estimator, so we also use the asymptotic distributios for the test statistics. t = ˆβ i s wii = ˆβi s wii. where w ii is the ith elemet o the diagoal of ˆβ s asymptotic variace Q. If we let the ith elemet o the diagoal of Q deoted by q ii, the we have ˆβ i d N(0, σ 2 q ii ). Recall that uder assumptio 3, s σ, there we have t N(0, ). Next, write F = (R ˆβ r) s 2 R(X X ) R (R ˆβ r)/m = (R ˆβ r) s 2 R(X X /) R (R ˆβ r)/m Now we have s 2 p σ 2, X X / Q, ad uder the ull, (R ˆβ r) = R ( ˆβ β0 ) d N(0, σ 2 RQ R ). The by propositio 2, we have mf χ 2 (m). We ca the use similar methods to derive the distributio for other cases. I geeral if ˆβ p β 0 ad asymptotically ormal, s 2 σ 2, ad we have foud a cosistet estimate for the variace of ˆβ, the the t ad F statistics follow asymptotically ormal ad χ 2 (m) distributio. Actually, uder assumptio or 2, whe the sample size is large, we ca also use ormal ad χ 2 distributio to approximate the exact t ad F distributio. Further, sice we are usig the asymptotic distributio, the Wald test ca also be used to test oliear restrictios.

12 2 Maximum Likelihood Estimatio 2. Review: maximum likelihood priciple ad Cramer-Rao lower boud The basic idea of maximum likelihood priciple is to choose the parameter estimates that maximizes the probability of obtaiig the observed sample. Cosider that we observe a sample X = (x, x 2,..., x ) ad assume that the sample is draw from a i.i.d. distributio ad the associated parameters are deoted by θ. Let p(x t ; θ) deote the pdf of the tth observatio. For example, whe x t i.i.d.n(µ, σ 2 ), the θ = (µ, σ 2 ) ad p(x t ; θ) = (2πσ 2 ) /2 exp (x t µ) 2 2σ 2. The likelihood fuctio for the whole sample X is ad the log likelihood fuctio is L(X ; θ) = l(x ; θ) = p(x t ; θ) log p(x t ; θ). The maximum likelihood estimates for θ are chose so that l(x ; θ) is maximized. Defie the score fuctio S(θ) = l(θ)/ θ, ad the Hessia matrix H(θ) = 2 l(β)/ θ θ, the the famous Cramer-Rao iequality tells that the lowest boud for the variace of a ubiased estimator of θ is the iverse of the iformatio matrix I(θ 0 ) = ES(θ 0 )S(θ 0 ), where θ 0 deotes the true value of the parameter. A estimator that have a variace equal to this boud is kow as efficiet. Uder some regularity coditio which are satisfied for the Gaussia desity, we have the followig equality 2 l(θ) I(θ) = EH(θ) = E θ θ. So, if we fid a ubiased estimator ad its variace achieves the Cramer-Rao lower boud, the we kow that this estimator is efficiet ad there is o other ubiased estimator (liear or oliear) that could have smaller variace tha this estimator. However, this lower boud is ot always achievable. If a estimator does achieve this boud, the this estimator is idetical to MLE. Note that Cramer-Rao iequality holds for ubiased estimator while sometimes ML estimators are biased. If the estimator is biased but cosistet, ad its variace approaches the Cramer-Rao boud asymptotically, the this estimator is kow as asymptotically efficiet. Example (MLE estimatio for i.i.d. Gaussia distributio) Let x t i.i.d.n(µ, σ 2 ), so the parameter θ = (µ, σ 2 ). The we have { p(x t ; θ) = exp (x t µ) 2 } 2πσ 2 2σ 2 l(x ; θ) = 2 log(2π) 2 log(σ2 ) 2σ 2 (x t µ) 2 2

13 S(X ; µ) = l(x ; θ) µ S(X ; σ 2 ) = l(x ; θ) σ 2 = σ 2 (x t µ) 2 = 2σ 2 + 2σ 4 (x t µ) 2 Set the score fuctios to zero, we foud the MLE estimator for θ are ˆµ = X ad ˆσ 2 = (x t ˆµ) 2. It is easy to verify that E(ˆµ) = E( X ) = µ, so ˆµ is ubiased ad its variace V ar(ˆµ) = σ 2 /, while Eˆσ 2 = E (x t ˆµ) 2 = E(x t ˆµ) 2 = E(x t µ) + (µ ˆµ) 2 = σ 2 2 σ2 + σ2 = σ2 so ˆσ 2 is biased, but it is cosistet as ˆσ 2 σ 2 as. Defie s 2 = (x t ˆµ) 2, the Es 2 = σ 2, ad V ar(s 2 ) = 2σ 4 /( ). We ca further compute the Hessia matrix, H(X ; θ) = 2 l(x ;θ) 2 µ 2 l(x ;θ) σ 2 µ 2 l(x ;θ) µσ 2 2 l(x ;θ) 2 σ 2 where l(x ; θ) 2 µ = σ 2 l(x ; θ) µσ 2 = l(x ; θ) σ 2 µ l(x ; θ) 2 σ 2 = 2σ 4 σ 6 = σ 4 (x t µ) (x t µ) 2 We ca also compute that H(X ; θ) θ=ˆθ = 2 2σ 6 > 0, so we kow that the we have foud the maximum (ot miimum) of the likelihood fuctio. Next, compute the iformatio matrix, E θ (x t µ) = 0, E θ (x t µ) 2 = σ 2. 3

14 therefore the iformatio matrix I(θ) = E H(X ; θ) = σ σ 4 So the MLE of µ has achieved the Cramer-Rao lower boud of variace σ2. Although s2 does ot achieve to the lower boud, it turs out it is still the ubiased estimator for σ 2 with miimum variace. 2.2 Asymptotic Normality of MLE There are a few regularity coditios to esure that the MLE is cosistet. First we assume that the data is strictly statioary ad ergodic (for example, i.i.d.). Secod, we assume that the parameter space Θ is covex ad either the estimate ˆθ or the true parameter θ 0 lie o the boudary of Θ. Third, we require that the likelihood fuctio evaluated at ˆθ is differet from θ 0, for ay ˆθ θ 0 i Θ. This is kow as the idetificatio coditio. Fially, we assume that Esup θ Θ l(x ; θ) <. With all those coditios satisfied, the MLE is cosistet ˆθ p θ 0. Next we will discuss the asymptotic results o the score fuctio S(X ; θ), the Hessia matrix H(X ; θ) ad the asymptotic distributio of the MLE estimates ˆθ. First, we wat to show that ES(X, θ 0 ) = 0 ad ES(X, θ 0 )S(X, θ 0 ) = E(H(X ; θ 0 ). Let the itegral operator deote itegrate over X, X 2,..., X, the we have that L(X, θ 0 )dx =. Takig derivative with respect to θ, the we have L(X, θ 0 ) dx = 0. θ While, we ca write L(X, θ 0 ) dx θ L(X, θ 0 ) = L(X, θ 0 )dx L(X, θ 0 ) θ l(x ; θ 0 ) = L(X, θ 0 )dx θ = ES(X, θ 0 ) So we kow that ES(X, θ 0 ) = 0. Next, let the itegral (which equal to zero) take θ, it is l(x ; θ 0 ) L(X, θ 0 ) 2 l(x ; θ 0 ) θ θ dx + θ θ L(X, θ 0 )dx = 0. The secod term is just EH(X ; θ 0 ). The first ca be writte as ( ) l(x ; θ 0 ) L(X, θ 0 ) θ L(X, θ 0 ) θ L(X, θ 0 )dx l(x ; θ 0 ) l(x ; θ 0 ) = θ θ L(X, θ 0 )dx = ES(X, θ 0 )S(X, θ 0 ) 4

15 Now, sice ES(X, θ 0 )S(X, θ 0 ) + EH(X ; θ 0 ) = 0, we have that ES(X, θ 0 )S(X, θ 0 ) = EH(X ; θ 0 ). log p(xt;θ) Next, defie that s(x t ; θ) = θ, the we write the score fuctio as the sum of s(x t ; θ), i.e., S(X, θ) = s(x t; θ). s(x t ; θ) is i.i.d. ad we ca show that Es(x t ; θ 0 ) = 0 ad Es(x t ; θ 0 )s(x t ; θ 0 ) = EH(x t ; θ 0 ). Applyig Lideberg-Levy CLT, we obtai the asymptotic ormality of the score fuctio /2 S(X ; θ 0 ) d N(0, EH(X ; θ 0 )). Next, we cosider the properties of the Hessia matrix. First we assume that EH(X ; θ 0 ) is o-sigular. Let N ɛ be a eighborhood of θ 0, ad the we have E sup θ N ɛ H(X ; θ) <, H(x t ; θ) EH(X ; θ 0 ) V, where θ is ay cosistet estimator for θ 0. Apply the LLN, we have H(X ; θ 0 ) = H(x t ; θ 0 ) p E(x t ; θ 0 ) = E H(X ; θ 0 ) Σ. With the otatio Σ, we ca write /2 S(X ; θ 0 ) d N(0, Σ). Propositio 3 (Asymptotic ormality of MLE) With all the coditios we have outlied above, (ˆθ θ0 ) d N(0, Σ ). Proof: Do a Taylor expasio of S(X ; ˆθ) aroud θ 0, 0 = S(X ; ˆθ) S(X ; θ 0 ) + (ˆθ θ 0 )H(X ; θ 0 ). Therefore, we have (ˆθ θ0 ) = S(X ; θ 0 )H(X ; θ 0 ) ( ) ( = S(X ; θ 0 ) ) H(X ; θ 0 ) N(0, Σ ΣΣ ) = N(0, Σ ) Note that Σ = E H(X ; θ 0 ) = I(θ 0 ), so the asymptotic distributio of ˆθ ca be writte as ˆθ N(θ 0, I(θ 0 ) ). However, I(θ 0 ) depeds o θ 0 which is ukow. So we eed to fid a cosistet estimator for it, deoted by ˆV. There are two methods to compute this variace matrix of ˆθ. Oe way is that 5

16 we compute the Hessia matrix, ad evaluate it at θ = ˆθ, i.e. ˆV = H(X ; ˆθ). The secod way is to use the outer product estimate, which is ˆV = 2.3 Statistical Iferece for MLE S(x t ; ˆθ)S(x t ; ˆθ). There are three asymptotically equivalet tests for MLE: likelihood ratio (LR) test, Wald test, ad Lagrage multiplier (LM) test or score test. You ca probably fid discussio o these three tests o ay graduate text book i ecoometrics, so we oly describe them briefly here. The likelihood ratio test is based o the differece betwee the likelihood you computed (maximized) with or without the restrictio. Let l u deote the likelihood without restrictio ad l r deote the likelihood with restrictio (ote that l r l u ). If the restrictio is valid, the we expect the l r should ot be too much lower tha l u. Therefore, to test if the restrictio is valid, the statistics we compute is 2(l u l r ) which follows a χ 2 distributio with degree of freedom equal to the umber of restrictios imposed. To do LR test, we have to compute the likelihood uder both restricted ad urestricted coditio. I compariso, the other two tests oly use either the estimator without restrictio (deoted by ˆθ) or the estimator with restrictio (deoted by θ). Let the restrictio be H 0 : R(θ) = r, the idea of Wald test is that: if this restrictio is valid, the the estimator obtaied without restrictio ˆθ will make R(ˆθ) r close to zero. Therefore the Wald statistics is W = (R(ˆθ) r) V ar(r(ˆθ) r) (R(ˆθ) r), which also follows a χ 2 distributio with degree of freedom equal to the umber of restrictios imposed. To fid the ML estimator, we set the score fuctio equal to zero ad solve for the estimator, i.e., S(ˆθ) = 0. If the restrictio is valid, ad the estimator we obtaied with the restrictio is θ, the we expect that S( θ) is close to zero. This idea leads to the LM test or score test. The LM statistics is LM = S( θ) I( θ) S( θ), which also follows a χ 2 distributio with degree of freedom equal to the umber of restrictios imposed. 2.4 LS ad MLE I a regressio Y = X β 0 + U where U X N(0, σ 2 I ) (as i assumptio 2), the coditioal desity of Y give X is f(y X; θ) = (2πσ 2 ) /2 exp 2σ 2 (Y Xβ) (Y Xβ). The log likelihood fuctio is l(y X; θ) = 2 log(2π) 2 log(σ2 ) 2σ 2 (X Xβ) (X Xβ) 6

17 Note that ˆβ that maximizes l is the vector that miimizes the sum of squares, therefore, uder the assumptio 2, the OLS estimator is equivalet to ML estimator for ˆβ 0. It ca be show that this estimator is ubiased ad achieves the Cramer-Rao lower boud, therefore uder assumptio 2, the OLS/MLE estimator are efficiet (compared to all ubiased liear or oliear estimators). Recall that uder assumptio, we have Gauss-Markov theorem to show that OLS estimator is the best liear ubiased estimator. Now, the Cramer-Rao iequality tells the optimality of OLS estimator uder assumptio2. The ML estimator for σ 2 is (Y Xβ) (Y Xβ)/. We have itroduced this estimator a momet ago ad we showed that the differece betwee ˆσ 2 ad the OLS estimator s 2 becomes arbitrarily small as. Next, cosider assumptio 5, where U X N(0, σ 2 V ) ad V is kow. The the log likelihood fuctio omittig costat term is l(y X, β) = (/2)logV (/2)(Y Xβ) V (Y Xβ). The MLE estimator is ˆβ = (X V X) X Y, which is equivalet to the GLS estimator. The score vector is S (β) = (Y Xβ) V X, the Hessia matrix H (β) = X V X. Therefore, the iformatio matrix is I(β) = X V X. Therefore, the GLS/MLE estimator is efficiet as it achieves the Cramer-Rao lower boud (X V X). Whe V is ukow, we ca parameterize it V (ψ), say, ad maximizes the likelihood l(y X, β, ψ) = (/2)logV (ψ) (/2)(Y Xβ) V (ψ)(y Xβ). 2.5 Example: MLE i autoregressive estimatio I Hamilto s book, you ca fid may detailed discussios about MLE estimatio for a ARMA model i Chapter 5. We will take a AR() model as example. Cosider a AR() model, x t = c + βx t + u t where u t i.i.d.n(0, σ 2 ). Let θ = (c, β, σ 2 ) ad let the sample size deoted by. There are two ways to costruct the likelihood fuctio, ad the differece lies i how to specify the iitial observatio x. If we let x be radom, we kow that the ucoditioal distributio of x t is N(c/( β), σ 2 /( β 2 )), ad this will lead to a exact likelihood fuctio. Alteratively, we ca assume that x is observable (kow) ad this will lead to a coditioal likelihood fuctio. We first cosider the exact likelihood fuctio. We kow that p(x ; θ) = (2πσ 2 ) /2 exp (x c/( β)) 2 2σ 2 /( β 2. ) Coditioal o x, the coditioal distributio of x 2 is N(c + βx 2, σ 2 ), the the coditioal probability desity for the secod observatio is p(x 2 x ; θ) = (2πσ 2 ) /2 exp (x 2 c βx )) 2 2σ 2. So the joit probability desity for (x, x 2 ) is p(x, x 2 ; θ) = p(x 2 x ; θ)p(x ; θ). 7

18 Similarly, the probability desity for the th observatio coditioal o x is p(x x ; θ) = (2πσ 2 ) /2 exp (x c βx )) 2 2σ 2. ad the desity for the joit observatio of X = (x, x 2,..., x ) is L(X ; θ) = p(x ; θ) p(x t x t ; θ). Takig log we get the exact likelihood fuctio (omittig costat terms for simplicity) t=2 l(x ; θ) = 2 log ( σ 2 ) (x c/( β)) 2 β 2 2σ 2 /( β 2 ) 2 log(σ 2 ) t=2 (x t c βx t ) 2 2σ 2. () Next, to costruct the coditioal likelihood, assume that x is observable, the the log likelihood fuctio is (agai, costat terms are omitted) l(x ; θ) = 2 log(σ 2 ) (x t c βx t ) 2 2σ 2. (2) The maximum likelihood estimates ĉ ad ˆβ are obtaied by maximizig (2), or solvig the score fuctio. Note that maximizig (2) with respect to β is equivalet to miimizig t=2 (x t c βx t ) 2, which is the objective fuctio i OLS. Compared to the exact likelihood fuctio, we see that the coditioal likelihood fuctio is much easier to work with. Actually, whe the sample size is large, the first observatio becomes egligible to the total likelihood fuctio. Whe β <, the estimator computed from exact likelihood ad the estimator from coditioal likelihood are asymptotically equivalet. Fially, if the residual is ot Gaussia, ad if we estimate the parameter usig the coditioal Gaussia likelihood as i (2), the the estimate we obtai is kow as quasi-maximum likelihood estimate (QMLE). QMLE is also very frequetly used i empirical estimatio. Although we misspecified the desity fuctio, i may cases, QMLE is still cosistet. For istace, i a AR(p) process, if the sample secod momet coverges to the populatio secod momets, the QMLE usig (2) is cosistet, o matter whether the error is Gaussia or ot. However, stadard errors for the estimated coefficiets that are computed with the Gaussia assumptio eed ot be correct if the true data are ot Gaussia (White, 982). 3 Model Selectio I the discussio o estimatio above, we assume that the order of the lags is kow. However, i empirical estimatio, we have to choose a proper order. A larger umber of order (parameters) will icrease the fitess of the model, therefore we eed some criterio to balace the goodess of 8

19 fit ad model parsimoy. There are three commoly used criterio, Akaike iformatio criterio (AIC), Schwartz s Bayesia iformatio criterio (BIC), ad the posterior iformatio criterio (PIC) developed by Phillips (996). I all these criterio, we specify a maximum order k max, ad the choose ˆk to miimize a criterio equatio. ( ) SSRk AIC = log + 2k (3) where is the sample size, k =, 2,..., k max is the umber of parameters i the model, ad SSR k is the residual from the fitted model. Whe k icrease, the fit icreases, so SSR k decreases, but the secod term icreases. So this shows a trade off betwee fit ad parsimoy. Sice the model is estimated usig differet lags, the sample size also varies. We ca either use the differet sample size k, or we ca use a fixed sample size k max. Ng ad Perro (2000) has recommeded usig the fixed sample size ad use it to replace i the criterio. However, the AIC rule is ot cosistet ad teds to overfit the model by choosig larger k. With all other issues similar as i the AIC rule, the BIC rule imposes a larger pealty for icreasig umber of parameters, ( ) SSRk BIC = log + k log() (4) BIC suggests samller k tha AIC ad BIC rule is cosistet i statioary data, i.e., lim ˆkBIC = k. Further, Haa ad Deistler (988) has show that ˆk BIC is cosistet whe we set k max = c log() (the iteger part of c log()) for ay c > 0. Therefore, we ca estimate ˆk BIC cosistetly without kowig the upper boud of k. Fially, to preset the PIC criterio, let K = k max, ad let X(K) ad X(k) to deote the regressor matrix with K ad k parameters respectively. Similar for β, the parameter vector. Y = X(K)β(K) + error = X(k)β(k) + X( )β( ) + error A( ) = X( )β( ) A(k) = X(k)β(k) A(, k) = X( )X(k) A( ) = A( ) A(, k)a(k) A(k, ) ˆβ( ) = X( ) X( ) X( ) X(k)(X(k) X(k)) X(k)X( ) X( ) Y X( ) X(k)(X(k) X(k)) X(k)Y σ 2 K = SSR K /( K) the {( ) } PIC = A( )/σk 2 /2 exp ˆβ( ) 2σk 2 A( ) ˆβ( ). PIC is asymptotically equivalet to the BIC criterio whe the data is statioary, ad whe the data is ostatioary, PIC is still cosistet. Readig: Hamilto, Ch. 5, 8. 9

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator Ecoomics 24B Relatio to Method of Momets ad Maximum Likelihood OLSE as a Maximum Likelihood Estimator Uder Assumptio 5 we have speci ed the distributio of the error, so we ca estimate the model parameters