Econ 5150: Applied Econometrics Dynamic Demand Model Model Selection Sung Y. Park CUHK
Simple dynamic models A typical simple model: y t = α 0 + α 1 y t 1 + α 2 y t 2 + x tβ 0 x t 1β 1 + u t, where y t is per-capita U.S. gasoline consumption and x t is a vector of exogenous variables, for example, x t =(1, p t, z t ). The lag operator: y t 1 = Ly t y t 2 = Ly t 1 = L 2 y t = Then (1 α 1 L α 2 L 2 )y t = α 0 +(β 0 + β 1 L) x t + u t
Simple dynamic models More compactly A(L)y t = α 0 B(L)x t + u t It is tempting to solve the above by writing y t = A(L) 1 α 0 + A(L) 1 B(L) x t + A(L) 1 u t This model is called linear transfer function model. How can we interpret the above model? We may want to explain the notion of equilibrium forms of the above model.
Simple dynamic models Stability in linear difference equations: Consider the simplest possible case By repeated substitution X t = ax t 1 X t = ax t 1 = a 2 X t 2 = = a t X 0 where X 0 denotes an initial condition. a < 1: X t 0 a > 1: X t a =1: eitherx t X 0 or X t = ±X 0
Simple dynamic models Consider the second order difference equation: X t = a 1 X t 1 + a 2 X t 2. The solutions take the form, X t = A 1 θ t 1 + A 2 θ t 2, where A 1 and A 2 are parameters determined by initial conditions and the θ s are dependent on the a s. By substituting A 1 θ t 1 + A 2θ t 2 = a 1(A 1 θ t 1 1 + A 2 θ t 1 2 )+a 2 (A 1 θ t 2 1 + A 2 θ t 2 2 ) or 0=A 1 θ t 1 (1 a 1θ 1 1 a 2 θ 2 1 )+A 2θ t 2 (1 a 1θ 1 2 a 2 θ 2 2 )
Simple dynamic models Suppose we find the roots of the quadratic equation 1 a 1 z a 2 z 2 =0 and call these roots θ1 1 and θ2 1. Done... Stability? Suppose that all the roots are real: both θ 1 and θ 2 must be less than one in absolute value. θ is complex: θ = λ 1 + λ 2 i we can represent θ in polar coordinates θ = r(cos(φ)+i sin(φ)), where r =(λ 2 1 + λ2 2 )1/2,cos(φ) =λ 1 /r, sin(φ) =λ 2 /r.
Simple dynamic models Thus, it is necessary that the roots of the equation 1 a 1 z a 2 z 2 =0 should lie outside of the unit circle. Roots outside unit circle are good (stability). Roots inside unit circle: explosive behavior. Roots on the unit circle: Unit root.
Impulse response functions Interpreting the expression D(L) =A(L) 1 B(L) Consider or B(L) =A(L)D(L) β 0 + β 1 L + + β s L s =(1 α 1 L α r L r )(δ 0 + δ 1 L + )
Impulse response functions For j s β 0 = δ 0 β 1 = δ 0 α 1 + δ 1 β 2 = δ 0 α 2 δ 1 α 1 + δ 2. β j = δ 0 α j δ j 1 α 1 + δ j This means that a system can be solved recursively given the α, β s for the δ s. More generally, δ j = { j r i=1 α iδ j i + β j j s j r i=1 α iδ j i j > s
Impulse response functions The function of cumulative sums of the δ s Δ(j) = j i=1 δ i the impulse response function: provide a complete picture of the time pathof the response of y to a once-and-for-all unit shock in x. Case: a single exogenous variable x stays at x 0 for a long time. Thus y is randomly fluctuating around an equilibrium value y 0.Nowx changes tt x 1 and stays there. What happens to y?
Impulse response functions EΔy t = A(L) 1 B(L)Δx t = D(L)Δx t D(1)Δx = δ i Δx i=1 a new equilibrium : the accumulation of the short-run impulse response a new equilibrium : can be calculated simply by letting y t = y e and x t = x e. if the roots of the A(z) = 0 lie outside the unit circle... Inferences?
Error correction form Consider the following simple dynamic model: y t = α 1 y t 1 + α 0 + β 0 x t + β 1 x t 1 + u t In equilibrium with x t x e y e = α 0 1 α 1 + β 0 + β 1 1 α 1 x e + 1 1 α 1 u t subtract y t from bothsides of the model and add and subtract β 0 x t 1 or Δy t =(α 1 1)y t 1 + α 0 + β 0 Δx t +(β 0 + β 1 )x t 1 + u t Δy t = β 0 Δx t +(α 1 1)[y t 1 α 0 1 α 1 β 0 + β 1 1 α 1 x t 1 ]+u t
Model selection Consider a collection of parametric models: {f i (x,θ)}, where θ Θ j for j =1,, J. Some linear structure usually imposed on the parameter space: Θ j = m j θj,wherem j is a linear subspace of R p J and p 1 < p 2 < < p J. Also assume that the models are nested: θ 1 θ 2 θ J.
Model selection Akaike information criterion [Akaike (1969)] AIC(j) =l j (ˆθ) p j, where l j (ˆθ) denotes the log-likelihood corresponding to the j t h model. Akaike s selection rule is simply choose the model j which maximizes AIC(j). Schwarz s information criterion [Schwarz (1978)] SIC(j) =l j (ˆθ) 1 2 p j log n where ĵ =argmaxs(j). p(ĵ = j ) 1. (1/2) log n > 1forn > 8 the SIC penalty is larger than the AIC penalty.
Model selection Connection with classical hypothesis testing: Under quite general conditions for nested models for p j > p i = p. 2(l j (ˆθ j ) l i (ˆθ i )) χ 2 p j p i SIC would choose j over i iff 2(l j l i ) p j p i > log n log n can be interpreted as an implicit critical value for the model selection decision based on SIC Make sense? AIC: an implicit critical value is 2: positive probability of Type I error.
Model selection SIC in the linear regression model: consider the Gaussian linear regression model: l(β,σ) = n 2 log(2π) n 2 log σ2 S 2σ 2 where S =(y X β) (y X β). Evaluating at ˆβ and ˆσ 2 = S/n l( ˆβ, ˆσ) = n 2 log(2π) n 2 n log ˆσ2 2 Thus we maximize SIC l i 1 2 p i log(n) which is the same as minimizing log ˆσ 2 j +(p j /n)logn
Model selection Connection with F-test statistic: Note l i l j = n 2 (log ˆσ2 j log ˆσ 2 i ) = n 2 log(ˆσ2 j /ˆσ i 2 ) ( ) = n 2 log 1 ˆσ2 i ˆσ j 2 ˆσ i 2 Usual Taylor-series approximation for log(1 ± a) fora small 2(l i l j ) n(ˆσ2 j ˆσ i 2) ˆσ i 2.
Model selection, Shrinkage and the LASSO The information criterion approach: balance the two objectives of simplicity (penalty) and goodness-of-fit (fidelity). Too simple model risks serious bias Too complicate model risks high degree of uncertainty Start with Bayesian method for linear regression model: Shrinkage methods or Stein-rule methods
Model selection, Shrinkage and the LASSO Consider the linear model: where u N(0,σ 2 I). y = X β + u, L(y b) =(2π) n/2 σ n exp{ 1 2σ 2 ( ˆβ b) X X ( ˆβ b)} Suppose that we have a prior that β N(β 0, Ω), i.e., π(b) =(2π) p/2 Ω 1/2 exp{ 1 2 (b β 0) Ω 1 (b β 0 )} Using the Bayes rule p(b y) = L(y b) π(b) L(y b)π(b)db.
Model selection, Shrinkage and the LASSO Then p(b y) =κ exp{ 1 2 (b β) (σ 2 X X +Ω 1 )(b β)} where κ is a constant and β =(σ 2 (X X )+Ω 1 ) 1 (σ 2 (X X ) ˆβ +Ω 1 β 0 ). the posterior distribution is also Gaussian with mean β. ˆβ and β0 have covariance matrices σ 2 (x x) 1 and Ω, respectively. They are weighted by the inverses of the covariance matrices.
Model selection, Shrinkage and the LASSO Tibshirani (1996) considered the l 1 norm in the penalty term Pen(θ) = p θ i i=1 and he proposed the following regression model min (yi x i θ)2 + λpen(θ) θ for some appropriately chosen λ the lasso (least absolute shrinkage and selection operator). Ridge regression: min θ (yi x i θ)2 + λ p i=1 θ 2 i
Model selection, Shrinkage and the LASSO One can also use the l 1 fidelity criterion: min yi x i θ + λpen(θ) θ This has been done by Wang, Li and Jiang (JBES, 2007).
Model selection, Shrinkage and the LASSO Figure: LASSO and Ridge shrinkage
Bias and Variance Consider the following stylized situation in regression (long-model) (short-model) y = X β + Zγ + u y = X β + v What are the price we pay when we misspecify the model...
Bias and Variance Assume that the long model is true and we estimate the short model (omitted variables). E ˆβ s = E(X X ) 1 X y = E(X X ) 1 X (X β + Zγ + u) = β +(X X ) 1 X Zγ the bias associated with estimation of β Gγ =(X X ) 1 X Zγ where G is obtained by regressing the columns Z on the columns of X. Bias vanishes if γ =0orifX is orthogonal to Z.
Bias and Variance Example: One estimates a static model when a dynamic one is the true model. Suppose the correct specification: y t = α + p β i x t i + u t i=0 where x t is exogenous variable. Instead we estimate the static model y t = α + β 0 x t + v t the relationship between our estimate of β 0 in the static model and the coefficients of the dynamic model...
Bias and Variance E ˆβ 0 = β 0 + p g i β i where g i denotes the slope coefficient of the obtained in a regression of x t i on x t, and an intercept. If x t is strongly trended, then these g i will tend to be close to one and E ˆβ 0 will be close to p i=0 β i: long-run effect. i=1
Bias and Variance Assume that the short model is true and we estimate the long model. bias?... E ˆβ L = E(X M Z X ) 1 X M Z y = E(X M Z X ) 1 X M Z (X β + u) = β Happy? There is a price to be paid of estimating parameters γ...
Bias and Variance Proposition ˆβ s = ˆβ L + G ˆγ L Proposition Assuming V (y) =E(y Ey)(y Ey) = σ 2 I, V ( ˆβ L )=V ( ˆβ s )+GV (ˆγ L )G... the variability of the long estimate always exceeds the variability of the short estimate... but...
Fishing concerns the difficulties associated with preliminary testing and model selection... based on Freedman (1983, American Statistician) (seealso Leeb and Pötscher (2005, ET)) He consider a model of the form: y i = x i β 0 + u i where u i iidn (0,σ 2 ). The matrix X =(x i )isn p and X X = I p. And p as n so that p/n ρ for some 0 <ρ<1. He also assumes β 0 =0.
Fishing Theorem For the above model, R 2 n ρ and F n 1. Proof: The usual F n statistic for the model is really distributed as F.So EF n =(n p)/(n p 2) which tends to 1. And so F n = n p 1 p R 2 n 1 Rn 2 ( ) n p 1 Rn 2 = F / + F p Thus since F 1wehavethatR 2 n ρ.
Fishing Now consider the following case: all p variables are initially tried. Those attaining α-level of significance in a standard t-test are retained, say, q n,α of them. Then the model is reestimated with only these variables. Theorem For the above model, R 2 n,α g(λ α )ρ and F n,α where g(λ) = z >λ and λ is chosen so Φ(λ) =1 α/2. z 2 φ(z)dz ( ) g(λα) α / ( ) 1 g(λ)ρ 1 αρ,
Fishing Example: Suppose that n = 100, p = 50, so ρ =1/2. Set α =0.25 so λ =1.15 and g(λ) =0.72. Then E(Z 2 z >λ) 2.9 Rn,α 2 g(λ) 0.72 0.5 0.36 ( ) g(λ) F n,α α (1 g(λ)ρ) 4.0 (1 αρ) Eq n,α = αρn =0.25 0.50 100 12.5 F 12,88,0.05 =1.88 P(F 12,88 > 4.0) 0.0001