Econometric Modelling

Size: px

Start display at page:

Download "Econometric Modelling"

Shavonne Mills
6 years ago
Views:

1 Econometric Modelling David F. Hendry Nuffield College, Oxford University. July 8, 2000 Abstract The theory of reduction explains the origins of empirical models, by delineating all the steps involved in mapping from the actual data generation process (DGP) in the economy far too complicated and high dimensional ever to be completely modeled to an empirical model thereof. Each reduction step involves a potential loss of information from: aggregating, marginalizing, conditioning, approximating, and truncating, leading to a local DGP which is the actual generating process in the space of variables under analysis. Tests of losses from many of the reduction steps are feasible. Models that show no losses are deemed congruent; those that explain rival models are called encompassing. The main reductions correspond to well-established econometrics concepts (causality, exogeneity, invariance, innovations, etc.) which are the null hypotheses of the mis-specification tests, so the theory has considerable excess content. General-to-specific (Gets) modelling seeks to mimic reduction by commencing from a general congruent specification that is simplified to a minimal representation consistent with the desired criteria and the data evidence (essentially represented by the local DGP). However, in small data samples, model selection is difficult. We reconsider model selection from a computer-automation perspective, focusing on general-to-specific reductions, embodied in PcGets an Ox Package for implementing this modelling strategy for linear, dynamic regression models. We present an econometric theory that explains the remarkable properties of PcGets. Starting from a general congruent model, standard testing procedures eliminate statistically-insignificant variables, with diagnostic tests checking the validity of reductions, ensuring a congruent final selection. Path searches in PcGets terminate when no variable meets the pre-set criteria, or any diagnostic test becomes significant. Non-rejected models are tested by encompassing: if several are acceptable, the reduction recommences from their union: if they re-appear, the search is terminated using the Schwartz criterion. Since model selection with diagnostic testing has eluded theoretical analysis, we study modelling strategies by simulation. The Monte Carlo experiments show that PcGets recovers the DGP specification from a general model with size and power close to commencing from the DGP itself, so model selection can be relatively non-distortionary even when the mechanism is unknown. Empirical illustrations for consumers expenditure and money demand will be shown live. Next, we discuss sample-selection effects on forecast failure, with a Monte Carlo study of their impact. This leads to a discussion of the role of selection when testing theories, and the problems inherent in conventional approaches. Finally, we show that selecting policy-analysis models by forecast accuracy is not generally appropriate. We anticipate that Gets will perform well in selecting models for policy. Financial support from the UK Economic and Social Research Council under grant L Modelling Nonstationary Economic Time Series, R , and Forecasting and Policy in the Evolving Macro-economy, L , is gratefully acknowledged. The research is based on joint work with Hans-Martin Krolzig of Oxford University.

2 2 Contents Introduction Theoryofreduction Empiricalmodels DGP Datatransformationsandaggregation Parametersofinterest Data partition Marginalization Sequentialfactorization Sequential factorization of WT Marginalizing with respect to VT Mapping to I(0) Conditionalfactorization Constancy Lagtruncation Functionalform Thederivedmodel Dominance Econometric concepts as measures of no information loss Implicitmodeldesign Explicitmodeldesign A taxonomy of evaluation information General-to-specific modelling Pre-search reductions Additionalpaths Encompassing Information criteria... 3 Sub-sample reliability Significantmis-specificationtests... 4 The econometrics of model selection Searchcosts Selection probabilities Deletion probabilities Path selection probabilities Improvedinferenceprocedures PcGets The multi-path reduction process of PcGets Settings in PcGets Limits to PcGets Collinearity Integratedvariables SomeMonteCarloresults AimoftheMonteCarlo DesignoftheMonteCarlo... 26

3 6.3 EvaluationoftheMonteCarlo Diagnostic tests Sizeandpowerofvariableselection Testsizeanalysis Empirical Illustrations DHSY UKMoneyDemand Model selection in forecasting, testing, and policy analysis Model selection for forecasting Sources of forecast errors Sampleselectionexperiments Modelselectionfortheorytesting Modelselectionforpolicyanalysis Congruent modelling Conclusions Appendix: encompassing References

4 4 Introduction The economy is a complicated, dynamic, non-linear, simultaneous, high-dimensional, and evolving entity; social systems alter over time; laws change; and technological innovations occur. Time-series data samples are short, highly aggregated, heterogeneous, non-stationary, time-dependent and interdependent. Economic magnitudes are inaccurately measured, subject to revision and important variables not unobservable. Economic theories are highly abstract and simplified, with suspect aggregation assumptions, change over time, and often rival, conflicting explanations co-exist. In the face of this welter of problems, econometric modelling of economic time series seeks to discover sustainable and interpretable relationships between observed economic variables. However, the situation is not as bleak as it may seem, provided some general scientific notions are understood. The first key is that knowledge accumulation is progressive: one does not need to know all the answers at the start (otherwise, no science could have advanced). Although the best empirical model at any point will later be supplanted, it can provide a springboard for further discovery. Thus, model selection problems (e.g., data mining) are not a serious concern: this is established below, by the actual behaviour of model-selection algorithms. The second key is that determining inconsistencies between the implications of any conjectured model and the observed data is easy. Indeed, the ease of rejection worries some economists about econometric models, yet is a powerful advantage. Conversely, constructive progress is difficult, because we do not know what we don t know, so cannot know how to find out. The dichotomy between construction and destruction is an old one in the philosophy of science: critically evaluating empirical evidence is a destructive use of econometrics, but can establish a legitimate basis for models. To understand modelling, one must begin by assuming a probability structure and conjecturing the data generation process. However, the relevant probability basis is unclear, sincet the economic mechanism is unknown. Consequently, one must proceed iteratively: conjecture the process, develop the associated probability theory, use that for modelling, and revise the starting point when the results do not match consistently. This can be seen in the gradual progress from stationarity assumptions, through integrated-cointegrated systems, to general non-stationary, mixing processes: further developments will undoubtedly occur, leading to a more useful probability basis for empirical modelling. These notes first review the theory of reduction in 2 to explain the origins of empirical models, then discuss some methodological issues that concern many economists. Despite the controversy surrounding econometric methodology, the LSE approach (see Hendry, 993, for an overview) has emerged as a leading approach to empirical modelling. One of its main tenets is the concept of general-to-specific modelling (Gets general-to-specific): starting from a general dynamic statistical model, which captures the essential characteristics of the underlying data set, standard testing procedures are used to reduce its complexity by eliminating statistically-insignificant variables, checking the validity of the reductions at every stage to ensure the congruence of the selected model. Section 3 discusses Gets, and relates it to the empirical analogue of reduction. Recently econometric model-selection has been automated in a program called PcGets, which is an Ox Package (see Doornik, 999, and Hendry and Krolzig, 999a) designed for Gets modelling, currently focusing on reduction approaches for linear, dynamic, regression models. The development of PcGets has been stimulated by Hoover and Perez (999), who sought to evaluate the performance of Gets. To implement a general-to-specific approach in a computer algorithm, all decisions must be mechanized. In doing so, Hoover and Perez made some important advances in practical modelling, and our approach builds on these by introducing further improvements. Given an initial general model, many reduction paths could be considered, and different selection strategies adopted for each path. Some of

5 5 these searches may lead to different terminal specifications, between which a choice must be made. Consequently, the reduction process is inherently iterative. Should multiple congruent contenders eventuate after a reduction round, encompassing can be used to test between them, with only the surviving usually non-nested specifications retained. If multiple models still remain after this testimation process, a new general model is formed from their union, and the simplification process re-applied. Should that union repeat, a final selection is made using information criteria, otherwise a unique congruent and encompassing reduction has been located. Automating Gets throws further light on several methodological issues, and prompts some new ideas, which are discussed in section 4. While the joint issue of variable selection and diagnostic testing using multiple criteria has eluded most attempts at theoretical analysis, computer automation of the model-selection process allows us to evaluate econometric model-selection strategies by simulation. Section 6 presents the results of some Monte Carlo experiments to investigate if the model-selection process works well or fails badly; their implications for the calibration of PcGets are also analyzed. The empirical illustrations presented in section 7 demonstrate the usefulness of PcGets for applied econometric research. Section 8 then investigates model selection in forecasting, testing, and policy analysis and shows the drawbacks of some widely-used approaches. 2 Theory of reduction First we define the notion of an empirical model, then explain the the origins of such models by the theory of reduction. 2. Empirical models In an experiment, the output is caused by the inputs and can be treated as if it were a mechanism: y t = f (z t ) + ν t [output] [input] [perturbation] () where y t is the observed outcome of the experiment when z t is the experimental input, f ( ) is the mapping from input to output, and ν t is a small, random perturbation which varies between experiments conducted at the same values of z. Given the same inputs {z t }, repeating the experiment generates essentially the same outputs. In an econometric model, however: y t = g (z t ) + ɛ t [observed] [explanation] [remainder] (2) y t can always be decomposed into two components, namely g (z t ) (the part explained) and ɛ t (unexplained). Such a partition is feasible even when y t does not depend on g (z t ). In econometrics: ɛ t = y t g (z t ). (3) Thus, models can be designed by selection of z t. Design criteria must be analyzed, and lead to the notion of a congruent model: one that matches the data evidence on the measured attributes. Successive congruent models should be able to explain previous ones, which is the concept of encompassing, and thereby progress can be achieved.

6 6 2.2 DGP Let {u t } denote a stochastic process where u t is a vector of n random variables. Consider the sample U T = (u...u T ),whereu t = (u...u t ). Denote the initial conditions by U 0 = (...u r...u u 0 ),andletu t = ( U 0 : U t ). The density function of U T conditional on U ( 0 is given by D U U T U 0, ψ ) where D U ( ) is represented parametrically by a k-dimensional vector of parameters ψ =(ψ...ψ k ) with parameter space Ψ R k. All elements of ψ need not be the same at each t, and some of the {ψ i } may reflect transient effects or regime shifts. The data generation process (DGP) of {u t } is written as: D U ( U T U 0, ψ ) with ψ Ψ R k. (4) The complete sample {u t,t =,...,T} is generated from D U ( ) by a population parameter value ψ p. ( The sample joint data density D U U T U 0, ψ ) is called the Haavelmo distribution (see e.g., Spanos, 989). The complete set of random variables relevant to the economy under investigation over t =,...T is denoted {u t },where denotes a perfectly measured variable and U T =(u,...,u T ),definedon the probability space (Ω, F, P). The DGP induces U T =(u,...,u T ) but U T is unmanageably large. Operational models are defined by a sequence of data reductions, organized into eleven stages. 2.3 Data transformations and aggregation One-one mapping of U T to a new data set W T : U T W T. The DGP of U T, and so of W T is characterized by the joint density: ( D U U T U 0, ψ ) ( T = DW W T W 0, φ ) T (5) where ψ T Ψ and φ T Φ, making parameter change explicit The transformation from U to W affects the parameter space, so Ψ is transformed into Φ. 2.4 Parameters of interest µ M. Identifiable, and invariant to an interesting class of interventions. 2 Data partition Partition WT into the two sets: WT = ( X T : ) V T where the X T matrix is T n. Everything about µ must be learnt from analyzing the X T is not essential to inference about µ. V T (6) alone, so that 2.6 Marginalization ( D W W T W 0, φ ) ( T = DV X V T X T, W 0, Λ ) ( a,t DX X T W 0, Λ ) b,t. (7) ( ) Eliminate VT by discarding the conditional density D V X VT X T, W 0, Λ a,t in (7), while retaining ( ) the marginal density D X (X T W 0, Λ b,t ). µ must be a function of Λ b,t alone, given by µ = f Λ b,t. ( ) A cut is required, so that Λ a,t : Λ b,t Λ a Λ b.

7 7 2.7 Sequential factorization To create the innovation process sequentially factorize X T as: ( D X X T W 0, Λ ) T ( b,t = D x xt X ) t, W 0, λ b,t. (8) t= Mean innovation error process ɛ t = x t E [ x t X t ] Sequential factorization of W T. Alternatively: ( D W W T W 0, φ ) T T = D w (w t W t, δ t ). (9) RHS innovation process is η t = w t E [ w t W t ] Marginalizing with respect to V T. t= ( D w (w t W t, δ t )=D v x (v t x t, W t, δ a,t ) D x xt Vt, X t, W ) 0, δ b,t, (0) as W t = { Vt, X t, W 0}. µ must be obtained from {δb,t } alone. Marginalize with respect to Vt : ( D x xt Vt, X t, W ) ( 0, δ b,t = Dx xt X t, W 0, δ b,t). () No loss of information if and only if δ b,t = δ b,t t, so the conditional, sequential distribution of {x t} does not depend on Vt (Granger non-causality). 2.8 Mapping to I(0) Needed to ensure conventional inference is valid, though many inferences will be valid even if this reduction is not enforced. Cointegration would need to be treated in a separate set of lectures. 2.9 Conditional factorization Factorize the density of x t into sets of n and n 2 variables where n + n 2 = n: x t = ( y t : z t), (2) where the y t are endogenous and the z t are non-modelled. ( D x xt X t, W ) ( 0, λ bt = Dy z yt z t, X t, W ) ( 0, θ a,t Dz zt X t, W ) 0, θ b,t (3) z t is weakly exogenous for µ if (i) µ = f (θ a,t ) alone; and (ii) (θ a,t, θ b,t ) Θ a Θ b. 2.0 Constancy Complete parameter constancy is : θ a,t = θ a t (4) where θ a Θ a, so that µ is a function of θ a : µ = f (θ a ). T ( D y z yt z t, X t, W ) 0, θ a (5) with θ a Θ. t=

8 8 2. Lag truncation Fix the extent of the history of X t in (5) at s earlier periods: 2.2 Functional form D y z ( yt z t, X t, W 0, θ a ) = Dy z ( yt z t, X t s t, W 0, δ ). (6) Map y t into y t = h (y t) and z t into z t = g (z t), and denote the resulting data by X. Assume that y t and z t simultaneously make D y z ( ) approximately normal and homoscedastic, denoted N n [η t, Υ]: 2.3 The derived model D y z ( yt z t, X t s t, W 0, δ ) = D y z ( y t z t, X t s t, W 0, γ ) (7) A (L) h (y) t = B (L) g (z) t + ɛ t (8) where ɛ t gapp N n [0, Σ ɛ ],anda (L) and B (L) are polynomial matrices (i.e., matrices whose elements are polynomials) of order s in the lag operator L. ɛ t is a derived, and not an autonomous, process defined by: ɛ t = A (L) h (y) t B (L) g (z) t. (9) The reduction to the generic econometric equation involves all the stages of aggregation, marginalization, conditioning etc., transforming the parameters from ψ which determines the stochastic features of the data, to the coefficients of the empirical model. 2.4 Dominance Consider two distinct scalar empirical models denoted M and M 2 with mean-innovation processes (MIPs) {ν t } and {ɛ t } relative to their own information sets, where ν t and ɛ t have constant, finite variances σν 2 and σ2 ɛ respectively. Then M variance dominates M 2 if σν 2 <σ2 ɛ, denoted by M M 2. Variance dominance is transitive since if M M 2 and M 2 M 3 then M M 3, and anti-symmetric since if M M 2 then it cannot be true that M 2 M. A model without a MIP error can be variance dominated by a model with a MIP on a common data set. The DGP cannot be variance dominated in the population by any models thereof (see e.g. Theil, 97, p543). Let U t denote the universe of information for the DGP and let X t be the subset, with associated innovation sequences {ν u,t } and {ν x,t }. Then as {X t } {U t }, E [ν u,t X t ]=0, whereas E [ν x,t U t ] need not be zero. A model with an innovation error cannot be variance dominated by a model which uses only a subset of the same information. If ɛ t = x t E [x t X t ],thenσɛ 2 is no larger than the variance of any other empirical model error defined by ξ t = x t G [x t X t ] whatever the choice of G [ ]. The conditional expectation is the minimum mean-square error predictor. These implications favour general rather than simple empirical models, given any choice of information set, and suggest modelling the conditional expectation. A model which nests all contending explanations as special cases must variance dominate in its class. Let model M j be characterized by parameter vector ψ j with κ j elements, then as in Hendry and Richard (982): M is parsimoniously undominated in the class {M i } if i, κ κ i and no M i M. Model selection procedures (such as AIC or the Schwarz criterion: see Judge, Griffiths, Hill, Lütkepohl and Lee (985)) seek parsimoniously undominated models, but do not check for congruence.

9 9 2 Econometric concepts as measures of no information loss [] Aggregation entails no loss of information on marginalizing with respect to disaggregates when the retained information comprises a set of sufficient statistics for the parameters of interest µ. [2] Transformations per se do not entail any associated reduction but directly introduce the concept of parameters of interest, and indirectly the notions that parameters should be invariant and identifiable. [3] Data partition is a preliminary although the decision about which variables to include and which to omit is perhaps the most fundamental determinant of the success or otherwise of empirical modelling. [4] Marginalizing with respect to v t is without loss providing the remaining data are sufficient for µ, whereas marginalizing without loss with respect to V t entails both Granger non-causality for x t and a cut in the parameters. [5] Sequential factorization involves no loss if the derived error process is an innovation relative to the history of the random variables, and via the notion of common factors, reveals that autoregressive errors are a restriction and not a generalization. [6] Integrated data systems can be reduced to I(0) by suitable combinations of cointegration and differencing, allowing conventional inference procedures to be applied to more parsimonious relationships. [7] Conditional factorization reductions, which eliminate marginal processes, lead to no loss of information relative to the joint analysis when the conditioning variables are weakly exogenous for the parameters of interest. [8] Parameter constancy, implicitly relates to invariance as constancy across interventions which affect the marginal processes. [9] Lag truncation involves no loss if the error process remains an innovation despite excluding some of the past of relevant variables. [0] Functional form approximations need involve no reduction (logs of log-normally distributed variables): e.g. when the two densities in (7) are equal. [] The derived model, as a reduction of the DGP, is nested within that DGP and its properties are explained by the reduction process: knowledge of the DGP entails knowledge of all reductions thereof. When knowledge of one model entails knowledge of another, the first is said to encompass the second. 2.6 Implicit model design This correcponds to the symptomatology approach in econometrics, testing for problems (autocorrelation, heteroscedasticity, omitted variables, multicollinearity, non-constant parameters etc.), and correcting these. 2.7 Explicit model design Mimic reduction theory in practical research to minimize the losses due to the reductions selected: leads to Gets modelling. 2.8 A taxonomy of evaluation information Partition the data X T [a] past data; [b] present data [c] future data. used in modelling into the three information sets: X T = ( X t : x t : X t+ ) T (20)

10 0 [d] theory information, which often is the source of parameters of interest, and is a creative stimulus in economics; [e] measurement information, including price index theory, constructed identities such as consumption equals income minus savings, data accuracy and so on; and: [f] data of rival models, which could be analyzed into past, present and future in turn. The six main criteria which result for selecting an empirical model are: [a] homoscedastic innovation errors; [b] weakly exogenous conditioning variables for the parameters of interest; [c] constant, invariant parameters of interest; [d] theory consistent, identifiable structures; [e] data admissible formulations on accurate observations; and [f] encompass rival models. Models which satisfy the first five information sets are said to be congruent: an encompassing congruent model satisfies all six criteria. 3 General-to-specific modelling The practical embodiment of reduction is general-to-specific (Gets) modelling. The DGP is replaced by the concept of the local DGP (LDGP), namely the joint distribution of the subset of variables under analysis. Then a general unrestricted model (GUM) is formulated to provide a congruent approximation to the LDGP, given the theoretical and previous empirical background. The empirical analysis commences from this general specification, after testing for mis-specifications, and if none are apparent, is simplified to a parsimonious, congruent representation, each simplification step being checked by diagnostic testing. Simplification can be done in many ways: and although the goodness of a model is intrinsic to it, and not a property of the selection route, poor routes seem unlikely to deliver useful models. Even so, some economists worry about the impact of selection rules on the properties of the resulting models, and insist on the use of apriorispecifications: but these need knowledge of the answer before we start, so deny empirical modelling any useful role and in practice, it has rarely contributed. Few studies have investigated how well general-to-specific modelling does. However, Hoover and Perez (999) offer important evidence in a major Monte Carlo, reconsidering the Lovell (983) experiments. They place 20 macro variables in databank; generate one (y) as a function of 0 5 others; regress y on all 20 plus all lags thereof, then let their algorithm simplify that GUM till it finds a congruent (encompassing) irreducible result. They check up to 0 different paths, testing for mis-specification, collect the results from each, then select one choice from the remainder by following many paths, the algorithm is protected against chance false routes, and delivers an undominated congruent model. Nevertheless, Hendry and Krolzig (999b) improve on their algorithm in several important respects and this section now describes these. 3. Pre-search reductions First, groups of variables are tested in the order of their absolute t-values, commencing with a block where all the p-values exceed 0.9, and continuing down towards the pre-assigned selection criterion, when deletion must become inadmissible. A less-stringent significance level is used at this step, usually 0%, since the insignificant variables are deleted permanently. If no test is significant, the F-test on all variables in the GUM has been calculated, establishing that there is nothing to model.

11 3.2 Additional paths Blocks of variables constitute feasible search paths, in addition to individual-coefficients, like the block F-tests in the preceding sub-section but along search paths. All paths that also commence with an insignificant t-deletion are explored. 3.3 Encompassing Encompassing tests select between the candidate congruent models at the end of path searches. Each contender is tested against their union, dropping those which are dominated by, and do not dominate, another contender. If a unique model results,select that; otherwise, if some are rejected, form the union of the remaining models, and repeat this round till no encompassing reductions result. That union then constitutes a new starting point, and the complete path-search algorithm repeats till the union is unchanged between successive rounds. 3.4 Information criteria When a union coincides with the original GUM, or with a previous union, so no further feasible reductions can be found, PcGets selects a model by an information criterion. The preferred final-selection rule presently is the Schwarz criterion, or BIC, defined as: SC = 2logL/T + p log(t )/T, where L is the maximized likelihood, p is the number of parameters and T is the sample size. For T = 40 and p = 40, minimum SC corresponds approximately to the marginal regressor satisfying t.9. 3 Sub-sample reliability For that finally-selected model, sub-sample reliability is evaluated by the Hoover Perez overlapping split-sample test. PcGets concludes that some variables are definitely excluded; some definitely included, and some have an uncertain role, varying from a reliability of 25% (included in the final model, but insignificant and insignificant in both sub-samples), through to 75% (significant overall and in one sub-sample, or in both sub-samples). 3.6 Significant mis-specification tests If the initial mis-specification tests are significant at the pre-specified level, we raise the required significance level, terminating search paths only when that higher level is violated. Empirical investigators would re-specify the GUM on rejection. To see why Gets does well, we develop the analytics for several of its procedures. 4 The econometrics of model selection The key issue for any model-selection procedure is the cost of search, since there are always bound to be mistakes in statistical inference: specifically, how bad is it to search across many alternatives? The conventional statistical analysis of repeated testing provides a pessimistic background: every test has a non-zero null rejection frequency (or size, if independent of nuisance parameters), and so type I errors

12 accumulate. Setting a small size for every test can induce low power to detect the influences that really matter. Critics of general-to-specific methods have pointed to a number of potential difficulties, including the problems of lack of identification, measurement without theory, data mining, pre-test biases, ignoring selection effects, repeated testing, and the potential path dependence of any selection: see inter alia, Faust and Whiteman (997), Koopmans (947), Lovell (983), Judge and Bock (978), Leamer (978), Hendry, Leamer and Poirier (990), and Pagan (987). The following discussion draws on Hendry (2000a). Koopmans critique followed up the earlier attack by Keynes (939, 940) on Tinbergen (940a, 940b), and set the scene for doubting all econometric analyses that failed to commence from prespecified models. Lovell s study of trying to select a small relation (zero to five regressors) hidden in a large database (40 variables) found a low success rate, thereby suggesting that search procedures had high costs, and supporting an adverse view of data-based model selection. The third criticism concerned applying significance tests to select variables, arguing that the resulting estimator was biased in general by being a weighted average of zero (when the variable was excluded) and an unbiased coefficient (on inclusion). The fourth concerned biases in reported coefficient standard errors from treating the selected model as if there was no uncertainty in the choice. The next argued that the probability of retaining variables that should not enter a relationship would be high because a multitude of tests on irrelevant variables must deliver some significant outcomes. The sixth suggested that how a model was selected affected its credibility : at its extreme, we find the claim in Leamer (983) that the mapping is the message, emphasizing the selection process over the properties of the final choice. In the face of this barrage of criticism, many economists came to doubt the value of empirical evidence, even to the extent of referring to it as a scientific illusion (Summers, 99). The upshot of these attacks on empirical research was that almost all econometric studies had to commence from pre-specified models (or pretend they did). Summers (99) failed to notice that this was the source of his claimed scientific illusion : econometric evidence had become theory dependent, with little value added, and a strong propensity to be discarded when fashions in theory changed. Much empirical evidence only depends on low-level theories which are part of the background knowledge base not subject to scrutiny in the current analysis so a data-based approach to studying the economy is feasible. Since theory dependence has at least as many drawbacks as sample dependence, data modelling procedures are essential: see Hendry (995a). Indeed, all of these criticisms are refutable, as we now show. First, identification has three attributes, as discussed in Hendry (997), namely uniqueness, satisfying the required interpretation, and correspondence to the desired entity. A non-unique result is clearly not identified, so the first attribute is necessary, but insufficient, since uniqueness can be achieved by arbitrary restrictions (criticized by Sims, 980, inter alia). There can exist a unique combination of several relationships which is incorrectly interpreted as one of those equations: e.g., a reduced form that has a positive price effect, wrongly interpreted as a supply relation. Finally, a unique, interpretable model of (say) a money-demand relation may in fact correspond to a Central Bank s supply schedule, and this too is sometimes called a failure to identify the demand relation. Because economies are highly interdependent, simultaneity was long believed to be a serious problem, but higher frequencies of observation have attenuated this problem. Anyway, simultaneity is not invariant under linear transformations although linear systems are so can be avoided by eschewing contemporaneous regressors until weak exogeneity is established. Conditioning ensures a unique outcome, although it cannot guarantee that the resulting model corresponds to the underlying reality. Next, Keynes appears to have believed that statistical work in economics is impossible without 2

13 3 knowledge of everything in advance. But if partial explanations are devoid of use, and empirically we could discover nothing not already known, then no science could have progressed. That is clearly refuted by the historical record. The fallacy in Keynes s argument is that since theoretical models are incomplete and incorrect, an econometrics that is forced to use such theories as the only permissible starting point for data analysis can contribute little useful knowledge, except perhaps rejecting the theories. When invariant features of reality exist, progressive research can discover them in part without prior knowledge of the whole: see Hendry (995b). A similar analysis applies to the attack in Koopmans on the study by Burns and Mitchell: he relies on the (unstated) assumption that only one sort of economic theory is applicable, that it is correct, and that it is immutable (see Hendry and Morgan, 995). Data mining is revealed when conflicting evidence exists or when rival models cannot be encompassed and if they can, then an undominated model results despite the inappropriate procedure. Thus, stringent critical evaluation renders the data mining criticism otiose. Gilbert (986) suggests separating output into two groups: the first contains only redundant results (those parsimoniously encompassed by the finally-selected model), and the second contains all other findings. If the second group is not null, then there has been data mining. On such a characterization, Gets cannot involve data mining, despite depending heavily on data basing. When the LDGP is known apriorifrom economic theory, but an investigator did not know that the resulting model was in fact true, so sought to test conventional null hypotheses on its coefficients, then inferential mistakes will occur in general. These will vary as a function of the characteristics of the LDGP, and of the particular data sample drawn, but for many parameter values, the selected model will differ from the LDGP, and hence have biased coefficients. This is the pre-test problem, and is quite distinct from the costs of searching across a general set of specifications for a congruent representation of the LDGP. If a wide variety of models would be reported when applying any given selection procedure to different samples from a common DGP, then the results using a single sample apparently understate the true uncertainty. Coefficient standard errors only reflect sampling variation conditional on a fixed specification, with no additional terms from changes in that specification (see e.g., Chatfield, 995). Thus, reported empirical estimates must be judged conditional on the resulting equation being a good approximation to the LDGP. Undominated (i.e., encompassing) congruent models have a strong claim to provide such an approximation, and conditional on that, their reported uncertainty is a good measure of the uncertainty inherent in such a specification for the relevant LDGP. The theory of repeated testing is easily understood: the probability p that none of n tests rejects at 00α% is: p α =( α) n. When 40 tests of correct null hypotheses are conducted at α =0.05, p , whereas p However, it is difficult to obtain spurious t-test values much in excess of three despite repeated testing: as Sargan (98) pointed out, the t-distribution is thin tailed, so even the 0% critical value is less than three for 50 degrees of freedom. Unfortunately, stringent criteria for avoiding rejections when the null is true lower the power of rejection when it is false. The logic of repeated testing is accurate as a description of the statistical properties of mis-specification testing: conducting four independent diagnostic tests at 5% will lead to about 9% false rejections. Nevertheless, even in that context, there are possible solutions such as using a single combined test which can substantially lower the size without too great a power loss (see e.g., Godfrey and Veale, 999). It is less clear that the analysis is a valid characterization of selection procedures in general when more one path is searched, so there is no error correction for wrong reductions. In fact, the serious practical difficulty is not one of avoiding

14 4 spuriously significant regressors because of repeated testing when many hypotheses are tested, it is retaining all the variables that genuinely matter. Path dependence is when the results obtained in a modelling exercise depend on the simplification sequence adopted. Since the quality of a model is intrinsic to it, and progressive research induces a sequence of mutually-encompassing congruent models, proponents of Gets consider that the path adopted is unlikely to matter. As Hendry and Mizon (990) expressed the matter: the model is the message. Nevertheless, it must be true that some simplifications lead to poorer representations than others. One aspect of the value-added of the approach discussed below is that it ensures a unique outcome, so the path does not matter. We conclude that each of these criticisms of Gets can be refuted. Indeed, White (990) showed that with sufficiently-rigorous testing, the selected model will converge to the DGP. Thus, any overfitting and mis-specification problems are primarily finite sample. Moreover, Mayo (98) emphasized the importance of diagnostic test information being effectively independent of the sufficient statistics from which parameter estimates are derived. Hoover and Perez (999) show how much better Gets is than any method Lovell considered, suggesting that modelling per se need not be bad. Indeed, overall, the size of their selection procedure is close to that expected, and the power is reasonable. Moreover, re-running their experiments using our version (PcGets) delivered substantively better outcomes (see Hendry and Krolzig, 999b). Thus, the case against model selection is far from proved. 4. Search costs Let p dgp i denote the probability of retaining the i th variable out of k when commencing from the DGP specification and applying the relevant selection test at the same significance level as the search procedure. Then p dgp i is the expected cost of inference. For irrelevant variables, p dgp i 0, sothat whole cost for those is attributed to search. Let p gum i denote the probability of retaining the i th variable when commencing from the GUM, and applying the same selection test and significance level. Then, the search costs are p dgp i p gum i. False rejection frequencies of the null can be lowered by increasing the required significance levels of selection tests, but only at the cost of also reducing power. However, it is feasible to lower the former and raise the latter simultaneously by an improved search algorithm, subject to the bound of attaining the same performance as knowing the DGP from the outset. To keep search costs low, any model-selection process must satisfy a number of requirements. First, it must start from a congruent statistical model to ensure that selection inferences are reliable: consequently, it must test for model mis-specification initially, and such tests must be well calibrated (nominal size close to actual). Secondly, it must avoid getting stuck in search paths that initially inadvertently delete relevant variables, thereby retaining many other variables as proxies: consequently, it must search many paths. Thirdly, it must check that eliminating variables does not induce diagnostic tests to become significant during searches: consequently, model mis-specification tests must be computed at every stage. Fourthly, it must ensure that any candidate model parsimoniously encompasses the GUM, so no loss of information has occurred. Fifthly, it must have a high probability of retaining relevant variables: consequently, a loose significance level and powerful selection tests are required. Sixthly, it must have a low probability of retaining variables that are actually irrelevant: consequently, this clashes with the fifth objective in part, but requires an alternative use of the available information. Finally, it must have powerful procedures to select between the candidate models, and any models derived from them, to end with a good model choice, namely one for which: k L = p dgp p gum i= i i

15 5 is close to zero. 4.2 Selection probabilities When searching a large database for that DGP, an investigator could well retain the relevant regressors much less often than when the correct specification is known, in addition to retaining irrelevant variables in the finally-selected model. We first examine the problem of retaining significant variables commencing from the DGP, then turn to any additional power losses resulting from search. For a regression coefficient β i, hypothesis tests of the null H 0 : β i =0will reject with a probability dependent on the non-centrality parameter of the test. We consider the slightly more general setting where t-tests are used to check an hypothesis, denoted t(n, ψ) for n degrees of freedom, when ψ is the non-centrality parameter, equal to zero under the null. For a critical value c α, P ( t c α H 0 )=α where H 0 implies ψ =0. The following table records some approximate power calculations when one coefficient null hypothesis is tested and when four are tested, in each case, precisely once. t-test powers ψ n α P ( t c α ) P ( t c α ) Thus, there is little hope of retaining variables with ψ =, and only a chance of retaining a single variable with a theoretical t of 2 when the critical value is also 2, falling to for a critical value of 2.6. Whenψ =3, the power of detection is sharply higher, but still leads to more than 35% mis-classifications. Finally, when ψ =4, one such variable will almost always be retained. However, the final column shows that the probability of retaining all four relevant variables with the given non-centrality is essentially negligible even when they are independent, except in the last few cases. Mixed cases (with different values of ψ) can be calculated by multiplying the probabilities in the fourth column (e.g., for ψ =2, 3, 4, 6 the joint P ( ) =0 at α =0.0). Such combined probabilities are highly non-linear in ψ, since one is almost certain to retain all four when ψ =6,even at a % significance level. The important conclusion is that, despite knowing the DGP, low signalnoise variables will rarely be retained using t-tests when there is any need to test the null; and if there are many relevant variables, all of them are unlikely to be retained even when they have quite large non-centralities. 4.3 Deletion probabilities The most extreme case where low deletion probabilities might entail high search costs is when many variables are included but none actually matters. PcGets systematically checks the reducibility of the GUM by testing simplifications up to the empty model. A one-off F-test F G of the GUM against the null model using critical value c γ would have size P (F G c γ )=γ under the null if it was the only test implemented. Consequently, path searches would only commence γ% of the time, and some of these could also terminate at the null model. Let there be k regressors in the GUM, of which n are retained

16 6 when t-test selection is used should the null model be rejected. In general, when there are no relevant variables, the probability of retaining no variables using t-tests with critical value c α is: P ( t i <c α i =,...,k)=( α) k. (2) Combining (2) with the F G -test, the null model will be selected with approximate probability: p G =( γ)+γ ( α) k, (22) where γ γ is the probability of F G rejecting yet no regressors being retained (conditioning on F G c γ cannot decrease the probability of at least one rejection). Since γ is set at quite a high value, such as 0.20, whereas α =0.05 is more usual, F G c 0.20 can occur without any t i c Evaluating (22) for γ =0.20, α =0.05 and k =20yields p G 0.87; whereas the re-run of the Hoover Perez experiments with k = 40reported by Hendry and Krolzig (999b) using γ = 0.0 yielded 97.2% in the Monte Carlo as against a theory prediction from (22) of 99%. Alternatively, when γ = 0. and α =0.0 (22) has an upper bound of 96.7%, falling to 9.3% for α =0.05. Thus, it is relatively easy to obtain a high probability of locating the null model, even when 40 irrelevant variables are included, using relatively tight significance levels, or a reasonable probability for looser significance levels. 4.4 Path selection probabilities We now calculate how many spurious regressors will be retained in path searches. The probability distribution of one or more null coefficients being significant in pure t-test selection at significance level α is given by the k +terms of the binomial expansion of: (α +( α)) k. The following table illustrates by enumeration for k =3: event probability number retained P ( t i <c α, i =,...3) ( α) 3 0 P ( t i c α t j <c α, j i) 3α ( α) 2 P ( t i <c α t j c α, j i) 3( α) α 2 2 P ( t i c α, i =,...3) α 3 3 Thus, for k =3, the average number of variables retained is: n =3 α ( α) α 2 +3α ( α) 2 =3α = kα. The result n = kα is general. When α =0.05 and k =40, n equals 2, falling to 0.4 for α =0.0: so even if only t-tests are used, few spurious variables will be retained. Combining the probability of a non-null model with the number of variables selected when the GUM F-test rejects: p = γα, (where p is the probability any given variable will be retained), which does not depend on k. For γ =0., α =0.0, wehavep =0.00. Evenforγ =0.25 and α =0.05, p =0.025 before search paths and diagnostic testing are included in the algorithm. The actual behaviour of PcGets is much more complicated than this, but can deliver a small overall size. Following the event F G c γ when γ =0. (so the null is incorrectly rejected 0% of the time), and approximating by 0 variables retained when

17 7 that occurs, then the average non-deletion probability (i.e., the probability any given variable will be retained) is p r = γn/k =0.25%, as against the reported value of 0.9% found by Hendry and Krolzig (999b). These are very small retention rates of spuriously-significant variables. Thus, in contrast to the relatively high costs of inference discussed in the previous section, those of search arising from retaining additional irrelevant variables are almost negligible. For a reasonable GUM with (say) 40 variables where 25 are irrelevant, even without the pre-selection and multiple path searches of PcGets, and using just t-tests at 5%, roughly one spuriously significant variable will be retained by chance. Against that, from the previous section, there is at most a 50% chance of retaining each of the variables that have non-centralites around 2, and little chance of keeping them all: the difficult problem is retention of relevance, not elimination of irrelevance. The only two solutions are better inference procedures, or looser critical values; we will consider them both. 4 Improved inference procedures An inference procedure involves a sequence of steps. As a simple example, consider a procedure comprising two F-tests: the first is conducted at the γ = 50% level, the second at δ = 5%. The variables to be tested are first ordered by their t-values in the GUM, such that t 2 t2 2 t2 k, and the first F-test adds in variables from the smallest observed t-values till a rejection would occur, with either F >c γ or an individual t >c α (say). All those variables except the last are then deleted from the model, and a second F-test conducted of the null that all remaining variables are significant. If that rejects, so F 2 >c δ, all the remaining variables are retained, otherwise, all are eliminated. We will now analyze the probability properties of this 2-step test when all k regressors are orthogonal for a regression model estimated from T observations. Once m variables are included in the first step, non-rejection requires that (a) the diagnostics are insignificant; (b) m variables did not induce rejection, (c) t m <c α and (d): F (m, T k) m t 2 i m c γ. (23) Clearly, any t 2 i reduces the mean F statistic, and since P( t i < ) = 0.68, whenk = 40, approximately 28 variables fall in that group; and P( t i.65) = 0. so only 4 variables should chance to have a larger t i value on average. In the conventional setting where α = 0.05 with P( t i < 2) 0.95, only 2 variables will chance to have larger t-values, whereas slightly more than half will have t 2 i < 0 or smaller. Since P(F (20, 00) < H 0 ) 03, a first step with γ =0 should eliminate all variables with t 2 i, and some larger t-values as well hence the need to check that t m <c α (below we explain why collinearity between variables that matter and those that do not should not jeopardize this step). A crude approximation to the likely value of (23) under H 0 is to treat all t-values within blocks as having a value equal to the mid-point. We use the five ranges t 2 i < 0,,.652, 4, and greater than 4, using the expected numbers falling in each of the first four blocks, which yields: F (38, 00) ( ) = , noting P(F (38, 00) < 0.84 H 0 ) 0.72 (setting all ts equal to the upper bound of each block yields an illustrative upper bound of about.3 for F ). Thus, surprisingly-large values of γ, suchas0.75, can be selected for this step yet have a high probability of eliminating almost all the irrelevant variables. Indeed, using γ =0.75 entails c γ 0.75 when m =20, since: i= P (F (20, 00) < 0.75 H 0 ) 0.75,

Econometrics Spring School 2016 Econometric Modelling. Lecture 6: Model selection theory and evidence Introduction to Monte Carlo Simulation

Econometrics Spring School 2016 Econometric Modelling Jurgen A Doornik, David F. Hendry, and Felix Pretis George-Washington University March 2016 Lecture 6: Model selection theory and evidence Introduction