All models are wrong but some are useful. George Box (1979)

Size: px

Start display at page:

Download "All models are wrong but some are useful. George Box (1979)"

Ariel Hopkins
5 years ago
Views:

1 All models are wrong but some are useful. George Box (1979) The problem of model selection is overrun by a serious difficulty: even if a criterion could be settled on to determine optimality, it is hard to know even what class of models to consider We will now discuss some criteria for assessing which models are better, but there is not necessarily a unique best criterion, just as there is not necessarily a unique best model (or at least, we ll leave that to the philosophers) As someone once pointed out to me, there may not be a unique worst model either (?!) 1

2 Let s just keep in mind that the answer we get will depend on the question we ask, so we should ask a suitable question (in this case, this means we should choose a suitable criterion) For example, when deciding how many types of restrooms to build, we usually consider gender as a criterion for similarity, not height or weight or some other such thing 2

3 To begin with, suppose we have a process which is governed by some (unknown) probability distribution f We might consider modeling that distribution with a regression model, an ARIMA model, a GARCH model, etc. It may not be clear which model or models will give us the best possible approximation to f In order to have any chance of answering this question, we need a way of evaluating how far off g is from f : we call this a discrepancy, and we are free to assign it in any number of different ways Whatever rule we use to assign it, the discrepancy should be a number that reflects in some way how far off g is from f 3

4 Simply by the way that the discrepancy is defined, model selection now amounts to finding the model from the class of models we re investigating that minimizes the discrepancy So we will have three models in the picture: the ideal f, which is simply a distribution (no parameters or anything); the best possible approximation g best to f within the set of models that we are considering (relative to the discrepancy function we have defined); and the fitted model g, which is our best estimate of g best Now we generally have to estimate the parameters for g based on the sample, and in addition, we don t even know what the correct f is, so finding g best is no easy task Therefore we resort to a model selection criterion, which we define as an estimator of the expected (over all samples) discrepancy 4

5 There are several general-purpose discrepancy functions The Kullback-Leibler discrepancy is given by: D KL (f, g) = (log g(x))f (x)dx, which is closely related to the information (in the technical sense of the word) lost by using g to approximate f The Pearson chi-squared discrepancy is given by: D P (f, g) = (f (x) g(x)) 2, g(x) x and is particularly useful for discrete data or grouped data The Gauss discrepancy is given by: D G (f, g) = x ((f (x) g(x)) 2, which is essentially the mean squared error (for maximization purposes) 5

6 Having decided on a discrepancy, our next step will be to find a criterion with which to estimate the expected discrepancy There are three general methods for this: 1. Asymptotic methods 2. Bootstrap methods 3. Cross-validation methods We will focus only on the first one here 6

7 Usually, one would like to obtain a criterion by first deriving a formula for the expected discrepancy and then finding a way to estimate it Deriving such a formula, however, is usually not possible (which makes estimating it rather difficult!) In many cases though, it is possible to derive an expression for the asymptotic value of expected discrepancy (i.e., its limiting value as the sample size approaches infinity) and to find an unbiased estimator of that value, called an asymptotic criterion Of course the asymptotic part causes some real problems, since: We only deal with finite samples The standard error of these asymptotic criteria can be quite large Assessing their performance can generally only be done through tedious Monte Carlo simulations (if at all) 7

8 Asymptotic methods applied to the Kullback-Leibler discrepancy give rise to the Akaike information criterion (AIC) (Akaike, 1973), defined by: AIC = 2 log(l) + 2p, where L is the likelihood of the model (given the data) and p is the number of parameters in the model Remember that a criterion is an estimate of the expected discrepancy, and a low discrepancy is desired, so we should seek to minimize the AIC if we use it as the criterion There are different conventions as to what (if any) constant should be added to the above definition, but these are irrelvant when comparisons are being made between models 8

9 It is also important to keep in mind that it might not be best to select the model with the lowest AIC Having some sense of what AIC differences of various sizes may mean: An AIC difference of 2 indicates almost no evidence of a difference between the fit of the models An AIC difference of about 2 to 10 indicates moderate evidence of a difference between the fit of the models An AIC difference of more than 10 indicates strong evidence of a difference between the fit of the models 9

10 The Schwarz crietrion (Schwarz, 1978) or Bayesian information criterion (BIC) is also arrived at by asymptotic methods, although rather different ones It has a similar form to the AIC, with the last 2p (the parameter penalty ) being replaced by p log n: BIC = 2 log L + p log n, where L is the likelihood of the model (given the data), p is the number of parameters in the model, and n is the number of observations The BIC is obtained by an asymptotic expansion of the Bayes factor which gives the posterior odds that the second model (versus the first model) is correct, assuming that beforehand they were assumed to be equally likely 10

11 Both AIC and BIC are widely used and are part of essentially every statistics software package Tests of the performance of AIC versus BIC have been unrevealing at best You should feel free to use either one, with perhaps a few words of advice First, BIC has the harsher penalty (much more so) for additional parameters, so if you are feeling parsimonious, that might be the criterion to use 11

12 From a less practical and more philosophical perspective, you should understand that: AIC is more often what frequentists prefer. It is built upon information-theoretic grounds and seeks to answer which model among those being considered leads to a better fit on average (if sampling were to be done again and again). BIC is more often what Bayesians prefer. It attempts to answer the question of which of the models being considered is more likely to be the actual model, assuming they were viewed as equally likely before the data were collected. But again, these are not usually issues from a practical standpoint the issue of how much weight to assign to model parsimony tends to override such considerations 12

13 It is also worth repeating that AIC and BIC are readily accessible in all the standard statistical software packages In particular, comparing several different ARIMA(p, d, q)-garch(p, Q) models is quite straightforward (for example, there is no difficulty fitting and comparing all GARCH(P, Q) models where P < 3 and Q < 5, say) In addition, there are generally stepwise regression implementations in which, for example, variables are thrown out one at a time, AIC (or BIC) is computed for each such reduced model, and if a better model is found, the process is repeated on it until a model is arrived at for which the AIC (or BIC) isn t lowered by removing any single variable 13

14 A word of caution is in order: it is highly recommended that you do as much of the model selection as you can by using graphical and other analytical tools, rather than feeding a huge number of models to the computer and just going with the best AIC or BIC AIC and BIC are useful tools to use in conjunction with all of the other ways we have developed for finding good models; AIC and BIC are certainly not a substitute for these other ways In short, avoid turning model selection over to the computer if possible! 14

Model comparison and selection

Model comparison and selection BS2 Statistical Inference, Lectures 9 and 10, Hilary Term 2008 March 2, 2008 Hypothesis testing Consider two alternative models M 1 = {f (x; θ), θ Θ 1 } and M 2 = {f (x; θ), θ Θ 2 } for a sample (X = x)