JACKKNIFE MODEL AVERAGING. 1. Introduction

Size: px

Start display at page:

Download "JACKKNIFE MODEL AVERAGING. 1. Introduction"

Rodger Holt
5 years ago
Views:

1 JACKKNIFE MODEL AVERAGING BRUCE E. HANSEN AND JEFFREY S. RACINE Abstract. We consider the problem of obtaining appropriate weights for averaging M approximate (misspecified models for improved estimation of an unknown conditional mean in the face of model uncertainty in heteroskedastic error settings. We propose a jackknife model averaging ( estimator which selects the weights by minimizing a cross-validation criterion. In models that are linear in the parameters, this criterion is quadratic in the weights, so computation of the weights is a simple application of quadratic programming. We show that our estimator is asymptotically optimal in the sense of achieving the lowest possible expected squared error. Monte Carlo simulations and an illustrative application show that can achieve significant efficiency gains over existing model selection and averaging methods in the presence of heteroskedasticity. 1. Introduction Confronting parametric model uncertainly has a rich heritage in the statistics and econometrics literature. The two most popular approaches towards dealing with model uncertainty are model selection and model averaging, while both Bayesian and non-bayesian approaches have been proposed. Model selection remains perhaps the most popular approach for dealing with model uncertainly. This method has the user first adopt an estimation criterion such as the Akaike information criterion (, Akaike (1970 and then select from among a set of candidate models that which scores most highly based upon this criterion. Unfortunately, a variety of such criteria have been proposed in the literature, different criteria favor different models from within a given set of candidate models, and seasoned practitioners know that some criteria will favor more parsimonious models (e.g., the Schwarz-Bayes information criterion (; Schwarz (1978 while others will favor more heavily parameterized models (e.g.,. Model averaging, on the other hand, deals with model uncertainly not by having the user select one model from among a set of candidate models according to a criterion such as or, but rather by averaging over the set of candidate models in a particular manner. Many readers will no doubt be familiar with the Bayesian model averaging literature, and we direct the interested reader to Hoeting, Madigan, Raftery & Volinsky (1999 for a comprehensive review of this literature. There also exist frequentist methods for model averaging. See, for instance, Buckland, Burnhamn & Augustin (1997 who suggested a method based upon exponential weights, and Hansen (2007 Date: July 13, Hansen would like to gratefully acknowledge support from the National Science Foundation. Racine would like to gratefully acknowledge support from Natural Sciences and Engineering Research Council of Canada (NSERC: the Social Sciences and Humanities Research Council of Canada (SSHRC: and the Shared Hierarchical Academic Research Computing Network (SHARCNET: 1

2 2 BRUCE E. HANSEN AND JEFFREY S. RACINE who suggested a Mallows Model Averaging ( method based on a Mallows criterion. Existing frequentist approaches, however, exclude heteroskedasticity, which limits their applicability. In this paper we propose a new frequentist model averaging method which we term jackknife model averaging (hereafter that selects the weights by minimizing a cross-validation criterion. In models that are linear in the parameters the cross-validation criterion is a simple quadratic function of the weights, so the solution is found through a standard application of numerical quadratic programming. Applying the theory developed in Li (1987, Andrews (1991 and Hansen (2007, we provide an asymptotic optimality justification for the estimator which permits heteroskedasticity of unknown form, unlike existing model averaging methods. We show that the estimator is asymptotically optimal in the sense of achieving the lowest possible expected squared error over the class of linear estimators constructed from a countable set of weights. The class of linear estimators includes but is not limited to linear least-squares, ridge regression, Nadaraya-Watson and local polynomial kernel regression with fixed bandwidths, nearest neighbor estimators, series estimators, estimators of additive interaction models, and spline estimators. In particular, our analysis of nested linear models shows that achieves asymptotic efficient estimation quite generally. We demonstrate the potential efficiency gains through a simple Monte Carlo experiment. In a classic regression setting, we show that achieves lower mean squared error than competitive methods. In the presence of homoskedastic errors, and are nearly equivalent, but when the errors are heteroskedastic, has significantly lower MSE. We also illustrate the method with an empirical application to cross-section earnings forecasting. We find that achieves out-of-sample forecast squared error which is either equivalent or lower than that achieved by all other methods considered. The remainder of the paper is organized as follows. Section 2 presents the jackknife weighting method. Section 3 provides an asymptotic optimality theory under high-level conditions for a broad class of linear estimators. Section 4 presents the optimality theory for nested linear regression models. Section 5 reports the Monte Carlo simulation experiment, and Section 6 the application to forecasting earnings. Section 7 concludes, and the mathematical proofs are presented in the appendix. 2. Jackknife Weighting Consider a sample of independent observations that satisfy y i = µ i + e i for i = 1,..., n, where µ i is the unobserved mean of y i which depends on the regressor vector x i. The error e i is mean zero with heteroskedastic variance σ 2 i. Define the vectors y = (y 1,..., y n, µ = (µ 1,..., µ n and e = (e 1,..., e n. Suppose that to estimate µ there is a set of M estimators {ˆµ 1,..., ˆµ M} where M may depend on n. For example, in the case of least-squares estimation, ˆµ m = P m y where P m = X m (X m X m 1 X m and the i th row of X m is x m i, a function of x i.

3 JACKKNIFE MODEL AVERAGING 3 Let w m be a weight assigned to estimator ˆµ m and define the weight vector w = ( w 1,..., w M. Typically we will restrict the weights to be non-negative, less than or equal to one, and sum to one. In this case w lies on the R M unit simplex: { } (1 H n = w [0, 1] M : w m = 1. An averaging estimator for µ takes the form ˆµ (w = w mˆµ m. Note that we can write ˆµ (w = ˆµw where ˆµ = (ˆµ 1,..., ˆµ M is the n M matrix of estimates. In this paper we consider jackknife selection of w (also known as leave-one-out cross-validation. The jackknife estimate of the mean is µ (w = w m µ m where µ m = ( µ m 1,..., µm n and µ m i is a version of ˆµ m i computed with the i th observation deleted. ( 1 In the least-squares example discussed above, µ m i = x m i X m ( i Xm ( i X m ( i y ( i, where X m ( i and y ( i denote the matrices X m and y with the i th row deleted. We can alternatively write µ (w = µw where µ = ( µ 1,..., µ M. Thus the jackknife residual vector is ẽ (w = y µ (w = ẽw where ẽ = ( ẽ 1,..., ẽ M and ẽ m = y µ m. This estimator is particularly convenient in the case of linear regression estimates, for it is well known that ẽ m = D m ê m where ê m = y P m y is the least squares residual vector and D m is the n n diagonal matrix with the i th diagonal element equal to (1 h m ii 1, where h m ii is the i th diagonal element of P m (see, for example, equation (1.4 of Li (1987, or the generalization by Racine (1997. Using this representation, the jackknife residual vector ẽ m can be computed with a simple linear operation and does not require n separate regressions. The jackknife estimate of expected apparent error 1 is E n (w = 1 n ẽ (w 2 = w S n w where denotes the Euclidean norm and S n = 1 nẽ ẽ is M M. The jackknife (or cross-validation choice of w is the value which minimizes E n (w. This is (2 ŵ = argmin w H n E n (w. The estimator of µ is ˆµ (ŵ. Even though E n (w is a quadratic function of w, the solution to (2 is not typically available in closed form due to the inequality constraints on w. The minimization problem (2 falls in the class of quadratic programming, for which numerical solutions have been thoroughly studied and 1 For a detailed overview of expected apparent and excess error, we direct the reader to Efron (1982, Chapter 7.

4 4 BRUCE E. HANSEN AND JEFFREY S. RACINE algorithms are widely available. For example, in the R language (R Development Core Team (2007 it is solved using the quadprog package, in GAUSS it is solved using the qprog command, and in MATLAB the quadprog command. Even when M is quite large the solution to (2 is computationally fast using any of these packages. Algebraically, (2 is solving a least-squares problem. The vector ŵ is the M 1 coefficient vector obtained by the constrained regression of y on the n M matrix µ. We can write this regression problem as y = µŵ + ẽ. The jackknife weight vector ŵ are the weights which find the linear combination of the different estimators which yield the lowest squared error. This is a generalization of jackknife model selection, which picks the individual model which minimizes the leave-one-out squared error. Also, it is useful to note that the unrestricted solution to (2 is simply ŵ = ( µ µ 1 µ y. These are the jackknife weights obtained if the restrictions (1 are not imposed. While it may appear intuitive to use these unrestricted weights, the optimality theory suggests otherwise and we therefore recommend that the weights always be restricted to the unit simplex (1. Define the expected squared error 3. Asymptotic Optimality (3 R n (w = E 1 n µ ˆµ (w 2. We seek conditions under which the jackknife selected weight vector ŵ is optimal in the sense of making R n (w as small as possible. Ideally, we wish to show that R n (ŵ inf w Hn R n (w p 1. For technical reasons we follow Hansen (2007 and establish a more limited result by replacing H n with a countably discrete subset H n. Let ŵ be the weight vector selected under this restriction: ŵ = argmin E n (w. w Hn Our optimality theory also restricts attention to linear estimators, which take the form ˆµ m = P m y where P m is not a function of y. As mentioned in the introduction, this class of estimators includes linear least-squares, ridge regression, Nadaraya-Watson and local polynomial kernel regression with fixed bandwidths, nearest neighbor estimators, series estimators, estimators of additive interaction models, and spline estimators. When the class of estimators is linear, the averaging estimator is also linear. That is, we can write ˆµ (w = P (w y where (4 P (w = is a linear operator indexed by w. w m P m

5 JACKKNIFE MODEL AVERAGING 5 Similarly, the jackknife estimates take the linear form µ m = P m y and µ (w = P (w y where P (w = M wm P m, and the matrices P m and P (w have zeros on the diagonal. In this section we establish optimality for the estimator under a set of high-level conditions. Define the jackknife average squared error R n (w = E 1 n µ µ (w 2. Let λ (A denote the largest eigenvalue of the matrix A. For some integer N 1, assume the following conditions hold. (A.1 (A.2 (A.3 (A.4 (A.5 (A.6 0 < inf i lim n w H n w H n lim σi 2 sup σi 2 <, i sup Ee 4N i <, i sup n w Hn sup w H n λ (P (w <, (nr n (w N 0, sup λ ( P (w <, R n (w R n (w 1 0. Theorem 1. If (A.1-(A.6 hold, then ŵ is asymptotically optimal in the sense that (5 R n (ŵ inf w H n R n (w p 1. Condition (A.1 excludes unbounded heteroskedasticity and (A.2 is a moment bound. Condition (A.3 is quite mild; typically λ (P n (w 1. To explain condition (A.4, note that in most applications the expected squared error R n (w is of order n 1+δ for some 0 δ < 1. If the cardinality of H n is of polynomial order and δ > 0, then (A.4 follows for sufficiently large N. Condition (A.5 is an analogue of (A.3 for the leave-one-out estimator. Condition (A.6 says that as n gets large, the fit of the regular and leave-one-out estimators become equivalent, uniformly over the class of averaging estimators. This excludes the possibility of extremely unbalanced designs. Theorem 1 is an application of results in Li (1987 and Andrews (1991. The theorem shows that the average squared error of the jackknife estimator is asymptotically as small as the infeasible best possible choice in the index set. It is similar to Theorem 5.1 of Li (1987 and Theorem 4.1* of Andrews (1991 which also give high-level conditions for the asymptotic optimality of cross-validation, but makes two different assumptions. One difference is that our condition (A.6 is somewhat stronger than theirs, as (A.6 specifies uniform convergence over the index set H n, while Li and Andrews only require such convergence for sequences w n for which R n (w n 0 or R n (w n 0. This distinction does not appear to exclude relevant applications, however. The other difference is that Li and Andrews assume that inf w Hn R n (w 0, while this assumption

6 6 BRUCE E. HANSEN AND JEFFREY S. RACINE is not required in our Theorem 1. Their assumption means that as n the optimal expected squared error declines to zero. In practice this means that the index set H n needs to expand with n in a way so that the estimate P (w y gets close to the unknown mean µ. Our Theorem 1 does not require this extra assumption. 4. Nested Linear Models For the remainder of the paper we focus attention on nested linear regression estimates. These are estimators which take the form ˆµ m = P m y where P m = X m (X m X m 1 X m and X m is an n k m matrix of regressors. Since the models are nested k m is an increasing sequence and span (X m span ( X m+1. To apply Theorem 1 we need to construct a countably discrete subset Hn of H n. For N as defined 1 in conditions (A.2 and (A.4 let the weights w m be restricted to the set {0, N 1, 2 N 1,, 1}. Let Hn be the subset of H n restricted to this set of weights. Let h m ii denote the i th diagonal element of P m. Theorem 2. Suppose that conditions (A.1 and (A.2 hold, (A.7 and max max 1 m M 1 i n hm ii 0 (A.8 ξ n = inf w H n nr n (w as n. Then conditions (A.3-(A.6 hold, and hence ŵ is asymptotically optimal in the sense of (5. Condition (A.7 requires that the self-weights h m ii be asymptotically negligible for all models considered. This is a reasonable restriction, as it excludes only extremely unbalanced designs. Instead of (A.7, Li (1987 and Andrews (1991 assume the uniform bound h m ii Λk m/n for some Λ <. If k M /n 0 this implies (A.7, but in general neither condition is strictly stronger than the other. Condition (A.8 is identical to the condition in Hansen (2007 for Mallows weight selection, and is quite similar to the conditions of Li (1987 and Andrews (1991. It requires that all finite dimensional models are approximations. Theorem 2 is our main result. It shows that under quite minimal primitive conditions, the estimator is asymptotically optimal in the class of weighted averages of linear estimators. The most significant limitation is the restriction to the countable weight set Hn. However, as N can be selected to be arbitrarily large, this restriction need not be viewed as substantive. In any event, we view this limitation as a technical artifact of the proof technique. For empirical practice we ignore the restriction to Hn and use the unrestricted weights ŵ as defined in (2.

7 JACKKNIFE MODEL AVERAGING 7 5. Finite-Sample Performance In this section we investigate the finite sample mean squared error of the estimator via a modest Monte Carlo experiment. We follow Hansen (2007 and consider an infinite-order regression model of the form y i = j=1 θ jx ji + ɛ i, i = 1,..., n. We set x 1i = 1 to be the intercept, while the remaining x ji are independent and identically distributed N(0, 1 random variates. The error ɛ i is also N(0, σi 2 and is independent of the x ji. For the homoskedastic simulation we set σ i = 1 for all i, while for the heteroskedastic simulation we set σ i = x 2 2i which has a mean value of 1. (Other specifications for the heteroskedasticity were considered and produced similar qualitative results. The parameters are determined by the rule θ j = c 2αj α 1/2. The population R 2 = c 2 /(1 + c 2 is controlled by the parameter c. The sample size is varied between n = 25, 50, 75, and 100. The parameter α is varied from 1/2, 1, and 3/2. 2 Larger values of α imply that the coefficients θ j decline more quickly with j. The number of models M is determined by the rule M = 3n 1/3 (so M = 9, 11, 13, and 14 for the four sample sizes considered herein.the coefficient c was selected to control the population R 2 to vary on a grid between 0.1 and 0.8. We consider four estimators: (1 model selection (, (2 model selection (, (3 model selection (, (4 Jackknife model averaging (, and (5 Mallows model averaging (, Hansen (2007. To evaluate the estimators, we compute the risk (expected squared error. We do this by computing averages across 10,000 simulation draws. For each parameterization, we normalize the risk by dividing by the risk of the infeasible optimal least squares estimator (the risk of the best-fitting model m when averaged over all replications Homoskedastic Error Processes. We first compare risk under homoskedastic errors. The risk calculations are summarized in Figure 1. The four panels in each graph display results for a variety of sample sizes. In each panel, risk (expected squared error is displayed on the y axis and the population R 2 is displayed on the x axis. It can be seen from Figure 1 that for the homoskedastic data generating process, the proposed method dominates it peers for all sample sizes and ranges of the population R 2 considered, though there is only a small gain to be had relative to the method which performs very well overall and becomes indistinguishable as n increases beyond some modest level Heteroskedastic Error Processes. Next, we compare risk under heteroskedastic errors, and the risk calculations are summarized in Figure 2. It can be seen from Figure 2 that for the heteroskedastic data generating process considered herein, the proposed method dominates it peers for all sample sizes and ranges of goodness of fit considered. Furthermore, the gain over that of the approach is substantially larger than for the homoskedastic case summarized in Figure 1, especially for small values of R 2. 2 We report results for 1/2 only for space considerations. All results are available on request from the authors.

8 8 BRUCE E. HANSEN AND JEFFREY S. RACINE n=25 n=50 n=75 n=100 Figure 1. Finite-sample performance, homoskedastic errors, α = 1/2, σ i = Forecasting Earnings We employ Wooldridge s (2003, pg. 226 popular wage1 dataset, a cross-section consisting of a random sample taken from the U.S. Current Population Survey for the year There are 526 observations in total. The dependent variable is the log of average hourly earnings ( lwage, while explanatory variables include dummy variables nonwhite, female, married, numdep, smsa, northcen, south, west, construc, ndurman, trcommpu, trade, services, profserv, profocc, clerocc, servocc, nondummy variables educ, exper, tenure, and interaction variables nonwhite educ, nonwhite exper,

9 JACKKNIFE MODEL AVERAGING 9 n=25 n=50 n=75 n=100 Figure 2. Finite-sample performance, heteroskedastic errors, α = 1/2, σ i = x 2 i2. nonwhite tenure, female educ, female exper, female tenure, married educ, married exper, and married tenure. 3 We presume there exists uncertainly about the appropriate model but need to forecast earnings for a set of hold-out data. We consider two cases, i a set of thirty models ranging from the unconditional mean (k = 1 through a model that includes all variables listed above (k = 30, and ii a set of eleven models that use dummy variables female and married, and non-dummy variables educ, exper, and tenure that range from simple bivariate models that have as covariates each nondummy regressor (k = 2 through a model that contains third-order polynomials in all non-dummy regressors and allows for interactions among all variables (k = See for a full description of the data.

10 10 BRUCE E. HANSEN AND JEFFREY S. RACINE Next, we shuffle the sample into a training set of size n 1 and an evaluation set of size n 2 = n n 1, and select models via model selection based on,, and, then consider model average-based models using and. Finally, we evaluate the models on the independent hold-out data computing their average square prediction error (ASPE. We repeat this procedure 100 times then report the median ASPE over the 100 splits. We vary n 1 and consider n 1 = 100, 200, 300, 400, 500. Tables 1 and 2 report the median ASPE for this experiment. Entries greater than one indicate inferior performance relative to the method. Table 1. Case i, relative out-of-sample predictive efficiency. Entries greater than one indicate inferior performance relative to the method. n C p Table 2. Case ii, relative out-of-sample predictive efficiency. Entries greater than one indicate inferior performance relative to the method. n C p Table 1 reveals that the proposed method delivers models that are no worse than existing model selection-based models or model-average based models across the range of sample sizes considered, often delivering an improved model. The method works very well for case i and yields models comparable to those selected by the method. However, this is not the situation for case ii, as Table 2 reveals. It would appear that the method is somewhat sensitive to the estimate of σ 2 needed for its computation, and relying on a large approximating model (Hansen (2007, pg may not be sufficient to deliver optimal results. This issue deserves further study. This application suggests that the proposed method could be of interest to those who forecast using model-selection criterion. These results, particularly in light of the simulation evidence, suggest that the method could be recommended for general use though it would of course be prudent to base this recommendation on more extensive simulation evidence than that outlined in Section Conclusion We propose a frequentist method for model averaging that does not preclude heteroskedastic settings, unlike many of its peers. The method is computationally tractable, is asymptotically

11 JACKKNIFE MODEL AVERAGING 11 optimal, and its finite-sample performance is better than that exhibited by its peers. An application to forecasting wages is undertaken that demonstrates performance not worse than other methods of model selection considered by way of comparison, and performance that is often better for the range of sample sizes investigated. References Akaike, H. (1970, Statistical predictor identification, Annals of the Institute of Statistics and Mathematics 22, Andrews, D. W. K. (1991, Asymptotic optimality of generalized C L, cross-validation, and generalized crossvalidation in regression with heteroskedastic errors, Journal of Econometrics 47, Buckland, S. T., Burnhamn, K. P. & Augustin, N. H. (1997, Model selection: Biometrics 53, An integral part of inference, Efron, B. (1982, The Jackknife, the Bootstrap, and Other Resampling Plans, Society for Industrial and Applied Mathematics, Philadelphia, Pennsylvania Hansen, B. E. (2007, Least squares model averaging, Econometrica 75, Hoeting, J. A., Madigan, D., Raftery, A. E. & Volinsky, C. T. (1999, Bayesian model averaging: A tutorial, Statistical Science 14, Li, K.-C. (1987, Asymptotic optimality for C p, C L, cross-validation and generalized cross- validation: Discrete index set, The Annals of Statistics 15, R Development Core Team (2007, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria. ISBN URL: Racine, J. (1997, Feasible cross-validatory model selection for general stationary processes, Journal of Applied Econometrics 12, Schwarz, G. (1978, Estimating the dimension of a model, The Annals of Statistics 6, Wooldridge, J. M. (2003, Introductory Econometrics, Thompson South-Western.

12 12 BRUCE E. HANSEN AND JEFFREY S. RACINE Proof of Theorem 1. Observe that w H n Appendix A. Mathematical Proofs ( n R N n (w = w H n w H n ( (nr n (w N Rn (w N R n (w (nr n (w N sup w H n ( Rn (w R n (w (6 0 by (A.4 and (A.6. Set Ω = diag { σ1 2,..., n} σ2 and note that Ee P (w e = tr ( P (w Ω = 0 since the diagonal elements of P (w are zero. Then by the argument given in the proof of Theorem 2.1 of Li (1987 as extended by Andrews (1991 to the heteroskedastic case, conditions (A.1, (A.2, (A.5 and (6 are sufficient to imply that R n (ŵ inf w H n Rn (w p 1. Combined with condition (A.6 we obtain (5 as desired. N Proof of Theorem 2. Hansen (2007, Lemma 1 shows that (A.3 holds, and in the proof of his Theorem 1, shows that (A.4 holds under homoskedasticity, (A.2 and (A.8. However, if σ 2 in that proof is replaced with inf i σi 2, the argument is unchanged and (A.4 is established. Now as shown in equation (1.4 of Li (1987, (7 P m = D m (P m I + I where D m is the n n diagonal matrix with the i th diagonal element equal to (1 h m ii 1. Thus P (w = = w m P m w m D m (P m I + I. Let λ m and λ m denote the largest and smallest diagonal elements of D m. Condition (A.7 implies that (8 λm = 1 + o(1 = λ m uniformly in m M. Hence and thus (A.5 holds. λ ( P (w 1 + w m λ (D m = 2 (1 + o(1

13 It remains to establish (A.6. We calculate that JACKKNIFE MODEL AVERAGING 13 nr n (w = µ (I P (w (I P (w µ + tr (P (w P (w Ω = l=1 w m w l [ µ (I P m (I P l µ + tr (P m P l Ω ] where Ω = diag { σ1 2,..., n} σ2. The first equality is from Andrews (1991 and the second uses (4. Similarly, n R [ ( n (w = w m w l µ I P m (I P l µ + tr ( P P ] m l Ω. l=1 It is sufficient to establish that (9 tr ( P P m l Ω = tr (P m P l Ω (1 + o(1 and ( (10 µ I P m (I P l µ = µ (I P m (I P l µ (1 + o(1 where the o(1 term is uniform in m M and l M. First take (9. Using (7 and the fact that the diagonal elements of P m are zero, (8, again (7, and finally (8 tr ( P P m l Ω = tr ( P md l P l Ω tr ( P md l Ω + tr ( P mω = tr ( P md l P l Ω = tr ( P mp l Ω (1 + o(1 = (tr ((P m I D m P l Ω + tr (P l Ω (1 + o(1 = (tr ((P m I P l Ω + tr (P l Ω (1 + o(1 = tr (P m P l Ω (1 + o(1 establishing (9. Finally (7 implies I P m = D m (I P m, which combined with (8 directly implies (10. We have established (9 and (10 which are sufficient for (A.6. Conditions (A.3-(A.6 have been verified. An application of Theorem 1 yields (5 as stated. Bruce E. Hansen: Department of Economics, Social Science Building, University of Wisconsin, Madison, WI USA , bhansen@ssc.wisc.edu. Jeffrey S. Racine: Department of Economics, Kenneth Taylor Hall, McMaster University, Hamilton, ON Canada L8S 4M4, racinej@mcmaster.ca.

JACKKNIFE MODEL AVERAGING. 1. Introduction

JACKKNIFE MODEL AVERAGING. 1. Introduction JACKKNIFE MODEL AVERAGING BRUCE E. HANSEN AND JEFFREY S. RACINE Abstract. We consider the problem of obtaining appropriate weights for averaging M approximate (misspecified) models for improved estimation