Simple robust averages of forecasts: Some empirical results

Similar documents
Exponential smoothing in the telecommunications data

Combining Forecasts: The End of the Beginning or the Beginning of the End? *

Another Error Measure for Selection of the Best Forecasting Method: The Unbiased Absolute Percentage Error

Modified Holt s Linear Trend Method

Research Brief December 2018

Do we need Experts for Time Series Forecasting?

Another look at measures of forecast accuracy

NATCOR. Forecast Evaluation. Forecasting with ARIMA models. Nikolaos Kourentzes

WORKING PAPER NO DO GDP FORECASTS RESPOND EFFICIENTLY TO CHANGES IN INTEREST RATES?

Warwick Business School Forecasting System. Summary. Ana Galvao, Anthony Garratt and James Mitchell November, 2014

Published in Journal of Forecasting, 20, 2001, Copyright 2001 John Wiley & Sons, Ltd.

Can a subset of forecasters beat the simple average in the SPF?

GDP forecast errors Satish Ranchhod

How Useful Are Forecasts of Corporate Profits?

Stellenbosch Economic Working Papers: 23/13 NOVEMBER 2013

WORKING PAPER NO THE CONTINUING POWER OF THE YIELD SPREAD IN FORECASTING RECESSIONS

A Nonparametric Approach to Identifying a Subset of Forecasters that Outperforms the Simple Average

The value of feedback in forecasting competitions

Package ForecastComb

Automatic forecasting with a modified exponential smoothing state space framework

THE DATA ON MOST ECONOMIC VARIABLES ARE ESTIMATES. THESE ESTIMATES ARE REVISED, Data Vintages and Measuring Forecast Model Performance

Scoring rules can provide incentives for truthful reporting of probabilities and evaluation measures for the

A Response to Rodrik s Geography Critique Incomplete and subject to revision as of March 6, 2001 The idea of using instrumental variables from the

Nowcasting gross domestic product in Japan using professional forecasters information

Visualization of distance measures implied by forecast evaluation criteria

Package GeomComb. November 27, 2016

Forecasting exchange rate volatility using conditional variance models selected by information criteria

PJM Long-Term Load Forecasting Methodology: Proposal Regarding Economic Forecast (Planning Committee Agenda Item #5)

An approach to make statistical forecasting of products with stationary/seasonal patterns

Forecasting & Futurism

Nowcasting GDP with Real-time Datasets: An ECM-MIDAS Approach

An Empirical Study of Forecast Combination in Tourism

Overcoming Algorithm Aversion: People will Use Imperfect Algorithms If They Can (Even Slightly) Modify Them

Automatic Identification of Time Series Features for Rule-Based Forecasting

Combining forecasts: An application to U.S. Presidential Elections

Error Statistics for the Survey of Professional Forecasters for Industrial Production Index

HARNESSING THE WISDOM OF CROWDS

Predicting Inflation: Professional Experts Versus No-Change Forecasts

Error Statistics for the Survey of Professional Forecasters for Treasury Bond Rate (10 Year)

Seasonality in macroeconomic prediction errors. An examination of private forecasters in Chile

Forecast comparison of principal component regression and principal covariate regression

Uncertainty and Disagreement in Economic Forecasting

Decision 411: Class 3

Error Statistics for the Survey of Professional Forecasters for Treasury Bill Rate (Three Month)

Decision 411: Class 3

Follow this and additional works at: Part of the Marketing Commons

Analytics for an Online Retailer: Demand Forecasting and Price Optimization

Lecture 1: Introduction to Forecasting

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

22/04/2014. Economic Research

Forecasting Using Time Series Models

Assessing recent external forecasts

Combining expert forecasts: Can anything beat the simple average? 1

Overcoming Algorithm Aversion: People Will Use Algorithms If They Can (Even Slightly) Modify Them. Berkeley J. Dietvorst Joseph P. Simmons Cade Massey

NOWCASTING REPORT. Updated: September 7, 2018

Rigorous Evaluation R.I.T. Analysis and Reporting. Structure is from A Practical Guide to Usability Testing by J. Dumas, J. Redish

A CHARACTERIZATION FOR THE SPHERICAL SCORING RULE

DECISIONS UNDER UNCERTAINTY

Is economic freedom related to economic growth?

GDP growth and inflation forecasting performance of Asian Development Outlook

Forecasting using robust exponential smoothing with damped trend and seasonal components

NOWCASTING REPORT. Updated: August 17, 2018

HOG PRICE FORECAST ERRORS IN THE LAST 10 AND 15 YEARS: UNIVERSITY, FUTURES AND SEASONAL INDEX

Detecting outliers in weighted univariate survey data

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

Assessing probabilistic forecasts about particular situations

The Superiority of Greenbook Forecasts and the Role of Recessions

e t Keywords: Forecasting, Moving Average Method, Least Square Method, Semi Average Method. Forecast accuracy, Forecast evaluation etc.

The Measurement and Characteristics of Professional Forecasters Uncertainty. Kenneth F. Wallis. Joint work with Gianna Boero and Jeremy Smith

Forecasting Inflation and Growth: Do Private Forecasts Match Those of Policymakers? William T. Gavin and Rachel J. Mandal

No Staff Memo. Evaluating real-time forecasts from Norges Bank s system for averaging models. Anne Sofie Jore, Norges Bank Monetary Policy

Reducing Model Risk With Goodness-of-fit Victory Idowu London School of Economics

High Wind and Energy Specific Models for Global. Production Forecast

Testing for Regime Switching in Singaporean Business Cycles

A New Approach to Rank Forecasters in an Unbalanced Panel

FORECASTING STANDARDS CHECKLIST

Statistical Practice

A TIME SERIES PARADOX: UNIT ROOT TESTS PERFORM POORLY WHEN DATA ARE COINTEGRATED

NOWCASTING REPORT. Updated: February 22, 2019

CSCI 5520: Foundations of Data Privacy Lecture 5 The Chinese University of Hong Kong, Spring February 2015

Decision 411: Class 3

5 Medium-Term Forecasts

Mitigating end-effects in Production Scheduling

Statistics for Managers using Microsoft Excel 6 th Edition

Forecasting Principles

Averaging and the Optimal Combination of Forecasts

A COMPARISON OF REGIONAL FORECASTING TECHNIQUES

LODGING FORECAST ACCURACY

How Accurate is My Forecast?

0.1 Motivating example: weighted majority algorithm

A Note on Bayesian Inference After Multiple Imputation

CHAPTER 4 VARIABILITY ANALYSES. Chapter 3 introduced the mode, median, and mean as tools for summarizing the

The Empirical Rule, z-scores, and the Rare Event Approach

e Yield Spread Puzzle and the Information Content of SPF forecasts

Testing an Autoregressive Structure in Binary Time Series Models

NOWCASTING REPORT. Updated: May 5, 2017

Understanding USDA Corn and Soybean Production Forecasts: Methods, Performance and Market Impacts over Darrel L. Good and Scott H.

NOWCASTING REPORT. Updated: April 15, 2016

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

Understanding USDA Corn and Soybean Production Forecasts: Methods, Performance and Market Impacts over Darrel L. Good and Scott H.

Transcription:

Available online at www.sciencedirect.com International Journal of Forecasting 24 (2008) 163 169 www.elsevier.com/locate/ijforecast Simple robust averages of forecasts: Some empirical results Victor Richmond R. Jose, Robert L. Winkler 1 The Fuqua School of Business, Duke University, Box 90120, Durham, NC 27708-0120, USA Abstract An extensive body of literature has shown that combining forecasts can improve forecast accuracy, and that a simple average of the forecasts (the mean) often does better than more complex combining schemes. The fact that the mean is sensitive to extreme values suggests that deleting such values or reducing their extremity might be worthwhile. We study the performance of two simple robust methods, trimmed and Winsorized means, which are easy to use and understand. For the data sets we consider, they provide forecasts which are slightly more accurate than the mean, and reduce the risk of high errors. Our results suggest that moderate trimming of 10 30% or Winsorizing of 15 45% of the forecasts can provide improved combined forecasts, with more trimming or Winsorizing being indicated when there is more variability among the individual forecasts. There are some differences in the performance of the trimmed and Winsorized means, but overall such differences are not large. 2007 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved. Keywords: Combining forecast; Robustness; Trimming; Winsorizing; M3 Competition 1. Introduction Research in the area of forecast aggregation has shown that combining forecasts generally leads to improvements in forecast accuracy. In addition, the evidence has shown that a simple average (the mean of the forecasts being combined) often provides better results than more complicated and sophisticated combining models (Clemen, 1989). The beauty of using simple averages is that they are easy to understand and implement, not requiring any estimation of weights or Corresponding author. Tel.: +1 919 660 7957. E-mail addresses: vrj@duke.edu (V.R.R. Jose), rwinkler@duke.edu (R.L. Winkler). 1 Tel.: +1 919 660 7729; fax: +1 919 681 6246. other parameters. This makes them robust, in that they are not sensitive to estimation errors, which can sometimes be substantial. For example, Winkler and Clemen (1992) consider weighted averages and show how relatively small estimation errors can lead to negative weights or weights much greater than one, and to weighted averages that are considerably outside the range of the individual forecasts being combined. It is well known that simple averages themselves are quite sensitive to extreme values, and forecasts can sometimes vary considerably. Because of this, some authors have suggested other more robust alternatives. The median, which is less sensitive to extreme values than the mean, has also been considered, with mixed results. The median performed better than the mean in Agnew (1985), worse in Stock and Watson (2004), and 0169-2070/$ - see front matter 2007 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved. doi:10.1016/j.ijforecast.2007.06.001

164 V.R.R. Jose, R.L. Winkler / International Journal of Forecasting 24 (2008) 163 169 about the same in McNees (1992). Armstrong (2001, p. 422) suggests the use of the trimmed mean: Individual forecasts may have large errors because of miscalculations, errors in data, or misunderstandings. For this reason, it may be useful to throw out the high and low forecasts. I recommend the use of trimmed means when you have at least five forecasts. Forecasters have not studied the effects of trimming, so this principle is speculative. Stock and Watson (2004) used a single trimmed mean, with symmetric 5% trimming, and found that it performed about the same as the mean. In this paper, we investigate the empirical performance of the trimmed mean as a combining method more comprehensively, considering varying amounts of trimming. We also study the performance of another robust measure from statistics, the Winsorized mean, for varying amounts of Winsorizing. We primarily use the extensive data set of forecasts from the M3 Competition (Makridakis & Hibon, 2000), but also analyze some data from the Federal Reserve Bank of Philadelphia's Survey of Professional Forecasters (Croushore, 1993). In Section 2 we discuss the methods and data sets that are used. The results are presented in Section 3, and Section 4 contains a brief summary and discussion. 2. Methods and data Suppose that we are interested in forecasting Y, and we have n forecasts X 1,, X n of Y. These forecasts might come from forecasting models, from human forecasters, or from other sources. One class of widely used location estimators is the family of L-estimators (Huber, 2004), which stands for linear combinations of order statistics. This family includes the mean and median, as well as measures such as trimmed means and Winsorized means. If X (i) is the ith order statistic for X 1,, X n, then the trimmed and Winsorized means are defined as follows, where i is an integer with 0 ibn/2: Trimmed Mean ¼ TðiÞ ¼ 1 X n i X ðkþ ; n 2i k¼iþ1 Winsorized Mean ¼ W ðiþ ¼ 1 n½ ix ðiþ1þ þ Xn i k¼iþ1 X ðkþ þ ix ðn iþ Š : These measures involve taking the i smallest and i largest forecasts and either deleting them or setting them equal to the (i+1)th smallest and (i+1)th largest forecasts, respectively. They correspond to symmetric 100(2i/ n)% trimming or Winsorizing. The tradeoff in terms of i is that a smaller value of i means that more information is being used (fewer data points are being deleted or adjusted), but the mean is more likely to be overly influenced by an extreme value. At one limit (i=0) we have the mean, or simple average, which corresponds to no trimming or Winsorizing. At the other limit (i=greatest integerbn/2) is the median, which represents the maximum amount of trimming or Winsorizing. Note that trimming and Winsorizing make sense only if n 3, and are distinct from each other when n 5. The comparative advantage of these measures is that they can be more robust than the mean, yet not as limited as the median, which ignores a great deal of the information contained in the set of forecasts. A rough analogy from time series forecasting is the use of a moving average (the average of a subset of the past data), as opposed to the two extremes of the average of all of the past data or just the most recent observation. The primary data set used here, from the M3 Competition (Makridakis & Hibon, 2000), consists of forecasts of 3003 different time series with varying time horizons (6 years for the 645 yearly series, 8 quarters for the 756 quarterly series, 18 months for the 1428 monthly series, and 8 periods for the other series). We use only 22 of the 24 forecasting methods from this data set because the other two (AAM1 and AAM2) do not provide forecasts for all of the series. The M3 Competition and earlier versions of the M competition have been influential in the forecasting literature, showing, among other things, that (1) simpler models tend to perform better than more complex forecasting models, (2) different methods perform better with respect to different forecasting accuracy measures, and (3) combined forecasts (particularly simple averages) tend to perform better than individual forecasts. To avoid reliance on a single data set, we also analyze quarterly forecasts of the nominal gross domestic product (NGDP) from The Federal Reserve Bank of Philadelphia's Survey of Professional Forecasters, formerly the ASA/NBER Economic Outlook Survey

V.R.R. Jose, R.L. Winkler / International Journal of Forecasting 24 (2008) 163 169 165 (Croushore, 1993). We consider 149 quarters, from the 4th quarter of 1968 to the 4th quarter of 2005, each with forecasts for the upcoming four quarters. Variants of this data source have often been used in studies of forecast aggregation methods (e.g., Clemen & Winkler, 1986; Graham, 1996; Zarnowitz, 1984). In evaluating the performance of the robust combining methods, we utilize the following forecast accuracy measures: (1) Symmetric Mean Absolute Percentage Error (smape), the mean of sape =100 (X Y)/[(X + Y)/2] for a forecast X of Y; (2) Symmetric Median Absolute Percentage Error (smedape), the median of sape; and (3) Percentage of times better than or equal to the simple average of sape (% Better). To provide measures of the risk of high errors, we also present the percentage of times the symmetric APE exceeds certain values (% sape exceeds). Finally, we study the relative performance of the robust combining methods as a function of the coefficient of variation of the forecasts from the individual methods. We base our evaluations on smape rather than the simple MAPE (which entails dividing by Y instead of (X+Y)/2), because we feel that smape is more stable and avoids some of the potential difficulties that can arise with MAPE (e.g., see Makridakis, 1993). The nature of our results is not changed if MAPE is used instead of smape. 3. Empirical results Using the entire set of n=22 individual methods, we study the performance of 20 different combining methods: the mean, the median, and the trimmed and Winsorized means for i =1,,9. Some overall summary statistics are provided in Table 1, and comprehensive lists of smape values for different forecasting horizons are provided in Tables 2 and 3.More detailed results are available in an online supplement at http:// www.forecasters.org/ijf/data.htm. From Table 1, we can see that the differences among the combining methods are not large, but that the performances of the trimmed and Winsorized means follow a similar pattern for smape and smedape as i increases from i=0 (the mean) to i=10 (the median). Performance improves at first as i increases, then gets worse beyond some particular point. The net result of this pattern is that many of the trimmed and Winsorized means outperform the mean, which in turn outperforms the median, using these measures. For example, all but T(8), T(9), W(9) and the median do better than the mean according to smape. The trimmed means have lower smapes for i 3(9 27% trimming) than for larger values of i, whereas the Winsorized means do better for 2 i 5 (18 45% Winsorizing) than for other values of i. Comparing T(i) and W(i) for Table 1 Accuracy measures and measures of the risk of large percentage errors for combinations of the 22 individual methods from the M3 Competition i smape smedape % Better % sape exceeds 40% 60% 80% T(i) W(i) T(i) W(i) T(i) W(i) T(i) W(i) T(i) W(i) T(i) W(i) Combined forecasts 0 12.74 5.79 7.32 3.26 1.53 1 12.67 12.69 5.74 5.73 50.32 49.57 7.26 7.31 3.22 3.24 1.48 1.50 2 12.66 12.66 5.76 5.73 50.04 50.03 7.22 7.26 3.21 3.22 1.47 1.47 3 12.66 12.65 5.78 5.74 49.84 50.30 7.23 7.24 3.19 3.21 1.49 1.48 4 12.68 12.65 5.78 5.77 49.57 50.17 7.22 7.24 3.21 3.20 1.49 1.47 5 12.69 12.66 5.79 5.78 49.17 50.06 7.25 7.24 3.23 3.22 1.49 1.49 6 12.71 12.68 5.81 5.79 48.89 49.50 7.25 7.22 3.23 3.22 1.49 1.49 7 12.72 12.70 5.82 5.80 48.72 49.08 7.25 7.24 3.24 3.22 1.48 1.47 8 12.74 12.72 5.81 5.82 48.52 48.95 7.27 7.24 3.24 3.23 1.48 1.47 9 12.76 12.75 5.80 5.81 48.46 48.49 7.31 7.30 3.26 3.25 1.47 1.49 10 12.77 5.81 48.16 7.33 3.25 1.48 Individual forecasts Min 13.01 6.00 7.52 3.32 1.55 Ave 14.31 6.62 9.14 4.50 2.36 Max 16.30 7.99 13.51 8.46 6.34

166 V.R.R. Jose, R.L. Winkler / International Journal of Forecasting 24 (2008) 163 169 Table 2 Symmetric Mean Absolute Percentage Error (smape): All 3003 series using trimmed means Method Forecasting horizon Average of forecasting horizon 1 2 3 4 6 12 18 1 4 1 6 1 12 1 18 Mean 8.50 9.41 11.33 12.62 13.11 13.02 17.45 10.47 11.23 11.65 12.74 T(1) 8.46 9.37 11.27 12.61 13.07 12.90 17.29 10.43 11.19 11.60 12.67 T(2) 8.44 9.35 11.27 12.59 13.06 12.88 17.29 10.41 11.18 11.59 12.66 T(3) 8.42 9.34 11.26 12.58 13.07 12.89 17.32 10.40 11.18 11.59 12.66 T(4) 8.41 9.34 11.26 12.58 13.08 12.91 17.36 10.40 11.18 11.59 12.68 T(5) 8.42 9.35 11.26 12.58 13.09 12.93 17.40 10.40 11.19 11.60 12.69 T(6) 8.42 9.36 11.26 12.58 13.10 12.97 17.45 10.41 11.20 11.62 12.71 T(7) 8.43 9.36 11.28 12.59 13.12 13.01 17.50 10.42 11.21 11.63 12.73 T(8) 8.44 9.37 11.28 12.60 13.13 13.03 17.55 10.42 11.22 11.64 12.74 T(9) 8.45 9.38 11.30 12.61 13.14 13.06 17.59 10.43 11.23 11.65 12.76 Median 8.45 9.38 11.31 12.61 13.16 13.07 17.62 10.44 11.24 11.67 12.77 fixed values of i, shows that trimming generally does better for i=1 and Winsorizing does better for i 3. The measures smape and smedape give averages over all of the forecasts. On a forecast-by-forecast basis, the results for % Better indicate that the trimmed means for i 2 and the Winsorized means for 2 i 5 have smaller errors than the mean for slightly over 50% of the forecasts. For higher values of i, the mean is equal or better slightly more than 50% of the time. One argument for robust methods is that they can reduce the risk of large errors. From Table 1, we see that sape exceeds 40, 60, and 80 more frequently with the mean than with any of the trimmed or Winsorized means. This helps to explain why, even for higher values of i, most of the T(i)s and W(i)s have lower smapes, despite beating the mean for slightly under half of the individual forecasts. Tables 2 and 3 give smape values by horizon, and they show that the results discussed above hold for individual horizons as well as for the overall data set. The amounts of trimming or Winsorizing that yield better performances might be slightly larger for shorter horizons. However, the insensitivity of the measures to the amount of trimming or Winsorizing for a given horizon is striking. Similar results hold if we use MAPE instead of smape, and also for other measures, as can be seen from the online supplement. Of course, the reason for combining forecasts is the expectation that they will perform better than the individual forecasts being combined, and that is definitely the case here. The smapes for all of the combined forecasts in Table 1 are lower than the best of the smapes for the 22 individual methods, and the same is true for the smedapes. It is also true for the occurrence Table 3 Symmetric Mean Absolute Percentage Error (smape): All 3003 series using Winsorized means Method Forecasting horizon Average of forecasting horizon 1 2 3 4 6 12 18 1 4 1 6 1 12 1 18 Mean 8.50 9.41 11.33 12.62 13.11 13.02 17.45 10.47 11.23 11.65 12.74 W(1) 8.48 9.39 11.28 12.63 13.09 12.93 17.31 10.45 11.21 11.62 12.69 W(2) 8.47 9.36 11.28 12.61 13.06 12.88 17.24 10.43 11.19 11.60 12.66 W(3) 8.45 9.35 11.27 12.60 13.07 12.87 17.26 10.42 11.19 11.59 12.65 W(4) 8.42 9.34 11.26 12.58 13.07 12.87 17.27 10.40 11.18 11.58 12.65 W(5) 8.42 9.34 11.26 12.57 13.08 12.86 17.29 10.39 11.18 11.58 12.66 W(6) 8.40 9.35 11.25 12.57 13.08 12.92 17.36 10.39 11.18 11.60 12.68 W(7) 8.41 9.35 11.27 12.59 13.10 12.96 17.42 10.40 11.20 11.61 12.70 W(8) 8.43 9.37 11.27 12.60 13.12 12.99 17.50 10.42 11.21 11.63 12.72 W(9) 8.45 9.38 11.29 12.60 13.13 13.06 17.58 10.43 11.23 11.65 12.75 Median 8.45 9.38 11.31 12.61 13.16 13.07 17.62 10.44 11.24 11.67 12.77

V.R.R. Jose, R.L. Winkler / International Journal of Forecasting 24 (2008) 163 169 167 Table 4 Ranking of combining methods for low, medium, and high coefficients of variation of the forecasts being combined Rank Low Medium High Rank Low Medium High 1 Mean W(4) W(4) 11 W(6) T(4) W(7) 2 W(1) T(1) W(3) 12 T(5) T(5) T(6) 3 T(1) W(1) W(5) 13 W(7) W(7) W(8) 4 W(2) W(3) T(3) 14 T(6) T(6) T(7) 5 W(4) W(2) T(2) 15 T(7) W(8) W(1) 6 W(3) Mean T(4) 16 W(8) T(7) T(8) 7 T(2) W(5) W(2) 17 T(8) T(8) W(9) 8 W(5) T(2) W(6) 18 W(9) W(9) T(9) 9 T(3) T(3) T(5) 19 T(9) T(9) Median 10 T(4) W(6) T(1) 20 Median Median Mean of high errors; the upper-tail percentages for sape for all of the combining methods are smaller than the corresponding values for each of the individual methods. The notion of trimming or Winsorizing when combining forecasts seems more compelling when some of the individual forecasts are relatively extreme. We divided the 3003 forecast situations into three categories, based on the size of the coefficient of variation of the individual forecasts, using the tertiles to determine the categories. The rankings of the smapes of the combining methods for these three categories are given in Table 4. When the coefficient of variation is low, the mean outperforms all of the trimmed and Winsorized means, and the best trimmed and Winsorized means are those with i=1. For coefficients of variation in the middle range, the rankings are more mixed, with the mean dropping somewhat and the Winsorized mean with i=4 doing best. For situations with a high coefficient of variation, the mean has the worst performance of all the 20 combining methods, and the best combining methods are W(i) for i=4, 3, and 5, followed by T(i) for i=3, 2, and 4. One caveat is that the differences in smape within each category are not large, but it does appear that the robust methods are most useful when there is higher variability among the individual forecasts. Also, the smapes increase dramatically from category to category, averaging about 3.3, 10.2, and 24.6, respectively, for the low, medium, and high coefficients of variation. This suggests that higher variability among the individual forecasters should lead us to expect higher forecast errors from the combined forecasts. However, we do not always have the luxury of having 22 different forecasts to combine. To see how T(i) and W(i) perform for smaller values of n,werandomlychose 1000 subsets, each of sizes 5 9, from the 22 methods, and evaluated the combined methods based on each of those subsets. The results for n =5, 7, and 9 are summarized in Table 5, and they are consistent with the results for n = 22. The trimmed and Winsorized means are roughly comparable, and they outperform both the mean and the median. For n=5, the only option is i = 1 (40% trimming or Winsorizing); i = 1 does slightly better than i=2 for n=7 (29% trimming or Winsorizing, compared with 57% for i=2) and n=9 (22% trimming or Winsorizing). Note that increasing the value of n improves the accuracy of the combined forecasts, but at a decreasing rate, as would be expected. The improvements in smape in going from n=5 to n = 9 are roughly comparable to the gains in going from n=9ton=22. Moreover, starting with n=5 and no trimming or Winsorizing, the improvement due to trimming or Winsorizing with i=1 is roughly the same as the improvement due to increasing the number of forecasts to n = 9 without trimming or Winsorizing. For the NGDP data set, the number of forecasters participating in a quarter varied between 9 and 87. For Table 5 Accuracy measures for combinations of 5, 7, and 9 individual methods from the M3 Competition smape smedape % Better 5 7 9 5 7 9 5 7 9 Mean 13.16 13.05 12.98 6.05 6.04 6.00 T(1) 12.96 12.86 12.80 5.94 5.88 5.84 49.50 49.80 49.94 T(2) 12.88 12.81 5.91 5.86 49.20 49.43 T(3) 12.84 5.89 48.99 Median 13.05 12.95 12.90 6.00 5.95 5.92 48.60 48.50 48.46 W(1) 12.97 12.87 12.81 5.94 5.88 5.84 49.40 49.70 49.70 W(2) 12.88 12.80 5.91 5.86 49.30 49.55 W(3) 12.84 5.88 49.07

168 V.R.R. Jose, R.L. Winkler / International Journal of Forecasting 24 (2008) 163 169 Table 6 Accuracy measures for combinations of 5, 7, and 9 individual forecasts from the NGDP data set smape smedape % Better 5 7 9 5 7 9 5 7 9 Mean 0.925 0.918 0.915 0.590 0.586 0.581 T(1) 0.920 0.914 0.911 0.587 0.584 0.583 55.50 52.53 54.36 T(2) 0.915 0.912 0.586 0.584 52.10 51.42 T(3) 0.913 0.586 51.50 Median 0.923 0.918 0.915 0.590 0.588 0.586 55.16 51.81 51.19 W(1) 0.920 0.914 0.912 0.586 0.583 0.583 58.02 56.77 56.96 W(2) 0.915 0.912 0.586 0.584 53.00 53.03 W(3) 0.913 0.586 51.86 Individ 1.011 1.010 1.011 0.631 0.631 0.630 each quarter, we randomly chose 10,000 subsets, each of sizes 5 9, from the forecasts available, and evaluated the combined methods based on each of those subsets. The results for n=5, 7, and 9 are summarized in Table 6.The smapes and the smedapes are much smaller than in Table 5, reflecting the fact that the forecasters are able to forecast NGDP more accurately than the forecasting methods can forecast the mélange of series in the M3 Competition. Of course, the forecasters are experts at forecasting economic variables, and they no doubt consider much more information than just the historical series of NGDP. What is particularly notable is that the relative performances of the different methods replicate the results of Table 5 quite closely, providing further support for the conclusions from our analysis of the M3 Competition data. 4. Summary and discussion The trimmed and Winsorized means seem to be viable alternatives for use when combining forecasts. The empirical results support Armstrong's (2001) recommendation to use trimmed means, and also support the use of Winsorized means. Trimmed and Winsorized means are simple to use and are quite robust. For the data sets analyzed here, they provide slightly more accurate forecasts than the simple average of the forecasts, and also beat all of the individual forecasts being combined. They do well with accuracy measures such as smape, and perhaps even more important, they do well on measures of the risk of high errors. How much trimming or Winsorizing is recommended? The results here suggest that for T(i), moderate trimming of 10 30% (i.e., 5 15% on each side) yields good results. For W(i), slightly greater Winsorizing of 15 45% (7.5 22.5% on each side) works well. Compared to the trimmed mean, the performance of the Winsorized mean seems to be a little less sensitive to the choice of i than the trimmed mean, which might make it preferable on balance. Winsorizing is also more appealing in the sense that the modifications to the mean provided by Winsorizing are less extreme in spirit than those from trimming. Winsorizing adjusts some forecasts to make them less extreme, whereas trimming eliminates those forecasts entirely. Nonetheless, our empirical results indicate that the overall improvements in the combined forecasts are similar for trimming and Winsorizing. Therefore, the nature of the situation and the forecasts should be considered when deciding whether to use one of these robust methods, and if so, which one to use (i.e., trimming or Winsorizing, and which value of i). For example, the rankings of combining methods for the low, medium, and high coefficients of variation of the forecasts indicate that the robust methods are more advantageous, and that greater amounts of trimming or Winsorizing might work better when there is more variability among the individual forecasts. More variability among the individual forecasts also leads to higher smapes, in which case getting additional forecasts might be desirable. Also, insofar as getting additional forecasts is concerned, accuracy improves at a decreasing rate as more forecasts are added. For example, improvements in smape in going from n=5 to n=9 with the forecasts from the M3 Competition are comparable to gains in going from n=9 to n=22.

V.R.R. Jose, R.L. Winkler / International Journal of Forecasting 24 (2008) 163 169 169 Our conclusions echo the spirit of some of the results of the M Competitions and of various other studies of combined forecasts: (1) Combined methods do better than their component parts, the individual methods. (2) Specific details (e.g., how much trimming or Winsorizing works best) differ depending on the accuracy measure used, but the general pattern is quite consistent and robust. (3) These simple robust combining methods do better than the equally simple but perhaps less robust simple average, which has been shown in previous studies to tend to do better than more complex combining methods. We have only considered symmetric trimming or Winsorizing, where equal numbers of forecasts are deleted or adjusted at the low and high ends. Asymmetric trimming could also be used, but would be closer to outlier detection. More complex and robust measures could also be considered. A whole strand of literature on such measures can be found in Huber (2004).However, the beauty of aggregation schemes such as trimmed and Winsorized means (and indeed, the simple average as well) is that they are simple to apply, understand, and explain to the end-users of the forecast, regardless of their technical backgrounds. Trimming can be explained as the removal of extreme values, while Winsorizing could be thought of as a reduction in the extremity of the outermost values. Recent behavioral studies such as Yaniv (1997) and Harries, Yaniv, and Harvey (2004) have even shown that trimming is commonly used as a decision heuristic when individuals aggregate information subjectively under conditions of uncertainty. As noted above, the trimmed and Winsorized means are appealing because of their simplicity and robust performance. On the question of which is better, trimming or Winsorizing, we have a slight preference for Winsorizing, but believe that the answer in a given situation depends on the circumstances and perhaps also on taste. In any event, the evidence suggests that in averaging forecasts, we could benefit from incorporating some degree of robustness into our aggregation schemes. Acknowledgements We are grateful to Michèle Hibon for providing the forecasts submitted by the forecasters in the M3 Competition, and to Bob Clemen, Luca Rigotti, and two reviewers for helpful comments. Appendix A. Supplementary data Supplementary data associated with this article can be found at http://www.forecasters.org/ijf/data.htm. References Agnew, C. (1985). Bayesian consensus forecasts of macroeconomic variables. Journal of Forecasting, 4, 363 376. Armstrong, J. S. (2001). Combining forecasts. In J. S. Armstrong (Ed.), Principles of forecasting: A handbook for researchers and practitioners (pp. 417 439). Norwell, MA: Kluwer Academic Publishers. Clemen, R. T. (1989). Combining forecasts: A review and annotated bibliography. International Journal of Forecasting, 5, 559 583. Clemen, R. T., & Winkler, R. L. (1986). Combining economic forecasts. Journal of Business and Economic Statistics, 4, 39 46. Croushore, D. (1993, November/December). Introducing the survey of professional forecasters. Business review (pp. 3 13). Federal Reserve Bank of Philadelphia. Graham, J. R. (1996). Is a group of economists better than one? Than none? Journal of Business, 69(2), 193 232. Harries, C., Yaniv, I., & Harvey, N. (2004). Combining advice: The weight of a dissenting opinion in the consensus. Journal of Behavioral Decision Making, 17, 333 348. Huber, P. J. (2004). Robust statistics. New York: Wiley-Interscience. Makridakis, S. (1993). Accuracy measures: Theoretical and practical concerns. International Journal of Forecasting, 9, 527 529. Makridakis, S., & Hibon, M. (2000). The M-3 competition: Results, conclusions and implications. International Journal of Forecasting, 16, 451 476. McNees, S. K. (1992). The uses and abuses of consensus forecasts. Journal of Forecasting, 11, 703 711. Stock, J. H., & Watson, M. W. (2004). Combination forecasts of output growth in a seven-country data set. Journal of Forecasting, 23, 405 430. Winkler, R. L., & Clemen, R. T. (1992). Sensitivity of weights in combining forecasts. Operations Research, 40(3), 609 614. Yaniv, I. (1997). Weighting and trimming: Heuristics for aggregating judgments under uncertainty. Organizational Behavior and Human Decision Processes, 69(3), 237 249. Zarnowitz, V. (1984). The accuracy of individual and group forecasts from business outlook surveys. JournalofForecasting, 3, 11 26.