Deterministic vs. Ensemble Forecasts: The Case from Sandy Robert Gall, David McCarren and Fred Toepfer Hurricane Forecast Improvement Program (HFIP); National Weather Service (NWS); National Oceanic and Atmospheric Administration(NOAA); Silver Spring, MD Introduction One could argue that Hurricane Sandy was one of the best forecast major weather events ever. There are other examples of impressive forecasts such as the so called Snowmageddon event of February 2010 in the Washington, DC area and probably others from around the world, but it is clear that the Sandy forecasts are noteworthy particularly for their lead time on the order of a week, suggesting that rather accurate forecasts as much as a week ahead of time for hurricanes are possible. Since these forecasts are produced by global models, it is likely that the skill translates to other weather events like major snowstorm and flooding events. Figure 1. Forecasts from the ECMWF T1299 deterministic model for the track of hurricane Sandy. White line is the forecast track, black the observed track. (from Mike Fiorino, personal communication) 5
Vol.3 No.2 June 2013 Much has been said in the press and elsewhere about the skill of the European Center for Medium Range Weather Forecasting (ECMWF) forecasts for Sandy. In figure 1 we show a prediction of the high resolution ECMWF (T1299) forecasts every twelve hours from October 22, 2012 00Z until October 23, 2012 12 Z. The actual track of Sandy is shown as a black line. Figure 2. Ensemble forecast tracks from the HFIP/GFS Ensemble (20 members T382), ECMWF (51 members T574) and GEFS ( 20 members T254). The HFIP/GFS and GEFS are constructed from the NCEP operational GFS and use Hybrid DA and GSI DA respectively. ECMWF uses a 4DVAR initialization. Gray lines are individual member tracks, the heavy black line the ensemble mean. The ellipses enclose 80% of the members at one day intervals. (from Jeff Whittaker, personal communication) The forecast for October 22, 2012 at 00Z, 8 days before landfall, was absolutely remarkable. The point of landfall in the forecast was almost precisely at the actual landfall. However note that subsequent forecasts over the next 36 hours, while still very good forecasts, had the storm s landfall further north, as far as Boston then moving south with each 12 hour forecast to landfall in Virginia (after the times of figure 1) and then finally moving back to correctly predict a landfall in New Jersey. This tendency for deterministic models to predict tracks that move back and forth with consecutive forecasts has been called Windshield Wipering by operational forecasters at the National Hurricane Center (NHC). A similar trait of deterministic models but associated with times of large uncertainty in forward speed is called Tromboneing. We feel these results are a good starting point to discuss the relative merits of deterministic versus ensemble forecasts, at least from the perspective of hurricane forecasting. 6
Deterministic vs Ensemble forecasts There are two points that are worth emphasizing for this discussion: 1. A single run of a deterministic model is just one member of some virtual ensemble. Assuming that the ensemble is constructed acceptably for hurricanes, that ensemble will be manifested as a fan of tracks spreading out from the initial location of the hurricane. If for no other reason, this spread is related to uncertainty in the initial state of the atmosphere. At longer lead times the spread of the fan can be quite large. 2. When averaged over a large number of cases and over many storms (in the case of hurricanes),the skill of an ensemble mean is approximately equal to the skill of a deterministic model run at twice the resolution of the ensemble members. This can t be proven mathematically and will depend on how the ensemble is constructed, but is a rough rule of thumb that has been shown through experience. Evidence for this is shown in figure 6. Figure 2,which shows forecast tracks for Hurricane Sandy,illustrates the first point. The middle figure in the upper row is the ECMWF 51 member ensemble forecast (T574 truncation) made at the same time as the upper right panel of figure 1. Note that the ellipses enclose 80% of the members at the indicated forecast lead time as a measure of uncertainty, and when oriented along the track the uncertainty is mostly in forward speed and when oriented across the track the uncertainty is mostly in position. The gray tracks are the individual ensemble members and the black tracks are the ensemble mean positions. The fan of gray tracks shown in figure 2 is typical for an appropriately constructed ensemble which at the least expresses uncertainty in the initial state of the atmosphere. So any ensemble (appropriately constructed) should show a similar fan for hurricane tracks including one at twice the resolution of the ensemble shown (the resolution of the ECMWF deterministic model), the so called virtual ensemble associated with the deterministic model noted in point 1 above. The deterministic model simply picks one of those members and there is no way of knowing a priori which of the members of that virtual ensemble the deterministic model will pick. It could pick any track say on the mean or on either side, thus the tendency to windshield wiper when there is large position uncertainty and trombone when large speed uncertainty is present. Hence there was a degree of chance in the exceptional 8 day forecast shown in the upper left panel of figure 1. When one averages errors of a deterministic model over a large number of cases, the windshield wiper and trombone effects average out but those two forms of error are always present though not quantifiable when one looks at a single deterministic forecast. Ensembles, on the other hand, if they are properly constructed and have enough members, average out these effects. So if the statistical skill of an ensemble mean is the same as the statistical skill of a deterministic model, then the ensemble mean has to be a more reliable forecast. As was characteristic of early forecasts for Sandy (near the time of its formation), the ECMWF model showed a huge uncertainty in position at the end of the 7 day forecast with approximately half of the members going east and half going west from a position off the South Carolina Coast. This was also true for the HFIP/GFS ensemble (basically the operational Global Forecast System (GFS) model using the hybrid data assimilation system and T382 truncation) and the Global Ensemble Forecast System (GEFS, basically the GFS operational model but using the GSI data assimilation system and T256 truncation). The GEFS had all members going east at 10/22/12 12Z but had the 50-50 split twelve hours later. Most of the other operational ensembles from around the world also showed this position uncertainty on October 22. The Hurricane Forecast Improvement Program (HFIP) of NOAA is described at http://www.hfip.org. There is other reading on that site relevant to the discussion in this paper such as HFIP annual reports and presentations. 7
Vol.3 No.2 June 2013 Figure 3. A cartoon illustrating possible tracks that Sandy could have taken as it interacted with the developing Omega Block over the Atlantic. See text for more discussion. The cartoon is overlaid on the 500 mb geopential height map for 10/30/12 at 12Z. The reason for this uncertainty in position was the interaction of Sandy with an Omega Block that was forming over the Atlantic Ocean as Sandy moved northward. This is illustrated in figure 3 which is a cartoon based on the 500mb geopotential height map over the Atlantic on 10/30/12 at 12Z, just after Sandy s landfall. The main features are illustrated and the red dashed lines indicate two possible tracks as Sandy interacted with this large scale mid-latitude feature. It could have either gone eastward with the jet under the block or, as was the case, gotten caught up in the developing low pressure system on the western side of the block. Interactions such as this are rare for hurricanes and is what led to the very odd path and evolution of Sandy. From figure 2 it is obvious that the models, particularly as shown by the ensembles, were far from certain on the path of the hurricane on the order of a week ahead of time. Conventional wisdom would take the hurricane eastward off the coast and was part of the reason early NHC forecasts had the storm staying out to sea. So a question is, when did the models (especially ensembles) resolve this choice of direction for the storm? Obviously they eventually got it right since the forecast lead times for this storm were exceptional. Figure 4 shows what happened. In figure 4 we counted the number of ensemble members from the various ensembles that tracked eastward out into the Atlantic versus those that tracked westward toward the coast and plotted the percentage of the total. This was accomplished by counting tracks on plots like those in figure 2 so there is a degree of subjectiveness in the calculations, but they were performed independently by two of the authors for the various ensembles. Shown are the HFIP/GFS ensemble (left column in figure 2), the ECMWF 8
ensemble (middle column in figure 2) and the National Unified Operational Predictional Capability (NUOPC) program ensemble that includes the Canadian operational ensemble (CMC), the Navy operational global ensemble (NOGAPS) and the GEFS (right panel of figure 2). Together there are 130 ensemble members over a total of 5 different global models. Figure 4 suggests that all the ensembles performed with roughly the same skill by this measure even though there was a wide range of resolutions involved (NUOPC is the lowest resolution, approximately 100 km). They all show the initial uncertainty with something like 50% of the members going east and 50% going west from 10/22/12 00Z until 10/24/12 00Z. That all changed around 12Z on 10/24/12 when most members settled on the western track, 5.5 days before landfall. ECMWF led the group (the one with highest resolution) followed closely by the HFIP/GFS model. The NUOPC ensemble was slower to come to a consensus but basically resolved the ambiguity by 10/25/12 00Z. So that illustrates the use of the uncertainty information in an ensemble to solve a particular dilemma in forecasting a normal track for Sandy, or one which is widely accepted to be very improbable. The next question is, how soon did the various ensembles resolve the landfall location? Figure 4. Percentage of ensemble members tending to move eastward into the Atlantic versus those tending to move Westward toward the US East Coast for Sandy starting at 10/22/12 00Z for three ensembles, the HFIP/GFS, ECMWF and the GEFS. See text for more discussion. In figure 5 we arbitrarily chose a target zone of landfall for somewhere between Delaware Bay and New York City essentially the New Jersey Coast where the actual landfall was in the middle. Again we simply counted tracks from plots like those in figure 2 but now took the ratio of members passing through that zone to the total number of members in the ensemble. The same ensembles are shown in figure 5 as in figure 4. Note that the HFIP/GFS has only 20 members as opposed to 51 for ECMWF and 60 for NUOPC. In figure 5 it is interesting to note that the various ensembles had members land-falling in this region even 8 days before landfall though the percentage was very low. The percentage forecast of landfalls in NJ increased linearly throughout the 8 days prior to landfall and exceeded a 75% probability of a NJ landfall three days prior. One can argue what percentage constitutes a good forecast of landfall but certainly by 10/28/12 12Z they had all gotten it right. Note that a figure like figure 5 could easily be generated automatically for forecaster chosen coastal zones and plotted in real time. A steadily increasing percentage of ensemble members going through a given zone will greatly increase confidence in a forecast of landfall in that zone. A decreasing percentage is likely 9
to eliminate that zone as a possibility. Figure 4 is actually an example of choosing two zones; one in the eastern Atlantic and one along the east Coast of the US. Obviously after 10/24/12 12Z one is not likely to choose the former possibility. (Upper) Figure 5. Percentage of ensembles passing over a zone from Delaware Bay to New York City for Sandy starting at 10/22/12 00Z for three ensembles, the HFIP/GFS, ECMWF and the GEFS. See text for more discussion. (Lower) Figure 6. Seven day track forecasts for each forecast cycle, every 12 hours, for Sandy starting at 10/23/00Z for the ECMWF and HFIP/GFS ensembles and the ECMWF deterministic model. (from Vijay Tallapragada, personal communication) 10
Finally figure 6 shows the performance of three separate models for Sandy. Included are the 51 member ECMWF ensemble (T574), the ECMWF deterministic model (T1299) and the 20 member HFIP/GFS ensemble (T382). For the ensembles, the mean tracks are plotted for each forecast cycle every 12 hours starting at 10/30/12 00Z with the forecasts going out 7 days. The same is done for the deterministic model tracks. The observed tracks are shown but hard to see because the forecast tracks are very close to it. Figure 6 suggests that all three of these models performed very well for Sandy and had similar track skill. The ensemble means show the east versus west uncertainty in the early forecasts (lower numbers on the mean tracks) which pull the mean eastward. After 10/25/12 the tendency to show eastward tracks ended. The deterministic model does not show any such tracks since it is only a single member and that member never chose the eastward possibility. The left hand panel and the middle panels compare an ensemble with a deterministic model where the deterministic model has twice the resolution of the ensemble. The two forecasts are very similar (point 2 above) though; if one discounts the early forecasts (tracks 2 and 3 of the ensemble) which were influenced by the early east versus west ambiguity, one could argue that the ensemble mean gave a better forecast of landfall than the higher resolution deterministic model. The HFIP/GFS with fewer members and with slightly lower resolution compared to the ECMWF ensemble arguably gave the best forecast of the three shown in figure 6. Conclusions We believe the two points noted earlier involving the relative skill of ensembles versus deterministic models as a function of resolution and that the deterministic models have an uncertainty because they are a member of a virtual ensemble (the windshield wiper, trombone effects) are illustrated in the above discussion. There is a third point here that leads to our final conclusion. 3. A 10 member ensemble costs about the same computationally as a deterministic model at twice the resolution of the ensemble. This is especially true when the models are run on multiple processor machines since the lower resolution ensemble will scale better with the number of cores than the high resolution deterministic model; therefore, by adding say a two member ensemble will take exactly twice the number of cores as a one member ensemble and so forth. So with this point we ask, if we are only running the deterministic model at twice the resolution, why run it at all? It has the same statistical skill as the lower resolution ensemble and the ensemble mean does not have much of the windshield wiper, trombone uncertainty. In fact since most centers run both an ensemble and a deterministic model at twice the resolution, eliminating the deterministic model means a 20 member ensemble can be run at the same (and probably slightly lower) cost as the combination of a deterministic and ensemble currently being run. Note that the results from the 20 member HFIP/GFS ensemble are similar to the 50 member ECMWF ensemble, which suggests that 20 members may be sufficient. It is possible that a deterministic model at three or four times the resolution may have improved skill over the ensemble but it will still be a member of a virtual ensemble and therefore subject to random error. There is not enough experience comparing deterministic models at those resolutions with ensembles constructed from the same models. We also note that models constructed differently but starting from similar initial conditions give different answers even when each is run as an ensemble (reference figure 2). This is well known and is the basis for insisting on multi-model ensembles. We agree. Figures 4 and 5 are examples of comparing multiple model ensembles. Our arguments above are for hurricane forecasting. We are not sure how well those arguments translate to other types of weather systems but it is likely that they will also apply. 11