Methodology for 2013 Stream 1.5 Candidate Evaluation

Methodology for 2013 Stream 1.5 Candidate Evaluation 20 May 2013 TCMT Stream 1.5 Analysis Team: Louisa Nance, Mrinal Biswas, Barbara Brown, Tressa Fowler, Paul Kucera, Kathryn Newman, Jonathan Vigh, and Christopher Williams Each Stream 1.5 candidate was evaluated based on three basic criteria: (1) a direct comparison between the Stream 1.5 candidate and each of last year s top-flight models, (2) an assessment of how the Stream 1.5 candidate performed relative to last year s top-flight models as a group, and 3) an evaluation of the Stream 1.5 candidate s impact on operational consensus forecasts or a direct comparison between the Stream 1.5 candidate and the operational consensus when appropriate. The following describes the baselines used for each aspect of the evaluation, how the experimental forecasts and the operational baselines were processed, and the approaches used for each type of analysis. Baselines The operational models used as baselines or as components of baselines for the Stream 1.5 analysis are described in Table 1. Note that only the early version 1 of all model guidance was considered in this analysis. The operational baselines used in evaluations as top-flight models are ECMWF, GFS and GFDL for track and LGEM, DSHP and GFDL for intensity. For evaluations of the Stream 1.5 candidate s impact on model consensus, the variable consensus TVCA and TVCE were used as the track baselines for the Atlantic and eastern North Pacific basins, respectively, and the fixed consensus, ICON, was used as the intensity baseline for both basins. The membership of the variable track consensus and the fixed intensity consensus are defined in Table 1. Note that the variable consensus requires that at least two of the members be present for a consensus forecast to be computed, whereas the fixed consensus requires that all members be present. Because the evaluation of each Stream 1.5 candidate is based on a homogeneous sample, the variable consensus with the Stream 1.5 candidate will actually require that at least three members be present. In other words, cases for which the Stream 1.5 candidate is available but only one member of the operational variable consensus is available would not be included in a homogeneous sample because the operational consensus would not be available for comparison. Early Model Conversion Modeling groups participating in the 2013 Stream 1.5 exercise submitted forecasted storm properties as text files conforming to the Automated Tropical Cyclone Forecast (ATCF) file format specifications. Each modeling group generated these data, which are referred to as Tier 1 data, by applying their own method of storm tracking to their model output. The forecasts from all the dynamical models are considered late model guidance. In addition, the Florida State University Multi-Model Super Ensemble (FSU-MMSE) is considered late model guidance 1 The National Hurricane Center (NHC) characterizes forecast models as early or late depending on whether their numerical guidance is available to the forecaster during the forecast cycle. Models that are available shortly after they are initialized are referred to as early models. Models with run times such that the numerical guidance is not available until after the forecaster needs to release their forecasts are considered late models. Early versions of late models are generated through an objective adjustment process provided by an interpolator program. 1

because the members of this consensus are a combination of early and late models. To perform the analysis in terms of early models, early model versions of the late model forecasts were generated using an interpolator package with the same functionality as the software used by the National Hurricane Center (NHC). The interpolator process first applies a smoother to an individual track and intensity forecast and applies the appropriate time lag to the forecast based on when the model guidance is available. The time-lagged track or intensity forecast is then adjusted or shifted such that the forecast for the new initial or zero hour guidance matches the analyzed position and intensity of the tropical cyclone. This adjustment is applied to all lead times for track, whereas the operational interpolator offers two adjustment methods for intensity. The first option, or full offset option, applies the same adjustment to all lead times. The second option applies the full adjustment to the time-lagged forecast out to a specified lead time tf, applies a linearly decreasing adjustment from lead time tf to lead time tn, and then no adjustment for the remainder of the forecast lead times. Each modeling group was asked to select the intensity offset option they felt was most appropriate for their model, including the parameters tf and tn if the variable offset option was selected. The one exception was the FSU-MMSE. Based on discussions with NHC, it was decided to simply time-lag the FSU-MMSE intensity guidance without applying any adjustment to match the analyzed intensity for the zero hour guidance. All Stream 1.5 candidates in the late model category were converted to early model versions using the assumption that their run time is short enough to be available for the forecast cycle six hours after the model initialization (i.e., the 6-h forecast is converted to 0-h). Error Distributions The errors associated with each forecast (Stream 1.5 and operational baselines) were computed relative to the Best Track analysis 2 using the Model Evaluation Tools Tropical Cyclone (MET- TC) package. This software was also used to generate variable and fixed consensus forecasts with and without the Stream 1.5 candidate and compute the errors associated with each of these consensus forecasts. The statistics for the individual cases were aggregated using a script in the R statistical language. All aggregations were done for homogeneous samples (i.e., only cases for which both the experimental and operational forecast were available were included in the aggregation statistics). Given the distribution of errors and absolute errors at a given lead time, several parameters of the distribution were computed: mean, median, quartiles, and outliers. In addition, confidence intervals (CI) on the mean were computed using a parametric method with a correction for first-order autocorrelation (Chambers et al. 1983, McGill et al. 1978). Only lead times and errors for which the distribution contained at least 11 samples are considered in the statistical significance (SS) discussions because the error distribution parameters cannot be accurately estimated for sample sizes less than 11. High autocorrelation can reduce the effective sample size, which can lead to samples where a sample size of 11 is insufficient to accurately estimate the variability and confidence. For those samples, the minimum sample size was increased to 20. Confidence intervals are only displayed for those samples where these measures could be accurately estimated. 95% confidence intervals (CIs) were selected as the criteria for determining statistical significance. 2 Best Track analysis was obtained from the NOAA Web Operations Center (ftp://ftp.nhc.noaa.gov/atcf) on 1 April 2013. 2

Baseline Comparisons The errors for some cases were substantial. Such outliers and/or large variability in the error distributions increase the likelihood the confidence intervals for the errors of two models will overlap even if one model is actually consistently performing better than the other. By comparing the error differences, rather than the errors themselves (i.e., using a paired test rather than a two-sample test), the variability due to difficult forecasts and large errors is removed. Hence, for criteria 1 and 3 of this evaluation (described in the first paragraph of this document), a pairwise technique was used to address the question of whether the differences between the experimental and the operational forecasts are statistically significant (SS). For this technique, the absolute error of a given quantity (e.g., intensity error) for a Stream 1.5 forecast or the experimental consensus forecast is subtracted from the same metric for the operational baseline. This subtraction is done separately for each lead time of each case, yielding a distribution of forecast error differences. The parameters of this difference distribution are then computed using the same methodology applied to the error distributions for a single model or model consensus. Knowing whether substantial error differences more often favor one model or scheme over the other is a valuable piece of information when selecting numerical guidance to be included in the operational forecast process. When negative or positive error differences occur at approximately the same frequency, the median of the error difference distribution will be insensitive to the size of the differences, whereas the mean error difference is somewhat sensitive to both the direction and size of the error differences. Hence, the mean error difference is used in this study to assess SS. A SS difference between the forecast verification metrics of the Stream 1.5 candidate (consensus with Stream 1.5 candidate) and the operational baseline (operational consensus) was noted when it was possible to ascertain with 95% confidence that the mean of the pairwise differences was not equal to zero. The pairwise method enables the identification of subtle differences between two error distributions that may go undetected when the mean absolute error (MAE) or root mean square error (RMSE) of each distribution is computed and the overlap of the CIs for the mean is used to ascertain differences (e.g., Lanzante 2005, Snedecor and Cochran 1980). Positive (Negative) mean error differences and percent improvement values indicate the errors associated with the Stream 1.5 candidate are smaller (larger) on average than the operational baseline. Comparison with Top-flight Models as a Group To assess the Stream 1.5 candidate s performance relative to last year s top-flight models as a group, rankings of the Stream 1.5 candidate s performance with respect to the top-flight operational models were determined for each case and lead time within a homogeneous sample, where a ranking of one corresponds to the Stream 1.5 candidate having the smallest error and a ranking of four (or five for some comparisons) corresponds to the Stream 1.5 candidate having the largest error. In some cases, the Stream 1.5 error can be the same as one of the top-flight models, resulting in ties in the rankings. In the case of ties, the rankings are randomly assigned to each model. This approach to handling the ties allows one to determine point-wise confidence intervals around the proportion of cases for each ranking. Once again, 95% confidence intervals were selected for this evaluation. The relative frequency of each ranking provides useful information about how the performance of the Stream 1.5 model relates to the performance of the top-flight operational models. When the frequency of each ranking (first through fourth or fifth for some comparisons) is approximately 25% of the cases for a comparison of four models and approximately 20% for a comparison of five models, then the candidate model errors are 3

indistinguishable from the errors of top-flight models. A high frequency of rank one indicates the Stream 1.5 candidate is matching or outperforming the top-flight models on a regular basis, whereas a high frequency of rank four (or fifth for comparisons of five models) indicates the Stream 1.5 candidate is not improving upon the operational guidance. Frequencies of the error rankings were also computed in which the lowest ranking was awarded to the Stream 1.5 candidate in the event of a tie; this additional analysis was done to provide a common context with the approach used for the 2011 Stream 1.5 evaluation and to provide information regarding the frequency of such ties. Methods for Displaying Results Mean errors with confidence intervals Graphs displaying the mean errors of an experimental scheme and the corresponding operational baseline with lead time that include 95% confidence intervals for each mean error were used to provide a quick visual summary of the size of the mean errors, the relationship between the mean errors for the experimental scheme and the corresponding baseline, trends with lead time, and a measure of the variability in the error distribution. For these graphs, black symbols and lines always correspond to the properties of the operational baseline errors and red always corresponds to the experimental scheme. Frequency of Superior Performance To provide a quick summary of whether one model consistently outperforms the other for cases associated with error differences exceeding the precision of the input data, the number of cases for which the error differences for intensity (track) equaled or exceeded 1 kt (6 nm) were tallied for each lead time, keeping track of whether the error difference favored the experimental scheme or the operational baseline. This information is displayed in terms of percent of cases with respect to lead time, where the black line corresponds to percent of cases favoring the operational baseline and the red line corresponds to percent of cases favoring the experimental scheme. Confidence intervals for these plots are calculated using the standard interval for proportions. This analysis categorizes the errors, thus the size of each error has no effect on the results once the category is determined. Furthermore, by examining the frequency rather than the magnitude of the errors, different information can be obtained. Forecasts may have similar average errors even though one forecast is frequently better than another. Conversely, forecasts may have very different average errors even though each is best on a similar number of cases. Typically, the frequency analysis confirms the conclusions from the magnitude analysis of error. When it does not, it is important to understand the forecast behavior. In this way, the frequency analysis complements the pairwise analysis and provides additional information. Boxplots Boxplots are used to display the various attributes of the error distributions in a concise format. Figure 1 illustrates the basic properties of the boxplots used in the Stream 1.5 candidate reports. The mean of the distribution is depicted as a star and the median as a bold horizontal bar. The 95% CIs for the median are shown as the waist or notch of the boxplot. Note that the notches or CIs generated by the R boxplot function do not include a correction for first-order autocorrelation. The outliers shown in this type of display are useful for obtaining information about the size and frequency of substantial errors or error differences. 4

Summary tables Figure 1: Description of the boxplot properties. Tables are used to provide a concise summary of the pairwise difference analysis. Each cell in the table contains three numbers corresponding to the mean error difference (top), percent improvement or degradation (middle) and the probability of having an error difference closer to zero than the mean error difference (bottom). A blank for the probability entry means the effective sample size was such that no meaningful probability statistic can be computed. Color shading is used to highlight the mean error differences that are SS, where green is used for mean error differences favoring the experimental scheme and red is used for mean error differences favoring the operational baseline. The darkness of the shading is used to highlight the size of the percent improvement or degradation. For track, the shading thresholds are based on the Stream 1.5 selection criteria. Light shading corresponds to mean track error differences that do not meet the selection criteria (< 4%). Medium shading indicates mean track error differences that meet the criteria (4-5%) and dark shading indicates mean track errors differences that go well beyond the criteria ( 6%). In contrast, the selection criteria for intensity guidance does not put forth a minimum percent improvement criteria. Hence, the shading thresholds for intensity were simply selected to provide a quick visualization of basic percent change ranges. Light shading indicates mean intensity errors with percent changes less than 5%, medium shading for percent changes between 5 to 9% and dark shading for percent changes equaling or exceeding 10%. Colored fonts are used to distinguish which scheme has smaller errors for those mean error differences that do not meet the SS criteria. Figure 2 illustrates the basic properties of these summary tables. 5

Table entry definitions mean error difference % improvement (+) / degradation (-) probability of having an error difference closer to zero than the mean error difference Rankings line plots SS Not SS Track Color scheme Intensity % -6 % -10-6 < % -4-10 % -5-4 < % 0-5 % 0 0 % < 4 0 % 5 4 % < 6 5 % 10 % 6 % 10 diff < 0 diff < 0 diff > 0 diff > 0 Figure 2: Description of entries in summary tables and color schemes By examining the error rankings, model performance can be gauged based on the frequency of superior (or inferior) performance. This analysis complements the other assessments based on distributions, means, and SS error differences. Typically, the results from a ranking frequency analysis confirm the results from the other types of analyses. However, occasionally model performance shows little difference from the top-flight models in the SS tables but large differences in rank frequency. Thus, it is important to examine both. Ranks one (smallest error) through four, or five for some model comparisons, (largest error) are assigned to all model errors, with ties randomly assigned. The frequency of each rank for the candidate model is displayed in a line plot by lead time. An example of this type of plot is shown in Fig. 3. Rankings one through four or five (1-4 or 1-5) are color coded and labeled with the appropriate ranking number. This display is similar to presenting a rank histogram for each lead time, but condensed into one figure. The dashed lines color-coded to match the appropriate rank show the point-wise 95% confidence intervals around the proportion of cases for each ranking. The 25% (20%) frequency is highlighted by a grey solid line. When a ranking frequency line lies above or below this 25% (20%) line with CIs on the same side of the 25% (20%) line (e.g., rank 1 and its CIs in the example lie above the 25% line for longer lead times), the results suggest the performance of the candidate model can be deemed statistically distinguishable, for better or worse, from that of the top-flight models. If all confidence intervals include the 25% (20%) line, the performance of the candidate model cannot be deemed statistically distinguishable from that of the top-flight models (e.g., rankings in the example at 48 h). The black numbers show the frequency of the first and fourth (fifth) rankings where the candidate model is assigned the better (lower) ranking for all ties. For cases with a lot of ties, the frequencies will differ considerably (e.g., rank 4 in the example for short lead times). When there are only a few ties, the frequencies of the rankings will be quite similar. 6

Figure 3: Sample error ranking line plot. References Chambers, J. M., W. S. Cleveland, B. Kleiner, and P. A. Tukey, 1983: Graphical Methods for Data Analysis. Wadsworth & Brooks/Cole Publishing Company. Lanzante, J. R., 2005: A cautionary note on the use of error bars. J. Climate, 18, 3699-3703. McGill, R., J. W. Tukey, and W. A. Larsen, 1978: Variations of box plots. The American Statistician, 32, 12 16. Snedecor, G.W., and W.G. Cochran, 1980: Statistical methods. Iowa State University Press., pp. 99-100. 7

Table 1: Descriptions of the operational models, including their ATCF ID, used as baselines or components of baselines for the 2013 Stream 1.5 evaluation. ATCF ID Type Description EMXI GFSI EGRI GHMI Global - interpolateddynamical Global - interpolateddynamical Global - interpolateddynamical Regional - interpolateddynamical Previous cycle ECMWF global model, adjusted using full offset (ECMWF is only available every 12 hours, so interpolated guidance used for evaluation is a combination of 6-h and 12-h time lagged) Previous cycle Global Forecast System (GFS) model, adjusted using full offset Previous cycle United Kingdom Met office (UKMET) model, automated tracker with subjective quality control applied to tracker, adjusted using full offset (UKMET is only available every 12 hours, so interpolated guidance used for evaluation is a combination of 6-h and 12-h time lagged) Previous cycle Geophysical Fluid Dynamics Laboratory (GFDL) model, adjusted using variable intensity offset correction that is a function of forecast time LGEM Statistical-dynamical Logistic Growth Equation Model HWFI Regional - interpolateddynamical Previous cycle Hurricane Weather Research and Forecasting (HWRF) model, adjusted using full offset DSHP Statistical-dynamical SHIPs with inland decay ICON Fixed-consensus Average of DSHP/LGEM/GHMI/HWFI all members must be present TVCA Variable-consensus Atlantic TVCE Variable-consensus eastern North Pacific Average of EMXI/GFSI/EGRI/GHMI/HWFI at least two members must be present Average of EMXI/GFSI/EGRI/GHMI/HWFI at least two members must be present 8