A spatial verification method applied to the evaluation of high-resolution ensemble forecasts

METEOROLOGICAL APPLICATIONS Meteorol. Appl. 15: 125 143 (2008) Published online in Wiley InterScience (www.interscience.wiley.com).65 A spatial verification method applied to the evaluation of high-resolution ensemble forecasts Chiara Marsigli,* Andrea Montani and Tiziana Paccangnella ARPA SIM (HydroMeteorological Service of Emilia-Romagna), Bologna, Italy ABSTRACT: The verification of ensemble systems is being operationally carried out in several meteorological centres. However, the main operational ensemble systems have a coarser spatial resolution with respect to the deterministic runs. Only recently, high-resolution limited-area ensembles have started to be run on a regular basis. Their verification requires combining the usual probabilistic evaluation with the statistical verification techniques which are being developed for high-resolution model forecasts (1 10 km). These techniques permit to evaluate a deterministic forecast in a probabilistic manner, by taking into account the spatio-temporal distribution of the forecast at different scales. In this work, a spatial verification technique, called distributional method, is used to verify the Consortium for Smallscale MOdeling Limited-area Ensemble Prediction System (COSMO LEPS) ensemble system, a mesoscale ensemble with 10 km horizontal resolution. The system is mainly designed to give probabilistic assistance in the forecast of severe weather, in particular of intense precipitation possibly leading to floods, hence verification is focused on the ability of the system in forecasting precipitation at high spatial resolution. The methodology is based on a comparison of forecasts and observations in terms of some parameters of their distributions, evaluated after the values are aggregated over boxes of selected size. In particular, performances in terms of average, a few percentiles and maximum forecast value in a box are considered. The system is compared against European Centre for Medium-Range Weather Forecasts Ensemble Prediction System (ECMWF EPS), addressing the issues of an intercomparison between a higher-resolution smaller-size ensemble and a lower-resolution larger-size one. Results show that when the forecast of the average amount of precipitation over an area is concerned, COSMO LEPS is more skilful than the Ensemble Prediction System (EPS) only from the resolution point of view. Therefore, although not properly calibrated, it is more capable of distinguishing between events and non-events, especially for moderate and high precipitation. Furthermore, COSMO LEPS has skill in forecasting the occurrence of precipitation peaks over an area, irrespective of the exact location. The analysis of the score behaviour as a function of the distribution parameter shows that EPS has the maximum skill in reproducing the central part of the observed precipitation distribution over an area of about 10 000 km 2, while COSMO LEPS is more skilful in reproducing the tail of the observed precipitation distribution. The problem of the predictability of precipitation at different spatial scales is also investigated, showing the role of different system resolutions. Copyright 2008 Royal Meteorological Society KEY WORDS verification; ensemble; high-resolution Received 14 September 2007; Revised 18 January 2008; Accepted 18 January 2008 1. Introduction Ensemble forecasting is nowadays recognized as part of the usual set of weather forecasting tools, available in the meteorological operational rooms to help forecasters issue predictions. Since the start of operational ensemble prediction by the European Centre for Medium-Range Weather Forecasts (ECMWF), National Centers for Environemental Prediction (NCEP) and Canadian Meteorological Centre (CMC) in the nineties (Molteni et al., 1996; Tracton and Kalnay, 1993; Houtekamer et al., 1996), several systems, both global and limited-area, were developed, in order to fulfil the requirements of * Correspondence to: Chiara Marsigli, ARPA SIM (HydroMeteorological Service of Emilia-Romagna), Bologna, Italy. E-mail: cmarsigli@arpa.emr.it more diversity and higher resolution that have arisen in the meantime. Among the high-resolution ensemble systems currently available, the Consortium for Small-scale MOdeling Limited-area Ensemble Prediction System (COSMO LEPS) is the mesoscale limited-area ensemble of the COSMO Consortium, developed by ARPA SIM, running since November 2002 (Montani et al., 2003). COSMO LEPS is currently running as a time-crtitical application at ECMWF, on resources made available by the COSMO countries. This system aims at improving upon the early and medium-range, predictability of extreme and localized weather events, especially when orographic and mesoscale-related processes play a crucial role. The ensemble is based on 16 runs of the COSMO model (Steppeler et al., 2003), formerly known as LM, the non-hydrostatic limited-area model developed by the COSMO Consortium (Paccagnella, Copyright 2008 Royal Meteorological Society

126 C. MARSIGLI ET AL. 2006). COSMO LEPS is running at 10 km of horizontal resolution, with 40 levels in the vertical, while the EPS is currently running at an approximate horizontal resolution of about 50 km. The integration domain is shown in Figure 1. The ensemble is generated as a downscaling of the global ECMWF EPS (Molteni et al., 2001; Marsigli et al., 2001), aiming at linking the forecast capabilities of a high-resolution non-hydrostatic limited-area model to the ensemble approach. In the COSMO LEPS system, the perturbations are ingested mainly through the initial and boundary conditions, which are provided by some selected members of the operational ECMWF EPS, denoted as representative members (RMs). Because the COSMO LEPS perturbations are derived from the EPS ones, the system is useful especially in the early mediumrange (day 3 5). Furthermore, a random choice of the scheme to be used for the parameterization of the deep convection (either Tiedtke or Kain-Fritsch) is allowed in the COSMO runs. A preliminary study (Marsigli et al., 2005a) suggested that the perturbations applied in this way play a minor role with respect to the perturbations at the boundaries, which explain the major amount of the spread. For a complete description of the methodology, the reader is referred to the above mentioned papers. The main purpose of COSMO LEPS is to give assistance to civil protection activities, especially for situations of intense precipitation possibly leading to floods. The system underwent both subjective and objective verification (Marsigli et al., 2005a; Marsigli et al., 2005b; Marsigli et al., 2006), focussing on its ability to provide a skilful probability quantitative precipitation forecast (PQPF). Both the characteristics of the precipitation field and the potential of a mesocale forecasting system require the use of a high-resolution verification methodology. In this work, the distributional method (DIST), developed in the past years at ARPA SIM for the verification of COSMO LEPS high-resolution precipitation forecasts, is presented and applied to verify the system performances over a season. DIST, which is used at ARPA SIM also for the verification of deterministic forecasts, consists of an upscaling of both forecasted and observed fields over boxes of selected size. The upscaling is performed by selecting some parameters of the precipitation distribution, providing a single forecast-observation pair for each box. This work aims at analysing both the characteristics of the verification methodology and the performance of COSMO LEPS in terms of PQPF. An evaluation of the PQPF, issued by a mesoscale ensemble system, is affected by the difficulties of both a high-resolution verification and those of an ensemble evaluation. The former has been tackled in recent years by a number of techniques, since it has been recognized that traditional verification statistics are not suitable for the evaluation of high-resolution forecasts (Bougeault, 2003). It is often recognized that the surface fields forecasted by a high-resolution system look more realistic than the corresponding lower-resolution fields. In particular, for intense precipitation, peak values nearest to the observed ones can be predicted. Nonetheless, it is difficult to prove the benefit brought about by higher-resolution forecasting system by applying traditional objective verification techniques, especially if models with a grid spacing less then 10 km are considered (Mass et al., 2002). In general, Figure 1. COSMO LEPS integration domain (shaded area).

SPATIAL VERIFICATION OF HIGH-RESOLUTION ENSEMBLE FORECASTS 127 traditional verification measures tend to penalize highresolution predictions, because of the required smaller scale matching between forecasts and observations. Among the various strategies developed to overcome this problem, the fuzzy verification techniques have the purpose to allow a certain amount of uncertainty in the matching between forecast and observed values required by the indices (Ebert, 2008). DIST can be considered a fuzzy technique, since the forecast-observation matching is performed on upscaled values. The fuzzy verification strategies also allow to encompass the problem of the actual resolution into the verification process. It is, in fact, widely recognized that the actual forecast resolution is not the horizontal grid resolution x. Hence, a verification carried out by taking the grid point forecasts at their face values turns out to be penalized (Davis and Carr, 2000). Recent studies have shown that an actual resolution of a few x has, instead, to be attributed to highresolution forecasts, e.g. Skamarock (2004). Therefore, the use of upscaling/fuzzy verification techniques is more appropriate, as they evaluate the model foreacasts at a scale where the models are more reliable. The difficulties associated with an ensemble evaluation arise from the fact that an ensemble forecast is intrinsically different from a deterministic one, because of the fact that the forecasted entity (the pdf or a probability) is never known, neither apriori (as is also the case for a deterministic forecast) nor a posteriori, which is strictly true only for the ensemble. In fact, since probability is not a physical observable quantity, it is not possible to assess the quality of an individual probabilistic forecast (Candille and Talagrand, 2005). The assessment can only be statistical and a large sample is needed. To encompass these features into the verification process, a number of dedicated score measures are adopted for the ensemble, which permit evaluation, in a statistical sense, the degree of matching of the forecast probabilities with the observed frequencies, or the capability of the system to detect an event as a function of a set of probability thresholds. The application of the proposed fuzzy spatial verification method to the ensemble evaluation has, then, required the merging of two different probabilistic approaches. In this work, this has been done by decoupling the applications of the two techniques. First, the ensemble members are treated separately, applying the distributional method to each of them: the 16-member forecasts are spatialized from the grid points to the verification boxes and then the ensemble of the mapped forecasts is used in the computation of the usual probabilistic scores. In this work, the COSMO LEPS ensemble is compared against the EPS, the global ensemble from which it is derived. This permits the issues of an intercomparison between a higher-resolution smaller-size ensemble and a lower-resolution larger-size one to be addressed. An analysis of the score behaviours when aggregating forecast and observations at different spatial scales is also carried out. The paper is organized as follows: the verification methodology is described in Section 2, while the implementation details are presented in Section 3. Verification results are discussed in Section 4, which is organized in sub-sections. In Section 4.1 COSMO LEPS is compared against EPS over boxes of 1.0 1.0 degrees in terms of the mean and maximum distribution parameters, while the scores behaviour as a function of a number of parameters is analysed in Section 4.2. In Section 4.3 discussion is on the extent to which the COSMO LEPS and EPS scores vary as a function of the aggregation scale. Finally, conclusions are drawn in Section 5. A brief description of the verification indices is presented in the Appendix. 2. Verification methodology The spatial verification method proposed here is very simple, but quite intuitive in terms of the interpretation of the verification results. The DIST is based on the verification of the precipitation distributions within boxes of selected size. The verification domain is subdivided into a number of boxes, each of them containing a certain number of observed and forecast values. For each box, several parameters of the distribution of both the observed and forecast values falling in the box are computed (mean, median, percentiles, maximum). Verification is then performed using a categorical approach, by comparing for each box each parameter of the forecast distribution against the corresponding parameter of the observed distribution, using a set of indices. This approach permits also evaluation of the performance of the model when aggregating forecast and observed values over boxes of different sizes, to estimate the scale dependence of the skill. When the mean value is considered, the approch is similar to the other upscaling methods, as the one followed in Ghelli and Lalaurette (2000), with the difference that in DIST both forecasts and observations are upscaled to a common coarser resolution, to allow a probabilistic treatment of the forecasts. A more similar approach is the one followed by Zepeda-Arce et al. (2000), where categorical scores for a deterministic forecast are computed for a set of spatial scales to which both forecasts and observations are aggregated. The need to evaluate different characteristics of the precipitation distribution, not only average values, is what has led to the development of the DIST methodology. This was pushed forward by the highly non-gaussian shape of the precipitation distribution and by the potentials of a highresolution forecasting system, which can show some skill in predicting intense and localized precipitation values. The method has been developed in order to satisfy three major requirements: 1. to verify the precipitation forecasted by a highresolution system; 2. to compare scores obtained by two systems with different horizontal resolutions;

128 C. MARSIGLI ET AL. 3. to produce verification results which can be used directly to interpret how to use the forecast system. The extent to which this method fulfils the first two requirements relies mainly on it being an upscaling methodology. Observed and forecast values are aggregated over boxes of large enough size to contain a number of both stations and grid points, then a spatial homogeneization of observations and forecasts is performed by ending with only a pair of representative values per box. By selecting the box size conveniently, forecasts issued by models with different resolutions can also be spatially homogenized, since they are brought to the same scale, larger than the scales of the single modelling systems (i.e. requirement 2). Since the forecasts and the observations lying within a box are treated in a probabilistic sense from the localization point of view, the technique is a fuzzy verification method. In fact, all the values falling within a box, no matter their precise location, concur to determine the forecast/observation pair of values that will contribute to the score coming from that box. Furthermore, since several parameters of the precipitation distribution are considered, the peculiar capabilities of a mesoscale forecasting system can be highlighted by this approach, satisfying requirement 1. From the point of view of an operational use of the precipitation forecast issued by the modelling chain (i.e. to fulfil requirement 3), the necessity of evaluating different parameters of the distribution is easily understood. The correct reproduction of the average precipitation fallen over an area of, e.g., 50 50 km 2, is indicative of the skill of the modelling system, but it is not the unique or the most important information needed for purposes of civil protection and for issuing alerts. Even if the average value is well below any warning threshold, the possible presence of localized precipitation peaks is an important information for prevention. This is even more true in the case of thunderstorms, typical of the spring and summer season over our area. The capability of forecasting the possible occurrence of localized precipitation overcoming pre-defined thresholds anywhere within an area can also be regarded as an indicator of skill when the forecasting system has to be used for these purposes. This evaluation is allowed by the distributional method by comparing forecast against observations in terms of maxima over boxes. In order to give a more complete representation of the distribution, characterized by a very skewed shape, not only maximum values are considered but also high percentiles, to have a representation of the skill of the system in forecasting the tail of the precipitation distribution. This kind of areal evaluation, being a method to encompass the uncertainty arising from localization errors, is also well suited to be useful for the forecasters, at least in the Italian system, where the alerts for civil protection purposes are issued over alert areas instead of single station points. The proposed methodology satisfies the requirements to fulfil the framework proposed by Ebert (2008), belonging to the neighbourhood observation-neighbourhood forecast type of methods, since a degree of allowable displacement (i.e. the box size) is accepted for both observations and forecasts. Nonetheless, by using a set of distribution parameters and not just one of them, the final system evaluation is performed by summarizing the results obtained for each parameter in a global judgment, where the behaviours in terms of mean, maximum, 95th percentile and other parameters are all considered. This kind of skill assessment is allowing a lower degree of uncertainty in the system evaluation. In fact, an infinite number of precipitation distributions can give a particular mean value, as well as an infinite number of precipitation distribution can have the same maxiumum value. But, considering together e.g. mean, maximum, 75th and 95th percentiles, the constraint becomes stronger and the matching becomes more difficult to satisfy. In principle, if all the percentiles are considered, this ends in a verification where all the intensities have to be matched and some uncertainty is allowed only in the precipitation location. By this point of view, the DIST method is also similar to an object-oriented approach: if the boxes are considered as pre-defined objects to be matched, apriori determined, the method is similar to the intensity evaluation part proposed in the MODE system (Davis et al., 2006), where the statistical distribution of rainfall within a rain area is the verified quantity, by considering the pre-defined object as already matched. Finally, it has to be pointed out that the proposed technique is not equally generous nor equally stringent when considering different parameters, since, e.g. the matching between maximum values is more generous than that of the average values (Ebert, 2008). This has to be kept in mind in the comparison of the results. It is worth noting that the proposed DIST methodology can be used also for the evaluation of a deterministic forecast system. In this work, it has been applied to the verification of an ensemble system, thus combining the fuzzy verification approach with the usual probabilistic evaluation suited for an ensemble system. The probabilistic indices, which permit evaluation of the probability distribution coming from the ensemble, are computed here as a common ensemble evaluation, that is by quantifying the match between the probability with which a given event is forecasted by the ensemble on a given point, expressed by the percentage of ensemble members forecasting it, and the the observed frequency of the event on that point, either 1 or 0 depending on the fact that the event is observed or not. The DIST methodology plays its role in determining the two values on the point, viz. the forecast probability and the observed frequency. These are not determined from just one observation and one ensemble of forecast values over the selected point, but are derived from the computation of, taking the mean parameter as an example, the average value of all the observations falling within the box whose centre is the selected point and the ensemble of average values of all the forecasts issued on the grid points falling within the same box. The evaluation of the parameters of the precipitation distribution over the boxes requires some carefulness in

SPATIAL VERIFICATION OF HIGH-RESOLUTION ENSEMBLE FORECASTS 129 order to avoid obtaining meaningless values. First of all, the number of points, both forecast and observed, which fall within each box has to be large enough to permit a reasonable sampling of the distribution. A large number of points would be desirable for any box. Though this is a stringent requirement in principle, the assumption had to be relaxed a bit to obtain a large enough verification sample as well. When boxes of 1.0 1.0 and of 0.5 0.5 degrees are considered, the minimum number of observations has been fixed to 5, while this number has been increased to 10 when 2.0 2.0 boxes are considered, partially to balance the great increase of model grid points falling within these boxes. As for the model grid points, COSMO LEPS has enough points for the 0.5 (25 points), 1.0 (100 points) and 2.0 (400 points) boxes, while the EPS system grid points are 16, 4 and just 1 for the 2.0, 1.0 and 0.5 boxes, respectively. Another problem arises at the boundaries of the area covered with observations (see Figure 2). In fact, while the models have grid points covering an area larger than the verification one, observations do not cover the verification boxes completely. This can affect the results, especially in the boxes that are only partially covered by observations but totally covered by grid-point forecasts. An example of the error that can be introduced is given by considering the case of intense precipitation falling only in that the part of the box where observations are not present. If this event is correctly forecasted, the distribution of the forecast values within this box contains high precipitation values, while the observed distribution does not contain them because of the fact that the stations are located only in the part of the box where it is not raining. In order to avoid the errors due to this dishomogeneity affecting the score computation, an observational mask has been introduced. This is similar to a land-sea mask, so it is like a model field, with a value for every model grid point. Values of the observational mask are set to 0 on the grid points that are not close enough to any observations and to 1 otherwise. The required closeness of each grid point to a station point, which can be arbitrarly chosen, has been set to 0.15 degree for this work. The observational masks of both COSMO LEPS and EPS are shown in Figure 3, while the observational network is shown in Figure 2. 3. Verification implementation In this work, COSMO LEPS performances are compared with those of the operational ECMWF EPS, both the fullsize 51-member EPS and the reduced-size 16-member EPS made up by the 16 selected Representative Members. The latter is considered the reference ensemble to evaluate the COSMO LEPS performances, to analyse the impact of the high-resolution mesoscale model alone, having the two systems an identical number of members. The three systems will be referred to as: cleps: COSMO LEPS, 16 members, 10 km horizontal resolution. epsrm : reduced EPS made up by the RMs, 16 members, 50 km horizontal resolution. eps51: full EPS starting at 1200 UTC, 51 members, 50 km hor. res. The need to compare the 10 km COSMO LEPS with the 50 km EPS determines the choice of boxes as large as 1.0 1.0 deg, in order to have enough EPS grid points within a box to make possible the computation of the statistical properties. The verification has been performed over spring 2006 (March, April and May, 92 days) and the available Figure 2. Network of stations used for the verification.

130 C. MARSIGLI ET AL. Figure 3. COSMO LEPS and EPS observational masks. Only the grid points whose values has been set to 1 are shown. observational data set is provided by a dense network of rain gauges covering a part of the alpine area. The data were made available by a number of Italian regions and by MeteoSwiss, in the framework of the COSMO cooperation. The gauges measuring precipitation are about 1400 (Figure 2). This period has been chosen for an extensive verification exercise because it was the first season since ECMWF increased the EPS resolution from about 80 to about 50 km, decreasing the gap in resolution between the two systems. Only daily precipitation is evaluated, cumulated from 0600 to 0600 UTC, because of the Swiss data availability and of the extended forecast range covered by the COSMO LEPS system (132 h). It has been already underlined that both the ensemble verification and the high-resolution PQPF verification require a large statistical sample. The effect of the degradation of the probabilistic scores because of the finitness of the sample has been discussed in Candille and Talagrand (2005), while the problems linked with the precipitation can easily be understood by recalling that the interest focuses on intense precipitation, which is a rare event, limiting the sample size. The scores presented in this paper are affected by both the problems, though the sample has been carefully selected. The verification domain has been chosen as a compromise between having enough verification points and having also a geographically homegoneous area. Futhermore, the

SPATIAL VERIFICATION OF HIGH-RESOLUTION ENSEMBLE FORECASTS 131 (c) (d) (e) (f) (g) (h) Figure 4. Brier Skill Score (panels to (d)) and ROC area (panels (e) to (h)) as a function of the precipitation thresholds (mm d 1 )when average values over boxes of 1.0 1.0 degrees are considered. The solid line with circles is for COSMO LEPS, the dashed line with squares is for the 16-member EPS, while the dotted line with triangles is for the 51-member EPS. The error bars represent the confidence interval at a 95% level. Four different forecast ranges are shown: 18 42 h (panels and (e)), 42 66 h (panels and (f)), 66 90 h (panels (c) and (g)), 90 114 h (panels (d) and (h)). alpine area has been selected, where high precipitation values are relatively common. The verification period has been chosen as a compromise as well, as to be long enough to increase the sample but limited to a season for homogeneity reasons. Results are presented for a number of probabilistic indices: Brier Skill Score, Ranked Probability Skill Score, Relative Operating Characteristic (ROC) area, Percentage of Outliers, which are briefly described in the Appendix. A test to assess the extent to which the score values obtained for COSMO LEPS are statistically significantly different from those obtained for the reference ensemble system (reduced-size EPS) has also been applied. The resampling technique (Wilks, 1995; Hamill, 1999) has

132 C. MARSIGLI ET AL. been used, by computing 1000 sets of scores relative to 1000 ensembles obtained by resampling, mixing the two ensembles under testing. Hence, the distribution of the 1000 values have been considered and the 5th and 95th percentile values have been computed. These are used to express the confidence intervals, at a 95% level, that the cleps scores differ from the epsrm ones. The confidence interval are shown in the plots as error bars around the cleps score line. 4. Results 4.1. Comparison of COSMO LEPS with the two EPS systems COSMO LEPS has been compared against the EPS over boxes of 1.0 1.0 degrees, each corresponding approximately to an area of 100 100 km 2. The number of COSMO LEPS grid points falling within each area is about 100, while the number of station points varies from one box to another, from a minimum value of 5 (the boxes, where this density is not reached, are rejected) up to a maximum of about 100. Instead, the number of EPS grid points falling within each area are about 4. The number of observations falling in each class individuated by the verification thresholds is shown in Table I. 4.1.1. Average values In order to evaluate the skill of the systems in forecasting the total precipitation fallen over an area, the verification in terms of average values is examined. In Figure 4, the Brier Skill Score values obtained for this parameter are Table I. Number of observed occurrences for each precipitation class for mean and maximum (1.0 1.0 boxes). thresholds (mm d 1 ) parameter 1 5 10 20 30 mean 1088 466 234 83 27 maximum 1656 1251 897 469 266 shown in panels to (d), while the ROC area values are shown in panels (e) to (h). In terms of BSS (Figure 4 to (d)), the discrimination between rain and no-rain (first threshold, 1 mm d 1 )is nicely operated by all the ensemble systems, the score going from 0.3 to 0.2 with increasing forecast range. As the threshold increases, the scores drop and the systems begin to exhibit different behaviours. While at the 18 42 h forecast range (Figure 4) the differences in BSS among the cleps and the epsrm are not statistically significant at a 95% level, in the second range (Figure 4) eps51 obtains a higher score for all the thresholds and also epsrm exhibits a positive score, while cleps has a negative BBS at the intermediate thresholds. At the 66 90 h range (Figure 4(c)) the cleps score improves and its difference from the epsrm is not significant. At the longer forecast range (Figure 4(d)), the two EPS systems still have a positive BSS, while cleps values are lower, being negative for the higher thresholds. The Ranked Probablity Skill Score (Figure 5) summarizes the findings by considering all the thresholds together, showing the fact that, in terms of Brier-like scores, when average over boxes is considered, eps51 gets the best results, while cleps is worse than or indistinguishable from epsrm. Considering the ROC area score (Figure 4(e h)) the situation is very different. At the first forecast range (Figure 4(e)), cleps gets the highest values for all the thresholds, the differences with respect to the epsrm scores being significant. At the second forecast range (Figure 4(f)), the ROC area values obtained by cleps are intermediate between the eps51 and the epsrm ones, but the difference is not significant. At the 66 90 h forecast range (Figure 4(g)), cleps ROC area is lower for the 1 mm threshold but higher for the higher thresholds, while at the longer forecast range (Figure 4(h)) cleps scores are not significantly different from the epsrm ones. The marked difference in the results in terms of BSS and ROC area is due to the different kind of evaluation of the forecasts they provide. The Brier Skill Score, like all the Brier-like scores, contains information about both the reliability and resolution of the forecast. Reliability is Figure 5. Ranked Probability Skill Score and Percentage of Outliers for the different forecast ranges when average values over boxes of 1.0 1.0 degrees are considered. The solid line with circles is for COSMO LEPS, the dashed line with squares is for the 16-member EPS, while the dotted line with triangles is for the 51-member EPS.

SPATIAL VERIFICATION OF HIGH-RESOLUTION ENSEMBLE FORECASTS 133 determined by the capability of the system in providing forecast probabilities which match the observed frequencies, while resolution expresses how well the system discriminates among events in different categories. Since a system can be calibrated, in order to obtain a match between probabilities and frequencies, reliability can be improved, while the resolution cannot be improved by a statistical post-processing. The ROC area, instead, contains only information about the resolution of the forecast, irrespective of the reliability. It is more representative of the potential skill, the one that will actually be obtained after a good calibration. The ROC area, which is based on the signal detection theory, indicates the ability of a forecast system to discriminate between events and nonevents. In order to show the difference in reliability among the three system, reliability diagrams for 42 66 h forecast range are shown in Figure 6, for the 5 and 10 mm thresholds. The reliabilities for the three ensembles are similar for the 5 mm threshold (Figure 6), being quite good for low probabilities but worsening when higher probabilities are forecasted. At the 10 mm threshold (Figure 6) reliability is generally lower, especially for the COSMO LEPS system. The average precipitation outliers are shown in Figure 5. The percentage of outliers produced by cleps goes from about 30% to about 10% with increasing forecast range and is significantly smaller than the one produced by epsrm, the EPS reduced to the same population. For both ensembles, the theoretical percentage of outliers (2/(M + 1), where M is the number of members of the ensemble) is about 12%. The outliers of eps51 are fewer than the cleps one, but well above the theoretical percentage of 4% for this ensemble. 4.1.2. Maximum values The total amount of precipitation falling over an area as large as 10 000 km 2 is not the unique important parameter to judge the usefulness of a precipitation forecast, especially when the interest is focused on heavy rainfall in general and high and localized precipitation maxima in particular. It has to be underlined also that, over the considered area, precipitation during the spring is often associated with thunderstorms. A more complete view of the usefulness of the highresolution forecasting system can be obtained, then, looking at the possibility to predict high and localized precipitation values within an area where an alert can be issued. The verification in terms of maxima has been designed for this purpose, to quantify the skill of the system in forecasting which are the highest precipitation values occurring anywhere inside an area of pre-defined size. In Figure 7, panels to (d), the Brier Skill Score values obtained for this parameter are shown, while the ROC area values are shown in the panels (e) to (h). Using this indicator, the BSS of cleps (Figure 7(a d)) is the highest for all the forecast ranges and for all the thresholds. The values are almost always positive, showing that the system has skill (BSS >0) in forecasting the occurrence of precipitation peaks over the boxes, up to the 10 mm threshold. For the last two thresholds, the skill is very small or absent, depending on the forecast range. Both EPS systems show an always negative BSS value, except for the 1 mm threshold, indicating that these systems have not skill in forecasting this kind of events. The ROC area values (Figure 7(e h)) provide complementary information: there is some skill in the forecasts provided by eps51 and epsrm in terms of signal detection, the score being over the no-skill value of 0.5 but decreasing rapidly with increasing threshold. This results confirm the fact that it is the horizontal resolution of the forecast that penalizes the EPS ensembles when this indicator is considered, since the model is not able to produce high (and possibly localized) precipitation values. In general, cleps has the highest score values, in Figure 6. Reliability diagrams for the 42 66 h forecast range when average values over boxes of 1.0 1.0 degrees are considered, for the 5 mm and 10 mm thresholds. The solid line with circles is for COSMO LEPS, the dashed line with squares is for the 16-member EPS, while the dotted line with triangles is for the 51-member EPS.

134 C. MARSIGLI ET AL. (c) (d) (e) (f) (g) (h) Figure 7. Brier Skill Score (a d) and ROC area (e h) as a function of the precipitation thresholds (mm d 1 ) when maximum values over boxes of 1.0 1.0 degrees are considered. The solid line with circles is for COSMO LEPS, the dashed line with squares is for the 16-member EPS, while the dotted line with triangles is for the 51-member EPS. The error bars represent the confidence interval at a 95% level. Four different forecast ranges are shown: 18 42 h ( and (e)), 42 66 h ( and (f)), 66 90 h ((c) and (g)), 90 114 h ((d) and (h)). particular the values being markedly higher than EPS for the three highest thresholds. The fact that high values of the ROC area are associated with low values of BSS for the highest thresholds is because of the already mentioned different information provided by the two indices. A possible explanation is that, for the highest thresholds, cleps has some capability to detect the occurrence of an event, but only with low probability. This permits a quite high value of the ROC area but implies a lowering of the BSS. The Ranked Probablity Skill Score (Figure 8) is higher for the cleps system, for all the forecast ranges, and is negative for both EPS systems.

SPATIAL VERIFICATION OF HIGH-RESOLUTION ENSEMBLE FORECASTS 135 The maximum precipitation outliers are shown in Figure 8. The cleps outliers go from about 40% to about 15%, with increasing forecast range, with a theoretical limit of about 12%. The values are significantly smaller of both the epsrm and the eps51 percentages, whose theoretical limit are about 12% and 4%, respectively. 4.2. Scores for different percentiles A more refined evaluation of the precipitation distribution can also be carried out, by comparing the values of the percentiles of the forecast distribution against the values of the percentiles of the observed distribution. The same probabilistic scores are used, considering the precipitation distribution forecasted by each ensemble member separately. The indices are then computed by taking the percentile values as forecast and observed values. Since the shape of the precipitation distribution is very skewed, even a 90th percentile is representative of moderate to high values, more than of extreme values. In order to be representative of the tail of the precipitation distribution, the 95th percentiles have been considered. It has to be pointed out that a categorical verification of the precipitation distribution carries with it the problem that the thresholds used to individuate the categories have different meanings when different parameters of the distribution are considered. It is clear that the events average precipitation exceeds 10 mm d 1 and 95th percentile of the precipitation exceeds 10 mm d 1 detect different subsamples of the verification sample, the second group representing a greater fraction of points than the first and the first individuating more extreme events. This issue has to be kept in mind when the behaviour of the scores as a function of the parameter of the distribution is evaluated. This evaluation has been performed for COSMO LEPS and EPS RM systems only, so that the comparison is not affected by the size of ensemble population. It has to be reminded that the different horizontal resolutions of the two systems affect this evaluation as well, since a greater number of values perbox concurrtodetermine the precipitation distribution for the COSMO LEPS system (about 100 points), while a very smaller number is involved in the distribution determination for the EPS RM system (four points). Due to this discrepancy, the scores for the 95th percentile are probably not reliable. They have been shown in figures only to give continuity to the curves. A comparison between the COSMO LEPS and the 16 RM EPS percentiles is shown in Figures 9 and 10. The graphs, one for each forecast range and for each threshold, show how the scores change among the distribution parameters for the two ensemble systems. The three columns are relative to the three different forecast ranges while the four rows are for the four thresholds. Results for the second forecast range (42 66 h) are not shown, being similar to those for the third forecast range. Results for the last threshold (30 mm d 1 ) are also omitted, because of the scarcity of the sample. Analysing the BSS, two different behaviours can be ascribed to cleps and epsrm systems, at least for the first three thresholds (Figure 9(a i)). In fact, while cleps scores increase almost uniformely moving towards the tail of the distribution, the epsrm scores have a dome shape, being maximum when the mean or the 75th percentile parameters are considered and decreasing sharply for the 95th percentile and the maximum parameters. This comparison permits to underline that epsrm shows the maximum skill in reproducing the central part of the observed precipitation distribution over an area of about 10 000 km 2, having higher BSS values than cleps for moderate thresholds. On the contrary, cleps is more skilful in reproducing the tail of the observed precipitation distribution, even for low thresholds. Considering the ROC area values (Figure 10), the score behaviours are quite different. For the 1 mm threshold (Figure 10 (a c)) both cleps and epsrm scores are almost uniform over the different parameters. At the 10 mm threshold (Figure 10 (g i)), while cleps score is quite uniform, epsrm ROC area is maximum for the mean parameter, while it decreases moving towards the tail of the distribution. The 5 mm threshold is characterized by an intermediate behaviour, while, for higher thresholds, it is difficult to draw general conclusions. In terms of this parameter, it is evident that cleps outperforms epsrm for moderate and high thresholds, while the two are almost indistinguishable for low values. The tendency noticed in the BSS, that epsrm has maximum skill for the mean parameter and decreasing skill for higher parameters, is observed also in the ROC area, except for the first threshold (Figure 10 (a c)). The behaviour shown by the indices for the 95th percentile is somewhat intermediate between those for average and maximum values (shown in Figures 4 and 7). Generally speaking, the observed precipitation distribution within boxes of this size is increasingly better represented by the COSMO LEPS ensemble as the tail is approached. 4.3. Scores for different box sizes The scores are also computed by aggregating both forecasts and observations on a wide range of scales, from 0.5 to 2.0 degrees, in order to analyse the scale dependency of the performance of the ensemble systems. As already underlined, the box size is representative of the spatial scale under which the fuzzy approach is introduced. The 0.5 0.5 degree boxes, corresponding approximately to an area of 50 50 km 2, contain a number of COSMO LEPS grid points approximately equal to 25, but only one EPS grid point. This implies that the mean and the maximum values forecasted by the EPS over each box coincide. The number of station points varies from one box to another, from a minimum value of 5 (the boxes where this density is not reached are rejected) up to a maximum of about 25. The 2.0 2.0 degree boxes, corresponding approximately to an area of 200 200 km 2, contain a number of COSMO LEPS

136 C. MARSIGLI ET AL. Figure 8. Ranked Probablity Skill Score and Percentage of Outliers for the different forecast ranges when maximum values over boxes of 1.0 1.0 degrees are considered. The solid line with circles is for COSMO LEPS, the dashed line with squares is for the 16-member EPS, while the dotted line with triangles is for the 51-member EPS. (c) (d) (e) (f) Figure 9. Brier Skill Score values as a function of the parameter of the distribution considered in the index computation. Panels to (c) are for the 1 mm d 1 threshold, for the three forecast ranges 18 42 h, 66 90 h and 90 114 h (c), respectively. Panels (d) to (f) are for the 5mmd 1 threshold, panels (g) to (i) for the 10 mm d 1 threshold and panels (j) to (l) are for the 20 mm d 1 threshold. The solid line with circles is for cleps, while the dashed line with squares is for epsrm. grid points approximately equal to 400 and a number of EPS grid point equal to 9, while the minimum number of observations per box is fixed to 10. In Tables II and III, the number of occurrences for the considered events (individuated by the thresholds) for the boxes of different sizes are shown. Only

SPATIAL VERIFICATION OF HIGH-RESOLUTION ENSEMBLE FORECASTS 137 (g) (h) (i) (j) (k) (l) Figure 9. (Continued). four thresholds have been considered: 1, 5, 10 and 20 mm d 1, due to the excessive reduction of the sample for the highest threshold at the larger scales. The scores in terms of the mean parameter are reported in Figure 11 for the 66 90 h forecast range. Only results for this forecast range are shown since the behaviour of the scores with varying the aggregation scale is almost the same for any forecast range. The BSS values are increasing with increasing box size, for both cleps (Figure 11) and epsrm, (Figure 11) the epsrm scores being generally higher Table II. Mean: number of observed occurrences for each class and for each box size. thresholds (mm d 1 ) box size 1 5 10 20 0.5 2715 1231 625 225 0.1 1088 466 234 83 2.0 394 150 73 22 Table III. Maximum: number of observed occurrences for each class and for each box size. thresholds (mm d 1 ) box size 1 5 10 20 0.5 1253 779 443 198 1.0 1363 892 537 246 2.0 1656 1251 897 469 than the corresponding cleps ones at any aggregation scale. A positive value for all scales is obtained by COSMO LEPS only for the 1 and 5 mm thresholds, while skill for higher thresholds is exhibited only at the two larger scales. On the other hand, epsrm is always skilful, especially for the 2.0 scale. The evaluation is repeated in terms of ROC area (Figure 11(c,d)), showing the same increase of the score with increasing the scale forecasted and observed values are aggregated to, with the exception of the 20 mm threshold for the 2.0 box size, the ROC area value being

138 C. MARSIGLI ET AL. (c) (d) (e) (f) Figure 10. ROC area values as a function of the parameter of the distribution considered in the index computation. Panels to (c) are for the 1 mm d 1 threshold, for the three forecast ranges 18 42 h, 66 90 h and 90 114 h (c), respectively. Panels (d) to (f) are for the 5mmd 1 threshold, panels (g) to (i) for the 10 mm d 1 threshold and panels (j) to (l) are for the 20 mm d 1 threshold. The solid line with circles is for cleps, while the dashed line with squares is for epsrm. lower than the corresponding one for the 1.0 box size. In terms of signal detection, COSMO LEPS is skilful for any threshold and any box dimension and it is more skilful than epsrm for the 10 and 20 mm thresholds for any aggregation scale. This analysis confirms that, when the average value of the precipitation distribution over a box is considered, the skill of the forecast increases by allowing a larger box size, because of the increasingly filtering out the errors when the scale is increased. The scale evaluation has been also performed for the maximum values (Figure 12). The BSS values of cleps (Figure 12) show that the value for the 1 mm threshold for the 1.0 scale is the same as for the 0.5 scale, while for the remaining thresholds, the 1.0 BSS is higher than the 0.5 one. At the 2.0 scale, the score for the lower thresholds is smaller than at the 1.0 one, the inversion being observed around the 10 mm threshold. For the lower thresholds, the 2.0 score is even smaller than the 0.5 one. This seems to indicate that considering the maximum value falling within a box, the intermediate scales are better than the larger one, if low precipitation is involved. The same kind of behaviour of COSMO LEPS can be found in the ROC area plot (Figure 12(c)), suggesting that this feature can be due to the increasing number of false alarms for these thresholds, due to an excessive number of precipitation over 1 mm in the forecast that are not found in the observations. In general, COSMO LEPS is skilful in terms of BSS for any threshold and any scale, with the exception of the 20 mm threshold for the 0.5 scale. On the other hand, epsrm has almost always negative BSS values (Figure 12), except for the 1 mm threshold at the two smaller scales. In terms of ROC area, cleps values are higher than epsrm ones; the reverse being true only for the 1 mm threshold and for both the 1 and 5 mm thresholds for the 2.0 scale.

SPATIAL VERIFICATION OF HIGH-RESOLUTION ENSEMBLE FORECASTS 139 (g) (h) (i) (j) (k) (l) Figure 10. (Continued). 5. Conclusions A spatial verification method (DIST) is applied here to evaluate the PQPF issued by the COSMO LEPS mesoscale ensemble system. DIST is based on the verification of the precipitation distribution within boxes of selected size. For each box, several parameters of the distribution of both observed and forecast values falling in the box are compared (mean, median, percentiles, maximum). The 16-member COSMO LEPS is compared against the global EPS, considering both the full-size ensemble (51 members) and the reduced ensemble made up by the 16 RMs driving the COSMO LEPS integrations. The main findings can be summarized as follows. Considering the forecast of average values over boxes of 1.0 1.0 degrees, the global ensembles tend to exhibit better performances in terms of Brier-like scores [BSS and Ranked Probability Skill Score (RPSS)]. Looking at the ROC area index, the results are highly variable as a function of the forecast range and of the threshold, COSMO LEPS being generally better for higher thresholds. The discrepancy in the results, in terms of BSS and ROC area, is due to the different kind of information provided by the two scores. It turns out that COSMO LEPS shows some skill in discriminating between events and non-events, expressed by the higher ROC area values, while the generally negative BSS is probably due to the low degree of reliability. When the maximum values over boxes of 1.0 1.0 degrees are considered, COSMO LEPS is scoring better than both the EPS systems for all the forecast ranges and for all the thresholds in terms of BSS and RPSS. The ROC area values are higher for COSMO LEPS except for the lowest thresholds, possibly because of higher false alarms for these thresholds with respect to the coarser resolution EPSs. Considering also the evaluation in terms of percentile values, the observed precipitation distribution within boxes is increasingy better represented by the COSMO LEPS ensemble, as the tail of the distribution is approached. With regard to the Outliers, while for the maximum values the Percentage of Outliers affecting COSMO LEPS is noticeably smaller than those of the global