Verification of WAQUA/DCSMv5 s operational water level probability forecasts

Similar documents
Probability forecasts for water levels at the coast of The Netherlands

Ensemble Verification Metrics

Verification of Probability Forecasts

Evaluating Forecast Quality

Probabilistic verification

Methods of forecast verification

VERFICATION OF OCEAN WAVE ENSEMBLE FORECAST AT NCEP 1. Degui Cao, H.S. Chen and Hendrik Tolman

Towards Operational Probabilistic Precipitation Forecast

The Impact of Horizontal Resolution and Ensemble Size on Probabilistic Forecasts of Precipitation by the ECMWF EPS

Model verification / validation A distributions-oriented approach

Predicting uncertainty in forecasts of weather and climate (Also published as ECMWF Technical Memorandum No. 294)

Upscaled and fuzzy probabilistic forecasts: verification results

Basic Verification Concepts

Post flood analysis. Demands of the Storm Surge Warning Service (SVSD) of the Netherlands. Annette Zijderveld/ Martin Verlaan

Exploring ensemble forecast calibration issues using reforecast data sets

Drought forecasting methods Blaz Kurnik DESERT Action JRC

Basic Verification Concepts

Operational use of ensemble hydrometeorological forecasts at EDF (french producer of energy)

Application and verification of ECMWF products: 2010

Feature-specific verification of ensemble forecasts

Assessment of Ensemble Forecasts

Model verification and tools. C. Zingerle ZAMG

Verification of ECMWF products at the Finnish Meteorological Institute

Monthly probabilistic drought forecasting using the ECMWF Ensemble system

The benefits and developments in ensemble wind forecasting

operational status and developments

ECMWF products to represent, quantify and communicate forecast uncertainty

Probabilistic Weather Forecasting and the EPS at ECMWF

Application and verification of ECMWF products 2010

EMC Probabilistic Forecast Verification for Sub-season Scales

Developing Operational MME Forecasts for Subseasonal Timescales

REPORT ON APPLICATIONS OF EPS FOR SEVERE WEATHER FORECASTING

Standardized Anomaly Model Output Statistics Over Complex Terrain.

Ensemble forecast and verification of low level wind shear by the NCEP SREF system

Joint Research Centre (JRC)

ECMWF 10 th workshop on Meteorological Operational Systems

Probabilistic seasonal forecast verification

A Local Ensemble Prediction System for Fog and Low Clouds: Construction, Bayesian Model Averaging Calibration, and Validation

Improvements in IFS forecasts of heavy precipitation

Severe storm forecast guidance based on explicit identification of convective phenomena in WRF-model forecasts

Application and verification of the ECMWF products Report 2007

Model error and seasonal forecasting

Application and verification of ECMWF products in Norway 2008

Recommendations on trajectory selection in flight planning based on weather uncertainty

Verification of intense precipitation forecasts from single models and ensemble prediction systems

Accounting for the effect of observation errors on verification of MOGREPS

Predictability from a Forecast Provider s Perspective

Global Flash Flood Forecasting from the ECMWF Ensemble

Categorical Verification

Verification of Continuous Forecasts

Verification and performance measures of Meteorological Services to Air Traffic Management (MSTA)

The Hungarian Meteorological Service has made

Five years of limited-area ensemble activities at ARPA-SIM: the COSMO-LEPS system

NOTES AND CORRESPONDENCE. Improving Week-2 Forecasts with Multimodel Reforecast Ensembles

Calibrating forecasts of heavy precipitation in river catchments

4.3.2 Configuration. 4.3 Ensemble Prediction System Introduction

Focus on parameter variation results

Complimentary assessment of forecast performance with climatological approaches

István Ihász, Máté Mile and Zoltán Üveges Hungarian Meteorological Service, Budapest, Hungary

Precipitation verification. Thanks to CMC, CPTEC, DWD, ECMWF, JMA, MF, NCEP, NRL, RHMC, UKMO

Extracting probabilistic severe weather guidance from convection-allowing model forecasts. Ryan Sobash 4 December 2009 Convection/NWP Seminar Series

Application and verification of ECMWF products 2013

Verification of African RCOF Forecasts

3.6 NCEP s Global Icing Ensemble Prediction and Evaluation

Spatial forecast verification

FORECASTING: A REVIEW OF STATUS AND CHALLENGES. Eric Grimit and Kristin Larson 3TIER, Inc. Pacific Northwest Weather Workshop March 5-6, 2010

Verification of ensemble and probability forecasts

The ECMWF Extended range forecasts

12.2 PROBABILISTIC GUIDANCE OF AVIATION HAZARDS FOR TRANSOCEANIC FLIGHTS

Recent advances in Tropical Cyclone prediction using ensembles

Helen Titley and Rob Neal

S e a s o n a l F o r e c a s t i n g f o r t h e E u r o p e a n e n e r g y s e c t o r

An extended re-forecast set for ECMWF system 4. in the context of EUROSIP

Report on EN6 DTC Ensemble Task 2014: Preliminary Configuration of North American Rapid Refresh Ensemble (NARRE)

152 STATISTICAL PREDICTION OF WATERSPOUT PROBABILITY FOR THE FLORIDA KEYS

Application and verification of ECMWF products 2015

Application and verification of ECMWF products 2012

Revisiting predictability of the strongest storms that have hit France over the past 32 years.

Application and verification of ECMWF products 2017

Clustering Techniques and their applications at ECMWF

Application and verification of ECMWF products 2016

Standardized Verification System for Long-Range Forecasts. Simon Mason

OBJECTIVE CALIBRATED WIND SPEED AND CROSSWIND PROBABILISTIC FORECASTS FOR THE HONG KONG INTERNATIONAL AIRPORT

Henrik Aalborg Nielsen 1, Henrik Madsen 1, Torben Skov Nielsen 1, Jake Badger 2, Gregor Giebel 2, Lars Landberg 2 Kai Sattler 3, Henrik Feddersen 3

Seasonal Hydrometeorological Ensemble Prediction System: Forecast of Irrigation Potentials in Denmark

ENSO-DRIVEN PREDICTABILITY OF TROPICAL DRY AUTUMNS USING THE SEASONAL ENSEMBLES MULTIMODEL

QUANTIFYING THE ECONOMIC VALUE OF WEATHER FORECASTS: REVIEW OF METHODS AND RESULTS

Jaclyn Ashley Shafer. Bachelor of Science Meteorology Florida Institute of Technology 2006

Calibration of extreme temperature forecasts of MOS_EPS model over Romania with the Bayesian Model Averaging

Improving global coastal inundation forecasting WMO Panel, UR2014, London, 2 July 2014

Upgrade of JMA s Typhoon Ensemble Prediction System

Verification of ECMWF products at the Deutscher Wetterdienst (DWD)

Seasonal Climate Watch July to November 2018

ENSEMBLE FLOOD INUNDATION FORECASTING: A CASE STUDY IN THE TIDAL DELAWARE RIVER

Forecasting Extreme Events

Calibration of ECMWF forecasts

LAMEPS Limited area ensemble forecasting in Norway, using targeted EPS

Overview of Verification Methods

Measuring the Ensemble Spread-Error Relationship with a Probabilistic Approach: Stochastic Ensemble Results

Application and verification of ECMWF products 2014

Transcription:

Ministry of Infrastructure and Water Management Verification of WAQUA/DCSMv5 s operational water level probability forecasts N. Wagenaar KNMI Internal report IR-18-1

UTRECHT UNIVSERSITY DEPARTMENT OF PHYSICAL GEOGRAPHY KNMI DE BILT 5.5 ECTS Verification of WAQUA/DCSMv5 s operational water level probability forecasts Author: Nils WAGENAAR Student nr. 5574137 Supervisors: dr. Hans DE VRIES dr. ir. Renske DE WINTER February, 18

abstract Rising sea levels due to climate change pose a serious threat to countries where much of the economic activity is close to the coast or below sea level. In the Netherlands, most political and economical centres are situated in low-lying areas. Accurate storm surge forecasts are of paramount importance, as extreme high waters are expected due to rising meas sea levels as a result of climate change. In fact, forecasts up to days ahead serve as guidance for local authorities and for optimal management of the storm surge barriers. In addition, forecasts up to 1 days ahead are beneficial for the preparation for severe storm surge events. Recently, there has been a growing demand from the forecasts end-users for uncertainty information and the extension of the forecast horizon. To fulfill this demand, the Integrated Forecasting System (IFS), developed by the European Centre for Medium range Weather Forecasts (ECMWF), has been used to generate storm surge probability forecasts. These storm surge forecasts with input from ECMWF-IFS runs for 1 years already, which opens the possibility for forecast verification and robust statistics. As a result, this research performed a verification study to quantify the quality of these issued storm surge probability forecasts for three locations along the Dutch coast. In general, the forecasts with positive surges were of good quality with Brier skill scores >.5 for lead-times < 48h and Brier skill scores >.4 for lead-times between 48h and 84h. This holds true for all three locations. Furthermore, forecasts are skillful till 5-7 days ahead. In addition, winter forecasts are more skillful than summer forecasts and high-tide forecasts have higher skill than low-tide forecasts. Next to this, this research proved that the current bias correction and calibration applied to the forecasts is not suitable and sub-optimal. In fact, the yearly ensemble s bias with respect to the observations needs to be considered as a dynamic variable instead of a static one. Therefore, the revision of both the bias correction and the calibration should be considered in the future to improve forecast quality.

Contents 1 Introduction......................................... 1 Methodology........................................ 3.1 General verification framework.............................. 3. Functions and definitions for forecast verification................... 4..1 Brier Score (BS) - Brier Skill Score (BSS)..................... 4.. Decomposition of the Brier Score......................... 4..3 Continuous Ranked Probability Score (CRPS)................. 5..4 Attributes diagram................................. 6..5 ROC-curve...................................... 8..6 Rank Histograms and Cumulative Rank Histograms............. 1.3 Data description....................................... 1 3 Results............................................ 13 3.1 Verification of WAQUA/DCSMv5 s coastal water level probability forecast.... 13 3.1.1 Forecasts sharpness................................. 13 3.1. Hoek van Holland................................. 18 3.1.3 Den Helder..................................... 4 3.1.4 Delfzijl........................................ 64 3. Evaluation of the existing ensemble s post-processing methods for sea level probability forecasts..................................... 87 4 Discussion.......................................... 97 5 Conclusion.......................................... 99 A Appendices......................................... 1 A.1 Bibliography......................................... 1 A. List of Tables......................................... 1 A.3 List of Figures......................................... 13

1 Introduction 1 1 Introduction Rising sea levels due to climate change pose a serious threat to countries where much of the economic activity is close to the coast or below sea-level. In the Netherlands, a substantial part is located below mean sea level (Fig. 1). Moreover, most political an economical centres are situated in these low-lying areas. For above mentioned reasons, the Netherlands is a relatively vulnerable country and the development of adaptation strategies is urgent. Simulations by NASA demonstrated that many low-lying coastal deltas, like the Netherlands, are expected to be flooded during the course of the 1 st century (Klijn et al. 1). In order to cope with these expected extreme high waters, accurate storm surge forecasts are of paramount importance. These forecasts play an important role for planning of activities, especially during situations with storm surges occurring. In fact, accurate storm surge forecasts up to two days ahead are essential as guidance for local authorities and for optimal management of the storm surge barriers (de Vries, 13). Next to this, it is beneficial for the preparation for severe storm surge events to have accurate storm surge forecasts up to 1 days ahead (de Vries, 13). Fig. 1: Distribution of flood prone areas in The Netherlands. source: PBL, 9 Storm surge forecasts for the Dutch coast are generated by Public Works and Water Management Authority (Rijkswaterstaat) together with the Water Management Centre for the Netherlands (WMCN-coast) with meteorological input from the Royal Netherlands Meteorological Institute. Recently, there has been more attention to both the vulnerability and changing possibilities to flooding due to global warming. This has led to the growing demand for uncertainty information and the extension of the forecast horizon. The Integrated Forecasting System (IFS), developed by the European Centre For Medium Range Weather Forecasts (ECMWF), has been used to generate storm surge forecasts as these weather data is readily available both from archives and in real-time. Moreover, the use of ECMWF-IFS fulfills the users demand both to reveal uncertainty information about storm surge forecasts and to extend the forecast horizon.

1 Introduction The ECMWF-IFS generates a -1 day ensemble for weather forecasts twice a day. This ensemble prediction system (ENS) consists of 5 individual ensemble members. The spatial resolution for one member is higher than the others (HRES). Furthermore, the ENS contains one member (CNTL) with lower spatial resolution than the HRES and to which perturbations are applied in order to generate realizations with slightly initial states (5 perturbed ensemble members). In addition, each ensemble member forecast is based on a slightly different set of model equations in order to take model uncertainties into account as well (ECMWF, 1). These ECMWF s 1-day ensemble weather forecasts serve as input data for the storm surge model (WAQUA/DCSMv5). This storm surge model calculates water levels along the Dutch coast for all ensemble members, which in turn can be converted to water level probabilities. The storm surge model WAQUA/D- CSMv5 runs with input from ECMWF for approximately 1 years already, which enables the possibility for forecast verification and associated robust statistics. Forecast verification is essential in order to gain insight in the accuracy, consistency, discrimination and precision of these forecasts (WMO, 13). In fact, the verification of storm surge forecasts exposes information about the ability of the forecast to capture the meteorology. In addition, forecast verification is essential to reveal strengths and weaknesses and to identify potential improvements that can be made. It is important to note that stakeholders need to be identified and that the stakeholder s questions of interest about forecast quality are kept in mind, as this influences the selection of the appropriate methodology. Statistical scores have been developed over the years for the verification of probabilistic forecasts, which differ from ordinary statistics between forecasts and observations. These probabilistic verification metrics will be applied to the storm surge probability forecasts driven by the ECMWF-IFS in order to quantify forecast quality. For the evaluation regarding forecast quality for water levels along the Dutch coast, several research questions are kept in mind: What is an appropriate methodology for storm surge forecast verification? How does the forecast performance for water levels vary for different locations along the Dutch coast? Is there a trend in forecast performance based on the selected statistical scores through time (i.e. forecast improvement or deterioration over the years)? How is forecast quality influenced by lead time, season, type of surge (positive/negative) and low/high tides? Are the default calibration and bias correction applied to the storm surge probability forecasts still suitable and performing sufficient?

Methodology 3 Methodology.1 General verification framework Issued storm surge forecasts are archived. This enables research using data from the past. Forecast data are available between December 8 to present. This research will use different statistical metrics for verification related to different forecast quality measures. Firstly, the Brier score and the Continuous Rank Probability Score are used to measure forecast accuracy. Forecasts discrimination ability between events and non-events will be visualized and quantified by ROC-curves and ROC skill scores. Lastly, this research includes attribute diagrams, which visualizes the forecast reliability for each forecasted probability level. Firstly, forecasts will be divided in two groups in order to evaluate whether forecast performances improved during the years: forecasts between 3-1-8 to 31-1-1 and forecasts between 1-1-13 to 31-1-16. Furthermore, verification of forecasts will be conducted over different locations to explore the spatial differences in forecast performance along the Dutch coast. In this research, forecast verification will be performed and presented for Hoek van Holland, Delfzijl and Den Helder. In addition, verification will be performed for both storm surge levels (> cm = "positive surge", < cm = "negative surge"). In other words, both surge types lead to either higher are lower sea levels compared to the levels of the astronomical extreme. Next to this, verification for the Information Level and Pre-warning Level will be performed as well. The Information Level and Pre-warning Level are both absolute water levels and these levels differ for locations along the Dutch coast. Table 1 summarizes for above mentioned locations, the surge levels and predefined absolute coastal water level thresholds. Tab. 1: This table summarizes the selected water levels used for verification of the WAQUA/DCSMv5 storm surge model. The medium positive surge/medium negative surge level is selected for the verification of positive/negative surges. The absolute levels corresponds with the Information level (IL) and the Pre-warning (VP). Location Hoek van Holland Den Helder Delfzijl level (cm) -5 75 18-5 8 15 17-5 8 4 6 leveltype surge surge IL VP surge surge IL VP surge surge IL VP According to de Vries (13), in case the absolute water level exceeds Warning Level within 8 days with a probability of more than 5% in the forecast, KNMI informs WMCN-coast. The forecasts will be further subdivided in 4 groups based on forecast lead time: -48, 48-84, 84-1 and 1-4 hours ahead (-, -3.5, 3.5-5 and 5-1 days) to examine the relation between forecast performance and forecast lead time. In addition, forecast performances will be evaluated for high/low tide forecasts and summer/winter seasons. This will reveal insights in the dependency of forecast performances to these factors. For summer forecasts, forecasts between 4-Jun - 4-Sep will be selected. For winter forecasts, forecasts between 4-Dec - 4-Mar will be subsetted. The performance of forecasts will be put into perspective by comparing it with a certain reference system, which is mostly persistence or climatology (Daan, 1984 ; Murphy and Daan, 1985). A Skill Score is developed to quantify the quality of the forecast, which is the improvement compared with the reference forecast (Eq. 1).

Methodology 4 SS = score fc score ref score perfect score ref (1) In this research, climatology will be used as reference to obtain Skill Scores. Data between 3-1- 8 and 31-1-16 is used as climatology. The next subsection dives into the selected statistical metrics for probability forecast verification.. Functions and definitions for forecast verification In this subsection, functions and definitions related to the verification of forecasts are explained. As mentioned earlier, this research will use statistical metrics developed for verification of probabilistic forecasts. For some scores, over-predictions lead to satisfying results (Kok (). As a result, if one metric is applied for verification, it is generally easy to adapt the forecast system in such a way that better results are attained for that score. Therefore, multiple verification metrics are used simultaneously to draw conclusions regarding the performance of forecasts...1 Brier Score (BS) - Brier Skill Score (BSS) The Brier Score aims to assess the relative accuracy of a probabilistic forecasting system (Brier, 195). The Brier Score is applicable to forecasts of mutually exclusive events (e.g. water level exceeded/not exceeded or rain/no rain). For N verification/forecast pairs, the Brier Score is: BS(P) = 1 N N (p t y t ), () n=1 where p t corresponds to the forecast probability and y t is a binary variable, which equals 1 if event occurs and if the event does not occur. N equals the amount of observations. The Brier Score for a perfect forecast, BS perf, equals. So, the lower the Brier Score, the better the forecast. The Brier Score allows us to make statements related to the accuracy of probabilistic forecasting systems. However, it is more interesting to demonstrate whether particular forecasts offers an improvement by comparing it with an unskilled standard forecast. To address this issue, Eq. 1 is used. For this Brier skill score, the Brier Score of the forecast is compared with the Brier Score of the sample s climatology: BSS (%) = (1 BS BS clim ). (3) With a BSS <, the forecasts have less skill than climatology. If BSS >, the forecasts have more skill compared to climatology. According to Eq. 3, perfect forecasts have a BSS of 1... Decomposition of the Brier Score Assume the issued forecasts p t have K distinct values for all t. Then, Murphy (1973) demonstrated that the Brier Score (Eq. ) can be decomposed into three components:

Methodology 5 BS(p) = REL RES + UNC, (4) in which: and REL = K k=1 RES = n k N ( o k n k P k ), (5) K ( o k o). (6) n k k=1 P k is the forecast probability, o k the amount of events occurring for that probability value and n k the number of times that probability was issued. o equals the climatological event frequency of the sample, which is defined as: o = 1 N y t, (7) N where y t = 1 if event occurs and y t = if the event does not occur in the forecast. The uncertainty term in Eq. is written as: t=1 UNC = o(1 o). (8) A forecast is considered reliable, when forecast probabilities equal the relative occurrence of observations for that particular forecast value (i.e. P k = o k n k ). It is easy to see that maximum reliability is obtained for REL =. Whenever the conditional observed relative frequencies (i.e. o k /n k ) are equal for all forecast categories, the forecast is uninformed and is not able to distinguish either less or more likely events than average, which results in RES = and a higher value for the Brier-Score...3 Continuous Ranked Probability Score (CRPS) Another statistical metric, which quantifies forecast reliability and resolution is the Continuous Ranked Probability Score (CRPS). Suppose we are interested in parameter x, which could stand for the 1-m wind speed. The PDF of a forecast system for the 1-m wind speed is then given by p(x) and the actual outcome by x a. Then, the CRPS equals to the squared difference between the actual value, x(a) and the forecast cumulative distribution function P(x): CRPS(P(x), x a ) = [P(x) P a (x)] dx. (9)

Methodology 6 In Eq. 9, cumulative density functions of both the forecasts (P(x)) and the observations (P a (x)) are used: P x = x p y dy (1) P a (x) = H(x x a ), (11) in which: H(x x a ) = { x x a 1 x x a (1) Eq. 1 is the Heavyside step function. The CRPS (Eq. 9) quantifies the differences in the cumulative distributions of both the forecasts and the observations. Hence, the minimum value equals if P equals P a, which happens in a perfect deterministic forecast. Often, the CRPS is averaged over n cases or n grid-points (Hersbach, ): CRPS = 1 n n CRPS(P(x) k, x k a), (13) k=1 where k equals a specific case. If we revisit Eq., the relation between the Brier Score and the CRPS can be written as (Hersbach, ): CRPS = BS(x t )dx t, (14) in which x t equals a predefined threshold value. In fact, the CRPS is the integration of Brier Scores for different threshold values. The Skill Score for the CRPS (CRPSS) is calculated according to Eq. 1. In order to obtain the CRPS clim, 51 quantiles have been generated based on the observations during the period 3-1-8 to 31-1-16. These quantiles represent a naive ensemble prediction system, in which each ensemble member predicts the same surge level for each forecast point. The surge forecasts for each ensemble member are based on the values for the quantiles...4 Attributes diagram A forecast is considered reliable when the relative frequency of occurrences conditioned to a forecast probability p is close to probability p (Murphy, 1986). In a reliability diagram, forecasts are divided in bins. For this research, forecasts are divided in ten bins of equal width (1%). Then, the observed relative frequencies corresponding to each forecast bin are plotted. The observed relative frequency conditioned to a forecast bin k can be written as: Observed relative frequency (k) = o k n k, (15)

Methodology 7 where k represents the forecast bins, which ranges between and 1. o k and n k correspond with the amount of observations and forecasts falling into forecast bin k respectively. In a perfect forecast, the reliability plot would have a slope of 1. The attribute diagram is an extended version of this reliability diagram. In the reliability diagram, observed frequency for each forecast bin is plotted against the forecast probability bins. However, the attribute diagram includes no resolution lines (x, y = p clim ). A forecast has RES =, when the observed relative frequencies for all forecast bins are the same and equal to climatology (Eq. 6). For a climatological forecasts, both the forecast and the observed relative frequency are equal to the climatological probability, which results in REL =. The Brier score for a climatological forecast equals the uncertainty term in Eq., since these forecasts have no resolution and no reliability. So, for a climatological forecast, substitution of Eq. in Eq. 1 for BS clim = UNC leads to: RES REL BSS = UNC. (16) In Eq. 16 the uncertainty term is always larger than zero. Therefore, forecast skill increases, when the reliability term is smaller than the resolution term. This means, geometrically, that points on the attribute diagram need to be closer to the perfect reliability line (1:1) than the no-resolution line. So, this no-skill line is included in the attribute diagram. An example of an attribute diagram is included (Fig. ). Confidence intervals are shown as black vertical lines. For this research, 9 % confidence intervals are calculated by bootstrap re-sampling and will be included in the attribute diagrams. Fig. : Example of an attribute diagram. line corresponds with the climatology for a specified event. This probability is represented as two straight lines starting at the value for the climatological probability at both the x and y axis. The area contributing to the skill of the forecast is indicated in gray.

Methodology 8..5 ROC-curve Another graphical representation of forecast performance is the Relative Operating Characteristic (ROC) curve. Firstly, contingency tables are made for different probability threshold values. The contingency table holds the number of hits, misses, false alarms and correct non-events. In these contingency tables, observed events are categorized by means of an exceedance level or threshold. For example, an event could be the absolute coastal water level at a certain location exceeding 7cm. For the probability forecasts, a predefined probability threshold value needs to be set in order to convert the probability forecast to a dichotomous (yes/no) forecast. An example of a contingency table is shown in Table. Tab. : Contingency table for categorical events (yes - no). source: WMO, 14 Hit rate = hits (hits + misses) (17) False alarm rate = False alarms (False alarms + Correct non event) (18) The hit-rate (Eq. 17) and false alarm rate (Eq. 18) of a probability forecast are determined by the amount of hits and false alarms issued by a forecast based on a predetermined probability threshold value. The ROC-curve visualizes the changes in both the hit-rate and the false-alarm rate by altering this probability threshold value. So, in order to construct the ROC-curve, several hit-rate/false alarm-rate pairs for these different probability threshold values are plotted. A good forecast in terms of the ROC-curve is defined as one, where the hit rate increases faster than the false alarm rate by a decrease in probability threshold value. Especially in diagnostic accuracy studies, the terms hit rate and false alarm rate are often replaced by sensitivity and specificity respectively (C. M. Florkowski, 8). An example of a ROC-curve is shown in Fig. 3.

Methodology 9 Fig. 3: An example of different ROC curves indicating different level of forecast performances If the ROC-curve lies close to the 1:1 line, the forecast has equal skill as a random draw from climatology (T.M. Hamill & J. Juras, 6). In other words, the higher the Area Under the Curve (AUC), the better the skill of the forecast. The AUC is calculated by Eq. 19: AUC = 1 ROC(u)du (19) The value for AUC is a measure of forecasts ability to discriminate between events and nonevents. Moreover, it is a measure for describing forecast accuracy (R. G. Pontius & L. C. Schneider, 1) This value ranges between and 1, where 1 represents a perfect discrimination of events and non-events and.5 equals the skill of a random draw from climatology to discriminate between events and non-events. This means that if the AUC value <.5, the forecast has no skill. Also, like for the CRPS and the Brier Score, the AUC value can be converted to a skill score by means of Eq. 1. In Table 3, a classification of the ROC skill score (ROCSS) has been made. Tab. 3: Classification for the ROC skill score based on the Area Under the Curve (AUC) values compared with the climatology. Classification ROC Skill Score Excellent.8-1. Good.5-.8 Sufficient.5-.5 Poor.-.5 Fail <.

Methodology 1..6 Rank Histograms and Cumulative Rank Histograms In this subsection, we describe tools used for the calibration of the storm surge probability forecasts. Firstly, Rank Histograms (Talagrand Diagrams) and Cumulative Rank Histograms will be made to assign probabilities to the ensemble predictions. For the construction of the Talagrand Diagram, each forecast member is ordered from low to high for each ensemble forecast. After, at each forecast point, the position of the veryfing observation with respect to the ensemble members is determined. Finally, a Cumulative Rank Histogram will be constructed by summing up the Talagrand Diagram (de Vries, 8). Often, the observations are unequally distributed over the ensembles in the Talagrand Diagram. In an ideal forecasting system, however, every ensemble is equally likely to contain the verifying observed value. In that case, the Talagrand Diagram obtains a flat shape. This means that the ensemble captures the probability distribution of the observations. A forecast system with too low variation in the ensemble member predictions leads to a u-shaped Talagrand Diagram, while a too high variation forecast system causes a bell shaped Talagrand Diagram (de Vries, 8). In some cases, systematic biases exist in the forecast system, where the ensemble systematically exhibits over or under-prediction. In order to overcome this problem, de Vries (8) introduced a bias correction to reduce the difference between the ensemble mean and the observations, which is based on the relation between the difference between the ensemble mean value and observation and the standard deviation of the ensemble (Eq. ). This bias correction is applied to each of the ensemble member: H mod,new = H mod α α 1 σ, () where H mod is the low/high tide skew surge, α and α 1 are bias correction coefficients and σ corresponds to the standard deviation of the ensemble. De Vries (8) found that for negative surges, the difference between observations and the ensemble mean gets more positive with increasing ensemble standard deviation, while for positive surges the difference between observations and the ensemble mean becomes more negative with increasing ensemble standard deviation. The bias correction coefficients are based on forecasts and observations in season 6/7 and calculated for positive/negative surge, low/high tide and for every 1 hours of lead-time up to 4 hours lead-time. The least squares method is used to determine the bias correction coefficients. In Fig. 4, the relation between the difference between the ensemble mean and the observations and the ensemble s standard deviation is demonstrated for both positive and negative surges (de Vries, 8).

for Hoek van Holland without corrections to the ensemble. The red lines are wit Methodology 11 wind correction, the black lines include the correction..75.5 Ensemble mean minus observation (m).5 -.5 -.5 -.75-1. -1.5 Positive surges Negative surges.1..3.4 Ensemble standard deviation (m) Figure 6 Ensemble mean minus observation versus ensemble standard deviatio Red crosses standardare deviation from of positive the ensemble surges, (de Vries, 8). blue crosses from negative surges. Fig. 4: The mean of the ensemble minus the observation as a function of the Talagrand diagrams are useful to evaluate whether the bias correction still suffices by removing the ensemble s bias. This evaluation a Talagrand needs Diagramto be done after the spin-up time b CRH of Diagram the ensemble 1 system, which is after approximately 48 hours lead-time. The forecast probabilities are calibrated 14 by means of matching the ensemble rank with the rank in the Cumulative Rank Histogram. Without wind correction With wind correction Without wind correction With wind correction 1 8 Observed frequency (%) 1 8 6 Observed frequency (%) 6 4 4 1 3 4 5 Bin number 4 6 EPS probability (%) Figure 7 (a) Talagrand Diagram and (b) CRH diagram for all high tide forecasts for Hoek van Holland with corrections to the ensemble. The red lines are withou correction, the black lines include the correction. 6

Methodology 1.3 Data description This research uses storm surge probability forecasts, generated by the WAQUA/DCSMv5 storm surge model with meteorological input from the ECMWF-IFS. The ECMWF-IFS generates an ensemble of 5 members by applying perturbations to the control forecast (CNTL). Next to this, a high spatial resolution members is included as well (HRES). So, in total the ECMWF-IFS consists of 5 ensemble members, which serves as input for the WAQUA/DCSMv5 storm surge model. These probabilistic storm surge forecasts are used for the medium range (-1 days ahead). For the first 48 hours forecasts generated by WAQUA/DCSMv6 The database used for this verification research consists of archived storm surge probability forecasts issued between 3-1-8 to present. Forecasts have been generated for the locations: Vlissingen, Roompot buiten, Hoek van Holland, Ijmuiden, Den Helder, Harlingen Huibergat and Delfzijl. However, this research verifies the performance for Hoek van Holland, Den Helder and Delfzijl. The results for these locations will be presented in this report. As mentioned earlier, in order to remove systematic over or under-prediction of the ensemble, de Vries (8) introduced a bias correction. This bias correction is based on the relation between the ensemble s standard deviation and the difference between the ensemble mean and the observations. This bias correction has been applied to each ensemble member by means of Eq.. The coefficients in Eq. are derived from observations and forecasts between September 6 to May 7. For the archived storm surge probability forecasts forced by ECMWF-IFS (3-1- 8 - present), these bias correction coefficients have been used for all forecasts to correct for the ensemble s bias. After the bias correction has been applied to each of the ensemble member, the ensemble probability forecasts have been calibrated by using observations and forecasts from season 6/7. For the calibration of the storm surge probability forecasts, the Talagrand diagram is summed to make the Cumulative Rank Histogram, which converts the ensemble rank to calibrated storm surge probabilities.

3 Results 13 3 Results Firstly, this section will present the results for the verification of the archived storm surge probability forecasts for Hoek van Holland (Sect. 3.1.), Den Helder (Sect. 3.1.3) and Delfzijl (Sect. 3.1.4). For these archived ensemble forecasts, a bias correction has been applied to each ensemble member based on forecasts and observation between September 6 to May 7. After, these storm surge forecasts have been calibrated and converted to probabilities using observations and forecasts from September 6 to May 7. In Sect. 3. the default bias correction and calibration are evaluated for suitability and performance. 3.1 Verification of WAQUA/DCSMv5 s coastal water level probability forecast In this subsection, probability forecasts for coastal water levels will be verified using statistical scores mentioned in Sect.. of this report. In Sect. 3.1.1, forecast sharpness will be evaluated. Sect. 3.1., 3.1.3 and 3.1.4 will be dedicated to the verification of storm surge probability forecasts for three locations along the Dutch coast, which are Hoek van Holland, Den Helder and Delfzijl respectively. For each of these locations, the model s performance will be analyzed using the statistical scores as described in Sect... It is important to note that WAQUA/DCSMv5 s probabilistic forecasts for coastal water levels are calibrated based on observations and forecasts in the period September 6 to may 7. For the analysis, forecasts are first split into forecasts generated between 3-1-8 till 31-1-1 and 1-1-13 till 31-1-16 to investigate whether WAQUA/DCSMv5 forecasts have been improved over time or not. To this end, verification is performed for both surge levels and absolute water levels. Then, forecasts are divided in low tide and high tide forecasts to see whether or not forecasts perform better for a certain tidal extreme. Lastly, model performance for summer forecasts are compared with winter forecasts to test the forecasts ability to capture the meteorology. 3.1.1 Forecasts sharpness Sharpness indicates the decisiveness of a forecast. This implies the sharper the forecast, the better. Figures 5, 6, 7 and 8 the relation between forecast lead-time and the distribution of issued forecast probabilities. This has been analyzed for forecasts falling between 3-1-8 to 31-1-1 and 1-1-13 to 31-1-16. Sharpness diagrams have been generated for medium positive surge, medium negative surge, Information Level and Pre-warning Level. In these diagrams, the amount of issued forecast probabilities is plotted against forecast probability bin. Firstly, an increase in forecast lead-time results in less frequent forecasts falling into the higher forecast probability bins. In other words, sharpness decreases with lead-time. Secondly, an increase in lead-time has no substantial influence on the frequency of forecasts in the lowest bin for all evaluated coastal water level forecasts except for medium negative surge. This high amount of forecasts in the lowest probability bin for all lead-times is due to the fact that this research evaluates rare events and rare event forecasts obtain less bell-shaped sharpness diagrams for high lead-times compared with normal event forecasts. Moreover, for medium negative surge, the absence of forecasts in the lowest bin and the high values in the second bin for forecast lead-time

3 Results 14 < 48 hours is due to the calibration based on the season 6/7, where all forecast probabilities have been calibrated substantially to higher values. Important to note is that the ensemble needs spin-up time to apply perturbations. As a result, the spread is to low for the first 48h. Moreover, deterministic forecasts by WAQUA/DCSMv6 are used for the first 48 hours and the ensemble forecasts by ECMWF for the medium range (-1 days). Still, also for medium negative surge level, higher forecast lead-times result in more forecasts with less forecasts falling in the highest probability bin. The trends found in these pictures hold for the other evaluated locations along the Dutch coast as well. In general, forecast users prefer sharp forecasts as possible outcomes are then reduced, which enhances decent anticipation. On the other hand, a sharp forecast may be unreliable and this combination leads to overconfidence of the forecast and to poor decisions. This research uses sharpness diagrams as a tool to provide information about the frequency of probabilities issued by a forecast. As a result, sharpness diagrams serve as background knowledge for the interpretation of attribute diagrams, which will be presented later on in this subsection. In this research, sharpness diagrams are only presented for Hoek van Holland as these diagrams are more or less equal for all other stations. 48h 48 84h 4 4 log # forecasts 3 1 log # forecasts 3 1 4.5.15.5.35.45.55.65.75.85.95 probability forecast bins 84 1h 4.5.15.5.35.45.55.65.75.85.95 probability forecast bins 1 4h log # forecasts 3 1 log # forecasts 3 1.5.15.5.35.45.55.65.75.85.95 probability forecast bins.5.15.5.35.45.55.65.75.85.95 probability forecast bins years 8 1 13 16 Fig. 5: Sharpness diagrams for medium positive surge probability forecasts, constructed for different lead-time groups for the location Hoek van Holland.

3 Results 15 48h 48 84h 4 4 log # forecasts 3 1 log # forecasts 3 1 4.5.15.5.35.45.55.65.75.85.95 probability forecast bins 84 1h 4.5.15.5.35.45.55.65.75.85.95 probability forecast bins 1 4h log # forecasts 3 1 log # forecasts 3 1.5.15.5.35.45.55.65.75.85.95 probability forecast bins.5.15.5.35.45.55.65.75.85.95 probability forecast bins years 8 1 13 16 Fig. 6: Sharpness diagrams for medium negative surge probability forecasts, constructed for different lead-time groups for the location Hoek van Holland.

3 Results 16 48h 48 84h 4 4 log # forecasts 3 1 log # forecasts 3 1 4.5.15.5.35.45.55.65.75.85.95 probability forecast bins 84 1h 4.5.15.5.35.45.55.65.75.85.95 probability forecast bins 1 4h log # forecasts 3 1 log # forecasts 3 1.5.15.5.35.45.55.65.75.85.95 probability forecast bins.5.15.5.35.45.55.65.75.85.95 probability forecast bins years 8 1 13 16 Fig. 7: Sharpness diagrams for Information Level probability forecasts, constructed for different lead-time groups for the location Hoek van Holland.

3 Results 17 48h 48 84h 4 4 log # forecasts 3 1 log # forecasts 3 1 4.5.15.5.35.45.55.65.75.85.95 probability forecast bins 84 1h 4.5.15.5.35.45.55.65.75.85.95 probability forecast bins 1 4h log # forecasts 3 1 log # forecasts 3 1.5.15.5.35.45.55.65.75.85.95 probability forecast bins.5.15.5.35.45.55.65.75.85.95 probability forecast bins years 8 1 13 16 Fig. 8: Sharpness diagrams for Pre-warning Level probability forecasts, constructed for different lead-time groups for the location Hoek van Holland.

3 Results 18 3.1. Hoek van Holland 3.1..1 8-1/13-16 In Fig. 9, Brier skill scores are presented for different lead-time groups. The Brier skill score is calculated by means of Eq. 3. The minimum value, which is plotted for the Brier skill score, is set to -1. For Hoek van Holland, forecasts generated between 13 and 16 have a higher Brier skill scores for medium positive surge than for the period 8-1. Forecasts for medium negative surges have low Brier skill scores for nearly all lead-time groups. This could be a result of systematic under/over estimation of negative surges by the WAQUA/DCSMv5 storm surge model. Higher Brier skill scores for both absolute water levels are obtained by forecasts issued between 8 and 1 for lead-times between and 84 hours ahead (Fig. 9. However, differences are not substantial compared to forecasts issued between 13 and 16. Moreover, forecasts between 13 and 16 maintain skill even after 5 days lead-time for both absolute water levels. Medium surge 5cm surge 1. 1. Brier Skill Score.5..5 Brier Skill Score.5..5 1. 1. 48 48 84 84 1 1 168 168 4 leadtime [h] Information level 1. 1. 48 48 84 84 1 1 168 168 4 leadtime [h] Pre warning level Brier Skill Score.5..5 Brier Skill Score.5..5 1. 48 48 84 84 1 1 168 168 4 leadtime [h] 1. 48 48 84 84 1 1 168 168 4 leadtime [h] years 8 1 13 16 Fig. 9: Brier skill scores for both surge and both absolute water level probability forecasts, calculated for different lead-time groups for the location Hoek van Holland.

3 Results 19 Fig. 1 holds different attribute diagrams for different lead-time groups based on coastal water level probability forecasts for medium positive surge exceedance. For the lower forecast probability bins, forecasts between 13-16 obtain higher skill than forecasts between 8-1 for lead-times -84 hours ahead. Importantly, most forecasts are generated in the lowest forecast bin as this research primarily verifies rare events. As a result, the value for the Brier score, which consists of reliability, resolution and uncertainty, is mainly determined by the lowest forecast bin. In addition, relative frequency points are plotted halfway each forecast bin in the attribute diagram. So, these figures could give the impression that there is no skill for this lowest bin, while there is actually skill. Still, these attribute diagrams open the possibility to visualize the model s reliability for forecasts with higher probabilities of exceedance. For all lead-times, there is a tendency towards over-forecasting and this tendency becomes larger with increasing lead-time (Fig. 1). 48h 48 84h...4.6.8 1. 8 1 13 16...4.6.8 1. 8 1 13 16...4.6.8 1....4.6.8 1. 84 1h 1 4h...4.6.8 1. 8 1 13 16...4.6.8 1. 8 1 13 16 No resolution...4.6.8 1....4.6.8 1. Fig. 1: Attribute diagrams for probability forecasts based on medium positive surge exceedance, constructed for different lead-time groups for the location Hoek van Holland. 9 % confidence intervals are included for both forecast periods.

3 Results Fig. 11 shows the attribute diagrams for medium negative surge for different lead-time groups and for different forecast periods. Again, most forecasts fall into the lowest probability and this figure clearly shows that forecasts from both periods are unreliable for lead-times up to 48 hours ahead. For higher probability bins, forecast between 8-1 are slightly more reliable. This holds also true for lead-times between 48-84 hours ahead. However, for lead-times between 84-1 hours ahead, forecast from both periods are equally reliable for the higher probability bins. Furthermore, forecasts from both periods have no skill for most probability bins for lead-times between 1-4 hours ahead. For both forecast periods and for all lead-time groups, there is a tendency towards over-forecasting. 48h 48 84h...4.6.8 1. 8 1 13 16...4.6.8 1. 8 1 13 16...4.6.8 1....4.6.8 1. 84 1h 1 4h...4.6.8 1. 8 1 13 16...4.6.8 1. 8 1 13 16...4.6.8 1....4.6.8 1. Fig. 11: Attribute diagrams for probability forecasts based on medium negative surge exceedance, constructed for different lead-time groups for the location Hoek van Holland. 9 % confidence intervals are included for both forecast periods.

3 Results 1 Fig. 1 and 13 contain attribute diagrams for both Information Level and Pre-warning Level forecasts respectively. For lead-times up to days ahead, forecasts from 13-16 are more reliable for probability forecasts based on Information-Level exceedance (Fig. 1). For lead-times between 48-84 hours ahead, forecasts between 8-1 are more reliable for forecasts with >.4 probabilities, while forecasts between 13-16 are more reliable for forecasts with <.4 probabilities. Forecasts from both periods have no-skill for forecasts with probabilities <.6 for lead-times between 1-4 hours ahead. For higher probabilities, forecasts between 13-16 are more reliable. Again, there is a bias for over-forecasting for all lead-time groups. 48h 48 84h...4.6.8 1. 8 1 13 16...4.6.8 1. 8 1 13 16...4.6.8 1....4.6.8 1. 84 1h 1 4h...4.6.8 1. 8 1 13 16...4.6.8 1. 8 1 13 16...4.6.8 1....4.6.8 1. Fig. 1: Attribute diagrams for probability forecasts based on Information Level exceedance, constructed for different lead-time groups for the location Hoek van Holland. 9 % confidence intervals are included for both forecast periods.

3 Results For probability forecasts based on Pre-warning level, forecasts between 8-1 are equally reliable for probabilities >.1 for lead-times up to days ahead (Fig. 13). However, the deterministic WAQUA/DCSMv6 is used for the first 48h. For lead-times -5 days ahead, forecasts between 8-1 are more reliable than forecasts between 13-16. After 5 days, forecasts between 13-16 are more reliable than the other forecast period. It is important to note that, especially for the Pre-warning level, confidence intervals are very large. This is due to the fact that this absolute water level is rarely reached and as a result, conclusions based on these figures need to be drawn with care. 48h 48 84h...4.6.8 1. 8 1 13 16...4.6.8 1. 8 1 13 16...4.6.8 1....4.6.8 1. 84 1h 1 4h...4.6.8 1. 8 1 13 16...4.6.8 1. 8 1 13 16...4.6.8 1....4.6.8 1. Fig. 13: Attribute diagrams for probability forecasts based on Pre-warning Level exceedance, constructed for different lead-time groups for the location Hoek van Holland. 9 % confidence intervals are included for both forecast periods.

3 Results 3 The next figures (Fig. 14, 15, 16 and 17) represent ROC-curves for two surge levels and two absolute water levels. The ROC-curves plot for different threshold values the hit-rate vs. false alarm rate. In fact, the larger the Area Under the Curve, the higher the skill of the forecast. the 1:1 line corresponds with random forecasts based on climatology. In this case, the hit-rate and the falsealarm rate are equal for each threshold value. For probability forecasts based on medium positive surge exceedance, forecasts generated between 8 and 1 are generally better in discriminating events from non-events for the lower lead-time groups (Fig. 14). Also, for both groups, the ability to discriminate between events and non-events deteriorate with increasing lead-time. As a result, the lines move towards the no-skill line (1:1 line) for larger lead-times, which results in lower ROC skill scores (ROCSS) (Table 5). Overall, for both forecast periods and for all lead-time groups, the ROCSS are classified good to excellent (Table 3), which implies that the forecasts have a high event/non-event discrimination ability. 48h 48 84h...4.6.8 1. 8 1 13 16...4.6.8 1. 8 1 13 16...4.6.8 1....4.6.8 1. 84 1h 1 4h...4.6.8 1. 8 1 13 16...4.6.8 1. 8 1 13 16...4.6.8 1....4.6.8 1. Fig. 14: ROC diagrams for probability forecasts based on medium positive surge exceedance, constructed for different forecast lead-time groups and for the location Hoek van Holland.

3 Results 4 For probability forecasts based on medium negative surge exceedance, forecasts generated between 13 and 16 have a higher ability to discriminate events from non-events for all leadtimes except between 5 and 1 days (Fig. 15). This is also shown in Table 5. Forecasts from both periods obtain their highest skill between and 5 days lead-time. For larger lead-times, skill declines. 48h 48 84h...4.6.8 1. 8 1 13 16...4.6.8 1. 8 1 13 16...4.6.8 1....4.6.8 1. 84 1h 1 4h...4.6.8 1. 8 1 13 16...4.6.8 1. 8 1 13 16...4.6.8 1....4.6.8 1. Fig. 15: ROC diagrams for probability forecasts based on medium negative surge exceedance, constructed for different forecast lead-time groups for the location Hoek van Holland.

3 Results 5 For Information Level probability forecasts, higher skill is achieved for forecasts in the period 13 to 16 than for forecasts in the period 8 to 1 for lead-times between and 5 days ahead (Fig. 16, Table 5). Lowest skill is obtained for lead-times larger than 5 days for both forecast periods. 48h 48 84h...4.6.8 1. 8 1 13 16...4.6.8 1. 8 1 13 16...4.6.8 1....4.6.8 1. 84 1h 1 4h...4.6.8 1. 8 1 13 16...4.6.8 1. 8 1 13 16...4.6.8 1....4.6.8 1. Fig. 16: ROC diagrams for probability forecasts based on Information Level exceedance, constructed for different forecast lead-time groups for the location Hoek van Holland.

3 Results 6 For Pre-warning Level, forecasts generated between 8 and 1 have higher skill for all leadtimes (Fig. 17, Table 5). Both forecast periods possess lowest skill for lead-times larger than 5 days. 48h 48 84h...4.6.8 1. 8 1 13 16...4.6.8 1. 8 1 13 16...4.6.8 1....4.6.8 1. 84 1h 1 4h...4.6.8 1. 8 1 13 16...4.6.8 1. 8 1 13 16...4.6.8 1....4.6.8 1. Fig. 17: ROC diagrams for probability forecasts based on Pre-warning Level exceedance, constructed for different forecast lead-time groups for the location Hoek van Holland.

3 Results 7 Tab. 4: ROC skill scores for different lead-time groups for forecasts between 8-1 and 13-16 for Hoek van Holland. ROCSS are calculated for both surge and absolute water levels. Water level Years -48h 48-84h 84-1h 1-4h Medium positive surge Medium negative surge Information Level Pre-warning Level 8-1.96.96.94.76 13-16.76.86.9.78 8-1.78.86.88.68 13-16.9.98.96.6 8-1.9.9.9.8 13-16.8.9.94.8 8-1.9.94 1..8 13-16.8.94.94.74

3 Results 8 3.1.. CRPS The previous scores dealt with the quantification of performance by the WAQUA/DCSMv5 probability forecasts based on pre-selected thresholds for coastal water levels. The next score, the Continuous Rank Probability Score (CRPS), holds information about the model s performance for all possible thresholds for coastal water levels. In fact, the CRPS corresponds with the integration of the Brier Scores for all thresholds for coastal water levels (see Sect...3). As a result, this score provides an overall impression of the model s performance. The CRPS is converted to the the a skill score (CRPSS) by means of Eq. 1, in which it compares the score with a perfect score and climatology. The CRPSS as a function of lead-time for combinations of high/low tide and 8-1/13-16 forecasts is presented in Fig. 18. Firstly, this figure clearly shows that maximum skill is obtained at approximately 48 hours lead-time for all combinations. Secondly, for lead-times < 4 days, forecasts low tide/13-16 forecasts have highest skill. After 6 days lead-time, high tide/8-1 forecasts obtains highest skill. Hoek van Holland.5.4 CRPSS.3..1 48 96 144 19 4 Leadtime [h] Tide_Period HT/8 1 HT/13/16 LT/8 1 LT/13 16 Fig. 18: CRPSS for the raw, un-calibrated ensemble forecasts as a function of lead-time for the location Hoek van Holland. HT = high tide and LT = low tide forecasts

3 Results 9 3.1..3 Low/High tide - Positive/Negative surge In this paragraph, performances for forecasts at high/low tides are evaluated and presented based on the WAQUA/DCSMv5 probability forecasts for Hoek van Holland. This will be done in order to provide more insight in potential strengths and weaknesses of the storm surge model. This subsection will solely evaluate performances for medium positive surge and medium negative surge levels as the predefined absolute water levels are not reached during low tides. The Brier skill scores corresponding to the WAQUA/DCSMv5 s model performance are calculated for different lead-time groups and are presented in Fig. 19. Again, the minimum plotted value for the Brier skill score is set to -1. In general, for both surge levels, forecasts at high tides perform better than forecasts at low tides. For medium positive surge probability forecasts, forecasts skill decrease with increasing lead-time, resulting in no-skill for forecasts after day 5. If we look at probability forecasts for medium negative surge, there is skill for forecasts during high tide after 3.5 days, but this skill diminishes rapidly for longer lead-times. As mentioned earlier in this research and in de Vries (8), the WAQUA/DCSM storm surge model has difficulties with forecasting negative surges. 1. Medium surge Brier Skill Score.5..5 1. 1. 5cm surge 48 48 84 84 1 1 168 168 4 leadtime [h] Brier Skill Score.5..5 1. 48 48 84 84 1 1 168 168 4 leadtime [h] tides Hightide Lowtide Fig. 19: Brier skill scores for both surge levels, calculated for different lead-time groups for the location Hoek van Holland.

3 Results 3 In Fig., the reliability of forecasts is represented for forecasts issued during both high and low tides for different lead-times. For Hoek van Holland, forecast performance deteriorates with increasing lead times regardless of whether the forecast was at either high tides or low tides. Furthermore, the ensemble forecasts surges are better at high tides than at low tides for low leadtimes (-48 hours ahead). In addition, high tide forecasts with lead-times between 1 and 4 hours ahead obtain reliability for the higher forecast bins. 48h 48 84h...4.6.8 1. low tide high tide...4.6.8 1. low tide high tide...4.6.8 1....4.6.8 1. 84 1h 1 4h...4.6.8 1. low tide high tide No resolution...4.6.8 1. low tide high tide...4.6.8 1....4.6.8 1. Fig. : Attribute diagrams for medium positive surge probability forecasts, constructed for different lead-time groups for the location Hoek van Holland. 9 % confidence intervals are included for both forecast periods.