Ensemble Verification Metrics

Similar documents
Verification of Probability Forecasts

Probabilistic verification

Methods of forecast verification

Evaluating Forecast Quality

Exploring ensemble forecast calibration issues using reforecast data sets

Basic Verification Concepts

EMC Probabilistic Forecast Verification for Sub-season Scales

Towards Operational Probabilistic Precipitation Forecast

Predicting uncertainty in forecasts of weather and climate (Also published as ECMWF Technical Memorandum No. 294)

Verification of ensemble and probability forecasts

Basic Verification Concepts

The Impact of Horizontal Resolution and Ensemble Size on Probabilistic Forecasts of Precipitation by the ECMWF EPS

Verification of Continuous Forecasts

Probabilistic Weather Forecasting and the EPS at ECMWF

VERFICATION OF OCEAN WAVE ENSEMBLE FORECAST AT NCEP 1. Degui Cao, H.S. Chen and Hendrik Tolman

Validation of Forecasts (Forecast Verification) Overview. Ian Jolliffe

Categorical Verification

Developing Operational MME Forecasts for Subseasonal Timescales

Assessment of Ensemble Forecasts

Overview of Verification Methods

Focus on parameter variation results

Model error and seasonal forecasting

3.6 NCEP s Global Icing Ensemble Prediction and Evaluation

Use of extended range and seasonal forecasts at MeteoSwiss

Model verification / validation A distributions-oriented approach

Verification of nowcasts and short-range forecasts, including aviation weather

Feature-specific verification of ensemble forecasts

Verification of WAQUA/DCSMv5 s operational water level probability forecasts

Assessing high resolution forecasts using fuzzy verification methods

Drought forecasting methods Blaz Kurnik DESERT Action JRC

Probabilistic seasonal forecast verification

4.3.2 Configuration. 4.3 Ensemble Prediction System Introduction

Upscaled and fuzzy probabilistic forecasts: verification results

Clustering Techniques and their applications at ECMWF

Sub-seasonal to seasonal forecast Verification. Frédéric Vitart and Laura Ferranti. European Centre for Medium-Range Weather Forecasts

Verification at JMA on Ensemble Prediction

Predictability from a Forecast Provider s Perspective

Forecasting Extreme Events

Ensemble forecasting and flow-dependent estimates of initial uncertainty. Martin Leutbecher

REPORT ON APPLICATIONS OF EPS FOR SEVERE WEATHER FORECASTING

Main characteristics and performance of COSMO LEPS

Diagnosis of Ensemble Forecasting Systems

Standardized Anomaly Model Output Statistics Over Complex Terrain.

ENSO-DRIVEN PREDICTABILITY OF TROPICAL DRY AUTUMNS USING THE SEASONAL ENSEMBLES MULTIMODEL

Model error and parameter estimation

Five years of limited-area ensemble activities at ARPA-SIM: the COSMO-LEPS system

Probabilistic Weather Prediction with an Analog Ensemble

QUANTIFYING THE ECONOMIC VALUE OF WEATHER FORECASTS: REVIEW OF METHODS AND RESULTS

ECMWF 10 th workshop on Meteorological Operational Systems

Ensemble Forecas.ng and their Verifica.on

The ECMWF Extended range forecasts

Verification of the operational NWP models at DWD - with special focus at COSMO-EU

Recommendations on trajectory selection in flight planning based on weather uncertainty

Spatial verification of NWP model fields. Beth Ebert BMRC, Australia

Severe storm forecast guidance based on explicit identification of convective phenomena in WRF-model forecasts

Ensemble forecast and verification of low level wind shear by the NCEP SREF system

Monthly probabilistic drought forecasting using the ECMWF Ensemble system

NOTES AND CORRESPONDENCE. Improving Week-2 Forecasts with Multimodel Reforecast Ensembles

Seasonal Climate Watch July to November 2018

An extended re-forecast set for ECMWF system 4. in the context of EUROSIP

Adaptation for global application of calibration and downscaling methods of medium range ensemble weather forecasts

Seasonal Climate Watch September 2018 to January 2019

Diagnosis of Ensemble Forecasting Systems

Accounting for the effect of observation errors on verification of MOGREPS

Extracting probabilistic severe weather guidance from convection-allowing model forecasts. Ryan Sobash 4 December 2009 Convection/NWP Seminar Series

statistical methods for tailoring seasonal climate forecasts Andrew W. Robertson, IRI

Standardized Verification System for Long-Range Forecasts. Simon Mason

Operational use of ensemble hydrometeorological forecasts at EDF (french producer of energy)

Jaime Ribalaygua 1,2, Robert Monjo 1,2, Javier Pórtoles 2, Emma Gaitán 2, Ricardo Trigo 3, Luis Torres 1. Using ECMWF s Forecasts (UEF2017)

Verification of ECMWF products at the Deutscher Wetterdienst (DWD)

Proper Scores for Probability Forecasts Can Never Be Equitable

Complimentary assessment of forecast performance with climatological approaches

Precipitation verification. Thanks to CMC, CPTEC, DWD, ECMWF, JMA, MF, NCEP, NRL, RHMC, UKMO

Joint Research Centre (JRC)

Evaluation of ensemble forecast uncertainty using a new proper score: Application to medium-range and seasonal forecasts

Application and verification of ECMWF products 2015

Enhancing Weather Information with Probability Forecasts. An Information Statement of the American Meteorological Society

Downscaling of ECMWF seasonal integrations by RegCM

Model verification and tools. C. Zingerle ZAMG

Verification of African RCOF Forecasts

Forecasting precipitation for hydroelectric power management: how to exploit GCM s seasonal ensemble forecasts

Seasonal Climate Watch June to October 2018

Experimental Extended Range GEFS

Verification of the operational NWP models at DWD - with special focus at COSMO-EU. Ulrich Damrath

Spatial forecast verification

(Towards) using subseasonal-to-seasonal (S2S) extreme rainfall forecasts for extendedrange flood prediction in Australia

The Calibration Simplex: A Generalization of the Reliability Diagram for Three-Category Probability Forecasts

S2S Monsoon Subseasonal Prediction Overview of sub-project and Research Report on Prediction of Active/Break Episodes of Australian Summer Monsoon

Verification Marion Mittermaier

How ECMWF has addressed requests from the data users

JP3.7 SHORT-RANGE ENSEMBLE PRECIPITATION FORECASTS FOR NWS ADVANCED HYDROLOGIC PREDICTION SERVICES (AHPS): PARAMETER ESTIMATION ISSUES

Deterministic and Probabilistic prediction approaches in Seasonal to Inter-annual climate forecasting

Exercise Part 1 - Producing Guidance and Verification -

Diagnostics of the prediction and maintenance of Euro-Atlantic blocking

Predicting climate extreme events in a user-driven context

A Local Ensemble Prediction System for Fog and Low Clouds: Construction, Bayesian Model Averaging Calibration, and Validation

Revisiting predictability of the strongest storms that have hit France over the past 32 years.

BAYESIAN APPROACH TO DECISION MAKING USING ENSEMBLE WEATHER FORECASTS

S e a s o n a l F o r e c a s t i n g f o r t h e E u r o p e a n e n e r g y s e c t o r

Radius of reliability: A distance metric for interpreting and verifying spatial probability forecasts

Transcription:

Ensemble Verification Metrics Debbie Hudson (Bureau of Meteorology, Australia) ECMWF Annual Seminar 207 Acknowledgements: Beth Ebert

Overview. Introduction 2. Attributes of forecast quality 3. Metrics: full ensemble 4. Metrics: probabilistic forecasts 5. Metrics: ensemble mean 6. Key considerations: sampling issues; stratification; uncertainty; communicating verification

Purposes of ensemble verification User-oriented How accurate are the forecasts? Do they enable better decisions than could be made using alternate information (persistence, climatology)? Inter-comparison and monitoring How do forecast systems differ in performance? How does performance change over time? Calibration Assist in bias removal and downscaling Diagnosis Pinpoint sources of error in ensemble forecast system Diagnose impact of model improvements, changes to DA and/or ensemble generation etc. Diagnose/understand mechanisms and sources of predictability Operations Research

Evaluating Forecast Quality Need large number of forecasts and observations to evaluate ensembles and probability forecasts Forecast quality vs. value Attributes of forecast quality: Accuracy Skill Reliability Discrimination and resolution Sharpness

Accuracy and Skill Accuracy Overall correspondence/level of agreement between forecasts and observations Skill A set of forecasts is skilful if better than a reference set, i.e. skill is a comparative quantity Reference set e.g., persistence, climatology, random

Reliability Ability to give unbiased probability estimates for dichotomous (yes/no) forecasts Can I trust the probabilities? Defines whether the certainty communicated in the forecasts is appropriate Forecast distribution represents distribution of observations Reliability can be improved by calibration

Discrimination and Resolution Resolution How much does the observed outcome change as the forecast changes i.e., "Do outcomes differ given different forecasts?" Conditioned on the forecasts Discrimination Can different observed outcomes can be discriminated by the forecasts. Conditioned on the observations Indicates potential "usefulness" Cannot be improved by calibration

Discrimination (a) (b) (c) observed observed non-events events observed observed non-events events observed observed non-events events frequency frequency frequency forecast forecast forecast Good discrimination Poor discrimination Good discrimination

Sharpness Sharpness is tendency to forecast extreme values (probabilities near 0 or 00%) rather than values clustered around the mean (a forecast of climatology has no sharpness). A property of the forecast only. Sharp forecasts are "useful" BUT don t want sharp forecasts if not reliable. Implies unrealistic confidence.

What are we verifying? How are the forecasts being used? Ensemble distribution Set of forecasts making up the ensemble distribution Use individual members or fit distribution Probabilistic forecasts generated from the ensemble Create probabilities by applying thresholds Ensemble mean

Commonly used verification metrics Characteristics of the full ensemble Rank histogram Spread vs. skill Continuous Ranked Probability Score (CRPS) (discussed under probability forecasts)

Rank histogram Measures consistency and reliability: the observation is statistically indistinguishable from the ensemble members For each observation, rank the N ensemble members from lowest to highest and identify rank of observation with respect to the forecasts Obs rank 2 out of Example for 0 ensemble members Ensemble Observation -5 0 5 0 5 20 25-5 0 5 0 5 20 25-5 0 5 0 5 20 25 degc Obs rank 8 out of degc Obs rank 3 out of degc Need lots of samples to evaluate the ensemble

Rank histogram Negative bias Positive bias 2 3 4 5 6 7 8 9 0 Rank of observation 2 3 4 5 6 7 8 9 0 Rank of observation Consistent/Reliable 2 3 4 5 6 7 8 9 0 Rank of observation Under-dispersive (overconfident) Over-dispersive (underconfident) 2 3 4 5 6 7 8 9 0 Rank of observation 2 3 4 5 6 7 8 9 0 Rank of observation

Rank histogram Flat rank histogram does not necessarily indicate a skillful forecast. Rank histogram shows conditional/unconditional biases BUT not full picture Only measures whether the observed probability distribution is well represented by the ensemble. Does NOT show sharpness climatological forecasts are perfectly consistent (flat rank histogram) but not useful

Spread-skill evaluation 500 hpa Geopotential Height (20-60S) RMSE Underdispersed (overconfident) S ens < RMSE Ensemble spread (S ens ) Seasonal prediction system where ensemble is generated using: A) Stochastic physics only

Spread-skill evaluation 500 hpa Geopotential Height (20-60S) RMSE Underdispersed (overconfident) S ens < RMSE Ensemble spread Seasonal prediction system where ensemble is generated using: A) Stochastic physics only B) Stochastic physics AND perturbed initial conditions Consistent/reliable S ens RMSE Overdispersed (underconfident) S ens > RMSE Hudson et al (207)

Commonly used verification metrics Probability forecasts Reliability/Attributes diagram Brier Score (BS and BSS) Ranked Probability Score (RPS and RPSS) Continuous Ranked Probability Score (CRPS and CRPSS) Relative Operating Characteristic (ROC and ROCS) Generalized Discrimination Score (GDS)

Reliability (attributes) diagram Dichotomous forecasts Measures how well the predicted probabilities of an event correspond to their observed frequencies (reliability) Plot observed frequency against forecast probability for all probability categories Need a big enough sample Histogram: how often each probability was issued. Shows sharpness and potential sampling issues Observed relative frequency 0 0 # fcsts P fcst No resolution (climatology) Forecast probability Curve tells what the observed frequency was for a given forecast probability. Conditioned on the forecasts

Interpretation of reliability diagrams Underforecasting No resolution Observed frequency Observed frequency 0 0 Forecast probability 0 0 Forecast probability Overconfident Probably under-sampled Observed frequency Observed frequency 0 0 Forecast probability 0 0 Forecast probability

Reliability diagram: Example Predictions of above normal seasonal SON rainfall Statistical forecast scheme Dynamical forecast scheme SON Observed relative frequency 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 STAT OPR Perfect reliability climatology no skill Size of the circles are proportional to the number of forecasts issuing that probability Forecast probability Most of the forecasts issued have probabilities near 50% A range of forecast probabilities are issued The statistical system often gave forecasts close to climatology reliable BUT poor sharpness. Of limited use for decision-makers!

Brier score (BS) Dichotomous forecasts Brier score measures the mean squared probability error BS N = ( p N i = i Score range: 0 to ; Perfect BS: 0 o i ) 2 p i : Forecast probability; o i : Observed occurrence (0 or ) Murphy's (973) decomposition into 3 terms (for K probability classes and N samples): BS K K 2 = nk( pk ok ) n N N 2 k(ok o ) k = k = + o( o ) reliability resolution uncertainty Useful for exploring dependence of probability forecasts on ensemble characteristics Uncertainty term measures the variability of the observations. Has nothing to do with forecast quality! BS is sensitive to the climatological frequency of an event: the more rare an event, the easier it is to get a good BS without having any real skill

BS, Brier Skill Score (BSS) and the Attributes diagram Reliability term (BS rel ): measures deviation of the curve from the diagonal line error in the probabilities. Obs. frequency 0 0 Forecast probability Resolution term (BS res ): measures deviation of the curve from the sample climate horizontal line indicates degree to which forecast can separate different situations Obs. frequency 0 0 Forecast probability clim Brier skill score: measures the relative skill of the forecast compared to climatology Obs. frequency 0 BSS = BS BS clim Perfect: BSS =.0 Climatology: BSS = 0.0 No resolution 0 Forecast probability X Penalty for lack of reliability Reward for resolution Points in shaded region contribute to positive BSS

BS rel and BS res : Example Aug-Sep-Oct season ACCESS-S Probability seasonal mean rainfall above-normal over Australia Reliability (BS rel ) Resolution (BS res ) POAMA Smaller is better Bigger is better

Continuous ranked probability score (CRPS) Continuous ranked probability score (CRPS) measures the difference between the forecast and observed CDFs CRPS ( P ( x ) P ( x ) ) fcst obs = dx Same as Brier score integrated over all thresholds On continuous scale: does not need reduction of ensemble forecasts to discrete probabilities of binary or categorical events (for multi-category use Ranked Probability Score) 2 PDF 0 CDF obs fcst obs fcst x Same as Mean Absolute Error for deterministic forecasts Has dimensions of observed variable Perfect score: 0 Rewards small spread (sharpness) if the forecast is accurate CRPS Skill score wrt climatology: CRPSS = CRPS c lim CDF obs -CDF fcst 0 0 x x

Relative Operating Characteristic (ROC) Dichotomous forecasts Measures the ability of the forecast to discriminate between events and non-events (discrimination) Plot hit rate vs false alarm rate using a set of varying probability thresholds to make the yes/no decision. Close to upper left corner good discrimination Close to or below diagonal poor discrimination

Relative Operating Characteristic (ROC) Dichotomous forecasts Measures the ability of the forecast to discriminate between events and non-events (discrimination) Plot hit rate vs false alarm rate using a set of varying probability thresholds to make the yes/no decision. Close to upper left corner good discrimination Close to or below diagonal poor discrimination Area under curve ("ROC area") is a useful summary measure of forecast skill

Relative Operating Characteristic (ROC) Dichotomous forecasts Measures the ability of the forecast to discriminate between events and non-events (discrimination) Plot hit rate vs false alarm rate using a set of varying probability thresholds to make the yes/no decision. Close to upper left corner good discrimination Close to or below diagonal poor discrimination Area under curve ("ROC area") is a useful summary measure of forecast skill ROC area = (Perfect ROC area forecast) = 0.5 (Climatological forecast) No skill 0.5

Relative Operating Characteristic (ROC) Dichotomous forecasts Measures the ability of the forecast to discriminate between events and non-events (discrimination) Plot hit rate vs false alarm rate using a set of varying probability thresholds to make the yes/no decision. Close to upper left corner good discrimination Close to or below diagonal poor discrimination Area under curve ("ROC area") is a useful summary measure of forecast skill ROC skill score: ROCS = 2(ROCarea-0.5) The ROC is conditioned on the observations Reliability and ROC diagrams are good companions

ROC: Example ROC area of probability of a heatwave for all forecasts initialised in DJF DJF Weeks _2 Poor Good discrimination Hudson and Marshall (206)

Generalized Discrimination Score (GDS) Binary, multi-category & continuous Rank-based measure of discrimination - does the forecast successfully rank (discriminate) the two different observations? GDS equivalent to ROC area for dichotomous forecasts & has same scaling Observation Observation 2 Forecast Forecast 2 Obs correctly discriminated? YES / NO Observation Observation 3 Forecast Forecast 3 Obs correctly discriminated? YES / NO Observation N- Observation N Mason & Weigel (2009); Weigel & Mason (20) Forecast N- Forecast N Obs correctly discriminated? YES / NO GDS = proportion of successful rankings (no skill = 0.5)

GDS (and ROC): Example Forecast of seasonal SON rainfall No/Poor Discrimination https://meteoswiss-climate.shinyapps.io/skill_metrics/ Good discrimination

Commonly used verification metrics Ensemble mean e.g., RMSE, correlation

Verification of ensemble mean Debate as to whether or not this is a good idea: Pros: Ensemble mean filters out smaller unpredictable scales Needed for spread skill evaluation Forecasters and others use ensemble mean Cons: Not a realization of the ensemble Different statistical properties to ensemble and observations Scores: RMSE Anomaly correlation Other deterministic verification scores

Key considerations: Sampling issues Rare and extreme events See Chris Ferro's talk on verification of extremes Difficult to verify probabilities on the "tail" of the PDF Too few samples to get robust statistics, especially for reliability Finite number of ensemble members may not resolve tail of forecast PDF Size of ensemble vs number of verification samples Robustness of verification depends on both!!!

Key considerations: Stratification Verification results vary with region, season, climate driver Pooling samples can mask variations in forecast performance Stratify data into sub-samples BUT must have enough samples to give robust statistics! MJO Example: MJO Bivariate correlation for RMM index Hudson et al (207)

Key considerations: Uncertainty Are the forecasts significantly better than a reference forecast? Does ensemble A perform significantly better than ensemble B? Take into account sampling variability Significance levels and/or confidence intervals Non-parametric resampling methods (Monte Carlo, bootstrap) Effects of observation errors Adds uncertainty to verification results True forecast skill unknown Extra dispersion of observed PDF Active area of research

Key considerations: Communicating verification to users Challenging to communicate ensemble verification Forecast quality does not necessarily reflect value Summary skill measure average skill over reforecasts. Does not show how skill changes over time (windows of forecast opportunity) Large sampling uncertainty around scores for quantities that are of most interest to the user e.g. regional rainfall Related considerations: Using reforecasts to estimate skill (smaller ensemble size that real-time) Models becoming more computationally expensive constraints on reforecast size. What is optimal reforecast size # years; start dates and ensemble size? (from a sub-seasonal to seasonal forecasting perspective)

Thanks Ian Jolliffe and Beth Ebert

Useful general references WMO Verification working group forecast verification web page: http://www.cawcr.gov.au/projects/verification/ Wilks, D.S., 20: Statistical Methods in the Atmospheric Sciences. 3rd Edition. Elsevier, 676 pp. Jolliffe, I.T., and D.B. Stephenson, 202: Forecast Verification. A Practitioner's Guide in Atmospheric Science., 2ndEdition, Wiley and Sons Ltd. Special issues of Meteorological Applications on Forecast Verification (Vol 5 2008 & Vol 20 203) Thank you Debbie Hudson Debbie.Hudson@bom.gov.au