Ensemble based first guess support towards a risk-based severe weather warning service

Similar documents
REPORT ON APPLICATIONS OF EPS FOR SEVERE WEATHER FORECASTING

Flood Risk Forecasts for England and Wales: Production and Communication

Verification of Weather Warnings. Dr Michael Sharpe Operational verification and systems team Weather Science

Predictability from a Forecast Provider s Perspective

Recent advances in Tropical Cyclone prediction using ensembles

Helen Titley and Rob Neal

Exploiting ensemble members: forecaster-driven EPS applications at the met office

From Hazards to Impact: Experiences from the Hazard Impact Modelling project

Medium-range Ensemble Forecasts at the Met Office

The UK National Severe Weather Warning Service - Guidance Unit Perspective

Forecasting the "Beast from the East" and Storm Emma

4.3.2 Configuration. 4.3 Ensemble Prediction System Introduction

Aurora Bell*, Alan Seed, Ross Bunn, Bureau of Meteorology, Melbourne, Australia

Application and verification of ECMWF products: 2010

Forecasting Flood Risk at the Flood Forecasting Centre, UK. Delft-FEWS User Days David Price

Upgrade of JMA s Typhoon Ensemble Prediction System

Five years of limited-area ensemble activities at ARPA-SIM: the COSMO-LEPS system

The Impact of Horizontal Resolution and Ensemble Size on Probabilistic Forecasts of Precipitation by the ECMWF EPS

Application and verification of ECMWF products 2016

Application and verification of ECMWF products 2008

MOGREPS short-range ensemble forecasting and the PREVIEW Windstorms Project

Application and verification of ECMWF products in Norway 2008

Application and verification of the ECMWF products Report 2007

Nesting and LBCs, Predictability and EPS

Weather forecasts and warnings: Support for Impact based decision making

Using Weather Pattern Analysis to Identify Periods of Heightened Coastal Flood Risk in the Medium to Long Range

VERFICATION OF OCEAN WAVE ENSEMBLE FORECAST AT NCEP 1. Degui Cao, H.S. Chen and Hendrik Tolman

Feature-specific verification of ensemble forecasts

Evaluating Forecast Quality

Hazard Impact Modelling for Storms Workshop

Ensemble Verification Metrics

OBJECTIVE CALIBRATED WIND SPEED AND CROSSWIND PROBABILISTIC FORECASTS FOR THE HONG KONG INTERNATIONAL AIRPORT

Understanding Weather and Climate Risk. Matthew Perry Sharing an Uncertain World Conference The Geological Society, 13 July 2017

Application and verification of ECMWF products 2012

Generating probabilistic forecasts from convectionpermitting. Nigel Roberts

Denver International Airport MDSS Demonstration Verification Report for the Season

Nowcasting for the London Olympics 2012 Brian Golding, Susan Ballard, Nigel Roberts & Ken Mylne Met Office, UK. Crown copyright Met Office

Heavier summer downpours with climate change revealed by weather forecast resolution model

Appendix 1: UK climate projections

MOGREPS Met Office Global and Regional Ensemble Prediction System

Application and verification of ECMWF products 2010

The benefits and developments in ensemble wind forecasting

Applications: forecaster perspective, training

TC/PR/RB Lecture 3 - Simulation of Random Model Errors

Representing model error in the Met Office convection permitting ensemble prediction system

Predicting rainfall using ensemble forecasts

Forecasting Extreme Events

Seamless Probabilistic Forecasts for Civil Protection: from week to minutes

How advances in atmospheric modelling are used for improved flood forecasting. Dr Michaela Bray Cardiff University

Enhancing Weather Information with Probability Forecasts. An Information Statement of the American Meteorological Society

Probabilistic verification

At the start of the talk will be a trivia question. Be prepared to write your answer.

LATE REQUEST FOR A SPECIAL PROJECT

Verification of ensemble and probability forecasts

Verification and performance measures of Meteorological Services to Air Traffic Management (MSTA)

Probabilistic Weather Prediction

Application and verification of ECMWF products 2009

ECMWF products to represent, quantify and communicate forecast uncertainty

ECMWF 10 th workshop on Meteorological Operational Systems

Deutscher Wetterdienst

Application and verification of ECMWF products 2011

Extracting probabilistic severe weather guidance from convection-allowing model forecasts. Ryan Sobash 4 December 2009 Convection/NWP Seminar Series

The UK Flood Forecasting Centre

Application and verification of ECMWF products 2017

Probabilistic Weather Forecasting and the EPS at ECMWF

Application and verification of ECMWF products 2009

Application and verification of ECMWF products 2009

The Latest Science of Seasonal Climate Forecasting

Application and verification of ECMWF products 2013

Upscaled and fuzzy probabilistic forecasts: verification results

Will it rain? Predictability, risk assessment and the need for ensemble forecasts

Application and verification of ECMWF products 2010

12.2 PROBABILISTIC GUIDANCE OF AVIATION HAZARDS FOR TRANSOCEANIC FLIGHTS

The Hungarian Meteorological Service has made

Application and verification of ECMWF products 2017

Verification of Probability Forecasts

Current verification practices with a particular focus on dust

Focus on Spatial Verification Filtering techniques. Flora Gofa

Verification of ECMWF products at the Finnish Meteorological Institute

Past & Future Services

2015: A YEAR IN REVIEW F.S. ANSLOW

The Wind Hazard: Messaging the Wind Threat & Corresponding Potential Impacts

Flood Forecasting. Fredrik Wetterhall European Centre for Medium-Range Weather Forecasts

National Meteorological Library and Archive

Calibrating forecasts of heavy precipitation in river catchments

Developments towards multi-model based forecast product generation

Optimal combination of NWP Model Forecasts for AutoWARN

Comparison of a 51-member low-resolution (T L 399L62) ensemble with a 6-member high-resolution (T L 799L91) lagged-forecast ensemble

Convective-scale NWP for Singapore

Complete Weather Intelligence for Public Safety from DTN

Model error and parameter estimation

Application and verification of ECMWF products 2009

Weather Analysis and Forecasting

A study on the spread/error relationship of the COSMO-LEPS ensemble

Predicting uncertainty in forecasts of weather and climate (Also published as ECMWF Technical Memorandum No. 294)

Assessment of Ensemble Forecasts

Application and verification of ECMWF products at the Finnish Meteorological Institute

Methods of forecast verification

Accounting for the effect of observation errors on verification of MOGREPS

Improvements in IFS forecasts of heavy precipitation

Transcription:

METEOROLOGICAL APPLICATIONS Meteorol. Appl. 21: 563 577 (2014) Published online 22 February 2013 in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/met.1377 Ensemble based first guess support towards a risk-based severe weather warning service Robert A. Neal,* Patricia Boyle, Nicholas Grahame, Kenneth Mylne and Michael Sharpe Met Office, Exeter, UK ABSTRACT: This paper describes an ensemble-based first guess support tool for severe weather, which has evolved over time to support changing requirements from the UK National Severe Weather Warning Service (NSWWS). This warning tool post-processes data from the regional component of the Met Office Global and Regional Ensemble Prediction System (MOGREPS), and is known as MOGREPS-W ( W standing for warnings ). The original system produced areabased probabilistic first guess warnings for severe and extreme weather, providing forecasters with an objective basis for assessing risk and making probability statements. The NSWWS underwent significant changes in spring 2011, removing area boundaries for warnings and focusing more on a risk-based approach. Warnings now include details of both likelihood and impact, whereby the higher the likelihood and impact, the greater the risk of disruption. This paper describes these changes to the NSWWS along with the corresponding changes to MOGREPS-W, using case studies from both the original and new systems. Calibration of the original MOGREPS-W system improves forecast accuracy of severe wind gust and rainfall warnings by reducing under-forecasting. In addition, verification of forecasts from different groups of areas of different sizes shows that larger areas have better forecast accuracy than smaller areas. KEY WORDS probabilistic forecast verification; severe weather impact; forecast application Received 3 April 2012; Revised 12 June 2012; Accepted 25 October 2012 1. Introduction Severe weather warnings are increasingly becoming more riskbased as forecasters aim to communicate both the uncertainty in the forecasts and the likely levels of impact of severe weather. Adding a probabilistic component to warnings may enable them to be issued earlier by initially using low probabilities (such as for a low likelihood, high impact event) and then increasing (decreasing) them as forecast lead time reduces and confidence in the event occurring increases (reduces). Ensemble prediction systems (EPSs) provide an ideal tool for deriving these levels of uncertainty, whereby probabilistic first guess warnings of severe weather can be generated using pre-defined criteria. An ensemble based first guess support tool for severe weather in the short range (1 2 days) has been developed at the Met Office. This tool post-processes data from the regional component of the Met Office Global and Regional Ensemble Prediction System (MOGREPS) and is known as MOGREPS-W ( W standing for warnings ). MOGREPS-W is a specific customer orientated application with the purpose of supporting the UK National Severe Weather Warning Service (NSWWS). The Met Office provides the NSWWS as part of its public weather service (PWS) responsibilities. PWS forecasts (including warnings) have been estimated to contribute at least 260.5 million to the UK economy and society per annum (PWS Customer Group, 2007), highlighting the benefit of accurate and timely forecasts. NSWWS warnings are designed to inform both the public and national and local government * Correspondence: R. A. Neal, Met Office, Fitzroy Road, Exeter EX1 3PB, UK. E-mail: robert.neal@metoffice.gov.uk authorities with defined levels of response to different levels of warning, such as be aware, be prepared and take action. Warnings are issued for the most disruptive types of weather in the UK, including severe wind, rain, snow, fog and ice. MOGREPS-W has evolved over time to support recent changes in customer requirements from the NSWWS. The majority of this paper focuses on the initial version of MOGREPS-W (before significant changes to the NSWWS in spring 2011). This system produced area-based probabilistic first guess warnings for severe and extreme weather, providing forecasters with a more objective basis for assessing risk and making probability statements. Section 2 describes the NSWWS in more detail. Section 3 introduces the concept of ensemble prediction systems (EPSs). Section 4 describes the setup of MOGREPS-W. MOGREPS-W calibration methodology is described in Section 5 and results given in Section 6. Section 7 describes changes to MOGREPS-W in response to a new impact-based framework for the NSWWS, introduced in spring 2011. Finally, a discussion and conclusion is given in Section 8. 2. History of the National Severe Weather Warning Service (NSWWS) The NSWWS (http://www.metoffice.gov.uk/publicsector/ns wws) was set up in 1988 following the Great Storm of 1987, to warn and inform the public of upcoming severe weather events. This county-based warning service consisted of Early Warnings (issued 1 5 days in advance) and Flash Warnings (issued up to 24 h in advance). A county here refers to 1 of the 147 regional government areas in the UK with civil response responsibility, ranging from small unitary authorities 2013 British Crown copyright, the Met Office.

564 R. A. Neal et al. (less than 100 km 2 ) such as Torbay, to large counties (greater than 10 000 km 2 ) such as the Highlands. Warnings had a probabilistic component from an early stage with set criteria for severe and extreme events. Early warnings for severe (extreme) weather were issued when there was 60% ( 40%) chance of widespread disruption. Flash warnings for severe (extreme) weather were issued when there was 80% ( 60%) chance of widespread disruption. The term widespread disruption was fairly loosely defined, with the Chief Forecaster making a subjective judgement on the disruption expected depending on the vulnerability within each county. For example, if the severe weather was forecast to occur in a densely populated urban area, then a warning was more likely to be issued than for a more remote rural area. Warnings had set threshold criteria, as set out in Table 1. Following the Civil Contingencies Act of 2004, the Met Office introduced the role of the PWS Advisor, whose job it is to work closely with responders to help them prepare for severe and extreme weather as well as to give advice and support when a weather warning is issued. In 2007 the PWS Customer Group (PWSCG) was set up to act as the customer for the NSWWS and ensure value for money (Goldstraw, 2012). They conduct consultation with public and professional users of the service and liaise with other government stakeholders. Soon after the PWSCG was set up and following feedback from responders through the PWS Advisors, a change was made to the NSWWS as Advisories were added to the Early and Flash warnings. Advisories for severe (extreme) weather were issued when there was 40% ( 20%) chance of widespread disruption. This gave an early indication to responders that there was a low probability of a large scale severe or extreme weather event coming up in order to give them more opportunity to prepare. Following widespread flooding across parts of England and Wales in summer 2007, the resulting Pitt review into lessons learnt from these floods (Pitt, 2008), and feedback from responders, the PWSCG tasked the Met Office with reviewing the NSWWS in 2009. As part of this review the Met Office wanted to check the requirements of both the public and the responders and ensure the NSWWS met the needs of those groups. Technology had changed since the introduction of the NSWWS in 1988 and there were new services like the Flood Forecasting Centre (which is a joint partnership between the Environment Agency and the Met Office). Seven responder workshops were held across the UK, and there were public focus groups and surveys. Problems identified with the NSWWS included the way it was presented on the Met Office website, the technical language used and the perception that there were too many warnings. The outcome of the review was that a new impact-based warning service should be introduced (as described in Section 7.1), which was implemented operationally in spring 2011. The requirement was that language would be simple, non-technical and easy to understand, there would be a graphical representation of the risk of disruption on the Met Office website, the service would be delivered across many platforms, and there would be recovery forecasts and post-event analysis. The concept of the probability of disruption due to severe of extreme weather has been part of the NSWWS since its introduction, but has become increasingly important with the introduction of Advisories in 2007 and the emphasis on risk following the 2009 review. This has increased the need for objective forecast data to support forecasters in issuing these probabilistic warnings. Ensemble prediction systems (EPSs) have been the ideal tool to support forecasters in this way. 3. Background to Ensemble Prediction Systems (EPSs) 3.1. Introduction to EPSs EPSs aim to take account of uncertainty in weather forecasts by providing a range of forecast scenarios. Forecast uncertainty is caused by a combination of initial condition uncertainty within the model analysis and model error, primarily caused by parameterizations of un-resolved processes. An EPS typically Table 1. NSWWS criteria and corresponding MOGREPS-W criteria (used up until the end of March 2011). Parameter NSWWS criteria MOGREPS-W criteria Severe 3 h rainfall Heavy rain expected to persist for at least 2 h and to give at least 15 mm within a 3 h period Extreme 3 h rainfall Heavy rain expected to persist for at least 2 h and to give at least 40 mm within a 3 h period Severe 24 h rainfall 25 mm in 24 h 25 mm precipitation in 24 h Extreme 24 h rainfall 100 mm in 24 h 100 mm precipitation in 24 h Severe snowfall Snow falling at a rate of approximately 2 cm h 1 or more and expected for at least 2 h Extreme snowfall Severe wind gusts Extreme wind gusts Very heavy snowfall expected to give depths of 15 cm or more Two or more gusts of 70 mph or more at separate hours within the period of the warning Gusts of 80 mph or more within the period of the warning MOGREPS-W criteria are based on available model diagnostics and model limitations. 15 mm precipitation in 6 h. MOGREPS-R provides precipitation accumulations at 3 h time-steps. Six hour accumulations are used to capture events which span two 3 h forecast lead times. Rainfall (as a precipitation type) is not forecast independently in model output; therefore, total precipitation is used. 40 mm precipitation in 6 h. (6 h accumulations used for same reason as above). 4cmin6h,2cmh 1 for at least 2 h would equate to 4 cm in 2 h. The closest MOGREPS can forecast to this is 4 cm in 3 h due to forecast time-step limitations. As with rainfall, 6 h accumulations are used instead of 3 h to capture events which occur across two 3 h forecast lead times. 15 cm in 6 h (6 h accumulations used for same reason as above) One or more gusts of 70 mph or more at separate hours within the period of the warning. The 10 m gust diagnostic is used and provides the maximum wind gust over a 3 h period One or more gusts of 80 mph or more at separate hours within the period of the warning

Ensemble severe weather forecasts 565 runs a numerical weather prediction (NWP) model multiple times, with each run having slightly different initial conditions and slight perturbations to model physics to produce an ensemble of forecasts. Different EPSs use a variety of methods to perturb initial conditions and to address model error, but most systems include both these essential elements. The ensemble can be used by forecasters in a range of ways; from the identification of most likely or worst-case scenarios, through to forming the basis for a probabilistic forecast. Probabilistic forecasts potentially provide additional value over deterministic (or categorical) forecasts and aid improved decision making, as highlighted by many authors including Mylne (2002), Zhu et al. (2002) and Buizza (2008). There are many established global EPSs including those run at the National Centers for Environmental Prediction (NCEP) (Toth and Kalnay, 1993), the European Centre for Medium-Range Weather Forecasts (ECMWF) (Buizza et al., 2007), and the Meteorological Service of Canada (MSC) (Houtekamer et al., 1996). EPSs are increasingly used for forecasting severe weather, aided by the introduction of high resolution regional ensembles and more recently the development of convection-permitting very high resolution ensembles. Examples of this include the Norwegian 4 km resolution ensemble (UMEPS) which was developed to improve forecasting of polar lows (Aspelien et al., 2011; Kristiansen et al., 2011) and the German 2.8 km ensemble (COSMO-DE-EPS) (Gebhardt et al., 2011; Peralta et al., 2012). 3.2. Met Office Global and Regional Ensemble Prediction System (MOGREPS) The Met Office EPS, MOGREPS, was made operational in 2008 (Bowler et al., 2008). It has 23 perturbed member forecasts and 1 control member using the Unified Model (Walters et al., 2011). The global component (MOGREPS-G) has a N216L70 resolution ( 60 km with 70 vertical levels) and runs out to T + 72 h, initialized at 0000 UTC and 1200 UTC daily. MOGREPS-G uses the ensemble transform Kalman filter (ETKF) to generate initial condition perturbations (Bowler et al., 2009; Bowler and Mylne, 2009). A stochastic kinetic energy backscatter scheme (SKEB) and a random parameters (RP) scheme take account of model uncertainty (Tennant et al., 2010). MOGREPS-15 runs the same system out to 15 days and is available as part of the THORPEX Interactive Grand Global Ensemble (TIGGE) for research purposes (Bougeault et al., 2010; http://www.tigge.ecmwf.int/) The regional component (MOGREPS-R) has an 18km L70 resolution and runs out to T + 54 h. MOGREPS-R takes its boundary conditions from MOGREPS-G and is initialized at 0600 and 1800 daily. A high resolution convection-permitting version of MOGREPS over the UK is planned for implementation in 2012. 3.3. Existing EPS applications for severe weather The Met Office first set up a probabilistic first-guess early warning (FGEW) system for severe weather in 2002 (Legg and Mylne, 2004). This system post-processes data from the ECMWF EPS to generate probabilistic area based warnings of severe weather for 12 regions of the UK using criteria from the NSWWS operational at that time. Using global EPS data, this system focuses on early warnings 2 5 days ahead and has assisted forecasters in issuing earlier warnings of severe events. The ECMWF have developed severe weather applications from their own EPS. An extreme forecast index (EFI) concept was developed which measures the difference between a forecast probability distribution and the model s climatological probability (Lalaurette, 2003). The EFI ranks the departure between the forecast and the model climate between 1 (forecast given 100% probability that record low values will be reached) and +1 (record-breaking high values). By relating the forecast distribution to model climatology the EFI provides a useful indication of upcoming severe weather even if the model does not capture the true intensity of the event due to limitations of model resolution. A disadvantage of the EFI is that it is difficult to translate the index values into probabilities of severe weather impact. More recently, the Probability of RETurn (PRET) product has been developed at the ECMWF (Prates and Buizza, 2011), which addresses this issue. PRET computes the fraction of EPS members that exceed a return period, such as once every 5 years. Return periods are estimated with a distribution fitted to the annual extremes from the model climate (the same as used in the EFI). Forecasting the impact of severe weather on society is becoming more important. Recent developments with impact modelling have included the coupling of MOGREPS with a storm surge model (Flowerdew et al., 2009, 2012) and the testing of MOGREPS rainfall forecasts in an operational flood forecasting system (Schellekens et al., 2011). 4. The MOGREPS-W system A probabilistic first guess warning system for severe weather (MOGREPS-W) has been running routinely at the Met Office since 2010, using criteria from the NSWWS (Table 1). The system was developed to address short-range (1 2 days ahead) warnings and complements the FGEW system already running which is designed for the medium range. The initial version of MOGREPS-W provided probabilistic area based warnings of severe and extreme weather for all counties in the UK. The aims of MOGREPS-W are: (1) to give forecasters early alerts of potential severe weather, providing the opportunity to increase the lead time of forecaster-issued weather warnings, and (2) to provide a more objective basis for assessing risk and making probability statements. MOGREPS-W post-processes gridded MOGREPS-R data. These data are interpolated onto a regular rotated-pole latitude/longitude grid at a 0.1 resolution ( 11 km). Re-gridding to a slightly finer resolution eases interpretation for smaller counties although, of course, not adding any information to that available from the model grid. Area probabilities were first introduced in the FGEW system (Legg and Mylne, 2004) and take into account all grid-points which lie inside a warning region. For the example given in Figure 1, only the control member and member 23 contain grid-points which exceed the warning threshold (e.g. 70 mph wind gusts). One method to calculate an area probability would be to calculate the individual grid-point probabilities and then take the highest probability in the county. In Figure 1 no grid-points exceed one member ensemble frequency, equating to 4% area probability (1 out of 24 ensemble members). Another method would be to calculate the probability that the event will occur at any grid-point within the county area. In Figure 1, 2 out of 24 members exceed the parameter threshold at 1 or more grid-points, therefore equating to 8% area probability. The latter method has been chosen as it estimates the probability of the event occurring anywhere within the county area, as required by the NSWWS. Figure 2 compares contoured grid-point probabilities with area probabilities for a snow case in January 2010, and highlights two key points.

566 R. A. Neal et al. Control member Member 23 Grid-point probabilities Area probability Figure 1. Diagrammatic explanation of MOGREPS-W area probabilities, assuming only the control member and member 23 (out of 24 members) have model grid-points which exceed the warning threshold. Grey line = county boundary; dots = model grid-points; circled dots = grid-points where event threshold is forecast to be exceeded. (a) (b) Figure 2. Comparison of MOGREPS-R grid-point probabilities (a) and MOGREPS-W area probabilities (b). Both maps are for the same forecast period. Forecast initiated at 1800 UTC on 18 January 2010 (T + 42 h). Probability 6 h snowfall 1 cm. The grid-point probabilities map has the ensemble mean PMSL over-plotted. 1. Area probabilities should always be as high as and usually higher than grid-point probabilities. In Figure 2 this is particularly evident in parts of Wales and Western England where area probabilities greater than 90% are observed. This is due to the way area probabilities are calculated, whereby different ensemble members do not have to exceed the warning threshold at exactly the same location within a county to count towards the final probability. 2. Larger counties trigger more warnings and at higher probabilities due to the larger number of grid-points at which the events are forecast. In Figure 2 the Highlands (the largest county in the UK) is shaded orange (40 60% probability) whereas, looking at the gridded contour plot, only the far south and east of the area has probabilities. In this situation a forecaster may issue a warning for the Highlands, but in the descriptive text say that only the far southeast of the county will be affected. An example of the initial MOGREPS-W system is given in Figure 3. For this severe weather event, a frontal system moving in from the southwest on Wednesday 25 August 2010 led to persistent rainfall in the south of the UK. This fell initially across southern Wales and southwest England, moving into the Midlands, East Anglia and southeast England. A surface warm front and triple point led to bursts of heavy rain, most intense during the morning in southwest England, and into the evening in southeast England. The first MOGREPS-W output to fully span the period of this event came from the run initiated at 0600 UTC on 24 August 2010 (Figure 3(a)). The initial forecaster-issued advisories (Figure 3(b)), issued without reference to MOGREPS-W, match well with those forecast by MOGREPS-W. As confidence of the event occurring increased the advisories were replaced by early warnings (Figure 3(c)), which didn t stretch as far north as the original advisories.

Ensemble severe weather forecasts 567 (a) (b) (c) Figure 3. Example of MOGREPS-W support for the old NSWWS for a heavy rainfall event in southern England and Wales on 25 August 2010. (a) MOGREPS-W forecast from the run initiated at 0600 UTC on 24 August 2010 showing probability of 24 h precipitation 25 mm. Valid for the 24 h period up to 0900 UTC 26 August 2010 (T + 51 h). Evolution of forecaster issued warnings valid 25 August 2010. (b) Issued 24 August 2010. (c) Issued 25 August 2010. Yellow shading indicates advisories and orange shading indicates early warnings. 5. MOGREPS-W verification and calibration methodology MOGREPS-W precipitation and wind gust forecasts for a range of thresholds have been verified against 2 km resolution analyses. The Met Office post-processing system routinely down-scales a range of model runs (from high-resolution nowcasts to medium-range global model forecasts) onto the same 2 km grid. This process uses physical downscaling techniques to add topographically-forced detail to some fields including wind speed, temperature and precipitation (Moseley, 2011). This same system also generates an analysis at hourly time steps. The precipitation analysis is formed from a blend of radar observations (which have been calibrated using surface observations), satellite-derived observations, and downscaled components of the latest model runs. The wind gust analysis also uses the latest downscaled model runs along with observed data from surface synoptic observations (SYNOPS), ships and meteorological airfield reports (METARS). NSWWS warnings typically focus on areas where the greatest impact of any severe weather will be felt. As a result, a height mask for grid-points above 1000 ft (305 m) has been applied to eliminate strong winds at altitude where population densities tend to be very low. No height mask is applied to MOGREPS-R output as model orography is on a much courser resolution (approximately 18 km). Caution is required when using these analysis data as the truth because they may contain more errors than traditional observations. Although they have better spatial coverage, they may be less accurate than traditional observations. In order to avoid identifying too many spurious events due to analysis errors, an event is only considered to occur in a county if 1% of the analysis grid-points in the area exceed the event threshold. Only a proportion of grid-points is used, rather than a fixed number, due to the large range of county sizes. The majority of results presented use a 9 month period (1 June 2010 to 28 February 2011) for precipitation and a 6 month period (1 September 2010 to 28 February 2011) for

568 R. A. Neal et al. Table 2. MOGREPS-W verification forecast lead-time periods. Forecast period Period 0 (P0) Period 1 (P1) Period 2 (P2) Period 3 (P3) Period 4 (P4) Description T + 9h to T + 54 h (16 forecast lead times) P1 P4 T + 9h to T + 18 h (four forecast lead times) T + 21 h to T + 30 h (four forecast lead times) T + 33 h to T + 42 h (four forecast lead times) T + 45 h to T + 54 h (four forecast lead times) wind gusts. These periods span a full autumn and winter season, when typically most severe weather occurs. Calibrating probabilistic forecasts of severe weather can be difficult due to their low frequency. As a result, thresholds lower than full warning criteria have been used in an attempt to increase the number of forecast and observed events. For this study 6 h precipitation accumulations with a threshold of 10 mm and 3 h maximum 10 m wind gusts of 40 mph were used. These thresholds were selected as a compromise between thresholds high enough to cause severe weather and thresholds which provide a large enough sample size. The 3 h forecast-observation pairs from all counties are grouped together and analysed for various forecast periods (Table 2). They are grouped for the 48 h period from T + 9 (when forecasts become fully available for use) to T + 54 (end of the MOGREPS-W forecast range) (period 0) and also grouped into four 12 h periods (periods 1 4). MOGREPS-W 54 h forecasts are updated every 12 h and thus have validity times which overlap with the previous three forecasts. For period 0, this means that every 3 h observation is used four times against different forecasts. Splitting results down into forecast periods reduces duplication of forecast events and provides an indication of how MOGREPS-W performance varies over different forecast lead times. The total number of cases in each period consists of the sum of events and non-events as observed in the analyses. Hits (H ), misses (M ), false alarms (F) and correct rejections (R) are derived from the number of events and non-events. Forecasts are divided into 25 probability bins (0 24 ensemble members) before being verified. A simple approach to calibration has been taken, whereby a range of forecast thresholds are verified against the same observation threshold. For example, probability forecasts for wind gusts 34, 36, 38, 40 and 42 mph are all verified as if they are forecasting wind gusts 40 mph. The idea is that a lower or higher forecast threshold may provide better guidance. As thresholds lower than warning criteria are used, the assumption is made that any calibration applied to these thresholds can be linearly scaled up to warning criteria. In making this assumption, the uncalibrated reliability diagrams for a range of thresholds were examined to see if they have similar reliability tendencies (Figure 4(a) and (b)). Plotting observed frequency against forecast probability for a set of probability forecasts creates a reliability diagram, where reliability is indicated by proximity of the plotted line to the diagonal (Wilks, 1995). Lines below the diagonal indicate over-forecasting (probabilities too high) and lines above the diagonal indicate under-forecasting (probabilities too low). MOGREPS-W wind gust forecasts ranging between 34 and 50 mph (Figure 4(a)) have very similar reliability tendencies, with all thresholds appearing to under-forecast. Therefore the assumption is made that any calibration applied to 40 mph gusts can be linearly scaled up to warning criteria. All precipitation thresholds have a tendency to under-forecast for lower probability forecasts (where samples are high) and over-forecast for higher probability forecasts where samples are low (Figure 4(b)). Lower forecast thresholds have better reliability with lines closest to the diagonal. Four millimetres has a reliability score of 0.0013 and 12 mm has a reliability score or 0.006, where (a) (b) Figure 4. MOGREPS-W wind gust (a) and precipitation (b) reliability diagrams for all forecast periods combined (P0). Reliability for a range of uncalibrated thresholds is shown. Wind gust verification period is from 1 September 2010 to 28 February 2011. Precipitation verification period is from 1 June 2010 to 28 February 2011.

Ensemble severe weather forecasts 569 the lower the reliability score, the better the forecast system. (The reliability score measures the divergence of the reliability curve from the diagonal weighted by the frequency of samples at each probability, and is explained in detail later on in this section). Lower precipitation totals are better represented by MOGREPS-R, with the forcing mechanisms behind higher precipitation totals often un-resolved. This may be related to a difference in forecast reliability between dynamic and convective scale precipitation (this idea is explored further in Section 6.2). As with wind gusts, the assumption is made that any calibration applied to 10 mm events can be scaled up to warning criteria. This assumption is unlikely to work as well as with wind gusts due to differing reliability tendencies between thresholds; however, it will still provide a first guess at calibrating severe precipitation forecasts. Finally, verification of forecasts from different county size groups has been done to provide an indication of how county size affects forecast accuracy. A number of standard verification scores for probability forecasts are used to understand forecast accuracy for both the calibrated an uncalibrated forecasts. The Brier score (BS; Equation (1)) measures the overall accuracy of a set of probability assessments taking account of both reliability and resolution (Brier, 1950). For a set number of forecast and observation pairs (N ), the Brier score measures the average squared deviation between predicted probabilities (f ) and their outcomes (o 0 if it doesn t happen and 1 if it happens). The score ranges from 0 to 1, with a perfect score being 0: BS = 1 N N ( ) 2 fj o j (1) j =1 If forecasts are binned according to defined probability categories, the Brier score may be derived according to its three components: reliability (REL; Equation (2)); resolution (RES; Equation (3); and uncertainty (UNC; Equation (4)) (Murphy, 1973), where BS = REL RES + UNC. The reliability score (REL; Equation (2)) quantifies the reliability by measuring the mean distance from the diagonal line in the vertical (in the reliability diagram), weighted by n i (the number of samples in each bin). N = total number of forecasts over all probability bins; I = number of probability bins (25); f i = forecast probability in probability bin i; o i = frequency of the event being observed when forecast with f i (o i = events(i)/n i ). The lower the reliability score, the more reliable the forecast system is (if REL = 0, the forecast is perfectly reliable). Sharpness is described by the number of samples in each bin: REL = 1 N I n i (f i o i ) 2 (2) i=1 Resolution relates to the difference between the conditional mean observations when forecast by each probability bin (o i ) and the overall mean observation (c) (Murphy, 1993), and thus measures the degree to which the forecasts are able to divide the observations into samples which are different from the climatology. The higher the resolution score (RES; Equation (3)) the better the forecast system; however, these results need comparing to the reliability to say how good the forecasts are. In the worst case, when the climatic probability is always forecast (o i = c), the resolution score is zero. In the best case, when the conditional probabilities are zero and one, the resolution score is equal to the uncertainty score (Equation (4)). Uncertainty measures the variance of observations frequency in the sample. The maximum uncertainty value is 0.25 when the climatological frequency (c) is 0.5. If the event occurs close to 100% (0%) of the time, such as for a very low (high) threshold event, uncertainty approaches its lowest value of 0: RES = 1 N I n i (o i c) 2 (3) i=1 UNC = c (1 c) (4) A Brier skill score can be calculated using a reference forecast and measures the improvement of the probabilistic forecast relative to a reference forecast. Sample climatology (c) has been used as the reference forecast for MOGREPS- W, and in this case the Brier skill score can be defined as in Equation (5): RES REL BSS = (5) UNC Relative Operating Characteristic (ROC) measures the forecast system s ability to discriminate between occurrences and non-occurrences of an event (Stanski et al., 1989). Hit Rates (HR = H /(H + M )) and False Alarm Rates (FAR = F/(F + R)) are evaluated for a range of probability thresholds. For each probability threshold the event is deemed to be forecast (giving a hit or a false alarm) if the forecast probability threshold equals or exceeds the probability threshold, and not forecast (giving a miss or correct rejection) if the forecast probability is below the threshold. The ROC curve is formed by plotting HR against FAR for a range of probability thresholds. Forecasts with no discrimination, where HR = FAR, produce a ROC curve lying along the diagonal. Forecasts with skill curve above the diagonal, where HR > FAR. The ROC skill score (ROCSS) can be calculated from the area under the ROC curve (A) and ranges between 1 and 1 (ROCSS = 2A 1). If A = 0.5 then the forecast model has no skill (ROCSS = 0). If A = 1 then the forecast model is perfect (ROCSS = 1). For the ROC results presented in this paper, the finite sampling of the ensemble has led to a large proportion of the ROC area being under the straight line from the last point on the curve to the top-right of the graph. Although this represents the true skill of the forecast system, this can cause difficulties when comparing different calibrated thresholds. ROC curve fitting has been used to provide a better representation of the ROC area. The program used to fit the ROC curve is based on methodology of Wilson (2000), whereby standard normal deviates corresponding to the empirical HRs and FARs are calculated and fitted to a straight line by the least squares method. This line is then transformed back to probability space at 100 equally spaced points for the purposes of plotting the ROC curve. A final test of the calibration of the system is the probabilistic bias, which compares the mean forecast probability (f ) with the observed frequency of events, c. 6. MOGREPS-W verification and calibration results 6.1. Wind gust calibration results Forecasts for gusts 34, 36, 38, 40 and 42 mph have been verified separately as if they are each forecasting gusts 40 mph (Figure 5). This calibration technique is done in an attempt

570 R. A. Neal et al. Figure 5. MOGREPS-W wind gust calibration results. All forecast thresholds are verified as if they are forecasting gusts 40 mph. Grey horizontal lines provide scores over all forecast periods combined. The black horizontal line in the mean forecast probability bar graph shows climatological frequency of gusts 40 mph based on a Met Office 2 km resolution analysis. Wind gust verification period 1 September 2010 to 28 February 2011. to select the best threshold for forecasting gusts of this magnitude. Forecast thresholds lower than 40 mph tend to have better reliability when forecasting gusts 40 mph. This is caused by MOGREPS-W under-forecasting higher forecast thresholds which are perhaps not captured by the resolution of the model. The dashed line (36 mph) is closest to the diagonal in the reliability diagram and also has the best reliability score over all forecast lead times, suggesting that MOGREPS-W forecasts for gusts 36 mph are most reliable at forecasting gusts 40 mph. Reliability scores improve with forecast lead time for all thresholds caused by increasing ensemble spread. Sharpness is shown for the forecast threshold of 40 mph. The forecasts are shown to be sharp with a peak of samples in the low and high probability bins. Even the middle bins have a useful number of cases, totalling > 1500 per bin. The number of cases for 0% forecasts is omitted from the graph and totals hundreds of thousands. The lower forecast thresholds have better ROCSSs when forecasting gusts 40 mph. Both the normal and fitted ROC curves produce similar results, with the fitted results having smaller differences between thresholds: 34 mph (dash dot line) has the highest ROCSS and 42 mph (long dashed line) has the lowest ROCSS. As higher forecast thresholds are used,

Ensemble severe weather forecasts 571 the model is less likely to forecast the intensity of the event and therefore more likely to have a lower number of hits, hence lowering the HR. The FAR also drops as higher forecast thresholds are used (but to a lesser extent than the HR) due to a lower number of false alarms. The lower forecast thresholds have the best resolution scores when forecasting gusts 40 mph: 34 mph (dash dot line) has the best resolution score and 42 mph (long dashed line) has the worst resolution score. Resolution reduces with forecast leadtime, indicative of increased ensemble spread and the model converging towards climatology. The lower forecast thresholds tend to have better Brier scores and Brier skill scores when forecasting gusts 40 mph: 34 mph (dash dot line) and 36 mph (dashed line) jointly have the best scores. All forecast thresholds have a positive Brier skill score indicating that they are all better than using sample climatology. A mean forecast probability for forecast thresholds between 36 and 34 mph is closest to the sample climatology of 40 mph. Thresholds higher than this move further away from climatology, with the mean forecast probability for 40 mph being 7% less than the climatological frequency of 16%. In summary, all results show that MOGRPEPS-W underforecasts 40 mph gusts. Taking into account all results, a forecast threshold somewhere between 34 and 36 mph may provide best guidance for forecasting gusts 40 mph. Taking 35 mph as the optimal threshold and scaling the threshold difference up to warning criteria suggests that forecasts for gusts 60 mph should be used for severe events (gusts 70 mph) and forecasts for gusts 70 mph should be used for extreme events (gusts 80 mph). 6.2. Precipitation calibration results Similarly, forecasts for 6 h precipitation accumulations 4, 6, 8, 10 and 12 mm have been verified separately as if they are each forecasting 3 h accumulations 10 mm (Figure 6). Note that 6 h forecast accumulations are used to predict 3 h observed accumulations in order to ensure that any 3 h accumulation is entirely captured within a forecast output period regardless of precipitation onset time (see first row, Table 1). The higher forecast thresholds have better reliability scores (12 and 10 mm respectively) when forecasting 3 h accumulations 10 mm. Thresholds lower than this show substantial over-forecasting at higher probabilities, as seen in the reliability diagram. The reliability diagram also shows that reliability is best when forecast probability is low (< 20%) and worst when forecast probability is high. However, the apparent greater extent of over-forecasting for higher probabilities carries little weight in the overall calculation of reliability and resolution scores due to the small sample sizes in higher probability bins, as seen in the sharpness diagram. In the reliability diagram, 12 mm (long dashed line) is closest to the diagonal suggesting it is most reliable. This is also evident in the reliability scores with 12 mm having the best reliability score. Reliability scores improve with forecast lead time for most forecast thresholds linked to improving ensemble spread. The higher the forecast threshold used, the better the reliability score. However, there is convergence in the reliability scores for the highest two thresholds (10 and 12 mm). The higher forecast thresholds also have the best Brier scores and Brier skill scores (12 and 10 mm respectively) when forecasting 3 h accumulations 10 mm. It is important to note that only the highest two thresholds have Brier skill scores greater than zero indicating that thresholds lower than this are worse than sample climatology. There is a difference in the ROC results when comparing the normal ROC curves to the fitted ROC curves. The normal ROC curves show lower forecast thresholds to have better ROCSSs: here, 4 mm forecasts (dash dot line) have the best ROCSS and 12 mm forecasts (long dashed line) have the lowest ROCSS. However, the fitted ROC curves show very little difference between thresholds and are probably a fairer comparison given the finite sampling of the ensemble. The lower forecast thresholds tend to have better resolution scores (6, 4 and 8 mm respectively) when forecasting 3 h accumulations 10 mm. This differs from the reliability and Brier score results which show higher forecast thresholds to have better skill. Resolution reduces with forecast leadtime, indicative of increasing ensemble spread and the model converging towards climatology. The mean forecast probability for 6 h accumulations of 8 mm is closest to the climatology of 10 mm events. Averaged across all counties in the UK there is 3% probability that any 3 h period will experience 10 mm, however the mean forecast probability for 10 mm events is only 1.5%. This suggests that MOGREPS-W under-forecasts rainfall events. The reliability diagram is misleading in this case in appearing to suggest overforecasting, which is due to the much higher weighting of the lower probability bins. The apparent conflict between reliability and resolution may occur due to varying forcing mechanisms behind different precipitation events. Dynamic precipitation is well represented by MOGREPS-R, but convective precipitation is sub gridscale and thus poorly resolved. The convective parameterization returns a grid-square average precipitation rate whereas the observations will often observe heavier precipitation over small localized areas. Convective events therefore require a much lower forecast threshold to capture them, whereas dynamic events can have a higher threshold, and possibly a more realistic one. Unfortunately, with MOGREPS-W a single threshold for both dynamic and convective precipitation is needed, which leads to an inevitable compromise. Thus, when forecasting 3 h accumulations 10 mm, using 8 12 mm forecasts the dynamic events well (leading to better reliability overall), but misses many convective events (leading to poor resolution overall). With 4 7 mm MOGREPS-W over-forecasts dynamic events (leading to poorer reliability overall), but captures a few more shower events (leading to better resolution overall). At the end of the day, there is limited use in having a forecast system with good resolution unless the forecasts are also reliable so a threshold of 8 10 mm may be best at forecasting 10 mm events. A threshold of 8 mm would have been preferable given that its mean forecast probability is closest to sample climatology; however, it also has a negative Brier skill score. It is therefore difficult to select a best threshold, but one closer to 10 mm than 8 mm may be best. These results show the importance of considering several verification scores when undergoing calibration. 6.3. County size verification Three different sized groups of counties have been verified individually using uncalibrated forecasts to analyse the affects of county size on forecast accuracy. These size groupings include the 25% largest counties, 25% medium sized counties and 25% smallest counties. Not all counties have been used, to

572 R. A. Neal et al. Figure 6. MOGREPS-W precipitation calibration results. All forecast thresholds are verified as if they are forecasting 3 h accumulations 10 mm. Grey horizontal lines provide scores over all forecast periods combined. The black horizontal line in the mean forecast probability plot shows climatological frequency of 3 h precipitation 10 mm based on a Met Office 2 km resolution analysis. Precipitation verification period 1 June 2010 to 28 February 2011. distinguish more clearly between the different county size sets. Results for wind gusts and precipitation are similar; hence only wind gust results are presented here (Figure 7). The reliability diagram shows that all county size groups perform broadly similarly, with some under-forecasting, but with the larger counties (solid line) having the best reliability over the majority of forecast probabilities between 20 and 80%. The sharpness diagram shows that all three size groups have a similar pattern of sampling across different forecast probabilities, but in general the larger counties (black) have a larger proportion of samples in the higher probabilities. This is to be expected since, as discussed earlier, the definition of area probabilities means one would expect on average higher probabilities and also higher observation frequencies in larger counties. Although the sharpness diagram shows that all county size groups have good sample sizes across all forecast probabilities, the different sample frequencies of observations means that it is not meaningful to compare most of the verification scores directly. In particular, it is well known that the Brier score is typically smaller for rarer events, due to the smaller uncertainty term (UNC), which corresponds to the lower Brier scores for smaller counties seen in Figure 7. A useful comparison can be made with the Brier skill score and ROCSS. The Brier skill

Ensemble severe weather forecasts 573 Figure 7. MOGREPS-W wind gust verification results split up according to county size. All MOGREPS-W probabilities are for gusts 40mph (uncalibrated). Solid lines = 25% largest counties, dashed lines = 25% middle sized counties, dotted lines = 25% smallest counties. Sharpness is given for all county size groups large = black shading, medium = grey shading and small = light grey shading. Grey horizontal lines provide scores over all forecast periods combined. Wind gust verification period 1 September 2010 to 31 December 2010. score is measured relative to the skill of sample climatology, and here this shows the largest skill for the large counties and the lowest skill for the small counties. This is likely to be because larger counties have a greater margin of error for positional errors in the forecast, resulting in a greater likelihood of correctly forecasting the severe weather in the correct area. The ROCSS can be derived from the ROC areas. Here, the normal ROC curves show better ROC areas for larger counties; however, the fitted ROC curves show little difference between ROC areas. All of the small counties have an area of less than 400 km 2, so every county in this group contains less than 100 analysis grid-points. Therefore, if just one analysis grid-point exceeds the parameter threshold in one of these counties it is enough to trigger an event at the 1% level. Unfortunately it is unlikely that one grid-point provides enough confidence that an event has occurred. Therefore in the small counties group it is likely that more correct-rejections are mistakenly identified as missed events and more false alarms are mistakenly identified as hits. In Figure 7 this may help to explain why the small counties group has the highest observed frequency in the reliability diagram. 7. The new NSWWS and changes to MOGREPS-W 7.1. The new NSWWS The way warnings are issued as part of the NSWWS changed in spring 2011. Warnings are now presented by shaded areas on a map rather than by county region. Three warning colours (yellow, amber or red) are issued based on the likelihood of a particular severe weather event occurring and its expected impact if it does occur, as selected from a weather impact matrix (Figure 8). In this matrix, green is used to signify weather with no significant impact on peoples day-to-day activities, including all very low impact events as well as some lower likelihood low impact events. Yellow signifies be aware and stay up to date with the latest forecast, amber signifies be prepared to take action. Finally, red warnings signify take action to mitigate any damages and are issued very rarely. There are two types of warnings for the three colour states: alerts are issued more than 24 h ahead and warnings are issued less than 24 h ahead. An important aspect of the impact assessment is that the same weather may have different levels of impact in different parts of the UK. For example, wind gusts of 60 mph may

574 R. A. Neal et al. An option is available to view previous runs allowing forecasters to check on forecast evolution and consistency. County area probabilities are still available, although emphasis is now put on the grid-point warnings which aid forecasters in drawing areas affected on a map. County analysis guides which authorities should receive the warnings. Figure 8. Weather impact matrix and colour key for the new NSWWS. have virtually no impact in northwest Scotland where strong winds are common, but have a much higher impact in southeast England where they are rarer and population density is much higher. Impact may also be higher depending on season, for example, trees with their leaves still on are more vulnerable to wind damage. 7.2. Changes to MOGREPS-W MOGREPS-W was modified during summer 2011 to adapt to the new NSWWS. The use of county area-probabilities has been replaced by grid-point probabilities to aid forecasters in drawing warning areas. MOGREPS-W now uses low, medium and high impact thresholds for each parameter (similar to the severe and extreme thresholds used in the old system). However, these impact thresholds now vary according to county, taking account of the varying levels of impact of severe weather for different parts of the UK. The capability is also available for the Chief Forecaster to vary these thresholds by season and even according to antecedent conditions (e.g. soil saturation). MOGREPS- W probabilities associated with each likelihood level are < 20% for very low likelihood events, 20% (and < 40%) for low likelihood events, 40% (and < 60%) for medium likelihood events and 60% for high likelihood events. The new MOGREPS-W forecaster interface has a range of options available for displaying warnings. County maps of the UK are contoured according to the three colour warning states (yellow, amber and red) with the following display options. An option is available to view an overall warning colour map for each parameter which is valid over all forecast time steps. These maps display the highest warning colour at each grid-point over three impact levels (low, medium and high) and over all time steps. Here, impact colour takes priority over probability giving emphasis where appropriate to low probability high impact events. For example (using the weather impact matrix; Figure 8), a 25% high impact warning (amber) would take priority over a 75% low impact warning (yellow), even though the yellow warning has a higher probability. These maps are designed to give forecasters a quick overview of severe weather. If they appear clear then the forecaster can quickly move onto other tasks. These overall warning colour maps can also be animated by forecast lead time (Figure 9(a)) allowing forecasters to determine when in the forecast range the severe weather is likelytooccur. Separate map animation options are available for low, medium and high impact weather, with the warning colour scale changing accordingly. This option allows forecasters to determine which impact level to issue the warning at. For example, the overall warning colour maps could be showing amber contouring over parts of the country, however it is possible to derive amber from both medium and high impact warnings. 7.3. Case study of the new NSWWS and MOGREPS-W Figure 9(a) shows an example of the overall warning colour map from MOGREPS-W, taken from a case study used to illustrate MOGREPS-W capabilities before forecaster trials during winter 2011/2012. In this example MOGREPS-W has been calibrated using the results from this study, whereby all warning thresholds have been multiplied by a factor of 0.9. For example, an 80 mph medium impact warning in the Highlands of Scotland would be triggered if any MOGREPS- R member forecasts wind gusts 72 mph. This was one of the first MOGREPS-W forecasts for post-tropical storm Katia, which reached the UK on 12 September 2011. The storm led to damaging winds across the populated central belt of Scotland with maximum recorded gusts of 72 mph in Glasgow and 75 mph in Edinburgh. One driver was killed by a falling tree with some structural damage to homes and buildings across the affected regions. Although forecasters were not using the new MOGREPS-W system at the time, the forecaster issued warnings (Figure 9(b)) compare well to the MOGREPS-W output, suggesting that the system provided useful guidance. A yellow low likelihood medium impact warning was first issued on 9 September 2011 4 days in advance and just before the MOGREPS-W timeframe. The warning area covered much of the UK reflecting uncertainty in the track of Katia as shown by medium range ensemble output from MOGREPS-15. Later updates to this warning (from 3 days in advance onwards and within the MOGREPS-W timeframe) reduced the size of the yellow area and introduced an amber area for medium impact weather as confidence in the exact track and intensity of Katia increased. At the time of these warnings, NSWWS criteria for issuing warnings for severe gales in the central belt of Scotland were 65 mph for low impact gusts, 70 mph for medium impact gusts and 80 mph for high impact gusts, showing that an amber medium impact warning was correctly issued. As shown in Figure 9(a), MOGREPS-W actually suggested a red warning in this run for coastal regions of Argyll and Bute (to the northwest of the central belt) showing the potential for exceptionally strong winds, however this was correctly downgraded to amber in later runs. During the following winter trial 2011/2012 MOGREPS-W was used to provide guidance for two red warnings for severe gales in the central belt of Scotland on 8 December 2011 and 3 January 2012, a yellow warning for snow in parts of England and Wales on 16 December 2011 and an amber warning for snow in parts of England on 4/5 February 2012. 8. Discussion and conclusion MOGREPS-R has been used to develop a probabilistic first guess warning system for severe weather, known as MOGREPS-W. The system is a customer-focussed application and is designed to meet the specific needs of the UK NSWWS. Warning criteria from the NSWWS is fed into MOGREPS-W allowing the generation of tailored first-guess warnings.

Ensemble severe weather forecasts 575 (a) (b) Figure 9. The new NSWWS for post tropical storm Katia. (a) MOGREPS-W calibrated forecast from the run initialized at 0600 UTC on 11 September 2011. Valid 2100 UTC on 12 September 2011 (T + 39 h). (b) Evolution of forecaster issued warnings. Valid 12 September 2011. From left to right; (1) issued 9 September 2011; (2) issued 10 September 2011; (3) issued 11 September 2011; (4) issued 12 September 2011. MOGREPS-W has the following aims: (1) to provide forecasters advanced warning of upcoming severe weather, offering the opportunity to increase the lead time of publicly issued weather warnings, and (2) to provide a more objective basis for assessing risk and making probability statements. This study has calibrated MOGREPS-W wind gust and precipitation area probability forecasts for all 147 counties and unitary authority areas in the UK, using a 6 month period for wind gusts (1 September 2010 to 28 February 2011) and 9 month period for precipitation (1 June 2010 to 28 February 2011). This period captures a full autumn and winter season which is typically when most weather warnings are issued. Calibration has been carried out using a Met Office 2 km resolution analysis. This gridded data set is beneficial over the use of weather station data as it provides 100% coverage over the whole UK. A simple approach to calibration has been taken whereby a range of forecast thresholds is examined to see which provides the best verification when forecasting a particular magnitude of event. Providing reliable verification results for severe weather is difficult due to its low frequency; therefore, thresholds lowerthanwarningcriteriahavebeenusedinanattemptto increase the verification sample size. The calibration difference is then linearly scaled up to warning criteria assuming that model characteristics are similar for all forecast thresholds. This assumption was made after examining the reliability characteristics for a range of hazard thresholds, which showed similarities, especially for different wind gusts thresholds. This is only a first attempt at calibrating the system and future updates will use longer data periods, and hopefully higher forecast thresholds. Although calibration has been shown to improve forecast skill for the period used in this paper, verification should eventually be carried out on an independent data set after the calibration has been applied to the system. Results show that both wind gust and precipitation forecasts have skill, with wind gust forecasts having better forecast skill than precipitation forecasts. Area probability forecasts for wind