The COLA Anomaly Coupled Model: Ensemble ENSO Prediction

Similar documents
Seasonal Climate Watch January to May 2016

Retrospective El Niño Forecasts Using an Improved Intermediate Coupled Model

An Introduction to Coupled Models of the Atmosphere Ocean System

Local versus non-local atmospheric weather noise and the North Pacific SST variability

JMA s Seasonal Prediction of South Asian Climate for Summer 2018

2013 ATLANTIC HURRICANE SEASON OUTLOOK. June RMS Cat Response

Coupled ocean-atmosphere ENSO bred vector

Seasonal Climate Outlook for South Asia (June to September) Issued in May 2014

Climate Outlook for December 2015 May 2016

ENSO Outlook by JMA. Hiroyuki Sugimoto. El Niño Monitoring and Prediction Group Climate Prediction Division Japan Meteorological Agency

The Spring Predictability Barrier Phenomenon of ENSO Predictions Generated with the FGOALS-g Model

CLIMATE SIMULATION AND ASSESSMENT OF PREDICTABILITY OF RAINFALL IN THE SOUTHEASTERN SOUTH AMERICA REGION USING THE CPTEC/COLA ATMOSPHERIC MODEL

Climate Outlook for October 2017 March 2018

ENSO Cycle: Recent Evolution, Current Status and Predictions. Update prepared by Climate Prediction Center / NCEP 23 April 2012

A simple method for seamless verification applied to precipitation hindcasts from two global models

ENSO Cycle: Recent Evolution, Current Status and Predictions. Update prepared by Climate Prediction Center / NCEP 5 August 2013

IAP Dynamical Seasonal Prediction System and its applications

Sensitivity of Tropical Tropospheric Temperature to Sea Surface Temperature Forcing

ENSO Cycle: Recent Evolution, Current Status and Predictions. Update prepared by Climate Prediction Center / NCEP 25 February 2013

Climate Outlook for March August 2018

Ensemble Hindcasts of ENSO Events over the Past 120 Years Using a Large Number of Ensembles

Climate Outlook for March August 2017

Climate Forecast Applications Network (CFAN)

Probabilistic predictions of monsoon rainfall with the ECMWF Monthly and Seasonal Forecast Systems

ENSO Cycle: Recent Evolution, Current Status and Predictions. Update prepared by Climate Prediction Center / NCEP 15 July 2013

CHAPTER 2 DATA AND METHODS. Errors using inadequate data are much less than those using no data at all. Charles Babbage, circa 1850

PRMS WHITE PAPER 2014 NORTH ATLANTIC HURRICANE SEASON OUTLOOK. June RMS Event Response

Potential of Equatorial Atlantic Variability to Enhance El Niño Prediction

Challenges to Improving the Skill of Weekly to Seasonal Climate Predictions. David DeWitt with contributions from CPC staff

ENSO Cycle: Recent Evolution, Current Status and Predictions. Update prepared by Climate Prediction Center / NCEP 11 November 2013

ENSO-DRIVEN PREDICTABILITY OF TROPICAL DRY AUTUMNS USING THE SEASONAL ENSEMBLES MULTIMODEL

Understanding Global Environmental Trends and Projections. Ants Leetmaa Geophysical Fluid Dynamics Laboratory Princeton, NJ 08542

Mozambique. General Climate. UNDP Climate Change Country Profiles. C. McSweeney 1, M. New 1,2 and G. Lizcano 1

Convective scheme and resolution impacts on seasonal precipitation forecasts

An Empirical Parameterization of Subsurface Entrainment Temperature for Improved SST Anomaly Simulations in an Intermediate Ocean Model

ECMWF: Weather and Climate Dynamical Forecasts

NOTES AND CORRESPONDENCE. Improving Week-2 Forecasts with Multimodel Reforecast Ensembles

The ECMWF Extended range forecasts

NIWA Outlook: October - December 2015

El Niño, South American Monsoon, and Atlantic Niño links as detected by a. TOPEX/Jason Observations

Model error and seasonal forecasting

Nonlinear atmospheric response to Arctic sea-ice loss under different sea ice scenarios

ENSO Cycle: Recent Evolution, Current Status and Predictions. Update prepared by Climate Prediction Center / NCEP 24 September 2012

Predictability and prediction of the North Atlantic Oscillation

ENSO: Recent Evolution, Current Status and Predictions. Update prepared by: Climate Prediction Center / NCEP 9 November 2015

Monthly forecasting system

ENSO: Recent Evolution, Current Status and Predictions. Update prepared by: Climate Prediction Center / NCEP 30 October 2017

Bred Vectors: A simple tool to understand complex dynamics

Climate Outlook for Pacific Islands for May - October 2015

The Formation of Precipitation Anomaly Patterns during the Developing and Decaying Phases of ENSO

4.3.2 Configuration. 4.3 Ensemble Prediction System Introduction

Global Ocean Monitoring: A Synthesis of Atmospheric and Oceanic Analysis

Climate Prediction Center National Centers for Environmental Prediction

Application and verification of the ECMWF products Report 2007

Developing Operational MME Forecasts for Subseasonal Timescales

Seasonal to decadal climate prediction: filling the gap between weather forecasts and climate projections

New Zealand Climate Update No 223, January 2018 Current climate December 2017

Predictability of the duration of La Niña

Charles Jones ICESS University of California, Santa Barbara CA Outline

Monthly forecast and the Summer 2003 heat wave over Europe: a case study

KUALA LUMPUR MONSOON ACTIVITY CENT

particular regional weather extremes

Zambia. General Climate. Recent Climate Trends. UNDP Climate Change Country Profiles. Temperature. C. McSweeney 1, M. New 1,2 and G.

Winter Forecast for GPC Tokyo. Shotaro TANAKA Tokyo Climate Center (TCC) Japan Meteorological Agency (JMA)

Multiple Ocean Analysis Initialization for Ensemble ENSO Prediction using NCEP CFSv2

EVALUATION OF THE GLOBAL OCEAN DATA ASSIMILATION SYSTEM AT NCEP: THE PACIFIC OCEAN

El Niño Seasonal Weather Impacts from the OLR Event Perspective

Seasonal Climate Watch July to November 2018

Malawi. General Climate. UNDP Climate Change Country Profiles. C. McSweeney 1, M. New 1,2 and G. Lizcano 1

MPACT OF EL-NINO ON SUMMER MONSOON RAINFALL OF PAKISTAN

Understanding Predictability and Model Errors Through Light, Portable Pseudo-Assimilation and Experimental Prediction Techniques

1. INTRODUCTION 2. HIGHLIGHTS

CPTEC and NCEP Model Forecast Drift and South America during the Southern Hemisphere Summer

Application of the Ems-Wrf Model in Dekadal Rainfall Prediction over the Gha Region Franklin J. Opijah 1, Joseph N. Mutemi 1, Laban A.

ENSO prediction using Multi ocean Analysis Ensembles (MAE) with NCEP CFSv2: Deterministic skill and reliability

Exploring and extending the limits of weather predictability? Antje Weisheimer

GPC Exeter forecast for winter Crown copyright Met Office

Verification of the Seasonal Forecast for the 2005/06 Winter

Will a warmer world change Queensland s rainfall?

Climate Outlook for Pacific Islands for August 2015 January 2016

Initialization and Predictability of a Coupled ENSO Forecast Model*

From El Nino to Atlantic Nino: pathways as seen in the QuikScat winds

The Role of Indian Ocean Sea Surface Temperature in Forcing East African Rainfall Anomalies during December January 1997/98

Influence of reducing weather noise on ENSO prediction

South African Weather Service SWIOCOF-5 Pre-Forum. Seasonal Forecast, Water Resources and Expected Outcomes

South Asian Climate Outlook Forum (SASCOF-6)

Introduction of Seasonal Forecast Guidance. TCC Training Seminar on Seasonal Prediction Products November 2013

ALASKA REGION CLIMATE FORECAST BRIEFING. January 23, 2015 Rick Thoman ESSD Climate Services

ENSO Irregularity. The detailed character of this can be seen in a Hovmoller diagram of SST and zonal windstress anomalies as seen in Figure 1.

UPDATE OF REGIONAL WEATHER AND SMOKE HAZE (December 2017)

Seasonal forecasting of climate anomalies for agriculture in Italy: the TEMPIO Project

The North Atlantic Oscillation: Climatic Significance and Environmental Impact

Seasonal predictability of the Atlantic Warm Pool in the NCEP CFS

Investigate the influence of the Amazon rainfall on westerly wind anomalies and the 2002 Atlantic Nino using QuikScat, Altimeter and TRMM data

Introduction of climate monitoring and analysis products for one-month forecast

Use of the Combined Pacific Variability Mode for Climate Prediction in North America

August Forecast Update for Atlantic Hurricane Activity in 2012

Analysis Links Pacific Decadal Variability to Drought and Streamflow in United States

POAMA: Bureau of Meteorology Coupled Model Seasonal Forecast System

High initial time sensitivity of medium range forecasting observed for a stratospheric sudden warming

Transcription:

2324 MONTHLY WEATHER REVIEW VOLUME 131 The COLA Anomaly Coupled Model: Ensemble ENSO Prediction BEN P. KIRTMAN George Mason University, Fairfax, Virgina, and Center for Ocean Land Atmosphere Studies, Calverton, Maryland (Manuscript received 12 November 2002, in final form 10 March 2003) ABSTRACT Results are described from a large sample of coupled ocean atmosphere retrospective forecasts during 1980 99. The prediction system includes a global anomaly coupled general circulation model and a state-of-the-art ocean data assimilation system. The retrospective forecasts are initialized each January, April, July, and October of each year, and ensembles of six forecasts are run for each initial month, yielding a total of 480 1-yr predictions. In generating the ensemble members, perturbations are added to the atmospheric initial state only. The skill of the prediction system is analyzed from both a deterministic and a probabilistic perspective. The probabilistic approach is used to quantify the uncertainty in any given forecast. The deterministic measures of skill for eastern tropical Pacific SST anomalies (SSTAs) suggest that the ensemble mean forecasts are useful up to lead times of 7 9 months. At somewhat shorter leads, the forecasts capture some aspects of the variability in the tropical Indian and Atlantic Oceans. The ensemble mean precipitation anomaly has disappointingly low correlation with observed rainfall. The probabilistic measures of skill (relative operating characteristics) indicate that the distribution of the ensemble provides useful forecast information that could not easily be gleaned from the ensemble mean. In particular, the prediction system has more skill at forecasting cold ENSO events compared to warm events. Despite the fact that the ensemble mean rainfall is not well correlated with the observed, the ensemble distribution does indicate significant regions where there is useful information in the forecast ensemble. In fact, it is possible to detect that droughts over land are more predictable than floods. It is argued that probabilistic verification is an important complement to any deterministic verification, and provides a useful and quantitative way to measure uncertainty. 1. Introduction Motivated by the pioneering work of Cane et al. (1986), a number of El Niño Southern Oscillation (ENSO) prediction systems have been developed. These prediction systems include both statistical and dynamical techniques of varying degrees of sophistication in terms of their scientific approaches, numerical techniques, and use of observational data. Latif et al. (1998) and Kirtman et al. (2002b) review how well these methodologies have performed in retrospective forecast mode and offer considerable optimism for predicting sea surface temperature anomalies (SSTA) in the tropical eastern Pacific. However, this optimism must be tempered by the experience of real prediction that indicates that the state-of-the-art in ENSO prediction has not realized its full potential. This is particularly the case with the dynamical prediction systems, one of which is the focus of this paper. Goddard et al. (2001) also provide an excellent summary of a number of different seasonal-to-interannual prediction systems. The reason for the tempered optimism is based on the Corresponding author address: Dr. Ben P. Kirtman, Center for Ocean Land Atmosphere Studies, 4041 Powder Mill Road, Calverton, MD 20705. E-mail: kirtman@cola.iges.org assessment of Barnston et al. (1999) and Landsea and Knaff (2000), who examined the performance of many different prediction systems during the 1997/99 ENSO episode. Arguably, there were substantial qualitative forecasting successes. For example, almost all the models predicted that the boreal winter of 1997/98 would be a warm event one to two seasons in advance. Despite these successes, there have also been some striking quantitative failures. For instance, none of the models captured the early onset of the event or the amplitude of that event, and many of the forecast systems had difficulty capturing the demise of the warm event and the development of cold anomalies that persisted through 2001. More recently, many models have failed to predict the three consecutive years (1999 2001) of relatively cold conditions and the development of warm anomalies in the central Pacific during the boreal summer of 2002. There are two important messages in these successes and failures. First, the fidelity of climate prediction systems needs to be continually assessed and improved both in retrospective forecast mode and with real-time forecasts. Despite the fact that this paper focuses on retrospective forecasts, the importance of examining realtime forecasts should not be underestimated. Second, there is considerable uncertainty in ENSO predictions 2003 American Meteorological Society

OCTOBER 2003 KIRTMAN 2325 even at relatively short lead times. This second point is due to the chaotic or irregular aspect of climate variability and, because of this, forecasts must necessarily include some quantitative assessment of this uncertainty. In fact, there have been some attempts at predicting uncertainty in deterministic forecasts. For instance, using an intermediate coupled model, Moore and Kleeman (1998) found a strong relationship between forecast skill and ensemble spread, provided the ensemble perturbations are appropriately chosen. When using random perturbations, no relationship between skill and spread could be found. Kleeman and Moore (1999) also found in the same intermediate coupled model that the amplitude of the least damped or slowest growing normal mode has a strong relationship to forecast skill. In this paper, we argue for a different approach for presenting quantitative information regarding forecast uncertainty. Specifically, we assert that the forecast should include probabilistic information and that the forecast skill and uncertainty should be assessed in probabilistic terms. The potential utility of probabilistic climate forecasts is primarily based on end-user specific decision-model analysis (Palmer et al. 2000). While this is generally applied to terrestrial variables, such as land surface temperature or precipitation anomaly, the potential utility argument is also valid for SSTA. Moreover, the results presented here indicate that the probabilistic forecast information and skill assessment for SSTA and precipitation anomalies provide additional information that could not be gleaned from a purely deterministic approach. Despite the fact that climate forecasts should include probabilistic information, much of the verification that appears in the refereed literature emphasizes the deterministic approach. This is particularly the case for dynamical coupled ocean atmosphere predictions of SSTA, and is largely due to historical precedent. For example, Cane et al. (1986) using a simple intermediate coupled model, were the first to present and verify deterministic ENSO hindcasts. Since this initial work there have been a number of ENSO forecasting activities using intermediate coupled models (Kang and Kug 2000; Kirtman and Zebiak 1997; Kleeman et al. 1995; Chen et al. 1995; Balmaseda et al. 1995; Kleeman 1993) and hybrid coupled models (Barnett et al. 1993; Syu et al. 1995; Tang and Hsieh 2002). In contrast to the intermediate model strategy in which both component models are simplified, the hybrid model approach combines one sophisticated component with one simplified component. All of these approaches show considerable skill at lead time of 6 9 months using deterministic measures. There are no published probabilistic measures of skill for the hybrid or intermediate coupled models. There are also a number of prediction efforts using coupled general circulation models with various methods of data assimilation, initialization and coupling strategies (Schneider et al. 1999, 2001; Wang et al. 2002; Ji et al. 1994, 1996, 1998; Stockdale et al. 1998; Kirtman et al. 1997; Rosati et al. 1997; Ji and Leetmaa 1997; Stockdale 1997; Leetmaa and Ji 1989). There is also a coupled general circulation model prediction activity at the National Aeronautics and Space Administration (M. Rienecker 2002, personal communication). Several of these prediction systems make regular contributions to the Experimental Long Lead Forecast Bulletin (Kirtman 2002), and, while the forecasts are often presented in some probabilistic format (i.e., Niño-3.4 forecast plumes), almost universally, the skill is evaluated using only deterministic measures (see online at http:// www.ecmwf.int/products/forecasts/seasonal). 1 While measures such as Niño-3.4 correlation and root-meansquare error are useful, particularly for model developers, they are of limited value in the end-user community and fail to provide much of the global details of the prediction skill. It should be noted that deterministic verification has an important role to play in assessing the fidelity of forecast models. However, we are suggesting that, given the recognition that ENSO forecasts should be probabilistic, verification must also include a probabilistic assessment of skill. In fact, probabilistic forecasts are an important complement to the deterministic forecast. This paper provides both a deterministic and probabilistic assessment of the skill of a large number of retrospective forecasts using a recently updated version of the Center for Ocean Land Atmosphere Studies (COLA) anomaly coupled general circulation model (GCM; Kirtman et al. 2002a). The paper is outlined as follows. Section 2 briefly describes the COLA anomaly coupled model. The design of the retrospective forecast experiments and the initial conditions are discussed in section 3. It is important to note that the ensembles are generated by perturbing the atmospheric states only. With this strategy, we are assuming a perfect ocean initialization, which is an obvious limitation. This means that we may be underestimating the forecast spread and uncertainty; nevertheless, this approach does provide a quantitative assessment of how uncertainty in atmospheric initial states impacts forecast spread and skill. Moreover, the probabilistic verification strategy identifies whether the prediction system is overconfident, and, thus compensates for some of the limitations in our ensemble strategy. In this context, an overconfident forecast occurs when most of the ensemble members agree, but the verification consistently indicates otherwise. Section 4 presents the deterministic and probabilistic skill assessments for the predicted sea surface temperature anomalies, heat content anomalies, and the precipitation anomalies. We also provide quantitative estimates of the relative skill of warm events versus cold events, and droughts versus floods. Concluding remarks are given in section 5. 1 Online, the European Centre for Medium-Range Forecasts (ECMWF) provides both probabilistic and deterministic verification of their coupled predictions for SSTA and several atmospheric variables. However, these skill assessments have not appeared in the refereed literature.

2326 MONTHLY WEATHER REVIEW VOLUME 131 2. Models and coupling strategies The anomaly coupled GCM (ACGCM) combines the COLA atmospheric GCM and the Geophysical Fluid Dynamics Laboratory (GFDL) Modular Ocean Model (MOM), version 3.0. Brief descriptions of these models and the coupling procedures are given below. Details of how well the model performs in long climate simulations are described in detail in Kirtman et al. (2002a) and Kirtman and Shukla (2002). a. Atmospheric model A number of changes to the atmospheric model have been made since the original coupled models were developed. The dynamic core used in the National Center for Atmospheric Research (NCAR) Community Climate Model (CCM) version 3.0 has been adopted (Schneider 2002). The dynamic core is spectral (triangular truncation at total wavenumber 42) with semi-lagrangian transport. There are 18 unevenly spaced -coordinate vertical levels. The parameterization of the solar radiation is after Briegleb (1992) and terrestrial radiation follows Harshvardhan et al. (1987). The deep convection is an implementation of the Relaxed Arakawa Schubert (RAS) scheme of Moorthi and Suarez (1992) described by DeWitt (1996). The convective cloud fraction follows the scheme used by the NCAR CCM (Kiehl et al. 1994; see DeWitt and Schneider 1996 for additional details). The model includes a turbulent closure scheme for the subgrid-scale exchange of heat, momentum, and moisture after Miyakoda and Sirutis (1977) and Mellor and Yamada (1982). Additional details regarding the AGCM physics can be found in Kinter et al. (1988) and DeWitt (1996). Model documentation is given in Kinter et al. (1997). b. Ocean model The ocean model is version 3 of the GFDL MOM (Pacanowski and Griffies 1998), a finite-difference treatment of the primitive equations of motion using the Boussinesq and hydrostatic approximations in spherical coordinates. The domain is that of the World Ocean between 74 S and 65 N. The coastline and bottom topography are realistic except that ocean depths less than 100 m are set to 100 m and the maximum depth is set to 6000 m. The artificial high-latitude meridional boundaries are impermeable and insulating. The zonal resolution is 1.5. The meridional grid spacing is 0.5 between 10 S and 10 N, gradually increasing to 1.5 at 30 N and 30 S and fixed at 1.5 in the extratropics. There are 25 levels in the vertical with 17 levels in the upper 450 m. The vertical mixing scheme is the nonlocal K-profile parameterization of Large et al. (1994). The horizontal mixing of tracers and momentum is Laplacian. The momentum mixing uses the space time-dependent scheme of Smagorinsky (1963) and the tracer mixing uses Redi (1982) diffusion along with Gent and McWilliams (1990) quasi-adiabatic stirring. c. Coupling strategy The anomaly coupling strategy is described in detail in Kirtman et al. (1997) and in Kirtman et al. (2002a). The main idea is that the ocean and atmosphere exchange predicted anomalies, which are computed relative to their own model climatologies, while the climatology upon which the anomalies are superimposed is specified from observations. The anomaly coupling strategy requires atmospheric model climatologies of momentum, heat, and freshwater flux, and an ocean model SST climatology. Similarly, observed climatologies of momentum, heat, and freshwater flux and SST are also required. The model climatologies are defined by separate uncoupled extended simulations of the ocean and atmospheric models. In the case of the atmosphere, the model climatology is computed from a 30-yr (1961 90) integration with observed specified SST and sea ice. This SST is also used to define the observed SST climatology. In the case of the ocean model SST climatology, an extended uncoupled ocean model simulation is made using 30 years of 1000-mb National Centers for Environmental Prediction (NCEP) reanalysis winds. The NCEP winds are converted to a wind stress following Trenberth et al. (1990). As with the SST, this observed wind stress product is used to define the observed momentum flux climatology. The heat flux and the freshwater flux in this ocean-only simulation is parameterized using damping of SST and sea surface salinity to observed conditions with a 100-day timescale. The heat and freshwater flux observed climatologies are then calculated from the results of the extended ocean only simulation. The ocean and atmosphere model exchange daily mean fluxes and SST once a day. 3. Retrospective forecast experiments In order to assess the potential predictive skill of the coupled model, a large sample of retrospective forecast experiments have been made and compared to available observations. The retrospective forecasts or hindcasts cover the period 1980 99. A 12-month hindcast is initialized each January, April, July, and October during this 20-yr period. For each initial month, an ensemble of six hindcasts is run, yielding a total of 480 retrospective forecasts to be verified. The ocean and atmosphere initial states and the method of generating the ensemble members are described below. The definition of forecast lead time is also given below. The ocean initial conditions were taken from a 1980 99 ocean data assimilation produced at GFDL using variational optimal interpolation (Derber and Rosati 1989). The GFDL ocean initial states were generated using a somewhat higher resolution ocean model. How-

OCTOBER 2003 KIRTMAN 2327 ever, the physics and parameter settings were identical to those used in the forecast experiments. In the forecast experiments presented here, the ocean initial states were interpolated to the lower resolution. Based on a series of fully coupled hindcast experiments with the current ocean resolution and with the higher resolution ocean model, we can say that increasing the ocean resolution has little impact on the forecast skill (Schneider et al. 2001). The hindcast ensembles are generated by atmospheric perturbations only and no attempt has been made to find optimal perturbations. The ocean intial state for each ensemble member is identical. This limitation in our ensemble strategy is due to the fact that the ocean data assimilation was performed externally, and we acknowledge that with this approach we may be underestimating the uncertainty in any individual forecast. However, we are also employing a probabilistic verification strategy that identifies whether the prediction system is overconfident and can be used to quantitatively interpret any individual forecast. From a probabilistic perspective, we may also be underestimating the skill of the hindcasts by not including the uncertainty in the ocean initial states. The atmospheric initial states are taken from an extended atmosphere-only simulation with observed prescribed SST. The atmospheric ensemble members were obtained by resetting the model calendar back one week and integrating the model forward one week with prescribed observed SST. In this way, it is possible to generate an unlimited sample of initial conditions that are synoptically independent (separated by one week) but have the same initial date. This procedure was also used by Kirtman et al. (2001) to generate a 100-member ensemble for atmospheric seasonal prediction experiments. Throughout the ENSO prediction literature there is some confusion regarding the appropriate definition of forecast lead time. For the purposes of this paper forecast lead time is defined in the following example. Suppose a forecast is labeled as being initialized in January 1982. This means that the forecast was actually initialized on 0000 Z 1 January 1982. The first monthly mean (i.e., the average of 1 31 January 1982) of the forecast is defined as the 1-month lead. Similarly, the second monthly mean (i.e., 1 28 February 1982) is defined as the second month lead. The remaining lead times are defined analogously. 4. Results and skill assessment The skill of the hindcasts was examined from both a deterministic and from a probabilistic perspective. In all the results presented here, the systematic error of the hindcast has been removed by taking the average of all the forecasts for that particular initial month. In this way, the systematic error is a function of lead time. This same approach was used by Kirtman et al. (1997). We begin by examining the skill of the forecasts from the deterministic perspective. a. Hindcast evolution Figure 1 shows all 480 hindcast (dashed lines) of Niño-3.4 SSTA compared to the observations (taken from the ocean data assimilation) for lead times of 3, 6, and 9 months. The labeling along the x axis corresponds to the initial condition time not the verification time. As a consequence, the verification curve on each successive panel shifts by 3 months. This is necessary so that forecasts and verification are directly comparable. Overall, there appears to be some useful information at each of the three lead times, and there do not appear to be any major misfires in terms of predictions of erroneous warm or cold events. At 3 months lead time, the hindcasts agree with the observations quite well. There is, however, a tendency for the 3-month hindcasts to overpredict the strength of the cold events, for example, during 1984/85 and 1988/89. By 6 months lead time, there is a noticeable deterioration in the agreement between the hindcast and the observations. There is a strong suggestion that the hindcasts are underpredicting the strength of the warm events during the peak phase of the event, and are tending to persist the events about 3 months longer than observed. This can be seen at the end of both the 1982/83 and the 1986/ 87 warm events. The extension of the warm events can also be seen as a delay in the onset of cold events. At 9 months lead, the observed warm and cold events can be detected in the hindcasts, but the amplitudes are very weak. As will be shown in section 4b, the correlation coefficient is above 0.5 at 9 months, but the rmse is near saturation. Substantial spread in the forecasts can also be detected at 9 months lead. A global perspective of the forecast evolution is presented in Figs. 2 and 3, which show the SSTA, heat content anomaly (i.e., vertically averaged temperature anomaly in the upper 400 m), and precipitation anomaly for the hindcast and the observations. For the SSTA and heat content anomaly, we use the results from the ocean data assimilation system for verification, and for precipitation we use the Xie and Arkin (1996) rainfall estimates. Here we show the hindcast for December 1982 (Fig. 2) and December 1988 (Fig. 3) at lead time of 3, 6, and 9 months. Each plot shows the ensemble average of six hindcasts so that a total of 36 retrospective forecasts are used in making Figs. 2 and 3. For both December 1982 and December 1988, the ENSO signal is captured in all three fields at all lead times. Not surprisingly, the hindcast fidelity decreases with increasing lead time. In the December 1982 case, there is a clear tendency for the SSTA to decay too rapidly just to the east of the date line. Consistent errors are also apparent in the heat content anomaly and can be seen in the rainfall anomaly as excessive signals in the eastern Pa-

2328 MONTHLY WEATHER REVIEW VOLUME 131 FIG. 1. Time series of observed Niño-3.4 SSTA and predicted SSTA. (top) Lead time of 3 months, (middle) lead time of 6 months, and (bottom) lead time of 9 months. The solid line corresponds to the observations and the dotted lines correspond to the hindcast. The labeling along the x axis corresponds to forecast initialization time not verification time so that the observations are shifted by 3 months on each successive panel. The shifting of the observations allows for direct comparison of the forecasts and the observations. cific. This error can also, although to a lesser degree, be detected in the 1988 case. The erroneous eastward migration of the anomalies discussed above is somewhat surprising given the canonical ENSO behavior of this model. Kirtman et al. (2002a) examined a 200-yr simulation with this model, and found that the ENSO events typically extended too far into the western Pacific in terms of SSTA, heat content anomaly, and rainfall anomaly. Although it is not shown here, we believe that the erroneous eastward migration is a signature of the initialization shock. The hindcast evolution in the Indian Ocean is also interesting. In both sets of hindcasts, there are clear heat content anomaly signals along 10 S that can be detected at all lead times. These are relatively strong heat content anomalies that offer some prospects for seasonal-to-interannual prediction in the Indian Ocean. This is despite the fact that the associated SSTA are extemely weak, at best, and that these hindcasts show no associated rainfall signal. Overall, outside the tropical Pacific there is little agreement between the hindcast rainfall anomalies and the observed anomalies. A generous interpretation of Figs. 2 and 3 would be that there is some useful information in the ensemble mean at short lead time (3 months) over tropical South America and equatorial Africa. The potential usefulness of the rainfall hindcasts are discussed in more detail below and in the section on probabilistic verification. A typical method for presenting probabilistic fore- FIG. 2. Dec 1982 hindcast and (a) (d) observed SSTA, (e) (f) heat content anomaly, and (i) (l) precipitation anomaly at a lead time of 3 (Oct 1982 initial condition), 6 (Jul 1982 initial condition), and 9 (Apr 1982 initial conditions) months. The shading in SSTA and precipitation anomaly hindcasts corresponds to the percentage of ensemble members predicting in either the upper or lower tercile. If the ensemble mean is positive (negative), the shading indicates the percentage of ensemble members predicting upper (lower) tercile. The contour interval for SSTA is 0.5 C, for vertically averaged temperature is 0.3 C, and for precipitation is 2 mm day 1.

OCTOBER 2003 KIRTMAN 2329

2330 MONTHLY WEATHER REVIEW VOLUME 131

OCTOBER 2003 KIRTMAN 2331 casts is to indicate the number of ensemble members (or the percentage of ensemble members) predicting a particular event. For instance, the shaded regions in Fig. 2a show the percent of ensemble members predicting either warm or cold SSTA for December 1982 at 3 months lead time. The shading corresponds to the sign of the SSTA so that if the ensemble mean (contours) is positive, the shading indicates the fraction of ensemble members in the upper tercile. Similarly, if the ensemble mean is negative, the shading indicates the fraction of ensemble members predicting SSTA in the lower tercile. The shading is a measure of the consistency among the ensemble members, and based on the verification shown later, provides a mechanism for measuring the risk associated with taking action based on the forecast. For the 3-month lead December 1982 hindcast (Fig. 2a), the ensemble members were consistent in the tropical eastern Pacific with six out of six predicting SSTA in the upper tercile. In the Indian and south tropical Atlantic Oceans there are broad regions where at least four out of six ensemble members predicted relatively warm SSTA. The ensemble members are consistent in predicting cold temperatures in large regions of the tropical Atlantic. At this lead time, the consistency in the subtropical Pacific is noticeably smaller than in the Tropics. The consistency in the rainfall hindcast is smaller than for SSTA, but there is strong agreement among the ensemble members in much of the tropical Pacific and Atlantic. This agreement statistic is useful in terms of identifying signals that appear to be strong, but are not robust among the ensemble members. In particular, the ensemble mean rainfall in the Bay of Bengal and the Arabian Sea appears to be fairly strong and positive. However, there is only modest agreement among the ensemble members. Conversely, there are weak negative rainfall signals in tropical South America that would be easy to dismiss except for the fact that all the ensemble members are giving similar predictions. Although the signal is weak, over lower central America for all lead times, the hindcasts consistently predict below normal rainfall, which was also observed during December 1982. As will be shown later, this is also a region where the model has skill at predicting below normal rainfall. For these hindcast examples, the consistency in the SSTA in the tropical Pacific is maintained for all lead times. In the December 1982 case, the consistency in the Indian Ocean is lost fairly rapidly; whereas, in the December 1988 case, the hindcasts are in better agreement at longer leads. FIG. 4.Niño-3.4 skill scores: (a) anomaly correlation coefficient and (b) rmse. The skill of the persistence forecast is denoted by the solid blue line and the ensemble mean skill indicated by the red line with filled circles. The green shaded region indicates for 10 000 randomly chosen one-member ensembles. b. Deterministic skill assessment Typically, the most common skill assessment is to calculate the Niño-3.4 (or Niño-3) anomaly correlation and rmse. These are often referred to as the Niño-3.4 skill scores. Figure 4 shows the skill scores for Niño- 3.4 in a slightly different format. The skill scores for the ensemble mean hindcast are depicted by the filled circles. The shaded region indicates the range (plus or minus one standard deviation) of these skill scores calculated from 10 000 possible one-member ensembles that were randomly chosen. In other words, this is the FIG. 3. Dec 1988 hindcast and (a) (d) observed SSTA, (e) (f) heat content anomaly, and (i) (l) precipitation anomaly at a lead time of 3 (Oct 1988 initial condition), 6 (Jul 1988 initial condition), and 9 (Apr 1988 initial conditions) months. The shading in SSTA and precipitation anomaly hindcasts corresponds to the percentage of ensemble members predicting in either the upper or lower tercile. If the ensemble mean is positive (negative), the shading indicates the percentage of ensemble members predicting upper (lower) tercile. The contour interval for SSTA is 0.5 C, for vertically averaged temperature is 0.3 C, and for precipitation is 2 mm day 1.

2332 MONTHLY WEATHER REVIEW VOLUME 131 skill score for 10 000 possible combinations of hindcasts where there is one hindcast for each initial month. This is a small subsample when one considers that fact that there are 6 80 possible one-member ensembles. A subsample of 5000 shows similar spread in the correlation, indicating that this is a robust estimate of the uncertainty in the correlation. The skill scores for a persistence hindcast are also shown in Fig. 4. In terms of the correlation coefficient, both the ensemble mean and the one-member ensembles significantly beat a persistence hindcast for all lead times. The ensemble mean correlation coefficient remains above 0.6 for lead times up to 9 months, and, for a one-member ensemble, the correlation coefficient remains above 0.6 for lead times up to 7 9 months. In the first 2 months, the rmse for persistence, the ensemble mean, and the one-member ensembles are indistinguishable from each other. Beyond 3 months lead time, the CGCM hindcasts (ensemble mean or one-member) beat persistence. At a lead time of 7 months, the persistence hindcast rmse becomes larger than the climatological hindcast rmse (0.88) 2 indicating that, for lead times greater than 7 months, persistence is a particularly poor measure of minimum skill. The shaded region in Fig. 4 is plotted for two reasons. First, it provides a conservative estimate of the range of uncertainty in these skill scores. We have not provided error estimates for the ensemble mean skill scores, but suggest that the error bar would be less than that for the one-member ensemble. Second, the shaded region also indicates the increase in skill that is expected by using an ensemble of hindcasts produced by a single model. In this case, the increase in skill comes from averaging across an ensemble of six members, and for both the correlation coefficient and the rmse, the ensemble average is about one standard deviation better than the expected value for a one-member ensemble. This improvement in skill is less than one would expect from a multimodel ensemble (Kirtman et al. 2002b; Schneider et al. 2003) and possibly suggests a limitation in the ensemble strategy. It should be noted that calculating these skill scores using all initial months masks their strong seasonality. In particular, forecasts initialized in the boreal winter tend to have the lowest 6-month lead skill, whereas forecasts initialized in boreal summer have the highest 6-month lead skill score. This is often referred to as the spring prediction barrier in that the forecasts initialized in boreal winter have a precipitous drop in skill during the boreal spring. While it is clear that this remains a problem with many prediction systems (i.e., Kirtman et al. 2002b), it is an open question whether the spring prediction barrier is a model error question or is associated with a fundamental limit in ENSO predictability. 2 The climatological hindcast is to predict a zero anomaly, so that the climatological rmse is the standard deviation of the observed SSTA. The model presented here also suffers from a spring prediction barrier. The 6-month lead forecasts initialized in January have a correlation coefficient of 0.58 compared to the forecasts initialized in July, which have a correlation coefficient of 0.91. The global distribution of the CGCM and persistence SSTA skill scores at a lead time of 6 months are shown in Fig. 5. The hindcast correlation suggests skill in the eastern tropical Pacific, western tropical Indian Ocean, and a small region in the tropical eastern Atlantic. There is also some skill in the subtropical western Pacific and in small regions of the extratropics. The regions of relatively high correlation in the Indian and Atlantic Oceans suggest potential seasonal to interannual predictability that is, as of yet, untapped. The hindcasts have larger correlation coefficients than for persistence in most regions; the notable exception is near the date line in the tropical Pacific. Not surprisingly, the rmse is largest in regions with the largest variability such as the western boundary currents and the tropical eastern Pacific. The rmse is generally smaller for the hindcasts compared to persistence. Figure 6 shows the same hindcast and persistence statistics as Fig. 5 except that the field being verified is the heat content anomaly. Similar to the SSTA, the hindcast heat content is convincingly more skillful than for persistence forecasts. There is some correspondence between regions of relative high heat content skill and high SSTA skill; for example, in the tropical eastern Pacific and western Atlantic, the subtropical western Pacific, and the tropical Indian Ocean. There are notable minima in the correlation in the central Pacific and Atlantic. These are regions near nodal points for the heat content (see Figs. 2 3) and relative low correlations are to be expected. At six months lead, the heat content rmse is generally smaller for the hindcast than for persistence. Again, there is a center of relative large persistence rmse in the Indian Ocean along 10 S where the hindcast has low rmse indicating some potential predictability. The 6-month lead precipitation hindcast correlation and rmse is shown in Fig. 7. Neither the hindcasts nor persistence show much skill at 6 months lead. The hindcasts have very modest skill in the eastern Pacific and some hints of skill over Indonesia and northeast Brazil. A persistence forecast has almost no areas where the correlation exceeds 0.4. The situation is somewhat more encouraging for the rmse. The hindcasts have smaller (by approximately 25%) rmse almost everywhere. Figure 7 provides a fairly bleak outlook for a precipitation anomaly forecast using this one-tiered prediction system. However, as shown in the next section, this picture is considerably more optimistic when verifying a probabilistic forecast. c. Probabilistic skill assessment There are a number of possible ways of verifying probabilistic forecasts. Here we have chosen the method

OCTOBER 2003 KIRTMAN 2333 FIG. 5. Global SSTA 6-month lead skill scores: (a) hindcast anomaly correlation coefficient, (b) hindcast rmse, (c) persistence anomaly correlation coefficient, and (d) persistence rmse. The contour interval for the correlation is 0.1 and for the rmse error is 0.2 C. Only correlations greater than 0.4 are plotted in (a) and (b). commonly referred to as relative operating characteristics (ROC; Mason and Graham 1999). The method uses ratios that measure the fraction of events and nonevents for which warnings are issued. It compares the fraction of events that were forewarned (i.e., the hit rate) with the fraction of nonevents that occurred after a warning was issued (i.e., the false alarm rate). The ratios are determined from contingency tables and the events are predefined and expressed in binary terms. Given an ensemble of hindcasts, an ROC curve showing the different combinations of hit and false alarm rates given different forecast probabilities can be constructed. In other words, the ROC curve indicates what the hit rate and false alarm rate is when, for example, four out of six ensemble members forecast an event to occur. The ROC curve is useful for identifying optimum strategies for issuing warnings, by indicating the trade-off between false alarms and misses. It is for this reason that the ROC curve is most useful in the forecast applications community (S. Mason 2001, personal communication). Details and examples of the ROC calculation can be found in Mason and Graham (1999). Here we summarize the key elements. The ROC calculation is based on hit rates and false alarm rates that are calculated from a 2 2 contingency table. Table 1, for example shows a standard 2 2 contingency table, where O 1 is the number of correct forecasts (or hits) of the event, O 2 is the number of misses, NO 1 is the number of false alarms, and NO 2 number of correct rejections. The hit rate (HR) is then given by HR O 1/(O1 O 2) and the false alarm rate (FAR) is given by FAR NO 1/(NO1 NO 2). An HR of one means that all occurrences of the event were correctly predicted and an HR of zero indicates that none of the events were correctly predicted. The FAR also ranges from zero to one with a value of zero indicating that no false alarms were issued. The standard 2 2 contingency table can be generalized for probabilistic ensemble forecasts by issuing a warning when the number of ensemble members forecasting an event exceeds some threshold. The 2 2 contingency table can then be recalculated for each threshold up to the total number of ensemble members, which in this case is six. Similarly, the HR and FAR

2334 MONTHLY WEATHER REVIEW VOLUME 131 FIG. 6. Tropical vertically averaged (0 400 m) temperature anomaly 6-month lead skill scores: (a) hindcast anomaly correlation coefficient, (b) persistence anomaly correlation coefficient, (c) hindcast rmse, and (d) persistence rmse. The contour interval for the correlation is 0.1 and for the rmse error is 0.2 C. are generalized for probabilistic ensemble forecasts. The ROC curve can then by formulated by plotting the HR versus the FAR for each threshold. ROC curves for the Niño-3.4 hindcasts at lead times of 3, 6, and 12 months are shown in Fig. 8. In calculating the ROC curves, the corresponding contingency tables have been aggregated over all model grid points in the Niño-3.4 region. There is a contingency table for each grid box and the aggregation consists of summing the respective elements of the contingency for all grid boxes in the Niño-3.4 region. For all three lead times, there are three curves or equivalently we have considered three different events: (i) warm events (upper tercile), (ii) cold events (lower tercile), and (iii) near normal (middle tercile), where both the retrospective forecasts and the observations have been normalized by their local standard deviation. This ability to easily verify the hindcast skill of warm events and cold events separately is one of the advantages of the ROC calculation. An ideal probabilistic forecast system would have relatively large hit rates and small false alarm rates so that all the points on the ROC curve would cluster in the upper-left corner of the diagram. For a relatively poor forecast system, all the points of the ROC curve would lie very close to the dashed diagonal line indicating that the hit rate and the false alarm rate were nearly the same (i.e., no skill). The six interior points on the ROC curve indicate number of ensemble mem-

OCTOBER 2003 KIRTMAN 2335 FIG. 7. Global precipitation 6-month lead skill scores: (a) hindcast anomaly correlation coefficient, (b) hindcast rmse, (c) persistence anomaly correlation coefficient, and (d) persistence rmse. The contour interval for the correlation is 0.1 and for the rmse is 1 mm day 1. Forecasts Occurrences Nonoccurrences TABLE 1. Generalized contingency table. Occurrences Observations Nonoccurrences O 1 O 2 NO 1 NO 2 bers forecasting a particular event. Progressing from the lower-left corner to the upper-right corner, the first interior point on the curve (i.e., the first point does not lie on the lower-left corner) corresponds to six out of six ensemble members forecasting the event. This point indicates how skillful the forecast system is when the model consistently forecasts a given event to occur. Not surprisingly, the hit rates are modest, but if the false alarm rate is also low, then a confident forecast is very useful. The second point indicates that five out of six ensemble members forecasted the event and the remaining point along the curve vary similarly. The last interior point corresponds to only one ensemble member forecasting the event and the hit rates tend to be high. Unfortunately, the false alarm rates are also high so that there is considerable risk when taking action based on only one ensemble member forecasting an event. For lead times of both 3 and 6 months (top and middle panels), both warm and cold events are fairly well predicted. The false alarm rates are low and the hit rates are relatively high when the agreement among the ensemble members is relatively large. The model is not particularly overconfident, and, thus, there are no serious limitations associated with the ensemble strategy. In this case, overconfidence would be apparent by large false alarm rates for high forecast agreement. Even when there is only modest agreement among the ensemble members, the hit rates are significantly larger than the false alarm rates. For a near-normal forecast, the 3- month lead has some skill although smaller than the extremes, whereas for both 6- and 12-month leads, the ROC curve lies close to the diagonal indicating little skill. At 12 months lead time, there is a considerable drop in skill. High confidence forecasts for warm events are only marginally better than those for near normal, suggesting that a confident forecast for a warm event at 12 months lead time is not particularly useful. However, confident forecasts for cold events appear to be more skillful at longer lead times. This also appears to be the case with the earlier version of the anomaly coupled model (Kirtman et al. 1997) in real forecast situations (e.g., Barnston et al. 1999). In order to assess the uncertainty in the ROC curves, we have applied two approaches. In order to facilitate comparisons between both approaches, we have used only five-member ensembles. The reason for this reduced ensemble size is that one of the approaches is a Monte Carlo technique that randomly selects ensemble members. The Monte Carlo approach is to repeat the

2336 MONTHLY WEATHER REVIEW VOLUME 131 FIG. 8.Niño-3.4 ROC curves for a lead time of (top) 3 months, (middle) 6 months, and (bottom) 12 months. Warm events (upper tercile) are denoted in red, near-norm conditions (middle tercile) are denoted in green, and cold events (lower tercile) are denoted in blue. ROC curve calculation a large number of times, where for each initial month, we randomly select a five-member ensemble from the sample of six. The HR and FAR corresponding to a probability threshold of 80% (i.e., four out of five ensemble members forecasting temperatures in the upper tercile) at a lead time of 6 months is shown in the bottom panel of Fig. 9. This calculation suggests that the uncertainty in the HR and FAR is rather modest, with the HR having slightly more uncertainty than the FAR. The top panel of Fig. 9 indicates the uncertainty in the ROC curves due to temporal sampling. Here we have plotted the ROC curve for warm events during the 1980s, 1990s, and for the entire period. Compared to the ensemble member subsampling, the temporal subsampling shows considerably more uncertainty. The hindcasts during the 1980s have more skill than the hindcasts from the 1990s. This has been the experience with a number of prediction systems (e.g., Ji et al. 1996) and according to several idealized studies should be expected (Thompson and Battisti 2001; Kirtman and Schopf 1998; Flügel and Chang 1996; Balmaseda et al. 1995). To provide a global perspective, the ROC score has been calculated. The ROC score is the area under the FIG. 9.Niño-3.4 ROC uncertainty estimates. (top) The ROC curves for warm events only, calculated separately for the 1980s (red), the 1990s (green), and the complete record (1980 99). Only five ensemble members have been used in the calculation. (bottom) The hit rate (green) and false alarm rate (red) for a probability threshold of 80% for warm events (i.e., four out of five ensemble members). In this case, 1000 randomly chosen ensembles of five forecasts were used to calculate the uncertainty in the hit rate and false alarm rate. ROC curve. A perfect forecast system has an area of one and a curve lying on the diagonal (no information) has an area of 0.5. In order to calculate the area under the ROC curve, the trapezoidal rule has been applied. There is some sensitivity due to the numerical integration scheme; however, we compared the results with ensemble sizes of both five and six members and found only small quantitative differences. The SSTA ROC score for the upper and lower terciles is shown in Figs. 10 and 11, respectively. The three panels correspond to lead times of 6, 9, and 12 months. The shaded regions correspond to ROC scores above 0.6. At a lead time of six months for both above and below normal temperatures, the ensemble hindcasts have useful information throughout much of the world oceans. Regions where the hindcasts have no information include the tropical western Pacific and much of the Atlantic Ocean north of 20 N. The warm SSTA and cold SSTA appear to be equally well predicted, but there is an indication that hindcast of cold temperatures have somewhat more skill in the central Pacific at longer

OCTOBER 2003 KIRTMAN 2337 FIG. 10. Upper tercile SSTA ROC score (see text for details) for lead times of (top) 6, (middle) 9, and (bottom) 12 months. The contour interval is 0.1. FIG. 11. Lower tercile SSTA ROC score (see text for details) for lead times of (top) 6, (middle) 9, and (bottom) 12 months. The contour interval is 0.1. leads. This is in agreement with the results for the Niño- 3.4 aggregated results shown in Fig. 8. For the most part, the regions of high ROC scores agree with the correlation coefficient map shown in Fig. 5. However, the regions of relatively high ROC scores are larger and there appears to be useful information at longer lead times. While it is a difficult target, one of the main purposes for using a CGCM to make seasonal-to-interannual forecasts, is to provide predictions of rainfall. Examples of rainfall forecasts that include probabilistically based uncertainty estimates were shown in Figs. 2 3. Here we examine how a probabilistic hindcast performs over the entire sample of cases and apply the skill assessment to interpret the uncertainty estimates. Figures 12 and 13 show the upper (wet conditions) and lower (dry conditions) tercile rainfall ROC scores, respectively. As with the SSTA, only ROC scores exceeding 0.6 are plotted. For relatively short lead times both wet and dry hindcasts have significant regions of skill over the tropical Pacific and Atlantic Oceans. The skill drops noticeably in the Pacific as the lead time increases. Encouragingly, there are regions over land that show skill. For example, the ROC score over much of South Africa is relatively high. There are also indications of useful information over northern Australia, coastal North and South America, and parts of Eurasia. This assessment of skill is considerably more optimistic than that shown in Fig. 7. Although the regions of skill are largely the same, hindcasts for drought verify better than hindcasts for flood. Perhaps, this is to be expected since droughts tend to be more large scale and persistent (J. Shukla 2002, personal communication). This asymmetry can be easily seen in Fig. 14, where we have plotted the HR minus the FAR when at least four out of six ensemble member predict the event to occur. The shaded values indicate regions where the HR is greater than the FAR by at least 0.1. As expected, when there is strong agreement (at least 67%) among the ensemble members, excessive rainfall is well predicted in the tropical eastern Pacific. Over land, there are only limited coastal areas where consistent forecasts of flood verify. In terms of drought, there are many more coherent land areas that have skill. These areas include much of Australia, Central, and South America and smaller regions of Africa and Eurasia. There is also apparent skill along the west coast of North America for both droughts and floods that was not detected in the correlation coefficient. It is interesting to note that droughts over India are reasonably well forecasted, but floods are not well predicted. There

2338 MONTHLY WEATHER REVIEW VOLUME 131 FIG. 12. Upper tercile precipitation anomaly ROC score (see text for details) for lead times of (top) 6, (middle) 9, and (bottom) 12, months. The contour interval is 0.1. The shading corresponds to values greater than 0.6. FIG. 13. Lower tercile precipitation anomaly ROC score (see text for details) for lead times of (top) 6, (middle) 9, and (bottom) 12 months. The contour interval is 0.1. The shading corresponds to values greater than 0.6. is a region in the eastern Pacific where the hindcasts show no skill. Figure 14 is also useful for interpreting the hindcasts presented in Figs. 2 and 3. For example, the 3-month lead precipitation hindcast for December 1982 had small but consistent drought signals over parts of South America. The bottom panel of Fig. 14 indicates this is a region of retrospective forecast skill, suggesting that despite the weak signal, the hindcasts are producing useful information. Conversely, there is a positive rainfall signal just off the coast of Africa in the equatorial Indian Ocean where the hindcasts are consistent, but where there is little indication of skill. Similar drought signals can be either disregarded or accepted based on the results shown in Figs. 12 14. 5. Discussion and concluding remarks A large number of retrospective coupled ocean atmosphere GCM forecasts for the period 1980 99 were presented here. The skill of these hindcasts was examined from both the deterministic and probabilistic perspectives. As measured by correlation coefficient and rmse, the ensemble mean hindcast of SSTA is skillful for lead times up to 7 9 months throughout much of the tropical eastern Pacific and with some regions of skill in the tropical Atlantic and Indian Oceans. The Niño-3.4 anomaly correlation coefficient and rmse indicate that the prediction system is competitive with any of the prediction systems that regularly publish in the Experimental Long-Lead Forecast Bulletin [see Table 2 of Kirtman et al. (2002b) for comparisons with other predictions systems]. The improvement in skill due to ensemble averaging is modest, and thus results may be due to a limitation in the strategy employed to generate the ensembles. It is also interesting that the ensemble average does not improve the skill as much as a multimodel ensemble (e.g., Kirtman et al. 2002b; Schneider et al. 2003). This suggests that ensemble averaging of one model cannot compensate for systematic errors in the variability of the model and has important implications when designing the best operational real-time seasonal to interannual prediction system. The heat content hindcasts also have considerable deterministic skill in much of the tropical oceans. The ensemble mean rainfall hindcasts, however, only have marginal skill that was primarily limited to a very narrow region in the eastern Pacific. The probabilistic assessment of skill was performed using an ROC analysis that gives a different and more