Attribution of observed historical near-surface temperature variations to anthropogenic and natural causes using CMIP5 simulations

JOURNAL OF GEOPHYSICAL RESEARCH: ATMOSPHERES, VOL. 118, 1, doi:1.1/jgrd.539, 13 Attribution of observed historical near-surface temperature variations to anthropogenic and natural causes using CMIP5 simulations Gareth S. Jones, 1 Peter A. Stott, 1 and Nikolaos Christidis 1 Received 19 July 1; revised 7 January 13; accepted 3 February 13; published 1 May 13. [1] We have carried out an investigation into the causes of changes in near-surface temperatures from 186 to 1. We analyze the HadCRUT observational data set which has the most comprehensive set of adjustments available to date for systematic biases in sea surface temperatures and the CMIP5 ensemble of coupled models which represents the most sophisticated multi-model climate modeling exercise yet carried out. Simulations that incorporate both anthropogenic and natural factors span changes in observed temperatures between 186 and 1, while simulations of natural factors do not warm as much as observed. As a result of sampling a much wider range of structural modeling uncertainty, we find a wider spread of historic temperature changes in CMIP5 than was simulated by the previous multi-model ensemble, CMIP3. However, calculations of attributable temperature trends based on optimal detection support previous conclusions that human-induced greenhouse gases dominate observed global warming since the mid-th century. With a much wider exploration of model uncertainty than previously carried out, we find that individually the models give a wide range of possible counteracting cooling from the direct and indirect effects of aerosols and other non-greenhouse gas anthropogenic forcings. Analyzing the multi-model mean over 1951 1 (focusing on the most robust result), we estimate a range of possible contributions to the observed warming of approximately.6 K from greenhouse gases of between.6 and 1. K, balanced by a counteracting cooling from other anthropogenic forcings of between and.5 K. Citation: Jones, G. S., P. A. Stott, and N. Christidis (13), Attribution of observed historical near-surface temperature variations to anthropogenic and natural causes using CMIP5 simulations, J. Geophys. Res. Atmos., 118, 1, doi:1.1/jgrd.539. 1. Introduction [] Successive reports of the Intergovernmental Panel on Climate Change (IPCC) have come to increasingly confident assessments of the dominant role of human-induced greenhouse gas emissions in causing increasing global near-surface temperatures [e.g., IPCC, 7a]. Furthermore, analyses of changes across the climate system, including of sub-surface ocean temperatures, of the water cycle, and of the cryosphere, show that as the observational evidence accumulates, there is an increasingly remote possibility that climate change is dominated by natural rather than anthropogenic factors [Stott et al., 1]. Thus, the IPCC concluded in its fourth assessment report that most of the observed increase in global average temperatures since Additional supporting information may be found in the online version of this article. 1 Met Office, Exeter, UK. Corresponding author: G. S. Jones, Met Office, FitzRoy Rd., Exeter, UK. (gareth.s.jones@metoffice.gov.uk) 13. Crown copyright. This article is published with the permission of the Controller of HMSO and the Queen s Printer for Scotland. 169-897X/13/1.1/jgrd.539 1 the mid-th century is very likely due to the observed increase in anthropogenic greenhouse gas concentrations [IPCC, 7a]. [3] However, despite the high level of confidence in the conclusion that greenhouse gases caused a substantial part of the observed warming, there remain many uncertainties that have so far limited the potential to be more precise about such attribution statements. These uncertainties are associated with observational uncertainties caused by remaining biases in data sets and gaps in global coverage [Morice et al., 1] and modeling uncertainties, which limit the ability to define the expected fingerprints of change due to anthropogenic and natural factors, and which result from errors in model formulation, deficiencies in model resolution, and inadequacies in the way external climate forcings are specified. All attribution results are contingent on such remaining uncertainties and until now they have been explored in a relatively limited way. Many previous attribution studies have been limited to a single observational data set and a single climate model [e.g., Tett et al., ] or a rather limited ensemble of different climate models [e.g., Hegerl et al., ; Gillett et al., ; Huntingford et al., 6] while another study used simple climate models to emulate global mean temperatures from over a dozen models [Stone et al., 7].

Table 1. Model Experiment Definitions as Used in the CMIP5 Archive [Taylor et al., 1], With Equivalent CMIP3 Terms [Meehl et al., 7a] Experiment Name Description CMIP3 Name picontrol Pre-industrial control simulation picntrl historical th century (185 5) c3m forced by anthropogenic and natural factors historicalnat th century (185 5) forced N.A. by only natural factors historicalghg th century (185 5) forced N.A. by only greenhouse gas factors historicalext Extension to historical experiment A1B from 5 onward rcp5 1st century (5 1) forced by RCP.5 factors " [] There have also been studies that have used simple time series methods to determine contributions of forcings to observed global temperatures [Lockwood, 8; Lean and Rind, 8; Kaufmann et al., 11; Foster and Rahmstorf, 11]. While many such studies are broadly consistent with studies using coupled climate models in finding a dominant role for greenhouse gases in explaining recent warming, the advantage of using climate models rather than simple statistical relationships with forcings is that coupled atmosphere-ocean general circulation models (AOGCMs) attempt to simulate all the most important physical processes in the climate system that lead to a model s response to a particular forcing. Responses are therefore emergent from the model and not imposed upon it [Hegerl and Zwiers, 11]. Also climate models produce temperature variations over space as well as time and this study analyzes these space-time variations. [5] Increasingly in the climate science community has come a realization of the importance of testing the robustness of results to observational uncertainty. One study [Hegerl et al., 1] deduced that one type of observational uncertainty, grid box sampling error, had little impact on the attribution of anthropogenic influence on temperature trends. A more recent study [Jones and Stott, 11] analyzed four global temperature data sets (GISS, NCDC, JMA, and HadCRUT3 see section 3) together with one climate model and concluded that the choice of observational data set had little impact on the attribution of greenhouse gas warming. Therefore, the conclusions that greenhouse gases were the dominant contributor to global warming over the last 5 years of the th century are robust to that observational uncertainty. [6] Since that study, there have been two important developments. First, a new analysis of global temperatures, HadCRUT, has been released that includes a much more thorough investigation of systematic biases in sea surface temperature measurements than carried out previously [Morice et al., 1]. While an overall global warming trend is still seen, the detailed nature of the time series has changed, particularly in the middle part of the th century [Thompson et al., 8]. It is therefore worth exploring how this change affects attribution results. Second, results from the CMIP5 experiment have become available, the most complete exploration of climate model uncertainty in simulating the last 15 years ever undertaken [Taylor et al., 1]. The new multi-model ensemble of opportunity includes a new generation of climate models with more sophisticated treatments of forcings, including aerosols and land use changes. As part of the experimental design, it also includes ensembles of simulations with both anthropogenic and natural forcings, as well as alternative ensembles which just include natural forcings, and ensembles which include just changes in well-mixed greenhouse gases. Therefore, the CMIP5 ensemble provides a much more thorough and up to date exploration of modeling uncertainty than available from the CMIP3 ensemble [Meehl et al., 7a]. [7] In this paper therefore we have the opportunity to undertake the widest exploration yet of the effects of modeling uncertainty when applied to the most up-to-date observational estimates of data until 1. In section we describe the CMIP model simulations, in section 3 we discuss the observational data sets we use, in section we compare results from the CMIP3 and CMIP5 model ensembles, and in section 5 we compare the CMIP models with observations. In section 6 we carry out optimal detection analyses and describe the results using standard techniques that have been widely applied to near-surface temperature data and other climate data over the last 1 years [Tett et al., ; Gillett et al., ; Nagashima et al., 6; Zhang et al., 7]. In section 7 we provide conclusions.. Climate Model Intercomparison Project [8] The World Climate Research Programme s Coupled Model Intercomparison Project phase 3 (CMIP3) and phase 5 (CMIP5) are arguably the largest collaborative effort for bringing together climate model data for access by climate scientists across the world. The CMIP3 repository [Meehl et al., 7a] was a major contributor for model data used in studies assessed by the IPCC s fourth assessment report [IPCC, 7b] while the CMIP5 repository [Taylor et al., 1] will be used in many studies to be assessed by the IPCC s fifth assessment report due to be published in 13. Over different institutions and groups have used over 6 different climate models to produce simulations for dozens of experiments to contribute data to both CMIP archives. All the climate models examined in this study are atmosphere-ocean general circulation models (AOGCMs), where the ocean and atmosphere components are coupled together, covering a wide range of resolutions and sophistication of physical modeling. [9] The different experiments represent sets of simulations that had different scenarios of forcing factors applied [Taylor et al., 1]. The basic experiments examined in this study are picontrol, historical, historicalnat, and historical- GHG (Table 1). The picontrol experiment is a long time scale simulation with no variations in external forcings, such as greenhouse gas concentrations, set to pre-industrial concentrations/settings. The other experiments are parallel to the picontrol but with different forcing factor variations applied, initialized from different times from the picontrol. It should be noted that there are differences between the names of the experiments in CMIP5 and CMIP3. For instance the historical experiment was called C3M in CMIP3 [Meehl et al., 7a]. The C3M CMIP3 experiments comprised not only simulations driven by anthropogenic and natural forcing factors but also simulations driven by

Table. Institutions That Supplied Model Data to CMIP3 and CMIP5 Repositories Used in This Study a Institutions Atmosphere and Ocean Research Institute (The University of Tokyo), Japan Beijing Climate Center, China Meteorological Administration, China Bureau of Meteorology, Australia Canadian Centre for Climate Modelling and Analysis, Canada Centre National Recherches Météorologiques, Météo-France, France Centre Europeen de Recherches et de Formation Avancee en Calcul Scientifique, France Commonwealth Scientific and Industrial Research Organisation, Marine and Atmospheric Research, Australia Institute for Numerical Mathematics, Russia Institute of Atmospheric Physics, Chinese Academy of Sciences Institut Pierre Simon Laplace, France Japan Agency for Marine-Earth Science and Technology, Japan Max Planck Institute for Meteorology, Germany Met Office Hadley Centre, UK Meteorological Institute of the University of Bonn, Germany Meteorological Research Institute, Japan Meteorological Research Institute of KMA, Korea NASA Goddard Institute for Space Studies, USA National Center for Atmospheric Research, USA National Institute for Environmental Studies, Japan NOAA Geophysical Fluid Dynamics Laboratory, USA Norwegian Climate Centre, Norway Queensland Climate Change Centre of Excellence, Australia Technology and National Institute for Environmental Studies, Japan Tsinghua University, China a The key is used in Tables 3 and. anthropogenic forcing factors only [Stone et al., 7]. To be consistent with the CMIP5 historical experiments, we use only those C3M CMIP3 simulations that were driven by both anthropogenic and natural forcing factors. There are no historicalnat and historicalghg type simulations in the CMIP3 archive. In Figure 9.5 of Hegerl et al. [7], historicalnat simulations are presented. The authors of Hegerl et al. [7] retrieved the historicalnat simulations from the institutions concerned (Dáithí Stone, personal communication). We have tried to collect the same data [Supplementary Materials in Hegerl et al., 7] from the institutions and will call them CMIP3 simulations for simplicity sake. Additional CMIP5 experiments used are historicalext and rcp5 (Table 1), both extensions to the historical experiments [Taylor et al., 1]. The C3M CMIP3 experiments are extended if an appropriate A1B simulation is available [Meehl et al., 7a]. These experiments extend the historical simulations by taking their initial conditions from the end of the equivalent historical experiment, in the case of CMIP5 in 5. To avoid the use of confusingly different terminology we will generally use the CMIP5 terms (Table 1)..1. Models [1] A list of the different institutions involved in producing data for CMIP3 and CMIP5 is given in Table. Tables 3 and list the models from CMIP3 and CMIP5 used in this study indicating the institutions involved and describing which experiments were made, how many initial condition ensembles were produced, and the length of the pre-industrial control simulations that were available. key a b c d e f g h i j k l m n o p q r s t u v w x 3 [11] For a given model experiment, up to 1 initial condition ensembles were produced. These took their initial conditions from the model s picontrol with the period between samples varying greatly from model to model. We do not consider the impact of different periods between initial conditions, although some studies have tried to minimize the impact of any sampling bias by selecting initial conditions according to the ocean s state [e.g., Jones et al., 11a]. [1] The CMIP3 and CMIP5 simulations, a multi-model ensemble (MME), are often called an ensemble of opportunity [Allen and Ingram, ]. Ideally models should sample uncertainties (physical modeling, forcing, and internal variability) as widely as possible [Collins et al., 1]. In practice, an ensemble of opportunity, like CMIP3 and CMIP5, would not methodically sample the full range of possible uncertainties. For instance, the models are not independent [Jun et al., 8; Abramowitz and Gupta, 8; Masson and Knutti, 11], with many sharing common components and algorithms [Knutti et al., 1], which can be seen in climate responses sharing common patterns [Masson and Knutti, 11]. As a result, the effective number of independent models in such an ensemble of opportunity is less than the actual number [Pennell and Reichler, 11]. Ensembles of models in which parameters in physics schemes have been perturbed (so-called perturbed physics ensembles) have been used to sample model uncertainty in a more methodical manner [Murphy et al., ; Stainforth et al., 5] although they do not sample structural model uncertainty, i.e., uncertainty due to processes not incorporated in a particular model. Similarly the CMIP5 models do not systematically sample forcing uncertainties [Taylor et al., 1]. [13] Bearing in mind the caveat that there are limitations to the statistical interpretation of ensembles of opportunity [Knutti, 1; Pennell and Reichler, 11; von Storch and Zwiers, 13], we use the CMIP3 and CMIP5 MME spread to examine the confidence of any agreement between the ensembles and the observations [Taylor et al., 1]. In this study, we generally treat each model as being an equal member of the MME. While such one model one vote [Knutti, 1] methods may underestimate the uncertainty in the model spread, it is arguably the simplest approach to use [Weigel et al., 1]. In any event, it is difficult to justify any particular weighting scheme based on a model s base climate since a measure of skill of the CMIP5 models to represent mean climate bears little relation to the skill of the models in simulating the observed trend [Jun et al., 8]. [1] While the CMIP3 archive was about 36 TB in size, the CMIP5 archive is estimated to be.5 3 PB in or even larger in size [Overpeck et al., 11; Taylor et al., 1]. In this study, because we are interested in near-surface temperatures only, we examine only a tiny fraction of the total CMIP archive; monthly means of one climate variable on a single level for up to 1 initial condition ensemble members for each of seven experiments from 6 models. In total, more than 66 thousand model years equating to about 65 GB of storage are analyzed in this study. [15] We have endeavored to retrieve the latest versions of the data available in the CMIP5 archives up to 1 March 1 in order to allow time to make the analysis and write up the results. Due to CMIP5 s limited version control, it has been a non-trivial task to keep track of data set changes. Thus,

Table 3. CMIP3 Models Used in This Study, Their Institutions ( Inst., See Table ), the Number of Initial Condition Ensemble Members for Each Experiment, and the Number of Years Available From Each picontrol a Model Inst. Number of Ensemble picontrol Length Members historical historicalnat cccma_cgcm3.1 d - - 1 cccma_cgcm3.1(t63) d - - 35 cnrm_cm3 e - - 5 csiro_mk3. g - - 38 csiro_mk3.5 g - - 5 gfdl_cm. t 3 [1] - 5 gfdl_cm.1 t 5 [3] - 5 giss_aom q - - 5 giss_model_eh q 5 [3] - giss_model_er q 9 [5] - 5 inmcm3_ h 1 [1] - 33 ipsl_cm j - - 75 miroc3._hires a,k,s 1 [1] - - miroc3._medres a,k,s 1 [1] 1 [] 5 miub_echo_g n,p [3] 3 [] 3 mpi_echam5 l - - 56 mri_cgcm.3.a o 5 [] [] 5 ncar_ccsm3 r 9 [7] 5 [] 3 ncar_pcm1 r [] [] 35 ukmo_hadcm3 m 3 [] [] 3 ukmo_hadgem1 m [] - a Numbers in square brackets (e.g., [1] ) represent numbers of ensembles members that cover period ending in 1. Names of models are same as used in CMIP3. There are some minor differences between what historical and historicalnat CMIP3 simulations we use and those used in Hegerl et al. [7] (Table S9.1) and in Stone et al. [7] Table. CMIP5 Models Used in This Study, Their Institutions ( Inst., See Table ), the Number of Initial Condition Ensemble Members for Each Experiment, and the Number of Years Available From the picontrol of the Model a Model Inst. Number of Ensemble Members picontrol Length Historical historicalnat historicalghg ACCESS1- g,c 1 [1] - - 5 CCSM r 6 [6] [] 6 [] 5 CNRM-CM5 e,f 1 [1] 6 [6] 6 [6] 85 CSIRO-Mk3-6- g,v 1 [1] 5 [5] 5 [5] 5 CanESM d 5 [5] 5 [5] 5 [5] 995 FGOALS-g i,x [] - 1 [1] 9 FGOALS-s i 3 [3] - - 5 GFDL-CM3 t 5 [1] 3 [] 3 [] 5 GFDL-ESMG t 3 [1] - - 5 GFDL-ESMM t 1 [1] 1 [] 1 [] 5 GISS-E-H q 5 [5] 5 [5] 5 [5], GISS-E-R q 6 [5] 5 [5] 5 [5] 3,5 HadCM3 m 7 [7] - - - HadGEM-CC m 1 [1] - - 5 HadGEM-ES m [] [] [] 13 IPSL-CM5A-LR j 5 [] 3 [] 1 [] 95 IPSL-CM5A-MR j 1 [1] - - 3 MIROC-ESM a,k,s 3 [1] 1 [] 1 [] 53 MIROC-ESM-CHEM a,k,s 1 [1] 1 [] 1 [] 5 MIROC5 a,k,s [] - - 67 MPI-ESM-LR l 3 [3] - - 1 MRI-CGCM3 o 3 [3] 1 [] 1 [] 5 NorESM1-M u 3 [3] 1 [1] 1 [1] 5 NorESM1-ME u 1 [] - - - bcc-csm1-1 b 3 [3] 1 [1] 1 [1] 5 inmcm h 1 [1] - - 5 a Numbers in square brackets (e.g., [1] ) represent numbers of ensembles members that cover period ending in 1. Names of models as used in CMIP5. GISS-E-H and GISS-E-R models provided two separate picontrol simulations, thus the two numbers in the picontrol length column for the models.

Table 5. Forcing Factors Included in CMIP3 Historical Experiments Additional to Greenhouse Gases, Sulfate Direct Effects, Ozone, Solar Irradiance, and Stratospheric Volcanic Aerosol Factors a Model SI CA LU gfdl_cm. N Y Y gfdl_cm.1 N Y Y giss_model_eh Y Y Y giss_model_er Y Y Y inmcm3. N N N miroc3._hires Y Y Y miroc3._medres Y Y Y miub_echo_g Y N N mri_cgcm.3.a N N N ncar_ccsm3 N Y N ncar_pcm1 N N N ukmo_hadcm3 Y N N ukmo_hadgem1 Y Y Y a SI, sulfate indirect effects (first and/or second effects); CA, carbonaceous aerosols (black carbon and organic carbon); and LU, land use changes (Y-factor included, N-factor not included). we cannot guarantee that all the data used in this study were up to date as of March 1. Nonetheless the CMIP5 data repository is a substantial undertaking and a great success, and without such a project such studies as this would not be possible. Since March 1 more models have been added to the CMIP5 archive, we hope to be able to include these models in future analyses... External Forcing Factors [16] Exactly what forcing factors are applied, and how they are modeled, for a given experiment (Table 1) differs somewhat across the models (see model s documentation for more details). Details of which forcings were included in the CMIP3 historical experiments were deduced from Table 1.1 in Meehl et al. [7b]. Information about which forcings were included in the CMIP5 simulations were extracted from the metadata contained within the data [Taylor et al., 1], with additional information being obtained from the institutions. The minimum criteria for models being included in the following analyses are that the model s historical experiments must include at least variations in long-lived well-mixed greenhouse gases, direct sulfate aerosol, ozone (tropospheric and stratospheric), solar, and explosive volcanic influences. Therefore, as stated earlier, we do not examine those CMIP3 historical experiments (C3M) that did not also include natural forcings. Only historicalnat simulations that include changes in both total solar irradiance and stratospheric volcanic aerosols are examined. The long-lived well-mixed greenhouse gases simulated in the historical and historicalghg experiments include concentration changes in carbon dioxide, methane, and nitrous oxides (or carbon dioxide equivalent), with some variation across the models in which CFC species are included. All the historical simulations include direct sulfate aerosol effects but how other non-greenhouse gas anthropogenic forcing factors are applied differs across the models. Tables 5 and 6 give a summary of which model historical simulations include the indirect effects of sulfate aerosols, the effects of carbonaceous aerosols (black carbon and/or organic carbon), and land use influences. Further details of the intricacies of the forcings and how they are 5 implemented in particular models are available from the individual model s documentation. [17] There are a few notable oddities in the way some model experiments have been set up which make them different from the rest of the models in the archive. The historicalghg experiments for the CNRM- CM5, GFDL-CM3, MIROC-ESM, MIROC-ESM-CHEM, MRI-CGCM3, and NorESM1-M models include variations in ozone concentrations in addition to the wellmixed greenhouse gas variations. The CMIP3 models miub_echo_g and mri_cgcm_3_a and the CMIP5 model IPSL-CM5A-LR simulate volcanic influences by perturbing the shortwave radiation at the top of the atmosphere. Whereas the ukmo-hadcm3 (run1 and run) and ukmo-hadgem1 (run1) C3M simulations listed in CMIP3 contained anthropogenic only forced simulations, the ukmo-hadcm3 and ukmo-hadgem1 C3M simulations we analyze here include both anthropogenic and natural forcings [Stott et al.,, 6b]..3. Simulation Details [18] Monthly mean near-surface air temperatures (TAS) were retrieved from the CMIP3 and CMIP5 archives. The historical CMIP3 simulations had start dates varying between 185 and 19 and end dates varying between 1999 and. To enable a continuation of the CMIP3 historical simulations to near-present day, we use any available A1B SRES scenario simulations [Meehletal., 7a] that are continuations of the C3M experiments. For CMIP5, the historic period is defined as starting in the mid-19th century and ending in 5, so to extend the historical simulations up to 1, we append to it the historicalext experiment, or if not available the rcp5 experiment (Table 1) [Taylor et al., 1]. There are some minor differences between the different representative concentration pathways (RCPs) anthropogenic emissions and concentrations Table 6. Same as Table 5 but for CMIP5 Models Model SI CA LU ACCESS1- Y Y N CCSM N Y Y CNRM-CM5 Y Y N CSIRO-Mk3-6- Y Y N CanESM Y Y Y FGOALS-g Y Y N FGOALS-s N Y N GFDL-CM3 Y Y Y GFDL-ESMG N Y Y GFDL-ESMM N Y Y GISS-E-H Y Y Y GISS-E-R Y Y Y HadCM3 Y N N HadGEM-CC Y Y Y HadGEM-ES Y Y Y IPSL-CM5A-LR Y Y Y IPSL-CM5A-MR Y Y Y MIROC-ESM Y Y Y MIROC-ESM-CHEM Y Y Y MIROC5 Y Y Y MPI-ESM-LR N N Y MRI-CGCM3 Y Y Y NorESM1-M Y Y N NorESM1-ME Y Y N bcc-csm1-1 N Y N inmcm N N N

during the first 1 years of the 1st century [van Vuuren et al., 11], but these are very small compared to the differences over the whole century. Which RCP experiment is chosen to extend the historical experiment to 1 is unlikely to be important. There are bigger differences between the forcing factors in the CMIP3 SRES and the CMIP5 RCP experiments over the first few decades of the 1st century, but differences in the climatic responses are also relatively small [Knutti and Sedláček, 13]. How the solar and volcanic forcing factors are applied during this period will also differ across the models, so there will be additional forcing uncertainties due to these choices. While the official CMIP5 guidance was for the historicalnat and historicalghg simulations to cover the mid-19th century to 5 period [Taylor et al., 1], a number of institutions supplied historicalnat and historicalghg data to CMIP5 to cover the period up to 1 (Table ). None of these CMIP3 historicalnat simulations have data beyond the year. 3. Observed Near-Surface Temperatures [19] Measurements of near-surface temperatures have been used to create the longest global scale diagnostics of observed climate going back to the mid-19th century. In this study we analyze HadCRUT produced by the Met Office and Climate Research Unit, University of East Anglia [Morice et al., 1], GISS produced by NASA Goddard Institute for Space Studies [Hansen et al., 6], NCDC produced by NOAA s National Climatic Data Center [Smith et al., 8], and JMA produced by the Japan Meteorological Agency [http://ds.data.jma.go.jp/tcc/tcc/products/ gwp/gwp.html accessed 8/8/11; Ishii et al., 5]. The data sets differ in how raw data have been quality controlled, and in the homogenization adjustments and bias corrections made. They also differ in their coverage. While HadCRUT and JMA have areas of missing data where no observations are available (i.e., no infilling outside of grid boxes with data), GISS extrapolate over data-sparse regions using data within a radius of 1 km, and NCDC use large area averages from low-frequency components of the data and spatial covariance patterns for the high frequency components to extrapolate data. The data sets incorporate land air temperatures and near-surface temperatures over the ocean (e.g., sea surface temperatures) into a global temperature record. The main data set we analyze is HadCRUT, an update to HadCRUT3 [Brohan et al., 6] with additional data, bias corrections, and a sophisticated error model. The data set is provided as an ensemble that samples a number of uncertainties and bias corrections that are correlated in time and space, as well as statistical descriptions of the other uncertainties [Kennedy et al., 11; Morice et al., 1]. For the purposes of this study, we use the median field of the HadCRUT bias realizations (see Morice et al. [1] for more details) as the best estimate of the data set. We plan to do a thorough investigation using HadCRUT s error model in a future study. The HadCRUT, NCDC, and JMA data sets have a gridded spatial resolution of 5 5 and GISS a resolution of 1 1. The periods covered by the data sets are 185 1 for HadCRUT, 188 1 for GISS and NCDC, and 1891 1 for JMA. 6. Comparison of CMIP3 With CMIP5 Models [] We first compare the CMIP3 and CMIP5 multimodel ensemble (MME). Annual mean spatial fields are created by averaging up the monthly gridded data (January December) and when comparing spatial patterns, model data are projected onto a 5 5 spatial grid..1. Differences in CMIP3 and CMIP5 Variability [1] For consistency when comparing the CMIP3 and CMIP5 picontrol experiments, we limit the length of simulations being examined to that of the shortest picontrol simulation, 3 years (Tables 3 and ). Often a model simulation with no changes in external forcing (picontrol) will have a drift in the climate diagnostics due to various flux imbalances in the model [Gupta et al., 1]. Some studies attempt to account for possible model climate drifts, for instance Figure 9.5 in Hegerl et al. [7] did not include transient simulations of the th century if the long-term trend of the picontrol was greater in magnitude than. K/century (Appendix 9.C in Hegerl et al. [7]). Another technique is to remove the trend, from the transient simulations, deduced from a parallel section of picontrol [e.g., Knutson et al., 6]. However whether one should always remove the picontrol trend, and how to do it in practice, is not a trivial issue [Taylor et al., 1; Gupta et al., 1]. Only two of the CMIP model simulations analyzed in this paper, giss-model-eh and csiro-mk3- (both from CMIP3), have trends with magnitude greater than. K/century (Figure S1 in the supporting information). We choose not to remove the trend from the picontrol from parallel simulations of the same model in this study due to the impact it would have on long-term variability, i.e., the possibility that part of the trend in the picontrol may be long-term internal variability that may or may not happen in a parallel experiment when additional forcing has been applied. The overall range of CMIP5 picontrol trends has a smaller spread than the CMIP3 picontrol trends, and all are lower in magnitude than about.1 K/century. While the variance of the annual mean TAS of the first 3 years of picontrol (Figure S1) has a lower spread in CMIP5 than in CMIP3, the differences are not statistically significant. The models with the very highest variability, which are all in CMIP3, have lower variability when the drift in the picontrol is removed (Figure S1). [] The interannual variability across latitudes in the CMIP3 and CMIP5 models is shown in Figure 1. The spread of the models is greatest around the tropics and high latitudes with the range across the models being quite large in places. For CMIP3, the tropical variability is very weak in few models and very strong in others; for CMIP5, there is closer agreement in the tropics. The reduction in the range across the models of tropical variability in CMIP5 is probably due to a reduction in the range of El NiQno-Southern Oscillation variability across the models relative to CMIP3 [Guilyardi et al., 1; Kim and Yu, 1]. Variability across the models over the high latitudes is similar between CMIP3 and CMIP5 apart from one model forming an outlier in CMIP5... Historic Transient Experiments [3] Global annual mean TAS for simulations for the historic period from 185 to 1 for CMIP3 (Table 3) and

Zonal st. dev., K Zonal st. dev., K 1..5 cccma_cgcm3_1 cccma_cgcm3_1_t63 cnrm_cm3 csiro_mk3_ csiro_mk3_5 gfdl_cm_ gfdl_cm_1 giss_aom CMIP3 giss_model_e_h giss_model_e_r inmcm3_ ipsl_cm miroc3 medres miub_echo_g mpi_echam5 mri_cgcm_3_a ncar_ccsm3_ ncar_pcm1 ukmo_hadcm3 ukmo_hadgem1. 9S 5S 5N 9N Latitude 1..5 ACCESS1- CCSM CNRM-CM5 CSIRO-Mk3-6- CanESM FGOALS-g FGOALS-s GFDL-CM3 CMIP5 GFDL-ESMG GFDL-ESMM GISS-E-H GISS-E-R HadGEM-CC HadGEM-ES IPSL-CM5A-LR IPSL-CM5A-MR MIROC-ESM-CHEM MIROC-ESM MIROC5 MPI-ESM-LR MRI-CGCM3 NorESM1-M bcc-csm1-1 inmcm. 9S 5S 5N 9N Latitude Figure 1. Standard deviation of zonal annual mean TAS from picontrols for CMIP3 (top) and CMIP5 (bottom). The first 3 years of each model s picontrol used to calculate annual latitudinal zonal means and then the standard deviation for each latitude was calculated. CMIP5 models (Table ) are shown in Figure. Included in the figure are the individual historical, historicalnat, and historicalghg simulations for CMIP3/5 together with the CMIP3 and CMIP5 ensemble averages. While for figure 9.5 in Hegerl et al. [7] the simple average of all the historical and historicalnat simulations was shown together with each individual simulation, here we take a slightly different approach. As for some models, there are as many as 1 initial condition ensemble members in the CMIP5 archive, and for other models, there is as few as one single simulation for a given experiment, a simple average would give most weight to the model with the most ensemble members. Therefore to avoid this bias toward models with most ensemble members, here we calculate the weighted average that gives equal weight to each model [Santer et al., 7] regardless of how many ensemble members a model has (see the supporting information for details). A simulation, one of an ensemble for a given model, will have a weight which is the inverse of the number of ensemble members for that model multiplied by the inverse of the number of different models. The resultant weighted average is the equivalent of taking the average of all the models ensemble averages. To estimate the statistical spread of the MME, we create a cumulative probability distribution, after ranking the simulations, with the probability assigned to each simulation equal to its weight (as described above). The cumulative probability distribution can then be calculated and sampled at whatever percentiles are of interest to obtain MME ranges. A more basic analysis could just look at the average and the percentile range of the ranked simulations, which would be equivalent to simply setting the weights to be the inverse of the total number of simulations. However by given equal weight to each model, rather than to each simulation, the statistical properties of the MME will not be dominated by those models with many ensemble members. The weighted ensemble means for CMIP3 and CMIP5, separately, are shown as thick lines in Figure. [] All the CMIP3 and CMIP5 historical experiments (see Figures S and S3 for the responses for each model shown separately) show warming over the historic period. The spread in the increase in TAS across the models is larger than the spread across the initial condition ensemble for any of the individual models. The historicalnat simulations show little overall warming due to the combined influences of solar and volcanic activity. In most models, the historical- GHG experiment warms more than the historical experiment consistent with the other anthropogenic factors sulfate and carbonaceous aerosols, ozone, land use and natural factors having an overall net cooling influence. However, the CCSM, GFDL-ESMM, and bcc-csm1 models have historicalghg simulations that warm by by similar amounts to the equivalent historical simulations (Figure S3) which implies that the non greenhouse gas forcing factors have little or no net warming/cooling influence over the whole period. These models were the only CMIP5 models that provided historical and historicalghg experiments that did not include the indirect effects of sulfate aerosols effects in the historical simulations, i.e., the effects of aerosols to make clouds brighter (first indirect effect) or longer lasting (second indirect effect). [5] The gradual warming seen in the historical simulations is punctuated with short periods of cooling of varying degrees from the major volcanic eruptions [Driscoll et al., 1]. Far less obvious is any response to solar irradiance changes with little evidence of an 11 year cycle in the weighted mean of the historical or historicalnat simulations, supporting previous examinations of the response of climate models to solar forcing that suggest they have a weak global mean response to the 11 year solar cycle [Jones et al., 1]. [6] With the increased number of models available in CMIP5 compared to CMIP3 there is a wider variety of responses in CMIP5 than CMIP3. The spread in TAS in the first decade of the 1st century for the historical CMIP5 simulations is somewhat wider than the CMIP3 simulations (Figure ). The spread is almost identical up to 196 but then widens for CMIP5 relative to CMIP3. A non-parametric Kolmogorov-Smirnov test [Press et al., 199, p. 617] does not rule out that the CMIP3 and CMIP5 distributions of 1 1 mean temperatures are drawn from the same population distribution, i.e., the distributions are not significantly different, but the test is not robust to differences in the tails of distributions which is what seems to be the main difference between CMIP3 and CMIP5. One of the 7

Temperature anomaly (K) Temperature anomaly (K) Temperature anomaly (K) 1..5. JONES ET AL.: ATTRIBUTION OF TEMPERATURES WITH CMIP5 -.5 186 188 19 19 19 196 198 Year 1..5. -.5 186 188 19 19 19 196 198 Year 1..5. historical historicalnat historicalghg CMIP3 CMIP5 -.5 186 188 19 19 19 196 198 Year Figure. Global annual mean TAS for both CMIP3 and CMIP5 for historical (top), historicalnat (middle), and historicalghg (bottom) individual ensemble members (thin light lines). The weighted ensemble average for both CMIP3 (blue thick line) and CMIP5 (red thick line) are estimated by finding the average of the model ensemble means (supporting information). TAS annual means shown with respect to 188 1919. striking differences between the CMIP3 and CMIP5 MME is the stronger cooling some of the models have for the historical experiments around the 195s to 198s (Figures S and S3). These differences indicate that some changes in modeling are potentially increasing the variety of temperature climate responses in the CMIP5 historical simulations. 8 This is despite the similarity of the range of the transient climate response (TCR) seen in CMIP3 and CMIP5 [Andrews et al., 1]. Differences in the way forcing factors are applied in the CMIP models and the resulting uncertainty in the radiative forcings [Forster and Taylor, 6; Forster et al., 13] may also be contributing to the wider spread in TAS responses in CMIP5. For instance a higher proportion of the CMIP5 models include land use changes and a wider range of aerosol influences than in CMIP3 (Tables 5 and 6). Also the 7 IPCC assessment [Forster et al., 7] estimated that historic total solar irradiance (TSI) increases were half that estimated by the previous report leading to a recommendation for CMIP5 to use a TSI reconstruction which had a smaller increase over the first half of the th century than those used by CMIP3. On the other hand, some of the CMIP5 models have very similar variations in historical TAS to that of their earlier generation model counterparts in CMIP3, for example giss_model_e_r and GISS-E-R or ncar_ccsm3_ and CCSM models (Figures S and S3). [7] The multi-model weighted average of the CMIP3 historical ensemble is very similar to the CMIP5 historical weighted average (Figure ), which given the wide range of responses in the individual models in CMIP5 compared to CMIP3, and the differences in the models and forcing factors applied, is perhaps surprising. The weighted means for CMIP3 and CMIP5 historicalnat simulations are also very similar with the mean cooling following the major volcanic eruptions being very similar even though there are large differences between individual models. 5. Comparison of CMIP Models With Observed Temperatures [8] In this section we compare the CMIP models with observed near surface temperatures. When comparing models with observations it is important to treat the model data in as similar a way as possible as the observed data [Santer et al., 1995]. All data are projected onto a 5 5 spatial grid and then monthly anomalies, relative to 1961 199, are masked by HadCRUT s spatial coverage. Annual means (January to December) are calculated for a gridpoint if at least months of data are available (see the supporting information for more details of the preprocessing). Imposing HadCRUT spatial coverage on all data means that the other observational data sets, in particular GISS and NCDC, have reduced coverage [Jones and Stott, 11] but this will allow a consistent comparison between the observations and the models. 5.1. Global Annual Mean Temperatures [9] Global annual mean near-surface temperatures for each model and the four observational data sets are shown in Figure 3 for CMIP5 MME (See Figure S7 in the supporting information for the equivalent figure for the CMIP3 models). Global mean anomalies were calculated by removing the average of the global annual means over the 188 1919 period (see the supporting information). As has been noted elsewhere, [Jones and Stott, 11; Morice et al., 1], all observational data sets track each other relatively closely over the 185 1 period. Each model s historical simulations warm up more than, and tracks more closely the observations, than the equivalent historicalnat simulations.

188 19 19 19 196 198 ACCESS1-188 19 19 19 196 198 CCSM 188 19 19 19 196 198 CNRM-CM5 188 19 19 19 196 198 CSIRO-Mk3-6- 1. 1..5.5.. -.5 -.5 CanESM FGOALS-g FGOALS-s GFDL-CM3 1. 1..5.5.. -.5 -.5 GFDL-ESMG GFDL-ESMM GISS-E-H GISS-E-R 1. 1..5.5.. Temperature anomaly, K -.5 1..5. -.5 HadCM3 HadGEM-CC HadGEM-ES IPSL-CM5A-LR -.5 1..5. -.5 IPSL-CM5A-MR MIROC-ESM MIROC-ESM-CHEM MIROC5 1. 1..5.5.. -.5 -.5 MPI-ESM-LR MRI-CGCM3 NorESM1-M NorESM1-ME 1. 1..5.5.. -.5 1..5 bcc-csm1-1 inmcm 188 19 19 19 196 198 observations historical historicalnat historicalghg -.5 188 19 19 19 196 198. -.5 188 19 19 19 196 198 188 19 19 19 196 198 Figure 3. Global annual mean TAS variations, 185 1, for the CMIP5 historical, historicalnat and historicalghg experiments and the four observational data sets. TAS annual means shown with respect to 188 1919. Model and observed data have same spatial coverage as HadCRUT. A comparison with simulations that have complete spatial coverage (supporting information) shows that the vast majority, >95% of the historical simulations, warm up less over the historic period when masked by the observational 9 coverage. This is predominantly caused by the high northern latitude warming from greenhouse gas warming in the models being masked by the lack of coverage in HadCRUT in that region. The largest reduction in the linear trend

(a) Temperature anomaly (K) (b) Temperature anomaly (K) (c) Temperature anomaly (K) 1..5. CMIP3 CMIP5 observations -.5 186 188 19 19 19 196 198 Year 1..5. 1..8.6... HadCRUT GISS NCDC JMA -. 186 188 19 19 19 196 198 -.5 186 188 19 19 19 196 198 Year 1..5. -.5 186 188 19 19 19 196 198 Year Figure. Global annual mean TAS for CMIP3 (thin blue lines) and CMIP5 (thin red lines) for (a) historical, (b) historicalnat, and (c) historicalghg ensemble members, compared to the four observational data sets (black lines) also shown individually in the insert of Figure a. The weighted ensemble average for CMIP3 (blue thick line) and CMIP5 (red thick line) are estimated by given equal weight to each model s ensemble mean (supporting information). All model and observed data have same spatial coverage as HadCRUT. TAS anomalies with respect to 188 1919 period. between 191 and 1, in an individual simulation due to applying the observational coverage mask, is.18 K per 11 years demonstrating the importance of comparing like with like. The historicalghg simulations consistently warm more than the observations across all the CMIP5 models (Figure 3). [3] Many of the model s historical simulations (Figure 3) capture the general temporal shape in the observed TAS, an increase from the 19s to the 19s, then flattening or even cooling to the 197s, then increase to the present day. The spread in a given model s warming across the ensemble is relatively small over the whole period compared to the spread in warming across the models. This indicates that over the 1 year timescale differences in the forcing factors applied to the models and their responses are more important than internal variability, although on shorter timescales the opposite may be the case [Smith et al., 7; Hawkins and Sutton, 9]. [31] Global annual mean TAS for the MME for both CMIP3 and CMIP5 for the historical, historicalnat and historicalghg simulations are given in Figure together with the CMIP3 and CMIP5 model weighted averages and the four observational data sets. As suggested in previous analyses [e.g., Stott et al., ] and as documented in Figure 9.5 in Hegerl et al. [7], the historical simulations describe the variations of the observed annual mean near surface temperatures fairly well (Figure S8 in the supporting information is an alternative version of Figure showing the overall spread of the CMIP3 and CMIP5 simulations combined). Linear trends for the observed data sets, and the model experiments are given in Table 7 for different periods. While all of the observational data sets show similar time histories (Figure ), there is a small spread in linear trends with the GISS data set having a slightly smaller trend than the other observations [Jones and Stott, 11; Morice et al., 1] over 191 1. The observational data sets show a linear warming of between.6 and.75 K per 1 years over 191 1 while the spread of the central 95% of the historical simulations trends is.33 1.11 K per 1 years. It should be noted that linear trends in these circumstances are just summary statistics and do not imply linear climate changes are expected or observed. [3] While the observed trend over the first half of the th century is higher than the historical MME mean, it is within its ensemble spread, but not the historicalnat ensemble spread (Table 7). The spread of the historical MME trends over the 1951 1 and 1979 1 periods encompass the observed trends while the historicalnat MME do not. One must be careful not to draw too many conclusions about the significance of differences just between the model mean and the observations. The process of averaging many model s simulations reduces substantially any internal variability in the ensemble mean of the historical experiments, leaving an average model response that is predominantly a forced signal with almost no internal variability even though the observations still contain internal variability. So, leaving aside issues of observational uncertainty, the difference between multi model mean and observations is mainly due to mean model error and observational internal variability [Weigel et al., 1]. This means it is important to account for the MME spread, and measures of internal variability, when comparing models with observations [Santer 1

Table 7. Global Mean Linear Trends for the Observed Data Sets and Both CMIP3 and CMIP5 MME (K per 1 Years) a 191 1 191 195 1951 1 1979 1 1 1 HadCRUT.7 1. 1.9 1.78.35 GISS.6.81 5.1 NCDC.75.95 1.1 1.6.17 JMA.7.9 1.6 1.7.16 historical.65 (.33,1.11).65 (.,1.11) 1.3 (.63,1.93).11 (.91,3.3) 1.87 (.7,.9) historicalnat. (.13,.13).3 (.8,.78).1 (.58,.15).16 (.79,1.7).7 (.9,.3) historicalghg 1.9 (.81,9).37 (.5,.7) 1.93 (1.7,.7).7 (1.6,3.13) 1.93 (.1,.1) a The average of the MME trends together with the.5 97.5% range (in brackets) are given for the CMIP experiments (given equal weight to each model). All observations and model simulations have same temporal-spatial coverage as HadCRUT. Trends calculated for a period when less than 1 years have missing data, apart from the 1 1 when trend is calculated only when all 1 years are available. et al., 8], and not just contrast the MME mean with an observational data set [e.g., Wild, 1]. [33] A comparison of the variability of the global mean of the models with the observations on different timescales is shown in Figure 5 as a power spectral density (PSD) plot (see also Figures S1 and S11 for the individual models PSDs). The method used is described elsewhere [Mitchell et al., 1; Allen et al., 6; Stone et al., 7; Hegerl et al., 7]. The spectra contain variance from internal variability and the response to external forcings, as the data has not been de-trended. The CMIP3 and CMIP5 historical MME encompass the variability of all four observational data sets on all the timescales examined. The historicalnat MME starts to diverge from the observations after periodicities of or so years and for periodicities of about 35 years no historicalnat simulations have variability as large as observed. Together with Figure this is strong evidence that observed temperature variations are detectable over internal and externally forced natural variability on the longer timescales, whereas on timescales shorter than 3 years changes are indistinguishable [Hegerl and Zwiers, 11]. [3] Figure 6 shows a summary of three statistical indicators for the CMIP simulations compared with HadCRUT, on a Taylor diagram [Taylor, 1]. The Taylor diagram enables the simultaneous representation of the standard deviation of each simulation and HadCRUT s global annual mean TAS, the root mean square error (RMSE) and correlation of the simulations with HadCRUT. The period 191 5 is used, to increase the number of simulations that can be examined, with global annual means having their whole period mean removed. Perhaps unsurprisingly the historicalnat (green points in Figure 6) simulations have the lowest standard deviation and the lowest correlation with HadCRUT. None of the historicalnat simulations have a RMSE lower than. K. All the historicalghg simulations have correlations with HadCRUT around.8 and RMSEs up to. K. The historical simulations have some of the simulations with the lowest RMSE with correlations with HadCRUT varying from just above. up to just below.9. While the historicalnat simulations are clustered away from the other simulations, there is some overlap between the clusters of historical and historicalghg simulations. 5.. Continental-Scale Mean Temperatures [35] Climate changes from internal variability and external forcings would not be expected to be uniform across the globe [Santer et al., 1995]. We examine annual mean 11 temperatures over sea, land and six continental land areas. We group pre-defined regions used by the IPCC in a report on climate extremes [SREX, 1] into six continental regions (Figure 7 insert). These SREX areas (Figure 3.1 and Table 3.A-1 in SREX [1]) do not always align perfectly with common geographic or political definitions of the continents, but for convenience we group and call the areas North America, South America, Africa, Europe, Asia, Australasia and Antarctica (insert in Figure 7). All data, models and HadCRUT, are processed in the same way to construct the global and regional land and global ocean temperatures. We use the proportion of land area in each of HadCRUT s grid boxes to deduce which grid boxes, in HadCRUT and the models, to use. Only those grid boxes where there is 5% or more land in HadCRUT are used to calculate land temperatures and only those grid boxes with % land are used to calculate ocean temperatures (see the supporting information for further details). [36] The observed (HadCRUT) data coverage across the regions changes substantially over the period being examined (Figure S6). Europe has the least amount of Power Spectral Density, K yr -1 1. 1..1.1.1.1 HadCRUT GISS NCDC JMA historical 5-95%iles historicalnat 5-95%iles 1 1 Period, years Figure 5. Power spectral density for 191 1 period for both CMIP3 and CMIP5 simulations and the observations. Analysis on annual mean data as shown in Figure. Tukeyhanning window of 97 years in length applied to all data. The central 9% ranges of the historical and historicalnat multi-model ensemble are shown separately as shaded areas. The 5 95% ranges are calculated given equal weight to each model (see section.). The HadCRUT, GISS, NCDC, and JMA global mean near surface temperature observations are as shown in the key.