Statistical concepts in climate research

Statistical concepts in climate research (some misuses of statistics in climatology) Slava Kharin Canadian Centre for Climate Modelling and Analysis, Environment Canada, Victoria, British Columbia, Canada Banff Summer School, May 2008

Statistical concepts in climate research Lecture I - Misuses of statistics in climate research testing hypotheses suggested by the data; serial correlation; using statistical recipes as black-box tools; Lecture II - Hypothesis testing Type I and Type II errors, significance, power, etc historical developments and controversy around classical statistical significance test and its interpretation. Lecture III - Basics of Bayesian statistics. Introduction to Bayesian statistics. Bayesian climate change assessment.

References Analysis of Climate Variability: Applications of Statistical Techniques, by H. von Storch and A. Navarra (Eds.) (proceedings of an Autumn School on Elba, 1993) Statistical Analysis in Climate Research, H. von Storch and F. W. Zwiers, 1999, Cambridge.

Misuses of Statistical Analysis in Climate Research (von Storch 1993) Obsession with statistical recipes, in particular, hypothesis testing Pavlov s dog reaction to any hypothesis derived from data, demanding statistical significance test; Use of statistical techniques as a black-box, or cook-book recipe (standard example is disregard of serial correlation). Misunderstanding or misinterpreting the names (e.g. decorrelation time, p-values as probability of hypotheses) Use of sophisticated techniques...there is sometimes unwarranted expectation of miracle-like results from very advanced techniques.

Statistical analysis in climate research There is basically one observational record in climate research. It is hardly possible to perform real independent experiments. Classical hypothesis testing is frequentist in nature (repeated sampling). Dynamical models can create independent data, subject to model deficiencies. Data are inter-related, both in space and time. This is useful for reconstruction of the space-time state of the climate system from a limited set of observations. But the basic premise of many standard statistical tests is violated.

Two types of statistical data analysis (Tukey 1977) Exploratory analysis (EA) is the art of extracting information from a data set. The objectives of EA are: to suggest and formulate hypotheses about the workings of the climate system; to assess assumptions on which statistical inference (i.e., estimation and hypothesis testing) will be based; to support selection of statistical tools and techniques; provide a basis for further data collection (e.g., through numerical experiments). Confirmatory analysis (CA) is then performed to confirm independently the hypotheses.

Climatology as a one-experiment science Climatology is a one-experiment science. There is basically one observational record in climate. Climate researchers essentially look at the same record to build and test their hypotheses. The processes of building and testing hypotheses are hardly separable. Only dynamical models can provide independent datasets but subject to model limitations to describe the real world. But even dynamical models (GCMs,CGCMs,ESMs) are tuned to the available observational record. This leads to testing hypotheses suggested by the data using the same data.

Mexican Hat Mexican Hat is a unique combination of vertically arranged stones in the desert at the border of Utah and Arizona (google images).

Is the Mexican Hat man-made? (von Storch & Zwiers 1999) The null hypothesis: The Mexican Hat is of natural origin Form a statistic: { 1 if x forms a Mexican Hat t(x) = 0 otherwise where x is any pile of rocks. We need the probability distribution of t(x) under null hypothesis: collect samples of x, say, n = 10 6. Mexican Hat is famous for good reasons there is only one t(p) = 1. Estimate the probability distribution of t(p): { 10 6 for k = 1 prob(t(x) = k) = 1 10 6 for k = 0 We reject the null hypothesis at a small significance level.

Where is the logic flaw in the Mexican Hat paradox? Fundamental problem: the null hypothesis is not independent of the data which are used to conduct the test. We know a priori that the Mexican Hat is a very rare event. The fact that we didn t find any other combination like that cannot be used as independent evidence against its natural origin. The same trick can be used to prove that any rare event is non-natural (and therefore we are tricked into searching for external causes)

Labitzke and van Loon, 1988 K.Labitzke studied the relationship between solar activity and the stratospheric relationship at the North Pole. there was no obvious relationship if all years were considered. She saw that there was positive correlation during WQBO and negative correlation during EQBO.

Testing hypotheses suggested by the data The limitation of a single observational record has a severe consequence: Many people screen the observational record for rare and unusual events. Typically, only unusual results are eventually published in literature (publication bias). Presumably, some of these unusual events are nevertheless usual. Although we can compare these unusual events with all other events in the record, they cannot be contested with a statistical test since independent data are not available.

Testing hypotheses suggested by the data No statistical test, regardless of its power, can overcome this problem! Possible solutions: extend record backwards; use suitably designed experiments with GCMs; postpone testing until new data become available.

Neglecting Serial Correlation Problem: Most standard statistical techniques are derived with explicit need for statistically independent data. However, almost all climatic data are somehow correlated in time. Consequence: Statistical tests becomes too liberal, that is, the no-signal hypothesis is incorrectly rejected more often than it is implied by the significance level when it is true. The results of a statistical tests depends strongly on the serial correlation. Even if serial correlation is not neglected, its effects are not always correctly accounted for in statistical tests.

Comparison of means test (Zwiers & von Storch 1994) There are two main assumptions in the standard t-test: sampling assumption observations are statistically independent; distributional assumption observations are normally distributed; t-statistic for one sample X: t = X µ 0 S X / n = D S D where S X is the sample standard deviation. The t-statistic has Student s t distribution with n 1 d.f.

Comparison of means test: violation of assumptions How sensitive is the t-test to departures from the sampling and distributional assumptions? the t-test is moderately robust against departures from the normal distribution, especially when large samples are available, but it depends on the nature of deviations from the normality. the t-test is strongly affected by serial correlation, both for small or large samples.

Comparison of means test Type I errors: departures from the normality (Pearson distribution family) Rejection rate (%) 25 20 15 10 5 0 skew=0 skew=0.5 skew=1 skew=1.5 skew=2 20 40 60 80 100 Sample size The t test is slightly too liberal when distributions are skewed but not excessively so.

Comparison of means test Power: departures from the normality (Pearson distribution family) 100 50 L=30 L=50 L=100 Power (%) 20 10 5 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 σ (%)

Comparison of means test Type I errors: departures from the normality (bimodal) 15 Rejection rate (%) 10 5 0 20 40 60 80 100 Sample size

Comparison of means test Power: departures from the normality (bimodal) 100 50 L=30 L=50 L=100 Power (%) 20 10 5 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 σ (%)

Comparison of means test: departures from the independence We assume that a climate process can be approximately represented by a auto-regressive first order (AR1) process: X t µ X = ρ 1 (X t 1 µ X ) + ǫ t where ǫ t is a Gaussian white noise process, ρ 1 is the lag-1 correlation coefficient. (ENSO, QBO, or MJO are suspect)

Comparison of means test: departures from the independence Rejection rate 0.5 0.4 0.3 0.2 0.1 0.0 rho=0 rho=0.3 rho=0.45 rho=0.6 rho=0.75 20 40 60 80 100

Subsampling the data If a data record is long, one can try to form subsamples, separated by some time interval. t = x µ S X / n = D S D if one is interested in the differences of means in a particular season (DJF), it is often more prudent to use seasonal means, rather than monthly means. Choice of sampling interval is not trivial (either from physically considerations, or from inspecting autocorrelation function). throwing away much of data.

The equivalent sample size, ESS The main reason why the t test is too liberal for serially correlated data is that the denominator underestimate the sampling standard errors of the numerator: The concept of effective or equivalent sample size is used to correct this problem (Thiebaux and Zwiers (1984) discuss interpretation and estimation of ESS). If {X 1,...,X n } are n i.i.d. random variables then Var(X) = 1 n Var(X) If {X 1,...,X n } are serially correlated then Var(X) = 1 n e Var(X) n n e = 1 + 2 n 1 τ=1 (1 τ n )ρ(τ) n1 ρ 1, n 1, 1 + ρ 1 ρ(τ) is the autocorrelation function.

The equivalent sample size Interpretation: samples of independent observations of size n e contain as much information about the difference of means as samples of serially correlated observations of size n. Important: ESS is not uniquely defined. T = D/S D The expression for n e depends on the definition of D. If our interest is something else (e.g., variance) the definition of information, and therefore of the ESS, changes. Therefore, it is incorrect to interpret n e as the number of effectively independent observations in the absolute sense.

t-test with the equivalent sample size When samples are sufficiently large then t = X µ 0 S X / n e has approximately a standard normal distribution N(0, 1). When samples are small it is often assumed that t behaves as Student s t statistic with (2n e 2) d.f. Correct? Wrong! This assumption is not correct for small samples.

t-test with the known equivalent sample size The actual rejection rate of the t test with known n e (adapted after Zwiers & von Storch 1994) 0.06 0.05 Rejection rate 0.04 0.03 0.02 0.01 0.00 rho=0 rho=0.3 rho=0.45 rho=0.6 rho=0.75 20 40 60 80 100 The t test is too conservative for small sample sizes, and therefore, wrong.

t-test with the estimated equivalent sample size The actual rejection rate of the t test with n e estimated from the data Rejection rate 0.25 0.20 0.15 0.10 0.05 0.00 rho=0 rho=0.3 rho=0.45 rho=0.6 rho=0.75 20 40 60 80 100 The t test is too liberal for small sample sizes, and therefore, wrong.

t-test with the table look-up Zwiers and Storch (1994) recommend: t test should not be used with equivalent sample sizes smaller than 30 (note that 30 is for the ESS, not sample size). If ESS<30, table look-up test should be used (2 sample test): compute standard t = X Y (S 2 X +SY 2 )/n compute the pooled lag-1 correlation coefficient n i=2 ρ 1 = x i x i 1 + n i=2 y i y i 1 n i=1 y 2 + n i=1 y i 2 Use the look-up tables for n and ρ 1 to determine critical values.

Example of using the ESS in literature

Example of using the ESS in literature Use ERA-40 reanalyses to study ENSO/QBO effects on polar temperature, e.g. in the NDFJ season ( 200 months). composites for different ENSO/QBO phases are based on 10-50 months. Use monthly data to increase sample size. ESS is estimated as N e = N 1 r 1+r (an approximation for N is large, bad small sample behaviour). normality is evaluated in MC tests assuming that monthly means are independent. perform a great number of statistical significance tests. This study highlight several possible misuses of statistics: obsession with hypothesis testing (perhaps providing confidence intervals would be more preferable) the use of statistical techniques like a cook-book recipe (equivalent degrees of freedom)

Conclusions Don t be obsessed with hypothesis testing when examining observational data statistical hypothesis testing is problematic and cannot be viewed as objective and unbiased judge of the null hypothesis under these circumstances; Be careful about the effects of serial correlation on the results of statistical inference (such as estimation and hypothesis testing) some off-the-shelf methods and tools behave badly for small samples.

References Analysis of climate variability Application of Statistical Techniques. Edited by H. von Storch and A. Navarra. Springer. 1995. Thiébaux, H. J., and F. W. Zwiers, 1984: The Interpretation and Estimation of Effective Sample Size. J. Appl. Meteor., 5, 800 811 Francis W. Zwiers, and H. von Storch, 1994: Taking Serial Correlation into Account in Tests of the Mean. J. Climate, 8, 336 351. Zhang, X., and F.W. Zwiers: 2004: Comment on Applicability of prewhitening to eliminate the influence of serial correlation on the Mann-Kendall test. Water Resources Research. von Storch, H., and F. W. Zwiers, 1999: Statistical Analysis in Climate Research. Cambridge.