Monte Carlo error analyses of Spearman s rank test

Similar documents
7.1 Sampling Error The Need for Sampling Distributions

UNIT 4 RANK CORRELATION (Rho AND KENDALL RANK CORRELATION

Bootstrap. Director of Center for Astrostatistics. G. Jogesh Babu. Penn State University babu.

1 Overview. Coefficients of. Correlation, Alienation and Determination. Hervé Abdi Lynne J. Williams

Analysing data: regression and correlation S6 and S7

VERITAS detection of VHE emission from the optically bright quasar OJ 287

Lecture 30. DATA 8 Summer Regression Inference

Outline. Confidence intervals More parametric tests More bootstrap and randomization tests. Cohen Empirical Methods CS650

Physics 509: Bootstrap and Robust Parameter Estimation

Statistics notes. A clear statistical framework formulates the logic of what we are doing and why. It allows us to make precise statements.

The assumptions are needed to give us... valid standard errors valid confidence intervals valid hypothesis tests and p-values

Detection ASTR ASTR509 Jasper Wall Fall term. William Sealey Gosset

Practical Statistics

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

Bootstrapping, Randomization, 2B-PLS

The H.E.S.S. Standard Analysis Technique

Dos and don ts of reduced chi-squared

The Normal Distribution. Chapter 6

Physics 509: Non-Parametric Statistics and Correlation Testing

arxiv: v1 [nucl-th] 26 Mar 2017

Correlation. We don't consider one variable independent and the other dependent. Does x go up as y goes up? Does x go down as y goes up?

Correlation and Regression

Correlation. Engineering Mathematics III

The Nonparametric Bootstrap

Permutation Tests. Noa Haas Statistics M.Sc. Seminar, Spring 2017 Bootstrap and Resampling Methods

A Calculator for Confidence Intervals

Critical Values for the Test of Flatness of a Histogram Using the Bhattacharyya Measure.

The exact bootstrap method shown on the example of the mean and variance estimation

Cosmology with Galaxy Clusters. I. A Cosmological Primer

Simulating Realistic Ecological Count Data

Nonparametric Statistics Notes

CORRELATION. compiled by Dr Kunal Pathak

RECOMMENDED TOOLS FOR SENSITIVITY ANALYSIS ASSOCIATED TO THE EVALUATION OF MEASUREMENT UNCERTAINTY

Uncertainty in Ranking the Hottest Years of U.S. Surface Temperatures

The Jackknife-Like Method for Assessing Uncertainty of Point Estimates for Bayesian Estimation in a Finite Gaussian Mixture Model

Correlation and regression

STAT440/840: Statistical Computing

Nanda Rea. XMM-Newton reveals magnetars magnetospheric densities. University of Amsterdam Astronomical Institute Anton Pannekoek NWO Veni Fellow

Bootstrap tests. Patrick Breheny. October 11. Bootstrap vs. permutation tests Testing for equality of location

A general assistant tool for the checking results from Monte Carlo simulations

Galactic and extragalactic hydrogen in the X-ray spectra of Gamma Ray Bursts

arxiv: v1 [astro-ph.im] 14 Sep 2017

Does the optical-to-x-ray energy distribution of quasars depend on optical luminosity?

Rest-frame properties of gamma-ray bursts observed by the Fermi Gamma-Ray Burst Monitor

Selection should be based on the desired biological interpretation!

The bootstrap. Patrick Breheny. December 6. The empirical distribution function The bootstrap

Chapter 11. Correlation and Regression

Data Analysis I. Dr Martin Hendry, Dept of Physics and Astronomy University of Glasgow, UK. 10 lectures, beginning October 2006

Inflation in a general reionization scenario

Stages in scientific investigation: Frequency distributions and graphing data: Levels of measurement:

Abstract Title Page. Title: Degenerate Power in Multilevel Mediation: The Non-monotonic Relationship Between Power & Effect Size

Estimation of Quantiles

Power and nonparametric methods Basic statistics for experimental researchersrs 2017

Statistics of Accumulating Signal

Is the Relationship between Numbers of References and Paper Lengths the Same for All Sciences?

Nonparametric Statistics. Leah Wright, Tyler Ross, Taylor Brown

Fully Bayesian Analysis of Calibration Uncertainty In High Energy Spectral Analysis

Class 11 Maths Chapter 15. Statistics

THE CONNECTION BETWEEN SPECTRAL EVOLUTION AND GAMMA-RAY BURST LAG Dan Kocevski and Edison Liang

MASSACHUSETTS INSTITUTE OF TECHNOLOGY PHYSICS DEPARTMENT

A Semi-Analytical Computation of the Theoretical Uncertainties of the Solar Neutrino Flux

Psychology 282 Lecture #4 Outline Inferences in SLR

Problem 1: Toolbox (25 pts) For all of the parts of this problem, you are limited to the following sets of tools:

Using BATSE to Measure. Gamma-Ray Burst Polarization. M. McConnell, D. Forrest, W.T. Vestrand and M. Finger y

Optical Systems Program of Studies Version 1.0 April 2012

2D Image Processing (Extended) Kalman and particle filter

A better way to bootstrap pairs

A SEARCH FOR FAST RADIO BURSTS WITH THE GBNCC SURVEY. PRAGYA CHAWLA McGill University (On Behalf of the GBNCC Collaboration)

Session III: New ETSI Model on Wideband Speech and Noise Transmission Quality Phase II. STF Validation results

Sensitivity Analysis with Correlated Variables

HYPOTHESIS TESTING II TESTS ON MEANS. Sorana D. Bolboacă

Search for a diffuse cosmic neutrino flux with ANTARES using track and cascade events

Advanced Statistics II: Non Parametric Tests

Big Data Analysis with Apache Spark UC#BERKELEY

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Long-Term X-ray Spectral Variability of AGN

Glauber modelling in high-energy nuclear collisions. Jeremy Wilkinson

Practical Solutions to Behrens-Fisher Problem: Bootstrapping, Permutation, Dudewicz-Ahmed Method

Finite Population Correction Methods

Simulating Uniform- and Triangular- Based Double Power Method Distributions

An Auger Observatory View of Centaurus A

Confidence Intervals in Ridge Regression using Jackknife and Bootstrap Methods

Statistical Methods for Astronomy

Accounting for Calibration Uncertainty in Spectral Analysis. David A. van Dyk

Monte Carlo Radiation Transport Kenny Wood

2010 Physics GA 3: Examination 2

A Pedagogic Demonstration of Attenuation of Correlation Due to Measurement Error

Permutation tests. Patrick Breheny. September 25. Conditioning Nonparametric null hypotheses Permutation testing

Average value of available measurements of the absolute air-fluorescence yield

Primer on statistics:

Numerical integration and importance sampling

Using Gamma Ray Bursts to Estimate Luminosity Distances. Shanel Deal

Parameter Uncertainty in the Estimation of the Markov Model of Labor Force Activity: Known Error Rates Satisfying Daubert

arxiv: v1 [astro-ph.he] 14 Feb 2013

arxiv: v1 [stat.co] 26 May 2009

E509A: Principle of Biostatistics. (Week 11(2): Introduction to non-parametric. methods ) GY Zou.

Accounting for Calibration Uncertainty in Spectral Analysis. David A. van Dyk

A thermodynamic origin of a universal mass-angular momentum relationship

TECHNICAL REPORT # 59 MAY Interim sample size recalculation for linear and logistic regression models: a comprehensive Monte-Carlo study

Transcription:

Monte Carlo error analyses of Spearman s rank test arxiv:1411.3816v2 [astro-ph.im] 29 May 2015 Peter A. Curran International Centre for Radio Astronomy Research, Curtin University, GPO Box U1987, Perth, WA 6845, Australia Spearman s rank correlation test is commonly used in astronomy to discern whether a set of two variables are correlated or not. Unlike most other quantities quoted in astronomical literature, the Spearman s rank correlation coefficient is generally quoted with no attempt to estimate the errors on its value. This is a practice that would not be accepted for those other quantities, as it is often regarded that an estimate of a quantity without an estimate of its associated uncertainties is meaningless. This manuscript describes a number of easily implemented, Monte Carlo based methods to estimate the uncertainty on the Spearman s rank correlation coefficient, or more precisely to estimate its probability distribution. 1 Introduction Spearman s rank correlation test (Spearman, 1904) is commonly used in astronomy to discern whether a pair of two variables are correlated or not. It has an advantage over the Pearson correlation test 1 which requires a linear relationship between the two variables as it is non-parametric. This non-parametric nature of the test is useful as it is model independent so long as the relationship is monotonic, i.e. is a steadily increasing or decreasing function. Unlike other quantities in literature, Spearman s rank correlation coefficient,ρ, is generally quoted with no attempt to estimate the uncertainties on its value; though the data used to calculate the coefficient will almost certainly have some measurement uncertainty, in at least one of the variables. In the fields of e.g. bio-medicine and clinical medicine, where Spearman s rank test is regularly used, the method of Monte Carlo bootstrapping (Efron, 1979) is used to estimate the confidence intervals of the coefficient (e.g., Haukoos & Lewis 2005); this involves resampling the data by drawing random entries from the original data set to create multiple resampled data sets of the same size as the original. The method does have limitations, as described by Haukoos & Lewis (2005), primarily that the method assumes that the sample is representative of the overall population which is of particular concern for small sample sizes. However, few other methods are available to estimate the uncertainties. The principal difference between astronomical and clinical data is that the astronomical data will often, though not always, have associated measurement uncertainties (e.g., flux, magnitude, distance) while clinical data is often count-based (e.g., frequency of success in clinical trials) with no intrinsic probability distribution on the individual data points. Of course count based astronomical data is commonplace (e.g. population studies) but even in these cases an attempt is often made to approximate the email:peter.curran@curtin.edu.au 1 Here error analyses of the Spearman test are discussed, though the methods can (and should) also be applied to the Pearson test. 1

probability distribution of the data by assuming e.g. Poissonian uncertainties. Here three Monte Carlo methods of error analyses of the Spearman s rank coefficient the bootstrap/resampling method, the perturbation method and composite method are described and demonstrated using a sample data set. Importantly, the latter two methods exploit the uncertainties or probability distributions which are generally associated with astronomical data. 2 Method Given a data set consisting of N data pairs, X i and Y i, each individual entry is assigned an ascending order rank, RX i and RY i, and the Spearman s rank correlation coefficient,ρ, for the sample is calculated from the square of the difference of the two ranks for each pair by ρ=1 6 N (RX i RY i ) 2 i=1 N(N 2 1). A coefficient of 0 corresponds to no correlation between the variables, while a value of+1 or 1 corresponds to a perfect increasing or decreasing monotonic correlation. The significance of the correlation may be calculated in a number of ways, such as a Student s t-test (Zar, 1972). Here the z-score, z, is used since it approximately follows a Gaussian, or normal, distribution in this case, i.e., a z-score of z approximately corresponds to a Gaussian significance of the correlation of σ z. The N 3 z-score is calculated by z = F(ρ), where F(ρ)=arctanh(ρ)=ln[(1+ρ)/(1 ρ)]/2 is the Fisher 1.06 transformation of the correlation coefficient, which goes to infinity asρgoes to 1. The aforementioned bootstrap or resampling method of error analysis involves creating M new data sets, each consisting of N data pairs, x i and y i, where for statistical significance, M 1000. Each of these new pairs is a randomly chosen pair from the original data set, X j and Y j, where j is the randomly chosen entry, e.g., x i = X j and y i = Y j, such that some of the original pairs may appear more than once in a given data set or not at all. The data points are again assigned a rank, Rx i and Ry i, and the Spearman s rank correlation coefficient and z-score for each of these M new data sets calculated. The distributions of the returned values are used to estimate the probability distribution of the two quantities, by normalising so that the integral is equal to one. In the simplest case of this probability distribution being Gaussian, the estimate of the correlation coefficient, ˆρ= ρ, the average of the calculated values and the estimate of the error is the Gaussian width of the distribution,σ ρ, i.e., the standard deviation of the calculated values; the z-score is similarly estimated as z±σ 2 z. The resampling method obviously does not take into account the uncertainties, X i and Y i, likely associated with the pairs. To do so a perturbation method may be used, where one creates multiple new data sets (M 1000) consisting of N data pairs, x i and y i, each of which is perturbed from the original values, X i and Y i, by adding a random Gaussian 3 number times the uncertainty on that point, e.g., x i = X i +G X i, wheregis a number, drawn randomly from a Gaussian distribution of width 1 and centered on 0. This is done independently per point (assuming that the uncertainties are independent) so that a different random Gaussian number,g, is used to perturb the X and Y values. The Spearman s rank correlation coefficient and z-score are then calculated for each of these M data 2 For a much fuller discussion of using probability distributions, Gaussian or otherwise, to estimate quantities see Andrae (2010), which also includes discussion on the use of Monte Carlo methods to estimate errors. 3 Here, as is often done, it is assumed that the measurement uncertainties are Gaussian; however, these methods only require that the probability distribution of the measurement uncertainties are known. 2

Photon Index v magnitude 1 1.5 2 2.5 17.5 18 18.5 19 Photon Index 1 1.5 2 2.5 5580 5600 5620 5640 5660 MJD 50,000 17.5 18 18.5 19 v magnitude Figure 1: Left: MAXI J0556-332 v-band Swift-UVOT light curve and photon index of spectral fit of the UVOT data. Right: the scatter plot of the same data (53 data pairs). sets and the distribution of returned values used as an estimate of the probability distribution of the two quantities, as before. Alternatively, the composite method may be used, combining the traditional resampling method and the above outlined perturbation method. In this case each new data pair, x i and y i, is perturbed from a random entry of the original data set, X j and Y j, where j is the randomly chosen entry, e.g., x i = X j +G X j. Again, the random Gaussian numbers,g, must be independent, in so far as that is possible given whatever random number generator is being used, but the two variables must have originated from the same source pair. The probability distribution of the two quantities, ρ and z, are again estimated as above. 3 Example Here I apply the standard method of estimating the Spearman s rank correlation coefficient (no error) and the different Monte Carlo methods (resampling, perturbation and composite) 4 to real physical data from the X-ray transient, MAXI J0556-332. During its 2011 outburst, MAXI J0556-332 was observed by the UVOT instrument on board the Swift satellite. When the photon indices of these optical observations were compared to the v-band magnitudes, an apparent correlation was clear (Figure 1). In fact, the Spearman s rank correlation coefficient was 0.83, implying a z-score significance of z=8.2, corresponding to a significance of the correlation of 8.2σ (N= 53). If one were to take the basic step of applying the resampling method to this (M= 1000 in this and the following cases), which is not often done, the distributions of ρ and z plotted in Figure 2 (black) are obtained. Assuming that these may be described as Gaussian distributions (clearly not the case for the correlation coefficient but a reasonable simplifying assumption nonetheless), their averages and standard deviations are as given in Table 1. It is clear from Figure 1 that there is significant uncertainties on both the spectral indices and the magnitudes which may reduce the significance of the correlation, and should be taken into account. Applying the perturbation method one finds that, indeed, the distribution and average of the 4 For these results, the methods were implemented within a C program, MCSpearman (available online; Curran 2015), but are easily implemented in many languages or programs. 3

Number 0 50 100 0.5 0.6 0.7 0.8 0.9 Spearman rank coefficient, ρ Number 0 50 100 4 6 8 10 z score Figure 2: Left: Histogram plot of Spearman rank coefficients for the MAXI J0556-332 data as derived from the resampling method (black), the perturbation method (red) and the composite method (green). Right: the distribution of the z-scores for those values (1000 iterations per test). In both plots, the arrow represents the result of the standard method. Table 1: Average ± standard deviations of the correlation coefficients, ρ, and z-scores, z, for the different methods as applied to the MAXI J0556-332 data. method ρ z standard 0.83 8.2 bootstrap 0.82 ± 0.05 8.2 ± 1.0 perturbation 0.78 ± 0.04 7.2 ± 0.6 composite 0.77 ± 0.06 7.1 ± 1.0 correlation coefficient is reduced, as is the z-score significance (Figure 2, red). Applying the composite method, one finds that it returns similar values to the perturbation method but with a wider distribution of values (Figure 2, green). In fact, the plotted distributions of the composite method clearly demonstrate that what was considered a correlation at the 8.2σ level is better described by a significance of (7.1±1.0)σ, which has a non-negligible probability of being<5σ. 4 Discussion Only one test case of these methods is presented so general conclusions should not be drawn without serious caution. For example, the results will be heavily dependent on the number of data points, the distribution of these data points in x and y, and, in the cases of the perturbation and composite methods, the size of the uncertainties on the data points. For a discussion regarding the number and distribution of data points see Haukoos & Lewis (2005). Clearly, taking the data uncertainties into account weakens the correlation between the two variables; as the data uncertainties go to zero, the perturbation method will tend to a delta function at the value returned by the standard method, while the results of the composite method will tend to those of the resampling method. It is important to understand the difference between the two (non-composite) methods and the distribution they return. The resampling method estimates the uncertainty of the correlation coefficient 4

given the uncertainty of the sample, i.e., the sample being tested is only a sub-sample of the population of all possible data pairs and the resampling method estimates the uncertainties associated with the lack of information of all those data pairs. The perturbation method estimates the uncertainty of the correlation coefficient, given only the uncertainties on the data points, i.e., this method assumes the given sample is absolutely representative of the population, or alternatively, the method only estimates the error of the given pairs in the sample, not the population as a whole. In some circumstances, one may only want to estimate the correlation coefficient (and uncertainty) of the given sample in which case the perturbation method should be used. In other circumstances, it is the uncertainty associated with the population, including its unknown entries, which dominates, and the resampling method should be used. In many cases, though only one of these dominate the probability distribution of the correlation coefficient, both should be taken into account via the composite approach. 5 Conclusion When calculating the Spearman s rank correlation coefficient of a data set, an estimate of the uncertainty of the coefficient is required, just as it is for most other quantities in astronomy. Furthermore, when the given data set has uncertainties on the individual entries, these uncertainties must be taken into account when estimating the correlation coefficient. Here I have suggested and discussed three Monte Carlo methods of accounting for the uncertainties in the data points which are easily implemented and return significant information regarding the probability distribution of the correlation coefficient. Acknowledgements The author thanks Anna D. Kapińska for constructive comments and suggestions on drafts of this manuscript. The work was supported by the Centre National d Etudes Spatiales (CNES), France through MINE: the Multi-wavelength INTEGRAL NEtwork. References Andrae R., 2010, arxiv, 1009.2755 Curran P. A., 2015, Astrophysics Source Code Library, ascl:1504.008 Efron B., 1979, The Annals of Statistics, 7, 1 Haukoos J. S., Lewis R. J., 2005, Academic Emergency Medicine, 12, 360 Spearman C., 1904, American Journal of Psychology, 15, 72 Zar J. H., 1972, Journal of the American Statistical Association, 67, 578 5