Influence Functions of the Spearman and Kendall Correlation Measures Croux, C.; Dehon, C.

Similar documents
Influence functions of the Spearman and Kendall correlation measures

ROBUST ESTIMATION OF A CORRELATION COEFFICIENT: AN ATTEMPT OF SURVEY

IMPROVING THE SMALL-SAMPLE EFFICIENCY OF A ROBUST CORRELATION MATRIX: A NOTE

COMPARISON OF THE ESTIMATORS OF THE LOCATION AND SCALE PARAMETERS UNDER THE MIXTURE AND OUTLIER MODELS VIA SIMULATION

ROBUST TESTS BASED ON MINIMUM DENSITY POWER DIVERGENCE ESTIMATORS AND SADDLEPOINT APPROXIMATIONS

MULTIVARIATE TECHNIQUES, ROBUSTNESS

Robust Maximum Association Estimators

Tilburg University. Two-Dimensional Minimax Latin Hypercube Designs van Dam, Edwin. Document version: Publisher's PDF, also known as Version of record

Robust Estimation of Cronbach s Alpha

Transmuted distributions and extrema of random number of variables

Visualizing multiple quantile plots

Tilburg University. Strongly Regular Graphs with Maximal Energy Haemers, W. H. Publication date: Link to publication

The S-estimator of multivariate location and scatter in Stata

Breakdown points of Cauchy regression-scale estimators

Supplementary Material for Wang and Serfling paper

On the Unique D1 Equilibrium in the Stackelberg Model with Asymmetric Information Janssen, M.C.W.; Maasland, E.

Online algorithms for parallel job scheduling and strip packing Hurink, J.L.; Paulus, J.J.

arxiv: v1 [stat.me] 8 May 2016

The M/G/1 FIFO queue with several customer classes

Aalborg Universitet. Published in: IEEE International Conference on Acoustics, Speech, and Signal Processing. Creative Commons License Unspecified

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 4: Measures of Robustness, Robust Principal Component Analysis

Small Sample Corrections for LTS and MCD

CLASSIFICATION EFFICIENCIES FOR ROBUST LINEAR DISCRIMINANT ANALYSIS

Key Words: absolute value methods, correlation, eigenvalues, estimating functions, median methods, simple linear regression.

A Note on Powers in Finite Fields

Robust Exponential Smoothing of Multivariate Time Series

9. Robust regression

J. W. LEE (Kumoh Institute of Technology, Kumi, South Korea) V. I. SHIN (Gwangju Institute of Science and Technology, Gwangju, South Korea)

Introduction to Robust Statistics. Elvezio Ronchetti. Department of Econometrics University of Geneva Switzerland.

OPTIMAL B-ROBUST ESTIMATORS FOR THE PARAMETERS OF THE GENERALIZED HALF-NORMAL DISTRIBUTION

Robust scale estimation with extensions

Highly Robust Variogram Estimation 1. Marc G. Genton 2

Regression Analysis for Data Containing Outliers and High Leverage Points

Detecting outliers in weighted univariate survey data

Improved Methods for Making Inferences About Multiple Skipped Correlations

Robust Maximum Association Between Data Sets: The R Package ccapp

A note on non-periodic tilings of the plane

Financial Econometrics and Volatility Models Copulas

Multivariate Distribution Models

Identification of Multivariate Outliers: A Performance Study

Asymptotic Relative Efficiency in Estimation

Graphs with given diameter maximizing the spectral radius van Dam, Edwin

FAST CROSS-VALIDATION IN ROBUST PCA

Dependence. MFM Practitioner Module: Risk & Asset Allocation. John Dodson. September 11, Dependence. John Dodson. Outline.

Detection of outliers in multivariate data:

Bivariate Paired Numerical Data

Tilburg University. Two-dimensional maximin Latin hypercube designs van Dam, Edwin. Published in: Discrete Applied Mathematics

Estimation of Peak Wave Stresses in Slender Complex Concrete Armor Units

Robust Online Scale Estimation in Time Series: A Regression-Free Approach

The Delta Method and Applications

Dependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline.

The Wien Bridge Oscillator Family

Robust Preprocessing of Time Series with Trends

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

robustness, efficiency, breakdown point, outliers, rank-based procedures, least absolute regression

A beam theory fracture mechanics approach for strength analysis of beams with a hole

TITLE : Robust Control Charts for Monitoring Process Mean of. Phase-I Multivariate Individual Observations AUTHORS : Asokan Mulayath Variyath.

1 A Review of Correlation and Regression

Modelling Dependent Credit Risks

A simple graphical method to explore tail-dependence in stock-return pairs

Robust model selection criteria for robust S and LT S estimators

Matrix representation of a Neural Network

Aalborg Universitet. The number of independent sets in unicyclic graphs Pedersen, Anders Sune; Vestergaard, Preben Dahl. Publication date: 2005

Independent Component (IC) Models: New Extensions of the Multinormal Model

Robust regression in Stata

Invariant coordinate selection for multivariate data analysis - the package ICS

High Breakdown Analogs of the Trimmed Mean

Robust Wilks' Statistic based on RMCD for One-Way Multivariate Analysis of Variance (MANOVA)

A NOTE ON ROBUST ESTIMATION IN LOGISTIC REGRESSION MODEL

The dipole moment of a wall-charged void in a bulk dielectric

A Brief Overview of Robust Statistics

Introduction to robust statistics*

A Comparison of Robust Estimators Based on Two Types of Trimming

Introduction Robust regression Examples Conclusion. Robust regression. Jiří Franc

Lecture 12 Robust Estimation

On the Expected Absolute Value of a Bivariate Normal Distribution

COMPARISON OF THERMAL HAND MODELS AND MEASUREMENT SYSTEMS

Strongly 2-connected orientations of graphs

Improved Ridge Estimator in Linear Regression with Multicollinearity, Heteroscedastic Errors and Outliers

Simulating Uniform- and Triangular- Based Double Power Method Distributions

Lidar calibration What s the problem?

Modelling Salmonella transfer during grinding of meat

Reducing Uncertainty of Near-shore wind resource Estimates (RUNE) using wind lidars and mesoscale models

Inequalities Relating Addition and Replacement Type Finite Sample Breakdown Points

Citation for pulished version (APA): Henning, M. A., & Yeo, A. (2016). Transversals in 4-uniform hypergraphs. Journal of Combinatorics, 23(3).

Accurate and Powerful Multivariate Outlier Detection

Outlier Detection via Feature Selection Algorithms in

A ROBUST METHOD OF ESTIMATING COVARIANCE MATRIX IN MULTIVARIATE DATA ANALYSIS G.M. OYEYEMI *, R.A. IPINYOMI **

A Measure of Monotonicity of Two Random Variables

INFERENCE FOR MULTIPLE LINEAR REGRESSION MODEL WITH EXTENDED SKEW NORMAL ERRORS

Robust forecasting of non-stationary time series DEPARTMENT OF DECISION SCIENCES AND INFORMATION MANAGEMENT (KBI)

Multiple Scenario Inversion of Reflection Seismic Prestack Data

Robust and sparse estimation of the inverse covariance matrix using rank correlation measures

Stahel-Donoho Estimation for High-Dimensional Data

On a set of diophantine equations

Modelling Dependence with Copulas and Applications to Risk Management. Filip Lindskog, RiskLab, ETH Zürich

Introduction to Robust Statistics. Anthony Atkinson, London School of Economics, UK Marco Riani, Univ. of Parma, Italy

On the Einstein-Stern model of rotational heat capacities

THE BREAKDOWN POINT EXAMPLES AND COUNTEREXAMPLES

LARGE SAMPLE BEHAVIOR OF SOME WELL-KNOWN ROBUST ESTIMATORS UNDER LONG-RANGE DEPENDENCE

Transcription:

Tilburg University Influence Functions of the Spearman and Kendall Correlation Measures Croux, C.; Dehon, C. Publication date: 1 Link to publication Citation for published version (APA): Croux, C., & Dehon, C. (1). Influence Functions of the Spearman and Kendall Correlation Measures. (CentER Discussion Paper; Vol. 1-4). Tilburg: Econometrics. General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. - Users may download and print one copy of any publication from the public portal for the purpose of private study or research - You may not further distribute the material or use it for any profit-making activity or commercial gain - You may freely distribute the URL identifying the publication in the public portal Take down policy If you believe that this document breaches copyright, please contact us providing details, and we will remove access to the work immediately and investigate your claim. Download date: 18. dec. 18

No. 1 4 INFLUENCE FUNCTIONS OF THE SPEARMAN AND KENDALL CORRELATION MEASURES By Christophe Croux, Catherine Dehon March 1 ISSN 94-7815

myjournal manuscript No. (will be inserted by the editor) Influence functions of the Spearman and Kendall correlation measures Christophe Croux Catherine Dehon Received: date / Accepted: date Abstract Nonparametric correlation estimators as the Kendall and Spearman correlation are widely used in the applied sciences. They are often said to be robust, in the sense of being resistant to outlying observations. In this paper we formally study their robustness by means of their influence functions and gross-error sensitivities. Since robustness of an estimator often comes at the price of an increased variance, we also compute statistical efficiencies at the normal model. We conclude that both the Spearman and Kendall correlation estimators combine a bounded and smooth influence function with a high efficiency. In a simulation experiment we compare these nonparametric estimators with correlations based on a robust covariance matrix estimator. Keywords Asymptotic Variance Correlation Gross-Error Sensitivity Influence function Kendall correlation Robustness Spearman correlation Mathematics Subject Classification () 6G35 6F99 1 Introduction The Pearson correlation is one of the most often used statistical estimators. But its value may be seriously affected by only one outlier. An important tool to measure Christophe Croux Faculty of Business and Economics, & K.U.Leuven, Naamsestraat 69, B-3 Leuven, Belgium Tilburg University, The Netherlands Tel.: +3-()16-36958 Fax: +3-()16-3673 E-mail: Christophe.Croux@econ.kuleuven.be Catherine Dehon Université libre de Bruxelles, ECARES, and Institut de Recherche en Statistique,, CP-114, Av. F.D. Roosevelt 5, B-15 Brussels, Belgium Tel.: +3-()-65-3858 Fax: +3-()-65-41 E-mail: cdehon@ulb.ac.be

the robustness of an estimator is the influence function. It measures the effect that an observation has on an estimator (Hampel et al., 1986). Devlin et al. (1975) showed that the influence function of the classical Pearson correlation is unbounded, proving the lack of robustness of this estimator. We refer to Morgenthaler (7) for a survey on robust statistics. In this paper we provide expressions for the influence functions of the popular Spearman and Kendall correlation. We show that their influence function is bounded. This confirms the general belief that these nonparametric measures of correlation are robust to outliers. Besides being robust, an estimator should also have a high statistical efficiency. At the normal distribution the Pearson correlation estimator is the most efficient, but the statistical efficiency of the Spearman and Kendall correlation estimators remains above 7% for all possible values of the population correlation. Hence they provide a good compromise between robustness and efficiency. The paper is organized as follows. In Section we review the definitions of the the Spearman, Kendall and Quadrant correlation. Their influence functions are presented in Section 3 and gross-error-sensitivities are given in Section 4. The asymptotic variances are computed in Section 5. Section 6 presents a simulation study comparing the performance of the different estimators at finite samples. A comparison with a robust correlation measure derived from a bivariate robust covariance matrix estimator is made. The conclusions are in Section 7. Measures of Correlation For a bivariate sample {(x i,y i ),1 i n}, the classical Pearson s estimator of correlation is given by n i=1 r P = (x i x)(y i ȳ) n i=1 (x i x) n i=1 (y i x) where x and ȳ are the sample means. To compute influence functions, it is necessary to consider the associated functional form of the estimator. Let (X,Y ) H, with H an arbitrary distribution (having second moments). The population version of Pearson s correlation measure is R P (H) = E H [XY ] E H [X]E H [Y ] (EH [X ] E H [X] )(E H [Y ] E H [Y ] ), and the function H R P (H) is called the functional representation of this estimator. If the sample (x 1,y 1 ),...,(x n,y n ) has been generated according to the distribution H, then the estimator r P converges in probability to R P (H). For the bivariate normal distribution with population correlation coefficient, denoted by Φ, we have R P (Φ ) =. The above property is called the Fisher consistency of R P at the normal model (e.g. Maronna et al., 6).

3 As an alternative to Pearson s correlation, nonparametric measures of correlation using ranks and signs have been introduced. We first consider the Quadrant correlation, r Q (Mosteller, 1946). It is is computed by first centering the data by the coordinatewise median. Then r Q equals the frequency of observations in the first or third quadrant, minus the frequency of observations in the second or fourth quadrant: r Q = 1 n n i=1 sign{(x i median j (x j ))(y i median j (y j ))}. Here, the sign function equals 1 for positive arguments, -1 for negative arguments, and sign() =. The associated functional is given by R Q (H) = P H [sign(x median(x))(y median(y ))]. When comparing a nonparametric correlation measure with the classical Pearson correlation, one must realize that they estimate different population quantities. For Φ the bivariate normal distribution with correlation, one has (Blomqvist, 195) Q := R Q (Φ ) = π arcsin(), which is different from, for any. To obtain a consistent version of the Quadrant correlation at the normal model, we apply the transformation R Q (H) = sin( 1 πr Q(H)). Another nonparametric correlation measure based on signs is Kendall s correlation (Kendall, 1938), given by r K = n(n 1) sign((x i x j )(y i y j )). i< j The corresponding functional representation is then R K (H) = E H [sign(x 1 X )(Y 1 Y )] (1) where (X 1,Y 1 ) and (X,Y ) are two independent copies of H. At normal distributions, r K estimates the same parameter as the Quadrant correlation (Blomqvist, 195), so K = Q = R K (Φ ), and the Fisher consistent version of Kendall s correlation is R K (H) = sin( 1 πr K(H)). Finally, the most popular nonparametric correlation measure is Spearman s rank correlation (Spearman, 194), which equals the Pearson correlation computed from the ranks of the observations. Take (X,Y ) H, and denote F(t) = P H (X t) and G(t) = P H (Y t) the marginal cumulative distribution functions of X and Y. Then the functional representation of Spearman s correlation is given by R S (H) = Corr(F(X),G(Y )) = 1E H [F(X)G(Y )] 3. ()

4 At the normal model Φ we have S := R S (Φ ) = 6 π arcsin( ), see Moran (1948). Again we see that the Spearman correlation differs from the correlation coefficient of the bivariate normal distribution, and the Fisher consistent version is given by R S (H) = sin( 1 6 πr S(H)). (3) 3 Influence Function As model distribution for (X,Y ) we take the bivariate normal distribution Φ, with correlation coefficient. We assume that the population means of X and Y are equal to zero, and their variances equal to one. Since all correlation measures considered in this paper are invariant with respect to linear transformation of X, respectively Y, the latter assumption is without loss of generality. The influence function (IF) of a statistical functional R at the model distribution Φ is defined as IF((x,y),R,Φ ) = lim ε R((1 ε)φ + ε (x,y) ) R(Φ ) ε where (x,y) is a Dirac measure putting all its mass at (x,y). It can be interpreted as the infinitesimal effect that a small amount of contamination placed at (x,y) has on R, for data coming from the model distribution Φ. Note that the influence function is defined at the population level, and that the IF of an estimator refers to the IF of the associated functional representation of the estimator. An estimator is called B-robust if its influence function is bounded (see Hampel et al., 1986). For the Pearson correlation, Devlin et al. (1975) derived IF((x,y),R P,Φ ) = xy x + y, (4) which is an unbounded function, showing that R P is not B-robust. The influence functions associated to the Quadrant correlation is given by IF((x,y),R Q,Φ ) = sign(xy) Q, (5) see Shevlyakov and Vilchevski (). The IF of the Kendall and Spearman correlation do not seem to have been published in the printed literature, even if they are not difficult to obtain. Proposition 1 The influence function of the Kendall correlation is given by IF((x,y),R K,H) = {P H [(X x)(y y) > ] 1 R K (H)}, (6) for any distribution H. At the bivariate normal model distribution Φ we have IF((x,y),R K,Φ ) = {4Φ (x,y) Φ(x) Φ(y) + 1 K }. (7)

5 Proposition The influence function of the Spearman correlation is given by IF((x,y),R S,H) = 3R S (H) 9 + 1{F(x)G(y) + E H [F(X)I(Y y)] +E H [G(Y )I(X x)]}, (8) for any distribution H, with F and G the marginal distributions of H, and where I( ) stands for the indicator function. At the bivariate normal model distribution Φ we have IF((x,y),R S,Φ ) = 3 S 9 + 1{Φ(x)Φ(y) + E Φ [Φ(X)Φ( X y 1 )] + +E Φ [Φ(Y )Φ( Y x )]}, (9) 1 with Φ the cumulative distribution function of a standard normal, and for < 1. In an unpublished manuscript of Grize (1978), similar expressions are obtained. Proofs of the propositions 1 and can be found in the Appendix. For comparing the numerical values of the different IF, it is important that all considered estimators estimate the same population quantity, i.e. are Fisher consistent. Figure 1 plots the influence function of R P and of the transformed measures R Q, R K and R S, for =.5. The analytical expressions of the IF of the transformed measures are given by IF((x,y), R Q,Φ ) = π sign() 1 IF((x,y),R Q,Φ ) (1) IF((x,y), R K,Φ ) = π 1 sign() IF((x,y),R K,Φ ) (11) IF((x,y), R S,Φ ) = π 3 sign() 1 4 IF((x,y),R S,Φ ). (1) One can see from Figure 1 that the IF of the Pearson correlation is indeed unbounded. On the other hand, the influence function for the Quadrant estimator is bounded but has jumps at the coordinate axes. This means that small changes in data points close to the median of one of the marginals lead to relatively large changes in the estimator. For Kendall and Spearman the influence functions are bounded and smooth. The value of the IF for R K and R S increases fastest along the first bisection axis. It can be checked that for = the influence functions of Spearman and Kendall estimators are exactly the same, i.e. IF((x,y), R K,Φ ) = IF((x,y), R S,Φ ) = 4π(Φ(x).5)(Φ(y).5), but they differ slightly for other values of. 4 Gross-error sensitivity An influence function can be summarized in a single index, the gross-error sensitivity (GES), giving the maximal influence an observation may have. Formally, the GES of the functional R at the model distribution Φ is defined as GES(R,Φ ) = sup IF((x,y),R,Φ ). (13) (x,y)

-1 IF 1 IF IF IF 6 (a) (b) Pearson Quadrant.5 1-1 1 -.5-4 -3 - - -1.5-1 4 (c) Y - -4 4 - X 4 (d) Y - -4 4 - X Spearman Kendall 1 - -1-5 -4-3 -4-3 - 4 4 Y - -4 4 - X Fig. 1 Influence functions of the Pearson, Quadrant, Spearman and Kendall measures, evaluated at the bivariate normal distribution with =.5. Y - -4 4 - X For example, since the classical Pearson estimator is not B-robust, GES(R P,Φ ) =. The following proposition gives the GES associated with the nonparametric measures of correlation we consider. Proposition 3 The gross-error sensitivities (GES) of the three transformed nonparametric correlation measures are given by for 1 1. (i) GES( R Q,Φ ) = π 1 [ arcsin( ) + 1] π (ii) GES( R K,Φ ) = π 1 [ arcsin( ) + 1] π (iii) GES( R S,Φ ) = π 1 4 [ 6 π arcsin( ) + 1],

7 GES 1 3 4 5 Spearman Kendall Quadrant...4.6.8 1. Fig. Gross-error sensitivities of the Quadrant, Spearman, and Kendall correlation, as a function of the correlation of the bivariate normal model distribution. The proof of proposition 3 is elementary. For a positive value of, the sup in definition (13) is attained for x tending to infinity, and y to minus infinity (or, equivalently, x tending to and y to + ). For a negative value of, the largest influence corresponds to contamination at (, ) or (, ). The gross-error sensitivities depend on the parameter in a non-linear way, and are plotted in Figure. The Quadrant estimator has uniformly a lower GES than Kendall and Spearman, and is exactly half of the GES of Kendall. On the other hand, Kendall s measure is preferable to Spearman, although the difference in GES is negligible for smaller values of. A striking feature of Figure is that the GES converges to zero, if tends to one, for the transformed Quadrant and Kendall correlation, but not for Spearman. The reason is that the transformation function g(r) = sin(πr/) has derivative zero at r = 1, which is not true for the transformation needed for the consistency of the Spearman correlation, g(r) = sin(πr/6). 5 Asymptotic Variance All considered correlation estimators are asymptotically normal at the model distribution. Let r be the correlation estimator associated with the functional R, then n(r ) d N(,ASV(R,Φ )). We use influence functions to compute the asymptotic variance, using the formula ASV(R,H) = E H [IF((X,Y ),R,H) ], see Hampel et al. (1986). The next proposition lists the expressions for ASV(R,Φ ). The proof is in the Appendix.

8 Proposition 4 At the model distribution Φ, and for any 1 1, we have: (i) ASV(R P,Φ ) = (1 ) (ii) ASV( R Q,Φ ) = (1 )( π 4 arcsin ()) (iii) ASV( R K,Φ ) = π (1 )( 1 9 4 π arcsin ( )) (14) (iv) ASV( R S,Φ ) = π (1 9 4 )144{ 1 144 9 4π arcsin ( ) + 1 arcsin( ) sin(x) π arcsin( 1 + cos(x) )dx + π + 1 π + 1 π arcsin( ) arcsin( ) arcsin( ) sin(x) arcsin( )dx 1 + cos(x) sin(x) arcsin( cos(x) )dx arcsin( 3sin(x) sin(3x) )dx} (15) 4cos(x) The results in the previous proposition are not new, since expressions for the asymptotic variances can be derived from Blomqvist (195) for the Quadrant and Kendall correlation, and in David and Mallows (1961) for the Spearman correlation at normal samples. In these older papers asymptotic expansions of Var(r) as a function of the sample size are given. From these, the same asymptotic variances listed above result. It is surprising, however, that in more recent literature not much attention is given to the asymptotic variances of nonparametric correlation estimators. In Bonett and Wright (), for example, confidence intervals for the Spearman and Kendall correlation are constructed using approximations of the asymptotic variances, while Proposition 4 provides the closed form expressions. Most complicated is the expression for ASV( R S,Φ ), requiring numerical integration of univariate integrals. A similar result, but expressed in more general terms, is given in Borkowf (). All asymptotic variances are decreasing in, and converge to zero for tending to one. The case = 1 is degenerate; the data are then lying on a straight line, and estimators always return one, without any estimation error. In Figure 3 we plot asymptotic efficiencies (relative to the Pearson correlation) as a function of, with < 1. Most striking are the high efficiencies for Kendall and Spearman correlation, being larger than 7% for all possible values of. This means that Kendall and Spearman are at the same time B-robust, and quite efficient. Comparing Kendall s with Spearman s correlation is favorable for Kendall, but the difference in efficiency is rather small. The Quadrant correlation has a much lower efficiency. Its efficiency even converges to zero if the true correlation is close to one.

9 Efficiency...4.6.8 1. Spearman Kendall Quadrant...4.6.8 1. Fig. 3 Asymptotic efficiencies of the Quadrant, Spearman, and Kendall correlation measures, as a function of the correlation of the bivariate normal model distribution. 6 Simulation Study By means of a simulation experiment, we try to answer two questions. First we verify whether the finite-sample variances of the estimators are close their asymptotic counterparts, derived in Section 5. Secondly, we study how the estimators behave when outliers are introduced in the sample. We make a comparison with a robust correlation estimator derived from a robust covariance matrix. If C(X,Y ) is a robust covariance matrix, then the associated robust correlation measure equals R C (H) = C 1 (X,Y ) C11 (X,Y )C (X,Y ). (16) Hence, any robust bivariate covariance matrix C leads to a robust correlation coefficient. We take for C in (16) the Minimum Covariance Determinant (MCD, Rousseeuw and Van Driessen, 1999) with 5% breakdown point, and additional reweighting step 1. Since the MCD estimator estimates a multiple of the population covariance matrix at the normal distribution, we have R C (Φ ) =. Simulation design without outliers. We generate m = 1 samples of size n =,5,1, from a bivariate normal with =.8 (simulations for other values of result in similar conclusions). For each sample j, the correlation coefficient is estimated by ˆ j, and the mean squared error (MSE) is computed as MSE = 1 m m j=1 ( ˆ j ). 1 We use the R-command covmcd from the robustbase package, with default options, for computing the MCD

1 Table 1 MSE of several correlation estimators at a bivariate normal distribution with =.8, for sample sizes n =, 5, 1 and. n * MSE n = n =5 n =1 n = Pearson.17.14.14.13.13 Spearman.6.1.18.18.16 Kendall.5.19.17.16.15 Quadrant.67.66.6.6.58.7.75.8.85 Pearson Spearman Kendall Quadrant Fig. 4 Boxplots of 1 correlation estimates for samples of size n = from a bivariate normal model distribution with =.9, for several correlation measures Table 1 reports the MSE for the different estimators we considered. Each MSE is multiplied by the corresponding sample size n, and these quantities should converge to the asymptotic variances given in proposition 4. As we can see from Table 1, the finite sample MSE converges rather quickly to the asymptotic counterpart (reported under the column n = ). The simulation experiment confirms the findings of Section 5; the precision of the Spearman and Kendall estimators is quite close to that of the Pearson correlation, and Kendall has slightly smaller MSEs than Spearman. On the other hand, the MSE of the Quadrant correlation is much larger. To gain insight in the distribution of the (transformed) Quadrant, Spearman, and Kendall correlation, we present the boxplot of the m = 1 simulated estimates, for n =. From Figure 4 we see that all correlation estimators are nearly unbiased. The distributions are almost symmetric, where the lower tail is slightly more pronounced than the upper tail. Simulation designs with outliers. In the second simulation scheme we generate m = 1 samples of size n = from the distribution Φ, with =.8. A certain percentage ε of the observations is then replaced by outliers. We consider (i) outliers at position (5, 5), where the influence function is close to its most extreme value, see Figure 1; (ii) correlation outliers, i.e. outliers that are not visible in the marginal

11 Table MSE of several correlation estimators at a bivariate normal distribution with =.8, for sample size n = and a fraction ε of outliers at position (5, 5). n MSE ε = % ε = 1% ε = 5% ε = 1% Pearson.13 6.87 1.96 331.8 Spearman.17.75 1.97 47.35 Kendall.16.37 5.69 4. Quadrant.61.7.49 1.4 MCD.37.36.31.8 Table 3 MSE of several correlation estimators at a bivariate normal distribution with =.8, for sample size n = and a fraction ε of correlation outliers. n MSE ε = % ε = 1% ε = 5% ε = 1% Pearson.13.17.56 1.61 Spearman.17.1.57 1.54 Kendall.16.18.4 1.14 Quadrant.58.63.91 1.57 MCD.37.4.51.76 distributions, generated from the distribution Φ, which has the opposite correlation structure as the model distribution. The MSEs are reported in Tables and 3. Although the MSE is the smallest for the Pearson correlation if no outliers are present, this does not hold anymore in presence of outliers. As we see from Table, and already for 1% of outliers, the MSE for Pearson is by far the largest of all considered estimators. This confirms the non robustness of the Pearson correlation. For 1% of contamination, the MSE of the Spearman and Kendall correlation remains within bounds, with Kendall being more resistant to outliers. But for larger amounts of contamination, a substantial increase in MSE is observed for these two estimators. For ε = 5%, the Quadrant estimator perform better than the two other nonparametric correlation measures. Finally note the high robustness of the MCD based estimator, where the MSE remains low for even 1% of contamination. We conclude that the correlation estimator associated to a highly robust covariance matrix estimator is the most resistant in presence of clusters of large outliers, Correlation outliers do not show up in the marginal distributions, but may still have an important effect on the sample correlation coefficient, see Table 3. For ε = 1%, the Kendall correlation is most robust. But for larger levels of contamination the MCD is again the best, followed by the Kendall correlation. It is interesting to notice that the Quadrant correlation yields the highest MSE of the three nonparametric correlation estimators we considered (for ε.1), and is more vulnerable to correlation outliers than to more extreme outliers.

1 7 Conclusion In this paper we compute the influence functions of some widely used nonparametric measures of correlation. The Spearman and Kendall correlation have a bounded and smooth influence function, and reasonably small values for the gross-error sensitivity. The gross-error sensitivity, as well as the efficiencies, are depending on the true value of the correlation in a nonlinear way. The Kendall correlation measure is more robust and slightly more efficient than Spearman s rank correlation, making it the preferable estimator from both perspectives. The Quadrant correlation measure was also studied, and shown to be very robust but at the price of a low Gaussian efficiency. Although the nonparametric correlation measures discussed in this paper are well known, and frequently used in applications, there are few papers presenting a formal treatment of their robustness properties. This paper focusses on studying the influence that observations have on the estimators, as is common in the robustness literature (e.g. Atkinson et al 4, Olkin and Raveh 9). We did not considered breakdown properties. The rejoinder of Davies and Gather (5) discusses the difficulties of finding an appropriate definition of breakdown point for correlation measures. Breakdown properties of the test statistics for independence using Spearman and Kendall correlation are studied in Caperaa and Garralda Guillem (1997). While this paper focuses on the Spearman and Kendall coefficient, other proposals for robust estimation of correlation have been made. For example a correlation coefficient based on MAD and co-medians (Falk, 1998), a correlation coefficient based on the decomposition of the covariance into a difference of variances (Genton & Ma, 1999), and a multiple skipped correlation (Wilcox, 3) have been proposed. In the simulation study we make a comparison with the robust correlation estimator associated to the Minimum Covariance Determinant, a standard robust covariance matrix estimator (see also Cerioli 1). For small amounts of outliers, the Kendall correlation can compete in terms of robustness with the MCD, while being much more simpler to compute. But in presence of multiple outliers, the MCD is preferable. Appendix Proof of Proposition 1. Let H ε = (1 ε)h + ε (x,y) be the contaminated distribution. It follows from (1) that R K (H ε ) = (1 ε) E H [sign(x 1 X )(Y 1 Y )] + ε(1 ε)e H [sign(x x)(y y)] + ε sign(x x)(y y) from which it follows that IF((x,y),R K,H) = K + E H [sign(x x)(y y)] = R K (H) + P H [(X x)(y y) > ] P H [(X x)(y y) < ], confirming (6). At continuous distributions H the above expression simplifies further into IF((x,y),R K,H) = { K + P H [(X x)(y y) > ] 1}.

13 Using P Φ [(X x)(y y) > ] = Φ (x,y) Φ(x) Φ(y) + 1 yields then the expression (7). Proof of Proposition. Let H ε = (1 ε)h + ε (x,y) be the contaminated model distribution. Then H has marginal distributions F ε = (1 ε)f + ε x and G ε = (1 ε)g + ε y. It follows from () that R S (H ε ) = 1(1 ε)e H [F ε (X)G ε (Y )] + 1εF ε (x)g ε (y) 3. from which it follows that IF((x,y),R S,H) = 1A 1E H [F(X)G(Y )] + 1F(x)G(y), (17) with A the derivative w.r.t. ε and evaluated at ε = of But then E H [F ε (X)G ε (Y )] = (1 ε) E H [F(X)G(Y )] + ε(1 ε)e H [F(X)I(Y y)] + ε(1 ε)e H [G(Y )I(X x)] + ε E H [I(Y y)i(x x)]. A = E H [F(X)G(Y )] + E H [F(X)I(Y y)] + E H [G(Y )I(X x)]. Using the above formula, (17) becomes IF((x,y),R S,H) = 1{E H [F(X)I(Y y)] + E H [G(Y )I(X x)] 3E H [F(X)G(Y )] + F(x)G(y)}, from which, using that R S (H) = 1E H [F(X)G(Y )] 3, result (8) follows. For the bivariate normal distribution, the marginals are given by F = G = Φ. Furthermore, one can write Y = X + 1 Z, with Z independent of X and standard normal. Then E Φ [Φ(X)I(Y y)] = E Φ [Φ(X)Φ( X y 1 )]. Proof of Proposition 4. (i) From (4) it follows that ASV(R p,φ ) = E Φ [(XY (X +Y )) ] = (1 ), since E Φ [X 4 ] = E Φ [Y 4 ] = 3, E Φ [X Y ] = 1 + and E Φ [X 3 Y ] = E Φ [XY 3 ] = 3. (ii) For the nonparametric Quadrant measure, using (5) and (1), we get ASV( R Q,Φ ) = π 4 (1 )(1 Q) = (1 )( π 4 arcsin ()),

14 since E[sign(XY )] = Q and E[sign (XY )] = 1. (iii) From (6) and (11), we obtain ( ASV( R K,Φ ) = π (1 )E Φ [ P Φ [(X X 1 )(Y Y 1 ) > ] 1 ) π arcsin() ] which can be rewritten as ASV( R K,Φ ) = ce[(k(x,y ) E[K(X,Y )]) ] = c{e[k (X,Y )] K}, (18) where K(x,y) = P Φ [(X x)(y y) > ] 1 = 1 (Φ(x)+Φ(y))+4Φ (x,y) and c = π (1 ). Now E[K (X,Y )] = E[sign((X X 1 )(Y Y 1 )(X X )(Y Y ))] = P(( X X 1 )( Y Y 1 )( X X )( Y Y ) > ) 1, where (X 1,Y 1 ) and (X,Y ) are independent copies of (X,Y ). To simplify the above expression, denote Z 1 = (X X 1 )/, Z = (Y Y 1 )/, Z 3 = (X X )/ and Z 4 = (Y Y )/, yielding E[K (X,Y )] = P(Z 1 Z Z 3 Z 4 > ) 1. (19) It is now easy to show that By symmetry, we have Cov Z 1 Z Z 3 Z 4 = 1 1 1 1 1 1. 1 1 P(Z 1 Z Z 3 Z 4 > ) = [P(Z 1 >,Z >,Z 3 >,Z 4 > ) + P(Z 1 >,Z >,Z 3 <,Z 4 < ) + P(Z 1 >,Z 3 >,Z <,Z 4 < ) + P(Z 1 >,Z 4 >,Z <,Z 3 < )]. The first term in the above expression is of type (r), the second term of type (w), the third term of type (r) and the fourth term of type (w), where the (r) and (w) types are defined in Appendix in David and Mallows (1961). We then obtain P(Z 1 Z Z 3 Z 4 > ) = [ 5 18 + 1 π (arcsin () arcsin ( ))]. () Combining (18), (19) and () yields (14). (iv) For the transformed Spearman measure, one can rewrite (1) as IF((x,y), R S,Φ ) = 1c{k(x,y) E[k(X,Y )]} where k(x,y) = F(x)G(y) + E Φ [F(X)I(Y y)] + E Φ [G(Y )I(X x)] and c = 3 π 1 4. It follows that ASV( R S,Φ ) = 144 π (1 9 4 ){E[k (X,Y )] 9( 1 4 + 1 π arcsin( )) }. (1)

15 Now, we must compute the expression E[k (X,Y )], with k(x,y) = E[I(X 1 x)i(y y)] + E[I(X X 1 )I(Y 1 y)] + E[I(X 1 x)i(y Y 1 )]. Tedious calculations result in E[k(X,Y ) ] = E[I(X 1 X)I(Y Y )I(X 3 X)I(Y 4 Y )] + E[I(X 1 X)I(Y Y )I(X 4 X 3 )I(Y 3 Y )] + E[I(X 1 X)I(Y Y )I(X 3 X)I(Y 4 Y 3 )] + E[I(X X 1 )I(Y 1 Y )I(X 4 X 3 )I(Y 3 Y )] + E[I(X X 1 )I(Y 1 Y )I(X 3 X)I(Y 4 Y 3 )] + E[I(X 1 X)I(Y Y 1 )I(X 3 X)I(Y 4 Y 3 )], from which, using Appendix of David and Mallows (1961), we obtain the following sum of 6 terms E[k(X,Y ) ] = 8 144 + 9 4π arcsin( ) + 1 π + π + 1 π arcsin( ) arcsin( ) arcsin( ) sin(x) arcsin( )dx + 1 1 + cos(x) arcsin( sin(x) arcsin( 1 + cos(x) )dx π 3sin(x) sin(3x) )dx. 4cos(x) arcsin( ) sin(x) arcsin( cos(x) )dx Using the above expression and (1) results in (15). References 1. Atkinson, A.C., Riani M., & Cerioli A. Exploring multivariate data with the forward search., Springer, New York (4). Blomqvist, N. On a measure of dependance between two random variables. Annals of Mathematical Statistics, 1, 593 6 (195) 3. Bonett, D.G., & Wright, T.A. Sample size requirements for estimating Pearson, Kendall and Spearman correlation Psychometrika, 65, 3-8 (). 4. Borkowf, C. Computing the nonnull asymptotic variance and the asymptotic relative efficiency of Spearman s rank correlation. Computational Statistics and Data Analysis, 39, 71 86. () 5. Caperaa, P., and Garralda Guillem, A.I. Taux de resistance des tests de rang d independance. The Canadian Journal of Statistics, 5, 113-14 (1997) 6. Cerioli, A., Multivariate Outlier Detection with High-Breakdown Estimators. Journal of The American Statistical Association, to appear (1). 7. David, F.N., & Mallows, C.L. The variance of Spearman s rho in normal samples. Biometrika, 48, 19 8 (1961) 8. Davies, P.L. & Gather, U. Breakdown and Groups (with discussion). The Annals of Statistics, 33, 977 135 (5) 9. Devlin, S.J., Gnanadesikan, R., & Kettering, J.R. Robust estimation and outlier detection with correlation coefficients. Biometrika, 6, 531 545 (1975) 1. Falk, M.(1998). A note on the Comedian for Elliptical Distributions. Journal of Multivariate Analysis, 67, 36 317. 11. Genton, M.G., & Ma Y.(1999). Robustness properties of dispersion estimators. Statistics and Probability Letters, 44, 343 35.

16 1. Grize, Y.L. Robustheitseigenschaften von Korrelations-schätzungen, Unpublished Diplomarbeit, ETH Zürich (1978) 13. Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., & Stahel, W.A. Robust statistics: the approach based on influence functions. John Wiley and Sons, New York (1986) 14. Kendall, M.G. A new measure of rank correlation. Biometrika, 3, 81 93 (1938). 15. Maronna, R., Martin, D. & Yohai, V. Robust Statistics. Wiley, New York (6) 16. Moran, P.A.P. Rank Correlation and Permutation Distributions. Proceedings of the Cambridge Philosophical Society, 44, 14 144 (1948) 17. Morgenthaler, S. A survey of robust statistics. Statistical Methods and Applications, 15, 71 93 (7) 18. Mosteller, F. On some useful inefficient statistics. Annals of Mathematical Statistics, 17, 377 (1946). 19. Olkin, I., & Raveh, A. Bounds for how much influence an observation can have. Statistical Methods and Applications, 18, 1 11 (9). Rousseeuw, P.J., & Van Driessen, K. A fast algorithm for the minimum covariance determinant estimator, Technometrics, 41, 1 3 (1999) 1. Shevlyakov, G.L., & Vilchevski, N.O. Robustness in Data Analysis: Criteria and Methods. Modern Probability and Statistics, Utrecht (). Spearman, C. General intelligence objectively determined and measured. American Journal of Psychology, 15, 1 93 (194) 3. Wilcox, R.R. Inferences based on multiple skipped correlations. Computational Statistics and Data Analysis, 44, 3 36 (3)