A Robust Method for Calculating the Correlation Coefficient

Similar documents
/ n ) are compared. The logic is: if the two

Uncertainty as the Overlap of Alternate Conditional Distributions

Comparison of Regression Lines

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers

Department of Quantitative Methods & Information Systems. Time Series and Their Components QMIS 320. Chapter 6

Linear Correlation. Many research issues are pursued with nonexperimental studies that seek to establish relationships among 2 or more variables

Chapter 9: Statistical Inference and the Relationship between Two Variables

BOOTSTRAP METHOD FOR TESTING OF EQUALITY OF SEVERAL MEANS. M. Krishna Reddy, B. Naveen Kumar and Y. Ramu

Simulated Power of the Discrete Cramér-von Mises Goodness-of-Fit Tests

Chapter 13: Multiple Regression

Durban Watson for Testing the Lack-of-Fit of Polynomial Regression Models without Replications

Negative Binomial Regression

Statistics Chapter 4

Statistics for Economics & Business

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

Chapter 3 Describing Data Using Numerical Measures

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Statistics II Final Exam 26/6/18

SIMPLE LINEAR REGRESSION

On Outlier Robust Small Area Mean Estimate Based on Prediction of Empirical Distribution Function

Correlation and Regression. Correlation 9.1. Correlation. Chapter 9

The Ordinary Least Squares (OLS) Estimator

Basically, if you have a dummy dependent variable you will be estimating a probability.

on the improved Partial Least Squares regression

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

Gaussian Mixture Models

Copyright 2017 by Taylor Enterprises, Inc., All Rights Reserved. Adjusted Control Limits for P Charts. Dr. Wayne A. Taylor

Basic Business Statistics, 10/e

Chapter 11: Simple Linear Regression and Correlation

Lecture Notes on Linear Regression

Cathy Walker March 5, 2010

Statistics for Managers Using Microsoft Excel/SPSS Chapter 13 The Simple Linear Regression Model and Correlation

Boostrapaggregating (Bagging)

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Statistics for Business and Economics

UNR Joint Economics Working Paper Series Working Paper No Further Analysis of the Zipf Law: Does the Rank-Size Rule Really Exist?

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

2016 Wiley. Study Session 2: Ethical and Professional Standards Application

18.1 Introduction and Recap

[The following data appear in Wooldridge Q2.3.] The table below contains the ACT score and college GPA for eight college students.

Comparison of the Population Variance Estimators. of 2-Parameter Exponential Distribution Based on. Multiple Criteria Decision Making Method

Economics 130. Lecture 4 Simple Linear Regression Continued

Lecture 3 Stat102, Spring 2007

Properties of Least Squares

Computation of Higher Order Moments from Two Multinomial Overdispersion Likelihood Models

Uncertainty in measurements of power and energy on power networks

Supporting Information

Turbulence classification of load data by the frequency and severity of wind gusts. Oscar Moñux, DEWI GmbH Kevin Bleibler, DEWI GmbH

Global Sensitivity. Tuesday 20 th February, 2018

Using the estimated penetrances to determine the range of the underlying genetic model in casecontrol

January Examinations 2015

This column is a continuation of our previous column

Introduction to Regression

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Notes on Frequency Estimation in Data Streams

Testing for seasonal unit roots in heterogeneous panels

The topics in this section concern with the second course objective. Correlation is a linear relation between two random variables.

10.34 Fall 2015 Metropolis Monte Carlo Algorithm

Open Systems: Chemical Potential and Partial Molar Quantities Chemical Potential

RELIABILITY ASSESSMENT

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Markov Chain Monte Carlo Lecture 6

Learning Objectives for Chapter 11

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Chapter 12 Analysis of Covariance

STAT 3008 Applied Regression Analysis

U-Pb Geochronology Practical: Background

Research Article Green s Theorem for Sign Data

On the Influential Points in the Functional Circular Relationship Models

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

Resource Allocation and Decision Analysis (ECON 8010) Spring 2014 Foundations of Regression Analysis

LINEAR REGRESSION ANALYSIS. MODULE VIII Lecture Indicator Variables

18. SIMPLE LINEAR REGRESSION III

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

Chapter 8 Indicator Variables

Cokriging Partial Grades - Application to Block Modeling of Copper Deposits

x i1 =1 for all i (the constant ).

STATISTICS QUESTIONS. Step by Step Solutions.

ANOMALIES OF THE MAGNITUDE OF THE BIAS OF THE MAXIMUM LIKELIHOOD ESTIMATOR OF THE REGRESSION SLOPE

Chapter 6. Supplemental Text Material

28. SIMPLE LINEAR REGRESSION III

Econ Statistical Properties of the OLS estimator. Sanjaya DeSilva

UNIVERSITY OF TORONTO Faculty of Arts and Science. December 2005 Examinations STA437H1F/STA1005HF. Duration - 3 hours

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

LECTURE 9 CANONICAL CORRELATION ANALYSIS

Tracking with Kalman Filter

Assignment 5. Simulation for Logistics. Monti, N.E. Yunita, T.

MDL-Based Unsupervised Attribute Ranking

Feature Selection: Part 1

STAT 511 FINAL EXAM NAME Spring 2001

Pulse Coded Modulation

Copyright 2017 by Taylor Enterprises, Inc., All Rights Reserved. Adjusted Control Limits for U Charts. Dr. Wayne A. Taylor

Biostatistics 360 F&t Tests and Intervals in Regression 1

Spatial Statistics and Analysis Methods (for GEOG 104 class).

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

DERIVATION OF THE PROBABILITY PLOT CORRELATION COEFFICIENT TEST STATISTICS FOR THE GENERALIZED LOGISTIC DISTRIBUTION

Probability and Random Variable Primer

Transcription:

A Robust Method for Calculatng the Correlaton Coeffcent E.B. Nven and C. V. Deutsch Relatonshps between prmary and secondary data are frequently quantfed usng the correlaton coeffcent; however, the tradtonal Pearson correlaton coeffcent s known to be heavly nfluenced by outler data ponts. Ths paper presents a method for calculatng a more robust, updated correlaton coeffcent. The updated correlaton coeffcent s a weghted average of correlatons calculated by leavng varous combnatons of data out. The proposed updated correlaton coeffcent s less senstve to outler data than ether the Pearson or Spearman correlaton coeffcents. Introducton Although obtanng data from wells s expensve, especally n offshore envronments, sesmc geophyscal data can be acqured at a fracton of the cost and ts vast areal coverage provdes detals on the reservor that are unreachable by wells. As a result, sesmc attrbutes are wdely used as predctors of reservor propertes. When a physcally justfable relatonshp between a sesmc attrbute and a reservor property s demonstrated, the uncertanty of the nterwell predctons of reservor propertes can be sgnfcantly reduced leadng to better geostatstcal models. Wth well data beng so expensve to obtan, the ntegraton of multvarate data n the presence of sparse data s of partcular mportance. The redundancy and the relatonshps between the secondary data and the varable beng estmated are frequently quantfed by the correlaton coeffcent. The tradtonal Pearson correlaton coeffcent s known to be heavly nfluenced by outler data (Isaaks and Srvastava, 1989; Abdullah, 1990; Shevlyakov, 1997). Ths research presents a method to calculate a correlaton coeffcent that s more robust and less senstve to outlers. Pearson s Correlaton Coeffcent Let ( x1, y1),...,( xn, y n) be n observatons from a bvarate normal dstrbuton wth parameters 2 2 2 ( μx, μy, σx, σy, ρ ), where μx and σ x are the mean and varance of x, μ and 2 y σ are the mean and varance y of y, and ρ s the correlaton coeffcent between x and y gven by ρ = βσ x / σ where β s the slope y parameter of regresson of y on x. The sample correlaton coeffcent commonly used for estmatng ρ s the Pearson s product-moment correlaton coeffcent defned by: r p ( x x)( y y) = 2 2 ( x x) ( y y) The sample means for x and y are known to be senstve to outlers. As a result, the estmate n (1) s also senstve to outlers n ether x, y, or both varables (Abdullah, 1990 and Shevlyakov, 1997). 1/2 (1) Spearman s Rank Correlaton Coeffcent As an alternatve to Pearson s correlaton coeffcent, the non-parametrc Spearman s rank correlaton coeffcent, r s, can be calculated as follows: 2 6D rs = 1 nn 2 ( 1) (2) Where 2 2 D = ( r r ) n y x 116-1

And r and x r are the ranks of y x and y, respectvely. Spearman s correlaton coeffcent does not requre the assumpton that there s a lnear relatonshp between the varables and s generally more resstant to outlers than Pearson s coeffcent. However, as s shown later n ths paper, Spearman s rank correlaton coeffcent s stll unacceptably senstve to outlers, partcularly n the presence of sparse data. Correlaton n the Presence of Outlers Fgure 1 shows a scatterplot of the bvarate relatonshp between two varables, x and y. The scatterplot shows what appears to be a strong correlaton between the two varables marred by one potental outler data pont at a locaton of (7, 1). The Pearson and Spearman correlaton coeffcents for the data shown n Fgure 1 are 0.291 and 0.214, respectvely. If we were sure the pont at (7, 1) was an outler, we could smply remove t from the plot and the Pearson and Spearman correlatons would ncrease to 0.910 and 0.943, respectvely. However, f we are not sure that the pont s an outler we should probably not remove t. Note that, whle the Spearman correlaton coeffcent s generally more resstant to the effects of outlers, n ths case t s more strongly affected by the potental outler data pont. A More Robust Correlaton Coeffcent We could conduct a leave one out test (LOOT), whereby a data pont s removed from the dataset and the correlaton s recalculated. Ths procedure can be repeated n tmes for a dataset wth n ponts, leavng a dfferent data pont out each tme. The result s n calculated correlaton coeffcents. A LOOT was conducted for the data shown n Fgure 1 and the results are shown n Table 1. For the frst 6 leave one out tests, the resultng correlatons are very low and unrepresentatve of the obvous correlaton n the data. However, the last test calculates a correlaton of 0.910. Table 1: Resultng correlatons from a leave one out test (actual correlaton s 0.23) Coordnates of Left Out Pont (x,y) Resultng Correlaton (1.00, 1.98) 0.081 (2.00, 3.20) 0.235 (3.00, 3.53) 0.269 (4.00, 7.25) 0.318 (5.00, 5.44) 0.272 (6.00, 9.31) 0.002 (7.00, 1.00) 0.910 The more robust correlaton coeffcent s based upon the dea of weghted average of the correlatons calculated n the LOOT. The dea s to weght the correlatons accordng to ther dfference from the actual correlaton as follows: α w ρ Actual ρ, LOOT = (3) Where: w s the resultng weght assgned to each correlaton calculated n the leave one out test Actual s the Pearson s correlaton coeffcent calculated from the orgnal data LOOT, s the th Pearson s correlaton coeffcent calculated from leavng out the th data pont n the leave one out test α s a weghtng coeffcent and s a functon of the number of data ( α = 1 + n /20) 116-2

In the above example the correlaton obtaned from removng the pont at (7, 1) s the most dfferent from the actual correlaton of 0.291 and thus receves the most weght. Then, the more robust correlaton coeffcent calculated from the LOOT for sparse datasets s defned as: r Robust, LOOT = n = 1 w ρ n = 1, LOOT w (4) The updated correlaton coeffcent s essentally a weghted average correlaton of the correlatons calculated n the LOOT, where the weghts are defned n (3). The weghtng scheme n (3) assgns the greatest weghts to the correlatons that are the most dfferent from the actual Pearson s correlaton coeffcent. The dea s that the data ponts that have the bggest mpact on the correlaton are the ones that are most lkely to be outlers. For the data shown n Fgure 1, the LOOT correlaton weghts and updated correlaton coeffcent s as follows n Fgure 1. As s shown n the table, the updated correlaton coeffcent from the LOOT s calculated to be 0.569, seems reasonable gven that we may not want to totally exclude the potental outler. Table 2: Resultng correlatons from a leave one out test (actual correlaton s 0.23) Coordnates of Left Out Pont (x,y) Resultng Correlaton Weght (w ) (1.00, 1.98) 0.081 0.12 (2.00, 3.20) 0.235 0.02 (3.00, 3.53) 0.269 0.01 (4.00, 7.25) 0.318 0.08 (5.00, 5.44) 0.272 0.05 (6.00, 9.31) 0.002 0.19 (7.00, 1.00) 0.910 0.52 Updated Correlaton from Leave One Out Test = 0.569 Consder another example data set shown to the rght of Fgure 2. The fgure shows a dataset wth 9 data ponts. Wth the excepton of two potental outler data ponts near a (x, y) locaton of (12, 2), the fgure appears to show a strong correlaton between x and y. However, due to the nfluence of the two potental outler ponts, the resultng Pearson s and Spearman s correlaton coeffcents are -0.050 and -0.133, respectvely. If we were certan that the two ponts near (12, 2) were outlers, we could remove them from the dataset and the Pearson correlaton coeffcent would ncrease to 0.853. Note that the data shown to the rght of Fgure 2 represent another case where the rank correlaton coeffcent performs worse than the Pearson s correlaton coeffcent. If we calculate an updated correlaton based upon the LOOT, the updated correlaton mproves to 0.083. The reason for the meager mprovement n the updated correlaton calculated from the LOOT s that there are two outlers affectng the otherwse strong correlaton n the data. So, when only one of those ponts s left out, the correlaton s stll adversely affected by the other remanng outler. Thus, we could perform a leave two out test, where every combnaton of two data are left out wth a correlaton calculated each tme. In the leave two out test, the resultng updated correlaton s 0.176, whch s a slght mprovement over the LOOT updated correlaton. We could also perform a leave three out test and a leave four out test and so on, up to a leave n-3 out test. For the leave n-3 out test, correlatons are calculated for each combnaton of three pont subset of the orgnal data. If we performed each of these tests, leavng out a dfferent amount of data each tme, we would be left wth n-3 updated correlatons. We could then weght each of those correlatons as follows: 116-3

α w X, ρ Actual ρ X, = (5) Where: Actual s the orgnal data correlaton X, s the updated correlaton calculated n each leave x-data out test w X, are the weghts calculated for each updated correlaton from the leave x-data out test α s the same weghtng exponent as n (3) (.e. α = 1 + n /20) Then the overall updated correlaton s calculated as follows: r OverallUpdated = 3n 3 4 w X, X, X = 1 3n 3 4 X = 1 w ρ X, (6) It was noted that as the number of data ncreases the updated correlatons calculated usng only small amounts of data (for example, say we have 20 data and we want to calculate a correlaton by usng 4 data ponts and leavng out 16) tend to adversely affect the overall updated correlaton. So, rather than consderng correlatons for every possble subset of data, the overall updated correlaton only consders correlatons calculated by leavng out 1 through 3 n 3 (where the rato s always rounded down) data 4 ponts. Although ths s a rather arbtrary decson, t seems to have mproved the results consderably. When the overall updated correlaton coeffcent s calculated for the data n Fgure 1, the algorthm calculates updated correlatons leavng out 1 through 4 data ponts. The overall updated correlaton s calculated to be 0.229. The updated correlatons and the overall updated value s shown n Table 3. Table 3: Updated correlatons from the leave x-data out test and the overall updated correlaton for the data n Fgure 1 (rght) Leavng out the followng number of data Updated Correlaton 1 Data Pont 0.280 2 Data Ponts 0.251 3 Data Ponts 0.176 4 Data Ponts 0.083 Overall Updated Correlaton Coeffcent = 0.229 Breakdown Propertes of the Overall Updated Correlaton To llustrate the breakdown propertes of the overall updated correlaton coeffcent n (6) compared wth the Pearson and Spearman correlaton coeffcents, a smulaton study was carred out as presented n Abdullah (1990). 116-4

Frst, 100 good observatons are generated accordng to the lnear relaton y = 2 + x + u where x s drawn randomly from a normal dstrbuton wth a mean of 5.0 and a varance of 1.0. u s drawn from a normal dstrbuton wth a mean of 0 and a standard devaton of 0.2. The results were as follows: r Pearson =0.984, r Spearman =0.976 and r OverallUpdated =0.941. Next, the data was slowly contamnated. In ncrements of 10 data ponts, the good data was replaced wth bad data ponts. The conatamnated data ponts were generated accordng to the lnear relaton where x s unformly dstrbuted on [5, 10] and y s drawn from a normal dstrbuton wth a mean of 2 and a standard devaton of 0.2. Ths was repeated untl only 50 good observatons remaned. Table 4 shows the values of Pearson s, Spearman s and the overall updated correlaton when good observatons are replaced by ncreasng fractons of outlers. A breakdown plot that llustrates the value fo the correlaton coeffcents as a functon of the percentage of outlers s shown n Fgure 2. Table 4: Updated correlatons from the leave x-data out test and the overall updated correlaton for the data n Error! Reference source not found. Contamnaton (%) Pearson ρ Spearman ρ Overall Updated ρ 0 0.984 0.976 0.941 10-0.070 0.503 0.963 20-0.317 0.195 0.92 30-0.451-0.091 0.947 40-0.603-0.380 0.391 50-0.601-0.489 0.266 Dscusson A Fortran 90 program called ROBUSTCORRCO was created whch automatcally calculates the updated correlaton coeffcents for each as well as an overall updated correlaton. In cases where there are more than approxmately 20 data ponts, the tme to calculate the number of combnatons of data n the becomes prohbtvely large. As a result, a monte-carlo-lke smulaton approach was mplemented n whch 1000 data combnatons for each are randomly sampled rather than calculatng every possble data combnaton. One example of an applcaton of the more robust updated correlaton coeffcent s n collocated cosmulaton. In collocated co-smulaton a Markov-type assumpton s made where collocated secondary nformaton s assumed to screen further away data of the same type. The correlaton coeffcent s used to specfy the relatonshp between prmary and secondary data. However, n some cases such as n off shore ol and gas reservors, there may be few wells or samples upon whch a correlaton may be calculated. In ths case, the overall updated correlaton coeffcent could be calculated and compared to the tradtonal Pearson correlaton. However, even usng the updated correlaton coeffcent, there s stll sgnfcant uncertanty n the true underlyng correlaton. The overall updated correlaton and the number of data could be used to defne a dstrbuton of uncertanty n the correlaton coeffcent as outlned n Nven and Deutsch (2008). Conclusons The relatonshp between prmary and secondary data n geostatstcs s frequently estmated usng the correlaton coeffcent. However, the tradtonal Pearson and Spearman correlaton coeffcents are very senstve to outler data ponts. A method for calculatng an updated correlaton coeffcent s presented n ths paper. The updated correlaton coeffcent s shown to be more robust than ether Pearson s or 116-5

Spearman s correlaton coeffcent. Although the method s somewhat ad-hoc, t appears to work well for a large range of examples. References Abdullah, M. B. 1990. On a robust correlaton coeffcent. The Statstcan 39, : 455-60. Isaaks, E. H., and R. M. Srvastava. 1989. An ntroducton to appled geostatstcs. New York: Oxford Unversty Press, Inc. Nven, E. B., and C. V. Deutsch. 2008. Applcaton of the samplng dstrbuton for the correlaton coeffcent. Centre for Computatonal Geostatstcs: Report 10. Shevlyakov, G. L. 1997. On robust estmaton of a correlaton coeffcent. Journal of Mathematcal Scences 83, (3): 434-8. Fgure 1: An example dataset wth an otherwse hgh correlaton between x and y marred by one apparent outler data pont and a second example dataset wth two potental outlers. Correlaton Coeffcent 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 0 10 20 30 40 50 Contamnaton (%) Pearson Overall Updated Spearman Fgure 2: Smulaton study comparng effect of contamnated data on the Pearson, Spearman and proposed overall updated correlaton coeffcent. 116-6