A Robust Method for Calculating the Correlation Coefficient

A Robust Method for Calculatng the Correlaton Coeffcent E.B. Nven and C. V. Deutsch Relatonshps between prmary and secondary data are frequently quantfed usng the correlaton coeffcent; however, the tradtonal Pearson correlaton coeffcent s known to be heavly nfluenced by outler data ponts. Ths paper presents a method for calculatng a more robust, updated correlaton coeffcent. The updated correlaton coeffcent s a weghted average of correlatons calculated by leavng varous combnatons of data out. The proposed updated correlaton coeffcent s less senstve to outler data than ether the Pearson or Spearman correlaton coeffcents. Introducton Although obtanng data from wells s expensve, especally n offshore envronments, sesmc geophyscal data can be acqured at a fracton of the cost and ts vast areal coverage provdes detals on the reservor that are unreachable by wells. As a result, sesmc attrbutes are wdely used as predctors of reservor propertes. When a physcally justfable relatonshp between a sesmc attrbute and a reservor property s demonstrated, the uncertanty of the nterwell predctons of reservor propertes can be sgnfcantly reduced leadng to better geostatstcal models. Wth well data beng so expensve to obtan, the ntegraton of multvarate data n the presence of sparse data s of partcular mportance. The redundancy and the relatonshps between the secondary data and the varable beng estmated are frequently quantfed by the correlaton coeffcent. The tradtonal Pearson correlaton coeffcent s known to be heavly nfluenced by outler data (Isaaks and Srvastava, 1989; Abdullah, 1990; Shevlyakov, 1997). Ths research presents a method to calculate a correlaton coeffcent that s more robust and less senstve to outlers. Pearson s Correlaton Coeffcent Let ( x1, y1),...,( xn, y n) be n observatons from a bvarate normal dstrbuton wth parameters 2 2 2 ( μx, μy, σx, σy, ρ ), where μx and σ x are the mean and varance of x, μ and 2 y σ are the mean and varance y of y, and ρ s the correlaton coeffcent between x and y gven by ρ = βσ x / σ where β s the slope y parameter of regresson of y on x. The sample correlaton coeffcent commonly used for estmatng ρ s the Pearson s product-moment correlaton coeffcent defned by: r p ( x x)( y y) = 2 2 ( x x) ( y y) The sample means for x and y are known to be senstve to outlers. As a result, the estmate n (1) s also senstve to outlers n ether x, y, or both varables (Abdullah, 1990 and Shevlyakov, 1997). 1/2 (1) Spearman s Rank Correlaton Coeffcent As an alternatve to Pearson s correlaton coeffcent, the non-parametrc Spearman s rank correlaton coeffcent, r s, can be calculated as follows: 2 6D rs = 1 nn 2 ( 1) (2) Where 2 2 D = ( r r ) n y x 116-1

And r and x r are the ranks of y x and y, respectvely. Spearman s correlaton coeffcent does not requre the assumpton that there s a lnear relatonshp between the varables and s generally more resstant to outlers than Pearson s coeffcent. However, as s shown later n ths paper, Spearman s rank correlaton coeffcent s stll unacceptably senstve to outlers, partcularly n the presence of sparse data. Correlaton n the Presence of Outlers Fgure 1 shows a scatterplot of the bvarate relatonshp between two varables, x and y. The scatterplot shows what appears to be a strong correlaton between the two varables marred by one potental outler data pont at a locaton of (7, 1). The Pearson and Spearman correlaton coeffcents for the data shown n Fgure 1 are 0.291 and 0.214, respectvely. If we were sure the pont at (7, 1) was an outler, we could smply remove t from the plot and the Pearson and Spearman correlatons would ncrease to 0.910 and 0.943, respectvely. However, f we are not sure that the pont s an outler we should probably not remove t. Note that, whle the Spearman correlaton coeffcent s generally more resstant to the effects of outlers, n ths case t s more strongly affected by the potental outler data pont. A More Robust Correlaton Coeffcent We could conduct a leave one out test (LOOT), whereby a data pont s removed from the dataset and the correlaton s recalculated. Ths procedure can be repeated n tmes for a dataset wth n ponts, leavng a dfferent data pont out each tme. The result s n calculated correlaton coeffcents. A LOOT was conducted for the data shown n Fgure 1 and the results are shown n Table 1. For the frst 6 leave one out tests, the resultng correlatons are very low and unrepresentatve of the obvous correlaton n the data. However, the last test calculates a correlaton of 0.910. Table 1: Resultng correlatons from a leave one out test (actual correlaton s 0.23) Coordnates of Left Out Pont (x,y) Resultng Correlaton (1.00, 1.98) 0.081 (2.00, 3.20) 0.235 (3.00, 3.53) 0.269 (4.00, 7.25) 0.318 (5.00, 5.44) 0.272 (6.00, 9.31) 0.002 (7.00, 1.00) 0.910 The more robust correlaton coeffcent s based upon the dea of weghted average of the correlatons calculated n the LOOT. The dea s to weght the correlatons accordng to ther dfference from the actual correlaton as follows: α w ρ Actual ρ, LOOT = (3) Where: w s the resultng weght assgned to each correlaton calculated n the leave one out test Actual s the Pearson s correlaton coeffcent calculated from the orgnal data LOOT, s the th Pearson s correlaton coeffcent calculated from leavng out the th data pont n the leave one out test α s a weghtng coeffcent and s a functon of the number of data ( α = 1 + n /20) 116-2

In the above example the correlaton obtaned from removng the pont at (7, 1) s the most dfferent from the actual correlaton of 0.291 and thus receves the most weght. Then, the more robust correlaton coeffcent calculated from the LOOT for sparse datasets s defned as: r Robust, LOOT = n = 1 w ρ n = 1, LOOT w (4) The updated correlaton coeffcent s essentally a weghted average correlaton of the correlatons calculated n the LOOT, where the weghts are defned n (3). The weghtng scheme n (3) assgns the greatest weghts to the correlatons that are the most dfferent from the actual Pearson s correlaton coeffcent. The dea s that the data ponts that have the bggest mpact on the correlaton are the ones that are most lkely to be outlers. For the data shown n Fgure 1, the LOOT correlaton weghts and updated correlaton coeffcent s as follows n Fgure 1. As s shown n the table, the updated correlaton coeffcent from the LOOT s calculated to be 0.569, seems reasonable gven that we may not want to totally exclude the potental outler. Table 2: Resultng correlatons from a leave one out test (actual correlaton s 0.23) Coordnates of Left Out Pont (x,y) Resultng Correlaton Weght (w ) (1.00, 1.98) 0.081 0.12 (2.00, 3.20) 0.235 0.02 (3.00, 3.53) 0.269 0.01 (4.00, 7.25) 0.318 0.08 (5.00, 5.44) 0.272 0.05 (6.00, 9.31) 0.002 0.19 (7.00, 1.00) 0.910 0.52 Updated Correlaton from Leave One Out Test = 0.569 Consder another example data set shown to the rght of Fgure 2. The fgure shows a dataset wth 9 data ponts. Wth the excepton of two potental outler data ponts near a (x, y) locaton of (12, 2), the fgure appears to show a strong correlaton between x and y. However, due to the nfluence of the two potental outler ponts, the resultng Pearson s and Spearman s correlaton coeffcents are -0.050 and -0.133, respectvely. If we were certan that the two ponts near (12, 2) were outlers, we could remove them from the dataset and the Pearson correlaton coeffcent would ncrease to 0.853. Note that the data shown to the rght of Fgure 2 represent another case where the rank correlaton coeffcent performs worse than the Pearson s correlaton coeffcent. If we calculate an updated correlaton based upon the LOOT, the updated correlaton mproves to 0.083. The reason for the meager mprovement n the updated correlaton calculated from the LOOT s that there are two outlers affectng the otherwse strong correlaton n the data. So, when only one of those ponts s left out, the correlaton s stll adversely affected by the other remanng outler. Thus, we could perform a leave two out test, where every combnaton of two data are left out wth a correlaton calculated each tme. In the leave two out test, the resultng updated correlaton s 0.176, whch s a slght mprovement over the LOOT updated correlaton. We could also perform a leave three out test and a leave four out test and so on, up to a leave n-3 out test. For the leave n-3 out test, correlatons are calculated for each combnaton of three pont subset of the orgnal data. If we performed each of these tests, leavng out a dfferent amount of data each tme, we would be left wth n-3 updated correlatons. We could then weght each of those correlatons as follows: 116-3

α w X, ρ Actual ρ X, = (5) Where: Actual s the orgnal data correlaton X, s the updated correlaton calculated n each leave x-data out test w X, are the weghts calculated for each updated correlaton from the leave x-data out test α s the same weghtng exponent as n (3) (.e. α = 1 + n /20) Then the overall updated correlaton s calculated as follows: r OverallUpdated = 3n 3 4 w X, X, X = 1 3n 3 4 X = 1 w ρ X, (6) It was noted that as the number of data ncreases the updated correlatons calculated usng only small amounts of data (for example, say we have 20 data and we want to calculate a correlaton by usng 4 data ponts and leavng out 16) tend to adversely affect the overall updated correlaton. So, rather than consderng correlatons for every possble subset of data, the overall updated correlaton only consders correlatons calculated by leavng out 1 through 3 n 3 (where the rato s always rounded down) data 4 ponts. Although ths s a rather arbtrary decson, t seems to have mproved the results consderably. When the overall updated correlaton coeffcent s calculated for the data n Fgure 1, the algorthm calculates updated correlatons leavng out 1 through 4 data ponts. The overall updated correlaton s calculated to be 0.229. The updated correlatons and the overall updated value s shown n Table 3. Table 3: Updated correlatons from the leave x-data out test and the overall updated correlaton for the data n Fgure 1 (rght) Leavng out the followng number of data Updated Correlaton 1 Data Pont 0.280 2 Data Ponts 0.251 3 Data Ponts 0.176 4 Data Ponts 0.083 Overall Updated Correlaton Coeffcent = 0.229 Breakdown Propertes of the Overall Updated Correlaton To llustrate the breakdown propertes of the overall updated correlaton coeffcent n (6) compared wth the Pearson and Spearman correlaton coeffcents, a smulaton study was carred out as presented n Abdullah (1990). 116-4

Frst, 100 good observatons are generated accordng to the lnear relaton y = 2 + x + u where x s drawn randomly from a normal dstrbuton wth a mean of 5.0 and a varance of 1.0. u s drawn from a normal dstrbuton wth a mean of 0 and a standard devaton of 0.2. The results were as follows: r Pearson =0.984, r Spearman =0.976 and r OverallUpdated =0.941. Next, the data was slowly contamnated. In ncrements of 10 data ponts, the good data was replaced wth bad data ponts. The conatamnated data ponts were generated accordng to the lnear relaton where x s unformly dstrbuted on [5, 10] and y s drawn from a normal dstrbuton wth a mean of 2 and a standard devaton of 0.2. Ths was repeated untl only 50 good observatons remaned. Table 4 shows the values of Pearson s, Spearman s and the overall updated correlaton when good observatons are replaced by ncreasng fractons of outlers. A breakdown plot that llustrates the value fo the correlaton coeffcents as a functon of the percentage of outlers s shown n Fgure 2. Table 4: Updated correlatons from the leave x-data out test and the overall updated correlaton for the data n Error! Reference source not found. Contamnaton (%) Pearson ρ Spearman ρ Overall Updated ρ 0 0.984 0.976 0.941 10-0.070 0.503 0.963 20-0.317 0.195 0.92 30-0.451-0.091 0.947 40-0.603-0.380 0.391 50-0.601-0.489 0.266 Dscusson A Fortran 90 program called ROBUSTCORRCO was created whch automatcally calculates the updated correlaton coeffcents for each as well as an overall updated correlaton. In cases where there are more than approxmately 20 data ponts, the tme to calculate the number of combnatons of data n the becomes prohbtvely large. As a result, a monte-carlo-lke smulaton approach was mplemented n whch 1000 data combnatons for each are randomly sampled rather than calculatng every possble data combnaton. One example of an applcaton of the more robust updated correlaton coeffcent s n collocated cosmulaton. In collocated co-smulaton a Markov-type assumpton s made where collocated secondary nformaton s assumed to screen further away data of the same type. The correlaton coeffcent s used to specfy the relatonshp between prmary and secondary data. However, n some cases such as n off shore ol and gas reservors, there may be few wells or samples upon whch a correlaton may be calculated. In ths case, the overall updated correlaton coeffcent could be calculated and compared to the tradtonal Pearson correlaton. However, even usng the updated correlaton coeffcent, there s stll sgnfcant uncertanty n the true underlyng correlaton. The overall updated correlaton and the number of data could be used to defne a dstrbuton of uncertanty n the correlaton coeffcent as outlned n Nven and Deutsch (2008). Conclusons The relatonshp between prmary and secondary data n geostatstcs s frequently estmated usng the correlaton coeffcent. However, the tradtonal Pearson and Spearman correlaton coeffcents are very senstve to outler data ponts. A method for calculatng an updated correlaton coeffcent s presented n ths paper. The updated correlaton coeffcent s shown to be more robust than ether Pearson s or 116-5

Spearman s correlaton coeffcent. Although the method s somewhat ad-hoc, t appears to work well for a large range of examples. References Abdullah, M. B. 1990. On a robust correlaton coeffcent. The Statstcan 39, : 455-60. Isaaks, E. H., and R. M. Srvastava. 1989. An ntroducton to appled geostatstcs. New York: Oxford Unversty Press, Inc. Nven, E. B., and C. V. Deutsch. 2008. Applcaton of the samplng dstrbuton for the correlaton coeffcent. Centre for Computatonal Geostatstcs: Report 10. Shevlyakov, G. L. 1997. On robust estmaton of a correlaton coeffcent. Journal of Mathematcal Scences 83, (3): 434-8. Fgure 1: An example dataset wth an otherwse hgh correlaton between x and y marred by one apparent outler data pont and a second example dataset wth two potental outlers. Correlaton Coeffcent 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 0 10 20 30 40 50 Contamnaton (%) Pearson Overall Updated Spearman Fgure 2: Smulaton study comparng effect of contamnated data on the Pearson, Spearman and proposed overall updated correlaton coeffcent. 116-6