5.3 Rank correlations. 194 Multidimensional semiquantitative data

Size: px

Start display at page:

Download "5.3 Rank correlations. 194 Multidimensional semiquantitative data"

Louisa Gallagher
6 years ago
Views:

1 94 Multidimesioal semiquatitative data the Excerpt tests from the Chapter hypothesis 5 of: (H 0 ) that the umbers of pairs with each sig are equal; a equivalet formulatio is that the proportio of pairs with either sig is equal to 0.5. This Legedre, test uses P. & iformatio L. Legedre. about 998. the Numerical directio of ecology, the differeces d Eglish betwee editio. pairs. Elsevier Whe the Sciece relative BV, Amsterdam. magitude of xv the differeces pages. betwee pairs is also kow, it becomes possible to use the more powerful Wilcoxo matched-pairs siged-raks test. Differeces betwee pairs are first raked accordig to their magitude (absolute values), after which the sig of the differece is affixed to each rak. The ull hypothesis of the test (H 0 ) is that the sum of the raks havig a (+) sig is equal to that of the raks with a () sig. The McNemar test provides a meas of comparig paired samples of biary data. For example, usig biary observatios (e.g. presece or absece) made at the same sites, before ad after some evet, oe could test (H 0 ) that o overall chage has occurred. Whe there are several related samples (k ) ad the data are quatitative, the parametric approach for testig (H 0 ) that the meas of the k groups are equal is twoway aalysis of variace, with or without replicatio. Oe classificatio criterio of the two-way ANOVA accouts for the variability amog the k groups (as i oe-way ANOVA above, for k idepedet samples) ad the other for that amog the related samples. Cosider, for a example, 6 sites (i.e. k groups) that have bee sampled at 5 depths i the water colum (or at 5 differet times, or usig 5 differet methods, etc.). The oparametric equivalet, for semiquatitative data, is Friedma s two-way aalysis of variace by raks without replicatio, which is based o a two-way table similar to Table 5.7. I the two-way table, the k groups (e.g. 6 sites) are i rows ad the correspodig samples (e.g. 5 depths) are i colums. Values withi each colum are raked separately, ad the Friedma statistic (eq. 5.0) is used to test (H 0 ) that the rak totals of the various rows (i.e. groups) are equal. For biary data, the Cochra Q test is a extesio to k groups of the McNemar test, described above for k. Fially, whe there are several samples (k ), related across several classificatio criteria (e.g. 6 sites all sampled at 8 differet times, usig each time 5 differet methods), multiway ANOVA is the stadard parametric method for testig the ull hypothesis (H 0 ) that the meas of the k groups are equal (F-test). I that case, there are o obvious equivalet approaches for semiquatitative or qualitative data. 5.3 Rak correlatios Textbooks of oparametric statistics propose a few methods oly for the aalysis of bi- or multivariate semiquatitative data. Sectio 5. has show that there actually exist may umerical approaches for aalysig multidimesioal data, correspodig to all levels of precisio (Table 5.). These methods, which iclude most of those described i this book, belog to oparametric statistics i a geeral sese, because they do ot focus o the parameters of the data distributios. Withi the specific realm of rakig tests, however, the oly statistical techiques available for

2 Rak correlatios 95 Table 5.3 Numerical example. Perfect rak correlatio betwee descriptors y ad y. Objects Raks of objects o the two descriptors (observatio uits) y y x 5 5 x x x 4 x multidimesioal semiquatitative data are two rak correlatio coefficiets (Spearma r ad Kedall ), which both quatify the relatioship betwee two descriptors, ad the coefficiet of cocordace (Kedall W), which assesses the relatioship amog several descriptors. These are described i some detail i the preset sectio. Spearma corr. coeff. The Spearma r statistic, also called (rho), is based o the idea that two descriptors y ad y carry the same iformatio if the largest object o y also has the highest rak o y, ad so o for all other objects. Two descriptors are said to be i perfect correlatio whe the raks of all object are the same o both descriptors, as i the umerical example of Table 5.3. If, however, object x which has rak 5 o y had rak o y, it would be atural to use the differece betwee these raks d (y y ) (5 ) 3 as a measure of the differece betwee the two descriptors, for this object. For the whole set of objects, differeces d i are squared before summig them, i order to prevet differeces with opposite sigs from cacellig each other out. The expressio for the Spearma r may be derived from the geeral formula of correlatio coefficiets (Kedall, 948): y i j y j y ik y k i r jk y i j y j y i k y k (5.)

3 96 Multidimesioal semiquatitative data For raked data, the average raks y j ad y k are equal, so that y ij y j y i k y k (y ij y ik ). Oe ca write the differece betwee the raks of object i o the two descriptors as d i (y ij y ik ) y ij y j y i k y k, which leads to: d i y i j y j + y i k y k y i j y j y ik y k Isolatig the right-had sum gives: y ij y j y i k y k - y i j y j + y i k y k d i Usig this result, eq. 5. is rewritte as: r jk - y i j y j + y i k y k d i y i j y j y i k y k (5.) The sum of raks for each descriptor, which is the sum of the first itegers, is equal to ( + )/ ad the sum of their squares is y i j Sice the sum of deviatios from the mea rak is y ij y j y ij - y ij oe ca write: y ij y j It follows that, whe usig raks, the umerator of eq. 5. becomes: -- y ij y j + y ik y k d i d i while its deomiator reduces to: y ij y j y ik y k

4 Rak correlatios 97 Table 5.4 Numerical example. Raks of four objects o two descriptors, y ad y. Objects Raks of objects o the two descriptors (observatio uits) y y x 3 3 x 4 x 3 4 x 4 The fial formula is obtaied by replacig the above two expressios i eq. 5.. This developmet shows that, whe usig raks, eq. 5. simplifies to the followig formula for Spearma s r: r j k d i 6 d i i i (5.3) Alteratively, the Spearma rak correlatio coefficiet may be obtaied i two steps: () replace all observatios by raks (columwise) ad () compute the Pearso correlatio coefficiet (eq. 4.7) betwee the raked variables. The result is the same as obtaied from eq The Spearma r coefficiet varies betwee + ad, just like the Pearso r. Descriptors that are perfectly matched, i terms of raks, exhibit values r + (direct relatioship) or r (iverse relatioship), whereas r 0 idicates the absece of a mootoic relatioship betwee the two descriptors. (Relatioships that are ot mootoic, e.g. Fig. 4.4d, ca be quatified usig polyomial or oliear regressio, or else cotigecy coefficiets; see Sectios 6. ad 0.3.) Numerical example. A small example (raked data, Table 5.4) illustrates the equivalece betwee eq. 5. computed o raks ad eq Usig eq. 5. gives: The same result is obtaied from eq. 5.3: r r

5 98 Multidimesioal semiquatitative data Two or more objects may have the same rak o a give descriptor. This is ofte the case with descriptors used i ecology, which may have a small umber of states or ordered classes. Such observatios are said to be tied. Each of them is assiged the average of the raks which would have bee assiged had o ties occurred. If the proportio of tied observatios is large, correctio factors must be itroduced ito the sums of squared deviatios of eq. 5., which become: y ij y j t r j q r 3 t rj ad y ik y k t rk s r 3 t rk where t rj ad t rk are the umbers of observatios i descriptors y j ad y k which are tied at raks r, these values beig summed over the q sets of tied observatios i descriptor j ad the s sets i descriptor k. Sigificace of the Spearma r is usually tested agaist the ull hypothesis H 0 : r 0. Whe 0, the test statistic is the same as for Pearso s r (eq. 4.3): t r ki r k i (5.4) H 0 is tested by comparig statistic t to the value foud i a table of critical values of t, with degrees of freedom. H 0 is rejected whe the probability correspodig to t is smaller tha a predetermied level of sigificace ( for a two-tailed test). The rules for oe-tailed ad two-tailed tests are the same as for the Pearso r (Sectio 4.). Whe < 0, which is ot ofte the case i ecology, oe must refer to a special table of critical values of the Spearma rak correlatio coefficiet, foud i textbooks of oparametric statistics. Kedall corr. coeff. Kedall s (tau) is aother rak correlatio coefficiet, which ca be used for the same types of descriptors as Spearma s r. Oe major advatage of over Spearma s r is that the former ca be geeralized to a partial correlatio coefficiet (below), which is ot the case for the latter. While Spearma s r was based o the differeces betwee the raks of objects o the two descriptors beig compared, Kedall s refers to a somewhat differet cocept, which is best explaied usig a example. Numerical example. Kedall s is calculated o the example of Table 5.4, already used for computig Spearma s r. I Table 5.5, the order of the objects was rearraged so as to obtai icreasig raks o oe of the two descriptors (here y ). The Table is used to determie the degree of depedece betwee the two descriptors. Sice the raks are ow i icreasig order

6 Rak correlatios 99 Table 5.5 Numerical example. The order of the four objects from Table 5.4 has bee rearraged i such a way that the raks o y are ow i icreasig order Objects Raks of objects o the two descriptors (observatio uits) y y x 4 x 3 4 x 3 3 x 4 o y, it is sufficiet to determie how may pairs of raks are also i icreasig order o y to obtai a measure of the associatio betwee the two descriptors. Cosiderig the object i first rak (i.e. x 4 ), at the top of the right-had colum, the first pair of raks ( ad 4, belogig to objects x 4 ad x 3 ) is i icreasig order; a score of + is assiged to it. The same goes for the secod pair ( ad 3, belogig to objects x 4 ad x ). The third pair of raks ( ad, belogig to objects x 4 ad x ) is i decreasig order, however, so that it ears a egative score. The same operatio is repeated for every object i successive raks alog y, i.e. for the object i secod rak (x 3 ): first pair of raks (4 ad 3, belogig to objects x 3 ad x ), etc. The sum S of scores assiged to each of the ( )/ differet pairs of raks is the computed. Kedall's rak correlatio coefficiet is defied as follows: a S S (5.5) where S stads for sum of scores. Kedall's a is thus the sum of scores for pairs i icreasig ad decreasig order, divided by the total umber of pairs (( )/). For the example of Tables 5.4 ad 5.5, a is: a 4 3 Clearly, i the case of perfect agreemet betwee two descriptors, all pairs receive a positive score, so that S ( )/ ad thus a +. Whe there is complete disagreemet, S ( )/ ad thus a. Whe the descriptors are totally urelated, the positive ad egative scores cacel out, so that S as well as a are ear 0. Equatio 5.5 caot be used for computig whe there are tied observatios. This is ofte the case with ecological semiquatitative descriptors, which may have a small umber of states. The Kedall rak correlatio is the computed o a cotigecy table (see Chapter 6) crossig two semiquatitative descriptors.

7 00 Multidimesioal semiquatitative data Table 5.6 Numerical example. Cotigecy table givig the distributio of 80 objects amog the states of two semiquatitative descriptors, a ad b. Numbers i the table are frequecies (f). b b b 3 b 4 t j a a a a t k Table 5.6 is a cotigecy table crossig two ordered descriptors. For example, descriptor a could represet the relative abudaces of arthropods i soil eumerated o a semiquatitative scale (e.g. abset, preset, abudat ad very abudat), while descriptor b could be the cocetratio of orgaic matter i the soil, divided ito 4 classes. For simplicity, descriptors are called a ad b here, as i Chapter 6. The states of a vary from to r (umber of rows) while the states of b go from to c (umber of colums). To compute with tied observatios, S is calculated as the differece betwee the umbers of positive (P) ad egative (Q) scores, S P Q. P is the sum of all frequecies f i the cotigecy table, each oe multiplied by the sum of all frequecies located lower ad o its right: r c P f j k f lm r c j k l j + m k + Likewise, Q is the sum of all frequecies f i the table, each oe multiplied by the sum of all frequecies lower ad o its left: r c k Q f j k f lm r j k l j + m Numerical example. For Table 5.6: P (0 40) + (0 30) + (0 0) + (0 0) + (0 0) 600 Q (0 0) + (0 0) 00 S P Q

8 Rak correlatios 0 Usig this value S, there are two approaches for calculatig, depedig o the umbers of states i the two descriptors. Whe a ad b have the same umbers of states (r c), b is computed usig a formula that icludes the total umber of pairs ( )/, as i the case of a (eq. 5.5). The differece with eq. 5.5 is that b icludes correctios for the umber of pairs L tied i a ad the umber of pairs L tied i b, where r L -- t j t j j c L - t k t k k i which t j is the margial total for row j i which t k is the margial total for colum k. The formula for b is: b S L -- L (5.6) Whe there are o tied observatios, L L 0 ad eq. 5.6 becomes idetical to eq Numerical example. For Table 5.6: L L b Without correctio for ties, the calculated value (eq. 5.5) would have bee a ( 400) / (80 79) 0.44 The secod approach for calculatig with tied observatios should be used whe a ad b do ot have the same umber of states (r c). The formula for c uses the miimum umber of states i either a or b, mi(r, c): c S -- mi mi (5.7)

9 0 Multidimesioal semiquatitative data The sigificace of Kedall s is tested by referece to the ull hypothesis H 0 : r 0 (i.e. idepedece of the two descriptors). A test statistic is obtaied by trasformig ito z (or t ) usig the followig formula (Kedall, 948): z (5.8) Whe 30, the secod term of eq. 5.8 becomes egligible (at 30, the value of this term is oly 0.078). For 0, the samplig distributio of is almost the same as the ormal distributio, so that H 0 is tested usig a table of z. Sice z tables are oetailed, the z statistic of eq. 5.8 may be used directly for oe-tailed tests by comparig it to the value z read i the table. For two-tailed tests, the statistic is compared to the value z / from the z table. Whe < 0, which is seldom the case i ecology, oe should refer to Table B, at the ed of this book. Table B gives the critical values of for 4 50 (oe-tailed ad two-tailed tests). Power Spearma s r provides a better approximatio of Pearso s r whe the data are almost quatitative ad there are but a few tied observatios, whereas Kedall s does better whe there are may ties. Computig both Spearma s r ad Kedall s a o the same umerical example, above, produced differet umerical values (i.e. r 0.40 versus a 0.33). This is because the two coefficiets have differet uderlyig scales, so that their umerical values caot be directly compared. However, give their differet samplig distributios, they both reject H 0 at the same level of sigificace. If applied to quatitative data that are meetig all the requiremets of Pearso s r, both Spearma s r ad Kedall s have power early as high (about 9%; Hotellig & Pabst, 936) as their parametric equivalet. I all other cases, they are more powerful tha Pearso s r. This refers to the otio of power of statistical tests: a test is more powerful tha aother if it is more likely to detect small deviatios from H 0 (i.e. smaller type II error), for costat type I error. The chief advatage of Kedall s over Spearma s r, as already metioed, is that it ca be geeralized to a partial correlatio coefficiet, which caot be doe with Spearma s (Siegel, 956: 4). The formula for a partial is: (5.9) This formula is algebraically the same as that of first-order partial Pearso r (eq. 4.36) although, accordig to Kedall (948: 03), this would be merely coicidetal because the two formulae are derived usig etirely differet approaches. The three coefficiets o the right-had side of eq. 5.9 may themselves be partial s, thus allowig oe to cotrol for more tha oe descriptor (i.e. high order partial correlatio coefficiets). Siegel & Castella (988) give tables for testig the sigificace of the Kedall partial correlatio coefficiet.

10 Rak correlatios 03 Kedall coeff. of cocordace Rak correlatio coefficiets should ot be used i the Q mode, i.e. for comparig objects istead of descriptors. This is also the case for the Pearso r (Sectio 7.5). The reasos for this are the followig: While physical dimesios disappear whe computig correlatio coefficiets betwee variables expressed i differet uits, the same coefficiets computed betwee objects have complex ad o-iterpretable physical dimesios. Physical descriptors are usually expressed i somewhat arbitrary uits (e.g. mm, cm, m, or km are all equally correct, i priciple). Ay arbitrary chage i uits could dramatically chage the values of correlatios computed betwee objects. Descriptors may be stadardized first to alleviate these problems but stadardizatio of quatitative descriptors, before rak-orderig the data withi objects, chages the values alog object vectors i a omootoic way. The correlatio betwee two objects is a fuctio of the values of all the other objects i the data set. Cosider species abudace data. At most samplig sites, several species are represeted by a small umber of idividuals, this umber beig subject to stochastic variability. It follows that their raks, i a give observatio uit, may ot strictly correspod to their quatitative importace i the ecosystem. A rak correlatio coefficiet computed betwee observatio uits would thus have high variace sice it would be computed o may ucertai raks, givig a prepoderat importace to the may poorly sampled species. While the cetral limit theorem isures that meas, variaces, covariaces, ad correlatios coverge towards their populatio values whe the umber of objects icreases, computig these same parameters i the Q mode is likely to have the opposite effect sice the additio of ew variables ito the calculatios is likely to chage the values of these parameters i a o-trivial way. Correlatio coefficiets ca be tested by the method of permutatios, as described i Subsectio..3. I the R mode, permutig the values of a variable withi a colum makes physical sese: uder H 0, that value could be foud at ay oe site. I the Q mode, however, permutig values withi a row of the data matrix does ot make sese because, i the real world, these values could ot belog to differet variables. As a illustratio, it would ot make sese to move a saliity of 35 psu to the ph colum. The rak correlatio coefficiets described above measure the correlatio for pairs of descriptors, based o objects. I cotrast, Kedall s coefficiet of cocordace W measures the relatioship amog several rak-ordered variables for objects. I Table 5., Kedall s W is listed as equivalet to the coefficiet of multiple liear correlatio R, but the approach is actually quite differet. The aalysis is coducted o a table which cotais, i each colum, the raks of the objects o oe of the p descriptors, e.g. Table 5.7. Friedma (937) has show

11 04 Multidimesioal semiquatitative data that, whe the umber of rows ad/or colums is large eough, the followig statistic is approximately distributed as with degrees of freedom: X p + R i 3 p + (5.0) where R i is the sum of the raks for row i. This is Friedma s statistic for two-way aalysis of variace by raks. Kedall s coefficiet of cocordace (Kedall, 948) ca be obtaied by trasformig the Friedma s X statistic as follows: W X p (5.) It ca be show that the followig expressio is equivalet to eq. 5.: R i R W p 3 (5.) Kedall s W statistic is simply the variace of the row sums of raks R i divided by the maximum possible value that this variace ca take; this occurs whe all variables are i total agreemet. Hece 0 W. Two properties are used to demostrate the equivalece of eqs. 5. ad 5.. The first oe is that R i R R i - R i ad the secod is that the sum of the all R i values i the table is p( + )/. I the presece of tied values, the formula is modified as follows: *** develop from the Kedall paper*** Whe there are o ties, W ca be computed from the mea, r, of the Spearma correlatios amog all variables (Siegel & Castella 988): p r + W p (5.3)

12 Rak correlatios 05 Table 5.7 Numerical example. Raks of six objects o three descriptors, y, y, ad y 3. Objects Raks of objects o the three descriptors Row sums (observatio uits) y y y 3 R i x 6 8 x x x x x Coefficiet W varies betwee 0 (o cocordace) ad (maximum cocordace). Its sigificace is tested either usig eq. 5. directly, or after trasformig W ito the associated X statistic: X p( )W The ull hypothesis (H 0 ) subjected to testig is that the row sums R i are equal or, i other words, that the p sets of raks (or the p semiquatitative descriptors) are idepedet of oe aother. The X statistic is compared to a value read i a table of critical values of, for ( ). Whe X is smaller tha the critical value (i.e. probability larger tha ), the ull hypothesis that the row sums R i are equal caot be rejected; this leads to the coclusio that the p descriptors are idepedet ad differ i the way they rak the objects. O the cotrary, X (i.e. probability smaller tha or equal to ) idicates good agreemet amog the descriptors i the way they rak the objects. Textbooks of oparametric statistics provide modified formulae for X, for data sets with tied observatios. Numerical example. Calculatio of Kedall s coefficiet of cocordace is illustrated usig the umerical example of Table 5.7. Data could be semiquatitative rak scores, or quatitative descriptors coded ito raks. It is importat to ote that the 6 objects are raked o each descriptor (colum) separately. The last colum gives, for each object i, the sum R i of its raks o the p 3 descriptors. The sum of squared deviatios from the mea, R i R, is equal to 5.5 for this example. The Friedma statistic is calculated with eq. 5.0: X Usig eq. 5., the X statistic is trasformed ito Kedall s W:

13 W Alteratively, W could have bee computed usig eq. 5.: W A table of critical values of idicates that X.43, for 6 5, correspods to a probability of ca. 0.80; the probability associated with this X statistic is actually The hypothesis (H 0 ) that the row sums R i are equal caot be rejected. Oe cocludes that the three descriptors differ i the way they rak the 6 objects. Permutatio test The Kedall coefficiet of cocordace ca also be tested by permutatio. The Kedall cocordace method with permutatio testig has bee used for the search for species associatio (Legedre 005), which is oe of the classical problems of commuity ecology. It is implemeted i the fuctios kedall.global (global test of the cocordace amog all members of a associatio) ad kedall.post (a posteriori test of the cotributio of idividual species to the overall cocordace of their group ) of the R-laguage library vega (Oksae et al. 009). The cocordace amog distace matrices (CADM) ca be tested usig a test of sigificace proposed by Legedre & Lapoite (004, 005). The distace matrices uder compariso are strug out like the descriptors i Table 5.7. The coefficiet of cocordace is computed ad tested usig the same permutatio procedure as i the Matel test (Subsectio 0.5.). This test is actually a geeralizatio of the Matel test of matrix correspodece to ay umber of distace matrices. This method is available i the fuctios CADM.global ad CADM.post of the R-laguage library ape (Paradis et al. 009). Additioal refereces Legedre, P Species associatios: the Kedall coefficiet of cocordace revisited. Joural of Agricultural, Biological, ad Evirometal Statistics 0: Oksae, J., R. Kidt, P. Legedre, B. O'Hara, G. L. Simpso, P. Solymos, M. H. H. Steves, ad H. Wager vega: Commuity Ecology Package. R package versio Paradis, E., B. Bolker, J. Claude, H. S. Cuog, R. Desper, B. Durad, J. Dutheil, O. Gascuel, G. Jobb, C. Heibl, D. Lawso, V. Lefort, P. Legedre, J. Lemo, Y. Noel, J. Nylader, R. Opge- Rhei, K. Strimmer, ad D. de Viee ape: Aalyses of Phylogeetics ad Evolutio. R package versio.3.

Properties and Hypothesis Testing

Properties and Hypothesis Testing Chapter 3 Properties ad Hypothesis Testig 3.1 Types of data The regressio techiques developed i previous chapters ca be applied to three differet kids of data. 1. Cross-sectioal data. 2. Time series data.