4 Multidimensional quantitative data

Size: px

Start display at page:

Download "4 Multidimensional quantitative data"

Lee Norris
5 years ago
Views:

1 Chapter 4 Multidimesioal quatitative data 4 Multidimesioal statistics Basic statistics are ow part of the curriculum of most ecologists However, statistical techiques based o such simple distributios as the uidimesioal ormal distributio are ot really appropriate for aalysig complex ecological data sets Nevertheless, researchers sometimes perform series of simple aalyses o the various descriptors i the data set, expectig to obtai results that are pertiet to the problem uder study This type of approach is icorrect because it does ot take ito accout the covariace amog descriptors; see also Box 3 where the statistical problem created by multiple testig is explaied I additio, such a approach oly extracts miimum iformatio from data which have ofte bee collected at great cost ad it usually geerates a mass of results from which it is difficult to draw much sese Fially, i studies ivolvig species assemblages, it is usually more iterestig to describe the variability of the structure of the assemblage as a whole (ie mesurative variatio observed through space or time, or maipulative variatio resultig from experimetal maipulatio; Hurlbert, 984) tha to look at each species idepedetly Fortuately, methods derived from multidimesioal statistics, which are used throughout this book, are desiged for aalysig complex data sets These methods take ito accout the co-varyig ature of ecological data ad ca evidece the structures that uderlie the data The preset chapter discusses the basic theory ad characteristics of multidimesioal data aalysis Mathematics are kept to a miimum, so that readers ca easily reach a high level of uderstadig May approaches of practical iterest are discussed, icludig several types of liear correlatio, with their statistical tests It must be oted that this chapter is limited to liear statistics A umber of excellet textbooks deal with detailed aspects of multidimesioal statistics For example, formal presetatios of the subject are foud i Muirhead (98) ad Aderso (984) Researchers less iterested i mathematical theory may refer to Cooley & Lohes (97), Tatsuoka (97), Press (97), Graybill (983), or

2 3 Multidimesioal quatitative data Table 4 Numerical example of two species observed at four samplig sites Figure 4 shows that each row of the data matrix may be costrued as a vector, as defied i Sectio 4 Samplig sites Species (descriptors) (objects) (p ) ( 4) Multidimesioal Multivariate Morriso (99) These books describe a umber of useful methods, amog which the multidimesioal aalysis of variace However, oe of these books specifically deals with ecological data Several authors use the term multivariate as a abbreviatio for multidimesioal variate (the latter term meaig radom variable; Sectio ) As a adjective, multivariate is iterchageable with multidimesioal 4 Multidimesioal variables ad dispersio matrix As stated i Sectio, the preset textbook deals with the aalysis of radom variables Ecological data matrices have rows ad p colums (Sectio ) Each row is a vector (Sectio 4) which is, statistically speakig, oe realizatio of a p-dimesioal radom variable I other words, for example, whe p species are observed at samplig sites, the species are the p dimesios of a radom variable species ad each site is oe realizatio of this p-dimesioal radom variable To illustrate this cocept, four samplig uits with two species (Table 4) are plotted i a two-dimesioal Euclidea space (Fig 4) Vector site is the doublet (5,) It is plotted i the same two-dimesioal space as the three other vectors site i Each row of the data matrix is a two-dimesioal vector, which is oe realizatio of the (bivariate) radom variable species The radom variable species is said to be two-dimesioal because the samplig uits (objects) cotai two species (descriptors), the two dimesios beig species ad, respectively

3 Multidimesioal variables ad dispersio matrix Site 4 (6,4) Species 3 Site (3,) Site 3 (8,3) Site (5,) Species Figure 4 Four realizatios (samplig sites from Table 4) of the two-dimesioal radom variable species are plotted i a two-dimesioal Euclidea space As the umber of descriptors (eg species) icreases, the umber of dimesios of the radom variable species similarly icreases, so that more axes are ecessary to costruct the space i which the objects are plotted Thus, the p descriptors make up a p-dimesioal radom variable ad the vectors of observatios are as may realizatios of the p-dimesioal vector descriptors The preset chapter does ot deal with samples of observatios, which result from field or laboratory work (for a brief discussio o samplig, see Sectio ), but it focuses istead o populatios, which are ivestigated by meas of the samples Before approachig the multidimesioal ormal distributio, it is ecessary to defie a p-dimesioal radom variable descriptors : Y [y, y, y j, y p ] (4) Each elemet y j of the multidimesioal table Y is a uidimesioal radom variable Every descriptor y j is observed i each of the vectors object, each samplig uit i providig oe realizatio of the p-dimesioal radom variable (Fig 4) I ecology, the structure of depedece amog descriptors is, i may istaces, the matter beig ivestigated Researchers who study multidimesioal data sets usig uivariate statistics assume that the p uidimesioal y j variables i Y are idepedet of oe aother (this refers to the third meaig of idepedece i Box ) This is the reaso why uivariate statistical methods are iappropriate with most ecological data ad why methods that take ito accout the depedece amog descriptors must be used whe aalysig sets of multidimesioal data Oly these methods will geerate proper results whe there is depedece amog descriptors; it is ever acceptable to replace a multidimesioal aalysis by a series of uidimesioal treatmets

4 34 Multidimesioal quatitative data i j p O bj e c ts Descriptors y ij y ip y j y p Dispersio (covariace) matrix D es c ri p t o r s Descriptors s s j s p s j s jj s jp s p s pj s pp s j k y ij y i k y k i Figure 4 Structure of ecological data Give their ature, ecological descriptors are depedet of oe aother I statistics, the objects are ofte assumed to be idepedet observatios, but this is geerally ot the case i ecology (Sectio ) The usual tests of sigificace require, however, that successive sample observatio vectors from the multidimesioal populatio have bee draw i such a way that they ca be costrued as realizatios of idepedet radom vectors (Morriso, 99, p 8) Sectio has show that this assumptio of idepedece amog observatios is most ofte ot realistic i ecology Lack of idepedece amog the observatios (data rows) does ot really matter whe statistical models are used for descriptive purposes oly, as it is ofte the case i the preset book For statistical testig, however, corrected tests of sigificace have to be used whe the observatios are autocorrelated (Sectio ) To sum up: () the p descriptors i ecological data matrices are the p dimesios of a radom variable descriptors ; () i geeral, the p descriptors are ot liearly idepedet of oe aother; methods of multidimesioal aalysis are desiged to brig out the structure of liear depedece amog descriptors; (3) each of the samplig uits is a realizatio of the p-dimesioal vector descriptors ; (4) the usual tests of sigificace assume that the samplig uits are realizatios of idepedet radom vectors The latter coditio is geerally ot met i ecology, with cosequeces that were metioed i the previous paragraph ad discussed i Sectio For the various meaigs of the term idepedece i statistics, see Box

5 Multidimesioal variables ad dispersio matrix 35 Dispersio matrix The depedece amog quatitative variables y j brigs up the cocept of covariace Covariace is the extesio, to two descriptors, of the cocept of variace Variace is a measure of the dispersio of a radom variable y j aroud its mea; it is deoted j Covariace measures the joit dispersio of two radom variables y j ad y k aroud their meas; it is deoted jk The dispersio matrix of Y, called matrix (sigma), cotais the variaces ad covariaces of the p descriptors (Fig 4): p p (4) p p pp Matrix is a associatio matrix [descriptors descriptors] (Sectio ) The elemets jk of matrix are the covariaces betwee all pairs of the p radom variables The matrix is symmetric because the covariace of y j ad y k is idetical to that of y k ad y j A diagoal elemet of is the covariace of a descriptor y j with itself, which is the variace of y j, so that j j j Variace The estimate of the variace of y j, deoted s j, is computed o the cetred variable y i j Variable y j is cetred by subtractig the mea y j from each of the observatios y ij As a result, the mea of the cetred variable is zero The ubiased estimator of the populatio variace s j is computed usig the well-kow formula: s j y i j i (43) Covariace where the sum of squares of the cetred data, for descriptor j, is divided by the umber of objects mius oe ( ) The summatio is over the observatios of descriptor j I the same way, the estimate s jk ) of the covariace jk ) of y j ad y k is computed o the cetred variables y ij ad y ik y k, usig the formula of a bivariate variace The covariace s jk is calculated as: s j k y ij y i k y k i (44) Stadard deviatio Whe k j, eq 44 is idetical to eq 43 The positive square root of the variace is called the stadard deviatio ( j ) Its estimate s j is thus: s j s j (45)

6 36 Multidimesioal quatitative data Table 4 Symbols used to idetify (populatio) parameters ad (sample) statistics Parameter Statistic Matrix or vector Elemets Matrix or vector Elemets Covariace (sigma) jk (sigma) S s jk Correlatio (rho) jk (rho) R r jk Mea (mu) µ j (mu) y y j The symbols for matrix ad summatio should ot be cofused Coefficiet of variatio The coefficiet of variatio is a dimesioless measure of variatio CV is used to compare the variatio of variables expressed i differet physical uits It is obtaied by dividig the stadard deviatio s j by the mea of variable j: x j CV j s j / Sice the stadard deviatio ad the mea of a variable have the same physical uits, CV j is dimesioless CV j is oly defied for quatitative variables that have o-zero meas ad it does ot makes sese for iterval-scale variables (Subsectio 4), for which the value of the mea is arbitrary The coefficiet of variatio may be rescaled to percetages by multiplyig its value by For small, a estimate with reduced bias is obtaied by multiplyig CV by ( + /(4)) Cotrary to the variace, which is always positive, the covariace may take positive or egative values To uderstad the meaig of the covariace, imagie that the object poits are plotted i a scatter diagram where the axes are descriptors y j ad y k The data are cetred by drawig ew axes, whose origi is at the cetroid ( y j, y k ) of the cloud of poits (cetred plots of that kid with positive ad egative correlatios are show i Fig 47) A positive covariace meas that most of the poits are i quadrats I ad III of the cetred plot, where the cetred values y ij ad y i k y k have the same sigs This correspods to a positive relatioship betwee the two descriptors The coverse is true for a egative covariace, for which most of the poits are i quadrats II ad IV of the cetred plot Whe the covariace is ull or small, the poits are equally distributed amog the four quadrats of the cetred plot x j Parameter Statistic Greek ad roma letters are used here (Table 4) The properties of a populatio (called parameters) are deoted by greek letters Their estimates (called statistics), computed from samples, are symbolized by the correspodig roma letters These covetios are complemeted by those pertaiig to matrix otatio (Sectio )

7 Multidimesioal variables ad dispersio matrix 37 The dispersio matrix * S ca be computed directly, by multiplyig the matrix of cetred data y y with its traspose y y ' : S (46) y y ' y y y y y y y y y y y y y p y p S y y y y y y y y y y y p y p y p y p y p y p y p y p y y y y y p y p S i i i y i y y i y y i y y ip i y i y y i y i y i y y i y y ip y p i y i y y ip y p i y p y i y y i p y p y i y y i p y p i i This elegat ad rapid procedure shows oce agai the advatage of matrix algebra i umerical ecology, where the data sets are geerally large Numerical example Four species (p 4) were observed at five statios ( 5) The estimated populatio parameters, for the species, are the meas ( y j ), the variaces ( s j ), ad the covariaces (s jk ) The origial ad cetred data are show i Table 43 Because s jk s kj, the dispersio matrix is symmetric The mea of each cetred variable is zero I this umerical example, the covariace betwee species ad the other three species is zero This does ot ecessarily mea that species is idepedet of the other three, but simply that the joit liear dispersio of species with ay oe of the other three is zero This example will be revisited i Sectio 4 * Some authors call y y ' y y a dispersio matrix ad S a covariace matrix For these authors, a covariace matrix is the a dispersio matrix divided by ( )

8 38 Multidimesioal quatitative data Table 43 Numerical example Calculatio of cetred data ad covariaces Sites Origial data Cetred data Y y y 4 4 Meas y' y y ' 4 S y y ' y y The square root of the determiat of the dispersio matrix S is kow as the geeralized variace It is also equal to the square root of the product of the eigevalues of S Ay dispersio matrix S is positive semidefiite (Table ) Ideed, the quadratic form of S (p p) with ay real ad o-ull vector t (of size p) is: t'st This expressio may be expaded usig eq 46: t's t t' y y ' y y t t's t a scalar y y t ' y y t This scalar is the variace of the variable resultig from the product Yt Sice a variace, which is a sum of squared values, ca oly be positive or ull, it follows that: so that S is positive semidefiite t'st

9 Correlatio matrix 39 A importat property ca be derived by computig the quadratic form of the dispersio matrix S usig eq8 (right) I that case, U U' because S is symmetric (property #7 of iverses, Sectio 8), ad eq8 (right) becomes: U'SU As vector t i the quadratic form, use the successive eigevectors u j from matrix U For each vector u j, the developmet above shows that u' j Su j Sice u' j Su j j, this demostrates that all the eigevalues j of S are positive or ull This property of dispersio matrices is fudametal i umerical ecology: it allows oe to partitio the variace amog real pricipal axes (Sectios 44 ad 9) Ideally, matrix S (of order p) should be estimated from a umber of observatios larger tha the umber of descriptors p Whe p, the rak of matrix S is ad, cosequetly, oly of its rows or colums are idepedet, so that p ( ) ull eigevalues are produced The oly practical cosequece of p is thus the presece of ull eigevalues i the pricipal compoet solutio (Sectio 9) The first few eigevalues of S, which are geerally those of iterest, have positive eigevalues 4 Correlatio matrix The previous sectio has show that the covariace provides iformatio o the orietatio of the cloud of data poits i the space defied by the descriptors That statistic, however, does ot provide ay iformatio o the itesity of the relatioship betwee variables y j ad y k Ideed, the covariace may icrease or decrease without chagig the relatioship betwee y j ad y k For example, i Fig 43, the two clouds of poits correspod to differet covariaces (factor two i size, ad thus i covariace), but the relatioship betwee variables is idetical (same shape) Sice the covariace depeds o the dispersio of poits aroud the mea of each variable (ie their variaces), determiig the itesity of the relatioship betwee variables requires to cotrol for the variaces The covariace measures the joit dispersio of two radom variables aroud their meas The correlatio is defied as a measure of the depedece betwee two radom variables y j ad y k As already explaied i Sectio 5, it ofte happes that matrices of ecological data cotai descriptors with o commo scale, eg whe some species are more abudat tha others by orders of magitude, or whe the descriptors have differet physical dimesios (Chapter 3) Calculatig covariaces o such variables obviously does ot make sese, except if the descriptors are first reduced to a commo scale The procedure cosists i cetrig all descriptors o a zero mea ad reducig them to uit stadard deviatio (eq) By usig stadardized descriptors,

10 4 Multidimesioal quatitative data y k B A y j Figure 43 Several observatios (objects), with descriptors y j ad y k, were made uder two differet sets of coditios (A ad B) The two ellipses delieate clouds of poit-objects correspodig to A ad B, respectively The covariace of y j ad y k is twice as large for B as it is for A (larger ellipse), but the correlatio betwee the two descriptors is the same i these two cases (ie the ellipses have the same shape) it is possible to calculate meaigful covariaces, because the ew variables have the same scale (ie uit stadard deviatio) ad are dimesioless (see Chapter 3) Liear correlatio The covariace of two stadardized descriptors is called the coefficiet of liear correlatio (Pearso r) This statistic has bee proposed by the statisticia Karl Pearso ad is amed after him Give two stadardized descriptors (eq) y z ij y i j ad z ik y k i k s j calculatig their covariace (eq 44) gives s( z j, z k ) s k z i j z i k because z j z k i s( z j, z k ) i y i j y j s j y i k y k s k s( z j, z k ) s j s k y y y ij j i k y k i

11 Correlatio matrix 4 s( z j, z k ) s, the coefficiet of liear correlatio betwee y j ad y k s j s jk r jk k The developed formula is: r jk s jk s j s k i y ij y ij y i k y k y j y i k y k (47) i i Correlatio matrix As i the case of dispersio (Sectio 4), it is possible to costruct the correlatio matrix of Y, ie the (rho) matrix, whose elemets are the coefficiets of liear correlatio jk: p p (48) p p The correlatio matrix is the dispersio matrix of the stadardized variables This cocept will play a fudametal role i pricipal compoet aalysis (Sectio 9) It should be oted that the diagoal elemets of are all equal to This is because the compariso of ay descriptor with itself is a case of complete depedece, which leads to a correlatio j Whe y j ad y k are idepedet of each other, j However, a correlatio equal to zero does ot ecessarily imply that y j ad y k are idepedet of each other, as show by the followig umerical example A correlatio jk is idicative of a complete, but iverse depedece of the two variables Numerical example Usig the values i Table 43, matrix R ca easily be computed Each elemet r jk combies, accordig to eq 47, the covariace s jk with variaces s j ad s k : R Matrix R is symmetric, as was matrix S The correlatio r betwee species 3 ad 4 meas that these species are fully, but iversely, depedet (Fig 44a) Correlatios r 8 ad 8 are iterpreted as idicatios of strog depedece betwee species ad 3 (direct) ad species ad 4 (iverse), respectively The zero correlatio betwee species ad the other three species

12 4 Multidimesioal quatitative data 5 4 a 5 4 b y 3 3 y y y c 5 4 y 3 y 3 d y y Figure 44 Numerical example Relatioships betwee species (a) 3 ad 4, (b) ad 4, (c) ad 3, ad (d) ad must be iterpreted with cautio Figure 44d clearly shows that species ad are completely depedet of each other sice they are related by equatio y + (3 y ) ; the zero correlatio is, i this case, a cosequece of the liear model uderlyig statistic r Therefore, oly those correlatios which are sigificatly differet from zero should be cosidered, sice a ull correlatio has o uique iterpretatio Sice the correlatio matrix is the dispersio matrix of stadardized variables, it is possible, as i the case of matrix S (eq 46), to compute R directly by multiplyig the matrix of stadardized data with its traspose: R y y s ' y y y s Z'Z y (49) Table 44 shows how to calculate correlatios r jk of the example as i Table 43, usig this time the stadardized data The mea of each stadardized variable is zero ad its stadard deviatio is equal to uity The dispersio matrix of Z is idetical to the correlatio matrix of Y, which was calculated above usig the covariaces ad variaces

13 Correlatio matrix 43 Table 44 Numerical example Calculatio of stadardized data ad correlatios Sites Origial data Stadardized data Y Z Meas y' z' 4 R y S z Z'Z Matrices ad are related to each other by the diagoal matrix of stadard deviatios of Y This ew matrix, which is specifically desiged for relatig ad, is symbolized by D( ) ad its iverse by D( ) : D ad D p p Usig these two matrices, oe ca write: D D D D (4) where D( ) is the matrix of the diagoal elemets of It follows from eq 4 that: D( ) D( ) (4)

14 44 Multidimesioal quatitative data Sigificace of r The theory uderlyig tests of sigificace is discussed i Sectio I the case of r, iferece about the statistical populatio is i most istaces through the ull hypothesis H : H may also state that has some other value tha zero, which would be derived from ecological hypotheses The geeral formula for testig correlatio coefficiets is give i Sectio 45 (eq 439) The Pearso correlatio coefficiet r jk ivolves two descriptors (eg y j ad y k, hece m whe testig a coefficiet of simple liear correlatio usig eq 439), so that ad The geeral formula the becomes: F r jk r jk r jk r j k (4) where Statistic F is tested agaist F [, ] Sice the square root of a statistic F, is a statistic t whe, r may also be tested usig: t r j k r j k (43) The t statistic is tested agaist the value t [ ] I other words, H is tested by comparig the F (or t) statistic to the value foud i a table of critical values of F (or t) Results of tests with eqs 4 ad 43 are idetical The umber of degrees of freedom is ( ) because calculatig a correlatio coefficiet requires prior estimatio of two parameters, ie the meas of the two populatios (eq 47) H is rejected whe the probability correspodig to F (or t) is smaller tha a predetermied level of sigificace ( for a two-tailed test, ad / for a oe-tailed test; the differece betwee the two types of tests is explaied i Sectio ) I priciple, this test requires that the sample of observatios be draw from a populatio with a bivariate ormal distributio (Sectio 43) Testig for ormality ad multiormality is discussed i Sectio 47, ad ormalizig trasformatios i Sectio 5 Whe the data do ot satisfy the coditio of ormality, t ca be tested by radomizatio, as show i Sectio Test of idepedece of variables It is also possible to test the idepedece of all variables i a data matrix by cosiderig the set of all correlatio coefficiets foud i matrix R The ull hypothesis here is that the p(p )/ coefficiets are all equal to zero, H : R I (uit matrix) Accordig to Bartlett (954), the determiat of R, R, ca be trasformed ito a X (chi-square) test statistic: X [ (p + 5)/6] l R (44) where l R is the atural logarithm of the determiat of R This statistic is approximately distributed as with p(p )/ degrees of freedom Whe the probability associated with X is sigificatly low, the ull hypothesis of complete idepedece of the p descriptors is rejected I priciple, this test requires the

15 Correlatio matrix 45 Table 45 Mai properties of the coefficiet of liear correlatio Some of these properties are discussed i later sectios Properties Sectios The coefficiet of liear correlatio measures the itesity of the liear relatioship betwee two radom variables 4 The coefficiet of liear correlatio betwee two variables ca be calculated usig their respective variaces ad their covariace 4 3 The correlatio matrix is the dispersio matrix of stadardized variables 4 4 The square of the coefficiet of liear correlatio is the coefficiet of determiatio It measures how much of the variace of each variable is explaied by the other 3 5 The coefficiet of liear correlatio is a parameter of a multiormal distributio 43 6 The coefficiet of liear correlatio is the geometric mea of the coefficiets of liear regressio of each variable o the other 3 observatios to be draw from a populatio with a multivariate ormal distributio (Sectio 43) If the ull hypothesis of idepedece of all variables is rejected, the p(p )/ correlatio coefficiets i matrix R may be tested idividually; see Box 3 about multiple testig Other correlatio coefficiets are described i Sectios 45 ad 5 Wherever the coefficiet of liear correlatio must be distiguished from other coefficiets, it is referred to as Pearso's r I other istaces, r is simply called the coefficiet of liear correlatio or correlatio coefficiet Table 45 summarizes the mai properties of this coefficiet

Properties and Hypothesis Testing

Properties and Hypothesis Testing Chapter 3 Properties ad Hypothesis Testig 3.1 Types of data The regressio techiques developed i previous chapters ca be applied to three differet kids of data. 1. Cross-sectioal data. 2. Time series data.