Statistical Analysis of. Compositional Data

Size: px

Start display at page:

Download "Statistical Analysis of. Compositional Data"

Kellie Ashlee Riley
5 years ago
Views:

1 Statistical Analysis of Compositional Data Statistical Analysis of Compositional Data Carles Barceló Vidal J Antoni Martín Fernández Santiago Thió Fdez-Henestrosa Dept d Informàtica i Matemàtica Aplicada Universitat de Girona Campus de Montilivi E Girona Catalunya Spain

2 Statistical Analysis of Compositional Data 2 What is compositional data? Traditionally, Composition positive vector x = (x 1,, x D ) whose components are subject to a constant sum restriction: x x D = constant Compositional data Closed data

3 Statistical Analysis of Compositional Data 3 What is compositional data? A positive vector w = (w 1,, w D ) is compositional when our interest lies on the relative magnitudes w j /w k of its parts and not on the absolute values Scale-invariance property If a positive vector w = (w 1,, w D ) is compositional, the vectors w and kw, with k > 0, give us the same information

4 Statistical Analysis of Compositional Data 4 USA - President election States Bush Gore Others Total Alabama Alaska Wisconsin Wyoming Alabama 56,6% 41,8% 1,6% 100% Alaska 59,0% 27,9% 13,1% 100% 100% Wisconsin 47,7% 47,9% 4,4% 100% Wyoming 69,2% 28,3% 2,5% 100%

5 Statistical Analysis of Compositional Data 5 Activity patterns of a statistician Daily time (hours) devoted by an academic statistician to different activities: te = teaching; co = consultation; ad = administration; re = research; ot = other wakeful activities; sl = sleep D te co ad re ot sl Total 1 3,5 2,0 4,5 2,5 6,5 5, ,0 2,0 2,5 3,0 6,5 6, ,5 2,5 3,0 2,0 5,0 8, ,5 2,0 3,0 3,0 4,0 9, ,4% 8,3% 18,8% 10,4% 27,1% 20,8% 100% 2 16,7% 8,3% 10,4% 2,5% 27,1% 25,0% 100% 100% 19 10,4% 10,4% 12,5% 8,3% 20,8% 35,4% 100% 20 10,5% 8,3% 12,5% 12,5% 16,7% 37,5% 100%

6 Statistical Analysis of Compositional Data 6 Artic lake Sand, silt, clay composition (% by weight) of 39 sediment samples from an Artic lake Sample Sand Silt Clay Total S % S % 100% S %

7 Statistical Analysis of Compositional Data 7 Volcano H Percentage of Cl, K 2 O, P 2 O 5, TiO 2 and SiO 2 in 46 samples of volcanic rocks from a volcano H Sample Cl K 2 O P 2 O 5 TiO 2 SiO 2 Total % % % % % %

8 Statistical Analysis of Compositional Data 8 Halimba boreholes Percentages of Al 2 O 3, SiO 2, Fe 2 O 3, TiO 2, H 2 O CaO and MgO in some samples from different drills in Halimba region (Hungary) Al 2 O 3 SiO 2 Fe 2 O 3 TiO 2 H 2 O CaO MgO Total 52,5 6,7 23,6 2,6 12,0 0,2 0,1 97,7% 47,7 4,6 32,1 2,3 12,0 2,0 0,0 100,7% 50,6 8,9 25,4 2,5 11,9 1,1 0,0 100,4%

9 Statistical Analysis of Compositional Data 9 The space of compositions Any D 1 real vector w = (w 1,, w D ) with positive components w 1,, w D will be called a D-observational vector Therefore, the set of these vectors will be IR D +, the positive orthant of IR D Definition Two D-observational vectors w and w are compositionally equivalent, w w, when there exists a positive proportionality constant k such that w = kw This relation classifies the vectors of IR D + in classes of equivalence, called D-compositions The composition generated by an observational vector w will be symbolized by w, ie, w = {kw : k IR + }

10 Statistical Analysis of Compositional Data 10 Scale invariance Definition A function f defined on IR D + is said to be scale invariant if f(kw) = f(w) for every w IR D + and k IR + or equivalently f(w) = f(w ) when w w Property Any scale invariant function f(w) defined on IR D + can be expressed in terms of ratios of the components w 1,, w D of w, such as w 1 /w D, w D 1 /w D or w 1 /g(w),, w D /g(w), where g(w) = (w 1 w 2 w D ) 1/D is the geometric mean of the components of w Property Any function defined on the compositional space C D 1 arises from a scale invariant function defined on the positive real space IR D +

11 Statistical Analysis of Compositional Data 11 The space of compositions A D-part composition can be geometrically interpreted as a ray from the origin in the positive orthant of IR D : W3 1 W ccl LW W 1 W2 W1 1 The set C D 1 of all D-compositions will be called the (D 1)-dimensional compositional space The compositional closure mapping from IR D + to C D 1 denoted by ccl is defined by ccl w = w (w IR D +)

12 Statistical Analysis of Compositional Data 12 Representation of a composition Linear criterion Definition The linear criterion selects from each D-composition w the D-observational vector w with components w1,, wd whose sum is equal to 1 If this vector is symbolized by ccl L w or by C w then ccl L w = C w = w/ D w j (w IR D +) j=1 The set of all the vectors x = ccl L w (w IR D +) is the well-known (D 1)-dimensional simplex S D

13 Statistical Analysis of Compositional Data 13 Linear criterion W3 1 W ccl LW W 1 W2 W1 1 1 x 2 x 3 x 1 2 3

14 Statistical Analysis of Compositional Data 14 Representation of a composition Other criteria Spherical criterion W3 1 W ccl W E 1 W2 W 1 W1 Hyperbolic criterion W2 W 1 W ccl W H 1 W1

15 Statistical Analysis of Compositional Data 15 Subcompositions Sometimes, given a composition w in C D 1, we may wish to focus attention on the relative magnitudes of a subset of components Definition If S is any subset of the indices 1,, D of a given a D-composition w C D 1, and w S is the subvector formed from the corresponding components of w, then w S = ccl w S is termed a subcomposition If the subset S is formed by C indices, with 2 C < D, the subcomposition w S belongs to the compositional space C C 1 Definition The formation of a C-subcomposition w S from a D-composition w may be considered as the mapping sub S from C D 1 to C C 1 : sub S w = w S (w C D 1 )

16 Statistical Analysis of Compositional Data 16 Subcompositions W3 1 W W2 W ccl LW 1 W 12 W 12 W1 1 1 ccl W L 12 ccl LW 2 3

17 Statistical Analysis of Compositional Data 17 Compositional Problems 1 Percentage of Cl, K 2 O, P 2 O 5, TiO 2 and SiO 2 in 46 samples of volcanic rocks from a volcano H: Num Cl K 2 O P 2 O 5 TiO 2 SiO

18 Statistical Analysis of Compositional Data 18 Compositional Problems 1a It is possible to describe the pattern of variability of these volcanic rocks and to define a covariance or correlation structure? 1b Is it possible to define a measure of total variability of this set of volcanic rocks? 1c For a new volcanic rock specimen with known composition (Cl,K 2 O,P 2 O 5,TiO 2,SiO 2 ) and claimed to be from the same volcano, can we say whether it is fairly typical from this volcano? If not, can we place some measure on its atypicality? 1d To what extent, if any, do the subcomposition (Cl,K 2 O,P 2 O 5 ) explain the pattern of variability of the full composition?

19 Statistical Analysis of Compositional Data 19 Compositional Problems 1e From this ternary diagram it seems that the pattern of (K 2 O,P 2 O 5,TiO 2 ) can be well adjusted by a curve How can we confirm this?

20 Statistical Analysis of Compositional Data 20 Compositional Problems 2 Percentage of Cl, K 2 O, P 2 O 5, TiO 2 and SiO 2 : 65 samples of volcanic rocks from a volcano A, and 19 samples from another volcano D Num Cl K 2 O P 2 O 5 TiO 2 SiO 2 1A A A D D D

21 Statistical Analysis of Compositional Data 21 Compositional Problems 2a Can we detect any differences between the compositional pattern of volcano A and volcano D? If so, how can we choose a 3-part subcomposition which somehow captures the essence of the two patterns individually and yet emphasizes the differences between the patterns? 2b Is it possible to establish a classification rule for discriminating between volcanos A and D?

22 Statistical Analysis of Compositional Data 22 Compositional Problems 3 Sand, silt,clay composition (% by weight) of 39 sediment samples at different water depths in an Artic lake: Num Sand Silt Clay Depth (m) S S S a Is sediment composition dependent on water depth? 3b If so, how can we quantify the extent of the dependence?

23 Statistical Analysis of Compositional Data 23 How to analyze closed raw data? Spurious correlations Pearson (1897) If u = f(x, y) and v = g(z, y) be two functions of three variables x, y, z, and these variables be selected at random so that there exists no correlation between x and y, y and z, or z and x, there will still be found to exist correlation between u and v That is likely to occur when u and v are indices with the same denominator Consequence The standard covariance matrix [s ij ] of a closed data set from S D is always singular because D s ij = 0, for i = 1,, D j=1

24 Statistical Analysis of Compositional Data 24 How to analyze closed raw data? Subcompositional incoherence Example Scientist A Scientist B Full compositions from S 4 Subcompositions from S 3 (x 1, x 2, x 3, x 4 ) (s 1, s 2, s 3 ) (01, 02, 01, 06) (0250, 0500, 0250) (02, 01, 01, 06) (0500, 0250, 0250) (03, 03, 02, 02) (0375, 0375, 0250) corr{x (1), x (2) } = 05 corr{s (1), s (2) } = 1 Any statement that scientists A and B make about the common parts 1,2 and 3 must agree

25 Statistical Analysis of Compositional Data 25 Statistics in IR D Translation In IR D the inner operation is translation If t IR D, the translation t moves the random vector X in IR D to a random vector X + t in such a way that E{X + t} = E{X} + t and Σ{X + t} = Σ{X} Scalar product For any random vector X on IR D and for any λ R, E{λX} = λe{x} and Σ{λX} = λ 2 Σ{X}

26 Statistical Analysis of Compositional Data 26 Perturbations on C D 1 Scale invariance is the property which characterizes compositional data Therefore, any operation involving compositions must be compatible with this property Definition C D 1 as We define an inner operation in w w = ccl (w 1 w 1,, w D w D) (C D 1, ) is a commutative group: Composition 1 D = ccl (1,, 1) is the neutral element The inverse composition w 1 of w = ccl (w 1,, w D ) is the composition w 1 = ccl (1/w 1,, 1/w D )

27 Statistical Analysis of Compositional Data 27 The group of perturbations in C D 1 Definition Given a composition p C D 1, the perturbation associated to p is the transformation from C D 1 to C D 1 defined by c p c (c C D 1 ) Then, we say that p c is the composition which results when the perturbation p is applied to the composition c Moreover, given two compositions of C D 1 w = ccl (w 1,, w D ), w = ccl (w1,, wd), there exists a unique perturbation p which transforms w on w : p = w w 1 = ccl ( ) w 1,, w D w 1 w D

28 Statistical Analysis of Compositional Data 28 The group of perturbations in C D 1 X1 pox p x e X2 X3 X1 x xox* 1 x* X2 X3

29 Statistical Analysis of Compositional Data 29 The group of perturbations in C D 1 Perturbation in compositional space plays the same role as translation plays in real space The set of all perturbations in C D 1 is a commutative group isomorphic to (C D 1, ) For this reason, we will also call perturbation the inner operation defined on C D 1 The assumption that the group of perturbations is the operating group on the compositional space is the keystone of the methodology introduced by Aitchison (1986) In fact, it implies to accept that the difference between two compositions w = ccl (w 1,, w D ) and w = ccl (w 1,, w D ) will be based on the ratios w j /w j between parts instead of on the arithmetic differences w j w j

30 Statistical Analysis of Compositional Data 30 Perturbations on C D 1 Interpretation Some natural processes in nature can be interpreted as a succession of changes from one initial composition w 0 to a final composition w n through the application of successive perturbations: w 0 p 1 w 0 = w 1 p 2 w 1 = w 2 p n w n 1 = w n In this manner, w n = (p n p n 1 p 1 ) w 0

Statistical Analysis of Compositional Data 31 Genesis of normal distribution Particles fall from a funnel onto tips of triangles, where they are deviated to the left or to the right with equal

31 Statistical Analysis of Compositional Data 31 Genesis of normal distribution Particles fall from a funnel onto tips of triangles, where they are deviated to the left or to the right with equal probability (05) and finally fall into receptacles If the tip of a triangle is at distance x from the left edge of the board, triangle tips to the right and to the left below it are placed at x + k and x k (k constant)

Statistical Analysis of Compositional Data 32 Genesis of lognormal distribution Particles fall from a funnel onto tips of triangles, where they are deviated to the left or to the right with equal

32 Statistical Analysis of Compositional Data 32 Genesis of lognormal distribution Particles fall from a funnel onto tips of triangles, where they are deviated to the left or to the right with equal probability (05) and finally fall into receptacles If the tip of a triangle is at distance x from the left edge of the board, triangle tips to the right and to the left below it are placed at x/k and xk (k constant)

33 Statistical Analysis of Compositional Data 33 Perturbations on C D 1 Interpretation If w = ccl (w SiO2,, w P2 O 5 ) expresses the percentage composition of major oxides of a rock, its molecular composition will be w = ccl (w SiO2 /m SiO2,, w P2 O 5 /m P2 O 5 ), where m j symbolizes the molecular weight of oxide j Therefore, composition w can be obtained applying the perturbation m 1 = (ccl (m SiO2,, m P2 O 5 ) ) 1 to composition w: w = m 1 w

34 Statistical Analysis of Compositional Data 34 The vector space (C D 1,, ) Definition defined as The external operation in C D 1 is λ w = ccl (w λ 1,, w λ D), for each λ IR and each w C D 1 (C D 1,, ) is a vector space of dimension D 1 X1 2x x e 2x X2 X3

35 Statistical Analysis of Compositional Data 35 The log and the exp transformations between IR D + and IR D The logarithmic transformation on IR D + transforms the rays from the origin which represent the compositions of the space C D 1, to straight lines of IR D parallel to vector 1 D = (1, D, 1) Inversely, the exponential transformation on IR D transforms these straight lines of IR D parallel to vector 1 D, to rays from the origin of IR D +

36 Statistical Analysis of Compositional Data 36 W2 1 W ccl W W 1 W1 Z2 z+u U z 1 ucl z V 1 Z1

37 Statistical Analysis of Compositional Data 37 Centered logratio transformation Definition The centered logratio transformation denoted by clr is the one-to-one function from the compositional space C D 1 to the subspace V = {z = (z 1,, z D ) IR D : z z D = 0} of IR D, defined by clr w = log w g(w) (w C D 1 ) The inverse transformation, from V to C D 1, is given by clr 1 z = ccl (exp z) (z V ) The logarithmic and the exponential transformations establish a one-to-one correspondence between the simplex S D and the hyperplan V in IR D

38 Statistical Analysis of Compositional Data 38 Centered logratio transformation W2 1 W ccl W W 1 W1 Z2 z+u U z 1 ucl z V 1 Z1

39 Statistical Analysis of Compositional Data 39 Centered logratio transformation Property The centered logratio transformation is an isomorphism between the vector space (C D 1,, ) and the vector subspace V = {z = (z 1,, z D ) IR D : z z D = 0} of (IR D, +, ) Therefore, clr (w w ) = clr w + clr w ; clr (λ w) = λ clr w, where w, w C D 1, and λ R Equally, clr 1 (z + z ) = clr 1 z clr 1 z ; clr 1 (λ z) = λ clr 1 z, where z, z V, and λ IR

40 Statistical Analysis of Compositional Data 40 Isometric logratio transformation Let V = {v 1,, v D 1 } an orthonormal basis of the subspace V = {z = (z 1,, z D ) IR D : z z D = 0} Then, since clr w V, it will be always possible to write clr w = u 1 v u D 1 v D 1, for any w C D 1 Definition The isometric logratio transformation denoted by ilr V is the one-to-one function from the compositional space C D 1 to IR D 1 defined by ilr V w = (u 1,, u D 1 ) (w C D 1 ) Like clr, the transformation ilr V is an isomorphism between the vector spaces (C D 1,, ) and (IR D 1, +, )

41 Statistical Analysis of Compositional Data 41 Skye Lavas Sample Na 2 O + K 2 O Fe 2 O 3 MgO S S S clr (Na 2 O + K 2 O) clr (Fe 2 O 3 ) clr (MgO) S1 0,7910 0,5775-1,3685 S2 0,9107 0,7436-1,6543 S3 0,7399 0,7609-1,5008 ilr 1 [u 1 ] ilr 2 [u 2 ] S1 0,1510 1,6760 S2 0,1181 2,0261 S3-0,0149 1,8381 Hint The orthonormal basis V of the subspace V IR 3 linked to the ilr coordinates is v 1 = ( 1 2, 1 2, 0), v 2 = ( 1 6, 1 6, 2 6 )

42 Statistical Analysis of Compositional Data 42 Skye Lavas Ternary diagram clr-coordinates ilr-coordinates

43 Statistical Analysis of Compositional Data 43 C D 1 as an Euclidean space The clr transformation between C D 1 and the subspace V of IR D allows to translate to C D 1 the real Euclidean structure defined on V : <w, w > C = (clr w) clr w = (log w) H D log w, w C = clr w = [(log w) H D log w] 1/2, d C (w, w ) = d Euc (clr w, clr w ) = [(log w log w) H D (log w log w)] 1/2, where H D is the (D D)-centering matrix This matrix is equal I D D 1 J D, where I D is the identity matrix and J D = 1 D 1 D Therefore, by construction, transformations clr and clr 1 and also ilr and ilr 1 preserve the distances defined in C D 1 and IR D 1

44 Statistical Analysis of Compositional Data 44 Compositional geometry in C D 1 We can not analyze the simplex S D as we analyze the Euclidean real space Let w 1 = ccl (1000, 49500, 39500) w 2 = ccl (0010, 49995, 39995) w 3 = ccl (250, 500, 250) w 4 = ccl (350, 300, 350) be four compositions from S 3 Then, d Euc (w 1, w 2 ) 121 < 2449 d Euc (w 3, w 4 ), whereas d C (w 1, w 2 ) 377 > 069 d C (w 3, w 4 )

45 Statistical Analysis of Compositional Data 45 Compositional geometry in C D 1 Any linear variety on C D 1 straight lines, planes, etc can always be implicitly expressed by a system of linear equations in log w 1,, log w D in the form a 11 log w 1 + +a 1D log w D = b 1 a m1 log w 1 + +a md log w D = b m, with a i1 + + a id = 0, for each i = 1,, m In particular, the parametric equation varying t IR of a straight line on C D 1 is given by w(t) = ccl (exp(α 1 +λ 1 t),, exp(α D +λ D t)), where D j=1 alpha j = 0 and D j=1 λ j = 0 Similarly to real space, the concepts of parallelism and orthogonality can be introduced in C D 1

46 Statistical Analysis of Compositional Data 46 Parallelism in C 2 1 k>0 k=0 k<0 2 3 log w 2 log w 3 = k 1 k=4 k=2 k=0 k= k= 4 log w 1 2 log w 2 + log w 3 = k

47 Statistical Analysis of Compositional Data 47 Orthogonality in C w 2 w 3 = 0 2 log w 1 + log w 2 + log w 3 = log w 1 3 log w log w 3 = 0 5 log w 1 log w 2 4 log w 3 = 0

48 Statistical Analysis of Compositional Data 48 Circles in C 2 Simplex S clr -space

49 Statistical Analysis of Compositional Data 49 The alr transformation Definition The additive logratio transformation of index j (j = 1,, D) denoted by alr j is the one-to-one transformation from C D 1 to IR D 1 defined by where w y = alr j w = log w j w j w j = (w 1, w 2,, w j 1, w j+1,, w D ) The inverse transformation of alr D, from IR D 1 to C D 1, is given by alr 1 D y = ccl (exp y 1,, exp y D 1, 1) (y IR D 1 )

50 Statistical Analysis of Compositional Data 50 The alr transformation Property The alr j transformations (j = 1,, D) are isomorphisms between the vector spaces (C D 1,, ) and (IR D 1, +, ), ie, alr j (w w ) = alr j w + alr j w ; alr j (λ w) = λ alr j w, alr 1 j (y + y ) = alr 1 j y alr 1 j y ; alr 1 j (λ y) = λ alr 1 j y, where w, w C D 1, y, y IR D 1 and λ IR Property The alr j transformations (j = 1,, D) do not preserve the distances defined in the metric spaces C D 1 and IR D 1, ie, d C (w, w ) d Euc (alr j w, alr j w ); d Euc (y, y ) d C (alr 1 j y, alr 1 j y )

51 Statistical Analysis of Compositional Data 51 Determination of a composition A composition w C D 1 can be determined in several forms: (i) Giving any D-observational vector belonging to w Usually, we will choose the vector x = Cw = ccl L w belonging to S D (ii) Giving the components (z 1,, z D ) = z of the centered logratio transformed vector clr w Since z belongs to the subspace V of IR D, its components are related by the equality z z D = 0 (iii) Giving the components (y 1,, y D 1 ) = y of the additive logratio transformed vector alr D w If it is needed, we can choose the components of any other logratio alr j w (j D) (iv) Giving the components (u 1,, u D 1 ) = u of the isometric logratio transformed vector ilr V w, where V is a known orthonormal basis of the subspace V of IR D

52 Statistical Analysis of Compositional Data 52 Determination of a composition Skye lavas: A = Na 2 O + K 2 O, F = Fe 2 O 3, M = MgO Sample A F M S S clr (A) clr (F) clr (M) S S ilr 1 [u 1 ] ilr 2 [u 2 ] S1 0,1510 1,6760 S2 0,1181 2,0261 alr M A alr M F S S alr A F alr A M S S Hint The orthonormal basis V of the subspace V IR 3 linked to the ilr coordinates is v 1 = ( 1, 1, 0), v 2 = ( 1 1,, 2 )

53 Statistical Analysis of Compositional Data 53

54 Statistical Analysis of Compositional Data 54 Compositional data set Raw data matrix W = [w ij : i = 1,, n; j = 1,, D], or X = [x ij : i = 1,, n; j = 1,, D], where x i = (x i1,, x id ) S D Example AFM composition of 23 aphyric Skye lavas [A = Na 2 O + K 2 O, F = Fe 2 O 3, M = MgO] Obs A% F% M% S S X =

55 Statistical Analysis of Compositional Data 55 Compositional data set Centred logratio (clr ) data matrix Z = [z ij : i = 1,, n; j = 1,, D], where z ij = log (w ij /g(w i )), with g(w i ) = ( D k=1 w ik) 1/D Example AFM composition of 23 aphyric Skye lavas [A = Na 2 O + K 2 O, F = Fe 2 O 3, M = MgO] Obs A% F% M% S S clr A clr F clr M Z =

56 Statistical Analysis of Compositional Data 56 Compositional data set Additive logratio (alr) data matrix Y = [y ij : i = 1,, n; j = 1,, d], where y ij = log(w ij /w id ) Example AFM composition of 23 aphyric Skye lavas [A = Na 2 O + K 2 O, F = Fe 2 O 3, M = MgO] Obs A% F% M% S S log A M log F M Y =

57 Statistical Analysis of Compositional Data 57 Center of a compositional data set The center of a set W of n compositions w 1,, w n of C D 1, is the composition g defined by cen W = g = ( 1 n w 1) ( 1 n w n) This center is equal to ( n ) 1/n g = ccl w i1 i=1 ( n ) 1/n,, w id i=1 It verifies that clr g = z = and alr D g = y = n i=1 n i=1 1 n z i = 1 n y i = n i=1 n i=1 1 n clr w i 1 n alr D w i

58 Statistical Analysis of Compositional Data 58 Center of a compositional data set Properties cen W = ccl ( n ) 1/n w i1 ( n ) 1/n,, w id i=1 i=1 cen W = argmin }{{} ξ C D 1 { dc (w 1,ξ)++d C (w n,ξ) n } cen {p W} = p cen W, where p C D 1 cen {t W} = t cen W, where t IR cen {W W } = cen W cen W

59 Statistical Analysis of Compositional Data 59 Center of a compositional data set Example AFM composition of 23 aphyric Skye lavas Compositional ( geometric ) center: g = ccl (2585, 5665, 1750) Arithmetic center: a = ccl (2683, 5374, 1943)

60 Statistical Analysis of Compositional Data 60 Centering To centre a compositional data set w 1, w n with centre g, it suffices to consider the new data set w 1 = g 1 w 1,, w n = g 1 w n Obviously, the centre of the new centered data set w 1,, w n is ccl (1/D,, 1/D)

61 Statistical Analysis of Compositional Data 61 Compositional covariance structure Variation matrix T = [τ jk ] = [ { var log w }] (j) w (k) τ jk = 0 means a perfect relationship between w (j) and w (k) in the sense that the ratio w (j) /w (k) is constant The larger the value of τ jk the more departure from proportionality between w (j) and w (k) A measure of degree of proportionality between two parts j and k is given by exp( τ jk ) In this way, exp( τ jk ) = 0 means zero propotionality, and exp( τ jk ) = 1 means perfect propotionality The variation matrix of any subcomposition is simply obtained by picking out on T all the logratio variances τ jk associated with the parts j and k of the subcomposition

62 Statistical Analysis of Compositional Data 62 Compositional covariance structure Logratio covariance matrix Σ = [σ jk ] = [ cov { y (j), y (k) }] = [ { }] cov log w (j) w (D), log w (k) w (D), where y j = ( log w 1j,, log w ) nj, w 1D w nd for j = 1,, D 1

63 Statistical Analysis of Compositional Data 63 Compositional covariance structure Centered covariance matrix where Γ = [γ jk ] = [ cov { z (j), z (k) }] z (j) = (log (w 1j /g(w 1 )),, log (w nj /g(w n ))), for j = 1,, D Hint Correlation corr { } z (j), z (k) is not a measure of a relationship between parts j and k because is subcompositionally incoherent Total (relative) variability totvar C {W} = n i=1 1 n d2 C (w i, g) = trace {Γ} = 1 2D 1 D T1 D = 1 D i<j τ ij

64 Statistical Analysis of Compositional Data 64 Compositional covariance structure [ ] The centered covariance matrix Γ = γ jk is singular because D k=1 γ jk = 0 (j = 1,, D) The relationships between the three covariance matrices T, Σ and Γ are linear The dimensionality of the covariance structure of a compositional raw data matrix from C D 1 is equal to 1 D(D 1) 2 The covariance matrix T and also Σ and Γ is coherent with the algebraic structure of (C D 1,, ), ie, T{p W} = T{W} and T{λ W} = λ 2 T{W}, where W is a compositional raw data matrix from C D 1, p C D 1 and λ IR Therefore, totvar C {p W} = totvar C {W}; totvar C {λ W} = λ 2 totvar C {W}

65 Statistical Analysis of Compositional Data 65 Compositional covariance structure Example AFM composition of 23 Skye lavas Variation matrix T = Logratio covariance matrix Σ A = [ Σ M = [ ] Σ F = [ ] ] Centered covariance matrix Γ = Total variability: totvar C = 0582

66 Statistical Analysis of Compositional Data 66 Biplots In general, the biplot is a simultaneous representation of the rows (observations) and columns (variables) of a n p matrix X by means of a rank-2 approximation Usually, biplot analysis starts with performing some transformations on X, depending on the nature of the data, to obtain a transformed matrix Z which is the one that is actually displayed

67 Statistical Analysis of Compositional Data 67 Biplots The singular value decomposition (SVD) of Z provides a decomposition of this matrix: Z = [u 1 : : u r ] diag{λ 1,, λ r } [v 1 : : v r ], where r is the rank of Z; u 1,, u r are the standardized eigenvectors of Z ; v 1,, v r are the standardized eigenvectors of Z; and λ 1,, λ r the corresponding positive eigenvalues in decreasing order From this SVD of Z, and using only the two first eigenvectors, a rank-2 approximation Ẑ is obtained: Ẑ = [u 1 : u 2 ] diag{λ 1, λ 2 } [v 1 : v 2 ]

68 Statistical Analysis of Compositional Data 68 Biplots Then Ẑ decomposes in Ẑ = [λ α 1 u 1 : λ α 2 u 2 ] }{{} F [λ 1 α 1 v 1 : λ 1 α 2 v 2 ] }{{} where α is an arbitrary constant G, The biplot represents simultaneously in IR 2 the rows of F, which provides the coordinates of n points (in correspondence with the n rows/observations of Z), and the rows of G, which provides the coordinates of p points (in correspondence with the columns/variables of Z) Conventionally, the biplot depicts the variables by rays and the observations by points Depending on the constant α, the biplot favours the display of rows (observations) or columns (variables) For α = 0, the biplot is called covariance biplot In this case, the display of variables is favoured

69 Statistical Analysis of Compositional Data 69 Biplots Singular value decomposition (SVD) of Z: Z = [u 1 : : u r ] diag{λ 1,, λ r } [v 1 : : v r ], Rank-2 approximation Ẑ: Ẑ = [λ α 1 u 1 : λ α 2 u 2 ] }{{} F [λ 1 α 1 v 1 : λ 1 α 2 v 2 ] }{{} G, The ratio λ 1 + λ 2 λ λ r is a measure of the proportion of the variability of Z captured by the biplot

70 Statistical Analysis of Compositional Data 70 Relative variation diagrams Definition The relative variation diagram of a compositional data set w 1,, w n of C D 1 is the covariance biplot of the matrix Z c which we obtain after centering the D columns of the centered logratio matrix Z Elements Origin, labeled O Vertices, for each of the D parts (variables/columns) of compositions, labeled 1,, j, D Case marker, for each of the n observations (rows), labeled 1,, i, n Ray Is the join Oj of origin O to a vertex j Link Is the join jk of two vertices j and k

71 Statistical Analysis of Compositional Data 71 Relative variation diagrams

72 Statistical Analysis of Compositional Data 72 Relative variation diagrams The vertices and case markers are both centered at the origin O Rays and inter-ray angles represent the centered logratio matrix Γ: Oj 2 = ˆγ jj = estimate of var { z (j) }, Oj Ok = ˆγ jk = estimate of cov { z (j), z (k) }, so that cos ĵok = estimate of corr { z (j), z (k) } Hint Remember that correlation corr { } z (j), z (k) is not a measure of a relationship between parts j and k because is subcompositionally incoherent

73 Statistical Analysis of Compositional Data 73 Relative variation diagrams The squared lengths of the links represent the set of estimated relative variances: { jk 2 = ˆτ jk = estimate of var log w } (j) w (k) Therefore, if two vertices j and k coincide or are close together then components w (j) and w (k) are in constant proportion or nearly so j k 0

74 Statistical Analysis of Compositional Data 74 Relative variation diagrams Links jl and kl, with a common vertex l, represent the estimated logratio covariance matrix Σ l : { jl kl cov log w (j), log w } (k), w (l) so that w (l) cos ĵlk corr { log w (j) w (l), log w (k) w (l) } j l k 0

75 Statistical Analysis of Compositional Data 75 Relative variation diagrams If the links jk and lm intersect at R then cos jrm { corr log w (j), log w } (l) w (k) w (m) Therefore, if two links jk and lm intersect at right angles then the logratios log(w (j) /w (k) ) and log(w (l) /w (m) ) will be uncorrelated and, within the context of logistic normality, independent, ie, subcompositions (j, k) and (l, m) are independent l j R 0 m k

76 Statistical Analysis of Compositional Data 76 Relative variation diagrams The relative variation diagram for any subcomposition S is simply the subdiagram formed by selecting the vertices corresponding to the parts of the subcomposition and taking the centroide O S of these vertices as the center of the subcompositional biplot Therefore, if a subset say 1,, C of vertices is approximately collinear then the associated subcomposition has a compositional one-dimensional structure 0 j k l j 0 S 0 k l

77 Statistical Analysis of Compositional Data 77 Volcano H Parts: 1=Cl; 2=K 2 O; 3=P 2 O 5 ; 4=TiO 2 ; 5=SiO 2 Variation matrix T 0 2, 784 4, 134 3, 970 2, 966 2, , 647 0, 645 0, 146 4, 134 0, , 071 0, 304 3, 970 0, 645 0, , 249 2, 966 0, 146 0, 304 0, Centered covariance matrix Γ 2, 134 0, 221 0, 803 0, 743 0, 368 0, 221 0, 208 0, 022 0, 043 0, 079 0, 803 0, 022 0, 394 0, 337 0, 094 0, 743 0, 043 0, 337 0, 350 0, 099 0, 368 0, 079 0, 094 0, 099 0, 096

78 Statistical Analysis of Compositional Data 78 Volcano H

79 Statistical Analysis of Compositional Data 79 Volcano H

80 Statistical Analysis of Compositional Data 80 Dimension-reducing techniques Compositional PCA Given a set of compositions w 1,, w n of C D 1 with center g, the PCA will start looking for a direction determined by a C-unitary composition c 1 such that the total variability of the C-orthogonal projections of w 1,, w n on the compositional straight line through g with direction c 1 will be maximum And so on Property The compositional principal components of a set of compositions w 1,, w n of C D 1 can be determined from the standard principal components of the clr -transformed observations clr w 1,, clr w n

81 Statistical Analysis of Compositional Data 81 Compositional PCA In this manner, the positive eigenvalues λ 1 λ D 1 of the centered logratio covariance matrix Γ give the decomposition of totvar C, and the corresponding unitary eigenvectors z 1,, z D 1 determine the corresponding directions clr 1 z 1,, clr 1 z D 1 of the principal axes

82 Statistical Analysis of Compositional Data 82 Skye Lavas PC1: log A log F log M 07849

83 Statistical Analysis of Compositional Data 83 Volcano H (Cl, K 2 O, P 2 O 5, TiO 2, SiO 2 ) PC 1 : (04246,01656,01264,01296,01538) (878%) PC 2 : (01469,03836,01230,01182,02293) (acum 981%)

84 Statistical Analysis of Compositional Data 84 Dimension-reducing techniques Subcomposition analysis Let w 1,, w n be a compositional data set of C D 1, and let sub S w 1,, sub S w n be the set of the corresponding subcompositions of C C 1 associated to a subset S of parts 1,, D Then, the ratio totvar C {sub S w 1,, sub S w n } totvar C {w 1,, w n } gives the proportion of total variabilitiy retained by the subcompositions If the purpose of subcompositional analysis is to retain as much variability as possible for a given number C of parts, then we have to search for subcompositions of this size which maximize this ratio

85 Statistical Analysis of Compositional Data 85 Volcano H

86 Statistical Analysis of Compositional Data 86 Subcomposition analysis Example Percentage of Cl, K 2 O, P 2 O 5, TiO 2 and SiO 2 in 46 samples of volcanic rocks from an anonymous volcano H Total variability: totvar C = Total variability of 3-parts subcompositions: Subcomposition Percentage P 2 O 5, TiO 2, SiO 2 653% K 2 O, TiO 2, SiO % K 2 O, P 2 O 5, SiO % K 2 O, P 2 O 5, TiO % Cl, K 2 O, SiO % Cl, TiO 2, SiO % Cl, K 2 O, TiO % Cl, P 2 O 5, SiO % Cl, K 2 O, P 2 O % Cl, P 2 O 5, TiO %

87 Statistical Analysis of Compositional Data 87 Zeros in compositional data Logratio methodology is incompatible with composition with zeros in one or more parts Two kinds of zeros: Essential zeros: part completely absent Rounded zeros: no quantifiable proportion has been recorded Treatment of essential zeros: Is it suitable to amalgamate some parts? Pre-classification: create initial groups according to the number and location of zeros, and analyze each group separately Treatment of rounded zeros: Consider the zero values as missing values Imputation: replace zero values by a small amount using non-parametric or parametric techniques Apply log-ratio methodology to replaced observations of resulting data set

88 Statistical Analysis of Compositional Data 88 Rounded zeros Multiplicative replacement Let be w = ccl (w 1,, w D ) C D 1 any composition with some w j = 0 (rounded zero) The multiplicative replacement replaces w by the composition w (r) = ccl (w (r) 1,, w(r) D ) defined by if w j = 0 w (r) j = δ j ; ( if w j 0 w (r) j = w j 1 ) δ l w l =0 where δ j are the small values replacing zeros parts

89 Statistical Analysis of Compositional Data 89 Rounded zeros Multiplicative replacement It is a natural replacement Ratio between two non-zero parts is preserved It is compatible with subcompositions, perturbation and power transformation Covariance structure of subcompositions with no zeros is preserved

90 Statistical Analysis of Compositional Data 90 Modeling compositional data In practice, many of the probability density functions (pdf) on the compositional space C D 1 will be defined from a pdf on the real space IR D 1 Then the alr j 1 transformations will allow to induce on the simplex S D the corresponding pdf The most important pdf on C D 1 are: The Dirichlet class The (additive) logistic normal class The (additive) logistic skewnormal class Definition A random composition w on C D 1 is said to have an additive logistic normal distribution (aln) of parameters µ and Σ written w L D 1 (µ, Σ) if the random vector y = alr D w = log(w D /w D ) has a N D 1 (µ, Σ) on IR D 1

91 Statistical Analysis of Compositional Data 91 Logistic normal distributions on C D 1 Property Let w be a random vector on C D 1 If alr D w N D 1 (µ, Σ), then all the other logratio random vectors alr j w (j = 1,, d) are normally distributed Property Let w be a random composition on C D 1 Let w S be the random subcomposition on C C 1 corresponding to a subset S of C parts of w If w L D 1 (µ, Σ), then w S L C 1 (µ S, Σ S ), where µ S and Σ S can be easily calculated from µ and Σ Property Let w be a random vector on C D 1, which w L D 1 (µ, Σ) If we perturb w by a constant composition p C D 1, then the perturbed random vector p w L D 1 (µ + alr D p, Σ)

92 Statistical Analysis of Compositional Data 92 Logistic normal distributions on C D 1 Estimation of parameters To estimate the parameters µ and Σ of a random composition w L D 1 (µ, Σ) from a random sample w 1,, w n of w, we estimate by standard procedures the vector mean and the covariance matrix of a multivariate normal distribution from the alr D -transformed random sample y 1 = alr D w 1,, y n = alr D w n The maximum likelihood estimations of µ and Σ are given by σ jk = 1 n µ j = 1 n y ij, n i=1 n (y ij µ j )(y ik µ k ), i=1 for j, k = 1,, D 1

93 Statistical Analysis of Compositional Data 93 Predictive regions Definition Let w be a random composition aln distributed on C D 1 If µ and Σ are the estimates of the unknown parameters of w from a random sample of size n, the 1 α predictive region is defined as } {w C D 1 : (alr D w µ) Σ 1 (alrd w µ) r 2, where r 2 is a real number such that Prob [ F D 1,n (D 1) ] n(n (D 1)) (n 2 1)(D 1) r2 = 1 α 99% Predictive region - Skye lavas

94 Statistical Analysis of Compositional Data 94 Atypicality index Definition If a random composition w on C D 1 is L D 1 (µ, Σ) distributed, the atypicality index of a composition w C D 1 in relation to the random composition w is defined as Prob [ χ 2 D 1 (alr D w µ) Σ 1 (alr D w µ) ] Definition Let w be a random composition aln distributed on C D 1 If µ and Σ are the estimates of the unknown parameters of w from a random sample w 1,, w n of size n, the atypicality index of a composition w C D 1 in relation to the compositional data set w 1,, w n is defined as ] Prob [ F D 1,n (D 1) k(alr D w µ) Σ 1 (alrd w µ) where k = n(n (D 1)) (n 2 1)(D 1),

95 Statistical Analysis of Compositional Data 95 Compositional Regression Artic lake Sand, silt,clay composition of 39 sediment samples at different water depths in an Artic lake: Num Sand Silt Clay Depth (m) S S S a Is sediment composition dependent on water depth? b If so, how can we quantify the extent of the dependence?

96 Statistical Analysis of Compositional Data 96 Compositional Regression Compositions w i C D 1 regressing on a real concomitant t i (i = 1,, n): w i = β 0 (t i β 1 ) ε i (i = 1,, n), where β 0 : constant; β 1 : regression coefficient; ε i (i = 1,, n): errors alr version of the regression model alr w i = alr β 0 +t i alr β 1 +alr ε i (i = 1,, n) Can be reparametrized as alr w i = α 0 + t i α 1 + ɛ i (i = 1,, n)

97 Statistical Analysis of Compositional Data 97 Compositional Regression alr w i = α 0 + t i α 1 + ɛ i (i = 1,, n) Estimations α 0 and α 1 are obtained by application of the least squares method Then β 0 = alr 1 α 0, β1 = alr 1 α 1 The error (residual) of w i (i = 1,, n) will be e i = w i ŵ i, where ŵ i = β 0 (t i β 1 ) Sum of squares of errors: n SSError = e i 2 C = i=1 n ( dc (w i, ŵ i ) ) 2 i=1 Proportion of variability explained by the fitted linear regression model: 1 SSError totvar C {w 1,, w n }

98 Statistical Analysis of Compositional Data 98 Artic lake Num Sand Silt Clay Depth (m) S S alr fitted simple linear regression model: log(sand/clay) = log(depth) + ɛ 1 ; log(silt/clay) = log(depth) + ɛ 2 Fitted regression model in S 3 : ccl L (sand, silt, clay) = ( , , ) log(depth) (004604, , ) Proportion of variability explained by the fitted simple linear regression model: = %

99 Statistical Analysis of Compositional Data 99 Artic lake Num Sand Silt Clay Depth (m) S S

The Mathematics of Compositional Analysis

Austrian Journal of Statistics September 2016, Volume 45, 57 71. AJS http://www.ajs.or.at/ doi:10.17713/ajs.v45i4.142 The Mathematics of Compositional Analysis Carles Barceló-Vidal University of Girona,