Regression with compositional response having unobserved components or below detection limit values

Size: px

Start display at page:

Download "Regression with compositional response having unobserved components or below detection limit values"

Hilary Freeman
6 years ago
Views:

1 Regression with compositional response having unobserved components or below detection limit values Karl Gerald van den Boogaart 1 2, Raimon Tolosana-Delgado 1 2, and Matthias Templ 3 1 Department of Modelling and Valuation, Helmholtz Institute Freiberg for Resources Technology, Germany 2 Institute for Stochastics, Technical University Bergakademie Freiberg, Germany 3 Department of Statistics and Probability Theory, Vienna University of Technology, Austria Address for correspondence: Karl Gerald van den Boogaart, Department of Modelling and Valuation, Helmholtz Institute Freiberg for Resources Technology, Germany Halsbrücker StraÃ e 34, Freiberg. k.van-den-boogaart@hzdr.de. Phone: (+490) Fax: (+1) Abstract: The typical way to deal with zeros and missing values in compositional data sets is to impute them with a reasonable value, and then the desired statistical model is estimated with the imputed data set, e.g. a regression model. This contribution aims at presenting alternative approaches to this problem within the framework of Bayesian regression with a compositional response. In a first step, a compositional data set with missing data is considered to follow a normal distribution on the simplex, which mean value is given as an Aitchison affine linear combination of some fully-observed explanatory variables. Both the coefficients of this linear combination and the missing values can be estimated with standard Gibbs sampling techniques. In a second step, a normally-distributed additive error is considered superimposed on the compositional response, and values are taken as below the detection limit (BDLs) if they are too small in comparison with the additive standard deviation of each variable. Within this framework, the regression parameters and all missing values (including BDLs) can be estimated with a Metropolis-Hastings algorithm. Both methods estimate the regression coefficients without need of any preliminary imputation step, and adequately propagate the uncertainty derived from the fact that the missing values and BDLs are not actually observed, something imputation methods cannot achieve. Key words: MCMC bayesian regression; compositional regression; missing values; nondetects;

2 2 Karl Gerald van den Boogaart et al. 1 Introduction Rounded zeroes and missing values in compositional data are often treated as sick observations, that must be cured (say, replaced) before one can apply the log-ratio methodology. This replacement or imputation strategy is often reasonable, when the number of such irregular data is small and if the analyst is seeking a quick and dirty answer. However, whenever the number of irregularities increases, or one is interested in closely monitoring the uncertainty of the fitted models, then the imputation may severely underestimate the uncertainty. This problem is specially critical, for instance, when testing the significance of some regression parameters. The regression of a compositional variable against a set of covariables is an important tool to analyse the dependence of compositions to external influences. It can essentially be treated with the principle of working in coordinates (take log-ratios, analyse the scores, back-transform the coefficients), by using a multivariate regression model for the additive or isometric log-ratio transformed response variables, using available standard software: R (R Core Team, 2013) and the specific packages compositions (van den Boogaart et al., 2013) and robcompositions (Templ et al., 2011). Actually, this can be seen as a maximum likelihood (ML) estimation problem, where the compositional response is assumed to follow an additive logistic normal distribution (Aitchison, 1982), which conditional expectation is given by a linear combination of the explanatory variables. But when the compositional response data has several missing values, the log-ratio approach fails, because missing components do not always correspond to log-ratios missed in the same sense. This renders compositional regression unavailable for many existing datasets, like e.g. containing incomplete analysis data (classically known as missing values) or very low concentrations in individual components (rounded zeroes or values below the detection limit). Within this framework, this contribution proposes an integrated approach, where each incomplete observation contributes with an incompletely determined likelihood to the estimation process. For missing values, this implies that the likelihood of that observation is defined in a subspace only (the subspace associated to the observed subcomposition). Observations below detection limit (BDL) are more complicated. If one assumes that it is still an outcome of a normal distribution on the simplex, a BDL has a likelihood that can be computed quasi-analytically, from the normal cumulative distribution function (Palarea- Albaladejo et al., 2007; Martín-Fernández et al., 2012). However, for several kinds of data, the meaning of a BDL is a bit more complex, involving some additive error on top of the compositional spread (van den Boogaart et al., 2011). This problem can also be solved by making use of the concept of latent variables. Beyond modelling BDLs and missing values, this model is able to consider additive errors even to the observed subcomposition, and can thus be of more general use in case that some samples have been contaminated. On the downside, it requires extensive Markov Chain Monte Carlo computing in a Bayesian approach.

3 Compositional regression with unobserved components 3 Combining both strategies as needed, we can fit compositional regression models efficiently in the presence of missing data. The approach also provides a complete imputed dataset, distributed according to the true conditional distribution (given all the data) of the unobserved true compositions, filtering the measurement error out. Section 2 introduces the geometric and statistical concepts necessary to work with compositional data. Section 3 presents two adaptations of classical Bayesian multivariate regression to deal with a compositional response, with and without additive error. Section 4 briefly reviews the kinds of missing values, and gives the necessary modifications to the regression models to work with them. A small simulation exercise showing the model capabilities is presented in Section 5, while a real case study is explained in Section 6. Some conclusions that close this contribution are given in Section 7. 2 Notation and basic concepts A D-component vector z = [z 1, z 2,..., z D ] is considered a composition if its components show the relative importance of D parts forming a total. The key property of a compositional vector is its scale invariance (Aitchison, 1986): because the only relevant information of a composition is relative, scaling it by any constant value does not change its meaning, i.e. p z z. The sample space of compositional data is the simplex, S D, defined as the positive orthant of the D-dimensional real space R D + equipped with the scale invariance equivalence (Barceló-Vidal et al., 2001). For practical reasons, however, this definition is simplified to the sample space of representatives, i.e. the set of non-negative components and total sum equal to 1, { } S D D = z z i 0 z i = 1. If a composition does not satisfy the constant sum constraint, it can be forced to do so by the application of the closure operation z = C[z] = i=1 1 Di=1 z i z, without loss of any relevant information. Pawlowsky-Glahn and Egozcue (2001) showed that the simplex can be given a Euclidean vector space structure by the operations of: perturbation (closed component-wise product of two compositions), powering (closed componentwise powering of a composition by a scalar) and the Aitchison scalar product (proportional to the scalar product of the vectors formed by all possible pairwise logratios of the two compositions). A subcomposition is a composition built from another composition by selecting a subset of its parts, and considering it scale invariant, i.e. applying the closure operation to the subset of chosen components. Within the Euclidean geometry of the simplex, the sample space of a subcomposition is a vector subspace of the larger D-part simplex.

4 4 Karl Gerald van den Boogaart et al. The scale invariance condition implies that: (1) the total sum of all components is in general an irrelevant quantity, and (2) all relevant information is conveyed by the component ratios, e.g., z i /z j. The Euclidean structure of the simplex allows to identify any composition with its vector of D 1 coordinates in any arbitrary reference basis. Actually, these coordinates are always balanced log-ratios or log-contrasts, linear combinations of log-transformed components which coefficients add up to zero, ζ = v t ln z, v t 1 D = 0, with the logarithm applied component-wise. The vector 1 D denotes a vector of D ones. If D 1 linearly independent vectors are considered, then a basis is obtained, ζ = V t ln z, V t 1 D = 0 D 1. For practical reasons, one can also assume an orthogonal reference basis, i.e. where V t V = I D 1 the (D 1)-identity matrix, where superindex t denotes transposition. In this case, one defines the so-called isometric logratio transformation, linking compositions and coordinates through ilr(z) := V t ln z = ζ; ilr 1 (ζ) := C[exp(V ζ)] = z. (2.1) Non-orthogonal bases can also be used, but then the inverse transformation expression involves a generalized inversion of V t, and is not so straightforward. An alternative log-ratio representation is provided by the centered logratio transformation (clr, Aitchison, 1986), a one-to-one transformation defined as clr(z) := ln z D z 1 z 2 z D = ζ clr 1 (ζ ) := C[exp(ζ )] = z. Between clr and ilr transformations there is also a one-to-one equivalence, ilr(z) = V t clr(z), clr(z) = V ilr(z). (2.2) A set of coordinates can be built to capture a certain split of a composition in two subcompositions. Assuming for simplicity that the splitting takes the first S of the D components in one group, the vector of coordinates should have: 1. first, S 1 logcontrasts between the S components of the first subcomposition, 2. second, D S 1 logcontrasts between the remaining D S components of the second subcomposition, 3. finally, a balance between the geometric means of the components of each subcomposition, S Si=1 S(D S) z i ζ D 1 = ln D D S Di=S+1. z i

5 Compositional regression with unobserved components 5 For the goals of this contribution, the sets of logcontrasts within each of the two subcompositions are irrelevant. Egozcue and Pawlowsky-Glahn (2005) give expressions and rules to build these subbases. In this contribution, the goal is to split the observed from the missing components. We denote by V o the columns of the matrix V linked to the observed subcomposition, and by V m those columns linked to the non-observed subcomposition and their balance. Thus, V o has D S 1 columns, and V m has S columns. A D component random composition will be denoted as Z. It will be said to have an additive logistic normal distribution, or alternatively a normal distribution on the simplex (Mateu-Figueras et al., 2003), if any of its vector of log-ratio coordinates has a joint (D 1)- variate normal distribution. These two models are equivalent in terms of their description of the randomness, and we use the second one for simplicity. The normal distribution on the simplex has as density function [ 1 f Z (z µ V, Σ V ) = exp 1 ( ) t ( ) ] V t ln z µ (2π) D 1 Σ V 2 V Σ 1 V V t ln z µ V. (2.3) As happens with the lognormal distribution, the parameters of the normal distribution on the simplex will be identified with those of the (real vector-valued) normal distribution of the transformed scores, i.e. the mean vector µ V and covariance matrix Σ V of the ilrtransformed composition, using matrix V in Eq. (2.1). Alternatively, we can also use an expression based on the clr transformed scores, thanks to Eq. (2.2), and an equivalent expression for the covariance Σ of clr-transformed data Σ = V Σ V V t, Σ V = V t Σ V, (2.4) where V is the Moore-Penrose generalized inverse of V, and V t the transposed generalized inverse. Inverting Σ is tricky, because this matrix is always singular and therefore has no inverse and its determinant is zero, strictly speaking. Both problems are bypassed by considering again Moore-Penrose generalized inverse, which in this case gives Σ = V Σ 1 V V t, (2.5) which happens to be the same whichever basis is used for the calculations. Note that both matrices Σ and Σ sum up to zero by rows and by columns. Note as well that, if V is an orthogonal matrix, then V = V t and V t = V, which simplifies a bit the expressions. For the sake of simplicity, we assume such an orthogonal matrix from this point on. In the same way as an inverse can be generalized, the determinant of Σ can be defined as the determinant of any of its ilr representations through Eq. (2.4), i.e. Σ := Σ V, as this is an invariant function. Moreover, Σ = 1/ Σ V 1 = 1/ Σ. With these definitions Σ [ f Z (z µ, Σ) = (2π) D 1 exp 1 ] 2 (clr(z) µ)t Σ (clr(z) µ). (2.6)

6 6 Karl Gerald van den Boogaart et al. 3 Bayesian regression with compositional response 3.1 Maximum likelihood compositional regression In a conventional multivariate regression model, a multivariate response Y is considered as a function of some explanatory variables (usually, including a constant predictor 1 that will later account for the intercept), in such a way that a linear combination of the explanatory variables gives the expected value of the response, E(Y B, ) = B = µ where B is a matrix containing the regression coefficients. This response is then considered joint normally distributed with an unknown covariance matrix Σ. This can be adapted straight to a compositional response using some ilr coordinates, as ilr(z), Σ V N D 1 (B V, Σ V ), (3.1) which has the slight inconvenience to apparently depend on the ilr basis chosen. It is more practical to describe the regression parameters in relation to the clr transformation, [ ] B V = E [ilr(z)] = E V t ln z = V t E [ln z] = V t E [clr(z)], which gives a basis independent representation of the coefficients as B = V B V, with inverse relation B V = V t B. These expressions do not depend on the ilr basis used. Note that the columns of B must sum up to zero. Given an observed pair (x n, z n ), the likelihood of the regression coefficients and residual (generalized inverse) covariance Σ is or taking logs and using expression (2.6) L(B, Σ x n, z n ) f Z (z n B x n, Σ), l(b, Σ x n, z n ) = κ 1 2 ln Σ 1 2 (ilr(z n) V t B x n ) t Σ 1 V (ilr(z n) V t B (3.2) x n ) = κ ln Σ 1 2 (clr(z n) B x n ) t Σ (clr(z n ) B x n ). (3.3) If a sample of pairs is available, (x n, z n ); n = 1, 2,..., N, then the joint log-likelihood of the sample is l(b, Σ x n, z n ; n = N 1, 2,..., N) = l(b, Σ x n, z n ) = (3.4) n=1 = κ + N 2 ln Σ 1 N (clr(z n ) B x n ) t Σ (clr(z n ) B x n ). 2 n=1 If the pairs of explanatory-explained are ordered column-wise in matrices and Z (this last where each column is a clr transformed composition), then maximum likelihood estimates of the parameters are provided by ˆB = (Z t ) ( t ) 1, and ˆΣ = 1 N (Z ˆB ) (Z ˆB ) t.

7 Compositional regression with unobserved components Bayesian compositional regression A classical Bayesian approach to regression assumes certain prior, independent distributions for the unknown parameters B and Σ (see, e.g. Gelman et al., 1995). Each of these prior distributions is specified making use of some models, which might depend on hyperparametes. Often a sort of weakly informative normal prior distribution is considered for the regression coefficients B and an inverse-wishart distribution for the covariance matrix Σ. These models and their hyperparameters are detailed below. Figure 1 shows as well the relation between observable variables, parameters and hyperparameters. This diagram also points at the possibility that Z is considered as a latent, unobservable variable underlying the observed Z, a model that will be useful when we consider missing values and BDLs (see Section 4.3). (N 0, S) Σ Σ Z Z (B 0, Q) B µ Figure 1: Scheme of relation between parameters, variables and observables (within a framebox) for a compositional Bayesian regression model. Dotted lines represent random relations, solid lines deterministic ones. See Section 3 for details. The observable Z will appear in the case of missing values, and in that case Z will be considered a latent, unobservable variable (see Section 4.3). To study the most likely values of the parameters B and Σ given the data, we must combine our prior assumptions about their distribution with their likelihood (Eq. 3.4), through Bayes Theorem. This gives the joint posterior distribution of the parameters, π[b, Σ, Z] π 0 [B] π 0 [Σ] exp(l(b, Σ x n, z n ; n = 1, 2,..., N)). Because of practical reasons, it is common to choose these prior distributions π 0 [ ] as the conjugate priors of the normal distribution: B is assumed to follow a (degenerate) normal distribution with mean value B 0 and precision matrix Q; equivalently, that means that B t V follows a normal distribution with mean value V t B 0 and precision matrix Q V = (1 P V t ) Q (1 P V t ) t (with the matrix B converted to a vector by stacking it column-wise), with P the number of predictor variables (eventually including the constant). This choice of precision matrix implies assuming a covariance matrix of Q 1 V /2. Σ prior is specified through the prior for one arbitrary ilr representation of it, denoted as Σ 1 V. This inverse ilr covariance matrix is assumed to follow a Wishart distribution, with size N 0 and variance parameter S V ; this can then be expressed free of any ilr base as S = V t S V V, after Eq. (2.4).

8 8 Karl Gerald van den Boogaart et al. These prior specifications give rise to the following posterior conditional distributions, ( Σ, Z, B W N + N 0, (S + ˆΣ ) ), (3.5) ( B, Z, Σ N P (D 1) (Q + R) 1 (Q c(b 0 ) + R c( ˆB)); ) Q + R, (3.6) with R = ( t ) ˆΣ, denoting by c(a) the column-wise stacking of matrix A in a vector, and specifying the normal distribution in terms of its precision matrix instead of its covariance. Note that other prior distributions may be chosen when assuming such normality is contradicted by prior experience (see, e.g., Ferreira and Steel, 2007; Liu, 1996). Given that the posterior distribution is specified with its conditional distributions, it is suitable to explore it with a Gibbs sampling scheme (see, e.g., Casella and George, 1992). With the following algorithm, a sample of the posterior distribution will be obtained: 1. Fix the matrix of regression coefficients to a suitable value B (0), e.g. intercept equal to a reasonable mean of the compositional response, slopes all equal to 0; 2. simulate a random value (Σ ) (k+1) out of the distribution (Eq. 3.5) of Σ, Z, B (k) ; 3. simulate a random value B (k+1) out of the distribution (Eq. 3.6) of B, Z, (Σ ) (k+1) ; 4. return to step 2 until the number of simulations is large enough. This Gibbs sampler reaches a stable distribution quite fast, thus a small burn-in period is often considered suitable. 3.3 Considering additive error In chemical analysis, it is common to observe a composition with both a relative and an additive error, respectively derived from the common practice to compare samples with standards and the need to substract a noise background. In the preceding model, the relative error can be captured by Σ. The additive error E is added in the model of Figure 2. Here, the composition itself is considered a latent parameter, which conditions the observations. (M 0, Ψ) P E (N 0, S) Σ Σ Z Y Y (B 0, Q) B µ Figure 2: Scheme of relation between parameters, variables and observables (within a framebox) for a compositional Bayesian regression model with additive error. Dotted lines represent random relations, solid lines deterministic ones. For details, see Section 3.3.

9 Compositional regression with unobserved components 9 This gives E P N D (0, P), Y Z, P N D (Z, P). (3.7) Note that Y is not anymore a composition in this model. Some information must be given with regard to P, the precision matrix of the additive errors E. Here, a full Bayesian framework would require a prior, which could be again considered an inverse Wishart distribution with size M 0 and variance Ψ, as specified in Figure 2. However, for the sake of simplicity we consider it fixed, and calculated from to the analytical precision of the measuring machine. No analytical expression exists for the conditional distribution of Z, Y, B, Σ, P. However, thanks to the conditional probability property [A B, C] [A B][C A, B], where [ ] = π[ ] denotes the probability distribution, we obtain [Z, Y, B, Σ, P] [Z, B, Σ, P] [Y, Z, B, Σ, P] = [Z, B, Σ ] [Y Z, P] The second step requires also the conditional independence properties implied in the scheme of Figure 2, i.e. that given Z, the blocks (, B, Σ ) and (Y, P) are conditionally independent. This is not closed, but it might still be suitable for a Metropolis-Hastings sampling scheme (Hastings, 1970), as [Z, B, Σ ] follows a normal distribution on the simplex (Eq. 3.1), and [Y Z, P] = [E P] is a normal distribution, according to Eq. (3.7). Thus, the following hybrid Gibbs/Metropolis-Hastings sampling scheme can be implemented: 1. Fix the matrix of regression coefficients to a suitable value B (0), e.g. intercept equal to a reasonable mean of the compositional response, slopes all equal to 0; fix suitable latent compositions Z (0), e.g. as the data Y; 2. simulate a random value (Σ ) (k+1) out of the distribution Σ, Z (k), B (k) of Eq. (3.5); 3. simulate a random value B (k+1) out of the distribution B, Z (k), (Σ ) (k+1) of Eq. (3.6); 4. simulate Z (k+1) with a Metropolis-Hastings algorithm; (a) simulate a candidate Z out of the distribution [Z Y, P], (b) using Eq. (3.1), compute the transition probability p = [Z, B (k+1), (Σ ) (k+1) ] [Z (k), B (k+1), (Σ ) (k+1) ], (c) take Z (k+1) = Z with probability min(1, p), otherwise take Z (k+1) = Z (k) ; note that kind of scheme is an independent chain, according to Tierney (1994), which implies that the typical thinning necessary in the Metropolis-Hastings algorithm to ensure independent samples can be reduced to a minimum; 5. return to step 2 until the number of simulations is large enough. In the Metropolis-Hastings step, we propose to simulate from [Z Y, P] and accept/reject from [Z, B, (Σ )] because the former will typically be much narrower than the latter.

10 10 Karl Gerald van den Boogaart et al. 4 Dealing with missing values 4.1 Values missing at random To cope with the information derived from a missing value, it is important to know why the value was lost. A value can be lost due to a certain defined process, or just randomly. Roughly, the following classes of missingness are often considered (for details see Little and Rubin, 2002) missing not at random (MNAR) the value is lost with a probability that might depend on the value itself and on other variables; this can only be solved with a model of the probability that each datum is missed/observed; values below the detection limit are an example of this; missing at random (MAR) the value is lost with a probability that is only depend on other variables; if this probability can be modeled, then one could consider estimating this missing probability model at the same time that the main regression problem is tackled; missing completely at random (MCAR) the value is lost with a probability independent of every variable. Modelling general MNAR and MAR values is beyond the scope of this contribution, as they require to specify the mechanism of missingness. MCAR values can be easily considered. Assume that the n-th datum has S missing variables. Then, the likelihood l(b, Σ x n, z n ) of that datum can only be specified up to a subspace of dimension D S 1. In particular, following the steps of definition of an ad-hoc ilr basis (Eq. 2.1) of Section 2, we can choose an ilr that allows us to compute these D S 1 coordinates, corresponding to the observed subcomposition. The rest (those within the S-component missing subcomposition and the balance) are missing values, which do not contribute to modify the likelihood in Eq. (3.2), 2l(B, Σ x n, z n ) + κ = ln Σ Vo + (ln(z n ) B x n ) t V o Σ 1 V o V t o (ln(z n ) B x n ) where the basis matrix V o depends on which components of observation z n are missing. Note as well that the determinant Σ Vo is computed as the product of the D S 1 eigenvalues of Σ Vo = V t o Σ V o. If the model used is the one with additive error, then the problem of dealing with missing values is actually moved from Z to Y, but the problem remains of the same type: the contribution of that observation to the likelihood is reduced. In whichever layer that occurs, the result is that the conjugacy property is strictly lost (Gross and Torres-Quevedo, 1995). However, the latent variable approach allows to still sample the posterior distribution of the whole set of parameters and latent variables, by means of the same Markov Chain Monte Carlo schemes mentioned before (Gross, 2000).

11 Compositional regression with unobserved components Values below the detection limit In general, a value below the detection limit or a rounded zero is a small value which cannot be distinguished from zero, given the accuracy of the measurement. If the measuring method is considered error-free, this concept can be easily accommodated into a Bayesian estimation of the scheme in Figure 1, by considering that actually the composition is a partly observed, partly latent variable. A Bayesian estimation of this model has already been implemented via a Markov Chain Monte Carlo (MCMC) method (Palarea-Albaladejo et al., 2007), albeit based on an Expectation-Maximization algorithm applied to a non-orthogonal family of compositional coordinates, the additive logratio transformation (Aitchison, 1986). Later, Martín-Fernández et al. (2012) have also adapted a full MCMC method to solve this problem, using an isometric log-ratio transformation like this contribution. The key idea of both works is to consider a latent variable for each value, which will be replaced through the Gibbs chain by a value strictly below the detection limit. If the measuring method generates measurement errors (according to Figure 2), then the model underlying the concept of BDL stated before is not adequate. This is the most common case in chemical compositions, where chemical components are often determined by comparing the readings obtained analyzing the sample (signal, s) with those of a blank (background, b) and of a reference (r). Roughly, if the true concentration of the reference is known to be c r, then the estimated concentration of the sample is c s = c r (s b)/(r b). This structure induces a multiplicative error in the sample measurements (because of the scaling by c r ), but also a an additive error (because of the background subtraction). In this context, a value is reported below the detection limit if it cannot be statistically distinguished from the background (van den Boogaart et al., 2011), not necessarily meaning that it must be really below the detection limit. Considering this kind of definition, (van den Boogaart et al., 2011) show that one has to replace in the data the values below the detection limit by the detection limit itself, and let the model with additive error run. The replacing values might well be slightly above the limit, but this is allowed by this concept of BDL. 4.3 Modified algorithms to deal with missing values Missing values and values below the detection limit are considered in the same way for a model without additive error (Figure 1). In this context, Z is considered a latent variable, and the observation is placed as an additional layer Z. Latent variables and observations are considered equal whenever the composition is fully observed, z n = z n. But if a datum has any kind of missing values, then the D S observed components are considered the same for both vectors, z in = z in, i = S + 1,..., D, but different for the non-observed set (note that we assume here that the missing components were the first S). The simulation chain is modified then as follows: 1. Fix the matrix of regression coefficients to a suitable value B (0), e.g. intercept equal to a reasonable mean of the compositional response, slopes all equal to 0; fix the latent variable Z (0) to reasonable values, e.g. replace values below the detection limit by 2/3

12 12 Karl Gerald van den Boogaart et al. of it, and missings at random by the compositional mean (i.e., closed geometric mean Aitchison, 1986); 2. simulate a random value (Σ ) (k+1) out of the distribution (Eq. 3.5) of Σ, Z (k), B (k) ; 3. simulate a random value B (k+1) out of the distribution (Eq. 3.6) of B, Z (k), (Σ ) (k+1) ; 4. for each datum with missing values, simulate a random value z (k+1) n out of the distribution of z n x n, z n, B (k+1), (Σ ) (k+1) ; Palarea-Albaladejo et al. (2007) showed that this distribution is a normal distribution on the simplex z n N S D ( B (k+1) x n, (Σ ) (k+1)), which must be conditioned to the fact that z n is partly observed by z n. Assuming D m missing components and D o = D D m observed components, an ilr basis is chosen following the steps of Section 2 to split them. Then the distribution of the coordinates within the subcomposition of the missing parts as well as the balance between them is given by V t m ln z n N Dm ( (V t m + (Σ mo Σ 1 oo ) (k+1) V t o)b (k+1) x n, (Σ mm Σ mo Σ 1 oo Σ om ) (k+1)). Variance-covariance or basis elements based on the missing parts are noted with the index m, those with index o are related to the fully observed parts and those parameters with index mo consider covariances between both parts. The distribution of the BDLs is the same, but further conditioned to correspond to a value below the detection limit (Palarea-Albaladejo et al., 2007). 5. Return to step 2 until the number of simulations is large enough. In the model with additive error, exactly the same strategies apply to deal with the difference between the observed readings Y and the latent Y ones. Latent variables and observations are considered equal whenever the readings are fully observed, y n = y n, and are considered different in those components where a MAR or a BDL occurs. Thus, a replacement step must be included in the algorithm, that accounts for these differences: 1. Fix the matrix of regression coefficients to a suitable value B (0), e.g. intercept equal to a reasonable mean of the compositional response, slopes all equal to 0; fix suitable latent compositions Z (0), e.g. as the data Y with irregular observations replaced in a sensible way; 2. simulate a random value (Σ ) (k+1) out of the distribution Σ, Z (k), B (k) of Eq. (3.5); 3. simulate a random value B (k+1) out of the distribution B, Z (k), (Σ ) (k+1) of Eq. (3.6);

13 Compositional regression with unobserved components for those n with missing values, simulate the latent readings y n (k) y n, z n (k), P from their associated latent errors e n (k) e (k) n, P, which must follow a multivariate normal distribution in a similar same scheme as in step 4 of the preceding algorithm; this gives ( ) e mn N D (Ψ mo Ψ 1 oo )e on (k), (Ψ mm Ψ mo Ψ 1 oo Ψ om ), where Ψ = P 1 /2, and the blocks (Ψ mm, Ψ mo ; Ψ om, Ψ oo ) represent respectively the variances of the missing-missing, missing-observed, observed-missing and observedobserved 5. simulate Z (k+1) with the same Metropolis-Hastings algorithm as in Section 3.3: 6. return to step 2 until the number of simulations is large enough. 5 Simulation study To evaluate the potential of the proposed methodology, a simulation exercise has been carried out. A three-part composition Z = [Z 1, Z 2, Z 3 ] has been regressed against an external variable x. One hundred values of the external variable were generated following a standard normal distribution. For the compositional part, a bivariate normal distribution was assumed for the ilr coordinates, ilr(z(x)) N 2 (b 0 + x b 1, Σ). The following prior was taken: regression intercept and slope following a normal distribution with zero mean vector B 0 = [b 0, b 1 ] = and precision matrix Q = I 4 ; regression residual covariance matrix Σ following a Wishart distribution with size N 0 = 3 and mean variance matrix S = 1/3 I 2. Figure 3 displays 11 realizations of this prior distribution (regression model plus residuals) and its flexibility. The last of these realizations were taken as starting data, to which a random additive error of standard deviation 5% was considered. Figure 4 shows the resulting Y and Z, respectively with and without additive error, as well as the proportions of missing values generated by the additive error. The proposed Bayesian methodology was applied, using the prior specified before, a Gibbs sampling for updating the regression coefficients and residual covariance, and a Metropolis- Hastings scheme for updating the latent composition without additive error. A thinning of 1:20 was applied and 1000 realizations were extracted. The Metropolis-Hastings step of sampling the error-free composition showed a rejection rate of 32.5%, a value usually considered to indicate good convergence. Convergence of the Gibbs steps was also acceptable,

14 14 Karl Gerald van den Boogaart et al. v3 v1 v v3 v1 v2 v3 v1 v2 v3 v1 v2 v3 v1 v2 1 2 log v v1 1 2 log v v1 1 v2 log 2 v v2 log 2 v v2 log 2 v v3 6 log v1 v v3 6 log v1 v v1 v2 2 v3 6 log v1 v2 2 v3 6 log v1 v2 2 v3 6 log Figure 3: Alternative realizations of the prior model, showing that it is not very informative. A black dot shows the model intercept, and a black line the linear model, and grey squares the realizations of the model (i.e. with Normal error on the simplex).

15 Compositional regression with unobserved components 15 v3 NM TM v1 v2 v3 v3 v1 v2 v1 v2 Figure 4: Model realizations without (left) and with (right) additive error. The data without additive error shows the regression model as well. The data with additive error show a significant amount of missing values, displayed as a barplot in the upper part (NM=not missed values; TMV=totally missed values, observations where 2 or 3 parts were missing; v1=first variable missing; v2=second variable missing, v3=third variable missing). Tickmarks show the position of those observations with one missing component. according to the fact that the trace plot of the simulation chain randomly oscillates around the real value of each parameter (Fig. 5 shows these for the regression coefficient matrix, similar results were obtained for the residual covariance matrix but are not shown to save space). Figures 6 and 7 show the fit of the several models to data, the first as compositions and the second as ilr coordinates. Figure 6 displays estimated regression curves, together with the predicted composition for the integer values of the external variable from 2 to +2. The same figure also shows the spread expected for the regression residuals, visible as ellipses of 95% probability linked to the estimated Σ. Figure 7 shows scatterplots of the ilr coordinates against the explanatory variable, with the several fitted models. Both figures report the true model, a maximum likelihood model fitted to data without additive error Z, a maximum likelihood model fitted to the data after a naive constant imputation of the missing values in Y, and the model obtained with the method proposed here (also based on Y). These results clearly show that the imputation performance is quite bad, specially in the regression slope (note how the symbols are misplaced with regard to their position in the true model in Fig. 6, or the slopes of their corresponding model lines in Fig. 7). The estimation of the residual covariance matrix in the imputation procedure is also not satsifactory, resulting in a much less informative covariance structure (note the more circular shape of its associated ellipse in Fig. 6). Finally, it is also visible that the proposed methodology performs comparably to the classical least-squares regression applied to the data without additive error or missing values, an impressive result if we take into account that only one fourth of the data were fully observed (as shown in the barplot of Figure 4).

16 16 Karl Gerald van den Boogaart et al. intercept ilr iteration slope ilr intercept ilr iteration slope ilr iteration iteration Figure 5: Trace plots of a series of 1000 simulated values from the regression parameters B V, using the same ilr as in Fig. 3. Dashed reference line shows the maximum likelihood estimate obtained with the whole data set, while solid reference line reports the true value of the parameter. true 3 full data 3 prediction for = imputed 3 proposal Figure 6: Comparison of results as compositions: regression predictions (symbols, five values reported in the legend) and residual covariance matrix (as a 95% probability ellipse). From left to right: (a) true model; (b) model fitted to the data without additive error (included in the other figures as a grey reference line), (c) model with missing values naively imputed; and (d) the proposed Bayesian model.

17 Compositional regression with unobserved components log v2 v1 true full data log v 3 2 v1 v log v2 v1 full data imputed proposal log v 3 2 v1 v2 Figure 7: Comparison of results in ilr coordinates. Left plots show the full data, compared with the true model and the classical maximum likelihood regression. Right plots show the data with BDLs, and the models fitted to them, using imputation of all BDLs by the detection limit, and using the proposed algorithm. In both cases, circles show fully-observed data, while crosses represent data with missing components.

18 18 Karl Gerald van den Boogaart et al. 6 Real case study: worldwide grainsize-petrology of sediments For illustration, the second model (with additive error, Section 3.3) is applied to analyse a synthetic model of the composition of sediments as a function of grain size at the World scale, published in the form of a conceptual diagram by Blatt et al. (1972, p. 301). The authors state that this diagram represents the probable relationship between grain size and detrital fragment composition, based on the limited data currently available. The diagram depicts how the composition of sediments depends on grain size (in φ scale, a typical logarithmic representation of grain size where higher values represent smaller grains), considering the following 5 components: Feldspar grains (F), formed by grains containing one or more feldspar crystal Micas, clays and other sheet silicates (M), formed by mono- or polycrystalline aggregates of sheet silicates, of primary origin or result of alteration. Monocrystalline Quartz (Qm), formed by grains containing one single quartz crystal Polycrystalline Quartz (Qp), formed by grains containing several quartz crystals Rock fragments (Rf), formed by grains containing more than one mineral (and fitting in none of the following types) Since this diagram was only available in printed form, a digitalization process was necessary: at 14 values of φ, a manual reading of the curve levels was taken. Table 1 shows the numbers obtained, and Figure 8 the resulting reconstruction of Blatt et al. (1972) diagram. This data set was presented and analysed by Tolosana-Delgado and von Eynatten (2009) with an alternative method, which does not take into account all the information provided by the values below detection limit nor the fact that the digitalization introduces an additive error on all readings, even those values above the detection limit. Their results are included here in some of the following figures for the sake of comparison. The geological context and interpretation of results can be found in this contribution, and will not be repeated here. Given that the data were actually read from a diagram, we consider that all values are affected by the same additive error (the visual error of reading a value from a diagram). Two values, 1% and 5% error, were considered, with equivalent results. To save space, only the second case is shown here, i.e. with a fixed additive variance matrix P = 0.05 I 5. Prior information for the regression coefficients was taken as a normal distribution, with mean values of zero slope and zero intercept and a large prior variance without covariance between the coefficients, i.e. B 0 = and Q = 10 1 I 8. The same strategy of a vague prior was taken for the inverse covariance matrix of the regression residuals, considered specified as a Wishart distribution with size N 0 = 1 one virtual datum and covariance S = 10 I 4. With these prior specifications, the posterior was characterized with a simulation of size 100, with Gibbs sampling for the regression parameters and Metropolis-Hastings sampling for the latent compositions without additive error (with a rejection rate of 46%). An intensive thinning was applied, discarding 50 simulations for each simulation being stored.

19 Compositional regression with unobserved components 19 Table 1: Proportions of five grain sediment classes (F=Feldspar, M=micas clays and other sheet silicates, Qm=monocrystalline Quartz, Qp=polycrystalline Quartz, Rf=Rock fragments) as a function of grain size (represented as φ, where larger values are related to finer sediments). Values under detection limit are marked with, meaning that the digitizer did not consider the curve to be different from zero. The values are taken as 2 digits, and the sum to 1 is not ensured. φ F M Qm Qp Rf sediments FMQm 0.75 Qp Rf proportion φ Figure 8: Data available for the five part composition (F=Feldspar, M=micas, clays and other sheet silicates, Qm=monocrystalline Quartz, Qp=polycrystalline Quartz, Rf=Rock fragments), after a diagram by Blatt et al. (1972, p. 301).

20 20 Karl Gerald van den Boogaart et al. proportion F M Qm Qp Rf proportion (log scale) n n n n φ φ φ φ φ Figure 9: Estimated proportions of the model results for φ from -3 to 10. The regression model provided by average coefficients are visualized by thick lines in black. The swarm of model lines of the 100 simulated regression models are in grey colour. Boxplots show the quartiles of the latent compositions (i.e. without additive error). The upper plot shows the composition (ordinate axis) in raw scale, the lower plot in logarithmic scale. Figure 9 summarizes some of the aspects of this posterior sample: Each simulation of the regression coefficients, B (k), is plotted as lines; one plot horizontally placed for each grain sediment class; the regression model specified by the averaged coefficients k B(k) /100 is represented as a solid thick lines; boxplots show the spread of the latent compositions without additive error The lower part of this figure (log vertical scale) clearly shows that the latent values are centered around the average regression model. However, notable deviations can be seen in the upper part (raw vertical scale). This is so because we are considering here that the data are prone to quite an important amount of additive error. Note that also notable deviations can be detected when these results are compared with the results form Tolosana-Delgado and von Eynatten (2009) (see also Figure 10).

21 Compositional regression with unobserved components 21 clr coefficient Rf Qp Qm F M Figure 10: Density estimates of the slope coefficient posterior sample, expressed in clr scale. The model of Tolosana-Delgado and von Eynatten (2009) is shown as horizontal reference lines. Finally, kernel density estimates of the slope coefficients (represented as clr) are displayed in Figure 10. These show a general consistency with the model of Tolosana-Delgado and von Eynatten (2009), though the Bayesian framework adopted here allows also to characterize the uncertainty affecting these parameters. As a result, one can conclude from this diagram that there is no significantly different behavior between Qp and Rf, or between Qm and F, i.e. that there is no evidence that increasing φ (reducing grainsize) should modify the ratios Qp/Rf or Qm/F. On the contrary, it is obvious that towards finer grain sizes, sediments get (with respect to all other parts) enriched in M and depleted in Qp and Rf. 7 Conclusions Some sorts of missing values (missing completely at random and values below the detection limit) in compositional data sets can be dealt without need of a preliminary imputation. The key idea is to build a hierarchical model where these missing values are considered latent variables, to be estimated together with the model parameters. This contribution illustrates this concept with the problem of regression with compositional response. The proposed methodology requires a Bayesian framework and computationally intensive estimation techniques, through Markov Chain Monte Carlo methods (MCMC). Two models have been discussed. The first model is the standard one in compositional regression, where the response is considered to follow an additive logistic normal distribution, with constant but unknown covariance and a mean built as an Aitchison affine linear combination of the explanatory variables. This is readily available to standard Bayesian multivariate regression techniques, as all parameters (regression coefficients, residual covariance and latent variables) have known conditional distributions (conjugate multivariate normal and Wishart models), thus

22 22 Karl Gerald van den Boogaart et al. amenable to Gibbs sampling techniques, with its conventional strengths and limitations. The second model builds upon the preceding one, including a layer of additive noise to the compositional response, i.e. assuming that observations are actually convolutions of a latent target composition plus a multivariate normal distribution. All observations are considered here different from their target composition, not just the missing ones. As with the first model, parameters (regression coefficients, residual covariances, additive noise covariances and latent compositions for all observations) can be obtained with standard MCMC techniques. In this case, the latent compositions can be sampled through an independent Metropolis-Hastings chain, while sampling all other parameters is possible using Gibbs sampling. This second model corresponds to the typical situation of chemical measurements and captures the effects of data below detection limit and of the omnipresent additive errors in chemical measurements more accurately. It is thus proposed for chemical compositions. It allows the estimation of regression coefficients and variances in the presence of missing values without a prior imputation and in situations of missing values in large portions of the dataset and non-informative priors. Conditional distributions of the imputed values and the precision of these estimates can be accurately captured through the conditional simulation in the Markov chain. References Aitchison J (1982) The statistical analysis of compositional data (with discussion). Journal of the Royal Statistical Society, Series B (Statistical Methodology), 44, Aitchison J (1986) The Statistical Analysis of Compositional Data. London: Chapman & Hall Ltd (Reprinted in 2003 with additional material by The Blackburn Press). Barceló-Vidal C, Martín-Fernández J-A and Pawlowsky-Glahn V (2001) Mathematical foundations of compositional data analysis. In Ross G, ed, Proceedings of IAMG 01 The VII Annual Conference of the International Association for Mathematical Geology, Cancun: IAMG. Blatt H, Middleton G and Murray R (1972) Origin of sedimentary rocks. Enlgewood Cliffs: Prentice-Hall. Casella G and George EI (1992) Explaining the Gibbs sampler. American Statistician, 46, Egozcue J-J and Pawlowsky-Glahn V (2005) Groups of parts and their balances in compositional data analysis. Mathematical Geology, 37, Egozcue J-J, Pawlowsky-Glahn V, Mateu-Figueras G and Barceló-Vidal C (2003) Isometric logratio transformations for compositional data analysis. Mathematical Geology, 35,

23 Compositional regression with unobserved components 23 Ferreira J-T-A-S and Steel F-J (2007) A new class of skewed multivariate distributions with applications to regression analysis. Statistica Sinica, 17, Gelman A, Carlin J-B, Stern H-S and Rubin D-B (1995) Bayesian data analysis. New York: Chapman & Hall. Gross A-L (2000) Bayesian interval estimation of multiple correlations with missing data: A gibbs sampling approach. Multivariate Behavioral Research, 35, Gross A-L and Torres-Quevedo R (1995) Estimating correlatons with missing data, a bayesian approach. Psychometrika, 60, Hastings W-K (1970) Monte carlo sampling methods using markov chains and their applications. Biometrika, 57, Little R-J-A and Rubin D-B (2002) Statistical Analysis with Missing Data. New York: Wiley. Liu C (1996) Bayesian robust multivariate linear regression with incomplete data. Journal of the American Statistical Association, 91, Martín-Fernández J-A, Hron K, Templ M, Filzmoser P and Palarea-Albaladejo J (2012) Model-based replacement of rounded zeros in compositional data: Classical and robust approaches. Computational Statistics and Data Analysis, 56, Mateu-Figueras, G, Pawlowsky-Glahn, V, and Barceló-Vidal, C (2003) Distributions on the simplex. In Thió-Henestrosa S and Martín-Fernández J-A, eds, Proceedings of the 1st International Workshop on Compositional Data Analysis. Girona: Universitat de Girona. Palarea-Albaladejo J, Martín-Fernández J-A and Gómez-García J-A (2007) Parametric Approach for Dealing with Compositional Rounded Zeros. Mathematical Geology, 39, Pawlowsky-Glahn V and Egozcue J-J (2001) Geometric approach to statistical analysis on the simplex. Stochastic Environmental Research and Risk Assessment, 15, R Core Team (2013). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL Templ M, Hron K and Filzmoser P (2011) robcompositions: an R-package for robust statistical analysis of compositional data. John Wiley and Sons. Tierney L (1994) Markov chains for exploring posterior distributions. Annals of Statistics, 22, Tolosana-Delgado R and Eynatten H von (2009) Grain-size control on petrographic composition of sediments: compositional regression and rounded zeroes. Mathematical Geosciences, 41,

Methodological Concepts for Source Apportionment

Methodological Concepts for Source Apportionment Peter Filzmoser Institute of Statistics and Mathematical Methods in Economics Vienna University of Technology UBA Berlin, Germany November 18, 2016 in collaboration