A latent Gaussian model for compositional data with structural zeroes

Size: px

Start display at page:

Download "A latent Gaussian model for compositional data with structural zeroes"

Jennifer Gilbert
5 years ago
Views:

1 A latent Gaussian model for compositional data with structural zeroes Adam Butler and Chris Glasbey Biomathematics & Statistics Scotland, Edinburgh, UK David Allcroft Edinburgh, UK Sarah Wanless Centre for Ecology and Hydrology, Banchory, UK Summary. Compositional data record the relative proportions of different components within a mixture, and arise frequently in fields such as geology, psychology and ecology. Standard statistical techniques for the analysis of compositional data assume that the data do not contain any proportions which are genuinely zero, but real datasets, such as a dataset on seabird diet that we will consider here, often do contain such structural zeroes. We propose a latent multivariate Gaussian model for the analysis of compositional data which contain zero values, and propose an iterative algorithm to simulate values from this model. Evaluation of the likelihood involves the calculation of intractable integrals, so inferences about the parameters of the model are instead based upon a recently proposed sequential Monte Carlo algorithm for approximate Bayesian computation. Keywords: Latent Gaussian model; Compositional data; Unit-sum constraint; Zero proportion; Approximate Bayesian Computation; Sequential Monte Carlo; Intractable likelihood; Dietary composition; Seabirds; Rissa tridactyla 1. Introduction The analysis of seabird diet can yield crucial insights into the ecology of both seabirds and their prey species, and can thereby play a significant role in contributing to the conservation of marine ecosystems. The data which motivate this paper record the relative frequencies of three different prey types - lesser sandeel ammodytes marinus aged less than one year (SE0), mature lesser sandeel (SE1), and other species - within samples of regurgitated material collected from four island colonies of the Black-legged Kittiwake Rissa tridactyla on the east coast of Britain during the period (Bull et al., 2004). The data are plotted in Figure 1. The Kittiwake data are compositional, in the sense that they record information about the relative frequencies associated with different components of a system - in this case the proportions associated with different prey types. Compositional data routinely arise in disciplines such as geology, economics and ecology, and it has long been recognised (Pearson, 1897) that they should not be analysed using standard statistical methods because of the intrinsic constraint that the proportions associated with the various components must sum to one. Aitchison (1986) demonstrated that this difficulty could effectively be overcome by analysing the log-ratios of the proportions rather than the proportions themselves, and proposed a suite of statistical methods based upon assuming a multivariate Gaussian distribution for these log-ratios. There are excellent theoretical and practical arguments for the

2 2 Butler, A., Glasbey, C., Allcroft, D. & Wanless, S. Fig. 1. Kittiwake diet data, plotted using a ternary diagram. The ternary diagram displays the relative frequency associated with the three different prey types, with the vertices representing those individuals which consume only a single type and the edges representing those individuals that consume two of the three types. Data points are represented by red circles whose area is proportional to the number of observations associated with that point - the area of the black circle represents the contribution of a single observation. SE0 Other SE1

3 A latent Gaussian model for compositional data with structural zeroes 3 use of the Aitchisonian approach when the compositional data lie entirely on the interior of the unit simplex, but the approach breaks down when the proportions associated with some or all of the components may be zero. Rigorous methods for dealing with the situation in which zero values arise solely through the rounding off of small values have been developed (Fry et al., 2000; Martín-Fernández et al., 2000, 2003), but there is as yet no established methodology for dealing with the possibility that zero proportions in the data may correspond to genuine absenses of the component concerned ( structural zeroes, also known as essential zeroes ). The Kittiwake diet data contain a large number of zero proportions, and at least some of these data are likely to arise from instances in which an individual bird has simply failed to consume some of the available prey types. Hierarchical approaches which explicitly model the occurrence of structural zeroes are conceptually attractive (Aitchison and Kay, 2004), but because such models are relatively highly parameterised they are unlikely to be appropriate for situations in which - as here - the amount of data is fairly small and the proportion of zero values is fairly large. Latent Gaussian models have been successfully used to deal with the presence of zero values in data on rainfall (Durban and Glasbey, 2001; Allcroft and Glasbey, 2003b), agriculture (Allcroft and Glasbey, 2003a), and nutritional science (Allcroft et al., 2006), and in this paper we argue that an analogous approach can be used for the analysis of compositional data that contain structural zeroes. More specifically, we propose a model in which compositional data X are assumed to arise from the Euclidean projection of a latent multivariate Gaussian variable Y onto the unit simplex, the geometric region within which compositional data must lie. The only unknown parameters within our proposed model model are the mean vector µ and covariance matrix Σ of the latent variable. Evaluation of the likelihood involves the calculation of intractable integrals unless the number of components D is very small, so parameter estimates are instead derived using methods of approximate Bayesian computation (ABC; Beaumont et al., 2002, Marjoram et al., 2003, Plagnol and Tavare, 2004) in which inferences are based upon repeated simulation from the model rather than upon explicit evaluation of the likelihood function. More specifically, we use a variant of the sequential Monte Carlo algorithm for ABC that was proposed by Sisson et al. (2006). We introduce our proposed model in Section 2, and present an iterative algorithm for simulating realisations from this model. In Section 3 we outline a methodology for drawing inferences about the parameters of our model through the use of a sequential Monte Carlo algorithm for approximate Bayesian computation. We apply the model to simulated data in Section 4, and to the Kittiwake diet data in Section 5. We close the paper with a brief discussion. 2. Proposed latent Gaussian model Consider a D-dimensional random variable Y. We will assume throughout this paper that Y is compositional, in the sense that Y [0, 1] D and Y T 1 = 1. These constraints imply that that the D elements of Y each lie between zero and one and that the elements will always sum to one. They are clearly equivalent to an assumption that Y lies on the unit simplex S D 1 = {y R D : y T 1 = 1, y [0, 1] D },

4 4 Butler, A., Glasbey, C., Allcroft, D. & Wanless, S. which is itself a subset of the hyperplane H D 1 = {y R D : y T 1 = 1}. We propose to model Y as a known transformation of a latent random variable whose support is the hyperplane H D 1, with the latent random variable assumed to be multivariate Gaussian. More specifically, our proposed model in based upon assuming that where: Y = g(z), (a) g is the deterministic function which performs a Euclidean projection of Z H D 1 onto the unit simplex S D 1, so that g(z) = {y S D 1 : z y z y for all y S D 1 }; (b) Z has a multivariate Gaussian distribution with mean vector µ and covariance matrix Σ, where µ T 1 = 1 and Σ1 = 0. Note that the constraints on µ and Σ are necessary and sufficient to ensure that the latent variable Z will always lie on the hyperplane H D 1. The constraints imply that there are (D + 2)(D 1)/2 free parameters within our model, corresponding to the number of parameters within a standard (D 1)-dimensional multivariate Gaussian model Simulation from the model The simulation of values from the proposed model relies upon (a) generating values z from a multivariate Gaussian distribution, which can easily be achieved using standard algorithms, and (b) calculating the Euclidean projection y = g(z ) of z onto the unit simplex. We propose an iterative algorithm that will evaluate g(z) within D 1 steps. Let z j denote the jth component of z. Without loss of generality we can assume, for notational convenience, that the values within z are arranged in ascending order, so that z 1 and z D respectively represent the smallest and largest values within this vector. If z 1 > 0 then we must already have z S D 1, in which case we set y = z. Otherwise, we loop through the values k = 2,..., D 1 and calculate r k = k z l /(D k), l=1 until we reach a value of k for which z k+1 + r k 0. We then stop and set y = (y 1,..., y D ) to be { 0 for j k; y j = z j + r k for j > k. It is clear that this algorithm will always generate a point y that lies within the unit simplex S D 1, so that y T 1 = 1 and y > 0. If we let λ = r k and γ = (γ 1,..., γ D ), where { zj r γ j = k for j k;, 0 for j > k.

5 A latent Gaussian model for compositional data with structural zeroes 5 then it follows immediately that y = z+1λ+γ, γ > 0 and y T γ = 0, and it therefore follows by the theory of Lagrangrian multipliers (e.g. Theorem of Fletcher, 1981) that and so, by definition, that g(z) = y. z y z y for all y S D 1, 3. Inference Let y = (y 1,..., y n ) denote n i.i.d. data that are assumed to be realisations from a random variable Y whose distribution is given by the proposed latent Gaussian model with unknown parameters θ = (µ, Σ). We are interested in using the data y to draw statistical inferences about the values of the p elements of θ. The log-likelihood function is of the form log f(y θ) := log P(Y = y θ) = = n log P(Y = y i ; θ) = i n i log P(Z = z; θ)dz = h(y i) n log P(Z h(y i ); θ) i n i log φ D (z; θ)dz, h(y i) where φ D (z; θ) is the probability density function of a D-dimensional multivariate normal random variable with parameter vector θ and where h(y) = {z H D 1 : g(z) = y}. Note that g is not invertible and that h typically defines a subset of the hyperplane H D 1 rather than a single point. If y lies on the interior of the simplex S D 1, however, then h(y) = y, so that h does indeed define a single point. When D = 2 or D = 3 it is possible to derive formulae which express the log-likelihood in terms of the standard Gaussian distribution function Φ but for general D no such simplification is possible and the likelihood function is consequently intractable. The intractability arises from the fact that the geometry of the regions defined by the function h rapidly becomes complicated as D increases to even moderately large values Approximate Bayesian Computation Methods of Approximate Bayesian Computation were developed to deal with situations such as this, in which the likelihood function for a model is intractable but it is relatively straightforward to generate values from the model via simulation. ABC algorithms are designed to select those parameter values which simulate data that have properties similar to those of the actual data. The degree of similarity between data simulated from the model y and the actual data y is quantified using d(s(y ), S(y)), where d is a distance metric and where S(y) denote a set of summary statistics from y. Algorithm A (Fu and Li, 1997; Weiss and von Haeseler, 1998; Pritchard et al., 1999) gives a mechanism for generating N independent realisations θ (1),..., θ (N) from the distribution θ d(s(y ), S(y)) < ɛ, where the threshold ɛ > 0 is fixed a priori.

6 6 Butler, A., Glasbey, C., Allcroft, D. & Wanless, S. Algorithm A: A1 Set i = 1 A2 Generate θ π(θ), where π(θ) is the prior distribution for θ; A3 If π(θ ) > 0 then generate y f(y θ); else to A1 A4 If d(s(y ), S(y)) < ɛ then set θ (i) = θ ; else go to A1 A5 If i < N then set i = i + 1 If S(y) is a sufficient statistic for y then the distribution θ d(s(y ), S(y)) < ɛ will converge to the posterior distribution of θ y as ɛ 0. For all but the simplest models it will not be able to derive a set of sufficient statistics y, and we must instead select S based upon heuristic considerations. It is clearly essential that almost all of the information about θ which are contained within y should also be contained in S(y), since the limiting distribution of θ d(s(y ), S(y)) < ɛ as ɛ 0 may otherwise not be equal to the target distribution. When D = 2 we take the summary statistics to be (1) the mean of y 1, (2) two times the variance of y 1, (3) the proportion of zeroes in y 1 and (4) the proportion of ones in y 1. When D = 3 we take the summary statistics to be the mean of y 1 the variances of y 1, y 2 and y 3, multiplied by two; the means of (y 1 y 2 )/2, (y 1 y 3 )/2 and (y 2 y 3 )/2; the proportion of zeroes in y 1, y 2 and y 3 ; and the proportion of ones in y 1, y 2 and y 3. This statistics were selected largely on the basis of trial and error, but appear to yield reasonable performance in practise. We have selected the elements of S such that each is constrained to lie on the interval [0, 1], thus avoiding the necessity for attributing different weights to the different elements of S within the distance metric d. We take d(s(y ), S(y)) to be equal to the mean of the absolute values of the elements of S(y ) S(y). The choice of ɛ depends largely upon computational considerations, since the acceptance rate of the algorithm decreases rapidly as we decrease the value of ɛ, and we discuss this in Sections 4 and 5. π(θ) denotes the prior distribution of θ. In this paper we will find it convenient to adopt a Bayesian approach, but we continue to regard this is an approximation to maximum likelihood inference and so adopt a uniform prior of the form where Θ R p. π(θ) { 1 if θ Θ 0 otherwise,

7 A latent Gaussian model for compositional data with structural zeroes A sequential algorithm for ABC The justification for the basic ABC algorithm (Algorithm A) relies upon the threshold ɛ being sufficiently small that the distribution of θ d(s(y ), S(y)) < ɛ is approximately equal to the distribution of θ y. If ɛ is small and the prior distribution π(θ) relatively uninformative, however, then the algorithm will tend to have an extremely low acceptance rate, especially if the dimensionality p of the parameter space Θ is large. Marjoram et al. (2003) propose overcoming this by embedding the ABC criterion d(s(y ), S(y)) < ɛ within a Markov chain Monte Carlo (MCMC) framework, but Sisson et al. (2006) note that the resulting algorithm will often exhibit very poor mixing owing to the fact that excursions into the tails of the distribution are associated with a severe reduction in the acceptance rate. Preliminary analyses of the seabird diet data using the ABC-MCMC algorithm suggest that the effects of poor mixing are particularly acute in the context of our latent Gaussian model. An alternative approach involves applying Algorithm A sequentially using a monotonic decreasing set of thresholds {e 0, e 1,..., e T } where e T = ɛ. Sequential Monte Carlo algorithms provide a powerful, and potentially highly efficient, set of methods for drawing inferences about an arbitrary target distribution, and the development of new sequential algorithms is an active area of statistical research (Robert, 2004). The algorithm that we use in this paper (Algorithm B) is a special case of the sequential Monte Carlo algorithm for ABC that was introduced by Sisson et al. (2006): specifically, it is the algorithm that is obtained by taking all particle weights to be equal to 1/N within the ABC-PRC algorithm that they propose. If q is symmetric and π(θ) 1 for θ Θ then it follows from the arguments in Sisson et al. (2006) that Algorithm B will generate independent realisations from the target distribution θ d(s(y ), S(y)) < ɛ. [note: Sisson et al. preprint has now been revised, and no longer seems to include steps B8 and B9 in the algorithm below - need to verify that the proof in that paper continues to hold] Algorithm B: B1 Set i = 1 B2 Generate θ π(θ) B3 Generate y f(y θ) B4 If ρ(s(y ), S(y) < e 0 then set θ (i) 0 = θ ; else return to B2 B5 If i < N then set i = i + 1 B6 Set t = 1 and i = 1 B7 Sample θ at random from the sequence {θ (1) t 1,..., θ(n) t 1 } B8 Generate y f(y θ ) B9 If ρ(s(y ), S(y)) e t then return to B7 B10 Generate θ q(θ θ ) B11 If π(θ ) > 0 then generate y f(y θ ); else return to B10 B12 If ρ(s(y ), S(y)) < e t then set θ (i) t = θ ; else return to B10

8 8 Butler, A., Glasbey, C., Allcroft, D. & Wanless, S. B13 If i < N then set i = i + 1 B14 If t < T then set t = t + 1 The accuracy with which Algorithm B is able to provide an approximation to the target distribution θ y will, as for Algorithm A, depend upon the threshold ɛ, the set of summary statistics S, the distance metric d, the number of simulations N and the prior distribution π. Algorithm B additionally requires us to specify a proposal distribution q(θ θ ) and a sequence of intermediate thresholds {e 0, e 1,..., e T 1 }: these two choices have an impact upon the efficiency of the algorithm, but will not affect the accuracy with which the final sequence {θ (1) T,..., θ(n) T } provides an approximation to the posterior distribution of θ y. In this paper we take q to be Gaussian, so that q(θ θ ) = φ p (θ, (θ, τ 2 I)), where the proposal standard deviation τ controls the rate at which we explore the parameter space. We take intermediate thresholds to be of the form e t = 2e 0 /(T +2), given a particular value for ɛ and for the initial threshold e Simulation study We use a simulation study to explore the performance of the sequence Monte Carlo ABC algorithm in the two and three component cases, for which the likelihood is tractable and it is therefore possible to compare estimates obtained using ABC against those obtained using standard maximum likelihood Two components When D = 2 we can simplify notation by restricting attention to a single component, y N(µ, σ 2 ), with the second component then being equal to 1 y. We simulate five datasets from our model, each of size n = 200 and each generated using the same seed for random number generation. The datasets are generated using a range of different parameter values: 2a: µ = 0.1, σ = 0.1; 2b: µ = 0, σ = 0.1; 2c: µ = 0.1, σ = 0.1; 2d: µ = 0.5, σ = 0.5; 2e: µ = 0.5, σ = 1. The first three sets of parameters are associated with increasingly large probabilities of obtaining a zero proportion in component one (0.159 for 2a, 0.5 for 2b and for 2c) but have a negligible probability (less than 0.001) of obtaining a zero proportion in component two. The last two set of parameters are associated with non-negligible probabilities of obtaining zero proportions in either of the components (0.159 in each component for 2d, in each component for 2e). We fit the model to each of the datasets by numerical maximum likelihood and using algorithm B with ɛ = 1/500, N = 1000, e 0 = 1/10 and τ = The log-likelihood for the two component case is n i=1 ( ( I{y i = 0}Φ µ ) [ + I{y i = 1} 1 Φ σ ( 1 µ σ )] + I{0 < y i < 1}φ ( yi µ σ )),

9 A latent Gaussian model for compositional data with structural zeroes 9 Fig. 2. Results from fitting proposed latent Gaussian model to three simulated datasets with D = 2 components and n = 200 observations each, using maximum likelihood (grey) and using Algorithm B with a sequence of thresholds from e 0 = 1/10 to ɛ = 1/500 (black). We show 2.5% (solid), 25% (dotted), 50% (thick solid), 75% (dotted) and 97.5% (solid) quantiles for the parameters µ and log σ. 2a 2a µ log σ c 2c µ log σ d 2d µ log σ

10 10 Butler, A., Glasbey, C., Allcroft, D. & Wanless, S. Table 1. Number of evaluations of D per particle to fit the latent Gaussian model to simulated data with D = 2 components using Algorithm B. *: for 2e it was necessary to terminate the algorithm at ɛ = 1/250 because the number of evaluations becomes prohibitively large for smaller values of ɛ. Dataset ɛ = 1/125 ɛ = 1/250 ɛ = 1/500 2a b c d e * where φ and Φ respectively denote the density and distribution functions for a standard Gaussian random variable. The prior distributions for the ABC algorithm are taken to be µ U( 10, 11) and log σ U( 10, 10), which we regard as relatively uninformative. In Figure 2 results of the maximum likelihood and ABC analyses are shown for three of the datasets; results for 2b are qualitatively similar to those for 2a, and results for 2e are similar to those for 2d. The plots for all of the parameters and datasets suggest that the medians of the ABC samples are converging towards the maximum likelihood estimate as e t tends towards zero, and that convergence is relatively rapid (with results changing minimally for e t smaller than 1/300). The more extreme quantiles of the ABC samples (2.5%, 25%, 75%, 97.5%) become increasingly close to the corresponding quantiles of the distribution of the MLE as e t becomes small, but for datasets 2d and 2e, and to a lesser extent 2a, they continue to systematically underestimate the uncertainty within the estimator even when ɛ is very small. For dataset 2a we investigated the impact of changing the prior distributions - from π(µ) U( 5, 5) to either π(µ) U( 10, 10) or π(µ) U( 2, 2) and from π(log σ) U( 10, 2) to π(log σ) U( 5, 1) - of changing the seed used for random number generation, and of reducing the number of particles N (from 1000 down to 200), but we found that these modifications all had a negligible impact upon both the results obtained. We also attempted to run a standard MCMC algorithm using the analytic form of the likelihood but with the same prior distribution as for the ABC algorithm, but found that the results were very similar to those obtained via numerical maximum likelihood. We conclude that the underestimation of uncertainty by the ABC algorithm for the simulated datasets is a robust result, which probably results from inadequacies in the selection of test statistics S - although it is not apparent what the nature of this inadequacy might be. We might expect that estimation would become increasingly difficult as the probability of obtaining a zero proportion becomes higher, since zero proportions are regarded as censored data within the context of our model, so it is somewhat surprising that in Figure 2 convergence appear to occur most slowly for the dataset (2a) in which the proportion of zero values is smallest. This counter-intuitive result is at least partly explained by the fact that the acceptance rate of the sequential ABC algorithm varies enormously between the different datasets - Algorithm B is fairly efficient for datasets 2a, 2b and 2c (in terms of requiring a low number of evaluations of D per particle: Table 1), but highly inefficient for datasets 2d and 2e. For the purposes of comparison, we also attempted to apply Algorithm A to dataset 2a using the same priors and test statistics as for Algorithm B. For a (relatively large) threshold of 1/125 we found that Algorithm A required 20 million evalu-

11 A latent Gaussian model for compositional data with structural zeroes 11 Table 2. Number of evaluations of D per particle to fit the latent Gaussian model to dataset 2a using Algorithm B, for different values of the proposal standard deviation τ. τ ɛ = 1/125 ɛ = 1/ ations in order to generate N = 995 accepted parameter values - i.e. an acceptance rate of 0.005%, or a rate of evaluations per particle - illustrating the extreme inefficiency of this algorithm. Sensitivity analyses using dataset 2a suggest that the efficiency - i.e. acceptance rate - of Algorithm B is largely unaffected by the seed used for random number generation, by the number of particles N and by the choice of prior distribution (so long as this remains relatively uninformative). The efficiency is strongly dependent, however, upon the threshold ɛ and upon the standard deviation τ of the proposal distribution (Table 2). The acceptance rate is highest when the proposal standard deviation is taken to be very small - this reflects the fact that the acceptance rate declines very rapidly with e t if the proposal standard deviation is taken to be even moderately large, since as e t becomes small the set of plausible parameter values also becomes small. The risk of using a very small value for τ, however, is that we may fail to adequately explore the parameter space whilst e t is relatively large, and may consequently become trapped in area of the space that actually has low posterior probability when e t becomes small. It would probably be most efficient to allow τ to depend upon t, so that τ 0, τ 1,..., τ T 1 is a monotonically decreasing sequence, but we have not attempted to implement such a strategy here Three components When D = 3 our model contains five unknown parameters - the mean µ 1 and standard deviation σ 1 of the first component, the mean µ 2 and standard deviation σ 2 of the second component, and the correlation ρ between the first and second components (note that the model can clearly also be parameterised in other ways). We simulate three datasets from our model, each of size n = 200 and each generated using the same seed for random number generation. We take µ 1 = 1/3, µ 2 = 1/3, ρ = 1/2, and 3a: σ 1 = σ 2 = 0.1; 3b: σ 1 = σ 2 = 0.5; 3c: σ 1 = σ 2 = 1; so that the proportion of zero values is higher for dataset 3c than for 3b and higher for 3b than for 3a. The log-likelihood function can be expressed in terms of the density and distribution functions of bivariate and univariate Normal random variable (calculations not shown), and so is again straightforward to calculate. In Figure 3 we compare results obtained using maximum likelihood against those obtained using Algorithm B with N = 1000, τ = 0.05 and e 0 = 1/2 and ɛ = 1/125. Results are shown for the parameters µ 1 and ρ, but qualitatively similar results are obtained for the

12 12 Butler, A., Glasbey, C., Allcroft, D. & Wanless, S. Fig. 3. Results from fitting proposed latent Gaussian model to three simulated datasets with D = 3 components and n = 200 observations each, using maximum likelihood (grey) and using Algorithm B with a sequence of thresholds from e 0 = 1/2 to ɛ = 1/125 (black). We show 2.5% (solid), 25% (dotted), 50% (thick solid), 75% (dotted) and 97.5% (solid) quantiles for the parameters µ 1 and ρ. 3a 3a µ ρ b 3b µ ρ c 3c µ ρ

13 A latent Gaussian model for compositional data with structural zeroes 13 Table 3. Number of evaluations of D per particle to fit the latent Gaussian model to simulated data with D = 3 components using Algorithm B. Dataset ɛ = 1/50 ɛ = 1/100 ɛ = 1/125 3a b c remaining three parameters. We see that there the median values of the ABC samples generally tend to converge towards the MLE in a reasonably smooth way as e t becomes small, but that there are some parameters (e.g. ρ for dataset 3a) for which the ABC samples appear to be converging towards a value that is slightly different from this. More noticeably, the ABC procedure again tends to markedly underestimate the level of uncertainty within the MLEs, especially for those datasets (3b and 3c) in which the data exhibit a relatively high degree of variability. Note that we use a substantially larger value for ɛ here than in the two component case, essentially for computational reasons - the acceptance rate of algorithm is much lower when D = 3 than when D = 2 (Table 3), and the rate drops very sharply between ɛ = 1/125 and ɛ = 1/150 (not shown). The lower efficiency is related to the higher dimensionality of the parameter space (p = 5 rather than p = 2). 5. Application to seabird diet data We use numerical maximum likelihood to fit our model to the kittiwake diet data, and in Figure 4 show the contours of the density associated the maximum likelihood estimate. We see that the mean of the fitted density lies outside the unit simplex. Such behaviour is not precluded by the specification of our model, and serves only to indicate (1) that the prevalances associated with consumption differ quite substantially between the three diet types and (2) that the proportion of zero values within our data is relatively large. We also fit our model to the data using ABC (Algorithm B), and in Figure 5 compare the results against those obtained via maximum likelihood. We see that for all five of the parameters the properties of the ABC samples have converged fairly well to those of the maximum likelihood estimates by the time that we reach a threshold of ɛ = 1/100, although convergence occurs much more slowly than for the simulated data of Section 4. It is not clear whether the ABC samples are underestimating the uncertainty of the MLEs in this case, partly because the quantiles of the ABC samples decay much less smoothly with e t than they do for the simulated data. The sudden jumps in the properties of the ABC samples arise because of the discretised nature of the diet data - proportions are almost always rounded to the nearest 5%. We have focused here upon using our model to provide a description of the full dataset, pooled across years and colonies, but ecologists are predominantly interested in comparing the effects of colony, year and date-within-year upon dietary composition (Bull et al., 2004). Our methodology could, in principle, easily be applied to subgroups of the data and extended to account for the effects of covariates, but preliminary results suggest that there are likely to be practical difficulties in doing this. We attempted to fit the model separately to data for the two groups of colonies - marine and estuarine - that are identified by Bull et al. (2004) as being associated with quite distinct patterns of feeding behaviour, but find (a) that the level of uncertainty in the MLEs is very large for the estuarine group and (b)

14 14 Butler, A., Glasbey, C., Allcroft, D. & Wanless, S. Fig. 4. Contours of the probability density function associated with maximum likelihood estimates obtained by fitting the model to Kittiwake diet data for all colonies. Raw data are also shown. SE0 Other SE1

15 A latent Gaussian model for compositional data with structural zeroes 15 Fig. 5. Results from fitting proposed latent Gaussian model to Kittiwake diet data. Model fitted using maximum likelihood (grey) and using Algorithm B with a sequence of thresholds from e 0 = 1/2 to ɛ = 1/100 (black). We show 2.5% (solid), 25% (dotted), 50% (thick solid), 75% (dotted) and 97.5% (solid) quantiles for the five parameters of the model. We also plot the number of evaluations per particle as a function of e t. µ µ σ σ ρ evaluations per particle ε

16 16 Butler, A., Glasbey, C., Allcroft, D. & Wanless, S. that the ABC algorithms are unable to provide any kind of useful approximation to the distribution of the MLEs for either group. These problems probably result from the lack of data on the interior of the simplex S 3, which could - at least in the limiting case that there were no data on the interior of S 3 - make the parameters of the model unidentifiable. For this paper we have aggregated the diet data into three groups, whereas the raw data actually record composition in terms of seven groups - the other category is broken down in terms of clupeids, small gadidae, planktonic crustacea, polychaetes and an unknown category. The overall approach to modelling and inference that we have presented is applicable for any number of components D. As D increases, however, the proportion of data points that lie on the interior of S D will decrease - the kittiwake diet data contain 22 observations on the interior of S D when D = 3 but only 3 observations if we create D = 4 components by regarding clupeids as a distinct class - again creating problems with estimation. The efficiency of the ABC algorithm will also decrease, because of an increase in the dimensionality of the parameter space. 6. Concluding remarks In this paper we have outlined a novel methodology for analysing compositional data that contain structural zeroes. Our approach is specifically designed to deal with those situations in which the proportion of zero values within the data is reasonably large, since conventional approaches for the analysis of compositional data will not be appropriate in such circumstances, and we have successfully fitted the model to simulated and real datasets in which there are two or three components. Our proposed model effectively regards those data-points which contain zeroes as being partially censored, and estimation problems will consequently arise if the data contain a very high frequency of zero values - as for the Estuarine group in the Kittiwake diet data - because of a lack of identifiability. Conversely, if the data contain no zero values then the model will reduce to a multivariate normal distribution that is subject to sum constraints on the mean vector and covariance matrix. This latter remark indicates that great care should be taken when interpreting the parameters of our model, since it is well known (e.g. Aitchison, 1986) that covariances between variables which are subject to a unit-sum constraint do not have any natural interpretation in terms of the (in)dependence of those variables. It is also essential to verify that the fitted model does indeed provide a reasonable description of the observed data, and the development of appropriate diagnostic tools to assess model fit for latent Gaussian models is an area of ongoing research for us. The proposed model presents a challenging problem for statistical inference since it involves the calculation of integrals over regions which, at least for general D, cannot explicitly be defined. We have adopted an approximate approach that is based upon simulation. ABC methods were originally developed for use with specific forms of highly dependent, unreplicated data that arise in genetics and evolutionary biology (Beaumont et al., 2002; Leman et al., 2005; Thornton and Andolfatto, 2006), but in this paper we have demonstrated that such approximate methods can also successfully be used for analysing the kinds of replicated data that are frequently encountered in ecology and environmental science. The accuracy of ABC methods in providing a good approximation to the posterior distribution of θ y depends upon making appropriate choices for the summary statistics S and the distance measure ρ - we have selected these in a somewhat arbitrary fashion, based largely upon trial and error, and the performance of the ABC algorithms could potentially be improved

17 A latent Gaussian model for compositional data with structural zeroes 17 through the use of alternative choices for S and ρ. The performance of the ABC approach is also closely related to the extent to which the fitted model provides an adequate description of the actual data, and this is a connection which should provide an interesting avenue for future research. The number of parameters within our model is quite large (e.g. p = 5 when D = 3) relative to previous situations in which ABC methods been applied, and standard ABC algorithms based upon direct Monte Carlo simulation or Markov chain Monte Carlo are consequently subject to very low acceptance rates and poor mixing respectively. We have shown that a sequential Monte Carlo approach (Sisson et al., 2006) can be used to ensure that the ABC approach retains a reasonable level of efficiency even when the threshold ɛ is very small. The computational costs of running this algorithm could be reduced, possibly substantially, by more systematic selection of the proposal standard deviation σ and the sequence of intermediate thresholds {e 0, e 1,..., e T 1 }. Finally, Algorithm B assigns equal weights to each particle in the sequence {θ (1) t,..., θ (N) t }, but we could potentially obtain improved inferences by allowing the weights to vary according to the level of support which each particle receives from the data and the prior, as in the general ABC-PRC algorithm presented by Sisson et al., Acknowledgements Funding for this work was provided by the Scottish Executive Environment and Rural Affairs Department. Ken McKinnon (Edinburgh University) provided helpful comments on the algorithm in Section 2.1. The data were collected as part of the UK Seabird Monitoring Programme, and were kindly provided to us by the Centre for Ecology and Hydrology. References Aitchison, J. (1986). The Statistical Analysis of Compositional Data. London: Chapman and Hall. Aitchison, J. and J. W. Kay (2004). Possible solutions of some essential zero problems in compositional data analysis. In Compositional Data Analysis Workshop, October 2004, Girona, Spain. < Allcroft, D. J. and C. A. Glasbey (2003a). Analysis of crop lodging using a latent variable model. Journal of Agricultural Science 140, Allcroft, D. J. and C. A. Glasbey (2003b). A latent Gaussian Markov random field model for spatio-temporal rainfall disaggregation. Applied Statistics 52, Allcroft, D. J., C. A. Glasbey, and M. J. Paulo (2006). A latent Gaussian model for multivariate consumption data. To appear in Food Quality and Preference. Beaumont, M. A., W. Zhang, and D. J. Balding (2002). Approximate Bayesian computation in population genetics. Genetics 162, Bull, K., S. Wanless, D. A. Elston, F. Daunt, S. Lewis, and M. P. Harris (2004). Local-scale variability in the diet of Black-legged Kittiwakes rissa tridactyla. Ardea 92 (1),

18 18 Butler, A., Glasbey, C., Allcroft, D. & Wanless, S. Durban, M. and C. A. Glasbey (2001). Weather modelling using a multivariate latent Gaussian model. Agricultural and Forest Meteorology 109, Fletcher, R. (1981). Practical Methods of Optimization. Chichester (England): Wiley and Sons. Fry, J., T. R. L. Fry, and K. R. McLaren (2000). Compositional data analysis and zeros in micro data. Applied Economics 32 (8), Fu, Y. X. and W. H. Li (1997). Estimating the age of the common ancestor of a sample of DNA sequences. Mol. Biol. Evol. 14, Leman, S. C., Y. Chen, J. E. Stajich, M. A. F. Noor, and M. K. Uyenoyama (2005). Likelihoods from summary statistics: recent divergence between species. Genetics 171 (3), Marjoram, P., J. Molitor, V. Plagnol, and S. Tavare (2003). Markov chain Monte Carlo without likelihoods. PNAS 100 (26), Martín-Fernández, J. A., C. B. Barceló-Vidal, and V. Pawlowsky-Glahn (2000). Zero replacement in compositional data. In H. A. L. Kiers, J. P. Rasson, P. J. F. Groenen, and M. Schader (Eds.), Advances in Data Science and Classification. Proceedings of the 7th Conference of the International Federation of Classification Societies, University of Namur (Belgium), pp Springer-Verlag (Berlin). Martín-Fernández, J. A., C. B. Barceló-Vidal, and V. Pawlowsky-Glahn (2003). Dealing with zeros and missing values in compositional data sets using nonparametric imputation. Mathematical Geology 35 (3), Pearson, K. (1897). Mathematical contributions to the theory of evolution: on a form of spurious correlation which may arise when indices are used in the measurement of organs. Proc. Roy. Soc. 60, Plagnol, V. and S. Tavare (2004). Approximate Bayesian computation and MCMC. In N. H. (Ed.), Monte Carlo and Quasi-Monte Carlo Methods 2002, pp Springer-Verlag. Pritchard, J. K., M. T. Seielstad, A. Perez-Lezaun, and M. W. Feldman (1999). Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Mol. Biol. Evol. 16, Robert, C. P. (2004). Monte Carlo Statistical Methods. Springer Texts in Statistics. New York: Springer-Verlag. Sisson, S. A., Y. Fan, and M. M. Tanaka (2006). Sequential Monte Carlo without likelihoods. Submitted. Thornton, K. and P. Andolfatto (2006). Approximate Bayesian inference reveals evidence for a recent, severe bottleneck in a Netherlands population of drosophila melanogaster. Genetics 172 (3), Weiss, G. and A. von Haeseler (1998). Inference of population history using a likelihood approach. Genetics 149,

A latent Gaussian model for compositional data with zeroes

A latent Gaussian model for compositional data with zeroes Adam Butler and Chris Glasbey Biomathematics & Statistics Scotland, Edinburgh, UK Summary. Compositional data record the relative proportions