On the informativeness of dominant and co-dominant genetic markers for Bayesian supervised clustering

Size: px

Start display at page:

Download "On the informativeness of dominant and co-dominant genetic markers for Bayesian supervised clustering"

Tamsin Wiggins
5 years ago
Views:

1 On the inormativeness o dominant and co-dominant genetic marers or Bayesian supervised clustering Gilles Guillot and Alexandra Carpentier-Sandalis September 24, 2010 Abstract We study the accuracy o Bayesian supervised method used to cluster individuals into genetically homogeneous groups on the basis o dominant or codominant molecular marers. We provide a ormula relating an error criterion the number o loci used and the number o clusters. This ormula is exact and holds or arbitrary number o clusters and marers. Our wor suggests that dominant marers studies can achieve an accuracy similar to that o codominant marers studies i the number o marers used in the ormer is about 1.7 times larger than in the latter. 1 1 Bacground A common problem in population genetics consists in assigning an individual to one o K populations on the basis o its genotype and inormation about the distribution o the various alleles in the K populations. This question has received a considerable attention in the population genetics and molecular ecology literature [1, 2, 3, 4] as it can provide important insight about gene low patterns and migration rates. It is or example widely used in epidemiology to detect the origin o a pathogens or o their hosts (see e.g. [5, 6, 7] or examples) or in conservation biology and population management to detect illegal trans-location or poaching [8]. See [9] or a review o related methods. In a statistical phrasing, assigning an individual to some nown clusters is a supervised clustering problem. This requires to observe the genotype o the individual to be assigned and those o some individuals in the various clusters. For diploid organisms (i.e. organisms harbouring Department o Inormatics and Mathematical Modelling, Technical University o Denmar, 2800, Lyngby, Copenhagen, Denmar Centre or Ecological and Evolutionary Synthesis, Department o Biology, University o Oslo, P.O. Box 1066 Blindern, 0316 Oslo, Norway 1

2 two copies o each chromosome), certain lab techniques allow one to retrieve the exact genotype o each individual. In contrast, or some marers it is only possible to say whether a certain allele A (reerred hereater as to dominant allele) is present or not at a locus. In this case, one can not distinguish the heterozygous genotype Aa rom the homozygous genotype AA or the dominant allele. The ormer type o marers are said to be codominant while the latter are said to be dominant. It is clear that the the second genotyping method incurs a loss o inormation. The consequence o this loss o inormation has been studied rom an empirical point o view [10] but it has never been studied on a theoretical basis. The choice to use one type o marers or empirical studies is thereore oten motivated mostly by practical considerations rather than by an objective rationale [11, 12]. The objective o the present article is to compare the accuracy achieved with dominant and codominant marers when they are used to perorm supervised clustering and to derive some recommendations about the number o marers required to achieve a certain accuracy. Dominant marers are essentially bi-allelic in the sense that they record the presence o the absence o a certain allele. We are not concerned here by the relation between inormativeness and the level o polymorphisms (c [13, 14] or reerences on this aspect). We thereore ocus on 28 bi-allelic dominant and co-dominant marers. Hence our study is representative o Ampliied Fragment Length Polymorphism (AFLP) and Single Nucleotide Polymorphism (SNP) marers, which are some o the most employed marers in genetics Inormativeness o dominant and co-dominant marers 2.1 Cluster model We will consider here the case o diploid organisms at L bi-allelic loci. We denote by z = (z l ) l=1,...,l the genotype o an individual. We denote by l the requency o allele A in cluster at locus l We assume that each cluster is at Hardy-Weinberg equilibrium (HWE) at each locus. HWE is deined as the conditions under which the allele carried at a locus on one chromosome 37 is independent o the allele carried at the same locus on the homologous chromosome. This situation is observed at neutral loci when individuals mate at random in a cluster. Denoting by z l the number o copies o allele A carried by an individual, we have: For co-dominant marers, 2

3 40 this can be expressed as p(z l = 2 ) = 2 l (1) p(z l = 1 ) = 2 l (1 l ) (2) p(z l = 0 ) = (1 l ) 2 (3) For dominant marers, z l is equal to 0 or 1 depending on whether a copy o allele A is present in the genotype o the individual. Under HWE we have: p(z l = 1 ) = 2 l + 2 l (1 l ) (4) p(z l = 0 ) = (1 l ) 2 (5) In addition to HWE, we also assume that the various loci are at linage equilibrium (henceorth HWLE), i.e. that the probability o a multilocus genotype is equal to the product o probabilities o single-locus genotypes: p(z 1,..., z L ) = l p(z l). We assume that the individual to be classiied has origin in one o the K clusters (no admixture) Sampling model We will measure the accuracy o a classiying rule or a given type o marers by the probability to assign correctly an individual with unnown origin. We are interested in deriving results that are independent (i) on the particular origin c o the individual to be classiied (ii) on the genotype z o this individual and (iii) on the allele requencies in the various clusters. We will thereore derive results that are conditional on c, z and and then compute Bayesian averages under suitable prior distributions. The mechanism assumed in the sequel is as ollows 1. The individual has origin in one o the K clusters. This origin is unnown and all origins are equally liely. We thereore assume a uniorm prior or c on {1,..., K}. 2. In each cluster, or each locus the allele requencies ollow a Dirichlet(1,1) distribution with independent across clusters and loci. 3. Conditionally on c and, the probability o the genotype o the individual is given by 3

4 59 60 equations (1-3) or (4-5), i.e we assume that the individual has been sampled at random among all individuals in his cluster o origin Accuracy o assignments under a maximum lielihood principle We consider an individual o unnown origin c with nown genotype z with potential origin in K clusters with nown allele requencies. Following a maximum lielihood principle, it is natural to estimate c as the cluster label or which the probability o observing this particular genotype is maximal. Formally: c = Argmax p(z c =, ). This assignment rule is deterministic, but whether the individual is correctly assigned will depend on its genotype and on cluster allele requencies. Randomising these quantities and averaging over all possible values, we can derive a generic ormula or the probability o correct assignment p MLA as p MLA = ϕ ζ max p(c =, z = ζ = ϕ)dp(ϕ) (6) See section A in appendix or details. This ormula is o little practical use and deriving some more explicit expression or arbitrary value o K and L seems to be out o reach. However, or K = 2 and L = 1, under the assumptions that the individual has a priori equally liely ancestry in each cluster and that each has a Dirichlet distribution with parameter (1,..., 1) (lat). we get p MLA c (K = 2, L = 1) = 17/24 or codominant marers (7) 74 and p MLA d (K = 2, L = 1) = 16/24 or dominant marers. (8) 75 Because o the lac o practical useulness o eq. (6), we now deine an alternative rule or assignment that is similar in spirit to maximum lielihood but also leads to more tractable equations. 4

5 Accuracy o assignments under a stochastic rule Considering the collection o lielihood values p(z c =, ) or = 1,..., K, ollowing [15], we deine a stochastic assignment (SA) rule by assigning the individual to a group at random with probabilities proportional to p(z c =, ). In words, an individual with genotype z is randomly assigned to cluster with a probability proportional to the probability to observe this genotype in cluster. The rationale behind this rule is that high values o p(z c =, ) indicate strong evidence o ancestry in group but do not guarantee against miss-assignments. To derive the probability o correct assignment, we irst consider that the allele requencies are nown, and the account or the uncertainty about these requencies by Bayesian in integration. The use o a Bayesian ramewor is motivated by the act that (i) there is genuine uncertainty on allele requencies which can not be overlooed, and (ii) under some airly mild assumptions, allele requencies are nown to be Dirichlet distributed (possibly with a degree o approximation see e.g. [16, 17]). Reer to [18] or urther discussion o the Bayesian paradigm in population genetics. We now give our main results regarding this clustering rule. For bi-allelic loci and denoting by p SA c marers we have: p SA c (K, L) = For bi-allelic loci and denoting by p SA d marers is the probability o correct assignment using codominant (K 1)(5/8) L (9) the probability o correct assignment using dominant p SA d (K, L) = (K 1)(25/33) L (10) 96 3 Implications Our investigations considered bi-allelic loci and are thereore representative o AFLP and SNP marers which are some o the most employed marers in genetics. In this context, or supervised clustering, our main conclusions are that (i) codominant marers are more accurate than dominant marers, (ii) the dierence o accuracy decreases toward 0 as the number o marers L increases, (iii) L d dominant marers can achieve an accuracy even higher than that o L c codominant marers as long as the numbers o loci used satisy L d λl c where λ = ln(5/8)/ ln(25/33)

6 The igures reported have to be taen with a grain o salt as they may depend on some speciic aspects o the models considered. For example, the model considered here assumes independence o allele requencies across clusters. This assumption is relevant in case o populations displaying low migration rates and low amount o shared ancestry. When one o these assumptions is violated, an alternative parametric model based on the Dirichlet distribution that accounts or correlation o allele requencies across population is oten used (see [16] and reerences therein). It is expected that the accuracy obtained with both marers would be lower under this model. Besides, the present study does not account or ascertainment bias [19, 20, 21, 22], an aspect that might aect the results but is notoriously diicult to deal with. However, it is important to note that the conditions considered in the present study were the same or dominant and codominant marers so that results should not be biased toward one type o marer. Our global result about the relative inormativeness o dominant and co-dominant marers contrasts with the common belie that dominant marers are expedient one would resort to when co-dominant marers are not available (see [12] or discussions). A comparison o dominant and codominant marers or unsupervised clustering has been carried out [23]. This study based on simulations suggests that the loss o accuracy incurred by dominant marers in unsupervised clustering is much larger than or supervised clustering. This is presumably explained by the act that in case o HWLE clusters, supervised clustering sees to optimise a criterion based on allele requencies only. This contrasts with unsupervised clustering which sees to optimise a criterion based on allele requencies and HWLE. A similar theoretical analysis o unsupervised clustering algorithm similar to the present study would be valuable but we anticipate that it would present more diiculties. 125 Acnowlegement This wor has been supported by Agence Nationale de la Recherche grant ANR-09-BLAN

7 128 A Supervised clustering with a maximum lielihood principle We consider the setting where the unnown ancestry c o an individual with genotype z is estimated by c = Argmax c p(z c, ). As this estimator is a deterministic unction o z we denote it by c z or clarity in the sequel. Consider or now that the allele requencies are nown to be equal to some ϕ. Under this setting, randomness comes rom the sampling o c and then rom the sampling o z (c, ). We are concerned with the event E deined as E = {the individual is correctly assigned} 129. Applying the total probability ormula, we can write p(e = ϕ) = p(e, c =, z = ζ = ϕ) (11) ζ 130 In the sum over, only one term is not equal to 0, this is the term or = c, hence p(e = ϕ) = ζ = ζ = ζ = ζ p(e, c = c ζ, z = ζ = ϕ) (12) p(c = c ζ, z = ζ = ϕ) (13) p(c = c ζ = ϕ)p(z = ζ c = c ζ, = ϕ) (14) p(c = c ζ )p(z = ζ c = c ζ, = ϕ) (15) Assuming that the individual has a priori equally liely ancestry in each cluster, i.e. assuming a uniorm distribution or the class variable c, we get p(e = ϕ) = K 1 ζ p(z = ζ c = c ζ, = ϕ) (16) 7

8 133 By deinition, c z satisies p(z c z, ) = max p(z c =, ), hence p(e = ϕ) = K 1 ζ max p(z = ζ c =, = ϕ) (17) = ζ = ζ max p(c = )p(z = ζ c =, = ϕ) (18) max p(c =, z = ζ = ϕ) (19) We see an expression o the probability o correct assignment that does not depend on particular values o allele requencies. This can be obtained by integrating over allele requencies, namely p(e) = = ϕ ϕ p(e = ϕ)dp(ϕ) (20) ζ max p(c =, z = ζ = ϕ)dp(ϕ) (21) Note that identity (21) holds or any number o cluster K, any number o loci L and any type o marers (dominant vs. codominant) We now consider a two cluster problem in the case where the genotype o an individual has been recorded at a single bi-allelic locus. We denote by 1 (resp. 2 ) the requency o allele A in cluster 1 (resp. cluster 2) A.1 Codominant marers: There are only three genotypes: AA, Aa and aa. Denoting by the requency o allele A in cluster and conditionally on, these three genotypes occur in cluster with probabilities 2, 2 (1 ) and (1 ) 2, and equation (21) can be simpliied as p(e) = ϕ [ p(c) max ] max (1 ) + max (1 ) 2 dp(ϕ) (22) We need to derive the distribution o max 2 and o max (1 ). Assuming a lat Dirichlet distribution or, elementary computations give: 8

9 149 i.e max 2 p(max 2 ollows a uniorm distribution on [0, 1] so that < x) = x (23) E(max 2 ) = 1/2 (24) 150 Besides, we also get p(max (1 ) < x) = (1 1 4x) 2 (25) 151 and deriving dp dx (max (1 ) < x) = x (26) 1 4x 152 Integrating by part, we get E(max (1 )) = 1/4 0 4x 1 1 4x 1 4x dx = 5/24 (27) 153 Eventually p(e) = 17/24 (28) 154 which proves equation (7) A.2 Dominant marers: For a single locus, there are two genotypes A and a. Conditionally on, these two genotypes are observed in cluster with probabilities 1 2 and 2. Equation (21) can be simpliied here as p(e) = ϕ [ ] p(c) max 2 + max (1 2 ) dp(ϕ) (29) 9

10 159 We now need the density o 1 2 p(max (1 2 ) < x) = (1 1 x) 2 (30) and Eventually we get dp dx (max(1 2 ) < x) = 1 1 (31) 1 x E(max (1 2 )) = 1 0 ( ) 1 x 1 dx = 5/6 (32) 1 x p(e) = 16/24 (33) 162 which proves equation (8). 163 B Stochastic assignment rule The maximum lielihood assignment rule considered above is not tractable or arbitrary values o K and L (c. eq. (21)). In particular, a diiculty arises rom the maximisation involved. We consider here an assignment rule that does not involve maximisation. The unnown ancestry c o an individual with genotype z is predicted by a random variable c with values in {1,..., K} and such that p(c = z, ) p(z c =, ). As in the previous sections, we irst consider that the allele requencies are nown, however we sip this dependence in the notation at the beginning 170 or clarity. We will account or the uncertainty about these requencies later by Bayesian in integration. In this setting, the structure o conditional probability dependence can be represented by a directed acyclic graph as in the on let-hand side o igure 1. We are concerned with evaluating the probability o event E deined as E = {the individual is correctly assigned} 173. i.e. E = {c = c }. We denote by p a (resp. p b ) probabilities under the two conditional dependence 10

11 c c* c c z (a) z z (b) Figure 1: Directed acyclic graph or our stochastic assignment rule (let) and or an alternative scheme (right). All downward arrows represent the same conditional dependence given by our lielihood model. Upward arrow represents the reverse probability dependence structure o igure 1. Some elementary computations show that p(e) can be expressed in terms o a probability in the model o the right-hand-side o the DAG in igure 1, namely: p a (c = c ) = p b (c = c z = z ) (34) 176 The let-hand-side o this expression can be written as p b (c = c z = z ) = p b (c = c, z = z )/p b (z = z ) (35) It is more convenient to manipulate this expression than p b (c = c ). We will to use it to evaluate p a (E) B.1 Codominant marers: We assume that the individual has a priori equally liely ancestry in each cluster. We slightly change the notation denoting by z l the count o allele A at locus l or the individual to be assigned. Then maing the dependence on explicit in the notation, we have p b (c = c, z = z ) = z = z p 2 b (c, z ) (36) [ 2 1 z l K (1 ) 2 z l (2 δz 1 l )] (37) l 183 where δ 1 z l denotes the Kronecer symbol that equals 1 i z l = 1 and 0 otherwise. 11

12 184 Accounting or uncertainty about by integration, we get p b (c = c, z = z ) = = p b (c = c, z = z )d (38) [ 2 1 z l K (1 ) 2 z l (2 δz 1 l )] d (39) z l Among the terms enumerated in the sum over z above, let us consider a generic term z or which the number o loci having exactly h heterozygous genotypes. The term corresponding to such a genotype z in the sum above can be written [ ] 1 h [ L h K 2 22h 2 (1 ) 2 d d] 4 (40) 188 Denoting by C h L the binomial coeicient, there are Ch L2 L h such terms. Equation (39) becomes p b (c = c, z = z ) = h CL2 h L h 1 [ ] h [ L h K 2 22h 2 (1 ) 2 d d] 4 (41) 189 Assuming a lat Dirichlet distribution or the allele requencies, we get p b (c = c, z = z ) = 1 K ( ) 8 L (42) We now need to evaluate p b (z = z ), but since p b (z = z ) = z p 2 b (z ) = z ( 2 p b (, z )), (43) p b (z = z ) = = p 2 b (z ) = ( 2 p b (, z )) z (44) p b (, z ) 2 + p b (, z )p b (, z ) z (45) z z z = p b (c = c, z = z ) + 12

13 CL2 h L h ( [ 1 2 K 2 22h h = 1 K ] 2h [ (1 )d d] ) 4(L h) (46) ( ) 8 L + K K 3 L (47) 191 Eventually, p(e) = (K 1) ( ) 5 L (48) which proves equation (9). 193 B.2 Dominant marers: 194 We still have p b (c = c, z = z ) = [ 2 1 z l K (1 ) 2 z l (2 δz 1 l )] d (49) z l For a generic genotype z in the sum above, let us denote by r the number o loci carrying exactly one copy o the recessive allele, then p b (c = c, z = z ) = [ ] Cr l 1 r [ L r K 2 4 d (1 d] 2 )2 (50) r = ( ) Cr l 1 1 L ( ) 8 L r K 2 (51) 5 3 r = 1 ( ) 11 L (52) K Moreover, by arguments similar to those used or codominant marers, we get p b (z = z ) = 1 K ( ) 11 L + K 1 15 K ( ) 5 L (53) 9 13

14 198 And we get p b (z = z ) = 1 K ( ) 11 L + K 1 15 K ( ) 5 L (54) Eventually, p(e) = (K 1) ( ) 25 L (55) which proves equation (10). 201 Reerences [1] B. Rannala and J. Mountain, Detecting immigration by using multilocus genotypes, Pro- ceedings o the National Academy o Sciences USA, vol. 94, pp , [2] J. Cornuet, S. Piry, G. Luiart, A. Estoup, and M. Solignac, New methods employing multilocus genotypes to select or exclude populations as origins o individuals, Genetics, vol. 153, pp , [3] D. Paetau, R. Slade, M. Burdens, and A. Estoup, Genetic assignment methods or the direct, real-time estimation o migration rate: a simulation-based exploration o accuracy and power, Molecular Ecology, vol. 15, pp , [4] A. Piry, S. Alapetite, J. Cornuet, D. Paetau, L. Baudoin, and A. Estoup, Geneclass2: A sotware or genetic assignment and irst-generation migrant detection, Journal o Heredity, vol. 95, no. 6, pp , [5] P. Gladieux, X. Zhang, D. Aoua-Bastien, R. V. Sanhueza, M. Sbaghi,, and B. L. Cam, On the origin and spread o the scab disease o apple: Out o central Asia, PLoS One, vol. 3, no. 1, p. e1455,

15 [6] A. P. de Rosas, E. Segura, L. Fichera, and B. García, Macrogeographic and microgeographic genetic structure o the chagas disease vector triatoma inestans (hemiptera: reduviidae) rom Catamarca, Argentina, Genetica, vol. 133, no. 3, pp , [7] A. Bataille, A. A. Cunningham, V. Cedeño, L. Patiño, A. Constantinou, L. Kramer, and S. J. Goodman, Natural colonization and adaptation o a mosquito species in Galápagos and its implications or disease threats to endemic wildlie, Proceedings o the National Academy o Sciences, vol. 106, no. 25, pp , [8] S. Manel, P. Berthier, and G. Luiart, Detecting wildle poaching: identiying the origin o individuals with Bayesian assignment test and multilocus genotypes, Conservation biology, vol. 13, no. 3, pp , [9] S. Manel, O. Gaggiotti, and R. Waples, Assignment methods: matching biological questions with appropriate techniques, Trends in Ecology and Evolution, vol. 20, pp , [10] D. Campbell, P. Duchesne, and L. Bernatchez, AFLP utility or population assignment studies: analytical investigation and empirical comparison with microsatellites, Molecular Ecology, vol. 12, pp , [11] C. Schlötterer, The evolution o molecular marers - just a matter o ashion? Nature Review Genetics, vol. 5, pp , [12] A. Bonin, D. Ehrich, and S. Manel, Statistical analysis o ampliied ragment length poly- morphism data: a toolbox or molecular ecologists and evolutionists, Molecular Ecology, vol. 16, no. 18, pp , [13] N. Rosenberg, T. Bure, K. Elo, M. Feldman, P. Friedlin, M. Groenen, J. Hillel, A. Mäi- Tanila, M. Tixier-Boichard, A. Vignal, K. Wimmers, and S. Weigend, Empirical evaluation o genetic clustering methods using multilocus genotypes rom 20 chicen breeds, Genetics, vol. 159, pp , [14] S. T. Kalinowsi, Do polymorphic loci require larger sample sizes to estimate genetic dis- tances? Heredity, vol. 94, pp ,

16 [15] N. Rosenberg, L. Li, R. Ward, and J. K. Pritchard, Inormativeness o genetic marers or inerence o ancestry, American Journal o Human Genetics, vol. 73, pp , [16] G. Guillot, Inerence o structure in subdivided populations at low levels o genetic di- erentiation.the correlated allele requencies model revisited, Bionormatics, vol. 24, pp , [17] O. Gaggiotti and M. Foll, Quantiying population structure using the -model, Molecular Ecology Resources, vol. 10, no. 5, p , [18] M. A. Beaumont and B. Rannala, The Bayesian revolution in genetics, Nature Review Genetics, vol. 5, pp , [19] R. Nielsen and J. Signorovitch, Correcting or ascertainment biases when analyzing SNP data: applications to the estimation o linage quilibrium, Theoretical Population Biology, vol. 63, pp , [20] R. Nielsen, M. Hubisz, and A. Clar, Reconstituting the requency spectrum o ascertained single-nucleotide polymorphism data, Genetics, vol. 168, pp , [21] M. Foll, M. Beaumont, and O. Gaggiotti, An approximate Bayesian computation approach to overcome biases that arise when using AFLP marers to study population structure, Genetics, vol. 179, pp , [22] G. Guillot and M. Foll, Accounting or the ascertainment bias in Marov chain Monte Carlo inerences o population structure, Bioinormatics, vol. 25, no. 4, pp , [23] G. Guillot and F. Santos, Using AFLP marers and the Geneland program or the inerence o population genetic structure, Molecular Ecology Resources, 2010, to appear. 16

Microsatellites as genetic tools for monitoring escapes and introgression

Microsatellites as genetic tools for monitoring escapes and introgression Alexander TRIANTAFYLLIDIS & Paulo A. PRODÖHL What are microsatellites? Microsatellites (SSR Simple Sequence Repeats) The repeat