On the informativeness of dominant and co-dominant genetic markers for Bayesian supervised clustering

Similar documents
Microsatellites as genetic tools for monitoring escapes and introgression

Estimating the location and shape of hybrid zones.

Visualizing Population Genetics

Roberto s Notes on Differential Calculus Chapter 8: Graphical analysis Section 1. Extreme points

Bayesian analysis of the Hardy-Weinberg equilibrium model

The Wright-Fisher Model and Genetic Drift

BIOINFORMATICS. Gilles Guillot

Microsatellite data analysis. Tomáš Fér & Filip Kolář

Population Genetics I. Bio

Example: When describing where a function is increasing, decreasing or constant we use the x- axis values.

Problems for 3505 (2011)

On a Closed Formula for the Derivatives of e f(x) and Related Financial Applications

Extreme Values of Functions

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

Population Structure

2.6 Two-dimensional continuous interpolation 3: Kriging - introduction to geostatistics. References - geostatistics. References geostatistics (cntd.

Processes of Evolution

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

Ex x xf xdx. Ex+ a = x+ a f x dx= xf x dx+ a f xdx= xˆ. E H x H x H x f x dx ˆ ( ) ( ) ( ) μ is actually the first moment of the random ( )

The achievable limits of operational modal analysis. * Siu-Kui Au 1)

Microevolution Changing Allele Frequencies

Objectives. By the time the student is finished with this section of the workbook, he/she should be able

CINQA Workshop Probability Math 105 Silvia Heubach Department of Mathematics, CSULA Thursday, September 6, 2012

Evaluating the performance of a multilocus Bayesian method for the estimation of migration rates

8.4 Inverse Functions

On the Efficiency of Maximum-Likelihood Estimators of Misspecified Models

UNIT 8 BIOLOGY: Meiosis and Heredity Page 148

10. Joint Moments and Joint Characteristic Functions

Picture from "Mendel's experiments: Figure 3," by Robert Bear et al

Scattered Data Approximation of Noisy Data via Iterated Moving Least Squares

The effect of strong purifying selection on genetic diversity

Breeding Values and Inbreeding. Breeding Values and Inbreeding

19. Genetic Drift. The biological context. There are four basic consequences of genetic drift:

Reliability assessment using probabilistic support vector machines. Anirban Basudhar and Samy Missoum*

NOTES CH 17 Evolution of. Populations

GETTING READY TO LEARN Preview Key Concepts 6.1 Chromosomes and Meiosis Gametes have half the number of chromosomes that body cells have.

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin

Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination)

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

Big Idea #1: The process of evolution drives the diversity and unity of life

Estimation of Sample Reactivity Worth with Differential Operator Sampling Method

Lecture 13: Population Structure. October 8, 2012

Mathematical models in population genetics II

GF(4) Based Synthesis of Quaternary Reversible/Quantum Logic Circuits

Symbolic-Numeric Methods for Improving Structural Analysis of DAEs

The concept of limit

VALUATIVE CRITERIA FOR SEPARATED AND PROPER MORPHISMS

Case-Control Association Testing. Case-Control Association Testing

Mechanisms of Evolution

RATIONAL FUNCTIONS. Finding Asymptotes..347 The Domain Finding Intercepts Graphing Rational Functions

Categories and Natural Transformations

Outline of lectures 3-6

Neutral Theory of Molecular Evolution

Chapter 6 Reliability-based design and code developments

Finite Dimensional Hilbert Spaces are Complete for Dagger Compact Closed Categories (Extended Abstract)

Heredity and Genetics WKSH

Basic mathematics of economic models. 3. Maximization

On High-Rate Cryptographic Compression Functions

Microevolution 2 mutation & migration

General Bayes Filtering of Quantized Measurements

Population Genetics: a tutorial

EM algorithm and applications Lecture #9

A Brief Survey on Semi-supervised Learning with Graph Regularization

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

AH 2700A. Attenuator Pair Ratio for C vs Frequency. Option-E 50 Hz-20 khz Ultra-precision Capacitance/Loss Bridge

LECTURE # How does one test whether a population is in the HW equilibrium? (i) try the following example: Genotype Observed AA 50 Aa 0 aa 50

What the No Free Lunch Theorems Really Mean; How to Improve Search Algorithms

A Rigorous Analysis of Population Stratification with Limited Data.

Q Expected Coverage Achievement Merit Excellence. Punnett square completed with correct gametes and F2.

Special types of Riemann sums

Segregation versus mitotic recombination APPENDIX

Bayesian Technique for Reducing Uncertainty in Fatigue Failure Model

Classification of effective GKM graphs with combinatorial type K 4

Review of Prerequisite Skills for Unit # 2 (Derivatives) U2L2: Sec.2.1 The Derivative Function

ROBUST STABILITY AND PERFORMANCE ANALYSIS OF UNSTABLE PROCESS WITH DEAD TIME USING Mu SYNTHESIS

Life Cycles, Meiosis and Genetic Variability24/02/2015 2:26 PM

Scattering of Solitons of Modified KdV Equation with Self-consistent Sources

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics.

Throughput-Lifetime Tradeoffs in Multihop Wireless Networks under an SINR-based Interference Model

Numerical Solution of Ordinary Differential Equations in Fluctuationlessness Theorem Perspective

A Systematic Approach to Frequency Compensation of the Voltage Loop in Boost PFC Pre- regulators.

STAT 536: Genetic Statistics

The neutral theory of molecular evolution

arxiv: v2 [math.co] 29 Mar 2017

Fluctuationlessness Theorem and its Application to Boundary Value Problems of ODEs

Solving the Vehicle Routing Problem with Stochastic Demands via Hybrid Genetic Algorithm-Tabu Search

Outline of lectures 3-6

MODELLING PUBLIC TRANSPORT CORRIDORS WITH AGGREGATE AND DISAGGREGATE DEMAND

APPENDIX 1 ERROR ESTIMATION

CHAPTER 1: INTRODUCTION. 1.1 Inverse Theory: What It Is and What It Does

Learning ancestral genetic processes using nonparametric Bayesian models

Supplementary Figures.

Section 3.4: Concavity and the second Derivative Test. Find any points of inflection of the graph of a function.

Introduction to Natural Selection. Ryan Hernandez Tim O Connor

Name Class Date. KEY CONCEPT Gametes have half the number of chromosomes that body cells have.

Question: If mating occurs at random in the population, what will the frequencies of A 1 and A 2 be in the next generation?

A maximum likelihood method to correct for allelic dropout in microsatellite data with no replicate genotypes

OPTIMALITY AND STABILITY OF SYMMETRIC EVOLUTIONARY GAMES WITH APPLICATIONS IN GENETIC SELECTION. (Communicated by Yang Kuang)

1. Definition: Order Statistics of a sample.

CHAPTER 5 Reactor Dynamics. Table of Contents

Transcription:

On the inormativeness o dominant and co-dominant genetic marers or Bayesian supervised clustering Gilles Guillot and Alexandra Carpentier-Sandalis September 24, 2010 Abstract We study the accuracy o Bayesian supervised method used to cluster individuals into genetically homogeneous groups on the basis o dominant or codominant molecular marers. We provide a ormula relating an error criterion the number o loci used and the number o clusters. This ormula is exact and holds or arbitrary number o clusters and marers. Our wor suggests that dominant marers studies can achieve an accuracy similar to that o codominant marers studies i the number o marers used in the ormer is about 1.7 times larger than in the latter. 1 1 Bacground 2 3 4 5 6 7 8 9 10 11 12 A common problem in population genetics consists in assigning an individual to one o K populations on the basis o its genotype and inormation about the distribution o the various alleles in the K populations. This question has received a considerable attention in the population genetics and molecular ecology literature [1, 2, 3, 4] as it can provide important insight about gene low patterns and migration rates. It is or example widely used in epidemiology to detect the origin o a pathogens or o their hosts (see e.g. [5, 6, 7] or examples) or in conservation biology and population management to detect illegal trans-location or poaching [8]. See [9] or a review o related methods. In a statistical phrasing, assigning an individual to some nown clusters is a supervised clustering problem. This requires to observe the genotype o the individual to be assigned and those o some individuals in the various clusters. For diploid organisms (i.e. organisms harbouring Department o Inormatics and Mathematical Modelling, Technical University o Denmar, 2800, Lyngby, Copenhagen, Denmar Centre or Ecological and Evolutionary Synthesis, Department o Biology, University o Oslo, P.O. Box 1066 Blindern, 0316 Oslo, Norway 1

13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 two copies o each chromosome), certain lab techniques allow one to retrieve the exact genotype o each individual. In contrast, or some marers it is only possible to say whether a certain allele A (reerred hereater as to dominant allele) is present or not at a locus. In this case, one can not distinguish the heterozygous genotype Aa rom the homozygous genotype AA or the dominant allele. The ormer type o marers are said to be codominant while the latter are said to be dominant. It is clear that the the second genotyping method incurs a loss o inormation. The consequence o this loss o inormation has been studied rom an empirical point o view [10] but it has never been studied on a theoretical basis. The choice to use one type o marers or empirical studies is thereore oten motivated mostly by practical considerations rather than by an objective rationale [11, 12]. The objective o the present article is to compare the accuracy achieved with dominant and codominant marers when they are used to perorm supervised clustering and to derive some recommendations about the number o marers required to achieve a certain accuracy. Dominant marers are essentially bi-allelic in the sense that they record the presence o the absence o a certain allele. We are not concerned here by the relation between inormativeness and the level o polymorphisms (c [13, 14] or reerences on this aspect). We thereore ocus on 28 bi-allelic dominant and co-dominant marers. Hence our study is representative o Ampliied 29 30 Fragment Length Polymorphism (AFLP) and Single Nucleotide Polymorphism (SNP) marers, which are some o the most employed marers in genetics. 31 32 33 34 35 36 2 Inormativeness o dominant and co-dominant marers 2.1 Cluster model We will consider here the case o diploid organisms at L bi-allelic loci. We denote by z = (z l ) l=1,...,l the genotype o an individual. We denote by l the requency o allele A in cluster at locus l We assume that each cluster is at Hardy-Weinberg equilibrium (HWE) at each locus. HWE is deined as the conditions under which the allele carried at a locus on one chromosome 37 is independent o the allele carried at the same locus on the homologous chromosome. This 38 39 situation is observed at neutral loci when individuals mate at random in a cluster. Denoting by z l the number o copies o allele A carried by an individual, we have: For co-dominant marers, 2

40 this can be expressed as p(z l = 2 ) = 2 l (1) p(z l = 1 ) = 2 l (1 l ) (2) p(z l = 0 ) = (1 l ) 2 (3) 41 42 For dominant marers, z l is equal to 0 or 1 depending on whether a copy o allele A is present in the genotype o the individual. Under HWE we have: p(z l = 1 ) = 2 l + 2 l (1 l ) (4) p(z l = 0 ) = (1 l ) 2 (5) 43 44 45 46 In addition to HWE, we also assume that the various loci are at linage equilibrium (henceorth HWLE), i.e. that the probability o a multilocus genotype is equal to the product o probabilities o single-locus genotypes: p(z 1,..., z L ) = l p(z l). We assume that the individual to be classiied has origin in one o the K clusters (no admixture). 47 48 49 50 51 52 53 54 55 56 57 58 2.2 Sampling model We will measure the accuracy o a classiying rule or a given type o marers by the probability to assign correctly an individual with unnown origin. We are interested in deriving results that are independent (i) on the particular origin c o the individual to be classiied (ii) on the genotype z o this individual and (iii) on the allele requencies in the various clusters. We will thereore derive results that are conditional on c, z and and then compute Bayesian averages under suitable prior distributions. The mechanism assumed in the sequel is as ollows 1. The individual has origin in one o the K clusters. This origin is unnown and all origins are equally liely. We thereore assume a uniorm prior or c on {1,..., K}. 2. In each cluster, or each locus the allele requencies ollow a Dirichlet(1,1) distribution with independent across clusters and loci. 3. Conditionally on c and, the probability o the genotype o the individual is given by 3

59 60 equations (1-3) or (4-5), i.e we assume that the individual has been sampled at random among all individuals in his cluster o origin. 61 62 63 64 65 66 67 68 2.3 Accuracy o assignments under a maximum lielihood principle We consider an individual o unnown origin c with nown genotype z with potential origin in K clusters with nown allele requencies. Following a maximum lielihood principle, it is natural to estimate c as the cluster label or which the probability o observing this particular genotype is maximal. Formally: c = Argmax p(z c =, ). This assignment rule is deterministic, but whether the individual is correctly assigned will depend on its genotype and on cluster allele requencies. Randomising these quantities and averaging over all possible values, we can derive a generic ormula or the probability o correct assignment p MLA as p MLA = ϕ ζ max p(c =, z = ζ = ϕ)dp(ϕ) (6) 69 70 71 72 73 See section A in appendix or details. This ormula is o little practical use and deriving some more explicit expression or arbitrary value o K and L seems to be out o reach. However, or K = 2 and L = 1, under the assumptions that the individual has a priori equally liely ancestry in each cluster and that each has a Dirichlet distribution with parameter (1,..., 1) (lat). we get p MLA c (K = 2, L = 1) = 17/24 or codominant marers (7) 74 and p MLA d (K = 2, L = 1) = 16/24 or dominant marers. (8) 75 Because o the lac o practical useulness o eq. (6), we now deine an alternative rule 76 77 or assignment that is similar in spirit to maximum lielihood but also leads to more tractable equations. 4

78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 2.4 Accuracy o assignments under a stochastic rule Considering the collection o lielihood values p(z c =, ) or = 1,..., K, ollowing [15], we deine a stochastic assignment (SA) rule by assigning the individual to a group at random with probabilities proportional to p(z c =, ). In words, an individual with genotype z is randomly assigned to cluster with a probability proportional to the probability to observe this genotype in cluster. The rationale behind this rule is that high values o p(z c =, ) indicate strong evidence o ancestry in group but do not guarantee against miss-assignments. To derive the probability o correct assignment, we irst consider that the allele requencies are nown, and the account or the uncertainty about these requencies by Bayesian in integration. The use o a Bayesian ramewor is motivated by the act that (i) there is genuine uncertainty on allele requencies which can not be overlooed, and (ii) under some airly mild assumptions, allele requencies are nown to be Dirichlet distributed (possibly with a degree o approximation see e.g. [16, 17]). Reer to [18] or urther discussion o the Bayesian paradigm in population genetics. We now give our main results regarding this clustering rule. For bi-allelic loci and denoting by p SA c marers we have: p SA c (K, L) = For bi-allelic loci and denoting by p SA d marers is the probability o correct assignment using codominant 1 1 + (K 1)(5/8) L (9) the probability o correct assignment using dominant p SA d (K, L) = 1 1 + (K 1)(25/33) L (10) 96 3 Implications 97 98 99 100 101 102 Our investigations considered bi-allelic loci and are thereore representative o AFLP and SNP marers which are some o the most employed marers in genetics. In this context, or supervised clustering, our main conclusions are that (i) codominant marers are more accurate than dominant marers, (ii) the dierence o accuracy decreases toward 0 as the number o marers L increases, (iii) L d dominant marers can achieve an accuracy even higher than that o L c codominant marers as long as the numbers o loci used satisy L d λl c where λ = ln(5/8)/ ln(25/33) 1.69. 5

103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 The igures reported have to be taen with a grain o salt as they may depend on some speciic aspects o the models considered. For example, the model considered here assumes independence o allele requencies across clusters. This assumption is relevant in case o populations displaying low migration rates and low amount o shared ancestry. When one o these assumptions is violated, an alternative parametric model based on the Dirichlet distribution that accounts or correlation o allele requencies across population is oten used (see [16] and reerences therein). It is expected that the accuracy obtained with both marers would be lower under this model. Besides, the present study does not account or ascertainment bias [19, 20, 21, 22], an aspect that might aect the results but is notoriously diicult to deal with. However, it is important to note that the conditions considered in the present study were the same or dominant and codominant marers so that results should not be biased toward one type o marer. Our global result about the relative inormativeness o dominant and co-dominant marers contrasts with the common belie that dominant marers are expedient one would resort to when co-dominant marers are not available (see [12] or discussions). A comparison o dominant and codominant marers or unsupervised clustering has been carried out [23]. This study based on simulations suggests that the loss o accuracy incurred by dominant marers in unsupervised clustering is much larger than or supervised clustering. This is presumably explained by the act that in case o HWLE clusters, supervised clustering sees to optimise a criterion based on allele requencies only. This contrasts with unsupervised clustering which sees to optimise a criterion based on allele requencies and HWLE. A similar theoretical analysis o unsupervised clustering algorithm similar to the present study would be valuable but we anticipate that it would present more diiculties. 125 Acnowlegement 126 127 This wor has been supported by Agence Nationale de la Recherche grant ANR-09-BLAN-0145-01. 6

128 A Supervised clustering with a maximum lielihood principle We consider the setting where the unnown ancestry c o an individual with genotype z is estimated by c = Argmax c p(z c, ). As this estimator is a deterministic unction o z we denote it by c z or clarity in the sequel. Consider or now that the allele requencies are nown to be equal to some ϕ. Under this setting, randomness comes rom the sampling o c and then rom the sampling o z (c, ). We are concerned with the event E deined as E = {the individual is correctly assigned} 129. Applying the total probability ormula, we can write p(e = ϕ) = p(e, c =, z = ζ = ϕ) (11) ζ 130 In the sum over, only one term is not equal to 0, this is the term or = c, hence p(e = ϕ) = ζ = ζ = ζ = ζ p(e, c = c ζ, z = ζ = ϕ) (12) p(c = c ζ, z = ζ = ϕ) (13) p(c = c ζ = ϕ)p(z = ζ c = c ζ, = ϕ) (14) p(c = c ζ )p(z = ζ c = c ζ, = ϕ) (15) 131 132 Assuming that the individual has a priori equally liely ancestry in each cluster, i.e. assuming a uniorm distribution or the class variable c, we get p(e = ϕ) = K 1 ζ p(z = ζ c = c ζ, = ϕ) (16) 7

133 By deinition, c z satisies p(z c z, ) = max p(z c =, ), hence p(e = ϕ) = K 1 ζ max p(z = ζ c =, = ϕ) (17) = ζ = ζ max p(c = )p(z = ζ c =, = ϕ) (18) max p(c =, z = ζ = ϕ) (19) 134 135 136 We see an expression o the probability o correct assignment that does not depend on particular values o allele requencies. This can be obtained by integrating over allele requencies, namely p(e) = = ϕ ϕ p(e = ϕ)dp(ϕ) (20) ζ max p(c =, z = ζ = ϕ)dp(ϕ) (21) 137 138 Note that identity (21) holds or any number o cluster K, any number o loci L and any type o marers (dominant vs. codominant). 139 140 141 142 We now consider a two cluster problem in the case where the genotype o an individual has been recorded at a single bi-allelic locus. We denote by 1 (resp. 2 ) the requency o allele A in cluster 1 (resp. cluster 2). 143 144 145 146 A.1 Codominant marers: There are only three genotypes: AA, Aa and aa. Denoting by the requency o allele A in cluster and conditionally on, these three genotypes occur in cluster with probabilities 2, 2 (1 ) and (1 ) 2, and equation (21) can be simpliied as p(e) = ϕ [ p(c) max ] 2 + 2 max (1 ) + max (1 ) 2 dp(ϕ) (22) 147 148 We need to derive the distribution o max 2 and o max (1 ). Assuming a lat Dirichlet distribution or, elementary computations give: 8

149 i.e max 2 p(max 2 ollows a uniorm distribution on [0, 1] so that < x) = x (23) E(max 2 ) = 1/2 (24) 150 Besides, we also get p(max (1 ) < x) = (1 1 4x) 2 (25) 151 and deriving dp dx (max (1 ) < x) = 4 1 1 4x (26) 1 4x 152 Integrating by part, we get E(max (1 )) = 1/4 0 4x 1 1 4x 1 4x dx = 5/24 (27) 153 Eventually p(e) = 17/24 (28) 154 which proves equation (7). 155 156 157 158 A.2 Dominant marers: For a single locus, there are two genotypes A and a. Conditionally on, these two genotypes are observed in cluster with probabilities 1 2 and 2. Equation (21) can be simpliied here as p(e) = ϕ [ ] p(c) max 2 + max (1 2 ) dp(ϕ) (29) 9

159 We now need the density o 1 2 p(max (1 2 ) < x) = (1 1 x) 2 (30) 160 161 and Eventually we get dp dx (max(1 2 ) < x) = 1 1 (31) 1 x E(max (1 2 )) = 1 0 ( ) 1 x 1 dx = 5/6 (32) 1 x p(e) = 16/24 (33) 162 which proves equation (8). 163 B Stochastic assignment rule 164 165 166 167 168 169 The maximum lielihood assignment rule considered above is not tractable or arbitrary values o K and L (c. eq. (21)). In particular, a diiculty arises rom the maximisation involved. We consider here an assignment rule that does not involve maximisation. The unnown ancestry c o an individual with genotype z is predicted by a random variable c with values in {1,..., K} and such that p(c = z, ) p(z c =, ). As in the previous sections, we irst consider that the allele requencies are nown, however we sip this dependence in the notation at the beginning 170 or clarity. We will account or the uncertainty about these requencies later by Bayesian in 171 172 integration. In this setting, the structure o conditional probability dependence can be represented by a directed acyclic graph as in the on let-hand side o igure 1. We are concerned with evaluating the probability o event E deined as E = {the individual is correctly assigned} 173. i.e. E = {c = c }. We denote by p a (resp. p b ) probabilities under the two conditional dependence 10

c c* c c z (a) z z (b) Figure 1: Directed acyclic graph or our stochastic assignment rule (let) and or an alternative scheme (right). All downward arrows represent the same conditional dependence given by our lielihood model. Upward arrow represents the reverse probability dependence. 174 175 structure o igure 1. Some elementary computations show that p(e) can be expressed in terms o a probability in the model o the right-hand-side o the DAG in igure 1, namely: p a (c = c ) = p b (c = c z = z ) (34) 176 The let-hand-side o this expression can be written as p b (c = c z = z ) = p b (c = c, z = z )/p b (z = z ) (35) 177 178 It is more convenient to manipulate this expression than p b (c = c ). We will to use it to evaluate p a (E). 179 180 181 182 B.1 Codominant marers: We assume that the individual has a priori equally liely ancestry in each cluster. We slightly change the notation denoting by z l the count o allele A at locus l or the individual to be assigned. Then maing the dependence on explicit in the notation, we have p b (c = c, z = z ) = z = z p 2 b (c, z ) (36) [ 2 1 z l K (1 ) 2 z l (2 δz 1 l )] (37) l 183 where δ 1 z l denotes the Kronecer symbol that equals 1 i z l = 1 and 0 otherwise. 11

184 Accounting or uncertainty about by integration, we get p b (c = c, z = z ) = = p b (c = c, z = z )d (38) [ 2 1 z l K (1 ) 2 z l (2 δz 1 l )] d (39) z l 185 186 187 Among the terms enumerated in the sum over z above, let us consider a generic term z or which the number o loci having exactly h heterozygous genotypes. The term corresponding to such a genotype z in the sum above can be written [ ] 1 h [ L h K 2 22h 2 (1 ) 2 d d] 4 (40) 188 Denoting by C h L the binomial coeicient, there are Ch L2 L h such terms. Equation (39) becomes p b (c = c, z = z ) = h CL2 h L h 1 [ ] h [ L h K 2 22h 2 (1 ) 2 d d] 4 (41) 189 Assuming a lat Dirichlet distribution or the allele requencies, we get p b (c = c, z = z ) = 1 K ( ) 8 L (42) 15 190 We now need to evaluate p b (z = z ), but since p b (z = z ) = z p 2 b (z ) = z ( 2 p b (, z )), (43) p b (z = z ) = = p 2 b (z ) = ( 2 p b (, z )) z (44) p b (, z ) 2 + p b (, z )p b (, z ) z (45) z z z = p b (c = c, z = z ) + 12

CL2 h L h ( [ 1 2 K 2 22h h = 1 K ] 2h [ (1 )d d] ) 4(L h) (46) ( ) 8 L + K 1 1 15 K 3 L (47) 191 Eventually, p(e) = 1 1 + (K 1) ( ) 5 L (48) 8 192 which proves equation (9). 193 B.2 Dominant marers: 194 We still have p b (c = c, z = z ) = [ 2 1 z l K (1 ) 2 z l (2 δz 1 l )] d (49) z l 195 196 For a generic genotype z in the sum above, let us denote by r the number o loci carrying exactly one copy o the recessive allele, then p b (c = c, z = z ) = [ ] Cr l 1 r [ L r K 2 4 d (1 d] 2 )2 (50) r = ( ) Cr l 1 1 L ( ) 8 L r K 2 (51) 5 3 r = 1 ( ) 11 L (52) K 15 197 Moreover, by arguments similar to those used or codominant marers, we get p b (z = z ) = 1 K ( ) 11 L + K 1 15 K ( ) 5 L (53) 9 13

198 And we get p b (z = z ) = 1 K ( ) 11 L + K 1 15 K ( ) 5 L (54) 9 199 Eventually, p(e) = 1 1 + (K 1) ( ) 25 L (55) 33 200 which proves equation (10). 201 Reerences 202 203 [1] B. Rannala and J. Mountain, Detecting immigration by using multilocus genotypes, Pro- ceedings o the National Academy o Sciences USA, vol. 94, pp. 9197 9201, 1997. 204 205 206 [2] J. Cornuet, S. Piry, G. Luiart, A. Estoup, and M. Solignac, New methods employing multilocus genotypes to select or exclude populations as origins o individuals, Genetics, vol. 153, pp. 1989 2000, 1999. 207 208 209 [3] D. Paetau, R. Slade, M. Burdens, and A. Estoup, Genetic assignment methods or the direct, real-time estimation o migration rate: a simulation-based exploration o accuracy and power, Molecular Ecology, vol. 15, pp. 55 65, 2004. 210 211 212 [4] A. Piry, S. Alapetite, J. Cornuet, D. Paetau, L. Baudoin, and A. Estoup, Geneclass2: A sotware or genetic assignment and irst-generation migrant detection, Journal o Heredity, vol. 95, no. 6, pp. 536 539, 2004. 213 214 215 [5] P. Gladieux, X. Zhang, D. Aoua-Bastien, R. V. Sanhueza, M. Sbaghi,, and B. L. Cam, On the origin and spread o the scab disease o apple: Out o central Asia, PLoS One, vol. 3, no. 1, p. e1455, 2008. 14

216 217 218 [6] A. P. de Rosas, E. Segura, L. Fichera, and B. García, Macrogeographic and microgeographic genetic structure o the chagas disease vector triatoma inestans (hemiptera: reduviidae) rom Catamarca, Argentina, Genetica, vol. 133, no. 3, pp. 247 260, 2008. 219 220 221 222 [7] A. Bataille, A. A. Cunningham, V. Cedeño, L. Patiño, A. Constantinou, L. Kramer, and S. J. Goodman, Natural colonization and adaptation o a mosquito species in Galápagos and its implications or disease threats to endemic wildlie, Proceedings o the National Academy o Sciences, vol. 106, no. 25, pp. 10 230 10 235, 2009. 223 224 225 [8] S. Manel, P. Berthier, and G. Luiart, Detecting wildle poaching: identiying the origin o individuals with Bayesian assignment test and multilocus genotypes, Conservation biology, vol. 13, no. 3, pp. 650 659, 2002. 226 227 [9] S. Manel, O. Gaggiotti, and R. Waples, Assignment methods: matching biological questions with appropriate techniques, Trends in Ecology and Evolution, vol. 20, pp. 136 142, 2005. 228 229 230 [10] D. Campbell, P. Duchesne, and L. Bernatchez, AFLP utility or population assignment studies: analytical investigation and empirical comparison with microsatellites, Molecular Ecology, vol. 12, pp. 1979 1991, 2003. 231 232 [11] C. Schlötterer, The evolution o molecular marers - just a matter o ashion? Nature Review Genetics, vol. 5, pp. 63 69, 2004. 233 234 235 [12] A. Bonin, D. Ehrich, and S. Manel, Statistical analysis o ampliied ragment length poly- morphism data: a toolbox or molecular ecologists and evolutionists, Molecular Ecology, vol. 16, no. 18, pp. 3737 3758, 2007. 236 237 238 239 [13] N. Rosenberg, T. Bure, K. Elo, M. Feldman, P. Friedlin, M. Groenen, J. Hillel, A. Mäi- Tanila, M. Tixier-Boichard, A. Vignal, K. Wimmers, and S. Weigend, Empirical evaluation o genetic clustering methods using multilocus genotypes rom 20 chicen breeds, Genetics, vol. 159, pp. 699 713, 2001. 240 241 [14] S. T. Kalinowsi, Do polymorphic loci require larger sample sizes to estimate genetic dis- tances? Heredity, vol. 94, pp. 33 36, 2005. 15

242 243 [15] N. Rosenberg, L. Li, R. Ward, and J. K. Pritchard, Inormativeness o genetic marers or inerence o ancestry, American Journal o Human Genetics, vol. 73, pp. 1402 1422, 2003. 244 245 246 [16] G. Guillot, Inerence o structure in subdivided populations at low levels o genetic di- erentiation.the correlated allele requencies model revisited, Bionormatics, vol. 24, pp. 2222 2228, 2008. 247 248 [17] O. Gaggiotti and M. Foll, Quantiying population structure using the -model, Molecular Ecology Resources, vol. 10, no. 5, p. 821830, 2010. 249 250 [18] M. A. Beaumont and B. Rannala, The Bayesian revolution in genetics, Nature Review Genetics, vol. 5, pp. 251 261, 2005. 251 252 253 [19] R. Nielsen and J. Signorovitch, Correcting or ascertainment biases when analyzing SNP data: applications to the estimation o linage quilibrium, Theoretical Population Biology, vol. 63, pp. 245 255, 2003. 254 255 [20] R. Nielsen, M. Hubisz, and A. Clar, Reconstituting the requency spectrum o ascertained single-nucleotide polymorphism data, Genetics, vol. 168, pp. 2373 2382, 2004. 256 257 258 [21] M. Foll, M. Beaumont, and O. Gaggiotti, An approximate Bayesian computation approach to overcome biases that arise when using AFLP marers to study population structure, Genetics, vol. 179, pp. 927 939, 2008. 259 260 [22] G. Guillot and M. Foll, Accounting or the ascertainment bias in Marov chain Monte Carlo inerences o population structure, Bioinormatics, vol. 25, no. 4, pp. 552 554, 2009. 261 262 [23] G. Guillot and F. Santos, Using AFLP marers and the Geneland program or the inerence o population genetic structure, Molecular Ecology Resources, 2010, to appear. 16