Genetics: Early Online, published on February 26, 2016 as /genetics Admixture, Population Structure and F-statistics

Size: px
Start display at page:

Download "Genetics: Early Online, published on February 26, 2016 as /genetics Admixture, Population Structure and F-statistics"

Transcription

1 Genetics: Early Online, published on February 26, 2016 as /genetics GENETICS INVESTIGATION Admixture, Population Structure and F-statistics Benjamin M Peter 1 1 Department of Human Genetics, University of Chicago, Chicago IL USA ABSTRACT Many questions about human genetic history can be addressed by examining the patterns of shared genetic variation between sets of populations. A useful methodological framework for this purpose are F-statistics, that measure shared genetic drift between sets of two, three and four populations, and can be used to test simple and complex hypotheses about admixture between populations. This paper provides context from phylogenetic and population genetic theory. I review how F-statistics can be interpreted as branch lengths or paths, and derive new interpretations using coalescent theory. I further show that the admixture tests can be interpreted as testing general properties of phylogenies, allowing extension of some ideas applications to arbitrary phylogenetic trees. The new results are used to investigate the behavior of the statistics under different models of population structure, and show how population substructure complicates inference. The results lead to simplified estimators in many cases, and I recommend to replace F 3 with the average number of pairwise differences for estimating population divergence. KEYWORDS admixture; gene flow; phylogenetics; population genetics; phylogenetic network For humans, whole-genome genotype data is now available for 23 individuals from hundreds of populations Lazaridis et al ; Yunusbayev et al. 2015, opening up the possibility to ask 5 more detailed and complex questions about our history Pickrell 6 and Reich 2014; Schraiber and Akey 2015, and stimulating the 7 development of new tools for the analysis of the joint history of 8 these populations Reich et al. 2009; Patterson et al. 2012; Pickrell 9 and Pritchard 2012; Lipson et al. 2013; Ralph and Coop 2013; 10 Hellenthal et al A simple and intuitive framework for this 11 purpose that has quickly gained in popularity are the F-statistics, 12 introduced by Reich et al. 2009, and summarized in Patterson 13 et al In that framework, inference is based on shared 14 genetic drift between sets of populations, under the premise 15 that shared drift implies a shared evolutionary history. Tools 16 based on this framework have quickly become widely used in 17 the study of human genetic history, both for ancient and modern 18 DNA Green et al. 2010; Reich et al. 2012; Lazaridis et al. 2014; 19 Haak et al. 2015; Allentoft et al Some care is required with terminology, as the F-statistics 21 sensu Reich et al are distinct, but closely related to 22 Wright s fixation indices Wright 1931; Reich et al. 2009, which 23 are also often referred to as F-statistics. Furthermore, it is nec essary to distinguish between statistics quantities calculated 25 from data and the underlying parameters which are part of the 26 modelweir and Cockerham In this paper, I will mostly discuss model parameters, and I 28 will therefore refer to them as drift indices. The term F-statistics 29 will be used when referring to the general framework introduced 30 by Reich et al. 2009, and Wright s statistics will be referred to 31 as F ST or f. 32 Most applications of the F-statistic-framework can be phrased 33 in terms of the following six questions: Treeness test: Are populations related in a tree-like fash- 35 ion? Reich et al Admixture test: Is a particular population descended from 37 multiple ancestral populations? Reich et al Admixture proportions: What are the contributions from 39 different populations to a focal population Green et al. 2010; 40 Haak et al Number of founders: How many founder populations are 42 there for a certain region? Reich et al. 2012; Lazaridis et al Copyright 2016 by the Genetics Society of America doi: /genetics.XXX.XXXXXX Manuscript compiled: Friday 26 th February, 2016% 1 bpeter@uchicago.edu 5. Complex demography: How can mixtures and splits of 45 population explain demography? Patterson et al. 2012; 46 Lipson et al Copyright Genetics, Vol. XXX, XXXX XXXX February

2 A F 3 P X ; P 1, P 2 represents the length of the external branch 70 from P X to the unique internal vertex connecting all three 71 populations. Thus, the first parameter of F 3 has an unique 72 role, whereas the other two can be switched arbitrarily. 73 F B 4 P 1, P 2 ; P 3, P 4 represents the internal branch from the 74 internal vertex connecting P 1 and P 2 to the vertex connect- 75 ing P 3 and P 4 blue. 76 B Figure 1 Population phylogeny and admixture graph. A: A population phylogeny with branches corresponding to F 2 green, F 3 yellow and F B 4 blue. B: An admixture graph extends population phylogenies by allowing gene flow red, full line and admixture events red, dotted. 6. Closest relative: What is the closest relative to a contem- 48 porary or ancient population Raghavan et al The demographic models under which these questions are 50 addressed, and that motivated the drift indices, are called popula- 51 tion phylogenies and admixture graphs. The population phylogeny 52 or population tree, is a model where populations are related in 53 a tree-like fashion Figure 1A, and it frequently serves as the null 54 model for admixture tests. The branch lengths in the population 55 phylogeny correspond to how much genetic drift occurred, so 56 that a branch that is subtended by two different populations can 57 be interpreted as the shared genetic drift between these pop- 58 ulations. The alternative model is an admixture graph Figure 59 1B, which extends the population phylogeny by allowing edges 60 that represent population mergers or a significant exchange of 61 migrants. 62 Under a population phylogeny, the three F-statistics pro- 63 posed by Reich et al. 2009, labelled F 2, F 3 and F 4 have interpre- 64 tations as branch lengths Fig. 1A between two, three and four 65 taxa, respectively: Assume populations are labelled as P 1, P 2,..., 66 then 67 F 2 P 1, P 2 corresponds to the path on the phylogeny from 68 P 1 to P If the arguments are permuted, some F-statistics will have no 77 corresponding internal branch. In particular, it can be shown 78 that in a population phylogeny, one F 4 -index will be zero, imply- 79 ing that the corresponding internal branch is missing. This is the 80 property that is used in the admixture test. For clarity, I add the 81 superscript F B 4 if I need to emphasize the interpretation of F 4 82 as a branch length, and F T 4 to emphasize the interpretation as a 83 test-statistic. For details, see the subsection on F 4 in the Methods 84 & Results section. 85 In an admixture graph, there is no longer a single branch 86 length corresponding to each F-statistic, and interpretations are 87 more complex. However, F-statistics can still be thought of 88 as the proportion of genetic drift shared between populations 89 Reich et al The basic idea exploited in addressing all 90 six questions outlined above is that under a tree model, branch 91 lengths, and thus the drift indices, must satisfy some constraints 92 Buneman 1971; Semple and Steel 2003; Reich et al The 93 two most relevant constraints are that i in a tree, all branches 94 have positive lengths tested using the F 3 -admixture test and ii 95 in a tree with four leaves, there is at most one internal branch 96 tested using the F 4 -admixture test. 97 The goal of this paper is to give a broad overview on the 98 theory, ideas and applications of F-statistics. Our starting point 99 is a brief review on how genetic drift is quantified in general, 100 and how it is measured using F 2. I then propose an alterna- 101 tive definition of F 2 that allows us to simplify applications, and 102 study them under a wide range of population structure models. 103 I then review some basic properties of distance-based phylo- 104 gentic trees, show how the admixture tests are interpreted in 105 this context, and evaluate their behavior. Many of the results 106 that are highlighted here are implicit in classical Wahlund 1928; 107 Wright 1931; Cavalli-Sforza and Edwards 1967; Felsenstein 1973; 108 Cavalli-Sforza and Piazza 1975; Felsenstein 1981; Slatkin 1991; 109 Excoffier et al and more recent work Patterson et al. 2012; 110 Pickrell and Pritchard 2012; Lipson et al. 2013, but often not 111 explicitly stated, or given in a different context. 112 Methods & Results 113 The next sections will discuss the F-statistics, introducing dif- 114 ferent interpretations and giving derivations for some useful 115 expressions. A graphical summary of the three interpretations 116 of the statistics is given in Figure 2, and the main formulas are 117 summarized in Table Throughout this paper, populations are labelled as 119 P 1, P 2,..., P i,.... Often, P X will denote an admixed population. 120 The allele frequency p i is defined as the proportion of individ- 121 uals in P i that carry a particular allele at a biallelic locus, and 122 throughout this paper I assume that all individuals are haploid. 123 However, all results hold if instead of haploid individuals, an 124 arbitrary allele of a diploid individual is used. I focus on genetic 125 drift only, and ignore the effects of mutation, selection and other 126 evolutionary forces Benjamin M Peter et al

3 F 2 P 1, P 2 F 3 P X ; P 1, P 2 F 4 P 1, P 2, P 3, P 4 Definition E[p 1 p 2 2 ] Ep X p 1 p X p 2 Ep 1 p 2 p 3 p 4 1 F 2-2 F2 P 1, P X + F 2 P 2, P X F 2 P 1, P F2 P 1, P 4 + F 2 P 2, P 3 F 2 P 1, P 3 F 2 P 2, P 4 Coalescence times Variance Varp 1 p 2 2ET 12 ET 11 ET 22 ET 1X + ET 2X ET 12 ET XX ET 14 + ET 23 ET 13 ET 24 Varp X + Covp 1, p 2 Covp 1, p X Covp 2, p X Covp 1 p 2, p 3 p 4 Branch length 2B c B d 2B c B d B c B d or as admixture test: B d B d Table 1 Summary of equations. A constant of proportionality is omitted for coalescence times and branch lengths. Derivations for F 2 are given in the main text, F 3 and F 4 are a simple result of combining Equation 16 with 20b and 24b. B c and B d correspond to the average length of the internal branch in a gene genealogy concordant and discordant with the population assignment respectively see section Gene tree branch lengths. Figure 2 Interpretation of F-statistics. F-statistics can be interpreted i as branch lengths in a population phylogeny Panels A,E,I,M, the overlap of paths in an admixture graph Panels B,F,J,N, see also Figure S1, and in terms of the internal branches of gene-genealogies see Figures 4, S2 and S3. For gene trees consistent with the population tree, the internal branch contributes positively Panels C,G,K, and for discordant branches, internal branches contribute negatively Panels D,H or zero Panel L. F 4 has two possible interpretations; depending on how the arguments are permuted relative to the tree topology, it may either reflect the length of the internal branch third row, F B 4, or a test statistic that is zero under a population phylogeny last row, F T 4. For the admixture test, the two possible gene trees contribute to the statistic with different sign, highlighting the similarity to the D-statistic Green et al and its expectation of zero in a symmetric model. Admixture, Population Structure and F-statistics 3

4 Measuring genetic drift F The purpose of F 2 is simply to measure how much genetic drift 129 occurred between two populations, i.e. to measure genetic dis- 130 similarity. For populations P 1 and P 2, F 2 is defined as Reich et al F 2 P 1, P 2 = F 2 p 1, p 2 = Ep 1 p The expectation is with respect to the evolutionary process, 133 but in practice F 2 is estimated from hundreds of thousands of 134 loci across the genome Patterson et al. 2012, which are assumed 135 to be non-independent replicates of the evolutionary history of 136 the populations. 137 Why is F 2 a useful measure of genetic drift? As it is infeasible 138 to observe changes in allele frequency directly, the effect of drift 139 is assessed indirectly, through its impact on genetic diversity. 140 Most commonly, genetic drift is quantified in terms of i the 141 variance in allele frequency, ii heterozygosity, iii probability 142 of identity by descent iv correlation or covariance between 143 individuals and v the probability of coalescence two lineages 144 having a common ancestor. In the next sections I show how 145 F 2 relates to these quantities in the the cases of a single popula- 146 tion changing through time and a pair of populations that are 147 partially isolated. 148 they had n 0 ancestral lineages, who all carry x with probability 169 p 0. Therefore, 170 PE = n 0 p0 i P n 0 = i p 0, n t = E [ p n 0 0 p ] 0, n t. 5 i=1 Equating 4 and 5 yields Equation In the present case, the only relevant cases are n t = 1, 2, since: [ ] E p 1 t p 0 ; n t = 1 = p 0 [ ] E p 2 t p 0 ; n t = 2 = p 0 f + p f, where f is the probability that two lineage sampled at time t 172 coalesce before time t This yields an expression for F 2 by conditioning on the allele frequency p 0, ] E [p 0 p t 2 p 0 [ ] ] = E p 2 0 p 0 E [2p t p 0 p 0 ] + E [p 2 t p 0 = p 2 0 2p2 0 + p 0 f + p f = f p 0 1 p 0 = 1 2 f H 0. Single population I assume a single population, measured at two time points t 0 and t, and label the two samples P 0 and P t. Then F 2 P 0, P t can be interpreted in terms of the variances of allele frequencies: Where H 0 = 2p 0 1 p 0 is the heterozygosity. Integrating over 174 Pp 0 yields: 175 F 2 P 0, P t = 1 2 f EH 0 6 F 2 P t, P 0 = E[p t p 0 2 ] = Varp t p 0 + E[p t p 0 ] 2 = Varp t + Varp 0 2Covp 0, p t = Varp t + Varp 0 2Covp 0, p 0 + p t p 0 = Varp t Varp 0. 2 Here, I used E[p t p 0 ] = Covp 0, p t p 0 = 0 to obtain 149 lines three and five. It is worth noting that this result holds for 150 any model of genetic drift where the expected allele frequency 151 is the current allele frequency and increments are independent. 152 For example, this interpretation of F 2 holds also if genetic drift 153 is modeled as a Brownian motion Cavalli-Sforza and Edwards An elegant way to introduce the use of F 2 in terms of expected 156 heterozygosities H t Figure 3B and identity by descent Figure 157 3C is the duality 158 E pt [ p n t t p 0, n t ] = En0 [ p n 0 0 p 0, n t ]. 3 This equation is due to Tavaré 1984, who also provided the 159 following intuition: Given n t individuals are sampled at time 160 t. Let E denote the event that all individuals carry allele x, 161 conditional on allele x having frequency p 0 at time t 0. There 162 are two components to this: First, the frequency will change 163 between t 0 and t, and then all n t sampled individuals need to 164 carry x. 165 In a diffusion framework, 166 PE = 1 0 y n t Pp t = y p 0, n t dy = E[p n t t p 0, n t ]. 4 On the other hand, one may argue using the coalescent: For 167 E to occur, all n t samples need to carry the x allele. At time t 0, 168 and it can be seen that F 2 increases as a function of f Figure 176 3C. This equation can also be interpreted in terms of prob- 177 abilities of identity by descent: f is the probability that two 178 individuals are identical by descent in P t given their ancestors 179 were not identical by descent in P 0 Wright 1931, and EH 0 is 180 the probability two individuals are not identical in P Furthermore, EH t = 1 f EH 0 Equation 3.4 in Wakeley and therefore 183 EH 0 EH t = EH f = 2F2 P t, P 0. 7 which shows that F 2 measures the decay of heterozygosity Fig- 184 ure 3A. A similar argument was used by Lipson et al to 185 estimate ancestral heterozygosities and to linearize F These equations can be rearranged to make the connection between other measures of genetic drift and F 2 more explicit: EH t = EH 0 2F 2 P 0, P t 8a = EH 0 2Varp t Varp 0 8b = EH 0 1 f 8c Pairs of populations Equations 8b and 8c describing the decay 187 of heterozygosity are of course well known by population 188 geneticists, having been established by Wright In struc- 189 tured populations, very similar relationships exist when the 190 number of heterozygotes expected from the overall allele fre- 191 quency, H obs is compared with the number of heterozygotes 192 present due to differences in allele frequencies between popula- 193 tions H exp Wahlund 1928; Wright In fact, already Wahlund showed by considering the genotypes of all possible matings in two subpopulations Table 3 in Wahlund 1928 that for a population made up of two subpopulations with equal proportions, the proportion of heterozygotes is reduced by H obs = H exp 2p 1 p Benjamin M Peter et al

5 A B of the established inbreeding coefficient f and fixation index 205 F ST? Recall that Wright motivated f and F ST as correlation co- 206 efficients between alleles Wright 1921, Correlation coef- 207 ficients have the advantage that they are easy to interpret, as, 208 e.g. F ST = 0 implies panmixia and F ST = 1 implies complete 209 divergence between subpopulations. In contrast, F 2 depends 210 on allele frequencies and is highest for intermediate frequency 211 alleles. However, F 2 has an interpretation as a covariance, making 212 it simpler and mathematically more convenient to work with. In 213 particular, variances and covariances are frequently partitioned 214 into components due to different effects using techniques such 215 as analysis of variance and analysis of covariance e.g. Excoffier 216 et al C Figure 3 Measures of genetic drift in a single population. Interpretations of F 2 in terms of A the increase in allele frequency variance, B the decrease in heterozygosity and C f, which can be interpreted as probability of coalescence of two lineages, or the probability that they are identical by descent. from which follows that 195 F 2 P 1, P 2 = EH exp EH obs. 9 2 Furthermore, Varp 1 p 2 = Ep 1 p 2 2 [ Ep 1 p 2 ] 2, 196 but Ep 1 p 2 = 0 and therefore 197 F 2 P 1, P 2 = Varp 1 p 2 10 Lastly, the original definition of F 2 was as the numerator of 198 F ST Reich et al. 2009, but F ST can be written as F ST = 2p 1 p 2 2 EH exp. 199 form which follows 200 F 2 P 1, P 2 = 1 2 F STEH exp. 11 Covariance interpretation To see how F 2 can be interpreted as a covariance, define X i and X j as indicator variables that two individuals from the same population sample have the A allele, which has frequency p 1 in one, and p 2 in the other population. If individuals are equally likely to be sampled from either population, EX i = EX j = 1 2 p p 2 EX i X j = 1 2 p p2 2 CovX i, X j =EX i X j EX i EX j = 1 4 p 1 p 2 2 = 1 4 F 2P 1, P 2 Justification for F 2 The preceding arguments show how the 201 usage of F 2 for both single and structured populations can be 202 justified by the similar effects of F 2 on different measures of 203 genetic drift. However, what is the benefit of using F 2 instead 204 F 2 as branch length Reich et al and Patterson et al proposed to partition drift as previously established, mea- 219 sured by covariance, allele frequency variance, or decrease in 220 heterozygosity between different populations into contribu- 221 tion on the different branches of a population phylogeny. This 222 model has been studied by Cavalli-Sforza and Edwards and Felsenstein 1973 in the context of a Brownian motion pro- 224 cess. In this model, drift on independent branches is assumed 225 to be independent, meaning that the variances can simply be 226 added. This is what is referred to as the additivity property of F Patterson et al To illustrate the additivity property, consider two populations P 1 and P 2 that split recently from a common ancestral population P 0 Figure 2A. In this case, p 1 and p 2 are assumed to be independent conditional on p 0, and therefore Covp 1, p 2 = Varp 0. Then, using 2 and 10, F 2 P 1, P 2 = Varp 1 p 2 = Varp 1 + Varp 2 2Covp 1, p 2 = Varp 1 + Varp 2 2Varp 0 = F 2 P 1, P 0 + F 2 P 2, P 0. Alternative proofs of this statement and more detailed reasoning 229 behind the additivity assumption can be found in Cavalli-Sforza 230 and Edwards 1967; Felsenstein 1973; Reich et al or 231 Patterson et al Lineages are not independent in an admixture graph, and 233 so this approach cannot be used. Reich et al approached 234 this by conditioning on the possible population trees that are 235 consistent with an admixture scenario. In particular, they pro- 236 posed a framework of counting the possible paths through the 237 graph Reich et al. 2009; Patterson et al An example of 238 this representation for F 2 in a simple admixture graph is given 239 in Figure S1, with the result summarized in Figure 2B. Detailed 240 motivation behind this visualization approach is given in Ap- 241 pendix 2 of Patterson et al In brief, the reasoning is as 242 follows: Recall that F 2 P 1, P 2 = Ep 1 p 2 p 1 p 2, and inter- 243 pret the two terms in parentheses as two paths between P 1 and 244 P 2, and F 2 as the overlap of these two paths. In a population 245 phylogeny, there is only one possible path, and the two paths 246 are always the same, therefore F 2 is the sum of the length of all 247 the branches connecting the two populations. However, if there 248 is admixture, as in Figure 2B, both paths choose independently 249 which admixture edge they follow. With probability α they will 250 go left, and with probability β = 1 α they go right. Thus, 251 F 2 can be interpreted by enumerating all possible choices for 252 the two paths, resulting in three possible combinations of paths 253 on the trees Figure S1, and the branches included will differ 254 depending on which path is chosen, so that the final F 2 is made 255 up average of the path overlap in the topologies, weighted by 256 the probabilities of the topologies. 257 Admixture, Population Structure and F-statistics 5

6 However, one drawback of this approach is that it scales 258 where the second sum is over all pairs in P 1. This equation is 297 quadratically with the number of admixture events, making cal- 259 equivalent to Equation 22 in Felsenstein culations cumbersome when the number of admixture events is 260 As F 2 ˆp 1, ˆp 2 = F 2 ˆp 2, ˆp 1, I can switch the labels, and obtain 299 large. More importantly, this approach is restricted to panmictic 261 the same expression for a second population P 2 = {I 2,i, i = 300 subpopulations, and cannot be used when the population model 262 0,..., n 2 } Taking the average over all I 2,j yields 301 cannot be represented as a weighted average of trees. 263 Gene tree interpretation For this reason, I propose to redefine 264 F 2 using coalescent theory Wakeley Instead of allele 265 frequencies on a fixed admixture graph, coalescent theory tracks 266 the ancestors of a sample of individuals, tracing their history 267 back to their most recent common ancestor. The resulting tree is 268 called a gene tree or coalescent tree. Gene trees vary between 269 loci, and will often have a different topology from the population 270 phylogeny, but they are nevertheless highly informative about 271 a population s history. Moreover, expected coalescence times 272 and expected branch lengths are easily calculated under a wide 273 array of neutral demographic models. 274 In a seminal paper, Slatkin 1991 showed how F ST can be interpreted in terms of the expected coalescence times of gene trees: F ST = ET B ET W, ET B where ET B and ET W are the expected coalescence times of two 275 lineages sampled in two different and the same population, 276 respectively. 277 Unsurprisingly, given the close relationship between F 2 and 278 F ST, an analogous expression exists for F 2 P 1, P 2 : The deriva- 279 tion starts by considering F 2 for two samples of size one. I then 280 express F 2 for arbitrary sample sizes in terms of individual-level 281 F 2, and obtain a sample-size independent expression by letting 282 the sample size n go to infinity. 283 In this framework, I assume that mutation is rare such that 284 there is at most one mutation at any locus. In a sample of size 285 two, let I i be an indicator random variable that individual i has 286 a particular allele. For two individuals, F 2 I 1, I 2 = 1 implies 287 I 1 = I 2, whereas F 2 I 1, I 2 = 0 implies I 1 = I 2. Thus, F 2 I 1, I is another indicator random variable with parameter equal to 289 the probability that a mutation happened on the tree branch 290 between I 1 and I Now, instead of a single individual I 1, consider a sample of n 1 individuals: P 1 = {I 1,1 I 1,2,..., I 1,n1 }. The sample allele frequency is ˆp 1 = n1 1 i I 1,i. And the sample-f 2 is 1 1 F 2 ˆp 1, I 2 =F 2 n 1 [ 1 =E [ 1 =E [ + E n 1 I 1,i, I 2 = E i=1 n 1 n 2 1 I 1,i I 2 i=1 n 2 1 I 2 1,i + 2 n 2 1 I 1,i I 1,j 2 n 1 I 1,i I 2 + I 2 2 n 1 I 2 1,i 2 n 1 I 1,i I 2 + n 1 2 n 2 1 I 1,i I 1,j n 1 1 n 2 1 ] I2 2 n 1 ] I 2 1,i The first three terms can be grouped into n 1 terms of the form 292 F 2 I 1,i, I 2, and the last two terms can be grouped into n 1 2 terms 293 of the form F 2 I 1,i, I 1,j, one for each possible pair of samples in 294 P Therefore, 296 F 2 ˆp 1, I 2 = 1 n 1 i F 2 I 1,i, I 2 1 n 2 F 2 I 1,i, I 1,j 12 1 i<j ] F 2 ˆp 1, ˆp 2 = 1 n 1 F 2 I 1,i, I 2,j i 1 n 2 F 2 I 1,i, I 1,j 1 1 i<j n 2 F 2 I 2,i, I 2,j. 2 i<j 13 Thus, I can write F 2 between the two populations as the average 302 number of differences between individuals from different pop- 303 ulations, minus some terms including differences within each 304 sample. 305 Equation 13 is quite general, making no assumptions on 306 where samples are placed on a tree. In a coalescence framework, 307 it is useful to make the assumptions that all individuals from the 308 same population have the same branch length distribution, i.e. 309 F 2 I x1,i, I y1,j = F 2 I x2,i, I y2,j = for all pairs of samples x 1, x and y 1, y 2 from populations P i and P j. Secondly, I assume that 311 all samples correspond to the leaves of the tree, so that I can esti- 312 mate branch lengths in terms of the time to a common ancestor 313 T ij. Finally, I assume that mutations occur at a constant rate of 314 θ/2 on each branch. Taken together, these assumptions imply 315 that F 2 I i,k, I j,l = θet ij for all individuals from populations 316 P i, P j. this simplifies 13 to 317 F 2 ˆp 1, ˆp 2 = θ ET n1 ET n2 ET which, for the cases of n = 1, 2 was also derived by Petkova 318 et al In some applications, F 2 might be calculated only 319 for segregating site in a large sample. As the expected number 320 of segregating sites is 2 θ T tot with T tot denoting the total tree 321 length, taking the limit where θ 0 is meaningful Slatkin ; Petkova et al. 2014: 323 F 2 ˆp 1, ˆp 2 = 2 ET T tot 1 1n1 ET n2 ET In either of these equations, Ttot or θ act as a constant of pro- 324 portionality that is the same for all statistics calculated from the 325 same data. Since interest is focused on the relative magnitude 326 of F 2, or whether a sum of F 2 -values is different from zero, this 327 constant has no impact on inference. 328 Furthermore, a population-level quantity is obtained by tak- 329 ing the limit when the number of individuals n 1 and n 2 go to 330 infinity: 331 F 2 P 1, P 2 = lim 2 ˆp 1, ˆp 2 = θ ET 12 ET 11 + ET 12. n 1,n Unlike F ST, the mutation parameter θ does not cancel. How- 332 ever, for most applications, the absolute magnitude of F 2 is of 333 little interest, since only the sign of the statistics are used for 334 most tests. In other applications F-statistics with presumably 335 the same θ Reich et al are compared. In these cases, θ 336 can be regarded as a constant of proportionality, and will not Benjamin M Peter et al

7 change the theoretical properties of the F-statistics. It will, how- 338 ever, influence statistical properties, as a larger θ implies more 339 mutations and hence more data. 340 Estimator for F 2 An estimator for F 2 can be derived using the 341 average number of pairwise differences π ij as an estimator for 342 θt ij Tajima Thus, a natural estimator for F 2 is 343 A. Equation B. Concordant genealogy ˆF 2 P 1, P 2 = π 12 π 11 + π C. Discordant genealogy Strikingly, the estimator in Eq. 17 is equivalent to that given 344 by Reich et al in terms of the sample allele frequency ˆp i 345 and sample size n i : 346 F 2 P 1, P 2 =π 12 π 11 /2 π 22 /2 = [ ˆp 1 1 ˆp 2 + ˆp 2 1 ˆp 1 ] n ˆp 1 1 ˆp 1 1 n 1 1 ˆp n 21 ˆp 2 2 n 2 1 = ˆp ˆp n ˆp 1 n 1 ˆp ˆp ˆp n 1 1 n 2 1 = ˆp 1 ˆp 2 2 ˆp 11 ˆp 1 n 1 1 ˆp 21 q2. n 2 1 Figure 4 Schematic explanation how F 2 behaves conditioned on gene tree. Blue terms and branches correspond to positive contributions, whereas red branches and terms are subtracted. Labels represent individuals randomly sampled from that population. External branches cancel out, so only the internal branches have non-zero contribution to F 2. In the concordant genealogy Panel B, the contribution is positive with weight 2, and in the discordant genealogy Panel C, it is negative with weight 1. The mutation rate as constant of proportionality is omitted. The last line is the same as Equation 10 in the Appendix of Reich 347 et al However, while the estimators are identical, the underlying 349 modelling assumptions are different: The original definition 350 only considered loci that were segregating in an ancestral popu- 351 lation; loci not segregating there were discarded. Since ancestral 352 populations are usually unsampled, this is often replaced by 353 ascertainment in an outgroup Patterson et al. 2012; Lipson et al In contrast, Equation 17 assumes that all markers are used, 355 which is the more natural interpretation for sequence data. 356 Gene tree branch lengths An important feature of Equation is that it only depends on the expected coalescence times be- 358 tween pairs of lineages. Thus, the behavior of F 2 can be fully 359 characterized by considering a sample of size four, with two 360 random individuals taken from each population. This is all 361 that is needed to study the joint distribution of T 12, T 11 and T 22, 362 and hence F 2. By linearity of expectation, larger samples can be 363 accommodated by summing the expectations over all possible 364 quartets. 365 For a sample of size four with two pairs, there are only two 366 possible unrooted tree topologies. One, where the lineages 367 from the same population are more closely related to each other 368 called concordant topology, T 2 c and one where lineages from 369 different populations coalesce first which I will refer to as discor- 370 danat topology T 2 d. The superscripts refers to the topologies 371 being for F 2, and I will discard them in cases where no ambiguity 372 arises. 373 Conditioning on the topology yields F 2 P 1, P 2 =E [ F 2 P 1, P 2 T ] =PT c E[F 2 P 1, P 2 T c ] + PT d E[F 2 P 1, P 2 T d ]. Figure 4 contains graphical representations for 374 E[F 2 P 1, P 2 T c ] Panel B and E[F 2 P 1, P 2 T d ] Panel C, 375 respectively. 376 In this representation, T 12 corresponds to a path from a ran- 377 dom individual from P 1 to a random individual from P 2, and 378 T 11 represents the path between the two samples from P For T c the internal branch is always included in T 12, but never 380 in T 11 or T 22. External branches, on the other hand, are included 381 with 50% probability in T 12 on any path through the tree. T and T 22, on the other hand, only consist of external branches, 383 and the lengths of the external branches cancel. 384 On the other hand, for T d, the internal branch is always in- 385 cluded in T 11 and T 22, but only half the time in T 12. Thus, they 386 contribute negatively to F 2, but only with half the magnitude of 387 T c. As for T c, each T contains exactly two external branches, and 388 so they will cancel out. 389 An interesting way to represent F 2 is therefore in terms of the 390 internal branches over all possible gene genealogies. Denote the 391 unconditional average length of the internal branch of T c as B c, 392 and the average length of the internal branch in T d as B d. Then, 393 F 2 can be written in terms of these branch lengths as 394 F 2 P 1, P 2 = θ 2B c 1B d, 18 resulting in the representation given in Figure 2C-D. 395 As a brief sanity check, consider the case of a population 396 without structure. In this case, the branch length is independent 397 of the topology and T d is twice as likely as T c and hence B d = 398 2B c, from which it follows that F 2 will be zero, as expected in a 399 randomly mating population 400 This argument can be transformed from branch lengths to ob- 401 served mutations by recalling that mutations occur on a branch 402 at a rate proportional to its length. F 2 is increased by doubletons 403 that support the assignment of populations i.e. the two lineages 404 from the same population have the same allele, but reduced by 405 doubletons shared by individuals from different populations. 406 All other mutations have a contribution of zero. 407 Admixture, Population Structure and F-statistics 7

8 Testing treeness 408 Many applications consider tens or even hundreds of popula- 409 tions simultaneously Patterson et al. 2012; Pickrell and Pritchard ; Haak et al. 2015; Yunusbayev et al. 2015, with the goal to in- 411 fer where and between which populations admixture occurred. 412 Using F-statistics, the approach is to interpret F 2 P 1, P 2 as a 413 measure of dissimilarity between P 1 and P 2, as a large F 2 -value 414 implies that populations are highly diverged. Thus, the strat- 415 egy is to calculate all pairwise F 2 indices between populations, 416 combine them into a dissimilarity matrix, and ask if that matrix is 417 consistent with a tree. 418 One way to approach this question is by using phylogenetic 419 theory: Many classical algorithms have been proposed that use 420 a measure of dissimilarity to generate a tree Fitch et al. 1967; 421 Saitou and Nei 1987; Semple and Steel 2003; Felsenstein 2004, 422 and what properties a general dissimilarity matrix needs to have 423 in order to be consistent with a tree Buneman 1971; Cavalli- 424 Sforza and Piazza 1975, in which case the matrix is also called a 425 tree metric Semple and Steel Thus, testing for admixture 426 can be thought of as testing treeness. 427 For a dissimilarity matrix to be consistent with a tree, there 428 are two central properties it needs to satisfy: First, the length of 429 all branches has to be positive. This is strictly not necessary for 430 phylogenetic trees, and some algorithms may return negative 431 branch lengths e.g. Saitou and Nei 1987; however, since in our 432 case branches have an interpretation of genetic drift, negative 433 genetic drift is biologically nonsensical, and therefore negative 434 branches should be interpreted as a violation of the modeling 435 assumptions, and hence treeness. 436 The second property of a tree metric important in the present context is a bit more involved: A dissimilarity matrix written in terms of F 2 is consistent with a tree if for any four populations P i, P j, P k and P l, F 2 P i, P j + F 2 P k, P l max F 2 P i, P k + F 2 P j,p l, F 2 P i, P l + F 2 P j, P k 19 that is, if the sums of pairs of distances are compared, two of 437 these sums will be the same, and no smaller than the third. 438 This theorem, due to Buneman 1971, 1974 is called the four- 439 point condition or sometimes, more modestly, the fundamental 440 theorem of phylogenetics. A proof can be found in chapter 7 of 441 Semple and Steel Informally, this statement can be understood by noticing that 443 on a tree, two of the pairs of distances will include the internal 444 branch, whereas the third one will not, and therefore be shorter. 445 Thus, the four-point condition can be colloquially rephrased as 446 any four taxa tree has at most one internal branch. 447 Why are these properties useful? It turns out that the ad- 448 mixture tests based on F-statistics can be interpreted as tests 449 of these properties: The F 3 -test can be interpreted as a test for 450 the positivity of a branch; and the F 4 as a test of the four-point 451 condition. Thus, the working of the two test statistics can be 452 interpreted in terms of fundamental properties of phylogenetic 453 trees, with the immediate consequence that they may be applied 454 as treeness-tests for arbitrary dissimilarity matrices. 455 An early test of treeness, based on a likelihood ratio, was 456 proposed by Cavalli-Sforza and Piazza 1975: They compared 457 the likelihood of the observed F 2 -matrix matrix to that induced 458 by the best fitting tree assuming Brownian motion, rejecting 459 the null hypothesis if the tree-likelihood is much lower than 460 that of the empirical matrix. In practice, however, finding the 461 best-fitting tree is a challenging problem, especially for large 462 trees Felsenstein 2004 and so the likelihood test proved to be 463 difficult to apply. From that perspective, the F 3 and F 4 -tests 464 provide a convenient alternative: Since treeness implies that all 465 subsets of taxa are also trees, the ingenious idea of Reich et al was that rejection of treeness for subtrees of size three 467 for F 3 and four for F 4 is sufficient to reject treeness for the 468 entire tree. Furthermore, tests on these subsets also pinpoint the 469 populations involved in the non-tree-like history. 470 F In the previous section, I showed how F 2 can be interpreted 472 as a branch length, as an overlap of paths, or in terms of gene 473 trees Figure 2. Furthermore, I gave expressions in terms of 474 coalescence times, allele frequency variances and internal branch 475 lengths of gene trees. In this section, I give analogous results for 476 F Reich et al defined F 3 as: 478 F 3 P X ; P 1, P 2 = F 3 p X ; p 1, p 2 = Ep X p 1 p X p 2 20a with the goal to test whether P X is admixed. Recalling the path 479 interpretation detailed in Patterson et al. 2012, F 3 can be inter- 480 preted as the shared portion of the paths from P X to P 1 with the 481 path from P X to P 1. In a population phylogeny Figure 2E this 482 corresponds to the branch between P X and the internal node. 483 Equivalently, F 3 can also be written in terms of F 2 Reich et al : 485 F 3 P X ; P 1, P 2 = 1 2 F 2 P X, P 1 + F 2 P X, P 2 F 2 P 1, P 2. 20b If F 2 in Equation 20b is generalized to an arbitrary tree metric, 486 Equation 20b is known as the Gromov product Semple and 487 Steel 2003 in phylogenetics. The Gromov product is a com- 488 monly used operation in classical phylogenetic algorithms to 489 calculate the length of the portion of a branch shared between P and P 2 Fitch et al. 1967; Felsenstein 1973; Saitou and Nei 1987; 491 consistent with the notion that F 3 is the length of an external 492 branch in a population phylogeny. 493 In an admixture graph, there is no longer a single external 494 branch; instead all possible trees have to be considered, and F is the weighted average of paths through the admixture graph 496 Figure 2F. 497 Combining Equations 16 and 20b gives an expression of F 3 in 498 terms of expected coalescence times: 499 F 3 P X ; P 1, P 2 = θ 2 ET 1X + ET 2X ET 12 ET XX. 20c Similarly, an expression in terms of variances is obtained by 500 combining Equation 2 with 20b: 501 F 3 P X ; P 1, P 2 = Varp X + Covp 1, p 2 Covp 1, p X Covp 2, p X. 20d which was also noted by Pickrell and Pritchard Outgroup-F 3 statistics A simple application of the interpreta- 503 tion of F 3 as a shared branch length are the outgroup -F statistics proposed by Raghavan et al For an unknown 505 population P U, they wanted to find the most closely related pop- 506 ulation from a panel of k extant populations {P i, i = 1, 2,... k} Benjamin M Peter et al

9 They did this by calculating F 3 P O ; P U, P i, where P O is an out- 508 group population that was assumed widely diverged from P U 509 and all populations in the panel. This measures the shared drift 510 or shared branch of P U with the populations from the panel, 511 and high F 3 -values imply close relatedness. 512 However, using Equation 20c, the outgroup-f 3 -statistic can 513 be written as 514 F 3 P O ; P U, P i ET UO + ET io ET Ui ET OO. 21 Out of these four terms, ET UO and ET OO do not depend on P i. 515 Furthermore, if P O is truly an outgroup, then all ET io should 516 be the same, as pairs of individuals from the panel population 517 and the outgroup can only coalesce once they are in the joint an- 518 cestral population. Therefore, only the term ET Ui is expected to 519 vary between different panel populations, suggesting that using 520 the number of pairwise differences, π Ui, is largely equivalent to 521 using F 3 P O ; P U, P i. I confirm this in Figure 5A by calculating 522 outgroup-f 3 and π iu for a set of increasingly divergent popu- 523 lations, with each population having its own size, sample size 524 and sequencing error probability. Linear regression confirms 525 the visual picture that π iu has a higher correlation with diver- 526 gence time R 2 = 0.90 than F 3 R 2 = Hence, the number 527 of pairwise differences may be a better metric for population 528 divergence than F F 3 admixture test However, F 3 is motivated and primarily used 530 as an admixture test Reich et al In this context, the null 531 hypothesis is that F 3 is non-negative, i.e. the null hypothesis 532 is that the data is generated from a phylogenetic tree that has 533 positive edge lengths. If this is not the case, the null hypothesis 534 is reject in favor of the more complex admixture graph. From 535 Figure 2F it may be seen that drift on the path on the internal 536 branches red contributes negatively to F 3. If these branches are 537 long enough compared to the branch after the admixture event 538 blue, then F 3 will be negative. For the simplest scenario where 539 P X is admixed between P 1 and P 2, Reich et al provided 540 a condition when this is the case Equation 20 in Supplement of Reich et al However, since this condition involves 542 F-statistics with internal, unobserved populations, it cannot 543 be used in practical applications. A more useful condition is 544 obtained using Equation 20c. 545 In the simplest admixture model, an ancestral population 546 splits into P 1 and P 2 at time t r. At time t 1, the populations mix to 547 form P X, such that with probability α, individuals in P X descend 548 from individuals from P 1, and with probability 1 α, they 549 descend from P 2 See Fig. 7 for an illustration. In this case, 550 F 3 P X ; P 1, P 2 is negative if t 1 < 2α1 α, 22 1 c x t r where c x is the probability two individuals sampled in P X have a common ancestor before t 1. For a randomly mating population with changing size Nt, t1 c x = 1 exp 0 1 Ns ds. Thus, the power F 3 to detect admixture is large if the admixture proportion is close to fifty percent if the ratio between the times of the original split to the time 554 of secondary contact is large, and if the probability of coalescence before the admixture event 556 in P X is small, i.e. the size of P X is large. 557 A more general condition for negativity of F 3 is obtained 558 by considering the internal branches of the possible gene tree 559 topologies, analogously to that given for F 2 in the section Gene 560 tree branch lengths. Since Equation 20c includes ET XX, only two 561 individuals from P X are needed, and one each from P 1 and P to study the joint distribution of all terms in 20c. The minimal 563 case therefore contains again just four samples Figure S Furthermore, P 1 and P 2 are exchangeable, and thus there are 565 again just two distinct gene genealogies, a concordant one T 3 c 566 where the two lineages from P X are most closely related, and a 567 discordant genealogy T 3 d where the lineages from P X merge 568 first with the other two lineages. A similar argument as that 569 for F 2 shows presented in Figure S2 that F 3 can be written as a 570 function of just the internal branches in the topologies: 571 F 3 P X ; P 1, P 2 = θ 2B c B d, 23 where B c and B d are the lengths of the internal branches in T 3 c 572 and T 3 d, respectively, and similar to F 2, concordant branches 573 have twice the weight of discordant ones. Again, the case of 574 all individuals coming from a single populations serves as a 575 sanity check: In this case T d is twice as likely as T c, and all 576 branches are expected to have the same length, resulting in F being zero. However, for F 3 to be negative, notice that B d needs 578 to be more than two times longer than B c. Since mutations 579 are proportional to B d and B c, F 3 can be interpreted as a test 580 whether mutations that agree with the population tree are more 581 than twice as common than mutations that disagree with it. 582 I performed a small simulation study to test the accuracy of 583 Equation 22. Parameters were chosen such that F 3 has a negative 584 expectation for α > 0.05, and I find that the predicted F 3 fit very 585 well with the simulations Figure 5B. 586 F The second admixture statistic, F 4, is defined as Reich et al : 589 F 4 P 1, P 2 ; P 3, P 4 = F 4 p 1, p 2 ; p 3, p 4 = E[p 1 p 2 p 3 p 4 ]. 24a Similarly to F 3, F 4 can be written as a linear combination of F 2 : 590 F 4 P 1, P 2 ; P 3, P 4 = 1 24b F 2 2 P 1, P 4 + F 2 P 2, P 3 F 2 P 1, P 3 F 2 P 2, P 4. which leads to 591 ET 14 + ET 23 ET 13 BET c F 4 P 1, P 2 ; P 3, P 4 = θ 2 As four populations are involved, there are 4! = 24 possible ways of arranging the populations in Equation 24a. However, there are four possible permutations of arguments that will lead to identical values, leaving only six unique F 4 -values for any four populations. Furthermore, these six values come in pairs that have the same absolute value, and a different sign, i. e. F 4 P 1, P 2 ; P 3, P 4 = F 4 P 1, P 2 ; P 4, P 3, leaving only three unique absolute values, which correspond to the tree possible tree topologies. Out of these three, one F 4 can be written as the sum of the other two, leaving just two independent possibilities: F 4 P 1, P 2 ; P 3, P 4 + F 4 P 1, P 3 ; P 2, P 4 = F 4 P 1, P 4 ; P 2, P 3. Admixture, Population Structure and F-statistics 9

10 Figure 5 Simulation results. A: Outgroup-F 3 statistics yellow and π iu white for a panel of populations with linearly increasing divergence time. Both statistics are scaled to have the same range, with the first divergence between the most closely related populations set to zero. F 3 is inverted, so that it increases with distance. B: Simulated boxplots and predictedblue F 3 -statistics under a simple admixture model. C: Comparison of F 4 -ratio yellow triangles, Equation 29 and ratio of differencesblack circles, Equation 31. As for F 3, Equation 24b can be generalized by replacing F with an arbitrary tree metric. In this case, Equation 24b is known 593 as a tree split Buneman 1971, as it measures the length of the 594 overlap of the branch lengths between the two pairs. As there 595 are two independent F 4 -indices for a fixed tree, there are two 596 different interpretations for the F 4 -indices. Consider the tree 597 from Fig. 1A: F 4 P 1, P 2 ; P 3, P 4 can be interpreted as the overlap 598 between the paths from P 1 to P 2 and from P 3 to P 4. However, 599 these paths do not overlap in Fig. 1A, and therefore F 4 = This is how F 4 is used as a test statistic. On the other hand, 601 F 4 P 1, P 3 ; P 2, P 4 measures the overlap between the paths from 602 P 1 to P 3 and from P 2 to P 4, which is the internal branch in Fig A, and will be positive. 604 It is cumbersome that the interpretation of F 4 depends on the ordering of its arguments. In order to make our intention clear, instead of switching the arguments around for the two interpretations, I introduce the superscripts T for test and B for branch length: F T 4 P 1, P 2 ; P 3, P 4 = F 4 P 1, P 2 ; P 3, P 4 25a F B 4 P 1, P 2 ; P 3, P 4 = F 4 P 1, P 3 ; P 2, P 4 25b Four-point-condition and F 4 Tree splits, and hence F 4, are closely 605 related to the four-point condition Buneman 1971, 1974, which, 606 informally, states that a sub-tree with four populations will 607 have at most one internal branch. Thus, if data is consistent with 608 a tree, F B 4 will be the length of that branch, and F T 4 will be zero. 609 In Figure 2, the third row Panels I-L correspond to the internal 610 branch, and the last row Panels M-P to the zero -branch. 611 Thus, in the context of testing for admixture, testing that F is zero is equivalent to checking whether there is in fact only 613 a single internal branch. If that is not the case, the population 614 phylogeny is rejected. This statement can be generalized to 615 arbitrary tree metrics: the four-point condition Buneman can be written as 617 F 2 P 1, P 2 + F 2 P 3, P 4 [ ] min F 2 P 1, P 3 + F 2 P 2, P 4, F 2 P 1, P 4 + F 2 P 2, P 3 26 for any permutations of the samples. This implies that two of 618 the sums need to be the same, and larger than the third. The 619 claim is that if the four-point condition holds, at least one of 620 the F 4 -values will be zero, and the others will have the same 621 absolute value. 622 Without loss of generality, assume that F 2 P 1, P 2 + F 2 P 3, P 4 F 2 P 1, P 3 + F 2 P 2, P 4 F 2 P 1, P 3 + F 2 P 2, P 4 = F 2 P 1, P 4 + F 2 P 2, P 3 Simply plugging this into the three possible F 4 -equations yields F 4 P 1, P 2 ; P 3, P 4 = 0 F 4 P 1, P 3 ; P 2, P 4 = k F 4 P 1, P 4 ; P 2, P 3 = k where k = F 2 P 1, P 3 + F 2 P 2, P 4 F 2 P 1, P 2 + F 2 P 3, P It is worth noting that the converse is false. If F 2 P 1, P 2 + F 2 P 3, P 4 > F 2 P 1, P 3 + F 2 P 2, P 4 F 2 P 1, P 3 + F 2 P 2, P 4 = F 2 P 1, P 4 + F 2 P 2, P 3 the four-point condition is violated, but F 4 P 1, P 2 ; P 3, P 4 is still 624 zero, and the other two F 4 -values have the same magnitude Benjamin M Peter et al

Mathematical models in population genetics II

Mathematical models in population genetics II Mathematical models in population genetics II Anand Bhaskar Evolutionary Biology and Theory of Computing Bootcamp January 1, 014 Quick recap Large discrete-time randomly mating Wright-Fisher population

More information

Population Structure

Population Structure Ch 4: Population Subdivision Population Structure v most natural populations exist across a landscape (or seascape) that is more or less divided into areas of suitable habitat v to the extent that populations

More information

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate. OEB 242 Exam Practice Problems Answer Key Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate. First, recall

More information

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles John Novembre and Montgomery Slatkin Supplementary Methods To

More information

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin CHAPTER 1 1.2 The expected homozygosity, given allele

More information

Stochastic Demography, Coalescents, and Effective Population Size

Stochastic Demography, Coalescents, and Effective Population Size Demography Stochastic Demography, Coalescents, and Effective Population Size Steve Krone University of Idaho Department of Mathematics & IBEST Demographic effects bottlenecks, expansion, fluctuating population

More information

The Wright-Fisher Model and Genetic Drift

The Wright-Fisher Model and Genetic Drift The Wright-Fisher Model and Genetic Drift January 22, 2015 1 1 Hardy-Weinberg Equilibrium Our goal is to understand the dynamics of allele and genotype frequencies in an infinite, randomlymating population

More information

Supplementary Materials: Efficient moment-based inference of admixture parameters and sources of gene flow

Supplementary Materials: Efficient moment-based inference of admixture parameters and sources of gene flow Supplementary Materials: Efficient moment-based inference of admixture parameters and sources of gene flow Mark Lipson, Po-Ru Loh, Alex Levin, David Reich, Nick Patterson, and Bonnie Berger 41 Surui Karitiana

More information

Coalescent based demographic inference. Daniel Wegmann University of Fribourg

Coalescent based demographic inference. Daniel Wegmann University of Fribourg Coalescent based demographic inference Daniel Wegmann University of Fribourg Introduction The current genetic diversity is the outcome of past evolutionary processes. Hence, we can use genetic diversity

More information

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics - in deriving a phylogeny our goal is simply to reconstruct the historical relationships between a group of taxa. - before we review the

More information

Demography April 10, 2015

Demography April 10, 2015 Demography April 0, 205 Effective Population Size The Wright-Fisher model makes a number of strong assumptions which are clearly violated in many populations. For example, it is unlikely that any population

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

Populations in statistical genetics

Populations in statistical genetics Populations in statistical genetics What are they, and how can we infer them from whole genome data? Daniel Lawson Heilbronn Institute, University of Bristol www.paintmychromosomes.com Work with: January

More information

I of a gene sampled from a randomly mating popdation,

I of a gene sampled from a randomly mating popdation, Copyright 0 1987 by the Genetics Society of America Average Number of Nucleotide Differences in a From a Single Subpopulation: A Test for Population Subdivision Curtis Strobeck Department of Zoology, University

More information

Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996

Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996 Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996 Following Confidence limits on phylogenies: an approach using the bootstrap, J. Felsenstein, 1985 1 I. Short

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: istance-based methods Ultrametric Additive: UPGMA Transformed istance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

RECOVERING NORMAL NETWORKS FROM SHORTEST INTER-TAXA DISTANCE INFORMATION

RECOVERING NORMAL NETWORKS FROM SHORTEST INTER-TAXA DISTANCE INFORMATION RECOVERING NORMAL NETWORKS FROM SHORTEST INTER-TAXA DISTANCE INFORMATION MAGNUS BORDEWICH, KATHARINA T. HUBER, VINCENT MOULTON, AND CHARLES SEMPLE Abstract. Phylogenetic networks are a type of leaf-labelled,

More information

Lecture 1 Hardy-Weinberg equilibrium and key forces affecting gene frequency

Lecture 1 Hardy-Weinberg equilibrium and key forces affecting gene frequency Lecture 1 Hardy-Weinberg equilibrium and key forces affecting gene frequency Bruce Walsh lecture notes Introduction to Quantitative Genetics SISG, Seattle 16 18 July 2018 1 Outline Genetics of complex

More information

Notes on Population Genetics

Notes on Population Genetics Notes on Population Genetics Graham Coop 1 1 Department of Evolution and Ecology & Center for Population Biology, University of California, Davis. To whom correspondence should be addressed: gmcoop@ucdavis.edu

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics.

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics. Evolutionary Genetics (for Encyclopedia of Biodiversity) Sergey Gavrilets Departments of Ecology and Evolutionary Biology and Mathematics, University of Tennessee, Knoxville, TN 37996-6 USA Evolutionary

More information

Population Genetics I. Bio

Population Genetics I. Bio Population Genetics I. Bio5488-2018 Don Conrad dconrad@genetics.wustl.edu Why study population genetics? Functional Inference Demographic inference: History of mankind is written in our DNA. We can learn

More information

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics: Homework Assignment, Evolutionary Systems Biology, Spring 2009. Homework Part I: Phylogenetics: Introduction. The objective of this assignment is to understand the basics of phylogenetic relationships

More information

Challenges when applying stochastic models to reconstruct the demographic history of populations.

Challenges when applying stochastic models to reconstruct the demographic history of populations. Challenges when applying stochastic models to reconstruct the demographic history of populations. Willy Rodríguez Institut de Mathématiques de Toulouse October 11, 2017 Outline 1 Introduction 2 Inverse

More information

Evolutionary Models. Evolutionary Models

Evolutionary Models. Evolutionary Models Edit Operators In standard pairwise alignment, what are the allowed edit operators that transform one sequence into the other? Describe how each of these edit operations are represented on a sequence alignment

More information

X X (2) X Pr(X = x θ) (3)

X X (2) X Pr(X = x θ) (3) Notes for 848 lecture 6: A ML basis for compatibility and parsimony Notation θ Θ (1) Θ is the space of all possible trees (and model parameters) θ is a point in the parameter space = a particular tree

More information

Population genetics snippets for genepop

Population genetics snippets for genepop Population genetics snippets for genepop Peter Beerli August 0, 205 Contents 0.Basics 0.2Exact test 2 0.Fixation indices 4 0.4Isolation by Distance 5 0.5Further Reading 8 0.6References 8 0.7Disclaimer

More information

Contrasts for a within-species comparative method

Contrasts for a within-species comparative method Contrasts for a within-species comparative method Joseph Felsenstein, Department of Genetics, University of Washington, Box 357360, Seattle, Washington 98195-7360, USA email address: joe@genetics.washington.edu

More information

How robust are the predictions of the W-F Model?

How robust are the predictions of the W-F Model? How robust are the predictions of the W-F Model? As simplistic as the Wright-Fisher model may be, it accurately describes the behavior of many other models incorporating additional complexity. Many population

More information

Big Idea #1: The process of evolution drives the diversity and unity of life

Big Idea #1: The process of evolution drives the diversity and unity of life BIG IDEA! Big Idea #1: The process of evolution drives the diversity and unity of life Key Terms for this section: emigration phenotype adaptation evolution phylogenetic tree adaptive radiation fertility

More information

Introduction to population genetics & evolution

Introduction to population genetics & evolution Introduction to population genetics & evolution Course Organization Exam dates: Feb 19 March 1st Has everybody registered? Did you get the email with the exam schedule Summer seminar: Hot topics in Bioinformatics

More information

C3020 Molecular Evolution. Exercises #3: Phylogenetics

C3020 Molecular Evolution. Exercises #3: Phylogenetics C3020 Molecular Evolution Exercises #3: Phylogenetics Consider the following sequences for five taxa 1-5 and the known outgroup O, which has the ancestral states (note that sequence 3 has changed from

More information

Math 239: Discrete Mathematics for the Life Sciences Spring Lecture 14 March 11. Scribe/ Editor: Maria Angelica Cueto/ C.E.

Math 239: Discrete Mathematics for the Life Sciences Spring Lecture 14 March 11. Scribe/ Editor: Maria Angelica Cueto/ C.E. Math 239: Discrete Mathematics for the Life Sciences Spring 2008 Lecture 14 March 11 Lecturer: Lior Pachter Scribe/ Editor: Maria Angelica Cueto/ C.E. Csar 14.1 Introduction The goal of today s lecture

More information

Admixture and ancestry inference from ancient and modern samples through measures of population genetic drift

Admixture and ancestry inference from ancient and modern samples through measures of population genetic drift Wayne State University Human Biology Open Access Pre-Prints WSU Press 7-1-2016 Admixture and ancestry inference from ancient and modern samples through measures of population genetic drift Alexandre M.

More information

Genetic Variation in Finite Populations

Genetic Variation in Finite Populations Genetic Variation in Finite Populations The amount of genetic variation found in a population is influenced by two opposing forces: mutation and genetic drift. 1 Mutation tends to increase variation. 2

More information

7. Tests for selection

7. Tests for selection Sequence analysis and genomics 7. Tests for selection Dr. Katja Nowick Group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute for Brain Research www. nowicklab.info

More information

Detecting selection from differentiation between populations: the FLK and hapflk approach.

Detecting selection from differentiation between populations: the FLK and hapflk approach. Detecting selection from differentiation between populations: the FLK and hapflk approach. Bertrand Servin bservin@toulouse.inra.fr Maria-Ines Fariello, Simon Boitard, Claude Chevalet, Magali SanCristobal,

More information

Consistency Index (CI)

Consistency Index (CI) Consistency Index (CI) minimum number of changes divided by the number required on the tree. CI=1 if there is no homoplasy negatively correlated with the number of species sampled Retention Index (RI)

More information

Estimating effective population size from samples of sequences: inefficiency of pairwise and segregating sites as compared to phylogenetic estimates

Estimating effective population size from samples of sequences: inefficiency of pairwise and segregating sites as compared to phylogenetic estimates Estimating effective population size from samples of sequences: inefficiency of pairwise and segregating sites as compared to phylogenetic estimates JOSEPH FELSENSTEIN Department of Genetics SK-50, University

More information

Frequency Spectra and Inference in Population Genetics

Frequency Spectra and Inference in Population Genetics Frequency Spectra and Inference in Population Genetics Although coalescent models have come to play a central role in population genetics, there are some situations where genealogies may not lead to efficient

More information

Surfing genes. On the fate of neutral mutations in a spreading population

Surfing genes. On the fate of neutral mutations in a spreading population Surfing genes On the fate of neutral mutations in a spreading population Oskar Hallatschek David Nelson Harvard University ohallats@physics.harvard.edu Genetic impact of range expansions Population expansions

More information

122 9 NEUTRALITY TESTS

122 9 NEUTRALITY TESTS 122 9 NEUTRALITY TESTS 9 Neutrality Tests Up to now, we calculated different things from various models and compared our findings with data. But to be able to state, with some quantifiable certainty, that

More information

Lecture 9. QTL Mapping 2: Outbred Populations

Lecture 9. QTL Mapping 2: Outbred Populations Lecture 9 QTL Mapping 2: Outbred Populations Bruce Walsh. Aug 2004. Royal Veterinary and Agricultural University, Denmark The major difference between QTL analysis using inbred-line crosses vs. outbred

More information

A Phylogenetic Network Construction due to Constrained Recombination

A Phylogenetic Network Construction due to Constrained Recombination A Phylogenetic Network Construction due to Constrained Recombination Mohd. Abdul Hai Zahid Research Scholar Research Supervisors: Dr. R.C. Joshi Dr. Ankush Mittal Department of Electronics and Computer

More information

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis

More information

Endowed with an Extra Sense : Mathematics and Evolution

Endowed with an Extra Sense : Mathematics and Evolution Endowed with an Extra Sense : Mathematics and Evolution Todd Parsons Laboratoire de Probabilités et Modèles Aléatoires - Université Pierre et Marie Curie Center for Interdisciplinary Research in Biology

More information

Microsatellite data analysis. Tomáš Fér & Filip Kolář

Microsatellite data analysis. Tomáš Fér & Filip Kolář Microsatellite data analysis Tomáš Fér & Filip Kolář Multilocus data dominant heterozygotes and homozygotes cannot be distinguished binary biallelic data (fragments) presence (dominant allele/heterozygote)

More information

Robust demographic inference from genomic and SNP data

Robust demographic inference from genomic and SNP data Robust demographic inference from genomic and SNP data Laurent Excoffier Isabelle Duperret, Emilia Huerta-Sanchez, Matthieu Foll, Vitor Sousa, Isabel Alves Computational and Molecular Population Genetics

More information

Processes of Evolution

Processes of Evolution 15 Processes of Evolution Forces of Evolution Concept 15.4 Selection Can Be Stabilizing, Directional, or Disruptive Natural selection can act on quantitative traits in three ways: Stabilizing selection

More information

6 Introduction to Population Genetics

6 Introduction to Population Genetics 70 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 19, 2011 6 Introduction to Population Genetics This chapter is based on: J. Hein, M.H. Schierup and C. Wuif, Gene genealogies, variation and evolution,

More information

Phylogenies & Classifying species (AKA Cladistics & Taxonomy) What are phylogenies & cladograms? How do we read them? How do we estimate them?

Phylogenies & Classifying species (AKA Cladistics & Taxonomy) What are phylogenies & cladograms? How do we read them? How do we estimate them? Phylogenies & Classifying species (AKA Cladistics & Taxonomy) What are phylogenies & cladograms? How do we read them? How do we estimate them? Carolus Linneaus:Systema Naturae (1735) Swedish botanist &

More information

Evolutionary Tree Analysis. Overview

Evolutionary Tree Analysis. Overview CSI/BINF 5330 Evolutionary Tree Analysis Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Backgrounds Distance-Based Evolutionary Tree Reconstruction Character-Based

More information

Quantitative trait evolution with mutations of large effect

Quantitative trait evolution with mutations of large effect Quantitative trait evolution with mutations of large effect May 1, 2014 Quantitative traits Traits that vary continuously in populations - Mass - Height - Bristle number (approx) Adaption - Low oxygen

More information

Computational statistics

Computational statistics Computational statistics Combinatorial optimization Thierry Denœux February 2017 Thierry Denœux Computational statistics February 2017 1 / 37 Combinatorial optimization Assume we seek the maximum of f

More information

How should we organize the diversity of animal life?

How should we organize the diversity of animal life? How should we organize the diversity of animal life? The difference between Taxonomy Linneaus, and Cladistics Darwin What are phylogenies? How do we read them? How do we estimate them? Classification (Taxonomy)

More information

Gene Genealogies Coalescence Theory. Annabelle Haudry Glasgow, July 2009

Gene Genealogies Coalescence Theory. Annabelle Haudry Glasgow, July 2009 Gene Genealogies Coalescence Theory Annabelle Haudry Glasgow, July 2009 What could tell a gene genealogy? How much diversity in the population? Has the demographic size of the population changed? How?

More information

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda 1 Population Genetics with implications for Linkage Disequilibrium Chiara Sabatti, Human Genetics 6357a Gonda csabatti@mednet.ucla.edu 2 Hardy-Weinberg Hypotheses: infinite populations; no inbreeding;

More information

A (short) introduction to phylogenetics

A (short) introduction to phylogenetics A (short) introduction to phylogenetics Thibaut Jombart, Marie-Pauline Beugin MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis with PR Statistics, Millport Field

More information

6 Introduction to Population Genetics

6 Introduction to Population Genetics Grundlagen der Bioinformatik, SoSe 14, D. Huson, May 18, 2014 67 6 Introduction to Population Genetics This chapter is based on: J. Hein, M.H. Schierup and C. Wuif, Gene genealogies, variation and evolution,

More information

Reading for Lecture 13 Release v10

Reading for Lecture 13 Release v10 Reading for Lecture 13 Release v10 Christopher Lee November 15, 2011 Contents 1 Evolutionary Trees i 1.1 Evolution as a Markov Process...................................... ii 1.2 Rooted vs. Unrooted Trees........................................

More information

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014 BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014 Homework 4 (version 3) - posted October 3 Assigned October 2; Due 11:59PM October 9 Problem 1 (Easy) a. For the genetic regression model: Y

More information

Notes 20 : Tests of neutrality

Notes 20 : Tests of neutrality Notes 0 : Tests of neutrality MATH 833 - Fall 01 Lecturer: Sebastien Roch References: [Dur08, Chapter ]. Recall: THM 0.1 (Watterson s estimator The estimator is unbiased for θ. Its variance is which converges

More information

Intraspecific gene genealogies: trees grafting into networks

Intraspecific gene genealogies: trees grafting into networks Intraspecific gene genealogies: trees grafting into networks by David Posada & Keith A. Crandall Kessy Abarenkov Tartu, 2004 Article describes: Population genetics principles Intraspecific genetic variation

More information

Computational Systems Biology: Biology X

Computational Systems Biology: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA Human Population Genomics Outline 1 2 Damn the Human Genomes. Small initial populations; genes too distant; pestered with transposons;

More information

Genetic Drift in Human Evolution

Genetic Drift in Human Evolution Genetic Drift in Human Evolution (Part 2 of 2) 1 Ecology and Evolutionary Biology Center for Computational Molecular Biology Brown University Outline Introduction to genetic drift Modeling genetic drift

More information

Theory of Evolution Charles Darwin

Theory of Evolution Charles Darwin Theory of Evolution Charles arwin 858-59: Origin of Species 5 year voyage of H.M.S. eagle (83-36) Populations have variations. Natural Selection & Survival of the fittest: nature selects best adapted varieties

More information

Statistical population genetics

Statistical population genetics Statistical population genetics Lecture 2: Wright-Fisher model Xavier Didelot Dept of Statistics, Univ of Oxford didelot@stats.ox.ac.uk Slide 21 of 161 Heterozygosity One measure of the diversity of a

More information

Workshop III: Evolutionary Genomics

Workshop III: Evolutionary Genomics Identifying Species Trees from Gene Trees Elizabeth S. Allman University of Alaska IPAM Los Angeles, CA November 17, 2011 Workshop III: Evolutionary Genomics Collaborators The work in today s talk is joint

More information

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information # Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Details of PRF Methodology In the Poisson Random Field PRF) model, it is assumed that non-synonymous mutations at a given gene are either

More information

Drift Inflates Variance among Populations. Geographic Population Structure. Variance among groups increases across generations (Buri 1956)

Drift Inflates Variance among Populations. Geographic Population Structure. Variance among groups increases across generations (Buri 1956) Geographic Population Structure Alan R Rogers December 4, 205 / 26 Drift Inflates Variance among Populations 2 / 26 Variance among groups increases across generations Buri 956) 3 / 26 Gene flow migration)

More information

(Write your name on every page. One point will be deducted for every page without your name!)

(Write your name on every page. One point will be deducted for every page without your name!) POPULATION GENETICS AND MICROEVOLUTIONARY THEORY FINAL EXAMINATION (Write your name on every page. One point will be deducted for every page without your name!) 1. Briefly define (5 points each): a) Average

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

8/23/2014. Phylogeny and the Tree of Life

8/23/2014. Phylogeny and the Tree of Life Phylogeny and the Tree of Life Chapter 26 Objectives Explain the following characteristics of the Linnaean system of classification: a. binomial nomenclature b. hierarchical classification List the major

More information

The neutral theory of molecular evolution

The neutral theory of molecular evolution The neutral theory of molecular evolution Introduction I didn t make a big deal of it in what we just went over, but in deriving the Jukes-Cantor equation I used the phrase substitution rate instead of

More information

URN MODELS: the Ewens Sampling Lemma

URN MODELS: the Ewens Sampling Lemma Department of Computer Science Brown University, Providence sorin@cs.brown.edu October 3, 2014 1 2 3 4 Mutation Mutation: typical values for parameters Equilibrium Probability of fixation 5 6 Ewens Sampling

More information

Haplotyping as Perfect Phylogeny: A direct approach

Haplotyping as Perfect Phylogeny: A direct approach Haplotyping as Perfect Phylogeny: A direct approach Vineet Bafna Dan Gusfield Giuseppe Lancia Shibu Yooseph February 7, 2003 Abstract A full Haplotype Map of the human genome will prove extremely valuable

More information

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression) Using phylogenetics to estimate species divergence times... More accurately... Basics and basic issues for Bayesian inference of divergence times (plus some digression) "A comparison of the structures

More information

Phylogenetic Analysis and Intraspeci c Variation : Performance of Parsimony, Likelihood, and Distance Methods

Phylogenetic Analysis and Intraspeci c Variation : Performance of Parsimony, Likelihood, and Distance Methods Syst. Biol. 47(2): 228± 23, 1998 Phylogenetic Analysis and Intraspeci c Variation : Performance of Parsimony, Likelihood, and Distance Methods JOHN J. WIENS1 AND MARIA R. SERVEDIO2 1Section of Amphibians

More information

Exercise 3 Exploring Fitness and Population Change under Selection

Exercise 3 Exploring Fitness and Population Change under Selection Exercise 3 Exploring Fitness and Population Change under Selection Avidians descended from ancestors with different adaptations are competing in a selective environment. Can we predict how natural selection

More information

Population Genetics: a tutorial

Population Genetics: a tutorial : a tutorial Institute for Science and Technology Austria ThRaSh 2014 provides the basic mathematical foundation of evolutionary theory allows a better understanding of experiments allows the development

More information

CONGEN Population structure and evolutionary histories

CONGEN Population structure and evolutionary histories CONGEN Population structure and evolutionary histories The table below shows allele counts at a microsatellite locus genotyped in 12 populations of Atlantic salmon. Review the table and prepare to discuss

More information

MOLECULAR PHYLOGENY AND GENETIC DIVERSITY ANALYSIS. Masatoshi Nei"

MOLECULAR PHYLOGENY AND GENETIC DIVERSITY ANALYSIS. Masatoshi Nei MOLECULAR PHYLOGENY AND GENETIC DIVERSITY ANALYSIS Masatoshi Nei" Abstract: Phylogenetic trees: Recent advances in statistical methods for phylogenetic reconstruction and genetic diversity analysis were

More information

Reconstructing Trees from Subtree Weights

Reconstructing Trees from Subtree Weights Reconstructing Trees from Subtree Weights Lior Pachter David E Speyer October 7, 2003 Abstract The tree-metric theorem provides a necessary and sufficient condition for a dissimilarity matrix to be a tree

More information

Concepts and Methods in Molecular Divergence Time Estimation

Concepts and Methods in Molecular Divergence Time Estimation Concepts and Methods in Molecular Divergence Time Estimation 26 November 2012 Prashant P. Sharma American Museum of Natural History Overview 1. Why do we date trees? 2. The molecular clock 3. Local clocks

More information

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D 7.91 Lecture #5 Database Searching & Molecular Phylogenetics Michael Yaffe B C D B C D (((,B)C)D) Outline Distance Matrix Methods Neighbor-Joining Method and Related Neighbor Methods Maximum Likelihood

More information

Diffusion Models in Population Genetics

Diffusion Models in Population Genetics Diffusion Models in Population Genetics Laura Kubatko kubatko.2@osu.edu MBI Workshop on Spatially-varying stochastic differential equations, with application to the biological sciences July 10, 2015 Laura

More information

Supporting Information Text S1

Supporting Information Text S1 Supporting Information Text S1 List of Supplementary Figures S1 The fraction of SNPs s where there is an excess of Neandertal derived alleles n over Denisova derived alleles d as a function of the derived

More information

arxiv: v1 [math.st] 22 Jun 2018

arxiv: v1 [math.st] 22 Jun 2018 Hypothesis testing near singularities and boundaries arxiv:1806.08458v1 [math.st] Jun 018 Jonathan D. Mitchell, Elizabeth S. Allman, and John A. Rhodes Department of Mathematics & Statistics University

More information

Testing quantitative genetic hypotheses about the evolutionary rate matrix for continuous characters

Testing quantitative genetic hypotheses about the evolutionary rate matrix for continuous characters Evolutionary Ecology Research, 2008, 10: 311 331 Testing quantitative genetic hypotheses about the evolutionary rate matrix for continuous characters Liam J. Revell 1 * and Luke J. Harmon 2 1 Department

More information

Estimating Evolutionary Trees. Phylogenetic Methods

Estimating Evolutionary Trees. Phylogenetic Methods Estimating Evolutionary Trees v if the data are consistent with infinite sites then all methods should yield the same tree v it gets more complicated when there is homoplasy, i.e., parallel or convergent

More information

GENETICS - CLUTCH CH.22 EVOLUTIONARY GENETICS.

GENETICS - CLUTCH CH.22 EVOLUTIONARY GENETICS. !! www.clutchprep.com CONCEPT: OVERVIEW OF EVOLUTION Evolution is a process through which variation in individuals makes it more likely for them to survive and reproduce There are principles to the theory

More information

Calculation of IBD probabilities

Calculation of IBD probabilities Calculation of IBD probabilities David Evans and Stacey Cherny University of Oxford Wellcome Trust Centre for Human Genetics This Session IBD vs IBS Why is IBD important? Calculating IBD probabilities

More information

Phylogenetic Tree Reconstruction

Phylogenetic Tree Reconstruction I519 Introduction to Bioinformatics, 2011 Phylogenetic Tree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Evolution theory Speciation Evolution of new organisms is driven

More information

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution Ziheng Yang Department of Biology, University College, London An excess of nonsynonymous substitutions

More information

Phylogenetic trees 07/10/13

Phylogenetic trees 07/10/13 Phylogenetic trees 07/10/13 A tree is the only figure to occur in On the Origin of Species by Charles Darwin. It is a graphical representation of the evolutionary relationships among entities that share

More information

Counting All Possible Ancestral Configurations of Sample Sequences in Population Genetics

Counting All Possible Ancestral Configurations of Sample Sequences in Population Genetics 1 Counting All Possible Ancestral Configurations of Sample Sequences in Population Genetics Yun S. Song, Rune Lyngsø, and Jotun Hein Abstract Given a set D of input sequences, a genealogy for D can be

More information

Gaussian Quiz. Preamble to The Humble Gaussian Distribution. David MacKay 1

Gaussian Quiz. Preamble to The Humble Gaussian Distribution. David MacKay 1 Preamble to The Humble Gaussian Distribution. David MacKay Gaussian Quiz H y y y 3. Assuming that the variables y, y, y 3 in this belief network have a joint Gaussian distribution, which of the following

More information

EVOLUTIONARY DISTANCES

EVOLUTIONARY DISTANCES EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:

More information

A representation for the semigroup of a two-level Fleming Viot process in terms of the Kingman nested coalescent

A representation for the semigroup of a two-level Fleming Viot process in terms of the Kingman nested coalescent A representation for the semigroup of a two-level Fleming Viot process in terms of the Kingman nested coalescent Airam Blancas June 11, 017 Abstract Simple nested coalescent were introduced in [1] to model

More information

Theoretical Population Biology

Theoretical Population Biology Theoretical Population Biology 87 (013) 6 74 Contents lists available at SciVerse ScienceDirect Theoretical Population Biology journal homepage: www.elsevier.com/locate/tpb Genotype imputation in a coalescent

More information