Genetics: Early Online, published on February 26, 2016 as /genetics Admixture, Population Structure and F-statistics

Genetics: Early Online, published on February 26, 2016 as 10.1534/genetics.115.183913 GENETICS INVESTIGATION Admixture, Population Structure and F-statistics Benjamin M Peter 1 1 Department of Human Genetics, University of Chicago, Chicago IL USA ABSTRACT Many questions about human genetic history can be addressed by examining the patterns of shared genetic variation between sets of populations. A useful methodological framework for this purpose are F-statistics, that measure shared genetic drift between sets of two, three and four populations, and can be used to test simple and complex hypotheses about admixture between populations. This paper provides context from phylogenetic and population genetic theory. I review how F-statistics can be interpreted as branch lengths or paths, and derive new interpretations using coalescent theory. I further show that the admixture tests can be interpreted as testing general properties of phylogenies, allowing extension of some ideas applications to arbitrary phylogenetic trees. The new results are used to investigate the behavior of the statistics under different models of population structure, and show how population substructure complicates inference. The results lead to simplified estimators in many cases, and I recommend to replace F 3 with the average number of pairwise differences for estimating population divergence. KEYWORDS admixture; gene flow; phylogenetics; population genetics; phylogenetic network For humans, whole-genome genotype data is now available for 23 individuals from hundreds of populations Lazaridis et al. 4 2014; Yunusbayev et al. 2015, opening up the possibility to ask 5 more detailed and complex questions about our history Pickrell 6 and Reich 2014; Schraiber and Akey 2015, and stimulating the 7 development of new tools for the analysis of the joint history of 8 these populations Reich et al. 2009; Patterson et al. 2012; Pickrell 9 and Pritchard 2012; Lipson et al. 2013; Ralph and Coop 2013; 10 Hellenthal et al. 2014. A simple and intuitive framework for this 11 purpose that has quickly gained in popularity are the F-statistics, 12 introduced by Reich et al. 2009, and summarized in Patterson 13 et al. 2012. In that framework, inference is based on shared 14 genetic drift between sets of populations, under the premise 15 that shared drift implies a shared evolutionary history. Tools 16 based on this framework have quickly become widely used in 17 the study of human genetic history, both for ancient and modern 18 DNA Green et al. 2010; Reich et al. 2012; Lazaridis et al. 2014; 19 Haak et al. 2015; Allentoft et al. 2015. 20 Some care is required with terminology, as the F-statistics 21 sensu Reich et al. 2009 are distinct, but closely related to 22 Wright s fixation indices Wright 1931; Reich et al. 2009, which 23 are also often referred to as F-statistics. Furthermore, it is nec- 24 1 essary to distinguish between statistics quantities calculated 25 from data and the underlying parameters which are part of the 26 modelweir and Cockerham 1984. 27 In this paper, I will mostly discuss model parameters, and I 28 will therefore refer to them as drift indices. The term F-statistics 29 will be used when referring to the general framework introduced 30 by Reich et al. 2009, and Wright s statistics will be referred to 31 as F ST or f. 32 Most applications of the F-statistic-framework can be phrased 33 in terms of the following six questions: 34 1. Treeness test: Are populations related in a tree-like fash- 35 ion? Reich et al. 2009 36 2. Admixture test: Is a particular population descended from 37 multiple ancestral populations? Reich et al. 2009 38 3. Admixture proportions: What are the contributions from 39 different populations to a focal population Green et al. 2010; 40 Haak et al. 2015. 41 4. Number of founders: How many founder populations are 42 there for a certain region? Reich et al. 2012; Lazaridis et al. 43 2014 44 Copyright 2016 by the Genetics Society of America doi: 10.1534/genetics.XXX.XXXXXX Manuscript compiled: Friday 26 th February, 2016% 1 bpeter@uchicago.edu 5. Complex demography: How can mixtures and splits of 45 population explain demography? Patterson et al. 2012; 46 Lipson et al. 2013 47 Copyright 2016. Genetics, Vol. XXX, XXXX XXXX February 2016 1

A F 3 P X ; P 1, P 2 represents the length of the external branch 70 from P X to the unique internal vertex connecting all three 71 populations. Thus, the first parameter of F 3 has an unique 72 role, whereas the other two can be switched arbitrarily. 73 F B 4 P 1, P 2 ; P 3, P 4 represents the internal branch from the 74 internal vertex connecting P 1 and P 2 to the vertex connect- 75 ing P 3 and P 4 blue. 76 B Figure 1 Population phylogeny and admixture graph. A: A population phylogeny with branches corresponding to F 2 green, F 3 yellow and F B 4 blue. B: An admixture graph extends population phylogenies by allowing gene flow red, full line and admixture events red, dotted. 6. Closest relative: What is the closest relative to a contem- 48 porary or ancient population Raghavan et al. 2014 49 The demographic models under which these questions are 50 addressed, and that motivated the drift indices, are called popula- 51 tion phylogenies and admixture graphs. The population phylogeny 52 or population tree, is a model where populations are related in 53 a tree-like fashion Figure 1A, and it frequently serves as the null 54 model for admixture tests. The branch lengths in the population 55 phylogeny correspond to how much genetic drift occurred, so 56 that a branch that is subtended by two different populations can 57 be interpreted as the shared genetic drift between these pop- 58 ulations. The alternative model is an admixture graph Figure 59 1B, which extends the population phylogeny by allowing edges 60 that represent population mergers or a significant exchange of 61 migrants. 62 Under a population phylogeny, the three F-statistics pro- 63 posed by Reich et al. 2009, labelled F 2, F 3 and F 4 have interpre- 64 tations as branch lengths Fig. 1A between two, three and four 65 taxa, respectively: Assume populations are labelled as P 1, P 2,..., 66 then 67 F 2 P 1, P 2 corresponds to the path on the phylogeny from 68 P 1 to P 2. 69 If the arguments are permuted, some F-statistics will have no 77 corresponding internal branch. In particular, it can be shown 78 that in a population phylogeny, one F 4 -index will be zero, imply- 79 ing that the corresponding internal branch is missing. This is the 80 property that is used in the admixture test. For clarity, I add the 81 superscript F B 4 if I need to emphasize the interpretation of F 4 82 as a branch length, and F T 4 to emphasize the interpretation as a 83 test-statistic. For details, see the subsection on F 4 in the Methods 84 & Results section. 85 In an admixture graph, there is no longer a single branch 86 length corresponding to each F-statistic, and interpretations are 87 more complex. However, F-statistics can still be thought of 88 as the proportion of genetic drift shared between populations 89 Reich et al. 2009. The basic idea exploited in addressing all 90 six questions outlined above is that under a tree model, branch 91 lengths, and thus the drift indices, must satisfy some constraints 92 Buneman 1971; Semple and Steel 2003; Reich et al. 2009. The 93 two most relevant constraints are that i in a tree, all branches 94 have positive lengths tested using the F 3 -admixture test and ii 95 in a tree with four leaves, there is at most one internal branch 96 tested using the F 4 -admixture test. 97 The goal of this paper is to give a broad overview on the 98 theory, ideas and applications of F-statistics. Our starting point 99 is a brief review on how genetic drift is quantified in general, 100 and how it is measured using F 2. I then propose an alterna- 101 tive definition of F 2 that allows us to simplify applications, and 102 study them under a wide range of population structure models. 103 I then review some basic properties of distance-based phylo- 104 gentic trees, show how the admixture tests are interpreted in 105 this context, and evaluate their behavior. Many of the results 106 that are highlighted here are implicit in classical Wahlund 1928; 107 Wright 1931; Cavalli-Sforza and Edwards 1967; Felsenstein 1973; 108 Cavalli-Sforza and Piazza 1975; Felsenstein 1981; Slatkin 1991; 109 Excoffier et al. 1992 and more recent work Patterson et al. 2012; 110 Pickrell and Pritchard 2012; Lipson et al. 2013, but often not 111 explicitly stated, or given in a different context. 112 Methods & Results 113 The next sections will discuss the F-statistics, introducing dif- 114 ferent interpretations and giving derivations for some useful 115 expressions. A graphical summary of the three interpretations 116 of the statistics is given in Figure 2, and the main formulas are 117 summarized in Table 1. 118 Throughout this paper, populations are labelled as 119 P 1, P 2,..., P i,.... Often, P X will denote an admixed population. 120 The allele frequency p i is defined as the proportion of individ- 121 uals in P i that carry a particular allele at a biallelic locus, and 122 throughout this paper I assume that all individuals are haploid. 123 However, all results hold if instead of haploid individuals, an 124 arbitrary allele of a diploid individual is used. I focus on genetic 125 drift only, and ignore the effects of mutation, selection and other 126 evolutionary forces. 127 2 Benjamin M Peter et al

F 2 P 1, P 2 F 3 P X ; P 1, P 2 F 4 P 1, P 2, P 3, P 4 Definition E[p 1 p 2 2 ] Ep X p 1 p X p 2 Ep 1 p 2 p 3 p 4 1 F 2-2 F2 P 1, P X + F 2 P 2, P X F 2 P 1, P 2 1 2 F2 P 1, P 4 + F 2 P 2, P 3 F 2 P 1, P 3 F 2 P 2, P 4 Coalescence times Variance Varp 1 p 2 2ET 12 ET 11 ET 22 ET 1X + ET 2X ET 12 ET XX ET 14 + ET 23 ET 13 ET 24 Varp X + Covp 1, p 2 Covp 1, p X Covp 2, p X Covp 1 p 2, p 3 p 4 Branch length 2B c B d 2B c B d B c B d or as admixture test: B d B d Table 1 Summary of equations. A constant of proportionality is omitted for coalescence times and branch lengths. Derivations for F 2 are given in the main text, F 3 and F 4 are a simple result of combining Equation 16 with 20b and 24b. B c and B d correspond to the average length of the internal branch in a gene genealogy concordant and discordant with the population assignment respectively see section Gene tree branch lengths. Figure 2 Interpretation of F-statistics. F-statistics can be interpreted i as branch lengths in a population phylogeny Panels A,E,I,M, the overlap of paths in an admixture graph Panels B,F,J,N, see also Figure S1, and in terms of the internal branches of gene-genealogies see Figures 4, S2 and S3. For gene trees consistent with the population tree, the internal branch contributes positively Panels C,G,K, and for discordant branches, internal branches contribute negatively Panels D,H or zero Panel L. F 4 has two possible interpretations; depending on how the arguments are permuted relative to the tree topology, it may either reflect the length of the internal branch third row, F B 4, or a test statistic that is zero under a population phylogeny last row, F T 4. For the admixture test, the two possible gene trees contribute to the statistic with different sign, highlighting the similarity to the D-statistic Green et al. 2010 and its expectation of zero in a symmetric model. Admixture, Population Structure and F-statistics 3

Measuring genetic drift F 2 128 The purpose of F 2 is simply to measure how much genetic drift 129 occurred between two populations, i.e. to measure genetic dis- 130 similarity. For populations P 1 and P 2, F 2 is defined as Reich et al. 131 2009 132 F 2 P 1, P 2 = F 2 p 1, p 2 = Ep 1 p 2 2. 1 The expectation is with respect to the evolutionary process, 133 but in practice F 2 is estimated from hundreds of thousands of 134 loci across the genome Patterson et al. 2012, which are assumed 135 to be non-independent replicates of the evolutionary history of 136 the populations. 137 Why is F 2 a useful measure of genetic drift? As it is infeasible 138 to observe changes in allele frequency directly, the effect of drift 139 is assessed indirectly, through its impact on genetic diversity. 140 Most commonly, genetic drift is quantified in terms of i the 141 variance in allele frequency, ii heterozygosity, iii probability 142 of identity by descent iv correlation or covariance between 143 individuals and v the probability of coalescence two lineages 144 having a common ancestor. In the next sections I show how 145 F 2 relates to these quantities in the the cases of a single popula- 146 tion changing through time and a pair of populations that are 147 partially isolated. 148 they had n 0 ancestral lineages, who all carry x with probability 169 p 0. Therefore, 170 PE = n 0 p0 i P n 0 = i p 0, n t = E [ p n 0 0 p ] 0, n t. 5 i=1 Equating 4 and 5 yields Equation 3. 171 In the present case, the only relevant cases are n t = 1, 2, since: [ ] E p 1 t p 0 ; n t = 1 = p 0 [ ] E p 2 t p 0 ; n t = 2 = p 0 f + p 2 0 1 f, where f is the probability that two lineage sampled at time t 172 coalesce before time t 0. 173 This yields an expression for F 2 by conditioning on the allele frequency p 0, ] E [p 0 p t 2 p 0 [ ] ] = E p 2 0 p 0 E [2p t p 0 p 0 ] + E [p 2 t p 0 = p 2 0 2p2 0 + p 0 f + p 2 0 1 f = f p 0 1 p 0 = 1 2 f H 0. Single population I assume a single population, measured at two time points t 0 and t, and label the two samples P 0 and P t. Then F 2 P 0, P t can be interpreted in terms of the variances of allele frequencies: Where H 0 = 2p 0 1 p 0 is the heterozygosity. Integrating over 174 Pp 0 yields: 175 F 2 P 0, P t = 1 2 f EH 0 6 F 2 P t, P 0 = E[p t p 0 2 ] = Varp t p 0 + E[p t p 0 ] 2 = Varp t + Varp 0 2Covp 0, p t = Varp t + Varp 0 2Covp 0, p 0 + p t p 0 = Varp t Varp 0. 2 Here, I used E[p t p 0 ] = Covp 0, p t p 0 = 0 to obtain 149 lines three and five. It is worth noting that this result holds for 150 any model of genetic drift where the expected allele frequency 151 is the current allele frequency and increments are independent. 152 For example, this interpretation of F 2 holds also if genetic drift 153 is modeled as a Brownian motion Cavalli-Sforza and Edwards 154 1967. 155 An elegant way to introduce the use of F 2 in terms of expected 156 heterozygosities H t Figure 3B and identity by descent Figure 157 3C is the duality 158 E pt [ p n t t p 0, n t ] = En0 [ p n 0 0 p 0, n t ]. 3 This equation is due to Tavaré 1984, who also provided the 159 following intuition: Given n t individuals are sampled at time 160 t. Let E denote the event that all individuals carry allele x, 161 conditional on allele x having frequency p 0 at time t 0. There 162 are two components to this: First, the frequency will change 163 between t 0 and t, and then all n t sampled individuals need to 164 carry x. 165 In a diffusion framework, 166 PE = 1 0 y n t Pp t = y p 0, n t dy = E[p n t t p 0, n t ]. 4 On the other hand, one may argue using the coalescent: For 167 E to occur, all n t samples need to carry the x allele. At time t 0, 168 and it can be seen that F 2 increases as a function of f Figure 176 3C. This equation can also be interpreted in terms of prob- 177 abilities of identity by descent: f is the probability that two 178 individuals are identical by descent in P t given their ancestors 179 were not identical by descent in P 0 Wright 1931, and EH 0 is 180 the probability two individuals are not identical in P 0. 181 Furthermore, EH t = 1 f EH 0 Equation 3.4 in Wakeley 182 2009 and therefore 183 EH 0 EH t = EH 0 1 1 f = 2F2 P t, P 0. 7 which shows that F 2 measures the decay of heterozygosity Fig- 184 ure 3A. A similar argument was used by Lipson et al. 2013 to 185 estimate ancestral heterozygosities and to linearize F 2. 186 These equations can be rearranged to make the connection between other measures of genetic drift and F 2 more explicit: EH t = EH 0 2F 2 P 0, P t 8a = EH 0 2Varp t Varp 0 8b = EH 0 1 f 8c Pairs of populations Equations 8b and 8c describing the decay 187 of heterozygosity are of course well known by population 188 geneticists, having been established by Wright 1931. In struc- 189 tured populations, very similar relationships exist when the 190 number of heterozygotes expected from the overall allele fre- 191 quency, H obs is compared with the number of heterozygotes 192 present due to differences in allele frequencies between popula- 193 tions H exp Wahlund 1928; Wright 1931. 194 In fact, already Wahlund showed by considering the genotypes of all possible matings in two subpopulations Table 3 in Wahlund 1928 that for a population made up of two subpopulations with equal proportions, the proportion of heterozygotes is reduced by H obs = H exp 2p 1 p 2 2 4 Benjamin M Peter et al

A B of the established inbreeding coefficient f and fixation index 205 F ST? Recall that Wright motivated f and F ST as correlation co- 206 efficients between alleles Wright 1921, 1931. Correlation coef- 207 ficients have the advantage that they are easy to interpret, as, 208 e.g. F ST = 0 implies panmixia and F ST = 1 implies complete 209 divergence between subpopulations. In contrast, F 2 depends 210 on allele frequencies and is highest for intermediate frequency 211 alleles. However, F 2 has an interpretation as a covariance, making 212 it simpler and mathematically more convenient to work with. In 213 particular, variances and covariances are frequently partitioned 214 into components due to different effects using techniques such 215 as analysis of variance and analysis of covariance e.g. Excoffier 216 et al. 1992. 217 C Figure 3 Measures of genetic drift in a single population. Interpretations of F 2 in terms of A the increase in allele frequency variance, B the decrease in heterozygosity and C f, which can be interpreted as probability of coalescence of two lineages, or the probability that they are identical by descent. from which follows that 195 F 2 P 1, P 2 = EH exp EH obs. 9 2 Furthermore, Varp 1 p 2 = Ep 1 p 2 2 [ Ep 1 p 2 ] 2, 196 but Ep 1 p 2 = 0 and therefore 197 F 2 P 1, P 2 = Varp 1 p 2 10 Lastly, the original definition of F 2 was as the numerator of 198 F ST Reich et al. 2009, but F ST can be written as F ST = 2p 1 p 2 2 EH exp. 199 form which follows 200 F 2 P 1, P 2 = 1 2 F STEH exp. 11 Covariance interpretation To see how F 2 can be interpreted as a covariance, define X i and X j as indicator variables that two individuals from the same population sample have the A allele, which has frequency p 1 in one, and p 2 in the other population. If individuals are equally likely to be sampled from either population, EX i = EX j = 1 2 p 1 + 1 2 p 2 EX i X j = 1 2 p2 1 + 1 2 p2 2 CovX i, X j =EX i X j EX i EX j = 1 4 p 1 p 2 2 = 1 4 F 2P 1, P 2 Justification for F 2 The preceding arguments show how the 201 usage of F 2 for both single and structured populations can be 202 justified by the similar effects of F 2 on different measures of 203 genetic drift. However, what is the benefit of using F 2 instead 204 F 2 as branch length Reich et al. 2009 and Patterson et al. 2012 218 proposed to partition drift as previously established, mea- 219 sured by covariance, allele frequency variance, or decrease in 220 heterozygosity between different populations into contribu- 221 tion on the different branches of a population phylogeny. This 222 model has been studied by Cavalli-Sforza and Edwards 1967 223 and Felsenstein 1973 in the context of a Brownian motion pro- 224 cess. In this model, drift on independent branches is assumed 225 to be independent, meaning that the variances can simply be 226 added. This is what is referred to as the additivity property of F 2 227 Patterson et al. 2012. 228 To illustrate the additivity property, consider two populations P 1 and P 2 that split recently from a common ancestral population P 0 Figure 2A. In this case, p 1 and p 2 are assumed to be independent conditional on p 0, and therefore Covp 1, p 2 = Varp 0. Then, using 2 and 10, F 2 P 1, P 2 = Varp 1 p 2 = Varp 1 + Varp 2 2Covp 1, p 2 = Varp 1 + Varp 2 2Varp 0 = F 2 P 1, P 0 + F 2 P 2, P 0. Alternative proofs of this statement and more detailed reasoning 229 behind the additivity assumption can be found in Cavalli-Sforza 230 and Edwards 1967; Felsenstein 1973; Reich et al. 2009 or 231 Patterson et al. 2012. 232 Lineages are not independent in an admixture graph, and 233 so this approach cannot be used. Reich et al. 2009 approached 234 this by conditioning on the possible population trees that are 235 consistent with an admixture scenario. In particular, they pro- 236 posed a framework of counting the possible paths through the 237 graph Reich et al. 2009; Patterson et al. 2012. An example of 238 this representation for F 2 in a simple admixture graph is given 239 in Figure S1, with the result summarized in Figure 2B. Detailed 240 motivation behind this visualization approach is given in Ap- 241 pendix 2 of Patterson et al. 2012. In brief, the reasoning is as 242 follows: Recall that F 2 P 1, P 2 = Ep 1 p 2 p 1 p 2, and inter- 243 pret the two terms in parentheses as two paths between P 1 and 244 P 2, and F 2 as the overlap of these two paths. In a population 245 phylogeny, there is only one possible path, and the two paths 246 are always the same, therefore F 2 is the sum of the length of all 247 the branches connecting the two populations. However, if there 248 is admixture, as in Figure 2B, both paths choose independently 249 which admixture edge they follow. With probability α they will 250 go left, and with probability β = 1 α they go right. Thus, 251 F 2 can be interpreted by enumerating all possible choices for 252 the two paths, resulting in three possible combinations of paths 253 on the trees Figure S1, and the branches included will differ 254 depending on which path is chosen, so that the final F 2 is made 255 up average of the path overlap in the topologies, weighted by 256 the probabilities of the topologies. 257 Admixture, Population Structure and F-statistics 5

However, one drawback of this approach is that it scales 258 where the second sum is over all pairs in P 1. This equation is 297 quadratically with the number of admixture events, making cal- 259 equivalent to Equation 22 in Felsenstein 1973. 298 culations cumbersome when the number of admixture events is 260 As F 2 ˆp 1, ˆp 2 = F 2 ˆp 2, ˆp 1, I can switch the labels, and obtain 299 large. More importantly, this approach is restricted to panmictic 261 the same expression for a second population P 2 = {I 2,i, i = 300 subpopulations, and cannot be used when the population model 262 0,..., n 2 } Taking the average over all I 2,j yields 301 cannot be represented as a weighted average of trees. 263 Gene tree interpretation For this reason, I propose to redefine 264 F 2 using coalescent theory Wakeley 2009. Instead of allele 265 frequencies on a fixed admixture graph, coalescent theory tracks 266 the ancestors of a sample of individuals, tracing their history 267 back to their most recent common ancestor. The resulting tree is 268 called a gene tree or coalescent tree. Gene trees vary between 269 loci, and will often have a different topology from the population 270 phylogeny, but they are nevertheless highly informative about 271 a population s history. Moreover, expected coalescence times 272 and expected branch lengths are easily calculated under a wide 273 array of neutral demographic models. 274 In a seminal paper, Slatkin 1991 showed how F ST can be interpreted in terms of the expected coalescence times of gene trees: F ST = ET B ET W, ET B where ET B and ET W are the expected coalescence times of two 275 lineages sampled in two different and the same population, 276 respectively. 277 Unsurprisingly, given the close relationship between F 2 and 278 F ST, an analogous expression exists for F 2 P 1, P 2 : The deriva- 279 tion starts by considering F 2 for two samples of size one. I then 280 express F 2 for arbitrary sample sizes in terms of individual-level 281 F 2, and obtain a sample-size independent expression by letting 282 the sample size n go to infinity. 283 In this framework, I assume that mutation is rare such that 284 there is at most one mutation at any locus. In a sample of size 285 two, let I i be an indicator random variable that individual i has 286 a particular allele. For two individuals, F 2 I 1, I 2 = 1 implies 287 I 1 = I 2, whereas F 2 I 1, I 2 = 0 implies I 1 = I 2. Thus, F 2 I 1, I 2 288 is another indicator random variable with parameter equal to 289 the probability that a mutation happened on the tree branch 290 between I 1 and I 2. 291 Now, instead of a single individual I 1, consider a sample of n 1 individuals: P 1 = {I 1,1 I 1,2,..., I 1,n1 }. The sample allele frequency is ˆp 1 = n1 1 i I 1,i. And the sample-f 2 is 1 1 F 2 ˆp 1, I 2 =F 2 n 1 [ 1 =E [ 1 =E [ + E n 1 I 1,i, I 2 = E i=1 n 1 n 2 1 I 1,i I 2 i=1 n 2 1 I 2 1,i + 2 n 2 1 I 1,i I 1,j 2 n 1 I 1,i I 2 + I 2 2 n 1 I 2 1,i 2 n 1 I 1,i I 2 + n 1 2 n 2 1 I 1,i I 1,j n 1 1 n 2 1 ] I2 2 n 1 ] I 2 1,i The first three terms can be grouped into n 1 terms of the form 292 F 2 I 1,i, I 2, and the last two terms can be grouped into n 1 2 terms 293 of the form F 2 I 1,i, I 1,j, one for each possible pair of samples in 294 P 1. 295 Therefore, 296 F 2 ˆp 1, I 2 = 1 n 1 i F 2 I 1,i, I 2 1 n 2 F 2 I 1,i, I 1,j 12 1 i<j ] F 2 ˆp 1, ˆp 2 = 1 n 1 F 2 I 1,i, I 2,j i 1 n 2 F 2 I 1,i, I 1,j 1 1 i<j n 2 F 2 I 2,i, I 2,j. 2 i<j 13 Thus, I can write F 2 between the two populations as the average 302 number of differences between individuals from different pop- 303 ulations, minus some terms including differences within each 304 sample. 305 Equation 13 is quite general, making no assumptions on 306 where samples are placed on a tree. In a coalescence framework, 307 it is useful to make the assumptions that all individuals from the 308 same population have the same branch length distribution, i.e. 309 F 2 I x1,i, I y1,j = F 2 I x2,i, I y2,j = for all pairs of samples x 1, x 2 310 and y 1, y 2 from populations P i and P j. Secondly, I assume that 311 all samples correspond to the leaves of the tree, so that I can esti- 312 mate branch lengths in terms of the time to a common ancestor 313 T ij. Finally, I assume that mutations occur at a constant rate of 314 θ/2 on each branch. Taken together, these assumptions imply 315 that F 2 I i,k, I j,l = θet ij for all individuals from populations 316 P i, P j. this simplifies 13 to 317 F 2 ˆp 1, ˆp 2 = θ ET 12 1 1 1n1 ET 2 11 1 1 1n2 ET 2 22 14 which, for the cases of n = 1, 2 was also derived by Petkova 318 et al. 2014. In some applications, F 2 might be calculated only 319 for segregating site in a large sample. As the expected number 320 of segregating sites is 2 θ T tot with T tot denoting the total tree 321 length, taking the limit where θ 0 is meaningful Slatkin 322 1991; Petkova et al. 2014: 323 F 2 ˆp 1, ˆp 2 = 2 ET 12 1 2 T tot 1 1n1 ET 11 1 2 1 1n2 ET 22. 15 2 In either of these equations, Ttot or θ act as a constant of pro- 324 portionality that is the same for all statistics calculated from the 325 same data. Since interest is focused on the relative magnitude 326 of F 2, or whether a sum of F 2 -values is different from zero, this 327 constant has no impact on inference. 328 Furthermore, a population-level quantity is obtained by tak- 329 ing the limit when the number of individuals n 1 and n 2 go to 330 infinity: 331 F 2 P 1, P 2 = lim 2 ˆp 1, ˆp 2 = θ ET 12 ET 11 + ET 12. n 1,n 2 2 16 Unlike F ST, the mutation parameter θ does not cancel. How- 332 ever, for most applications, the absolute magnitude of F 2 is of 333 little interest, since only the sign of the statistics are used for 334 most tests. In other applications F-statistics with presumably 335 the same θ Reich et al. 2009 are compared. In these cases, θ 336 can be regarded as a constant of proportionality, and will not 337 6 Benjamin M Peter et al

change the theoretical properties of the F-statistics. It will, how- 338 ever, influence statistical properties, as a larger θ implies more 339 mutations and hence more data. 340 Estimator for F 2 An estimator for F 2 can be derived using the 341 average number of pairwise differences π ij as an estimator for 342 θt ij Tajima 1983. Thus, a natural estimator for F 2 is 343 A. Equation B. Concordant genealogy ˆF 2 P 1, P 2 = π 12 π 11 + π 22. 17 2 C. Discordant genealogy Strikingly, the estimator in Eq. 17 is equivalent to that given 344 by Reich et al. 2009 in terms of the sample allele frequency ˆp i 345 and sample size n i : 346 F 2 P 1, P 2 =π 12 π 11 /2 π 22 /2 = [ ˆp 1 1 ˆp 2 + ˆp 2 1 ˆp 1 ] n ˆp 1 1 ˆp 1 1 n 1 1 ˆp n 21 ˆp 2 2 n 2 1 = ˆp 1 1 1 1 + ˆp n 2 1 1 1 2 ˆp 1 n 1 ˆp 2 2 + ˆp 2 1 1 1 ˆp 2 2 1 1 n 1 1 n 2 1 = ˆp 1 ˆp 2 2 ˆp 11 ˆp 1 n 1 1 ˆp 21 q2. n 2 1 Figure 4 Schematic explanation how F 2 behaves conditioned on gene tree. Blue terms and branches correspond to positive contributions, whereas red branches and terms are subtracted. Labels represent individuals randomly sampled from that population. External branches cancel out, so only the internal branches have non-zero contribution to F 2. In the concordant genealogy Panel B, the contribution is positive with weight 2, and in the discordant genealogy Panel C, it is negative with weight 1. The mutation rate as constant of proportionality is omitted. The last line is the same as Equation 10 in the Appendix of Reich 347 et al. 2009. 348 However, while the estimators are identical, the underlying 349 modelling assumptions are different: The original definition 350 only considered loci that were segregating in an ancestral popu- 351 lation; loci not segregating there were discarded. Since ancestral 352 populations are usually unsampled, this is often replaced by 353 ascertainment in an outgroup Patterson et al. 2012; Lipson et al. 354 2013. In contrast, Equation 17 assumes that all markers are used, 355 which is the more natural interpretation for sequence data. 356 Gene tree branch lengths An important feature of Equation 16 357 is that it only depends on the expected coalescence times be- 358 tween pairs of lineages. Thus, the behavior of F 2 can be fully 359 characterized by considering a sample of size four, with two 360 random individuals taken from each population. This is all 361 that is needed to study the joint distribution of T 12, T 11 and T 22, 362 and hence F 2. By linearity of expectation, larger samples can be 363 accommodated by summing the expectations over all possible 364 quartets. 365 For a sample of size four with two pairs, there are only two 366 possible unrooted tree topologies. One, where the lineages 367 from the same population are more closely related to each other 368 called concordant topology, T 2 c and one where lineages from 369 different populations coalesce first which I will refer to as discor- 370 danat topology T 2 d. The superscripts refers to the topologies 371 being for F 2, and I will discard them in cases where no ambiguity 372 arises. 373 Conditioning on the topology yields F 2 P 1, P 2 =E [ F 2 P 1, P 2 T ] =PT c E[F 2 P 1, P 2 T c ] + PT d E[F 2 P 1, P 2 T d ]. Figure 4 contains graphical representations for 374 E[F 2 P 1, P 2 T c ] Panel B and E[F 2 P 1, P 2 T d ] Panel C, 375 respectively. 376 In this representation, T 12 corresponds to a path from a ran- 377 dom individual from P 1 to a random individual from P 2, and 378 T 11 represents the path between the two samples from P 1. 379 For T c the internal branch is always included in T 12, but never 380 in T 11 or T 22. External branches, on the other hand, are included 381 with 50% probability in T 12 on any path through the tree. T 11 382 and T 22, on the other hand, only consist of external branches, 383 and the lengths of the external branches cancel. 384 On the other hand, for T d, the internal branch is always in- 385 cluded in T 11 and T 22, but only half the time in T 12. Thus, they 386 contribute negatively to F 2, but only with half the magnitude of 387 T c. As for T c, each T contains exactly two external branches, and 388 so they will cancel out. 389 An interesting way to represent F 2 is therefore in terms of the 390 internal branches over all possible gene genealogies. Denote the 391 unconditional average length of the internal branch of T c as B c, 392 and the average length of the internal branch in T d as B d. Then, 393 F 2 can be written in terms of these branch lengths as 394 F 2 P 1, P 2 = θ 2B c 1B d, 18 resulting in the representation given in Figure 2C-D. 395 As a brief sanity check, consider the case of a population 396 without structure. In this case, the branch length is independent 397 of the topology and T d is twice as likely as T c and hence B d = 398 2B c, from which it follows that F 2 will be zero, as expected in a 399 randomly mating population 400 This argument can be transformed from branch lengths to ob- 401 served mutations by recalling that mutations occur on a branch 402 at a rate proportional to its length. F 2 is increased by doubletons 403 that support the assignment of populations i.e. the two lineages 404 from the same population have the same allele, but reduced by 405 doubletons shared by individuals from different populations. 406 All other mutations have a contribution of zero. 407 Admixture, Population Structure and F-statistics 7

Testing treeness 408 Many applications consider tens or even hundreds of popula- 409 tions simultaneously Patterson et al. 2012; Pickrell and Pritchard 410 2012; Haak et al. 2015; Yunusbayev et al. 2015, with the goal to in- 411 fer where and between which populations admixture occurred. 412 Using F-statistics, the approach is to interpret F 2 P 1, P 2 as a 413 measure of dissimilarity between P 1 and P 2, as a large F 2 -value 414 implies that populations are highly diverged. Thus, the strat- 415 egy is to calculate all pairwise F 2 indices between populations, 416 combine them into a dissimilarity matrix, and ask if that matrix is 417 consistent with a tree. 418 One way to approach this question is by using phylogenetic 419 theory: Many classical algorithms have been proposed that use 420 a measure of dissimilarity to generate a tree Fitch et al. 1967; 421 Saitou and Nei 1987; Semple and Steel 2003; Felsenstein 2004, 422 and what properties a general dissimilarity matrix needs to have 423 in order to be consistent with a tree Buneman 1971; Cavalli- 424 Sforza and Piazza 1975, in which case the matrix is also called a 425 tree metric Semple and Steel 2003. Thus, testing for admixture 426 can be thought of as testing treeness. 427 For a dissimilarity matrix to be consistent with a tree, there 428 are two central properties it needs to satisfy: First, the length of 429 all branches has to be positive. This is strictly not necessary for 430 phylogenetic trees, and some algorithms may return negative 431 branch lengths e.g. Saitou and Nei 1987; however, since in our 432 case branches have an interpretation of genetic drift, negative 433 genetic drift is biologically nonsensical, and therefore negative 434 branches should be interpreted as a violation of the modeling 435 assumptions, and hence treeness. 436 The second property of a tree metric important in the present context is a bit more involved: A dissimilarity matrix written in terms of F 2 is consistent with a tree if for any four populations P i, P j, P k and P l, F 2 P i, P j + F 2 P k, P l max F 2 P i, P k + F 2 P j,p l, F 2 P i, P l + F 2 P j, P k 19 that is, if the sums of pairs of distances are compared, two of 437 these sums will be the same, and no smaller than the third. 438 This theorem, due to Buneman 1971, 1974 is called the four- 439 point condition or sometimes, more modestly, the fundamental 440 theorem of phylogenetics. A proof can be found in chapter 7 of 441 Semple and Steel 2003. 442 Informally, this statement can be understood by noticing that 443 on a tree, two of the pairs of distances will include the internal 444 branch, whereas the third one will not, and therefore be shorter. 445 Thus, the four-point condition can be colloquially rephrased as 446 any four taxa tree has at most one internal branch. 447 Why are these properties useful? It turns out that the ad- 448 mixture tests based on F-statistics can be interpreted as tests 449 of these properties: The F 3 -test can be interpreted as a test for 450 the positivity of a branch; and the F 4 as a test of the four-point 451 condition. Thus, the working of the two test statistics can be 452 interpreted in terms of fundamental properties of phylogenetic 453 trees, with the immediate consequence that they may be applied 454 as treeness-tests for arbitrary dissimilarity matrices. 455 An early test of treeness, based on a likelihood ratio, was 456 proposed by Cavalli-Sforza and Piazza 1975: They compared 457 the likelihood of the observed F 2 -matrix matrix to that induced 458 by the best fitting tree assuming Brownian motion, rejecting 459 the null hypothesis if the tree-likelihood is much lower than 460 that of the empirical matrix. In practice, however, finding the 461 best-fitting tree is a challenging problem, especially for large 462 trees Felsenstein 2004 and so the likelihood test proved to be 463 difficult to apply. From that perspective, the F 3 and F 4 -tests 464 provide a convenient alternative: Since treeness implies that all 465 subsets of taxa are also trees, the ingenious idea of Reich et al. 466 2009 was that rejection of treeness for subtrees of size three 467 for F 3 and four for F 4 is sufficient to reject treeness for the 468 entire tree. Furthermore, tests on these subsets also pinpoint the 469 populations involved in the non-tree-like history. 470 F 3 471 In the previous section, I showed how F 2 can be interpreted 472 as a branch length, as an overlap of paths, or in terms of gene 473 trees Figure 2. Furthermore, I gave expressions in terms of 474 coalescence times, allele frequency variances and internal branch 475 lengths of gene trees. In this section, I give analogous results for 476 F 3. 477 Reich et al. 2009 defined F 3 as: 478 F 3 P X ; P 1, P 2 = F 3 p X ; p 1, p 2 = Ep X p 1 p X p 2 20a with the goal to test whether P X is admixed. Recalling the path 479 interpretation detailed in Patterson et al. 2012, F 3 can be inter- 480 preted as the shared portion of the paths from P X to P 1 with the 481 path from P X to P 1. In a population phylogeny Figure 2E this 482 corresponds to the branch between P X and the internal node. 483 Equivalently, F 3 can also be written in terms of F 2 Reich et al. 484 2009: 485 F 3 P X ; P 1, P 2 = 1 2 F 2 P X, P 1 + F 2 P X, P 2 F 2 P 1, P 2. 20b If F 2 in Equation 20b is generalized to an arbitrary tree metric, 486 Equation 20b is known as the Gromov product Semple and 487 Steel 2003 in phylogenetics. The Gromov product is a com- 488 monly used operation in classical phylogenetic algorithms to 489 calculate the length of the portion of a branch shared between P 1 490 and P 2 Fitch et al. 1967; Felsenstein 1973; Saitou and Nei 1987; 491 consistent with the notion that F 3 is the length of an external 492 branch in a population phylogeny. 493 In an admixture graph, there is no longer a single external 494 branch; instead all possible trees have to be considered, and F 3 495 is the weighted average of paths through the admixture graph 496 Figure 2F. 497 Combining Equations 16 and 20b gives an expression of F 3 in 498 terms of expected coalescence times: 499 F 3 P X ; P 1, P 2 = θ 2 ET 1X + ET 2X ET 12 ET XX. 20c Similarly, an expression in terms of variances is obtained by 500 combining Equation 2 with 20b: 501 F 3 P X ; P 1, P 2 = Varp X + Covp 1, p 2 Covp 1, p X Covp 2, p X. 20d which was also noted by Pickrell and Pritchard 2012. 502 Outgroup-F 3 statistics A simple application of the interpreta- 503 tion of F 3 as a shared branch length are the outgroup -F 3-504 statistics proposed by Raghavan et al. 2014. For an unknown 505 population P U, they wanted to find the most closely related pop- 506 ulation from a panel of k extant populations {P i, i = 1, 2,... k}. 507 8 Benjamin M Peter et al

They did this by calculating F 3 P O ; P U, P i, where P O is an out- 508 group population that was assumed widely diverged from P U 509 and all populations in the panel. This measures the shared drift 510 or shared branch of P U with the populations from the panel, 511 and high F 3 -values imply close relatedness. 512 However, using Equation 20c, the outgroup-f 3 -statistic can 513 be written as 514 F 3 P O ; P U, P i ET UO + ET io ET Ui ET OO. 21 Out of these four terms, ET UO and ET OO do not depend on P i. 515 Furthermore, if P O is truly an outgroup, then all ET io should 516 be the same, as pairs of individuals from the panel population 517 and the outgroup can only coalesce once they are in the joint an- 518 cestral population. Therefore, only the term ET Ui is expected to 519 vary between different panel populations, suggesting that using 520 the number of pairwise differences, π Ui, is largely equivalent to 521 using F 3 P O ; P U, P i. I confirm this in Figure 5A by calculating 522 outgroup-f 3 and π iu for a set of increasingly divergent popu- 523 lations, with each population having its own size, sample size 524 and sequencing error probability. Linear regression confirms 525 the visual picture that π iu has a higher correlation with diver- 526 gence time R 2 = 0.90 than F 3 R 2 = 0.73. Hence, the number 527 of pairwise differences may be a better metric for population 528 divergence than F 3. 529 F 3 admixture test However, F 3 is motivated and primarily used 530 as an admixture test Reich et al. 2009. In this context, the null 531 hypothesis is that F 3 is non-negative, i.e. the null hypothesis 532 is that the data is generated from a phylogenetic tree that has 533 positive edge lengths. If this is not the case, the null hypothesis 534 is reject in favor of the more complex admixture graph. From 535 Figure 2F it may be seen that drift on the path on the internal 536 branches red contributes negatively to F 3. If these branches are 537 long enough compared to the branch after the admixture event 538 blue, then F 3 will be negative. For the simplest scenario where 539 P X is admixed between P 1 and P 2, Reich et al. 2009 provided 540 a condition when this is the case Equation 20 in Supplement 541 2 of Reich et al. 2009. However, since this condition involves 542 F-statistics with internal, unobserved populations, it cannot 543 be used in practical applications. A more useful condition is 544 obtained using Equation 20c. 545 In the simplest admixture model, an ancestral population 546 splits into P 1 and P 2 at time t r. At time t 1, the populations mix to 547 form P X, such that with probability α, individuals in P X descend 548 from individuals from P 1, and with probability 1 α, they 549 descend from P 2 See Fig. 7 for an illustration. In this case, 550 F 3 P X ; P 1, P 2 is negative if 551 1 t 1 < 2α1 α, 22 1 c x t r where c x is the probability two individuals sampled in P X have a common ancestor before t 1. For a randomly mating population with changing size Nt, t1 c x = 1 exp 0 1 Ns ds. Thus, the power F 3 to detect admixture is large 552 1. if the admixture proportion is close to fifty percent 553 2. if the ratio between the times of the original split to the time 554 of secondary contact is large, and 555 3. if the probability of coalescence before the admixture event 556 in P X is small, i.e. the size of P X is large. 557 A more general condition for negativity of F 3 is obtained 558 by considering the internal branches of the possible gene tree 559 topologies, analogously to that given for F 2 in the section Gene 560 tree branch lengths. Since Equation 20c includes ET XX, only two 561 individuals from P X are needed, and one each from P 1 and P 2 562 to study the joint distribution of all terms in 20c. The minimal 563 case therefore contains again just four samples Figure S2. 564 Furthermore, P 1 and P 2 are exchangeable, and thus there are 565 again just two distinct gene genealogies, a concordant one T 3 c 566 where the two lineages from P X are most closely related, and a 567 discordant genealogy T 3 d where the lineages from P X merge 568 first with the other two lineages. A similar argument as that 569 for F 2 shows presented in Figure S2 that F 3 can be written as a 570 function of just the internal branches in the topologies: 571 F 3 P X ; P 1, P 2 = θ 2B c B d, 23 where B c and B d are the lengths of the internal branches in T 3 c 572 and T 3 d, respectively, and similar to F 2, concordant branches 573 have twice the weight of discordant ones. Again, the case of 574 all individuals coming from a single populations serves as a 575 sanity check: In this case T d is twice as likely as T c, and all 576 branches are expected to have the same length, resulting in F 3 577 being zero. However, for F 3 to be negative, notice that B d needs 578 to be more than two times longer than B c. Since mutations 579 are proportional to B d and B c, F 3 can be interpreted as a test 580 whether mutations that agree with the population tree are more 581 than twice as common than mutations that disagree with it. 582 I performed a small simulation study to test the accuracy of 583 Equation 22. Parameters were chosen such that F 3 has a negative 584 expectation for α > 0.05, and I find that the predicted F 3 fit very 585 well with the simulations Figure 5B. 586 F 4 587 The second admixture statistic, F 4, is defined as Reich et al. 588 2009: 589 F 4 P 1, P 2 ; P 3, P 4 = F 4 p 1, p 2 ; p 3, p 4 = E[p 1 p 2 p 3 p 4 ]. 24a Similarly to F 3, F 4 can be written as a linear combination of F 2 : 590 F 4 P 1, P 2 ; P 3, P 4 = 1 24b F 2 2 P 1, P 4 + F 2 P 2, P 3 F 2 P 1, P 3 F 2 P 2, P 4. which leads to 591 ET 14 + ET 23 ET 13 BET 24. 24c F 4 P 1, P 2 ; P 3, P 4 = θ 2 As four populations are involved, there are 4! = 24 possible ways of arranging the populations in Equation 24a. However, there are four possible permutations of arguments that will lead to identical values, leaving only six unique F 4 -values for any four populations. Furthermore, these six values come in pairs that have the same absolute value, and a different sign, i. e. F 4 P 1, P 2 ; P 3, P 4 = F 4 P 1, P 2 ; P 4, P 3, leaving only three unique absolute values, which correspond to the tree possible tree topologies. Out of these three, one F 4 can be written as the sum of the other two, leaving just two independent possibilities: F 4 P 1, P 2 ; P 3, P 4 + F 4 P 1, P 3 ; P 2, P 4 = F 4 P 1, P 4 ; P 2, P 3. Admixture, Population Structure and F-statistics 9

0.002 0.000 0.002 0.004 0.006 0.008 Figure 5 Simulation results. A: Outgroup-F 3 statistics yellow and π iu white for a panel of populations with linearly increasing divergence time. Both statistics are scaled to have the same range, with the first divergence between the most closely related populations set to zero. F 3 is inverted, so that it increases with distance. B: Simulated boxplots and predictedblue F 3 -statistics under a simple admixture model. C: Comparison of F 4 -ratio yellow triangles, Equation 29 and ratio of differencesblack circles, Equation 31. As for F 3, Equation 24b can be generalized by replacing F 2 592 with an arbitrary tree metric. In this case, Equation 24b is known 593 as a tree split Buneman 1971, as it measures the length of the 594 overlap of the branch lengths between the two pairs. As there 595 are two independent F 4 -indices for a fixed tree, there are two 596 different interpretations for the F 4 -indices. Consider the tree 597 from Fig. 1A: F 4 P 1, P 2 ; P 3, P 4 can be interpreted as the overlap 598 between the paths from P 1 to P 2 and from P 3 to P 4. However, 599 these paths do not overlap in Fig. 1A, and therefore F 4 = 0. 600 This is how F 4 is used as a test statistic. On the other hand, 601 F 4 P 1, P 3 ; P 2, P 4 measures the overlap between the paths from 602 P 1 to P 3 and from P 2 to P 4, which is the internal branch in Fig. 603 1A, and will be positive. 604 It is cumbersome that the interpretation of F 4 depends on the ordering of its arguments. In order to make our intention clear, instead of switching the arguments around for the two interpretations, I introduce the superscripts T for test and B for branch length: F T 4 P 1, P 2 ; P 3, P 4 = F 4 P 1, P 2 ; P 3, P 4 25a F B 4 P 1, P 2 ; P 3, P 4 = F 4 P 1, P 3 ; P 2, P 4 25b Four-point-condition and F 4 Tree splits, and hence F 4, are closely 605 related to the four-point condition Buneman 1971, 1974, which, 606 informally, states that a sub-tree with four populations will 607 have at most one internal branch. Thus, if data is consistent with 608 a tree, F B 4 will be the length of that branch, and F T 4 will be zero. 609 In Figure 2, the third row Panels I-L correspond to the internal 610 branch, and the last row Panels M-P to the zero -branch. 611 Thus, in the context of testing for admixture, testing that F 4 612 is zero is equivalent to checking whether there is in fact only 613 a single internal branch. If that is not the case, the population 614 phylogeny is rejected. This statement can be generalized to 615 arbitrary tree metrics: the four-point condition Buneman 1971 616 can be written as 617 F 2 P 1, P 2 + F 2 P 3, P 4 [ ] min F 2 P 1, P 3 + F 2 P 2, P 4, F 2 P 1, P 4 + F 2 P 2, P 3 26 for any permutations of the samples. This implies that two of 618 the sums need to be the same, and larger than the third. The 619 claim is that if the four-point condition holds, at least one of 620 the F 4 -values will be zero, and the others will have the same 621 absolute value. 622 Without loss of generality, assume that F 2 P 1, P 2 + F 2 P 3, P 4 F 2 P 1, P 3 + F 2 P 2, P 4 F 2 P 1, P 3 + F 2 P 2, P 4 = F 2 P 1, P 4 + F 2 P 2, P 3 Simply plugging this into the three possible F 4 -equations yields F 4 P 1, P 2 ; P 3, P 4 = 0 F 4 P 1, P 3 ; P 2, P 4 = k F 4 P 1, P 4 ; P 2, P 3 = k where k = F 2 P 1, P 3 + F 2 P 2, P 4 F 2 P 1, P 2 + F 2 P 3, P 4. 623 It is worth noting that the converse is false. If F 2 P 1, P 2 + F 2 P 3, P 4 > F 2 P 1, P 3 + F 2 P 2, P 4 F 2 P 1, P 3 + F 2 P 2, P 4 = F 2 P 1, P 4 + F 2 P 2, P 3 the four-point condition is violated, but F 4 P 1, P 2 ; P 3, P 4 is still 624 zero, and the other two F 4 -values have the same magnitude. 625 10 Benjamin M Peter et al