Populations in statistical genetics

Populations in statistical genetics What are they, and how can we infer them from whole genome data? Daniel Lawson Heilbronn Institute, University of Bristol www.paintmychromosomes.com Work with: January 2014, Cambridge 1 / 35

Overview Some views of population A statistical definition Generative approaches Inference from sparse weakly linked data (STRUCTURE) Inference from dense linked data (finestructure) Selected results 2 / 35

What is a population in general? Like species, populations are elusive objects. Definitions are a function of knowledge and application Some definitions - Need not be biologically meaningful Example: Individuals found on same island Example: Sample location due to clustered sampling procedure Example: Disease status for case/control studies In statistics: we generalize from samples to a population Share: individuals are equivalent under some measure But do populations really exist? Does it matter if they are useful? 3 / 35

Diversion: some fuzzy definitions of species Are definitions of species analogous to definitions of populations? Typological species: Organisms that share the same set of phenotypes Ecological species: Organisms that compete for the same environment Phylogenetic species: Organisms with a single common ancestor Biological species: Organisms that can share DNA via sexual reproduction (Western/Eastern) Meadow Lark: http://evolution.berkeley.edu 4 / 35

What is a population? Motivation from genetics We are interested in non-subjective definitions of populations. Individuals have equivalent genetic ancestry This depends on the data available...... and the model used Knowledge driven definition Requires defining equivalent! 5 / 35

The birds and the bees (vs the plants and bacteria) Sex dramatically affects genetic transmission Different population concept required Sexual species Generalise the Biological Species concept Random mating within a population Try to detect deviations from random mating to identify populations Asexual species Generalise the Ecological Species concept Neutral competition within a population Try to detect deviations from neutrality to identify populations Not examined further in this talk 6 / 35

Top down approaches Start from a theoretical generative model of reproduction. For example: Individuals move between populations via migration Discrete generations (for simplicity) Each generation randomly chooses parent(s) within a population Population structure: individuals migrate between populations but mate randomly within each 7 / 35

Ancestry Process Time Present 8 / 35

Populations as Demes Demes Migration Ammerman and Cavalli-Sforza, 1984 The Neolithic Transition and the Genetics of Populations in Europe. 9 / 35

Diversion: Principal Components Analysis PCA is widely used in genetics, and helpful for visualisation It does not make any explicit modelling assumptions... BUT to interpret the output, you do! Just rotating the data Similar individuals tend to be close Differences shared by many individuals tend to appear first Each component describes a different direction of variation If we are lucky, these correspond to real shared drift events There is no unique interpretation of PCA (see McVean 2009) See Lawson & Falush 2012 for details. 10 / 35

Example Time in the past A) Historical Scenario d1 d2 Present Pop A Pop B1 Pop B2 C) PCA of A B) Spatial Scenario D) PCA of B Populations Migration Space PC2 0.2 PC2 0.2 0.1 f(d2) B2 0.2 PC1 0.2 0.1 0 0.1 0.2 f(d1) A 0.1 PC1 0.2 0 0.2 B1 0.2 11 / 35

Bottom up approaches Start from a definition of population as equivalent individuals Within a population, individuals are randomly mating Small samples of large populations: individuals are approximately independent Smaller populations: relationships must be accounted for What does random mating mean for the data? How are populations related? 12 / 35

Top down vs Bottom up A) Generative Model B) Statistical Model C) Full Statistical Model Time 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.6 0.9 1.0 Density 0.0 0.6 0.9 1.0 Time 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.6 0.9 1.0 Genetic Postition Genetic Postition Genetic Postition A) Generative approaches are best in theory, if we can make the model match reality But, hard to use in practice - how to do inference? B) Bottom up approaches are approximate - might lose power C) But can be refined until they are close to the generating process 13 / 35

Ancestry Process - Ancestral Recombination Graph Time Take limit N with N/T constant Recombination rate ρ (for tract length) Ignore unbranching ancestors Present 14 / 35

Structure model Pritchard, Stephens & Donnelly 2000. Populations are large and well mixed SNPs D il are unlinked* loci have some ancestral frequency p 0l P( ) Population k has frequency p kl drifted from ancestral p 0l ** p kl Dirichlet(p 0l ) Individual i is in population k if Q ik = 1, assigned by P(D il p Qik,l) l Individuals in the same population are exchangeable with respect to the SNP frequencies * Solutions to this have been explored (computationally inconvenient) ** Valid approximation, originally derived by Wright 15 / 35

Single SNP with populations Ancestral Population Mutation Some SNP distribution Present populations 'Independent drift' of SNP distribution 16 / 35

Scaling STRUCTURE STRUCTURE approach has a parameter for every SNP. But: Assuming that drift is weak, p 0l = E(D l ) and: p(p kl ) N (p 0l, p 0l (1 p 0l )) Probability SNP l is shared not by chance: X ijl = D il D jl /p 0l + (1 D il )(1 D jl )/(1 p 0l ) Invoke the Central limit theorem: X ij = L X ijl N(µ ij, σij) 2 l=1 This is the Coancestry Matrix It is a sufficient statistic for p(d Q) µ and σ known and same for all individuals in a population Exchangability again! Now lets think about linkage... 17 / 35

Ancestral Recombination Graph Time towards present Hein, Schierup and Wiuf 'Gene Genealogies, Variation and Evolution', OUP 2005 18 / 35

ChromoPainter 19 / 35

ChromoPainter Hidden Markov Model Reference Panel x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Reconstruction x x x x x Data x x x x x x 20 / 35

FineSTRUCTURE model Each population has a characteristic rate P ab of sharing chunks with each other population Individuals are again exchangeable within populations Each recombination event has probability ˆP ab = P ab /ˆn b when coming from an individual in population b into population a Dirichlet Process prior on the parameters P a We integrate out P, leaving no population level parameters We can put a meaningful prior on the variation of P between populations, and the number of populations In practice these details don t matter much 21 / 35

Example: Africa HGDP 0.86 0.99 0.86 0.86 0.99 0.90 0.93 0.84 0.84 0.86 1000 BiakaPygmy 900 MbutiPygmy San BantuSouthAfrica BantuSouthAfrica BantuKenya 800 700 600 500 Yoruba 400 300 Mandenka 200 100 Mandenka Yoruba BantuKenya BantuSouthAfrica BantuSouthAfrica San MbutiPygmy BiakaPygmy 22 / 35

Ancient Genomes: fennoscandia.blogspot.co.uk Ajv70 Gotland hunter gatherer fennoscandia.blogspot.co.uk/2013/11/ajv70-and-modern-european-variation-ii.html 444k SNPs 23 / 35

History of populations We have assumed that we observe individuals from real populations Populations differ by genetic drift This works, even if there is historical migration, provided that the mixture fractions are equal (exchangeability) Real individuals are related by a combination of drift and admixture Ancestral population graph Time Split Split Migration Bottleneck Migration Populations 24 / 35

Admixture A) Discrete Admixture B) Continuous Admixture Time in the past Present Pop AB1 Pop A Pop B1 Pop B2 C) PCA of A Time in the past Present Pop AB1 Pop A Pop B1 Pop B2 C) PCA of B A 0.2 0.1 0.2 0.1 0.1 0.2 AB1 x 0.1 0.2 PC2 B2 PC1 B1 A 0.1 0.2 0.1 0.1 0.2 AB1 0.2 0.1 0.2 PC2 B2 PC1 B1 Admixture describes mixture without drift. 25 / 35

Admixture A) Non admixed populations B) Mixing parameter 0.1 Density 0.00 0.10 0.20 10 5 0 5 10 Genetic Location Density 0.00 0.05 0.10 0.15 10 5 0 5 10 Genetic Location Density 0.00 0.02 0.04 0.06 0.08 C) Mixing parameter 1 10 5 0 5 10 Genetic Location Density 0.00 0.04 0.08 0.12 Continuous admixture is a problem. D) Mixing parameter 5 10 5 0 5 10 Genetic Location 26 / 35

Admixture models STRUCTURE can infer pure populations from admixed populations Population assignment Q ik sums to 1 without requiring a single element Interpretation: Observed individuals are mixtures of pure populations, without drift How can we tell apart: Large drift, mixed by admixture? small drift without admixture? Solution: SNPs have fixed in some populations p ik = 0 (or 1) FineSTRUCTURE cannot use this description, as we ve integrated out these details 27 / 35

Drift model See each drift event as independent Population assignment Q ik now takes any value Amount of each drift event retained by individual i Reconstructs the coancestry matrix Requires a strong prior to obtain a unique solution Pop AB1 Drift Components 28 / 35

29 / 35

Projects Siberian cold adaptation (Alexia & Toomas, in review) GlobeTrotter (Hellenthal et al: very accurate admixture dating using ChromoPainter, to appear in Science) Peopling of the British Isles (in review) UK10K, ALSPAC (4K whole genomes, use in genome-wide association studies) Highly recombining asexuals (fungus, bacteria) Model improvements: Relatedness Admixture/history of populations Complete recoding for usability Efficient computation (fastfinestructure) 30 / 35

See www.paintmychromosomes.com Lawson, Hellenthal, Myers & Falush, Inference of population structure using dense haplotype data, 2012. PLoS Genetics. Lawson & Falush Similarity matrices and clustering algorithms for population identifcation using genetic data, 2012. ARHG. Lawson 2014 Populations in statistical genetics modelling and inference, in Populations in the Human Sciences, Eds. Kreager, Capelli, Ulijaszek & Winney. 31 / 35

Relatedness What if individuals within a population are related? 32 / 35

Relatedness Grandparents Parents Cousins Relatedness IBD Samples are cousins This is very easy to tell from their tract length distribution Excluding these tracts, we sample from the population distribution of chunks Multiple ordering model in progress 33 / 35

Choice of measure IBS Measure range = ( 0.832, 0.843 ) 6.52 COV Measure range = ( 5246, 3766 ) 4.04 ESU Measure range = ( 0.0339, 0.0201 ) 4.04 C2 5.54 C2 3.31 C2 3.33 4.57 2.57 2.61 C1 3.6 C1 1.83 C1 1.89 2.62 1.1 1.18 B2 1.65 B2 0.36 B2 0.46 0.68 0.37 0.26 B1 0.29 B1 1.11 B1 0.97 1.27 1.84 1.69 A 2.24 A 2.58 A 2.4 A B1 B2 C1 C2 3.21 A B1 B2 C1 C2 3.32 A B1 B2 C1 C2 3.12 FHS Measure range = ( 366346, 434550 ) 4.39 IBD Measure range = ( 26290, 164376 ) 4.52 CPL Measure range = ( 1774, 2131 ) 4.52 C2 3.6 C2 3.84 C2 3.78 2.81 3.17 3.04 C1 2.01 C1 2.49 C1 2.3 1.22 1.81 1.56 B2 0.43 B2 1.14 B2 0.82 0.37 0.46 0.08 B1 1.16 B1 0.22 B1 0.65 1.95 0.9 1.39 A 2.75 A 1.57 A 2.13 A B1 B2 C1 C2 3.54 A B1 B2 C1 C2 2.25 A B1 B2 C1 C2 2.87 34 / 35

Choice of measure Between / Within Population Distance 3.5 3.0 2.5 2.0 1.5 1.0 0.5 Number of regions 0 20 40 60 80 100 CPL IBD ESU FHS COV IBS 35 / 35