Detecting selection from differentiation between populations: the FLK and hapflk approach. Bertrand Servin bservin@toulouse.inra.fr Maria-Ines Fariello, Simon Boitard, Claude Chevalet, Magali SanCristobal, Maxime Bonhomme INRA Animal Genetics Toulouse, France June 18, 2013
Introduction We will be considering a set of populations differentiated through the effect of drift As selection modifies allele frequencies within a population...... it amplifies differentiation between populations at selected loci Differentiation-based tests for selection are characterized by: their models for background differentiation ( neutral demographic model ) how they capture outliers / model selective effects FLK (and hapflk): Pure drift model, with population splits (tree demography, phylogeny ) Outlier approach: look for genome regions where the neutral (null) model does not fit well (goodness of fit statistic)
Outline 1 Theoretical Background Neutral model for SNP data in multiple populations Single SNP statistics for detecting selection: FLK Incorporating haplotype information: hapflk 2 Genome Scanning with hapflk
Outline 1 Theoretical Background Neutral model for SNP data in multiple populations Single SNP statistics for detecting selection: FLK Incorporating haplotype information: hapflk 2 Genome Scanning with hapflk
Outline 1 Theoretical Background Neutral model for SNP data in multiple populations Single SNP statistics for detecting selection: FLK Incorporating haplotype information: hapflk 2 Genome Scanning with hapflk
Single Population: evolution of allele frequency under drift Consider a biallelic locus (SNP) in a population evolving under pure drift Starting at a frequency p 0 Let F t be the fixation index of the population after t generations. (F t = 1 (1 1 2N )t t/2n) Provided F t is small, we can model: See Nicholson et al. (2002). p(t) N (p0, F t p0(1 p0))
Single Population: evolution of allele frequency under drift Simulated trajectories Normal approximation
Multiple populations: star-like evolution Consider an ancestral population split at time t 0 in multiple populations, evolving in parallel, i.e. star-like population tree. Assume no mutation after the split (F t is small), for each population: p i N (p 0, F i p 0 (1 p 0 )) NB: if we were to assume the same F i for each population, then F i = F ST.
Multiple populations: tree-like evolution Under the star-like model, conditional on p 0, all populations are independent, (Cov(p i, p j ) = 0) If we allow Cov(p i, p j ) 0 : a population tree ( F 3 = 1 f 12 = 1 Kinship matrix 1 1 2N 3 ) t ( 1 1 2N 12 ) t12 Var(p i ) = F i p 0 (1 p 0 ) Cov(p i, p j ) = f ij p 0 (1 p 0 ) F = F 1 f 12 0 f 12 F 2 0 0 0 F 3 Var(p) = Fp 0 (1 p 0 )
Estimation of the neutral model This evolutionary model (population tree, pure drift) has two parameters : p 0 and F Suppose F is known, then a natural estimator of p 0 is the generalized least squares estimator: ˆp 0 = 1T F 1 p 1 T F 1 1 Estimating F means reconstructing the population tree, with branch length unit expressed in terms of fixation indices.
Estimation of the population kinship matrix Branch length of the tree are measured in units of drift ( t/2n) For each pair of population the Reynolds genetic distance D (Reynolds, Weir and Cockerham, 1983) between two populations i and j has expectation: see Laval et al. (2002). E(D ij ) = F i + F j 2 The population tree can be built using the neighbour joining algorithm on the Reynolds distances matrix 2, computed over many ( 10 4 ) SNPs. Assumes majority of them are neutral. Rooting the tree requires an outgroup. If not, uses midpoint rooting.
Conclusions on the neutral model We have described a neutral model for population allele frequencies at a SNP We can estimate the model parameters: p 0 : ancestral allele frequency, locus specific F: population kinship matrix, constant across loci. Note that other procedures could be used to estimate F. The hapflk software allows to use any kinship matrix. In our context: detecting selection is identifying loci for which the neutral model is not a good fit.
Outline 1 Theoretical Background Neutral model for SNP data in multiple populations Single SNP statistics for detecting selection: FLK Incorporating haplotype information: hapflk 2 Genome Scanning with hapflk
The FLK statistic: goodness-of-fit of the neutral model We can think of our Neutral model as a linear model: p = 1p 0 + r with r N (0, V), V = Fp 0 (1 p 0 ). A goodness-of-fit statistic for this model, estimated at a particular locus, is the deviance: (p 1 ˆp 0 ) T V 1 (p 1 ˆp 0 ) named the FLK Statistic (Bonhomme et al., 2010) Under H 0 (neutral model) for n populations, FLK follows a χ 2 (n 1).
Relationship with other statistics If we were to assume a star-like, equal branch length population tree: F = I n F ST where F ST is the mean F ST over loci (genomewide F ST ) ˆp 0 = p FLK = (n 1) F F ST ST
Relationship with other statistics If we were to assume a star-like, equal branch length population tree: F = I n F ST where F ST is the mean F ST over loci (genomewide F ST ) ˆp 0 = p FLK = (n 1) F F ST ST The Lewontin and Krakauer (1973) statistic (LK) LK = (n 1) F F ST ST The LK statistic gives the same ranking as F ST LK (or F ST ) scans for selection assume a very particular evolution model for populations Outliers of LK (or F ST ): bad fit of this model. Might not be due to selection (but wrong evolutionary model H 0 ).
Outline 1 Theoretical Background Neutral model for SNP data in multiple populations Single SNP statistics for detecting selection: FLK Incorporating haplotype information: hapflk 2 Genome Scanning with hapflk
Principle 1 Incorporate an haplotype diversity model within the FLK framework 2 Considering haplotypes as multi-allelic markers, use a multiallelic version of FLK (Bonhomme et al. 2010). However, haplotypes are not ancestral alleles (recombination happens). Modified multiallelic version: the hapflk statistic (Fariello et al. 2013). Unknown distibution.
The Scheet and Stephens (aka. fastphase) model Models the local similarity between haplotypes via a reduction of dimension: local clustering of haplotypes The underlying clusters can be considered as local haplotypes. Definition changes along the chromosome Model is a Hidden Markov Model, hidden states are clusters. As for all mixture models, need to specify number of components K.
Using LD models for FLK transform SNP genotypes into multiallelic genotypes Based on the posterior probability: P(Z il = k G i ) where Z il Underlying cluster for individual i at SNP l G i Observed SNP genotype (multilocus) Consider haplotype clusters as alleles The frequency of a cluster within a population is : p kl = 1 P(Z il = k G i ) N Advantages No need for sliding windows Model can be estimated on unphased genotype data Can incorporate missing data (e.g. mixture of dense and sparse data...) i
Outline 1 Theoretical Background Neutral model for SNP data in multiple populations Single SNP statistics for detecting selection: FLK Incorporating haplotype information: hapflk 2 Genome Scanning with hapflk
Example data: sheep from Northern Europe Kijas et al. (2012) PLoS Biology 6 Populations + Outgroup (Soay), 388 individuals, 49K SNPs Available at http://www.sheephapmap.org
Before diving in... Remember assumptions underlying the neutral model: Population tree Pure drift model (no mutations, no admixture) Small F i (say < 0.2) This means Discard strongly bottleneck-ed or admixed populations Consider that low frequency variants are more likely to have appeared after population spit. Perform a diversity analysis before: Population structure (STRUCTURE, PCA, treemix...) Within population kinship between individuals to identify a set of unrelated individuals
Get the software :) https://forge-dga.jouy.inra.fr/projects/hapflk Available for Linux 64bits and MacOSX 10.6+ For estimation of the kinship matrix, needs R with ape and phangorn packages.
Run single SNP analysis hapflk reads PLINK files (ped/map or bed/bim/fam), first column (FID) must give the population name hapflk --bfile NorthernSheep --outgroup Soay 1. [ 00:00:00 ] Reading Input Files 2. [ 00:00:58 ] Computing Allele Frequencies NewZealandRomney 3. [ 00:02:21 ] Computing Reynolds distances 4. [ 00:02:21 ] Computing Kinship Matrix Loading required package: ape 5. [ 00:02:21 ] Computing FLK tests 6. [ 00:02:32 ] Writing down results 7. [ 00:02:36 ] The End NB: single SNP analysis is fast.
Output files hapflk_reynolds.txt : Reynolds Distance Matrix hapflk_poptree.pdf : Population tree figure hapflk_kinship.r : R code for estimating the kinship matrix hapflk_fij.txt : Kinship matrix hapflk.frq : Allele frequencies hapflk-snp-reynolds.txt: Reynolds Distances in the region (more later) hapflk.flk : FLK results
Population Tree IrishSuffolk NewZealandRomney Galway GermanTexel ScottishTexel NewZealandTexel
Fit of the χ 2 distribution (1) flk=read.table( hapflk.flk,head=t) mysnps=flk$pzero > 0.05 & flk$pzero < 0.95 hist(flk$flk[mysnps],n=50,freq=f,xlab= FLK,main= ) lines(xx,dchisq(xx,df=5),lwd=2) Density 0.00 0.05 0.10 0.15 Good overall fit Slightly less high value than a χ 2 (5). Relatively high drift (F i 0.16). 0 5 10 15 20 25 30 35 FLK
Fit of the χ 2 distribution (2) hist(flk$flk[!mysnps],n=50,freq=f,xlab= FLK,main= ) Density 0.00 0.05 0.10 0.15 0.20 0.25 0.30 No fit For these SNPs, our neutral model is clearly wrong (no mutation in the tree). Proceed with caution for low/high ˆp 0 SNPs. 0 10 20 30 40 50 FLK
FLK Manhattan plot log10(p) 0 1 2 3 4 5 Large drift affects power of single SNP tests
Let s go the haplotype way Wait..., what K? use fastphase cross validation routine on your favorite chromosome (the big one). hint : plink --chr 1 --recode-fastphase... hapflk2 --bfile NorthernSheep --outgroup Soay --kinship hapflk_fij.txt --chr 1 -K 40 --ncpu 7 -p OAR1 One run for each chromosome Note: if phased data, or inbred lines, possibility to specify it --phased or --inbred. Makes fitting LD model much faster.
hapflk output files hapflk.hapflk : hapflk results hapflk.kfrq.fit_{n}.bz2 : haplotype cluster freq. The fastphase model is estimated several (T) times (by default 20), and the hapflk statistic is averaged over this T fits. For each fit the haplotype cluster frequencies are given.
Distribution of the hapflk statistic In this particular case, the distribution is close to normal + outliers Robust estimation of the Normal distribution parameters: require(mass) mod=rlm(hapflk~1) mu=mod$coefficients[1] ss=mod$s pvalue=1-pnorm(hapflk,mean=mu,sd=ss)
Manhattan plot hapflk hapflk reveals clear outlying regions
Looking at a particular region Once outlying regions are found, we want to know which population(s) has experienced a selection event. Local allele frequencies Build local population trees to find which branch(es) have been affected. ( Eigen decomposition of hapflk )
Local SNP and Cluster frequencies chr 2 : Selection in Texel breeds : GDF8 (MSTN) mutation R script for haplotype cluter plots provided on hapflk webpage.
Local SNP and Cluster frequencies chr 14 : less obvious
Local population trees Script for making these trees to be released...
Example on the 1000 Bull genomes data Differentiation based tests can find causal mutations: example of coat color mutations (MC1R) in the 1000 Bull genomes dataset. 2 3 4 5 6 14200 14400 14600 14800 15000 Position (Kbp)
References Nicholson et al. 2002 J. Roy. Stat. Soc. B 64(4), 695-715. Laval et al. 2002 Genetics Selection Evolution, 34(4), 481-508. Bonhomme et al. 2010 Genetics, 186(1), 241-262 Fariello et al. 2013 Genetics, 193(3), 929-941