Detecting selection from differentiation between populations: the FLK and hapflk approach.

Similar documents
(Genome-wide) association analysis

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

Detecting Selection in Population Trees: The Lewontin and Krakauer Test Extended

Genetic Drift in Human Evolution

Populations in statistical genetics

Population Genetics I. Bio

Notes on Population Genetics

Learning ancestral genetic processes using nonparametric Bayesian models

Mathematical models in population genetics II

Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination)

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda

Microsatellite data analysis. Tomáš Fér & Filip Kolář

Problems for 3505 (2011)

7. Tests for selection

Lecture 22: Signatures of Selection and Introduction to Linkage Disequilibrium. November 12, 2012

Multivariate analysis of genetic data an introduction

Introduction to Advanced Population Genetics

Methods for Cryptic Structure. Methods for Cryptic Structure

GBLUP and G matrices 1

Lecture 1 Hardy-Weinberg equilibrium and key forces affecting gene frequency

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

Evolution of Populations. Chapter 17

Introduction to Linkage Disequilibrium

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Processes of Evolution

Statistical issues in QTL mapping in mice

Calculation of IBD probabilities

Effect of Genetic Divergence in Identifying Ancestral Origin using HAPAA

Haplotype-based variant detection from short-read sequencing

New imputation strategies optimized for crop plants: FILLIN (Fast, Inbred Line Library ImputatioN) FSFHap (Full Sib Family Haplotype)

Lecture WS Evolutionary Genetics Part I 1

p(d g A,g B )p(g B ), g B

1. Understand the methods for analyzing population structure in genomes

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

Genetic Association Studies in the Presence of Population Structure and Admixture

The problem Lineage model Examples. The lineage model

Lecture 13: Population Structure. October 8, 2012

Gene mapping in model organisms

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Calculation of IBD probabilities

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

122 9 NEUTRALITY TESTS

Genetic diversity and population structure in rice. S. Kresovich 1,2 and T. Tai 3,5. Plant Breeding Dept, Cornell University, Ithaca, NY

Hidden Markov models in population genetics and evolutionary biology

Space Time Population Genetics

BTRY 7210: Topics in Quantitative Genomics and Genetics

Population Structure

Classical Selection, Balancing Selection, and Neutral Mutations

Computational Systems Biology: Biology X

SWEEPFINDER2: Increased sensitivity, robustness, and flexibility

Introduction to population genetics & evolution

The neutral theory of molecular evolution

Phylogenetic Networks with Recombination

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics.

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

Supporting Information Text S1

Linkage and Linkage Disequilibrium

Overview. Background

Haploid & diploid recombination and their evolutionary impact

Genotype Imputation. Biostatistics 666

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin

The genomes of recombinant inbred lines

Supplementary Materials: Efficient moment-based inference of admixture parameters and sources of gene flow

Testing for spatially-divergent selection: Comparing Q ST to F ST

Expected complete data log-likelihood and EM

How robust are the predictions of the W-F Model?

CONSERVATION AND THE GENETICS OF POPULATIONS

USING LINEAR PREDICTORS TO IMPUTE ALLELE FREQUENCIES FROM SUMMARY OR POOLED GENOTYPE DATA. By Xiaoquan Wen and Matthew Stephens University of Chicago

Big Idea #1: The process of evolution drives the diversity and unity of life

Adaptation and genetics. Block course Zoology & Evolution 2013, Daniel Berner

Introduction to Natural Selection. Ryan Hernandez Tim O Connor

Neutral Theory of Molecular Evolution

Population Genetics & Evolution

Introduction to QTL mapping in model organisms

2. Map genetic distance between markers

Lecture 13: Variation Among Populations and Gene Flow. Oct 2, 2006

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Production type of Slovak Pinzgau cattle in respect of related breeds

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

Breeding Values and Inbreeding. Breeding Values and Inbreeding

Genetics: Early Online, published on February 26, 2016 as /genetics Admixture, Population Structure and F-statistics

Linkage disequilibrium and the genetic distance in livestock populations: the impact of inbreeding

Use of hidden Markov models for QTL mapping

Y-STR: Haplotype Frequency Estimation and Evidence Calculation

Modelling Genetic Variations with Fragmentation-Coagulation Processes

, Helen K. Pigage 1, Peter J. Wettstein 2, Stephanie A. Prosser 1 and Jon C. Pigage 1ˆ. Jeremy M. Bono 1*

Using haplotypes for the prediction of allelic identity to fine-map QTL: characterization and properties

I N N O V A T I O N L E C T U R E S (I N N O l E C) Petr Kuzmič, Ph.D. BioKin, Ltd. WATERTOWN, MASSACHUSETTS, U.S.A.

The Quantitative TDT

Demographic Inference with Coalescent Hidden Markov Model

opulation genetics undamentals for SNP datasets

Tutorial Session 2. MCMC for the analysis of genetic data on pedigrees:

Spatial localization of recent ancestors for admixed individuals

Mapping QTL to a phylogenetic tree

Population Genetics: a tutorial

Phasing via the Expectation Maximization (EM) Algorithm

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Lies, damn lies, and. genomics

Transcription:

Detecting selection from differentiation between populations: the FLK and hapflk approach. Bertrand Servin bservin@toulouse.inra.fr Maria-Ines Fariello, Simon Boitard, Claude Chevalet, Magali SanCristobal, Maxime Bonhomme INRA Animal Genetics Toulouse, France June 18, 2013

Introduction We will be considering a set of populations differentiated through the effect of drift As selection modifies allele frequencies within a population...... it amplifies differentiation between populations at selected loci Differentiation-based tests for selection are characterized by: their models for background differentiation ( neutral demographic model ) how they capture outliers / model selective effects FLK (and hapflk): Pure drift model, with population splits (tree demography, phylogeny ) Outlier approach: look for genome regions where the neutral (null) model does not fit well (goodness of fit statistic)

Outline 1 Theoretical Background Neutral model for SNP data in multiple populations Single SNP statistics for detecting selection: FLK Incorporating haplotype information: hapflk 2 Genome Scanning with hapflk

Outline 1 Theoretical Background Neutral model for SNP data in multiple populations Single SNP statistics for detecting selection: FLK Incorporating haplotype information: hapflk 2 Genome Scanning with hapflk

Outline 1 Theoretical Background Neutral model for SNP data in multiple populations Single SNP statistics for detecting selection: FLK Incorporating haplotype information: hapflk 2 Genome Scanning with hapflk

Single Population: evolution of allele frequency under drift Consider a biallelic locus (SNP) in a population evolving under pure drift Starting at a frequency p 0 Let F t be the fixation index of the population after t generations. (F t = 1 (1 1 2N )t t/2n) Provided F t is small, we can model: See Nicholson et al. (2002). p(t) N (p0, F t p0(1 p0))

Single Population: evolution of allele frequency under drift Simulated trajectories Normal approximation

Multiple populations: star-like evolution Consider an ancestral population split at time t 0 in multiple populations, evolving in parallel, i.e. star-like population tree. Assume no mutation after the split (F t is small), for each population: p i N (p 0, F i p 0 (1 p 0 )) NB: if we were to assume the same F i for each population, then F i = F ST.

Multiple populations: tree-like evolution Under the star-like model, conditional on p 0, all populations are independent, (Cov(p i, p j ) = 0) If we allow Cov(p i, p j ) 0 : a population tree ( F 3 = 1 f 12 = 1 Kinship matrix 1 1 2N 3 ) t ( 1 1 2N 12 ) t12 Var(p i ) = F i p 0 (1 p 0 ) Cov(p i, p j ) = f ij p 0 (1 p 0 ) F = F 1 f 12 0 f 12 F 2 0 0 0 F 3 Var(p) = Fp 0 (1 p 0 )

Estimation of the neutral model This evolutionary model (population tree, pure drift) has two parameters : p 0 and F Suppose F is known, then a natural estimator of p 0 is the generalized least squares estimator: ˆp 0 = 1T F 1 p 1 T F 1 1 Estimating F means reconstructing the population tree, with branch length unit expressed in terms of fixation indices.

Estimation of the population kinship matrix Branch length of the tree are measured in units of drift ( t/2n) For each pair of population the Reynolds genetic distance D (Reynolds, Weir and Cockerham, 1983) between two populations i and j has expectation: see Laval et al. (2002). E(D ij ) = F i + F j 2 The population tree can be built using the neighbour joining algorithm on the Reynolds distances matrix 2, computed over many ( 10 4 ) SNPs. Assumes majority of them are neutral. Rooting the tree requires an outgroup. If not, uses midpoint rooting.

Conclusions on the neutral model We have described a neutral model for population allele frequencies at a SNP We can estimate the model parameters: p 0 : ancestral allele frequency, locus specific F: population kinship matrix, constant across loci. Note that other procedures could be used to estimate F. The hapflk software allows to use any kinship matrix. In our context: detecting selection is identifying loci for which the neutral model is not a good fit.

Outline 1 Theoretical Background Neutral model for SNP data in multiple populations Single SNP statistics for detecting selection: FLK Incorporating haplotype information: hapflk 2 Genome Scanning with hapflk

The FLK statistic: goodness-of-fit of the neutral model We can think of our Neutral model as a linear model: p = 1p 0 + r with r N (0, V), V = Fp 0 (1 p 0 ). A goodness-of-fit statistic for this model, estimated at a particular locus, is the deviance: (p 1 ˆp 0 ) T V 1 (p 1 ˆp 0 ) named the FLK Statistic (Bonhomme et al., 2010) Under H 0 (neutral model) for n populations, FLK follows a χ 2 (n 1).

Relationship with other statistics If we were to assume a star-like, equal branch length population tree: F = I n F ST where F ST is the mean F ST over loci (genomewide F ST ) ˆp 0 = p FLK = (n 1) F F ST ST

Relationship with other statistics If we were to assume a star-like, equal branch length population tree: F = I n F ST where F ST is the mean F ST over loci (genomewide F ST ) ˆp 0 = p FLK = (n 1) F F ST ST The Lewontin and Krakauer (1973) statistic (LK) LK = (n 1) F F ST ST The LK statistic gives the same ranking as F ST LK (or F ST ) scans for selection assume a very particular evolution model for populations Outliers of LK (or F ST ): bad fit of this model. Might not be due to selection (but wrong evolutionary model H 0 ).

Outline 1 Theoretical Background Neutral model for SNP data in multiple populations Single SNP statistics for detecting selection: FLK Incorporating haplotype information: hapflk 2 Genome Scanning with hapflk

Principle 1 Incorporate an haplotype diversity model within the FLK framework 2 Considering haplotypes as multi-allelic markers, use a multiallelic version of FLK (Bonhomme et al. 2010). However, haplotypes are not ancestral alleles (recombination happens). Modified multiallelic version: the hapflk statistic (Fariello et al. 2013). Unknown distibution.

The Scheet and Stephens (aka. fastphase) model Models the local similarity between haplotypes via a reduction of dimension: local clustering of haplotypes The underlying clusters can be considered as local haplotypes. Definition changes along the chromosome Model is a Hidden Markov Model, hidden states are clusters. As for all mixture models, need to specify number of components K.

Using LD models for FLK transform SNP genotypes into multiallelic genotypes Based on the posterior probability: P(Z il = k G i ) where Z il Underlying cluster for individual i at SNP l G i Observed SNP genotype (multilocus) Consider haplotype clusters as alleles The frequency of a cluster within a population is : p kl = 1 P(Z il = k G i ) N Advantages No need for sliding windows Model can be estimated on unphased genotype data Can incorporate missing data (e.g. mixture of dense and sparse data...) i

Outline 1 Theoretical Background Neutral model for SNP data in multiple populations Single SNP statistics for detecting selection: FLK Incorporating haplotype information: hapflk 2 Genome Scanning with hapflk

Example data: sheep from Northern Europe Kijas et al. (2012) PLoS Biology 6 Populations + Outgroup (Soay), 388 individuals, 49K SNPs Available at http://www.sheephapmap.org

Before diving in... Remember assumptions underlying the neutral model: Population tree Pure drift model (no mutations, no admixture) Small F i (say < 0.2) This means Discard strongly bottleneck-ed or admixed populations Consider that low frequency variants are more likely to have appeared after population spit. Perform a diversity analysis before: Population structure (STRUCTURE, PCA, treemix...) Within population kinship between individuals to identify a set of unrelated individuals

Get the software :) https://forge-dga.jouy.inra.fr/projects/hapflk Available for Linux 64bits and MacOSX 10.6+ For estimation of the kinship matrix, needs R with ape and phangorn packages.

Run single SNP analysis hapflk reads PLINK files (ped/map or bed/bim/fam), first column (FID) must give the population name hapflk --bfile NorthernSheep --outgroup Soay 1. [ 00:00:00 ] Reading Input Files 2. [ 00:00:58 ] Computing Allele Frequencies NewZealandRomney 3. [ 00:02:21 ] Computing Reynolds distances 4. [ 00:02:21 ] Computing Kinship Matrix Loading required package: ape 5. [ 00:02:21 ] Computing FLK tests 6. [ 00:02:32 ] Writing down results 7. [ 00:02:36 ] The End NB: single SNP analysis is fast.

Output files hapflk_reynolds.txt : Reynolds Distance Matrix hapflk_poptree.pdf : Population tree figure hapflk_kinship.r : R code for estimating the kinship matrix hapflk_fij.txt : Kinship matrix hapflk.frq : Allele frequencies hapflk-snp-reynolds.txt: Reynolds Distances in the region (more later) hapflk.flk : FLK results

Population Tree IrishSuffolk NewZealandRomney Galway GermanTexel ScottishTexel NewZealandTexel

Fit of the χ 2 distribution (1) flk=read.table( hapflk.flk,head=t) mysnps=flk$pzero > 0.05 & flk$pzero < 0.95 hist(flk$flk[mysnps],n=50,freq=f,xlab= FLK,main= ) lines(xx,dchisq(xx,df=5),lwd=2) Density 0.00 0.05 0.10 0.15 Good overall fit Slightly less high value than a χ 2 (5). Relatively high drift (F i 0.16). 0 5 10 15 20 25 30 35 FLK

Fit of the χ 2 distribution (2) hist(flk$flk[!mysnps],n=50,freq=f,xlab= FLK,main= ) Density 0.00 0.05 0.10 0.15 0.20 0.25 0.30 No fit For these SNPs, our neutral model is clearly wrong (no mutation in the tree). Proceed with caution for low/high ˆp 0 SNPs. 0 10 20 30 40 50 FLK

FLK Manhattan plot log10(p) 0 1 2 3 4 5 Large drift affects power of single SNP tests

Let s go the haplotype way Wait..., what K? use fastphase cross validation routine on your favorite chromosome (the big one). hint : plink --chr 1 --recode-fastphase... hapflk2 --bfile NorthernSheep --outgroup Soay --kinship hapflk_fij.txt --chr 1 -K 40 --ncpu 7 -p OAR1 One run for each chromosome Note: if phased data, or inbred lines, possibility to specify it --phased or --inbred. Makes fitting LD model much faster.

hapflk output files hapflk.hapflk : hapflk results hapflk.kfrq.fit_{n}.bz2 : haplotype cluster freq. The fastphase model is estimated several (T) times (by default 20), and the hapflk statistic is averaged over this T fits. For each fit the haplotype cluster frequencies are given.

Distribution of the hapflk statistic In this particular case, the distribution is close to normal + outliers Robust estimation of the Normal distribution parameters: require(mass) mod=rlm(hapflk~1) mu=mod$coefficients[1] ss=mod$s pvalue=1-pnorm(hapflk,mean=mu,sd=ss)

Manhattan plot hapflk hapflk reveals clear outlying regions

Looking at a particular region Once outlying regions are found, we want to know which population(s) has experienced a selection event. Local allele frequencies Build local population trees to find which branch(es) have been affected. ( Eigen decomposition of hapflk )

Local SNP and Cluster frequencies chr 2 : Selection in Texel breeds : GDF8 (MSTN) mutation R script for haplotype cluter plots provided on hapflk webpage.

Local SNP and Cluster frequencies chr 14 : less obvious

Local population trees Script for making these trees to be released...

Example on the 1000 Bull genomes data Differentiation based tests can find causal mutations: example of coat color mutations (MC1R) in the 1000 Bull genomes dataset. 2 3 4 5 6 14200 14400 14600 14800 15000 Position (Kbp)

References Nicholson et al. 2002 J. Roy. Stat. Soc. B 64(4), 695-715. Laval et al. 2002 Genetics Selection Evolution, 34(4), 481-508. Bonhomme et al. 2010 Genetics, 186(1), 241-262 Fariello et al. 2013 Genetics, 193(3), 929-941