Robust demographic inference from genomic and SNP data

Similar documents
Estimating Evolutionary Trees. Phylogenetic Methods

Figure S1: The model underlying our inference of the age of ancient genomes

Genetic Drift in Human Evolution

Supporting Information

Supporting Information Text S1

Learning ancestral genetic processes using nonparametric Bayesian models

Gene Genealogies Coalescence Theory. Annabelle Haudry Glasgow, July 2009

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Supporting information for Demographic history and rare allele sharing among human populations.

Coalescent based demographic inference. Daniel Wegmann University of Fribourg

Inferring the age of a fixed beneficial allele

7. Tests for selection

Introduction to Advanced Population Genetics

Distinguishing between population bottleneck and population subdivision by a Bayesian model choice procedure

Diffusion Models in Population Genetics

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin

Populations in statistical genetics

Rapid speciation following recent host shift in the plant pathogenic fungus Rhynchosporium

Mathematical models in population genetics II

Taming the Beast Workshop

Supplementary Materials: Efficient moment-based inference of admixture parameters and sources of gene flow

1. Understand the methods for analyzing population structure in genomes

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

Supplementary Material to Full Likelihood Inference from the Site Frequency Spectrum based on the Optimal Tree Resolution

The problem Lineage model Examples. The lineage model

Demographic inference reveals African and European. admixture in the North American Drosophila. melanogaster population

Intraspecific gene genealogies: trees grafting into networks

Microsatellite data analysis. Tomáš Fér & Filip Kolář

Anatomy of a species tree

Stochastic Demography, Coalescents, and Effective Population Size

Discrete & continuous characters: The threshold model

Challenges when applying stochastic models to reconstruct the demographic history of populations.

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation

Full Likelihood Inference from the Site Frequency Spectrum based on the Optimal Tree Resolution

The scaling of genetic diversity in a changing and fragmented world

Supplementary Materials for

Population Genetics I. Bio

the role of standing genetic variation

arxiv: v1 [stat.co] 2 May 2011

Inferring Admixture Histories of Human Populations Using. Linkage Disequilibrium

POPULATION STRUCTURE 82

Genetic diversity of beech in Greece

The Impact of Sampling Schemes on the Site Frequency Spectrum in Nonequilibrium Subdivided Populations

Analysis of Chimpanzee History Based on Genome Sequence Alignments

SWEEPFINDER2: Increased sensitivity, robustness, and flexibility

Frequency Spectra and Inference in Population Genetics

How should we organize the diversity of animal life?

Diversity and Human Evolution. Homo neanderthalensis. Homo neanderthalensis. Homo neanderthalensis. Homo neanderthalensis. Part II

Markov chain Monte-Carlo to estimate speciation and extinction rates: making use of the forest hidden behind the (phylogenetic) tree

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

31/10/2012. Human Evolution. Cytochrome c DNA tree

Demography April 10, 2015

CREATING PHYLOGENETIC TREES FROM DNA SEQUENCES

Dr. Amira A. AL-Hosary

Fei Lu. Post doctoral Associate Cornell University

Organizing Life s Diversity

Detecting selection from differentiation between populations: the FLK and hapflk approach.

Learning Ancestral Genetic Processes using Nonparametric Bayesian Models

Today's project. Test input data Six alignments (from six independent markers) of Curcuma species

STEM-hy: Species Tree Estimation using Maximum likelihood (with hybridization)

Calculation of IBD probabilities

Supplementary Information

Genetic Variation in Finite Populations

Closed-form sampling formulas for the coalescent with recombination

Figure S2. The distribution of the sizes (in bp) of syntenic regions of humans and chimpanzees on human chromosome 21.

GENETICS OF MODERN HUMAN ORIGINS AND DIVERSITY

Phylogenetic inference: from sequences to trees

6 Introduction to Population Genetics

Haplotyping as Perfect Phylogeny: A direct approach

Adaptation in the Human Genome. HapMap. The HapMap is a Resource for Population Genetic Studies. Single Nucleotide Polymorphism (SNP)

p(d g A,g B )p(g B ), g B

A (short) introduction to phylogenetics


Evolution and pathogenicity in the deadly chytrid pathogen of amphibians. Erica Bree Rosenblum UC Berkeley

A novel method with improved power to detect recombination hotspots from polymorphism data reveals multiple hotspots in human genes

Bayesian species identification under the multispecies coalescent provides significant improvements to DNA barcoding analyses

THE genetic patterns observed today in most species

6 Introduction to Population Genetics

Selection against A 2 (upper row, s = 0.004) and for A 2 (lower row, s = ) N = 25 N = 250 N = 2500

Multiple Sequence Alignment. Sequences

Genetics. Metapopulations. Dept. of Forest & Wildlife Ecology, UW Madison

MtDNA profiles and associated haplogroups

Effect of Genetic Divergence in Identifying Ancestral Origin using HAPAA

Maximum likelihood inference of population size contractions from microsatellite data

Genetics: Early Online, published on December 29, 2015 as /genetics

Quartet Inference from SNP Data Under the Coalescent Model

How robust are the predictions of the W-F Model?

TO date, several studies have confirmed that Drosophila

Casey Leonard. Multiregional model vs. Out of Africa theory SLCC

Genetics: Early Online, published on February 26, 2016 as /genetics Admixture, Population Structure and F-statistics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Demographic Inference with Coalescent Hidden Markov Model

Concepts and Methods in Molecular Divergence Time Estimation

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles

When is selection effective?

Lecture 18 - Selection and Tests of Neutrality. Gibson and Muse, chapter 5 Nei and Kumar, chapter 12.6 p Hartl, chapter 3, p.

Species Tree Inference using SVDquartets

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

A Phylogenetic Network Construction due to Constrained Recombination

Neutral genomic regions refine models of recent rapid human population growth

Transcription:

Robust demographic inference from genomic and SNP data Laurent Excoffier Isabelle Duperret, Emilia Huerta-Sanchez, Matthieu Foll, Vitor Sousa, Isabel Alves Computational and Molecular Population Genetics Lab (CMPG) Institute of Ecology and Evolution University of Berne Swiss Institute of Bioinformatics

Past demography affect genetic diversity Past Stationary population Recent expansion Recent contraction Present Mixture of rare and frequent mutations Few and mostly rare mutations Very deep lineages separating little differentiated clades

Site Frequency Spectrum (SFS) depends on past demography

Joint SFS (2D-SFS) N A m N 12 1 N 2 TDIV Model of Isolation with migration (IM)

Problems with estimation of demographic parameters from SFS

Estimation of demographic parameters from SFS with dadi 2009 Program Inference : Diffusion Approximation for Demographic http://code.google.com/p/dadi/ a i dadi estimates the site frequency spectrum based on a diffusion approximation

Advantages of SFS for parameter inference Accuracy of estimates increases with data size, but computing time does not Can be used to study complex scenarios (e.g. as complex as ABC) Very fast estimations (as compared to ABC, or full likelihoods)

Potential problems Maximization of the CL is not trivial (precision of the approximation and convergence problems) Ignores (assumes no) LD Need to repeat estimations to find maximum CL Needs genomic data (several Mb) difficult to have gene-specific estimates Next-generation sequencing data must have high coverage to correctly estimate SFS (likely to miss singletons or show errors). SFS needs to be estimated from the NGS reads (ML methods: Nielsen et al. 2013, Keightley and Halligan, 2011)

Estimating the SFS with coalescent simulations The probability of a SFS entry i can be estimated under a specific model θ from its expected coalescent tree as (Nielsen 2000) a ratio of expected branch lengths p = Et ( θ)/ ET ( θ) i i t i : total length of all branches directly leading to i terminal nodes T : total tree length. This probability can then be estimated on the basis of Z simulations as where b kj is the length of the k-th compatible branch in simulation j. Z pˆ b T = i kj j j k Φi j Z b 1 b 6 b 4 b 2 b 2 b 2 b 1 b 1 b 1

Likelihood The (composite) likelihood of a model θ is obtained as a multinomial sampling of sites (Adams and Hudson, 2004) n 1 M S = obs θ 0 0 i= 1 mi CL Pr( SFS ) P (1 P ) pˆ M : number of monomorphic sites S : number of polymorphic sites P 0 : probability of no mutation on the tree p i : probability of the i-th SFS entry m i : number of sites with derived frequency i i This can be generalized for the joint SFS of two or more populations

fastsimcoal2 program Uses coalescent simulations to estimate the SFS and approximate the likelihood Large number of simulations per point (>50000) Uses a conditional expectation maximization (CEM) algorithm to find maxcl parameters Relatively fast and can explore wide and unbounded parameter ranges Can handle an arbitrary number of populations For more than 4 populations, we use a composite compositelikelihood CL 1234 = CL 12 CL 13 CL 14 CL 23

Approximation of the SFS Chen (2012) TPB Coalescent approach to infer the expected joint SFS numerically Divergence model 5000 5K 500 T DIV T DIV =10 T DIV =100

Bottleneck model fastsimcoal2 N A N BOT N CUR a i T BOT Simulation of 20 Mb data 10 cases, 50 runs/case a i 9/10

IM model N A m 12 N 1 N 2 T DIV m 21 a i 8/10

Pseudo human evolution model N A N BOT N OUT T BOT N 1 m 10 6 10 6 T DIV a i 8/10

Herarchical island model 12 populations in two continent-island models Migration rates over 3 orders of magnitude are well recovered!!!

Application: Complete genomics data Four sampled human populations: 4 Luhya from Kenya (LWK) 9 Europeans (CEU) 9 Yoruba (YRI) 5 African Americans (ASW) Data: (sequenced at 51-89x per genome) Multidimensional SFS estimated from : 239, 120 SNPs in non-coding and non CpG regions Each SNP more than 5 Kb away from the other

Model of admixture in African Americans West-African metapopulation European metapopulation Luhya (Kenya) Ghost (East-African) meta-population Afr. Am. Yoruba (Nigeria) Northern Europeans

Model of admixture in African Americans

Models of African population divergence Two models with different degrees of realism and complexity IM model 2 continent-island model 3 populations 5 populations The estimation of each model were performed separately for the San (109,020 SNPs) and the Yoruba (81,383 SNPs) SNP panels

Models of African population divergence IM model Good agreement between panels

Models of African population divergence 2 continent-island model Akaike s weigths of evidence in favor of model B are close to 1 for both panels

Models of African population divergence 1,475 y 1,925 y 4,250 y 7,450 y 138,250 y 258,250 y

Inference of archaic admixture in modern humans Simple model (proof of concept) N ANH Altai Neandertal N AN N H T DIV T DN N ALT 2,000 N N admix N BOT N CH TBOT Complete genomics CHB or TSI samples (4 inds / pop) Other unsampled Neandertal Data set: Non coding DNA and non CpG sites. Altai Neandertal (Prüfer et al. 2013), unfiltered vcf 271,994 regions of 100 bp in non-coding DNA Ancestral state deduced by 1000G for 26,466,040 bp (26.5Mb) All regions are at least 5 Kb apart from each other

Inference of archaic admixture in modern humans Very preliminary results Admixture level CHB: 1.2% (0.94-1.43) TSI: 1.3% (0.85-1.45) Recent admixture!! TSI: 875 gen (790-1030) CHB: 950 gen ( 810-1200) <25,000 y (assuming u=2e-8)

Possible extensions Multiprocessor version of fsc MCMC (Beaumont 2004, Garrigan 2009) Multilocus SFS Coalescent simulations through pedigrees

Thanks to: Isabelle Duperret Emilia Huerta-Sanchez Isabel Alves Vitor Sousa Matthieu Foll David Reich Nick Patterson Rasmus Nielsen CMPG lab