Visualizing Population Genetics

Similar documents
Evaluating the performance of a multilocus Bayesian method for the estimation of migration rates

Microsatellite data analysis. Tomáš Fér & Filip Kolář

Microsatellites as genetic tools for monitoring escapes and introgression

Multiple paternity and hybridization in two smooth-hound sharks

Learning ancestral genetic processes using nonparametric Bayesian models

Bayesian Methods with Monte Carlo Markov Chains II

Package LBLGXE. R topics documented: July 20, Type Package

Multivariate analysis of genetic data an introduction

Supporting Information

Comparing latent inequality with ordinal health data

Methods for Cryptic Structure. Methods for Cryptic Structure

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

McGill University. Faculty of Science MATH 204 PRINCIPLES OF STATISTICS II. Final Examination

Similarity Measures and Clustering In Genetics

p(d g A,g B )p(g B ), g B

mstruct: Inference of Population Structure in Light of both Genetic Admixing and Allele Mutations

COMBI - Combining high-dimensional classification and multiple hypotheses testing for the analysis of big data in genetics

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

Population Structure

How to analyze many contingency tables simultaneously?

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

THE genetic patterns observed today in most species

Nonparametric Tests for Multi-parameter M-estimators

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

Problem Set 3: Bootstrap, Quantile Regression and MCMC Methods. MIT , Fall Due: Wednesday, 07 November 2007, 5:00 PM

Genetic Association Studies in the Presence of Population Structure and Admixture

Assessment Schedule 2014 Geography: Demonstrate understanding of how interacting natural processes shape a New Zealand geographic environment (91426)

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation

BIOINFORMATICS. Gilles Guillot

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

Genetic Variation in Finite Populations

Expected complete data log-likelihood and EM

Question: If mating occurs at random in the population, what will the frequencies of A 1 and A 2 be in the next generation?

On the informativeness of dominant and co-dominant genetic markers for Bayesian supervised clustering

The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations

Bayesian Inference of Interactions and Associations

Appendix 3. Jaspers et al. Appendix 3

Lecture 13: Population Structure. October 8, 2012

Bayesian clustering algorithms ascertaining spatial population structure: a new computer program and a comparison study

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics.

CHICAGO: A Fast and Accurate Method for Portfolio Risk Calculation

Production type of Slovak Pinzgau cattle in respect of related breeds

Tutorial Session 2. MCMC for the analysis of genetic data on pedigrees:

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

Package sdprisk. December 31, 2016

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

Essential Statistics Chapter 6

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

A Saddlepoint Approximation Based Simulation Method for Uncertainty Analysis

Association studies and regression

Populations in statistical genetics

Maximum likelihood inference of population size contractions from microsatellite data

Principles of QTL Mapping. M.Imtiaz

Multiple Choice Open-Ended Gridded Response Look at the coordinate plane below. In which quadrant is Point A located?

Lecture WS Evolutionary Genetics Part I 1

The Union and Intersection for Different Configurations of Two Events Mutually Exclusive vs Independency of Events

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014

Frequency Analysis & Probability Plots

Modeling conditional distributions with mixture models: Applications in finance and financial decision-making

IN applications of population genetics, it is often use- populations based on these subjective criteria represents

Probability and Estimation. Alan Moses

A paradigm shift in DNA interpretation John Buckleton

Comparing latent inequality with ordinal health data

BIOINFORMATICS. StructHDP: Automatic inference of number of clusters and population structure from admixed genotype data

Chapter 2. Continuous random variables

Network models: dynamical growth and small world

Phenotype to Genotype Matching and Epigenetics in Evolutionary Algorithms

Taming the Beast Workshop

New Bayesian methods for model comparison

Bayes methods for categorical data. April 25, 2017

Approximate Bayesian Computation

A spatial Dirichlet process mixture model for clustering population genetics data

CHICAGO: A Fast and Accurate Method for Portfolio Risk Calculation

2. Map genetic distance between markers

A spatial statistical model for landscape genetics

Figure S2. The distribution of the sizes (in bp) of syntenic regions of humans and chimpanzees on human chromosome 21.

Curriculum Map. Biology, Quarter 1 Big Ideas: From Molecules to Organisms: Structures and Processes (BIO1.LS1)

On inferring and interpreting genetic population structure - applications to conservation, and the estimation of pairwise genetic relatedness

Semiparametric Generalized Linear Models

Bayesian Modeling of Conditional Distributions

Nonparametric Bayes Modeling

A spatial Dirichlet process mixture model for clustering population genetics data

A moment estimation of the haplotypes distribution using phenotypes data 1

LEA: An R package for Landscape and Ecological Association Studies. Olivier Francois Ecole GENOMENV AgroParisTech, Paris, 2016

Statistical Ranking Problem

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics

Modelling Genetic Variations with Fragmentation-Coagulation Processes

Detecting selection from differentiation between populations: the FLK and hapflk approach.

SELESTIM: Detec-ng and Measuring Selec-on from Gene Frequency Data

Shared Segmentation of Natural Scenes. Dependent Pitman-Yor Processes

Adventures in Forensic DNA: Cold Hits, Familial Searches, and Mixtures

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination)

Variable selection and machine learning methods in causal inference

The Normal Distribuions

Statistical Methods for studying Genetic Variation in Populations

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

A partial regression coefficient analysis framework to infer candidate genes potentially causal to traits in recombinant inbred lines

Interpreting Mixed Membership Models: Implications. Erosheva s Representation Theorem...

Conservation genetics of the Ozark pocket gopher

Transcription:

Visualizing Population Genetics Supervisors: Rachel Fewster, James Russell, Paul Murrell Louise McMillan 1 December 2015 Louise McMillan Visualizing Population Genetics 1 December 2015 1 / 29

Outline 1 Background and Motivation 2 Characterising Genetic Distribution of Population 3 Saddlepoint Approximation within GenePlot 4 Further Work Louise McMillan Visualizing Population Genetics 1 December 2015 1 / 29

Assignment Uses genetic data (microsatellites or Single Nucleotide Polymorphisms (SNPs)) Compare individuals to populations Infer likely source populations Louise McMillan Visualizing Population Genetics 1 December 2015 2 / 29

Post-eradication Assignment Louise McMillan Visualizing Population Genetics 1 December 2015 3 / 29

Post-eradication Assignment Louise McMillan Visualizing Population Genetics 1 December 2015 4 / 29

Post-eradication Assignment Louise McMillan Visualizing Population Genetics 1 December 2015 5 / 29

Assignment Process Baseline samples from each possible source population Louise McMillan Visualizing Population Genetics 1 December 2015 6 / 29

Assignment Process Estimate microsatellite allele frequencies of candidate source populations using baseline sample Louise McMillan Visualizing Population Genetics 1 December 2015 7 / 29

Assignment Process Dirichlet prior for allele frequencies p Dirichlet(τ, τ,..., τ) Combine Dirichlet prior and multinomial data to get Dirichlet posterior for baseline allele frequencies p Dirichlet(x 1 + τ, x 2 + τ,..., x k + τ) Louise McMillan Visualizing Population Genetics 1 December 2015 8 / 29

Assignment Process Compare alleles from new sample with allele frequencies for candidate source population Calculate Log Genotype Probability (LGP) using Dirichlet Compound Multinomial (DCM) distribution Probability of obtaining individual s genotype from multinomial distribution with estimated allele proportions as probabilities P(a) = (x r + τ)(x r + τ + 1) (n + 1)(n + 2) 2(x r + τ)(x s + τ) (n + 1)(n + 2) a r = 2, a j = 0 for j r a r = a s = 1, a j = 0 for j r, s Louise McMillan Visualizing Population Genetics 1 December 2015 9 / 29

Assignment Process Combine over all loci to get overall Log Genotype Probability (LGP) Calculate for each candidate source population Assign individual to population with highest LGP OR only assign if individual has assignment score above threshold for at least one population Louise McMillan Visualizing Population Genetics 1 December 2015 10 / 29

Assignment Process Starred points indicate individuals with missing data Kaikoura Island vs. Aotea 1% 100% Broken Islands vs. Aotea 1% 100% Log[10] genotype probability for Aotea 22 20 18 16 14 12 10 8 Kaikoura Island Main Aotea 1% Log[10] genotype probability for Aotea 35 30 25 20 15 10 5 Broken Islands Main Aotea 100% 1% 22 20 18 16 14 12 10 8 Log[10] genotype probability for Kaikoura Island 35 30 25 20 15 10 5 Log[10] genotype probability for Broken Islands Louise McMillan Visualizing Population Genetics 1 December 2015 11 / 29

Alternative Assignment Methods GENECLASS2 (Piry et al. 2004) and GenePlot use the above method STRUCTURE (Pritchard et al. 2000) uses same Dirichlet posterior for allele frequencies and same LGP calculations But STRUCTURE performs clustering, using MCMC to sample from distribution of population membership combinations Louise McMillan Visualizing Population Genetics 1 December 2015 12 / 29

GenePlot Method for Missing Data Not all DNA samples replicate correctly Individual may have missing data at one or more loci Calculate LGP for non-missing loci But LGP will be on different scale than LGP for individuals with all loci Louise McMillan Visualizing Population Genetics 1 December 2015 13 / 29

Characterising Genetic Distribution of Population Calculate LGP for non-missing loci Find corresponding quantile in full-loci distribution Requires characterisation of full-loci and reduced-loci distributions for all candidate source populations Distribution is sum of distributions from each locus, difficult to characterise Louise McMillan Visualizing Population Genetics 1 December 2015 14 / 29

Characterising Genetic Distribution of Population Possible to simulate individuals from each population using allele frequencies Distribution of genotype probabilities can be estimated from simulated data Slow to calculate, non-repeatable results Unreliable in long lower tail of distribution Louise McMillan Visualizing Population Genetics 1 December 2015 15 / 29

Characterising Genetic Distribution of Population Alternative method is to approximate population genetic distribution analytically Saddlepoint approximation method achieves this with high accuracy Louise McMillan Visualizing Population Genetics 1 December 2015 16 / 29

Saddlepoint Approximation Initial motivation for the saddlepoint approximation: Set of random variables X 1, X 2,..., X n with known distributions What is the distribution of their sum, Xtot? Louise McMillan Visualizing Population Genetics 1 December 2015 17 / 29

Saddlepoint Approximation PDF approximation derived by Daniels in 1954 CDF approximation derived by Lugannani & Rice in 1980 Can approximate any distribution, not just sums Approximations rely on the cumulant generating function (CGF) Overall CGF is sum of component CGFs Louise McMillan Visualizing Population Genetics 1 December 2015 18 / 29

Multilocus Genetic Distribution Distribution of genotype probabilities in a population MGF at a single locus is given by: M(t) = P(Y = y)e ty = p i e t log p i = p t+1 i p i are the probabilities of the possible genotypes at that locus. Louise McMillan Visualizing Population Genetics 1 December 2015 19 / 29

Multilocus Genetic Distribution Derivatives of the MGF are given by: M (r) (t) = (log p i ) r p t+1 i Overall CGF and derivatives are sums of those at each locus Louise McMillan Visualizing Population Genetics 1 December 2015 20 / 29

Multilocus Genetic Distribution Saddlepoint CDF formula: where { } Φ(ŵ) + φ(ŵ)(1/ŵ 1/û) if x µ ˆF (x) = 1 2 + K (0) 6 if x = µ 2πK (0) 3/2 K (ŝ) = x ŵ = sign(ŝ) 2(ŝx K(ŝ)) û = ŝ K (ŝ) Louise McMillan Visualizing Population Genetics 1 December 2015 21 / 29

Test against Generated Samples Generate 100,000 samples from multilocus distribution Calculate empirical CDF Compare with saddlepoint approximation Louise McMillan Visualizing Population Genetics 1 December 2015 22 / 29

Test against Generated Samples Louise McMillan Visualizing Population Genetics 1 December 2015 23 / 29

Test for Simulated Populations Calculate sum of squared differences over 1000 points for different numbers of loci Louise McMillan Visualizing Population Genetics 1 December 2015 24 / 29

Characterising Genetic Distribution of Population Calculate LGP for non-missing loci Find corresponding quantile in full-loci distribution Requires characterisation of full-loci and reduced-loci distributions for all candidate source populations Distribution is sum of distributions from each locus, difficult to characterise Louise McMillan Visualizing Population Genetics 1 December 2015 25 / 29

GenePlots LGP results for each individual, with respect to all source populations, can be plotted on a graph Kaikoura Island vs. Aotea 1% 100% Broken Islands vs. Aotea 1% 100% Log[10] genotype probability for Aotea 22 20 18 16 14 12 10 8 Kaikoura Island Main Aotea 1% Log[10] genotype probability for Aotea 35 30 25 20 15 10 5 Broken Islands Main Aotea 100% 1% 22 20 18 16 14 12 10 8 Log[10] genotype probability for Kaikoura Island 35 30 25 20 15 10 5 Log[10] genotype probability for Broken Islands Louise McMillan Visualizing Population Genetics 1 December 2015 26 / 29

Further Work Simulate scenarios to test for bias in assignment results Investigate relatedness within populations and effects on assignment accuracy Online version of GenePlot: https://lmcm177.shinyapps.io/geneplot-on-the-web Louise McMillan Visualizing Population Genetics 1 December 2015 27 / 29

References B. Rannala and J. L. Mountain (1997) Detecting immigration by using multilocus genotypes Proceedings of the National Academy of Sciences USA, Vol. 94. S. Piry et al (2004) GENECLASS2: A Software for Genetic Assignment and First-Generation Migrant Detection. Journal of Heredity, Vol. 95 (Issue 6). J. K. Pritchard, M. Stephens and P. Donnelly (2000) Inference of Population Structure Using Multilocus Genotype Data. Genetics, Vol. 155. H. E. Daniels (1954) Saddlepoint Approximations in Statistics. The Annals of Mathematical Statistics, Vol. 25. R. Lugannani and S. Rice (1980) Saddle point approximations for the distribution of the sum of independent random variables. Advances in Applied Probability, Vol. 12. C. Goutis and G. Casella (1999) Explaining the Saddlepoint Approximation. The American Statistician, Vol. 53. Louise McMillan Visualizing Population Genetics 1 December 2015 28 / 29

Acknowledgements Many thanks to my supervisors, particularly Rachel Fewster Louise McMillan Visualizing Population Genetics 1 December 2015 29 / 29