Joyce, Krone, and Kurtz

Similar documents
arxiv: v1 [stat.ap] 9 Oct 2009

Stochastic Demography, Coalescents, and Effective Population Size

Model Specification Testing in Nonparametric and Semiparametric Time Series Econometrics. Jiti Gao

Computer Intensive Methods in Mathematical Statistics

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Infinitely iterated Brownian motion

Copulas. MOU Lili. December, 2014

Frequency Spectra and Inference in Population Genetics

On The Mutation Parameter of Ewens Sampling. Formula

Closed-form sampling formulas for the coalescent with recombination

Probability Distribution And Density For Functional Random Variables

Non-parametric Inference and Resampling

Bayesian Methods with Monte Carlo Markov Chains II

Hypothesis Testing. Robert L. Wolpert Department of Statistical Science Duke University, Durham, NC, USA

Parametric Techniques Lecture 3

July 31, 2009 / Ben Kedem Symposium

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

Lecture 18 : Ewens sampling formula

Quantitative trait evolution with mutations of large effect

Parametric Techniques

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Direction: This test is worth 250 points and each problem worth points. DO ANY SIX

Homework 7: Solutions. P3.1 from Lehmann, Romano, Testing Statistical Hypotheses.

STAT215: Solutions for Homework 2

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

The Poisson-Dirichlet Distribution: Constructions, Stochastic Dynamics and Asymptotic Behavior

Exact Simulation of Multivariate Itô Diffusions

Can we do statistical inference in a non-asymptotic way? 1

Hypothesis Test. The opposite of the null hypothesis, called an alternative hypothesis, becomes

Testing Algebraic Hypotheses

Stat 451 Lecture Notes Simulating Random Variables

Spring 2012 Math 541B Exam 1

Math 494: Mathematical Statistics

Advanced Statistics II: Non Parametric Tests

MAT 271E Probability and Statistics

Covariance function estimation in Gaussian process regression

High Dimensional Empirical Likelihood for Generalized Estimating Equations with Dependent Data

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Association studies and regression

2014/2015 Smester II ST5224 Final Exam Solution

Inference on distributions and quantiles using a finite-sample Dirichlet process

Tutorial on Approximate Bayesian Computation

I forgot to mention last time: in the Ito formula for two standard processes, putting

Bridging the Gap between Center and Tail for Multiscale Processes

Bickel Rosenblatt test

Answers and expectations

Joint Iterative Decoding of LDPC Codes and Channels with Memory

Weak convergence and large deviation theory

Semi-Nonparametric Inferences for Massive Data

11. Learning graphical models

Econ 508B: Lecture 5

Multivariate Analysis and Likelihood Inference

INTERVAL ESTIMATION AND HYPOTHESES TESTING

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin

Dynamics of the evolving Bolthausen-Sznitman coalescent. by Jason Schweinsberg University of California at San Diego.

Power and Sample Size Calculations with the Additive Hazards Model

8. Genetic Diversity

19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

CALCULUS JIA-MING (FRANK) LIOU

Gaussian, Markov and stationary processes

Problem 1 (20) Log-normal. f(x) Cauchy

36. Multisample U-statistics and jointly distributed U-statistics Lehmann 6.1

Math 152. Rumbos Fall Solutions to Assignment #12

LAN property for ergodic jump-diffusion processes with discrete observations

1 Probability theory. 2 Random variables and probability theory.

Approximate Bayesian Computation

Evolution in a spatial continuum

Statistics: Learning models from data

Asymptotical distribution free test for parameter change in a diffusion model (joint work with Y. Nishiyama) Ilia Negri

On detection of unit roots generalizing the classic Dickey-Fuller approach

General Theory of Large Deviations

Infinitely divisible distributions and the Lévy-Khintchine formula

Spring 2012 Math 541A Exam 1. X i, S 2 = 1 n. n 1. X i I(X i < c), T n =

Hypothesis testing: theory and methods

Probability Theory and Statistics. Peter Jochumzen

Brief Review on Estimation Theory

By Paul A. Jenkins and Yun S. Song, University of California, Berkeley July 26, 2010

MS 3011 Exercises. December 11, 2013

Minimum Hellinger Distance Estimation in a. Semiparametric Mixture Model

1 A simple example. A short introduction to Bayesian statistics, part I Math 217 Probability and Statistics Prof. D.

Tail bound inequalities and empirical likelihood for the mean

STAT 512 sp 2018 Summary Sheet

Maximum Smoothed Likelihood for Multivariate Nonparametric Mixtures

Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics. 1 Executive summary

The mathematical challenge. Evolution in a spatial continuum. The mathematical challenge. Other recruits... The mathematical challenge

Statistical Inference of Covariate-Adjusted Randomized Experiments

Lecture 2. (See Exercise 7.22, 7.23, 7.24 in Casella & Berger)

SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions

f (1 0.5)/n Z =

Asymptotics for posterior hazards

Final Examination Statistics 200C. T. Ferguson June 11, 2009

S6880 #7. Generate Non-uniform Random Number #1

BTRY 4830/6830: Quantitative Genomics and Genetics

Multivariate Statistics

Lecture 4: Probabilistic Learning

Topic 12 Overview of Estimation

Statistical Inference

Statistical population genetics

Nonparametric Drift Estimation for Stochastic Differential Equations

Transcription:

Statistical Inference for Population Genetics Models by Paul Joyce University of Idaho A large body of mathematical population genetics was developed by the three main speakers in this symposium. As a tribute to the substantial contributions of Ewens, Griffiths and Tavaré, I will present an over view of some of my work, which builds upon their ideas. The focus will be on issues in the realm of Mathematical Statistics. The likelihood functions are based on the stationary distributions, under both infinite and K-alleles models, involving mutation, selection and genetic drift. The theoretical portion of the talk will consider limiting results that determine under what conditions models can be distinguished based on allele frequency data at a single locus. The computational portion of the talk will focus on new computationally efficient approaches to analyzing data under these models. A brief history of the problem In the late 9 s John Gillespie challenged Tom Kurtz, Steve Krone and myself to come up with a rigorous proof for his conjecture that the heterozygote advantage model converges to the neutral model in the limit as both θ and σ go to infinity at the same rate. Recall that θ =4Nu and the σ =4Ns In and 3 Steve Krone, Tom Kurtz and I published two papers in the Annals of Applied Probability addressing the problem posed by Gillespie. For purely mathematical reasons I decided to consider the homozygote advantage model and developed an analogous result. I got some help from my colleague Frank Gao. Gillespie Joyce, Krone, and Kurtz Heterozygote Advantage Model Notation and Vocabulary Review What does a sample from a neutral population look like? My version of Warren s slide N-Effective Population Size. Fitness of Heterozygote =1, Fitness of Homozygote = w, w<1 u- per individual mutation rate w =1 σ/4n) or σ =1 w)4n. θ =4Nu

The Effects of Selection The probability that two individuals chosen at random are the same type is N F = Xi. The heterozygote advantage model penalizes homozygote, thus decreasing F. The minimum value of Recall from Calculus F = N subject to the constraint that N X i =1occurs when X i = 1 N. Selection tends to make the allele frequencies more evenly distributed. It is sometimes referred to as balancing selection. X i σ =4N1 w) θ =4Nu As the population size increases the mutation rate and selection intensity become large. An increase in the mutation parameter θ tends to σ = θ =.3 increase the number of alleles decrease the homozygosity. An increased selection intensity also decreases the homozygosity. Can high mutation mask selection when the population is large? σ = θ =.3 Stationary Distribution under Neutrality Let V 1,V,... be i.i.d, with beta density fx) =θ1 x) θ 1. The joint distribution of the population proportions X =X 1,X,...), underneutrality σ = θ =4 X 1 = V 1, X i =1 V 1 )1 V ) 1 V i 1 )V i 1) X µ σ = θ =4

Stationary Distribution Under Selection The stationary distribution under selection depends on the population homozygosity which is given by F = X i. The form of the stationary distribution µ σ follows as a special case of : µ σ A) = A e σf Ee σf ) µ dx), ) dµ = e σf E[e σf ] Samples versus Populations Let A n be a random partition structure of a sample of size of size n then P σ A n = a n ) P A n = a n ) = E dµσ ) X) A n = a n dµ and P σ A n = a n ) lim n P A n = a n ) = X) dµ where P A n = a n ) is the Ewens Sampling Formula. See Joyce 1994) for more details Theorem 4.4 in Ethier and Kurtz 1994) If σ = cθthen lim X) = lim θ dµ θ Gillespie Conjecture: exp{ σ X i }] E [exp{ σ X i }] =1, 3) Joyce Krone and Kurtz 3) Theorem 1 Suppose X =X 1,X,...) µ, Y =Y 1,Y,...) µ σ, where σ = cθ 3/+γ and c> is a constant. F X) = X i and F Y) = Y i Then, as θ, dµ X) = e σfx) E e σfx) ) 1, if γ< expcz c ), if γ =, if γ> where Z N, ). Outline of the proof of Theorem 1. Outline of the proof of Theorem 1. Define Z θ ) θ θ Xi 1, 4) for σ = cθ 3/ rewrite For γ = X θ )= exp{ cz θ} dµ E exp{ cz θ }) as θ.weneedtoshowthat 1. Z θ Z as θ exp{ cz} E exp{ cz}) P X θ )= e σ P dµ E e σ X i X i ) = exp{ cz θ} E exp{ cz θ }) 5). E exp{ cz θ }) E exp{ cz}) as θ

Z θ has a heavy right tail E exp{ cz θ }) E exp{ cz}) as θ but E exp{cz θ }) as θ. Homozygote Advantage Model Unscaled Parameters N-Effective Population Size. Fitness of Heterozygote =1, Fitness of Homozygote = w, w>1 u- per individual mutation rate Distribution of Z θ Distribution of Z Scaled Parameters w =1+σ/4N) or σ =w 1)4N. θ =4Nu Joyce and Gao 6) Homozygote Advantage Theorem Let c be a solution to 1 ) 1 /c c.45541 e c 1+ 1 /c = 1 6) Joyce and Gao 6) Homozygote Advantage Theorem Suppose X =X 1,X,...) µ,and Y =Y 1,Y,...) µ σ and let σθ) =cθ. Asθ, P X) ecθ 1, if c<c X i dµ E e cθp ) X i, if c c P Y) ecθ dµ E e cθp Y i X i 1, if c<c ), if c c What is c? Recall that θ =4Nu and σ =4Nw 1) where w>1. If σ = cθ then c = σ/θ and c = w 1 u Theorem in words At a highly polymorphic locus θ large) the homozygote advantage model is readily distinguishable from the neutral model provided the selection coefficient w 1 is at least.45541 times bigger than the the per individual mutation rate u. However, if the selective advantage is below.4554 times the mutation rate then the models are indistinguishable in the limit.

Proof Let V 1,V,... be i.i.d, with beta density θ1 x) θ 1.The joint distribution of the population proportions X =X 1,X,...), aredefinedby If V has beta density θ1 x) θ 1 then ) E e σf )=E e σv e 1 V ) σf E e σv ) cθv ) = E e X 1 = V 1, If F = X i then X i =1 V 1 )1 V ) 1 V i 1 )V i F = V1 +1 V 1 ) F where F and F have the same distribution and F is independent of V 1 E e cθv ) = 1 1 1 e cθx θ1 x) θ 1 dx θ θ e cx 1 x)) dx θf c x)) θ dx Finding the Critical c f c x) for Small c f c x) =e cx 1 x) has a local minimum at x = 1 1 /c and a local maximum at x 1 = 1+ 1 /c, provided c is larger than. f c x) for Large c f c x) for Critical c

c is the constant that makes f c x 1 )=1where x 1 = 1+ 1 /c is the local maximum. 1 ) 1 /c c.45541. e c 1+ 1 /c = 1 7) If c>c then Ee cθf ) > 1 θf cx)) θ dx as θ Large Deviations and the Homozygotes Advantage Feng and Dawson 5) Theorem Varadhan) Assume that {Q ɛ : ɛ>} satisfies the Large Deviation Principle with speed 1/ɛ and rate function I ). LetC b E) denote the set of bounded continuous functions. Then for any φx) in C b E) one has Λ φ =limɛlog E Q ɛ e φx)/ɛ )=sup{φx) Ix)} ɛ x E Large Deviations and the Homozygotes Advantage Feng and Dawson 5) Large Deviations and the Homozygotes Advantage Feng and Dawson 5) For our case E = {x 1,x,, ):x i > and x i =1} ɛ =1/θ [ 1 lim θ θ log E e cθp ] x i =sup x 1 logf c x)) φx) = x i sup x 1 logf c x)) > when c>c sup x E {φx) Ix)} =sup x 1 logf c x)) Conclusion Introduction The models selection versus neutrality) separate when the selection intensity is large relative to the mutation rate. For the heterozygote advantage model σ must be much larger than θ σ cθ 3/+γ ) before the models separate. For the homozygote advantage model σ need only be moderately larger than θ before the models separate. σ c θ where c.45541). The large deviation results provides a rate of convergence when c>c.

Introduction Any assessment of the forces that generate and maintain genetic diversity must include the possibility of selection. Computationally intensive methods for approximating likelihood functions and generating samples for a class of nonneutral models was proposed by Donnelly, Nordborg, and Joyce DNJ) 1). Benefit The new methods make likelihood analysis practicable for a wider set of parameters. In particular, if the selection intensity is much greater than the mutation rate, then the DNJ 1) methods become increasingly inefficient. However, this is the case where one has the best hope of drawing meaningful more precise) inferences. We develop algorithms for likelihood analysis that are substantially more efficient than those in DNJ 1). Calculating the constant of integration Law of Large Numbers Simulate many population frequencies X 1, X,...,X M under neutrality and average. That is, See DNJ 1). E N e X ΣX ) M e X iσx i )/M. 8) Calculating the constant of integration Law of Large Numbers Simulate many population X 1, X,...,X M under neutrality and average. That is, E N e X ΣX )) M e X iσx i )/M. 9) This works fine if the selective influences are relatively small. However, when selection is small there is very little power to detect selection from neutrality and likelihood analysis gives little to no information about the parameters of interest. When selection is large enough to be detected, the above method is extremely inefficient. Simulating data under selection Rejection Method 1. Simulate X from the neutral model. Simulate U, an independent uniform random variable on [, 1] 3. If U e σx) σ max,reportx as a population frequency from the nonneutral model. Otherwise return to step 1. See DNJ 1). If σ 1 it takes 1 9 rejections before a sample is accepted. Importance sampling and rejection method The rejection method involves generating random variables under the proposal distribution and then developing a rule for rejecting or accepting the simulated random variable, so that the accepted random variables are distributed according to the target distribution. Importance sampling also involves generating random variables under the proposal distribution and then creating a weighted average, such that the weighted average represents the expectation under the target distribution of a random quantity of interest.

A good proposal distribution should have the following two properties A good proposal distribution 1. It should be easy to simulate data and calculate probabilities of interest with respect to the proposal distribution.. The proposal distribution should be in some sense close to the target distribution. In DNJ 1) the neutral model is the proposal distribution and the model with selection is the target distribution. While the neutral model has property 1, it does not have property. A bad proposal distribution Computation of the Normalization Constant when Σ is Diagonal We consider the special case where Σ is a diagonal matrix. Denote the entries of the diagonal by Σ =σ 1,σ,,σ K ). The normalization constant for the distribution can be calculated by a series of recursive integrals. Define α i = θν i 1. cσ,θν) = 1 xα 1 1 e σ 1x 1 x1 1 x α e σ x where g K y) =y α K e σ Ky 1 P K 3 x i x α K K e σ K x K 1 P K x i x α K 1 K 1 e σ K 1x K 1 g K 1 ) K 1 x i dx K 1 dx 1. 1 P K x i = y x α K 1 K 1 1 K x i x K 1 ) αk e σ K 1x K 1 ) K g K 1 x i x K 1 dx K 1 t α K 1 y t) α K e σ K 1t g K y t) dt g K 1 y). where y =1 x 1 x... x K and t = x K 1

Integral is iteratively defined cσ,θν) = 1 1 x1 x α 1 1 e σ 1x 1 x α e σ x 1 P K 3 x i x α K K e σ K x K ) K g K 1 1 x i dx K dx 1 Now let y =1 x 1 x... x i 1 and t = x i, α i = θν i 1, the successive integrals can be defined by g i y) = y t α i e σ it g i+1 y t)dt, 1) for i = K 1,K,...,1. The required cσ,θν) is given by g 1 1). Lyme disease sample The following data was collected by Qui et al. 1997) Hereditas 17: 3-16 on B. burgdorferi the cause of Lyme disease) from eastern Long Island, New York. relative frequency frequency 1.1 46.37 166 3.6 11 4.7 1 The maximum likelihood estimate is ˆθ =5and ˆσ =36.A total of 1 6 repetitions per θ were used in DNJ ) Constant of Integration s m c m 36, 5ν)/1 7.19 16 3.485588.4 3 3.49353.19 64 3.4937998.1 18 3.49498.53 56 3.494147.7 51 3.4941576.85 14 3.4941595 c m 36, 51, 1, 1, 1)/4)) Approximations scaled by 1 7 ). Time complexity for mg i y) values is Om logm)) Likelihood Surface Lyme Disease Data Simulated Data A simulated data set from Xu ), where K =, θ =15and σ =65. The relative allele frequencies x =.9,.814,.146,.87,.45,.46,.131,.185,.578,.59,.139,.167,.169,.183,.34,.91,.159,.1376,.869,.6). 11) The original simulation from Xu ) was performed using the DNJ 1) rejection method. 1 1 9 simulations were required before the data set was accepted.

Likelihood surface simulated data s m c m 67.5, 1.65ν)/1 9.11 16 3.187147.115 3 4.3311995.1 64 4.8144.134 18 5.8376.335 56 5.99513.451 51 5.197355.544 14 5.1415851 c m 67.5, 1.651, 1,...,1)/) Approximations scaled by 1 9 ) New method for simulating samples under selection Define the following cumulative distribution functions with parameter z as F i ; z) for i =1,,,K where y y F i y; z) = tα i exp σ i t)g i+1 z t)dt y z g i z) 1 y>z Generating allele frequencies under selection 1. Generate U i UNIF[, 1 X 1 X X i 1 ]. Define X i = F 1 i U i ;1 X 1 X X i 1 ). and g i y) is defined by 1). Note that P X i y X i 1,,X 1 )=F i y;1 X 1 X i 1 ). Parametric Bootstrap Lyme Disease Data Mean Standard Deviation ˆθ 5.. ˆσ 36.8 1.1 Simulated Data Mean Standard Deviation ˆθ 14. 4.3 ˆσ 66.5 15.8 The two tables represent estimates of the mean and standard deviation for the maximum likelihood estimates ˆθ and ˆσ based on the parametric bootstrap procedure. Conclusions Important sampling and rejection method are powerful tools for modern likelihood based statistical analysis. DNJ 1) use this approach for the analysis of a class of nonneutral population genetics models. The efficiency of the above mentioned procedures depends critically on the choice of the proposal distribution. Our method generates data directly under the model with selection and so is much more efficient than the methods described in DNJ 1).