Joyce, Krone, and Kurtz

Size: px

Start display at page:

Download "Joyce, Krone, and Kurtz"

Matilda Hodges
6 years ago
Views:

Statistical Inference for Population Genetics Models by Paul Joyce University of Idaho A large body of mathematical population genetics was developed by the three main speakers in this symposium.

1 Statistical Inference for Population Genetics Models by Paul Joyce University of Idaho A large body of mathematical population genetics was developed by the three main speakers in this symposium. As a tribute to the substantial contributions of Ewens, Griffiths and Tavaré, I will present an over view of some of my work, which builds upon their ideas. The focus will be on issues in the realm of Mathematical Statistics. The likelihood functions are based on the stationary distributions, under both infinite and K-alleles models, involving mutation, selection and genetic drift. The theoretical portion of the talk will consider limiting results that determine under what conditions models can be distinguished based on allele frequency data at a single locus. The computational portion of the talk will focus on new computationally efficient approaches to analyzing data under these models. A brief history of the problem In the late 9 s John Gillespie challenged Tom Kurtz, Steve Krone and myself to come up with a rigorous proof for his conjecture that the heterozygote advantage model converges to the neutral model in the limit as both θ and σ go to infinity at the same rate. Recall that θ =4Nu and the σ =4Ns In and 3 Steve Krone, Tom Kurtz and I published two papers in the Annals of Applied Probability addressing the problem posed by Gillespie. For purely mathematical reasons I decided to consider the homozygote advantage model and developed an analogous result. I got some help from my colleague Frank Gao. Gillespie Joyce, Krone, and Kurtz Heterozygote Advantage Model Notation and Vocabulary Review What does a sample from a neutral population look like? My version of Warren s slide N-Effective Population Size. Fitness of Heterozygote =1, Fitness of Homozygote = w, w<1 u- per individual mutation rate w =1 σ/4n) or σ =1 w)4n. θ =4Nu

The Effects of Selection The probability that two individuals chosen at random are the same type is N F = Xi. The heterozygote advantage model penalizes homozygote, thus decreasing F.

2 The Effects of Selection The probability that two individuals chosen at random are the same type is N F = Xi. The heterozygote advantage model penalizes homozygote, thus decreasing F. The minimum value of Recall from Calculus F = N subject to the constraint that N X i =1occurs when X i = 1 N. Selection tends to make the allele frequencies more evenly distributed. It is sometimes referred to as balancing selection. X i σ =4N1 w) θ =4Nu As the population size increases the mutation rate and selection intensity become large. An increase in the mutation parameter θ tends to σ = θ =.3 increase the number of alleles decrease the homozygosity. An increased selection intensity also decreases the homozygosity. Can high mutation mask selection when the population is large? σ = θ =.3 Stationary Distribution under Neutrality Let V 1,V,... be i.i.d, with beta density fx) =θ1 x) θ 1. The joint distribution of the population proportions X =X 1,X,...), underneutrality σ = θ =4 X 1 = V 1, X i =1 V 1 )1 V ) 1 V i 1 )V i 1) X µ σ = θ =4

3 Stationary Distribution Under Selection The stationary distribution under selection depends on the population homozygosity which is given by F = X i. The form of the stationary distribution µ σ follows as a special case of : µ σ A) = A e σf Ee σf ) µ dx), ) dµ = e σf E[e σf ] Samples versus Populations Let A n be a random partition structure of a sample of size of size n then P σ A n = a n ) P A n = a n ) = E dµσ ) X) A n = a n dµ and P σ A n = a n ) lim n P A n = a n ) = X) dµ where P A n = a n ) is the Ewens Sampling Formula. See Joyce 1994) for more details Theorem 4.4 in Ethier and Kurtz 1994) If σ = cθthen lim X) = lim θ dµ θ Gillespie Conjecture: exp{ σ X i }] E [exp{ σ X i }] =1, 3) Joyce Krone and Kurtz 3) Theorem 1 Suppose X =X 1,X,...) µ, Y =Y 1,Y,...) µ σ, where σ = cθ 3/+γ and c> is a constant. F X) = X i and F Y) = Y i Then, as θ, dµ X) = e σfx) E e σfx) ) 1, if γ< expcz c ), if γ =, if γ> where Z N, ). Outline of the proof of Theorem 1. Outline of the proof of Theorem 1. Define Z θ ) θ θ Xi 1, 4) for σ = cθ 3/ rewrite For γ = X θ )= exp{ cz θ} dµ E exp{ cz θ }) as θ.weneedtoshowthat 1. Z θ Z as θ exp{ cz} E exp{ cz}) P X θ )= e σ P dµ E e σ X i X i ) = exp{ cz θ} E exp{ cz θ }) 5). E exp{ cz θ }) E exp{ cz}) as θ

4 Z θ has a heavy right tail E exp{ cz θ }) E exp{ cz}) as θ but E exp{cz θ }) as θ. Homozygote Advantage Model Unscaled Parameters N-Effective Population Size. Fitness of Heterozygote =1, Fitness of Homozygote = w, w>1 u- per individual mutation rate Distribution of Z θ Distribution of Z Scaled Parameters w =1+σ/4N) or σ =w 1)4N. θ =4Nu Joyce and Gao 6) Homozygote Advantage Theorem Let c be a solution to 1 ) 1 /c c e c 1+ 1 /c = 1 6) Joyce and Gao 6) Homozygote Advantage Theorem Suppose X =X 1,X,...) µ,and Y =Y 1,Y,...) µ σ and let σθ) =cθ. Asθ, P X) ecθ 1, if c<c X i dµ E e cθp ) X i, if c c P Y) ecθ dµ E e cθp Y i X i 1, if c<c ), if c c What is c? Recall that θ =4Nu and σ =4Nw 1) where w>1. If σ = cθ then c = σ/θ and c = w 1 u Theorem in words At a highly polymorphic locus θ large) the homozygote advantage model is readily distinguishable from the neutral model provided the selection coefficient w 1 is at least times bigger than the the per individual mutation rate u. However, if the selective advantage is below.4554 times the mutation rate then the models are indistinguishable in the limit.

5 Proof Let V 1,V,... be i.i.d, with beta density θ1 x) θ 1.The joint distribution of the population proportions X =X 1,X,...), aredefinedby If V has beta density θ1 x) θ 1 then ) E e σf )=E e σv e 1 V ) σf E e σv ) cθv ) = E e X 1 = V 1, If F = X i then X i =1 V 1 )1 V ) 1 V i 1 )V i F = V1 +1 V 1 ) F where F and F have the same distribution and F is independent of V 1 E e cθv ) = e cθx θ1 x) θ 1 dx θ θ e cx 1 x)) dx θf c x)) θ dx Finding the Critical c f c x) for Small c f c x) =e cx 1 x) has a local minimum at x = 1 1 /c and a local maximum at x 1 = 1+ 1 /c, provided c is larger than. f c x) for Large c f c x) for Critical c

6 c is the constant that makes f c x 1 )=1where x 1 = 1+ 1 /c is the local maximum. 1 ) 1 /c c e c 1+ 1 /c = 1 7) If c>c then Ee cθf ) > 1 θf cx)) θ dx as θ Large Deviations and the Homozygotes Advantage Feng and Dawson 5) Theorem Varadhan) Assume that {Q ɛ : ɛ>} satisfies the Large Deviation Principle with speed 1/ɛ and rate function I ). LetC b E) denote the set of bounded continuous functions. Then for any φx) in C b E) one has Λ φ =limɛlog E Q ɛ e φx)/ɛ )=sup{φx) Ix)} ɛ x E Large Deviations and the Homozygotes Advantage Feng and Dawson 5) Large Deviations and the Homozygotes Advantage Feng and Dawson 5) For our case E = {x 1,x,, ):x i > and x i =1} ɛ =1/θ [ 1 lim θ θ log E e cθp ] x i =sup x 1 logf c x)) φx) = x i sup x 1 logf c x)) > when c>c sup x E {φx) Ix)} =sup x 1 logf c x)) Conclusion Introduction The models selection versus neutrality) separate when the selection intensity is large relative to the mutation rate. For the heterozygote advantage model σ must be much larger than θ σ cθ 3/+γ ) before the models separate. For the homozygote advantage model σ need only be moderately larger than θ before the models separate. σ c θ where c.45541). The large deviation results provides a rate of convergence when c>c.

7 Introduction Any assessment of the forces that generate and maintain genetic diversity must include the possibility of selection. Computationally intensive methods for approximating likelihood functions and generating samples for a class of nonneutral models was proposed by Donnelly, Nordborg, and Joyce DNJ) 1). Benefit The new methods make likelihood analysis practicable for a wider set of parameters. In particular, if the selection intensity is much greater than the mutation rate, then the DNJ 1) methods become increasingly inefficient. However, this is the case where one has the best hope of drawing meaningful more precise) inferences. We develop algorithms for likelihood analysis that are substantially more efficient than those in DNJ 1). Calculating the constant of integration Law of Large Numbers Simulate many population frequencies X 1, X,...,X M under neutrality and average. That is, See DNJ 1). E N e X ΣX ) M e X iσx i )/M. 8) Calculating the constant of integration Law of Large Numbers Simulate many population X 1, X,...,X M under neutrality and average. That is, E N e X ΣX )) M e X iσx i )/M. 9) This works fine if the selective influences are relatively small. However, when selection is small there is very little power to detect selection from neutrality and likelihood analysis gives little to no information about the parameters of interest. When selection is large enough to be detected, the above method is extremely inefficient. Simulating data under selection Rejection Method 1. Simulate X from the neutral model. Simulate U, an independent uniform random variable on [, 1] 3. If U e σx) σ max,reportx as a population frequency from the nonneutral model. Otherwise return to step 1. See DNJ 1). If σ 1 it takes 1 9 rejections before a sample is accepted. Importance sampling and rejection method The rejection method involves generating random variables under the proposal distribution and then developing a rule for rejecting or accepting the simulated random variable, so that the accepted random variables are distributed according to the target distribution. Importance sampling also involves generating random variables under the proposal distribution and then creating a weighted average, such that the weighted average represents the expectation under the target distribution of a random quantity of interest.

8 A good proposal distribution should have the following two properties A good proposal distribution 1. It should be easy to simulate data and calculate probabilities of interest with respect to the proposal distribution.. The proposal distribution should be in some sense close to the target distribution. In DNJ 1) the neutral model is the proposal distribution and the model with selection is the target distribution. While the neutral model has property 1, it does not have property. A bad proposal distribution Computation of the Normalization Constant when Σ is Diagonal We consider the special case where Σ is a diagonal matrix. Denote the entries of the diagonal by Σ =σ 1,σ,,σ K ). The normalization constant for the distribution can be calculated by a series of recursive integrals. Define α i = θν i 1. cσ,θν) = 1 xα 1 1 e σ 1x 1 x1 1 x α e σ x where g K y) =y α K e σ Ky 1 P K 3 x i x α K K e σ K x K 1 P K x i x α K 1 K 1 e σ K 1x K 1 g K 1 ) K 1 x i dx K 1 dx 1. 1 P K x i = y x α K 1 K 1 1 K x i x K 1 ) αk e σ K 1x K 1 ) K g K 1 x i x K 1 dx K 1 t α K 1 y t) α K e σ K 1t g K y t) dt g K 1 y). where y =1 x 1 x... x K and t = x K 1

9 Integral is iteratively defined cσ,θν) = 1 1 x1 x α 1 1 e σ 1x 1 x α e σ x 1 P K 3 x i x α K K e σ K x K ) K g K 1 1 x i dx K dx 1 Now let y =1 x 1 x... x i 1 and t = x i, α i = θν i 1, the successive integrals can be defined by g i y) = y t α i e σ it g i+1 y t)dt, 1) for i = K 1,K,...,1. The required cσ,θν) is given by g 1 1). Lyme disease sample The following data was collected by Qui et al. 1997) Hereditas 17: 3-16 on B. burgdorferi the cause of Lyme disease) from eastern Long Island, New York. relative frequency frequency The maximum likelihood estimate is ˆθ =5and ˆσ =36.A total of 1 6 repetitions per θ were used in DNJ ) Constant of Integration s m c m 36, 5ν)/ c m 36, 51, 1, 1, 1)/4)) Approximations scaled by 1 7 ). Time complexity for mg i y) values is Om logm)) Likelihood Surface Lyme Disease Data Simulated Data A simulated data set from Xu ), where K =, θ =15and σ =65. The relative allele frequencies x =.9,.814,.146,.87,.45,.46,.131,.185,.578,.59,.139,.167,.169,.183,.34,.91,.159,.1376,.869,.6). 11) The original simulation from Xu ) was performed using the DNJ 1) rejection method simulations were required before the data set was accepted.

Likelihood surface simulated data s m c m 67.5, 1.65ν)/1 9.11 16 3.187147.115 3 4.3311995.1 64 4.8144.134 18 5.8376.335 56 5.99513.451 51 5.197355.544 14 5.1415851 c m 67.5, 1.651, 1,.

10 Likelihood surface simulated data s m c m 67.5, 1.65ν)/ c m 67.5, 1.651, 1,...,1)/) Approximations scaled by 1 9 ) New method for simulating samples under selection Define the following cumulative distribution functions with parameter z as F i ; z) for i =1,,,K where y y F i y; z) = tα i exp σ i t)g i+1 z t)dt y z g i z) 1 y>z Generating allele frequencies under selection 1. Generate U i UNIF[, 1 X 1 X X i 1 ]. Define X i = F 1 i U i ;1 X 1 X X i 1 ). and g i y) is defined by 1). Note that P X i y X i 1,,X 1 )=F i y;1 X 1 X i 1 ). Parametric Bootstrap Lyme Disease Data Mean Standard Deviation ˆθ 5.. ˆσ Simulated Data Mean Standard Deviation ˆθ ˆσ The two tables represent estimates of the mean and standard deviation for the maximum likelihood estimates ˆθ and ˆσ based on the parametric bootstrap procedure. Conclusions Important sampling and rejection method are powerful tools for modern likelihood based statistical analysis. DNJ 1) use this approach for the analysis of a class of nonneutral population genetics models. The efficiency of the above mentioned procedures depends critically on the choice of the proposal distribution. Our method generates data directly under the model with selection and so is much more efficient than the methods described in DNJ 1).

arxiv: v1 [stat.ap] 9 Oct 2009

arxiv: v1 [stat.ap] 9 Oct 2009 The Annals of Applied Statistics 2009, Vol. 3, No. 3, 1147 1162 DOI: 10.1214/09-AOAS237 c Institute of Mathematical Statistics, 2009 MAXIMUM LIKELIHOOD ESTIMATES UNDER K-ALLELE MODELS WITH SELECTION CAN