ESTIMATING THE FREQUENCY OF THE OLDEST ALLELE - A BAYESIAN APPROACH

Size: px

Start display at page:

Download "ESTIMATING THE FREQUENCY OF THE OLDEST ALLELE - A BAYESIAN APPROACH"

Lucinda Fletcher
5 years ago
Views:

1 ESTIMATING THE FREQUENCY OF THE OLDEST ALLELE - A BAYESIAN APPROACH by Paul Joyce TECHNICAL REPORT No. 171 July 1989 Department of Statistics, GN 22 University ofwasbi.ngton Seattle, Wasbington USA

2 ESTIMATING THE FREQUENCY OF THE OLDEST ALLELE - A BAYESIAN APPROACH PAUL JOYCE 1 Department of Statistics University of "Washington 0e,:L~~.le 'Washington, Abstract Consider an aged-ordered population Zl, Z2,..., where Z; is the frequency of the ith oldest allele and.ezi = 1. From this population consider an aged-ordered sample of size n with 1 alleles and the frequencies in age-order denoted by M = (/; ml, m2"" ml). vve calculate the posterior distribution and posterior moments from the population frequency of the oldest allele, Zl, given the sample M, assuming that the population is at stationarity and follows the neutral infinite alleles model. We also calculate the posterior disribution of Zl given a partition of n genes with no age information in the sample. These results are used to determine Bayes estimators for the population frequency of the oldest type, and the analysis is extended to include the posterior distribution of Zl, Z2,..., Zk given M for any k. nhtfl<?pq' G.E.M. distribution, infinite NSF BSR

3 1 Introduction Consider a of n drawn at a loci from a large population which has evolved according to assumptions of the neutral population is viewed as a realization of a random sequence of listed in aged-order. It is this particular realization that we to learn more about. In particular, we use the information obtained in the of the oldest type. to estimate the proportion of the population that is However, we do not wish to base our inference entirely on information obtained from the sample. The infinite-alleles assumption is prior knowledge that should be used together with the sample to make our estimate. So the problem at hand naturally lends itself to a Bayesian approach. The above problem is in some sense a follow up to a question asked by Watterson and Guess (1977). The question is: Is the most frequent allele the oldest? The answer is that, (under the assumptions of the infinite alleles model)with probability equal to its expected frequency, the oldest allele is the most frequent. So, it seems natural to ask what is the frequency of the oldest a sample? authors have studied ages of alleles. Watterson (1976a) was the used to answer quesions about ( reversibility arguments to cistrroutson as it to extraction. Lionneuy were to discover 1

4 according to simpler and more intormative. cumulation of the theory that has been developed in last decade concerning age-ordering of alleles can be summed up by the following distribution called the G.E.M. Let VI, 1;2,... be i.i.d. beta (1, B). Zl = Vi, Z, = (1 V 1 )(1-1;2)'" (1 - V'i-l)V'i i > 1. (1.1) Z, represents the allele frequency of the iih. oldest type taken from a stationary infinite alleles diffusion model. The distribution (1.1) (in the context of population genetics) was first discovered by Griffiths (1982) (unpublished). Donnelly and 'I'avare (1986) showed the above distribution (1.1) arises as the limiting distribution of an age-ordered sample of size n taken from a coalescent process with ages, as n -l- 00. An infinite coalescent with ages was constructed by Donnelly and Tavare (1987) which has equilibrium distribution given by (1.1). Hoppe (1987) uses the G.E.M. to derive the Ewens sampling formula giving an alternative proof to a theorem of Wattersons (1976b). Donnelly and Joyce (1990a) showed that (1.1) limit of population of age-ordered alleles from a class of stationary ex.1 IS as -l- 00 2

5 to (1.1). A summary of recent 1."''''U1.''''' about a Since we are in Bayesian inference, we (1.1) as a prier distribution for our population frequencies. Fortunately, distributions (1.1) are often as priors in Bayesian statistics and some known results are listed in Using Bayesian prior to sampling. Zl represents the frequency of the oldest allele 'We aim to calculate the posterior distribution for Zl given the data. The E(Zlldata) is our Bayes estimator for the population frequency of the oldest type. vve use the posterior distribution to access our error for that estimation. 3

6 2 Data As was mentioned in the introduction, distribution like (1.1) are often used as priors in a Bayesian inference problem. What sets our population problem apart from other problems, is the type of information we have available in our sample. The point is best illustrated by an example. Example. Consider a population consisting of an infinite number of types, listed in age order. Let Xi be the proportion of the ith type. Suppose a sample of size 5, X Il X z,..., X s, taken from the population yields the s following Xl = 2 X z = 3, X 3 = 2, X 4 5, X s = 3. Let Tj = L I {Xi = j}, i=l let T = (TIl Tz,...) then T = (0,2,2,0,1,0,0,...). The X's tell you the label of each of the individuals sampled. T tells you which labels have been sampled and which have not. The distribution of T is, of course, multinomial. In the context of age ordering, T tells you (in the above example) that the second oldest type in the populations has 2 representatives in the sample, the third oldest also 2, and the 5th oldest has 1 representative in the sample. However, a g;ej1etilst;s could not hope to be privy to that much information. At best the zenetist would For able to list the types in age-order relative to probtem his sample would look lla,v 11>'" 2 representatrve, 1 It is even more 4

7 no I ai renresent alleles in sample with i representatrves for the above example a1 az = 2, integer 5, Kingman (1978a,b)). The data for our purposes will come the form of a partition, or will be relatively age-ordered. Our inference must be based on data of this form. The above example should serve to motivate the following definitions. Let X (XI, X z,.,., X n ) be a iid sample of size n taken from a population described by (1.1). Define T j to be and T = (T I, Tz,...). n r, = I:I{Xi =j}, i::1 (2.1) Let i 1 = min{j : T, O}, and i k = min{j : j > i k - I T j =f. O}. Let co L = I:I{Tj j::i Define M, for k = 1,2,..., Let O}. (2.2) Thus L is partition of number of (:t1lt~lt::5 sample. Define in the sample and M IS the age-ordered A=.. "", IS n. vector co j=i =i}, or a n j=l 5

8 Let a = Let, az,, n a, E N, such that L.: jaj = n. j=l ~(a co = {(ti,tz,...): ti E N,L.:ti = n,l.:i{tj = i} = ail. i=l j=1 co Let m I (1, ml, mz,..., ml) be such that L.: m; = n. Define i=1 co em = {(tiltz,...): t, E N,L.:ti i=l n til = mj j = 1,2,...,l}. Note that Z = (Zl, Zz,...) defined by (1.1) is a random vector on the infinite simplex.6. defined by Let fl be the distribution of Z given by (1.1) then by definition and P[L = 1;1\11 1 = mi,...,lv1 1 = mil = (L.: I n! xilx~2,... dfl(x) (2.4) it::.. tefm t1 tz! Computing the above infinite dimensional integrals seems like an extremely!-lfyujp'\lpl', for the G.E.M. distribution IL, answers are as n B aj 1 -IIi = 1 6

9 0 1 n!... df-l = ,----: ,------: (n) (2.6) where O(n) = 0(0 1) (0 n - 1). The right side of Equation (2.5) is the famous H:UT,pn" Sampling Formula (see Ewens (1972))and the right side of (2.6) is an aged ordered version (see Donnelly and -LCLVCLLe; (1986), Donnelly and Joyce (1990a,b) Ethier (1989)). Note that the function being integrated in (2.5) is symmetric. So the integral in (2.5) would be the same if integration is done with respect to the joint distribution of the order statistics. In the case of the G.E.M. (1.1), the distribution of the order statistic is the well-known Poisson Dirichlet distribution (see Kingman(1975)). Watterson (1976) was first to relate Poisson Dirichlet distribution to the Ewens sampling formula and so can be credited with the first proof of (2.5). Actually, Wattersons calculation uses finite dimensional Dirichlets rather than the Poisson Dirichlet directly. A Theorem of Kingmans (1977) shows that this is equivalent to showing (2..5) directly. In theorem 3 of rtopne (1987) a proof of (2.5) is given by using the G.E.M. distribution. In theorrn 10 of Donnelly and Joyce (1990a) (2.6) is (2.6) by summing

10 3 Posterior distribution now distribution of Z1 (frequency of the oldest a aged sample M. This distribution depends on only through the frequency of the oldest allele in The prior distribution of Z1' is Beta (1, B). It follows from Theorem A2 of the ap'pej1u1x that the distribution of Z1 given a sample X (X lists the population label for each member, of the sample) is also a beta distribution. However, as was pointed out in example 1 of the last section, we condition on the relatively aged ordered sample M defined by (2.2). There are two cases to be considered. Either the oldest allele in the sample is also the oldest allele in the population, or the oldest allele in the population has no representatives in the sample. By viewing M we cannot determine which of the above two situations is the truth. So we condition on each possibility. Thus the posterior distribution of the oldest allele in the population, Z1, given M is a mixture of two Beta distributions. The theorem below formalizes this Theorem 3.1 Let ZlJ... be the population frequencies in aged-order of a n. posterior of by (1.. Let M a ---.;,..;.;.-

11 Proof. Let X,..., a sample of size n from population given by.1) population labels for allele the sample. Recall from (2.1) that T 1 is the number of genes in the sample which are of the oldest in the population. It follows from Corollary A4 of the appendix that the posterior density of 2 1 given T 1 = t 1 is. st1(1 _ s)n-t1+8-1 fz1lt 1 (slt 1 ) = Beta (t 1 + 1, n - t 1 + e) (:3.2) So it follows that n fzlim(slm) = L fzllt1 (slj)p(t1 = jim = m). j=o However, the oldest allele in the population is eitherthe oldest allele in sample or the oldest allele in the population is not represented in the sample, this implies Thus P(T 1 0IM m) + P(T 1 = m11m = m) = 1. (:3.:3) (:3.4) \Ve need note to show that e e IM=m)=--. n II 1'I~",d'",1... j=l 1, 9

12 by we see n! 1,Yj+1). (3.6) Thus P(T 1 allvf m) _ OBeta(l, n + 0) L n! IT OBeta(tj + 1, Yj + 0) { t2!t3!.... P(M - m) te nl:tl=o} )=2 - OBeta(l. n + 0) P(M = m), P(M = m) by (3.6) o n+o' (3.7) Note that 1 P(T 1 = aim = m) (3.8) So o IS interestine to see reduces to o n+o'

13 P(T1 = 0) = = 0), is (1.1), o (3.9) =O+n Thus event = O} is muepencent of the sample M. We mentioned earlier that viewing the sample M it is impossible to tell whether or not the oldest allele in the population appears in the sample. In fact, viewing M doesn't even give us a hint. The information in M is independent of wnetner or not the oldest allele appears in the sample. This fact serves as a reminder that there is alot less information in a relatively aged-ordered sample M than in a totally aged-ordered sample T. There are alternative ways to prove (3.9). The central result of 'Watterson and Guess (1977) and Kelly(1979) is that the probability that a particular allele is the oldest is equal to its proportion of the population. Thus, the chance that the oldest allele does not appear in the sample, P(R 1 = 0), is the chance that a randomly selected individual is of a type that does not appear in the sample. Corollary 3.1 If Zl is the oldest allele in the population whose distribution is given by (1. and M is an aged ordered sample of size n then n +1 o 1M) = -n-+-0 n + 0' , , ) an azed-ordered sample. 18

14 rru1l11c:1l1;zes ones <"'111 :>lc",,1i error loss. Zl' So it tottows n our estimator IS consistent, 1 [M) =

15 4 Conditioning on a Partition our sampre of n contains no information about As we mentioned in example (1.1), Ai is the number of alleles in have i representatives. Recall from (2.1) that T j is the number genes in the sample that are of the jth oldest type in the population, and that co Ai LI{Tj = i} j=l Recall that L iai = n, and that L Ai is the i=1 i=1 number of alleles present in the sample. vve wish to calculate the posterior and A = (AI, A 2,, An)' distribution of ZI given A = a. Suppose we are given a sample with k alleles present. If no age information is available then it is possible that any given one of the alleles present in the sample could be of the oldest type in the population, or the oldest type in n n the population is not represented in the sample. It is these k 1 possibilities that we must condition on. While the aged-ordered sample gave a mixture of two Betas as posterior for the sample without a posterior distribution is a mixture of k + 1 Betas. vve us consider a sampte of use a partrcutar IS IS to

16 In an t oldest in the population, we (.;<:LJl(.;Ul<:Ll,e an 1nrlnrlrll1"" cnosen at random from population is of a given i representatives in the sample. With probability ;1 the randomly chosen individual will belong to the ~ample, and if this that it is of the given allelic type is!... With probability n happens the lv1 n the randomly chosen individual will be outside sample and if this happens the probability that it is of the given allelic typeis n: e.(see Kelly (1979), 'Watterson and Guess(1977).) Thus the probability we are seeking is n i J\!1-n ~ v1 n J\!1 n + e i(e+ J\!J) lvf(e n)" This is the argument used in Theorem 7.6 of Kelly(1979). Now letting J\!1 -jo 00 we see that the probability that a particular allele with i representatives in the sample is the oldest in the population is So the probability that any with i representatives in the sample is the oldest in the population is we "hn,n,n IS i> 0 --e'

17 IS a statement '-"...'.n... (1.1), but argument we to never mentions G.E.M. (1.1) III book(1979)prectaties use of G.E.M. (1.1) population genetics. The argument works because we now know that the G.E.M. is the limiting distributions as population size 1\ of the alleles model. (See Donnelly and Tavare (1986), (1987), Donnelly and Joyce (1990), Ethier (1990)). However, it is interesting to see that (4.1) can be arrived at by direct calculation using the G.E.M. explicity. For this reason we give an alternative proof to (4.1). Lemma 4.1 Consider a sample of size n taken from a population described by (1.1). Let A be the partition associated with the sample) let T l be a number of alleles in the sample that are of the oldest type in the population} then for j>o Proof. It follows by (3..5) that P(T l = jla = a) = J. n+o =J

18 t i is i=j+l jth oldest III number of note it follows by summing equation (3.5) ~wens sampling formula (2.5) which we will now denote by (to make explicit the dependence on size) can be Ull lltt.l"n as Pn(aJ, az,, an) := P(A = a) = L P(T t) te'i1ll (4.3) vve now rewrite (4.2) to be (n - j)! IT OBeta(tk+ 1, Yk + 0) tz!t3!... k=z Pn(aJ,az,, an) vve now use Ewens Sampling Formula given by (2.5) to reduce the above equation to n () o an urn modet. rioupe ( ""dcj.ul. dlstjrlbutlon IS

19 4.1 contexts rioppe urn IS in Donnelly (1987). Theorem 4.1 VU!t!::iI,'"M::;r a sample ofsize n taken from a population described by (1.1). Let A be the partition associated with the sample. Let ZI be the frequency ofthe oldest allele in the population. Then the posterior distribution ofzi given A n ja' sj(1_s)n- j +8-1 fzlia(sla) = B(l- st :-.L BB C 1. B) (4.4) j=l ri + eta J +,n - J + Proof. Note that From lemma 4.1 we note that n fzda(sla) = 2: fz1lt1(s/j)p(t 1 = jla = a). j=o n [a; B P(T 1 = ala = a) = 1-2:- J - = --. j=l n + B n + B So the result follows from lemma 1 and equation (3.2). 0 Corollary 4.1 trpr,n.e>nr " of VLUC,0 u allele in a sample, A, uaviij.l!..., we can

20 1 of 1JOnnE~l1y and it was ",h,..,"'tn that as n, to mnnity, for all k. So it follows from the above and (4.5) that 00 E(ZlIA) -+ I.:Z; as n -+ 00, i=l (4.6) where Z, has distribution given by (1.1). So we see that the estimator in Corollary 4.3 is not consistent. Since the above estimator is invariant undel' different labelings of the sample, it is not surprising that its limit IS a symmetric function of the population. The poor asymptotic property of the estimator in Corollary (4.1) is very significant, and can be used to argue that one has no business estimating the frequency of the oldest allele from a sample with no age information. However, the posterior distribution (4.4) has a multi-modal density, that is the graph of (4.4) can have many local maxima. So the posterior mean, E(ZlIA), is not a good summary of (4.4). In fact a point estimate is not a very thing to be looking at. Yet, the posterior density still you more about you nerore sampung, It you, IS a near

21 sample. The posterior distribution quantifies this statement. III

22 5 The Posterior G.E.M. \VenowexlcenLdour~H~lV010tolncluc~ joint distribution of the!-'''"'il./ud'","v.',,",u trequencies of sample M. Let us first k oldest alleles ZIl Z2,'.., Zk, given an aged-ordered concentrate on the case k = 2. The posterior uisr.rrbution for (Zl' Z2) given a sample X, which lists the population labels for each H.lC;.lH1VC,.l sample, depends only on T 1, the number of In the sample that are of the oldest type in the population and T 2, the number of genes in the sample that are of the second oldest type in the population. The joint distribution of (Zl' Z2) given X will be a generalized-dirichlet with density O(n+}) 0 + n - t 1 ( )n-t 1-tz+8-1 (5.1) O(n-tl-tz) t1!t2! 1-8 (This follows from theorem A.3 of the appendix.] However, if we view a relatively aged-ordered sample M we do not know T I or T 2 There are four possibilities for (TIlT 2 ) given M. They are =0 =0 =0 =0

23 unique structure of G.E.M. IS In sampie IS In population and second oldest in the sample is second oldest in the population conditioned on knowing the sample. Yet for the G.E.M. distribution aged-ordering is equivalent to sized biasing. (For a proof of see and (1977), a complete description of size-biasing see Donnelly and Joyce (1989)). Thus is just the chance that 1) a randomly selected individual is of a type that appears in the sample, and 2) after deleting that type from the population another randomly chosen individual is of a type that appears in the remaining sample. (1.1) remain population IS G.E.M. (1.1). This is tneorem 1 IS n -:;;:0'

24 statement (Hoppe (1986)) that the probability of IS n - 1\1[1 n ki I + f)' Thus P(T I Similarly o n ---- n+on+o P(T I n 0 n + 0 ti - NIl + 0 (5.2) P(T 1 = 0, T 2 = OIM) (n + O)(n + 0) The posterior distribution of (Zl, Z2) given M has density,82 0)P(T 1 0, = 0IM = m) ) (5.2) prootern rs compretec. now outlined procedure to calculate =m case k =

25 same procedure the we must c;;'v<:lljll;'u some notatron. -tr cenneu by { a if t, = 0 n - to - ti :>: if t i ::/= O. a fixed m = (lj I..., mr) where I::tri; = n, m, > 0 for i > 0, denne i=l Theorem 5.1 The joint posterior distribution for the first k oldest alleles in the pop-ulation Zk = (Zl' Z2,..., Zk) given an aged-ordered sample M = m has the following density " Ilk Cfi(t) L.t team i=l n - L...j=O ",",t-1 t j (5.3) where = 1- SI Sk = n - - i. - tz t i Proof. fj

26 expression k I1---,...'o;~i=l n- (5.5) is one of the appropriate probabilities which we outlined the procedure for calculating in the case k = 2, the general case (5.5) follows by induction. 0 Let us now define a sequence of random variables Z~, Z~,... where the joint distribution of (Z~, Z~,..., Zk) is given by (5.3). The Kolmogorov existence theorem (theorem 3.1 Billingsley (1977)) guarantees that Z~,Z~,... defines a probability measure on.6.., which we denote by u', So fl.1 is the posterior G.E.M. which gives us the aged-ordered frequencies of a population after viewing an aged-ordered sample.

27 A Appendix Connor Mosrmann ( defined a generauzed Dirichlet distribution for which "'~'.t::>...-. distributions of G.E.M. are a special example. Like the Dirichlet, posterior a generalized Dirichlet, is again generalized Dirichlet with a change in the parameters. Theorem A.I Let {Ui} be a sequence ofindependent random variables. Let u, have distribution Beta(ai, bj. Define Ql = U1 and (A.I) The joint distribution of (Q1, Q2,...,Qm-l) is called a generalized Dirichlet and has density given by (A.2) Proof. The proof follows immediately from transformation of variable. 0 Note that in special case ai bi-1 - b; the generalized Dirichlet Theorem A.2 Q.- q Q is

28 ..., are maepenaent with posterior distribution Beta(Ti + ai, Yi bi ), where n LI{Xj = ij and Yi = L r; j=l Proof. j>i+1 Xl,, x; = XnlUi i = 1,2,...)IT I{UI ::;; ud} 1=1 (A.3) = E{. IT(1- Uj)L7=lI{j<Xi}Ul:=7=lI{j=X;}ITI{UI::;; UIJ}. )=1 1=1 Let tj = I {Xi = j}, let Yj = Li=l I {Xi> j} = Li>j ti. Now we use the fact that {Ud are independent and... Beta(ai, b i ) to rewrite (A.3) as IT lui --., ,..--- j=l 0 = 1 J X nl l -----:...--:..--- j>k 0 00 = Xl,...,- n ---'--=---=-;"';;"::"-"':;" j=l

29 So by o Corollary A.I Let T, - fined in theorem (A Let T = (Tr, T 2, ). Then P(T Proof. The result follows immediately from equation (A.5). 0 Corollary A.2 The Posterior distribution of Ql given T1 has Beta density given by st+ul-1(1 _ s)t+bt-1 fqllt l (sit) = Beta(t + al, n - t + b 1 ) Proof. It is a special case of Theorem (A.2). We state it separately because of its importance in the paper. 0 Theorem A.3 Let Ti, Qi, Xi be as defined in Theorem A.2: The joint posterior distribution of has a a e-n PTnl1'7pn b. -L t, J1T1rnl,PT distribuiio»: where ai is replaced by a.;+ Proof. an rmmediate consequence

30 References P. (1977), Probability and Measure, Wiley, New York. Connor, R.J., J.E. (1969), Concepts of independence for proportions with a generalization of the Dirichlet distribution. J. Am. Statist. Assoc. 64, Donnelly, P. (1986), Partition structures, Polya urns, the Ewens sampling formula, and the ages of alleles. Theoret. Population BioI. 30, Donnelly, P. and Joyce, P. (1989), Continuity and weak convergence of ranked size-biased permutations on the infinite simplex. Stochastic Processes Appl. 31, Donnelly, P. and Joyce, P. (1990a), Consistent ordered sampling distributions: characterization and convergence. Adv. ui Appl. Probab. (to appear). Donnelly, P. and Joyce, P. (1990b), Weak convergence of Population Process to with Ages (sub- S. (1986), of alleles and a Probab, 18,

31 P. S. of mnnitely-many neutral alleles model, J. Math, Biol. 251, S. (1989), The infinitely-many-neutral-alleles diffusion model Adv. Appl. Prob, 22, to appear. L-/Ul.Hvl., S (1990), The distribution of the frequencies of ageordered alleles in a diffusion model (submitted). Ewens, \V.J. (1972), The sampling theory of selectively neutral alleles, Theoret. Population Biol. 3, Ewens, 'N.J. (1989), Population genetics theory -t.he past and the future. Mothematical and Statistical Problems in Evolution, S. Lessard, ed. University of Montreal Press, Montreal, to appear. Griffiths, R.C. (c. 1982) Unpublished Hoppe, F.M. (1984), Polya-like urns and the Ewens sampling formula, J. lvfath. Biol. 20, Hoppe, F.M. Size-biased nrtenng of Poisson-Dirichlet samappucanon to n>ll'htlr\n structures genetics, J. 23,

32 samnunz TI"tp,\rV of neutral Uj'~vL'~"" an urn model population genetacs, J. Math, Biol. 25, Joyce, P. Age-ordered distributions associated with some neutral population genetics models. Unpublished Ph.D. the University of Utah. Kelly, F.P. (1977), Exact results for the moran neutral allele model, Adv. App. Prob, 9, Kelly, F.P. (1979), Reversibility and Stochastic Networks, 'Wiley, New York. Kingman, J.F.C. (1977), The population structure associated with the Ewens sampling formula, Theoret. Population Biol. 11, Kingman, J.F.C. (1978a), Random partitions in population genetics, Proc. Roy. Soc. London Ser. A 361, Kmgman, J.F.C. (1978b), The representation of partition struc J. Lond..Math. Soc. 18, C. ( nversitv as a concept and

33 age of an I. Moran's intinitely-many neutral d,l1t~le:::; model, Theoret. ulaiioti Bioi. 10, 'Watterson, G.A. (1976b), The stationary distribution of the mnnrtetvmany neutral alleles diffusion model, J. Appl. Probab, 13, Watterson, G.A. and Guess, H.A. (1977), Is the most frequent allele the oldest?. Theoret. Population Bioi. 11,

The two-parameter generalization of Ewens random partition structure

The two-parameter generalization of Ewens random partition structure Jim Pitman Technical Report No. 345 Department of Statistics U.C. Berkeley CA 94720 March 25, 1992 Reprinted with an appendix and updated