DISTRIBUTION OF NUCLEOTIDE DIFFERENCES BETWEEN TWO RANDOMLY CHOSEN CISTRONS 1N A F'INITE POPULATION'

Similar documents
I of a gene sampled from a randomly mating popdation,

Genetic Variation in Finite Populations

A MODEL ALLOWING CONTINUOUS VARIATION IN ELECTROPHORETIC MOBILITY OF NEUTRAL ALLELES

VARIANCE AND COVARIANCE OF HOMOZYGOSITY IN A STRUCTURED POPULATION

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin

The Wright-Fisher Model and Genetic Drift

The neutral theory of molecular evolution

Application of a time-dependent coalescence process for inferring the history of population size changes from DNA sequence data

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

NEUTRAL EVOLUTION IN ONE- AND TWO-LOCUS SYSTEMS

Population Structure

LINKAGE DISEQUILIBRIUM IN SUBDIVIDED POPULATIONS MASATOSHI NE1 AND WEN-HSIUNG LI

PROBABILITY OF FIXATION OF A MUTANT GENE IN A FINITE POPULATION WHEN SELECTIVE ADVANTAGE DECREASES WITH TIME1

LINKAGE DISEQUILIBRIUM, SELECTION AND RECOMBINATION AT THREE LOCI

Computational Systems Biology: Biology X

Sequence evolution within populations under multiple types of mutation

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

Variance and Covariances of the Numbers of Synonymous and Nonsynonymous Substitutions per Site

MOLECULAR PHYLOGENY AND GENETIC DIVERSITY ANALYSIS. Masatoshi Nei"

THE THEORY OF GENETIC DISTANCE AND EVOLUTION OF HUMAN RACES 1. (The Japan Society of Human Genetics Award Lecture) MasatoshiNEI

I negligible, and in this case it is possible to construct an evolutionary tree EVOLUTIONARY RELATIONSHIP OF DNA SEQUENCES IN FINITE POPULATIONS

MOLECULAR EVOLUTION AND POLYMORPHISM IN A RANDOM ENVIRONMENT

CONGEN Population structure and evolutionary histories

Neutral behavior of shared polymorphism

The Combinatorial Interpretation of Formulas in Coalescent Theory

Variances of the Average Numbers of Nucleotide Substitutions Within and Between Populations

A comparison of two popular statistical methods for estimating the time to most recent common ancestor (TMRCA) from a sample of DNA sequences

Population Genetics I. Bio

Gene Genealogies Coalescence Theory. Annabelle Haudry Glasgow, July 2009

NATURAL SELECTION FOR WITHIN-GENERATION VARIANCE IN OFFSPRING NUMBER JOHN H. GILLESPIE. Manuscript received September 17, 1973 ABSTRACT

Observation: we continue to observe large amounts of genetic variation in natural populations

Evolution and maintenance of quantitative genetic variation by mutations

A SIMPLE METHOD TO ACCOUNT FOR NATURAL SELECTION WHEN

Australian bird data set comparison between Arlequin and other programs

Outline of lectures 3-6

Statistical Tests for Detecting Positive Selection by Utilizing High. Frequency SNPs

Estimating effective population size from samples of sequences: inefficiency of pairwise and segregating sites as compared to phylogenetic estimates

Outline of lectures 3-6

GENETIC VARIABILITY AND RATE OF GENE SUBSTITUTION IN A FINITE POPULATION UNDER MUTATION AND FLUCTUATING SELECTION* NAOYUKI TAKAHATA ABSTRACT

Processes of Evolution

Effective population size and patterns of molecular evolution and variation

(Write your name on every page. One point will be deducted for every page without your name!)

A Likelihood Approach to Populations Samples of Microsatellite Alleles

THE genetic consequences of population structure simple migration models with no selection (cited above),

AEC 550 Conservation Genetics Lecture #2 Probability, Random mating, HW Expectations, & Genetic Diversity,

Breeding Values and Inbreeding. Breeding Values and Inbreeding

7. Tests for selection

A Sampling Theory of Selectively Neutral Alleles in a Subdivided Population

Recombina*on and Linkage Disequilibrium (LD)

9 Genetic diversity and adaptation Support. AQA Biology. Genetic diversity and adaptation. Specification reference. Learning objectives.

A TAXONOMIC APPROACH TO EVALUATION OF THE CHARGE STATE MODEL USING TWELVE SPECIES OF SEA ANEMONE ABSTRACT

Concepts and Methods in Molecular Divergence Time Estimation

Estimating selection on non-synonymous mutations. Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh,

How robust are the predictions of the W-F Model?

Levels of genetic variation for a single gene, multiple genes or an entire genome

LECTURE # How does one test whether a population is in the HW equilibrium? (i) try the following example: Genotype Observed AA 50 Aa 0 aa 50

6 Introduction to Population Genetics

- per ordinary gene locus (GELBART and

Introduction to Wright-Fisher Simulations. Ryan Hernandez

A MARKOV PROCESS OF GENE FREQUENCY CHANGE IN A

Molecular Population Genetics

ACCORDING to current estimates of spontaneous deleterious

Question: If mating occurs at random in the population, what will the frequencies of A 1 and A 2 be in the next generation?

EVOLUTIONARY DISTANCE MODEL BASED ON DIFFERENTIAL EQUATION AND MARKOV PROCESS

6 Introduction to Population Genetics

Febuary 1 st, 2010 Bioe 109 Winter 2010 Lecture 11 Molecular evolution. Classical vs. balanced views of genome structure

reciprocal altruism by kin or group selection can be analyzed by using the same approach (6).

p(d g A,g B )p(g B ), g B

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution

11. TEMPORAL HETEROGENEITY AND

Shane s Simple Guide to F-statistics

EVOLUTIONARY DYNAMICS AND THE EVOLUTION OF MULTIPLAYER COOPERATION IN A SUBDIVIDED POPULATION

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics.

Expected coalescence times and segregating sites in a model of glacial cycles

EXERCISES FOR CHAPTER 3. Exercise 3.2. Why is the random mating theorem so important?

There are 3 parts to this exam. Use your time efficiently and be sure to put your name on the top of each page.

Supporting Information

Lecture 18 : Ewens sampling formula

Inbreeding depression due to stabilizing selection on a quantitative character. Emmanuelle Porcher & Russell Lande

Neutral Theory of Molecular Evolution

SEQUENCE DIVERGENCE,FUNCTIONAL CONSTRAINT, AND SELECTION IN PROTEIN EVOLUTION

Gene Pool Recombination in Genetic Algorithms

A simple genetic model with non-equilibrium dynamics

Frequency Spectra and Inference in Population Genetics

2. Map genetic distance between markers

Fitness landscapes and seascapes

- point mutations in most non-coding DNA sites likely are likely neutral in their phenotypic effects.

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda

Rare Alleles and Selection

Outline of lectures 3-6

Space Time Population Genetics

STRONG balancing selection can result from over- approximation of Gillespie (1984, 1991). This approximation

Lecture Notes: BIOL2007 Molecular Evolution

Selection and Population Genetics

ON THE DIFFUSION OPERATOR IN POPULATION GENETICS

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Introductory seminar on mathematical population genetics

Statistical population genetics

Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination)

The Quantitative TDT

Transcription:

DISTRIBUTION OF NUCLEOTIDE DIFFERENCES BETWEEN TWO RANDOMLY CHOSEN CISTRONS 1N A F'INITE POPULATION' WEN-HSIUNG LI Center for Demographic and Population Genetics, University of Texas Health Science Center, Houston, Texas 77030 Manuscript received October 2, 1975 Revised copy received August 16, 1975 ABSTRACT WATTERSON'S (1975) formula for the steady-state distribution of the niimber of nucleotide differences between two randomly chosen cistrons in a finite population has been extended to transient states. The rate for the mean of this distribution to approach its equilibrium value is 1/2N and independent of mu-- tation rate, but that for the variance is dependent on mutation rate, where N denotes the effective population size. Numerical computations show that if the heterozygosity (i.e., the probability that two cistrons are different) is low, say of the order of 0.1 or less, the probability that two cistrons differ at two or more nucleotide sites is less than 10 percent of the heterozygosity, whereas this probability may be as high as 50 percent of the heterozzgosity if the heterozygosity is 0.5. A skple estimate for the mean number (d) of site differences between cistrons is d = h/(i - h) where h is the heterozygosity. At equilibrium, the probability that two cistrons differ by more than one site is equal to h2, the square of heterozygosity. IN a pioneering work, KIMURA (1969) studied the number of heterozygous nucleotide sites per individual in a randomly mating population, assuming that sites are independent. Note that, under random mating, the number of heterozygous sites at a locus in a randomly chosen individual is equivalent to the number of nucleotide differences between two randomly chosen cistrons. Recently, EWENS (1974) found that, for independent sites with Poisson mutations, this number would be exactly Poisson distributed. On the other hand, WATTER- SON (1975) has shown that this number follows approximately a geometric distribution if there is no recombination between sites. All these authors were concerned only with the steady state and no one seems to have studied this number in transient states. The main purpose of this communication is to study the distribution of this number in transient states under the assumption of no recombination between sites. This assumption is more reasonable than that of independent sites since my main interest is the nucleotide differences between cistrons or amino acid differences between proteins. I shall Iollow the method used by WEHRHAHN (1975) and LI (1976). One alternative is the method used by WAT- TERSON (1975). No selection will be considered in this study. Th~study was supported by Public Health Service Grant GM 20293. Genetics 85: 331-337 February, 1977

332 W-H. LI BASIC THEORY Consider a randomly mating population of effective size N. I assume that, in each generation, a cistron either mutates with probability U at one of the nonsegregating sites or remains unchanged with probability 1 - U. I use the model of infinite sites (KIMURA 1971), in which I assume that 110 two mutations ever occur at the same site (even in different cistrons). Suppose that two cistrons A and A at present were derived from the replication of a cistron s generations ago and let Pt (s) = Pr{js = k} be the probability that the number (j) of nucleotide differences between A and A) at present is k. To compute P; (s) we note that in passing from the previous generation to the present generation three possibilities can occur: (1) A and A) differed at k sites in the previous generation and no mutation occurred to them in coming to the present generation, (2) they differed at k - 1 sites and either of them gained one mutation, and (3) they differed at k - 2 sites and each of them gained one mutation. Since the first event occurs with probability (1 - U ) 2, the second with probability 2u( 1 - U ), and the third with probability u2, PE (s) = (1- U ) 2 P: (s- 1) + 241 - v)p,*_, (s- 1) + u2p;-z (s- 1) (1-2v) P, (s - 1) + 2v P,*_, (s - 1 ), neglecting terms of order v2. It follows that the probability generating function (pgf) for the P; values is: m H(Z,s) = p0 Pk (s)zk m 7 B =O [(I -2v)Pt,(s-I) +2vP;-l(s-1)]z = (1-2v + 2vZ)H(Z,s-l) ezus(z-l). Equation (1) holds exactly if the number of mutations per cistron per generation follows the Poisson law with mean U. Note also that equation (1) is equivalent to approximating the discrete time model by a continuous time model. For mathematical ease, I shall consider a continuous time model instead of the discrete time model. Let P k(t) = Pr{dt = k} denote the probability that in generation t the number (d) of site differences between two randomly chosen cistrons is k and let the pgf for the Plc values be: (1) The probability that two cistrons chosen randomly at time t were not due to ihe replication of a cistron sometime in the past is given by f(t) = (1-1/2N) for a discrete time model, or f(t) = &Ira for a continuous time model. In this case, the pgf of the distribution of d is H(Z,t)G(Z,O). This follows from the fact that the number of site differences between the two cistrons is the sum of their initial differences and new mutations and from the fact that the pgf of the sum of two

NUCLEOTIDE DIFFERENCES 333 independent random variables is the product ocf their pgf's (cf. FELLER 1968). On the other hand, the probability that two cistrons chosen randomly at present were derived from the replication of a cistron s generations ago is given by F'(s) = df (s)/ds = e-s/2n/2n, since f(t) +j: F'(s)ds = 1. (Note that F (t) = 1 - f(t) is Wright's inbreeding coefficient in the absence of mutation.) In this case the pgf of the distribution of d is given by equation (1). Thus, G(Z,t) = 1,; F'(s)H(Z,s)ds + f(t)h(z,t)g(z,o) where a(2) = -h + 2vZ and h = 1/2N + 2v. Note that -1 G(Z,W) = 2Na (2) - 1 i+e-ez where 6' = 4Nv. That is, at steady state d follows a geometric distribution and (3) Formulas (3) and (4) are identical with formulas (2.14) and (1.8) of WATTER- SON (1975). It is also easy to see that Therefore, In particular, the homozygosity is given by Po(t) = P o(~) + e+[[po(o) - Po(~)l,

334 W-H. LI which is standard (MALECOT 1948). The mean and variance of dt are given by a az Var(dt) = - [ Z ag(i,t> az -2 1-4 It is interesting to note that the rate for the mean of d to approach its equilibrium value is 1/2N and independent of mutation rate while that for the variance is retarded by mutation. At steady state - d, = 8, Var(d,) = 8 + 02, which agree with WATTERSON'S (1975) formula (1.10). For a comparison of (6a) and (7a) with the results of KIMURA (1969) and EWENS (1974), readers may refer to WATTERSON (1975). DISCUSSION In the derivation of equation (2), I reasoned from time 0 to t. A simpler derivation is to reason from time t - 1 to t. The arbwent is briefly as follows. At generation t, the pgf for the number of site differences between two randomly chosen cistrons is given by g(2) =H (2,l) if they came hom a cistron in generation t - l, but it is G(2,t - l)g(z) if they came from two cistrons in generation t - 1. Thus, 1 G(Z,t) = [- + cg(2,~-1)] g(z), 2N where c = 1-1J2N (see (2.3) of WATTERSON 1975). The solution of the above equation is Using the following two approximations g(2) = 1 and - 1 = a(2), equation (8) reduces to equation (2). Therefore, these two approaches lead to the same result. Table 1 shows the probability that two randomly chosen cistrons are different, i.e., the heterozygosity (h), and its decomposition into the probabilities that they

NUCLEOTIDE DIFFERENCES 335 TABLE 1 Distribution of site differences t= 0 0.04.W 0.4N 4N 40N or m 4vv =o. 1 k>l 0 0.001978 0.01795 0.0808 0.0909 k=l 0 0.001976 0.01778 0.0755 0.0826 k>2 0 0.000002 0.00017 0.0053 0.0083 4Nu = 1 k>l 0 0.0109 0.0987 0.44.1.6 0.5 k=l 0 0.00642 0.0574 0.2334 0.25 k22 0 0.0044.6 0.0413 0.2112 0.25 The probability for k 2 1 is equivalent to heterozygosity. N denotes effective population size. differ by one nucleotide and by more than one, respectively, assuming that the initial population is completely homozygous. It is seen that if 4Nv is 0.1, the probability that two cistrons differ at two or more nucleotide sites is small, being less than 10 percent OP the heterozygosity even at equilibrium. On the other hand, if 4Nv is 1, this probability is larger than 40 percent of the heterozygosity even as early as t = 0.04N and consists of 50 percent of the heterozygosity at equilibrium. Therefore, for a population with 4Nv of the order of 1 or larger, the actual genic variation may be considerably larger than that revealed by heterozygosity. However, the mean number (2) of site differences between cistrons may. NEI (1975) used z= --log, (1 - h). Theo- be estimated from heterozygosity (h) retically, ;Ir=4Nv at equilibrium [see formula (6a)l for all models studied (KIMURA 1969, EWENS 1974, WATTERSON 1975), while -Zoge(l - h) = loge( 1 + 4Nu) since h = 4Nv/(1 + 4Nv) (KIMURA 1968). If 4Nv = 0.1, -log,(l - h) = 0.095, which gives only a 5 percent underestimate, but if 4Nv = 1, -log, (1 - h) = 0.693, which gives a 30 percent underestimate. Although in practice NEI S formula holds approximately, since 4Nv is usually of the order of 0.15 or less (NEI 1975), theoretically one should USE z= h/(l - h), since h/( 1 - h) = 4Nv. Note that h/( 1 -h) is the ratio of heterozygosity to homozygosity. Note also that at equilibrium the probability that two cistrons differ by more than one site is equal to PJ(1 + e) 2, which equals h2 (see formula (4) ), It also follows that the expected number of site differences between two cistrons is ejh = 1 4-0, given the condition that they differ at least by one site. The above argument is based on the assumption that allelic variants are identified at nucleotide or codon level (in the latter case, site refers to codon instead of nucleotide). In practice, however, genetic variation is mostly studied by elec- trophoresis. The model of stepwise change of electrophoretic mobility of protein has recently been studied fairly extensively (e.g., OHTA and KIMURA 1973, KING 1973, NEI and CHAKRABORTY 1973, OHTA and KIMURA 1974, WEHRHAHN 1975, KIMURA and OHTA 1975, LI 1976). It should be of interest to compare the results of infinite site model with those of stepwise mutation models. For simplicity I shall consider only the steady-state values. I consider two electrophoretic models: (1) one-step model in which only one-step mutations can occur (cf. OHTA and

336 W-H. LI TABLE 2 Equilibrium distributions of state differences for various models h k=l k=2 0.0075 0.08005 0.0035 0.125 0.025 0.040 k?3 4Nu = 0.1 Infinite site model 0.09091 One-step model 0.03175 Two-step model 0.03184 4Nv = 1 Infinite site model 0.5 One-step model 0.225 Two-step model 0.229 0.0826 0.0312 0.0~2 0.25 0.197 0.180 0.m75 0.00001 0.0001 01.125 0.0036 0.0086 The mutation rate (and therefore 4Nu) in the one-step and two-step models is assumed to be one-third of that in the infinite site model. For details, see text. KIMURA 1973) and (2) two-step model in which iwo-step mutations as well as one-step mutations can occur (cf. WEHRHAHN 1975, LI 1976). In the electrophoretic models the mutation rate is assumed to be one-third of that of the infinite site model. In the two-step model, the proportion of two-step mutations is assumed to be 10 percent of all the mutations involving electrophoretic charge changes, though NEI and CHAKRABORTY S (1973) results indicate ihat it is somewhat less than 10 percent. The state differences in Table 2 refer to amino acid differences between pmteins in the case of infinite site model, while they refer to charge differences between proteins in the case of one-step and two-step models. Note that if 4Nu = 0.1, the underestimate of the heterozygosity at amino acid level by using electrophoretic data is mainly due to the underestimate occurring at the first class, i.e., k = 1. On the other hand, if 4Nu = 1, the underestimate occurs mainly at the second class, i.e., k = 2 and the higher classes, i.e., k 2 3. The twostep model gives only a slight improvement over the one-step model. Although the assumption of stepwise change of electrophoretic mobility may not be very realistic (cf. JOHNSON 1974), Table 2 gives us some rough estimates of the detectability of electrophoresis. The present results may be extended to study gene differentiation between populations. A simple situation is as follows: A population splits into two poplations at t = 0 and thereafter there is no migration between them. Let Dk(t) be the probability that at generation t two randomly chosen cistrons, one from each 00 population, differ at k sites. Then the pgf D (2,t) = kzo Dk( 1) Zk is given by D (2,t) = D (Z,O) H (2,t) (9) where D (2,O) = ZDk( 0) Zk is the pgf for the ancestral population at the moment of splitting. It follows that Var(dt) = Var(dn) + 2vt. (12)

NUCLEOTIDE DIFFERENCES 33 7 Note that the mean cumber of site differences between two cistrons, one from each population, is equal to the mean number (&) of site differences in the ancestral population plus 2ut, the amount of differentiation after separation. The latter component agrees with the results of NEI (1972). I am greatly indebted to DR. M. NEI for valuable suggestions and discussions. Thanks also are due to DR. R. CHAKRABORTY for discussions. I thank a reviewer for valuable suggestions. LITERATURE CITED EWENS, W. J., 1974 A note on the sampling theory for infinite alleles and infinite sites models. Theor. Pop. Biol. 6: 143-148. FELLER, W., 1968 An Introduction to Probability Theory and Its Applications, 3rd ed. John TViley and Sons, New York. JOHNSON, G. B., 1974 On the estimation of effective number of alleles from electrophoretic data. Genetics 78: 771-776, KIMURA, M., 1968 Genetic voriability maintained in a finite population due to mutational production of neutral and nearly neutral isoalleles. Genet. Res. 11: 247-269. -, 1969 The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutation. Genetics 61 : 893-903. -, 1971 Theoretical foundation of population genetics at the molecular level. Theor. Fop. Biol. 2 : 174-208. KIMURA, M. and T. OHTA, 1975 Distribution of allelic frequencies in a finite population under stepwise production of neutral alleles. Proc. Nat. Acad. Sci. U.S. 72 : 2761-2764. KING, 5. L., 1973 The probability of electrophcretic identity of proteins as a function of amino acid divergence. J. Molec. Evol. 2: 317-322. LI, W-H., 1976 Electrophoretic identity of proteins in a finite population and genetic distance between taxa. Genet. Res. 28: 113-127. MALECOT, G., 1943 Les Mathematiques de 2 Heredite. Masson et Cie, Paris. NEI, M., 1972 Genetic distance between populations. Am. Nat. 106: 283-292. -, 1975 Molecular Population Genetics and Euoluiion. N-orth-IIolland Publishing Company, Amsterolam. NEI, M. and R. CHAKRABORTY, 1973 Genetic distance and electrophoretic identity of proteilis between taxa. J. Molec. Evol. 2: 323-398. OHTA, T. and M. KIMURA, 1973 A rricilel of mutation appropriate to estimate the nuniber of electrophoretically detectab:e alleles in a finite population. Genet. Res. 2.2 : 201-204. --, 1974 Simulation studie? on electrophoretically detectable genetic variability in a finite population. Genetics 76: 615-624. WATTERSON, G. A., 1975 On the number of segregating sites in genetical models without recombination. Theor. POP. Biol. 7: 256-276. WEHRHAHN, C. F., 1975 The evolution of selectively similar electrophoretically detectable alleles in finite natural populations. Genetics 80: 375-394. Corresponding editor: J. F. CROW