Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

OEB 242 Exam Practice Problems Answer Key Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate. First, recall that by genetic diversity (or polymorphism or heterozygosity ) we typically mean Π. Thus, it matters not only how many segregating sites we have, but also whether the alleles at those sites are more often at intermediate frequencies (which inflate the number of pairwise differences) or at rare/high frequencies (recall that if one allele is rare the other must be common, assuming a biallelic locus). Background selection means that deleterious mutations and nearby linked variants are removed from the population as they arise. A large rate of recombination will more easily allow neutral linked variants to detach themselves from deleterious variants and so persist in the population, whereas a small rate of recombination ensures that neutral variants are often removed along with the targets of background selection proper. More formally, Charlesworth showed that we can calculate the effective population size by scaling N by a factor of e (-µ/r), where µ and r are the mutation and recombination rate, respectively. Thus recombination rate is positively correlated with Ne, which in turn is positively correlated with per-site diversity (recall that the infinite sites model predicts that E(Π) = θ = 4N e µ). In other words, big r à big N e à big θ à big Π. Genetic hitchhiking means that beneficial mutations and nearby linked variants proliferate through a population together, causing a local reduction in diversity surrounding the adaptive allele. This process is counteracted by recombination, which breaks up haplotypes and restores diversity to the population. When the recombination rate is high, diversity is restored quickly, whereas when it is low haplotypes can persist for long stretches of time, preserving a low value of Π. Q2) What is the inbreeding coefficient of individual I in the pedigree below, assuming that the only nonzero values of F among the ancestors of I are F A = 1/16 and F D = 1/8. You should leave your answer in the form of an unreduced expression.

Calculate F by tracing all possible paths through common ancestors: KGCADJL KHDACGL KHEBFJL KHDJL KGL F I = 2 *(1/ 2) 7 (1+ F A )+ (1/ 2) 7 (1+ F B )+ (1/ 2) 5 (1+ F D )+ (1/ 2) 3 (1+ F G ) = 2 *(1/ 2) 7 (1+ 1 16 )+ (1/ 2)7 + (1/ 2) 5 (1+ 1 )+ (1/ 2)3 8 = 0.185 Q3) A population at steady state in the infinite-alleles neutral model has a homozygosity equal to 10%. What value of θ can you infer? With random mating, how many equally frequent alleles would be required to produce the same level of homozygosity? At steady-state, F = 1/(1+θ), so θ = (1 F)/F = 9. With n equally frequent alleles, each! has frequency p = 1/n, giving us homozygosity!""!"#$ p! =! =! (think of, for example,!!! a Punnett square with n possibilities: each square is equally probable with p=(1/2) 2, and there are n homozygotes). Hence, N = 10. Q3) An agronomist is studying grain yield in an outbred variety of maize. The variety has a mean yield of 200 bushels per acre. The table below shows estimates of the additive genetic variance V A, the dominance variance V D, the environmental variance V E, and the total phenotypic variance, each expressed either as its estimated value in [bu/acre] 2 or as its estimated value as a fraction of the total phenotypic variance V P. Variance component Estimated value Value as fraction of V P V A 300 [bu/acre] 2 V D 14.3% V E 300 [bu/acre] 2 V P 100.0% A. Complete the missing entries in the table. [You may round the value of each variance component to the nearest 100.] Let x = V P. Then the values given imply that 300 + 0.143x + 300 = x, hence x = 600/0.857 = 700 [bu/acre] 2. With this as the value of V P, the rest of the entries in the table are as follows. (The fact that the numbers in the percent column add to 99.9% not 100.0% is due to round-off error.) Variance component Estimated value Value as fraction of V P V A 300 [bu/acre] 2 42.8% V D 100 [bu/acre] 2 14.3% V E 300 [bu/acre] 2 42.8%

V P 700 [bu/acre] 2 100.0% B. What are the values of the narrow-sense heritability h 2 and of the broad-sense heritability H 2. [In estimating the broad-sense heritability, ignore any possible effects of interaction between different genes.] Recall that h 2 = V A /V P and that H 2 = V G /V P : h 2 = 300 700 = 0.428 and H 300 + 100 2 = = 0.571 700 C. As noted, the mean yield of the variety is 200 bu/acre. If the top yielding 20% of the plants are selected for breeding, and mated randomly among themselves, this is equivalent to a selection differential S of S = 37 bu/acre. What is the expected mean yield of the progeny of the selected parents? Use the breeder s equation, R =h 2 S, where h 2 =.428 and S=37, hence R = 15.84 Because R = M M, where M = progeny mean and M = population mean, we can infer M = 200 + 15.84 = 215.84 bu/acre. Q4) Consider the two following phylogenetic topologies. If you were to calculate Tajima s D for each of them, what do you expect your results would be, and how would you interpret that? What if you were to use a McDonald-Kreitman test? When would it be appropriate to apply one or the other? The left tree has relatively deep /ancient coalescent times, whereas the tree on the right has relatively shallow /recent coalescent times. We would expect Tajima s D to be negative in the former case. Most mutations that we sprinkle onto this tree will happen on private branches, and so will be rare. Rare alleles contribute less to per-site heterozygosity than do intermediate frequency alleles (consider how many pairwise differences: AAAC vs AACC?), so we will deflate θ Π relative to θ S for an overall negative statistic. This might suggest directional selection or population growth, for example. (Negative selection against deleterious alleles will reduce frequencies, and positive selection can also lead to a surplus of rare alleles as mutations appear on the homogenous background produced by a selective sweep. To

see why this topology is consistent with population growth, recall the relationship between population size and coalescent times predicted by the Kingman coalescent). We would expect Tajima s D to be positive in the latter case. Most mutations that we sprinkle onto this tree will be shared, and so will be common. By the above reasoning, this will inflate θ Π relative to θ S for an overall positive statistic. This might suggest balancing selection or admixture, for example. (Balancing selection will preserve polymorphisms at intermediate frequencies against the effects of drift, and admixture will have an overall averaging effect on allele frequencies between the two populations.) Tajima s D is often applied broadly, as it assumes only the infinite sites model, and people are generally willing to make this assumption across a broad range of time scales. However, arguably this model begins to lose validity when our tree spans long evolutionary times (e.g. spanning speciation events), at which point multiple substitutions at a site become feasible hence, not every mutation happens at a new site. The McDonald-Kreitman test, on the other hand, assumes that we can partition our data into polymorphism and divergence, where the former refers to variation within a population of some species and the former generally refers to variation between two species. Thus, we would generally only want to use this test if the root of the trees pictured above represents a speciation event. Moreover, we could only use this test if we are examining coding regions, because we need to be able to compare synonymous and non-synonymous changes, whereas Tajima s D can be applied to any genomic region. If we applied the MKT to a topology like the one on the left, we would expect to find that polymorphism exceeds divergence (again, mutations sprinkled on the tree will create differences among individuals on the left branch). This could be indicative of purifying selection between the species (keeps divergence low) or balancing selection within the population (keeps polymorphism high). If we applied the MKT to a topology like the one on the right, we would expect to find that divergence exceeds polymorphism. This could be indicative of positive selection between the species (accelerates the accumulation of differences between them). Q5) A geneticist is studying the hierarchical population structure of a species of ground squirrel in an area where there is a confluence of two wide streams to form a river. To determine whether the watercourses are significant barriers to gene flow, the researcher estimates allele frequencies of a biallelic gene from large samples of individuals from three subpopulations in each region. A diagram of the area and the allele frequencies in the subpopulations are shown below.

A. Estimate H S, H R, and H T for the subpopulations, regional populations, and total area. Recall that when we look at the different levels of structure (subpops, regions, total) we are changing the granularity at which we define our allele frequencies, which we then use to calculate heterozygosity according to Hardy-Weinberg. In each the case of subpops or regions, we then take an average. (In the case of the total population, doing so would be trivial.) 9 2(0.1* i)(1 0.1* i) i=1 H S = 9 = 0.36667 2(0.4)(.6)+ 2(0.5)(0.5)+ 2(0.6)(0.4) H R = = 0.4867 3 H T = 2(0.5)(0.5) = 0.5 B. Estimate F SR, F RT, and F ST for these populations. Recall that F XY = [H Y H X ]/H Y. In other words, F XY is the reduction in heterozygosity relative to Y, due to structure at the X level. F SR = 0.246 F RT = 0.027 F ST = 0.267 C. Based on these estimates, do the watercourses appear to be a significant impediment to gene flow? (Please answer with either "Yes" or "No.") No, because F RT is smaller than F SR. Thus, the reduction in heterozygosity due to population structure at the level of regions is not as great as the reduction in heterozygosity due to structure at the level of subpops within those regions. In other words, most of the population structure appears at the subpop level, rather than the regional (watercourse-defined) level.

Q6) The equation d(t) = 19 20 (1 e 40αt ) gives the Jukes-Cantor-corrected proportion of amino acid differences between two aligned protein sequences from different species that diverged from a common ancestral species that existed t years ago. The rate of amino acid replacement in each lineage is given by 20 α. Orthologous protein molecules were compared in two pairs of species. One species pair had diverged twice as long ago as the other species pair. In the more divergent species, the observed percentage of amino acid differences in the protein was 91.1%, whereas in the more recently diverged species pair the observed percentage of amino acid differences in the protein was 52.3%. A. Are these data consistent with a molecular clock? Letting t = τ equal the time of divergence of the less divergent species pair, the question states that the time of divergence of the more divergent species pair is t = 2τ. The equation for d(t) implies that ln[1 20d(t)/19] = 40αt. Hence 40ατ = ln[1 (20)(0.523)/19] = 0.040 or 20ατ = 0.040 in the less divergent species pair. In the more divergent species pair, 40α(2τ) = ln[1 (20)(0.911)/19] = 3.193 or 20ατ = 0.080. The rates of amino acid replacement (20α) are therefore 0.04/τ and 0.08/τ in the two comparisons, which is not consistent with a molecular clock. B. From these data, can one estimate the absolute rate of amino acid replacement in each lineage? No, the percent differences depend on the product ατ, and since neither is known, neither can be specified. A faster rate would result in the same percent differences in a shorter time, and a slower rate would result in the same percent differences in a longer time. C. From these data, can one estimate the relative rate of amino acid replacement in each lineage? Yes, from these data we can say that the rate of amino acid replacement (20α) in the more divergent species pair, relative to that in the less divergent species pair, is greater by a factor of (0.08/τ)/(0.04/τ) = 2. Q7) You are examining a species of flower that is normally blue. Occasionally plants with red flowers are observed in wild populations. You determine that flower color is controlled at a single locus, with the red allele completely recessive to the blue allele. You conduct a survey in a field and find 3000 blue flowers and 500 red flowers. You then look at the mean number of seed pods produced by the flowers, and find that the blue plants on average produce 20 pods whereas the red flowers on average produce 15. Assuming that the alleles are currently in HWE, but that selection is operating, predict the genotype frequencies after another generation. Assume that seed pod count is a perfect proxy for fitness (e.g. all seeds produced successfully take root, etc.) If the blue

allele mutates to a red allele at the rate of 10-5 /gen, what will the equilibrium frequency of the red allele be at mutation-selection balance? We first need to calculate the relative fitness of each genotype. We can let B represent the dominant (blue) allele and b represent the recessive allele. In this case, our relative fitnesses are as follows: w BB = 1; w Bb = 1; w bb = 15/20 =.75 We can next calculate mean fitness by assuming HWE. The frequency of the bb genotype is 500/3500, suggesting that q =.378 and p =.622. Our mean fitness is p 2 (w BB ) + 2pq(w Bb ) + q 2 (w bb ) = (.387)(1) + (.4702)(1) + (.1429)(.75) =.964 We can now divide each term in the above sum by wbar to get the predictions for genotype frequencies: w BB =.401; w Bb =.488; w bb =.111 To find the equilibrium frequency, we can use the formula q =!!, which holds when the harmful allele is a complete recessive (h=0). We now need to find s. Since w bb = 1-s =.75, we can infer that s =.25. Our equilibrium frequency =!"!!.!" =.006.