F SR = (H R H S)/H R. Frequency of A Frequency of a Population Population

Hierarchical structure, F-statistics, Wahlund effect, Inbreeding, Inbreeding coefficient Genetic difference: the difference of allele frequencies among the subpopulations Hierarchical population structure Reduction in heterozygosity: A reduction in the average proportion of heterozygous genotypes relative to that expected under random mating Paradoxical result: there is a deficiency of heterozygotes in the total population even though random mating takes place within each subpopulation. This is a consequence of the difference in allele frequency among the subpopulations. (If the allele frequencies in both populations the same, what would happen?) Isolation by distance (Wright) The relation of population substructure to inbreeding can be understood by interpreting each subpopulation as a sort of extended family or set of interconnected pedigrees. Mating between organisms in the same subpopulation will often be a mating between relatives. The larger the subpopulation, and the more recently it has been isolated, the smaller this inbreeding effect. Wright s F statistics Quantification of the inbreeding effect of subpopulation structure Fixation index: the reduction in heterozygosity expected with random mating at any level of a population hierarchy. Hierarchical F-statistics To quantify the inbreeding effect of population substructure, Wright defined the fixation index. The index equals the reduction in heterozygosity expected with random mating at any one level of a population hierarchy. The fixatio n index is a useful index of genetic differentiation. The average heterozygosity of the subpopulation S is denoted as H S. H S is 2pq, where p and q are the estimated frequencies of the alleles. Then F SR is the fixation index of the subpopulations S and R: 1 F SR = (H R H S)/H R Then we can easily find the relation (1 F SR)(1 F RT) = 1 F ST. Isolating break (Wahlund principle) 1 Wahlund effect When there is population subdivision, there is almost inevitably some genetic differentiation among the subpopulations. Here genetic differentiation means the acquisition of allele frequencies that differ among the subpopulations. One of the most important consequences of population substructure is a reduction in the average proportion of heterozygous genotypes relative to that expected under random mating. Or equivalently, a subdivided set of populations has a higher proportion of homozygotes than an equivalent fused population: this is the Wahlund effect In practice, a species may consist of a number of separate populations. A species may have population subdivision. The concepts of hierarchical population structure and the various levels of heterozygosity were developed by Sewall Wright to quantify genetic differences among subgroups at the various levels. To quantify the inbreeding effect of population substructure, he defined what has come to be called the fixation index. The index equals the reduction in heterozygosity expected with random mating at any one level of a population hierarchy relative to another. Let us examine what effect population subdivision has on the Hardy-Weinberg principle. Frequency of A Frequency of a Population 1 0.3 0.7 Population 2 0.7 0.3 Genotype AA Aa aa Frequency (0.3) 2 = 0.09 2(0.3)(0.7) = 0.42 (0.7) 2 = 0.49 population 1 (0.7) 2 = 0.49 2(0.7)(0.3) = 0.42 (0.3) 2 = 0.09 population 2 Average 0.58/2 = 0.29 0.84/2 = 0.42 0.58/2 = 0.29 Now suppose that the two populations are fused together. The gene frequencies of A and a in the combined population are (0.3 + 0.7)/2 = 0.5, and the Hardy-Weiberg genotype frequencies are as follows: AA Aa aa Genotype frequency 0.25 0.5 0.25. Let s consider a little generally the above example. The Wahlund effect for an allele a in two subpopulations is as follows. Population 1 has allele frequency q 1 and genotype q 2 1 ; population 2 has allele frequency q 2 and genotype frequency q 2 2. Then we have the following table. Genotype AA Aa aa Frequency (1- q 1) 2 2(1 q 1)q 1 2 q 1 population 1 (1- q 2) 2 2(1 q 2)q 2 2 q 2 population 2 Average Q (separate) [(1 q 1) 2 + (1 q 2) 2 ]/2 [2(1 q 1)q 1 + 2(1 q 2)q 2]/2 [q 2 1 + q 2 2 ]/ 2

Average Q (fused) [[(1 q 1) + (1 q 2)]/2] 2 2[(q 1 + q 2)/2][(2 q 1 q 2)/2] [(q 1 + q 2)/2] 2 2 The average frequency of aa would be reduced by an amount given by: Q (separate) Q (fused) = [q 2 1 + q 2 2 ]/ 2 - [(q 1 + q 2)/2] 2 = [q 2 1 + q 2 2 ]/ 2 q* (q* = (q 1 + q 2)/2) = (q 1 q*)/2 + (q 2 q*)/2 = 2 q 2 q is the variance in allele frequency among the original subpopulations and is always nonnegative. Similarly P (separate) P (fused) = 2 p (p = 1 q). it is true that 2 q = 2 p and if we write either of them 2, then the total reduction in homozygosity from the Wahlund effect upon population fusion is 2 2. Wahlund effect and the Fixation index Reduction in total homozygosity = 2 2. On the other hand, the reduction in total homozygosity with population fusion must also equal the increase in heterozygosity. Hence F ST = (H T H S)/H T = 2 2 /H T. Using the average allele frequencies, F ST = 2 /(p*q*), where p*, q* are the allele frequencies in the combined population. We can also find that the change of the genotype frequencies when the fusion happens: AA: p* 2 + p*q*f ST Aa: 2p*q* - 2p*q*F ST aa: q* 2 + p*q*f ST From the above expression, it is clear that the value of F ST determines the degree of departure from Hardy-Weiberg equilibrium. If F ST = 0, then the second term in each expression vanishes, and the genotype frequencies reduce to the Hardy-Weinberg equilibrium. F ST = 0 means that there is no variation in allele frequency among the subpopulations for the gene in question. The opposite case is F ST = 1, which happens when two subpopulations are fixed for alternative alleles. In this case, the average allele frequencies are 1/2 for each allele and the average genotype frequencies of AA, Aa, and aa are 1/2, 0, and 1/2. The large, fused population contains fewer homozygotes than are present, on average, for the set of subdivided populations. This is a mathematical result. The presence of variation among the subgroups leads to an increase in homozygosity and loss of heterozyosity as compared with the random mating frequencies that would result if the entire population were considered into one population. The Wahlund effect depends on the existence of geographical variation among the subpopulation. Inbreeding: the pattern of mating taking place between relatives The relationship between population substructure and inbreeding The main effect inbreeding is to produce organisms with a decrease in heterozygosity. There is a deficiency of heterozygous genotypes and an excess of homozygous genotypes. However, the allele frequency of A remains constant. Inbreeding coefficient F F measures the fractional reduction in heterozygosity of an inbred subpopulation relative to a random-mating subpopulation with the same allele frequencies. In a subpopulation of organisms with inbreeding coefficient F, the genotype frequencies are expected in the proportions: AA: p 2 (1 F)+ pf= p 2 + pqf Aa: 2pq(1 F) = 2pq 2pqF aa: q 2 (1 F) + qf = q 2 + pqf If a gene has multiple alleles A 1, A 2,, A n at respective frequencies p 1, p 2,, p n (p 1 + p 2 + + p n = 1), then in a population with inbreeding coefficient F, the frequencies of A ia i homozygotes and A ia j heterozygotes are: p 2 i (1 F)+ p if 2p iq j(1 F). Relation between the inbreeding coefficient and the F-statistics There is an intimate relation between the inbreeding coefficient F and the hierarchical F statistics. Each of the hierarchical F statistics is also a type of inbreeding coefficient that measures the reduction in heterozygosity at any level of a population hierarchy, relative to a higher level. The connection between the inbreeding coefficient and the F statistics is indicated by the formal similarity above. If F IS is the inbreeding coefficient of a group of inbred organisms relative to the subpopulation to which they belong, the value of F IS is the reduction in heterozygosity of the inbred organisms. Then we have F IS = (H S H I)/H S F IT = (H T H I)/H T (1 F IS)(1 F ST) = 1 F IT. The inbreeding coefficient as a probability The inbreeding coefficient F IS has an interpretation in terms of probability in addition to its interpretation in terms of heterozygosity. The

probability interpretation of the inbreeding coefficient is that F is the probability that the two alleles of a gene in an inbred organism are identical by descent (autozygous). F measures the probability of autozygosity relative to some ancestral subpopulation. Summary Species are usually divided into subpopulations. Matings between organisms within the same subpopulation are more likely than matings between organisms in different subpopulations. Geographical subdivision of a population is called population substructure. When the allele frequencies differ, the average heterozygosity among the subpopulations is smaller than that expected with random mating in the total population. The F statistics are a quantitative measure of the reduction in heterozygosity at various levels in a population hierarchy. When subpopulations undergo fusion and random mating, the deficiency of heterozygosity is eliminated. This effect of population fusion is called the Wahlund principle. The principle implies that population fusion and random mating will cause a reduction in the frequency of any homozygous genotype by an amount equal to the variance in allele frequency among the original subpopulations. Inbreeding means mating between relatives. The most important effect of inbreeding is that replicas of a single allele in a common ancestor may be transmitted down both sides of the pedigree and come together in fertilization to produce the inbred organism. In such a case, the inbred organism is called autozygous, and the alleles are identical by descent. Otherwise the inbred organism allozygous. The inbreeding coefficient F is the probability that the two homologous genes in an inbred organism are autozygous. Similar examples 1 Simpson s paradox Let s consider first the sex bias in admissions. Consider the following tables on the number of admissions to an MBA program and a Law program cross-classified by sex. For each table compute the % admitted for each gender (row percentage). Business School Law School Admit Deny Admit Deny Male 480 = 80% 120 = 20% 10 = 10% 90 = 90% Female 180 = 90% 20 = 10% 100 = 33% 200 = 66% Business and Law Schools Admit Deny Male 490 = 70% 210 = 30% Female 280 = 56% 220 = 44% Now females seem to be admitted at a lower rate than males! What has happened? This is known as Simpson s Paradox which refers to the reversal of results when several groups are combined together to form a single group. This is caused by the different percentages in admission in the two tables -they really shouldn't be combined. It is not caused by different sample sizes. Our second example is about twenty-year survival and smoking status. In 1972-1994 a one-in-six survey of the electoral roll, largely concerned with thyroid disease and heart disease was carried out in Wichkham, a mixed urban and rural district near Newcastle upon Tyne, in the UK. Twenty years later, a follow-up study was conducted. Here are the results for two age groups of females. Each table shows the twenty-year survival status for smokers and non-smokers. Age 55-64 Age 65-74 Dead Alive Dead Alive Smokers 51 = 44% 64 = 56% 29 = 80% 7 = 20% Non-smokers 40 = 33% 81 = 67% 101 = 78% 28 = 22% It appears that smokers die off more than non-smokers in each table. And what happens when the tables are combined? Ages 55-74 combined Dead Alive Smokers 80=53% 71=47% Non-smokers 141=56% 109=44% Now smokers seem to have a lower death rate! What has happened? Most of the smokers have died off before reaching the older age classes, and so the higher number of deaths (in absolute numbers) for the non-smokers in the older age classes has obscured the result. Simpson s paradox is an example of the dangers of lurking variables. 2 Stein paradox Here are the results of a trial of two drugs A and B on 2000 patients: Drug A B Total treated 900 1100 Recoveries 405 297 Recovery rate 45% 27% Clearly drug A is considerably more effective than drug B. However, consider the division of the total treatment group into two classes based on some characteristic X (for example, X = male, or X = born on an odd date, or X = mother's first name starts with a vowel, etc.). Suppose the numbers come out like this: X not X Drug A B A B Total treated 100 900 800 200 Recoveries 5 101 400 196 Recovery rate 5% 11% 50% 98% 3

Now it appears that for both X and not X, drug B is more effective than drug A. Yet, X union not X is the whole treatment group! We have reached the opposite conclusion from the same data, merely by considering an apparently irrelevant characteristic. It looks suspiciously like Simpson s Paradox: What s good for the population can be bad for every subgroup. Personally, I fail to see the paradox. What I do see is that: 1) The characteristic X is not apparently irrelevant, but appears to be strongly negatively correlated with recovery rate, regardless of the treatment. 2) With respect to the characteristic X, the trial groups A and B were not randomly chosen. Since X occurs in 50% of the population, the probability is extremely small that with such a large sample size, 90% of those with X would end up in group B. The moral of the story: 1) Divide the classes randomly to make your conclusions valid.2) Use a large enough sample to make such statistical aberrations improbable. This apparent refutation unfortunately fails badly. Even more unfortunately, it is often the case that the relevant X is not even known, or if it is, its value is not available. There may even be many crucial X s. 3 Gibbs Paradox Let s consider the problem of calculating the entropy of a gas of N identical particles in a box of volume V. The entropy is defined as kln(g) where G is a measure of the number of accessible states of the system. We specify the state of a particle classically by giving its position and momentum. Let s say the box is at some fixed temperature and this results in each particle having a certain average momentum and energy. The momentum will not always be exactly this average value but will vary according to some probability distribution. The width of its distribution is part of that uncertainty of which entropy is a measure. Let s quantify the width of the distribution by the standard deviation of the momentum, p. It is also reasonable to assume that the value for p will depend only on the temperature and not say on the volume or geometry of the box and neither should it depend on the number of particles. Now consider that in specifying the system exactly we would have to give three coordinates for position and three coordinates for momentum for each of the N particles. We thus have 3N + 3N variables which forms a 6N dimensional phase space of possible system configurations. The 3N momentum coordinates would each have a standard deviation of p and there being 3N of them the hyper-volume in phase space in which we expect to find the system will be proportional to ( p) 3N. In addition there are the N triples of position coordinates each triple residing in the volume V of the box so the over all region in phase space will have measure G ~ V N ( p) 3N. For convenience let s assume a distance unit x = V 1/3 so that we may write G ~ ( x p) 3N. Part of the reason for using distance units is that there is a fundamental unit called action which is in units of energy time = distance position. We then see that G will as it now stands be in units of (action) 3N. There is a problem here however because we wish to take the natural logarithm of G to calculate the entropy, S. We should only apply abstract mathemat ical functions to unitless quantities. Looking at it another way we don t want our entropy to depend on our choice of distance and momentum units. Consider that S = kln(g) is equal to zero when G = 1 so we need a minimal unit for G which will define the minimum entropy state. The simplest thing to do is assume an arbitrary minimal unit of action, H. Then we can define G to be the unitless quantity G = ( x p/h ) 3N. And then we have a formula for entropy: S = k ln(g) = 3Nk ln( x p/h). We can also write this in terms of the box volume V: S = Nk ln(vw) where W = ( p/h) 3. Aside from calculating how W depends on temperature and such this is the standard (naive) calculation of the classical entropy. There is however a slight problem and that is the manner in which this formula for S depends on volume. Suppose you have two boxes with identical volume, V, in which are an equal number of gas particles, N. You then have a total entropy of S = S 1 + S 2 = 2Nk ln(vw). Now suppose the boxes are placed side by side and the partition between them is removed. We now have a new box with volume 2V and with 2N particles so that the entropy of this system is S = 2Nk ln(2vw). The entropy has changed and in fact has increased: ds = S - S = 2Nk ln(2vw) - 2Nk ln(vw) = 2Nk ln(2). If each particle has a label so that they are distinguishable then this change reflects the entropy of mixing. After opening the partition the particles mix and afterwards by replacing the partition we no longer have the same particles on the same sides of the partition. This however is not realistic. If the two boxes contained identical particles then removing and replacing the boxes should be a reversible process. Hence the change in entropy should be zero. This is Gibbs paradox. Put another way entropy should be an extensive quantity. One should be able to partition the volume arbitrarily and the total entropy should be the sum of the entropies over the partition. This would then imply that the volume dependence should occur only in the form of a density. Instead of V we should see V/N at least for large N. Resolving Gibbs paradox. Particle statistics. The problem in Gibbs paradox arises due to the indistinguishability of the particles. What has happened is that we have over counted the 4

possible configurations of the system. Consider if particle 1 is at position 1 in phase space, and particle 2 is at position 2 etc.. This configuration is not just equivalent but exactly equal to the configuration particle 28 at position 1, particle 13 at position 2, particle 19332 at position 3, etc. Given N particles we have over-counted by the number of permutations of these N particles which is to say a factor of N! = 1 2 3... (N-1) N. The resolution is to replace G = (VW/H 3 ) N with G = G/N!. Hence S = Nk ln(vw) - k ln(n!). There is an approximation formula for N! which is valid for large N called Stirling s formula: 5 so ln(n!) = (N + 1/2)ln(N) - N + 1/2 ln(2 ). N! = (2 ) N N+1/2 exp(-n), As N is very large this will be approximately N ln(n) as the other terms do not grow as fast with increasing N. Hence we get S = Nk ln(vw) - Nk ln(n) = Nk ln(vw/n). Now the entropy depends (in the large particle number limit) on the quantity V/N which is one over the particle density. We now have entropy as an extensive quantity. Double the particle number and the volume at the same time and we get: S = 2Nk ln(2vw/2n) = 2Nk ln(vw/n). Sure enough the entropy has doubled. There is then no difference between two boxes with volumes V and particle counts N and a single box with volume 2V and particle count 2N. At least in the large scale limit. 5-7 4 two-locus and multi-locus population genetics 7.1 Mimicry in Papilio is controlled by more than one genetic locus 7.2 The genotypes at different loci in Papilio memnon are coadapted 7.3 Mimicry in Heliconius is controlled by more than one gene, but not by a supergene 7.4 Two-locus genetics is concerned with haplotype frequencies 7.5 The frequencies of haplotypes may or may not be in linkage equilibrium 7.6 The human HLA genes are a multi-locus gene system 7.7 Linkage disequilibrium can exist for several reasons 7.8 Two-locus models of natural selection can be built 7.9 Hitchhiking occurs in two-locus selection models 7.10 Linkage disequilibrium can be advantageous, neutral, or disadvantageous 7.11 Why does the genome not congeal? 7.12 Wright invented the influential concept of an adaptive topography 7.13 The shifting balance theory of evolution (two-locus and multi-locus populaton) Population genetics for two or more loci ids concerned with changes in the frequencies of haplotypes, which are the multi-locus equivalent of alleles. Recombination tends, in the absence of other factors, to make the alleles of different loci appear in random proportions in haplotypes. An allele A, at one locus will then be found with allele B1 and B2 at another locus in the same proportions as B1 and B2 are found in the population as a whole. This condition is called linkage equilibrium. A deviation from the random combinatorial proportions of haplotypes is called linkage disequilibrium. The theory of population genetics for a single locus works well for populations in linkage equilibrium. Linkage equilibrium can arise because of non-random nating, random sampling, and natural selection. For selection to generate linkage disequilibrium, the fitness interactions must be epistatic. That is, the effect on fitness of a genotype (such as A 1/A 2) must vary according to the genotype with which it is associated at other loci. Pairs of alleles at different loci that cooperate in their effects on fitness are called coadapted. Selection acts to reduce the amount of recombination between coadapted genotypes. When selection works on one locus, it will influence gene frequencies at linked loci. The effect is called hitchhiking. Recombination is selectively disadvantageous in so far as it breaks down favorable gene combinations. The mean fitness of a population can be drawn graphically for two loci; the graph is called a fitness surface or an adaptive topography. Wright suggested that real adaptive topographies will have many separate hills with valleys between them. Natural selection enables populations to climb the hills in the adaptive topography, but not to cross valleys. A population could become trapped at a local optimum. Random drift could supplement natural selection by enabling populations to explore the valley bottoms of adaptive topographies. It is questionable whether real adaptive topographies have multiple peaks and valleys. They might have a single peak, with a continous hill leading up to it. Natural selection could then take the population to the peak without any random drift. Two-locus population genetics uses a number of concepts not found in single-locus genetics. The most important of these ideas are haplotype frequency, recombination, linkage disequilibrium, epistatic fitness interaction, hitchhiking, and multiple-peaked fitness surfaces. quantitative genetics 8.1 Climatic changes have driven the evolution of beak size in one of Darwin s finches

8.2 Quantitative genetics is concerned with characters controlled by large numbers of genes 8.3 Variation is first divided into genetic and environmental effects 8.4 The variance of a character is divided into genetic and environmental effects 8.5 Relatives have similar genotypes, producing the correlation between relatives 8.6 Heritability is the proportion of phenotypic variance that is additive 8.7 A character s heritability determines its response to artificial selection 8.8 The relation between genotype and phenotype may be nonlinear, producing remarkable responses to selection 8.9 Selection reduces the genetic variablity of a character 8.10 Characters in natural populations subject to stabilizing selection show genetic variation 8.11 Selection-mutation balance is one possible explanation, but there are two models for it 8.12 The rate of slightly deleterious mutations can be observed in experiments in which selection against them is minimized (quantitative genetics) Quantitative genetics, which is concerned with characters controlled by many genes, considers the change in phenotypic and genotypic frequency distributions between generations, rather than following the fate of individual genes. The phenotypic variance of a character in a population can be divided into components due to genetic differences and to environmental differences between individuals. Some of the genetic effects on an individual s phenotype are inherited by its offspring; others are not. The former are called additive genetic effects; the latter are due to such factors as dominance and epistatic interaction between genes. The heritability of a character comprises the proportion of its total phenotypic variance in a population that is additive. The heritability of a character determines its evolutionary response to selection. The additive genetic variance can be measured by the correlation between relatives, or by artificial selection experiments. The response of a population to artificial selection depends on the amount of additive genetic variability and on the relation between genotype and phenotype. If the relation is non-linear, strange bimodal responses can arise. Stabilizing selection acts to reduce the amount of genetic variability in a population. However, polygenic characters show non-trivial values for heritability. The level of genetic variation may be a balance between an input of new deleterious mutations and their removal by selection. Experiments in which the effect of selection is held to a minimum suggest the power of deleterious mutation. The rate of deleterious mutation may not be high enough to explain the observed levels of genetic variation, and some other factirs, such as a form of selection that maintain variation, may be needed to explain the observations. 6 genome evolution 9.1 Non-Mendelian processes must be added to classical population genetics to explain the evolution of the whole genome 9.2 Genes are arranged in gene clusters 9.3 Gene clusters probably originated by gene duplication 9.4 The genes in a gene family often evolve in concert 9.5 Not all DNA codes for genes 9.6 Repetitive DNA other than in gene clusters may be selfish DNA 9.7 Minisatellites are sequences of short repeats, found scattered through the genome 9.8 Scattered repeats may originate by transposition 9.9 Selfish DNA may explain the C-factor paradox (genome evolution) Genes are usually arranged in clusters (called gene clusters) of related genes on the chromosome. The cluster may consist of tandem repeats, like the ribosomal RNA genes, or a linked group of related genes, like the globin genes. Much of the non-coding DNA consists of repeated sequences. Several different kinds of repetitive DNA have been recognized. Gene families originate by gene duplication, which itself takes place by unequal crossing over or polyploidy. The different genes in a gene family often show concerted evolution. That is, the genes at separate loci within a species are much more similar than the homologous copies of a gene in different species. Concerted evolution among large number of genes is practically impossible to explain if mutation arise independently at each locus; it requires some mechanism for concerted mutations at all loci. Concerted mutation can occur by unequal crossing over or gene conversion. Concerted evolution happens when the more homogeneous variants produced by these processes are fixed by selection or drift. Gene conversion may be biased in favor of some sequences rather than others; the favored sequences would then proliferate by a form of lateral mutation pressure. Gene conversion can occur between genes of similar sequence. When two genes have diverged more than a certain amount, concerted evolution will become unlikely. Much of the genome does not consist of coding genes. It consists of various classes of repetitive DNA. Repetitive DNA is classified as being either tandem or scattered repeat sequences. Tandem repeats, in turn, are distinguished by the size and number of the repeats, as being microsatellites, minisatellites, or satellite DNA. The large quantities of repetitive, non-coding DNA may be selfish DNA. Such DNA may be non-transcribed, have no function for the organism, and be replicated from generation to generation like a passive parasite. It would change in frequency in the population mainly by random drift. Tandem repeats probably originate by unequal crossing over and slippage (especially for short repeats). In contrast, scattered repeats probably originate by transposition. Minisatellites are sequences consisting of a variable number of repeats of a characteristic short sequence; they are probably examples of selfish DNA. They have mutation rates as high as 10-2 per generation, as minisatellites with new numbers of repeats arise (probably by unequal crossing over) in high frequency. Selfish DNA may explain the C-factor paradox that is, the paradox that many eukaryotic organisms contain more DNA than appears to be necessary.