AEC 550 Conservation Genetics Lecture #2 Probability, Random mating, HW Expectations, & Genetic Diversity, Today: Review Probability in Populatin Genetics Review basic statistics Population Definition Random mating and non-ovelapping generations models Hardy-Weinberg Model Look at measures of genetic diversity, following Tuesday s talk Note there are times that there is a question that is left blank, make sure you can answer it after lecture, these are often concepts that are important for a deeper understanding and for you mid-term.
Probability Theory in Population Genetics The PROBABILITY (P) of an event is the number of times the event will occur (a) divided by the total number of possible events (n). P = a/n Multiplicative (Product) Rule : If the events A and B are independent, then the probability that they both occur is P(A and B) = P(A) x P(B) That is, the probability of 2 or more independent events occurring simultaneously is equal to the product of their individual probabilities. For example, the probability of a progeny having the genotype AA at a locus is the frequency of that A allele (denoted as p) in the population x the frequency of that A allele in the population or p 2 Sum Rule: The probability of 2 or more mutually exclusive events occurring is equal to the sum of their individual probabilities: P(A or B) = P(A) + P(B) Using the example above, the frequency of a heterozygote genotype Aa at a locus is the frequency of both alleles in the population multiplied. For example pq. However, there are two ways to get the pq, a p from the mom and a q from the dad, or a q from the mom and a p from the dad. We could write this as pq + qp = 2pq
Conditional probability probability of one event given the other event has occurred. P(A B) = P(A and B) = P(A)*P(B) P(B) P(B) BASIC STATISTICS: Basic Terms: Population = group of things we are interested in (population of inference) Sample = Subset of the population typically it is not possible to sample the total population Random Sample = each member has and equal and independent chance of being in that sample Variable = an attribute common to all members of the population but varies in the realization, and these realizations are called varieties Random variable = is a variable measured on the random sample Continuous variables = metric variable, continuous scales, e.g., height Discrete variable = meristic variable, countable, e.g., # of leaves, # of digits, integers Categorical variable = grouped and discrete but not ordered Example: Categories AA, Aa, aa Discrete number of A alleles Parameter = numerical summary or constants that measure the population of inference describes the entire population Example: 2 is the population variance and is the population mean for a certain trait x1
Statistic = value of this numerical constant calculated on the sample and used to estimate the parameter. Example: s 2 is the variance and x is the mean Summary statistics allows us to compare populations and estimate the parameters. Statistics are divided into 5 categories: Descriptive Tests of difference Tests of relationship Multivariate exploratory methods Estimators of population parameters Central Tendency: Arithmetic Mean n = xi/(n-1) I=1 N = Xi/N I=1 Calculate the average fitness of a population: From your sample of the population categorize individuals into groups: # Genotype Fitness 25 AA 0.7 50 Aa 0.5 25 aa 0.4 (freq. of category)(value of category) (0.25)(0.7)+(0.5)(0.5)+(0.25)(0.4) = average fitness
The measure of variability or dispersion of points around the mean is the variance. 2 = (X- ) 2 /N s 2 = (x- ) 2 /(n-1) Standard deviation is the square root of s 2 - remember that 1 SD is 68% of the central area and 2 SD is 95% of the central area. Do not confuse SE with SD SD is the probability distribution of the underlying raw data of a parameter and SE is the measure of the dispersion of a sample statistic. For example: SE describes the distribution of the sample mean heterozygosity while the SD describes the sampling distribution of the raw parameter heterozygosity. Geometric mean average of the product of numbers, used in growth rate estimates Harmonic mean weighted for the smallest size, used in calculating the effective population size
POPULATIONS: Group of organisms (species) living within a sufficiently restricted geographic area with random mating Local interbreeding population Local population or demes (Mendelian populations or Subpopulations)
THE MODEL OF RANDOM MATING: P(AA) P(aa) P(Aa) Parent Population a A a A A a A a a A Allele Pool P (AA) P (AA) P (AA) New Population genotype frequencies
NON-OVERLAPPING GENERATIONS Mostly insects and plants. While simple, the model works for a lot of organisms with complex life-histories: generation generation generation t-1 t t+1 HARDY-WEINBERG MODEL GH Hardy & W Weinberg 1908 (independently) WE Castle (1903 Harvard geneticist) Assumptions of HW Principal 1. Diploid population (2N) 2. Sexual reproduction no selfing 3. Non-overlapping generations 4. Locus with 2 alleles 5. Allele frequencies are equal in males and females 6. Random mating 7. Infinite population size 8. Mutation ignored 9. Natural Selection doesn t affect alleles considered
Model with Theoretical Predictions Gen 1 Gen 2 Time p = frequency of A allele q = frequency of a allele p+q = 1 Independent trials (pa + qa)*(pa + qa) = 1 (all genotypes) So p 2 +2pq+q 2 =1 (1) Equilibrium allele frequencies, after one round of random mating p or p 2 is equal to p and p 2 (2) What about random union of gametes?
EXAMPLE: If we have a single locus with two alleles, A1 and A2 Let: p = frequency of A1 allele q = frequency of A2 allele What are the three possible genotypes? The allele frequencies can be estimated from the genotype frequencies: Now if there is random mating what is the frequency of genotypes in the next generation? What are the progeny genotypes given the adult genotypes and random mating? Frequency of zygotes (progeny) Mating Genotype Frequency A1A1x A1A1 P 2 1 0 0 A1A1xA1A2 2PQ ½ ½ 0 A1A1xA2A2 2PR 0 1 0 A1A2xA1A2 Q 2 ¼ ½ ¼ A1A2xA2A2 2QR 0 ½ ½ A2A2xA2A2 R 2 0 0 1 P Q R New P +Q +R =1 genotypes P = P 2 + 2PQ 2 4 p2 Q = 2PQ 2 + 2PR + Q2 2 + 2QR 2 = = 2pq R = R 2 + 2QR 2 + Q2 4 = = q2 A1A1 A1A2 A2A2 For extra credit on your homeowrk this week, can you prove the connection of the equation for P to p 2, Q to 2pq, and R to q 2?
EXAMPLE
Measures of Genetic Diversity - Allozyme Data There are two standard measures of allozyme diversity (1) P, the proportion of loci sample that are polymorphic P = x/m x is the number of polymorphic loci in a sample of m loci Note: Often you ll see this measure as a measure of diversity for allozyme loci, but because of sampling (low sample numbers may have loci that appear monomorphic, but are polymorphic with more individuals in the sample, see below), this is not a good measure for highly polymorphic loci. (2) H, mean Heterozygosity Sample a locus with two alleles at frequencies of 0.4 and 0.6 Let p1=0.4 and p2=0.6 Homozygotes p1 2 =0.16; p2 2 =0.36 Therefore 1-(0.16+0.36)= 0.48 (48% heterozygote) Average over all loci including monomorphic ones! General equation (Nei 1987) Unbiased estimate Measures of Genetic Diversity Allozymes Data Note: The general equation for expected heterozygosity is often referred to as a measure of diversity. We use this equation for more than just allozymes, and it s fundamental to understand for measuring divergences among populations (Fstatistics). I like to think of the measure as the probability of an individual being heterozygous at a given locus. Many human microsatellite loci are >0.85, which means you have a >85% chance of being heterozygous at this locus. I ll break down the equation here and we will talk about it more in class In the equation above the pi is the ith allele of n alleles at a locus. For example p1, p2, p3 could correspond to p, q, r,
Remember the HW proportions equation p 2 + 2pq + q 2 = 1, then this follows: Rearrange the above equation = p 2 + q 2 + 2pq = 1 Solve for heterzygotes = 2pq = 1 (p 2 + q 2 ) If you think about a situation, which could be true for many loci, that alleles are 4 or more, it becomes much easier to take the sum of the homozygous rather than the heterozygous genotype combinations. For example if you have 6 alleles, there are a possible 21 genotypes: A(A + 1) 2 = 6 7 2 = 21 Of this 21 possible there are only 6 kinds of homozygous genotypes (A1A1, A2A2, A3A3, etc. etc.) but there are 15 different heterozygous genotypes. As you increase it is easier to just square the homozygous individuals to calculate the heterozygosity frequency. Heterozygosity = 1 (sum of all the homozygous frequencies)
Measures of Genetic Diversity Microsatellite Data There are 4 standard measures of microsatellite diversity (1) P, the proportion of loci sample that are polymorphic P = x/m x is the number of polymorphic loci in a sample of m loci (2) HE Expected heterozygosity (Nei 1987) general measure of genetic diverisity Problem- high diversity because of high mutation rate 100 Average number of alleles captured (all loci combined) 90 80 70 60 50 40 30 20 10 SM BC MB FB 0 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 Sample Size (2N) (3) A - Allele number- more sensitive to loss of genetic variation # of alleles per locus at each population (4) Rg - Allelic Richness Samples alleles at individual loci at the same sample size among populations using a rarefaction method to estimate allelic richness. The sub g is the number of genes sampled.
Locus m.2 Repeat Number 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Total Locations Big Creek Adults 0 0 0 0 0 8 2 1 12 2 19 5 6 8 6 5 0 2 0 76 Monterey Bay Adults 0 0 0 2 0 4 1 0 5 3 7 5 2 2 5 0 2 0 0 38 Fort Bragg Adults 1 0 1 0 0 14 3 0 14 3 24 6 14 14 11 7 2 0 0 114 San Miguel Is. Adults 1 0 4 0 1 18 7 1 15 3 15 9 20 11 7 9 1 1 1 124 Fort Ross Juveniles 0 1 2 0 1 19 5 0 61 25 31 10 8 25 18 4 3 2 0 215 Monterey Bay Juvenil 0 0 32 2 4 73 68 6 107 15 57 45 103 74 33 18 3 4 8 652 Carmel Bay Juveniles 0 0 4 0 0 11 8 2 14 3 6 4 12 10 3 2 0 0 1 80 Total 2 1 43 4 6 147 94 10 228 54 159 84 165 144 83 45 11 9 10 1299 Unique All. # of Allele 0 12 1 11 0 13 2 17 1 16 1 17 0 13 5 99 Allele Number (A) = #alleles in pop Big Creek Adults Locus m2 = 12 Monterey juveniles Locus m2 = 17 Big difference in population size! Allelic richness (Rg) measures # of alleles using sample of N individuals of the smallest population size for all loci (N=38)
Measures of Genetic Variation Using Sequence Data 1. Nucleotide Diversity - π π = (n/n-1)σxixjπij xi = is the frequency of that haplotype divided by total number of haplotypes n/(n-1) = (n/n-1) = n is the # of alleles in gene, sampling error term πij = proportion of nucleotides that differ between type I and type j 2. The number of segregation sites θ (Theta) Infinite-alleles model θ = 4NEμ S = np/nt the number of polymorphic sites over total number of sites Here is how we estimate θ Which we can rearrange to be θ = S/a1 At Steady State in the infinite-alleles method π = θ
Estimating π and θ from DNA Sequence Data An Example -We collected a sample of 5 banana slugs from the woods outside of UC Santa Cruz campus in California -We sequence 500 bp region of the mitochondrial COI gene and observe 5 segregating sites in four distinct haplotypes Nucleotide site in gene N 4 45 345 398 456 Haplotype 1 2 T G T C T Haplotype 2 1 T A T T A Haplotype 3 1 C G T C T Haplotype 4 1 C G G C T 1. Proportion of polymorphic sites - (referred to as P or S) 2. Nucleotide diversity - π π = (n/n-1)σxixjπij n = 5, the number of polymorphic sites, therefore n/n-1 = 5/4 Frequency Hap1 0.4 (note that there are 2 Haplotype 1s) Hap2 0.2 Hap3 0.2 Hap4 0.2 Pairwise Diff. Hap1&Hap2 0.006 (3 pairwise differences out of 500 possible) Hap1&Hap3 0.002 Hap1&Hap4 0.004 Hap2&Hap3 0.008 Hap2&Hap4 0.01 Hap3&Hap4 0.002
Make a matrix to sum Hap (i) Hap (j) xi xj πij xixjπij 1 1 0.4 0.4 0 0 1 2 0.4 0.2 0.006 0.00048 1 3 0.4 0.2 0.002 0.00016 1 4 0.4 0.2 0.004 0.00032 2 1 0.2 0.4 0.006 0.00048 2 2 0.2 0.2 0 0 2 3 0.2 0.2 0.008 0.00032 2 4 0.2 0.2 0.01 0.0004 3 1 0.2 0.4 0.002 0.00016 3 2 0.2 0.2 0.008 0.00032 3 3 0.2 0.2 0 0 3 4 0.2 0.2 0.002 0.00008 4 1 0.2 0.4 0.004 0.00032 4 2 0.2 0.2 0.01 0.0004 4 3 0.2 0.2 0.002 0.00008 4 4 0.2 0.2 0 0 Σ 0.00352 π = (n/n-1)σxixjπij π = 5/4*(0.00352) = 0.0044
Estimating π and θ from DNA Sequence Data -We collected a sample of 5 banana slugs from the woods outside of UC Santa Cruz campus in California -We sequence 500 bp region of the mitochondrial COI gene and observe 5 segregating sites in four distinct haplotypes Nucleotide site in gene N 4 45 345 398 456 Haplotype 1 2 T G T C T Haplotype 2 1 T A T T A Haplotype 3 1 C G T C T Haplotype 4 1 C G G C T 3. Segregating Sites θ S = np/nt θ = S/a1 S = # segregating sites/total number of sites analyzed = n S = 5/500 = 0.01 a1 = 1/1+1/2+ 1/n-1 = 1/1 + 1/2 + 1/3 +1/4= 2.083 Note: a1 = # of alleles, in the example above you have 5 alleles or segregating sites and you divide by starting at 1 to n-1 to calcuated a1. θ = S/ a1 = 0.010/2.083 = 0.0048 Notice that both estimates of nucleotide diversity are similar π = θ which indicated steady state