STAT 536: Genetic Statistics Frequency Estimation Karin S. Dorman Department of Statistics Iowa State University August 28, 2006
Fundamental rules of genetics Law of Segregation a diploid parent is equally likely to pass along either of its two alleles P(pass copy 1) = P(pass copy 2) = 1 2 Law of Random Union gametes unite in a random fashion, so allele A 1 is no more likely to unite with allele A 1 than A 2, for example P(offspring is A 1 A 1 ) = P(father passes A 1 ) P(mother passes A 1 ) P(offspring is A 1 A 2 ) = P(father passes A 1 ) P(mother passes A 2 ) + P(mother passes A 1 ) P(father passes A 2 )
Segregation & Random Union (F1) backcross A A a a meiosis I meiosis I a A A a a meiosis II meiosis II A A A A a a a a P(A) = 1 fertilization P(a) = 1 F1: A a P(Aa)=1
Segregation & Random Union (F2) A a A a meiosis I meiosis I A a A a meiosis II meiosis II A A a a A A a a F2: fertilization P(A) = P(a) = 0.5 P(A) = P(a) = 0.5 A a or A A or a a P(Aa)=0.5 P(AA)=0.25 P(aa)=0.25
Ways that alleles can differ Identical by origin Alleles isolated from the same chromosomes are IBO. Identical by state Two nucleotide sequences that are the same at all sites are IBS. Identical by descent Alleles that share a common ancestor allele are IBD. Identical by origin identical by state and descent Identical by descent NOT identical by origin may imply identical by state
Questions about alleles Is the blue cone photoreceptor allele you got from your mother identical in origin to the one received from your father? Are they identical by descent? Are two protein alleles different in state if their underlying nucleotide sequence differs by a single synonymous mutation? What about two nucleotide alleles with a synonymous change? Are either of your blue cone photoreceptor alleles identical by descent with that of your brother or sister? Are the four blue cone photoreceptor alleles of identical twins identical by descent? Identical in state?
Population summaries Diallelic locus: Imagine a locus A with two possible alleles: A 1 and A 2 Muliallelic locus: A locus B with alleles B k for i = 1, 2,..., K. Parameters: properties of the population that can never actually be observed. Population size: N Population frequency of genotype at a locus: P A1 A 1, P A1 A 2, P B1 B 5, etc. Or P 11, P, etc. when the locus is assumed. Population frequency of allele at a locus: p A1, p A2, p Bk, etc. Or p 1, p 2, p k when the locus is assumed. Note the relationship between genotype and allele frequencies: p u = P uu + u<v 1 2 P uv
HWE - History G. H. Hardy, a mathematician, who wanted to counter the suggestion that any dominant trait should rise to a proportion of 75%. Under what circumstances would you expect 75% dominant? W. Weinberg, an obstetrician, wanted to know if bearing twins was a Mendelian trait. Evolution: Change in the allele frequencies in a population over time. Under what conditions does evolution not occur?
HWE - History G. H. Hardy, a mathematician, who wanted to counter the suggestion that any dominant trait should rise to a proportion of 75%. Under what circumstances would you expect 75% dominant? Starting from a backcross, the F2 generation has 75% = 50% + 25% with dominant trait. W. Weinberg, an obstetrician, wanted to know if bearing twins was a Mendelian trait. Evolution: Change in the allele frequencies in a population over time. Under what conditions does evolution not occur?
Haploid population Characterizing a Population Suppose a population consists of two types of individuals (e.g. green, yellow). Suppose all individuals in the population reproduce simultaneously. Let N 1 (t) and N 2 (t) be the counts of each type of individual at generation t. N Let p 1 (t) = 1 (t) N 1 (t)+n 2 (t) be the population allele frequency of of allele 1. Assume each individual in generation t has exactly W t offspring. (Note: even with environmental fluctuation, the law of large numbers implies that an average W t offspring will be produced per generation and the result is the same.) What does the population look like in generation t + 1?
Change in allele frequency in one generation A linear recurrence equation for counts across generations: N 1 (t + 1) = W t N 1 (t) N 2 (t + 1) = W t N 2 (t) To see if the allele frequency is changing (evolution), consider the allele frequency in the next generation p 1 (t + 1) = N 1 (t + 1) N 1 (t + 1) + N 2 (t + 1) = W t N 1 (t) W t N 1 (t) + W t N 2 (t) = N 1 (t) N 1 (t) + N 2 (t) = p 1(t) The results can be generalized to populations consisting of k different types of individuals. The fundamental assumptions have been: All individual types produce the same W t offspring at the tth generation. The population is large enough that environmental fluctuations average out. There is no mutation during offspring production.
Linear recurrence relation A linear recurrence relation on a sequence of numbers N(1), N(2),..., N(t),... expresses N(t) as a first-degree polynomial of N(k) with k < t. N(t) = AN(t 1) + BN(t 2) + CN(t 3) + A first-order linear recurrent relation involves only the preceding number in the sequence: N(t) = AN(t 1) + B Given an initial condition N(0) = N 0 and A 1, there is a unique solution to the first-order linear recurrence relation. ( ) N(t) = N 0 + A t B A 1 B A 1
Proof of linear recurrence relation solution By induction, show that it is true for t = 1: N(1) = N 0 + B A 1 «A 1 = N 0 A + B A 1 A 1 = N 0 A + B. B A 1 Then suppose the solution is N(t) = N 0 + B A 1 «A t B A 1, and show that N(t + 1) satisfies the desired equation. N(t + 1) = AN(t) + B» = A N 0 + B A 1 B = N 0 + = N 0 + A 1 B A 1 «A t «A t+1 «A t+1 B A 1 AB A 1 B A 1. + B + B(A 1) A 1
Haploid population with sexual reproduction Suppose the population consists of two genotypes A 1 and A 2 (Note: these are alleles and also genotypes in a haploid population). Let p(t) be the proportion of genotype A 1 in the tth generation, again assuming synchronous reproduction.
Following a sexually reproducing haploid population through one generation Assuming... mating is random then the mate probabilities are: Parent 1 Parent 2 Probability Offspring A 1 A 1 p(t)p(t) A 1 A 1 A 2 A 2 (1 p(t))(1 p(t)) A 2 A 2 A 1 A 2 p(t)(1 p(t)) A 1 A 2 A 2 A 1 (1 p(t))p(t) A 2 A 1 But, we ll never be able to tell apart the last two, so the diploid genotype proportions are: A 1 A 1 p 2 (t) A 1 A 2 2p(t)(1 p(t)) A 2 A 2 [1 p(t)] 2
(cont.) Characterizing a Population Assuming... All diploid genotypes are equally likely to proceed through meiosis. Then, what does the result of meiosis look like? Product of Meiosis Diploid Genotype Probability P(A genotype) P(a genotype) A 1 A 1 p 2 (t) 1 0 A 2 A 2 (1 p(t)) 2 0 1 A 1 A 2 2p(t)(1 p(t)) 0.5 0.5 And therefore, the next generation makeup is: Genotype A 1 A 2 Probability p(t + 1) = 1 p 2 (t) + 0.5 2p(t)(1 p(t)) = p(t) (1 p(t + 1)) = 1 (1 p(t)) 2 + 0.5 2p(t)(1 p(t)) = 1 p(t)
Hardy Weinberg Assumptions Consider a single locus where there are two alleles segregating in a diploid population. Make the Hardy-Weinberg (HW) assumptions: No difference in genotype proportions between the sexes. Synchronous reproduction at discrete points in time (discrete generations). Infinite population size (so that small variabilities are erased in the average). No mutation. No migration (precisely no immigration and balanced emigration). No selection (precisely no differences in fertility and viability). Random mating. Let the genotype frequencies at generation t be P 11 (t), P (t), and P (t).
Following the population through one generation... Using the assumptions of... No mutation, No selection (all diploids equally likely to proceed through meiosis), and Infinite population size. The allele frequencies in the gametes (haploid products of meisosis) are: p 1 (t) = 1 P 11 (t) + 0.5 P (t) p 2 (t) = 0.5 P (t) + 1 P (t) Notice, these are also the equations for population allele frequences p A1 and p A2 because producing gametes under these assumptions is like randomly selecting alleles from random individuals in the population.
...still following... Characterizing a Population Using the assumptions of... Random mating (individuals randomly select their mates from the population), Infinite population size, and No difference in genotype proportions between the sexes, Then, we already know what to expect: Diploid genotype probabilities in the next generation will be P 11 (t + 1) = p 2 1 (t) P (t + 1) = 2p 1 (t)p 2 (t) P (t + 1) = p 2 2 (t). And they will produce gametes (in the next generation) with proportions p 1 (t + 1) = p 1 (t) p 2 (t + 1) = p 2 (t).
HWE Theorem Characterizing a Population Theorem (1908): Given all the assumptions mentioned three slides ago, then the allele and genotype frequencies are at Hardy-Weinberg equilibrium (HWE) (unchanging from generation to generation). If the frequencies are perturbed, they will return to equilibrium (not necessarily the same equilibrium) in a single generation. Proof: The above proof starts with allele frequencies in one generation and shows they are equivalent to the allele frequencies in the next generation. One can also achieve the proof by starting from genotype frequencies in one generation and showing they are equivalent to the genotype frequencies in the following generation. This proof requires considering all the mating types and their probabilities, e.g. A 1 A 2 A 1 A 2 has probability P A1 A 2 P A1 A 2 while A 1 A 1 A 1 A 2 has probability 2P A1 A 1 P A1 A 2.
Consider this population 11 11 11 11 11 11 11 11 11 11 11 11
Population genotype frequencies A count of genotypes leads to the population counts: N 11 = N = N = 15 N = 39 implying the population genotype and allele frequencies: P 11 = 0.31 P = 0.31 P = 0.38 P = 1 p 1 = p 2 = 2 + 2 39 2 15 + 2 39 = 36 78 0.46 = 42 78 0.54.
Next generation population genotype frequencies In the next generation, when these alleles unite randomly, the genotype frequencies will be: P 11 (1) = p 2 1 = 0.462 = 0.21 P (1) = 2p 1 p 2 = 2 0.46 0.54 = 0.50 P (1) = p 2 2 = 0.542 = 0.29 total = 1 And these of course will produce gametes with proportions p 1 and p 2 again.
Implications of HWE Under the appropriate conditions, genotype frequencies can be predicted from allele frequencies. Therefore, we need only track the allele frequencies when analyzing populations satisfying the assumptions. Mendelian reproduction does not favor one allele over another, hence there will be no loss of genetic variability from generation to generation. The dominant phenotype will not always make up 75% of the population. Indeed, only when p A1 = 0.5.
Generalization to multiple alleles Suppose there are k > 2 different alleles A 1, A 2,..., A k with population frequencies p 1, p 2,..., p k. Then, upon random union, the diploid genotype frequencies are: P ii = p 2 i for i = 1, 2,..., k P ij = p i p j for i = 1, 2,..., k and j = 1, 2,..., n and i j. (Here we have distinguished the order ij vs. ji.) The allele frequencies are p i = 1 nx `Pji + P ij 2 j=1 If the previous generation was a product of random mating, then P ij = P ji = p i p j, so p i = 1 nx 2p i p j 2 j=1 nx = p i p j = p i j=1
Synchronous reproduction We have made the assumption of synchronous reproduction. What happens when this assumption is violated? If you assume individuals live an exponentially distributed lifetime and then reproduce, then the HWE will be achieved when the last individual from the founding population dies. It could take a very long ime for this goal to be achieved. Exponentially distributed lifetimes are not usually applicable to biological populations. More complex models are difficult mathematically.
Sample summaries, i.e. statistics Statistics: functions of an observed sample of data collected from a population Sample size: n Sample counts of alleles (n A1, n A2 ) and genotypes (n A1 A 1, n A1 A 2 ). n A1 = n A1 A 2 + 2n A1 A 1 n A2 = n A1 A 2 + 2n A2 A 2 Sample frequencies (denoted by tilde) p A1 = n A 1 2n P A1 A 2 = n A 1 A 2 n We shall denote parameter estimates carets, e.g. ˆp A1 or ˆP A1 A 2.
Statistical estimation estimator: a function of the data that is used to estimate a parameter of the population. estimate: identified by the caret, these are the values calculated for a given dataset. consistent: estimator is consistent if it is is more and more accurate as n increases. unbiased: E(ˆp) = p. estimator variance: E [ (ˆp E(ˆp)) 2]. efficient: an estimator whose variance achieves the minimum possible variance. sufficient: a statistic is sufficient for a parameter if it contains all the information in a sample about that parameter. Result: There is an efficient estimator only if there is a sufficient statistic.
The randomness of population genetics Statistical We are studying a population of N individuals. We take a sample of size n << N. Different samples will lead to different inferences. Sampling distribution: informs on the size and type of variation in inferences due to the randomness of sampling. Genetic: life is a stochastic process Reproduction and genetic transmission are random processes following precise, but nevertheless stochastic probability rules. The population we study arose as a realization of this random process. The variation resulting from this genetic sampling is important when: predicting the genetic future of the population, and studying the processes that gave rise to this population, and others like it.
Population sampling 11 11 11 11 11 11 11 11 11 11 11 11
Population sampling 11: 0 : 0 : 0 11 11 11 11 11 11 11 11 11 11 11 11
Population sampling 11: 1 : 0 : 0 11 11 11 11 11 11 11 11 11 11 11
Population sampling 11: 1 : 0 : 0 11 11 11 11 11 11 11 11 11 11 11
Population sampling 11: 1 : 1 : 0 11 11 11 11 11 11 11 11 11 11 11
Application of sampling and frequency estimation Walter E. Nance and Michael J. Kearsey (2004) Relevance of Connexin Deafness (DFNB1) to Human Evolution. Am. J. Hum. Genet. 74:1081-1087. Mutations at over 100 loci (pl of locus) can cause deafness. Hypothesize that less severe selection and assortative mating on deafness can increase the incidence of the most common deafness allele in the population. Speculate that the incidence of deafness has increased since the introduction of sign language for this reason.
A statistical model of genotype sampling Given population frequencies P 11, P, P we could model the statistical sampling process with the multinomial distribution if the population size is large enough so that sampling does not change the population frequencies. Multinomial distribution: Mult(n, Q 1, Q 2,..., Q k ) Pr(n 1, n 2,..., n k ) = n! k i=1 n i! k i=1 Q n i i Binomial distribution: Bin(n, Q) applies when there are two categories Pr(n 1, n n 1 ) = n! n 1!(n n 1 )! Qn 1 (1 Q) n n 1
Facts about expectations and variances E[aX + by ] = ae[x] + be[y ] for two random variables X and Y and constants a and b. Var(X) = E[X 2 ] (E[X]) 2 Cov(X, Y ) = E[XY ] E[X]E[Y ] Var(aX + by ) = a 2 Var(X) + b 2 Var(Y ) + 2abCov(X, Y ) where covariance term is zero for independent X and Y.
Estimating multinomial probabilities Mean counts E(n i ) = nq i The sample proportion is unbiased estimate of population frequency. ( ) ( ni ) E Qi = E = 1 n n E (n i) = Q i Variance in counts Var(n i ) = nq i (1 Q i ) Population frequency estimator variance ( ) ( ni ) Var Qi = Var = 1 n n 2 Var(n i) = 1 n Q i(1 Q i )
Estimating covariances and correlations E(n i n j ) = n n r rsp(n i = r, n j = s) r=0 s=0 = n(n 1)Q i Q j ( ) E Qi Qj = n 1 n Q iq j Cov(n i, n j ) = nq i Q j Cov( Q i, Q j ) = 1 n Q iq j Corr ( n i, n j ) = Cov(n i, n j ) Var(ni )Var(n j ) = Corr( Q i, Q j )
Obtaining allele counts Allele counts are obtained from genotype counts: n u = 2n uu + X v<u n uv Expected allele counts: E (n u) = E 2n uu + X! n uv v<u = 2E (n uu) + X v<u E (n uv ) = 2nP uu + X v<u np uv = 2np u Sample allele frequency is unbiased for population allele frequency: nu E ( p u) = E = 1 2npu = pu 2n 2n
Variance of allele estimators Var(n u ) = ( Var 2n uu + ) n uv v<u Apply formula for the variance of sums of random variables. ( ) = 2n p u + P uu 2pu 2 Var ( p u ) = ( nu ) Var = 1 ( ) p u + P uu 2pu 2 2n 2n
Variance estimation To actually use the variance (covariance, etc) formulas requires knowledge of the population parameters, which of course, we don t have. Substitute sample proportions p u and P uu into the variance/covariance formulas to obtain an estimates Var ( p u ) and Var ) ( Puu If the sample size is large enough (n 30), then confidence intervals for the estimates can be obtained: The population parameter φ has approximately (1 α)% chance of falling in the interval ( ) ˆφ ± z 1 α/2 Var ˆφ.
Confidence interval Characterizing a Population 0 1
Confidence interval Characterizing a Population 0 1
Confidence interval Characterizing a Population 0 1
Confidence interval Characterizing a Population 0 1
Importance of variance estimates Computing the variance of estimates tells us how estimates and therefore inferences will differ among samples. Approaches to computing variances Reducing expression to a function of multinomial variances. Using indicator variables. Delta method. Approximate computational methods.
Estimating covariance of allele frequency estimates Let x ij be an indicator variable that is X if the jth allele in the ith individual is A 1 and 0 otherwise. Let y ij be an indicator variable that is X if the jth allele in the ith individual is A 2 and 0 otherwise. Given these definitions so we can compute E ( p 1 p 2 ) = p 1 = 1 2n p 2 = 1 2n 1 4n 2 E i n 2 i=1 j=1 n 2 i=1 j=1 j x ij x ij y ij i j y ij
(cont.) Characterizing a Population Taking expectations of indicator variables is very easy: E ( x ij ) = 1 P ( xij = 1 ) + 0 P ( x ij = 0 ) = P ( x ij = 1 ) = p 1 We conclude (after algebra) that The covariance is then E ( p 1 p 2 ) = p 1 p 2 + 1 4n (P 4p 1 p 2 ) Cov ( p 1, p 2 ) = E ( p 1 p 2 ) p 1 p 2 = 1 4n (P 4p 1 p 2 )
Delta Method Characterizing a Population Let T be a function of the data, specifically the counts n i : T (n 1, n 2,...). By Taylor s series: Var(T ) X i «T 2 Var (n i ) + X X T T Cov `n i, n j n i n i j i i n j and replace n i in the derivatives with E(n i ) = nq i for multinomial counts. In addition, equations for variances and covariances of multinomial counts Var (n i ) = nq i (1 Q i ) Cov `n i, n j = nq i Q j we have Var(T ) n X i «T 2 Q i n X! 2 T Q i n i n i i
Fisher s approximate variance formula Var(T ) n i ( T n i ) 2 ( ) T 2 Q i n n where the second term is needed only when T explicitly involves the sample size n. In addition, terms with higher power 1 of n in the deminator (e.g. ) are ignored in the derivative n 2 functions. The above approximation works when T is a ratio of functions of the same order in the counts n i, or counts n i in T only appear divided by the total sample size n.
Example application of Fisher s approximation P = n n T = 1 n n T = n n n 2 = P n ) ( ) 1 2 Var ( P n P n n = 1 n P (1 P ) ( ) 2 P n
Other Methods for Confidence Intervals What can one do when the sample size is small (n < 30) or when no formula for the variance can be obtained? Jackknife Bootstrap
Jackknife Characterizing a Population You begin with a sample of observations X 1, X 2,..., X n of size n. You use these data to calculate an estimate ˆφ. Compute n new estimates ˆφ (i) where the ith estimate is calculated using all the data minus the ith data point, e.g. X 1,..., X i 1, X i+1,..., X n. Compute their average ˆφ ( ) = 1 n i ˆφ (i) Obtain a less biased estimated: ˆφ J = n ˆφ (n 1) ˆφ ( ) Calculate an estimate of the variance of ˆφ Var ˆφ = n 1 X 2 ˆφ (i) ˆφ ( ) J n i
Bootstrap Characterizing a Population Obtain M samples by sampling with replacement from the original data. Compute the boostrap estimate ˆφ (i) for each bootstrap dataset. Plot histogram of ˆφ (i) for all i = 1,..., M to obtain an approximation to the sampling distribution.
Bootstrap Characterizing a Population Obtain M samples by sampling with replacement from the original data. Compute the boostrap estimate ˆφ (i) for each bootstrap dataset. Plot histogram of ˆφ (i) for all i = 1,..., M to obtain an approximation to the sampling distribution. bootstrap sample 1 bootstrap sample 2
Bootstrap Characterizing a Population Obtain M samples by sampling with replacement from the original data. Compute the boostrap estimate ˆφ (i) for each bootstrap dataset. Plot histogram of ˆφ (i) for all i = 1,..., M to obtain an approximation to the sampling distribution.
Bootstrap Sampling Distribution bootstrap frequency 0 0.5 proportion of Aa
Genetic Sampling Variance In general, for the cases where between population variance should be considered, we need to do more work with variances. We have only computed among sample within population variances so far. Section Total Variance of Allele Frequencies covers this partially, and we will address it in more detail later.
Method of maximum likelihood Another estimation procedure produces the most likely value of a parameter. It is applicable when the the sampling distribution for the random variable (e.g. genotype or allele counts) is known. What s our distribution?
Method of maximum likelihood Another estimation procedure produces the most likely value of a parameter. It is applicable when the the sampling distribution for the random variable (e.g. genotype or allele counts) is known. What s our distribution? Multinomial
Maximum Likelihood Suppose the expected proportions Q i from the multinomial distribution are functions of other population parameters. For example, under HWE P 11 = p1 2 P = 2p 1 p 2 = 2p 1 (1 p 1 ) P = (1 p 1 ) 2. Suppose we observe counts n 11, n, n, then the likelihood of the data can be written in terms of the allele frequencies: L(p 1 ) = = n! n 11!n!n aa! (P 11) n 11 (P ) n (P ) n n! n 11!n!n! p2n 11 1 [2p 1 (1 p 1 )] n (1 p 1 ) 2n
Supports and Scores It is usually more convenient to work with ln L, called the support. The derivatives of the support with respect to the parameters are called the scores: S p1 = ln L p 1 The maximum likelihood estimates are those values of the parameters (e.g. p 1 ) that maximize the likelihood. They are found by setting the scores equal to 0 and simultaneously solving the resulting system of equations.
Maximum likelihood estimate of p 1 n! L(p 1 ) = n 11!n!n! p2n 1 [2p 1 (1 p 1 )] n (1 p 1 ) 2n «n! ln L(p 1 ) = ln + (2n 11 + n ) ln (p 1 ) + (n + 2n ) ln (1 p 1 ) n 11!n!n! Solve S p1 = 2n 11 + n p 1 n + 2n 1 p 1 to obtain the maximum likelihood estimate We know this is the maximum because S p1 = 0 ˆp 1 = 1 2n (2n 11 + n ). S p1 = 2n 11 + n p 1 p1 2 n + 2n (1 p 1 ) 2 <= 0 for all p 1
Statistics Refresher: Properties of MLEs Do not attempt to estimate a two or parameters that are functions of each other. For example P 11 = 1 P P when there are only two alleles. The MLE of a function of parameters is the function of the MLEs. For example, p i 2 = ˆp i 2 The MLE may be biased. MLEs are consistent estimators under general conditions, so for very large samples the bias disappears. The information of a parameter is the negative second derivative, e.g. ( 2 ) ln L(p 1 ) = I p1 p 2 1
Properties of MLEs (cont) For large samples, the variance of the MLE is the inversed expected information: Var (ˆp 1 ) = 1 E [I p1 ] When the likelihood is a function of multiple independent parameters, e.g. p 11, p, the information is a matrix. The variance is obtained as the inverse of this matrix. For large samples, the MLE is approximately normally distributed (and parameter vectors are multivariate normal). For example, ˆp 1 N (p 1, {E [I (p 1 )]} 1)