STAT 536: Genetic Statistics

Similar documents
The Wright-Fisher Model and Genetic Drift

STAT 536: Genetic Statistics

URN MODELS: the Ewens Sampling Lemma

Introduction to population genetics & evolution

Life Cycles, Meiosis and Genetic Variability24/02/2015 2:26 PM

Outline of lectures 3-6

Processes of Evolution

STAT 536: Migration. Karin S. Dorman. October 3, Department of Statistics Iowa State University

MEIOSIS, THE BASIS OF SEXUAL REPRODUCTION

The Genetics of Natural Selection

that does not happen during mitosis?

Outline of lectures 3-6

Darwinian Selection. Chapter 6 Natural Selection Basics 3/25/13. v evolution vs. natural selection? v evolution. v natural selection

UNIT 8 BIOLOGY: Meiosis and Heredity Page 148

genome a specific characteristic that varies from one individual to another gene the passing of traits from one generation to the next

How robust are the predictions of the W-F Model?

Name Class Date. KEY CONCEPT Gametes have half the number of chromosomes that body cells have.

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin

Statistical Genetics I: STAT/BIOST 550 Spring Quarter, 2014

Reinforcement Unit 3 Resource Book. Meiosis and Mendel KEY CONCEPT Gametes have half the number of chromosomes that body cells have.

Genetics (patterns of inheritance)

Introduction to Natural Selection. Ryan Hernandez Tim O Connor

Problems for 3505 (2011)

Notes for MCTP Week 2, 2014

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda

Goodness of Fit Goodness of fit - 2 classes

Outline of lectures 3-6

Mathematical modelling of Population Genetics: Daniel Bichener

Genetics and Natural Selection

Population Genetics I. Bio

Chapter 6 Meiosis and Mendel

A. Correct! Genetically a female is XX, and has 22 pairs of autosomes.

Microevolution Changing Allele Frequencies

Natural Selection. Population Dynamics. The Origins of Genetic Variation. The Origins of Genetic Variation. Intergenerational Mutation Rate

AEC 550 Conservation Genetics Lecture #2 Probability, Random mating, HW Expectations, & Genetic Diversity,

Name Class Date. Pearson Education, Inc., publishing as Pearson Prentice Hall. 33

Lecture 9. QTL Mapping 2: Outbred Populations

MGC New Life Christian Academy

1. The diagram below shows two processes (A and B) involved in sexual reproduction in plants and animals.

Evolutionary Theory. Sinauer Associates, Inc. Publishers Sunderland, Massachusetts U.S.A.

Population Genetics & Evolution

UNIT 3: GENETICS 1. Inheritance and Reproduction Genetics inheritance Heredity parent to offspring chemical code genes specific order traits allele

Darwinian Selection. Chapter 7 Selection I 12/5/14. v evolution vs. natural selection? v evolution. v natural selection

Introduction to Genetics

What is a sex cell? How are sex cells made? How does meiosis help explain Mendel s results?

Ch 11.Introduction to Genetics.Biology.Landis

Introduction to Genetics

Section 11 1 The Work of Gregor Mendel

8. Genetic Diversity

Meiosis vs Mitosis. How many times did it go through prophase-metaphase-anaphase-telophase?

Meiosis and Mendel. Chapter 6

Genetic Variation in Finite Populations

Guided Notes Unit 6: Classical Genetics

Mechanisms of Evolution

BIOLOGY - CLUTCH CH.13 - MEIOSIS.

Full file at CHAPTER 2 Genetics

Mutation, Selection, Gene Flow, Genetic Drift, and Nonrandom Mating Results in Evolution

Chapter 16. Table of Contents. Section 1 Genetic Equilibrium. Section 2 Disruption of Genetic Equilibrium. Section 3 Formation of Species

NATURAL SELECTION FOR WITHIN-GENERATION VARIANCE IN OFFSPRING NUMBER JOHN H. GILLESPIE. Manuscript received September 17, 1973 ABSTRACT

Notes on Population Genetics

Big Idea #1: The process of evolution drives the diversity and unity of life

10.2 Sexual Reproduction and Meiosis

Dropping Your Genes. A Simulation of Meiosis and Fertilization and An Introduction to Probability

Population Genetics: a tutorial

THEORETICAL EVOLUTIONARY GENETICS JOSEPH FELSENSTEIN

Lecture 2: Introduction to Quantitative Genetics

Lecture 4: Allelic Effects and Genetic Variances. Bruce Walsh lecture notes Tucson Winter Institute 7-9 Jan 2013

Sexual and Asexual Reproduction. Cell Reproduction TEST Friday, 11/13

Breeding Values and Inbreeding. Breeding Values and Inbreeding

2. What is meiosis? The process of forming gametes (sperm and egg) 4. Where does meiosis take place? Ovaries- eggs and testicles- sperm

BIOL Evolution. Lecture 9

Name: Period: EOC Review Part F Outline

Modeling IBD for Pairs of Relatives. Biostatistics 666 Lecture 17

Meiosis. Activity. Procedure Part I:

9 Genetic diversity and adaptation Support. AQA Biology. Genetic diversity and adaptation. Specification reference. Learning objectives.

NOTES CH 17 Evolution of. Populations

Chapter 4 Lesson 1 Heredity Notes

Introduction to Meiosis Many organisms pass their genes to their offspring through.

is the scientific study of. Gregor Mendel was an Austrian monk. He is considered the of genetics. Mendel carried out his work with ordinary garden.

Essential Questions. Meiosis. Copyright McGraw-Hill Education

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014

Phasing via the Expectation Maximization (EM) Algorithm

Chapter 13 Meiosis and Sexual Reproduction

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

Mathematical Population Genetics. Introduction to the Stochastic Theory. Lecture Notes. Guanajuato, March Warren J Ewens

ML Testing (Likelihood Ratio Testing) for non-gaussian models

Derivation of Itô SDE and Relationship to ODE and CTMC Models

Observing Patterns in Inherited Traits

Educational Items Section

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

Meiosis -> Inheritance. How do the events of Meiosis predict patterns of heritable variation?

Evolution of Populations. Chapter 17

Binary fission occurs in prokaryotes. parent cell. DNA duplicates. cell begins to divide. daughter cells

KEY: Chapter 9 Genetics of Animal Breeding.

Selection Page 1 sur 11. Atlas of Genetics and Cytogenetics in Oncology and Haematology SELECTION

Reproduction and Evolution Practice Exam

This is DUE: Come prepared to share your findings with your group.

EXERCISES FOR CHAPTER 3. Exercise 3.2. Why is the random mating theorem so important?

Computational Systems Biology: Biology X

1.3 Forward Kolmogorov equation

Transcription:

STAT 536: Genetic Statistics Frequency Estimation Karin S. Dorman Department of Statistics Iowa State University August 28, 2006

Fundamental rules of genetics Law of Segregation a diploid parent is equally likely to pass along either of its two alleles P(pass copy 1) = P(pass copy 2) = 1 2 Law of Random Union gametes unite in a random fashion, so allele A 1 is no more likely to unite with allele A 1 than A 2, for example P(offspring is A 1 A 1 ) = P(father passes A 1 ) P(mother passes A 1 ) P(offspring is A 1 A 2 ) = P(father passes A 1 ) P(mother passes A 2 ) + P(mother passes A 1 ) P(father passes A 2 )

Segregation & Random Union (F1) backcross A A a a meiosis I meiosis I a A A a a meiosis II meiosis II A A A A a a a a P(A) = 1 fertilization P(a) = 1 F1: A a P(Aa)=1

Segregation & Random Union (F2) A a A a meiosis I meiosis I A a A a meiosis II meiosis II A A a a A A a a F2: fertilization P(A) = P(a) = 0.5 P(A) = P(a) = 0.5 A a or A A or a a P(Aa)=0.5 P(AA)=0.25 P(aa)=0.25

Ways that alleles can differ Identical by origin Alleles isolated from the same chromosomes are IBO. Identical by state Two nucleotide sequences that are the same at all sites are IBS. Identical by descent Alleles that share a common ancestor allele are IBD. Identical by origin identical by state and descent Identical by descent NOT identical by origin may imply identical by state

Questions about alleles Is the blue cone photoreceptor allele you got from your mother identical in origin to the one received from your father? Are they identical by descent? Are two protein alleles different in state if their underlying nucleotide sequence differs by a single synonymous mutation? What about two nucleotide alleles with a synonymous change? Are either of your blue cone photoreceptor alleles identical by descent with that of your brother or sister? Are the four blue cone photoreceptor alleles of identical twins identical by descent? Identical in state?

Population summaries Diallelic locus: Imagine a locus A with two possible alleles: A 1 and A 2 Muliallelic locus: A locus B with alleles B k for i = 1, 2,..., K. Parameters: properties of the population that can never actually be observed. Population size: N Population frequency of genotype at a locus: P A1 A 1, P A1 A 2, P B1 B 5, etc. Or P 11, P, etc. when the locus is assumed. Population frequency of allele at a locus: p A1, p A2, p Bk, etc. Or p 1, p 2, p k when the locus is assumed. Note the relationship between genotype and allele frequencies: p u = P uu + u<v 1 2 P uv

HWE - History G. H. Hardy, a mathematician, who wanted to counter the suggestion that any dominant trait should rise to a proportion of 75%. Under what circumstances would you expect 75% dominant? W. Weinberg, an obstetrician, wanted to know if bearing twins was a Mendelian trait. Evolution: Change in the allele frequencies in a population over time. Under what conditions does evolution not occur?

HWE - History G. H. Hardy, a mathematician, who wanted to counter the suggestion that any dominant trait should rise to a proportion of 75%. Under what circumstances would you expect 75% dominant? Starting from a backcross, the F2 generation has 75% = 50% + 25% with dominant trait. W. Weinberg, an obstetrician, wanted to know if bearing twins was a Mendelian trait. Evolution: Change in the allele frequencies in a population over time. Under what conditions does evolution not occur?

Haploid population Characterizing a Population Suppose a population consists of two types of individuals (e.g. green, yellow). Suppose all individuals in the population reproduce simultaneously. Let N 1 (t) and N 2 (t) be the counts of each type of individual at generation t. N Let p 1 (t) = 1 (t) N 1 (t)+n 2 (t) be the population allele frequency of of allele 1. Assume each individual in generation t has exactly W t offspring. (Note: even with environmental fluctuation, the law of large numbers implies that an average W t offspring will be produced per generation and the result is the same.) What does the population look like in generation t + 1?

Change in allele frequency in one generation A linear recurrence equation for counts across generations: N 1 (t + 1) = W t N 1 (t) N 2 (t + 1) = W t N 2 (t) To see if the allele frequency is changing (evolution), consider the allele frequency in the next generation p 1 (t + 1) = N 1 (t + 1) N 1 (t + 1) + N 2 (t + 1) = W t N 1 (t) W t N 1 (t) + W t N 2 (t) = N 1 (t) N 1 (t) + N 2 (t) = p 1(t) The results can be generalized to populations consisting of k different types of individuals. The fundamental assumptions have been: All individual types produce the same W t offspring at the tth generation. The population is large enough that environmental fluctuations average out. There is no mutation during offspring production.

Linear recurrence relation A linear recurrence relation on a sequence of numbers N(1), N(2),..., N(t),... expresses N(t) as a first-degree polynomial of N(k) with k < t. N(t) = AN(t 1) + BN(t 2) + CN(t 3) + A first-order linear recurrent relation involves only the preceding number in the sequence: N(t) = AN(t 1) + B Given an initial condition N(0) = N 0 and A 1, there is a unique solution to the first-order linear recurrence relation. ( ) N(t) = N 0 + A t B A 1 B A 1

Proof of linear recurrence relation solution By induction, show that it is true for t = 1: N(1) = N 0 + B A 1 «A 1 = N 0 A + B A 1 A 1 = N 0 A + B. B A 1 Then suppose the solution is N(t) = N 0 + B A 1 «A t B A 1, and show that N(t + 1) satisfies the desired equation. N(t + 1) = AN(t) + B» = A N 0 + B A 1 B = N 0 + = N 0 + A 1 B A 1 «A t «A t+1 «A t+1 B A 1 AB A 1 B A 1. + B + B(A 1) A 1

Haploid population with sexual reproduction Suppose the population consists of two genotypes A 1 and A 2 (Note: these are alleles and also genotypes in a haploid population). Let p(t) be the proportion of genotype A 1 in the tth generation, again assuming synchronous reproduction.

Following a sexually reproducing haploid population through one generation Assuming... mating is random then the mate probabilities are: Parent 1 Parent 2 Probability Offspring A 1 A 1 p(t)p(t) A 1 A 1 A 2 A 2 (1 p(t))(1 p(t)) A 2 A 2 A 1 A 2 p(t)(1 p(t)) A 1 A 2 A 2 A 1 (1 p(t))p(t) A 2 A 1 But, we ll never be able to tell apart the last two, so the diploid genotype proportions are: A 1 A 1 p 2 (t) A 1 A 2 2p(t)(1 p(t)) A 2 A 2 [1 p(t)] 2

(cont.) Characterizing a Population Assuming... All diploid genotypes are equally likely to proceed through meiosis. Then, what does the result of meiosis look like? Product of Meiosis Diploid Genotype Probability P(A genotype) P(a genotype) A 1 A 1 p 2 (t) 1 0 A 2 A 2 (1 p(t)) 2 0 1 A 1 A 2 2p(t)(1 p(t)) 0.5 0.5 And therefore, the next generation makeup is: Genotype A 1 A 2 Probability p(t + 1) = 1 p 2 (t) + 0.5 2p(t)(1 p(t)) = p(t) (1 p(t + 1)) = 1 (1 p(t)) 2 + 0.5 2p(t)(1 p(t)) = 1 p(t)

Hardy Weinberg Assumptions Consider a single locus where there are two alleles segregating in a diploid population. Make the Hardy-Weinberg (HW) assumptions: No difference in genotype proportions between the sexes. Synchronous reproduction at discrete points in time (discrete generations). Infinite population size (so that small variabilities are erased in the average). No mutation. No migration (precisely no immigration and balanced emigration). No selection (precisely no differences in fertility and viability). Random mating. Let the genotype frequencies at generation t be P 11 (t), P (t), and P (t).

Following the population through one generation... Using the assumptions of... No mutation, No selection (all diploids equally likely to proceed through meiosis), and Infinite population size. The allele frequencies in the gametes (haploid products of meisosis) are: p 1 (t) = 1 P 11 (t) + 0.5 P (t) p 2 (t) = 0.5 P (t) + 1 P (t) Notice, these are also the equations for population allele frequences p A1 and p A2 because producing gametes under these assumptions is like randomly selecting alleles from random individuals in the population.

...still following... Characterizing a Population Using the assumptions of... Random mating (individuals randomly select their mates from the population), Infinite population size, and No difference in genotype proportions between the sexes, Then, we already know what to expect: Diploid genotype probabilities in the next generation will be P 11 (t + 1) = p 2 1 (t) P (t + 1) = 2p 1 (t)p 2 (t) P (t + 1) = p 2 2 (t). And they will produce gametes (in the next generation) with proportions p 1 (t + 1) = p 1 (t) p 2 (t + 1) = p 2 (t).

HWE Theorem Characterizing a Population Theorem (1908): Given all the assumptions mentioned three slides ago, then the allele and genotype frequencies are at Hardy-Weinberg equilibrium (HWE) (unchanging from generation to generation). If the frequencies are perturbed, they will return to equilibrium (not necessarily the same equilibrium) in a single generation. Proof: The above proof starts with allele frequencies in one generation and shows they are equivalent to the allele frequencies in the next generation. One can also achieve the proof by starting from genotype frequencies in one generation and showing they are equivalent to the genotype frequencies in the following generation. This proof requires considering all the mating types and their probabilities, e.g. A 1 A 2 A 1 A 2 has probability P A1 A 2 P A1 A 2 while A 1 A 1 A 1 A 2 has probability 2P A1 A 1 P A1 A 2.

Consider this population 11 11 11 11 11 11 11 11 11 11 11 11

Population genotype frequencies A count of genotypes leads to the population counts: N 11 = N = N = 15 N = 39 implying the population genotype and allele frequencies: P 11 = 0.31 P = 0.31 P = 0.38 P = 1 p 1 = p 2 = 2 + 2 39 2 15 + 2 39 = 36 78 0.46 = 42 78 0.54.

Next generation population genotype frequencies In the next generation, when these alleles unite randomly, the genotype frequencies will be: P 11 (1) = p 2 1 = 0.462 = 0.21 P (1) = 2p 1 p 2 = 2 0.46 0.54 = 0.50 P (1) = p 2 2 = 0.542 = 0.29 total = 1 And these of course will produce gametes with proportions p 1 and p 2 again.

Implications of HWE Under the appropriate conditions, genotype frequencies can be predicted from allele frequencies. Therefore, we need only track the allele frequencies when analyzing populations satisfying the assumptions. Mendelian reproduction does not favor one allele over another, hence there will be no loss of genetic variability from generation to generation. The dominant phenotype will not always make up 75% of the population. Indeed, only when p A1 = 0.5.

Generalization to multiple alleles Suppose there are k > 2 different alleles A 1, A 2,..., A k with population frequencies p 1, p 2,..., p k. Then, upon random union, the diploid genotype frequencies are: P ii = p 2 i for i = 1, 2,..., k P ij = p i p j for i = 1, 2,..., k and j = 1, 2,..., n and i j. (Here we have distinguished the order ij vs. ji.) The allele frequencies are p i = 1 nx `Pji + P ij 2 j=1 If the previous generation was a product of random mating, then P ij = P ji = p i p j, so p i = 1 nx 2p i p j 2 j=1 nx = p i p j = p i j=1

Synchronous reproduction We have made the assumption of synchronous reproduction. What happens when this assumption is violated? If you assume individuals live an exponentially distributed lifetime and then reproduce, then the HWE will be achieved when the last individual from the founding population dies. It could take a very long ime for this goal to be achieved. Exponentially distributed lifetimes are not usually applicable to biological populations. More complex models are difficult mathematically.

Sample summaries, i.e. statistics Statistics: functions of an observed sample of data collected from a population Sample size: n Sample counts of alleles (n A1, n A2 ) and genotypes (n A1 A 1, n A1 A 2 ). n A1 = n A1 A 2 + 2n A1 A 1 n A2 = n A1 A 2 + 2n A2 A 2 Sample frequencies (denoted by tilde) p A1 = n A 1 2n P A1 A 2 = n A 1 A 2 n We shall denote parameter estimates carets, e.g. ˆp A1 or ˆP A1 A 2.

Statistical estimation estimator: a function of the data that is used to estimate a parameter of the population. estimate: identified by the caret, these are the values calculated for a given dataset. consistent: estimator is consistent if it is is more and more accurate as n increases. unbiased: E(ˆp) = p. estimator variance: E [ (ˆp E(ˆp)) 2]. efficient: an estimator whose variance achieves the minimum possible variance. sufficient: a statistic is sufficient for a parameter if it contains all the information in a sample about that parameter. Result: There is an efficient estimator only if there is a sufficient statistic.

The randomness of population genetics Statistical We are studying a population of N individuals. We take a sample of size n << N. Different samples will lead to different inferences. Sampling distribution: informs on the size and type of variation in inferences due to the randomness of sampling. Genetic: life is a stochastic process Reproduction and genetic transmission are random processes following precise, but nevertheless stochastic probability rules. The population we study arose as a realization of this random process. The variation resulting from this genetic sampling is important when: predicting the genetic future of the population, and studying the processes that gave rise to this population, and others like it.

Population sampling 11 11 11 11 11 11 11 11 11 11 11 11

Population sampling 11: 0 : 0 : 0 11 11 11 11 11 11 11 11 11 11 11 11

Population sampling 11: 1 : 0 : 0 11 11 11 11 11 11 11 11 11 11 11

Population sampling 11: 1 : 0 : 0 11 11 11 11 11 11 11 11 11 11 11

Population sampling 11: 1 : 1 : 0 11 11 11 11 11 11 11 11 11 11 11

Application of sampling and frequency estimation Walter E. Nance and Michael J. Kearsey (2004) Relevance of Connexin Deafness (DFNB1) to Human Evolution. Am. J. Hum. Genet. 74:1081-1087. Mutations at over 100 loci (pl of locus) can cause deafness. Hypothesize that less severe selection and assortative mating on deafness can increase the incidence of the most common deafness allele in the population. Speculate that the incidence of deafness has increased since the introduction of sign language for this reason.

A statistical model of genotype sampling Given population frequencies P 11, P, P we could model the statistical sampling process with the multinomial distribution if the population size is large enough so that sampling does not change the population frequencies. Multinomial distribution: Mult(n, Q 1, Q 2,..., Q k ) Pr(n 1, n 2,..., n k ) = n! k i=1 n i! k i=1 Q n i i Binomial distribution: Bin(n, Q) applies when there are two categories Pr(n 1, n n 1 ) = n! n 1!(n n 1 )! Qn 1 (1 Q) n n 1

Facts about expectations and variances E[aX + by ] = ae[x] + be[y ] for two random variables X and Y and constants a and b. Var(X) = E[X 2 ] (E[X]) 2 Cov(X, Y ) = E[XY ] E[X]E[Y ] Var(aX + by ) = a 2 Var(X) + b 2 Var(Y ) + 2abCov(X, Y ) where covariance term is zero for independent X and Y.

Estimating multinomial probabilities Mean counts E(n i ) = nq i The sample proportion is unbiased estimate of population frequency. ( ) ( ni ) E Qi = E = 1 n n E (n i) = Q i Variance in counts Var(n i ) = nq i (1 Q i ) Population frequency estimator variance ( ) ( ni ) Var Qi = Var = 1 n n 2 Var(n i) = 1 n Q i(1 Q i )

Estimating covariances and correlations E(n i n j ) = n n r rsp(n i = r, n j = s) r=0 s=0 = n(n 1)Q i Q j ( ) E Qi Qj = n 1 n Q iq j Cov(n i, n j ) = nq i Q j Cov( Q i, Q j ) = 1 n Q iq j Corr ( n i, n j ) = Cov(n i, n j ) Var(ni )Var(n j ) = Corr( Q i, Q j )

Obtaining allele counts Allele counts are obtained from genotype counts: n u = 2n uu + X v<u n uv Expected allele counts: E (n u) = E 2n uu + X! n uv v<u = 2E (n uu) + X v<u E (n uv ) = 2nP uu + X v<u np uv = 2np u Sample allele frequency is unbiased for population allele frequency: nu E ( p u) = E = 1 2npu = pu 2n 2n

Variance of allele estimators Var(n u ) = ( Var 2n uu + ) n uv v<u Apply formula for the variance of sums of random variables. ( ) = 2n p u + P uu 2pu 2 Var ( p u ) = ( nu ) Var = 1 ( ) p u + P uu 2pu 2 2n 2n

Variance estimation To actually use the variance (covariance, etc) formulas requires knowledge of the population parameters, which of course, we don t have. Substitute sample proportions p u and P uu into the variance/covariance formulas to obtain an estimates Var ( p u ) and Var ) ( Puu If the sample size is large enough (n 30), then confidence intervals for the estimates can be obtained: The population parameter φ has approximately (1 α)% chance of falling in the interval ( ) ˆφ ± z 1 α/2 Var ˆφ.

Confidence interval Characterizing a Population 0 1

Confidence interval Characterizing a Population 0 1

Confidence interval Characterizing a Population 0 1

Confidence interval Characterizing a Population 0 1

Importance of variance estimates Computing the variance of estimates tells us how estimates and therefore inferences will differ among samples. Approaches to computing variances Reducing expression to a function of multinomial variances. Using indicator variables. Delta method. Approximate computational methods.

Estimating covariance of allele frequency estimates Let x ij be an indicator variable that is X if the jth allele in the ith individual is A 1 and 0 otherwise. Let y ij be an indicator variable that is X if the jth allele in the ith individual is A 2 and 0 otherwise. Given these definitions so we can compute E ( p 1 p 2 ) = p 1 = 1 2n p 2 = 1 2n 1 4n 2 E i n 2 i=1 j=1 n 2 i=1 j=1 j x ij x ij y ij i j y ij

(cont.) Characterizing a Population Taking expectations of indicator variables is very easy: E ( x ij ) = 1 P ( xij = 1 ) + 0 P ( x ij = 0 ) = P ( x ij = 1 ) = p 1 We conclude (after algebra) that The covariance is then E ( p 1 p 2 ) = p 1 p 2 + 1 4n (P 4p 1 p 2 ) Cov ( p 1, p 2 ) = E ( p 1 p 2 ) p 1 p 2 = 1 4n (P 4p 1 p 2 )

Delta Method Characterizing a Population Let T be a function of the data, specifically the counts n i : T (n 1, n 2,...). By Taylor s series: Var(T ) X i «T 2 Var (n i ) + X X T T Cov `n i, n j n i n i j i i n j and replace n i in the derivatives with E(n i ) = nq i for multinomial counts. In addition, equations for variances and covariances of multinomial counts Var (n i ) = nq i (1 Q i ) Cov `n i, n j = nq i Q j we have Var(T ) n X i «T 2 Q i n X! 2 T Q i n i n i i

Fisher s approximate variance formula Var(T ) n i ( T n i ) 2 ( ) T 2 Q i n n where the second term is needed only when T explicitly involves the sample size n. In addition, terms with higher power 1 of n in the deminator (e.g. ) are ignored in the derivative n 2 functions. The above approximation works when T is a ratio of functions of the same order in the counts n i, or counts n i in T only appear divided by the total sample size n.

Example application of Fisher s approximation P = n n T = 1 n n T = n n n 2 = P n ) ( ) 1 2 Var ( P n P n n = 1 n P (1 P ) ( ) 2 P n

Other Methods for Confidence Intervals What can one do when the sample size is small (n < 30) or when no formula for the variance can be obtained? Jackknife Bootstrap

Jackknife Characterizing a Population You begin with a sample of observations X 1, X 2,..., X n of size n. You use these data to calculate an estimate ˆφ. Compute n new estimates ˆφ (i) where the ith estimate is calculated using all the data minus the ith data point, e.g. X 1,..., X i 1, X i+1,..., X n. Compute their average ˆφ ( ) = 1 n i ˆφ (i) Obtain a less biased estimated: ˆφ J = n ˆφ (n 1) ˆφ ( ) Calculate an estimate of the variance of ˆφ Var ˆφ = n 1 X 2 ˆφ (i) ˆφ ( ) J n i

Bootstrap Characterizing a Population Obtain M samples by sampling with replacement from the original data. Compute the boostrap estimate ˆφ (i) for each bootstrap dataset. Plot histogram of ˆφ (i) for all i = 1,..., M to obtain an approximation to the sampling distribution.

Bootstrap Characterizing a Population Obtain M samples by sampling with replacement from the original data. Compute the boostrap estimate ˆφ (i) for each bootstrap dataset. Plot histogram of ˆφ (i) for all i = 1,..., M to obtain an approximation to the sampling distribution. bootstrap sample 1 bootstrap sample 2

Bootstrap Characterizing a Population Obtain M samples by sampling with replacement from the original data. Compute the boostrap estimate ˆφ (i) for each bootstrap dataset. Plot histogram of ˆφ (i) for all i = 1,..., M to obtain an approximation to the sampling distribution.

Bootstrap Sampling Distribution bootstrap frequency 0 0.5 proportion of Aa

Genetic Sampling Variance In general, for the cases where between population variance should be considered, we need to do more work with variances. We have only computed among sample within population variances so far. Section Total Variance of Allele Frequencies covers this partially, and we will address it in more detail later.

Method of maximum likelihood Another estimation procedure produces the most likely value of a parameter. It is applicable when the the sampling distribution for the random variable (e.g. genotype or allele counts) is known. What s our distribution?

Method of maximum likelihood Another estimation procedure produces the most likely value of a parameter. It is applicable when the the sampling distribution for the random variable (e.g. genotype or allele counts) is known. What s our distribution? Multinomial

Maximum Likelihood Suppose the expected proportions Q i from the multinomial distribution are functions of other population parameters. For example, under HWE P 11 = p1 2 P = 2p 1 p 2 = 2p 1 (1 p 1 ) P = (1 p 1 ) 2. Suppose we observe counts n 11, n, n, then the likelihood of the data can be written in terms of the allele frequencies: L(p 1 ) = = n! n 11!n!n aa! (P 11) n 11 (P ) n (P ) n n! n 11!n!n! p2n 11 1 [2p 1 (1 p 1 )] n (1 p 1 ) 2n

Supports and Scores It is usually more convenient to work with ln L, called the support. The derivatives of the support with respect to the parameters are called the scores: S p1 = ln L p 1 The maximum likelihood estimates are those values of the parameters (e.g. p 1 ) that maximize the likelihood. They are found by setting the scores equal to 0 and simultaneously solving the resulting system of equations.

Maximum likelihood estimate of p 1 n! L(p 1 ) = n 11!n!n! p2n 1 [2p 1 (1 p 1 )] n (1 p 1 ) 2n «n! ln L(p 1 ) = ln + (2n 11 + n ) ln (p 1 ) + (n + 2n ) ln (1 p 1 ) n 11!n!n! Solve S p1 = 2n 11 + n p 1 n + 2n 1 p 1 to obtain the maximum likelihood estimate We know this is the maximum because S p1 = 0 ˆp 1 = 1 2n (2n 11 + n ). S p1 = 2n 11 + n p 1 p1 2 n + 2n (1 p 1 ) 2 <= 0 for all p 1

Statistics Refresher: Properties of MLEs Do not attempt to estimate a two or parameters that are functions of each other. For example P 11 = 1 P P when there are only two alleles. The MLE of a function of parameters is the function of the MLEs. For example, p i 2 = ˆp i 2 The MLE may be biased. MLEs are consistent estimators under general conditions, so for very large samples the bias disappears. The information of a parameter is the negative second derivative, e.g. ( 2 ) ln L(p 1 ) = I p1 p 2 1

Properties of MLEs (cont) For large samples, the variance of the MLE is the inversed expected information: Var (ˆp 1 ) = 1 E [I p1 ] When the likelihood is a function of multiple independent parameters, e.g. p 11, p, the information is a matrix. The variance is obtained as the inverse of this matrix. For large samples, the MLE is approximately normally distributed (and parameter vectors are multivariate normal). For example, ˆp 1 N (p 1, {E [I (p 1 )]} 1)