STAT 536: Genetic Statistics

Similar documents
STAT 536: Genetic Statistics

Goodness of Fit Goodness of fit - 2 classes

The purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j.

A consideration of the chi-square test of Hardy-Weinberg equilibrium in a non-multinomial situation

Statistical Data Analysis Stat 3: p-values, parameter estimation

Multiple Regression Analysis

Case-Control Association Testing. Case-Control Association Testing

Statistical Hypothesis Testing. STAT 536: Genetic Statistics. Statistical Hypothesis Testing - Terminology. Hardy-Weinberg Disequilibrium

Lecture 4: Testing Stuff

Lecture 10: Generalized likelihood ratio test

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Practice Problems Section Problems

Contingency Tables. Safety equipment in use Fatal Non-fatal Total. None 1, , ,128 Seat belt , ,878

Introduction to Linkage Disequilibrium

The Multinomial Model

Recall the Basics of Hypothesis Testing

Probability and Statistics Notes

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

Population genetics snippets for genepop

Statistical inference (estimation, hypothesis tests, confidence intervals) Oct 2018

Chapter 7: Hypothesis Testing

Introduction to the Analysis of Variance (ANOVA)

Categorical Data Analysis. The data are often just counts of how many things each category has.

Introduction to the Analysis of Variance (ANOVA) Computing One-Way Independent Measures (Between Subjects) ANOVAs

The legacy of Sir Ronald A. Fisher. Fisher s three fundamental principles: local control, replication, and randomization.

Just Enough Likelihood

Central Limit Theorem ( 5.3)

Ling 289 Contingency Table Statistics

Review of Statistics 101

Testing Independence

BTRY 4090: Spring 2009 Theory of Statistics

Sociology 6Z03 Review II

Lecture 3. G. Cowan. Lecture 3 page 1. Lectures on Statistical Data Analysis

Contingency Tables. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels.

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur

2008 Winton. Statistical Testing of RNGs

1 Overview. Coefficients of. Correlation, Alienation and Determination. Hervé Abdi Lynne J. Williams

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

Unit 9: Inferences for Proportions and Count Data

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014

For 5% confidence χ 2 with 1 degree of freedom should exceed 3.841, so there is clear evidence for disequilibrium between S and M.

The Chi-Square and F Distributions

Statistical Analysis for QBIC Genetics Adapted by Ellen G. Dow 2017

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

The t-test: A z-score for a sample mean tells us where in the distribution the particular mean lies

Statistics Introductory Correlation

Statistical Distribution Assumptions of General Linear Models

Modeling Overdispersion

Lecture 22. December 19, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Bias Variance Trade-off

Physics 403. Segev BenZvi. Classical Hypothesis Testing: The Likelihood Ratio Test. Department of Physics and Astronomy University of Rochester

ACE 564 Spring Lecture 8. Violations of Basic Assumptions I: Multicollinearity and Non-Sample Information. by Professor Scott H.

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

The goodness-of-fit test Having discussed how to make comparisons between two proportions, we now consider comparisons of multiple proportions.

Part 1.) We know that the probability of any specific x only given p ij = p i p j is just multinomial(n, p) where p k1 k 2

Statistics for Business and Economics

I Have the Power in QTL linkage: single and multilocus analysis

ML Testing (Likelihood Ratio Testing) for non-gaussian models

10: Crosstabs & Independent Proportions

HYPOTHESIS TESTING: THE CHI-SQUARE STATISTIC

Unit 9: Inferences for Proportions and Count Data

Logistic Regression Analysis

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda

Lectures 5 & 6: Hypothesis Testing

Inferential statistics

1. Understand the methods for analyzing population structure in genomes

Hypothesis tests

Introductory Econometrics. Review of statistics (Part II: Inference)

STAT Chapter 8: Hypothesis Tests

BIO 682 Nonparametric Statistics Spring 2010

Introduction to Econometrics. Review of Probability & Statistics

Econ 325: Introduction to Empirical Economics

You can compute the maximum likelihood estimate for the correlation

Statistical Inference with Regression Analysis

Multiple Regression Analysis

Dover- Sherborn High School Mathematics Curriculum Probability and Statistics

Single Sample Means. SOCY601 Alan Neustadtl

The Quantitative TDT

Chapter 8 Heteroskedasticity

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

Two-sample inference: Continuous data

Statistics 3858 : Maximum Likelihood Estimators

We know from STAT.1030 that the relevant test statistic for equality of proportions is:

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

HYPOTHESIS TESTING: FREQUENTIST APPROACH.

STAT 536: Migration. Karin S. Dorman. October 3, Department of Statistics Iowa State University

Basic Concepts of Inference

Psychology 282 Lecture #4 Outline Inferences in SLR

Hypothesis Testing hypothesis testing approach

Binary response data

Chapter 10 Nonlinear Models

One-Way Tables and Goodness of Fit

Chapter 5: HYPOTHESIS TESTING

Transcription:

STAT 536: Genetic Statistics Tests for Hardy Weinberg Equilibrium Karin S. Dorman Department of Statistics Iowa State University September 7, 2006

Statistical Hypothesis Testing Identify a hypothesis, an idea you want to test for its applicability to your data set. Hardy-Weinberg equilibrium applies to this data set. The two loci I am studying are independent of each other. Identify and calculate a test statistic. Ideally, the statistic should: summarize and accentuate any deviations of the data from what is expected under the hypothesis, have a known sampling distribution under the null hypothesis. Compute or estimate the probability of the observed test statistic under the assumption of the hypothesis. Reject the hypothesis if this probability is small.

Statistical Hypothesis Testing - Terminology Null hypothesis (H 0 ): the hypothesis you wish to test. Alternative hypothesis (H a): when you reject the null hypothesis, you conclude the alternative hypothesis or hypotheses. Type I error: the test statistic causes you to (erroneously) reject the hypothesis when it is true. size, or significance level (α): the probability of a type I error Type II error: you accept the hypothesis when it is false β: the probability of a type II error power (1 β): probability that you reject the hypothesis when it is false. Procedure: classically, one decides on the size of the test before collecting the data, then selects the most powerful test for the desired size. Hypothesis Accepted Hypothesis Rejected Hypothesis True 1 α Type I (size, α) Hypothesis False Type II (β) power (1 β)

Hardy-Weinberg Disequilibrium Let u and v be alleles at a single locus. Then, HWE implies P uu = pu 2 P uv = 2p u p v whenever u v where P uv is the population genotype frequency and p u are the population allele frequencies. Recall that if two random variables X and Y are independent, then P(X and Y ) = P(X)P(Y ). In English, knowing X tells you nothing about Y and vice versa. HWE equations imply independence (or no association) among the alleles at a locus.

Hardy-Weinberg Disequilibrium (cont) These equations may not be satisfied in a population where any one (or more) of the Hardy-Weinberg assumptions is (are) violated. When the equations are not satisfied, a Hardy-Weinberg disequilibrium applies. P uu p 2 u P uv 2p u p v whenever u v One could mathematically quantitate this disequilibrium in multiple ways. We will consider two for now: Suppose there is covariation among alleles. Write the disequilibrium in terms of this covariation. Consider subtractive disequilibrium: P uu p 2 u and P uv 2p u p v. Consider multiplicative disequilibrium once again and convert to log-linear model.

Covariation Between Alleles Let x j be an indicator variable indicating whether the jth allele of a random individual is allele 1. Recall the model What is the meaning of f? Var `x j Cov `x i, x j Corr `x i, x j P 11 = p 2 1 + p 1(1 p 1 )f P 12 = 2p 1 (1 p 1 )(1 f ) = p 1 (1 p 1 ) = E `x i x j E (xi ) E `x j = P11 p1 2 Cov `x i, x j = qvar (x i ) Var `x j = P 11 p 2 1 p 1 (1 p 1 ) = p2 1 + p 1 (1 p 1 ) f p 2 1 p 1 (1 p 1 ) = f.

Covariation Between >2 Alleles What do you do with f when there are more than 2 alleles? Say, for example, there are 3 alleles u, v, and w. There can be correlation between u and v or u and w, etc. We can subscript f. Let f uv be the correlation between alleles u and v, where u can equal v. Then, P uu = pu 2 + p u (1 p u ) f uu P uv = 2p u p v (1 f uv ). But, we are not done. There are relationships among these parameters. If there are n different alleles, there are n 1 free allele frequencies p u and n(n+1) 2 correlation coefficients f uv, for a total of n 2 +3n 2 2 parameters. However, there are only n(n+1) 2 1 free genotype counts in the data. How many relationships are there among the parameters?

Covariation Between >2 Alleles (cont) There are d.f. in parameters d.f. in data = n 2 + 3n 2 n(n + 1) + 1 2 2 = n relationships among the model parameters. To find the relationships among the parameters, recall that p u = P uu + 1 P uv. 2 There are n such relationships, and they will consume all the extra degrees of freedom. Substitute in the expressions for the genotype frequencies to observe f uu = p v f uv. 1 p u v u v u

Covariation Between >2 Alleles (2nd approach) Suppose that associations between alleles are not a consequence of specific interactions among alleles. Suppose instead that there is a general association between alleles regardless of allele identity. (We will discuss how this can arise later.) Then, there is one correlation that applies to all pairs of alleles u and v and the applicable equations are again P uu = p 2 u + p u (1 p u ) f P uv = 2p u p v (1 f ) where u v. There are n 1 free allele frequencies p u and 1 free correlation f, leading to n free parameters. There are plenty of degrees of freedom n(n+1) 2 1 in the data to cover these parameters.

Problems with Correlations f uv But all is not perfect. Notice, f appears as a multiplier of allele frequencies. We model disequilibrium as a multiplicative factor that expands/shrinks genotype counts away from HWE expectation. We derived some estimates for correlation f in previous lecture(s). In general, estimation of multiplicative factors involves ratios of statistics. Ratios are notoriously difficult statistically. An easier approach, statistically, is to look for additive disequilibrium. Define D uu = P uu pu 2 D uv = P uv 2p u p v. Let D uv = 2D uv, to write more conveniently P uu = pu 2 + D uu P uv = 2p u p v 2D uv whenever u v.

d.f. for Additive Disequilibrium Again, there must be n relationships among the parameters to account for the difference in degrees of freedom. p u = P uu + 1 P uv 2 v u = pu 2 + D uu + (p u p v D uv ) v<u = p u v u p v + D uu v<u = p u + D uu v<u D uv D uv leading to D uu = v<u D uv.

Range of Additive Disequilibrium Because 0 P uu, P uv 1, we also recognize that the additive disequilibrium can t be just anything. In fact, the total list of constraints is p 2 u D uu 1 p 2 u p up v 1 Duv pupv 2 D uu = X v<u D uv In the case of just two alleles at a locus 1 and 2, then D 11 = D 12 = D 22, so P 11 = p1 2 + D 1 P 12 = 2p 1 p 2 2D 1 P 22 = p2 2 + D 1 and max pu 2 D 1 p 1 p 2. u {1,2}

Testing D 1 = 0 (HWE) Testing for HWE is equivalent to testing the null hypothesis H 0 : D 1 = 0. Here, we have restricted to the two allele case. We need two things: An estimate ˆD A. Is it close to 0? A sampling distribution for the estimate ˆD 1 to determine whether it is farther from 0 than we would expect by chance.

Estimating D 1 Two free parameters p 1, D 1 and two free pieces of data n 11, n 12 suggests Bailey s method: n 11 = n ( p 2 1 + D 1 ) can be solved to produce n 12 = n (2p 1 p 2 2D 1 ) ˆp 1 = 2n 11 + n 12 = p 1 2n ˆD 1 = n 11 n p2 1 = P 11 p 1. 2

ˆD 1 Bias Tests for Hardy Weinberg Equilibrium Is our estimate ˆD 1 unbiased? ( ) E ˆD1 ) = E ( P11 E ( p 1 2 ) = P 11 p1 2 1 ( p1 + P 11 2p 2 2n 1) = D 1 1 2n [p 1 (1 p 1 ) + D 1 ]. ( ) Since E ˆD1 D 1 we conclude that the estimator is biased. However, we are encouraged to note that as n, the bias goes to 0.

ˆD 1 Sampling Variance Using Fisher s approximation (it applies because ˆD 1 is a function of proportions), we obtain an approximate sampling variance for ˆD 1 ( ) Var ˆD1 1 [ ] p1 2 (1 p 1 ) 2 + (1 2p 1 ) 2 D 1 D1 2. n To estimate the sampling variance, we substitute in our estimates for p 1 and D 1 ( ) Var ˆD1 ˆ= 1 [ ˆp 1 2 (1 ˆp 1 ) 2 + (1 2ˆp 1 ) 2 ˆD1 n ˆD ] 1 2. Since ˆD 1 is the MLE, we have for large samples that [ ( ) ( )] ˆD 1 N E ˆD1, Var ˆD1.

Testing H 0 : D 1 = 0 Using z-values Compute the standard normal variate z ( ) ˆD 1 E ˆD1 z = ( ) Var ˆD1 Under the null hypothesis H 0, z approximately follows the standard normal distribution. Compare z against standard normal distribution. The key is that if ˆD 1 is very positive or very negative, then z will tend to be far from 0 and your statistic will fall in the tails of the sampling distribution, where it is not expected to fall if the null hypothesis of HWE is true.

Computing z Tests for Hardy Weinberg Equilibrium Under the null hypothesis, we know ( ) E ˆD1 0 ( ) Var ˆD1 1 n so z = [ ˆp 2 1 (1 ˆp 1 ) 2]. n ˆD1 ˆp 1 (1 ˆp 1 ) Note, we have assumed n is sufficiently large that the bias term is negligible.

Relevant Alternative Hypotheses P 11 = p 2 1 + D 1 P 12 = 2p 1 p 2 2D 1 P 22 = p 2 2 + D 1 Depending on your purpose, there may be different alternative hypotheses you consider. Suppose z = 1.25, then H A : D 1 > 0 or D 1 < 0. You have no a priori feeling for whether heterozygotes will be over- or under-represented. Use a two-tailed test. > 2*pnorm(q=-1.25) [1] 0.2112995 > 2*(1-pnorm(q=1.25)) [1] 0.2112995

Relevant Alternative Hypotheses (cont) One-side hypotheses are appropriate when you suspected heterozygotes would either be under- or over-represented before you collected the data. H A : D 1 > 0. You suspect that heterozygotes will be under-represented. Use a one-tailed (right tail) test. > (1-pnorm(q=1.25)) [1] 0.1056498 H A : D 1 < 0. You suspect that heterozygotes will be over-represented. Use a one-tailed (left tail) test. > pnorm(q=1.25) [1] 0.8943502

Chi-Square Tests for Hardy Weinberg Equilibrium An equivalent test is the Chi-Square test for HWE. It depends on comparing z 2 against its sampling distribution, which under the null, is a chi-square distribution with 1 degree of freedom. z 2 = X 2 1 = n ˆD 2 1 ˆp 2 1 (1 ˆp 1) 2. However, note that both positive and negative values of z give the same z 2 statistic. It is not so easy to consider one-sided alternative hypotheses. Since the tests are equivalent, use the z statistic for one-sided tests. Test Assumption: The sample size is large so both normality (or chi-square) applies and bias can be ignored.

Chi-Square Goodness-of-Fit Genotype 11 12 22 Observed (O) n 11 n 12 n 22 Expected (E) Observed - Expected nˆp 1 2 n ˆD 1 2nˆp 1 (1 ˆp 1 ) 2n ˆD 1 n (1 ˆp 1 ) 2 n ˆD 1 Here, we have made the assumption that n is sufficiently large that the bias terms are 0. The goodness-of-fit chi-square statistic is defined as X 2 1 = genotypes (n ˆD ) 2 1 (O E) 2 E ( 2n ˆD 1 ) 2 (n ˆD 1 ) 2 = nˆp 2 1 + 2nˆp 1 (1 ˆp 1 ) + n (1 ˆp 1 ) 2.

Standard Cautions About Chi-Square Tests Apply only when expected counts E 5. Because the expected counts E appear in the denominator, small variation when they are small results in huge changes in X 2 1. Apply Yates correction to account for discrete nature of data. Because the observed data are discrete, but the sampling distribution (normal or chi-square) is continuous, the Yates correction is recommended: X1 2 ( O E 0.5) 2 = E genotypes

Likelihood Ratio Tests for Hardy Weinberg Equilibrium Suppose your null hypothesis is H 0 : φ = φ 0 for some parameter φ. Let the maximum likelihood value under H 0 be L 0 and the maximum likelihood value without the restriction on φ be L 1. Then, L 0 will always be smaller than L 1 since φ 0 may not be the maximum likelihood value of φ. However, if the null is true, ˆφ should be very close to φ 0 and L 0 will be very close to L 1. Define the likelihood ratio as λ = L 0 L 1. When the null hypothesis is true and the size of φ is s, then 2 ln λ χ 2 (s).

Likelihood Ratio Test for HWE Under the unconstrained model (the alternative hypothesis), the parameters are p 11, p 12, p 22 and the data are n 11, n 12, n 22. There are two degrees of freedom in the model and the data, so Bailey s method applies to yield ˆp 11 = n 11 n ˆp 12 = n 12 n. The maximum likelihood under the unconstrained model is L 1 = n! n 11!n 12!n 22! ( n11 n ) n11 ( n12 n Under the constrained model, ˆP 11 = ˆp 1 2 ˆP 12 = 2ˆp 1 (1 ˆp 1 ) ) n12 ( n22 n ) n22

Likelihood Ratio Test for HWE (cont) The maximum likelihood under the constrained model is L 0 = n! n 11!n 12!n 22! ( n1 2n ) 2n11 ( 2n1 n 2 (2n) 2 ) n12 ( n2 ) 2n22. 2n The test statistic is therefore [ ] n n n 2n 11 1 (2n 1 n 2 ) n 12 n 2n 22 2 2 ln λ = 2 ln (2n) 2n n n 11 11 nn 12 12 nn 22 22

A Multiplicative Model That Works Let us consider another multiplicative model P 11 = MM 2 1 M 11 P 12 = 2MM 1 M 2 M 12 P 22 = MM 2 2 M 22 Here M is the mean effect, M 11, M 12, M 22 represent associations between allele frequencies, and M 1, M 2 represent the allele frequency contributions. Taking logarithms puts this model back in the additive space ln P 11 = ln M + 2 ln M 1 + ln M 11 ln P 12 = ln M + ln 2 + ln M 1 + ln M 2 + ln M 12 ln P 22 = ln M + 2 ln M 2 + ln M 22

Handling Overparameterization There are, as usual, more parameters than observations. And this time there are multiple ways to deal with the overparameterization. One way is to set then M 2 M 12 = 1 M 2 2 M 22 = 1, P 11 = MM 2 1 M 11 P 12 = 2MM 1 P 22 = M. There is still an extra degree of freedom, but summing all three equations yields 1 = M M1 2 M 11 + 2M 1 + 1 or 1 M = 1 + 2M 1 + M1 2M. 11

Estimating Parameters M 1 and M 11 Again, Bailey s method applies and the maximum likelihood estimates are ˆM 1 = n 12 2n 22 Mˆ 11 = 4n 11n 22 n12 2 with the tag-along ˆM = n 22 n. Substituting these MLEs back into the original multiplicative equations produces the same likelihood under H A. ˆP 11 = n 11 n ˆP 12 = n 12 n ˆP 22 = n 22 n L 1 = n! n 11!n 12!n 22! n11 n n11 n12 n n12 n22 n22 n

Log Likelihood Test for Multiplicate Model HWE implies no interaction term, i.e. M 11 = 1. Under this constraint, we again apply Bailey s method to find ) 2 ( n1 ˆP 11 = 2n ˆP 12 = n 1n 2 2n 2 ( n2 ) 2 ˆP 22 =. 2n It turns out that λ has the same form under this log-linear model as the additive disequilibrium model. So the log-linear model for testing HWE is equivalent to the additive model.

Summary of Tests for HWE Normal approximation for MLEs uses the z statistic. Chi-square test uses the X 2 1 = z2 statistic and is equivalent to the above test. The chi-square goodness-of-fit test is identical to the above chi-square test, but highlights the need for substantial data in each category. The likelihood ratio test is widely applicable and flexible when a likelihood function is available. The log-linear model uses a multiplicative model and leads to a test equivalent to the likelihood ratio test. The exact test is useful when the data set is small and particularly when counts in some categories are small.

Tests for multiple alleles When there are more than two alleles at a locus, the general equations apply, with relationship P uu = p 2 u + D uu P uv = 2p up v 2D uv whenever u < v D uu = X v>u D uv + X v<u D vu and MLEs obtained by Bailey s method (verify this; it is not hard) ˆp u = p u ˆD uv = p u p v 1 2 P uv. Under complete HWE, D uv = 0 for all u v, that is there are k(k 1) 2 constraints applied (one for each heterozygote) when there are k alleles.

Testing Complete HWE Therefore 2 ln λ for H 0 : D uv = 0 for all u v approximately follows a chi-square distribution with k(k 1) 2 degrees of freedom. If you are more comfortable with a goodness-of-fit test, that statistic with X 2 T = u v ( n uv E (n uv ) 0.5) 2 E (n uv ) E (n uu ) = n p u 2 E (n uv ) = 2n p u p v. follows the same sampling distribution.

Testing Partial HWE If you want to test only certain combinations of alleles, the tests are more complicated. Test a single D uv. Example: If H 0 : D 12 = 0, the likelihood ratio statistic 2 ln λ follows a chi-square with 1 degree of freedom. But Bailey s method does not apply unless k = 2 so L 0 is difficult to compute. Iterative methods are required. Or, one could apply the z-test or the chi-square test. Details for these tests follow on the next slide. Test multiple, but not all D uv. Example: If k = 4 and H 0 : D 12, D 34 = 0, 2 ln λ follows a chi-square with 2 degrees of freedom. Iterative methods are still required. This time, there are no easy alternatives. Complex hypotheses cause difficulties for the z-test and related chi-square. The likelihood ratio test handles complex hypotheses quite naturally.

z-test for Partial HWE Here, you only need the MLEs under the full alternative model (where Bailey s method applies). For large samples, the MLE ˆD uv is approximately normally distributed under H 0 with mean 0 and variance Var ˆDuv. ˆD uv z uv = r Var ˆDuv We can again use Fisher s approximation to compute the variance. 8 1 < Var ˆDuv = pupv [(1 pu)(1 pv ) + pupv ] 2n : h (1 p u p v ) 2 2(p u p v ) 2i D uv 9 + X = (pud 2 vw + pv 2 D uw ) Duv 2 ; w u,v Under the H 0, the variance is obtained by assuming D uv = 0.

Exact Tests Tests for Hardy Weinberg Equilibrium If the probability of the observed sample under the null hypothesis and all less likely samples is small, then the evidence suggests the data is unlikely to have arisen under the null hypothesis. If one can compute the probability of all possible samples, then obtaining an exact probability (no approximation) is possible. are useful when all the possible observed data can be enumerated practically. This occurs generally when the expected counts are small in some categories (i.e. when the previous tests fail). The probability of the observed data n 11, n 12, n 22 is given by the multinomial distribution P (n 11, n 12, n 22 ) = n! n 11!n 12!n 22! pn 11 11 pn 12 12 pn 22 22.

Exact Tests for HWE When HWE applies, then P (n 11, n 12, n 22 ) = n! n 11!n 12!n 22! p2n 11 1 (2p 1 p 2 ) n 12 p 2n 22 2. In addition, the allele counts n 1 and n 2 are binomially distributed P (n 1, n 2 ) = (2n)! n 1!n 2! pn 1 1 pn 2 2. The conditional probability, where we condition on the observed allele counts is P (n 11, n 12, n 22 n 1, n 2 ) = P (n 11, n 12, n 22, n 1, n 2 ) P (n 1, n 2 ) n!n 1!n 2!2 n 12 = n 11!n 12!n 22!(2n)!. = P (n 11, n 12, n 22 ) P (n 1, n 2 )

Exact Test for HWE (Example) Suppose we observe n 11 = 10, n 12 = 1, n 22 = 2. Use an exact test to calculate a p-value for rejecting the null hypothesis of HWE. n 11 n 12 n 22 Probability Cumul. Prob. 10 1 2 9.1 10 5 9.1 10 5 9 3 1 0.35 0.35 8 5 0 0.63 0.97 The p-value is p = 9.1 10 5 since there is no dataset more extreme than the observed.

Exact Tests for Multiple Alleles The exact test generalizes to multiple alleles. The formula is P ({n uv } {n u}) = n!2h Q u nu! (2n)! Q u,v nuv!. where H is the total number of heterozygous individuals. Let s consider the following genotype data for a locus with 4 alleles: GenePOP example. A 1 A 2 A 3 A 4 A 1 2 16 13 5 A 2-15 21 12 A 3 - - 9 7 A 4 - - - 0

: computational difficulties However, the computations become quickly overwhelming. Approximate exact tests were developed for this situation. GUO, S. and THOMPSON, E. 1992. Performing the exact test of Hardy Weinberg proportion for multiple alleles. Biometrics 48 pp. 361-372. LAZZERONI, L. C. and LANGE, K. 1997. Markov chains for Monte Carlo tests of genetic equilibrium in multidimensional contingency tables. Ann. Statist. 25 pp. 138-168.

Approximate Exact Tests When it is impossible to enumerate all the possible datasets with the same allele frequencies, approximate methods are needed. One of the simplest uses Monte Carlo. Calculate F = P ({n uv } {n u}) for the observed data. Set S = 0 and put all your genotypes (13, 66, 31, 16,...) in a big vector of length 2n. 13663116 Permute all alleles and clump successive alleles into genotypes. (36)(16)(36)(13) Compute F = P ({n uv } {n u }) for the permuted dataset. If F F, increment S by 1. Repeat M times. Estimate the p-value as S M.

Example Tests for Hardy Weinberg Equilibrium Consider the following table of genotype counts for a locus with four alleles. A 1 0 A 2 3 1 A 3 5 18 1 A 4 3 7 5 2 A 1 A 2 A 3 A 4 Method Estimate Exact 0.01744 χ 2 0.02337 MC integration (M = 1700) 0.01706

Power Calculations Recall that the power of a statistical test is the probability that you reject the null hypothesis when it is false. Before you begin collecting data, you may wish to estimate whether you will be able to detect a disequilibrium of a given size, say D 1 = 0.10. To detect it means you reject the null that D 1 = 0. The test statistic X1 2 n = ˆD 1 2 ˆp 1 2 (1 ˆp 1) 2 follows a different distribution depending on whether H 0 is true or not. H 0 true X1 2 χ2 (1) H 0 false X1 2 χ2 (1,ν) where χ 2 (1,ν) is the noncentral chi-square distribution with noncentrality parameter ν.

The Noncentrality Parameter The noncentral chi-square distribution is an approximate distribution under H A. The noncentrality parameter is given by ν = nd 2 1 p 2 1 (1 p 1) 2 ν is bigger when D 1 is farther from 0. But note, the approximation is only valid when ν n is small, say of order n 1. If we take the standard significance level α = 0.05, then we will reject H 0 if X 2 1 > 3.84. With this knowledge one can ask a couple of questions How big does D 1 need to be in order to have 90% chance of getting X 2 1 > 3.84 and therefore rejecting H 0 and detecting the disequilibrium? How big does my sample n need to be in order to detect a disequilibrium D 1 = 0.1 with 90% probability?

Size of D 1 or n Tests for Hardy Weinberg Equilibrium r ν D 1 = p 1 (1 p 1 ) n > pchisq( qchisq(0.05, 1, lower.tail=f), 1, ncp=10.5, lower.tail=f ) [1] 0.899799 When ν = 10.5, then X1 2 will exceed 3.84 with 90% probability. (These kinds of results are tabulated or available in statistics packages.) So, for alternative sample sizes you can see how large a disequilibrium D 1 you will be likely (90% probabilitity) to detect. Also, further rearrangement yields n = νp2 1 (1 p 1) 2 D 2 1 which can be used to compute the size of the sample we ll need to detect a specified disequilibrium D 1. Note, for both of these applications, you need to know p 1 to use the formulas.

Power Calculations for Exact Tests Recall, you need to calculate P ({n uv} {n u}) = P ({nuv}) P ({n u}) Under the alternative hypothesis, the genotype frequencies are given by the formulas P uu = p 2 u + D uu so you can compute the numerator P uv = 2p up v 2D uv whenever u v, n! Y P ({n uv}) = Q u v nuv! u v P nuv uv but you can no longer assume the binomial distribution for allele frequencies.

Power Calculations for Exact Tests (cont) But, of course, n u = n uu + 1 n uv. 2 so if you know P ({n uv }), you can compute P ({n u }) by summing over the former. u<v