A consideration of the chi-square test of Hardy-Weinberg equilibrium in a non-multinomial situation

Similar documents
STAT 536: Genetic Statistics

8. Genetic Diversity

LECTURE # How does one test whether a population is in the HW equilibrium? (i) try the following example: Genotype Observed AA 50 Aa 0 aa 50

Goodness of Fit Goodness of fit - 2 classes

19. Genetic Drift. The biological context. There are four basic consequences of genetic drift:

Question: If mating occurs at random in the population, what will the frequencies of A 1 and A 2 be in the next generation?

Bayesian analysis of the Hardy-Weinberg equilibrium model

Case-Control Association Testing. Case-Control Association Testing

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

Mechanisms of Evolution

Major Genes, Polygenes, and

Problems for 3505 (2011)

EDWARD D. ROTHMAN Departments of Statistics and Human Genetics CHARLES F. SING Department of Human Genetics AND

STAT 536: Migration. Karin S. Dorman. October 3, Department of Statistics Iowa State University

Chapter 16. Table of Contents. Section 1 Genetic Equilibrium. Section 2 Disruption of Genetic Equilibrium. Section 3 Formation of Species

Microevolution Changing Allele Frequencies

Single Sample Means. SOCY601 Alan Neustadtl

Outline of lectures 3-6

Population Structure

Outline of lectures 3-6

Sociology 6Z03 Review II

Chapter 2. Review of basic Statistical methods 1 Distribution, conditional distribution and moments

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

Computational Systems Biology: Biology X

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Population Genetics I. Bio

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin

I of a gene sampled from a randomly mating popdation,

Case Studies in Ecology and Evolution

Selection Page 1 sur 11. Atlas of Genetics and Cytogenetics in Oncology and Haematology SELECTION

MATH4427 Notebook 2 Fall Semester 2017/2018

Chapter 9 Inferences from Two Samples

Segregation versus mitotic recombination APPENDIX

1.3 Forward Kolmogorov equation

Hypothesis Tests and Estimation for Population Variances. Copyright 2014 Pearson Education, Inc.

Population genetics snippets for genepop

Statistical Genetics I: STAT/BIOST 550 Spring Quarter, 2014

CIVL /8904 T R A F F I C F L O W T H E O R Y L E C T U R E - 8

Statistical inference (estimation, hypothesis tests, confidence intervals) Oct 2018

STAT 536: Genetic Statistics

The number of distributions used in this book is small, basically the binomial and Poisson distributions, and some variations on them.

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

A Robust Test for Two-Stage Design in Genome-Wide Association Studies

The statistical evaluation of DNA crime stains in R

A maximum likelihood method to correct for allelic dropout in microsatellite data with no replicate genotypes

The goodness-of-fit test Having discussed how to make comparisons between two proportions, we now consider comparisons of multiple proportions.

The purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j.

Lecture 1 Hardy-Weinberg equilibrium and key forces affecting gene frequency

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

Evolution Module. 6.2 Selection (Revised) Bob Gardner and Lev Yampolski

You can compute the maximum likelihood estimate for the correlation

Educational Items Section

Objective Bayesian Hypothesis Testing

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Neutral Theory of Molecular Evolution

Processes of Evolution

1. Understand the methods for analyzing population structure in genomes

Just Enough Likelihood

Logistic Regression Analysis

Summary of Chapters 7-9

Microevolution 2 mutation & migration

Section VII. Chi-square test for comparing proportions and frequencies. F test for means

EXERCISES FOR CHAPTER 3. Exercise 3.2. Why is the random mating theorem so important?

Breeding Values and Inbreeding. Breeding Values and Inbreeding

Darwinian Selection. Chapter 6 Natural Selection Basics 3/25/13. v evolution vs. natural selection? v evolution. v natural selection

Statistics Handbook. All statistical tables were computed by the author.

Study of similarities and differences in body plans of major groups Puzzling patterns:

Lecture 10: Generalized likelihood ratio test

AEC 550 Conservation Genetics Lecture #2 Probability, Random mating, HW Expectations, & Genetic Diversity,

Functional divergence 1: FFTNS and Shifting balance theory

Lecture 2: Introduction to Quantitative Genetics

Chapter 10. Chapter 10. Multinomial Experiments and. Multinomial Experiments and Contingency Tables. Contingency Tables.

Life Cycles, Meiosis and Genetic Variability24/02/2015 2:26 PM

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /8/2016 1/38

Institute of Actuaries of India

Example. χ 2 = Continued on the next page. All cells

LOOKING FOR RELATIONSHIPS

Testing Independence

Outline of lectures 3-6

POLI 443 Applied Political Research

11-2 Multinomial Experiment

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

BIOL Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES

Partitioning Genetic Variance

2. Map genetic distance between markers

0 0'0 2S ~~ Employment category

- point mutations in most non-coding DNA sites likely are likely neutral in their phenotypic effects.

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

Review: General Approach to Hypothesis Testing. 1. Define the research question and formulate the appropriate null and alternative hypotheses.

Statistical Analysis for QBIC Genetics Adapted by Ellen G. Dow 2017

Two-Sample Inferential Statistics

Space Time Population Genetics

Review of One-way Tables and SAS

THE MAYFIELD METHOD OF ESTIMATING NESTING SUCCESS: A MODEL, ESTIMATORS AND SIMULATION RESULTS

TUTORIAL 8 SOLUTIONS #

Exam 2 (KEY) July 20, 2009

Psychology 282 Lecture #4 Outline Inferences in SLR

Journal of Educational and Behavioral Statistics

Introduction to Linkage Disequilibrium

Math 152. Rumbos Fall Solutions to Exam #2

Transcription:

Ann. Hum. Genet., Lond. (1975), 39, 141 Printed in Great Britain 141 A consideration of the chi-square test of Hardy-Weinberg equilibrium in a non-multinomial situation BY CHARLES F. SING AND EDWARD D. ROTHMAN Departments of Human Genetics and Statistics, University of Michigan, Ann Arbor, Michigan 48104, U.S.A. Deviation from the Hardy-Weinberg proportion of each genotype frequency may be attributable to a number of forces (selection, inbreeding, mutation, migration, subdivision) operating in the population sampled. Smith (1970) has derived estimators of these deviations and discussed the difficulty of sorting out the effects of inbreeding, population subdivision, and silent alleles from the effects selection may have on the frequencies of genotypes. An additional condition which may contribute to deviations from Hardy-Weinberg that has not received adequate attention is non-independent sampling of individuals collected to detect the operation of such forces in a population. For example, the detection of an increase in the frequency of homozygotes and a decrease of heterozygotes in small populations of humans due to inbreeding is an obvious first consideration of the geneticist who is interested in studying the effects of selection on genotype frequencies. However, the correct statement of probability of observing a test statistic computed from the sample usually depends on the assumption that members of the sample are chosen independently of one another. In this report we investigate the effect of non-independence of observations, characterized by the multinomial Dirichlet, on the level of an a = 0.05 size (significance level) x2 goodness-of-fit test of Hardy-Weinberg. A modified test is presented which gives the expected type I error when the null hypothesis of Hardy-Weinberg is true and there is sampling correlation. We begin by defining the relative frequencies of genotypes for a locus with two codominant alleles in a very large reference population of diploid individuals. Letting p be the frequency of one allele A and q (= 1 -p) be the frequency of the other, a, the expected relative frequencies in this population of the AA, Aa and aa genotypes are p2, 2pq, and q2, respectively. The test most often utilized on a sample of size N to detect inbreeding in this reference population is based on the value of X2: x2 = (Tl-P2N)2 +v2-2pw2 +(T3-g2"2 P2N 2PQN q2n ' where T,, T2 and T3 are the observed number of AA, Aa and aa, respectively, and T,+T2+ T3 = N individuals randomly drawn from the population. (Significantly large values of X2 are taken to indicate inbreeding.) The statistic, X2, is distributed as a x2 with one degree of freedom when p is estimated f3 = (T,+$T2)/N, the population is large compared to the sample, the expected number of each genotype is at least 5, and the individuals in the sample are unrelated, i.e. stochastically independent. Ward & Sing (1970) considered the statistical power of this goodness-of-fit test to detect inbreeding. It was shown that very large samples are required to be confident that x2 will reject F = 0 when in fact F in the population being sampled takes on a value which is reasonable for humans (i.e. F < 0.10). For example, a sample of 105,000 is required to be 90% confident of detecting F = 0.01 when a 5% test (1)

142 C. F. SINU AND E. D. ROTHMAN criterion is chosen (10,500,000 when P = 0.001). Such samples are unreasonably large in view of the above conditions stating the relationship of population size to sample size. PROBABILITY MODEL Consider a parameter p defined as the average correlation between the genotypes of any two individuals sampled from a population with Hardy-Weinberg genotype frequencies. This correlation could be attributable to co-ancestry of the individuals forming the sample, finiteness of the population being sampled, or any other bias in the sampling procedure which leads to non-independence of the observations to be used for analysis. In a previous paper (Rothman, Sing & Templeton, 1974) we suggested the multinomial Dirichlet (MD) as a simple model for the analysis of measures of correlation between pairs of alleles within and between individuals in a population divided into subpopulations. Here we apply the multinomial Dirichlet distribution to describe the sampling process whereby there is a correlation between the genotypes of any pair of individuals in a sample drawn from a single unsubdivided population. Even though it may be argued that the underlying set of biological assumptions necessary to derive such a model are not met, experience has shown that in practice a useful distribution need only accurately describe the measurable outcome of the process being modelled. The multinomial Dirichlet can be derived by mixing the multinomial with a Dirichlet distribution (see Johnson & Kotz, 1969). Briefly, if (Tl, T,, T3) conditional on (el, 6,, 6,) is multinomial (N; 01, O,, 0,) and (el, O,, 6,) is Dirichlet ($Pl, $P2, $I?,), then the unconditional distribution of (Tl, T,, T3) is multinomial Dirichlet (N; p, G, P,, P3), where $ = (1-p)/p. In forming a sample, when the Hardy-Weinberg law is true and p # 0.0, on the first draw the probability of drawing each genotype is Pl = p2, P, = 2pq and P3 = q2. However on the jth succeeding draw these probabilities conditional on the past are and P3* = 1 - Pl* - P2*. Alternatively the probability mass function of the multinomial Dirichlet may be written where q5 = (1 -p)/p. For this parameterization of the sampling scheme as $ -+ 00 (p + 0) equation (2) reduces to the multinomia.1 (N; Pl, P,, P3): We note that the expected Ti, i = 1, 2, 3 is the same under either model while the variance of Ti when the MD is true is (1 +p(n- 1)) multiplied by the variance of Ti when the multi-

x2 test of Hardy- Weinberg equilibrium 143 nomial is true. The latter result forms the basis of the modified x2 test given below. We will use the multinomial Dirichlet distribution to evaluate the effect of correlation between observations forming a sample on the type I error of the x2 goodness-of-fit test given by equation (1). CHI-SQUARE RESULTS The type I error for the x2 goodness-of-fit test of the Hardy-Weinberg hypothesis when sampling follows the multinomial Dirichlet distribution was computed for each combination of three allele frequencies (0.05, 0.25 and 0.50) with four values of p (0.0, 0.001, 0.01 and 0.1) and two sample sizes (50 and 100). The exact level of the test based on large values of X2 when samples are drawn from the multinomial Dirichlet with (Pl, P2, P3) = (p2, 2pq, q2) is computed as L = c MD (N; P, P2, 2p4, q2), 6, where 3 (Tl, T2, TJ: 2 Ti = N, Ti > 0, X2 > c i=l and c = the 95th percentile of a x:~.~.) distribution. In addition, rather than computing the values of X2 using known values of allele frequency, in which case the expected X2 has 2 degrees of freedom and the critical value will be 5.991, we considered the X2 statistic based on the estimate of p = (Tl+&T2)/N. In the latter case X2 is compared to the 95th percentile of a x2 with 1 D.F., namely 3.84. In cases where expected numbers in any class did not exceed 5, a Yates correction was applied. The solid lines in Fig. 1 give the probability of rejecting the Hardy-Weinberg expectation given the values of allele frequency, N, and p considered. For the range of allele frequencies considered, the type I error increases as p increases. There is a substantial deviation from the expected a level of the test due to sample correlation. As expected, the larger sample size rejects the hypothesis of Hardy-Weinberg when p is not 0 more often than for the smaller sample size. The type I error when p = 0 is a measure of the error of an a = 0.05 level test when sampling is multinomial. Again, as expected, this error increases as sample size decreases and allele frequency deviates from 0.5. The peculiar result for p = 0.05 based on samples of size 50 reflects the fact that the x2 dist,ribution does not approximate the distribution of X2 even when p = 0. The x2 goodness-of-fit for Hardy-Weinberg equilibrium is very sensitive to non-independent sampling. These results are not unexpected since, if we consider the testing problem H,: multinomial (N; p2, 2pq, q2) versus HA: MD (N; p, p2, 2pq, q2), the locally (p--f 0) most powerful test is based on large values of X2! This result obtains from the expansion of the likelihood ratio test about p = 0. Since HA is the null hypothesis in our investigation, the observed level of the x2 test as a function of p is in fact the power of this test in the problem H, versus HA. I MODIFICATION OF THE X2 TEST Consideration of the expectation of the X2 statistic when there is correlation among individuals implies that E (X2)p) = D.F.//?, I0 HGE 39

0~100 144 C. F. SING AND E. D. ROTHMAN P known 0.50 045-0.40 P= 0.05-0.25 - al I I 1 N=50 O*OOO 0~001 0.010 0.100 0.80 - P - 0.75 N=lOO 0.70 P= 0.25 0.65 0.60 - L 0.55-0.50.! 045 - ; 0.40 - I- 0.35 0.30-0.25-0.20-045 - n 0.50 r 0.45 0.40 0.35 0.30 0.25 0.20 0.1 5 0.1 0 L 0.05 I- P estimated 0.000 0.001 0.010 ~ 0.80 - P 0.75-0.70 - P=0.25 0.65-0.60-0.55-0.50-045 - 040 0.35-0.30 0.25 0.20-0.15 O.Oo0 0.001 0.010 0*1w O~OOO 0.001 0.01 0 0.1 00 0.80 P N-50 N 4 00 N=50 0.65 L L O 0.55 Oe60 I u - 0.50 u a. 045 h I- 0.40-0.35 0.30-0.25-0.20-045 - P=0*50 / N=lOO N-50 --- -------- N-50 I I I N=lOO. 0.000 0.001 0.010 0.100 O.Oo0 o*w1 0.010 0.100 P P Fig. 1. A comparison of the goodness-of-fit X2 (solid line) with bx2 (dashed line) for two sample sizes, three allele frequencies and four values of p when sampling is according to the multinomial Dirichlet.

x2 test of Hardy- Weinberg equilibrium 145 where p = 1/[1 +p(n- l)]. This result does not require the assumption of the multinomial Dirichlet. For, if we set where and N _. q = 2 Sij (i = 1, 2, 3), j=1 Sij = 1 if jth observation is in cell i, 0 otherwise Pr (Sij = 1, Sij. = 1) = (1 -p)pi2+ppp, (j # j'), the expected X2 is as stated when c; are known. The type I error for the test statistic px2 was computed for the same combinations of p, p, and N in the manner described above. The results are also given in Fig. 1 (dashed line). Essentially, the corrected X2 gives an approximate 5 % test when H-W is true regardless of the level of p considered here (on the order of l/n). We note that, in practice, choice of a larger value of p than is correct for a population will lead to a conservative test. SUMMARY The multinomial Dirichlet distribution was used to study the effect of correlation between observations in a sample on the frequency of rejection of the Hardy-Weinberg Law when in fact it was true in the population being sampled. It was shown that the usual x2 goodness-of-fit test of Hardy-Weinberg is very sensitive to non-multinomial sampling. In view of the lack of statistical power of the test to detect deviations due to inbreeding, it is likely that whenever H-W is rejected using samples of size 100 or less, the underlying causation is sample correlation rather than failure of the H-W law to be true. Related to these findings is, of course, the effect of pooling heterogeneous frequencies or, in the case of contingency tables, Simpson's paradox (see Simpson, 1951). This work was supported by U.S. Atomic Energy Commission Contract AT( 11-1)-1552. The authors wish to thank Professor C. A. B. Smith for his kind advice during the preparation of this manuscript. REFERENCES JOHNSON, N. L. & KOTZ, S. (1969). Discrete Distributions. Somerset, N.J.: Wiley. ROTHMAN, E. D., SING, C. F. & TEMPLETON, A. R. (1974). A model for analysis of population structure. Genetics 78, 943-60. SIMPSON, E. H. (1951). The interpretation of interaction in contingency tables. Journal of the RoyaZStatisticaZ Society B 13 (2), 238-41. SMITH, C. A. B. (1970). A note on testing the Hardy-Weinberg law. Annals of Human Genetics 33, 377-83. WARD, R. H. & SING, C. F. (1970). A consideration of the power of the x2 test to detect inbreeding effects in natural populations. American Naturalist 104, 355-63. 10-2