Statistical Hypothesis Testing. STAT 536: Genetic Statistics. Statistical Hypothesis Testing - Terminology. Hardy-Weinberg Disequilibrium

Statistical Hypothesis Testig STAT 536: Geetic Statistics Kari S. Dorma Departmet of Statistics Iowa State Uiversity September 7, 006 Idetify a hypothesis, a idea you wat to test for its applicability to your data set. Hardy-Weiberg equilibrium applies to this data set. The two loci I am studyig are idepedet of each other. Idetify ad calculate a test statistic. Ideally, the statistic should: summarize ad accetuate ay deviatios of the data from what is expected uder the hypothesis, have a kow samplig distributio uder the ull hypothesis. Compute or estimate the probability of the observed test statistic uder the assumptio of the hypothesis. Reject the hypothesis if this probability is small. Statistical Hypothesis Testig - Termiology Null hypothesis (H 0 ): the hypothesis you wish to test. Alterative hypothesis (H a ): whe you reject the ull hypothesis, you coclude the alterative hypothesis or hypotheses. Type I error: the test statistic causes you to (erroeously) reject the hypothesis whe it is true. size, or sigificace level (α): the probability of a type I error Type II error: you accept the hypothesis whe it is false β: the probability of a type II error power (1 β): probability that you reject the hypothesis whe it is false. Procedure: classically, oe decides o the size of the test before collectig the data, the selects the most powerful test for the desired size. Hypothesis Accepted Hypothesis Rejected Hypothesis True 1 α Type I (size, α) Hypothesis False Type II (β) power (1 β) Hardy-Weiberg Disequilibrium Let u ad v be alleles at a sigle locus. The, HWE implies P uu = pu P uv = p u p v wheever u v where P uv is the populatio geotype frequecy ad p u are the populatio allele frequecies. Recall that if two radom variables X ad Y are idepedet, the P(X ad Y ) = P(X)P(Y ). I Eglish, kowig X tells you othig about Y ad vice versa. HWE equatios imply idepedece (or o associatio) amog the alleles at a locus.

Hardy-Weiberg Disequilibrium (cot) These equatios may ot be satisfied i a populatio where ay oe (or more) of the Hardy-Weiberg assumptios is (are) violated. Whe the equatios are ot satisfied, a Hardy-Weiberg disequilibrium applies. P uu p u P uv p u p v wheever u v Oe could mathematically quatitate this disequilibrium i multiple ways. We will cosider two for ow: Suppose there is covariatio amog alleles. Write the disequilibrium i terms of this covariatio. Cosider subtractive disequilibrium: P uu p u ad P uv p u p v. Cosider multiplicative disequilibrium oce agai ad covert to log-liear model. Covariatio Betwee > Alleles What do you do with f whe there are more tha alleles? Say, for example, there are 3 alleles u, v, ad w. There ca be correlatio betwee u ad v or u ad w, etc. We ca subscript f. Let f uv be the correlatio betwee alleles u ad v, where u ca equal v. The, P uu = pu + p u (1 p u ) f uu P uv = p u p v (1 f uv ). But, we are ot doe. There are relatioships amog these parameters. If there are differet alleles, there are 1 free allele frequecies p u ad (+1) correlatio coefficiets f uv, for a total of +3 parameters. However, there are oly (+1) 1 free geotype couts i the data. How may relatioships are there amog the parameters? Covariatio Betwee Alleles Let x j be a idicator variable idicatig whether the jth allele of a radom idividual is allele 1. Recall the model What is the meaig of f? Var `x j Cov `x i, x j Corr `x i, x j P 11 = p 1 + p 1(1 p 1 )f P 1 = p 1 (1 p 1 )(1 f ) = p 1 (1 p 1 ) = E `x i x j E (xi ) E `x j = P11 p1 Cov `x i, x j = qvar (x i ) Var `x j = P 11 p 1 p 1 (1 p 1 ) = p 1 + p 1 (1 p 1 ) f p 1 p 1 (1 p 1 ) = f. Covariatio Betwee > Alleles (cot) There are d.f. i parameters d.f. i data = + 3 ( + 1) + 1 = relatioships amog the model parameters. To fid the relatioships amog the parameters, recall that p u = P uu + 1 P uv. There are such relatioships, ad they will cosume all the extra degrees of freedom. Substitute i the expressios for the geotype frequecies to observe f uu = p v f uv. 1 p u v u v u

Covariatio Betwee > Alleles (d approach) Suppose that associatios betwee alleles are ot a cosequece of specific iteractios amog alleles. Suppose istead that there is a geeral associatio betwee alleles regardless of allele idetity. (We will discuss how this ca arise later.) The, there is oe correlatio that applies to all pairs of alleles u ad v ad the applicable equatios are agai P uu = p u + p u (1 p u ) f P uv = p u p v (1 f ) where u v. There are 1 free allele frequecies p u ad 1 free correlatio f, leadig to free parameters. There are plety of degrees of freedom (+1) 1 i the data to cover these parameters. Problems with Correlatios f uv But all is ot perfect. Notice, f appears as a multiplier of allele frequecies. We model disequilibrium as a multiplicative factor that expads/shriks geotype couts away from HWE expectatio. We derived some estimates for correlatio f i previous lecture(s). I geeral, estimatio of multiplicative factors ivolves ratios of statistics. Ratios are otoriously difficult statistically. A easier approach, statistically, is to look for additive disequilibrium. Defie D uu = P uu pu D uv = P uv p u p v. Let D uv = D uv, to write more coveietly P uu = pu + D uu P uv = p u p v D uv wheever u v. d.f. for Additive Disequilibrium Agai, there must be relatioships amog the parameters to accout for the differece i degrees of freedom. p u = P uu + 1 P uv v u = pu + D uu + v<u = p u v u p v + D uu v<u = p u + D uu v<u D uv (p u p v D uv ) D uv Rage of Additive Disequilibrium Because 0 P uu, P uv 1, we also recogize that the additive disequilibrium ca t be just aythig. I fact, the total list of costraits is p u D uu 1 p u p u p v 1 D uv p u p v D uu = X v<u D uv I the case of just two alleles at a locus 1 ad, the D 11 = D 1 = D, so P 11 = p 1 + D 1 P 1 = p 1 p D 1 leadig to D uu = v<u D uv. ad P = p + D 1 max pu D 1 p 1 p. u {1,}

Testig D 1 = 0 (HWE) Estimatig D 1 Testig for HWE is equivalet to testig the ull hypothesis H 0 : D 1 = 0. Here, we have restricted to the two allele case. We eed two thigs: A estimate ˆD A. Is it close to 0? A samplig distributio for the estimate ˆD 1 to determie whether it is farther from 0 tha we would expect by chace. Two free parameters p 1, D 1 ad two free pieces of data 11, 1 suggests Bailey s method: 11 = ( p 1 + D 1 ) ca be solved to produce 1 = (p 1 p D 1 ) ˆp 1 = 11 + 1 = p 1 ˆD 1 = 11 p 1 = P 11 p 1. ˆD 1 Bias ˆD 1 Samplig Variace Is our estimate ˆD 1 ubiased? E ˆD1 ) = E ( P11 E ( p 1 ) = P 11 p1 1 ( p1 + P 11 p 1) = D 1 1 [p 1 (1 p 1 ) + D 1 ]. Sice E ˆD1 D 1 we coclude that the estimator is biased. However, we are ecouraged to ote that as, the bias goes to 0. Usig Fisher s approximatio (it applies because ˆD 1 is a fuctio of proportios), we obtai a approximate samplig variace for ˆD 1 Var ˆD1 1 [ ] p1 (1 p 1 ) + (1 p 1 ) D 1 D1. To estimate the samplig variace, we substitute i our estimates for p 1 ad D 1 Var ˆD1 ˆ= 1 [ ˆp 1 (1 ˆp 1 ) + (1 ˆp 1 ) ˆD1 ˆD ] 1. Sice ˆD 1 is the MLE, we have for large samples that [ ] ˆD 1 N E ˆD1, Var ˆD1.

Testig H 0 : D 1 = 0 Usig z-values Computig z Compute the stadard ormal variate z ˆD 1 E ˆD1 z = Var ˆD1 Uder the ull hypothesis H 0, z approximately follows the stadard ormal distributio. Compare z agaist stadard ormal distributio. The key is that if ˆD 1 is very positive or very egative, the z will ted to be far from 0 ad your statistic will fall i the tails of the samplig distributio, where it is ot expected to fall if the ull hypothesis of HWE is true. Uder the ull hypothesis, we kow E ˆD1 0 Var ˆD1 1 so z = [ ˆp 1 (1 ˆp 1 ) ]. ˆD1 ˆp 1 (1 ˆp 1 ) Note, we have assumed is sufficietly large that the bias term is egligible. Relevat Alterative Hypotheses Relevat Alterative Hypotheses (cot) P 11 = p 1 + D 1 P 1 = p 1 p D 1 P = p + D 1 Depedig o your purpose, there may be differet alterative hypotheses you cosider. Suppose z = 1.5, the H A : D 1 > 0 or D 1 < 0. You have o a priori feelig for whether heterozygotes will be over- or uder-represeted. Use a two-tailed test. > *porm(q=-1.5) [1] 0.11995 > *(1-porm(q=1.5)) [1] 0.11995 Oe-side hypotheses are appropriate whe you suspected heterozygotes would either be uder- or over-represeted before you collected the data. H A : D 1 > 0. You suspect that heterozygotes will be uder-represeted. Use a oe-tailed (right tail) test. > (1-porm(q=1.5)) [1] 0.1056498 H A : D 1 < 0. You suspect that heterozygotes will be over-represeted. Use a oe-tailed (left tail) test. > porm(q=1.5) [1] 0.894350

Chi-Square Chi-Square Goodess-of-Fit A equivalet test is the Chi-Square test for HWE. It depeds o comparig z agaist its samplig distributio, which uder the ull, is a chi-square distributio with 1 degree of freedom. z = X 1 = ˆD 1 ˆp 1 (1 ˆp 1). However, ote that both positive ad egative values of z give the same z statistic. It is ot so easy to cosider oe-sided alterative hypotheses. Sice the tests are equivalet, use the z statistic for oe-sided tests. Test Assumptio: The sample size is large so both ormality (or chi-square) applies ad bias ca be igored. Geotype 11 1 Observed (O) 11 1 Expected (E) Observed - Expected ˆp 1 ˆD 1 ˆp 1 (1 ˆp 1 ) ˆD 1 (1 ˆp 1 ) ˆD 1 Here, we have made the assumptio that is sufficietly large that the bias terms are 0. The goodess-of-fit chi-square statistic is defied as X 1 = = geotypes ( ˆD ) 1 ˆp 1 (O E) E ( ˆD 1 ) ( ˆD 1 ) + ˆp 1 (1 ˆp 1 ) + (1 ˆp 1 ). Stadard Cautios About Chi-Square Tests Likelihood Ratio Suppose your ull hypothesis is Apply oly whe expected couts E 5. Because the expected couts E appear i the deomiator, small variatio whe they are small results i huge chages i X 1. Apply Yates correctio to accout for discrete ature of data. Because the observed data are discrete, but the samplig distributio (ormal or chi-square) is cotiuous, the Yates correctio is recommeded: X1 ( O E 0.5) = E geotypes H 0 : φ = φ 0 for some parameter φ. Let the maximum likelihood value uder H 0 be L 0 ad the maximum likelihood value without the restrictio o φ be L 1. The, L 0 will always be smaller tha L 1 sice φ 0 may ot be the maximum likelihood value of φ. However, if the ull is true, ˆφ should be very close to φ 0 ad L 0 will be very close to L 1. Defie the likelihood ratio as λ = L 0 L 1. Whe the ull hypothesis is true ad the size of φ is s, the l λ χ (s).

Likelihood Ratio Test for HWE Uder the ucostraied model (the alterative hypothesis), the parameters are p 11, p 1, p ad the data are 11, 1,. There are two degrees of freedom i the model ad the data, so Bailey s method applies to yield ˆp 11 = 11 ˆp 1 = 1. The maximum likelihood uder the ucostraied model is L 1 =! 11! 1!! ( 11 ) 11 ( 1 Uder the costraied model, ˆP 11 = ˆp 1 ˆP 1 = ˆp 1 (1 ˆp 1 ) with ˆp 1 = 1. ) 1 ( A Multiplicative Model That Works Let us cosider aother multiplicative model P 11 = MM 1 M 11 P 1 = MM 1 M M 1 P = MM M Here M is the mea effect, M 11, M 1, M represet associatios betwee allele frequecies, ad M 1, M represet the allele frequecy cotributios. Takig logarithms puts this model back i the additive space l P 11 = l M + l M 1 + l M 11 ) l P 1 = l M + l + l M 1 + l M + l M 1 l P = l M + l M + l M Likelihood Ratio Test for HWE (cot) The maximum likelihood uder the costraied model is L 0 =! ( 1 11! 1!! ) 11 ( 1 () ) 1 ( ). The test statistic is therefore [ ] 11 1 ( 1 ) 1 l λ = l () 11 11 1 1 Hadlig Overparameterizatio There are, as usual, more parameters tha observatios. Ad this time there are multiple ways to deal with the overparameterizatio. Oe way is to set the M M 1 = 1 M M = 1, P 11 = MM 1 M 11 P 1 = MM 1 P = M. There is still a extra degree of freedom, but summig all three equatios yields 1 = M M1 M 11 + M 1 + 1 or 1 M = 1 + M 1 + M1 M. 11

Estimatig Parameters M 1 ad M 11 Agai, Bailey s method applies ad the maximum likelihood estimates are ˆM 1 = 1 Mˆ 11 = 4 11 1 with the tag-alog ˆM =. Substitutig these MLEs back ito the origial multiplicative equatios produces the same likelihood uder H A. ˆP 11 = 11 ˆP 1 = 1 ˆP = L 1 =! 11! 1!! 11 11 1 1 Log Likelihood Test for Multiplicate Model HWE implies o iteractio term, i.e. M 11 = 1. Uder this costrait, we agai apply Bailey s method to fid ) ( 1 ˆP 11 = ˆP 1 = 1 ( ) ˆP =. It turs out that λ has the same form uder this log-liear model as the additive disequilibrium model. So the log-liear model for testig HWE is equivalet to the additive model. Exact Tests Exact Tests for HWE If the probability of the observed sample uder the ull hypothesis ad all less likely samples is small, the the evidece suggests the data is ulikely to have arise uder the ull hypothesis. If oe ca compute the probability of all possible samples, the obtaiig a exact probability (o approximatio) is possible. are useful whe all the possible observed data ca be eumerated practically. This occurs geerally whe the expected couts are small i some categories (i.e. whe the previous tests fail). The probability of the observed data 11, 1, is give by the multiomial distributio P ( 11, 1, ) =! 11! 1!! p 11 11 p 1 1 p. Whe HWE applies, the P ( 11, 1, ) =! 11! 1!! p 11 1 (p 1 p ) 1 p. I additio, the allele couts 1 ad are biomially distributed P ( 1, ) = ()! 1!! p 1 1 p. The coditioal probability, where we coditio o the observed allele couts is P ( 11, 1, 1, ) = P ( 11, 1,, 1, ) P ( 1, )! 1!! 1 = 11! 1!!()!. = P ( 11, 1, ) P ( 1, )

Exact Test for HWE (Example) Summary of Tests for HWE Suppose we observe 11 = 10, 1 = 1, =. Use a exact test to calculate a p-value for rejectig the ull hypothesis of HWE. 11 1 Probability Cumul. Prob. 10 1 9.1 10 5 9.1 10 5 9 3 1 0.35 0.35 8 5 0 0.63 0.97 The p-value is p = 9.1 10 5 sice there is o dataset more extreme tha the observed. Normal approximatio for MLEs uses the z statistic. Chi-square test uses the X 1 = z statistic ad is equivalet to the above test. The chi-square goodess-of-fit test is idetical to the above chi-square test, but highlights the eed for substatial data i each category. The likelihood ratio test is widely applicable ad flexible whe a likelihood fuctio is available. The log-liear model uses a multiplicative model ad leads to a test equivalet to the likelihood ratio test. The exact test is useful whe the data set is small ad particularly whe couts i some categories are small. Tests for Multiple Alleles Testig Complete HWE Whe there are more tha two alleles at a locus, the geeral equatios P uu = p u + D uu P uv = p u p v D uv wheever u v Therefore l λ for H 0 : D uv = 0 for all u v approximately follows a chi-square distributio with k(k 1) degrees of freedom. If you are more comfortable with a goodess-of-fit test, that statistic apply, with relatioship D uu = X v<u D uv X T = u v ( uv E ( uv ) 0.5) E ( uv ) ad MLEs obtaied by Bailey s method (verify this; it is ot hard) with ˆp u = p u ˆD uv = p u p v 1 P uv. Uder complete HWE, D uv = 0 for all u v, that is there are k(k 1) costraits applied (oe for each heterozygote) whe there are k alleles. E ( uu ) = p u E ( uv ) = p u p v. follows the same samplig distributio.

Testig Partial HWE z-test for Partial HWE If you wat to test oly certai combiatios of alleles, the tests are more complicated. Test a sigle D uv. Example: If H 0 : D 1 = 0, the likelihood ratio statistic l λ follows a chi-square with 1 degree of freedom. But Bailey s method does ot apply uless k = so L 0 is difficult to compute. Iterative methods are required. Or, oe could apply the z-test or the chi-square test. Details for these tests follow o the ext slide. Test multiple, but ot all D uv. Example: If k = 4 ad H 0 : D 1, D 34 = 0, l λ follows a chi-square with degrees of freedom. Iterative methods are still required. This time, there are o easy alteratives. Complex hypotheses cause difficulties for the z-test ad related chi-square. The likelihood ratio test hadles complex hypotheses quite aturally. Here, you oly eed the MLEs uder the full alterative model (where Bailey s method applies). For large samples, the MLE ˆD uv is approximately ormally distributed uder H 0 with mea 0 ad variace Var ˆDuv. z uv = ˆD uv r Var ˆDuv We ca agai use Fisher s approximatio to compute the variace. 8 1 < Var ˆDuv = : p up v [(1 p u )(1 p v ) + p u p v ] h (1 p u p v ) (p u p v ) i D uv 9 + X = (pud vw + pv D uw ) Duv ; w u,v Uder the H 0, the variace is obtaied by assumig D uv = 0. Exact Tests for Multiple Alleles Approximate Exact Tests The exact test geeralizes to multiple alleles. The formula is GeePOP iput: Test loc1 Pop a1, 0101 a, 0101 a3, 0101 a4, 00 a5, 003 P ({ uv } { u }) =!H Q u u! ()! Q u,v uv!. GUO, S. ad THOMPSON, E. 199. Performig the exact test of Hardy Weiberg proportio for multiple alleles. Biometrics 48 pp. 361-37. LAZZERONI, L. C. ad LANGE, K. 1997. Markov chais for Mote Carlo tests of geetic equilibrium i multidimesioal cotigecy tables. A. Statist. 5 pp. 138-168. Whe it is impossible to eumerate all the possible datasets with the same allele frequecies, approximate methods are eeded. Oe of the simplest uses Mote Carlo. Calculate F = P ({ uv } { u }) for the observed data. Set S = 0 ad put all your geotypes (13, 66, 31, 16,...) i a big vector of legth. 13663116 Permute all alleles ad clump successive alleles ito geotypes. (36)(16)(36)(13) Compute F = P ({ uv } { u }) for the permuted dataset. If F F, icremet S by 1. Repeat M times. Estimate the p-value as S M.

Example Power Calculatios Cosider the followig table of geotype couts for a locus with four alleles. A 1 0 A 3 1 A 3 5 18 1 A 4 3 7 5 A 1 A A 3 A 4 Method Estimate Exact 0.01744 χ 0.0337 MC itegratio (M = 1700) 0.01706 Recall that the power of a statistical test is the probability that you reject the ull hypothesis whe it is false. Before you begi collectig data, you may wish to estimate whether you will be able to detect a disequilibrium of a give size, say D 1 = 0.10. To detect it meas you reject the ull that D 1 = 0. The test statistic X1 = ˆD 1 ˆp 1 (1 ˆp 1) follows a differet distributio depedig o whether H 0 is true or ot. H 0 true X1 χ (1) H 0 false X1 χ (1,ν) where χ (1,ν) is the ocetral chi-square distributio with ocetrality parameter ν. The Nocetrality Parameter Size of D 1 or The ocetral chi-square distributio is a approximate distributio uder H A. The ocetrality parameter is give by ν = D 1 p 1 (1 p 1) ν is bigger whe D 1 is farther from 0. But ote, the approximatio is oly valid whe ν is small, say of order 1. If we take the stadard sigificace level α = 0.05, the we will reject H 0 if X 1 > 3.84. With this kowledge oe ca ask a couple of questios How big does D 1 eed to be i order to have 90% chace of gettig X 1 > 3.84 ad therefore rejectig H 0 ad detectig the disequilibrium? How big does my sample eed to be i order to detect a disequilibrium D 1 = 0.1 with 90% probability? r ν D 1 = p 1 (1 p 1 ) Whe ν = 10.5, the X 1 will exceed 3.84 with 90% probability. (These kids of results are tabulated or available i statistics packages.) So, for alterative sample sizes you ca see how large a disequilibrium D 1 you will be likely (90% probabilitity) to detect. Also, further rearragemet yields = νp A(1 p A ) D A which ca be used to compute the size of the sample we ll eed to detect a specified disequilibrium D A. Note, for both of these applicatios, you eed to kow p A to use the formulas.

Power Calculatios for Exact Tests Power Calculatios for Exact Tests (cot) Recall, you eed to calculate P ({ uv } { u }) = P ({ uv}) P ({ u }) Uder the alterative hypothesis, the geotype frequecies are give by the formulas P uu = p u + D uu so you ca compute the umerator P uv = p u p v D uv wheever u v,! Y P ({ uv }) = Q u v uv! u v P uv uv But, of course, u = uu + 1 uv. so if you kow P ({ uv }), you ca compute P ({ u }) by summig over the former. u<v but you ca o loger assume the biomial distributio for allele frequecies.