Data Aalysis ad Statistical Methods Statistics 651 http://www.stat.tau.edu/~suhasii/teachig.htl Suhasii Subba Rao Exaple The itroge cotet of three differet clover plats is give below. 3DOK1 3DOK5 3DOK7 19.4 18.2 20.7 32.6 24.6 21.0 27.0 25.5 20.5 32.1 19.4 18.8 33.0 21.7 18.6 20.8 20.1 21.3 Use ANOVA ad the F distributio to test whether all three clovers have the sae populatio ea. H 0 : µ 1 = µ 2 = µ 3 agaist the alterative that oe of the eas if differet. 1 Solutio 3DOK1 3DOK1 - X 1 3DOK5 3DOK5- X 2 3DOK7 3DOK7- X 3 19.4 19.4-28.2= -9.42 18.2 18.2-21.7=-3.5 20.7 20.7-20.14=0.557 32.6 32.6-28.2 = 3.78 24.6 24.6-21.7=2.9 21.0 21.0-20.14=0.857 27.0 27-28.2=-1.82 25.5 25.5-21.7=3.8 20.5 20.5-20.24=0.357 32.1 32.1-28.2=3.28 19.4 19.4-21.7=-2.3 18.8 18.8-20.14=-1.34 33.0 33-28.2=4.18 21.7 21.7-21.7 =0.0 18.6 18.6-20.14-1.54 20.8 20.8-21.7 =-0.9 20.1 20.1-20.14=-0.04 21.3 21.3-20.14 = 1.16 X i 28.2 21.7 20.14 s 2 i 33.6 8.24 1.12 The total saple ea is X = 1 18 (5 28.2+6 21.7+7 20.14) = 22.9. The pooled variace is calculated usig either the residuals or saple variace s 2 i. The SSW is SSW = 3 i (X ij X i ) 2 i=1 j=1 = ( 4s 2 1 + 5s 2 2 + 6s 2 3) = (4 33.6 + 5 8.24 + 6 1.12) = 183. Hece the SSW/(N k) = 183/(18 3) = 12.2. Notice the the variatio of saple 3DOK7 is sall copared to the other (the residuals are uch saller!), this is what pulls the value of SSW dow. The SSB is SSB = ( X 1 X) 2 + ( X 2 X) 2 + ( X 3 X) 2 = 5 (28.2 22.9) 2 + 6 (21.7 22.9) 2 + 7 (20.14 22.9) 2 = 20 2 3
2 1 0 1 2 Theoretical Quatiles Su of df Mea square F Si Squares F k 1,N k Betwee Groups 202 k 1 = 2 202/2 = 101 101/12.2 = 8.2 0.0 Withi Groups 183 N k = 15 183/15 = 12.2 Total N 1 = 17 QQplot of residuals fro the plat data The residuals are calculated (look at the ubers of i the secod table) ad a QQplot is ade usig these ubers. Noral Q Q Plot The F-statistics uder the ull H 0 : µ 1 = µ 2 = µ 3 has a F-distributio with 2,15 degrees of freedo. The p-value for P(F 2,15 > 8.2) = 0.004. 0.004 is quite sall (saller tha α = 0.05), so there is eough evidece to reject the ull. Saple Quatiles 8 6 4 2 0 2 4 Equivaletly, look up the tables with α = 0.05 we have F 2,15 (0.05) = 3.68. The rejectio regio is ay uber bigger tha 3.68. Sice 3.68 < 8.2, there is eough evidece to reject the ull. QQplot of residuals. There is ot uch deviatio fro orality, hece assuptios of ANOVA (orality) are satisfied ad it is okay to use this test. 4 5 Categorical data: The Bioial revisited I several situatios we have to deal with categorical data. For exaple a copay way wat to kow whether people are aware of a ew product. They would do a survey, by askig people o the street. For each perso surveyed the aswer could be yes or o. We ca forally write this as, let X i be the aswer of the ith perso iterviewed where Do you reeber how to calculate the probability that out of 5 people selected o the street 3 have heard of the product? We recall that the uber people out of 5 who said yes ca be writte as Y 5 = X 1 + X 2 + X 3 + X 4 + X 5. Y 5 = uber of people out of 5 who say yes. Y 5 is a rado variable that ca take the values 0,1,2, 3,4,5. So what we wat to do is calculate P(Y 5 = 3), do you reeber this distributio? X i = { 1 if yes 0 if o Suppose the probability P(X i = 1) = P(ith Perso says yes) = π. This eas the probability that a radoly selected perso has heard of the product is π. 6 7
It s the bioial distributio ad P(Y 5 = 3) = 5! 3!2! π3 (1 π) 2, (you do t have to reeber this forula!). Now suppose the copay does ot kow what π is, but they iterview 100 people ad 30 say they have heard of the product. How would you estiate π? The ost logical estiator of π would be ˆπ = 30 100. I geeral if Y people out of people said yes the the best estiator of π is ˆπ = Y. What ca we say about the estiator ˆπ? Do we kow the distributio of ˆπ? Ca we costruct cofidece itervals for π? Ca we ake a hypothesis test? 8 9 The oral approxiatio to the bioial distributio Coditio for this to work If Y > 5 ad ( Y ) > 5, hece the uber who said yes (uber of successes) ad the uber who said o (uber of failures) is greater tha 5, the we ca approxiate the bioial distributio i ters of the oral distributio. Reeber the ea of Y is π ad the variace is π(1 π). Sice Y = i=1 X i, the the cetral liit theore says that the average Y / = 1 i=1 X i has approxiately the distributio π(1 π) As always, the all iportat value is the stadard error. It will tell us how close the proportio estiator ˆπ is to the true proportio π. Furtherore, as the saple size grows, the stadard error get saller. The estiator iproves. Sice ˆπ = Y / has the ea π it is a estiator of π. Secodly we ca use the orality result to costruct cofidece itervals ad hypothesis test. ˆπ = Y π(1 π) N(π, ). I other words Y / is approxiately oral with ea π ad variace π(1 π)/ (see Lecture 7). 10 11
Cofidece itervals ad testig for proportios If Y ad ( Y ) are greater tha five, the we ca use the oral approxiatio to costruct cofidece itervals for π. We use the sae ideas as before. Sice Y α)% cofidece iterval is Y π(1 π) z α/2, Y π(1 π) + z α/2, π(1 π) N(π, ) the 100(1 where z α/2 is the oral distributio evaluated at α/2 (recall z 0.025 = 1.96). Of course π is ukow which is why we are costructig a CI for it, so π(1 π) the stadard error is ukow. Istead we replace π with its estiator ˆπ = Y /. We still get approxiately ˆπ = Y iterval as before Y z α/2, Y + z α/2. We do ot use the t distributio. ˆπ(1 ˆπ) N(π, ) ad pretty uch the We ca also do a hypothesis test i the sae way. We test H 0 : π = π 0 agaist the alterative H A : π π 0. 12 13 To do this we costruct the o-rejectio regio which is π 0 z α/2, π 0 + z α/2, if ˆπ lies i this regio there is ot eough evidece to reject the ull. If ˆπ does ot lie i this regio, there is eough evidece to reject the ull ad accept the alterative. Alteratively, we ca do a Z-trasfor ad calculate the p-value. That is do the traforatio: z = ˆπ π Y 0 π = 0 ˆπ(1 ˆπ) ˆπ(1 ˆπ) ad look up the oral tables for P(Z > z ). If this p-value is saller tha α there is eough evidece to reject the ull. It aybe easier for you, to use the p-value approach. Exaple I 4 years ago 40 percet of a populatio were thought to have voted for cadidate X. This year X is up for re-electio. Suppose 150 people were iterviewed i a poll, ad 35% opted for X. (i) test the hypothesis that the proportio of people this tie roud that will vote for X will be less tha 40% (use α = 0.05). Costruct a 95% cofidece iterval for the proportio of people who will vote for X this tie roud. (i) The hypothesis test: We wat to test the hypothesis that H 0 : π 0.4 agaist the alterative H A : π < 0.4. The uber people polled who said they would vote for X is 0.35 150 52 ad the uber people who said they would ot vote for X is about 150 52 = 98. Both these ubers are greater tha 5 so we ca use the oral approxiatio. 14 15
Calculatig the stadard error Sice ˆπ = 0.35, replace i / all the πs with ˆπ = 0.35 ad = 150, to give the stadard error / = 0.0015. There are two equivalet ways to do the test: Testig usig the p-value ethods We ow calculate the p-value, by akig a z-trasfor ˆπ 0.4 0.35 0.4 z = = = 0.05 = 1.29. ˆπ(1 ˆπ) 0.35 (1 0.35)/150 0.0015 Lookig up P(Z 1.29) = 0.098. Sice 0.098 > 0.05 there is ot eough evidece to reject the ull. Testig usig rejectio regios We see fro the alterative that the rejectio regio is o the left. We reject the ull if ˆπ is less tha 0.35 ˆ 0.65 0.4 1.64 = 0.4 1.64 = 0.336. 150 Sice ˆπ = 0.35 > 0.336, there is ot eough evidece to reject the ull. (ii) Costructig the cofidece iterval: Sice α = 0.05 we have z 0.05/2 = 1.96, hece the CI is (ˆπ = Y /): Y z α/2, Y + z α/2 0.35 1.96 0.0015, 0.35 + 1.96 0.0015 = 0.274,0.426. 16 17 Hece with cofidece of 95% the proportio of voters who will vote for X lies i the iterval 0.274,0.426. Reeber CI ad tests are differet. Eve if we do oe-sided tests, whe we costruct CIs it is always two sided. Usually the CI will be syetric about the saple ea (though ot always). Exaple II Electios i the Uited Burrows of Aardvarks (UBA) are due to take place ext week. There are two cadidates, their aes are Erie ad Bert. There has bee soe speculatio that Erie ay wi the electios. To test this hypothesis a poll of 1000 aardvarks was take ad the data is suarised below. Erie Bert uber 600 400 (a) What test would you use to test the hypothesis that Erie will wi? (b) Based o this saple, is there evidece to suggest that Erie will wi? Reeber, to state clearly the ull ad alterative (do the test at the 5% level). 18 19
(c) Costruct a 95% CI for the proportio of the populatio who will vote for Erie. Solutio II (a) We observe that there is oe populatio (all aardvarks), ad we are iterested i the proportio of aardvarks who will vote for Erie. This eas we should a oe-saple test for proportios. (b) Suppose π are the proportio of aardvarks who will vote for Erie. Erie will wi if π > 0.5. Hece we wat to test the hypothesis H 0 : π 0.5 agaist H A : π > 0.5. Our estiate of π is ˆπ = 600/1000 = 0.6. Now we wat to use ˆπ = 0.6 is sigificatly larger tha 0.5 (to see whether we will reject the ull ad accept the alterative). The test is a oe-sided test ad the rejectio regio is o the right. We 20 21 will reject the ull if ˆπ is larger tha 0.6 0.4 0.5 + 1.64 = 0.5 + 1.64 = 0.56. 1000 1000 Sice ˆπ = 0.6 > 0.56, there is evidece to reject the ull. We ow costruct a 95% CI for the true proportio π. It is 0.6 0.4 0.6 0.4 0.6 1.96,0.6 + 1.64 = 0.57,0.63. 1000 1000 Recall that this eas with 95% cofidece the true proportio who will vote for Erie lies i this iterval. Look for electios aroud the world ad look for such CIs. Choosig the saple size As before we ca choose the saple size which gives a cofidece iterval with a required legth. Reeber the legth of a CI is the differece betwee the edpoits of the iterval: ( ) ( ) Y π(1 π) + z Y π(1 π) π(1 π) α/2 z α/2 = 2z α/2. If we wat the iterval to have legth 2E (for exaple the proportio error to be about 5% either side of the estiator), the π(1 π) 2E = 2z α/2. 22 23
Solvig this we have Exaple III: Coparig two populatio proportios = z2 α/2π(1 π) E 2. Suppose we wat to copare the proportio of voters who would vote deocrate or republica i tow A ad tow B. Of course i π is ukow. So we replace π with 0.5 or a pre-estiator of π (usig 0.5 is beig very cautious sice it leads to the largest ). The actual electio will be i a few oths, but we wat to see what the ood is ow. We caot poll all people i all tows but we ca take two saples (oe fro each tow) ad ake a copariso. 500 people i tow A ad 600 people i tow B were iterviewed. 35% of the people iterviewed i tow A said they would vote Republica, whereas 38% of the people i tow B said they would vote Republica. We wat to kow whether these differeces are sigificat eough to suggest that there is a differece betwee votig attitudes i both tows. 24 25 What we see to have are two differet populatios each of the have bee sapled with saple sizes = 500 ad = 600 respectively. The outcoe i tow A is A i which takes value oe if ith perso says the will vote Republica ad ad zero if ot. The uber of sapled people i tow A who said they would vote Republica is Y 500 = i=500 A i. Siilarly the outcoe i tow B is B i which takes value oe if the ith perso says the will vote Republica ad zero if ot. The uber of sapled people i tow B who said they would vote republica is U 600 = 600 i=1 B i. We suppose that P(A i = 1) = π 1 (the probability a perso i tow A will vote republica is π 1 ). We suppose that P(B i = 1) = π 2 (the probability a perso i tow B will vote republica is π 2 ). How ca we foralise the questio that the votig behaviour i both tows is the sae? Aswer We wat to test whether the proportios are equal, H 0 : π 1 π 2 = 0 agaist the alterative H A : π 1 π 2 0. Assuptios Each persos aswer is idepedet of the other ad the two saples are also idepedet. 26 27
The geeral coparig proportios proble I the geeral proble two idepedet saples of size ad are draw fro two populatios. A 1,..., A are observed i saple oe ad B 1,..., B are sapled i saple two. A i ad B i ca take the value oe or zero (eg. questio? have you heard of the product, aswer yes or o). We assue all outcoes A i ad B i are idepedet. P(A i = 1) = π 1 ad P(B i = 1) = π 2. What is of iterest is the uber out of ad who says yes, this is Y = i=1 A i ad U = i=1 B i. The we ca say Y ad U are bioially distributioed. We wat to test H 0 : π 1 π 2 = 0 agaist H A : π 1 π 2 0 with α. We recall that ˆπ 1 = Y / ad ˆπ 2 = U / ca be used as estiators of π 1 ad π 2, Uder the ull that π 1 π 2 = 0 we have ˆπ 1 ˆπ 2 = Y U N(0, π 1(1 π 1 ) + π 2(1 π 2 ) ), π 1 (1 π 1 ) + hece its oral with ea zero ad variace π 2 (1 π 2 ) Y or equivaletly the estiator U has stadard error π 1 (1 π 1 ) + π 2(1 π 2 ). 28 29 Hypothesis testig ad Cofidece itervals Sice π 1 ad π 2 are ukow we replace the variace with its estiator ˆπ 1 (1 ˆπ 1 ) + ˆπ 2(1 ˆπ 2 ) ad ˆπ 1 ˆπ 2 = Y U N(0, ˆπ 1(1 ˆπ 1 ) ). We use this costruct rejectio regios. Just as i the estiatio we have doe before costruct the o-rejectio regio: z α/2 ˆπ1 (1 ˆπ 1 ) ˆπ1 (1 ˆπ 1 ), z α/2. To costruct cofidece itervals for the differece i proportios we use Y U ˆπ1 z (1 ˆπ 1 ) α/2 ˆπ1 (1 ˆπ 1 ) Y U + z α/2,. If Y U lies i this iterval we do ot have eough evidece to reject the ull if it does ot lie i this iterval we have eough evidece to reject the ull. 30 31
Exaple III: The votig proble for two tows Recall 500 people i tow A ad 600 people i tow B were iterviewed. 35% of the saple i tow A seeed would vote republica, whereas 38% of the saple i tow B would vote republica. This eas that ˆπ 1 = Y 500 /500 = 0.35 ad ˆπ 2 = U 600 /600 = 0.38. Test H 0 : π 1 π 2 = 0 agaist the alterative H A : π 1 π 2 0. We set α = 0.1, hece z 0.05 = 1.64. Recipe for doig the hypothesis test: First calculate the stadard error ˆπ1 (1 ˆπ 1 ) z α/2 0.35(1 0.35) 1.64 + 500 ˆπ1 (1 ˆπ 1 ), z α/2 0.38(1 0.38) = 0.0477, 0.0477. 600 We see that the differece ˆπ 1 ˆπ 2 = 0.03. Sice this differece lies i the iterval 0.0477, 0.0477, there is ot eough evidece to reject the ull. Observe o t-distributio whe we test proportios, we stick to the oral distributio! ˆπ1 (1 ˆπ 1 ) 0.35 0.65 = + 500 0.38 0.62 600 = 0.000847 32 33 Exaple IV The FDA approved the drug Miodixil as a reedy for ale patter baldess. They did a study ad this is what they foud: Saple size % with ew hair growth Miodixil 310 32 Placebo 309 20 Let π M be the probability a perso has ew hair growth ad uses Miodixil ad π P be the probability a perso has ew hair growth ad ad does t use Miodixil. Use this data to test H 0 : π M π P 0 agaist the alterative H A : π M π P > 0 with α = 0.05. Exaple V Recall the o the HIV vaccie we cosidered i class: saple size uber of people ifected HIV vaccie 8000 51 Placebo 8000 74 We wat to test the hypothesis that the vaccie gave soe protectio. This would ea that the proportio of people i the etire populatio who take the vaccie ad go o to develop HIV is less tha the proportio of the etire populatio who do ot take the vaccie ad go o to develop HIV. Hece we wat to test H 0 : π V π P 0 agaist H A : π V π P < 0 (ie. the people who took the vaccie are at less risk). 34 35