A string of not-so-obvious statements about correlation in the data. (This refers to the mechanical calculation of correlation in the data.

STAT-UB.003 NOTES for Wedesday 0.MAY.0 We will use the file JulieApartet.tw. We ll give the regressio of Price o SqFt, show residual versus fitted plot, save residuals ad fitted. Give plot of (Resid, Price, that s residuals versus observed. Strog patter! What does this ea? The proble is that this always happes. We just do t thik about it, ad whe it pops up, we see puzzled. Here s a quick ru through the forulas. Y i β 0 + β x i + ε i Y ˆi b 0 + b x i e i Y i Y ˆi A strig of ot-so-obvious stateets about correlatio i the data. (This refers to the echaical calculatio of correlatio i the data.. Corr( e i, Y ˆi 0 Of course. There is o et slope o the residual versus fitted plot.. Corr( Y i, x i Corr( Y ˆi, x i Of course. Oce you ve coputed b 0 ad b as ubers, the Ŷ is a siple liear fuctio of x. This equal correlatio will ot happe for ultiple regressio. 3. Corr( e i, x i 0 This is a cosequece of the coputatio. I a ultiple regressio, this will happe for each idepedet variable. 4. Corr( e i, Y i > 0 Yes, the residuals are positively correlated with the depedet variable. Here s soe ituitio. With ot too sall, it will happe that Y ˆi β 0 + β x i. The e i Y i ˆi Y ε i. Thus Y i β 0 + β x i + ε i β 0 + β x i + e i (o-rado + e i. This will happe eve with sall, ad this ca be foralized with a real proof.

As a prelude to what s goig to happe ext, we have a siple distributioal result. Suppose that Z is a stadard oral rado variable. Defie a ew rado variable as H Z. This H is always positive. Its distributio is called the chi-squared with oe degree of freedo. I fact, you will usually fid that it s writte χ. The superscript ( is the square part, ad the subscript is the degrees of freedo idetified. I geeral, if Z, Z,, Z k are idepedet stadard oral rado variables, the the distributio of Z + Z +... + Zk is the chi-squared distributio with k degrees of freedo. There are tables of the distributio, but we will be usig oly the oe degree of freedo versio. We let ( α χ upper α poit for the chi-squared distributio with oe degree of freedo. The ost cooly used value is ( 0.05 3.84. χ 3.84. Note that.96 3.846 Here s aother coo situatio. Suppose that you have two idepedet bioial rado variables, X ad Y, correspodig to groups idetified as ad. Cotexts will of course chage. Soeties the groups with be idetified as A ad B, soeties as X ad Y, soeties as Coke ad Pepsi. Clerical vigilace is critical! (Clerical vigilace is always critical. Let s say that X ~ Bi(, p ad Y ~ Bi(, p. The ~ here is read is distributed as. We wat to test H 0 : p p versus alterative H : p p. This siple little proble ca be put ito this two-by-two display: Group Successes Failures Total X X Y Y There are ay, ay statistical tests that have bee proposed for this. The ost coo oe operates with the assuptio that the saple sizes ad are large eough to allow oral approxiatios for each of the two bioials. I what follows, there will be a lot of algebra. It s ot very difficult, but it is quite aoyig. You ca certaily read through quickly to see what s goig o. Sice we eed to set α P[Type I error H 0 ] uder the ull hypothesis H 0, let s assue for ow that p p ad (for ow let s just use the coo sybol p.

Just to be ephatic: the assuptio that p p is doe oly to derive the test ad to cotrol the type I error probability at α. NOTE: If there is a coo value of p, the X + Y would be bioial with + trials ad success probability p. However, if we just copute X + Y, the we would have o iforatio about the possibility that H 0 would be wrog. X Ituitively, we ll have to base a statistic o either X Y or ad the Y derive its distributio uder H 0. I settig up the test statistics, assue hypothetically that H 0 is true. It would see that X is approxiately oral with ea p ad with stadard deviatio p ( Idepedetly, the distributio of Y is approxiately oral with ea p ad with stadard deviatio p (. It follows the that with ea 0 ad with stadard deviatio p ( is approxiately oral. Where did that coe fro? The thig that cobies clealy is the variace, the square of the stadard deviatio. We use these steps:. X Var Y Var p p ( ( X Y Var Var + Var p ( p ( + X Y Var + Var p ( { + } 3

SD ( p p + We have a plausible estiate of p whe H 0 is true. This is p X + Y. It follows the + that we have a estiate of SD. This estiate will be called the stadard error (SE ad it is SE ( +. We ca ake a approxiately oral statistic through. If the SE ( + ull hypothesis is true, this follows closely the oral distributio with ea 0 ad with stadard deviatio. The covetioal rule is that we ll reject H 0 at the 5% level if this is either -.96 or.96. More siply, we reject H 0 if the square of this exceeds 3.84. (Note that 3.84.96. At this poit, it helps to overlay additioal otatio: Group Successes Failures Total X A - X B A+B Y C - Y D C+D Total A+C B+D N A+B+C+D It ca ow be show that the square of this, aely N ( AD BC ( A + B( C + D( A + C( B + D. (, is equal to NOTE: A slightly differet derivatio will produce this test statistic as. The two fors are uerically very close ( ( + 4

whe H 0 is true (ad are usually very close eve whe H 0 is ot true. I ay case, the operatios of the hypothesis test deped priarily o how the test statistic behaves whe H 0 is true. Thus, either for will work. Betwee the *** lies is this derivatio. It s ot critical to read this, but it s good to have it here. ***************************************************************** The ite uder the square root sig, eaig ( +, ca be rearraged: X + Y X + Y + + + ( ( + X Y + + [] Ite [] is ow B + D + N A B C D ( A C ( + ( + The uerator ca be revised as well: ( + ( + ( + ( + A C B D N A B C D X Y A C A + B C + D ( + ( + ( A + B( C + D A C D C A B AD BC ( A + B ( C + D Recall that our test statistic was ( + iterested i its square. That squared statistic is, ad that we were ( ( 5

AD BC N A B C D ( A + B( C + D ( A + C( B + D ( + ( + N ( AD BC ( A + B( C + D( A + C( B + D ***************************************************************** I this for N ( AD BC ( A + B( C + D( A + C( B + D, it s ot eve too hard to reeber. The uerator has the deteriat of the two-by-two table (thought of as a atrix, ad the deoiator is the product of the four totals aroud the argi of the table. This is well kow; it s the chi-squared test for the two-by-two table of couts, ad it s usually idetified as χ. (Soeties this sybol is writte χ, to ephasize that it has oe degree of freedo. The procedure, at level α, is to reject H 0 if χ ( α χ upper α poit for the chi-squared distributio with oe degrees of freedo. The ost cooly used value is ( 0.05 χ 3.84. Note that.96 3.846 3.84. There are tables of the chi-squared distributio, but for this two-by-two table situatio, we are usig the 3.84 value early all the tie. More coplicated uses of the chi-squared distributio are doe with software, ad the software geerally provides the p-value. Here s a quick exaple. Suppose that you get these values: Cure No cure Total Product A 8 6 44 Product B 40 8 58 Total 68 34 0 The hypothesis uder test is H 0 : p A p B, that the two products have equal cure rates. The test statistic is 6

χ 0 ( 8 8 6 40 44 58 68 34 0.397 The cutoff poit is 3.84. We are owhere close to statistical sigificace. We d have to coclude that the two products are ot sigificatly differet. This is t surprisig, sice the product success rates were p ˆ A 8 44 64% ad p ˆ B 40 58 69%. Now let s suppose that we have two possible edicatios for stoach ache. We have two groups of subjects; 45 of the get edicatio H ad 4 of the get edicatio M. The data o relief are these: Success Fail Total Medicatio H 36 9 45 Medicatio M 0 4 Total 58 9 87 Is there a sigificat differece betwee the products? Well. we re testig H 0 : p H p M agaist H : p H p M. Let s agree to use the 0.05 level of sigificace. The test statistic is χ, ad we ll reject H 0 if χ 3.84. Fid χ 87 ( 36 0 9 45 4 58 9 5 87 7.457 45 4 58 9 This differece would be judged statistically sigificat. You ight observe that p ˆ H 36 45 0.80 ad p ˆ M 0.538. These are separated clealy. 4 7