Chapter 6 Part 5. Confidence Intervals t distribution chi square distribution. October 23, PDF Free Download

Chapter 6 Part 5 Cofidece Itervals t distributio chi square distributio October 23, 2008 The will be o help sessio o Moday, October 27.

Goal: To clearly uderstad the lik betwee probability ad cofidece itervals. Skills: Be able to calculate (1 - α)% cofidece iterval for a sample mea both for the case that the populatio variace is kow ad ot kow. Be able to explai what a cofidece iterval does ad does ot mea. Cotets: Cofidece iterval usig t distributio Page 7 Defiitio of chi square distributio Page 13 Stata commads: ci tde(df, t) ttail(df,t) ivttail(df, p)

Cofidece iterval for the mea where the radom variable X has ukow mea μ ad ukow stadard deviatio If is ot kow, the we will use the sample stadard deviatio s to estimate $ (i.e. s = ). So istead of havig Z = X μ ( / ) where Z - N(0,1), we have (where S is the radom variable represetig the stadard deviatio). t = X μ ( S / ) This is the t distributio (also call Studet s t distributio). Like the ormal distributio, the t distributio is a probability desity fuctio. Probability will be related to area. We will come back to cofidece itervals but first we eed to kow somethig about the t distributio i order to kow how to calculate the cofidece itervals. The t distributio, like the ormal distributio, is actually a family of distributios. However, ulike the ormal, we caot trasform each t distributio ito a sigle t equivalet of the N(0,1) distributio (i.e. stadard ormal distributio). It is, however, the case that each member of the family of t distributios is symmetric about zero. So every t distributio is cetered at zero. Therefore, the oly thig that ca chage from t distributio to t distributio is the shape of the curve. With the ormal distributio we ca chage both the locatio (the mea) ad the shape (the variace or stadard deviatio). The t distributio is a sigle parameter distributio. The parameter is called degrees of freedom ad is usually deoted by df. This is i cotrast to the ormal distributio which is a two parameter distributio with parameters ad. μ Page -1-

If the t distributio is based o a sample of size, the it is said to have degrees of freedom. So sample size is what distiguishes the various t distributios. Sample size determies the shape of the distributio. 1 I order to calculate t we ot oly have to estimate the populatio mea μ by the x 2 s 2 sample mea, but we also have to estimate the populatio variace by (the sample variace) or estimate the populatio stadard deviatio by s (the sample stadard deviatio). This meas that we should expect more variability with the t distributio tha with the ormal distributio (i.e. we should expect that the t distributio will have larger [or fatter] tails tha the ormal distributio). Below we have a t distributio with 3 degrees of freedom. The graph shows that the poit 3.182 cuts off the upper 2.5 percet of the area uder the curve or the lower 97.5 percet of the curve. For the ormal distributio with mea = 0 ad SD = 1, the poit 1.96 would cut off these same percetages (i.e. you have to go further out o the t distributio tha o the ormal distributio to cut off oly 2.5%)..4 The desity fuctio for a t distributio with 3 degrees of freedom f(t).3.2 tde(3,3.182) = 0.019 (3.182, 0.019).1 0-5 -4-3 -2-1 0 1 2 3 4 5 t 3.182 I am goig to itroduce three Stata fuctios that are useful i workig with the t-distributio. The curve above is the desity for the t-distributio with 3 degrees of Page -2-

freedom. Notice i the graph above the elemets o the x-axis are represeted by t ad the elemets o the y-axis are represeted by f(t). So the poits o the curve are (t, f(t)). Stata fuctio tde(df, t): This fuctio allows us to graph the t distributio. We ca use it to fid the poits (t, f(t)) where t plays the role of x ad f(t) plays the role of y. Let t = 3.182. The to fid f(3.182) we will use the Stata fuctio tde(df, t). The degrees of freedom (df) idicate which t distributio we are usig ad t (here 3.182) gives us the poit o the x-axis. So above we have the t distributio associated with df = 3 or sample size = 4 (i.e. df = - 1). Notice that Stata gives the fuctio as tde(, t). Stata s is ot the sample size but the degrees of freedom. I will try to cosistetly use df to avoid cofusio.. di tde(3,3.182).01920239 The poit (3.182, 0.019) is o the curve. How do we get a graph of the t desity fuctio usig tde(df,t). clear *the dofile is tdist3df.do *We kow that the t distributio is cetered at zero *but we will eed some idea of what maximum ad miimum *values we should use for t *I checked Table 5 o page 831 of Roser. The table *idicates that for the t distributio with 3 df *99% of the area uder the curve is to the left of 4.5 *ad 99.5% of the area uder the curve is to the left of 5.8. *So if we graph from -5 to 5 we should be i good shape * *I decidig how to create the values for t that we will eed *to get a graph, we have to use small icremets so that we'll have *a smooth curve. I have decided to use 0.25. To get from -5 to -4 *will take 4 of the 0.25's. So to get from -5 to 5 we will eed 40 *observatios ad we eed to take oe more to get the edpoit * set obs 41 * ge t = sum(0.25) - 5.25 ge foft = tde(3,t) twoway (coected foft t, msymbol(oe)), ytitle(f(t))/* */xtitle(t) xlabel(-5(1)5)/* */ title(the desity fuctio for a t distributio, size(medium))/* */ subtitle(with 3 degrees of freedom, size(medium)) But what is more useful to us is to be able to take a particular value of t ad fid the Page -3-

area that goes with that cutoff. Or give a area, to be able to fid the cutoff. This is because area is syoymous with probability. Stata fuctio ttail(df, t): This fuctio will allow us to fid the area to the right of the cutoff give that we kow the degrees of freedom ad the value of the cutoff. Let df = 3 ad t = 3.182. di ttail(3,3.182).02500857 So 2.5% is the area to the right of 3.182. Notice that this meas the area to the left of 3.182 is 0.975. I have bee workig with oly the upper tail for the t distributio. But sice the t distributio is symmetric, we kow that the area to the left of -3.182 is 0.025. What happes if I put i 3 degrees of freedom ad the poit 1.96?. di ttail(3,1.96).0724261 This idicates we have more area to the right of 1.96 usig the t distributio with 3 degrees of freedom tha we would have with the N(0,1) distributio where the area would be 0.025. The graph below makes this same poit. Graph that shows t 1,1 (a/2) > z 1 (a/2) N(0,1) t distributio with 3 df 0 1.96 Page -4-

Stata fuctio ivttail(df,p): ivttail(df,p) provides the iverse of ttail(). If ttail(df,t) = p, the ivttail(df,p) = t, where t is the cutoff ad p is the area to the right of t. If I use ivttail ad eter the degrees of freedom (3) ad the area i the upper tail (0.025), I get back the value that cuts off that area (3.182).. di ivttail(3,0.025) 3.1824463 Now cosider the graph below: I the graph below, the dotted lie represets a N(0,1) curve. Notice that its tails fit uder the tails of all of the t distributios. This meas for ay poit particular poit the tail of the distributio, the area to the right of the area to the right of t 0 o the t distributios. t 0 t 0 o the ormal curve will be less tha Notice that the smaller the degrees of freedom, the fatter the tails. Also otice that as the umber of degrees of freedom get larger, the t distributio begis to look more like the ormal distributio. i Page -5-

Usig the fuctio ttail, I ca give you the idea that as the sample size gets very large, the t distributio approaches the stadard ormal distributio (i.e. the stadard ormal distributio is the limit for the t distributio as gets large). For the stadard ormal distributio [i.e. N(0,1)], the umber 1.96 cuts off 2.5% of the upper tail of the distributio.. di ttail(10, 1.96) Sample size = 11.03921812. di ttail(20, 1.96) Sample size = 21.03203913. di ttail(100, 1.96) Sample size = 101.02638945. di ttail(200, 1.96) Sample size = 201.02569242. di ttail(1000, 1.96) Sample size = 1001.02513659. di ttail(10000, 1.96) Sample size = 10,001.02501176 So it is clear that the larger the sample size (ad hece the degrees of freedom), the closer the area comes to 0.025. Let me remid you that the whole reaso we care about the t distributio (at least at this poit) is that we eed it to get the cofidece iterval for the mea of the populatio whe either or is kow. I should also poit out that it is much more commo μ for us ot to kow tha it is for us to kow. X ~ N( μ, 2 ) X ~ N( μ, 2 ) Whe is kow ad, the where is the sample size, the 1 - α cofidece iterval about the sample mea is: x z x + z, 1 ( α/ 2) 1 ( α/ 2) Page -6-

Remember that we were able to derive this cofidece iterval because we kew that X ~ N( μ, 2 ) implies that Z X μ = ( / ) is distributed N(0,1). So z < X μ ( / ) < 1 ( α/ 2) 1 ( α/ 2) z X μ t = ( S / ) But whe is ot kow we have which follows the t distributio with 1 degrees of freedom where is the sample size. So we should expect the cofidece iterval to look somethig like x s t x t s, + somethig somethig Remember we ca t covert all of the t distributios to a t equivalet of N(0,1). So we have to idicate ot oly the α-level (or area), but we also have to specify which member of the family of t distributios we are usig. If the sample size is, the the degrees of freedom for the t distributio are - 1, so the t equivalet to is. z 1 ( α 2) t 11, ( α 2) So the geeral form for the cofidece iterval usig the t distributio is s x t x t s, + 11, ( α/ 2) 11, ( α/ 2) Page -7-

So for a 95% cofidece iterval, α = 0.05 ad α/2 = 0.025 which implies 1 - α/2 = 0.975. So for = 16 ( i.e. - 1 = 15) ad α = 0.05, the cutoff for is 2.1314. t 11, ( α 2) t 15, 0. 975 2.1314 was obtaied by usig Stata commad ivttail(15,0.025) So the cofidece iterval for the baselie heart rate i the cardiology example we used earlier (ifed.dta) is 17. 95 17. 95 7681. 21314., 7681. 21314. = (67.25, 86.37) 16 + 16 sice x = 76.81 ad s = 17.95.. sum( heartlv0) if trtgrp == 1 Variable Obs Mea Std. Dev. Mi Max -------------+-------------------------------------------------------- heartlv0 16 76.8125 17.94889 54 105 We used the stadard deviatio 17.95 because the populatio stadard deviatio was ukow. But this is a lot of work. Surely Stata ca do this for us. We ca get the cofidece iterval usig the Stata ci commad.. ci heartlv0 if trtgrp == 1 Variable Obs Mea Std. Err. [95% Cof. Iterval] -------------+------------------------------------------------------------- heartlv0 16 76.8125 4.487221 67.24821 86.37679 This is called the ormal cofidece iterval by Stata. This meas that the radom variable X is approximately ormal ot that it is usig z 1 ( α / 2) i the formula (it is, i fact, usig the t distributio). As I ve said before, i real life you almost ever kow the value of. So Stata does t eve bother to provide you with that cofidece iterval. The cofidece itervals derived usig the t-distributio are loger tha those usig the ormal distributio eve if = s because t > z 11, ( α/ 2) 1 ( α/ 2) Page -8-

All of the 95% cofidece itervals usig the ormal distributio have the same legth (give that the sample size stays fixed) although the locatios vary. The cofidece itervals based o the t distributio will have differet legths as well as differet locatios because ot oly will the sample meas vary from sample to sample, but the sample stadard deviatios will also vary from sample to sample eve whe the sample size remais the same. But it is still the case eve with the t distributio that we have cofidece that 95% of the 95% cofidece itervals will cover the true mea of the populatio. Now we have bee behavig like the oly type of cofidece iterval that exists is the 95% cofidece iterval but that is ot the case. It is the case that the default for the cofidece iterval i Stata is the 95% CI (i.e. if you do ot specify a level, Stata assumes 95%), but you ca request other levels. Below I have first requested the 95% cofidece iterval so that you ca see it is exactly what we got above whe we did t specify a level.. ci heartlv0 if trtgrp == 1,level(95) Variable Obs Mea Std. Err. [95% Cof. Iterval] -------------+--------------------------------------------------------------- heartlv0 16 76.8125 4.487221 67.24821 86.37679 If we request a 90% cofidece iterval, this meas that α = 0.10 so that α/2 = 0.05. Notice that the 90% cofidece iterval below is cetered about the mea 76.81 ad that it i fits iside the 95% CI.. ci heartlv0 if trtgrp == 1,level(90) Variable Obs Mea Std. Err. [90% Cof. Iterval] -------------+--------------------------------------------------------------- heartlv0 16 76.8125 4.487221 68.94617 84.67883 95% 90% 67.25 68.95 76.81 84.68 86.38 Baselie Heart Rate beats/mi Page -9-

The more cofidet you are, the loger the cofidece iterval.. ci heartlv0 if trtgrp == 1,level(99) Variable Obs Mea Std. Err. [99% Cof. Iterval] -------------+--------------------------------------------------------------- heartlv0 16 76.8125 4.487221 63.58995 90.03505 Before we go o to the chi square distributio let me give you a little iformatio about the ma who came up with the t distributio. W. S. Gosset was a famous chemist ad statisticia. He worked for the Guiess Brewery i Dubli durig the early 1900's. He iveted the t-test to hadle small samples for quality cotrol i brewig. Gosset discovered the form of the t distributio by a combiatio of mathematical ad empirical work with radom umbers, a early applicatio of the Mote-Carlo method. It is said that he published uder the ame Studet because he did t wat other breweries to lear of the potetial applicatio of his work for a brewery. This is take from the web ad from the book Biostatistics by Forthofer ad Lee, published i 2007. Brief Overview To date our cofidece itervals have looked like x z x + z, 1 ( α/ 2 ) 1 ( α/ 2 ) Cofidece iterval 1 or x s t x t s, + 11, ( α/ 2) 11, ( α/ 2) Cofidece iterval 2 Both of these cofidece itervals are itervals about the mea of the populatio of iterest. Page -10-

The major differece betwee them is the distributio that we use to obtai the cofidece level. For Cofidece Iterval 1 the distributio used is the ormal distributio. For Cofidece Iterval 2 the distributio is the t distributio. So i order to obtai a cofidece iterval, we eed a kow probability distributio that is associated with the samplig distributio of the parameter (of the origial populatio) of iterest. The parameter for each of the two cofidece itervals above is the mea of the populatio. We use Cofidece Iterval 1 whe we kow the stadard deviatio of the populatio ad Cofidece Iterval 2 whe we kow oly the sample estimate (s) for the populatio stadard deviatio. For quality cotrol purposes the parameter that teds to be of iterest is the variace or stadard deviatio rather tha the mea. It happes that either the ormal or the t distributio will work here. We will have to fid a distributio that is associated with the samplig distributio of the variace (i.e. the distributio that you get whe you take repeated radom samples from the origial populatio ad get the variace, rather tha the mea for each sample). It turs out that the chi square distributio is the oe that works. Remember that the radom variable associated with the samplig distributio for both Cofidece itervals 1 ad 2 above was X. This was because we were lookig for a cofidece iterval about the populatio mea ad cosequetly the samplig distributio was the distributio of the meas of all samples from the populatio of size. We will fid that the patter for the cofidece iterval for the variace is differet from Cofidece Itervals 1 ad 2 above. This is because we will be lookig for a cofidece iterval about the variace (or stadard deviatio) ad the samplig distributio will be the distributio of variaces of all samples of size from the origial populatio. We called the radom variable associated with the samplig distributio of meas. X seemed a reasoable choice sice the otatio X is similar to the otatio of the sample mea x. Note that we used a upper case letter for the radom variable ad a lower case for the sample mea. Followig the same patter we will use a upper case S 2 for the radom variable associated with the distributio of the variaces of the radom samples ad a lower case s 2 for the sample variace. Notice that I have also used differet fots for the radom variable ad the sample estimate so they will be easier to distiguish. X Page -11-

X μ / I additio, we kow that the distributio of, for large eough, is approximately N(0,1), so we deoted the ratio as Z. X μ S / I a similar fashio we kow that follows the t distributio with 1 degrees of freedom, so we deoted the ratio as t. ( 1) S 2 2 We will use the fact that is distributed chi square with - 1 degrees of freedom. Roser s example 6.39 o page 196 is ice example of a problem ivolvig the variace rather tha the mea of the populatio. Cofidece iterval for the variace ad stadard deviatio We have already discovered that the variace is useful i describig data, ad ow we ll cosider the use of the variace as a way of describig how reliable a process is. Example: For may years the Ceters for Disease Cotrol has ru a iterlaboratory program to maitai cotrol over measuremets of lead i blood. Approximately 100 laboratories participate each moth. Each laboratory is to ascertai the blood lead level i a sample that has bee provided by CDC (all of the samples are supposed to have the same blood lead level). The histogram below gives the results for April, 1980. We ca see that there is cosiderable variatio from lab to lab. It is this variatio that is of cocer. Page -12-

Idividual measuremets of blood lead cocetratios by approximately 100 differet laboratories 20 Frequecy 15 10 5 0 30 35 40 45 50 55 60 Blood lead cocetratio i μg/dl The data i the histogram above is my guess at readig the values from a graph preseted i a 1980 issue of Sciece. I order to get the cofidece iterval for or 2, we will eed to itroduce a ew distributio, amely the chi square distributio (for those of you who have studied such thigs, the chi square is a gamma distributio). Page -13-

Properties of the chi square distributio 1) The chi square distributio is a cotiuous distributio like both the ormal ad t distributios. 2) Ulike the ormal ad t distributios, the chi square distributio is ot symmetric about its mea. The chi square distributio is skewed with the tail to the right. 3) Like the t distributio, the chi square distributio is a family of distributios whose shapes deped o degrees of freedom. That is, you ca select a uique member of the family by specifyig just the degrees of freedom. So the chi square distributio is a sigle parameter distributio. 4) Because the chi square distributio is cotiuous, probability will be associated with area as it is with the ormal ad t distributios. Defiitio of chi square distributio: G = If X 2 where for each ad the are idepedet (i.e. the i = 1 i i, X ~ N( 01, ) X' s i X s are idepedet ad idetically distributed which is abbreviated as iid), the G is said to follow a chi-square distributio with degrees of freedom (df). So G is the sum of the squares of idetically distributed idepedet distributios. 2 χ χ The distributio is deoted by ( is the Greek letter chi). The mea of the chi square distributio with degrees of freedom is ad the variace is 2. X 1 If = 1 so that we have oly oe radom variable, say, ad that radom variable is X 1 2 distributed N(0, 1), is distributed chi square with oe degree of freedom. The graphs below are a attempt to show you how this works. A lack of ormality i the distributios formig G has a greater impact o the validity of the cofidece itervals derived here tha it does for the cofidece itervals usig t ad z distributios to get a cofidece iterval about the mea. So we eed to be careful whe we use the cofidece iterval for the variace. Page -14-

Stadard Normal Distributio = N(0, 1).5 f(x).25 Area is approximately 0.025 Area is approximately 0.025 0-4 -3-2 -1 0 1 2 3 4 x 1.75 Note that 1.96 2 = 3.84 g(x).5.25 Area is approximately 0.05 where 0.05 = 2 X 0.025 0 0 1 2 3 4 5 x Chi Square Distributio with 1 degree of freedom Note that the actual cutoff for 0.025 for the N(0,1) distributio is ±1.96 ad that 1.96 2 = 3.84. I have used 2 for 1.96 ad 4 for 3.84 to make the graphs easier to follow. Page -15-

Stadard Normal Distributio = N(0, 1).5 Area of each side is approximately 0.34 f(x).25 We are cosiderig the middle 68% of this N(0,1) distributio as two 34% sectios. 0-4 -3-2 -1 0 1 2 3 4 x 1.75 g(x).5.25 Area is approximately 0.68 (0.68 = 2 X 0.34) 0 0 1 2 3 4 5 x Chi Square Distributio with 1 degree of freedom Page -16-

Chi square distributios with various degrees of freedom.8 g(x).6.4 df = 1 df = 2.2 df = 5 df = 10 df = 15 0 0 5 10 15 20 25 x What happes if the radom variables are idepedet, idetically ormally distributed, but ot N(0,1)? If X, X,X,...,X - N(μ, 2 ) ad the X's are idepedet, the 1 2 3 Z i = Xi μ - N(0, 1) for = 1, 2,..., i ad sice the X i are idepedet, the Z i are also idepedet. [Notice it is Xot Xso we divide by ot.] This meas that χ 2 i = 1 2 Z i - because it is the sum of the squares of ormal idepedet radom variables with mea = 0 ad variace = 1. Page -17-

Chapter 6 Part 5. Confidence Intervals t distribution chi square distribution. October 23, 2008