A Method for Analyzing Unreplicated Experiments Using Information on the Intraclass Correlation Coefficient

Journal of Modern Appled Statstcal Methods Volume 5 Issue Artcle 7 --5 A Method for Analyzng Unreplcated Experments Usng Informaton on the Intraclass Correlaton Coeffcent Jams J. Perrett Unversty of Northern Colorado, jams.perrett@unco.edu Follow ths and addtonal works at: http://dgtalcommons.wayne.edu/jmasm Part of the Appled Statstcs Commons, Socal and Behavoral Scences Commons, and the Statstcal Theory Commons Recommended Ctaton Perrett, Jams J. (5) "A Method for Analyzng Unreplcated Experments Usng Informaton on the Intraclass Correlaton Coeffcent," Journal of Modern Appled Statstcal Methods: Vol. 5 : Iss., Artcle 7. DOI:.37/jmasm/635456 Avalable at: http://dgtalcommons.wayne.edu/jmasm/vol5/ss/7 Ths Regular Artcle s brought to you for free and open access by the Open Access Journals at DgtalCommons@WayneState. It has been accepted for ncluson n Journal of Modern Appled Statstcal Methods by an authorzed edtor of DgtalCommons@WayneState.

Journal of Modern Appled Statstcal Methods Copyrght 6 JMASM, Inc. November, 6, Vol. 5, No., 43-44 538 947/6/$95. A Method for Analyzng Unreplcated Experments Usng Informaton on the Intraclass Correlaton Coeffcent Jams J. Perrett Department of Appled Statstcs and Research Methods Unversty of Northern Colorado Many studes are performed on unts that cannot be replcated; however, there s often an abundance of subsamplng. By placng a reasonable upper bound on the ntraclass correlaton coeffcent (ICC), t s possble to carry out classcal tests of sgnfcance that have conservatve levels of sgnfcance. Key words: Intraclass correlaton coeffcent, unreplcated experment, subsamples Introducton A researcher, wshng to compare two dfferent teachng methods, teaches two classes: one wth method and one wth method. The grade of each student n the two classes s recorded wth the purpose of comparng the average grade for the students taught by method to the average grade for the students taught by method. The wthn class varaton s the varablty from student to student. The between class varaton s due to such factors as tme of day, dfference n classroom settng, etc. One would expect the varaton from class to class to be small relatve to the wthn class varaton, regardless of whether the students are beng taught mathematcs, creatve wrtng, etc. The majorty of the total varablty wll be explaned by the dfference n performance of the students wthn a class, and that should be farly smlar for one Jams Perrett s Assstant professor of Appled Statstcs at the Unversty of Northern Colorado. Hs research nterests nclude the analyss of unreplcated experments, computatonal statstcs, and repeated measures analyss. E-mal: jams.perrett@unco.edu subject as t s another. Thus, the ntraclass correlaton coeffcent (ICC) should be consstent n studes of smlar types, and t wll tend to be small (In an educaton example such as ths, t would not be unusual for the ICC to be less than.) n many stuatons. An unreplcated experment s one n whch a treatment of nterest s appled to only one unt. Some experments logstcally cannot be replcated. Crcumstances that mght prevent replcaton are cost n tme or money or both, scarcty of expermental unts, destructve expermentaton, among other thngs. Some farmers just don t have an extra plot of land to experment on. Consderaton s gven for what can be done n such cases. The Model Let y j be the measurement taken on the j th student gven the th treatment, μ s the fxed effect of treatment, s the random effect of class, and j s the random effect of student j gven treatment, =,,,t; j =,,, n. Let ~ n(, σ ), where σ represents the between class varablty; let j ~ n(, σ ), where σ represents the between student wthn class varablty. It s assumed that and j are ndependent. A model s wrtten for the experment as follows: 43

JAMIS J. PERRETT 433 yj = μ + + () Ths type of model s a sngle factor completely randomzed desgn wth subsamplng, where classes are the expermental unts for each treatment level, and the students wthn each class are the subsamples, or observatonal unts. A researcher, who uses the students as the expermental unts, gnores the varablty that can exst between dfferent classes recevng the same treatment. Such an assumpton s to clam that j σ =. If the researcher correctly uses classes as the expermental unts, there s only one unt per treatment level and zero degrees of freedom avalable for testng the dfference between these treatment means (Barckowsk, 98, Blar, 986). The Intraclass Correlaton Coeffcent and the Independence Assumpton The ntraclass correlaton coeffcent (ICC) s defned as the correlaton between y j and y j (two subsample unts wthn one expermental unt). In ths study, refers to the true value of the ICC and refers to a best guess value, chosen by the researcher, to plug nto formulas n place of the ICC n the analyss. The ICC for the model n Equaton can be obtaned usng the followng formula: = var ( yj, yj ) σ = ( y ) var( y ) σ + σ cov j j () Thus, f σ =, the ICC s also zero. The result s ndependent samples assumng normalty of error terms. To llustrate the deas, consder the two-treatment case n whch H : μ = μ versus H A : μ μ s tested. The varance of the dfference between the two sample means s gven by var n + n ( y y ) = σ + σ nn n + n = σ + (3) nn usng the substtuton σ = σ. Z = y y (4) + n n σ + nn where Z has a standard normal dstrbuton. If s ncorrectly assumed to be zero, then Z = y σ y n + n nn. (5) A test statstc for a hypothess test based on the ncorrect assumpton that observatons are ndependent wll be too large and consequently nflate the assocated Type error. Boundng the ICC In practce, σ may be estmated from the pooled varance of observatonal unts wthn the expermental unt. Wth many subsamples, σ can be estmated qute accurately. However, n an unreplcated experment there s no way to estmate σ and consequently. Although t may not be possble to know the value of the ICC, n many cases t may be reasonable to make assumptons about ts upper bound. To do so, one must consder the relatve sze of the between unt varablty to the wthn unt varablty. In the example consdered n ths study, the classes were smlar, so t s reasonable to assume the component of the varance due to classes s relatvely small. On the other hand, the component of varance due to dfferences among students wthn a class tends to be relatvely large due to the nherent dfferences n students: maturty, study habts, ntal understandng, ntellgence, etc. Thus, t

434 A METHOD FOR ANALYZING UNREPLICATED EXPERIMENTS would seem to be reasonable to place a bound on the ICC that s less than.5 and possbly qute a bt smaller than ths. Data dscussed n ths study ndcate that a bound of <. 5 s reasonable for ths example. Other examples of ths also are common n agrculture. Consder for nstance feedng treatments appled to pens wth measurements made on ndvdual anmals wthn pens. For many measurements such as weght gan, body condton scores, and varous blood parameters, the greatest source of varablty s among the anmals wthn the pens. The component of varance due to pens, whle not neglgble, s often just of fracton of the component of varance due to the anmals. In such cases t s qute reasonable to assume that s small. The upper bound wll be denoted as max. The mportance of a small value of can be seen n Equaton 3 whch shows that the varance of the dfference of sample means gets smaller as gets smaller. In the lmt, the varance s that of the dfference of means of ndependent observatons. Intutvely, the closer s to zero, the more the observatons behave as f they were ndependent. Moreover, the analyss usng a known ICC becomes more powerful as gets closer to zero wth the lmtng power beng that obtaned when the observatons are ndependent. Methodology The Testng Strategy Although s not known, f pror experence allows for an upper bound to be placed on t, then t s possble to carry out statstcal tests for unreplcated experments. Consder a test of the hypotheses H : μ = μ vs. H A : μ μ. Let denote a value of that the researcher assumes to be reasonable based on pror experence. Let μ μ represent the hypotheszed dfference of the mean for treatment and the mean for treatment respectvely. The test statstc s T = ˆ ( y y ) ( μ μ ) n + n ˆ σ + nn (6) where σ represents a pooled estmate of the wthn class varance. Let s = n k = ( y y ) k n, =, (7) be the varance of the measurements under treatment. Then ( n ) s + ( n ) s ˆ σ =. (8) n + n Let α denote the nomnal level of the test and α the true level of sgnfcance. Dependng on assumptons, α may or may not equal α. Balancng smplcty and desrable propertes, t s suggested that tests be based on p-values obtaned when = max. It s also recommended that confdence ntervals and multple comparsons be carred out usng =. max Propertes of the test statstc For smplcty, assume the number of subsamples per class s the same for all classes,.e. n = n = n. The sample sze s one one class per treatment. Let μ μ be the true value of the dfference between the treatment means to be compared n the hypothess test. Let υ =(n-), the degrees of freedom for the test. Let t.5, υ denote the upper tal.5 value of the t-dstrbuton wth υ degrees of freedom. The probablty of rejectng H for an upper-tal test at α =. 5 can be determned usng the followng steps:

JAMIS J. PERRETT 435 ( y y ) ( μ μ ) P > t.5, υ (9) ˆ σ + n If the followng s defned as and Z = λ = ( y y ) ( μ μ ) σ + n ( μ μ ) ( μ μ ) σ + n then Equaton 9 can be expressed as Z + λ P > U υ ( t ).5, () () + n υ () + n where Z~n(,), λ s a constant, and U~ χ υ. To the left of the nequalty s a random varable wth a non-central t-dstrbuton wth noncentralty parameter λ and degrees of freedom, υ. Evaluaton The plug-n method nvolves choosng a value to plug nto Equaton 6. The method s evaluated by determnng sgnfcance levels and power curves for the tests, for dfferent values of and. In partcular, the method s consdered useful f t can mantan a Type error level close to the nomnal level ( α ) whle provdng power to detect dfferences n treatment means for a reasonable range of near. Equaton may be used to evaluate the probablty of rejecton for tests of the hypothess that the two means are equal. These probabltes depend on the values for,, n, and the non-centralty parameter, λ. In order to measure the devaton between the two means, a standardzed dfference s defned as StDff = μ μ. Probabltes wll depend on StDff σ through λ. Usng Equaton, probabltes were generated usng the followng values as ndces: StDff =, n = = through.99 n ncrements of. = through.99 n ncrements of. Fgure s a result of the generated probabltes. The data used for ths plot were created by evaluatng Equaton. For the plot, let n resultng n lm = beng used n n n Equaton. The nomnal sgnfcance level α =.5 s used for the plot. Also, nether the power nor the sgnfcance level s defned at = or =. Ths plot depcts the case wth a theoretcally nfnte number of students per class. A smlar graph results from usng students per class (omtted). Ths power plot s overlad wth the correspondng plot of two-tal sgnfcance levels. The red, green, and blue areas denote power n the ranges.,. to.5, and.5 to.9 respectvely. The lnes represent the two-tal probabltes of rejectng the null hypothess. If =,

436 A METHOD FOR ANALYZING UNREPLICATED EXPERIMENTS Fgure. Large (sub)sample power of detectng a dfference of sze between two treatment means wth no replcaton, overlad wth large sample sgnfcance (alpha) levels. then α=.5. If the plug-n value s less than the true value ( ), the Type error s nflated ( α.5 ). For example, f a value of =. s used when =.4, alpha wll be.53. If the plug-n value s greater than the true value ( ), the probablty of a Type error s smaller than α =.5. Because the p-value ncreases as ncreases, the p-value obtaned when = max s greater than or equal to the true p-value, so tests usng ths methodology are conservatve. On the other hand, f max s set too small by mstake, Type error wll be nflated. In terms of power and length of confdence ntervals, a smaller max s better than a larger one. Ths study does not suggest that problems wth lack of replcaton magcally dsappear wth ths methodology. Even f s known, > presents problems. In the two sample case, for nstance, the varance of the dfference between two means when the number of subsamples n and n approaches nfnty becomes = σ (3) Ths would be the formula for the varance of the dfference between two sample means f

JAMIS J. PERRETT 437 each treatment had n = replcatons. Thus, can be thought of as the number of pseudo replcatons. For nstance f =., the number of pseudo replcatons s 4. Ths study smply suggests the proposed methodology gves researchers a way to analyze data when conventonal analyss of varance could not be carred out due to lack of replcaton. It s also useful to examne the maxmum power attanable usng a plug-n value for n hypothess testng. Fgure demonstrates the maxmum attanable power under the most deal condtons: namely = and n =. As seen n Fgure, f the true value s.5, the power s low. For smaller values of the power ncreases consderably. For nstance, for a standardzed dfference of. and =., the power s.564. Thus, usng a plug-n value for s only gong to be effectve for smaller values of. What the Researcher May Know About It s possble for the researcher to obtan nformaton about based on pror experments of a smlar nature, or from knowledge about the behavor of for a current experment. Dstrbutonal Informaton If a large amount of dstrbutonal nformaton s avalable from pror studes of a nature smlar to that of the current study, the researcher may be able to put a pror dstrbuton or emprcal dstrbuton on (not consdered n ths study). Fgure. Maxmum attanable power for testng the dfference between two treatment means usng the plug-n method wth dfferent values of = and nfnte subsamplng.

438 A METHOD FOR ANALYZING UNREPLICATED EXPERIMENTS Pont or Interval Informaton The researcher may not have extensve dstrbutonal nformaton about, but may have an ndcaton of a reasonable mnmum or maxmum value of. The Plug-n Value In the case of no replcaton, zero degrees of freedom exst for conductng a hypothess test comparng means. So, the test cannot be performed usng tradtonal methods. Wth ths procedure a value, chosen by the researcher, s used as f t were the true value,. Ths value,, called a plug-n value, can be used n hypothess testng and n producng confdence ntervals of dfferences of treatment means. The strateges for choosng a value for gven n ths secton are proposed to researchers who have an unreplcated experment and have a reasonable dea of the actual value,. Other methods for dealng wth unreplcated experments have been nvestgated wth smlar results to the plug-n methods. Proposed Strateges Two strateges that make use of pror nformaton about to test hypotheses about treatment means n an unreplcated, twotreatment experment wll be presented n ths study. Strategy : Plot of the Condtonal P-value Gven Descrpton For a gven set of data, the p-value may be obtaned for varous assumed values of. It s computed as the probablty that a t-dstrbuted random varable wth n + n degrees of freedom s more extreme than the observed value of the statstc defned by Equaton. A condtonal p-value plot plots p-values for testng a certan hypothess over the range of possble values of. Propertes The three stuatons a researcher may encounter wth a condtonal p-value plot are ) all p-values are above α for reasonable values of, ) some p-values are below α for reasonable values of and some are above α for reasonable values of, and 3) all p-values are below α for reasonable values of. When the frst stuaton occurs, the result of the test s to fal to reject the null hypothess at level α. When the thrd stuaton occurs, the result of the test s to reject the null hypothess at level α. When the second stuaton occurs, t s less obvous what the results of the test of hypothess should be. The researcher s advsed to refran from makng a decson about the acceptance of the null hypothess f the second stuaton s observed. The accuracy of the results wll depend on the accuracy of the choce of the lkely range of. Implementaton The researcher determnes the range of reasonable values for the ICC based on nformaton obtaned from pror studes. Then the researcher creates the condtonal p-value plot gven the possble values of. The decson to reject or fal to reject the null hypothess, or abstan from judgment on the acceptance of the null hypothess s made based on the observed p-values wthn the range of reasonable values of. Strategy : Maxmum Rho Descrpton The Maxmum Rho procedure smply nvolves choosng = max. That value,, s then ncorporated nto the test statstc (Equaton 6). The test rejects the null hypothess f the p-value s less than α.

JAMIS J. PERRETT 439 Propertes The Maxmum Rho procedure assures the true sgnfcance level, α, s less than or equal to the nomnal value α, wth equalty when =. The closer s to, the hgher the power of the test. Implementaton To mplement the Maxmum Rho procedure, the researcher smply computes a p- value for the test based on usng = max n Equaton 6. Example The class data conssts of fnal course grades for two classes of ntroductory statstcs taught by two dfferent methods. The sample means for the two classes were.83 and 3.37 wth sample standard devatons of.4 and.84 respectvely. A researcher would lke to see f there s a dfference between the average grade receved by students taught by the two dfferent methods. Only one class was observed for each of the two methods makng ths an unreplcated experment wth class as the unt of study, student as the subsamplng unt. Let H : μ = μ and H A : μ μ. Let α =.5. It s assumed that replcaton would have yelded small class-to-class varablty relatve to the total varablty. To determne f that could n fact be the case, a study was conducted on pror undergraduate courses taught to multple sectons over multple semester. College Course Grades The value was estmated from a varety of courses offered at Kansas State Unversty usng the SAS MIXED procedure. The components of varance consst of varablty of scores due to secton, σ ; and the varablty of scores due to students wthn sectons, σ. Fourteen dfferent undergraduate courses were selected (CHM,, 3; CIS ; ENGL, 5; MATH, ; MUSIC 5, 55; PSYCH,, 35; SPAN 6) each wth multple sectons, coverng both Fall and Sprng semesters over the years -3, for a total of 43 course-semester combnatons. These values are graphed n a frequency hstogram n Fgure 3. All 43 estmated ICC values are at or below.33. The majorty, 95%, s at or below. 9% are below.5. Only one value s at.33. That value s for a honors Englsh course (ENGL 5). Other courses nclude undergraduate courses n chemstry, Englsh, musc, CIS, math, psychology, and Spansh. Based on ths, t would be reasonable to use the range of from. to.5, wth an assumed maxmum at = max =.5 for the plug-n methods nvolvng grades for ths study.

44 A METHOD FOR ANALYZING UNREPLICATED EXPERIMENTS Estmated ICC Values for KSU Grades 43 Courses wth Multple Sectons (-3) Frequency 5..4.8..6. Estmated ICC.4.8.3 Fgure 3. Estmated ICC values for KSU grades for 43 courses wth multple sectons (-3). Results Plot of the Condtonal P-value Gven Fgure 4 plots the p-value as a functon of for the class data. It can be seen that p- values are only sgnfcant f s less than.3. Because the lkely value of s greater than.3, the result of the test s to fal to reject the null hypothess at α =. 5. Maxmum Rho Usng =.5 and Equaton 6, the test statstc s found to be t=-.894 wth a p-value of.374. So the result of the test s to fal to reject the null hypothess n favor of the alternatve (at α =.5 ). Based on both the condtonal p-value plot and the results of the Maxmum Rho procedure, the concluson s that the dfference between the mean grades for the two classes s not sgnfcant. Assumng, the probablty of a Type error for the test n ths example s at most.5. If, n fact, = =.5, the power for detectng a dfference of one grade pont s approx..3373. However, f <.5, the power wll be less. A sgnfcant dfference would have been detected usng a classcal t-test to test the same hypothess under the assumpton of ndependent samples. However, the probablty of a type error would be nflated under that assumpton.

JAMIS J. PERRETT 44 Fgure 4. Condtonal P-value Plot. P-value curve for dfferent values of usng the Class data. Concluson If replcaton s feasble, a replcated experment s always preferred over an unreplcated experment. However, many studes are performed on unts that cannot be replcated. The method descrbed n ths paper makes t possble to accurately analyze unreplcated experments n whch the ntraclass correlaton coeffcent (ICC) s small and relatvely stable. By placng a reasonable upper bound on the ICC, t s possble to carry out classcal tests of sgnfcance that have conservatve levels of sgnfcance. Ths methodology has wde applcablty for analyzng unreplcated experments n many felds of research and ts smple computatons wll surely appeal to the appled researcher. References Barckowsk, R. S. (98). Statstcal power wth group mean as the unt of analyss. Journal of Educatonal Statstcs, 6, 67-85. Blar, R C., et. al. (983). An nvestgaton of the robustness of the t-test to unt of analyss volatons. Educatonal and Psychologcal Measurement, 43, 69-8. Blar, R. C., & Hggns, J. J. (986). Comment on statstcal power wth group mean as the unt of analyss. Journal of Educatonal Statstcs Summer,, 6-69. Casella, G., & Berger, R. L. (). Statstcal nference (nd ed.). Pacfc Grove, CA: Duxbury. Graybll, F. A. (976). Theory and applcaton of the lnear model. Pacfc Grove, CA: Wadsworth. Lttle, R. C., et. al. (996). SAS system for mxed models. Cary, NC: SAS Insttute Inc.

44 A METHOD FOR ANALYZING UNREPLICATED EXPERIMENTS Mllken, G. A., & Johnson, D. E. (99). Analyss of messy data, Volume : Expermental desgn. London, UK: Chapman & Hall. SAS Insttute Inc. (999). SAS/STAT user s gude, verson 8. Cary, NC: Author. SAS s a regstered trademark of SAS Insttute Inc., Cary, N.C.