Mixed Models for Assessing Correlation in the Presence of Replication

Size: px

Start display at page:

Download "Mixed Models for Assessing Correlation in the Presence of Replication"

Camilla Atkins
6 years ago
Views:

Journal of the Air & Waste Management Association ISSN: 1096-47

http://wwwtandfonlinecom/loi/uawm0 Mixed Models for Assessing

Ryan, Paulina Serrano-Trespalacios & Russ Wolfinger To cite this

& Russ Wolfinger (003) Mixed Models for Assessing Correlation in

Association, 53:4, 44-450, DOI: 101080/104738900310466174 To link

online: Feb 01 Submit your article to this journal Article views:

Full Terms & Conditions of access and use can be found at

1 Journal of the Air & Waste Management Association ISSN: (Print) (Online) Journal homepage: Mixed Models for Assessing Correlation in the Presence of Replication Anthony Hamlett, Louise Ryan, Paulina Serrano-Trespalacios & Russ Wolfinger To cite this article: Anthony Hamlett, Louise Ryan, Paulina Serrano-Trespalacios & Russ Wolfinger (003) Mixed Models for Assessing Correlation in the Presence of Replication, Journal of the Air & Waste Management Association, 53:4, , DOI: / To link to this article: Published online: Feb 01 Submit your article to this journal Article views: 865 View related articles Citing articles: 33 View citing articles Full Terms & Conditions of access and use can be found at Download by: [ ] Date: 06 December 017, At: 13:59

2 TECHNICAL PAPER ISSN J Air & Waste Manage Assoc 53: Copyright 003 Air & Waste Management Association Mixed Models for Assessing Correlation in the Presence of Replication Anthony Hamlett and Louise Ryan Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts Paulina Serrano-Trespalacios Department of Environmental Health, Harvard School of Public Health, Boston, Massachusetts Russ Wolfinger SAS Institute, Inc, Cary, North Carolina Downloaded by [ ] at 13:59 06 December 017 ABSTRACT The need to assess correlation in settings where multiple measurements are available on each of the variables of interest often arises in environmental science However, this topic is not covered in introductory statistics texts Although several ad hoc approaches can be used, they can easily lead to invalid conclusions and to a difficult choice of an appropriate measure of the correlation Lam et al approached this problem by using maximum likelihood estimation in cases where the replicate measurements are linked over time, but the method requires specialized software We reanalyze the data of Lam et al using PROC MIXED in SAS and show how to obtain the parameter estimates of interest with just a few lines of code We then extend Lam et al s method to settings where the replicate measurements are not linked Analysis of the unlinked case is illustrated with data from a study designed to assess correlations between indoor and outdoor measurements of benzene concentration in the air INTRODUCTION An important first step in any environmental science research project is assessing the accuracy and reliability of IMPLICATIONS While replicate measurements are commonly taken in environmental science research settings, it is unclear how to use these replicates to assess correlations When the number of replicates varies by subject, use of ad hoc approaches to correlation results in an efficiency loss and, hence, in unreliable correlation estimates Formulating the problem as a mixed model leads to results that are more reliable and that overcome the problems of the ad hoc approaches In addition, the SAS approach is very userfriendly and lends itself to extensions for more complex settings the measurement tools to be used For example, researchers may wish to compare the results of air pollution measurements based on the use of two different types of sampling or analytical techniques Correlation analysis is the primary statistical tool used in this context As discussed in almost every introductory statistics book, the Pearson correlation coefficient is the appropriate measure of association between two variables when these two variables are jointly normally distributed The Spearman correlation coefficient provides a nonparametric alternative based on ranks As discussed by Rosner, 1 the more specialized intraclass correlation coefficient is appropriate in settings where the two variables of interest are expected to have the same means and variances Pearson and Spearman correlations are easily and directly obtained from most statistical packages While some packages (eg, Stata) also provide commands to compute intraclass correlations directly, this quantity is easily obtained in packages such as SAS by formulating the problem as a mixed model (see the SAS manual ) In this paper, we address a question not covered in introductory texts, namely, the assessment of correlation in settings where multiple measurements are available on each of the variables of interest Two settings, linked and unlinked, are considered In the linked setting, the repeated measurements are linked together in some way, for example, repeats are taken together on different days Table 1 (taken from Bland and Altman 3 ) shows a linked data set of repeated measurements of intramural ph and PaCO in a study designed to assess within-subject correlations of clinical information gained from blood gas analysis and from gastric ph of critically ill patients Each pair of measurements was taken on different days In the unlinked setting, repeated measures are not linked 44 Journal of the Air & Waste Management Association Volume 53 April 003

3 Downloaded by [ ] at 13:59 06 December 017 Table 1 Repeated measures of intramural ph(x) and PaCO (Y) for eight critically ill patients (PATI) PATI ph PaCO PATI ph PaCO PATI ph PaCO together Table shows such a data set, corresponding to replicate measurements of benzene concentration in indoor and outdoor air, measured on 35 Mexican families Note that while some families have a single replicate measurement on indoor and outdoor air (eg, family 1), others (eg, family 6) have two replicate measurements on each Several ad hoc approaches can be taken to compute correlations in the presence of replication One naive approach is to ignore the repeated measurements and treat the data as if it were a simple random sample, and then compute the standard Pearson correlation coefficient Another choice is to compute the mean response for each variable for each subject, and then compute the standard Pearson correlation coefficient using the subject-specific averages for each variable Yet another choice is to compute a weighted correlation coefficient, 4 using the subject-specific averages for each variable and the number of repeated measurements for each subject as weights There are inherent problems with each of these approaches The simple correlation coefficient ignores the number of subjects as the (correct) sample size and uses instead the total number of observations as the (incorrect) sample size, thereby erroneously increasing the degrees of freedom, which can lead to overly frequent rejection of the null hypothesis when in fact it is true (ie, an invalid type I error 5 ) The simple correlation coefficient based on subject means avoids this problem but does not take into account the different number of replicate measurement per subject In addition, it tends to underestimate the true between-subject correlation 6 The weighted correlation coefficient does take into account the different number of replicate measurements per subject; however, the number of replicate measurements per variable for a given subject must be the same In addition, because the subject means are used in the computation, it too tends to underestimate the true between-subject correlation Several authors have proposed more technical solutions to the problem of measuring the correlation between two variables in the presence of replication For example, Bland and Altman 3 proposed using a partial correlation coefficient, which requires removing differences between subjects The partial correlation coefficient is useful if we want to know whether an increase (decrease) in one variable within a subject is associated with an increase (decrease) in the other variable However, if there are many subjects, there is loss in power, caused by the increased number of parameters that are to be estimated Chinchilli et al 7 proposed the use of a weighted correlation coefficient, using the sample variances and covariances to compute the weights While they considered both the unlinked and the linked case, their method is complicated Furthermore, it is empirically based and does not naturally arise from an underlying statistical model Lam et al 6 used maximum likelihood (ML) estimation to estimate the true correlation between the variables when the repeated measurements are linked over time They derived their estimates through formulation of the problem as a mixed-effects model Unfortunately, their approach is rather technical and requires the use of specialized software An important purpose of this paper is to show how to reproduce the parameter estimates of Lam et al, 6 using PROC MIXED in SAS We show that the analysis can be easily achieved with just a few lines of code We then extend Lam et al s approach to the nonlinked setting and also obtain parameter estimates using PROC MIXED in SAS In addition to reanalyzing the data of Lam et al, the analysis of data in the nonlinked setting is illustrated with data from a study designed to assess the correlation between indoor and outdoor measurements of benzene concentration in the air STATISTICAL MODELS Linked Repeated Measurements of Two Variates To proceed, some notation must be introduced We begin with the linked case, in which the repeated observations on the variables of interest, X and Y, are Volume 53 April 003 Journal of the Air & Waste Management Association 443

4 Downloaded by [ ] at 13:59 06 December 017 Table Repeated measures of benzene concentration (g/m 3 ) in indoor(x) and outdoor(y) air taken at the homes of 35 Mexican families Family Benzene Location Family Benzene Location Family Benzene Location In Out In Out In Out 75 In In In 1363 Out Out In Out In 8 86 Out 4 71 In Out Out Out In In In Out In Out In Out 6 69 In Out In In In In Out Out Out Out In Out In Out In In In In Out In Out Out Out Out In Out 3 46 In In 1 38 Out Out Out 60 In Out Out 703 In 3 71 Out 9 96 In 686 Out In In 3 46 Out In Out In 33 3 Out 9 98 Out In In In 5 31 Out In In 5 76 Out Out Out 6 75 In Out Out In In In Out In In 6 87 Out Out Out In where X and Y are the variances of X and Y, respectively, and XY is their correlation Note that XY is our main parameter of interest For notational convenience later, we will define the covariance form X Y XY XY Full specification of the model also requires assumptions regarding the relationships between X s and Y s measured at different times Like Lam et al, we assume that correlations between measurements taken at two different times, j and j, j j, are given by CorrX ij, X ij X CorrY ij, Y ij Y () CorrX ij, Y ij XY Heuristically, we would expect the term to generally be less than 1, indicating that correlations between variables measured at different times are lower in magnitude than those taken at the same time The assumed correlation structure is depicted in Figure 1 To better visualize the covariance structure, it is helpful to write out the full covariance matrix for the entire set of n i repeated measurements for the ith subject linked, for example, by being taken at the same point in time Let (X ij, Y ij ) be the jth repeated observation (j 1,,n i )ofthex, Y variables taken on the ith subject (i 1,,n), in a sample of n individuals, and define N to be the total number of observations Suppose that the pair (X ij, Y ij ) have a bivariate normal distribution with mean ( X, Y ) and variance-covariance matrix The parameters X and Y represent the overall mean values of the variables of interest Assumptions about are important, because this is where the correlations of interest are defined Following Lam et al, it is assumed X X Y XY X Y XY Y (1) Y X i C i CovXi1 Y i X X Y ini XY X X XY X X XY XY Y XY Y Y XY Y Y X X XY X XY X X XY XY Y Y XY Y XY Y Y X X XY X X XY X XY XY Y Y XY Y Y XY Y Note that to allow for a more parsimonious expression, we are using the covariance term XY Note also the block structure of this matrix, with submatrices corresponding to down the main diagonal The covariance matrix will have the same structure for each subject, except that the dimension will vary For example, Table 1 shows that person 1 has eight observations, 4 on each variable; hence, the covariance matrix for person 1 has eight rows (3) 444 Journal of the Air & Waste Management Association Volume 53 April 003

5 CorrX ij, X ij X CorrY ij, Y ij Y (5) CorrX ij, Y ij XY CorrX ij, Y ij It follows that one can think of the unlinked case as a special case of the linked setting, with set equal to 1 The covariance matrix for the ith subject in this setting is given by Downloaded by [ ] at 13:59 06 December 017 Figure 1 Correlation structure and eight columns (8 8) with the previously given structure Further insight into the structure of C i is seen if the data is reordered Instead of setting up the covariance matrix in terms of successive X, Y pairs (ie, X i1, Y i1, X i, Y i,, Y ini, Y ini ), suppose the n i X values are written first, followed by the n i Y values With this rearrangement, the covariance matrix becomes X i X i3 X Cov Xi1 Y i1 Y i Y i3 Y inix X X X X X X XY XY XY XY X X X X X X X XY XY XY XY X X X X X X X XY XY XY XY X X X X X X X X X XY XY XY XY XY XY XY XY Y Y Y Y Y Y Y XY XY XY XY Y Y Y Y Y Y Y XY XY XY XY Y Y Y Y Y Y Y Y Y XY XY XY XY Y Y Y Y Y Y Y (4) The covariance matrix can now be seen to fall into four distinct blocks The upper left block shows a constant covariance X X between the n i repeated X values taken on the ith subject Similarly, the lower right block shows a constant covariance Y Y between the n i repeated Y values taken on the ith subject The off-diagonal blocks show a compound symmetric covariance structure between the n i X and Y values, with XY on the main diagonal and XY on the off-diagonal Unlinked Repeated Measurements of Two Variates An appropriate model for the unlinked repeated measures design is easily obtained by a simple alteration to the model corresponding to the linked repeated measures design The fundamental difference between the linked and the unlinked settings is that the X ij and Y ij are no longer linked together That is, there is no time effect in the problem, and hence, the correlation between any two X and Y measurements should be the same, regardless of when they are taken The correlation structure thus becomes Y X i C i CovXi1 Y i X X Y ini XY X X XY X X XY XY Y XY Y Y XY Y Y X X XY X XY X X XY XY Y Y XY Y XY Y Y X X XY X X XY X XY XY Y Y XY Y Y XY Y (6) C i can also be written with the n i X values first, followed by the n i Y values Note that in this unlinked version of the covariance matrix, the difference between C i and C i is that there are no terms in C i involving and, hence, the blocks on the off-diagonal are now constant MODEL FITTING IN SAS Models for both the linked and unlinked settings can be easily fit using PROC MIXED in SAS To use PROC MIXED, the data must be entered in univariate form; that is, each row of data must correspond to a different measurement A variable needs to be defined, which indicates whether each line of data corresponds to an X or Y observation This variable is called Vtype A Replicate variable is used to keep track of the repeated measurements within subjects Note that the Replicate variable will be nested within subjects Appropriate SAS data format is illustrated below by Example 1, for the data in Table 1, where ph is chosen as Vtype 1 and PaCO is chosen as Vtype Response is the value of Vtype 1orVtype and Persnum is the subject number It is of no significance that ph is chosen as Vtype 1 and PaCO as Vtype, because the coding scheme was arbitrary Example 1 Input Persnum Vtype Response Replicate; cards; Volume 53 April 003 Journal of the Air & Waste Management Association 445

6 Downloaded by [ ] at 13:59 06 December The appropriate formulation of the PROC MIXED code, however, is not immediately obvious, because of the relative complexity of the covariance matrices C i and C i As described in the SAS documentation, PROC MIXED allows the fitting of regression models, where the covariance of the response involves the sum of two components, a matrix G involving the random effects in the model and specified through the random command, as well as a matrix R corresponding to the error term in the model and specified through the use of the repeated command While most familiar mixed models use either the random or repeated commands, the models described in the previous sections require the use of both We begin with the linked case To see how the SAS code should be written, it is useful to note that for each subject, the covariance matrix C i can be written as the sum of two matrices, one a matrix of constants whose values depend on whether the corresponding pair is two X s, two Y s oranx, Y pair, and the other a block diagonal, with blocks corresponding to X, Y pairs measured at the same time Hence, C i can be written as C i X X XY X X XY X X XY XY Y Y XY Y Y XY Y Y X X XY X X XY X X XY XY Y Y XY Y Y XY Y Y X X XY X X XY X X XY XY Y Y XY Y Y XY Y Y (7) where X (1 X ), Y (1 Y ) and XY (1 ) These two matrices can be set up through judicious use of the random and repeated statements in PROC MIXED Consider first the matrix on the left side of the expression Careful scrutiny indicates that the matrix can be constructed by assigning X- and Y-specific random effects to individual i, and allowing these random effects to be correlated This can be achieved by declaring the variable Vtype (ie, the indicator of whether a particular observation is an X or a Y) to be random across individual subjects Covariance between the X- and Y-specific random effects can be achieved by specifying an unstructured covariance matrix Now consider the matrix on the right side of the expression This structure is relatively straightforward and can be achieved by declaring the variable Vtype to be repeated within each individual-specific replicate (ie, declaring the subject to be replicate nested within individual) and using an unstructured covariance In the case of linked repeated measurements, the SAS code to obtain the parameter estimates is given by SAS code; data dataname; input persnum vtype response replicate; datalines; ; proc mixed; class persnum vtype replicate; model response vtype/solution ddfmkr; random vtype/typeun subjectpersnum g gcorr v vcorr; repeated vtype/typeun subjectreplicate(persnum) r rcorr; run; where Persnum corresponds to subject number; Vtype refers to the two variables, which are coded as 1 and ; Response corresponds to the values of the two variables; and Replicate corresponds to the number of repeated measurements for each subject, which need not be the same The CLASS statement specifies Persnum, Treatment, and Replicate as classification (categorical) effects, and the MODEL statement specifies the mean (regression) model for the data SOLUTION requests that the fixed effects (specified on the right side of the equal sign in the model statement, before /) estimates be printed, and DDFM KR specifies the Kenward- Roger 8 method for computing the denominator degrees of freedom for the fixed effects Note that while this latter option is not necessary, it tends to yield more reliable results in general (see the SAS manual for more details) As indicated earlier, the RANDOM and REPEATED statements are used to set up the structure of the G and R matrices Declaring SUBJECT Persnum after the specification of Vtype as random instructs PROC MIXED to make the N N variance-covariance matrix for the entire data vector to be block diagonal, with block corresponding to subject The size of the blocks depends on the number of measurements each subject has These subject blocks are in themselves block diagonal of size with structure specified by TYPE option For example, from 446 Journal of the Air & Waste Management Association Volume 53 April 003

7 Downloaded by [ ] at 13:59 06 December 017 the data in Table 1, the first person has a total of eight measurements; hence, the size of the block for the first person is 8 8, while the third person has 16 measurements and, thus, the size of the block for the third person is TYPE UN specifies a general variance-covariance matrix and makes the subject-specific X and Y random effects correlated On the REPEATED statement line, SUBJECT Replicate(Persnum) instructs PROC MIXED to make the N N variance-covariance matrix for the data vector to be a diagonal matrix of blocks Each of these blocks has the structure specified by the TYPE option In this case, TYPE UN specifies a general variance-covariance matrix G and GCORR request that the estimated random effect variance-covariance and correlation be printed, respectively V and VCORR request that the estimated response variance-covariance and correlation be printed, respectively R and RCORR request that the variance-covariance and correlation between the within subject replicate X, Y pairs be printed, respectively The V matrix is a combination of the G and R matrices By default, for R, RCORR, V, and VCORR, the first block, determined by the SUBJECT effect, is printed However, the default can be changed by specifying a specific value for R, RCORR, V, and VCORR (see the SAS manual ) In the PROC MIXED statement, a METHOD option can be given to specify the method of estimation for the covariance parameters If no METHOD option is given in the PROC MIXED statement, the covariance parameters are estimated using restricted maximum likelihood (REML) estimation, the default option Similarly, in the MODEL statement, the method of computation for the denominator degrees of freedom can be specified by using the DDFM option If no DDFM option is given in the MODEL statement, for the SAS code given here, the CONTAINMENT option is used For further details on the METHOD option and DDFM option, see the SAS manual For the unlinked case, the code is the same as that described previously, except that the repeated statement is replaced by repeated vtype/typeun(1) subjectreplicate(persnum) r rcorr; where TYPE UN(1) specifies a variance-covariance matrix whose off-diagonal element is zero Equivalently, one can use the following code for the repeated statement: repeated/groupvtype r rcorr; where GROUP vtype specifies heterogeneity of variances between observations with vtype 1 and vtype (ie, for X and Y) EXAMPLES Linked Data Table 1 provided by Bland and Altman 3 and reproduced in Lam et al 6 shows linked repeated measurements of intramural ph and PaCO for eight subjects Table 3 gives the simple Pearson correlation, the simple Pearson correlation based on subject means, the weighted correlation (Bland and Altman 4 ), and the 95% bootstrapped confidence interval (CI) for this data set It is important to note here that bootstrapping was accomplished by resampling individuals, thus maintaining the appropriate correlation structure of the data Inspection of Table 3 reveals that these correlation measures are of different magnitudes and signs Thus, one is faced with the dilemma of choosing one of these measures as the appropriate measure of the true correlation The values presented here differ from those of Bland and Altman 4 because of rounding Of the three correlation measures, the naive Pearson correlation measure has the shortest interval Lam et al 6 obtained parameter estimates (Table 4) for the data, using an ML estimation program These results can be reproduced using the SAS code, by specifying METHOD ML in the PROC MIXED statement The main difference between ML and REML (the default option) is that ML gives biased estimates of the covariance parameters, whereas REML does not For comparison with the naive estimates reported in Table 3, we provide bootstrap confidence intervals for the correlation parameter estimates obtained using SAS s PROC MIXED Selected Table 3 Simple correlations between ph(x) and PaCO (Y) and 95% bootstrap confidence interval for the data in Table 1 Correlation Value 95% CI Naive Pearson correlation , Pearson correlation based on means , Weighted correlation , 0813 Table 4 Parameter estimates from Lam et al 6 for the ph(x)-paco data in Table 1 with 95% bootstrap confidence interval for XY Parameter Estimate 95% CI X Y 5008 X Y X Y 0654 XY , Volume 53 April 003 Journal of the Air & Waste Management Association 447

8 Downloaded by [ ] at 13:59 06 December 017 portions of the SAS output are given in Tables 5 8, where labels have been added for clarity Table 5 gives the results obtained from the SAS code for the R and G matrices From Table 5, and 0547 are the estimated variances ( XR ; YR )ofxand Y, ( X ; Y )ofx and Y, respectively Note that and Note also that the elements in G appear in V From Table 7, the estimated correlation between X and Y ( XY )is For j j, the estimated correlation between X ij and X ij respectively, obtained from the R matrix Similarly, ( X ) is and the estimated correlation between Y ij and 045 are the estimated variances ( XG ; YG )ofxand Y, respectively, obtained from the G matrix The respective covariances from the R and G matrices are and 005 The correlations derived from the R ( R ) and G ( G ) matrices are 0509 and 01416, respectively The results in Table 4 are obtained from Tables 6, 7, and Y ij ( Y ) is 0654 The estimated correlation between X ij and Y ij ( XY ) is 0104 and, thus, the estimate of is (0104/000995) The means X and Y are obtained from Table 8 In SAS, when the variables are categorical, the highest value is taken as the point of reference, which in this case is the variable labeled as [(PaCO (Y)] The estimate for X is 71151, which is the and 8 Table 6 gives the results obtained from the SAS sum of the estimated values for the intercept and code for the V matrix and Table 7 gives the corresponding correlations associated with the V matrix From Table 6, PaCO (Y) The estimate for Y is 5008, the intercept value and are the overall estimated variances Unlinked Data Table 5 Estimated R and G matrices obtained for the data in Table 1 using SAS PROC MIXED procedure The data in Table is from an environmental study that focused on measuring the amount of benzene concentration R Matrix G Matrix (in g/m 3 ) in the air inside and outside the homes of several Mexican families The data are entered into SAS Variable ph PaCO ph PaCO similarly as was done for Example 1 Table 9 gives the ph simple correlation coefficients along with the 95% bootstrapped confidence intervals Of the two correlation PaCO measures, the naive Pearson correlation measure has the shorter confidence interval Inspection of Table 9 indicates that it is much more difficult to Table 6 Estimated variance-covariance matrix for the ph(x)-paco (Y) data in Table 1, for PATI 1 choose an appropriate measure of the true X 1 Y 1 X Y X 3 Y 3 X 4 Y 4 correlation in this setting because the number of observations is not the same in the two cases In addition, not all of the data X (95 observations) are used in computing Y these correlations The reason for the discrepancy in the number of observations is X Y that, for some subjects, there are measurements X Y missing Consequently, because of X missing measurements, a weighted correlation Y would be difficult to compute These problems do not occur if the SAS code is used to obtain the correlation Table 7 Estimated correlation matrix between ph(x) and PaCO (Y) data in Table 1, for PATI 1 Table 10 gives the results obtained X 1 Y 1 X Y X 3 Y 3 X 4 Y 4 from the SAS code for the R and G matrices From Table 10, 8 and are the estimated variances ( XR ; YR )ofx and Y, X respectively, obtained from the R matrix Y Similarly, 1196 and are the estimated variances ( XG ; YG )ofx and Y, X Y respectively, obtained from the G matrix X Y The covariance from the G matrix is X and the correlation ( G ) is 0655 Y Note here that for the R matrix there is no covariance and, hence, no correlation, 448 Journal of the Air & Waste Management Association Volume 53 April 003

9 Downloaded by [ ] at 13:59 06 December 017 Table 8 Regression results for the ph(x)-paco (Y) data in Table 1 Effect Estimate SE DF t Value Pr > t Intercept ph PaCO 0 Table 9 Simple correlations between indoor(x) and outdoor(y) air, and 95% bootstrap confidence interval for the benzene data in Table Correlation # of Obs Value 95% CI Naive Pearson correlation , Pearson correlation based on means , Table 10 Estimated R and G matrices obtained for the data in Table using SAS PROC MIXED procedure Variable R Matrix G Matrix Indoor Outdoor Indoor Outdoor Indoor Outdoor because of the TYPE UN(1) specified in the REPEATED statement The results in Tables 11, 1, and 13 are used to obtain the results in Table 14 Table 11 gives the results obtained from the SAS code for the V matrix and Table 1 gives the corresponding correlations associated with the V matrix From Table 11, and 4493 are the overall estimated variances ( X ; Y )ofx and Y, respectively Note that and Note also that the elements in G appear in V From Table 1, the estimated correlation between X and Y ( XY ) is For j j, the estimated correlation between X ij and X ij ( X ) is 088 and the estimated correlation between Y ij and Y ij ( Y ) is From Table 13, X ( ) and Y Table 14 also provides bootstrap confidence interval for the parameter of interest, XY DISCUSSION In this paper, we investigated methods to assess the correlation between two variates, X and Y, in the presence of repeated measures or replicates Both linked and unlinked settings were considered, in both cases under the assumption that the two variates follow a multivariate normal distribution Ad hoc approaches as well as PROC MIXED in SAS were used to estimate the correlation for two examples Of the ad hoc approaches, the bootstrapped confidence interval was shortest for the naive Pearson approach Bootstrapped confidence intervals for the mixed model formulation were approximately equal to or shorter than the bootstrapped confidence intervals for the ad hoc approaches This confirms that the mixed-model approach is indeed using the data in a more efficient manner The mixed-model formulation overcomes some of the inherent problems with the ad hoc approaches and is very easy to apply using PROC MIXED in SAS Although not of direct relevance to the topic of this paper, our data examples revealed some interesting features in relation to the effects of outliers For both the ad hoc approaches and the SAS PROC MIXED approach, the estimates were sensitive to the exclusion of an extreme Table 11 Estimated variance-covariance matrix for the indoor(x)-outdoor(y) benzene data in Table for family 6 X 1 Y 1 X Y X Y X Y Table 1 Estimated correlation matrix between indoor(x) and outdoor(y) air for the benzene data in Table for family 6 X 1 Y 1 X Y X Y X Y Table 13 Regression results for the indoor(x)-outdoor(y) benzene data in Table Effect Estimate SE DF t Value Pr > t Intercept Indoor Outdoor 0 Table 14 Parameter estimates for the indoor(x)-outdoor(y) benzene data in Table with 95% bootstrap confidence interval for XY Parameter Estimate 95% CI X Y X Y X 088 Y XY , Volume 53 April 003 Journal of the Air & Waste Management Association 449

10 Downloaded by [ ] at 13:59 06 December 017 observation For example, in the benzene analysis, the correlation coefficient was reduced by 69% when an influential observation was removed This finding suggests that users should be cautious to make sure the results are not driven by extreme values before interpreting them If the data are skewed, thereby violating the normality assumption, a transformation, such as the log, might be appropriate before applying the SAS PROC MIXED procedure On the other hand, one can compute Spearman s correlation 9 in the simple case (no repeats) However, it is not clear how one would generalize our method to compute a Spearman correlation in the presence of replication We have treated X and Y as being distinct variables, each having its own mean and variance However, in many instances, one may not be able to distinguish between X and Y For example, consider two different devices used to measure the lung capacity of a subject In this situation, one is more interested in the agreement of measurement of the two devices A measure of this agreement is the concordance correlation 1,7 On the other hand, interest may focus on the degree to which a single measure of an event describes the mean of repeated measurements of that event In this case, an intraclass correlation 1 can be computed For both of the data sets presented here, intraclass correlations can be computed Finally, one can compute a correlation for each subject and then use the subject correlations in the computation of an overall correlation 7 This procedure would work well if there were several repeated measurements per subject per variable ACKNOWLEDGMENTS This work was supported by NIH grants ES0000, ES0714, and ES05947 REFERENCES 1 Rosner, B Fundamentals of Biostatistics, 5th ed; Duxbury: Pacific Grove, CA, 000 SAS Institute Inc SAS/STAT User s Guide: Version 8, Volume ; SAS Institute, Inc: Cary, NC, Bland, JM; Altman, DG Calculating Correlation Coefficients with Repeated Observations: Part 1 Correlation within Subjects; Brit Med J 1995, 310, Bland, JM; Altman, DG Calculating Correlation Coefficients with Repeated Observations: Part Correlation between Subjects; Brit Med J 1995, 310, Bland, JM; Altman, DG Correlation, Regression and Repeated Data; Brit Med J 1994, 308, Lam, M; Webb, CA; O Donnell, DE Correlation between Two Variables in Repeated Measures In American Statistical Association, Proceedings of the Biometric Section; American Statistical Association: Alexandria, VA, 1999; pp Chinchilli, VM; Martel, JK; Kumanyika, S; Lloyd, T A Weighted Concordance Correlation Coefficient for Repeated Measures Designs; Biometrics 1996, 5, Kenward, MG; Roger, JH Small Sample Inference for Fixed Effects from Restricted Maximum Likelihood; Biometrics 1997, 53, Zar, JH Biostatistical Analysis, 4th ed; Prentice Hall: Upper Saddle River, NJ, 1999 About the Authors Anthony Hamlett is a research fellow and Louise Ryan is a professor of biostatistics in the Department of Biostatistics, Harvard School of Public Health, 655 Huntington Avenue, Boston, MA 0115 Paulina Serrano-Trespalacios is a doctoral student in the Department of Environmental Science, Harvard School of Public Health, 655 Huntington Avenue, Boston, MA 0115 Russ Wolfinger is the director of geonomics at SAS Institute Inc, SAS Campus Drive, Cary, NC Journal of the Air & Waste Management Association Volume 53 April 003

over Time line for the means). Specifically, & covariances) just a fixed variance instead. PROC MIXED: to 1000 is default) list models with TYPE=VC */

over Time line for the means). Specifically, & covariances) just a fixed variance instead. PROC MIXED: to 1000 is default) list models with TYPE=VC */ CLP 944 Example 4 page 1 Within-Personn Fluctuation in Symptom Severity over Time These data come from a study of weekly fluctuation in psoriasis severity. There was no intervention and no real reason