Applications of GEE Methodology Using the SAS System

Applcatons of GEE Methodology Usng the SAS System Gordon Johnston Maura Stokes SAS Insttute Inc, Cary, NC Abstract The analyss of correlated data arsng from repeated measurements when the measurements are assumed to be multvarate normal has been studed extensvely In many practcal problems, however, the normalty assumpton s not reasonable When the responses are dscrete and correlated, for example, dfferent methodology must be used n the analyss of the data Generalzed Estmatng Equatons (GEEs) provde a practcal method wth reasonable statstcal effcency to analyze such data Ths paper provdes an overvew of the use of GEEs n the analyss of correlated data usng the SAS System Emphass s placed on dscrete correlated data, snce ths s an area of great practcal nterest Introducton GEEs were ntroduced by Lang and Zeger (1986) as a method of dealng wth correlated data when, except for the correlaton among responses, the data can be modeled as a generalzed lnear model For example, correlated bnary and count data often can be modeled n ths way You can solve GEEs wth the GENMOD procedure n SAS/STAT software, begnnng wth Release 612 of the SAS System In addton, the Alternatng Logstc Regresson algorthm for fttng log odds ratos wth bnary data wll be mplemented n a future release Ths paper provdes an overvew of the GEE methodology that s mplemented n the GENMOD procedure Refer to Dggle, Lang, and Zeger (1994) and the other references at the end of ths paper for more detals on ths method Correlated data can arse from stuatons such as longtudnal studes, n whch multple measurements are taken on the same subject at dfferent ponts n tme clusterng, where measurements are taken on subjects that share a common category or characterstc that leads to correlaton For example, ncdence of pulmonary dsease among famly members may be correlated because of heredtary factors The correlaton must be accounted for by analyss methods approprate to the data Possble consequences of analyzng correlated data as f t were ndependent are ncorrect nferences concernng regresson parameters due to underestmated standard errors neffcent estmators, that s, more mean square error n regresson parameter estmators than necessary Example of Longtudnal Data The followng data, from Thall and Val (199), are concerned wth the treatment of epleptc sezure epsodes These data were also analyzed n Dggle, Lang, and Zeger (1994) The data conssts of the number of epleptc sezures n an eght-week baselne perod, before any treatment, and n each of four twoweek treatment perods, n whch patents receved ether a placebo or the drug Progabde n addton to other therapy A porton of the data s shown n Table 1 Table 1 Epleptc Sezure Data Patent ID Treatment Baselne Vst1 Vst2 Vst3 Vst4 14 Placebo 11 5 3 3 3 16 Placebo 11 3 5 3 3 17 Placebo 6 2 4 5 11 Progabde 76 11 14 9 8 12 Progabde 38 8 7 9 4 13 Progabde 19 4 3 Wthn-subject measurements are lkely to be correlated, whereas between-subject measurements are lkely to be ndependent The raw correlatons among the counts between vsts are shown n Fgure 1 They ndcate strong correlaton n the number of sezures between the vsts Accountng for ths correlaton s an mportant aspect of the analyss strategy The

sezures data wll be analyzed n later sectons as count data wth a specfed correlaton structure Fgure 1 Raw Correlatons Vst 1 Vst 2 Vst 3 Vst 4 Vst 1 1 69 54 72 Vst 2 1 67 76 Vst 3 1 71 Vst 4 1 Generalzed Lnear Models for Independent Data Let Y ; = 1;:::;n be ndependent measurements Generalzed lnear models for ndependent data are characterzed by a systematc component g(e(y )) = g( )=x where = E(Y ), g s a lnk functon that relates the means of the responses to the lnear predctor x, x s a vector of ndependent varables for the th observaton, and s a vector of regresson parameters to be estmated a random component: Y ; = 1;:::;n are ndependent and have a probablty dstrbuton from an exponental famly: Y exponental famly: bnomal, Posson, normal, gamma, nverse gaussan The exponental famly assumpton mples that the varance of Y s gven by V = v( ), where v s a varance functon that s determned by the specfc probablty dstrbuton and s a dsperson parameter that may be known or may be estmated from the data, dependng on the specfc model The varance functon for the bnomal and Posson dstrbutons are gven by bnomal: v() =(1,) Posson: v() = The maxmum lkelhood estmator of the p 1 parameter vector s obtaned by solvng the estmatng equatons mx v,1 (y, ()) = for Ths s a nonlnear system of equatons for and t can be solved teratvely by the Fsher scorng or Newton-Raphson algorthm Modelng Correlaton Generalzed Estmatng Equatons Let Y j ; j = 1;:::;n ; = 1;:::;K represent the jth measurement on the th subject There are n measurments on subject and P K n total measurements Correlated data are modeled usng the same lnk functon and lnear predctor setup (systematc component) as the ndependence case The random component s descrbed by the same varance functons as n the ndependence case, but the covarance structure of the correlated measurements must also be modeled Let the vector of measurements on the th subject be Y =[Y 1 ;:::;Y n ] wth correspondng vector of means =[ 1 ;:::; n ] and let V be an estmate of the covarance matrx of Y The Generalzed Estmatng Equaton for estmatng s an extenson of the ndependence estmatng equaton to correlated data and s gven by V,1 (Y, ()) = Workng Correlatons Let R () be an n n "workng" correlaton matrx that s fully specfed by the vector of parameters The covarance matrx of Y s modeled as V = A 1 2 R()A 1 2 where A s an n n dagonal matrx wth v( j ) as the jth dagonal element If R () s the true correlaton matrx of Y, then V s the true covarance matrx of Y The workng correlaton matrx s not usually known and must be estmated It s estmated n the teratve fttng process usng the current value of the parameter vector to compute approprate functons of the Pearson resdual r j = y j, p j v(j ) There are several specfc choces of the form of workng correlaton matrx R () commonly used to model the correlaton matrx of Y A few of the choces are shown below Refer to Lang and Zeger (1986) for addtonal choces The dmenson of the

vector, whch s treated as a nusance parameter, and the form of the estmator of are dfferent for each choce Some typcal choces are: R () =R, a fxed correlaton matrx For R = I, the dentty matrx, the GEE reduces to the ndependence estmatng equaton m-dependent: t t=1;2;:::;m Corr(Y j ;Y ;j+t )= t>m exchangeable: Corr(Y j ;Y k )=; j 6= k unstructured: Corr(Y j ;Y k )= jk Fttng Algorthm The followng s an algorthm for fttng the specfed model usng GEEs Compute an ntal estmate of, for example wth an ordnary generalzed lnear model assumng ndependence Compute the workng correlatons R () Compute an estmate of the covarance: Update : V = A 1 2 ^R()A 1 2 r+1 = r, [ V,1 ],1 [ Iterate untl convergence Propertes of GEEs V,1 (Y, )] The GEE method has some desrable statstcal propertes that make t an attractve method for dealng wth correlated data GEEs reduce to ndependence estmatng equatons for n = 1 GEEs are the maxmum lkelhood score equaton for multvarate Gaussan data p K( ^, )! N (; M()) f the mean model s correct even f V s ncorrectly specfed, where -- M() =I,1I 1I,1 -- I = V,1 -- I 1 = V,1 Cov(Y )V,1 The thrd property lsted above means that you don t have to specfy the workng correlaton matrx correctly n order to have a consstent estmator of the regresson parameters Choosng the workng correlaton closer to the true correlaton ncreases the statstcal effcency of the regresson parameter estmator, so you should specfy the workng correlaton as accurately as possble based on knowledge of the measurement process Estmatng the Covarance of ^ The model-based estmator of Cov(^) s gven by Cov M (^)=I,1 Ths s the GEE equvalent of the nverse of the Fsher nformaton matrx that s often used n generalzed lnear models as an estmator of the covarance estmate of the maxmum lkelhood estmator of It s a consstent estmator of the covarance matrx of ^ f the mean model and the workng correlaton matrx are correctly specfed The estmator M = I,1I 1I,1 s called the emprcal, or robust, estmator of the covarance matrx of ^ It has the property of beng a consstent estmator of the covarance matrx of ^, even f the workng correlaton matrx s msspecfed, that s, f Cov(Y ) 6= V In computng M, and are replaced by estmates, and Cov(Y ) s replaced by an estmate, such as (Y, ( ^))(Y, ( ^)) Progabde Example GEE s an approprate strategy strategy for analyzng the epleptc sezure data You can employ a log-lnear model wth v() =(the Posson varance functon) and where log(e(y j )) = + x 1 1 + x 2 2 + x 1 x 2 3 + log(t j ) Y j : number of eplectc sezures n nterval j

t j : length of nterval j 1 : weeks 8-16 x 1 = : weeks -8 1 : progabde group x 2 = : placebo group The correlatons between the counts are modeled as r j = ; 6= j (exchangeable correlatons) For comparson, the correlatons are also modeled as ndependent (dentty correlaton matrx) In ths model, the regresson parameters have the nterpretaton n terms of the log sezure rate shown n Fgure 2 Fgure 2 Interpretaton of Regresson Parameters Treatment Vst log(e(y j )=t j ) Placebo Baselne 1-4 + 1 Progabde Baselne + 2 1-4 + 1 + 2 + 3 As ndcated schematcally n Fgure 3, the dfference between the log sezure rates n the pretreatment (baselne) perod and the treatment perods s 1 for the placebo group and 1 + 3 for the Progabde group A value of 3 < would ndcate an effectve reducton n the sezure rate Fgure 3 Interpretaton of Model log(e(y j )=t j ) Baselne * * 1 Vsts 1-4 * 1 + 3 Placebo * Treatment You can now ft ths model n the SAS System by usng the GENMOD procedure, whch has been enhanced to provde Generalzed Estmatng Equatons methodology The followng statements nput the data, whch are arranged as one vst per observaton: data thall; nput d y vst trt blne age; ntercpt=1; cards; 14 5 1 11 31 14 3 2 11 31 14 3 3 11 31 14 3 4 11 31 16 3 1 11 3 16 5 2 11 3 16 3 3 11 3 16 3 4 11 3 17 2 1 6 25 17 4 2 6 25 17 3 6 25 17 5 4 6 25 114 4 1 8 36 114 4 2 8 36 run; Some further data manpulatons create an observaton for the baselne measures, create an nterval varable, and create an ndcator varable for whether the observaton s for a baselne measurement or a vst measurement data new; set thall; output; f vst=1 then do; y=blne; vst=; output; end; run; data new2; set new; f d ne 27; f vst= then do; x1=; ltme=log(8); end; else do; x1=1; ltme=log(2); end; x1trt=x1*trt; run; The GEE soluton s requested by usng the RE- PEATED statement n the GENMOD procedure The opton SUBJECT=ID specfes that the ID varable descrbes the observatons for a sngle cluster and the CORRW opton prnts the workng correlaton matrx The TYPE=opton specfes the correlaton structure; the value EXCH ndcates the exchangeable structure Other structures now supported nclude the unstructured, AR(1), ndependent, and user-specfed proc genmod data=new2; model y=x1 trt / d=posson offset=ltme tprnt; class d; repeated subject=d / corrw type=exch; These statements produce the usual output for fttng a generalzed lnear model to these data; the estmates are used as ntal values for the GEE soluton Frst, the usual results for fttng a GLM soluton are produced; the GLM parameter estmates are used as the ntal parameter estmates for the GEE soluton Informaton about the GEE Model s dsplayed n Fgure 4 The result of fttng the model are shown n 5 Compare these wth the model of ndependence dsplayed n Fgure 6 The parameter estmates are nearly dentcal, but the standard errors for the ndependence case are underestmated The coeffcent of the nteracton term, 3, s hghly sgnfcant under the ndependence model and margnally sgnfcant wth the exchangeable correlatons model

Fgure 4 GEE Model Informaton Descrpton Value Correlaton Structure Exchangeable Subject Effect ID Number of Clusters 58 Maxmum Cluster Sze 5 Mnmum Cluster Sze 5 GEE Model Informaton Covarance Matrx (Model-Based) Covarances are Above the Dagonal and Correlatons are Below Parameter Number PRM1 PRM2 PRM3 PRM4 PRM1 126 1594-126 -1594 PRM2 11876 1493-1594 -1493 PRM3-717 -8316 246 5562 PRM4-7557 -63627 18466 3687 Covarance Matrx (Emprcal) Covarances are Above the Dagonal and Correlatons are Below Parameter Number PRM1 PRM2 PRM3 PRM4 Emprcal 95% Confdence Lmts Parameter Estmate Std Err Lower Upper Z Pr> Z INTERCEPT 13476 1574 1392 1656 8564 X1 118 1161-1168 3383 9543 3399 TRT -18 1937-4876 2716-5578 577 X1*TRT -316 1712-6371 339-1762 781 Scale 32245 PRM1 2476-1152 -2476 1152 PRM2-635 1348 1152-1348 PRM3-81249 5122 3751-2999 PRM4 4276-67815 -945 2931 Fgure 8 Covarance Matrces Fgure 5 GEE Parameter Estmates The two covarance estmates are smlar, ndcatng an adequate correlaton model Analyss Of Parameter Estmates Parameter DF Estmate Std Err ChSquare Pr>Ch INTERCEPT 1 13476 341 15654356 1 X1 1 118 469 55839 181 TRT 1-18 486 49316 264 X1*TRT 1-316 697 186987 1 SCALE 1 Modelng Odds Ratos for Bnary Data Dggle, Lang, and Zeger (1994) pont out that modelng assocaton among bnary responses wth correlaton has a dsadvantage, and they propose usng the odds rato nstead For bnary data, the correlaton between the jth and kth response s, by defnton, Fgure 6 Independence Model Corr(Y j ;Y k )= Pr(Y j = 1;Y k = 1), j p k j (1, j ) k (1, k ) Table 2 Results of Model Fttng Varable Correlaton Coef Std Error Coef/SE Structure Intercept Exchangeable 135 16 856 Independent 135 3 3952 Vst (x 1 ) Exchangeable 11 12 95 Independent 11 5 236 Treat (x 2 ) Exchangeable -11 19-56 Independent -11 5-222 x 1 x 2 Exchangeable -3 17-176 Independent -3 7-432 The workng correlaton s prnted out wth the CORRW opton The ftted exchangeable correlaton matrx s shown n Fgure 7 Workng Correlaton Matrx COL1 COL2 COL3 COL4 COL5 ROW1 1 5983 5983 5983 5983 ROW2 5983 1 5983 5983 5983 ROW3 5983 5983 1 5983 5983 ROW4 5983 5983 5983 1 5983 ROW5 5983 5983 5983 5983 1 The jont probablty n the numerator satsfes the followng bounds, by elementary propertes of probablty, snce j = Pr(Y j = 1): max(; j + k, 1) Pr(Y j = 1;Y k = 1) mn( j ; k ) The correlaton, therefore, s constraned to be wthn lmts that depend n a complcated way on the means of the data The odds rato, defned as OR(Y j ;Y k )= Fgure 7 Workng Correlaton Matrx Pr(Y j = 1;Y k = 1)Pr(Y j = ;Y k = ) Pr(Y j = 1;Y k = )Pr(Y j = ;Y k = 1) If you specfy the COVB opton, you produce both the model-based (nave) and the emprcal (robust) covarance matrces Fgure 8 contans these estmates s not constraned by the means and s preferred by many workers to correlatons for bnary data

Carey, Zeger, and Dggle (1993) propose an algorthm for fttng the log odds rato as log(or(y j ;Y k )) = z jk where z jk s a vector of covarates and s a vector of assocaton parameters to be estmated The mean s modeled wth a regresson model just as t s when you use correlatons to model assocaton Ths mplementaton of GEE s called alternatng logstc regresson (ALR) It uses a GEE smlar to the one used to model correlatons to estmate the mean regresson parameters alternatng wth a logstc regresson to estmate the assocaton parameters The prevous method treated correlaton as a nusance parameter, whch must be taken nto account but s not of scentfc nterest The ALR method s useful f the assocaton s a scentfc focus of the analyss, snce a detaled model for the assocaton s ftted Concluson Generalzed Estmatng Equatons provde a practcal method wth good statstcal propertes to model data that exhbt assocaton but cannot be modeled as multvarate normal References Carey, V, Zeger, SL, and Dggle, P (1993), Modellng Multvarate Bnary Data wth Alternatng Logstc Regressons Bometrka, 517-526 Dggle, PJ, Lang, K-Y and Zeger, SL (1994), Analyss of Longtudnal Data, Oxford: Oxford Scence Lang, K-Y and Zeger, SL (1986), Longtudnal Data Analyss Usng Generalzed Lnear Models Bometrka, 13-22 Thall, PF and Val, SC (199), Some Covarance Models for Longtudnal Count Data wth Overdsperson Bometrcs, 657-671 Zeger, SL and Lang, K-Y (1986), Longtudnal Data Analyss for Dscrete and Contnuous Outcomes Bometrcs, 121-13 SAS and SAS/STAT are regstered trademarks of SAS Insttute Inc n the USA and n other countres ndcates USA regstraton Workshop Outlne Correlated response settngs for GEE applcaton repeated measurements clustered data Overvew of Generalzed Lnear Models revew of methodology basc examples Extendng the GLM to Generalzed Estmatng Equatons Methodology REPEATED Statement n PROC GENMOD GEE Analyses Analyss objectves PROC GENMOD set-up Results and nterpretaton