Applications of GEE Methodology Using the SAS System

Similar documents
Negative Binomial Regression

Computation of Higher Order Moments from Two Multinomial Overdispersion Likelihood Models

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Chapter 11: Simple Linear Regression and Correlation

QUASI-LIKELIHOOD APPROACH TO RATER AGREEMENT PLUS LINEAR BY LINEAR ASSOCIATION MODEL FOR ORDINAL CONTINGENCY TABLES

Comparison of Regression Lines

Chapter 13: Multiple Regression

BIO Lab 2: TWO-LEVEL NORMAL MODELS with school children popularity data

Introduction to Regression

Introduction to Generalized Linear Models

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

Lecture 6: Introduction to Linear Regression

Analyzing Longitudinal Data Using Gee-Smoothing Spline

4.3 Poisson Regression

x i1 =1 for all i (the constant ).

Scientific Question Determine whether the breastfeeding of Nepalese children varies with child age and/or sex of child.

since [1-( 0+ 1x1i+ 2x2 i)] [ 0+ 1x1i+ assumed to be a reasonable approximation

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Semiparametric geographically weighted generalised linear modelling in GWR 4.0

9. Binary Dependent Variables

STAT 405 BIOSTATISTICS (Fall 2016) Handout 15 Introduction to Logistic Regression

See Book Chapter 11 2 nd Edition (Chapter 10 1 st Edition)

MAXIMUM LIKELIHOOD FOR GENERALIZED LINEAR MODEL AND GENERALIZED ESTIMATING EQUATIONS

Homework Assignment 3 Due in class, Thursday October 15

Lab 4: Two-level Random Intercept Model

Lecture 9: Linear regression: centering, hypothesis testing, multiple covariates, and confounding

Chapter 2 - The Simple Linear Regression Model S =0. e i is a random error. S β2 β. This is a minimization problem. Solution is a calculus exercise.

Lecture 9: Linear regression: centering, hypothesis testing, multiple covariates, and confounding

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Statistics for Economics & Business

Chapter 14: Logit and Probit Models for Categorical Response Variables

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Graph the R Matrix in Linear Mixed Model

Here is the rationale: If X and y have a strong positive relationship to one another, then ( x x) will tend to be positive when ( y y)

UNIVERSITY OF TORONTO Faculty of Arts and Science. December 2005 Examinations STA437H1F/STA1005HF. Duration - 3 hours

The Geometry of Logit and Probit

Durban Watson for Testing the Lack-of-Fit of Polynomial Regression Models without Replications

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

An R implementation of bootstrap procedures for mixed models

Chapter 15 - Multiple Regression

Multinomial logit regression

PASS Sample Size Software

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

is the calculated value of the dependent variable at point i. The best parameters have values that minimize the squares of the errors

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Linear Regression Analysis: Terminology and Notation

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

Lecture Notes on Linear Regression

The Power of Proc Nlmixed

Generalized Linear Methods

The Ordinary Least Squares (OLS) Estimator

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Department of Quantitative Methods & Information Systems. Time Series and Their Components QMIS 320. Chapter 6

Number of cases Number of factors Number of covariates Number of levels of factor i. Value of the dependent variable for case k

Economics 130. Lecture 4 Simple Linear Regression Continued

Interval Estimation in the Classical Normal Linear Regression Model. 1. Introduction

x = , so that calculated

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

e i is a random error

Copyright 2017 by Taylor Enterprises, Inc., All Rights Reserved. Adjusted Control Limits for P Charts. Dr. Wayne A. Taylor

Properties of Least Squares

Andreas C. Drichoutis Agriculural University of Athens. Abstract

Marginal Effects in Probit Models: Interpretation and Testing. 1. Interpreting Probit Coefficients

Non-Mixture Cure Model for Interval Censored Data: Simulation Study ABSTRACT

2016 Wiley. Study Session 2: Ethical and Professional Standards Application

Introduction to Dummy Variable Regressors. 1. An Example of Dummy Variable Regressors

ANSWERS CHAPTER 9. TIO 9.2: If the values are the same, the difference is 0, therefore the null hypothesis cannot be rejected.

Polynomial Regression Models

Chapter 9: Statistical Inference and the Relationship between Two Variables

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Which estimator of the dispersion parameter for the Gamma family generalized linear models is to be chosen?

Systems of Equations (SUR, GMM, and 3SLS)

Lecture 3 Stat102, Spring 2007

Laboratory 3: Method of Least Squares

Basic Business Statistics, 10/e

Factor models with many assets: strong factors, weak factors, and the two-pass procedure

LECTURE 9 CANONICAL CORRELATION ANALYSIS

Laboratory 1c: Method of Least Squares

Resource Allocation and Decision Analysis (ECON 8010) Spring 2014 Foundations of Regression Analysis

STK4080/9080 Survival and event history analysis

LOW BIAS INTEGRATED PATH ESTIMATORS. James M. Calvin

DO NOT OPEN THE QUESTION PAPER UNTIL INSTRUCTED TO DO SO BY THE CHIEF INVIGILATOR. Introductory Econometrics 1 hour 30 minutes

Kernel Methods and SVMs Extension

The SAS program I used to obtain the analyses for my answers is given below.

Lecture 16 Statistical Analysis in Biomaterials Research (Part II)

Limited Dependent Variables

Basic R Programming: Exercises

STAT 511 FINAL EXAM NAME Spring 2001

Statistics for Business and Economics

ECONOMICS 351*-A Mid-Term Exam -- Fall Term 2000 Page 1 of 13 pages. QUEEN'S UNIVERSITY AT KINGSTON Department of Economics

STAT 3008 Applied Regression Analysis

PHYS 450 Spring semester Lecture 02: Dealing with Experimental Uncertainties. Ron Reifenberger Birck Nanotechnology Center Purdue University

T E C O L O T E R E S E A R C H, I N C.

NANYANG TECHNOLOGICAL UNIVERSITY SEMESTER I EXAMINATION MTH352/MH3510 Regression Analysis

Hydrological statistics. Hydrological statistics and extremes

THE ROYAL STATISTICAL SOCIETY 2006 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE

Statistics II Final Exam 26/6/18

Structure and Drive Paul A. Jensen Copyright July 20, 2003

Diagnostics in Poisson Regression. Models - Residual Analysis

Transcription:

Applcatons of GEE Methodology Usng the SAS System Gordon Johnston Maura Stokes SAS Insttute Inc, Cary, NC Abstract The analyss of correlated data arsng from repeated measurements when the measurements are assumed to be multvarate normal has been studed extensvely In many practcal problems, however, the normalty assumpton s not reasonable When the responses are dscrete and correlated, for example, dfferent methodology must be used n the analyss of the data Generalzed Estmatng Equatons (GEEs) provde a practcal method wth reasonable statstcal effcency to analyze such data Ths paper provdes an overvew of the use of GEEs n the analyss of correlated data usng the SAS System Emphass s placed on dscrete correlated data, snce ths s an area of great practcal nterest Introducton GEEs were ntroduced by Lang and Zeger (1986) as a method of dealng wth correlated data when, except for the correlaton among responses, the data can be modeled as a generalzed lnear model For example, correlated bnary and count data often can be modeled n ths way You can solve GEEs wth the GENMOD procedure n SAS/STAT software, begnnng wth Release 612 of the SAS System In addton, the Alternatng Logstc Regresson algorthm for fttng log odds ratos wth bnary data wll be mplemented n a future release Ths paper provdes an overvew of the GEE methodology that s mplemented n the GENMOD procedure Refer to Dggle, Lang, and Zeger (1994) and the other references at the end of ths paper for more detals on ths method Correlated data can arse from stuatons such as longtudnal studes, n whch multple measurements are taken on the same subject at dfferent ponts n tme clusterng, where measurements are taken on subjects that share a common category or characterstc that leads to correlaton For example, ncdence of pulmonary dsease among famly members may be correlated because of heredtary factors The correlaton must be accounted for by analyss methods approprate to the data Possble consequences of analyzng correlated data as f t were ndependent are ncorrect nferences concernng regresson parameters due to underestmated standard errors neffcent estmators, that s, more mean square error n regresson parameter estmators than necessary Example of Longtudnal Data The followng data, from Thall and Val (199), are concerned wth the treatment of epleptc sezure epsodes These data were also analyzed n Dggle, Lang, and Zeger (1994) The data conssts of the number of epleptc sezures n an eght-week baselne perod, before any treatment, and n each of four twoweek treatment perods, n whch patents receved ether a placebo or the drug Progabde n addton to other therapy A porton of the data s shown n Table 1 Table 1 Epleptc Sezure Data Patent ID Treatment Baselne Vst1 Vst2 Vst3 Vst4 14 Placebo 11 5 3 3 3 16 Placebo 11 3 5 3 3 17 Placebo 6 2 4 5 11 Progabde 76 11 14 9 8 12 Progabde 38 8 7 9 4 13 Progabde 19 4 3 Wthn-subject measurements are lkely to be correlated, whereas between-subject measurements are lkely to be ndependent The raw correlatons among the counts between vsts are shown n Fgure 1 They ndcate strong correlaton n the number of sezures between the vsts Accountng for ths correlaton s an mportant aspect of the analyss strategy The

sezures data wll be analyzed n later sectons as count data wth a specfed correlaton structure Fgure 1 Raw Correlatons Vst 1 Vst 2 Vst 3 Vst 4 Vst 1 1 69 54 72 Vst 2 1 67 76 Vst 3 1 71 Vst 4 1 Generalzed Lnear Models for Independent Data Let Y ; = 1;:::;n be ndependent measurements Generalzed lnear models for ndependent data are characterzed by a systematc component g(e(y )) = g( )=x where = E(Y ), g s a lnk functon that relates the means of the responses to the lnear predctor x, x s a vector of ndependent varables for the th observaton, and s a vector of regresson parameters to be estmated a random component: Y ; = 1;:::;n are ndependent and have a probablty dstrbuton from an exponental famly: Y exponental famly: bnomal, Posson, normal, gamma, nverse gaussan The exponental famly assumpton mples that the varance of Y s gven by V = v( ), where v s a varance functon that s determned by the specfc probablty dstrbuton and s a dsperson parameter that may be known or may be estmated from the data, dependng on the specfc model The varance functon for the bnomal and Posson dstrbutons are gven by bnomal: v() =(1,) Posson: v() = The maxmum lkelhood estmator of the p 1 parameter vector s obtaned by solvng the estmatng equatons mx v,1 (y, ()) = for Ths s a nonlnear system of equatons for and t can be solved teratvely by the Fsher scorng or Newton-Raphson algorthm Modelng Correlaton Generalzed Estmatng Equatons Let Y j ; j = 1;:::;n ; = 1;:::;K represent the jth measurement on the th subject There are n measurments on subject and P K n total measurements Correlated data are modeled usng the same lnk functon and lnear predctor setup (systematc component) as the ndependence case The random component s descrbed by the same varance functons as n the ndependence case, but the covarance structure of the correlated measurements must also be modeled Let the vector of measurements on the th subject be Y =[Y 1 ;:::;Y n ] wth correspondng vector of means =[ 1 ;:::; n ] and let V be an estmate of the covarance matrx of Y The Generalzed Estmatng Equaton for estmatng s an extenson of the ndependence estmatng equaton to correlated data and s gven by V,1 (Y, ()) = Workng Correlatons Let R () be an n n "workng" correlaton matrx that s fully specfed by the vector of parameters The covarance matrx of Y s modeled as V = A 1 2 R()A 1 2 where A s an n n dagonal matrx wth v( j ) as the jth dagonal element If R () s the true correlaton matrx of Y, then V s the true covarance matrx of Y The workng correlaton matrx s not usually known and must be estmated It s estmated n the teratve fttng process usng the current value of the parameter vector to compute approprate functons of the Pearson resdual r j = y j, p j v(j ) There are several specfc choces of the form of workng correlaton matrx R () commonly used to model the correlaton matrx of Y A few of the choces are shown below Refer to Lang and Zeger (1986) for addtonal choces The dmenson of the

vector, whch s treated as a nusance parameter, and the form of the estmator of are dfferent for each choce Some typcal choces are: R () =R, a fxed correlaton matrx For R = I, the dentty matrx, the GEE reduces to the ndependence estmatng equaton m-dependent: t t=1;2;:::;m Corr(Y j ;Y ;j+t )= t>m exchangeable: Corr(Y j ;Y k )=; j 6= k unstructured: Corr(Y j ;Y k )= jk Fttng Algorthm The followng s an algorthm for fttng the specfed model usng GEEs Compute an ntal estmate of, for example wth an ordnary generalzed lnear model assumng ndependence Compute the workng correlatons R () Compute an estmate of the covarance: Update : V = A 1 2 ^R()A 1 2 r+1 = r, [ V,1 ],1 [ Iterate untl convergence Propertes of GEEs V,1 (Y, )] The GEE method has some desrable statstcal propertes that make t an attractve method for dealng wth correlated data GEEs reduce to ndependence estmatng equatons for n = 1 GEEs are the maxmum lkelhood score equaton for multvarate Gaussan data p K( ^, )! N (; M()) f the mean model s correct even f V s ncorrectly specfed, where -- M() =I,1I 1I,1 -- I = V,1 -- I 1 = V,1 Cov(Y )V,1 The thrd property lsted above means that you don t have to specfy the workng correlaton matrx correctly n order to have a consstent estmator of the regresson parameters Choosng the workng correlaton closer to the true correlaton ncreases the statstcal effcency of the regresson parameter estmator, so you should specfy the workng correlaton as accurately as possble based on knowledge of the measurement process Estmatng the Covarance of ^ The model-based estmator of Cov(^) s gven by Cov M (^)=I,1 Ths s the GEE equvalent of the nverse of the Fsher nformaton matrx that s often used n generalzed lnear models as an estmator of the covarance estmate of the maxmum lkelhood estmator of It s a consstent estmator of the covarance matrx of ^ f the mean model and the workng correlaton matrx are correctly specfed The estmator M = I,1I 1I,1 s called the emprcal, or robust, estmator of the covarance matrx of ^ It has the property of beng a consstent estmator of the covarance matrx of ^, even f the workng correlaton matrx s msspecfed, that s, f Cov(Y ) 6= V In computng M, and are replaced by estmates, and Cov(Y ) s replaced by an estmate, such as (Y, ( ^))(Y, ( ^)) Progabde Example GEE s an approprate strategy strategy for analyzng the epleptc sezure data You can employ a log-lnear model wth v() =(the Posson varance functon) and where log(e(y j )) = + x 1 1 + x 2 2 + x 1 x 2 3 + log(t j ) Y j : number of eplectc sezures n nterval j

t j : length of nterval j 1 : weeks 8-16 x 1 = : weeks -8 1 : progabde group x 2 = : placebo group The correlatons between the counts are modeled as r j = ; 6= j (exchangeable correlatons) For comparson, the correlatons are also modeled as ndependent (dentty correlaton matrx) In ths model, the regresson parameters have the nterpretaton n terms of the log sezure rate shown n Fgure 2 Fgure 2 Interpretaton of Regresson Parameters Treatment Vst log(e(y j )=t j ) Placebo Baselne 1-4 + 1 Progabde Baselne + 2 1-4 + 1 + 2 + 3 As ndcated schematcally n Fgure 3, the dfference between the log sezure rates n the pretreatment (baselne) perod and the treatment perods s 1 for the placebo group and 1 + 3 for the Progabde group A value of 3 < would ndcate an effectve reducton n the sezure rate Fgure 3 Interpretaton of Model log(e(y j )=t j ) Baselne * * 1 Vsts 1-4 * 1 + 3 Placebo * Treatment You can now ft ths model n the SAS System by usng the GENMOD procedure, whch has been enhanced to provde Generalzed Estmatng Equatons methodology The followng statements nput the data, whch are arranged as one vst per observaton: data thall; nput d y vst trt blne age; ntercpt=1; cards; 14 5 1 11 31 14 3 2 11 31 14 3 3 11 31 14 3 4 11 31 16 3 1 11 3 16 5 2 11 3 16 3 3 11 3 16 3 4 11 3 17 2 1 6 25 17 4 2 6 25 17 3 6 25 17 5 4 6 25 114 4 1 8 36 114 4 2 8 36 run; Some further data manpulatons create an observaton for the baselne measures, create an nterval varable, and create an ndcator varable for whether the observaton s for a baselne measurement or a vst measurement data new; set thall; output; f vst=1 then do; y=blne; vst=; output; end; run; data new2; set new; f d ne 27; f vst= then do; x1=; ltme=log(8); end; else do; x1=1; ltme=log(2); end; x1trt=x1*trt; run; The GEE soluton s requested by usng the RE- PEATED statement n the GENMOD procedure The opton SUBJECT=ID specfes that the ID varable descrbes the observatons for a sngle cluster and the CORRW opton prnts the workng correlaton matrx The TYPE=opton specfes the correlaton structure; the value EXCH ndcates the exchangeable structure Other structures now supported nclude the unstructured, AR(1), ndependent, and user-specfed proc genmod data=new2; model y=x1 trt / d=posson offset=ltme tprnt; class d; repeated subject=d / corrw type=exch; These statements produce the usual output for fttng a generalzed lnear model to these data; the estmates are used as ntal values for the GEE soluton Frst, the usual results for fttng a GLM soluton are produced; the GLM parameter estmates are used as the ntal parameter estmates for the GEE soluton Informaton about the GEE Model s dsplayed n Fgure 4 The result of fttng the model are shown n 5 Compare these wth the model of ndependence dsplayed n Fgure 6 The parameter estmates are nearly dentcal, but the standard errors for the ndependence case are underestmated The coeffcent of the nteracton term, 3, s hghly sgnfcant under the ndependence model and margnally sgnfcant wth the exchangeable correlatons model

Fgure 4 GEE Model Informaton Descrpton Value Correlaton Structure Exchangeable Subject Effect ID Number of Clusters 58 Maxmum Cluster Sze 5 Mnmum Cluster Sze 5 GEE Model Informaton Covarance Matrx (Model-Based) Covarances are Above the Dagonal and Correlatons are Below Parameter Number PRM1 PRM2 PRM3 PRM4 PRM1 126 1594-126 -1594 PRM2 11876 1493-1594 -1493 PRM3-717 -8316 246 5562 PRM4-7557 -63627 18466 3687 Covarance Matrx (Emprcal) Covarances are Above the Dagonal and Correlatons are Below Parameter Number PRM1 PRM2 PRM3 PRM4 Emprcal 95% Confdence Lmts Parameter Estmate Std Err Lower Upper Z Pr> Z INTERCEPT 13476 1574 1392 1656 8564 X1 118 1161-1168 3383 9543 3399 TRT -18 1937-4876 2716-5578 577 X1*TRT -316 1712-6371 339-1762 781 Scale 32245 PRM1 2476-1152 -2476 1152 PRM2-635 1348 1152-1348 PRM3-81249 5122 3751-2999 PRM4 4276-67815 -945 2931 Fgure 8 Covarance Matrces Fgure 5 GEE Parameter Estmates The two covarance estmates are smlar, ndcatng an adequate correlaton model Analyss Of Parameter Estmates Parameter DF Estmate Std Err ChSquare Pr>Ch INTERCEPT 1 13476 341 15654356 1 X1 1 118 469 55839 181 TRT 1-18 486 49316 264 X1*TRT 1-316 697 186987 1 SCALE 1 Modelng Odds Ratos for Bnary Data Dggle, Lang, and Zeger (1994) pont out that modelng assocaton among bnary responses wth correlaton has a dsadvantage, and they propose usng the odds rato nstead For bnary data, the correlaton between the jth and kth response s, by defnton, Fgure 6 Independence Model Corr(Y j ;Y k )= Pr(Y j = 1;Y k = 1), j p k j (1, j ) k (1, k ) Table 2 Results of Model Fttng Varable Correlaton Coef Std Error Coef/SE Structure Intercept Exchangeable 135 16 856 Independent 135 3 3952 Vst (x 1 ) Exchangeable 11 12 95 Independent 11 5 236 Treat (x 2 ) Exchangeable -11 19-56 Independent -11 5-222 x 1 x 2 Exchangeable -3 17-176 Independent -3 7-432 The workng correlaton s prnted out wth the CORRW opton The ftted exchangeable correlaton matrx s shown n Fgure 7 Workng Correlaton Matrx COL1 COL2 COL3 COL4 COL5 ROW1 1 5983 5983 5983 5983 ROW2 5983 1 5983 5983 5983 ROW3 5983 5983 1 5983 5983 ROW4 5983 5983 5983 1 5983 ROW5 5983 5983 5983 5983 1 The jont probablty n the numerator satsfes the followng bounds, by elementary propertes of probablty, snce j = Pr(Y j = 1): max(; j + k, 1) Pr(Y j = 1;Y k = 1) mn( j ; k ) The correlaton, therefore, s constraned to be wthn lmts that depend n a complcated way on the means of the data The odds rato, defned as OR(Y j ;Y k )= Fgure 7 Workng Correlaton Matrx Pr(Y j = 1;Y k = 1)Pr(Y j = ;Y k = ) Pr(Y j = 1;Y k = )Pr(Y j = ;Y k = 1) If you specfy the COVB opton, you produce both the model-based (nave) and the emprcal (robust) covarance matrces Fgure 8 contans these estmates s not constraned by the means and s preferred by many workers to correlatons for bnary data

Carey, Zeger, and Dggle (1993) propose an algorthm for fttng the log odds rato as log(or(y j ;Y k )) = z jk where z jk s a vector of covarates and s a vector of assocaton parameters to be estmated The mean s modeled wth a regresson model just as t s when you use correlatons to model assocaton Ths mplementaton of GEE s called alternatng logstc regresson (ALR) It uses a GEE smlar to the one used to model correlatons to estmate the mean regresson parameters alternatng wth a logstc regresson to estmate the assocaton parameters The prevous method treated correlaton as a nusance parameter, whch must be taken nto account but s not of scentfc nterest The ALR method s useful f the assocaton s a scentfc focus of the analyss, snce a detaled model for the assocaton s ftted Concluson Generalzed Estmatng Equatons provde a practcal method wth good statstcal propertes to model data that exhbt assocaton but cannot be modeled as multvarate normal References Carey, V, Zeger, SL, and Dggle, P (1993), Modellng Multvarate Bnary Data wth Alternatng Logstc Regressons Bometrka, 517-526 Dggle, PJ, Lang, K-Y and Zeger, SL (1994), Analyss of Longtudnal Data, Oxford: Oxford Scence Lang, K-Y and Zeger, SL (1986), Longtudnal Data Analyss Usng Generalzed Lnear Models Bometrka, 13-22 Thall, PF and Val, SC (199), Some Covarance Models for Longtudnal Count Data wth Overdsperson Bometrcs, 657-671 Zeger, SL and Lang, K-Y (1986), Longtudnal Data Analyss for Dscrete and Contnuous Outcomes Bometrcs, 121-13 SAS and SAS/STAT are regstered trademarks of SAS Insttute Inc n the USA and n other countres ndcates USA regstraton Workshop Outlne Correlated response settngs for GEE applcaton repeated measurements clustered data Overvew of Generalzed Lnear Models revew of methodology basc examples Extendng the GLM to Generalzed Estmatng Equatons Methodology REPEATED Statement n PROC GENMOD GEE Analyses Analyss objectves PROC GENMOD set-up Results and nterpretaton