STK4900/ Lecture 10. Program

Size: px

Start display at page:

Download "STK4900/ Lecture 10. Program"

Augustine Reginald Ball
5 years ago
Views:

1 STK4900/ Lecture 10 Program 1. Repeated measures and longitudinal data 2. Simple analysis approaches 3. Random effects models 4. Generalized estimating equations (GEE) 5. GEE for binary data (and GLMs) 6. Time series data Sections 7.1, 7.2, 7.3, 7.4 (except 7.4.5), 7.5 1

Example: Fecal fat Lack of digestive enzymes in the intestine can cause bowel absorption problems, which will be indicated by excess fat in the feces.

2 Example: Fecal fat Lack of digestive enzymes in the intestine can cause bowel absorption problems, which will be indicated by excess fat in the feces. Pancreatic enzyme supplements can reduce the problem. The data are from a study to determine if the form of the supplement makes a difference This is an example with repeated measurements (more than one observation per subject) 2

3 Fecal fat (g/day) none tablet capsule coated The plot shows that some patients tend to have high values for all pill types, while other patients tend to have low values The values for a patient are not independent, and this has to be taken into account when we analyze the data 3

4 Example: Birth weight and birth order We have recorded the weights of the babies of 200 mothers who all have five children. We are interested in studying the effect of birth order and the age of the mother on the birth weight Box plots of birth weights for all 200 mothers Birth weights for a sample of 30 mothers with fitted line (based on all) Birthweight (g) Birth Order This is an example of longitudinal data (repeated measures taken over time) The birth weights for a mother are not independent, and this has to be taken into account when we analyze the data 4

5 Simple approaches to analyzing repeated measures and longitudinal data 1. Standard analysis ignoring the dependence Fecal fat ex: One-way ANOVA on pill type Birth weight ex: Linear regression on birth order However: Ignoring dependence is WRONG! Not pursued further. 2. Looking at parts of the data for an individual to avoid the dependence problem Fecal fat ex: one option is to compare two pill types at a time using the paired t-test Birth weight example one option is to look at the difference in weight between the fifth and the first child. This is not wrong, but ignores part of the data (birth weight) or gives many comparisons (fecal fat) See slides

6 Approaches, contd. 3. Including dependence as fixed factor variable Fecal fat ex: Two-way ANOVA on pill type and individual Birth weight ex: Linear regression on birth order with mother as factor variable However: Generally not interested in factors individual/ mother. Also in birth weight ex many factor levels. 4. Including dependence as random factor: Random effects model Fecal fat ex: Two-way ANOVA on pill type and individual as random factor Birth weight ex: Linear regression on birth order with mother as random factor variable Pro: Individual/mother modeled as random variable. 6

7 Fecal fat example: R commands (comparing pill types 1(none) and 2(tablet)): fecfat=read.table(" header=t) x=fecfat$fecfat[fecfat$pilltype==1] y=fecfat$fecfat[fecfat$pilltype==2] t.test(x,y,paired=t) R output : Paired t-test t = 3.109, df = 5, p-value = percent confidence interval: mean of the differences: Estimated difference / P-value for the six comparisons of two pill types at a time (column minus row) None Tablet Capsule Tablet 21.6 / 2.7 % * * Capsule 20.7 / 3.7 % / 58.9 % * Coated 7.0 / 23.9 % / 7.8 % / 8.0 % 7

8 Birth weight example: R commands: babies=read.table(" header=t) first=babies$bweight[babies$birthord==1] fifth= babies$bweight[babies$birthord==5] diff=fifth-first t.test(diff) R output : One Sample t-test data: diff t = 4.211, df = 199, p-value = 3.849e percent confidence interval: mean of x On average the fifth child weights grams more than the first A 95 % confidence interval is from grams to grams If we divide by four we get the average increase per child 8

9 Approach 3: Individual as fixed factor Fecal fat example: Two way ANOVA R commands and output: > anova(lm(fecfat~factor(pilltype)+factor(subject),data=fecfat)) Response: fecfat Df Sum Sq Mean Sq F value Pr(>F) factor(pilltype) ** factor(subject) *** Residuals Birth weight example: Two way ANOVA R commands and output: > babiesanova=lm(bweight~birthord+initage+factor(momid),data=babies) > anova(babiesanova) Response: bweight Df Sum Sq Mean Sq F value Pr(>F) birthord e-06 *** initage e-09 *** factor(momid) < 2.2e-16 *** Residuals

10 Approach 4: Random effects model A drawback of Approach 3 is that an effect is estimated for every individual. The interest, however, lies in how such effects will vary over a population. A useful approach for analysing repeated measures is to consider a random effects model We will describe the random effects model using the fecal fat example Here we consider the model: ij j i ij where Y = µ + α + B + ε Y is the fecal fat for patient i when using pill type j ij α j is the effect of pill type j (relative to type 1) 2 the Bi are the effects of patients, assumed independent N(0, σ subj) 2 the εij are random errors, assumed independent N(0, σε ) To fit a random effects model, we use the "nlme" library 10

11 R commands: library(nlme) fit.fecfat=lme(fecfat~factor(pilltype), random=~1 subject, data=fecfat) summary(fit.fecfat) anova(fit.fecfat) R output (edited): Linear mixed-effects model fit by fit by REML Random effects: s subj Formula: ~1 subject (Intercept) Residual StdDev: Fixed effects: fecfat ~ factor(pilltype) Value Std.Error DF t-value p-value (Intercept) factor(pilltype) factor(pilltype) factor(pilltype) s ε numdf dendf F-value p-value (Intercept) factor(pilltype)

12 Correlation within subjects Covariance for two measurements from the same patient ( ): Cov( Y, Y ) ij ik = Cov( µ + α + B + ε, µ + α + B + ε ) j i ij k i ik = Cov( B + ε, B + ε ) = Cov( B, B) = Var( ) i ij i ik Variance for a measurement: Var( Yij ) = Var( µ + α j + Bi + εij ) Correlation for two measurements from the same patient: i i B i j k 2 = σ subj = Var( Bi + εij ) 2 2 =Var( B ) + Var( ε ) = σ + σ i ij subj ε corr( Y, Y ) ij ik = Cov( Y, Y ) ij ik Var( Y ) Var( Y ) ij ik = σ σ 2 subj 2 2 subj + σε Estimate of correlation: =

13 We will then analyze the birth weight example using a random effects model We here consider the model: where Y = µ + β x + β x + B + ε ij 1 1ij 2 2ij i ij Y is the birth weight for the j-th baby of the i-th mother ij x = j is the birth order (parity) of the j-th baby of the i-th mother 1ij β 1 is the effect of increasing the birth order by one x2ij is the age of the i-th mother when she had her first baby β 2 is the effect of one year's incerase in the age of the mother 2 Bi N σ subj the are the effects of mothers, assumed independent (0, ) 2 the εij are random errors, assumed independent N(0, σε ) 13

14 R commands: fit.babies=lme(bweight~birthord+initage, random=~1 momid, data=babies) summary(fit.babies) R output (edited): Linear mixed-effects model fit by fit by REML Random effects: Formula: ~1 momid (Intercept) Residual StdDev: Fixed effects: bweight ~ birthord + initage Value Std.Error* DF t-value* p-value (Intercept) birthord initage Estimate of correlation for two babies by the same mother: = 0.39 *) Standard errors and t-values differ slightly form those on page 283 in the text book, since we here use REML estimation. 14

15 Longitudinal data and correlation structures A random effects model for longitudinal data assumes that the correlation between any two observations for the same individual is the same E.g. for the birth weight example a random effects model assumes the same correlation between the birth weights of the first and second child as for the first and fifth child In general for longitudinal data, where observations are taken consecutively over time, it may be the case that observations that are close to each other in time are more correlated than those further apart In order to fit a model for longitudinal data, we need to take into account the type of correlation between the observations 15

16 Assume that the observations for the i-th subject are Y, Y,..., Y i1 i2 im Common assumptions on the correlation structure of the are ( ): Y ij j k Exchangeable: corr( Y, Y ) ij ik = ρ Autoregressive: corr( Y, Y ) = Unstructured: corr( Y, Y ) = ρ Independence: corr( Y, Y ) = 0 ij ij ik ik k j ρ ij ik jk Note that a random effects model implies an exchangeable correlation structure We may use the "gee" library to fit models with these correlation structures (using a method called generalized estimating equations) 16

17 R commands: library(gee) summary(gee(bweight~birthord+initage,id=momid,data=babies,corstr="exchangeable")) summary(gee(bweight~birthord+initage,id=momid,data=babies,corstr="ar-m")) summary(gee(bweight~birthord+initage,id=momid,data=babies,corstr= unstructured")) summary(gee(bweight~birthord+initage,id=momid,data=babies,corstr="independence")) R output (edited): Estimate Naive S.E. Naive z Robust S.E. Robust z birthord initage birthord initage birthord initage birthord initage The naive SE and z are valid if the assumed correlation structure is true Inference should be based on the robust SE and z, since these are valid also if the assumed "working correlation" does not hold 17

18 For the example with birth weight and birth order we have the following relation between the birth weights first second The weight of the first baby is less correlated with the others. Otherwise the weights have about the same correlation. An exchangeable correlation structure is a reasonable "working assumption" third fourth fifth Correlations: first second third fourth fiah first second third fourth fiah

19 In conclusion the birth weight data may be analyzed using generalized estimating equations with an exchangeable correlation structure (slide 14 ) or by using a random effects model (slide 11) The two models give comparable results in this example However, if we extend the generalized estimating (GEE) approach and the random effects model to generalized linear models (like logistic regression), the results need not longer agree. The GEE approach can be used for all glms (distributional families, link functions). We will only consider extension of logistic regression to dependent data. 19

20 GEE and binary data (logistic model) Example: Birth weight data, but with an indicator of low birth weight (lowbrth), i.e. < 3000g. geefit<- gee(lowbrth~birthord+initage,id=momid,family=binomial,data=babies, corstr="unstructured") R output (edited): > summary(geefit) Model: Link: Logit Variance to Mean Relation: Binomial Correlation Structure: Unstructured Coefficients: Estimate Naive S.E. Naive z Robust S.E. Robust z (Intercept) birthord initage Negative coefficients for birth order and initial age indicate that low birth weigh is less likely with increasing order and age. 20

21 GEE and Birth weight data, contd. Even the homemade expcoef function from Lecture 7 works on gee-objects More R output (edited): > expcoef(geefit) expcoef lower upper (Intercept) birthord initage

22 More on dependent data: Time series data (not in Vittinghoff et al.) In this lecture we have so far considered data with many small independent groups of subjects (measurements) dependence within each group Time series data is a different dependent data structure with Y, Y, Y,... long sequences of correlated data only one (or maybe a few) such sequences Examples Temperature on consecutive days (weeks) Stock prices on consecutive days (weeks) Sunspot activity years

23 Autocorrelation function (ACF) The correlation between observation at time t, time t-k,, is given as The Yt k ˆ( ρ k) = corr( Y, Y ) t t k ˆ( ρ k), k = 0,1,2,... Y t, and its lag at is referred to as the autocorrelation function. For the sunspot data we have ACF with high positive correlations at 11 year cycles 23

24 Uncertainty limits for the ACF The dashed horizontal lines in the ACF plots lie at values ±1.96 / n In particular for the sunspot data with n=289 this becomes ±0.118 These limits correspond to the test in Lecture 2 for using test statistic t = r n 2 1 r 2 H : ρ = 0 0 Then correlations within the uncertainty limits are not significantly different from zero. 24

25 Example: Luteinizing hormone in blood samples at 10 min. intervals from a human female, 48 samples. The sunspot example is a bit unusual in the 11-year correlation pattern. The luteinizing hormone example may be more typical of a time-series in the sense only the first (few) correlations are significantly different from 0. 25

26 Autoregressive models AR(p) Another approach for analyzing the dependence in a time series is through autoregressive models where the current value Y t is regressed on previous p values Yt 1, Yt 2, Yt 3,..., Yt p of the series, i.e. through a model Y = ay + a Y + a Y + ε t 1 t 1 2 t 2 p t p t where the ak are regression coefficients and the εt independent error terms. This is called an AR(p) model. For the special case AR(1) we have Y = ay + ε t 1 t 1 t and the current value only depends on the previous value. This is referred to as a Markov property. 26

27 Simulated Tme series from an AR(1) model Yt = ay 1 t 1+ εt with a 1 = 0.1, 0.5, 0.9 and y_t y_t time time y_t y_t time time 27

28 R commands: par(mfrow=c(2,2)) Tme =seq(1,100,1) a = c(0.1, 0.5, 0.9,- 0.9) for(i in 1:4) { y_t<- arima.sim(n,model=list(ar=a[i]),sd=1) plot(tme, y_t, type="l", main=a[i]) } AR(p) is a special case of the more general ARIMA(p,d,q) models for Tme series. AR- I- MA stands for AutoRegressive Integrated Moving Average, and p is the order of the autoregressive process, d is the order of a difference operator (helps against nonstatonarity) and q is the order of the moving average process (where the Tme series relates to the q past values of the noise). The parameters in such models can be estmated via maximum likelihood estmaton! R command: arima(y,order=c(p,d,q)) # in its most simple version 28

29 true a1 = 0.5 true a1 = 0.9 y1_t y2_t time > arima(y1_t, order=c(1,0,0)) Call: arima(x = y1_t, order = c(1, 0, 0)) Coefficients: ar1 intercept s.e sigma^2 estmated as 1.101: log likelihood = , aic = time > arima(y2_t, order=c(1,0,0)) Call: arima(x = y2_t, order = c(1, 0, 0)) Coefficients: ar1 intercept s.e sigma^2 estmated as 1.105: log likelihood = , aic =

30 STK Time series STK Survival and Event History Analysis STK StaTsTcal Learning: Advanced Regression and ClassificaTon STK Machine learning and statstcal methods for predicton and classificaton STK Applied Bayesian Analysis and Numerical Methods STK StaTsTcal Model SelecTon 30

THE END Mandatory number 2: To be handed in Thursday April 6 th 2017 before 14.

31 THE END Mandatory number 2: To be handed in Thursday April 6 th 2017 before Feedback Tuesday 18 th and when actual, a second try to be handed in by Tuesday April 25 th 2017 before Writen 4 hours exam Friday May 12 th

Workshop 9.3a: Randomized block designs

Workshop 9.3a: Randomized block designs -1- Workshop 93a: Randomized block designs Murray Logan November 23, 16 Table of contents 1 Randomized Block (RCB) designs 1 2 Worked Examples 12 1 Randomized Block (RCB) designs 11 RCB design Simple Randomized