Longitudinal Modeling with Logistic Regression

Similar documents
Simple logistic regression

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Trends in Human Development Index of European Union

Three-Way Contingency Tables

GEE for Longitudinal Data - Chapter 8

Ronald Heck Week 14 1 EDEP 768E: Seminar in Categorical Data Modeling (F2012) Nov. 17, 2012

Multinomial Logistic Regression Models

Class Notes. Examining Repeated Measures Data on Individuals

SAS Analysis Examples Replication C8. * SAS Analysis Examples Replication for ASDA 2nd Edition * Berglund April 2017 * Chapter 8 ;

Binary Logistic Regression

Contrasting Marginal and Mixed Effects Models Recall: two approaches to handling dependence in Generalized Linear Models:

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Statistical Distribution Assumptions of General Linear Models

WU Weiterbildung. Linear Mixed Models

More Accurately Analyze Complex Relationships

Advanced Quantitative Data Analysis

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

Testing Independence

Stat 642, Lecture notes for 04/12/05 96

Cohen s s Kappa and Log-linear Models

Generalized Linear Models for Non-Normal Data

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

Introducing Generalized Linear Models: Logistic Regression

You can specify the response in the form of a single variable or in the form of a ratio of two variables denoted events/trials.

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

Model Estimation Example

Turning a research question into a statistical question.

Survival Regression Models

Repeated ordinal measurements: a generalised estimating equation approach

multilevel modeling: concepts, applications and interpretations

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

Utilizing Hierarchical Linear Modeling in Evaluation: Concepts and Applications

Lecture 25: Models for Matched Pairs

7. Assumes that there is little or no multicollinearity (however, SPSS will not assess this in the [binary] Logistic Regression procedure).

COMPLEMENTARY LOG-LOG MODEL

Formula for the t-test

Designing Multilevel Models Using SPSS 11.5 Mixed Model. John Painter, Ph.D.

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 18.1 Logistic Regression (Dose - Response)

36-309/749 Experimental Design for Behavioral and Social Sciences. Dec 1, 2015 Lecture 11: Mixed Models (HLMs)

Semiparametric Generalized Linear Models

Correlation and regression

Lecture 3.1 Basic Logistic LDA

Section 9c. Propensity scores. Controlling for bias & confounding in observational studies

CHAPTER 9 EXAMPLES: MULTILEVEL MODELING WITH COMPLEX SURVEY DATA

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Review of the General Linear Model

The impact of covariance misspecification in multivariate Gaussian mixtures on estimation and inference

Hypothesis Testing for Var-Cov Components

Longitudinal Data Analysis Using SAS Paul D. Allison, Ph.D. Upcoming Seminar: October 13-14, 2017, Boston, Massachusetts

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Applied Statistics Friday, January 15, 2016

Review of CLDP 944: Multilevel Models for Longitudinal Data

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Overdispersion Workshop in generalized linear models Uppsala, June 11-12, Outline. Overdispersion

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

Lecture 12: Effect modification, and confounding in logistic regression

Lab 11. Multilevel Models. Description of Data

Mixed Models for Longitudinal Ordinal and Nominal Outcomes

Chapter 11: Analysis of matched pairs

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

STA6938-Logistic Regression Model

Random Coefficient Model (a.k.a. multilevel model) (Adapted from UCLA Statistical Computing Seminars)

Generalized Models: Part 1

Chapter 11: Models for Matched Pairs

Longitudinal Data Analysis. Michael L. Berbaum Institute for Health Research and Policy University of Illinois at Chicago

CHL 5225 H Crossover Trials. CHL 5225 H Crossover Trials

A Re-Introduction to General Linear Models (GLM)

Modeling the Mean: Response Profiles v. Parametric Curves

Count data page 1. Count data. 1. Estimating, testing proportions

MIXED MODELS FOR REPEATED (LONGITUDINAL) DATA PART 2 DAVID C. HOWELL 4/1/2010

An Introduction to Multilevel Models. PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 25: December 7, 2012

Multilevel Statistical Models: 3 rd edition, 2003 Contents

Multilevel Analysis of Grouped and Longitudinal Data

STAT 705 Generalized linear mixed models

LOGISTIC REGRESSION Joseph M. Hilbe

Hypothesis testing, part 2. With some material from Howard Seltman, Blase Ur, Bilge Mutlu, Vibha Sazawal

Introduction to Generalized Models

Computationally Efficient Estimation of Multilevel High-Dimensional Latent Variable Models

Ron Heck, Fall Week 3: Notes Building a Two-Level Model

Robust covariance estimator for small-sample adjustment in the generalized estimating equations: A simulation study

Reader s Guide ANALYSIS OF VARIANCE. Quantitative and Qualitative Research, Debate About Secondary Analysis of Qualitative Data BASIC STATISTICS

H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL

Models for binary data

Generalized linear models

Mixed Models for Longitudinal Binary Outcomes. Don Hedeker Department of Public Health Sciences University of Chicago.

Time-Invariant Predictors in Longitudinal Models

Model Assumptions; Predicting Heterogeneity of Variance

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

Calculating Effect-Sizes. David B. Wilson, PhD George Mason University

Q30b Moyale Observed counts. The FREQ Procedure. Table 1 of type by response. Controlling for site=moyale. Improved (1+2) Same (3) Group only

ADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES. Cox s regression analysis Time dependent explanatory variables

Simple, Marginal, and Interaction Effects in General Linear Models: Part 1

Research Design: Topic 18 Hierarchical Linear Modeling (Measures within Persons) 2010 R.C. Gardner, Ph.d.

Sample Size and Power Considerations for Longitudinal Studies

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

Transcription:

Newsom 1 Longitudinal Modeling with Logistic Regression Longitudinal designs involve repeated measurements of the same individuals over time There are two general classes of analyses that correspond to two definitions of change, the absolute level definition and the relative level definition Absolute Level Definition of Stability and Change One type of question involves whether the scores on a variable increase or decrease over time To answer this question, we can subtract one score from another, usually Y 2-1 = Y 2 Y 1, where the score at the first time point is subtracted from the score at the second time point Referred to as difference scores, gain scores, or sometimes changes scores The mean of the difference score indicates an average increase if it is positive and an average decrease if it is negative A variable is considered stable on average to the extent the average difference is 0, and the degree of change is measured by the extent to which the average difference is greater or less than 0 The average of the differences equals the differences of the averages, E( Y2 Y1) = Y2 Y1 This is true whether variables are continuous or binary, but with binary variables, the difference score has only three possible values, -1, 0, and +1 The downside to this definition of change is that regression toward the mean might be mistaken for changes over time and the change scores are sensitive to any random or temporary fluctuations in the values For a continuous variable, the statistical test for whether there is a significant difference on average, we would use a dependent-samples (paired) t-test or within-subjects ANOVA The within-subjects ANOVA extends the difference test to three or more time points For binary variables, where the mean is a proportion when using 0,1 coding the McNemar s (or Cochran s Q for more than two time points) test whether there is a significant difference between P(Y 2 = 1) P(Y 1 = 1) is greater than zero As we saw in earlier, this is the same as determining whether p 1+ = p +1 (and, equivalently that p 12 = p 21) 1 A loglinear analysis could also be used for this comparison For any analysis that is based on a log transformation, however, the concept of absolute change is further complicated When a log transformation is involved, the average difference between two binary variables and the difference in the averages (marginal proportions) are no longer equivalent Mathematically, this is because the difference between two logged averages does not equal the average of the log of the differences between two scores ( Y2 Y1) ( Y2) ( Y1) ln ln ln The difference on the left-hand side of the inequality, can be referred to as a conditional or subjectspecific value, whereas the difference on the right-hand side of the inequality above can be referred to as a marginal, unconditional, or population-averaged value 2 Relative Level Definition of Stability and Change An alternative way to think about stability and change is in terms of the correlation of a variable with itself over time Stability is defined in this way by the size of the correlation between Y 2 and Y 1, with r 12 = 10 indicating a perfectly stable variable There is some change if the correlation is less than perfect This definition is distinct from the absolute definition of change, because the correlation between the two variables can be perfect even if there is a large increase of decrease from Time 1 to Time 2 Adding 5, 10, or even 100 points to all of the Y 2 scores would not change the correlation This is because the correlation coefficient is more a function of the correspondence in the relative position of the scores in the sample than a metric of how well the scores match in absolute terms The focus under this definition 1 Refer to the Matched Pairs Analysis handout for more on this equivalence 2 2 Recall that from the quotient rule of logs, that ln ( Y2) ln ( Y1) ln Y =, instead Y1

Newsom 2 of stability and change has to do with the strength of the relationship over time, and it does not provide information about whether the scores increase or decrease over time Longitudinal analyses involve a correlation between Y 1 and Y 2 or autoregression with Y 1 predicting Y 2 The residual in an autoregression represents the degree to which the variable is not stable over time or changing in the relative level sense Modeling Absolute Level Change: Conditional Logistic Regression and GEE In addition to the McNemar test change in a binary variable over two or more time points can be assessed with a conditional logistic regression model The conditional logistic model (Cox, 1958) models examine change over time using time as a predictor of Y t, measured repeatedly at multiple time points ( ) logit PYit = 1 = αi + βxt This is a typical logistic regression, except for two things The dependent variable, Y it, has values for each person and each time point, indicated by the subscript ij, and the intercept α i has a different value for each person x t is some arbitrary time code, usually beginning 0,1, for however many time points are available The term conditional logistic is used because the regression is conditioned on subject To conduct the analysis, data must be transformed into a long (or person time) format If there are 100 cases measured at two time points, the data set will have 200 records i t Y ij 1 1 0 1 2 1 2 1 1 2 2 1 99 1 1 99 2 0 100 1 0 100 2 1 In general, the data set has n T records A standard logistic regression procedure can be used to estimate the conditional logistic model if the analysis can be stratified by subject (eg, Proc Logistic in SAS) or survival analysis procedures can be used The regression coefficient represents the change in the logit for each unit change in x t, with positive values indicating an increase over time and negative values indicating a decrease over time The odds ratio is the odds increase or decrease with each unit increase in x t For two time points, it s the odds that Y = 1 increase from the first time point to the second Or, β ln n n 12 21 = And notice that these are the same two cells, the discordant cells, that are compared in McNemar s test The distinction is that the conditional logistic represents a test of the subject-specific effect and the McNemar s chi-squared represents a test of the marginal or population-averaged effect The intercept α i gives the logit value of Y when X equals 0, or the first time point The logistic transformation can be used to obtain the estimate of the probability that Y = 1 at the first time point, where αi e = = 1 + e ( 1) PY it αi Generalized estimating equations (GEE; Liang & Zeger, 1986) approach to change over time works much the same way The GEE model accounts for correlated observations, and, for longitudinal data, it takes into account the dependence that occurs with multiple observations per person The analysis uses quasi-likelihood estimation rather than the usual Newton-Raphson maximum likelihood estimation of

Newsom 3 logistic regression And it couples the estimation with robust standard errors (Huber-White or Sandwich estimator; Huber, 1967; White, 1980) to account for the correlated residuals The GEE approach is a population-averaged model, however, where the model is not conditional on subject ( ) logit PYit = 1 = α + βxt The GEE modeling approach is very popular and performs well statistically with longitudinal or clustered observations and can be used with missing data The GEE model is not as flexible as a multilevel regression approach (eg, Raudenbush & Bryk, 2002), also known as hierarchical linear modeling, random coefficient models, or growth curve analysis when applied to longitudinal data A variant of growth curve models when estimated with structural equation modeling software are known as latent growth curve models Growth curve models have several advantages over GEE models, including estimation of intercept and slope random effects that provide information about individual variation in change over time For binary outcomes, multilevel models can provide population-averaged as well as subject-specific estimates Modeling Relative Level Change: The Lagged Regression Approach The methods discussed above conditional logistic, GEE, and multilevel models estimate level changes over time That is, they answer the questions about whether the observed variable Y it increases from one time-point to the next And when covariates are used to account for this change, the model describes who increase or decreases over time Such models do not address questions about whether X may be a cause of Y or Y a cause of X When experimental data are not available to investigate this question, longitudinal studies have advantages over cross-sectional studies for addressing this question The lagged regression model uses an independent variable measured at Time 1 to predict values at Time 2 controlling for the dependent variable measured at Time 1 Y 1 Y 2 X 1 The analysis involves the relative level definition of change, because the path from Y 1 to Y 2 represents stability in the correlational sense The remaining variance in Y 2 not accounted for by Y 1 is what changes (again in the relative correlation sense) The model can be interpreted in two ways, as prediction of Y 2 while controlling for any preexisting differences in the dependent variable at Time 1 or and a prediction of change in Y 2 A special case of the model in ordinary least squares regression in which X 1 is binary, is an analysis of covariance (ANCOVA), and so the lagged regression model is sometimes referred to as the ANCOVA approach This modeling approach has the advantage of accounting for regression toward the mean The results from the conditional logistic and the lagged regression model will not always lead to the same conclusion (known as Lord s paradox), but that is because they address different questions about change Software Examples Below I use SAS to illustrate several simple longitudinal models using the depression data from the LLSSE (a new data, dep2sav, set with Time 1 predictors was created) The outcome is the binary depression variable from the 9-item CES-D using the recommended clinical cutoff The lagged regression example is easily replicated in SPSS (logistic regression) or R (glm) Lagged logistic regression example with Time 1 negative social exchanges predicting depression depression at Time 2, controlling for Time 1 depression: proc logistic data=one ; model w2dep = w1dep w1neg;

Newsom 4 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 949888 2 <0001 Score 910140 2 <0001 Wald 814675 2 <0001 The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1-00665 01303 02607 06096 w1dep 1 16401 02005 668942 <0001 w1neg 1 04519 02014 50333 00249 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits w1dep 5156 3480 7638 w1neg 1571 1059 2332 Sample Write-Up Clinical depression measured at the six-month follow-up was regressed on negative social exchanges and clinical depression measured at baseline Both baseline predictors significantly predicted clinical depression at follow-up As expected, those who were depressed at baseline were more likely to be depressed at follow-up, b = 164, SE = 13, OR = 516, p < 001 Of greater interest was that those who reported more frequent negative social exchanges were more likely to be depressed at follow-up after controlling for initial depression, b = 45, SE = 20, OR = 157, p < 05, suggesting that negative social exchanges at baseline were predictive of an increase in clinical depression over the six-month interval Questions about absolute level change (difference of binary variable with no predictor) can be tested with the McNemar s test and conditional logistic Conditional logistic requires configuration of the data to a long (person period) data set In SAS, ID is used as a strata to allow the intercept to vary by case, α i A survival analysis approach is an alternative method for obtaining a conditional regression, which would be needed in SPSS (coxreg) and R (clogit in the survival package) I only illustrate the conditional logistic here for didactic reasons, so don t give code for either of these models I recommend growth curve models (or possibly GEE if there are only two time points) instead of conditional logistic for this general approach /* mcnemar's test */ proc freq; tables w1dep*w2dep /agree; w1dep(time 1 depression) w2dep(time 2 depression) Frequency Percent Row Pct Col Pct not depr depresse Total essed d not depressed 146 155 301 2243 2381 4624 4850 5150 7565 3384 depressed 47 303 350 722 4654 5376 1343 8657 2435 6616 Total 193 458 651 2965 7035 10000

Newsom 5 McNemar's Test ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Statistic (S) 577426 DF 1 Pr > S <0001 /*reconfigure data set to be long format */ data one; set one; id = _N_; DATA long ; SET one ; t = 0 ; dep = w1dep ; OUTPUT ; t = 1 ; dep = w2dep; OUTPUT ; keep id t dep ; RUN; /*conditional logistic regression */ proc logistic data=long order=data descending; strata id; MODEL dep=t ; The LOGISTIC Procedure Conditional Analysis Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 608670 1 <0001 Score 577426 1 <0001 Wald 513524 1 <0001 Analysis of Conditional Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq t 1-11933 01665 513524 <0001 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits t 0303 0219 0420 Notice that the score test from the conditional logistic matches the McNemar s test Both are a test of the population-averaged (marginal) difference The Wald test from the conditional logistic does not match either, because it is a test of the subject-specific (conditional) effect References and Further Readings Cox, D R (1958) Two further applications of a model for binary regression Biometrika, 45(3/4), 562-565Hu, F B, Goldberg, J, Hedeker, D, Flay, B R, & Pentz, M A (1998) Comparison of population-averaged and subject-specific approaches for analyzing repeated binary outcomes American Journal of Epidemiology, 147, 694-703 Hanley, J A, Negassa, A, & Forrester, J E (2003) Statistical analysis of correlated data using generalized estimating equations: an orientation American journal of epidemiology, 157(4), 364-375 Huber, Peter J (1967) The behavior of maximum likelihood estimates under nonstandard conditions Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 221 233 Liang, K Y, & Zeger, S L (1986) Longitudinal data analysis using generalized linear models Biometrika, 73(1), 13-22 Newsom (2012) Basic Longitudinal Analysis Approaches for Continuous and Categorical Variables In Newsom, JT, Jones, RN, & Hofer, SM (Eds) (2012) Longitudinal Data Analysis: A Practical Guide for Researchers in Aging, Health, and Social Science New York: Routledge (pp 168-170) Raudenbush, S W, & Bryk, A S (2002) Hierarchical linear models: Applications and data analysis methods (Vol 1) Thousand Oaks, CA: Sage White, H (1980) A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity Econometrica, 48, 817 838