Subgroup analysis using regression modeling multiple regression Aeilko H Zwinderman
who has unusual large response? Is such occurrence associated with subgroups of patients? such question is hypothesis-generating: to refine patient- or dose-selection subgroup-analyses are -by nature- almost surely underpowered: => regression model regression modeling may increase efficiency correct for confounding investigfate interaction / synergism be used for prediction
regression models: many possibilities quantitative data: linear/nonlinear regression models discrete data: (probit) logistic regression censored data: Cox regression
general form: E[Y i X i ] = g -1 ( 0 + 1 X 1i + 2 X 2i +... + k X ki ) Var[Y i X i ] = e 2 Y is the dependent variable (primary efficacy variable) X is a covariate, predictor or independent variable g is the link-function is a regression parameter (which must be estimated from the data)
linear model: Y = quantitative variable X = quantitative or discrete variable Y i X X... 0 1 1i 2 2i k X ki e i is a direct effect: difference in mean of Y if X changes 1 unit assumptions: a. linarity of the relation between Y and X b. normality: Y is normally distributed for any given value of X c. homogeneity: Y has the same variance for any given value of X
logistic model: Y = binary variable (i.e. 1 or 0) X = quantitative or discrete variable P( Y i 1 X i ) exp( 1 exp( 0 0 1 X1 i... X... 1 1i k X ki ) X k ki ) is a log odds-ratio: change in the log(odds) that Y=1 if X changes 1 unit assumptions: a. linarity of the relation between log odds(of Y=1) and X b. the link-function g -1 has the logistic form
Cox proportional hazards regression model: Y(t) = binary status variable (i.e. 1 or 0) occurring at time t X = quantitative or discrete variable S exp( 1X1i... k Xki ) ( ti Xi) S0( t) is a log-relative risk: change in the log(hazard(t)) if X changes 1 unit assumptions: a. linarity of the relation between the log hazard(t) and X b. the relative risk is constant with time
Cum hazard Survival log(hazard) hazard X=0 X=20 X=80 X=0 X=20 X=80 0 6 12 18 24 time 0 6 12 18 24 time X=0 X=20 X=80 X=0 X=20 X=80 0 6 12 18 24 time 0 6 12 18 24 time
survival hazard X=0 X=80 X=40 0 6 12 18 24 X=40 X=0 0 10 20 30 X=80 time time 0 6 12 18 24 0 6 12 18 24 time time
regression modeling to increase precision * placebo (n=434) or pravastatin (n=438) * two years treatment * average LDL-decrease: # pravastatin: 1.23 (SD 0.68, se = 0.68/438) # placebo: -0.04 (SD 0.59, se = 0.59/434) * efficacy: 1.23 - -0.04 = 1.270 standard error = 0.043
LDL-reduction: Y i = 0 + 1 X 1i + e i X 1 = 1 if a patient receives pravastatin and zero if he/she placebo => 1 is efficacy: 1.27 (SE = 0.043 is a function of e 2 ) Suppose there is a covariate X 2 which is related to Y, but not to X 1: Y i = 0 + 1 X 1i + 2 X 2i + e i 1 remains the same but e 2 will be (much) smaller => SE( 1 ) will be smaller => increased precision
An example of a variable that might be related to Y but not to treatment is baseline LDL * is not related to treatment (randomized trial) placebo: 4.32 (SD 0.78) pravast: 4.29 (SD 0.78) p=0.60 * is (almost surely) related to LDL-decrease 2 = 0.41 (SE 0.024, p<0.0001) => efficacy: 1 = 1.27 (SE 0.037, was 0.043: 15% gain in efficiency)
LDL decrease 4 3 2 1 0-1 -2-3 2 3 4 5 6 7 baseline LDL
usually there are many many many candidates to consider: specify which ones will be used in the protocol in non-linear regression models 1 always changes by including covariates, thus its interpretation changes (often not much, but it can be greatly inflated)
regression modeling to correct for confounding a confounder is a covariate Z that is associated with both Y and X 1 distorts the interpretation of the efficacy estimate 1 what is thought to be efficacy may just reflect the unbalance of Z between treatment groups
* will not happen often in randomized trials * will happen almost always in non-randomized research * when it happens, adjustment of 1 is required Y i = 0 + * 1 X 1i + 2 Z i + e i if r xz >0 and r yz >0 then * 1 < 1 if r xz >0 and r yz <0 then * 1 > 1 if r xz <0 and r yz >0 then * 1 > 1 if r xz <0 and r yz <0 then * 1 < 1
X Y X Y direct effect of treatment X on outcome Y: no need for regression modeling Z effect of treatment X on outcome Y is confounded bij Z: regression model may correct for this X Z Y effect of treatment X on outcome Y is partly through Z: Z is an intermediate not a confounder. Do not use regression modeling: in the regression model the effect of X is split between a direct and an indirect effect.
check only the necessary (known) confounders beware of multiple testing
interaction/synergism looking for subgroups with different efficacy Y i = 0 + 1 X 1i + 2 X 2i + 3 X 1i.X 2i + e i Suppose X 2 =0 or 1: X 2 =1: Y i = ( 0 + 2 )+ ( 1 + 3 ) X 1i + e i X2=0: Y i = 0 + 1 X 1i + e i
Primary question: H 0 : 3 = 0 Example: is there interaction between statins and CCBs? Y = change of diameter of coronary vessels during statin/placebo treatment placebo no CCB 0.097 (0.20) CCB 0.130 (0.22) statin no CCB 0.088 (0.19) CCB 0.035 (0.19)
Diameter decrease 0.2 0.15 0.1 0.05 placebo statin 0 no CCB CCB
Efficacy: no CCB: 1 = 0.097-0.088 = 0.011 CCB: 1 + 3 = 0.130-0.035 = 0.095 3 = 0.095-0.011 = 0.084, p=0.011 thus, statins are significantly more effective in patients who also were prescribed CCBs.
Fellstrom et al. Rosuvastatin and cardiovascular events in patients undergoing hemodialysis. NEJM, 2009.
be careful investigating interactions: multiple testing problem do not enter too many covariates in a regression model: (k<n/10)
good models check assumptions use selection algorithms sparsely. (Instead use penalized methods, shrink regression weights) caution against optimistic results: (cross-) validation