Statistical Questions: Classification Modeling Complex Dose-Response

Size: px

Start display at page:

Download "Statistical Questions: Classification Modeling Complex Dose-Response"

Damian Carson
6 years ago
Views:

1 Biost 536 / Epi 536 ategorical ata Analysis in Epidemiology Lecture Outline Modeling nonlinear associations (complex dose response ) Flexible methods Scott S. Emerson, M.., Ph.. Professor of Biostatistics University of Washington Lecture 7: Modeling of Nonlinear Associations October 23, Statistical Questions: lassification Modeling omplex ose-response Statistical Questions 1. lustering of observations Perhaps into groups that might be different diseases 2. lustering of variables Perhaps into groups representing biochemical pathways 3. Quantification of distributions Perhaps reporting mean life expectancy after diagnosis 4. omparing distributions Perhaps investigating associations between variables 5. Prediction of individual observations Perhaps diagnosing disease or estimating kidney function 3 4 ategorical ata Analysis, AUT

2 4. Investigating Associations Transformation of Predictor of Interest Our scientific questions can be at many different levels of detail 1. Is there an association? 2. What is the general (first order) trend in Y with higher X? 3. Is their a nonlinear trend in the association? 4. Is the general trend a particular shape? Increasing exponentially? Increasing to a threshold? onstant then decreasing? U-shaped? S-shaped? 5. What is the association at particular levels of X? E.g., What is the difference in odds of mortality between subjects with LL of 160 and 161 mg/dl? Any questions can be about associations independent of other We choose the exact form of the modeled POI in order to answer our scientific question Accurately, and Precisely Important issues Ensuring that the question is appropriately reflected by some regression parameter(s) Striving to have greatest precision to make inference about those parameter(s) Trying to avoid overly influential observations Ensuring that we do not allow multiple comparisons to inflate the type 1 error mechanisms (i.e., adjusted for potential confounding) 5 6 Invalid Reasons for Transformations alid Reasons for Transformation 1 A commonly stated reason for transforming a predictor is: We transformed the predictor in order that it would appear more normally distributed. This reasoning is WRONG There is nothing in our statistical theory that ever demands that a predictor in a regression model be normally distributed With just a little thought, this should be obvious to all: The most straightforward of all statistical investigations of associations is the two-sample problem In such a problem, our POI is far from normal it is binary (discrete) 7 Modeling our question We ensure that our full model is flexible enough to reflect our alternative hypothesis We ensure that our null hypothesis can be represented by constraining one or more of our parameters to some set value Most often constrained to 0 (equivalent to removing the term ) Examples: We use an linear (untransformed) predictor to detect a first order trend We include both a linear and squared term to detect nonlinearity A linear relationship would not need the squared term We fit linear splines to detect U-shaped trends A U-shaped trend would require that slopes for lower values of X are of opposite sign of the slopes for higher values of X 8 ategorical ata Analysis, AUT

3 alid Reasons for Transformation 2 Loss of Precision from Lack of Fit Increasing precision The more accurately we can model the true data generating mechanism, the greater precision we will have This avoids mixing pure error (the random variation in the distribution) and systematic error (the difference between the fitted model and the true distribution) Examples: Anticipating that either too high or too low itamin A is harmful, we fit a model that can model a U-shaped trend However, this does make it difficult to test for direction of trend Anticipating that risk of death from prostate cancer is more closely related to multiplicative increases, we model log(psa) Equal increased risk with every doubling of serum PSA 9 10 alid Reasons for Transformation 3 Methodologic Approach Avoiding overly influential observations Any observation having a predictor value that is distant from the rest of the data has the potential to Greatly influence the estimated parameters ( influential ) Greatly influence the statistical significaned ( highly leveraged ) It is not uncommon that this criterion is especially important when The true effect of the predictor is based on multiplicative differences ( log transformations of predictor) Measurement error is greatest on the largest measurements (perhaps use variance stabilizing transformation of log(x)) 11 We must avoid allowing multiple comparisons to inflate the type 1 error Generally, our null hypothesis is common to all transformations that we might consider Looking for the best transformation is thus testing the null hypothesis We must prespecify the model we will fit Luckily, we should know our question If we do not know the data generation mechanism we should make a reasonable guess at a suitably flexible model Science is incremental: answer first questions first There is always room for additional exploratory descriptive analyses that might generate the next hypothesis 12 ategorical ata Analysis, AUT

4 Linear Predictors Modeling omplex ose-response Transformations The most commonly used regression models use linear predictors Linear refers to linear in the parameters The modeled predictors can be transformations of the scientific measurements Examples g X log X i 0 log X i 13 g 2 X X 2 X i 0 X i X i 14 Transformations of Predictors General Applicability We transform predictors to answer scientific questions aimed at detecting nonlinear relationships E.g., is the association between all cause mortality and LL in elderly adults nonlinear? E.g., is the association between all cause mortality and LL in elderly adults U-shaped? We transform predictors to provide more flexible description of complex associations between the response and some scientific measure (especially confounders, but also precision and POI) Threshold effects Exponentially increasing effects U-shaped functions S-shaped functions etc. 15 The issues related to transformations of predictors are similar across all types of regression with linear predictors Linear regression Logistic regression Poisson regression Proportional hazards regression Accelerated failure time regression However, it is easiest to use descriptive statistics to illustrate the issues in linear regression In other forms of regression we can display differences between fitted values, but display of the original data is more difficult Binary data ensored data Models that use a log link 16 ategorical ata Analysis, AUT

5 Ex: ubic Relationship FE vs Height in hildren Ex: Threshold Effect of ose? RT of beta carotene supplementation: 4 doses plus placebo Plasma Beta-carotene at 3 months by ose Plasma Beta-carotene at 9 months by ose FE (l/sec) Height (in.) 17 Plasma Beta-carotene ose Plasma Beta-carotene ose 18 Ex: U-shaped Trend? Inflammatory marker vs cholesterol Lowess smoother, bandwidth =.8 Ex: S-shaped trend In vitro cytotoxic effect of oxorubicin with chemosensitizers hemosensitizers -reactive protein holesterol (mg/dl) 19 ell ount = OX only = OX + erapimil = OX + yclosporine A oncentration of oxirubicin ategorical ata Analysis, AUT

6 Y Lecture 7: Modeling of Nonlinear Associations October 23, 2014 Y 1:1 Transformations Sometimes we transform 1 scientific measurement into 1 modeled predictor Log Transformations Simulated data where every doubling of X has same difference in mean of Y Untransformed Log Transformed X Ex: log transformation will sometimes address apparent threshold effects Ex: cubing height produces more linear association with FE Ex: dichotomization of dose to detect efficacy in presence of strong threshold effect against placebo X log X 22 ubic Transformation: FE vs Height Transforming Predictors: Interpretation When using a predictor that represents a transformed predictor, we try to use the same interpretation of slopes Additive models: ifference in θ Y X per 1 unit difference in modeled predictor Multiplicative models: Ratio of θ Y X per 1 unit difference in modeled predictor Such interpretations are generally easy for ichotomization of a measured variable Logarithmic transformation of a measured variable 23 Other univariate transformations are generally difficult to interpret I tend not to use other transformations when interpretability of the estimate of effect is key (and I think it always is) 24 ategorical ata Analysis, AUT

7 iagnostics It is natural to wonder whether univariate transformations of some measured covariate are appropriate We can illustrate methods for investigating the appropriateness of a transformation using one of the more common flexible methods of modeling covariate associations I consider polynomial regression to investigate whether some of the transformations we have talked about make statistical sense I am not suggesting that we do model building by routinely investigating many different models I think questions about linearity vs nonlinearity of associations is an interesting scientific question in its own right and should be placed in a hierarchy of investigation I revisit this later 25 Effect of Link Function: R, RR, OR With binary data, we cannot easily look at scatterplots Instead we can look at fitted values compared to extremely flexible models (e.g., linear splines) But first we need to recognize that when looking at fitted values, we usually look at fitted probabilities Examples: 5 year survival versus LL regress deadin5 ldl predict Rfitlin poisson deadin5 ldl predict RRfitlin logistic deadin5 ldl predict ORfitlin 26 Linear Trend in Mortality by LL 1:Many Transformations Fitted alues: Linear LL Sometimes we transform 1 scientific measurement into several modeled predictor Ex: polynomial regression Ex: dummy variables ( factored variables ) Ex: piecewise linear Ex: splines ldl R linear OR linear RR linear ategorical ata Analysis, AUT

8 Polynomial Regression Ex: Mortality - LL Assoc Linear? Fit linear term plus higher order terms (squared, cubic, ) an fit arbitrarily complex functions An n-th order polynomial can fit n+1 points exactly We can try to assess whether any association between 5 year mortality and LL follows a straight line association I am presuming this was a prespecified scientific question (We should not pre-test our statistical models) Generally very difficult to interpret parameters I usually graph function when I want an interpretation Special uses 2 nd order (quadratic) model to look for U-shaped trend Test for linearity by testing that all higher order terms have parameters equal to zero 29 I fit a 2 nd order polynomial to the data g ldlsqr= ldl^2 regress deadin4 ldl ldlsqr predict Rfitquad poisson deadin4 ldl ldlsqr predict RRfitquad logistic deadin4 ldl ldlsqr predict ORfitquad 30 Mortality - LL Assoc Linear?: OR Linear vs Quadratic Fitted alues No statistically significant evidence of a nonlinear trend based on this analysis Fitted alues: Linear LL. logistic deadin4 ldl ldlsqr Logistic regression Number of obs = 725 LR chi2(2) = 9.42 Prob > chi2 = Log likelihood = Pseudo R2 = deadin4 Odds Ratio Std. Err. z P> z [95% onf. Interval] ldl ldlsqr ldl 31 R linear OR linear RR quadratic RR linear R quadratic OR quadratic 32 ategorical ata Analysis, AUT

9 Mortality - LL Associated?: OR ummy ariables Need to test both covariates When these are the only predictors Overall LR test Otherwise use post-estimation test or testparm. logistic deadin4 ldl ldlsqr Logistic regression Number of obs = 725 LR chi2(2) = 9.42 Prob > chi2 = Log likelihood = Pseudo R2 = deadin4 Odds Ratio Std. Err. z P> z [95% onf. Interval] ldl ldlsqr Indicator variables for all but one group This is the only appropriate way to model nominal (unordered) variables E.g., for marital status Indicator variables for married (married = 1, everything else = 0) widowed (widowed = 1, everything else = 0) divorced (divorced = 1, everything else = 0) (single would then be the intercept) Often used for other settings as well Equivalent to Analysis of ariance (ANOA) 34 ategorized ontinuous Fitted alues: Linear, ummy We can use dummy variables with categorized continuous random variables to explore dose-response Fitted alues: Linear LL In Stata, we can quickly make categorized variables using egen Examples: egen ldlctg = cut(ldl), at(0,70,100,130,160,250) egen ldlctgq = cut(ldl), group(5) regress deadin4 i.ldlctg predict Rfitstep poisson deadin4 i.ldlctg predict RRfitstep logistic deadin4 i.ldlctg predict ORfitstep ldl R linear OR linear RR step RR linear R step OR step 36 ategorical ata Analysis, AUT

10 Flexible Modeling of Predictors Flexible Methods Linear Splines 37 We do have methods that can fit a wide variety of curve shapes Polynomials If high degree: allows many patterns of curvature Fractional polynomial: allows raising to a fractional power, often searching for best fit (I will not be a party to the propagation of these methods) ummy variables A step function with tiny steps Flat lines over each interval Piecewise linear or piecewise polynomial efine intervals over which the curve is a line or polynomial Splines Piecewise linear or piecewise polynomial but joined at knots 38 Linear Splines raw straight lines between pre-specified knots Stata: Linear Splines Stata will make variable that will fit piecewise linear curves Model intercept and m+1 variables when using m knots mkspline new0 #k1 new1 #k2 new2 #kp newp= oldvar Suppose knots are k 1,, k m, for variable X efine variables Spline0 SplineM Spline0 equals X for X < k 1 k 1 for k 1 < X Then, for J = 1.. m, SplineJ equals (define k 0 =0, k m+1 = ) 0 for X < k J X k J for k J < X < k J+1 k J+1 k J for k J+1 < X 39 Regression on newvar0 newvarp Straight lines between min and k1; k1 and k2, etc. 40 ategorical ata Analysis, AUT

11 Regression with Linear Splines: FE, Age. mkspline age3 6 age6 9 age9 12 age12 15 age15= age. list age age3 age6 age9 age12 age15 in 1/15 Regression with Linear Splines: FE, Age. mkspline age3 6 age6 9 age9 12 age12 15 age15= age. regress fev age3 age6 age9 age12 age15, robust age age3 age6 age9 age12 age Linear regression Number of obs = 654 F( 5, 648) = Prob > F = R-squared = Root MSE = Robust fev oef. Std Err t P> t [95% onf Intervl] age age age age age _cons predict splinefit (option xb assumed; fitted values) 42 Fitted alues with Linear Splines Fitted alues with Linear Splines. tabstat splinefit, by(age) stat(n mean sd min max) age N mean sd min max tabstat splinefit, by(age) age N mean sd min max ifference ategorical ata Analysis, AUT

12 Linear Splines: Parameter Interpretation Fitted alues With identity link Intercept β 0 : θ Y X when X = 0 Slope parameters β j : Estimated difference in θ Y X between two groups both between the same knots but differing by 1 unit in X With log link Exponentiated intercept exp(β 0 ): θ Y X when X = 0 Exponentiated slope parameters exp(β j ) : Estimated ratio of θ Y X between two groups both between the same knots but differing by 1 unit in X 45 Lowess (largely hidden), linear, dummy variables, linear splines FE by Age (stratified by sex) age Males Lowess ummy fit Females Linear fit Spline fit 46 Testing Linearity A straight line is a special case of linear splines All the parameter coefficients would have to be equal an use Stata s test. test age3 = age6 = age9 = age12 = age15 ( 1) age3 - age6 = 0 ( 2) age3 - age9 = 0 ( 3) age3 - age12 = 0 ( 4) age3 - age15 = 0 Linear Splines In Stata, we can quickly make linear splines using mkspline Examples: mkspline sldl0 70 sldl sldl sldl sldl160= ldl regress deadin4 sldl* predict Rfitspline poisson deadin4 sldl* predict RRfitspline logistic deadin4 sldl* predict ORfitspline F( 4, 648) = 6.89 Prob > F = ategorical ata Analysis, AUT

13 Fitted alues Fitted alues: Linear LL Flexible Methods omments ldl R linear OR linear RR step RR linear R step OR step Flexible Modeling of Predictors Uses of Flexible Modeling of Predictors ommonly used flexible models include Polynomials ummy variables Linear splines Possibilities are limitless, but some you may encounter ubic splines Makes curves smooth at knots But for the ways I use splines, I cannot be bothered Fractional polynomial: allows raising to a fractional power Often searching for best fit over a grid of values I will not be a party to the propagation of these methods 51 For predictor of interest When strong suspicion of a complex nonlinear fit May provide greater precision due to better fit an test for linearity by including linear term, then testing all the other terms When fit is fairly well approximated by a straight line of untransformed predictor or straight line with a univariate transformation of predictor, splines may result in loss of precision due to loss of df Keep an open mind, but not so open that your brains fall out - irginia Gildersleeve For confounders, ensures more accurately modeled effect of covariates But, again, not wise to go overboard For precision variables, often not often worth the effort 52 ategorical ata Analysis, AUT

Ex: Cubic Relationship. Transformations of Predictors. Ex: Threshold Effect of Dose? Ex: U-shaped Trend?

Ex: Cubic Relationship. Transformations of Predictors. Ex: Threshold Effect of Dose? Ex: U-shaped Trend? Biost 518 Applied Biostatistics II Scott S. Emerson, M.., Ph.. Professor of Biostatistics University of Washington Lecture Outline Modeling complex dose response Flexible methods Lecture 9: Multiple Regression: