Statistics in medicine Lecture 4: and multivariable regression Fatma Shebl, MD, MS, MPH, PhD Assistant Professor Chronic Disease Epidemiology Department Yale School of Public Health Fatma.shebl@yale.edu Outline Regression Linear Logistic Cox s proportional hazard model S L I D E 0 S L I D E 1 and prediction methods Regression and prediction Is a measure of the strength and direction of the association between two variables measured on numerical scale. Measured by correlation coefficient r Linear Logistic Other Simple Partial Pearson Spearman Other S L I D E 2 S L I D E 3 1
The sign of r indicate the direction of association + sign: Positive correlation High value of one variable are associated with high values of the second variable - sign: Negative correlation High value of one variable are associated with low values of the second variable r ranges between +1 and -1 + 1: Positive correlation Perfect correlation - 1: Negative correlation Perfect correlation 0: No correlation S L I D E 4 S L I D E 5 r is immune to the change in x and y position r is immune to linear transformation r close to 0 does not mean lack of relationship i.e. strong non-linear relationship might exist does NOT indicate causation : Visualization Scatterplot A two-dimensional graph displaying the relationship between two numerical characteristics. Visualization of correlation Also called joint distribution graph Plot (x,y) to assess the pattern of the relationship S L I D E 6 S L I D E 7 2
: Scatterplot Legend: Patterns A.Perfect positive B.Positive C.Negative D.Week negative E.Nonexistent F.Nonlinear Scatterplots and correlations. A:r = +1.0; B:r = 0.7; C:r = 0.9; D:r = 0.4; E:r = 0.0; F:r = 0.0. Copyright 2016 McGraw-Hill Education. All rights reserved. Date of download: 1/11/2016 Types Pearson product-moment Interval or ratio scale Spearman rank-order Ordinal scale Other From: Chapter 8. Research Questions About Relationships among Variables Basic & Clinical Biostatistics, 4e, 2004 S L I D E 8 S L I D E 9 Pearson product-moment Used for two numerical normally distributed variables Test of significance 1- Calculate r (correlation coefficient) 2-Calculate the degrees of freedom 3-Calculate the test statistic t 4-Find the critical value of significance t 5-Draw a conclusion Assumptions Linear relationship Normal distribution No outliers Large sample size (>30) (X X)(Y Y) r = (X X) 2 (Y Y) 2 df=n-2 r n 2 t= 1 r 2 S L I D E 10 S L I D E 11 3
Spearman s Rho Could be used for NOT normally distributed variables Normally distributed variables Based on ranks Test of significance 1- Calculate r s (correlation coefficient) 2-Calculate the degrees of freedom 3-Calculate the test statistic t 4-Find the critical value of significance t 5-Draw a conclusion (RX R r s = X )(RY R Y ) (R X R X ) 2 (R Y R Y ) 2 df=n-2 r n 2 t= Interpretation of the size of r Rule of thumb 0-.10 no to trivial correlation.10-.30 very low correlation.30-.50 low correlation.50-.70 moderate correlation.70-.90 high correlation.90-1 very high correlation 1 perfect correlation 1 r 2 S L I D E 12 S L I D E 13 Interpretation of the size of r r is affected by sample size Large sample size, with small r significant results Better interpretation using r 2 (Known as the coefficient of determination) Is the proportion of the variance in one variable that is accounted for by the other variable Is a measure of the strength of the relationship S L I D E 14 : Example A study was conducted to examine whether serum calcium and serum triglycerides are correlated. If the correlation coefficient is 20%, interpret the r coefficient, what is the coefficient of determination, and its interpretation, and can you infer causation? Answer There is low correlation between serum calcium and serum triglycerides R 2 =.2x.2=.04=4% Interpretation of r 2 : 4% of the variation of serum calcium is accounted for by serum triglycerides (and vice versa) No S L I D E 15 4
Partial correlation Is a measure of the strength and direction of the association between two variables controlling for one or more variable r ranges between +1 and -1 Assumptions Linear relationship between all pairs of variables Normal distribution No outliers Definition: Statistical models that have one dependent (outcome) variable, but include more than one independent variable S L I D E 16 S L I D E 17 Rational of the regression equations Example: if we hypothesize that cholesterol level is predicted by age, gender, and diabetic status, and we would like to find out the line (as in the scatter diagram) that best fit this relationship, we can write this as a straight line equation: Y = a + bx Rational of the straight line equation in regression Cholesterol = age + gender + diabetes But not all these predictors are equally important, so we give each predictor a weight(coefficient) relative to its importance Cholesterol=(W1)age+(W2)gender+(W3)diabetes Rational of the regression equations However, we need a starting point for the calculation, so we add it to the equation Cholesterol=starting point+ (W1)age+(W2)gender+(W3)diabetes Because usually the prediction of the outcome is not perfect, so we add an error term Cholesterol=starting point+ (W1)age+(W2)gender+(W3)diabetes + error term S L I D E 18 S L I D E 19 5
Rational of the regression equations The final formula could be expressed as y= a+b 1 x 1 +b 2 x 2 +b 3 x 3 + e Also written as y= β 0 +β 1 x 1 +β 2 x 2 +β 3 x 3 + ε Rational of the regression equations y= a+b 1 x 1 +b 2 x 2 +b 3 x 3 + e Also written as y= β 0 +β 1 x 1 +β 2 x 2 +β 3 x 3 + ε Interpretation of the symbols This equation is commonly referred to as general linear model a (β 0 ):intercept i.e. where line crosses the y-axis b (β 1 k ): regression coefficients (slope) i.e. amount y changes each time x change by 1 unit e (ε): error term(residual) i.e. the distance the actual value of y depart from the regression line S L I D E 20 S L I D E 21 Rational of the regression equations Estimation of best estimates(least-squares method) Observed y and x are known, therefore e has to be calculated Use different a and b to calculate the predicted y (y hat) ŷ= a+b 1 x 1 +b 2 x 2 +b 3 x 3 e is then calculated as: y-ŷ The best estimate is the one with the least error i.e. that minimize e 2 = (y-ŷ) 2 i.e. minimize the sum of the squared error term From: Chapter 8. Research Questions About Relationships among Variables Basic & Clinical Biostatistics, 4e, 2004 Geometric interpretation of a regression line. Least squares regression line. Date of download: 1/12/2016 Copyright 2016 McGraw-Hill Education. All rights reserved. S L I D E 22 S L I D E 23 6
Applications Test for interaction Adjust for confounding Predict future values of y given x The types of models described by the previous equation are referred to as general linear models General because can accommodate different types of y and or x Linear because is a linear combination of the x terms Commonly used methods Survival S L I D E 24 S L I D E 25 Readings and resources Chapter 8, p190-220: Dawson, B. and Trapp, R. G. (2004). Basic and Clinical Biostatistics (4th edition). New York: McGraw-Hill Chapter 9, p221-244: Dawson, B. and Trapp, R. G. (2004). Basic and Clinical Biostatistics (4th edition). New York: McGraw-Hill. Chapter 10, p245-263: Dawson, B. and Trapp, R. G. (2004). Basic and Clinical Biostatistics (4th edition). New York: McGraw-Hill. Chapter 11, p147-151: Jekel's epidemiology, biostatistics, preventive medicine, and public health by David L. Katz et al (4th edition). Chapter 13, p163-170: Jekel's epidemiology, biostatistics, preventive medicine, and public health by David L. Katz et al (4th edition). Statistics in medicine Lecture 4 part 2: and multiple regression Fatma Shebl, MD, MS, MPH, PhD Assistant Professor Chronic Disease Epidemiology Department Yale School of Public Health Fatma.shebl@yale.edu S L I D E 26 S L I D E 27 7
Outline and prediction methods Regression Linear Logistic Cox s proportional hazard model Regression and prediction Linear Logistic Other Simple Partial Pearson Spearman Other S L I D E 28 S L I D E 29 General linear model in which the dependent variable is continuous variable Types Single continuous predictor: simple linear regression Multiple continuous predictors: multiple linear regression Single categorical predictor: one-way ANOVA Multiple categorical predictors: N-way ANOVA Some categorical and some continuous predictors: analysis of covariance (ANCOVA) Assumptions Linearity: the relation is linear between each independent variable and the dependent variable Independence: The values of Y are independent Homogeneity: The equal variance of Y across the range of X S L I D E 30 S L I D E 31 8
Build up the model Assess model fit Interpret the regression coefficient Build up the model Most common method is stepwise (it is automated in most programs) Start with a one variable in the model (the main predictor, if one is hypothesized) Add another variable Keep adding variables to the list of variables already in the model Use a stopping criterion such as: The increase in r 2 <.01 S L I D E 32 S L I D E 33 Build up the model R 2 Is a measure of how much of the variation of the outcome is accounted for by the explanatory variables Range 0-1 0 no variance accounted for 1 all the variance (100%) accounted for Assess model fit Residuals the part of Y that is not explained by X could be used to assess the model fit Plot the residuals(on Y axis) versus X The mean of the residuals is zero, therefore, if the model fits the data, the residuals and x should not be correlated S L I D E 34 S L I D E 35 9
Assess model fit Legend: Good fit: the residuals form a random scatter around the zero line Illustration of analysis of residuals. A: Linear relationship between X and Y. B: Residuals versus values of X for relation in part A.C: Curvilinear relationship between X and Y. D: Residuals versus values of X for relation in part C. Date of download: 1/12/2016 From: Chapter 8. Research Questions About Relationships among Variables Basic & Clinical Biostatistics, 4e, 2004 Copyright 2016 McGraw-Hill Education. All rights reserved. Interpret the regression coefficient The intercept: it is the expected value of Y if all X = zero If x cannot be zero, so intercept is not meaningful If the interest is in the relationship between X and Y, the intercept is not of interest and will not affect the conclusion If the interest is in the prediction of Y from X, then X has to be re-scaled intercept will be the expected value of Y at the chosen X value S L I D E 36 S L I D E 37 Interpret the regression coefficient The slope: If X is continuous: it is the change in Y for a one-unit increase in X, holding other variables (other X s ) constant If X is categorical: it is the mean difference in Y for between one category and the reference category of X, holding other variables (other X s ) constant Interpret the regression coefficient Compare the test statistic (t) with the critical value of significance for the relevant df The p value: Null hypothesis: the coefficients (intercept and slopes) = zero If p < predetermined significance level reject the null S L I D E 38 S L I D E 39 10
An example and interpretation A study was conducted to examine the association between insulin sensitivity scores (outcome) and BMI. The resultant regression equation was Y =1.5817 0.0433X. Interpret the terms in the equation. Y : predicted insulin sensitivity 1.5817: the intercept i.e. patients with zero BMI (unrealistic) have insulin sensitivity of 1.5817 From: Basic & Clinical Biostatistics, 4e, 2004 An example and interpretation A study was conducted to examine the association between insulin sensitivity scores (outcome) and BMI. The resultant regression equation was Y =1.5817 0.0433X. Interpret the terms in the equation. o X: Observed BMI value o -0.0433: the slope i.e. when BMI increase by 1 unit, predicted insulin sensitivity decrease by 0.0433 From: Basic & Clinical Biostatistics, 4e, 2004 S L I D E 40 S L I D E 41 General linear model in which the dependent variable is nominal/categorical variable Types Commonly used for dichotomous dependent variable Could be used for multinomial dependent variable Using a mathematical function (logit) to transform the regression data so y will be limited to (0,1) Logit(p (y=1 x s ))=log(p/(1-p))= β 0 +β 1 x 1 +β 2 x 2 + +β k x k Translated into the probability of the dependent variable as an exponential function of the independent variables 1 p y=1 = 1 + exp [ (b 0+b 1 x 1 +b 2 x 2 + +b k x k )] S L I D E 42 S L I D E 43 11
Types Commonly used for dichotomous dependent variable Could be used for multinomial dependent variable Build up the model Similar to linear regression Assess model fit Hosmer and Lemeshow s goodness of fit test a p value >.05 acceptable fit Interpret the regression coefficient S L I D E 44 S L I D E 45 Steps: Interpret the regression coefficient exp(β 0 ): the odds that y=1, given x=0 exp(β 1 ): If X is categorical: is the odds ratio of y=1 in one category of x compared to the reference category of x, holding other variables (other x s ) constant If x is continuous: is the change in the odds of y=1 for a one-unit increase in x, holding other variables (other x s ) constant Practical coding issues: Interpretation of the results depend on how you code your data, therefore It is important to check how the outcome is coded i.e. what level is coded as 1 and what level is coded as 2 (or 0) It is important check how the predictors are coded Common practice is to code binary predictors as 0,1 S L I D E 46 S L I D E 47 12
An example and interpretation Blood alcohol concentration (BAC)>50mg/dL was examined among men with unintentional injury who was admitted to emergency room. Predictors of BAC were daytime, weekday, being Caucasian, and age of 40. the reported OR(95% CI) for Caucasian, and age of 40 were 1.32(1.06-1.65), and 0.89(0.72-1.10). Interpret the results. From: Basic & Clinical Biostatistics, 4e, 2004 An example and interpretation Answer Caucasians were significantly more likely to have elevated BAC than other races. Age did not significantly predict elevated BAC. From: Basic & Clinical Biostatistics, 4e, 2004 S L I D E 48 S L I D E 49 Survival analysis Definition: The statistical methods for analyzing survival data when there are censored observations Censored observation: is an observation whose value is unknown, generally because the subject has not been in the study long enough for outcome of interest, such as death Survival analysis Common methods of summarization and presentation Person-time Life-tables S L I D E 50 S L I D E 51 13
Survival analysis Common methods of summarization and presentation Person-time Person-time is the length of follow-up Ex. If two subject were followed, one for 2 years and the second for one year, then the total person-time is three person-year Could be used to calculate incidence density Incidence density is the number of events divided by the total person-time Useful method if the event could be recurrent Survival analysis Common methods of summarization and presentation Life-tables (covered in the epi course) Two methods Kaplan-Meier method Actuarial method Requirements Date of entry Date of withdrawal Cause of withdrawal Death Loss of follow-up S L I D E 52 S L I D E 53 Survival analysis Common methods to test significance in survival analysis Logrank test Mantel-Haenszel chi-square test Cox s proportional hazard model Bi-variate Cox proportional hazard model Regression model when there is censored outcome data Data is said to be censored if Loss of follow up End of the study The dependent variable is the survival time (time to event) h(t, X 1, X 2,.X K )= h 0 (t)e b 1 x 1 +b 2 x 2 +.+b k x k S L I D E 54 S L I D E 55 14
Cox proportional hazard model Cox proportional hazard model Answer the question what is the likelihood of survival to a particular time (i.e. dying in the next interval), given survival up to this time, and given a set of independent variables S L I D E 56 Allows estimating relative risk (also called hazard ratio) In other words, answer the question of what is the risk of an event (such as death) at a given time, given it has not occurred until that time The ratio of the risk of the event at a given time, in the exposed to the risk in the unexposed Assessing the assumption of proportional hazard is beyond this class S L I D E 57 Readings and resources Chapter 8, p190-220: Dawson, B. and Trapp, R. G. (2004). Basic and Clinical Biostatistics (4th edition). New York: McGraw-Hill Chapter 9, p221-244: Dawson, B. and Trapp, R. G. (2004). Basic and Clinical Biostatistics (4th edition). New York: McGraw-Hill. Chapter 10, p245-263: Dawson, B. and Trapp, R. G. (2004). Basic and Clinical Biostatistics (4th edition). New York: McGraw-Hill. Chapter 11, p147-151: Jekel's epidemiology, biostatistics, preventive medicine, and public health by David L. Katz et al (4th edition). Chapter 13, p163-170: Jekel's epidemiology, biostatistics, preventive medicine, and public health by David L. Katz et al (4th edition). S L I D E 58 15