Statistics in medicine

Similar documents
Statistics in medicine

Correlation and Simple Linear Regression

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

Correlation and regression

Statistics in medicine

Multiple linear regression S6

REVIEW 8/2/2017 陈芳华东师大英语系

Basic Medical Statistics Course

Can you tell the relationship between students SAT scores and their college grades?

Correlation and simple linear regression S5

THE PEARSON CORRELATION COEFFICIENT

Analysing data: regression and correlation S6 and S7

Business Statistics. Lecture 10: Correlation and Linear Regression

Introduction to Statistical Analysis

Machine Learning. Module 3-4: Regression and Survival Analysis Day 2, Asst. Prof. Dr. Santitham Prom-on

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

Introduction to logistic regression

TMA 4275 Lifetime Analysis June 2004 Solution

Bivariate Relationships Between Variables

Practical Biostatistics

Nemours Biomedical Research Statistics Course. Li Xie Nemours Biostatistics Core October 14, 2014

Important note: Transcripts are not substitutes for textbook assignments. 1

Correlation and Linear Regression

Turning a research question into a statistical question.

Survival Analysis Math 434 Fall 2011

Statistics Introductory Correlation

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

Logistic Regression. Continued Psy 524 Ainsworth

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

Nemours Biomedical Research Biostatistics Core Statistics Course Session 4. Li Xie March 4, 2015

Ph.D. course: Regression models. Introduction. 19 April 2012

Correlation and Regression Bangkok, 14-18, Sept. 2015

Ph.D. course: Regression models. Regression models. Explanatory variables. Example 1.1: Body mass index and vitamin D status

Lecture 5: ANOVA and Correlation

Binary Logistic Regression

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

AMS 7 Correlation and Regression Lecture 8

Stat 642, Lecture notes for 04/12/05 96

Computational Systems Biology: Biology X

Correlation & Regression. Dr. Moataza Mahmoud Abdel Wahab Lecturer of Biostatistics High Institute of Public Health University of Alexandria

More Statistics tutorial at Logistic Regression and the new:

Chapter 1 Statistical Inference

BMI 541/699 Lecture 22

ADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES. Cox s regression analysis Time dependent explanatory variables

Inferences for Regression

BIOSTATISTICS NURS 3324

Chapter 16: Correlation

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke

Correlation: Relationships between Variables

CORELATION - Pearson-r - Spearman-rho

Biostatistics 4: Trends and Differences

Categorical Predictor Variables

Objectives. 2.3 Least-squares regression. Regression lines. Prediction and Extrapolation. Correlation and r 2. Transforming relationships

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

Lecture 12: Effect modification, and confounding in logistic regression

Linear regression and correlation

23. Inference for regression

STA441: Spring Multiple Regression. This slide show is a free open source document. See the last slide for copyright information.

Chapter 4. Regression Models. Learning Objectives

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Small n, σ known or unknown, underlying nongaussian

Categorical data analysis Chapter 5

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

MAS3301 / MAS8311 Biostatistics Part II: Survival

Biost 518 Applied Biostatistics II. Purpose of Statistics. First Stage of Scientific Investigation. Further Stages of Scientific Investigation

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Measuring Associations : Pearson s correlation

Correlation and the Analysis of Variance Approach to Simple Linear Regression

Chapter 6: Exploring Data: Relationships Lesson Plan

Chapter 4: Regression Models

Analysis of Time-to-Event Data: Chapter 6 - Regression diagnostics

APPENDIX B Sample-Size Calculation Methods: Classical Design

Correlation and Regression

Review of Multiple Regression

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

Analysing categorical data using logit models

7.2 One-Sample Correlation ( = a) Introduction. Correlation analysis measures the strength and direction of association between

Lecture 8: Summary Measures

Review. Number of variables. Standard Scores. Anecdotal / Clinical. Bivariate relationships. Ch. 3: Correlation & Linear Regression

Chapter 4 Describing the Relation between Two Variables

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Generalized logit models for nominal multinomial responses. Local odds ratios

Review of Statistics 101

Draft Proof - Do not copy, post, or distribute. Chapter Learning Objectives REGRESSION AND CORRELATION THE SCATTER DIAGRAM

Logistic Regression: Regression with a Binary Dependent Variable

Lecture 7 Time-dependent Covariates in Cox Regression

CHAPTER 5. Outlier Detection in Multivariate Data

Chapter 12 - Part I: Correlation Analysis

Survival Analysis I (CHL5209H)

CS 5014: Research Methods in Computer Science

Statistical Thinking in Biomedical Research Session #3 Statistical Modeling

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds

Unit 11: Multiple Linear Regression

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Unit 6 - Introduction to linear regression

Transcription:

Statistics in medicine Lecture 4: and multivariable regression Fatma Shebl, MD, MS, MPH, PhD Assistant Professor Chronic Disease Epidemiology Department Yale School of Public Health Fatma.shebl@yale.edu Outline Regression Linear Logistic Cox s proportional hazard model S L I D E 0 S L I D E 1 and prediction methods Regression and prediction Is a measure of the strength and direction of the association between two variables measured on numerical scale. Measured by correlation coefficient r Linear Logistic Other Simple Partial Pearson Spearman Other S L I D E 2 S L I D E 3 1

The sign of r indicate the direction of association + sign: Positive correlation High value of one variable are associated with high values of the second variable - sign: Negative correlation High value of one variable are associated with low values of the second variable r ranges between +1 and -1 + 1: Positive correlation Perfect correlation - 1: Negative correlation Perfect correlation 0: No correlation S L I D E 4 S L I D E 5 r is immune to the change in x and y position r is immune to linear transformation r close to 0 does not mean lack of relationship i.e. strong non-linear relationship might exist does NOT indicate causation : Visualization Scatterplot A two-dimensional graph displaying the relationship between two numerical characteristics. Visualization of correlation Also called joint distribution graph Plot (x,y) to assess the pattern of the relationship S L I D E 6 S L I D E 7 2

: Scatterplot Legend: Patterns A.Perfect positive B.Positive C.Negative D.Week negative E.Nonexistent F.Nonlinear Scatterplots and correlations. A:r = +1.0; B:r = 0.7; C:r = 0.9; D:r = 0.4; E:r = 0.0; F:r = 0.0. Copyright 2016 McGraw-Hill Education. All rights reserved. Date of download: 1/11/2016 Types Pearson product-moment Interval or ratio scale Spearman rank-order Ordinal scale Other From: Chapter 8. Research Questions About Relationships among Variables Basic & Clinical Biostatistics, 4e, 2004 S L I D E 8 S L I D E 9 Pearson product-moment Used for two numerical normally distributed variables Test of significance 1- Calculate r (correlation coefficient) 2-Calculate the degrees of freedom 3-Calculate the test statistic t 4-Find the critical value of significance t 5-Draw a conclusion Assumptions Linear relationship Normal distribution No outliers Large sample size (>30) (X X)(Y Y) r = (X X) 2 (Y Y) 2 df=n-2 r n 2 t= 1 r 2 S L I D E 10 S L I D E 11 3

Spearman s Rho Could be used for NOT normally distributed variables Normally distributed variables Based on ranks Test of significance 1- Calculate r s (correlation coefficient) 2-Calculate the degrees of freedom 3-Calculate the test statistic t 4-Find the critical value of significance t 5-Draw a conclusion (RX R r s = X )(RY R Y ) (R X R X ) 2 (R Y R Y ) 2 df=n-2 r n 2 t= Interpretation of the size of r Rule of thumb 0-.10 no to trivial correlation.10-.30 very low correlation.30-.50 low correlation.50-.70 moderate correlation.70-.90 high correlation.90-1 very high correlation 1 perfect correlation 1 r 2 S L I D E 12 S L I D E 13 Interpretation of the size of r r is affected by sample size Large sample size, with small r significant results Better interpretation using r 2 (Known as the coefficient of determination) Is the proportion of the variance in one variable that is accounted for by the other variable Is a measure of the strength of the relationship S L I D E 14 : Example A study was conducted to examine whether serum calcium and serum triglycerides are correlated. If the correlation coefficient is 20%, interpret the r coefficient, what is the coefficient of determination, and its interpretation, and can you infer causation? Answer There is low correlation between serum calcium and serum triglycerides R 2 =.2x.2=.04=4% Interpretation of r 2 : 4% of the variation of serum calcium is accounted for by serum triglycerides (and vice versa) No S L I D E 15 4

Partial correlation Is a measure of the strength and direction of the association between two variables controlling for one or more variable r ranges between +1 and -1 Assumptions Linear relationship between all pairs of variables Normal distribution No outliers Definition: Statistical models that have one dependent (outcome) variable, but include more than one independent variable S L I D E 16 S L I D E 17 Rational of the regression equations Example: if we hypothesize that cholesterol level is predicted by age, gender, and diabetic status, and we would like to find out the line (as in the scatter diagram) that best fit this relationship, we can write this as a straight line equation: Y = a + bx Rational of the straight line equation in regression Cholesterol = age + gender + diabetes But not all these predictors are equally important, so we give each predictor a weight(coefficient) relative to its importance Cholesterol=(W1)age+(W2)gender+(W3)diabetes Rational of the regression equations However, we need a starting point for the calculation, so we add it to the equation Cholesterol=starting point+ (W1)age+(W2)gender+(W3)diabetes Because usually the prediction of the outcome is not perfect, so we add an error term Cholesterol=starting point+ (W1)age+(W2)gender+(W3)diabetes + error term S L I D E 18 S L I D E 19 5

Rational of the regression equations The final formula could be expressed as y= a+b 1 x 1 +b 2 x 2 +b 3 x 3 + e Also written as y= β 0 +β 1 x 1 +β 2 x 2 +β 3 x 3 + ε Rational of the regression equations y= a+b 1 x 1 +b 2 x 2 +b 3 x 3 + e Also written as y= β 0 +β 1 x 1 +β 2 x 2 +β 3 x 3 + ε Interpretation of the symbols This equation is commonly referred to as general linear model a (β 0 ):intercept i.e. where line crosses the y-axis b (β 1 k ): regression coefficients (slope) i.e. amount y changes each time x change by 1 unit e (ε): error term(residual) i.e. the distance the actual value of y depart from the regression line S L I D E 20 S L I D E 21 Rational of the regression equations Estimation of best estimates(least-squares method) Observed y and x are known, therefore e has to be calculated Use different a and b to calculate the predicted y (y hat) ŷ= a+b 1 x 1 +b 2 x 2 +b 3 x 3 e is then calculated as: y-ŷ The best estimate is the one with the least error i.e. that minimize e 2 = (y-ŷ) 2 i.e. minimize the sum of the squared error term From: Chapter 8. Research Questions About Relationships among Variables Basic & Clinical Biostatistics, 4e, 2004 Geometric interpretation of a regression line. Least squares regression line. Date of download: 1/12/2016 Copyright 2016 McGraw-Hill Education. All rights reserved. S L I D E 22 S L I D E 23 6

Applications Test for interaction Adjust for confounding Predict future values of y given x The types of models described by the previous equation are referred to as general linear models General because can accommodate different types of y and or x Linear because is a linear combination of the x terms Commonly used methods Survival S L I D E 24 S L I D E 25 Readings and resources Chapter 8, p190-220: Dawson, B. and Trapp, R. G. (2004). Basic and Clinical Biostatistics (4th edition). New York: McGraw-Hill Chapter 9, p221-244: Dawson, B. and Trapp, R. G. (2004). Basic and Clinical Biostatistics (4th edition). New York: McGraw-Hill. Chapter 10, p245-263: Dawson, B. and Trapp, R. G. (2004). Basic and Clinical Biostatistics (4th edition). New York: McGraw-Hill. Chapter 11, p147-151: Jekel's epidemiology, biostatistics, preventive medicine, and public health by David L. Katz et al (4th edition). Chapter 13, p163-170: Jekel's epidemiology, biostatistics, preventive medicine, and public health by David L. Katz et al (4th edition). Statistics in medicine Lecture 4 part 2: and multiple regression Fatma Shebl, MD, MS, MPH, PhD Assistant Professor Chronic Disease Epidemiology Department Yale School of Public Health Fatma.shebl@yale.edu S L I D E 26 S L I D E 27 7

Outline and prediction methods Regression Linear Logistic Cox s proportional hazard model Regression and prediction Linear Logistic Other Simple Partial Pearson Spearman Other S L I D E 28 S L I D E 29 General linear model in which the dependent variable is continuous variable Types Single continuous predictor: simple linear regression Multiple continuous predictors: multiple linear regression Single categorical predictor: one-way ANOVA Multiple categorical predictors: N-way ANOVA Some categorical and some continuous predictors: analysis of covariance (ANCOVA) Assumptions Linearity: the relation is linear between each independent variable and the dependent variable Independence: The values of Y are independent Homogeneity: The equal variance of Y across the range of X S L I D E 30 S L I D E 31 8

Build up the model Assess model fit Interpret the regression coefficient Build up the model Most common method is stepwise (it is automated in most programs) Start with a one variable in the model (the main predictor, if one is hypothesized) Add another variable Keep adding variables to the list of variables already in the model Use a stopping criterion such as: The increase in r 2 <.01 S L I D E 32 S L I D E 33 Build up the model R 2 Is a measure of how much of the variation of the outcome is accounted for by the explanatory variables Range 0-1 0 no variance accounted for 1 all the variance (100%) accounted for Assess model fit Residuals the part of Y that is not explained by X could be used to assess the model fit Plot the residuals(on Y axis) versus X The mean of the residuals is zero, therefore, if the model fits the data, the residuals and x should not be correlated S L I D E 34 S L I D E 35 9

Assess model fit Legend: Good fit: the residuals form a random scatter around the zero line Illustration of analysis of residuals. A: Linear relationship between X and Y. B: Residuals versus values of X for relation in part A.C: Curvilinear relationship between X and Y. D: Residuals versus values of X for relation in part C. Date of download: 1/12/2016 From: Chapter 8. Research Questions About Relationships among Variables Basic & Clinical Biostatistics, 4e, 2004 Copyright 2016 McGraw-Hill Education. All rights reserved. Interpret the regression coefficient The intercept: it is the expected value of Y if all X = zero If x cannot be zero, so intercept is not meaningful If the interest is in the relationship between X and Y, the intercept is not of interest and will not affect the conclusion If the interest is in the prediction of Y from X, then X has to be re-scaled intercept will be the expected value of Y at the chosen X value S L I D E 36 S L I D E 37 Interpret the regression coefficient The slope: If X is continuous: it is the change in Y for a one-unit increase in X, holding other variables (other X s ) constant If X is categorical: it is the mean difference in Y for between one category and the reference category of X, holding other variables (other X s ) constant Interpret the regression coefficient Compare the test statistic (t) with the critical value of significance for the relevant df The p value: Null hypothesis: the coefficients (intercept and slopes) = zero If p < predetermined significance level reject the null S L I D E 38 S L I D E 39 10

An example and interpretation A study was conducted to examine the association between insulin sensitivity scores (outcome) and BMI. The resultant regression equation was Y =1.5817 0.0433X. Interpret the terms in the equation. Y : predicted insulin sensitivity 1.5817: the intercept i.e. patients with zero BMI (unrealistic) have insulin sensitivity of 1.5817 From: Basic & Clinical Biostatistics, 4e, 2004 An example and interpretation A study was conducted to examine the association between insulin sensitivity scores (outcome) and BMI. The resultant regression equation was Y =1.5817 0.0433X. Interpret the terms in the equation. o X: Observed BMI value o -0.0433: the slope i.e. when BMI increase by 1 unit, predicted insulin sensitivity decrease by 0.0433 From: Basic & Clinical Biostatistics, 4e, 2004 S L I D E 40 S L I D E 41 General linear model in which the dependent variable is nominal/categorical variable Types Commonly used for dichotomous dependent variable Could be used for multinomial dependent variable Using a mathematical function (logit) to transform the regression data so y will be limited to (0,1) Logit(p (y=1 x s ))=log(p/(1-p))= β 0 +β 1 x 1 +β 2 x 2 + +β k x k Translated into the probability of the dependent variable as an exponential function of the independent variables 1 p y=1 = 1 + exp [ (b 0+b 1 x 1 +b 2 x 2 + +b k x k )] S L I D E 42 S L I D E 43 11

Types Commonly used for dichotomous dependent variable Could be used for multinomial dependent variable Build up the model Similar to linear regression Assess model fit Hosmer and Lemeshow s goodness of fit test a p value >.05 acceptable fit Interpret the regression coefficient S L I D E 44 S L I D E 45 Steps: Interpret the regression coefficient exp(β 0 ): the odds that y=1, given x=0 exp(β 1 ): If X is categorical: is the odds ratio of y=1 in one category of x compared to the reference category of x, holding other variables (other x s ) constant If x is continuous: is the change in the odds of y=1 for a one-unit increase in x, holding other variables (other x s ) constant Practical coding issues: Interpretation of the results depend on how you code your data, therefore It is important to check how the outcome is coded i.e. what level is coded as 1 and what level is coded as 2 (or 0) It is important check how the predictors are coded Common practice is to code binary predictors as 0,1 S L I D E 46 S L I D E 47 12

An example and interpretation Blood alcohol concentration (BAC)>50mg/dL was examined among men with unintentional injury who was admitted to emergency room. Predictors of BAC were daytime, weekday, being Caucasian, and age of 40. the reported OR(95% CI) for Caucasian, and age of 40 were 1.32(1.06-1.65), and 0.89(0.72-1.10). Interpret the results. From: Basic & Clinical Biostatistics, 4e, 2004 An example and interpretation Answer Caucasians were significantly more likely to have elevated BAC than other races. Age did not significantly predict elevated BAC. From: Basic & Clinical Biostatistics, 4e, 2004 S L I D E 48 S L I D E 49 Survival analysis Definition: The statistical methods for analyzing survival data when there are censored observations Censored observation: is an observation whose value is unknown, generally because the subject has not been in the study long enough for outcome of interest, such as death Survival analysis Common methods of summarization and presentation Person-time Life-tables S L I D E 50 S L I D E 51 13

Survival analysis Common methods of summarization and presentation Person-time Person-time is the length of follow-up Ex. If two subject were followed, one for 2 years and the second for one year, then the total person-time is three person-year Could be used to calculate incidence density Incidence density is the number of events divided by the total person-time Useful method if the event could be recurrent Survival analysis Common methods of summarization and presentation Life-tables (covered in the epi course) Two methods Kaplan-Meier method Actuarial method Requirements Date of entry Date of withdrawal Cause of withdrawal Death Loss of follow-up S L I D E 52 S L I D E 53 Survival analysis Common methods to test significance in survival analysis Logrank test Mantel-Haenszel chi-square test Cox s proportional hazard model Bi-variate Cox proportional hazard model Regression model when there is censored outcome data Data is said to be censored if Loss of follow up End of the study The dependent variable is the survival time (time to event) h(t, X 1, X 2,.X K )= h 0 (t)e b 1 x 1 +b 2 x 2 +.+b k x k S L I D E 54 S L I D E 55 14

Cox proportional hazard model Cox proportional hazard model Answer the question what is the likelihood of survival to a particular time (i.e. dying in the next interval), given survival up to this time, and given a set of independent variables S L I D E 56 Allows estimating relative risk (also called hazard ratio) In other words, answer the question of what is the risk of an event (such as death) at a given time, given it has not occurred until that time The ratio of the risk of the event at a given time, in the exposed to the risk in the unexposed Assessing the assumption of proportional hazard is beyond this class S L I D E 57 Readings and resources Chapter 8, p190-220: Dawson, B. and Trapp, R. G. (2004). Basic and Clinical Biostatistics (4th edition). New York: McGraw-Hill Chapter 9, p221-244: Dawson, B. and Trapp, R. G. (2004). Basic and Clinical Biostatistics (4th edition). New York: McGraw-Hill. Chapter 10, p245-263: Dawson, B. and Trapp, R. G. (2004). Basic and Clinical Biostatistics (4th edition). New York: McGraw-Hill. Chapter 11, p147-151: Jekel's epidemiology, biostatistics, preventive medicine, and public health by David L. Katz et al (4th edition). Chapter 13, p163-170: Jekel's epidemiology, biostatistics, preventive medicine, and public health by David L. Katz et al (4th edition). S L I D E 58 15