Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

Similar documents
Correlation and Simple Linear Regression

General Linear Model (Chapter 4)

Statistical Modelling in Stata 5: Linear Models

Linear Modelling in Stata Session 6: Further Topics in Linear Modelling

STATISTICS 110/201 PRACTICE FINAL EXAM

sociology 362 regression

sociology 362 regression

Section Least Squares Regression

Name: Biostatistics 1 st year Comprehensive Examination: Applied in-class exam. June 8 th, 2016: 9am to 1pm

sociology sociology Scatterplots Quantitative Research Methods: Introduction to correlation and regression Age vs Income

1 A Review of Correlation and Regression

Lab 10 - Binary Variables

Correlation and regression. Correlation and regression analysis. Measures of association. Why bother? Positive linear relationship

STAT 3900/4950 MIDTERM TWO Name: Spring, 2015 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis

Lecture 12: Interactions and Splines

Six Sigma Black Belt Study Guides

ECO220Y Simple Regression: Testing the Slope

Confidence Interval for the mean response

Statistics in medicine

Lecture 12: Effect modification, and confounding in logistic regression

Business Statistics. Lecture 10: Course Review

A discussion on multiple regression models

Lecture Outline. Biost 518 Applied Biostatistics II. Choice of Model for Analysis. Choice of Model. Choice of Model. Lecture 10: Multiple Regression:

Lecture 3: Multiple Regression. Prof. Sharyn O Halloran Sustainable Development U9611 Econometrics II

Inferences for Regression

Answer all questions from part I. Answer two question from part II.a, and one question from part II.b.

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

Introduction and Single Predictor Regression. Correlation

Economics 326 Methods of Empirical Research in Economics. Lecture 14: Hypothesis testing in the multiple regression model, Part 2

Problem Set #5-Key Sonoma State University Dr. Cuellar Economics 317- Introduction to Econometrics

STA441: Spring Multiple Regression. This slide show is a free open source document. See the last slide for copyright information.

ECON3150/4150 Spring 2015

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

This document contains 3 sets of practice problems.

Problem Set 1 ANSWERS

Practical Biostatistics

Regression and Models with Multiple Factors. Ch. 17, 18

Sociology 6Z03 Review II

Problem Set #3-Key. wage Coef. Std. Err. t P> t [95% Conf. Interval]

Example: Forced Expiratory Volume (FEV) Program L13. Example: Forced Expiratory Volume (FEV) Example: Forced Expiratory Volume (FEV)

Lecture 4 Multiple linear regression

Essential of Simple regression

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

Statistical Inference with Regression Analysis

Applied Statistics and Econometrics

Review of Statistics 101

Inference for Regression

multilevel modeling: concepts, applications and interpretations

ECON Introductory Econometrics. Lecture 5: OLS with One Regressor: Hypothesis Tests

STA441: Spring Multiple Regression. More than one explanatory variable at the same time

Lecture 10 Multiple Linear Regression

Practice exam questions

Big Data Analysis with Apache Spark UC#BERKELEY

Example. Multiple Regression. Review of ANOVA & Simple Regression /749 Experimental Design for Behavioral and Social Sciences

Lecture 4: Multivariate Regression, Part 2

Data Set 1A: Algal Photosynthesis vs. Salinity and Temperature

1 Independent Practice: Hypothesis tests for one parameter:

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

1 The basics of panel data

STAT Chapter 11: Regression

Review of the General Linear Model

Regression Models - Introduction

Self-Assessment Weeks 8: Multiple Regression with Qualitative Predictors; Multiple Comparisons

Unit 2 Regression and Correlation Practice Problems. SOLUTIONS Version STATA

Notebook Tab 6 Pages 183 to ConteSolutions

Review of Multiple Regression

MATH ASSIGNMENT 2: SOLUTIONS

BIOSTATS 640 Spring 2018 Unit 2. Regression and Correlation (Part 1 of 2) STATA Users

BIOSTATISTICS NURS 3324

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

Business Statistics. Lecture 10: Correlation and Linear Regression

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /1/2016 1/46

Inference. ME104: Linear Regression Analysis Kenneth Benoit. August 15, August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Multiple Regression an Introduction. Stat 511 Chap 9

Intro to Linear Regression

Thursday Morning. Growth Modelling in Mplus. Using a set of repeated continuous measures of bodyweight

Correlation and regression

Correlation and Regression

Interaction effects between continuous variables (Optional)

Lecture 11: Simple Linear Regression

INFERENCE FOR REGRESSION

using the beginning of all regression models

Econometrics Midterm Examination Answers

Soc 63993, Homework #7 Answer Key: Nonlinear effects/ Intro to path analysis

Lecture#12. Instrumental variables regression Causal parameters III

Lab 07 Introduction to Econometrics

SplineLinear.doc 1 # 9 Last save: Saturday, 9. December 2006

Multiple Regression: Inference

Graduate Econometrics Lecture 4: Heteroskedasticity

F Tests and F statistics

Statistical View of Least Squares

Unit 6 - Simple linear regression

Lecture 3: Inference in SLR

Lecture 2. The Simple Linear Regression Model: Matrix Approach

Analysing data: regression and correlation S6 and S7

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

Sociology Exam 2 Answer Key March 30, 2012

y response variable x 1, x 2,, x k -- a set of explanatory variables

Transcription:

INTRODUCTION TO CLINICAL RESEARCH Introduction to Linear Regression Karen Bandeen-Roche, Ph.D. July 17, 2012 Acknowledgements Marie Diener-West Rick Thompson ICTR Leadership / Team JHU Intro to Clinical Research 2 Outline 1. Regression: Studying association between (health) outcomes and (health) determinants 2. Correlation 3. Linear regression: Characterizing i relationships 4. Linear regression: Prediction 5. Future topics: multiple linear regression, assumptions, complex relationships 1

Introduction 30,000-foot purpose: Study association of continuously measured health outcomes and health determinants Continuously measured No gaps Total lung capacity (l) and height (m) Birthweight (g) and gestational age (mos) Systolic BP (mm Hg) and salt intake (g) Systolic BP (mm Hg) and drug (trt, placebo) JHU Intro to Clinical Research 4 Introduction 30,000-foot purpose: Study association of continuously measured health outcomes and health determinants Association Connection Determinant predicts the outcome A query: Do people of taller height tend to have a larger total lung capacity (l)? JHU Intro to Clinical Research 5 Example: Association of total lung capacity with height Study: 32 heart lung transplant recipients aged 11-59 years. list tlc height age in 1/10 +---------------------+ + tlc height age --------------------- 1. 3.41 138 11 2. 3.4 149 35 3. 8.05 162 20 4. 5.73 160 23 5. 4.1 157 16 --------------------- 6. 5.44 166 40 7. 7.2 177 39 8. 6 173 29 9. 4.55 152 16 10. 4.83 177 35 +---------------------+ 2

Introduction Two analyses to study association of continuously measured health outcomes and health determinants Correlation analysis: Concerned with measuring the strength and direction of the association between variables. The correlation of X and Y(Yand X). Linear regression: Concerned with predicting the value of one variable based on (given) the value of the other variable. The regression of Y on X. JHU Intro to Clinical Research 7 Correlation Analysis Some specific names for correlation in one s data: r Sample correlation coefficient Pearson correlation coefficient Product moment correlation coefficient 8 Correlation Analysis Characterizes the extent of linear relationship between two variables, and the direction How closely does a scatterplot of the two variables produce a non-flat straight line? Exactly: r = 1 or -1 Not at all (e.g. flat relationship): r=0-1 r 1 Does one variable tend to increase as the other increases (r>0), or decrease as the other increases (r<0) 9 3

Types of Correlation 10 Examples of Relationships and Correlations 11 Correlation: Lung Capacity Example r=.865 Scatterplot of TLC by Height 4 6 8 10 140 150 160 170 180 190 height 4

FYI: Sample Correlation Formula Heuristic: If I draw a straight line through the middle of a scatterplot of y versus x, r divides the standard deviation of the heights of points on the line by the sd of the heights of the original points 13 Correlation Closing Remarks The value of r is independent of the units used to measure the variables The value of r can be substantially influenced by a small fraction of outliers The value of r considered large varies over science disciplines Physics : r=0.9 Biology : r=0.5 Sociology : r=0.2 r is a guess at a population analog 14 Linear regression Aims to predict the value of a health outcome, Y, based on the value of an explanatory variable, X. What is the relationship between average Y and X? The analysis models this as a line We care about slope size, direction Slope=0 corresponds to no association How precisely can we predict a given person s Y with his/her X? 15 5

Linear regression Terminology Health outcome, Y Dependent variable Response variable Explanatory variable (predictor), X Independent variable Covariate 16 Linear regression - Relationship Y Δy Mean Y ŷy at x i i y i ε i Δx β 1 = Δy / Δx = slope 0 β 0 = Intercept x i X Model: Y i = β 0 + β 1 X i + ε i Linear regression - Relationship In words Intercept β 0 is mean Y at X=0 mean lung capacity among persons with 0 height Recommendation: Center Create new X* = (X-165), regress Y on X* Then: β 0 is mean lung capacity among persons 165 cm Slope β 1 is change in mean Y per 1 unit difference in X difference in mean lung capacity comparing persons who differ by 1 cm in height irrespective of centering Measures association (=0 if slope=0) 6

Linear regression Sample inference We develop best guesses at β 0, β 1 using our data Step 1: Find the least squares line Tracks through the middle of the data as best possible Has intercept b 0 and slope b 1 that make sum of [Y i (b 0 + b 1 X i )] 2 smallest Step 2: Use the slope and intercept of the least squares line as best guesses Can develop hypothesis tests involving β 1, β 0, using b 1, b 0 Can develop confidence intervals for β 1, β 0, using b 1, b 0 Linear regression Lung capacity data Scatterplot of TLC by Height 4 6 8 10 140 150 160 170 180 190 height Linear regression Lung capacity data In STATA - regress command: Syntax regress yvar xvar. regress tlc height Source SS df MS Number of obs = 32 -------------+------------------------------ F( 1, 30) = 89.12 Model 93.7825029 1 93.7825029 Prob > = 0.00000000 F Residual 31.5694921 30 1.0523164 R-squared = 0.7482 -------------+------------------------------ Adj R-squared = 0.7398 Total 125.351995 31 4.04361274 Root MSE = 1.0258 tlc Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- height.1417377.015014 9.44 0.000.1110749.1724004 _cons -17.10484 2.516234-6.80 0.000-22.24367-11.966 b 0 TLC of -17.1 liters among persons of height = 0 If centered at 165 cm: TLC of 6.3 liters among persons of height = 165 7

Linear regression Lung capacity data In STATA - regress command: Syntax regress yvar xvar. regress tlc height Source SS df MS Number of obs = 32 -------------+------------------------------ F( 1, 30) = 89.12 Model 93.7825029 1 93.7825029 Prob > = 0.00000000 F Residual 31.5694921 30 1.0523164 R-squared = 0.7482 -------------+------------------------------ Adj R-squared = 0.7398 Total 125.351995 31 4.04361274 Root MSE = 1.0258 tlc Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- height.1417377.015014 9.44 0.000.1110749.1724004 _cons -17.10484 2.516234-6.80 0.000-22.24367-11.966 b 1 On average, TLC increases by 0.142 liters per cm increase in height. Linear regression Lung capacity data Inference: p-value tests the null hypothesis that the coefficient = 0.. regress tlc height Source SS df MS Number of obs = 32 -------------+------------------------------ F( 1, 30) = 89.12 Model 93.7825029 1 93.7825029 Prob > = 0.00000000 F Residual 31.5694921 30 1.0523164 R-squared = 0.7482 -------------+------------------------------ Adj R-squared = 0.7398 Total 125.351995 31 4.04361274 Root MSE = 1.0258 tlc Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- height.1417377.015014 9.44 0.000.1110749.1724004 _cons -17.10484 2.516234-6.80 0.000-22.24367-11.966 pvalue for the slope very small We reject the null hypothesis of 0 slope (no linear relationship). The data support a tendency for TLC to increase with height. Linear regression Lung capacity data Inference: Confidence interval for coefficients; these both exclude 0.. regress tlc height Source SS df MS Number of obs = 32 -------------+------------------------------ F( 1, 30) = 89.12 Model 93.7825029 1 93.7825029 Prob > F = 0.0000 Residual 31.5694921 30 1.0523164 R-squared = 0.7482 -------------+------------------------------ Adj R-squared = 0.7398 Root MSE 1.0258 Total 125.351995 31 4.04361274 = tlc Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- height.1417377.015014 9.44 0.000.1110749.1724004 _cons -17.10484 2.516234-6.80 0.000-22.24367-11.966 We are 95% confident that the interval (0.111, 0.172) includes the true slope. Data are consistent with an average per-cm of height increase in TLC ranging between 0.111 and 0.172. The data support a tendency for TLC to increase with height. 8

Linear regression Aims to predict the value of a health outcome, Y, based on the value of an explanatory variable, X. What is the relationship between average Y and X? The analysis models this as a line We care about slope size, direction Slope=0 corresponds to no association How precisely can we predict a given person s Y with his/her X? 25 Linear regression - Prediction What is the linear regression prediction of a given person s Y with his/her X? Plug X into the regression equation ^ The prediction Y =b 0 +bx 1 Linear regression - Prediction Y Δy ŷy i y i ε i Δx b 1 = Δy / Δx = slope 0 b 0 = Intercept x i X Data Model: Y i = b 0 + b 1 X i + ε i 9

Linear regression - Prediction What is the linear regression prediction of a given person s Y with his/her X? Plug X into the regression equation ^ The prediction Y =b 0 +bx 1 The residual ε = data-prediction = Y-Y ^ Least squares minimizes the sum of squared residuals, e.g. makes predicted Y s as close to observed Y s as possible Linear regression - Prediction How precisely does Y ^ predict Y? Conventional measure: R-squared ^ Variance of Y / Variance of Y = Proportion of Y variance explained by regression = squared sample correlation between Y and Y In examples so far: = squared sample correlation between Y, X ^ Linear prediction Lung capacity data Inference: Confidence interval for coefficients; these both exclude 0.. regress tlc height Source SS df MS Number of obs = 32 -------------+------------------------------ F( 1, 30) = 89.12 Model 93.7825029 1 93.7825029 Prob > F = 0.0000 Residual 31.5694921 30 1.0523164 R-squared = 0.7482 -------------+------------------------------ Adj R-squared = 0.7398 Root MSE 1.0258 Total 125.351995 31 4.04361274 = tlc Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- height.1417377.015014 9.44 0.000.1110749.1724004 _cons -17.10484 2.516234-6.80 0.000-22.24367-11.966 R-squared = 0.748: 74.8 % of variation in TLC is characterized by the regression of TLC on height. This corresponds to correlation of sqrt(0.748) =.865 between predictions and actual TLCs. This is a precise prediction. 10

Linear regression - Prediction Cautionary comment: In real life you d want to evaluate the precision of your predictions in a sample different than the one with which you built your prediction model Cross-validation Another way to think of SLR t-test generalization To study how mean TLC varies with height Could dichotomize height at median and compare TLC between two height groups using a two-sample t-test test [Could also multi-categorize patients into four height groups, (height quartiles), then do ANOVA to test for differences in TLC] Lung capacity example two height groups Total Lung Capacity By Height TLC (Total Lung Capacity) 4 6 8 10 Median or Below Height Category Above Median 11

Lung capacity example two height groups Could ~replicate this analysis with SLR of TLC on X=1 if height > median and X=0 otherwise Regression with more than one predictor Multiple linear regression More than one X variable (ex.: height, age) With only 1 X we have simple linear regression Y i = β 0 + β 1 X i1 + β 2 X i2 + + β p X ip + ε i Intercept β 0 is mean Y for persons with all Xs=0 Slope β k is change in mean Y per 1 unit difference in X k among persons identical on all other Xs Regression with more than one predictor Slope β k is change in mean Y per 1 unit difference in X k among persons identical on all other Xs i.e. holding all other Xs constant i.e. controlling for all other Xs Fitted slopes for a given predictor in a simple linear regression and a multiple linear regression controlling for other predictors do NOT have to be the same We ll learn why in the lecture on confounding 12

Assumptions Most published regression analyses make statistical assumptions Why this matters: p-values and confidence intervals may be wrong, and coefficient interpretation may be obscure, if assumptions aren t approximately true Good research reports on analyses to check whether assumptions are met ( diagnostics, residual analysis, model checking/fit, etc.) Linear Regression Assumptions Units are sampled independently (no connections) Posited model for Y-X relationship is correct Normally (Gaussian; bell-shaped) distributed responses for each X Variability of responses the same for all X Linear Regression Assumptions Assumptions well met: 0 20 40 60 80 100 20 30 40 50 60 70 age y Fitted values 40 Residuals -40-20 0 20 20 30 40 50 60 70 age 13

Linear Regression Assumptions Non-normal responses per X 100 20 40 60 80 30 20 30 40 50 60 70 age y2 Fitted values Residuals -10 0 10 20 20 30 40 50 60 70 age Linear Regression Assumptions Non-constant variability of responses per X 120 60 2 20 40 60 80 100 20 30 40 50 60 70 age y3 Fitted values Residuals -40-20 0 20 40 20 30 40 50 60 70 age Linear Regression Assumptions Lung capacity example 10 2 4 6 8 140 150 160 170 180 190 height Residuals -2-1 0 1 tlc Fitted values 140 150 160 170 180 190 height 14

Types of relationships that can be studied ANOVA (multiple group differences) ANCOVA (different slopes per groups) Effect modification: lecture to come Curves (polynomials, broken arrows, more) Etc. Main topics once again 1. Regression: Studying association between (health) outcomes and (health) determinants 2. Correlation 3. Linear regression: Characterizing i relationships 4. Linear regression: Prediction 5. Future topics: assumptions, model checking, complex relationships 15