Correlation and simple linear regression S5

Similar documents
Multiple linear regression S6

Analysing data: regression and correlation S6 and S7

Ordinary Least Squares Regression Explained: Vartanian

Multiple Regression. More Hypothesis Testing. More Hypothesis Testing The big question: What we really want to know: What we actually know: We know:

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

Area1 Scaled Score (NAPLEX) .535 ** **.000 N. Sig. (2-tailed)

Business Statistics. Lecture 10: Correlation and Linear Regression

Multiple linear regression

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

Review of Multiple Regression

Chapter 9 - Correlation and Regression

Practical Biostatistics

Lecture 18: Simple Linear Regression

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

Chapter 27 Summary Inferences for Regression

Unit 6 - Introduction to linear regression

Chapter 12 - Part I: Correlation Analysis

Statistics in medicine

General Linear Model (Chapter 4)

Example: Forced Expiratory Volume (FEV) Program L13. Example: Forced Expiratory Volume (FEV) Example: Forced Expiratory Volume (FEV)

Review of Statistics 101

Unit 6 - Simple linear regression

Applied Regression Modeling: A Business Approach Chapter 3: Multiple Linear Regression Sections

Can you tell the relationship between students SAT scores and their college grades?

Chapter 9 Regression. 9.1 Simple linear regression Linear models Least squares Predictions and residuals.

9 Correlation and Regression

REVIEW 8/2/2017 陈芳华东师大英语系

Correlation & Regression. Dr. Moataza Mahmoud Abdel Wahab Lecturer of Biostatistics High Institute of Public Health University of Alexandria

1 A Review of Correlation and Regression

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

: The model hypothesizes a relationship between the variables. The simplest probabilistic model: or.

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

Correlation and Regression Bangkok, 14-18, Sept. 2015

Categorical Predictor Variables

THE PEARSON CORRELATION COEFFICIENT

Example. Multiple Regression. Review of ANOVA & Simple Regression /749 Experimental Design for Behavioral and Social Sciences

Chapter 4 Regression with Categorical Predictor Variables Page 1. Overview of regression with categorical predictors

AMS 7 Correlation and Regression Lecture 8

SPSS Output. ANOVA a b Residual Coefficients a Standardized Coefficients

Lecture (chapter 13): Association between variables measured at the interval-ratio level

Simple Linear Regression

Advanced Quantitative Data Analysis

Correlation and Simple Linear Regression

Confidence Intervals, Testing and ANOVA Summary

9. Linear Regression and Correlation

Chapter 4: Regression Models

ST430 Exam 2 Solutions

4/22/2010. Test 3 Review ANOVA

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Chapter 4 Describing the Relation between Two Variables

Key Concepts. Correlation (Pearson & Spearman) & Linear Regression. Assumptions. Correlation parametric & non-para. Correlation

Lecture 5: ANOVA and Correlation

Chapter 4. Regression Models. Learning Objectives

Scatter plot of data from the study. Linear Regression

Stat 101 Exam 1 Important Formulas and Concepts 1

IT 403 Practice Problems (2-2) Answers

STAT 3900/4950 MIDTERM TWO Name: Spring, 2015 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION

Self-Assessment Weeks 6 and 7: Multiple Regression with a Qualitative Predictor; Multiple Comparisons

Overview. Overview. Overview. Specific Examples. General Examples. Bivariate Regression & Correlation

Week 8: Correlation and Regression

( ), which of the coefficients would end

Daniel Boduszek University of Huddersfield

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Correlation and Linear Regression

Scatter plot of data from the study. Linear Regression

Finding Relationships Among Variables

Lecture 2. The Simple Linear Regression Model: Matrix Approach

Inferences for Regression

CORRELATION AND REGRESSION

Chapter Goals. To understand the methods for displaying and describing relationship among variables. Formulate Theories.

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Correlation: Relationships between Variables

Regression. Marc H. Mehlman University of New Haven

Bivariate Relationships Between Variables

STK4900/ Lecture 3. Program

ESP 178 Applied Research Methods. 2/23: Quantitative Analysis

The scatterplot is the basic tool for graphically displaying bivariate quantitative data.

Lecture 11: Simple Linear Regression

16.400/453J Human Factors Engineering. Design of Experiments II

Simple Linear Regression Using Ordinary Least Squares

Correlation and Regression

Chapter 11. Correlation and Regression

Bivariate Data: Graphical Display The scatterplot is the basic tool for graphically displaying bivariate quantitative data.

Single and multiple linear regression analysis

Self-Assessment Weeks 8: Multiple Regression with Qualitative Predictors; Multiple Comparisons

Simple Linear Regression

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

STATISTICAL DATA ANALYSIS IN EXCEL

Chapter 11. Correlation and Regression

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION

Simple Linear Regression

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

22s:152 Applied Linear Regression

Comparing Nested Models

CHAPTER 5 LINEAR REGRESSION AND CORRELATION

Binary Logistic Regression

About Bivariate Correlations and Linear Regression

AP Statistics Cumulative AP Exam Study Guide

Inference for Regression Inference about the Regression Model and Using the Regression Line

Transcription:

Basic medical statistics for clinical and eperimental research Correlation and simple linear regression S5 Katarzyna Jóźwiak k.jozwiak@nki.nl November 15, 2017 1/41

Introduction Eample: Brain size and body weight Brain size Weight Subject id (MRI total piel count per 10,000) (pounds) 1 81.69 118 2 96.54 172 3 92.88 146 4 90.49 134 5 95.55 172 6 83.39 118 7 92.41 155... 24 79.06 122 25 98.00 190 2/41

Introduction Eample: Brain size and body weight Subject 4 Subject 24 3/41

Relationship between two numerical variables If a linear relationship between and y appears to be reasonable from the scatter plot, we can take the net step and 1. Calculate Pearson s product moment correlation coefficient between and y Measures how closely the data points on the scatter plot resemble a straight line 2. Perform a simple linear regression analysis Finds the equation of the line that best describes the relationship between variables seen in a scatter plot 4/41

Correlation Sample Pearson s product moment correlation coefficient (or correlation coefficient), between variables and y is calculated as where: r(, y) = 1 n 1 n ( ) ( ) i yi ȳ i=1 {( i, y i ) : i = 1,..., n} is a random sample of n observations on and y, and ȳ are the sample means of respectively and y, s and s y are the sample standard deviations of respectively and y. s s y 5/41

Correlation Properties of r: r estimates the true population correlation coefficient ρ r takes on any value between 1 and 1 Magnitude of r indicates the strength of a linear relationship between and y: r = 1 or 1 means perfect linear association The closer r is to -1 or 1, the stronger the linear association (e.g. r = -0.1 (weak association) vs r = 0.85 (strong association)) r = 0 indicates no linear association (but can be e.g. non-linear) Sign of r indicates the direction of association: r > 0 implies positive relationship i.e. the two variables tend to move in the same direction r < 0 implies negative relationship i.e. the two variables tend to move in the opposite directions 6/41

Correlation Properties of r (cont d): r(a + b, c y + d) = r(, y), where a > 0, c > 0, and b and d are constants r(, y) = r(y, ) r 0 does not imply causation! Just because two variables are correlated does not necessarily mean that one causes the other! r 2 is called the coefficient of determination 0 r 2 1 Represents the proportion of total variation in one variable that is eplained by the other For eample: the coefficient of determination between body weight and age of 0.60 means that 60% of total variation in body weight is eplained by age alone and the remaining 40% is eplained by other factors 7/41

Correlation Correlation r= -1 r= 1 r= 0.8 r= -0.8 r= 0 r= 0 0 < r< 1-1 < r< 0 Don t interpret r without looking at the scatter plot! 8/41

Correlation Hypothesis test for the population correlation coefficient ρ: H 0 : ρ = 0 (there is no linear relationship between y and ) H 1 : ρ 0 (there is a linear relationship between y and ) Under H 0, the test statistic n 2 T = r 1 r 2 follows a Student-t distribution with n 2 degrees of freedom. This test assumes that the variables and y are normally distributed 9/41

Correlation Eample: Brain size and body weight What is the magnitude and sign of correlation coefficient between brain size and weight? 10/41

Correlation Eample: Brain size and body weight Correlations Weight Brain Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Weight Brain 1.826 **.000 25 25.826 ** 1.000 25 25 **. Correlation is significant at the 0.01 level (2-tailed). 11/41

Pearson s product moment correlation coefficient measures the strength and direction of a linear association between and y Simple linear regression finds an equation (mathematical model) that describes the relationship between the two variables we can predict values of one variable using values of the other variable Unlike correlation, regression requires a dependent variable y (outcome/response variable): variable being predicted (always on the vertical or y-ais) an independent variable (eplanatory/predictor variable): variable used for prediction (always on the horizontal or -ais) 12/41

Simple linear regression postulates that in the population y = (α + β ) + ɛ, where: y is the dependent variable is the independent variable α and β are parameters called the population regression coefficients α is called the intercept or constant term β is called the slope ɛ is the random error term 13/41

y 1 2 3 4 5 14/41

y E(y i ) E(y ) = α + β 1 2 3 4 5 E(y i ) is the mean value of y when = i E(y ) = α + β is the population regression function 15/41

y E(y ) = α + β 3β β α 0 1 2 3 4 5 6 α is the y-intercept of the population regression function, i.e. the mean value of y when equals 0 β is the slope of the population regression function, i.e. the mean (or epected) change in y associated with a 1-unit increase in the value of c β is the mean change in y for a c-unit increase in the value of α and β are estimated from the sample data using the least squares method (usually) 16/41

y y = a + b y i e i ei = y i - y i = residual i y i 0 i Least squares method chooses a and b (estimates for α and β) to minimize the sum of the squares of the residuals n e 2 i = i=1 n (y i ŷ i ) 2 = i=1 n [y i (a + b i )] 2 i=1 17/41

The least squares estimates for β and α are: n i=1 b = ( i )(y i ȳ) n i=1 ( i ) 2 and a = ȳ b, where and ȳ are the respective sample means of and y. Note that: b = r(, y) sy s, where r(, y) is the sample product moment correlation between and y, and s and s y are the sample standard deviations of and y. 18/41

Test of H 0 : β = 0 versus H 1 : β 0 1. t-test: Test statistic: T = b, where SE(b) is the standard error of b SE(b) calculated from the data Under H0, T follows a Student-t distribution with n 2 degrees of freedom 2. F-test: ( ) 2 Test statistic: F = b SE(b) = T 2, where SE(b) and T are as above Under H0, F follows a F distribution with 1 and n 2 degrees of freedom The t-test and the F-test lead to the same outcome The test of zero intercept α is of less interest, unless = 0 is meaningful 19/41

Eample: Brain size (MRI total piel count per 10,000) and body weight (pounds) Coefficients a Model 1 (Constant) Weight Unstandardized Coefficients Standardized Coefficients B Std. Error Beta t Sig. 62.334 3.845 16.212.000.176.025.826 7.030.000 a. Dependent Variable: Brain ANOVA a Model 1 Regression Residual Total Sum of Squares df Mean Square F Sig. 507.387 1 507.387 49.416.000 b 236.157 23 10.268 743.544 24 a. Dependent Variable: Brain b. Predictors: (Constant), Weight Brain = 62.33 + 0.18 Weight 20/41

Eample: Brain size (MRI total piel count per 10,000) and body weight (pounds) 100.00 95.00 Brain 90.00 y=62.33+0.18* 85.00 80.00 75.00 100 120 140 160 180 200 Weight 21/41

Eample: Blood pressure (mmhg) and body weight (kg) in 20 patients with hypertension 1 125.00 120.00 BP 115.00 110.00 105.00 85.00 90.00 95.00 100.00 105.00 Weight 1 Daniel, W.W. and Cross, C.L.(2013). Biostatistics: a foundation for analysis in the health sciences, 10th edition. 22/41

Eample: Blood pressure (mmhg) and body weight (kg) in 20 patients with hypertension Coefficients a Model 1 (Constant) Weight Unstandardized Coefficients B Std. Error Beta t Sig. 2.205 8.663.255.802 1.201.093.950 12.917.000 a. Model 1 Regression Residual Total a. Dependent Variable: BP b. Predictors: (Constant), Weight ANOVA a Sum of Squares df Mean Square F Sig. 505.472 1 505.472 166.859.000 b 54.528 18 3.029 560.000 19 BP = 2.21 + 1.20 Weight 23/41

Eample: Blood pressure (mmhg) and body weight (kg) in 20 patients with hypertension 125.00 120.00 BP 115.00 y=2.21+1.2* 110.00 105.00 85.00 90.00 95.00 100.00 105.00 Weight 24/41

Standardized coefficients Obtained by standardizing both y and (i.e. converting into z-scores) and re-running the regression Standardized intercept equals zero and standardized slope for equals the sample correlation coefficient Correlations Weight Brain Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Weight Brain 1.826 **.000 25 25.826 ** 1.000 25 25 **. Correlation is significant at the 0.01 level (2-tailed). Coefficients a Model 1 (Constant) Weight Unstandardized Coefficients Standardized Coefficients B Std. Error Beta t Sig. 62.334 3.845 16.212.000.176.025.826 7.030.000 a. Dependent Variable: Brain 25/41

Standardized coefficients Obtained by standardizing both y and (i.e. converting into z-scores) and re-running the regression Standardized intercept equals zero and standardized slope for equals the sample correlation coefficient Of greater concern in multiple linear regression where the predictors are epressed in different units Standardization removes the dependence of regression coefficients on the units of measurements of y and s so they can be meaningfully compared The larger the standardized coefficient (in absolute value) the greater the contribution of the respective variable in the prediction of y 26/41

Linear regression is only appropriate when the following assumptions are satisfied: 1. Independence: the observations are independent, i.e. there is only one pair of observations per subject 2. Linearity: the relationship between and y is linear 3. Constant variance: the variance of y is constant for all values of 4. Normality: y has a normal distribution 27/41

28/41 Simple linear regression Checking linearity assumption: 1. Make a scatter plot of y versus the points should generally form a straight line 2. Plot the residuals against the eplanatory variable the points should present a random scatter of points around zero, there should be no systematic pattern 0 e Linearity 0 Lack of linearity e

29/41 Simple linear regression Checking constant variance assumption: Make a residual plot, i.e. plot the residuals against the fitted values of y (ŷ i = a + b i ) the points should present a random scatter of points 0 e Constant variance 0 Non-constant variance e

Eample: Blood pressure and body weight 30/41

Checking normality assumption: 1. Draw a histogram of y or the residuals and eyeball the result 2. Make a normal probability plot (P P plot) of the residuals, i.e. plot the epected cumulative probability of a normal distribution versus the observed cumulative probability at each value of the residual the points should form a straight diagonal line 31/41

Eample: Blood pressure and body weight 32/41

33/41 Simple linear regression Assessing goodness of fit The estimated regression line is the best one available (in the least-squares sense) Yet, it can still be a very poor fit to the observed data y Good fit Bad fit y

To assess goodness of fit of a regression line (i.e. how well the line fits the data) we can: 1. Calculate the correlation coefficient R between the predicted and observed values of y A higher absolute value of R indicates better fit (predicted and observed values of y are closer to each other) 2. Calculate R 2 (R Square in SPSS) 0 R 2 1 A higher value of R 2 indicates better fit R 2 = 1 indicates perfect fit (i.e. ŷ i = y i for each i) R 2 = 0 indicates very poor fit 34/41

Alternatively, R 2 can be calculated as n R 2 i=1 = (ŷ i ȳ) 2 variation in y eplained by n i=1 (y = i ȳ) 2 total variation in y R 2 is interpreted as proportion of total variability in y eplained by eplanatory variable R 2 = 1: eplains all variability in y R 2 = 0: does not eplain any variability in y R 2 is usually epressed as a percentage; e.g., R 2 = 0.93 indicates that 93% of total variation in y can be eplained by 35/41

Eample: Blood pressure and body weight Model Summary Model R R Square Adjusted R Square Std. Error of the Estimate 1,950 a,903,897 1,74050 a. Predictors: (Constant), Weight 36/41

Prediction: interpolation versus etrapolation Etrapolation Interpolation Etrapolation y Possible patterns of additional data Range of actual data Etrapolation beyond the range of the data is risky!! 37/41

Categorical eplanatory variable So far we assumed that the predictor variable is numerical We can also study an association between y and a categorical, e.g. between blood pressure and gender or between brain size and ethnicity Categorical variables can be incorporated through dummy variables that take on the values 0 and 1; to include a categorical variable with p categories p 1 dummy variables are required 38/41

Categorical eplanatory variable Eample: variable blood group with 4 categories, A, B, AB, 0 1. Dummy variables for all categories { 1, if blood group is A A = 0, otherwise { 1, if blood group is AB AB = 0, otherwise { 1, if blood group is B B = 0, otherwise { 1, if blood group is 0 0 = 0, otherwise 2. One category is a reference category category that results in useful comparisons (e.g. eposed versus non-eposed, eperimental versus standard treatment) or a category with large number of subjects 3. In the model we include all dummies ecept the one corresponding to the reference category 39/41

Categorical eplanatory variable Model with blood group 0 as reference category y = α + β A A + β B B + β AB AB + ɛ and its estimated counterpart is ŷ = a + b A A + b B B + b AB AB Estimation of model parameters requires running multiple linear regression, unless the eplanatory variable has only two categories (e.g. gender) Given that y represents IQ score, the estimated coefficients are interpreted as follows: a is the mean IQ for subjects with blood group 0, i.e. the reference category Each b represents the mean difference in IQ between subjects with a blood group represented by the respective dummy variable and subjects with blood group 0 (the reference category) 40/41

Categorical eplanatory variable Specifically: b A is the mean difference in IQ between subjects with blood group A and subjects with blood group b B is the mean difference in IQ between subjects with blood group B and subjects with blood group b AB is the mean difference in IQ between subjects with blood group AB and subjects with blood group A test for the significance of a categorical eplanatory variable with p levels involves the hypothesis that the coefficients of all p 1 dummy variables are zero. For that purpose, we need to use an overall F-test (net lecture) and not a t-test. The t-test can be used only when the variable is binary. 41/41