Correlation and simple linear regression S5

Basic medical statistics for clinical and eperimental research Correlation and simple linear regression S5 Katarzyna Jóźwiak k.jozwiak@nki.nl November 15, 2017 1/41

Introduction Eample: Brain size and body weight Brain size Weight Subject id (MRI total piel count per 10,000) (pounds) 1 81.69 118 2 96.54 172 3 92.88 146 4 90.49 134 5 95.55 172 6 83.39 118 7 92.41 155... 24 79.06 122 25 98.00 190 2/41

Introduction Eample: Brain size and body weight Subject 4 Subject 24 3/41

Relationship between two numerical variables If a linear relationship between and y appears to be reasonable from the scatter plot, we can take the net step and 1. Calculate Pearson s product moment correlation coefficient between and y Measures how closely the data points on the scatter plot resemble a straight line 2. Perform a simple linear regression analysis Finds the equation of the line that best describes the relationship between variables seen in a scatter plot 4/41

Correlation Sample Pearson s product moment correlation coefficient (or correlation coefficient), between variables and y is calculated as where: r(, y) = 1 n 1 n ( ) ( ) i yi ȳ i=1 {( i, y i ) : i = 1,..., n} is a random sample of n observations on and y, and ȳ are the sample means of respectively and y, s and s y are the sample standard deviations of respectively and y. s s y 5/41

Correlation Properties of r: r estimates the true population correlation coefficient ρ r takes on any value between 1 and 1 Magnitude of r indicates the strength of a linear relationship between and y: r = 1 or 1 means perfect linear association The closer r is to -1 or 1, the stronger the linear association (e.g. r = -0.1 (weak association) vs r = 0.85 (strong association)) r = 0 indicates no linear association (but can be e.g. non-linear) Sign of r indicates the direction of association: r > 0 implies positive relationship i.e. the two variables tend to move in the same direction r < 0 implies negative relationship i.e. the two variables tend to move in the opposite directions 6/41

Correlation Properties of r (cont d): r(a + b, c y + d) = r(, y), where a > 0, c > 0, and b and d are constants r(, y) = r(y, ) r 0 does not imply causation! Just because two variables are correlated does not necessarily mean that one causes the other! r 2 is called the coefficient of determination 0 r 2 1 Represents the proportion of total variation in one variable that is eplained by the other For eample: the coefficient of determination between body weight and age of 0.60 means that 60% of total variation in body weight is eplained by age alone and the remaining 40% is eplained by other factors 7/41

Correlation Correlation r= -1 r= 1 r= 0.8 r= -0.8 r= 0 r= 0 0 < r< 1-1 < r< 0 Don t interpret r without looking at the scatter plot! 8/41

Correlation Hypothesis test for the population correlation coefficient ρ: H 0 : ρ = 0 (there is no linear relationship between y and ) H 1 : ρ 0 (there is a linear relationship between y and ) Under H 0, the test statistic n 2 T = r 1 r 2 follows a Student-t distribution with n 2 degrees of freedom. This test assumes that the variables and y are normally distributed 9/41

Correlation Eample: Brain size and body weight What is the magnitude and sign of correlation coefficient between brain size and weight? 10/41

Correlation Eample: Brain size and body weight Correlations Weight Brain Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Weight Brain 1.826 **.000 25 25.826 ** 1.000 25 25 **. Correlation is significant at the 0.01 level (2-tailed). 11/41

Pearson s product moment correlation coefficient measures the strength and direction of a linear association between and y Simple linear regression finds an equation (mathematical model) that describes the relationship between the two variables we can predict values of one variable using values of the other variable Unlike correlation, regression requires a dependent variable y (outcome/response variable): variable being predicted (always on the vertical or y-ais) an independent variable (eplanatory/predictor variable): variable used for prediction (always on the horizontal or -ais) 12/41

Simple linear regression postulates that in the population y = (α + β ) + ɛ, where: y is the dependent variable is the independent variable α and β are parameters called the population regression coefficients α is called the intercept or constant term β is called the slope ɛ is the random error term 13/41

y 1 2 3 4 5 14/41

y E(y i ) E(y ) = α + β 1 2 3 4 5 E(y i ) is the mean value of y when = i E(y ) = α + β is the population regression function 15/41

y E(y ) = α + β 3β β α 0 1 2 3 4 5 6 α is the y-intercept of the population regression function, i.e. the mean value of y when equals 0 β is the slope of the population regression function, i.e. the mean (or epected) change in y associated with a 1-unit increase in the value of c β is the mean change in y for a c-unit increase in the value of α and β are estimated from the sample data using the least squares method (usually) 16/41

y y = a + b y i e i ei = y i - y i = residual i y i 0 i Least squares method chooses a and b (estimates for α and β) to minimize the sum of the squares of the residuals n e 2 i = i=1 n (y i ŷ i ) 2 = i=1 n [y i (a + b i )] 2 i=1 17/41

The least squares estimates for β and α are: n i=1 b = ( i )(y i ȳ) n i=1 ( i ) 2 and a = ȳ b, where and ȳ are the respective sample means of and y. Note that: b = r(, y) sy s, where r(, y) is the sample product moment correlation between and y, and s and s y are the sample standard deviations of and y. 18/41

Test of H 0 : β = 0 versus H 1 : β 0 1. t-test: Test statistic: T = b, where SE(b) is the standard error of b SE(b) calculated from the data Under H0, T follows a Student-t distribution with n 2 degrees of freedom 2. F-test: ( ) 2 Test statistic: F = b SE(b) = T 2, where SE(b) and T are as above Under H0, F follows a F distribution with 1 and n 2 degrees of freedom The t-test and the F-test lead to the same outcome The test of zero intercept α is of less interest, unless = 0 is meaningful 19/41

Eample: Brain size (MRI total piel count per 10,000) and body weight (pounds) Coefficients a Model 1 (Constant) Weight Unstandardized Coefficients Standardized Coefficients B Std. Error Beta t Sig. 62.334 3.845 16.212.000.176.025.826 7.030.000 a. Dependent Variable: Brain ANOVA a Model 1 Regression Residual Total Sum of Squares df Mean Square F Sig. 507.387 1 507.387 49.416.000 b 236.157 23 10.268 743.544 24 a. Dependent Variable: Brain b. Predictors: (Constant), Weight Brain = 62.33 + 0.18 Weight 20/41

Eample: Brain size (MRI total piel count per 10,000) and body weight (pounds) 100.00 95.00 Brain 90.00 y=62.33+0.18* 85.00 80.00 75.00 100 120 140 160 180 200 Weight 21/41

Eample: Blood pressure (mmhg) and body weight (kg) in 20 patients with hypertension 1 125.00 120.00 BP 115.00 110.00 105.00 85.00 90.00 95.00 100.00 105.00 Weight 1 Daniel, W.W. and Cross, C.L.(2013). Biostatistics: a foundation for analysis in the health sciences, 10th edition. 22/41

Eample: Blood pressure (mmhg) and body weight (kg) in 20 patients with hypertension Coefficients a Model 1 (Constant) Weight Unstandardized Coefficients B Std. Error Beta t Sig. 2.205 8.663.255.802 1.201.093.950 12.917.000 a. Model 1 Regression Residual Total a. Dependent Variable: BP b. Predictors: (Constant), Weight ANOVA a Sum of Squares df Mean Square F Sig. 505.472 1 505.472 166.859.000 b 54.528 18 3.029 560.000 19 BP = 2.21 + 1.20 Weight 23/41

Eample: Blood pressure (mmhg) and body weight (kg) in 20 patients with hypertension 125.00 120.00 BP 115.00 y=2.21+1.2* 110.00 105.00 85.00 90.00 95.00 100.00 105.00 Weight 24/41

Standardized coefficients Obtained by standardizing both y and (i.e. converting into z-scores) and re-running the regression Standardized intercept equals zero and standardized slope for equals the sample correlation coefficient Correlations Weight Brain Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Weight Brain 1.826 **.000 25 25.826 ** 1.000 25 25 **. Correlation is significant at the 0.01 level (2-tailed). Coefficients a Model 1 (Constant) Weight Unstandardized Coefficients Standardized Coefficients B Std. Error Beta t Sig. 62.334 3.845 16.212.000.176.025.826 7.030.000 a. Dependent Variable: Brain 25/41

Standardized coefficients Obtained by standardizing both y and (i.e. converting into z-scores) and re-running the regression Standardized intercept equals zero and standardized slope for equals the sample correlation coefficient Of greater concern in multiple linear regression where the predictors are epressed in different units Standardization removes the dependence of regression coefficients on the units of measurements of y and s so they can be meaningfully compared The larger the standardized coefficient (in absolute value) the greater the contribution of the respective variable in the prediction of y 26/41

Linear regression is only appropriate when the following assumptions are satisfied: 1. Independence: the observations are independent, i.e. there is only one pair of observations per subject 2. Linearity: the relationship between and y is linear 3. Constant variance: the variance of y is constant for all values of 4. Normality: y has a normal distribution 27/41

28/41 Simple linear regression Checking linearity assumption: 1. Make a scatter plot of y versus the points should generally form a straight line 2. Plot the residuals against the eplanatory variable the points should present a random scatter of points around zero, there should be no systematic pattern 0 e Linearity 0 Lack of linearity e

29/41 Simple linear regression Checking constant variance assumption: Make a residual plot, i.e. plot the residuals against the fitted values of y (ŷ i = a + b i ) the points should present a random scatter of points 0 e Constant variance 0 Non-constant variance e

Eample: Blood pressure and body weight 30/41

Checking normality assumption: 1. Draw a histogram of y or the residuals and eyeball the result 2. Make a normal probability plot (P P plot) of the residuals, i.e. plot the epected cumulative probability of a normal distribution versus the observed cumulative probability at each value of the residual the points should form a straight diagonal line 31/41

Eample: Blood pressure and body weight 32/41

33/41 Simple linear regression Assessing goodness of fit The estimated regression line is the best one available (in the least-squares sense) Yet, it can still be a very poor fit to the observed data y Good fit Bad fit y

To assess goodness of fit of a regression line (i.e. how well the line fits the data) we can: 1. Calculate the correlation coefficient R between the predicted and observed values of y A higher absolute value of R indicates better fit (predicted and observed values of y are closer to each other) 2. Calculate R 2 (R Square in SPSS) 0 R 2 1 A higher value of R 2 indicates better fit R 2 = 1 indicates perfect fit (i.e. ŷ i = y i for each i) R 2 = 0 indicates very poor fit 34/41

Alternatively, R 2 can be calculated as n R 2 i=1 = (ŷ i ȳ) 2 variation in y eplained by n i=1 (y = i ȳ) 2 total variation in y R 2 is interpreted as proportion of total variability in y eplained by eplanatory variable R 2 = 1: eplains all variability in y R 2 = 0: does not eplain any variability in y R 2 is usually epressed as a percentage; e.g., R 2 = 0.93 indicates that 93% of total variation in y can be eplained by 35/41

Eample: Blood pressure and body weight Model Summary Model R R Square Adjusted R Square Std. Error of the Estimate 1,950 a,903,897 1,74050 a. Predictors: (Constant), Weight 36/41

Prediction: interpolation versus etrapolation Etrapolation Interpolation Etrapolation y Possible patterns of additional data Range of actual data Etrapolation beyond the range of the data is risky!! 37/41

Categorical eplanatory variable So far we assumed that the predictor variable is numerical We can also study an association between y and a categorical, e.g. between blood pressure and gender or between brain size and ethnicity Categorical variables can be incorporated through dummy variables that take on the values 0 and 1; to include a categorical variable with p categories p 1 dummy variables are required 38/41

Categorical eplanatory variable Eample: variable blood group with 4 categories, A, B, AB, 0 1. Dummy variables for all categories { 1, if blood group is A A = 0, otherwise { 1, if blood group is AB AB = 0, otherwise { 1, if blood group is B B = 0, otherwise { 1, if blood group is 0 0 = 0, otherwise 2. One category is a reference category category that results in useful comparisons (e.g. eposed versus non-eposed, eperimental versus standard treatment) or a category with large number of subjects 3. In the model we include all dummies ecept the one corresponding to the reference category 39/41

Categorical eplanatory variable Model with blood group 0 as reference category y = α + β A A + β B B + β AB AB + ɛ and its estimated counterpart is ŷ = a + b A A + b B B + b AB AB Estimation of model parameters requires running multiple linear regression, unless the eplanatory variable has only two categories (e.g. gender) Given that y represents IQ score, the estimated coefficients are interpreted as follows: a is the mean IQ for subjects with blood group 0, i.e. the reference category Each b represents the mean difference in IQ between subjects with a blood group represented by the respective dummy variable and subjects with blood group 0 (the reference category) 40/41

Categorical eplanatory variable Specifically: b A is the mean difference in IQ between subjects with blood group A and subjects with blood group b B is the mean difference in IQ between subjects with blood group B and subjects with blood group b AB is the mean difference in IQ between subjects with blood group AB and subjects with blood group A test for the significance of a categorical eplanatory variable with p levels involves the hypothesis that the coefficients of all p 1 dummy variables are zero. For that purpose, we need to use an overall F-test (net lecture) and not a t-test. The t-test can be used only when the variable is binary. 41/41