Linear regression and correlation

Faculty of Health Sciences Linear regression and correlation Statistics for experimental medical researchers 2018 Julie Forman, Christian Pipper & Claus Ekstrøm Department of Biostatistics, University of Copenhagen

Associations between quantitative variables How do we find evidence of association? How can we describe/quantify the association? Method 1: Linear regression. How do we determine the best fitting line so that we can predict one variable from the other? How do we quantify the statistical uncertainty? We assume a linear association to simplify; Is that fair? Method 2: Correlation. Simple symmetrical measures of association. 2 / 38

Outline Linear regression Model assumptions and prediction About linear models in R Correlations 3 / 38

Case study: A dose-response experiment Does cell concentration affect cell size of tetrahymena in cell cultivation? Is there an association? How can we describe it? How well can we predict diameter when we know concentration? 4 / 38

Same picture with fittet regression line The regression line doesn t fit all measurements spot on. But overall the association looks reasonably linear. 5 / 38

The linear model y = α + βx + ε 6 / 38 y is the response (or ouctcome), in this case: log(diameter). x is the explanatory variable (or predictor), in this case: log(concentration). β is the regression coefficient (or slope): It tells us how much y increases when x is increased by one unit. α is the interceptet: Where the line intersects with the y-axis. Note that formally α is "the expected value of y when x = 0", but this interpretation is often extrapolated and not really meaningful in practice ε is an individual error term, assumed normally distributed with zero mean and standard deviation σ ε.

Graphical interpretation of the regression parameters 1 1 2 3 4 y = α + βx β 1 α 1 1 2 3 4 7 / 38

How do we find the "best fitting line? Answer: by least squares (explicit formulae or R). Find ˆα, ˆβ which minimizes n (y i (α + βx i )) 2 i=1 The deviations r i = y i (ˆα + ˆβx i ) are called the residuals. Finding the best fitting line is the same as minimizing the residual variance. s 2 = 1 n 2 ni=1 r 2 i. We estimate the residual standard deviation by s = s 2. 8 / 38

The residuals Definition: r i = y i ˆα ˆβ x i vol 2.6 2.8 3.0 3.2 3.4 e 60 65 70 wt 9 / 38

Quantification and test of association If no association exists between x and y, then we expect the regression line to be horizontal, that is β = 0 so that y = α + ε regardless of x. We can test the nul hypothesis H : β = 0 by using: t = ˆβ s.e.( ˆβ) which has a t-distribution with n 2 degress of freedom in case the null hypothesis is true. (standard error later ) And we can obtain a confidence interval for β from: ˆβ ± t(n 2) s.e.( ˆβ) 10 / 38 Inference for α is rarely of interest, but otherwise similar.

Case study: estimates and inference Estimates etc are easily obtained from R: Estimate Std. Error t value Pr(> t ) (Intercept) 5.405816 0.068406 79.03 <2e-16 *** log2conc -0.054515 0.004178-13.05 <2e-16 *** Residual standard error: 0.05514 on 49 degrees of freedom The rows in the table describes first α, the intercept, and secondly β, the slope corresponding to the effect of the log2-concentration on the log2-diameter. What can we conclude? 11 / 38

Interpretaion 12 / 38 There is a significant association between concentration and cell diameter (P<0.0001). We estimate that log 2(diameter) decreases by -0.0545 every time log 2(concentration) increases by one unit, that is each time the concentration is doubled. The 95% confidence interval is -0.0629 to -0.0461. Another way of expressing this is that the diameter decreases exponentially with log 2(concentration). Specifically by an estimated factor 2 0.0545 0.9629 or -3.71% every time the concentration is doubled with 95% CI -4.27 to -3.15%. The intercept 5.41 might be interpreted as the expected log 2(diameter) when log 2(concentration) = 0 (that is when concentration= 1), but this is highly extrapolated.

Regressions coefficients by formulae The "best fitting line" can be solved explicitly: The estimated slope is given by: ˆβ = ni=1 (x i x)(y i ȳ) ni=1 (x i x) 2 And the intercept can be computed by ˆα = ȳ ˆβ x where ˆβ from the previous formula is inserted. Note that the fitted line always passes through the point ( x, ȳ). 13 / 38

Standard error formulae The standard errors for ˆα and ˆβ are given by. 1 s.e.(ˆα) = s n + x 2 (x x) 2 s.e.( ˆβ) s = (x x) 2 where s ε is the residual standard deviatoin. A bigger sample size n will of course give rise to smaller standard errors, but the specific values of the x s also has an impact. 14 / 38 s.e( ˆβ) is larger if x doesn t vary much. s.e(ˆα) is larger if x doesn t vary much. and / or if x is far away from 0. Both are larger if the residual variance is large.

Outline Linear regression Model assumptions and prediction About linear models in R Correlations 15 / 38

Model assumptions The statistical model assumed by the linear regression analysis is: y i = α + βx i + ε i where the error terms ε i describes the individual deviations from the regression line, assumed to be random, normally distributed with mean 0 and standard deviation σ ε. There are four model assumptions we need to consider: 16 / 38 1. Observations are mutually independent (no clustering). 2. The true association is linear. 3. The error terms, ε s, are normally distributed. 4. The error terms, ε s, have the same standard deviation, regardless of the value of x.

What do we need to check? 1. Independence should be ensured by the study design. 2. Linearity is checked in a residualplot, i.e. a scatterplot of the residuals against the fitted (predicted) values in the data. 3. Normal distribution is checked by making a QQplot of the standardized residuals. 4. Homogenity of variance is assessed from the residualplot. Note: It is not strictly necessary that the error terms are normally distributed if the sample size is large. Confidence intervals and tests are valid as long as the other model assumptions hold. Prediction intervals ( ), however, can only be trusted if error terms are truly normal. 17 / 38

Case study: residual- and QQplot Note: The points in the residual plot should be randomly and symmetrically scattered around the zero-line with the same variability at any point in the interval. Any systematic deviations from this -?? 18 / 38

Prediction To predict the expected y for a new value of x, we plug in to the equation of the estimated regression: ŷ(x 0 ) = ˆα + ˆβx 0 Example: When the concentration is 250000, then x 0 = log 2(250000) = 17.9316, and we would expect a log 2-diameter of 5.4058 0.0545 17.9316 = 4.4283 I.e. a diameter around 2 4.4283 = 21.53. Note that this involves interpolating between the doses tested in the experiment. 19 / 38 Extrapolating should be avoided.

Uncertainty in prediction I Not all responses are on the average. There are two sources of uncertainty we need to consider when making predictions: 1. Natural variation in responses (estimated by the residual standard deviation s ε ). 2. The statistical uncertainty in our estimates (standard errors). The standard error of the expected value at x 0 is: 1 s.e.(ŷ(x 0 )) = s n + (x 0 x) 2 (x x) 2. This is the uncertainty related to estimating the average response at x 0. 20 / 38

Uncertainty in prediction I Not all responses are on the average. There are two sources of uncertainty we need to consider when making predictions: 1. Natural variation in responses (estimated by the residual standard deviation s ε ). 2. The statistical uncertainty in our estimates (standard errors). 21 / 38 If we want to predict individual responses to x = x 0 with 95% certainty, then we need: s.d.(y new (x 0 ) ŷ(x 0 )) = s 1 + 1 n + (x 0 x) 2 (x x) 2. where the residual standard deviation has been "addedl" to the estimation uncertainty.

Confidence- vs prediction interval Which is the prediction- and which is the confidence interval? What happens when sample size increases? 22 / 38

A nicer picture Obtained by back-transforming with 2 x before plotting. 23 / 38

Outline Linear regression Model assumptions and prediction About linear models in R Correlations 24 / 38

Linear models in R We use the lm-function to do linear regression (and a lot more: ANOVA, multiple regression,... ) The model must be specified by a model formula, e.g.: fit <- lm(log2diam~ log2conc, data=dr) where should be read as "potentially depends on"or "is potentially predicted by". The respone goes on the left and the predictor on the right. lm returns a so-called model object of the class "lm", you don t have to understand all of its contents to use it. 25 / 38

Extractor functions R has a bunch of functions that can be used to extract information from model objects, e.g.: summary(fit) table of estimates, tests, and more. confint(fit) confidence intervals. abline(fit) add the fitted line to an existing plot. residuals(fit) vector containing the residuals predict(fit, frame) predict y s for supplied x values. plot(fit) diagnostic plots (e.g. model assumptions). 26 / 38

R-demo: doseresponse.r load(file.choose()) # choose doseresponse.rda dr <- transform(dr, log2conc=log2(concentration), log2diam=log2(diameter)) # fit the linear model fit <- lm(log2diam~log2conc, data=dr) summary(fit) cbind(coef(fit), confint(fit)) plot(dr$log2conc, dr$log2diam) abline(fit, col= blue ) # NOTE: predictions and more in the program file Exercise: Run the demo! 27 / 38

Outline Linear regression Model assumptions and prediction About linear models in R Correlations 28 / 38

Regression vs correlation In linear regression, we model a directed relationship, either: A causal relation: We assume that x has an effect on y, not the other way around. A prediction problem: We know x and want to predict y. It matters in which order the two variables are supplied to the formula in R. But sometimes we just want to know: Are two different outcomes associated? In this case a correlation coefficient gives a crude measure of the strength of the association. 29 / 38

Pearson s correlation r = (x x)(y ȳ) (x x) 2 (y ȳ) 2. measureres the degree of linear association between to outcomes. r er symmetrical in x and y r is always between 1 and +1 r has the same sign as the regression coefficient β (no matter whether you regress y on x or the other way around). Crude rule of thumb: r < 0.3: weak correlation, 0.3 < r < 0.5: moderate correlation, r > 0.5: strong correlatiion. 30 / 38

Interpretion of Pearsons correlation coefficient r = 0, no correlation no (linear) associaton occurs when x and y are mutually independent. r > 0, positive correlation Larger/smaller values of x and y tend to coincide. r < 0, negative correlation Larger values of x tend to coincide with smaller values of y and vice versa. r = ±1, perfect linear association 31 / 38

Be careful! 32 / 38 The correlation assumes that both x and y are random. It doesn t make sense to report a correlation coefficient if the values of x were dictated by the study protocol. The strength of the correlation depends on the study population. E.g. height and weight is stronger correlated in pups than in adults. Interpretation should depend on the study aims. A 90% correlation may be poor if we are comparing two laboratory weights suposed to measure the same thing! Association is not the same as agreement. A device that hasn t been properly calibrated may correlate almost perfectly with one that has, but still measurements may show a large systematic deviation.

Analyzing correlation in R. CKD-data from course day 2: > cor.test(ckd$pwv0, ckd$aix0) Pearson s product-moment correlation data: ckd$pwv0 and ckd$aix0 t = 2.4151, df = 48, p-value = 0.01959 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.05594193 0.55652213 sample estimates: cor 0.3291641 33 BUT: / 38 are outcomes normally distributed?

Model assumptions for Pearson korrelation Pearson s correlation is valid under the assumption that the two variables have a 2-dimensionel joint normal distribution. 34 / 38 Source: Wikipedia.

The 2D normal distribution as we see it Source: 35 / 38 Wikipedia.

Anscombe s quartet Four datasets sharing a Pearson correlation of 0.816: BUT: In which case is the Pearson correlation apropriate? 36 / 38

Non-normally distributed data Use Spearman s rank correlation instead. No assumptions save that observations are independent. The formula is the same as for Pearson s correlationen: r = (rank(x) rank(x))(rank(y) rank(y)) (rank(x) rank(x)) 2 (rank(y) rank(y)) 2. only the original data has been replaced by their ranks. The rank of an observation is it s number on the list when all data has been ordered from the smallest value to the largest. 37 / 38

Spearmans correlation in R Test the hypothesis H:ρ S = 0. > cor.test(ckd$pwv0, ckd$aix0, method= spearman ) Spearman s rank correlation rho data: ckd$pwv0 and ckd$aix0 S = 13982, p-value = 0.01981 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.3285996 Note: You don t get a confidence interval! 38 / 38