Linear regression and correlation

Similar documents
SLR output RLS. Refer to slr (code) on the Lecture Page of the class website.

Statistics for exp. medical researchers Regression and Correlation

Scatter plot of data from the study. Linear Regression

Scatter plot of data from the study. Linear Regression

REVIEW 8/2/2017 陈芳华东师大英语系

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model

9. Linear Regression and Correlation

Marcel Dettling. Applied Statistical Regression AS 2012 Week 05. ETH Zürich, October 22, Institute for Data Analysis and Process Design

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

Unit 6 - Introduction to linear regression

Analysing data: regression and correlation S6 and S7

Statistics in medicine

appstats27.notebook April 06, 2017

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

Lecture 11: Simple Linear Regression

Inferences for Regression

Correlation Analysis

Density Temp vs Ratio. temp

Lecture 14 Simple Linear Regression

L21: Chapter 12: Linear regression

Correlation & Simple Regression

Unit 6 - Simple linear regression

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

y response variable x 1, x 2,, x k -- a set of explanatory variables

Data Analysis and Statistical Methods Statistics 651

Important note: Transcripts are not substitutes for textbook assignments. 1

Correlation. Bivariate normal densities with ρ 0. Two-dimensional / bivariate normal density with correlation 0

Linear Regression. Chapter 3

A discussion on multiple regression models

Correlation and the Analysis of Variance Approach to Simple Linear Regression

9 Correlation and Regression

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

Correlation and Linear Regression

Correlation and simple linear regression S5

Statistics for Engineers Lecture 9 Linear Regression

Chapter 8: Correlation & Regression

Biostatistics. Correlation and linear regression. Burkhardt Seifert & Alois Tschopp. Biostatistics Unit University of Zurich

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y

Chapter 27 Summary Inferences for Regression

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

Regression and correlation

14 Multiple Linear Regression

Basic Business Statistics 6 th Edition

Week 8: Correlation and Regression

Section 3: Simple Linear Regression

Oct Simple linear regression. Minimum mean square error prediction. Univariate. regression. Calculating intercept and slope

Correlation and Simple Linear Regression

Lecture 18: Simple Linear Regression

Variance Decomposition and Goodness of Fit

Business Statistics. Lecture 10: Course Review

De-mystifying random effects models

Biostatistics 4: Trends and Differences

Stat 135, Fall 2006 A. Adhikari HOMEWORK 10 SOLUTIONS

Ch 2: Simple Linear Regression

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

Chapter 14. Statistical versus Deterministic Relationships. Distance versus Speed. Describing Relationships: Scatterplots and Correlation

Inference for Regression

Regression. Marc H. Mehlman University of New Haven

Business Statistics. Lecture 10: Correlation and Linear Regression

Introduction and Single Predictor Regression. Correlation

Simple Linear Regression

Answer Key. 9.1 Scatter Plots and Linear Correlation. Chapter 9 Regression and Correlation. CK-12 Advanced Probability and Statistics Concepts 1

POL 681 Lecture Notes: Statistical Interactions

2.1: Inferences about β 1

Stat 411/511 ESTIMATING THE SLOPE AND INTERCEPT. Charlotte Wickham. stat511.cwick.co.nz. Nov

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

MATH11400 Statistics Homepage

R 2 and F -Tests and ANOVA

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Correlation and regression

Correlation & Regression. Dr. Moataza Mahmoud Abdel Wahab Lecturer of Biostatistics High Institute of Public Health University of Alexandria

Data Set 1A: Algal Photosynthesis vs. Salinity and Temperature

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

1 Multiple Regression

1 A Review of Correlation and Regression

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

My data doesn t look like that..

Simple linear regression

Relationship Between Interval and/or Ratio Variables: Correlation & Regression. Sorana D. BOLBOACĂ

Chapter 8. Linear Regression. The Linear Model. Fat Versus Protein: An Example. The Linear Model (cont.) Residuals

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

Simple Linear Regression

df=degrees of freedom = n - 1

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

Regression Analysis: Basic Concepts

Linear regression. We have that the estimated mean in linear regression is. ˆµ Y X=x = ˆβ 0 + ˆβ 1 x. The standard error of ˆµ Y X=x is.

Regression Models - Introduction

Week 3: Simple Linear Regression

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /1/2016 1/46

Matrices and vectors A matrix is a rectangular array of numbers. Here s an example: A =

Bivariate Relationships Between Variables

Biostatistics: Correlations

STAT 4385 Topic 03: Simple Linear Regression

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

Unit 10: Simple Linear Regression and Correlation

AMS 7 Correlation and Regression Lecture 8

Statistical View of Least Squares

Exam Applied Statistical Regression. Good Luck!

Transcription:

Faculty of Health Sciences Linear regression and correlation Statistics for experimental medical researchers 2018 Julie Forman, Christian Pipper & Claus Ekstrøm Department of Biostatistics, University of Copenhagen

Associations between quantitative variables How do we find evidence of association? How can we describe/quantify the association? Method 1: Linear regression. How do we determine the best fitting line so that we can predict one variable from the other? How do we quantify the statistical uncertainty? We assume a linear association to simplify; Is that fair? Method 2: Correlation. Simple symmetrical measures of association. 2 / 38

Outline Linear regression Model assumptions and prediction About linear models in R Correlations 3 / 38

Case study: A dose-response experiment Does cell concentration affect cell size of tetrahymena in cell cultivation? Is there an association? How can we describe it? How well can we predict diameter when we know concentration? 4 / 38

Same picture with fittet regression line The regression line doesn t fit all measurements spot on. But overall the association looks reasonably linear. 5 / 38

The linear model y = α + βx + ε 6 / 38 y is the response (or ouctcome), in this case: log(diameter). x is the explanatory variable (or predictor), in this case: log(concentration). β is the regression coefficient (or slope): It tells us how much y increases when x is increased by one unit. α is the interceptet: Where the line intersects with the y-axis. Note that formally α is "the expected value of y when x = 0", but this interpretation is often extrapolated and not really meaningful in practice ε is an individual error term, assumed normally distributed with zero mean and standard deviation σ ε.

Graphical interpretation of the regression parameters 1 1 2 3 4 y = α + βx β 1 α 1 1 2 3 4 7 / 38

How do we find the "best fitting line? Answer: by least squares (explicit formulae or R). Find ˆα, ˆβ which minimizes n (y i (α + βx i )) 2 i=1 The deviations r i = y i (ˆα + ˆβx i ) are called the residuals. Finding the best fitting line is the same as minimizing the residual variance. s 2 = 1 n 2 ni=1 r 2 i. We estimate the residual standard deviation by s = s 2. 8 / 38

The residuals Definition: r i = y i ˆα ˆβ x i vol 2.6 2.8 3.0 3.2 3.4 e 60 65 70 wt 9 / 38

Quantification and test of association If no association exists between x and y, then we expect the regression line to be horizontal, that is β = 0 so that y = α + ε regardless of x. We can test the nul hypothesis H : β = 0 by using: t = ˆβ s.e.( ˆβ) which has a t-distribution with n 2 degress of freedom in case the null hypothesis is true. (standard error later ) And we can obtain a confidence interval for β from: ˆβ ± t(n 2) s.e.( ˆβ) 10 / 38 Inference for α is rarely of interest, but otherwise similar.

Case study: estimates and inference Estimates etc are easily obtained from R: Estimate Std. Error t value Pr(> t ) (Intercept) 5.405816 0.068406 79.03 <2e-16 *** log2conc -0.054515 0.004178-13.05 <2e-16 *** Residual standard error: 0.05514 on 49 degrees of freedom The rows in the table describes first α, the intercept, and secondly β, the slope corresponding to the effect of the log2-concentration on the log2-diameter. What can we conclude? 11 / 38

Interpretaion 12 / 38 There is a significant association between concentration and cell diameter (P<0.0001). We estimate that log 2(diameter) decreases by -0.0545 every time log 2(concentration) increases by one unit, that is each time the concentration is doubled. The 95% confidence interval is -0.0629 to -0.0461. Another way of expressing this is that the diameter decreases exponentially with log 2(concentration). Specifically by an estimated factor 2 0.0545 0.9629 or -3.71% every time the concentration is doubled with 95% CI -4.27 to -3.15%. The intercept 5.41 might be interpreted as the expected log 2(diameter) when log 2(concentration) = 0 (that is when concentration= 1), but this is highly extrapolated.

Regressions coefficients by formulae The "best fitting line" can be solved explicitly: The estimated slope is given by: ˆβ = ni=1 (x i x)(y i ȳ) ni=1 (x i x) 2 And the intercept can be computed by ˆα = ȳ ˆβ x where ˆβ from the previous formula is inserted. Note that the fitted line always passes through the point ( x, ȳ). 13 / 38

Standard error formulae The standard errors for ˆα and ˆβ are given by. 1 s.e.(ˆα) = s n + x 2 (x x) 2 s.e.( ˆβ) s = (x x) 2 where s ε is the residual standard deviatoin. A bigger sample size n will of course give rise to smaller standard errors, but the specific values of the x s also has an impact. 14 / 38 s.e( ˆβ) is larger if x doesn t vary much. s.e(ˆα) is larger if x doesn t vary much. and / or if x is far away from 0. Both are larger if the residual variance is large.

Outline Linear regression Model assumptions and prediction About linear models in R Correlations 15 / 38

Model assumptions The statistical model assumed by the linear regression analysis is: y i = α + βx i + ε i where the error terms ε i describes the individual deviations from the regression line, assumed to be random, normally distributed with mean 0 and standard deviation σ ε. There are four model assumptions we need to consider: 16 / 38 1. Observations are mutually independent (no clustering). 2. The true association is linear. 3. The error terms, ε s, are normally distributed. 4. The error terms, ε s, have the same standard deviation, regardless of the value of x.

What do we need to check? 1. Independence should be ensured by the study design. 2. Linearity is checked in a residualplot, i.e. a scatterplot of the residuals against the fitted (predicted) values in the data. 3. Normal distribution is checked by making a QQplot of the standardized residuals. 4. Homogenity of variance is assessed from the residualplot. Note: It is not strictly necessary that the error terms are normally distributed if the sample size is large. Confidence intervals and tests are valid as long as the other model assumptions hold. Prediction intervals ( ), however, can only be trusted if error terms are truly normal. 17 / 38

Case study: residual- and QQplot Note: The points in the residual plot should be randomly and symmetrically scattered around the zero-line with the same variability at any point in the interval. Any systematic deviations from this -?? 18 / 38

Prediction To predict the expected y for a new value of x, we plug in to the equation of the estimated regression: ŷ(x 0 ) = ˆα + ˆβx 0 Example: When the concentration is 250000, then x 0 = log 2(250000) = 17.9316, and we would expect a log 2-diameter of 5.4058 0.0545 17.9316 = 4.4283 I.e. a diameter around 2 4.4283 = 21.53. Note that this involves interpolating between the doses tested in the experiment. 19 / 38 Extrapolating should be avoided.

Uncertainty in prediction I Not all responses are on the average. There are two sources of uncertainty we need to consider when making predictions: 1. Natural variation in responses (estimated by the residual standard deviation s ε ). 2. The statistical uncertainty in our estimates (standard errors). The standard error of the expected value at x 0 is: 1 s.e.(ŷ(x 0 )) = s n + (x 0 x) 2 (x x) 2. This is the uncertainty related to estimating the average response at x 0. 20 / 38

Uncertainty in prediction I Not all responses are on the average. There are two sources of uncertainty we need to consider when making predictions: 1. Natural variation in responses (estimated by the residual standard deviation s ε ). 2. The statistical uncertainty in our estimates (standard errors). 21 / 38 If we want to predict individual responses to x = x 0 with 95% certainty, then we need: s.d.(y new (x 0 ) ŷ(x 0 )) = s 1 + 1 n + (x 0 x) 2 (x x) 2. where the residual standard deviation has been "addedl" to the estimation uncertainty.

Confidence- vs prediction interval Which is the prediction- and which is the confidence interval? What happens when sample size increases? 22 / 38

A nicer picture Obtained by back-transforming with 2 x before plotting. 23 / 38

Outline Linear regression Model assumptions and prediction About linear models in R Correlations 24 / 38

Linear models in R We use the lm-function to do linear regression (and a lot more: ANOVA, multiple regression,... ) The model must be specified by a model formula, e.g.: fit <- lm(log2diam~ log2conc, data=dr) where should be read as "potentially depends on"or "is potentially predicted by". The respone goes on the left and the predictor on the right. lm returns a so-called model object of the class "lm", you don t have to understand all of its contents to use it. 25 / 38

Extractor functions R has a bunch of functions that can be used to extract information from model objects, e.g.: summary(fit) table of estimates, tests, and more. confint(fit) confidence intervals. abline(fit) add the fitted line to an existing plot. residuals(fit) vector containing the residuals predict(fit, frame) predict y s for supplied x values. plot(fit) diagnostic plots (e.g. model assumptions). 26 / 38

R-demo: doseresponse.r load(file.choose()) # choose doseresponse.rda dr <- transform(dr, log2conc=log2(concentration), log2diam=log2(diameter)) # fit the linear model fit <- lm(log2diam~log2conc, data=dr) summary(fit) cbind(coef(fit), confint(fit)) plot(dr$log2conc, dr$log2diam) abline(fit, col= blue ) # NOTE: predictions and more in the program file Exercise: Run the demo! 27 / 38

Outline Linear regression Model assumptions and prediction About linear models in R Correlations 28 / 38

Regression vs correlation In linear regression, we model a directed relationship, either: A causal relation: We assume that x has an effect on y, not the other way around. A prediction problem: We know x and want to predict y. It matters in which order the two variables are supplied to the formula in R. But sometimes we just want to know: Are two different outcomes associated? In this case a correlation coefficient gives a crude measure of the strength of the association. 29 / 38

Pearson s correlation r = (x x)(y ȳ) (x x) 2 (y ȳ) 2. measureres the degree of linear association between to outcomes. r er symmetrical in x and y r is always between 1 and +1 r has the same sign as the regression coefficient β (no matter whether you regress y on x or the other way around). Crude rule of thumb: r < 0.3: weak correlation, 0.3 < r < 0.5: moderate correlation, r > 0.5: strong correlatiion. 30 / 38

Interpretion of Pearsons correlation coefficient r = 0, no correlation no (linear) associaton occurs when x and y are mutually independent. r > 0, positive correlation Larger/smaller values of x and y tend to coincide. r < 0, negative correlation Larger values of x tend to coincide with smaller values of y and vice versa. r = ±1, perfect linear association 31 / 38

Be careful! 32 / 38 The correlation assumes that both x and y are random. It doesn t make sense to report a correlation coefficient if the values of x were dictated by the study protocol. The strength of the correlation depends on the study population. E.g. height and weight is stronger correlated in pups than in adults. Interpretation should depend on the study aims. A 90% correlation may be poor if we are comparing two laboratory weights suposed to measure the same thing! Association is not the same as agreement. A device that hasn t been properly calibrated may correlate almost perfectly with one that has, but still measurements may show a large systematic deviation.

Analyzing correlation in R. CKD-data from course day 2: > cor.test(ckd$pwv0, ckd$aix0) Pearson s product-moment correlation data: ckd$pwv0 and ckd$aix0 t = 2.4151, df = 48, p-value = 0.01959 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.05594193 0.55652213 sample estimates: cor 0.3291641 33 BUT: / 38 are outcomes normally distributed?

Model assumptions for Pearson korrelation Pearson s correlation is valid under the assumption that the two variables have a 2-dimensionel joint normal distribution. 34 / 38 Source: Wikipedia.

The 2D normal distribution as we see it Source: 35 / 38 Wikipedia.

Anscombe s quartet Four datasets sharing a Pearson correlation of 0.816: BUT: In which case is the Pearson correlation apropriate? 36 / 38

Non-normally distributed data Use Spearman s rank correlation instead. No assumptions save that observations are independent. The formula is the same as for Pearson s correlationen: r = (rank(x) rank(x))(rank(y) rank(y)) (rank(x) rank(x)) 2 (rank(y) rank(y)) 2. only the original data has been replaced by their ranks. The rank of an observation is it s number on the list when all data has been ordered from the smallest value to the largest. 37 / 38

Spearmans correlation in R Test the hypothesis H:ρ S = 0. > cor.test(ckd$pwv0, ckd$aix0, method= spearman ) Spearman s rank correlation rho data: ckd$pwv0 and ckd$aix0 S = 13982, p-value = 0.01981 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.3285996 Note: You don t get a confidence interval! 38 / 38