Statistics for exp. medical researchers Regression and Correlation

Size: px

Start display at page:

Download "Statistics for exp. medical researchers Regression and Correlation"

Solomon Howard
5 years ago
Views:

1 Faculty of Health Sciences Regression analysis Statistics for exp. medical researchers Regression and Correlation Lene Theil Skovgaard Sept. 28, 2015 Linear regression, Estimation and Testing Confidence and prediction limits Model checks and diagnostics The correlation coefficient Transformation Comparison of regression lines Home pages: : indicates that this page may be skipped without serious consequences 1 / 90 2 / 90 The nature of the explanatory variable The nature of the explanatory variable, II So far, we have been looking at Multi factor designs (ANOVA) Variance component models The explanatory variables have been factors: Condition/Treatment (A, B, C, Ctrl) Laboratory (1,2,3,4) Splenectomy, Altitude Now we turn to quantitative explanatory variables: concentration temperature age Regression: Establish a (parsimonous) smooth relation between x and y, making prediction possible Examples of regression type problems: Dose-response relations Calibration curves 3 / 90 4 / 90

2 Quantitative explanatory variable Linear regression Relation between two quantitative variables: Explanatory variable, x-variable Outcome variable, y-variable, (Response variable, Dependent variable) Linearity: Every smooth curve can be approximated by a straight line at least locally Linearity is easy to deal with Sometimes it takes a transformation to make it linear. 5 / 90 6 / 90 Example: Cell concentration of tetrahymena Choice of scale The unicellar organism tetrahymena grown in two different media, with and without glucose for the linearity assumption (...or perhaps it is not linear on any scale) Research question: How does cell concentration x (number of cells in 1 ml of the growth media) affect the cell size y (average cell diameter, measured in µm). Quantitative covariate : concentration x Quantitative outcome : diameter y Here, we need a log-transformation (more later on, p. 71 ff) 7 / 90 8 / 90

3 Example (Book, p. 226) The straight line Calibration curve for measuring the concentration of Selenium: Mathematical formulation: y = α + βx 6 known concentrations: 0,20, 40, 80, 120 and 160 Triplicate measurements for each concentration so close that they cannot be seen individually Does this look like a straight line? Yes 9 / / 90 Parameters of the straight line Model for selenium measurements Interpretation Intercept α: The expected outcome, when the explanatory variable x is zero Units identical to y-units Slope β: The expected difference in y corresponding to a one unit difference in x Units in y-units per x-unit y ci : the i th measurement of the c th concentration x c : the corresponding known concentration of selenium Model: We call this a simple linear regression E(y ci ) = α + βx c simple, because there is only one explanatory variable (concentration) linear, because the explanatory variable has a linear effect But: We have some issues regarding correlation of triplicate measurements... (p. 50) 11 / / 90

4 Average over triplicates Model for linear regression to avoid the correlation issue Mean value of outcome is linearly dependent on the explanatory variable, and the variance σ 2, or σy x 2 (the variance of the residuals = the distance in vertical direction from the regression line) is assumed constant, 13 / 90 y c : average measurement of the c th concentration x c : the corresponding known concentration of selenium Y c = α + βx c + ε c, ε c N (0, σ 2 ) 14 / 90 Regression vs correlation Method of Least Squares The regression model Y c = α + βx c + ε c, ε i N (0, σ 2 ) independent specifies the conditional distributions of Y, given X, to be Normal, with identical variances and with mean values, that depend linearly on x. derived from the general likelihood principle: Minimize the residual sum of squares: n n SS res = (y c ŷ c ) 2 = (y c α βx c ) 2, c=1 c=1 residuals here being the vertical distance from the observation y c to the line (ŷ c = α + βx c ), i.e. r c = y c ( α + βx c ) 15 / / 90

5 Technicalities: Estimation of slope Estimation with SAS where s xy = 1 n 1 β = s xy sx 2, n (x i x)(y i ȳ) i=1 is the covariance between x and y, and s 2 x = 1 n 1 is the variance in the covariate n (x i x) 2 i=1 proc means nway N mean data=a1; class part Concentration; var Selenium; output out=av mean=average_selenium; run; ods graphics on; proc reg plots=all data=av; model Average_Selenium=Concentration / clb; run; ods graphics off; 17 / / 90 Results for Selenium averages Results for Selenium averages, II The REG Procedure Dependent Variable: Average_Selenium Number of Observations Used 6 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model <.0001 Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept Concentration <.0001 Variable DF 95% Confidence Limits Intercept Concentration / 90 taken from output: α = 0.943(0.814) β = 0.751(0.009) s y x = where s y x denotes the estimate of the residual variance σ y x = σ 2, called "Root MSE" in SAS output, and estimated as s 2 y x = SS res n 2 α and s y x are measured in the units of the outcome variable β is measured in outcome-units per x-unit 20 / 90

6 Uncertainty of estimated slope Confidence interval for parameters SE( β) = σ y x s x n 1, with 95% coverage, here shown for the slope: β ± t-quantile SE( β) Good precision, when the residual variation σ y x is small the sample (n) is large the variation in the explanatory variable (s x ) is large (i.e. when concentrations vary a lot) Here, n = 6, so df = 6 2 = 4, and the corresponding t-quantile is Therefore, the interval becomes ± = (0.726, 0.776) 21 / / 90 Test of zero slope Use of regression results Cut from the output: Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept Concentration <.0001 T = (0.751/0.009) = t(4), P < Strong evidence of a relationship between the actual concentration and the measured response Is this at all interesting? Maybe test α = 0? We want to know more... to be continued Prediction: Determine y from x, the ordinary way: Prediction of the outcome, for given value of the explanatory variable, i.e. We predict observations, for given known concentration of selenium: ŷ i = α + βx i Calibration: Determine x from y, the reverse way: We estimate unknown concentrations of selenium from one or several measurements taken We shall look briefly at that on p / / 90

7 Confidence and prediction limits Confidence and prediction limits for line Confidence limits show the uncertainty in the estimated regression line Almost collapsed on the line itself Tell us where the line may also be Limits become narrower when sample size is increased Prediction limits show the (future) variation in the outcome, for given covariate (reference regions) Tell us where future subjects will lie Limits have approximately same width no matter the sample size 25 / 90 Confidence limits can hardly be seen since they are so narrow 26 / 90 Check of model assumptions Residual plot, for check of linearity Look for possible flaws: Linearity: Plot residuals vs. explanatory variable, curves?, p.28 Test whether a second order polynomium (a parabola) is better than a straight line, p.29 Variance homogeneity: Plot residuals against predicted values, trumpet shape?, p. 30 Normality: Histogram, skewness?, p. 31 Quantile plot, hammock shape?, p. 31 We do not have enough information to reasonably perform these checks 27 / 90 Curves? 28 / 90

8 Numerical check of linearity Residual plot, for check of variance homogeneity Include a second-order term, Concentration2=(Concentration-75)**2; proc reg data=av; model Average_Selenium=Concentration Concentration2; run; which yields the output Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept Concentration <.0001 Concentration No significant deviation from linearity, since the second order term has a P-value P= / 90 Trumpet shape? 30 / 90 Model checks at a glance If assumptions fail Linearity: Transform or do non-linear regression Variance homogeneity: Transform Normality: Transform Linearity is the most important assumption, unless the task is to construct prediction intervals! More on transformations later... p. 71 ff 31 / / 90

9 Diagnostics Assess the influence of single observations by leaving out one observation at a time Omit the ith individual from the analysis Obtain new estimates, α ( i) and β ( i) Compute deletion diagnostics: dev(α) i = α α ( i) dev(β) i = β β ( i) both normalized by the standard error of the estimate Combine the squared deletion diagnostics into a single diagnostic, Cook s distance Cook(α, β) i. Deletion diagnostics for selenium averages Influence-option (in MODEL-statement of PROC REG): proc reg data=av; model Average_Selenium=Concentration / r clb influence; run; Output: Output Statistics Dependent Predicted Std Error Std Error Student Obs Variable Value Mean Predict Residual Residual Residual / / 90 Deletion diagnostics for selenium averages, II Cooks distance Output, continued: Cook s Hat Diag Cov Obs D RStudent H Ratio DFFITS 1 * ** * * ** ** DFBETAS Obs Intercept Concentration Note the large values for Obs=6 35 / 90 Note the large value of COOKD for Obs=6 36 / 90

10 The correlation coefficient Bivariate Normal distribution, ρ = 0 A numerical quantity describing the degree of (linear) relationship between two variables: Pearson correlation: assuming normality of both variables: ni=1 (x i x)(y i ȳ) r = r xy = ni=1 (x i x) 2 n i=1 (y i ȳ) 2 Spearman correlation: based on ranks Both of them take on values between -1 and 1 (0 corresponding to independence) +1 and -1 correspond to perfect (linear) relationships, positive respectively negative 37 / 90 Correlation 0 38 / 90 All vertical slices yields normal distributions with identical mean values and identical variances Bivariate Normal distribution, ρ = 0.9 Correlation 0.9 Contour curves Contour curves from a Normal distribution becomes ellipses (or circles in case of ρ = 0) All vertical slices yields normal distributions with different mean values but identical variances Scatter plots should resemble ellipses 39 / / 90

11 Regression vs correlation Regression vs. correlation, II The regression model Y c = α + βx c + ε c, ε c N (0, σ 2 ) independent specifies the conditional distributions of Y, given X, to be Normal, with identical variances and with mean values, that depend linearly on x. The assumptions for interpreting a correlation are stronger than for interpreting a slope (involves normality of both variables) Interpretation of a correlation coefficient is often misleading...and almost allways non-informative The correlation has no units, gives no quantification of the relation Tests of zero slope and zero correlation are identical and does not assume anything regarding the distribution of x 41 / / 90 Regression vs. correlation, III Problems with the correlation Test of zero slope or zero correlation is the same thing The two estimates (for correlation and slope) resembles one another (in formulae), and they become 0 simultaneously ˆβ = S xy Sxx r xy = S xy S xxsyy ˆβ = r xy S yy Sxx r xy = ˆβ Sxx Syy Test for β = 0 is identical to test for ρ xy = 0 Formula manipulation yields the equality: Fix: sample size n slope β 1 r 2 xy = s 2 s 2 + ˆβ 2 Sxx n 2 residual variation s y x but increase the variation s x in the covariate x What happens? The correlation approaches either 1 or 1!! 43 / / 90

12 Two imaginary investigations When can we use the correlation? Slopes are equal Correlations are not! Can we increase correlation even further, without obtaining more observations? Yes / 90 When we only want to test whether or not there is a relation between two variables (only P-value needed) In this case, consider the non-parametric Spearman correlation to avoid the Normality (linearity) assumption To rank the relatedness of many variables, measured on the same units (e.g. concentrations of different compounds in the same solution). The correlation has no units... and is therefore hard to interpret! 46 / 90 Correlation for Selenium Correlation for Selenium, II Correlation between average selenium measurements, and the corresponding known concentrations: proc corr pearson spearman data=av; var Average_Selenium Concentration; run; with output Simple Statistics Variable N Mean Std Dev Median Average_Selenium Concentration Variable Minimum Maximum Average_Selenium Concentration / 90...continued next page Pearson correlation Pearson Correlation Coefficients, N = 6 Prob > r under H0: Rho=0 Average_ Selenium Concentration Average_Selenium <.0001 Concentration <.0001 The very high correlation of indicates a close-to-linear relationship And so what? 48 / 90

13 Correlation for Selenium, III Spearman correlation Spearman Correlation Coefficients, N = 6 Prob > r under H0: Rho=0 Average_ Selenium Concentration Average_Selenium <.0001 Concentration <.0001 The correlation is 1, indicating a perfect monotone relationship, not necessarily linear And so what? 49 / 90 Analysis of all Selenium measurements Originally, we had 18 measurements, triplicates for each known concentration. Could we use all of these 18 measurements to obtain Better estimates? More narrow confidence intervals? No, probably not, since triplicates are not independent: We expect to have several variations in this investigation: Error in the known concentration, Observer variation, Temperature effects... and Measurement error If triplicates are measured on the same solution, an error in the known concentration will affect all three measurements equally, i.e. they would all be too large or too small, they would be correlated 50 / 90 Analysis of all Selenium measurements, II In case of correlated measurements, using the same simple regression model, applied to all individual measurements would be wrong, and would give Too small P-values for anything Too narrow confidence intervals Instead, we could build a mixed model, and the result would be identical to the analysis of averages unless the design is unbalanced, e.g. due to missing observations Naive=wrong analysis of all 18 observations The REG Procedure Dependent Variable: Selenium Number of Observations Used 18 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model <.0001 Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept Concentration <.0001 Book, p. 251 Variable DF 95% Confidence Limits Intercept Concentration / / 90

14 Residual plot from naive analysis Comparison of the two analyses The naive one (wrong), with all 18 measurements The analysis of averages Estimate t-value and P-value Method of slope for zero slope Naive (0.005) (< ) n = 18 Averages (0.009) (< ) n = 6 Note the correlation between residuals from the same concentration The discrepancy cannot be seen in the P-values, because it is so significant. 53 / / 90 Consequences of naive model Mixed model The naive model based on all 18 observations ignores the correlation between triplicates: We are dealing with two variance components: The variation of means around the regression line (ω 2 ) The measurement error (variation between triplicates, σ 2 ) We think we have too much information The standard errors become too small The confidence intervals become too narrow The conclusions become exaggerated or formulated as Y ci = α + βx c + A c + ε ci A c N (0, ω 2 ), ε ci N (0, σ 2 ) Corr(Y ci, Y cj ) = ρ = ω2 ω 2 + σ 2 55 / / 90

15 Mixed model in SAS Mixed model in SAS, specification I We have to specify, that all observations regarding the same concentration (i.e. the triplicates), are correlated. This can be done in two ways: 1. directly specifying the triplicates to have a CS: Compond Symmetry structured correlation (see p. 58) 2. specifying two variance components, one between concentrations (ω 2 ), and one within concentration (triplicate variation, σ 2 ), see p. 60 Both of these structures will require cconcentration=concentration, specified as a factor, a Class-variable proc mixed cl data=a1; class cconcentration; model Selenium=Concentration / s cl ddfm=satterth; repeated / subject=cconcentration type=cs r rcorr; run; Here, we specify directly type=cs, i.e. the correlation structure 1 ρ ρ ρ ρ ρ 1 ρ ρ ρ ρ ρ 1 ρ ρ ρ ρ ρ 1 ρ ρ ρ ρ ρ 1 and get the output shown next page 57 / / 90 Output from specification I Estimated R Matrix Row Col1 Col2 Col Estimated R Correlation Matrix Row Col1 Col2 Col Covariance Parameter Estimates Cov Parm Subject Estimate Alpha Lower Upper CS cconcentration Residual Solution for Fixed Effects Standard Effect Estimate Error DF t Value Pr > t Alpha Intercept Concentration < Effect Lower Upper Intercept Concentration / 90 Mixed model in SAS, specification II proc mixed plots=all cl data=a1; class cconcentration; model Selenium=Concentration / s cl ddfm=satterth; random intercept / subject=cconcentration; run; gives more or less the same output: Covariance Parameter Estimates Cov Parm Subject Estimate Alpha Lower Upper Intercept cconcentration Residual Solution for Fixed Effects Standard Effect Estimate Error DF t Value Pr > t Alpha Intercept Concentration < Effect Lower Upper Intercept Concentration / 90

16 Model check Comment on Mixed model in SAS Note: The results are identical to those for the analysis of averages, but we get extra information here: Estimated correlation: ˆρ = (p. 59) This clearly violates the independence assumption, which is why the naive approach using all 18 measurements will provide wrong results. 61 / / 90 Recalculate correlation Effect of correlation to the number of independent pieces of information for: n different doses (here n = 6) k repetitions for each dose (here k = 3) How much do we gain by taking duplicates, triplicates etc instead of just taking a single measurement? That depends on the correlation, ρ We ought to have n k pieces of information, but due to the correlation, we only have m < n k, and m = n k 1 + ρ(k 1) 63 / / 90

17 Reminder: prediction Calibration involves prediction/estimation the other way around, i.e. estimation of an unknown x-value, based on y-observations: What about the other way around?: Calibration 65 / 90 Take a soil sample with an unknown concentration (c 0, say) of selenium Measure with some instrument a couple of times (e.g. 3, as here), and get observations Y 01, Y 02, Y 03, with average Ȳ0 Make a qualified guess of the unknown concentration, with confidence interval, based on the average measurement, Ȳ 0 Since E(y 0 ) = α + βx 0, we must estimate ĉ 0 = Ȳ0 ˆα ˆβ But what is the uncertainty in this expression? 66 / 90 *Calibration uncertainty Example: Cell concentration of tetrahymena based on k measurements (Y 0i, i = 1,..., k) of an unknown concentration (c 0 ): σ y x ˆβ How to do this in SAS? Not so easy, unfortunately... 1 k + 1 n + (ȳ 0 ȳ) 2 ˆb2 n i=1 (x i x) 2 The unicellar organism tetrahymena grown in two different media, with and without glucose Research question: How does cell concentration x (number of cells in 1 ml of the growth media) affect the cell size y (average cell diameter, measured in µm). Quantitative covariate : concentration x Quantitative outcome : diameter y 67 / / 90

18 Scatter plot Residual plot for naive linear regression for the no glucose medium: the relation is clearly not linear Note the curved shape indicating that linearity between cell diameter and concentration is not appropriate. 69 / / 90 Power relationship Logarithmic transformation Suggested relationship between diameter (y) and concentration (x): y = αx β Interpretation of the parameters: α is a parameter denoting the cell size for a concentration of x = 1, an extrapolation to the extreme lower end of the concentration range as seen from the scatter plot β is... When the concentration x is doubled, the diameter will increase with a factor 2 b Transforming the diameter (y) with a logarithm yields the theoretical relationship or in terms of observations: log 10 (y) = log 10 (α) + β log 10 (x). E(y i ) = α + βx i where y i = log 10 (y i ), x i = log 10 (x i ), and α = log 10 (α) is the intercept. 71 / / 90

19 Scatter plot on double logarithmic scale Regression on double logarithmic scale ods graphics on; proc reg plots=(diagnostics(unpack) residuals(smooth)) data=a1; where glucose="no"; model logdiameter = logconcentration / clb; run; ods graphics off; The REG Procedure Dependent Variable: logdiameter Number of Observations Read 19 Number of Observations Used 19 Parameter Estimates looks pretty linear 73 / 90 Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept <.0001 logconcentration <.0001 Variable DF 95% Confidence Limits Intercept logconcentration / 90 Model check for logarithmic analysis Estimates for the multiplicative model taken from output on p. 74: α = 1.635(0.0202), CI=(1.5921, ) β = (0.0041), CI=( , ) Back-transforming The effect of a doubling of the concentration is estimated to 2 β = = 0.959, a 4.1% reduction of diameter. looks much better Confidence limits: ( , ) = (0.954, 0.965), i.e. between a 3.5% and a 4.6% reduction 75 / / 90

20 Two media: with and without glucose Two parallel regression lines An effect of concentration, assumed to be the same for both media A difference between the two media, assumed to be the same for all concentrations This is called: Multiple regression with two covariates (explanatory variables) Analysis of covariance It is an additive model, with no interaction. 77 / / 90 Analysis of covariance in SAS Analysis of covariance, II proc glm plots=all data=a1; class glucose; model logdiameter = logconcentration glucose /solution clparm; run; with output The GLM Procedure Class Level Information Class Levels Values glucose 2 no yes Number of Observations Read 51 Dependent Variable: logdiameter Sum of Source DF Squares Mean Square F Value Pr > F Model <.0001 Error Corrected Total R-Square Coeff Var Root MSE logdiameter Mean / 90 Source DF Type III SS Mean Square F Value Pr > F logconcentration <.0001 glucose <.0001 Standard Parameter Estimate Error t Value Pr > t Intercept B <.0001 logconcentration <.0001 glucose no B <.0001 glucose yes B... Parameter 95% Confidence Limits Intercept logconcentration glucose no glucose yes.. The slope ( ) is the estimated effect on log 10 (diam) of log 10 (Concentration), so we must backtransform for interpretation: 80 / 90

21 Interpretation of output, slope Interpretation of output, difference between media The effect of concentration after back-transforming: The effect of a doubling of the concentration is estimated to 2 β = = 0.962, a 3.8% reduction of diameter. Confidence limits: ( , ) = (0.959, 0.965), i.e. between a 3.5% and a 4.1% reduction Note that this is almost the same as when we considered one medium alone The intercept is an estimate of α (see p. 72), and α = log 10 (α) α = 10 α. Therefore, the difference between the two media (glucose vs. no glucose) is a factor = 1.067, i.e. a 6.7% higher cell diameter when glucose is added. Confidence limits: ( , ) = (1.054, 1.080), i.e. between a 5.4% and 8.0% 81 / / 90 Interaction? Model with interaction in SAS If the effect of one explanatory variable (X 1 ) depends on the value of another (X 2 ), we say that there is an interaction between X 1 and X 2 : If the effect of concentration depends on the media, we have interaction between concentration and media. If the difference between the two media varies with concentration, we have interaction between concentration and media. Interaction: The two regression lines are not parallel, they have different slopes 83 / 90 proc glm plots=all data=tetrahymena; class glucose; model logdiam=logconc glucose logconc*glucose/solution clparm; estimate slope, glucose=0 logconc 1 logconc*glucose 1 0; estimate slope, glucose=1 logconc 1 logconc*glucose 0 1; output out=check r=res p=pred; run; The GLM Procedure Class Level Information Class Levels Values glucose Number of Observations Used 51 Dependent Variable: logdiam Sum of Source DF Squares Mean Square F Value Pr > F Model <.0001 Error Corrected Total R-Square Coeff Var Root MSE logdiam Mean / 90

22 Interaction?, II Interaction?, III Output, continued: Output, continued: Source DF Type III SS Mean Square F Value Pr > F logconc <.0001 glucose logconc*glucose Standard Parameter Estimate Error t Value Pr > t slope, glucose= <.0001 slope, glucose= <.0001 Parameter 95% Confidence Limits slope, glucose= slope, glucose= Standard Parameter Estimate Error t Value Pr > t Intercept B <.0001 logconc B <.0001 glucose B glucose B... logconc*glucose B logconc*glucose B... Parameter 95% Confidence Limits Intercept logconc glucose glucose 1.. logconc*glucose logconc*glucose 1.. Test for no interaction gives P=0.19, i.e. no significance 85 / / 90 Interpretation of output, slopes Interpretation of output, media We now have two different estimates of slope, depending on the presence of glucose. We back-transform to the effect of a doubling of the concentration No glucose: 2 β = = 0.959, a 4.1% reduction of diameter (CI: 0.954, 0.965). Glucose: 2 β = = 0.964, a 3.6% reduction of diameter (CI: 0.960, 0.968). Note that for No glucosis, we get the same as the results p. 76 The difference between the two media (glucose vs. no glucose) now depends upon the concentration of cells! The estimate shown in the output (p. 86) refers to the difference in media, when the explanatory variable is zero. Since our explanatory variable is the logarithm of cell concentration, it corresponds to a cell concentration of 1 - and only this particular value. This is way out of range 87 / / 90

23 Model fit, with interaction Interpretation of marginal effects Important: As long as an interaction is present (in the model), do not try to interpret the marginal effects of either explanatory variable even if the interaction in seen to be insignificant Instead, leave out the interaction from the model, and run it again This will make us return to the analysis on p / / 90

Linear regression and correlation

Faculty of Health Sciences Linear regression and correlation Statistics for experimental medical researchers 2018 Julie Forman, Christian Pipper & Claus Ekstrøm Department of Biostatistics, University