Statistics for exp. medical researchers Regression and Correlation

Similar documents
Linear regression and correlation

Statistics for exp. medical researchers Comparison of groups, T-tests and ANOVA

Correlation. Bivariate normal densities with ρ 0. Two-dimensional / bivariate normal density with correlation 0

General Linear Model (Chapter 4)

Lecture 11: Simple Linear Regression

Overview Scatter Plot Example

Stat 500 Midterm 2 12 November 2009 page 0 of 11

Outline. Review regression diagnostics Remedial measures Weighted regression Ridge regression Robust regression Bootstrapping

Correlated data. Repeated measurements over time. Typical set-up for repeated measurements. Traditional presentation of data

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

1) Answer the following questions as true (T) or false (F) by circling the appropriate letter.

Correlation and the Analysis of Variance Approach to Simple Linear Regression

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION

Answer to exercise: Blood pressure lowering drugs

Multi-factor analysis of variance

Analysis of variance and regression. May 13, 2008

df=degrees of freedom = n - 1

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

SAS Procedures Inference about the Line ffl model statement in proc reg has many options ffl To construct confidence intervals use alpha=, clm, cli, c

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

Topic 14: Inference in Multiple Regression

Notes 6. Basic Stats Procedures part II

9 Correlation and Regression

1 A Review of Correlation and Regression

STAT 350: Summer Semester Midterm 1: Solutions

Topic 18: Model Selection and Diagnostics

Regression models. Categorical covariate, Quantitative outcome. Examples of categorical covariates. Group characteristics. Faculty of Health Sciences

Correlated data. Longitudinal data. Typical set-up for repeated measurements. Examples from literature, I. Faculty of Health Sciences

6. Multiple regression - PROC GLM

STATISTICS 479 Exam II (100 points)

Chapter 1 Linear Regression with One Predictor

Analysis of variance and regression. November 22, 2007

Models for Clustered Data

unadjusted model for baseline cholesterol 22:31 Monday, April 19,

Models for longitudinal data

Models for Clustered Data

ST Correlation and Regression

Multicollinearity Exercise

Faculty of Health Sciences. Correlated data. Variance component models. Lene Theil Skovgaard & Julie Lyng Forman.

STAT 3A03 Applied Regression Analysis With SAS Fall 2017

Correlated data. Overview. Variance component models. Terminology for correlated measurements. Faculty of Health Sciences. Variance component models

Regression and correlation

Correlation and Simple Linear Regression

Lecture 3: Inference in SLR

Faculty of Health Sciences. Correlated data. Variance component models. Lene Theil Skovgaard & Julie Lyng Forman.

Correlated data. Overview. Example: Swelling due to vaccine. Variance component models. Faculty of Health Sciences. Variance component models

sociology sociology Scatterplots Quantitative Research Methods: Introduction to correlation and regression Age vs Income

Lecture notes on Regression & SAS example demonstration

3 Variables: Cyberloafing Conscientiousness Age

Lecture 1 Linear Regression with One Predictor Variable.p2

Booklet of Code and Output for STAC32 Final Exam

Measuring relationships among multiple responses

REVIEW 8/2/2017 陈芳华东师大英语系

Chapter 2 Inferences in Simple Linear Regression

9. Linear Regression and Correlation

Ch 2: Simple Linear Regression

Lecture 11 Multiple Linear Regression

IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc

13 Simple Linear Regression

STAT 3A03 Applied Regression With SAS Fall 2017

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Week 3: Simple Linear Regression

ST505/S697R: Fall Homework 2 Solution.

Simple Linear Regression

Analysing data: regression and correlation S6 and S7

Homework 2: Simple Linear Regression

Review of Statistics 101

LINEAR REGRESSION. Copyright 2013, SAS Institute Inc. All rights reserved.

Analysis of variance and regression. December 4, 2007

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

5.3 Three-Stage Nested Design Example

Multiple Linear Regression

EXST7015: Estimating tree weights from other morphometric variables Raw data print

Failure Time of System due to the Hot Electron Effect

y response variable x 1, x 2,, x k -- a set of explanatory variables

STOR 455 STATISTICAL METHODS I

Introduction to SAS proc mixed

Handout 1: Predicting GPA from SAT

Topic 25 - One-Way Random Effects Models. Outline. Random Effects vs Fixed Effects. Data for One-way Random Effects Model. One-way Random effects

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002

Correlated data. Variance component models. Example: Evaluate vaccine. Traditional assumption so far. Faculty of Health Sciences

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

Interactions. Interactions. Lectures 1 & 2. Linear Relationships. y = a + bx. Slope. Intercept

Detecting and Assessing Data Outliers and Leverage Points

ECO220Y Simple Regression: Testing the Slope

Paper: ST-161. Techniques for Evidence-Based Decision Making Using SAS Ian Stockwell, The Hilltop UMBC, Baltimore, MD

Statistical Modelling in Stata 5: Linear Models

ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS

Scatter plot of data from the study. Linear Regression

Introduction to SAS proc mixed

ssh tap sas913, sas

Simple Linear Regression

Biostatistics. Correlation and linear regression. Burkhardt Seifert & Alois Tschopp. Biostatistics Unit University of Zurich

Circle a single answer for each multiple choice question. Your choice should be made clearly.

R 2 and F -Tests and ANOVA

Topic 17 - Single Factor Analysis of Variance. Outline. One-way ANOVA. The Data / Notation. One way ANOVA Cell means model Factor effects model

Correlation and Linear Regression

The General Linear Model. April 22, 2008

The General Linear Model. November 20, 2007

Chapter 8 Quantitative and Qualitative Predictors

Transcription:

Faculty of Health Sciences Regression analysis Statistics for exp. medical researchers Regression and Correlation Lene Theil Skovgaard Sept. 28, 2015 Linear regression, Estimation and Testing Confidence and prediction limits Model checks and diagnostics The correlation coefficient Transformation Comparison of regression lines Home pages: http://staff.pubhealth.ku.dk/~jufo/basicstatisticsx2015 E-mail: ltsk@sund.ku.dk : indicates that this page may be skipped without serious consequences 1 / 90 2 / 90 The nature of the explanatory variable The nature of the explanatory variable, II So far, we have been looking at Multi factor designs (ANOVA) Variance component models The explanatory variables have been factors: Condition/Treatment (A, B, C, Ctrl) Laboratory (1,2,3,4) Splenectomy, Altitude Now we turn to quantitative explanatory variables: concentration temperature age Regression: Establish a (parsimonous) smooth relation between x and y, making prediction possible Examples of regression type problems: Dose-response relations Calibration curves 3 / 90 4 / 90

Quantitative explanatory variable Linear regression Relation between two quantitative variables: Explanatory variable, x-variable Outcome variable, y-variable, (Response variable, Dependent variable) Linearity: Every smooth curve can be approximated by a straight line at least locally Linearity is easy to deal with Sometimes it takes a transformation to make it linear. 5 / 90 6 / 90 Example: Cell concentration of tetrahymena Choice of scale The unicellar organism tetrahymena grown in two different media, with and without glucose for the linearity assumption (...or perhaps it is not linear on any scale) Research question: How does cell concentration x (number of cells in 1 ml of the growth media) affect the cell size y (average cell diameter, measured in µm). Quantitative covariate : concentration x Quantitative outcome : diameter y Here, we need a log-transformation (more later on, p. 71 ff) 7 / 90 8 / 90

Example (Book, p. 226) The straight line Calibration curve for measuring the concentration of Selenium: Mathematical formulation: y = α + βx 6 known concentrations: 0,20, 40, 80, 120 and 160 Triplicate measurements for each concentration so close that they cannot be seen individually Does this look like a straight line? Yes 9 / 90 10 / 90 Parameters of the straight line Model for selenium measurements Interpretation Intercept α: The expected outcome, when the explanatory variable x is zero Units identical to y-units Slope β: The expected difference in y corresponding to a one unit difference in x Units in y-units per x-unit y ci : the i th measurement of the c th concentration x c : the corresponding known concentration of selenium Model: We call this a simple linear regression E(y ci ) = α + βx c simple, because there is only one explanatory variable (concentration) linear, because the explanatory variable has a linear effect But: We have some issues regarding correlation of triplicate measurements... (p. 50) 11 / 90 12 / 90

Average over triplicates Model for linear regression to avoid the correlation issue Mean value of outcome is linearly dependent on the explanatory variable, and the variance σ 2, or σy x 2 (the variance of the residuals = the distance in vertical direction from the regression line) is assumed constant, 13 / 90 y c : average measurement of the c th concentration x c : the corresponding known concentration of selenium Y c = α + βx c + ε c, ε c N (0, σ 2 ) 14 / 90 Regression vs correlation Method of Least Squares The regression model Y c = α + βx c + ε c, ε i N (0, σ 2 ) independent specifies the conditional distributions of Y, given X, to be Normal, with identical variances and with mean values, that depend linearly on x. derived from the general likelihood principle: Minimize the residual sum of squares: n n SS res = (y c ŷ c ) 2 = (y c α βx c ) 2, c=1 c=1 residuals here being the vertical distance from the observation y c to the line (ŷ c = α + βx c ), i.e. r c = y c ( α + βx c ) 15 / 90 16 / 90

Technicalities: Estimation of slope Estimation with SAS where s xy = 1 n 1 β = s xy sx 2, n (x i x)(y i ȳ) i=1 is the covariance between x and y, and s 2 x = 1 n 1 is the variance in the covariate n (x i x) 2 i=1 proc means nway N mean data=a1; class part Concentration; var Selenium; output out=av mean=average_selenium; run; ods graphics on; proc reg plots=all data=av; model Average_Selenium=Concentration / clb; run; ods graphics off; 17 / 90 18 / 90 Results for Selenium averages Results for Selenium averages, II The REG Procedure Dependent Variable: Average_Selenium Number of Observations Used 6 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 10713 10713 6864.47 <.0001 Error 4 6.24263 1.56066 Corrected Total 5 10719 Root MSE 1.24926 R-Square 0.9994 Dependent Mean 53.50544 Adj R-Sq 0.9993 Coeff Var 2.33483 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 0.94262 0.81400 1.16 0.3113 Concentration 1 0.75090 0.00906 82.85 <.0001 Variable DF 95% Confidence Limits Intercept 1-1.31741 3.20264 Concentration 1 0.72573 0.77606 19 / 90 taken from output: α = 0.943(0.814) β = 0.751(0.009) s y x = 1.249 where s y x denotes the estimate of the residual variance σ y x = σ 2, called "Root MSE" in SAS output, and estimated as s 2 y x = SS res n 2 α and s y x are measured in the units of the outcome variable β is measured in outcome-units per x-unit 20 / 90

Uncertainty of estimated slope Confidence interval for parameters SE( β) = σ y x s x n 1, with 95% coverage, here shown for the slope: β ± t-quantile SE( β) Good precision, when the residual variation σ y x is small the sample (n) is large the variation in the explanatory variable (s x ) is large (i.e. when concentrations vary a lot) Here, n = 6, so df = 6 2 = 4, and the corresponding t-quantile is 2.776. Therefore, the interval becomes 0.7509 ± 2.776 0.0091 = (0.726, 0.776) 21 / 90 22 / 90 Test of zero slope Use of regression results Cut from the output: Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 0.94262 0.81400 1.16 0.3113 Concentration 1 0.75090 0.00906 82.85 <.0001 T = (0.751/0.009) = 82.85 t(4), P < 0.0001 Strong evidence of a relationship between the actual concentration and the measured response Is this at all interesting? Maybe test α = 0? We want to know more... to be continued Prediction: Determine y from x, the ordinary way: Prediction of the outcome, for given value of the explanatory variable, i.e. We predict observations, for given known concentration of selenium: ŷ i = α + βx i Calibration: Determine x from y, the reverse way: We estimate unknown concentrations of selenium from one or several measurements taken We shall look briefly at that on p. 66-67 23 / 90 24 / 90

Confidence and prediction limits Confidence and prediction limits for line Confidence limits show the uncertainty in the estimated regression line Almost collapsed on the line itself Tell us where the line may also be Limits become narrower when sample size is increased Prediction limits show the (future) variation in the outcome, for given covariate (reference regions) Tell us where future subjects will lie Limits have approximately same width no matter the sample size 25 / 90 Confidence limits can hardly be seen since they are so narrow 26 / 90 Check of model assumptions Residual plot, for check of linearity Look for possible flaws: Linearity: Plot residuals vs. explanatory variable, curves?, p.28 Test whether a second order polynomium (a parabola) is better than a straight line, p.29 Variance homogeneity: Plot residuals against predicted values, trumpet shape?, p. 30 Normality: Histogram, skewness?, p. 31 Quantile plot, hammock shape?, p. 31 We do not have enough information to reasonably perform these checks 27 / 90 Curves? 28 / 90

Numerical check of linearity Residual plot, for check of variance homogeneity Include a second-order term, Concentration2=(Concentration-75)**2; proc reg data=av; model Average_Selenium=Concentration Concentration2; run; which yields the output Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 1.58004 0.90266 1.75 0.1783 Concentration 1 0.75312 0.00858 87.81 <.0001 Concentration2 1-0.00024847 0.00019313-1.29 0.2885 No significant deviation from linearity, since the second order term has a P-value P=0.29. 29 / 90 Trumpet shape? 30 / 90 Model checks at a glance If assumptions fail Linearity: Transform or do non-linear regression Variance homogeneity: Transform Normality: Transform Linearity is the most important assumption, unless the task is to construct prediction intervals! More on transformations later... p. 71 ff 31 / 90 32 / 90

Diagnostics Assess the influence of single observations by leaving out one observation at a time Omit the ith individual from the analysis Obtain new estimates, α ( i) and β ( i) Compute deletion diagnostics: dev(α) i = α α ( i) dev(β) i = β β ( i) both normalized by the standard error of the estimate Combine the squared deletion diagnostics into a single diagnostic, Cook s distance Cook(α, β) i. Deletion diagnostics for selenium averages Influence-option (in MODEL-statement of PROC REG): proc reg data=av; model Average_Selenium=Concentration / r clb influence; run; Output: Output Statistics Dependent Predicted Std Error Std Error Student Obs Variable Value Mean Predict Residual Residual Residual 1-0.000720 0.9426 0.8140-0.9433 0.948-0.995 2 17.1000 15.9606 0.6822 1.1394 1.047 1.089 3 30.0000 30.9785 0.5780-0.9785 1.108-0.884 4 61.8333 61.0144 0.5180 0.8189 1.137 0.720 5 92.1333 91.0503 0.6822 1.0830 1.047 1.035 6 119.9667 121.0862 0.9620-1.1195 0.797-1.405 33 / 90 34 / 90 Deletion diagnostics for selenium averages, II Cooks distance Output, continued: Cook s Hat Diag Cov Obs -2-1 0 1 2 D RStudent H Ratio DFFITS 1 * 0.366-0.9939 0.4246 1.7484-0.8537 2 ** 0.252 1.1241 0.2982 1.2543 0.7328 3 * 0.106-0.8529 0.2140 1.4652-0.4451 4 * 0.054 0.6687 0.1719 1.6260 0.3047 5 ** 0.228 1.0474 0.2982 1.3584 0.6828 6 ** 1.437-1.7089 0.5930 1.1215-2.0627 ----------DFBETAS--------- Obs Intercept Concentration 1-0.8537 0.6654 2 0.7226-0.4867 3-0.4093 0.2094 4 0.1464 0.0533 5-0.0337 0.4535 6 0.6780-1.7490 Note the large values for Obs=6 35 / 90 Note the large value of COOKD for Obs=6 36 / 90

The correlation coefficient Bivariate Normal distribution, ρ = 0 A numerical quantity describing the degree of (linear) relationship between two variables: Pearson correlation: assuming normality of both variables: ni=1 (x i x)(y i ȳ) r = r xy = ni=1 (x i x) 2 n i=1 (y i ȳ) 2 Spearman correlation: based on ranks Both of them take on values between -1 and 1 (0 corresponding to independence) +1 and -1 correspond to perfect (linear) relationships, positive respectively negative 37 / 90 Correlation 0 38 / 90 All vertical slices yields normal distributions with identical mean values and identical variances Bivariate Normal distribution, ρ = 0.9 Correlation 0.9 Contour curves Contour curves from a Normal distribution becomes ellipses (or circles in case of ρ = 0) All vertical slices yields normal distributions with different mean values but identical variances Scatter plots should resemble ellipses 39 / 90 40 / 90

Regression vs correlation Regression vs. correlation, II The regression model Y c = α + βx c + ε c, ε c N (0, σ 2 ) independent specifies the conditional distributions of Y, given X, to be Normal, with identical variances and with mean values, that depend linearly on x. The assumptions for interpreting a correlation are stronger than for interpreting a slope (involves normality of both variables) Interpretation of a correlation coefficient is often misleading...and almost allways non-informative The correlation has no units, gives no quantification of the relation Tests of zero slope and zero correlation are identical and does not assume anything regarding the distribution of x 41 / 90 42 / 90 Regression vs. correlation, III Problems with the correlation Test of zero slope or zero correlation is the same thing The two estimates (for correlation and slope) resembles one another (in formulae), and they become 0 simultaneously ˆβ = S xy Sxx r xy = S xy S xxsyy ˆβ = r xy S yy Sxx r xy = ˆβ Sxx Syy Test for β = 0 is identical to test for ρ xy = 0 Formula manipulation yields the equality: Fix: sample size n slope β 1 r 2 xy = s 2 s 2 + ˆβ 2 Sxx n 2 residual variation s y x but increase the variation s x in the covariate x What happens? The correlation approaches either 1 or 1!! 43 / 90 44 / 90

Two imaginary investigations When can we use the correlation? Slopes are equal Correlations are not! Can we increase correlation even further, without obtaining more observations? Yes... 45 / 90 When we only want to test whether or not there is a relation between two variables (only P-value needed) In this case, consider the non-parametric Spearman correlation to avoid the Normality (linearity) assumption To rank the relatedness of many variables, measured on the same units (e.g. concentrations of different compounds in the same solution). The correlation has no units... and is therefore hard to interpret! 46 / 90 Correlation for Selenium Correlation for Selenium, II Correlation between average selenium measurements, and the corresponding known concentrations: proc corr pearson spearman data=av; var Average_Selenium Concentration; run; with output Simple Statistics Variable N Mean Std Dev Median Average_Selenium 6 53.50544 46.30191 45.91667 Concentration 6 70.00000 61.64414 60.00000 Variable Minimum Maximum Average_Selenium -0.0007200 119.96667 Concentration 0 160.00000 47 / 90...continued next page Pearson correlation Pearson Correlation Coefficients, N = 6 Prob > r under H0: Rho=0 Average_ Selenium Concentration Average_Selenium 1.00000 0.99971 <.0001 Concentration 0.99971 1.00000 <.0001 The very high correlation of 0.99971 indicates a close-to-linear relationship And so what? 48 / 90

Correlation for Selenium, III Spearman correlation Spearman Correlation Coefficients, N = 6 Prob > r under H0: Rho=0 Average_ Selenium Concentration Average_Selenium 1.00000 1.00000 <.0001 Concentration 1.00000 1.00000 <.0001 The correlation is 1, indicating a perfect monotone relationship, not necessarily linear And so what? 49 / 90 Analysis of all Selenium measurements Originally, we had 18 measurements, triplicates for each known concentration. Could we use all of these 18 measurements to obtain Better estimates? More narrow confidence intervals? No, probably not, since triplicates are not independent: We expect to have several variations in this investigation: Error in the known concentration, Observer variation, Temperature effects... and Measurement error If triplicates are measured on the same solution, an error in the known concentration will affect all three measurements equally, i.e. they would all be too large or too small, they would be correlated 50 / 90 Analysis of all Selenium measurements, II In case of correlated measurements, using the same simple regression model, applied to all individual measurements would be wrong, and would give Too small P-values for anything Too narrow confidence intervals Instead, we could build a mixed model, and the result would be identical to the analysis of averages unless the design is unbalanced, e.g. due to missing observations Naive=wrong analysis of all 18 observations The REG Procedure Dependent Variable: Selenium Number of Observations Used 18 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 32139 32139 22292.0 <.0001 Error 16 23.06789 1.44174 Corrected Total 17 32162 Root MSE 1.20073 R-Square 0.9993 Dependent Mean 53.50544 Adj R-Sq 0.9992 Coeff Var 2.24412 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 0.94262 0.45170 2.09 0.0533 Concentration 1 0.75090 0.00503 149.30 <.0001 Book, p. 251 Variable DF 95% Confidence Limits Intercept 1-0.01495 1.90019 Concentration 1 0.74024 0.76156 51 / 90 52 / 90

Residual plot from naive analysis Comparison of the two analyses The naive one (wrong), with all 18 measurements The analysis of averages Estimate t-value and P-value Method of slope for zero slope Naive 0.751 (0.005) 149.3 (< 0.0001) n = 18 Averages 0.751 (0.009) 82.85 (< 0.0001) n = 6 Note the correlation between residuals from the same concentration The discrepancy cannot be seen in the P-values, because it is so significant. 53 / 90 54 / 90 Consequences of naive model Mixed model The naive model based on all 18 observations ignores the correlation between triplicates: We are dealing with two variance components: The variation of means around the regression line (ω 2 ) The measurement error (variation between triplicates, σ 2 ) We think we have too much information The standard errors become too small The confidence intervals become too narrow The conclusions become exaggerated or formulated as Y ci = α + βx c + A c + ε ci A c N (0, ω 2 ), ε ci N (0, σ 2 ) Corr(Y ci, Y cj ) = ρ = ω2 ω 2 + σ 2 55 / 90 56 / 90

Mixed model in SAS Mixed model in SAS, specification I We have to specify, that all observations regarding the same concentration (i.e. the triplicates), are correlated. This can be done in two ways: 1. directly specifying the triplicates to have a CS: Compond Symmetry structured correlation (see p. 58) 2. specifying two variance components, one between concentrations (ω 2 ), and one within concentration (triplicate variation, σ 2 ), see p. 60 Both of these structures will require cconcentration=concentration, specified as a factor, a Class-variable proc mixed cl data=a1; class cconcentration; model Selenium=Concentration / s cl ddfm=satterth; repeated / subject=cconcentration type=cs r rcorr; run; Here, we specify directly type=cs, i.e. the correlation structure 1 ρ ρ ρ ρ ρ 1 ρ ρ ρ ρ ρ 1 ρ ρ ρ ρ ρ 1 ρ ρ ρ ρ ρ 1 and get the output shown next page 57 / 90 58 / 90 Output from specification I Estimated R Matrix Row Col1 Col2 Col3 1 1.8018 1.4401 1.4401 2 1.4401 1.8018 1.4401 3 1.4401 1.4401 1.8018 Estimated R Correlation Matrix Row Col1 Col2 Col3 1 1.0000 0.7993 0.7993 2 0.7993 1.0000 0.7993 3 0.7993 0.7993 1.0000 Covariance Parameter Estimates Cov Parm Subject Estimate Alpha Lower Upper CS cconcentration 1.4401 0.05-0.7250 3.6052 Residual 0.3617 0.05 0.1860 0.9855 Solution for Fixed Effects Standard Effect Estimate Error DF t Value Pr > t Alpha Intercept 0.9426 0.8140 4 1.16 0.3113 0.05 Concentration 0.7509 0.009063 4 82.85 <.0001 0.05 Effect Lower Upper Intercept -1.3174 3.2026 Concentration 0.7257 0.7761 59 / 90 Mixed model in SAS, specification II proc mixed plots=all cl data=a1; class cconcentration; model Selenium=Concentration / s cl ddfm=satterth; random intercept / subject=cconcentration; run; gives more or less the same output: Covariance Parameter Estimates Cov Parm Subject Estimate Alpha Lower Upper Intercept cconcentration 1.4401 0.05 0.4856 15.6829 Residual 0.3617 0.05 0.1860 0.9855 Solution for Fixed Effects Standard Effect Estimate Error DF t Value Pr > t Alpha Intercept 0.9426 0.8140 4 1.16 0.3113 0.05 Concentration 0.7509 0.009063 4 82.85 <.0001 0.05 Effect Lower Upper Intercept -1.3174 3.2026 Concentration 0.7257 0.7761 60 / 90

Model check Comment on Mixed model in SAS Note: The results are identical to those for the analysis of averages, but we get extra information here: Estimated correlation: ˆρ = 0.799 (p. 59) This clearly violates the independence assumption, which is why the naive approach using all 18 measurements will provide wrong results. 61 / 90 62 / 90 Recalculate correlation Effect of correlation to the number of independent pieces of information for: n different doses (here n = 6) k repetitions for each dose (here k = 3) How much do we gain by taking duplicates, triplicates etc instead of just taking a single measurement? That depends on the correlation, ρ We ought to have n k pieces of information, but due to the correlation, we only have m < n k, and m = n k 1 + ρ(k 1) 63 / 90 64 / 90

Reminder: prediction Calibration involves prediction/estimation the other way around, i.e. estimation of an unknown x-value, based on y-observations: What about the other way around?: Calibration 65 / 90 Take a soil sample with an unknown concentration (c 0, say) of selenium Measure with some instrument a couple of times (e.g. 3, as here), and get observations Y 01, Y 02, Y 03, with average Ȳ0 Make a qualified guess of the unknown concentration, with confidence interval, based on the average measurement, Ȳ 0 Since E(y 0 ) = α + βx 0, we must estimate ĉ 0 = Ȳ0 ˆα ˆβ But what is the uncertainty in this expression? 66 / 90 *Calibration uncertainty Example: Cell concentration of tetrahymena based on k measurements (Y 0i, i = 1,..., k) of an unknown concentration (c 0 ): σ y x ˆβ How to do this in SAS? Not so easy, unfortunately... 1 k + 1 n + (ȳ 0 ȳ) 2 ˆb2 n i=1 (x i x) 2 The unicellar organism tetrahymena grown in two different media, with and without glucose Research question: How does cell concentration x (number of cells in 1 ml of the growth media) affect the cell size y (average cell diameter, measured in µm). Quantitative covariate : concentration x Quantitative outcome : diameter y 67 / 90 68 / 90

Scatter plot Residual plot for naive linear regression for the no glucose medium: the relation is clearly not linear Note the curved shape indicating that linearity between cell diameter and concentration is not appropriate. 69 / 90 70 / 90 Power relationship Logarithmic transformation Suggested relationship between diameter (y) and concentration (x): y = αx β Interpretation of the parameters: α is a parameter denoting the cell size for a concentration of x = 1, an extrapolation to the extreme lower end of the concentration range as seen from the scatter plot β is... When the concentration x is doubled, the diameter will increase with a factor 2 b Transforming the diameter (y) with a logarithm yields the theoretical relationship or in terms of observations: log 10 (y) = log 10 (α) + β log 10 (x). E(y i ) = α + βx i where y i = log 10 (y i ), x i = log 10 (x i ), and α = log 10 (α) is the intercept. 71 / 90 72 / 90

Scatter plot on double logarithmic scale Regression on double logarithmic scale ods graphics on; proc reg plots=(diagnostics(unpack) residuals(smooth)) data=a1; where glucose="no"; model logdiameter = logconcentration / clb; run; ods graphics off; The REG Procedure Dependent Variable: logdiameter Number of Observations Read 19 Number of Observations Used 19 Parameter Estimates looks pretty linear 73 / 90 Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 1.63476 0.02021 80.89 <.0001 logconcentration 1-0.05968 0.00412-14.47 <.0001 Variable DF 95% Confidence Limits Intercept 1 1.59212 1.67740 logconcentration 1-0.06838-0.05097 74 / 90 Model check for logarithmic analysis Estimates for the multiplicative model taken from output on p. 74: α = 1.635(0.0202), CI=(1.5921, 1.6774) β = 0.0597(0.0041), CI=( 0.0684, 0.0510) Back-transforming The effect of a doubling of the concentration is estimated to 2 β = 2 0.0597 = 0.959, a 4.1% reduction of diameter. looks much better Confidence limits: (2 0.0684, 2 0.0510 ) = (0.954, 0.965), i.e. between a 3.5% and a 4.6% reduction 75 / 90 76 / 90

Two media: with and without glucose Two parallel regression lines An effect of concentration, assumed to be the same for both media A difference between the two media, assumed to be the same for all concentrations This is called: Multiple regression with two covariates (explanatory variables) Analysis of covariance It is an additive model, with no interaction. 77 / 90 78 / 90 Analysis of covariance in SAS Analysis of covariance, II proc glm plots=all data=a1; class glucose; model logdiameter = logconcentration glucose /solution clparm; run; with output The GLM Procedure Class Level Information Class Levels Values glucose 2 no yes Number of Observations Read 51 Dependent Variable: logdiameter Sum of Source DF Squares Mean Square F Value Pr > F Model 2 0.05638431 0.02819216 337.89 <.0001 Error 48 0.00400497 0.00008344 Corrected Total 50 0.06038929 R-Square Coeff Var Root MSE logdiameter Mean 0.933681 0.671460 0.009134 1.360375 79 / 90 Source DF Type III SS Mean Square F Value Pr > F logconcentration 1 0.04835122 0.04835122 579.49 <.0001 glucose 1 0.00949383 0.00949383 113.78 <.0001 Standard Parameter Estimate Error t Value Pr > t Intercept 1.642132414 B 0.01141747 143.83 <.0001 logconcentration -0.055392698 0.00230106-24.07 <.0001 glucose no -0.028237875 B 0.00264722-10.67 <.0001 glucose yes 0.000000000 B... Parameter 95% Confidence Limits Intercept 1.619176048 1.665088780 logconcentration -0.060019289-0.050766107 glucose no -0.033560473-0.022915277 glucose yes.. The slope (-0.0554) is the estimated effect on log 10 (diam) of log 10 (Concentration), so we must backtransform for interpretation: 80 / 90

Interpretation of output, slope Interpretation of output, difference between media The effect of concentration after back-transforming: The effect of a doubling of the concentration is estimated to 2 β = 2 0.0554 = 0.962, a 3.8% reduction of diameter. Confidence limits: (2 0.0600, 2 0.0508 ) = (0.959, 0.965), i.e. between a 3.5% and a 4.1% reduction Note that this is almost the same as when we considered one medium alone The intercept is an estimate of α (see p. 72), and α = log 10 (α) α = 10 α. Therefore, the difference between the two media (glucose vs. no glucose) is a factor 10 0.0282 = 1.067, i.e. a 6.7% higher cell diameter when glucose is added. Confidence limits: (10 0.0229, 10 0.0336 ) = (1.054, 1.080), i.e. between a 5.4% and 8.0% 81 / 90 82 / 90 Interaction? Model with interaction in SAS If the effect of one explanatory variable (X 1 ) depends on the value of another (X 2 ), we say that there is an interaction between X 1 and X 2 : If the effect of concentration depends on the media, we have interaction between concentration and media. If the difference between the two media varies with concentration, we have interaction between concentration and media. Interaction: The two regression lines are not parallel, they have different slopes 83 / 90 proc glm plots=all data=tetrahymena; class glucose; model logdiam=logconc glucose logconc*glucose/solution clparm; estimate slope, glucose=0 logconc 1 logconc*glucose 1 0; estimate slope, glucose=1 logconc 1 logconc*glucose 0 1; output out=check r=res p=pred; run; The GLM Procedure Class Level Information Class Levels Values glucose 2 0 1 Number of Observations Used 51 Dependent Variable: logdiam Sum of Source DF Squares Mean Square F Value Pr > F Model 3 0.05653259 0.01884420 229.65 <.0001 Error 47 0.00385670 0.00008206 Corrected Total 50 0.06038929 R-Square Coeff Var Root MSE logdiam Mean 0.936136 0.665886 0.009059 1.360375 84 / 90

Interaction?, II Interaction?, III Output, continued: Output, continued: Source DF Type III SS Mean Square F Value Pr > F logconc 1 0.04498230 0.04498230 548.18 <.0001 glucose 1 0.00000171 0.00000171 0.02 0.8859 logconc*glucose 1 0.00014828 0.00014828 1.81 0.1853 Standard Parameter Estimate Error t Value Pr > t slope, glucose=0-0.05967671 0.00391968-15.22 <.0001 slope, glucose=1-0.05319626 0.00280663-18.95 <.0001 Parameter 95% Confidence Limits slope, glucose=0-0.06756209-0.05179134 slope, glucose=1-0.05884246-0.04755005 Standard Parameter Estimate Error t Value Pr > t Intercept 1.631343588 B 0.01387873 117.54 <.0001 logconc -0.053196257 B 0.00280663-18.95 <.0001 glucose 0 0.003417551 B 0.02369477 0.14 0.8859 glucose 1 0.000000000 B... logconc*glucose 0-0.006480457 B 0.00482090-1.34 0.1853 logconc*glucose 1 0.000000000 B... Parameter 95% Confidence Limits Intercept 1.603423184 1.659263992 logconc -0.058842464-0.047550050 glucose 0-0.044250175 0.051085278 glucose 1.. logconc*glucose 0-0.016178851 0.003217937 logconc*glucose 1.. Test for no interaction gives P=0.19, i.e. no significance 85 / 90 86 / 90 Interpretation of output, slopes Interpretation of output, media We now have two different estimates of slope, depending on the presence of glucose. We back-transform to the effect of a doubling of the concentration No glucose: 2 β = 2 0.0597 = 0.959, a 4.1% reduction of diameter (CI: 0.954, 0.965). Glucose: 2 β = 2 0.0532 = 0.964, a 3.6% reduction of diameter (CI: 0.960, 0.968). Note that for No glucosis, we get the same as the results p. 76 The difference between the two media (glucose vs. no glucose) now depends upon the concentration of cells! The estimate 0.0034 shown in the output (p. 86) refers to the difference in media, when the explanatory variable is zero. Since our explanatory variable is the logarithm of cell concentration, it corresponds to a cell concentration of 1 - and only this particular value. This is way out of range 87 / 90 88 / 90

Model fit, with interaction Interpretation of marginal effects Important: As long as an interaction is present (in the model), do not try to interpret the marginal effects of either explanatory variable even if the interaction in seen to be insignificant Instead, leave out the interaction from the model, and run it again This will make us return to the analysis on p. 80 89 / 90 90 / 90