Biostatistics for physicists fall Correlation Linear regression Analysis of variance

Biostatistics for physicists fall 2015 Correlation Linear regression Analysis of variance

Correlation Example: Antibody level on 38 newborns and their mothers There is a positive correlation in antibody level between a mother and her child 2

Correlation Pearson s r Pearson s correlation coefficient, denoted r, is a measure of the correlation r n n i 1 ( x ( x i x)( y x) i 1 i i 1 2 n i ( y y) i y) 2 1 r 1 Positive correlation <==> r>0 Negative correlation <==> r<0 A measure of the linear relationship Preferably used for normal distributed data 3

Correlation Pearson s r 4

Pearson s r Not always a suitable measure on a relationship r=0.07 10,0 9,0 8,0 7,0 6,0 5,0 4,0 3,0 2,0 1,0 2,0 3,0 4,0 5,0 6,0 7,0 8,0 25 r=0.71 20 15 10 5 0 2 4 6 8 10 12 5

Spearman s rank correlation coefficient Spearman s rho A non-parametric correlation coefficient Suitable when data is far from normal distributed Can also be used on ordinal data. Is calculated in the same way as Pearson s r, but on the rank values instead of the original data Is not affected by outliers 6

Spearman s rank correlation coefficient Spearman s rho Pearson s r = 0.71 Spearman s rho = 0.08 25 20 15 10 5 0 2 4 6 8 10 12 7

Pearson s r in R Ex. Antibody level on 38 newborns and their mothers > cor(mother,child) [1] 0.7846048 > cor.test(mother,child) Pearson's product-moment correlation data: mother and child t = 7.593, df = 36, p-value = 5.562e-09 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.6205780 0.8828478 sample estimates: cor 0.7846048 8

Spearman s rho in R > cor.test(mother,child,method="spearman") Spearman's rank correlation rho data: mother and child S = 1618, p-value = 3.770e-08 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.8229566 9

Correlation measures association, not causation Ex. Among elementary school children, shoe size is strongly positively correlated with reading skills Obviously there is no causal relationship There is a third factor involved age (age is a confounding factor)

Statistical models A deterministic model describe relationship between variables without randomness pv nrt 2 A r Real measurements on a variable introduce randomness that can be modelled by a stochastic model Y 0 1 x e where e is a random variable y 0 1 x A statistical model is a stochastic model that contains parameters, which are unknown and needs to be estimated based on observed data and assumptions of the randomness

Statistical models A simple statistical model: Yi ei i 1,2,..., n Where e i is independent normal distributed random variables with mean=0 and variance=s 2 This is the statistical model behind a onesample t-test

Estimation Maximum Likelihood How to estimate parameters in a statistical model? Several methods. The most important is Maximum Likelihood (ML) The ML-estimate of a parameter is the most likely value of a parameter for a given set of observations More formally (for a continuous variable): Probability density function, f(x) Data, x 1,x 2,,x n Likelihood function L( q x n 1, x2,..., xn) f ( x i q) i 1 The ML-estimate of q is the value of q that maximize L(q)

Estimation Maximum Likelihood Example Model: Yi ei i 1,2,..., n where e i is independent normal distributed random variables with mean=0 and variance=s 2 The ML-estimate of µ is y The ML-estimate of s 2 is 1 ( y y) 2 n i That estimate is biased. An unbiased estimate is 2 1 ) 1 ( y i y n

Linear regression A linear regression model seeks to establish a relationship between a continuous response variable y and one ore more continuous explanatory variables x 1,x 2,,x n Model Y x x... 0 1 1 2 x n n e where e is independent normal distributed with mean=0 and variance=s 2 E( Y) x x... 0 1 1 2 2 A multiple linear regression model 2 n x n

Simple linear regression One explanatory variable Example: A calibration curve. Fluoroscence intensity measured for known concentrations

Simple linear regression Model Yi 0 1x i ei where e i is independent normal distributed random variables with mean=0 and variance=s 2 The ML estimates of 0 and 1 is equal to the least square estimates of 0 and 1 - the value on 0 and 1 that minimize the sum of squares of residuals n ( y i 1 ( 0 )) i 1x i 2 Predicted value 0 1 x i Residual y i ( 0 1xi )

Simple linear regression Using R > lm(int~conc) Call: lm(formula = int ~ conc) Coefficients: (Intercept) conc 3.558 1.386

Simple linear regression Using R > model<-(lm(int~conc)) > summary(model) Call: lm(formula = int ~ conc) Residuals: Min 1Q Median 3Q Max -4.18727-1.90848 0.02697 1.69864 4.72727 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 3.5582 1.6642 2.138 0.065. conc 1.3858 0.1559 8.891 2.03e-05 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 2.831 on 8 degrees of freedom Multiple R-squared: 0.9081, Adjusted R-squared: 0.8966 F-statistic: 79.05 on 1 and 8 DF, p-value: 2.027e-05

Coefficients: Simple linear regression Interpretation of R output Estimate Std. Error t value Pr(> t ) (Intercept) 3.5582 1.6642 2.138 0.065. conc 1.3858 0.1559 8.891 2.03e-05 *** The p-value refer to H0: 0 =0 / 1 =0 Residual standard error: 2.831 on 8 degrees of freedom The estimate of s is 2.831 (df=8) Multiple R-squared: 0.9081, 0.8966 Adjusted R-squared: R 2 =0.9081 (the coefficient of determination) is the proportion of variability in data that the model explains F-statistic: 79.05 on 1 and 8 DF, p-value: 2.027e-05 The p-value refer to H0: 1 = 2 = = n =0 (only 1 =0 here)

Simple linear regression Model check Model assumption: e i is independent normal distributed random variables Check the residuals > res<-residuals(model) > plot(conc,res) > lines(c(min(conc),max(conc)),c(0,0)) > qqnorm(res) > qqline(res)

Linear regression Warnings Be careful when there are outliers in data. They may have strong influence on the estimated model Be careful when extrapolate from the model. It may lead to completely wrong conclusions. There are no true models, only useful approximations All models are wrong but some are useful G.E.P. Box

Analysis of variance ANOVA Analysis of variance is a collection of statistical models, and their associated procedures, in which the observed variance [in data] is partitioned into components due to different sources of variation Wikipedia

Analysis of variance Example Measurments on a particular variable for several groups of individuals 4 treatments, 6 individuals per treatment 11 12 y 13 14 15 1 2 3 4 treatment

Analysis of variance Example Are the four treatments equivalent? Statistical model: Y e i 1,2,3,4 j 1,2,...,6 ij i ij where e ij is independent normal distributed random errors with mean=0 and variance=s 2 The ML-estimate of µ i is y i Null hypothesis: The four treatments are equivalent H0: 1 = 2 = 3 = 4

Analysis of variance Example 11 12 y 13 14 15 1 2 3 4 treatment The total variation can be divided in variation within treatment and variation between treatment

Analysis of variance Partitioning the sum of squares y = observed value on individual j for ij treatment i y i. = average for treatment i y.. = total average SST ( y y ) i j ij.. 2 SSA n ( y y i i i... ) 2 SSE ( y y ) i. i j ij SST = SSA + SSE 2

Analysis of variance ANOVA-table k = # treatments (# groups) N= # observations in total Test statistic: F-ratio = (SSA/(k-1))/(SSE/(N-k)) Under H0 the F-ratio is F-distributed with k-1 and N-k degrees of freedom

Analysis of variance Example H0: 1 = 2 = 3 = 4 Significance level 5 % p-value=0.001657 < 0.01 ==> H0 is rejected There are significant differences in means between the treatments. In order to find out which means are significantly different from one another, pairwise comparison must be performed (post hoc test). For example Tukey s test.

Analysis of variance using R > y [1] 11.03391 10.78788 10.60427 11.82818 11.97202 11.17095 12.22507 11.23059 [9] 11.62308 11.22159 12.39842 11.81493 11.66640 13.15860 11.21381 12.44268 [17] 13.11546 11.73569 13.85875 12.45239 13.74456 12.78989 11.92614 15.20695 > x [1] 1 1 1 1 1 1 3 3 3 3 3 3 2 2 2 2 2 2 4 4 4 4 4 4 > x<-as.factor(x) > anova(lm(y~x)) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) x 3 14.372 4.7908 7.3446 0.001657 ** Residuals 20 13.046 0.6523 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1

Analysis of variance using R > TukeyHSD(aov(y~x)) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = y ~ x) $x diff lwr upr p adj 2-1 0.9892383-0.3158831 2.2943598 0.1804179 3-1 0.5194117-0.7857098 1.8245331 0.6853882 4-1 2.0969117 0.7917902 3.4020331 0.0011642 3-2 -0.4698267-1.7749481 0.8352948 0.7468743 4-2 1.1076733-0.1974481 2.4127948 0.1145178 4-3 1.5775000 0.2723786 2.8826214 0.0144137 > tapply(y,x,mean) 1 2 3 4 11.23287 12.22211 11.75228 13.32978 Treatment 4 have significant higher mean than treatment 1 and treatment 3

Analysis of variance Model check Model assumption: e i is independent normal distributed random variables. Check the residuals > res<-residuals(model) > plot(x,res) > lines(c(min(x),max(x)),c(0,0)) > qqnorm(res) > qqline(res)

Analysis of variance More than 1 factor The example shown is a One-factor (or oneway) ANOVA on 4 levels. The factor is treatment. If we have another factor, for example three different concentrations, the analysis can be performed with a two-factor ANOVA Statistical model: Yijk i j eijk i=1,2,3,4 j=1,2,3 k=1,2,,6 where e ijk is independent normal distributed random errors with mean=0 and variance=s 2 There is much more to say about this, which would require some additional lectures

General linear models The response variable is continuous and is a linear function of explanatory variables. The errors are independent normal distributed with mean 0 and variance s 2 Explanatory variable: Continuous ==> Regression Categorical ==> ANOVA Continuous + categorical ==> ANCOVA

Generalized linear models The response variable is continuous or categorical and a is linear function of explanatory variables. The error components are usually not normal distributed A few example: Response variable: Proportion ==> Logistic regression Count ==> Log linear model, Poisson regression Count with overdispersion ==> Negative binomial regression Time to failure (continuous) ==> Survival models (e.g. Cox regression)