Biostatistics for physicists fall Correlation Linear regression Analysis of variance

Similar documents
Inference for Regression

Homework 9 Sample Solution

ANOVA: Analysis of Variation

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

Part II { Oneway Anova, Simple Linear Regression and ANCOVA with R

Extensions of One-Way ANOVA.

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

FACTORIAL DESIGNS and NESTED DESIGNS

Extensions of One-Way ANOVA.

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim

Math 141. Lecture 16: More than one group. Albyn Jones 1. jones/courses/ Library 304. Albyn Jones Math 141

Factorial designs. Experiments

Workshop 7.4a: Single factor ANOVA

Ch 2: Simple Linear Regression

Chapter 11: Linear Regression and Correla4on. Correla4on

Statistics - Lecture 05

3. Design Experiments and Variance Analysis

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Lecture 6 Multiple Linear Regression, cont.

1 Multiple Regression

Lecture 9: Linear Regression

ST430 Exam 2 Solutions

Section 4.6 Simple Linear Regression

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL - MAY 2005 EXAMINATIONS STA 248 H1S. Duration - 3 hours. Aids Allowed: Calculator

STAT 215 Confidence and Prediction Intervals in Regression

Lecture 18: Simple Linear Regression

ANOVA Situation The F Statistic Multiple Comparisons. 1-Way ANOVA MATH 143. Department of Mathematics and Statistics Calvin College

Regression Analysis. BUS 735: Business Decision Making and Research. Learn how to detect relationships between ordinal and categorical variables.

Booklet of Code and Output for STAD29/STA 1007 Midterm Exam

Regression and Models with Multiple Factors. Ch. 17, 18

Lecture 11: Simple Linear Regression

Variance Decomposition and Goodness of Fit

Correlation and Simple Linear Regression

Booklet of Code and Output for STAC32 Final Exam

Stat 5102 Final Exam May 14, 2015

Booklet of Code and Output for STAC32 Final Exam

Multiple Pairwise Comparison Procedures in One-Way ANOVA with Fixed Effects Model

Exam details. Final Review Session. Things to Review

Statistics for EES Factorial analysis of variance

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

Finding Relationships Among Variables

Coefficient of Determination

Tests of Linear Restrictions

Regression. Marc H. Mehlman University of New Haven

STK4900/ Lecture 3. Program

Regression. Bret Hanlon and Bret Larget. December 8 15, Department of Statistics University of Wisconsin Madison.

Chapter 8: Correlation & Regression

Final Exam. Name: Solution:

Multiple Linear Regression

Statistiek II. John Nerbonne. March 17, Dept of Information Science incl. important reworkings by Harmut Fitz

MS&E 226: Small Data

Ch 3: Multiple Linear Regression

Biostatistics 380 Multiple Regression 1. Multiple Regression

Simple Linear Regression

Statistics in medicine

Linear Model Specification in R

Formal Statement of Simple Linear Regression Model

Ch. 1: Data and Distributions

In ANOVA the response variable is numerical and the explanatory variables are categorical.

SCHOOL OF MATHEMATICS AND STATISTICS

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Linear Regression Model. Badr Missaoui

2-way analysis of variance

Stat 411/511 ESTIMATING THE SLOPE AND INTERCEPT. Charlotte Wickham. stat511.cwick.co.nz. Nov

Linear Modelling: Simple Regression

MODELS WITHOUT AN INTERCEPT

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Inference for Regression Simple Linear Regression

Contents. Acknowledgments. xix

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Unit 6 - Simple linear regression

ST430 Exam 1 with Answers

Math 423/533: The Main Theoretical Topics

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Statistics for Managers Using Microsoft Excel Chapter 10 ANOVA and Other C-Sample Tests With Numerical Data

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson )

Example: Poisondata. 22s:152 Applied Linear Regression. Chapter 8: ANOVA

ESP 178 Applied Research Methods. 2/23: Quantitative Analysis

COMPARISON OF MEANS OF SEVERAL RANDOM SAMPLES. ANOVA

Introduction and Descriptive Statistics p. 1 Introduction to Statistics p. 3 Statistics, Science, and Observations p. 5 Populations and Samples p.

Multiple linear regression

Model Modifications. Bret Larget. Departments of Botany and of Statistics University of Wisconsin Madison. February 6, 2007

-However, this definition can be expanded to include: biology (biometrics), environmental science (environmetrics), economics (econometrics).

Garvan Ins)tute Biosta)s)cal Workshop 16/6/2015. Tuan V. Nguyen. Garvan Ins)tute of Medical Research Sydney, Australia

Lecture 11 Analysis of variance

Inference. ME104: Linear Regression Analysis Kenneth Benoit. August 15, August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58

Chapter 14. Linear least squares

Lecture 15. Hypothesis testing in the linear model

ST Correlation and Regression

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

Lecture 3: Linear Models. Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012

1 The Classic Bivariate Least Squares Model

Factorial and Unbalanced Analysis of Variance

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Unit 6 - Introduction to linear regression

Chapter 1 Statistical Inference

Correlation and Regression

Applied Regression Analysis

1 Use of indicator random variables. (Chapter 8)

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Transcription:

Biostatistics for physicists fall 2015 Correlation Linear regression Analysis of variance

Correlation Example: Antibody level on 38 newborns and their mothers There is a positive correlation in antibody level between a mother and her child 2

Correlation Pearson s r Pearson s correlation coefficient, denoted r, is a measure of the correlation r n n i 1 ( x ( x i x)( y x) i 1 i i 1 2 n i ( y y) i y) 2 1 r 1 Positive correlation <==> r>0 Negative correlation <==> r<0 A measure of the linear relationship Preferably used for normal distributed data 3

Correlation Pearson s r 4

Pearson s r Not always a suitable measure on a relationship r=0.07 10,0 9,0 8,0 7,0 6,0 5,0 4,0 3,0 2,0 1,0 2,0 3,0 4,0 5,0 6,0 7,0 8,0 25 r=0.71 20 15 10 5 0 2 4 6 8 10 12 5

Spearman s rank correlation coefficient Spearman s rho A non-parametric correlation coefficient Suitable when data is far from normal distributed Can also be used on ordinal data. Is calculated in the same way as Pearson s r, but on the rank values instead of the original data Is not affected by outliers 6

Spearman s rank correlation coefficient Spearman s rho Pearson s r = 0.71 Spearman s rho = 0.08 25 20 15 10 5 0 2 4 6 8 10 12 7

Pearson s r in R Ex. Antibody level on 38 newborns and their mothers > cor(mother,child) [1] 0.7846048 > cor.test(mother,child) Pearson's product-moment correlation data: mother and child t = 7.593, df = 36, p-value = 5.562e-09 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.6205780 0.8828478 sample estimates: cor 0.7846048 8

Spearman s rho in R > cor.test(mother,child,method="spearman") Spearman's rank correlation rho data: mother and child S = 1618, p-value = 3.770e-08 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.8229566 9

Correlation measures association, not causation Ex. Among elementary school children, shoe size is strongly positively correlated with reading skills Obviously there is no causal relationship There is a third factor involved age (age is a confounding factor)

Statistical models A deterministic model describe relationship between variables without randomness pv nrt 2 A r Real measurements on a variable introduce randomness that can be modelled by a stochastic model Y 0 1 x e where e is a random variable y 0 1 x A statistical model is a stochastic model that contains parameters, which are unknown and needs to be estimated based on observed data and assumptions of the randomness

Statistical models A simple statistical model: Yi ei i 1,2,..., n Where e i is independent normal distributed random variables with mean=0 and variance=s 2 This is the statistical model behind a onesample t-test

Estimation Maximum Likelihood How to estimate parameters in a statistical model? Several methods. The most important is Maximum Likelihood (ML) The ML-estimate of a parameter is the most likely value of a parameter for a given set of observations More formally (for a continuous variable): Probability density function, f(x) Data, x 1,x 2,,x n Likelihood function L( q x n 1, x2,..., xn) f ( x i q) i 1 The ML-estimate of q is the value of q that maximize L(q)

Estimation Maximum Likelihood Example Model: Yi ei i 1,2,..., n where e i is independent normal distributed random variables with mean=0 and variance=s 2 The ML-estimate of µ is y The ML-estimate of s 2 is 1 ( y y) 2 n i That estimate is biased. An unbiased estimate is 2 1 ) 1 ( y i y n

Linear regression A linear regression model seeks to establish a relationship between a continuous response variable y and one ore more continuous explanatory variables x 1,x 2,,x n Model Y x x... 0 1 1 2 x n n e where e is independent normal distributed with mean=0 and variance=s 2 E( Y) x x... 0 1 1 2 2 A multiple linear regression model 2 n x n

Simple linear regression One explanatory variable Example: A calibration curve. Fluoroscence intensity measured for known concentrations

Simple linear regression Model Yi 0 1x i ei where e i is independent normal distributed random variables with mean=0 and variance=s 2 The ML estimates of 0 and 1 is equal to the least square estimates of 0 and 1 - the value on 0 and 1 that minimize the sum of squares of residuals n ( y i 1 ( 0 )) i 1x i 2 Predicted value 0 1 x i Residual y i ( 0 1xi )

Simple linear regression Using R > lm(int~conc) Call: lm(formula = int ~ conc) Coefficients: (Intercept) conc 3.558 1.386

Simple linear regression Using R > model<-(lm(int~conc)) > summary(model) Call: lm(formula = int ~ conc) Residuals: Min 1Q Median 3Q Max -4.18727-1.90848 0.02697 1.69864 4.72727 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 3.5582 1.6642 2.138 0.065. conc 1.3858 0.1559 8.891 2.03e-05 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 2.831 on 8 degrees of freedom Multiple R-squared: 0.9081, Adjusted R-squared: 0.8966 F-statistic: 79.05 on 1 and 8 DF, p-value: 2.027e-05

Coefficients: Simple linear regression Interpretation of R output Estimate Std. Error t value Pr(> t ) (Intercept) 3.5582 1.6642 2.138 0.065. conc 1.3858 0.1559 8.891 2.03e-05 *** The p-value refer to H0: 0 =0 / 1 =0 Residual standard error: 2.831 on 8 degrees of freedom The estimate of s is 2.831 (df=8) Multiple R-squared: 0.9081, 0.8966 Adjusted R-squared: R 2 =0.9081 (the coefficient of determination) is the proportion of variability in data that the model explains F-statistic: 79.05 on 1 and 8 DF, p-value: 2.027e-05 The p-value refer to H0: 1 = 2 = = n =0 (only 1 =0 here)

Simple linear regression Model check Model assumption: e i is independent normal distributed random variables Check the residuals > res<-residuals(model) > plot(conc,res) > lines(c(min(conc),max(conc)),c(0,0)) > qqnorm(res) > qqline(res)

Linear regression Warnings Be careful when there are outliers in data. They may have strong influence on the estimated model Be careful when extrapolate from the model. It may lead to completely wrong conclusions. There are no true models, only useful approximations All models are wrong but some are useful G.E.P. Box

Analysis of variance ANOVA Analysis of variance is a collection of statistical models, and their associated procedures, in which the observed variance [in data] is partitioned into components due to different sources of variation Wikipedia

Analysis of variance Example Measurments on a particular variable for several groups of individuals 4 treatments, 6 individuals per treatment 11 12 y 13 14 15 1 2 3 4 treatment

Analysis of variance Example Are the four treatments equivalent? Statistical model: Y e i 1,2,3,4 j 1,2,...,6 ij i ij where e ij is independent normal distributed random errors with mean=0 and variance=s 2 The ML-estimate of µ i is y i Null hypothesis: The four treatments are equivalent H0: 1 = 2 = 3 = 4

Analysis of variance Example 11 12 y 13 14 15 1 2 3 4 treatment The total variation can be divided in variation within treatment and variation between treatment

Analysis of variance Partitioning the sum of squares y = observed value on individual j for ij treatment i y i. = average for treatment i y.. = total average SST ( y y ) i j ij.. 2 SSA n ( y y i i i... ) 2 SSE ( y y ) i. i j ij SST = SSA + SSE 2

Analysis of variance ANOVA-table k = # treatments (# groups) N= # observations in total Test statistic: F-ratio = (SSA/(k-1))/(SSE/(N-k)) Under H0 the F-ratio is F-distributed with k-1 and N-k degrees of freedom

Analysis of variance Example H0: 1 = 2 = 3 = 4 Significance level 5 % p-value=0.001657 < 0.01 ==> H0 is rejected There are significant differences in means between the treatments. In order to find out which means are significantly different from one another, pairwise comparison must be performed (post hoc test). For example Tukey s test.

Analysis of variance using R > y [1] 11.03391 10.78788 10.60427 11.82818 11.97202 11.17095 12.22507 11.23059 [9] 11.62308 11.22159 12.39842 11.81493 11.66640 13.15860 11.21381 12.44268 [17] 13.11546 11.73569 13.85875 12.45239 13.74456 12.78989 11.92614 15.20695 > x [1] 1 1 1 1 1 1 3 3 3 3 3 3 2 2 2 2 2 2 4 4 4 4 4 4 > x<-as.factor(x) > anova(lm(y~x)) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) x 3 14.372 4.7908 7.3446 0.001657 ** Residuals 20 13.046 0.6523 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1

Analysis of variance using R > TukeyHSD(aov(y~x)) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = y ~ x) $x diff lwr upr p adj 2-1 0.9892383-0.3158831 2.2943598 0.1804179 3-1 0.5194117-0.7857098 1.8245331 0.6853882 4-1 2.0969117 0.7917902 3.4020331 0.0011642 3-2 -0.4698267-1.7749481 0.8352948 0.7468743 4-2 1.1076733-0.1974481 2.4127948 0.1145178 4-3 1.5775000 0.2723786 2.8826214 0.0144137 > tapply(y,x,mean) 1 2 3 4 11.23287 12.22211 11.75228 13.32978 Treatment 4 have significant higher mean than treatment 1 and treatment 3

Analysis of variance Model check Model assumption: e i is independent normal distributed random variables. Check the residuals > res<-residuals(model) > plot(x,res) > lines(c(min(x),max(x)),c(0,0)) > qqnorm(res) > qqline(res)

Analysis of variance More than 1 factor The example shown is a One-factor (or oneway) ANOVA on 4 levels. The factor is treatment. If we have another factor, for example three different concentrations, the analysis can be performed with a two-factor ANOVA Statistical model: Yijk i j eijk i=1,2,3,4 j=1,2,3 k=1,2,,6 where e ijk is independent normal distributed random errors with mean=0 and variance=s 2 There is much more to say about this, which would require some additional lectures

General linear models The response variable is continuous and is a linear function of explanatory variables. The errors are independent normal distributed with mean 0 and variance s 2 Explanatory variable: Continuous ==> Regression Categorical ==> ANOVA Continuous + categorical ==> ANCOVA

Generalized linear models The response variable is continuous or categorical and a is linear function of explanatory variables. The error components are usually not normal distributed A few example: Response variable: Proportion ==> Logistic regression Count ==> Log linear model, Poisson regression Count with overdispersion ==> Negative binomial regression Time to failure (continuous) ==> Survival models (e.g. Cox regression)