MSc / PhD Course Advanced Biostatistics. dr. P. Nazarov

Similar documents
S A T T A I T ST S I T CA C L A L DAT A A T

STATISTICAL DATA ANALYSIS IN EXCEL

Inferences for Regression

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Correlation Analysis

Inference for Regression

Ch 2: Simple Linear Regression

Multiple Linear Regression

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression

Inference for Regression Simple Linear Regression

Statistics for Managers using Microsoft Excel 6 th Edition

CHAPTER 4 Analysis of Variance. One-way ANOVA Two-way ANOVA i) Two way ANOVA without replication ii) Two way ANOVA with replication

Inference for the Regression Coefficient

Chapter Learning Objectives. Regression Analysis. Correlation. Simple Linear Regression. Chapter 12. Simple Linear Regression

Lecture 10 Multiple Linear Regression

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

Chapter 11 - Lecture 1 Single Factor ANOVA

Lecture 1 Linear Regression with One Predictor Variable.p2

The Multiple Regression Model

This document contains 3 sets of practice problems.

Simple Linear Regression

Confidence Interval for the mean response

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal

Taguchi Method and Robust Design: Tutorial and Guideline

Unbalanced Data in Factorials Types I, II, III SS Part 1

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

16.3 One-Way ANOVA: The Procedure

Chapter 16. Simple Linear Regression and Correlation

Lecture 2 Simple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: Chapter 1

Notes for Week 13 Analysis of Variance (ANOVA) continued WEEK 13 page 1

Chapter 16. Simple Linear Regression and dcorrelation

Chapter 14 Simple Linear Regression (A)

Factorial designs. Experiments

28. SIMPLE LINEAR REGRESSION III

Basic Business Statistics 6 th Edition

STAT Chapter 11: Regression

Lecture 11: Simple Linear Regression

Mathematics for Economics MA course

STA121: Applied Regression Analysis

Simple Linear Regression Using Ordinary Least Squares

Linear regression. We have that the estimated mean in linear regression is. ˆµ Y X=x = ˆβ 0 + ˆβ 1 x. The standard error of ˆµ Y X=x is.

ANOVA: Comparing More Than Two Means

Inference for Regression Inference about the Regression Model and Using the Regression Line

1 Introduction to One-way ANOVA

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Biostatistics 380 Multiple Regression 1. Multiple Regression

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION

Linear models and their mathematical foundations: Simple linear regression

Measuring the fit of the model - SSR

STATS Analysis of variance: ANOVA

Chapter 4. Regression Models. Learning Objectives

ANOVA (Analysis of Variance) output RLS 11/20/2016

Objectives Simple linear regression. Statistical model for linear regression. Estimating the regression parameters

Lecture 3: Inference in SLR

Statistics and Quantitative Analysis U4320

Ch 3: Multiple Linear Regression

Regression Models - Introduction

Chapter 14 Student Lecture Notes 14-1

Variance Decomposition and Goodness of Fit

Chapter 10: Analysis of variance (ANOVA)

6. Multiple Linear Regression

Econ 3790: Statistics Business and Economics. Instructor: Yogesh Uppal

Simple Linear Regression

Lecture 9: Linear Regression

Chapter 12 - Lecture 2 Inferences about regression coefficient

Econ 3790: Business and Economic Statistics. Instructor: Yogesh Uppal

Statistics For Economics & Business

ANOVA: Analysis of Variation

Formal Statement of Simple Linear Regression Model

Chapter 4: Regression Models

Basic Business Statistics, 10/e

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

CS 5014: Research Methods in Computer Science

Multiple Comparisons. The Interaction Effects of more than two factors in an analysis of variance experiment. Submitted by: Anna Pashley

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

Much of the material we will be covering for a while has to do with designing an experimental study that concerns some phenomenon of interest.

Business Statistics. Lecture 9: Simple Regression

with the usual assumptions about the error term. The two values of X 1 X 2 0 1

Six Sigma Black Belt Study Guides

Correlation and Regression

Lectures on Simple Linear Regression Stat 431, Summer 2012

Intro to Linear Regression

Econometrics. 4) Statistical inference

Intro to Linear Regression

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

s e, which is large when errors are large and small Linear regression model

3. Design Experiments and Variance Analysis

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

CHAPTER EIGHT Linear Regression

ST Correlation and Regression

What If There Are More Than. Two Factor Levels?

Psychology 282 Lecture #3 Outline

Table of z values and probabilities for the standard normal distribution. z is the first column plus the top row. Each cell shows P(X z).

SIMPLE REGRESSION ANALYSIS. Business Statistics

Ch. 1: Data and Distributions

ECON3150/4150 Spring 2016

n i n T Note: You can use the fact that t(.975; 10) = 2.228, t(.95; 10) = 1.813, t(.975; 12) = 2.179, t(.95; 12) =

Lecture notes on Regression & SAS example demonstration

Transcription:

MSc / PhD Course Advanced Biostatistics dr. P. Nazarov petr.nazarov@crp-sante.lu 04-1-013 L4. Linear models edu.sablab.net/abs013 1

Outline ANOVA (L3.4) 1-factor ANOVA Multifactor ANOVA Experimental design Linear regression (L3.5) L4. Linear models edu.sablab.net/abs013

L4.1 ANOVA Why ANOVA? Means for more than populations We have measurements for 5 conditions. Are the means for these conditions equal? If we would use pairwise comparisons, what will be the probability of getting error? 5 5! Number of comparisons: C 10!3! Probability of an error: 1 (0.95) 10 = 0.4 Validation of the effects We assume that we have several factors affecting our data. Which factors are most significant? Which can be neglected? ANOVA example from Partek http://easylink.playstream.com/affymetrix/ambsymposium/partek_08.wvx L4. Linear models edu.sablab.net/abs013 3

L4.1 ANOVA Example As part of a long-term study of individuals 65 years of age or older, sociologists and physicians at the Wentworth Medical Center in upstate New York investigated the relationship between geographic location and depression. A sample of 60 individuals, all in reasonably good health, was selected; 0 individuals were residents of Florida, 0 were residents of New York, and 0 were residents of North Carolina. Each of the individuals sampled was given a standardized test to measure depression. The data collected follow; higher test scores indicate higher levels of depression. Q: Is the depression level same in all 3 locations? depression.txt 1. Good health respondents Florida New York N. Carolina 3 8 10 7 11 7 7 9 3 3 7 5 8 8 11 8 7 8 L4. Linear models H0: 1= = 3 Ha: not all 3 means are equal edu.sablab.net/abs013 4

FL FL FL FL FL FL FL NY NY NY NY NY NY NY NC NC NC NC NC NC Depression level L4.1 ANOVA Meaning H 0 : 1 = = 3 H a : not all 3 means are equal 14 1 10 8 m m 3 6 m 1 4 0 Measures L4. Linear models edu.sablab.net/abs013 5

Assumptions for Analysis of Variance L4.1 ANOVA Assumption for ANOVA 1. For each population, the response variable is normally distributed. The variance of the respond variable, denoted as is the same for all of the populations. 3. The observations must be independent. L4. Linear models edu.sablab.net/abs013 6

L4.1 ANOVA Parameter Florida New York N. Carolina m= 5.55 8.35 7.05 overall mean= 6.98333 var= 4.5763 4.7658 8.0500 Let s estimate the variance of sampling distribution. If H 0 is true, then all m i belong to the same distribution m k i1 mi m (5.55 6.98) k 1 01.96 39.7 (8.35 6.98) 31 (7.05 6.98) 1.96 n this is called between-treatment estimate, works only at H m 0 At the same time, we can estimate the variance just by averaging out variances for each populations: k i1 k i 4.58 4.77 8.05 5.8 3 Some Calculations 7 L4. Linear models edu.sablab.net/abs013 7 this is called within-treatment estimate Does between-treatment estimate and within-treatment estimate give variances of the same population?

H 0 : 1 = = = k H a : not all k means are equal L4.1 ANOVA Theory Means for treatments m j n j Variances treatments n j xij xij m j i i1 1 j i s j m n j n j 1 nt Total mean k n j 1 1 x ij due to treatment Sum squares SSTR k j1 n j m j m n n n T 1 n k Mean squares, beetween due to error Sum squares SSTR MSTR k 1 SSE k 1 j1 n j s j Test of variance equality F MSE SSTR p-value for the treatment effect p value SSE Mean squares, within MSE nr k 8 L4. Linear models edu.sablab.net/abs013 8

Total sum squares SST n k j xij m j1 i1 SST L4.1 ANOVA The Main Equation SSTR SSE Total variability of the data include variability due to treatment and variability due to error SS due to treatment SSTR SSE k j1 n j m SS due to error k 1 j1 j n j s j m d. f.( SST) d. f.( SSTR) d. f.( SSE) n T 1 k 1 n k T Partitioning The process of allocating the total sum of squares and degrees of freedom to the various components. 9 L4. Linear models edu.sablab.net/abs013 9

FL FL FL FL FL FL FL NY NY NY NY NY NY NY NC NC NC NC NC NC Depression level L4.1 ANOVA Example 14 1 10 8 m m 3 6 m 1 4 0 Measures SST SSTR SSE 10 L4. Linear models edu.sablab.net/abs013 10

L4.1 ANOVA Example ANOVA table A table used to summarize the analysis of variance computations and results. It contains columns showing the source of variation, the sum of squares, the degrees of freedom, the mean square, and the F value(s). In Excel use: Data Data Analysis ANOVA Single Factor Let s perform for dataset 1: good health depression.txt In R use: aov(...) Df Sum Sq Mean Sq F value Pr(>F) Location 78.5 39.7 6.773 0.003 ** Residuals 57 330.4 5.80 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 L4. Linear models edu.sablab.net/abs013 11

L4.1 ANOVA Some Definitions Factor Another word for the independent variable of interest. Treatments Different levels of a factor. Factorial experiment An experimental design that allows statistical conclusions about two or more factors. good health bad health depression.txt Factor 1: Health Factor : Location Florida New York North Carolina Depression = + Health + Location + HealthLocation + Interaction The effect produced when the levels of one factor interact with the levels of another factor in influencing the response variable. 1 L4. Linear models edu.sablab.net/abs013 1

Replications The number of times each experimental condition is repeated in an experiment. L4.1 ANOVA ANOVA Table 13 L4. Linear models edu.sablab.net/abs013 13

F-statistitcs 0 50 100 00 L4.1 ANOVA Example depression.txt Factor 1: Health res = aov(depression ~ Location + Health + Location*Health, Data) summary(res) source("http://sablab.net/scripts/drawanova.r") drawanova(res) ANOVA Factor : Location ANOVA model: Factor Location Health Location:Health error Df 1 114 Depression = m + Location + Health + Location * Health + e Sum Sq Mean Sq F value 73.85 36.93 4.9 1748.03 1748.03 03.094 6.1 13.06 1.517 981. 8.61 p-value 1.5981e-0 4.3961e-7.373e-01 * *** Location Health Location*Health error 14 L4. Linear models edu.sablab.net/abs013 14

L4.1 ANOVA Experimental Design Aware of Batch Effect! When designing your experiment always remember about various factors which can effect your data: batch effect, personal effect, lab effect... Day 1 Day? T = +30C T = +10C 15 L4. Linear models edu.sablab.net/abs013 15

L4.1 ANOVA Experimental Design Completely randomized design An experimental design in which the treatments are randomly assigned to the experimental units. We can nicely randomize: Day effect Batch effect 16 L4. Linear models edu.sablab.net/abs013 16

L4.1 ANOVA Experimental Design Blocking The process of using the same or similar experimental units for all treatments. The purpose of blocking is to remove a source of variation from the error term and hence provide a more powerful test for a difference in population or treatment means. Day 1 Day 17 L4. Linear models edu.sablab.net/abs013 17

L4.1 ANOVA A good suggestion Block what you can block, randomize what you cannot, and try to avoid unnecessary factors 18 L4. Linear models edu.sablab.net/abs013 18

L4.1 ANOVA ################################################################################ # L4.1 ANOVA ################################################################################ ## clear memory rm(list = ls()) ## load data Data = read.table("http://edu.sablab.net/data/txt/depression.txt",header=t,sep="\t") str(data) DataGH = Data[Data$Health == "good",] ## build 1-factor ANOVA model res1 = aov(depression ~ Location, DataGH) summary(res1) ## build the ANOVA model res = aov(depression ~ Location + Health + Location*Health, Data) ## show the ANOVA table summary(res) ## Load function source("http://sablab.net/scripts/drawanova.r") x11() drawanova(res) L4. Linear models edu.sablab.net/abs013 19

Ending weight Blood ph L3.5. Linear Regression Dependent and Independent Variables mice.xls Ending weight vs. Starting weight Blood ph vs. Starting weight 60 50 40 7.45 7.4 7.35 7.3 30 0 10 0 0 10 0 30 40 50 Starting weight 7.5 7. 7.15 7.1 7.05 7 0 10 0 30 40 50 Starting weight L4. Linear models edu.sablab.net/abs013 0

L3.5. Linear Regression Dependent and Independent Variables L4. Linear models edu.sablab.net/abs013 1

Number of cells L3.5. Linear Regression Example Temperature Cell Number 0 83 1 139 99 3 143 4 164 5 33 6 198 7 61 8 35 9 64 30 43 31 339 3 331 33 346 34 350 35 368 36 360 37 397 38 361 39 358 40 381 y 450 400 350 300 50 00 150 100 50 0 15 5 35 45 Temperature x Cells are grown under different temperature conditions from 0 to 40. A researched would like to find a dependency between T and cell number. cells.txt Dependent variable The variable that is being predicted or explained. It is denoted by y. Independent variable The variable that is doing the predicting or explaining. It is denoted by x. L4. Linear models edu.sablab.net/abs013

Number of cells L3.5. Linear Regression Regression Model and Regression Line Simple linear regression Regression analysis involving one independent variable and one dependent variable in which the relationship between the variables is approximated by a straight line. Building a regression means finding and tuning the model to explain the behaviour of the data 450 400 350 300 50 00 150 100 50 0 15 5 35 45 Temperature L4. Linear models edu.sablab.net/abs013 3

Number of cells L3.5. Linear Regression Regression Model and Regression Line Regression model The equation describing how y is related to x and an error term; in simple linear regression, the regression model is y = 0 + 1 x + Regression equation The equation that describes how the mean or expected value of the dependent variable is related to the independent variable; in simple linear regression, E(y) = 0 + 1 x 450 Model for a simple linear regression: 400 350 300 50 y( x) x 1 0 00 150 100 50 0 15 5 35 45 Temperature L4. Linear models edu.sablab.net/abs013 4

y( x) x 1 0 L3.5. Linear Regression Regression Model and Regression Line L4. Linear models edu.sablab.net/abs013 5

Number of cells Number of cells L3.5. Linear Regression Estimation Estimated regression equation The estimate of the regression equation developed from sample data by using the least squares method. For simple linear regression, the estimated regression equation is y = b 0 + b 1 x cells.xls 1. Make a scatter plot for the data. E y( x) x yˆ ( x) 1 0 b x b y( x) b1 x b0 1 0 450 400 350 300 50 00 150 100 50 0 15 5 35 45 Temperature. Right click to Add Trendline. Show equation. 450 400 y = 15.339x - 191.01 350 300 50 00 150 100 50 0 15 5 35 45 Temperature L4. Linear models edu.sablab.net/abs013 6

L3.5. Linear Regression Overview L4. Linear models edu.sablab.net/abs013 7

L3.5. Linear Regression Slope and Intercept Least squares method A procedure used to develop the estimated regression equation. ˆi The objective is to minimize y y i Slope: b 1 x m y m i x m 1 x x i y Intercept: b0 my b1m x L4. Linear models edu.sablab.net/abs013 8

Number of cells L3.5. Linear Regression Coefficient of Determination Sum squares due to error SSE y i y ˆ i 450 400 350 y = 15.339x - 191.01 Sum squares total SST y i my 300 50 00 150 Sum squares due to regression SSR yˆ i my 100 50 0 15 0 5 30 35 40 45 Temperature The Main Equation SST SSR SSE L4. Linear models edu.sablab.net/abs013 9

FL FL FL FL FL FL FL NY NY NY NY NY NY NY NC NC NC NC NC NC Depression level Number of cells L3.5. Linear Regression ANOVA and Regression 14 1 10 m 8 m 3 6 m 1 4 0 Measures 450 400 350 300 50 00 150 100 50 0 15 0 5 30 35 40 45 Temperature SST SSTR SSE SST SSR SSE L4. Linear models edu.sablab.net/abs013 30

Number of cells L3.5. Linear Regression Coefficient of Determination SSE SST y i y ˆ y i my i SST SSR SSE 450 400 350 300 50 00 y = 15.339x - 191.01 R = 0.9041 SSR yˆ i my 150 100 50 Coefficient of determination A measure of the goodness of fit of the estimated regression equation. It can be interpreted as the proportion of the variability in the dependent variable y that is explained by the estimated regression equation. 0 15 0 5 30 35 40 45 R Temperature SSR SST Correlation coefficient A measure of the strength of the linear relationship between two variables (previously discussed in Lecture 1). r sign b R 1 L4. Linear models edu.sablab.net/abs013 31

y( x) x 1 0 L3.5. Linear Regression Assumptions Assumptions for Simple Linear Regression 1. The error term is a random variable with 0 mean, i.e. E[]=0. The variance of, denoted by, is the same for all values of x 3. The values of are independent 3. The term is a normally distributed variable L4. Linear models edu.sablab.net/abs013 3

i-th residual L3.5. Linear Regression Estimation of The difference between the observed value of the dependent variable and the value predicted using the estimated regression equation; for the i-th observation the i-th residual is: y yˆ Mean square error The unbiased estimate of the variance of the error term. It is denoted by MSE or s. Standard error of the estimate: the square root of the mean square error, denoted by s. It is the estimate of, the standard deviation of the error term. i i s MSE SSE n s MSE SSE n L4. Linear models edu.sablab.net/abs013 33

L3.5. Linear Regression Sampling Distribution for b1 If assumptions for are fulfilled, then the sampling distribution for b 1 is as follows: y( x) x 1 0 yˆ ( x) b x b 1 0 Expected value E[ b1] 1 Variance b 1 x m i x = Standard Error Distribution: normal Interval Estimation for 1 1 b 1 t n / x m i x b 1 1 t n SE / L4. Linear models edu.sablab.net/abs013 34

H 0 : 1 = 0 H a : 1 0 insignificant L3.5. Linear Regression Test for Significance 1. Build a F-test statistics. MSR F MSE 1. Build a t-test statistics. t b 1 b 1 b 1 s x m i x. Calculate a p-value. Calculate p-value for t L4. Linear models edu.sablab.net/abs013 35

L3.5. Linear Regression Example cells.xls 1. Calculate manually b 1 and b 0 Intercept b0= -191.008119 Slope b1= 15.338573 In Excel use the function: = INTERCEPT(y,x) = SLOPE(y,x). Let s do it automatically SUMMARY OUTPUT Regression Statistics Multiple R 0.95084308 R Square 0.904101095 Adjusted R Square 0.899053784 Standard Error 31.80180903 Observations 1 Data Data Analysis Regression ANOVA df SS MS F Significance F Regression 1 181159.853 181159.3 179.153 4.01609E-11 Residual 19 1915.7461 1011.355 Total 0 00375.0314 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% Intercept -191.0081194 35.0751066-5.445689.97E-05-64.411603-117.5950784-64.411603-117.5950784 X Variable 1 15.338576 1.146057646 13.38377 4.0E-11 1.93984605 17.7379848 1.93984605 17.7379848 L4. Linear models edu.sablab.net/abs013 36

L3.5. Linear Regression Confidence and Prediction Confidence interval The interval estimate of the mean value of y for a given value of x. Prediction interval The interval estimate of an individual value of y for a given value of x. L4. Linear models edu.sablab.net/abs013 37

y 100 00 300 400 L3.5. Linear Regression Example cells.txt x = data$temperature y = data$cell.number res = lm(y~x) res summary(res) # draw the data x11() x plot(x,y) # draw the regression and its confidence (95%) lines(x, predict(res,int = "confidence")[,1],col=4,lwd=) lines(x, predict(res,int = "confidence")[,],col=4) lines(x, predict(res,int = "confidence")[,3],col=4) # draw the prediction for the values (95%) lines(x, predict(res,int = "pred")[,],col=) lines(x, predict(res,int = "pred")[,3],col=) 0 5 30 35 40 L4. Linear models edu.sablab.net/abs013 38

L3.5. Linear Regression Residuals L4. Linear models edu.sablab.net/abs013 39

L3.5. Linear Regression Example rana.txt A biology student wishes to determine the relationship between temperature and heart rate in leopard frog, Rana pipiens. He manipulates the temperature in increment ranging from to 18C and records the heart rate at each interval. His data are presented in table rana.txt 1) Build the model and provide the p-value for linear dependency ) Provide interval estimation for the slope of the dependency 3) Estmate 95% prediction interval for heart rate at 15 40 L4. Linear models edu.sablab.net/abs013 40

L3.5. Linear Regression Multiple Regression L4. Linear models edu.sablab.net/abs013 41

L3.5. Linear Regression Multiple Regression L4. Linear models edu.sablab.net/abs013 4

edu.sablab.net/abs013 L4. Linear models 43 L3.5. Linear Regression Non-Linear Regression p p p p p x x x x x x x x x y P y E... exp 1... exp,...,, 1 ) ( 1 1 0 1 1 0 1

Questions? Thank you for your attention to be continued L4. Linear models edu.sablab.net/abs013 44