Biostatistics. Chapter 11 Simple Linear Correlation and Regression. Jing Li

Similar documents
Correlation & Linear Regression. Slides adopted fromthe Internet

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Chapter 16. Simple Linear Regression and dcorrelation

Chapter 16. Simple Linear Regression and Correlation

Basic Business Statistics 6 th Edition

Correlation Analysis

Statistics for Managers using Microsoft Excel 6 th Edition

Inferences for Regression

Business Statistics. Lecture 9: Simple Regression

Business Statistics. Lecture 10: Correlation and Linear Regression

Ch 13 & 14 - Regression Analysis

Inference for Regression Simple Linear Regression

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression

Inference for Regression Inference about the Regression Model and Using the Regression Line

Lecture 3: Inference in SLR

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

STAT Chapter 11: Regression

Chapter 14 Simple Linear Regression (A)

Chapter 4. Regression Models. Learning Objectives

REVIEW 8/2/2017 陈芳华东师大英语系

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Review of Statistics

: The model hypothesizes a relationship between the variables. The simplest probabilistic model: or.

Review 6. n 1 = 85 n 2 = 75 x 1 = x 2 = s 1 = 38.7 s 2 = 39.2

Psychology 282 Lecture #4 Outline Inferences in SLR

Lecture 9: Linear Regression

The Multiple Regression Model

Can you tell the relationship between students SAT scores and their college grades?

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

Review of Statistics 101

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

Simple Linear Regression

Simple Linear Regression

Correlation and Linear Regression

Multiple Regression. More Hypothesis Testing. More Hypothesis Testing The big question: What we really want to know: What we actually know: We know:

Mathematical Notation Math Introduction to Applied Statistics

Correlation and Regression

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

Chapter 13. Multiple Regression and Model Building

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Chapter 5 Confidence Intervals

A discussion on multiple regression models

Example: Forced Expiratory Volume (FEV) Program L13. Example: Forced Expiratory Volume (FEV) Example: Forced Expiratory Volume (FEV)

Simple Linear Regression Using Ordinary Least Squares

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

Inference for the Regression Coefficient

Chapter 27 Summary Inferences for Regression

Ordinary Least Squares Regression Explained: Vartanian

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

ST430 Exam 2 Solutions

Chapter 4: Regression Models

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Review of Multiple Regression

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION

CIVL /8904 T R A F F I C F L O W T H E O R Y L E C T U R E - 8

What is a Hypothesis?

STAT 350 Final (new Material) Review Problems Key Spring 2016

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /1/2016 1/46

Scatter plot of data from the study. Linear Regression

Simple Linear Regression. Material from Devore s book (Ed 8), and Cengagebrain.com

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

Multiple linear regression S6

Lecture 10 Multiple Linear Regression

Regression Analysis and Forecasting Prof. Shalabh Department of Mathematics and Statistics Indian Institute of Technology-Kanpur

Regression Models. Chapter 4

Interactions. Interactions. Lectures 1 & 2. Linear Relationships. y = a + bx. Slope. Intercept

Ordinary Least Squares Regression Explained: Vartanian

y response variable x 1, x 2,, x k -- a set of explanatory variables

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Applied Statistics Friday, January 15, 2016

Scatter plot of data from the study. Linear Regression

AMS7: WEEK 7. CLASS 1. More on Hypothesis Testing Monday May 11th, 2015

Prepared by: Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti

Simple Linear Regression. (Chs 12.1, 12.2, 12.4, 12.5)

Basic Business Statistics, 10/e

Applied Regression Modeling: A Business Approach Chapter 2: Simple Linear Regression Sections

Econometrics. 4) Statistical inference

The t-test: A z-score for a sample mean tells us where in the distribution the particular mean lies

Simple Linear Regression

Density Temp vs Ratio. temp

Using regression to study economic relationships is called econometrics. econo = of or pertaining to the economy. metrics = measurement

The simple linear regression model discussed in Chapter 13 was written as

Chapter Goals. To understand the methods for displaying and describing relationship among variables. Formulate Theories.

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Sample Problems. Note: If you find the following statements true, you should briefly prove them. If you find them false, you should correct them.

Difference in two or more average scores in different groups

Applied Regression Analysis

Lecture 18: Simple Linear Regression

Regression Analysis II

Simple Linear Regression: One Quantitative IV

Econ 3790: Statistics Business and Economics. Instructor: Yogesh Uppal

Statistics 5100 Spring 2018 Exam 1

SIMPLE REGRESSION ANALYSIS. Business Statistics

Correlation and regression

appstats27.notebook April 06, 2017

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Single Sample Means. SOCY601 Alan Neustadtl

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

Transcription:

Biostatistics Chapter 11 Simple Linear Correlation and Regression Jing Li jing.li@sjtu.edu.cn http://cbb.sjtu.edu.cn/~jingli/courses/2018fall/bi372/ Dept of Bioinformatics & Biostatistics, SJTU

Review Questions (5 min) Describe the assumption of simple linear correction?

Linear regression assumes that The relationship between X and Y is linear Y is distributed normally at each value of X The variance of Y at every value of X is the same (homogeneity of variances) The observations are independent

Example: Arm Circumference and Height Estimated mean arm circumference for children 60 cm in height

The best fit is good enough????

Assessing the Model The least squares method will always produce a straight line, even if there is no relationship between the variables, or if the relationship is something other than linear. Hence, in addition to determining the coefficients of the least squares line, we need to assess it to see how well it fits the data. We ll see these evaluation methods now. They re based on the what is called sum of squares for errors (SSE).

Sum of Squares for Error (SSE another thing to calculate) The sum of squares for error is calculated as: and is used in the calculation of the standard error of estimate: If is zero, all the points fall on the regression line.

Standard Error If is small, the fit is excellent and the linear model should be used for forecasting. If is large, the model is poor But what s the cutoff of for a good model?

Standard Error Judge the value of by comparing it to the sample mean of the dependent variable ( ). For example, =.3265 and = 14.841 so (relatively speaking) it appears to be small, hence our linear regression model of car price as a function of odometer reading is good.

Testing the Slope If no linear relationship exists between the two variables, we would expect the regression line to be horizontal, that is, to have a slope of zero. We want to see if there is a linear relationship, i.e. we want to see if the slope ( ) is something other than zero. Our research hypothesis becomes: H 1 : 0 Thus the null hypothesis becomes: H 0 : = 0

Testing the Slope We can implement this test statistic to try our hypotheses: H0: β 1 = 0 where is the standard deviation of b 1, defined as: If the error variable is normally distributed, the test statistic has a Student t-distribution with n 2 degrees of freedom. The rejection region depends on whether or not we re doing a one- or two- tail test (two-tail test is most typical).

Relationship with correlation rˆ = bˆ SD SD x y In correlation, the two variables are treated as equals. In regression, one variable is considered independent (=predictor) variable (X) and the other the dependent (=outcome) variable Y.

Power of the model : Coefficient of Determination Tests thus far have shown if a linear relationship exists; it is also useful to measure the strength of the relationship. This is done by calculating the coefficient of determination R 2. The coefficient of determination is the square of the coefficient of correlation (r), hence R 2 = (r) 2 R 2 has a value of.6483. This means 64.83% of the variation in y is explained by your regression model. The remaining 35.17% is unexplained, i.e. due to error.

Linear regression assumes that The relationship between X and Y is linear Y is distributed normally at each value of X The variance of Y at every value of X is the same (homogeneity of variances) The observations are independent

Regression Diagnostics How can we diagnose violations of these conditions? è Residual Analysis, that is, examine the differences between the actual data points and those predicted by the linear equation There are three conditions about error that are required in order to perform a regression analysis. These are: The error variable must be normally distributed, The error variable must have a constant variance The errors must be independent of each other.

Example: relationship between cognitive function and vitamin D Cross-sectional study of 100 middle-aged and older European men. Cognitive function is measured by the Digit Symbol Substitution Test (DSST). Lee DM, Tajar A, Ulubaev A, et al. Association between 25-hydroxyvitamin D levels and cognitive performance in middle-aged and older European men. J Neurol Neurosurg Psychiatry. 2009 Jul;80(7):722-9.

SDx = 33 nmol/l SDy= 10 points Cov(X,Y) = 163 points*nmol/l bˆ SS SS x y Beta = 163/33 2 = 0.15 points per nmol/l = 1.5 points per 10 nmol/l r = 163/(10*33) = 0.49

Significance testing Slope Distribution of slope ~ T n-2 (β,s.e.( )) bˆ H0: β1 = 0 (no linear relationship) H1: β1 ¹ 0 (linear relationship does exist) T n-2 = ˆ b - s. e.( 0 ˆ) b

Formula for the standard error of beta i i n i i x y x x b a ˆ ˆ ˆ and ) ( where SS 1 2 x + = - = å = x x y x n i i i SS s SS n y y s 2 / 1 2 ˆ 2 ) ˆ ( = - - = å = b

Example: cognitive function and vitamin D Standard error (beta) = 0.03 T 98 = 0.15/0.03 = 5, p<.0001 95% Confidence interval = 0.09 to 0.21

Residual Analysis: check assumptions The residual for observation i, e i, is the difference between its observed and predicted value Check the assumptions of regression by examining the residuals Examine for linearity assumption Examine for constant variance for all levels of X (homoscedasticity) Evaluate normal distribution assumption Evaluate independence assumption Graphical Analysis of Residuals e Can plot residuals vs. X i = Y i -Yˆ i

Predicted values yˆ = 20 + 1. 5 i x i For Vitamin D = 95 nmol/l (or 9.5 in 10 nmol/l): ˆ = 20 + 1.5(9.5) = y i 34

Residual = observed - predicted X=95 nmol/l y i = 48 34 yˆ y i i = - 34 yˆ i = 14

(1) Residual Analysis for Linearity Y Y x x residuals x residuals x Not Linear ü Linear nslide from: Statistics for Managers Using Microsoft Excel 4th Edition, 2004 Prentice-Hall

(2) Residual Analysis for Homoscedasticity Y Y x x residuals x residuals x Non-constant variance ü Constant variance nslide from: Statistics for Managers Using Microsoft Excel 4th Edition, 2004 Prentice-Hall

(3) Residual Analysis for Independence Not Independent ü Independent residuals X residuals X residuals X nslide from: Statistics for Managers Using Microsoft Excel 4th Edition, 2004 Prentice-Hall

Residual plot

Confounder Confounding variables ( third variables) are variables that the researcher failed to control, or eliminate, damaging the internal validity of an experiment.

Multiple linear regression What if age is a confounder here? Older men have lower vitamin D Older men have poorer cognition Adjust for age by putting age in the model: DSST score = intercept + slope 1 xvitamin D + slope 2 xage

2 predictors: age and vit D

Different 3D view

Fit a plane rather than a line On the plane, the slope for vitamin D is the same at every age; thus, the slope for vitamin D represents the effect of vitamin D when age is held constant.

Equation of the Best fit plane DSST score = 53 + 0.0039xvitamin D (in 10 nmol/l) - 0.46 xage (in years) P-value for vitamin D >>.05 P-value for age <.0001 Thus, relationship with vitamin D was due to confounding by age!

Multiple Linear Regression More than one predictor E(y)= a + b 1 *X + b 2 *W + b 3 *Z Each regression coefficient is the amount of change in the outcome variable that would be expected per one-unit change of the predictor, if all other variables in the model were held constant.

Functions of multivariate analysis: Control for confounders Test for interactions between predictors (effect modification) Improve predictions