Simple linear regression

Similar documents
Correlation Analysis

Inferences for Regression

Simple Linear Regression

Chapter 14 Simple Linear Regression (A)

Simple Linear Regression. Material from Devore s book (Ed 8), and Cengagebrain.com

Simple Linear Regression

Lecture 9: Linear Regression

STAT5044: Regression and Anova. Inyoung Kim

Measuring the fit of the model - SSR

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

11 Correlation and Regression

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Estadística II Chapter 4: Simple linear regression

9. Linear Regression and Correlation

Sample Problems. Note: If you find the following statements true, you should briefly prove them. If you find them false, you should correct them.

Basic Business Statistics 6 th Edition

y response variable x 1, x 2,, x k -- a set of explanatory variables

Ch 2: Simple Linear Regression

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Review of Statistics 101

Linear models and their mathematical foundations: Simple linear regression

Simple Linear Regression. (Chs 12.1, 12.2, 12.4, 12.5)

Chapter 16. Simple Linear Regression and dcorrelation

Chapter 12 - Lecture 2 Inferences about regression coefficient

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression

Single and multiple linear regression analysis

Inference for Regression Simple Linear Regression

Inference for Regression

Regression Analysis II

Regression Analysis: Basic Concepts

TMA4255 Applied Statistics V2016 (5)

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

Regression and correlation. Correlation & Regression, I. Regression & correlation. Regression vs. correlation. Involve bivariate, paired data, X & Y

CORRELATION AND REGRESSION

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

Multiple Linear Regression

SF2930: REGRESION ANALYSIS LECTURE 1 SIMPLE LINEAR REGRESSION.

Chapter 4. Regression Models. Learning Objectives

SSR = The sum of squared errors measures how much Y varies around the regression line n. It happily turns out that SSR + SSE = SSTO.

Unit 10: Simple Linear Regression and Correlation

Lecture 14 Simple Linear Regression

Statistics for Managers using Microsoft Excel 6 th Edition

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

Lectures on Simple Linear Regression Stat 431, Summer 2012

Hypothesis Testing hypothesis testing approach

with the usual assumptions about the error term. The two values of X 1 X 2 0 1

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

The Multiple Regression Model

Inference for Regression Inference about the Regression Model and Using the Regression Line

Section 3: Simple Linear Regression

regression analysis is a type of inferential statistics which tells us whether relationships between two or more variables exist

STAT Chapter 11: Regression

Oct Simple linear regression. Minimum mean square error prediction. Univariate. regression. Calculating intercept and slope

2 Regression Analysis

Simple Linear Regression

Draft Proof - Do not copy, post, or distribute. Chapter Learning Objectives REGRESSION AND CORRELATION THE SCATTER DIAGRAM

THE ROYAL STATISTICAL SOCIETY 2008 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE (MODULAR FORMAT) MODULE 4 LINEAR MODELS

Ch 3: Multiple Linear Regression

Business Statistics. Lecture 10: Correlation and Linear Regression

Well-developed and understood properties

Lecture 16 - Correlation and Regression

ECO220Y Simple Regression: Testing the Slope

Applied Econometrics (QEM)

Interactions. Interactions. Lectures 1 & 2. Linear Relationships. y = a + bx. Slope. Intercept

Chapter Learning Objectives. Regression Analysis. Correlation. Simple Linear Regression. Chapter 12. Simple Linear Regression

Important note: Transcripts are not substitutes for textbook assignments. 1

ST430 Exam 1 with Answers

INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y

Chapter 4: Regression Models

STAT 511. Lecture : Simple linear regression Devore: Section Prof. Michael Levine. December 3, Levine STAT 511

Lecture 11: Simple Linear Regression

AMS 7 Correlation and Regression Lecture 8

Announcements: You can turn in homework until 6pm, slot on wall across from 2202 Bren. Make sure you use the correct slot! (Stats 8, closest to wall)

SIMPLE REGRESSION ANALYSIS. Business Statistics

Scatter plot of data from the study. Linear Regression

Analysis of Bivariate Data

Statistics for Engineers Lecture 9 Linear Regression

Bivariate Regression Analysis. The most useful means of discerning causality and significance of variables

Mathematics for Economics MA course

This document contains 3 sets of practice problems.

Chapter 14. Linear least squares

ECON The Simple Regression Model

Simple Linear Regression

Statistical View of Least Squares

Chapter 16. Simple Linear Regression and Correlation

Simple Linear Regression Analysis

A discussion on multiple regression models

Predicted Y Scores. The symbol stands for a predicted Y score

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Linear Regression Model. Badr Missaoui

Multivariate Regression

Correlation and the Analysis of Variance Approach to Simple Linear Regression

Table of z values and probabilities for the standard normal distribution. z is the first column plus the top row. Each cell shows P(X z).

Review of Statistics

Correlation & Simple Regression

Lecture 46 Section Tue, Apr 15, 2008

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

Scatter plot of data from the study. Linear Regression

Transcription:

Simple linear regression Prof. Giuseppe Verlato Unit of Epidemiology & Medical Statistics, Dept. of Diagnostics & Public Health, University of Verona Statistics with two variables two nominal variables: blood group gastric cancer A nominal variable & a quantitative variable: sex systolic arterial pressure two quantitative variables: weight - glycaemia chi-square Student s t-test, ANOVA correlation, regression 1

PERFECT correlation between variable X (electric potential difference V) and variable Y (current): 1 st Ohm s law Conductance = ΔI / ΔV There are no PERFECT relations in Medicine, as a variable Y is affected not only by a variable X, but also by several other variables, mostly unknown (the so-called biologic variability) 2

Measures of variability Univariate statistics Bivariate statistics Sum of squares Sum of products Heuristic equation Empirical equation Sum of (x-x) 2 x 2 ( x) 2 /n always>=0 squares Sum of products (x-x)(y-y) xy ( x* y)/n <0, =0, >0 6 5 4 Y 3 (x-x) (y-y) = = - + = - (x-x) (y-y) = = + + = + 2 1 (x-x) (y-y) = = - - = + 0 0 1 2 3 4 X (x-x) (y-y) = = + - = - Measures of variability Univariate statistics Bivariate statistics variance = SSq / (n-1) covariance=sum of products/(n-2) 3

COV[X, ^ Y] = (x i - x) (y i - y) n - 2 Y COV[X, ^ Y] 0 (independence) 2 random samples of size n n observations: (x i - x) (y i - y) I quadrant - positive values II quadrant - negative values II III (x, y) I IV III quadrant positive values y i IV quadrant negative values Y ^ COV[X, Y] > 0 (positive correlation) Y x i X ^ COV[X, Y] < 0 (negative correlation) II III I II IV III (x, y) (x, y) I IV X X height (cm) weight (Kg) xy (x-x) (y-y) (x-x)(y-y) 172 63 10836-1,3-0,1 0,2 178 73 12994 4,7 9,9 46,5 175 67 11725 1,7 3,9 6,6 175 55 9625 1,7-8,1-14,0 176 66 11616 2,7 2,9 7,8 169 63 10647-4,3-0,1 0,6 168 55 9240-5,3-8,1 43,0 Total 1213 442 76683 90,7 Mean 173,3 63,1 Sum of products = (x-x)(y-y) = 90.7 xy ( x y)/ n = 76683-1213*442/7 = 76683-76592.3 = 90.7 4

Regression = asymmetric relation: a random variable (Y) depends on a fixed variable (X) Correlation = symmetric relation: the two variables are at the same level variable Y Correlation: Both X and Y are RANDOM variables variable X 5

{ variable Y Regression model only Y is a RANDOM variable variable X Model of simple linear regression - 1 Response variable (dependent) Systematic part of the model y = 0 + x + error term intercept Linear regression coefficient Unknown parameters of the model, estimated from available data Explanatory variable (predictive, independent) 6

Variabile Y Model of simple linear regression - 2 y = 0 + x + 10 A line in the plane 8 6 4 2 0 0 1 2 3 4 Variabile X Variable X Response variable Model of simple linear regression - 3 y = 0 + x + Linear predictor, deterministic part of the model, without random variability error term, probabilistic part Error terms, and hence response variable, are NORMALLY distributed 7

Model of simple linear regression - 4 Weight (Y) depends on height (X 1 ) E(y) = 0 + 1 x 1 E(y) = expected value (mean) of weight in those individuals with that particular height y = 0 + 1 x 1 + y = weight of a given subject, which depends on height (systematic part of the model), but also on other individual known and unknown characteristics (ε, probabilistic part) Model of simple linear regression - 5 Unknown real model y = 0 + 1 x + Estimated linear regression y = b 0 + b 1 x y = b 0 + b 1 x + e 8

Variabile Y DECOMPOSING total sum of squares in simple linear regression - 1 y = 0 + x + 10 8 6 media Mean y Y = = 5.63 5.63 4 ŷ - y { } y- ŷ 2 0 0 1 2 3 4 Variabile X Variable X (y- y) = (ŷ - y) + (y- ŷ ) DECOMPOSING total Sum of Squares (SSq) in simple linear regression - 2 Total variability (y- y) = (ŷ - y) + (y- ŷ ) Variability explained by the regression Residual variability It can be demonstrated that: Total SSq, SST Σ(y- y) 2 = Σ (ŷ - y) 2 + Σ(y- ŷ) 2 Regressionexplained SSq, SSR Residual (Error, unexplained) SSq, SSE 9

Correlation Correlation coefficient (r ) is a dimensionless number, ranging from -1 to +1 r = -1 points are perfectly aligned along a declining line r = 0 points are randomly scattered, without an increasing or decreasing trend r = +1 points are perfectly aligned along an increasing line Sum of products r = xy SSq x * SSq y SSq explained by r 2 regression, SSR Σ (ŷ - y) 2 = = total SSq, SST Σ(y- y) 2 s x > s y s x = s y s x < s y r = 1 0 <r< 1 r = 0 10

s x > s y s x = s y s x < s y r = 0-1<r<0 r= -1 Most correlations between biological variables are rather weak: the coefficient of determination (r 2 ) ranges between 0 and 0.5. r 2 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 r 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Going from r 2 to r, by extracting the square root, allows to amplify the scale in the range of weak correlation. 11

Simple linear regression One should identify the line, that best fits the scatter points, i.e. that roughly goes through the middle of all the scatter points. LEAST SQUARES METHOD The line that minimizes residual (Error) Sum of Squares, SSE, Σ(y- ŷ) 2, is selected. Simple linear regression b 1 = linear regression coefficient, slope Sum of products xy b 1 = Sum of squares x b 1 ranges from - to + Its measurement unit is the ratio of Y measurement unit to X measurement unit. Hence the absolute value of b 1 depends on the measurement units adopted. b 0 = intercept b 0 = y - b 1 x 12

The regression line fits well scatter points. Explained SSq > residual SSq The regression line does not fit well scatter points. Explained SSq < residual SSq b close to zero b large 13

Correlation Correlation coefficient (r) = 0.393 r 2 = 0.155 SSq explained by the regression, SSR Σ (ŷ - y) 2 r 2 = = Total SSq, SST Σ(y- y) 2 15.5% of total variability (SSq) in admission test score is explained by variability in high school final mark. 14

Probability density densità di probabilità Simple linear regression y = 0 + x + ASSUMPTIONS 1) Expected value of errors, E(ε), must be ZERO 2) HOMOSCEDASTICITY (The variance of errors remains constant) 3) INDEPENDENCE of errors if test tubes (provette in Italian) are not adequately washed between subsequent tests, a test is affected by the previous one 4) Errors are NORMALLY distributed ASSUMPTIONS of simple linear regression 1) The expected value of errors, E(ε), must be ZERO 4) Errors are NORMALLY distributed Error distribution Distribution of response variable (Y) according to the model 0 + x = 0 Kg = 3 Kg = 30 Kg = 3 Kg -10,0-5,0 0,0 5,0 10,0 15,0 20,0 25,0 30,0 35,0 40,0 weight Peso (Kg) (Kg) 15

Approximately normal distribution EXAMPLE: Is sons height associated with fathers height? In other words, do shorter fathers have shorter sons? Do taller fathers have taller sons? Father Son 167 cm 168 cm 175 cm 180 cm 183 cm 178 cm 170 cm 178 cm 181 cm 184 cm 172 cm 170 cm 177 cm 177 cm 179 cm 180 cm 16

1 STEP: graphic representation (scatterplot) of observed values 2 STEP: a statistical model is assumed, potentially useful to interpret data The following linear model is assumed: y = 0 + x + (son s height) = 0 + (father s height) + (sons sharing the same father have similar, but not exactly equal height, even if o oio trό, i.e having also the same mother) 17

3 STEP: uni-variate and bi-variate descriptive statistics Σx Σx 2 N Σxy mean Father 1404 246,618 175.5 8 248,483 Son 1415 250,477 176.875 Sum of products= xy - x y/n = 248.483 1404*1415/8 = 150.5 SSq Variance Standard dev. Father 216 30.86 5.55 Son 198.875 28.41 5.33 4 STEP: Estimation of model parameters by the method of least squares Sum of products xy b 1 = = 150.5/216 = 0.697 cm/cm Sum of squares x b 0 = y - b 1 x= 176.875 0.697 *175.5 = 54.59 cm Regression line: Son s height (cm)= 54.6 cm + 0.697 cm/cm * father s height (cm) When father s height increases by 1 cm, son s height increases on average by 7 mm. 18

5 STEP: Computing correlation coefficient Sum of products r = xy 150.5 = = 0.726 SSq x * SSq y 216 * 198.9 r 2 = 0.7261 2 = 0.527 52.7% of total variability in sons height is explained by variability in fathers height. 6 STEP: Inference on model parameters: Is the theoretical model supported by observed data? Student s t-test, based on b 1 (estimate of 1 ) { H 0 : 1 = 0 H 1 : 1 0 Two-tailed test Significance level = 5% Degrees of freedom = n-2 = 8-2 = 6 Critical threshold = t 6, 0.025 = 2.447 b-0 b 0.697 t = = = = 2.588 SE b var res /SSq x 15.67 / 216 SSq res =SSq y products xy2 /SSq x =198.9-150.5 2 /216 =94.01 variance res = SSq res / (n-2) = 94.01 / 6 = 15.67 19

6 STEP: Inference on model parameters: Is the theoretical model supported by observed data? Student s t-test, based on r (estimate of r) { H 0 : r = 0 H 1 : r 0 Two-tailed test Significance level = 5% Degrees of freedom = n-2 = 8-2 = 6 Critical threshold = t 6, 0.025 = 2.447 r-0 r 0.726 t = = = = 2.587 SE r (1-r 2 ) /(n-2) (1-0.527) /6 observed t > critical t value 2.588 2.447 H 0 is rejected The observed relation between sons height and fathers height is not due to chance, but rather represents a real phenomenon (P=0.041). 20

7 STEP: Prediction of Y values Which is the expected value of Y, for x = 185 cm? Regression line: ŷ = 54.6 cm + 0.697 cm/cm *185 cm = 183.49 cm SE ŷ = var res [1/n + (x-x) 2 / SSq x ] = = 15.67 [1/8 + (185-175.5) 2 / 216 ] = 2.916 IC 95% = ŷ ± t, /2 * SE ŷ = 183.49 ± 2.447 * 2.916 = [ 190.63 176.36 Regression line and its 95% confidence interval Uncertainty in the estimates of the regression line increases as X moves away from X mean (175.5 cm) 21

Regression line, 95% confidence interval of the regression line, prediction interval containing 95% of single observations Uncertainty is greater in predicting single observations than in estimating expected values (regression line) 22