Statistical Techniques II EXST7015 Simple Linear Regression

Similar documents
Analysis of Variance

EXST Regression Techniques Page 1 SIMPLE LINEAR REGRESSION WITH MATRIX ALGEBRA

Categorical Predictor Variables

df=degrees of freedom = n - 1

Lecture 2 Simple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: Chapter 1

Ch 2: Simple Linear Regression

Inferences for Regression

EXST Regression Techniques Page 1. We can also test the hypothesis H :" œ 0 versus H :"

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Regression Models - Introduction

Lecture 10 Multiple Linear Regression

Inference for Regression Simple Linear Regression

Statistics for Managers using Microsoft Excel 6 th Edition

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

The Multiple Regression Model

Basic Business Statistics 6 th Edition

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

STOR 455 STATISTICAL METHODS I

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /1/2016 1/46

TMA4255 Applied Statistics V2016 (5)

14 Multiple Linear Regression

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is

Chapter 4. Regression Models. Learning Objectives

STAT Chapter 11: Regression

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

Addition and subtraction: element by element, and dimensions must match.

Simple linear regression

Regression Analysis. Table Relationship between muscle contractile force (mj) and stimulus intensity (mv).

Simple and Multiple Linear Regression

STAT5044: Regression and Anova. Inyoung Kim

Simple Linear Regression

Measuring the fit of the model - SSR

Correlation Analysis

Statistical Modelling in Stata 5: Linear Models

Correlation and the Analysis of Variance Approach to Simple Linear Regression

Simple Linear Regression Analysis

Confidence Intervals, Testing and ANOVA Summary

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

Chapter 4: Regression Models

PubH 7405: REGRESSION ANALYSIS. MLR: INFERENCES, Part I

22s:152 Applied Linear Regression. Take random samples from each of m populations.

Simple Linear Regression for the Climate Data

Lecture 3: Inference in SLR

Review. One-way ANOVA, I. What s coming up. Multiple comparisons

Simple Linear Regression

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

using the beginning of all regression models

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression

Chapter 14 Simple Linear Regression (A)

Chapter 12. ANalysis Of VAriance. Lecture 1 Sections:

What If There Are More Than. Two Factor Levels?

Lecture 18 Miscellaneous Topics in Multiple Regression

POL 213: Research Methods

Summary of Chapter 7 (Sections ) and Chapter 8 (Section 8.1)

STAT5044: Regression and Anova

Regression Models. Chapter 4. Introduction. Introduction. Introduction

22s:152 Applied Linear Regression. Chapter 5: Ordinary Least Squares Regression. Part 1: Simple Linear Regression Introduction and Estimation

Simple Linear Regression Using Ordinary Least Squares

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik

Ch 3: Multiple Linear Regression

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

F-tests and Nested Models

STAT 3A03 Applied Regression With SAS Fall 2017

ANCOVA. ANCOVA allows the inclusion of a 3rd source of variation into the F-formula (called the covariate) and changes the F-formula

Chapter 1. Linear Regression with One Predictor Variable

Regression and correlation. Correlation & Regression, I. Regression & correlation. Regression vs. correlation. Involve bivariate, paired data, X & Y

Multiple Linear Regression

THE ROYAL STATISTICAL SOCIETY 2008 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE (MODULAR FORMAT) MODULE 4 LINEAR MODELS

REGRESSION ANALYSIS AND INDICATOR VARIABLES

sociology 362 regression

Calculating Fobt for all possible combinations of variances for each sample Calculating the probability of (F) for each different value of Fobt

Topic 28: Unequal Replication in Two-Way ANOVA

Unbalanced Data in Factorials Types I, II, III SS Part 1

Inference for Regression Inference about the Regression Model and Using the Regression Line

Analysis of Bivariate Data

The legacy of Sir Ronald A. Fisher. Fisher s three fundamental principles: local control, replication, and randomization.

13 Simple Linear Regression

Business Statistics. Lecture 10: Correlation and Linear Regression

Lecture 9 SLR in Matrix Form

Simple Linear Regression. Material from Devore s book (Ed 8), and Cengagebrain.com

This module focuses on the logic of ANOVA with special attention given to variance components and the relationship between ANOVA and regression.

Sociology 6Z03 Review II

ECO220Y Simple Regression: Testing the Slope

Multiple Regression. More Hypothesis Testing. More Hypothesis Testing The big question: What we really want to know: What we actually know: We know:

General Linear Model (Chapter 4)

6. Multiple Linear Regression

Biostatistics 380 Multiple Regression 1. Multiple Regression

DESAIN EKSPERIMEN Analysis of Variances (ANOVA) Semester Genap 2017/2018 Jurusan Teknik Industri Universitas Brawijaya

Lecture 10: F -Tests, ANOVA and R 2

Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

STAT Chapter 10: Analysis of Variance

Analysis of Variance. Source DF Squares Square F Value Pr > F. Model <.0001 Error Corrected Total

Chapter 1 Linear Regression with One Predictor

Basic Business Statistics, 10/e

Transcription:

Statistical Techniques II EXST7015 Simple Linear Regression 03a_SLR 1

Y - the dependent variable 35 30 25 The objective Given points plotted on two coordinates, Y and X, find the best line to fit the data. 20 0 1 2 3 4 5 6 7 8 9 10 X - the independent variable 03a_SLR 2

The concept Data consists of paired observations with a presumed potential for the existence of some underlying relationship We wish to determine the nature of, and quantify, the relationship if it exists. Note that we cannot prove that the relationship exists by using regression (e.g. we cannot prove cause and effect). Regression can only show if a "correlation" exists, and provide an equation for the relationship. 03a_SLR 3

The concept (continued) Given a data set consisting of paired, quantitative variables, and recognizing that there is variation in the data set, we will define, POPULATION MODEL (SLR) Y ij = β 0 + β 1 X i + ε i This is the line we will fit. 03a_SLR 4

The concept (continued) We must estimate the population equation for a straight line The Population Parameters estimated are µ y.x = the true population mean of Y at each value of X β 0 = the true value of the Y intercept β 1 = the true value of the slope, the change in Y per unit of X µ y.x = β 0 + β 1 X i 03a_SLR 5

Terminology Dependent variable - variable to be predicted Y = dependent variable (all variation occurs in Y) Independent variable - predictor or regressor variable X = independent variable (X is measured without error) 03a_SLR 6

Terminology (continued) Intercept - value of Y when X = 0, point where the regression line passes through the Y axis NOTE: units are "Y" units Slope - the value of the change in Y for each unit increase in X NOTE: units are "Y" units per "X" unit 03a_SLR 7

Terminology (continued) Deviation - distance from an observed point to the regression line, also called a residual. Least squares regression line - the line that minimizes the squared distances from the line to the individual observations. 03a_SLR 8

Regression line Y - the dependent variable 35 30 25 Deviations 20 0 1 2 3 4 5 6 7 8 9 10 X - the independent variable 03a_SLR 9

Regression calculations All calculations for simple linear regression start with the same values. These are, ΣX i, ΣX i2, ΣY i, ΣY i2, ΣX i Y i, n Calculations for simple linear regression are first adjusted for the mean. These are called "corrected values". They are corrected for the MEAN by subtracting a "correction factor". 03a_SLR 10

Reg calculations (continued) As a result, all simple linear regressions are adjusted for the mean of X and Y and pass through the point ( X, Y). Y 03a_SLR 11 X

Reg calculations (continued) The original sums and sums of squares of Y are distances and squared distances from zero. Y 0 X 03a_SLR 12

Reg calculations (continued) The corrected sums are zero (half negative and half positive) and sums of squares of Y are squared distances from the mean of Y. Y Y 03a_SLR 13 X

Reg calculations (continued) Once corrected sums of squares and crossproducts are obtained (S XX, S YY, S XY ), the calculations are: Slope = b 1 = S XY / S XX Intercept = b 0 = Y - b 1 X We have fitted the sample equation Y i = b 0 + b 1 X i + e i Y = b + b X i 0 1 i 03a_SLR 14

Variance Estimates Once the regression line is fitted, Variance calculations are based on the deviations from the regression. Y 03a_SLR 15 X

Variance Estimates (continued) From the regression model Y i = b 0 + b 1 X i + e i We derive the formula for the deviations e i = Y i - (b 0 + b 1 X i ) = e = Y-Y i i i 03a_SLR 16

Variance Estimates (continued) As with other calculations of variance, we calculate a sum of squares (corrected for the mean). This is simplified by the fact that the deviations, or residuals, already have a mean of zero. SSResiduals = n i=1 e 2 = SSError i 03a_SLR 17

Variance Estimates (continued) The degrees of freedom (d.f.) for the variance calculation is n-2, since two parameters are estimated prior to the variance (b 0 and b 1 ). The variance estimate is called the MSE (Mean square error). It is the SSError divided by the d.f.. MSE = SSE / (n-2) 03a_SLR 18

Variance Estimates (continued) The variances for the two parameter estimates and the predicted values are all different, but all are based on the MSE, and all have n-2 d.f. (t-tests) or n-2 d.f. for the denominator (F tests). Variance of the slope = MSE/S XX Variance of the intercept = MSE[(1/n)+(- X) 2 /S XX ] Variance of a predicted value at X i = MSE[(1/n)+(X i - X) 2 /S XX ] 03a_SLR 19

Variance Estimates (continued) Any of these variance can be used for a t-test of an estimate against an hypothesized value for a parameter. Another common expression of regression results is an ANOVA table. Given the SSError (sum of squared deviations from the regression) And the initial total sum of squares (S YY ), the sum of squares of Y adjusted for the mean we can construct an ANOVA table 03a_SLR 20

ANOVA table Simple Linear Regression ANOVA table d.f. Sum of Squares Mean Square Regression 1 SSRegression MSReg Error n-2 SSError MSError F MSReg MSError Total n-1 S YY = SSTotal 03a_SLR 21

ANOVA table (continued) In the ANOVA table The SSRegression and SSError sum the SSTotal, so given the total (S YY ) and one of the two terms, we can get the other. The easiest to calculate is usually the SSRegression since we usually already have the necessary intermediate values. SSRegression = (S XY ) 2 / S XX. 03a_SLR 22

ANOVA table (continued) The SSRegression is a measure of the "improvement" in the fit due to the regression line. The deviations start at S YY and are reduced to SSError. The difference is the improvement, and is equal to the SSRegression. This gives another statistic called the R 2. What portion of the total SS (S YY ) is accounted for by the regression. R 2 = SSRegression / SSTotal 03a_SLR 23

ANOVA table (continued) The degrees of freedom are, n-1 for the total, one lost for the correction for the mean (which also fits the intercept) n-2 for the error, since two parameters are estimated to get the regression line. 1 d.f. for the regression, which is the d.f. for the slope. 03a_SLR 24

ANOVA table (continued) The F test is constructed by calculating the MSRegression / MSError. This has 1 and (n-2) d.f. This is EXACTLY the same test as the t-test of the slope against zero. To test the slope against an hypothesized value (say zero) with n-2 d.f., calculate t = b 1 b 1Hypothesized b 0 1 S b 1 = MSE S XX 03a_SLR 25

Assumptions for the Regression We will recognize 4 assumptions 1) Normality - We take the deviations from regression and pool them all together into one estimate of variance. Some of the tests we use require the assumption of normality, so these deviations should be normally distributed. 03a_SLR 26

Assumptions for the Regression (continued) For each value of X there is a population of values for the variable Y (normally distributed). Y 03a_SLR 27 X

Assumptions for the Regression (continued) 2) Homogeneity of variance - When we pool these deviations (variances) we also assume that the variances are the same at each value of X i. In some cases this is not true, particularly when the variance increases as X increases. 3) X is measured without error! Since variances are measured vertically only, all variance is in Y, no provisions are made for variance in X. 03a_SLR 28

Assumptions for the Regression (continued) Independence. This enters in several places. First, the observations should be independent of each other (i.e. the value of e i should be independent of e j ). Also, in the equation for the line Y i = b 0 + b 1 X i + e i We assume that the term e i is independent of the rest of the model. We will talk more of this when we get to multiple regression. 03a_SLR 29

Assumptions for the Regression (continued) So the four assumptions are: Normality Homogeneity of variance Independence X measured without error These are explicit assumptions, and we will often test these assumptions. 03a_SLR 30

Assumptions for the Regression (continued) There are some other assumptions that I consider implicit. We will not state these, but in some cases they can be tested. For example, There is order in the Universe The underlying fundamental relationship that I just fitted a straight line to really is a straight line. This one can be examined statistically. 03a_SLR 31

Characteristics of a Regression Line The line will pass through the point X, Y (also the point 0, b 0 ) The sum of deviations will be zero (Σe i =0) The sum of squared deviations (measured vertically, Σe i2 =Σ(Y i -b 0 +b 1 X i ) 2 ) of the points from the regression line will be a minimum. Values on the line can be described by the equation Y = b + b X i 0 1 i 03a_SLR 32

Characteristics of a Regression The line has some desirable properties (if the assumptions are met) E(b 0 ) = β 0 E(b 1 ) = β 1 E( Y X ) = µ X.Y Therefore, the parameter estimates and predicted values are unbiased estimates. Note that linear regression is considered statistically robust. That is, the analysis tends to give good results if the assumptions are not violated to a great extent. 03a_SLR 33

About crossproducts and correlation Crossproducts are used in a number of related calculations. a crossproduct = Y i X i Sum of crossproducts = ΣY i X i Covariance = S XY / (n-1) = S XY Slope = S / S XY XX SSRegression = S 2 / S XY XX Correlation = S / S S XY XX YY R 2 =r 2 =S 2 / S S = XY XX YY SSRegression/SSTotal 03a_SLR 34

Summary See Simple linear regression notes from EXST7005 for additional information, including the derivation of the equations for the slope and intercept. You are not responsible for these derivations. Know the terminology, characteristics and properties of a regression line, the assumptions, and the components to the ANOVA table. 03a_SLR 35

Summary (continued) You will not be fitting regressions by hand, but I will expect you to understand where the values on SAS output come from and what they mean. Particular emphasis will be placed on working with, and interpreting, numerical regression analyses. 03a_SLR 36