SIMPLE REGRESSION ANALYSIS Business Statistics
CONTENTS Ordinary least squares (recap for some) Statistical formulation of the regression model Assessing the regression model Testing the regression coefficients The ANOVA table Old exam question Further study
ORDINARY LEAST SQUARES Idea of curve fitting in a scatterplot linear fit: y = a + bx (x=floor area of house, y=price of house)
ORDINARY LEAST SQUARES You find the best line by minimizing the misfit (e i ) between observed value (y i ) and modelled/estimated value ( y i = ax i + b) e i = y i y i in fact by minimizing the sum of squares of misfit n σ i=1 e i 2 OLS regression The hat (^) is our symbol for the estimate
STATISTICAL FORMULATION OF THE REGRESSION MODEL Rephrasing the model y = a + bx as a statistical model Assumptions and notation we assume a linear relation of the form of the population regression model Y i = β 0 + β 1 X i + ε i or Y = β 0 + β 1 X + ε with β 0 is the intercept or constant β 1 the slope or slope coefficient We prefer to use β 0 instead of a for the constant, and β 1 instead of b for the slope random variable ε i is the error or residual, the unexplained part
STATISTICAL FORMULATION OF THE REGRESSION MODEL Estimation of the model coefficients we assume that ε i ~N 0, σ 2 based on sample of n paired data points x i, y i, i = 1,, n use OLS to estimate the best line through the estimated regression model Y = b 0 + b 1 X or y i = b 0 + b 1 x i the estimated coefficients (b 0 for β 0 and b 1 for β 1 ) and the estimated error (e i for ε i ) corresespond to y i = b 0 + b 1 x i + e i
STATISTICAL FORMULATION OF THE REGRESSION MODEL x i, y i Y = b 0 + b 1 X e i { x i, y i b 0 b 1
STATISTICAL FORMULATION OF THE REGRESSION MODEL So b 0 is the estimated value of β 0 the intercept or constant of the regression line b 1 is the estimated value of β 1 the slope or slope coefficient of the regression line e i is the estimated residual or error for observation i the misfit
EXERCISE 1 Look back at the house prices where we have a line found y = 264700 + 6152x a. Give the theoretical model b. Give the estimated model
ASSESSING THE REGRESSION MODEL OLS will always give an estimate for β 0 and β 1 the line of best fit But is best also good enough to make good predictions? can we do a statistical test on the quality of the model? We have minimized the sum of squares (SS) of the error n SSE = i=1 n e i 2 = i=1 We would like to compare this with: the total sum of squares SST the explained sum of squares SSR y i y i 2 R stands for regression
ASSESSING THE REGRESSION MODEL Total sum of squares: n SST = i=1 y i തy 2 So SST is the total variation around the mean തy
ASSESSING THE REGRESSION MODEL Regression sum of squares: n SSR = i=1 y i തy 2 So, the data has a total variability SST the regression model explains a variability SSR and the residual variability is SSE and SST = SSR + SSE Coefficient of determination ( R-square ): R 2 = SSR SST = 1 SSE SST So SSR is the variation around the mean തy that is explained by the model
ASSESSING THE REGRESSION MODEL R 2 is a measure of the usefulness of the model Properties 0 R 2 1 R 2 = 0 means the model doesn t explain anything R 2 = 1 means the model explains everything in between, the model explains R 2 100% of the variance of Y R 2 = 1 SSE SST
ASSESSING THE REGRESSION MODEL If R 2 > 0, the regression model explains something but in a random sample, R 2 may be non-zero due to chance when is R 2 is significantly different from 0? Finding a test statistic look at the variances associated with SSR and SSE so define the mean sums of squares (MS) (variances!) MST = SST n 1 ; MSR = SSR 1 use MSR MSE = SSR/1 SSE/ n 2 ; MSE = SSE n 2 as a ratio of two variances
ASSESSING THE REGRESSION MODEL Statistical test: H 0 : the independent variable (X) does not explain the variation in the dependent variable (Y) i.e., H 0 : β 1 = 0 versus H 1 : β 1 0 Sample statistic: F = MSR MSE ; reject for large values Under H 0 : F~F 1,n 2 ; assumptions: see model Compare F calc = MSR MSE with F crit = F 1,n 2;α or compute p-value as the probability of obtaining F calc or more extreme if H 0 is true
ASSESSING THE REGRESSION MODEL Using SPSS, three types of output Model summary R 2 Variance decomposition (ANOVA?) SSR, SSE, SST MSR, MSE F calc p-value Regression coefficients b 0 and b 1
ASSESSING THE REGRESSION MODEL The model is Y = β 0 + β 1 X + ε OLS extracts estimates from the data: b 0 and b 1 But how accurate are these estimates? We can also find the distribution of B 0 and B 1 So, we can find confidence intervals and perform hypothesis tests B 0 and B 1 are t-distributed: B 0 β 0 S B 0 ~t n 2 B 1 β 1 S B 1 ~t n 2
ASSESSING THE REGRESSION MODEL Mind the notation, like before: mean population value μ X sample estimate xҧ sampling distribution of random variable തX regression slope population value β 1 sample estimate b 1 sampling distribution of random variable B 1 When you re careless with this, it all gets mixed up in one big abracadabra trickery!
EXERCISE 2 a. Is the model significant? b. Has the model practical relevance?
TESTING THE REGRESSION COEFFICIENTS Testing β 0 is usually not interesting but testing β 1 is! in particular, the hypothesis β 1 = 0 is often interesting i.e., the hypothesis that there is no relation between X and Y or: that knowledge of X doesn t tell you anything about Y This test requires the standard deviation of B 1 it is calculated from the data; see computer output here s B1 = 347.578
TESTING THE REGRESSION COEFFICIENTS So: t calc = b 1 β 1 s B 1 = 6151.670 0 347.578 = 17.699 which has to be compared to t crit = ±t 0.025;69 reject H 0 : β 1 = 0, because t calc > t crit or with p-value: p = 0.000 0.05 and conclude that the slope differs significantly from zero post-hoc conclusion: it is larger than zero
TESTING THE REGRESSION COEFFICIENTS Testing the regression model on the basis of MSR MSE ~F 1,n 2 Testing the regression coefficient b 1 on the basis of B 1 0 S B 1 ~t n 2 The two approaches are equivalent they have the same null hypothesis: H 0 : β 1 = 0 they lead to the same conclusion (rejection or no rejection) they lead to the same p-value when we do multiple regression with several explanatory variables this is not the case! See later.
TESTING THE REGRESSION COEFFICIENTS We can also perform other tests than H 0 : β 1 = 0 Case 1: Different test values for β 1 for example H 0 : β 1 = 2 t calc = b 1 2 s B 1 not in SPSS, but easily calculated using s B1 Case 2: One sided tests for example H 0 : β 1 0 t calc as before, but now tested with different t crit not in SPSS, but also easily calculated using 2-sided p-value Case 3: combination of case 1 and case 2 for example H 0 : β 1 2 Try all! (see tutorials)
TESTING THE REGRESSION COEFFICIENTS Example of case 3: is there evidence that the price per square meter larger than 5500? H 0 : β 1 5500; H 1 : β 1 > 5500; α = 0.05 t calc = 6151.670 5500 = 1.875 > t crit 1.7 347.578 one-sided critical value, with α, not α/2 reject H 0 conclude that price per m 2 is significantly larger than 5500
TESTING THE REGRESSION COEFFICIENTS One may also test β 0 in exactly the same way however, this is hardly ever useful Overall significance of F-test only depends on B 1 S B1, not on B 0 S B 0 that is because the slope explains variation while the intercept is only a vertical shift
THE ANOVA TABLE One of the regression results is the ANOVA table ANOVA = analysis of variance Excel SPSS
THE ANOVA TABLE What was ANOVA? ANOVA: Y numerical; X categorical regression: Y numerical; X numerical So ANOVA is really different from regression why then an ANOVA table in regression? Because ANOVA and regression both decompose the total variance (SST) into an explained part SSA in ANOVA (factor A ); SSR in regression ( Regression ) an unexplained part SSW in ANOVA ( Within ), SSE in regression ( Error )
THE ANOVA TABLE The ANOVA table for regression MS = SS, with = R, E df F calc = MSR and associated p-value MSE
OLD EXAM QUESTION 21 May 2015, Q2c
FURTHER STUDY Doane & Seward 5/E 12.1-12.6 Tutorial exercises week 4 regression analysis