Regression Analysis II
Measures of Goodness of fit Two measures of Goodness of fit Measure of the absolute fit of the sample points to the sample regression line Standard error of the estimate An index of the relative goodness of fit of the sample regression line Coefficient of determination
Standard error of the estimate SEE = e i 2 n 2 e i 2 = (y i y) 2 where SEE denotes the standard error of estimate. SEE is the standard deviation of the errors about the sample regression line.
y Unexplained variation: y i -y c { Explained variation: y c - y{ y i y c Total deviation: y i y y = 3.10 Explained Unexplained Total deviation = Unexplained deviation + Explained deviation x (y i y) = (y i -y c ) + (y c - y )
Total variation = Unexplained variation + Explained variation (y i y) 2 = (y i y c ) 2 + (y c y ) 2 SST = SSE + SSR
Computational formula for correlation coefficient r r = n x i y i x i y i n x i 2 ( x i ) 2 n y i 2 ( y i ) 2
The Coefficient of Determination The relative amount of the variation that has been explained by the sample regression line. r 2 = SSR SST = VARIATION EXPLAINED TOTAL VARIATION = (y c y ) 2 (y i y ) 2 r 2 is the proportion of total variation that is explained by the regression line. If the regression line perfectly fit all the sample points, all residuals would be zero. Then SSE = 0, SSR = SST r 2 = SSR = 1.0 SST
y y y c = y x Perfect fit: r 2 = 1, y c = y, e i = 0 No systematic relation between y and x; r 2 = 0,y c = y, b = 0
Using the information given in previous example a) What is the degree of correlation r between fuel sales and temperature? b) Test the statistical significance of this value of r at the 5 per cent level of significance. c) What proportion of the variation in fuel oil sales is explained by variations in temperature?
a) The appropriate formula to compute r: r = n x i y i x i y i n x i 2 ( x i ) 2 n y i 2 ( y i ) 2 r = -0.927
X i = Sales amount of fuel Y i = temperature X i Y i X i Y i X 2 i Y 2 i 4 26 104 16 676 10 17 170 100 289 14 7 98 196 49 12 12 144 144 144 4 30 120 16 900 5 40 200 25 1600 8 20 160 64 400 n=10 x i y i = 1366 x i = 96 y i = 182 x i 2 = 1076 2 y 2 i = 4408 ( x i ) ( y i ) 2 = 182 2 = 33124 = 9216 11 15 165 121 225 13 10 130 169 100 r = n x i y i x i y i n x i 2 ( x i ) 2 n y i 2 ( yi ) 2 = 0.927 15 5 75 225 25
b) Test the statistical significance of this value of r at the 5 per cent level of significance. Hypotheses: H 0 : ρ = 0 H 1 : ρ 0 Significance level : α = 0.05 Standard error : s r = 1 r 2 n 2 = 1 ( 0.927)2 10 2 =0.1326 Test statistics: critical t= tα with n-2 df = -1.86 actual t = r ρ s r = 0.927 0 0.1326 = -6.99
Conclusion: t tα i.e. -6.99-1.86. We reject H 0 and conclude that ρ 0, i.e. that there is a significant negative correlation between fuel sales and temperature. Note that, in a sample regression involving only one dependent variable y and one explanatory variable x, a significance test on r is equivalent to a significance test on the regression coefficient b.
c) What proportion of the variation in fuel oil sales is explained by variations in temperature? The proportion of the variation in fuel oil sales is explained by variations in temperature is given by the coefficient of determination r 2. r 2 = ( 0.927) 2 = 0.859 That is, 85.9 per cent of the variation in fuel sales is explained by variations in temperature.
1. The Correlation Coefficient: A single summary number that tells you whether a relationship exists between two variables, how strong that relationship is and whether the relationship is positive or negative. 2. The Coefficient of Determination: A single summary number that tells you how much variation in one variable is directly related to variation in another variable. 3. Linear Regression: A process that allows you to make predictions about variable Y based on knowledge you have about variable X. 4. The Standard Error of Estimate: A single summary number that allows you to tell how accurate your predictions are likely to be when you perform Linear Regression.
Multiple Regression Analysis Simple Regression analysis: one independent variable (X) is used to predict the value of a dependent variable (Y) Multiple Regression Analysis: Several independent variables can be used to predict the value of a dependent variable. Multiple Regression is a statistical tool that allows you to examine how multiple independent variables are related to a dependent variable and the process is called Multiple Regression Analysis.
Multiple Regression Analysis Population multiple regression model y i = α + β 1 X i1 + β 2 X i2 + β 3 X i3 + + β m X im + Ɛ i Population multiple linear regression equation μ y.x1.x 2. X m = α + β 1 X 1 + β 2 X 2 + β 3 X 3 + + β m X m β 1, β 2,β 3,.. β m - Partial Regression coefficients
Sample multiple regression equation y = a + b 1 X i1 + b 2 X i2 + b 3 X i3 + + b m X im y = estimate of μ y.x1.x 2. X m a = estimate of the intercept α (Y intercept) b 1, b 2, b 3,..b m = estimates of the partial regression coefficients of β 1, β 2,β 3,.. β m b i = Slope of Y with variable X i holding other variables constant e. g. b 1 = Slope of Y with variable X i holding variables X 2, X 3 X m constant
Multiple regression model with two independent variables y i = α + β 1 X 1i + β 2 X 2i + Ɛ i α = Y intercept β 1 = slope of Y with variablex 1 holding variable X 2 constant β 2 = slope of Y with variablex 2 holding variable X 1 constant Ɛ i = random error in Y for observation i
Multiple regression Analysis Interpretation of the individual regression coefficients Statistical significance of the regression coefficients Overall explanatory power of the estimated equation Statistical significance of the overall explanatory power
Interpretation of the individual regression coefficients a (intercept term) = the estimated value of Y ( y ) when the values of all independent variables are zero. y = a when X 1 = X 2 = X m = 0 Interpretation of any b i coefficient b i represents the change in y corresponding to a unit change in x i, when all other independent variables are held constant.
Statistical significance of the regression coefficients Set up the null hypothesis which states that that variable associated with b 1 (X 1 ), does not influence the dependent variable. H o : B 1 = 0 H 1 : B 1 0 (a two tailed test) Or H o : B 1 = 0 H 1 : B 1 < 0 (a one tailed test) Or H o : B 1 = 0 H 1 : B 1 0 (a one tailed test) Hypothesis may be tested using t test.
Test statistic: t = b i B i S bi S bi = standard error of b i Under the null hypothesis t = b i 0 S bi = b i S bi Degrees of freedom : n k 1 n= number of observations k = number of independent variables
Overall explanatory power of the estimated regression equation Multiple coefficient of determination (R 2 ) R 2 = SSR SST = 1- SSE SST Coefficient of multiple correlation (R) R is a measure of the degree of freedom association between Y and all the explanatory variables jointly. Adjusted multiple coefficient of determination ( R 2 ) R 2 = 1 (1-R 2 ) n 1 n k 1
Statistical significance of overall explanatory power F statistic F = MSR MSE = SSR K SSE (n k 1) ; SST = SSR + SSE Sum of squares SSR k SSE n k - 1 SST n 1 Degrees of freedom
Example. The data below show the monthly sales of heating fuel for a firm over the past 12 months together with the average price charged per unit in each month, the advertising expenditure (Rs000s) per month and the mean daily temperature ( C) recorded during each month. Using these data, compute the regression equation which can be used to estimate the influence of the three explanatory variables (price, advertising, z) on heating fuel sales. Comments on the results.
Month Sales Price Advertising Temperature (liters) (Rs 00s per liter) expenditure(rs000s) ( C) January 450 0.60 25 27.50 February 380 1.20 17 25.00 March 298 1.80 14 29.00 April 350 1.50 18 30.00 May 201 3.00 10 31.00 June 215 2.70 11 35.00 July 220 2.70 12 32.00 August 240 2.10 11 30.00 September 192 3.00 7 29.00 October 201 2.40 7 24.00 November 202 2.40 8 22.00 December 235 2.30 10 20.00
Regression equation: Sales = a + b 1 (Price) + b 2 (Advertising expenditure) + b 3 (temperature) Regression equation: Sales = 281.357-53.344(Price) + 8.853(Advertising expenditure) -0.446(temperature) Interpretation ( from the computer output) 1. Size and sign of coefficients Sales to be inversely related to price. Sales to be positively related to advertising expenditure. Sales to rise as temperature falls.
Interpretation.. Regression equation: Sales = 281.357-53.344(Price) + 8.853(Adexp) -0.446(temp) Sales fall by 53.344 liters for a unit increase (Rs 100) in price, rise by 8.853 liters for a unit increase (Rs 1000) in advertising expenditure and fall by 0.446 liters for a unit rise (1 C) in temperature.
Statistical significance of the coefficients t ratio for price is -2.842 with a probability value.022 indicating that the coefficient b 1 is different from zero at the 0.05 level of significance. The coefficient on price is statistically significant. t ratio for advertising expenditure is 3.410 with a probability value.009 indicating that the coefficient b 2 is different from zero at the 0.05 level of significance. The coefficient on advertising expenditure is statistically significant. The coefficient on temperature with a t ratio of -.307and a probability value of.767, is not significant at the 5 per cent level.
Statistical significance of the regression has whole The F statistic( 131.704)has a probability value.000 indicating that the regression as a whole is very highly significant. Overall explanatory power The R 2 value indicates that 98 per cent of the variation in sales is explained by the regression as a whole (i.e. by the joint variation in the three independent variables).