Linear Regression Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x). A dependent variable is a random variable whose variation is predicted by an independent variable. The relationship is specified as: y = b 0 + b 1 x Where: b 0 and b 1 are fixed numbers. Further, the value of b 0 determines the point at which the straight line crosses the y- axis; the y-intercept.
Regression Analysis: Simple Page: 2 The value of b 1 determines the slope of the line. Generally, specified as the amount by which y changes for every unit change in x. If b 1 < 0 the slope is negative If b 1 > 0 the slope is positive If b 1 = 0 the slope does not exist -- the line is parallel to the x-axis.
Regression Analysis: Simple Page: 3 Notation Used S xx = Σx 2 - (Σx) 2 / n OR Σ(x - x ) 2 S xy = Σxy - (ΣxΣy) / n OR Σ(x - x )(y - y ) S yy = Σy 2 - (Σy) 2 / n OR Σ(y - y ) 2 b 1 = S xy / S xx b 0 = (1/n)(Σy - b 1 Σx) The Least-Square Criterion The straight line that best fits a set of data points is the one for which the sum of squared errors is smallest. Regression Line The straight line that fits a set of data points the best according to the least-square criterion is the regression line. The Regression Equation The equation of the regression line is stated as: y = b 0 + b 1 x Estimation & Prediction Estimation or prediction is simply using the regression equation on the sample x-values to provide an estimate of the corresponding y-value. That is, ŷ = b 0 + b 1 x
Regression Analysis: Simple Page: 4 The Total sum of Squares The total squared differences in y is called the total sum of squares. The formula used is: SST = S yy = Σ(y - y ) 2 Error Sum Of Squares The total squared error is called the sum of squares error. The formula used is: SSE = S yy - (S xy ) 2 /S xx = Σ(y - ŷ ) 2 Sum Of Squares Due To Regression The total amount of variation explained by the regression line is called the regression sum of squares. The formula used is: SSR = (S xy ) 2 /S xx = Σ( ŷ - y ) 2 The Regression Identity SST = SSR + SSE
Regression Analysis: Simple Page: 5 The Coefficient Of Determination The r-square or the coefficient of determination is the percentage reduction obtained in the total squared error by using the regression equation instead of the sample mean to predict the observed y-values. In other words, it is the amount of variation in the dependent variable that is explained by the independent variable. OR OR r 2 = (SSR / SST) r 2 = 1 - (SSE / SST) r 2 = ((S xy ) 2 /S xx S yy )
Regression Analysis: Simple Page: 6 Linear Correlation The square-root of r 2 determines the linear correlation (r) between x & y. r ranges from 0 to ±1. Positive values indicate a positive correlation; negative values indicate a negative correlation. If x & y are correlated it does not mean they are related. Other Terms Extrapolation Using the regression equation to make predictions for x-values outside the range of x-values in the sample data. Outliers & Influential Observations Recall: An outlier is an observation that lies outside the overall pattern of the data. In regression context an outlier is a data point that lies far from the regression line. An influential observation is a data point whose removal causes the regression equation to change considerably. Scatter Plots Plot of x & y to visualize the pattern of the sample data. If the plot shows a non-linear relationship between x (predictor variables) and y (response variable) DO NOT use linear regression methods.
Regression Analysis: Simple Page: 7 Four Data Sets Having Same Value of Summary Statistics (Source: Anscombe, 1973) Data Set 1 Data Set 2 Data Set 3 Data Set 4 x1 y1 x2 y2 x3 y3 x4 y4 4 4.26 4 3.1 4 5.39 8 6.58 5 5.68 5 4.74 5 5.73 8 5.76 6 7.24 6 6.13 6 6.04 8 7.71 7 4.82 7 7.26 7 6.42 8 8.84 8 6.95 8 8.14 8 6.77 8 8.47 9 8.81 9 8.77 9 7.11 8 7.04 10 8.04 10 9.14 10 7.46 8 5.25 11 8.33 11 9.26 11 7.81 8 5.56 12 10.84 12 9.13 12 8.15 8 7.91 13 7.58 13 8.74 13 12.74 8 6.89 14 9.96 14 8.1 14 8.84 19 12.5 Mean 9.00 7.50 9.00 7.50 9.00 7.50 9.00 7.50 STD. 3.32 2.03 3.32 2.03 3.32 2.03 3.32 2.03
Regression Analysis: Simple Page: 8 Data Set 1 Data Set 2 12 10 8 6 4 2 0 0 5 10 15 12 10 8 6 4 2 0 0 5 10 15 Data Set 3 Data Set 4 14 12 10 8 6 4 2 0 0 5 10 15 0 2 4 6 8 10 12 14 0 5 10 15 20 Inferential Methods Assumptions For Regression Inferences 1. Population Regression Line (Assumption I): There is a straight line, y = β 0 + β 1 x such that for each x-value, the mean of the corresponding population of y-values lies on that straight line.
Regression Analysis: Simple Page: 9
Regression Analysis: Simple Page: 10 2. Equal Standard Deviations (Assumption II): The standard deviation, σ, of the population of y-values corresponding to a particular x-value is the same, regardless of the x-value. 3. Normality (Assumption III): For each x-value, the corresponding population of y-values is normally distributed.
Regression Analysis: Simple Page: 11 Standard Error Of The Estimate The standard error of the estimate is defined by: s e = Sqrt(SSE / (n - 2)) It provides us with an estimate for the population standard deviation. The s e indicates how far the observed y-values are from the predicted y-values, on average. Residual Analysis For the Regression Model If the assumptions for regression inferences are met, then the following two conditions should hold. 1. A plot of the residuals against the x-values should fall roughly in a horizontal band centered and symmetric about the x-axis. If a pattern is indicated you probably need to use some other analytical method that simple linear regression. For example, the following graph shows that a quadratic relationship exists in the residuals.
Regression Analysis: Simple Page: 12 Residual 30 20 10 0-10 -20-30 Linearity Test Dependent Variable: Min 1 2 3 4 5 6 7 8 9 10 11 Units 12 13 14 15 16 17 18 19 20 2. A normal probability plot of the residuals should be roughly linear. For example, the following graph shows that the normality assumption is violated. Sorted Residual 30 20 10 0-10 -20-30 Normal Probability Plot Dependent Variable: Min -30-20 -10 0 10 Expected Residual 20 30
Regression Analysis: Simple Page: 13 Hypothesis Tests For The Slope H o : β 1 = 0 H a : β 1 0 The test statistic has a t-distribution with df=(n - 2). t = (b 1 - β 1 ) / (s e / Sqrt(S xx )) If the value of the test statistic falls in the rejection region, then reject the null; otherwise do not reject the null. Confidence Intervals for the Slope The endpoints of the confidence interval for β 1 are: b 1 ± t α/2 ( S e S xx ) with df = n-2 Confidence Intervals for Means in Regression yˆ p ( Σ ) 2 p x x 1 ± n t /2s + with df = (n -2) α e n S xx Confidence Intervals for a Population y-value given an x-value ( Σ ) 2 p x x 1 yˆ p ± 1 n t /2s + + with df = (n -2) α e n S xx
Regression Analysis: Simple Page: 14 Aside: The F distribution Not symmetric Does not have zero at its center Using the F-Table Need a significance level, numerator degrees of freedom (n 1-1), and denominator degrees of freedom (n 2-1) To find the left tailed critical value we can use the reciprocal of the right-tailed value with the numbers of degrees of freedom reversed. For example: if n 1 = 10, n 2 = 7, alpha=0.05 the right tail value is 5.5234 (from table). The left tail value is (1/4.317 = 0.2315). That is, 4.317 is the critical value for 6,9 degrees of freedom.
Regression Analysis: Simple Page: 15 Regression Example << Regrsam.xls >> Age (x) Price (y) xy Sqr(x) Sqr(y) 6 125 750 36 15625 6 115 690 36 13225 6 130 780 36 16900 2 260 520 4 67600 2 219 438 4 47961 5 150 750 25 22500 4 190 760 16 36100 5 163 815 25 26569 1 260 260 1 67600 4 160 640 16 25600 TOTALS 41 1772 6403 199 339680 Syy 339680 - Sqr(1772) / 10 = 25681.60 Sxx 199 - Sqr(41) / 10 = 30.90 Sxy 6403 - (41)(1772) / 10 = -862.20 B1 = Sxy / Sxx = -27.90 Bo = (1/11 (1772 - b1(41))) = 265.09 SST = Syy = 25681.600 SSR = (Sxy*Sxy) / Sxx = 24057.891 SSE = SST SSR = 1623.709 Se = Sqrt(SSE / (n-2)) = 14.247 r-square = (1-(SSE / SST))*100 = 93.68% r = Sqrt(r-square) = 0.97
Regression Analysis: Simple Page: 16 Hypothesis Testing Ho: B1 = 0 Ha: B1 <> 0 t = b1 / (Se / Sqrt(Sxx)) = -10.887 Critical t at 95% and df=8 = 2.306 Since calculated value is in the rejection region we reject the null hypothesis. There is enough evidence to conclude that the age of corvettes is useful for predicting price of the corvettes. Confidence Intervals at 95% -27.9 + 2.306 * (14.247 / Sqrt(30.90)) = -21.99-27.9-2.306 * (14.247 / Sqrt(30.90)) = -33.81 CI = (-21.99, -33.81)