Least Squares Regression Sections 5.3 & 5.4 Cathy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Department of Mathematics University of Houston Lecture 14-2311 Cathy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Sections (Department 5.3 & 5.4of Mathematics University of Lecture Houston 14 ) - 2311 1 / 32
Outline 1 Least-Squares Regression 2 Prediction 3 Coefficient of Determination 4 Residuals 5 Residual Plots Cathy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Sections (Department 5.3 & 5.4of Mathematics University of Lecture Houston 14 ) - 2311 2 / 32
Popper Set Up Fill in all of the proper bubbles. Make sure your ID number is correct. Make sure the filled in circles are very dark. This is popper number 07. Cathy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Sections (Department 5.3 & 5.4of Mathematics University of Lecture Houston 14 ) - 2311 3 / 32
Scatterplots and Correlation We are wanting to know the relationship between two random variables. To quickly look at the relationship we use the scatterplot and correlation to determine: Direction (positive or negative) Strength (weak, moderately weak, strong) Form (linear or non-linear) Cathy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Sections (Department 5.3 & 5.4of Mathematics University of Lecture Houston 14 ) - 2311 4 / 32
Popper 07 Questions We are looking at the relationship between how many items are on a shelf (shelf space) and the number of items sold in a week. The correlation coefficient is r = 0.827. 1. What is the direction of the relationship between shelf space and weekly sales? a) Positive b) Negative c) No direction 2. What is the strength the relationship between shelf space and weekly sales? a) Strong b) Moderate c) Weak 3. What form appears to be in the relationship between shelf space and weekly sales? a) Linear b) Non-linear Cathy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Sections (Department 5.3 & 5.4of Mathematics University of Lecture Houston 14 ) - 2311 5 / 32
Popper 07 Questions Use the following scatterplot to answer the questions below. y 2 2 6 10 2 0 2 4 6 x 4. What is the correlation coefficient for this set of data? a. 1 b. -0.94 c. 0.94 d. 0 5. If both X and Y are multiplied by 2, what would happen to the correlation coefficient? a. It will stay the same. c. It will be multiplied by 2. b. It will be divided by 2. d. It will be zero. Cathy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Sections (Department 5.3 & 5.4of Mathematics University of Lecture Houston 14 ) - 2311 6 / 32
Example The marketing manager of a supermarket chain would like to use shelf space to predict the sales of coffee. A random sample of 12 stores is selected, with the following results. Store Shelf Space (ft) Weekly Sales (# sold) 1 5 160 2 5 220 3 5 140 4 10 190 5 10 240 6 10 260 7 15 230 8 15 270 9 15 280 10 20 260 11 20 290 12 20 310 Cathy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Sections (Department 5.3 & 5.4of Mathematics University of Lecture Houston 14 ) - 2311 7 / 32
Scatterplot Number Sold 150 200 250 300 5 10 15 20 Shelf Space(feet) Cathy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Sections (Department 5.3 & 5.4of Mathematics University of Lecture Houston 14 ) - 2311 8 / 32
Examining relationships Correlation measures the direction and strength of the straight-line relationship between two quantitative variables. If a scatterpolt shows a linear relationship, we would like to summarize this overall pattern by drawing a line on the scatterplot. A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. This equation is used when one of the variables helps explain or predict the other. Cathy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Sections (Department 5.3 & 5.4of Mathematics University of Lecture Houston 14 ) - 2311 9 / 32
Least-Squares regression The least-squares regression line (LSRL) of Y on X is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible. - 2311 10 / 32
Least-Squares Number Sold 150 200 250 300 5 10 15 20 Shelf Space(feet) - 2311 11 / 32
Equation of the least-squares regression line Let x be the explanatory variable and y be the response variable for n individuals. From the data calculate the means x and ȳ and the standard deviations s x and s y of the two variables, and their correlation r. The least-squares line is the equation ŷ = a + bx - 2311 12 / 32
Equation of the least-squares regression line The least-squares line is the equation ŷ = a + bx. In the example of the supermarket sales, let Y = weekly sales and X = shelf space. The least-squares regression equation to predict weekly sales based on shelf space is ŷ = 145 + 7.4x - 2311 13 / 32
Calculating the least squares regression equation by hand Let X be the explanatory variable and Y be the response variable for n individuals. 1. From the data calculate the means x and ȳ and the standard deviations s x and s y of the two variables, and their correlation r. 2. Calculate the slope: 3. Calculate the y-intercept: b = r s y s x a = ȳ b x 4. Then the equation is: ŷ = a + bx. Where you input the slope into b and the y-intercept into a and leave y and x alone (do not put any numbers into y and x). - 2311 14 / 32
Finding Equation for Coffee Sales Given these values determine the least-square regression line (LSRL) equation for predicting number sold based on shelf space. Explanatory variable: Shelf space (X) x = 12.5 feet, s x = 5.83874 feet. Response variable: Sales (Y ) ȳ = 237.5 units sold, s y = 52.2451 units sold. Correlation: r = 0.827-2311 15 / 32
Example of LSLR for Cost of Car The following are descriptive statistics for estimating cost of the car Y by the age of the car (X), r = 0.8224. Estimated Cost (Y ) ȳ = 10360.93 s y =5482.3372 Age (X) x = 5.214 s x = 2.940 Determine the Least Squares Equation (LSLR) of the data. - 2311 16 / 32
Least-Square Regression Line Using R Use the command: lm(yvariable Xvariable). For example of the coffee sales. > shelf=c(5,5,5,10,10,10,15,15,15,20,20,20) > sales=c(160,220,140,190,240,260,230,270,280,260,290,310) > lm(sales~shelf) Call: lm(formula = sales ~ shelf) Coefficients: (Intercept) shelf 145.0 7.4 Where the Intercept value is a and the other value is b, the slope. Thus the equation in this example is: ŷ = 145 + 7.4x. - 2311 17 / 32
Least-Squares Regression Line Using TI-83(84) 1. Make sure the diagnostics is turned on by clicking 2ND CATALOG and scroll down to Diagnostics. 2. Choose STAT CALC then 8:LinReg(a+bx). 3. Make sure your Xlist is L1 and Ylist is L2 and select Calculate. 4. a is your y-intercept and b is your slope. - 2311 18 / 32
Prediction The equation of the regression line makes prediction easy. Substitute an x-value into the equation. Predict the weekly sales of coffee with shelf space of 12 feet. ŷ = 145 + 7.4 12 = 233.8 Thus the 233.80 is the predicted weekly number of units of coffee sold that is has 12 feet of shelf space. - 2311 19 / 32
Popper Questions The least-squares regression line equation to predict cost of the car(y ) by the age of the car (X) is: ŷ = 18358 1534x. 6. Predict the cost of an automobile for a 5 year old car. a) $18358 b) $1534 c) $10688 d) $11.96 7. Predict the cost of an automobile for a 20 year old car. a) $18358 b) -$1534 c) -$12,322 d) $49038-2311 20 / 32
Facts about least-squares regression Fact 1: A change of one standard deviation in x corresponds to a change of r standard deviations in y. (b 1 slope) Fact 2: The least-squares regression line always passes through the point ( x, ȳ). That is why we can get b 0 the y-intercept. Fact 3: The distinction between explanatory and response variables is essential in regression. - 2311 21 / 32
R 2 The square of the correlation R 2 describes the strength of a straight-line relationship. The formal name of it is called the Coefficient of Determination. This is the percent of variation of Y that is explained by this equation. R 2 is a measure of how successful the regression was in explaining the response. - 2311 22 / 32
R 2 for coffee sales The correlation is r = 0.827 The coefficient of determination is R 2 = 0.827 2 = 0.684 This means that 68.4% of the variation in coffee sales can be explained by the equation. - 2311 23 / 32
Example The following is an output of a least-squares regression equation in the TI-84. What percent of the variation in the y-variable can be explained by this regression equation? a) 67% b) 6.7845% c) 77% d) 88% - 2311 24 / 32
Is good at predicting the response? The coefficient of determination, R 2 is the percent (fraction) of variability in the response variable (Y ) that is explained by the least-squares regression with the explanatory variable. This is a measure of how successful the regression equation was in predicting the response variable. The closer R 2 is to one (100%) the better our equation is at predicting the response variable. From the previous question we can say that 84.33% of the variation in the monthly gas bills can be explained by the least squares regression line equation. - 2311 25 / 32
Is the LSLR good at predicting the response? A residual is the difference between an observed value of the response variable and the value predicted by the regression line. residual = observed y predicted y We can determine residuals for each observation. The closer the residuals are to zero, the better we are at predicting the response variable. We can plot the residuals for each observation, these are called the residual plots. - 2311 26 / 32
Example of residuals The regression equation to determine number of units sold (y) for coffee by shelf space (x) is: ŷ = 145 + 7.4x. A store that has 10 feet of space sold 260 units of coffee. Determine the residual for this store. 1. Determine the predicted units sold for x = 10. 2. The observed units sold is the given value 260. 3. The residual is the difference between the observed y and the predicted y. - 2311 27 / 32
Residuals of coffee sales The regression equation is Weekly Sales = 145 + 7.40 Shelf Space Shelf Observed Predicted Residual Space Weekly Sales Weekly Sales 5 160 182-22 5 220 182 38 5 140 182-42 10 190 219-29 10 240 219 21 10 260 219 41 15 230 256-26 15 270 256 14 15 280 256 24 20 260 293-33 20 290 293-3 20 310 293 17-2311 28 / 32
Residual Plots The plot of the residual values against the x values can tell us about our LSRL model. This is a scatterplot of where the x-values are on the horizontal axis and the residuals are on the vertical axis. In R: plot(x,resid(lm(y x))) If the equation is a good way of predicting the y-variable (our model is a good fit), the residual plot will be a bunch of scattered points with no pattern that is centered about zero on the vertical axis. - 2311 29 / 32
Residual Plot residuals 40 20 0 20 40 5 10 15 20 shelf - 2311 30 / 32
R code > shelf=c(5,5,5,10,10,10,15,15,15,20,20,20) > sales=c(160,220,140,190,240,260,230,270,280,260,290,310) > plot(shelf,resid(lm(sales~shelf)),ylab="residuals") > abline(0,0) - 2311 31 / 32
Examining a residual plot A curved pattern shows that the relationship is not linear. Increasing spread about the zero line as x increases indicates the prediction of y will be less accurate for larger x. Decreasing spread about the zero line as x increases indicates the prediction of y to be more accurate for larger x. Individual points with larger residuals are considered outliers in the vertical (y) direction. Individual points that are extreme in the x direction are considered outliers for the x-variable. - 2311 32 / 32