Chapter 4: Regression Models

Similar documents
Chapter 4. Regression Models. Learning Objectives

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Bayesian Analysis LEARNING OBJECTIVES. Calculating Revised Probabilities. Calculating Revised Probabilities. Calculating Revised Probabilities

Regression Models. Chapter 4

Correlation Analysis

Chapter 3 Multiple Regression Complete Example

Chapter 14 Student Lecture Notes 14-1

Statistics for Managers using Microsoft Excel 6 th Edition

Inferences for Regression

Basic Business Statistics 6 th Edition

Chapter 7 Student Lecture Notes 7-1

Chapter 16. Simple Linear Regression and dcorrelation

The Multiple Regression Model

Chapter 13. Multiple Regression and Model Building

Basic Business Statistics, 10/e

Ch 13 & 14 - Regression Analysis

Chapter 16. Simple Linear Regression and Correlation

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Regression Analysis II

Chapter Learning Objectives. Regression Analysis. Correlation. Simple Linear Regression. Chapter 12. Simple Linear Regression

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression

Inference for Regression Inference about the Regression Model and Using the Regression Line

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

Lecture 10 Multiple Linear Regression

Regression Models REVISED TEACHING SUGGESTIONS ALTERNATIVE EXAMPLES

Chapter 14 Simple Linear Regression (A)

SIMPLE REGRESSION ANALYSIS. Business Statistics

Inference for Regression

Simple Linear Regression

Mathematics for Economics MA course

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

STA441: Spring Multiple Regression. This slide show is a free open source document. See the last slide for copyright information.

Regression Analysis. BUS 735: Business Decision Making and Research

Business Statistics. Lecture 10: Correlation and Linear Regression

STA121: Applied Regression Analysis

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is

Regression Analysis. BUS 735: Business Decision Making and Research. Learn how to detect relationships between ordinal and categorical variables.

BNAD 276 Lecture 10 Simple Linear Regression Model

Inference for Regression Simple Linear Regression

Review of Statistics 101

The simple linear regression model discussed in Chapter 13 was written as

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

Ch14. Multiple Regression Analysis

STAT 350 Final (new Material) Review Problems Key Spring 2016

STATISTICS 110/201 PRACTICE FINAL EXAM

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

ST430 Exam 2 Solutions

Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal

Econ 3790: Statistics Business and Economics. Instructor: Yogesh Uppal

Categorical Predictor Variables

Draft Proof - Do not copy, post, or distribute. Chapter Learning Objectives REGRESSION AND CORRELATION THE SCATTER DIAGRAM

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

LI EAR REGRESSIO A D CORRELATIO

Simple Linear Regression

Variance Decomposition and Goodness of Fit

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

CHAPTER EIGHT Linear Regression

CS 5014: Research Methods in Computer Science

Ch 2: Simple Linear Regression

Chapter 12 - Part I: Correlation Analysis

Example: Forced Expiratory Volume (FEV) Program L13. Example: Forced Expiratory Volume (FEV) Example: Forced Expiratory Volume (FEV)

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Regression With a Categorical Independent Variable

ECON 450 Development Economics

Section 3: Simple Linear Regression

Data Analysis 1 LINEAR REGRESSION. Chapter 03

Linear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear.

CHAPTER 4 & 5 Linear Regression with One Regressor. Kazu Matsuda IBEC PHBU 430 Econometrics

Ordinary Least Squares Regression Explained: Vartanian

Correlation 1. December 4, HMS, 2017, v1.1

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

3. Diagnostics and Remedial Measures

THE ROYAL STATISTICAL SOCIETY 2008 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE (MODULAR FORMAT) MODULE 4 LINEAR MODELS

Introduction to Regression

Unit 10: Simple Linear Regression and Correlation

Chapter 13 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics

Multiple Regression. Peerapat Wongchaiwat, Ph.D.

STAT Chapter 11: Regression

Simple Linear Regression

Chapter 14. Linear least squares

ECO220Y Simple Regression: Testing the Slope

FAQ: Linear and Multiple Regression Analysis: Coefficients

Inference for the Regression Coefficient

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

Linear Regression and Correlation

Lecture 9: Linear Regression

Can you tell the relationship between students SAT scores and their college grades?

Correlation & Simple Regression

Business Statistics. Lecture 9: Simple Regression

Multiple Linear Regression. Chapter 12

Simple Linear Regression: One Qualitative IV

Unit 11: Multiple Linear Regression

REVIEW 8/2/2017 陈芳华东师大英语系

A discussion on multiple regression models

Analysing data: regression and correlation S6 and S7

ST Correlation and Regression

What is a Hypothesis?

Lectures on Simple Linear Regression Stat 431, Summer 2012

Transcription:

Sales volume of company 1 Textbook: pp. 129-164 Chapter 4: Regression Models Money spent on advertising

2 Learning Objectives After completing this chapter, students will be able to: Identify variables, visualise them in a scatter diagram, and use them in a regression model. Develop simple linear regression equations from sample data and interpret the slope and intercept. Calculate the coefficient of determination and the coefficient of correlation and interpret their meanings. List the assumptions used in regression and use residual plots to identify problems. Interpret the F test in a linear regression model.

3 Learning Objectives After completing this chapter, students will be able to: Use computer software for regression analysis. Develop a multiple regression model and use it for prediction purposes. Use dummy variables to model categorical data. Determine which variables should be included in a multiple regression model. Transform a nonlinear function into a linear one for use in regression. Understand and avoid mistakes commonly made in the use of regression analysis.

4 Introduction (1 of 2) Regression analysis very valuable tool for a manager o o Understand the relationship between variables Predict the value of one variable based on another variable Simple linear regression models have only two variables Multiple regression models have more than one independent variable

5 Introduction (2 of 2) Variable to be predicted is called the dependent variable or response variable o Value depends on the value of the independent variable(s) [Explanatory or predictor variable]

6 Scatter Diagram / Scatter Plot Scatter diagram or scatter plot often used to investigate the relationship between variables o Independent variable normally plotted on X axis o Dependent variable normally plotted on Y axis

7 Triple A Construction (1 of 6) Triple A Construction renovates old homes The dollar volume of renovation work is dependent on the area payroll Triple A Construction Company Sales and Local Payroll Economists have predicted the local area payroll to be $600 million next year, and Triple A wants to plan accordingly!

8 Triple A Construction (2 of 6) Scatter Diagram of Triple A Construction Company Data:

9 Simple Linear Regression (1 of 2) In any regression model, there is an implicit assumption (which can be tested) that a relationship exists between the variables! There is some random error that cannot be predicted! Where Y X = dependent variable (response) = independent variable (predictor or explanatory) β 0 = intercept (value of Y when X = 0) β 1 ε Y = slope of the regression line = e = random error X 0 1

10 Simple Linear Regression (2 of 2) True values for the slope and intercept are not known so they are estimated using sample data Yˆ b b X 0 1 Where Ŷ b 0 b 1 = predicted value of Y = estimate of β 0, based on sample results = estimate of β 1, based on sample results

11 Triple A Construction (3 of 6) Predict sales based on area payroll Y = Sales X = Area payroll The line in our Scatter Diagram minimises the errors Error = (Actual value) (Predicted value) e Y Yˆ Regression analysis minimises the sum of squared errors Least-squares regression

12 Triple A Construction (4 of 6) Formulas for simple linear regression, intercept and slope:

13 Triple A Construction (5 of 6) Regression Calculations for Triple A Construction:

14 Triple A Construction (6 of 6) Y X 0 1 Regression calculations: Therefore sales = 2 + 1.25(payroll) If the payroll next year is $600 million Y = 2 + 1.25(6) = 9.5 or $950,000

15 Measuring the Fit of the Regression Model (1 of 4) Regression models can be developed for any variables X and Y How helpful is the model in predicting Y? With average error positive and negative errors cancel each other out Three measures of variability o SST Total variability about the mean o SSE Variability about the regression line o SSR Total variability that is explained by the model

16 Measuring the Fit of the Regression Model (2 of 4) Sum of squares total Sum of squares error Sum of squares regression An important relationship: SST ( Y Y) SSE e ( Y Yˆ ) SSR ( Yˆ Y) 2 2 2 SST SSR + SSE 2

Sum of squares total (SST) = total variability in Y; The SSE is much lower than SST. Using the regression line has reduced the variability in the sum of squares by 22.5-6.875 = 15.625 (SSR). This indicates how much of the total variability in Y is explained by the regression model! 17 Measuring the Fit of the Regression Model (3 of 4) Sum of Squares for Triple A Construction: ^ ^ ^ ^ ^ SST ( Y Y) The mean is compared to each value 2 Prediction for each observation is compared to the actual value. SSE e ( Y Yˆ ) 2 2 Total variability in Y explained by the regression model! SSR ( Yˆ Y) 2

18 Measuring the Fit of the Regression Model (4 of 4) Deviations from the Regression Line and from the Mean:

19 The SSR is the explained variability in Y; SSE is the unexplained variability in Y. Coefficient of Determination r 2 The proportion of the variability in Y explained by the regression equation o The coefficient of determination is r 2. 2 SSR SSE r 1 SST SST o For Triple A Construction 2 15.625 r 0.6944 22.5 About 69% of the variability in Y is explained by the equation based on payroll (X) If every point in the sample were on the regression line (meaning all errors are 0), then 100% of the variability in Y could be explained by the regression equation, so r 2 = 1 and SEE = 0. The lowest possible value r 2 is 0, indicating that X explains 0% of the variability in Y. r 2 can range from 0 to 1. In developing regression equations, a good model will have an r 2 value close to 1.

20 Correlation Coefficient r An expression of the strength of the linear relationship o Always between +1 and 1 o The correlation coefficient is r r r 2 o For Triple A Construction: r 0.6944 0.8333

Four Values of the Correlation Coefficient 21

22 Assumptions of the Regression Model With certain assumptions about the errors, statistical tests can be performed to determine the model s usefulness: 1. Errors are independent 2. Errors are normally distributed 3. Errors have a mean of zero 4. Errors have a constant variance A plot of the residuals (errors) often highlights glaring violations of assumptions

When the errors (residuals) are plotted against the independent variable, the pattern should appear random. 23 Residual Plots (1 of 3) Pattern of Errors Indicating Randomness: Errors seem random and no discernible pattern is present.

24 Residual Plots (2 of 3) Nonconstant Error Variance: Error pattern in which the errors increase as X increases! Violation of constant variance assumption!

Patterns in the plot of the errors indicate problems with the assumptions or the model specification! 25 Residual Plots (3 of 3) Pattern of Errors Indicating Relationship Is Not Linear: Errors consistently increase at first, and then consistently decrease not a linear model!

26 Estimating the Variance (1 of 2) Errors are assumed to have a constant variance (σ 2 ), usually unknown o Estimated using the mean squared error (MSE), s 2 2 SSE s MSE Sum of squares due to error divided by n k the degrees of freedom 1 where n = number of observations in the sample k = number of independent variables

27 Estimating the Variance (2 of 2) For Triple A Construction: 2 SSE 6.8750 6.8750 s MSE 1.7188 n k 1 6 1 1 4 o Estimate the standard deviation, s o The standard error of the estimate or the standard deviation of the regression s MSE 1.7188 1.31

28 Testing the Model for Significance (1 of 4) When the sample size is too small, you can get good values for MSE (mean squared error) and r 2 even if there is no relationship between the variables o Testing the model for significance helps determine if the values are meaningful o Performing a statistical hypothesis test

29 Testing the Model for Significance (2 of 4) We start with the general linear model: Y X 0 1 If β 1 = 0, the null hypothesis is that there is no relationship between X and Y The alternate hypothesis is that there is a linear relationship (β 1 0) If the null hypothesis can be rejected, we have proven there is a relationship We use the F statistic for this test

30 MSE (Mean Squared Error) MSR (Mean Square Regression) Testing the Model for Significance (3 of 4) The F statistic is based on the MSE and MSR MSR SSR k where k = number of independent variables in the model The F statistic is F MSR MSE Describes an F distribution with: o degrees of freedom for the numerator = df 1 = k o degrees of freedom for the denominator = df 2 = n k 1

31 Testing the Model for Significance (4 of 4) If there is very little error, MSE would be small and the F statistic would be large model is useful If the F statistic is large, the significance level (p-value) will be low unlikely would have occurred by chance When the F value is large, we can reject the null hypothesis and accept that there is a linear relationship between X and Y and the values of the MSE and r 2 are meaningful

MSE (Mean Squared Error) MSR (Mean Square Regression) 32 Steps in a Hypothesis Test (1 of 2) 1. Specify null and alternative hypotheses H H : 0 0 1 : 0 1 1 2. Select the level of significance (α) Common values are 0.01 and 0.05. 3. Calculate the value of the test statistic F MSR MSE

33 Steps in a Hypothesis Test (2 of 2) 4. Make a decision using one of the following methods a) Reject the null hypothesis if the test statistic is greater than the F value from the table in Appendix D. Otherwise, do not reject the null hypothesis Reject if F F df df 1 2 k calculated n k 1, df, df b) Reject the null hypothesis if the observed significance level, or p- value, is less than the level of significance (α). Otherwise, do not reject the null hypothesis: 1 2 p-value P( F calculated test statistic) Reject if p-value

34 Triple A Construction (1 of 3) Step 1 Step 2 Step 3 H 0 : β 1 = 0 (no linear relationship between X and Y) H 1 : β 1 0 (linear relationship exists between X and Y) Select α = 0.05 Calculate the value of the test statistic SSR 15.6250 MSR 15.6250 k 1 MSR 15.6250 F 9.09 MSE 1.7188

35 Triple A Construction (2 of 3) Step 4 Reject the null hypothesis if the test statistic is greater than the F value in Appendix D (page 576) df 1 = k = 1 df 2 = n k 1 = 6 1 1 = 4 The value of F associated with a 5% level of significance and with degrees of freedom 1 and 4 is: F 0.05,1,4 = 7.71 F calculated = 9.09 Reject H 0 because 9.09 > 7.71

36 Triple A Construction (3 of 3) F-Distribution for Triple A Construction Test for Significance: We can conclude there is a statistically significant relationship between X and Y The r 2 value of 0.69 means about 69% of the variability in sales (Y) is explained by local payroll (X)

37 Analysis of Variance (ANOVA) Table With software models, an ANOVA table is typically created that shows the observed significance level (p-value) for the calculated F value o This can be compared to the level of significance (α) to make a decision Analysis of Variance Table for Regression:

Because this probability is less than 0.05, we reject the null hypothesis of no linear relationship and conclude there is a linear relationship between X and Y. 38 ANOVA for Triple A Construction Excel 2016 Output for Triple A Construction Example: P(F > 9.0909) = 0.0394

49 Multiple Regression Analysis (1 of 2) Extensions to the simple linear model Models with more than one independent variable Y = β 0 + β 1 X 1 + β 2 X 2 + + β k X k + ε Where Y = dependent variable (response variable) X i = i th independent variable (predictor or explanatory variable) β 0 = intercept (value of Y when all X i = 0) β i = coefficient of the i th independent variable k = number of independent variables ε = random error

50 Multiple Regression Analysis (2 of 2) To estimate these values, a sample is taken the following equation developed ˆ 0... 1 1 2 2 k k Y b b X b X b X where Y = predicted value of Y b 0 = sample intercept (an estimate of β 0 ) b i = sample coefficient of the ith variable (an estimate of β i )

51 Jenny Wilson Realty (1 of 9) Develop a model to determine the suggested listing price for houses based on the size and age of the house where Y b 0 X 1 and X 2 b 1 and b 2 Yˆ b b X b X 0 1 1 2 2 = predicted value of dependent variable (selling price) = Y intercept = value of the two independent variables (square footage and age) respectively = slopes for X 1 and X 2 respectively o Selects a sample of houses that have sold recently and records the data

Jenny Wilson Real Estate Data 52

53 PROGRAMME 4.4A Jenny Wilson Realty (2 of 9) Input Screen for Jenny Wilson Realty Multiple Regression in Excel 2016:

54 PROGRAMME 4.4B Jenny Wilson Realty (3 of 9) Excel 2016 Output Screen for Jenny Wilson Realty Multiple Regression Example: Yˆ b b X b X 0 1 1 2 2 146,630.89 43.82X 2898.69X 1 2

55 Evaluating the Multiple Regression Model (1 of 2) Similar to simple linear regression models The p-value for the F test and r 2 interpreted the same The hypothesis is different because there is more than one independent variable The F test is investigating whether all the coefficients are equal to 0 at the same time

56 Evaluating the Multiple Regression Model (2 of 2) To determine which independent variables are significant, tests are performed for each variable H : 0 0 1 H1: 1 0 The test statistic is calculated and if the p-value is lower than the level of significance (α), the null hypothesis is rejected

57 Jenny Wilson Realty (4 of 9) Full model is statistically significant Useful in predicting selling price p-value for F test = 0.002 r 2 = 0.6719 Are both variables significant? For X 1 (square footage) For a = 0.05, p-value = 0.0013 rejected For X 1 (age) For a = 0.05, p-value = 0.0039 rejected H H : 0 0 1 : 0 1 1 null hypothesis is null hypothesis is

58 Jenny Wilson Realty (5 of 9) Both square footage and age are helpful in predicting the selling price!

59 Binary or Dummy Variables Binary (or dummy or indicator) variables are special variables created for qualitative data A dummy variable is assigned a value of 1 if a particular condition is met and a value of 0 otherwise The number of dummy variables must equal one less than the number of categories of the qualitative variable

60 Jenny Wilson Realty (6 of 9) A better model can be developed if information about the condition of the property is included X 3 = 1 if house is in excellent condition = 0 otherwise X 4 = 1 if house is in mint condition = 0 otherwise Two dummy variables are used to describe the three categories of condition No variable is needed for good condition since if both X 3 and X 4 = 0, the house must be in good condition

61 PROGRAMME 4.5A Jenny Wilson Realty (7 of 9) Input Screen for Jenny Wilson Realty Example:

62 PROGRAMME 4.5B Jenny Wilson Realty (8 of 9) Output Screen for Jenny Wilson Realty Example with Dummy Variables in Excel 2016: Y ˆ 121,658 + 56.43 X 3,962 X + 33,162 X + 47,369 X 1 2 3 4

63 PROGRAMME 4.5B Jenny Wilson Realty (9 of 9) Output Screen for Jenny Wilson Realty Example with Dummy Variables in Excel 2016: Coefficient of determination, r 2 = 0.898 Y ˆ 121,658 + 56.43 X 3,962 X + 33,162 X + 47,369 X 1 2 3 4

64 Model Building (1 of 5) The best model is a statistically significant model with a high r 2 and few variables As more variables are added to the model, the r 2 value increases For this reason, the adjusted r 2 value is often used to determine the usefulness of an additional variable The adjusted r 2 takes into account the number of independent variables in the model

65 Model Building (2 of 5) The formula for r 2 2 SSR SSE r 1 SST SST The formula for adjusted r 2 2 SSE / ( n k 1) Adjusted r 1 SST / ( n 1) As the number of variables increases, the adjusted r 2 gets smaller unless the increase due to the new variable is large enough to offset the change in k

66 Model Building (3 of 5) In general, if a new variable increases the adjusted r 2, it should probably be included in the model!

67 Model Building (4 of 5) Stepwise regression systematically adds or deletes independent variables A forward stepwise procedure puts the most significant variable in first, adds the next variable that will improve the model the most Backward stepwise regression begins with all the independent variables and deletes the least helpful

68 Model Building (5 of 5) In some cases variables contain duplicate information When two independent variables are correlated, they are said to be collinear When more than two independent variables are correlated, multicollinearity exists When multicollinearity is present, hypothesis tests for the individual coefficients are not valid but the model may still be useful

69 Nonlinear Regression In some situations, variables are not linear Transformations may be used to turn a nonlinear model into a linear model

70 Colonel Motors (1 of 6) Use regression analysis to improve fuel efficiency o Study the impact of weight on miles per gallon (MPG) Automobile Weight Versus MPG:

71 Colonel Motors (2 of 6) Linear Model for MPG Data: Yˆ b b 0 1X1

72 PROGRAMME 4.6 Colonel Motors (3 of 6) Excel 2016 Output with Linear Regression Model for MPG Data: Useful model Small F test for significance Good r 2 value Y ˆ 47.6 8.2 X or MPG 47.6 8.2 (weight in 1,000 lb.) 1

73 Colonel Motors (4 of 6) Nonlinear Model for MPG Data:

74 Colonel Motors (5 of 6) The nonlinear model is a quadratic model The easiest approach develop a new variable New model X 2 (weight) 2 Yˆ b b X b X 0 1 1 2 2

75 PROGRAMME 4.7 Colonel Motors (6 of 6) Excel 2016 Output with Nonlinear Regression Model for MPG Data: Y ˆ 79.8 30.2 X 3.4 X 1 2 Improved model Small F test for significance Adjusted r 2 and r 2 both increased

76 Cautions and Pitfalls (1 of 2) If the assumptions are not met, the statistical test may not be valid Correlation does not necessarily mean causation Multicollinearity makes interpreting coefficients problematic, but the model may still be good Using a regression model beyond the range of X is questionable, as the relationship may not hold outside the sample data

77 Cautions and Pitfalls (2 of 2) A t-test for the intercept (b 0 ) may be ignored as this point is often outside the range of the model A linear relationship may not be the best relationship, even if the F test returns an acceptable value A nonlinear relationship can exist even if a linear relationship does not Even though a relationship is statistically significant it may not have any practical value

78 Homework --- Chapter 4 Please read Chapter 5 (pp. 165-202)!

Parameters and Statistics 79

80 Least Squares Method The objective of the scatter diagram is to measure the strength and direction of the linear relationship. Both can be more easily judged by drawing a straight line through the data. Which line best describes the relationship between X and Y?

81 Least Squares Method We need an objective method of producing a straight line. The best line will be one that is closest to the points on the scatterplot The best line is one that minimises the total distance between itself and all the observed data points. Since we oftentimes use regression to predict values of Y from observed values of X, we choose to measure the distance vertically.

82 Least Squares Method We want to find the line that minimises the vertical distance between itself and the observed points on the scatterplot. So here we have 2 different lines that may describe the relationship between X and Y. To determine which one is best, we can find the vertical distances from each point to the line... So based on this, the line on the right is better than the line on the left in describing the relationship between X and Y. ***infinite number of lines***

83 Least Squares Method Recall, the slope-intercept equation for a line is expressed in these terms: y = mx + b Where: m is the slope of the line b is the y-intercept. If we have determined there is a linear relationship between two variables with covariance and the coefficient of correlation, can we determine a linear function of the relationship?

84 Least Squares Method We typically rewrite this line as: yˆ b0 b1 x Read as y-hat! --- Fitted regression line! where the slope, and the intercept, b 1 s s xy 2 x b0 y b1 x Read as b naught

Interpretation of the b 0, b 1 85

86 Some of the errors will be positive and some will be negative! The problem is that when we add positive and negative values, they tend to cancel each other out. Best line: least-squares, or regression line We can then define the error to be the difference between the coordinates and the prediction line. The coordinate of one point: (x i, y i ) Predicted value for given x i : Best line minimises y i y ˆi ˆ yi b0 b1 x i 2, the sum of the squared errors. Error = distance from one point to the line = Coordinate Prediction

87 Some of the errors will be positive and some will be negative! The problem is that when we add positive and negative values, they tend to cancel each other out. Best line: least-squares, or regression line When we square those error lines, we are literally making squares from those lines. We can visualise this as... So we want to find the regression line that minimises the sum of the areas of these error squares. For this regression line, the sum of the areas of the squares would look like this...

88 Let`s determine the best-fitted line for following data: b Least Squares Method 1 2 s s xy x b0 y b1 x

89 b Least Squares Method 1 2 s s xy x b0 y b1 x

90 b Least Squares Method 1 2 s s xy x b0 y b1 x

91 b Least Squares Method 1 2 s s xy x b0 y b1 x

92 b Least Squares Method 1 2 s s xy x b0 y b1 x Lines of best fit will pivot around the point which represents the mean of X and the mean of the Y variables!

93 b Least Squares Method 1 2 s s xy x b0 y b1 x sx 2 x i x n 1 2

94 b Least Squares Method 1 2 s s xy x b0 y b1 x

95 b Least Squares Method 1 2 s s xy x b0 y b1 x

96 b Least Squares Method 1 2 s s xy x b0 y b1 x

97 b Least Squares Method 1 2 s s xy x b0 y b1 x

98 b Least Squares Method 1 2 s s xy x b0 y b1 x

99 b Least Squares Method 1 2 s s xy x b0 y b1 x

100 b Least Squares Method 1 2 s s xy x b0 y b1 x sx 2 x i x n 1 2

101 b Least Squares Method 1 2 s s xy x b0 y b1 x sx 2 x i x n 1 2

102 b Least Squares Method 1 2 s s xy x b0 y b1 x sx 2 x i x n 1 2

103 b Least Squares Method 1 2 s s xy x b0 y b1 x sx 2 x i x n 1 2

104 b Least Squares Method 1 2 s s xy x b0 y b1 x sx 2 x i x n 1 2

105 b Least Squares Method 1 2 s s xy x b0 y b1 x

106 Line of Best Fit Only for medium to strong correlations...

Line of Best Fit 107

Line of Best Fit 108

109 r = Sample Coefficient of Correlation What line? r measures closeness of data to the best line. best? In terms of least squared error: How

110 Interpretation of the b 0, b 1, yˆ 9.95 2.25x i i In a fixed and variable costs model: yˆ 9.95 2.25x i i b 0 =9.95? Intercept: predicted value of y when x = 0. b 1 =2.25? Slope: predicted change in y when x increases by 1.

111 Interpretation of the b 0, b 1, yˆ 9.95 2.25x i i A simple example of a linear equation A company has fixed costs of $7,000 for plant and equipment and variable costs of $600 for each unit of output. What is total cost at varying levels of output? let x = units of output let C = total cost C = fixed cost plus variable cost = 7,000 + 600 x

112 Interpretation of the b 0, b 1, yˆ 9.95 2.25x i i b 1, slope, always has the same sign as r, the correlation coefficient but they measure different things! The sum of the errors (or residuals),, is always 0 i y i (zero). The line always passes through the point. y ˆ x, y

113 Coefficient of Determination When we introduced the coefficient of correlation we pointed out that except for 1, 0, and +1 we cannot precisely interpret its meaning. We can judge the coefficient of correlation in relation to its proximity to 1, 0, and +1 only. Fortunately, we have another measure that can be precisely interpreted. It is the coefficient of determination, which is calculated by squaring the coefficient of correlation. For this reason we denote it R 2. The coefficient of determination measures the amount of variation in the dependent variable that is explained by the variation in the independent variable.

114 Coefficient of Determination The coefficient of determination measures the amount of variation in the dependent variable that is explained by the variation in the independent variable. The coefficient of determination is R 2 = 0.758 This tells us that 75.8% of the variation in electrical costs is explained by the number of tools. The remaining 24.2% is unexplained.

Least Squares Method --- R 2 115