Chapter 5: Ordinary Least Squares Estimation Procedure The Mechanics Chapter 5 Outline Best Fitting Line Clint s Assignment Simple Regression Model o

Similar documents
Wednesday, September 19 Handout: Ordinary Least Squares Estimation Procedure The Mechanics

Chapter 8 Handout: Interval Estimates and Hypothesis Testing

[Mean[e j ] Mean[e i ]]

Monday, November 26: Explanatory Variable Explanatory Premise, Bias, and Large Sample Properties

Wednesday, September 26 Handout: Estimating the Variance of an Estimate s Probability Distribution

Chapter 13: Dummy and Interaction Variables

Chapter 15: Other Regression Statistics and Pitfalls

Chapter 10: Multiple Regression Analysis Introduction

Chapter 11 Handout: Hypothesis Testing and the Wald Test

Wednesday, October 17 Handout: Hypothesis Testing and the Wald Test

Chapter 12: Model Specification and Development

Monday, October 15 Handout: Multiple Regression Analysis Introduction

An Introduction to Econometrics. A Self-contained Approach. Frank Westhoff. The MIT Press Cambridge, Massachusetts London, England

Monday, September 10 Handout: Random Processes, Probability, Random Variables, and Probability Distributions

AP Statistics L I N E A R R E G R E S S I O N C H A P 7

Chapter 14: Omitted Explanatory Variables, Multicollinearity, and Irrelevant Explanatory Variables

Econometric Modelling Prof. Rudra P. Pradhan Department of Management Indian Institute of Technology, Kharagpur

2. Linear regression with multiple regressors

Stat 101 L: Laboratory 5

appstats27.notebook April 06, 2017

POL 681 Lecture Notes: Statistical Interactions

LECTURE 15: SIMPLE LINEAR REGRESSION I

Linear Regression 3.2

Solutions: Monday, October 15

ECON3150/4150 Spring 2015

The Simple Regression Model. Part II. The Simple Regression Model

EXPERIMENT 2 Reaction Time Objectives Theory

1 Correlation and Inference from Regression

Chapter 4 Describing the Relation between Two Variables

APPENDIX 1 BASIC STATISTICS. Summarizing Data

The cover page of the Encyclopedia of Health Economics (2014) Introduction to Econometric Application in Health Economics

Regression Analysis and Forecasting Prof. Shalabh Department of Mathematics and Statistics Indian Institute of Technology-Kanpur

BIOSTATISTICS NURS 3324

Amherst College Department of Economics Economics 360 Fall 2012

appstats8.notebook October 11, 2016

Chapter 27 Summary Inferences for Regression

Introduction to Algebra: The First Week

Mathematical Forms and Strategies

Introduction to Uncertainty and Treatment of Data

Wooldridge, Introductory Econometrics, 4th ed. Chapter 2: The simple regression model

Applied Statistics and Econometrics

9. Linear Regression and Correlation

Information Sources. Class webpage (also linked to my.ucdavis page for the class):

The Simple Linear Regression Model

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

Lesson/Unit Plan Name: Algebraic Expressions Identifying Parts and Seeing Entities. as both a single entity and a sum of two terms.

Review of the Normal Distribution

At this point, if you ve done everything correctly, you should have data that looks something like:

WISE Regression/Correlation Interactive Lab. Introduction to the WISE Correlation/Regression Applet

Slope Fields: Graphing Solutions Without the Solutions

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION

B. Weaver (24-Mar-2005) Multiple Regression Chapter 5: Multiple Regression Y ) (5.1) Deviation score = (Y i

Warm-up Using the given data Create a scatterplot Find the regression line

Final Exam - Solutions

4.1 Least Squares Prediction 4.2 Measuring Goodness-of-Fit. 4.3 Modeling Issues. 4.4 Log-Linear Models

ECO220Y Simple Regression: Testing the Slope

Wednesday, October 10 Handout: One-Tailed Tests, Two-Tailed Tests, and Logarithms

Please bring the task to your first physics lesson and hand it to the teacher.

The SuperBall Lab. Objective. Instructions

Chapter 1 Handout: Descriptive Statistics

Chapter 6. Exploring Data: Relationships

Measurement: The Basics

11.5 Regression Linear Relationships

Assumptions, Diagnostics, and Inferences for the Simple Linear Regression Model with Normal Residuals

Do not copy, post, or distribute

Multiple Regression Analysis

TOPIC 9 SIMPLE REGRESSION & CORRELATION

Using Microsoft Excel

3 Non-linearities and Dummy Variables

3.4 Complex Zeros and the Fundamental Theorem of Algebra

PHY 123 Lab 1 - Error and Uncertainty and the Simple Pendulum

Lecture 4 Scatterplots, Association, and Correlation

Lecture 4 Scatterplots, Association, and Correlation

Interactions. Interactions. Lectures 1 & 2. Linear Relationships. y = a + bx. Slope. Intercept

The following formulas related to this topic are provided on the formula sheet:

Notes 3: Statistical Inference: Sampling, Sampling Distributions Confidence Intervals, and Hypothesis Testing

CHAPTER 6: SPECIFICATION VARIABLES

Lesson: Slope. Warm Up. Unit #2: Linear Equations. 2) If f(x) = 7x 5, find the value of the following: f( 2) f(3) f(0)

ECON2228 Notes 2. Christopher F Baum. Boston College Economics. cfb (BC Econ) ECON2228 Notes / 47

STAT 350 Final (new Material) Review Problems Key Spring 2016

Chapter 8. Linear Regression. The Linear Model. Fat Versus Protein: An Example. The Linear Model (cont.) Residuals

Experiment 2. Reaction Time. Make a series of measurements of your reaction time. Use statistics to analyze your reaction time.

Ordinary Least Squares Regression Explained: Vartanian

Conditions for Regression Inference:

Algebra Exam. Solutions and Grading Guide

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

2 Prediction and Analysis of Variance

HOLLOMAN S AP STATISTICS BVD CHAPTER 08, PAGE 1 OF 11. Figure 1 - Variation in the Response Variable

Business Statistics. Lecture 9: Simple Regression

AP Statistics Bivariate Data Analysis Test Review. Multiple-Choice

Mathematics for Economics MA course

Unit 1.1 Equations. Quarter 1. Section Days Lesson Notes. Algebra 1 Unit & Lesson Overviews Mathematics Variables and Expressions

Scatterplots and Correlation

Linear Regression with one Regressor

TESTING FOR CO-INTEGRATION

Lab 3 Acceleration. What You Need To Know: Physics 211 Lab

Chapter 3: Describing Relationships

Simple Linear Regression: One Quantitative IV

Uncertainty, Error, and Precision in Quantitative Measurements an Introduction 4.4 cm Experimental error

Transcription:

Chapter 5: Ordinary Least Squares Estimation Procedure The Mechanics Chapter 5 Outline Best Fitting Line Clint s Assignment Simple Regression Model o Parameters of the Model o Error Term and Random Influences o Best Fitting Line o Needed: A Systematic Procedure to Determine the Best Fitting Line Ordinary Least Squares (OLS) Estimation Procedure o Sum of Squared Residuals Criterion o Finding the Best Fitting Line Importance of the Error Term o Absence of Random Influences: A What If Question o Presence of Random Influences: Back to Reality Error Terms and Random Influences: A Closer Look Standard Ordinary Least Squares (OLS) Premises Clint s Assignment: The Two Parts Chapter 5 Prep Questions 1. The following table reports the (disposable) income earned by Americans and their total savings between 1950 and 1975 in billions of dollars: Income Savings Income Savings Income Savings Year (Billion $) (Billion $) Year (Billion $) (Billion $) Year (Billion $) (Billion $) 1950 210.1 17.9 1959 350.5 32.9 1968 625.0 67.0 1951 231.0 22.5 1960 365.4 33.7 1969 674.0 68.8 1952 243.4 23.9 1961 381.8 39.7 1970 735.7 87.2 1953 258.6 25.5 1962 405.1 41.8 1971 801.8 99.9 1954 264.3 24.3 1963 425.1 42.4 1972 869.1 98.5 1955 283.3 24.5 1964 462.5 51.1 1973 978.3 125.9 1956 303.0 31.3 1965 498.1 54.3 1974 1071.6 138.2 1957 319.8 32.9 1966 537.5 56.6 1975 1187.4 153.0 1958 330.5 34.3 1967 575.3 67.5 a. Construct a scatter diagram for income and savings. Place income on the horizontal axis and savings on the vertical axis. b. Economic theory teaches that savings increases with income. Do these data tend to support this theory?

2 c. Using a ruler, draw a straight line through these points to estimate the relationship between savings and income. What equation describes this line? d. Using the equation, estimate by how much savings will increase if income increases by $1 billion. 2. Three students are enrolled in Professor Jeff Lord s 8:30 am class. Every week, he gives a short quiz. After returning the quiz, Professor Lord asks his students to report the number of minutes they studied; the students always respond honestly. The minutes studied and the quiz scores for the first quiz appear in the table below: 1 Minutes Quiz Student Studied (x) Score (y) 1 5 66 2 15 87 3 25 90 a. Construct a scatter diagram for income and savings. Place minutes on the horizontal axis and score on the vertical axis. b. Ever since first grade, what have your parents and teachers been telling you about the relationship between studying and grades? For the most part, do these data tend to support this theory? c. Using a ruler, draw a straight line through these points to estimate the relationship between minutes studied and quiz scores. What equation describes this line? d. Using the equation, estimate by how much a student s quiz score would increase if that student studies one additional minute. 3. Recall that the presence of a random variable brings forth both bad news and good news. a. What is the bad news? b. What is the good news? 4. What is the relative frequency interpretation of probability? 5. Calculus problem: Consider the following equation: 2 2 2 SSR= ( y1 bconst bx x 1) + ( y2 bconst bx x 2) + ( y3 bconst bx x 3) Differentiate SSR with respect to b Const and set the derivative equal to 0: dssr = 0. dbconst Solve for b Const, and show that

3 y1 + y2 + y3 bconst = y bx x where y = 3 x1 + x2 + x3 x = 3 6. Again, consider the following equation: 2 2 2 SSR= ( y1 bconst bx x 1) + ( y2 bconst bx x 2) + ( y3 bconst bx x 3) Let bconst = y bx x Substitute the expression for b Const into the equation for SSR. Show that after the substitution: SSR= [( y y) b ( x x)] + [( y y) b ( x x)] + [( y y) b ( x x)] 2 2 2 1 x 1 2 x 2 3 x 3 Best Fitting Line Recall the income and savings data we introduced in the chapter preview questions: Income and Savings Data: Annual time series data of U. S. disposable income and savings from 1950 and 1975. Income Savings Income Savings Income Savings Year (Billion $) (Billion (Billion (Billion (Billion (Billion Year Year $) $) $) $) $) 1950 210.1 17.9 1959 350.5 32.9 1968 625.0 67.0 1951 231.0 22.5 1960 365.4 33.7 1969 674.0 68.8 1952 243.4 23.9 1961 381.8 39.7 1970 735.7 87.2 1953 258.6 25.5 1962 405.1 41.8 1971 801.8 99.9 1954 264.3 24.3 1963 425.1 42.4 1972 869.1 98.5 1955 283.3 24.5 1964 462.5 51.1 1973 978.3 125.9 1956 303.0 31.3 1965 498.1 54.3 1974 1071.6 138.2 1957 319.8 32.9 1966 537.5 56.6 1975 1187.4 153.0 1958 330.5 34.3 1967 575.3 67.5 Table 5.1: U. S. Annual Income and Savings Data 1950 to 1975 Economic theory suggests that as American households earn more income, they will save more: Theory: Additional income increases savings. Project: Assess the effect of income on savings. Question: How can we use our data to test this theory? That is, how can we assess the effect of income on savings?

4 Answer: We begin by drawing a scatter diagram of the income-savings data. Each point represents income and savings of a single year. The lower left point represents income and savings for 1950: (210.1, 17.9). The upper right point represents income and savings for 1975: (1187.4, 153.0). Each other point represents one of the other years. Figure 5.1: Income and Savings Scatter Diagram The data appear to support the theory: as income increases, savings generally increase. Question: How can we estimate the relationship between income and savings? Answer: Draw a line through the points that best fits the data; then, use the equation for the best fitting line to estimate the relationship.

5 Figure 5.2: Income and Savings Scatter Diagram with Best Fitting Line By choosing two points on this line, we can solve for the equation of the best fitting line. It looks like the points (200, 15) and (1200, 155) are more or less on the line. Let us use these two points to estimate the slope: Rise 155 15 140 Slope = = = =.14 Run 1,200 200 1,000 A little algebra allows us to derive the equation for this line: y 15 =.14 x 200 y 15 =.14x 28 y =.14x 13 This equation suggests that if Americans earn an additional $1 of income, savings will rise by an estimated $.14; or equivalently, we estimate that a $1,000 increase in income causes a $140 increase in savings. Since the slope is positive, the data appear to support our theory; additional income appears to increase savings. Clint s Assignment Next, consider a second example. Three students are enrolled in Professor Jeff Lord s 8:30 am class. Every week, he gives a short quiz. After returning the quiz, Professor Lord asks his students to report the number of minutes they studied; the

students always respond honestly. The minutes studied and the quiz scores for the first quiz appear in Table 5.2: 6

7 Student Minutes Studied (x) Quiz Score (y) 1 5 66 2 15 87 3 25 90 Table 5.2: First Quiz Results The theory suggests that a student s score on the quiz depends on the number of minutes he/she studied: Theory: Additional studying increases quiz scores. Also, it is generally believed that Professor Lord, a very generous soul, awards students some points just for showing up for a quiz so early in the morning. Our friend Clint has been assigned the problem of assessing the theory. Clint s assignment is to use the data from Professor Lord s first quiz to assess the theory: Project: Use data from Professor Lord s first quiz to assess the effect of studying on quiz scores. Simple Regression Model The following equation allows us to use the simple regression model to assess the theory: y t = β Const + β x x t + e t where y t = Quiz score received by student t x t = Number of minutes studied by student t e t = Error term for student t: Random influences t = 1, 2, and 3 Denoting the three students: Student 1, Student 2, and Student 3 y t, quiz score, is called the dependent variable and x t, minutes studied, the explanatory variable. The value of the dependent variable depends on the value of the explanatory variable. Or putting it differently, the value of the explanatory variable explains the value of the dependent value. Parameters of the Model β Const and β x, the constant and coefficient of the equation, are called the parameters of the model. To interpret the parameters recall that it is generally believed that Professor Lord gives students some points just for showing up for the quiz. the theory postulates that studying more will improve a student s score. Using these observations, we can interpret the parameters, β Const and β x : β Const represents the number of points Professor Lord gives students just for showing up.

β x represents the number of additional points earned for an additional minute of studying. Error Term and Random Influences e t is the error term. The error term reflects all the random influences on student t s quiz score, y t. For example, if Professor Lord were in an unusually bad humor when he graded one student s quiz, that student s quiz score might be unusually low; this would be reflected by a negative error term. On the other hand, if Professor Lord were in an unusually good humor, the student s score might be unusually high and a positive error term would result. Professor Lord s disposition is not the only sources of randomness. For example, one particular student could have just lucked out by correctly anticipating the questions Professor Lord asked. In this case, the students score would be unusually high, his/her error term would be positive. All these random influences are accounted for by the error term. The error term accounts for all the factors that cannot be determined or anticipated beforehand. What Is Simple about the Simple Regression Model? The word simple is used to describe the model because the model includes only a single explanatory variable. Obviously, many other factors influence a student s quiz score; the number of minutes studied is only one such factor. However, we must start somewhere. We will begin with the simple regression model. Later we shall move on and introduce multiple regression models to analyze more realistic scenarios in which two or more explanatory variables are used to explain the dependent variable. Best Fitting Line Question: How can Clint use the data to assess the effect of studying on quiz scores? Answer: He begins by drawing a scatter diagram using the data appearing in Table 5.2. 8

9 Score (y) 100 90 Std 2 Std 3 80 70 Std 1 60 50 5 10 15 20 25 30 Minutes (x) Figure 5.3: Minutes and Scores Scatter Diagram The data appear to confirm the theory. As minutes studied increase, quiz scores tend to increase. Question: How can Clint estimate the relationship between minutes studied and the quiz score more precisely? Answer: Draw a line through the points that best fits the data; then, use the best fitting line s equation to estimate the relationship.

10 Score (y) 100 90 Std 2 Std 3 80 70 Std 1 60 50 5 10 15 20 25 30 Minutes (x) Figure 5.4: Minutes and Scores Scatter Diagram with Clint s Eyeballed Best Fitting Line Clint s effort to eyeball the best fitting line appears in Figure 5.4. By choosing two points on this line, Clint can solve for the equation of his best fitting line. It looks like the points (0, 60) and (20, 90) are more or less on the line. He can use these two points to estimate the slope: Rise 90 60 30 Slope = = = = 1.5 Run 20 0 20 Next, Clint can use a little algebra to derive the equation for the line: y 60 = 1.5 x 0 y 60 = 1.5x y = 60+ 1.5x This equation suggests that an additional minute of studying increases a student s score by 1.5 points. Needed: A Systematic Procedure to Determine the Best Fitting Line Let us compare the two examples we introduced. In the income-savings case, the points were clustered tightly around our best fitting line. Two different individuals

might not eyeball the identical best fitting line, but the difference would be slight. In the minutes-scores case, however, the points are not clustered nearly so tightly. Two individuals could eyeball the best fitting line very differently; therefore, two individuals could derive substantially different equations for the best fitting line and would then would report very different estimates of the effect that studying has on quiz scores. Consequently, we need a systematic procedure to determine the best fitting line. Furthermore, once we determine the best fitting line, we need to decide how confident we should be in the theory. We will now address two issues: What systematic procedure should we use to determine the best fitting line for the data? In view of the best fitting line, how much confidence should we have in the theory s validity? Ordinary Least Squares (OLS) Estimation Procedure The ordinary least squares (OLS) estimation procedure is the most widely used estimation procedure to determine the equation for the line that best fits the data. Its popularity results from two factors. The procedure is computationally straightforward; it provides us (and computer software) with a relatively easy way to estimate the regression model s parameters, the constant and slope of the best fitting line. possesses several desirable properties when the error term meets certain conditions. This chapter focuses on the computational aspects of the ordinary least squares (OLS) estimation procedure. In Chapter 6 we turn to the properties of the estimation procedure. We begin our study of the ordinary least squares (OLS) estimation procedure by introducing a little notation. We must distinguish between the actual values of the parameters and the estimates of the parameters. We have used the Greek letter beta, β, to denote the actual values. Recall the original model: y t = β Const + β x x t + e t β Const denotes the actual constant and β x the actual coefficient. We shall use Roman italicized b s to denote the estimates. b Const denotes the estimate of the constant for the best fitting line and b x denotes the estimate of the coefficient for the best fitting line. That is, the equation for the best fitting line is: y = b Const + b x x The constant and slope of the best fitting line, b Const and b x, estimate the values of β Const and β x. 2 Sum of Squared Errors Criterion 11

12 The ordinary least squares (OLS) estimation procedure chooses b Const and b x so as to minimize the sum of the squared residuals. We shall now use our example to illustrate precisely what this means. We begin by introducing an equation for each student s estimated score: Esty 1, Esty 2, and Esty 3. Esty 1 = b Const + b x x 1 Esty 2 = b Const + b x x 2 Esty 3 = b Const + b x x 3 Esty 1, Esty 2, and Esty 3 estimate the score received by students 1, 2, and 3 based on the estimated constant, b Const, the estimated coefficient, b x, and the number of minutes each student studies, x 1, x 2, and x 3. The difference between a student s actual score, y t, and his/her estimated score, Esty t, is called the residual, Res t : Res 1 = y 1 Esty 1 Res 2 = y 2 Esty 2 Res 3 = y 3 Esty 3 Substituting for each student s estimated score: Res 1 = y 1 b Const b x x 1 Res 2 = y 2 b Const b x x 2 Res 3 = y 3 b Const b x x 3 Next, we square each residual and add them together to compute the sum of squared residuals, SSR: 2 2 2 SSR = Res1 + Res2 + Res3 = ( y b b x ) + ( y b b x ) + ( y b b x ) 2 2 2 1 Const x 1 2 Const x 2 3 Const x 3 We can generalize the sum of squared residuals by considering a sample size of T: T T 2 2 t t Const x t t= 1 t= 1 SSR = Res = ( y b b x ) where T = SampleSize b Const and b x are chosen to minimize the sum of squared residuals. The following equations for b Const and b x accomplish this: t t= 1 Const = x x = T b y b x b T ( y y)( x x) t= 1 ( x x) To justify the equations, consider a sample size of 3: SSR = ( y b bx) + ( y b bx) + ( y b bx) 2 2 2 1 Const x 1 2 Const x 2 3 Const x 3 Finding the Best Fitting Line First, focus on b Const. Differentiate the sum of squared residuals, SSR, with respect to b Const and set the derivative equal to 0: t t 2

13 dssr db Const = 2( y b bx) + 2( y b bx) + 2( y b bx) = 0 1 Const x 1 2 Const x 2 3 Const x 3 Dividing by 2. ( y1 bconst bx x 1) + ( y2 bconst bx x 2) + ( y3 bconst bx x 3) = 0 Collecting like terms. ( y1+ y2 + y3) + ( bconst bconst bconst ) + ( bx x 1 bx x 2 bx x 3) = 0 Simplifying. ( y1+ y2 + y3) 3 bconst bx ( x1+ x2 + x3) = 0 Dividing by 3. 1+ 2 + y x 3 1+ 2 + 3 b x Const bx 3 3 = 0 y1+ y2 + y3 Since equals the mean of yy,, 3 x1+ x2 + x3 and equals the mean of x, x : 3 y b b x = Const x 0

14 Score (y) 100 90 80 Std 2 (x, y ) = (15, 81) Std 3 OLS Estimate: y = b Const + b x x 70 Std 1 60 50 5 10 15 20 25 30 Minutes (x) Figure 5.5: Minutes and Scores Scatter Diagram with OLS Best Fitting Line Our first equation, our equation for b Const, is now justified. To minimize the sum of squared residuals, the following relationship must be met: y = bconst + bx x or bconst = y bx x As illustrated in Figure 5.5, this equation simply says that the best fitting line must pass through the point ( x, y ), the point representing the mean of x, minutes studied, and the mean of y, the quiz scores. It is easy to calculate the means: x1+ x2 + x3 5+ 15+ 25 45 x = = = = 15 3 3 3 y1+ y2 + y3 66 + 87 + 90 243 y = = = = 81 3 3 3 The best fitting line passes through the point (15, 81). Next, we shall justify the equation for b x. Reconsider the equation for the sum of squared residuals and substitute y b x for b Const : x SSR = ( y b bx) + ( y b bx) + ( y b bx) 2 2 2 1 Const x 1 2 Const x 2 3 Const x 3 Substituting y b xfor b. x Const

15 = [ y ( y b x) b x ] + [ y ( y b x) b x ] + [ y ( y b x) b x ] 2 2 2 1 x x 1 2 x x 2 3 x 3 Simplifying each of the three terms. = [ y y+ b x b x ] + [ y y+ b x b x ] + [ y y+ b x b x ] 2 2 2 1 x x 1 2 x x 2 3 x 3 Switching of the b x terms within each of the three squared terms. = [ y y b x + b x] + [ y y b x + b x] + [ y y b x + b x] 2 2 2 1 x 1 x 2 x 2 x 3 x 3 x Factoring out b x within each of the three squared terms. = [( y y) b ( x x)] + [( y y) b ( x x)] + [( y y) b ( x x)] 2 2 2 1 x 1 2 x 2 3 x 3 To minimize the sum of squared residuals, differentiate SSR with respect to b x and set the derivative equal to 0: dssr = 2[( y1 y) bx( x1 x)]( x1 x) 2[( y2 y) bx( x2 x)]( x2 x) db x 2[( y y) b ( x x)]( x x) = 0 3 x 3 3 Dividing by 2. [( y y) b ( x x)]( x x) + [( y y) b ( x x)]( x x) 1 x 1 1 2 x 2 2 + [( y3 y) bx ( x3 x)]( x3 x) = 0 Simplifying the expression. ( y y)( x x) b ( x x) + ( y y)( x x) b ( x x) 2 2 1 1 x 1 2 2 x 2 2 + ( y3 y)( x3 x) bx ( x3 x) = 0 Moving all terms containing b x to the right side. ( y y)( x x) + ( y y)( x x) + ( y y)( x x) = b ( x x) + b ( x x) + b ( x x) 2 2 2 1 1 2 2 3 3 x 1 x 2 x 3 Factoring out b x from the right side terms. ( y y)( x x) + ( y y)( x x) + ( y y)( x x) = b [( x x) + ( x x) + ( x x) ] 2 2 2 1 1 2 2 3 3 x 1 2 3 Solving for b x. ( y1 y)( x1 x) + ( y2 y)( x2 x) + ( y3 y)( x3 x) bx = 2 2 2 ( x1 x) + ( x2 x) + ( x3 x) Now, let us generalize this to a sample size of T:

16 b x = T t= 1 ( y y)( x x) T t t= 1 ( x x) t t 2 Therefore, we have justified our second equation. Let us return to Professor Lord s first quiz to calculate the constant and slope, b Const and b x, of the ordinary least squares (OLS) best fitting line for the first quiz s data. We have already computed the means for the quiz scores and minutes studied: x = x1+ x2 + x3 5+ 15+ 25 45 = = 3 3 3 = 15 y1+ y2 + y3 66 + 87 + 90 243 y = = = = 81 3 3 3 Now, for each student calculate the deviation of y from its mean and the deviations of x from its mean: Student y t y yt y x t x xt x 1 66 81 15 5 15 10 2 87 81 6 15 15 0 3 90 81 9 25 15 10 Next, for each student calculate the products of the y and x deviations and squared x deviations: Student 2 ( yt y)( xt x) ( xt x) 1 ( 15)( 10) = 150 (-10) 2 = 100 2 (6)(0) = 0 (0) 2 = 0 3 (9)(10) = 90 (10) 2 = 100 Sum = 240 Sum = 200 b x equals the sum of the products of the y and x deviations divided by the sum of the squared x deviations: b T ( y y)( x x) 240 6 1.2 t t t= 1 x = = = = T 2 200 5 ( xt x) t= 1

17 Score (y) 100 90 80 Std 2 Std 3 OLS Estimate: y = β Const + β x = 63 + 1.2x (x, y) = (15, 81) 70 Std 1 60 50 5 10 15 20 25 30 Minutes (x) Figure 5.6: Minutes and Scores Scatter Diagram with OLS Best Fitting Line To calculate b Const recall that the best fitting line passes through the point representing the average value of x and y, ( x, y ) : y = bconst + bx x Solving for b Const, bconst = y bx x We just learned that b x equals 6/5. The average of the x s, x, equals 15 and the average of the y s, y, equals 81. Substituting, 6 bconst = 81 x 5 = 81 18 = 63 Using the ordinary least squares (OLS) estimation procedure, the best fitting line for Professor Lord s first quiz is: 6 y = 63+ x = 63+ 1.2x 5 Consequently, the least squares estimates for β Const and β x are 63 and 1.2. These estimates suggest that Professor Lord gives each student 63 points just for

18 showing up; each minute studied earns the student 1.2 additional points. Based on the regression we estimate that: 1 additional minute studied increases the quiz score by 1.2 points. 2 additional minutes studied increase the quiz score by 2.4 points. etc. Let us now quickly calculate the sum of squared residuals for the best fitting line: Student x t y t 6 Estyt = 63+ xt = 63+ 1.2x Res t t = yt Estyt 5 2 Res t 1 5 66 63 + 6 5 5 = 63 + 6 = 69 66 69 = 3 9 2 15 87 63 + 6 5 15 = 63 + 6 3 = 63 + 18 = 81 87 81 = 6 36 3 25 90 63 + 6 5 25 = 63 + 6 5 = 63 + 30 = 93 90 93 = 3 9 The sum of squared residuals for the best fitting line is 54. Econometrics Lab 5.1: Finding the Ordinary Least Squares (OLS) Estimates We can use our Econometrics Lab to emphasize how the ordinary least squares (OLS) estimation procedure determines the best fitting line by accessing the Best Fit simulation. SSR = 54 [Link to MIT-Lab 5.1 goes here.] By default the data from Professor Lord s first quiz are specified: the values of x and y for the first student are 5 and 66, for the second student 15 and 87, and for the third student 25 and 90:

19 Figure 5.7: Best Fitting Line Simulation Data Now, click Go. A new screen appears as shown in Figure 5.8 with two slider bars, one slide bar for the constant and one for the coefficient. Figure 5.8: Best Fitting Line Simulation Parameter Estimates By default the constant and coefficient values are 63 and 1.2, the ordinary least squares (OLS) estimates. Also, the arithmetic used to calculate the sum of squared residuals is displayed. When the constant equals 63 and the coefficient equals 1.2, the sum of squared residuals equals 54.00; this just the value that we calculated. Next, experiment with different values for the constant and coefficient values by moving the two sliders. Convince yourself that the equations we used to calculate

20 the estimate for the constant and coefficient indeed minimize the sum of squared residuals. Software and the Ordinary Least Squares (OLS) Estimation Procedure Fortunately, we do not have to trudge through the laborious arithmetic to compute the ordinary least squares (OLS) estimates. Statistical software can do the work for us. Professor Lord s First Quiz Data: Cross section data of minutes studied and quiz scores in the first quiz for the 3 students enrolled in Professor Lord s class. Student Minutes Studied (x) Quiz Score (y) 1 5 66 2 15 87 3 25 90 Table 5.3: First Quiz Results [Link to MIT-Quiz1.wf1 goes here.] Getting Started in EViews We can use the statistical package EViews to perform the calculations. After opening the workfile in EViews: In the Workfile window: Click on the dependent variable, y, first; and then, click on the explanatory variable, x, while depressing the <Ctrl> key. In the Workfile window: Double click on a highlighted variable In the Workfile window: Click Open Equation In the Equation Specification window: Click OK This window previews the regression that will be run; note that the dependent variable, y, is the first variable listed followed by two expressions representing the explanatory variable, x, and the constant c. Do not forget to close the workfile.

21 Ordinary Least Squares (OLS) Dependent Variable: y Explanatory Variable(s): Estimate SE t-statistic Prob x 1.200000 0.519615 2.309401 0.2601 Const 63.00000 8.874120 7.099296 0.0891 Number of Observations 3 Sum Squared Residuals 54.00000 Estimated Equation: Esty = 63 + 1.2x Interpretation of Estimates: b Const = 63: Students receive 63 points for showing up. b x = 1.2: Students receive 1.2 additional points for each additional minute studied. Critical Result: The coefficient estimate equals 1.2. The positive sign of the coefficient estimate, suggests that additional studying increases quiz scores. This evidence lends support to our theory. Table 5.4: OLS First Quiz Regression Results Table 5.4 reports the values of the coefficient and constant for the best fitting line. Also, note that the sum of squared residuals for the best fitting line is also included. Importance of the Error Term Recall the regression model: y t = β Const + β x x t + e t where y t = Quiz score of student t x t = Minutes studied by student t e t = Error term for student t The parameters of the model, the constant, β Const, and the coefficient, β x, represent the actual number of points Professor Lord gives students just for showing up, β Const ; additional points earned for each minute of study, β x. Obviously, the parameters of the model play an important role, but what about the error term, e t? To illustrate the importance of the error term, suppose that somehow we know the values of β Const and β x. For the moment, suppose that β Const, the actual constant, equals 50 and β x, the actual coefficient, equals 2. In words, this means that Professor Lord gives each student 50 points for showing up; furthermore, each minute of study provides the student with 2 additional points. Consequently, the regression model is: y t = 50 + 2x t + e t

22 NB: In the real world, we never know the actual values of the constant and coefficient. We are assuming that we do here, just to illustrate the importance of the error term. The error term reflects all the factors that cannot be anticipated or determined before the quiz is given; that is, the error term represents all random influences. In the absence of random influences, the error terms would equal 0. Absence of Random Influences: A What If Question Score (y) 100 Std 3 90 80 Std 2 70 Actual: y = 50 + 2x 60 Std 1 50 5 10 15 20 25 30 Minutes (x) Figure 5.9: Best Fitting Line with No Error Term Assume, only for the moment, that there are no random influences; consequently, each error term would equal 0. While this assumption is unrealistic, it allows us to appreciate the important role played by the error term. Focus on the first student taking Professor Lord s first quiz. The first student studies for 5 minutes. In the absence of random influences (that is, if e 1 equaled 0), what score would the first student receive on the quiz? The answer is 60. y 1 = 50 + 2 5 + 0 = 50 + 10 = 60 Next, consider the second student. The second student studies for 15 minutes. In the absence of random influences, the second student would receive an 80 on the quiz:

23 y 2 = 50 + 2 15 + 0 = 50 + 30 = 80 The third student would receive a 100: y 3 = 50 + 2 25 + 0 = 50 + 50 = 100 We summarize this in Table 5.5: Absence of Random Influences Student Minutes (x) Score (y) 1 5 60 2 15 80 3 25 100 Table 5.5: Quiz Results with No Random Influences (No Error Term) In the absence of random influences, the intercept and slope of the best fitting line would equal the actual constant and the actual coefficient, β Const and β x : y = β Const + β x x = 50 + 2x Summary: In the absence of random influences, the error term of each student equals 0 and the best fitting line fits the data perfectly. The slope of this line equals 2, the actual coefficient, and the vertical intercept of the line equals 50, the actual constant. Without random influences, it is easy to determine the actual constant and coefficient. We shall now use a simulation to emphasize this point. Econometrics Lab 5.2: Coefficient Estimates When Random Influences Are Absent

24 Figure 5.10: Coefficient Estimate Simulation The simulation allows us to do something we cannot do in the real world. It allows us to specify the actual values of the constant and coefficient in the model; that is, we can select β Const and β x. We can specify the number of points Professor Lord gives students just for showing up, β Const ; by default, β Const is set at 50. additional points earned for an additional minute of study, β x ; by default, β x is set at 2. Consequently, the regression model is: y t = 50 + 2x t + e t Each repetition of the simulation represents a quiz from a single week. In each repetition, the simulation: Calculates the score for each student based on the actual constant (β Const ), the actual coefficient (β x ), and the number of minutes the student studied; then, to be realistic, the simulation can add a random influence in the form of the error term, e t. An error term is included whenever the Err Term checkbox is checked. Applies the ordinary least squares (OLS) estimation procedure to estimate the coefficient.

25 When the Pause box is checked the simulation stops after each repetition; when it is cleared, quizzes are simulated repeatedly until the Stop button is clicked. [Link to MIT-Lab 5.2 goes here.] We can eliminate random influences by clearing the Err Term box. After doing so, click Start and then continue a few times. We discover that in the absence of random influences the estimate of the coefficient value always equals the actual value, 2:

26 Coefficient Estimate: Repetition No Error Term 1 2.0 2 2.0 3 2.0 4 2.0 Table 5.6: Simulation Results with No Random Influences (No Error Term) This is precisely what we concluded earlier from the scatter diagram. In the absence of random influences, the best fitting line fits the data perfectly. The best fitting line s slope equals the actual value of the coefficient. Presence of Random Influences: Back to Reality The real world is not that simple, however; random influences play an important role. In the real world, random influences are inevitably present: Inclusion of Random Influences Student Minutes (x) Score (y) 1 5 66 2 15 87 3 25 90 Table 5.7: Quiz Results with Random Influences (with Error Term) In Figure 5.11, the actual scores on the first quiz have been added to the scatter diagram. As a consequence of the random influences, Students 1 and 2 over perform while Student 3 under performs.

27 Score (y) 100 90 Std 2 Std 3 80 70 60 Std 1 Actual: y = 50 + 2x 50 5 10 15 20 25 30 Minutes (x) Figure 5.11: Scatter Diagram with Error Term As illustrated in Figure 5.12, when random influences are present, we cannot expect the intercept and slope of the best fitting line to equal the actual constant and the actual coefficient. The intercept and slope of the best fitting line, b Const and b x, are affected by the random influences.

28 Score (y) 100 90 80 Std 2 Std 3 OLS Estimate: y = 63 + 1.2x 70 Actual: y = 50 + 2x 60 Std 1 50 5 10 15 20 25 30 Minutes (x) Figure 5.12: OLS Best Fitting Line with Error Term Consequently, the intercept and slope of the best fitting line, b Const and b x, are themselves random variables. Even if we knew the actual constant and slope, that is, if we knew the actual values of β Const and β x, we could not predict the values of the constant and slope of the best fitting line, b Const and b x, with certainty before the quiz was given. Econometrics Lab 5.3: Coefficient Estimates When Random Influences Are Present We shall now use the Coefficient Estimate simulation to emphasize this point. We shall show that in the presence of random influences, the coefficient of the best fitting line is a random variable. [Link to MIT-Lab 5.3 goes here.] Note that the error term checkbox is now checked to include the error term. Be certain that the Pause checkbox is checked and then click Start. When the simulation computes the best fitting line, the estimated value of the coefficient typically is not 2 despite the fact that the actual value of the coefficient is 2. Click the Continue button a few more times to simulate each successive week s quiz.

29 What do you observe? We simply cannot expect the coefficient estimate to equal the actual value of the coefficient. In fact, when random influences are present, the coefficient estimate almost never equals the actual value of the coefficient. Sometimes the estimate is less than the actual value, 2, and sometimes it is greater than the actual value. When random influences are present, the coefficient estimates are random variables: Coefficient Estimate: Repetition With Error Term 1 1.8 2 1.6 3 3.2 4 1.9 Table 5.8: Simulation Results with Random Influences (with Error Term) While your coefficient estimates will no doubt differ from the ones in Table 5.8, one thing is clear. Even if we know the actual value of the coefficient, as we do in the simulation, we cannot predict with certainty the value of the estimate from one repetition. Our last two simulations illustrate a critical point: The coefficient estimate is a random variable as a consequence of the random influences introduced by each student s error term. Error Terms and Random Influences: A Closer Look We shall now use a simulation to gain insights into random influences and error terms. As we know, random influences are those factors that cannot be anticipated or determined beforehand. Sometimes random influences lead to a higher quiz score and other times they lead to a lower score. The error terms embody these random influences: Sometimes the error term is positive indicating that the score is higher than usual ; Other times the error term is negative indicating that the score is lower than usual. If the random influences are indeed random, they should be a wash after many, many quizzes. That is, random influences should not systematically lead to higher or lower quiz scores. In other words, if the error terms truly reflect random influences, they should average out to 0 in the long run. Econometrics Lab 5.4: Error Terms When Random Influences Are Present Let us now check to be certain that the simulations are capturing random influences properly by accessing the Random Influence Error Terms simulation. [Link to MIT-Lab 5.4 goes here.]

30 Figure 5.13: Error Term Simulation Initially, the Pause checkbox is checked and the error term variance is 500. Now, click Start and observe that the simulation reports the numerical value error term for each of the three students. Record these three values. Also, note that the simulation constructs a histogram for each student s error term and also reports the mean and variance. Click Continue again to observe the numerical values of the error terms for the second quiz. Confirm that the simulation is calculating the mean and variance of each student s error terms correctly. Click Continue a few more times. Note that the error terms are indeed random variables. Before the quiz is given, we cannot predict the numerical value of a student s error term. Each student s histogram shows that sometimes the error term for that student is positive and sometimes it is negative. Next, clear the Pause checkbox and click Continue. After many, many repetitions, click Stop. Student 1 Student 2 Student 3 Mean:.0 Variance: 500. Mean:.0 Variance: 500. Mean:.0 Variance: 500. Figure 5.14: Error Term Simulation Results After many, many repetitions, the mean (average) of each student s error terms equals about 0. Consequently, each student s error term truly represents a random influence; it does not systematically influence the student s quiz score. It is also instructive to focus on each student s histogram. For each student, the numerical value of the error term is positive about half the time and negative about half the time after many, many repetitions. Summary: The error terms represent random influences; consequently, the error terms have no systematic effect on quiz scores, the dependent variable: Sometimes the error term is positive indicating that the score is higher than usual ;

Other times the error term is negative indicating that the score is lower than usual. What can we say about the student s error terms beforehand, before the next quiz? We can describe their probability distribution. The chances that a student s error term will be positive is the same as the chances it will be negative. For any one quiz, the mean of each student s error term s probability distribution equals 0: Mean[e 1 ] = 0 Mean[e 2 ] = 0 Mean[e 3 ] = 0 e 1 has no systematic e 2 has no systematic e 3 has no systematic effect on Student 1 s score effect on Student 2 s score effect on Student 3 s score e 1 represents e 2 represents e 3 represents a random influence a random influence a random influence Standard Ordinary Least Squares (OLS) Premises Initially, we shall make some strong assumptions regarding the explanatory variables and the error terms: Error Term Equal Variance Premise: The variance of the error term s probability distribution for each observation is the same; all the variances equal Var[e]: Var[e 1 ] = Var[e 2 ] = = Var[e T ] = Var[e] Error Term/Error Term Independence Premise: The error terms are independent: Cov[e i, e j ] = 0. Knowing the value of the error term from one observation does not help us predict the value of the error term for any other observation. Explanatory Variable/Error Term Independence Premise: The explanatory variables, the x t s, and the error terms, the e t s, are not correlated. Knowing the value of an observation s explanatory variable does not help us predict the value of that observation s error term. We call these premises the standard ordinary least squares (OLS) premises. They make the analysis as straightforward as possible. In Part Four of this textbook, we relax these premises to study more general cases. Our strategy is to start with the most straightforward case and then move on to more complex ones. While we only briefly cite the premises here, we shall return to them in the fourth part of the textbook to study their implications. Clint s Assignment: The Two Parts Recall Clint s assignment. He must assess the effect of studying on quiz scores by using Professor Lord s first quiz as evidence. Clint can apply the ordinary least squares (OLS) estimation procedure; the OLS estimate for the value of the coefficient is 1.2. But we now know that the estimate is a random variable. We 31

32 cannot expect the coefficient estimate from the one quiz, 1.2, to equal the actual value of the coefficient, the actual impact that studying has on a student s quiz score. We shall proceed by dividing Clint s assignment into two related parts: Reliability of the Coefficient Estimate: How reliable is the coefficient estimate calculated from the results of the first quiz? That is, how confident should Clint be that the coefficient estimate, 1.2, will be close to the actual value? Assessment of the Theory: In view of the fact that Clint s estimate of the coefficient equals 1.2, how confident should Clint be that the theory is correct, that additional studying increases quiz scores? In the next few chapters, we shall address these issues. 1 NB: These data are not real. Instead, they were constructed to illustrate important pedagogical points. 2 There is another convention that is often used to denote the parameter estimates, the beta-hat convention. The estimate of the constant is denoted by ˆConst β and the coefficient by ˆx β. While the Roman italicized b s estimation convention will be used throughout this textbook, be aware that you will come across textbooks and articles that use the beta-hat convention. The b s and ˆβ s denote the same thing; they are interchangeable.