Introduction to Simple Linear Regression

Similar documents
Variance Decomposition and Goodness of Fit

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Inferences on Linear Combinations of Coefficients

Tests of Linear Restrictions

Statistical View of Least Squares

Inference with Heteroskedasticity

Introduction and Single Predictor Regression. Correlation

Correlation and Regression (Excel 2007)

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

Overview. Overview. Overview. Specific Examples. General Examples. Bivariate Regression & Correlation

Simple Linear Regression

Ecn Analysis of Economic Data University of California - Davis February 23, 2010 Instructor: John Parman. Midterm 2. Name: ID Number: Section:

Linear Regression and Correlation. February 11, 2009

AMS 7 Correlation and Regression Lecture 8

Linear Regression Communication, skills, and understanding Calculator Use

11 Regression. Introduction. The Correlation Coefficient. The Least-Squares Regression Line

Inference for the Regression Coefficient

Quantitative Bivariate Data

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Notes 6: Multivariate regression ECO 231W - Undergraduate Econometrics

Statistical View of Least Squares

The Simple Linear Regression Model

LECTURE 5. Introduction to Econometrics. Hypothesis testing

Inference for Regression

The following formulas related to this topic are provided on the formula sheet:

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

Midterm 2 - Solutions

ECON 450 Development Economics

Chapter 3: Examining Relationships

ASSIGNMENT 3 SIMPLE LINEAR REGRESSION. Old Faithful

BIVARIATE DATA data for two variables

SLR output RLS. Refer to slr (code) on the Lecture Page of the class website.

y response variable x 1, x 2,, x k -- a set of explanatory variables

Chapter 6: Exploring Data: Relationships Lesson Plan

Chapter 4 Describing the Relation between Two Variables

MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik

ECON3150/4150 Spring 2015

Chapter 12 Summarizing Bivariate Data Linear Regression and Correlation

2 Regression Analysis

Interpreting Regression

INFERENCE FOR REGRESSION

Interpreting Regression

Chapter 14. Linear least squares

Section 3: Simple Linear Regression

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION

SIMPLE REGRESSION ANALYSIS. Business Statistics

Unit 6 - Introduction to linear regression

Wednesday, September 19 Handout: Ordinary Least Squares Estimation Procedure The Mechanics

Bivariate Relationships Between Variables

UNIT 12 ~ More About Regression

APPENDIX 1 BASIC STATISTICS. Summarizing Data

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION

Confidence Intervals, Testing and ANOVA Summary

STAT 3022 Spring 2007

Inferences for Regression

Chapter 9. Correlation and Regression

Announcements: You can turn in homework until 6pm, slot on wall across from 2202 Bren. Make sure you use the correct slot! (Stats 8, closest to wall)

Stat 101 L: Laboratory 5

STAT 350 Final (new Material) Review Problems Key Spring 2016

Objectives Simple linear regression. Statistical model for linear regression. Estimating the regression parameters

STAT Regression Methods

Mathematics for Economics MA course

Recitation 1: Regression Review. Christina Patterson

Linear Probability Model

EC402 - Problem Set 3

Correlation and Linear Regression

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

7.0 Lesson Plan. Regression. Residuals

Density Temp vs Ratio. temp

Econometrics Problem Set 6

Regression. Marc H. Mehlman University of New Haven

The Regression Tool. Yona Rubinstein. July Yona Rubinstein (LSE) The Regression Tool 07/16 1 / 35

Least Squares Regression

MATH 80S Residuals and LSR Models October 3, 2011

Business Statistics. Lecture 10: Correlation and Linear Regression

LECTURE 10. Introduction to Econometrics. Multicollinearity & Heteroskedasticity

Statistics II Exercises Chapter 5

Quantitative Analysis of Financial Markets. Summary of Part II. Key Concepts & Formulas. Christopher Ting. November 11, 2017

Section 5.4 Residuals

Intro to Linear Regression

ECON5115. Solution Proposal for Problem Set 4. Vibeke Øi and Stefan Flügel

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

MGEC11H3Y L01 Introduction to Regression Analysis Term Test Friday July 5, PM Instructor: Victor Yu

Note on Bivariate Regression: Connecting Practice and Theory. Konstantin Kashin

ST430 Exam 1 with Answers

Intro to Linear Regression

Chapter 3: Examining Relationships

Institute of Actuaries of India

Unit 6 - Simple linear regression

STATS DOESN T SUCK! ~ CHAPTER 16

Section 3.3. How Can We Predict the Outcome of a Variable? Agresti/Franklin Statistics, 1of 18

Regression Analysis: Exploring relationships between variables. Stat 251

Ordinary Least Squares Regression Explained: Vartanian

11 Correlation and Regression

Simple Linear Regression

Gov 2000: 9. Regression with Two Independent Variables

MS&E 226: Small Data

Bivariate Data Summary

MATH 1150 Chapter 2 Notation and Terminology

Transcription:

Introduction to Simple Linear Regression 1. Regression Equation A simple linear regression (also known as a bivariate regression) is a linear equation describing the relationship between an explanatory variable and an outcome variable. The procedure is used to measure how the explanatory variable (also known as the independent variable) possibly influences the outcome variable (also known as the dependent variable) Example: Let y i denote the income of some individual in your sample indexed by i where i {1, 2,.., n}, let x i denote the number of years of education of the same individual, and let n denote the sample size. A simple linear regression equation of these two variables in the sample takes the form, y i = b 0 + b 1 x i + e i where b 1 is the sample estimate of the slope of the regression line with respect to years of education and b 0 is the sample estimate for the vertical intercept of the regression line. The term e i is residual, that is the error term in regression. Since we would not expect education to exactly predict income, not all data points in a sample will line up exactly on the regression line. For some individual i {1, 2,.., n} in the sample, e i is the difference between his or her actual income and the predicted level of income that is on the regression line given his or her actual education attainment equal to x i. The point on the regression equation line is the predicted value for y i given some value for x i. The prediced value from an estimated regression is given by, ŷ i = b 0 + b 1 x i. Since some actual values for y i will be above the regression line and some will be below, some e i will be positive and others will be negative. The best fitting regression line is one such that the positive values exactly offset the negative values so that the mean of the residuals equals zero: 1 n e i = 0 To minimize the error that the regression line makes, the coefficients for the best fitting regression line are chosen to minimize the sum of the squared residuals: {b 0, b 1 } = min = min = min e 2 i (y i ŷ i ) 2 (y i b 0 b 1 x i ) 2 This standard method for estimating regression coefficients by minimizing the sum of squared residuals is called the ordinary least squares (OLS) method. 1

Since b 1 is the slope, it measures how much the y-variable changes when the x-variable increases by one unit. In this case, b 1 is the estimate for on average how much additional income one earns for each additional year of education. Depending on the application, the vertical intercept sometimes has an intuitive meaning. It measures what value to be expected for the y-variable when the x-variable is equal to zero. In this case, b 0 is the measure for the average income to be expected for individuals with zero years of education. If your data does not include any observation with a zero value for education, or if a zero value for the x-variable is unrealistic, then this coefficient has little mearning. The regression line in the equation above is a sample estimate of the population regression line. The population regression line is the best fitting line for all possible elements in the population and whose coefficients are generally unknown. The population regression equation is given by, y i = β 0 + β 1 x i + ɛ i where b 0 above is a sample estimate of the population coefficient β 0 and b 1 above is a sample estimate of the population coefficient β 1 2. Download and Visualize the Dataset The code below downloads a CSV file that includes data for 935 individuals on varaibles including their total monthly earnings (MonthlyEarnings) and a number of variables that could influence income, including years of education (YearsEdu). download.file( url="http://murraylax.org/datasets/wage2.csv", dest="wage2.csv") wages <- read.csv("wage2.csv"); Let us begin by plotting the data to visually examine the relationship between years of schooling and monthly earnings. The code below produces a scatterplot with YearsEdu on the horizontal axis and MonthlyEarnings on the vertical axes. plot(x=wages$yearsedu, y=wages$monthlyearnings, main=" versus ", xlab="", ylab="") 2

versus The paramters main, xlab, and ylab set the main title, the label for the horizontal axis, and the label for the vertical axis, respectively. The scatterplot looks a little peculiar because the explanatory variable, years of education, consists only of a finite number of integers. When so many observations have the exact same value, they vertically line up tightly and heavily overlap, making it difficult to see the relationship in the scatterplot. R includes a function called jitter() help address this visual problem. The function accepts as a parameter a numerical vector and adds a small amount of noise to it. That is, it accepts a vector of values and adds a small random number that takes negative and positive values. Let us call the plot function again, but this time add a jitter to years of education. Because the distance between possible values for years of education on our plot is always equal to 1.0, we will add a jitter with a maximum magnitude of 0.5, or half of this distance. The following call to plot includes a call to jitter to accomplish this. plot(x=jitter(wages$yearsedu, amount=0.5), y=wages$monthlyearnings, main=" versus ", xlab="", ylab="") 3

versus 3. Estimating the Regression Equation We estimate the regression line with the R function lm, which stands for linear model. We estimate the regression line in the code that follows and assign the output to a variable we call edulm. edulm <- lm(wages$monthlyearnings ~ wages$yearsedu) We passed to lm a single parameter that was the formula, wages$monthlyearnings ~ wages$yearsedu which told lm to estimate a linear model for how monthly earnings depends on years of education. The object edulm is a list of many other objects that include summary many statistics, hypothesis tests, and confidence intervals, regarding the equation of the best fit line. For the moment, let us examine the estimates of the coefficients. We can view these by accessing the estimate object within edulm: edulm$coefficients ## (Intercept) wages$yearsedu ## 146.95244 60.21428 These results imply the equation for the best fitting line is given by, y i = 146.95 + 60.21x i + e i, and the predicted value for monthly earnings for person i with x i years of education is given by, ŷ i = 146.95 + 60.21x i. We can add the regression line to our scatterplot with a call to abline(). In the code below, we regenerate the original scatterplot and call abline to draw the regression line: 4

plot(x=jitter(wages$yearsedu, amount=0.5), y=wages$monthlyearnings, main=" versus ", xlab="", ylab="") abline(edulm) versus 5