FAQ: Linear and Multiple Regression Analysis: Coefficients

Similar documents
Regression Analysis. BUS 735: Business Decision Making and Research. Learn how to detect relationships between ordinal and categorical variables.

Chapter 4: Regression Models

FinQuiz Notes

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

Making sense of Econometrics: Basics

Linear Regression with Multiple Regressors

Econ 300/QAC 201: Quantitative Methods in Economics/Applied Data Analysis. 17th Class 7/1/10

Chapter 4. Regression Models. Learning Objectives

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Basic Business Statistics 6 th Edition

Regression Analysis. BUS 735: Business Decision Making and Research

Final Exam - Solutions

Finding Relationships Among Variables

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Lecture 5: Omitted Variables, Dummy Variables and Multicollinearity

Linear Regression with Multiple Regressors

Chapter 13. Multiple Regression and Model Building

Econometrics Honor s Exam Review Session. Spring 2012 Eunice Han

Statistics for Managers using Microsoft Excel 6 th Edition

Single and multiple linear regression analysis

ECNS 561 Multiple Regression Analysis

Applied Statistics and Econometrics

Chapter 14 Multiple Regression Analysis

ACE 564 Spring Lecture 8. Violations of Basic Assumptions I: Multicollinearity and Non-Sample Information. by Professor Scott H.

PBAF 528 Week 8. B. Regression Residuals These properties have implications for the residuals of the regression.

Correlation and Regression Bangkok, 14-18, Sept. 2015

Ref.: Spring SOS3003 Applied data analysis for social science Lecture note

Wooldridge, Introductory Econometrics, 4th ed. Chapter 2: The simple regression model

Draft Proof - Do not copy, post, or distribute. Chapter Learning Objectives REGRESSION AND CORRELATION THE SCATTER DIAGRAM

Business Economics BUSINESS ECONOMICS. PAPER No. : 8, FUNDAMENTALS OF ECONOMETRICS MODULE No. : 3, GAUSS MARKOV THEOREM

WISE International Masters

Multiple Regression. Midterm results: AVG = 26.5 (88%) A = 27+ B = C =

LECTURE 10. Introduction to Econometrics. Multicollinearity & Heteroskedasticity

Instructions: Closed book, notes, and no electronic devices. Points (out of 200) in parentheses

Econometrics Review questions for exam

Bivariate Relationships Between Variables

ECONOMETRICS HONOR S EXAM REVIEW SESSION

ECON2228 Notes 2. Christopher F Baum. Boston College Economics. cfb (BC Econ) ECON2228 Notes / 47

Midterm 2 - Solutions

REED TUTORIALS (Pty) LTD ECS3706 EXAM PACK

Multiple Linear Regression CIVL 7012/8012

405 ECONOMETRICS Chapter # 11: MULTICOLLINEARITY: WHAT HAPPENS IF THE REGRESSORS ARE CORRELATED? Domodar N. Gujarati

Chapter 3 Multiple Regression Complete Example

Chapter 7 Student Lecture Notes 7-1

Multiple Linear Regression estimation, testing and checking assumptions

Regression Diagnostics Procedures

Eco 391, J. Sandford, spring 2013 April 5, Midterm 3 4/5/2013

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Econometrics -- Final Exam (Sample)

Econometrics Summary Algebraic and Statistical Preliminaries

Multiple Regression Analysis. Part III. Multiple Regression Analysis

y response variable x 1, x 2,, x k -- a set of explanatory variables

Regression With a Categorical Independent Variable

ECON3150/4150 Spring 2016

Diagnostics of Linear Regression

CHAPTER 5 LINEAR REGRESSION AND CORRELATION

Unit 6 - Introduction to linear regression

Chapter 14 Student Lecture Notes 14-1

Regression With a Categorical Independent Variable

1 Motivation for Instrumental Variable (IV) Regression

Ecn Analysis of Economic Data University of California - Davis February 23, 2010 Instructor: John Parman. Midterm 2. Name: ID Number: Section:

Analysing data: regression and correlation S6 and S7

Introduction to Regression Analysis. Dr. Devlina Chatterjee 11 th August, 2017

Applied Quantitative Methods II

Applied Econometrics Lecture 1

Project Report for STAT571 Statistical Methods Instructor: Dr. Ramon V. Leon. Wage Data Analysis. Yuanlei Zhang

CHAPTER 6: SPECIFICATION VARIABLES

Correlation Analysis

ECON 5350 Class Notes Functional Form and Structural Change

Rockefeller College University at Albany

statistical sense, from the distributions of the xs. The model may now be generalized to the case of k regressors:

Making sense of Econometrics: Basics

Classification & Regression. Multicollinearity Intro to Nominal Data

Business Statistics. Lecture 10: Correlation and Linear Regression

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

Introduction to Econometrics. Heteroskedasticity

Chapter 9 Regression. 9.1 Simple linear regression Linear models Least squares Predictions and residuals.

Inference with Simple Regression

6. Assessing studies based on multiple regression

IV Estimation and its Limitations: Weak Instruments and Weakly Endogeneous Regressors

Wooldridge, Introductory Econometrics, 3d ed. Chapter 9: More on specification and data problems

Inferences for Regression

DEMAND ESTIMATION (PART III)

Chapter 12 - Part I: Correlation Analysis

Can you tell the relationship between students SAT scores and their college grades?

6. Dummy variable regression

Multiple Regression Methods

ISQS 5349 Spring 2013 Final Exam

Friday, March 15, 13. Mul$ple Regression

Chapter Fifteen. Frequency Distribution, Cross-Tabulation, and Hypothesis Testing


FNCE 926 Empirical Methods in CF

Basic Business Statistics, 10/e

Chapter 16. Simple Linear Regression and dcorrelation

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

Wooldridge, Introductory Econometrics, 4th ed. Chapter 15: Instrumental variables and two stage least squares

Ch 13 & 14 - Regression Analysis

1 Linear Regression Analysis The Mincer Wage Equation Data Econometric Model Estimation... 11

Unit 6 - Simple linear regression

Lecture (chapter 13): Association between variables measured at the interval-ratio level

Transcription:

Question 1: How do I calculate a least squares regression line? Answer 1: Regression analysis is a statistical tool that utilizes the relation between two or more quantitative variables so that one variable (dependent variable) can be predicted from the others (independent variables). For example, if one knows the relationship between corporate research and development expenditures and future sales, one may be able to predict future sales. The linear regression model is typically represented as Y= a+bx+e. In order to construct a regression model, information on both x and y must be obtained from a sample of objects or individuals. The relationship between the two can then be estimated. Assume we are estimated the relation between the number of operating hours of a machine and its annual repair and maintenance costs. In this example, the number of operating hours is the independent variable and the repair cost is the dependent variable. The parameters a and b can take on any of an infinite number of real values. The goal in the regression procedure is to create a model where we can predict the value of y with good accuracy. More exactly, we estimate a regression to minimize the sum of the squared deviations between predicted y and actual y (the residual). Fortunately, statisticians have developed equations to estimate a and b so that the best fitting regression line is obtained (called least squares equations). So, lets use the following information on machine hours and repair costs to calculate a least squares regression line. First, collect data to estimate the relation. We limit the sample to 5 observations for simplicity only. Typically at least 3 observations will be required to fit any type of line. Hours Costs 1

5 $1, 8 $2, 1,5 $2,5 2,5 $7, 6, $9, Mean 2,26 Mean $4,3 Second, estimate the regression coefficient b, the average amount repair costs increase when operating hours increase one unit and other independent variables are held constant. b is estimated as S (x-mean of x) (y-mean of y) / S(x-mean of x) 2 This is the covariance of x and y divided by the variance of the x. In our example, this estimate is (5-2,26)*(1,-4,3) + (8-2,26)*(2,-4,3) + (5-226) + (8-226) + b = 1.449 Third, estimate for a is computed as mean of y (b*mean of x). In our example, this estimate is a = 43 (1.449 * 226) The final estimated line is thus y = 1,26 + 1.449x + e. This means we can predict repair costs. For x equal to 3 hours, we estimate repair costs as 1,26 + 1.449*3, = $5,373. 2

Note that for b coefficients for dummy variables which have been binary coded (1 or ), or based on categories, b is relative to the reference category (the category left out). Thus for the a of dummy variables for "Region," assuming "North" is the reference category and income is the dependent, a b of -1.5 for the dummy "South" means that the expected income the South is 1.5 units less than the average of "North" respondents. Also, note that t-tests are used to assess the significance of individual b coefficients. specifically testing the null hypothesis that the regression coefficient is zero. A common rule of thumb is to drop from the equation all variables not significant at the.5 level or better. Note that there is always error in the regression procedure. Error may be incorporated into the information collected to estimate the line. Regression models are a source of information about the world but they must be used wisely. Question 2: How do I calculate the coefficient of correlation? Answer 2: The correlation coefficient measures the strength of the relationship between two variables. Perfectly positive correlation coefficients are +1.. This means that for every change in one variable, the other variable changes in the same direction and in the same proportion as every other change in the first variable. For example, if every $1, increase in a company s reported net income is associated with a 5% increase in its stock price, then the relationship between company reported net income and stock price is perfectly positively correlated (the correlation coefficient equals 1.). Note that this correlation infers nothing about cause and effect. That is, an increase in the stock price may not be caused by the increase in net income. It may be that another identified variable is affecting both. Perfectly negative correlations also exist such as the relationship between age and purchases of CDs. When no correlation exists, the coefficient is zero, such as the relation between the air temperature and birth rates. Like any statistic, we can test the significance of the measured correlation using a t-statistic. 3

The correlation coefficient is defined as the covariance of x and y divided by the product of the standard deviations of x and y. Correlation coefficients are particularly useful when we have situations in which y is considered to be a dependent variable and x is taken as an independent variable and times when x is considered to be the dependent variable and y is the independent variable. So how do we calculate it the coefficient if we do not use the function in excel? Consider the following example of 5 observations of the relation between number of years of education school past high school and dollars spent on new cars over the 1 years period following high school. Years $ 3, 5, 2 9, 4 95, 6 8, 8 First, compute the mean of each variable. Mean of years = 4. years Mean of $ = $69, Next compute the difference of each variable from its mean. Years Deviation $ Deviation 4. 3, 39K 2 2. 5, 19K 4. 9, 21K 4

6 2. 95, 26K 8 4. 8, 11K Third, for each pair of observations, sum the product of the deviations. The sum is (4. * 39K) + (2. * 19K) + (. * 21K) + (2. * 26K) + (4. * 11K) = 29K Next, compute the product of the sum of the squared deviations. For Years: (4.) 2 + (2.) 2 + (.) 2 + (2.) 2 + (4.) 2 = 4 For $: (39K) 2 +(19K) 2 +(21K) 2 +(26K) 2 +(11K) 2 = 3,12,, The product of the sums is then $124,8,, Finally, divide the sum of the products by the square root of the product of the sum of the squared deviations. 29, / SQR (124,8,,) = 29, / 353, =.82 This is r, the correlation coefficient. In this case the coefficient is positive by not perfectly correlated. We could use a t-statistic to test the significance. Question 3: How do I calculate the coefficient of determination? Answer 3: The coefficient of determination, R 2, is the percent of the variance in the dependent explained by the independents. R-squared can also be interpreted as the proportionate reduction in error in estimating the dependent when knowing the independents. That is, R 2 is the number of errors made when using the regression model to guess the value of the dependent, in ratio to the total errors made when using only the dependent's mean as the basis for estimating all cases. 5

Mathematically, R 2 = (1 - (SSE/SST)), where SSE = error sum of squares and SST = total sum of squares. Error sum of squares is the sum of the squared residuals (predicted y less actual y) and total sum of squares is the sum of the squared deviations between actual y and the mean value of y. SSE is the error not explained by the model and SST is the total error whether explained or not. In our example of machine hours and costs, we can compute R 2. First, using the estimated regression(y=1,26 + 1.449*x), compute the predicted y for each x. Hours Actual Predicte Costs d Cost 5 $1, 1,75 8 $2, 2,185 1,5 $2,5 3,2 2,5 $7, 4,649 6, $9, 9,72 Mean 2,26 Mean $4,3 Mean $4,31 Second, compute SSE as (1-175) 2 + (2 2185) 2 + (25-32) 2 + (7-4649) 2 + (9 972) 2 = approx. 7,, Third, compute SST as (175 4,31) 2 + (2185 431) 2 + (32 431) 2 +(4649 431) 2 +(972 431) 2 = approx. 41,68, Thus, R 2 = (1-16.8%) = 83.2% While R 2 can be increased by adding variables, it is inappropriate unless variables are added to the equation for sound theoretical 6

reason. At an extreme, when n-1 variables are added to a regression equation, R 2 will be 1, but this result is meaningless. Adjusted R 2 is used as a conservative reduction to R 2 to penalize for adding variables. Adjusted R-Square is an adjustment for the fact that when one has a large number of independents, it is possible that R 2 will become artificially high simply because some independents' chance variations "explain" small parts of the variance of the dependent. At the extreme, when there are as many independents as cases in the sample, R 2 will always be 1.. The adjustment to the formula arbitrarily lowers R 2 as p, the number of independents, increases. Also note that typically, R 2 should not be compared between samples because one sample may have more variation generally to explain. More variance might allow more explanation of the variance. Question 4: How do I calculate the standard error of the estimate? Answer 4: Residual variance is a measure of the variation of the y values about the regression line. That is, it is the deviations between actual y and predicted y. The square root of the residual variance is the standard error of estimate. If the standard error is too large, the model may not be useful. To compute the standard error, first, determine predicted y values. Second, compute the difference between predicted y and actual y values and square the differences. This is also known as the error sum of squares (SSE). It is the error that cannot be explained by the estimated regression model. Finally, scale the error sum of square (SSE) by the sample size less the number of independent variables less one. The square root of the result is the standard error of the estimate and it measures the dispersion of the actual values of y around the fitted regression line. The standard error is also called the standard deviation of the regression model. Even if the coefficient of determination is high, the standard error will be too high to provide adequate confidence in the model. For example two standard deviations might be impractical for prediction intervals. A better model with different variables or additional variables is then warranted. 7

Question 5: What are the similarities between simple linear regression analysis and multiple regression analysis? Answer 5: The multiple regression equation takes the form: y = a + b 1 x 1 + b 2 x 2 +... + b n x n + e. In multiple regression, more than one independent variables explain the one dependent variable. This is expected to reduce the standard error. The b's are the regression coefficients, representing the amount the dependent variable y changes when the independent changes 1 unit. The a is the constant, where the regression line intercepts the y axis, representing the amount the dependent y will be when all the independent variables are. Associated with multiple regression is R 2, multiple correlation, which is the percent of variance in the dependent variable explained collectively by all of the independent variables. While for simple linear regression, R 2 2 (that is, the simple correlations coefficient squared), for multiple regression the same equality does not hold. Like linear regression, multiple regression shares all the assumptions of correlation: linearity of relationships, the same level of relationship throughout the range of the independent variable ("homoscedasticity"),interval or near-interval data, and data whose range is not truncated. In addition, it is important that the model being tested is correctly specified. The exclusion of important causal variables or the inclusion of extraneous variables can change markedly the estimated coefficients and thus the interpretation of the importance of the independent variables. Question 6: What are the basic assumptions of regression analysis? Answer 6: There are four general assumption of regression analysis. 1. First, model errors are normally distributed. 8

2. Second, the mean of the model error terms is zero. 3. Third, the model error terms have a constant variance for all values and combination of values of the independent variable("homoscedasticity". 4. Finally, the each error for each x is independent of each other. Error, represented by the residuals, should be normally distributed for each set of values of the independents. A histogram of standardized residuals should show a roughly normal curve. An alternative for the same purpose is the normal probability plot, with the observed cumulative probabilities of occurrence of the standardized residuals on the Y axis and of expected normal probabilities of occurrence on the X axis, such that a 45-degree line will appear when observed conforms to normally expected. The central limit theorem assumes that even when error is not normally distributed, when sample size is large, the sampling distribution of the b coefficient will still be normal. Therefore violations of this assumption usually have little or no impact on substantive conclusions for large samples, but when sample size is small, tests of normality are important. The assumption that the mean error is independent of the x independent variables. This is a critical regression assumption which, when violated, may lead to substantive misinterpretation of output. The (population) error term, which is the difference between the actual values of the dependent and those estimated by the population regression equation, should be uncorrelated with each of the independent variables. Since the population regression line is not known for sample data, the assumption must be assessed by theory. Specifically, one must be confident that the dependent is not also a cause of one or more of the independents, and that the variables not included in the equation are not causes of Y and correlated with the variables which are included. Either circumstance would violate the assumption of uncorrelated error. One common type of correlated error occurs due to selection bias with regard to membership in the independent variable "group" (representing membership in a treatment vs. a comparison group): measured factors such as gender, race, education, etc., may cause differential selection into the two groups and also can be correlated with the dependent variable. When there is correlated error, conventional computation of 9

standard deviations, t-tests, and significance are biased and cannot be used validly. Other assumptions are also part of regression. Some of the major ones are discussed here. Model specification is critical. Relevant variables are not omitted. If relevant variables are omitted from the model, the common variance they share with included variables may be wrongly attributed to those variables, and the error term is inflated. Similarly, if causally irrelevant variables are included in the model, the common variance they share with included variables may be wrongly attributed to the irrelevant variables. Omission and irrelevancy can both affect substantially the size of the b coefficients. This is one reason why it is better to use regression to compare the relative fit of two models rather than to seek to establish the validity of a single model specification. Continuous data are required(interval or ratio), though it is common to use ordinal data. Dummy variables form a special case and are allowed in regression as independents. Dichotomies may be used as independents but not as the dependent variable. Use of a dichotomous dependent in regression violates the assumptions of normality and homoscedasticity as a normal distribution is impossible with only two values. Also, when the values can only be or 1, residuals will be low for the portions of the regression line near Y= and Y=1, but high in the middle -- hence the error term will violate the assumption of homoscedasticity (equal variances) when a dichotomy is used as a dependent. Unbounded data are an assumption. That is, the regression line produced by can be extrapolated in both directions but is meaningful only within the upper and lower natural bounds of the dependent. Data are not censored, sample selected, or truncated. There are as many observations of the independents as for the dependents. Absence of perfect multicollinearity is another assumption. When there is perfect multicollinearity, there is no unique regression solution. Perfect multicollinearity occurs if independents are linear functions of each other (ex., age and year of birth), when the researcher creates dummy variables 1

for all values of a categorical variable rather than leaving one out, and when there are fewer observations than variables. Regression analysis is a linear procedure. To the extent nonlinear relationships are present, conventional regression analysis will underestimate the relationship. Nonlinear transformation of selected variables may be a pre-processing step, but this is not common because it runs the danger of over fitting the model to what are, in fact, chance variations in the data. When nonlinearity is present, there may be a need for exponential or interactive terms. The same underlying distribution is assumed for all variables. To the extent that an independent variable has a different underlying distribution compared to the dependent (bimodal vs. normal, for instance), then a unit increase in the independent will have nonlinear impacts on the dependent. Even when independent/dependent data pairs are ordered perfectly, unit increases in the independent cannot be associated with fixed linear changes in the dependent. For instance, perfect ordering of a bimodal independent with a normal dependent will generate an s-shaped scatter plot not amenable to a linear solution. Linear regression will underestimate the correlation of the independent and dependent when they come from different underlying distributions. Variable measurement is reliable and valid. To the extent there is systematic error in the measurement of the variables, the regression coefficients will be simply wrong. Independent observations (absence of autocorrelation) leading to uncorrelated error terms. Current values should not be correlated with previous values in a data series. This is often a problem with time series data, where many variables tend to increment over time such that knowing the value of the current observation helps one estimate the value of the previous observation. That is, each observation should be independent of each other observation if the error terms are not to be correlated, which would in turn lead to biased estimates of standard deviations and significance. 11