Ref.: Spring SOS3003 Applied data analysis for social science Lecture note

Similar documents
Econometrics Part Three

CHAPTER 6: SPECIFICATION VARIABLES

Assumptions of the error term, assumptions of the independent variables

SOS3003 Applied data analysis for social science Lecture note Erling Berge Department of sociology and political science NTNU.

Introduction to Regression Analysis. Dr. Devlina Chatterjee 11 th August, 2017

LECTURE 11. Introduction to Econometrics. Autocorrelation

Answer all questions from part I. Answer two question from part II.a, and one question from part II.b.

ECON 4230 Intermediate Econometric Theory Exam

2 Prediction and Analysis of Variance

Making sense of Econometrics: Basics

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Outline. Nature of the Problem. Nature of the Problem. Basic Econometrics in Transportation. Autocorrelation

Steps in Regression Analysis

Regression Analysis. BUS 735: Business Decision Making and Research. Learn how to detect relationships between ordinal and categorical variables.

Multiple Regression and Model Building Lecture 20 1 May 2006 R. Ryznar

FinQuiz Notes

Regression Analysis. BUS 735: Business Decision Making and Research

Iris Wang.

Multiple Regression. Peerapat Wongchaiwat, Ph.D.

Autocorrelation. Think of autocorrelation as signifying a systematic relationship between the residuals measured at different points in time

Chapter 16. Simple Linear Regression and Correlation

Spatial Regression. 3. Review - OLS and 2SLS. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Lecture 14 Simple Linear Regression

AUTOCORRELATION. Phung Thanh Binh

Chapter 16. Simple Linear Regression and dcorrelation

Sociology 593 Exam 2 Answer Key March 28, 2002

Ordinary Least Squares Regression Explained: Vartanian

Instructions: Closed book, notes, and no electronic devices. Points (out of 200) in parentheses

2.4.3 Estimatingσ Coefficient of Determination 2.4. ASSESSING THE MODEL 23

Multiple Regression. More Hypothesis Testing. More Hypothesis Testing The big question: What we really want to know: What we actually know: We know:

Multiple Regression Analysis. Part III. Multiple Regression Analysis

Regression Analysis By Example

Hypothesis testing Goodness of fit Multicollinearity Prediction. Applied Statistics. Lecturer: Serena Arima

Heteroscedasticity 1

ECO220Y Simple Regression: Testing the Slope

LECTURE 10. Introduction to Econometrics. Multicollinearity & Heteroskedasticity

NATCOR Regression Modelling for Time Series

FAQ: Linear and Multiple Regression Analysis: Coefficients

Introduction and Single Predictor Regression. Correlation

Quantitative Methods I: Regression diagnostics

2. Linear regression with multiple regressors

Multiple linear regression S6

Multiple Regression Analysis

Inferences for Regression

Basic Linear Model. Chapters 4 and 4: Part II. Basic Linear Model

1 Correlation and Inference from Regression

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

y response variable x 1, x 2,, x k -- a set of explanatory variables

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION

Prepared by: Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti

Econometrics. 9) Heteroscedasticity and autocorrelation

ECNS 561 Multiple Regression Analysis

Regression of Time Series

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Ridge Regression. Summary. Sample StatFolio: ridge reg.sgp. STATGRAPHICS Rev. 10/1/2014

ECON 497: Lecture 4 Page 1 of 1

Ordinary Least Squares Regression

Single and multiple linear regression analysis

Econometrics - 30C00200

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Lecture 5: Omitted Variables, Dummy Variables and Multicollinearity

Heteroscedasticity. Jamie Monogan. Intermediate Political Methodology. University of Georgia. Jamie Monogan (UGA) Heteroscedasticity POLS / 11

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

Regression Diagnostics Procedures

Lecture 6: Linear Regression

Linear Regression Models

DEMAND ESTIMATION (PART III)

Final Exam - Solutions

1. You have data on years of work experience, EXPER, its square, EXPER2, years of education, EDUC, and the log of hourly wages, LWAGE

SPSS Output. ANOVA a b Residual Coefficients a Standardized Coefficients

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

STAT5044: Regression and Anova. Inyoung Kim

Lecture 2: Linear and Mixed Models

ECON 497: Lecture Notes 10 Page 1 of 1

Answers: Problem Set 9. Dynamic Models

The Multiple Regression Model Estimation

Lectures 5 & 6: Hypothesis Testing

PBAF 528 Week 8. B. Regression Residuals These properties have implications for the residuals of the regression.

Checking model assumptions with regression diagnostics

Using the Regression Model in multivariate data analysis

Business Economics BUSINESS ECONOMICS. PAPER No. : 8, FUNDAMENTALS OF ECONOMETRICS MODULE No. : 3, GAUSS MARKOV THEOREM

MS&E 226: Small Data

ESP 178 Applied Research Methods. 2/23: Quantitative Analysis

Lecture (chapter 13): Association between variables measured at the interval-ratio level

Christopher Dougherty London School of Economics and Political Science

Final Exam. Name: Solution:

Regression Models - Introduction

Lecture 8. Using the CLR Model. Relation between patent applications and R&D spending. Variables

Eco and Bus Forecasting Fall 2016 EXERCISE 2

Hypothesis Testing hypothesis testing approach

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.

Lecture 19 Multiple (Linear) Regression

Correlation and simple linear regression S5

Final Exam. Question 1 (20 points) 2 (25 points) 3 (30 points) 4 (25 points) 5 (10 points) 6 (40 points) Total (150 points) Bonus question (10)

5. Erroneous Selection of Exogenous Variables (Violation of Assumption #A1)

Midterm 2 - Solutions

x3,..., Multiple Regression β q α, β 1, β 2, β 3,..., β q in the model can all be estimated by least square estimators

Econometrics Review questions for exam

Two-Variable Regression Model: The Problem of Estimation

Transcription:

SOS3003 Applied data analysis for social science Lecture note 05-2010 Erling Berge Department of sociology and political science NTNU Spring 2010 Erling Berge 2010 1 Literature Regression criticism I Hamilton Ch 4 p109-123 Spring 2010 Erling Berge 2010 2 Erling Berge 2010 1

Analyses of models are based on assumptions OLS is a simple technique of analysis with very good theoretical properties. But: The good properties are based on certain assumptions If the assumptions do not hold the good properties evaporates Investigating the degree to which the assumptions hold is the most important part of a regression analysis Spring 2010 Erling Berge 2010 3 OLS-REGRESSION: assumptions I SPECIFICATION REQUIREMENT The model is correctly specified II GAUSS-MARKOV REQUIREMENTS Ensures that the estimates are BLUE III NORMALLY DISTRIBUTED ERROR TERM Ensures that the tests are valid Spring 2010 Erling Berge 2010 4 Erling Berge 2010 2

I SPECIFICATION REQUIREMENT The model is correctly specified if The expected value of y, given the values of the independent variables, is a linear function of the parameters of the x-variables All included x-variables have an impact on the expected y-value No other variable has an impact on expected y-value at the same time as they correlate with included x-variables Spring 2010 Erling Berge 2010 5 II GAUSS-MARKOV REQUIREMENTS (i) (1) x is known, without stochastic variation (2) Errors have an expected value of 0 for all i E(ε i )=0 for all i Given (1) and (2) ε i will be independent of x k for all k and OLS provides unbiased estimates of β (unbiased = forventningsrett) Spring 2010 Erling Berge 2010 6 Erling Berge 2010 3

II GAUSS-MARKOV REQUIREMENTS (ii) (3) Errors have a constant variance for all i Var(ε i ) = σ 2 for all i This is called homoscedasticity (4) Errors are uncorrelated with each other Cov(ε i, ε j )=0 for all i j This is called no autocorrelation Spring 2010 Erling Berge 2010 7 II GAUSS-MARKOV REQUIREMENTS (iii) Given (3) and (4) in addition to (1) and (2) provides: a. Estimates of standard errors of regression coefficients are unbiased and b. The Gauss-Markov theorem: OLS estimates have less variance than any other linear unbiased estimate (including ML estimates) OLS gives BLUE (Best Linear Unbiased Estimate) Spring 2010 Erling Berge 2010 8 Erling Berge 2010 4

II GAUSS-MARKOV REQUIREMENTS (iv) (1) - (4) are called the GAUSS-MARKOV requirements Given (2) - (4) with an additional requirement that errors are uncorrelated with x-variables: cov (x ik, ε i )=0 for all i,k The coefficients and standard errors are consistent (converging in probability to the true population value as sample size increases) Spring 2010 Erling Berge 2010 9 Footnote 1: Unbiased estimators Unbiased means that E[b k ] = β k In the long run we are bound to find the population value - β k - if we draw sufficiently many samples, calculates b k and average these Spring 2010 Erling Berge 2010 10 Erling Berge 2010 5

Footnote 2: Consistent estimators An estimator is consistent if we as sample size (n) grows towards infinity, find that b approaches β and s b [or SE b ] approaches σ β b k is a consistent estimator of β k if we for any small value of c have lim n [Pr{ Ib k - β k I < c }] = 1 Spring 2010 Erling Berge 2010 11 Footnote 3: In BLUE Best means minimal variance estimator Minimal variance or efficient estimator means that var(b k ) < var(a k ) for all estimators a different from b Equivalent: E[b k - β k ] 2 < E[a k - β k ] 2 for all estimators a unlike b Spring 2010 Erling Berge 2010 12 Erling Berge 2010 6

Footnote 4: Biased estimators Even if the requirements ensuring that our estimates are BLUE one may at times find biased estimators with less variance such as in Ridge Regression Spring 2010 Erling Berge 2010 13 Footnote 5: Non-linear estimators There may be non-linear estimators that are unbiased and with less variance than BLUE estimators Spring 2010 Erling Berge 2010 14 Erling Berge 2010 7

III NORMALLY DISTRIBUTED ERROR TERM (5) If all errors are normally distributed with expectation 0 and standard deviation of σ 2, that is if ε i ~N(0, σ 2 ) for all i Then we can test hypotheses about β and σ, and OLS estimates will have less variance than estimates from all other unbiased estimators OLS results are BUE (Best Unbiased Estimate) Spring 2010 Erling Berge 2010 15 Problems in regression analysis that cannot be tested If all relevant variables are included If x-variables have measurement errors If the expected value of the error is 0 This means that we are unable to check if the correlation between the error term and x- variables actually is 0 OLS constructs residuals so that cov(x ik,e i )=0 This is in reality saying the same as the first point that we are unable to test if all relevant variables are included Spring 2010 Erling Berge 2010 16 Erling Berge 2010 8

Problems in regression analysis that can be tested (1) Non-linear relationships Inclusion of an irrelevant variable Non-constant variance of the error term (heteroscedasticity) Autocorrelation for the error term Correlations among error terms Non-normal error terms Multicollinearity Spring 2010 Erling Berge 2010 17 Require ment Specification - - - - - - Gauss- Markov - - - - Normal distribution Consequences of problems (Hamilton, p113) Problem Non-linear relationship Excluded relevant variable Included irrelevant variable with measurement error Heteroscedasticity Autocorrelation correlated with ε ε not normally distributed Unwanted properties of estimates Biased estimate of b 0 0 0 0 Biased estimate of SE b... no Multicollinearity 0 0 0 requirement Spring 2010 Erling Berge 2010 18 0 0 Invalid t&f-tests 0 High var[b] - - - - Erling Berge 2010 9

Problems in regression analysis that can be discovered (2) Outliers (extreme y-values) Influence (cases with large influence: unusual combinations of y and x-values) Leverage (potential for influence) Spring 2010 Erling Berge 2010 19 Tools for discovering problems Studies of One-variable distributions (frequency distributions and histogram) Two-variable co-variation (correlation and scatter plot) Residual (distribution and covariation with predicted values) Spring 2010 Erling Berge 2010 20 Erling Berge 2010 10

Correlation and scatter plot Data from 122 countries ENERGY CONSUM PTION PER PERSON MEAN ANNUAL POPULATIO N GROWTH FERTILIZER USE PER HECTARE CRUDE BIRTH RATE ENERGY CONSUMPTION PER PERSON Pearson Correlation 1 -,505,533 -,689 N 125 122 125 122 MEAN ANNUAL POPULATION GROWTH Pearson Correlation -,505 1 -,469,829 N 122 125 125 125 FERTILIZER USE PER HECTARE Pearson Correlation,533 -,469 1 -,589 N 125 125 128 125 CRUDE BIRTH RATE Pearson Correlation N -,689 Spring 2010 Erling Berge 2010 21 122,829 125 -,589 125 1 125 Correlation and scatter plot CRUDE BIRTH... FERTILIZER USE... MEAN ANNUAL... ENERGY... ENERGY CONSUMPTION PER PERSON FERTILIZER USE PER HECTARE MEAN ANNUAL POPULATION GROWTH CRUDE BIRTH RATE Spring 2010 Erling Berge 2010 22 Erling Berge 2010 11

Heteroscedasticity (non-constant variance of error term) can arise from: Measurement error (e.g. y more accurate the larger x is) Outliers If ε i contains an important variable that varies with both x and y (specification error) Specification error is the same as the wrong model and may cause heteroscedasticity An important diagnostic tool is a plot of the residual against predicted value (Ŷ) Spring 2010 Erling Berge 2010 23 Example: Hamilton table 3.2 Dependent Variable: Summer 1981 Water Use Unstandardized Coefficients B Std. Error t Sig. (Constant) 242,220 206,864 1,171,242 Income in Thousands 20,967 3,464 6,053,000 Summer 1980 Water Use,492,026 18,671,000 Education in Years -41,866 13,220-3,167,002 head of house retired? 189,184 95,021 1,991,047 # of People Resident 1981 248,197 28,725 8,641,000 Increase in # of People 96,454 80,519 1,198,232 Spring 2010 Erling Berge 2010 24 Erling Berge 2010 12

10000,00000 7500,00000 Unstandardized Residual 5000,00000 2500,00000 0,00000-2500,00000-5000,00000 0,00000 2000,00000 4000,00000 6000,00000 8000,00000 Unstandardized Predicted Value From the regression reported in table 3.2 in Hamilton Spring 2010 Erling Berge 2010 25 Footnote for the previous figure There is heteroscedasticity if the variation of the residual (variation around a typical value) varies systematically with the value of one or more x- variables The figure shows that the variation of the residual increases with increasing predicted y: ŷ Predicted y (ŷ) is in this case an index showing high average x-values When the variation of the residual varies systematically with the values of the x-variables like this, we conclude with heteroscedasticity Spring 2010 Erling Berge 2010 26 Erling Berge 2010 13

Box-plot of the residual shows Heavy tails Many outliers Weakly positively skewed distribution 10000,00000 7500,00000 5000,00000 2500,00000 0,00000 483 489 496 491 494 486 481 476 477 479 490 482 Will any of the outliers affect the regression? -2500,00000-5000,00000 384 54 358 152 236 112 435 Unstandardized Residual Spring 2010 Erling Berge 2010 27 The distribution seen from another angle Spring 2010 Erling Berge 2010 28 Erling Berge 2010 14

Band-regression Homoscedasticity means that the median (and the average) of the absolute value of the residual, i.e.: median{ie i I}, should be about the same for all values of the predicted y i If we find that the median of Ie i I for given predicted values of y i changes systematically with the value of predicted y i (ŷ i ) it indicates heteroscedasticity Such analyses can easily be done in SPSS Spring 2010 Erling Berge 2010 29 6000,00 5000,00 absoluttverdiresidual 4000,00 3000,00 2000,00 1000,00 0,00 0,00000 2000,00000 4000,00000 6000,00000 8000,00000 Unstandardized Predicted Value Absolute value of e i (Based on regression in table 3.2 in Hamilton) Spring 2010 Erling Berge 2010 30 Erling Berge 2010 15

6000,00 Approximate band regression (cpr figure 4.4 in Hamilton) 5000,00 483 489 absoluttverdiresidual 4000,00 3000,00 2000,00 1000,00 377 225 182 442 392 3 19 361 394 261 54 56 112 481 491 0,00 1 2 3 4 5 6 7 8 9 10 11 Unstandardized Predicted Value (Banded) Spring 2010 Erling Berge 2010 31 Band regression in SPSS Start by saving the residual and predicted y from the regression Compute a new variable by taking the absolute value of the residual (Use compute under the transform menu) Then partition the predicted y into bands by using the procedure Visual bander under the Transform menu Then use Box plot under Graphs where the absolute value of the residual is specified as variable and the band variable as category axis Spring 2010 Erling Berge 2010 32 Erling Berge 2010 16

Footnote to Eikemo and Clausen 2007 Page 121 describes White s test of Heteroscedasticity The description is wrong They say to replace y with e 2 in the regression on all the x variables That is not sufficient. The x-variables have to be replaced by all unique cross products of x with x (including x 2 ) Unique elements of the Kronecker product of x with x (where x is the vector of x-variables) Spring 2010 Erling Berge 2010 33 Autocorrelation (1) Correlation among variable values on the same variable across different cases (e.g. between ε i and ε i-1 ) Autocorrelation leads to larger variance and biased estimates of the standard error - similar to heteroscedasticity In a simple random sample from a population autocorrelation is improbable Spring 2010 Erling Berge 2010 34 Erling Berge 2010 17

Autocorrelation (2) Autocorrelation is the result of a wrongly specified model. A variable is missing Typically it is found in time series and geographically ordered cases Tests (e.g. Durbin-Watson) is based on the sorting of the cases. Hence: A hypothesis about autocorrelation needs to specify the sorting order of the cases Spring 2010 Erling Berge 2010 35 Durbin-Watson test (1) d = n i= 2 ( e e ) 2 i n i= 1 e 2 i i 1 Should not be used for autoregressive models, i.e. models where the y-variable also is an x- variable, see table 3.2 Spring 2010 Erling Berge 2010 36 Erling Berge 2010 18

Durbin-Watson test (2) The sampling distribution of the d-statistic is known and tabled as d L and d U (table A4.4 in Hamilton), the number of degrees of freedom is based on n and K-1 Test rule: Reject if d<d L Do not reject if d>d U If d L < d < d U the test is inconclusive d=2 means uncorrelated residuals Positive autocorrelation results in d<2 Negative autocorrelation results in d>2 Spring 2010 Erling Berge 2010 37 Daily water use, average pr month Example: 5,50 AVERAGE DAILY WATER USE 5,00 4,50 4,00 3,50 3,00 137,00 133,00 129,00 125,00 121,00 117,00 113,00 109,00 105,00 101,00 97,00 93,00 89,00 85,00 81,00 77,00 73,00 69,00 65,00 61,00 57,00 53,00 49,00 45,00 41,00 37,00 33,00 29,00 25,00 21,00 17,00 13,00 9,00 5,00 1,00 INDE NUMBER 1-137 Spring 2010 Erling Berge 2010 38 Erling Berge 2010 19

Ordinary OLS-regression where the case is month Dependent Variable: AVERAGE DAILY WATER USE Unstandardized Coefficients t Sig. (Constant) B 3,828 Std. Error,101 38,035,000 AVERAGE MONTHLY TEMPERATURE,013,002 7,574,000 PRECIPITATION IN INCHES -,047,021-2,234,027 CONSERVATION CAMPAIGN DUMMY -,247,113-2,176,031 Predictors: (Constant), CONSERVATION CAMPAIGN DUMMY, AVERAGE MONTHLY TEMPERATURE, PRECIPITATION IN INCHES Spring 2010 Erling Berge 2010 39 Test of autocorrelation Dependent Variable: AVERAGE DAILY WATER USE 1 R,572(a) R Square,327 Adjusted R Square,312 Std. Error of the Estimate,36045 Durbin- Watson,535 Predictors: (Constant), CONSERVATION CAMPAIGN DUMMY, AVERAGE MONTHLY TEMPERATURE, PRECIPITATION IN INCHES N = 137, K-1 = 3 Find limits for rejection / acceptance of the null hypothesis of no autocorrelation with level of significance 0,05 Tip: Look up table A4.4 in Hamilton, p355 Spring 2010 Erling Berge 2010 40 Erling Berge 2010 20

Autocorrelation coefficient m-th order autocorrelation coefficient r m = T m t= 1 ( e )( ) t e et+ m e T t= 1 ( e ) t e 2 Spring 2010 Erling Berge 2010 41 Residual Daily water use, month 1,00000 Mean Unstandardized Residual 0,50000 0,00000-0,50000 136,00 131,00 126,00 121,00 116,00 111,00 106,00 101,00 96,00 91,00 86,00 81,00 76,00 71,00 66,00 61,00 56,00 51,00 46,00 41,00 36,00 31,00 26,00 21,00 16,00 11,00 6,00 1,00 INDE NUMBER 1-137 Spring 2010 Erling Berge 2010 42 Erling Berge 2010 21

Sliding average Hanning Sliding median Smoothing with 3 points e + e + e * t 1 t t+ 1 et = 3 et et et = + + 4 2 4 * 1 + 1 et {,, } e = median e e e * t t 1 t t+ 1 Spring 2010 Erling Berge 2010 43 Residual, smoothing once 0,75000 0,50000 Mean MA(RES_1,3,3) 0,25000 0,00000-0,25000-0,50000-0,75000 134,00 131,00 128,00 125,00 122,00 119,00 116,00 113,00 110,00 107,00 104,00 101,00 98,00 95,00 92,00 89,00 86,00 83,00 80,00 77,00 74,00 71,00 68,00 65,00 62,00 59,00 56,00 53,00 50,00 47,00 44,00 41,00 38,00 35,00 32,00 29,00 26,00 23,00 20,00 17,00 14,00 11,00 8,00 5,00 2,00 INDE NUMBER 1-137 Spring 2010 Erling Berge 2010 44 Erling Berge 2010 22

Residual, smoothing twice 0,75000 0,50000 Mean MA(MA3RES_1,3,3) 0,25000 0,00000-0,25000-0,50000-0,75000 135,00 132,00 129,00 126,00 123,00 120,00 117,00 114,00 111,00 108,00 105,00 102,00 99,00 96,00 93,00 90,00 87,00 84,00 81,00 78,00 75,00 72,00 69,00 66,00 63,00 60,00 57,00 54,00 51,00 48,00 45,00 42,00 39,00 36,00 33,00 30,00 27,00 24,00 21,00 18,00 15,00 12,00 9,00 6,00 3,00 INDE NUMBER 1-137 Spring 2010 Erling Berge 2010 45 Residual, smoothing five times 0,60000 0,40000 Mean MA(MA3RES_5,3,3) 0,20000 0,00000-0,20000-0,40000-0,60000 132,00 129,00 126,00 123,00 120,00 117,00 114,00 111,00 108,00 105,00 102,00 99,00 96,00 93,00 90,00 87,00 84,00 81,00 78,00 75,00 72,00 69,00 66,00 63,00 60,00 57,00 54,00 51,00 48,00 45,00 42,00 39,00 36,00 33,00 30,00 27,00 24,00 21,00 18,00 15,00 12,00 9,00 6,00 INDE NUMBER 1-137 Spring 2010 Erling Berge 2010 46 Erling Berge 2010 23

Consequences of autocorrelation Tests of hypotheses and confidence intervals are unreliable. Regressions may nevertheless provide a good description of the sample. Parameters are unbiased Special programs can estimate standard errors consistently Include in the model variables affecting neighbouring cases Use techniques developed for time series analysis (e.g.: analyse the difference between two points in time, Δy) Spring 2010 Erling Berge 2010 47 Erling Berge 2010 24