Simple Regression Model (Assumptions) Lecture 18 Reading: Sections 18.1, 18., Logarithms in Regression Analysis with Asiaphoria, 19.6 19.8 (Optional: Normal probability plot pp. 607-8) 1 Height son, inches 80 75 70 65 60 Remember Regression? son_hat = 33.887 + 0.514*father n = 1078, R = 0.51, s_e =.437 60 65 70 75 Height father, inches (s.d. of residuals).437 inches: measures scatter about OLS line 0.51: 5.1% of variation in sons heights explained by variation in their fathers heights OLS intercept 33.887: No interpretation b/c father cannot be 0 inches tall OLS slope 0.514: For every extra 1 inch of father s height, son is on average about ½ inch taller (y-hat): Predicted y, given x; E.g. son of a 7 inch tall father predicted to be 70.895 inches (= 33.887 + 0.514*7) (residual): ; E.g. if is 70.895 but is 68.531, then residual is -.364 inches Descriptive & Inferential Statistics Chap. 6: Scatterplots, Association, and Correlation Chap. 7: Introduction to Linear Regression / Simple Reg.: Chaps. 18 & 19 (Inference for Regression & Understanding Regression Residuals) Multiple Reg.: Chaps. 0 & 1 (Multiple Regression & Building Multiple Regression Models) BUT, multiple regression is also a new way to describe data: descriptive statistics 3 Lecture 18, Page 1 of 9
Questions and Data: Still Important Which kind of question? Research question: What is causal effect of a change in X (e.g. match) on Y (e.g. amount given) Descriptive question: what are patterns in data (e.g. how does household spending on food vary with income?) Which kind of data? Observational or experimental Correlation causation is a cliché Instead, apply understanding of data and specific context to fully interpret quantitative results Cross-sectional, time series, or panel 4 Which kind of data are these? On this slide, what is review of Fall-term material and what is new material? Unsurprisingly, we find that higher university density is associated with higher GDP per capita levels. p. 9 Why associated instead of correlated? Also, does quote imply causality? Source: The Economic Impact of Universities: Evidence from Across the Globe a 016 NBER Working Paper http://www.nber.org/papers/w501 5 It is interesting that countries with more universities in 1960 generally had higher growth rates over 1960-000. p. 9 But is it a strong association? 6 Lecture 18, Page of 9
X-variable is defined as Log(1 + universities per million people) Logs can straighten curved scatter plot Plus one addresses countries with 0 universities Example 1: x-value of 1 is a country with 1.7 universities per million: ln(1 + 1.7) 1 E.g. 10 universities w/ pop. 5.8 million: 1.710/5.8 Example : x-value of 3 is a country with 19.09 universities per million: ln(1 + 19.09) 3 E.g. 5 universities w/ pop. 1.31 million: 19.095/1.31 University density is over 11 times bigger in Example, but x-value only 3 times as big (diminishing returns) 7 How to interpret b=.64? On average, countries with a university density (universities per million people) that is 10% higher have average years of schooling that is 0.64 years higher. 8 Frozen Pizza (p. 67) How does the volume of sales depend on the price of frozen pizza? What is the economic name of this relationship? Weekly data on price and quantity for each of four cities (1994 1996); 156 weeks Raw data: ch18_mcsp_frozen_pizza.csv Are these data observational? Cross-sectional, time series, or panel? 9 Lecture 18, Page 3 of 9
Observational Data Price Quantity Demanded Demand Shifters What are some demand shifters for frozen pizza? Why do these demand shifters affect the price the firm chooses to set? 10 Frozen Pizza: OLS 0.7697 0.594 18.1 5.8 Interpret the line? Is the OLS line an estimate of the demand equation? Quantity, 10,000s 10 8 6 4 Denver, 1994-96 n = 156 weeks..4.6.8 3 Price, $1s 11 Simple Linear Regression: One x-variable Model: : dependent var., regressand, y-var., LHS-var. : independent var., regressor, explanatory var., x- var., RHS-var. (i.e. right-hand side variable) : observation index (often or cross-sectional data; time series data; or panel data) : intercept (constant) parameter : slope parameter : error term, residual, disturbance 1 Lecture 18, Page 4 of 9
Error term in Line is expected value: E Error explains deviations from expectations includes all other factors that affect aside from Impossible to collect data on everything: some variables unobserved to the researcher It reflects reality: model cannot control for everything In the above graph is positive or negative? 13 Assumptions Tame Elusive Epsilon We cannot observe but we can observe Notice how many of the six assumptions are about the unobservable Some assumptions can be checked by analyzing (the statistic tied to the parameter, but some cannot In general, models make assumptions about unknowns For example, a model could assume the outcome of the role of a die follows a discrete Uniform distribution: i.e. it s fair with a 1/6 probability of each outcome {1,, 3, 4, 5, 6} 14 Six Assumptions Six assumptions of the linear regression model Your book gives only four assumptions: I suspect it leaves one out because it is fairly obvious I am sure the other one is left out because it is only required if you wish to make a causal interpretation To minimize confusion, number the extra two as 5 & 6 Econometrics addresses violations in assumptions ECO374H Applied Econometrics (Commerce) ECO375H & ECO475H Applied Econometrics I & II 15 Lecture 18, Page 5 of 9
Assumption #1 Regression equation is linear in the error and parameters; the variables (in boxes) are linearly related to each other Not assuming that what is in boxes is linear (so long as no nonlinear functions of parameters or nonlinear functions of the error) Example of a linear regression: Example of a linear regression: ln 16 Frozen Pizza (Chapter 18) Weekly Q, 10,000s 10 8 6 4 Frozen Pizza, Denver, 1994-96 Q-hat = 18.1 + -5.80*P n = 156, R = 0.59, s.e.(b) = 0.353..4.6.8 3 Weekly price, $1s e (residuals) 3 1 0-1 - 3 4 5 6 7 Q_hat Which violations can we see? 17 Natural Log Transformations ln(weekly Q, 10,000s) Frozen Pizza, Denver, 1994-96 ln(q)-hat = 4.095 + -.773*ln(P) n = 156, R = 0.631, s.e.(b) = 0.171.5 1.5 1.7.8.9 1 1.1 ln(weekly price, $1s) e (residuals).4. 0 -. -.4 3 4 5 6 7 ln(q)_hat 18 Lecture 18, Page 6 of 9
Assumption # No autocorrelation / no serial correlation:, 0if Common problem in time-series data E.g. higher than expected inflation today, likely high tomorrow Errors assumed not systematically related across observations residual (e) -10-5 0 5 10 residual (e) -3--1 0 1 Assumption # Holds No Autocorrelation 0 10 0 30 t Assumption # Violated Positive Autocorrelation 0 10 0 30 t 19 Assumption #3 Homoscedasticity:, 1,, Equal variance assumption Error is just as noisy for all values of x Violation is called heteroscedasticity Common problem in cross-sectional data Y1-1 0 1 3 Y 0 1 3 Homoscedasticity 0 4 6 8 10 X1 Heteroscedasticity 0 4 6 8 10 X 0 ln(weekly Q, 10,000s) Fix Assumption #1 issues before checking Assumption #3 Frozen Pizza, Denver, 1994-96 ln(q)-hat = 4.095 + -.773*ln(P) n = 156, R = 0.631, s.e.(b) = 0.171.5 1.5 1.7.8.9 1 1.1 ln(weekly price, $1s) e (residuals).4. 0 -. -.4 3 4 5 6 7 ln(q)_hat Heteroscedasticity unequal variance of the residuals is often a byproduct of a violation of the linearity assumption Is Denver pizza regression an example? Remember that Chapter 18 advises you to check the assumptions in order: start with the linearity assumption 1 Lecture 18, Page 7 of 9
Assumptions #4 & #5 Galton s data (Lec. 5) Assumptions 1-3 hold? Normality: is Normal is unobserved so check Error has mean zero: 0,1,, Constant term (i.e. or ) picks up any constant effects, not the error Density 75 70 65 60 Son, inches80 Galton's Heights, n = 1078..15.1.05 60 65 70 75 Father, inches n = 1078 0-10 -5 0 5 10 residuals (e) Graphical Summary Would the elements in the population (not shown) lie on the line?,,, Is a reflection of sampling error? Assumptions #3, #4, and #5 combined: ~ 0, 3 017 ON Public Sector Disclosure of 016 salaries for University of Waterloo employees Sex n Mean S.d. F 416 $139.74K $33.74K M 941 $155.36K $36.96K OLS Results: Salary-hat = 139.74 + 15.6*Male R = 0.0385, n = 1,357, = 36.006 Do Assumptions #1 - #5 hold? Salary (1,0000's CAN$) Salary and Salary-hat 400 300 00 100 400 300 00 100 Waterloo, 016 Salaries Male=0 Male=1 Waterloo, 016 Salaries 0..4.6.8 1 Male 4 Lecture 18, Page 8 of 9
Assumption #6 x uncorrelated w/ error:, 0 Exogeneity: x variable(s) unrelated with error Dosage is exogenous: Experimental data can est. causal effect: Endogeneity: x variable(s) related with error With observational data, lurking/unobserved/omitted/ confounding variables mean x and error are related Price of pizza is endogenous: Endogeneity bias means: 5 Short-Hand Assumptions 1) Linear relationship between variables (possibly non-linearly transformed) ) No correlation amongst errors (no autocorrelation for time-series data) 3) Homoscedasticity (single variance) of errors 4) Normally distributed errors 5) Constant included (error has mean 0) 6) No relationship between x and error 6 Lecture 18, Page 9 of 9