Econ 300/QAC 201: Quantitative Methods in Economics/Applied Data Analysis. 12th Class 6/23/10

Similar documents
Statistics and Quantitative Analysis U4320. Segment 10 Prof. Sharyn O Halloran

Do not copy, post, or distribute

Inferences for Regression

The Simple Linear Regression Model

Statistics: revision

Linear Regression with one Regressor

Simple Linear Regression. Material from Devore s book (Ed 8), and Cengagebrain.com

Simple Linear Regression. (Chs 12.1, 12.2, 12.4, 12.5)

appstats27.notebook April 06, 2017

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

CORRELATION AND REGRESSION

11.5 Regression Linear Relationships

regression analysis is a type of inferential statistics which tells us whether relationships between two or more variables exist

Ordinary Least Squares Regression Explained: Vartanian

Econometrics Summary Algebraic and Statistical Preliminaries

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Section 3: Simple Linear Regression

Correlation Analysis

Regression Analysis. Table Relationship between muscle contractile force (mj) and stimulus intensity (mv).

Chapter 12 - Part I: Correlation Analysis

Lecture 14 Simple Linear Regression

ECON3150/4150 Spring 2015

Measuring the fit of the model - SSR

4/22/2010. Test 3 Review ANOVA

Regression Analysis: Basic Concepts

Chapter 27 Summary Inferences for Regression

Chapter 9 Inferences from Two Samples

Simple Linear Regression Analysis

Business Statistics. Lecture 9: Simple Regression

Warm-up Using the given data Create a scatterplot Find the regression line

Lectures 5 & 6: Hypothesis Testing

Business Statistics. Lecture 10: Course Review

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

Chapter 7: Simple linear regression

Correlation and Linear Regression

THE ROYAL STATISTICAL SOCIETY 2008 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE (MODULAR FORMAT) MODULE 4 LINEAR MODELS

Lecture 2 Linear Regression: A Model for the Mean. Sharyn O Halloran

Business Statistics. Lecture 10: Correlation and Linear Regression

WISE International Masters

Midterm 2 - Solutions

Single Sample Means. SOCY601 Alan Neustadtl

Chapter 8 Handout: Interval Estimates and Hypothesis Testing

Chapter 10. Correlation and Regression. McGraw-Hill, Bluman, 7th ed., Chapter 10 1

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Sociology 6Z03 Review II

Recitation 1: Regression Review. Christina Patterson

Ordinary Least Squares Regression Explained: Vartanian

Correlation and Regression

Regression M&M 2.3 and 10. Uses Curve fitting Summarization ('model') Description Prediction Explanation Adjustment for 'confounding' variables

Chapter 1 Statistical Inference

Multiple Regression Analysis. Basic Estimation Techniques. Multiple Regression Analysis. Multiple Regression Analysis

Black White Total Observed Expected χ 2 = (f observed f expected ) 2 f expected (83 126) 2 ( )2 126

Simple Linear Regression

Unit 10: Simple Linear Regression and Correlation

Simple linear regression

11 Correlation and Regression

Chapter 10. Simple Linear Regression and Correlation

Regression Analysis. BUS 735: Business Decision Making and Research. Learn how to detect relationships between ordinal and categorical variables.

Regression Analysis. Simple Regression Multivariate Regression Stepwise Regression Replication and Prediction Error EE290H F05

ECON 497: Lecture Notes 10 Page 1 of 1

CORRELATION AND SIMPLE REGRESSION 10.0 OBJECTIVES 10.1 INTRODUCTION

Multiple Linear Regression CIVL 7012/8012

Scatter plot of data from the study. Linear Regression

y = a + bx 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation Review: Interpreting Computer Regression Output

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.


Ch 2: Simple Linear Regression

Essential of Simple regression

9. Linear Regression and Correlation

Chapter 12 Summarizing Bivariate Data Linear Regression and Correlation

ECO220Y Simple Regression: Testing the Slope

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

Final Exam - Solutions

Estadística II Chapter 4: Simple linear regression

Econometrics Part Three

Review of Econometrics

Lectures on Simple Linear Regression Stat 431, Summer 2012

Multiple Regression Analysis

Linear Regression with Multiple Regressors

Econ 300/QAC 201: Quantitative Methods in Economics/Applied Data Analysis. 18th Class 7/2/10

Simple Linear Regression Using Ordinary Least Squares

Unit 6 - Simple linear regression

Gov 2000: 9. Regression with Two Independent Variables

M(t) = 1 t. (1 t), 6 M (0) = 20 P (95. X i 110) i=1

Basic Linear Model. Chapters 4 and 4: Part II. Basic Linear Model

Multiple Regression Analysis

STAT Chapter 11: Regression

Lecture 14. Analysis of Variance * Correlation and Regression. The McGraw-Hill Companies, Inc., 2000

Lecture 14. Outline. Outline. Analysis of Variance * Correlation and Regression Analysis of Variance (ANOVA)

Simple Linear Regression

ECON 450 Development Economics

Final Exam - Solutions

Chapter 14 Simple Linear Regression (A)

Correlation and regression

Ecn Analysis of Economic Data University of California - Davis February 23, 2010 Instructor: John Parman. Midterm 2. Name: ID Number: Section:

Lecture 30. DATA 8 Summer Regression Inference

Probability and Statistics Notes

Simple linear regression

Scatter plot of data from the study. Linear Regression

ISQS 5349 Final Exam, Spring 2017.

Transcription:

Econ 300/QAC 201: Quantitative Methods in Economics/Applied Data Analysis 12th Class 6/23/10 In God we trust, all others must use data. --Edward Deming hand out review sheet, answer, point to old test, answers now that we ve done Ch. 9, for QAC201 folks, spend a little time on chi-squared tests (can look at Ch. 17 for reference) allows us to consider the whole frequency distribution instead of just one part of it e.g. look at polling for all candidates instead of just the winning candidate s percentage e.g. look at the color distribution for all M&Ms instead of just blue compare the observed frequency to the expected frequency and test the null hypothesis ( ) 2 create the test statistic χ 2 O E E this is distributed according to the chi-square distribution (see table VII), similar to the t-distribution in that it is a family of distributions indexed by the d.f. (equal to the number of categories minus one) that converges to the normal distribution as the degrees of freedom increase H0: the sample comes from the expected distribution Ha: the sample doesn t come from it If the sample value exceeds the critical value in table 7, reject the null (the deviations are too great) note you can also do this if the data are sorted by more than one variable births example: by season, or by season and part of country (in two-variable case, d.f. = (rows-1)*(columns-1) the Swedish case covered in the chapter: note that the four seasons are not of equal length (winter is five months, summer and fall only two each) so different than in the northeastern U.S. Let s do our class: out of 12 people, expect 3 in each season if births distributed equally (null hypothesis) compare to test value of 7.81 (at 5% significance level) for our class, we can t reject the null (3, 4, 1, 4, so sample value is 2) 12-1

Ch. 11 and 12: first part of Part III of the course: relating two or more variables fitting a line to a scatter plot (Y is dependent variable, X is independent variable; so we are making causality assumption) [draw up picture] fit by eye or use a criterion commonly-used criterion (but not the only one): minimize the sum of squared deviations of Ys from the line short history of the Method of Least Squares: [all facts as related in Stephen M. Stigler, The History of Statistics: The Measurement of Uncertainty before 1900, Cambridge, MA: Belknap Press, 1986] 1805: Adrien Marie Legendre developed the Method of Least Squares as a method of reconciling a set of measurements; it was rapidly accepted and diffused geographically; used in astronomy and geodesy ˆ Y = a + bx d = Y - ˆ Y min. d 2 = (Y - ˆ Y ) 2 [the d 2 above should have a summation sign in front of it] minimize with respect to choices of a and b: solution to this (using calculus) gives optimal a (intercept) and b (slope): where b = (X - X )(Y - Y ) (X - X ) 2 and a = Y - b X note also that 12-2

b = Cov(XY) Var(X) (actually MSD in denominator, but close enough for large n) go through and derive this (set S, for sum of squares, equal to d 2 ): min S = Σ[Y - (a + bx)] 2 = Σ [Y 2-2(a + bx)y + (a + bx) 2 ] = ΣY 2-2aΣY - 2bΣXY + Σ (a 2 + 2abX + b 2 X 2 ) = ΣY 2-2aΣY - 2bΣXY + Σa 2 + 2abΣX + b 2 ΣX 2 S/ a = -2ΣY + 2na + 2bΣX = 0 S/ b = -2ΣXY + 2aΣX + 2bΣX 2 = 0 get rid of the 2s throughout and reorder terms: na -ΣY + bσx = 0 aσx -ΣXY + bσx 2 = 0 na = ΣY - bσx a = ΣY/n - bσx/n a = Y - b X ( Y - b X )ΣX -ΣXY + bσx 2 = 0 Y ΣX - b X ΣX -ΣXY + bσx 2 = 0 b(σx 2 - X ΣX) = ΣXY - Y ΣX so b = [ΣXY - n X Y ]/[ΣX 2 - n X 2 ] = ncov(x,y)/nmsd(x) where b = (X - X )(Y - Y ) (X - X ) 2 another way to write this is to set up a notation for the deviations: 12-3

x = X - X y = Y - Y (notation to remind us these deviations are generally smaller values than the original values X and Y) now can write b as: b = xy this is known as ordinary least squares (OLS) estimates of a and b why do we tend to use this method of line-fitting in preference to other methods? --simple formulas for a and b --goes through the means for X and Y --efficient --unbiased still, there are many modifications of this basic methodology which have been developed to adjust for problems with the data, e.g. WLS examples from economics of these simple linear functions: consumption function in macroeconomics: if GNP = national income, then C = a + b(gnp), where b = MPC (think about how you would want to expand this to include other variables, e.g., wealth: C = a + b(gnp) + c(wealth)) investment function in macroeconomics: I = f(r); expand to I = f(r,gnp) demand curve in microeconomics: Qd = a + bp (where b 0) supply curve in microeconomics: Qs = c + dp (where c 0) (think again about how you would want to expand these functions to include other variables) on to Ch. 12 12-4

Ch. 11 shows how to summarize data by fitting a line to it. But in order to make inferences about the population from which the sample scatter of observations was drawn, we must build a statistical model (i.e., state our assumptions about the nature of the population and the sample) then we will be able to construct confidence intervals and test hypotheses. 1809: Carl Friedrich Gauss couched the problem of planetary orbit determiniation in explicitly probabilistic theory [draw up a scatter diagram] Note that the Ys are each drawn from different (conditional) probability distributions P(Y i X i ), as they have different Xs associated with them Simplifying assumptions about the Ys (about their distributions): 1) homogeneous variance: all Ys have the same variance σ 2 2) the mean of each Y E(Y i X i ) = α + βxi, where our goal is to estimate α and β; we shorthand E(Y i X i ) = E(Y i ), but remember it is a conditional mean 3) the Ys are independent of each other The simple regression model can be written as: Y i = α + βx i + e i where the e i s are independent error terms, and E(e i ) = 0, Var( e i ) = σ 2 for each e i error here signifies wandering and inclusion of an error term distinguishes statistical models from other models [give examples of other models] This was a big insight in economics (from the 1940s) that economic models did not have to be exact and that error need not signify measurement error; rather measurement error is one component of the error, but the other component is inherent variability (consider whether inclusion of more and more controls in the experiment could reduce this to zero; same as asking whether more and more variables in a multiple regression could reduce this to zero) since E(e i ) = 0, E(Y i ) = α + βx i rather than carry along all the i s, we can write the regression model in shorthand as: 12-5

E(Y) = α + βx so our notation for the true regression is: E(Y) = α + βx and for the fitted, or estimated regression, based on the available sample of paired Xs and Ys: ˆ Y = a + bx we can describe the closeness of the slope of the estimated line to the slope of the true population line: E(b) = β SE(b) = σ where x = (X - X ) 2 x let s decompose the standard error of b: SE(b) = σ n x 2 n = σ n 1 s x (for large n, since n is MSD, not variance) so standard error can be reduced in three ways: 1) reduce σ, the inherent variability of the Y observations (not generally feasible) 2) increase n (generally possible, but may be expensive) 3) increase s x, the spread of the X values; s x is called the leverage of the X values on b [illustrate with figure like Figure 12-4] of course, we cannot observe σ in general, so must estimate the standard error of b, using the residual variance s 2 to estimate the variance σ 2, where s 2 : s 2 1 n - 2 (Y - Y ˆ ) 2 and SE = s 12-6

note that fitting a regression line leaves only n - 2 degrees of freedom, because you must have 2 points in order to fit a line (which would then be a perfect fit in that both points would lie on the line); hence the denominator in s 2 now we can create a confidence interval around b: β = b ± t.025 SE β = b ± t.025 s x 2, where the d.f. for t are n - 2 and we can test the null hypothesis that β = 0 (same as saying that X and Y are unrelated the regression line is horizontal--draw it) by seeing if 0 is in the confidence b interval or by setting up a t-ratio for b: t = SE we can also set up a confidence interval for the mean of Y 0, given X 0 : µ 0 = (a + bx 0 ) ± t.025 s 1 n + (X X) 2 0 and for an individual Y 0, given X 0 : Y 0 = (a + bx 0 ) ± t.025 s 1 n + (X X) 2 0 +1 (motivate why this interval is wider and why both intervals widen away from X ) interpolation vs. extrapolation -- the latter is riskier [illustrate] [show extrapolation cartoon] Do problem 12-14 estimate the OLS line as Y = 16.25 +.625X so our estimate if X (X 2 ) = 55 is Y (X 1 )=50.625 12-7

for the 95% interval: Y 0 = 50.625 ± 1.98*SQRT(180)*SQRT[(1/150) + (55-70) 2 /24000 + 1] = 50.625 ± 26.777, or about 24 to 77 Note the underlying assumptions we made, namely that these 150 students were regarded as a random sample from a conceptual population of many students where the expected values of Y given X form a straight line, the variance of Y given X is constant along the line (think about this in context), and the student can be regarded as randomly drawn from this population (think about this in context) Ch 13: multiple regression go to a multiple regression model for several reasons: 1) are simultaneously interested in the effects of several independent variables on a dependent variable (often because you are testing a theory which involves several variables, or fitting a relationship that involves several variables, e.g. a production function) 2) want to explain as much of the variation in Y as possible (subject to Occam s Razor) 3) are only interested in the effect of one variable, but want to measure its effect accurately a. need to eliminate bias on estimating the direct effect of this variable by including confounding factors b. want to distinguish between direct and indirect effects of a variable (path analysis) go on to Ch. 13 tomorrow 12-8