Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Similar documents
BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1

Business Statistics. Tommaso Proietti. Model Evaluation and Selection. DEF - Università di Roma 'Tor Vergata'

Introduction to Estimation Methods for Time Series models. Lecture 1

Ch 3: Multiple Linear Regression

14 Multiple Linear Regression

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

Ch 2: Simple Linear Regression

Multivariate Regression Analysis

Simple Linear Regression

Analisi Statistica per le Imprese

13 Simple Linear Regression

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

Lecture 4 Multiple linear regression

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Regression Analysis for Data Containing Outliers and High Leverage Points

Review of Econometrics

Linear Regression Models

MS&E 226: Small Data

Regression Models - Introduction

Multiple Regression Analysis. Part III. Multiple Regression Analysis

CAS MA575 Linear Models

Simple Regression Model Setup Estimation Inference Prediction. Model Diagnostic. Multiple Regression. Model Setup and Estimation.

9. Least squares data fitting

Advanced Quantitative Methods: ordinary least squares

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Chapter 3: Multiple Regression. August 14, 2018

Homoskedasticity. Var (u X) = σ 2. (23)

Quantitative Methods I: Regression diagnostics

The general linear regression with k explanatory variables is just an extension of the simple regression as follows

Regression Models - Introduction

Lecture 4: Regression Analysis

Multiple Linear Regression

Multiple Regression Analysis

Lecture 11: Regression Methods I (Linear Regression)

Math 423/533: The Main Theoretical Topics

Applied Econometrics (QEM)

Ref.: Spring SOS3003 Applied data analysis for social science Lecture note

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

11 Hypothesis Testing

Lecture 11: Regression Methods I (Linear Regression)

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1

Data Mining Stat 588

Hypothesis testing Goodness of fit Multicollinearity Prediction. Applied Statistics. Lecturer: Serena Arima

Matrices and vectors A matrix is a rectangular array of numbers. Here s an example: A =

Categorical Predictor Variables

Lecture 1: Linear Models and Applications

Multiple Linear Regression

MATH 644: Regression Analysis Methods

Essential of Simple regression

Lecture 1: OLS derivations and inference

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Regression and Statistical Inference

Simple and Multiple Linear Regression

Motivation for multiple regression

Quantitative Analysis of Financial Markets. Summary of Part II. Key Concepts & Formulas. Christopher Ting. November 11, 2017

The Aitken Model. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics / 41

Regression diagnostics

Xβ is a linear combination of the columns of X: Copyright c 2010 Dan Nettleton (Iowa State University) Statistics / 25 X =

STAT5044: Regression and Anova. Inyoung Kim

Linear models and their mathematical foundations: Simple linear regression

The Multiple Regression Model Estimation

Weighted Least Squares

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

Econometrics I. Introduction to Regression Analysis (Outline of the course) Marijus Radavi ius

2. Linear regression with multiple regressors

Steps in Regression Analysis

ECON 4230 Intermediate Econometric Theory Exam

Lecture 14 Simple Linear Regression

Applied Regression Analysis

MATH 829: Introduction to Data Mining and Analysis Linear Regression: statistical tests

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

Review of Classical Least Squares. James L. Powell Department of Economics University of California, Berkeley

General Linear Model (Chapter 4)

Simple Linear Regression

Introductory Econometrics

STAT 540: Data Analysis and Regression

Statistics - Lecture Three. Linear Models. Charlotte Wickham 1.

Machine Learning for OR & FE

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

Lecture 1 Intro to Spatial and Temporal Data

Chapter 14. Linear least squares

ECON The Simple Regression Model

L2: Two-variable regression model

x 21 x 22 x 23 f X 1 X 2 X 3 ε

Applied Statistics and Econometrics

Chapter 1. Linear Regression with One Predictor Variable

4 Multiple Linear Regression

Variable Selection and Model Building

Lecture 2. The Simple Linear Regression Model: Matrix Approach

Recent Advances in the Field of Trade Theory and Policy Analysis Using Micro-Level Data

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Lecture 4: Multivariate Regression, Part 2

3. Linear Regression With a Single Regressor

Estimation of the Response Mean. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics / 27

MFin Econometrics I Session 4: t-distribution, Simple Linear Regression, OLS assumptions and properties of OLS estimators

MS&E 226: Small Data

AMS-207: Bayesian Statistics

STATISTICS 110/201 PRACTICE FINAL EXAM

Transcription:

Business Statistics Tommaso Proietti DEF - Università di Roma 'Tor Vergata' Linear Regression

Specication Let Y be a univariate quantitative response variable. We model Y as follows: Y = f(x) + ε where f(x) is the systematic part (that can be predicted using the inputs X) and ε is a disturbance (error) term, accounting for the all the variation sources dierent from X. Assumptions: Linearity. The regression function is linear in the inputs: f(x) = β 0 + p β j X j Weak exogeneity: E(ε X) = 0, i.e. X and ε are independent in the mean. j=1 Homoscedasticity: E(ε 2 X) = Var(ε X) = σ 2.

The inputs set X includes quantititative variables, nonlinear transformations and basis expansions (to be dened later), as well as dummy variables coding the levels of qualitative inputs. Under these assumptions the regression function is interpreted as the conditional mean of Y given X f(x) = E(Y X). This is the optimal predictor of Y under square loss (minimum mean square estimator of Y ).

Estimation Set of training data {(y i, x i ), i = 1,..., N}, with x i = (x i1, x i2,..., x ip ). We assume that the training sample is a sample of size N, drawn independently (with replacement) from the underlying population, so that for the i-th individual we can write p y i = β 0 + β j x ij + ε i, j=1 Writing the N equations in matrix form: y = Xβ + ε i = 1,..., N with E(ε X) = 0 and Var(ε X) = σ 2 I. y 1 1 x 11 x 12... x 1k... x 1p y 2 1 x 21 x 22... x 2k... x 2p y =. y i, X =............. 1 x i1 x i2... x ik... x ip, ε =.............. y N 1 x N1 x N2... x Nk... x Np β = [β 0, β 1,..., β p] ε 1 ε 2. ε i. ε N

We aim at estimating the unknown parameters β 0,..., β p, and σ 2. An important estimation method is Ordinary Least Squares (OLS). OLS obtains an estimate of β denoted ˆβ, as the minimizers of the residual sum of squares RSS(β) = (y Xβ) (y Xβ). There is a closed form solution to the above problem. Computing the partial derivatives of RSS(β) w.r.t. each β j and equating to zero yields a p + 1 linear system of equations in p + 1 unknowns, X Xβ = X y which is solved w.r.t. β. The solution is unique provided that the X's are not collinear. In matrix notation the solution is ˆβ = (X X) 1 X y

Example: p = 1 (simple linear regression) Nβ 0 + β 1 i x i = i y i β 0 i x i + β 1 i x2 i = i x iy i We solve this system by substitution: solving wrt β 0 the 1st eqn. yields ˆβ 0 = ȳ ˆβ 1 x Replacing into the 2nd and solving wrt ˆβ 1 gives: 1 N i ˆβ 1 = x iy i xȳ Cov(x, y) i = = (x i x)(y i ȳ) x 2 i x 2 Var(x) i (x i x) 2 1 N

Predicted values and residuals Predicted (tted) values ŷ = Xˆβ = X(X X) 1 X y = Hy H = X(X X) 1 X is the hat matrix, projecting y orthogonally onto the vector space generated by X. Residuals The OLS residuals are where M = I H. e = y ŷ = (I H)y = My

Properties of ŷ and e The tted values and the LS residuals have properties that are simple to show and are good to know. Orthogonality. X e = 0. The residuals are uncorrelated with each of the variable in X. In the model includes the intercept, the residuals have zero mean. ŷ e = 0 RSS(ˆβ) = e e = i e2 i (henceforth RSS) From y = ŷ + e and the previous properties, y 2 = ŷ 2 + e 2 ȳ = N 1 i y = ŷ (average of predicted values)

Estimation of σ 2 For estimating σ 2 (the variance of the error term) we could use the variance of the residuals corrected for the number of degrees of freedom: ˆσ 2 i = e2 i N p 1 N p 1 is known as the number of degrees of freedom. ˆσ is the standard error of regression (SER).

Goodness of t Dene T SS = i (y i ȳ) 2 (Total sum of squares - Deviance of y i ) ESS = i (ŷ i ȳ) 2 (Explained sum of squares - Deviance of ŷ i ) RSS = i e2 i (Residual sum of squares - Deviance of e i) It is easy to prove that A relative measure of g.o.f. is T SS = ESS + RSS R 2 = ESS T SS = 1 RSS T SS

This measures suers from a serious drawback when used for model selection: the inclusion of (possibly irrelevant) additional regressors always produces an increase in R 2. Adjusted R 2 R 2 = 1 RSS/(N p 1) T SS/(N 1) = 1 N 1 N p 1 (1 R2 )

Properties of the OLS estimators We are going to look at the statistical properties of the OLS estimators (and other related quantities, such as the tted values and the residuals), hypothesizing that we can draw repeated training samples from the same population, keeping the X's xed and drawing dierent Y 's. A distribution of outcomes will arise. Under the stated assumptions, E(ˆβ X) = β, Var(ˆβ) = σ 2 (X X) 1 Also, E(ˆσ 2 ) = σ 2. Moreover, the OLS estimators are Best Linear Unbiased (Gauss Markov theorem): E( ˆβ j ) = β j, j and Var( ˆβ j ) = MSE( ˆβ j ) is the smallest among the class of all linear unbiased estimators. Example: p = 1 (simple regression model) [ Var( ˆβ 1 0 ) = σ 2 N + x 2 ] i (x i x) 2, Var( ˆβ σ 2 1 ) = i (x i x) 2,

If we further assume ɛ i x i NID(0, σ 2 ), y i x i N(β 0 + β 1 x i, σ 2 ) The OLS estimators have a normal distribution: ˆβ N(β, σ 2 (X X) 1 ). ˆβ is the maximum likelihood estimator of the coecients (we will say more about it). Letting s.e.( ˆβ j ) denote the estimated standard error of the OLS estimator, i.e. the j-th element of Var(ˆβ) ˆ = ˆσ 2 (X X) 1, t j = ˆβ j β j s.e.( ˆβ j ) t N p 1, a Student's t r.v. with N p 1 d.o.f. This result is used to test hypotheses on a single coecient and to produce interval estimates.

Suppose we wish to test for the signicance of subset of the coecients. The relevant statistic for the null that J coecients are all equal to zero is F = (RSS 0 RSS)/J RSS/(N p 1) F J,N p 1 where RSS 0 is the residual sum of squares of the restricted model (i.e. it has J less explanatory variables), and F J,N K is Fisher's distribution with J and N p 1 d.o.f. As a special case, the test for H 0 : β 1 = β 2 = = β p = 0 (all the regression coecients are zero, except for the intercept) is: F = ESS/p RSS/(N p 1) = R 2 /p (1 R 2 )/(N p 1)

Example: clothing data. Regression of total sales (logs) on hours worked and store size (logs). Call: lm(formula = y ~ x1 + x2, data = clothing) Residuals: Min 1Q Median 3Q Max -1.89464-0.21835 0.01462 0.29581 1.63896 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 7.73152 0.21850 35.385 < 2e-16 *** x1 0.88949 0.05791 15.359 < 2e-16 *** x2 0.31488 0.04308 7.309 1.49e-12 *** --- Residual standard error: 0.4311 on 397 degrees of freedom Multiple R-squared: 0.6342, Adjusted R-squared: 0.6324 F-statistic: 344.2 on 2 and 397 DF, p-value: < 2.2e-16

Figure : Fitted values log(hoursw) 3.5 4.0 4.5 5.0 5.5 6.0 log(ssize) 3 4 5 6 7 log(tsales) 12 13 14 15

Example: house price regression in R lm(formula = price ~ sqft + Age + Pool + Bedrooms + Pool + Firepl Waterfront + DOM) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 4336.851 12084.856 0.359 0.720 sqft 97.862 3.485 28.082 < 2e-16 *** Age -694.671 141.292-4.917 1.02e-06 *** Pool 815.028 8926.466 0.091 0.927 Bedrooms -20923.878 4567.168-4.581 5.16e-06 *** Fireplace -97.130 5152.184-0.019 0.985 Waterfront 63376.186 9389.874 6.749 2.43e-11 *** DOM -20.988 24.841-0.845 0.398 --- Residual standard error: 76390 on 1072 degrees of freedom Multiple R-squared: 0.6162, Adjusted R-squared: 0.6137 F-statistic: 245.9 on 7 and 1072 DF, p-value: < 2.2e-16

Figure : Histogram of residuals Frequency 0 50 100 150 200 4e+05 0e+00 2e+05 4e+05 6e+05 8e+05 e

Properties of the OLS residuals The i-th residual e i has zero expectation (under the model's assumptions) and Var(e i ) = σ 2 (1 h i ), where h i = x i(x X) 1 x i is the leverage of observation i (the i-th diagonal element of H). h i is an important diagnostic quantity. It is a measure of remoteness of the inputs for the i-th individual in the space of the inputs. The barplot of h i versus i illustrates the inuence of the i-th observation on the t. It is possible to show that i h i = p + 1 and the average is (p + 1)/N. The standardized residual is dened as r i = e i ˆσ 1 h i and is used in regression diagnostics (normal probability plot, check for homoscedasticity).

Figure : Anscombe Quartet: 4 datasets giving the same set tted values

Figure : Anscombe Quartet: Leverage plot (h i vs i)

Specication issues: omitted variable bias, collinearity, etc. Consider the regression model y = Xβ + γw + ε, where w is a vector of N measurements on an added variable W. We can show that the OLS estimator of γ is and ˆγ = w My w Mw, M = I X(X X) 1 X, Var(ˆγ) = σ2 w Mw. These expressions have a nice interpretation: ˆγ results from Regress Y on X and obtain the residuals e Y X = My; Regress W on X and obtain the residuals e W X = Mw; Regress e Y X on e W X to obtain the LS estimate ˆγ (partial regr. coe.).

Multicollinearity Mw are the residuals of the regression of W on X. w Mw is the residual sum of squares from that regression. If X explains most of the variation in W, this is very small, and Var(ˆγ) will be very high. We can conclude that the eect that W exerts on Y is poorly (i.e. very imprecisely) estimated.

Omission of relevant variables If W is omitted from the regression, the OLS estimator ˆβ is biased, but also more precise: E(ˆβ) = β + (X X) 1 X wγ, Var(ˆβ) = σ 2 (X X) 1 σ 2 (X M W X) 1 ; M W = I w(w w) 1 w, where σ 2 (X M W X) 1 is the variance of the estimator of β when W is included in the model. If W is irrelevant (true γ = 0) there is only a cost associated to including it in the model. In any case, if W is orthogonal to X the distribution of ˆβ is not aected by W.

Prediction of a new sample observation We are interested in predicting Y for a new unit with values of X in the vector x o = (1, x o1,..., x op ). The optimal predictor under square loss is ŷ o = ˆf(x o ) = ˆβ 0 + ˆβ 1 x o1 + + ˆβ p x op = x o ˆβ Under the stated assumptions, the predictor is unbiased, i.e. the prediction error has zero mean, E(Y o ŷ o ) = 0, and Var(Y o ŷ o ) = σ 2 (1 + x o(x X) 1 x o ) Note: the stated assumptions concern the selection of the set of predictors in X and the properties of ε.

Predictive accuracy depends on 1. Ability to select all the relevant variable. Failure to do so will result in an inated σ 2 and E(ε X) 0, so that ˆβ is no longer optimal. 2. How removed is the unit prole x o from the rest: x o(x X) 1 x o measures the (Mahalanobis) distance of the values of the X 1,..., X p for the o-th unit in the space of the X's.

If the true model is y = Xβ + γw + ɛ, E(ɛ X, w) = 0, Var(ɛ X, w) = σ 2 ɛ I, the above predictor is biased and its mean square error is with E[(Y 0 ˆf(x o )) 2 X, w, x o, w o ] = σ 2 ɛ (1 + x o(x X) 1 x o ) + Bias 2, Bias = (β E(ˆβ)) x o + γw o = (w o w X(X X) 1 x o )γ The term in parenthesis is the error of prediction of w o using the information in x o.