Estadística II Chapter 5. Regression analysis (second part)

Similar documents
Estadística II Chapter 4: Simple linear regression

Statistics II Exercises Chapter 5

Inferences for Regression

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

Ch 2: Simple Linear Regression

Inference for the Regression Coefficient

Multiple Linear Regression

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

Diagnostics of Linear Regression

Correlation Analysis

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

The Multiple Regression Model

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression

Ch 3: Multiple Linear Regression

Homoskedasticity. Var (u X) = σ 2. (23)

Lectures on Simple Linear Regression Stat 431, Summer 2012

Simple Linear Regression: The Model

Business Statistics. Lecture 10: Correlation and Linear Regression

ECON The Simple Regression Model

Simple Linear Regression

Simple Linear Regression

Chapter 16. Simple Linear Regression and dcorrelation

Inference for Regression Simple Linear Regression

SIMPLE REGRESSION ANALYSIS. Business Statistics

The Simple Regression Model. Part II. The Simple Regression Model

Inference for Regression Inference about the Regression Model and Using the Regression Line

Lecture 3: Linear Models. Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

Regression Analysis. BUS 735: Business Decision Making and Research. Learn how to detect relationships between ordinal and categorical variables.

Applied Econometrics (QEM)

Measuring the fit of the model - SSR

Basic Business Statistics 6 th Edition

Heteroskedasticity. Part VII. Heteroskedasticity

Homework 2: Simple Linear Regression

The Simple Regression Model. Simple Regression Model 1

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is

Mathematics for Economics MA course

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

Statistical Inference. Part IV. Statistical Inference

Chapter 14. Linear least squares

Chapter 7 Student Lecture Notes 7-1

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Important note: Transcripts are not substitutes for textbook assignments. 1

ECON3150/4150 Spring 2015

Chapter 14 Simple Linear Regression (A)

Lecture 19 Multiple (Linear) Regression

Chapter 14 Student Lecture Notes 14-1

Statistics for Managers using Microsoft Excel 6 th Edition

Econometrics Multiple Regression Analysis: Heteroskedasticity

Section 3: Simple Linear Regression

Chapter 12 - Lecture 2 Inferences about regression coefficient

Multiple Regression Analysis

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

Multivariate Regression Analysis

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

Scatter plot of data from the study. Linear Regression

Chapter 16. Simple Linear Regression and Correlation

UNIT 12 ~ More About Regression

Multiple Regression Analysis: Heteroskedasticity

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Weighted Least Squares

4 Multiple Linear Regression

assumes a linear relationship between mean of Y and the X s with additive normal errors the errors are assumed to be a sample from N(0, σ 2 )

Applied Statistics and Econometrics

Math 423/533: The Main Theoretical Topics

Inference for Regression

3 Multiple Linear Regression

Regression Analysis: Basic Concepts

Bivariate data analysis

MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik

Regression Analysis. Table Relationship between muscle contractile force (mj) and stimulus intensity (mv).

Econ 3790: Statistics Business and Economics. Instructor: Yogesh Uppal

Econometrics. 4) Statistical inference

Lecture 14 Simple Linear Regression

Scatter plot of data from the study. Linear Regression

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

Correlation 1. December 4, HMS, 2017, v1.1

Regression analysis is a tool for building mathematical and statistical models that characterize relationships between variables Finds a linear

Lecture 9: Linear Regression

Lecture 2: Linear Models. Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011

Single and multiple linear regression analysis

Correlation and Regression

Finding Relationships Among Variables

Analisi Statistica per le Imprese

ECO220Y Simple Regression: Testing the Slope

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

STA121: Applied Regression Analysis

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

Essential of Simple regression

Multiple Linear Regression

ST Correlation and Regression

Statistics 512: Solution to Homework#11. Problems 1-3 refer to the soybean sausage dataset of Problem 20.8 (ch21pr08.dat).

Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal

Linear Regression Model. Badr Missaoui

STAT 511. Lecture : Simple linear regression Devore: Section Prof. Michael Levine. December 3, Levine STAT 511

Correlation in Linear Regression

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Statistics II Lesson 1. Inference on one population. Year 2009/10

Transcription:

Estadística II Chapter 5. Regression analysis (second part)

Chapter 5. Regression analysis (second part) Contents Diagnostic: Residual analysis The ANOVA (ANalysis Of VAriance) decomposition Nonlinear relationships and linearizing transformations Simple linear regression model in the matrix notation Introduction to the multiple linear regression

Lesson 5. Regression analysis (second part) Bibliography references Newbold, P. Estadística para los negocios y la economía (1997) Chapters 12, 13 and 14 Peña, D. Regresión y Análisis de Experimentos (2002) Chapters 5 and 6

Regression diagnostics Diagnostics The theoretical assumptions for the simple linear regression model with one response variable y and one explicative variable x are: Linearity: y i = β 0 + β 1 x i + u i, for i = 1,..., n Homogeneity: E[u i ] = 0, for i = 1,..., n Homoscedasticity: Var[ui ] = σ 2, for i = 1,..., n Independence: ui and u j are independent for i j Normality: ui N(0, σ 2 ), for i = 1,..., n We study how to apply diagnostic procedures to test if these assumptions are appropriate for the available data (x i, y i ) Based on the analysis of the residuals ei = y i ŷ i

Diagnostics: scatterplots Scatterplots The simplest diagnostic procedure is based on the visual examination of the scatterplot for (x i, y i ) Often this simple but powerful method reveals patterns suggesting whether the theoretical model might be appropriate or not We illustrate its application on a classical example. Consider the four following datasets

Diagnostics: scatterplots The Anscombe datasets TABLE 3-10 Four Data Sets DATASET 1 DATASET 2 X Y X Y 10.0 8.04 10.0 9.14 8.0 6.95 8.0 8.14 1'3.0 7.58 13.0 8.74 9.0 8.81 9.0 8.77 11.0 8.33 11.0 9.26 14.0 9.96 14.0 8.10 6.0 7.24 6.0 6.13 4.0 4.26 4.0 3.10 12.0 10.84 12.0 9.13 7.0 4.82 7.0 7.26 5.0 5.68 5.0 4.74 DATASET 3 DATASET 4 X Y X Y 10.0 7.46 8.0 6.58 8.0 6.77 8.0 5.76 13.0 12.74 8.0 7.71 9.0 7.11 8.0 8.84 11.0 7.81 8.0 8.47 14.0 8.84 8.0 7.04 6.0 6.08 8.0 5.25 4.0 5.39 19.0 12.50 12.0 8.15 8.0 5.56 7.0 6.42 8.0 7.91 5.0 5.73 8.0 6.89 SOURCE: F. J. Anscombe, op. cit.

Diagnostics: scatterplots The Anscombe example The estimated regression model for each of the four previous datasets is the same ŷ i = 3.0 + 0.5x i n = 11, x = 9.0, ȳ = 7.5, rxy = 0.817 The estimated standard error of the estimator ˆβ 1, s 2 R (n 1)s 2 x takes the value 0.118 in all four cases. The corresponding T statistic takes the value T = 0.5/0.118 = 4.237 But the corresponding scatterplots show that the four datasets are quite different. Which conclusions could we reach from these diagrams?

Diagnostics: scatterplots Anscombe data scatterplots 10 10 5 5 o 5 10 15 5 10 Data Set 1 Data Set 2 15 20 10 10 5 5 5 10 15 20 5 10 Data Set 3 Data Set 4 15 20 FIGURE 3-29 Scatterplots for the four data sets of Table 3-10 SOURCE: F. J. Anscombe, op cit.

Residual analysis Further analysis of the residuals If the observation of the scatterplot is not sufficient to reject the model, a further step would be to use diagnostic methods based on the analysis of the residuals e i = y i ŷ i This analysis starts by standarizing the residuals, that is, dividing them by the residuals (quasi-)standard deviation s R. The resulting quantities are known as standardized esiduals: e i s R Under the assumptions of the linear regression model, the standardized residuals are approximately independent standard normal random variables A plot of these standardized residuals should show no clear pattern

Residual analysis Residual plots Several types of residual plots can be constructed. The most common ones are: Plot of standardized residuals vs. x Plot of standardized residuals vs. ŷ (the predicted responses) Deviations from the model hypotheses result in patterns on these plots, which should be visually recognizable

Residual plots examples Consistency 5.1: Ej: of the consistencia theoretical model con el modelo teórico

Residual plots examples Nonlinearity 5.1: Ej: No linealidad

Residual plots examples Heteroscedasticity 5.1: Ej: Heterocedasticidad

Residual analysis Outliers In a plot of the regression line we may observe outlier data, that is, data that show significant deviations from the regression line (or from the remaining data) The parameter estimators for the regression model, ˆβ 0 and ˆβ 1, are very sensitive to these outliers It is important to identify the outliers and make sure that they really are valid data

Residual analysis Normality of the errors One of the theoretical assumptions of the linear regression model is that the errors follow a normal distribution This assumption can be checked visually from the analysis of the residuals e i, using different approaches: By inspection of the frequency histogram for the residuals By inspection of the Normal Probability Plot of the residuals (significant deviations of the data from the straight line in the plot correspond to significant departures from the normality assumption)

The ANOVA decomposition Introduction ANOVA: ANalysis Of VAriance When fitting the simple linear regression model ŷ i = ˆβ 0 + ˆβ 1 x i to a data set (x i, y i ) for i = 1,..., n, we may identify three sources of variability in the responses variability associated to the model: SSM = n i=1 (ŷ i ȳ) 2, where the initials SS denote sum of squares and M refers to the model variability of the residuals: SSR = n i=1 (y i ŷ i ) 2 = n i=1 e2 i total variability: SST = n i=1 (y i ȳ) 2 The ANOVA decomposition states that: SST = SSM + SSR

The ANOVA decomposition The coefficient of determination R 2 The ANOVA decomposition states that SST = SSM + SSR Note that y i ȳ = (y i ŷ i ) + (ŷ i ȳ) SSM = n i=1 (ŷ i ȳ) 2 measures the variation in the responses due to the regression model (explained by the predicted values ŷ i ) Thus, the ratio SSR/SST is the proportion of the variation in the responses that is not explained by the regression model The ratio R 2 = SSM/SST = 1 SSR/SST is the proportion of the variation in the responses that is explained by the regression model. It is known as the coefficient of determination The value of the coefficient of determination satisfies R 2 = r 2 xy (the squared correlation coefficient) For example, if R 2 = 0.85 the variable x explains 85% of the variation in the response variable y

The ANOVA decomposition ANOVA table Source of variability SS DF Mean F ratio Model SSM 1 SSM/1 SSM/s 2 R Residuals/errors SSR n 2 SSR/(n 2) = s 2 R Total SST n 1

The ANOVA decomposition ANOVA hypothesis testing Hypothesis test, H 0 : β 1 = 0 vs. H 1 : β 1 0 Consider the ratio F = SSM/1 SSR/(n 2) = SSM sr 2 Under H 0, F follows an F 1,n 2 distribution Test at a significance level α: reject H 0 if F > F 1,n 2;α How does this result relate to the test based on the Student-t we saw in Lesson 4? They are equivalent

The ANOVA decomposition Excel output

Nonlinear relationships and linearizing transformations Introduction Consider the case when the deterministic part f (x i ; a, b) of the response in the model y i = f (x i ; a, b) + u i, i = 1,..., n is a nonlinear function of x that depends on two parameters a and b (for example, f (x; a, b) = ab x ) In some cases we may apply transformations to the data to linearize them. We are then able to apply the linear regression procedure From the original data (x i, y i ) we obtain the transformed data (x i, y i ) The parameters β 0 and β 1 corresponding to the linear relation between x i and y i are transformations of the parameters a and b

Nonlinear relationships and linearizing transformations Linearizing transformations Examples of linearizing transformations: If y = f (x; a, b) = ax b then log y = log a + b log x. We have y = log y, x = log x, β 0 = log a, β 1 = b If y = f (x; a, b) = ab x then log y = log a + (log b)x. We have y = log y, x = x, β 0 = log a, β 1 = log b If y = f (x; a, b) = 1/(a + bx) then 1/y = a + bx. We have y = 1/y, x = x, β 0 = a, β 1 = b If y = f (x; a, b) = log(ax b ) then y = log a + b(log x). We have y = y, x = log x, β 0 = log a, β 1 = b

A matrix treatment of linear regression Introduction Remember the simple linear regression model, y i = β 0 + β 1 x i + u i, i = 1,..., n If we write one equation for each one of the observations, we have y 1 = β 0 + β 1 x 1 + u 1 y 2 = β 0 + β 1 x 2 + u 2.. y n = β 0 + β 1 x n + u n

A matrix treatment of linear regression The model in matrix form We can write the preceding equations in matrix form as y 1 β 0 + β 1 x 1 u 1 y 2 β 0 + β 1 x 2 u 2. y n =. + β 0 + β 1 x n. u n And splitting the parameters β from the variables x i, y 1 1 x 1 u 1 y 2 1 x 2 ( ) β0 u 2. y n =.. 1 x n β 1 +. u n

A matrix treatment of linear regression The regression model in matrix form We can write the preceding matrix relationship y 1 1 x 1 y 2 1 x 2 ( ) β0 as. y n =.. 1 x n β 1 y = Xβ + u + y: response vector; X: explanatory variables matrix (or experimental design matrix); β: vector of parameters; u: error vector u 1 u 2. u n

The regression model in matrix form Covariance matrix for the errors We denote as Cov(u) the n n matrix of covariances for the errors. Its (i, j)-th element is given by cov(u i, u j ) = 0 if i j and cov(u i, u i ) = Var(u i ) = σ 2 Thus, Cov(u) is the identity matrix I n n multiplied by σ 2 : σ 2 0 0 0 σ 2 0 Cov(u) =...... = σ2 I 0 0 σ 2

The regression model in matrix form Least-squares estimation The least-squares vector parameter estimate ˆβ is the unique solution of the 2 2 matrix equation (check the dimensions) (X T X) ˆβ = X T y, that is, ˆβ = (X T X) 1 X T y. The vector ŷ = (ŷ i ) of response estimates is given by ŷ = X ˆβ and the residual vector is defined as e = y ŷ

The multiple linear regression model Introduction Use of the simple linear regression model: predict the value of a response y from the value of an explanatory variable x In many applications we wish to predict the response y from the values of several explanatory variables x 1,..., x k For example: forecast the value of a house as a function of its size, location, layout and number of bathrooms forecast the size of a parliament as a function of the population, its rate of growth, the number of political parties with parliamentary representation, etc.

The multiple linear regression model The model Use of the multiple linear regression model: predict a response y from several explanatory variables x 1,..., x k If we have n observations, for i = 1,..., n, y i = β 0 + β 1 x i1 + β 2 x i2 + + β k x ik + u i We assume that the error variables u i are independent random variables following a N(0, σ 2 ) distribution

The multiple linear regression model The least-squares fit We have n observations, and for i = 1,..., n y i = β 0 + β 1 x i1 + β 2 x i2 + + β k x ik + u i We wish to fit to the data (x i1, x i2,..., x ik, y i ) a hyperplane of the form ŷ i = ˆβ 0 + ˆβ 1 x i1 + ˆβ 2 x i2 + + ˆβ k x ik The residual for observation i is defined as e i = y i ŷ i We obtain the parameter estimates ˆβ j as the values that minimize the sum of the squares of the residuals

The multiple linear regression model The model in matrix form We can write the model as a matrix relationship, β y 1 1 x 11 x 12 x 0 1k y 2. = 1 x 21 x 22 x 2k β 1...... β 2 +.. y n 1 x n1 x n2 x nk and in compact form as y = Xβ + u y: response vector; X: explanatory variables matrix (or experimental design matrix); β: vector of parameters; u: error vector β k u 1 u 2. u n

The multiple linear regression model Least-squares estimation The least-squares vector parameter estimate ˆβ is the unique solution of the (k + 1) (k + 1) matrix equation (check the dimensions) (X T X) ˆβ = X T y, and as in the k = 1 case (simple linear regression) we have ˆβ = (X T X) 1 X T y. The vector ŷ = (ŷ i ) of response estimates is given by ŷ = X ˆβ and the residual vector is defined as e = y ŷ

The multiple linear regression model Variance estimation For the multiple linear regression model, an estimator for the error variance σ 2 is the residual (quasi-)variance, s 2 R = n i=1 e2 i n k 1, and this estimator is unbiased Note that for the simple linear regression case we had n 2 in the denominator

The multiple linear regression model The sampling distribution of ˆβ Under the model assumptions, the least-squares estimator ˆβ for the parameter vector β follows a multivariate normal distribution E( ˆβ) = β (it is an unbiased estimator) The covariance matrix for ˆβ is Cov( ˆβ) = σ 2 (X T X) 1 We estimate Cov( ˆβ) using s 2 R (XT X) 1 The estimate of Cov( ˆβ) provides estimates s 2 ( ˆβ j ) for the variance Var( ˆβ j ). s( ˆβ j ) is the standard error of the estimator ˆβ j If we standardize ˆβ j we have ˆβ j β j s( ˆβ j ) t n k 1 (the Student-t distribution)

The multiple linear regression model Inference on the parameters β j Confidence interval for the partial coefficient β j at a confidence level 1 α ˆβ j ± t n k 1;α/2 s( ˆβ j ) Hypothesis testing for H 0 : β j = 0 vs. H 1 : β j 0 at a confidence level α Reject H0 if T > t n k 1;α/2, where is the test statistic T = ˆβ j s( ˆβ j )

The ANOVA decomposition The multivariate case ANOVA: ANalysis Of VAriance When fitting the multiple linear regression model ŷ i = ˆβ 0 + ˆβ 1 x i1 + + ˆβ k x ik to a data set (x i1,..., x ik, y i ) for i = 1,..., n, we may identify three sources of variability in the responses variability associated to the model: SSM = n i=1 (ŷ i ȳ) 2, where the initials SS denote sum of squares and M refers to the model variability of the residuals: SSR = n i=1 (y i ŷ i ) 2 = n i=1 e2 i total variability: SST = n i=1 (y i ȳ) 2 The ANOVA decomposition states that: SST = SSM + SSR

The ANOVA decomposition The coefficient of determination R 2 The ANOVA decomposition states that SST = SSM + SSR Note that y i ȳ = (y i ŷ i ) + (ŷ i ȳ) SSM = n i=1 (ŷ i ȳ) 2 measures the variation in the responses due to the regression model (explained by the predicted values ŷ i ) Thus, the ratio SSR/SST is the proportion of the variation in the responses that is not explained by the regression model The ratio R 2 = SSM/SST = 1 SSR/SST is the proportion of the variation in the responses that is explained by the explanatory variables. It is known as the coefficient of multiple determination For example, if R 2 = 0.85 the variables x 1,..., x k explain 85% of the variation in the response variable y

The ANOVA decomposition ANOVA table Source of variability SS DF Mean F ratio Model SSM k SSM/k (SSM/k)/ Residuals/errors SSR n k 1 Total SST n 1 SSR/(n k 1) = sr 2

The ANOVA decomposition ANOVA hypothesis testing Hypothesis test, H 0 : β 1 = β 2 = = β k = 0 vs. H 1 : β j 0 for some j = 1,..., k H 0 : the response does not depend on any x j H 1 : the response depends on at least one x j ˆβ j interpretation: keeping everything else the same, a one unit increase in x j will produce, on average, a change by ˆβ j in y Consider the ratio F = SSM/k SSR/(n k 1) = SSM sr 2 Under H 0, F follows an F k,n k 1 distribution Test at a significance level α: reject H 0 if F > F k,n k 1;α