Estadística II Chapter 4: Simple linear regression

Similar documents
Estadística II Chapter 5. Regression analysis (second part)

Ch 2: Simple Linear Regression

Inferences for Regression

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

ECON The Simple Regression Model

Correlation Analysis

Applied Econometrics (QEM)

Simple Linear Regression

Inference for Regression Simple Linear Regression

Simple Linear Regression

The Simple Regression Model. Part II. The Simple Regression Model

STAT5044: Regression and Anova. Inyoung Kim

Simple and Multiple Linear Regression

Homoskedasticity. Var (u X) = σ 2. (23)

Mathematics for Economics MA course

Basic Business Statistics 6 th Edition

Statistics II Exercises Chapter 5

Econometrics I Lecture 3: The Simple Linear Regression Model

Chapter Learning Objectives. Regression Analysis. Correlation. Simple Linear Regression. Chapter 12. Simple Linear Regression

Inference for Regression Inference about the Regression Model and Using the Regression Line

9. Linear Regression and Correlation

Linear models and their mathematical foundations: Simple linear regression

Simple linear regression

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

ECON2228 Notes 2. Christopher F Baum. Boston College Economics. cfb (BC Econ) ECON2228 Notes / 47

Section 3: Simple Linear Regression

Lectures on Simple Linear Regression Stat 431, Summer 2012

Business Statistics. Lecture 10: Correlation and Linear Regression

Formal Statement of Simple Linear Regression Model

Regression Analysis. BUS 735: Business Decision Making and Research. Learn how to detect relationships between ordinal and categorical variables.

Ch 3: Multiple Linear Regression

Simple linear regression

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

STAT Chapter 11: Regression

Statistics for Managers using Microsoft Excel 6 th Edition

Lecture 16 - Correlation and Regression

Multiple Linear Regression

Multiple Regression Analysis: Heteroskedasticity

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

Homework 2: Simple Linear Regression

SIMPLE REGRESSION ANALYSIS. Business Statistics

STA121: Applied Regression Analysis

Correlation and the Analysis of Variance Approach to Simple Linear Regression

Essential of Simple regression

Unit 6 - Introduction to linear regression

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Chapter 2 The Simple Linear Regression Model: Specification and Estimation

Econometrics Summary Algebraic and Statistical Preliminaries

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

Psychology 282 Lecture #4 Outline Inferences in SLR

STAT 4385 Topic 03: Simple Linear Regression

Econometrics I KS. Module 1: Bivariate Linear Regression. Alexander Ahammer. This version: March 12, 2018

Simple Linear Regression

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Regression Models - Introduction

The regression model with one fixed regressor cont d

Simple Linear Regression: The Model

STAT2012 Statistical Tests 23 Regression analysis: method of least squares

Inference for the Regression Coefficient

Inference for Regression

The Simple Linear Regression Model

Regression and correlation. Correlation & Regression, I. Regression & correlation. Regression vs. correlation. Involve bivariate, paired data, X & Y

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Finding Relationships Among Variables

Two-Variable Regression Model: The Problem of Estimation

Chapter 27 Summary Inferences for Regression

MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik

Wooldridge, Introductory Econometrics, 4th ed. Chapter 2: The simple regression model

Simple Linear Regression. Material from Devore s book (Ed 8), and Cengagebrain.com

Section 4: Multiple Linear Regression

Correlation and Linear Regression

Simple Linear Regression. (Chs 12.1, 12.2, 12.4, 12.5)

Statistics for Engineers Lecture 9 Linear Regression

Lecture 3: Inference in SLR

ECON 450 Development Economics

Chapter 16. Simple Linear Regression and dcorrelation

Problem Set #6: OLS. Economics 835: Econometrics. Fall 2012

Unit 6 - Simple linear regression

SF2930: REGRESION ANALYSIS LECTURE 1 SIMPLE LINEAR REGRESSION.

3. Linear Regression With a Single Regressor

Scatter plot of data from the study. Linear Regression

Business Statistics. Lecture 10: Course Review

Chapter 4: Regression Models

Measuring the fit of the model - SSR

REGRESSION ANALYSIS AND INDICATOR VARIABLES

MFin Econometrics I Session 4: t-distribution, Simple Linear Regression, OLS assumptions and properties of OLS estimators

Multiple Regression Analysis. Part III. Multiple Regression Analysis

Correlation and Regression

Introductory Econometrics

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Objectives Simple linear regression. Statistical model for linear regression. Estimating the regression parameters

Chapter 14 Simple Linear Regression (A)

Applied Regression Analysis. Section 2: Multiple Linear Regression

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Ch 13 & 14 - Regression Analysis

Multiple Regression Analysis

4.1 Least Squares Prediction 4.2 Measuring Goodness-of-Fit. 4.3 Modeling Issues. 4.4 Log-Linear Models

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

Transcription:

Estadística II Chapter 4: Simple linear regression

Chapter 4. Simple linear regression Contents Objectives of the analysis. Model specification. Least Square Estimators (LSE): construction and properties Statistical inference: For the slope. For the variance. Prediction for a new observation (the actual value or the average value)

Chapter 4. Simple linear regression Learning objectives Ability to construct a model to describe the influence of X on Y Ability to find estimates Ability to construct confidence intervals and carry out tests of hypothesis Ability to estimate the average value of Y for a given x (point estimate and confidence intervals) Ability to estimate the individual value of Y for a given x (point estimate and confidence intervals)

Chapter 4. Simple Linear Regression Bibliography Newbold, P. Statistics for Business and Economics (2013) Ch. 10 Ross, S. Introductory Statistics (2005) Ch. 12

Introduction A regression model is a model that allows us to describe an effect of a variable X on a variable Y. X: independent or explanatory or exogenous variable Y: dependent or response or endogenous variable The objective is to obtain reasonable estimates of Y for X based on a sample of n bivariate observations (x 1, y 1 ),..., (x n, y n ).

Introduction Examples Study how the father s height influences the son s height. Estimate the price of an apartment depending on its size. Predict an unemployment rate for a given age group. Approximate a final grade in Est II based on the weekly number of study hours. Predict the computing time as a function of the processor speed.

Introduction Types of relationships Deterministic: Given a value of X, the value of Y can be perfectly identified. y = f (x) Example: The relationship between the temp in C (X ) and Fahrenheit (Y ) is: y = 1.8x + 32 112 Plot of Grados Fahrenheit vs Grados centígrados Grados Fahrenheit 92 72 52 32 0 10 20 30 40 Grados centígrados

Introduction Types of relationships Nondeterministic (random/stochastic): Given a value of X, the value of Y cannot be perfectly known. y = f (x) + u where u is an unknown (random) perturbation (random variable). Example: Production (X ) and price (Y ). 80 Plot of Costos vs Volumen 60 Costos 40 20 0 26 31 36 41 46 51 56 Volumen There is a linear pattern, but not perfect.

Introduction Types of relationships Linear: When the function f (x) is linear, f (x) = β 0 + β 1 x If β 1 > 0 there is a positive linear relationship. If β 1 < 0 there is a negative linear relationship. Relación lineal positiva Relación lineal negativa 10 10 6 6 Y 2 Y 2-2 -2-6 -6-2 -1 0 1 2-2 -1 0 1 2 X X The scatterplot is (American) football-shaped.

Introduction Types of relationships Nonlinear: When f (x) is nonlinear. For example, f (x) = log(x), f (x) = x 2 + 3,... 2 Relación no lineal 1 0 Y -1-2 -3-4 -2-1 0 1 2 X The scatterplot is not (American) football-shaped.

Introduction Types of relationships Lack of relationship: When f (x) = 0. 2,5 Ausencia de relación 1,5 0,5 Y -0,5-1,5-2,5-2 -1 0 1 2 X

Measures of linear dependence Covariance The covariance is defined as n (x i x) (y i ȳ) n x i y i n( x)(ȳ) cov (x, y) = i=1 n 1 = i=1 n 1 If there is a positive linear relationship, cov > 0 If there is a negative linear relationship, cov < 0 If there is no relationship or the relationship is nonlinear, cov 0 Problem: Covariance depends on the units of X and Y.

Measures of linear dependence Correlation coefficient The correlation coefficient (unitless) is defined as r (x,y) = cor (x, y) = cov (x, y) s x s y where n (x i x) 2 n (y i ȳ) 2 s 2 x = i=1 n 1 and s 2 y = i=1 n 1-1 cor (x, y) 1 cor (x, y) = cor (y, x) cor (ax + b, cy + d) = sign(a)sign(c)cor (x, y) for arbitrary numbers a, b, c, d.

Simple linear regression model The simple linear regression model assumes that where Y i = β 0 + β 1 x i + u i Y i is the value of the dependent variable Y when the random variable X takes a specific value x i x i is the specific value of the random variable X u i is an error, a random variable that is assumed to be normal with mean 0 and unknown variance σ 2, u i N(0, σ 2 ) β 0 and β 1 are the population coefficients: β 0 : population intercept β1 : population slope The (population) parameters that we need to estimate are: β 0, β 1 and σ 2.

Simple linear regression model Our objective is to find the estimators/estimates ˆβ 0, ˆβ 1 of β 0, β 1 in order to obtain the regression line: ŷ = ˆβ 0 + ˆβ 1 x which is the best fit to the data with a linear pattern. Example: Let s say that the regression line for the last example is Price = 15.65 + 1.29 Production 80 Plot of Fitted Model 60 Costos 40 20 0 26 31 36 41 46 51 56 Volumen Based on the regression line, we can estimate the price when Production is 25 millions: Price = 15.65 + 1.29(25) = 16.6

Simple linear regression model The difference between the observed value of the response variable y i and its estimate ŷ i is called a residual: e i = y i ŷ i Valor observado Dato (y) Recta de regresión estimada Example (cont.): Clearly, if for a given year the production is 25 millions, the price will not be exactly 16.6 mil euros. That small difference, the residual, in that case will be e i = 18 16.6 = 1.4

Simple linear regression model: model assumptions Linearity: The underlying relationship between X and Y is linear, f (x) = β 0 + β 1 x Homogeneity: The errors have mean zero, E[u i ] = 0 Homoscedasticity: The variance of the errors is constant, Var(u i ) = σ 2 Independence: The errors are independent, E[u i u j ] = 0 Normality: The errors follow a normal distribution, u i N(0, σ 2 )

Simple linear regression model: model assumptions Linearity The scaterplot should have an (American) football-shape, i.e., it should show scatter around a straight line. 80 Plot of Fitted Model 60 Costos 40 20 0 26 31 36 41 46 51 56 Volumen If not, the regression line is not an adequate model for the data. 34 Plot of Fitted Model 24 Y 14 4-6

Simple linear regerssion model: model assumptions Homoscedasticity The vertical spread around the line should roughly remain constant. 80 Plot of Costos vs Volumen 60 stos Cos 40 20 0 26 31 36 41 46 51 56 Volumen If that s not the case, heteroscedasticity is present.

independientes. Simple linear regerssion model: model assumptions Independence variables dependientes y un conjunto de factores Tipos de relaciones: The observations should be independent. One observation Regresión doesn tlineal implysimple any information about another. In general, time series fail this assumption. Regresión Lineal f ( Y, Y,..., Y X, X,..., X ) - Relación no lineal - Relación lineal 1 2 k 1 2 l 2 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Regresión Normality A priori, we Modelo assume that the observations are normal. y 0 1x u, u N(0, ) i i i i 2 H L y i 0 1 x N H 2 0, 1, x i : parámetros desconocidos In

(Ordinary) Least Square Estimators: LSE In 1809 Gauss proposed the least squares x xi method to obtain the estimators ˆβ 0 and ˆβ 1 that provide the best fit Regresión Lineal ŷ i = ˆβ 0 + ˆβ 1 x i The method is based on a criterion in which we minimize the sum of squares of the residuals, SSR, that is, the sum of squared vertical distances Residuos between the observed y i and predicted ŷ i values n n n ( ( )) 2 ei 2 = y (y i ŷ i ) 2 = ˆ ˆ y i ˆβ 0 + ˆβ 1 x i 0 1xi e i i i=1 i=1 i=1 Valor Observado Valor Previsto Residuo e i 7 y i yˆ i ˆ ˆ 0 1x i x i

Least Squares Estimators yi 0 1xi ui, ui N(0, ) The resulting estimators y are i : Variable dependiente ˆβ 1 = x i : Variable independiente u i : Parte aleatoria Regresión Lineal cov(x, y) s 2 x = n (x i x) (y i ȳ) i=1 n 0 (x i x) 2 i=1 ˆβ 0 = ȳ ˆβ Recta de regresión 1 x yˆ ˆ ˆ 1x 0 2 6 y i y Regresión Lineal Residuos y i Valor Observ y ˆ 0 y ˆ 1x x Pendiente ˆ 1 y i Regresión Lineal 8 Regresión Lineal

Fitting the regression line Example 4.1. For the Spanish wheat production data from the 80 s with production (X ) and price per kilo in pesetas (Y ) we have the following table production 30 28 32 25 25 25 22 24 35 40 price 25 30 27 40 42 40 50 45 30 25 Fit a least squares regression line to the data. ˆβ 1 = 10 i=1 10 x 2 i=1 x i y i n xȳ i n x 2 = 9734 10 28.6 35.4 8468 10 28.6 2 = 1.3537 ˆβ 0 = ȳ ˆβ 1 x = 35.4 + 1.3537 28.6 = 74.116 Regression line is ŷ = 74.116 1.3537x

Fitting the regression line in software

Estimating the error variance To estimate the error variance, σ 2, we can simply take the uncorrected sample variance, n ˆσ 2 = i=1 n e 2 i which is the so-called maximum likelihood estimator of σ 2. However, this estimator is biased. The unbiased estimator of σ 2, is called the residual variance, n s 2 R = e 2 i i=1 n 2 = SSR n 2

Estimating the error variance Exercise. 4.2. Find the residual variance for exercise 4.1. First, we find the residuals, e i, using the regression line ŷ i = 74.116 1.3537x i x i 30 28 32 25 25 25 22 24 35 40 y i 25 30 27 40 42 40 50 45 30 25 ŷ i = 74.116 1.3537x i 33.5 36.21 30.79 40.27 40.27 40.27 44.33 41.62 26.73 19.96 e i = y i ŷ i -8.50-6.21-3.79-0.27 1.72-0.27 5.66 3.37 3.26 5.03 The residual variance is then s 2 R = n e 2 i i=1 n 2 = 207.92 = 25.99 8

Estimating the error variance in software

Statistical inference in simple linear regression model Up to this point we only talked about point estimation. With confidence intervals for model parameters, we can obtain information about the estimation error. And tests of hypothesis will help us to decide if a given parameter is statistically significant. In statistical inference, we begin with the distribution of the estimators.

Statistical inference of the slope The estimator ˆβ 1 follows a normal distribution because it is a linear combination of normally distributed random variables n (x i x) n ˆβ 1 = (n 1)sX 2 Y i = w i Y i i=1 where Y i = β 0 + β 1 x i + u i, and satisfies Y i N ( β 0 + β 1 x i, σ 2). In addition, ˆβ 1 is an unbiased estimator of β 1, [ ] n (x i x) E ˆβ 1 = (n 1)sX 2 E [Y i ] = β 1 whose variance is, [ ] Var ˆβ 1 = Thus, i=1 i=1 i=1 n ( ) (xi x) 2 σ 2 (n 1)sX 2 Var [Y i ] = (n 1)sX 2 ˆβ 1 N ( β 1, σ 2 (n 1)s 2 X )

Confidence interval for the slope We wish to obtain a (1 α) confidence interval for β 1. Since σ 2 is unknown, we estimate it using sr 2. The corresponding theoretical result, when the error variance is unknown is then ˆβ 1 β 1 s 2 R (n 1)s 2 X t n 2 based on which we obtain (1 α) confidence interval for β 1 : sr ˆβ 2 1 ± t n 2,α/2 (n 1)sX 2 The length of the interval decreases if: The sample size increases. The variance of x i increases. The residual variance decreases.

Hypothesis testing for the slope In a similar manner, we construct a hypothesis test for β 1. In particular, if the true value of β 1 is zero, this means that the variable Y does not depend on X in a linear fashion. Thus, we are mainly interested in a two-sided test: The rejection region is : H 0 : β 1 = 0 H 1 : β 1 0 8 >< RR α = t : >: 9 t z } { ˆβ 1 >= q sr 2 /((n 1)s2 X ) > t n 2,α/2 >; Equivalently, if 0 is outside a (1 α) confidence interval for β 1, we reject the null at α significance level. The p-value is: 0 1 B ˆβ 1 C p-value = 2 Pr @T n 2 > q A sr 2 /((n 1)s2 X )

Inference for the slope Exercise 4.3 1. Find a 95% CI for the slope of the (population) regression model from Example 4.1. 2. Test the hypothesis that the price of wheat depends linearly on the production at a 0.05 significance level. 1. Since t n 2,α/2 = t 8,0.025 = 2.306 2.306 1.3537 β 1 25.99 9 32.04 2.046 β 1 0.661 2.306 2. Since the interval (with the same α) doesn t contain 0, we reject the null β 1 = 0 at 0.05 level. Also, the (observed) test statistic is ˆβ 1 t = = 1.3537 = 4.509. s 2 R / (n 1) sx 2 25.99 9 32.04 Thus, we have p-value = 2 Pr( T 8 > 4.509 ) = 0.002

Inference for β 1 in software

Statistical inference for the intercept The estimator ˆβ 0 follows a normal distribution because it is a linear combination of normal random variables, n ( ) 1 ˆβ 0 = n xw i Y i i=1 where w i = (x i x) /nsx 2 and Y i = β 0 + β 1 x i + u i, which satisfies Y i N ( β 0 + β 1 x i, σ 2). Additionally, ˆβ 0 is an unbiased estimator of β 0, [ ] n ( ) 1 E ˆβ 0 = n xw i E [Y i ] = β 0 i=1 i=1 whose variance is, [ ] n ( ) 1 2 ( 1 Var ˆβ 0 = n xw i Var [Y i ] = σ 2 n + x 2 ) (n 1)sX 2. Thus, ( 1n ˆβ 0 N (β 0, σ 2 + x 2 )) (n 1)sX 2

Confidence interval for the intercept We wish to find a (1 α) confidence interval for β 0. Since σ 2 is unknown, we estimate it with sr 2 as before. We obtain: s 2 R ˆβ 0 β 0 ( 1 n + x 2 ) t n 2 (n 1)sX 2 which yields the following confidence interval for β 0 : ( ) ˆβ 0 ± t 1 n 2,α/2 n + x2 s 2 R The length of the interval decreases if: The sample size increases. Variance of x i increases. The residual variance decreases. The mean of x i decreases. (n 1)s 2 X

Hypothesis test for the intercept Based on the distribution of the estimator, we can carry out the test of hypothesis. In particular, if the true value of β 0 is 0, it means that the population regression line goes through the origin. For this case we would test: The rejection region is: H 0 : β 0 = 0 H 1 : β 0 0 8 9 t z } { >< RR α = t : ˆβ 0 >= s «> t n 2,α/2 >: sr 2 1 n + >; x2 (n 1)s 2 X Equivalently, if 0 is outside the (1 α) confidence interval for β 0 we reject the null. The p-value is 0 p-value = 2 Pr B @ T n 2 > s sr 2 ˆβ 0 1 n + x2 (n 1)s 2 X 1 «C A

Inference for the intercept Exercise 4.4 1. Find a 95% CI for the intercept of the population regression line of Exercise 4.1. 2. Test the hypothesis that the population regression line intersects the origin at a 0.05 significance level. 1. The quantile is t n 2,α/2 = t 8,0.025 = 2.306 so 74.1151 β 0 2.306 ( 25.99 1 10 + 28.62 9 32.04 ) 2.306 53.969 β 0 94.261 2. Since the interval (with the same α) doesn t contain 0, we reject the null hypothesis that β 0 = 0. Also, the (observed) test statistic is t {}}{ ˆβ 0 t = ( ) = 74.1151 ( ) = 8.484 sr 2 1 n + x 2 1 25.99 (n 1)sX 2 10 + 28.62 9 32.04 Thus, we have: p-value = 2 Pr( T 8 > 8.483 ) = 0.000.

Inference for the intercept in software

Inference for the error variance We have: Which means that : (n 2) s 2 R σ 2 χ 2 n 2 The (1 α) confidence interval for σ 2 is: (n 2) s 2 R χ 2 n 2,α/2 σ 2 (n 2) s2 R χ 2 n 2,1 α/2 Which can be used to solve the test: H 0 : σ 2 = σ 2 0 H 1 : σ 2 σ 2 0

Average and individual predictions We consider two situations: 1. We wish to estimate/predict the average value of Y for a given X = x 0. 2. We wish to estimate/predict the actual value of Y for a given X = x 0. For example in Ex. 4.1 1. What would be the average wheat price for all years in which the production was 30? 2. If in a given year, the production was 30, what would be the corresponding price of wheat? In both cases: ŷ 0 = ˆβ 0 + ˆβ 1 x 0 = ȳ + ˆβ 1 (x 0 x) But the estimation errors are different.

Estimating/predicting the average value Remember that: ) Var (Ŷ0 = Var ( Ȳ ) ( ) + (x 0 x) 2 Var ˆβ 1 ( ) = σ 2 1 n + (x 0 x) 2 (n 1) sx 2 The confidence interval for the mean prediction E[Y 0 X = x 0 ] is: ( ) Ŷ 0 ± t n 2,α/2 s 2 1 R n + (x 0 x) 2 (n 1) sx 2

Estimating/predicting the actual value The variance for the prediction of the actual value is the mean squared error: [ ( ) ] 2 ) E Y 0 Ŷ 0 = Var (Y 0 ) + Var (Ŷ0 ( ) = σ 2 1 + 1 n + (x 0 x) 2 (n 1) sx 2 And thus the confidence interval for the actual value Y 0 is: ( ) Ŷ 0 ± t n 2,α/2 s 2 R 1 + 1 n + (x 0 x) 2 (n 1) sx 2 The size of this interval is bigger than that for the average prediction.

Estimating/predicting the average and actual values In red: confidence intervals for the prediction of average value. In pink: confidence intervals for the prediction of actual value. 50 Plot of Fitted Model Precio en ptas. 45 40 35 30 25 22 25 28 31 34 37 40 Produccion en kg.

Regression line: R-squared and variability decomposition Coefficient of determination, R-squared is used to assess the goodness-of-fit of the model. It is defined as R 2 = r 2 (x,y) [0, 1] R 2 tells us what percentage of the sample variability in the y variable is explained by the model, that is, by its linear dependence on x Values close to 100% indicate that the regression model is a good fit to the data (less than 60%, not so good) Variability decomposition and R 2 : The Total Sum of Squares i (y i ȳ) 2 can be decomposed into the Residual Sum of Squares i (y i ŷ) 2 + the Model Sum of Squares i (ŷ ȳ)2 SST = SSR + SSM and we have R 2 = 1 SSR SST = SSM SST

Regression line: R-squared and variability decomposition From Wikipedia:

ANOVA table ANOVA (Analysis of Variance) table for the simple linear regression model Source of variability SS DF Mean F ratio Model SSM 1 SSM/1 SSM/s 2 R Residuals/errors SSR n 2 SSR/(n 2) = s 2 R Total SST n 1 Note that the value of the F statistic is the square of that for the t statistic in the simple regression significance test.