Estadística II Chapter 4: Simple linear regression
Chapter 4. Simple linear regression Contents Objectives of the analysis. Model specification. Least Square Estimators (LSE): construction and properties Statistical inference: For the slope. For the variance. Prediction for a new observation (the actual value or the average value)
Chapter 4. Simple linear regression Learning objectives Ability to construct a model to describe the influence of X on Y Ability to find estimates Ability to construct confidence intervals and carry out tests of hypothesis Ability to estimate the average value of Y for a given x (point estimate and confidence intervals) Ability to estimate the individual value of Y for a given x (point estimate and confidence intervals)
Chapter 4. Simple Linear Regression Bibliography Newbold, P. Statistics for Business and Economics (2013) Ch. 10 Ross, S. Introductory Statistics (2005) Ch. 12
Introduction A regression model is a model that allows us to describe an effect of a variable X on a variable Y. X: independent or explanatory or exogenous variable Y: dependent or response or endogenous variable The objective is to obtain reasonable estimates of Y for X based on a sample of n bivariate observations (x 1, y 1 ),..., (x n, y n ).
Introduction Examples Study how the father s height influences the son s height. Estimate the price of an apartment depending on its size. Predict an unemployment rate for a given age group. Approximate a final grade in Est II based on the weekly number of study hours. Predict the computing time as a function of the processor speed.
Introduction Types of relationships Deterministic: Given a value of X, the value of Y can be perfectly identified. y = f (x) Example: The relationship between the temp in C (X ) and Fahrenheit (Y ) is: y = 1.8x + 32 112 Plot of Grados Fahrenheit vs Grados centígrados Grados Fahrenheit 92 72 52 32 0 10 20 30 40 Grados centígrados
Introduction Types of relationships Nondeterministic (random/stochastic): Given a value of X, the value of Y cannot be perfectly known. y = f (x) + u where u is an unknown (random) perturbation (random variable). Example: Production (X ) and price (Y ). 80 Plot of Costos vs Volumen 60 Costos 40 20 0 26 31 36 41 46 51 56 Volumen There is a linear pattern, but not perfect.
Introduction Types of relationships Linear: When the function f (x) is linear, f (x) = β 0 + β 1 x If β 1 > 0 there is a positive linear relationship. If β 1 < 0 there is a negative linear relationship. Relación lineal positiva Relación lineal negativa 10 10 6 6 Y 2 Y 2-2 -2-6 -6-2 -1 0 1 2-2 -1 0 1 2 X X The scatterplot is (American) football-shaped.
Introduction Types of relationships Nonlinear: When f (x) is nonlinear. For example, f (x) = log(x), f (x) = x 2 + 3,... 2 Relación no lineal 1 0 Y -1-2 -3-4 -2-1 0 1 2 X The scatterplot is not (American) football-shaped.
Introduction Types of relationships Lack of relationship: When f (x) = 0. 2,5 Ausencia de relación 1,5 0,5 Y -0,5-1,5-2,5-2 -1 0 1 2 X
Measures of linear dependence Covariance The covariance is defined as n (x i x) (y i ȳ) n x i y i n( x)(ȳ) cov (x, y) = i=1 n 1 = i=1 n 1 If there is a positive linear relationship, cov > 0 If there is a negative linear relationship, cov < 0 If there is no relationship or the relationship is nonlinear, cov 0 Problem: Covariance depends on the units of X and Y.
Measures of linear dependence Correlation coefficient The correlation coefficient (unitless) is defined as r (x,y) = cor (x, y) = cov (x, y) s x s y where n (x i x) 2 n (y i ȳ) 2 s 2 x = i=1 n 1 and s 2 y = i=1 n 1-1 cor (x, y) 1 cor (x, y) = cor (y, x) cor (ax + b, cy + d) = sign(a)sign(c)cor (x, y) for arbitrary numbers a, b, c, d.
Simple linear regression model The simple linear regression model assumes that where Y i = β 0 + β 1 x i + u i Y i is the value of the dependent variable Y when the random variable X takes a specific value x i x i is the specific value of the random variable X u i is an error, a random variable that is assumed to be normal with mean 0 and unknown variance σ 2, u i N(0, σ 2 ) β 0 and β 1 are the population coefficients: β 0 : population intercept β1 : population slope The (population) parameters that we need to estimate are: β 0, β 1 and σ 2.
Simple linear regression model Our objective is to find the estimators/estimates ˆβ 0, ˆβ 1 of β 0, β 1 in order to obtain the regression line: ŷ = ˆβ 0 + ˆβ 1 x which is the best fit to the data with a linear pattern. Example: Let s say that the regression line for the last example is Price = 15.65 + 1.29 Production 80 Plot of Fitted Model 60 Costos 40 20 0 26 31 36 41 46 51 56 Volumen Based on the regression line, we can estimate the price when Production is 25 millions: Price = 15.65 + 1.29(25) = 16.6
Simple linear regression model The difference between the observed value of the response variable y i and its estimate ŷ i is called a residual: e i = y i ŷ i Valor observado Dato (y) Recta de regresión estimada Example (cont.): Clearly, if for a given year the production is 25 millions, the price will not be exactly 16.6 mil euros. That small difference, the residual, in that case will be e i = 18 16.6 = 1.4
Simple linear regression model: model assumptions Linearity: The underlying relationship between X and Y is linear, f (x) = β 0 + β 1 x Homogeneity: The errors have mean zero, E[u i ] = 0 Homoscedasticity: The variance of the errors is constant, Var(u i ) = σ 2 Independence: The errors are independent, E[u i u j ] = 0 Normality: The errors follow a normal distribution, u i N(0, σ 2 )
Simple linear regression model: model assumptions Linearity The scaterplot should have an (American) football-shape, i.e., it should show scatter around a straight line. 80 Plot of Fitted Model 60 Costos 40 20 0 26 31 36 41 46 51 56 Volumen If not, the regression line is not an adequate model for the data. 34 Plot of Fitted Model 24 Y 14 4-6
Simple linear regerssion model: model assumptions Homoscedasticity The vertical spread around the line should roughly remain constant. 80 Plot of Costos vs Volumen 60 stos Cos 40 20 0 26 31 36 41 46 51 56 Volumen If that s not the case, heteroscedasticity is present.
independientes. Simple linear regerssion model: model assumptions Independence variables dependientes y un conjunto de factores Tipos de relaciones: The observations should be independent. One observation Regresión doesn tlineal implysimple any information about another. In general, time series fail this assumption. Regresión Lineal f ( Y, Y,..., Y X, X,..., X ) - Relación no lineal - Relación lineal 1 2 k 1 2 l 2 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Regresión Normality A priori, we Modelo assume that the observations are normal. y 0 1x u, u N(0, ) i i i i 2 H L y i 0 1 x N H 2 0, 1, x i : parámetros desconocidos In
(Ordinary) Least Square Estimators: LSE In 1809 Gauss proposed the least squares x xi method to obtain the estimators ˆβ 0 and ˆβ 1 that provide the best fit Regresión Lineal ŷ i = ˆβ 0 + ˆβ 1 x i The method is based on a criterion in which we minimize the sum of squares of the residuals, SSR, that is, the sum of squared vertical distances Residuos between the observed y i and predicted ŷ i values n n n ( ( )) 2 ei 2 = y (y i ŷ i ) 2 = ˆ ˆ y i ˆβ 0 + ˆβ 1 x i 0 1xi e i i i=1 i=1 i=1 Valor Observado Valor Previsto Residuo e i 7 y i yˆ i ˆ ˆ 0 1x i x i
Least Squares Estimators yi 0 1xi ui, ui N(0, ) The resulting estimators y are i : Variable dependiente ˆβ 1 = x i : Variable independiente u i : Parte aleatoria Regresión Lineal cov(x, y) s 2 x = n (x i x) (y i ȳ) i=1 n 0 (x i x) 2 i=1 ˆβ 0 = ȳ ˆβ Recta de regresión 1 x yˆ ˆ ˆ 1x 0 2 6 y i y Regresión Lineal Residuos y i Valor Observ y ˆ 0 y ˆ 1x x Pendiente ˆ 1 y i Regresión Lineal 8 Regresión Lineal
Fitting the regression line Example 4.1. For the Spanish wheat production data from the 80 s with production (X ) and price per kilo in pesetas (Y ) we have the following table production 30 28 32 25 25 25 22 24 35 40 price 25 30 27 40 42 40 50 45 30 25 Fit a least squares regression line to the data. ˆβ 1 = 10 i=1 10 x 2 i=1 x i y i n xȳ i n x 2 = 9734 10 28.6 35.4 8468 10 28.6 2 = 1.3537 ˆβ 0 = ȳ ˆβ 1 x = 35.4 + 1.3537 28.6 = 74.116 Regression line is ŷ = 74.116 1.3537x
Fitting the regression line in software
Estimating the error variance To estimate the error variance, σ 2, we can simply take the uncorrected sample variance, n ˆσ 2 = i=1 n e 2 i which is the so-called maximum likelihood estimator of σ 2. However, this estimator is biased. The unbiased estimator of σ 2, is called the residual variance, n s 2 R = e 2 i i=1 n 2 = SSR n 2
Estimating the error variance Exercise. 4.2. Find the residual variance for exercise 4.1. First, we find the residuals, e i, using the regression line ŷ i = 74.116 1.3537x i x i 30 28 32 25 25 25 22 24 35 40 y i 25 30 27 40 42 40 50 45 30 25 ŷ i = 74.116 1.3537x i 33.5 36.21 30.79 40.27 40.27 40.27 44.33 41.62 26.73 19.96 e i = y i ŷ i -8.50-6.21-3.79-0.27 1.72-0.27 5.66 3.37 3.26 5.03 The residual variance is then s 2 R = n e 2 i i=1 n 2 = 207.92 = 25.99 8
Estimating the error variance in software
Statistical inference in simple linear regression model Up to this point we only talked about point estimation. With confidence intervals for model parameters, we can obtain information about the estimation error. And tests of hypothesis will help us to decide if a given parameter is statistically significant. In statistical inference, we begin with the distribution of the estimators.
Statistical inference of the slope The estimator ˆβ 1 follows a normal distribution because it is a linear combination of normally distributed random variables n (x i x) n ˆβ 1 = (n 1)sX 2 Y i = w i Y i i=1 where Y i = β 0 + β 1 x i + u i, and satisfies Y i N ( β 0 + β 1 x i, σ 2). In addition, ˆβ 1 is an unbiased estimator of β 1, [ ] n (x i x) E ˆβ 1 = (n 1)sX 2 E [Y i ] = β 1 whose variance is, [ ] Var ˆβ 1 = Thus, i=1 i=1 i=1 n ( ) (xi x) 2 σ 2 (n 1)sX 2 Var [Y i ] = (n 1)sX 2 ˆβ 1 N ( β 1, σ 2 (n 1)s 2 X )
Confidence interval for the slope We wish to obtain a (1 α) confidence interval for β 1. Since σ 2 is unknown, we estimate it using sr 2. The corresponding theoretical result, when the error variance is unknown is then ˆβ 1 β 1 s 2 R (n 1)s 2 X t n 2 based on which we obtain (1 α) confidence interval for β 1 : sr ˆβ 2 1 ± t n 2,α/2 (n 1)sX 2 The length of the interval decreases if: The sample size increases. The variance of x i increases. The residual variance decreases.
Hypothesis testing for the slope In a similar manner, we construct a hypothesis test for β 1. In particular, if the true value of β 1 is zero, this means that the variable Y does not depend on X in a linear fashion. Thus, we are mainly interested in a two-sided test: The rejection region is : H 0 : β 1 = 0 H 1 : β 1 0 8 >< RR α = t : >: 9 t z } { ˆβ 1 >= q sr 2 /((n 1)s2 X ) > t n 2,α/2 >; Equivalently, if 0 is outside a (1 α) confidence interval for β 1, we reject the null at α significance level. The p-value is: 0 1 B ˆβ 1 C p-value = 2 Pr @T n 2 > q A sr 2 /((n 1)s2 X )
Inference for the slope Exercise 4.3 1. Find a 95% CI for the slope of the (population) regression model from Example 4.1. 2. Test the hypothesis that the price of wheat depends linearly on the production at a 0.05 significance level. 1. Since t n 2,α/2 = t 8,0.025 = 2.306 2.306 1.3537 β 1 25.99 9 32.04 2.046 β 1 0.661 2.306 2. Since the interval (with the same α) doesn t contain 0, we reject the null β 1 = 0 at 0.05 level. Also, the (observed) test statistic is ˆβ 1 t = = 1.3537 = 4.509. s 2 R / (n 1) sx 2 25.99 9 32.04 Thus, we have p-value = 2 Pr( T 8 > 4.509 ) = 0.002
Inference for β 1 in software
Statistical inference for the intercept The estimator ˆβ 0 follows a normal distribution because it is a linear combination of normal random variables, n ( ) 1 ˆβ 0 = n xw i Y i i=1 where w i = (x i x) /nsx 2 and Y i = β 0 + β 1 x i + u i, which satisfies Y i N ( β 0 + β 1 x i, σ 2). Additionally, ˆβ 0 is an unbiased estimator of β 0, [ ] n ( ) 1 E ˆβ 0 = n xw i E [Y i ] = β 0 i=1 i=1 whose variance is, [ ] n ( ) 1 2 ( 1 Var ˆβ 0 = n xw i Var [Y i ] = σ 2 n + x 2 ) (n 1)sX 2. Thus, ( 1n ˆβ 0 N (β 0, σ 2 + x 2 )) (n 1)sX 2
Confidence interval for the intercept We wish to find a (1 α) confidence interval for β 0. Since σ 2 is unknown, we estimate it with sr 2 as before. We obtain: s 2 R ˆβ 0 β 0 ( 1 n + x 2 ) t n 2 (n 1)sX 2 which yields the following confidence interval for β 0 : ( ) ˆβ 0 ± t 1 n 2,α/2 n + x2 s 2 R The length of the interval decreases if: The sample size increases. Variance of x i increases. The residual variance decreases. The mean of x i decreases. (n 1)s 2 X
Hypothesis test for the intercept Based on the distribution of the estimator, we can carry out the test of hypothesis. In particular, if the true value of β 0 is 0, it means that the population regression line goes through the origin. For this case we would test: The rejection region is: H 0 : β 0 = 0 H 1 : β 0 0 8 9 t z } { >< RR α = t : ˆβ 0 >= s «> t n 2,α/2 >: sr 2 1 n + >; x2 (n 1)s 2 X Equivalently, if 0 is outside the (1 α) confidence interval for β 0 we reject the null. The p-value is 0 p-value = 2 Pr B @ T n 2 > s sr 2 ˆβ 0 1 n + x2 (n 1)s 2 X 1 «C A
Inference for the intercept Exercise 4.4 1. Find a 95% CI for the intercept of the population regression line of Exercise 4.1. 2. Test the hypothesis that the population regression line intersects the origin at a 0.05 significance level. 1. The quantile is t n 2,α/2 = t 8,0.025 = 2.306 so 74.1151 β 0 2.306 ( 25.99 1 10 + 28.62 9 32.04 ) 2.306 53.969 β 0 94.261 2. Since the interval (with the same α) doesn t contain 0, we reject the null hypothesis that β 0 = 0. Also, the (observed) test statistic is t {}}{ ˆβ 0 t = ( ) = 74.1151 ( ) = 8.484 sr 2 1 n + x 2 1 25.99 (n 1)sX 2 10 + 28.62 9 32.04 Thus, we have: p-value = 2 Pr( T 8 > 8.483 ) = 0.000.
Inference for the intercept in software
Inference for the error variance We have: Which means that : (n 2) s 2 R σ 2 χ 2 n 2 The (1 α) confidence interval for σ 2 is: (n 2) s 2 R χ 2 n 2,α/2 σ 2 (n 2) s2 R χ 2 n 2,1 α/2 Which can be used to solve the test: H 0 : σ 2 = σ 2 0 H 1 : σ 2 σ 2 0
Average and individual predictions We consider two situations: 1. We wish to estimate/predict the average value of Y for a given X = x 0. 2. We wish to estimate/predict the actual value of Y for a given X = x 0. For example in Ex. 4.1 1. What would be the average wheat price for all years in which the production was 30? 2. If in a given year, the production was 30, what would be the corresponding price of wheat? In both cases: ŷ 0 = ˆβ 0 + ˆβ 1 x 0 = ȳ + ˆβ 1 (x 0 x) But the estimation errors are different.
Estimating/predicting the average value Remember that: ) Var (Ŷ0 = Var ( Ȳ ) ( ) + (x 0 x) 2 Var ˆβ 1 ( ) = σ 2 1 n + (x 0 x) 2 (n 1) sx 2 The confidence interval for the mean prediction E[Y 0 X = x 0 ] is: ( ) Ŷ 0 ± t n 2,α/2 s 2 1 R n + (x 0 x) 2 (n 1) sx 2
Estimating/predicting the actual value The variance for the prediction of the actual value is the mean squared error: [ ( ) ] 2 ) E Y 0 Ŷ 0 = Var (Y 0 ) + Var (Ŷ0 ( ) = σ 2 1 + 1 n + (x 0 x) 2 (n 1) sx 2 And thus the confidence interval for the actual value Y 0 is: ( ) Ŷ 0 ± t n 2,α/2 s 2 R 1 + 1 n + (x 0 x) 2 (n 1) sx 2 The size of this interval is bigger than that for the average prediction.
Estimating/predicting the average and actual values In red: confidence intervals for the prediction of average value. In pink: confidence intervals for the prediction of actual value. 50 Plot of Fitted Model Precio en ptas. 45 40 35 30 25 22 25 28 31 34 37 40 Produccion en kg.
Regression line: R-squared and variability decomposition Coefficient of determination, R-squared is used to assess the goodness-of-fit of the model. It is defined as R 2 = r 2 (x,y) [0, 1] R 2 tells us what percentage of the sample variability in the y variable is explained by the model, that is, by its linear dependence on x Values close to 100% indicate that the regression model is a good fit to the data (less than 60%, not so good) Variability decomposition and R 2 : The Total Sum of Squares i (y i ȳ) 2 can be decomposed into the Residual Sum of Squares i (y i ŷ) 2 + the Model Sum of Squares i (ŷ ȳ)2 SST = SSR + SSM and we have R 2 = 1 SSR SST = SSM SST
Regression line: R-squared and variability decomposition From Wikipedia:
ANOVA table ANOVA (Analysis of Variance) table for the simple linear regression model Source of variability SS DF Mean F ratio Model SSM 1 SSM/1 SSM/s 2 R Residuals/errors SSR n 2 SSR/(n 2) = s 2 R Total SST n 1 Note that the value of the F statistic is the square of that for the t statistic in the simple regression significance test.