Linear Regression Model Badr Missaoui
Introduction What is this course about? It is a course on applied statistics. It comprises 2 hours lectures each week and 1 hour lab sessions/tutorials. We will focus mainly on "regression models". Second half of the course, we will focus on "time series" and "classification and discrimination". Hands-on : we use R, an open source statistics software environment. Can interface with Excel, C++,...
Introduction Assessment Evaluation : 4 assignments (60%), 1 take home final exam (40%). Assignments to be done by group of 2 For each exercise, the report should contain : a description of the data including summary tables and plots. a description of the method and assumptions should be clearly stated. the results of the analysis of the data set at hand using the methods presented in the previous section. Here you should provide any relevant estimates, confidence intervals, levels of significance, goodness of fit, etc, along with their interpretation.
Introduction Textbooks Modern Applied Statistics with S. D. Venables, B. Ripley. Generalized Linear Models, Second Edition, by P. McCullagh P. McCullagh, John A. Nelder, CRC press Extending the Linear Model with R : Generalized Linear, Mixed Effects and Nonparametric Regression Models, Julian Faraway, CRC press
A regression model is a model of the relationships between a predictor variable X = (x 1,..., x n ) and a response variable Y = (y 1,..., y n ). The regression model : Y = f (X) + ε, where f is an unknown function and ε N(0, σ 2 ). The goal is to recover f from the noisy data Y : f could be parametric or non-parametric.
Example : The heart and body weights of samples of male and female cats used for digitalis experiments. heart weight (Hwt) is the outcome. body weight (Bwt) is the predictor.
Regression model is a model of the average outcome given the predictor. Regression model is a model of the conditional expectation which is a function of Bwt. E(Hwt Bwt)
Linear regression model : y i = β 0 + β 1 x i + ε i, β 0 and β 1 are resp. the intercept and the slope of the linear regression.
We fit the linear regression model to the cats data y i = 0.3567 + 4.0341X i + ε i
Call: lm(formula = Hwt Bwt, data = cats) Residuals: Min 1Q Median 3Q Max -3.5694-0.9634-0.0921 1.0426 5.1238 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -0.3567 0.6923-0.515 0.607 Bwt 4.0341 0.2503 16.119 <2e-16 *** -- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 1.452 on 142 degrees of freedom Multiple R-squared: 0.6466, Adjusted R-squared: 0.6441 F-statistic: 259.8 on 1 and 142 DF, p-value: < 2.2e-16
The assumptions corresponding to such model are : 1. Normality assumption : The error ε i has a normal distribution and independent of X. 2. Homoscedasticity assumption : E(ε i ) = 0 and Var(ε i ) = σ 2. 3. β 0 and β 1 are constants.
QQ-plot for the cats data
Parameter Estimation : we will use the popular least squares method S(β 0, β 1 ) = Y β 0 + β 1 X 2 = n (y i β 0 β 1 x i ) 2. i=1 the regression parameters are ˆβ 1 = (xi x)(y i ȳ) (xi x) 2, and ˆβ 0 = ȳ ˆβ 1 x.
The variance of β 0 and β 1 are Var( ˆβ 0 ) = σ 2 [ 1 n + x 2 (xi x) 2 ], Var( ˆβ 1 ) = σ 2 (xi x) 2. An unbiased estimate of σ 2 ˆσ 2 = 1 n 2 n (y i ˆβ 0 ˆβ 1 x i ) 2. i=1 and ˆσ 2 σ 2 = χ2 n 2 n 2.
Using the least squares estimates, one can develop statistical inference procedures : confidence intervals, hypothesis tests, and goodness-of-fit tests. Under the assumption that ε N(0, σ 2 ) ˆβ 0 N ˆβ 1 N ( ˆσ 2 ) β 1, (xi x) 2 (β 0, ˆσ 2 ( 1n + x 2 (xi x) 2 ))
Student random variables Start with Z N(0, 1) is standard normal and G χ 2 ν independent of Z. Compute. T = Z G ν T has a Student distribution t ν with ν degrees of freedom.
F-random variables Start with independent variables G 1 χ 2 ν 1 and G 2 χ 2 ν 1. Compute. F = G 1/ν 1 G 2 /ν 2 F has an F-distribution F ν1,ν 2 with ν degrees of freedom. Note that T t ν then T 2 F 1,ν.
If the residuals are Normal, then an exact 1 α confidence interval for β 0 is given by ˆβ 0 t n 2,α/2 SE( ˆβ 0 ) where t n 2,1 α/2 is the upper α/2 critical value of the t n 2 distribution and 1 SE( ˆβ 0 ) = ˆσ n + x 2 (xi x) 2 A level 1 α confidence interval for β 1 is given by ˆβ 1 t n 2,α/2 SE( ˆβ 1 ) where t n 2,1 α/2 is the upper α/2 critical value of the t n 2 distribution and ˆσ SE( ˆβ 2 1 ) = (xi x) 2
We are now in position to perform statistical analysis concerning the usefulness of X as a predictor of Y. Suppose that we want to test that β 1 is some pre-specified value β. To test the hypothesis H 0 : β 1 = β we use the test statistic T = ˆβ 1 β SE( ˆβ 1 ) t n 2 Reject H 0 : β 1 = β if T > t n 2,α/2
Goodness of fit SSE = SSR = n (y i ŷ i ) 2 = i=1 n (ȳ ŷ i ) 2 = i=1 n (y i ˆβ 0 ˆβ 1 x i ) 2 i=1 n (ȳ ˆβ 0 ˆβ 1 x i ) 2 i=1 SST = SSE + SSR = R 2 = SSR SST n (y i ȳ) 2 i=1 If R 2 is large : the total variability in the response Y is accounted for by the predictor variable X.
Amount of variability in Y explained by X. Also, R 2 = r 2 where r = n i=1 (Y i Ȳ )(X i X) n i=1 (Y i Ȳ )2 n i=1 (X i X) 2 is the sample correlation. This is the estimate of the correlation ρ = E((X µ X )(Y µ Y )) σ X σ Y Note that 1 ρ 1. The correlation is a very useful quantity for measuring the direction and the strength of the relationship between X and Y.
H 0 : β1 = 0, we use the F statistic : F = MSR(Regression) MSE(Errors) = SSR/df (Regression) SSE/df (Errors) = ( ˆβ 1 SE( ˆβ 1 ) ) 2 = t 2 Under H 0 : β 1 = 0, F F 1,n 2 where F 1,n 2 is an F distribution with 1 and n 2 degrees of freedom.
F-test in simple linear regression Full Model (FM) : Reduced Model (RM) : The F statistic is : Y = β 0 + β 1 X + ε Y = β 0 + ε F = (SSE(RM) SSE(FM))/(df RM df FM ) SSE(FM)/df FM Reject H 0 : RM is correct, if F > F 1 α,1,n 2.
Forecasting Interval : 1. Suppose that we want an interval which contain the predicted new observation Ŷ new = ˆβ 0 + ˆβ 1 X new + ε new 2. with a certain probability. SE(Ŷnew) = ˆσ 1 + 1 n + ( X X new ) 2 (Xi x) 2 3. Prediction interval is ˆβ 0 + ˆβ 1 X new ± t n 2,α/2 SE(Ŷnew)
Let us return to the cat example
What if the assumptions are not satisfied? Maybe regression function may have higher-order polynomial or otherwise, i.e. y j = β 0 + β 1 x j +... + β p 1 x p 1 j + ε j Errors may not be normally distributed or may not have the same variance - qqnorm can help with this.
How to fix? Sometimes things can be transformed to linear model : suppose y i = β 0 e β 1x i.ε i Then log y i = log β 0 + β 1 x i + log ε i is a linear model and if ε s are lognormal independent variables, then the transformed model is the same as the original model. Box-Cox transformations will help us to choose a transformation that linearizes the model.
Polynomial regression y j = β 0 + β 1 x j +... + β p 1 x p 1 j + ε j In matrix notation y 1 1 x 1 x p 1 1 β y 2. = 1 x 2 x p 1 0 2 β 1..... + y n 1 x n xn p 1 β p ε 1 ε 2. ε n which can be written Y = Xβ + ε
We add a quadratic term to the model, using the function poly which adds a polynomial trend of a given degree to a model. quadratic.lm <- lm(y2 poly(x2, 2), data = datatest)
Other regression models : Splines are piecewise polynomial functions, i.e. on interval between knots (t i, t i+1 ) the spline f (x) is polynomial but the coefficients change within each interval. Example : cubic splines f (x) = 3 h β 0 x j + β i (x t i ) 3 + j=0 i=1 Other bases one might use : Fourier series, wavelet series,...