Linear Regression Model. Badr Missaoui

Similar documents
Simple Linear Regression

Ch 2: Simple Linear Regression

Applied Regression Analysis

Inference for Regression

Ch 3: Multiple Linear Regression

Measuring the fit of the model - SSR

Multiple Linear Regression

Lecture 6 Multiple Linear Regression, cont.

Linear models and their mathematical foundations: Simple linear regression

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

Correlation Analysis

STAT420 Midterm Exam. University of Illinois Urbana-Champaign October 19 (Friday), :00 4:15p. SOLUTIONS (Yellow)

Lecture 4 Multiple linear regression

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

Chapter 14 Simple Linear Regression (A)

Applied Econometrics (QEM)

Simple and Multiple Linear Regression

MATH 644: Regression Analysis Methods

The Multiple Regression Model

Statistics - Lecture Three. Linear Models. Charlotte Wickham 1.

Simple Linear Regression

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Inferences for Regression

Lecture 14 Simple Linear Regression

Ma 3/103: Lecture 25 Linear Regression II: Hypothesis Testing and ANOVA

Regression Models - Introduction

TMA4255 Applied Statistics V2016 (5)

Chapter 12 - Lecture 2 Inferences about regression coefficient

ST430 Exam 1 with Answers

Lectures on Simple Linear Regression Stat 431, Summer 2012

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Multiple Linear Regression

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test, October 2013

Density Temp vs Ratio. temp

Ma 3/103: Lecture 24 Linear Regression I: Estimation

Biostatistics 380 Multiple Regression 1. Multiple Regression

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

STAT5044: Regression and Anova. Inyoung Kim

Handout 4: Simple Linear Regression

MS&E 226: Small Data

2. Regression Review

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

BNAD 276 Lecture 10 Simple Linear Regression Model

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Lecture 18: Simple Linear Regression

ST430 Exam 2 Solutions

Sample Problems. Note: If you find the following statements true, you should briefly prove them. If you find them false, you should correct them.

(a) The percentage of variation in the response is given by the Multiple R-squared, which is 52.67%.

Coefficient of Determination

Math 3330: Solution to midterm Exam

Example: 1982 State SAT Scores (First year state by state data available)

Basic Business Statistics 6 th Edition

Statistical View of Least Squares

STAT 511. Lecture : Simple linear regression Devore: Section Prof. Michael Levine. December 3, Levine STAT 511

Nonparametric Regression. Badr Missaoui

Simple Linear Regression: One Qualitative IV

R 2 and F -Tests and ANOVA

TMA4267 Linear Statistical Models V2017 (L10)

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Variance Decomposition and Goodness of Fit

IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc

Study Sheet. December 10, The course PDF has been updated (6/11). Read the new one.

Stat 5102 Final Exam May 14, 2015

Inference for Regression Simple Linear Regression

Chapter 14 Student Lecture Notes 14-1

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Inference for Regression Inference about the Regression Model and Using the Regression Line

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Statistics for Managers using Microsoft Excel 6 th Edition

Weighted Least Squares

Math 423/533: The Main Theoretical Topics

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

Multivariate Regression

Correlation and Regression

Data Analysis Using R ASC & OIR

The Simple Regression Model. Part II. The Simple Regression Model

STAT 215 Confidence and Prediction Intervals in Regression

Statistics for Engineers Lecture 9 Linear Regression

Lecture 2. The Simple Linear Regression Model: Matrix Approach

MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik

Lecture 18 MA Applied Statistics II D 2004

Review of Statistics

Advanced Quantitative Methods: ordinary least squares

Regression Analysis Chapter 2 Simple Linear Regression

Unit 10: Simple Linear Regression and Correlation

Inference. ME104: Linear Regression Analysis Kenneth Benoit. August 15, August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58

13 Simple Linear Regression

SCHOOL OF MATHEMATICS AND STATISTICS

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

Multiple Linear Regression (solutions to exercises)

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is

Lecture 3: Linear Models. Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012

Inference in Regression Analysis

Chapter 7 Student Lecture Notes 7-1

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Homoskedasticity. Var (u X) = σ 2. (23)

Transcription:

Linear Regression Model Badr Missaoui

Introduction What is this course about? It is a course on applied statistics. It comprises 2 hours lectures each week and 1 hour lab sessions/tutorials. We will focus mainly on "regression models". Second half of the course, we will focus on "time series" and "classification and discrimination". Hands-on : we use R, an open source statistics software environment. Can interface with Excel, C++,...

Introduction Assessment Evaluation : 4 assignments (60%), 1 take home final exam (40%). Assignments to be done by group of 2 For each exercise, the report should contain : a description of the data including summary tables and plots. a description of the method and assumptions should be clearly stated. the results of the analysis of the data set at hand using the methods presented in the previous section. Here you should provide any relevant estimates, confidence intervals, levels of significance, goodness of fit, etc, along with their interpretation.

Introduction Textbooks Modern Applied Statistics with S. D. Venables, B. Ripley. Generalized Linear Models, Second Edition, by P. McCullagh P. McCullagh, John A. Nelder, CRC press Extending the Linear Model with R : Generalized Linear, Mixed Effects and Nonparametric Regression Models, Julian Faraway, CRC press

A regression model is a model of the relationships between a predictor variable X = (x 1,..., x n ) and a response variable Y = (y 1,..., y n ). The regression model : Y = f (X) + ε, where f is an unknown function and ε N(0, σ 2 ). The goal is to recover f from the noisy data Y : f could be parametric or non-parametric.

Example : The heart and body weights of samples of male and female cats used for digitalis experiments. heart weight (Hwt) is the outcome. body weight (Bwt) is the predictor.

Regression model is a model of the average outcome given the predictor. Regression model is a model of the conditional expectation which is a function of Bwt. E(Hwt Bwt)

Linear regression model : y i = β 0 + β 1 x i + ε i, β 0 and β 1 are resp. the intercept and the slope of the linear regression.

We fit the linear regression model to the cats data y i = 0.3567 + 4.0341X i + ε i

Call: lm(formula = Hwt Bwt, data = cats) Residuals: Min 1Q Median 3Q Max -3.5694-0.9634-0.0921 1.0426 5.1238 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -0.3567 0.6923-0.515 0.607 Bwt 4.0341 0.2503 16.119 <2e-16 *** -- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 1.452 on 142 degrees of freedom Multiple R-squared: 0.6466, Adjusted R-squared: 0.6441 F-statistic: 259.8 on 1 and 142 DF, p-value: < 2.2e-16

The assumptions corresponding to such model are : 1. Normality assumption : The error ε i has a normal distribution and independent of X. 2. Homoscedasticity assumption : E(ε i ) = 0 and Var(ε i ) = σ 2. 3. β 0 and β 1 are constants.

QQ-plot for the cats data

Parameter Estimation : we will use the popular least squares method S(β 0, β 1 ) = Y β 0 + β 1 X 2 = n (y i β 0 β 1 x i ) 2. i=1 the regression parameters are ˆβ 1 = (xi x)(y i ȳ) (xi x) 2, and ˆβ 0 = ȳ ˆβ 1 x.

The variance of β 0 and β 1 are Var( ˆβ 0 ) = σ 2 [ 1 n + x 2 (xi x) 2 ], Var( ˆβ 1 ) = σ 2 (xi x) 2. An unbiased estimate of σ 2 ˆσ 2 = 1 n 2 n (y i ˆβ 0 ˆβ 1 x i ) 2. i=1 and ˆσ 2 σ 2 = χ2 n 2 n 2.

Using the least squares estimates, one can develop statistical inference procedures : confidence intervals, hypothesis tests, and goodness-of-fit tests. Under the assumption that ε N(0, σ 2 ) ˆβ 0 N ˆβ 1 N ( ˆσ 2 ) β 1, (xi x) 2 (β 0, ˆσ 2 ( 1n + x 2 (xi x) 2 ))

Student random variables Start with Z N(0, 1) is standard normal and G χ 2 ν independent of Z. Compute. T = Z G ν T has a Student distribution t ν with ν degrees of freedom.

F-random variables Start with independent variables G 1 χ 2 ν 1 and G 2 χ 2 ν 1. Compute. F = G 1/ν 1 G 2 /ν 2 F has an F-distribution F ν1,ν 2 with ν degrees of freedom. Note that T t ν then T 2 F 1,ν.

If the residuals are Normal, then an exact 1 α confidence interval for β 0 is given by ˆβ 0 t n 2,α/2 SE( ˆβ 0 ) where t n 2,1 α/2 is the upper α/2 critical value of the t n 2 distribution and 1 SE( ˆβ 0 ) = ˆσ n + x 2 (xi x) 2 A level 1 α confidence interval for β 1 is given by ˆβ 1 t n 2,α/2 SE( ˆβ 1 ) where t n 2,1 α/2 is the upper α/2 critical value of the t n 2 distribution and ˆσ SE( ˆβ 2 1 ) = (xi x) 2

We are now in position to perform statistical analysis concerning the usefulness of X as a predictor of Y. Suppose that we want to test that β 1 is some pre-specified value β. To test the hypothesis H 0 : β 1 = β we use the test statistic T = ˆβ 1 β SE( ˆβ 1 ) t n 2 Reject H 0 : β 1 = β if T > t n 2,α/2

Goodness of fit SSE = SSR = n (y i ŷ i ) 2 = i=1 n (ȳ ŷ i ) 2 = i=1 n (y i ˆβ 0 ˆβ 1 x i ) 2 i=1 n (ȳ ˆβ 0 ˆβ 1 x i ) 2 i=1 SST = SSE + SSR = R 2 = SSR SST n (y i ȳ) 2 i=1 If R 2 is large : the total variability in the response Y is accounted for by the predictor variable X.

Amount of variability in Y explained by X. Also, R 2 = r 2 where r = n i=1 (Y i Ȳ )(X i X) n i=1 (Y i Ȳ )2 n i=1 (X i X) 2 is the sample correlation. This is the estimate of the correlation ρ = E((X µ X )(Y µ Y )) σ X σ Y Note that 1 ρ 1. The correlation is a very useful quantity for measuring the direction and the strength of the relationship between X and Y.

H 0 : β1 = 0, we use the F statistic : F = MSR(Regression) MSE(Errors) = SSR/df (Regression) SSE/df (Errors) = ( ˆβ 1 SE( ˆβ 1 ) ) 2 = t 2 Under H 0 : β 1 = 0, F F 1,n 2 where F 1,n 2 is an F distribution with 1 and n 2 degrees of freedom.

F-test in simple linear regression Full Model (FM) : Reduced Model (RM) : The F statistic is : Y = β 0 + β 1 X + ε Y = β 0 + ε F = (SSE(RM) SSE(FM))/(df RM df FM ) SSE(FM)/df FM Reject H 0 : RM is correct, if F > F 1 α,1,n 2.

Forecasting Interval : 1. Suppose that we want an interval which contain the predicted new observation Ŷ new = ˆβ 0 + ˆβ 1 X new + ε new 2. with a certain probability. SE(Ŷnew) = ˆσ 1 + 1 n + ( X X new ) 2 (Xi x) 2 3. Prediction interval is ˆβ 0 + ˆβ 1 X new ± t n 2,α/2 SE(Ŷnew)

Let us return to the cat example

What if the assumptions are not satisfied? Maybe regression function may have higher-order polynomial or otherwise, i.e. y j = β 0 + β 1 x j +... + β p 1 x p 1 j + ε j Errors may not be normally distributed or may not have the same variance - qqnorm can help with this.

How to fix? Sometimes things can be transformed to linear model : suppose y i = β 0 e β 1x i.ε i Then log y i = log β 0 + β 1 x i + log ε i is a linear model and if ε s are lognormal independent variables, then the transformed model is the same as the original model. Box-Cox transformations will help us to choose a transformation that linearizes the model.

Polynomial regression y j = β 0 + β 1 x j +... + β p 1 x p 1 j + ε j In matrix notation y 1 1 x 1 x p 1 1 β y 2. = 1 x 2 x p 1 0 2 β 1..... + y n 1 x n xn p 1 β p ε 1 ε 2. ε n which can be written Y = Xβ + ε

We add a quadratic term to the model, using the function poly which adds a polynomial trend of a given degree to a model. quadratic.lm <- lm(y2 poly(x2, 2), data = datatest)

Other regression models : Splines are piecewise polynomial functions, i.e. on interval between knots (t i, t i+1 ) the spline f (x) is polynomial but the coefficients change within each interval. Example : cubic splines f (x) = 3 h β 0 x j + β i (x t i ) 3 + j=0 i=1 Other bases one might use : Fourier series, wavelet series,...