MATH 829: Introduction to Data Mining and Analysis Linear Regression: statistical tests

Similar documents
MATH 829: Introduction to Data Mining and Analysis Consistency of Linear Regression

BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

Simple and Multiple Linear Regression

Ch 2: Simple Linear Regression

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Probability and Statistics Notes

18.S096 Problem Set 3 Fall 2013 Regression Analysis Due Date: 10/8/2013

Ch 3: Multiple Linear Regression

Review: General Approach to Hypothesis Testing. 1. Define the research question and formulate the appropriate null and alternative hypotheses.

3 Multiple Linear Regression

Maximum Likelihood Estimation

Lecture 15. Hypothesis testing in the linear model

Math 423/533: The Main Theoretical Topics

MATH 829: Introduction to Data Mining and Analysis Least angle regression

Multiple Linear Regression

MATH 829: Introduction to Data Mining and Analysis Linear Regression: old and new

Simple Linear Regression

Data Mining Stat 588

1 Least Squares Estimation - multiple regression.

Linear models and their mathematical foundations: Simple linear regression

STA442/2101: Assignment 5

MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik

variability of the model, represented by σ 2 and not accounted for by Xβ

Lecture 6 Multiple Linear Regression, cont.

INTERVAL ESTIMATION AND HYPOTHESES TESTING

3. For a given dataset and linear model, what do you think is true about least squares estimates? Is Ŷ always unique? Yes. Is ˆβ always unique? No.

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

Regression and Statistical Inference

Properties of the least squares estimates

20.1. Balanced One-Way Classification Cell means parametrization: ε 1. ε I. + ˆɛ 2 ij =

Statistical Inference

Weighted Least Squares

Matrix Approach to Simple Linear Regression: An Overview

MATH 644: Regression Analysis Methods

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Bias Variance Trade-off

MATH 567: Mathematical Techniques in Data Science Linear Regression: old and new

4 Multiple Linear Regression

Lecture 2. The Simple Linear Regression Model: Matrix Approach

Hypothesis testing Goodness of fit Multicollinearity Prediction. Applied Statistics. Lecturer: Serena Arima

Statement: With my signature I confirm that the solutions are the product of my own work. Name: Signature:.

STA 2101/442 Assignment Four 1

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

STAT5044: Regression and Anova. Inyoung Kim

Chapter 12: Multiple Linear Regression

[y i α βx i ] 2 (2) Q = i=1

Regression #3: Properties of OLS Estimator

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

Outline. Remedial Measures) Extra Sums of Squares Standardized Version of the Multiple Regression Model

11 Hypothesis Testing

MATH 829: Introduction to Data Mining and Analysis Graphical Models I

Simple Linear Regression

MATH 829: Introduction to Data Mining and Analysis Graphical Models III - Gaussian Graphical Models (cont.)

Scatter plot of data from the study. Linear Regression

Part IB Statistics. Theorems with proof. Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua. Lent 2015

Weighted Least Squares

REGRESSION WITH SPATIALLY MISALIGNED DATA. Lisa Madsen Oregon State University David Ruppert Cornell University

Multivariate Linear Regression Models

Simple Linear Regression

Matrices and vectors A matrix is a rectangular array of numbers. Here s an example: A =

Least Squares Estimation

Scatter plot of data from the study. Linear Regression

Math 494: Mathematical Statistics

Hypothesis Testing One Sample Tests

Chapter 12. Analysis of variance

Chapter 1. Linear Regression with One Predictor Variable

Need for Several Predictor Variables

Chapter 14. Linear least squares

Managing Uncertainty

ST430 Exam 1 with Answers

High-dimensional regression

Lecture 11: Regression Methods I (Linear Regression)

The outline for Unit 3

Hypothesis Testing The basic ingredients of a hypothesis test are

14 Multiple Linear Regression

Summary of Chapter 7 (Sections ) and Chapter 8 (Section 8.1)

M(t) = 1 t. (1 t), 6 M (0) = 20 P (95. X i 110) i=1

MLES & Multivariate Normal Theory

Chapter 2 Multiple Regression I (Part 1)

Applied Regression Analysis

STAT763: Applied Regression Analysis. Multiple linear regression. 4.4 Hypothesis testing

So far our focus has been on estimation of the parameter vector β in the. y = Xβ + u

ST 740: Linear Models and Multivariate Normal Inference

Problem Selected Scores

Simple Regression Model Setup Estimation Inference Prediction. Model Diagnostic. Multiple Regression. Model Setup and Estimation.

Lecture 14 Simple Linear Regression

Multiple Linear Regression

Econometrics. Week 11. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

Math 5305 Notes. Diagnostics and Remedial Measures. Jesse Crawford. Department of Mathematics Tarleton State University

ANALYSIS OF VARIANCE AND QUADRATIC FORMS

Lecture 10 Multiple Linear Regression

Bayesian Linear Regression

Chapter 17: Undirected Graphical Models

Lecture 4 Multiple linear regression

The linear model is the most fundamental of all serious statistical models encompassing:

Answers to Problem Set #4

The assumptions are needed to give us... valid standard errors valid confidence intervals valid hypothesis tests and p-values

Transcription:

1/16 MATH 829: Introduction to Data Mining and Analysis Linear Regression: statistical tests Dominique Guillot Departments of Mathematical Sciences University of Delaware February 17, 2016

Statistical hypothesis testing 2/16 Suppose we have a linear model for some data: Y = β 0 + β 1 X 1 + + β p X p + ɛ.

Statistical hypothesis testing 2/16 Suppose we have a linear model for some data: Y = β 0 + β 1 X 1 + + β p X p + ɛ. An important problem is to identify which variables are really useful in predicting Y.

Statistical hypothesis testing 2/16 Suppose we have a linear model for some data: Y = β 0 + β 1 X 1 + + β p X p + ɛ. An important problem is to identify which variables are really useful in predicting Y. We want to decide if β i = 0 or not with some level of condence. Also want to test if groups of coecients {β ik : k = 1,..., l} are zero.

Statistical hypothesis testing (cont.) 3/16 Recall: to do a statistical test: 1 State a null hypothesis H 0 and an alternative hypothesis H 1. 2 Construct an appropriate test statistics. 3 Derive the distribution of the test statistic under the null hypothesis. 4 Select a signicance level α (typically 5% or 1%). 5 Compute the test statistics and decide if the null hypothesis is rejected at the given signicance level.

Example 4/16 Example: Suppose X N(µ, 1) with µ unknown. We want to test: H 0 : µ = 0 H 1 : µ 0. We have an iid sample X 1,..., X n N(µ, 1).

Example 4/16 Example: Suppose X N(µ, 1) with µ unknown. We want to test: H 0 : µ = 0 H 1 : µ 0. We have an iid sample X 1,..., X n N(µ, 1). Recall: if X N(µ 1, σ1 2) and Y N(µ 2, σ2 2 ) are independent, then X + Y N(µ 1 + µ 2, σ1 2 + σ2 2 ).

Example 4/16 Example: Suppose X N(µ, 1) with µ unknown. We want to test: H 0 : µ = 0 H 1 : µ 0. We have an iid sample X 1,..., X n N(µ, 1). Recall: if X N(µ 1, σ1 2) and Y N(µ 2, σ2 2 ) are independent, then X + Y N(µ 1 + µ 2, σ1 2 + σ2 2 ). Therefore, under H 0, we have ˆµ = 1 n X i N(0, 1 n n ). Test statistics: n ˆµ N(0, 1).

Example 4/16 Example: Suppose X N(µ, 1) with µ unknown. We want to test: H 0 : µ = 0 H 1 : µ 0. We have an iid sample X 1,..., X n N(µ, 1). Recall: if X N(µ 1, σ1 2) and Y N(µ 2, σ2 2 ) are independent, then X + Y N(µ 1 + µ 2, σ1 2 + σ2 2 ). Therefore, under H 0, we have ˆµ = 1 n X i N(0, 1 n n ). Test statistics: n ˆµ N(0, 1). Suppose we observed: nˆµ = k.

Example 4/16 Example: Suppose X N(µ, 1) with µ unknown. We want to test: H 0 : µ = 0 H 1 : µ 0. We have an iid sample X 1,..., X n N(µ, 1). Recall: if X N(µ 1, σ1 2) and Y N(µ 2, σ2 2 ) are independent, then X + Y N(µ 1 + µ 2, σ1 2 + σ2 2 ). Therefore, under H 0, we have ˆµ = 1 n X i N(0, 1 n n ). Test statistics: n ˆµ N(0, 1). Suppose we observed: nˆµ = k. We compute: P ( z α nˆµ z α ) = P ( z α N(0, 1) z α ) = 1 α.

Example Example: Suppose X N(µ, 1) with µ unknown. We want to test: H 0 : µ = 0 H 1 : µ 0. We have an iid sample X 1,..., X n N(µ, 1). Recall: if X N(µ 1, σ1 2) and Y N(µ 2, σ2 2 ) are independent, then X + Y N(µ 1 + µ 2, σ1 2 + σ2 2 ). Therefore, under H 0, we have ˆµ = 1 n X i N(0, 1 n n ). Test statistics: n ˆµ N(0, 1). Suppose we observed: nˆµ = k. We compute: P ( z α nˆµ z α ) = P ( z α N(0, 1) z α ) = 1 α. Reject the null hypothesis if k [ z α, z α ]. 4/16

Example (cont.) 5/16 For example, suppose: α = 0.05, nˆµ = 2.2. If Z N(0, 1), then P ( 1.96 Z 1.96) = 0.95

Example (cont.) 5/16 For example, suppose: α = 0.05, nˆµ = 2.2. If Z N(0, 1), then P ( 1.96 Z 1.96) = 0.95 Therefore, it is very unlikely to observe nˆµ = 2.2 if µ = 0. We reject the null hypothesis.

Example (cont.) 5/16 For example, suppose: α = 0.05, nˆµ = 2.2. If Z N(0, 1), then P ( 1.96 Z 1.96) = 0.95 Therefore, it is very unlikely to observe nˆµ = 2.2 if µ = 0. We reject the null hypothesis. In fact, P ( 2.2 Z 2.2) 0.972. So our p-value is 0.028.

Example (cont.) 5/16 For example, suppose: α = 0.05, nˆµ = 2.2. If Z N(0, 1), then P ( 1.96 Z 1.96) = 0.95 Therefore, it is very unlikely to observe nˆµ = 2.2 if µ = 0. We reject the null hypothesis. In fact, P ( 2.2 Z 2.2) 0.972. So our p-value is 0.028. Type I error: H 0 true, but rejected False positive. (Controlled by the level α).

Example (cont.) For example, suppose: α = 0.05, nˆµ = 2.2. If Z N(0, 1), then P ( 1.96 Z 1.96) = 0.95 Therefore, it is very unlikely to observe nˆµ = 2.2 if µ = 0. We reject the null hypothesis. In fact, P ( 2.2 Z 2.2) 0.972. So our p-value is 0.028. Type I error: H 0 true, but rejected False positive. (Controlled by the level α). Type II error: H 0 false, but not rejected False negative. (Power of the test). 5/16

Testing if coecients are zero 6/16 In practice, we often include unnecessary variables in linear models.

Testing if coecients are zero 6/16 In practice, we often include unnecessary variables in linear models. These variables bias our estimator, and can lead to poor performance.

Testing if coecients are zero 6/16 In practice, we often include unnecessary variables in linear models. These variables bias our estimator, and can lead to poor performance. Need ways of identifying a good set of predictors. We now discuss a classical approach that uses statistical tests.

Testing if coecients are zero 6/16 In practice, we often include unnecessary variables in linear models. These variables bias our estimator, and can lead to poor performance. Need ways of identifying a good set of predictors. We now discuss a classical approach that uses statistical tests. Before, we tested if the mean of a N(µ, 1) is zero: H 0 : µ = 0 H 1 : µ 0. assuming σ 2 = 1 is known. What if the variance is unknown? Sample variance: s 2 = 1 n 1 n (x i x) 2, x = 1 n n x i.

7/16 In general, suppose X N(µ, σ 2 ) with σ 2 known and we want to test H 0 : µ = µ 0 H 1 : µ µ 0.

7/16 In general, suppose X N(µ, σ 2 ) with σ 2 known and we want to test H 0 : µ = µ 0 H 1 : µ µ 0. Under the H 0 hypothesis, we have X := 1 n X i N n ) (µ 0, σ2 n

7/16 In general, suppose X N(µ, σ 2 ) with σ 2 known and we want to test H 0 : µ = µ 0 H 1 : µ µ 0. Under the H 0 hypothesis, we have X := 1 n X i N n Therefore, we use the test statistic ) (µ 0, σ2 n Z = X µ 0 σ/ n N(0, 1).

7/16 In general, suppose X N(µ, σ 2 ) with σ 2 known and we want to test H 0 : µ = µ 0 H 1 : µ µ 0. Under the H 0 hypothesis, we have X := 1 n X i N n Therefore, we use the test statistic Z = X µ 0 σ/ n ) (µ 0, σ2 n N(0, 1). If the variance is unknown, we replace σ by its sample version s T = X µ 0 s/ n t n 1.

Review: the student distribution The student t ν distribution with ν degrees of freedom: f ν (t) = Γ ( ) ν+1 ( ) ν+1 2 ( νπ Γ ν ) 1 + t2 2, ν 2 where Γ is the Gamma function. When X 1,..., X n are iid N(µ, σ 2 ), then X µ s/ n t n 1. 8/16

9/16 Back to testing regression coecients: suppose y i = x i,1 β 1 + x i,2 β 2 + + x i,p β p + ɛ i, where (x ij ) is a xed matrix, and ɛ i are iid N(0, σ 2 ). We saw that this implies ˆβ N(β, (X T X) 1 σ 2 ).

9/16 Back to testing regression coecients: suppose y i = x i,1 β 1 + x i,2 β 2 + + x i,p β p + ɛ i, where (x ij ) is a xed matrix, and ɛ i are iid N(0, σ 2 ). We saw that this implies ˆβ N(β, (X T X) 1 σ 2 ). In particular, ˆβ i N(β i, v i σ 2 ), where v i is the i-th diagonal element of (X T X) 1.

9/16 Back to testing regression coecients: suppose y i = x i,1 β 1 + x i,2 β 2 + + x i,p β p + ɛ i, where (x ij ) is a xed matrix, and ɛ i are iid N(0, σ 2 ). We saw that this implies ˆβ N(β, (X T X) 1 σ 2 ). In particular, ˆβ i N(β i, v i σ 2 ), where v i is the i-th diagonal element of (X T X) 1. We want to test: H 0 : β i = 0 H 1 : β i 0.

Back to testing regression coecients: suppose y i = x i,1 β 1 + x i,2 β 2 + + x i,p β p + ɛ i, where (x ij ) is a xed matrix, and ɛ i are iid N(0, σ 2 ). We saw that this implies ˆβ N(β, (X T X) 1 σ 2 ). In particular, ˆβ i N(β i, v i σ 2 ), where v i is the i-th diagonal element of (X T X) 1. We want to test: H 0 : β i = 0 H 1 : β i 0. Note: v i is known, but σ is unknown. 9/16

10/16 Recall: y i = x i,1 β 1 + x i,2 β 2 + + x i,p β p + ɛ i, Problem: How do we estimate σ?

10/16 Recall: y i = x i,1 β 1 + x i,2 β 2 + + x i,p β p + ɛ i, Problem: How do we estimate σ? Let ˆɛ i = y i (x i,1 ˆβ1 + x i,2 ˆβ2 + + x i,p ˆβp ).

10/16 Recall: y i = x i,1 β 1 + x i,2 β 2 + + x i,p β p + ɛ i, Problem: How do we estimate σ? Let ˆɛ i = y i (x i,1 ˆβ1 + x i,2 ˆβ2 + + x i,p ˆβp ). What about: ˆσ 2 = 1 n ˆɛ 2 i? n What is E(ˆσ 2 )?

Recall: y i = x i,1 β 1 + x i,2 β 2 + + x i,p β p + ɛ i, Problem: How do we estimate σ? Let ˆɛ i = y i (x i,1 ˆβ1 + x i,2 ˆβ2 + + x i,p ˆβp ). What about: ˆσ 2 = 1 n ˆɛ 2 i? n What is E(ˆσ 2 )? We have y = Xβ + ɛ and ˆβ = (X T X) 1 X T y. Thus, ˆɛ = y X ˆβ = y X(X T X) 1 X T y = (I X(X T X) 1 X T )y = (I X(X T X) 1 X T )(Xβ + ɛ) = (I X(X T X) 1 X T )ɛ = Mɛ. 10/16

11/16 We showed ˆɛ = Mɛ where M := I X(X T X) 1 X T. Now

11/16 We showed ˆɛ = Mɛ where M := I X(X T X) 1 X T. Now n ˆɛ 2 i = ˆɛ T ˆɛ = ɛ T M T Mɛ.

11/16 We showed ˆɛ = Mɛ where M := I X(X T X) 1 X T. Now n ˆɛ 2 i = ˆɛ T ˆɛ = ɛ T M T Mɛ. Note: M T = M and M T M = M 2 = (I X(X T X) 1 X T )(I X(X T X) 1 X T ) = I X(X T X) 1 X T = M. (In other words, M is idempotent.)

11/16 We showed ˆɛ = Mɛ where M := I X(X T X) 1 X T. Now n ˆɛ 2 i = ˆɛ T ˆɛ = ɛ T M T Mɛ. Note: M T = M and M T M = M 2 = (I X(X T X) 1 X T )(I X(X T X) 1 X T ) = I X(X T X) 1 X T = M. (In other words, M is idempotent.) Therefore, n ˆɛ 2 i = ɛ T Mɛ.

We showed ˆɛ = Mɛ where M := I X(X T X) 1 X T. Now n ˆɛ 2 i = ˆɛ T ˆɛ = ɛ T M T Mɛ. Note: M T = M and M T M = M 2 = (I X(X T X) 1 X T )(I X(X T X) 1 X T ) = I X(X T X) 1 X T = M. (In other words, M is idempotent.) Therefore, n ˆɛ 2 i = ɛ T Mɛ. Now, E(ɛ T Mɛ) = E(tr(Mɛɛ T )) = tr E(Mɛɛ T ) = tr ME(ɛɛ T ) = tr Mσ 2 I = σ 2 tr M. 11/16

12/16 We proved: E( n ˆɛ 2 i ) = σ 2 tr M, where M = I X(X T X) 1 X T. (Here I = I n, the n n identity matrix.) What is tr M?

12/16 We proved: E( n ˆɛ 2 i ) = σ 2 tr M, where M = I X(X T X) 1 X T. (Here I = I n, the n n identity matrix.) What is tr M? Recall tr(ab) = tr(ba). Thus,

12/16 We proved: E( n ˆɛ 2 i ) = σ 2 tr M, where M = I X(X T X) 1 X T. (Here I = I n, the n n identity matrix.) What is tr M? Recall tr(ab) = tr(ba). Thus, tr M = tr(i X(X T X) 1 X T ) = n tr(x(x T X) 1 X T ) = n tr(x T X(X T X) 1 ) = n tr(i p ) = n p.

We proved: E( n ˆɛ 2 i ) = σ 2 tr M, where M = I X(X T X) 1 X T. (Here I = I n, the n n identity matrix.) What is tr M? Recall tr(ab) = tr(ba). Thus, Therefore, tr M = tr(i X(X T X) 1 X T ) = n tr(x(x T X) 1 X T ) = n tr(x T X(X T X) 1 ) = n tr(i p ) = n p. 1 n n p E( ˆɛ 2 i ) = σ 2. 12/16

13/16 As a result of the previous calculation, our estimator of the variance σ 2 in the regression model will be s 2 = 1 n (y i ŷ i ) 2, n p where ŷ i := x i,1 ˆβ1 + x i,2 ˆβ2 + + x i,p ˆβp is our prediction of y i.

13/16 As a result of the previous calculation, our estimator of the variance σ 2 in the regression model will be s 2 = 1 n (y i ŷ i ) 2, n p where ŷ i := x i,1 ˆβ1 + x i,2 ˆβ2 + + x i,p ˆβp is our prediction of y i. Our test statistic is T = ˆβ i s, v i = ((X T X) 1 ) ii. v i

13/16 As a result of the previous calculation, our estimator of the variance σ 2 in the regression model will be s 2 = 1 n (y i ŷ i ) 2, n p where ŷ i := x i,1 ˆβ1 + x i,2 ˆβ2 + + x i,p ˆβp is our prediction of y i. Our test statistic is T = ˆβ i s, v i = ((X T X) 1 ) ii. v i Under the null hypothesis H 0 : β i = 0, one can show that the above T statistic has a student distribution with n p degrees of freedom: T t n p.

13/16 As a result of the previous calculation, our estimator of the variance σ 2 in the regression model will be s 2 = 1 n (y i ŷ i ) 2, n p where ŷ i := x i,1 ˆβ1 + x i,2 ˆβ2 + + x i,p ˆβp is our prediction of y i. Our test statistic is T = ˆβ i s, v i = ((X T X) 1 ) ii. v i Under the null hypothesis H 0 : β i = 0, one can show that the above T statistic has a student distribution with n p degrees of freedom: T t n p. Thus, to test if β i = 0, we compute the value of the T satistic, say T = ˆT and reject the null hypothesis (at the α = 5% level) if P ( t n p ˆT ) 0.05.

As a result of the previous calculation, our estimator of the variance σ 2 in the regression model will be s 2 = 1 n (y i ŷ i ) 2, n p where ŷ i := x i,1 ˆβ1 + x i,2 ˆβ2 + + x i,p ˆβp is our prediction of y i. Our test statistic is T = ˆβ i s, v i = ((X T X) 1 ) ii. v i Under the null hypothesis H 0 : β i = 0, one can show that the above T statistic has a student distribution with n p degrees of freedom: T t n p. Thus, to test if β i = 0, we compute the value of the T satistic, say T = ˆT and reject the null hypothesis (at the α = 5% level) if P ( t n p ˆT ) 0.05. Important: This procedure cannot be iterated to remove multiple coecients. We will see how this is done later. 13/16

Condence intervals for the regression coecients 14/16 Recall that ˆβ i N(β i, v i σ 2 ). Using our esimate s 2 for σ 2, we can construct a 1 2α condence interval for β i : ( ˆβi z (1 α) v i s, ˆβ i + z (1 α) v i s). Here z (1 α) is the (1 α)-th percentile of the N(0, 1) distribution, i.e., P (Z z 1 α ) = 1 α.

Python 15/16 Unfortunately, scikit-learn doesn't compute t-statistics and condence intervals. However, the module statsmodels provides exactly what we need. import numpy as np import statsmodels.api as sm import statsmodels.formula.api as smf # Load data dat = sm.datasets.get_rdataset("guerry", "HistData").data # Fit regression model (using the natural log # of one of the regressors) results = smf.ols('lottery ~ Literacy + np.log(pop1831)', data=dat).fit() # Inspect the results print results.summary()

Python (cont.) 16/16