MATH 829: Introduction to Data Mining and Analysis Linear Regression: statistical tests

1/16 MATH 829: Introduction to Data Mining and Analysis Linear Regression: statistical tests Dominique Guillot Departments of Mathematical Sciences University of Delaware February 17, 2016

Statistical hypothesis testing 2/16 Suppose we have a linear model for some data: Y = β 0 + β 1 X 1 + + β p X p + ɛ.

Statistical hypothesis testing 2/16 Suppose we have a linear model for some data: Y = β 0 + β 1 X 1 + + β p X p + ɛ. An important problem is to identify which variables are really useful in predicting Y. We want to decide if β i = 0 or not with some level of condence. Also want to test if groups of coecients {β ik : k = 1,..., l} are zero.

Statistical hypothesis testing (cont.) 3/16 Recall: to do a statistical test: 1 State a null hypothesis H 0 and an alternative hypothesis H 1. 2 Construct an appropriate test statistics. 3 Derive the distribution of the test statistic under the null hypothesis. 4 Select a signicance level α (typically 5% or 1%). 5 Compute the test statistics and decide if the null hypothesis is rejected at the given signicance level.

Example 4/16 Example: Suppose X N(µ, 1) with µ unknown. We want to test: H 0 : µ = 0 H 1 : µ 0. We have an iid sample X 1,..., X n N(µ, 1).

Example 4/16 Example: Suppose X N(µ, 1) with µ unknown. We want to test: H 0 : µ = 0 H 1 : µ 0. We have an iid sample X 1,..., X n N(µ, 1). Recall: if X N(µ 1, σ1 2) and Y N(µ 2, σ2 2 ) are independent, then X + Y N(µ 1 + µ 2, σ1 2 + σ2 2 ). Therefore, under H 0, we have ˆµ = 1 n X i N(0, 1 n n ). Test statistics: n ˆµ N(0, 1).

Example Example: Suppose X N(µ, 1) with µ unknown. We want to test: H 0 : µ = 0 H 1 : µ 0. We have an iid sample X 1,..., X n N(µ, 1). Recall: if X N(µ 1, σ1 2) and Y N(µ 2, σ2 2 ) are independent, then X + Y N(µ 1 + µ 2, σ1 2 + σ2 2 ). Therefore, under H 0, we have ˆµ = 1 n X i N(0, 1 n n ). Test statistics: n ˆµ N(0, 1). Suppose we observed: nˆµ = k. We compute: P ( z α nˆµ z α ) = P ( z α N(0, 1) z α ) = 1 α. Reject the null hypothesis if k [ z α, z α ]. 4/16

Example (cont.) 5/16 For example, suppose: α = 0.05, nˆµ = 2.2. If Z N(0, 1), then P ( 1.96 Z 1.96) = 0.95

Example (cont.) 5/16 For example, suppose: α = 0.05, nˆµ = 2.2. If Z N(0, 1), then P ( 1.96 Z 1.96) = 0.95 Therefore, it is very unlikely to observe nˆµ = 2.2 if µ = 0. We reject the null hypothesis. In fact, P ( 2.2 Z 2.2) 0.972. So our p-value is 0.028.

Example (cont.) For example, suppose: α = 0.05, nˆµ = 2.2. If Z N(0, 1), then P ( 1.96 Z 1.96) = 0.95 Therefore, it is very unlikely to observe nˆµ = 2.2 if µ = 0. We reject the null hypothesis. In fact, P ( 2.2 Z 2.2) 0.972. So our p-value is 0.028. Type I error: H 0 true, but rejected False positive. (Controlled by the level α). Type II error: H 0 false, but not rejected False negative. (Power of the test). 5/16

Testing if coecients are zero 6/16 In practice, we often include unnecessary variables in linear models.

Testing if coecients are zero 6/16 In practice, we often include unnecessary variables in linear models. These variables bias our estimator, and can lead to poor performance.

Testing if coecients are zero 6/16 In practice, we often include unnecessary variables in linear models. These variables bias our estimator, and can lead to poor performance. Need ways of identifying a good set of predictors. We now discuss a classical approach that uses statistical tests. Before, we tested if the mean of a N(µ, 1) is zero: H 0 : µ = 0 H 1 : µ 0. assuming σ 2 = 1 is known. What if the variance is unknown? Sample variance: s 2 = 1 n 1 n (x i x) 2, x = 1 n n x i.

7/16 In general, suppose X N(µ, σ 2 ) with σ 2 known and we want to test H 0 : µ = µ 0 H 1 : µ µ 0.

7/16 In general, suppose X N(µ, σ 2 ) with σ 2 known and we want to test H 0 : µ = µ 0 H 1 : µ µ 0. Under the H 0 hypothesis, we have X := 1 n X i N n ) (µ 0, σ2 n

7/16 In general, suppose X N(µ, σ 2 ) with σ 2 known and we want to test H 0 : µ = µ 0 H 1 : µ µ 0. Under the H 0 hypothesis, we have X := 1 n X i N n Therefore, we use the test statistic ) (µ 0, σ2 n Z = X µ 0 σ/ n N(0, 1).

7/16 In general, suppose X N(µ, σ 2 ) with σ 2 known and we want to test H 0 : µ = µ 0 H 1 : µ µ 0. Under the H 0 hypothesis, we have X := 1 n X i N n Therefore, we use the test statistic Z = X µ 0 σ/ n ) (µ 0, σ2 n N(0, 1). If the variance is unknown, we replace σ by its sample version s T = X µ 0 s/ n t n 1.

Review: the student distribution The student t ν distribution with ν degrees of freedom: f ν (t) = Γ ( ) ν+1 ( ) ν+1 2 ( νπ Γ ν ) 1 + t2 2, ν 2 where Γ is the Gamma function. When X 1,..., X n are iid N(µ, σ 2 ), then X µ s/ n t n 1. 8/16

9/16 Back to testing regression coecients: suppose y i = x i,1 β 1 + x i,2 β 2 + + x i,p β p + ɛ i, where (x ij ) is a xed matrix, and ɛ i are iid N(0, σ 2 ). We saw that this implies ˆβ N(β, (X T X) 1 σ 2 ). In particular, ˆβ i N(β i, v i σ 2 ), where v i is the i-th diagonal element of (X T X) 1.

Back to testing regression coecients: suppose y i = x i,1 β 1 + x i,2 β 2 + + x i,p β p + ɛ i, where (x ij ) is a xed matrix, and ɛ i are iid N(0, σ 2 ). We saw that this implies ˆβ N(β, (X T X) 1 σ 2 ). In particular, ˆβ i N(β i, v i σ 2 ), where v i is the i-th diagonal element of (X T X) 1. We want to test: H 0 : β i = 0 H 1 : β i 0. Note: v i is known, but σ is unknown. 9/16

10/16 Recall: y i = x i,1 β 1 + x i,2 β 2 + + x i,p β p + ɛ i, Problem: How do we estimate σ?

10/16 Recall: y i = x i,1 β 1 + x i,2 β 2 + + x i,p β p + ɛ i, Problem: How do we estimate σ? Let ˆɛ i = y i (x i,1 ˆβ1 + x i,2 ˆβ2 + + x i,p ˆβp ).

10/16 Recall: y i = x i,1 β 1 + x i,2 β 2 + + x i,p β p + ɛ i, Problem: How do we estimate σ? Let ˆɛ i = y i (x i,1 ˆβ1 + x i,2 ˆβ2 + + x i,p ˆβp ). What about: ˆσ 2 = 1 n ˆɛ 2 i? n What is E(ˆσ 2 )?

Recall: y i = x i,1 β 1 + x i,2 β 2 + + x i,p β p + ɛ i, Problem: How do we estimate σ? Let ˆɛ i = y i (x i,1 ˆβ1 + x i,2 ˆβ2 + + x i,p ˆβp ). What about: ˆσ 2 = 1 n ˆɛ 2 i? n What is E(ˆσ 2 )? We have y = Xβ + ɛ and ˆβ = (X T X) 1 X T y. Thus, ˆɛ = y X ˆβ = y X(X T X) 1 X T y = (I X(X T X) 1 X T )y = (I X(X T X) 1 X T )(Xβ + ɛ) = (I X(X T X) 1 X T )ɛ = Mɛ. 10/16

11/16 We showed ˆɛ = Mɛ where M := I X(X T X) 1 X T. Now

11/16 We showed ˆɛ = Mɛ where M := I X(X T X) 1 X T. Now n ˆɛ 2 i = ˆɛ T ˆɛ = ɛ T M T Mɛ.

11/16 We showed ˆɛ = Mɛ where M := I X(X T X) 1 X T. Now n ˆɛ 2 i = ˆɛ T ˆɛ = ɛ T M T Mɛ. Note: M T = M and M T M = M 2 = (I X(X T X) 1 X T )(I X(X T X) 1 X T ) = I X(X T X) 1 X T = M. (In other words, M is idempotent.)

11/16 We showed ˆɛ = Mɛ where M := I X(X T X) 1 X T. Now n ˆɛ 2 i = ˆɛ T ˆɛ = ɛ T M T Mɛ. Note: M T = M and M T M = M 2 = (I X(X T X) 1 X T )(I X(X T X) 1 X T ) = I X(X T X) 1 X T = M. (In other words, M is idempotent.) Therefore, n ˆɛ 2 i = ɛ T Mɛ.

We showed ˆɛ = Mɛ where M := I X(X T X) 1 X T. Now n ˆɛ 2 i = ˆɛ T ˆɛ = ɛ T M T Mɛ. Note: M T = M and M T M = M 2 = (I X(X T X) 1 X T )(I X(X T X) 1 X T ) = I X(X T X) 1 X T = M. (In other words, M is idempotent.) Therefore, n ˆɛ 2 i = ɛ T Mɛ. Now, E(ɛ T Mɛ) = E(tr(Mɛɛ T )) = tr E(Mɛɛ T ) = tr ME(ɛɛ T ) = tr Mσ 2 I = σ 2 tr M. 11/16

12/16 We proved: E( n ˆɛ 2 i ) = σ 2 tr M, where M = I X(X T X) 1 X T. (Here I = I n, the n n identity matrix.) What is tr M?

12/16 We proved: E( n ˆɛ 2 i ) = σ 2 tr M, where M = I X(X T X) 1 X T. (Here I = I n, the n n identity matrix.) What is tr M? Recall tr(ab) = tr(ba). Thus,

12/16 We proved: E( n ˆɛ 2 i ) = σ 2 tr M, where M = I X(X T X) 1 X T. (Here I = I n, the n n identity matrix.) What is tr M? Recall tr(ab) = tr(ba). Thus, tr M = tr(i X(X T X) 1 X T ) = n tr(x(x T X) 1 X T ) = n tr(x T X(X T X) 1 ) = n tr(i p ) = n p.

We proved: E( n ˆɛ 2 i ) = σ 2 tr M, where M = I X(X T X) 1 X T. (Here I = I n, the n n identity matrix.) What is tr M? Recall tr(ab) = tr(ba). Thus, Therefore, tr M = tr(i X(X T X) 1 X T ) = n tr(x(x T X) 1 X T ) = n tr(x T X(X T X) 1 ) = n tr(i p ) = n p. 1 n n p E( ˆɛ 2 i ) = σ 2. 12/16

13/16 As a result of the previous calculation, our estimator of the variance σ 2 in the regression model will be s 2 = 1 n (y i ŷ i ) 2, n p where ŷ i := x i,1 ˆβ1 + x i,2 ˆβ2 + + x i,p ˆβp is our prediction of y i. Our test statistic is T = ˆβ i s, v i = ((X T X) 1 ) ii. v i Under the null hypothesis H 0 : β i = 0, one can show that the above T statistic has a student distribution with n p degrees of freedom: T t n p.

As a result of the previous calculation, our estimator of the variance σ 2 in the regression model will be s 2 = 1 n (y i ŷ i ) 2, n p where ŷ i := x i,1 ˆβ1 + x i,2 ˆβ2 + + x i,p ˆβp is our prediction of y i. Our test statistic is T = ˆβ i s, v i = ((X T X) 1 ) ii. v i Under the null hypothesis H 0 : β i = 0, one can show that the above T statistic has a student distribution with n p degrees of freedom: T t n p. Thus, to test if β i = 0, we compute the value of the T satistic, say T = ˆT and reject the null hypothesis (at the α = 5% level) if P ( t n p ˆT ) 0.05. Important: This procedure cannot be iterated to remove multiple coecients. We will see how this is done later. 13/16

Condence intervals for the regression coecients 14/16 Recall that ˆβ i N(β i, v i σ 2 ). Using our esimate s 2 for σ 2, we can construct a 1 2α condence interval for β i : ( ˆβi z (1 α) v i s, ˆβ i + z (1 α) v i s). Here z (1 α) is the (1 α)-th percentile of the N(0, 1) distribution, i.e., P (Z z 1 α ) = 1 α.

Python 15/16 Unfortunately, scikit-learn doesn't compute t-statistics and condence intervals. However, the module statsmodels provides exactly what we need. import numpy as np import statsmodels.api as sm import statsmodels.formula.api as smf # Load data dat = sm.datasets.get_rdataset("guerry", "HistData").data # Fit regression model (using the natural log # of one of the regressors) results = smf.ols('lottery ~ Literacy + np.log(pop1831)', data=dat).fit() # Inspect the results print results.summary()

Python (cont.) 16/16