i=1 (Ŷ i Y ) 2 and error (or residual) i=1 (Y i Ŷ i ) 2 = n

Size: px

Start display at page:

Download "i=1 (Ŷ i Y ) 2 and error (or residual) i=1 (Y i Ŷ i ) 2 = n"

Hester McCormick
6 years ago
Views:

1 Math 583 Exam 3 is on Wednesday, Dec 3 You are allowed 10 sheets of notes and a calculator CHECK FORMULAS! 85 If the MLR (multiple linear regression model contains a constant, then SSTO = SSE + SSR where the total sum of squares (corrected for the mean SSTO = n i=1 (Y i Y 2, the regression sum of squares SSR = n i=1 (Ŷ i Y 2 and error (or residual sum of squares SSE = n i=1 (Y i Ŷ i 2 = n i=1 e2 i 86 If the MLR model contains a constant, then the coefficient of multiple determination R 2 = [corr(y i, Ŷi] 2 = SSR SSTO = 1 SSE SSTO 87 MLR output for the Anova F test H 0 : β 1 = = β p 1 = 0 has the form shown below where under Source, Model often replaces Regression and Error often replaces Residual Here MS = SS/df Source df SS MS F p-value Regression p-1 SSR MSR Fo=MSR/MSE for Ho: Residual n-p SSE MSE β 1 = = β p 1 = 0 88 If the MLR model has a constant β 0, then F o = MSR MSE = R2 n p 1 R 2 p 1 89 R 2 does not decrease and usually increases as predictors are added to the linear model Want n 10p 90 If ˆθ AN p (θ,σ n, then a T ˆθ ANp (a T θ, a T Σ n a, and a large sample 100(1 δ% confidence interval (CI for a T θ is a T ˆθ ± tdn,1 δse(a T ˆθ = a T ˆθ ± tdn,1 δ a T ˆΣ n a where d n as n 91 For the full rank OLS model, ˆβ AN p (β, MSE(X T X 1 So a T ˆβ AN 1 (a T β, MSE a T (X T X 1 a, and a large sample 100(1 δ% confidence interval (CI for a T β is a T ˆβ ± tn p,1 δ MSE a T (X T X 1 a where P(T t n p,1 δ = 1 δ if T t n p 92 A large sample 100(1 δ% CI for β i uses a T = (0,, 0, 1, 0,, 0 where the 1 is in the (i + 1th position for i = 0,, p 1 Then a T (X T X 1 a is the (i + 1th diagonal element d ii of (X T X 1 So the CI is ˆβ i ± t n p,1 δ MSE dii = ˆβ i ± t n p,1 δ SE(ˆβ i 93 Suppose there are k 100(1 δ S % CIs where 1 δ S is the confidence for a single confidence interval Suppose we want the overall familywise confidence 1 δ T (probability before gathering the data and making the k CIs that all k CIs contain their θ i Hence δ T = P(at least one of the k CIs does not contain its θ i, before gathering data Then δ T is called the familywise error rate Can get procedures where 1 δ T 1 δ 94 Assume ɛ N n (0, σ 2 I i The Bonferroni t intervals use δ S = δ/k Then 1 δ T 1 δ ii Scheffe s CIs for a T β have the form a T ˆβ ± pfp,n p,1 δ MSE a T (X T X 1 a, and have 1 δ T 1 δ 95 Scheffe s CIs are longer than the corresponding Bonferroni CIs Scheffe s CIs allow data snooping: you can decide on the a T β to use after getting the data Usually need to decide on the a T β to use before gathering data for valid inference 1

2 96 If the normality in 94 does not hold, then the large sample familywise confidence 1 δ T,n 1 δ T 1 δ as n for a large class of 0 mean iid error distributions 97 A large sample 100(1 δ% confidence region for θ is a set C n such that P(θ C n 1 δ as n For the full rank OLS model, a large sample 100(1 δ% confidence region for β is C n = {β : (β ˆβ T X T X(β ˆβ MSE p F p,n p,1 δ } = {β : D 2 β (ˆβ, (X T X 1 MSE p F p,n p,1 δ }, a hyperellipsoid for β centered at ˆβ 98 For the full rank OLS model Y = x T β + ɛ, a large sample 100(1 δ% CI for E(Y = E(Y x = x T β is x T ˆβ ± tn p,1 δ MSE x T (X T X 1 x = Ŷ ± t n p,1 δ MSE h x Want hx max(h 1,, h n, where h i = x T i (X T X 1 x i, to avoid extrapolation 99 Consider predicting future test value Y f given x f and training data (x 1, Y 1,, (x n, Y n where Y i = x i β +ɛ i and Y f = x f β +ɛ f For the full rank OLS model, ɛ 1,, ɛ n, ɛ f are iid and Y f Y 1,, Y n Hence Y f Ŷ f = x T f ˆβ since ˆβ is computed using the training data Then E(Ŷ f Y f = 0 and V (Ŷ f Y f = σ 2 (1 + h f where h f = x f (X T X 1 x f is the leverage of x f Want h f max(h 1,, h n to avoid extrapolation 100 A large sample 100(1 δ% prediction interval (PI has the form (ˆL n, Ûn where P(ˆL n < Y f < Û n P 1 δ as the sample size n If the highest density region is an interval, then a PI is asymptotically optimal if it has the shortest asymptotic length that gives the desired asymptotic coverage 101 The length of a large sample CI goes to 0 while the length of a good PI goes to U L as n, where P(Y f (L, U x f = 1 δ 102 Know: Let Z 1,, Z n be random variables, let Z (1,, Z (n be the order statistics, and let c be a positive integer Compute Z (c Z (1, Z (c+1 Z (2,, Z (n Z (n c+1 Let shorth(c = (Z (d, Z (d+c 1 = ( ξ δ1, ξ 1 δ2 correspond to the interval with the smallest distance 103 Let k n = n(1 δ where x is the smallest integer x, eg, 77 = 8 Let a n = ( n n n p (1 + hf Apply the shorth(c = k n estimator to the residuals e 1,, e n : shorth(c = (e (d, e (d+c 1 = ( ξ δ1, ξ 1 δ2 Then a large sample 100 (1 δ% PI for Y f is (Ŷ f + a n ξδ1, Ŷ f + a n ξ1 δ2 For the full rank OLS model, this PI is asymptotically optimal if the x i are bounded in probability and the iid ɛ i come from a large class of zero mean unimodal distributions 104 For the full rank OLS model, the 100 (1 δ% classical PI for Y f is Ŷ f ± t n p,1 δ/2 MSE (1 + h f where P(T t n p,δ = δ if T has a t distribution with n p degrees of freedom Asymptotically, this PI estimates (E(Y f x f σz 1 δ/2, E(Y f x f +σz 1 δ/2, the interval between two quantiles of a N(E(Y f x f, σ 2 distribution where P(Z Z α = α if Z N(0, 1 This PI may not perform well if the ɛ i are not iid N(0, σ 2 since the normal quantiles are not the correct quantiles for other error distributions 2

3 105 One of the best ways to check the linear model is to make a residual plot of Ŷ versus e and a response plot of Ŷ versus Y with the identity line that has unit slope and zero intercept added as a visual aid For multiple linear regression (MLR, assume the zero mean errors are iid from a unimodal distribution that is not highly skewed If the iid constant variance MLR model is useful, then i the plotted points in the response plot should scatter about the identity line with no other pattern, and ii the plotted points in the residual plot should scatter about the e = 0 line with no other pattern If either i or ii is violated, then the iid constant variance MLR model is not sustained In other words, if the plotted points in the residual plot show some type of dependency, eg increasing variance or a curved pattern, then the MLR model may be inadequate 106 Omitting important predictors, known as underfitting, can be a serious problem Then for the multiple linear regression model, ˆβ tends to be a biased estimator of β, and leaving out important predictors could destroy the linearity of the model and could result in a model that has a nonconstant variance function Let ( ( x1 β1 x = and β =, x p 1 β p 1 assume that Y = x T β + ɛ is a good OLS model Hence E(Y x = β T x = β T 1 x 1 + β p 1 x p 1 and V (Y x = σ 2 If x p 1 is omitted from the model, then E(Y x 1 = β T 1 x 1 + β p 1 E(x p 1 x 1 and V (Y x 1 = σ 2 + β 2 p 1V (Y x 1 Note that linearity is destroyed if E(x p 1 x 1 is nonlinear and the model has a nonconstant variance function if V (Y x 1 is not constant and so depends on x 1 On the other hand, if E(x p 1 x 1 = θ T x 1 and V (x p 1 x 1 = τ 2, then E(Y x 1 = η T x 1 is linear and V (Y x 1 = σ 2 + β 2 p 1τ 2 = γ 2 is constant, where η = β 1 + β p 1 θ 107 Suppose x 1 = 1 and (Y, x 2,, x p 1 T = (Y, w T T N p ( ( µy µw, ( σ 2 Y Σ Y,w Σw,Y Σw Then Y x i1,, x ik follows a linear model with constant variance: Y i = β 0k +β 1k x i1 + + β kk x ik + ɛ ik where V (ɛ ik = σk 2 Models with lower σ2 k are better 108 Can also get linear models with underfitting if the columns of X are orthogonal: predictors can be omitted without changing the ˆβ i of the predictors that are in the model 109 Having too many predictors, known as overfitting, is much less serious than omitting important predictors The ˆβ i for unneeded x i tend to have ˆβ P i 0 Suppose Y = X 1 β 1 +ɛ is the appropriate OLS model where X 1 is an n k matrix, X = (X 1 X 2 is n p, and β = ( β1 β 2 = ( β1 0 since β 2 = 0 Consider overfitting by fitting the OLS model using X instead of X 1 Then large sample inference is correct using X, but not as precise as the model that omits predictors with β i = 0 (the model that uses X 1 For the overfitted model, R 2 is too high, and CIs for β i are longer using X than using X 1 for i = 1,, k Also want n 10p for the overfitted model and n 10k for the model using X 1 3

4 110 If Y = Xβ + ɛ with E(ɛ = 0 but Cov(ɛ = σ 2 V instead of σ 2 I, then under regularity conditions ˆβ P β, but typically i Cov(ˆβ σ 2 (X T X 1 and ii E(MSE σ 2 GLS can be used if V is known A sandwich estimator can also be used to get a consistent estimator of Cov(ˆβ 111 Outliers can often be found using the response and residual plots The OLS fitted values (so identity line will often go right through a cluster of gross outliers Look for a gap separating the outliers from the bulk of the data Fit OLS to the bulk of the data producing OLS estimator b Then make the response and residual plots for all of the data using Ŷ = xt b If the identity line still goes through the far away cluster, then it may be a cluster of good leverage cases, otherwise the cases are likely outliers Robust estimators attempt to automatically fit the bulk of the data well 112 For variable selection, the model Y = x T β+ɛ that uses all of the predictors is called the full model A model Y = x T I β I +ɛ that only uses a subset x I of the predictors is called a submodel The full model is always a submodel 113 Let I min correspond to the submodel with the smallest C p Find the submodel I I with the fewest number of predictors such that C p (I I C p (I min + 1 Then I I is the initial submodel that should be examined It is possible that I I = I min or that I I is the full model Models I with fewer predictors than I I such that C p (I C p (I min + 4 are interesting and should also be examined Models I with k predictors, including a constant, and with fewer predictors than I I such that C p (I min + 4 < C p (I min(2k, p should be checked Be able to find model I I from computer output 114 Forward selection Step 1 k = 1: Start with a constant w 1 = x 1 Step 2 k = 2: Compute C p for all models with k = 2 containing a constant and a single predictor x i Keep the predictor w 2 = x j, say, that minimizes C p Step 3 k = 3: Fit all models with k = 3 that contain w 1 and w 2 Keep the predictor w 3 that minimizes C p Step j k = j: Fit all models with k = j that contains w 1, w 2,, w j 1 Keep the predictor w j that minimizes C p Step p: Fit the full model Backward elimination: All models contain a constant = u 1 Step 0 k = p: Start with the full model that contains x 1,, x p We will also say that the full model contains u 1,, u p where u 1 = x 1 but u i need not equal x i for i > 1 Step 1 k = p 1: Fit each model with k = p 1 predictors including a constant Delete the predictor u p, say, that corresponds to the model with the smallest C p Keep u 1,, u p 1 Step 2 k = p 2: Fit each model with p 2 predictors including a constant Delete the predictor u p 1 corresponding to the smallest C p Keep u 1,, u p 2 Step j k = p j: fit each model with p j predictors including a constant Delete the predictor u p j+1 corresponding to the smallest C p Keep u 1,, u p j Step p 2 k = 2 The current model contains u 1, u 2 and u 3 Fit the model u 1, u 2 and the model u 1, u 3 Assume that model u 1, u 2 minimizes C p Then delete u 3 and keep u 1 and u Can do all subsets variable selection for up to about 30 predictors Criterion other than C p, such as MSE(I or R 2 (I = 1 [1 R 2 n (I] where model I contains k 4 n k

5 predictors, including a constant (adjusted R 2, can be used C p needs a good full model with n 10p or 5p 116 Collinearity occurs when at least one column of X = [v 0, v 1,, v p 1 ] is highly correlated with a linear combination of the other columns Regress v j on v 0, v 1,, v j 1, v j+1,, v p 1 Let R 2 j be the coefficient of determination (squared multiple correlation 1 coefficient from the regression The variance inflation factor V IF j = Let d 1 R 2 jj j be given in 92 Then SE( ˆβ j = MSE d jj = MSE V IFj SD(v j n 1 where SD(v j is the sample standard deviation of the n elements of v j Collinearity does not affect prediction much, provided that the software does not fail because X T X is nearly singular 117 The ith residual e i = ɛ i + N i where N i = x T i (β ˆβ AN 1 (0, MSE h i P 0 if max(h 1,, h n 0 as n 118 In experimental design models or design of experiments (DOE, the entries of X are coded, often as 1, 0 or 1 Often X is not a full rank matrix 119 Some DOE models have one Y i per x i and lots of x i s Then the response and residual plots are used like those for MLR 120 Some DOE models have n i Y i s per x i, and only a few distinct values of x i Then the response and residual plots no longer look like those for MLR 121 A dot plot of Z 1,, Z m consists of an axis and m points each corresponding to the value of Z i 122 Let f Z (z be the pdf of Z Then the family of pdfs f Y (y = f Z (y µ indexed by the location parameter µ, < µ <, is the location family for the random variable Y = µ + Z with standard pdf f Z (y A one way fixed effects ANOVA model has a single qualitative predictor variable W with p categories a 1,, a p There are p different distributions for Y, one for each category a i The distribution of Y (W = a i f Z (y µ i where the location family has second moments Hence all p distributions come from the same location family with different location parameter µ i and the same variance σ 2 The one way fixed effects normal ANOVA model is the special case where Y (W = a i N(µ i, σ The response plot is a plot of Ŷ versus Y For the one way Anova model, the response plot is a plot of Ŷ ij = ˆµ i versus Y ij Often the identity line with unit slope and zero intercept is added as a visual aid Vertical deviations from the identity line are the residuals e ij = Y ij Ŷij = Y ij ˆµ i The plot will consist of p dot plots that scatter about the identity line with similar shape and spread if the fixed effects one way ANOVA model is appropriate The ith dot plot is a dot plot of Y i,1,, Y i,ni Assume that each n i 10 If the response plot looks like the residual plot, then a horizontal line fits the p dot plots about as well as the identity line, and there is not much difference in the µ i 5

6 If the identity line is clearly superior to any horizontal line, then at least some of the means differ The residual plot is a plot of Ŷ versus e where the residual e = Y Ŷ The plot will consist of p dot plots that scatter about the e = 0 line with similar shape and spread if the fixed effects one way ANOVA model is appropriate The ith dot plot is a dot plot of e i,1,, e i,ni Assume that each n i 10 Under the assumption that the Y ij are from the same location scale family with different parameters µ i, each of the p dot plots should have roughly the same shape and spread This assumption is easier to judge with the residual plot than with the response plot 124 Rule of thumb: Let R i be the range of the ith dot plot = max(y i1,, Y i,ni min(y i1,, Y i,ni If the n i n/p and if max(r 1,, R p 2min(R 1,, R p, then the one way ANOVA F test results will be approximately correct if the response and residual plots suggest that the remaining one way ANOVA model assumptions are reasonable 125 Let Y i0 = n i j=1 Y ij and let ˆµ i = Y i0 = Y i0 /n i = 1 n i Y ij n i Hence the dot notation means sum over the subscript corresponding to the 0, eg j Similarly, Y 00 = p ni i=1 j=1 Y ij is the sum of all of the Y ij Be able to find ˆµ i from data 126 The cell means model for the fixed effects one way Anova is Y ij = µ i + ɛ ij where Y ij is the value of the response variable for the jth trial of the ith factor level for i = 1,, p and j = 1,, n i The µ i are the unknown means and E(Y ij = µ i The ɛ ij are iid from the location family with pdf f Z (z, zero mean and unknown variance σ 2 = V (Y ij = V (ɛ ij For the normal cell means model, the ɛ ij are iid N(0, σ 2 The estimator ˆµ i = Y i0 = n i j=1 Y ij/n i = Ŷij The ith residual is e ij = Y ij Y i0, and Y 00 is the sample mean of all of the Y ij and n = p i=1 n i The total sum of squares SSTO = p ni i=1 j=1 (Y ij Y 00 2, the treatment sum of squares SSTR = p i=1 n i(y i0 Y 00 2, and the error sum of squares SSE = RSS = p ni i=1 j=1 (Y ij Y i0 2 The MSE is an estimator of σ 2 The Anova table is the same as that for multiple linear regression, except that SSTR replaces the regression sum of squares and that SSTO, SSTR and SSE have n 1, p 1 and n p degrees of freedom Summary Analysis of Variance Table Source df SS MS F p-value Treatment p-1 SSTR MSTR Fo=MSTR/MSE for Ho: Error n-p SSE MSE µ 1 = = µ p 127 Shown is a one way ANOVA table given in symbols Sometimes Treatment is replaced by Between treatments, Between Groups, Model, Factor or Groups Sometimes Error is replaced by Residual, or Within Groups Sometimes p-value is replaced by P, Pr(> F or PR > F SSE is often replaced by RSS = residual sum of squares 128 In matrix form, the cell means model is the linear model without an intercept j=1 6

7 (although 1 C(X, where µ = β = (µ 1,, µ p T, and Y = Xµ + ɛ = Y ɛ 11 Y 1,n ɛ 1,n1 Y µ 1 ɛ 21 µ 2 = Y 2,n ɛ 2,n2 µ p Y p, ɛ p, Y p,np 129 For the cell means model, X T X = diag(n 1,, n p, (X T X 1 = diag(1/n 1,, 1/n p, and X T Y = (Y 10,, Y p0 T So ˆβ = ˆµ = (X T X 1 X T Y = (Y 10,, Y p0 T Then Ŷ = X(X T X 1 X T Y = X ˆµ, and Ŷ ij = Y i0 Hence the ijth residual e ij = Y ij Ŷ ij = Y ij Y i0 for i = 1,, p and j = 1,, n i 130 In the response plot, the dot plot for the jth treatment crosses the identity line at Y j0 131 For the one way anova F test has hypotheses H 0 : µ 1 = = µ p and H A : noth 0 (not all of the p population means are equal The one way Anova table for this test is given above 127 Let RSS = SSE The test statistic F = MSTR MSE = [RSS(H RSS]/(p 1 MSE ɛ p,np F p 1,n p if the ɛ ij are iid N(0, σ 2 If H 0 is true, then Y ij = µ+ɛ ij and ˆµ = Y 00 Hence RSS(H = SSTO = p ni i=1 j=1 (Y ij Y 00 2 Since SSTO = SSE + SSTR, the quantity SSTR = RSS(H RSS, and MSTR = SSTR/(p The one way Anova F test is a large sample test if the ɛ ij are iid with mean 0 and variance σ 2 Then the Y ij come from the same location family with the same variance σi 2 = σ 2 and different mean µ i for i = 1,, p Thus the p treatments (groups, populations have the same variance σi 2 = σ2 The V (ɛ ij σ 2 assumption (which implies that σi 2 = σ 2 for i = 1,, p is a much stronger assumption for the one way Anova model than for MLR, but the test has some resistance to the assumption that σi 2 = σ 2 by Other design matrices X can be used for the full model One design matrix adds a column of ones to the cell means design matrix This model is no longer a full rank model 134 A full rank one way Anova model with an intercept adds a constant but deletes the last column of the X for the cell means model Then Y = Xβ+ɛ where Y and ɛ are as in the cell means model Then β = (β 0, β 1,, β p 1 T = (µ p, µ 1 µ p, µ 2 µ p,, µ p 1 µ p T So β 0 = µ p and β i = µ i µ p for i = 1,, p 1 It can be shown that the OLS estimators are ˆβ 0 = Y p0 = ˆµ p, and ˆβ i = Y i0 Y p0 = ˆµ i ˆµ p for i = 1,, p 1 (The cell means model has ˆβ i = ˆµ i = Y i0 In matrix form the model is 7

8 Y 11 Y 1,n1 Y 21 Y 2,n2 Y p,1 Y p,np = β 0 β 1 β p 1 Then X T Y = (Y 00, Y 10, Y 20,, Y p 1,0 T and n n 1 n 2 n 3 n p 2 n p 1 n 1 n X T n 2 0 n X = = n p n p 2 0 n p n p 1 Hence (X T X 1 = 1 n p 1 n p + n p 1 ɛ 11 ɛ 1,n1 ɛ 21 ɛ 2,n2 ɛ p,1 ɛ p,np n (n 1 n 2 n p 1 n 1 n 2 diag(n 1,, n p np n np n np n p np n p 1 [ 1 1 T 1 11 T + diag( np n 1,, n p n p 1 ] = This model is interesting since the one way Anova F test of H 0 : µ 1 = = µ p versus H A : not H 0 corresponds to the MLR Anova F test of H 0 : β 1 = = β p 1 = 0 versus H A : not H A contrast θ = p i=1 c iµ i where p i=1 c i = 0 The estimated contrast is ˆθ = p i=1 c iy i0 Then SE(ˆθ = MSE p c 2 i and a 100(1 δ% CI for θ is n ˆθ±t n 1,1 δ/2 SE(ˆθ i i=1 CIs for one way Anova are less robust to the assumption that σ 2 i σ2 than the one way Anova F test 8

9 136 Two important families of contrasts are the family of all possible contrasts and the family of pairwise differences θ ij = µ i µ j where i j The Scheffé multiple comparisons procedure has a δ F for the family of all possible contrasts while the Tukey multiple comparisons procedure has a δ F for the family of all ( p 2 pairwise contrasts 137 The multivariate linear regression model y i = B T x i + ɛ i for i = 1,, n has m 2 response variables Y 1,, Y m and p predictor variables x 1, x 2,, x p where x 1 1 is the trivial predictor The ith case is (x T i, y T i = (1, x i2,, x ip, Y i1,, Y im where the 1 could be omitted The model is written in matrix form as Z = XB+E where the matrices are defined below The model has E(ɛ k = 0 and Cov(ɛ k = Σɛ = (σ ij for k = 1,, n Also E(e i = 0 while Cov(e i, e j = σ ij I n for i, j = 1,, m where I n is the n n identity matrix and e i is defined below Then the p m coefficient matrix B = [ ] β 1 β 2 β m and the m m covariance matrix Σ ɛ are to be estimated, and E(Z = XB while E(Y ij = x T i β j The ɛ i are assumed to be iid The data matrix W = [X Y ] except usually the first column 1 of X is omitted The n m matrix Y 1,1 Y 1,2 Y 1,m Y 2,1 Y 2,2 Y 2,m Z = = [ ] Y 1 Y 2 Y m = Y n,1 Y n,2 Y n,m y T 1 y T n The n p design matrix of predictor variables is x 1,1 x 1,2 x 1,p x 2,1 x 2,2 x 2,p X = = [ ] v 1 v 2 v p = x n,1 x n,2 x n,p x T 1 x T n where v 1 = 1 The p m matrix β 1,1 β 1,2 β 1,m β 2,1 β 2,2 β 2,m B = β p,1 β p,2 β p,m = [ β 1 β 2 ] β m The n m matrix ɛ 1,1 ɛ 1,2 ɛ 1,m ɛ 2,1 ɛ 2,2 ɛ 2,m E = ɛ n,1 ɛ n,2 ɛ n,m = [ e 1 e 2 ɛ ] e m = 1 ɛ T n Considering the ith row of Z, X and E shows that y T i = x T i B + ɛt i 9

10 138 We have changed notation for multiple linear regression, using e and ɛ for errors and ˆɛ, r, and ˆɛ ij for residuals For the multiple linear regression model, m = 1 and Y i = x i,1 β 1 + x i,2 β x i,p β p + e i = x T i β + e i (1 for i = 1,, n In matrix notation, these n equations become Y = Xβ + e, where Y is an n 1 vector of response variables, X is an n p matrix of predictors, β is a p 1 vector of unknown coefficients, and e is an n 1 vector of unknown errors 139 Each response variable in a multivariate linear regression model follows a multiple linear regression model Y j = Xβ j + e j for j = 1,, m where it is assumed that E(e j = 0 and Cov(e j = σ jj I n Hence the errors corresponding to the jth response are uncorrelated with variance σ 2 j = σ jj Notice that the same design matrix X of predictors is used for each of the m models, but the jth response variable vector Y j, coefficient vector β j and error vector e j change and thus depend on j Now consider the ith case (x T i, y T i which corresponds to the ith row of Z and the ith row of X Then Y i1 = β 11 x i1 + + β p1 x ip + ɛ i1 = x T i β 1 + ɛ i1 Y i2 = β 12 x i1 + + β p2 x ip + ɛ i2 = x T i β 2 + ɛ i2 Y im = β 1m x i1 + + β pm x ip + ɛ im = x T i β m + ɛ im or, suppressing the condition y i x i, we have y i = µx i + ɛ i = E(y i + ɛ i where x T i β 1 E(y i = µx i = B T x T i β 2 x i = x T i β m 140 The least squares estimators are ˆB = (X T X 1 X T Z = [ ˆβ1 ˆβ2 ˆβm ] The predicted values or fitted values Ẑ = X ˆB = [ ] Ŷ 1 Ŷ 2 Ŷ m = Ŷ 1,1 Ŷ 1,2 Ŷ 1,m Ŷ 2,1 Ŷ 2,2 Ŷ 2,m Ŷ n,1 Ŷ n,2 Ŷ n,m The residuals Ê = Z Ẑ = Z X ˆB = ˆɛ T 1 ˆɛ T 2 ˆɛ T n = [ ] r 1 r 2 r m = 10 ˆɛ 1,1 ˆɛ 1,2 ˆɛ 1,m ˆɛ 2,1 ˆɛ 2,2 ˆɛ 2,m ˆɛ n,1 ˆɛ n,2 ˆɛ n,m

11 These quantities can be found from the m multiple linear regressions of Y j on the predictors: βj = (X T X 1 X T Y j, Ŷ j = X β j and r j = Y j Ŷ j for j = 1,, m Hence ˆɛ i,j = Y i,j Ŷi,j where Ŷ j = (Ŷ1,j,, Ŷn,j T Finally, ˆΣɛ,d = (Z ẐT (Z Ẑ n d = (Z X ˆB T (Z X ˆB n d = ÊT Ê n d = 1 n d n ˆɛ iˆɛ T i i=1 The choices d = 0 and d = p are common If d = 1, then ˆΣɛ,d=1 = S r, the sample covariance matrix of the residual vectors ˆɛ i, since the sample mean of the ˆɛ i is 0 Let ˆΣ ɛ = ˆΣ ɛ,p be the unbiased estimator of Σɛ Also, and ˆΣɛ,d = (n d 1 Z T [I X(X T X 1 X]Z, Ê = [I X(X T X 1 X]Z 141 Theorem: Suppose X has full rank p < n and the covariance structure of 137 holds Then E( ˆB = B so E(ˆβ j = β j, Cov(ˆβ j, ˆβ k = σ jk (X T X 1 for j, k = 1,, p Also Ê and ˆB are uncorrelated, E(Ê = 0 and (ÊTÊ E(ˆΣɛ = E = Σɛ n p Also, ˆΣɛ is a n consistent estimator of Σɛ under mild regularity conditions 142 A response plot for the jth response variable is a plot of the fitted values Ŷij versus the response Y ij The identity line with slope one and zero intercept is added to the plot as a visual aid A residual plot corresponding to the jth response variable is a plot of Ŷ ij versus r ij where i = 1,, n Make the m response and residual plots for any multivariate linear regression For each response variable Y ij, the response and residual plots behave just as they do for MLR 143 Let the observed multivariate data w i for i = 1,, n be collected in an n p matrix W with n rows w T 1,, w T n Let the p 1 column vector T = T(W be a multivariate location estimator, and let the p p symmetric positive definite matrix C = C(W be a dispersion estimator such as the sample covariance matrix The ith squared Mahalanobis distance is D 2 i = D 2 i (T, C = D 2 w i (T, C = (w i T T C 1 (w i T for each point w i Notice that D 2 i is a random variable (scalar valued Notice that the population squared Mahalanobis distance is D 2 w (µ,σ = (w µt Σ 1 (w µ and that the term Σ 1/2 (w µ is the p dimensional analog to the Z-score used to transform a univariate N(µ, σ 2 random variable into a N(0, 1 random variable Hence the sample Mahalanobis distance D i = D 2 i is an analog of the absolute value Z i of 11

12 the sample Z-score Z i = (W i W/ˆσ Also notice that the Euclidean distance of w i from the estimate of center T is D i (T, I p where I p is the p p identity matrix 144 The classical Mahalanobis distance corresponds to the sample mean and sample covariance matrix T = W = 1 n n i=1 w i, and C = S = 1 n 1 n (w i W (w i W T and will be denoted by MD i When T and C are estimators other than the sample mean and covariance, D i = D 2 i will sometimes be denoted by RD i Then the DD plot is a plot of the classical Mahalanobis distances MD i versus robust Mahalanobis distances RD i The identity line is added as a visual aid If n is large and the w i are iid N p (µ,σ, then the data will cluster tightly about the identity line Tight clustering about a line through the origin that is not the identity line suggests that the w i are iid from an elliptically contoured distribution that is not multivariate normal 145 A large sample (1 δ100% prediction region is a set A n such that P(y f A n 1 δ as n, and is asymptotically optimal if the volume of the region converges in probability to the volume of the population minimum volume covering region 146 For multivariate linear regression, the classical large sample 100(1 δ% prediction region for a future value y f given x f and past data (x 1, y i,, (x n, y n is {y : D 2 y (ŷ f, ˆΣɛ χ 2 m,1 δ } and does not work well unless the ɛ i are iid N m (0,Σɛ 147 Let q n = min(1 δ + 005, 1 δ + m/n for δ > 01 and i=1 q n = min(1 δ/2, 1 δ + 10δm/n, otherwise If q n < 1 δ , set q n = 1 δ Let ẑ i = ŷ f + ˆɛ i for i = 1,, n The large sample nonparametric prediction region that works for a large class of error vector distributions is {y : D 2 y(ŷ f, S r D 2 (U n } where D (U n is the q n th sample quantile of the D i = Dẑi (ŷ f, S r = Dˆɛ i (0, S r In the DD plot, the cases to the left of the vertical line MD = D (Un correspond to y i that are in their nonparametric prediction region when x f = x i Hence 100q n % of the training data are in their prediction region, and 100q n % 100(1 δ% as n 148 Consider testing a linear hypothesis H 0 : LB = 0 versus H 1 : LB 0 where L is a full rank r p matrix Let H = ˆB T L T [L(X T X 1 L T ] 1 L ˆB The error or residual sum of squares and cross products matrix is W e = (Z Ẑ T (Z Ẑ = Z T Z Z T X ˆB = Z T [I n X(X T X 1 X T ]Z Note that W e = ÊT Ê and W e /(n p = ˆΣɛ 149 Let λ 1 λ 2 λ m be the ordered eigenvalues of W 1 e H Then there are four commonly used test statistics The Roy s maximum root statistic is λ max (L = λ 1 m The Wilks Λ statistic is Λ(L = (H +W e 1 W e = W 1 e H +I 1 = (1+λ i 1 The Pillai s trace statistic is V (L = tr[(h + W e 1 H] = 12 m i=1 λ i 1 + λ i i=1

13 The Hotelling-Lawley trace statistic is U(L = tr[w 1 e H] = m λ i 150 Let matrix A = [a 1 a 2 a p ] Then the vec operator stacks the columns of A on top of one another so a 1 a 2 vec(a = a p Let A = (a ij be an m n matrix and B a p q matrix Then the Kronecker product of A and B is the mp nq matrix a 11 B a 12 B a 1n B a 21 B a 22 B a 2n B A B = a m1 B a m2 B a mn B An important fact is that if A and B are nonsingular square matrices, then [A B] 1 = A 1 B 1 The following assumption is important 151 Theorem: The Hotelling-Lawley trace statistic U(L = 1 n p [vec(l ˆB] T [ ˆΣ 1 ɛ (L(XT X 1 L T 1 ][vec(l ˆB] 152 Assumption D1: Let h i be the ith diagonal element of X(X T X 1 X T Assume max(h 1,, h n 0 as n, assume that the zero mean iid error vectors have finite fourth moments, and assume that 1 n XT X P W Multivariate Least Squares Central Limit Theorem (MLS CLT: For the least squares estimator, if assumption D1 holds, then ˆΣɛ is n consistent and n vec( ˆB B D N pm (0,Σɛ W 154 Theorem: If assumption D1 holds and if H 0 is true, then (n pu(l D χ 2 rm 155 Consider testing a linear hypothesis H 0 : LB = 0 versus H 1 : LB 0 where L is a full rank r p matrix For now assume the error distribution is multivariate normal N m (0,Σɛ Then ˆβ 1 β 1 vec( ˆB ˆβ 2 β 2 B = N pm(0,σɛ (X T X 1 ˆβ m β m i=1 where C = Σɛ (X T X 1 = σ 11 (X T X 1 σ 12 (X T X 1 σ 1m (X T X 1 σ 21 (X T X 1 σ 22 (X T X 1 σ 2m (X T X 1 σ m1 (X T X 1 σ m2 (X T X 1 σ mm (X T X 1 13

14 Now let A be a rm pm block diagonal matrix: A = diag(l,, L Then A vec( ˆB B = vec(l( ˆB B = L(ˆβ 1 β 1 L(ˆβ 2 β 2 L(ˆβ m β m N rm(0,σɛ L(X T X 1 L T where D = Σɛ L(X T X 1 L T = ACA T = σ 11 L(X T X 1 L T σ 12 L(X T X 1 L T σ 1m L(X T X 1 L T σ 21 L(X T X 1 L T σ 22 L(X T X 1 L T σ 2m L(X T X 1 L T σ m1 L(X T X 1 L T σ m2 L(X T X 1 L T σ mm L(X T X 1 L T and Under H 0, vec(lb = A vec(b = 0, and Hence under H 0, vec(l ˆB = Lˆβ 1 Lˆβ 2 Lˆβ m N rm(0,σɛ L(X T X 1 L T [vec(l ˆB] T [Σ 1 ɛ (L(X T X 1 L T 1 ][vec(l ˆB] χ 2 rm, T = [vec(l ˆB] T 1 [ ˆΣ ɛ (L(X T X 1 L T 1 ][vec(l ˆB] D χ 2 rm (2 A large sample level α test will reject H 0 if pval < α where pval = P( T rm < F rm,n mp (3 Since least squares estimators are asymptotically normal, if the ɛ i are iid for a large class of distributions, ˆβ 1 β 1 ˆβ 2 β 2 D n vec( ˆB B = n N pm (0,Σɛ W ˆβ m β m where X T X n W 1 14

15 Then under H 0, n vec(l ˆB = n Lˆβ 1 Lˆβ 2 D N rm (0,Σɛ LWL T, Lˆβ m and n [vec(l ˆB] T [Σ 1 ɛ (LWL T 1 ][vec(l ˆB] D χ 2 rm Hence (2 holds, and (3 gives a large sample level δ test if the least squares estimators are asymptotically normal 156 Theorems 151 and 154 are useful for relating multivariate tests with the partial F test for multiple linear regression that tests whether a reduced model that omits some of the predictors can be used instead of the full model that uses all p predictors The partial F test statistic is [ ] SSE(R SSE(F F R = /MSE(F df R df F where the residual sums of squares SSE(F and SSE(R and degrees of freedom df F and df r are for the full and reduced model while the mean square error MSE(F is for the full model Let the null hypothesis for the partial F test be H 0 : Lβ = 0 where L sets the coefficients of the predictors in the full model but not in the reduced model to 0 Seber and Lee (2003, p 100 shows that F R = [Lˆβ] T (L(X T X 1 L T 1 [Lˆβ] rˆσ 2 is distributed as F r,n p if H 0 is true and the errors are iid N(0, σ 2 Note that for multiple linear regression with m = 1, F R = (n pu(l/r since = 1/ˆσ 2 Hence the scaled Hotelling Lawley test statistic is the partial F test statistic extended to m > 1 predictor variables by Theorem By Theorem 154, for example, rf D R χ 2 r for a large class of nonnormal error distribution If Z n F k,dn, then Z D n χ 2 k /k as d n Hence using the F r,n p approximation gives a large sample test with correct asymptotic level, and the partial F test is robust to nonnormality Similarly, using an F rm,n pm approximation for the following test statistics gives large sample tests with correct asymptotic level and similar power for large n The large sample test will have correct asymptotic level as long as the denominator degrees of freedom d n as n, and d n = n pm reduces to the partial F test if m = 1 and U(L is used Then the three test statistics are [n p 05(m r + 3] rm it can be shown that log(λ(l, n p rm V (L log(λ(l U(L 15 ˆΣ 1 ɛ n p V (L, and rm U(L

16 Hence the Hotelling Lawley test will have the most power and Pillai s test will have the least power 158 Under regularity conditions, [n p (m r + 3] log(λ(l D χ 2 rm, (n pv (L D χ 2 rm, and (n pu(l D χ 2 rm These statistics are robust against nonnormality For the ( Wilks Lambda test, [n p (m r + 3] pval = P rm ( n p For the Pillai s trace test, pval = P rm ( n p For the Hotelling Lawley trace test, pval = P log(λ(l < F rm,n rm V (L < F rm,n rm U(L < F rm,n rm rm The above three tests are large sample tests, P(reject H 0 H 0 is true α as n, under regularity conditions 159 The 4 step MANOVA F test of hypotheses uses L = [0 I p 1 ]: i State the hypotheses H 0 : the nontrivial predictors are not needed in the mreg model H 1 : at least one of the nontrivial predictors is needed ii Find the test statistic F o from output iii Find the pval from output iv If pval < α, reject H 0 If pval α, fail to reject H 0 If H 0 is rejected, conclude that there is a mreg relationship between the response variables Y 1,, Y m and the predictors X 2,, X p If you fail to reject H 0, conclude that there is a not a mreg relationship between Y 1,, Y m and the predictors X 2,, X p (Get the variable names from the story problem 160 The 4 step F j test of hypotheses uses L j = [0,, 0, 1, 0,, 0] where the 1 is in the jth position Let b T j be the jth row of B The hypotheses are equivalent to H 0 : b T j = 0 H 1 : b T j 0 This test is a test for whether x j is needed in the model i State the hypotheses H 0 : x j is not needed in the model H 1 : x j is needed in the model ii Find the test statistic F j from output iii Find pval from output iv If pval < α, reject H 0 If pval α, fail to reject H 0 Give a nontechnical sentence restating your conclusion in terms of the story problem If H 0 is rejected, then conclude that X j is needed in the mreg model for Y 1,, Y m If you fail to reject H 0, then conclude that X j is not needed in the mreg model for Y 1,, Y m given that the other predictors are in the model The statistic F j = 1 d j ˆBT j ɛ ˆB j = 1 (ˆβ j1, d ˆβ j2,, ˆβ jm ˆΣ 1 ɛ j ˆΣ 1 ˆβ j1 ˆβ j2 ˆβ jm where ˆB T j is the jth row of ˆB and d j = (X T X 1 jj, the jth diagonal entry of (X T X 1 16

17 The statistic F j could be used for forward selection and backward elimination in variable selection 161 The 4 step MANOVA partial F test of hypotheses has a full model using all of the variables and a reduced model where r of the variables are deleted The ith row of L has a 1 in the position corresponding to the ith variable to be deleted Omitting the jth variable corresponds to the F j test while omitting variables X 2,, X p corresponds to the MANOVA F test i State the hypotheses H 0 : the reduced model is good H 1 : use the full model ii Find the test statistic F R from output iii Find the pval from output iv If pval < α, reject H 0 and conclude that the full model should be used If pval α, fail to reject H 0 and conclude that the reduced model is good 162 The 4 step MANOVA F test should reject H 0 if the response and residual plots look good, n is large enough and at least one response plot does not look like the corresponding residual plot A response plot for Y j will look like a residual plot if the identity line appears almost horizontal, hence the range of Ŷj is small The lregpack function mltreg produces the m response and residual plots, gives ˆB, ˆΣɛ, the MANOVA partial F test statistic and pval corresponding to the reduced model that leaves out the variables given by indices (so x 2 and x 4 in the output below with F = 077 and pval = 0614, F j and the pval for the F j test for variables 1, 2,, p (where p = 4 in the output below so F 2 = 151 with pval = 0284 and F 0 and pval for the MANOVA F test (in the output below F 0 = 315 and pval= 006 Right click Stop on the plots m times to advance the plots and to get the cursor back on the command line in R The command out <- mltreg(x,y,indices=c(2 would produce a MANOVA partial F test corresponding to the F 2 test while the commandout <- mltreg(x,y,indices=c(2,3,4 would produce a MANOVA partial F test corresponding to the MANOVA F test for a data set with p = 4 predictor variables The Hotelling Lawley trace statistic is used in the tests out <- mltreg(x,y,indices=c(2,4 $Bhat [,1] [,2] [,3] [1,] [2,] [3,] [4,] $Covhat [,1] [,2] [,3] [1,] [2,] [3,] $partial partialf Pval 17

18 [1,] $Ftable Fj pvals [1,] [2,] [3,] [4,] $MANOVA MANOVAF pval [1,] #Output for Example 122 y<-marry[,c(2,3]; x<-marry[,-c(2,3]; mltreg(x,y,indices=c(3,4 $partial partialf Pval [1,] $Ftable Fj pvals [1,] [2,] [3,] [4,] $MANOVA MANOVAF pval [1,] e-16 Example 122 The above output is for the Hebbler (1847 data from the 1843 Prussia census Sometimes if the wife or husband was not at the household, then s/he would not be counted Y 1 = number of married civilian men in the district, Y 2 = number of women married to civilians in the district, x 2 = population of the district in 1843, x 3 = number of married military men in the district, x 4 = number of women married to military men in the district The reduced model deletes x 3 and x 4 a Do the MANOVA F test b Do the F 2 test c Do the F 4 test d Do an appropriate 4 step test for the reduced model that deletes x 3 and x 4 e The output for the reduced model that deletes x 1 and x 2 is shown below Do an appropriate 4 step test 18

19 $partial partialf Pval [1,] Solution: a i H 0 : the nontrivial predictors are not needed in the mreg model H 1 : at least one of the nontrivial predictors is needed ii F 0 = iii pval = 0 iv Reject H 0, the nontrivial predictors are needed in the mreg model b i H 0 : x 2 is not needed in the model H 1 : x 2 is needed ii F 2 = iii pval = 0 iv Reject H 0, population of the district is needed in the model c i H 0 : x 4 is not needed in the model H 1 : x 4 is needed ii F 4 = 0065 iii pval = 0937 iv Fail to reject H 0, number of women married to military men is not needed in the model given that the other predictors are in the model d i H 0 : the reduced model is good H 1 : use the full model ii F R = 0200 iii pval = 0935 iv Fail to reject H 0, so the reduced model is good e i H 0 : the reduced model is good H 1 : use the full model ii F R = 5696 iii pval = 000 iv Reject H 0, so use the full model 163 Recall the population OLS coefficients and second way to compute ˆβ from 74 and 75 Similar results will hold for multivariate linear regression Let y = (Y 1,, Y m T, let w = (x 2,, x p T, let ˆβ j = (ˆα j, ˆη T j T where ˆα j = Y j ˆη T j w and ˆη j = ˆΣ 1 w ˆΣwY j Let ˆΣwy = 1 n n 1 i=1 (w i w(y i y T which has jth column ˆΣwY j for j = 1,, m Let u = ( y w, E(u = µu = ( E(y E(w = ( Σ yy Σyw Σwy Σww ( µ y µw, and Cov(u = Σu = Let the vector of constants be α T = (α 1,, α m and the matrix of slope vectors B S = [ η1 η 2 η m ] Then the population least squares coefficient matrix is B = ( α T B S where α = µy B T Sµw and B S = Σ 1 w Σ wy where Σw = Σww 19

20 If the u i are iid with nonsingular covariance matrix Cov(u, the least squares estimator ( ˆα T ˆB = ˆB S where ˆα = y ˆB T S w and ˆB S = ˆΣ 1 w ˆΣwy The least squares multivariate linear regression estimator can be calculated by computing the classical estimator (u, Su = (u, ˆΣu of multivariate location and dispersion on the u i, and then plug in the results into the formulas for ˆα and ˆB S 20

Inference After Variable Selection

Department of Mathematics, SIU Carbondale Inference After Variable Selection Lasanthi Pelawa Watagoda lasanthi@siu.edu June 12, 2017 Outline 1 Introduction 2 Inference For Ridge and Lasso 3 Variable Selection