22s:152 Applied Linear Regression. In matrix notation, we can write this model: Generalized Least Squares. Y = Xβ + ɛ with ɛ N n (0, Σ)

22s:152 Applied Linear Regression Generalized Least Squares Returning to a continuous response variable Y Ordinary Least Squares Estimation The classical models we have fit so far with a continuous response Y have been fit using ordinary least squares The model: In matrix notation, we can write this model: Y = Xβ + ɛ with ɛ N n (0, Σ) mean error structure and Σ = σ 2 0 0 0 0 0 0 σ 2 0 0 0 0 0 0 0 0 σ 2 0 0 0 0 0 0 σ 2 n n Y i = β 0 + β 1 x 1i + + β k x ki + ɛ i with ɛ i iid N(0, σ 2 ) is fit by minimizing the RSS, ni=1 (Y i Ŷi) 2 = n i=1 (Y i ( ˆβ 0 + ˆβ 1 x 1i + + ˆβ k x ki )) 2 1 or Σ = σ 2 I n n The variance of the vector Y is denoted Σ, or V (Y )=V (Xβ + ɛ) =V (ɛ) =Σ The Σ shows the independence of the observations (off-diagonals are 0) and the constant variance (σ 2 down the entire diagonal) 2 In matrix notation, we can show the Ordinary Least Squares (OLS) estimates for the regression coefficients β as: ˆβ =(X X) 1 X Y where the (X X) 1 represents the inverse of X X when X is of full rank And the estimate for σ 2 is ˆσ 2 = RSS n k 1 But what if the observations are NOT independent, or there is NOT constant variance? (assumptions for OLS) ie what if V (Y ) σ 2 I n n Then, the appropriate estimation method for the regression coefficients may be through Generalized Least Squares Estimation 3 CASE 1: Non-constant variance, but independence holds In this situation we have a similar model Y = Xβ + ɛ with ɛ N n (0, Σ) except Σ = or ɛ i N(0, σ 2 i ) σ1 2 0 0 0 0 0 0 σ2 2 0 0 0 0 0 0 0 0 σn 1 2 0 0 0 0 0 0 σn 2 n n [different observations i have different variances σi 2] Suppose we can write the variance of Y i as a multiplier of a common variance σ 2 V (Y i )=V (ɛ i )=σ 2 i = ( 1wi ) σ 2 and we say observation i has weight w i 4

The weights are inversely proportional to the variance of errors (w i = 1 σi 2 σ 2 ) An observation with a smaller variance has a larger weight Then, V (Y )=Σ and 1 w 0 0 0 0 0 1 0 1 w2 0 0 0 0 Σ = σ 2 0 0 0 0 1 w 0 n 1 0 0 0 0 0 1 wn n n = σ 2 W 1 and W is a n n diagonal matrix of weights This special case where Σ is diagonal is very useful, and is known as weighted least squares 5 Weighted Least Squares A special case of Generalized Least Squares Useful when errors have different variance but are all uncorrelated (independent) Assumes that we have some way of knowing the relative variances associated with each observation (or weights) Associates a weight w i with each observation Chooses ˆβ =(ˆβ 0, ˆβ 1,, ˆβ k ) to minimize ni=1 w i [Y i ( ˆβ 0 + ˆβ 1 x 1i + + ˆβ k x ki )] 2 In matrix form the Generalized Least Squares estimates are: ˆβ =(X WX) 1 X WY ni=1 ˆσ 2 w = i (Y i Ŷi) 2 n k 1 Notice the similarity to the OLS form, but now with the W 6 Example situations: 1 If data has been summarized and ith response is the average of n i observations each with constant variance σ 2, then V ar(y i )= σ2 n i and w i = n i 2 If variance is proportional to some predictor x i, then V ar(y i )=x i σ 2 and w i = 1 x i Example: Apple Shoots data Using trees planted in 1933 and 1934, Bland took samples of shoots from apple tress every few days throughout the 1971 growing season (about 106 days) He counted the number of stem units per shoot This measurement was thought to help understand the growth of the tress (fruiting and branching) We do not know the number of stem units for every shoot, but we know the average number of stem units per shoot for all samples on a given day We are interested in modeling the relationship between day of collection (observed) and number of stem units on a sample (not directly observed) 7 8

VARIABLES day n y ybar days from dormancy (day of collection) number of shoots collected number of stem units per shoot average number of stem units per shoot for shoots collected on that day (ie y/n) 14 55 11 2264 15 58 9 2278 16 61 14 2393 17 69 10 2550 18 73 12 2508 19 76 9 2667 20 88 7 2800 21 100 10 3167 22 106 7 3214 Notice we do not have y, and there are a variety of number of samples taken on a day > applelong day n ybar 1 0 5 1020 2 3 5 1040 3 7 5 1060 4 13 6 1250 5 18 5 1200 6 24 4 1500 7 25 6 1517 8 32 5 1700 9 38 7 1871 10 42 9 1922 11 44 10 2000 12 49 19 2032 13 52 14 2207 Plot the relationship between ybar and day ybar 10 15 20 25 30 0 20 40 60 80 100 day 9 10 If these were individual y observations, we could fit our usual linear model But, some of the observations provide more information on the conditional mean (n i larger), and others provide less information on the conditional mean (n i smaller) If we assume a constant variance σ 2 for the simple linear regression model of y regressed on day, then these ybar observations have a non-constant variance related to n i, with Var(ybar i )= σ2 n i We can fit this model using Weighted Least Squares estimation with w i = n i We ll use our usual lm() function, but include the weights option > lmout=lm(ybar ~ day,weights=n) > summary(lmout) Coefficients: Estimate Std Error t value Pr(> t ) (Intercept) 9973754 0314272 3174 <2e-16 *** day 0217330 0005339 4071 <2e-16 *** --- Signif codes: 0 *** 0001 ** 001 * 005 01 1 Residual standard error: 1929 on 20 degrees of freedom Multiple R-Squared: 09881,Adjusted R-squared: 09875 F-statistic: 1657 on 1 and 20 DF, p-value: < 22e-16 These estimates ˆβ 0 and ˆβ 1 coincide with the simple linear regression model, but we ve accounted for the non-constant variance in our observations And the common ˆσ = 1929 11 12

If we plot the absolute value of the raw residual e i against the number of observations on the day n i, we see that the observations with higher n i tend to have lower variability: > plot(n,abs(lmout$residuals),ylab="abs(residual)") y 0 40 80 y 0 40 80 lmout$residuals -10 0 5 CASE 2: Non-independence due to Time Correlation When we model the mean structure with ordinary least squares (OLS), the mean structure explains the general trends in the data with respect to our dependent variable and the independent variables abs(residual) 00 05 10 15 5 10 15 n The leftover noise or errors are assumed to have no pattern (we have diagnostic plots to check this) For one thing, the errors are assumed to be independent Suppose observations have been collected over time, and observations taken closer in time are more alike than observations taken further apart in time 13 14 This is a time-correlation situation And we can see the correlation in the errors by plotting the residuals against time OLS fit Residuals Example: Time as independent variable The following scatterplot shows a positive linear trend in Y with respect to time for Time = 1, 2, 3,, 50 0 20 40 time 0 40 80 lmout$fittedvalues There is a pattern in the residuals suggesting residuals near to each other are similar (positively correlated) 0 10 30 50 time Let s look at the ordinary least squares fit 15 If a residual is positive, there s a good chance it s neighboring residual is also positive A lag plot of the residuals gives us information on this residual: e i previous residual in time: e i 1 16

Plotting each residual against the previous residual: lmout$residuals -10-5 0 5 10-15 -10-5 0 5 10 15 Autocorrelation Autoregressive model: model a series in terms of its own past behavior The first-order autoregressive model, AR(1) Y t = β 0 + β 1 x t + ɛ t for t =1,, T lag 1 So, there is positive correlation in the lag residuals The assumption of independence is violated (with respect to the assumption of OLS) We can instead move away from OLS, and incorporate this correlation into our modeling with ɛ t = ρɛ t 1 + u t and u t N(0, σ 2 ) ρ < 1 is the autocorrelation parameter, it tells how strongly the sequential observations are correlated The t th and the (t j) th are also correlated, but not as strongly: corr(ɛ t, ɛ t j )=ρ j 17 18 A simulation of AR(1) data from n = 50 uniformly spaced time points with a positive linear trend (with β 1 = 2) can bring insight into the AR(1) process: ## Generate x-values: > n=50 > time=1:n ## Assign parameters: > sigma=3 > rho=95 > beta=2 ## Get start point at t=1 for time series ## and allocate space for data vectors: > y=rep(0,n) > e=rep(0,n) > e[1]=rnorm(1,0,sigma) > y[1]=beta*time[1]+e[1] ## Use AR(1) process to sequentially generate y-values: > for (i in 2:n){ e[i]=rho*e[i-1]+rnorm(1,0,sigma) y[i]=beta*time[i]+e[i] } The data for the plots on the previous pages were made from There is also a test for time-correlated errors called the Durbin-Watson test It actually looks for AR(1) errors, and uses H 0 : ρ = 0 vs H A : ρ 0 The test statistic: d = nt=2 (e t e t 1 ) 2 nt=1 e 2 t A small d indicates positive autocorrelation And d = 2 suggests no positive autocorrelation Testing the simulated AR(1) data: > library(car) > durbinwatson(lmout) lag Autocorrelation D-W Statistic p-value 1 0852054 02674768 0 Alternative hypothesis: rho!= 0 Reject H 0, there is positive correlation this code 19 20

The mean structure in the AR(1) is the same as OLS, but we model the errors differently Y = Xβ + ɛ with ɛ N n (0, Σ) and V (Y )=Σ = κ κρ κρ 2 κρ n 2 κρ n 1 κρ κ κρ κρ n 3 κρ n 2 κρ n 2 κρ n 3 κρ n 4 κ κρ κρ n 1 κρ n 2 κρ n 3 κρ κ and Var(ɛ t )=κ = σ2 1 ρ 2 n n And we again have a Generalized Least Squares Estimation for ˆβ ˆβ =(X Σ 1 X) 1 X Σ 1 Y Notice the similarity to the OLS form, but now with the Σ 1 Example: Daily value of stock The dataset soccho for this example shows the value of 1 unit of CREF Social Choice stock fund on each day of a year starting on 10/21/99 We re interested in fitting a linear model over time But, the independence assumption for OLS is probably violated We can use the Durbin-Watson test to determine whether this is true If so, we will fit an AR(1) model to the data > head(soccho) account unitval date 1 CREFsoci 887151 10/21/99 2 CREFsoci 894194 10/22/99 3 CREFsoci 894194 10/23/99 4 CREFsoci 894194 10/24/99 5 CREFsoci 891719 10/25/99 6 CREFsoci 887471 10/26/99 21 22 As there are weekend days in the data set, we will first remove these: > n=length(soccho$unitval) > n [1] 365 ## 10/21/99 is a Thursday, get indices for removal: > a1=(seq(1,365,7)+3) > a1=a1[-length(a1)] > a2=(seq(1,365,7)+4) > a2=a2[-length(a2)] > a=sort(c(a1,a2)) ## Subset data down to weekdays: > dayvalues=soccho$unitval[-a] > day=1:length(dayvalues) dayvalues 90 92 94 96 98 100 102 0 50 100 150 200 250 day > length(dayvalues) [1] 261 > plot(day,dayvalues,pch=16) It s pretty apparent that there is time-based correlation in the data, but we will fit a regular linear model assuming independence and then test for correlation over time 23 24

> lmout=lm(dayvalues~day) > summary(lmout) Coefficients: Estimate Std Error t value Pr(> t ) (Intercept) 92653292 0235187 39395 <2e-16 *** day 0027864 0001556 1790 <2e-16 *** --- Signif codes: 0 *** 0001 ** 001 * 005 01 1 Residual standard error: 1894 on 259 degrees of freedom Multiple R-Squared: 05531,Adjusted R-squared: 05514 F-statistic: 3206 on 1 and 259 DF, p-value: < 22e-16 dayvalues 90 92 94 96 98 100 102 A plot of the residuals vs fitted also show the time-based correlation > plot(lmout$fittedvalues,lmout$residuals,pch=16) > abline(h=0) lmout$residuals!6!4!2 0 2 4 94 96 98 100 lmout$fittedvalues Residuals that are positive tend to be near other positive residuals, and vice versa for negative residuals 0 50 100 150 200 250 day 25 26 This is more apparent in a lag plot where we plot a residual vs its neighboring residual: e i vs e i 1 > lagplot(lmout$residuals,dolines=false) We can use the Durbin-Watson test to formally test for time dependence (uses the relationship between e i and e i 1 ) > library(car) > durbinwatson(lmout) lmout$residuals!6!4!2 0 2 4 lag Autocorrelation D-W Statistic p-value 1 09025303 01684356 0 Alternative hypothesis: rho!= 0 The test strongly rejects the null of independence (H 0 : ρ = 0) We will fit a first-order autoregressive model to the data, or an AR(1)!6!4!2 0 2 4 lag 1 There is a positive correlation in the lag residuals (residuals tend to be more like their near neighbors) 27 28

Fitting the AR(1) model The gls function [generalized least squares] in the nlme library [non-linear mixed effects] fits regression models with a variety of correlated-error and non-constant errorvariance structures > library(nlme) ## The ~1 below says the data is in order by time > glsout=gls(dayvalues~day, correlation=corar1(form = ~1)) > summary(glsout) Generalized least squares fit by REML Model: dayvalues ~ day Data: NULL AIC BIC loglik 615644 6298713-303822 Coefficients: Value StdError t-value p-value (Intercept) 9212831 14044915 6559549 00000 day 002897 00090084 321585 00015 Residual standard error: 226733 Degrees of freedom: 261 total; 259 residual Day is a significant linear predictor for stock price ˆρ = 09413, and sequential observations are strongly correlated Correlation Structure: AR(1) Formula: ~1 Parameter estimate(s): Phi 09412842 29 30 Comments: 1 When you have many covariates, you can plot the residuals from the OLS fitted model against time as a time-correlation diagnostic If there is time-correlation, this plot will show a pattern rather than a random scatter 2 Including time as a predictor does not necessarily remove time-correlated errors As in the soccho example, time was a predictor in the OLS model, which meant there was a general trend over time, but there was still correlation in the errors after time was included 31