6.1 Introduction. Regression Model:

Size: px

Start display at page:

Download "6.1 Introduction. Regression Model:"

Alison Chase
6 years ago
Views:

1 6.1 Introduction Regression Model: y = Xβ + ɛ Assumptions: 1. The relationship between y and the predictors is linear. 2. The noise term has zero mean. ɛ 3. All ε s have the same variance σ The ε s are uncorrelated between observations. 5. The ε s are independent of the predictors. 6. The ε s are normally distributed. Regression diagnostics for detecting departures from assumptions. This is not always required. 1

2 6.2 Residual Analysis Residual Plots are the most important diagnostics: Residuals vs fitted values or predictors * for detecting changes in variance * for detecting nonlinearity * for detecting outliers * for detecting dependence on a predictor. Partial plots - for checking whether variables enter the model linearly Time plot of residuals - for detecting dependence in time: autocorrelation. Normal QQ - for assessing normality. 2

3 Types of Residuals Raw Residuals: e i = y i ŷ i (Scale problem: how big is a large residual?) Standardized Residuals: d i = e i MSE (variance of d i 1, but depends on x i ) 3

4 Types of Residuals (cont d) Studentized Residuals: r i = e i MSE(1 h ii ) where h ii = ith diagonal element of H. (Var(ẽ) = Var((I H)y ) = (I H)σ2 ) PRESS Residuals: e (i) = y i ŷ (i) = e i 1 h ii (ŷ (i) : delete ith observation; fit model and predict at x i1, x i2,..., x ik.) 4

5 Example Data on a collection of paperback books: > library(daag); softbacks volume weight softbacks.lm <- lm(weight volume, data = softbacks) summary(softbacks.lm) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) volume Residual standard error: 102 on 6 degrees of freedom soft.res <- resid(softbacks.lm) # ordinary residuals soft.stres <- soft.res/102 # standardized soft.stud <- soft.res/(102*sqrt(1- hat(model.matrix(softbacks.lm)))) # studentized soft.press <- soft.res/(1- hat(model.matrix(softbacks.lm))) # PRESS residuals par(mfrow=c(2,2)) plot(soft.res volume, data=softbacks, ylim=2*range(soft.res)) # similarly for the other 3 types of residuals 5

6 Example (cont d) soft.res soft.stres volume volume soft.stud soft.press volume volume Observation: There is a mild outlier. 6

7 Example Biochemical Oxygen Demand Capability of subsurface flow wetland systems to remove biochemical oxygen demand (BOD) and various other chemical constituents resulted in 13 observations on BOD mass loading (x) and BOD mass removal (y). Interest centers on how to predict BOD mass removal. library(devore5); data(ex12.04); attach(ex12.04) par(mfrow=c(2,2)) hist(x); hist(log(x)); hist(y); hist(log(y)) Histograms of each variable can be helpful. 7

8 Example (cont d) Histogram of x Histogram of log(x) Frequency Frequency x log(x) Histogram of y Histogram of log(y) Frequency Frequency y log(y) A log transformation of each variable is recommended here. 8

9 Example (cont d) BOD.lm <- lm(log(y) log(x)) plot(resid(bod.lm) log(x)) # resid vs. predictor resid(bod.lm) log(x) 9

10 Observations Linear relationship is not appropriate. There is an extreme outlier. The model is not satisfactory. What if we use untransformed variables? BOD.lm1 <- lm(y x) plot(resid(bod.lm1) x) 10

11 Example (cont d) resid(bod.lm1) x Error variance is not constant. 11

12 PRESS - Predicted Residual Sum of Squares PRedicted Error Sum of Squares: PRESS = n e 2 (i) i=1 = n i=1 ( ei 1 h ii This gives an idea of how well a regression model can predict new data. Small values of PRESS are desired. ) 2 12

13 litters Example # regression of brain weight against body weight and litter size: > litters.lm <- lm(brainwt bodywt + lsize, data = litters) PRESS(litters.lm) [1] # same regression as above, but without the intercept term: > litters.0 <- lm(brainwt bodywt + lsize -1, data=litters) > PRESS(litters.0) [1] # regression of brain weight against body weight only, with intercept: > litters.1 <- lm(brainwt bodywt, data=litters) > PRESS(litters.1) [1] # regression of brain weight against both variables plus an interaction term: > litters.2 <- lm(brainwt bodywt + lsize + lsize:bodywt, data=litters) > PRESS(litters.2) [1] # best predictor is the 1st model! 13

14 Ch Added Variable Plots or Partial Regression Plots Example: Suppose observations are taken on a response variable y and three other variables x 1, x 2 and x 3. Linear Model: y = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 3 + ε One should always check plots of the residuals versus the fitted values, and versus each of the predictors. An additional way to check whether each predictor should enter the regression model linearly is to look at partial regression plots for each variable. 14

15 Constructing a Partial Regression Plot for x 1 regress y against x 2 and x 3 (i.e. all variables but x 1 ) regress x 1 against x 2 and x 3 obtain residuals for both regressions Plot the y residuals against the x 1 residuals If x 1 enters the model linearly, you should see a points scattered about a straight line of slope β 1. Otherwise, the plot may indicate what kind of transformation to apply to x 1. 15

16 Examples Some artificial data is in partial.data: the first three columns are x 1, x 2 and x 3 ; the last column is y. partial.plot(partial.data[,-4],partial.data[,4],1) # partial for x1; # the true model is nonlinear # y =.2 exp(x1) + x2 + x3 + e sd(e) =.1 partial.plot(partial.data[,-4],partial.data[,4],2) # linear term partial.plot(partial.data[,-4],partial.data[,4],3) # linear term 16

17 Examples Partial Regression Plot Partial Regression Plot y.res y.res x1 x2 Partial Regression Plot y.res x3 17

18 Examples # litters example partial.plot(litters[,-3],litters[,3],1) # partial for lit partial.plot(litters[,-3],litters[,3],2) # partial for bod 18

19 Examples Partial Regression Plot y.res lsize 19

20 Examples Partial Regression Plot y.res bodywt Observation: there is mild nonlinearity. 20

21 6.2.3 Checking the Normal Assumption Real data are not likely to be normally distributed. For practical purposes, two questions are important: How much departure from normality can we tolerate? How can we decide if it is plausible that data are from a normal distribution? The first question can be difficult. Large departures from normality should be checked for, particularly skewness. Small departures can be ignored. For most moderate-sized samples, only gross departures will be detectable. 21

22 What sorts of checks will detect gross departures? While histograms have their place, the normal qq plot is more effective. The following code plots 4 histograms of independent random samples of 50 values from a normal distribution. par(mfrow=c(2,2)) set.seed(2733) for (i in 1:4) hist(rnorm(50)) par(mfrow=c(1,1)) 22

23 The normal Q-Q plot One sorts the data values. These are then plotted against the corresponding values that one might expect if the data were really from a normal distribution. If the data really are from a normal distribution, the plot should approximate a straight line. par(mfrow=c(2,2)) set.seed(2733) # Use the same samples as before for(i in 1:4)qqnorm(rnorm(50), main="") par(mfrow=c(1,1)) Simulated plots can help train the eye on what to expect in samples of various size. 23

24 Example - Simulated Normal Data Theoretical Quantiles Sample Quantiles Theoretical Quantiles Sample Quantiles Theoretical Quantiles Sample Quantiles Theoretical Quantiles Sample Quantiles 24

25 Exercise: roller data Obtain a normal Q-Q plot of the residuals roller.lm <- lm(depression weight, data = roller) plot(roller.lm, which=2, pch=16, col=4) abline(0,1,lwd=2,col=2) 25

26 Exercise: roller data (cont d) Normal Q Q plot Standardized residuals Theoretical Quantiles lm(formula = depression ~ weight, data = roller) 26

27 Setting the sample plot alongside plots for random normal data par(mfrow=c(2,2)) roller.lm <- lm(depression weight, data = roller) plot(roller.lm, which=2, pch=16, col=4) abline(0,1,lwd=2,col=2) for(i in 1:3) { qqnorm(rnorm(10), pch=16, col=4) abline(0,1,lwd=2,col=2) } par(mfrow=c(1,1)) 27

28 QQ plot for roller data Standardized residuals Normal Q Q plot 7 Sample Quantiles Normal Q Q Plot Theoretical Quantiles Theoretical Quantiles Normal Q Q Plot Normal Q Q Plot Sample Quantiles Sample Quantiles Theoretical Quantiles Theoretical Quantiles 28

29 Formal statistical testing for normality Shapiro-Wilk test A difficulty with such tests is that normality is difficult to rule out in small samples, while in large samples the tests will almost inevitably identify departures from normality that are too small to have any practical consequence for standard forms of statistical analysis. 29

30 6.2.4 Serial Correlation among the Errors Time Plots: plots of residuals against time Autocorrelation function (ACF): acf(residuals) Durbin-Watson Test examines lag 1 autocorrelation only; it is better to look at the ACF or to use a portmanteau test such as Box-Ljung: Box.test(residuals, type="ljung-box", lag=10) 30

31 Autocorrelation Checking: Example > library(daag) > log.hills <- log(hills) > names(log.hills) <- c("log.dist", "log.climb", + "log.time") > hills.lm <- lm(log.time log.dist + log.climb, + data = log.hills[-18,]) > ts.plot(resid(hills.lm)) > acf(resid(hills.lm)) > Box.test(resid(hills.lm), type="ljung-box", lag=10) Box-Ljung test data: resid(hills.lm) X-squared = 7.5, df = 10, p-value =

32 Autocorrelation Checking: Example Series resid(hills.lm) ACF Lag 32

33 Autocorrelation Checking: Example 2 Winnipeg daily maximum temperatures > source("wpgtemp.r") > temp.lm <- lm(temperature sin(2*pi*day/365.25) + cos(2*pi*day/365.25), data=wpgtemp) > > acf(resid(temp.lm)) > Box.test(resid(temp.lm), lag=10, type="ljung-box") Box-Ljung test data: resid(temp.lm) X-squared = , df = 10, p-value < 2.2e-16 33

34 Autocorrelation Checking: Temperature Example Series resid(temp.lm) ACF Lag 34

35 Ch. 6.3 Detection and Treatment of Outliers An outlier is an extreme observation. If a residual plots more than about 3 standard deviation units away from 0, then the observation should be regarded as an outlier. Detection: Plot residuals vs. fitted values Example hills.lm > hills.lm <- lm(log.time log.dist + log.climb, data = log.hills) > summary(hills.lm) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-06 log.dist e-08 log.climb Residual standard error: on 32 degrees of freedom... > plot(hills.lm,pch=16,which=1,col=2) 35

36 Example (cont d) Residuals vs Fitted Residuals Fitted values lm(formula = log.time ~ log.dist + log.climb, data = log.hills) The standardized 18th residual is (resid(hills.lm)[18]/summary(hills.lm)$sigma) 1.46/.315 = 4.63 an extreme outlier. 36

37 Treatment Outlying observations should be examined closely. Example hills (cont d): > hills[18,] dist climb time Compare this with the rest of the data: > summary(hills) dist climb time Min. : 2.00 Min. : 300 Min. : st Qu.: st Qu.: 725 1st Qu.:0.467 Median : 6.00 Median :1000 Median :0.662 Mean : 7.53 Mean :1815 Mean : rd Qu.: rd Qu.:2200 3rd Qu.:1.144 Max. :28.00 Max. :7500 Max. :

38 Handling Outliers - Example (Cont d) The 18th race seems to have taken a long time, though it was a short climb and a short distance. e.g. Compare this race with the first observation: hills[1,] dist climb time This race is shorter but with more climbing. The time is much less than for race 18. Observation 21 is also comparable for distance and climb, but not at all for time: > hills[21,] dist climb time This leads us to the conclusion that observation 18 might have been misrecorded. One author believes that the time was really.31 hours instead of 1.3 hours. 38

39 Handling Outliers - Example (cont d) If the outlier has been improperly recorded, it should be corrected or discarded. In this case, we discard it since we are not sure of the correct time: > hills.lm <- lm(time climb + dist, data = hills[-18,]) > summary(hills.lm) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -2.25e e e-05 climb 1.98e e e-11 dist 1.06e e < 2e-16 Residual standard error: on 31 degrees of freedom Multiple R-Squared: 0.972, Adjusted R-squared: 0.97 F-statistic: 529 on 2 and 31 DF, p-value: 0 Note the effect on the standard error after removing that outlier. 39

40 Handling Outliers If the outlier has not been improperly recorded, then it could mean that the model is not adequate. In such cases, Compare the fitted model with and without the observation. If the outlier is making a big difference, one strategy is to report the fitted model obtained without the outlier, but to report the outlier as well. In addition, any plots of the data should properly identify the outlying observation. Exercise: plot the residuals vs. fitted values for the hills data without observation 18. Note that there is now a new outlier. It cannot be explained away as easily as observation 18. Are there other variables that could be included in the model? A cluster of outliers sometimes indicates that an important variable has been omitted, e.g. gender. 40

41 Example Softbacks Earlier, we plotted the residuals for the weight versus volume model for 8 paperback books. Observation 6 appears to be an outlier. It has not been recorded incorrectly. It is just different from the other observations. The pages are of denser material than in the other books; perhaps density should be measured and included in the model. If we choose to omit this observation, we should do it as follows: softbacks.lm6 <- lm(weight volume, data = softbacks[-6,]) plot(softbacks,pch=16) abline(softbacks.lm6) points(softbacks[6,],col=2,pch=16) text(1050,930,"omitted observation") 41

42 Example (cont d) weight omitted observation volume 42

43 Ch Leverage and Influence Diagnostics High leverage point: an observation which lies near an extreme of the space of explanatory variables e.g. Leverage Measurements: Simple Linear Regression y x (Blue numbers are hat diagonal values.) 43

44 Influence Influential observation: A high leverage point which is also an outlier in the response space. e.g. Influential and Ordinary Outliers y full data set (b=1.20) without obs. 20 (b=1.02) without obs. 10 (b=1.19) not influential 1.11 influential x (Blue numbers are Cook s Distance values.) 44

45 Leverage: Theory Consider y i = β 0 + β 1 x i + ε i β 1 = (xi x)y i S xx = c i y i i.e. β 1 is a weighted average of the y s. High leverage points are such that (x x) 2 is large. The largest weights are for small or large x s: x i x S xx is large when x i >> x and when x i << x. Therefore, the observations that give greatest weight to the determination of the slope are the smallest and largest x s. 45

46 Example Suppose we have n = 5 observations. Then x 1 = 0, x 2 =.25, x 3 =.5, x 4 =.75, x 5 = 1 x =.5, S xx =.625 β 1 =.8y 1.4y 2 + 0y 3 +.4y 4 +.8y 5 Note that the middle observation makes no contribution at all, while the two extremes contribute a lot. The correctness of the slope estimate depends heavily on the quality of the 1st and 5th measurements. 46

47 Measuring Leverage - hat diagonal In multiple regression, it is more difficult to identify extreme observations in the multi-dimensional space of the x variables. The norm we use to decide if points have high leverage is based on A = (X T X) 1. This is a valid way of defining the norm, since if the columns of X are linearly independent, the (X T X) 1 is symmetric and positive definite. Symmetry of (X T X) 1 is obvious. Reminder: A matrix A is positive definite if it is symmetric and for any nonzero vector ã, we have a T A ã > 0 47

48 Example Demonstration for the special case where n = 3 and p = 2: Step 1: X T X is positive definite: * Suppose ã = [a 1 a 2 ] T [0 0] and Write X = [x 1 * Then we have x 2 ] b = X ã = 0 a 1 x 1 +a 2 x 2 = 0 Columns of X are linearly independent so a 1 = a 2 = 0, a contradiction. * Therefore, if ã = [a 1 a 2 ] T [0 0], then b 0, and a T XT X ã = b b > 0 T 48

49 Example (Cont d) Step 2: If a symmetric matrix A is positive definite, then A 1 is positive definite. * Suppose ã 0, and set b = A 1 ã. * b 0 (Why?) and So A 1 is positive definite. a T A 1 ã = b T A b > 0 49

50 Example (Cont d) Step 3: Step 1 (X T X) is positive definite, so Step 2 (X T X) 1 is positive definite. 50

51 Leverage The leverage of an observation at x i norm: is defined by its (X T X) 1 x T ĩ (XT X) 1 x i but this is the ii element of the hat matrix: H = X(X T X) 1 X T leverage of the ith observation = h ii. 51

52 Example Simple regression - response plotted against predictor Leverage Measurements: Simple Linear Regression y x 52

53 Example Multiple Regression - predictors plotted against each other Leverage of Selected Observations in Litters Data bodywt lsize 53

54 To obtain the leverage values in R > attach(litters) > litters.lm <- lm(brainwt bodywt + lsize) > lm.influence(litters.lm)$hat [1] [6] [11] [16] > detach(litters) average leverage = n 1 tr(h) = p n high leverage: h ii > 2p n =.3 for litters data Observation 17 (h 17,17 =.4326) is high leverage. 54

55 Measuring Influence - Cook s D General method to see if an observation is influential: delete it and see how the estimates change. β = LS estimates full data set β = LS estimates ith observation deleted (i) Look at difference: β β (i). * If big, ith observation is influential. * If small, ith observation is not influential. How can we tell if this difference is big or small? 55

56 One answer:cook s distance (based on another norm!) gives us an idea of how large this difference is, according to a standard scale: D i = 1 pmse ( β After some algebra, we have β (i))t (X T X)( β h ii β (i)) D i = r2 i p 1 h ii r i = ith studentized residual; influence is related to leverage and whether a point is an outlier: * high leverage outliers are influential * low leverage outliers are less influential * high leverage points that are not outliers are less influential High influence if D i > 1 56

57 In R: cooks.distance(y.lm) or plot(y.lm, which=4) Cook's distance plot 20 Cook's distance Obs. number lm(formula = y ~ x) 57

58 Example (Cont d) and Cook's distance plot Cook's distance Obs. number lm(formula = brainwt ~ bodywt + lsize, data = litters) 58

59 Example (Cont d) Residuals vs Fitted Residuals Fitted values lm(formula = brainwt ~ bodywt + lsize, data = litters) (h 19,19 is moderate, but r 19 is large) 59

60 Example (Cont d) Cook's distance plot Cook's distance Obs. number lm(formula = depression ~ weight, data = roller) Obs not outlier > lm.influence(roller.lm)$hat[10] [1] # high leverage 60

61 Measuring Effects Due to Influence DFBETAS What effect does the ith observation have on the estimate β j? Compare β j with β j(i) : β j β j(i) Problem: This is not standardized. 61

62 Standardizing Divide by standard deviation : S 2 (i) C jj where S 2 (i) = MSE (i) and C jj = (X T X) 1 jj DF BET AS j,(i) = β j β j(i) S 2 (i) C jj If this is positive, the effect of the ith observation is to increase the estimate of β j. If this is negative, the effect of the ith observation is to decrease the estimate of β j. ith observation is influential on jth coefficient estimate if DF BET AS j,(i) > { 1, small n 2/ n, large n 62

63 Example hills data linear model: hills.lm1 <- lm(time dist + climb, data = hills[-18,]) plot(hills.lm1,which=1) plot(dfbetas(hills.lm1)[,2],pch=16) plot(dfbetas(hills.lm1)[,3],pch=16) 63

64 Residual Plot for hills Residuals vs Fitted 7 Residuals Fitted values lm(formula = time ~ dist + climb, data = hills[ 18, ])

65 DFBETAS for BETA1: hills dfbetas observation number

66 DFBETAS for BETA2: hills 7 dfbetas observation number

67 A Nonlinear Model hills.lm2 <- lm(time dist + I(climbˆ2.25)) plot(hills.lm2,which=2) plot(dfbetas(hills.lm2)[,2],pch=16) plot(dfbetas(hills.lm2)[,3],pch=16) 64

68 Residuals vs Fitted Residuals Fitted values lm(formula = time ~ dist + I(climb^2.25), data = hills[ 18, ])

69 DFBETAS for BETA1: hills nonlinear 11 dfbetas observation number

70 DFBETAS for BETA2: hills nonlinear dfbetas observation number

71 How does observation 11 affect the coefficients? hills.lm3 <- lm(time dist+i(climbˆ2.25), data = hills[-c(11,18),]) > coef(hills.lm3) (Intercept) dist I(climbˆ2.25) -2.89e e e-09 > coef(hills.lm2) (Intercept) dist I(climbˆ2.25) -5.13e e e-09 65

72 Measuring Effects Due to Influence (Cont d) DFFITS * What effect does the ith observation have on the fitted value ŷ i? * Compare ŷ i with ŷ i,(i) : ŷ i ŷ i,(i) * Standardize by dividing by S (i) 2 h ii: DF F IT S (i) = ŷi ŷ i,(i) S (i) 2 h ii = e i hii S (i) 1 hii 1 h ii * If this is positive (negative), the effect of the ith observation is to increase(decrease) the estimate of y i. * ith observation is influential on ith fitted value if { 1, small n DF F IT S (i) > 2 p/n, large n 66

73 Example hills linear model: plot(dffits(hills.lm1),pch=16) dffits(hills.lm1) Index 67

74 Example hills (Cont d) nonlinear model: plot(dffits(hills.lm2),pch=16) dffits(hills.lm2) Index 68

75 Summary Look for Outliers (in the residual plots) if such points are influential, the fitted model is not adequate High Leverage Observations (on the hat diagonal) such points are potentially influential Influential Observations (Cook s D, DFFITS, DFBETAS) such points may be distorting the fitted model 69

76 Final Thoughts 1. Outliers might not be influential. 2. High leverage observations might not be influential. 3. Influential observations might not be outliers. 70

77 Ch. 6.4 Lack of Fit in Simple Regression Suppose repeated observations of y are taken at at least one level of x. Example tomatoes: Electrical conductivity measured at different salinity concentrations: > tomatoes salinity electrical.conductivity

78 Lack of Fit Example (Cont d) > plot(tomatoes, pch = 16) > tomatoes.lm <- lm(electrical.conductivity salinity, data = tomatoes) > abline(tomatoes.lm) > summary(tomatoes.lm) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 salinity e-06 Residual standard error: 2.83 on 16 degrees of freedom Multiple R-Squared: 0.78, Adjusted R-squared: F-statistic: 56.7 on 1 and 16 DF, p-value: 1.21e

79 Lack of Fit Example (Cont d) y x 73

80 Lack of Fit Example (Cont d) Fitted model: ŷ = x with an estimated noise standard error of How well does this linear model actually fit the data? Because of the repeated observations the model can be written as y ij = β 0 + β 1 x i + ε ij j = 1, 2,..., n i, i = 1, 2,..., m, (n = m i=1 n i). i.e. There are n i observations at each x i, i = 1, 2,..., m. 74

81 Lack of Fit ȳ i is the best estimate of E[y x = x i ]. If we fit a straight line, then ȳ i ŷ i is a measure of lack of fit of the linear relationship. Look at the residuals e ij = y ij ŷ i = y ij ȳ i pure error + ȳi ŷ i lack of fit SSE = SSP E + SSLOF where and SSP E = SSLOF = m n i i=1 j=1 m i=1 (y ij ȳ i ) 2 n i (ȳ i ŷ i ) 2 75

82 Lack of Fit To test for lack of fit, calculate F 0 = MSLOF MSP E Null hypothesis: E[y i ] = β 0 + β 1 x i (i.e. linear model is correct.) Degrees of freedom: Error: n 2 Pure error: m i=1 (n i 1) = n m Lack of fit: n 2 (n m) = m 2. Therefore, F 0 F m 2,n m when the null hypothesis is true. Reject the null hypothesis when F 0 > F m 2,n m,α. > lof.lm(tomatoes.lm) Test of Lack of Fit for Simple Linear Regression Response: electrical.conductivity Df Sum Sq Mean Sq F value Pr(>F) prediction ratio Lack of Fit Pure Error

83 litters Example brainwt vs. lsize Here, litter size is replicated so the test can be applied > litters.0 <- lm(brainwt lsize, data=litters) > lof.lm(litters.0) Test of Lack of Fit for Simple Linear Regression Response: brainwt Df Sum Sq Mean Sq F value Pr(>F) prediction ratio Lack of Fit Pure Error The prediction ratio which is smaller now than for the tomatoes data: range of fitted values 2MSP E/n This gives an idea of how well the model is able to predict. Larger values indicate more predictive power. 77

84 Example (cont d) y x 78

85 Example (cont d) brainwt vs. bodywt > litters.1 <- lm(brainwt bodywt, data=litters) > lof.lm(litters.1) [1] "There are no replicate observations." [1] "Exact Lack of Fit Test is Not Applicable." A simple approximation involves averaging neighboring points: > lof.lm(litters.1,approx=t) The following results are only approximate!!! Test of Lack of Fit for Simple Linear Regression Response: y Df Sum Sq Mean Sq F value Pr(>F) prediction ratio Lack of Fit Pure Error

86 Example (cont d) x y red denotes approximation 80

87 geophones data measurements of the thickness of a subsurface layer in a region of Alberta. > geophones.lm <- lm(thickness distance,data=geophones) > summary(geophones.lm) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 distance e-16 Residual standard error: 5.6 on 54 degrees of freedom Multiple R-Squared: 0.716, Adjusted R-squared: 0.71 F-statistic: 136 on 1 and 54 DF, p-value: 2.22e-016 Fitted model: with a noise standard deviation of 5.6. ŷ = x > lof.lm(geophones.lm, approx=t) The following results are only approximate!!! Test of Lack of Fit for Simple Linear Regression Response: y Df Sum Sq Mean Sq F value Pr(>F) prediction ratio Lack of Fit e Pure Error

88 Example (cont d) y x red denotes approximation 82

89 Example roller data > lof.lm(roller.lm, approx = T) The following results are only approximate!!! Test of Lack of Fit for Simple Linear Regression Response: y Df Sum Sq Mean Sq F value Pr(>F) prediction ratio Lack of Fit Pure Error

90 Example (cont d) y x red denotes approximation 84

91 Example (cont d) Without the intercept: > lof.lm(roller.0, approx = T, call.plot=f) The following results are only approximate!!! Test of Lack of Fit for Simple Linear Regression Response: y Df Sum Sq Mean Sq F value Pr(>F) prediction ratio Lack of Fit Pure Error

92 Lack of Fit Exercises Question 1 from Tutorial 8 for 2009 Question 7 from Final Examination for December,

93 Ch. 6.5: Transformations 1. Variance stabilizing transformations; Box-Cox Transformations 2. Transformations to linearize the model 87

94 Variance-Stabilizing Transformations Model assumptions: E[y x] = β 0 + β 1 x V (y x) = σ 2 Set µ y = E[y x]. What if V (y x) = σ 2 f(µ y ) where f(x) is some non-constant function? Try to find a function g(y) so that V (g(y) x) = constant 88

95 Variance-Stabilizing Transformations (cont d) Then obtain a Taylor expansion of g(y) about µ y : g(y) = g(µ y ) + (y µ y )g (µ y ) + (y µ y) 2 g (µ y ) + 2 Then V (g(y)) =. V (y) ( g (µ y ) ) 2 = σ 2 f(µ y ) ( g (µ y ) ) 2 V (g(y)) will be constant if g (µ y ) = 1 f(µ y ) g (z) = 1 f(z) 89

96 Examples 1. f(x) = x (e.g. Poisson data) 1 f(x) = x 1/2 g(y) = y Poisson Residuals Residuals vs Fitted Poisson Residuals (after sqrt) Residuals vs Fitted Residuals Residuals Fitted values lm(formula = yy ~ xx) Fitted values lm(formula = I(sqrt(yy) ~ xx)) 90

97 Examples (cont d) 2. f(x) = x 2 (e.g. Exponential data) Residuals Exponential Residuals Residuals vs Fitted f(x) = 1 x g(y) = log(y) Fitted values lm(formula = yy ~ xx) 91

98 Examples (cont d) 3. f(x) = x(1 x) (e.g. binomial data) 1 = f(x) 1 x(1 x) d dx sin 1 ( x) = 1 2 x(1 x) g(y) = arcsin( y) 92

99 6.5.1 Box-Cox Transformations (on response) Select the power λ in the transformation g(y) = y λ by maximum likelihood. Equivalent to minimizing the SSE with respect to λ (and other parameters). Caution: The residual sums of squares are not comparable for different values of λ. We need to ensure that comparisons are made according to the same standard: where y (λ) = y λ 1 λẏ λ 1, λ 0 ẏ log y, λ = 0 ẏ = geometric mean of the y s 93

100 Strategy 1. Perform transformation y (λ) 1,..., y(λ) n for several values of λ. 2. Compute SSE for each value of λ 3. Select λ which gives the minimum value. 4. Fit y λ = Xβ + ɛ 5. Approximate confidence intervals for λ can also be obtained. 6. In R, use boxcox(y x, data= dataset) 94

101 Example 1 1. Bacteria data - the average number of surviving bacteria (y) in a canned food product versus time (t) of exposure to 300 F heat. 95

102 Example 1 (cont d) > library(mpv) > data(p5.3) > bact.lm <- lm(bact min, data=p5.3) > plot(bact.lm, which=1) # > plot(bact.lm, which=2) # > library(mass) > boxcox(bact.lm) # > bactlog.lm <- lm(log(bact) min, data=p5.3) > plot(bactlog.lm, which=1) # > plot(bactlog.lm, which=2) # 96

103 Residuals vs. Fitted Residuals vs Fitted 1 Residuals Fitted values lm(formula = bact ~ min, data = p5.3) 97

104 Q-Q Plot Normal Q Q plot Standardized residuals Theoretical Quantiles lm(formula = bact ~ min, data = p5.3) 98

105 Box-Cox log Likelihood % lambda 99

106 Residuals vs. Fitted (after log-transforming) Residuals vs Fitted Residuals Fitted values lm(formula = log(bact) ~ min, data = p5.3) 100

107 Q-Q Plot (after log-transforming) Normal Q Q plot Standardized residuals Theoretical Quantiles lm(formula = log(bact) ~ min, data = p5.3) 101

108 Example (cont d) A model of the form log(y) = β 0 + β 1 t + ε is reasonable, especially if β 1 is negative ( β 1 =.236). 102

109 Example 2 trees data. 31 observations on Girth (g), Height (h) and Volume (V ) A Simple Model: or V. = g2 h 4π log V = β 0 + β 1 log h + β 2 log g + ε 103

110 Example 2 (Cont d) > library(daag) > data(trees); attach(trees) > trees.lm <- lm(log(volume) log(girth) + log(height)) > boxcox(trees.lm) # (lambda = 1 is OK) > summary(trees.lm) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-09 log(height) e-06 log(girth) < 2e

111 Example 2 (Cont d) - Box-Cox after Transforming log Likelihood % lambda Coefficient of log(height) is not distinguishable from 1, and coefficient of log(girth) is not distinguishable from

112 Exercises Question 2 from December, 2010 Final Examination Question 6 from December, 2009 Final Examination Question 5acd from December, 2008 Final Examination 106

1. Variance stabilizing transformations; Box-Cox Transformations - Section. 2. Transformations to linearize the model - Section 5.

Ch. 5: Transformations and Weighting 1. Variance stabilizing transformations; Box-Cox Transformations - Section 5.2; 5.4 2. Transformations to linearize the model - Section 5.3 3. Weighted regression -