6.1 Introduction. Regression Model:

Size: px
Start display at page:

Download "6.1 Introduction. Regression Model:"

Transcription

1 6.1 Introduction Regression Model: y = Xβ + ɛ Assumptions: 1. The relationship between y and the predictors is linear. 2. The noise term has zero mean. ɛ 3. All ε s have the same variance σ The ε s are uncorrelated between observations. 5. The ε s are independent of the predictors. 6. The ε s are normally distributed. Regression diagnostics for detecting departures from assumptions. This is not always required. 1

2 6.2 Residual Analysis Residual Plots are the most important diagnostics: Residuals vs fitted values or predictors * for detecting changes in variance * for detecting nonlinearity * for detecting outliers * for detecting dependence on a predictor. Partial plots - for checking whether variables enter the model linearly Time plot of residuals - for detecting dependence in time: autocorrelation. Normal QQ - for assessing normality. 2

3 Types of Residuals Raw Residuals: e i = y i ŷ i (Scale problem: how big is a large residual?) Standardized Residuals: d i = e i MSE (variance of d i 1, but depends on x i ) 3

4 Types of Residuals (cont d) Studentized Residuals: r i = e i MSE(1 h ii ) where h ii = ith diagonal element of H. (Var(ẽ) = Var((I H)y ) = (I H)σ2 ) PRESS Residuals: e (i) = y i ŷ (i) = e i 1 h ii (ŷ (i) : delete ith observation; fit model and predict at x i1, x i2,..., x ik.) 4

5 Example Data on a collection of paperback books: > library(daag); softbacks volume weight softbacks.lm <- lm(weight volume, data = softbacks) summary(softbacks.lm) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) volume Residual standard error: 102 on 6 degrees of freedom soft.res <- resid(softbacks.lm) # ordinary residuals soft.stres <- soft.res/102 # standardized soft.stud <- soft.res/(102*sqrt(1- hat(model.matrix(softbacks.lm)))) # studentized soft.press <- soft.res/(1- hat(model.matrix(softbacks.lm))) # PRESS residuals par(mfrow=c(2,2)) plot(soft.res volume, data=softbacks, ylim=2*range(soft.res)) # similarly for the other 3 types of residuals 5

6 Example (cont d) soft.res soft.stres volume volume soft.stud soft.press volume volume Observation: There is a mild outlier. 6

7 Example Biochemical Oxygen Demand Capability of subsurface flow wetland systems to remove biochemical oxygen demand (BOD) and various other chemical constituents resulted in 13 observations on BOD mass loading (x) and BOD mass removal (y). Interest centers on how to predict BOD mass removal. library(devore5); data(ex12.04); attach(ex12.04) par(mfrow=c(2,2)) hist(x); hist(log(x)); hist(y); hist(log(y)) Histograms of each variable can be helpful. 7

8 Example (cont d) Histogram of x Histogram of log(x) Frequency Frequency x log(x) Histogram of y Histogram of log(y) Frequency Frequency y log(y) A log transformation of each variable is recommended here. 8

9 Example (cont d) BOD.lm <- lm(log(y) log(x)) plot(resid(bod.lm) log(x)) # resid vs. predictor resid(bod.lm) log(x) 9

10 Observations Linear relationship is not appropriate. There is an extreme outlier. The model is not satisfactory. What if we use untransformed variables? BOD.lm1 <- lm(y x) plot(resid(bod.lm1) x) 10

11 Example (cont d) resid(bod.lm1) x Error variance is not constant. 11

12 PRESS - Predicted Residual Sum of Squares PRedicted Error Sum of Squares: PRESS = n e 2 (i) i=1 = n i=1 ( ei 1 h ii This gives an idea of how well a regression model can predict new data. Small values of PRESS are desired. ) 2 12

13 litters Example # regression of brain weight against body weight and litter size: > litters.lm <- lm(brainwt bodywt + lsize, data = litters) PRESS(litters.lm) [1] # same regression as above, but without the intercept term: > litters.0 <- lm(brainwt bodywt + lsize -1, data=litters) > PRESS(litters.0) [1] # regression of brain weight against body weight only, with intercept: > litters.1 <- lm(brainwt bodywt, data=litters) > PRESS(litters.1) [1] # regression of brain weight against both variables plus an interaction term: > litters.2 <- lm(brainwt bodywt + lsize + lsize:bodywt, data=litters) > PRESS(litters.2) [1] # best predictor is the 1st model! 13

14 Ch Added Variable Plots or Partial Regression Plots Example: Suppose observations are taken on a response variable y and three other variables x 1, x 2 and x 3. Linear Model: y = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 3 + ε One should always check plots of the residuals versus the fitted values, and versus each of the predictors. An additional way to check whether each predictor should enter the regression model linearly is to look at partial regression plots for each variable. 14

15 Constructing a Partial Regression Plot for x 1 regress y against x 2 and x 3 (i.e. all variables but x 1 ) regress x 1 against x 2 and x 3 obtain residuals for both regressions Plot the y residuals against the x 1 residuals If x 1 enters the model linearly, you should see a points scattered about a straight line of slope β 1. Otherwise, the plot may indicate what kind of transformation to apply to x 1. 15

16 Examples Some artificial data is in partial.data: the first three columns are x 1, x 2 and x 3 ; the last column is y. partial.plot(partial.data[,-4],partial.data[,4],1) # partial for x1; # the true model is nonlinear # y =.2 exp(x1) + x2 + x3 + e sd(e) =.1 partial.plot(partial.data[,-4],partial.data[,4],2) # linear term partial.plot(partial.data[,-4],partial.data[,4],3) # linear term 16

17 Examples Partial Regression Plot Partial Regression Plot y.res y.res x1 x2 Partial Regression Plot y.res x3 17

18 Examples # litters example partial.plot(litters[,-3],litters[,3],1) # partial for lit partial.plot(litters[,-3],litters[,3],2) # partial for bod 18

19 Examples Partial Regression Plot y.res lsize 19

20 Examples Partial Regression Plot y.res bodywt Observation: there is mild nonlinearity. 20

21 6.2.3 Checking the Normal Assumption Real data are not likely to be normally distributed. For practical purposes, two questions are important: How much departure from normality can we tolerate? How can we decide if it is plausible that data are from a normal distribution? The first question can be difficult. Large departures from normality should be checked for, particularly skewness. Small departures can be ignored. For most moderate-sized samples, only gross departures will be detectable. 21

22 What sorts of checks will detect gross departures? While histograms have their place, the normal qq plot is more effective. The following code plots 4 histograms of independent random samples of 50 values from a normal distribution. par(mfrow=c(2,2)) set.seed(2733) for (i in 1:4) hist(rnorm(50)) par(mfrow=c(1,1)) 22

23 The normal Q-Q plot One sorts the data values. These are then plotted against the corresponding values that one might expect if the data were really from a normal distribution. If the data really are from a normal distribution, the plot should approximate a straight line. par(mfrow=c(2,2)) set.seed(2733) # Use the same samples as before for(i in 1:4)qqnorm(rnorm(50), main="") par(mfrow=c(1,1)) Simulated plots can help train the eye on what to expect in samples of various size. 23

24 Example - Simulated Normal Data Theoretical Quantiles Sample Quantiles Theoretical Quantiles Sample Quantiles Theoretical Quantiles Sample Quantiles Theoretical Quantiles Sample Quantiles 24

25 Exercise: roller data Obtain a normal Q-Q plot of the residuals roller.lm <- lm(depression weight, data = roller) plot(roller.lm, which=2, pch=16, col=4) abline(0,1,lwd=2,col=2) 25

26 Exercise: roller data (cont d) Normal Q Q plot Standardized residuals Theoretical Quantiles lm(formula = depression ~ weight, data = roller) 26

27 Setting the sample plot alongside plots for random normal data par(mfrow=c(2,2)) roller.lm <- lm(depression weight, data = roller) plot(roller.lm, which=2, pch=16, col=4) abline(0,1,lwd=2,col=2) for(i in 1:3) { qqnorm(rnorm(10), pch=16, col=4) abline(0,1,lwd=2,col=2) } par(mfrow=c(1,1)) 27

28 QQ plot for roller data Standardized residuals Normal Q Q plot 7 Sample Quantiles Normal Q Q Plot Theoretical Quantiles Theoretical Quantiles Normal Q Q Plot Normal Q Q Plot Sample Quantiles Sample Quantiles Theoretical Quantiles Theoretical Quantiles 28

29 Formal statistical testing for normality Shapiro-Wilk test A difficulty with such tests is that normality is difficult to rule out in small samples, while in large samples the tests will almost inevitably identify departures from normality that are too small to have any practical consequence for standard forms of statistical analysis. 29

30 6.2.4 Serial Correlation among the Errors Time Plots: plots of residuals against time Autocorrelation function (ACF): acf(residuals) Durbin-Watson Test examines lag 1 autocorrelation only; it is better to look at the ACF or to use a portmanteau test such as Box-Ljung: Box.test(residuals, type="ljung-box", lag=10) 30

31 Autocorrelation Checking: Example > library(daag) > log.hills <- log(hills) > names(log.hills) <- c("log.dist", "log.climb", + "log.time") > hills.lm <- lm(log.time log.dist + log.climb, + data = log.hills[-18,]) > ts.plot(resid(hills.lm)) > acf(resid(hills.lm)) > Box.test(resid(hills.lm), type="ljung-box", lag=10) Box-Ljung test data: resid(hills.lm) X-squared = 7.5, df = 10, p-value =

32 Autocorrelation Checking: Example Series resid(hills.lm) ACF Lag 32

33 Autocorrelation Checking: Example 2 Winnipeg daily maximum temperatures > source("wpgtemp.r") > temp.lm <- lm(temperature sin(2*pi*day/365.25) + cos(2*pi*day/365.25), data=wpgtemp) > > acf(resid(temp.lm)) > Box.test(resid(temp.lm), lag=10, type="ljung-box") Box-Ljung test data: resid(temp.lm) X-squared = , df = 10, p-value < 2.2e-16 33

34 Autocorrelation Checking: Temperature Example Series resid(temp.lm) ACF Lag 34

35 Ch. 6.3 Detection and Treatment of Outliers An outlier is an extreme observation. If a residual plots more than about 3 standard deviation units away from 0, then the observation should be regarded as an outlier. Detection: Plot residuals vs. fitted values Example hills.lm > hills.lm <- lm(log.time log.dist + log.climb, data = log.hills) > summary(hills.lm) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-06 log.dist e-08 log.climb Residual standard error: on 32 degrees of freedom... > plot(hills.lm,pch=16,which=1,col=2) 35

36 Example (cont d) Residuals vs Fitted Residuals Fitted values lm(formula = log.time ~ log.dist + log.climb, data = log.hills) The standardized 18th residual is (resid(hills.lm)[18]/summary(hills.lm)$sigma) 1.46/.315 = 4.63 an extreme outlier. 36

37 Treatment Outlying observations should be examined closely. Example hills (cont d): > hills[18,] dist climb time Compare this with the rest of the data: > summary(hills) dist climb time Min. : 2.00 Min. : 300 Min. : st Qu.: st Qu.: 725 1st Qu.:0.467 Median : 6.00 Median :1000 Median :0.662 Mean : 7.53 Mean :1815 Mean : rd Qu.: rd Qu.:2200 3rd Qu.:1.144 Max. :28.00 Max. :7500 Max. :

38 Handling Outliers - Example (Cont d) The 18th race seems to have taken a long time, though it was a short climb and a short distance. e.g. Compare this race with the first observation: hills[1,] dist climb time This race is shorter but with more climbing. The time is much less than for race 18. Observation 21 is also comparable for distance and climb, but not at all for time: > hills[21,] dist climb time This leads us to the conclusion that observation 18 might have been misrecorded. One author believes that the time was really.31 hours instead of 1.3 hours. 38

39 Handling Outliers - Example (cont d) If the outlier has been improperly recorded, it should be corrected or discarded. In this case, we discard it since we are not sure of the correct time: > hills.lm <- lm(time climb + dist, data = hills[-18,]) > summary(hills.lm) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -2.25e e e-05 climb 1.98e e e-11 dist 1.06e e < 2e-16 Residual standard error: on 31 degrees of freedom Multiple R-Squared: 0.972, Adjusted R-squared: 0.97 F-statistic: 529 on 2 and 31 DF, p-value: 0 Note the effect on the standard error after removing that outlier. 39

40 Handling Outliers If the outlier has not been improperly recorded, then it could mean that the model is not adequate. In such cases, Compare the fitted model with and without the observation. If the outlier is making a big difference, one strategy is to report the fitted model obtained without the outlier, but to report the outlier as well. In addition, any plots of the data should properly identify the outlying observation. Exercise: plot the residuals vs. fitted values for the hills data without observation 18. Note that there is now a new outlier. It cannot be explained away as easily as observation 18. Are there other variables that could be included in the model? A cluster of outliers sometimes indicates that an important variable has been omitted, e.g. gender. 40

41 Example Softbacks Earlier, we plotted the residuals for the weight versus volume model for 8 paperback books. Observation 6 appears to be an outlier. It has not been recorded incorrectly. It is just different from the other observations. The pages are of denser material than in the other books; perhaps density should be measured and included in the model. If we choose to omit this observation, we should do it as follows: softbacks.lm6 <- lm(weight volume, data = softbacks[-6,]) plot(softbacks,pch=16) abline(softbacks.lm6) points(softbacks[6,],col=2,pch=16) text(1050,930,"omitted observation") 41

42 Example (cont d) weight omitted observation volume 42

43 Ch Leverage and Influence Diagnostics High leverage point: an observation which lies near an extreme of the space of explanatory variables e.g. Leverage Measurements: Simple Linear Regression y x (Blue numbers are hat diagonal values.) 43

44 Influence Influential observation: A high leverage point which is also an outlier in the response space. e.g. Influential and Ordinary Outliers y full data set (b=1.20) without obs. 20 (b=1.02) without obs. 10 (b=1.19) not influential 1.11 influential x (Blue numbers are Cook s Distance values.) 44

45 Leverage: Theory Consider y i = β 0 + β 1 x i + ε i β 1 = (xi x)y i S xx = c i y i i.e. β 1 is a weighted average of the y s. High leverage points are such that (x x) 2 is large. The largest weights are for small or large x s: x i x S xx is large when x i >> x and when x i << x. Therefore, the observations that give greatest weight to the determination of the slope are the smallest and largest x s. 45

46 Example Suppose we have n = 5 observations. Then x 1 = 0, x 2 =.25, x 3 =.5, x 4 =.75, x 5 = 1 x =.5, S xx =.625 β 1 =.8y 1.4y 2 + 0y 3 +.4y 4 +.8y 5 Note that the middle observation makes no contribution at all, while the two extremes contribute a lot. The correctness of the slope estimate depends heavily on the quality of the 1st and 5th measurements. 46

47 Measuring Leverage - hat diagonal In multiple regression, it is more difficult to identify extreme observations in the multi-dimensional space of the x variables. The norm we use to decide if points have high leverage is based on A = (X T X) 1. This is a valid way of defining the norm, since if the columns of X are linearly independent, the (X T X) 1 is symmetric and positive definite. Symmetry of (X T X) 1 is obvious. Reminder: A matrix A is positive definite if it is symmetric and for any nonzero vector ã, we have a T A ã > 0 47

48 Example Demonstration for the special case where n = 3 and p = 2: Step 1: X T X is positive definite: * Suppose ã = [a 1 a 2 ] T [0 0] and Write X = [x 1 * Then we have x 2 ] b = X ã = 0 a 1 x 1 +a 2 x 2 = 0 Columns of X are linearly independent so a 1 = a 2 = 0, a contradiction. * Therefore, if ã = [a 1 a 2 ] T [0 0], then b 0, and a T XT X ã = b b > 0 T 48

49 Example (Cont d) Step 2: If a symmetric matrix A is positive definite, then A 1 is positive definite. * Suppose ã 0, and set b = A 1 ã. * b 0 (Why?) and So A 1 is positive definite. a T A 1 ã = b T A b > 0 49

50 Example (Cont d) Step 3: Step 1 (X T X) is positive definite, so Step 2 (X T X) 1 is positive definite. 50

51 Leverage The leverage of an observation at x i norm: is defined by its (X T X) 1 x T ĩ (XT X) 1 x i but this is the ii element of the hat matrix: H = X(X T X) 1 X T leverage of the ith observation = h ii. 51

52 Example Simple regression - response plotted against predictor Leverage Measurements: Simple Linear Regression y x 52

53 Example Multiple Regression - predictors plotted against each other Leverage of Selected Observations in Litters Data bodywt lsize 53

54 To obtain the leverage values in R > attach(litters) > litters.lm <- lm(brainwt bodywt + lsize) > lm.influence(litters.lm)$hat [1] [6] [11] [16] > detach(litters) average leverage = n 1 tr(h) = p n high leverage: h ii > 2p n =.3 for litters data Observation 17 (h 17,17 =.4326) is high leverage. 54

55 Measuring Influence - Cook s D General method to see if an observation is influential: delete it and see how the estimates change. β = LS estimates full data set β = LS estimates ith observation deleted (i) Look at difference: β β (i). * If big, ith observation is influential. * If small, ith observation is not influential. How can we tell if this difference is big or small? 55

56 One answer:cook s distance (based on another norm!) gives us an idea of how large this difference is, according to a standard scale: D i = 1 pmse ( β After some algebra, we have β (i))t (X T X)( β h ii β (i)) D i = r2 i p 1 h ii r i = ith studentized residual; influence is related to leverage and whether a point is an outlier: * high leverage outliers are influential * low leverage outliers are less influential * high leverage points that are not outliers are less influential High influence if D i > 1 56

57 In R: cooks.distance(y.lm) or plot(y.lm, which=4) Cook's distance plot 20 Cook's distance Obs. number lm(formula = y ~ x) 57

58 Example (Cont d) and Cook's distance plot Cook's distance Obs. number lm(formula = brainwt ~ bodywt + lsize, data = litters) 58

59 Example (Cont d) Residuals vs Fitted Residuals Fitted values lm(formula = brainwt ~ bodywt + lsize, data = litters) (h 19,19 is moderate, but r 19 is large) 59

60 Example (Cont d) Cook's distance plot Cook's distance Obs. number lm(formula = depression ~ weight, data = roller) Obs not outlier > lm.influence(roller.lm)$hat[10] [1] # high leverage 60

61 Measuring Effects Due to Influence DFBETAS What effect does the ith observation have on the estimate β j? Compare β j with β j(i) : β j β j(i) Problem: This is not standardized. 61

62 Standardizing Divide by standard deviation : S 2 (i) C jj where S 2 (i) = MSE (i) and C jj = (X T X) 1 jj DF BET AS j,(i) = β j β j(i) S 2 (i) C jj If this is positive, the effect of the ith observation is to increase the estimate of β j. If this is negative, the effect of the ith observation is to decrease the estimate of β j. ith observation is influential on jth coefficient estimate if DF BET AS j,(i) > { 1, small n 2/ n, large n 62

63 Example hills data linear model: hills.lm1 <- lm(time dist + climb, data = hills[-18,]) plot(hills.lm1,which=1) plot(dfbetas(hills.lm1)[,2],pch=16) plot(dfbetas(hills.lm1)[,3],pch=16) 63

64 Residual Plot for hills Residuals vs Fitted 7 Residuals Fitted values lm(formula = time ~ dist + climb, data = hills[ 18, ])

65 DFBETAS for BETA1: hills dfbetas observation number

66 DFBETAS for BETA2: hills 7 dfbetas observation number

67 A Nonlinear Model hills.lm2 <- lm(time dist + I(climbˆ2.25)) plot(hills.lm2,which=2) plot(dfbetas(hills.lm2)[,2],pch=16) plot(dfbetas(hills.lm2)[,3],pch=16) 64

68 Residuals vs Fitted Residuals Fitted values lm(formula = time ~ dist + I(climb^2.25), data = hills[ 18, ])

69 DFBETAS for BETA1: hills nonlinear 11 dfbetas observation number

70 DFBETAS for BETA2: hills nonlinear dfbetas observation number

71 How does observation 11 affect the coefficients? hills.lm3 <- lm(time dist+i(climbˆ2.25), data = hills[-c(11,18),]) > coef(hills.lm3) (Intercept) dist I(climbˆ2.25) -2.89e e e-09 > coef(hills.lm2) (Intercept) dist I(climbˆ2.25) -5.13e e e-09 65

72 Measuring Effects Due to Influence (Cont d) DFFITS * What effect does the ith observation have on the fitted value ŷ i? * Compare ŷ i with ŷ i,(i) : ŷ i ŷ i,(i) * Standardize by dividing by S (i) 2 h ii: DF F IT S (i) = ŷi ŷ i,(i) S (i) 2 h ii = e i hii S (i) 1 hii 1 h ii * If this is positive (negative), the effect of the ith observation is to increase(decrease) the estimate of y i. * ith observation is influential on ith fitted value if { 1, small n DF F IT S (i) > 2 p/n, large n 66

73 Example hills linear model: plot(dffits(hills.lm1),pch=16) dffits(hills.lm1) Index 67

74 Example hills (Cont d) nonlinear model: plot(dffits(hills.lm2),pch=16) dffits(hills.lm2) Index 68

75 Summary Look for Outliers (in the residual plots) if such points are influential, the fitted model is not adequate High Leverage Observations (on the hat diagonal) such points are potentially influential Influential Observations (Cook s D, DFFITS, DFBETAS) such points may be distorting the fitted model 69

76 Final Thoughts 1. Outliers might not be influential. 2. High leverage observations might not be influential. 3. Influential observations might not be outliers. 70

77 Ch. 6.4 Lack of Fit in Simple Regression Suppose repeated observations of y are taken at at least one level of x. Example tomatoes: Electrical conductivity measured at different salinity concentrations: > tomatoes salinity electrical.conductivity

78 Lack of Fit Example (Cont d) > plot(tomatoes, pch = 16) > tomatoes.lm <- lm(electrical.conductivity salinity, data = tomatoes) > abline(tomatoes.lm) > summary(tomatoes.lm) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 salinity e-06 Residual standard error: 2.83 on 16 degrees of freedom Multiple R-Squared: 0.78, Adjusted R-squared: F-statistic: 56.7 on 1 and 16 DF, p-value: 1.21e

79 Lack of Fit Example (Cont d) y x 73

80 Lack of Fit Example (Cont d) Fitted model: ŷ = x with an estimated noise standard error of How well does this linear model actually fit the data? Because of the repeated observations the model can be written as y ij = β 0 + β 1 x i + ε ij j = 1, 2,..., n i, i = 1, 2,..., m, (n = m i=1 n i). i.e. There are n i observations at each x i, i = 1, 2,..., m. 74

81 Lack of Fit ȳ i is the best estimate of E[y x = x i ]. If we fit a straight line, then ȳ i ŷ i is a measure of lack of fit of the linear relationship. Look at the residuals e ij = y ij ŷ i = y ij ȳ i pure error + ȳi ŷ i lack of fit SSE = SSP E + SSLOF where and SSP E = SSLOF = m n i i=1 j=1 m i=1 (y ij ȳ i ) 2 n i (ȳ i ŷ i ) 2 75

82 Lack of Fit To test for lack of fit, calculate F 0 = MSLOF MSP E Null hypothesis: E[y i ] = β 0 + β 1 x i (i.e. linear model is correct.) Degrees of freedom: Error: n 2 Pure error: m i=1 (n i 1) = n m Lack of fit: n 2 (n m) = m 2. Therefore, F 0 F m 2,n m when the null hypothesis is true. Reject the null hypothesis when F 0 > F m 2,n m,α. > lof.lm(tomatoes.lm) Test of Lack of Fit for Simple Linear Regression Response: electrical.conductivity Df Sum Sq Mean Sq F value Pr(>F) prediction ratio Lack of Fit Pure Error

83 litters Example brainwt vs. lsize Here, litter size is replicated so the test can be applied > litters.0 <- lm(brainwt lsize, data=litters) > lof.lm(litters.0) Test of Lack of Fit for Simple Linear Regression Response: brainwt Df Sum Sq Mean Sq F value Pr(>F) prediction ratio Lack of Fit Pure Error The prediction ratio which is smaller now than for the tomatoes data: range of fitted values 2MSP E/n This gives an idea of how well the model is able to predict. Larger values indicate more predictive power. 77

84 Example (cont d) y x 78

85 Example (cont d) brainwt vs. bodywt > litters.1 <- lm(brainwt bodywt, data=litters) > lof.lm(litters.1) [1] "There are no replicate observations." [1] "Exact Lack of Fit Test is Not Applicable." A simple approximation involves averaging neighboring points: > lof.lm(litters.1,approx=t) The following results are only approximate!!! Test of Lack of Fit for Simple Linear Regression Response: y Df Sum Sq Mean Sq F value Pr(>F) prediction ratio Lack of Fit Pure Error

86 Example (cont d) x y red denotes approximation 80

87 geophones data measurements of the thickness of a subsurface layer in a region of Alberta. > geophones.lm <- lm(thickness distance,data=geophones) > summary(geophones.lm) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 distance e-16 Residual standard error: 5.6 on 54 degrees of freedom Multiple R-Squared: 0.716, Adjusted R-squared: 0.71 F-statistic: 136 on 1 and 54 DF, p-value: 2.22e-016 Fitted model: with a noise standard deviation of 5.6. ŷ = x > lof.lm(geophones.lm, approx=t) The following results are only approximate!!! Test of Lack of Fit for Simple Linear Regression Response: y Df Sum Sq Mean Sq F value Pr(>F) prediction ratio Lack of Fit e Pure Error

88 Example (cont d) y x red denotes approximation 82

89 Example roller data > lof.lm(roller.lm, approx = T) The following results are only approximate!!! Test of Lack of Fit for Simple Linear Regression Response: y Df Sum Sq Mean Sq F value Pr(>F) prediction ratio Lack of Fit Pure Error

90 Example (cont d) y x red denotes approximation 84

91 Example (cont d) Without the intercept: > lof.lm(roller.0, approx = T, call.plot=f) The following results are only approximate!!! Test of Lack of Fit for Simple Linear Regression Response: y Df Sum Sq Mean Sq F value Pr(>F) prediction ratio Lack of Fit Pure Error

92 Lack of Fit Exercises Question 1 from Tutorial 8 for 2009 Question 7 from Final Examination for December,

93 Ch. 6.5: Transformations 1. Variance stabilizing transformations; Box-Cox Transformations 2. Transformations to linearize the model 87

94 Variance-Stabilizing Transformations Model assumptions: E[y x] = β 0 + β 1 x V (y x) = σ 2 Set µ y = E[y x]. What if V (y x) = σ 2 f(µ y ) where f(x) is some non-constant function? Try to find a function g(y) so that V (g(y) x) = constant 88

95 Variance-Stabilizing Transformations (cont d) Then obtain a Taylor expansion of g(y) about µ y : g(y) = g(µ y ) + (y µ y )g (µ y ) + (y µ y) 2 g (µ y ) + 2 Then V (g(y)) =. V (y) ( g (µ y ) ) 2 = σ 2 f(µ y ) ( g (µ y ) ) 2 V (g(y)) will be constant if g (µ y ) = 1 f(µ y ) g (z) = 1 f(z) 89

96 Examples 1. f(x) = x (e.g. Poisson data) 1 f(x) = x 1/2 g(y) = y Poisson Residuals Residuals vs Fitted Poisson Residuals (after sqrt) Residuals vs Fitted Residuals Residuals Fitted values lm(formula = yy ~ xx) Fitted values lm(formula = I(sqrt(yy) ~ xx)) 90

97 Examples (cont d) 2. f(x) = x 2 (e.g. Exponential data) Residuals Exponential Residuals Residuals vs Fitted f(x) = 1 x g(y) = log(y) Fitted values lm(formula = yy ~ xx) 91

98 Examples (cont d) 3. f(x) = x(1 x) (e.g. binomial data) 1 = f(x) 1 x(1 x) d dx sin 1 ( x) = 1 2 x(1 x) g(y) = arcsin( y) 92

99 6.5.1 Box-Cox Transformations (on response) Select the power λ in the transformation g(y) = y λ by maximum likelihood. Equivalent to minimizing the SSE with respect to λ (and other parameters). Caution: The residual sums of squares are not comparable for different values of λ. We need to ensure that comparisons are made according to the same standard: where y (λ) = y λ 1 λẏ λ 1, λ 0 ẏ log y, λ = 0 ẏ = geometric mean of the y s 93

100 Strategy 1. Perform transformation y (λ) 1,..., y(λ) n for several values of λ. 2. Compute SSE for each value of λ 3. Select λ which gives the minimum value. 4. Fit y λ = Xβ + ɛ 5. Approximate confidence intervals for λ can also be obtained. 6. In R, use boxcox(y x, data= dataset) 94

101 Example 1 1. Bacteria data - the average number of surviving bacteria (y) in a canned food product versus time (t) of exposure to 300 F heat. 95

102 Example 1 (cont d) > library(mpv) > data(p5.3) > bact.lm <- lm(bact min, data=p5.3) > plot(bact.lm, which=1) # > plot(bact.lm, which=2) # > library(mass) > boxcox(bact.lm) # > bactlog.lm <- lm(log(bact) min, data=p5.3) > plot(bactlog.lm, which=1) # > plot(bactlog.lm, which=2) # 96

103 Residuals vs. Fitted Residuals vs Fitted 1 Residuals Fitted values lm(formula = bact ~ min, data = p5.3) 97

104 Q-Q Plot Normal Q Q plot Standardized residuals Theoretical Quantiles lm(formula = bact ~ min, data = p5.3) 98

105 Box-Cox log Likelihood % lambda 99

106 Residuals vs. Fitted (after log-transforming) Residuals vs Fitted Residuals Fitted values lm(formula = log(bact) ~ min, data = p5.3) 100

107 Q-Q Plot (after log-transforming) Normal Q Q plot Standardized residuals Theoretical Quantiles lm(formula = log(bact) ~ min, data = p5.3) 101

108 Example (cont d) A model of the form log(y) = β 0 + β 1 t + ε is reasonable, especially if β 1 is negative ( β 1 =.236). 102

109 Example 2 trees data. 31 observations on Girth (g), Height (h) and Volume (V ) A Simple Model: or V. = g2 h 4π log V = β 0 + β 1 log h + β 2 log g + ε 103

110 Example 2 (Cont d) > library(daag) > data(trees); attach(trees) > trees.lm <- lm(log(volume) log(girth) + log(height)) > boxcox(trees.lm) # (lambda = 1 is OK) > summary(trees.lm) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-09 log(height) e-06 log(girth) < 2e

111 Example 2 (Cont d) - Box-Cox after Transforming log Likelihood % lambda Coefficient of log(height) is not distinguishable from 1, and coefficient of log(girth) is not distinguishable from

112 Exercises Question 2 from December, 2010 Final Examination Question 6 from December, 2009 Final Examination Question 5acd from December, 2008 Final Examination 106

1. Variance stabilizing transformations; Box-Cox Transformations - Section. 2. Transformations to linearize the model - Section 5.

1. Variance stabilizing transformations; Box-Cox Transformations - Section. 2. Transformations to linearize the model - Section 5. Ch. 5: Transformations and Weighting 1. Variance stabilizing transformations; Box-Cox Transformations - Section 5.2; 5.4 2. Transformations to linearize the model - Section 5.3 3. Weighted regression -

More information

Ch. 5 Transformations and Weighting

Ch. 5 Transformations and Weighting Outline Three approaches: Ch. 5 Transformations and Weighting. Variance stabilizing transformations; Box-Cox Transformations - Section 5.2; 5.4 2. Transformations to linearize the model - Section 5.3 3.

More information

Chapter 6 Exercises 1

Chapter 6 Exercises 1 Chapter 6 Exercises 1 Data Analysis & Graphics Using R, 3 rd edn Solutions to Exercises (April 30, 2010) Preliminaries > library(daag) Exercise 1 The data set cities lists the populations (in thousands)

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html 1 / 42 Passenger car mileage Consider the carmpg dataset taken from

More information

Lecture 1: Linear Models and Applications

Lecture 1: Linear Models and Applications Lecture 1: Linear Models and Applications Claudia Czado TU München c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 0 Overview Introduction to linear models Exploratory data analysis (EDA) Estimation

More information

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin Regression Review Statistics 149 Spring 2006 Copyright c 2006 by Mark E. Irwin Matrix Approach to Regression Linear Model: Y i = β 0 + β 1 X i1 +... + β p X ip + ɛ i ; ɛ i iid N(0, σ 2 ), i = 1,..., n

More information

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002 Time allowed: 3 HOURS. STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002 This is an open book exam: all course notes and the text are allowed, and you are expected to use your own calculator.

More information

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response. Multicollinearity Read Section 7.5 in textbook. Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response. Example of multicollinear

More information

STAT5044: Regression and Anova

STAT5044: Regression and Anova STAT5044: Regression and Anova Inyoung Kim 1 / 49 Outline 1 How to check assumptions 2 / 49 Assumption Linearity: scatter plot, residual plot Randomness: Run test, Durbin-Watson test when the data can

More information

Unit 10: Simple Linear Regression and Correlation

Unit 10: Simple Linear Regression and Correlation Unit 10: Simple Linear Regression and Correlation Statistics 571: Statistical Methods Ramón V. León 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 1 Introductory Remarks Regression analysis is a method for

More information

Chapter 6 Exercises 1. Data Analysis & Graphics Using R Solutions to Exercises (April 24, 2004)

Chapter 6 Exercises 1. Data Analysis & Graphics Using R Solutions to Exercises (April 24, 2004) Chapter 6 Exercises 1 Data Analysis & Graphics Using R Solutions to Exercises (April 24, 2004) Preliminaries > library(daag) Exercise 1 The data set cities lists the populations (in thousands) of Canada

More information

Introduction and Single Predictor Regression. Correlation

Introduction and Single Predictor Regression. Correlation Introduction and Single Predictor Regression Dr. J. Kyle Roberts Southern Methodist University Simmons School of Education and Human Development Department of Teaching and Learning Correlation A correlation

More information

Leverage. the response is in line with the other values, or the high leverage has caused the fitted model to be pulled toward the observed response.

Leverage. the response is in line with the other values, or the high leverage has caused the fitted model to be pulled toward the observed response. Leverage Some cases have high leverage, the potential to greatly affect the fit. These cases are outliers in the space of predictors. Often the residuals for these cases are not large because the response

More information

The Model Building Process Part I: Checking Model Assumptions Best Practice

The Model Building Process Part I: Checking Model Assumptions Best Practice The Model Building Process Part I: Checking Model Assumptions Best Practice Authored by: Sarah Burke, PhD 31 July 2017 The goal of the STAT T&E COE is to assist in developing rigorous, defensible test

More information

Formal Statement of Simple Linear Regression Model

Formal Statement of Simple Linear Regression Model Formal Statement of Simple Linear Regression Model Y i = β 0 + β 1 X i + ɛ i Y i value of the response variable in the i th trial β 0 and β 1 are parameters X i is a known constant, the value of the predictor

More information

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) Authored by: Sarah Burke, PhD Version 1: 31 July 2017 Version 1.1: 24 October 2017 The goal of the STAT T&E COE

More information

Chapter 10 Building the Regression Model II: Diagnostics

Chapter 10 Building the Regression Model II: Diagnostics Chapter 10 Building the Regression Model II: Diagnostics 許湘伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 41 10.1 Model Adequacy for a Predictor Variable-Added

More information

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects Contents 1 Review of Residuals 2 Detecting Outliers 3 Influential Observations 4 Multicollinearity and its Effects W. Zhou (Colorado State University) STAT 540 July 6th, 2015 1 / 32 Model Diagnostics:

More information

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017 UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics Tuesday, January 17, 2017 Work all problems 60 points are needed to pass at the Masters Level and 75

More information

Detecting and Assessing Data Outliers and Leverage Points

Detecting and Assessing Data Outliers and Leverage Points Chapter 9 Detecting and Assessing Data Outliers and Leverage Points Section 9.1 Background Background Because OLS estimators arise due to the minimization of the sum of squared errors, large residuals

More information

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij = K. Model Diagnostics We ve already seen how to check model assumptions prior to fitting a one-way ANOVA. Diagnostics carried out after model fitting by using residuals are more informative for assessing

More information

STAT 4385 Topic 06: Model Diagnostics

STAT 4385 Topic 06: Model Diagnostics STAT 4385 Topic 06: Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso xsu@utep.edu Spring, 2016 1/ 40 Outline Several Types of Residuals Raw, Standardized, Studentized

More information

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises LINEAR REGRESSION ANALYSIS MODULE XVI Lecture - 44 Exercises Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur Exercise 1 The following data has been obtained on

More information

Lecture 18: Simple Linear Regression

Lecture 18: Simple Linear Regression Lecture 18: Simple Linear Regression BIOS 553 Department of Biostatistics University of Michigan Fall 2004 The Correlation Coefficient: r The correlation coefficient (r) is a number that measures the strength

More information

> modlyq <- lm(ly poly(x,2,raw=true)) > summary(modlyq) Call: lm(formula = ly poly(x, 2, raw = TRUE))

> modlyq <- lm(ly poly(x,2,raw=true)) > summary(modlyq) Call: lm(formula = ly poly(x, 2, raw = TRUE)) School of Mathematical Sciences MTH5120 Statistical Modelling I Tutorial 4 Solutions The first two models were looked at last week and both had flaws. The output for the third model with log y and a quadratic

More information

Ch 2: Simple Linear Regression

Ch 2: Simple Linear Regression Ch 2: Simple Linear Regression 1. Simple Linear Regression Model A simple regression model with a single regressor x is y = β 0 + β 1 x + ɛ, where we assume that the error ɛ is independent random component

More information

MATH 644: Regression Analysis Methods

MATH 644: Regression Analysis Methods MATH 644: Regression Analysis Methods FINAL EXAM Fall, 2012 INSTRUCTIONS TO STUDENTS: 1. This test contains SIX questions. It comprises ELEVEN printed pages. 2. Answer ALL questions for a total of 100

More information

Regression Diagnostics Procedures

Regression Diagnostics Procedures Regression Diagnostics Procedures ASSUMPTIONS UNDERLYING REGRESSION/CORRELATION NORMALITY OF VARIANCE IN Y FOR EACH VALUE OF X For any fixed value of the independent variable X, the distribution of the

More information

Regression Model Specification in R/Splus and Model Diagnostics. Daniel B. Carr

Regression Model Specification in R/Splus and Model Diagnostics. Daniel B. Carr Regression Model Specification in R/Splus and Model Diagnostics By Daniel B. Carr Note 1: See 10 for a summary of diagnostics 2: Books have been written on model diagnostics. These discuss diagnostics

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression September 24, 2008 Reading HH 8, GIll 4 Simple Linear Regression p.1/20 Problem Data: Observe pairs (Y i,x i ),i = 1,...n Response or dependent variable Y Predictor or independent

More information

Chapter 5 Exercises 1

Chapter 5 Exercises 1 Chapter 5 Exercises 1 Data Analysis & Graphics Using R, 2 nd edn Solutions to Exercises (December 13, 2006) Preliminaries > library(daag) Exercise 2 For each of the data sets elastic1 and elastic2, determine

More information

Lecture 2 Linear Regression: A Model for the Mean. Sharyn O Halloran

Lecture 2 Linear Regression: A Model for the Mean. Sharyn O Halloran Lecture 2 Linear Regression: A Model for the Mean Sharyn O Halloran Closer Look at: Linear Regression Model Least squares procedure Inferential tools Confidence and Prediction Intervals Assumptions Robustness

More information

Regression. Marc H. Mehlman University of New Haven

Regression. Marc H. Mehlman University of New Haven Regression Marc H. Mehlman marcmehlman@yahoo.com University of New Haven the statistician knows that in nature there never was a normal distribution, there never was a straight line, yet with normal and

More information

Remedial Measures, Brown-Forsythe test, F test

Remedial Measures, Brown-Forsythe test, F test Remedial Measures, Brown-Forsythe test, F test Dr. Frank Wood Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 1 Remedial Measures How do we know that the regression function

More information

Applied Regression Analysis

Applied Regression Analysis Applied Regression Analysis Chapter 3 Multiple Linear Regression Hongcheng Li April, 6, 2013 Recall simple linear regression 1 Recall simple linear regression 2 Parameter Estimation 3 Interpretations of

More information

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015 Introduction to Linear Regression Rebecca C. Steorts September 15, 2015 Today (Re-)Introduction to linear models and the model space What is linear regression Basic properties of linear regression Using

More information

13 Simple Linear Regression

13 Simple Linear Regression B.Sc./Cert./M.Sc. Qualif. - Statistics: Theory and Practice 3 Simple Linear Regression 3. An industrial example A study was undertaken to determine the effect of stirring rate on the amount of impurity

More information

Lecture 6 Multiple Linear Regression, cont.

Lecture 6 Multiple Linear Regression, cont. Lecture 6 Multiple Linear Regression, cont. BIOST 515 January 22, 2004 BIOST 515, Lecture 6 Testing general linear hypotheses Suppose we are interested in testing linear combinations of the regression

More information

Inference for Regression

Inference for Regression Inference for Regression Section 9.4 Cathy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Department of Mathematics University of Houston Lecture 13b - 3339 Cathy Poliak, Ph.D. cathy@math.uh.edu

More information

Linear Modelling: Simple Regression

Linear Modelling: Simple Regression Linear Modelling: Simple Regression 10 th of Ma 2018 R. Nicholls / D.-L. Couturier / M. Fernandes Introduction: ANOVA Used for testing hpotheses regarding differences between groups Considers the variation

More information

Diagnostics and Remedial Measures: An Overview

Diagnostics and Remedial Measures: An Overview Diagnostics and Remedial Measures: An Overview Residuals Model diagnostics Graphical techniques Hypothesis testing Remedial measures Transformation Later: more about all this for multiple regression W.

More information

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata' Business Statistics Tommaso Proietti DEF - Università di Roma 'Tor Vergata' Linear Regression Specication Let Y be a univariate quantitative response variable. We model Y as follows: Y = f(x) + ε where

More information

Chapter 5 Exercises 1. Data Analysis & Graphics Using R Solutions to Exercises (April 24, 2004)

Chapter 5 Exercises 1. Data Analysis & Graphics Using R Solutions to Exercises (April 24, 2004) Chapter 5 Exercises 1 Data Analysis & Graphics Using R Solutions to Exercises (April 24, 2004) Preliminaries > library(daag) Exercise 2 The final three sentences have been reworded For each of the data

More information

Beam Example: Identifying Influential Observations using the Hat Matrix

Beam Example: Identifying Influential Observations using the Hat Matrix Math 3080. Treibergs Beam Example: Identifying Influential Observations using the Hat Matrix Name: Example March 22, 204 This R c program explores influential observations and their detection using the

More information

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont. TCELL 9/4/205 36-309/749 Experimental Design for Behavioral and Social Sciences Simple Regression Example Male black wheatear birds carry stones to the nest as a form of sexual display. Soler et al. wanted

More information

STATISTICS 479 Exam II (100 points)

STATISTICS 479 Exam II (100 points) Name STATISTICS 79 Exam II (1 points) 1. A SAS data set was created using the following input statement: Answer parts(a) to (e) below. input State $ City $ Pop199 Income Housing Electric; (a) () Give the

More information

IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc

IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc IES 612/STA 4-573/STA 4-576 Winter 2008 Week 1--IES 612-STA 4-573-STA 4-576.doc Review Notes: [OL] = Ott & Longnecker Statistical Methods and Data Analysis, 5 th edition. [Handouts based on notes prepared

More information

Lecture 2. The Simple Linear Regression Model: Matrix Approach

Lecture 2. The Simple Linear Regression Model: Matrix Approach Lecture 2 The Simple Linear Regression Model: Matrix Approach Matrix algebra Matrix representation of simple linear regression model 1 Vectors and Matrices Where it is necessary to consider a distribution

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression Simple linear regression tries to fit a simple line between two variables Y and X. If X is linearly related to Y this explains some of the variability in Y. In most cases, there

More information

Ch 3: Multiple Linear Regression

Ch 3: Multiple Linear Regression Ch 3: Multiple Linear Regression 1. Multiple Linear Regression Model Multiple regression model has more than one regressor. For example, we have one response variable and two regressor variables: 1. delivery

More information

Regression diagnostics

Regression diagnostics Regression diagnostics Leiden University Leiden, 30 April 2018 Outline 1 Error assumptions Introduction Variance Normality 2 Residual vs error Outliers Influential observations Introduction Errors and

More information

> Y ~ X1 + X2. The tilde character separates the response variable from the explanatory variables. So in essence we fit the model

> Y ~ X1 + X2. The tilde character separates the response variable from the explanatory variables. So in essence we fit the model Regression Analysis Regression analysis is one of the most important topics in Statistical theory. In the sequel this widely known methodology will be used with S-Plus by means of formulae for models.

More information

Diagnostics and Remedial Measures

Diagnostics and Remedial Measures Diagnostics and Remedial Measures Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Diagnostics and Remedial Measures 1 / 72 Remedial Measures How do we know that the regression

More information

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression 36-309/749 Experimental Design for Behavioral and Social Sciences Sep. 22, 2015 Lecture 4: Linear Regression TCELL Simple Regression Example Male black wheatear birds carry stones to the nest as a form

More information

The Simple Regression Model. Part II. The Simple Regression Model

The Simple Regression Model. Part II. The Simple Regression Model Part II The Simple Regression Model As of Sep 22, 2015 Definition 1 The Simple Regression Model Definition Estimation of the model, OLS OLS Statistics Algebraic properties Goodness-of-Fit, the R-square

More information

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference. Understanding regression output from software Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals In 1966 Cyril Burt published a paper called The genetic determination of differences

More information

Density Temp vs Ratio. temp

Density Temp vs Ratio. temp Temp Ratio Density 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Density 0.0 0.2 0.4 0.6 0.8 1.0 1. (a) 170 175 180 185 temp 1.0 1.5 2.0 2.5 3.0 ratio The histogram shows that the temperature measures have two peaks,

More information

STK4900/ Lecture 5. Program

STK4900/ Lecture 5. Program STK4900/9900 - Lecture 5 Program 1. Checking model assumptions Linearity Equal variances Normality Influential observations Importance of model assumptions 2. Selection of predictors Forward and backward

More information

Regression Diagnostics for Survey Data

Regression Diagnostics for Survey Data Regression Diagnostics for Survey Data Richard Valliant Joint Program in Survey Methodology, University of Maryland and University of Michigan USA Jianzhu Li (Westat), Dan Liao (JPSM) 1 Introduction Topics

More information

Regression Analysis for Data Containing Outliers and High Leverage Points

Regression Analysis for Data Containing Outliers and High Leverage Points Alabama Journal of Mathematics 39 (2015) ISSN 2373-0404 Regression Analysis for Data Containing Outliers and High Leverage Points Asim Kumer Dey Department of Mathematics Lamar University Md. Amir Hossain

More information

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION Answer all parts. Closed book, calculators allowed. It is important to show all working,

More information

REGRESSION DIAGNOSTICS AND REMEDIAL MEASURES

REGRESSION DIAGNOSTICS AND REMEDIAL MEASURES REGRESSION DIAGNOSTICS AND REMEDIAL MEASURES Lalmohan Bhar I.A.S.R.I., Library Avenue, Pusa, New Delhi 110 01 lmbhar@iasri.res.in 1. Introduction Regression analysis is a statistical methodology that utilizes

More information

Math 423/533: The Main Theoretical Topics

Math 423/533: The Main Theoretical Topics Math 423/533: The Main Theoretical Topics Notation sample size n, data index i number of predictors, p (p = 2 for simple linear regression) y i : response for individual i x i = (x i1,..., x ip ) (1 p)

More information

holding all other predictors constant

holding all other predictors constant Multiple Regression Numeric Response variable (y) p Numeric predictor variables (p < n) Model: Y = b 0 + b 1 x 1 + + b p x p + e Partial Regression Coefficients: b i effect (on the mean response) of increasing

More information

Regression Model Building

Regression Model Building Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation in Y with a small set of predictors Automated

More information

The Big Picture. Model Modifications. Example (cont.) Bacteria Count Example

The Big Picture. Model Modifications. Example (cont.) Bacteria Count Example The Big Picture Remedies after Model Diagnostics The Big Picture Model Modifications Bret Larget Departments of Botany and of Statistics University of Wisconsin Madison February 6, 2007 Residual plots

More information

10 Model Checking and Regression Diagnostics

10 Model Checking and Regression Diagnostics 10 Model Checking and Regression Diagnostics The simple linear regression model is usually written as i = β 0 + β 1 i + ɛ i where the ɛ i s are independent normal random variables with mean 0 and variance

More information

Topic 18: Model Selection and Diagnostics

Topic 18: Model Selection and Diagnostics Topic 18: Model Selection and Diagnostics Variable Selection We want to choose a best model that is a subset of the available explanatory variables Two separate problems 1. How many explanatory variables

More information

Model Modifications. Bret Larget. Departments of Botany and of Statistics University of Wisconsin Madison. February 6, 2007

Model Modifications. Bret Larget. Departments of Botany and of Statistics University of Wisconsin Madison. February 6, 2007 Model Modifications Bret Larget Departments of Botany and of Statistics University of Wisconsin Madison February 6, 2007 Statistics 572 (Spring 2007) Model Modifications February 6, 2007 1 / 20 The Big

More information

1) Answer the following questions as true (T) or false (F) by circling the appropriate letter.

1) Answer the following questions as true (T) or false (F) by circling the appropriate letter. 1) Answer the following questions as true (T) or false (F) by circling the appropriate letter. T F T F T F a) Variance estimates should always be positive, but covariance estimates can be either positive

More information

R 2 and F -Tests and ANOVA

R 2 and F -Tests and ANOVA R 2 and F -Tests and ANOVA December 6, 2018 1 Partition of Sums of Squares The distance from any point y i in a collection of data, to the mean of the data ȳ, is the deviation, written as y i ȳ. Definition.

More information

Chapter 16. Simple Linear Regression and dcorrelation

Chapter 16. Simple Linear Regression and dcorrelation Chapter 16 Simple Linear Regression and dcorrelation 16.1 Regression Analysis Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will

More information

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014 Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014 Tim Hanson, Ph.D. University of South Carolina T. Hanson (USC) Stat 704: Data Analysis I, Fall 2014 1 / 13 Chapter 8: Polynomials & Interactions

More information

CHAPTER 5. Outlier Detection in Multivariate Data

CHAPTER 5. Outlier Detection in Multivariate Data CHAPTER 5 Outlier Detection in Multivariate Data 5.1 Introduction Multivariate outlier detection is the important task of statistical analysis of multivariate data. Many methods have been proposed for

More information

Quantitative Methods I: Regression diagnostics

Quantitative Methods I: Regression diagnostics Quantitative Methods I: Regression University College Dublin 10 December 2014 1 Assumptions and errors 2 3 4 Outline Assumptions and errors 1 Assumptions and errors 2 3 4 Assumptions: specification Linear

More information

1 Multiple Regression

1 Multiple Regression 1 Multiple Regression In this section, we extend the linear model to the case of several quantitative explanatory variables. There are many issues involved in this problem and this section serves only

More information

Regression diagnostics

Regression diagnostics Regression diagnostics Kerby Shedden Department of Statistics, University of Michigan November 5, 018 1 / 6 Motivation When working with a linear model with design matrix X, the conventional linear model

More information

Linear Regression Models

Linear Regression Models Linear Regression Models November 13, 2018 1 / 89 1 Basic framework Model specification and assumptions Parameter estimation: least squares method Coefficient of determination R 2 Properties of the least

More information

Statistics - Lecture Three. Linear Models. Charlotte Wickham 1.

Statistics - Lecture Three. Linear Models. Charlotte Wickham   1. Statistics - Lecture Three Charlotte Wickham wickham@stat.berkeley.edu http://www.stat.berkeley.edu/~wickham/ Linear Models 1. The Theory 2. Practical Use 3. How to do it in R 4. An example 5. Extensions

More information

22s:152 Applied Linear Regression. Take random samples from each of m populations.

22s:152 Applied Linear Regression. Take random samples from each of m populations. 22s:152 Applied Linear Regression Chapter 8: ANOVA NOTE: We will meet in the lab on Monday October 10. One-way ANOVA Focuses on testing for differences among group means. Take random samples from each

More information

Lecture 10 Multiple Linear Regression

Lecture 10 Multiple Linear Regression Lecture 10 Multiple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: 6.1-6.5 10-1 Topic Overview Multiple Linear Regression Model 10-2 Data for Multiple Regression Y i is the response variable

More information

Math 3330: Solution to midterm Exam

Math 3330: Solution to midterm Exam Math 3330: Solution to midterm Exam Question 1: (14 marks) Suppose the regression model is y i = β 0 + β 1 x i + ε i, i = 1,, n, where ε i are iid Normal distribution N(0, σ 2 ). a. (2 marks) Compute the

More information

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure. STATGRAPHICS Rev. 9/13/213 Calibration Models Summary... 1 Data Input... 3 Analysis Summary... 5 Analysis Options... 7 Plot of Fitted Model... 9 Predicted Values... 1 Confidence Intervals... 11 Observed

More information

Linear Regression. Furthermore, it is simple.

Linear Regression. Furthermore, it is simple. Linear Regression While linear regression has limited value in the classification problem, it is often very useful in predicting a numerical response, on a linear or ratio scale. Furthermore, it is simple.

More information

Lecture 11: Simple Linear Regression

Lecture 11: Simple Linear Regression Lecture 11: Simple Linear Regression Readings: Sections 3.1-3.3, 11.1-11.3 Apr 17, 2009 In linear regression, we examine the association between two quantitative variables. Number of beers that you drink

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression Reading: Hoff Chapter 9 November 4, 2009 Problem Data: Observe pairs (Y i,x i ),i = 1,... n Response or dependent variable Y Predictor or independent variable X GOALS: Exploring

More information

9. Linear Regression and Correlation

9. Linear Regression and Correlation 9. Linear Regression and Correlation Data: y a quantitative response variable x a quantitative explanatory variable (Chap. 8: Recall that both variables were categorical) For example, y = annual income,

More information

One-way ANOVA Model Assumptions

One-way ANOVA Model Assumptions One-way ANOVA Model Assumptions STAT:5201 Week 4: Lecture 1 1 / 31 One-way ANOVA: Model Assumptions Consider the single factor model: Y ij = µ + α }{{} i ij iid with ɛ ij N(0, σ 2 ) mean structure random

More information

Circle the single best answer for each multiple choice question. Your choice should be made clearly.

Circle the single best answer for each multiple choice question. Your choice should be made clearly. TEST #1 STA 4853 March 6, 2017 Name: Please read the following directions. DO NOT TURN THE PAGE UNTIL INSTRUCTED TO DO SO Directions This exam is closed book and closed notes. There are 32 multiple choice

More information

Nonlinear Models. What do you do when you don t have a line? What do you do when you don t have a line? A Quadratic Adventure

Nonlinear Models. What do you do when you don t have a line? What do you do when you don t have a line? A Quadratic Adventure What do you do when you don t have a line? Nonlinear Models Spores 0e+00 2e+06 4e+06 6e+06 8e+06 30 40 50 60 70 longevity What do you do when you don t have a line? A Quadratic Adventure 1. If nonlinear

More information

14 Multiple Linear Regression

14 Multiple Linear Regression B.Sc./Cert./M.Sc. Qualif. - Statistics: Theory and Practice 14 Multiple Linear Regression 14.1 The multiple linear regression model In simple linear regression, the response variable y is expressed in

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Weighted Least Squares

Weighted Least Squares Weighted Least Squares The standard linear model assumes that Var(ε i ) = σ 2 for i = 1,..., n. As we have seen, however, there are instances where Var(Y X = x i ) = Var(ε i ) = σ2 w i. Here w 1,..., w

More information

Exam Applied Statistical Regression. Good Luck!

Exam Applied Statistical Regression. Good Luck! Dr. M. Dettling Summer 2011 Exam Applied Statistical Regression Approved: Tables: Note: Any written material, calculator (without communication facility). Attached. All tests have to be done at the 5%-level.

More information

STATISTICS 174: APPLIED STATISTICS TAKE-HOME FINAL EXAM POSTED ON WEBPAGE: 6:00 pm, DECEMBER 6, 2004 HAND IN BY: 6:00 pm, DECEMBER 7, 2004 This is a

STATISTICS 174: APPLIED STATISTICS TAKE-HOME FINAL EXAM POSTED ON WEBPAGE: 6:00 pm, DECEMBER 6, 2004 HAND IN BY: 6:00 pm, DECEMBER 7, 2004 This is a STATISTICS 174: APPLIED STATISTICS TAKE-HOME FINAL EXAM POSTED ON WEBPAGE: 6:00 pm, DECEMBER 6, 2004 HAND IN BY: 6:00 pm, DECEMBER 7, 2004 This is a take-home exam. You are expected to work on it by yourself

More information

Circle a single answer for each multiple choice question. Your choice should be made clearly.

Circle a single answer for each multiple choice question. Your choice should be made clearly. TEST #1 STA 4853 March 4, 215 Name: Please read the following directions. DO NOT TURN THE PAGE UNTIL INSTRUCTED TO DO SO Directions This exam is closed book and closed notes. There are 31 questions. Circle

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression ST 430/514 Recall: A regression model describes how a dependent variable (or response) Y is affected, on average, by one or more independent variables (or factors, or covariates)

More information

Regression Diagnostics

Regression Diagnostics Diag 1 / 78 Regression Diagnostics Paul E. Johnson 1 2 1 Department of Political Science 2 Center for Research Methods and Data Analysis, University of Kansas 2015 Diag 2 / 78 Outline 1 Introduction 2

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression ST 370 Regression models are used to study the relationship of a response variable and one or more predictors. The response is also called the dependent variable, and the predictors

More information

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION In this lab you will first learn how to display the relationship between two quantitative variables with a scatterplot and also how to measure the strength of

More information