Chapter 8. R-squared, Adjusted R-Squared, the F test, and Multicollinearity

Size: px
Start display at page:

Download "Chapter 8. R-squared, Adjusted R-Squared, the F test, and Multicollinearity"

Transcription

1 Chapter 8. R-squared, Adusted R-Squared, the F test, and Multicollinearity This chapter discusses additional output in the regression analysis, from the context of multiple regression in the classic model. It also discusses multicollinearity, its effects, and remedies. 8.1 The R-squared Statistic The population R statistic was introduced in Chapter 6 as Φ = 1 E{v (X)}/ Var(Y), where v(x) is the conditional variance of Y given X = x. This number tells you how well the X variable(s) predict your Y variables. Since the entire focus of this book is on conditional distributions p(y x), I d like you to understand the prediction concept in terms of separation of the distributions p(y X = low) and p(y X = high). For example, suppose the true model is Y = X + ε, where X ~ N(0,5 ) and Var(ε) = σ. Then Var(Y) = σ = 1 + σ, and v(x) =σ, implying Φ = 1 σ /(1 + σ ) = 1/(1 + σ ). Three cases I d like you to consider: (i) σ = 9.0, implying a low Φ = 0.1, (ii) σ = 1.0, implying a medium value Φ = 0.5, and (iii) σ = 1/9, implying a high Φ = 0.9. In all cases, let s say a low value of X is 15.0, one standard deviation below the mean, and a high value of X is 5.0, one standard deviation above the mean. Now, when X = 15, the distribution p(y X = 15) is the N(6+0.(15) = 9.0, σ ) distribution; and when X = 5, the distribution p(y X = 5) is the N(6+0.(5) = 11.0, σ ) distribution. Figure displays these distributions for the three cases above, where the population R is either 0.1, 0.5, or 0.9 (which happen in this study when σ is either 9.0, 1.0, or 1/9). Notice that there is greater separation of the distributions p(y x) when the population R is higher. 1

2 Figure Separation of distributions p(y X = low) (left distributions) and p(y X = high) (right distributions) in cases where the population R is 0.1 (top panel), 0.5 (medium panel) and 0.9 (bottom panel). In all cases X = low and X = high refer to an X that is either one standard deviation below the mean or one standard deviation above the mean. In the case of the classic regression model, which is instantiated by Figure 8.1.0, the conditional variance Var(Y X = x) = v(x) is a constant σ, and does not depend on X = x. Also in the classic regression model, the maximum likelihood estimate of σ is ˆ σ = SSE/n, where SSE = n ( y i yˆ ), the sum of squared vertical deviations from yi values to the fitted OLS function. i = 1 i

3 The unconditional variance is Var(Y) = σ Y, so the population R statistic in the classic regression model is The maximum likelihood estimate of σ Y is Φ = 1 σ / σ Y n ˆY σ = SST/n, where SST = = ( y ), the i i y 1 total sum of squared vertical deviations from yi values to the flat line where y = y. See Figure Figure Scatterplot of n = 4 data points (indicated by X s.) Horizontal red line is y = y line and diagonal blue line is the least squares lines. Vertical deviations from the y = y line are shown as red; SST is the sum of these squared deviations. Vertical deviations from the least squares line are shown as blue; SSE is the sum of these squared deviations. The R statistic equals 1 SSE/SST. 3

4 Using the maximum likelihood estimates of conditional and unconditional variance, you get the estimate of the population R-squared statistic, R = 1 (SSE/n)/(SST/n) = 1 SSE/SST. Recall Chapter 5, where I compared different transformations in the X variable. The model with the highest maximized log likelihood was the one with the smallest estimated conditional variance SSE/n, hence it was also the model with smallest SSE, since n is always the same when considering different models for the same data set. Also, SST is always the same when considering different models for the same data set, because SST does not involve the predicted values from the model. Thus, among the different models having different transformed X variable 1, the model with the highest log likelihood corresponds precisely to the model with the highest R statistic. While it is mathematically factual that 0 R 1.0, there is no Ugly Rule of Thumb for how large an R statistic should be to be considered good. Rather, it depends on norms for the given subect area: In finance, any non-zero R for predicting stock returns is interesting, because the efficient markets hypothesis states that the population R is zero in this case. In chemical reaction modeling, the outputs are essentially deterministic functions of the inputs, so an R statistic that is less than 1.0, e.g. 0.99, may not be good enough because it indicates faulty experimental procedures. With human subects and models to predict their behavior, the R statistics are typically less than 0.50 because people are, well, people. We have our own minds, and are not robots that can be pigeon-holed by some regression model. My advice is to rely less on R, and more on separation of distributions as seen in Figure When we get to more complex models, the usual R statistic becomes less interpretable, and in some cases it is non-existent. But you always will have conditional distributions p(y x), and you can always graph those distributions as shown in Figure to see how well your X predicts your Y. 8. The Adusted R-Squared Statistic Recall that, in the classic model, Φ = 1 σ / σ Y, and that the standard R statistic replaces the two variances with their maximum likelihood estimates. Recall also that maximum likelihood estimates of variance are slightly biased. Replacing the variances with their unbiased estimates gives the adusted R statistic: R a = 1 {SSE/(n k 1)}/{SST/(n -1)} With larger number of predictor variables k, the ordinary R tends to be increasingly biased upward; the adusted R statistic is less biased. You can interpret the adusted R statistic in the same way as the ordinary one, but note that the adusted R statistic can give values less than 0.0, which are clearly bad estimates since the estimand Φ cannot be negative. 1 This discussion refers to X transformations only, not Y transformations. 4

5 The following R code indicates where these statistics are, as well as by hand calculations of them. sales = read.table(" attach(sales); Y = NSOLD; X1 = INTRATE^-1; X = PPGGAS n = nrow(sales) fit = lm(y ~ X1 + X); summary(fit) SST = sum( (Y-mean(Y))^ ) SSE = sum(fit$residuals^) ## By hand calculations of R-squared statistics R.squared = 1 - SSE/SST R.squared.ad = 1 - (SSE/(n-3))/(SST/(n-1)) R.squared; R.squared.ad The summary of the fit shows the following R and adusted R statistic: Multiple R-squared: , Adusted R-squared: F-statistic: on and 15 DF, p-value: 3.50e-08 The by hand calculations agree: > R.squared; R.squared.ad [1] [1] The F Test See the R output a few lines above: Underneath the R statistic is the F-statistic. This statistic is related to the R statistic in that it is also a function of SST and SSE (see Figure again.) It is given by F = {(SST SSE)/k)/{SSE/(n k 1)}, If you add the line ((SST-SSE)/)/(SSE/(n-3)) to the R code above, you will get the reported F-statistic, although with more decimals: With a little algebra, you can relate this directly to the R statistic, showing that for fixed k and n, larger R corresponds to larger F: F = {(n-k-1)/k} R /(1 R ) This statistic is used to test the global null hypothesis H0: β1 = β = = βk = 0. In loose words, this hypotheses states that none of the regression variables X1, X,, or Xk is related to Y. Under the classic model where H0: β1 = β = = βk = 0 is true, it can be proven mathematically that F ~ Fk, n-k-1 5

6 In other words, the null distribution of the F statistic is the F distribution with k numerator degrees of freedom and n-k-1 denominator degrees of freedom. Recall also that the degrees of freedom for error, dfe, was given by dfe = n-k-1. The numerator degrees of freedom, k, is sometimes called the model degrees of freedom, hence symbolized as dfm, because it represents the flexibility (freedom) of the model. When H0: β1 = β = = βk = 0 is true, the theoretical R statistic is exactly Φ = 0. And when H0 is false you get larger values of R, hence larger F-statistics. Unlike the t-test for testing individual regression coefficients, the p-value for testing H0: β1 = β = = βk = 0 via the F test considers the extreme values of F to be only the large values, not both the large and the small ones: Smaller F values are expected under H0. To understand the F statistic, when it is small and when it is large, its distribution, and the chance only model where β1 = β = = βk = 0, you should use simulation. (As always!) Simulation Study to Understand the F Statistic sales = read.table(" attach(sales); X1 = INTRATE^-1; X = PPGGAS n = nrow(sales) Y = 5 + 0*X1 + 0*X + rnorm(n,0,4) ## Notice the 0 s: The null model is true fit = lm(y ~ X1 + X); summary(fit) The code above generates data Y that is unrelated to either X1 or X; in other words, the null hypothesis H0: β1 = β = 0 is in fact true. From the code above, I got F = (yours will vary by randomness). But to understand what is the range of possible F values that are explained by chance alone, you need to repeat this simulation many (ideally, infinitely many) times. So let s simulate a bunch of em, save the F values, draw their histogram and overlay the theoretically correct Fdfm,dfe density. R Code for Figure Nsim = Fsim.null = numeric(nsim) Fsim.alt = numeric(nsim) for (i in 1:Nsim) { Y.null = 5 + 0*X1 + 0*X + rnorm(n,0,4) Y.alt = *X1 + 50*X + rnorm(n,0,4) fit.null = lm(y.null ~ X1 + X) fit.alt = lm(y.alt ~ X1 + X) Fsim.null[i] = summary(fit.null)$fstatistic[1] Fsim.alt[i] = summary(fit.alt)$fstatistic[1] } par(mfrow=c(3,1)) For example, the quadratic regression model, which has dfm =, is more flexible than the linear model, which has dfm = 1. 6

7 par(mar=c(4,4,1,1)) hist(fsim.null, breaks=100, freq=f, main="", xlab="f value") hist(fsim.null, breaks=100, freq=f, main="", xlab="f value") flist = seq(0,15,.01) fdist = df(flist,,15) crit = qf(.95,,15) points(flist, fdist, type="l") abline(v=crit, col="blue") hist(fsim.null, breaks=100, freq=f, main="", xlab="f value") points(flist, fdist, type="l") abline(v=crit, col="blue") hist(fsim.alt,breaks=100, freq=f, add=t, lty=, border="red") 7

8 Figure Top panel: Histogram of 10,000 simulated F statistics under the null model. Middle panel: Same as top panel but with the theoretically correct F,15 distribution overlaid (solid black curve), as well as its 0.95 quantile 3.68 (blue line). Bottom panel: Same as middle panel, but with histogram of 10,000 simulated F statistics under an alternative model superimposed (red histogram). The observed F statistic from the original data was As seen in Figure 8.3.1, this value is off the chart, and hence does not appear to be easily explained by the null model where β1 = β = 0. The p-value is calculated from the solid F,15 curve shown in the middle and bottom panels 8

9 of Figure 8.3.1; it is the area under that curve beyond , and is calculated in R as 1- pf( ,,15), giving e-08, agreeing with p-value: 3.50e-08 shown in the lm output above. The conclusion is that the F statistic is not easily explained under the model where β1 = β = 0, so it is logical to conclude that (β1 = β = 0) is not true; i.e., it is logical to conclude that either β1 0, or β 0, or that both β1 and β differ from 0. Be careful, though: The F test is not specific. A significant F test does not tell you that both parameters differ from zero, nor can it identify which parameter differs from zero. It can only tell you that at least one parameter (either β1 or β) differs from Multicollinearity Multicollinearity (MC) refers to the X variables being collinear to varying degrees. In the case of two X variables, X1 and X, collinearity means that the two variables are close to linearly related. A perfect multicollinearity means that they are perfectly linearly related. See Figure Figure Left panel: Collinear X variables having correlation Right panel: Perfectly collinear X variables having correlation 1.0. Often, multicollinearity with ust two X variables is called simply collinearity ; Figure illustrates the meaning of the term collinear. With more X variables, it is not so easy to visualize multicollinearity. But if one of the X variables, say X, is closely related to all the other X variables via 9

10 X a0 + a1x1 + + a-1x-1 + a+1x akxk then there is multicollinearity. And if the is in fact an = in the equation above, then there is a perfect multicollinearity. A perfect multicollinearity causes the X T X matrix to be non-invertible, implying that there are no unique least squares estimates. Equations 0 through k shown in Section 7.1 can still be solved for estimates of the β s, but there are infinitely many solutions, so it is unclear what the effects of the individual X variables are. To understand this infinity of solutions for the estimated β s, consider the case where there is only one X variable. A perfect multicollinearity in this case means that X1 = a0, a constant, so that the X1 column is perfectly related to the intercept column of 1 s; i.e., X1 = a01. Figure 8.4. shows how data might look in this case, where xi = 10 for every i = 1,,n, and also shows several possible least squares fits, all of which have the same sum of squared errors. Figure Non-unique least squares fits, all of which provide the minimum SSE, when the X column of data is perfectly related to the intercept column. A similar phenomenon happens with the case of two X variables as shown in the right panel of Figure 8.4.1: There are an infinity of planar functions (review Figure in Chapter 7) of X1 and X that all minimize the SSE. 10

11 In R, you will get one of these infinitely many estimated planar functions, but you can t trust the parameter estimates, because, again, they are ust one of an infinity of possible estimates. For example, the R code below generates perfectly collinear (X1, X) data, then generates Y data from these X data that satisfy all regression assumptions. set.seed(1345) X1 = rnorm(100) X = *X1-1 Y = 1 + *X1 + 3*X + rnorm(100,0,1) summary(lm(y~x1+x)) This code produces the following output: Call: lm(formula = Y ~ X1 + X) Residuals: Min 1Q Median 3Q Max Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(> t ) (Intercept) <e-16 *** X <e-16 *** X NA NA NA NA --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 98 degrees of freedom Multiple R-squared: , Adusted R-squared: F-statistic: 7888 on 1 and 98 DF, p-value: <.e-16 Notice the NA for the coefficient of X. Recognizing that X T X is not invertible, and hence that there are infinitely many solutions for the estimated β s, R simply assigned ˆβ = 0 and estimated 3 β1. Note also the comment Coefficients: (1 not defined because of singularities). In matrix algebra, singular means not invertible ; the comment lets you know that R recognizes that X T X is not invertible. To visualize the infinity of solutions for the regression plane in the example above, have a look at the 3-D representation of the data ust simulated in Figure below. In that graph, there are infinitely many planes that will separate the positive and negative residuals as shown some are steeper on one side of the vertical sheet where the data lie, some are steeper on the other side of the sheet. 3 While the estimates and 0 do not correspond well with R code where β 1 = and β = 3, the estimate actually makes sense when you replace X with X1 1 in the model equation; then you see that the true multiplier of X1 is exactly

12 Figure Three-D scatterplot of data where the X variables are perfectly collinear. The data lie in a vertical sheet above the line of collinearity on the (X1, X) plane. There are infinitely many planes for which the given blue points are above and the given red points are below. Intuitively, it makes sense that you cannot estimate the coefficients uniquely when there is perfect multicollinearity. Recall that β is the difference between the means of the distributions of potentially observable Y values in two cohorts: Cohort 1: X1 = x1, X = x Cohort : X1 = x1, X = x +1 However, if X is perfectly related to X1, it is impossible to increase X while leaving X1 fixed. 1

13 Hence, with perfectly collinear (X1, X) variables, it is simply impossible to estimate the effect of larger X when X1 is held fixed. See the right panel of Figure 8.4.1: You cannot increase X while holding X1 constant. The intuitive logic that you cannot estimate the effect of increasing X while X1 is held constant for the case of perfectly collinear X variables, also explains the problem with near perfect collinearity, as shown in the left panel of Figure Since the data are so closely related, there is very little variation in X when you fix X1, say, by drawing a vertical line over any particular value of X1. Recall also that, to estimate the effect of an X variable on Y, you need variation in that X variable. The relevant variation in the case of multiple regression, where you are estimating effect of an X variable holding the other variables fixed, is exactly the variation in that X variable where the other variables are fixed. If there is little such variation, as shown in the left panel of Figure 8.4.1, you will get unique estimates of the β s, but they will be relatively imprecise estimates because, again, there is so little relevant variation in the X data. Therefore, the main problem with multicollinearity is that the estimates of the β s are relatively (relative to the case where the X variables are unrelated) imprecisely estimated. This imprecision manifests itself in higher standard errors of the estimated β s. There is a simple formula to explain how multicollinearity affects the standard errors of the estimated β s: Recall from Chapter 7, Section 3, that s.e.( βˆ ) = σˆ c, = 0,1,,k. In simple regression, where there is ust one X variable, this expression reduces to the form you saw in Chapter 3, ˆ ˆ σ s. e.( β1) =. (8.4.1) n 1 Some fairly complicated matrix algebra gives the following representation of the standard errors for the multiple regression case: 1/ ˆ 1 s.e.( βˆ ) = σˆ c =. s 1 1 R x σn Here, R is the R-squared statistic that you get by regressing X on all other X variables. Higher R is in indication of more extreme multicollinearity, and its effect on the precision of the estimate βˆ. Two important special cases are (1) R = 0, in which case the standard error formula for βˆ is exactly as given in the simple regression where there is only one X variable, see equation s x 13

14 (8.4.1) above, and () R 1, in which case the standard error tends to infinity, which is expected because when X is increasingly related to the other X variables, there is less and less variation in X when all other X variables are held fixed. The term 1/(1- R ) is called the variance inflation factor because it measures how much larger is the variance of be called a standard error inflation factor. βˆ due to multicollinearity. By the same token, {1/(1- R )} 1/ can Example: Illustrating the effects of MC in a Simulation Study ## R code to illustrate the effects of MC ## This data set shows what happens with highly MC data. Note that the ## model has a highly significant F statistic (p-value =.306e-08), ## but neither X variable is significant via their t statistics. The MC ## between X1 and X makes it difficult to assess the effect ## of X1 when X is held fixed, and vice versa. set.seed(1345) x1 = rep(1:10, each=10) x = x1 + rnorm(100, 0,.05) # X differs from X1 by N(0,0.05^) random variation. ## You can see the collinearity in the graph: plot(x1,x) ## The true model has beta0 = 7, beta1=1, and beta = 1, ## with all assumptions satisfied. y = 7 + x1 + x + rnorm(100,0,10) dat.high.mc = data.frame(y,x1,x) high.mc = lm( y ~ x1 + x, data = dat.high.mc) summary(high.mc) The output shows: Call: lm(formula = y ~ x1 + x, data = dat.high.mc) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) ** x x Signif. codes: 0 *** ** 0.01 *

15 Residual standard error: on 97 degrees of freedom Multiple R-squared: , Adusted R-squared: F-statistic: 1. on and 97 DF, p-value:.306e-08 In the example above, the R statistics (obtained via summary(lm(x1~x)) and summary(lm(x~x1)) are and , implying standard error inflation factors 1/( ) 1/ = This, the standard errors, and are times larger than they would have been had the variables been uncorrelated. A slight modification of the simulation model to keep all the same (same n, same σ, nearly the same variances of X1 and X) except with uncorrelated X variables verifies this: set.seed(1345) x1 = rep(1:10, each=10) x = rep(1:10, 10) ## You can see the lack of collinearity in the graph: plot(x1,x) ## The true model has beta0 = 7, beta1=1, and beta = 1, ## with all assumptions satisfied. y = 7 + x1 + x + rnorm(100,0,10) dat.no.mc = data.frame(y,x1,x) no.mc = lm( y ~ x1 + x, data = dat.no.mc) summary(no.mc) The output shows: Call: lm(formula = y ~ x1 + x, data = dat.high.mc) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) ** x ** x Signif. codes: 0 *** ** 0.01 * Residual standard error: on 97 degrees of freedom Multiple R-squared: 0.157, Adusted R-squared: F-statistic: on and 97 DF, p-value: In the case of simple regression, the R statistic is equal to the square of the correlation coefficient, so you get the same R in both regressions. However, with more than two X variables, the R statistics will all be different. Some of the X variables will be more highly related to the others; these are the variables that suffer most from multicollinearity. 15

16 The R statistics relating X1 to X and X to X1 are both 0.0 in this second example. The standard error multiplier in the first example with the high multicollinearity was ; checking, we see that = 0.6, reasonably close to the standard errors and in the original analysis with multicollinear variables. Differences are mostly explained by randomness in the estimates σˆ. Summary of Multicollinearity (MC) and its effects 1. MC exists when the X s are correlated (i.e., almost always). It does not involve the Y s. Existence of MC violates none of the classic model assumptions 5.. Greater MC causes larger standard errors of the parameter estimates. This means that your estimates of the parameters tend to be less precise with higher degrees of MC. You will tend to have more insignificant tests and wider confidence intervals in these cases. This happens because when X1 and X are closely related, the data cannot isolate the unique effect of X1 on Y, controlling X, as precisely as is the case when X1 and X are not closely related. 3. The more the MC, the less interpretable are the parameters. In particular, β1 is the effect of varying X1 when other X s are held fixed. But it becomes difficult to even imagine varying X1 while holding X fixed, when X1 and X are extremely highly correlated. 4. MC almost always exists in observational data, and often exists in experimental data as well. The question is therefore not is there MC?, but rather how strong is the MC and what are its effects? Generally, the higher the correlations among the X's, the greater the degree of MC, and the greater the effects (high parameter standard errors; tenuous parameter interpretation.) 5. The extreme case of MC is called perfect MC, and happens when the columns of the X matrix are perfectly linearly dependent, in which case there are no unique least squares estimates. The fact that there are no unique LSEs in this case does not mean you can't proceed; you still can still estimate parameters (albeit non-uniquely) and make valid predictions resulting from such estimates. Most computer software allow you to estimate models in this case, but provide a warning message or other unusual output (such as R s NA for some parameter estimates) that you should pay attention to. 6. Regression models that are estimated using MC data can still be useful. There is no absolute requirement that MC be below a certain level. In fact, in some cases it is strongly recommended that highly correlated variables be retained in the model. For 5 Some books and web documents incorrectly state that there is an assumption of no MC in regression analysis. 16

17 example, in most cases you should include the linear term in a quadratic model, even though the linear and quadratic terms are highly correlated. This is called the Variable Inclusion Principle ; more on this in the next chapter. 7. It is most important that you simply recognize the effects of multicollinearity, which are (i) high variances of parameter estimates, (ii) tenuous parameter interpretations, and (iii) in the extreme case of perfect multicollinearity, non-existence of unique least squares estimates. When might MC be a Problem? It makes no sense to test for MC in the usual hypothesis testing H0 vs. H1 sense. The following are not tests, they are ust suggestions, essentially Ugly Rules of Thumb, aimed to help identify when MC might be a problem. 1. When correlations between the X variables are extremely high (e.g., many greater than 0.9) or variance inflation factors are very high (e.g., greater than 9.0; implying a standard error inflation factor greater than 3.0).. When variables that are important a priori but are insignificant, you might suspect a MC problem (but consider also whether the sample size is simply too small). What to do about MC? 1. Main Solution: Diagnose the problem/understand its effects. Display the correlation matrix of the X variables and analyze the variance inflation factors. MC always exists to a degree, and need not be removed, especially if MC is not severe; it violates no assumptions. You don't necessarily have to do anything at all about it.. In some cases, you can avoid using MC variables. Here are some suggestions. Evaluate them in your particular situation to see if they make sense; every situation is different. a. Drop less important and/or redundant X variables. b. Combine X variables into an index. For example, if X1, X and X3 are all measuring the same thing, then you might use their sum or average in the model in place of the original three X variables. c. Use principal components to reduce the dimensionality of the X variables (this is discussed in courses in Multivariate Analysis. Maybe also later in this book, I have not yet decided whether to include it). d. Use common factors (or latent variables), to represent the correlated X variables, and fit a structural equations model relating the response Y to these common factors. This is a somewhat controversial solution because the common factors are unobservable, and therefore cannot be used for prediction. Nevertheless, this model is quite common in behavioral research. It is discussed in courses in Multivariate Analysis, but not here. I will use the related combine into an index approach (see 17

18 .b. above) instead, which is very similar to the latent variable-based analysis, in some ways better, and in some ways worse. e. Use ratios in size-related cases. For example, if you have the two firm-level variables X1 = Total Sales and X = Total Assets in your model, they are bound to be highly correlated. So you might use the two variables X 1 = (Total Assets)/(Total Sales) and X = (Total Sales) (perhaps in log form) in your model instead of the two variables (Total Sales) and (Total Assets). 3. In some cases, you must simply leave multicollinear variables in the model. These cases include a. Predictive Multicollinearity: Two variable can be highly correlated, but both are essential for predicting Y. When you leave one or the other out of the model, you get a much poorer model (much lower R ). In the data set Turtles, if you predict a turtle's sex from its length and height, you will find that length and height are highly correlated (R = 0.97). But you have to include them both in the model because R (length, height) = 0.61, whereas R (length) = 0.31 and R (height) = The scientific conclusion is that turtle sex is more related to turtle shape, a combination of length and height, than it is to either length or height individually. This probably makes sense to a biologist who studies reproductive biology of turtles. b. Variable Inclusion Rules: Whenever you include higher order terms in a model, you should also include the implied lower order terms. For example, if you include X in the model, then you should also include X. But X and X are highly correlated. Nevertheless, both X and X should be used in model, despite the fact that they are highly correlated, for reasons I will give in the next chapter. c. Research Hypotheses: Your main research hypothesis is to assess the effect of X 1, but you recognize that the effect of X1 on Y might be confounded by X. If this is the case, you are simply stuck with including both X 1 and X in the model 4. Other solutions: Redesign study or collect more data. a. Selection of levels: If you have the opportunity to select the (X 1, X ) values, then you should attempt to do so in a way that makes those variables as uncorrelated as possible. For example, (X 1, X ) might refer to two process inputs, each either Low or High, and you should select them in the arrangement (L,L), (L,H), (H,L), (H,H), with equal numbers of runs at each combination, to ensure that X1 and X are uncorrelated. b. Sample size: The main problem resulting from MC is that the standard errors are large. You can always make standard errors smaller by collecting a larger sample size: recall that ˆ 1 s.e.( βˆ ) = s 1 1 R x σn 1/ 18

19 So even you change nothing else, a larger sample size n will make the standard errors smaller. 19

Instructions: Closed book, notes, and no electronic devices. Points (out of 200) in parentheses

Instructions: Closed book, notes, and no electronic devices. Points (out of 200) in parentheses ISQS 5349 Final Spring 2011 Instructions: Closed book, notes, and no electronic devices. Points (out of 200) in parentheses 1. (10) What is the definition of a regression model that we have used throughout

More information

Final Exam. Name: Solution:

Final Exam. Name: Solution: Final Exam. Name: Instructions. Answer all questions on the exam. Open books, open notes, but no electronic devices. The first 13 problems are worth 5 points each. The rest are worth 1 point each. HW1.

More information

Inference for Regression

Inference for Regression Inference for Regression Section 9.4 Cathy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Department of Mathematics University of Houston Lecture 13b - 3339 Cathy Poliak, Ph.D. cathy@math.uh.edu

More information

Chapter 9. Polynomial Models and Interaction (Moderator) Analysis

Chapter 9. Polynomial Models and Interaction (Moderator) Analysis Chapter 9. Polynomial Models and Interaction (Moderator) Analysis In Chapter 4, we introduced the quadratic model as a device to test for curvature in the conditional mean function. You could also use

More information

Open book, but no loose leaf notes and no electronic devices. Points (out of 200) are in parentheses. Put all answers on the paper provided to you.

Open book, but no loose leaf notes and no electronic devices. Points (out of 200) are in parentheses. Put all answers on the paper provided to you. ISQS 5347 Final Exam Spring 2017 Open book, but no loose leaf notes and no electronic devices. Points (out of 200) are in parentheses. Put all answers on the paper provided to you. 1. Recall the commute

More information

LECTURE 10. Introduction to Econometrics. Multicollinearity & Heteroskedasticity

LECTURE 10. Introduction to Econometrics. Multicollinearity & Heteroskedasticity LECTURE 10 Introduction to Econometrics Multicollinearity & Heteroskedasticity November 22, 2016 1 / 23 ON PREVIOUS LECTURES We discussed the specification of a regression equation Specification consists

More information

ISQS 5349 Final Exam, Spring 2017.

ISQS 5349 Final Exam, Spring 2017. ISQS 5349 Final Exam, Spring 7. Instructions: Put all answers on paper other than this exam. If you do not have paper, some will be provided to you. The exam is OPEN BOOKS, OPEN NOTES, but NO ELECTRONIC

More information

Hypothesis testing Goodness of fit Multicollinearity Prediction. Applied Statistics. Lecturer: Serena Arima

Hypothesis testing Goodness of fit Multicollinearity Prediction. Applied Statistics. Lecturer: Serena Arima Applied Statistics Lecturer: Serena Arima Hypothesis testing for the linear model Under the Gauss-Markov assumptions and the normality of the error terms, we saw that β N(β, σ 2 (X X ) 1 ) and hence s

More information

Density Temp vs Ratio. temp

Density Temp vs Ratio. temp Temp Ratio Density 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Density 0.0 0.2 0.4 0.6 0.8 1.0 1. (a) 170 175 180 185 temp 1.0 1.5 2.0 2.5 3.0 ratio The histogram shows that the temperature measures have two peaks,

More information

Statistical View of Least Squares

Statistical View of Least Squares Basic Ideas Some Examples Least Squares May 22, 2007 Basic Ideas Simple Linear Regression Basic Ideas Some Examples Least Squares Suppose we have two variables x and y Basic Ideas Simple Linear Regression

More information

Simple Linear Regression. Material from Devore s book (Ed 8), and Cengagebrain.com

Simple Linear Regression. Material from Devore s book (Ed 8), and Cengagebrain.com 12 Simple Linear Regression Material from Devore s book (Ed 8), and Cengagebrain.com The Simple Linear Regression Model The simplest deterministic mathematical relationship between two variables x and

More information

Applied Regression Analysis

Applied Regression Analysis Applied Regression Analysis Chapter 3 Multiple Linear Regression Hongcheng Li April, 6, 2013 Recall simple linear regression 1 Recall simple linear regression 2 Parameter Estimation 3 Interpretations of

More information

405 ECONOMETRICS Chapter # 11: MULTICOLLINEARITY: WHAT HAPPENS IF THE REGRESSORS ARE CORRELATED? Domodar N. Gujarati

405 ECONOMETRICS Chapter # 11: MULTICOLLINEARITY: WHAT HAPPENS IF THE REGRESSORS ARE CORRELATED? Domodar N. Gujarati 405 ECONOMETRICS Chapter # 11: MULTICOLLINEARITY: WHAT HAPPENS IF THE REGRESSORS ARE CORRELATED? Domodar N. Gujarati Prof. M. El-Sakka Dept of Economics Kuwait University In this chapter we take a critical

More information

Multiple Regression Theory 2006 Samuel L. Baker

Multiple Regression Theory 2006 Samuel L. Baker MULTIPLE REGRESSION THEORY 1 Multiple Regression Theory 2006 Samuel L. Baker Multiple regression is regression with two or more independent variables on the right-hand side of the equation. Use multiple

More information

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model 1 Linear Regression 2 Linear Regression In this lecture we will study a particular type of regression model: the linear regression model We will first consider the case of the model with one predictor

More information

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Regression Models. Chapter 4. Introduction. Introduction. Introduction Chapter 4 Regression Models Quantitative Analysis for Management, Tenth Edition, by Render, Stair, and Hanna 008 Prentice-Hall, Inc. Introduction Regression analysis is a very valuable tool for a manager

More information

Linear Regression. Furthermore, it is simple.

Linear Regression. Furthermore, it is simple. Linear Regression While linear regression has limited value in the classification problem, it is often very useful in predicting a numerical response, on a linear or ratio scale. Furthermore, it is simple.

More information

Inference for Regression Simple Linear Regression

Inference for Regression Simple Linear Regression Inference for Regression Simple Linear Regression IPS Chapter 10.1 2009 W.H. Freeman and Company Objectives (IPS Chapter 10.1) Simple linear regression p Statistical model for linear regression p Estimating

More information

Regression Analysis: Basic Concepts

Regression Analysis: Basic Concepts The simple linear model Regression Analysis: Basic Concepts Allin Cottrell Represents the dependent variable, y i, as a linear function of one independent variable, x i, subject to a random disturbance

More information

Diagnostics and Transformations Part 2

Diagnostics and Transformations Part 2 Diagnostics and Transformations Part 2 Bivariate Linear Regression James H. Steiger Department of Psychology and Human Development Vanderbilt University Multilevel Regression Modeling, 2009 Diagnostics

More information

Chapter 4: Regression Models

Chapter 4: Regression Models Sales volume of company 1 Textbook: pp. 129-164 Chapter 4: Regression Models Money spent on advertising 2 Learning Objectives After completing this chapter, students will be able to: Identify variables,

More information

Multiple Linear Regression CIVL 7012/8012

Multiple Linear Regression CIVL 7012/8012 Multiple Linear Regression CIVL 7012/8012 2 Multiple Regression Analysis (MLR) Allows us to explicitly control for many factors those simultaneously affect the dependent variable This is important for

More information

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects Contents 1 Review of Residuals 2 Detecting Outliers 3 Influential Observations 4 Multicollinearity and its Effects W. Zhou (Colorado State University) STAT 540 July 6th, 2015 1 / 32 Model Diagnostics:

More information

Multiple Linear Regression. Chapter 12

Multiple Linear Regression. Chapter 12 13 Multiple Linear Regression Chapter 12 Multiple Regression Analysis Definition The multiple regression model equation is Y = b 0 + b 1 x 1 + b 2 x 2 +... + b p x p + ε where E(ε) = 0 and Var(ε) = s 2.

More information

MATH 644: Regression Analysis Methods

MATH 644: Regression Analysis Methods MATH 644: Regression Analysis Methods FINAL EXAM Fall, 2012 INSTRUCTIONS TO STUDENTS: 1. This test contains SIX questions. It comprises ELEVEN printed pages. 2. Answer ALL questions for a total of 100

More information

1 Least Squares Estimation - multiple regression.

1 Least Squares Estimation - multiple regression. Introduction to multiple regression. Fall 2010 1 Least Squares Estimation - multiple regression. Let y = {y 1,, y n } be a n 1 vector of dependent variable observations. Let β = {β 0, β 1 } be the 2 1

More information

Steps in Regression Analysis

Steps in Regression Analysis MGMG 522 : Session #2 Learning to Use Regression Analysis & The Classical Model (Ch. 3 & 4) 2-1 Steps in Regression Analysis 1. Review the literature and develop the theoretical model 2. Specify the model:

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression ST 430/514 Recall: A regression model describes how a dependent variable (or response) Y is affected, on average, by one or more independent variables (or factors, or covariates)

More information

ECON The Simple Regression Model

ECON The Simple Regression Model ECON 351 - The Simple Regression Model Maggie Jones 1 / 41 The Simple Regression Model Our starting point will be the simple regression model where we look at the relationship between two variables In

More information

y response variable x 1, x 2,, x k -- a set of explanatory variables

y response variable x 1, x 2,, x k -- a set of explanatory variables 11. Multiple Regression and Correlation y response variable x 1, x 2,, x k -- a set of explanatory variables In this chapter, all variables are assumed to be quantitative. Chapters 12-14 show how to incorporate

More information

Business Statistics. Lecture 10: Course Review

Business Statistics. Lecture 10: Course Review Business Statistics Lecture 10: Course Review 1 Descriptive Statistics for Continuous Data Numerical Summaries Location: mean, median Spread or variability: variance, standard deviation, range, percentiles,

More information

1 The Classic Bivariate Least Squares Model

1 The Classic Bivariate Least Squares Model Review of Bivariate Linear Regression Contents 1 The Classic Bivariate Least Squares Model 1 1.1 The Setup............................... 1 1.2 An Example Predicting Kids IQ................. 1 2 Evaluating

More information

Lecture 5: Omitted Variables, Dummy Variables and Multicollinearity

Lecture 5: Omitted Variables, Dummy Variables and Multicollinearity Lecture 5: Omitted Variables, Dummy Variables and Multicollinearity R.G. Pierse 1 Omitted Variables Suppose that the true model is Y i β 1 + β X i + β 3 X 3i + u i, i 1,, n (1.1) where β 3 0 but that the

More information

1 Multiple Regression

1 Multiple Regression 1 Multiple Regression In this section, we extend the linear model to the case of several quantitative explanatory variables. There are many issues involved in this problem and this section serves only

More information

Chapter 16. Simple Linear Regression and Correlation

Chapter 16. Simple Linear Regression and Correlation Chapter 16 Simple Linear Regression and Correlation 16.1 Regression Analysis Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will

More information

Chapter 27 Summary Inferences for Regression

Chapter 27 Summary Inferences for Regression Chapter 7 Summary Inferences for Regression What have we learned? We have now applied inference to regression models. Like in all inference situations, there are conditions that we must check. We can test

More information

Mathematics for Economics MA course

Mathematics for Economics MA course Mathematics for Economics MA course Simple Linear Regression Dr. Seetha Bandara Simple Regression Simple linear regression is a statistical method that allows us to summarize and study relationships between

More information

Lecture 4: Regression Analysis

Lecture 4: Regression Analysis Lecture 4: Regression Analysis 1 Regression Regression is a multivariate analysis, i.e., we are interested in relationship between several variables. For corporate audience, it is sufficient to show correlation.

More information

An overview of applied econometrics

An overview of applied econometrics An overview of applied econometrics Jo Thori Lind September 4, 2011 1 Introduction This note is intended as a brief overview of what is necessary to read and understand journal articles with empirical

More information

Applied Statistics and Econometrics

Applied Statistics and Econometrics Applied Statistics and Econometrics Lecture 6 Saul Lach September 2017 Saul Lach () Applied Statistics and Econometrics September 2017 1 / 53 Outline of Lecture 6 1 Omitted variable bias (SW 6.1) 2 Multiple

More information

ECNS 561 Multiple Regression Analysis

ECNS 561 Multiple Regression Analysis ECNS 561 Multiple Regression Analysis Model with Two Independent Variables Consider the following model Crime i = β 0 + β 1 Educ i + β 2 [what else would we like to control for?] + ε i Here, we are taking

More information

ECON 497: Lecture 4 Page 1 of 1

ECON 497: Lecture 4 Page 1 of 1 ECON 497: Lecture 4 Page 1 of 1 Metropolitan State University ECON 497: Research and Forecasting Lecture Notes 4 The Classical Model: Assumptions and Violations Studenmund Chapter 4 Ordinary least squares

More information

Quantitative Understanding in Biology Module II: Model Parameter Estimation Lecture I: Linear Correlation and Regression

Quantitative Understanding in Biology Module II: Model Parameter Estimation Lecture I: Linear Correlation and Regression Quantitative Understanding in Biology Module II: Model Parameter Estimation Lecture I: Linear Correlation and Regression Correlation Linear correlation and linear regression are often confused, mostly

More information

ECON 450 Development Economics

ECON 450 Development Economics ECON 450 Development Economics Statistics Background University of Illinois at Urbana-Champaign Summer 2017 Outline 1 Introduction 2 3 4 5 Introduction Regression analysis is one of the most important

More information

Chapter 3 - Linear Regression

Chapter 3 - Linear Regression Chapter 3 - Linear Regression Lab Solution 1 Problem 9 First we will read the Auto" data. Note that most datasets referred to in the text are in the R package the authors developed. So we just need to

More information

statistical sense, from the distributions of the xs. The model may now be generalized to the case of k regressors:

statistical sense, from the distributions of the xs. The model may now be generalized to the case of k regressors: Wooldridge, Introductory Econometrics, d ed. Chapter 3: Multiple regression analysis: Estimation In multiple regression analysis, we extend the simple (two-variable) regression model to consider the possibility

More information

Chapter 8 Heteroskedasticity

Chapter 8 Heteroskedasticity Chapter 8 Walter R. Paczkowski Rutgers University Page 1 Chapter Contents 8.1 The Nature of 8. Detecting 8.3 -Consistent Standard Errors 8.4 Generalized Least Squares: Known Form of Variance 8.5 Generalized

More information

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018 Econometrics I KS Module 2: Multivariate Linear Regression Alexander Ahammer Department of Economics Johannes Kepler University of Linz This version: April 16, 2018 Alexander Ahammer (JKU) Module 2: Multivariate

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression ST 430/514 Recall: a regression model describes how a dependent variable (or response) Y is affected, on average, by one or more independent variables (or factors, or covariates).

More information

Chapter 16. Simple Linear Regression and dcorrelation

Chapter 16. Simple Linear Regression and dcorrelation Chapter 16 Simple Linear Regression and dcorrelation 16.1 Regression Analysis Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will

More information

Sociology 593 Exam 1 Answer Key February 17, 1995

Sociology 593 Exam 1 Answer Key February 17, 1995 Sociology 593 Exam 1 Answer Key February 17, 1995 I. True-False. (5 points) Indicate whether the following statements are true or false. If false, briefly explain why. 1. A researcher regressed Y on. When

More information

Inference for Regression Inference about the Regression Model and Using the Regression Line

Inference for Regression Inference about the Regression Model and Using the Regression Line Inference for Regression Inference about the Regression Model and Using the Regression Line PBS Chapter 10.1 and 10.2 2009 W.H. Freeman and Company Objectives (PBS Chapter 10.1 and 10.2) Inference about

More information

Multiple Regression Analysis. Part III. Multiple Regression Analysis

Multiple Regression Analysis. Part III. Multiple Regression Analysis Part III Multiple Regression Analysis As of Sep 26, 2017 1 Multiple Regression Analysis Estimation Matrix form Goodness-of-Fit R-square Adjusted R-square Expected values of the OLS estimators Irrelevant

More information

L7: Multicollinearity

L7: Multicollinearity L7: Multicollinearity Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Introduction ï Example Whats wrong with it? Assume we have this data Y

More information

Chapter 4. Regression Models. Learning Objectives

Chapter 4. Regression Models. Learning Objectives Chapter 4 Regression Models To accompany Quantitative Analysis for Management, Eleventh Edition, by Render, Stair, and Hanna Power Point slides created by Brian Peterson Learning Objectives After completing

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 15: Examples of hypothesis tests (v5) Ramesh Johari ramesh.johari@stanford.edu 1 / 32 The recipe 2 / 32 The hypothesis testing recipe In this lecture we repeatedly apply the

More information

Contest Quiz 3. Question Sheet. In this quiz we will review concepts of linear regression covered in lecture 2.

Contest Quiz 3. Question Sheet. In this quiz we will review concepts of linear regression covered in lecture 2. Updated: November 17, 2011 Lecturer: Thilo Klein Contact: tk375@cam.ac.uk Contest Quiz 3 Question Sheet In this quiz we will review concepts of linear regression covered in lecture 2. NOTE: Please round

More information

Linear Regression. Chapter 3

Linear Regression. Chapter 3 Chapter 3 Linear Regression Once we ve acquired data with multiple variables, one very important question is how the variables are related. For example, we could ask for the relationship between people

More information

Gov 2000: 9. Regression with Two Independent Variables

Gov 2000: 9. Regression with Two Independent Variables Gov 2000: 9. Regression with Two Independent Variables Matthew Blackwell Fall 2016 1 / 62 1. Why Add Variables to a Regression? 2. Adding a Binary Covariate 3. Adding a Continuous Covariate 4. OLS Mechanics

More information

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006 Chapter 17 Simple Linear Regression and Correlation 17.1 Regression Analysis Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will

More information

Statistiek II. John Nerbonne. March 17, Dept of Information Science incl. important reworkings by Harmut Fitz

Statistiek II. John Nerbonne. March 17, Dept of Information Science incl. important reworkings by Harmut Fitz Dept of Information Science j.nerbonne@rug.nl incl. important reworkings by Harmut Fitz March 17, 2015 Review: regression compares result on two distinct tests, e.g., geographic and phonetic distance of

More information

28. SIMPLE LINEAR REGRESSION III

28. SIMPLE LINEAR REGRESSION III 28. SIMPLE LINEAR REGRESSION III Fitted Values and Residuals To each observed x i, there corresponds a y-value on the fitted line, y = βˆ + βˆ x. The are called fitted values. ŷ i They are the values of

More information

The general linear regression with k explanatory variables is just an extension of the simple regression as follows

The general linear regression with k explanatory variables is just an extension of the simple regression as follows 3. Multiple Regression Analysis The general linear regression with k explanatory variables is just an extension of the simple regression as follows (1) y i = β 0 + β 1 x i1 + + β k x ik + u i. Because

More information

Collinearity: Impact and Possible Remedies

Collinearity: Impact and Possible Remedies Collinearity: Impact and Possible Remedies Deepayan Sarkar What is collinearity? Exact dependence between columns of X make coefficients non-estimable Collinearity refers to the situation where some columns

More information

Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56

Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56 STAT 391 - Spring Quarter 2017 - Midterm 1 - April 27, 2017 Name: Student ID Number: Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56 Directions. Read directions carefully and show all your

More information

Handout 4: Simple Linear Regression

Handout 4: Simple Linear Regression Handout 4: Simple Linear Regression By: Brandon Berman The following problem comes from Kokoska s Introductory Statistics: A Problem-Solving Approach. The data can be read in to R using the following code:

More information

Motivation for multiple regression

Motivation for multiple regression Motivation for multiple regression 1. Simple regression puts all factors other than X in u, and treats them as unobserved. Effectively the simple regression does not account for other factors. 2. The slope

More information

1 Correlation and Inference from Regression

1 Correlation and Inference from Regression 1 Correlation and Inference from Regression Reading: Kennedy (1998) A Guide to Econometrics, Chapters 4 and 6 Maddala, G.S. (1992) Introduction to Econometrics p. 170-177 Moore and McCabe, chapter 12 is

More information

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds Chapter 6 Logistic Regression In logistic regression, there is a categorical response variables, often coded 1=Yes and 0=No. Many important phenomena fit this framework. The patient survives the operation,

More information

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company Multiple Regression Inference for Multiple Regression and A Case Study IPS Chapters 11.1 and 11.2 2009 W.H. Freeman and Company Objectives (IPS Chapters 11.1 and 11.2) Multiple regression Data for multiple

More information

Least Squares Estimation-Finite-Sample Properties

Least Squares Estimation-Finite-Sample Properties Least Squares Estimation-Finite-Sample Properties Ping Yu School of Economics and Finance The University of Hong Kong Ping Yu (HKU) Finite-Sample 1 / 29 Terminology and Assumptions 1 Terminology and Assumptions

More information

STA121: Applied Regression Analysis

STA121: Applied Regression Analysis STA121: Applied Regression Analysis Linear Regression Analysis - Chapters 3 and 4 in Dielman Artin Department of Statistical Science September 15, 2009 Outline 1 Simple Linear Regression Analysis 2 Using

More information

Interactions. Interactions. Lectures 1 & 2. Linear Relationships. y = a + bx. Slope. Intercept

Interactions. Interactions. Lectures 1 & 2. Linear Relationships. y = a + bx. Slope. Intercept Interactions Lectures 1 & Regression Sometimes two variables appear related: > smoking and lung cancers > height and weight > years of education and income > engine size and gas mileage > GMAT scores and

More information

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph. Regression, Part I I. Difference from correlation. II. Basic idea: A) Correlation describes the relationship between two variables, where neither is independent or a predictor. - In correlation, it would

More information

R 2 and F -Tests and ANOVA

R 2 and F -Tests and ANOVA R 2 and F -Tests and ANOVA December 6, 2018 1 Partition of Sums of Squares The distance from any point y i in a collection of data, to the mean of the data ȳ, is the deviation, written as y i ȳ. Definition.

More information

Coefficient of Determination

Coefficient of Determination Coefficient of Determination ST 430/514 The coefficient of determination, R 2, is defined as before: R 2 = 1 SS E (yi ŷ i ) = 1 2 SS yy (yi ȳ) 2 The interpretation of R 2 is still the fraction of variance

More information

Lecture 3. The Population Variance. The population variance, denoted σ 2, is the sum. of the squared deviations about the population

Lecture 3. The Population Variance. The population variance, denoted σ 2, is the sum. of the squared deviations about the population Lecture 5 1 Lecture 3 The Population Variance The population variance, denoted σ 2, is the sum of the squared deviations about the population mean divided by the number of observations in the population,

More information

Section 3: Simple Linear Regression

Section 3: Simple Linear Regression Section 3: Simple Linear Regression Carlos M. Carvalho The University of Texas at Austin McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Regression: General Introduction

More information

Lecture 18: Simple Linear Regression

Lecture 18: Simple Linear Regression Lecture 18: Simple Linear Regression BIOS 553 Department of Biostatistics University of Michigan Fall 2004 The Correlation Coefficient: r The correlation coefficient (r) is a number that measures the strength

More information

Inferences for Regression

Inferences for Regression Inferences for Regression An Example: Body Fat and Waist Size Looking at the relationship between % body fat and waist size (in inches). Here is a scatterplot of our data set: Remembering Regression In

More information

appstats27.notebook April 06, 2017

appstats27.notebook April 06, 2017 Chapter 27 Objective Students will conduct inference on regression and analyze data to write a conclusion. Inferences for Regression An Example: Body Fat and Waist Size pg 634 Our chapter example revolves

More information

Lectures on Simple Linear Regression Stat 431, Summer 2012

Lectures on Simple Linear Regression Stat 431, Summer 2012 Lectures on Simple Linear Regression Stat 43, Summer 0 Hyunseung Kang July 6-8, 0 Last Updated: July 8, 0 :59PM Introduction Previously, we have been investigating various properties of the population

More information

Analytics 512: Homework # 2 Tim Ahn February 9, 2016

Analytics 512: Homework # 2 Tim Ahn February 9, 2016 Analytics 512: Homework # 2 Tim Ahn February 9, 2016 Chapter 3 Problem 1 (# 3) Suppose we have a data set with five predictors, X 1 = GP A, X 2 = IQ, X 3 = Gender (1 for Female and 0 for Male), X 4 = Interaction

More information

Simple and Multiple Linear Regression

Simple and Multiple Linear Regression Sta. 113 Chapter 12 and 13 of Devore March 12, 2010 Table of contents 1 Simple Linear Regression 2 Model Simple Linear Regression A simple linear regression model is given by Y = β 0 + β 1 x + ɛ where

More information

The simple linear regression model discussed in Chapter 13 was written as

The simple linear regression model discussed in Chapter 13 was written as 1519T_c14 03/27/2006 07:28 AM Page 614 Chapter Jose Luis Pelaez Inc/Blend Images/Getty Images, Inc./Getty Images, Inc. 14 Multiple Regression 14.1 Multiple Regression Analysis 14.2 Assumptions of the Multiple

More information

Lecture 4: Multivariate Regression, Part 2

Lecture 4: Multivariate Regression, Part 2 Lecture 4: Multivariate Regression, Part 2 Gauss-Markov Assumptions 1) Linear in Parameters: Y X X X i 0 1 1 2 2 k k 2) Random Sampling: we have a random sample from the population that follows the above

More information

Biostatistics 380 Multiple Regression 1. Multiple Regression

Biostatistics 380 Multiple Regression 1. Multiple Regression Biostatistics 0 Multiple Regression ORIGIN 0 Multiple Regression Multiple Regression is an extension of the technique of linear regression to describe the relationship between a single dependent (response)

More information

Regression Models - Introduction

Regression Models - Introduction Regression Models - Introduction In regression models there are two types of variables that are studied: A dependent variable, Y, also called response variable. It is modeled as random. An independent

More information

Final Exam - Solutions

Final Exam - Solutions Ecn 102 - Analysis of Economic Data University of California - Davis March 19, 2010 Instructor: John Parman Final Exam - Solutions You have until 5:30pm to complete this exam. Please remember to put your

More information

9. Linear Regression and Correlation

9. Linear Regression and Correlation 9. Linear Regression and Correlation Data: y a quantitative response variable x a quantitative explanatory variable (Chap. 8: Recall that both variables were categorical) For example, y = annual income,

More information

Correlation and Regression

Correlation and Regression Correlation and Regression October 25, 2017 STAT 151 Class 9 Slide 1 Outline of Topics 1 Associations 2 Scatter plot 3 Correlation 4 Regression 5 Testing and estimation 6 Goodness-of-fit STAT 151 Class

More information

Nonstationary time series models

Nonstationary time series models 13 November, 2009 Goals Trends in economic data. Alternative models of time series trends: deterministic trend, and stochastic trend. Comparison of deterministic and stochastic trend models The statistical

More information

Inference with Simple Regression

Inference with Simple Regression 1 Introduction Inference with Simple Regression Alan B. Gelder 06E:071, The University of Iowa 1 Moving to infinite means: In this course we have seen one-mean problems, twomean problems, and problems

More information

Chapter 7: Simple linear regression

Chapter 7: Simple linear regression The absolute movement of the ground and buildings during an earthquake is small even in major earthquakes. The damage that a building suffers depends not upon its displacement, but upon the acceleration.

More information

Multiple Regression Analysis

Multiple Regression Analysis Multiple Regression Analysis y = β 0 + β 1 x 1 + β 2 x 2 +... β k x k + u 2. Inference 0 Assumptions of the Classical Linear Model (CLM)! So far, we know: 1. The mean and variance of the OLS estimators

More information

FinQuiz Notes

FinQuiz Notes Reading 10 Multiple Regression and Issues in Regression Analysis 2. MULTIPLE LINEAR REGRESSION Multiple linear regression is a method used to model the linear relationship between a dependent variable

More information

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression Introduction to Correlation and Regression The procedures discussed in the previous ANOVA labs are most useful in cases where we are interested

More information

Multiple Regression Analysis

Multiple Regression Analysis Multiple Regression Analysis y = 0 + 1 x 1 + x +... k x k + u 6. Heteroskedasticity What is Heteroskedasticity?! Recall the assumption of homoskedasticity implied that conditional on the explanatory variables,

More information

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore What is Multiple Linear Regression Several independent variables may influence the change in response variable we are trying to study. When several independent variables are included in the equation, the

More information

The Standard Linear Model: Hypothesis Testing

The Standard Linear Model: Hypothesis Testing Department of Mathematics Ma 3/103 KC Border Introduction to Probability and Statistics Winter 2017 Lecture 25: The Standard Linear Model: Hypothesis Testing Relevant textbook passages: Larsen Marx [4]:

More information