Statistics and Quantitative Analysis U3 Lecture 13: Explaining Variation Prof. Sharyn O Halloran Explaining Variation: Adjusted R (cont) Definition of Adjusted R So we'd like a measure like R, but one that takes into account the fact that adding extra variables always increases your explanatory power. The statistic we use for this is call the Adjusted R, and its formula is: n 1 R = 1 ( 1 R ); n = number of observations, Includes constant k = number of independent variables. The Adjusted R can actually fall if the variable you add doesn't explain much of the variance. Explaining Variation: Adjusted R (cont) Back to the Example Comparing Adjusted R Model 1:.9 Model :.31 Model 3:.33 Model :.3 Interpretation You can see that the adjusted R rises from equation 1 to equation, and from equation to equation 3. But then it falls from equation 3 to, when we add in the variables for national parks and the zodiac. Explaining Variation: Adjusted R (cont) Example: Equation Analysis of Variance Sum DFof Squares Mean Square Regression.73 3.39 Residual 7 1.75.391 F=.13 Signif F =. We calculate: Multiple R.15 R Square.3555 Adjusted R Square.31 Standard Error.5 Variable B SE B Beta T T Sig T SKOOL.1.97 1.59.153 TUBETIME -.1 -.15777-3.31.9 (Constant).1.15 13. R n 1 = 1 ( 1 R ) 7 1 = 1 ( 7 3 1. 3555 ) =. 31 Copyright Sharyn O'Halloran 1 1
Explaining Variation: Adjusted R (cont) Stepwise Regression One strategy for model building is to add variables only if they increase your adjusted R. This technique is called stepwise regression. However, I don't want to emphasize this approach to strongly. Just as people can fixate on R they can fixate on adjusted R. If you have a theory that suggests that certain variables are important for your analysis then include them whether or not they increase the adjusted R. Negative findings can be important! Comparing Models: F-Tests When to use an F-Test? Say you add a number of variables into a regression model and you want to see if, as a group, they are significant in explaining variation in your dependent variable Y. The F-test tells you whether a group of variables, or even an entire model, is jointly significant. This is in contrast to a t-test, which tells whether an individual coefficient is significantly different from zero. In short, does the specified model explain a significant proportion of the total variation. Equations To be precise, say our original equation is: Model 1: Y = b + b 1 X 1 + b X, We add two more variables, so the new equation is: Model : Y = b + b 1 X 1 + b X + b 3 X 3 + b X. We want to test the hypothesis that Η : β 3 = β =. We want to test the joint hypothesis that X 3 and X together are not significant factors in determining Y. Using Adjusted R First There's an easy way to tell if these two variables are not significant. First, run the regression without X 3 and X in it, then run the regression with X 3 and X. Now look at the adjusted R 's for the two regressions. If the adjusted R went down, then X 3 and X are not jointly significant. So the adjusted R can serve as a quick test for insignificance. Copyright Sharyn O'Halloran 1
Calculating an F-Test If the adjusted R goes up, then you need to do a more complicated test, F-Test. Ratio Let regression 1 be the model without X 3 and X, and let regression include X 3 and X. The basic idea of the F statistic, then, is to compute the ratio: SSE SSE SSE 1 Correction We have to correct for the number of independent we add. So the complete statistic is: SSE1 SSE m is the number of F = m ; SSE additional variables added to the model m = number of restrictions; k = number of independent variables. Remember: k is the total number of independent variables, including the ones that you are testing and the constant. Correction (cont.) This equation defines an F-statistic with m and n-k degrees of freedom. We write it like this: m F To get critical values for the F statistic, we use a set of tables, just like for the normal and t- statistics. Example Adding Extra Variables: Are a group of variables jointly significant? Are the variables YELOWSTN and MYSIGN jointly significant? Model1: TRUSTTV = b + blikejpan + b SKOOL + b TUBETIME 1 3 Model : TRUSTTV = b + blikejpan+ b SKOOL + b TUBETIME + b MYSIGN+ b YELOWSTN 5 1 3 Copyright Sharyn O'Halloran 1 3
Adding Extra Variables (cont.) State the null hypothesis H : B = B = 5 Calculate the F-statistic Our formula for the F-statistic is: SSE1 SSE F = m, SSE What is SSE 1? the sum of squared errors in the first regression. What is SSE? the sum of squared errors in the second regression m = N = 7 k = The formula is: 1. 7 11. F = 11. 7 =.319 Reject or fail to reject the null hypothesis? The critical value at the 5 % level F 7 from the table, is 3.. Is the F-statistic > F 7? If yes, then we reject the null hypothesis that the variables are not significantly different from zero; otherwise we fail to reject. We can reject the null hypothesis because.319 < 3.. β=.319 3. Testing All Variables: Is the Model Significant? Equation : Impact of school and TV watched DF Sum of SquaresMean Square Regression.7 3.37 Residual 7 1.7.39 F Statistic =.1 Signif F =.E- Multiple R.15 R Square.35 Adjusted R Square.31 Standard Error.5 Dependent Variable: Trust TV Variable B SE B Beta T Sig T SKOOL.1.11.9 1.59.15 TUBETIME -..1 -.15-3.31.1 (Constant).1.15 13.. Copyright Sharyn O'Halloran 1
Hypothesis Testing: State Hypothesis H : β = β = 1 Calculate test statistic Again, we start with our formula: SSE1 SSE F = m, SSE Calculate F-statistic SSE = 1.7 This is the number reported in your printout under the F statistic. SSE 1 is the sum of squared errors when there are no explanatory variables at all. If there are no explanatory variables, then SSR must be. In this case, SSE=SST. So we can substitute SST for SSE1 in our formula. SST = SSR + SSE =.73 + 1.7 = 19.5 19. 5 1. 7 F = 1. 7 7 3 =.1. Reject or fail to reject the null hypothesis? The critical value at the 5% level, F 7 from 3 your table, is 3.. So this time we can reject the null hypothesis that β 1 = β =. Interpretation? The model explains a significant amount of the total variation in how much people trust what is said on TV. Comparing Models: Example Study of 7 Seventh Grade students in a mid-western school. Path Diagram Gender + + Copyright Sharyn O'Halloran 1 5
1 1 1 Average of 7.9537 7.139 Average of 15.3797 11.957 Gender Data Average of Average of Variables = student s score on a standard test = student s grade point average Gender= students gender (1 for male; for female) Descriptive Statistics Gender Mean 7.5 Mean 1.9 Mean. Standard Error. Standard Error 1.9 Standard Error. Mode 9.17 Mode 111. Mode 1. Sample Variance.1 Sample Variance 173.7 Sample Variance. Kurtosis 1.1 Kurtosis. Kurtosis -1.7 Minimum.53 Minimum 7. Minimum. Sum 5.3 Sum 9. Sum 7. Relation between and 1 1 5 1 15 Gender Average of Average of 7.9537 15.3797 1 7.139 11.957 Grand Total 7.53 1.9379 Graphs Relation between and For Women Only 1 1 1 1 1 Relation between and For Men Only 1 1 1 1 1 1 Hypothesis Testing: Hypothesizes concerning coefficients H : β = 1 H : β a 1 We want to know if and Gender explain a significant amount of the variation in. Hypothesizes Concerning Models H H a : β = β = 1 : β = β 1 Estimation Model I Relation between and 1 1 y =.11x - 3.5571 - - 1 1 1 1 - - = b + b 1 SUMMARY OUTPUT Dependent Variable: Regression Statistics Multiple R.3 R Square. Adjusted R Square.39 Standard Error 1.3 Observations 7. ANOVA df SS MS F Significance F Regression 1 13.3 13.3 51.1.7373E-1 Residual 7 3.11.7 Total 77 339.3 Coefficients Standard Error t Stat P-value Intercept -3.5 1.55 -.9.59.1.1 7.1.7373E-1 = 3.5+. 1 Copyright Sharyn O'Halloran 1
Model II: Relation between and 1 1 = 3.73.97 +. 11Gender = - 3.5571 +.11 - - 1 1 1 1 - - = b + b + b Gender 1 SUMMARY OUTPUT Dependent Variable: Regression Statistics Multiple R.7 R Square.5 Adjusted R Square. Standard Error 1.5 Observations 7. ANOVA df SS MS F Significance F Regression 153.1 7.5 3.. Residual 75 1.7. Total 77 339.3 Coefficients Standard Error t Stat P-value -3.73 1.5 -.9.1 Intercept.11.1 7.77. Gender -.97.37 -..1 Is Model I better than Model II? F = F-test Statistics SSE SSE m SSE 1 3.11 1.7, 1.. 1 3 1 = =.11.75 77 3 1 F =. <.1 7 = 3.73.97 +. 11Gender β=..1 Yes it is. Interactive Terms Regression Statistics Multiple R.75 R Square.5 Adjusted R Square.33 Standard Error 1.55 Observations 77. ANOVA df SS MS F Significance F Regression 3 153.5 51.175.33. Residual 73 13.5.513 Total 7 33.9 Coefficients Standard Error t Stat P-value Intercept -.19.19-1.13.315.93..53. Gender -3.99 3.55-1.7. *Gender.7..97.333 The interactive term is not statistically significant. A high or low has the same effect on independent of gender. Interpretation Coefficients Both and Gender matter. increases by.11 points holding Gender constant. Gender Decreases by.97 points holding constant. Models F-statistic shows that the model that includes Gender performs significantly better in explaining variation then does the model with only. We are therefore able to reject the null hypothesis that model 1=model at the 5% significance level. Copyright Sharyn O'Halloran 1 7
Final Paper Clearly state your hypothesis. Use a path diagram to present the causal relation. Use the correlations to help you determine what causes what. State the alternative hypothesis. Present descriptive statistics. This includes a correlation matrix and histogram or scatter plot. Estimate your model. You can do simple regression, include interactive terms, do path analysis, use dummy variables; whatever is appropriate to your hypothesis. Present your results. Interpret your results. Draw out the policy implications of your analysis. The paper should begin with a brief which states the basic project and your main findings. Copyright Sharyn O'Halloran 1