ISQS 5349 Spring 2013 Final Exam

Size: px

Start display at page:

Download "ISQS 5349 Spring 2013 Final Exam"

Hugo Moody
5 years ago
Views:

1 ISQS 5349 Spring 2013 Final Exam Name: General Instructions: Closed books, notes, no electronic devices. Points (out of 200) are in parentheses. Put written answers on separate paper; multiple choices can be circled on this exam sheet. Special Instructions: Do not discuss this exam with anyone, even in the most general terms, until the solutions have been posted. Hand in this exam when you are done. 1. (10) Give an example where E(Y X = x) is a curved, rather than a linear function of x. Use an example from class or of your own choosing. Explain, from a subject matter perspective, why the relationship is curved. Don t answer there is curvature because the quadratic term is significant. Don t answer there is curvature because the LOESS estimate is curved. And don t answer using any other similarly data-centric answer. Make your answer specific to your specific Y variable and your specific X variable, and give the subject matter explanation for the curvature. Solution: The case in class where Y = monthly car sales and X = interest rates is good. When interest rates increase, fewer people will buy cars because the cost of the loan is too high. But as interest rates continue to increase, the sales must level off (flatten) because people simply won t take loans they will pay with cash. So interest rates will have less of an effect on sales when they are very high. In addition, sales can never be negative, so that also explains the flattening of the curve. 2. (20) Name three regression models require use of maximum likelihood estimation, and attempt to show how the likelihood functions look in each of these three cases. If you forget the specific formulas, that s ok, just write as much down as you can in terms of formulas, then describe in words what you are trying to remember. Solution: Logistic regression. L = Successes exp( x i )/{ 1 + exp( x i )} Failures 1/{ 1 + exp( x i )} Poisson regression: L = e - i) (i) y(i) /y(i)!, where i) = exp( x i ) Tobit regression: L = y>0 (1/sqrt(2 ))exp[-.5{y i ( x i )} 2 / 2 ] y=0 x i )/ ), where is the N(0,1) cdf.

2 3. (10) In the regression model Y = X +, the parameter 1 can be interpreted as a difference between means. Explain why, using the theory of the regression model. Don t give any example here. Solution: The regression model assumes that E( ) = 0. Thus, the model states that when X = x, the Y data produced by the model come from a distribution whose mean is x. Also, when X = x + 1, the Y data produced by the model come from a distribution whose mean is (x + 1) = x + 1. Thus, the difference between the mean of the distribution of Y when X = x + 1 and the mean of the distribution of Y when X = x is exactly 1 when the model is true. 4. (10) In the regression model Y = X +, the parameter 1 usually cannot be interpreted as a causal effect of X on Y. Explain why not, using an example, either one of your own choosing or one discussed in class. Be specific: Name your Y variable and your X variable first before you answer. Solution: The example where Y = computer speed and X = RAM is good. The causal effect of X on Y is the change in computer speed you get by changing RAM, holding all else constant. In the example in class, students had computers with different RAM, but the students machines with higher RAM were generally better in many ways than the students machines with lower RAM; in particular, students machines with higher RAM also tended to have higher GhZ. Thus, while the 1 coefficient in the regression model is correctly interpreted as the difference between mean speed of computers with higher versus lower RAM, this difference could as well be attributed to the difference in the machines GhZ as it could be to differences in RAM. In other words, it is possible that RAM has no effect whatsoever, while the coefficient 1 is positive, and is still correctly interpreted as a difference between mean speed for two conceptual subpopulations. 5. (10) Define overfitting, and explain how the AIC statistic helps you to avoid it. Solution: Overfitting is what happens when you include too many variables in your model. You get a great fit to the existing data, but not to the data-generating process, because the overfit model follows the random wiggles and squiggles of the data that are simply noise, and not necessarily the structural elements of the process being studied. For each variable you include there is an additional parameter to estimate (or more if you include quadratics and interactions). The AIC statistic is -2LL + 2k, where k is the number of parameters. Lower AIC is good: you want higher likelihood (it s called maximum likelihood), which means you want smaller values of -2LL. So by adding 2k to -2LL, you are penalizing the model fit for adding too many variables to the model. Thus, if two models have nearly identical log likelihood LL, but one model has more parameters than the other, the AIC criterion will choose the simpler model.

3 AR Support 6. (10) Draw a single graph with a horizontal numerical axis and a vertical numerical axis that illustrates the concept of a moderating variable. No words are needed if the graph is clear enough, with appropriate labeling, but feel free to use words to help your answer. Mainly, I ll look at the graph though. Solution: Idealism Moderates Effect of Misanthropy on Animal Rights Support (Wuensch) AR, Low Id AR, High Id Misanthropy 7. (10) Explain how the effect inclusion principle applies to the model Y = x + 2 x 2 +, and give the reason for using the effect inclusion principle in this case. In other words, what goes wrong if you disobey the effect inclusion principle? Solution: The EIP states that you should include all lower order polynomial terms when higher order terms are in the model. Here, it means that, as long as x 2 is in the model, you should include both the x variable (x 1 ) and the intercept variable (x 0 ). What goes wrong here? Suppose there is really no curvature; ie, the function is truly linear. But suppose also that you decide to fit the model Y = x 2 +. It is likely that you will find a significant 2, because the information that was in X to relate Y to X is now subsumed into X 2. The resulting

4 function is a quadratic, and therefore curved. So you would incorrectly conclude that the relationship between Y and X is curved if you disobey the principle. 8. (10) Give an example where quantile regression is interesting; either one of your own choosing, or one discussed in class. DO NOT refer to the outlier issue; there are other, subject-matter specific reasons why quantile regression is interesting. DO NOT choose the 0.5 quantile (THE MEDIAN) in your answer, either. Choose a different quantile or quantiles of interest and explain why it is interesting. Again, be specific: Name your Y variable and your X variable first before you answer. Solution: The example in class where Y = weekly salary in the US (current dollars) and X = year ( ) was interesting. It showed that the relationship between the 0.90 quantile of the distribution of Y and X has a much larger slope than does the relationship between the 0.10 quantile of the distribution of Y and X. It is interesting because it shows the income gap is widening. (Make of it what you will; this is a statement of facts, not a political comment.) 9. (15) An author of a paper wrote the sentence, We used multi-level regression model, with a compound symmetric within-cluster covariance matrix. Explain the terms multi-level, within-cluster covariance matrix, and compound symmetric. Solution: Multi-level: Data collected within different, nested groupings are called multi-level. For example, data on public schools with many districts, many schools within districts, and children within schools are multi-level data. Within-cluster covariance matrix: A particular value for one of the levels is also called a cluster. For example, a particular school defines a cluster of students within that school. Observations within a cluster are assumed to be correlated; the within-cluster covariance matrix defines the covariance structure for all the observations within the cluster. The covariance matrix identifies variances of the observations on the diagonal, and covariances between all pairs of observations on the off-diagonals. Compound symmetric: This is a type of covariance matrix where all variances on the diagonal are the same number, and all covariances on the off-diagonal are also the same number (but a different number from the variance). 10. (5) How are the Tobit model and the Cox proportional hazards model similar? Be brief don t define everything about each model. Just identify similarities briefly.

5 Solution: Similarities: Both allows censored data, whose value is above or below a threshold, but otherwise unknown. But use a type of maximum likelihood. Both are models for how data Y are produced, depending on an X or X s. (I.e., both are regression models). 11. (5) How are the Tobit model and the Cox proportional hazards model different? Be brief don t define everything about each model. Just identify differences briefly. Solution: Differences: The Cox model typically has upper censored data, Tobit lower. Cox does not assume a particular distribution for Y; Tobit usually assumes normal. Tobit uses ordinary likelihood, Cox uses a funny kind of partial likelihood. Tobit models the data Y* (latent or observed) in terms of linear function of the X s; Cox models the log hazard of the pdf of Y as a linear function of the X s. 12. Define the following terms very briefly. (4 points each) A. Logit function Solution: It s the log odds : logit( ) = ln( /(1 )) where is a probability of success. B. Link function Solution: It s the function g(.) that transforms the mean of Y into a linear function of X: g(e(y X)) = X. For example, in logistic regression the g function is the logit function; in Poisson regression the link function is the natural logarithm. C. Hazard function Solution: The instantaneous probability of surviving the next increment given survival to this particular time; or h(t) = p(t)/s(t) where p and S are the pdf and survival functions for the random variable T. D. Latent variable Solution: An unobserved variable assumed to exist or used as a device to create a realistic model. For example, the Tobit model assumes a latent Y* that can be less than zero; this is used to model the case where Y is 0 by producing Y = 0 whenever Y* <0. E. Shrinkage estimate Solution: An estimate that is shrunken towards an overall mean, depending on the sample size. Smaller sample size; more shrinkage. BLUPs are shrinkage estimates.

6 F. Heteroscedasticity Solution: When the variances of the distributions Y i X i = x i differ for different i, then there is heteroscedasticity. When these variances are the same for all i (i = 1, 2,, n), then there is homoscedasticity. G. Serial correlation Solution: This is another terms for correlation between residuals i, usually used in the context of data collected sequentially over time. If today s residual is correlated with yesterday s residual, then there is serial correlation. H. Multicollinearity Solution: When the X variables are correlated there is multicollinearity. For example, there is multicollinearity between Ram and GHZ in the computer speed example. MC is not a yes/no situation, it is a question of degree. (The MC is not too strong in that computer example). I. Interaction Solution: When the effect of X 1 on Y depends on the value of X 2, then X 1 and X 2 interact; or it can be said that there is interaction between X 1 and X 2. J. Standard error Solution: The standard error is a measure of accuracy of the estimate. Typically you can assume that the true parameter value is within two standard errors of the estimate, as long as the model you specify is reasonably close to the true data-generating process. Multiple choice questions (3 points each) 13. The Gauss-Markov theorem states that the ordinary least squares estimates are A. the best possible estimators B. the best possible linear unbiased estimators 14. The ordinary least squares estimates are given by A. (X X)X Y B. (X V -1 X) -1 X V -1 Y C. (X X) -1 X Y 15. PRESS, AIC, k-fold cross-validation, and stepwise regression are methods for A. parameter estimation B. variable selection

7 16. When does the variance portion of the variance-bias tradeoff get larger? A. when you estimate more parameters B. when you estimate fewer parameters 17. Multinomial logistic regression assumes A. Y is a nominal variable B. Y is a normally distributed variable 18. Which is the best representation of a regression model? A. E(Y X = x) = x B. Y = x + C. Y X = x ~ p(y X = x) 19. When you use a proper instrument, the instrumental variable estimator is A. unbiased B. BLUE C. consistent 20. Model averaging is an alternative method for A. variable selection B. obtaining unbiased estimates 21. Bagging and boosting are types of techniques A. data mining B. instrumental variable 22. A neural network is a type of A. linear regression B. nonlinear regression 23. Generalized additive models assume A. no interaction B. linear link functions 24. The Newey-West procedure estimates the A. regression coefficients B. standard errors 25. Winsorizing is used to solve what problem? A. Outliers B. Multicollinearity C. Heteroscedasticity

8 26. Switching regressions are used to estimate models when A. there are outliers B. there is multicollinearity C. there are different regimes 27. Optimal design seeks to minimize the A. error sum of squares B. variances of parameter estimators C. variance-bias trade-off

Instructions: Closed book, notes, and no electronic devices. Points (out of 200) in parentheses

ISQS 5349 Final Spring 2011 Instructions: Closed book, notes, and no electronic devices. Points (out of 200) in parentheses 1. (10) What is the definition of a regression model that we have used throughout