Instructions: Closed book, notes, and no electronic devices. Points (out of 200) in parentheses

Size: px

Start display at page:

Download "Instructions: Closed book, notes, and no electronic devices. Points (out of 200) in parentheses"

Meagan Watkins
5 years ago
Views:

1 ISQS 5349 Final Spring 2011 Instructions: Closed book, notes, and no electronic devices. Points (out of 200) in parentheses 1. (10) What is the definition of a regression model that we have used throughout the class? Relate this definition to the case where Y = driving speed of a car on Interstate 20 (the speed limit is 70MPH) and X=age of the car. (If the word distribution does not appear in your answer, then you will lose most of the possible points.) Solution: The regression model is a model for the conditional distribution of a dependent variable Y given the value (or values in the multiple regression case) of predictor variables X. In the case study, we would postulate a model for how the distributions of speeds observed on the freeway depends on the age of a car. Driving speeds differ depending on the driving style of the owner, so there is a distribution of possible driving speeds for any age-specific cohort of cars. For example, there is a distribution of speeds for X=5, and another distribution of speeds for X=10, etc. The regression model postulates how these distributions look; for example, one might postulate that they are all normal distributions with common variance, whose mean values lie precisely on a line E(Speed) = β 0 + β 1 X for some values β 0, β 1. But one might also postulate that these distributions are unspecified, of the generic form p(y X=x), with mean function also unspecified, of generic form f(x). Both are examples of regression models; the former is more specific but less believable, the latter less specific but more believable. 2. (10) Suppose you will collect actual data (x 1,y 1 ), (x 2,y 2 ),, (x n,y n ), where you have assumed a model Y X=x ~ N(β 0 + β 1 x, σ 2 ). If the model is true, how will the scatterplot of the actual data appear? Explain in words, and then draw a prototypical scatterplot for this case. Solution: The scatter of the Y data will appear to increase (or decrease) steadily, with no indication of curvature. The range of Y data for any given X=x (or in a small neighborhood of x such as x-δ < X < x+δ) will appear roughly constant for all x. Finally, the distribution of Y data for any given X=x (or in a small neighborhood of x such as x-δ < X < x+δ) will appear roughly symmetrically distributed on either side of the center, with no outliers. Further, there will be no obvious signs of discreteness in the Y data. For example:

2 600 Scatterplot of Data Produced By a Model Where All Assumptions Are Satisfied 500 y x 3. (10) Use the subpopulation (or cohort ) framework to interpret the parameter β 2 in the quantile regression model Income (0.9) = β 0 + β 1 (Education) + β 2 (Year). Here, Income (0.9) is the 0.9 quantile of the income distribution, Education is level of education of a person (say coded as 0,1,2,3,4,5,6,7,8,9, or 10), and Year is year (say 1960, 1961,, or 2011). Be specific. Solution: Consider two cohorts: Cohort 1: People with Education level = 5 in Cohort 2: People with Education level = 5 in Imagine that there are many people in Cohort 1. Then the 0.90 quantile (or 90 th percentile) of the incomes in this group should be approximately equal to the true quantile of the conceptual distribution, which is assumed to be β 0 + β 1 (5) + β 2 (1990), according to the model. Imagine also that there are many people in Cohort 2. Then the 0.90 quantile (or 90 th percentile) of the incomes in this group should be approximately equal to the true quantile of the conceptual distribution, which is assumed to be β 0 + β 1 (5) + β 2 (1991), according to the model. Thus, we can interpret β 2 as being approximately equal to the the 90 th percentile of income in cohort 2, minus the 90 th percentile of income in cohort 1.

3 4. (10) An estimated Probit model is P(Success) = Φ( X), where Φ(t) is the cumulative standard normal distribution function. (Recall that the standard normal distribution is the one that has mean zero and variance 1, so, for example, Φ( 1.96) = ) Draw a graph of the estimated probability of success as a function of X. Label the axes and put numbers on the axes. Solution: Plugging some numbers and using the rule: X X P(Success) 0-2 Φ(-2) Φ(-1) Φ(0) Φ(1) Φ(2) Here is the graph: 1 Estimated Probit Model 0.8 P(Success) X 5. (10) The model Y=Xβ+ε is assumed. Suppose there are only n=4 observable data points, and they are (Y 1, X 1 =4), (Y 2, X 2 =6), (Y 3, X 3 =1), and (Y 4, X 4 =5). Suppose also that the {Y i } are independent random variables with Var(Y i X i =x) = x 2 σ 2. Write down the entire covariance matrix of Y.

4 Solution: Since the Y data are independent, their covariances are 0, so we have a diagonal matrix. And according to the model, Var(Y 1 X 1 =4) = 4 2 σ 2, Var(Y 2 X 2 =6) = 6 2 σ 2, Var(Y 3 X 3 =1) = 1 2 σ 2, and Var(Y 4 X 4 =5) = 5 2 σ 2. Hence we have Y σ Y σ 0 0 = σ 0 Cov. Y Y σ (10) When do you use generalized least squares (GLS)? First, answer the question in general terms. Second, give an example of a real study that you might do where GLS, rather than ordinary least squares (OLS), is appropriate. Solution: When the covariance matrix of Y has non-zero off-diagonal elements, ie, when the error terms are correlated, then the usual OLS estimates are inefficient and their usual standard errors are incorrect. We use GLS to get more efficient estimates with correct standard errors. For example, I might perform a repeated measures study whereon of the predictor variables is drug (active or placebo) and other predictor variables are age, sex, initial health, etc. There are multiple observations per subject because the subject s health is evaluated at repeat visits to the clinic. These repeated measures must be assumed to be correlated because they share subjectspecific commonality. If instead, we assume they are independent, and perform the usual OLS analysis, then the estimated parameters will tend to be farther from the true process parameters than when we incorporate covariance information and use GLS. Further, the OLS standard errors will be simply wrong, perhaps leading us to conclude significance incorrectly, and perhaps leading us to an insignificant conclusion when significance is warranted, whereas the GLS standard errors will incorporate the covariance information appropriately. 7. Define the following terms briefly using one sentence. If you use formulas, make sure that they are incorporated into the sentence using proper English phrasing. (4 points each) 7.A. Latent variable : This is variable that you cannot observe directly, like satisfaction with boss. 7.B. Pseudo R-squared: This is a likelihood-based measure of goodness of a model, one that reduces to the usual form in the classical regression case. 7.C. LOESS estimate: This is an estimate of the mean of the distribution of Y as a function of

5 X, one that makes no assumption of linearity or other specific function form. 7.D. Nominal variable: This is a variable whose values are unordered categories, such as choice of ART, MATH, or SCIENCE. 7.E. Moderator variable: This is a variable that affects another variable s effect on a response. 7.F. Censored data: Data whose values are known only to lie above (or below) some known threshold are called censored data. 7.G. Dummy variable: A variable whose values are coded as 0 or 1, depending upon the value of some other variable is called a dummy variable. 7.H. Likelihood ratio test: This test is used to compare full and reduced models that are estimated using maximum likelihood, and the result is a chi-square statistic. 8. (8) A benefit of using a random effects model instead of a fixed effects model is that random effects models provide shrinkage estimators (BLUPs). Explain why this is a benefit in the context of an example, either one of your own choosing, or one discussed in class (such as the faculty rating example). Be sure to explain why there is a benefit, i.e., why the alternative fixed effects model is worse. Solution: In the class rating example, the BLUP estimates of major effects we re shrunk toward the overall mean when the sample size within the major was smaller. This gave us a better ranking of the majors than the fixed-effects model, which estimates the mean of major using the simple mean without shrinkage. A problem was that a mean of 5.0 based on one observation should not be rated more highly than a mean of 4.7 based on 30 observations, as would be the case using fixed-effects estimates. 9. (10) We discussed three types of outliers. One of these three types was the worst type of outlier, in the sense that such an outlier has the most potential influence on ordinary least squares estimates. Show how this worst type of outlier appears in a scatterplot.

6 Solution: The worst type was the outlier in both X- and Y-space. Here is a picture: Example 3: Outlier in both Y X-space and X-space 15 Y y = x R² = X 10. (10) Why must we use graphs in addition to tests when assessing the validity of assumptions? Be specific to the case of testing whether the regression function is a line rather than a curve. How do you test it? What graph do you draw? Why are both the test and the graph needed? Solution: We can test for linearity by adding a quadratic term and seeing whether it is significant if significant, we reject linearity in favor of curvature. The problem, though, is that with large sample sizes, even small deviations from linearity result in statistically significant results. The graph allows you to assess whether the degree of curvature is worth worrying about. One graph to draw would be the fitted linear and fitted quadratic functions, on the same axes, like this:

7 The graph shows a slight difference, which might be statistically significant, but which is also practically unimportant. Multiple choice. Each question is worth 2 points. 11. Suppose the number of injuries on construction site in a day are usually 0, sometimes 1, less often 2, etc., conceptually without any upper limit. Pick the most appropriate regression model: 11.A. Normal regression (no discrete data) 11.B. Ordinal logistic regression (no no upper bound) 11.C. Tobit regression (no observations not zero are not continuous) 11.D. Poisson regression (best choice) 12. In the case of predicting GPA (on the 0-4 scale) as a function of GMAT score, give the most plausible value for ROOT MSE: A. 0.7 (mean±2(.7) is the most plausible 95% range for GPA) B. 2.0 C D. 14,000

8 13. Maximum likelihood estimates and least squares estimates are identical when A. the distribution of Y is assumed to be normal. (yes, math) B. the variance of Y is a linear function of X. (WLS?) C. the Gauss-Markov assumptions hold true. (non normality allowed) D. the model is correct. (if non-normal?) 14. In simple least squares regression, what is the relationship between the correlation coefficient r, and the R 2 statistic? A. r = R 2 B. r 2 = R 2 (righto) C. 1 r = R 2 D. 1 r 2 = R In ordinary least squares regression, what is the relationship between the sum of squared errors (SSE), the corrected total sum of squares (SST), and the R 2 statistic? A. SSE/SST = R 2 B. SST SSE = R 2 C. SSE SST = R 2 D. 1 SSE/SST = R 2 (righto) 16. All else fixed, what is the effect of increasing the variance of X 1? A. The mean squared error (MSE) will be larger. B. The mean squared error (MSE) will be smaller. C. The standard error of ˆβ 1 will be larger. D. The standard error of ˆβ 1 will be smaller. (righto) 17. Why is ˆβ 1 a random variable? Because A. the process could have produced other data. ( ) B. the linearity assumption is never true. C. the true value of β 1 is unknown. D. the data set has more than 2 observations. 18. The true variance of the error terms, Var(ε), is also denoted by A. MSE (this is an estimate) B. ROOT MSE C. Var(Y X=x) (this is true) D. Cov( ˆβ ) 19. The linearity assumption means, by definition, A. When X is larger, then Y is larger. (it s not about data, it s about process) B. When X is larger, then E(Y) is larger. (allows curvature) C. The points (X, Y) fall exactly on a straight line. (it s not about data, it s about process)

9 D. The points (X, E(Y)) fall exactly on a straight line. (bingo) 20. Which one of the following models obeys the variable inclusion principle? A. Y = β 1 X + β 2 X 2 + ε B. Y = β 0 + β 1 X + ε (the one and only) C. Y = β 0 + β 1 X 1 + β 2 X 1 X 2 + ε D. Y = β 0 + β 1 X + β 2 Z 2 + ε 21. Which hypothesis test is used for testing normality? A. Breusch-Pagan B. Shapiro-Wilk (yes) C. q-q plot (this is a graph, not a hypothesis test) D. histogram ( ) 22. Which of the following terms require the randomness assumption for their definition? A. β 1 B. Var(Y X=x) C. The p-value for testing H 0 : β 1 = 0 D. All of the above (yes, see westfall/images/5349/practiceproblems_discussion.doc ) 23. The Gauss-Markov theorem refers to estimators that are linear functions of the data. Give an example of a non-linear estimator. A. The sample average of the Y data. (linear; see westfall/images/5349/sp2011_midterm1solution.pdf ) 1 B. ˆ β = ( X' X) XY '. (G-M applies to OLS so the must be linear) C. The model Y = β 0 + β 1 X + β 2 X 2 + ε. (This is a model, not an estimator) D. The estimator of β 1 when using quantile regression. (Yes, that s why the bootstrap is needed to find their s.e. s). 24. In the model Y = β 0 + β 1 (1/X) + ε, the intercept β 0 is equal to A. E(Y X=0) B. E(Y X=1) C. E(Y X= ) (right) D. E(Y X=2) E(Y X=1) 25. Suppose Y depends on X 1 and X 2. Then the model Y = β 0 + β 1 x 1 + ε A. is simply wrong. (no, we discussed this in class) B. is simply a model for the conditional distribution of Y given X 1 =x 1. (right, it answers different questions)

10 C. is less biased than the model Y = β 0 + β 1 x 1 + β 2 x 2 + ε. (If anything, it would be more biased, but only with respect to the model for the conditional distribution given both X s). D. violates the uncorrelated errors assumption. (irrelevant) 26. When are the absolute values of the residuals most useful? A. When checking the linearity assumption. B. When checking the homoscedasticity assumption. (this one) C. When checking the uncorrelated errors assumption. D. When checking the normality assumption. 27. Causal arguments are strengthened by (pick the best answer) A. finding a higher R 2 statistic. (correlation does not imply causation) B. including more X variables in the model and still finding significance of the effect of interest. (right, more control of confounders) C. showing that the model assumptions are satisfied. (good, but not related. There is no assumption of causality in the model.) D. using a larger sample size. (always a good idea, but it won t help directly. Correlation is not causation.) How is σ i = Var(Y i X i = x i ) estimated when using heteroscedasticity-consistent standard errors? 2 2 A. ˆi σ = e i (yes) 2 2 B. ˆi σ = ε i (unavailable) 2 2 C. ˆ σ i = xi ( MSE) (HC makes no assumption about form) 2 D. ˆ σ = exp( ˆ γ + ˆ γ x ) (HC makes no assumption about form) i 0 1 i 29. In mathematics, linearly independent columns of the X matrix means that A. the correlation matrix of the X data is an identity matrix. B. the columns of the X matrix are independent variables. C. the columns of the X matrix have correlations less than 0.9 (in absolute value). D. no column of X is a linear function of the other columns. (yes) 30. The hat matrix is H X X X X 1 = ( ' ) '. Select the true statement. A. ˆ β = HY. B. The trace of H is equal to n p. C. H is idempotent. (right) D. The standard errors of the ˆ β j are the diagonal elements of H. 31. There is a full model and a restricted model. Everything with a F subscript is from the full model, and everything with an R subscript is from the restricted model. Also, SSE refers to sum of squares for error and SST stands for (corrected) total sum of squares. Then A. SSE F SSE R

11 B. SSE F SSE R (yes, adding variables reduces SSE) C. SST F SST R (SST s are =) D. SST F SST R ( ) 32. When is the Model F statistic equal to { ˆ / se..( ˆ )} 2 β β? 1 1 A. When there is one X variable in the model. (Yes, F = t 2 ) B. When there is more than one X variable in the model. C. When the null hypothesis is rejected. D. When the t statistic is normally distributed. 33. Which value of c minimizes n yi c? i= 1 A. E(Y) (This question refers to data, not process) B. The median of the distribution of Y (This question refers to data, not process) C. 1 n yi n i = 1 D. The median of {y 1, y 2,,y n } (yes) 34. What is the bootstrap used for? A. To minimize the sum of absolute deviations. B. To extrapolate to X data outside the range of the observed data. C. To estimate standard errors. (among the choices, this is best. It s used for other things as well.) D. To compute posterior distributions of the regression parameters. 35. When using weighted least squares, you must assume that A. the errors are normally distributed. (no, WLS are BLUE even when normality violated) B. Var(Y i X i = x i ) is known for each i=1,,n. (not absolutely necessary; see C.) C. Var(Y i X i = x i ) = c i σ 2, where c i is known for each i=1,,n. (yes) D. Var(Y i X i = x i ) = exp(γ 0 + γ 1 x i ), where γ 0 and γ 1 are known. (you could WLS here, but this is wrong for the same reason that B. is wrong) 36. When the weights in weighted least squares (WLS) are all 1.0, then the WLS estimates are A. quantile regression estimates with q=0.5. B. quantile regression estimates with q=1.0. C. generalized least squares estimates with a block-diagonal homoscedastic covariance structure. D. ordinary least squares estimates. (yes) 37. The compound symmetry covariance structure assumes

12 A. observations that are farther apart in time have smaller covariance. B. observations that are farther apart in time have smaller variance. C. observations on different people have different variances. D. observations within a person are equally correlated. (yes) 38. The AR(1) covariance structure assumes A. observations that are farther apart in time have smaller covariance. (yes) B. observations that are farther apart in time have smaller variance. C. observations on different people have different variances. D. observations within a person are equally correlated. 39. In PROC MIXED of SAS/STAT, the covariance matrix of Y is estimated as ZGZ + R. Select the true statement. A. The R matrix is a correlation matrix. (no, it s a covariance matrix) B. The RANDOM statement defines the R matrix. C. The G matrix is a correlation matrix. (see A) D. The RANDOM statement defines the Z matrix. (yes) 40. In panel data where companies are followed over time, there is cross-sectional correlation. This means that A. observations in the same year are correlated. (yes) B. observations on the same company are correlated. (that s time series correlation) C. observations that are two years apart are more highly correlated than observations that are ten years apart. (see B) D. there is a high degree of multicollinearity. (irrelevant) 41. The multivariate normal distribution function is used in mixed models mainly to A. define the levels of the multilevel analysis. (?) B. estimate the parameters of the model via maximum likelihood. (yes) C. compute the R 2 statistic. (could be via pseudo, but we never did it that way and it s not in SAS) D. assure that the predictions of the of the random effects are best linear unbiased predictions (BLUPs). (normality needed as in the case of BLUE) 42. Select the incorrect answer. Generalized least squares estimates A. minimize the sum of squared errors (or SSE). (no, that s OLS) B. are best linear unbiased estimates when Φ is known. (right) C. are maximum likelihood estimates when the distribution of Y is multivariate normal with mean Xβ and known covariance matrix Φ. (yes) 1 1 D. are given by the formula ( X ' Φ X) X ' Φ Y. (yes)

13 43. When the errors are correlated, but all other assumptions are satisfied, then the ordinary least squares estimators for the β j are A. still unbiased. (yes) B. non-normally distributed. (doesn t follow logically) C. uncorrelated. ( ) D. efficient. (no) 44. If your estimate has the form ˆ θ = A B (A minus B), where A and B are positively correlated, then if you use a model that assumes independence of A and B, the reported standard error of ˆ θ will be A. too large. (yes, Var(A B) = Var(A) + Var(B) 2Cov(A,B). If you assume uncorrelated you get Var(A B) = Var(A) + Var(B), which is too large). B. too small. C. sometimes too large, sometimes too small. D. unaffected. 45. There are several reasons for not using classic linear regression analysis with binary responses. Which of following is not one of those reasons? A. Because the probabilities do not lie on a straight line. B. Because the variance is nonconstant. C. Because the distribution is non-normal. D. Because there are outliers in the Y variable. (0/1 data do not usually have outliers). 46. What is the likelihood of a single binary (0 or 1) observation y? A. exp ( y θ1) 2πθ 2θ 2 2 y 1 y B. θ (1 θ) (this one you get θ when y=1, and 1 θ when y=0) C. θe θ y D. 1/θ 47. When is the variance of binary (0 or 1) data largest? A. When the proportion of 1 s is B. When the proportion of 1 s is 0.5. (Yes, Var(Y) = π(1 π)) C. When the binary data are normally distributed. D. When the sample size is small.

14 48. If the estimated probability of success when X=4 is 0.3, then which is the most logical prediction? A. That 30 out of 100 people having X=4 will be successful. (this one) B. That 50 out of 100 people having X=4 will be successful. C. That 30 out of 100 people will be successful, regardless of their X. D. That 50 out of 100 people will be successful, regardless of their X. 49. If the probability of success is 0.75, then the odds of success is A. also B C. exp(0.75). D (.75/(1-.75) = 3) 50. The logarithm of the odds is also called the A. normit. B. probit. C. tobit. D. logit. (yep)

ISQS 5349 Spring 2013 Final Exam

ISQS 5349 Spring 2013 Final Exam Name: General Instructions: Closed books, notes, no electronic devices. Points (out of 200) are in parentheses. Put written answers on separate paper; multiple choices