Fall 07 ISQS 6348 Midterm Solutions

Size: px

Start display at page:

Download "Fall 07 ISQS 6348 Midterm Solutions"

Kristina Nicholson
6 years ago
Views:

1 Fall 07 ISQS 648 Midterm Solutions Instructions: Open notes, no books. Points out of 00 in parentheses. 1. A random vector X = 4 X 1 X X has the following mean vector and covariance matrix: E(X) = ; Cov(X) = : A.(10) Find the correlation between X 1 and X. Solution: 1 1 = p p = p 1 p = 0:5: B.(0) Sketch the likely appearance of the scatterplot of the (X 1 ; X ) data. Label axes carefully. Solution: The likely range of X 1 is 1 (1) or to 4; the likely range of X is 10 () or 4 to 16: So the graph should show a scatter of data points with those ranges on the respective X 1 and X axes, with a moderate upward tilt to re ect the positive correlation, also with not too tight of an ellipse to re ect the fact that the correlation is not extremely close to 1.0. X 1 1.C.(0) Explain how X = 4 X 5 appears in your data set (spreadsheet X or SAS le). Solution: The data vector, transposed, is a generic row in the spreadsheet. Speci cally, X 0 = X 1 X X might be a random row i in your data set, which looks like this: Obs X 1 X X 1 X 11 X 1 X 1 X 1 X X. 6 i X i1 X i X i n X n1 X n X n D.(10) Suppose Y = 4 X X 1 X 5 : 1

2 Find C so that Y = CX: Solution: C = : Likert scale data are on a 1-5 scale, where 1="Bad" and 5="Good". The graph shows a 95% con dence ellipsoid for the parameter vector 0 = [ 1 ], using the Likert scale data set from HW..A.(0) Based on this ellipse, is it plausible that 1 =? Explain. Solution: Yes, the ellipse admits values where 1 =, including ( 1 ; ) = (4; 4), ( 1 ; ) = (4:01; 4:01), ( 1 ; ) = (4:0; 4:0) and others. So it is indeed plausible that 1 =..B.(0) Based on the ellipse, can we say that approximately 95% of the survey respondents answered "4" for both questions 1 and? Explain. Solution: No, that would be the interpretation of the prediction ellipse, which is much larger than the con dence ellipse. In this example, the prediction

3 ellipse should cover most of the 1-5 range in both directions to capture 95% of the actual survey responses. The actual probability that a survey is answered "4" on both questions is likely to be much smaller than 95%..C.(10) Use the ellipse to identify a con dence interval for. Solution: The range on the vertical axis consistent with the ellipse shows approximately :96 < < 4:06:.A.(0) What is the purpose of considering "distance" in statistical analysis? (Not necessarily Mahalanobis distance in particular, just explain why the notion of "distance" is important in statistics.) Solution: Distance is used for comparison. How your quiz score compare to another s quiz score is measured by distance between your score and the other s score. How one treatment compares to another treatment is measured by distance between outcomes for the two di erent treatments. Whether a point is an outlier is determined by its distance from the mean. How well a prediction model works is determined by the distance from the predictions to the actuals. Whether a research theory (or hypothesis) is tenable is determined by how distant the data are from what you would expect when the theory (or hypothesis) is true. It s hard to think of anything in statistics that does not use distance in some way..b.(0) Why, in particular, is Mahalanobis distance needed? Solution: Mahalanobis distance incorporates variance and standard deviation info. Variance info is needed to properly scale the variables, so that a distance of 1 (=1 standard deviation 1 ) in the X 1 direction is comparable to a distance of 1 (=1 standard deviation ) in the X direction. Correlation info is needed to identify distant points relative to the data scatter: It might happen that the standard Euclidean distance from a point to the mean is small, but the point lies well outside the scatter. The errors data provides a nice example, where the red highlighted point is only an outlier when you consider correlation information.

4 6 4 E E1 Mahalanobis distance is also the basis for the multivariate normal distribution: when the data vector is distributed as MVN, points with equal Mahalanobis distance from the mean vector have equal likelihood. 4. Answer True or False. (5 points apiece) 4.A. Standardized Euclidean distance incorporates correlation information. Solution: False. It involves the standard deviations, but not the correlations. 4.B. If there is a negative number in a matrix, then the matrix cannot be a covariance matrix. Solution: False. See class notes; the matrix 1 :99 :99 1 is a covariance matrix. 4

4.C. You might assume variables are independent, or you might assume they are uncorrelated. In the former case you are more likely to be wrong than in the latter case. Solution: True.

5 4.C. You might assume variables are independent, or you might assume they are uncorrelated. In the former case you are more likely to be wrong than in the latter case. Solution: True. Look at the following picture. The box indicates every possible joint distribution of the two variables. The light blue shows joint distributions where the two variables are independent. The dark blue shows joint distributions where the two variables are uncorrelated. Since we know that independence implies uncorrelatedness, the picture is correct. It also shows that you have a better chance of being wrong if you assume independence, since there are fewer joint pdfs exhibiting independence than there are that exhibit uncorrelatedness. 4.D. If an ordinary correlation is positive, yet the partial correlation is negative, this is an example of Simpson s paradox. Solution: True, as described in class with the behavioral nance example, except that the directions were reversed in that case. 4.E. When there are just two variables, the square of the correlation coe - cient is equal to the R-square statistic. Solution: True, as discussed in class with the regression to the mean example. 4.F. If a random vector X has a "spherical" normal distribution, then Cov(X) = I. Solution: True. Consider an ellipse of constant density, de ned by X values with constant Mahalanobis distance from the mean. It is a sphere in this case since the variances are the same for all variables and since the variables are uncorrelated. 4.G. If Z 1 ; : : : ; Z p iid N(0; 1), then px Zi p. i=1 5

6 Solution: True, by de nition. 4.H. If the chi-square q-q plot looks approximately like the 45 degree line, then we can conclude that the data come from a multivariate normal distribution. Solution: False. We never conclude that data come from any type of normal distribution unless we simulate the data ourselves. Also recall that MVN implies the expected appearance is a straight line. But this statement does not admit the converse "the expected appearance is a straight line implies MVN." Recall that truth of "A implies B" does not allow us to conclude truth of "B implies A"; recall the cow/mammal example. 4.I. The familywise error rate cannot be smaller than the comparisonwise error rate. Solution: I originally conceived of this as a "True" answer. The comparisonwise error rate is the probability of an error on one test. You have a higher chance of making a mistake with more than one test. Analogy: Play Russian Roulette one time. The comparisonwise error rate is the probability of death. Play the game ten times. The familywise error rate is the probability of death, and is higher. That answer presupposes that you use a single testing strategy. like twosample t tests, and compare CER and FWER using that same test. In that case, CER =.05 and FWER is much higher than.05. In the example from class, we calculated FWER= 1 :95 50 = :9: However, one might interpret the question that the CER was calculated on two-sample t-tests, giving CER= :05; but the FWER was calculate using Bonferroni-adjusted two-sample t-tests, in which case the FWER is :05, and the answer is "false." Since the question is not clear as to which scenario is taking place, either "True" or "False" is acceptable. 4.J. An adjusted p-value cannot be smaller than an ordinary p-value. Solution: True. The formulas show that the adjusted p-values are obtained by multiplying the ordinary p-values by numbers that are greater than

Bayes Decision Theory - I

Bayes Decision Theory - I Nuno Vasconcelos (Ken Kreutz-Delgado) UCSD Statistical Learning from Data Goal: Given a relationship between a feature vector and a vector y, and iid data samples ( i,y i ), find