ISQS 6348 Final exam solutions. Name: Open book and notes, but no electronic devices. Answer short answer questions on separate blank paper. Answer multiple choice on this exam sheet. Put your name on everything you hand in. Points out of 00 are indicated in parentheses. Short Answer Questions: Answer on your separate blank paper. Points are given in parentheses.. (3) Calculate: Solution: 0 0 0 0.A.(3) Give the following system of equations in matrix/vector form: x + y 0 3x y. Solution: 0 3 y x.b. (3) Give the matrix form of the solution to A. (Don t calculate the final solution.) Solution: 0 3 y x 3. Here is a covariance matrix: Σ 9 X X Cov 3.A. () Find the standard deviation of X. Solution: 3 3.B. () Find the standard deviation of X. Solution: 3.C. (3) Find the correlation between X and X. Solution: -/3
3.D. (3) Assume the means of X and X are zero. Using your answers to A., B. and C., draw a scatterplot showing the likely appearance of the (X, X) data. Solution: 3.E. (5) Again assume the means of X and X are zero. Find the regression equation to predict X as a function of X. Solution: X -X + u
3.F. (3) Draw the path diagram that represents the regression equation in E. Assume both variables are manifest (not latent). Solution: 4. A. (3) Draw a scatterplot showing an observation whose Euclidean distance from the centroid is relatively small, but whose Mahalanobis distance from the centroid is relatively large. Solution: The observation is labelled P and the centroid is labelled M.
4.B. (3) Draw a contour plot of a kernel-based bivariate density estimate that corresponds to your scatterplot of A. Solution: Draw a graph with concentric ellipses that encompass the data. For my scatterplot above, these ellipses will be very narrow. 5. (5) Throughout the course, the effect of sample size on statistical estimates has been emphasized. What usually happens to statistical estimates when there is a larger sample size? Solution: With more data, the estimated parameters tend to be closer to the true values of the parameters. 6. Here is a contingency table showing job tasks taken up by husbands and wives. For example, 4 of the 744 couples surveyed jointly do the laundry. Wife Alternating Husband Jointly Total Laundry 56 4 4 76 Main_meal 4 0 5 4 53 Dinner 77 7 3 08 Breakfeast 8 36 5 7 40 Tidying 53 57 Dishes 3 4 4 53 3 Shopping 33 3 9 55 0 Official 46 3 5 96 Driving 0 5 75 3 39 Finances 3 3 66 3 Insurance 8 53 77 39 Repairs 0 3 60 65 Holidays 0 6 53 60 Total 600 54 38 509 744 Here is a correspondence analysis plot from these data:
6.A. (4) Insurance and Finances points are relatively close. Refer to the data to explain why. Solution: It means that their conditional distributions, are relatively similar compared to the other conditional distributions. Here are those distributions: Finances 3/3 3/3 /3 66/3 Insurance 8/39 /39 53/39 77/39 6.B. (4) Husband and Repairs have similar directions. Refer to the data to explain why. Solution: It means that Pr(Repairs Husband) is higher than Pr(Repairs). Here, Pr(Repairs Husband) is 60/65, while Pr(Repairs) 65/744. 7.(3) A principal component is given as follows: PC -0.55X -0.46X +0.48X3 + 0.5X4. Here X through X4 are standardized measurements of a person: X Height, X Arm length, X3 Weight, X4 Percentage body fat Use the line the people against the wall from smallest to largest value of PC idea. What can you say about people with large values of PC? What can you say about people with small values of PC? Solution: People at the high end have low X, and X and simultaneously high X3 and X4. These are short people who are overweight. People at the low end are the opposite: tall people who are underweight.
8. We discussed principal components analysis (PCA) and canonical correlation analysis (CCA). 8.A. () Briefly state two things that are similar about PCA and CCA. Solution: (i) linear combinations are derived (ii) linear combinations are chosen optimally (best) 8.B. () Briefly state two things that are different about PCA and CCA. Solution: (i) for PCA best means maximum variance explained, for CCA it refers to maximum correlation (ii) PCA considers all variables in one group, CCA require two groups of variables. 9. We discussed model-based clustering (mclust) and k-means clustering (kmeans). 9.A.() Briefly state two things that are similar about mclust and kmeans. Solution: (i) both methods are used to assign data to clusters (ii) both methods work well with spherical clusters 9.B. () Briefly state two things that are different about mclust and kmeans. Solution: (i) kmeans has no model, mclust uses a model (ii) kmeans has no objective criterion for choosing number of clusters; mclust uses the objective BIC criterion. 0. We discussed exploratory factor analysis (EFA) and confirmatory factor analysis (CFA). 0.A.() Briefly state two things that are similar about EFA and CFA. Solution: (i) Both are latent variable methods, (ii) both assume the manifest variables are functions of the latent variable 0.B. () Briefly state two things that are different about EFA and CFA. Solution: (i) EFA typically assumes no correlation between factors; CFA typically allows correlated factors (ii) EFA allows all loadings to be non-zero, CFA constrains some loadings to be zero.
. Here is a factor analysis model with true loadings given. X 0.5 f + u X 0.5 f + u X3 0.5 f + u3.a. (5) Assume the X s all have variance.0. (Hence, each ui has variance 0.75). Give the covariance matrix of (X, X, X3) that is implied by the model..5.75 0 0.5.5 Solution: Σ [ ] + implied.5.5.5.5 0.75 0.5.5.5 0 0.75.5.5.B. (3) Using your answer to A., give the null hypothesis that is tested in the usual χ test for model adequacy. Solution: Σ true.5.5.5.5.5.5. (3) In terms of the data generating process (not the data), what does it mean for a structural equation model to be misspecified? Do not answer in terms of any statistics like χ, SRMR, GFI etc., because those are from data. Answer only in terms of the true data generating process and your model for it. Solution: If you specify a model to fit that is not truly Nature s model, then your model is misspecified. Such misspecification can easily occur if there are additional paths or correlations in Nature, but you fail to model them.
Multiple Choice Questions: Circle Answers on Exam Sheet. One point each. 3. An eigenvector of a covariance matrix tells you what? A. The direction of variability in multivariate space B. The magnitude of variability in multivariate space C. Whether the data point is an outlier D. Whether the data point is normally distributed 4. Suppose (X, Y) has a bivariate normal distribution. Select the true statement. A. X and Y are independent B. X and Y are independent if the covariance between X and Y is equal to 0. C. X is a linear function of Y D. Y is a linear function of X 5. What does the na.omit function do to a data frame? A. Deletes a column if all observations in it are missing B. Deletes a column if at least one observation in it is missing C. Deletes a row if all observations in it are missing D. Deletes a row if at least one observation in it is missing 6. What is the preferred method of estimating correlations in the presence of missing values? A. Use listwise deletion on data.frame, then use cor(data.frame) B. Use pairwise deletion on data.frame, then use cor(data.frame) C. Use pairwise deletion data.frame, then use maximum likelihood D. Do not delete any missing values. Instead, use maximum likelihood on all available data 7. Suppose data.frame has two columns. What does the R command rug(data.frame, side) do? A. Draws a density plot of the data B. Draws a histogram of the data C. Places data values along the horizontal axis D. Places data values along the vertical axis 8. What does the scale function do? A. Converts data to percentages B. Coverts date to natural logarithm scale C. Subtracts the row mean from each observation, and then divides the result by the row standard deviation D. Subtracts the column mean from each observation, and then divides the result by the column standard deviation
9. As used most commonly in this course, what is a latent variable? A. A column of data in your data frame B. An unobserved column of data in your data frame C. A row of data in your data frame D. An unobserved row of data in your data frame 0. Which plot can display three of the variables in your data frame simultaneously? A. Ellipsoidal plot B. Bivariate density pot C. Bubble plot D. Contour plot. When does the Mahalanobis distance (generalized distance) from data X to the mean µ have a chi-squared distribution? A. When the estimated covariance matrix is used to define distance B. When the estimated mean vector is used instead of µ C. When the distribution of X is the multivariate normal distribution D. When the distribution of X is the chi-squared distribution. The kernel density estimate is the sum of the kernel bumps centered at the data points. How should you choose the bandwidth, h? A. Choose it to be around 0.5 B. Choose it to be around.0 C. Choose it to be around 0.5 standard deviations D. Choose it to be around.0 standard deviations 3. In principal component analysis, a smaller eigenvalue indicates that A. A given variable in the original data set, say Xj, is more important B. A given variable in the original data set, say Xj, is less important C. A given principal component, say Yj, is more important D. A given principal component, say Yj, is less important 4. Why do we often pick just the first two principal components? A. Because we can graph them in a scatterplot B. Because they explain most of the variance C. Because they are uncorrelated D. Because of the Kaiser criterion
5. Which function retrieves the coefficients ai of the PC score Y ax + + aqxq? A. princomp(data.frame)$loadings B. princomp(data.frame)$scores C. princomp(data.frame)$coefs D. princomp(data.frame)$estimates 6. Convert a correlation matrix Rho into a distance matrix. A. Rho^ B. sqrt(rho) C. sqrt(-rho^) D. exp(rho) 7. Pick the correct form of the exploratory factor analysis model. A. X Λ f + u B. Σ Λ f + u C. X Λ Λ T + ψ D. Σ Λ f + ψ 8. What does varimax factor rotation do? A. gives a simple structure of the loading matrix B. makes the distribution of manifest variables closer to multivariate normal C. makes the factors have maximum variances D. explains a greater proportion of variance of the manifest variables 9. In an exploratory factor analysis, there is a test of H0: Σ ΛΛ Τ + Ψ, where Λ has k columns (one for each latent variable). When is the model with k factors acceptable? A. When the p-value for the test is >.05 B. When the p-value for the test is.05 C. When the p-value for the test is <.05 D. When the p-value for the test is 0 30. When are both principal components and factor analysis pointless? A. When the test of H0: Σ ΛΛ Τ + Ψ gives p <.05 B. When the test of H0: Σ ΛΛ Τ + Ψ gives p >.05 C. When the test of H0: Σ (a diagonal matrix) gives p <.05 D. When the test of H0: Σ (a diagonal matrix) gives p >.05
3. How many clusters are there to end with in the agglomerative hierarchical clustering algorithm? A. B. q C. n D. As many as you want 3. What is wrong with the knee (or elbow ) criterion for selecting number of clusters? A. It usually picks too many clusters B. It usually picks too few clusters C. It is hard to find the knee (or elbow ) when clusters are well separated D. It is hard to find the knee (or elbow ) when clusters are poorly separated 33. Which statistic measures row-column correspondence when comparing assigned clusters to an external grouping variable? A. F statistic B. chi square statistic C. affinity statistic D. root mean squared error 34. In model-based clustering, when do observations come from the same true cluster? A. When they come from the same distribution B. When they have the highest posterior probability of belonging to the same cluster C. When they are close to each other in terms of Mahalanobis distance D. When they are close to each other in terms of Euclidean distance 35. When you apply the R command plot to an object obtained via hierarchical clustering, as in h.obj hclust(dmat, ) plot(h.obj) then you get a A. dendrogram B. scree plot C. scatterplot D. density plot 36. Suppose you have specified the correct model in your confirmatory factor analysis. What happens to the p-value for the test of model fit as the sample size get larger? A. it tends toward 0.0 B. it tends toward.0 C. it is random, but usually above 0.05 D. it is random, but usually below 0.05
37. How can you improve the fit of your structural equations model? A. By removing a manifest variable from the model B. By including an additional manifest variable in the model C. By forcing certain error terms to be uncorrelated D. By allowing certain error terms to be correlated 38. Using lavaan, specify the following model. y λ f + u y λ f + u A. f ~ y + y B. f ~~ y + y C. f y + y D. f ~ y + y 39. Suppose a data set in wide format has 0 columns, four of which are repeated measures on a particular variable. There are 00 rows. How many rows are there in long format? 400 40. Which kind of missing values are the worst? A. Non-normal B. Non-ignorable C. Missing completely at random D. Missing at random