University of California, Berkeley, Statistics 3A: Statistical Inference for the Social and Life Sciences Michael Lugo, Spring 202 Solutions to Exam Friday, March 2, 202. [5: 2+2+] Consider the stemplot below. 0 579 236 2 48 3 3 4 5 5 6 0 7 8 (a) What is the median of the data represented in this stemplot? There are 3 data points; the median is the (3 + )/2th largest, or 7th largest, which is 6. (b) The mean of the data represented in this stemplot is (circle one): much smaller than about equal to much larger than the median. Explain your answer without any explicit computations. Since the distribution is right-skewed, the mean is much larger than the median. (c) One of the images below is a boxplot for the data given in the stemplot. Circle that boxplot. No explanation is necessary. The bottom of the three boxplots is the correct one. There is a right outlier (corresponding to 78 in the data) and the boxplot is otherwise typical of a right-skewed distribution.
2. [6: 3+3] Below is a histogram for a data set containing nine numbers. Histogram of x Frequency 0.0 0.5.0.5 2.0 0 2 4 6 8 x (a) Can you determine the median of this data set exactly? If you can, do so. If not, explain why not and give the best possible bounds on the median. (For example, the median is clearly between and 8. But you can do better.) There are nine data points, so the median is the fifth smallest. One is between and 2, one between 2 and 3, and two between 3 and 4. The fifth smallest data point is between 4 and 5 but we can t say what it is more precisely than that. (b) Can you determine the mean of this data set exactly? If you can, do so. If not, explain why not and give the best possible bounds on the mean. The smallest data point is between and 2; the second smallest is between 2 and 3; and so on. So the mean is at least ( + 2 + 3 + 3 + 4 + 4 + 5 + 6 + 7)/9 = 36/9 = 4 and at most more than this, or 5. 2
3. [8: 2+2+2+2] Consider the data set of four points given below: x 2 2 3 y 2 2 4 4 (a) Find the standard deviation s x. The mean is ( + 2 + 2 + 3)/4 = 2; the standard deviation is 4 (( 2)2 + (2 2) 2 + (2 2) 2 + (3 2) 2 ) = 2 3. (b) Find the standard deviation s y. The mean is (2 + 2 + 4 + 4)/4 = 3; the standard deviation is 4 ((3 2)2 + (3 2) 2 + (3 4) 2 + (3 4) 2 ) = 4 3. (c) Find the coefficient of correlation r. We have the formula r = n and so Now, plugging in values, r = n i= (4 ) 2/3 4/3 x i x y i ȳ s x s y n (x i 2)(y i 3). r = 8 ( 2)(2 3) + (2 2)(2 3) + (2 2)(4 3) + (3 2)(4 3) = 2 8 = 2. i= (d) What is the equation of the regression line for predicting y from x? The regression line passes through ( x, ȳ) = (2, 3) and has slope rs y /s x =. Thus its equation is y = x +. 3
4. [5: 2+2+] Scores on the math section of the SAT are normally distributed with mean 500 and standard deviation 00. (a) What proportion of math SAT scores are between 60 and 680? Standardizing gives z =., z =.8. So we want Φ(.8) Φ(.) = 0.964 0.8643 = 0.0998. (b) What score is at the 80th percentile of math SAT scores? From the normal table, Φ (0.8) = 0.84. Unstandardizing gives 500 + (00)(0.84) = 584. (c) The proportion of students scoring less than 350 is (circle one): greater than the number scoring at least 630 between the number scoring at least 630 and the number scoring at least 670 less than the number scoring at least 670 The normal distribution is symmetric around its mean, so the number scoring less than 350 (=500-50) is the same as the number scoring greater than 650 (=500+50). 4
Name: 5. [3] Let r M be the coefficient of correlation between the heights and weights of adult men. Let r A be the coefficient of correlation between the heights and weights of all adults. Which of the following is true? Circle one. r M < r A r M = r A r M > r A Explain your answer, using a clearly labeled diagram and/or a few sentences of text. This was intended to be a problem about the restricted range effect. If we know someone s height then knowing their gender doesn t give us much additional information about predicting their weight, so the residuals in the men-only case and in the all-adults case are similar. But the variance of the weights of all adults is much larger than the variance of the heights of men. We recall that r 2 is the variance of the residuals divided by the variance of the response variable. This quotient has larger denominator for all adults, so r 2 M > r2 A ; rearranging (and assuming correlations are positive) gives r M < r A. However, it turns out that there is significant overlap between the distribution of heights and weights of men and that of women, so this doesn t really happen. In fact, from actual data, r M > r A. If you put this, see us and we ll give you back a point. 6. [3] In one study, it was necessary to draw a representative sample of Japanese- Americans resident in San Francisco. The procedure was as follows. After consultation with representative figures in the Japanese community, the four most representative blocks in the Japanese area of the city were chosen. All persons resident in those four blocks were taken for the sample. However, a comparison with Census data shows that the sample did not include a high enough proportion of Japanese with college degrees. How can this be explained? People living within the Japanese community are likely to be less well assimilated to American culture (and perhaps more likely to not be fluent in English). As a result a sample which overrepresents this community will have a lower number of people with college degrees. 5
Name: 7. [6: 3 + + 2] The figure below shows a scatter diagram of the high temperatures in San Francisco (SFO) and Los Angeles (LAX) for each day in 20. 50 60 70 80 90 50 60 70 80 90 LAX temps vs. SFO temps high temperature at SFO high temperature at LAX Q R S S R Q (a) Three lines are drawn, and are labeled Q, R, and S. For each description circle the letter of the line it corresponds to. (i) Estimated average high at LAX, for a given high at SFO Q R S (ii) Estimated average high at SFO, for a given high at LAX Q R S (iii) Nearly equal percentile ranks in both data sets Q R S (b) The coefficient of correlation for these 365 points is closest to circle one - -0.5 0 0.5 (c) The average of the 3 high temperatures for January 20 at SFO was 56.7 degrees; the average of the 3 high temperatures for January 20 at LAX was 68.7 degrees. This gives us the point (56.7, 68.7). We could compute similar points for the other eleven months of the year, and compute the correlation coefficient of these twelve points. The coefficient of correlation of the twelve monthly averages is (circle one) less than equal to greater than the coefficient of correlation of the original 365 daily data points. Briefly explain your answer. This is an ecological correlation; the averaging removes the day-to-day fluctuation and just leaves the seasonal trend, namely that both places are cool in the winter and warm in the summer. 6