Stat 139 Homework 2 Solutions, Spring 2015

Size: px

Start display at page:

Download "Stat 139 Homework 2 Solutions, Spring 2015"

Evangeline Lambert
6 years ago
Views:

1 Stat 139 Homework 2 Solutions, Spring 2015 Problem 1. A pharmaceutical company is surveying through 50 different targeted compounds to try to determine whether any of them may be useful in treating migraine headaches. From previous experiments like this, they believe that each compound independently has a 1/100 chance of truly being effective, and 99/100 chance of having zero effect. For each potential compound, they perform a hypothesis test to determine whether it is effective at α = An effective drug will be statistically significant based on this hypothesis test 80% of the time (which is called statistical power. (a What is the expected number of compounds that will be shown to be statistically significant based on these fifty separate hypothesis tests? Let E i be the event that the i th compound is trule effective, and let R i be the event that it reject the null hypothesis (aka, statistically significant. For any one compound, the probability of being statistically significant can be calculated based on the Law of Total Probability: P (R i = P (R i E i + P (R i E C i = P (R i E i P (E i + P (R i E C i P (E C i = (0.80( (0.05(0.99 = Let X i be the indicator r.v. for whether the i th compound is statistically significant, and let T be the total number of compounds that are shown to be statistically significant. Based on linearity of expectation, the expected number becomes: E(T = E(X 1 + X X 50 = E(X 1 + E(X E(X 50 = 50( = (b Given a compound is flagged as statistically significant, what is the probability that it is actually effective in treating migraine headaches? Here we want to determine P (E i R i, which we can use Bayes Rule to calculate (since we are flipping which event is conditioned on: P (E i R i = P (R i E i P (E i P (R i E i P (E i + P (R i E C i P (EC i = (0.80(0.01 (0.80( (0.05( (c After testing the 50 potential compounds, the company has exactly 1 compound that was deemed to be statistically significant based on the tests. Let π be the probability that it is actually effective in treating migraine headaches. How does π compare to your result in part (b? Explain briefly. It depends on whether you think the characteristics of the testing and the compounds are fixed given in the problem statement. Intuitively speaking, if you see fewer compounds as significant than expected, then there is a good chance that the assumption of 1/100 truly effective is an overestimate, so one could argue that π should be even lower than If you think the 1/100 probability is truly a known and fixed value, then this probability should be the same as part (b. The key is that the characteristics of the test and the compounds have not changed, so for any selected compound, the probability should still be that it is truly effective, no matter how many statistically significant compounds you find out of the 50 compounds tested. Problem 2. ACT scores of high school seniors. The scores of high school seniors on the ACT college entrance examination in a recent year had mean µ = 19.2 and standard deviation σ = 5.1. The distribution of individual scores is only roughly Normal. 1

2 (a What is the approximate probability that a single student randomly chosen from all those taking the test scores 23 or higher? Let X be the test score of a single student. Then: ( P (X 23 P Z > = P (Z > = (b Now take an SRS of 25 students who took the test. What would be the mean and standard deviation of the sampling distribution of the sample mean score, X, for n = 25 students? Based on the Law of Large Numbers, E( X = µ and Var( X = σ2 n, where µ and σ are the mean and standard deviation for a single student. Then E( X = 19.2 and SD( X = Var ( X = = 1.02 in this problem. (c What is the approximate probability that the mean scores of these 25 students is 23 or higher? By the Central Limit Theorem, we know that X is approximately Normal with the mean and standard deviation found in part (b. Then: ( P ( X P Z > = P (Z > 3.73 < (d Which of your two Normal probability calculations in parts (a and (c is more accurate? Why? The distribution of single-student scores is only roughly Normal (it is very discretized after all since individual ACT scores can only be whole numbers, but the sampling of the distribution of X is closer to Normal (although still approximate by the CLT (and can be fractions of 1/25. So we believe that the calculation in part (c is more accurate. Problem 3. The sum of squares of a sample of data is minimized when the sample mean, X = Xi /n, is used as the basis of the calculation. Define g(c as a function w.r.t. c as: g(c = (X i c 2. Show that this function is minimized at the value c = X. In order to minimize a function, we have to take the first derivative (w.r.t. c and set to zero. Then we can take the second derivative and make sure it is positive at x (concave up: g (c = 2 (x i c 0 = c = x i = n c = x i = c = x i /n = x g (c = 2 1 = 2n > 0 Problem 4. Let X 1,..., X i,..., X n be independent random variables drawn from a population with mean µ and variance σ 2. Let X be a sample average. Recall that σ 2 can be estimated by S 2, the usual sample variance, defined as: n S 2 = (X i X ( 2 = 1 Xi 2 n n 1 n 1 X 2. 2

3 (a Show that E(X 2 i = σ2 + µ 2, using the fact that σ 2 = E ( (X i µ 2. E(X 2 i = E(X 2 i 2µ 2 + 2µ 2 = E(X 2 i 2µX i + µ 2 + E(µ 2 = E ( (X i µ 2 + E(µ 2 = σ 2 + µ 2 Note: E(µ 2 = µe(µ = µe(x i = E(µX i. (b Show that E(S 2 = σ 2, i.e., S 2 is an unbiased estimator of the population variance. [ ( ] ( E(S 2 1 = E Xi 2 n n 1 X 2 = 1 E(Xi 2 ne( n 1 X 2 = 1 n 1 ( n(σ 2 + µ 2 n(σ 2 /n + µ 2 Note: E( X 2 = σ 2 X + µ 2 X = σ 2 /n + µ 2 based on the Law of Large Numbers. Problem 5. Let X 1, X 2,..., X 25 be i.i.d. Normal r.vt].s. with mean µ = 1 and variance σ 2 = 3 2 = 9. Let s 2 be the usual variance estimate: S 2 = (X i X 2 /(n 1, and let ˆσ 2 be the estimate using µ in the calculation instead: ˆσ 2 = (X i µ 2 /n. Write a simulation in R, using a for-loop based on at least 10,000 iterations, to determine the following (be sure to include the relevant R code and output: (a That both estimators (S 2 and ˆσ 2 are unbiased. Based on 10,000 iterations, the observed means of both estimators were within 0.01 units of the true variance of 9. We could formally test if the is significantly different from 9 (based on n = 10, 000 realizations, but that is overkill. Here is the relevant R code: > nsims=10000 > mu=1 > sigma=3 > n=25 > sigma2.hat=s2=rep(na,nsims > > for(i in 1:nsims{ + sample=rnorm(n,mean=mu,sd=sigma + xbar=mean(sample + sigma2.hat[i]=sum((sample-mu^2/n + s2[i]=var(sample + } > mean(sigma2.hat [1] > mean(s2 [1] (b Provide a separate histogram for each of the two sampling distributions. Which has lower spread? Based on the R output below, ˆσ 2 has slightly smaller spread than S 2 (about 3% lower standard deviation. 3

4 > sd(sigma2.hat [1] > sd(s2 [1] Histogram of sigma2.hat Histogram of s2 Frequency Frequency sigma2.hat s2 (c Which estimator is closer to the true value more often. Based on the R output below, ˆσ 2 is as close or closer than S 2 about 52.4% of the time. > mean(abs(sigma2.hat-sigma^2>abs(s2-sigma^2 [1] (d Are you sure of your answers above? What could you do to be more certain? No, I am not certain of the answers above since these are based on random simulations. We could be more certain if we based this study on more iterations, or if we performed a formal test to see if the results above were statistically significant. Problem 6. The BOSsnowfall.csv data set on the course website has weather measurements made at Logan Airport. There are two variables in this data set measured annually from winter until winter : totalsnow: the total amount of snow fall for a winter season, in inches avgmaxtemp: the average daily high temperature for the previous calendar year, in degrees F (a Calculate the following summary statistics for both the totalsnow and avgmaxtemp variables: sample mean, sample SD, min, median, max, 1st and 3rd quartiles. > summary(snow season totalsnow avgmaxtemp : 1 Min. : 9.00 Min. : : 1 1st Qu.: st Qu.: : 1 Median : Median :

5 : 1 Mean : Mean : : 1 3rd Qu.: rd Qu.: : 1 Max. : Max. :61.27 (Other:85 > sd(snow$totalsnow [1] > sd(snow$avgmaxtemp [1] (b Split the observations into two groups: the winters with avgmaxtemp at or below the 3rd quartile, as calculated in part (a, vs. the winters above the 3rd quartile. Plot side-by-side boxplots of totalsnow for the two groups and describe the shapes of their distributions. Are there any visible differences? Histogram of meandiff.sim Frequency High Low meandiff.sim The boxplot to the left above shows the annual snowfall for years when the average maximum temperature is in the top quartile, vs. the bottom 75%. Both boxplots appear to be right-skewed. When the temperature is cooler, there appears to be more snowfall, on average. There also seems to be more spread in the cooler group, but this may just be because there are more observations in that group (3-to-1. More details below. (c Comment on whether you think the group means are very different or not (without conducting any formal tests. Based on the side-by-side boxplots above, it appears that the High group (when the average temperature for the year is above 59.63, the 3rd quartile has typically lower amounts of snowfall. The median is lower (the line inside the box, the middle 50% of the distribution (the box is shifted down, and the highest values are lower for the High group as well compared to the Low group. (d Perform a permutation test based on 10,000 iterations to determine whether totalsnow differs between winters where the temperature was at or below the 3rd quartile vs. above the 3rd quartile. Please refer to the Unit 2 lecture notes for useful R code. Be sure to state the hypotheses, calculate the test statistic, produce a histogram of the reference distribution, calculate the p-value based on this distribution, and state the conclusion of the procedure (be sure to mention the scope of the inference. Here is some relevant R output (see HW 2 Solutions R Code.R for the remaining R commands used. 5

6 > meandiff.obs [1] > mean(meandiff.sim [1] > sd(meandiff.sim [1] > #two-sided p-value > mean( abs(meandiff.sim >= abs(meandiff.obs [1] Based on the R ouptput and the histogram above (the reference distribution for the test statistic, we can perform the following Hypothesis Test (a permutation test at the α = 0.05 level, where Y high = Y low + δ: H 0 : δ = 0 vs. H A : δ 0 T = Ȳhigh Ȳlow = p value Since our estimated p-value = , which is two-sided, is less than α = 0.05, we have just enough evidence to conclude that the average snowfall in Boston is different in the two groups; in fact, snowfall tends to be lower in years with high temperature. This is certainly not a causal relationship (no way to randomly assign temperature to years, and this is not a random sample of years, so this does not necessarily mean the trend generalizes outside the years studied or to other locations. 6

M(t) = 1 t. (1 t), 6 M (0) = 20 P (95. X i 110) i=1

M(t) = 1 t. (1 t), 6 M (0) = 20 P (95. X i 110) i=1 Math 66/566 - Midterm Solutions NOTE: These solutions are for both the 66 and 566 exam. The problems are the same until questions and 5. 1. The moment generating function of a random variable X is M(t)