Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n N, let G n be a set in A such that G n+1 G n. Show that P (G n ) monotonically converges to P ( k=1 G k) as n. If you can do this problem, you will be OK in a measure theory class. 2. Show formally that for any CDF defined by F (y) = Pr(Y y), we have lim y F (y) = 0, lim y F (y) = 1, and that F is right continuous. 3. Show how you you can use the CDF F of a random variable Y to compute (a) Pr(Y (a, b]); (b) Pr(Y (a, b)); (c) Pr(Y [a, b]). 4. Let Z N(0, 1). Derive the density of X where (a) X = e Z. (b) X = Z 2 ; 5. Let Y be a random variable with a continuous strictly increasing CDF, so in particular F 1 exists and F 1 (F (y)) = y. (a) Find the CDF of U where U = F (Y ); (b) Find the CDF of X, where X = F 1 (U); (c) Explain how these results can be used to simulate a normal distribution in R using only the runif command and the qnorm command. Write out your computer code. 1

6. Let F be a discrete CDF with jumps at Y {0, 1, 2, 3, 4}, with (F (0), F (1), F (2), F (3), F (4)) = (.1,.15,.3,.6, 1.0). Describe how to simulate from this distribution using a random variable U uniform(0, 1). 7. Let Y θ Binomial(n, θ) and let θ beta(a, b). Derive the marginal density of Y and the conditional density of θ given Y. 8. Let Y θ N(θ, σ 2 ) and let θ N(µ, τ 2 ). Derive the marginal density of Y and the conditional density of θ. 9. Normal distribution properties via change of variables: (a) Let X N(θ, τ 2 ). Using the univariate change of variables formula, obtain the distribution of Y = µ + σx. (b) Let W N(θ 1, τ 2 1 ) and X N(θ 2, τ 2 2 ) be independent. Find the distribution of W + X using the multivariate change of variables method. (c) Let Y 1,..., Y n i.i.d. N(µ, σ 2 ). Use the above two results to obtain the distribution of Ȳ. (d) Let Y 1,..., Y n be independent with Y i N(µ i, σ 2 i ). Use the first two results to obtain the distribution of Ȳ. 10. Let Y 1,..., Y n i.i.d. N(µ, σ 2 ). Using the multivariate change of variables formula, show that Ȳ = Y i /n is independent of S 2 = (Yi Ȳ )2 /(n 1). 11. Show the following: (a) E[a + by ] = a + be[y ]. (b) V [a + by ] = b 2 V [Y ]. 12. Let Y be a positive random variable. Use Jensen s inequality to relate 2

(a) E[Y p ] 1/p to E[Y q ] 1/q for p > q 1; (b) E[1/Y ] to 1/E[Y ]; (c) log E[Y ] to E[log Y ]. 13. Let {Y t : t N} be i.i.d. random variables on Y = { 1, 0, +1}, with Pr(Y t = 1) = p and Pr(Y t = +1) = p +. These random variables represent jumps of a particle along a one-dimensional grid. Let S T = T t=1 Y t be the position of the particle at time T. Compute the mean and variance of S T as a function of p and p +. Describe qualitatively the behavior of the particle as T increases, as a function of p and p +. 14. Let (w 1, w 2, w 3 ) Dirichlet(α 1, α 2, α 3 ). (a) Compute the expected value and variance of w j for j {1,..., 3}. (b) Compute the variance of w 1 + w 2 + w 3. (c) Compute the covariance of w 1 and w 2, and explain intuitively the sign of the result. (d) Obtain the distribution of θ = w 1 + w 2. 15. Let X and Y be random variables. Show that E[f(X)g(X, Y ) X] = f(x)e[g(x, Y ) X]. 16. Highly skewed data (y 1,..., y n ) are often analyzed on a log scale, i.e. we analyze (x 1,..., x n ) = (ln y 1,..., ln y n ). (a) Show that e x ȳ. (*) Bonus question: Compare f 1 ( f(y i )/n) for f(x) = 1/x, f(x) = ln x and f(x) = x. 17. Variance of correlated sums: (a) Derive the variance of ax + by for possibly correlated random variables X and Y. 3

(b) Let Y = (Y 1,..., Y n ) T be a vector of real-valued random variables. Compute the variance of Y i /n when Var[Y i ] = σ 2 for all i and i. Cor[Y i, Y j ] = 0 for all i.j; ii. Cor[Y i, Y j ] = ρ for all i, j; iii. Cor[Y i, Y j ] = ρ if i j = 1 and is zero if i j > 1. Describe in words how correlation affects the variance of the sample mean. 18. Suppose E[Y ] = µ and Var[Y ] = σ 2. Consider the estimator ˆµ = (1 w)µ 0 + wy. (a) Find the expectation and variance of ˆµ. (b) Find the bias and MSE of ˆµ (as functions of µ). (c) For what values of µ does ˆµ have lower MSE than Y? 19. Let C k (Y 1,..., Y n ) = (Ȳ a kσ/ n, Ȳ + a kσ/ n) for k {1, 2}, where a 1 = z.975 and a 2 = 1/.05. Via simulation, find the coverage rates of C 1 and C 2 for n {1, 10} when (a) Y 1,..., Y n i.i.d. N(µ, σ 2 ); (b) Y 1,..., Y n i.i.d. double exponential with variance 2. (c) Y 1,..., Y n i.i.d. beta(.1,.5). Include your code as an appendix to your homework. Discuss your results, and your thoughts on the robustness of the z-interval when the data are not normal (importantly, in this exercise, we are using the true variance of the population instead of an estimate). 20. Interval for a proportion: Let Y binomial(n, θ). (a) Find the approximate (large n) distribution of ˆθ = Y/n. Find a function of ˆθ that is approximately standard normal. 4

(b) Based on the normal approximation, obtain the form of an approximate 1 α CI for θ. Roughly how wide to you expect this to be for a given value of θ? (c) Obtain a CI for θ using Hoeffding s inequality. Compare the width of this CI to the approximate normal CI. 21. Convergence of correlated sums: Let {Y i : i N} be a vector of realvalued random variables with E[Y i ] = µ and Var[Y i ] = σ 2. obtain a WLLN for Ȳ = Y i /n in the cases (a) Cor[Y i, Y j ] = ρ for all i, j; (b) Cor[Y i, Y j ] = ρ if i j = 1 and is zero if i j > 1. Try to Discuss how correlation affects the asymptotic concentration of Ȳ around µ. 22. Weighted estimates: Sometimes our measurements of a quantity of interest have differing levels of precision. Let {Y i : i N} be a vector of independent real-valued random variables with E[Y i ] = µ and Var[Y i ] = σ 2 i. (a) Find the mean and variance of Ȳw = n i=1 w iy i, where the w i s are constants that sum to one. (b) Find the values of the w i s that minimize the variance of Ȳw. (c) Obtain a WLLN for Ȳw. 23. Moment generating functions: (a) Obtain the MGFs for the Poisson, exponential, and Gamma distributions. (b) Find the distributions of n i=1 Y i, where Y 1,..., Y n are i.i.d. Poisson, exponential, or gamma random variables. 5

24. Normal tail behavior: Show that if Z N(0, 1), then Pr(Z > t) φ(t)/t for t > 0. (Hint: Recall from the proof of Markov s inequality that zp(z) dz tp(z) dz). t t 25. Sketch a proof of the multivariate delta method. 26. Let Y 1,..., Y n be a sample from a bivariate population P where E[Y i ] = (µ A, µ B ), Var[Y i,a ] = Var[Y i,b ] = 1 and Cor[Y i,a, Y i,b ] = ρ. The purpose of this problem is to derive an estimator and standard error for ρ. (a) For this problem, what moments of P does ρ depend on? (b) Find CAN estimators of the moments from (a). (c) Find a CAN estimator ˆρ of ρ and give its limiting distribution. (d) Find a large-n estimate of the standard deviation of ρ, that is, its standard error. 27. Suppose you obtain a random sample from a population and the numerical values of the sample are (y 1,..., y n ), where each y i is a real valued number. Let ˆF be the empirical CDF based on these numbers, that is, ˆF (y) equals the fraction of yi s at or below the value y. (a) ˆF is discrete - where are the jumps? (b) How big are the jumps if there are no ties in the sample? What if there are ties? (c) ˆF is a valid CDF, and so it corresponds to a probability distribution, say ˆP, called the empirical distribution. distribution, that is, describe ˆP ((a, b]). 28. Confidence intervals and bands for CDFs. Describe this (a) Write down the formula for a 95% pointwise confidence interval for F (y), using a plug-in estimate of F (y) (hint: this is just the usual 6

normal interval for a binomial proportion, with F (y) replacing p, the population proportion). Also write down the formula for the 95% interval based on Hoeffding/DKW. (b) Simulate at least S = 10, 000 datasets consisting of samples of size n = 10 from the uniform distribution on [0, 1]. For each simulated dataset, check to see if the two confidence intervals cover the true value of F (y) for y {.1,.2,...,.8,.9}. For example, for each simulation s = 1,..., S you might make a vector c N s, where c N s [k] indicates whether or not the normal interval covers the true value of F (y) at the value k/10. (c) For both types of intervals use the results of the simulation to evaluate the pointwise coverage rates at y {.1,.2,...,.8,.9}; the global coverage rate, i.e., for what fraction of datasets did the intervals cover the true values of F at all y {.1,.2,...,.8,.9}. (d) Comment on the relative widths of the intervals, and summarize your findings about coverage rates. 29. Let Y 1,..., Y n i.i.d. from a distribution P with CDF F. Let ˆF be the empirical CDF of Y 1,..., Y n. For two points x, y with x < y, calculate E[ ˆF (y)], E[ ˆF (x) ˆF (y)] and Cov[ ˆF (x), ˆF (y)]. 30. Suppose we wish to simulate a bootstrap dataset Y = (Y 1,..., Y n ) from the empirical distribution of the observed sample values y = (y 1,..., y n ). Explain mathematically why this can be done with the R command Ystar <- sample(y,replace=true). 31. Suppose you observe an outcome y and a predictor x for a random sample of n = 10 objects, with y-values (2.38,2.72,-0.13,2.66,3.72,0.48,2.86,4.27,3.86, 2.04) and x-values (-0.63,0.18,-0.84,1.60,0.33,-0.82,0.49,0.74,0.58,-0.31). In other words, you sample (X 1, Y 1 ),..., (X n, Y n ) i.i.d. from some bivariate distribution, and these are the numerical results you get. Con- 7

sider the normal linear regression model y i ɛ 1,... ɛ n i.i.d. N(0, σ 2 ). = β 0 + β 1 x i + ɛ i, with (a) Obtain the usual normal-theory standard error, 95% CI and p- value for the OLS estimate of β 1 (use the lm command in R). (b) Obtain a bootstrap estimate of the standard deviation of ˆβ 1 (this is the bootstrap standard error), obtain a normal-theory CI for β 1 using the bootstrap standard error, and compare to the results in (a). (c) Obtain the bootstrap distribution of the p-value and display it graphically, including the observed p-value as a reference. How might you describe the evidence that β 1 0? How stable is this evidence? 32. Let Y 1,..., Y n i.i.d. from a continuous distribution P. Given Y 1,..., Y n, let Y 1,..., Y n i.i.d. from ˆP, the empirical distribution of Y 1,..., Y n. Let Ȳ = Y i /n. (a) Compute E[Ȳ ˆP ] and Var[Ȳ ˆP ]. (b) Now compute E[Ȳ ] and Var[Ȳ ], the unconditional expectation and variance of Ȳ, marginal over i.i.d. samples Y 1,..., Y n from P. 33. Suppose Y 1,..., Y n i.i.d. N(µ, σ 2 ). (a) What is the distribution of n(ȳ µ)/σ, and why? (b) What is the distribution of (n 1)s 2 /σ 2, and why? (c) What is the distribution of n(ȳ µ)/s, and why? (d) For w [0, 1], find the coverage rate for the set C w (Y ) = (Ȳ + s n t α(1 w), Ȳ + s n t 1 αw ). 34. Let Y N(µ, 1) and consider testing the hypothesis H : µ = 0. Consider an acceptance region of the form A 0 = (Y : z α(1 w) < Y < z 1 αw ). 8

(a) Show that the type I error rate of such a test is α for w [0, 1]. (b) Obtain the power of the test, that is, Pr(Y A 0 µ) as a function of µ. Make a plot of this power function for w = 1/2 and w = 1/4. When would you use w = 1/2? When would you use w = 1/4? 35. Suppose treatment A is assigned to a random selection of 5 experimental units, and 5 remaining experimental units are assigned treatment B. The observed treatment assignments and measured responses are (X 1,..., X 10 ) = (B, A, A, B, A, B, A, B, B, A), and (Y 1,..., Y 10 ) = (7.5, 1.2, 5.5, 2.2, 9.1, 8.7, 3.2, 5.1, 6.2, 1.7). (a) Assuming the A and B outcomes are random samples from N(µ A, σ 2 ) and N(µ B, σ 2 ) populations, compute the appropriate t-statistic for testing H : µ A = µ B, state the distribution of the statistic under H, and compute the p-value, (b) Using the same test statistic, do a permutation test of H : no treatment effect. Specifically, obtain the permeation null distribution, and compute the corresponding p-value. (c) Graphically compare the two null distributions, and compare the p-values. Describe the differences in assumptions that the two testing procedures make. (d) Obtain the permutation null distributions and p-values for the test statistics ȲA ȲB and ȲA/ȲB. 36. Let Y be a random variable and t(y ) be a test statistic. The p-value is p(y ) = Pr(t(Y ) > t(y )), where the distribution of Y is the null distribution P 0, with CDF F 0. Show that the distribution of p(y ) under Y P 0 is uniform on [0, 1] (Hint: Find the CDF of the p-value in terms of F 0 ). 37. Let Y 1,..., Y n i.i.d. P θ0 P. Find the log likelihood function and the form of the MLE in the cases where P is the set of 9

(a) Poisson distributions with mean θ R; (b) the multinomial distributions with probabilities (θ 1,..., θ p ); (c) the uniform distribution on (θ 1 θ 2 /2, θ 1 + θ 2 /2), with θ 1 R and θ 2 R +. 38. Let Y i = e X i where X 1,..., X n i.i.d. N(µ, σ 2 ). Let φ = E[Y i ]. (a) Find the expectation and variance of Y i, and the expectation and variance of Ȳ, in terms of µ and σ2. (b) Find the MLE ˆφ of φ based on Y 1,..., Y n, and find an approximation to the variance of ˆφ. Discuss the magnitude of Var[ ˆφ] relative to Var[Ȳ ]. (c) Perform a simulation study where you compare ˆφ and Ȳ in terms of bias, variance and MSE. 39. Let f and g be discrete pdfs on {0, 1, 2,...} and define D(f, g) = y log(f(y)/g(y)) f(y). (a) Show that D(f, g) > 0 if f g and D(f, f) = 0. (b) Let g θ be the Poisson pdf with mean θ. Find the value of θ that minimizes D(f, g θ ), in terms of moments of f. 40. Consider a one parameter exponential family model, P = {P θ : θ Θ}, with densities f(y θ) = c(y) exp(θt(y) A(θ)) for θ Θ R. Here, t(y) is a scalar-valued function of the data point y. (a) For a sample of size n, write out the log-likelihood function and simplify as much as possible. (b) Find the likelihood equation, with the data on one side of the equation and items involving the parameter on the other. (c) Now take the derivative of p(y θ) with respect to θ and integrate to obtain a formula for the expectation of t(y). equation to the one obtained on (b), and comment. Compare this 10

41. Let (X i, Y i ) i.i.d. with Y i X i binary(e β 0+β 1 X i /(1 + e β 0+β 1 X i )) and X i P X. Our goal is to infer θ = (β 0, β 1 ). (a) Find a formula for the log-likelihood and the score function, and obtain equations that determine the MLE (i.e., the likelihood equations ). (b) Write down the observed information for θ, and compute the Fisher information. (c) Find the asymptotic distribution of ˆθ MLE as a function of the Fisher information. Is this usable for inference if P X is unknown? (d) Find another asymptotic approximation to the distribution of ˆθ MLE that can be used if P X is unknown. Describe how this approximation can be used to provide a hypothesis test of H : β 1 = 0. 42. Let Y 1,..., Y n i.i.d. gamma(a, b), parameterized so that E[Y i ] = a/b. (a) Write down the log-likelihood and obtain the likelihood equations. (b) Compute the Fisher information and use this to obtain a joint asymptotic distribution for (â MLE, ˆb MLE ). (c) Let µ = a/b. Obtain the asymptotic distribution of ˆµ MLE = â MLE /ˆb MLE. 43. Suppose Y 1,..., Y n i.i.d. gamma(a, b) as in the previous problem, but the statistician thinks that Y 1,..., Y n i.i.d. N(µ, σ 2 ) for some unknown values of µ, σ 2. (a) What values of (µ, σ 2 ) will maximize the expected log likelihood, E[log p(y µ, σ 2 )]? Here, the expectation is with respect to the true gamma distribution for Y, and your answer should depend on (a, b). (b) Make an argument that (ˆµ MLE, ˆσ MLE 2 ) converges in probability to something, say what that something is and explain your reasoning. 11

(c) What is the standard error of ˆµ MLE for the statistician who assumes normality? How does this compare to the standard error of a statistician who correctly assumes the gamma model (as in the previous problem)? (d) Discuss the consequences of model misspecification in this case. 44. Information inequalities: (a) Adapt the derivation of the Cramer-Rao information inequality to obtain a lower bound on the variance of a biased estimator. (b) For the model Y 1,..., Y n i.i.d. N(µ, σ 2 ), the posterior mean estimator ˆµ of µ under the prior µ N(0, τ 2 ) is ȳ (n/σ 2 )/(n/σ 2 + 1/τ 2 ). Use (a) to obtain a lower bound on the variance of ˆµ and compare to the actual variance, Var[ˆµ µ]. 45. Let X 1,..., X n i.i.d. gamma(a x, b x ) and Y 1,..., Y n i.i.d. gamma(a y, b y ). (a) Compute the (-2 log) likelihood ratio statistic for testing H : a x = a y, b x = b y, and state the asymptotic null distribution. (b) Simulate the actual null distribution of the statistic in the case that a x = a y = b x = b y = 1 for the sample sizes n = 5, 10, 20, 40, and compare to the asymptotic null distribution. 46. Let X 1,..., X m i.i.d. N(µ x, σ 2 ) and Y 1,..., Y n i.i.d. N(µ y, σ 2 ). (a) For the case that σ 2 is known, compute and compare the AIC and BIC for the two models corresponding to µ x = µ y and µ x µ y. For each model selection criterion, give the decision rule for choosing µ x µ y over µ x = µ y. Also compare these decision rules to deciding based on a level-α z-test. (b) Repeat for the case that σ 2 is unknown, but now compare AIC and BIC to deciding based on a level-α t-test. 12

(c) Now compute and compare the AIC and BIC decision rules for the case that the variances of the two population are not necessarily equal. 47. Let Y 1,..., Y n i.i.d. N(θ, 1). Using level-α n z-tests where α n depends on n, develop a consistent model selection procedure for choosing between θ = 0 and θ 0. 48. Let p 1,..., p m i.i.d. (1 γ)p 0 + γp 1, where P 0 is the uniform distribution on [0, 1] and P 1 is some other distribution, with CDF F 1. (a) Write out the probability that p 1 < α/m in terms of α, m, F 1, γ. (b) Write out the probability that the Bonferroni procedure rejects the global null hypothesis H 0 : γ = 0 at level α, that is, the probability that the smallest p-value is less than α/m. (c) Approximate the above probability using the approximation that log(1 x) x for small x. (d) Based on this approximation, evaluate if the probability of rejection is increasing or decreasing in α and in γ. Explain why your answers make sense. (e) What are conditions on F 1 that suggest (based on the approximation) that the Bonferroni procedure will have good power as m? 49. Let Y i θ i N(θ i, 1) independently for i = 1,..., m with m = 100. Using a Monte Carlo approximation, compute the probability of rejecting the global null H 0 : θ 1 =... = θ m = 0 at level α =.05 using the Bonferroni procedure, Fisher s procedure, and a test based on the statistic Y 2 i, under the following scenarios: (a) θ 1,..., θ m i.i.d. N(0, K/100) for K {1, 2, 4, 8, 16, 32}. (b) θ 1 = K and θ 2 =... = θ m = 0, where K {1, 2, 3, 4, 5, 6}. 13

50. Consider a model for m p-values, p 1,..., p m i.i.d. from a mixture distribution P = (1 γ)p 0 + γp 1, where P 0 is uniform on [0, 1] and P 1 is a beta(1, b) distribution. (a) Propose a modified Benjamini-Hochberg procedure to control the FDR at level α, in the case that γ and b are known. (b) Compute the mean and variance of p 1 in terms of γ and b. Using these calculations, propose moment-based estimators of γ and b using the observed values of p 1,..., p m. Based on this, propose a modified BH procedure that can be used if γ and b are not known. (*) Compare the FDR and the number of discoveries made by the BH and modified BH procedure in a simulation study, for the case that b {1, 2, 4, 8} and some interesting values of α and γ. 14