STAT 461/561- Assignments, Year 2015 This is the second set of assignment problems. When you hand in any problem, include the problem itself and its number. pdf are welcome. If so, use large fonts and double-space. Change lines whenever it helps. Hand in any number of questions you have completed every Friday. I am happy to give you feedbacks, or ask TA to give you feedbacks. 1. Example 9.2 (at this moment) is to show that t-test for two-sided hypothesis is UMPU based on a theorem. I was a bit ignorant on checking two important conditions. (1) V is independent of T at ξ = 0. This is given in the note. Have this condition verified. (2) V is linear in U given T. This is not true. However, V is a monotone function of a linear function of U (given T ). Find this linear function and show the conclusion in this example is solid. (3) For the two-sided test, the UMPU test contains two constants k 1 and k 2 to be decided. Demonstrate why k 1 = k 2 is our solution in this problem. 2. Suppose X n has binomial distribution with probability of success given by θ. Consider the problem of constructing confidence interval for θ based on following methods. (i) Direct use of Wald s method: make use of asymptotic normality of the MLE ˆθ and replace the asymptotic variance by a sensible estimator; (ii) Direct use of Wald s method: make use of asymptotic normality of the MLE ˆθ and regarding the asymptotic variance as a function of θ. (iii) Find a variance stabilization transformation g(θ) such that g(ˆθ) is asymptotic normal with constant limiting variance. (iv) Work out the likelihood interval without activating asymptotic distribution. 1
2 (v) Work out the likelihood interval based on Wilks theorem. 3. (i) Suppose n = 200 and θ 0 = 0.4. The observed value of X n is 73. Obtain all intervals in the last problem. (ii) Suppose n = 20 and θ 0 = 0.1. The observed value of X n is 1. Obtain all intervals in the last problem. 4. Let X 1,..., X n be a sample from N(ξ, σ 2 ). (a) Show that the power of the student s t-test is an increasing function of ξ/σ for testing H 0 : ξ < 0 versus H 1 : ξ > 0. (One-sided test). (b) Show that the power of the student s t-test is an increasing function of ξ /σ for testing H 0 : ξ = 0 versus H 1 : ξ 0. (two-sided test). The next 2 questions are too hard for today s students. Try one of them. Waived for Undergrad students. 5. Suppose that X i = β 0 + β 1 t i + ϵ i, where t i s are fixed constants that are not all the same, ϵ i s are iid from N(0, σ 2 ), and β 0, β 1 and σ 2 are unknown parameters. Derive a UMPU test of sizes α for testing (a) H 0 : β 0 θ 0 versus H 1 : β 0 > θ 0 ; (b) H 0 : β 0 = θ 0 versus H 1 : β 0 θ 0. 6. Suppose (X i, y i ), i = 1, 2,..., n are a sample from a bivariate normal distribution with density function f(x, y; ξ, η, σ, τ) = {2πστ 1 ρ 2 } n exp { ( 1 1 2(1 ρ 2 ) σ 2 (xi ξ) 2 2ρ (xi ξ)(y στ i η) + 1 τ 2 (yi η) 2)}. (a) Determine the form of the UMPU test for H 0 : ρ 0 versus H 1 : ρ > 0; (b) Determine the rejection region of the test of size α in terms of the quantile of a well known distribution (t-distribution).
3 7. Carry out two permutation tests on the Precambrian iron formation data. Consider the hypothesis that the first two types have the same mean (H 0 ) versus the hypothesis that the first two formations have unequal means, (a) Use permutation methods (via mean, and Wilcoxin test) to get the p-values. (b) Use t-test, and CLT to obtain approximate P-values. An article on the origin of Precambrian iron formation reported the following data on percentage iron for 4 types of iron formation (1=carbonate, 2=silicate, 3=magnetite, 4=hematite) group observations 1: 20.5 28.1 27.8 27.0 28.0 25.2 25.3 27.1 20.5 31.3 2: 26.3 24.0 26.2 20.2 23.7 34.0 17.1 26.8 23.7 24.9 Decide for yourself on two-sided or one-sided tests. However, have it declared before you perform the analysis. 8. (Graduate students only). Let F n (x) be the empirical distribution function based on an iid sample from a continuous distribution F. Let D n (F ) be the Kolmogorov-Smirnov test statistic. (a) Show that D n (F ) 0 almost surely. (b) Show that the distribution of D n (F ) for any continuous F is the same as that of D n (F 0 ) when F 0 is a uniform distribution on [0, 1]. 9. The following values are iid observations from a binomial distribution with m = 10 and the probability of success θ. 4 3 3 3 2 3 4 3 2 1 3 7 5 2 2 2 2 1 3 4 (1) Obtain the 95% confidence interval of θ based on likelihood method.
4 (2) Let T n ( x, θ) = 20( x 10θ) 10θ(1 θ) be used as a test statistic for H 0 : θ = θ 0 versus H 1 : θ θ 0. Note that the sample size n = 20. Based on CLT, T n is asymptotic N(0, 1). Thus, we reject H 0 when T n ( x, θ 0 ) > 1.96 at 5% level. Numerically find all value of θ which is not rejected by the above test. Your outcome is a confidence interval. 10. The following are 5 iid observations of a random vector: (1.96, 1.93), (0.42,.46), (1.12, 0.27), (0.20, 0.39), (1.16, 0.12). Use some R-function to draw the asymptotic empirical likelihood 95% confidence region of the mean. 11. (Stat Graduate students only). Let F be the distribution family contains all one-dimensional distributions with finite first moment. Let θ = T (F ) be the first moment of F. Define R n (θ) as the empirical likelihood ratio function based on an iid sample of size n from F which was given as n R n (θ) = sup{ (np i ) : p i > 0; i=1 n n p i = 1, p i x i = θ} i=1 i=1 where x 1,..., x n is a set of i.i.d. observations from F. Show that, if we change the definition slightly into R n (θ) = sup F n { (nf {x i }) : i=1 xdf (x) = θ} where sup F is taken over F, F ({x i }) = F (x i ) F(x i ) is the probability mass the distribution F puts on x i. Then the region {θ : R n (θ) r 0 } contains all real values of θ for any choice of 1 > r 0 > 0.
5 12. Prove that the univariate empirical likelihood confidence region for θ = E(X 1 ) is an interval. Hint: show that some function is concave in θ. 13. Let X 1,..., X n be a random sample from exponential distribution with density function f(x; θ) = θ 1 exp( θ 1 x). Consider the case n = 201 and θ = 1. (a) Theoretically determine the median of this distribution. (b) Generate 1000 data sets with n = 201 to estimate the bias and variance of the sample median for estimating the population median. (c) Bootstrap the first sample in (b) to obtain estimates of the bias and variance of the sample median for estimating the population median. Remark: Use set.seed(2015561) so that we get at least the same first sample. 14. Generate 1000 sets of two sample data of size 30+30 from normal distribution with mean 0 and variance 1. Randomly select 20 sets of these two sample data. Add to each observation in control group by a common random value generated from Uniform (0, 3). This number is the same for the data in the same group, but different for different groups. Use Benjamini and Hochberg procedure to identify a set of differentially expressed genes based on two-sample t-test (two-sided). Choose q = 0.05 and 0.10. Compute the positive identification rate and the false discovery rate. positive identification rate : percentage of false null hypotheses (20 of them) are rejected. Repeat the above procedure 2000 times to get averages and standard deviations of PIR and FDR.
6 Remark: write a flexible code so that you can simulate data from other distributions and different effect sizes. 15. Repeat the above simulation experiment with data generated from standard Gamma distribution with 2 degrees of freedom. 16. In book of Wu and Hamada, there is a data set on girder experiment which studies 10 methods. Analyze the full experiment, including ANOVA table, multiple comparisons based on Bonferroni and Turkey s method. 17. Let us try out the LASSO. Generate a data set i = 1, 2,..., n according to the model y i = x τ i (s)β(s) + ϵ i such that ϵ i are all i.i.d. N(0, 1) and independent of each other. Create each x i a vector of length P = 1000 and such that has its first entry 1 and the 999 of them generated from N(0, 1). Let s be a random subset of {1, 2,..., P } of size 5. If s = {3, 6, 8, 20, 21}, then x τ i (s) is the sub vector of x i made of its 3rd, 6th, 8th, 20th and 21st entries. Let β(s) = (0.7, 0.9, 0.4, 0.3, 1.0). Now create a data set with n = 200 according to the linear model. Run glmpath function in R to find out the first 10 covariates will be selected by LASSO. Compare it to s, the covariates which should be ideally sellcted. Repeat the computer experiment 5 times, and put the outcomes in a table in 5 rows and each row has two segments: The first segment contains 5 entries that are truly active covariates; the next segment contains 10 covariates selected by lasso. Mark your table clearly.
7 18. If BIC or EBIC (with γ = 0.5) are used to decide the variables selected, what would be the variables selected in the 5 runs of the last question. Stop selection beyond 10. That is, select at most 10 covariates. 19. No more questions.