MA20226: STATISTICS 2A 2011/12 Assessed Coursework Sheet One. Set: Lecture 10 12:15-13:05, Thursday 3rd November, 2011 EB1.1

Size: px

Start display at page:

Download "MA20226: STATISTICS 2A 2011/12 Assessed Coursework Sheet One. Set: Lecture 10 12:15-13:05, Thursday 3rd November, 2011 EB1.1"

Candice Hunt
5 years ago
Views:

1 MA20226: STATISTICS 2A 2011/12 Assessed Coursework Sheet One Preamble Set: Lecture 10 12:15-13:05, Thursday 3rd November, 2011 EB1.1 Due: Please hand the work in to the coursework drop-box on level 1 of 4W by 14:30 on Thursday 17th November, The work should be accompanied by a completed Coursework Cover Sheet, available either in lectures, the Departmental Office, or If the work is submitted after the deadline, without an agreed extension or mitigating circumstances, it will be assessed at a maximum mark of the pass mark (40%). If you submit work more than five days after the submission date, you will normally receive a mark of 0, unless you have been granted an extension or a panel has agreed that there are Individual Mitigating Circumstances (IMCs). Time: The average student should spend around three hours on this assignment. Conditions: The assignment should be your own work. You should attempt all questions. You may also consult me, or your tutor, for general advice. It should be completed during the practical session in Week 6 and in your own time. Value: This assignment carries 50% of the total marks for the coursework. Hence, it carries % of the total marks for the course. This assignment contains nine questions and there are a possible 25 marks available. Aim: The objective of this coursework is to enable you to construct confidence intervals using the statistical package R and to gain an insight into the properties of such intervals. In Section 1 we state how you can use R to find quantiles and probabilities from the normal and χ 2 -distributions. In Section 2 we look at how to construct confidence intervals while in Section 3 we investigate the properties of confidence intervals using a sampling experiment. Contact details Simon Shaw Room: 4W s.shaw@bath.ac.uk Webpage: 1

2 1 Using R to find quantiles and probabilities Rather than using tables, we may use R to find quantiles and probabilities. Normal distribution: If Z N(0, 1), let P (Z z p ) = p. For a given p, z p may be found using the command qnorm(p). For a given z p, p may be found using the command pnorm(z p ). For example, > qnorm(0.975) # finds z such that P(Z <= z) = [1] > pnorm(1.96) # P(Z <= 1.96) [1] > pnorm(qnorm(0.975)) [1] If you want to compute an upper tail probability, say P (Z > z p ), then you can either use the result that P (Z > z p ) = 1 P (Z z p ) or change the default in pnorm to calculate the upper tail. The same approach can be used with qnorm. > 1-pnorm(1.96) # P(Z > 1.96) = 1 - P(Z <= 1.96) [1] > pnorm(1.96,lower.tail=false) # calculates using the upper tail, P(Z > 1.96) [1] > qnorm(0.975) # finds z such that P(Z <= z) = [1] > qnorm(0.025) # finds z such that P(Z <= z) = [1] > qnorm(0.025,lower.tail=false) # calculates using the upper tail, so finds z such that [1] P(Z > z) = χ 2 -distribution: If χ 2 ν is a χ 2 -distribution with ν degrees of freedom, let P (χ 2 ν χ 2 ν,p) = p. For a given p, χ 2 ν,p may be found using the command qchisq(p, ν). For a given χ 2 ν,p, p may be found using the command pchisq(χ 2 ν,p, ν). If you want to work with the upper tail, say P (χ 2 ν > χ 2 ν,p), then you can either use P (χ 2 ν > χ 2 ν,p) = 1 P (χ 2 ν χ 2 ν,p) or change the default in pchisq to calculate the upper tail. The same approach can be used with qchisq. For example, > qchisq(0.95, 10) # finds y such that P(Y <= y) = 0.95 when Y is a chi-square [1] with 10 degrees of freedom > qchisq(0.05, 10) # finds y such that P(Y <= y) = 0.05 [1] > qchisq(0.05, 10, lower.tail=false) # calculates using the upper tail, finds y such that [1] P(Y > y) = 0.05 > qchisq(0.95, 10, lower.tail=false) # calculates using the upper tail, finds y such that [1] P(Y > y) = 0.95 > 1-pchisq(18.307, 10) # P(Y > ) = 1 - P(Y <= ) [1] > pchisq(18.307, 10, lower.tail=false) # calculates using the upper tail, P(Y > ) [1] Constructing confidence intervals The following function calculates a 95% confidence interval for µ when observations X 1,..., X n are assumed to be iid N(µ, σ 2 ) with σ 2 assumed known. 2

3 > CIfun <- function(i, sigmasq) # i is a vector containing the observations + # sigmasq is the known variance + n <- length(i) # find the number of elements in i + smean <- mean(i) # calculate the mean of i + z <- qnorm(0.975) # appropriate z-value + lo <- smean - z*sqrt(sigmasq/n) # lower bound + hi <- smean + z*sqrt(sigmasq/n) # upper bound + c(lo, hi) # return confidence interval as a vector + } The code may be downloaded from The following data, which are assumed to come from a normal distribution with mean µ, representing the passage time of light, and variance σ 2, may be regarded as Newcomb s measurements of the passage time of light The data set may be scanned into R from a text file using the command scan. > newcomb <- scan(" Read 20 items Assuming σ 2 = 40, we can find a 95% confidence interval for the true passage time of light > CIfun(newcomb,40) [1] Write a function, which you should call gcifun, that allows you to construct a 100(1 α)% confidence interval for µ when σ 2 is assumed known. Your function should have three arguments: the first, i, corresponding to the data vector, the second, sigmasq, corresponding to the assumed variance and a third corresponding to the value α. Thus, for example, your answers from gcifun(newcomb,40,0.05) and CIfun(newcomb,40) should be the same. Hand in a copy of your code. [2] 2. Use your function, gcifun, to calculate a 91.8% confidence interval for µ, the true passage time of light. [1] 3. What assumptions have you made in calculating the confidence interval, and to what extent do they seem justified here? [2] 4. Suppose that we now assume σ 2 to be unknown. (a) Write a function, chigcifun, that allows you to construct a 100(1 α)% confidence interval for σ 2. Hand in a copy of your code. [3] (b) Use your function to calculate a 97.2% confidence interval for σ 2 for Newcomb s measurements of the passage time of light. Comment upon whether your interval does or does not support the previously assumed value of σ 2 = 40. [3] You often need to extract elements from vectors that satisfy certain criteria. This can be done by using a relational expression instead of the index. For example, newcomb[newcomb > 0] will produce the vector whose elements are those contained in newcomb which are positive. 5. Omitting the two smallest data values in Newcomb s measurements, recalculate the 91.8% confidence interval for µ (assuming σ 2 = 40) and the 97.2% confidence interval for σ 2 (now assumed unknown). Explain carefully any differences in the results compared with those using all 20 data values. [6] 3

4 3 Confidence intervals for multiple samples R provides a simple way to execute a loop where each iteration of the loop returns a value, perhaps from the application of the same function. First, we recall that in R we may use a colon to create a sequence of integers. For example, > 1:10 [1] > 16:9 [1] The function sapply is used to perform our loop. It takes two arguments: a sequence e.g. of integers and a function with at least one argument. The function is called once for each value in the sequence. > sapply(1:10, function(i)i}) [1] > sapply(16:9,function(i)sqrt(i)}) [1] > sapply(1:5,function(i)rnorm(2)}) [,1] [,2] [,3] [,4] [,5] [1,] [2,] The final example is interesting on two counts. Firstly, although i appears in the parenthesis for the function, it does not appear in the curly brackets. sapply always requires that the function definition has at least one argument and we have used a dummy one here. Secondly, the output is a matrix. At each stage of the loop, we take a sample of two observations from the standard normal distribution: the ith iteration produces the ith column. > normsample <- sapply(1:100,function(i)rnorm(25,6,3)}) The matrix normsample contains 100 random samples of size 25 from a normal distribution with mean µ = 6 and standard deviation σ = 3 so that the variance σ 2 = 9. Note that > gcifun(normsample[,47],9,0.14) [1] produces an 86% confidence interval for µ for the 47th sample of size Use the sapply function to create a matrix, which you should call meanconf. The ith column should contain the 86% confidence interval for µ for the ith sample of size 25 with the lower bounds on the first row and the upper bounds on the second row. Thus, the result of meanconf[,47] and gcifun(normsample[,47],9,0.14) should be identical. What R command did you use to create meanconf? [1] The function ciplot can be used to plot n confidence intervals, for a parameter θ, contained in a 2 n matrix whose first row contains the lower bounds and second row the upper bounds. The true value of the parameter θ is also specified to ciplot and this is drawn on the plot. The intervals that contain the parameter are coloured red; the others blue. ciplot <- function(confint, true) n <- length(confint[1,]) # find number of confidence intervals x <- matrix(c(1:n,1:n),nrow=n,ncol=2) y <- c(t(x)) # produces vector with y[2i-1] = y[2i] = i z <- c(confint) # vector with z[2i-1] lower bound, z[2i] upper bound of ith ci plot(z, y, type="n", ylab="sample number") # plot end points of ci 4

5 abline(v=true) for (i in 1:n) a <- 2*i-1 b <- 2*i if (z[a] <= true & z[b] >= true) lines(z[a:b],y[a:b],col=2) } else lines(z[a:b],y[a:b],col=4) }}} # draw vertical line at true value of parameter # interval contains true value of parameter # join endpoints of ci with red line # join endpoints of ci with blue line The code may be downloaded from 7. How many, and why, of the intervals in meanconf do you expect to contain µ = 6 and how many actually do? (The following R commands achieve this. > mulo <- meanconf[1,] > muhi <- meanconf[2,] > sum(mulo <= 6 & muhi >= 6) Note that inside the parenthesis we have a logical vector which is TRUE if and only if the designated interval contains the value 6. The sum function then totals up how many TRUE statements occur.) Use the function ciplot to plot your 100 confidence intervals. Hand in a copy of your plot. [2] 8. Use the sapply function to create a matrix, which you should call varconf. The ith column should contain the 92% confidence interval for σ 2 for the ith sample of size 25. The lower bounds should be on the first row and the upper bounds on the second row. How many, and why, of these intervals do you expect to contain σ 2 = 9 and how many actually do? Use ciplot to plot your 100 confidence intervals. Hand in a copy of your plot. [2] 9. Briefly comment upon the two plots you have obtained, in each case making reference to the typical location of the actual parameter in the confidence interval and highlighting any differences in the two plots. [3] 5

Homework for 1/13 Due 1/22

Name: ID: Homework for 1/13 Due 1/22 1. [ 5-23] An irregularly shaped object of unknown area A is located in the unit square 0 x 1, 0 y 1. Consider a random point distributed uniformly over the square;