EXAMINERS REPORT & SOLUTIONS STATISTICS 1 (MATH 11400) May-June 2009

Size: px

Start display at page:

Download "EXAMINERS REPORT & SOLUTIONS STATISTICS 1 (MATH 11400) May-June 2009"

Thomas Byrd
5 years ago
Views:

1 EAMINERS REPORT & SOLUTIONS STATISTICS (MATH 400) May-June 2009 Examiners Report A. Most plots were well done. Some candidates muddled hinges and quartiles and gave the wrong one. Generally candidates correctly identified F (u), but some failed to find F (k/(n + )) or got the coordinates the wrong way round. A2. Generally answered correctly. A3. Parts (i) and (ii) generally answered correctly. Some candidates incorrectly took P ( = 20) = 0.2 rather than 0., and some computed E( 2 ) rather than Var(). Part (iii) was less well done and some candidates seemed totally confused. Many candidates correctly identified the distribution of U T to be Normal, but some took it to have variance σ 2 U σ2 T rather than σ2 U + σ2 T. A4. One common error was to transpose S and σ, and incorrectly write n( µ)/s N(0, ) or ( i ) 2 /S 2 χ 2 n, in place of the correct statements n( µ)/σ N(0, ) and (i ) 2 /σ 2 χ 2 n. Otherwise both parts were generally answered correctly. A5. In Part (i) some candidates lost marks by failing to state the distribution of n( µ)/σ 0. Most candidates correctly computed the end points of the confidence interval, though one or two incorrectly used a t-distribution or even a chi-square distribution. In part (ii), most candidates correctly answered (a) and (b) correctly, but not many got (c) completely correct. B. Most candidates found parts (a) and (b) straightforward, though some candidates incorrectly defined the bias as E(θ ˆθ) (it should be E(ˆθ θ)) and some took mse(ˆθ) = Var(ˆθ)+ bias(ˆθ) 2 as the definition of the mse (it should be mse(ˆθ) E(ˆθ θ) 2 ). A small number of candidates forgot that θ here is the fixed constant value of the parameter, and tried to compute E(θ). Candidates seemed to find part (c) harder; some got as far as τ = exp{log(2)/ˆθ}, without realising that exp{log(2)} is just 2. Most candidates answered part (d), but some did not make the step from the median of each distribution (shown in the boxplot) to the mean (required for the bias and variance). B2. Part (a) was attempted by almost all candidates, and generally answered correctly. Most candidates attempted part (b) usually correctly identifying the appropriate test (paired t-test) and the correct hypotheses, finding the correct form of test statistic, and using the correct approach to computing the the p-value. However candidates did not always identify the correct model assumptions (i.e. that d,..., d n are the observed values of a random sample from the Normal distribution with unknown mean and variance); a surprising number made numerical slips in computing the variance estimate s 2 = ( d 2 i n d 2 )/(n ); some failed to state the distribution of the test statistic under H 0 ; and some failed to correctly interpret their p-value or draw appropriate conclusions. Candidates generally found part (c) harder, and many failed to relate the occurrence of, say, a type II error to a range of D values for which the probability could then be easily computed under H. B3. This question was significantly less popular than B or B2. In part (a), most students correctly computed the estimates, but very few were able to derive the equations satisfied by the least squares estimates. Candidates attempting part (b) generally identified the correct formulae for the end points of the confidence intervals, but many made numerical slips in the calculation often computing an incorrect value for sˆα even having found the correct value for ˆσ 2. Again, candidates found part (c) testing, and only a small number worked their way through to the end.

2 A. (i) The ordered observations are: 0.50, 0.66, 0.7, 0.92, 0.94,.02,.06,.2,.20,.23,.47. The values used to construct the boxplot are: the smallest observation o = 0.050; the lower hinge = median of {data values median} = ( )/2 = 0.85; the median o 6 =.02; the upper hinge = median of {data values median} = ( )/2 =.6, and the largest observation o =.47. (ii) F (x) = x/2 has inverse F (u) = 2u so the fitted quantiles are F (k/(n + )) = 2k/(n + ) for k =,..., n. The smallest observation is o = 0.50 and the corresponding fitted quantile (as n = ) is F (/2) = 0.67, so the coordinates of the point are (0.7, 0.50). The upper tail values are smaller than expected (shorter upper tails) and the lower tail values are larger than expected (shorted lower tails), so overall the data is more concentrated about its centre than expected under the fitted distribution. A2. (i) For two unknown parameters we use the equations involving the two smallest moments of the distribution. Here E( 2 ; µ, σ 2 ) = Var(; µ, σ 2 ) + [E(; µ, σ 2 )] 2 = σ 2 + µ 2. Let m = (x + + x n )/n and m 2 = (x x 2 n)/n. Then the method of moments estimators satisfy E(; ˆµ, ˆσ 2 ) = m and E( 2 ; ˆµ, ˆσ 2 ) = m 2 i.e. ˆµ = m and ˆσ2 + ˆµ 2 = m 2 (ii) Solving the first equation in (a) gives ˆµ = m = x. Substituting this into the second equation gives ˆσ 2 + m 2 = m 2, so ˆσ 2 = m 2 m 2 = x 2 i /n x 2 = (x i x) 2 /n. A3. (i) Let denote the value of a randomly chosen card. P ( = 5) = 0.6; P ( = 0) = 0.3, P ( = 20) = 0. so µ = = 8, E( 2 ) = = 85, σ 2 = E(2 ) µ 2 = 2 (ii) The customer collects n = 50 cards, so from the central limit theorem T = i N(nµ, nσ 2 ) = N(400, 050). (iii) Exactly the same argument gives U N(400, 050). Since T and U are independent, U T N(µ U µ T, σu 2 + σ2 T ) = N(0, 200) and (U T )/ 200 N(0, ). Thus P (U T > 20) = P ((U T )/ 200 > 20/ 200) = P ((U T )/ 200 > 20/ ) = P (Z > ) = P (Z < ) = = A4. (i) Let U N(0, ) and let V χ 2 r with U and V independent. Then from notes W = U/ V/r has the t distribution with r degrees of freedom and we write W t r. (ii) From notes, N(µ, σ 2 /n) so that n( µ)/σ N(0, ). Also from notes, (n )S 2 /σ 2 = i ( i ) 2 /σ 2 χ 2 n. But n( µ)/s = [ n( µ)/σ]/[ (n )S 2 /(n )σ 2 ], so n( µ)/s = U/ V/(n ), where U N(0, ) and V χ 2 n. Hence from the result in (i), n( µ)/s tn. 2

3 A5. (i) When σ is known to take value σ 0, from notes n( µ)/σ 0 N(0, ). Thus the end points of a 00( α)% confidence interval are c L = x z α/2 σ 0 / n and c U = x + z α/2 σ 0 / n, where z α/2 is such that P (Z > z α ) = α when Z N(0, ). Here n = 0, α/2 = 0.05, z 0.05 =.645, x = 5, σ 0 = 3 so c L = / 0 = 5.56 = 3.44 c U = / 0 = = 6.56 (ii) a) It would become shorter since / n decreases as n increases intuitively more observations more information smaller interval needed for same confidence level; (b) it would become larger since z α/2 increases as α decreases intuitively more confidence larger interval needed; (c) it might increase since t α/2;n > z α/2 intuitively more uncertainty over σ less information larger interval needed. However in this case the end point is also affected by the value of the estimate of σ if this was smaller than σ 0, it would tend to decrease the interval length. 3

4 B. (a) For observations from the given Pareto distribution, f(x; θ) = θ/( + x) θ+ i= log f(x; θ) = log(θ) (θ + ) log( + x) θ log f(x; θ) = log( + x) ( θ ) ( ) θ log f(x i; θ) = θ log( + x ) + + θ log( + x n) = n n θ log( + x i ) so the likelihood equation satisfied by the maximum likelihood estimate ˆθ is n n ˆθ log( + x i ) = 0 giving ˆθ = n/ log( + x i) (b) bias(ˆθ) = E(ˆθ θ), mse(ˆθ) = E(ˆθ θ) 2, and (from notes) mse(ˆθ) = Var(ˆθ)+ bias(ˆθ) 2. Here E(ˆθ) = nθ/(n ), so bias(ˆθ) = E(ˆθ θ) = E(ˆθ) θ = nθ/(n ) θ = θ/(n ). (c) You are given that ( + τ) θ = 2, so τ = τ(θ) = 2 /θ. The maximum likelihood estimate has the property that τ(θ) = τ(ˆθ) Thus τ = 2 /ˆθ = 2 log(+x i )/n (d) Despite the outliers in the boxplot for the sample median, both boxplots appear roughly symmetric about the median, possibly with some positive skew. This would imply that, in the distribution of each estimator, the median is approximately equal to, or slightly less than, the mean. The plot indicates that for each distribution, the median is close to, so for each distribution the mean is also (approximately) close to, so both estimators are approximately unbiased for τ = (though the plot seems to indicate that the median/mean of the distribution of τ mle is slightly less than, indicating a slight negative bias). However, the plot indicates that estimates based on the sample median have substantially greater variability about their median than the maximum likelihood estimates, with the hinges and whisker ends in the sample median boxplot noticeably further away from the median than in the mle plot. Since the median and the mean are not that dissimilar here, this indicates that the sample median has a greater variability about its mean. Since both estimators are unbiased, this indicates the sample median has greater variability about τ and hence greater mean square error than the mle, as an estimator of τ. 4

5 B2. (a) (i) Type I error occurs if we reject H 0 when it is in fact true; (ii) Type II error occurs if we accept H 0 when it is in fact false; (iii) The significance level of the test is P(Type I error) = P(Reject H 0 H 0 true ); (iv) The power is P(Type II error) = P(Reject H H true) = P(Accept H 0 H true). (b) A student s body fat percentage both before and after the course is likely to depend on the student as well as the effect of the course some students may have naturally high levels of body fat, others some may have naturally lower levels. However, the difference in the percentage of body fat for each student may well be independent of the difference for other subjects. Thus it is sensible to use a paired t-test based on the differences, say d i = x i y i. Model assumptions: Let i denote the percentage of body fat before the course for the ith student, let Y i denote the percentage of body fat after the course for the same student, and let D i = i Y i. Assume D,..., D 0 are a simple random sample from the N(µ, σ 2 ) distribution, where µ and σ 2 are unknown. Hypotheses: H 0 : µ = 0 versus H A : µ > 0. H 0 corresponds to a zero mean difference between the before and after body fat percentages; H corresponds to the percentages after the course being systematically smaller than before the course, so the mean difference is positive. Test Statistic: Let T = n D/ˆσ D, where here n = 0, ˆσ 2 D = S2 D = 0 i= (D i D) 2 /(0 ), so T has the t 9 distribution when H 0 is true. For the given data, the d i values have mean d =.7 and variance s 2 D = so the observed test statistic is t obs = p-value: For H A : µ > 0, the values of T which are at least as extreme as t obs are the set {T t obs }. Thus the p-value = P (T t obs H 0 true) = P (t ) = P (t ). From tables P (t 9 2.7) = and P (t 9 2.8) = , so linear interpolation gives P (t ) = (0.008) = , giving a p-value of [Not required but may replace the computation of the p-value.] Critical region: The critical region of values where an α-level test would reject H 0 has form C = {T c }, where c is defined by: α = P (Reject H 0 H 0 true) = P (T c H 0 true) = P (t 9 c ). Thus, for α = 0.05, c = t 9;0.05 =.833 giving C = {T.833}.] Conclusions: The p-value is very small, so there is very strong evidence that H 0 is not true. [ or The observed test statistic value t obs = falls well within the critical region of the 0.05-level test.] Thus we would reject H 0 in favour of H A, and conclude that the mean body fat percentages after taking the course are indeed smaller those before taking the course. (c) Under H, the power of the test is - P(Type II error) = - P(Accept H 0 H true) = P(Reject H 0 H true) = P(T >.645 H true) = - P(T <.645 H true). But when H : µ = is true, D N(, 4), so D N(, 4/0), so 0( D )/2 N(0, ) and T < D/2 <.645 D < ( D )/2 < So Power = P ( 0( D )/2 < H true) = P (Z < ) [where Z N(0, )] = Φ(0.0638) = =

6 B3. (a) The least squares estimates minimise the sum of squares (y i α βx i ) 2, so they satisfy the equations obtained by setting / α = 0 and / β = 0. The first equation gives 2( y i nˆα ˆβxi ) = 0, i.e. ȳ ˆα ˆβ x = 0, so ˆα = ȳ ˆβ x. The second eqn gives 2( y i x i ˆαx i ˆβx 2 i ) = 0, i.e. y i x i ˆαn x ˆβ x 2 i = 0. Substituting in for ˆα gives y i x i (ȳ ˆβ x)n x ˆβ x 2 i = 0, and rearranging gives ˆβ( x 2 i n x 2 ) = y i x i nȳ x, so ˆβ = ( y i x i nȳ x)/( x 2 i n x 2 ) or equivalently ˆβ = ( y i x i y i xi /n)/( x 2 i ( x i ) 2 /n). Here x i = 30; y i = 73.62; x 2 i = 0; y 2 i = ; y i x i = giving: ˆβ = ( /0)/( /0) =.672, ˆα = y i /n ˆβ x i /n = 73.62/ /0 = The definitions of x i and y i were inadvertently swapped at some point when the question was being revised, so times became distances and vice versa. The intended answer was that the estimated time y to reach an address a distance x from the base is y = ˆα + ˆβx, so the estimated time taken to reach an address 2.5 miles from the base is ˆα ˆβ = However, as printed, x denoted the time and y = 2.5 the distance, so the estimated time became x = (y ˆα)/ ˆβ = Both answers were awarded full marks. (b) From the R output we can read off that the estimate of σ is and the corresponding value of sˆα is Alternatively, ˆσ 2 = [ y 2 i nȳ 2 ( x i y i n xȳ) 2 /( x 2 i n x 2 )]/(n 2] = giving s 2ˆα = ˆσ2 (/n + x 2 / (x i x) 2 ) = and sˆα = You are given that (ˆα α)/sˆα has the t-distribution with n 2 degrees of freedom, so the end points (c L, c U ) of a 00( γ)% confidence interval for α are given by c L = ˆα t n 2;γ/2 sˆα and c U = ˆα + t n 2;γ/2 sˆα. For a 95% confidence interval, γ/2 = and from tables t 8;0.025 = Thus the 95% confidence interval for α has end points c L = =.4996 c U = = (c) For fixed x, you are given that ˆα + ˆβx can be expressed in the form ˆα + ˆβx = Now and n Y i (/n + b i (x x)) where b i = (x i x)/s xx and s xx = (x i x) 2. i= b i = (x i x)/s xx = 0/s xx = 0 b2 i = (x i x) 2 /(s xx ) 2 = s xx /(s xx ) 2 = /s xx. Since the Y i are independent and each has the same variance σ 2, we have Var(ˆα + ˆβx) = Var( Y i (/n + b i (x x)) = Var(Y i (/n + b i (x x)) = Var(Y i )(/n + b i (x x)) 2 = σ 2 (/n + b i (x x)) 2 = σ 2 (/n 2 + 2b i (x x)/n + b 2 i (x x) 2 ) = σ 2 [ /n 2 + 2(x i x)( b i )/n + (x x) 2 b 2 i ] = σ 2 [/n + (x x) 2 /s xx ] since b i = 0 and b 2 i = /s xx. 6

Bias Variance Trade-off

Bias Variance Trade-off The mean squared error of an estimator MSE(ˆθ) = E([ˆθ θ] 2 ) Can be re-expressed MSE(ˆθ) = Var(ˆθ) + (B(ˆθ) 2 ) MSE = VAR + BIAS 2 Proof MSE(ˆθ) = E((ˆθ θ) 2 ) = E(([ˆθ E(ˆθ)]