Exercise I.1 I.2 I.3 I.4 II.1 II.2 III.1 III.2 III.3 IV.1 Question (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) Answer

Size: px

Start display at page:

Download "Exercise I.1 I.2 I.3 I.4 II.1 II.2 III.1 III.2 III.3 IV.1 Question (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) Answer"

Tobias Cunningham
5 years ago
Views:

1 Solutions to Exam in December 2012 Exercise I.1 I.2 I.3 I.4 II.1 II.2 III.1 III.2 III.3 IV.1 Question (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) Answer Exercise IV.2 IV.3 IV.4 V.1 V.2 V.3 VI.1 VI.2 VII.1 VII.2 Question (11) (12) (13) (14) (15) (16) (17) (18) (19) (20) Answer Exercise VII.3 VII.4 VIII.1 VIII.2 IX.1 IX.2 IX.3 X.1 X.2 X.3 Question (21) (22) (23) (24) (25) (26) (27) (28) (29) (30) Answer Exercise I On a large fully automated production plant items are pushed to a side band at random time points, from which they are automatically fed to a control unit. The production plant is set up in such a way that the number of items sent to the control unit on average is 1.6 item pr. minute. Let the random variable X denote the number of items pushed to the side band in 1 minute. It is assumed that X follows a Poisson distribution. Question I.1 (1) The probability that there will arrive more than 5 items at the control unit in a given minute is: Answer With λ = 1.6, we find that P (X > 5) = 1 P (X 5) = = where the can be found in the Poisson table (Table 2) OR we can find the value by R: 1-ppois(5,1.6) 3 Approximately 0.6% Question I.2 (2) The probability that no more than 8 items arrive to the control unit within a 5-minute period is: 1

2 Answer With λ 5minutes = 8, we find that P (X 8) = where the can be found in the Poisson table (Table 2) OR we can find the value by R: ppois(8,8) 1 Approximately 59.3% Question I.3 (3) The operators responsible for the control unit believe that the number of items arriving for control, is lower than desired. Hence, a count of the number of items arriving in periods of 10 minutes is carried out. Eight random periods of 10 minutes are being registered. The following data is found: It can now be assumed that a normal distribution, N(µ, σ 2 ), can be used as a valid approximation of the distribution of the number of items for control during 10 minutes. We want to test the hypothesis (on level α = 0.05) H 0 : µ = 16 ( Correct level ) H 1 : µ < 16 ( Too low level ) The result of the study becomes: (As well conclusion as argument must be correct) Answer The mean and sample standard deviation becomes x = and s = , so the t-test statistic becomes: t = / 8 = 3.53 And the p-value becomes (as it is a left-one-tailed alternative): P (t < 3.53) where t is a t(7)-distribution. From Table 4 we can conclude that the P-value is below 0.005, as the point (=99.5% percentile) of the t(7)-distribution is From R, everything, including the exact P-value can be found as: 2

3 x=c(16,12,10,15,11,14,9,15) mean(x) sd(x) t_obs=( )/(sd(x)/sqrt(8)) t_obs pt(t_obs,7) Or more easily: x=c(16,12,10,15,11,14,9,15) t.test(x,mu=16,alt="less") 5 There is a documented too low level, since the relevant P-value is clearly below Question I.4 (4) The management made a similar investigation but based it on 10 periods of 5 minutes, and got the following counts: They wish to obtain a 90% confidence interval for µ - the mean of the number of items in 5 minutes but WITHOUT using the assumption that the normal distribution is valid, and runs the following in R: x=c(8,7,5,10,8,7,7,8,9,8) k = my_bootstrap_samples = replicate(k, sample(x, replace = TRUE)) my_bootstrap_means = apply(my_bootstrap_samples, 2, mean) quantile(my_bootstrap_means,c(0.001,0.01,0.025,0.05,0.1,0.9,0.95,0.975,0.99,0.995)) And obtains from the the last line of code the following percentiles for the bootstrap distribution: 0.1% 1% 2.5% 5% 10% 90% 95% 97.5% 99% 99.5%

4 The wanted confidence interval based on this becomes: Answer As it is stated in the note on simulation based statistics/bootstrapping, the method simply amounts to reading off the relevant percentiles, whch for a 90%-confidence interval then are the 5% and 95% percentiles. 2 [7.0; 8.3] Exercise II A machine for checking computer chips uses on average 65 milliseconds per check with a standard deviation of 4 milliseconds. A newer machine, potentially to be bought, uses on average 54 milliseconds per check with a standard deviation of 3 milliseconds. It can be used that check times can be assumed normally distributed and independent. Question II.1 (5) The probability that the time savings per check using the new machine is less than 10 milliseconds is: Answer Let X old N(65, 4 2 ) and X new N(54, 3 2 ). If we let U denote the time saving per check, we have that U = X old X new. We are asked to find: P (U < 10) = P (Z < 10 E(U) V ar(u) ) = P (Z < 10 (65 54) = where the latter can be found from Table 3 OR in R: ) = P (Z < 1 ) = P (Z < 0.2) 5 z=(10-11)/5 z pnorm(z) Alternatively in R: pnorm(10,mean=65-54,sd=sqrt(16+9)) 5 Approximately 42% 4

5 Question II.2 (6) The mean (µ) og standard deviation (σ) for the total time use for checking 100 chips on the new machine is: Answer Let U be the total time use for checking 100 chips on the new machine, that is: 100 U = where X i N(54, 3 2 ). So we find, using basic mean and variance calculus rules, that: i=1 X i and µ = E(U) = E(X i ) = 54 = = 5400 i=1 i= σ 2 = V ar(u) = V ar(x i ) = 9 = i=1 i=1 2 µ = = 5400ms and σ = = 30ms Exercise III A supermarket has just opened a delicacy department wanting to make its own homemade remoulade (a Danish delicacy consisting of a certain mixture of pickles and dressing). In order to find the best recipe a taste test was conducted. 4 different kinds of dressing and 3 different types of pickles were used in the test. Taste evaluation of the individual remoulade versions were carried out on a continuous scale from 0 to 5 The following measurement data were found: In an R-run for twoway ANOVA: Dressing type Row Pickles type A B C D average I II III Column average anova(lm(taste~pickles+dressing)) the following output is obtained: (however some of the values have been substituted by the symbols A, B, C, D, E and F) 5

6 > anova(lm(taste~pickles+dressing)) Analysis of Variance Table Response: Taste Df Sum Sq Mean Sq F value Pr(>F) Pickles A E Dressing B F Residuals C D Question III.1 (7) The values of A, B, and C are: Answer As is clear from the general definition of the two-way ANOVA table the degrees of freedom are r 1, b 1 and (r 1)(b 1), where r = 3 is the number of rows, c = 4 is the number of columns. 3 A = 2, B = 3 and C = 6 Question III.2 (8) The values of D, E, and F are: Answer E and F are the F-statistics, which are: F P ickles = MS P ickles MSE = = F Dressing = MS Dressing = MSE = Actually, only one answer option has these two values. The D= SSE could be found from the total sum of squares: 3 4 SS(total) = (y ij 2.23) 2 And then: i=1 j=1 D = SSE = SS(total) OR more easily using that the DF E = (r 1)(b 1) = 6 and then: In any case, the answer is: D = SSE = 6 MSE = = D = 0.633, E = 1.55 and F = Question III.3 (9) With a test level of α = 5% the conclusion of the analysis becomes: Answer We look at the P-values in the ANOVA table, and observe that the Dressing P-value is BELOW 0.05 and the Pickles P-value is ABOVE 0.05, and hence the answer is: 6

7 1 Only the choice of the dressing type has a significant influence on the taste Exercise IV For production of brass valves raw material (brass bars) from 2 different suppliers are received. Samples are taken from the deliveries from each of the two suppliers. The tensile strength of the items are determined, and the following results are found: Supplier 1: n 1 = 15, x 1 = 223.5N/mm 2, s 1 = 7.23N/mm 2 Supplier 2: n 2 = 20, x 2 = 220.4N/mm 2, s 2 = 4.49N/mm 2 As a potential help, the following four R-commands: round(qf(c(0.001,0.01,0.025,0.05,0.1,0.9,0.95,0.975,0.99,0.995),14,19),3) round(qf(c(0.001,0.01,0.025,0.05,0.1,0.9,0.95,0.975,0.99,0.995),15,20),3) round(qf(c(0.001,0.01,0.025,0.05,0.1,0.9,0.95,0.975,0.99,0.995),19,14),3) round(qf(c(0.001,0.01,0.025,0.05,0.1,0.9,0.95,0.975,0.99,0.995),20,15),3) are giving a number of percentiles (rounded to 3 decimals) for four different F-distributions. The results of these are shown in the R output window as follows: > round(qf(c(0.001,0.01,0.025,0.05,0.1,0.9,0.95,0.975,0.99,0.995),14,19),3) [1] > round(qf(c(0.001,0.01,0.025,0.05,0.1,0.9,0.95,0.975,0.99,0.995),15,20),3) [1] > round(qf(c(0.001,0.01,0.025,0.05,0.1,0.9,0.95,0.975,0.99,0.995),19,14),3) [1] > round(qf(c(0.001,0.01,0.025,0.05,0.1,0.9,0.95,0.975,0.99,0.995),20,15),3) [1] Question IV.1 (10) With a significance level of α = 5% we cannot conclude any difference between the two variances for the two suppliers, since: Answer The proper test statistic for comparing two variances is the larger variance divided by the smaller one: = The proper distribution to use for the test is the F (14, 19)-distribution. As nothing points towards a one-sided test, we should apply a two-sided test AND the critical value hence becomes F α/2 (n 1 1, n 2 1) = F (14, 19) 7

8 (Which by the way cannot be found by Table 6, BUT we do not need that here). It may be found in R as: qf(0.975,14,19) So the answer is, (since no other options use the correct degrees of freedom). 3 F (14, 19) = > Question σ1 2 = σ2: 2 IV.2 (11) The following hypothesis is to be tested, and it is now assumed that H 0 : µ 1 = µ 2 H 1 : µ 1 µ 2 With a significance level of α = 5% we cannot conclude any difference between the two means for the two suppliers, since the t statistic and P-value for this test becomes: (both must be correct) Answer The standard pooled t-test for this situation uses the pooled variance estimate: And the t-statistic hence becomes: s 2 p = t = = (1/15 + 1/20) = From the t-distribution Table 4 (with ν = 33) we can observe that the P-value is larger than 0.10 (and smaller than 0.20) since is between t 0.10 and t 0.05 and we are doing a two-sided test (so the tail probability should be multiplied by two to give the proper P-value) In R we could do it by the following: sp=sqrt((14*s1^2+19*s2^2)/33) t=( )/(sp*sqrt(1/15 + 1/20)) t 2*(1-pt(t,33)) Giving a P-value of In both cases the P-value is found to be larger than 0.05 so the answer is: 2 t = and P-value>

9 Question IV.3 (12) As there is no difference in the mean level for the two suppliers the data can be joined together and a 99% confidence interval for the mean can be found as: (Some summary statistics for the joint data set: n = 35, x = 221.7, s = 5.93) Answer We use the standard procedure for the one-sample confidence interval: x ± t α/2 (n 1) s n which is, since α = ± t ± t Question IV.4 (13) A study of a new supplier is planned. It is expected that the standard deviation for this supplier will be approximately 6, that is σ = 6N/mm 2. A 99% confidence interval for the mean in this new study is required to have a width of ±1N/mm 2. How many items must be sampled to achieve this? Answer We use the one-sample confidence interval sample size formula: The is found in Table 3 or in R as: ( zα/2 σ ) ( n = = E 1 ) 2 qnorm(0.995) 4 ( ) Exercise V When brass is used in a production, the modulus of elasticity, E, of the material is often important for the functionality. The modulus of elasticity for 6 different brass alloys are measured. 5 samples from each alloy are tested. The results are shown in the table below where the measured modulus of elasticity is given in GPa: 9

10 In an R-run for oneway analysis of variance: Brass alloys M1 M2 M3 M4 M5 M anova(lm(elasmodul~alloy)) the following output is obtained: (however some of the values have been substituted by the symbols A, B, and C) > anova(lm(elasmodul~alloy)) Analysis of Variance Table Response: ElasModul Df Sum Sq Mean Sq F value Pr(>F) Alloy A e-05 Residuals B C Question V.1 (14) The values of A, B, and C are: Answer The A and B are the degrees of freedom, which in the oneway ANOVA is k 1 and N k, where k = 6 is the number of groups and N = 30 is the number of observations. This is all that is needed to answer the question, but the C could be found as: C = SSE = MSE DF E = = A = 5, B = 24 and C = Question V.2 (15) The assumptions for using the oneway analysis of variance is: (Choose the answer that most correctly lists all the assumptions and NOT lists any unnecessary assumptions) Answer It is difficult to make a lot of arguments here but to emphasize that only in answer 1 all assumptions are given and not any unnecessary assumptions. 1 The data must be normally distributed within each group, independent and the variances within each group should not differ significantly from each other 10

11 Question V.3 (16) A 95% confidence interval for the difference between brass alloy 1 and 2 becomes: Answer A post-hoc 95% confidence interval between two groups in a oneway ANOVA is: ( 1 ȳ 1 ȳ 2 ± t MSE + 1 ) n 1 n 2 So we have to compute the means of the M1 and M2 groups: ȳ 1 = 84.62, ȳ 2 = (Or accept that it can only be 3.48) and then plug in MSE = and n 1 = n 2 = 5. So the answer is: ± t ( ) 2 5 Exercise VI It is a common conjecture that a student s perception of the quality of teaching in a particular discipline is related to the student s level in the subject. To investigate whether this is true, the following data were collected: There are 125 students in the table above. Grade Course Evaluation group GOOD MIDDLE BAD HIGH 22.4% 7.2% 4% MIDDLE 18.4% 8.8% 11.2% LOW 11.2% 5.6% 11.2% Question VI.1 (17) To investigate whether the conjecture is valid the following statistic should be calculated: Answer The actual 3-by-3 contingency table comes from multiplying the percentages by 125: o ij Course Evaluation GOOD MIDDLE BAD HIGH MIDDLE LOW Now one could compute all the nine expected values for this 3-by-3 table, for instance, e 11 = In R these could e.g. be found as: ( )( ) = 21.84

12 X=t(matrix(125*c(22.4,7.2,4, 18.4,8.8,11.2, 11.2,5.6,11.2)/100,ncol=3)) round(chisq.test(x)$expected,2) But having found just one of them makes it clear that only answer 1 can provide the correct answer, as the χ 2 -statistic has the form: χ 2 = 3 3 (o ij e ij ) 2 e i=1 j=1 ij 1 ( )2 + (9 9.07)2 + ( )2 + ( )2 + ( )2 + ( )2 + ( )2 + (7 7.56)2 + ( ) Question VI.2 (18) The number of degrees of freedom (DF) and the critical value (Q) of the relevant test on a 5% level are: Answer This is a χ 2 -test for an r-by-c table where the degrees of freedom are (r 1)(c 1) = 2 2 = 4, and χ (4) = (to be found in table 5 with ν = 4) or in R as: qchisq(0.95,4) 4 DF = 4 and Q = 9.49 Exercise VII At a specific education it was decided to introduce a project, running through the course period, as a part of the grade point evaluation. In order to assess whether it has changed the percentage of students passing the course, the following date was collected: Before introduction After introduction of project of project Number of students evaluated Number of students failed 13 3 Average grade point x Sample standard deviation s

13 Let p Before be the proportion failing the course before the introduction of the project and p After the corresponding proportion after the introduction of the project. Question VII.1 (19) If the following hypothesis is tested: H 0 : p Before = p After H 1 : p Before > p After a valid test statistic u, the corresponding P-value and a valid conclusion become: (both values and the conclusion must be correct) Answer For this particular test we have been given two different (but similar) options. One would be the χ 2 -test for a 2-by-2 frequencey/contingency table OR a z-test giving a direct comparison of the two proportions: (ˆp 1 ˆp 2 ) (ˆp (1 ˆp) (1/50 + 1/24) where ˆp 1 = 13/50, ˆp 2 = 3/24 and ˆp = 16/74, so: (ˆp 1 ˆp 2 ) (ˆp (1 ˆp) (1/50 + 1/24) = 1.32 And the (one-tailed) P-value becomes (from Table 3) or from R as: P (Z > 1.32) = pnorm(1.32) And this means that we must accept the null hypothesis, that is we cannot reject it on e.g. 5% level. 3 u = 1.32 and P-value = On a 5% level a drop in failing percentage cannot be documented Question VII.2 (20) As it is assumed that the grades are approximately normally distributed in each group, and that the variances in the two groups do not differ significantly from each other, the following hypothesis is tested: H 0 : µ Before = µ After H 1 : µ Before < µ After 13

14 The test statistic, the P-value and the conclusion for this test become: (both values and the conclusion must be correct) Answer The standard pooled t-test for this situation uses the pooled variance estimate: s 2 p = = And the t-statistic hence becomes: t = (1/50 + 1/24) = From the t-distribution Table 4 (with ν = 72) we can observe that the (one-tailed) P-value is between and 0.05 since is between t 0.05 and t In R we could do it by the following: t=( )/(2.088*sqrt(1/50+1/24)) t pt(t,72) Giving a P-value of t = = 1.842, P-value=0.035: On a 5% level an increase in grade point ( ) 24 average can be documented Question VII.3 (21) A 95% confidence interval for the grade point standard deviation after the introduction of the project becomes: Answer The confidence interval formula for a sample variance is used WITH the square-root applied to everything: (n 1) s 2 (n 1) s 2 < σ < χ χ < σ After < Question VII.4 (22) The critical value for the following hypothesis test for the grade point variance before the introduction of the project σbefore 2 : H 0 : σbefore 2 =

15 becomes on level 1% (α = 0.01): H 1 : σ 2 Before > 2 2 Answer The test for comparing a variance with a specific value is a χ 2 -test with ν = DF = n 1 = 49. So the critical value for this ONE-sided hypothesis is: In R this can be found as: χ (49) qchisq(0.99,49) giving the value Without using R we can use Table 4. In table 4 the χ (50) value can be read off to be and the χ (40) value can be read off to be , so by linear interpolation the only possible answer is: Exercise VIII A company manufactures an electronic device to be used in a very wide temperature range. The company knows that increased temperature shortens the life time of the device, and a study is therefore performed in which the life time is determined as a function of temperature. The following data is found: Temperature in Celcius (t) Life time in hours (y) The following is run in R: t=c(10,20,30,40,50,60,70,80,90) y=c(420,365,285,220,176,117,69,34,5) summary(lm(y~t)) with the reults: Call: lm(formula = y ~ t) Residuals: Min 1Q Median 3Q Max Coefficients: 15

16 Estimate Std. Error t value Pr(> t ) (Intercept) e-09 *** t e-07 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 7 degrees of freedom Multiple R-squared: 0.984,Adjusted R-squared: F-statistic: on 1 and 7 DF, p-value: 1.505e-07 Question VIII.1 (23) A 95% confidence interval for the slope in the regression model underlying the R-run above, and which expresses the life time as a linear function of the temperature, become: Answer Either one could do all the regression computations to find the b = and then subsequently use the formula for the confidence interval for β: b ± t α/2 s e 1 Sxx Or one could use the knowledge of the information in the R-output that wht is know as the standard error for the slope can be directly read off as: s e 1 Sxx = And t (7) = from Table 4 or in R: qt(.975,7) ± Question VIII.2 (24) Can a relation between temperature and life time be documented on level 5%? (As well conclusion as argument must be correct) Answer We look at the P-value in the slope row of the output: 1.51e 07 = Yes, as the relevant P-value is , which is clearly smaller than

17 Exercise IX In a consumer study, it is shown that in many supermarkets there are discrepancies between your receipt and the price on the shelf. The manager of a supermarket wants to keep track of the error percentage, and therefore introduces the following checks: During the day, 40 different items are sampled at random and for these items it is being checked whether the receipt and the price on the shelf is matching. The manager defines the situation as in control, if there are no more than 1 mislabeled item found among the 40 items. The probability that the situation is found to be under control, if the real percentage of mislabeled items is 1%, is called A. The probability that the situation is found to be under control, if the real percentage of mislabeled products is 10%, is called B. Question IX.1 (25) The values of A and B are: Answer For A we use the binomial distribution with p = 0.01 and n = 40: A = P (X 1), X b(x; 40, 0.01) For this p = 0.01 we cannot use table 1, so we have to either use hand calculation of the two binomial probabilities: A = P (X = 0) + P (X = 1) = = For B we use the binomial distribution with p = 0.10 and n = 40: A = P (X 1), X b(x; 40, 0.1) For this p = 0.1 and n = 40 combination we cannot use table 1, so we have to either use hand calculation of the two binomial probabilities: B = P (X = 0) + P (X = 1) = = In R, A could be found by any of the following three computations: pbinom(1,40,0.01) dbinom(0,40,0.01)+dbinom(1,40,0.01) 0.99^40+40*0.01*0.99^39 In R, B could be found by any of the following three computations: pbinom(1,40,0.10) dbinom(0,40,0.10)+dbinom(1,40,0.10) 0.90^40+40*0.1*0.90^39 17

18 1 A = and B = Question IX.2 (26) As an additional check, on a given day all together 120 different items were checked, and out of these 15 mislabeled items were observed. A 95% confidence interval for the proportion of mislabeled items becomes: Answer The standard one-proportion confidence interval is used (and this is OK as both np and n(1 p) is at least 15): ˆp(1 ˆp) ˆp ± z α/2 n which becomes: 15 (15/120)(105/120) 120 ± ± /120 Question IX.3 (27) We now wish to determine the proportion of mislabeled items in a particular store with a precision such that a 90% confidence interval will be of the width plus/minus It is expected the proportion will be in the order of How many items should approximately be checked in order to achieve such precision? Answer We use the sample size formula for the proportion confidence interval using a guess of the true value: ( zα/2 ) 2 n = p(1 p) = (1.645/0.02) 2 E as α = 0.10 and Table 3 or: qnorm(0.95) (1.645/0.02) items 18

19 Exercise X The yield Y of a chemical process is a random variable whose value is considered to be a linear function of the temperature X. The following data of corresponding values of x and y is found: Temperature in Celcius(x) Yield in grams (y) The average and standard deviation of Temperature(x) and Yield (y) are: x = 50, s x = , ȳ = 55.4, s y = , and further it can be used that S xy = In the exercise the usual linear regression model is used: Y i = α + βx i + ε i, ε i N(0, σ 2 ), i = 1,..., 5 Question X.1 (28) Can a significant relationship between yield and temperature be documented? (As well conclusion as argument must be correct) Answer It could most easily be solved by running the regression in R as: x=c(0,25,50,75,100) y=c(14,38,54,76,95) summary(lm(y~x)) with the results: Call: lm(formula = y ~ x) Residuals: Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) ** x e-05 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 3 degrees of freedom Multiple R-squared: ,Adjusted R-squared: F-statistic: 1071 on 1 and 3 DF, p-value: 6.267e-05 19

20 Alternatively one could use hand calculations and use the formula for the t-test of the hypothesis: H 0 : β = 0. The relevant test statistic and P-value can be read off in the R-output as and Yes, as the relevant test statistic and P-value are resp and Question X.2 (29) Give the 95% confidence interval of the expected yield at a temperature of x 0 = 80 degrees celcius: Answer We use the formula for the confidence limit of a point on the line: 1 (a + bx 0 ) ± t α/2 s e n + (x 0 x) S xx And we have to compute a, b and s e either by hand OR by R as above: a = 15.4, b = 0.8, s e = So the confidence interval becomes 1 ( ) ± since S xx = = (80 50) ± 3.61 Question X.3 (30) The five residuals become: -1.4, 2.6, -1.4, 0.6 og What is the upper quartile of the residuals? Answer We use the basic definition of finding a percentile (from Chapter 2), n = 5, p = 0.75, so: np = 3.75 So the upper quartile is the 4th observation in the ordered sequence: -1.4, -1.4, -0.4, 0.6,

Ch 2: Simple Linear Regression

Ch 2: Simple Linear Regression 1. Simple Linear Regression Model A simple regression model with a single regressor x is y = β 0 + β 1 x + ɛ, where we assume that the error ɛ is independent random component