Exercise I.1 I.2 I.3 I.4 II.1 II.2 III.1 III.2 III.3 IV.1 Question (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) Answer

Similar documents
Ch 2: Simple Linear Regression

Homework 9 Sample Solution

Inference for Regression

Table of z values and probabilities for the standard normal distribution. z is the first column plus the top row. Each cell shows P(X z).

Section 4.6 Simple Linear Regression

MAT 2377C FINAL EXAM PRACTICE

Table of z values and probabilities for the standard normal distribution. z is the first column plus the top row. Each cell shows P(X z).

13 Simple Linear Regression

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Sociology 6Z03 Review II

Variance Decomposition and Goodness of Fit

Review for Final. Chapter 1 Type of studies: anecdotal, observational, experimental Random sampling

STA 101 Final Review

Week 14 Comparing k(> 2) Populations

ST430 Exam 2 Solutions

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test, October 2013

Simple Linear Regression

UNIVERSITY OF TORONTO Faculty of Arts and Science

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

MATH 644: Regression Analysis Methods

EXAM IN TMA4255 EXPERIMENTAL DESIGN AND APPLIED STATISTICAL METHODS

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Biostatistics 380 Multiple Regression 1. Multiple Regression

WISE International Masters

GROUPED DATA E.G. FOR SAMPLE OF RAW DATA (E.G. 4, 12, 7, 5, MEAN G x / n STANDARD DEVIATION MEDIAN AND QUARTILES STANDARD DEVIATION

Regression and the 2-Sample t

Density Temp vs Ratio. temp

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002

No other aids are allowed. For example you are not allowed to have any other textbook or past exams.

Multiple Linear Regression

SCHOOL OF MATHEMATICS AND STATISTICS

MS&E 226: Small Data

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Lecture 14. Analysis of Variance * Correlation and Regression. The McGraw-Hill Companies, Inc., 2000

CHAPTER EIGHT Linear Regression

Lecture 14. Outline. Outline. Analysis of Variance * Correlation and Regression Analysis of Variance (ANOVA)

STAT Exam Jam Solutions. Contents

Chapter 10. Correlation and Regression. McGraw-Hill, Bluman, 7th ed., Chapter 10 1

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

STATISTICS 141 Final Review

Inference with Simple Regression

Lecture 15. Hypothesis testing in the linear model

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

FRANKLIN UNIVERSITY PROFICIENCY EXAM (FUPE) STUDY GUIDE

Concordia University (5+5)Q 1.

SIMPLE REGRESSION ANALYSIS. Business Statistics

χ test statistics of 2.5? χ we see that: χ indicate agreement between the two sets of frequencies.

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim

This does not cover everything on the final. Look at the posted practice problems for other topics.

ST430 Exam 1 with Answers

STA 2101/442 Assignment 3 1

Mathematical Notation Math Introduction to Applied Statistics

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Analysis of Variance. Source DF Squares Square F Value Pr > F. Model <.0001 Error Corrected Total

Stat 401B Final Exam Fall 2015

QUEEN S UNIVERSITY FINAL EXAMINATION FACULTY OF ARTS AND SCIENCE DEPARTMENT OF ECONOMICS APRIL 2018

Correlation Analysis

Stat 401B Exam 2 Fall 2015

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

Linear Regression Model. Badr Missaoui

Multiple Linear Regression

STAT 328 (Statistical Packages)

Inference for Single Proportions and Means T.Scofield

The Normal Distribution

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

Technical University of Denmark

1 Use of indicator random variables. (Chapter 8)

Simple linear regression

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Simple Linear Regression: One Quantitative IV

STAT Final Practice Problems

Unit 6 - Introduction to linear regression

Outline The Rank-Sum Test Procedure Paired Data Comparing Two Variances Lab 8: Hypothesis Testing with R. Week 13 Comparing Two Populations, Part II

Interactions. Interactions. Lectures 1 & 2. Linear Relationships. y = a + bx. Slope. Intercept

Business Statistics. Lecture 10: Course Review

Tests of Linear Restrictions

SMA 6304 / MIT / MIT Manufacturing Systems. Lecture 10: Data and Regression Analysis. Lecturer: Prof. Duane S. Boning

Analysis of variance

1.) Fit the full model, i.e., allow for separate regression lines (different slopes and intercepts) for each species

Solutions to Final STAT 421, Fall 2008

Lecture 2. The Simple Linear Regression Model: Matrix Approach

Stat 135, Fall 2006 A. Adhikari HOMEWORK 6 SOLUTIONS

Review 6. n 1 = 85 n 2 = 75 x 1 = x 2 = s 1 = 38.7 s 2 = 39.2

Simple Linear Regression

1 Multiple Regression

Stat 101 Exam 1 Important Formulas and Concepts 1

Regression. Bret Hanlon and Bret Larget. December 8 15, Department of Statistics University of Wisconsin Madison.

9. Linear Regression and Correlation

Applied Regression Analysis

using the beginning of all regression models

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Econometrics Review questions for exam

This is a multiple choice and short answer practice exam. It does not count towards your grade. You may use the tables in your book.

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

Unit 6 - Simple linear regression

Lecture 3: Inference in SLR

Chapter 7 Comparison of two independent samples

BIOL Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES

Transcription:

Solutions to Exam in 02402 December 2012 Exercise I.1 I.2 I.3 I.4 II.1 II.2 III.1 III.2 III.3 IV.1 Question (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) Answer 3 1 5 2 5 2 3 5 1 3 Exercise IV.2 IV.3 IV.4 V.1 V.2 V.3 VI.1 VI.2 VII.1 VII.2 Question (11) (12) (13) (14) (15) (16) (17) (18) (19) (20) Answer 2 4 4 5 1 2 1 4 3 5 Exercise VII.3 VII.4 VIII.1 VIII.2 IX.1 IX.2 IX.3 X.1 X.2 X.3 Question (21) (22) (23) (24) (25) (26) (27) (28) (29) (30) Answer 1 4 3 5 1 2 1 4 4 5 Exercise I On a large fully automated production plant items are pushed to a side band at random time points, from which they are automatically fed to a control unit. The production plant is set up in such a way that the number of items sent to the control unit on average is 1.6 item pr. minute. Let the random variable X denote the number of items pushed to the side band in 1 minute. It is assumed that X follows a Poisson distribution. Question I.1 (1) The probability that there will arrive more than 5 items at the control unit in a given minute is: Answer With λ = 1.6, we find that P (X > 5) = 1 P (X 5) = 1 0.994 = 0.006 where the 0.994 can be found in the Poisson table (Table 2) OR we can find the value by R: 1-ppois(5,1.6) 3 Approximately 0.6% Question I.2 (2) The probability that no more than 8 items arrive to the control unit within a 5-minute period is: 1

Answer With λ 5minutes = 8, we find that P (X 8) = 0.593 where the 0.593 can be found in the Poisson table (Table 2) OR we can find the value by R: ppois(8,8) 1 Approximately 59.3% Question I.3 (3) The operators responsible for the control unit believe that the number of items arriving for control, is lower than desired. Hence, a count of the number of items arriving in periods of 10 minutes is carried out. Eight random periods of 10 minutes are being registered. The following data is found: 16 12 10 15 11 14 9 15 It can now be assumed that a normal distribution, N(µ, σ 2 ), can be used as a valid approximation of the distribution of the number of items for control during 10 minutes. We want to test the hypothesis (on level α = 0.05) H 0 : µ = 16 ( Correct level ) H 1 : µ < 16 ( Too low level ) The result of the study becomes: (As well conclusion as argument must be correct) Answer The mean and sample standard deviation becomes x = 12.75 and s = 2.60494, so the t-test statistic becomes: 12.75 16 t = 2.60494/ 8 = 3.53 And the p-value becomes (as it is a left-one-tailed alternative): P (t < 3.53) where t is a t(7)-distribution. From Table 4 we can conclude that the P-value is below 0.005, as the 0.005-point (=99.5% percentile) of the t(7)-distribution is 3.499. From R, everything, including the exact P-value can be found as: 2

x=c(16,12,10,15,11,14,9,15) mean(x) sd(x) t_obs=(12.75-16)/(sd(x)/sqrt(8)) t_obs pt(t_obs,7) Or more easily: x=c(16,12,10,15,11,14,9,15) t.test(x,mu=16,alt="less") 5 There is a documented too low level, since the relevant P-value is clearly below 0.05. Question I.4 (4) The management made a similar investigation but based it on 10 periods of 5 minutes, and got the following counts: 8 7 5 10 8 7 7 8 9 8 They wish to obtain a 90% confidence interval for µ - the mean of the number of items in 5 minutes but WITHOUT using the assumption that the normal distribution is valid, and runs the following in R: x=c(8,7,5,10,8,7,7,8,9,8) k = 10000 my_bootstrap_samples = replicate(k, sample(x, replace = TRUE)) my_bootstrap_means = apply(my_bootstrap_samples, 2, mean) quantile(my_bootstrap_means,c(0.001,0.01,0.025,0.05,0.1,0.9,0.95,0.975,0.99,0.995)) And obtains from the the last line of code the following percentiles for the bootstrap distribution: 0.1% 1% 2.5% 5% 10% 90% 95% 97.5% 99% 99.5% 6.4 6.7 6.9 7.0 7.2 8.2 8.3 8.5 8.6 8.7 3

The wanted confidence interval based on this becomes: Answer As it is stated in the note on simulation based statistics/bootstrapping, the method simply amounts to reading off the relevant percentiles, whch for a 90%-confidence interval then are the 5% and 95% percentiles. 2 [7.0; 8.3] Exercise II A machine for checking computer chips uses on average 65 milliseconds per check with a standard deviation of 4 milliseconds. A newer machine, potentially to be bought, uses on average 54 milliseconds per check with a standard deviation of 3 milliseconds. It can be used that check times can be assumed normally distributed and independent. Question II.1 (5) The probability that the time savings per check using the new machine is less than 10 milliseconds is: Answer Let X old N(65, 4 2 ) and X new N(54, 3 2 ). If we let U denote the time saving per check, we have that U = X old X new. We are asked to find: P (U < 10) = P (Z < 10 E(U) V ar(u) ) = P (Z < 10 (65 54) 32 + 4 2 = 0.4207 where the latter can be found from Table 3 OR in R: ) = P (Z < 1 ) = P (Z < 0.2) 5 z=(10-11)/5 z pnorm(z) Alternatively in R: pnorm(10,mean=65-54,sd=sqrt(16+9)) 5 Approximately 42% 4

Question II.2 (6) The mean (µ) og standard deviation (σ) for the total time use for checking 100 chips on the new machine is: Answer Let U be the total time use for checking 100 chips on the new machine, that is: 100 U = where X i N(54, 3 2 ). So we find, using basic mean and variance calculus rules, that: i=1 X i and 100 100 µ = E(U) = E(X i ) = 54 = 100 54 = 5400 i=1 i=1 100 100 σ 2 = V ar(u) = V ar(x i ) = 9 = 100 9 i=1 i=1 2 µ = 100 54 = 5400ms and σ = 3 100 = 30ms Exercise III A supermarket has just opened a delicacy department wanting to make its own homemade remoulade (a Danish delicacy consisting of a certain mixture of pickles and dressing). In order to find the best recipe a taste test was conducted. 4 different kinds of dressing and 3 different types of pickles were used in the test. Taste evaluation of the individual remoulade versions were carried out on a continuous scale from 0 to 5 The following measurement data were found: In an R-run for twoway ANOVA: Dressing type Row Pickles type A B C D average I 4.0 3.0 3.8 2.4 3.30 II 4.3 3.1 3.3 1.9 3.15 III 3.9 2.3 3.0 2.4 2.90 Column average 4.06 2.80 3.36 2.23 anova(lm(taste~pickles+dressing)) the following output is obtained: (however some of the values have been substituted by the symbols A, B, C, D, E and F) 5

> anova(lm(taste~pickles+dressing)) Analysis of Variance Table Response: Taste Df Sum Sq Mean Sq F value Pr(>F) Pickles A 0.3267 0.16333 E 0.287133 Dressing B 5.5367 1.84556 F 0.002273 Residuals C D 0.10556 Question III.1 (7) The values of A, B, and C are: Answer As is clear from the general definition of the two-way ANOVA table the degrees of freedom are r 1, b 1 and (r 1)(b 1), where r = 3 is the number of rows, c = 4 is the number of columns. 3 A = 2, B = 3 and C = 6 Question III.2 (8) The values of D, E, and F are: Answer E and F are the F-statistics, which are: F P ickles = MS P ickles MSE = 0.16333 0.10556 = 1.547 F Dressing = MS Dressing = 1.84556 MSE 0.10556 = 17.48 Actually, only one answer option has these two values. The D= SSE could be found from the total sum of squares: 3 4 SS(total) = (y ij 2.23) 2 And then: i=1 j=1 D = SSE = SS(total) 0.3267 5.5367 OR more easily using that the DF E = (r 1)(b 1) = 6 and then: In any case, the answer is: D = SSE = 6 MSE = 6 0.10556 = 0.633 5 D = 0.633, E = 1.55 and F = 17.48 Question III.3 (9) With a test level of α = 5% the conclusion of the analysis becomes: Answer We look at the P-values in the ANOVA table, and observe that the Dressing P-value is BELOW 0.05 and the Pickles P-value is ABOVE 0.05, and hence the answer is: 6

1 Only the choice of the dressing type has a significant influence on the taste Exercise IV For production of brass valves raw material (brass bars) from 2 different suppliers are received. Samples are taken from the deliveries from each of the two suppliers. The tensile strength of the items are determined, and the following results are found: Supplier 1: n 1 = 15, x 1 = 223.5N/mm 2, s 1 = 7.23N/mm 2 Supplier 2: n 2 = 20, x 2 = 220.4N/mm 2, s 2 = 4.49N/mm 2 As a potential help, the following four R-commands: round(qf(c(0.001,0.01,0.025,0.05,0.1,0.9,0.95,0.975,0.99,0.995),14,19),3) round(qf(c(0.001,0.01,0.025,0.05,0.1,0.9,0.95,0.975,0.99,0.995),15,20),3) round(qf(c(0.001,0.01,0.025,0.05,0.1,0.9,0.95,0.975,0.99,0.995),19,14),3) round(qf(c(0.001,0.01,0.025,0.05,0.1,0.9,0.95,0.975,0.99,0.995),20,15),3) are giving a number of percentiles (rounded to 3 decimals) for four different F-distributions. The results of these are shown in the R output window as follows: > round(qf(c(0.001,0.01,0.025,0.05,0.1,0.9,0.95,0.975,0.99,0.995),14,19),3) [1] 0.178 0.283 0.350 0.417 0.508 1.878 2.256 2.647 3.195 3.638 > round(qf(c(0.001,0.01,0.025,0.05,0.1,0.9,0.95,0.975,0.99,0.995),15,20),3) [1] 0.191 0.297 0.363 0.430 0.520 1.845 2.203 2.573 3.088 3.502 > round(qf(c(0.001,0.01,0.025,0.05,0.1,0.9,0.95,0.975,0.99,0.995),19,14),3) [1] 0.209 0.313 0.378 0.443 0.532 1.970 2.400 2.861 3.529 4.089 > round(qf(c(0.001,0.01,0.025,0.05,0.1,0.9,0.95,0.975,0.99,0.995),20,15),3) [1] 0.219 0.324 0.389 0.454 0.542 1.924 2.328 2.756 3.372 3.883 Question IV.1 (10) With a significance level of α = 5% we cannot conclude any difference between the two variances for the two suppliers, since: Answer The proper test statistic for comparing two variances is the larger variance divided by the smaller one: 7.23 2 4.49 2 = 2.593 The proper distribution to use for the test is the F (14, 19)-distribution. As nothing points towards a one-sided test, we should apply a two-sided test AND the critical value hence becomes F α/2 (n 1 1, n 2 1) = F 0.025 (14, 19) 7

(Which by the way cannot be found by Table 6, BUT we do not need that here). It may be found in R as: qf(0.975,14,19) So the answer is, (since no other options use the correct degrees of freedom). 3 F 0.025 (14, 19) = 2.647 > 7.232 4.49 2 Question σ1 2 = σ2: 2 IV.2 (11) The following hypothesis is to be tested, and it is now assumed that H 0 : µ 1 = µ 2 H 1 : µ 1 µ 2 With a significance level of α = 5% we cannot conclude any difference between the two means for the two suppliers, since the t statistic and P-value for this test becomes: (both must be correct) Answer The standard pooled t-test for this situation uses the pooled variance estimate: And the t-statistic hence becomes: s 2 p = 14 7.232 + 19 4.49 2 15 + 20 2 t = = 5.8124 2 223.5 220.4 5.8124 (1/15 + 1/20) = 1.561 From the t-distribution Table 4 (with ν = 33) we can observe that the P-value is larger than 0.10 (and smaller than 0.20) since 1.561 is between t 0.10 and t 0.05 and we are doing a two-sided test (so the tail probability should be multiplied by two to give the proper P-value) In R we could do it by the following: sp=sqrt((14*s1^2+19*s2^2)/33) t=(223.5-220.4)/(sp*sqrt(1/15 + 1/20)) t 2*(1-pt(t,33)) Giving a P-value of 0.128. In both cases the P-value is found to be larger than 0.05 so the answer is: 2 t = 1.561 and P-value> 0.05 8

Question IV.3 (12) As there is no difference in the mean level for the two suppliers the data can be joined together and a 99% confidence interval for the mean can be found as: (Some summary statistics for the joint data set: n = 35, x = 221.7, s = 5.93) Answer We use the standard procedure for the one-sample confidence interval: x ± t α/2 (n 1) s n which is, since α = 0.01 221.7 ± t 0.005 5.93 35 4 221.7 ± t 0.005 5.93 35 Question IV.4 (13) A study of a new supplier is planned. It is expected that the standard deviation for this supplier will be approximately 6, that is σ = 6N/mm 2. A 99% confidence interval for the mean in this new study is required to have a width of ±1N/mm 2. How many items must be sampled to achieve this? Answer We use the one-sample confidence interval sample size formula: The 2.576 is found in Table 3 or in R as: ( zα/2 σ ) ( 2 2.576 6 n = = E 1 ) 2 qnorm(0.995) 4 (2.576 6) 2 239 Exercise V When brass is used in a production, the modulus of elasticity, E, of the material is often important for the functionality. The modulus of elasticity for 6 different brass alloys are measured. 5 samples from each alloy are tested. The results are shown in the table below where the measured modulus of elasticity is given in GPa: 9

In an R-run for oneway analysis of variance: Brass alloys M1 M2 M3 M4 M5 M6 82.5 82.7 92.2 96.5 88.9 75.6 83.7 81.9 106.8 93.8 89.2 78.1 80.9 78.9 104.6 92.1 94.2 92.2 95.2 83.6 94.5 87.4 91.4 87.3 80.8 78.6 100.7 89.6 90.1 83.8 anova(lm(elasmodul~alloy)) the following output is obtained: (however some of the values have been substituted by the symbols A, B, and C) > anova(lm(elasmodul~alloy)) Analysis of Variance Table Response: ElasModul Df Sum Sq Mean Sq F value Pr(>F) Alloy A 1192.51 238.501 9.9446 3.007e-05 Residuals B C 23.983 Question V.1 (14) The values of A, B, and C are: Answer The A and B are the degrees of freedom, which in the oneway ANOVA is k 1 and N k, where k = 6 is the number of groups and N = 30 is the number of observations. This is all that is needed to answer the question, but the C could be found as: C = SSE = MSE DF E = 23.983 24 = 575.59 5 A = 5, B = 24 and C = 575.59 Question V.2 (15) The assumptions for using the oneway analysis of variance is: (Choose the answer that most correctly lists all the assumptions and NOT lists any unnecessary assumptions) Answer It is difficult to make a lot of arguments here but to emphasize that only in answer 1 all assumptions are given and not any unnecessary assumptions. 1 The data must be normally distributed within each group, independent and the variances within each group should not differ significantly from each other 10

Question V.3 (16) A 95% confidence interval for the difference between brass alloy 1 and 2 becomes: Answer A post-hoc 95% confidence interval between two groups in a oneway ANOVA is: ( 1 ȳ 1 ȳ 2 ± t 0.025 MSE + 1 ) n 1 n 2 So we have to compute the means of the M1 and M2 groups: ȳ 1 = 84.62, ȳ 2 = 81.14 (Or accept that it can only be 3.48) and then plug in MSE = 23.983 and n 1 = n 2 = 5. So the answer is: 2 3.48 ± t 0.025 23.983 ( ) 2 5 Exercise VI It is a common conjecture that a student s perception of the quality of teaching in a particular discipline is related to the student s level in the subject. To investigate whether this is true, the following data were collected: There are 125 students in the table above. Grade Course Evaluation group GOOD MIDDLE BAD HIGH 22.4% 7.2% 4% MIDDLE 18.4% 8.8% 11.2% LOW 11.2% 5.6% 11.2% Question VI.1 (17) To investigate whether the conjecture is valid the following statistic should be calculated: Answer The actual 3-by-3 contingency table comes from multiplying the percentages by 125: o ij Course Evaluation GOOD MIDDLE BAD HIGH 28 9 5 MIDDLE 23 11 14 LOW 14 7 14 Now one could compute all the nine expected values for this 3-by-3 table, for instance, e 11 = In R these could e.g. be found as: (28 + 9 + 5)(28 + 23 + 14) 125 11 = 21.84

X=t(matrix(125*c(22.4,7.2,4, 18.4,8.8,11.2, 11.2,5.6,11.2)/100,ncol=3)) round(chisq.test(x)$expected,2) But having found just one of them makes it clear that only answer 1 can provide the correct answer, as the χ 2 -statistic has the form: χ 2 = 3 3 (o ij e ij ) 2 e i=1 j=1 ij 1 (28 21.84)2 + (9 9.07)2 + (5 11.09)2 + (23 24.96)2 + (11 10.37)2 + (14 12.67)2 + (14 18.20)2 + (7 7.56)2 + (14 9.24)2 21.84 9.07 11.09 24.96 10.37 12.67 18.20 7.56 9.24 Question VI.2 (18) The number of degrees of freedom (DF) and the critical value (Q) of the relevant test on a 5% level are: Answer This is a χ 2 -test for an r-by-c table where the degrees of freedom are (r 1)(c 1) = 2 2 = 4, and χ 2 0.05(4) = 9.488 (to be found in table 5 with ν = 4) or in R as: qchisq(0.95,4) 4 DF = 4 and Q = 9.49 Exercise VII At a specific education it was decided to introduce a project, running through the course period, as a part of the grade point evaluation. In order to assess whether it has changed the percentage of students passing the course, the following date was collected: Before introduction After introduction of project of project Number of students evaluated 50 24 Number of students failed 13 3 Average grade point x 6.420 7.375 Sample standard deviation s 2.205 1.813 12

Let p Before be the proportion failing the course before the introduction of the project and p After the corresponding proportion after the introduction of the project. Question VII.1 (19) If the following hypothesis is tested: H 0 : p Before = p After H 1 : p Before > p After a valid test statistic u, the corresponding P-value and a valid conclusion become: (both values and the conclusion must be correct) Answer For this particular test we have been given two different (but similar) options. One would be the χ 2 -test for a 2-by-2 frequencey/contingency table OR a z-test giving a direct comparison of the two proportions: (ˆp 1 ˆp 2 ) (ˆp (1 ˆp) (1/50 + 1/24) where ˆp 1 = 13/50, ˆp 2 = 3/24 and ˆp = 16/74, so: (ˆp 1 ˆp 2 ) (ˆp (1 ˆp) (1/50 + 1/24) = 1.32 And the (one-tailed) P-value becomes (from Table 3) or from R as: P (Z > 1.32) = 0.093 1-pnorm(1.32) And this means that we must accept the null hypothesis, that is we cannot reject it on e.g. 5% level. 3 u = 1.32 and P-value = 0.09. On a 5% level a drop in failing percentage cannot be documented Question VII.2 (20) As it is assumed that the grades are approximately normally distributed in each group, and that the variances in the two groups do not differ significantly from each other, the following hypothesis is tested: H 0 : µ Before = µ After H 1 : µ Before < µ After 13

The test statistic, the P-value and the conclusion for this test become: (both values and the conclusion must be correct) Answer The standard pooled t-test for this situation uses the pooled variance estimate: s 2 p = 49 2.2052 + 23 1.813 2 49 + 23 = 2.088 2 And the t-statistic hence becomes: t = 6.42 7.375 2.088 (1/50 + 1/24) = 1.842 From the t-distribution Table 4 (with ν = 72) we can observe that the (one-tailed) P-value is between 0.025 and 0.05 since 1.842 is between t 0.05 and t 0.025. In R we could do it by the following: t=(6.42-7.375)/(2.088*sqrt(1/50+1/24)) t pt(t,72) Giving a P-value of 0.0348. 5 t = 6.42 7.375 = 1.842, P-value=0.035: On a 5% level an increase in grade point 2.088 ( 1 50 + 1 ) 24 average can be documented Question VII.3 (21) A 95% confidence interval for the grade point standard deviation after the introduction of the project becomes: Answer The confidence interval formula for a sample variance is used WITH the square-root applied to everything: (n 1) s 2 (n 1) s 2 < σ < χ 2 0.975 χ 2 0.025 1 23 1.813 2 38.076 < σ After < 23 1.813 2 11.689 Question VII.4 (22) The critical value for the following hypothesis test for the grade point variance before the introduction of the project σbefore 2 : H 0 : σbefore 2 = 2 2 14

becomes on level 1% (α = 0.01): H 1 : σ 2 Before > 2 2 Answer The test for comparing a variance with a specific value is a χ 2 -test with ν = DF = n 1 = 49. So the critical value for this ONE-sided hypothesis is: In R this can be found as: χ 2 0.01(49) qchisq(0.99,49) giving the value 74.91947. Without using R we can use Table 4. In table 4 the χ 2 0.01(50) value can be read off to be 76.154 and the χ 2 0.01(40) value can be read off to be 63.691, so by linear interpolation the only possible answer is: 4 74.919 Exercise VIII A company manufactures an electronic device to be used in a very wide temperature range. The company knows that increased temperature shortens the life time of the device, and a study is therefore performed in which the life time is determined as a function of temperature. The following data is found: Temperature in Celcius (t) 10 20 30 40 50 60 70 80 90 Life time in hours (y) 420 365 285 220 176 117 69 34 5 The following is run in R: t=c(10,20,30,40,50,60,70,80,90) y=c(420,365,285,220,176,117,69,34,5) summary(lm(y~t)) with the reults: Call: lm(formula = y ~ t) Residuals: Min 1Q Median 3Q Max -21.022-12.622-9.156 17.711 29.644 Coefficients: 15

Estimate Std. Error t value Pr(> t ) (Intercept) 453.5556 14.3936 31.51 8.38e-09 *** t -5.3133 0.2558-20.77 1.51e-07 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 19.81 on 7 degrees of freedom Multiple R-squared: 0.984,Adjusted R-squared: 0.9818 F-statistic: 431.5 on 1 and 7 DF, p-value: 1.505e-07 Question VIII.1 (23) A 95% confidence interval for the slope in the regression model underlying the R-run above, and which expresses the life time as a linear function of the temperature, become: Answer Either one could do all the regression computations to find the b = 5.3133 and then subsequently use the formula for the confidence interval for β: b ± t α/2 s e 1 Sxx Or one could use the knowledge of the information in the R-output that wht is know as the standard error for the slope can be directly read off as: s e 1 Sxx = 0.2558 And t 0.025 (7) = 2.364 - from Table 4 or in R: qt(.975,7) 3 5.31 ± 2.365 0.2558 Question VIII.2 (24) Can a relation between temperature and life time be documented on level 5%? (As well conclusion as argument must be correct) Answer We look at the P-value in the slope row of the output: 1.51e 07 = 0.00000015 5 Yes, as the relevant P-value is 0.00000015, which is clearly smaller than 0.05 16

Exercise IX In a consumer study, it is shown that in many supermarkets there are discrepancies between your receipt and the price on the shelf. The manager of a supermarket wants to keep track of the error percentage, and therefore introduces the following checks: During the day, 40 different items are sampled at random and for these items it is being checked whether the receipt and the price on the shelf is matching. The manager defines the situation as in control, if there are no more than 1 mislabeled item found among the 40 items. The probability that the situation is found to be under control, if the real percentage of mislabeled items is 1%, is called A. The probability that the situation is found to be under control, if the real percentage of mislabeled products is 10%, is called B. Question IX.1 (25) The values of A and B are: Answer For A we use the binomial distribution with p = 0.01 and n = 40: A = P (X 1), X b(x; 40, 0.01) For this p = 0.01 we cannot use table 1, so we have to either use hand calculation of the two binomial probabilities: A = P (X = 0) + P (X = 1) = 0.99 40 + 40 0.01 0.99 39 = 0.939 For B we use the binomial distribution with p = 0.10 and n = 40: A = P (X 1), X b(x; 40, 0.1) For this p = 0.1 and n = 40 combination we cannot use table 1, so we have to either use hand calculation of the two binomial probabilities: B = P (X = 0) + P (X = 1) = 0.9 40 + 40 0.1 0.9 39 = 0.08047 In R, A could be found by any of the following three computations: pbinom(1,40,0.01) dbinom(0,40,0.01)+dbinom(1,40,0.01) 0.99^40+40*0.01*0.99^39 In R, B could be found by any of the following three computations: pbinom(1,40,0.10) dbinom(0,40,0.10)+dbinom(1,40,0.10) 0.90^40+40*0.1*0.90^39 17

1 A = 0.939 and B = 0.080 Question IX.2 (26) As an additional check, on a given day all together 120 different items were checked, and out of these 15 mislabeled items were observed. A 95% confidence interval for the proportion of mislabeled items becomes: Answer The standard one-proportion confidence interval is used (and this is OK as both np and n(1 p) is at least 15): ˆp(1 ˆp) ˆp ± z α/2 n which becomes: 15 (15/120)(105/120) 120 ± 1.96 120 2 0.125 ± 1.96 0.125 0.875/120 Question IX.3 (27) We now wish to determine the proportion of mislabeled items in a particular store with a precision such that a 90% confidence interval will be of the width plus/minus 0.02. It is expected the proportion will be in the order of 0.10. How many items should approximately be checked in order to achieve such precision? Answer We use the sample size formula for the proportion confidence interval using a guess of the true value: ( zα/2 ) 2 n = p(1 p) = 0.10 0.90 (1.645/0.02) 2 E as α = 0.10 and Table 3 or: qnorm(0.95) 1 0.10 0.90 (1.645/0.02) 2 609 items 18

Exercise X The yield Y of a chemical process is a random variable whose value is considered to be a linear function of the temperature X. The following data of corresponding values of x and y is found: Temperature in Celcius(x) 0 25 50 75 100 Yield in grams (y) 14 38 54 76 95 The average and standard deviation of Temperature(x) and Yield (y) are: x = 50, s x = 39.52847, ȳ = 55.4, s y = 31.66702, and further it can be used that S xy = 5000. In the exercise the usual linear regression model is used: Y i = α + βx i + ε i, ε i N(0, σ 2 ), i = 1,..., 5 Question X.1 (28) Can a significant relationship between yield and temperature be documented? (As well conclusion as argument must be correct) Answer It could most easily be solved by running the regression in R as: x=c(0,25,50,75,100) y=c(14,38,54,76,95) summary(lm(y~x)) with the results: Call: lm(formula = y ~ x) Residuals: 1 2 3 4 5-1.4 2.6-1.4 0.6-0.4 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 15.40000 1.49666 10.29 0.00196 ** x 0.80000 0.02444 32.73 6.27e-05 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 1.932 on 3 degrees of freedom Multiple R-squared: 0.9972,Adjusted R-squared: 0.9963 F-statistic: 1071 on 1 and 3 DF, p-value: 6.267e-05 19

Alternatively one could use hand calculations and use the formula for the t-test of the hypothesis: H 0 : β = 0. The relevant test statistic and P-value can be read off in the R-output as 32.73 and 0.0000627. 4 Yes, as the relevant test statistic and P-value are resp. 32.73 and 0.00006 Question X.2 (29) Give the 95% confidence interval of the expected yield at a temperature of x 0 = 80 degrees celcius: Answer We use the formula for the confidence limit of a point on the line: 1 (a + bx 0 ) ± t α/2 s e n + (x 0 x) S xx And we have to compute a, b and s e either by hand OR by R as above: a = 15.4, b = 0.8, s e = 1.932 So the confidence interval becomes 1 (15.4 + 0.8 80) ± 3.182 1.932 5 since S xx = 4 39.52847 2 = 6250 + (80 50) 6250 4 79.40 ± 3.61 Question X.3 (30) The five residuals become: -1.4, 2.6, -1.4, 0.6 og -0.4. What is the upper quartile of the residuals? Answer We use the basic definition of finding a percentile (from Chapter 2), n = 5, p = 0.75, so: np = 3.75 So the upper quartile is the 4th observation in the ordered sequence: -1.4, -1.4, -0.4, 0.6, 2.6 5 0.6 20