Regression Analysis and Analysis of Variance

Size: px

Start display at page:

Download "Regression Analysis and Analysis of Variance"

Maurice Maximillian Pierce
5 years ago
Views:

1 Prof. Dr. P. L. Davies Technische Universiteit Eindhoven Regression Analysis and Analysis of Variance Examination Question 1.1 (a) Let X 1,..., X n be i.i.d. random variables with the common distribution function F (x, λ) with λ > 0 and { 0, x < log(λ), F (x, λ) = 1 exp( x log(λ)), x log(λ). Show that ( lim P ( max X i log(n) + x) = exp 1 ) n 1 i n λ exp( x). (b) Calculate the Fisher information I(λ) for the distribution G(x, λ) = exp( exp( x)/λ). (c) Let Y 1,..., Y n be i.i.d. random variables with the distribution distribution function G(x, λ). Calculate the maximum likelihood estimator ˆλ n of λ. Show that it is unbiased and efficient. (a) lim P ( max X i log(n) + x) = lim n 1 i n n i=1 ( n = lim n (1 exp( x log(n) log(λ))) 1 1 n exp( x log(λ)) ) n = exp( exp( x log(λ))) = exp( exp( x)/λ) (b) The density function is g(x, λ) = exp( x exp( x)/λ)/λ and hence log(g(x, λ)) λ = 1 λ + exp( x) λ 2, 2 log(g(x, λ)) λ 2 = 1 λ 2 2 exp( x) λ 3.

2 The Fisher information is 2 log(g(x, λ)) λ 2 g(x, λ) dx = ( 1 λ + 2 exp( x) ) g(x, λ) dx 2 λ 3 = 1 λ + 2 exp( x x exp( x)/λ) dx 2 λ 3 = 1 λ + 2 y exp( y) dy 2 λ 2 substitution = 1/λ 2 0 y = exp( x)/λ (c) The log likelihood is n i=1 ( X i log(λ) exp( X i )/λ) and on differentiating with respect to λ and setting the derivative to zero we obtain We have n/λ + n exp( X i )/λ 2 = 0 ˆλ n = 1 n i=1 E(exp( X i )) = = λ = λ 0 n exp( X i ). i=1 exp( x) exp( x exp( x)/λ)/λ dx y exp( y) dy (substitution y = exp( x)/λ) so that ˆλ n is unbiased. Analogously we have E(exp( 2X i )) = 2λ 2 so that V(exp( X i )) = 2λ 2 λ 2 = λ 2 and hence V(ˆλ n ) = λ 2 /n. This agrees with the Fisher information so the estimator is efficient. Question 1.2 Consider the two following rules for identifying outliers in a sample x n = (x 1,..., x n ). Rule 1: x i is an outlier if x i mean(x n ) c 1 (n, α) sd(x n ) Rule 2: x i is an outlier if x i median(x n ) c 2 (n, α) mad(x n ) where c 1 (n, α) and c 2 (n, α) are so chosen, that the probability of detecting an outlier in an i.i.d. normal sample of size n is α. The values of c 1 (n, α) and c 2 (n, α) for n = 11,..., 20 and α = 0.05 are given below 2

3 n c 1 (n, α) c 2 (n, α) (a) Take the data in sample1.dat and add one additional point x 11 = x to it. For each rule determine numerically the smallest absolute value of x so that it is detected as an outlier. qu21<-function(){ fc1<-c(2.36,2.42,2.46,2.50,2.54,2.59,2.63,2.65,2.68,2.71) fc2<-c(4.75,4.42,4.43,4.18,4.30,4.08,4.20,4.01,4.10,3.97) source("sample1.dat") while(i<=1000){ tmpx[11]<-i/100 if(abs(tmpx[11]-mean(tmpx))>sd(tmpx)*fc1[1]){ax1<-tmpx[11] 000 tmpx[11]<- -i/100 if(abs(tmpx[11]-mean(tmpx))>sd(tmpx)*fc1[1]){ax1<-tmpx[11] 000 print(abs(ax1)) while(i<=1000){ tmpx[11]<-i/100 if(abs(tmpx[11]-median(tmpx))>mad(tmpx)*fc2[1]){ax2<-tmpx[11] 000 tmpx[11]<- -i/100 if(abs(tmpx[11]-median(tmpx))>mad(tmpx)*fc2[1]){ax2<-tmpx[11] 000 print(abs(ax2)) Results: 5.27, 2.19 (b) Add k additional points x 11 =... = x 10+k = x to the data sample1.dat. For each rule determine the smallest value of k so that x is arbitrarily large but no point is identified as an outlier. 3

4 qu22<-function(){ fc1<-c(2.36,2.42,2.46,2.50,2.54,2.59,2.63,2.65,2.68,2.71) fc2<-c(4.75,4.42,4.43,4.18,4.30,4.08,4.20,4.01,4.10,3.97) source("sample1.dat") k<-1 while(k<=10){ tmpx[11:(10+k)]< if(abs(tmpx[11]-mean(tmpx))<sd(tmpx)*fc1[k]){kmin<-k k<-20 k<-k+1 print(kmin) k<-1 while(k<=10){ tmpx[11:(10+k)]< if(abs(tmpx[11]-median(tmpx))<mad(tmpx)*fc2[k]){kmin<-k k<-20 k<-k+1 print(kmin) Results: 2 and 10. Question 1.3 Analyse the two-way table below (a) by least squares and (b) Tukey s median polish by calculating the residuals. Standardize the residuals by dividing by the their standard deviation in (a) and by their MAD in (b). Include interactions in the two cells with the largest residuals of the median polish analysis and then perform least squares to determine the size of these interactions/outliers. Show that the interaction pattern is unconditionally identifiable. What are your conclusions about the status of the interactions/outliers? The data is stored in sample2.dat data <- read.table("sample2.dat", header=true) 4

5 attach(data); model <- lm(y ~ A + B); anova(model) summary(model) resid(model) # Tukey s median polish # first get data in correct shape y2 <- y dim(y2) <- c(3,4) yp <- medpolish(y2, eps=0.001, maxiter=10) yp$residuals # the largest residuals are in cells (1, 1) and (3, 3) # so we will place interactions there. interaction <- c(1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0); model2 <- lm(y ~ A + B + interaction); anova(model2); summary(model2); dim(interaction) <- c(3, 4) testinteractionpattern(interaction) # The interaction pattern is unconditionally identifiable. Question 1.4 Consider a sample x 1,..., x n consisting of n different points. For each interval I which contains exactly k data points let var I denote the variance of the data points in the interval. Define ˆµ k as follows to be the mean of the data points in the interval I for which var I is smallest. For which value of k is the finite sample breakdown point of ˆµ k highest? For this value of k calculate ˆµ k for the sample in sample3.dat. The value of k is clearly (n + 1)/2. If we change i < (n + 1)/2 observations so that the mean of the interval with the smallest variance tends to infinity, then the interval contains at least one original point and one point which is arbitrarily large. The variance is then also arbitrarily large. However there are at least i points which have not been altered and as their variance remains bounded we have a contradiction. Thus the breakdown point is (n + 1)/2 /n. As the procedure is affine equivariant this is the highest possible breakdown point. qu4<-function(){ source("sample3.dat") n<-length(qu4x) 5

6 k<-floor((n+1)/2) tmpx<-sort(qu4x) sdmin<-10^10 while(i<=n-k+1){ tmp<-tmpx[i:(i+k-1)] if(sd(tmp)<sdmin){ii<-i sdmin<-sd(tmp) print(mean(tmpx[ii:(ii+k-1)])) Result: Question 1.5 The data in sample4.dat consists of 10 different samples, samp i, i = 1,..., 10. Test the null hypothesis H 0 : µ 1 <... < µ 10 at size α = 0.05 using: (a) the Bonferroni Holm method and the one-sided two-sample t tests where mean(samp i+1 ) mean(samp i ) s i,i+1 1/ni + 1/n i+1 s 2 i,i+1 = n i+1sd(samp i+1 ) 2 + n i sd(samp i ) 2 n i+1 + n i 2 to test the hypotheses µ i < µ i+1, i = 1,..., 9. (b) the method based on M-estimators and confidence intervals for the individual samples described in the lecture notes. Check whether you can find values γ i I i where I i is the confidence interval for the ith sample with γ 1 <... < γ 10. qu5<-function(){ source("sample4.dat") tmpx<-list(samp1,samp2,samp3,samp4,samp5,samp6,samp7,samp8,samp9,samp10) empp<-double(9) while(i<=9){ n1<-length(tmpx[[i]]) n2<-length(tmpx[[(i+1)]]) 6

7 sd1<-sd(tmpx[[i]]) sd2<-sd(tmpx[[(i+1)]]) ss<-sqrt((n1*sd1^2+n2*sd2^2)/(n1+n2-2)) tst<-(mean(tmpx[[(i+1)]])-mean(tmpx[[i]]))/(ss*sqrt(1/n1+1/n2)) empp[i]<-pt(tst,(n1+n2-2)) tmpi<-rank(empp) tmp1<-(1:9) tmp1[tmpi]<-tmp1 while(i<=9){ n1<-length(tmpx[[tmp1[i]]]) n2<-length(tmpx[[(tmp1[i]+1)]]) sd1<-sd(tmpx[[tmp1[i]]]) sd2<-sd(tmpx[[(tmp1[i]+1)]]) ss<-sqrt((n1*sd1^2+n2*sd2^2)/(n1+n2-2)) tst<-(mean(tmpx[[(tmp1[i]+1)]])-mean(tmpx[[tmp1[i]]]))/ (ss*sqrt(1/n1+1/n2)) if(tst<qt(0.05/(10-i),n1+n2-2)){print(c(tmp1[i],"reject")) i<-9 Empirical p values: 9.77e e e e e e e e e-07 Test 6 against 7, 8 against 9 etc. at levels 0.05/9, 0.05/8 etc. The first test results in rejection. Thus there is no increasing set of means. (b) >tmp<-f1way(list(samp1,samp2,samp3,samp4,samp5,samp6,samp7,samp8,samp9,samp10)) >print(tmp$lb,dig=2) >print(tmp$ub,dig=3) Lower bounds: Upper bounds The upper bound of interval 7 lies below the lower bound of interval 6 so there is no non-decreasing sequence. 7

REGRESSION ANALYSIS AND ANALYSIS OF VARIANCE

REGRESSION ANALYSIS AND ANALYSIS OF VARIANCE P. L. Davies Eindhoven, February 2007 Reading List Daniel, C. (1976) Applications of Statistics to Industrial Experimentation, Wiley. Tukey, J. W. (1977) Exploratory