Multiple comparisons of slopes of regression lines. Jolanta Wojnar, Wojciech Zieliński

Size: px

Start display at page:

Download "Multiple comparisons of slopes of regression lines. Jolanta Wojnar, Wojciech Zieliński"

Claude Douglas
5 years ago
Views:

1 Multiple comparisons of slopes of regression lines Jolanta Wojnar, Wojciech Zieliński Institute of Statistics and Econometrics University of Rzeszów ul Ćwiklińskiej 2, Rzeszów Department of Econometrics and Computer Sciences Agricultural University ul Nowoursynowska 166, Warszawa Abstract The problem of comparing K simple regression lines is considered A statistical procedure for finding groups of parallel lines is proposed Key words: multiple comparisons, simultaneous inference, regression analysis AMS subject classification: 62J15, 62J1 1 Introduction Consider K regression lines Y = α k + β k x + ε, k = 1,, K and assume that ε s are iid random variables distributed as N(, σ 2 ) The problem is to divide a set of regression coefficients {β 1,, β K } into homogeneous groups A subset {β i1,, β im } is called the homogenous group if β i1 = = β im and any other β {β 1,, β K } is not equal to β i1 This problem is similar to the problem of extracting homogeneous groups of means in the ANOVA Some of classical multiple comparison procedures (cf Miller, 1982, Hochberg and Tamhane, 1988) such as Tukey, Scheffé, Bonfferroni can be adopted to the above problem In what follows, the W procedure of multiple comparison proposed by Zieliński (1992) is used As a criterion of the procedure quality the probability of the correct decision is taken 1

2 2 Statistical model Assume that for each of the regression function we have n k observations (x ki, Y ki ) Than the overall number of observations is N = k n k Hence we have the following joint model: (1) Y ki = α k + β k x ki + ε ki, i = 1,, n k, k = 1,, K, where α s and β s are unknown regression coefficients and ε s are independent normally distributed random variables with mean zero and variance σ 2 In matrix notation the model may be written as y = Xβ + ε, where the vector ε is distributed as N N (, σ 2 I) and Xβ = y = (Y 11,, Y 1n1,, Y K1,, Y KnK ), 1 x 11 1 x 12 1 x 1n1 1 x 21 1 x 22 1 x 2n2 1 x K1 1 x K2 1 x KnK α 1 α 2 α K β 1 β 2 β K Let A be a given q 2K matrix of rank r and c be a given q 1 vector On the basis of the general theory of linear models we obtain the following test statistics for the hypothesis H : Aβ = c, (2) F = (A β c) [A(X X) A ] (A β c) y (I X(X X) X )y N rank(x) r where β is the LSE of β Its null distribution is F with (r, N rank(x)) degrees of freedom and the hypothesis is rejected, at a significance level α, if F > Fr;N rank(x) α, where Fr;N rank(x) α is an appropriate critical value 2

3 If for each k = 1,, K there exist at[ least two different ] x ki s, then the rank(x) = 2K and there exists (X X) 1 W1 W of the form 2, where W 2 W 3 { x 2 W 1 = diag 1i,, n 1 SS 1 { x1i W 2 = diag,, n 1 SS 1 W 3 = diag { 1 SS 1,, } x 2 Ki, n K SS K 1 SS K xki n K SS K } Here diag {a 1,, a K } denotes the diagonal matrix with diagonal elements a 1,, a K and SS k = n k i=1 (x ki x k ) 2 Note that 1 N 2K y (I X(X X) 1 X )y is the least square unbiased estimator of the variance σ 2 }, 3 Procedure The procedure of comparison of regression coefficients is based on the statistic (2) with c = and is stepwise On the first step, it is verified whether β 1 = = β K The matrix A is then of the form A = [ ] K K IK 1 K 1 K1 K where K K is K K zero matrix, I K denotes the identity matrix of order K and 1 K denotes the K 1 vector of ones The explicite form of the nominator of (2) is K ( βk β ) 2, where k=1 β k = i=1 (Y ik Ȳk)(x ik x k ) i=1 (x, β = ik x k ) 2 K k=1 i=1 (Y ik Ȳk)(x ik x k ) K k=1 i=1 (x ik x k ) 2 Note that β k is the LSE of β k for k th regression line and β is the LSE of the regression coefficient under assumption β 1 = = β K = β If the value of statistic (2) is less than FK 1;N 2K α, the procedure stops, and regression coefficients are considered as equal Elsewhere we go to the second step 3

4 On the p-th step we consider a division of the set of regression coefficients into p disjoint homogenous groups Let I 1,, I p be a division of {1,, K} into p disjoint subsets Let J (p) denotes this division The corresponding matrix A (after appropriate permutation of regression coefficients) takes on the form A J (p) = m1 K m2 K mp K I m1 1 m 1 1 m1 1 m 1 m1 m 2 m1 m p m2 m 1 I m2 1 m 2 1 m2 1 m 2 m2 m p mp m 1 mp m 2 I mp 1 m p 1 mp 1 m p where m i is the cardinality of the subset I i Let F J (p) denotes the statistic (2) with the matrix A J (p) The nominator of (2) equals to p ( βk β ) 2 Ij, j=1 k I j where β k = i=1 (Y ik Ȳk)(x ik x k ) i=1 (x, ik x k ) 2 βij = k I j i=1 (Y ik ȲI j )(x ik x Ij ) K k=1 i=1 (x, ik x Ij ) 2 Ȳ Ij = k I j i=1 Y ki k I j n k, x Ij = k I j i=1 x ki k I j n k The estimator ˆβ Ij is the LSE of regression coefficient under assumption that all β k for k I j are equal Let J (p) be a division into p groups such that F J (p) = min F J (p) If F J (p) < F α K p;n 2K, then we stop the procedure and accept the division J (p) Otherwise we consider divisions into p + 1 groups If p = K 1 and F J (p) > F α K p;n 2K holds, we decide that we have K groups, ie all coefficients are distinct 4

5 4 Criterion Let Θ = {θ 1, θ 2, } denotes the set of all possible divisions of the set of regression coefficients into homogenous groups Elements of the set Θ are disjoint subsets of R K and for every (β 1,, β K ) R K there exists only one θ Θ such that (β 1,, β K ) θ Note that Θ is a finite set The elements of the set Θ are commonly called states of nature For example consider K = 3 The set Θ consists of the following elements: θ 1 = {(β 1, β 2, β 3 ) R 3 : β 1 = β 2 = β 3 } θ 2 = {(β 1, β 2, β 3 ) R 3 : β 1 = β 2, β 3 β 1 } θ 3 = {(β 1, β 2, β 3 ) R 3 : β 1 = β 3, β 2 β 1 } θ 4 = {(β 1, β 2, β 3 ) R 3 : β 2 = β 3, β 1 β 2 } θ 5 = {(β 1, β 2, β 3 ) R 3 : β 1 β 2, β 1 β 3, β 2 β 3 } The aim of any multiple comparison procedure is to detect the true state of nature Let D be a set of all decisions which can be made on the basis of observations The elements of the set D are called decisions We assume that D Θ We define the loss function in the following manner L(d, θ) = {, if d = θ, 1, if d θ, for d D and θ Θ This loss function gives penalty of one when our decision is not correct If we denote by X the space of all observations, then the function δ : X x d D is called a decision rule The considered procedure of multiple comparisons may be described as a decision rule A decision rule δ is characterized by its risk function, ie, average loss Let (β 1,, β K ) θ Then the risk function of the rule δ equals R δ (β 1,, β K ) = E (β1,,β K )L(δ(x), θ) = P (β1,,β K ){δ(x) θ} Note that in general, the risk depends on the differences of the values of the parameters (β 1,, β K ) For example if we assume K = 3 and σ 2 = 1, then it is easier to make misclassification for β 1 = β 2 = 1, β 3 = 11 than for β 1 = β 2 = 1, β 3 = 5 though both belong to the same state of nature Only in the case β 1 = = β K = β the risk does not depend on the value of β 5

6 The risk of the rule δ is the probability of the false decision This probability should be as small as possible In our investigation we are interested in a probability of the correct decision which is equal to 1 R δ The most common approach to the problem under consideration is via theory of multiple hypothesis testing In that framework, there are considered different criterions of goodness, such as a Familywise Error Rate or Per Comparison Error Rate connected with controlling the risk of committing an error of type I (see Gather et al 1996) Those criterions may be considered as a generalizations of the notion of the significance level in the Neyman Pearson theory of testing hypotheses According to that terminology, we may say that probability of the correct decision is the criterion which simultaneously takes into account the possibility of committing the errors of type I and type II, as Wald decision theory does Note that the imposed criterion, as opposed to the theory of multiple hypotheses testing, does not keep the Familywise Error Rate at a fixed level Thus it is advisable to consider rules δ with R δ for β 1 = = β K equal to the value of the significance level of the hypothesis H : β 1 = = β K As a consequence, there is no possibility that the results obtained by the rule contradicts that obtained for the above hypothesis The weak point of the presented approach is that there is no possibility to obtain the uniformly best procedure But on the other hand we avoid the obvious disadvantage that the constructed procedures will be too conservative (ie giving too large homogenous groups) 5 Experiment The probability of the correct decision was estimated on the basis of a simulation experiment In the experiment we choose K = 5 regression functions and each of it was observed 2 times Random errors were normal with σ = 1 Parameters α 1,, α 5 were zero The regression functions were considered on the interval [ 1, 1] For five regression functions there are 67 states of nature but it is enough to consider seven of them Considered states are given below Notation {β 1 = β 2 = β 3, β 4, β 5 } means the following subset: {(β 1, β 2, β 3, β 4, β 5 ) : β 1 = β 2 = β 3, β 4 β 1, β 5 β 1, β 4 β 5 } R 5 6

7 Number of groups State of nature Notation 1 {β 1 = β 2 = β 3 = β 4 = β 5 } {5} 2 {β 1 = β 2 = β 3, β 4 = β 5 } {2, 3} {β 1 = β 2 = β 3 = β 4, β 5 } {1, 4} 3 {β 1 = β 2 = β 3, β 4, β 5 } {1, 1, 3} {β 1 = β 2, β 3 = β 4, β 5 } {1, 2, 2} 4 {β 1 = β 2, β 3, β 4, β 5 } {1, 1, 1, 2} 5 {β 1, β 2, β 3, β 4, β 5 } {1, 1, 1, 1, 1} For each state of nature there were generated regression coefficients from the interval (5, 15) according to uniform distribution For example, for the state {2, 3} there were generated two numbers x, y from the distribution U(5, 15) and it was set β 1 = β 2 = β 3 = x and β 4 = β 5 = y Such generation was repeated one hundred times For each generated regression coefficients (β 1,, β 5 ) there were made 1 drawn of samples (x ki, Y ki ) for k = 1,, 5 and i = 1,, 2, such that Y ki = β k x ki + ε ki To each sample the described procedure was applied and there was noted if the obtained division of regression coefficients is consistent with the state of nature The probability of the correct decision was estimated by a fraction of divisions consistent with the state of nature It is obvious that the probability of the correct decision depends on a plan of experiment, ie on the choice of values of regressors Three plans were considered In the first case (random plan) twenty values of x s were chosen randomly from the [ 1, 1] interval (due to the uniform distribution) for every regression function separately The second plan was a naive one, ie values of regressor were 1 + 2i/19 for i =, 1,, 19 The third plan was the G optimal plan in which x = 1 or x = 1 and at each x there were taken ten observations The second plan as well as the third one, were common for all regression functions 6 Results The results are presented graphically On y axis there is an estimated probability (multiplied by 1) of the correct decision while on the x axis there is minimal distance between groups The solid line represents the probability of the correct decision for the G optimal plan, dashed line for the naive plan, and the dotted line for the random plan On the basis of simulations we may formulate the following conclusions 7

8 1 The proposed procedure of detecting the division of regression coefficients closely corresponding to the true states of nature is more precise when there is a small number of groups of coefficients When we increase the number of groups of coefficients the probability of detecting differences between them is decreasing 2 In the case of each division of regression coefficients we obtained that the best plan of experiment was the G optimal plan The probability of taking the correct decision is very high even for small differences between the sets of regression coefficients Above conclusions are true for five regression functions, but it may be expected, that similar conclusions may be formulated for the higher number of functions 7 Literature Gather U, Pawlitschko J, Pigeot I, 1966: Unbiasedness of multiple tests, Scandinavian Journal of Statistics, 23: Hochberg Y, Tamhane A C (1988) Multiple Comparison Procedures, John Wiley & Sons Miller Jr R G (1982) Simultaneous Statistical Inference, Springer Verlag, 2nd ed Zieliński W (1992) Monte Carlo comparison of multiple comparison procedures, Biometrical Journal 34: Porównania wielokrotne współczynników kierunkowych prostych regresji Streszczenie W pracy rozważane jest zagadnienie porównania K prostych regresji Zaproponowana została procedura znajdowania grup równoległych prostych Słowa kluczowe: porównania wielokrotne, jednoczesne wnioskowanie, analiza regresji AMS subject classification: 62J15, 62J1 8

9 8 Figures {1, 4} {2, 3}

10 {1, 1, 3} {1, 2, 2}

11 {1, 1, 1, 2} {1, 1, 1, 1, 1}

Ch 3: Multiple Linear Regression

Ch 3: Multiple Linear Regression 1. Multiple Linear Regression Model Multiple regression model has more than one regressor. For example, we have one response variable and two regressor variables: 1. delivery