UNIVERSIDADE DE SANTIAGO DE COMPOSTELA DEPARTAMENTO DE ESTATÍSTICA E INVESTIGACIÓN OPERATIVA

Size: px

Start display at page:

Download "UNIVERSIDADE DE SANTIAGO DE COMPOSTELA DEPARTAMENTO DE ESTATÍSTICA E INVESTIGACIÓN OPERATIVA"

Victor Norris
6 years ago
Views:

1 UNIVERSIDADE DE SANTIAGO DE COMPOSTELA DEPARTAMENTO DE ESTATÍSTICA E INVESTIGACIÓN OPERATIVA GOODNESS OF FIT TEST FOR LINEAR REGRESSION MODELS WITH MISSING RESPONSE DATA González Manteiga, W. and Pérez González, A. Report Reports in Statistics and Operations Research

2 Goodness of fit test for linear regression models with missing response data González Manteiga, W. and Pérez González, A. Key words and phrases: Bootstrap; Goodness-of-fit-test; Missing at random; Multivariate Local Linear smoother; Nonparametric regression. Abstract: In this paper we propose tests to check the hypothesis of a linear regression model when we have missing data in the response variable. The test statistics are based on the L 2 distance between nonparametric regression estimators and rootn-consistent estimators of the regression function under the parametric model. We obtain the limit distribution of each statistic and we prove the validity of its Bootstrap version. Finally, in a simulation study, we compare the level and the power for the tests with incomplete samples with the resultant of having the complete samples. 1

3 1. INTRODUCTION Let m (x) =E [Y/X = x],x R d be the regression function associated to a random vector (X, Y ) R d+1. In a parametric regression context, m is assumed to belong to a certain family M Θ = {m θ (.),θ Θ} depending on some p-dimensional parameter θ; important examples of these are linear models (Seber (1977)), etc. During many years numerous efforts have been directed towards inference over these parameters. However, it is clear that before any type of conclusion can be drawn about these parameters, it is necessary to study whether or not the supposed parametric model is correct. This can be obtained by carrying out a test in order to contrast H 0 : m M Θ = {m θ (.),θ Θ} Θ R p versus H 1 : m/ M Θ. In contrast to parametric estimation, over the last three decades several nonparametric estimators for m have been developed, which avoids the problem of assuming a parametric model of the regression function (kernel type estimators, local polynomial smoothers, splines, etc.; see for instance Fan and Gijbels (1996)). On certain occasions these estimators have been used as pilot estimators in the test to validate a parametric model (Härdle and Mammen (1993), Alcalá, Cristóbal and González-Manteiga (1999), etc). All these tests statistics have been developed for the context of complete 2

4 samples. In practice, however, sample observations are often unavailable, owing to errors in the measuring apparatus in experiment design, or to any refusal to reply to certain questions during surveys, etc. In the field of regression estimation with missing data both the parametric context (Little (1992), Wang C. Y. et al. (2002), etc.) and the nonparametric context (Chu and Cheng (1995), Wang C. Y. et al.(1998), González Manteiga and Pérez González (2003), etc.) have been studied. The aim of this paper is to consider a goodness of fit test for a linear regression model adapted to the case in which the response variable Y has missing observations and covariable X is totally observed. As a test statistic, we shall use the distance (based on a weighted L 2 norm) between the nonparametric and parametric estimates of the regression function under the null hypothesis of linearity. When estimating the regression function in this situation there are two alternatives: either to consider only the complete observations or to first impute the incomplete ones and subsequently carry out the estimation with the completed sample. Bearing this in mind, two possible nonparametric estimators are considered, based on the Multivariate Local Linear Smoother (Ruppert and Wand (1994)): the Simplified Multivariate Local Linear Smoother and the Imputed Multivariate Local Linear Smoother. Both estimators have been recently studied by González Manteiga, W. and Pérez González, A. (2003). The two test statistics studied in this paper are based on these two estimators. 3

5 As in the case of complete data, the convergence speed of the test statistics distribution to their normal-type asymptotic distribution is, in general, slow, thus a Bootstrap procedure is proposed for the approximation of the critical values of the test. In this paper we have designed a resampling Bootstrap mechanism adapted to the lack of data in the response variable, and we have shown the asymptotic validity of the Bootstrap versions of the proposed contrast statistics. Not only does our interest lie in obtaining a goodness of fit test, which up until now has been nonexistent, for a linear regression model with missing data in the response variable, but also in ascertaining which of the two proposed statistics shows better behaviour. From the asymptotic results obtained, as well as from the simulations carried out, it is revealed that with an adequate choice of smoothing parameters, the test for the imputed estimator behaves better than that of the simplified one, as can be seen in the simulation study performed in Section 5. Moreover, the asymptotic results we obtained extend existing studies for complete data to this context of incompleteness, investigations such as those performed by Härdle and Mammen (1993), where they proposed a goodness of fit test for a parametric model using the Nadaraya-Watson estimator (1964) as the nonparametric smoother; or the results obtained by Alcalá, Cristóbal and González-Manteiga (1999) for polynomial parametric models using the local polynomial smoother. The rest of the article is organized as follows. In the next section we 4

6 present the regression model with missing data, as well as the nonparametric estimators used. In Section 3, we derive the asymptotic distribution of both statistics Section 4 focuses on the Bootstrap approximation of the test statistics. In Section 5, we present some of the simulation results, obtained through the aforementioned resampling mechanism, which compare the behavior of both tests and of the result of working with the complete sample, with respect to power, sample size, etc. And finally, Section 7 contains the proofs of the results appearing in previous sections. 2. THE REGRESSION MODEL WITH MISSING DATA AND THE LOCAL LINEAR SMOOTHERS We shall consider the general heteroscedastic regression model: Y = m (X)+σ (X) ε = m (X)+η, where ε is the error term, assumed to be of mean zero and variance the unit and σ 2 (x) =Var[Y/X = x]. In the case of there being no missing observations, we have a sample {(X i,y i )} n i=1 of i.i.d. (independent and identically distributed) vectors to the random vector (X, Y ) R d+1. In our case, it may be possible that Y i is not observed for any index i, which implies that we are faced with: (X i,y i ) R d+1 if Y i is observed and otherwise (X i,?) R d. To control whether or not an observation is complete, a new variable δ is introduced in the model, this being an indicator of the missing observations. Thus for each index i, δ i =1if Y i is observed, and zero if Y i is missing. 5

7 Following the guidelines laid down in publications (see Little and Rubin (2002) among others), it is necessary to establish whether or not the loss of a datum is independent of the value of the observed data and/or the missing data. In this paper we model the aforementioned loss assuming that the data are missing at random, MAR, i.e.: P (δ =1/Y,X) =P (δ =1/X) =p (X), X R d. (1) This model has been previously used by various authors such as Cheng (1994), Chu and Cheng (1995) and Wang and Rao (2001, 2002), among others. In this case, the variable in which the missing data appear is not the cause of loss. When there are no missing observations, one possible nonparametric estimator of the multidimensional regression function is the Multivariate Local Linear Smoother (MLLS) studied by Ruppert and Wand (1994), among others. This estimator is the result of minimizing: min α, β nx Yi α β t (X i x) ª2 K H (X i x), i=1 ³ where K H (u) = H 1 2 K H 1 2 u,his a symmetrical and positively defined d d matrix, and K is a d dimensional kernel function, that is K 0 and 6

8 R K (u) du =1. The explicit expression for this estimator is: bm H (x) =e t 1 X t x W x,h X x 1 X t xw x,h Y, (2) 1 (X 1 x) t where X t x =.., W x,h = diag {(K H (X i x)) n i=1 }, 1 (X n x) t Y =(Y 1,..., Y n ) t and e 1 is the (d +1) 1 dimensional vector with 1 at the first coordinate and zero at the rest. A very simple way of estimating the regression function with missing data in the response variable is that considered by the Simplified Multivariate Local Linear Smoother (SMLLS) which involves using only the complete observations, that is, those where δ i =1. Thus the problem of minimization becomes: min α,β nx Yi α β t (X i x) ª2 K H (X i x) δ i, i=1 from where the explicit expression of the estimator is deduced as: bm S,H (x) =bα = e t 1 X t x W δ x,hx x 1 X t x W δ x,hy, (3) where X x has the same expression as for complete data and Wx,H δ = diag {(K H (X i x) δ i ) n i=1 }. Another possibility is the Imputed Multivariate Local Linear Smoother (IMLLS), constructed in two stages. At the first stage the missing observa- 7

9 tions are estimated by making use of the SMLLS, and the sample is completed. In this way a completed sample is obtained of the form ³X ti, by i R d+1 i =1,...,n, where b Y i = δ i Y i +(1 δ i ) bm S,G,(L) (X i ), bm S,G,(L) (X i ) being the estimate of the m function evaluated at point X i through SMLLS (3), using a matrix of bandwidth G and a kernel function L. Once the sample has been completed the Multivariate Local Linear Smoother n³ is applied to the data Xi t, Y i o b n from where the estimator expression is i=1, deduced as: bm I,H,G (x) =bα = e t 1 X t x W x,h X x 1 X t xw x,h b Y, (4) where b Y = ³ by1,..., b Y n t is the imputed response vector. 3. ASYMPTOTIC RESULTS The aim of this paper is to test whether or not the regression function m is linear, that is: H 0 : m M Θ versus H 1 : m/ M Θ with M Θ = m θ (x) =θ 0 + θ t 1x, θ =(θ 0,θ 1 ) t Θ ª, Θ R d+1, in the case of the response variable having missing observations. In order to do this, we shall compare the parametric estimation under the null hypothesis, and the nonparametric one in both the previously considered 8

10 situations, that is, considering only the complete observations, or imputing the missing observations. To measure the distance between the estimates we shall use the weighted L 2 norm, previously used by Härdle and Mammen (1993) and Alcalá, Cristóbal and González Manteiga (1999), among others. As a nonparametric estimator the former authors used that of Nadaraya-Watson (1964) which is biased even under the null hypothesis of linearity, consequently having to correct this bias by using parametric residuals smoothing. This did not occur in the work of Alcalá, Cristóbal and González Manteiga (1999) which, in spite of contrasting a polynomial model, guaranteed the unbiasedness of the nonparametric estimator by using a polynomial local smoother of sufficient order. Since the objective of this paper is the contrast of a linear model, the use of the simplified and imputed estimators guarantees the unbiasedness under our null hypothesis (due to their construction based on the local linear smoother), thus avoiding the need to smooth the parametric estimation. For the SMLLS estimator(3) we considered the following test statistic: T n,s = n H 1 4 Z ( bm S,H (x) bm θn (x)) 2 w (x) dx. In the same way we consider the following statistic for the IMLLS estimator(4): T n,i = n H 1 4 Z ( bm I,H,G (x) bm θn (x)) 2 w (x) dx. In general, the asymptotic distribution of the test statistics is analyzed 9

11 under a local alternative, which in this case we shall assume as: with c n = m (x) =m θ0 (x)+c n s (x) ³ n H and s (x) belonging to the class of orthogonal functions to M Θ with respect to the inner product <s,t>= R s (x) t (x) w (x) dx. Next some considerations regarding the notation used throughout this paper are presented. We denote as ³ L the convergence in distribution, the symbol means convolution, and K (j) (a) represents the j-th convolution of the function K at point a. Furthermore, we define A (t) = R ³ K (u) L G 1 2 H 1 2 (t u) du, ³ with A H (t) = H 1 2 A H 1 2 t, and q (x) =1 p (x). The hypotheses needed to obtain the asymptotic distribution of the statistics are: (A.1) The variable X, with density function f, lies in a compact set with probability one. (A.2) The functions m, f and p are twice continuously differentiable. The function w is positive and continuously differentiable. (A.3) The functions f and p are bounded at the top and bottom far from zero and from infinity. (A.4) The conditional variance function Var(Y/X = x) =σ 2 (x) is bounded away from 0 and from and is continuous. 10

12 (A.5) The kernel functions K and L are symmetrical continuous densities, with compact support, and such that R K (u) udu =0and R K (u) uu t du = µ 2 (K) I, where µ 2 (K) is a scalar and I is the d-dimensional matrix. We will denote R [K] = R K 2 (u) du. (A.6) The orthogonal component s (x) is bounded uniformly in x. (A.7) E [ε 4 ] exists. (A.8) The difference bm θn (x) m θ0 (x) =O P ³n 1 2 uniformly in x. (A.9) The matrix H is symmetrical, positively defined, with each of its elementstendingtowardszero,and H 1 2 n d d+4 (A.10) (Only applicable for the imputed estimator) Apart from being symmetrical, positively defined and having all its elements tending towards zero, imputation matrix G should verify that ³n H G 2 = O (1) and n 3 2 H 1 2 G as n. (A.11) (Only applicable for the imputed estimator) The kernel function L is Lipchitzian continuous. (A.12) (Only applicable for the imputed estimator) The continuity module of the function p (x) 1 is uniformly bounded (see definition in Billingsley (1999) p. 80, for example). 11

13 Theorem 1 Given the hypotheses ( A.1-A.9), the asymptotic distribution of the test statistic is obtained for the Simplified estimator: T n,s L N (b S,V S ), where Z b S = [(K H s)(x)] 2 w (x) dx + H 1 4 Z w (x) σ 2 (x) R [K] dx, f (x) p (x) and V S =2K (4) (0) Z µ w (x) σ 2 2 (x) dx. f (x) p (x) Theorem 2 Given the hypotheses ( A.1-A.12), the asymptotic distribution of the test statistic is obtained for the Imputed estimator. We will distinguish three cases according to the asymptotic behaviour of smoothing parameter G: a) For the case in which G 1 2 = O (1) H 1 2,thatis G 1 2 = α H 1 2 with some scalar α>0. T n,i L N (b I,1,V I,1 ), where the asymptotic bias is b I,1 = Z p(x)(kh s)(x)+q(x)α 1 (A H s)(x) 2 w (x) dx Z + H 1 4 w (x) σ 2 (x) v 1 (x) dx, f (x) 12

14 and the asymptotic variance is Z µ w (x) σ 2 2 (x) p (x) V I,1 =2 c 1 (x) dx, f (x) being v 1 (x) =p (x) R [K]+α 2 q2 (x) p (x) and Z Z A 2 (u) du +2α 1 q (x) K (u) A (u) du, (5) c 1 (x) = K (4) (0) + H 2 q 4 (x) G 2 p 4 (x) A(4) (0) + 6 H q2 (x) G p 2 (x) +4 H 1 2 G 1 2 q (x) p (x) [K K K A (0)] + 4 H G [K K A A (0)] + q 3 (x) [A A A K (0)]. p 3 (x) b) If H 1 2 G 1 2 0, that is G H 0 T n,i L N (b I,2,V I,2 ), where the asymptotic bias is Z b I,2 = Z [(K H s)(x)] 2 w (x) dx + H 1 4 w (x) σ 2 (x) v 2 (x) dx, f (x) and the asymptotic variance is V I,2 =2 Z µ w (x) σ 2 2 (x) K (4) (0) dx, f (x) p (x) 13

15 with v 2 (x) = 1 p (x) R [K]. c) Finally, when G 1 2 H 1 2 0, that is H G 0 T n,i L N (b I,3,V I,3 ), where the asymptotic bias is b I,3 = Z [(p(x)(k H s)(x)+q(x)(l G s)(x))] 2 w (x) dx Z + H 1 4 w (x) σ 2 (x) v 3 (x) dx, f (x) and the asymptotic variance is Z µ w (x) σ 2 2 (x) p(x) V I,3 =2 c 3 (x) dx, being f (x) Z v 3 (x) =p (x) K 2 (u) du + q Z (x)2 p (x) α 1 L 2 (u) du +2q (x) α 1 L (0), and c 3 (x) = K (4) (0) + H H 1 2 G 1 2 µ 6 q2 (x) G 1 2 p 2 (x) L(2) (0) + q4 (x) p 4 (x) L(4) (0) µ 4 q (x) p (x) L (0) + (x) 4q3 p 3 (x) L(3) (0). The following remarks are of interest. 14

16 Remark 1 The asymptotic distribution of the tests is generally obtained under an alternative hypothesis which converges asymptotically to the null hypothesis. It can be observed that the convergence rate we use here ( c n = ³ n H ) extends on that used by Hardle and Mammen (1993) with scalar ³ smoothing matrix ( H = h 2 I, c n = nh d ), or by Alcalá et al. (1999) for the onedimensional case with complete samples µ c n = ³ nh 1 2 Regarding the hypotheses needed to obtain those results, hypotheses A.1- A.7 are similar to those used for complete data; with the only difference that, here we also need some conditions about the missing data model ( p)(1). Hypothesis A.8 is analogous to that used for complete data (see Alcalá et al.(1999)), where bm θn is a parametric linear estimator of the regression function with missing data. It is easy to see that by applying the least squares method, the estimator of θ obtained from using only the complete observations θ n = bθ0, b ³ ³ θ t 1, coincides with that obtained from using the imputed sample, when the imputations are made under the null hypothesis by by i = b θ 0 + b θ t 1x i (see Little and Rubin (2002) for more details). This implies 1 2. that the parametric least squared estimator (θ n ) is the same for T n,s and T n,i. If we have the whole sample, then the rate of convergence of the parametric ³ estimator is known n 1 2. However, if we have missing observations and we useonlythecompletesubsample,thentherateofconvergenceisof (n 1 ) 1 2, where n 1 is the size of the complete subsample, but n 1 = O P (np ( )), then under the p-bound hypotheses, our estimator can be considered as rootn- 15

17 consistent. Hypothesis A.9 extends on that used for complete data (see for example Alcalá et al.(1999)). And finally, hypotheses A.10 A.12 are used to obtain the asymptotic representation of the weights for the imputed estimator (4). Remark 2 We can see that terms v 1 (x), v 2 (x) and v 3 (x) are also the expressions that appear in the variance of the Imputed estimator, IMLLS (4), forthecases G 1 2 = α H 1 2, H 1 2 G and G 1 2 H respectively (see González Manteiga W. and Pérez González A. (2003) for more details). Remark 3 It is important to point out that the asymptotic distributions obtained extend the existing results for complete samples to the case of absence of data in the response variable. In the particular case of no missing observations, it is immediately obvious that both contrast statistics T n,s and T n,i coincide, and therefore their asymptotic distribution as well. If we denote the statistic based in the complete sample as T n,c, we get that: T n,c L N (b, V ), with Z b = Z [(K H s)(x)] 2 w (x) dx + H 1 4 w (x) σ 2 (x) R [K] dx, f (x) and Z µ w (x) σ V =2K (4) 2 2 (x) (0) dx. f (x) 16

18 This asymptotic distribution coincides (taking H = h 2 I) with the obtained by Härdle and Mammen (1993) with a statistic based on the smoothing of the parametric residuals. It is also clear from their work that if the smoothing is no carried out, then, for reasons we mentioned above, a term appears in the bias in the asymptotic distribution depending on the first order derivatives of the function m. On the other hand, the obtained by Alcalá, Cristóbal and González-Manteiga (1999) for the local linear smoother is extended to the multidimensional case. Remark 4 A general characteristic of the previous distributions worth pointing out is that the data observation probability (1) affects the asymptotic distributions considerably. It can immediately be seen that the bias and variance terms increase as we lose data (the value of p (.) decreases). Observing the asymptotic distributions obtained for T n,s and T n,i in the case of H 1 2 G as n, it can be seen that both expressions are asymptotically equivalent, which in turn indicates that carrying out the imputation with a bandwidth parameter G, with order of convergence lower than that of estimate parameter H, does not bring about any improvement in the test performance, but rather it merely implies an increase in computational time. From this, it follows that imputation is not recommended in this case. This situation also arose when comparing the asymptotic mean squared error of both estimators for this case in the work of González Manteiga W. and PérezGonzálezA.(2003). The behaviour of the test in the other two cases is more complex. Bearing 17

19 in mind the considerations given in the aforementioned work, the case in which G 1 2 H 1 2 0, implies a degree of imputation oversmoothing which may make worse the behaviour of the Imputed estimator considerably in this case. However, since our aim is to test a linear model and we are using a local linear estimator, the degree of oversmoothing in parameter G can lead to a good behaviour of the Imputed test under the null hypothesis. This choice of imputation parameter provokes more conservative behaviour compared to the Simplified one. However, in the case of G 1 2 = α H 1 2, it was observed in the estimation that, with an appropriate choice of imputation parameter, the imputed one is better than the simplified one considerably. Analogous behaviour could be expected for the tests; but due to the complexity of the asymptotic distribution for the statistic T n,i in this case, we chose for one simulation study in order to carry out the aforementioned comparisons. It can be seen in the simulation study of the Section 5 that the test for the Imputed estimator is slightly better than that of the Simplified one, when the selection of imputation parameter G is appropriate. 4. BOOTSTRAP RESAMPLING Since the speed of convergence of the distribution of the statistic to the Normal asymptotic distribution is generally quite slow, it follows that obtaining the critical points using this asymptotic distribution is in general is not recommended. One method to approximate these critical values could be the Bootstrap Resampling mechanism. In this paper we propose a method 18

20 based on Wild Bootstrap, which Härdle and Mammen (1993) used previously, demonstrating its validity for the case of complete data. In our case, we will have to design a mechanism adapted to the situation in which there are missing observations in the response variable. We shall now describe the method for the Imputed estimator. The method for the Simplified estimator is obtained through the appropriate modifications. Starting from a random sample {(X i,y i )} n i=1 where Y i may not be observed for some i index, we follow these steps: 1) Constructions of the residuals: At the first stage the residuals are constructed: bη i = Y i bm θn (X i ), if δ i =1 where bm θn is the least squared linear estimator with missing observations. 2) Construction of the Bootstrap errors: Subsequently the resampling oftheavailableresidualsisperformed,followingthewildbootstrap methodology, and obtaining {η i } i J such that: E [(η i )] = 0 i J. E (η i ) 2 =(bη i ) 2 i J. E (η i ) 3 =(bη i ) 3 i J. Here J is the set of indices such that δ i =1. 19

21 3) Construction of the Bootstrap sample: If δ i = 1, then Y i = η i + bm θn (X i ), where bm θn (X i ) is the parametric estimator at point X i, and if δ i =0,Y i is missing. In this way the Bootstrap sample ends up as: (X i,y i,δ i ) i =1, 2,..., n. The process will be repeated as many times as Bootstrap samples we wish to construct. The next theorem proves the Bootstrap validity. Theorem 3 Let us assume that hypotheses A.1-A.12 are verified. Let T n,s and T n,i be the test statistics for the Simplified (3) and Imputed (4) estimators respectively, over the Bootstrap sample {(X i,y i,δ i )} n i=1.then T n,s L N (b S,V S ), in probability; furthermore, for the three cases considered in Theorem 2 T n,i L N (b I,j,V I,j ), in probability with j =1, 2 and A SIMULATION STUDY In this section we will describe a simulation study designed for comparing the performance of the simplified and imputed tests. To do this, we observed their behaviour, using the test which resulted from having all the data of sample (T n,c ), as reference. We have used the Wild Bootstrap method, described in the previous section, in order to approximate the critical value of each test; 20

22 so that H 0 is rejected if T n,c (T n,s or T n,i respectively) is greater that the 1 α (α = 0.05) quantile of the Bootstrap distribution of the statistic, which is approximated by Monte Carlo. We have considered the following one-dimensional regression model: Y i =5X i + ax 2 i + σ (X i ) ε i 1 i n, (6) where the X i were generated from the uniform distribution in the unit interval [0, 1] and ε i N (0, 1). The Wild Bootstrap resampling has been performed 500 times for each sample. In the first place we have considered the sample sizes n =50and 100; two possible choices of σ (0.5 and 1) and a=0,1,5;furthermorewehavetakenthe missing data model as p (x) =1 0.4exp 5(x 0.4) 2 (see Figure 1). The statistics T n,c,t n,s and T n,i, have been calculated for various selections of the onedimensional bandwidth parameter h (or g in the imputed case). For each combination of factors, the experiment has been repeated 1000 times, and the percentage of rejections has been calculated. The results appear in Table x Figure 1: Model of missing data: p (x) =1 0.4exp 5(x 0.4) 2. 21

23 n=50 n=100 a=0 a=1 a=5 a=0 a=1 a=5 σ=0.5 σ =1 σ=0.5 σ =1 σ=0.5 σ =1 σ=0.5 σ =1 σ=0.5 σ =1 σ=0.5 σ =1 Comp Simp h=0.1 g= Imput. g= g= Comp Simp h=0.25 g= Imput. g= g= Comp Simp h=0.4 g= Imput. g= g= Figure 1: Table 1: Percentage of times that the null hypothesis was rejected, α =0.05. It is evident that the results are better for the case of the complete sample, especially for n =100; furthermore, as the sample size is increased or the variance is decreased, the performance of the three tests improves considerably; the effect of the variance is seen mainly in the power of the tests. As expected under the null hypothesis (a =0), the percentage of rejections approaches the considered level (5%), except for very small bandwidth parameters, perhaps due to the small amount of data available for the estimate. As we move away from the null hypothesis (a =1and 5), the power grows. We can see how, with the appropriate choice of the imputation parameter g, the behaviour of the imputed test can be improved, on most occasions more that the simplified one. In this table it is difficult to appreciate the degree 22

24 of oversmoothing which we mentioned in Remark 4, so we have created a graph (Figure 2) in which we present the rejection percentage under the null hypothesis for the imputed estimator test with various choices of parameter h and g. In addition we have compared this with the best level obtained for the simplified test with those values of h, which was 5.1% for h =0.15. In the graph we can see that the test for the imputed estimator becomes more conservative as g increases. Furthermore, the optimum choice of g is located in a neighborhood close to h, which coincides with our comments in Remark % of Rejects 4 3 h=0.05 h=0.1 h=0.15 Simp h= g Figure 2: Percentage of rejections under the null hypothesis for the Imputed test. The black discontinuous line represents the best level obtained for the simplified test (h =0.15, 5.1%), black continuous line represents the imputed test (h =0.15), gray continuous line represents the imputed test (h =0.1) and gray discontinuous line represents the imputed test (h =0.05). We shall now present some graphs (Figures 3, 4 and 5) where we can see 23

25 the behaviour of the test (for fixed values of g) for various values of a and of bandwidth parameter h, comparing these with the behaviour of the complete data case. Here we can observe more clearly that we previously remarked, even in the case of fixing the imputation parameter, in the larger part of the variation range of h, the imputed test approaches that of the case of complete data far more than the simplified test Comp. Simp. Imp h Figure 3: Estimated probability of rejection of H 0 under the null hypothesis (a =0). Percentages are for complete data (black continuous line), for the simplified test (discontinuous line) and for the imputed test with g =0.05 (gray line); n =100and σ 2 =

26 Comp Simp Imput h Figure 4: Estimated probability of rejection of H 0 with a =1. Percentages are for complete data (black continuous line), for the simplified test (discontinuous line) and for the imputed test with g =0.25 (gray line); n =100and σ 2 = Comp. Simp. Imput h Figure 5: Estimated probability of rejection of H 0 with a =5. Percentages are for complete data (black continuous line), for the simplified test (discontinuous line) and for the imputed test with g =0.15 (gray line); n =100and σ 2 =

27 The previously study may seem restrictive due to a determined model of missing data (1) being assumed, thus we shall now to study the behaviour of the tests with respect to it. In order to do this, we have assumed that missing data model (1) is constant, and we have taken various choices of this one (1, 0.9, 0.8, 0.75, 0.6 and 0.5); evidently, p =1reflects the complete data case. In the study the percentages of rejections under the null hypothesis (a =0) and the power with a =1are calculated for several values of the bandwidth parameter samples of size 100 of model (6) have been generated, and the standard deviation used has been of 0.5. n= 100, σ= 0.5 a=0 a=1 p=1 p=0.9 p=0.8 p=0.75 p=0.6 p=0.5 p=1 p=0.9 p=0.8 p=0.75 p=0.6 p=0.5 Simp g= h= 0.1 Imput. g= g= Simp g= h= 0.2 Imput. g= g= Simp g= h= 0.3 Imput. g= g= Table 2: Percentage of times that H 0 was rejected with level 0.05 for several values of missing data model (1) ; n =100and σ =0.5. As expected, the behaviour of both tests becomes worse as more data are lost (as p decreases), the levels of significance arenotreachedwelland it lowers the power. For values of p near 1, it may be that no improvement is brought about by the imputation since very few data are lost; furthermore, 26

28 major variance may be introduced in the estimates; but as p decreases, it becomes clear that through an appropriate selection of g, the behaviour of the imputed test is better than that of the simplified one. Figures 6 and 7 present the empirical approximations of the power function of the test for complete data, for the Simplified and Imputed ones, with α =0.05. The curves were drawn by joining points (a, P(a)) with lines, where P(a) denotes the estimated probability of rejection. We see how the empirical power for both Simplified and Imputed are similar Comp 0.2 Simp Imput a Figure 6: Estimated power function with p =0.7. Approximations are for complete data (black continuous line), for the simplified test (discontinuous line) and for the imputed test (gray line); n =100, σ 2 =0.25, h =0.15 and g =0.15. Computations are based on 100 samples. 27

29 Comp 0.2 Simp Imput a Figure 7: Estimated power function with variable p. Approximations are for complete data (black continuous line), for the simplified test (discontinuous line) and for the imputed test (gray line); n =100, σ 2 =0.25, h =0.15, g =0.15 and p (x) =1 0.4exp 5(x 0.4) 2. Computations are based on 100 samples. In the first figure (Figure 6), it is assumed that p =0.7; thatis,thedata are Missing Completely at random (MCAR) and the incomplete sample is a random subsample of the original sample; this is the most unfavourable case fortheimputedestimator;inthefigure the Imputed estimator shows slight improvement over the simplified one for any choice of a. In Figure 7 we have taken p (x) =1 0.4exp 5(x 0.4) 2, observing a greater gain for any value of a of the Imputed test (they display similar behaviour under the null hypothesis). 28

30 6. CONCLUSIONS In this paper, we have focused on the goodness of fit testofalinearregression model when there are missing observations in the response variable. To do this we have proposed two contrast statistics (T n,s and T n,i ). The analysis of the asymptotic distribution for each one of them and the results obtained through the simulation study allow us to reach several conclusions. In the first place, it is necessary to point out that both tests perform quite well in the absence of data; their behaviour evidently depends on the choice of the smoothing parameters of the nonparametric estimators. If we compare the asymptotic distributions obtained for both tests varying the parameter G depending on H, we can conclude that, under the null hypothesis, the best choice of G is G 1/2 H 1/2 0, providing less bias and variance to the distribution of the imputed estimator. On the other hand, under the alternative hypothesis, this choice would provoke a degree of oversmoothing which would lead to the Imputed test being too conservative, consequently the appropriate choice is H 1/2 G 1/2 = O (1) ; since in the case of H 1/2 G 1/2 0, both statistics have the same distribution asymptotically, and their comparison lacks interest. It is difficult to prove this analytically, due to the complexity of the distribution variance term of T n,i in the case H 1/2 G 1/2 = O (1). However, this fact is reflected in the simulation study. In short, for both the level test and the power analysis, it was observed that the appropriate choice of smoothing parameter implies a clear advantage 29

31 for the imputed test opposite to that the simplified one. 7. SKETCHES OF THE PROOFS Proof of the Theorem 1 The proof is similar to the complete data case by Alcala et al. (1999) taking into account the missing data model (1) and the multidimensional calculus. ProofoftheTheorem2The proof begins by obtaining a more manageable representation of the Imputed estimator (4); todothiswebaseourselveson the following development: bm I,H,G (x) = nx li δ (x; H, G)Y i, i=1 where nx li δ (x; H, G) =w i (x, H)δ i + w j (x, H)(1 δ j ) wi δ (X j,g), j=1 with w i (x, H) =e t 1(X t xw x,h X x ) 1 [1, (X i x) t ] t K H (X i x) and wi δ (x, H) =e t 1(X t xwx,h δ X x) 1 [1, (X i x) t ] t K H (X i x)δ i. We now use the asymptotically equivalent representation of a local lineal smoother in terms of a kernel estimator (see Fan and Gijbels (1996) for more 30

32 details): li δ (x; H, G) n 1 f(x) 1 K H (X i x)δ i + nx + f(x) 1 δ i n 2 K H (X j x)(1 δ j )f(x j ) 1 p(x j ) 1 L G (X i X j ), j=1 where means asymptotically equivalent. Our next goal is to obtain an asymptotic expression (uniformly in X i )for the second member of the above expression. Lemma 1 We will denote by e li δ (x; H, G) the asymptotic representation of li δ (x; H, G): For the case of H 1/2 G 1/2 = O (1), its expression is: e l δ i (x; H, G) =n 1 f(x) 1 K H (X i x)δ i +n 1 f(x) 1 δ i q(x)p(x) 1 α 1 A H (X i x); when G 1/2 H 1/2 0 as n : e l δ i (x; H, G) =n 1 f(x) 1 K H (X i x)δ i + n 1 f(x) 1 δ i q(x)p(x) 1 L G (X i x); and finally when H 1/2 G 1/2 0 as n : e l δ i (x; H, G) =n 1 f(x) 1 p(x i ) 1 K H (X i x)δ i. 31

33 ProofoftheLemma1In order to obtain the proof of the lemma, we need the two following lemmas. Let R (u) =n 2 P n i=1 K H(X i x)(1 δ i )f(x i ) 1 p(x i ) 1 L G (u X i )=n 2 P n i=1 Z i (u). Lemma 2 Under the hypotheses A.1-A.3, A.5, A.9-A.11, there is obtained that Sup R (u) E [R (u)] 0 as n. u Lemma 3 Furthermore, Sup E [R (u)] R0 (u) 0 as n, where: u If H 1/2 G 1/2 = O (1) then 1 q (x) R0 (u) =n R p (x) G 1/2 K (v) L G 1/2 x + H 1/2 v u dv 1 q (x) = n p (x) α 1 A H (u x) If G 1/2 H 1/2 1 q (x) 0 as n, R0 (u) =n p (x) L G (u x). And if H 1/2 G 1/2 1 q (u) 0 as n,r0 (u) =n p (u) K H (u x). ProofoftheLemma2 Sup u R (u) E [R (u)] = n 1 Sup P n i=1 n 1 Z i (u) E [Z i (u)] u Since the support of variable X (D) isacompactset,wecancarryout a partition in L n cubes (I k ), such as D = Ln k=1 I k,andi k I j = φ if k 6= j. For simplicity of notation we assume that the support is cubic, if it is not cubic, since D is a compact set, it can be covered by a cubic set. From this we obtain the following decomposition: n 1 Sup P n i=1 n 1 Z i (u) E [Z i (u)] u 32

34 = n 1 Max Sup 1 k L n u I k D n 1 Max Sup 1 k L n u I k D P n i=1 n 1 Z i (u) E [Z i (u)] P n i=1 n 1 Z i (u) P n i=1 n 1 Z i (u k ) + n 1 Max P n i=1 1 k L n 1 Z i (u k ) E [Z i (u k )] n +n 1 Max Sup 1 k L n u I k D E [Z i (u k )] E [Z i (u)] = Q 1 + Q 2 + Q 3. where u k (k =1,..., L n ) are the centers of the cubes. Now, let us consider the first term. Q 1 = Max Sup n P 2 n i=1 K H(X i x)(1 δ i )f(x i ) 1 p(x i ) 1 L G (u X i ) 1 k L n u D I k n P 2 n i=1 K H(X i x)(1 δ i )f(x i ) 1 p(x i ) 1 L G (u k X i ) from the functions hypothes about f, p, K, andl, we can bound the previous expression by: C 0 (u u n H 1 k ), where denotes a norm and C 0 is a positive constant. 2 G Let L n the number of d-dimensional cubes, and let l n the length (area, volume) of each of these cubes. Clearly l n = Cte, where Cte is a positive (L n ) 1 d µ d/2 n constant. Taking L n =, gives (u u k ) O (l n ) = Ã log n µlog 1! n 2 O and then: n µ 1 C 0 C 1 log n 2 (u u n H 1 k ) 0, 2 G n H 1 2 G n with C 1 a other positive constant; and this proves that Q 1 0 a.s. 33

35 The proof of Q 3 0 a.s is straightforward from Q 1 0 a.s., using that E [Q 3 ] E [ Q 3 ]. And finally, we consider the Q 2 term. Applying the Bernstein s inequality, we obtain that P { Q 2 >ζ} = P ½ Max n 1 n P ¾ 1 n i=1 Z i (u k ) E [Z i (u k )] 1 k L n >ζ L n Max P { n P 1 n i=1 1 k L (n 1 Z i (u k )) E [n 1 Z i (u k )] >ζ} n ( ) n 2 ζ 2 O (L n )2exp 2C 2 n 1 H 1 G 1 + 2C 3 3 H 1/2 G 1/2 ζ ( ) n 2 ζ 2 H 1/2 G 1/2 O (L n )2exp 2C 2 n 1 H 1/2 G 1/2 + 2C 3 3ζ Under the hypothesis of the bandwidth matrices (A.10), and letting Ã! 1 2 ζ = C log n n 2 H 1 1, we can bound the previous expression by: 2 G 2 ( ) (C ) 2 (C ) 2 log n O (L n )2exp = O (L n )2n C 0 where C 2,C 3,C 0 and C C 0 are positive constants. Hence (C ) 2 Ã µ! d/2 (C ) 2 n P { Q 2 >ζ} O (L n )2n C 0 = O 2n C 0 0, log n taking C sufficiently large and applying the lemma of Borel Cantelli we obtain that Q 2 = o (1) a.s. ProofoftheLemma3We only give the proof for the first case; the other cases are analogous. 34

36 Sup E [R (u)] R0 (u) u = Sup n 1 G 1/2 R K (v) L G 1/2 x + H 1/2 v u q x + H 1/2 v dv R0 (u) u q (x + H 1/2 v) n 1 G 1/2 R Sup K (v) L G 1/2 x + H 1/2 v u Ã q x + H 1/2 v! u q (x + H 1/2 v) q (x) dv p (x) n 1 G 1/2 Sup R K (v) L G 1/2 x + H 1/2 v u W q/p H 1/2 v dv u = O ³n 1/2 1 G, where Wq H 1/2 v denotes the modulus of continuity of q p. p Combining the results of the Lemmas 2 and 3, we prove the Lemma 1. Later, we give the proof of the asymptotic distribution of T n,i in the case G 1 2 = α H 1 2. Through the construction of the imputed estimator, we have the following decomposition of T n,i : T n,i = n H 1 4 Z ( bm I,H,G (x) m θn (x)) 2 w (x) dx = I 1 + I 2 +2I 12, R P where I 1 = n H 1 4 n i=1 lδ i (x; H, G)(Y i m θ0 (X i )) 2 w (x) dx, R I 2 = n H 1 4 [(mθ0 (x) m θn (x))] 2 w (x) dx, R P and I 12 = n H 1 4 n i=1 lδ i (x; H, G)(Y i m θ0 (X i )) (m θ0 (x) m θn (x)) w (x) dx. Under the hypothesis (A.8) and the conditions of s, f and p, itiseasyto check that I 2 = o P (1). Furthermore, similar considerations to complete data case, yield I 12 = o P (1). 35

37 We have seen that the weight function l δ i (x; H, G) can be approximated by e l δ i (x; H, G). Hence, the asymptotic distribution of T n,i is determined by the approximation e I 1 of I 1, where ei 1 = n H 1 4 Z ³ el δ i (x; H, G)(c n s (X i )+η i ) 2 w (x) dx, Developing ei 1 we obtain the following decomposition: ei 1 = , R ³ P where 1 = n H 1 4 n e i=1 li δ (x; H, G)η i 2 w (x) dx, 2 = n H 1 4 R ³ P n e 2 i=1 li δ (x; H, G)c n s (X i ) w (x) dx, and 3 is the cross product. By a simple application of the Markov s inequality, we have: 2 = Z p(x)(kh s)(x)+q(x)α 1 (A H s)(x) 2 w (x) dx + o P (1). By straightforward calculations one gets E [ 3 ]=0, and that E ( 3 ) 2 = o (1), hence 3 = o P (1). R h P Let s see the term 1 = n H 1 4 n e i 2 i=1 li δ (x; H, G)η i w (x) dx = R µ P ³ where 11 = n H 1 4 n el δ i=1 i (x; H, G)η i 2 w (x) dx R ³ and 12 = n H 1 P ³ 4 n el δ i6=j=1 i (x; H, G)η i ³ el δ j (x; H, G)η j w (x) dx. 36

38 Now, if we span 11, it is obvious that: Ã Ã!! R 11 = n H 1 P 4 n δj η j j=1 K H (x X j )+ H q (x) A nf (x) G 1 H (x X j ) w (x) dx 2 p (x) = where 111 = n H = n H =2n H 1 4 µ R P n δj η 2 j j=1 nf (x) K H (x X j ) w (x) dx Ã R P n δj η j H 1 2 q (x) j=1 nf (x) R P n j=1 Ã µ δj η j nf (x)! 2 w (x) dx and A G 1 H (x X j ) 2 p (x) 2 H 1 2 q (x) A G 1 H (x X j ) K H (x X j ) 2 p (x) Arguing as in the article of Härdle and Mammen (1993) one sees that: 111 = H = H 1 4 and 113 = 2 H 1 4! w (x) dx Z µz σ 2 (x) p (x) w (x) K 2 (u) du dx + o P (1), f (x) Z µz σ 2 (x) q 2 (x) w (x) α 2 A 2 (u) du dx + o P (1) f (x) p (x) Z µz σ 2 (x) q (x) w (x) α 1 K (u) A (u) du dx + o P (1). f (x) Hence: 11 = H 1 4 Z σ 2 (x) w (x) (v 1 (x)) dx + o P (1), f (x) where v 1 (x) has the expression (5). Finally 12 = n H 1 4 R P n i,j=1 i6=j ½ ¾½ ¾ δi η i nf (x) B δj η j H (x X i ) nf (x) B H (x X j ) w (x) dx 37

39 with B H (x X j )=K H (x X j )+ H 1 2 q (x) G 1 2 p (x) A H (x X j ) In this point we make use of Theorem (2.1) of Jong (1987) applied to the quadratic form 12. n H 1 4 with k ij = 12 = nx k ij i,j=1 i6=j R µ 2 1 B H (x X i ) δ i η nf (x) i B H (x X j ) δ j η j w (x) dx si i 6= j 0 si i = j It suffices to check the following conditions: 1)E [k ij /X i,δ i,η i ]=0 2)E k ij /X j,δ j,η j =0 P 3) max n 1 i n j=1 Var(k ij) /V ar ( 12 ) 0 4)E ( 12 ) 4 / {Var( 12 )} 2 3 It is obvious that the first two conditions are verified. Also, Var( 12 )= 2n (n 1) E (k ij ) 2 But E (k 12 ) 2 RRRR = n 2 H 1 2 w (x) (nf (x)) 2 w (z) (nf (z)) 2 p (x 2) f (x 2 ) σ 2 (x 2 ) p (x 1 ) f (x 1 ) σ 2 (x 1 ) B H (x x 1 ) B H (z x 1 ) B H (x x 2 ) B H (z x 2 ) dxdzdx 1 dx 2, using the properties of continuity and boundary, we obtain: E (k 12 ) 2 Z µ w (x) σ n (x) p (x) c 1 (x) dx, f (x) 38

40 andthisissufficient to prove condition 3. Itremainstoprove condition4; for this, we can repeat the steps in the proof of Theorem 2 in Härdle and Mammen (1993, p. 1943) taking into account the weights of the imputed estimator. Therefore we obtain the asymptotic normality of 12. Combining thepreviousresultstheproofisconcluded. The proof for the cases H 1 2 G and G 1 2 H as (element to element) is similar, it is sufficient to consider the asymptotic representation of the weights of each case, and repeat the previous calculations. Proof of the Theorem 3 Under the hypotheses A1-A12, we derive the asymptotic distribution of the Bootstrap versions of the statistics T n,s and T n,i following the same lines of Theorem 1 and 2. ACKNOWLEDGEMENTS Research supported in part by MCyT Grant BFM (European FEDER support included), by PG IDIT 03 PXIC 20702PN of Dirección Xeral de Investigación e Desenvolvemento (Xunta de Galicia), and by Vicerrectorado de Investigación of the Universidad de Vigo. REFERENCES J. T. Alcalá,, J. A. Cristóbal & W. González-Manteiga (1999). Goodnessof-fit test for linear models based on local polynomials. Statistics & Probability Letters, 42, P. Billingsley (1999). Convergence in probability measures. Second edition, John Wiley & Sons, Inc., New York. 39

41 P.E. Cheng (1994). Nonparametric estimation of mean functionals with data missing at random. J. Amer. Statist. Assoc., 89, no 425, C. K. Chu & P. E. Cheng (1995). Nonparametric regression estimation with missing data. Journal of Statistical Planning and Inference, 48, J. Fan & I. Gijbels (1996). Local polynomial modelling and its applications, Chapman and Hall, London. W. González Manteiga & A. Pérez González (2003). Nonparametric mean estimation with missing data. To appear in Communications in Statistics. W. Härdle & E. Mammen (1993). Comparing nonparametric versus parametric regression fits. The Annals of Statistics, 21, P. De Jong (1987). A central limit theorem for generalized quadratic forms. Probability Theory and Related Fields, 75, no. 2, R. J. A. Little (1992). Regression with missing X s: a review. Journal of the American Statistical Association, 87, R. J. A. Little & D. B. Rubin (2002). Statistical analysis with missing data. 2nd ed., J. Wiley & Sons, New York. E. A. Nadaraya (1964). On estimating regression. Theory of Probability and its Applications, 10,

42 D.Ruppert & M. P. Wand (1994). Multivariate locally weighted least squares regression. The Annals of Statistics, 22, 3, G. A. F. Seber (1977). Linear regression analysis. John Wiley and Sons, New York. Q. Wang & J. N. K. Rao (2001). Empirical likelihood for linear regression models under imputation for missing responses. The Canadian Journal of Statistics, 29, no. 4, Q. Wang & J. N. K. Rao (2002). Empirical likelihood-based inference in linear models with missing data. Scandinavian Journal of Statistics, 29, no. 3, C. Y. Wang, J. C. Chen, S. M. Lee & S. T. Ou (2002). Joint conditional likelihood estimator in logistic regression with missing covariate data. Statistica Sinica, 12, C. Y. Wang, S. Wang, R. J. Carrol & R. G. Gutierrez (1998). Local linear regression for generalized linear models with missing data. The Annals of Statistics, 26, G. S. Watson (1964). Smooth regression analysis. Sankhya. The Indian Journal of Statistics, Ser. A, 26,

43 González Manteiga, W.: Departamento de Estadística e Investigación Operativa, Universidad de Santiago de Compostela, Facultad de Matemáticas, Campus Sur, C.P , Santiago de Compostela, Spain Pérez González, A.: anapg@uvigo.es Departamento de Estadística e Investigación Operativa, Universidad devigo, Escuela Superior de Ingeniería Informática, Campus AS Lagoas, C.P , Orense, Spain 42

Bickel Rosenblatt test

Bickel Rosenblatt test University of Latvia 28.05.2011. A classical Let X 1,..., X n be i.i.d. random variables with a continuous probability density function f. Consider a simple hypothesis H 0 : f = f 0 with a significance