estimate results from a recursive sceme tat generalizes te algoritms of Efron (967), Turnbull (976) and Li et al (997) by kernel smooting te data at e

Size: px

Start display at page:

Download "estimate results from a recursive sceme tat generalizes te algoritms of Efron (967), Turnbull (976) and Li et al (997) by kernel smooting te data at e"

Frederica Blake
5 years ago
Views:

1 A kernel density estimate for interval censored data Tierry Ducesne and James E Staord y Abstract In tis paper we propose a kernel density estimate for interval-censored data It retains te simplicity andintuitive appeal of te usual kernel density estimate and is easy to compute Te estimate results from an algoritm were conditional expectations of a kernel are computed at eac iteration Tese conditional expectations are computed wit respect to te density estimate from te previous iteration, allowing te estimator to extract more information from te data at eac step Te estimator is applied to HIV data were interval censoring is common In terms of te cumulative distribution function te algoritm is sown to coincide wit tose of Efron (967), Turnbull (976), and Li et al (997), as te window sizeof te kernel srinks to zero Viewing te iterative sceme as a generalized EM algoritm permits a natural interpretation of te estimator as being close to te ideal kernel density estimate were te data is not censored in any way Simulation results support te conjecture tat kernel smooting at every iteration does not eect convergence In addition, comparison to te standard kernel density estimate, based on smooting Turnbull's estimator, reect favourably on te estimator for all criteria considered Use of te estimator for scatterplot smooting is considered in a nal example Keywords: Cross-validation, EM algoritm, HIV, importance sampling, interval censoring, kernel smooting, Kullbeck-Leibler, mean squared error, Monte Carlo integration, nonparametric maximum likeliood, scatterplot smooting, self-consistency Introduction We propose a kernel density estimate to be used in te presence of interval censored data, ie data tat are observed to lie witin an interval but wose exact value is unknown Te Department of Statistics, University oftoronto y Department of Public Healt Sciences, University oftoronto

2 estimate results from a recursive sceme tat generalizes te algoritms of Efron (967), Turnbull (976) and Li et al (997) by kernel smooting te data at eac iteration ^f j (x) = n E j; K x ; X X 2 I i : () Here expectation is wit respect to te previous iterate conditional on te observed interval Convergence of te algoritm implies tat ^f j approaces some density for wic te application of () as no eect Efron (967) called suc a xed point a self-consistent estimator Te estimator retains te simplicity and intuitive appeal of a kernel density estimate In fact, tis simplicity avoids some of te awkward aspects associated wit kernel smooting Turnbull's estimator, F t,oftecumulative distribution function (cdf) ^f t (x) = Z < x ; u K df t (u) wic is a standard tecnique Turnbull's F t is a non-parametric maximum likeliood estimator (NPMLE) tat is not uniquely dened over te wole real line but only up to an equivalence class of distributions tat may dier over gaps called \innermost" intervals Associated wit tese gaps are probability masses wose distribution over te gap is left unspecied and tat proves to be troublesome wen computing ^f t Pan (2000) suggests arbitrarily assuming tat jumps occur at te rigt-and points of te gaps wic may be appropriate if te censoring proportion and te lengt of te censoring intervals are small However, if most observations are interval censored wit interval lengts tat can be large, as is often te case wit HIV/AIDS data, ten assuming tat te jumps occur at te rigt-and point of te interval may cause considerable bias in te estimator Tis complication never arises wen computing () because we smoot te data directly at every iteration rater tat smooting a NPMLE once Moreover, tis smooting process distributes probability mass over eac observed interval using a conditional density determined by te previous iterate Tis process is data driven rater tan arbitrary Figure depicts use of te estimator as applied to a group of eavily treated emopiliacs (De Gruttola and Lagakos, 989) wose time of infection wit te HIV virus was interval censored Te upper plot gives te original data ordered by te left end point Time is measured in six mont intervals and rigt censored observations are denoted by dotted lines Te lower plot gives ^f t and our estimator ^f 4 based on four iterations of te algoritm Te coice of j = 4 is based on bot simulations and visual inspection of te estimator for several values of j > 4 Te latter can be made common practice as successive iterates are based on an importance sampling sceme were te time to compute an iterate does not increase wit te number of iterations Te estimator ^f t was computed assuming jumps occur at te 2

3 center of innermost intervals rater tan te rigt-and point, wic causes te estimate to be sifted to te rigt Window sizes were cosen using a metod of cross-validation discussed in x5 Wat is evident from te plot is tat ^f 4 does a better job of smooting wat appears to be a sampling anomaly on te left side of te plot witout eroding te peak on te rigt It eliminates sampling artifacts in te estimate witout degrading te estimate itself, and to some extent overcomes te fact tat smooting te NPMLE does not recover te information lost by te non-parametric estimation (Pan, 2000) By smooting at every iteration it does a better job of borrowing information from neigbouring data points in te smooting process Tis is borne out in simulations of mean squared error Tis example is used trougout te paper to illustrate oter aspects of te estimator and a separate example concerning HIV infection and infant mortality is given in x7 Innermost intervals, wose concept is not entirely straigtforward, never explicitly enter into te calculation resulting in te advantage tat our estimator lls in te gaps of Turnbull's F t Tis idea of lling in te gaps is not new as Li et al: (997) embed Turnbull's NPMLE in an EM algoritm designed specically for tis purpose Tey obtain an estimator tat will converge to te NPMLE were te NPMLE is uniquely dened, and to some cdf tat depends on te starting point oftealgoritm were te NPMLE as gaps In x3 we sow tat as te window size,, srinks to zero our algoritm coincides wit tat of Li et al (997) and ence wit te algoritms of Efron (967) and Turnbull (976) as well Te remainder of tis paper is organized as follows In x2 te estimator is proposed as a natural extension of te usual kernel density estimate in te complete data case (no censoring) It is formally dened troug a generalized EM algoritm were te \M" step is caracterised by optimizing an \MSE" criterion Tis criterion is quite natural as it involves te complete data kernel density estimate, ^fc, allowing te estimator to be interpreted as minimizing te distance between itself and te ideal estimate ^f c Numerical implementation of te metod is discussed in x4 and te coice of te smooting parameter is considered in x5 Te question of convergence of te algoritm is considered in x3 Altoug te developments are not rigorous, te conjecture is tat use of kernel smooting at every iteration does not perturb algoritms, tat are known converge, to suc an extent tat tey no longer converge Te argument is supported by simulation results in x6 Finally, in x7 te metod is used to provide kernel weigts for scatterplot smooting Trougout te paper analogies wit te complete data case make developments transparent 3

4 2 Denition of te estimator In te presence of complete data X ::: X n te standard kernel density estimate, ^f c (x) = x ; n K Xi may be written as an expectation wit respect to te empirical distribution, F n, of te sample x ; X ^f c (x) =E Fn K : Wen te data are interval censored, so tat X i 2 I i 8i and only I i =(L i R i ) is observed, it seems natural to express te kernel density estimate in terms of iterated expectation ^f(x) =E Fn E x ; X K X 2 I = n E K x ; X X 2 I i : Here conditional expectation is computed wit respect to te distribution for te true value of X i over te interval I i Goutis (997) uses suc a strategy for te nonparametric estimation of a mixing density Tis conditional distribution is itself unknown and must be estimated A natural coice is data driven and involves using te kernel density estimate itself to approximate eac conditional distribution Tis results in an iterative algoritm wit te following smoot estimate of te density at te jt step: ^f j (x) = n x ; X E j; K X 2 I i were E k [g(x)j X 2 I i ]= ( R Ri L i g(t) ^f k i (t)dt L i 6= R i g(x i ) L i = R i = X i : Te conditional density ^f k i () over te interval I i is dened as ^f k i (t) = i (t) ^f k (t), c k i were i () is te indicator function for te interval I i and c k i is its unconditional expectation under ^f k At te(k +) st iterate it is te conditional density ^f k i (x) tat is used to smootly distribute a probability mass of =n over te interval I i Note ow tis diers from, for example, te product limit estimator wic distributes te mass associated wit a rigt censored observation X i to only tose uncensored observations tat exceed X i and not to te entire interval [X i ) 4

5 Given te estimator weigts a data point by computing te average eigt of te kernel over te observed interval consider Figure 2 wic depicts ow te weigt depends on te lengt and proximity of te interval to te location of te kernel In te gure te weigts for two intervals, centered at 0 but wit dierent lengts, are sown for dierent positions of te kernel Wen te kernel is also centered at 0, te metod rewards precision by giving te sorter interval a greater weigt However, wen te location of te kernel is sifted so it overlaps predominantly wit te longer interval, it assigns a larger weigt to tis interval even toug it is less precise Tis is due to te longer interval being more \local" tan te sorter interval to te point of estimation, or te center of te kernel Te longer interval is local because tere is non-zero probability tat te true observation is in a region close to \-2" wile tis is not te case for te sorter interval Wile te above derivation as intuitive appeal te estimator may be formally dened as minimizing an integrated squared distance between some arbitrary function and te ideal estimator ^f c We rst present te following result Teorem 2 Let F be te set of absolutely continuous density functions in L 2 Suppose tat X is distributed wit density f 2F Let C(f) = R ; n ^fc (x) ; f(x) o 2 dx, were ^fc (x) = (n) P ; n K((x ; X i)=), and assume tat ; K((;u)=) 2F for any xed >0 and u 2 IR Ten ^f = E f [(n) P ; n K((x ; X i)=)jx i 2 I i 8i] solves " # ^f = arg min E f f2f C(f) X i 2 I i : i = ::: n Proof: Let " ^f = arg min E f f2f C(f) = arg min f2f E f "Z ; # X i 2 I i i = ::: n n o 2 ^fc (x) ; f(x) dx X i 2 I i i = ::: n Under te assumptions about f and K, tis expectation is nite and ence E f "Z ; n ^fc (x) ; f(x) o 2 dx X i 2 I i i = ::: n # = Z E f ; # " n o 2 ^fc (x) ; f(x) : X i 2 I i i = ::: n By minimizing te positive integrand for every xed x, we minimize te integral Tus for axedvalue of x te denition of conditional expectation implies tat 2 E f 4( x ; Xi ) 2 3 X i 2 I i K ; f(x) 5 n i = ::: n is minimized wit respect to f(x) at " x ; ^f(x) =E f n K Xi X i 2 I i i = ::: n 5 # = n E f K x ; Xi # dx: X i 2 I i :2

6 Note te criterion is quite restrictive It explicitly involves ^f c and ence te form of te optimal estimator is not surprising Neverteless, te result is useful due to te interpretation it lends te estimator In terms of squared distance te estimator gets as close as possible to te ideal kernel density estimate In addition, since we do not know te true density f, we replace it wit any current guess for f, say ^f j; Hence te estimator may be regarded as resulting from a generalized EM algoritm wit: E-step: 8i dene ^f j; i (x) and compute w i (x) =E K x;x i j; X 2 Ii M-step: Compute ^f P j (x) = w(x) = n w n i(x) Computational issues concerning te E-step are considered in x4 Figure 3 gives te result of te rst four iterations of te algoritm for te emopiliac data In tis data set, patients wo were infected at te time of entry were assigned a leftand point of L i =wic resulted in a number of lengty intervals commencing from te beginning of te study Te common practice in te HIV literature of assuming a uniform distribution over eac interval is clearly inappropriate from an inspection of te data For te estimates in Figure 3 we used a uniform distribution as our starting point (rigt censored observations were given a weigt of 0) As one expects, dierences between te rst two iterates are quite large as te initial assumption of a uniform distribution is adjusted by te density estimate itself wic places more weigt on te later period of te study Convergence is acieved after four iterations 3 Properties of te estimator Wen te complete data kernel density estimate, ^f c, is used to estimate te cdf as ^F c (x) = Z x ; ^f c (u)du te estimate ^Fc reduces to te NPMLE as # 0 Here te NPMLE is te empirical distribution function F n An analogous development olds for te estimator ^f j as well In tis section we sow te algoritm () reduces to tat of Efron (967), Turnbull (976) and Li et al (997) as # 0 Since eac of tese converges under broad conditions we conjecture tat te use of kernel smooting at eac iteration does not perturb te algoritm to suc an extent as to eect convergence Te simulation results of x6 support tis Efron (967) proposed an iterative sceme for approximating te survivor function at a point x, S(x) =P [X x]: n ~ Sj (x) =N(x)+ X Li <x i =0 6 ~S j; (x) ~S j; (L i )

7 were N(x) =#X i x, i =if X i is observed exactly and i = 0 if X i is rigt-censored (R i = ) Efron sows ~ Sj converges to a xed point tat coincides wit te Kaplan-Meier product limit estimator, tat is te NPMLE Turnbull (976) generalized tis algoritm to obtain a NPMLE of te distribution function under general censoring and truncation scemes Li et al (997) proposed an estimator tat is te xed point ofanemalgoritm Teir estimator coincides wit Turnbull's estimator were Turnbull's estimator is uniquely dened, and converges to a value tat depends on te starting point of te iterative sceme were Turnbull's estimator is not uniquely dened Te iterative sceme proposed by Li et al (997) involves computing te conditional expectation of F n at eac step F j (x) =E j; [F n (x)j X i 2 I i 8i] : Te following teorem sows tat Li et al's estimator can be obtained as a limit of our estimator wen we let te window widt of te kernel srink to zero at every step Teorem 3 Let ^F j (x) be te estimate of te cdf corresponding to te density estimate () Assuming bot algoritms ave te same initial value, ten Proof: Fj may be rewritten as lim #0 ^Fj (x) = Fj (x) 8x j = 2 ::: F j (x) = E j; [F n (x)j X i 2 I i 8i] " = E j; I[X i x] n = ( Fj; (x) ; Fj; (L i ) n F j; (x) ; Fj; (L i ) = n E j; [I[X i x]j X i 2 I i ] # X 2 I ::: X n 2 I n! i (x)+i[x R i ] j = 2 ::: Note Li et al (997) use te tird expression for computation Dening K (u) = R u ; K(y)dy and using Tonelli's teorem to intercange expectation and integration we may similarly write ) ^F j (x) = = Z x = n ; Z x ; ^f j (u) du n u ; E j; K Xi X i 2 I i E j; K x ; Xi X i 2 I i : du 7

8 Since K ((x ; X i )=) for all, we can bring te limit inside te expectation Te result obtains since lim #0 K ((u ; v)=) =I[v u] 8u v 2< 2 In te case of rigt-censored data te algoritm () reduces to tat of Efron (967) as an immediate consequence of Teorem 3 Corollary 3 If R i = for all interval censored data points, ten lim #0 ^Fj (x) =; ~ Sj (x) 8x j = 2 ::: Proof: Under rigt censoring, ; ~ Sj (x) = Fj (x) and ence te result 2 Te above developments naturally lead to te consideration of convergence of te algoritm to a xed point Series expansions, (Silverman 986), similar to tose for te complete data kernel density estimate, ^f c, sow tat ^F j is equivalentto F j to second order Te following teorem does not prove convergence but it does sow tat convergence of te algoritm is linked to te convergence of Fj and may be inerited from Fj In oter words, te convergence of Fj is a necessary condition for te convergence of ^Fj Li et al (997) sow tat F j converges wen F0 is a strictly increasing distribution function Teorem 32 Assume, R < K(u)du = R< uk(u)du = 0 and R < u2 K(u)du = 2 K <, ten, assuming bot algoritms ave same initial value we ave ^F j (x) = Fj (x)+o( 2 ) 8x j = 2 ::: Te proof of tis Teorem can be found in te Appendix Te assumptions of te teorem are typical of most popular kernel functions, including te Gaussian kernel Te eect of te O( 2 ) term depends on bot te properties of te kernel as well as te size of In te simulations of x6 te O( 2 ) term does not disturb convergence 4 Implementation troug importance sampling Computing an iterate in te recursive sceme () requires te computation of a conditional expectation for eac interval censored observation For an interval I tis conditional expectation as te form I = E j K x ; X X 2 I = Z R L x ; X K ^fj I(X)dX wic, except in special cases, will not be computable in closed form Rater tan numerically approximating te integral involved in te expectation, we estimate it by a sample 8

9 mean in a Monte Carlo sceme tat is fast and easy to implement Tus te iterative algoritm involves a sampling process wic iterates until we are condent tat we are sampling from te xed point of() Two sampling scemes are considered were te second is an approximation of te rst Te rst metod involves sampling exactly from ^f j I using an acceptance/rejection metod were candidate values Y are generated from te distribution wit density ^f j and accepted if Y 2 I generate Y ^f j 2 if Y 2 I set X ; Y oterwise goto Te rst step is accomplised by te following recursive sceme: sample wit replacement from fi ::: I n g to get I? 2 sample from f j; I? 3 sample Y from x; K( ) were te recursion occurs at step 2 Once a sample X ::: X B is obtained, ^ I is computed as and since te sampling is exact ^ I B! ^ I = B BX k= x ; K Xk x ; X ;! E j K X 2 I Tus we can limit te eect of Monte Carlo error by coosing B to be as large as we want Te diculty wit tis exact sampling metod is tat it punises precision in te data Wen an interval is narrow te acceptance/rejection step will largely reject proposals Tus obtaining a large sample may take a long time and given te sceme is recursive te impact can be substantial To oset tis we use an importance sampling sceme were te time to compute an iterate does not increase wit te number of iterations as in te exact sampling sceme above Based on E j " K x ; X)! X 2 I# x ; X = E g K w(x) were g is some distribution over te interval I tat is easy to sample from, and w(x) = ^f j I (X) is te importance sampling weigt, ^ I becomes g(x) ^ I = BX k= x ; K Xk w? k were w? k = w(x k )= P B l= w(x l) and te above acceptance/rejection sceme is replaced by 9

10 Generate X k g k = ::: B 2 Compute ^ I Te only additional complication is to compute te sampling weigts w? k wic, upon inspection, simplify in a convenient way Ultimately tey involve te eigt of te unconditional kernel density estimate ^f j tus avoiding computation of te constants c j I w? k = ^f j I (X k ) g(x k ), BX l= ^f j I (X l ) g(x l ) = ^f j (X k ) c j I g(x k ), BX l= ^f j (X l ) c j I g(x l ) = ^f j (X k ) g(x k ), BX l= ^f j (X l ) g(x l ) : Finally, using te values of te kernel density estimate ^f j at a suciently ne grid, ^f j; (X k ) can be accurately computed by interpolation Using te emopiliac data gure 4 gives te result of a simulation study for values of B = 0 & 00 respectively For eac plottekernel density estimate, based on 4 iterations of te algoritm, was computed 00 times for a xed value of B Te plot depicts te resulting pointwise mean and 99 % percentile interval Te metod works quite well for samples of size 00 5 Coice of te smooting parameter A central component of kernel density estimation is te coice of te smooting parameter We propose an automatic metod for tis purpose based on likeliood cross-validation tat is is analogous to te complete data case (Silverman, 986) In te presence of complete data X ::: X n likeliood cross-validation aims to maximize CV () = ny ^f (;i) (X i ) wit respect to te smooting parameter Te superscript indicates X i is left out wen (;i) te estimate ^f (X i ) is computed and te metod works because E[CV ()] involves te Kullbeck-Leibler distance between f and ^f: E[CV ()] ; Z f(t)logff(t)= ^f(t)gdt + Z f(t)logff(t)gdt: In te case of interval censored data it is natural to mimic te above strategy troug (;i) analogy In te above, ^f (X i ) is obtained by eliminating a point of support, X i from te NPMLE, namely F n By eliminating X i te contribution to CV at tat point of support uses only te remaining data In our case, te support of te NPMLE, F t, are te innermost intervals dened as J r = (p r q r ) r = ::: m were p r 2 fl i i = ::: ng, q r 2 fr i i = ::: ng and J r \ I i equals J r or 8r i (see Turnbull 976, or Li et al, 997 for a more detailed discussion of innermost sets) For simplicity of exposition, and witout loss of 0

11 generality, assume all data are interval censored In tis case, te cross-validated likeliood is dened as my Z (;r) ^f (t)dt r= J r were R (;r) J r ^f (t)dt is obtained by dropping te innermost interval J r wen estimating te density Dropping an innermost interval is accomplised by removing all intervals in te original sample tat contribute to its presence but not to te presence of any oter innermost interval Tis conveniently addresses te question of tied observations wic are common for interval censored data For example, te emopiliac data contains only 40 distinct intervals in a sample of size 05 In addition it also addresses te issue of ow toandletwo observed intervals tat are not tied but ave a ig degree of overlap If tey bot overlap completely wit te eliminated innermost interval ten tey are bot eliminated wen estimating te contribution to te cross-validation process for tat interval Wile te sceme is admittedly adoc, it worked well in a limited simulation study using 40 samples A description of ow data was generated is given in x6 Table compares average values of our cross-validated likeliood wit te Kullbeck-Leibler distance for bot ^f j and ^f t In bot cases our cross-validated likeliood is quite accurate wen compared to te Kullbeck-Leibler distance It obtains its maximum at, or near, te value of te window size tat minimizes te Kullbeck-Leibler distance In addition, te ideal window size is smaller for te proposed estimator, ^fj, indicating it uses more information in te data Note te metod contradicts our aim of simplicity as knowledge of te innermost intervals is required for computation Ultimately, a metod tat is independent on innermost intervals, like k-fold cross-validation as considered in Pan (2000), may be preferred 6 A simulation study For te estimator ^f j two patterns of beaviour are evident in te following simulation study: convergence and improvement over te standard kernel density estimate, ^ft Five criteria are considered of wic tree are useful for comparing ^f j and ^f t Te remaining two assess te dependence of te estimator on te initial value, f 0, of te algoritm All criteria assess convergence Dene te squared distance,, between two functions, u and v, as X (u v) = fu(x) ; v(x)g 2 x2x were X is a xed grid of equally spaced points spanning te range of te data Tis distance is central to all convergence criteria wit te exception of one involving te Kullbeck-Leibler distance

12 If te algoritm converges to a xed point, ^f, tat is independent of te initial value, ten it is said to be a contraction mapping,, if for some suitably dened space of densities F, is suc tat : F ;! F ^f j =(^f j; ) ^f =(^f) ( ^f j ^g j ) <( ^f j; ^g j; ) j > were ^f j and ^g j are te density estimates at te j t step for two arbitrary but dierent starting points f 0 g 0 2 F Te rst two columns of table assess te beaviour of te estimator as a contraction mapping Te \squared distance" column gives te average value of ( ^f j ^g j ) j = ::: 0 based on 00 samples X ( ^f j ^g j )= i ( ^f j ^g j ) were i ( ) denotes te value of ( ) for te i t sample Te \contraction" column give te proportion of samples tat satisfy te condition i ( ^f j ^g j ) < i ( ^f j; ^g j; ) j > : For eac of te 00 random samples we generated 20 failure times from a Weibull distribution wit sape parameter 75 and scale parameter 3 Independently, we ten generated \visit times" using a omogeneous Poisson process Eac failure time was interval-censored by te visit times tat bracketed it For eac sample, we computed te iterative sceme () using a Gaussian kernel wit = We used a sample size of B = 00 for te importance sampling sceme described in x4 Te initial values of te density for te iterative sceme are various scale and location sifted beta distributions Here f 0 and g 0 are based on beta(5, 2) and beta(2, 5) distributions respectively Te criteria \MSE r ", r = 2 use (u v) to assess te expected value of te squared distance (u v) under te Weibull(75,3) distribution Wen r =, te function u is set to be te true density, f, and v = ^f j Tus MSE estimates te actual mean squared error of te estimator For MSE 2 te true density is replaced by te ideal estimator, ^fc, and te criterion assesses te closeness of ^fj to te ideal estimator as discussed in x2 Te nal column assesses an estimate of te Kullbeck-Leibler distance " (, (f ^f j )=E log f(x) ^f j (X) For eac of te 00 samples an estimate, ^ i (f ^f j )=f(x), ^f j (X) )# : 2

13 is itself based on a sample X Weibull(:75 3) As in x4, te values of te density estimate ^f j (X) are found by interpolation using te values of ^f j computed at te grid X Te entry given in te table is X (f ^f j )= ^ i (f ^f j ): Note te MSE and Kullbeck-Leibler criteria are evaluated for Turnbull's estimator as well were ^f j is simply replaced by ^f t All columns in te table give standard errors in brackets Te results reported in Table 2 sow te algoritm tends to reac convergence after 3-6 iterations Te contraction criterion improves until te 6 t iteration after wic its beaviour is consistent wit wat would be expected if te Monte Carlo sceme of x4 involved sampling from te xed point (once convergence is reaced we expect te condition i ( ^f j ^g j ) < i ( ^f j; ^g j; ) to old % 50 of te time) Te squared distance criterion sows te distance between estimators wit dierent initial values gets very small indeed by te 6 t iteration Te MSE and Kullbeck-Leibler criteria reac teir minimum after only 3 iterations of te algoritm after wic tey remain fairly constant Te large improvement between te rst and second iterates, and te smaller improvement between te second and tird iterates, sow ow te estimator continues to extract more information out of te data after te rst iteration It is tese improvements tat result in te estimator aving better properties tan ^f t for tese criteria Finally, asan example, Figure 5 sows four successive iterates for te emopiliac data based on various initial values of te algoritm 7 Use as a scatterplot smooter Kernel weigts are useful in regression as well as density estimation In te regression context we consider te use of kernel weigts were acovariate is interval censored Te tecniques described ere are understood to be applicable to multiple regression problems were an additive or generalized additive model are used Tis and te context were te response in a regression is also interval censored are deferred Te purpose of te following example is to exibit te exibility of te metods of te paper rater tan to perform te ideal data analysis Te data used is a subset of a larger dataset concerning HIV infection and infant mortality (Huges and Ricardson, 200) Here we only consider infants wit no interval censoring in te response (ie infants tat died) and tat were infected wit HIV Consider te model E[Y ]=g(x) were te covariate x may be interval censored In terms of a scatterplot, interval censoring in a covariate means only te y coordinate is known Any smooting process tat uses kernel weigts wose size is determined by some nearest neigbourood tecnique needs 3

14 modication, as suc neigbouroods are determined by te covariate Suppose, forexample, tat a running mean smooter is used were ^y =^g(x) = X y j 2N x v j y j : were N x is te nearest neigbourood for x Typically, te weigts v j are given as v j = K x;xj owever te Xj are not observed In keeping wit te spirit of tis paper v j is replaced by Ij x ; X = E ^f x K X 2 I j were expectation is computed wit respect to te xed point ^f x of () restricted to te interval I j Note te estimate ^f x is te density estimate for te covariate X As in x4 expectation is approximated by an importance sampling algoritm and so te recipe for computing ^g is Generate a sample X ::: X B from te cosen importance sampling distribution for te interval I j 2 Using ^f x compute te sampling weigts w k? as in x4 3 approximate Ij by ^ Ij = P o B k= w k? 4 Compute ^g(x) = P y j 2N x ^ Ij y j n K x;xk Figure 6 gives four plots for te infant data Te rst of tese is a scatterplot of te times of deat for te sixty infants used in te tting process Te covariate is te time of infection wit te HIV virus It is interval censored and ence a scatterplot \point" is actually a line obtained by joining te rigt and left endpoints of eac interval Te second plot gives te tted density estimate for te covariate based on four iterations of our algoritm Te cross validation tecnique of x5 was used to pick te \ideal" window size of 375 Te data were originally collected in a study of te eect of breast feeding on infection However, for te infants used ere te primary source of infection seems to be invetro Te time point 0 indicates te time of birt of te cild Note intervals for te covariate extend to - indicating infection may ave taken place before te birt of te cild Te remaining two plots use tis density estimate to kernel smoot te scatterplot using a running mean, altoug any linear smooter could be used Several window sizes were arbitrarily cosen Te plots raise many interesting issues worty of furter study For example, te smoots are dominated in an unreasonable fasion by tree points on te rigt suggesting tat eiter te tecniques be made robust or te coice of smooting parameter be made adaptive, or bot! Tese and o ter issues, like owtoandleinterval censored response as well, will be considered in future work 4

15 Acknowledgements Te autors are grateful to Je Rosental, David Andrews, Rob Tibsirani and Jerry Lawless for inspiring discussions We particularly appreciate many useful conversations wit Paul Corey Tey also wis to tank te Natural Sciences and Engineering Researc Council of Canada for supporting tis researc troug individual operating grants Appendix Proof of Teorem 32: Te proof is given for te case were all data are interval censored Recall, were ~F j (x) = n Z x = ~fj (t)dt ( ~Fj;! ) (x) ; Fj; ~ (L i ) ~F j; (R i ) ; Fj; ~ i (x)+i[x R i ] (L i ) ~f j (x) = d F ~ j (x) dx = n = n f j; (x) ~F j; (R i ) ; ~ Fj; (L i ) i(x) c ; i j; ~ f j; (x) i (x) Now standard calculations like tose of Silverman (984) allow expansion about = 0 of eac conditional expectation and tus expansion of ^f j 8j Z Ri x ; t ; K f j; i (t)dt ^f j (x) = n = n = n = n = n = n L i c ; i j; ; Z< K c ; i j; ; Z< K c ; i j; c ; i j; c ; i j; Z Z < < x ; t x ; t f j; (t) i (t)dt g(t)dt K(u)g(x ; u)du g(t) =f j; (t) i (t) K(u) g(x) ; u _g(x)+ 2 2 u 2 g(x)+ du g(x) Z< K(u)du ; _g(x) Z < uk(u)du g(x) Z < u2 K(u)du + 5

16 = n = ~ f j (t)+o( 2 ) c i j; i (x)f j; (x) g(x) 2 K + and terefore ^F j (x) = ~ Fj (x)+o( 2 ) j = 2 ::: 2: References [] Efron, B (967) \Te two sample problem wit censored data", Fourt Berkeley Symposium on Matematical Statistics, University of California Press, [2] Goutis, C (997) \Nonparametric estimation of a mixing density via te kernel metod", Journal of te American Statistical Association, 92, [3] De Gruttola, V and Lagakos, S W (989) Analysis of Doubly-Censored Survival Data, wit Applications to AIDS Biometrics 45, - [4] Huges, J P and Ricardson, R (200) Analysis of a Randomized Trial to Prevent Vertical Transmission of HIV- J of Amer Statist Assoc In press [5] Li, L, Watkins, T and Yu, Q (997) An EM algoritm for smooting te selfconsistent estimator of survival functions wit interval-censored data Scandinavian Journal of Statistics 24, [6] Pan, W (2000) Smoot estimation of te survival function for interval censored data Statistics in Medicine 9, [7] Silverman, B (986) Density Estimation for Statistics and Data Analysis, London, Capman-Hall [8] Turnbull, BW (976) Te empirical distribution function wit arbitrarily grouped, censored and truncated data Journal of te Royal Statistical Society, Series B 38, [9] Wu, CF (983) \On te convergence of te EM algoritm", Annals of Statistics,,

17 observation count time of infection density estimate time of infection Figure : Te rst plot gives te original data wit a line joining te left and rigt endpoints of eac interval Te second plot gives two kernel density estimates wit te solid line being tat of te metod proposed, ^f4, after 4 iterations and te dotted line a kernel smooted version of Turnbull's estimator, ^f t 7

18 weigt x weigt x Figure 2: Te plot depicts ow te kernel density estimate works Te diamond saped plotting caracter gives te weigt for te longer interval 8

19 density estimate j= j=2 j=3 j= time of infection Figure 3: Te plot gives te kernel density estimate ^f j for eac of te rst four iterations of te algoritm Convergence appears to ave been reaced by te tird iteration 9

20 density estimate time of infection density estimate time of infection Figure 4: Plots of te pointwise mean and 99% percentile intervals for B wit values of 0 and 00 respectively 20

21 density estimates time of infection density estimates time of infection density estimates time of infection density estimates time of infection Figure 5: Plots of ^f j j = ::: 4 for dierent initial values of te algoritm 2

22 Time of deat Time of infection f Time of infection Time of deat = =5 =2 =25 =3 = Time of infection Time of deat = =5 =2 =25 =3 = Time of infection Figure 6: Clockwise from te upper left are plots of te data, a density estimate for te interval censored covariate and two plots of te tted curve for various window sizes Te rst of tese includes te scatterplot data reported as te midpoint of te interval and te second gives te interval itself 22

23 Table : Cross validation (CV) and Kullbeck-Leibler (KL) distances CV ( ^f 4 ) KL ( ^f 4 ) CV ( ^f t ) KL ( ^f t ) ;

24 Table 2: Te beavior of various criteria j Squared distance Contraction MSE MSE 2 Kullbeck-Leibler 2e-0 (37e-02) (00032) (00054) 079 (00059) 2 2e-02 (94e-03) 000(N/A) 0055 (00037) 0030 ( ) 03 (00049) 3 6e-03 (20e-03) 000(N/A) (000322) 0022 ( ) 025 (00047) 4 24e-04 (44e-04) 000(N/A) (000325) 0022 ( ) 026 (00050) 5 55e-05 (80e-05) 09(00286) (000328) 0023 ( ) 025 (00050) 6 27e-05 (26e-05) 069(00462) (000329) 0024 ( ) 026 (00050) 7 28e-05 (24e-05) 047(00499) 0049 (000328) 0023 (000088) 024 (00052) 8 26e-05 (27e-05) 053(00499) 0049 (000328) 0023 ( ) 025 (00049) 9 28e-05 (30e-05) 046(00498) (00033) 0023 ( ) 025 (00049) 0 25e-05 (25e-05) 053(00499) 0049 (000329) 0023 ( ) 026 (00052) ^f t (000579) (000302) 049 (00067) 24

The Priestley-Chao Estimator

The Priestley-Chao Estimator Te Priestley-Cao Estimator In tis section we will consider te Pristley-Cao estimator of te unknown regression function. It is assumed tat we ave a sample of observations (Y i, x i ), i = 1,..., n wic are