Error distribution function for parametrically truncated and censored data Géraldine LAURENT Jointly with Cédric HEUCHENNE QuantOM, HEC-ULg Management School - University of Liège Friday, 14 September 2012
The doctor Anders Green et al (1981) studied the mortality of diabetics in the county of Fyn (Denmark) between the 1 July 1973 and 1 January 1982. For these data, we have the survival time of diabetic patient but this time can be partially observed, the age at the diabetic diagnosis, the sex for each patient.
Duration between the diagnosis and the study end (in years) 70 60 50 40 30 20 10 Female diabetics Censored Observed 0 0 20 40 60 80 Age at the diagnosis (in years) Duration between the diagnosis and the study end (in years) 70 60 50 40 30 20 10 Male diabetics Censored Observed 0 0 20 40 60 80 Age at the diagnosis (in years)
Estimation Asymptotic properties Data Analysis
Estimation
We consider the nonparametric regression model where Y is the response variable X is the covariate Y = m(x ) + σ(x )ε m( ) = E[Y ] and σ 2 ( ) = Var[Y ] are unknown smooth functions, ε is independent of X, with E[ε] = 0 and Var[ε] = 1.
The particularity of (X, Y ) is that (X, Y ) is obtained from cross-sectional sampling Y is subject to right censoring. We study the variable Y delimited by T Y C where T is the truncation variable C is the censoring variable.
Real World Time We use as notation F for cdf
Real World Truncation Time Time We use as notation F for cdf
Real World Truncation Time Time We use as notation F for cdf
Intermediate Observed World Y1 C2 Y3 Y4 C5 C6 Truncation Time Time We use as notation H for cdf, n the sample size
Observed World Y1 Y3 Y4 Truncation Time Time We use as notation H for cdf
We want to estimate the error distribution F ε (e) = IP(ε e) with (X, Y ) satisfying T Y C where the distribution F T X is a parametric distribution the distribution F C T X is completely unknown
We suppose that the variables Y and T are independent, conditionally on X for each value x, the support of F Y X ( x) is included into the support of F T X ( x) the lower bound of the T support is zero the variables (T, Y ) and C T are independent, conditionally on T Y, X
We have H X,Y (x, y) = IP(X x, Y y T Y C) Z Z = (E[w(X, Y )]) 1 w(r, s)df X,Y (r, s), r x the weight function w(x, y) is defined by w(x, y) = Z t y s y 1 GC T X (y t x) df T X (t x) where G C T X (z x) = IP(C T z X = x, T Y ).
We easily obtain F X,Y (x, y) = E[w(X, Y )] Therefore, we have Y m(x ) F ε (e) = IP σ(x ) = = ZZ (x,y): ZZ (x,y): y m(x) e σ(x) Z y m(x) e σ(x) u x e Z v y df X,Y (x, y) dh X,Y (u, v). w(u, v) E[w(X, Y )] dh X,Y (x, y) w(x, y)
Thus, the estimator is with ˆF ε (e) = Ê[w(X, Y )] M nx i=1 ˆε i = Y i ˆm(X i ), M = ˆσ(X i ) Ê[w(X, Y )] = 1 M nx i=1 I { ˆε i e, i = 1} ŵ(x i, Y i ) nx i ŵ(x i, Y i ) i, i=1! 1 where the functions ˆm( ), ˆσ( ) and ŵ(, ) are nonparametric estimators.
For G C T X (t x), we use the Beran (1981) estimator defined by Y W i (x, h n ) Ĝ C T X (t x) = 1 1 P nj=1 W j (x, h n )I {Z j Z i } where Z i t, i =0 Z i = min(c i T i, Y i T i ) and i = I {Y i C i } W i (x, h n ) = K x Xi hn Š P n j=1 K x Xj K is a kernel function hn Š are the Nadaraya-Watson weights h n is a bandwidth sequence tending to 0 when n => ŵ(x, y) = Z t y 1 ĜC T X (y t x) df T X (t x) Œ
The estimators of m( ) and σ( ) are given by ˆm(x) = P ni=1 W i (x,h n)y i i ŵ(x,y i ) P ni=1 W i (x,h n) i ŵ(x,y i ), ˆσ 2 (x) = P ni=1 W i (x,h n) i (Y i ˆm(x)) 2 ŵ(x,y i ) P ni=1 W i (x,h n) i ŵ(x,y i ), extension of the estimators in de Uña-Alvarez and Iglesias-Pérez (2010), described in Heuchenne-Laurent (2012).
Asymptotic properties
Under some assumptions not mentioned here, we obtain an asymptotic representation ˆF ε (e) F ε (e) = nx i=1 uniformly in < e < where Γ(Z i, Y i, X i, i, e) + o p (n 1 2 ) Γ(Z i, Y i, X i, i, e) = E[w(X, Y )] w(x i, Y i ) (I {ε i e, i = 1} F ε (e)) + f ε(e)f X (X i ) Ψ(Z i, Y i, X i, i, e) σ(x i ) for a given function Ψ.
Intermediate results used for the proof Technical lemma sup <e< n 1 nx i=1 E[w(X, Y )] w(x i, Y i ) I {ˆε i e, i = 1} E[w(X, Y )] w(x i, Y i ) I {ε i e, i = 1} IP (ˆε e χ n ) + IP (ε e) = o p n 1 2 Asymptotic representation of ˆm(x) m(x) Asymptotic representation of ˆσ(x) σ(x)
Under the assumptions of the previous theorem, the process n 1 2 ( ˆFε (e) F ε (e)) where < e < converges weakly to a zero-mean Gaussian process Ω(e) with a complex covariance function.
Data Analysis
Between 1987 and 1997 the Spanish Institute for Statistics studied the unemployment of active people, and more especially the one of married women. For these data, we note that the time of unemployment will not be completely observed, the age of the woman acts on the future job.
200 180 Censored Observed 160 Unemployment duration (in months) 140 120 100 80 60 40 20 0 0 100 200 300 400 500 600 700 800 900 1000 Woman age (in months)
For the real data, we suppose that the number of time periods is equal to 1009 but only 446 aren t censored. the distribution of T is a uniform one (Wang, 1991); the variable C is defined by C = T + τ where τ is a constant equal to 18 months; Because of the definition of the censoring variable, the weight function is simply defined by w(x, y) = Z y 0 y τ df T X (t x) The Bootstrap approximation gives the value of 70 months as the optimal bandwidth.
Representation of ˆF Y X for various ages. 1 0.9 0.8 Cumulative distribution function 0.7 0.6 0.5 0.4 0.3 0.2 0.1 cdf of Y X on 20 years cdf of Y X on 35 years cdf of Y X on 50 years 0 0 20 40 60 80 100 120 140 160 180 Unemployment time (in months)
For the real data, we suppose that the number diabetics is equal to 716 for female patients and 783 for male patients but only 241 and 256 are not censored for the female diabetics and the male diabetics. for each sex, the distribution of T X = x is supposed to be a gamma one (Wang, 1991) with a decreasing linear function for the mean and a constant function for the variance; the variable C T is supposed to be a variable dependant of the covariate for each sex. The optimal bandwidth obtained by bootstrap approximation is equal to 26 years for the female population and 28 years for the male population.
Representation of Ŝ Y X for various ages. 1 0.9 0.8 0.7 Est of F Y X ( X=15) female patient Est of F Y X ( X=30) female patient Est of F Y X ( X=50) female patient Est of F Y X ( X=15) male patient Est of F Y X ( X=30) male patient Est of F Y X ( X=50) male patient Survival function 0.6 0.5 0.4 0.3 0.2 0.1 0 0 10 20 30 40 50 60 70 Time interval between the age at diagnosis and the death (in years)
Thank you for your attention
BERAN, R. (1981): Nonparametric regression with randomly censored survival data. Technical Report, University of California, Berkeley. de UNA-ALVAREZ, J., IGLESIAS-PEREZ, M.C. (2010): Nonparametric estimation of a conditional distribution from length-biased data. Annals of the Institute of Statistical Mathematics 62, 323-341. HEUCHENNE, C., LAURENT, G. (2012): Nonparametric estimation of conditional moments with right-censored election biased data. Submission Journal of Statistical Planning and Inference. WANG, M.-C. (1991): Nonparametric estimation from cross-sectional survival data. Journal of the American Statistical Association 86, 130-143.