Estimation for Modified Data - PDF Free Download

Definition. Estimation for Modified Data 1. Empirical distribution for complete individual data (section 11.) An observation X is truncated from below ( left truncated) at d if when it is at or below d (X d) it is not recorded, but when it is above d it is recorded at its observed value X. An observation X is truncated from above ( left truncated) at u if when it is at or above u (X u) it is not recorded, but when it is below u it is recorded at its observed value X. An observation X is censored from below ( left censored) at d if when it is at or below d (X d) it is recorded as being equal to d, but when it is above d it is recorded at its observed value X. An observation X is censored from above ( right censored) at u if when it is at or above u (X u) it is recorded as being equal to u, but when it is below u it is recorded at its observed value X. The most common occurrences are left truncation and right censoring. Left truncation occurs when an ordinary deductible of d is applied. When a policyholder has a loss below d, he or she knows no benefits will be paid and so does not inform the insurer. When the loss is above d, the amount of the loss will be reported. A policy limit is an example of right censoring. When the amount of the loss exceeds u, benefits beyond that value are not paid, and so the exact value is not recorded. However, it is known that a loss of at least u has occurred. In chapters 11 and 1, truncated means truncated from below and censored always means censored from above. 1

For the observed values {x 1,,..., x n } let y 1 < y < < y k be the unique values of the x i s. For each j we associate a so-called risk set r j as follows: For survival/mortality data, the risk set is the number of people that are alive at age y j. For loss amount data, the risk set is the number of policies with observed loss amounts (either the actual amount or the maximum amount due to a policy limit) larger than or equal to y j less those with deductibles greater than or equal to y j. Let s j be the number of times the uncensored observation y j appears in the sample. We also use the following notations: For an individual data, let d j denote the truncation point of the j-th observation with d j = 0 in the absence of truncation. In a mortality table, it is the entry time. u j denote the censored point of the j-th observation. In the mortality table, it is the withdrawal time. x j denote the uncensored value (loss amount). In a mortality table, it is the death time. Definition. In terms of a mortality study, risk set at the j-th ordered observation y j, denoted by r j, is the number of individuals who are under observation at that age. r j = #(x i y j ) + #(u i y j ) #(d i y j ) Since the total number of d i s equals the sum of the number of x i s and the number of u i s, we can calculate r j in the following way too: r j = #(d i < y j ) #(x i < y j ) #(u i < y j ) = number of those who have entered prior to y j number of those who have left prior to y j For data consisting of loss amounts:

Risk set = the number of policies with observed loss amounts (either the actual amount or the maximum amount due to a policy limit) greater than or equal to y j less those with deductibles greater than or equal to y j. Example (Finan s notes - Example 51.1). The following mortality table is given: Life Time of Entry Time of exit Reason of exit 1 0 4 End of Study 0 0.5 Death 3 0 1 Leaving Study 4 0 4 End of Study 5 1 4 End of Study 6 1. Death 7 1.5 Death 8 3 Leaving Study 9.5 4 End of Study 10 3.1 3. Death (a) Complete the following table: (b) Create a table summarizing y j, s j, and r j 3

i d i x i u i 1 3 4 5 6 7 8 9 10 Solution for part (a). Life Time of Entry Time of exit Reason of exit 1 0 4 End of Study 0 0.5 Death 3 0 1 Leaving Study 4 0 4 End of Study 5 1 4 End of Study 6 1. Death 7 1.5 Death 8 3 Leaving Study 9.5 4 End of Study 10 3.1 3. Death i d i x i u i entry death withdraw 1 0 4 0 0.5 3 0 1 4 0 4 5 1 4 6 1. 7 1.5 8 3 9.5 4 10 3.1 3. Solution for part (b). 4

The unique values of x i s are y j {0.5,, 3.}. Now form the extended table with asterisk assigned to y j values: all numbers in the table 0 0.5 1 1. 1.5.5 3 3.1 3. 4 number of d i s 4 1 1 1 1 1 1 number of x i s 1 1 number of u i s 1 1 4 Then from the formula r j = #(x i y j ) + #(u i y j ) #(d i y j ) we can complete the following table j y j s j r j 1 0.5 1 4 + 6-6 = 4 3 + 5-3 = 5 3 3. 1 1 + 4-0 = 5 5

Definition. The Kaplan-Meier (product-limit) estimate for the survival function is given by 1 0 t < y 1 S n (t) = j 1 i=1 ( ) 1 s i r i y j 1 t < y j j =,..., k k i=1 ( ) 1 s i r i or 0 t y k where for the case of s k = r k, which means that nobody survives beyond age y k, we set S n (t) = 0 for t y k. Note. This estimator is a point estimation, as it gives a value for Ŝ(t) for each time t. Note. Note that the s i = #(x i : x i = y j ) gives the number of those who die at age y j, while r j = #(x i y j ) + #(u i y j ) #(d i y j ) gives the number of those who are at y j under study before any event occurs (death or withdrawal). So, the difference 1 s j r j gives the probability of surviving past time y j. = r j s j r j Example. Find the Kaplan-Meier estimate of the survival function for the data in the previous example. j y j s j r j 1 0.5 1 4 + 6-6 = 4 3 + 5-3 = 5 3 3. 1 1 + 4-0 = 5 6

S 10 (t) = 1 0 t < 0.5 1 s 1 r 1 = 0.75 0.5 t < ( ) (0.75) 1 s r = 0.45 t < 3. ( ) (0.45) 1 s 3 r 3 = 0.36 t 3. Note. The Kaplan-Meier estimation applied to a complete individual data is the same as the empirical survival function defined by S n (x) = 1 number of observations x n Note. For a survival function S(x) associated with a variable X we have this is because P (X x) = lim t x S(t) = S(x ) P (X x) = 1 P (X < x) = 1 F (x ) = S(x ) Example ( ). You are studying the length of time attorneys are involved in settling bodily injury lawsuits. T represents the number of months from the time an attorney is assigned such a case to the time the case is settled. Nine cases were observed during the study period, two of which were not settled at the conclusion of the study. For those two cases, the time spent up to the conclusion of the study, 4 months and 6 months, was recorded instead. The observed values of T for the other seven cases are as follows: 1 3 3 5 8 8 9 Estimate P r(3 T 5) using the Product-Limit estimator. 7

i d i x i u i entry death withdraw 1 0 1 0 3 3 0 3 4 0 4 5 0 5 6 0 6 7 0 8 8 0 8 9 0 9 j y j s j r j 1 1 1 9 3 8 3 5 1 5 4 8 3 5 9 1 1 Ŝ(t) = 8 9 3 8 15 1 0 t < 1 1 1 9 = 8 9 1 t < 3 ( ) 1 8 = 3 3 t < 5 ( ) 1 1 5 = 8 15 5 t < 8 ( ) 1 3 = 8 45 8 t < 9 8 45 ( 1 1 1) = 0 t 9 P (3 T 5) = P (T 3) P (T > 5) = Ŝ(3 ) Ŝ(5) = 8 9 8 15 = 0.356 Example. The claim payments on a sample of ten policies are: 3 3 5 5 + 6 7 7 + 9 10 + + indicates that the loss exceeded the policy limit 8

Using the Product-Limit estimator, calculate the probability that the loss on a policy exceeds 8. i d i x i u i entry event censored 1 0 0 3 3 0 3 4 0 5 5 0 5 6 0 6 7 0 7 8 0 7 9 0 9 10 0 10 j y j s j r j 1 1 10 3 9 3 5 1 7 4 6 1 5 5 7 1 4 6 9 1 Ŝ(t) = 1 0 t < 1 1 10 = 0.9 t < 3 (0.9) ( 1 9) = 0.7 3 t < 5 (0.7) ( 1 9) = 0.6 5 t < 6 (0.6) ( 1 1 5) = 0.48 6 t < 7 (0.45) ( 1 4) 1 = 0.36 7 t < 9 (0.36) ( 1 ) 1 = 0.18 t 9 Ŝ(8) = Ŝ(7) = 0.36 9

Definition. As in the case of complete data, the Nelson-Åalen Estimator cumulative hazard rate function is defined by: 0 0 t < y 1 Ĥ(t) = j 1 i=1 s i r i y j 1 t < y j j =,..., k k i=1 s i r i t y k and then set Ŝ(t) = e Ĥ(t) for the estimation of the survival function. Example. Find the Nelson-Åalen estimate of the survival function. 0 0 t < 0.5 1 4 = 0.5 0.5 t < Ĥ(t) = 0.5 + 5 = 0.65 t < 3. 0.65 + 1 5 = 0.85 t 3. 1 0 t < 0.5 e 0.5 = 0.7788 0.5 t < Ŝ(t) = e 0.65 = 0.50 t < 3. e 0.85 = 0.474 t 3. 10

Mean and Variance of Nelson-Åalen Estimator (from section 1.) Suppose that events have occurred at times t i. In the Nelson-Åalen estimation j i=1 s i r i Ĥ(y j ) the value s i r i is an estimation for the hazard rate function h(t i ). If we assume a Poisson distribution for the number of events at time t i, this distribution has mean r i h(t i ), which is approximated by r i ( si r i ) = s i. So, if S i represents the random number of events at time t i, then Var(S i ) s i as S i is assumed to follow a Poisson distribution with mean approximately s i. If we assume independence between the random variables S i, then Var(Ĥ(y j)) = Var ( j i=1 S i r i ) = Note. This formula is recursive: j i=1 ( ) Si Var = r i ( ) j Var(Si = i=1 r i j i=1 ( si r i ) for Var(Ĥ(y j)) = Var(Ĥ(y j 1)) + s j r j Example (from the Finan s note)- 55.1. For a survival study with censored and truncated data, you are given: Time (t) Number at risk Failures at at time t time t 1 30 5 7 9 3 3 6 4 5 5 5 0 4 Estimate the variance of Ĥ(4). 11

Var(Ĥ(y 4 4)) = s i r i=1 i = 5 30 + 9 7 + 6 3 + 5 5 = 0.0318 Example. Fifteen cancer patients were observed from the time of diagnosis until the earlier of death or 36 months from diagnosis. Deaths occurred during the study as follows: Time in Months Number of Since Diagnosis Deaths 15 0 3 4 30 d 34 36 1 The Nelson-Aalen estimate Ĥ(35) is 1.5641. Calculate the Nelson-Aalen estimate of the variance of Ĥ(35). First of all, note that d 8 because prior to t = 30 only 8 survivals remain in the group. 15 + 3 13 + 10 + d 8 + 8 d = 1.5641 After simplifying, this reduces to: d 8 + 8 d = 1 By multiplying both sides by 8(8 d) gives: d(8 d) + 16 = 8(8 d) d 16d + 48 = 0 d = 4, 1 d 8 d = 4 The estimation for the variance will be: 1

15 + 3 13 + 10 + 4 8 + 4 = 0.341 13

Variance of Kaplan-Meier Estimator (from section 1.) The vairiance of the Kaplan-Meier estimator is: Var[Ŝ(y j)] = Ŝ(y j) j i=1 s i r i (r i s i ) and for t s other than y j s we have: { s Var[Ŝ(t)] = Ŝ(t) i r i (r i s i ) : } y i t We note that the quantity Ŝ(t) estimates the probability P (X > t). Now if one wants to estimate the variance of the estimator of a conditional probability t p s = P (X > t + s X > s), where s is one of the y j values, then the formula would be Var[P (X > t + s X > s)] = P (X > t + s X > s) { s i r i (r i s i ) : } s + 1 y i s + t Note. If the data is a complete one, then this formula reduces to the empirical approximation of the variance: Var[S n (x)] = Sn(x)(1 Sn(x)) n Example. The following payment data is given: 1500, 1700 +, 550, 800, 800, 3750 +, 5600 in here the plus sign means that the policy limit has been reached (note that in absence of deductible, the policy limit is the same as the maximum covered loss). Find the Greenwood s approximation for the variance of the Kaplan-Meier estimate of S(4000). 14

all numbers in the table 1500 1700 550 800 3750 5600 number of x i s 1 1 1 number of u i s 1 1 The unique values of y j s are 1500, 550, 800, 5600 The formula r j = #(x y j ) + #(u y j ) #(d y j ) still works fine, but since there is no d-value, it reduces to the formula r j = #(x y j ) + #(u y j ) j y j s j r j 1 1500 1 5+=7 550 1 4+1=5 3 800 3+1=4 4 5600 1 1 S(4000) = ( 6 7 )(4 5 )( 4 ) = 0.349 ) 3 Var (Ŝ(4000) = S(4000) i=1 [ s i r i (r i s i ) = 0.349 1 (7)(6) + 1 (5)(4) + 1 (4)() ] = 0.0381 Example. In a five year mortality study, you are given: y j s j r j 1 3 15 4 80 3 5 5 4 6 60 5 3 10 15

Calculate Greenwood s approximation of the conditional variance of the product limit estimator of S(4). Ŝ(4) = ( 1 15 ) ( 56 80 ) ( 0 5 ) ( ) 54 = 0.403 60 [ ] Var(Ŝ(4)) = 3 0.403 (15)(1) + 4 (80)(56) + 5 (5)(0) + 6 = 0.0055 (60)(54) Example given:. A mortality study is conducted on 50 lives observed from time zero. You are Time Number of Deaths Number Censored t d t c t (i) 15 0 17 0 3 5 4 0 30 0 c 30 3 8 0 40 0 (ii) Ŝ(35) is the Product-Limit estimate of S(35). (iii) Var(Ŝ(35)) ˆ is the estimate of Ŝ(35) using Greenwood s formula. (iv) Var(Ŝ(35)) ˆ ( Ŝ(35) ) = 0.011467 Determine c 30, the number censored at 30. 16

all numbers in the table 0 15 17 5 30 3 40 number of x i s 4 8 number of u i s 3 c number of d i s 50 Note that the initial time we d 0 = 50. We use the following formula to calculate r j s: r j = #(d < y j ) #(x < y j ) #(u < y j ) t j s j r j 15 50 5 4 45 3 8 41-c 40 33-c 0.011467 = ˆ Var(Ŝ(35)) (Ŝ(35) ) = (50)(48) + 4 (45)(41) + 8 (41 c)(33 c) (41 c)(33 c) = 945 solve quadratic equation c = 6 Note. For mortality data, the provability P (T > s + t T > s) is denoted by t p s and the probability of failing at or before time s + t given survival past s is denoted by t q s. The Kaplan-Meier estimation estimates the probabilities t p 0. If you want a conditional probability tp s, then you must calculate the Kaplan-Meier product starting at time s + 1 since tp s = P (T > s + t) P (T > s) = the product starting at s + 1 Example. For a survival study with censored and truncated data, you are given: 17

Number at Risk Time (t) at Time t Failures at Time t 1 30 5 7 9 3 3 6 4 5 5 5 0 4 The probability of failing at or before Time 4, given survival past Time 1, is 3 q 1. Calculate Greenwood s approximation of the variance of 3 q 1. 3p 1 = ( 18 7 ) ( 6 3 ) ( ) 0 = 0.4333 5 9 (7)(18) + 6 (3)(6) + 5 (5)(5) = 0.03573 Var( 3 q 1 ) = Var(1 3 p 1 ) = Var( 3 p 1 ) = (0.4333) (0.03573) = 0.006709 Definition. If S(t) is the Kaplan-Meier estimator for the survival function (the empirical estimate is a special case of this), then the 100(1 α)% linear confidence interval is defined to be the interval with endpoints: S(t) ± z p+1 Var(S(t)) Similarly, if H(t) is Nelson-Aalen estimate of the cumulative hazard function, we have the 100(1 α)% linear confidence interval H(t) ± z p+1 Var(H(t)) But, this confidence interval may include negative values, and the linear confidence interval for S(t) may include numbers larger than 1. To avoid these problems, the delta method can be 18

used (details omitted) to find the following so-call log-transformed confidence intervals : (S(t) 1 U, S(t) U ) U = exp ( z p+1 Var(S(t)) S(t) ln S(t) ) ( H(t) U, U H(t) ) U = exp ( z p+1 Var(H(t)) H(t) ) Example. In the previous example, calculate the 95% log-transformed confidence interval for H(3), based on the Nelson-Aalen estimate. Ĥ(3) = 5 30 + 9 7 + 6 3 = 0.6875 Var(Ĥ(3)) = 5 30 + 9 7 + 6 3 = 0.038 U = exp ( 1.96 ) 0.038 = exp(0.4395) = 1.5519 0.6875 confidence interval ( ) ( ) H 0.6875 U, HU = 1.5519, (0.6875)(1.5519) = (0.4430, 1.0669) Example. Twelve policyholders were monitored from the starting date of the policy to the time of first claim. The observed data are as follows: Time of First claim 1 3 4 5 6 7 Number of Claims 1 1 Using the Nelson-Aalen estimator, calculate the 95% linear confidence interval for the cumulative hazard rate function H(4.5). 19

y j r i s i 1 1 10 1 3 9 4 7 H(4.5) = s 1 r 1 + s r + s 3 r 3 + s 4 r 4 = 1 + 1 10 + 9 + 7 = 0.7746 Var(H(4.5)) = s 1 r1 + s r + s 3 r3 + s 4 r4 = 1 + 1 10 + 9 + 7 = 0.0894 linear confidence interval : 0.7746 ± 1.96 0.0894 = (0.189, 1.361) Example. For a survival study, you are given: (i) The Product-Limit estimator Ŝ(t 0) is used to construct confidence intervals for S(t 0 ). (ii) The 95% log-transformed confidence interval for S(t 0 ) is (0.695, 0.843). Determine Ŝ(t 0). S 1 U = 0.695 S U = 0.843 1 U ln(s) = ln(0.695) U ln(s) = ln(0.843) Dividing the second row by the first row gives us: U = ln(0.843) ln(0.695) S = U = ( S 1 U ) U = 0.695 0.6851 = 0.7794 ln(0.843) ln(0.695) = 0.6851 0