Credit risk and survival analysis: Estimation of Conditional Cure Rate

Size: px
Start display at page:

Download "Credit risk and survival analysis: Estimation of Conditional Cure Rate"

Transcription

1 Stochastics and Financial Mathematics Master Thesis Credit risk and survival analysis: Estimation of Conditional Cure Rate Author: Just Bajželj Examination date: August 30, 2018 Supervisor: dr. A.J. van Es Daily supervisor: R. Man MSc Korteweg-de Vries Institute for Mathematics Rabobank

2 Abstract Rabobank currently uses a non-parametric estimator for the computation of Conditional Cure Rate (CCR) and this method has several shortcomings. The goal of this thesis is to find a better estimator than the currently used one This master thesis looks into three CCR estimators. The first one is the currently used method. We analyze its performance with the bootstrap and later develop a method, with better performance. Since the newly developed and currently used estimators are not theoretically correct with respect to the data, a third method is introduced. However, according to the bootstrap the latter method exhibits the worst performance. For the modeling and data analysis the programing language Python is used. Title: Credit risk and survival analysis: Estimation of Conditional Cure Rate Author: Just Bajželj, just.bajzelj@gmail.com, Supervisor: dhr. dr. A.J. van Es Second Examiner: dhr. dr. A.V. den Boer Examination date: August 30, 2018 Korteweg-de Vries Institute for Mathematics University of Amsterdam Science Park , 1098 XG Amsterdam Rabobank Croeselaan 18, 3521 CB Utrecht 2

3 Aknowledgments I would like to thank my parents who made possible for me to finish the two years Masters in Stochastics and Financial Mathematics in Amsterdam, that helped me to become the person I am today. I would also like to thank people from Rabobank and the department of Risk Analytics, thanks to whom I have written this thesis and, during my six-month internship, and showed me that work can be more than enjoyable. In particular, I would like to acknowledge all my mentors, Bert van Es who always had time to answer all my questions, Viktor Tchistiakov who gave me challenging questions and ideas, which represent the core of this thesis and, Ramon Man, who always showed me support and cared that this thesis was done on schedule. 3

4 Contents Introduction Background Research objective and approach Survival analysis Censoring Definitions Estimators Competing risk setting Current CCR Model Model implementation Stratification Results Performance of the method The bootstrap Cox proportional hazards model Parameter estimation Estimation of β Baseline hazard estimation Estimators Model implementation Results Performance of the method Discussion Proportional hazards model in an interval censored setting Theoretical background Generalized Linear Models Model implementation Binning Results Performance of the method Discussion Popular summary 60 4

5 Introduction Under Basel II banks are allowed to build their internal models for the estimation of risk parameters. This is known as the Internal Rating Based approach (IRB). Risk parameters are used by banks in order to calculate their own regulatory capital. In Rabobank, the Loss Given Default (LGD), Probability of Default (PD) and Exposure at Default (EAD) are calculated with IRB. Loss given default describes the loss of a bank in case that the client defaults. Default happens when the client is unable to pay monthly payments for the mortgage for some time or one of other default events, usually connected with the client s financial difficulties, happen. Missed payment is also known as arrear. After the client defaults his portfolio is non-performing and two events can happen. The event when the clients portfolio returns to performing one is known as cure. Cure happens if the costumer pays his arrears and has no missed payments during a three-months period or he pays the arrears after the loan restructuring and has no arrears in a twelve-months period. The event in which the bank needs to sell the client s security in order to cover the loss is called liquidation. Two approaches are available for LGD modeling. The non-structural approach consists of estimating the LGD by observing the historical loss and recovery data. Rabobank uses another approach, the so-called structural approach. While the bank deals with LGD in a structural way, different probabilities and outcomes of default are considered. The model is split in several components that are developed separately and later combined in order to produce the final LGD estimation, as can be seen from Figure 0.1. In order to calculate the loss given default we firstly need to calculate the probability of cure, the loss given liquidation, the loss given cure and indirect costs. Rabobank assumes that loss given cure equals zero, since cure is offered to the client only if there is a zero-loss to the bank and any indirect costs that are suffered during the cure process are taken into account in the indirect costs component. Therefore, in this thesis sometimes the term loss is used instead of the term liquidation. Indirect costs are usually costs that are made internally by the departments of the bank that are involved into processing of defaults, e.g., salaries paid to employees and administrative costs. The equation used for LGD calculation, can be seen in Figure 0.1. The parameter probability of cure (P Cure ) is the probability that the client will cure before his security is liquidated. All model components, including P Cure depend on covariates from each client. A big proportion of cases, which are used to estimate P Cure, are unresolved cases. A client that defaulted and was not yet liquidated nor cured is called unresolved. The easiest way to imagine an unresolved state is when every client needs some time after default in order to cure or sell their property. The non-absorbing state before cure or 5

6 Figure 0.1: LGD = P Cure LGC + IC + (1 P Cure )LGL. liquidation is called unresolved state. Unresolved cases can be treated in two ways: Exclusion: the cases can be excluded from the P Cure estimation Inclusion: with expected outcome: we can include unresolved cases into the P Cure estimation by assigning an expected outcome or Conditional Cure Rate to them. If unresolved cases are excluded from the P Cure estimation, it can happen that the parameter estimator will be biased. In other words P Cure will be estimated on a sample where clients are cured after a short time. Consequently, clients, who would need more time to be cured would get a smaller value for P Cure than they would deserve. Since P Cure tells us the probability that a client will be cured after default and such a probability is not time dependent, treatment of unresolved cases with exclusion would be wrong. One approach within Rabobank is to treat unresolved cases by assigning them a value called Conditional Cure Rate. Conditional Cure rate (CCR) can be estimated with a non-parametric technique. This thesis will be about developing a new model able to eliminate the existing shortcomings of the currently employed model. 0.1 Background CCR tells us the probability that a client will cure after a certain time point conditioned on the event that client is still unresolved at that point. Rabobank s current model estimates CCR with survival analysis, which is a branch of statistics that is specialized in the distribution of lifetimes. The lifetime is the time to the occurrence of an event. In our case the lifetime is the time between the default of the client and cure or liquidation. Current CCR is a combination of Kaplan Meier and Neslon Aalen estimators, which are two of the most recognized and simple survival distribution estimators. The current CCR model has some shortcomings. The goal of the present research is to develop a new CCR model which will be able to outperform current model. 6

7 0.2 Research objective and approach The objective of this thesis is to find new ways to estimate CCR that would be able to improve CCR estimation. New techniques will be compared with the currently used CCR estimator. The research goal of this thesis is: Develop an alternative CCR model which is better than the current one. In order to reach this research goal, we will need to answer the first research question: What type of model is natural for the problem? To answer this question first, we study the currently used model. Second, we will get familiar with basic and more advanced concepts of survival analysis. Once this is done, we will be able to derive more advanced and probably better estimators. In particular we will get some basic ideas about the techniques that can be used for the CCR modeling. In order to find a better model than the one which is currently used we will have to be able to know what better means. In other words, we will need to be able to answer the second research question: What are criteria for comparing the models? 7

8 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of lifetimes. The random variable which denotes a lifetime will be denoted by T. Survival analysis has its roots in medical studies as they look into the lifetime of patients or the time to cure of a disease. In our case the lifetime of interest is the time which a client spends in the unresolved state and the events of interest are cure and liquidation. One of the most attractive features of survival analysis is that it can cope with censored observations. The estimator which is currently used for CCR estimation is a combination of two estimators of two different quantities which are specific for survival analysis. In order to be able to understand and to model CCR as it is currently model by the Rabobank, we first have to know what the quantities are that are modeled by the Rabobank, how these quantities are estimated and how to model these quantities as a client can become resolved due to two reasons. Censoring is represented in Chapter 1.1. The most basic concepts of survival analysis and its specific quantities are introduced in Chapter 1.2. In Chapter 1.3 we will look into basic estimators of survival analysis. In Chapter 1.4 the theory and estimators when two outcomes are possible are presented, since our model assumes that client can become resolved due to two reasons, cure and liquidation. 1.1 Censoring When lifetime distributions are modeled with a non-survival analysis approach only observations where the event of interest took place are used. For simplifying reasons we will look into an example from medical studies. For instance if a study on 10 patients has been made and only 5 of them died, and the other 5 are still alive, in order to model the distribution of lifetimes only 5 observations can be used. Survival analysis is able to get some information about distribution also from the other patients. Patients which are still alive at the end of the study are called censored observations and survival analysis is able to cope with censored data. A visualization of censored observations can be seen in Figure 1.1. The time variable on the x axis represents the month of study, meanwhile on the y axis the number of each patient can be found. The line on the right side represents the end of the study. The circle represents in which month of study the patient got sick, while the arrow represents the death of a patient. It can be seen that deaths of patients 6 and 4 are not observed, because the study ended before the event of interest happened. The event of interest cannot be seen at patient 2 as well, because the patient left the study before the end of the study. Such phenomenons are called censored observations. Despite the fact that the event of interest did not happen, 8

9 Figure 1.1: Censored observations and uncensored observation such observations tell us that the event of interest happened after the last time we saw the patient alive and consequently has an impact on the output distribution. Types of censoring when we know the time when the patient got sick, but we do not know the time of the event of interest are called right censoring (Lawless 2002). For now we will assume that we have just one type of censoring, right censoring. In our case this means that a client which has defaulted is observed, but the observation period has finished before client was cured or liquidated. To understand why such a phenomenon happens it has to be taken into consideration that in the Rabobank s data there are observations where a client is cured after 48 months (4 years), consequently it can happen that if clients which have defaulted in year 2016 are observed, some of them are still unresolved, despite the fact that they will be cured or liquidated sometime in the future. 9

10 1.2 Definitions In order to use techniques from survival analysis some quantities need to be defined and are going to be directly or indirectly modeled. For simplicity it will be assumed that a client can become resolved due to one reason only, cure. Later this assumption will be relaxed. All definitions and theory from this chapter can be found in any book about survival analysis. For a review of survival analysis see Lawless (2002), Kalbfleisch and Prentice (2002) and Klein and Moeschberger (2003). Definition Distribution function: Since T is a random variable it has its own distribution and a density function, which are denoted by F and f or F (t) = P (T t) = t 0 f(t)dt. The distribution function tells us the probability that the event of interest, cure, will happen before time t. In survival analysis we can also be interested in the probability that the client is still unresolved at time t. Definition The survival function represents the probability that the event of interest did not happen up to time t and it is denoted by S or S(t) = P (T > t) = 1 F (t). (1.1) The survival function is a non-increasing right continuous function of t with S(0) = 1 and lim t S(t) = 0. Since we can also be interested in the rate of events in a small time step after time t, we define the following quantity. Definition The hazard rate function is a function that tells us the rate of cures in an infinitely small step after t conditioned on the event that the client is still unresolved at time t. The hazard rate function is denoted by λ or λ(t) = lim t 0 P (t T < t + t T t) =. t It is important to note that the hazard rate function can take a value bigger than 1. If λ(t) 1, it means that the event will probably happen at time t. In order to interconnect all the defined quantities we need to represent the cumulative hazard function which is defined as Λ(t) = t 0 λ(s)ds. Now we can look at how the hazard rate function and the survival function are con- 10

11 nected. We know that {t T < t + t} {t T }. Consequently we get P (t T < t + t T t) λ(t) = lim t 0 t = lim t 0 P (t T < t + t) t P (t T ) = f(t) d S(t) = dt (1 S(t)) = d dt S(t) S(t) S(t) (1.2) From the last equality it follows that = d dt log(s(t)). S(t) = exp( Λ(t)). (1.3) We can see that the hazard rate function, the survival function, the density function and the distribution function uniquely define the distribution of T. 1.3 Estimators Until now we looked into quantities that define the distribution of T. In this chapter simple estimators of these quantities will be presented. Later those estimators will be used for the modeling of CCR. The assumption that a client can become resolved due to one reason only is still valid. In the data there are n defaults. The observed times will be denoted by t i, i {1, 2,..., n}. Each of those times can represent the time between default and cure or liquidation and censoring time. The variable δ i is named the censoring indicator and it takes value 1 if client i was not censored. Consequently, in cases where there is only one event of interest the data is given as {(t i, δ i ) i {1, 2,..., n}}. Once we observed times are obtained we have to order them in an ascending order t (1) < t (2), < t (k), k n. With D(t) we denote the set of individuals that experienced the event of interest at time t or D(t) = {j T j = t, j {1, 2,..., n}}. With d i the size of D(t (i) ) will be denoted. The set of individuals that are at risk at time t is denoted by N(t). The individuals which are in N(t) are the individuals that experienced the event of interest at time t or are still in our study at time t or N(t) = {j T j t, j {1, 2,..., n}}. With n i we will denote the size of N(t (i) ). The most basic estimator of a survival function is known as the Kaplan-Meier estimator and it is also used in the current Rabobank model. The Kaplan-Meier estimator is given 11

12 by the following formula, Ŝ(t) = j:t (j) t (1 d j ). (1.4) n j The estimator which is used for the estimation of the cumulative hazard rate in the current CCR model is called the Nelson-Aalen estimator and it is given by ˆΛ(t) = j:t (j) t d i n i. (1.5) Using the Nelson-Aalen estimator we can also model the hazard rate as a point process which takes value d i n i at time t (i) Let us look into the following example. We have an observation which consists of 10 clients. The censoring indicator and observed time can be seen in Figure 1.2. The times which each client needs in order to be cured after default are observed. In order to compare traditional statistical techniques and survival techniques, survival curve will be estimated with the empirical distribution function and the Kaplan-Meier estimator. Figure 1.2: Example data The survival curve estimated with the Kaplan-Meier estimator is shown in Figure 1.3. In the figure it can be seen that jumps happen only at the times when an event of interest happens. Observed censored times are denoted with small vertical lines and they have no influence on the jumps, but rather on the size of the jumps, since censored observations only have influence on the risk set which is in the nominator of equation (1.4). The more censored observations occur before the observed time the smaller the risk set will be and the bigger the jump on the survival curve will be. We will now calculate the survival curve with empirical distribution which is formu- 12

13 Figure 1.3: Survival curve estimated with Kaplan-Meier estimator lated as ˆF n (t) = 1 n n i=1 1 Ti t. and obtain the survival curve from the identity (1.1). Doing so we obtain an estimator for the survival function which is equal to Ŝ n (t) = 1 n n i=1 1 Ti >t. It is important to add that if we use the empirical distribution function we can only use observations where the event of interest interest did happen. In our case we can just take the observations where cure happened, i = 4, 6, 7. Results from the estimator of the survival curve with empirical distribution is shown in the Figure 1.4. From Figure 1.4 it can be observed tha bigger jumps occur in comparison with the Kaplan-Meier estimator. An intuitive explanation for this phenomenon would be that censored observations also bring us an important information for the estimator. For instance, if we want to estimate the survival curve at time t and we have only censored observations which happened after time t, then the probability of an event happening before time t is probably low. At this point note that if we use the Kaplan-Meier estimator with non-censored observations only and the empirical distribution for the estimation of the survival curve we will obtain the same results. If we have times t 1 < t 2 < t k and at the time t i d i events of interest happened then the size of the risk set at time t i will be n d 1 d i. for 13

14 Figure 1.4: Survival curve estimated with the empirical distribution t [t i, t i+1 ) it follows Ŝ(t) = (1 d 1 n )(1 d 2 d i )... (1 ) n d 1 n d 1 d i 1 = (n d 1)(n d 1 d 2 )... (n d 1 d i ) n(n d 1 )(n d 1 d 2 )... (n d 1 d i 1 ) = n d 1 d i 1 = 1 n 1 ti >t. n n i=1 (1.6) 1.4 Competing risk setting Up to this point we assumed that clients can become resolved only by being cured, but in reality that is not the case. It is known that a client can become resolved due to two reasons, cure and liquidation. The modeling approach in the CCR model is that the client becomes resolved due to the reason which happens first. With other words there are two types of events which have the influence on the survival function. If all liquidated events would be modeled as censored, then the survival function would overestimate. Since one of the building blocks of CCR is the survival function, CCR estimates where liquidated events are considered as censored would give us wrong results. Modeling of CCR in the setting where two outcomes are possible is in survival analysis known as the competing risk setting. In this chapter we will look into quantities in the competing risk setting. For a review of the competing risk model see M.-J. Zhang, X. Zhang, and Scheike (2008). 14

15 In order to model CCR(t) we have to introduce more notation. Variable e i represents the reason due to which subject i failed. In our case e i = 1 if the subject i was cured. The variable e i will take value 2 if subject i was liquidated. Consequently, the data is given as (T i, δ i, e i ), i {1, 2,..., n}. Since a subject can fail due to two reasons, a cause specific hazard rate needs to be used. By D i (t) we denote a set of individuals that failed due to reason i at time t or D i (t) = {k, t = T j e k = i, j {1, 2,..., n}}. With d i j we denote the size of the set Di (t (j) ). Definition The cause specific hazard rate is a function that tells us the rate of event e = i on an infinitely small step after t conditioned on the event that a subject did not fail up to time t. The cause specific hazard rate function is denoted by λ i (t) or λ i (t) = lim t 0 P (t T < t + t, e = i T t) =. t In order to calculate the survival function in a competing risk setting we need to define the cause specific cumulative hazard function, which is given as Λ i (t) = t 0 λ i (s)ds. Then the analog of Formula 1.3 in the competing risk setting is S(t) = exp( Λ 1 (t) Λ 2 (t)). (1.7) It is important to add that when a subject can fail due to 2 reasons the Kaplan Meier estimator of the survival function takes the following form Ŝ(t) = j:t (j) t ( d1 j + d2 j ). n j The Nelson-Aalen estimator is the estimator of the cause specific cumulative hazard rate, and it is given as ˆΛ i (t) = d i j. (1.8) n j j:t(j) t The value dˆλ i (t i ) = di j n j tells us the probability that a subject will experience the event at time t j conditioned on the event that he is still alive at time t j. Once we obtain all the cause specific hazard rates we can calculate the probability of failing due to a specific reason. Definition The cumulative incidence function (CIF) tells us the probability of 15

16 failing due to cause i before time t and it is denoted by F i or F i (t) = P (T t, e = i). The cumulative incidence function is expressed mathematically as F i (t) = t λ i (s)s(s)ds 0 t = 0 λ i (s) exp( 2 Λ i (s))ds. i=1 (1.9) Using the Nelson-Aalen and Kaplan Meier estimators we can obtain a non-parametric estimator for CIF which is given as ˆF i (t) = i t Ŝ(t j 1 )dˆλ i (t j ). (1.10) An intuitive explanation of Formula (1.10) is that if we want that subject fails due to reason i it has to be unresolved up to time t j 1 and then at the next time instance fail due to reason i. We will continue with the example from Section 1.3, but this time we will assume that some observations which were censored were actually liquidated. We will take all the steps we need in order to calculate ˆF 1 (t) from equation (1.10), which will be later needed in order to estimate CCR. Figure 1.5: Survival curve estimated with empirical distribution In order to estimate the survival curve, the following steps need to be taken. It is seen that no event of interest happens on interval [0, 1), consequently for t [0, 1) it holds 16

17 that Ŝ(t) = 1. At time 1 one event of interest happens and the size of the risk set is 10. Since no events of interest happen until time 2 for t [1, 2) it holds that Ŝ(t) = 1(1 10) = 9 10 At time 2 two events of interest happen and the size of the risk set is 9 for t [2, 3) it holds Ŝ(t) = 1( )(1 2 9 ) = In a similar fashion we obtain the following values for Ŝ(t) 7 10 (1 2 7 ) = 1 2 t [3, 4) 1 Ŝ(t) = 2 (1 2 5 ) = 3 10 t [4, 5) 3 10 (1 0 3 ) = 3 10 t [5, 6) 3 10 (1 0 1 ) = 3 10 t [6, ). Figure 1.6: Survival curve in competing risk setting The Kaplan-Meier curve for the competing risk setting is shown in Figure 1.6. Compared with Figure 1.3 it can be seen that it has more jumps, which makes sense, since we have two events of interest. At the same time we can see that if we consider liquidated observations as censored, the survival curve will overestimate survival probability. 17

18 Consequently, the CCR will be modeled in a competing risk setting. In order to calculate an estimator of F 1 (t) we need to calculate dˆλ 1 (t j ). The value of dˆλ 1 (1) is simply the number of individuals that were cured at time 1 divided by the number of all individual that are at risk at time 1 or dˆλ 1 (1) = In a similar fashion we obtain the other values of dˆλ 1 (t i ) 1 9 t = t = 3 0 dˆλ 1 (t) = 5 t = t = t = 6 0 else Figure 1.7: cause specific hazard rate for cure estimated with dˆλ 1 Now we can finally calculate ˆF 1 (t). Since we do not have any cures in the interval [0, 1) it holds that ˆF 1 (t) = 0 for t [0, 1). For t [1, 2) it holds ˆF 1 (t) = S(0)dˆΛ 1 (1) = = 0. 18

19 For t [2, 3) it holds that ˆF 1 (t) = Ŝ(0)dˆΛ 1 (1) + S(1)dˆΛ 1 (2) = ˆF 1 (1) + Ŝ(1)dˆΛ 1 (2) = = (1.11) Figure 1.8: Estimated cumulative incidence function in a similar fashion we obtain values ˆF 1 (t) for other t, ˆF 1 (2) = t [3, 4) ˆF 1 (3) + ˆF (t) = 7 5 = t [4, 5) ˆF 1 (4) = t [5, 6) ˆF 1 (5) = t [6, ) 19

20 2 Current CCR Model The CCR for time t tells us the probability that a client will be cured after time t conditioned on the event that he is still unresolved at time t. This probability can be expressed as F 1 ( ) F 1 (t). It is important to understand that some cases will never be cured. Consequently probability of cure is not necessary equal to 1 as t or F 1 ( ) 1. Since CCR is conditioned on the event of being unresolved up to time t it follows that CCR(t) = F 1( ) F 1 (t). (2.1) S(t) Using the estimator (1.10) we can estimate a probability of curing after time t [t i, t i+1 ) as ˆF 1 ( ) ˆF 1 (t i ). (2.2) We assume that the probability of cure after the 58th month is equal to zero. Consequently it holds ˆF 1 ( ) = ˆF 1 (58) and ĈCR(58) = 0. Another assumption is that every case will be resolved as t which is equal to the property of requiring survival function that S( ) = 0. Consequently, we will consider that every case which has observed time bigger than 58 as liquidated. Since CCR is conditioned on the event of being unresolved up to time t [t i, t i+1 ), the estimator of CCR takes the following form ĈCR(t) = ˆF (58) ˆF (t i ) Ŝ(t i ) = d1 i+1 n i+1 + (1 d1 i+1 + d2 i+1 n i+1 ĈCR(t i+1 )). (2.3) If we define ˆF 1 (0) = 0, since the probability of being cured before time zero is equal to zero, ĈCR(0) is equal to ˆF (58). This gives us the probability of being cured, i.e the value of P Cure in Figure Model implementation In order to estimate CCR for each time point the Rabobank uses data which consists of mortgage defaults from the bank Lokaal Bank Bedrijf (LBB). Each default observation consists of the time the client spent in default and the status after the last month in default, which can be equal to cure, liquidated or unresolved. In the data we can also find the following variables: High LTV indicator, 20

21 NGH indicator, Bridge loan indicator. Variable LTV represents Loan To Value ratio and it is calculated with the following formulation, Mortgage amount LT V = Appraised value of the property. If LTV is higher than 100% it means that the value of mortgage which was not payed back by client is larger than the value of the security. Indicator High LTV takes value 1 if LTV is high. NGH is an abbreviation for the National Mortgage Guarantee or in Dutch Nationale Hypotheek Garantie. If a client which has an NGH-backed mortgage cannot pay the mortgage due to specific circumstances, NGH will cover support to the bank and the client. If the client sells the house under the price of the mortgage, NGH will cover the difference and consequently neither the client nor the bank will suffer the loss. Since both sides get support in case of liquidation, we can expect that the cause specific hazard rate for liquidation will be higher and cause specific hazard rate for cure lower and consequently CCR lower. Bridge loans are short-term loans that last between 2 weeks and 3 years. A client usually uses them until he finds a longer and larger term financing. This kind of loan provides an immediate cash flow to a client with relatively high interest. Such loans are usually riskier for a bank and have consequntly higher interest rates Stratification It can be seen that Rabobank has data from different clients with different variables, but it uses an estimator for CCR which is unable to incorporate those variables. Rabobank solves this problem with the method called stratification or segmentation. This method separates the original data frame into smaller data frames based on variables and then calculates CCR on each one of them. For instance, if segmentation is based on the variable called Bridge loan indicator, the original data frame will be separated into two data frames. In the first data frame only clients with bridge loans can be found and in the other clients with non-bridge loans. Once this step is made, CCR is calculated for each segment and two CCR estimates are obtained, one for clients with and the other for clients without a bridge loan. Rabobank separates the original data frame into four buckets, as can be seen on Figure 2.1. The segment with the most observations is Non-Bridge-Low LTV which has almost all observations from the original data. It is followed by the segment Non-Bridge-High LTV-non-NHG which has about ten times less observations than segment with Low LTV. The smallest segments are segments with Bridge loans and Non-Bridge-High LTV-NHG, which have about 600 observation. 21

22 Figure 2.1: Segmentation of original data frame 2.2 Results In this chapter the estimation of survival quantities, which are needed in order to model CCR, will be looked into. In order to estimate the nonparametric cumulative incidence function we need the cause specific hazard rate for cure, which follows from equation (1.10). Hazard rates estimated with dˆλ(t) estimator are point processes as can be seen in Figure 2.2. The two biggest segments behave regularly, while segments with less observations have irregular behavior with jumps at the end of the period. Big jumps from value zero to above 0.02 are occurring because a small number of individuals is at risk. For instance, at time 53 in the segment high LTV NHG cause specific hazard rate takes value At that time only one cure happens, but there are 34 individuals at risk. The next step is the estimation of the survival function with the Kaplan-Meier estimator. Results can be found on Figure 2.3. From the figure it is visible that the segment which becomes resolved at the smallest rate is the segment which represents the clients with high LTV and without NHG, while clients with bridge loans become resolved at the highest rate. Once we obtain the survival functions and cause specific hazard rates for cure we can model the cause specific incidence functions for cure, which can be found in Figure 2.4. Results of the nonparametric estimator can be found in Figure 2.5. Curves are the most irregular, have the most jumps, for the segments where we have the least observations. The explanation for this phenomenon can be found in the recursive part of equation (2.3). It is seen that the big jumps happen because the cause specific hazard rates for cure are irregular. From the Figure 2.5 it can be seen that the clients with low LTV have the highest CCR 22

23 Figure 2.2: Cause specific hazard rate for cure estimated with dˆλ(t) 1 Figure 2.3: Survival function estimated with Kaplan-Meier estimator estimates. Clients without NHG have larger CCR estimates than clients with, since bank and clients are more motivated into curing their defaults. From the figure is it not completely clear if the clients with bridge loans or clients with high LTV and without NHG have higher CCR estimates. 2.3 Performance of the method In this chapter we will look into the variance, bias and confidence intervals of the nonparametric CCR estimator. Since the derivation of the asymptotic variance of the CCR estimator is out of scope of this thesis, these quantities will be estimated with the method known as bootstrap. 23

24 Figure 2.4: Survival functions estimated with Kaplan-Meier estimator Figure 2.5: CCR estimated with non parametric estimators The bootstrap The bootstrap was introduced by Efron in The method is used to estimate the bias, variance and confidence intervals of estimators by resampling. In this chapter we will look at how this method is used for the estimation of bias, variance and confidence intervals. Later the method will be applied to mortgage data in order to estimate the previously mentioned quantities for each segment. For a review of bootstrap estimators and techniques see Fox (2016). In order to bootstrap, data frames from the original data frame need to be selected. These data frames are created by a selection of random default observations from the original data frame. Let us assume that n random data frames will be simulated. Simulated data frames need to be of the same size as the original data frame and it is allowed that simulated data frames have the same observations more than once. Once the simulated data frames are obtained, segmentation is done as in Figure 2.1. Finally CCR i (t), i {1, 2,..., n} estimates are calculated for each segment 24

25 and each time as can be seen in Figure 2.6. Figure 2.6: Estimation of quantities with bootstrap. Once estimates CCR i (t) for each data frame are obtained, CCR(t), t {0, 1,... 58} for each segment can be calculated as CCR(t) = n i=1 CCR i n. The bootstrap estimate for variance of ĈCR(t) is equal to 1 n 1 n (CCR i (t) CCR(t)) 2. i=1 An estimator ˆθ is called unbiased if we have E(ˆθ) = θ and if it is biased the bias of the estimator is defined as B θ = E(ˆθ) θ. Bias of estimators is undesired. With the bootstrap the bias can be estimated by CCR(t) CCR(t) where ĈCR(t) is the estimate of CCR from the original data frame. For the estimation of confidence intervals a method called bootstrap percentile confidence interval will be used. In order to obtain 100(1 α) interval for fixed time t we have to take ĈCR(t) α 2,L, which denotes a value where α 2 percent of the CCRi (t) is below that value. In a similar fashion ĈCR(t) α 2,R denotes a value where α 2 percent of the CCRi (t) 25

26 is above that value. For an α confidence interval the following values are taken [ĈCR(t) α 2,L, ĈCR(t) α 2,R]. The Rabobank decides whether it makes sense to make a segmentation or not based on the size of the confidence intervals. If two curves lie in each other s 95%-confidence intervals, then the CCR curves are pragmatically treated as the same. If we look into CCR curves for each segment, as can be seen in Figure 2.7, we can see that segments LBB-LOW LTV and LBB-High LTV-non-NGH are treated as significantly different, while LBB-Low-LTV and Bridge are not. Furthermore segments with a bigger size of population have narrow confidence intervals, while segments with a small population size have wide confidence intervals. It follows that nonparametric CCR estimators are not a good choice when CCR on a population of small size has to be estimated. Figure 2.7: Estimation of confidence intervals with the bootstrap. From Figure 2.8 it can be seen that the biggest variance is obtained by segments with the smallest population and that variance grows with time. This happens because of the variability of the term d1 i+1 n i+1 from the recursive part of the equation (2.3), as it is seen from Figure 2.2. Since cures at the end of observation period in the Bridge and LBB-High LTV-NHG do not exist, CCR i (t) always takes the value zero. The same holds for ĈCR(t). Consequently, the variance estimated with the bootstrap is equal to zero for the CCR estimates at the end of the observation period. In the Figure 2.9 the bias of the nonparametric CCR estimation can be found. If we 26

27 Figure 2.8: Estimation of variance with bootstrap. compare bias with the size of CCR estimates from the Figure 2.5 it can be concluded that it is more than 100 times smaller than the size of the ĈCR and that estimator is not problematically biased. Figure 2.9: Estimation of bias with the bootstrap. 27

28 3 Cox proportional hazards model Until now we have been looking only into nonparametric estimators of Survival and Hazard rate functions, which are unable to incorporate additional variables. Rabobank uses a method called stratification in order to estimate the CCR curve of individuals with different covariates, that has certain shortcomings. Since a big amount of the data consists out of unresolved (censored) cases, we will look for a suitable regression method from survival analysis. At the beginning of this chapter we review the theory behind the so called Cox model. In Section 3.1 we will see how to estimate coefficients of explanatory variables and the baseline function. In section 3.2 the formula for the cumulative incidence function will be derived. In Section 3.3 it will be explained how to estimate the coefficients in order to get comparable results with segmentation. In Section 3.4 we will look into quantities estimated with the Cox model that are needed in order to get CCR estimates. In Section 3.5 the performance of the method will be analyzed with the bootstrap. In the last section of this chapter the method will be compared with the non-parametric estimator. In order to model CCR(t) we need a model, which is able to model the cause-specific hazard rate. Since we expect different behavior from clients with different covariates, we will look into one of the most popular regression models in survival analysis. The Cox proportional hazards model was presented in 1972 by Sir David Cox. For a review of the Cox model see Kalbfleisch and Prentice (2002), Lawless (2002) and Weng (2007). A hazard rate modeled with the Cox model takes the following form λ(t X) = λ 0 (t) exp(β T X), (3.1) where β = (β 1, β 2,..., β p ) is a vector of coefficients which represents the influence of covariates, X = (X 1,..., X p ) on the hazard rate function λ(t X), which depends on X. We denoted covariates of the individual i by X i. The baseline hazard function is denoted by λ 0 (t) and it can take any form Weng (2007). The baseline hazard function can be interpreted as a hazard rate from an individual whose values of covariates are equal to zero. In a similar way the baseline survival function can be defined S 0 (t) = exp( t 0 λ 0 (t)dt), (3.2) according to equation (1.3). The survival function of individual j than takes the following form S(t X j ) = [S 0 (t)] exp(βt X j ) (3.3) 28

29 Corrente, Chalita, and Moreira (2003). Since the baseline function can take on any form, the Cox model is a semi-parametric estimator of the hazard rate. In a similar way the cause specific hazard rate for cause i λ i (t X) = λ 0 i (t) exp(β T i X), i = 1, 2 can be modeled. Here β i represents the effect of covariate vector X on the cause specific hazard rate. The function λ 0 i (t) represents a cause specific baseline function. In the following subsections we will look at how to estimate parameters in the Cox model. 3.1 Parameter estimation Since the proportional hazards model is semi-parametric model the β coefficients and the baseline hazard function need to be estimated. In Section we will look into partial likelihood and how to estimate the coefficients. In Section we will introduce the Breslow estimator of the baseline function. For a review of baseline and β estimation see Weng (2007) Estimation of β Firstly, it will be assumed that the data consist of n individuals and that each individual has a different observation time t i, no ties in data. These observation times are ordered in an ascending order, so t 1 < t 2 < < t n. In 1972 Cox proposed to estimate β using partial likelihood, Weng (2007). The partial likelihood of individual i, P L i, is simply the hazard rate of individual i divided by the sum of the hazard rates of all individuals that are at risk at time t i or for i N(t i ): λ(t i X i ) L i = j N(t i ) λ(t i X j ) (3.4) exp(βx i ) = j N(t i ) exp(βx j). (3.5) (3.6) The partial likelihood of individuals that were censored is equal to 1. It follows that the partial likelihood function of the data we have is equal to P L(β) = n i=1 Instead of maximizing P L we maximize log(p L) = pl, pl(β) = i:δ i =1 exp(β T X i ) j N(t i ) exp(βt X i ). (3.7) β T X i ln( j N(t i ) exp(β T X j )). 29

30 3.1.2 Baseline hazard estimation Breslow has proposed the following estimator for the baseline hazard function. We still assume that we do not have ties in our data, or t 1 < t 2 <... t n. Breslow proposed an estimator, which is constant on subintervals where no event happened, Weng (2007). Rabobank s data shows if an event of interest happened or not at the end of each month. That means that if the observation has observed time t i, that customer was cured, liquidated or censored in the interval (t i 1, t i ]. In the case of event of interest, the Breslow baseline function will be constant on the interval (t i 1, t i ], where it will take the value λ 0 i As in equation (1.2) and by using the equality f(t) = λ(t)s(t), we get that n L(β, λ 0 (t)) λ(t i X i ) δ i S(t i X i ). Using the equation (1.3) we get that L(β, λ 0 (t)) = i=1 k (λ 0 (t) exp(βx i )) δ i exp( i=1 If we take t i + 1 = t i+1, so equidistant t i, we get that Taking the logarithm of L gives us ti 0 λ 0 (s)ds = i λ 0 j. j=1 ti 0 λ 0 (s) exp(βx i )ds. l(β, λ 0 (t)) = k δ i (ln(λ 0 i ) + βx i ) k i=1 i=1 λ 0 i exp(βx j ). (3.8) j d i Once we obtain ˆβ from partial likelihood, we insert it into l. From the second term in (3.8) we see that only the λ i with δ i = 1 give a positive value to l. Consequently, we take λ 0 i = 0 for i / {t 1 δ i, t 2 δ 2,..., t k δ k }. Going through the steps above we get l(λ 0 (t)) = δ i (ln(λ 0 i ) + ˆβX i ) exp( ˆβX j ). δ i =1 δ i =1 j n i Differentiation with respect to λ 0 i gives us that l(λ0 (t)) for t (t i 1, t i ] is maximized by λ 0 (t i ) = λ 0 t i = 1 j N(t i ) exp( ˆβX j ), Weng (2007). It is known that in continuous time it is impossible to have two individuals with the same observed time, but in reality it will most likely happen that some individuals have the same observed time, since we usually check the state of individuals on 30

31 monthly intervals. Consequently, many individuals will have the same observed times. It follows that a different estimator for a baseline function and a different partial likelihood function has to be used. In 1974 Breslow proposed the following partial likelihood or L(β) = k i=1 exp(βx + i ) ( j N(t i ) exp(βx j)) d i, where X i + = j N(t (i) ) X j. Using the same methodology as when there are no ties in the data we get that the Breslow baseline for ties in the data is equal to λ 0 (t i ) = d i j n i exp( ˆβX j ). (3.9) 3.2 Estimators From Chapter 2 we know that we have to be able to model cause specific hazard rates, survival function and cumulative incidence function in order to estimate CCR. For estimation of the survival function the identity (1.7) will be used. In order to model the cumulative incidence function we have to integrate equation (1.9), which is possible by using the fact that the baseline hazard rate is a step function. For t (t i 1, t i ] we get F 1 (t) = = t 0 t 0 λ 1 (s)s(s)ds λ 1 (s) exp( = F 1 (t i 1 ) + t s 0 (λ 1 (u) + λ 2 (u))du)ds t i 1 λ 1 (s) exp( s 0 (λ 1 (u) + λ 2 (u))du)ds (3.10) Firstly let us look into the integral s 0 (λ 1(u) + λ 2 (u))du. We know that λ i (u) is a step function, which takes value λ i t i on the interval (t i 1, t i ]. For s (t i 1, t i ] it follows s 0 i 1 (λ 1 (u) + λ 2 (u))du = (λ 1 t j + λ 2 t j ) + (s t i 1 )(λ 1 t i + λ 2 t i ) j=1 =Λ 1 (t j 1 ) + Λ 2 (t j 1 ) + (s t i 1 )(λ 1 t i + λ 2 t i ). (3.11) 31

32 From equations (3.10) and (3.11) it follows that F 1 (t) = F 1 (t i 1 ) + λ 1 t i t = F 1 (t i 1 ) + = F 1 (t i 1 ) + t i 1 exp( (Λ 1 (t j 1 ) + Λ 2 (t j 1 ) + (s t i 1 )(λ 1 t i + λ 2 t i ))) λ 1 t i exp(λ 1 (t j 1 ) + Λ 2 (t j 1 )) t t i 1 exp((t i 1 s)(λ 1 t i + λ 2 t i )))ds λ 1 t i exp(λ 1 (t j 1 ) + Λ 2 (t j 1 ))(λ 1 t i + λ 2 t i ) ( exp((t i 1 s)(λ 1 t i + λ 2 t i )) t t i 1 ) = F 1 (t i 1 ) + λ1 t i (1 exp((t i 1 t)(λ 1 t i + λ 2 t i )) exp(λ 1 (t j 1 ) + Λ 2 (t j 1 ))(λ 1 t i + λ 2 t i ). (3.12) Let us continue the example from Figure 1.5 and assume that every individual also has a variable LTV. We will define a new variable HighLTV, which takes value 1, if LTV is high as it is shown in the Figure 3.1. In this example the variable HighLTV will be included in the regression. Figure 3.1: Example data In order to estimate the coefficient β 1, which explains the influence of the variable HighLTV on the cause specific hazard rate for cure, we model cured individuals as individuals who experienced the event of interest. Other individuals are considered as censored. In the same way the coefficient for the cause specific hazard rate for liquidation, β 2, is estimated. For the estimation of the parameters β 1 and β 2 we used the Python Statsmodels package. This gave the following result β 1 =

33 and for liquidation, β 2 = Since no cures occurred in the interval [0, 1) the Breslow estimator for the cause specific baseline function for cure gives us the following value for t (0, 1] λ 0 1(t) = 0 j {1,2,3,4,5,6,7,8,9,10} exp(β 1X i ) = 0. Since one event happened on the interval (1, 2], the cause specific baseline function for t (1, 2] is equal to λ 0 1(t) = 1 j {1,2,4,5,6,7,8,9,10} exp(β 1X i ) = In a similar fashion we get λ 0 1(t) = { = t (2, 3] 0 t (3, ]. For the baseline function for cure we get the following values t (0, 1] t (1, 2] λ 0 2(t) = 0 t (2, 3] t (3, 4] 0 t (4, ). Once the cause specific baseline hazard rates for cure are calculated, we can multiply them by exp(β 1 HighLTV), HighLTV {0, 1} in order to get cause specific hazard rates for cure of individuals with LTV higher and lower than 1.2. For the baseline function for cure we get the following values 0 t (0, 1] t (1, 2] λ 1 (t 1) = t (2, 3] Since exp(β 1 0) = 1 the following identity holds 0 t (3, 4] 0 t (4, ) λ 1 (t 0) = λ 0 1(t). By taking the same steps as above the cause specific hazard rate for loss can be calculated and the following results are obtained: 33

34 c Figure 3.2: Cause specific hazard rate for cure and t (0, 1] t (1, 2] λ 0 2(t 1) = 0 t (2, 3] t (3, 4] 0 t (4, ) λ 1 (t 0) = λ 0 1(t). The hazard rate for liquidation for HighLT V = 1 is almost equal to zero, because of the size of β 2. In order to calculate the survival function equation (1.7) will be used. Since the time steps are of length 1 and the cause specific hazard rate is a step function, the following identity holds for t (t k 1, t k ] Λ i (t X) = Λ i (t k 1 X) + (t t k )λ i (t k X). The identity above gives us the following functions of t for cause specific cumulative hazard functions for cure 0 t (0, 1] (t 1) t (1, 2] Λ 1 (t 0) = (t 2) t (2, 3] t (3, ) 34

35 Figure 3.3: Cause specific cumulative function for cure Figure 3.4: Cause specific cumulative hazard rate function for liquidation and for HighLTV=1, 0 t (0, 1] (t 1) t (1, 2] Λ 1 (t 1) = (t 2) t (2, 3] t (3, ). 35

36 The cause specific cumulative hazard functions for loss are t t (0, 1] and for HighLTV=1, (t 1) t (1, 2] Λ 2 (t 0) = t (2, 3] (t 3) t (3, 4] t (4, ) t t (0, 1] (t 1) t (1, 2] Λ 0 2(t 1) = t (2, 3] (t 3) t (3, 4] t (4, ). The cause specific cumulative hazard functions for cure and loss can be seen in Figures 3.3 and 3.4. After the cause specific cumulative functions for cure and liquidation are determined, the identity (1.7) can be used for the estimation of the survival function. We get exp( t 0.103) t (0, 1] exp( ( (t 1) ( )) t (1, 2] S(t 0) = exp( ( (t 2) 0.268)) t (2, 3] exp( ( (t 3) 0.408) t (3, 4] exp( 1.997) t (4, ) and for HighLTV=1 we get exp( t 0.093) t (0, 1] exp( ( (t 1) ( )) t (1, 2] S(t 1) = exp( ( (t 2) 0.329)) t (2, 3] exp( ( (t 3) 0.366)) t (3, 4] exp( 1.018) t (4, ). The survival curve estimates can be seen in Figure 3.5. After calculation of survival curves and hazard rates are calculated we can estimate cumulative incidence functions using equation (3.12). The calculated cumulative incidence functions can be seen in Figure

37 Figure 3.5: Survival function calculated with the Cox model Figure 3.6: Cumulative incidence function calculated with the Cox proportional model 3.3 Model implementation For the estimation of CCR, using the Cox model, the same data will be used as in the previous chapter. Rabobank is still interested in the behavior of clients, which have Bridge loans, clients with non-bridge loans and low LTV, clients with non-bridge loans, which have high LTV and do not have NHG and clients, which have non-bridge mortgages with high LTV and do have NHG. Firstly it needs to be understood that the coefficient estimates of one risk factor depends on all variables that are included into the regression. For instance, if we compare the parameters, when Cox regression is performed with variable HighLTV only, to ˆβ hltv HighLTV 37

38 ˆβ hltv,br HighLTV, when HighLTV and variable Bridge are included in the regression, it can be ˆβ hltv hltv,br HighLTV ˆβ HighLTV. seen that Secondly, if one is interested in the behavior of clients with bridge loans then regression only with the variable Bridge needs to be made. If we would include variable highltv as well, than we would not be able to calculate a CCR curve for Bridge loans. However we would be able to calculate only CCR curves for clients which have Bridge loans and high LTV or CCR for clients which have bridge loans and low LTV, since X highltv can take only values 0 and 1. It follows that Cox regression needs to be done three times with three different variable combinations as can be seen in Figure 3.7. Once the coefficients are obtained for hazard rate calculation of the desired segment, a Figure 3.7: Input variables and output coefficients matching covariate needs to be used as can be observed from Figure 3.8. Figure 3.8: Output coefficients multiplied with covariates 38

39 3.4 Results In this chapter all quantities that are needed in order to obtain CCR estimates for each segment will be looked into. Since we want to obtain CCR estimates for each segment. The first step is coefficient estimation for all combinations of variables. Coefficient cause specific hazard rates for cure estimated with Cox regression can be found in the table in Figure 3.9 and for loss in Figure Figure 3.9: Output coefficients for cure Figure 3.10: Output coefficients for liquidation From the figures it can bee seen that the indicator variable for bridge loan and the indicator variable for high LTV have a negative effect on the cause specific hazard rates for cure. In other words, individuals which have a LTV of more than 1.2 and a Bridge loan are have a smaller hazard rate and consequently have a smaller probability of being cured. Individuals with NGH will be cured slightly faster than individuals without it. On the other hand, it can be seen that individuals which have high LTV, non Bridge loans and NHG will be liquidated faster than individuals without NGH. Once estimates of coefficients are obtained we are able to model cause specific hazard rates for cure and liquidation, which can be seen in Figures 3.11 and It is seen that the segment which is cured at the fastest rate is the segment with clients which have low LTV, while the other three segments have similar hazard rates, which become almost the same after the 30th month in default. These segments behave differently when cause specific hazard rate for liquidation is modeled. We can conclude that liquidation happens at the lowest rates for clients which have low LTV. This probably happens because clients with low LTV want to keep their properties. The segment with the highest rate of liquidation is the segment with bridge loans. In order to calculate the survival function for segments with Cox regression equation (1.7) 39

40 Figure 3.11: Cause specific hazard rate for cure Figure 3.12: Cause specific hazard rate for liquidation is used. Since the size of cause specific hazard rates and the size of the survival function are negatively correlated, the segment with the biggest survival function is represented by clients which have High LTV and do not have NHG. If we look Figure 3.12 it is visible that this segment has the second smallest cause specific hazard rate for liquidation while in Figure 3.11 it is shown that cause specific hazard rates for cure are almost the same as the smallest hazard rates. With the same reasoning the behaviour of the other groups can also be explained. An intuitive explanation for the shapes of the survival curves is that clients with high cause specific hazard rates will be resolved faster and consequently the probability of being unresolved at time t becomes smaller. Before CCR is modeled we need to look into the estimation of the cumulative incidence function for which the equation (3.12) is needed. From the Figure 3.14 it can be concluded that the highest probabilities of being cured are achieved by individuals with Low LTV. Since other segments have similar estimates for cause specific hazard rates 40

41 Figure 3.13: Survival functions calculated with the Cox model for cure, cumulative incidence functions are ordered in the same way as estimates of the cause specific hazard rates for liquidation. Figure 3.14: Cumulative incidence functions for cure calculated with the Cox model Finally all estimates from above can be used to estimate the CCRs, which can be seen in Figure Performance of the method In Section 2.3 it was seen that the nonparametric CCR estimator has large variance and wide confidence intervals at the end of the observation period. That happens because of the jumps of the hazard rates that are estimated with the Nelson-Aalen estimator. In this chapter the methodology described in Section will be used again in order to estimate confidence intervals, variance and bias of CCR estimated with Cox regression. From Figure 3.16 it can be seen that Cox regression estimators have narrower confidence 41

42 Figure 3.15: CCR estimated with the Cox model intervals than the nonparametric estimator but on the other hand, estimates for each segment are the same at the end of the observation period and consequently not significantly different. In Figure 3.17 it is visible that the variance is almost 10 times smaller than the variance Figure 3.16: Confidence intervals estimated with the bootstrap of the nonparametric estimator and that the small number of observations at the end of the observation period has almost no effect on variance. In Figure 3.18 it is be shown that bias is smaller and that time also has almost no influence on the bias. 3.6 Discussion From Section 3.5 it can be concluded that the Cox model gives us lower variance, less bias and narrower confidence intervals than nonparametric estimators. All these properties are definitely desirable features of an estimator. 42

43 Figure 3.17: Variance of CCR estimated with the Cox regression Figure 3.18: Estimation of bias with the bootstrap At the same time there exist systematical techniques for variable selection in the Cox model. It can be much easier decided if a variable should be included or excluded from a segmentation than the method based on 95% confidence intervals, which is used in the case of nonparametric estimators. Also β s which are output from regression have some explanatory power. When the coefficient of a boolean variable is negative it is clear from equation (3.1) that a client with such property will have a smaller hazard rate than a client without and consequently has a higher survival function and smaller probability of resolution than a client without. The opposite phenomenon happens when β is positive. Consequently, tables which can be found in Figures 3.9 and 3.10 could be a helpful instrument at deciding if a client should get a mortgage or not or at deciding what should be a size of interest rate of a client. Such a tool cannot be found when we operate with nonparametric estimators. This definitely is one of the desired properties of an estimator in Risk Management, the understandability, what is not a property of the Cox model. All nonparametric 43

44 estimators, which are part of the CCR estimator, are intuitively easier to understand than estimators, which are used in the Cox model, especially if we compare explanations of the cumulative incidence functions. Hazard rates estimated with the Nelson-Aalen estimator are closely related to empirical probability, a simple counting process while formula 3.1 is closely related to the exponential distribution. This makes sense only to a person who is better educated in probability. If time and computational power are important factors at deciding which estimator we are going to use, then we would have to decide to use nonparametric estimators, since they work faster and use less memory. 44

45 4 Proportional hazards model in an interval censored setting Up until now CCR was modeled assuming that we have continuous time data and that the events of interest happen exactly at the observed time. In reality however, the events, which are denoted by t i, actually happened on the interval (t i 1, t i ]. Consequently, we are dealing with another type of censoring, interval censoring. In order to model CCR in an interval censoring setting we have to use a different approach and a different likelihood function. For a review of proportional hazards model in an interval censored setting see Corrente, Chalita, and Moreira (2003). In Section 4.1 we will get familiar with the theory behind the survival analysis in the interval censoring setting. In Section generalized linear models, which are needed for the modeling of CCR, will be represented. In Section 4.2 we will learn how to use binning in time intervals construction and make computations with the generalized linear models computationally feasible. In Section 4.3 we review estimated quantities in the interval censoring setting. In Section 4.4 the performance of the method will be analyzed with the bootstrap. Which method from the three estimators is the best will be discussed in the Section Theoretical background When we have interval censored data, an interval is divided into smaller subintervals I i = [a i 1, a i ), where 0 = a 0 < a 1 < < a k =. The set of subjects that defaulted in interval I i will be denoted with D i and N i will denote the number set of subjects, which are at risk at the beginning of the interval I i. The indicative variable δ ji, j {1, 2,..., n}, i {1, 2,..., k} will denote the indicator variable, which takes value 1 if the subject j failed in interval i and it takes value zero if subject j was still alive at the end of interval I i or it was censored in interval I i. For instance, if subject j died in the interval I 3, the following equality will hold (δ j1, δ j2, δ j3 ) = (0, 0, 1). The value p(a i X j ) equals the conditional probability that the subject with covariate X j has already experienced an event at time a i given that the individual has not experienced the event of interest at a i 1. In order to derive the likelihood function in an 45

46 interval censoring setting the following two identities are needed P (T j I X j ) = P (a i 1 T j < a i X j ) = (S(a i 1 X j ) (S(a i X j ) = [(1 p(a 1 X j ))... (1 p(a i 1 X j ))] [(1 p(a 1 X j ))... (1 p(a i X j ))] = [(1 p(a 1 X j ))... (1 p(a i 1 X j ))](1 (1 p(a i X j ))) = [(1 p(a 1 X j ))... (1 p(a i 1 X j ))]p(a i X j ) (4.1) and P (T j > I k X j ) = P (T j > a k X j ) = S(a i X j ) = [(1 p(a 1 X j ))... (1 p(a i X j ))] (4.2) Combining equations (4.1) and (4.2) the likelihood function for interval censored data becomes k L = p(a i X j ) δji (1 p(a i X j )) 1 δji, (4.3) i=1 j N i Corrente, Chalita, and Moreira (2003). Equation (4.3) is a likelihood function for observations with a Bernoulli distribution where δ ji is a binary response variable with probability p(a i X j ). When a variable has a Bernoulli distribution generalized linear models can be used for the modeling Generalized Linear Models In the first part of this section generalized linear models will be presented. In the second part we review modeling of p(a i X j ) with GLM. For a review of Generalized Linear Models see Fox (2016). Generalized Linear Models (GLM) is a tool for the estimation of a number of distinct statistical models, for which we would usually need distinct statistical regression models, for instance logit and probit. GLM was firstly presented by John Nelder and R.W.M. Wedderburn in In order to use GLM three components are needed: Firstly we need a random component Y i for i-th observation, which is conditioned on the explanatory variables of the model. In the original formulation of GLM Y i had to be a member of an exponential family. One of the members of the exponential family is binomial distribution and consequently Bernoulli distribution. Secondly, a linear predictor is needed. That is a function of regressors η i = α + β 1 X i1 + β 2 X i2 + + β k X ik. 46

47 Thirdly, we need a smooth invertible linearizing link function, g( ). The link function transforms the expectation of the response variable, µ i = E(Y i ), into a linear predictor g(µ i ) = η i = α + β 1 X i1 + β 2 X i2 + + β k X ik. One of the functions that can be used in a combination with the binomial distribution is the complementary log-log function (clog-log) where η i = g(µ i ) = ln( ln(1 µ i )) or µ i = g 1 (η i ) = 1 exp( exp(η i )). From the identity (4.3) it is seen that δ ji will be the response variable with a Bernoulli distribution with parameter p(a i X j ) and consequently E(δ ji ) = p(a i X j ) = µ i. Once we know µ i, we have to find the link function g, for which it holds η i = g(p(a i X j )). Using the Cox proportional hazard model in order to model p(a i x j ), equations (3.3) and (3.2), give us p(a i X j ) = 1 ( S 0 (a i ) S 0 (a i 1 ) ) exp(β T X j ). (4.4) As a complementary log-log transformation is applied to equation (4.4) we get, ( ( )) ln( ln(1 p(a i X j ))) = β T S 0 (a i ) X j + ln ln = β T X j + γ i = η i, (4.5) S 0 (a i 1 ) where γ i = ln ( ln ( S 0 (a i ) S 0 (a i 1 ))). From above it follows that we can use a GLM with a binomial distribution and complementary log-log link function in order to model δ ji or in other words, we can model the Cox proportional hazard model in the interval censoring setting with GLM. After the values p(a i X j ) are obtained, the survival function for each time point can be calculated using equation (4.2). In the competing risk setting a GLM model can be used in order to model probabilities of failing due to reason k on the interval [a j i, a i ) or p k (a i X j ), k = 1, 2. In a competing risk setting indicator function δji k tells us if individual j failed on interval i due to reason k = 1, 2. As soon as estimators ˆp k (a i X j ) are calculated, the following identity can be used in order to estimate the survival function Ŝ(t X j ) = (1 (ˆp 1 (a i X) + ˆp 2 (a i X))). (4.6) i:a i t In order to estimate the cumulative incidence function the following estimator will be 47

48 used ˆF 1 (t X j ) = i:a i t Ŝ(a i 1 X j )ˆp i (a i X j ). (4.7) Once the cumulative incidence functions are estimated, equation (2.1) is used in order to estimate CCR. As can be seen from equation (4.7) GLM can be used for the modeling of proportional hazards in the interval censoring setting. Since the indicator variable δji k is defined for individual j for each time until he is censored, or one of events k occur to him, we need to input a different type of data frame into the regression than the one inputed when we were using the nonparametric and Cox estimators. The easiest way to understand how the data needs to be transformed is by continuing the example from Figure 3.1. From the figure it can be seen that we are following individual 1 until time 6, when he is censored. And for every time interval we have to define δ1i k, k = 1, 2, i = 1, 2,..., 6 which tells us if individual 1 experienced cure or loss in interval (i 1, i]. In similar fashion the other rows have to be duplicated, but at observations, which are not censored, δji 1 takes value 1 if cure happened as can bee seen for individual 4 from the Figure 4.1. After we have the data in the right format, we need to make dummy variables out of the variable observed time. How to create dummy variables for individual 1 can be seen in Figure 4.2. Dummies have to be crated for each time from 1 to the largest time that can be found in the column observed time. Once dummies are created, we can start modeling p k (a i X). In the GLM we chose the clog-log as the link function and the binomial distribution. For the response variable we chose δji k and for the independent variables we have to chose a dummy variable and the variables we want to include into the regression. In our case it is the variable HighLTV. Coefficients, which we get as output for dummies i, represent values γ i while coefficient β HighLT V explains the influence of the variable HighLTV on p k (a i HighLT V ) as can be seen in equation (4.4). The output coefficients for cure are and for liquidation βhighltv (γ1 1) (γ2 1) (γ3 1) (γ4 1) (γ5 1) (γ6 1)

49 Figure 4.1: Transformed data frame for GLM βhighltv (γ1 2) (γ2 2) (γ3 2) (γ4 2) (γ5 2) (γ6 2)

50 Figure 4.2: Dummies for GLM In order to estimate p k (a i HighLTV) we have to use the inverse of equation (4.5) or p k (a i HighLTV) = 1 exp( exp(γi k + β HighLTV HighLTV)). (4.8) After doing so we get the following values for the probabilities for cure , i = , i = , i = 3 ˆp 1 (a i 1) = , i = , i = , i = 6 and 0.105, i = , i = , i = 3 ˆp 1 (a i 0) = 0.411, i = , i = , i = 6 For p 2 (a i 1) we get the following values 50

51 0.143, i = , i = , i = 3 ˆp 2 (a i 0) = 0.500, i = , i = , i = 6 and 0.009, i = , i = , i = 3 ˆp 2 (a i 1) = 0.365, i = , i = , i = 6 Applying equation (4.6) to the results above gives us 1.000, i = , i = , i = 3 Ŝ(a i 0) = 0.301, i = , i = , i = 6 and 0.909, i = , i = , i = 3 Ŝ(a i 1) = 0.298, i = , i = , i = 6 51

52 Now equation (4.7) can be used for the estimation of the cumulative incidence function , i = , i = , i = 3 ˆF (a i 0) = 0.294, i = , i = , i = 6 and , i = , i = , i = 3 ˆF (a i 1) = 0.349, i = , i = , i = Model implementation Rabobank assumes that the probability of cure is equal to zero after the 58th month and that the data is collected at the end of each month. From Section 4.1 it follows that we would need to make a regression with at least 58 coefficients in order to obtain γ i, i = 1,..., 58, which are needed for the estimation of p k i (a X). Such a regression is computationally too expensive. At the same time we would need to transform the data with observations into data with more than 2 million observations, which is expensive as well. In order to avoid these problems a method called binning or bucketing will be used Binning In order to make the method computationally feasible the number of observed intervals will be narrowed. Firstly, the cause specific hazard rate of each interval will be estimated with the following estimator Number of deaths in the interval Number of individuals in the beginning of the interval Size of interval. Once these hazard rates are estimated, the two neighboring intervals with the smallest absolute difference will be joined into a new interval. If we join interval I i = (a i 1, a i ] and I i+1 = (a i, a i+1 ] then the new interval I i = (a i 1, a i+1 ] will be obtained. These two steps need to be repeated until we have a desired small enough number of intervals. GLM regression will be made with 10 intervals. The results can be seen in Figures 4.3 and

53 Figure 4.3: Hazard rates estimated before binning Figure 4.4: Interval hazard rates estimated after binning As soon as binning is performed, we use the output interval I i = (a i 1, a i ] in order to model probabilities p k (a i ). Values a i, for which we will model probabilities, are a 0 < a 1 < < a 10 < (a 11 = ) we get 0 < 3 < 4 < 5 < 8 < 12 < 13 < 14 < 20 < 46 < 58 <. It will be still assumed that the probability of cure after time 58 is equal to 0, 53

54 and consequently p 1 (a 11 X) = Results In this chapter all the quantities, which are need for estimation of CCR in the interval censoring setting, will be modeled and compared with the results from previous chapters. Since regression with 58 coefficients is not computationally feasible, the intervals, which were obtained in Section 4.2.1, will be used. All graphs in this chapter should be histograms, since we are operating with discrete time, but visualization of four segments on one figure would be nearly impossible, consequently quantities are represented as step functions. After the intervals are determined we can estimate the parameters. As independent variables we have to use a combination of variables for each segment as described in Section 3.3 and dummy variables for intervals. Once regression is done we get the following coefficients for the risk parameters for cure as can be seen in Figures 4.5 and 4.6. Figure 4.5: Output coefficients for cure Figure 4.6: Output coefficients for loss Once γ i, i = 1, 2..., 10 are estimated we can start computing p 1 (a i X j ) and p 2 (a i X j ). Interval probabilities for cure and loss can be seen in the figures 4.7 and 4.8. Interval probabilities cannot be compared with modeled hazard rates from the previous chapters because they are not normalized. In the figures it is also visible that interval probabilities are higher for wider intervals, which makes sense. The longer the interval, the higher the probability of cure or liquidation. From Figures 4.7 and 4.8 it is seen that segments behave in a similar way as hazard rates estimated with the Cox model, which can be seen in Figures 3.11 and Con- 54

STAT331. Cox s Proportional Hazards Model

STAT331. Cox s Proportional Hazards Model STAT331 Cox s Proportional Hazards Model In this unit we introduce Cox s proportional hazards (Cox s PH) model, give a heuristic development of the partial likelihood function, and discuss adaptations

More information

TMA 4275 Lifetime Analysis June 2004 Solution

TMA 4275 Lifetime Analysis June 2004 Solution TMA 4275 Lifetime Analysis June 2004 Solution Problem 1 a) Observation of the outcome is censored, if the time of the outcome is not known exactly and only the last time when it was observed being intact,

More information

Cox s proportional hazards model and Cox s partial likelihood

Cox s proportional hazards model and Cox s partial likelihood Cox s proportional hazards model and Cox s partial likelihood Rasmus Waagepetersen October 12, 2018 1 / 27 Non-parametric vs. parametric Suppose we want to estimate unknown function, e.g. survival function.

More information

Lecture 6 PREDICTING SURVIVAL UNDER THE PH MODEL

Lecture 6 PREDICTING SURVIVAL UNDER THE PH MODEL Lecture 6 PREDICTING SURVIVAL UNDER THE PH MODEL The Cox PH model: λ(t Z) = λ 0 (t) exp(β Z). How do we estimate the survival probability, S z (t) = S(t Z) = P (T > t Z), for an individual with covariates

More information

Multistate Modeling and Applications

Multistate Modeling and Applications Multistate Modeling and Applications Yang Yang Department of Statistics University of Michigan, Ann Arbor IBM Research Graduate Student Workshop: Statistics for a Smarter Planet Yang Yang (UM, Ann Arbor)

More information

MAS3301 / MAS8311 Biostatistics Part II: Survival

MAS3301 / MAS8311 Biostatistics Part II: Survival MAS3301 / MAS8311 Biostatistics Part II: Survival M. Farrow School of Mathematics and Statistics Newcastle University Semester 2, 2009-10 1 13 The Cox proportional hazards model 13.1 Introduction In the

More information

Survival Analysis. Stat 526. April 13, 2018

Survival Analysis. Stat 526. April 13, 2018 Survival Analysis Stat 526 April 13, 2018 1 Functions of Survival Time Let T be the survival time for a subject Then P [T < 0] = 0 and T is a continuous random variable The Survival function is defined

More information

Estimation for Modified Data

Estimation for Modified Data Definition. Estimation for Modified Data 1. Empirical distribution for complete individual data (section 11.) An observation X is truncated from below ( left truncated) at d if when it is at or below d

More information

UNIVERSITY OF CALIFORNIA, SAN DIEGO

UNIVERSITY OF CALIFORNIA, SAN DIEGO UNIVERSITY OF CALIFORNIA, SAN DIEGO Estimation of the primary hazard ratio in the presence of a secondary covariate with non-proportional hazards An undergraduate honors thesis submitted to the Department

More information

In contrast, parametric techniques (fitting exponential or Weibull, for example) are more focussed, can handle general covariates, but require

In contrast, parametric techniques (fitting exponential or Weibull, for example) are more focussed, can handle general covariates, but require Chapter 5 modelling Semi parametric We have considered parametric and nonparametric techniques for comparing survival distributions between different treatment groups. Nonparametric techniques, such as

More information

Lecture 22 Survival Analysis: An Introduction

Lecture 22 Survival Analysis: An Introduction University of Illinois Department of Economics Spring 2017 Econ 574 Roger Koenker Lecture 22 Survival Analysis: An Introduction There is considerable interest among economists in models of durations, which

More information

Survival Analysis I (CHL5209H)

Survival Analysis I (CHL5209H) Survival Analysis Dalla Lana School of Public Health University of Toronto olli.saarela@utoronto.ca January 7, 2015 31-1 Literature Clayton D & Hills M (1993): Statistical Models in Epidemiology. Not really

More information

Other Survival Models. (1) Non-PH models. We briefly discussed the non-proportional hazards (non-ph) model

Other Survival Models. (1) Non-PH models. We briefly discussed the non-proportional hazards (non-ph) model Other Survival Models (1) Non-PH models We briefly discussed the non-proportional hazards (non-ph) model λ(t Z) = λ 0 (t) exp{β(t) Z}, where β(t) can be estimated by: piecewise constants (recall how);

More information

Lecture 5 Models and methods for recurrent event data

Lecture 5 Models and methods for recurrent event data Lecture 5 Models and methods for recurrent event data Recurrent and multiple events are commonly encountered in longitudinal studies. In this chapter we consider ordered recurrent and multiple events.

More information

THESIS for the degree of MASTER OF SCIENCE. Modelling and Data Analysis

THESIS for the degree of MASTER OF SCIENCE. Modelling and Data Analysis PROPERTIES OF ESTIMATORS FOR RELATIVE RISKS FROM NESTED CASE-CONTROL STUDIES WITH MULTIPLE OUTCOMES (COMPETING RISKS) by NATHALIE C. STØER THESIS for the degree of MASTER OF SCIENCE Modelling and Data

More information

Lecture 7. Proportional Hazards Model - Handling Ties and Survival Estimation Statistics Survival Analysis. Presented February 4, 2016

Lecture 7. Proportional Hazards Model - Handling Ties and Survival Estimation Statistics Survival Analysis. Presented February 4, 2016 Proportional Hazards Model - Handling Ties and Survival Estimation Statistics 255 - Survival Analysis Presented February 4, 2016 likelihood - Discrete Dan Gillen Department of Statistics University of

More information

Approximation of Survival Function by Taylor Series for General Partly Interval Censored Data

Approximation of Survival Function by Taylor Series for General Partly Interval Censored Data Malaysian Journal of Mathematical Sciences 11(3): 33 315 (217) MALAYSIAN JOURNAL OF MATHEMATICAL SCIENCES Journal homepage: http://einspem.upm.edu.my/journal Approximation of Survival Function by Taylor

More information

Survival Regression Models

Survival Regression Models Survival Regression Models David M. Rocke May 18, 2017 David M. Rocke Survival Regression Models May 18, 2017 1 / 32 Background on the Proportional Hazards Model The exponential distribution has constant

More information

Statistics 262: Intermediate Biostatistics Non-parametric Survival Analysis

Statistics 262: Intermediate Biostatistics Non-parametric Survival Analysis Statistics 262: Intermediate Biostatistics Non-parametric Survival Analysis Jonathan Taylor & Kristin Cobb Statistics 262: Intermediate Biostatistics p.1/?? Overview of today s class Kaplan-Meier Curve

More information

Chapter 4 Fall Notations: t 1 < t 2 < < t D, D unique death times. d j = # deaths at t j = n. Y j = # at risk /alive at t j = n

Chapter 4 Fall Notations: t 1 < t 2 < < t D, D unique death times. d j = # deaths at t j = n. Y j = # at risk /alive at t j = n Bios 323: Applied Survival Analysis Qingxia (Cindy) Chen Chapter 4 Fall 2012 4.2 Estimators of the survival and cumulative hazard functions for RC data Suppose X is a continuous random failure time with

More information

Extensions of Cox Model for Non-Proportional Hazards Purpose

Extensions of Cox Model for Non-Proportional Hazards Purpose PhUSE 2013 Paper SP07 Extensions of Cox Model for Non-Proportional Hazards Purpose Jadwiga Borucka, PAREXEL, Warsaw, Poland ABSTRACT Cox proportional hazard model is one of the most common methods used

More information

3003 Cure. F. P. Treasure

3003 Cure. F. P. Treasure 3003 Cure F. P. reasure November 8, 2000 Peter reasure / November 8, 2000/ Cure / 3003 1 Cure A Simple Cure Model he Concept of Cure A cure model is a survival model where a fraction of the population

More information

Residuals and model diagnostics

Residuals and model diagnostics Residuals and model diagnostics Patrick Breheny November 10 Patrick Breheny Survival Data Analysis (BIOS 7210) 1/42 Introduction Residuals Many assumptions go into regression models, and the Cox proportional

More information

ST745: Survival Analysis: Nonparametric methods

ST745: Survival Analysis: Nonparametric methods ST745: Survival Analysis: Nonparametric methods Eric B. Laber Department of Statistics, North Carolina State University February 5, 2015 The KM estimator is used ubiquitously in medical studies to estimate

More information

Nonparametric Model Construction

Nonparametric Model Construction Nonparametric Model Construction Chapters 4 and 12 Stat 477 - Loss Models Chapters 4 and 12 (Stat 477) Nonparametric Model Construction Brian Hartman - BYU 1 / 28 Types of data Types of data For non-life

More information

Survival Analysis for Case-Cohort Studies

Survival Analysis for Case-Cohort Studies Survival Analysis for ase-ohort Studies Petr Klášterecký Dept. of Probability and Mathematical Statistics, Faculty of Mathematics and Physics, harles University, Prague, zech Republic e-mail: petr.klasterecky@matfyz.cz

More information

Multi-state Models: An Overview

Multi-state Models: An Overview Multi-state Models: An Overview Andrew Titman Lancaster University 14 April 2016 Overview Introduction to multi-state modelling Examples of applications Continuously observed processes Intermittently observed

More information

DAGStat Event History Analysis.

DAGStat Event History Analysis. DAGStat 2016 Event History Analysis Robin.Henderson@ncl.ac.uk 1 / 75 Schedule 9.00 Introduction 10.30 Break 11.00 Regression Models, Frailty and Multivariate Survival 12.30 Lunch 13.30 Time-Variation and

More information

Tied survival times; estimation of survival probabilities

Tied survival times; estimation of survival probabilities Tied survival times; estimation of survival probabilities Patrick Breheny November 5 Patrick Breheny Survival Data Analysis (BIOS 7210) 1/22 Introduction Tied survival times Introduction Breslow approximation

More information

Part [1.0] Measures of Classification Accuracy for the Prediction of Survival Times

Part [1.0] Measures of Classification Accuracy for the Prediction of Survival Times Part [1.0] Measures of Classification Accuracy for the Prediction of Survival Times Patrick J. Heagerty PhD Department of Biostatistics University of Washington 1 Biomarkers Review: Cox Regression Model

More information

You know I m not goin diss you on the internet Cause my mama taught me better than that I m a survivor (What?) I m not goin give up (What?

You know I m not goin diss you on the internet Cause my mama taught me better than that I m a survivor (What?) I m not goin give up (What? You know I m not goin diss you on the internet Cause my mama taught me better than that I m a survivor (What?) I m not goin give up (What?) I m not goin stop (What?) I m goin work harder (What?) Sir David

More information

Survival Distributions, Hazard Functions, Cumulative Hazards

Survival Distributions, Hazard Functions, Cumulative Hazards BIO 244: Unit 1 Survival Distributions, Hazard Functions, Cumulative Hazards 1.1 Definitions: The goals of this unit are to introduce notation, discuss ways of probabilistically describing the distribution

More information

Definitions and examples Simple estimation and testing Regression models Goodness of fit for the Cox model. Recap of Part 1. Per Kragh Andersen

Definitions and examples Simple estimation and testing Regression models Goodness of fit for the Cox model. Recap of Part 1. Per Kragh Andersen Recap of Part 1 Per Kragh Andersen Section of Biostatistics, University of Copenhagen DSBS Course Survival Analysis in Clinical Trials January 2018 1 / 65 Overview Definitions and examples Simple estimation

More information

FULL LIKELIHOOD INFERENCES IN THE COX MODEL

FULL LIKELIHOOD INFERENCES IN THE COX MODEL October 20, 2007 FULL LIKELIHOOD INFERENCES IN THE COX MODEL BY JIAN-JIAN REN 1 AND MAI ZHOU 2 University of Central Florida and University of Kentucky Abstract We use the empirical likelihood approach

More information

Meei Pyng Ng 1 and Ray Watson 1

Meei Pyng Ng 1 and Ray Watson 1 Aust N Z J Stat 444), 2002, 467 478 DEALING WITH TIES IN FAILURE TIME DATA Meei Pyng Ng 1 and Ray Watson 1 University of Melbourne Summary In dealing with ties in failure time data the mechanism by which

More information

11 Survival Analysis and Empirical Likelihood

11 Survival Analysis and Empirical Likelihood 11 Survival Analysis and Empirical Likelihood The first paper of empirical likelihood is actually about confidence intervals with the Kaplan-Meier estimator (Thomas and Grunkmeier 1979), i.e. deals with

More information

ADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES. Cox s regression analysis Time dependent explanatory variables

ADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES. Cox s regression analysis Time dependent explanatory variables ADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES Cox s regression analysis Time dependent explanatory variables Henrik Ravn Bandim Health Project, Statens Serum Institut 4 November 2011 1 / 53

More information

Exercises. (a) Prove that m(t) =

Exercises. (a) Prove that m(t) = Exercises 1. Lack of memory. Verify that the exponential distribution has the lack of memory property, that is, if T is exponentially distributed with parameter λ > then so is T t given that T > t for

More information

A COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky

A COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky A COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky Empirical likelihood with right censored data were studied by Thomas and Grunkmier (1975), Li (1995),

More information

Analysis of competing risks data and simulation of data following predened subdistribution hazards

Analysis of competing risks data and simulation of data following predened subdistribution hazards Analysis of competing risks data and simulation of data following predened subdistribution hazards Bernhard Haller Institut für Medizinische Statistik und Epidemiologie Technische Universität München 27.05.2013

More information

Analysis of Time-to-Event Data: Chapter 2 - Nonparametric estimation of functions of survival time

Analysis of Time-to-Event Data: Chapter 2 - Nonparametric estimation of functions of survival time Analysis of Time-to-Event Data: Chapter 2 - Nonparametric estimation of functions of survival time Steffen Unkel Department of Medical Statistics University Medical Center Göttingen, Germany Winter term

More information

Constrained estimation for binary and survival data

Constrained estimation for binary and survival data Constrained estimation for binary and survival data Jeremy M. G. Taylor Yong Seok Park John D. Kalbfleisch Biostatistics, University of Michigan May, 2010 () Constrained estimation May, 2010 1 / 43 Outline

More information

Lecture 3. Truncation, length-bias and prevalence sampling

Lecture 3. Truncation, length-bias and prevalence sampling Lecture 3. Truncation, length-bias and prevalence sampling 3.1 Prevalent sampling Statistical techniques for truncated data have been integrated into survival analysis in last two decades. Truncation in

More information

Semiparametric Regression

Semiparametric Regression Semiparametric Regression Patrick Breheny October 22 Patrick Breheny Survival Data Analysis (BIOS 7210) 1/23 Introduction Over the past few weeks, we ve introduced a variety of regression models under

More information

INSTITUTE AND FACULTY OF ACTUARIES. Curriculum 2019 SPECIMEN SOLUTIONS

INSTITUTE AND FACULTY OF ACTUARIES. Curriculum 2019 SPECIMEN SOLUTIONS INSTITUTE AND FACULTY OF ACTUARIES Curriculum 09 SPECIMEN SOLUTIONS Subject CSA Risk Modelling and Survival Analysis Institute and Faculty of Actuaries Sample path A continuous time, discrete state process

More information

Survival Analysis. Lu Tian and Richard Olshen Stanford University

Survival Analysis. Lu Tian and Richard Olshen Stanford University 1 Survival Analysis Lu Tian and Richard Olshen Stanford University 2 Survival Time/ Failure Time/Event Time We will introduce various statistical methods for analyzing survival outcomes What is the survival

More information

Subject CT4 Models. October 2015 Examination INDICATIVE SOLUTION

Subject CT4 Models. October 2015 Examination INDICATIVE SOLUTION Institute of Actuaries of India Subject CT4 Models October 2015 Examination INDICATIVE SOLUTION Introduction The indicative solution has been written by the Examiners with the aim of helping candidates.

More information

β j = coefficient of x j in the model; β = ( β1, β2,

β j = coefficient of x j in the model; β = ( β1, β2, Regression Modeling of Survival Time Data Why regression models? Groups similar except for the treatment under study use the nonparametric methods discussed earlier. Groups differ in variables (covariates)

More information

POWER AND SAMPLE SIZE DETERMINATIONS IN DYNAMIC RISK PREDICTION. by Zhaowen Sun M.S., University of Pittsburgh, 2012

POWER AND SAMPLE SIZE DETERMINATIONS IN DYNAMIC RISK PREDICTION. by Zhaowen Sun M.S., University of Pittsburgh, 2012 POWER AND SAMPLE SIZE DETERMINATIONS IN DYNAMIC RISK PREDICTION by Zhaowen Sun M.S., University of Pittsburgh, 2012 B.S.N., Wuhan University, China, 2010 Submitted to the Graduate Faculty of the Graduate

More information

Frailty Models and Copulas: Similarities and Differences

Frailty Models and Copulas: Similarities and Differences Frailty Models and Copulas: Similarities and Differences KLARA GOETHALS, PAUL JANSSEN & LUC DUCHATEAU Department of Physiology and Biometrics, Ghent University, Belgium; Center for Statistics, Hasselt

More information

Power and Sample Size Calculations with the Additive Hazards Model

Power and Sample Size Calculations with the Additive Hazards Model Journal of Data Science 10(2012), 143-155 Power and Sample Size Calculations with the Additive Hazards Model Ling Chen, Chengjie Xiong, J. Philip Miller and Feng Gao Washington University School of Medicine

More information

REGRESSION ANALYSIS FOR TIME-TO-EVENT DATA THE PROPORTIONAL HAZARDS (COX) MODEL ST520

REGRESSION ANALYSIS FOR TIME-TO-EVENT DATA THE PROPORTIONAL HAZARDS (COX) MODEL ST520 REGRESSION ANALYSIS FOR TIME-TO-EVENT DATA THE PROPORTIONAL HAZARDS (COX) MODEL ST520 Department of Statistics North Carolina State University Presented by: Butch Tsiatis, Department of Statistics, NCSU

More information

A Recursive Formula for the Kaplan-Meier Estimator with Mean Constraints

A Recursive Formula for the Kaplan-Meier Estimator with Mean Constraints Noname manuscript No. (will be inserted by the editor) A Recursive Formula for the Kaplan-Meier Estimator with Mean Constraints Mai Zhou Yifan Yang Received: date / Accepted: date Abstract In this note

More information

Week 1 Quantitative Analysis of Financial Markets Distributions A

Week 1 Quantitative Analysis of Financial Markets Distributions A Week 1 Quantitative Analysis of Financial Markets Distributions A Christopher Ting http://www.mysmu.edu/faculty/christophert/ Christopher Ting : christopherting@smu.edu.sg : 6828 0364 : LKCSB 5036 October

More information

Extensions of Cox Model for Non-Proportional Hazards Purpose

Extensions of Cox Model for Non-Proportional Hazards Purpose PhUSE Annual Conference 2013 Paper SP07 Extensions of Cox Model for Non-Proportional Hazards Purpose Author: Jadwiga Borucka PAREXEL, Warsaw, Poland Brussels 13 th - 16 th October 2013 Presentation Plan

More information

Practice Exam 1. (A) (B) (C) (D) (E) You are given the following data on loss sizes:

Practice Exam 1. (A) (B) (C) (D) (E) You are given the following data on loss sizes: Practice Exam 1 1. Losses for an insurance coverage have the following cumulative distribution function: F(0) = 0 F(1,000) = 0.2 F(5,000) = 0.4 F(10,000) = 0.9 F(100,000) = 1 with linear interpolation

More information

Survival Analysis Math 434 Fall 2011

Survival Analysis Math 434 Fall 2011 Survival Analysis Math 434 Fall 2011 Part IV: Chap. 8,9.2,9.3,11: Semiparametric Proportional Hazards Regression Jimin Ding Math Dept. www.math.wustl.edu/ jmding/math434/fall09/index.html Basic Model Setup

More information

Multi-state models: prediction

Multi-state models: prediction Department of Medical Statistics and Bioinformatics Leiden University Medical Center Course on advanced survival analysis, Copenhagen Outline Prediction Theory Aalen-Johansen Computational aspects Applications

More information

Industrial Engineering Prof. Inderdeep Singh Department of Mechanical & Industrial Engineering Indian Institute of Technology, Roorkee

Industrial Engineering Prof. Inderdeep Singh Department of Mechanical & Industrial Engineering Indian Institute of Technology, Roorkee Industrial Engineering Prof. Inderdeep Singh Department of Mechanical & Industrial Engineering Indian Institute of Technology, Roorkee Module - 04 Lecture - 05 Sales Forecasting - II A very warm welcome

More information

Loss Estimation using Monte Carlo Simulation

Loss Estimation using Monte Carlo Simulation Loss Estimation using Monte Carlo Simulation Tony Bellotti, Department of Mathematics, Imperial College London Credit Scoring and Credit Control Conference XV Edinburgh, 29 August to 1 September 2017 Motivation

More information

9 Estimating the Underlying Survival Distribution for a

9 Estimating the Underlying Survival Distribution for a 9 Estimating the Underlying Survival Distribution for a Proportional Hazards Model So far the focus has been on the regression parameters in the proportional hazards model. These parameters describe the

More information

Statistical Inference and Methods

Statistical Inference and Methods Department of Mathematics Imperial College London d.stephens@imperial.ac.uk http://stats.ma.ic.ac.uk/ das01/ 31st January 2006 Part VI Session 6: Filtering and Time to Event Data Session 6: Filtering and

More information

e 4β e 4β + e β ˆβ =0.765

e 4β e 4β + e β ˆβ =0.765 SIMPLE EXAMPLE COX-REGRESSION i Y i x i δ i 1 5 12 0 2 10 10 1 3 40 3 0 4 80 5 0 5 120 3 1 6 400 4 1 7 600 1 0 Model: z(t x) =z 0 (t) exp{βx} Partial likelihood: L(β) = e 10β e 10β + e 3β + e 5β + e 3β

More information

STAT 6350 Analysis of Lifetime Data. Failure-time Regression Analysis

STAT 6350 Analysis of Lifetime Data. Failure-time Regression Analysis STAT 6350 Analysis of Lifetime Data Failure-time Regression Analysis Explanatory Variables for Failure Times Usually explanatory variables explain/predict why some units fail quickly and some units survive

More information

Lecture 7 Time-dependent Covariates in Cox Regression

Lecture 7 Time-dependent Covariates in Cox Regression Lecture 7 Time-dependent Covariates in Cox Regression So far, we ve been considering the following Cox PH model: λ(t Z) = λ 0 (t) exp(β Z) = λ 0 (t) exp( β j Z j ) where β j is the parameter for the the

More information

Censoring and Truncation - Highlighting the Differences

Censoring and Truncation - Highlighting the Differences Censoring and Truncation - Highlighting the Differences Micha Mandel The Hebrew University of Jerusalem, Jerusalem, Israel, 91905 July 9, 2007 Micha Mandel is a Lecturer, Department of Statistics, The

More information

STAT Sample Problem: General Asymptotic Results

STAT Sample Problem: General Asymptotic Results STAT331 1-Sample Problem: General Asymptotic Results In this unit we will consider the 1-sample problem and prove the consistency and asymptotic normality of the Nelson-Aalen estimator of the cumulative

More information

Part IV Extensions: Competing Risks Endpoints and Non-Parametric AUC(t) Estimation

Part IV Extensions: Competing Risks Endpoints and Non-Parametric AUC(t) Estimation Part IV Extensions: Competing Risks Endpoints and Non-Parametric AUC(t) Estimation Patrick J. Heagerty PhD Department of Biostatistics University of Washington 166 ISCB 2010 Session Four Outline Examples

More information

PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA

PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA Kasun Rathnayake ; A/Prof Jun Ma Department of Statistics Faculty of Science and Engineering Macquarie University

More information

Time-dependent covariates

Time-dependent covariates Time-dependent covariates Rasmus Waagepetersen November 5, 2018 1 / 10 Time-dependent covariates Our excursion into the realm of counting process and martingales showed that it poses no problems to introduce

More information

Survival Analysis: Weeks 2-3. Lu Tian and Richard Olshen Stanford University

Survival Analysis: Weeks 2-3. Lu Tian and Richard Olshen Stanford University Survival Analysis: Weeks 2-3 Lu Tian and Richard Olshen Stanford University 2 Kaplan-Meier(KM) Estimator Nonparametric estimation of the survival function S(t) = pr(t > t) The nonparametric estimation

More information

Censoring mechanisms

Censoring mechanisms Censoring mechanisms Patrick Breheny September 3 Patrick Breheny Survival Data Analysis (BIOS 7210) 1/23 Fixed vs. random censoring In the previous lecture, we derived the contribution to the likelihood

More information

Survival Analysis. 732G34 Statistisk analys av komplexa data. Krzysztof Bartoszek

Survival Analysis. 732G34 Statistisk analys av komplexa data. Krzysztof Bartoszek Survival Analysis 732G34 Statistisk analys av komplexa data Krzysztof Bartoszek (krzysztof.bartoszek@liu.se) 10, 11 I 2018 Department of Computer and Information Science Linköping University Survival analysis

More information

1 Glivenko-Cantelli type theorems

1 Glivenko-Cantelli type theorems STA79 Lecture Spring Semester Glivenko-Cantelli type theorems Given i.i.d. observations X,..., X n with unknown distribution function F (t, consider the empirical (sample CDF ˆF n (t = I [Xi t]. n Then

More information

Frailty Modeling for clustered survival data: a simulation study

Frailty Modeling for clustered survival data: a simulation study Frailty Modeling for clustered survival data: a simulation study IAA Oslo 2015 Souad ROMDHANE LaREMFiQ - IHEC University of Sousse (Tunisia) souad_romdhane@yahoo.fr Lotfi BELKACEM LaREMFiQ - IHEC University

More information

Size and Shape of Confidence Regions from Extended Empirical Likelihood Tests

Size and Shape of Confidence Regions from Extended Empirical Likelihood Tests Biometrika (2014),,, pp. 1 13 C 2014 Biometrika Trust Printed in Great Britain Size and Shape of Confidence Regions from Extended Empirical Likelihood Tests BY M. ZHOU Department of Statistics, University

More information

1 The problem of survival analysis

1 The problem of survival analysis 1 The problem of survival analysis Survival analysis concerns analyzing the time to the occurrence of an event. For instance, we have a dataset in which the times are 1, 5, 9, 20, and 22. Perhaps those

More information

The influence of categorising survival time on parameter estimates in a Cox model

The influence of categorising survival time on parameter estimates in a Cox model The influence of categorising survival time on parameter estimates in a Cox model Anika Buchholz 1,2, Willi Sauerbrei 2, Patrick Royston 3 1 Freiburger Zentrum für Datenanalyse und Modellbildung, Albert-Ludwigs-Universität

More information

A Recursive Formula for the Kaplan-Meier Estimator with Mean Constraints and Its Application to Empirical Likelihood

A Recursive Formula for the Kaplan-Meier Estimator with Mean Constraints and Its Application to Empirical Likelihood Noname manuscript No. (will be inserted by the editor) A Recursive Formula for the Kaplan-Meier Estimator with Mean Constraints and Its Application to Empirical Likelihood Mai Zhou Yifan Yang Received:

More information

MAS3301 / MAS8311 Biostatistics Part II: Survival

MAS3301 / MAS8311 Biostatistics Part II: Survival MAS330 / MAS83 Biostatistics Part II: Survival M. Farrow School of Mathematics and Statistics Newcastle University Semester 2, 2009-0 8 Parametric models 8. Introduction In the last few sections (the KM

More information

A Simulation Study on Confidence Interval Procedures of Some Mean Cumulative Function Estimators

A Simulation Study on Confidence Interval Procedures of Some Mean Cumulative Function Estimators Statistics Preprints Statistics -00 A Simulation Study on Confidence Interval Procedures of Some Mean Cumulative Function Estimators Jianying Zuo Iowa State University, jiyizu@iastate.edu William Q. Meeker

More information

Validation. Terry M Therneau. Dec 2015

Validation. Terry M Therneau. Dec 2015 Validation Terry M Therneau Dec 205 Introduction When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean - neither more nor less. The question is, said

More information

Dynamic Models Part 1

Dynamic Models Part 1 Dynamic Models Part 1 Christopher Taber University of Wisconsin December 5, 2016 Survival analysis This is especially useful for variables of interest measured in lengths of time: Length of life after

More information

Quantifying Weather Risk Analysis

Quantifying Weather Risk Analysis Quantifying Weather Risk Analysis Now that an index has been selected and calibrated, it can be used to conduct a more thorough risk analysis. The objective of such a risk analysis is to gain a better

More information

Part III Measures of Classification Accuracy for the Prediction of Survival Times

Part III Measures of Classification Accuracy for the Prediction of Survival Times Part III Measures of Classification Accuracy for the Prediction of Survival Times Patrick J Heagerty PhD Department of Biostatistics University of Washington 102 ISCB 2010 Session Three Outline Examples

More information

Survival analysis in R

Survival analysis in R Survival analysis in R Niels Richard Hansen This note describes a few elementary aspects of practical analysis of survival data in R. For further information we refer to the book Introductory Statistics

More information

Introduction to Statistical Analysis

Introduction to Statistical Analysis Introduction to Statistical Analysis Changyu Shen Richard A. and Susan F. Smith Center for Outcomes Research in Cardiology Beth Israel Deaconess Medical Center Harvard Medical School Objectives Descriptive

More information

Estimation MLE-Pandemic data MLE-Financial crisis data Evaluating estimators. Estimation. September 24, STAT 151 Class 6 Slide 1

Estimation MLE-Pandemic data MLE-Financial crisis data Evaluating estimators. Estimation. September 24, STAT 151 Class 6 Slide 1 Estimation September 24, 2018 STAT 151 Class 6 Slide 1 Pandemic data Treatment outcome, X, from n = 100 patients in a pandemic: 1 = recovered and 0 = not recovered 1 1 1 0 0 0 1 1 1 0 0 1 0 1 0 0 1 1 1

More information

Reduced-rank hazard regression

Reduced-rank hazard regression Chapter 2 Reduced-rank hazard regression Abstract The Cox proportional hazards model is the most common method to analyze survival data. However, the proportional hazards assumption might not hold. The

More information

A Regression Model For Recurrent Events With Distribution Free Correlation Structure

A Regression Model For Recurrent Events With Distribution Free Correlation Structure A Regression Model For Recurrent Events With Distribution Free Correlation Structure J. Pénichoux(1), A. Latouche(2), T. Moreau(1) (1) INSERM U780 (2) Université de Versailles, EA2506 ISCB - 2009 - Prague

More information

Introduction to Regression Analysis. Dr. Devlina Chatterjee 11 th August, 2017

Introduction to Regression Analysis. Dr. Devlina Chatterjee 11 th August, 2017 Introduction to Regression Analysis Dr. Devlina Chatterjee 11 th August, 2017 What is regression analysis? Regression analysis is a statistical technique for studying linear relationships. One dependent

More information

Dynamic Prediction of Disease Progression Using Longitudinal Biomarker Data

Dynamic Prediction of Disease Progression Using Longitudinal Biomarker Data Dynamic Prediction of Disease Progression Using Longitudinal Biomarker Data Xuelin Huang Department of Biostatistics M. D. Anderson Cancer Center The University of Texas Joint Work with Jing Ning, Sangbum

More information

Lecture 4 - Survival Models

Lecture 4 - Survival Models Lecture 4 - Survival Models Survival Models Definition and Hazards Kaplan Meier Proportional Hazards Model Estimation of Survival in R GLM Extensions: Survival Models Survival Models are a common and incredibly

More information

Multistate models and recurrent event models

Multistate models and recurrent event models Multistate models Multistate models and recurrent event models Patrick Breheny December 10 Patrick Breheny Survival Data Analysis (BIOS 7210) 1/22 Introduction Multistate models In this final lecture,

More information

Improving Efficiency of Inferences in Randomized Clinical Trials Using Auxiliary Covariates

Improving Efficiency of Inferences in Randomized Clinical Trials Using Auxiliary Covariates Improving Efficiency of Inferences in Randomized Clinical Trials Using Auxiliary Covariates Anastasios (Butch) Tsiatis Department of Statistics North Carolina State University http://www.stat.ncsu.edu/

More information

Logistic regression model for survival time analysis using time-varying coefficients

Logistic regression model for survival time analysis using time-varying coefficients Logistic regression model for survival time analysis using time-varying coefficients Accepted in American Journal of Mathematical and Management Sciences, 2016 Kenichi SATOH ksatoh@hiroshima-u.ac.jp Research

More information

Handling Ties in the Rank Ordered Logit Model Applied in Epidemiological

Handling Ties in the Rank Ordered Logit Model Applied in Epidemiological Handling Ties in the Rank Ordered Logit Model Applied in Epidemiological Settings Angeliki Maraki Masteruppsats i matematisk statistik Master Thesis in Mathematical Statistics Masteruppsats 2016:4 Matematisk

More information

STAT 526 Spring Final Exam. Thursday May 5, 2011

STAT 526 Spring Final Exam. Thursday May 5, 2011 STAT 526 Spring 2011 Final Exam Thursday May 5, 2011 Time: 2 hours Name (please print): Show all your work and calculations. Partial credit will be given for work that is partially correct. Points will

More information

Robust estimates of state occupancy and transition probabilities for Non-Markov multi-state models

Robust estimates of state occupancy and transition probabilities for Non-Markov multi-state models Robust estimates of state occupancy and transition probabilities for Non-Markov multi-state models 26 March 2014 Overview Continuously observed data Three-state illness-death General robust estimator Interval

More information