Credit risk and survival analysis: Estimation of Conditional Cure Rate

Size: px

Start display at page:

Download "Credit risk and survival analysis: Estimation of Conditional Cure Rate"

Felix Lawson
5 years ago
Views:

Stochastics and Financial Mathematics Master Thesis Credit risk and survival analysis: Estimation of Conditional Cure Rate Author: Just Bajželj

1 Stochastics and Financial Mathematics Master Thesis Credit risk and survival analysis: Estimation of Conditional Cure Rate Author: Just Bajželj Examination date: August 30, 2018 Supervisor: dr. A.J. van Es Daily supervisor: R. Man MSc Korteweg-de Vries Institute for Mathematics Rabobank

2 Abstract Rabobank currently uses a non-parametric estimator for the computation of Conditional Cure Rate (CCR) and this method has several shortcomings. The goal of this thesis is to find a better estimator than the currently used one This master thesis looks into three CCR estimators. The first one is the currently used method. We analyze its performance with the bootstrap and later develop a method, with better performance. Since the newly developed and currently used estimators are not theoretically correct with respect to the data, a third method is introduced. However, according to the bootstrap the latter method exhibits the worst performance. For the modeling and data analysis the programing language Python is used. Title: Credit risk and survival analysis: Estimation of Conditional Cure Rate Author: Just Bajželj, just.bajzelj@gmail.com, Supervisor: dhr. dr. A.J. van Es Second Examiner: dhr. dr. A.V. den Boer Examination date: August 30, 2018 Korteweg-de Vries Institute for Mathematics University of Amsterdam Science Park , 1098 XG Amsterdam Rabobank Croeselaan 18, 3521 CB Utrecht 2

3 Aknowledgments I would like to thank my parents who made possible for me to finish the two years Masters in Stochastics and Financial Mathematics in Amsterdam, that helped me to become the person I am today. I would also like to thank people from Rabobank and the department of Risk Analytics, thanks to whom I have written this thesis and, during my six-month internship, and showed me that work can be more than enjoyable. In particular, I would like to acknowledge all my mentors, Bert van Es who always had time to answer all my questions, Viktor Tchistiakov who gave me challenging questions and ideas, which represent the core of this thesis and, Ramon Man, who always showed me support and cared that this thesis was done on schedule. 3

4 Contents Introduction Background Research objective and approach Survival analysis Censoring Definitions Estimators Competing risk setting Current CCR Model Model implementation Stratification Results Performance of the method The bootstrap Cox proportional hazards model Parameter estimation Estimation of β Baseline hazard estimation Estimators Model implementation Results Performance of the method Discussion Proportional hazards model in an interval censored setting Theoretical background Generalized Linear Models Model implementation Binning Results Performance of the method Discussion Popular summary 60 4

5 Introduction Under Basel II banks are allowed to build their internal models for the estimation of risk parameters. This is known as the Internal Rating Based approach (IRB). Risk parameters are used by banks in order to calculate their own regulatory capital. In Rabobank, the Loss Given Default (LGD), Probability of Default (PD) and Exposure at Default (EAD) are calculated with IRB. Loss given default describes the loss of a bank in case that the client defaults. Default happens when the client is unable to pay monthly payments for the mortgage for some time or one of other default events, usually connected with the client s financial difficulties, happen. Missed payment is also known as arrear. After the client defaults his portfolio is non-performing and two events can happen. The event when the clients portfolio returns to performing one is known as cure. Cure happens if the costumer pays his arrears and has no missed payments during a three-months period or he pays the arrears after the loan restructuring and has no arrears in a twelve-months period. The event in which the bank needs to sell the client s security in order to cover the loss is called liquidation. Two approaches are available for LGD modeling. The non-structural approach consists of estimating the LGD by observing the historical loss and recovery data. Rabobank uses another approach, the so-called structural approach. While the bank deals with LGD in a structural way, different probabilities and outcomes of default are considered. The model is split in several components that are developed separately and later combined in order to produce the final LGD estimation, as can be seen from Figure 0.1. In order to calculate the loss given default we firstly need to calculate the probability of cure, the loss given liquidation, the loss given cure and indirect costs. Rabobank assumes that loss given cure equals zero, since cure is offered to the client only if there is a zero-loss to the bank and any indirect costs that are suffered during the cure process are taken into account in the indirect costs component. Therefore, in this thesis sometimes the term loss is used instead of the term liquidation. Indirect costs are usually costs that are made internally by the departments of the bank that are involved into processing of defaults, e.g., salaries paid to employees and administrative costs. The equation used for LGD calculation, can be seen in Figure 0.1. The parameter probability of cure (P Cure ) is the probability that the client will cure before his security is liquidated. All model components, including P Cure depend on covariates from each client. A big proportion of cases, which are used to estimate P Cure, are unresolved cases. A client that defaulted and was not yet liquidated nor cured is called unresolved. The easiest way to imagine an unresolved state is when every client needs some time after default in order to cure or sell their property. The non-absorbing state before cure or 5

6 Figure 0.1: LGD = P Cure LGC + IC + (1 P Cure )LGL. liquidation is called unresolved state. Unresolved cases can be treated in two ways: Exclusion: the cases can be excluded from the P Cure estimation Inclusion: with expected outcome: we can include unresolved cases into the P Cure estimation by assigning an expected outcome or Conditional Cure Rate to them. If unresolved cases are excluded from the P Cure estimation, it can happen that the parameter estimator will be biased. In other words P Cure will be estimated on a sample where clients are cured after a short time. Consequently, clients, who would need more time to be cured would get a smaller value for P Cure than they would deserve. Since P Cure tells us the probability that a client will be cured after default and such a probability is not time dependent, treatment of unresolved cases with exclusion would be wrong. One approach within Rabobank is to treat unresolved cases by assigning them a value called Conditional Cure Rate. Conditional Cure rate (CCR) can be estimated with a non-parametric technique. This thesis will be about developing a new model able to eliminate the existing shortcomings of the currently employed model. 0.1 Background CCR tells us the probability that a client will cure after a certain time point conditioned on the event that client is still unresolved at that point. Rabobank s current model estimates CCR with survival analysis, which is a branch of statistics that is specialized in the distribution of lifetimes. The lifetime is the time to the occurrence of an event. In our case the lifetime is the time between the default of the client and cure or liquidation. Current CCR is a combination of Kaplan Meier and Neslon Aalen estimators, which are two of the most recognized and simple survival distribution estimators. The current CCR model has some shortcomings. The goal of the present research is to develop a new CCR model which will be able to outperform current model. 6

7 0.2 Research objective and approach The objective of this thesis is to find new ways to estimate CCR that would be able to improve CCR estimation. New techniques will be compared with the currently used CCR estimator. The research goal of this thesis is: Develop an alternative CCR model which is better than the current one. In order to reach this research goal, we will need to answer the first research question: What type of model is natural for the problem? To answer this question first, we study the currently used model. Second, we will get familiar with basic and more advanced concepts of survival analysis. Once this is done, we will be able to derive more advanced and probably better estimators. In particular we will get some basic ideas about the techniques that can be used for the CCR modeling. In order to find a better model than the one which is currently used we will have to be able to know what better means. In other words, we will need to be able to answer the second research question: What are criteria for comparing the models? 7

8 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of lifetimes. The random variable which denotes a lifetime will be denoted by T. Survival analysis has its roots in medical studies as they look into the lifetime of patients or the time to cure of a disease. In our case the lifetime of interest is the time which a client spends in the unresolved state and the events of interest are cure and liquidation. One of the most attractive features of survival analysis is that it can cope with censored observations. The estimator which is currently used for CCR estimation is a combination of two estimators of two different quantities which are specific for survival analysis. In order to be able to understand and to model CCR as it is currently model by the Rabobank, we first have to know what the quantities are that are modeled by the Rabobank, how these quantities are estimated and how to model these quantities as a client can become resolved due to two reasons. Censoring is represented in Chapter 1.1. The most basic concepts of survival analysis and its specific quantities are introduced in Chapter 1.2. In Chapter 1.3 we will look into basic estimators of survival analysis. In Chapter 1.4 the theory and estimators when two outcomes are possible are presented, since our model assumes that client can become resolved due to two reasons, cure and liquidation. 1.1 Censoring When lifetime distributions are modeled with a non-survival analysis approach only observations where the event of interest took place are used. For simplifying reasons we will look into an example from medical studies. For instance if a study on 10 patients has been made and only 5 of them died, and the other 5 are still alive, in order to model the distribution of lifetimes only 5 observations can be used. Survival analysis is able to get some information about distribution also from the other patients. Patients which are still alive at the end of the study are called censored observations and survival analysis is able to cope with censored data. A visualization of censored observations can be seen in Figure 1.1. The time variable on the x axis represents the month of study, meanwhile on the y axis the number of each patient can be found. The line on the right side represents the end of the study. The circle represents in which month of study the patient got sick, while the arrow represents the death of a patient. It can be seen that deaths of patients 6 and 4 are not observed, because the study ended before the event of interest happened. The event of interest cannot be seen at patient 2 as well, because the patient left the study before the end of the study. Such phenomenons are called censored observations. Despite the fact that the event of interest did not happen, 8

9 Figure 1.1: Censored observations and uncensored observation such observations tell us that the event of interest happened after the last time we saw the patient alive and consequently has an impact on the output distribution. Types of censoring when we know the time when the patient got sick, but we do not know the time of the event of interest are called right censoring (Lawless 2002). For now we will assume that we have just one type of censoring, right censoring. In our case this means that a client which has defaulted is observed, but the observation period has finished before client was cured or liquidated. To understand why such a phenomenon happens it has to be taken into consideration that in the Rabobank s data there are observations where a client is cured after 48 months (4 years), consequently it can happen that if clients which have defaulted in year 2016 are observed, some of them are still unresolved, despite the fact that they will be cured or liquidated sometime in the future. 9

10 1.2 Definitions In order to use techniques from survival analysis some quantities need to be defined and are going to be directly or indirectly modeled. For simplicity it will be assumed that a client can become resolved due to one reason only, cure. Later this assumption will be relaxed. All definitions and theory from this chapter can be found in any book about survival analysis. For a review of survival analysis see Lawless (2002), Kalbfleisch and Prentice (2002) and Klein and Moeschberger (2003). Definition Distribution function: Since T is a random variable it has its own distribution and a density function, which are denoted by F and f or F (t) = P (T t) = t 0 f(t)dt. The distribution function tells us the probability that the event of interest, cure, will happen before time t. In survival analysis we can also be interested in the probability that the client is still unresolved at time t. Definition The survival function represents the probability that the event of interest did not happen up to time t and it is denoted by S or S(t) = P (T > t) = 1 F (t). (1.1) The survival function is a non-increasing right continuous function of t with S(0) = 1 and lim t S(t) = 0. Since we can also be interested in the rate of events in a small time step after time t, we define the following quantity. Definition The hazard rate function is a function that tells us the rate of cures in an infinitely small step after t conditioned on the event that the client is still unresolved at time t. The hazard rate function is denoted by λ or λ(t) = lim t 0 P (t T < t + t T t) =. t It is important to note that the hazard rate function can take a value bigger than 1. If λ(t) 1, it means that the event will probably happen at time t. In order to interconnect all the defined quantities we need to represent the cumulative hazard function which is defined as Λ(t) = t 0 λ(s)ds. Now we can look at how the hazard rate function and the survival function are con- 10

11 nected. We know that {t T < t + t} {t T }. Consequently we get P (t T < t + t T t) λ(t) = lim t 0 t = lim t 0 P (t T < t + t) t P (t T ) = f(t) d S(t) = dt (1 S(t)) = d dt S(t) S(t) S(t) (1.2) From the last equality it follows that = d dt log(s(t)). S(t) = exp( Λ(t)). (1.3) We can see that the hazard rate function, the survival function, the density function and the distribution function uniquely define the distribution of T. 1.3 Estimators Until now we looked into quantities that define the distribution of T. In this chapter simple estimators of these quantities will be presented. Later those estimators will be used for the modeling of CCR. The assumption that a client can become resolved due to one reason only is still valid. In the data there are n defaults. The observed times will be denoted by t i, i {1, 2,..., n}. Each of those times can represent the time between default and cure or liquidation and censoring time. The variable δ i is named the censoring indicator and it takes value 1 if client i was not censored. Consequently, in cases where there is only one event of interest the data is given as {(t i, δ i ) i {1, 2,..., n}}. Once we observed times are obtained we have to order them in an ascending order t (1) < t (2), < t (k), k n. With D(t) we denote the set of individuals that experienced the event of interest at time t or D(t) = {j T j = t, j {1, 2,..., n}}. With d i the size of D(t (i) ) will be denoted. The set of individuals that are at risk at time t is denoted by N(t). The individuals which are in N(t) are the individuals that experienced the event of interest at time t or are still in our study at time t or N(t) = {j T j t, j {1, 2,..., n}}. With n i we will denote the size of N(t (i) ). The most basic estimator of a survival function is known as the Kaplan-Meier estimator and it is also used in the current Rabobank model. The Kaplan-Meier estimator is given 11

12 by the following formula, Ŝ(t) = j:t (j) t (1 d j ). (1.4) n j The estimator which is used for the estimation of the cumulative hazard rate in the current CCR model is called the Nelson-Aalen estimator and it is given by ˆΛ(t) = j:t (j) t d i n i. (1.5) Using the Nelson-Aalen estimator we can also model the hazard rate as a point process which takes value d i n i at time t (i) Let us look into the following example. We have an observation which consists of 10 clients. The censoring indicator and observed time can be seen in Figure 1.2. The times which each client needs in order to be cured after default are observed. In order to compare traditional statistical techniques and survival techniques, survival curve will be estimated with the empirical distribution function and the Kaplan-Meier estimator. Figure 1.2: Example data The survival curve estimated with the Kaplan-Meier estimator is shown in Figure 1.3. In the figure it can be seen that jumps happen only at the times when an event of interest happens. Observed censored times are denoted with small vertical lines and they have no influence on the jumps, but rather on the size of the jumps, since censored observations only have influence on the risk set which is in the nominator of equation (1.4). The more censored observations occur before the observed time the smaller the risk set will be and the bigger the jump on the survival curve will be. We will now calculate the survival curve with empirical distribution which is formu- 12

13 Figure 1.3: Survival curve estimated with Kaplan-Meier estimator lated as ˆF n (t) = 1 n n i=1 1 Ti t. and obtain the survival curve from the identity (1.1). Doing so we obtain an estimator for the survival function which is equal to Ŝ n (t) = 1 n n i=1 1 Ti >t. It is important to add that if we use the empirical distribution function we can only use observations where the event of interest interest did happen. In our case we can just take the observations where cure happened, i = 4, 6, 7. Results from the estimator of the survival curve with empirical distribution is shown in the Figure 1.4. From Figure 1.4 it can be observed tha bigger jumps occur in comparison with the Kaplan-Meier estimator. An intuitive explanation for this phenomenon would be that censored observations also bring us an important information for the estimator. For instance, if we want to estimate the survival curve at time t and we have only censored observations which happened after time t, then the probability of an event happening before time t is probably low. At this point note that if we use the Kaplan-Meier estimator with non-censored observations only and the empirical distribution for the estimation of the survival curve we will obtain the same results. If we have times t 1 < t 2 < t k and at the time t i d i events of interest happened then the size of the risk set at time t i will be n d 1 d i. for 13

14 Figure 1.4: Survival curve estimated with the empirical distribution t [t i, t i+1 ) it follows Ŝ(t) = (1 d 1 n )(1 d 2 d i )... (1 ) n d 1 n d 1 d i 1 = (n d 1)(n d 1 d 2 )... (n d 1 d i ) n(n d 1 )(n d 1 d 2 )... (n d 1 d i 1 ) = n d 1 d i 1 = 1 n 1 ti >t. n n i=1 (1.6) 1.4 Competing risk setting Up to this point we assumed that clients can become resolved only by being cured, but in reality that is not the case. It is known that a client can become resolved due to two reasons, cure and liquidation. The modeling approach in the CCR model is that the client becomes resolved due to the reason which happens first. With other words there are two types of events which have the influence on the survival function. If all liquidated events would be modeled as censored, then the survival function would overestimate. Since one of the building blocks of CCR is the survival function, CCR estimates where liquidated events are considered as censored would give us wrong results. Modeling of CCR in the setting where two outcomes are possible is in survival analysis known as the competing risk setting. In this chapter we will look into quantities in the competing risk setting. For a review of the competing risk model see M.-J. Zhang, X. Zhang, and Scheike (2008). 14

15 In order to model CCR(t) we have to introduce more notation. Variable e i represents the reason due to which subject i failed. In our case e i = 1 if the subject i was cured. The variable e i will take value 2 if subject i was liquidated. Consequently, the data is given as (T i, δ i, e i ), i {1, 2,..., n}. Since a subject can fail due to two reasons, a cause specific hazard rate needs to be used. By D i (t) we denote a set of individuals that failed due to reason i at time t or D i (t) = {k, t = T j e k = i, j {1, 2,..., n}}. With d i j we denote the size of the set Di (t (j) ). Definition The cause specific hazard rate is a function that tells us the rate of event e = i on an infinitely small step after t conditioned on the event that a subject did not fail up to time t. The cause specific hazard rate function is denoted by λ i (t) or λ i (t) = lim t 0 P (t T < t + t, e = i T t) =. t In order to calculate the survival function in a competing risk setting we need to define the cause specific cumulative hazard function, which is given as Λ i (t) = t 0 λ i (s)ds. Then the analog of Formula 1.3 in the competing risk setting is S(t) = exp( Λ 1 (t) Λ 2 (t)). (1.7) It is important to add that when a subject can fail due to 2 reasons the Kaplan Meier estimator of the survival function takes the following form Ŝ(t) = j:t (j) t ( d1 j + d2 j ). n j The Nelson-Aalen estimator is the estimator of the cause specific cumulative hazard rate, and it is given as ˆΛ i (t) = d i j. (1.8) n j j:t(j) t The value dˆλ i (t i ) = di j n j tells us the probability that a subject will experience the event at time t j conditioned on the event that he is still alive at time t j. Once we obtain all the cause specific hazard rates we can calculate the probability of failing due to a specific reason. Definition The cumulative incidence function (CIF) tells us the probability of 15

16 failing due to cause i before time t and it is denoted by F i or F i (t) = P (T t, e = i). The cumulative incidence function is expressed mathematically as F i (t) = t λ i (s)s(s)ds 0 t = 0 λ i (s) exp( 2 Λ i (s))ds. i=1 (1.9) Using the Nelson-Aalen and Kaplan Meier estimators we can obtain a non-parametric estimator for CIF which is given as ˆF i (t) = i t Ŝ(t j 1 )dˆλ i (t j ). (1.10) An intuitive explanation of Formula (1.10) is that if we want that subject fails due to reason i it has to be unresolved up to time t j 1 and then at the next time instance fail due to reason i. We will continue with the example from Section 1.3, but this time we will assume that some observations which were censored were actually liquidated. We will take all the steps we need in order to calculate ˆF 1 (t) from equation (1.10), which will be later needed in order to estimate CCR. Figure 1.5: Survival curve estimated with empirical distribution In order to estimate the survival curve, the following steps need to be taken. It is seen that no event of interest happens on interval [0, 1), consequently for t [0, 1) it holds 16

17 that Ŝ(t) = 1. At time 1 one event of interest happens and the size of the risk set is 10. Since no events of interest happen until time 2 for t [1, 2) it holds that Ŝ(t) = 1(1 10) = 9 10 At time 2 two events of interest happen and the size of the risk set is 9 for t [2, 3) it holds Ŝ(t) = 1( )(1 2 9 ) = In a similar fashion we obtain the following values for Ŝ(t) 7 10 (1 2 7 ) = 1 2 t [3, 4) 1 Ŝ(t) = 2 (1 2 5 ) = 3 10 t [4, 5) 3 10 (1 0 3 ) = 3 10 t [5, 6) 3 10 (1 0 1 ) = 3 10 t [6, ). Figure 1.6: Survival curve in competing risk setting The Kaplan-Meier curve for the competing risk setting is shown in Figure 1.6. Compared with Figure 1.3 it can be seen that it has more jumps, which makes sense, since we have two events of interest. At the same time we can see that if we consider liquidated observations as censored, the survival curve will overestimate survival probability. 17

18 Consequently, the CCR will be modeled in a competing risk setting. In order to calculate an estimator of F 1 (t) we need to calculate dˆλ 1 (t j ). The value of dˆλ 1 (1) is simply the number of individuals that were cured at time 1 divided by the number of all individual that are at risk at time 1 or dˆλ 1 (1) = In a similar fashion we obtain the other values of dˆλ 1 (t i ) 1 9 t = t = 3 0 dˆλ 1 (t) = 5 t = t = t = 6 0 else Figure 1.7: cause specific hazard rate for cure estimated with dˆλ 1 Now we can finally calculate ˆF 1 (t). Since we do not have any cures in the interval [0, 1) it holds that ˆF 1 (t) = 0 for t [0, 1). For t [1, 2) it holds ˆF 1 (t) = S(0)dˆΛ 1 (1) = = 0. 18

19 For t [2, 3) it holds that ˆF 1 (t) = Ŝ(0)dˆΛ 1 (1) + S(1)dˆΛ 1 (2) = ˆF 1 (1) + Ŝ(1)dˆΛ 1 (2) = = (1.11) Figure 1.8: Estimated cumulative incidence function in a similar fashion we obtain values ˆF 1 (t) for other t, ˆF 1 (2) = t [3, 4) ˆF 1 (3) + ˆF (t) = 7 5 = t [4, 5) ˆF 1 (4) = t [5, 6) ˆF 1 (5) = t [6, ) 19

20 2 Current CCR Model The CCR for time t tells us the probability that a client will be cured after time t conditioned on the event that he is still unresolved at time t. This probability can be expressed as F 1 ( ) F 1 (t). It is important to understand that some cases will never be cured. Consequently probability of cure is not necessary equal to 1 as t or F 1 ( ) 1. Since CCR is conditioned on the event of being unresolved up to time t it follows that CCR(t) = F 1( ) F 1 (t). (2.1) S(t) Using the estimator (1.10) we can estimate a probability of curing after time t [t i, t i+1 ) as ˆF 1 ( ) ˆF 1 (t i ). (2.2) We assume that the probability of cure after the 58th month is equal to zero. Consequently it holds ˆF 1 ( ) = ˆF 1 (58) and ĈCR(58) = 0. Another assumption is that every case will be resolved as t which is equal to the property of requiring survival function that S( ) = 0. Consequently, we will consider that every case which has observed time bigger than 58 as liquidated. Since CCR is conditioned on the event of being unresolved up to time t [t i, t i+1 ), the estimator of CCR takes the following form ĈCR(t) = ˆF (58) ˆF (t i ) Ŝ(t i ) = d1 i+1 n i+1 + (1 d1 i+1 + d2 i+1 n i+1 ĈCR(t i+1 )). (2.3) If we define ˆF 1 (0) = 0, since the probability of being cured before time zero is equal to zero, ĈCR(0) is equal to ˆF (58). This gives us the probability of being cured, i.e the value of P Cure in Figure Model implementation In order to estimate CCR for each time point the Rabobank uses data which consists of mortgage defaults from the bank Lokaal Bank Bedrijf (LBB). Each default observation consists of the time the client spent in default and the status after the last month in default, which can be equal to cure, liquidated or unresolved. In the data we can also find the following variables: High LTV indicator, 20

21 NGH indicator, Bridge loan indicator. Variable LTV represents Loan To Value ratio and it is calculated with the following formulation, Mortgage amount LT V = Appraised value of the property. If LTV is higher than 100% it means that the value of mortgage which was not payed back by client is larger than the value of the security. Indicator High LTV takes value 1 if LTV is high. NGH is an abbreviation for the National Mortgage Guarantee or in Dutch Nationale Hypotheek Garantie. If a client which has an NGH-backed mortgage cannot pay the mortgage due to specific circumstances, NGH will cover support to the bank and the client. If the client sells the house under the price of the mortgage, NGH will cover the difference and consequently neither the client nor the bank will suffer the loss. Since both sides get support in case of liquidation, we can expect that the cause specific hazard rate for liquidation will be higher and cause specific hazard rate for cure lower and consequently CCR lower. Bridge loans are short-term loans that last between 2 weeks and 3 years. A client usually uses them until he finds a longer and larger term financing. This kind of loan provides an immediate cash flow to a client with relatively high interest. Such loans are usually riskier for a bank and have consequntly higher interest rates Stratification It can be seen that Rabobank has data from different clients with different variables, but it uses an estimator for CCR which is unable to incorporate those variables. Rabobank solves this problem with the method called stratification or segmentation. This method separates the original data frame into smaller data frames based on variables and then calculates CCR on each one of them. For instance, if segmentation is based on the variable called Bridge loan indicator, the original data frame will be separated into two data frames. In the first data frame only clients with bridge loans can be found and in the other clients with non-bridge loans. Once this step is made, CCR is calculated for each segment and two CCR estimates are obtained, one for clients with and the other for clients without a bridge loan. Rabobank separates the original data frame into four buckets, as can be seen on Figure 2.1. The segment with the most observations is Non-Bridge-Low LTV which has almost all observations from the original data. It is followed by the segment Non-Bridge-High LTV-non-NHG which has about ten times less observations than segment with Low LTV. The smallest segments are segments with Bridge loans and Non-Bridge-High LTV-NHG, which have about 600 observation. 21

Figure 2.1: Segmentation of original data frame 2.2 Results In this chapter the estimation of survival quantities, which are needed in order to model CCR, will be looked into.

22 Figure 2.1: Segmentation of original data frame 2.2 Results In this chapter the estimation of survival quantities, which are needed in order to model CCR, will be looked into. In order to estimate the nonparametric cumulative incidence function we need the cause specific hazard rate for cure, which follows from equation (1.10). Hazard rates estimated with dˆλ(t) estimator are point processes as can be seen in Figure 2.2. The two biggest segments behave regularly, while segments with less observations have irregular behavior with jumps at the end of the period. Big jumps from value zero to above 0.02 are occurring because a small number of individuals is at risk. For instance, at time 53 in the segment high LTV NHG cause specific hazard rate takes value At that time only one cure happens, but there are 34 individuals at risk. The next step is the estimation of the survival function with the Kaplan-Meier estimator. Results can be found on Figure 2.3. From the figure it is visible that the segment which becomes resolved at the smallest rate is the segment which represents the clients with high LTV and without NHG, while clients with bridge loans become resolved at the highest rate. Once we obtain the survival functions and cause specific hazard rates for cure we can model the cause specific incidence functions for cure, which can be found in Figure 2.4. Results of the nonparametric estimator can be found in Figure 2.5. Curves are the most irregular, have the most jumps, for the segments where we have the least observations. The explanation for this phenomenon can be found in the recursive part of equation (2.3). It is seen that the big jumps happen because the cause specific hazard rates for cure are irregular. From the Figure 2.5 it can be seen that the clients with low LTV have the highest CCR 22

23 Figure 2.2: Cause specific hazard rate for cure estimated with dˆλ(t) 1 Figure 2.3: Survival function estimated with Kaplan-Meier estimator estimates. Clients without NHG have larger CCR estimates than clients with, since bank and clients are more motivated into curing their defaults. From the figure is it not completely clear if the clients with bridge loans or clients with high LTV and without NHG have higher CCR estimates. 2.3 Performance of the method In this chapter we will look into the variance, bias and confidence intervals of the nonparametric CCR estimator. Since the derivation of the asymptotic variance of the CCR estimator is out of scope of this thesis, these quantities will be estimated with the method known as bootstrap. 23

24 Figure 2.4: Survival functions estimated with Kaplan-Meier estimator Figure 2.5: CCR estimated with non parametric estimators The bootstrap The bootstrap was introduced by Efron in The method is used to estimate the bias, variance and confidence intervals of estimators by resampling. In this chapter we will look at how this method is used for the estimation of bias, variance and confidence intervals. Later the method will be applied to mortgage data in order to estimate the previously mentioned quantities for each segment. For a review of bootstrap estimators and techniques see Fox (2016). In order to bootstrap, data frames from the original data frame need to be selected. These data frames are created by a selection of random default observations from the original data frame. Let us assume that n random data frames will be simulated. Simulated data frames need to be of the same size as the original data frame and it is allowed that simulated data frames have the same observations more than once. Once the simulated data frames are obtained, segmentation is done as in Figure 2.1. Finally CCR i (t), i {1, 2,..., n} estimates are calculated for each segment 24

and each time as can be seen in Figure 2.6. Figure 2.6: Estimation of quantities with bootstrap. Once estimates CCR i (t) for each data frame are obtained, CCR(t), t {0, 1,.

25 and each time as can be seen in Figure 2.6. Figure 2.6: Estimation of quantities with bootstrap. Once estimates CCR i (t) for each data frame are obtained, CCR(t), t {0, 1,... 58} for each segment can be calculated as CCR(t) = n i=1 CCR i n. The bootstrap estimate for variance of ĈCR(t) is equal to 1 n 1 n (CCR i (t) CCR(t)) 2. i=1 An estimator ˆθ is called unbiased if we have E(ˆθ) = θ and if it is biased the bias of the estimator is defined as B θ = E(ˆθ) θ. Bias of estimators is undesired. With the bootstrap the bias can be estimated by CCR(t) CCR(t) where ĈCR(t) is the estimate of CCR from the original data frame. For the estimation of confidence intervals a method called bootstrap percentile confidence interval will be used. In order to obtain 100(1 α) interval for fixed time t we have to take ĈCR(t) α 2,L, which denotes a value where α 2 percent of the CCRi (t) is below that value. In a similar fashion ĈCR(t) α 2,R denotes a value where α 2 percent of the CCRi (t) 25

26 is above that value. For an α confidence interval the following values are taken [ĈCR(t) α 2,L, ĈCR(t) α 2,R]. The Rabobank decides whether it makes sense to make a segmentation or not based on the size of the confidence intervals. If two curves lie in each other s 95%-confidence intervals, then the CCR curves are pragmatically treated as the same. If we look into CCR curves for each segment, as can be seen in Figure 2.7, we can see that segments LBB-LOW LTV and LBB-High LTV-non-NGH are treated as significantly different, while LBB-Low-LTV and Bridge are not. Furthermore segments with a bigger size of population have narrow confidence intervals, while segments with a small population size have wide confidence intervals. It follows that nonparametric CCR estimators are not a good choice when CCR on a population of small size has to be estimated. Figure 2.7: Estimation of confidence intervals with the bootstrap. From Figure 2.8 it can be seen that the biggest variance is obtained by segments with the smallest population and that variance grows with time. This happens because of the variability of the term d1 i+1 n i+1 from the recursive part of the equation (2.3), as it is seen from Figure 2.2. Since cures at the end of observation period in the Bridge and LBB-High LTV-NHG do not exist, CCR i (t) always takes the value zero. The same holds for ĈCR(t). Consequently, the variance estimated with the bootstrap is equal to zero for the CCR estimates at the end of the observation period. In the Figure 2.9 the bias of the nonparametric CCR estimation can be found. If we 26

27 Figure 2.8: Estimation of variance with bootstrap. compare bias with the size of CCR estimates from the Figure 2.5 it can be concluded that it is more than 100 times smaller than the size of the ĈCR and that estimator is not problematically biased. Figure 2.9: Estimation of bias with the bootstrap. 27

28 3 Cox proportional hazards model Until now we have been looking only into nonparametric estimators of Survival and Hazard rate functions, which are unable to incorporate additional variables. Rabobank uses a method called stratification in order to estimate the CCR curve of individuals with different covariates, that has certain shortcomings. Since a big amount of the data consists out of unresolved (censored) cases, we will look for a suitable regression method from survival analysis. At the beginning of this chapter we review the theory behind the so called Cox model. In Section 3.1 we will see how to estimate coefficients of explanatory variables and the baseline function. In section 3.2 the formula for the cumulative incidence function will be derived. In Section 3.3 it will be explained how to estimate the coefficients in order to get comparable results with segmentation. In Section 3.4 we will look into quantities estimated with the Cox model that are needed in order to get CCR estimates. In Section 3.5 the performance of the method will be analyzed with the bootstrap. In the last section of this chapter the method will be compared with the non-parametric estimator. In order to model CCR(t) we need a model, which is able to model the cause-specific hazard rate. Since we expect different behavior from clients with different covariates, we will look into one of the most popular regression models in survival analysis. The Cox proportional hazards model was presented in 1972 by Sir David Cox. For a review of the Cox model see Kalbfleisch and Prentice (2002), Lawless (2002) and Weng (2007). A hazard rate modeled with the Cox model takes the following form λ(t X) = λ 0 (t) exp(β T X), (3.1) where β = (β 1, β 2,..., β p ) is a vector of coefficients which represents the influence of covariates, X = (X 1,..., X p ) on the hazard rate function λ(t X), which depends on X. We denoted covariates of the individual i by X i. The baseline hazard function is denoted by λ 0 (t) and it can take any form Weng (2007). The baseline hazard function can be interpreted as a hazard rate from an individual whose values of covariates are equal to zero. In a similar way the baseline survival function can be defined S 0 (t) = exp( t 0 λ 0 (t)dt), (3.2) according to equation (1.3). The survival function of individual j than takes the following form S(t X j ) = [S 0 (t)] exp(βt X j ) (3.3) 28

29 Corrente, Chalita, and Moreira (2003). Since the baseline function can take on any form, the Cox model is a semi-parametric estimator of the hazard rate. In a similar way the cause specific hazard rate for cause i λ i (t X) = λ 0 i (t) exp(β T i X), i = 1, 2 can be modeled. Here β i represents the effect of covariate vector X on the cause specific hazard rate. The function λ 0 i (t) represents a cause specific baseline function. In the following subsections we will look at how to estimate parameters in the Cox model. 3.1 Parameter estimation Since the proportional hazards model is semi-parametric model the β coefficients and the baseline hazard function need to be estimated. In Section we will look into partial likelihood and how to estimate the coefficients. In Section we will introduce the Breslow estimator of the baseline function. For a review of baseline and β estimation see Weng (2007) Estimation of β Firstly, it will be assumed that the data consist of n individuals and that each individual has a different observation time t i, no ties in data. These observation times are ordered in an ascending order, so t 1 < t 2 < < t n. In 1972 Cox proposed to estimate β using partial likelihood, Weng (2007). The partial likelihood of individual i, P L i, is simply the hazard rate of individual i divided by the sum of the hazard rates of all individuals that are at risk at time t i or for i N(t i ): λ(t i X i ) L i = j N(t i ) λ(t i X j ) (3.4) exp(βx i ) = j N(t i ) exp(βx j). (3.5) (3.6) The partial likelihood of individuals that were censored is equal to 1. It follows that the partial likelihood function of the data we have is equal to P L(β) = n i=1 Instead of maximizing P L we maximize log(p L) = pl, pl(β) = i:δ i =1 exp(β T X i ) j N(t i ) exp(βt X i ). (3.7) β T X i ln( j N(t i ) exp(β T X j )). 29

30 3.1.2 Baseline hazard estimation Breslow has proposed the following estimator for the baseline hazard function. We still assume that we do not have ties in our data, or t 1 < t 2 <... t n. Breslow proposed an estimator, which is constant on subintervals where no event happened, Weng (2007). Rabobank s data shows if an event of interest happened or not at the end of each month. That means that if the observation has observed time t i, that customer was cured, liquidated or censored in the interval (t i 1, t i ]. In the case of event of interest, the Breslow baseline function will be constant on the interval (t i 1, t i ], where it will take the value λ 0 i As in equation (1.2) and by using the equality f(t) = λ(t)s(t), we get that n L(β, λ 0 (t)) λ(t i X i ) δ i S(t i X i ). Using the equation (1.3) we get that L(β, λ 0 (t)) = i=1 k (λ 0 (t) exp(βx i )) δ i exp( i=1 If we take t i + 1 = t i+1, so equidistant t i, we get that Taking the logarithm of L gives us ti 0 λ 0 (s)ds = i λ 0 j. j=1 ti 0 λ 0 (s) exp(βx i )ds. l(β, λ 0 (t)) = k δ i (ln(λ 0 i ) + βx i ) k i=1 i=1 λ 0 i exp(βx j ). (3.8) j d i Once we obtain ˆβ from partial likelihood, we insert it into l. From the second term in (3.8) we see that only the λ i with δ i = 1 give a positive value to l. Consequently, we take λ 0 i = 0 for i / {t 1 δ i, t 2 δ 2,..., t k δ k }. Going through the steps above we get l(λ 0 (t)) = δ i (ln(λ 0 i ) + ˆβX i ) exp( ˆβX j ). δ i =1 δ i =1 j n i Differentiation with respect to λ 0 i gives us that l(λ0 (t)) for t (t i 1, t i ] is maximized by λ 0 (t i ) = λ 0 t i = 1 j N(t i ) exp( ˆβX j ), Weng (2007). It is known that in continuous time it is impossible to have two individuals with the same observed time, but in reality it will most likely happen that some individuals have the same observed time, since we usually check the state of individuals on 30

31 monthly intervals. Consequently, many individuals will have the same observed times. It follows that a different estimator for a baseline function and a different partial likelihood function has to be used. In 1974 Breslow proposed the following partial likelihood or L(β) = k i=1 exp(βx + i ) ( j N(t i ) exp(βx j)) d i, where X i + = j N(t (i) ) X j. Using the same methodology as when there are no ties in the data we get that the Breslow baseline for ties in the data is equal to λ 0 (t i ) = d i j n i exp( ˆβX j ). (3.9) 3.2 Estimators From Chapter 2 we know that we have to be able to model cause specific hazard rates, survival function and cumulative incidence function in order to estimate CCR. For estimation of the survival function the identity (1.7) will be used. In order to model the cumulative incidence function we have to integrate equation (1.9), which is possible by using the fact that the baseline hazard rate is a step function. For t (t i 1, t i ] we get F 1 (t) = = t 0 t 0 λ 1 (s)s(s)ds λ 1 (s) exp( = F 1 (t i 1 ) + t s 0 (λ 1 (u) + λ 2 (u))du)ds t i 1 λ 1 (s) exp( s 0 (λ 1 (u) + λ 2 (u))du)ds (3.10) Firstly let us look into the integral s 0 (λ 1(u) + λ 2 (u))du. We know that λ i (u) is a step function, which takes value λ i t i on the interval (t i 1, t i ]. For s (t i 1, t i ] it follows s 0 i 1 (λ 1 (u) + λ 2 (u))du = (λ 1 t j + λ 2 t j ) + (s t i 1 )(λ 1 t i + λ 2 t i ) j=1 =Λ 1 (t j 1 ) + Λ 2 (t j 1 ) + (s t i 1 )(λ 1 t i + λ 2 t i ). (3.11) 31

32 From equations (3.10) and (3.11) it follows that F 1 (t) = F 1 (t i 1 ) + λ 1 t i t = F 1 (t i 1 ) + = F 1 (t i 1 ) + t i 1 exp( (Λ 1 (t j 1 ) + Λ 2 (t j 1 ) + (s t i 1 )(λ 1 t i + λ 2 t i ))) λ 1 t i exp(λ 1 (t j 1 ) + Λ 2 (t j 1 )) t t i 1 exp((t i 1 s)(λ 1 t i + λ 2 t i )))ds λ 1 t i exp(λ 1 (t j 1 ) + Λ 2 (t j 1 ))(λ 1 t i + λ 2 t i ) ( exp((t i 1 s)(λ 1 t i + λ 2 t i )) t t i 1 ) = F 1 (t i 1 ) + λ1 t i (1 exp((t i 1 t)(λ 1 t i + λ 2 t i )) exp(λ 1 (t j 1 ) + Λ 2 (t j 1 ))(λ 1 t i + λ 2 t i ). (3.12) Let us continue the example from Figure 1.5 and assume that every individual also has a variable LTV. We will define a new variable HighLTV, which takes value 1, if LTV is high as it is shown in the Figure 3.1. In this example the variable HighLTV will be included in the regression. Figure 3.1: Example data In order to estimate the coefficient β 1, which explains the influence of the variable HighLTV on the cause specific hazard rate for cure, we model cured individuals as individuals who experienced the event of interest. Other individuals are considered as censored. In the same way the coefficient for the cause specific hazard rate for liquidation, β 2, is estimated. For the estimation of the parameters β 1 and β 2 we used the Python Statsmodels package. This gave the following result β 1 =

33 and for liquidation, β 2 = Since no cures occurred in the interval [0, 1) the Breslow estimator for the cause specific baseline function for cure gives us the following value for t (0, 1] λ 0 1(t) = 0 j {1,2,3,4,5,6,7,8,9,10} exp(β 1X i ) = 0. Since one event happened on the interval (1, 2], the cause specific baseline function for t (1, 2] is equal to λ 0 1(t) = 1 j {1,2,4,5,6,7,8,9,10} exp(β 1X i ) = In a similar fashion we get λ 0 1(t) = { = t (2, 3] 0 t (3, ]. For the baseline function for cure we get the following values t (0, 1] t (1, 2] λ 0 2(t) = 0 t (2, 3] t (3, 4] 0 t (4, ). Once the cause specific baseline hazard rates for cure are calculated, we can multiply them by exp(β 1 HighLTV), HighLTV {0, 1} in order to get cause specific hazard rates for cure of individuals with LTV higher and lower than 1.2. For the baseline function for cure we get the following values 0 t (0, 1] t (1, 2] λ 1 (t 1) = t (2, 3] Since exp(β 1 0) = 1 the following identity holds 0 t (3, 4] 0 t (4, ) λ 1 (t 0) = λ 0 1(t). By taking the same steps as above the cause specific hazard rate for loss can be calculated and the following results are obtained: 33

34 c Figure 3.2: Cause specific hazard rate for cure and t (0, 1] t (1, 2] λ 0 2(t 1) = 0 t (2, 3] t (3, 4] 0 t (4, ) λ 1 (t 0) = λ 0 1(t). The hazard rate for liquidation for HighLT V = 1 is almost equal to zero, because of the size of β 2. In order to calculate the survival function equation (1.7) will be used. Since the time steps are of length 1 and the cause specific hazard rate is a step function, the following identity holds for t (t k 1, t k ] Λ i (t X) = Λ i (t k 1 X) + (t t k )λ i (t k X). The identity above gives us the following functions of t for cause specific cumulative hazard functions for cure 0 t (0, 1] (t 1) t (1, 2] Λ 1 (t 0) = (t 2) t (2, 3] t (3, ) 34

35 Figure 3.3: Cause specific cumulative function for cure Figure 3.4: Cause specific cumulative hazard rate function for liquidation and for HighLTV=1, 0 t (0, 1] (t 1) t (1, 2] Λ 1 (t 1) = (t 2) t (2, 3] t (3, ). 35

36 The cause specific cumulative hazard functions for loss are t t (0, 1] and for HighLTV=1, (t 1) t (1, 2] Λ 2 (t 0) = t (2, 3] (t 3) t (3, 4] t (4, ) t t (0, 1] (t 1) t (1, 2] Λ 0 2(t 1) = t (2, 3] (t 3) t (3, 4] t (4, ). The cause specific cumulative hazard functions for cure and loss can be seen in Figures 3.3 and 3.4. After the cause specific cumulative functions for cure and liquidation are determined, the identity (1.7) can be used for the estimation of the survival function. We get exp( t 0.103) t (0, 1] exp( ( (t 1) ( )) t (1, 2] S(t 0) = exp( ( (t 2) 0.268)) t (2, 3] exp( ( (t 3) 0.408) t (3, 4] exp( 1.997) t (4, ) and for HighLTV=1 we get exp( t 0.093) t (0, 1] exp( ( (t 1) ( )) t (1, 2] S(t 1) = exp( ( (t 2) 0.329)) t (2, 3] exp( ( (t 3) 0.366)) t (3, 4] exp( 1.018) t (4, ). The survival curve estimates can be seen in Figure 3.5. After calculation of survival curves and hazard rates are calculated we can estimate cumulative incidence functions using equation (3.12). The calculated cumulative incidence functions can be seen in Figure

37 Figure 3.5: Survival function calculated with the Cox model Figure 3.6: Cumulative incidence function calculated with the Cox proportional model 3.3 Model implementation For the estimation of CCR, using the Cox model, the same data will be used as in the previous chapter. Rabobank is still interested in the behavior of clients, which have Bridge loans, clients with non-bridge loans and low LTV, clients with non-bridge loans, which have high LTV and do not have NHG and clients, which have non-bridge mortgages with high LTV and do have NHG. Firstly it needs to be understood that the coefficient estimates of one risk factor depends on all variables that are included into the regression. For instance, if we compare the parameters, when Cox regression is performed with variable HighLTV only, to ˆβ hltv HighLTV 37

If we would include variable highltv as well, than we would not be able to calculate a CCR curve for Bridge loans.

38 ˆβ hltv,br HighLTV, when HighLTV and variable Bridge are included in the regression, it can be ˆβ hltv hltv,br HighLTV ˆβ HighLTV. seen that Secondly, if one is interested in the behavior of clients with bridge loans then regression only with the variable Bridge needs to be made. If we would include variable highltv as well, than we would not be able to calculate a CCR curve for Bridge loans. However we would be able to calculate only CCR curves for clients which have Bridge loans and high LTV or CCR for clients which have bridge loans and low LTV, since X highltv can take only values 0 and 1. It follows that Cox regression needs to be done three times with three different variable combinations as can be seen in Figure 3.7. Once the coefficients are obtained for hazard rate calculation of the desired segment, a Figure 3.7: Input variables and output coefficients matching covariate needs to be used as can be observed from Figure 3.8. Figure 3.8: Output coefficients multiplied with covariates 38

3.4 Results In this chapter all quantities that are needed in order to obtain CCR estimates for each segment will be looked into. Since we want to obtain CCR estimates for each segment.

39 3.4 Results In this chapter all quantities that are needed in order to obtain CCR estimates for each segment will be looked into. Since we want to obtain CCR estimates for each segment. The first step is coefficient estimation for all combinations of variables. Coefficient cause specific hazard rates for cure estimated with Cox regression can be found in the table in Figure 3.9 and for loss in Figure Figure 3.9: Output coefficients for cure Figure 3.10: Output coefficients for liquidation From the figures it can bee seen that the indicator variable for bridge loan and the indicator variable for high LTV have a negative effect on the cause specific hazard rates for cure. In other words, individuals which have a LTV of more than 1.2 and a Bridge loan are have a smaller hazard rate and consequently have a smaller probability of being cured. Individuals with NGH will be cured slightly faster than individuals without it. On the other hand, it can be seen that individuals which have high LTV, non Bridge loans and NHG will be liquidated faster than individuals without NGH. Once estimates of coefficients are obtained we are able to model cause specific hazard rates for cure and liquidation, which can be seen in Figures 3.11 and It is seen that the segment which is cured at the fastest rate is the segment with clients which have low LTV, while the other three segments have similar hazard rates, which become almost the same after the 30th month in default. These segments behave differently when cause specific hazard rate for liquidation is modeled. We can conclude that liquidation happens at the lowest rates for clients which have low LTV. This probably happens because clients with low LTV want to keep their properties. The segment with the highest rate of liquidation is the segment with bridge loans. In order to calculate the survival function for segments with Cox regression equation (1.7) 39

40 Figure 3.11: Cause specific hazard rate for cure Figure 3.12: Cause specific hazard rate for liquidation is used. Since the size of cause specific hazard rates and the size of the survival function are negatively correlated, the segment with the biggest survival function is represented by clients which have High LTV and do not have NHG. If we look Figure 3.12 it is visible that this segment has the second smallest cause specific hazard rate for liquidation while in Figure 3.11 it is shown that cause specific hazard rates for cure are almost the same as the smallest hazard rates. With the same reasoning the behaviour of the other groups can also be explained. An intuitive explanation for the shapes of the survival curves is that clients with high cause specific hazard rates will be resolved faster and consequently the probability of being unresolved at time t becomes smaller. Before CCR is modeled we need to look into the estimation of the cumulative incidence function for which the equation (3.12) is needed. From the Figure 3.14 it can be concluded that the highest probabilities of being cured are achieved by individuals with Low LTV. Since other segments have similar estimates for cause specific hazard rates 40

41 Figure 3.13: Survival functions calculated with the Cox model for cure, cumulative incidence functions are ordered in the same way as estimates of the cause specific hazard rates for liquidation. Figure 3.14: Cumulative incidence functions for cure calculated with the Cox model Finally all estimates from above can be used to estimate the CCRs, which can be seen in Figure Performance of the method In Section 2.3 it was seen that the nonparametric CCR estimator has large variance and wide confidence intervals at the end of the observation period. That happens because of the jumps of the hazard rates that are estimated with the Nelson-Aalen estimator. In this chapter the methodology described in Section will be used again in order to estimate confidence intervals, variance and bias of CCR estimated with Cox regression. From Figure 3.16 it can be seen that Cox regression estimators have narrower confidence 41

42 Figure 3.15: CCR estimated with the Cox model intervals than the nonparametric estimator but on the other hand, estimates for each segment are the same at the end of the observation period and consequently not significantly different. In Figure 3.17 it is visible that the variance is almost 10 times smaller than the variance Figure 3.16: Confidence intervals estimated with the bootstrap of the nonparametric estimator and that the small number of observations at the end of the observation period has almost no effect on variance. In Figure 3.18 it is be shown that bias is smaller and that time also has almost no influence on the bias. 3.6 Discussion From Section 3.5 it can be concluded that the Cox model gives us lower variance, less bias and narrower confidence intervals than nonparametric estimators. All these properties are definitely desirable features of an estimator. 42

43 Figure 3.17: Variance of CCR estimated with the Cox regression Figure 3.18: Estimation of bias with the bootstrap At the same time there exist systematical techniques for variable selection in the Cox model. It can be much easier decided if a variable should be included or excluded from a segmentation than the method based on 95% confidence intervals, which is used in the case of nonparametric estimators. Also β s which are output from regression have some explanatory power. When the coefficient of a boolean variable is negative it is clear from equation (3.1) that a client with such property will have a smaller hazard rate than a client without and consequently has a higher survival function and smaller probability of resolution than a client without. The opposite phenomenon happens when β is positive. Consequently, tables which can be found in Figures 3.9 and 3.10 could be a helpful instrument at deciding if a client should get a mortgage or not or at deciding what should be a size of interest rate of a client. Such a tool cannot be found when we operate with nonparametric estimators. This definitely is one of the desired properties of an estimator in Risk Management, the understandability, what is not a property of the Cox model. All nonparametric 43

44 estimators, which are part of the CCR estimator, are intuitively easier to understand than estimators, which are used in the Cox model, especially if we compare explanations of the cumulative incidence functions. Hazard rates estimated with the Nelson-Aalen estimator are closely related to empirical probability, a simple counting process while formula 3.1 is closely related to the exponential distribution. This makes sense only to a person who is better educated in probability. If time and computational power are important factors at deciding which estimator we are going to use, then we would have to decide to use nonparametric estimators, since they work faster and use less memory. 44

45 4 Proportional hazards model in an interval censored setting Up until now CCR was modeled assuming that we have continuous time data and that the events of interest happen exactly at the observed time. In reality however, the events, which are denoted by t i, actually happened on the interval (t i 1, t i ]. Consequently, we are dealing with another type of censoring, interval censoring. In order to model CCR in an interval censoring setting we have to use a different approach and a different likelihood function. For a review of proportional hazards model in an interval censored setting see Corrente, Chalita, and Moreira (2003). In Section 4.1 we will get familiar with the theory behind the survival analysis in the interval censoring setting. In Section generalized linear models, which are needed for the modeling of CCR, will be represented. In Section 4.2 we will learn how to use binning in time intervals construction and make computations with the generalized linear models computationally feasible. In Section 4.3 we review estimated quantities in the interval censoring setting. In Section 4.4 the performance of the method will be analyzed with the bootstrap. Which method from the three estimators is the best will be discussed in the Section Theoretical background When we have interval censored data, an interval is divided into smaller subintervals I i = [a i 1, a i ), where 0 = a 0 < a 1 < < a k =. The set of subjects that defaulted in interval I i will be denoted with D i and N i will denote the number set of subjects, which are at risk at the beginning of the interval I i. The indicative variable δ ji, j {1, 2,..., n}, i {1, 2,..., k} will denote the indicator variable, which takes value 1 if the subject j failed in interval i and it takes value zero if subject j was still alive at the end of interval I i or it was censored in interval I i. For instance, if subject j died in the interval I 3, the following equality will hold (δ j1, δ j2, δ j3 ) = (0, 0, 1). The value p(a i X j ) equals the conditional probability that the subject with covariate X j has already experienced an event at time a i given that the individual has not experienced the event of interest at a i 1. In order to derive the likelihood function in an 45

46 interval censoring setting the following two identities are needed P (T j I X j ) = P (a i 1 T j < a i X j ) = (S(a i 1 X j ) (S(a i X j ) = [(1 p(a 1 X j ))... (1 p(a i 1 X j ))] [(1 p(a 1 X j ))... (1 p(a i X j ))] = [(1 p(a 1 X j ))... (1 p(a i 1 X j ))](1 (1 p(a i X j ))) = [(1 p(a 1 X j ))... (1 p(a i 1 X j ))]p(a i X j ) (4.1) and P (T j > I k X j ) = P (T j > a k X j ) = S(a i X j ) = [(1 p(a 1 X j ))... (1 p(a i X j ))] (4.2) Combining equations (4.1) and (4.2) the likelihood function for interval censored data becomes k L = p(a i X j ) δji (1 p(a i X j )) 1 δji, (4.3) i=1 j N i Corrente, Chalita, and Moreira (2003). Equation (4.3) is a likelihood function for observations with a Bernoulli distribution where δ ji is a binary response variable with probability p(a i X j ). When a variable has a Bernoulli distribution generalized linear models can be used for the modeling Generalized Linear Models In the first part of this section generalized linear models will be presented. In the second part we review modeling of p(a i X j ) with GLM. For a review of Generalized Linear Models see Fox (2016). Generalized Linear Models (GLM) is a tool for the estimation of a number of distinct statistical models, for which we would usually need distinct statistical regression models, for instance logit and probit. GLM was firstly presented by John Nelder and R.W.M. Wedderburn in In order to use GLM three components are needed: Firstly we need a random component Y i for i-th observation, which is conditioned on the explanatory variables of the model. In the original formulation of GLM Y i had to be a member of an exponential family. One of the members of the exponential family is binomial distribution and consequently Bernoulli distribution. Secondly, a linear predictor is needed. That is a function of regressors η i = α + β 1 X i1 + β 2 X i2 + + β k X ik. 46

47 Thirdly, we need a smooth invertible linearizing link function, g( ). The link function transforms the expectation of the response variable, µ i = E(Y i ), into a linear predictor g(µ i ) = η i = α + β 1 X i1 + β 2 X i2 + + β k X ik. One of the functions that can be used in a combination with the binomial distribution is the complementary log-log function (clog-log) where η i = g(µ i ) = ln( ln(1 µ i )) or µ i = g 1 (η i ) = 1 exp( exp(η i )). From the identity (4.3) it is seen that δ ji will be the response variable with a Bernoulli distribution with parameter p(a i X j ) and consequently E(δ ji ) = p(a i X j ) = µ i. Once we know µ i, we have to find the link function g, for which it holds η i = g(p(a i X j )). Using the Cox proportional hazard model in order to model p(a i x j ), equations (3.3) and (3.2), give us p(a i X j ) = 1 ( S 0 (a i ) S 0 (a i 1 ) ) exp(β T X j ). (4.4) As a complementary log-log transformation is applied to equation (4.4) we get, ( ( )) ln( ln(1 p(a i X j ))) = β T S 0 (a i ) X j + ln ln = β T X j + γ i = η i, (4.5) S 0 (a i 1 ) where γ i = ln ( ln ( S 0 (a i ) S 0 (a i 1 ))). From above it follows that we can use a GLM with a binomial distribution and complementary log-log link function in order to model δ ji or in other words, we can model the Cox proportional hazard model in the interval censoring setting with GLM. After the values p(a i X j ) are obtained, the survival function for each time point can be calculated using equation (4.2). In the competing risk setting a GLM model can be used in order to model probabilities of failing due to reason k on the interval [a j i, a i ) or p k (a i X j ), k = 1, 2. In a competing risk setting indicator function δji k tells us if individual j failed on interval i due to reason k = 1, 2. As soon as estimators ˆp k (a i X j ) are calculated, the following identity can be used in order to estimate the survival function Ŝ(t X j ) = (1 (ˆp 1 (a i X) + ˆp 2 (a i X))). (4.6) i:a i t In order to estimate the cumulative incidence function the following estimator will be 47

48 used ˆF 1 (t X j ) = i:a i t Ŝ(a i 1 X j )ˆp i (a i X j ). (4.7) Once the cumulative incidence functions are estimated, equation (2.1) is used in order to estimate CCR. As can be seen from equation (4.7) GLM can be used for the modeling of proportional hazards in the interval censoring setting. Since the indicator variable δji k is defined for individual j for each time until he is censored, or one of events k occur to him, we need to input a different type of data frame into the regression than the one inputed when we were using the nonparametric and Cox estimators. The easiest way to understand how the data needs to be transformed is by continuing the example from Figure 3.1. From the figure it can be seen that we are following individual 1 until time 6, when he is censored. And for every time interval we have to define δ1i k, k = 1, 2, i = 1, 2,..., 6 which tells us if individual 1 experienced cure or loss in interval (i 1, i]. In similar fashion the other rows have to be duplicated, but at observations, which are not censored, δji 1 takes value 1 if cure happened as can bee seen for individual 4 from the Figure 4.1. After we have the data in the right format, we need to make dummy variables out of the variable observed time. How to create dummy variables for individual 1 can be seen in Figure 4.2. Dummies have to be crated for each time from 1 to the largest time that can be found in the column observed time. Once dummies are created, we can start modeling p k (a i X). In the GLM we chose the clog-log as the link function and the binomial distribution. For the response variable we chose δji k and for the independent variables we have to chose a dummy variable and the variables we want to include into the regression. In our case it is the variable HighLTV. Coefficients, which we get as output for dummies i, represent values γ i while coefficient β HighLT V explains the influence of the variable HighLTV on p k (a i HighLT V ) as can be seen in equation (4.4). The output coefficients for cure are and for liquidation βhighltv (γ1 1) (γ2 1) (γ3 1) (γ4 1) (γ5 1) (γ6 1)

Figure 4.1: Transformed data frame for GLM βhighltv 2-0.156 1(γ1 2) -2.

49 Figure 4.1: Transformed data frame for GLM βhighltv (γ1 2) (γ2 2) (γ3 2) (γ4 2) (γ5 2) (γ6 2)

Figure 4.2: Dummies for GLM In order to estimate p k (a i HighLTV) we have to use the inverse of equation (4.5) or p k (a i HighLTV) = 1 exp( exp(γi k + β HighLTV HighLTV)). (4.8) After doing so we get the following values for the probabilities for cure 7.

50 Figure 4.2: Dummies for GLM In order to estimate p k (a i HighLTV) we have to use the inverse of equation (4.5) or p k (a i HighLTV) = 1 exp( exp(γi k + β HighLTV HighLTV)). (4.8) After doing so we get the following values for the probabilities for cure , i = , i = , i = 3 ˆp 1 (a i 1) = , i = , i = , i = 6 and 0.105, i = , i = , i = 3 ˆp 1 (a i 0) = 0.411, i = , i = , i = 6 For p 2 (a i 1) we get the following values 50

51 0.143, i = , i = , i = 3 ˆp 2 (a i 0) = 0.500, i = , i = , i = 6 and 0.009, i = , i = , i = 3 ˆp 2 (a i 1) = 0.365, i = , i = , i = 6 Applying equation (4.6) to the results above gives us 1.000, i = , i = , i = 3 Ŝ(a i 0) = 0.301, i = , i = , i = 6 and 0.909, i = , i = , i = 3 Ŝ(a i 1) = 0.298, i = , i = , i = 6 51

52 Now equation (4.7) can be used for the estimation of the cumulative incidence function , i = , i = , i = 3 ˆF (a i 0) = 0.294, i = , i = , i = 6 and , i = , i = , i = 3 ˆF (a i 1) = 0.349, i = , i = , i = Model implementation Rabobank assumes that the probability of cure is equal to zero after the 58th month and that the data is collected at the end of each month. From Section 4.1 it follows that we would need to make a regression with at least 58 coefficients in order to obtain γ i, i = 1,..., 58, which are needed for the estimation of p k i (a X). Such a regression is computationally too expensive. At the same time we would need to transform the data with observations into data with more than 2 million observations, which is expensive as well. In order to avoid these problems a method called binning or bucketing will be used Binning In order to make the method computationally feasible the number of observed intervals will be narrowed. Firstly, the cause specific hazard rate of each interval will be estimated with the following estimator Number of deaths in the interval Number of individuals in the beginning of the interval Size of interval. Once these hazard rates are estimated, the two neighboring intervals with the smallest absolute difference will be joined into a new interval. If we join interval I i = (a i 1, a i ] and I i+1 = (a i, a i+1 ] then the new interval I i = (a i 1, a i+1 ] will be obtained. These two steps need to be repeated until we have a desired small enough number of intervals. GLM regression will be made with 10 intervals. The results can be seen in Figures 4.3 and

53 Figure 4.3: Hazard rates estimated before binning Figure 4.4: Interval hazard rates estimated after binning As soon as binning is performed, we use the output interval I i = (a i 1, a i ] in order to model probabilities p k (a i ). Values a i, for which we will model probabilities, are a 0 < a 1 < < a 10 < (a 11 = ) we get 0 < 3 < 4 < 5 < 8 < 12 < 13 < 14 < 20 < 46 < 58 <. It will be still assumed that the probability of cure after time 58 is equal to 0, 53

54 and consequently p 1 (a 11 X) = Results In this chapter all the quantities, which are need for estimation of CCR in the interval censoring setting, will be modeled and compared with the results from previous chapters. Since regression with 58 coefficients is not computationally feasible, the intervals, which were obtained in Section 4.2.1, will be used. All graphs in this chapter should be histograms, since we are operating with discrete time, but visualization of four segments on one figure would be nearly impossible, consequently quantities are represented as step functions. After the intervals are determined we can estimate the parameters. As independent variables we have to use a combination of variables for each segment as described in Section 3.3 and dummy variables for intervals. Once regression is done we get the following coefficients for the risk parameters for cure as can be seen in Figures 4.5 and 4.6. Figure 4.5: Output coefficients for cure Figure 4.6: Output coefficients for loss Once γ i, i = 1, 2..., 10 are estimated we can start computing p 1 (a i X j ) and p 2 (a i X j ). Interval probabilities for cure and loss can be seen in the figures 4.7 and 4.8. Interval probabilities cannot be compared with modeled hazard rates from the previous chapters because they are not normalized. In the figures it is also visible that interval probabilities are higher for wider intervals, which makes sense. The longer the interval, the higher the probability of cure or liquidation. From Figures 4.7 and 4.8 it is seen that segments behave in a similar way as hazard rates estimated with the Cox model, which can be seen in Figures 3.11 and Con- 54

STAT331. Cox s Proportional Hazards Model

STAT331. Cox s Proportional Hazards Model STAT331 Cox s Proportional Hazards Model In this unit we introduce Cox s proportional hazards (Cox s PH) model, give a heuristic development of the partial likelihood function, and discuss adaptations