Loss Estimation using Monte Carlo Simulation

Size: px

Start display at page:

Download "Loss Estimation using Monte Carlo Simulation"

Rosa Owens
6 years ago
Views:

College London Credit Scoring and Credit Control

1 Loss Estimation using Monte Carlo Simulation Tony Bellotti, Department of Mathematics, Imperial College London Credit Scoring and Credit Control Conference XV Edinburgh, 29 August to 1 September 2017

2 Motivation Accurate estimation of loss based on underlying models of PD, LGD and EAD. Use of Monte Carlo Simulation (integration) to avoid complex analytic solution: giving a distribution of possible loss. Confidence intervals to quantify in expected loss estimates. Applications: Internal risk management, Regulation (Basel 3), Accounting rules (IFRS9, CECL), Stress testing, Profit estimation.

3 Basic Idea Simple idea: Simulate Loss given by For a portfolio of loans, with i = 1 to n accounts, compute n Loss = PD i LGD i EAD i i=1 for a portfolio of loans i = 1 to n, where PD i = probability of default; LGD i = loss given default; EAD i = exposure at default across distributions of these risk factors, informed by models. Devil in the detail: relationship between these three risk factors.

4 Scope of this study For this study, we considered the simplified problem: Assume no population change between training and forecast data (ie IID data). Do not consider inclusion of economic conditions just yet. Show results from both a simulation study plus using real credit card data.

5 The Maths: Defining Loss Consider estimating loss on n accounts in a portfolio. For each account i 1,, n : Let x i be a vector of characteristics of mixed data types. Let Y i 0,1 be default event for account i; 1=default, 0=nondefault. Let L i R be loss-given-default (LGD). Let E i > 0 be exposure-at-default (EAD). Then, total loss on the portfolio is V = n i=1 Y i L i E i. n Then, expected loss is E V = i=1 E Y i L i E i.

6 Introducing the risk models Suppose we have models m 1,m 2,m 3 for probability of default (PD), LGD and log-ead respectively. Hence, P Y i = 1 x i = m 1 x i L i = m 2 x i + ε 2,i log E i = m 3 x i + ε 3,i where ε 2,i and ε 3,i are residual terms.

7 The Maths: Expected Loss Then with change of variables, expected loss E(Y i L i E i ) can be rewritten as m 1 x i m 2 x i + ε 2,i exp m 3 x i + ε 3,i f ε 2,i, ε 3,i Y i = 1, x i dε 2,i dε 3,i which can be approximated using Monte Carlo integration by M EL 1 m M 1 x i m 2 x i + ε 2,i exp m 3 x i + ε 3,i m=1 for random samples ε 2,i, ε 3,i ~f: Assume independence of residuals from x i, ie simulate from the density f ε 2,i, ε 3,i Y i = 1. Estimate using either the empirical distribution or kernel density estimation on training or validation data set. ote: I will not show derivation of these formulae, but these are available upon request by .

8 Quantile estimation of Loss It is valuable to consider the distribution of possible losses, and in particular compute quantiles. This allows confidence intervals (CI) on Loss estimates. The qth quantile v q of V is q = C f v x 1,, x n dv where f is the density over V, conditional on characteristics, and C = v: v v q. ote: here q is known and v q is unknown. For example, to compute a 95%CI, find v q for q = and q = 0.975: v 0.025, v

9 Quantile estimation of Loss using Monte Carlo Using Monte Carlo integration, this integral can be approximated by M n q 1 M I v i m=1 i=1 v q where v i = y i m 2 x i + ε 2,i exp m 3 x i + ε 3,i and random samples y i, ε 2,i, ε 3,i ~f. The loss quantile v q is easily estimated by ranking simulated values n i=1 v i in ascending order and choosing the value at the Mq rank.

10 Quantile estimation: Sampling We need to sample y i, ε 2,i, ε 3,i ~f. 1.otice f y i, ε 2,i, ε 3,i x i = f ε 2,i, ε 3,i y i, x i P y i x i. 2.Hence, for each account i, simulate y i = 0 or 1 from P y i x i = m 1 x i. 3.If y i = 0, it does not matter how ε 2,i, ε 3,i are simulated, since y i = 0 v i = 0, always. 4.If y i = 1, simulate ε 2,i, ε 3,i from f ε 2,i, ε 3,i Y i = 1, assuming that ε 2,i, ε 3,i are independent of x i. 5.The density f ε 2,i, ε 3,i Y i = 1 can be estimated based on a validation data set of previous defaults. Either the empirical distribution or a kernel density estimator (KDE) can be used. ote: it is easy to simulate from a KDE: randomly sample an example from the validation/training data, then add random noise corresponding to the kernel function.

11 Why a simulation study? Simulate credit accounts with default, LGD and EAD outcomes and correlations controlled by different predictor variables. Allows us to control the generating distribution for the data. Allows for testing and debug of models and loss estimation technique, since we know the true values. Endless supply of artificial data allows for repeat experiments and hence samples of results for statistical analysis.

12 Simulation study: Data generation A credit portfolio was simulated with multiple risk factors to simulate default events, LGD and EAD. Risk factors: X1 X2 X3 X4 X5 Default * * * LGD * * * EAD * * All variables are standard normally distributed, All variables are expressed as the sum of an observable and unobservable component; only the observable component can be used in the model built, hence simulating uncertainty. X1 and X2 are common to more than one component, hence inducing a correlation.

13 Simulation study: models and distribution of residuals LGD model R 2 =0.29 Log-EAD model R 2 =0.25 Contour map of density f ε 2,i, ε 3,i Y i = 1 using KDE: LGD residual ε 2,i

14 Simulation study: results Model details train test EL 95% CI % below Q2.5% (-9.5,+9.8) (-3.1,+3.1) (-9.3,+10.1) 9 8 Bandwidth=high (-10.2,+10.6) 15 0 Fix LGD (-9.5,+9.7) 3 5 Fix LGD, ε 2,i = (-8.5,+8.7) 0 34 Poor PD model (-8.8,+11.6) 46 0 o EAD in LGD model (-9.44,+9.73) 3 3 M=5000 and repeat each experiment 100 times. train, test are numbers of examples in train and test data sets (in 1000 s). EL = % for analytic expected loss estimate, compared to actual loss. = % for Monte Carlo expected loss estimate. 95% CI is % difference from EL estimate. % above Q97.5%

15 Simulation study: results Model details train test EL 95% CI % below Q2.5% (-9.5,+9.8) (-3.1,+3.1) (-9.3,+10.1) 9 8 Bandwidth=high (-10.2,+10.6) 15 0 Fix LGD (-9.5,+9.7) 3 5 Fix LGD, ε 2,i = (-8.5,+8.7) 0 34 Poor PD model (-8.8,+11.6) 46 0 o EAD in LGD model (-9.44,+9.73) 3 3 M=5000 and repeat each experiment 100 times. train, test are numbers of examples in train and test data sets (in 1000 s). EL = % for analytic expected loss estimate, compared to actual loss. = % for Monte Carlo expected loss estimate. 95% CI is % difference from EL estimate. % above Q97.5% Main result: Reliable and accurate predictions, but high : +/-10%

16 Simulation study: results Model details train test EL 95% CI % below Q2.5% (-9.5,+9.8) (-3.1,+3.1) (-9.3,+10.1) 9 8 Bandwidth=high (-10.2,+10.6) 15 0 Fix LGD (-9.5,+9.7) 3 5 Fix LGD, ε 2,i = (-8.5,+8.7) 0 34 Poor PD model (-8.8,+11.6) 46 0 o EAD in LGD model (-9.44,+9.73) 3 3 M=5000 and repeat each experiment 100 times. train, test are numbers of examples in train and test data sets (in 1000 s). EL = % for analytic expected loss estimate, compared to actual loss. = % for Monte Carlo expected loss estimate. 95% CI is % difference from EL estimate. % above Q97.5% Increase sample size: more accuracy, but less reliability.

17 Simulation study: results Model details train test EL 95% CI % below Q2.5% (-9.5,+9.8) (-3.1,+3.1) (-9.3,+10.1) 9 8 Bandwidth=high (-10.2,+10.6) 15 0 Fix LGD (-9.5,+9.7) 3 5 Fix LGD, ε 2,i = (-8.5,+8.7) 0 34 Poor PD model (-8.8,+11.6) 46 0 o EAD in LGD model (-9.44,+9.73) 3 3 M=5000 and repeat each experiment 100 times. train, test are numbers of examples in train and test data sets (in 1000 s). EL = % for analytic expected loss estimate, compared to actual loss. = % for Monte Carlo expected loss estimate. 95% CI is % difference from EL estimate. % above Q97.5% Poor models (due to small training set) leads to poor reliability.

18 Simulation study: results Model details train test EL 95% CI % below Q2.5% (-9.5,+9.8) (-3.1,+3.1) (-9.3,+10.1) 9 8 Bandwidth=high (-10.2,+10.6) 15 0 Fix LGD (-9.5,+9.7) 3 5 Fix LGD, ε 2,i = (-8.5,+8.7) 0 34 Poor PD model (-8.8,+11.6) 46 0 o EAD in LGD model (-9.44,+9.73) 3 3 M=5000 and repeat each experiment 100 times. train, test are numbers of examples in train and test data sets (in 1000 s). EL = % for analytic expected loss estimate, compared to actual loss. = % for Monte Carlo expected loss estimate. 95% CI is % difference from EL estimate. % above Q97.5% Accuracy is sensitive to bandwidth in KDE: perhaps just use the empirical distribution for sampling.

19 Simulation study: results Model details train test EL 95% CI % below Q2.5% (-9.5,+9.8) (-3.1,+3.1) (-9.3,+10.1) 9 8 Bandwidth=high (-10.2,+10.6) 15 0 Fix LGD (-9.5,+9.7) 3 5 Fix LGD, ε 2,i = (-8.5,+8.7) 0 34 Poor PD model (-8.8,+11.6) 46 0 o EAD in LGD model (-9.44,+9.73) 3 3 M=5000 and repeat each experiment 100 times. train, test are numbers of examples in train and test data sets (in 1000 s). EL = % for analytic expected loss estimate, compared to actual loss. = % for Monte Carlo expected loss estimate. 95% CI is % difference from EL estimate. % above Q97.5% Using a fixed value for LGD is fine, so long as residual for LGD is used in MC sampling. A similar result when using a fixed value for EAD.

20 Simulation study: results Model details train test EL 95% CI % below Q2.5% (-9.5,+9.8) (-3.1,+3.1) (-9.3,+10.1) 9 8 Bandwidth=high (-10.2,+10.6) 15 0 Fix LGD (-9.5,+9.7) 3 5 Fix LGD, ε 2,i = (-8.5,+8.7) 0 34 Poor PD model (-8.8,+11.6) 46 0 o EAD in LGD model (-9.44,+9.73) 3 3 M=5000 and repeat each experiment 100 times. train, test are numbers of examples in train and test data sets (in 1000 s). EL = % for analytic expected loss estimate, compared to actual loss. = % for Monte Carlo expected loss estimate. 95% CI is % difference from EL estimate. % above Q97.5% Poor PD model (just one predictor variable), leads to poor reliability.

21 Simulation study: results Model details train test EL 95% CI % below Q2.5% (-9.5,+9.8) (-3.1,+3.1) (-9.3,+10.1) 9 8 Bandwidth=high (-10.2,+10.6) 15 0 Fix LGD (-9.5,+9.7) 3 5 o need to include EAD as a predictor variable in the LGD model. Fix LGD, ε 2,i = (-8.5,+8.7) 0 34 Poor PD model (-8.8,+11.6) 46 0 o EAD in LGD model (-9.44,+9.73) 3 3 M=5000 and repeat each experiment 100 times. train, test are numbers of examples in train and test data sets (in 1000 s). EL = % for analytic expected loss estimate, compared to actual loss. = % for Monte Carlo expected loss estimate. 95% CI is % difference from EL estimate. % above Q97.5%

22 UK credit card data study Behavioural data for UK credit cards, observed during Define default as 3 months missed payments within a 12 month period. Predictor variables include client and account ages, application data (employment status, tenure status, months at current address) and behavioural data (balance, utilization, past delinquency). Build simple underlying models for PD using logistic regression, LGD and log-ead using OLS linear regression. Train / test over two different periods:- Data set Observation date train test A July 2008 B September

23 Credit card data: models and distribution of residuals Data set A LGD model R 2 =0.09 Log-EAD model R 2 =0.74 Data set B LGD model R 2 =0.11 Log-EAD model R 2 =0.81 LGD residual ε 2,i LGD residual ε 2,i Contour maps of density f ε 2,i, ε 3,i Y i = 1 using KDE

24 Credit card data study: Results Data set A Data set B Model details EL 95% CI EL 95% CI (-14.7,+20.0) (-17.9,+27.8) Bandwidth=high (-15.2,+20.7) (-19.0,+29.3) Fix LGD (-14.0,+18.3) (-17.7,+28.6) Fix LGD, ε 2,i = (-12.9,+16.4) (-15.1,+21.7) Poor PD model (-15.2,+20.5) (-20.1,+31.9) o EAD in LGD model (-14.4,+19.2) (-18.5,+32.4) M=10000, average over 50 runs with different train / test split. EL = % for analytic expected loss estimate, compared to actual loss. = % for Monte Carlo expected loss estimate. 95% CI is % difference from EL estimate.

25 Credit card data study: Results Data set A Data set B Model details EL 95% CI EL 95% CI (-14.7,+20.0) (-17.9,+27.8) Bandwidth=high (-15.2,+20.7) (-19.0,+29.3) Fix LGD (-14.0,+18.3) (-17.7,+28.6) Fix LGD, ε 2,i = (-12.9,+16.4) (-15.1,+21.7) Poor PD model (-15.2,+20.5) (-20.1,+31.9) o EAD in LGD model Monte Carlo simulation gives accurate EL estimates, on average. However, CI is broad (+/-20%) (-14.4,+19.2) (-18.5,+32.4) M=10000, average over 50 runs with different train / test split. EL = % for analytic expected loss estimate, compared to actual loss. = % for Monte Carlo expected loss estimate. 95% CI is % difference from EL estimate.

26 Credit card data study: Results Data set A Data set B Model details EL 95% CI EL 95% CI (-14.7,+20.0) (-17.9,+27.8) Bandwidth=high (-15.2,+20.7) (-19.0,+29.3) Fix LGD (-14.0,+18.3) (-17.7,+28.6) Fix LGD, ε 2,i = (-12.9,+16.4) (-15.1,+21.7) Poor PD model (-15.2,+20.5) (-20.1,+31.9) o EAD in LGD model Accuracy is sensitive to bandwidth used in KDE (-14.4,+19.2) (-18.5,+32.4) M=10000, average over 50 runs with different train / test split. EL = % for analytic expected loss estimate, compared to actual loss. = % for Monte Carlo expected loss estimate. 95% CI is % difference from EL estimate.

27 Credit card data study: Results Data set A Data set B Model details EL 95% CI EL 95% CI (-14.7,+20.0) (-17.9,+27.8) Bandwidth=high (-15.2,+20.7) (-19.0,+29.3) Fix LGD (-14.0,+18.3) (-17.7,+28.6) Fix LGD, ε 2,i = (-12.9,+16.4) (-15.1,+21.7) Poor PD model (-15.2,+20.5) (-20.1,+31.9) o EAD in LGD model (-14.4,+19.2) (-18.5,+32.4) Accuracy is affected by using a fixed value for LGD. Similar result for EAD. Also, potentially bad result with poor PD model (ie insufficient predictors). M=10000, average over 50 runs with different train / test split. EL = % for analytic expected loss estimate, compared to actual loss. = % for Monte Carlo expected loss estimate. 95% CI is % difference from EL estimate.

28 Credit card data study: Results Data set A Data set B Model details EL 95% CI EL 95% CI (-14.7,+20.0) (-17.9,+27.8) Bandwidth=high (-15.2,+20.7) (-19.0,+29.3) Fix LGD (-14.0,+18.3) (-17.7,+28.6) Fix LGD, ε 2,i = (-12.9,+16.4) (-15.1,+21.7) Poor PD model (-15.2,+20.5) (-20.1,+31.9) o EAD in LGD model (-14.4,+19.2) (-18.5,+32.4) o need to include EAD as a predictor in the LGD model. M=10000, average over 50 runs with different train / test split. EL = % for analytic expected loss estimate, compared to actual loss. = % for Monte Carlo expected loss estimate. 95% CI is % difference from EL estimate.

29 Credit card data: LGD/EAD model residuals When EAD is not explicitly included as a predictor in the LGD model, the correlation between the LGD and log-ead model residuals is stronger, to compensate:- Data set A Data set B LGD residual ε 2,i LGD residual ε 2,i Contour maps of density f ε 2,i, ε 3,i Y i = 1 using KDE

30 Conclusions and future work Monte Carlo simulation can be used to give reliable estimates of Loss, and estimates of in expected loss estimation. But, sensitivity to model risk. Care is needed to ensure the underlying models are correctly specified. Future work:- Test procedure on other data (eg mortgage). Extend the exercise to include dynamic components: environmental/macroeconomic conditions and forecasting. Use reliable prediction techniques (conformal predictors) to output reliable confidence intervals, even with model.

31 Loss Estimation using Monte Carlo Simulation Thank you! I hope you have found this presentation useful. Any questions? Dr Tony Bellotti Senior Lecturer in Statistics Department of Mathematics Imperial College London a.bellotti@imperial.ac.uk Part of the Statistics in Finance Research Group at Imperial College London. Research, Training, Consultancy. ICO:

arxiv: v1 [q-fin.st] 31 May 2017

arxiv: v1 [q-fin.st] 31 May 2017 Identification of Credit Risk Based on Cluster Analysis of Account Behaviours Maha Bakoben 1, 2, Tony Bellotti 1, and Niall Adams 1, 3 1 Department of Mathematics, Imperial College London, London SW7 2AZ,