Bank of England Centre for Central Banking Studies CEMLA 2013 Extreme Value Theory. David G. Barr November 21, 2013 Any views expressed are those of the author and not necessarily those of the Bank of England. 1
Contents 1 Outline. 3 2 How probable are improbable events? 3 3 Can we use the Normal distribution to calculate stock return probabilities? 4 3.1 The data exhibit excess kurtosis i.e. fat tails................................ 4 3.2 So what do we do about it?.......................................... 6 4 Estimating probability distributions. 8 4.1 Estimating a Normal distribution....................................... 8 5 Parametric vs. empirical distributions. 10 6 Extreme value theory. 11 6.1 A quick summary................................................ 11 6.2 Back to work.................................................... 12 6.3 Sensible curves.................................................. 12 7 What makes these curves sensible? The Theory in EVT. 15 7.1 The theory in general terms: Gnedenko s result................................ 15 7.2 The general EVT shape: The generalised Pareto distribution....................... 16 7.3 Do we actually fit the density function to the histogram?......................... 17 7.4 To estimate the GPD do the following:.................................... 17 7.5 Using the estimated parameters of the Generalised Pareto Distribution.................. 19 7.6 A simpler EVT shape: The Pareto distribution................................ 19 7.7 Application of the Pareto distribution to SP500 daily returns........................ 20 8 References. 27 2
1. Outline. 1. Introduction: Why is EVT useful? 2. Ingredients: (a) Probability distributions for EVT. (b) Gnedenko s results 3. A general application of EVT. 4. A simpler, restricted, application. 2. How probable are improbable events? The most damaging events occur rarely (fortunately?). As a result we have few observations with which to estimate their probability. We can estimate them using data, asset returns for example, from normal times and by assuming that they come from a standard distribution. This gives us an estimated parametric distribution, but it is typically very bad at providing estimated probabilities for rare events. Or we can assume that the frequencies of past data represent the probabilities of future outcomes. This gives us an empirical distribution, but this too is very bad at providing the probabilities we want. (Though it can be better than the parametric distribution.) A third option is to use Extreme Value Theory, which is a parametric estimate of the tails of the distribution only. 3
3. Can we use the Normal distribution to calculate stock return probabilities? 3.1. The data exhibit excess kurtosis i.e. fat tails. We hear quite a lot about stock returns being approximately lognormally distributed. I.e. ln (1 + r) N(µ, σ) (1) where ln (1 + r) is the log return. We will refer to this as just the return from now on. The following figures show the frequency of daily returns for the US and UK. They demonstrate that, to the eye at least, the lognormal is a reasonable approximation. Figure 1: SP500 4
Figure 2: FT100 The are three things worth noting in these charts: 1. The returns are centred very close to zero. This is because they are the actual returns from one day to the next. The means are actually slightly positive and, grossed up to annual rates, are about 15% p.a. 2. The data display fat-tails. We could observe these in (at least) two ways. (a) We could fit a more-general curve than the Normal and we would see it hover above the Normal in the tails. (We ll do this later.) (b) We could simply observe that there are several spikes where the Normal is effectively at zero (as these Figures show). 5
Figure 3: SP500-left tail 3. These spikes might not look important, but if we were to use a standard significance test for the returns at these points, we would reject the null hypothesis that the returns are distributed according to the Normals drawn here. 4. More dangerously in a risk management context, using the Normal makes us feel a lot safer than we actually are. 5. There are several ways to test whether an asset s returns follow a specific distribution. The Bera-Jarque test is a test against Normality (i.e. it is used to detect evidence against Normality specifically). The Kolmogorov-Smirnov test can be used to test against any specified distribution. 3.2. So what do we do about it? One possibility is to do nothing, but it s not very interesting, and leads to errors in VaR calculations. Use a simple distribution that makes a better job of matching the data, a Student-t with low degrees of freedom, for example: 6
It looks just like the normal, but it has fatter tails. Figure 4: t (- - - -) and Normal distributions. Unfortunately this doesn t work too well either - the mass of observations close to the mean dominate the parameter estimates. 7
4. Estimating probability distributions. 4.1. Estimating a Normal distribution. The normal has only two parameters to estimate, µ and σ. With these values for ˆµ and ˆσ we can construct the full Normal distribution. However, having these estimates does not imply that the true distribution of S t actually is Normal. Daily returns for the SP500 from 2 January 1957 to 26 April 2012 have the following estimated moments: ˆµ = 0.0244%, ˆσ = 1.0062% for T = 13928 observations. If the returns were Normally distributed the 1st percentile would be found from 0.0244 r 1% = 2.33 (2) 1.0062 r 1% = 2.32% (3) (4) So we should see (1% of 13928) = 139 observations lower than -2.32%. In fact there were 220 such observations. Conclusion: There are more observations in the lower tail of the actual distribution than we should expect if they are generated by a Normal distribution. 1% of the observations lie below -2.71% i.e. VaR(1%:empirical) = -2.71% 8
VaR(1%:Normal) = -2.32% Note that these conclusions about the normal are not based on a statistical test. Would we, for example, want to reject the Normal distribution if there were only 140 > 139 observations below -2.32%? As it happens the statistical tests also reject Normality in this case. 9
5. Parametric vs. empirical distributions. In the above example, it is not impossible that the returns come from a N(0.0244, 1.0062) distribution but......the number of observations below -2.32% suggests that this is unlikely. So what next? We can search for a parametric distribution that is not rejected by the data. We can assume that some of the empirical characteristics of the past observations will be reflected in the future returns. In particular we usually use the frequency distribution (or histogram ), as opposed to an estimated probability distribution. The second of these makes use of the empirical distribution. For example, we would conclude that the probability of a loss in excess of 2.32% is equal to the frequency of these lossess in the historical data i.e. p = 220/13928 = 1.58% rather than the 1% we get from the estimated Normal. The main failing of empirical distributions is that they tell us nothing about the probability of outcomes that lie outside the sample of empirical observations. 10
6. Extreme value theory. 6.1. A quick summary. EVT provides a method for fitting a sensible curve to the following observed histogram: Fitted EVT Fitted Normal Figure 5: SP500-left tail-example EVT curve. We have to do three things: 1. Pick a sensible curve. 2. Rearrange the histogram s data to simplify the estimation process. 3. Perform the estimation. 11
6.2. Back to work... Extreme value outcomes are not necessarily more unusual than any other events - consider a uniform distribution. In financial data however, it turns out that they are. This makes it very difficult to assess their probabilities. In particular, they tend not to fit into the distributions that do quite well at assessing the probabilities of ordinary outcomes. See the SP500 charts we looked at earlier. Since these events are also among the most dangerous, extreme value theory (EVT) has been adopted to deal with them. Hull (2012) summarises EVT s role rather well: [EVT] is a way of smoothing and extrapolating the tails of an empirical distribution. This is true of fitting most parametric distributions of course (why?) but EVT does a better job with the tails. 6.3. Sensible curves. We have seen that fitting a Normal to SP500 returns leads to underestimation of the tail probabilities. EVT provides a theoretical distribution that fits the tails much better, although it typically makes a mess of fitting the rest of the distribution. Does this matter? No, because we are interested only in tail risk here. 12
Why is this better than the empirical distribution? Because it is continuous, and can be extended beyond the most extreme empirical data point, making it more accurate for risk calculations. The core of EVT in finance is the Generalised Pareto Distribution. This will supply our sensible curve. We present it in terms of the probability that a variable x will be less than a number X, given that x is greater than a number m. Note that this is the cumulative distribution and not the density function. The numerical examples we saw earlier were presented in terms of the density function. We will have to take this change of approach into account when we perform the EVT estimation. Generalised Pareto Distribution (GPD): P (x < X x > m) GP D(X) (5) ( ) 1 (X m) φ GP D(X) = 1 1 + φ φ 0 (6) β ( ) X m = 1 exp φ = 0 (7) β where, β > 0, and m, φ 0. The distribution starts at x = m. Note that x is the random variable, and X represents a specific number. Many texts reverse these definitions. We return to the relative merits of these two later. For now all we need to know is that EVT based on the GPD can describe many types 13
of data. For financial returns in particular, the PD restrictions (i.e. fat-tailed) seem to hold. For GP D(X): φ GP D(X) becomes the... > 0 Frechet = 0 Gumbel < 0 Weibull For financial returns we expect to find φ > 0 since the Frechet exhibits fat tails. Financial data are often consistent with a another restriction m = β φ and when this is applied to the GPD/Frechet we get the Pareto Distribution. Pareto Distribution PD: a restricted version of the GPD that displays fat tails. P (x < X x > m) P D(X) (8) ( m ) α P D(X) = 1 x m, α > 0 (9) X = 1 KX α where K = m α (10) where the restriction is m = β φ, and α 1 φ > 0. We will estimate a GPD and a PD in what follows. 14
Figure 6: Frechet distributions. 7. What makes these curves sensible? The Theory in EVT. 7.1. The theory in general terms: Gnedenko s result. The key result of the theory is that, under certain conditions, the cdf of many random variables converges on a specific shape as we get further into the tails. This shape is the familiar smooth decline towards zero that we see in the Normal, t, χ 2 etc. We can apply this result even when we do not know which distribution the variables come from. 15
So, in finance, even though the true distribution of asset returns may remain a mystery, we can say something about the shape of the distribution in the tails, which is what matters for Value at Risk etc. And, provided that we have enough data, EVT allows us to estimate a shape that will approximate the true distribution whatever it is (subject to some conditions). More specifically, Gnedenko s result states that for a wide class of distributions the upper tail converges on the GPD above. 7.2. The general EVT shape: The generalised Pareto distribution. Textbooks typically present this material in terms of losses expressed as positive numbers - we do the same here. For many distributions the part to the right of a value u converges on the generalised Pareto distribution. I.e. if F u (Y ) is the probability that x lies between u and u + Y then, as u increases, ( ) 1 (Y u) φ F u (Y ) G φ,β (y) = 1 1 + φ β φ 0 (11) ( ) Y u = 1 exp β φ = 0 (12) I.e. as u gets larger, and we move into the tail, the distribution converges on the GPD. We use slightly different notation in contrast to (7) above to emphasize that this is an approximation to the GPD that improves as u gets larger. 16
In the limit, Y becomes X, and u becomes m as in (7). The parameters φ, β can then be estimated using non-linear methods, for any choice of u (or m). 7.3. Do we actually fit the density function to the histogram? A histogram presents the observed observations grouped into buckets. A density function presents the probabilities for every possible individual observation i.e. it does not group them into buckets. While the buckets approach is convenient for diagrams it would be complicated (though not impossible) for estimation. We actually fit the probability (more precisely, the value of the density function) of getting each observation For the cumulative distribution for the GPD is ( ) 1 (X m) φ GP D(X) = 1 1 + φ φ 0 (13) β For which the density is gpd(x) = 1 β ( ) 1 (X m) φ 1 1 + φ β (14) which we can think of, loosely, as the probability of observing x = X. 7.4. To estimate the GPD do the following: Order the sample of n loss-making returns (x) as positive numbers, starting with the largest first. Chose u: Usually this will be the empirical 95th percentile, so select the losses above this level. 17
If we want to find VaR(q%) for example, we will need to choose u to the left (closer to the mean) of this. The 95th percentile satisfies this condition for VaR(1%). Call the number of observations in this percentile n u, so n u = 0.05 (15) n Estimate the parameters of GPD(X) using maximum likelihood. The likelihood for the sample is: L = ln(l) = = n u i=1 n u 1 n u 1 ln gpd(x i ) (16) 1 β [ 1 β ( 1 + φ (x ) 1 i u) φ 1 β ( 1 + φ(x ) ] 1/φ 1 i u) β We have to find values of φ, β to maximise ln(l). (17) (18) For this we need a numerical search process, and we have to supply a couple of starting values to get the search started. Hull (2012) p316 gives the following data for losses on an example portfolio (from a total sample of 500 losses): 18
[ 1 Loss ($000s) ln β ( 1 + φ(x i u) β 477.841-8.97 345.435-7.47 282.200-6.51...... 160.778-3.71 ) 1/φ 1 ] φ=0.3,β=40 where the second column has been calculated using initial values of φ = 0.3 and β = 40. The value of u is 160. This gives n u = 22, and n u /n = 22/500 = 4.4% i.e. we are fitting the distribution to the largest 4.4% of losses. We then use the search process to find the φ, β that maximise the sum of the final column (i.e. push the sum towards zero from below). The optimising values in this case are φ = 0.436 and β = 32.532. The estimated φ > 0 confirms the presence of fat tails in the data. 7.5. Using the estimated parameters of the Generalised Pareto Distribution. VaR(q) and expected shortfall (ES(q)) where q is expressed as the weight in the tail (e.g. 1%) are ( [ V ar(q) = u + β ] φ n q 1) (19) φ n u ES(q) = V ar(q) + β φu 1 φ (20) 7.6. A simpler EVT shape: The Pareto distribution. For a lot of financial data the Pareto distribution performs well. GP D(X) = 1 KX α (21) 19
where K = α = 1 φ ( ) α β = m α (22) φ (23) The parameter α is known as the tail index. The larger the tail index, the fatter the tail. The tail index for the Normal is zero. Tail indices can be found for both parametric and empirical distributions. The tail index can then be used to calculate the VaR for either distribution. If the data satisfy this restriction, it is more efficient to estimate a Pareto distribution than the GPD. And we can fit the cumulative distribution function quite easily in this case. 7.7. Application of the Pareto distribution to SP500 daily returns. We use the same sample of SP500 data as in Section 4 above. The cumulative frequencies of deviations on both sides of the mean are shown in Table 1, in which the final column is (1 Φ(Col1)) 100. We can estimate a PD equation to fit the actual column in Table 1 as follows: 20
Deviation Percentage Percentage (below the mean) actual Normal distn > 1 s.d. 10.41 15.87 > 2 s.d. 2.33 2.23 > 3 s.d. 0.71 0.14 > 4 s.d. 0.28 0.005 > 5 s.d. 0.13 2.4 10 5 > 6 s.d. 0.08 1.0 10 7 > 7 s.d. 0.04 1.3 10 10 > 8 s.d. 0.03 0 > 9 s.d. 0.02 0 > 10 s.d. 0.01 0 Table 1: Actual and Normal one-sided frequencies. X will be the value of x measured in standard deviations. This scaling is not important - we could choose numbers for v arbitrarily. The probability that x exceeds X is KX α, which corresponds to the percentages in Table 1. Take logs of this probability to produce a linear equation for estimation i.e. ln(p r(x > X)) = ln(k) αln(x) (24) Construct the ln data for equation (24) using the first 7 observations from Table 1 (we keep the other 3 for out-of-sample tests), see Table 2. 21
1 2 3 4 X ln(x) P rop(x > X) ln[p rop(x > X)] in s.d. (Actual) = ln(col(3)) 1 0.00 0.1041-2.26 2 0.69 0.0233-3.76 3 1.10 0.0071-4.95 4 1.39 0.0028-5.89 5 1.61 0.0013-6.65 6 1.79 0.0008-7.10 7 1.95 0.0004-7.84 Table 2: Data for fitting equation (24). Plot ln(x) (column 2) against ln[p rop(x > X)] (column 4): Figure 1. These are the 2 variables for the regression. Figure 7: x axis = ln(x), y axis = ln[prop(x>x)]. 22
The plot is approximately linear, which is evidence in favour of our using the Pareto. Select the observation beyond which the actual frequencies appear most linear: Observation 3 in this case. So we run the regression using observations 3 to 7. This bit is is art, not science. OLS using the log linear observations generates ˆ ln(k) = 1.300 (25) ˆK = 0.27 (26) ˆα = 3.31 (27) Most financial time series produce ˆα between 3 and 5. From the parameter estimates we can construct the following cumulative probabilities for comparison: Deviation Percentage: Percentage: Percentage: in s.d. fitted Normal distn actual > 2 2.72 2.33 > 2.33 1.64 1.98 > 3 0.71 0.71 > 4 0.27 0.27 > 4.50 0.19 6.8 10 6 > 5 0.13 0.13 > 8.00 0.03 0 0.03 > 9.00 0.02 0 0.02 > 10.00 0.01 0 0.01 23
The value for 4.5 s.d. represents an interpolation of the fitted distribution. The values for 8, 9 and 10 s.d. represent extrapolations of the fited distribution. Figures 2 and 3 show that the power distribution fits the empirical data better than the Normal in the tails. 24
Figure 8: Fitted distributions: 1 to 7 sigma. Figure 9: Fitted distributions: 4 to 15 sigma. 25
The extrapolated power distribution does not beat the empirical in these graphs due to the Black Monday loss of 23σ. Even the Pareto underestimates that probability of this size of loss. 26
8. References. Hull s Risk Management and Financial Institutions., 3ed (2012), is an excellent source for the material covered here. Danielsson s Financial Risk Forecasting. (2011) also covers this material but at a more advanced level. Both books are published by Wiley Finance. Working with both simultaneously can be confusing because their notations differ: Hull is the better place to start. 27