Testing approxiate norality of an estiator using the estiated MSE and bias with an application to the shape paraeter of the generalized Pareto distribution J. Martin van Zyl Abstract In this work the norality in sall saples of three often used estiators of the shape paraeter of the generalized Pareto distribution are investigated. Norality iplies that hypotheses can be tested with confidence in saller saples. Often it is not easy to choose between estiators, based on the estiated MSE and bias using siulation studies and norality give the added advantage that hypotheses concerning the paraeter can be tested in sall saples. A confidence interval for the index of the S&P5 index is found by applying the results to estiators of the generalized Pareto distribution. Keywords: Estiator, Bias, Mean-squared error, norality, generalized Pareto distribution Introduction Three well-known estiators of the shape paraeter of a generalized Pareto distribution (GPD) are investigated to see if they can be used to test hypotheses assuing norality in sall saples and also to get an indication at which saples sizes it is acceptable to assue norality. Often it can be proven that an estiator is asyptotically norally distributed but that is very vague and for soe estiators this can ean large saple sizes of thousands of observations which are often not available in practical probles. The shape paraeter and the inverse of it, the tail index, is iportant in risk analysis and used to check if data is heavy-tailed. The probability weighted oent (PWM) estiator, axiu likelihood (ML) estiator and the epirical Bayes estiator proposed by Zhang and Stephens (9) will be investigated. Saples fro the three-paraeter will be siulated and transfored to the two-paraeter GPD by subtracting the scale paraeter estiated as the sallest observation in a saple. It was found that the distribution of the Zhang and Stephens estiator (9) is closest to norally distributed in sall saples and in saller saples it can be used with confidence to test hypotheses. The ML estiator is only norally distributed in larger saples of say n=5 and above and the PWM estiator is not even close to norally distributed in saples of size n=5. The Jarque Bera test (Jarque and Bera, 987) and the Lilliefors test for norality (Lilliefors, 967) will both be applied. It should be noted that norality as such does not iply good results with respect to bias and ean squared error (MSE), but all three these
estiators were shown to perfor well (Zhang and Stephens, 9), (de Zea Berudez, Kotz, ). Soe of the estiators are asyptotically norally distributed and the asyptotic variances known, but not for sall saples (Hosking and Wallis, 987). Using these results it is shown in an application in section 5 that with 95% certainty, that the index of the S&P5 is at least.796, thus with a finite variance. A new test which can be applied specifically in siulation studies where the true paraeter is known is also proposed and used. This is an exact ethod although very siple, and under the hypothesis of norality will involve no approxiations when perforing tests. When testing a hypothesis, under the true null-hypothesis where the true paraeter is known, the hypothesis can be rejected if the underlying distribution of the data is not distributed according to the distribution of the saple on which the test is based. In siulation studies where the true paraeters are known, this aspect can be used to test norality. A t-test is based on the assuption of norally distributed observations when calculating the test statistic. Consider a general estiation proble of a paraeter θ in a siulation study, which is a paraeter of a specified distribution. After saples of sizes n each siulated were siulated with true paraeter θ, estiators: ˆ θ ˆ,..., θ are available with ean θ and estiated variance ˆ ˆ = (/ ) ( j ) S θ θ θ and the estiated variance of θ is norality of the estiator, the following statistic can be used: S ˆ /( ). To test the θ t S θ = ( )( θ θ) /. () ˆ If the estiators, ˆ θ ˆ,..., θ, are norally distributed, the t-statistic would be central t distributed with ν = degrees of freedo. If this statistic is used to test the hypothesis H : θ = θ, it will be rejected with high probably only if the estiator is not norally distributed, because the hypothesis is true and the estiated variance correct. This is exact under the hypothesis of norality and the size of this statistic can also give an indication of the distance of the distribution of the estiator and the noral distribution. It will be shown in section, that this statistic can be written in ters of the estiated bias and MSE at saple size n, which eans that previously published results where the MSE and bias were reported can be used to check norality of an estiator. An approxiate variance can be investigated if it can be used at a saple size where it was shown that the estiator is norally distributed. Say it is established or known that at saple size n an estiator is norally distributed, and there is an expression for the variance (aybe an asyptotic variance) say ˆa σ. If the expression for the variance is correct, and t-statistics are fored under the true null hypothesis t = ( ˆ θ θ ) / ˆ σ, j =,...,, the statistics should be approxiately for n large be j j aj
distributed noral with ean and variance. If the sae principle is applied the statistic * t = ( ) t / S, with t t = (/ ) t, S = (/ ) ( t t ), j t j should have a t distribution with - degrees of freedo. Since would alost always be large in a siulation study, the statistic would be approxiately norally distributed and can also be copared against a standard noral critical value. If the test statistic is not standard noral distributed, using the approxiate variances, a bootstrap estiate of the variance of the estiator ust be calculated to perfor hypothesis testing in practical probles. Both of these aspects are iportant when testing hypotheses, the distribution and the variance of the statistic. An application of using the t-statistic to test norality using previously published results concerning estiator of the GPD paraeters, where the MSE and bias were reported, will also be given. Testing norality of estiators using the MSE and bias It is first shown that the t-statistic can be written in ters of the estiated ean square error (MSE) and bias calculated fro the siulated estiation results. This for can be used to check norality of an estiator using the MSE and Bias. Consider again the estiators as described in section. The perforance of an estiator was tested on the basis of siulated saples of size n each, resulting in estiators ˆ θ ˆ,..., θ of the true paraeter θ. It will be assued that the estiators are asyptotically unbiased. Let θ ˆ = θ j / denote the saple ean of ˆ θ ˆ,..., θ, and the estiated variance of ˆ θ by S σ ˆ θ ˆ θ j θ = ˆ ( ) = ( ) /( ), and the variance of θ is ˆ σ ( θ ) = S /. The estiated bias and MSE using the true paraeter are respectively B( ˆ θ) = θ θ, and ˆ MSE = (/ ) ( θ θ ). It follows that MSE = ( ˆ θ θ ) j j = ν S + ( θ θ ), ν = = ν S ˆ + B ( θ). 3
n Thus S = ( MSE ( θ θ ) ) and ˆ σ ( θ ) = ˆ σ ( ˆ θ j / ) = S /. If ˆ θ,..., ˆ θ are ν norally distributed under the true hypothesis, then the statistic in () is exactly t distributed and can be written as t θ θ B( ˆ θ ) = = S( θ ) { ( MSE B ( ˆ θ ))}/ ν ν ˆ θ ˆ θ ν = B( ) / ( MSE B ( )), = () ν ( θ θ ) / S t ν = The nuber of siulated saples will ostly be large, ore than say 5 saples generated, thus the t-statistic is approxiately standard norally distributed with the test statistic z N (,), ν, if the estiator is norally distributed and the test statistic is z = ν ( θ θ ) / ˆ σ ( ˆ θ ). (3) Rejection of the hypothesis H : θ = θ, will iply with high probability that the saple is not noral distributed. The approxiate distribution of the statistic z where the ˆ's θ are not norally distributed and the variance of the true paraeters known, attracted uch attention and the ost well known approxiation in such a case is the Edgeworth expansion for large saples with ( 3 3 4 4 ) 3 6 3/ f ( z) = φ( z) + ρ H ( z) /(6 n) + (3 ρ H ( z) + ρ H ( z)) /(7 n) + O( n ), r / where n the saple size, ρr = κ r / κ are the standardized cuulants and κ r denotes the r ( r) r-th cuulant. H r ( z) = ( ) φ ( z) / φ( z) denotes a Herite polynoial of order r. Using these results a saple approxiation of the distribution is (Kendall, Stuart and Ord, 987): ( ˆ ˆ 3 3 ) f ( z) = φ( z) + ( ρ 3 5 ρ ) /(4 n) + ( n ). Expression of the series in ters of the t-distribution function was considered in the paper of Finner and Dickhaus (). This can be considered as a way to calculate the power for a specific estiator for large saple sizes. It can be seen that for large saple sizes 4
the noralized ratio can be considered as an indication of how close the distribution is to a noral distribution, since theoretically f ( z) = φ( z) if z is noral. Since would alost always be large in a siulation study, the statistic would be approxiately norally distributed and can also be copared against a standard noral critical value. If the test statistic is not standard noral distributed, using the approxiate variances, a bootstrap estiate of the variance of the estiator ust be calculated to perfor hypothesis testing in practical probles. Both of these aspects are iportant when testing hypotheses, the distribution and the variance of the statistic. The GPD (Johnson et al., 994) distribution is / ξ F( x) = ( + ( ξ / σ )( x µ )), x µ, σ >, ξ >, σ is a scale paraeter, ξ the shape paraeter and µ the location paraeter. The shape paraeter deterines how heavy the tail is and oents up to order the index α, whereα = / ξ, are finite. Using the PWM procedure eans that the paraeters are estiated as ˆ a ˆ /( a ˆ a ˆ ) ˆ σ = ˆ ˆ /( ˆ ˆ ), ξ = and aa a a n aˆ = x, aˆ = x ( p ), p = ( j.35) / n, ( j) j j n j = for a saple of size n fro the two-paraeter GPD and x()... x( n) the order statistics (Zea Berudez, Kotz ()). This section will be in two parts, a siulation testing norality for various saples sizes and shape paraeters and in the second part the siulation results given in the paper of Hosking and Wallis (987) will be used to show how previously published results can be used to check norality of three estiators in sall saples. 3 A siulation study to test norality In the 3. three ways to test norality will be used for three estiators and in 3. it is shown how by using a t-statistic norality can be tested when previously published siulation studies are available, where the MSE and bias of an estiator was reported. 3. Checking norality of the estiators of the shape paraeter of a GPD In this section a siulation study will be conducted to check the approxiate norality of the PWM, Zhang and Stephens (9) and two-paraeter ML estiators of the shape 5
paraeter ξ will be investigated. Data is siulated fro the three-paraeter GPD, transfored to the two-paraeter by subtracting the iniu saple value. Jarque Bera PWM Zhang-Stephens ML Lilliefors 5 5.3.3.5 5.95.874 5.44.496 5 5.5.489.78.7 5 5.4 5. 5.5.35.646.35 5.5.7 5 and t- statistic (.8498) (8.843) (6.8775) (4.5).4 (3.5756) (9.789) (6.493) (.69) (9.489) (6.657) (3.869) (3.38) (3.8) (.5493) (.563) Jarque Bera Lilliefors and t-statistic ξ =.5, σ =..638.546.54 (.39).39.474.769 (.778).77.783.78 (.33).45.73.748 (.97).76.679.9643 (.448) ξ =.5, σ =..663.677 (.45).6.476.7969 (.575).398.87.697 (.3966).3.495.9874 (.58).65.3.63 (.4) ξ =.75, σ =..44.34 (.75).87.88.88 (.3536).537.43. (3.).5.44.8437 (.97).8.7.834 (.95) Jarque Bera Lilliefors.4..36.34.867.764.534.644.87.766.5.54.736.48.87.374.44.63.68.9..97.54.57.5.359.788.73 and t- statistic (.8354) (8.545) (6.3). (3.853). (3.34) (9.56) (6.966) (4.4957).6 (.7493).494 (.684) (7.756) (5.6499) (5.44).336 (.93).556 (.597) Table. testing norality of three estiators of the shape paraeter of a GPD, for various saple sizes and paraeters. P-values given and norality will be rejected for P<.5. Based on = saples generated for each saple size and set of paraeters. 6
Fro the above results it can be seen that the estiator of Zhang and Stephens (9) is approxiately norally distributed especially for n>5, and the PWM not close to noral for n=5. In figure a histogra of estiated paraeters using the Zhang and Stephens (9) ethod is shown. The skewness and kurtosis of the three estiators PWM, Zhang&Stephens and ML are respectively:.38, 3.4435,.67, 3.59,.4, 3.443. 35 3 5 frequency 5 5.4.5.6.7.8.9...3 estiated shape paraeter Figure. Histogra of =, n=5 estiated shape paraeter fro GPD with ξ =.75, σ =., µ =.. In the next siulation study the testing of hypotheses using a bootstrap estiate for the variance will be investigated. There are = siulated statistics and each tie the statistic ˆ z = ( ξ ξ ˆ ) / σ* ( ξ), ˆ σ ˆ * ( ξ ) the bootstrap variance for each saple. The nuber of rejections at the 5% level is shown. H : ξ = ξ Testing: n Average Mean, variance z % rejected ξ =., σ =. 5.475 (.996) 5. 5.4 (.) 6.4.89 (.74) 5.7 5.9 (.38) 6.8 5.4 (.45) 5.9 7
ξ =.4, σ =. 5.87 (.934) 4.9 5.859 (.436) 6.7.579 (.36) 5.7 5.883 (.467) 7.7 5.8 (.364) 5.8 ξ =.6, σ =. 5.66 (.77) 7.5 5.6 (.994) 8..974 (.95) 7.8 5.95 (.48) 6.9 5.698 (.33) 7. Table. Checking the rejection rate under the true hypothesis at the 5% level using the ethod of Zhang and a bootstrap estiate of the variance. 3. Using previously published results to check norality Hosking and Wallis (987) conducted a siulation study, generating =5 saples to estiate the MSE and bias of the axiu likelihood estiator (MLE), MOM and PWM estiators, tables and 3 in their paper. They reported the estiated bias and RMSE, where MSE is the square of RMSE. Using the equation ˆ ˆ ˆ z ν B( θ ) / ( MSE B ( θ )) = ν B( θ ) / ( RMSE B ( ˆ θ )) their results will be used to check for norality in sall saples. Two siulation studies where the estiator is possibly biased with different nubers of generated saples can be ade coparable. Hosking and Wallis (987) used =5 saples and if one wish to see if the z are of the sae order size as that of a siulation study with =, the statistic * ˆ z 999 B( θ) / ( MSE B ( ˆ θ)) can be calculated. Both are reported. As entioned in section, the test statistic should not increase as the nuber of siulated saples increase if the estiator is unbiased. σ =, ξ =.4 N ML MOM PWM Bias RMSE * z, z Bias RMSE * z, z Bias RMSE 5.6.46 8.956 87.68.3.38.738 4.6745.8.36 5.5. 5.854 38.374.7. 7.38 43.74.7.9..5 3.88 3.65.3.3 4.544 44.74.4.4 z, z * 9.995 8.574 88.647.53 66.6667 9.48 Table 3. Test statistics for norality calculated using reported bias and MSE 8
The estiators are not norally distributed for saples sizes less than n=. The * adjusted z can be copared to those found in the siulation study in section 4. where saple sizes of = were generated and it will be seen that the test statistics calculated for the PWM estiators fro the study of Hosking and Wallace (987) are approxiately equal. 4 A confidence interval for the log returns of the S&P5 Using the closing values of the S&P5 over the last 5 years, 6 to, 58 log returns was calculated. A GPD was fitted to the largest 5 returns using a threshold of.6. The purpose is to calculate a 95% confidence interval for ξ using the above approxiate norality of the Zhang and Stephens estiator and a bootstrap estiate for the variance of the estiator. The estiated values of σ using MOM, PWM and Zhang and Stephens are.46,.44,.44 respectively and the estiated values of ξ using the three ethods are.494,.75,.99. The bootstrap estiate of the standard deviation of the Zhang and Stephens estiate of ξ is.897, resulting in a 95% confidence interval of [.6:.3677], with the iplication that the index of the tail observations is not less than.796 (=/.3677). The P-P plot of the estiated distribution versus the epirical distribution function is shown in figure 4. epirical cdf.9.8.7.6.5.4.3.....3.4.5.6.7.8.9 fitted cdf GPD Figure. P-P plot of the fitted GPD and epirical distribution to the 5 largest log returns fro a saple of size 58. 9
It can be seen that the line is close to straight and the GPD gives a good fit. 5 Conclusions and Rearks Norality in saller saples of the estiator of the shape paraeter of a GPD was investigated, and the estiator of Zhang and Stephens (9) was shown to be close to norally distributed even in sall saples, and when using excesses over a threshold when perforing the estiation. Soe test are robust, especially the t-test, but the knowledge of norality, can assure ore confidence in the conclusions ade. It was shown the previously published results where the estiated MSE and bias were given can be used to check the norality of estiators. References De Zea Berudez, P., S. Kotz,. Paraeter estiation of the generalized Pareto distribution Part I. Journal of Statistical Planning and Inference, Volue 4, Issue 6, Pages 353-373. Finner, H. and Dickhaus, T.,. Edgeworth expansions and rates of convergence for noralized sus: Chung s 946 ethod revisited. Statistics and Probability Letters, 8, 875 88. Hosking, J.R.M. and J.R. Wallace. 987. Paraeter and quantile estiation for the generalized Pareto distribution. Technoetrics, 9(3), 339 349. Hossain, A., Zier, W.,3. Coparison of estiation ethods for Weibull paraeters: Coplete and censored saples. Journal of Statistical Coputation and Siulation, 73:, 45 53. Jarque, C.M. and Bera, A. K. (98). "Efficient tests for norality, hooscedasticity and serial independence of regression residuals". Econoics Letters 6 (3): 55 59. Jarque, C.M. and Bera, A. K. (987). "A test for norality of observations and regression residuals". International Statistical Review 55 (): 63 7. Johnson, N.L., S. Kotz and N. Balakrishnan. 994. Continuous Univariate Distributions. Volue. Wiley, New York.
Kendall, M., Stuart, A. and Ord, J.K. (987). Kendall s Advanced Theory of Statistics. Charles Griffin and Copany, London. Lilliefors, H. (June 967), "On the Kologorov Sirnov test for norality with ean and variance unknown", JASA, Vol. 6. pp. 399 4. Zhang, J. and Stephens, M.A. (9). A New and Efficient Estiation ethod for the Generalized Pareto Distribution. Technoetrics, 5,3, 36-35.