On covariate smoothing for non-stationary extremes

Size: px

Start display at page:

Download "On covariate smoothing for non-stationary extremes"

Edwin Brown
5 years ago
Views:

1 On covariate smoothing for non-stationary extremes Matthew Jones a, David Randell a, Kevin Ewans b, Philip Jonathan a, a Shell Projects & Technology, Manchester M RR, United Kingdom. b Sarawak Shell Bhd., 55 Kuala Lumpur, Malaysia. Abstract Numerous approaches are proposed in the literature for non-stationarity marginal extreme value inference, including different model parameterisations with respect to covariate, and different inference schemes. The objective of this article is to compare some of these procedures critically. We generate sample realisations from generalised Pareto distributions, the parameters of which are smooth functions of a single smooth periodic covariate, specified to reflect the characteristics of actual samples of significant wave height with direction considered in the literature in the recent past. We estimate extreme values models using Constant, Fourier, B-spline and Gaussian Process parameterisations for the functional forms of generalised Pareto shape and adjusted scale with respect to covariate and maximum likelihood and Bayesian inference procedures. We evaluate the relative quality of inferences by estimating return value distributions for the response corresponding to a time period of the assumed period of the original sample, and compare estimated return values distributions with the truth using Kullback-Leibler, Cramer-von Mises and Kolmogorov-Smirnov statistics. We find that Spline and Gaussian Process parameterisations estimated by maximum penalised likelihood using the back-fitting algorithm, or Monte Carlo Markov chain inference using the mmala algorithm, perform equally well in terms of quality of inference and computational efficiency, and generally perform better than alternatives. Keywords: extreme, covariate, non-stationary, smoothing, non-parametric, spline, Gaussian process, mmala, Kullback-Leibler. Introduction Accurate estimates of the likely extreme environmental loading on an offshore facility are vital to enable a design that ensures the facility is both structurally reliable and economic.this involves estimating the extreme value behaviour of meteorological and oceanographic metocean parameters that quantify the various environmental loading quantities, primarily winds, wave, and currents. Examples of such parameters are the significant wave height, the mean wind speed, and mean current speed, which provide a measure of wave field or sea state intensity, the wind strength, and the ocean current strength respectively. These parameters characterise the short-term variability of the phenomenon involved, which is assumed to be stationary over temporal scales of a few hours. The long-term variability of these parameters is however non-stationary, in particular with respect to time, space and direction. From a temporal point of view metocean parameters generally have a strong seasonal variation, with an annual periodicity, and longer term variations due to decadal or semi-decadal climate variations. At any given location, the variability of a particular parameter is also dependent on the direction; for example, wind forcing is typically stronger from some directions than others, and fetch and water depth effects can strongly influence the resulting magnitude of the waves. Clearly these effects will vary with location: a more exposed location will be associated with longer fetches, resulting in a more extreme wave climate. When estimating the long-term variability of parameters, such as significant wave height, the non-stationary effects associated with e.g. direction and season can be incorporated by treating direction and season as covariates. The common practice is to perform extreme value analysis of hindcast data sets, which include many years of metocean parameters, along with their associated covariates. Such data sets have all the information needed for input to covariate analysis. Numerous authors have reported the essential features of extreme value analysis e.g. Davison and Smith [9] and the importance of considering different aspects of covariate effects. Carter and Challenor [3] considers estimation of annual maxima from monthly data, when the distribution functions of monthly extremes are known. Coles and Walshaw [] describes directional modelling of extreme wind speeds using a Fourier parameterisation. Scotto and Guedes-Soares [3] models the long-term time series of significant wave height with non-linear threshold models. Anderson, Carter, and Cotton [] reports that estimates for -year significant wave height from an extreme Corresponding author. philip.jonathan@shell.com Preprint submitted to Comp. Stat. Data Anal. Elsevier June 3, 5

2 value model ignoring seasonality are considerably smaller than those obtained using a number of different seasonal extreme value models. Chavez-Demoulin and Embrechts [5] describes smooth extreme value models in finance and insurance. Chavez-Demoulin and Davison [] provides a straight-forward description of a nonhomogeneous Poisson model in which occurrence rates and extreme value properties are modelled as functions of covariates. Cooley, Naveau, Jomelli, Rabatel, and Grancher [7] uses a Bayesian hierarchical model to characterise extremes of lichen growth. Renard, Lang, and Bois [] considers identification of changes in peaks over threshold using Bayesian inference. Fawcett and Walshaw [] uses a hierarchical model to identify location and seasonal effects in marginal densities of hourly maxima for wind speed. Mendez, Menendez, Luceno, Medina, and Graham [7] considers seasonal non-stationary in extremes of NOAA buoy records. Randell, Feld, Ewans, and Jonathan [9] discusses estimation for return values for significant wave height in the South China Sea using a directional-seasonal extreme value model. The objective of this article is to evaluate different procedures for estimating non-stationary extreme value models critically. We quantify the extent to which extreme value analysis of samples of peaks over threshold exhibiting clear non-stationarity with respect to covariates, such as those in Figure or simulation case studies in Section below, is influenced by a particular choice of model parameterisation or inference method. We generate sample realisations from generalised Pareto distributions, the parameters of which are smooth functions of a single smooth periodic covariate, specified to reflect the characteristics of actual samples considered in the literature in the recent past. Then we estimate extreme value models using Constant, Fourier, B-spline and Gaussian Process parameterisations for the functional forms of generalised Pareto parameters with respect to covariate and maximum likelihood and Bayesian inference procedures. We evaluate the relative quality of inferences by estimating return value distributions for the response corresponding to a time period of the assumed period of the original sample, and compare estimated return values distributions with the truth using Kullback-Leibler see e.g. Perez-Cruz [], Cramer-von Mises see e.g. Anderson [] and Kolmogorov-Smirnov statistics. We cannot hope to compare all possible parameterisations, but choose four parameterisations useful in our experience. Similarly, there are many competing approaches for maximum likelihood and Bayesian inference, and general interest in understanding their relative characteristics. For example, Smith and Naylor [] compares maximum likelihood and Bayesian inference for the three-parameter Weibull distribution. In this work, we choose to compare frequentist penalised likelihood maximisation with two Monte Carlo Markov Chain MCMC methods of different complexities. Non-stationary model estimation is a growing field. There is a huge literature on still further possibilities for parametric e.g. Chebyshev, Legendre and other polynomial forms and non-parametric e.g. Gauss-Markov random fields and radial basis functions model parameterisations with respect to covariates. Moreover, in extreme value analysis, pre-processing of a response to near stationarity e.g. using a Box-Cox transformation is sometimes preferred. The outline of the paper is as follows. Section illustrates applications of non-stationary extreme value analysis in the recent literature, motivating the model forms adopted subsequently in Section.. Section 3 outlines the different model parameterisations and inference schemes under consideration. Section describes underlying model forms used to generate samples for inference, outlines the procedure for estimation of return value distributions and their comparison, and presents results of those comparisons. Section 5 provides discussion and conclusions.. Motivating applications Randell, Zanini, Vogel, Ewans, and Jonathan [] explores the directional characteristics of hindcast storm peak significant wave height with direction for locations in the Gulf of Mexico, North-West Shelf of Australia, Northern North Sea, Southern North Sea, South Atlantic Ocean, Alaska, South China Sea and West Africa. Figure illustrates the essential features of samples such as these. The rate and magnitude of occurrences of storm events varies considerably between locations, and with direction at each location. There are directional sectors with effectively no occurrences, there is evidence of rapid changes in characteristics with direction and of local stationarity with direction. Any realistic model for such samples needs to be non-stationary with respect to direction. The simulation case studies introduced in Section are constructed to reflect the general features of the samples in Figure, with the advantage that the statistical characteristics of the case studies are known exactly, allowing objective evaluation and comparison of competing methods of model parameterisation and inference. The results of this study are of course relevant to any application of nonstationary extreme value analysis. 3. Estimating non-stationary extremes 3.. Generalised Pareto model We consider estimation of non-stationary marginal extreme value models. We observe a sample y = {y,..., y N } of peaks over threshold drawn independently from a generalised Pareto distribution, the parameters of which are functions of corresponding observed covariate

3 GOM 95, NNS 9 99, 3 5 NWS 97 7, SNS 957 9, SAO 9 995, 7 Als 93, SCS 95 7, 797 WAf 99, 3 H S sp Direction Figure : Storm peak significant wave height on direction for locations worldwide. From right to left, top to bottom: Gulf of Mexico GOM, North-West Shelf of Australia NWS, Northern North Sea NNS, Southern North Sea SNS, South Atlantic Ocean SAO, Alaska Als, South China Sea SCS and West Africa WAf. Panel titles give the location, the sample period and storm peak sample size. 3

4 values Θ = {θ,..., θ N }. The sample likelihood is a product of generalised Pareto GP likelihoods for each of the observations f y Θ, ξ, σ, µ = = N f y i ξ θ i, σ θ i, µ θ i i= N i= σ θ i + ξ θ i y i µ θ i ξθ i + σ θ i where ξ θ and σ θ are the shape and scale parameters as functions of covariate. We do not attempt to estimate the threshold function µ θ, assuming it is for all covariate values. We also assume that the rate of occurrence ρ θ of exceedances of µ varies with covariate, but that ρ θ is known. It is computationally advantageous see e.g. Cox and Reid [], Chavez-Demoulin and Davison [] to transform variables from ξ, σ to the asymptotically independent pair ξ, ν, where ν θ = σ θ +ξ θ. Inference therefore amounts to estimating the smooth functions ξ θ and ν θ, although we usually choose to illustrate the analysis in terms of ξ θ and σ θ. 3.. Covariate parameterisations To accommodate non-stationarity, we parameterise ξ and ν as linear combinations of unknown parameters β ξ and β ν respectively, where ν θ = B ν θ β ν, and ξ θ = B ξ θ β ξ and B ν θ and B ξ θ are row vectors of basis functions evaluated at θ. We consider four different forms of basis function, corresponding to Constant stationary, Fourier, Spline and Gaussian Process parameterisations for ξ θ and ν θ, as described below. For each parameterisation, we also specify roughness matrices Q for = ξ, ν to regulate the roughness of θ with respect to θ during inference. This ensures that the elements of β weight the individual basis functions in such a way that the resulting estimate is optimally smooth in some sense. The form of the roughness penalty term R is λ β Q β, for some roughness coefficient λ. The penalty is incorporated directly within a penalised likelihood for maximum likelihood inference, and within a prior distribution for β in Bayesian inference, as described in Section 3.3. Constant stationary parameterisation In the Constant parameterisation, the values of ξ and ν do not vary with respect to θ. We therefore adopt a scalar basis function which is constant across all values of covariate, so that B ν θ = B ξ θ =, and corresponding roughness matrices Q ν = Q ξ =. We do not expect the return value distributions estimated under this parameterisation to fare well in general in our comparison, since samples are generated from non-stationary distributions. Quality of fit is expected to be poor, at least in some intervals of covariate. However, many practitioners continue to use stationary extreme value models, perhaps with high thresholds to mitigate non-stationarity, in applications; inclusion of a stationary parameterisation provides a useful point of reference for comparison, therefore. Spline parameterisation Under a Spline parameterisation, the vector of basis functions for each of ν and ξ is made up of p local polynomial B-spline functions with compact support, joined at a series of knots evenly spaced in the covariate domain see e.g. Eilers and Marx [] B ν θ = B ξ θ = b θ b p θ. We specify roughness matrices Q ν = Q ξ = D T D which penalise squared differences between adjacent elements of the coefficient vectors, where D =..... is a p p difference matrix. Fourier parameterisation We use basis vectors composed of sine and cosine functions of n p different periods B ν θ = B ξ θ = sin θ sin θ sin n p θ cos θ cos n p θ.

5 The roughness matrix is computed by imposing a condition on the squared second derivative of the resulting parameter function. If we write n p θ = ak cos kθ + b k sin kθ k= where = ξ or ν, and a k and b k are the parameters from β corresponding respectively to the sine and cosine functions of period k. The roughness criterion from Jonathan, Randell, and Ewans [] becomes R = π such that the penalty matrix can be written in matrix form as n p θ dθ = k a k + b k Q = diag,,,..., k,..., n p,,,..., k,..., n p for the p = n p + Fourier parameters a, a,..., a np, b,..., b np. k= Gaussian Process parameterisation We use a Gaussian Process parameterisation [] for a set of p knots {ˆθ, ˆθ,..., ˆθ p } in the covariate domain, and relate each covariate input to a knot using the following basis vectors B ν θ = B ξ θ = I θ I p θ where the indicator functions I j. are defined as if θ ˆθ j < θ ˆθ k k j I j θ = otherwise. Roughness matrices Q where = ν, ξ are defined by the coefficient correlation matrix V via Q = V, and the elements of V drawn from a periodic squared exponential covariance function [] V jk = exp sin r where r are correlation lengths for each of the parameters, fixed to likely values by comparison with the covariate functions used to generate the data. Partitioning the covariate domain in this manner, as opposed to fitting a Gaussian Process parameterisation to each of the data inputs, greatly reduces the number of parameters to estimate, and is physically reasonable. Estimating a parameter for each data point would have made the computational burden for the Gaussian Process parameterisation significantly greater than that for any of the other parameterisations. ˆθ j ˆθ k 3.3. Inference procedures We consider two methods for estimating parameters and return value distributions for the models and parameterisations described above, namely a maximum penalised likelihood estimation with bootstrapping to quantify uncertainties, and b two forms of Bayesian inference using Markov Chain Monte Carlo MCMC. These are discussed below. Maximum likelihood estimation We use an iterative back-fitting optimisation see Appendix to minimise the penalised negative log likelihood L y β ξ, β ν ; λ ξ, λ ν with respect to β ξ and β ν for given roughness coefficients λ ξ and λ ν, where L y β ξ, β ν ; λ ξ, λ ν = L y β ξ, β ν + Rξ + R ν = L y β ξ, β ν + λ ξβ ξ Q ξβ ξ + λ νβ νq ν β ν. Here, L y β ξ, β ν is the negative log sample GP likelihood from Section 3. expressed as a function of ξ and ν, and Rξ and R ν are additive roughness penalties. The values of λ ξ and λ ν are selected using cross-validation to maximise the predictive performance of the estimated model, and bootstrap resampling is used to quantify the uncertainty of parameter estimates. The original sample is resampled with replacement a large number of times, and inference repeated for each resample. We use the empirical distributions of parameter estimates and return values over resamples as approximate uncertainty distributions. 5

6 Bayesian Inference From a Bayesian perspective, all of β ξ, β ν, λ ξ and λ ν are treated as parameters to be estimated. Their joint posterior distribution given sample responses y and covariates Θ can be written f β ξ, β ν, λ ξ, λ ν y, Θ f y Θ, ξ, σ, µ f β ν λ ν f βξ λ ξ f λν a ν, b ν f λξ a ξ, b ξ where f y Θ, ξ, σ, µ is the sample GP likelihood from Section 3. and prior distributions f β ν λ ν, f βξ λ ξ, f λν a ν, b ν and f λξ a ξ, b ξ are specified as follows. Parameter smoothness of GP shape and modified scale functions is encoded by adopting Gaussian priors for their vectors β of basis coefficients for = ξ, ν, expressed in terms of parameter roughness R f β λ λ / exp λ βt Q β. The roughness coefficient λ can be seen, from a Bayesian perspective, as a parameter precision for β. It is assigned a Gamma prior distribution, which is conjugate with the prior Gaussian distribution for β. The values of hyper-parameters are set such that Gamma priors are relatively uninformative. The Bayesian inference can be illustrated by the directed acyclic graph a ν b ν b ξ a ξ λ ν λ ξ β ν β ξ y Estimates for β ξ, β ν, λ ξ and λ ν are obtained by sampling the posterior distribution above using MCMC. We choose to adopt a Metropoliswithin-Gibbs framework, where each of the four parameters is sampled in turn conditionally on the values of others. The full conditional distributions λ ξ β ξ and λ ν β ν of precision parameters are Gamma by conjugacy, and are sampled exactly in a Gibbs step. Full conditional distributions for coefficients β ξ and β ν are not available in closed form; a more general Metropolis-Hastings MH scheme must therefore be used. There are a number of potential alternative strategies regarding the MH step for β = ξ, ν. We choose to examine two possibilities: a a straightforward MH sampling of correlated Gaussian proposals for β, and b the mmala algorithm of Girolami and Calderhead [3], exploiting first- and second-derivative information from the log posterior to propose candidate values for the full vector of coefficients in high-probability regions. Implementations are described in the Appendix. Henceforth we refer to these two schemes as MH and mmala respectively for brevity. The MH approach is simple to implement, but is likely to generate MCMC chains which mix relatively poorly. The mmala scheme is expected to explore the posterior with considerably more efficiency; however, its implementation requires knowledge of likelihood derivatives.. Evaluation of methods This section describes evaluation of relative performance of different model parameterisations and inference schemes introduced in Sections 3. and 3.3. We assess performance in terms of quality of estimation of distributions of return values corresponding to long return periods, estimated under models for large numbers of replicate samples of data from pre-specified underlying models. We simulate sample realisations each of size from three different underlying models, described below and referred to henceforth as Cases, and 3, and further simulate sample realisations of size 5 from the same triplet of underlying models, referring to these as Cases, 5 and respectively. We next estimate extreme values models for all sample realisations, model parameterisations and inference schemes. We assume that any sample realisation for any Case corresponds to a period T years of observation. We then simulate replicates of return period realisations, each replicate consisting of observations of directional extreme values corresponding to a return period of T, and estimate the distribution of the maximum observed the T -year maximum return value for all model parameterisations and inference methods, by accumulation from the replicates. Return value distributions are estimated omnidirectionally that is, including all directions and for directional octants centred on the cardinal and semi-cardinal direction by considering only those observations from the return period realisation with the appropriate directional characteristics. For each sample realisation from Cases, 5 and, we estimate return value distributions for all parameterisations but for only mmala inference, since as will be discussed in Section

7 .3 below, the computational effort associated with any of mmala, MH and MLE for large numbers of bootstrap resamples for these Cases is prohibitively large. We quantify the quality of return value inference by comparing the empirical cumulative distribution function generated under the fitted model for each sample realisation with that from simulation under the known underlying Case. We quantify the discrepancy between empirical distribution functions by estimating Kullback-Leibler, Cramer-von Mises and Kolmogorov-Smirnov statistics. We visualise relative performance by plotting the empirical cumulative distribution function of the test statistic over the sample realisations, for each combination of Case, model parameterisation and inference method. We also compare performance in terms of prediction of the 37.5 th percentile of the T -year return value distribution, since this sis often used in metocean and coastal design applications. However, we are not only interested in quality of inference, but also in computational efficiency. This is evaluated and illustrated in terms of effective sample size, as discussed below in Section.3... Case studies considered First, we describe model Cases - used to generate sample realisations for extreme value modelling. For each Case, potentially all of Poisson rate ρ of threshold exceedance, GP shape ξ and scale σ of exceedance size vary as a function of covariate θ. The extreme value threshold µ is fixed at zero throughout. Case. : For extreme value threshold µ θ =, we simulate observations with a uniform Poisson rate ρ θ = /3 per degree covariate, and a low order Fourier parameterisation of GP shape ξ θ = sin θ + cos θ + and scale σ θ =. + sin θ 3/. Case. : For extreme value threshold µ θ = and the same Fourier parameterisation of GP shape and scale as in Case, a non-uniform Poisson rate ρ θ = max sin θ +., /c ρ, where c ρ = 3 max sin θ +., dθ is used to simulate observations. Case 3. : For extreme value threshold µ θ =, the forms of each of ρ θ, ξ θ and σ θ are defined by mixtures of between one and five Gaussian densities, as illustrated in Figure. Sample size is. Cases, 5 and. : These cases are identical to Cases, and 3 respectively, except that Poisson rate ρ is increased by a factor of five. Sample size is therefore 5. Figure illustrates typical sample realisations of Cases, and 3. Parameter variation of GP shape ξ and scale σ with direction θ are identical in Cases and. Poisson rate ρ is constant in Case only. In Cases and 3, ρ is very small at θ 7 leading to a sparsity of corresponding observations. ξ is largest but negative at θ for Cases and, leading to larger observations here. For Case 3, ξ is largest and positive at θ 3 leading to the longest tail in any of the Cases considered. Figure 3 shows parameter estimates for ξ and σ, corresponding to the sample realisation of Case shown in Figure, for different model parameterisations using mmala inference. Visual inspection suggests that estimates of similar quality are obtained using all of Spline, Fourier and Gaussian Process parameterisations, but that the Constant parameterisation is poor. It is also apparent that identification of ξ is more difficult than σ. Corresponding plots not shown for maximum likelihood and Metropolis-Hastings inference show broadly similar characteristics, as do plots for other realisations of the same Case, and realisations of different Cases. Posterior cumulative distribution functions of return values based on models for the sample realisations of Case illustrated in Figures and 3, corresponding to a return period of ten times the period of the original sample, are shown in Figure for different model parameterisations and mmala inference. It can be seen omnidirectionally that the Constant parameterisation provides best agreement with the known return value distribution, despite the fact that parameter estimates in Figure 3 do not reflect the directional non-stationarity present. Omnidirectionally, and for directional octant sectors, non-stationary model parameterisations perform similarly. However, it is clear that the Constant parameterisation does particularly poorly for the western and north-western sectors, for which the rate of occurrence of events is relatively low, and both ξ and σ are near their minimum values. Figure 5 illustrates uncertainty over all sample realisations in the cumulative distribution function of return value for Case, corresponding to a return period of ten times the period of the original sample, using the Spline parameterisation and mmala inference. The median estimate for the return value distribution over all sample realisations of Case is shown in solid grey, with corresponding point-wise 95% uncertainty band in dashed grey. The true return value distribution is given in solid black. There is good agreement in all sectors. We explore differences in inferences for return value distributions more fully in Section.... Assessing quality of inference The criteria used to compare distributions of return values are now described. Since, for comparison only, we only have access to samples from distributions, where necessary we project empirical distributions onto a linear grid using linear interpolation, and evaluate gridbased approximations to facilitate comparison. Then we compare empirical return value distributions using each of the following three 7

8 < 9 y 9 <,;. Case Case Case < ; Figure : Illustrations of sample realisations from each of Cases left, centre and 3 right. Upper panels show parameter variation of GP shape ξ, scale σ and Poisson rate ρ with direction θ for each case. Lower panels show the th realisation of the corresponding simulated samples. ξ and σ for Cases, 5 and are identical to those of Cases, and 3 respectively. The value of ρ for Cases, 5 and is five times that of Cases, and 3 respectively.. Spline Fourier Gaussian Process Constant Figure 3: Parameter estimates for GP shape ξ upper and scale σ lower for the sample realisation of Case shown in Figure, for different model parameterisations left to right: Spline, Fourier, Gaussian Process, Constant using mmala inference. Each panel illustrates the true parameter solid back, posterior median estimate solid grey with 95% credible interval dashed grey.

9 Cumulative probability Cumulative probability % 5% 3% % True Spl Frr GP Cns x % 5% NW W SW N S 3% NE SE E % 3.5 % 3 Figure : Posterior cumulative distribution functions of return value for the sample realisation of Case shown in Figure, corresponding to a return period of ten times the period of the original sample. The left hand panel shows the omnidirectional return value distribution, and right hand panels the corresponding directional estimates. The title for each panel gives the expected percentage of individuals in that directional sector. In each panel, estimates are given for different model parameterisations using mmala inference. The true return value distribution is given in solid black. % 5% 3% % % NW N NE %.5.5 W E % SW S 3% SE 3 % x Figure 5: Uncertainty over all sample realisations in the cumulative distribution function of return value for Case, corresponding to a return period of ten times the period of the original sample. The left hand panel shows the omnidirectional return value distribution, and right hand panels the corresponding directional estimates. The title for each panel gives the expected percentage of individuals in that directional sector. In each panel, the true return value distribution is given in solid black. The median estimate over realisations for return value distribution of the Spline model parameterisation using mmala inference is shown in solid grey, with corresponding point-wise 95% uncertainty band in dashed grey. 9

10 Cumulative probability % 5% 3% % % NW N NE %.5.5 W E Spl Frr GP Cns.5 5%.5 SW S 3% SE.5 % Kullback-Leibler Divergence Figure : Empirical cumulative distribution functions of the Kullback-Leibler divergence between return value distributions corresponding to a return period of ten times that the original sample estimated under the true model and those estimated under models of sample realisations with different parameterisations and mmala inference for Case. The title for each panel gives the expected percentage of individuals in that directional sector. statistics. The Kolmogorov-Smirnov criterion compares two distributions in terms of the maximum vertical distance between cumulative distribution functions, as D ks F, F = sup x F x F x. The Cramer-von Mises criterion evaluates the average squared difference of one distribution from a second, reference distribution, using D cm F, F = F x F x f x dx. The Kullback-Leibler divergence compares distributions using the average ratio of density functions D kl F, F = log f f x f x dx; in this work, we use the approximation of Perez-Cruz []. The general characteristics of differences in return value inference due to model parameterisation and inference method were found to be similar for each of the three statistics. Only comparisons using Kullback-Leibler KL divergence are therefore reported here. We note that perfect agreement between f x and f x yields a minimum KL divergence of zero. For illustration, Figure shows empirical cumulative distribution functions for the KL divergence between return value distributions corresponding to a return period of ten times that the original sample estimated under the true model and those estimated under models of sample realisations with different parameterisations and mmala inference for Case. The distributions of KL divergence from all non-stationary model parameterisations appear to be very similar, as might be expected from consideration of figures similar to Figure. However, the Constant model yields the best performance omnidirectionally in this case since the corresponding distribution of KL divergence is shifted towards zero. In stark contrast, the Constant model does particularly badly in the eastern, south-western, western and north-western sectors. Figure 7 gives empirical cumulative distribution functions of the KL divergence between return value distributions corresponding to a return period of ten times that of the original sample estimated under the true return value distribution and those estimated under models of sample realisations with Spline parameterisations and different inference procedures for the same Case. There appears to be little to choose between the three inference methods for this Case. Figure summarises the characteristics of distributions for KL divergence corresponding to the omnidirectional return value distribution for all Cases, model parameterisations and inference methods considered in this work. In general, we note that all non-stationary parameterisations perform well with mmala inference. With MH inference, performance is generally poorer, especially for Fourier parameterisation. MLE does better than MH. We note that the Constant parameterisation generally performs well for the omnidirectional return value, but there is some erratic behaviour, notably for Case. Figure 9 is the corresponding plot for the generally sparsely populated western directional sector. The Spline parameterisation with mmala inference performs best. We note that the Fourier parameterisation does less well using MH and MLE, and that the Constant parameterisation behaves very erratically. We also note that, somewhat surprisingly, the Gaussian Process model performs considerably less well than the Spline and Fourier parameterisations. We surmise that this is due to mean-reversion in the absence of observations, compared with the Spline parameterisation in particular which prefers interpolation to reduce parameter roughness. We note that the Constant parameterisation also performs badly for Case. From a practitioner s perspective, it is also interesting to quantify the performance of different model parameterisations and inference methods in estimating some central value e.g. the mean, median, of 37.5 th percentile of the distribution of the return value. We choose the 37.5 th percentile, since this is commonly used in the met-ocean community. Figure illustrates this comparison in terms of a box-whisker

11 MLE MH mmala Cumulative probability % 5% 3% % % NW N NE %.5.5 W E % SW S 3% SE %. mmala MH MLE Kullback-Leibler Divergence Figure 7: Empirical cumulative distribution functions of the Kullback-Leibler divergence between return value distributions corresponding to a return period of ten times that the original sample estimated under the samples from the true return value distribution and those estimated under models of sample realisations with Spline parameterisations and different inference procedures for Case. The title for each panel gives the expected percentage of individuals in that directional sector. Spline Fourier Gaussian Process Constant Figure : Box-whisker comparison of samples of Kullback-Leibler KL divergence between omnidirectional return value distributions corresponding to a return period of ten times that the original sample estimated under samples from the true return value distribution and those estimated under models of each of sample realisations. mmala inference is reported for all six Cases abscissa labels and model parameterisations columns of panels. Metropolis-Hastings MH inference and maximum likelihood estimation MLE are reported only for the smaller sample sizes. The sample of KL divergence is summarised by the median white disc with black central dot, the interquartile range grey rectangular box and the.5%, 97.5% interval grey line.

12 MLE MH mmala Spline Fourier Gaussian Process Constant Figure 9: Box-whisker comparison of samples of Kullback-Leibler KL divergence between western sector return value distributions corresponding to a return period of ten times that the original sample estimated under samples from the true return value distribution and those estimated under models of each of sample realisations. mmala inference is reported for all six Cases abscissa labels and model parameterisations columns of panels. Metropolis-Hastings MH inference and maximum likelihood estimation MLE are reported only for the smaller sample sizes. The sample of KL divergence is summarised by the median white disc with black central dot, the interquartile range grey rectangular box and the.5%, 97.5% interval grey line. The ordinate scale is the same as that of Figure to facilitate comparison. The Constant parameterisation for Case with mmala inference yields values of KL divergence larger than. plot. In each panel, the true value, estimated by simulation under the true model, is shown as a black disc. The distribution of estimates from different sample realisations of each Case is summarised by the median white disc with black central dot, the interquartile range grey rectangular box and the.5%, 97.5% interval grey line. The performance of mmala and MLE in all Cases where comparison is possible is very similar. In general, MH inference tends to produce greater bias and variability in estimates of the 37.5 th percentile. We may surmise that this difference may be due to the fact that both mmala and MLE exploit knowledge of likelihood gradient and curvature. There is little difference between the performance of the three non-stationary parameterisations; but the Constant model again performs more erratically. Corresponding box-whisker plots for the eastern directional octant not shown suggest that MLE and mmala inference is again very similar, with MH estimates exhibiting greater variability. All non-stationary models yield similar performance, but the Constant model overestimates throughout. True values of the 37.5 th percentile for the western sector see Figure are considerably lower than for the eastern sector, and lower again than the omnidirectional values. In this sector, the rate of occurrences of events is generally lower in all Cases. Nevertheless, Figure has many similar features to Figure. However, we note that MH struggles in combination with the Fourier parameterisation, probably since the latter has the whole of the covariate domain as its support; intelligent proposals like those used here in MLE and mmala are necessary. Overall, we note that MLE and mmala inference for all of Spline, Fourier and Gaussian Process parameterisations perform relatively well, and equally well..3. Assessing efficiency of inference The effective sample size m, see Geyer [] gives an estimate of the equivalent number of independent iterations that a Monte Carlo Markov chain represents, and is defined by m = m/ + k= c k, where c k is the autocorrelation of the MCMC chain at lag k, and m is the actual chain length. The effective sample size per hour is defined by m /T, where T is the elapsed computational time in hours for m steps of the chain. For maximum likelihood inference with bootstrap uncertainty estimation, since bootstrap resamples are independent of one another, we estimate effective sample size per hour as m BS /T where m BS is the number of bootstrap resamples used and T is now the total elapsed computational time in hours to execute analysis of the m BS bootstrap resamples. Comparison of effective sample sizes per hour for different cases, parameterisations and inference methods gives some indication of relative computational efficiency, although objective comparison is difficult. In particular we note that software implementations in MATLAB exploiting common computational structures between different approaches have been used; these are almost certainly to the detriment of computational efficiency for some of the approaches, particulary the Constant parameterisation. For this reason, we do not report effective sample size per hour for the Constant parameterisation. Computational run-times are also of course critically dependent on software and hardware resources used. We note that the focus of this work is primarily quality of inference, rather than its computational efficiency. Specific modelling choices, such as the set-up of the cross-validation strategy adopted for MLE and choice of burn-in length and proposal step-size for MH and mmala within reasonable

13 MLE MH mmala MLE MH mmala Spline 3 5 Fourier 3 5 Gaussian Process 3 5 Constant Figure : Box-whisker comparison of estimates for the 37.5 th percentile of the omnidirectional return value distribution in metres for different cases, model parameterisations and inference procedures. mmala inference is reported for all six cases abscissa labels and model parameterisations columns of panels. Metropolis-Hastings MH inference and maximum likelihood estimation MLE are reported only for the smaller sample sizes. In each panel, the estimate from simulation under the true model is shown as a black disc. The distribution of estimates from different sample realisations is summarised by the median value white disc with black central dot, the interquartile range grey rectangular box and the.5%, 97.5% interval grey line. The Constant parameterisation for Case with mmala inference yields values larger than m. Spline 3 5 Fourier 3 5 Gaussian Process 3 5 Constant Figure : Box-whisker comparison of estimates for the 37.5 th percentile of the return value distribution in metres for the western directional sector least populous for cases, 3, 5 and, for different cases, model parameterisations and inference procedures. mmala inference is reported for all six cases abscissa labels and model parameterisations columns of panels. Metropolis-Hastings MH inference and maximum likelihood estimation MLE are reported only for the smaller sample sizes. In each panel, the estimate from simulation under the true model is shown as a black disc. The distribution of estimates from different sample realisations is summarised by the median value white disc with black central dot, the interquartile range grey rectangular box and the.5%, 97.5% interval grey line. The Constant parameterisation for Case with mmala inference yields values larger than 5m. 3

14 Figure : Effective sample size per hour ESS/hr estimates on logarithm base scale, i.e. with log ESS/hr plotted on y-axis for different cases, non-stationary model parameterisations and inference procedures. mmala inference is reported for all six cases abscissa labels and model parameterisations columns of panels. Metropolis- Hastings MH inference and maximum likelihood estimation MLE are reported only for the smaller sample sizes. The distribution of estimates from different realisations of the original sample is summarised by the median value white disc with black central dot, the interquartile range grey rectangular box and the.5%, 97.5% interval grey line. bounds may not influence inferences greatly, but will obviously however affect run times. Similarly the Spline, Fourier and Gaussian Process parameterisations used were chosen to be of similar complexity, but small differences may again influence relative computational efficiency of inference. With these caveats in mind, Figure illustrates the distribution of effective sample size per hour estimates for different cases, model parameterisations and inference procedures. Figure shows that, for mmala inference, the effective sample size per hour ESS/hr is considerably lower for Cases, 5 and, indicating that inference using large sample sizes is slower. For this reason, in this work, we do not provide results for Cases, 5 and using MLE and MH. Overall, comparing non-stationary parameterisations, ESS/hr is larger for Splines and Gaussian Processes than for Fourier. Using Spline and Gaussian Process parameterisations, values of ESS/hr are higher for MH and mmala inference that for MLE. There is little difference in ESS/hr for different model parameterisations using MLE. 5. Discussion Adequate allowance for non-stationarity is essential for realistic environmental extreme value inference. Ignoring the effects of covariates leads to unrealistic inference in general. The applied statistics and environmental literature provides various competing possible approaches to modelling non-stationarity. To these authors, adoption of non- or semi-parametric functional forms for generalised Pareto shape and scale in peaks over threshold modelling is far preferable in general in real world applications, than the assumption of a less flexible parameterisation. We find that B-spline and Gaussian Process parameterisations estimated by maximum penalised likelihood using back-fitting or Bayesian inference using mmala perform equally well in terms of quality and computational efficiency, and generally outperform alternatives in this work. The Gaussian Process parameterisation is computationally unwieldy for larger problems, unless covariate gridding onto the covariate domain is performed. The Fourier parameterisation, utilising bases with global support compared to Spline and Gaussian Process basis functions whose support is local on the covariate domain, is generally somewhat more difficult to estimate well in practice, showing greater instability to choices such as starting solution for maximum likelihood estimation. The Constant parameterisation performs surprisingly well in estimating the omnidirectional return values distribution in some cases, but is generally very poor in estimating directional variation. Various choices of methods of inference are also available. Competing approaches include maximum penalised likelihood optimisation and Bayesian inference using Monte Carlo Markov chain sampling. It appears however that the major difference, in terms of practical value of inference, is not between frequentist and Bayesian paradigms but rather the advantage gained by exploiting knowledge of likelihood

15 gradient and curvature. In addition, two aspects give Bayesian inference a slight further advantage: i it appears that inference schemes which sample from a likelihood surface randomly, rather that seeking its minimum deterministically, are more stable, and therefore more routinely implementable and useable, and ii the computational efficiency of a gradient-based Bayesian inference e.g. mmala in quantifying parameter estimates and their variability evaluated in this work in terms of effective sample size per hour is higher than that of the corresponding frequentist maximum penalised likelihood with bootstrapping inference. Moreover, Bayesian inference gives a more consistent procedure for statistical learning, and a more intuitive framework for communication of uncertainty, particularly to a non-specialist audience. Acknowledgement We are grateful to Kathryn Turnbull of Durham University, UK for useful comments. Appendix Maximum likelihood estimation For maximum likelihood estimation MLE, we use a back-fitting or iteratively re-weighted least-squares, IRLS algorithm to estimate vectors of basis coefficients β = ξ, ν derived in Jonathan et al. [5]. For fixed value of smoothness parameter λ, we initialise coefficients to starting value β, and then iterate the following step until convergence where β i+ = B T W β i B + Q B T V β i + B T W β i V β = ξ L y Ω and W β = β T β L y Ω B β i are derived at the end of this Appendix. This algorithm is similar to the mmala algorithm see below used to generate proposals for the corresponding Metropolis-Hastings step in MCMC, in that both exploit first- and second-derivative information to move towards regions of high probability. The back-fitting iteration is of course deterministic, whereas the mmala step is stochastic. MCMC sampling algorithms Denoting the set of parameters to be estimated by Ω = {β ξ, β ν, λ ξ, λ ν }, inference proceeds by sampling from the full conditional distributions f Ω k y, Θ, Ω k, Γ for each parameter in Ω in turn, where Γ = {a ξ, b ξ, a ν, b ν } is the set of fixed hyper-parameters for prior distributions. The form of the full conditional distribution varies depending on the type of parameter being estimated, as explained below. Full conditional distributions for basis coefficients Vector β = ξ, ν has the following conditional distribution f β y, Θ, Ω β, Γ f y Θ, β f β λ which is not available in closed form, and therefore cannot be sampled directly in a Gibbs step. Instead, we generate samples using the Metropolis-Hastings algorithm: given current state β i, we propose new parameter β from proposal distribution f β β i and evaluate the acceptance ratio A f β y, Θ, Ω β, β i β f β i β = f β i y, Θ, Ω β f β β i accepting the proposal with probability q = min, A, setting β i+ = β. Otherwise we reject the proposal and set β i+ = β i. As outlined in Section 3.3 and detailed below, we consider two different methods for generating multivariate proposals for β. In the first approach referred to as MH, we make an entirely stochastic Gaussian random walk proposal using a fixed covariance matrix; in the second referred to as mmala, we make a proposal which is partly deterministic and partly stochastic, accounting for local curvature of the likelihood surface. 5

16 For Metropolis-Hastings MH inference, we generate Gaussian random walk proposals of the form β = β i + B B + R ν ɛ where ɛ is a vector of independent standard Normal random variables, and the value of step size ν is adjusted to achieve reasonable acceptance rates of approximately.5. For inference using the Riemann manifold Metropolis-adjusted Langevin algorithm mmala, as implemented by Girolami and Calderhead [3] we propose using derivatives of the target distribution at the current sample. This promotes proposals in regions of higher probability, at the additional computational cost of computing necessary derivatives and matrix inverses. At iteration i of the sampling algorithm, where the current sample of the coefficients is β i, proposals are made as β = β i + ν G β i D β i + ν G β i where ɛ is a vector of independent standard Normal random variables, ν is adjustable step size, and D β i = β L β β i are the negative gradient and negative Hessian of the log density, with ɛ and G β i = β T β L β β i L β = log f β y, Θ, Ω β, Γ and β = / β,..., / β p T. Computation of likelihood derivatives is described at the end of this Appendix. Full conditional distributions for prior precisions Prior precision parameter λ = ξ, ν has the following conditional distribution f λ y, Θ, Ω λ, Γ f β λ f λ a, b. By construction, since the Gamma distribution is a conjugate prior for the precision of a Gaussian distribution, we know that the full conditional distribution is also Gamma, with updated parameters â = a + p and ˆb = b + βt Q β. Derivatives of the posterior distribution Here we find the derivatives of the log posterior distribution, required for maximum likelihood and mmala inference. The log likelihood of the observed data under the generalised Pareto distribution is N [ i= log νi +ξ L y Ω = i ξ i + log + ξ ] i ν i + ξ i y i for ξ i N [ i= log νi +ξ i +ξ i y i ] ν i for ξ i = The log conditional distribution for the vector of basis coefficients β = ξ, ν is then the sum of this likelihood plus a contribution from the prior distribution L β = log f β y, Θ, Ω β, Γ = L y Ω λ βt Q β We note the obvious similarity between this expression and the penalised negative log likelihood used for maximum likelihood inference. The gradient of the log conditional distribution for vector of coefficients β = ξ, ν is Using the chain rule, the likelihood gradient can be computed as β L β = β L y Ω Q β. β L β = Dβ B β T L y Ω = B T L y Ω.

Statistics of extreme ocean environments: Non-stationary inference for directionality and other covariate effects

Ocean Engineering v9 pp3- Statistics of extreme ocean environments: Non-stationary inference for directionality and other covariate effects Matthew Jones a, David Randell b, Kevin Ewans c, Philip Jonathan