Extreme value modelling of rainfalls and

Size: px

Start display at page:

Download "Extreme value modelling of rainfalls and"

Georgia Fox
5 years ago
Views:

1 Universite de Paris Sud Master s Thesis Extreme value modelling of rainfalls and flows Author: Fan JIA Supervisor: Pr Elisabeth Gassiat Dr Elena Di Bernardino Pr Michel Bera A thesis submitted in fulfilment of the requirements for the degree of Master of Probabilities and Statistics in the CEDRIC-C.N.A.M September 2013

2 UNIVERSITE DE PARIS SUD Abstract Probabilities and Statistics Department of mathematics Master of Probabilities and Statistics Extreme value modelling of rainfalls and flows by Fan JIA The main topic of this thesis is the statistical analysis of extreme values with applications to hydrology, more specifically, to rainfalls and flows. Estimation of rainfall and flow frequencies is crucial as we consider the potential significant damage to agriculture, ecology, infrastructure systems. The main goal of this study is to find out the best fitting distributions of rainfalls and flows measured in Bièvre region during the year The method of block maxima is applied when we use Generalized Extreme Value (GEV) distribution to fit the data, and the Peak Over Threshold (POT) method is applied when we use Generalized Pareto (GP) distribution[1]. We use maximum likelihood estimation for the parameters. There are two different methods allowing us to treat trends and seasonal factors, for the first method, we remove at the beginning the trends and seasonal factors, then we apply GEV and GP distribution to the remaining stationary series. For the second method, we use point process, which incorporates trends and seasonal variations in the models, instead of deseasonalizing the data[2]. With these models, we derive estimates of T years return levels for different level T. Another extremely important goal is to evaluate the function of the control system in Bièvre region installed by SIAVB ( We estimate the return levels of the flow in Bièvre region without the Bièvre control system, and this allows us to know whether the control system is efficient. In order to get an accurate and robust prediction of the return level, we estimate the hyper-parameters by cross-validation in the Ridge regression and PLS (Partial Least Squares) regression.

3 ii

4 Acknowledgements First and foremost I offer my sincerest gratitude to my supervisor, Dr. Elena Di Bernardino, without her encouragement and effort, the thesis can not been completed or written. One simply could not wish for a better or friendlier supervisor. I also would like to express my deepest appreciation to Professor Elisabeth Gassiat, who has supported me throughout my second year of Master with her patience and knowledge. I attribute the level of my Masters degree to her help and encouragement. My sincere thanks also goes to Professor Michel Béra who taught me everything I need about robust regression methods. Without his guidance, this paper can not be possible. I would like to thank M. Herve Cardinal from SIAVB and M. Jean-Marie Grisot from Veolia for providing the necessary data and very useful information about the project. Last but not least, I would like to thank the Laboratory CEDRIC of C.N.A.M. for providing the support and equipment I have needed to produce and complete my thesis and funding my studies. iii

5 Contents Abstract i Acknowledgements iii Introduction 1 1 Theoretical Framework Some definitions of hydrology Return period Probability of occurrence Return level Extreme value theory Extreme value theory and models Generalized Extreme Value distributions Generalized Pareto distribution Generalized Pareto models Dependence in the series Extremal index Estimation of the extremal index Block method Runs method Point process approach Linear regression Ridge regression Partial Least Squares regression Cross-validation Case Study Procedure of study Numerical results and discussion Descriptive statistics for the data Dependence in the series GEV approach GP approach Point process approach Generalized linear regression iv

6 Contents v Conclusion 35 Bibliography 36

7 Introduction 1 Introduction The significant damage to agriculture, ecology and disruption of human activities caused by floods makes it necessary to study the possibility of having another heavy rainfall in the future. Many areas are affected by the heavy rainfalls and the associated floods and landslides. In our case, we focus on the Bièvre region who faced serious flooding problems before the installation of the fully automated control system by SIAVB ( Our goal is to evaluate the performance of this control system and we adjust extremevalue models of the rainfalls and flows in the Bièvre region. In our case, statistical analysis of extreme rainfalls and flows is based on the observations from 10 stations in the Bièvre region. Among the 10 stations, there are 4 stations for the measurement of flows and 6 stations for the measurement of rainfalls. The data was measured every 5 minutes during 10 years from 2003 to We use two techniques to select our sample: one considers the monthly maxima of daily rainfalls and flows, the other selects exceedances over a specific threshold value. We use Generalized Extreme Value distribution (GEV) and Generalized Pareto (GP) distribution respectively to fit the observed data. The models parameters are estimated by applying the maximum likelihood method. We also use two different methods to treat the trends and the seasonal factors, the first method is a classical approach, which removes at the beginning trends and seasonal factors in order to obtain a stationary series; we then fit GEV and GP distributions to the stationary series; for the second method, we incorporate trends and seasonal variation in the models, this method allows us to treat directly the non stationary series. With the chosen models, we derive estimates for T years return levels, for different T. Since there exists a water control system installed in the region Bièvre, which consists of several computer operated water gates, with the help of these water gates, the flows can be controlled. So it is interesting to see what is going to happen without this system. To this aim, we simulate the flow without the system. The main technique here is the regression, in order to get a better quality of prediction, we apply Ridge regression and PLS regression to make the model more robust.

8 Chapter 1 Theoretical Framework 1.1 Some definitions of hydrology Frequency analysis provides a systematic approach for using historical data to relate the magnitude of a naturally occurring to the probability of its occurring in a given time period or to its recurrence interval. Frequency analysis is typically included in the study of hydrology. We present here some important notions in frequency analysis, these are applied in Section and Section Return period Return period (T ) is the average length of time in years for an event (e.g. flow or river level) of given magnitude to be exceeded. For example, if rainfall with a 100 year return period at a given location is 80 mm, it is equal to say that a rainfall of 80 mm (or greater), should occur at that location on the average only once every 100 years Probability of occurrence Probability of occurrence (p) is the probability that an event of the specified magnitude will be equalled or exceeded during a one year period. For instance, if N is the total number of values and m is the rank of a value in a list ordered by descending magnitude (x 1 > x 2 > x 3.. > x m ), then, the exceedance probability of the m th largest value, x m, is P (X x m ) = m N. 2

9 Chapter 1. Theoretical Framework 3 A fundamental relationship exists between flood return period (T ) and probability of occurrence (p). These two variables are inversely related to each other by: so we have 1 T = p := P (X x m) = m N, T = 1 P (X x m ) = N m. For example, the probability of a 100 year storm occurring in a one year period is 1/100 or Return level A return level with a return period T = 1/p years is a high threshold x(p) (e.g., annual peak flow of a river) whose probability of exceedance is p. For example, if p = 0.01, then the return period is T = 100 years, and if we have P (X x) = 0.01, that means the probability that the peak flow is greater than x is Extreme value theory In this section, we deal with extreme value distributions that are especially designed for maxima analysis. We present at first three types of extreme value distributions, the Gumbel (EV0), Frechet (EV1) and the Weibull (EV2)[3] Extreme value theory and models Let X 1,..., X n be a sequence of independent and identically distributed random variables with distribution function F and let M n = max(x 1,..., X n ) denotes the maximum. In theory, the exact distribution of the maximum can be derived by: Pr(M n z) = Pr(X 1 z,..., X n z) (1.1) = Pr(X 1 z) Pr(X n z) = (F (z)) n. (1.2) The associated indicator function I(X n > z) which is a Bernoulli process with a success probability p(z) = (1 F (z)) that depends on the magnitude z. The number of extreme events within n trials thus follows a binomial distribution and the number of trials

10 Chapter 1. Theoretical Framework 4 until an event occurs follows a geometric distribution with expected value and standard deviation of the same order O(1/p(z)). In practice, we do not have the distribution function F, but the Fisher-Tippett-Gnedenko theorem provides the following asymptotic result: Theorem 1.1 (Theorem in [4]). If there exist sequences of constants a n > 0 and b n R such that P{(M n b n )/a n z} G(z) as n and G is a non-degenerate distribution, then G belongs to one of the following families: { ( ( ))} z b Gumbel : G(z) = exp exp a for z R, (1.3) 0 z b F réchet : G(z) = { exp ( ) z b α } a z > b, { exp ( ( )) z b α } a z < b W eibull : G(z) = 1 z b. (1.4) (1.5) where α > Generalized Extreme Value distributions In probability theory and statistics, the Generalized Extreme Value (GEV) distribution is a family of continuous probability distributions including Gumbel (Equation (1.3)), Fréchet (Equation (1.4)) and Weibull (Equation (1.5)) distributions. By the Extreme Value theorem, the GEV distribution is the limit distribution of properly normalized maxima of a sequence of independent and identically distributed random variables. For this reason, the GEV distribution is used as an approximation to model the maxima of long (finite) sequences of random variables. The three distributions presented by Equation (1.3) - (1.5) can be characterized by one generalized cumulative distribution function (see Equation (1.6)), that is the reason we call it a Generalized Extreme Value distribution: F (x; µ, σ, ξ) = exp { [ 1 + ξ ( )] } x µ 1/ξ, (1.6) σ

11 Chapter 1. Theoretical Framework 5 for 1+ξ(x µ)/σ > 0, where µ R is the location parameter, σ > 0 the scale parameter and ξ R the shape parameter. For ξ = 0 the expression is formally undefined and is understood as a limiting case. The density function is, consequently f(x; µ, σ, ξ) = 1 σ for 1 + ξ(x µ)/σ > 0. [ 1 + ξ ( )] { x µ ( 1/ξ) 1 exp σ [ 1 + ξ ( )] } x µ 1/ξ, (1.7) σ Link to Fréchet, Weibull and Gumbel families The shape parameter ξ governs the tail behaviour of the distributions. The sub-families defined by ξ = 0, ξ > 0 and ξ < 0 correspond, respectively, to the Gumbel, Fréchet and Weibull families, whose cumulative distribution functions are displayed below (see also Figure 1.1)[5]. ξ = 0: Gumbel F (x; µ, σ, 0) = e e (x µ)/σ for x R, (1.8) which is a distribution with a light upper tail and positively skewed. ξ > 0: Fréchet 0 x µ F (x; µ, σ, ξ) = e ((x µ)/σ) α x > µ, (1.9) which is a distribution with a heavy upper tail and infinite higher order moments. ξ < 0: Weibull e ( (x µ)/σ)α x < µ F (x; µ, σ, ξ) = 1 x µ, (1.10) where σ > 0. The Weibull distribution is a distribution with bounded upper tails.

12 Chapter 1. Theoretical Framework 6 Figure 1.1: Different types of Generalized Extreme Value distributions. We use the shape parameter to determine the behaviour of the tail of a distribution. This method is applied in Section χ 2 Test for the Gumbel distribution In statistics, the χ 2 test is a statistical test used to compare the fit of two models, one of which (the null model) is a special case of the other (the alternative model). Within the Extreme Value distributions, the Gumbel distribution can be considered as a special case of GEV distribution whose shape parameter is 0. So when we have a GEV distribution with shape parameter close to 0, it is natural to test: H 0 : ξ = 0 against H 1 : ξ 0 with unkown location and scale parameters. For a given vector x = (x 1,..., x n ), the likelihood ratio (LR) statistic is T LR (x) = 2log ( L 1 L 0 ) (1.11) where L 1 denotes the likelihood of the more complicated model (Fréchet or Weibull) which have 3 parameters (µ, σ, ξ) and L 0 denotes the likelihood of the Gumbel model which only has 2 parameters (µ, σ). The probability distribution of the test statistic is approximately a chi-squared distribution with 1 degree of freedom. Consequently, the p-value is p LR (x) = 1 χ 2 1(T LR (x)). (1.12)

13 Chapter 1. Theoretical Framework 7 The decision rule is: If p LR (x) > α, accept H 0 ; If p LR (x) < α, reject H 0 ; with significant level α a certain predetermined number (0.05, for instance). 1.3 Generalized Pareto distribution This section deals with the exceedances which is also named Peaks-Over-Threshold (POT) values. We distinguish 3 types of Generalized Pareto (GP) distributions, the Exponential distribution, the Pareto distribution and the Beta distribution Generalized Pareto models The Generalized Pareto (GP) distribution is a well known family of continuous probability distributions. It is often used to model the extreme values in a distribution. It is specified by three parameters: location µ, scale σ, and shape ξ. Sometimes it is specified by only scale σ and shape ξ parameters. The cumulative distribution function is ( ) 1/ξ ξ(x µ) σ for ξ 0, F (ξ,µ,σ) (x) = 1 exp ( x µ ) for ξ = 0. σ (1.13) for x µ when ξ 0, and µ x µ σ/ξ when ξ < 0, where µ R, σ > 0, and ξ R. The probability density function is f (ξ,µ,σ) (x) = 1 σ ( 1 + ) ( ) ξ(x µ) 1 ξ 1. (1.14) σ Link to Exponential, Pareto and Beta families The shape parameter ξ governs the tail behaviour of the distribution. The sub-families defined by ξ = 0, ξ > 0 and ξ < 0 correspond, respectively, to the Exponential, Pareto and Beta families (see Figure 1.2).

14 Chapter 1. Theoretical Framework 8 ξ = 0: Exponential family, which is a light-tailed distribution with a memoryless property. ξ > 0: Pareto family, which is a heavy-tailed distribution ξ < 0: Beta family, which is a bounded distribution. Figure 1.2: Different types of Generalized Pareto distributions. The shape parameter is used to identify the type of the distribution and the tail behaviour. We apply this method in our case study, more details of the application can be found in Section Dependence in the series The classical extremal value theory is based on the hypothesis that the variables are independent and identically distributed (i.i.d). However, in our case, we are not sure if there is any correlation between the extreme events, in order to confirm the condition, we calculate the extremal index. In this section, we use two different methods to estimate the extremal index as mentioned in the book [4], the first method is the block method, and the second is the runs method. We explain these two methods by providing the necessary theories Extremal index The extremal index is an important parameter measuring the degree of clustering of extremes in a stationary process. The extremal index, θ [0, 1], can be interpreted as the reciprocal of the mean cluster size. Apart from being of interest in its own right, it is a crucial parameter for determining the correlation between extremal values. In

15 Chapter 1. Theoretical Framework 9 reality, extremal events often tend to occur in clusters caused by local dependence of the data. The extremal index is an indicator which allows us to characterize the dependence structure of the data and their extremal behaviours. Definition Suppose we have n observations from a stationary process X i, i 0 with distribution function F. For large n and u n, it is typically the case that F n (u n ) = P(max(X 1,..., X n ) u n ) F nθ (u n ) (1.15) where θ [0, 1] is a constant for the process known as the extremal index Estimation of the extremal index One of the characterizations of the extremal index[6] is that 1/θ is the mean cluster size in the point process of exceedance times over a high threshold. This suggest that a suitable way to estimate the extremal index is to identify clusters of high level exceedances, and to calculate the mean size of those clusters. Suppose that we have n observations from a stationary series, and let N n denote the number of observations which exceed a predetermined high threshold u n. We consider two methods of defining clusters. The block method and the runs method[7]. Block method The block method divides the data into approximately k n blocks of length r n, where n k n r n. Each block is treated as one cluster, thus if Z n denotes the number of blocks in which there is at least one exceedance of the threshold u n, we consider N n /Z n to estimate the mean size of a cluster and hence estimate θ by ˆθ n = Z n /N n. (1.16) Runs method Our second method is based on the idea of runs of observations below or above the threshold defining clusters. More precisely, suppose we take any sequence of T n consecutive observations below the threshold as separating two clusters. An equivalent and in some ways more attractive characterization is the following. Define W n,i to be 1 if the i th observation is above the threshold and 0 otherwise. Let N n = n W n,i, Zn = i=1 n W n,i (1 W n,i+1 )... (1 W n,i+rn ), (1.17) i=1

16 Chapter 1. Theoretical Framework 10 then θ n = Z n/n n. (1.18) We call it the runs estimator. The definition of Z n ensures that an exceedance in position i is counted if and only if the following T n observations are all below the threshold u n, in other words, if it is the rightmost member of a cluster according to the runs definition. 1.5 Point process approach This section illustrates an alternative approach to treat the series by using the point process approach. It is interesting to determine whether there is any evidence of a long term increase, so it is natural to look for any possible evidence of the trend. On the other hand, the cycle is also an important factor that we have to take into consideration. We will use the non-homogeneous point process model in which the parameters depend on the time t rather than modelling the cycle and the trends separately. This approach has advantage in treating all of the uncertainty in parameters (see [2] and [3]). The parameters can be presented by (µ t, σ t, ξ t ). A typical model can be written in the form: µ t = q µ j=0 µ j x jt, log(σ t ) = q σ j=0 σ j x jt, ξ t = ξ, (1.19) the x jt, j = 0, 1, 2,... represents the covariates where we usually assume x 0t = 1. In our analysis, we will consider the models in which only µ t depends on the covariates. The covariates taken into consideration in our case are the following: the linear function of time t (trend component) and the sinusoidal terms with the forms sin(2πkt) and cos(2πkt) T (seasonal component), where k=1, 2,... and T =365 which represents one year. In practise, we normally choose k=1. So we can assume that the parameters have the forms as follows[2]: T µ t = µ 0 + µ 1 t + µ 2 sin(2πt)/t + µ 3 cos(2πt)/t (1.20) σ t = σ, ξ t = ξ. (1.21) This approach is used in Section 2.2.5

17 Chapter 1. Theoretical Framework Linear regression Ridge regression The Ridge regression is the most commonly used method of regularization of ill-posed problems. It is also variously known as the constrained linear inversion method, and the method of linear regularization (see [8]). When the following problem Ax = b is not well posed (either because of non-existence or non-uniqueness of x), the standard approach which seeks to minimize the residual Ax b 2 where is the Euclidean norm may not work. This may be due to the system being overdetermined or under determined (may be ill-conditioned or singular). In order to give preference to a particular solution with desirable properties, a regularization term is included in this minimization: Ax b 2 + ξx 2, for some suitably chosen matrix ξ. In many cases, this matrix is chosen as the identity matrix, giving preference to solutions with smaller norms. In other cases, lowpass operators (e.g., a difference operator or a weighted Fourier operator) may be used to enforce smoothness if the underlying vector is believed to be mostly continuous. This regularization improves the conditioning of the problem, thus enabling a numerical solution. An explicit solution, denoted by ˆx, is given by: ˆx = (A T A + ξ T ξ) 1 A T b. The effect of regularization may be varied via the scale of matrix. When ξ is the identity matrix, it reduces to the normal least squares solution Partial Least Squares regression Partial least squares regression (PLS regression) is a statistical method that bears some relation to principal components regression; instead of finding hyperplanes of minimum variance between the response and independent variables, it finds a linear regression model by projecting the predicted variables and the observable variables to a new space. Because both the X and Y data are projected to new spaces, the PLS family of methods are known as bilinear factor models (see [8]).

18 Chapter 1. Theoretical Framework 12 PLS regression is used to find the fundamental relations between two matrices (X and Y), i.e. a latent variable approach to modelling the covariance structures in these two spaces. A PLS model will try to find the multidimensional direction in the X space that explains the maximum multidimensional variance direction in the Y space. PLS regression is particularly suited when the matrix of predictors has more variables than observations, and when there is multicollinearity among X values. By contrast, standard regression will fail in these cases Cross-validation Cross-validation is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called training set), and validating the analysis on the other subset (called validation set or testing set). To reduce variability, multiple rounds of cross-validation are performed using different partitions, and the validation results are averaged over the rounds. Cross-validation is important in guarding against testing hypotheses suggested by the data, especially where further samples are hazardous, costly or impossible to collect.

19 Chapter 2 Case Study 2.1 Procedure of study The analysis of rainfalls and floods has been developed in the whole world for many years. We mainly follow the steps considered in different previous studies, as several methods are applied on our data. Our study contains the following steps: 1. Descriptive statistics for the data We focus on the descriptive statistics for the data which can give us lots of information, such as the maximum and minimum of the floods during 10 years, the average rainfalls and the flows, etc.. We also present the histograms and the cumulative distribution functions which allow us to have a first idea about the shape of the tails. 2. Dependence in the series In this section we deal with the dependence in the series, as mentioned in Section 1.4.1, we use two estimators to estimate the extremal index θ. 3. GEV approach In this section, we search the Generalized Extreme Value distribution which well fits our data, it allows us to identify which type of extreme value distributions the data belongs to, in other words, we can know the behaviours of the tails (heavy tail or light tail) of our series (see Section 1.2.2). 13

20 Chapter 2. Case Study GP approach We fit the exceedances (Peak-Over-Threshold values) with a Generalized Pareto distribution, it can be considered as another approach to identify specific behaviours for extreme values (see Section 1.3.1). 5. Point process approach As mentioned in Section (1.5), we use the point process to fit the data, in our study we introduce the trends and the cycle in the parameters using Equation (1.20) and Equation Linear regression In order to protect the area from the flood, a control system consisted by several water gates, operated instantly by computer using an automated software, was installed in several stations, in order to identify this system s utility and importance, we simulate the flow without the water gates. In order to do this, we need to calibrate the influence to the flow of the water gates. We assume the flow as a function of the rainfalls and the opening percentage of the water gates (see Section 1.6). 2.2 Numerical results and discussion Descriptive statistics for the data In our study, there are 3 datasets containing the following information respectively, the monthly cumulative rainfalls, the flows of the river Bièvre and the opening percentage of the water gates from 2003 to The data of cumulative rainfalls was collected at 6 stations in the region Bièvre, the flows of the river was collected at 4 stations and the opening percentage of water gate was recorded in 4 stations. All the data was measured each five minutes. The flows were measured by m 3 /s and the rainfalls were measured by mm. We display in Figure 2.1 the coordinates of all the concerning sites so that we can better understand the geographical distribution of these sites.

Chapter 2. Case Study 15 Coordinates of the stations lattitude 48.72 48.74 48.76 48.78 Geneste Arcades de Buc Trou sale Bas Pres Vauboyen 2.10 2.15 2.20 2.

21 Chapter 2. Case Study 15 Coordinates of the stations lattitude Geneste Arcades de Buc Trou sale Bas Pres Vauboyen longitude Loup Pendu Damoiseaux Sablons Rainfall Flow Water gates Monseigneur Vilgenis Cambaceres Figure 2.1: Coordinates of the stations. The rainfalls in the original dataset were monthly cumulative, thus, the quantity of the rainfall is not continuous as it starts again from zero at the beginning of each month. The first step is to convert the cumulative data to daily rainfall by calculating the difference made in the 24 hours. Then for the flows of the river Bièvre, we use the daily average. Here we present some descriptive statistics of the daily average flow of the station Arcades de Buc for example in order to understand their individual behaviours. min 0.01 m 3 /s max 1.38 m 3 /s mean 0.21 m 3 /s median 0.17 m 3 /s sd 0.15 m 3 /s Table 2.1: Descriptive statistics for average flow of station Arcades de Buc. Figure 2.2: Average flow from 2003 to 2013 station Arcades de Buc.

Chapter 2. Case Study 16 From the Figure 2.2, we can observe the seasonal factors. The peaks appear during the summer every year. The maximum of average flow is 1.

22 Chapter 2. Case Study 16 From the Figure 2.2, we can observe the seasonal factors. The peaks appear during the summer every year. The maximum of average flow is 1.38 m 3 /s which appeared at 18th July We need to remove the trend and the seasonal component to get a stationary process. We assume that the different components affected the time series additively. Data = Seasonal Component + T rend + Residuals Figure 2.3: Decomposition of the original series. The residuals part doesn t present any trend or seasonal effect which can be considered as a stationary time series. Now we are interested in the maxima during a certain period, normally we use the annual maxima, however, since we have only the data for 10 years, the size is not enough to make any conclusion, so here we use the monthly maxima which gives us about 120 observations.

We are also interested in its histogram, the empirical density and the empirical cumulative distribution function. The results are displayed in Figure 2.

23 Chapter 2. Case Study 17 Figure 2.4: Monthly average flow of station Arcades de Buc. In Table 2.2, we present the same descriptive statistics of the monthly maxima of the average flow at station Arcades de Buc. We are also interested in its histogram, the empirical density and the empirical cumulative distribution function. The results are displayed in Figure 2.5, which allow us to have some information about the shape of the tails. min 0.14 m 3 /s max 1.38 m 3 /s mean 0.51 m 3 /s median 0.44 m 3 /s sd 0.23 m 3 /s Table 2.2: Monthly maxima of the average flow of station Arcades de Buc. Figure 2.5: Histogram with empirical density and cumulative distribution function of monthly maximum flow at station Arcades de Buc. The histogram and the cumulative distribution function show that the distribution is almost symmetric, the right-tail is slightly heavy: this will be tested in Section 2.2.3

24 Chapter 2. Case Study 18 with numerical methods. We can apply the same method to all the 4 stations to obtain the stationary series. They offer similar results. Table 2.3 shows the maximum and the minimum of the flow in 4 stations during the past 10 years. We can see that for all 4 stations, the maximum appears in the year 2007 and Station Max flow Date Min flow Date Arcades de Buc /07/ /12/10 Cambacérès /08/ /08/22 Monseigneur /05/ /05/24 Vauboyen /05/ /09/08 Table 2.3: Maximum flow and minimum flow of 10 years in 4 stations.

25 Chapter 2. Case Study 19 We do the same study on the daily rainfalls, using the station Geneste as an example. We present the descriptive statistics of flows at station Geneste in Table 2.4 Figure 2.6: Daily rainfall from 2003 to 2013 at station Geneste. Figure 2.7: Decomposition of daily rainfall at station Geneste.

26 Chapter 2. Case Study 20 min max mean median sd 0 mm 51.6 mm 1.52 mm 0 mm 3.58 mm Table 2.4: Descriptive statistics for daily rainfall of station Geneste. We can see that the peaks of the rainfall appear every year in August or September, and in the year 2007, there were many more days presenting a great quantity of rainfall, as we compare with the other years. On the other hand, after 2008, the rainfall seems to decrease until 2010 where we can observe a significant peak. Then we analyze the monthly maxima of rainfalls during the 10 years. We display the monthly maxima of rainfalls at station Geneste in Figure 2.8: Figure 2.8: Monthly maxima of rainfall at station Geneste. In Table 2.5, we present the descriptive statistics of the monthly maxima of rainfalls at station Geneste. We display the histogram, the empirical density and the empirical cumulative distribution function in Figure 2.9. min max mean median sd 0.8 mm 51.6 mm mm 12 mm 8.22 mm Table 2.5: Monthly maxima of the rainfall at station Geneste.

Chapter 2. Case Study 21 Figure 2.9: Histogram, empirical density and cumulative distribution function of monthly maximum rainfall at station Geneste.

27 Chapter 2. Case Study 21 Figure 2.9: Histogram, empirical density and cumulative distribution function of monthly maximum rainfall at station Geneste. It is clear that the distribution of the rainfalls at station Geneste is not symmetric and that it presents heavier tail than the flows. This will also be verified in Section by estimating the shape parameters Dependence in the series The classical extremal value theory is based on the hypothesis that the variables are independent and identically distributed (i.i.d). However, in our case, we can see in Figure 2.2 and 2.6, the peaks seem to appear together, in order to make sure if there is any correlation between the extreme events in time, we calculate the extremal index. In this section, we use two different estimators to estimate the extremal index, the first method is the block method, and the second is the runs method as mentioned in Section In Table , we present the values of the extremal index θ estimated by two estimators. we calculate the result r=20 (the definition of r can be found in Equation (1.16) and (1.17)).

28 Chapter 2. Case Study 22 Estimations of extremal index θ for the 4 stations of flow Station Arcades de Buc Quantile Threshold ˆθ by Block Method ˆθ by Runs Method Table 2.6: Extremal index of average flow of station Arcades de Buc. Station Cambacérès Quantile Threshold ˆθ by Block Method ˆθ by Runs Method Table 2.7: Extremal index of average flow of station Cambacérès. Station Monseigneur Quantile Threshold ˆθ by Block Method ˆθ by Runs Method Table 2.8: Extremal index of average flow of station Monseigneur. Station Vauboyen Quantile Threshold ˆθ by Block Method ˆθ by Runs Method Table 2.9: Extremal index of average flow of station Vauboyen.

29 Chapter 2. Case Study 23 Estimation of extremal index θ for the 6 stations of rainfall Station Geneste Quantile Threshold ˆθ by Block Method ˆθ by Runs Method Table 2.10: Extremal index of daily rainfall of station Geneste. Station Loup Pendu Quantile Threshold ˆθ by Block Method ˆθ by Runs Method Table 2.11: Extremal index of daily rainfall of station Loup Pendu. Station Trou Salé Quantile Threshold ˆθ by Block Method ˆθ by Runs Method Table 2.12: Extremal index of daily rainfall of station Trou Salé.

30 Chapter 2. Case Study 24 Station Sablons Quantile Threshold ˆθ by Block Method ˆθ by Runs Method Table 2.13: Extremal index of daily rainfall of station Sablons. Station Vilgenis Quantile Threshold ˆθ by Block Method ˆθ by Runs Method Table 2.14: Extremal index of daily rainfall of station Vilgenis. Station Vauboyen Quantile Threshold ˆθ by Block Method ˆθ by Runs Method Table 2.15: Extremal index of daily rainfall of station Vauboyen. With the help of these tables, we see that the extremal indexes ˆθ converge to 1 by increasing the threshold, because the higher threshold we choose, the less Peaks Over Threshold remain, and therefore, the effect of dependence is less obvious. However, the extremal indexes estimated on our series are smaller than 1 at the 95% quantile. We observe the fact that for the rainfalls, the extreme values have the trend to appear separately because the extremal indexes are close to 1, however, the extreme values of the flows seem to be more correlated. Indeed, the size of the cluster is between 1 and 3. The correlation is not very strong but it s still remarkable.

31 Chapter 2. Case Study GEV approach As we discussed in Section 2.2.1, the rainfalls and flows present different shapes of the tails. In order to know the exact result, we use Generalized Extreme Value (GEV) distribution to fit the maxima. Because of the size of our data, we will deal with the monthly maxima instead of the annual maxima. We get 108 maxima from 2003 to The whole study is based on the stationary series. First we will deal with the data of average flow, we will give the details of modelling of the data from station Arcades de Buc, and also the estimations of the parameters for the other stations. We apply the Generalized Extreme Value distribution method to fit the monthly maxima of station Arcades de Buc. We get the shape parameter ξ = 0.007, the scale parameter σ = 0.15 and the location parameter µ = We can see here that the shape parameter ξ is negative and very close to zero, it is natural to think that the monthly maxima may be fitted by a Gumbel distribution which is often used in hydrology to analyze such variables. We display in the following figure several plots of the fitted GEV distribution. Figure 2.10: Quality of adjustment using different plots. According to the Figure 2.10, we can see that the pertaining Gumbel distribution fits well the upper part of our data. So we can use this parametric approach to estimate the T-year return level. The 10-year return level is 0.92 m 3 /s and the 100- year return level is 1.26 m 3 /s.

32 Chapter 2. Case Study 26 The same method can be applied on the remaining stations. We present the estimation of the parameters in Table Station shape ξ location µ scale σ Arcades de Buc Cambacérès Monseigneur Vauboyen Table 2.16: Estimation of the parameters in GEV distribution. Now we focus on the daily rainfall measured by 6 stations. The main idea is the same with the one used for average flow. We present the results in Table Station shape ξ location µ scale σ Geneste Loup Pendu Trou Salé Sablons Vilgenis Vauboyen Table 2.17: Estimation of the parameters in GEV distribution. The χ 2 test mentioned in Section 1.6 is applied here to choose the model. The numerical results can be found in Table 2.18 and Table 2.19: Station p-value Final Model Arcades de Buc Gumbel Cambacérès Gumbel Monseigneur Gumbel Vauboyen 0.57 Gumbel Table 2.18: Likelihood Ration Test for Gumbel distribution of average flow. Station p-value Final Model Geneste Fréchet Louope Pendu 0.03 Fréchet Trou Salé 0.23 Gumbel Sablons 0.03 Fréchet Vilgenis Fréchet Vauboyen 0.01 Fréchet Table 2.19: Likelihood Ration Test for Gumbel distribution of rainfall. From Table 2.16 to Table 2.19, we can see that the flows measured at 4 stations can be fitted by the Gumbel distribution. Among them, the station Cambacérès presents more risk than other stations. Conversely, the rainfalls are fitted by Fréchet distribution

33 Chapter 2. Case Study 27 except for the station Trou Salé. The results can confirm the discussion of the shape of the tail based on descriptive statistics. It is normal that most of the flows do not have a heavy tail because they are controlled by the system which can reduce the risk significantly. However, for the rainfall, we can see that most of them present a heavy tail. From this phenomena we can imagine that without the control, the flow may present heavy tail as the rainfall does GP approach This section deals with the exceedances which is also named Peaks-Over-Threshold (POT) values (see Section 1.3.1). We try to fit a Generalized Pareto distributions on our data. We can also use the χ 2 test to choose our model. The results are displayed in the following tables. Average Flow Station shape ξ scale σ AIC p-value Final model Arcades de Buc exponential Cambacérès exponential Monseigneur exponential Vauboyen exponential Table 2.20: Estimation of the parameters in GPD and accepted exponential distribution model.

34 Chapter 2. Case Study 28 Daily rainfall Station shape ξ scale σ AIC p-value Final model Geneste Pareto Loupe Pendu Pareto Trou Salé exponential Sablons Pareto Vilgenis Pareto Vauboyen Pareto Table 2.21: Estimation of the parameters for the GPD distribution. From Table 2.20 and Table 2.21, we find that the flow measured in all of the four stations can be fitted by the exponential models which do not present heavy tail. Conversely, for the rainfalls, only the station Trou Salé should be fitted by an exponential model, other stations are more likely to be heavy-tail distributed Point process approach This section illustrates an alternative approach to include and model the trend and cycle in extreme value data, using the point process approach (see [2]). This approach is applied, as a matter of example, to the daily rainfall measured at station Loupe Pendu. In our case, it is interesting to determine whether there is any evidence of a long term increase, so it is natural to look for any possible evidence of trend. On the other hand, the cycle is also an important factor that we have to take into consideration. We use the model mentioned in Section 1.5. The location parameter µ is described with Equation (1.20), the scale parameter σ and shape parameter ξ are presented in Equation (1.21). In order to make it easier to read, we rewrite the two equations here: µ t = µ 0 + µ 1 t + µ 2 sin(2πt)/t + µ 3 cos(2πt)/t, σ t = σ, ξ t = ξ.

35 Chapter 2. Case Study 29 The threshold used here is 8.6 mm which is the 95% quantile of our data. The 4 nested candidate models used in our case are: (i) µ 1 = µ 2 = µ 3 = 0 the data presents neither the trend nor the cycle (ii) µ 1 0, µ 2 = µ 3 = 0 the data presents only the trend (iii) µ 1 = 0, µ 2 0, µ 3 0 the data presents only the cycle (iv) µ 1 0, µ 2 0, µ 3 0 the data presents both the trend and the cycle The models (i) - (iii) are three sub-models and model (iv) is the full model. The difference between negative log-likelihood (-logl in Table 2.22) would have an approximate χ 2 distribution with n degrees of freedom (n denotes the difference between the number of parameters in the full model and the sub-model) if the sub-model is correct. Model ˆµ 0 ˆµ 1 ˆµ 2 ˆµ 3 ˆσ ˆξ -LogL (i) µ 1 = µ 2 = µ 3 = (ii) µ 1 0, µ 2 = µ 3 = E (iii) µ 1 = 0, µ 2 0, µ (iv) µ 1 0, µ 2 0, µ E Table 2.22: Estimation of the parameters of 4 models. We use the negative log-likelihood to choose the best fitting model. Full model Sub-model Null Hyp p-value Selected model µ 1 0, µ 2 0, µ 3 0 µ 1 0, µ 2 = µ 3 = 0 µ 2 = µ 3 = µ 2 0, µ 3 0 µ 1 0, µ 2 0, µ 3 0 µ 1 = 0, µ 2 0, µ 3 0 µ 1 = µ 1 = 0 µ 1 = 0, µ 2 0, µ 3 0 µ 1 = µ 2 = µ 3 = 0 µ 2 = µ 3 = µ 2 0, µ 3 0 Table 2.23: Model Selection. Under the condition that the threshold is 8.6 mm, the best fitted model is the one which presents only the cycle in the parameter µ (see Equation (1.20)). We can also choose the best fitted model by using AIC (Akaike Information Criterion). Here we present the the results in the following table: Model criterion AIC µ 1 = µ 2 = µ 3 = µ 1 0, µ 2 = µ 3 = µ 1 = 0, µ 2 0, µ µ 1 0, µ 2 0, µ Table 2.24: Akaike Information Criterion of 4 models.

Chapter 2. Case Study 30 The AIC indicates that the model which fits best is the one with cycle in the location parameter, so we will consider model (iii) as the best fitted one.

2 which is close to the result we obtained by the first approach which was based on removing the trend and seasonal components at first (see Table 2.17 and 2.21).

36 Chapter 2. Case Study 30 The AIC indicates that the model which fits best is the one with cycle in the location parameter, so we will consider model (iii) as the best fitted one. With this alternative approach by a point process model, we find that the shape parameter is between 0 and 0.2 which is close to the result we obtained by the first approach which was based on removing the trend and seasonal components at first (see Table 2.17 and 2.21). We display the probability and quantile plots in Figure Figure 2.11: Probability and quantile plot of chosen model. The same method can be applied on all the other stations which offers similar results Generalized linear regression In order to protect the area from the flood, a control system that consists of several water gates was installed in several stations by SIAVB. the system is working in the following way: As a rainfall occurs, its characteristics are measured by several devices in real time, and those measures are correlated to radar maps describing the weather, thus defining the estimated amount of rain to be expected on the ground. A software relates the current rainfall to post ones from a database, yielding a flood gates strategy that is executed in real time on the ground by tele-transmission to the water gates. It is natural to ask that what would happen(e.g. floods) if such a system was not in place. In order to answer this question, we lead to simulate the flow without the water gates, which can be simulated trough letting the water gates totally open in a flow model. Indeed, we need to know the relation between the flow and the opening percentage of the water gates. We assume the flow as a function of the rainfalls and the opening percentage of the water gates.

37 Chapter 2. Case Study 31 The variable that we try to explain is the flow of the station Arcades de Buc for example, and the explanatory variables are the rainfalls and the opening percentage of the water gates nearby. In the simple linear regression, we have 10 original explanatory variables (6 rainfalls and 4 opening percentages), flow 6 rainfall + 4 opening percentage since we would like also to consider a non linear factor, we add the terms of degree 2 of the combined explanatory variables flow 6 rainfall + 4 opening percentage + all the termes of degree 2. To make sure our method is robust, we use the Ridge regression and the PLS regression. About the cross-validation, we separate the data into two parts, one part is the training dataset (75% of the data) and the other part the validation dataset (25% of the data). We repeat the modelling procedure 10 times and calculate the mean squared error of the models for each method, we also calculate the robustness of these models (the relative error between MSE of training data and MSE of validation data, see Table 2.25). Obviously, we need to pay some price to get the robustness, that means the models may lose some fit. So it is important to get a best balance between fit and robustness, and the Ridge coefficient which is our hyper-parameter can be used to do so. The results can be found in the table below: Model MSE of the MSE of the Relative Error training data validation data 1. Regression simple % 2. Regression with 2 nd degree terms % 3. Ridge simple % 4. Ridge with 2 nd degree terms % 5. PLS simple % 6. PLS with 2 nd degree terms % Table 2.25: Mean Squared Error for different models. What can be seen from Table 2.25 is that when we raise the numbers of explanatory variables, the fit of the model is better, for example, the models with 2 nd degree terms have smaller errors (especially model 2 and model 4). However, if we come to the robustness, we find that model 2 is very unstable, which means the regression depends a lot on the separation of the data. So the ideal model here is the model 4, the Ridge regression with 2 nd degree terms.

system is removed. We get the following results: Figure 2.12: Prediction of the flow at station Arcades de Buc without control system. min 3.

38 Chapter 2. Case Study 32 With this final model, we simulate the flow when the water gates are totally open, which makes the influence of the control system is removed. We get the following results: Figure 2.12: Prediction of the flow at station Arcades de Buc without control system. min 3.11 max 4.65 mean 3.82 median 3.82 sd 0.06 Table 2.26: Descriptive statistics for predictive flow at station Arcades de Buc. We also display here the monthly maxima with its histogram and cumulative distribution function: Figure 2.13: Monthly maxima of predictive flow with chosen model.

Chapter 2. Case Study 33 Figure 2.14: Histogram with empirical density and cumulative distribution function of predictive monthly maximum flow at Station Arcades de Buc.

We estimate then the parameters of GEV and GP distributions of the predictive flow of station Arcades de Buc, the estimations of GEV distribution are presented in Table 2.27.

39 Chapter 2. Case Study 33 Figure 2.14: Histogram with empirical density and cumulative distribution function of predictive monthly maximum flow at Station Arcades de Buc. From the descriptive statistics, we can find that there is a great increase of flow without the control system. We can also observe from Figure 2.13 that the distribution seems to present heavy tails. We estimate then the parameters of GEV and GP distributions of the predictive flow of station Arcades de Buc, the estimations of GEV distribution are presented in Table location µ scale σ shape ξ Estimates Standard Errors Table 2.27: Estimations of the parameters in GEV distribution. With the same test procedure, the p-value is 2.574e-08 which means the hypothesis that the shape parameter is zero should be rejected. Since the shape parameter is greater that 0, so we confirm that the distribution is a Fréchet distribution which presents a heavy tail. The estimation of parameters in GP distribution is showed in Table scale σ shape ξ Estimates Standard Error Table 2.28: Estimations of the parameters in GP distribution. This reconfirm the fact that the distribution presents a heavy tail.

40 Chapter 2. Case Study 34 This is an interesting result, in Section 2.2.3, we conclude that the flows can be fitted by Gumbel distributions, i.e. they don t present the heavy tails. However, when we simulate the flow without the control system, its distribution becomes a Fréchet distribution which presents heavy tail. It is a proof of the efficiency of the control system. The system does not only reduce the flow in the common situation, but also reduce the risk in extreme hydrological events. We estimate the return level of the flow under the condition that there is no control system involved and the result will be compared with the return level based on real data (with control system). Return Levels (mm) 10 years 30 years 50 years 80 years 100 years With control system Without control system Relative improvement 31.27% 57.38% 66.13% 72.69% 73.37% Table 2.29: Return Levels for 10, 30, 50, 80, 100 years, at the station Arcades de Buc. From the results above, we can find that the system is globally efficient. It can reduce the level of extreme flow significantly. This is a really satisfactory result for this control system.

Investigation of an Automated Approach to Threshold Selection for Generalized Pareto

Investigation of an Automated Approach to Threshold Selection for Generalized Pareto Kate R. Saunders Supervisors: Peter Taylor & David Karoly University of Melbourne April 8, 2015 Outline 1 Extreme Value