Bayesian training of artificial neural networks used for water resources modeling

Size: px

Start display at page:

Download "Bayesian training of artificial neural networks used for water resources modeling"

Preston Parsons
5 years ago
Views:

1 WATER RESOURCES RESEARCH, VOL. 41,, doi:10.109/005wr00415, 005 Bayesian training of artificial neural networks used for water resources modeling Greer B. Kingston, Martin F. Lambert, and Holger R. Maier Centre for Applied Modeling in Water Engineering, School of Civil and Environmental Engineering, University of Adelaide, Adelaide, South Australia, Australia Received 9 March 005; revised 9 September 005; accepted 13 September 005; published 6 December 005. [1] Artificial neural networks have proven to be superior prediction models in many hydrology-related areas; however, failure of ANN practitioners to account for uncertainty in the predictions has limited the wider use of ANNs as forecasting models. Conventional methods for quantifying parameter uncertainty are difficult to apply to ANN weights because of the complexity of these models, and complicated methods developed for this purpose have been not been adopted by water resources practitioners because of the difficulty in implementing them. This paper presents a relatively straightforward Bayesian training method that enables weight uncertainty to be accounted for in ANN predictions. The method is applied to a salinity forecasting case study, and the resulting ANN is shown to significantly outperform an ANN developed using standard approaches in a real-time forecasting scenario. Moreover, the Bayesian approach produces prediction limits that indicate the level of uncertainty in the predictions, which is extremely important if forecasts are to be used with confidence in water resources applications. Citation: Kingston, G. B., M. F. Lambert, and H. R. Maier (005), Bayesian training of artificial neural networks used for water resources modeling, Water Resour. Res., 41,, doi:10.109/005wr Copyright 005 by the American Geophysical Union /05/005WR Introduction [] Over the past 15 years, artificial neural networks (ANNs) have proven to be extremely beneficial tools for simulating, predicting and forecasting water resources variables. The predictive capability of ANNs in this field has been demonstrated in numerous studies, leading to the publication of several comprehensive reviews on the application of ANNs in hydrology, such as rainfall-runoff modeling, water quality forecasting and streamflow prediction [see ASCE Task Committee on Application of Artificial Neural Networks in Hydrology, 000a, 000b; Maier and Dandy, 000; Dawson and Wilby, 001]. ANNs are able to capture complex, nonlinear functional relationships within data without requiring an in-depth understanding of the underlying physical process, or the need to prespecify a functional form of the model, thus giving them an advantage over many of the models traditionally used for modeling water resources variables. However, as noted by Maier and Dandy [000], a major limitation of ANNs in this field is that the uncertainty in the predictions generated is seldom quantified. Failure to account for such uncertainty makes it impossible to assess the quality of ANN predictions, which severely limits their usability in real-world water resources management and design applications. [3] A significant component of prediction uncertainty can be attributed to the uncertainty in the parameters that govern the modeled function. In an ANN these parameters are the connection and bias weights in the network, as shown in Figure 1, which displays a multilayer ANN structure typically used for hydrological prediction. These weights have no direct physical interpretation, and therefore calibration, or training, is required to obtain estimates of their values. Traditionally, this involves iteratively adjusting the network weights to find a single optimal set of weight values that provides the best fit between the model outputs and a set of observed calibration (training) data. However, for any hydrological model, it is inappropriate to assume that point parameter estimates obtained by calibration can adequately describe the underlying hydrological relationship, and this is particularly the case for ANNs for reasons discussed below. [4] First, as an ANN does not incorporate any knowledge of the physical system, the resulting model is heavily dependent on information inferred from the finite data set used for training. Because of the stochastic nature of hydrological systems each different set of training data would most likely yield a different set of weights. Thus finding the single weight vector that provides the best fit to the training data does not necessarily result in a correct model of the system. This problem becomes less significant as more data become available and the training data set becomes more representative of the population; however, the data available to describe many hydrological phenomena are often limited. Therefore, while an ANN based on a single weight vector may perform well in an interpolative context (on data similar to those contained in the training data set), it cannot be expected to extrapolate well in situations dissimilar to those previously presented to the model. [5] Second, ANN training is a multidimensional nonlinear optimization problem. This is not a trivial task, as the nonlinearity of the problem can lead to the existence of multiple optima on the solution surface. Training algorithms 1of11

2 KINGSTON ET AL.: BAYESIAN TRAINING OF ANNS IN WATER RESOURCES MODELING available complicated techniques, Bayesian ANN training has not been adopted by water resources practitioners. [8] In this paper, a relatively simple and very accessible Bayesian training technique is presented. So as to not detract from the original attraction of ANNs, the aim of the procedure is not statistical optimality, nor optimum efficiency, but rather good results and ease of programming and application. The Bayesian training framework is applied to a real-world water resources case study in order to assess the uncertainty associated with the predictions and investigate the relative advantages of the Bayesian approach in comparison with standard deterministic training techniques. Figure 1. Example of a typical ANN structure used for hydrological prediction. may become trapped in local minima rather than converging on the global solution and, although sophisticated global optimization algorithms have been developed (e.g., the shuffled complex evolution (SCE-UA) algorithm developed by Duan et al. [199]), there is still no algorithm that can guarantee global convergence. Furthermore, ANNs have the potential of becoming overtrained. This means that the network fits to noise in the training data rather than inferring the general underlying rule. Therefore, even if the training data sufficiently represent the data population, poor weight estimates may still result from difficulties during training. [6] Finally, complex hydrological relationships may require complex model structures [Omlin and Reichert, 1999]. If the data are sparse in relation to the number of weights in the ANN, it becomes more difficult to properly identify a value for each of the weights. As a consequence, many combinations of weights may result in similar network performance and there is no way to distinguish which of these best approximates the underlying relationship. [7] In recent years, there has been an upsurgeance in the use of Bayesian methodology for quantifying parameter uncertainty in various scientific fields [Malakoff, 1999], including hydrological and water resources modeling [Marshall et al., 004]. Under this paradigm, uncertainty in the model parameters is handled explicitly, as parameter distributions are estimated rather than point values. However, applications of Bayesian methods in hydrology related areas have been limited to simple conceptual [Kuczera and Parent, 1998; Kuczera and Mroczkowski, 1998; Bates and Campbell, 001; Thiemann et al., 001; Vrugt et al., 003; Marshall et al., 004] and traditional statistical [Kuczera, 1983; Thyer et al., 00] models, and have not been extended to ANNs. While Bayesian techniques are not inapplicable to ANNs and have, in fact, been used for ANN training (although rarely) since the early 1990s [Buntine and Weigend, 1991; MacKay, 199; Neal, 199], the complexity of ANNs makes it difficult to apply standard Bayesian methods to estimate the weight distributions. Consequently, the majority of Bayesian techniques applied to ANNs in the past have employed complex statistics in order to overcome any complications. Most of the available ANN software does not allow for Bayesian training, and due to the difficulty associated with programming the. Bayesian Training.1. Background [9] ANNs, like all mathematical models, work on the assumption that there is a real function underlying a system that relates a set of independent predictor variables to one or more dependent variables of interest. The aim of ANN training is to infer an acceptable approximation of this relationship from a set of training data, so that the model can be used to produce accurate predictions when presented with new data. If y is the target variable and x is a vector of input data, it is assumed that y ¼ gðxjwþþ ð1þ where g() is the function described by the ANN, w is a vector of connection and bias weights that characterize the data generating relationship and e is a random noise term with zero mean and constant variance, i.e., white noise. [10] Using standard (deterministic) training approaches, a single optimal weight vector bw is sought that is most likely to reproduce the set of observed target data y =(y 1, y,..., y N ), given the inputs X =(x 1, x,..., x N ). The aim of Bayesian training, on the other hand, is to infer the posterior probability distribution of the weights given the observed data P(wjy, X). This is done by updating any knowledge of the weight values prior to obtaining the data, with information contained in the data, using Bayes theorem: Pðwjy; X Þ ¼ Pðyjw; XÞPðwÞ PðyjXÞ ¼ R Pðyjw; XÞPðwÞdw where P(w) is the prior weight distribution and P(yjw, X) is the likelihood function, which describes any information about w contained in the data. The likelihood function is often expressed as L(w)... Marginalization [11] Under the Bayesian paradigm, the predictive distribution of a new datum y N+1 is determined by integrating the predictions made by all of the weight vectors over the posterior distribution of the weights as follows: Z Py ð Nþ1 jx Nþ1 ; y; XÞ ¼ ðþ Py ð Nþ1 jx Nþ1 ; wþpðwjy; XÞdw ð3þ This process is known as marginalization. For complex problems, the high dimensionality of this integral makes its of11

3 KINGSTON ET AL.: BAYESIAN TRAINING OF ANNS IN WATER RESOURCES MODELING Figure. The Metropolis method in two dimensions, displaying two proposal distributions, Q 1 and Q, that will achieve different rates of convergence to the posterior distribution. evaluation with conventional analytical or numerical integration techniques virtually impossible. In order to overcome this problem, two main approaches to marginalization have generally been followed: Gaussian approximation of the posterior weight distribution to enable analytical integration as introduced by MacKay [199] and numerical integration using Markov chain Monte Carlo methods as introduced by Neal [199]. These approaches have since been reviewed by MacKay [1995], Bishop [1995], Neal [1996a], Lampinen and Vehtari [001], and Titterington [004]..3. Markov Chain Monte Carlo Integration [1] For multilayered ANNs, the posterior weight distribution is typically very complex and multimodal and thus the assumption of a Gaussian weight distribution is generally not a good one [Neal, 1996a]. It may be reasonable to assume that the distribution is locally Gaussian around each mode; however, this raises the question of how to properly handle the multiple modes when making predictions. Furthermore, the assumption of even a locally Gaussian distribution in the vicinity of the modes is sometimes questionable, particularly when the model is complex in comparison with the data available for training [Rasmussen, 1996]. To avoid the need to make such an approximation, Neal [199] introduced a Markov chain Monte Carlo (MCMC) implementation to sample from the posterior weight distribution. [13] The objective of MCMC methods is to generate samples from a continuous target density, which in this case is the posterior weight distribution. The Metropolis algorithm is a commonly used MCMC approach, which has had success in a number of applications [Bishop, 1995; Kuczera and Parent, 1998; Bates and Campbell, 001; Marshall et al., 004]. As it is difficult to sample from the complex target (posterior) distribution directly, this method makes use of a simpler, symmetrical distribution, Q(), known as the proposal or jumping distribution, to generate candidate weight vectors. In its simplest form, the proposal distribution is only dependent on the previous weight state and therefore a random walk Markov chain is generated within the weight space. An adaptive acceptancerejection criterion is employed such that the random walk sequence continually adapts to the posterior distribution of the weights. The algorithm proceeds as follows (as adapted from Gelman et al. [1995]). 1. Initialize the algorithm with an arbitrary set of weights w 0 for which P(w 0 jy) >0.. For t =1,, generate a candidate weight vector w* from Q(w*jw t 1 ).. evaluate the ratio of P(w*jy, X) compared to P(w t 1 jy, X).3. accept w* with probability PðyjX; w* ÞPðw* Þ aðw*jw t 1 Þ ¼ min PðyjX; w t 1 ÞPðw t 1 Þ ; 1 and if w* is accepted, set w t = w*, otherwise w t = w t 1. [14] Given sufficient iterations, the Markov chain produced by the Metropolis algorithm should converge to a stationary distribution. From this point onward it can be considered that the sampled weight vectors are generated from the posterior distribution. However, selection of an appropriate proposal distribution is a difficult task and one that has important implications on the convergence properties and efficiency of the algorithm, particularly in the case of complex models with correlated parameters. Shown in Figure is an example of a bivariate posterior distribution together with two example proposal distributions, Q 1 and Q. Using the larger proposal distribution Q 1, denoted by the dashed line, it can be seen that a jump made from the current weight state w t in almost any direction will result in a decrease in the posterior probability, and as such, a large proportion of jumps will be rejected and therefore convergence will be slow. On the other hand, if the smaller proposal distribution Q is used the acceptance rate will increase; however, the algorithm will take longer to sample from the entire region of the posterior, and therefore the distribution may not be adequately represented by the samples generated within a specified number of iterations [Thyer et al., 00]. This problem is not unique to ANNs; however, it is amplified due to the typically large number of parameters and the high correlations between them, which result from the interconnectivity of the nodes and a generally poor understanding of the optimum model complexity. This means that the time taken to obtain an adequate representation of the posterior distribution using the standard Metropolis algorithm can be prohibitive. [15] In order to suppress the random walk behavior of the Metropolis algorithm and speed up convergence, Neal [199, 1996a] used the hybrid Monte Carlo (HMC) algorithm of Duane et al. [1987] to sample from the posterior weight distribution. The HMC algorithm is a particularly elaborate version of the Metropolis algorithm that makes use of gradient information to direct the sampler into regions of high density, which ensures higher acceptance probabilities. However, while this complex MCMC implementation may improve convergence to the posterior, it may be difficult to program and to verify its correctness [Neal, 1996a; Lee, 003]. The time and effort required to ð4þ 3of11

4 KINGSTON ET AL.: BAYESIAN TRAINING OF ANNS IN WATER RESOURCES MODELING code this algorithm may indeed be the reason why it has not been more widely adopted by practitioners. Furthermore, while the complex algorithm may be quicker to converge to the target distribution, a greater number of runs using a simpler algorithm may in fact have a shorter total run time..4. Adaptive Metropolis Sampling of Weights [16] The adaptive Metropolis (AM) algorithm, developed by Haario et al. [001], was used in this study to sample from the posterior weight distribution, as it has been found to have a number of advantages over other variants of the Metropolis algorithm in terms of efficiency and ease of use [Marshall et al., 004]. This algorithm was developed to overcome the problems associated with selecting an appropriate covariance for the proposal distribution by setting it equal to the estimated posterior covariance of the weights, which is updated at each iteration based on all of the previously sampled weight vectors. This adaptation strategy ensures that information gained about the posterior distribution throughout the simulation is used to increase the efficiency of the algorithm and improve the convergence rate. [17] To initialize the algorithm, an arbitrary, positive definite covariance matrix, S 0, is selected. For an initial period t 0 > 0, the covariance of the proposal is fixed at the initial covariance, after which time the adaptation strategy begins, as follows: S t ¼ S 0 t t 0 c covðw 0 ; w 1 ;...; w t 1 Þþc ei d t > t 0 where c is an adaptive scaling parameter used to maintain an appropriate acceptance rate, e is a small constant used to ensure that S t will not become singular, and I d is the d- dimensional identity matrix with d being the dimension of the weight vector. For t > t 0, calculation of S t satisfies the following recursion formula: S tþ1 ¼ t 1 S t t þ c t w t 1 w T t 1 t ðt þ 1Þw tw T t þ w t w T t þ ei d where w t =1/(t +1) P t i¼0 w i. Therefore the covariance may be updated at each iteration with little additional computational cost. [18] The choice of the initial fixed period t 0 should reflect the confidence in the initial covariance S 0. The longer this period, the more slowly the adaptation is felt and the greater the effect of the initial covariance on the simulated draws. Therefore, if the initial fixed period is short, even a poor initial choice of S 0 should only have a minor impact on the overall convergence of the algorithm. However, it is necessary to select S 0 such that the algorithm moves at least a little during the initial stage. To avoid the algorithm starting slowly, Haario et al. [001] suggest using a priori knowledge, such as the most likely weight vector bw or the covariance of the weights at this mode, to assist in the choice of the initial weight state or the initial covariance. ð5þ ð6þ.5. Gibbs Sampling of Residual Variance [19] In this study, a Gaussian likelihood function was used, as given by LðwÞ ¼ P yjx; w; s! ¼ YN 1 pffiffiffiffiffiffiffiffiffiffiffi exp 1 ½y i gðx i ; wþš s s ð7þ i¼1 This distribution makes the assumption that the model residuals (i.e., y i g(x i jw)) are normally and independently distributed with zero mean and constant variance s. The parameter s is sometimes referred to as a hyperparameter as it plays an important role in estimating the values of the network weights, but ultimately plays no part in the developed model. In a full Bayesian training approach, no fixed values are used for any parameters or hyperparameters [Lampinen and Vehtari, 001]; thus s is also estimated from the training data. Following the approach used by Neal [1996a], in the proposed Bayesian training framework the posterior distribution of s is estimated using the Gibbs sampler, which is the simplest MCMC algorithm. This involves sampling from the full conditional distribution of s, given the data and the values of the network weights sampled using the AM algorithm, given by: P s jy; X; w / P yjx; w; s P s [0] Generating samples from P(s jy, X, w) is relatively easy if the prior distribution is chosen to be a conjugate distribution to the likelihood. This means that the posterior distribution will have the same parametric form as the prior. A natural conjugate prior for the Gaussian variance is the scaled inverse chi-square distribution c (n 0, s 0 ) where n 0 and s 0 are degrees of freedom and scale parameters, respectively, chosen to express the level of prior knowledge [Lee, 1989]. The posterior distribution is then the scaled inverse chi-square distribution: c n * ¼ n 0 þ N; s * ¼ s 0 n 0 þ s N n 0 þ N where s is equal to P N i¼1 (y i g(x i, w)) /N and N is the number of training samples. Draws from the distribution given by (9) may be obtained by sampling X from the c n* distribution and letting s = n * s * /X..6. Proposed Bayesian Training Approach [1] The MCMC Bayesian training approach presented in this paper follows a two-step iterative procedure. In the first step, s is held constant while w is sampled from P(w t jy, X,s t 1 ) using the AM algorithm. In the second step, w is held constant while s is sampled from P(s t jy, X, w t ) using the Gibbs sampler. This is consistent with the MCMC approach used by Neal [1996a], except that the complicated HMC algorithm has been substituted with the much simpler AM algorithm. While implementation of this procedure is relatively straightforward, there are a number of factors that need consideration to ensure successful performance of the algorithm and appropriate convergence to the posterior distribution. These factors were determined to achieve ð8þ ð9þ 4of11

5 KINGSTON ET AL.: BAYESIAN TRAINING OF ANNS IN WATER RESOURCES MODELING optimal efficiency while still retaining the simplicity of the approach, as discussed below. [] Before running the MCMC sampling algorithm, suitable prior distributions for w and s need to be chosen. As ANN weights have no physical interpretation, little can be known about the values of these parameters before observing the data. Therefore, to represent vague prior knowledge, it is convenient to assume wide, noninformative prior distributions, which allow the posterior distributions of w and s to be determined by the data without being restricted or affected by the prior distribution [Neal, 1996a; Lampinen and Vehtari, 001]. In this study, a wide uniform prior, symmetric around zero, was assumed for each weight in order to specify an equal likelihood of positive and negative values, but an otherwise lack of prior knowledge about the values of the weights. A noninformative prior was also assumed for s by setting appropriate values for n 0 and s 0. In general, the smaller n 0 is relative to the number of training samples N, the less informative is the scaled inverse chi-square prior distribution [Gelman et al., 1995]. The assumption of uniform priors for the network weights has the additional advantage of simplifying (4), as the prior probabilities cancel out, leaving only the likelihood ratio to be evaluated. To evaluate this ratio, it is more computationally efficient to take the (natural) logarithm of the likelihood so that the function given by (6) becomes additive rather than multiplicative. [3] To initialize the training procedure, arbitrary values of w 0, s 0 and S 0 are required. Gelman et al. [1995] and Haario et al. [001] recommend using point estimates of w to obtain a rough estimate of the location of the posterior distribution. Not only does this increase convergence speed, but these values are easily estimated using available software and provide a useful check of the accuracy of the Bayesian training algorithm. As the posterior distribution of ANN weights is often multimodal, it is acknowledged that this initialization may bias the resulting posterior distribution if the algorithm becomes trapped in the vicinity of the local mode. Müller and Rios Insua [1998] and Neal [1996b] discuss the multimodality of ANN posterior distributions at length, as well as methods for efficiently sampling from such distributions. However, because the emphasis of the proposed Bayesian training framework is on simplicity and ease of use, instead of incorporating complicated mode jumping schemes, it is recommended that extra care be taken in finding appropriate weights bw to initialize the algorithm. Thus, if the algorithm does become stuck around the local mode, there will at least be some confidence that it is a good mode (i.e., the best estimate of the maximum likelihood value given by a rigorous search algorithm that tries to thoroughly search the space). Nevertheless, to lessen the bias that may be caused by this initialization, it is recommended that, for an initial period t s0, the variance hyperparameter should be fixed at some value s 0 6¼ bs, where bs is estimated using the weight vector bw.as{bw; bs } results in the (locally) maximum likelihood value, setting s 0 = bs would cause the acceptance rate of candidate weight vectors to be low, making it difficult to move away from the initial location. However, by selecting s 0 such that the magnitude of the initial likelihood is somewhat reduced, the acceptance rate is increased, allowing the simulated chain to move more freely about the weight space during this period. In this study, s 0 was set equal to the scale parameter s 0 used to define the prior distribution of s,as this was found to provide good results in preliminary tests. [4] The proposal distribution used in this study to generate candidate weight states was a multinormal distribution centered on the current weight state with covariance estimated according to (5) (i.e., Q(w*jw t 1 )=N(w t 1, S t )). Gelman et al. [1995] give a number of recommendations for achieving optimal efficiency with this form of proposal distribution. One of these recommendations is that the initial covariance of the proposal distribution S 0 be estimated based on the derivatives of the error function at the posterior mode. However, for a feed forward ANN, these derivatives are often ill conditioned, resulting in a (nearly) singular covariance matrix. This, in turn, can cause instability of the AM algorithm, as use of the recursion formula (5) means that the covariance is updated based on an initial illconditioned matrix. For the AM algorithm, the only requirements for the choice of S 0 are that it is positive definite and allows the algorithm to move at least a little in the initial fixed covariance stage. Therefore, in the proposed implementation, S 0 is defined by: s 3 w s S 0 ¼ c w s w d ð10þ The parameters s w1,..., s wd are the standard deviations of the weights, which are initially set equal to some arbitrary positive values, chosen to ensure that sufficient states are accepted when S 0 is fixed. To achieve an optimal acceptance rate of approximately 3% (for dimension pffiffiffid > 5), the scaling parameter c is initially set equal to.4/ d and tuned up or down at the beginning of the simulation if the acceptance rate is too high or low, respectively, as recommended by Gelman et al. [1995]. [5] The algorithm is run for sufficient iterations, K, firstly, to achieve convergence to the stationary distribution and secondly, to sample enough weight states following convergence to provide an adequate representation of the posterior distribution. However, weight states simulated prior to when convergence was apparently reached at t = k K will still be influenced by the initial distribution rather than the posterior distribution. Therefore these simulations are discarded and the remainder (or a smaller representative subset) are used as the basis for making Monte Carlo estimates from the predictive distribution of a test data set. In this implementation, the predictive distribution is then summarized by 95% prediction limits, which are useful for the visualization of prediction uncertainty, and the mean predictions, which enable a direct comparison of predictive performance with that of a deterministic ANN (i.e., an ANN based on single valued weight estimates). [6] Given the above considerations, the full Bayesian training procedure used in this study was carried out as follows. 1. Set w 0 = bw; s 0 = s 0 and evaluate log L(w 0 ). Initialize S 0 according to (10) 5of11

6 KINGSTON ET AL.: BAYESIAN TRAINING OF ANNS IN WATER RESOURCES MODELING. For t =1,,..., K.1. if t t 0,letS t = S 0.. generate a candidate weight vector w* from Q(w*jw t 1 )=N(w t 1, S t ).3. evaluate log L(w*) and compare it to log L(w t 1 ).4. accept w* with probability aðw*jw t 1 Þ ¼ min½expðlog Lðw* Þ log Lðw t 1 ÞÞ; 1Š and if w* is accepted, set w t = w*, otherwise w t = w t 1.5. if t > t s0, generate s t from P s t jy; X; w t ¼ c n * ; s *.6. if t = t 0, calculate S tþ1 ¼ c covðw 0 ; w 1 ;...; w t0 Þþc ei d ; where covðw 0 ; w 1 ;...; w t0 Þ ¼ 1 X t0 t w i¼0 iw T i 0 ðt 0 þ 1Þw t0 w T t 0 ; else if t > t 0, calculate S t+1 according to (5) 3. Discard initial samples (w 0, s 0 ),...,(w k, s k ) to diminish the effects of the initial distribution. Use samples (w k+1, s k+1 ),...,(w K, s K ) for analysis. 4. For i =1,,..., N testset 4.1. calculate the network predictions y i,k+1,..., y i,k based on w k+1,..., w K and input vector x i. 4.. Rank predictions y i,k+1,..., y i,k in ascending order and determine 95% simulation limits Calculate mean prediction y i ¼ 1= ðk kþ X K y t¼kþ1 t: [7] The most critical issue associated with MCMC simulations is determining when convergence to the posterior distribution has been achieved. Because of the multimodal nature of ANN weights and the correlations between them, convergence to the posterior weight distribution is usually relatively slow, as the covariance of the proposal distribution needs to be small in order to maintain a high enough acceptance rate and because chains can become trapped for a long time around local modes. Multiple chains can be used to more widely explore the weight space and help to speed convergence to the posterior distribution; therefore it is recommended that a number of parallel chains be simulated simultaneously. In this study, the multiple chains were each initialized at bw to give them equal chance at finding a good mode. However, during the initial period when s 0 is fixed, the chains should diverge from their initial location in different directions. The use of multiple chains can also help to detect whether the algorithm has converged. The most commonly used diagnostics of convergence are trace plots of sample MCMC values versus iteration [Kass et al., 1998]. A trace of the log posterior density, calculated by taking the mean of the multiple chains, was used in this study to diagnose convergence. As stated by Kass et al. [1998], if the log posterior density is increasing, the main mode has yet to be reached, whereas if it is decreasing, the algorithm was initialized near a tall, narrow mode and is moving toward a more representative part of the distribution. Therefore it can be assumed that convergence has been reached when this plot flattens out. However, a limitation of MCMC posterior simulation, particularly when applied to ANNs, is that it can never be guaranteed that convergence to the true posterior has been obtained, as there may still be modes that have been undiscovered. 3. Case Study [8] The case study used to demonstrate the advantages of the proposed Bayesian training approach is that of forecasting salinity in the River Murray at Murray Bridge, South Australia, 14 days in advance. The River Murray is Australia s largest river and is essential for irrigation and water supply purposes in South Australia, providing, on average, 45% of the total water requirements for the capital city of Adelaide [River Murray Catchment Water Management Board, 003]. [9] It is predicted that, without intervention, the average salinity of the river at Murray Bridge will increase to 870 EC by 050 [Murray Darling Basin Commission, 1999], which is above the 800 EC threshold for desirable drinking water. Apart from having an objectionable taste, the high salinity of this water can cause reductions in crop yields, corrosion of pipes and infrastructure, and increased consumption of soap and detergents due to the hardness of water. Murray Bridge is located within one of South Australia s major irrigation regions and is also the site of one of two offtakes that divert river water to Adelaide in order to meet the city s water demands. It is therefore important to be able to forecast salinity levels in the river at Murray Bridge several weeks in advance so that operational changes can be put into place that minimize the negative effects of high salinity. [30] This case study was also used by Bowden et al. [00, 005], where an ANN was developed to produce 14-day salinity forecasts and a real-time simulation was subsequently performed using the developed model. Bowden et al. [00] clustered the available data into groups of similar input/output patterns using a self-organizing map (SOM). It was then identified that the data set reserved to perform the real-time simulation contained two regions of data that were dissimilar to the data used to develop the model. Consequently, the ANN was required to extrapolate and the predictive performance in these uncharacteristic regions was poor. This case study was considered to be ideal for assessing the uncertainty associated with ANN predictions in a real-time simulation situation, and investigating the advantages of the proposed Bayesian training technique in comparison with deterministic training methods, particularly when required to extrapolate. 4. Model Development 4.1. Data and Model Inputs [31] Daily salinity, flow and river level data were available at various locations in the lower River Murray for the period 1 December 1986 to 1 April Bowden et al. [00, 005] used data from 1 December 1986 to 30 June 199 to develop an ANN, while data from 1 July 199 to 1 April 1998 were reserved to perform the real-time forecasting simulation. The same data split was also used in this study. The ANN inputs used in this study were the 6of11

7 KINGSTON ET AL.: BAYESIAN TRAINING OF ANNS IN WATER RESOURCES MODELING Table 1. Inputs and Outputs Used in the Salinity Model Location Data Type Lag, days Input/Output Number Mannum salinity 1 I 1 Morgan salinity 60 I Waikerie salinity 1 I 3 Waikerie salinity 43 I 4 Loxton salinity 5 I 5 Lock 7 flow 1 I 6 Murray Bridge level 1 I 7 Murray Bridge level 11 I 8 Murray Bridge level 1 I 9 Murray Bridge level 34 I 10 Murray Bridge level 57 I 11 Mannum level 57 I 1 Lock 1 Upper level 1 I 13 Murray Bridge salinity 13 O 1 same as those used by Bowden et al. [005] and are given in Table 1. These inputs were selected from a total of 960 potentially suitable inputs (including lagged values of the available data from 1 through to 60 days) for predicting salinity levels in the River Murray at Murray Bridge 14 days in advance using the partial mutual information (PMI) input determination method (see Bowden et al. [005] for details). Similar to the method of Bowden et al. [005], the input data were linearly scaled between 1 and 1, while the output data were linearly scaled between 0.8 and 0.8. [3] A plot of the available salinity data at Murray Bridge is given in Figure 3, showing the data used for model development and the data reserved to perform a real time simulation. In Figure 3 the two regions of uncharacteristic data identified by Bowden et al. [00] are indicated by regions 1 and. 4.. Deterministic ANN [33] Standard ANN development practices were initially employed to develop a deterministic ANN model (based on bw) to provide a basis for comparison of the ANN trained using the proposed Bayesian training approach. To do this, the model development data (i.e., over the period 1 December 1986 to 1 April 1998) were further divided into training, testing (for cross validation) and validation subsets using the SOM data division method used by Bowden et al. [00]. In order to compare the results of this study to those obtained by Bowden et al. [005], the proportions of data allocated to each of these subsets were the same as those used by Bowden et al. [005]. After accounting for the appropriate lags of the input and output variables, there were 1964 data samples available for model development. Of these, 157 samples were allocated to the training data set (64%), another 314 (16%) samples were allocated to the testing data set and the final 393 samples were allocated to the validation data set (0%). [34] It has been shown that a one hidden layer multilayer perceptron (MLP) with the hyperbolic tangent (tanh) activation function on the hidden layer nodes and a linear activation function on the output layer nodes is able to approximate any continuous function arbitrarily well, given that there are a sufficient number of hidden nodes [Bishop, 1995]. Therefore an ANN with this configuration was used in this study to forecast the salinity levels at Murray Bridge. Bowden et al. [005] used an ANN with 3 hidden nodes and 481 weights to produce the 14-day salinity forecasts, 7of11 given the 13 inputs in Table 1. In order to obtain a more parsimonious model, the optimal ANN geometry was reinvestigated in this study. A trial-and-error approach was used to do this, where the number of hidden nodes was successively increased from 1, in increments of 1, until the addition of further hidden nodes did not result in significant improvement of the test set error. This resulted in a network with 4 hidden nodes and 61 weights. [35] In order to decrease the potential of becoming stuck in a local minimum, a genetic algorithm was used to train the deterministic ANN. Furthermore, the algorithm was initialized three times with different sets of random weights to increase the likelihood of obtaining a globally optimal solution. Cross validation using the testing data subset was also employed to ensure that the model did not overfit the training data. The generalization ability of the final model was validated against the validation subset before applying it to the real-time simulation data Bayesian ANN [36] Using the proposed Bayesian training approach, the posterior weight distribution P(wjy, X) for the 13-input, 4-hidden node ANN was estimated with the same training data subset used to find bw for the deterministic ANN. Cross validation with a test set is incompatible with Bayesian weight estimation; therefore it was important to check that the model had not overfit the training data by evaluating the predictive distribution of the validation data and assessing out-of-sample model performance. The training and testing data subsets could have been combined to form a larger calibration data set containing a greater amount of information. However, in order to perform a fair comparison with the deterministic forecasts, this was not done in this study. Rather, the predictive distribution of the testing data was evaluated and the model performance on the testing data set was compared to that of the deterministic ANN. [37] Prior distributions were selected to include all possible values of w and s. It was assumed that all of the weights would lie within the range [ 100,100]; therefore this was the range set for the uniform prior distributions of all of the network weights. To define a vague prior for s, n 0 was set equal to 0.1, while s 0 was set equal to 0.01, similar to the values used by Neal [1996a] to produce a noninformative prior distribution for this parameter. Figure 3. Time series of salinity in the River Murray at Murray Bridge displaying data periods used for model development and real-time simulation. Regions 1 and highlight uncharacteristic regions in the real-time simulation data.

8 KINGSTON ET AL.: BAYESIAN TRAINING OF ANNS IN WATER RESOURCES MODELING Table. Model Performance Results in Comparison to Those Obtained by Bowden et al. [005] Model Performance Measure Model Development Training Testing Validation Real-Time Simulation Bowden et al. [005] a RMSE AIC Deterministic ANN b RMSE AIC Bayesian ANN b RMSE AIC a A 3-hidden node ANN. b A 4-hidden node ANN. [38] To initialize S 0 according to (10), the standard deviations of the weights s w1,...,s wd were each set to 0.01 as this resulted in a reasonable acceptance rate in the initial fixed covariance stage (t t 0 ) of the AM algorithm. As S 0 was not estimated to give an accurate representation of the distribution at the mode, a short (relative to the total number of iterations) initial fixed period of t 0 = 1,000 was chosen in order to minimize the effect of S 0 on the simulated weight states. In this study, s 0 was held constant at 0.01 (s 0 ) for the same period of time (i.e., t s 0 = t 0 = 1,000). To achieve an appropriate acceptance rate, the scaling parameter c was tuned every 00 iterations for the first 3,000 iterations, from which point on it remained fixed. [39] Five parallel chains were simulated for a total of 500,000 iterations (K = 500,000). After inspecting the trace of the mean log posterior density, it was considered that convergence to the posterior was reached after the first 100,000 iterations. Therefore the first 100,000 simulated draws were discarded to reduce the effects of the initial conditions (k = 100,000), and the final 400,000 iterations were used to make up the posterior weight distribution. From these, 10,000 weight vectors were randomly selected and used to evaluate the predictive distribution of each data sample in the real-time simulation data set. 5. Results and Discussion [40] The root mean squared error (RMSE) and Akaike s information criterion (AIC) were used to assess the performance of the deterministic and Bayesian ANNs (the performance of the Bayesian ANN was assessed based on the mean predictions) developed in this study. The RMSE measure was used to assess model performance of Bowden et al. [005]; thus its use in this study enabled a direct comparison of the results obtained with those given by Bowden et al. [005]. The AIC measure, calculated by AIC = N log(rmse) + d, was also used to compare the performance of the ANNs developed in this study (4 hidden nodes) with that used by Bowden et al. [005] (3 hidden nodes), as this measure takes into account the parsimony of the model. While complex models can often fit the data better than models with fewer free parameters, the increase in model performance may not be justifiable given the additional effort required to train the model. [41] Table presents the results obtained using the deterministic and Bayesian ANNs in comparison to the results obtained by Bowden et al. [005]. These results show that, although the performance of all models is similar on the training, testing and validation data subsets (i.e., interpolation), the performance of the Bayesian model is significantly better in the real-time forecasting scenario. This highlights the importance of accounting for the entire range of plausible weight vectors when making predictions, rather than relying on the single weight vector that provides the best fit to the training data. By estimating the posterior weight distribution, the Bayesian ANN has achieved a more generalized mapping of the underlying relationship, which is influenced less by the minimum error of the training data and influenced more broadly by the overall information contained in the data. As seen in the results, this enables better extrapolation ability, as hypothesized in the introduction of this paper. It is not surprising that the results on the model development data (training, testing and validation subsets) were similar, as it is well known that deterministic ANNs generally perform well at interpolation. [4] The fact that the Bayesian ANN takes into account a range of weight vectors can be seen in Figure 4. Figures 4a 4d display, for illustration purposes, the marginal distributions of the weights between hidden layer nodes 1 4 and the output, respectively, while Figure 4e displays a scatterplot of w I3,H 4 versus w H4,O 1 (see Figure 1). It can be seen that the level of uncertainty associated with the weights, as indicated by the spread of the distributions, is quite varied, with some weights being very poorly identified (e.g., the range of w I3,H 4 in Figure 4e is approximately [0, 50]) and some being reasonably well determined (e.g., w H4,O 1 has a narrow range of approximately [0, 0.5] as seen in Figures 4d and 4e) by the data. These plots also demonstrate the non-gaussian, multimodal, correlated and ill-conditioned nature typical of ANN weights and reinforce the theory that Gaussian approximation of the posterior weight distribution may be inappropriate for ANNs. [43] In addition to producing significantly better average forecasts for the real-time forecasting scenario, the Bayesian ANN produces 95% prediction limits that indicate the level of uncertainty in the forecasts, as shown in Figure 5, which displays the 95% prediction limits and deterministic ANN outputs for the model development data (i.e., combined training, testing and validation data sets, Figure 5a) and the real-time simulation data (Figure 5b). It can be seen that the 95% prediction limits are quite narrow for most of the forecasting period, which may seem somewhat surprising given the significant uncertainty in the some of the ANN weights (e.g., w I3,H 4 ). However, this can be explained by considering the highly correlated nature of the weights, an example of which is shown in Figure 4e, and the possible redundancy of some of the input-hidden node connections. The prediction limits for the forecasts during the real-time forecasting period are much wider than those for the model development period (interpolation) and this is particularly noticeable for the two periods of uncharacteristic data, as identified by Bowden et al. [00] (i.e., regions 1 and ). During these periods the ANN has to extrapolate beyond the range of the training data and, similar to [Bowden et al., 8of11

9 KINGSTON ET AL.: BAYESIAN TRAINING OF ANNS IN WATER RESOURCES MODELING Figure 4. (a d) Marginal posterior distributions of weights between hidden layer nodes 1 4 and the output node, respectively. (e) Scatterplot of w I3,H 4 versus w H4,O 1, displaying the correlation structure between the weights. 005], the deterministic ANN performed poorly in these regions. In comparison, using the Bayesian ANN, the resulting uncertainty in the forecasts due to the uncharacteristic data is reflected in the expanded prediction limits, indicating to the modeler that single valued forecasts (e.g., mean predictions) should be used with caution. Salinity levels were underpredicted by the deterministic ANN in regions 1 and, with estimated levels below the 800 EC threshold when observed salinities were above 800 EC. This provides an excellent example of the consequences of ignoring uncertainty in the model parameters, as scheduling pumping from the river during these periods, due to the relatively low predicted salinity levels, could have costly ramifications. [44] It can be seen in Figure 5b that the 95% prediction limits failed to include all of the observed salinity data in regions 1 and. While this may, in part, be due to inappropriate convergence to the true posterior, uncertainty in the ANN weights is only one source of prediction uncertainty and the fact that all of the data in these regions were not accounted for by considering this source may suggest inadequacies in the model used to forecast the salinity data, possibly due to the omission of important inputs, an inappropriate model architecture, or errors in the data. Nevertheless, the Bayesian ANN provides a significant improvement over the deterministic ANN, and apart from the two periods of uncharacteristic data, almost all data points fall within the 95% limits. [45] Several authors in the ANN/Bayesian literature have warned against the use of straightforward MCMC approaches for sampling from the posterior distribution of ANN weights due to potential inefficiencies and the prohibitive time required for implementation [MacKay, 1995; Neal, 1996a; Müller and Rios Insua, 1998]. In this study, both the Bayesian and deterministic implementations were written in FORTRAN 90 and run on an Intel Xeon processor with GB of RAM running at.4 GHz, and the resulting computation time of the Bayesian training approach was approximately min, in comparison to the 34.8 min taken by the genetic algorithm. However, the deterministic ANN needed to be trained three times with different initial weights, which required a total computation time of approximately 98.6 min. It may be argued that deterministic weight estimates are also required to initialize Figure 5. The 95% prediction limits and deterministic output for (a) model development data (December 1986 to May 199) and (b) real-time simulation data (August 199 to March 1998). 9of11

10 KINGSTON ET AL.: BAYESIAN TRAINING OF ANNS IN WATER RESOURCES MODELING the Bayesian approach; nevertheless, the computation times of the two training approaches are comparable. In contrast, Neal [1996a] reported a computation time of approximately 0 hours using the complex hybrid Monte Carlo method to train a similar sized network to that used in this study. Because of the increasing power and speed of modern computers, straightforward MCMC approaches such as that presented in this paper are now both attractive and feasible. 6. Summary and Conclusions [46] In this paper, an accessible MCMC Bayesian ANN training approach was presented, combining the simple adaptive Metropolis (AM) and Gibbs sampling algorithms. The main advantage of the proposed training approach over other MCMC Bayesian training methods, which have been developed to achieve statistical optimality and efficiency [Neal, 1996a; Müller and Rios Insua, 1998], is its ease of implementation and coding. The simplicity of the framework is particularly important for its adoption in the field of water resources modeling, as it is likely that the difficulties associated with coding the more complex Bayesian training methods have hindered their use in this field, with practitioners opting to disregard prediction uncertainty and rely on deterministic predictions rather than apply such methods. [47] The results of the salinity forecasting case study presented in this paper highlight the importance of accounting for the uncertainty associated with ANN predictions and demonstrate the advantages of the proposed Bayesian training approach over standard deterministic training techniques. While the performance of the ANN model developed using the Bayesian training approach was similar to that of a deterministic ANN in an interpolative context, it was shown that the Bayesian ANN was more robust in a real-time forecasting scenario, particularly when the model was required to extrapolate. Not only were the average forecasts obtained using the Bayesian ANN an improvement over the single valued forecasts obtained using the deterministic ANN, but prediction limits, indicating the quality of the forecasts, were produced using the Bayesian approach, which was shown to be particularly important in situations when forecasts were made outside the range of the calibration data. [48] A major challenge facing the application of any MCMC ANN training technique is that, due to the complexity of ANNs and the strong correlations between the weights, it is difficult to effectively and efficiently explore the weight space and achieve convergence to the posterior distribution within a reasonable time frame. In trying to maintain the simplicity of the Bayesian training approach, limited focus was placed on achieving optimal efficiency of the MCMC algorithm. Therefore the proposed approach may be more computationally intensive than the complex MCMC algorithms previously developed for ANN training, requiring a larger number of iterations to converge. Nevertheless, due to the increasing power of modern computers, the efficiency of MCMC algorithms is becoming less of a concern. In the case study presented, it was observed that the Bayesian training algorithm had a computation time comparable to that of standard training techniques and also compared favorably to the results presented using more complex MCMC algorithms. However, by less efficiently exploring the weight space, it is also recognized that the results of the proposed Bayesian training approach may be biased by the initial weights, which is not statistically optimal. Yet, statistical optimality has never been the main concern of ANN practitioners and, as the results presented in this paper have demonstrated, it is still better to approximate what may be a local posterior distribution around a good mode than to rely on a single set of deterministic weight estimates. [49] Acknowledgments. The authors would like to thank the three WRR reviewers for their helpful comments and suggestions. This project is funded by an Australian Research Council Discovery grant. References ASCE Task Committee on Application of Artificial Neural Networks in Hydrology (000a), Artificial neural networks in hydrology. I: Preliminary concepts, J. Hydrol. Engineering, 5(), , doi: / (ASCE) (000)5:(115). ASCE Task Committee on Application of Artificial Neural Networks in Hydrology (000b), Artificial neural networks in hydrology. II: Hydrologic applications, J. Hydrol. Eng., 5(), , doi: / (ASCE) (000)5:(14). Bates, B. C., and E. P. Campbell (001), A Markov chain Monte Carlo scheme for parameter estimation and inference in conceptual rainfallrunoff modeling, Water Resour. Res., 37(4), Bishop, C. M. (1995), Neural Networks for Pattern Recognition, Oxford Univ. Press, New York. Bowden, G., H. Maier, and G. Dandy (00), Optimal division of data for neural network models in water resources applications, Water Resour. Res., 38(), 1010, doi:10.109/001wr Bowden, G. J., H. R. Maier, and G. C. Dandy (005), Input determination for neural network models in water resources applications. Part. Case study: forecasting salinity in a river, J. Hydrol., 301(1 4), , doi: /j.jhydrol Buntine, W. L., and A. S. Weigend (1991), Bayesian back-propagation, Complex Syst., 5(6), Dawson, C. W., and R. L. Wilby (001), Hydrological modelling using artificial neural networks, Prog. Phys. Geogr., 5(1), Duan, Q., S. Sorooshian, and V. Gupta (199), Effective and efficient global optimization for conceptual rainfall-runoff models, Water Resour. Res., 8(4), Duane, S., A. D. Kennedy, B. J. Pendleton, and D. Roweth (1987), Hybrid Monte Carlo, Phys. Lett. B, 195(), 16, doi: / (87)91197-x. Gelman, A., J. B. Carlin, H. S. Stern, and D. B. Rubin (1995), Bayesian Data Analysis, CRC Press, Boca Raton, Fla. Haario, H., E. Saksman, and J. Tamminen (001), An adaptive metropolis algorithm, Bernoulli, 7(), 3 4. Kass, R. E., B. P. Carlin, A. Gelman, and R. M. Neal (1998), Markov chain Monte Carlo in practice: A roundtable discussion, Am. Stat., 5(), Kuczera, G. (1983), Improved parameter inference in catchment models: 1. Evaluating parameter uncertainty, Water Resour. Res., 19(5), Kuczera, G., and M. Mroczkowski (1998), Assessment of hydrologic parameter uncertainty and worth of multiresponse data, Water Resour. Res., 34(6), Kuczera, G., and E. Parent (1998), Monte Carlo assessment of parameter uncertainty in conceptual catchment models: The metropolis algorithm, J. Hydrol., 11(1 4), 69 85, doi: /s (98)00198-x. Lampinen, J., and A. Vehtari (001), Bayesian approach for neural networks Review and case studies, Neural Networks, 14(3), 57 74, doi: /s (00) Lee, H. K. H. (003), A noninformative prior for neural networks, Mach. Learn., 50(1 ), 197 1, doi:10.103/a: Lee, P. M. (1989), Bayesian Statistics: An Introduction, Oxford Univ. Press, New York. MacKay, D. J. C. (199), A practical Bayesian framework for backpropagation networks, Neural Comput., 4(3), MacKay, D. J. C. (1995), Probable networks and plausible predictions A review of practical Bayesian methods for supervised neural networks, 10 of 11

Development of Stochastic Artificial Neural Networks for Hydrological Prediction

Development of Stochastic Artificial Neural Networks for Hydrological Prediction G. B. Kingston, M. F. Lambert and H. R. Maier Centre for Applied Modelling in Water Engineering, School of Civil and Environmental