Bayesian training of artificial neural networks used for water resources modeling

Size: px
Start display at page:

Download "Bayesian training of artificial neural networks used for water resources modeling"

Transcription

1 WATER RESOURCES RESEARCH, VOL. 41,, doi:10.109/005wr00415, 005 Bayesian training of artificial neural networks used for water resources modeling Greer B. Kingston, Martin F. Lambert, and Holger R. Maier Centre for Applied Modeling in Water Engineering, School of Civil and Environmental Engineering, University of Adelaide, Adelaide, South Australia, Australia Received 9 March 005; revised 9 September 005; accepted 13 September 005; published 6 December 005. [1] Artificial neural networks have proven to be superior prediction models in many hydrology-related areas; however, failure of ANN practitioners to account for uncertainty in the predictions has limited the wider use of ANNs as forecasting models. Conventional methods for quantifying parameter uncertainty are difficult to apply to ANN weights because of the complexity of these models, and complicated methods developed for this purpose have been not been adopted by water resources practitioners because of the difficulty in implementing them. This paper presents a relatively straightforward Bayesian training method that enables weight uncertainty to be accounted for in ANN predictions. The method is applied to a salinity forecasting case study, and the resulting ANN is shown to significantly outperform an ANN developed using standard approaches in a real-time forecasting scenario. Moreover, the Bayesian approach produces prediction limits that indicate the level of uncertainty in the predictions, which is extremely important if forecasts are to be used with confidence in water resources applications. Citation: Kingston, G. B., M. F. Lambert, and H. R. Maier (005), Bayesian training of artificial neural networks used for water resources modeling, Water Resour. Res., 41,, doi:10.109/005wr Copyright 005 by the American Geophysical Union /05/005WR Introduction [] Over the past 15 years, artificial neural networks (ANNs) have proven to be extremely beneficial tools for simulating, predicting and forecasting water resources variables. The predictive capability of ANNs in this field has been demonstrated in numerous studies, leading to the publication of several comprehensive reviews on the application of ANNs in hydrology, such as rainfall-runoff modeling, water quality forecasting and streamflow prediction [see ASCE Task Committee on Application of Artificial Neural Networks in Hydrology, 000a, 000b; Maier and Dandy, 000; Dawson and Wilby, 001]. ANNs are able to capture complex, nonlinear functional relationships within data without requiring an in-depth understanding of the underlying physical process, or the need to prespecify a functional form of the model, thus giving them an advantage over many of the models traditionally used for modeling water resources variables. However, as noted by Maier and Dandy [000], a major limitation of ANNs in this field is that the uncertainty in the predictions generated is seldom quantified. Failure to account for such uncertainty makes it impossible to assess the quality of ANN predictions, which severely limits their usability in real-world water resources management and design applications. [3] A significant component of prediction uncertainty can be attributed to the uncertainty in the parameters that govern the modeled function. In an ANN these parameters are the connection and bias weights in the network, as shown in Figure 1, which displays a multilayer ANN structure typically used for hydrological prediction. These weights have no direct physical interpretation, and therefore calibration, or training, is required to obtain estimates of their values. Traditionally, this involves iteratively adjusting the network weights to find a single optimal set of weight values that provides the best fit between the model outputs and a set of observed calibration (training) data. However, for any hydrological model, it is inappropriate to assume that point parameter estimates obtained by calibration can adequately describe the underlying hydrological relationship, and this is particularly the case for ANNs for reasons discussed below. [4] First, as an ANN does not incorporate any knowledge of the physical system, the resulting model is heavily dependent on information inferred from the finite data set used for training. Because of the stochastic nature of hydrological systems each different set of training data would most likely yield a different set of weights. Thus finding the single weight vector that provides the best fit to the training data does not necessarily result in a correct model of the system. This problem becomes less significant as more data become available and the training data set becomes more representative of the population; however, the data available to describe many hydrological phenomena are often limited. Therefore, while an ANN based on a single weight vector may perform well in an interpolative context (on data similar to those contained in the training data set), it cannot be expected to extrapolate well in situations dissimilar to those previously presented to the model. [5] Second, ANN training is a multidimensional nonlinear optimization problem. This is not a trivial task, as the nonlinearity of the problem can lead to the existence of multiple optima on the solution surface. Training algorithms 1of11

2 KINGSTON ET AL.: BAYESIAN TRAINING OF ANNS IN WATER RESOURCES MODELING available complicated techniques, Bayesian ANN training has not been adopted by water resources practitioners. [8] In this paper, a relatively simple and very accessible Bayesian training technique is presented. So as to not detract from the original attraction of ANNs, the aim of the procedure is not statistical optimality, nor optimum efficiency, but rather good results and ease of programming and application. The Bayesian training framework is applied to a real-world water resources case study in order to assess the uncertainty associated with the predictions and investigate the relative advantages of the Bayesian approach in comparison with standard deterministic training techniques. Figure 1. Example of a typical ANN structure used for hydrological prediction. may become trapped in local minima rather than converging on the global solution and, although sophisticated global optimization algorithms have been developed (e.g., the shuffled complex evolution (SCE-UA) algorithm developed by Duan et al. [199]), there is still no algorithm that can guarantee global convergence. Furthermore, ANNs have the potential of becoming overtrained. This means that the network fits to noise in the training data rather than inferring the general underlying rule. Therefore, even if the training data sufficiently represent the data population, poor weight estimates may still result from difficulties during training. [6] Finally, complex hydrological relationships may require complex model structures [Omlin and Reichert, 1999]. If the data are sparse in relation to the number of weights in the ANN, it becomes more difficult to properly identify a value for each of the weights. As a consequence, many combinations of weights may result in similar network performance and there is no way to distinguish which of these best approximates the underlying relationship. [7] In recent years, there has been an upsurgeance in the use of Bayesian methodology for quantifying parameter uncertainty in various scientific fields [Malakoff, 1999], including hydrological and water resources modeling [Marshall et al., 004]. Under this paradigm, uncertainty in the model parameters is handled explicitly, as parameter distributions are estimated rather than point values. However, applications of Bayesian methods in hydrology related areas have been limited to simple conceptual [Kuczera and Parent, 1998; Kuczera and Mroczkowski, 1998; Bates and Campbell, 001; Thiemann et al., 001; Vrugt et al., 003; Marshall et al., 004] and traditional statistical [Kuczera, 1983; Thyer et al., 00] models, and have not been extended to ANNs. While Bayesian techniques are not inapplicable to ANNs and have, in fact, been used for ANN training (although rarely) since the early 1990s [Buntine and Weigend, 1991; MacKay, 199; Neal, 199], the complexity of ANNs makes it difficult to apply standard Bayesian methods to estimate the weight distributions. Consequently, the majority of Bayesian techniques applied to ANNs in the past have employed complex statistics in order to overcome any complications. Most of the available ANN software does not allow for Bayesian training, and due to the difficulty associated with programming the. Bayesian Training.1. Background [9] ANNs, like all mathematical models, work on the assumption that there is a real function underlying a system that relates a set of independent predictor variables to one or more dependent variables of interest. The aim of ANN training is to infer an acceptable approximation of this relationship from a set of training data, so that the model can be used to produce accurate predictions when presented with new data. If y is the target variable and x is a vector of input data, it is assumed that y ¼ gðxjwþþ ð1þ where g() is the function described by the ANN, w is a vector of connection and bias weights that characterize the data generating relationship and e is a random noise term with zero mean and constant variance, i.e., white noise. [10] Using standard (deterministic) training approaches, a single optimal weight vector bw is sought that is most likely to reproduce the set of observed target data y =(y 1, y,..., y N ), given the inputs X =(x 1, x,..., x N ). The aim of Bayesian training, on the other hand, is to infer the posterior probability distribution of the weights given the observed data P(wjy, X). This is done by updating any knowledge of the weight values prior to obtaining the data, with information contained in the data, using Bayes theorem: Pðwjy; X Þ ¼ Pðyjw; XÞPðwÞ PðyjXÞ ¼ R Pðyjw; XÞPðwÞdw where P(w) is the prior weight distribution and P(yjw, X) is the likelihood function, which describes any information about w contained in the data. The likelihood function is often expressed as L(w)... Marginalization [11] Under the Bayesian paradigm, the predictive distribution of a new datum y N+1 is determined by integrating the predictions made by all of the weight vectors over the posterior distribution of the weights as follows: Z Py ð Nþ1 jx Nþ1 ; y; XÞ ¼ ðþ Py ð Nþ1 jx Nþ1 ; wþpðwjy; XÞdw ð3þ This process is known as marginalization. For complex problems, the high dimensionality of this integral makes its of11

3 KINGSTON ET AL.: BAYESIAN TRAINING OF ANNS IN WATER RESOURCES MODELING Figure. The Metropolis method in two dimensions, displaying two proposal distributions, Q 1 and Q, that will achieve different rates of convergence to the posterior distribution. evaluation with conventional analytical or numerical integration techniques virtually impossible. In order to overcome this problem, two main approaches to marginalization have generally been followed: Gaussian approximation of the posterior weight distribution to enable analytical integration as introduced by MacKay [199] and numerical integration using Markov chain Monte Carlo methods as introduced by Neal [199]. These approaches have since been reviewed by MacKay [1995], Bishop [1995], Neal [1996a], Lampinen and Vehtari [001], and Titterington [004]..3. Markov Chain Monte Carlo Integration [1] For multilayered ANNs, the posterior weight distribution is typically very complex and multimodal and thus the assumption of a Gaussian weight distribution is generally not a good one [Neal, 1996a]. It may be reasonable to assume that the distribution is locally Gaussian around each mode; however, this raises the question of how to properly handle the multiple modes when making predictions. Furthermore, the assumption of even a locally Gaussian distribution in the vicinity of the modes is sometimes questionable, particularly when the model is complex in comparison with the data available for training [Rasmussen, 1996]. To avoid the need to make such an approximation, Neal [199] introduced a Markov chain Monte Carlo (MCMC) implementation to sample from the posterior weight distribution. [13] The objective of MCMC methods is to generate samples from a continuous target density, which in this case is the posterior weight distribution. The Metropolis algorithm is a commonly used MCMC approach, which has had success in a number of applications [Bishop, 1995; Kuczera and Parent, 1998; Bates and Campbell, 001; Marshall et al., 004]. As it is difficult to sample from the complex target (posterior) distribution directly, this method makes use of a simpler, symmetrical distribution, Q(), known as the proposal or jumping distribution, to generate candidate weight vectors. In its simplest form, the proposal distribution is only dependent on the previous weight state and therefore a random walk Markov chain is generated within the weight space. An adaptive acceptancerejection criterion is employed such that the random walk sequence continually adapts to the posterior distribution of the weights. The algorithm proceeds as follows (as adapted from Gelman et al. [1995]). 1. Initialize the algorithm with an arbitrary set of weights w 0 for which P(w 0 jy) >0.. For t =1,, generate a candidate weight vector w* from Q(w*jw t 1 ).. evaluate the ratio of P(w*jy, X) compared to P(w t 1 jy, X).3. accept w* with probability PðyjX; w* ÞPðw* Þ aðw*jw t 1 Þ ¼ min PðyjX; w t 1 ÞPðw t 1 Þ ; 1 and if w* is accepted, set w t = w*, otherwise w t = w t 1. [14] Given sufficient iterations, the Markov chain produced by the Metropolis algorithm should converge to a stationary distribution. From this point onward it can be considered that the sampled weight vectors are generated from the posterior distribution. However, selection of an appropriate proposal distribution is a difficult task and one that has important implications on the convergence properties and efficiency of the algorithm, particularly in the case of complex models with correlated parameters. Shown in Figure is an example of a bivariate posterior distribution together with two example proposal distributions, Q 1 and Q. Using the larger proposal distribution Q 1, denoted by the dashed line, it can be seen that a jump made from the current weight state w t in almost any direction will result in a decrease in the posterior probability, and as such, a large proportion of jumps will be rejected and therefore convergence will be slow. On the other hand, if the smaller proposal distribution Q is used the acceptance rate will increase; however, the algorithm will take longer to sample from the entire region of the posterior, and therefore the distribution may not be adequately represented by the samples generated within a specified number of iterations [Thyer et al., 00]. This problem is not unique to ANNs; however, it is amplified due to the typically large number of parameters and the high correlations between them, which result from the interconnectivity of the nodes and a generally poor understanding of the optimum model complexity. This means that the time taken to obtain an adequate representation of the posterior distribution using the standard Metropolis algorithm can be prohibitive. [15] In order to suppress the random walk behavior of the Metropolis algorithm and speed up convergence, Neal [199, 1996a] used the hybrid Monte Carlo (HMC) algorithm of Duane et al. [1987] to sample from the posterior weight distribution. The HMC algorithm is a particularly elaborate version of the Metropolis algorithm that makes use of gradient information to direct the sampler into regions of high density, which ensures higher acceptance probabilities. However, while this complex MCMC implementation may improve convergence to the posterior, it may be difficult to program and to verify its correctness [Neal, 1996a; Lee, 003]. The time and effort required to ð4þ 3of11

4 KINGSTON ET AL.: BAYESIAN TRAINING OF ANNS IN WATER RESOURCES MODELING code this algorithm may indeed be the reason why it has not been more widely adopted by practitioners. Furthermore, while the complex algorithm may be quicker to converge to the target distribution, a greater number of runs using a simpler algorithm may in fact have a shorter total run time..4. Adaptive Metropolis Sampling of Weights [16] The adaptive Metropolis (AM) algorithm, developed by Haario et al. [001], was used in this study to sample from the posterior weight distribution, as it has been found to have a number of advantages over other variants of the Metropolis algorithm in terms of efficiency and ease of use [Marshall et al., 004]. This algorithm was developed to overcome the problems associated with selecting an appropriate covariance for the proposal distribution by setting it equal to the estimated posterior covariance of the weights, which is updated at each iteration based on all of the previously sampled weight vectors. This adaptation strategy ensures that information gained about the posterior distribution throughout the simulation is used to increase the efficiency of the algorithm and improve the convergence rate. [17] To initialize the algorithm, an arbitrary, positive definite covariance matrix, S 0, is selected. For an initial period t 0 > 0, the covariance of the proposal is fixed at the initial covariance, after which time the adaptation strategy begins, as follows: S t ¼ S 0 t t 0 c covðw 0 ; w 1 ;...; w t 1 Þþc ei d t > t 0 where c is an adaptive scaling parameter used to maintain an appropriate acceptance rate, e is a small constant used to ensure that S t will not become singular, and I d is the d- dimensional identity matrix with d being the dimension of the weight vector. For t > t 0, calculation of S t satisfies the following recursion formula: S tþ1 ¼ t 1 S t t þ c t w t 1 w T t 1 t ðt þ 1Þw tw T t þ w t w T t þ ei d where w t =1/(t +1) P t i¼0 w i. Therefore the covariance may be updated at each iteration with little additional computational cost. [18] The choice of the initial fixed period t 0 should reflect the confidence in the initial covariance S 0. The longer this period, the more slowly the adaptation is felt and the greater the effect of the initial covariance on the simulated draws. Therefore, if the initial fixed period is short, even a poor initial choice of S 0 should only have a minor impact on the overall convergence of the algorithm. However, it is necessary to select S 0 such that the algorithm moves at least a little during the initial stage. To avoid the algorithm starting slowly, Haario et al. [001] suggest using a priori knowledge, such as the most likely weight vector bw or the covariance of the weights at this mode, to assist in the choice of the initial weight state or the initial covariance. ð5þ ð6þ.5. Gibbs Sampling of Residual Variance [19] In this study, a Gaussian likelihood function was used, as given by LðwÞ ¼ P yjx; w; s! ¼ YN 1 pffiffiffiffiffiffiffiffiffiffiffi exp 1 ½y i gðx i ; wþš s s ð7þ i¼1 This distribution makes the assumption that the model residuals (i.e., y i g(x i jw)) are normally and independently distributed with zero mean and constant variance s. The parameter s is sometimes referred to as a hyperparameter as it plays an important role in estimating the values of the network weights, but ultimately plays no part in the developed model. In a full Bayesian training approach, no fixed values are used for any parameters or hyperparameters [Lampinen and Vehtari, 001]; thus s is also estimated from the training data. Following the approach used by Neal [1996a], in the proposed Bayesian training framework the posterior distribution of s is estimated using the Gibbs sampler, which is the simplest MCMC algorithm. This involves sampling from the full conditional distribution of s, given the data and the values of the network weights sampled using the AM algorithm, given by: P s jy; X; w / P yjx; w; s P s [0] Generating samples from P(s jy, X, w) is relatively easy if the prior distribution is chosen to be a conjugate distribution to the likelihood. This means that the posterior distribution will have the same parametric form as the prior. A natural conjugate prior for the Gaussian variance is the scaled inverse chi-square distribution c (n 0, s 0 ) where n 0 and s 0 are degrees of freedom and scale parameters, respectively, chosen to express the level of prior knowledge [Lee, 1989]. The posterior distribution is then the scaled inverse chi-square distribution: c n * ¼ n 0 þ N; s * ¼ s 0 n 0 þ s N n 0 þ N where s is equal to P N i¼1 (y i g(x i, w)) /N and N is the number of training samples. Draws from the distribution given by (9) may be obtained by sampling X from the c n* distribution and letting s = n * s * /X..6. Proposed Bayesian Training Approach [1] The MCMC Bayesian training approach presented in this paper follows a two-step iterative procedure. In the first step, s is held constant while w is sampled from P(w t jy, X,s t 1 ) using the AM algorithm. In the second step, w is held constant while s is sampled from P(s t jy, X, w t ) using the Gibbs sampler. This is consistent with the MCMC approach used by Neal [1996a], except that the complicated HMC algorithm has been substituted with the much simpler AM algorithm. While implementation of this procedure is relatively straightforward, there are a number of factors that need consideration to ensure successful performance of the algorithm and appropriate convergence to the posterior distribution. These factors were determined to achieve ð8þ ð9þ 4of11

5 KINGSTON ET AL.: BAYESIAN TRAINING OF ANNS IN WATER RESOURCES MODELING optimal efficiency while still retaining the simplicity of the approach, as discussed below. [] Before running the MCMC sampling algorithm, suitable prior distributions for w and s need to be chosen. As ANN weights have no physical interpretation, little can be known about the values of these parameters before observing the data. Therefore, to represent vague prior knowledge, it is convenient to assume wide, noninformative prior distributions, which allow the posterior distributions of w and s to be determined by the data without being restricted or affected by the prior distribution [Neal, 1996a; Lampinen and Vehtari, 001]. In this study, a wide uniform prior, symmetric around zero, was assumed for each weight in order to specify an equal likelihood of positive and negative values, but an otherwise lack of prior knowledge about the values of the weights. A noninformative prior was also assumed for s by setting appropriate values for n 0 and s 0. In general, the smaller n 0 is relative to the number of training samples N, the less informative is the scaled inverse chi-square prior distribution [Gelman et al., 1995]. The assumption of uniform priors for the network weights has the additional advantage of simplifying (4), as the prior probabilities cancel out, leaving only the likelihood ratio to be evaluated. To evaluate this ratio, it is more computationally efficient to take the (natural) logarithm of the likelihood so that the function given by (6) becomes additive rather than multiplicative. [3] To initialize the training procedure, arbitrary values of w 0, s 0 and S 0 are required. Gelman et al. [1995] and Haario et al. [001] recommend using point estimates of w to obtain a rough estimate of the location of the posterior distribution. Not only does this increase convergence speed, but these values are easily estimated using available software and provide a useful check of the accuracy of the Bayesian training algorithm. As the posterior distribution of ANN weights is often multimodal, it is acknowledged that this initialization may bias the resulting posterior distribution if the algorithm becomes trapped in the vicinity of the local mode. Müller and Rios Insua [1998] and Neal [1996b] discuss the multimodality of ANN posterior distributions at length, as well as methods for efficiently sampling from such distributions. However, because the emphasis of the proposed Bayesian training framework is on simplicity and ease of use, instead of incorporating complicated mode jumping schemes, it is recommended that extra care be taken in finding appropriate weights bw to initialize the algorithm. Thus, if the algorithm does become stuck around the local mode, there will at least be some confidence that it is a good mode (i.e., the best estimate of the maximum likelihood value given by a rigorous search algorithm that tries to thoroughly search the space). Nevertheless, to lessen the bias that may be caused by this initialization, it is recommended that, for an initial period t s0, the variance hyperparameter should be fixed at some value s 0 6¼ bs, where bs is estimated using the weight vector bw.as{bw; bs } results in the (locally) maximum likelihood value, setting s 0 = bs would cause the acceptance rate of candidate weight vectors to be low, making it difficult to move away from the initial location. However, by selecting s 0 such that the magnitude of the initial likelihood is somewhat reduced, the acceptance rate is increased, allowing the simulated chain to move more freely about the weight space during this period. In this study, s 0 was set equal to the scale parameter s 0 used to define the prior distribution of s,as this was found to provide good results in preliminary tests. [4] The proposal distribution used in this study to generate candidate weight states was a multinormal distribution centered on the current weight state with covariance estimated according to (5) (i.e., Q(w*jw t 1 )=N(w t 1, S t )). Gelman et al. [1995] give a number of recommendations for achieving optimal efficiency with this form of proposal distribution. One of these recommendations is that the initial covariance of the proposal distribution S 0 be estimated based on the derivatives of the error function at the posterior mode. However, for a feed forward ANN, these derivatives are often ill conditioned, resulting in a (nearly) singular covariance matrix. This, in turn, can cause instability of the AM algorithm, as use of the recursion formula (5) means that the covariance is updated based on an initial illconditioned matrix. For the AM algorithm, the only requirements for the choice of S 0 are that it is positive definite and allows the algorithm to move at least a little in the initial fixed covariance stage. Therefore, in the proposed implementation, S 0 is defined by: s 3 w s S 0 ¼ c w s w d ð10þ The parameters s w1,..., s wd are the standard deviations of the weights, which are initially set equal to some arbitrary positive values, chosen to ensure that sufficient states are accepted when S 0 is fixed. To achieve an optimal acceptance rate of approximately 3% (for dimension pffiffiffid > 5), the scaling parameter c is initially set equal to.4/ d and tuned up or down at the beginning of the simulation if the acceptance rate is too high or low, respectively, as recommended by Gelman et al. [1995]. [5] The algorithm is run for sufficient iterations, K, firstly, to achieve convergence to the stationary distribution and secondly, to sample enough weight states following convergence to provide an adequate representation of the posterior distribution. However, weight states simulated prior to when convergence was apparently reached at t = k K will still be influenced by the initial distribution rather than the posterior distribution. Therefore these simulations are discarded and the remainder (or a smaller representative subset) are used as the basis for making Monte Carlo estimates from the predictive distribution of a test data set. In this implementation, the predictive distribution is then summarized by 95% prediction limits, which are useful for the visualization of prediction uncertainty, and the mean predictions, which enable a direct comparison of predictive performance with that of a deterministic ANN (i.e., an ANN based on single valued weight estimates). [6] Given the above considerations, the full Bayesian training procedure used in this study was carried out as follows. 1. Set w 0 = bw; s 0 = s 0 and evaluate log L(w 0 ). Initialize S 0 according to (10) 5of11

6 KINGSTON ET AL.: BAYESIAN TRAINING OF ANNS IN WATER RESOURCES MODELING. For t =1,,..., K.1. if t t 0,letS t = S 0.. generate a candidate weight vector w* from Q(w*jw t 1 )=N(w t 1, S t ).3. evaluate log L(w*) and compare it to log L(w t 1 ).4. accept w* with probability aðw*jw t 1 Þ ¼ min½expðlog Lðw* Þ log Lðw t 1 ÞÞ; 1Š and if w* is accepted, set w t = w*, otherwise w t = w t 1.5. if t > t s0, generate s t from P s t jy; X; w t ¼ c n * ; s *.6. if t = t 0, calculate S tþ1 ¼ c covðw 0 ; w 1 ;...; w t0 Þþc ei d ; where covðw 0 ; w 1 ;...; w t0 Þ ¼ 1 X t0 t w i¼0 iw T i 0 ðt 0 þ 1Þw t0 w T t 0 ; else if t > t 0, calculate S t+1 according to (5) 3. Discard initial samples (w 0, s 0 ),...,(w k, s k ) to diminish the effects of the initial distribution. Use samples (w k+1, s k+1 ),...,(w K, s K ) for analysis. 4. For i =1,,..., N testset 4.1. calculate the network predictions y i,k+1,..., y i,k based on w k+1,..., w K and input vector x i. 4.. Rank predictions y i,k+1,..., y i,k in ascending order and determine 95% simulation limits Calculate mean prediction y i ¼ 1= ðk kþ X K y t¼kþ1 t: [7] The most critical issue associated with MCMC simulations is determining when convergence to the posterior distribution has been achieved. Because of the multimodal nature of ANN weights and the correlations between them, convergence to the posterior weight distribution is usually relatively slow, as the covariance of the proposal distribution needs to be small in order to maintain a high enough acceptance rate and because chains can become trapped for a long time around local modes. Multiple chains can be used to more widely explore the weight space and help to speed convergence to the posterior distribution; therefore it is recommended that a number of parallel chains be simulated simultaneously. In this study, the multiple chains were each initialized at bw to give them equal chance at finding a good mode. However, during the initial period when s 0 is fixed, the chains should diverge from their initial location in different directions. The use of multiple chains can also help to detect whether the algorithm has converged. The most commonly used diagnostics of convergence are trace plots of sample MCMC values versus iteration [Kass et al., 1998]. A trace of the log posterior density, calculated by taking the mean of the multiple chains, was used in this study to diagnose convergence. As stated by Kass et al. [1998], if the log posterior density is increasing, the main mode has yet to be reached, whereas if it is decreasing, the algorithm was initialized near a tall, narrow mode and is moving toward a more representative part of the distribution. Therefore it can be assumed that convergence has been reached when this plot flattens out. However, a limitation of MCMC posterior simulation, particularly when applied to ANNs, is that it can never be guaranteed that convergence to the true posterior has been obtained, as there may still be modes that have been undiscovered. 3. Case Study [8] The case study used to demonstrate the advantages of the proposed Bayesian training approach is that of forecasting salinity in the River Murray at Murray Bridge, South Australia, 14 days in advance. The River Murray is Australia s largest river and is essential for irrigation and water supply purposes in South Australia, providing, on average, 45% of the total water requirements for the capital city of Adelaide [River Murray Catchment Water Management Board, 003]. [9] It is predicted that, without intervention, the average salinity of the river at Murray Bridge will increase to 870 EC by 050 [Murray Darling Basin Commission, 1999], which is above the 800 EC threshold for desirable drinking water. Apart from having an objectionable taste, the high salinity of this water can cause reductions in crop yields, corrosion of pipes and infrastructure, and increased consumption of soap and detergents due to the hardness of water. Murray Bridge is located within one of South Australia s major irrigation regions and is also the site of one of two offtakes that divert river water to Adelaide in order to meet the city s water demands. It is therefore important to be able to forecast salinity levels in the river at Murray Bridge several weeks in advance so that operational changes can be put into place that minimize the negative effects of high salinity. [30] This case study was also used by Bowden et al. [00, 005], where an ANN was developed to produce 14-day salinity forecasts and a real-time simulation was subsequently performed using the developed model. Bowden et al. [00] clustered the available data into groups of similar input/output patterns using a self-organizing map (SOM). It was then identified that the data set reserved to perform the real-time simulation contained two regions of data that were dissimilar to the data used to develop the model. Consequently, the ANN was required to extrapolate and the predictive performance in these uncharacteristic regions was poor. This case study was considered to be ideal for assessing the uncertainty associated with ANN predictions in a real-time simulation situation, and investigating the advantages of the proposed Bayesian training technique in comparison with deterministic training methods, particularly when required to extrapolate. 4. Model Development 4.1. Data and Model Inputs [31] Daily salinity, flow and river level data were available at various locations in the lower River Murray for the period 1 December 1986 to 1 April Bowden et al. [00, 005] used data from 1 December 1986 to 30 June 199 to develop an ANN, while data from 1 July 199 to 1 April 1998 were reserved to perform the real-time forecasting simulation. The same data split was also used in this study. The ANN inputs used in this study were the 6of11

7 KINGSTON ET AL.: BAYESIAN TRAINING OF ANNS IN WATER RESOURCES MODELING Table 1. Inputs and Outputs Used in the Salinity Model Location Data Type Lag, days Input/Output Number Mannum salinity 1 I 1 Morgan salinity 60 I Waikerie salinity 1 I 3 Waikerie salinity 43 I 4 Loxton salinity 5 I 5 Lock 7 flow 1 I 6 Murray Bridge level 1 I 7 Murray Bridge level 11 I 8 Murray Bridge level 1 I 9 Murray Bridge level 34 I 10 Murray Bridge level 57 I 11 Mannum level 57 I 1 Lock 1 Upper level 1 I 13 Murray Bridge salinity 13 O 1 same as those used by Bowden et al. [005] and are given in Table 1. These inputs were selected from a total of 960 potentially suitable inputs (including lagged values of the available data from 1 through to 60 days) for predicting salinity levels in the River Murray at Murray Bridge 14 days in advance using the partial mutual information (PMI) input determination method (see Bowden et al. [005] for details). Similar to the method of Bowden et al. [005], the input data were linearly scaled between 1 and 1, while the output data were linearly scaled between 0.8 and 0.8. [3] A plot of the available salinity data at Murray Bridge is given in Figure 3, showing the data used for model development and the data reserved to perform a real time simulation. In Figure 3 the two regions of uncharacteristic data identified by Bowden et al. [00] are indicated by regions 1 and. 4.. Deterministic ANN [33] Standard ANN development practices were initially employed to develop a deterministic ANN model (based on bw) to provide a basis for comparison of the ANN trained using the proposed Bayesian training approach. To do this, the model development data (i.e., over the period 1 December 1986 to 1 April 1998) were further divided into training, testing (for cross validation) and validation subsets using the SOM data division method used by Bowden et al. [00]. In order to compare the results of this study to those obtained by Bowden et al. [005], the proportions of data allocated to each of these subsets were the same as those used by Bowden et al. [005]. After accounting for the appropriate lags of the input and output variables, there were 1964 data samples available for model development. Of these, 157 samples were allocated to the training data set (64%), another 314 (16%) samples were allocated to the testing data set and the final 393 samples were allocated to the validation data set (0%). [34] It has been shown that a one hidden layer multilayer perceptron (MLP) with the hyperbolic tangent (tanh) activation function on the hidden layer nodes and a linear activation function on the output layer nodes is able to approximate any continuous function arbitrarily well, given that there are a sufficient number of hidden nodes [Bishop, 1995]. Therefore an ANN with this configuration was used in this study to forecast the salinity levels at Murray Bridge. Bowden et al. [005] used an ANN with 3 hidden nodes and 481 weights to produce the 14-day salinity forecasts, 7of11 given the 13 inputs in Table 1. In order to obtain a more parsimonious model, the optimal ANN geometry was reinvestigated in this study. A trial-and-error approach was used to do this, where the number of hidden nodes was successively increased from 1, in increments of 1, until the addition of further hidden nodes did not result in significant improvement of the test set error. This resulted in a network with 4 hidden nodes and 61 weights. [35] In order to decrease the potential of becoming stuck in a local minimum, a genetic algorithm was used to train the deterministic ANN. Furthermore, the algorithm was initialized three times with different sets of random weights to increase the likelihood of obtaining a globally optimal solution. Cross validation using the testing data subset was also employed to ensure that the model did not overfit the training data. The generalization ability of the final model was validated against the validation subset before applying it to the real-time simulation data Bayesian ANN [36] Using the proposed Bayesian training approach, the posterior weight distribution P(wjy, X) for the 13-input, 4-hidden node ANN was estimated with the same training data subset used to find bw for the deterministic ANN. Cross validation with a test set is incompatible with Bayesian weight estimation; therefore it was important to check that the model had not overfit the training data by evaluating the predictive distribution of the validation data and assessing out-of-sample model performance. The training and testing data subsets could have been combined to form a larger calibration data set containing a greater amount of information. However, in order to perform a fair comparison with the deterministic forecasts, this was not done in this study. Rather, the predictive distribution of the testing data was evaluated and the model performance on the testing data set was compared to that of the deterministic ANN. [37] Prior distributions were selected to include all possible values of w and s. It was assumed that all of the weights would lie within the range [ 100,100]; therefore this was the range set for the uniform prior distributions of all of the network weights. To define a vague prior for s, n 0 was set equal to 0.1, while s 0 was set equal to 0.01, similar to the values used by Neal [1996a] to produce a noninformative prior distribution for this parameter. Figure 3. Time series of salinity in the River Murray at Murray Bridge displaying data periods used for model development and real-time simulation. Regions 1 and highlight uncharacteristic regions in the real-time simulation data.

8 KINGSTON ET AL.: BAYESIAN TRAINING OF ANNS IN WATER RESOURCES MODELING Table. Model Performance Results in Comparison to Those Obtained by Bowden et al. [005] Model Performance Measure Model Development Training Testing Validation Real-Time Simulation Bowden et al. [005] a RMSE AIC Deterministic ANN b RMSE AIC Bayesian ANN b RMSE AIC a A 3-hidden node ANN. b A 4-hidden node ANN. [38] To initialize S 0 according to (10), the standard deviations of the weights s w1,...,s wd were each set to 0.01 as this resulted in a reasonable acceptance rate in the initial fixed covariance stage (t t 0 ) of the AM algorithm. As S 0 was not estimated to give an accurate representation of the distribution at the mode, a short (relative to the total number of iterations) initial fixed period of t 0 = 1,000 was chosen in order to minimize the effect of S 0 on the simulated weight states. In this study, s 0 was held constant at 0.01 (s 0 ) for the same period of time (i.e., t s 0 = t 0 = 1,000). To achieve an appropriate acceptance rate, the scaling parameter c was tuned every 00 iterations for the first 3,000 iterations, from which point on it remained fixed. [39] Five parallel chains were simulated for a total of 500,000 iterations (K = 500,000). After inspecting the trace of the mean log posterior density, it was considered that convergence to the posterior was reached after the first 100,000 iterations. Therefore the first 100,000 simulated draws were discarded to reduce the effects of the initial conditions (k = 100,000), and the final 400,000 iterations were used to make up the posterior weight distribution. From these, 10,000 weight vectors were randomly selected and used to evaluate the predictive distribution of each data sample in the real-time simulation data set. 5. Results and Discussion [40] The root mean squared error (RMSE) and Akaike s information criterion (AIC) were used to assess the performance of the deterministic and Bayesian ANNs (the performance of the Bayesian ANN was assessed based on the mean predictions) developed in this study. The RMSE measure was used to assess model performance of Bowden et al. [005]; thus its use in this study enabled a direct comparison of the results obtained with those given by Bowden et al. [005]. The AIC measure, calculated by AIC = N log(rmse) + d, was also used to compare the performance of the ANNs developed in this study (4 hidden nodes) with that used by Bowden et al. [005] (3 hidden nodes), as this measure takes into account the parsimony of the model. While complex models can often fit the data better than models with fewer free parameters, the increase in model performance may not be justifiable given the additional effort required to train the model. [41] Table presents the results obtained using the deterministic and Bayesian ANNs in comparison to the results obtained by Bowden et al. [005]. These results show that, although the performance of all models is similar on the training, testing and validation data subsets (i.e., interpolation), the performance of the Bayesian model is significantly better in the real-time forecasting scenario. This highlights the importance of accounting for the entire range of plausible weight vectors when making predictions, rather than relying on the single weight vector that provides the best fit to the training data. By estimating the posterior weight distribution, the Bayesian ANN has achieved a more generalized mapping of the underlying relationship, which is influenced less by the minimum error of the training data and influenced more broadly by the overall information contained in the data. As seen in the results, this enables better extrapolation ability, as hypothesized in the introduction of this paper. It is not surprising that the results on the model development data (training, testing and validation subsets) were similar, as it is well known that deterministic ANNs generally perform well at interpolation. [4] The fact that the Bayesian ANN takes into account a range of weight vectors can be seen in Figure 4. Figures 4a 4d display, for illustration purposes, the marginal distributions of the weights between hidden layer nodes 1 4 and the output, respectively, while Figure 4e displays a scatterplot of w I3,H 4 versus w H4,O 1 (see Figure 1). It can be seen that the level of uncertainty associated with the weights, as indicated by the spread of the distributions, is quite varied, with some weights being very poorly identified (e.g., the range of w I3,H 4 in Figure 4e is approximately [0, 50]) and some being reasonably well determined (e.g., w H4,O 1 has a narrow range of approximately [0, 0.5] as seen in Figures 4d and 4e) by the data. These plots also demonstrate the non-gaussian, multimodal, correlated and ill-conditioned nature typical of ANN weights and reinforce the theory that Gaussian approximation of the posterior weight distribution may be inappropriate for ANNs. [43] In addition to producing significantly better average forecasts for the real-time forecasting scenario, the Bayesian ANN produces 95% prediction limits that indicate the level of uncertainty in the forecasts, as shown in Figure 5, which displays the 95% prediction limits and deterministic ANN outputs for the model development data (i.e., combined training, testing and validation data sets, Figure 5a) and the real-time simulation data (Figure 5b). It can be seen that the 95% prediction limits are quite narrow for most of the forecasting period, which may seem somewhat surprising given the significant uncertainty in the some of the ANN weights (e.g., w I3,H 4 ). However, this can be explained by considering the highly correlated nature of the weights, an example of which is shown in Figure 4e, and the possible redundancy of some of the input-hidden node connections. The prediction limits for the forecasts during the real-time forecasting period are much wider than those for the model development period (interpolation) and this is particularly noticeable for the two periods of uncharacteristic data, as identified by Bowden et al. [00] (i.e., regions 1 and ). During these periods the ANN has to extrapolate beyond the range of the training data and, similar to [Bowden et al., 8of11

9 KINGSTON ET AL.: BAYESIAN TRAINING OF ANNS IN WATER RESOURCES MODELING Figure 4. (a d) Marginal posterior distributions of weights between hidden layer nodes 1 4 and the output node, respectively. (e) Scatterplot of w I3,H 4 versus w H4,O 1, displaying the correlation structure between the weights. 005], the deterministic ANN performed poorly in these regions. In comparison, using the Bayesian ANN, the resulting uncertainty in the forecasts due to the uncharacteristic data is reflected in the expanded prediction limits, indicating to the modeler that single valued forecasts (e.g., mean predictions) should be used with caution. Salinity levels were underpredicted by the deterministic ANN in regions 1 and, with estimated levels below the 800 EC threshold when observed salinities were above 800 EC. This provides an excellent example of the consequences of ignoring uncertainty in the model parameters, as scheduling pumping from the river during these periods, due to the relatively low predicted salinity levels, could have costly ramifications. [44] It can be seen in Figure 5b that the 95% prediction limits failed to include all of the observed salinity data in regions 1 and. While this may, in part, be due to inappropriate convergence to the true posterior, uncertainty in the ANN weights is only one source of prediction uncertainty and the fact that all of the data in these regions were not accounted for by considering this source may suggest inadequacies in the model used to forecast the salinity data, possibly due to the omission of important inputs, an inappropriate model architecture, or errors in the data. Nevertheless, the Bayesian ANN provides a significant improvement over the deterministic ANN, and apart from the two periods of uncharacteristic data, almost all data points fall within the 95% limits. [45] Several authors in the ANN/Bayesian literature have warned against the use of straightforward MCMC approaches for sampling from the posterior distribution of ANN weights due to potential inefficiencies and the prohibitive time required for implementation [MacKay, 1995; Neal, 1996a; Müller and Rios Insua, 1998]. In this study, both the Bayesian and deterministic implementations were written in FORTRAN 90 and run on an Intel Xeon processor with GB of RAM running at.4 GHz, and the resulting computation time of the Bayesian training approach was approximately min, in comparison to the 34.8 min taken by the genetic algorithm. However, the deterministic ANN needed to be trained three times with different initial weights, which required a total computation time of approximately 98.6 min. It may be argued that deterministic weight estimates are also required to initialize Figure 5. The 95% prediction limits and deterministic output for (a) model development data (December 1986 to May 199) and (b) real-time simulation data (August 199 to March 1998). 9of11

10 KINGSTON ET AL.: BAYESIAN TRAINING OF ANNS IN WATER RESOURCES MODELING the Bayesian approach; nevertheless, the computation times of the two training approaches are comparable. In contrast, Neal [1996a] reported a computation time of approximately 0 hours using the complex hybrid Monte Carlo method to train a similar sized network to that used in this study. Because of the increasing power and speed of modern computers, straightforward MCMC approaches such as that presented in this paper are now both attractive and feasible. 6. Summary and Conclusions [46] In this paper, an accessible MCMC Bayesian ANN training approach was presented, combining the simple adaptive Metropolis (AM) and Gibbs sampling algorithms. The main advantage of the proposed training approach over other MCMC Bayesian training methods, which have been developed to achieve statistical optimality and efficiency [Neal, 1996a; Müller and Rios Insua, 1998], is its ease of implementation and coding. The simplicity of the framework is particularly important for its adoption in the field of water resources modeling, as it is likely that the difficulties associated with coding the more complex Bayesian training methods have hindered their use in this field, with practitioners opting to disregard prediction uncertainty and rely on deterministic predictions rather than apply such methods. [47] The results of the salinity forecasting case study presented in this paper highlight the importance of accounting for the uncertainty associated with ANN predictions and demonstrate the advantages of the proposed Bayesian training approach over standard deterministic training techniques. While the performance of the ANN model developed using the Bayesian training approach was similar to that of a deterministic ANN in an interpolative context, it was shown that the Bayesian ANN was more robust in a real-time forecasting scenario, particularly when the model was required to extrapolate. Not only were the average forecasts obtained using the Bayesian ANN an improvement over the single valued forecasts obtained using the deterministic ANN, but prediction limits, indicating the quality of the forecasts, were produced using the Bayesian approach, which was shown to be particularly important in situations when forecasts were made outside the range of the calibration data. [48] A major challenge facing the application of any MCMC ANN training technique is that, due to the complexity of ANNs and the strong correlations between the weights, it is difficult to effectively and efficiently explore the weight space and achieve convergence to the posterior distribution within a reasonable time frame. In trying to maintain the simplicity of the Bayesian training approach, limited focus was placed on achieving optimal efficiency of the MCMC algorithm. Therefore the proposed approach may be more computationally intensive than the complex MCMC algorithms previously developed for ANN training, requiring a larger number of iterations to converge. Nevertheless, due to the increasing power of modern computers, the efficiency of MCMC algorithms is becoming less of a concern. In the case study presented, it was observed that the Bayesian training algorithm had a computation time comparable to that of standard training techniques and also compared favorably to the results presented using more complex MCMC algorithms. However, by less efficiently exploring the weight space, it is also recognized that the results of the proposed Bayesian training approach may be biased by the initial weights, which is not statistically optimal. Yet, statistical optimality has never been the main concern of ANN practitioners and, as the results presented in this paper have demonstrated, it is still better to approximate what may be a local posterior distribution around a good mode than to rely on a single set of deterministic weight estimates. [49] Acknowledgments. The authors would like to thank the three WRR reviewers for their helpful comments and suggestions. This project is funded by an Australian Research Council Discovery grant. References ASCE Task Committee on Application of Artificial Neural Networks in Hydrology (000a), Artificial neural networks in hydrology. I: Preliminary concepts, J. Hydrol. Engineering, 5(), , doi: / (ASCE) (000)5:(115). ASCE Task Committee on Application of Artificial Neural Networks in Hydrology (000b), Artificial neural networks in hydrology. II: Hydrologic applications, J. Hydrol. Eng., 5(), , doi: / (ASCE) (000)5:(14). Bates, B. C., and E. P. Campbell (001), A Markov chain Monte Carlo scheme for parameter estimation and inference in conceptual rainfallrunoff modeling, Water Resour. Res., 37(4), Bishop, C. M. (1995), Neural Networks for Pattern Recognition, Oxford Univ. Press, New York. Bowden, G., H. Maier, and G. Dandy (00), Optimal division of data for neural network models in water resources applications, Water Resour. Res., 38(), 1010, doi:10.109/001wr Bowden, G. J., H. R. Maier, and G. C. Dandy (005), Input determination for neural network models in water resources applications. Part. Case study: forecasting salinity in a river, J. Hydrol., 301(1 4), , doi: /j.jhydrol Buntine, W. L., and A. S. Weigend (1991), Bayesian back-propagation, Complex Syst., 5(6), Dawson, C. W., and R. L. Wilby (001), Hydrological modelling using artificial neural networks, Prog. Phys. Geogr., 5(1), Duan, Q., S. Sorooshian, and V. Gupta (199), Effective and efficient global optimization for conceptual rainfall-runoff models, Water Resour. Res., 8(4), Duane, S., A. D. Kennedy, B. J. Pendleton, and D. Roweth (1987), Hybrid Monte Carlo, Phys. Lett. B, 195(), 16, doi: / (87)91197-x. Gelman, A., J. B. Carlin, H. S. Stern, and D. B. Rubin (1995), Bayesian Data Analysis, CRC Press, Boca Raton, Fla. Haario, H., E. Saksman, and J. Tamminen (001), An adaptive metropolis algorithm, Bernoulli, 7(), 3 4. Kass, R. E., B. P. Carlin, A. Gelman, and R. M. Neal (1998), Markov chain Monte Carlo in practice: A roundtable discussion, Am. Stat., 5(), Kuczera, G. (1983), Improved parameter inference in catchment models: 1. Evaluating parameter uncertainty, Water Resour. Res., 19(5), Kuczera, G., and M. Mroczkowski (1998), Assessment of hydrologic parameter uncertainty and worth of multiresponse data, Water Resour. Res., 34(6), Kuczera, G., and E. Parent (1998), Monte Carlo assessment of parameter uncertainty in conceptual catchment models: The metropolis algorithm, J. Hydrol., 11(1 4), 69 85, doi: /s (98)00198-x. Lampinen, J., and A. Vehtari (001), Bayesian approach for neural networks Review and case studies, Neural Networks, 14(3), 57 74, doi: /s (00) Lee, H. K. H. (003), A noninformative prior for neural networks, Mach. Learn., 50(1 ), 197 1, doi:10.103/a: Lee, P. M. (1989), Bayesian Statistics: An Introduction, Oxford Univ. Press, New York. MacKay, D. J. C. (199), A practical Bayesian framework for backpropagation networks, Neural Comput., 4(3), MacKay, D. J. C. (1995), Probable networks and plausible predictions A review of practical Bayesian methods for supervised neural networks, 10 of 11

Development of Stochastic Artificial Neural Networks for Hydrological Prediction

Development of Stochastic Artificial Neural Networks for Hydrological Prediction Development of Stochastic Artificial Neural Networks for Hydrological Prediction G. B. Kingston, M. F. Lambert and H. R. Maier Centre for Applied Modelling in Water Engineering, School of Civil and Environmental

More information

A Statistical Input Pruning Method for Artificial Neural Networks Used in Environmental Modelling

A Statistical Input Pruning Method for Artificial Neural Networks Used in Environmental Modelling A Statistical Input Pruning Method for Artificial Neural Networks Used in Environmental Modelling G. B. Kingston, H. R. Maier and M. F. Lambert Centre for Applied Modelling in Water Engineering, School

More information

A Shuffled Complex Evolution Metropolis algorithm for optimization and uncertainty assessment of hydrologic model parameters

A Shuffled Complex Evolution Metropolis algorithm for optimization and uncertainty assessment of hydrologic model parameters WATER RESOURCES RESEARCH, VOL. 39, NO. 8, 1201, doi:10.1029/2002wr001642, 2003 A Shuffled Complex Evolution Metropolis algorithm for optimization and uncertainty assessment of hydrologic model parameters

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Different Criteria for Active Learning in Neural Networks: A Comparative Study

Different Criteria for Active Learning in Neural Networks: A Comparative Study Different Criteria for Active Learning in Neural Networks: A Comparative Study Jan Poland and Andreas Zell University of Tübingen, WSI - RA Sand 1, 72076 Tübingen, Germany Abstract. The field of active

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods By Oleg Makhnin 1 Introduction a b c M = d e f g h i 0 f(x)dx 1.1 Motivation 1.1.1 Just here Supresses numbering 1.1.2 After this 1.2 Literature 2 Method 2.1 New math As

More information

A log-sinh transformation for data normalization and variance stabilization

A log-sinh transformation for data normalization and variance stabilization WATER RESOURCES RESEARCH, VOL. 48, W05514, doi:10.1029/2011wr010973, 2012 A log-sinh transformation for data normalization and variance stabilization Q. J. Wang, 1 D. L. Shrestha, 1 D. E. Robertson, 1

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Markov Chain Monte Carlo

Markov Chain Monte Carlo Markov Chain Monte Carlo Recall: To compute the expectation E ( h(y ) ) we use the approximation E(h(Y )) 1 n n h(y ) t=1 with Y (1),..., Y (n) h(y). Thus our aim is to sample Y (1),..., Y (n) from f(y).

More information

Gaussian process for nonstationary time series prediction

Gaussian process for nonstationary time series prediction Computational Statistics & Data Analysis 47 (2004) 705 712 www.elsevier.com/locate/csda Gaussian process for nonstationary time series prediction Soane Brahim-Belhouari, Amine Bermak EEE Department, Hong

More information

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 18, 2015 1 / 45 Resources and Attribution Image credits,

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Multimodal Nested Sampling

Multimodal Nested Sampling Multimodal Nested Sampling Farhan Feroz Astrophysics Group, Cavendish Lab, Cambridge Inverse Problems & Cosmology Most obvious example: standard CMB data analysis pipeline But many others: object detection,

More information

Bayesian model selection: methodology, computation and applications

Bayesian model selection: methodology, computation and applications Bayesian model selection: methodology, computation and applications David Nott Department of Statistics and Applied Probability National University of Singapore Statistical Genomics Summer School Program

More information

Markov chain Monte Carlo methods in atmospheric remote sensing

Markov chain Monte Carlo methods in atmospheric remote sensing 1 / 45 Markov chain Monte Carlo methods in atmospheric remote sensing Johanna Tamminen johanna.tamminen@fmi.fi ESA Summer School on Earth System Monitoring and Modeling July 3 Aug 11, 212, Frascati July,

More information

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Bagging During Markov Chain Monte Carlo for Smoother Predictions Bagging During Markov Chain Monte Carlo for Smoother Predictions Herbert K. H. Lee University of California, Santa Cruz Abstract: Making good predictions from noisy data is a challenging problem. Methods

More information

Neural Networks and the Back-propagation Algorithm

Neural Networks and the Back-propagation Algorithm Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

17 : Markov Chain Monte Carlo

17 : Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models, Spring 2015 17 : Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Heran Lin, Bin Deng, Yun Huang 1 Review of Monte Carlo Methods 1.1 Overview Monte Carlo

More information

arxiv: v1 [stat.co] 23 Apr 2018

arxiv: v1 [stat.co] 23 Apr 2018 Bayesian Updating and Uncertainty Quantification using Sequential Tempered MCMC with the Rank-One Modified Metropolis Algorithm Thomas A. Catanach and James L. Beck arxiv:1804.08738v1 [stat.co] 23 Apr

More information

Lecture 7 and 8: Markov Chain Monte Carlo

Lecture 7 and 8: Markov Chain Monte Carlo Lecture 7 and 8: Markov Chain Monte Carlo 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/ Ghahramani

More information

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University Phylogenetics: Bayesian Phylogenetic Analysis COMP 571 - Spring 2015 Luay Nakhleh, Rice University Bayes Rule P(X = x Y = y) = P(X = x, Y = y) P(Y = y) = P(X = x)p(y = y X = x) P x P(X = x 0 )P(Y = y X

More information

1 Using standard errors when comparing estimated values

1 Using standard errors when comparing estimated values MLPR Assignment Part : General comments Below are comments on some recurring issues I came across when marking the second part of the assignment, which I thought it would help to explain in more detail

More information

Simulated Annealing for Constrained Global Optimization

Simulated Annealing for Constrained Global Optimization Monte Carlo Methods for Computation and Optimization Final Presentation Simulated Annealing for Constrained Global Optimization H. Edwin Romeijn & Robert L.Smith (1994) Presented by Ariel Schwartz Objective

More information

Lecture 4: Perceptrons and Multilayer Perceptrons

Lecture 4: Perceptrons and Multilayer Perceptrons Lecture 4: Perceptrons and Multilayer Perceptrons Cognitive Systems II - Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning Perceptrons, Artificial Neuronal Networks Lecture 4: Perceptrons

More information

MCMC Sampling for Bayesian Inference using L1-type Priors

MCMC Sampling for Bayesian Inference using L1-type Priors MÜNSTER MCMC Sampling for Bayesian Inference using L1-type Priors (what I do whenever the ill-posedness of EEG/MEG is just not frustrating enough!) AG Imaging Seminar Felix Lucka 26.06.2012 , MÜNSTER Sampling

More information

Chapter 6 Problems with the calibration of Gaussian HMMs to annual rainfall

Chapter 6 Problems with the calibration of Gaussian HMMs to annual rainfall 115 Chapter 6 Problems with the calibration of Gaussian HMMs to annual rainfall Hidden Markov models (HMMs) were introduced in Section 3.3 as a method to incorporate climatic persistence into stochastic

More information

The Bias-Variance dilemma of the Monte Carlo. method. Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel

The Bias-Variance dilemma of the Monte Carlo. method. Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel The Bias-Variance dilemma of the Monte Carlo method Zlochin Mark 1 and Yoram Baram 1 Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel fzmark,baramg@cs.technion.ac.il Abstract.

More information

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling Christopher Jennison Department of Mathematical Sciences, University of Bath, UK http://people.bath.ac.uk/mascj Adriana Ibrahim Institute

More information

The connection of dropout and Bayesian statistics

The connection of dropout and Bayesian statistics The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

Analysis of Fast Input Selection: Application in Time Series Prediction

Analysis of Fast Input Selection: Application in Time Series Prediction Analysis of Fast Input Selection: Application in Time Series Prediction Jarkko Tikka, Amaury Lendasse, and Jaakko Hollmén Helsinki University of Technology, Laboratory of Computer and Information Science,

More information

Bayesian Phylogenetics:

Bayesian Phylogenetics: Bayesian Phylogenetics: an introduction Marc A. Suchard msuchard@ucla.edu UCLA Who is this man? How sure are you? The one true tree? Methods we ve learned so far try to find a single tree that best describes

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Bayesian Backprop in Action: Pruning, Committees, Error Bars and an Application to Spectroscopy

Bayesian Backprop in Action: Pruning, Committees, Error Bars and an Application to Spectroscopy Bayesian Backprop in Action: Pruning, Committees, Error Bars and an Application to Spectroscopy Hans Henrik Thodberg Danish Meat Research Institute Maglegaardsvej 2, DK-4 Roskilde thodberg~nn.meatre.dk

More information

Bayesian Inference for Discretely Sampled Diffusion Processes: A New MCMC Based Approach to Inference

Bayesian Inference for Discretely Sampled Diffusion Processes: A New MCMC Based Approach to Inference Bayesian Inference for Discretely Sampled Diffusion Processes: A New MCMC Based Approach to Inference Osnat Stramer 1 and Matthew Bognar 1 Department of Statistics and Actuarial Science, University of

More information

ECO 513 Fall 2009 C. Sims HIDDEN MARKOV CHAIN MODELS

ECO 513 Fall 2009 C. Sims HIDDEN MARKOV CHAIN MODELS ECO 513 Fall 2009 C. Sims HIDDEN MARKOV CHAIN MODELS 1. THE CLASS OF MODELS y t {y s, s < t} p(y t θ t, {y s, s < t}) θ t = θ(s t ) P[S t = i S t 1 = j] = h ij. 2. WHAT S HANDY ABOUT IT Evaluating the

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

Introduction to Bayesian methods in inverse problems

Introduction to Bayesian methods in inverse problems Introduction to Bayesian methods in inverse problems Ville Kolehmainen 1 1 Department of Applied Physics, University of Eastern Finland, Kuopio, Finland March 4 2013 Manchester, UK. Contents Introduction

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

Checking up on the neighbors: Quantifying uncertainty in relative event location

Checking up on the neighbors: Quantifying uncertainty in relative event location Checking up on the neighbors: Quantifying uncertainty in relative event location The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

Doing Bayesian Integrals

Doing Bayesian Integrals ASTR509-13 Doing Bayesian Integrals The Reverend Thomas Bayes (c.1702 1761) Philosopher, theologian, mathematician Presbyterian (non-conformist) minister Tunbridge Wells, UK Elected FRS, perhaps due to

More information

Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation

Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation COMPSTAT 2010 Revised version; August 13, 2010 Michael G.B. Blum 1 Laboratoire TIMC-IMAG, CNRS, UJF Grenoble

More information

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1 Parameter Estimation William H. Jefferys University of Texas at Austin bill@bayesrules.net Parameter Estimation 7/26/05 1 Elements of Inference Inference problems contain two indispensable elements: Data

More information

Advanced analysis and modelling tools for spatial environmental data. Case study: indoor radon data in Switzerland

Advanced analysis and modelling tools for spatial environmental data. Case study: indoor radon data in Switzerland EnviroInfo 2004 (Geneva) Sh@ring EnviroInfo 2004 Advanced analysis and modelling tools for spatial environmental data. Case study: indoor radon data in Switzerland Mikhail Kanevski 1, Michel Maignan 1

More information

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007 Bayesian inference Fredrik Ronquist and Peter Beerli October 3, 2007 1 Introduction The last few decades has seen a growing interest in Bayesian inference, an alternative approach to statistical inference.

More information

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Lecture - 27 Multilayer Feedforward Neural networks with Sigmoidal

More information

Default Priors and Effcient Posterior Computation in Bayesian

Default Priors and Effcient Posterior Computation in Bayesian Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010 Presented by Eric Wang, Duke University Background and Motivation A Brief Review of Parameter Expansion Literature

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

Lecture 5. G. Cowan Lectures on Statistical Data Analysis Lecture 5 page 1

Lecture 5. G. Cowan Lectures on Statistical Data Analysis Lecture 5 page 1 Lecture 5 1 Probability (90 min.) Definition, Bayes theorem, probability densities and their properties, catalogue of pdfs, Monte Carlo 2 Statistical tests (90 min.) general concepts, test statistics,

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

Bayesian networks: approximate inference

Bayesian networks: approximate inference Bayesian networks: approximate inference Machine Intelligence Thomas D. Nielsen September 2008 Approximative inference September 2008 1 / 25 Motivation Because of the (worst-case) intractability of exact

More information

ST 740: Markov Chain Monte Carlo

ST 740: Markov Chain Monte Carlo ST 740: Markov Chain Monte Carlo Alyson Wilson Department of Statistics North Carolina State University October 14, 2012 A. Wilson (NCSU Stsatistics) MCMC October 14, 2012 1 / 20 Convergence Diagnostics:

More information

On the Optimal Scaling of the Modified Metropolis-Hastings algorithm

On the Optimal Scaling of the Modified Metropolis-Hastings algorithm On the Optimal Scaling of the Modified Metropolis-Hastings algorithm K. M. Zuev & J. L. Beck Division of Engineering and Applied Science California Institute of Technology, MC 4-44, Pasadena, CA 925, USA

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Principles of Bayesian Inference

Principles of Bayesian Inference Principles of Bayesian Inference Sudipto Banerjee University of Minnesota July 20th, 2008 1 Bayesian Principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters

More information

Gaussian Processes: We demand rigorously defined areas of uncertainty and doubt

Gaussian Processes: We demand rigorously defined areas of uncertainty and doubt Gaussian Processes: We demand rigorously defined areas of uncertainty and doubt ACS Spring National Meeting. COMP, March 16 th 2016 Matthew Segall, Peter Hunt, Ed Champness matt.segall@optibrium.com Optibrium,

More information

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD WHAT IS A NEURAL NETWORK? The simplest definition of a neural network, more properly referred to as an 'artificial' neural network (ANN), is provided

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

Regression with Input-Dependent Noise: A Bayesian Treatment

Regression with Input-Dependent Noise: A Bayesian Treatment Regression with Input-Dependent oise: A Bayesian Treatment Christopher M. Bishop C.M.BishopGaston.ac.uk Cazhaow S. Qazaz qazazcsgaston.ac.uk eural Computing Research Group Aston University, Birmingham,

More information

Markov chain Monte Carlo

Markov chain Monte Carlo Markov chain Monte Carlo Markov chain Monte Carlo (MCMC) Gibbs and Metropolis Hastings Slice sampling Practical details Iain Murray http://iainmurray.net/ Reminder Need to sample large, non-standard distributions:

More information

Part 8: GLMs and Hierarchical LMs and GLMs

Part 8: GLMs and Hierarchical LMs and GLMs Part 8: GLMs and Hierarchical LMs and GLMs 1 Example: Song sparrow reproductive success Arcese et al., (1992) provide data on a sample from a population of 52 female song sparrows studied over the course

More information

Supplementary Note on Bayesian analysis

Supplementary Note on Bayesian analysis Supplementary Note on Bayesian analysis Structured variability of muscle activations supports the minimal intervention principle of motor control Francisco J. Valero-Cuevas 1,2,3, Madhusudhan Venkadesan

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Penalized Loss functions for Bayesian Model Choice

Penalized Loss functions for Bayesian Model Choice Penalized Loss functions for Bayesian Model Choice Martyn International Agency for Research on Cancer Lyon, France 13 November 2009 The pure approach For a Bayesian purist, all uncertainty is represented

More information

Multitask Learning of Environmental Spatial Data

Multitask Learning of Environmental Spatial Data 9th International Congress on Environmental Modelling and Software Brigham Young University BYU ScholarsArchive 6th International Congress on Environmental Modelling and Software - Leipzig, Germany - July

More information

Kernel adaptive Sequential Monte Carlo

Kernel adaptive Sequential Monte Carlo Kernel adaptive Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) December 7, 2015 1 / 36 Section 1 Outline

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Labor-Supply Shifts and Economic Fluctuations. Technical Appendix

Labor-Supply Shifts and Economic Fluctuations. Technical Appendix Labor-Supply Shifts and Economic Fluctuations Technical Appendix Yongsung Chang Department of Economics University of Pennsylvania Frank Schorfheide Department of Economics University of Pennsylvania January

More information

Simple closed form formulas for predicting groundwater flow model uncertainty in complex, heterogeneous trending media

Simple closed form formulas for predicting groundwater flow model uncertainty in complex, heterogeneous trending media WATER RESOURCES RESEARCH, VOL. 4,, doi:0.029/2005wr00443, 2005 Simple closed form formulas for predicting groundwater flow model uncertainty in complex, heterogeneous trending media Chuen-Fa Ni and Shu-Guang

More information

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017 Markov Chain Monte Carlo (MCMC) and Model Evaluation August 15, 2017 Frequentist Linking Frequentist and Bayesian Statistics How can we estimate model parameters and what does it imply? Want to find the

More information

DETECTING PROCESS STATE CHANGES BY NONLINEAR BLIND SOURCE SEPARATION. Alexandre Iline, Harri Valpola and Erkki Oja

DETECTING PROCESS STATE CHANGES BY NONLINEAR BLIND SOURCE SEPARATION. Alexandre Iline, Harri Valpola and Erkki Oja DETECTING PROCESS STATE CHANGES BY NONLINEAR BLIND SOURCE SEPARATION Alexandre Iline, Harri Valpola and Erkki Oja Laboratory of Computer and Information Science Helsinki University of Technology P.O.Box

More information

16 : Markov Chain Monte Carlo (MCMC)

16 : Markov Chain Monte Carlo (MCMC) 10-708: Probabilistic Graphical Models 10-708, Spring 2014 16 : Markov Chain Monte Carlo MCMC Lecturer: Matthew Gormley Scribes: Yining Wang, Renato Negrinho 1 Sampling from low-dimensional distributions

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Karl-Rudolf Koch Introduction to Bayesian Statistics Second Edition

Karl-Rudolf Koch Introduction to Bayesian Statistics Second Edition Karl-Rudolf Koch Introduction to Bayesian Statistics Second Edition Karl-Rudolf Koch Introduction to Bayesian Statistics Second, updated and enlarged Edition With 17 Figures Professor Dr.-Ing., Dr.-Ing.

More information

FORECASTING poor air quality events associated with

FORECASTING poor air quality events associated with A Comparison of Bayesian and Conditional Density Models in Probabilistic Ozone Forecasting Song Cai, William W. Hsieh, and Alex J. Cannon Member, INNS Abstract Probabilistic models were developed to provide

More information

Statistical Methods in Particle Physics Lecture 1: Bayesian methods

Statistical Methods in Particle Physics Lecture 1: Bayesian methods Statistical Methods in Particle Physics Lecture 1: Bayesian methods SUSSP65 St Andrews 16 29 August 2009 Glen Cowan Physics Department Royal Holloway, University of London g.cowan@rhul.ac.uk www.pp.rhul.ac.uk/~cowan

More information

STAT 425: Introduction to Bayesian Analysis

STAT 425: Introduction to Bayesian Analysis STAT 425: Introduction to Bayesian Analysis Marina Vannucci Rice University, USA Fall 2017 Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 2) Fall 2017 1 / 19 Part 2: Markov chain Monte

More information

Chapter 5 Identifying hydrological persistence

Chapter 5 Identifying hydrological persistence 103 Chapter 5 Identifying hydrological persistence The previous chapter demonstrated that hydrologic data from across Australia is modulated by fluctuations in global climate modes. Various climate indices

More information

CSC 2541: Bayesian Methods for Machine Learning

CSC 2541: Bayesian Methods for Machine Learning CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 3 More Markov Chain Monte Carlo Methods The Metropolis algorithm isn t the only way to do MCMC. We ll

More information

Learning features by contrasting natural images with noise

Learning features by contrasting natural images with noise Learning features by contrasting natural images with noise Michael Gutmann 1 and Aapo Hyvärinen 12 1 Dept. of Computer Science and HIIT, University of Helsinki, P.O. Box 68, FIN-00014 University of Helsinki,

More information

Linear Regression Models

Linear Regression Models Linear Regression Models Model Description and Model Parameters Modelling is a central theme in these notes. The idea is to develop and continuously improve a library of predictive models for hazards,

More information

A hybrid Marquardt-Simulated Annealing method for solving the groundwater inverse problem

A hybrid Marquardt-Simulated Annealing method for solving the groundwater inverse problem Calibration and Reliability in Groundwater Modelling (Proceedings of the ModelCARE 99 Conference held at Zurich, Switzerland, September 1999). IAHS Publ. no. 265, 2000. 157 A hybrid Marquardt-Simulated

More information

Kyle Reing University of Southern California April 18, 2018

Kyle Reing University of Southern California April 18, 2018 Renormalization Group and Information Theory Kyle Reing University of Southern California April 18, 2018 Overview Renormalization Group Overview Information Theoretic Preliminaries Real Space Mutual Information

More information

A limited memory acceleration strategy for MCMC sampling in hierarchical Bayesian calibration of hydrological models

A limited memory acceleration strategy for MCMC sampling in hierarchical Bayesian calibration of hydrological models Click Here for Full Article WATER RESOURCES RESEARCH, VOL. 46,, doi:1.129/29wr8985, 21 A limited memory acceleration strategy for MCMC sampling in hierarchical Bayesian calibration of hydrological models

More information

Bayesian Inference of Noise Levels in Regression

Bayesian Inference of Noise Levels in Regression Bayesian Inference of Noise Levels in Regression Christopher M. Bishop Microsoft Research, 7 J. J. Thomson Avenue, Cambridge, CB FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop

More information

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning for for Advanced Topics in California Institute of Technology April 20th, 2017 1 / 50 Table of Contents for 1 2 3 4 2 / 50 History of methods for Enrico Fermi used to calculate incredibly accurate predictions

More information

New Insights into History Matching via Sequential Monte Carlo

New Insights into History Matching via Sequential Monte Carlo New Insights into History Matching via Sequential Monte Carlo Associate Professor Chris Drovandi School of Mathematical Sciences ARC Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS)

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods Tomas McKelvey and Lennart Svensson Signal Processing Group Department of Signals and Systems Chalmers University of Technology, Sweden November 26, 2012 Today s learning

More information

Variational Methods in Bayesian Deconvolution

Variational Methods in Bayesian Deconvolution PHYSTAT, SLAC, Stanford, California, September 8-, Variational Methods in Bayesian Deconvolution K. Zarb Adami Cavendish Laboratory, University of Cambridge, UK This paper gives an introduction to the

More information

Assessing Regime Uncertainty Through Reversible Jump McMC

Assessing Regime Uncertainty Through Reversible Jump McMC Assessing Regime Uncertainty Through Reversible Jump McMC August 14, 2008 1 Introduction Background Research Question 2 The RJMcMC Method McMC RJMcMC Algorithm Dependent Proposals Independent Proposals

More information

Combination of M-Estimators and Neural Network Model to Analyze Inside/Outside Bark Tree Diameters

Combination of M-Estimators and Neural Network Model to Analyze Inside/Outside Bark Tree Diameters Combination of M-Estimators and Neural Network Model to Analyze Inside/Outside Bark Tree Diameters Kyriaki Kitikidou, Elias Milios, Lazaros Iliadis, and Minas Kaymakis Democritus University of Thrace,

More information

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model UNIVERSITY OF TEXAS AT SAN ANTONIO Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model Liang Jing April 2010 1 1 ABSTRACT In this paper, common MCMC algorithms are introduced

More information

The Jackknife-Like Method for Assessing Uncertainty of Point Estimates for Bayesian Estimation in a Finite Gaussian Mixture Model

The Jackknife-Like Method for Assessing Uncertainty of Point Estimates for Bayesian Estimation in a Finite Gaussian Mixture Model Thai Journal of Mathematics : 45 58 Special Issue: Annual Meeting in Mathematics 207 http://thaijmath.in.cmu.ac.th ISSN 686-0209 The Jackknife-Like Method for Assessing Uncertainty of Point Estimates for

More information

16 : Approximate Inference: Markov Chain Monte Carlo

16 : Approximate Inference: Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models 10-708, Spring 2017 16 : Approximate Inference: Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Yuan Yang, Chao-Ming Yen 1 Introduction As the target distribution

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information