Bayesian Sampling and Ensemble Learning in Generative Topographic Mapping

Size: px
Start display at page:

Download "Bayesian Sampling and Ensemble Learning in Generative Topographic Mapping"

Transcription

1 Bayesian Sampling and Ensemble Learning in Generative Topographic Mapping Akio Utsugi National Institute of Bioscience and Human-Technology, - Higashi Tsukuba Ibaraki , Japan March 6, Abstract Generative topographic mapping (GTM) is a statistical model to extract a hidden smooth manifold from data, like the self-organizing map (SOM). Although a deterministic search algorithm for the hyperparameters regulating the smoothness of the manifold has been proposed previously, it is based on approximations that are valid only on abundant data. Thus, it often fails to obtain suitable estimates on small data. In this paper, to improve the hyperparameter search in GTM, we construct a Gibbs sampler on the model, which generates random sample series following the posteriors on the hyperparameters. Reliable estimates are obtained from the samples. In addition, we obtain another deterministic algorithm using the ensemble learning. From the result of an experimental comparison of these algorithms, an efficient method for reliable estimation in GTM is suggested. Introduction The self-organizing map (SOM) [] is initially proposed as a minimal model for the formation of topology-preserving maps in brains. Subsequently, it is used as an information processing tool to extract a hidden smooth manifold from data. However, since the SOM is defined as a learning algorithm and has no explicit statistical model for the data generation, it is difficult to develop higher-level inference on the model, such as the determination of the optimal hyperparameters and topology. Thus, several authors propose statistical generative models whose parameter estimation algorithms are similar to the SOM.

2 The elastic net [] [3][4][5] is one of such generative models. Its learning is derived as a parameter estimation algorithm for a model consisting of the mixture of spherical Gaussian generators and a smoothing prior on the centroids of the generators. This prior is also a spherical Gaussian distribution on the variations of the centroids along a predefined topology. The variances of these spherical Gaussian distributions are hyperparameters regulating the smoothness of the hidden manifold. The elastic net is extended to have a more general Gaussian smoothing prior, and evidence for the hyperparameters and the topology, which is a Bayesian model selection criterion, is obtained [6][7][8]. Generative topographic mapping (GTM) is another SOM-like generative model [9]. It is also based on the mixture of spherical Gaussian generators with a constraint on the centroids. However, the centroids are assumed to be generated as the outputs of a generalized linear network whose inputs are the nodes of a regular grid on a latent space. Evidence for the hyperparameters of GTM is obtained by the same method as used for the generalized elastic net []. Subsequently, the framework of GTM is expanded to include models with a general Gaussian smoothing prior on the centroids. In the present paper, we mainly deal with this type of GTM model. Evidence for hyperparameters is strictly defined as the marginal likelihood of the hyperparameters, which is obtained by integrating out the other parameters from the joint likelihood. In GTM, we need to integrate out the centroid parameters to obtain the evidence. However, this integral is difficult to calculate exactly. Thus, it is approximated using the Laplace method [][][3]. Furthermore, in order to obtain a fast search algorithm for the optimal hyperparameters, which is given by the maximizer of the evidence, we need the derivatives of the evidence with regard to the hyperparameters. These derivatives are obtained using additional approximations. The validity of these approximations depends on the data size and the signal to noise ratio. Actually, a simulation experiment shows that the hyperparameter search algorithm fails to obtain suitable estimates when the data size is small and the noise level is high [6]. In this paper, we try to improve the hyperparameter search of GTM on small data using a Gibbs sampler. The Gibbs sampler [3] generates the random sample series of the parameters and the hyperparameters following their posteriors. In the limits of the long series, their averages approach the posterior means, which are the exact Bayesian estimates. We can also evaluate the confidence of the estimates using the estimated posterior variances. In fact, we can obtain the estimates of the posterior distributions using the histograms of the samples. While the Gibbs sampler provides reliable and wealthy estimation on small data, it is rather time-consuming on large data because it requires the generation of long sample series. Thus, we also need a deterministic algorithm producing

3 the estimates quickly. We obtain another deterministic algorithm for the hyperparameter search of GTM using the ensemble learning method [4][5]. While this algorithm is similar to the previous deterministic algorithm, it is based on more straightforward approximation assumption. In addition, the ensemble learning is considered more stable than the previous deterministic algorithm, because it minimizes the variational free energy of the model, which gives an upper bound of negative log evidence. To evaluate the validity of the approximations in the deterministic algorithms, we compare the estimates by those algorithms with the outcome of the Gibbs sampler in a simulation experiment. The experiment shows the superiority of the Gibbs sampler on small data and the validity of the deterministic algorithms on large data. From this result, a policy for algorithm selection in GTM is suggested. Generative topographic mapping GTM has two versions: an original regression version and a Gaussian process version. In this paper, we focus on the latter, and the former is mentioned shortly in the discussion. The Gaussian process version of GTM consists of a spherical Gaussian mixture density and a Gaussian process prior. The spherical Gaussian mixture model assumes that each data point x i = (x i,...,x im ) R m is generated from one of r spherical Gaussian generators, which correspond to the inner units of a neural network. The density of a data set X = {x,...,x n } is given by ( ) nm/ β n { r f(x Y, W,β)= exp ( β )} yik π x i w k, (.) i= k= where y ik in Y are binary membership variables with constraints r k= y ik =, and w k =(w k,...,w km ) in W are the centroids of the spherical Gaussian generators. By integrating out Y from the product of this density and a multinomial prior f(y )= n i= k= we obtain a spherical Gaussian mixture density r ( ) yik = r n, (.) r f(x W,β)= ( ) nm/ β n r exp ( β ) r n π x i w k. (.3) i= k= 3

4 Next, we assume that the centroids are the regular samples of a smooth Gaussian process over a latent space. In other words, W has a Gaussian prior f(w h) =(π) rm/ M m/ m j= exp ( ) w (j)m w (j), (.4) where w (j) =(w j,...,w rj ), and the covariance in M decreases as the distance between the pair of the inner units over the latent space increases. In the next section, we use this general form of Gaussian process prior to obtain a Gibbs sampler for W. In fact, there are various manners to construct M using hyperparameters in h. We will choose a specific form of M to obtain a simple Gibbs sampler for h. The Bayesian inference of W is based on its posterior f(w X, h), where h includes β. Although this posterior is obtained by the Bayes theorem: f(w X, h) f(x, W h) =f(x W,β)f(W h), (.5) the posterior mean as the estimate of W is difficult to calculate. Instead, we can obtain the maximum a posteriori (MAP) estimate Ŵ, which is the maximizer of the posterior, through an expectation-maximization (EM) algorithm [7][]. The inference of h is based on its evidence f(x h). The maximizer of the evidence is called the generalized maximum likelihood (GML) estimate of h. Note that the evidence is proportional to the posterior on h, f(h X), if we adopt a flat prior on h. In this case, the GML and MAP estimates of h are identical. Although the evidence is obtained by integrating out W from f(x, W h), the integral is difficult to calculate exactly. Using the Laplace method [][][3], we can obtain a computable approximate expression of the evidence [7][]. In this method, the integration is performed by quadratically approximating the logarithm of the integrand around Ŵ. In other words, the integrand is approximated by the nearest Gaussian function. Furthermore, a fast hyperparameter search algorithm is obtained using the approximate derivatives of the evidence with regard to h [6][]. In this approximation, the dependence of Ŵ and the posterior selection probabilities of the inner units on h is neglected. As mentioned in the introduction, the approximations for the hyperparameter search algorithm are valid only on abundant data. The algorithm often fail to extract a suitable structure from small data. In order to enhance the capability of GTM, the hyperparameter search is improved using a Gibbs sampler in the following section. 4

5 3 Gibbs sampler in GTM In the preceding section, we mentioned a method to estimate the parameters and the hyperparameters by the modes of their posteriors with the help of several approximations. Another method to infer them is to use random samples following their posteriors. Any moment of the posteriors can be obtained precisely by an average over the long sample series. Markov chain Monte Carlo (MCMC) [3] provides easier devices to generate such random samples than direct generation from the posteriors. The Metropolis-Hastings algorithm is the most general MCMC method. At each step of this algorithm, a trial sample is generated from a trial distribution, then its adoption is determined stochastically by the improvement ratio of the posterior. Since the posterior has to be calculated at each step, an efficient procedure for this calculation is required. Moreover, the efficiency of this algorithm depends on the design of a fine trial distribution. The Gibbs sampler is another MCMC method. In contrast to the Metropolis- Hastings algorithm, it does not need the design of a trial distribution. In fact, the Gibbs sampler can be regarded as a special kind of Metropolis-Hastings algorithm, whose trials are always adopted. In the Gibbs sampler, the variables of a model are divided into some groups, and the conditional posterior on each group given the other variables is obtained. The algorithm is started by setting appropriate initial values in the conditions of the conditional posteriors. Then, a sample is generated from a conditional posterior and this sample is set to the conditions of the other conditional posteriors. Such sampling and setting is iterated among the groups. After arrival at a stationary state of this process, the samples follow their (unconditional) posteriors. In the following two subsections, we obtain all conditional posteriors for the Gibbs sampler of GTM: f(y X, W, h), f(w X, Y, h) and f(h X, Y, W ). 3. Conditional posteriors on Y and W The conditional posterior on Y has been already obtained in the previous papers [6][7]: n r f(y X, W,β)= p y ik ik, (3.) i= k= where exp( β p ik = x i w k ) rk= exp( β x i w k ) are the posterior selection probabilities of the inner units. (3.) 5

6 The conditional posterior on W is obtained by normalizing f(x, Y, W h), the product of (.), (.) and (.4). It is a product of Gaussian densities m f(w X, Y, h) = N(w (j) µ (j), Σ). (3.3) j= The common covariance matrix is and the means are Σ =(βn + M ) (3.4) µ (j) = βσs (j), (3.5) where n N = diag(n,,...,n r )= diag(y i,...,y ir ) (3.6) i= and n s (j) =(s j,...,s rj ) = x ij (y i,...,y ir ), (3.7) i= that is, n k is the number of data belonging to the kth inner unit, and s k = (s k,...,s km ) is the sum of data belonging to it. 3. Conditional posteriors on hyperparameters Bishop et al. [] suggest a Gaussian process prior with a Gaussian-shaped covariance function, that is, the entries of M are ( m ij exp ) λ u i u j, (3.8) where u i is the position of the ith inner unit over a latent space. An MCMC algorithm for the hyperparameter λ may be constructed using the Metropolis- Hastings method. However, it requires heavy calculation of the posterior at each step and the design of a fine trial distribution. From the standpoint of the availability of the Gibbs sampler, we choose a Gaussian process prior with M =(αd D + ξe E), (3.9) where D is a discretized Laplacian and E is a matrix whose rows are the orthonormal basis vectors of the linear null-space of D [6]. For a one-dimensional latent space, the entries of D are d ij = i j + = i j + = (i =,..., r ; j =,..., r), otherwise 6 (3.)

7 and E consists of a constant vector and a linear-trend vector. For a two-dimensional latent space, D is constructed from the distortion of a thin-plate and E consists of a constant vector and two linear-trend vectors [8]. When the hyperparameters α and ξ are positive, M is positive definite, thus the Gaussian process prior (.4) is proper and expressed as f(w α, ξ) = (π) rm/ α lm/ ξ (r l)m/ D D m/ + m exp (α Dw (j) + ξ Ew (j) ), (3.) j= where D D + is the product of the positive eigenvalues of D D and l = rank(d D), which is given by l = r t for a t-dimensional latent space. The logarithm of this prior corresponds to a discretized Laplacian regularizer, which is used in discretized Laplacian smoothing [7], if ξ. The term of ξ is introduced to make the prior proper. Although other methods to make the prior proper can be considered, our method has the advantage of leading a simple gamma posterior on α. To obtain the posteriors on the hyperparameters, we consider hyper-priors on α and β: f(α d α,s α )=G(α d α,s α ), (3.) where G is a gamma density function given by f(β d β,s β )=G(β d β,s β ), (3.3) G(x d, s) = sd x d Γ(d) exp( sx), (3.4) whose mean is d/s. The other hyperparameters ξ,d α,s α,d β,s β are fixed to appropriate values using prior knowledge. Alternatively, we can use non-informative priors given by ξ,d α = d β =,s α,s β. Hereafter we represent a model structure with a set of the fixed hyperparameters as H. The conditional posteriors on α and β are obtained by normalizing f(x, Y, W,α,β H), the product of (.), (.), (3.), (3.) and (3.3): f(α X, Y, W,H)=G(α d α, s α ), (3.5) where f(β X, Y, W,H)=G(β d β, s β ), (3.6) d α = ml + d α, (3.7) 7

8 s α = m Dw (j) + s α, j= (3.8) d β = nm + d β (3.9) and s β = n r y ik x i w k + s β. i= k= (3.) We now obtain all conditional posteriors for the Gibbs sampler of GTM. 3.3 Reduction of computational load The computational bottleneck of our Gibbs sampler lies on the inversion of the r r matrix βn + M = βn + αd D + ξe E in (3.4). This inversion generally needs O(r 3 ) operations and r becomes large to obtain fine structures. If we use a partially improper prior given by ξ, this matrix becomes a sparse matrix K = βn + αd D. In particular, for a one-dimensional latent space, K becomes a five-banded matrix. Thus, the inversion of K needs only O(r) operations. If we can fix the topology of the latent space, this strategy is recommended. However, if we desire to compare different topologies, ξ must be positive, because the partially improper prior is not available for model selection [6]. In this case, we can reduce the load of the inversion using the matrix-inversion lemma: (K + ξe E) = K K E (EK E + ξ I) EK. (3.) Note that the matrix in the parenthesis of the right side has a very small size r l. 4Ensemble learning in GTM The ensemble learning [4][5] is a deterministic algorithm to obtain the estimates of parameters and hyperparameters concurrently. We consider an approximating ensemble density Q(Y, W,α,β) and its variational free energy on a model H: F (Q H) = Q(Y, W,α,β) log Q(Y, W,α,β) dy dw dαdβ. (4.) f(x, Y, W,α,β H) This functional is minimized by the joint posterior Q(Y, W,α,β)=f(Y, W,α,β X,H), and the minimum is equal to the negative log evidence of the model, log f(x H). This variational problem is generally difficult to solve, thus an approximate solution is obtained by restricting Q to a specific form. For example, if we restrict Q to a 8

9 factorial form Q(Y, W,α,β) = Q(Y )Q(W )Q(α)Q(β), we can have a straightforward algorithm for the minimization of F. Ultimately, the optimization procedure is given as follows:. Initial densities are set to the partial ensembles, Q(Y ),Q(W ),Q(α) and Q(β).. From the present densities of Q(W ),Q(α) and Q(β), a new density of Q(Y ) is obtained by { Q(Y ) exp } Q(W )Q(α)Q(β) log f(x, Y, W,α,β H)dWdαdβ. (4.) 3. Each of the other partial ensembles is also updated using the same formula as (4.) except that Y and the target variable are exchanged. 4. These updates of the partial ensembles are repeated until a convergence condition is satisfied. Actually, each partial ensemble has the same parametric form as the corresponding conditional posterior has, thus the updates of the partial ensembles are reduced to those of their parameters. Using (4.), we obtain the update formula for the partial ensemble on Y : Q(Y )= n r i= k= p y ik ik, (4.3) where exp{ β p ik = ( x i w k + m σ k )} rk= exp{ β ( x i w k + m σ k )}, (4.4) w k and β are the partial-ensemble means of w k and β respectively; and σ k is the kth diagonal entry of Σ, the partial-ensemble covariance matrix of w k. The update formula for the partial ensemble on W is m Q(W )= N(w (j) w (j), Σ), (4.5) j= where Σ =( β N +ᾱd D + ξe E), (4.6) w (j) = β Σ s (j), (4.7) 9

10 and s (j) = n N = diag( p i,..., p ir ) (4.8) i= n x ij ( p i,..., p ir ). (4.9) i= Finally, the update formulae for the partial ensembles on the hyperparameters are Q(α) =G(α d α, s α ), (4.) Q(β) =G(β d β, s β ), (4.) where s α = m D w (j) + m j= tr(d D Σ)+s α, (4.) s β = n r p ik x i w k + m i= tr( N Σ)+s β. k= (4.3) From them, the partial-ensemble means of the hyperparameters are given by ᾱ = d α / s α, (4.4) β = d β / s β. (4.5) 5 Simulations We compare the three algorithms in simulations: the previous deterministic algorithm in [6], the ensemble learning, and the Gibbs sampler. However, since the two deterministic algorithms yield the similar estimates, they are represented by the ensemble learning in the following. Artificial data x i =(x i,x i ), i =,...,n, are generated from two independent standard Gaussian random series {e i } and {e i } by x i =4(i )/n +σe i, (5.) x i = sin[π(i )/n]+σe i. (5.) We use three noise levels: σ =.3,.4,.5; and two data sizes: n = 5,. Under each data condition, 5 different data sets are prepared. Figure shows the examples of the data sets with the extracted manifolds. The GTM model has inner units (r = ) over a one-dimensional latent space (t = ) and the non-informative hyper-priors mentioned in Section 3.. The

11 σ=.3, n= σ=.5, n= Figure : The examples of the data sets and the extracted manifolds. The dots are the data points. The squares and circles are the estimated centroids by the ensemble learning and the Gibbs sampler, respectively. The centroids are linked along the manifolds. initial values of W and β are obtained by the PCA initialization method [6], and α is initially set to 3. The deterministic algorithms are terminated when the relative variations of α and β become under 4 or α becomes over 5 or the number of iterations is over. The average number of iterations is 565. On the other hand, the Gibbs sampler is always continued until 5 iterations. In all the algorithms, α is restricted to be under 5, because the centroids show no visible change for α over 5. In the case of n =, the ensemble learning and the Gibbs sampler take 4 and 9 seconds per session, respectively, on a 333 MHz processor. However, the fixed number of iterations in the Gibbs sampler is chosen arbitrarily and may be too large in comparison with the ensemble learning. Figure shows the histograms of the estimates of log α. The estimates by the Gibbs sampler are the posterior means estimated by the means of the Gibbs samples. The discrepancy between the deterministic algorithms and the Gibbs sampler increases as the data condition is getting worse. Although the estimates by the deterministic algorithms stay at values consistent with the Gibbs sampler under the good conditions, they diverge abruptly as the noise level grows or the data size reduces gradually. At the maximum value of α, the estimated centroids are arranged regularly on a straight line, as shown in figure. This is a bias of the deterministic algorithms toward the simplest form. On the other hand, the estimates by the Gibbs sampler vary more smoothly as the data condition changes. Unlike α, the estimates of β are similar in all the methods. We can also observe the shape of the posterior distribution using the Gibbs

12 5 σ =.3 5 σ =.4 5 σ = n = n = Figure : The histograms of the estimates of log α by the ensemble learning (black) and the Gibbs sampler (white). sampler. Figure 3 shows the estimated posterior distributions on log α by the histograms of the Gibbs samples. This figure suggests that the posterior mean over the restricted range is rather inappropriate as a point estimate of the hyperparameter when the distribution is broad or multimodal. Instead, we can use the posterior mode. Figure 4 shows the histograms of the estimated posterior modes. Unlike the posterior means in figure, some of them are multimodal. However, they are dominated by the values near, while the estimates of the deterministic algorithms reach the maximum in many cases (figure ). The result of our experiment shows the limitation of the deterministic algorithms for the hyperparameter search on small data and the superiority of the Gibbs sampler. However, the deterministic algorithms are attractive for their fast convergence, particularly on large data. Figure shows that when an estimate of α by the deterministic algorithms stays at a finite value, we can consider the estimate reliable. From this fact, an efficient method to obtain reliable estimates

13 σ =.3 σ =.4 σ = n = n = Figure 3: The estimated posteriors on log α. Each distribution is obtained by the histogram of the Gibbs samples on a data set. The width of the bins is.. in GTM is suggested, that is, we first employ a deterministic algorithm to obtain the estimates quickly, then if the estimate of α diverges, we request the aid of the Gibbs sampler. 6 Discussion 6. The regression version of GTM The original regression version of GTM becomes a full Bayesian model by introducing a ridge prior on the regression coefficients []. This model has three types of smoothing hyperparameters: the number of basis functions, the width of each basis function and a ridge coefficient. Only for the ridge coefficient, we can obtain a Gibbs sampler and ensemble learning in a similar manner. 3

14 5 σ =.3 5 σ =.4 5 σ = n = n = Figure 4: The histograms of the posterior modes for log α. One method to avoid the tuning of the other smoothing hyperparameters is to use cubic B-splines or thin-plate splines at every nodes on the inner space as basis functions. When the number of basis functions is equal to the number of the nodes, the regression version of GTM can be expressed as a Gaussian process version of GTM. Furthermore, in the case of a one-dimensional inner space, M made from the splines has a similar form to one made from the discretized Laplacian if an appropriate prior on the regression coefficients is employed [6]. While this spline version of GTM has a strong connection with spline smoothing, it is timeconsuming except for the case of a one-dimensional inner space. 6. Markov chain Monte Carlo model construction Markov chain Monte Carlo model construction (MC 3 ) is a method to estimate a posterior over the set of model structures using MCMC []. In our model, D is regarded as a structure variable. MCMC for D has no alternative but the 4

15 Metropolis-Hastings algorithm. Thus, the design of a fine trial distribution is required. In addition, the choice of a prior on D may be important for the efficiency of the algorithm. We are now searching for a trial distribution and a prior for the efficient MC 3. 7 Conclusion In order to improve the hyperparameter search of GTM on small data, we constructed a Gibbs sampler on the model, which generates the sample series of the hyperparameters following their posteriors. Using the series, we can obtain the reliable estimates of the hyperparameters and their posterior distributions. In addition, another deterministic algorithm for the hyperparameter search was obtained using ensemble learning. A simulation experiment showed the superiority of the Gibbs sampler on small data and the validity of the deterministic algorithms on large data. Finally, an efficient method for reliable estimation in GTM was suggested using the deterministic algorithms and the Gibbs sampler. References [] T. Kohonen. Self-Organizing Maps. Springer, Berlin, 995. [] R. Durbin and D. Willshaw. An analogue approach to the traveling salesman problem using an elastic net method. Nature, 36:689 69, 987. [3] R. Durbin, R. Szeliski, and A. Yuille. An analysis of the elastic net approach to the traveling salesman problem. Neural Computation, : , 989. [4] P. D. Simic. Statistical mechanics as the underlying theory of elastic and neural optimisations. Network, :89 3, 99. [5] R. Durbin and G. Mitchison. A dimension reduction framework for understanding cortical maps. Nature, 343: , 99. [6] A. Utsugi. Topology selection for self-organizing maps. Network, 7:77 74, 996. [7] A. Utsugi. Hyperparameter selection for self-organizing maps. Neural Computation, 9:63 635, 997. [8] A. Utsugi. Density estimation by mixture models with smoothing priors. Neural Computation, :5 35,

16 [9] C. M. Bishop, M. Svensén, and C. K. I. Williams. GTM: the generative topographic mapping. Neural Computation, :5 34, 998. [] C. M. Bishop, M. Svensén, and C. K. I. Williams. Developments of the generative topographic mapping. Neurocomputing, :3 4, 998. [] D. J. C. MacKay. A practical Bayesian framework for backprop networks. Neural Computation, 4:448 47, 99. [] R. E. Kass and A. E. Raftery. Bayes factors. Journal of the American Statistical Association, 9: , 995. [3] M. A. Tanner. Tools for statistical inference: methods for exploration of posterior distribution and likelihood functions. Springer-Verlag, New York, 3rd edition, 996. [4] S. Waterhouse, D. MacKay, and T. Robinson. Bayesian methods for mixtures of experts. In D. S. Touretzky, M. C. Mozer, and M. E. Haaslmo, editors, Advances in Neural Information Processing Systems 8, pages MIT Press, Cambridge, 996. [5] D. J. C. MacKay. Comparison of approximate methods for handling hyperparameters. Neural Computation, :35 68, 999. [6] A. Buja, T. Hastie, and R. Tibshirani. Linear smoothers and additive models. The Annals of Statistics, 7: ,

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Least Absolute Shrinkage is Equivalent to Quadratic Penalization Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Matching the dimensionality of maps with that of the data

Matching the dimensionality of maps with that of the data Matching the dimensionality of maps with that of the data COLIN FYFE Applied Computational Intelligence Research Unit, The University of Paisley, Paisley, PA 2BE SCOTLAND. Abstract Topographic maps are

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Bagging During Markov Chain Monte Carlo for Smoother Predictions Bagging During Markov Chain Monte Carlo for Smoother Predictions Herbert K. H. Lee University of California, Santa Cruz Abstract: Making good predictions from noisy data is a challenging problem. Methods

More information

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES SHILIANG SUN Department of Computer Science and Technology, East China Normal University 500 Dongchuan Road, Shanghai 20024, China E-MAIL: slsun@cs.ecnu.edu.cn,

More information

Variational Methods in Bayesian Deconvolution

Variational Methods in Bayesian Deconvolution PHYSTAT, SLAC, Stanford, California, September 8-, Variational Methods in Bayesian Deconvolution K. Zarb Adami Cavendish Laboratory, University of Cambridge, UK This paper gives an introduction to the

More information

Self-Organization by Optimizing Free-Energy

Self-Organization by Optimizing Free-Energy Self-Organization by Optimizing Free-Energy J.J. Verbeek, N. Vlassis, B.J.A. Kröse University of Amsterdam, Informatics Institute Kruislaan 403, 1098 SJ Amsterdam, The Netherlands Abstract. We present

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Gaussian process for nonstationary time series prediction

Gaussian process for nonstationary time series prediction Computational Statistics & Data Analysis 47 (2004) 705 712 www.elsevier.com/locate/csda Gaussian process for nonstationary time series prediction Soane Brahim-Belhouari, Amine Bermak EEE Department, Hong

More information

STAT 518 Intro Student Presentation

STAT 518 Intro Student Presentation STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Prediction of Data with help of the Gaussian Process Method

Prediction of Data with help of the Gaussian Process Method of Data with help of the Gaussian Process Method R. Preuss, U. von Toussaint Max-Planck-Institute for Plasma Physics EURATOM Association 878 Garching, Germany March, Abstract The simulation of plasma-wall

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 13: Learning in Gaussian Graphical Models, Non-Gaussian Inference, Monte Carlo Methods Some figures

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo Group Prof. Daniel Cremers 11. Sampling Methods: Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory

More information

Large-scale Ordinal Collaborative Filtering

Large-scale Ordinal Collaborative Filtering Large-scale Ordinal Collaborative Filtering Ulrich Paquet, Blaise Thomson, and Ole Winther Microsoft Research Cambridge, University of Cambridge, Technical University of Denmark ulripa@microsoft.com,brmt2@cam.ac.uk,owi@imm.dtu.dk

More information

Integrated Non-Factorized Variational Inference

Integrated Non-Factorized Variational Inference Integrated Non-Factorized Variational Inference Shaobo Han, Xuejun Liao and Lawrence Carin Duke University February 27, 2014 S. Han et al. Integrated Non-Factorized Variational Inference February 27, 2014

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

Default Priors and Effcient Posterior Computation in Bayesian

Default Priors and Effcient Posterior Computation in Bayesian Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010 Presented by Eric Wang, Duke University Background and Motivation A Brief Review of Parameter Expansion Literature

More information

Supplement to A Hierarchical Approach for Fitting Curves to Response Time Measurements

Supplement to A Hierarchical Approach for Fitting Curves to Response Time Measurements Supplement to A Hierarchical Approach for Fitting Curves to Response Time Measurements Jeffrey N. Rouder Francis Tuerlinckx Paul L. Speckman Jun Lu & Pablo Gomez May 4 008 1 The Weibull regression model

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

Advanced Introduction to Machine Learning

Advanced Introduction to Machine Learning 10-715 Advanced Introduction to Machine Learning Homework 3 Due Nov 12, 10.30 am Rules 1. Homework is due on the due date at 10.30 am. Please hand over your homework at the beginning of class. Please see

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

ABC methods for phase-type distributions with applications in insurance risk problems

ABC methods for phase-type distributions with applications in insurance risk problems ABC methods for phase-type with applications problems Concepcion Ausin, Department of Statistics, Universidad Carlos III de Madrid Joint work with: Pedro Galeano, Universidad Carlos III de Madrid Simon

More information

Infinite Latent Feature Models and the Indian Buffet Process

Infinite Latent Feature Models and the Indian Buffet Process Infinite Latent Feature Models and the Indian Buffet Process Thomas L. Griffiths Cognitive and Linguistic Sciences Brown University, Providence RI 292 tom griffiths@brown.edu Zoubin Ghahramani Gatsby Computational

More information

eqr094: Hierarchical MCMC for Bayesian System Reliability

eqr094: Hierarchical MCMC for Bayesian System Reliability eqr094: Hierarchical MCMC for Bayesian System Reliability Alyson G. Wilson Statistical Sciences Group, Los Alamos National Laboratory P.O. Box 1663, MS F600 Los Alamos, NM 87545 USA Phone: 505-667-9167

More information

Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

More information

Introduction to Bayesian methods in inverse problems

Introduction to Bayesian methods in inverse problems Introduction to Bayesian methods in inverse problems Ville Kolehmainen 1 1 Department of Applied Physics, University of Eastern Finland, Kuopio, Finland March 4 2013 Manchester, UK. Contents Introduction

More information

Metropolis-Hastings Algorithm

Metropolis-Hastings Algorithm Strength of the Gibbs sampler Metropolis-Hastings Algorithm Easy algorithm to think about. Exploits the factorization properties of the joint probability distribution. No difficult choices to be made to

More information

Learning Energy-Based Models of High-Dimensional Data

Learning Energy-Based Models of High-Dimensional Data Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero www.cs.toronto.edu/~hinton/energybasedmodelsweb.htm Discovering causal structure as a goal

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory

More information

ECO 513 Fall 2009 C. Sims HIDDEN MARKOV CHAIN MODELS

ECO 513 Fall 2009 C. Sims HIDDEN MARKOV CHAIN MODELS ECO 513 Fall 2009 C. Sims HIDDEN MARKOV CHAIN MODELS 1. THE CLASS OF MODELS y t {y s, s < t} p(y t θ t, {y s, s < t}) θ t = θ(s t ) P[S t = i S t 1 = j] = h ij. 2. WHAT S HANDY ABOUT IT Evaluating the

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Development of Stochastic Artificial Neural Networks for Hydrological Prediction

Development of Stochastic Artificial Neural Networks for Hydrological Prediction Development of Stochastic Artificial Neural Networks for Hydrological Prediction G. B. Kingston, M. F. Lambert and H. R. Maier Centre for Applied Modelling in Water Engineering, School of Civil and Environmental

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods By Oleg Makhnin 1 Introduction a b c M = d e f g h i 0 f(x)dx 1.1 Motivation 1.1.1 Just here Supresses numbering 1.1.2 After this 1.2 Literature 2 Method 2.1 New math As

More information

Approximate inference in Energy-Based Models

Approximate inference in Energy-Based Models CSC 2535: 2013 Lecture 3b Approximate inference in Energy-Based Models Geoffrey Hinton Two types of density model Stochastic generative model using directed acyclic graph (e.g. Bayes Net) Energy-based

More information

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model UNIVERSITY OF TEXAS AT SAN ANTONIO Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model Liang Jing April 2010 1 1 ABSTRACT In this paper, common MCMC algorithms are introduced

More information

Curve Fitting Re-visited, Bishop1.2.5

Curve Fitting Re-visited, Bishop1.2.5 Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos & Aarti Singh Contents Markov Chain Monte Carlo Methods Goal & Motivation Sampling Rejection Importance Markov

More information

STA 294: Stochastic Processes & Bayesian Nonparametrics

STA 294: Stochastic Processes & Bayesian Nonparametrics MARKOV CHAINS AND CONVERGENCE CONCEPTS Markov chains are among the simplest stochastic processes, just one step beyond iid sequences of random variables. Traditionally they ve been used in modelling a

More information

An Introduction to Statistical and Probabilistic Linear Models

An Introduction to Statistical and Probabilistic Linear Models An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning

More information

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection (non-examinable material) Matthew J. Beal February 27, 2004 www.variational-bayes.org Bayesian Model Selection

More information

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling Christopher Jennison Department of Mathematical Sciences, University of Bath, UK http://people.bath.ac.uk/mascj Adriana Ibrahim Institute

More information

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures 17th Europ. Conf. on Machine Learning, Berlin, Germany, 2006. Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures Shipeng Yu 1,2, Kai Yu 2, Volker Tresp 2, and Hans-Peter

More information

Nonstationary spatial process modeling Part II Paul D. Sampson --- Catherine Calder Univ of Washington --- Ohio State University

Nonstationary spatial process modeling Part II Paul D. Sampson --- Catherine Calder Univ of Washington --- Ohio State University Nonstationary spatial process modeling Part II Paul D. Sampson --- Catherine Calder Univ of Washington --- Ohio State University this presentation derived from that presented at the Pan-American Advanced

More information

First Technical Course, European Centre for Soft Computing, Mieres, Spain. 4th July 2011

First Technical Course, European Centre for Soft Computing, Mieres, Spain. 4th July 2011 First Technical Course, European Centre for Soft Computing, Mieres, Spain. 4th July 2011 Linear Given probabilities p(a), p(b), and the joint probability p(a, B), we can write the conditional probabilities

More information

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods Prof. Daniel Cremers 14. Sampling Methods Sampling Methods Sampling Methods are widely used in Computer Science as an approximation of a deterministic algorithm to represent uncertainty without a parametric

More information

Statistical Approaches to Learning and Discovery

Statistical Approaches to Learning and Discovery Statistical Approaches to Learning and Discovery Bayesian Model Selection Zoubin Ghahramani & Teddy Seidenfeld zoubin@cs.cmu.edu & teddy@stat.cmu.edu CALD / CS / Statistics / Philosophy Carnegie Mellon

More information

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview

More information

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence Bayesian Inference in GLMs Frequentists typically base inferences on MLEs, asymptotic confidence limits, and log-likelihood ratio tests Bayesians base inferences on the posterior distribution of the unknowns

More information

Markov Chain Monte Carlo Methods for Stochastic Optimization

Markov Chain Monte Carlo Methods for Stochastic Optimization Markov Chain Monte Carlo Methods for Stochastic Optimization John R. Birge The University of Chicago Booth School of Business Joint work with Nicholas Polson, Chicago Booth. JRBirge U of Toronto, MIE,

More information

Markov Chain Monte Carlo

Markov Chain Monte Carlo Markov Chain Monte Carlo Recall: To compute the expectation E ( h(y ) ) we use the approximation E(h(Y )) 1 n n h(y ) t=1 with Y (1),..., Y (n) h(y). Thus our aim is to sample Y (1),..., Y (n) from f(y).

More information

Likelihood NIPS July 30, Gaussian Process Regression with Student-t. Likelihood. Jarno Vanhatalo, Pasi Jylanki and Aki Vehtari NIPS-2009

Likelihood NIPS July 30, Gaussian Process Regression with Student-t. Likelihood. Jarno Vanhatalo, Pasi Jylanki and Aki Vehtari NIPS-2009 with with July 30, 2010 with 1 2 3 Representation Representation for Distribution Inference for the Augmented Model 4 Approximate Laplacian Approximation Introduction to Laplacian Approximation Laplacian

More information

Particle Filtering Approaches for Dynamic Stochastic Optimization

Particle Filtering Approaches for Dynamic Stochastic Optimization Particle Filtering Approaches for Dynamic Stochastic Optimization John R. Birge The University of Chicago Booth School of Business Joint work with Nicholas Polson, Chicago Booth. JRBirge I-Sim Workshop,

More information

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55

More information

Variational Scoring of Graphical Model Structures

Variational Scoring of Graphical Model Structures Variational Scoring of Graphical Model Structures Matthew J. Beal Work with Zoubin Ghahramani & Carl Rasmussen, Toronto. 15th September 2003 Overview Bayesian model selection Approximations using Variational

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College

More information

Supplementary Note on Bayesian analysis

Supplementary Note on Bayesian analysis Supplementary Note on Bayesian analysis Structured variability of muscle activations supports the minimal intervention principle of motor control Francisco J. Valero-Cuevas 1,2,3, Madhusudhan Venkadesan

More information

MCMC Sampling for Bayesian Inference using L1-type Priors

MCMC Sampling for Bayesian Inference using L1-type Priors MÜNSTER MCMC Sampling for Bayesian Inference using L1-type Priors (what I do whenever the ill-posedness of EEG/MEG is just not frustrating enough!) AG Imaging Seminar Felix Lucka 26.06.2012 , MÜNSTER Sampling

More information

A short introduction to INLA and R-INLA

A short introduction to INLA and R-INLA A short introduction to INLA and R-INLA Integrated Nested Laplace Approximation Thomas Opitz, BioSP, INRA Avignon Workshop: Theory and practice of INLA and SPDE November 7, 2018 2/21 Plan for this talk

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

More information

Part 8: GLMs and Hierarchical LMs and GLMs

Part 8: GLMs and Hierarchical LMs and GLMs Part 8: GLMs and Hierarchical LMs and GLMs 1 Example: Song sparrow reproductive success Arcese et al., (1992) provide data on a sample from a population of 52 female song sparrows studied over the course

More information

Monte Carlo Methods. Leon Gu CSD, CMU

Monte Carlo Methods. Leon Gu CSD, CMU Monte Carlo Methods Leon Gu CSD, CMU Approximate Inference EM: y-observed variables; x-hidden variables; θ-parameters; E-step: q(x) = p(x y, θ t 1 ) M-step: θ t = arg max E q(x) [log p(y, x θ)] θ Monte

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem

More information

Markov Chain Monte Carlo Methods for Stochastic

Markov Chain Monte Carlo Methods for Stochastic Markov Chain Monte Carlo Methods for Stochastic Optimization i John R. Birge The University of Chicago Booth School of Business Joint work with Nicholas Polson, Chicago Booth. JRBirge U Florida, Nov 2013

More information

Regression with Input-Dependent Noise: A Bayesian Treatment

Regression with Input-Dependent Noise: A Bayesian Treatment Regression with Input-Dependent oise: A Bayesian Treatment Christopher M. Bishop C.M.BishopGaston.ac.uk Cazhaow S. Qazaz qazazcsgaston.ac.uk eural Computing Research Group Aston University, Birmingham,

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Confidence Estimation Methods for Neural Networks: A Practical Comparison

Confidence Estimation Methods for Neural Networks: A Practical Comparison , 6-8 000, Confidence Estimation Methods for : A Practical Comparison G. Papadopoulos, P.J. Edwards, A.F. Murray Department of Electronics and Electrical Engineering, University of Edinburgh Abstract.

More information

The Origin of Deep Learning. Lili Mou Jan, 2015

The Origin of Deep Learning. Lili Mou Jan, 2015 The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets

More information

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information

Infinite Mixtures of Gaussian Process Experts

Infinite Mixtures of Gaussian Process Experts in Advances in Neural Information Processing Systems 14, MIT Press (22). Infinite Mixtures of Gaussian Process Experts Carl Edward Rasmussen and Zoubin Ghahramani Gatsby Computational Neuroscience Unit

More information

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis Summarizing a posterior Given the data and prior the posterior is determined Summarizing the posterior gives parameter estimates, intervals, and hypothesis tests Most of these computations are integrals

More information

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic NPFL108 Bayesian inference Introduction Filip Jurčíček Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic Home page: http://ufal.mff.cuni.cz/~jurcicek Version: 21/02/2014

More information

Differential Priors for Elastic Nets

Differential Priors for Elastic Nets Differential Priors for Elastic Nets Miguel Á Carreira-Perpiñán 1, Peter Dayan, and Geoffrey J Goodhill 3 1 Dept of Computer Science & Electrical Eng, OGI, Oregon Health & Science University miguel@cseogiedu

More information

Slice Sampling with Adaptive Multivariate Steps: The Shrinking-Rank Method

Slice Sampling with Adaptive Multivariate Steps: The Shrinking-Rank Method Slice Sampling with Adaptive Multivariate Steps: The Shrinking-Rank Method Madeleine B. Thompson Radford M. Neal Abstract The shrinking rank method is a variation of slice sampling that is efficient at

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inrialpes.fr http://perception.inrialpes.fr/ Outline of Lecture

More information

Sequential Monte Carlo Methods for Bayesian Computation

Sequential Monte Carlo Methods for Bayesian Computation Sequential Monte Carlo Methods for Bayesian Computation A. Doucet Kyoto Sept. 2012 A. Doucet (MLSS Sept. 2012) Sept. 2012 1 / 136 Motivating Example 1: Generic Bayesian Model Let X be a vector parameter

More information

Principles of Bayesian Inference

Principles of Bayesian Inference Principles of Bayesian Inference Sudipto Banerjee University of Minnesota July 20th, 2008 1 Bayesian Principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters

More information

Regularized Regression A Bayesian point of view

Regularized Regression A Bayesian point of view Regularized Regression A Bayesian point of view Vincent MICHEL Director : Gilles Celeux Supervisor : Bertrand Thirion Parietal Team, INRIA Saclay Ile-de-France LRI, Université Paris Sud CEA, DSV, I2BM,

More information

Factor Analysis and Kalman Filtering (11/2/04)

Factor Analysis and Kalman Filtering (11/2/04) CS281A/Stat241A: Statistical Learning Theory Factor Analysis and Kalman Filtering (11/2/04) Lecturer: Michael I. Jordan Scribes: Byung-Gon Chun and Sunghoon Kim 1 Factor Analysis Factor analysis is used

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 2016 Robert Nowak Probabilistic Graphical Models 1 Introduction We have focused mainly on linear models for signals, in particular the subspace model x = Uθ, where U is a n k matrix and θ R k is a vector

More information