Bayesian Sampling and Ensemble Learning in Generative Topographic Mapping
|
|
- Bathsheba Hood
- 5 years ago
- Views:
Transcription
1 Bayesian Sampling and Ensemble Learning in Generative Topographic Mapping Akio Utsugi National Institute of Bioscience and Human-Technology, - Higashi Tsukuba Ibaraki , Japan March 6, Abstract Generative topographic mapping (GTM) is a statistical model to extract a hidden smooth manifold from data, like the self-organizing map (SOM). Although a deterministic search algorithm for the hyperparameters regulating the smoothness of the manifold has been proposed previously, it is based on approximations that are valid only on abundant data. Thus, it often fails to obtain suitable estimates on small data. In this paper, to improve the hyperparameter search in GTM, we construct a Gibbs sampler on the model, which generates random sample series following the posteriors on the hyperparameters. Reliable estimates are obtained from the samples. In addition, we obtain another deterministic algorithm using the ensemble learning. From the result of an experimental comparison of these algorithms, an efficient method for reliable estimation in GTM is suggested. Introduction The self-organizing map (SOM) [] is initially proposed as a minimal model for the formation of topology-preserving maps in brains. Subsequently, it is used as an information processing tool to extract a hidden smooth manifold from data. However, since the SOM is defined as a learning algorithm and has no explicit statistical model for the data generation, it is difficult to develop higher-level inference on the model, such as the determination of the optimal hyperparameters and topology. Thus, several authors propose statistical generative models whose parameter estimation algorithms are similar to the SOM.
2 The elastic net [] [3][4][5] is one of such generative models. Its learning is derived as a parameter estimation algorithm for a model consisting of the mixture of spherical Gaussian generators and a smoothing prior on the centroids of the generators. This prior is also a spherical Gaussian distribution on the variations of the centroids along a predefined topology. The variances of these spherical Gaussian distributions are hyperparameters regulating the smoothness of the hidden manifold. The elastic net is extended to have a more general Gaussian smoothing prior, and evidence for the hyperparameters and the topology, which is a Bayesian model selection criterion, is obtained [6][7][8]. Generative topographic mapping (GTM) is another SOM-like generative model [9]. It is also based on the mixture of spherical Gaussian generators with a constraint on the centroids. However, the centroids are assumed to be generated as the outputs of a generalized linear network whose inputs are the nodes of a regular grid on a latent space. Evidence for the hyperparameters of GTM is obtained by the same method as used for the generalized elastic net []. Subsequently, the framework of GTM is expanded to include models with a general Gaussian smoothing prior on the centroids. In the present paper, we mainly deal with this type of GTM model. Evidence for hyperparameters is strictly defined as the marginal likelihood of the hyperparameters, which is obtained by integrating out the other parameters from the joint likelihood. In GTM, we need to integrate out the centroid parameters to obtain the evidence. However, this integral is difficult to calculate exactly. Thus, it is approximated using the Laplace method [][][3]. Furthermore, in order to obtain a fast search algorithm for the optimal hyperparameters, which is given by the maximizer of the evidence, we need the derivatives of the evidence with regard to the hyperparameters. These derivatives are obtained using additional approximations. The validity of these approximations depends on the data size and the signal to noise ratio. Actually, a simulation experiment shows that the hyperparameter search algorithm fails to obtain suitable estimates when the data size is small and the noise level is high [6]. In this paper, we try to improve the hyperparameter search of GTM on small data using a Gibbs sampler. The Gibbs sampler [3] generates the random sample series of the parameters and the hyperparameters following their posteriors. In the limits of the long series, their averages approach the posterior means, which are the exact Bayesian estimates. We can also evaluate the confidence of the estimates using the estimated posterior variances. In fact, we can obtain the estimates of the posterior distributions using the histograms of the samples. While the Gibbs sampler provides reliable and wealthy estimation on small data, it is rather time-consuming on large data because it requires the generation of long sample series. Thus, we also need a deterministic algorithm producing
3 the estimates quickly. We obtain another deterministic algorithm for the hyperparameter search of GTM using the ensemble learning method [4][5]. While this algorithm is similar to the previous deterministic algorithm, it is based on more straightforward approximation assumption. In addition, the ensemble learning is considered more stable than the previous deterministic algorithm, because it minimizes the variational free energy of the model, which gives an upper bound of negative log evidence. To evaluate the validity of the approximations in the deterministic algorithms, we compare the estimates by those algorithms with the outcome of the Gibbs sampler in a simulation experiment. The experiment shows the superiority of the Gibbs sampler on small data and the validity of the deterministic algorithms on large data. From this result, a policy for algorithm selection in GTM is suggested. Generative topographic mapping GTM has two versions: an original regression version and a Gaussian process version. In this paper, we focus on the latter, and the former is mentioned shortly in the discussion. The Gaussian process version of GTM consists of a spherical Gaussian mixture density and a Gaussian process prior. The spherical Gaussian mixture model assumes that each data point x i = (x i,...,x im ) R m is generated from one of r spherical Gaussian generators, which correspond to the inner units of a neural network. The density of a data set X = {x,...,x n } is given by ( ) nm/ β n { r f(x Y, W,β)= exp ( β )} yik π x i w k, (.) i= k= where y ik in Y are binary membership variables with constraints r k= y ik =, and w k =(w k,...,w km ) in W are the centroids of the spherical Gaussian generators. By integrating out Y from the product of this density and a multinomial prior f(y )= n i= k= we obtain a spherical Gaussian mixture density r ( ) yik = r n, (.) r f(x W,β)= ( ) nm/ β n r exp ( β ) r n π x i w k. (.3) i= k= 3
4 Next, we assume that the centroids are the regular samples of a smooth Gaussian process over a latent space. In other words, W has a Gaussian prior f(w h) =(π) rm/ M m/ m j= exp ( ) w (j)m w (j), (.4) where w (j) =(w j,...,w rj ), and the covariance in M decreases as the distance between the pair of the inner units over the latent space increases. In the next section, we use this general form of Gaussian process prior to obtain a Gibbs sampler for W. In fact, there are various manners to construct M using hyperparameters in h. We will choose a specific form of M to obtain a simple Gibbs sampler for h. The Bayesian inference of W is based on its posterior f(w X, h), where h includes β. Although this posterior is obtained by the Bayes theorem: f(w X, h) f(x, W h) =f(x W,β)f(W h), (.5) the posterior mean as the estimate of W is difficult to calculate. Instead, we can obtain the maximum a posteriori (MAP) estimate Ŵ, which is the maximizer of the posterior, through an expectation-maximization (EM) algorithm [7][]. The inference of h is based on its evidence f(x h). The maximizer of the evidence is called the generalized maximum likelihood (GML) estimate of h. Note that the evidence is proportional to the posterior on h, f(h X), if we adopt a flat prior on h. In this case, the GML and MAP estimates of h are identical. Although the evidence is obtained by integrating out W from f(x, W h), the integral is difficult to calculate exactly. Using the Laplace method [][][3], we can obtain a computable approximate expression of the evidence [7][]. In this method, the integration is performed by quadratically approximating the logarithm of the integrand around Ŵ. In other words, the integrand is approximated by the nearest Gaussian function. Furthermore, a fast hyperparameter search algorithm is obtained using the approximate derivatives of the evidence with regard to h [6][]. In this approximation, the dependence of Ŵ and the posterior selection probabilities of the inner units on h is neglected. As mentioned in the introduction, the approximations for the hyperparameter search algorithm are valid only on abundant data. The algorithm often fail to extract a suitable structure from small data. In order to enhance the capability of GTM, the hyperparameter search is improved using a Gibbs sampler in the following section. 4
5 3 Gibbs sampler in GTM In the preceding section, we mentioned a method to estimate the parameters and the hyperparameters by the modes of their posteriors with the help of several approximations. Another method to infer them is to use random samples following their posteriors. Any moment of the posteriors can be obtained precisely by an average over the long sample series. Markov chain Monte Carlo (MCMC) [3] provides easier devices to generate such random samples than direct generation from the posteriors. The Metropolis-Hastings algorithm is the most general MCMC method. At each step of this algorithm, a trial sample is generated from a trial distribution, then its adoption is determined stochastically by the improvement ratio of the posterior. Since the posterior has to be calculated at each step, an efficient procedure for this calculation is required. Moreover, the efficiency of this algorithm depends on the design of a fine trial distribution. The Gibbs sampler is another MCMC method. In contrast to the Metropolis- Hastings algorithm, it does not need the design of a trial distribution. In fact, the Gibbs sampler can be regarded as a special kind of Metropolis-Hastings algorithm, whose trials are always adopted. In the Gibbs sampler, the variables of a model are divided into some groups, and the conditional posterior on each group given the other variables is obtained. The algorithm is started by setting appropriate initial values in the conditions of the conditional posteriors. Then, a sample is generated from a conditional posterior and this sample is set to the conditions of the other conditional posteriors. Such sampling and setting is iterated among the groups. After arrival at a stationary state of this process, the samples follow their (unconditional) posteriors. In the following two subsections, we obtain all conditional posteriors for the Gibbs sampler of GTM: f(y X, W, h), f(w X, Y, h) and f(h X, Y, W ). 3. Conditional posteriors on Y and W The conditional posterior on Y has been already obtained in the previous papers [6][7]: n r f(y X, W,β)= p y ik ik, (3.) i= k= where exp( β p ik = x i w k ) rk= exp( β x i w k ) are the posterior selection probabilities of the inner units. (3.) 5
6 The conditional posterior on W is obtained by normalizing f(x, Y, W h), the product of (.), (.) and (.4). It is a product of Gaussian densities m f(w X, Y, h) = N(w (j) µ (j), Σ). (3.3) j= The common covariance matrix is and the means are Σ =(βn + M ) (3.4) µ (j) = βσs (j), (3.5) where n N = diag(n,,...,n r )= diag(y i,...,y ir ) (3.6) i= and n s (j) =(s j,...,s rj ) = x ij (y i,...,y ir ), (3.7) i= that is, n k is the number of data belonging to the kth inner unit, and s k = (s k,...,s km ) is the sum of data belonging to it. 3. Conditional posteriors on hyperparameters Bishop et al. [] suggest a Gaussian process prior with a Gaussian-shaped covariance function, that is, the entries of M are ( m ij exp ) λ u i u j, (3.8) where u i is the position of the ith inner unit over a latent space. An MCMC algorithm for the hyperparameter λ may be constructed using the Metropolis- Hastings method. However, it requires heavy calculation of the posterior at each step and the design of a fine trial distribution. From the standpoint of the availability of the Gibbs sampler, we choose a Gaussian process prior with M =(αd D + ξe E), (3.9) where D is a discretized Laplacian and E is a matrix whose rows are the orthonormal basis vectors of the linear null-space of D [6]. For a one-dimensional latent space, the entries of D are d ij = i j + = i j + = (i =,..., r ; j =,..., r), otherwise 6 (3.)
7 and E consists of a constant vector and a linear-trend vector. For a two-dimensional latent space, D is constructed from the distortion of a thin-plate and E consists of a constant vector and two linear-trend vectors [8]. When the hyperparameters α and ξ are positive, M is positive definite, thus the Gaussian process prior (.4) is proper and expressed as f(w α, ξ) = (π) rm/ α lm/ ξ (r l)m/ D D m/ + m exp (α Dw (j) + ξ Ew (j) ), (3.) j= where D D + is the product of the positive eigenvalues of D D and l = rank(d D), which is given by l = r t for a t-dimensional latent space. The logarithm of this prior corresponds to a discretized Laplacian regularizer, which is used in discretized Laplacian smoothing [7], if ξ. The term of ξ is introduced to make the prior proper. Although other methods to make the prior proper can be considered, our method has the advantage of leading a simple gamma posterior on α. To obtain the posteriors on the hyperparameters, we consider hyper-priors on α and β: f(α d α,s α )=G(α d α,s α ), (3.) where G is a gamma density function given by f(β d β,s β )=G(β d β,s β ), (3.3) G(x d, s) = sd x d Γ(d) exp( sx), (3.4) whose mean is d/s. The other hyperparameters ξ,d α,s α,d β,s β are fixed to appropriate values using prior knowledge. Alternatively, we can use non-informative priors given by ξ,d α = d β =,s α,s β. Hereafter we represent a model structure with a set of the fixed hyperparameters as H. The conditional posteriors on α and β are obtained by normalizing f(x, Y, W,α,β H), the product of (.), (.), (3.), (3.) and (3.3): f(α X, Y, W,H)=G(α d α, s α ), (3.5) where f(β X, Y, W,H)=G(β d β, s β ), (3.6) d α = ml + d α, (3.7) 7
8 s α = m Dw (j) + s α, j= (3.8) d β = nm + d β (3.9) and s β = n r y ik x i w k + s β. i= k= (3.) We now obtain all conditional posteriors for the Gibbs sampler of GTM. 3.3 Reduction of computational load The computational bottleneck of our Gibbs sampler lies on the inversion of the r r matrix βn + M = βn + αd D + ξe E in (3.4). This inversion generally needs O(r 3 ) operations and r becomes large to obtain fine structures. If we use a partially improper prior given by ξ, this matrix becomes a sparse matrix K = βn + αd D. In particular, for a one-dimensional latent space, K becomes a five-banded matrix. Thus, the inversion of K needs only O(r) operations. If we can fix the topology of the latent space, this strategy is recommended. However, if we desire to compare different topologies, ξ must be positive, because the partially improper prior is not available for model selection [6]. In this case, we can reduce the load of the inversion using the matrix-inversion lemma: (K + ξe E) = K K E (EK E + ξ I) EK. (3.) Note that the matrix in the parenthesis of the right side has a very small size r l. 4Ensemble learning in GTM The ensemble learning [4][5] is a deterministic algorithm to obtain the estimates of parameters and hyperparameters concurrently. We consider an approximating ensemble density Q(Y, W,α,β) and its variational free energy on a model H: F (Q H) = Q(Y, W,α,β) log Q(Y, W,α,β) dy dw dαdβ. (4.) f(x, Y, W,α,β H) This functional is minimized by the joint posterior Q(Y, W,α,β)=f(Y, W,α,β X,H), and the minimum is equal to the negative log evidence of the model, log f(x H). This variational problem is generally difficult to solve, thus an approximate solution is obtained by restricting Q to a specific form. For example, if we restrict Q to a 8
9 factorial form Q(Y, W,α,β) = Q(Y )Q(W )Q(α)Q(β), we can have a straightforward algorithm for the minimization of F. Ultimately, the optimization procedure is given as follows:. Initial densities are set to the partial ensembles, Q(Y ),Q(W ),Q(α) and Q(β).. From the present densities of Q(W ),Q(α) and Q(β), a new density of Q(Y ) is obtained by { Q(Y ) exp } Q(W )Q(α)Q(β) log f(x, Y, W,α,β H)dWdαdβ. (4.) 3. Each of the other partial ensembles is also updated using the same formula as (4.) except that Y and the target variable are exchanged. 4. These updates of the partial ensembles are repeated until a convergence condition is satisfied. Actually, each partial ensemble has the same parametric form as the corresponding conditional posterior has, thus the updates of the partial ensembles are reduced to those of their parameters. Using (4.), we obtain the update formula for the partial ensemble on Y : Q(Y )= n r i= k= p y ik ik, (4.3) where exp{ β p ik = ( x i w k + m σ k )} rk= exp{ β ( x i w k + m σ k )}, (4.4) w k and β are the partial-ensemble means of w k and β respectively; and σ k is the kth diagonal entry of Σ, the partial-ensemble covariance matrix of w k. The update formula for the partial ensemble on W is m Q(W )= N(w (j) w (j), Σ), (4.5) j= where Σ =( β N +ᾱd D + ξe E), (4.6) w (j) = β Σ s (j), (4.7) 9
10 and s (j) = n N = diag( p i,..., p ir ) (4.8) i= n x ij ( p i,..., p ir ). (4.9) i= Finally, the update formulae for the partial ensembles on the hyperparameters are Q(α) =G(α d α, s α ), (4.) Q(β) =G(β d β, s β ), (4.) where s α = m D w (j) + m j= tr(d D Σ)+s α, (4.) s β = n r p ik x i w k + m i= tr( N Σ)+s β. k= (4.3) From them, the partial-ensemble means of the hyperparameters are given by ᾱ = d α / s α, (4.4) β = d β / s β. (4.5) 5 Simulations We compare the three algorithms in simulations: the previous deterministic algorithm in [6], the ensemble learning, and the Gibbs sampler. However, since the two deterministic algorithms yield the similar estimates, they are represented by the ensemble learning in the following. Artificial data x i =(x i,x i ), i =,...,n, are generated from two independent standard Gaussian random series {e i } and {e i } by x i =4(i )/n +σe i, (5.) x i = sin[π(i )/n]+σe i. (5.) We use three noise levels: σ =.3,.4,.5; and two data sizes: n = 5,. Under each data condition, 5 different data sets are prepared. Figure shows the examples of the data sets with the extracted manifolds. The GTM model has inner units (r = ) over a one-dimensional latent space (t = ) and the non-informative hyper-priors mentioned in Section 3.. The
11 σ=.3, n= σ=.5, n= Figure : The examples of the data sets and the extracted manifolds. The dots are the data points. The squares and circles are the estimated centroids by the ensemble learning and the Gibbs sampler, respectively. The centroids are linked along the manifolds. initial values of W and β are obtained by the PCA initialization method [6], and α is initially set to 3. The deterministic algorithms are terminated when the relative variations of α and β become under 4 or α becomes over 5 or the number of iterations is over. The average number of iterations is 565. On the other hand, the Gibbs sampler is always continued until 5 iterations. In all the algorithms, α is restricted to be under 5, because the centroids show no visible change for α over 5. In the case of n =, the ensemble learning and the Gibbs sampler take 4 and 9 seconds per session, respectively, on a 333 MHz processor. However, the fixed number of iterations in the Gibbs sampler is chosen arbitrarily and may be too large in comparison with the ensemble learning. Figure shows the histograms of the estimates of log α. The estimates by the Gibbs sampler are the posterior means estimated by the means of the Gibbs samples. The discrepancy between the deterministic algorithms and the Gibbs sampler increases as the data condition is getting worse. Although the estimates by the deterministic algorithms stay at values consistent with the Gibbs sampler under the good conditions, they diverge abruptly as the noise level grows or the data size reduces gradually. At the maximum value of α, the estimated centroids are arranged regularly on a straight line, as shown in figure. This is a bias of the deterministic algorithms toward the simplest form. On the other hand, the estimates by the Gibbs sampler vary more smoothly as the data condition changes. Unlike α, the estimates of β are similar in all the methods. We can also observe the shape of the posterior distribution using the Gibbs
12 5 σ =.3 5 σ =.4 5 σ = n = n = Figure : The histograms of the estimates of log α by the ensemble learning (black) and the Gibbs sampler (white). sampler. Figure 3 shows the estimated posterior distributions on log α by the histograms of the Gibbs samples. This figure suggests that the posterior mean over the restricted range is rather inappropriate as a point estimate of the hyperparameter when the distribution is broad or multimodal. Instead, we can use the posterior mode. Figure 4 shows the histograms of the estimated posterior modes. Unlike the posterior means in figure, some of them are multimodal. However, they are dominated by the values near, while the estimates of the deterministic algorithms reach the maximum in many cases (figure ). The result of our experiment shows the limitation of the deterministic algorithms for the hyperparameter search on small data and the superiority of the Gibbs sampler. However, the deterministic algorithms are attractive for their fast convergence, particularly on large data. Figure shows that when an estimate of α by the deterministic algorithms stays at a finite value, we can consider the estimate reliable. From this fact, an efficient method to obtain reliable estimates
13 σ =.3 σ =.4 σ = n = n = Figure 3: The estimated posteriors on log α. Each distribution is obtained by the histogram of the Gibbs samples on a data set. The width of the bins is.. in GTM is suggested, that is, we first employ a deterministic algorithm to obtain the estimates quickly, then if the estimate of α diverges, we request the aid of the Gibbs sampler. 6 Discussion 6. The regression version of GTM The original regression version of GTM becomes a full Bayesian model by introducing a ridge prior on the regression coefficients []. This model has three types of smoothing hyperparameters: the number of basis functions, the width of each basis function and a ridge coefficient. Only for the ridge coefficient, we can obtain a Gibbs sampler and ensemble learning in a similar manner. 3
14 5 σ =.3 5 σ =.4 5 σ = n = n = Figure 4: The histograms of the posterior modes for log α. One method to avoid the tuning of the other smoothing hyperparameters is to use cubic B-splines or thin-plate splines at every nodes on the inner space as basis functions. When the number of basis functions is equal to the number of the nodes, the regression version of GTM can be expressed as a Gaussian process version of GTM. Furthermore, in the case of a one-dimensional inner space, M made from the splines has a similar form to one made from the discretized Laplacian if an appropriate prior on the regression coefficients is employed [6]. While this spline version of GTM has a strong connection with spline smoothing, it is timeconsuming except for the case of a one-dimensional inner space. 6. Markov chain Monte Carlo model construction Markov chain Monte Carlo model construction (MC 3 ) is a method to estimate a posterior over the set of model structures using MCMC []. In our model, D is regarded as a structure variable. MCMC for D has no alternative but the 4
15 Metropolis-Hastings algorithm. Thus, the design of a fine trial distribution is required. In addition, the choice of a prior on D may be important for the efficiency of the algorithm. We are now searching for a trial distribution and a prior for the efficient MC 3. 7 Conclusion In order to improve the hyperparameter search of GTM on small data, we constructed a Gibbs sampler on the model, which generates the sample series of the hyperparameters following their posteriors. Using the series, we can obtain the reliable estimates of the hyperparameters and their posterior distributions. In addition, another deterministic algorithm for the hyperparameter search was obtained using ensemble learning. A simulation experiment showed the superiority of the Gibbs sampler on small data and the validity of the deterministic algorithms on large data. Finally, an efficient method for reliable estimation in GTM was suggested using the deterministic algorithms and the Gibbs sampler. References [] T. Kohonen. Self-Organizing Maps. Springer, Berlin, 995. [] R. Durbin and D. Willshaw. An analogue approach to the traveling salesman problem using an elastic net method. Nature, 36:689 69, 987. [3] R. Durbin, R. Szeliski, and A. Yuille. An analysis of the elastic net approach to the traveling salesman problem. Neural Computation, : , 989. [4] P. D. Simic. Statistical mechanics as the underlying theory of elastic and neural optimisations. Network, :89 3, 99. [5] R. Durbin and G. Mitchison. A dimension reduction framework for understanding cortical maps. Nature, 343: , 99. [6] A. Utsugi. Topology selection for self-organizing maps. Network, 7:77 74, 996. [7] A. Utsugi. Hyperparameter selection for self-organizing maps. Neural Computation, 9:63 635, 997. [8] A. Utsugi. Density estimation by mixture models with smoothing priors. Neural Computation, :5 35,
16 [9] C. M. Bishop, M. Svensén, and C. K. I. Williams. GTM: the generative topographic mapping. Neural Computation, :5 34, 998. [] C. M. Bishop, M. Svensén, and C. K. I. Williams. Developments of the generative topographic mapping. Neurocomputing, :3 4, 998. [] D. J. C. MacKay. A practical Bayesian framework for backprop networks. Neural Computation, 4:448 47, 99. [] R. E. Kass and A. E. Raftery. Bayes factors. Journal of the American Statistical Association, 9: , 995. [3] M. A. Tanner. Tools for statistical inference: methods for exploration of posterior distribution and likelihood functions. Springer-Verlag, New York, 3rd edition, 996. [4] S. Waterhouse, D. MacKay, and T. Robinson. Bayesian methods for mixtures of experts. In D. S. Touretzky, M. C. Mozer, and M. E. Haaslmo, editors, Advances in Neural Information Processing Systems 8, pages MIT Press, Cambridge, 996. [5] D. J. C. MacKay. Comparison of approximate methods for handling hyperparameters. Neural Computation, :35 68, 999. [6] A. Buja, T. Hastie, and R. Tibshirani. Linear smoothers and additive models. The Annals of Statistics, 7: ,
Variational Principal Components
Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings
More informationLeast Absolute Shrinkage is Equivalent to Quadratic Penalization
Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationSTA 4273H: Sta-s-cal Machine Learning
STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationMatching the dimensionality of maps with that of the data
Matching the dimensionality of maps with that of the data COLIN FYFE Applied Computational Intelligence Research Unit, The University of Paisley, Paisley, PA 2BE SCOTLAND. Abstract Topographic maps are
More informationThe Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision
The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that
More informationLecture 16 Deep Neural Generative Models
Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationBagging During Markov Chain Monte Carlo for Smoother Predictions
Bagging During Markov Chain Monte Carlo for Smoother Predictions Herbert K. H. Lee University of California, Santa Cruz Abstract: Making good predictions from noisy data is a challenging problem. Methods
More informationINFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES
INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES SHILIANG SUN Department of Computer Science and Technology, East China Normal University 500 Dongchuan Road, Shanghai 20024, China E-MAIL: slsun@cs.ecnu.edu.cn,
More informationVariational Methods in Bayesian Deconvolution
PHYSTAT, SLAC, Stanford, California, September 8-, Variational Methods in Bayesian Deconvolution K. Zarb Adami Cavendish Laboratory, University of Cambridge, UK This paper gives an introduction to the
More informationSelf-Organization by Optimizing Free-Energy
Self-Organization by Optimizing Free-Energy J.J. Verbeek, N. Vlassis, B.J.A. Kröse University of Amsterdam, Informatics Institute Kruislaan 403, 1098 SJ Amsterdam, The Netherlands Abstract. We present
More informationA graph contains a set of nodes (vertices) connected by links (edges or arcs)
BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,
More informationGaussian process for nonstationary time series prediction
Computational Statistics & Data Analysis 47 (2004) 705 712 www.elsevier.com/locate/csda Gaussian process for nonstationary time series prediction Soane Brahim-Belhouari, Amine Bermak EEE Department, Hong
More informationSTAT 518 Intro Student Presentation
STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible
More informationLecture 3. Linear Regression II Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression
More informationSTA414/2104. Lecture 11: Gaussian Processes. Department of Statistics
STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations
More informationPrediction of Data with help of the Gaussian Process Method
of Data with help of the Gaussian Process Method R. Preuss, U. von Toussaint Max-Planck-Institute for Plasma Physics EURATOM Association 878 Garching, Germany March, Abstract The simulation of plasma-wall
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 13: Learning in Gaussian Graphical Models, Non-Gaussian Inference, Monte Carlo Methods Some figures
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationComputer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo
Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain
More informationLecture 4: Probabilistic Learning
DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods
More informationComputer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo
Group Prof. Daniel Cremers 11. Sampling Methods: Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative
More informationRecent Advances in Bayesian Inference Techniques
Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian
More informationApproximate Inference Part 1 of 2
Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory
More informationLarge-scale Ordinal Collaborative Filtering
Large-scale Ordinal Collaborative Filtering Ulrich Paquet, Blaise Thomson, and Ole Winther Microsoft Research Cambridge, University of Cambridge, Technical University of Denmark ulripa@microsoft.com,brmt2@cam.ac.uk,owi@imm.dtu.dk
More informationIntegrated Non-Factorized Variational Inference
Integrated Non-Factorized Variational Inference Shaobo Han, Xuejun Liao and Lawrence Carin Duke University February 27, 2014 S. Han et al. Integrated Non-Factorized Variational Inference February 27, 2014
More informationComputational statistics
Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated
More informationDefault Priors and Effcient Posterior Computation in Bayesian
Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010 Presented by Eric Wang, Duke University Background and Motivation A Brief Review of Parameter Expansion Literature
More informationSupplement to A Hierarchical Approach for Fitting Curves to Response Time Measurements
Supplement to A Hierarchical Approach for Fitting Curves to Response Time Measurements Jeffrey N. Rouder Francis Tuerlinckx Paul L. Speckman Jun Lu & Pablo Gomez May 4 008 1 The Weibull regression model
More informationBayesian Regression Linear and Logistic Regression
When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we
More informationAdvanced Introduction to Machine Learning
10-715 Advanced Introduction to Machine Learning Homework 3 Due Nov 12, 10.30 am Rules 1. Homework is due on the due date at 10.30 am. Please hand over your homework at the beginning of class. Please see
More informationLecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions
DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K
More informationABC methods for phase-type distributions with applications in insurance risk problems
ABC methods for phase-type with applications problems Concepcion Ausin, Department of Statistics, Universidad Carlos III de Madrid Joint work with: Pedro Galeano, Universidad Carlos III de Madrid Simon
More informationInfinite Latent Feature Models and the Indian Buffet Process
Infinite Latent Feature Models and the Indian Buffet Process Thomas L. Griffiths Cognitive and Linguistic Sciences Brown University, Providence RI 292 tom griffiths@brown.edu Zoubin Ghahramani Gatsby Computational
More informationeqr094: Hierarchical MCMC for Bayesian System Reliability
eqr094: Hierarchical MCMC for Bayesian System Reliability Alyson G. Wilson Statistical Sciences Group, Los Alamos National Laboratory P.O. Box 1663, MS F600 Los Alamos, NM 87545 USA Phone: 505-667-9167
More informationStudy Notes on the Latent Dirichlet Allocation
Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection
More informationIntroduction to Bayesian methods in inverse problems
Introduction to Bayesian methods in inverse problems Ville Kolehmainen 1 1 Department of Applied Physics, University of Eastern Finland, Kuopio, Finland March 4 2013 Manchester, UK. Contents Introduction
More informationMetropolis-Hastings Algorithm
Strength of the Gibbs sampler Metropolis-Hastings Algorithm Easy algorithm to think about. Exploits the factorization properties of the joint probability distribution. No difficult choices to be made to
More informationLearning Energy-Based Models of High-Dimensional Data
Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero www.cs.toronto.edu/~hinton/energybasedmodelsweb.htm Discovering causal structure as a goal
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationApproximate Inference Part 1 of 2
Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory
More informationECO 513 Fall 2009 C. Sims HIDDEN MARKOV CHAIN MODELS
ECO 513 Fall 2009 C. Sims HIDDEN MARKOV CHAIN MODELS 1. THE CLASS OF MODELS y t {y s, s < t} p(y t θ t, {y s, s < t}) θ t = θ(s t ) P[S t = i S t 1 = j] = h ij. 2. WHAT S HANDY ABOUT IT Evaluating the
More informationUnsupervised Learning
Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College
More informationDevelopment of Stochastic Artificial Neural Networks for Hydrological Prediction
Development of Stochastic Artificial Neural Networks for Hydrological Prediction G. B. Kingston, M. F. Lambert and H. R. Maier Centre for Applied Modelling in Water Engineering, School of Civil and Environmental
More informationMarkov Chain Monte Carlo methods
Markov Chain Monte Carlo methods By Oleg Makhnin 1 Introduction a b c M = d e f g h i 0 f(x)dx 1.1 Motivation 1.1.1 Just here Supresses numbering 1.1.2 After this 1.2 Literature 2 Method 2.1 New math As
More informationApproximate inference in Energy-Based Models
CSC 2535: 2013 Lecture 3b Approximate inference in Energy-Based Models Geoffrey Hinton Two types of density model Stochastic generative model using directed acyclic graph (e.g. Bayes Net) Energy-based
More informationHastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model
UNIVERSITY OF TEXAS AT SAN ANTONIO Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model Liang Jing April 2010 1 1 ABSTRACT In this paper, common MCMC algorithms are introduced
More informationCurve Fitting Re-visited, Bishop1.2.5
Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the
More informationIntroduction to Machine Learning CMU-10701
Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos & Aarti Singh Contents Markov Chain Monte Carlo Methods Goal & Motivation Sampling Rejection Importance Markov
More informationSTA 294: Stochastic Processes & Bayesian Nonparametrics
MARKOV CHAINS AND CONVERGENCE CONCEPTS Markov chains are among the simplest stochastic processes, just one step beyond iid sequences of random variables. Traditionally they ve been used in modelling a
More informationAn Introduction to Statistical and Probabilistic Linear Models
An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning
More informationCSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection
CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection (non-examinable material) Matthew J. Beal February 27, 2004 www.variational-bayes.org Bayesian Model Selection
More informationA Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait
A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling Christopher Jennison Department of Mathematical Sciences, University of Bath, UK http://people.bath.ac.uk/mascj Adriana Ibrahim Institute
More informationVariational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures
17th Europ. Conf. on Machine Learning, Berlin, Germany, 2006. Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures Shipeng Yu 1,2, Kai Yu 2, Volker Tresp 2, and Hans-Peter
More informationNonstationary spatial process modeling Part II Paul D. Sampson --- Catherine Calder Univ of Washington --- Ohio State University
Nonstationary spatial process modeling Part II Paul D. Sampson --- Catherine Calder Univ of Washington --- Ohio State University this presentation derived from that presented at the Pan-American Advanced
More informationFirst Technical Course, European Centre for Soft Computing, Mieres, Spain. 4th July 2011
First Technical Course, European Centre for Soft Computing, Mieres, Spain. 4th July 2011 Linear Given probabilities p(a), p(b), and the joint probability p(a, B), we can write the conditional probabilities
More informationComputer Vision Group Prof. Daniel Cremers. 14. Sampling Methods
Prof. Daniel Cremers 14. Sampling Methods Sampling Methods Sampling Methods are widely used in Computer Science as an approximation of a deterministic algorithm to represent uncertainty without a parametric
More informationStatistical Approaches to Learning and Discovery
Statistical Approaches to Learning and Discovery Bayesian Model Selection Zoubin Ghahramani & Teddy Seidenfeld zoubin@cs.cmu.edu & teddy@stat.cmu.edu CALD / CS / Statistics / Philosophy Carnegie Mellon
More informationBayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine
Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview
More informationBayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence
Bayesian Inference in GLMs Frequentists typically base inferences on MLEs, asymptotic confidence limits, and log-likelihood ratio tests Bayesians base inferences on the posterior distribution of the unknowns
More informationMarkov Chain Monte Carlo Methods for Stochastic Optimization
Markov Chain Monte Carlo Methods for Stochastic Optimization John R. Birge The University of Chicago Booth School of Business Joint work with Nicholas Polson, Chicago Booth. JRBirge U of Toronto, MIE,
More informationMarkov Chain Monte Carlo
Markov Chain Monte Carlo Recall: To compute the expectation E ( h(y ) ) we use the approximation E(h(Y )) 1 n n h(y ) t=1 with Y (1),..., Y (n) h(y). Thus our aim is to sample Y (1),..., Y (n) from f(y).
More informationLikelihood NIPS July 30, Gaussian Process Regression with Student-t. Likelihood. Jarno Vanhatalo, Pasi Jylanki and Aki Vehtari NIPS-2009
with with July 30, 2010 with 1 2 3 Representation Representation for Distribution Inference for the Augmented Model 4 Approximate Laplacian Approximation Introduction to Laplacian Approximation Laplacian
More informationParticle Filtering Approaches for Dynamic Stochastic Optimization
Particle Filtering Approaches for Dynamic Stochastic Optimization John R. Birge The University of Chicago Booth School of Business Joint work with Nicholas Polson, Chicago Booth. JRBirge I-Sim Workshop,
More informationCOMP 551 Applied Machine Learning Lecture 20: Gaussian processes
COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55
More informationVariational Scoring of Graphical Model Structures
Variational Scoring of Graphical Model Structures Matthew J. Beal Work with Zoubin Ghahramani & Carl Rasmussen, Toronto. 15th September 2003 Overview Bayesian model selection Approximations using Variational
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College
More informationSupplementary Note on Bayesian analysis
Supplementary Note on Bayesian analysis Structured variability of muscle activations supports the minimal intervention principle of motor control Francisco J. Valero-Cuevas 1,2,3, Madhusudhan Venkadesan
More informationMCMC Sampling for Bayesian Inference using L1-type Priors
MÜNSTER MCMC Sampling for Bayesian Inference using L1-type Priors (what I do whenever the ill-posedness of EEG/MEG is just not frustrating enough!) AG Imaging Seminar Felix Lucka 26.06.2012 , MÜNSTER Sampling
More informationA short introduction to INLA and R-INLA
A short introduction to INLA and R-INLA Integrated Nested Laplace Approximation Thomas Opitz, BioSP, INRA Avignon Workshop: Theory and practice of INLA and SPDE November 7, 2018 2/21 Plan for this talk
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables
More informationPart 8: GLMs and Hierarchical LMs and GLMs
Part 8: GLMs and Hierarchical LMs and GLMs 1 Example: Song sparrow reproductive success Arcese et al., (1992) provide data on a sample from a population of 52 female song sparrows studied over the course
More informationMonte Carlo Methods. Leon Gu CSD, CMU
Monte Carlo Methods Leon Gu CSD, CMU Approximate Inference EM: y-observed variables; x-hidden variables; θ-parameters; E-step: q(x) = p(x y, θ t 1 ) M-step: θ t = arg max E q(x) [log p(y, x θ)] θ Monte
More informationProbabilistic Graphical Models
10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem
More informationMarkov Chain Monte Carlo Methods for Stochastic
Markov Chain Monte Carlo Methods for Stochastic Optimization i John R. Birge The University of Chicago Booth School of Business Joint work with Nicholas Polson, Chicago Booth. JRBirge U Florida, Nov 2013
More informationRegression with Input-Dependent Noise: A Bayesian Treatment
Regression with Input-Dependent oise: A Bayesian Treatment Christopher M. Bishop C.M.BishopGaston.ac.uk Cazhaow S. Qazaz qazazcsgaston.ac.uk eural Computing Research Group Aston University, Birmingham,
More informationComputer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationConfidence Estimation Methods for Neural Networks: A Practical Comparison
, 6-8 000, Confidence Estimation Methods for : A Practical Comparison G. Papadopoulos, P.J. Edwards, A.F. Murray Department of Electronics and Electrical Engineering, University of Edinburgh Abstract.
More informationThe Origin of Deep Learning. Lili Mou Jan, 2015
The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets
More informationLINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception
LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,
More informationCSci 8980: Advanced Topics in Graphical Models Gaussian Processes
CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian
More informationInfinite Mixtures of Gaussian Process Experts
in Advances in Neural Information Processing Systems 14, MIT Press (22). Infinite Mixtures of Gaussian Process Experts Carl Edward Rasmussen and Zoubin Ghahramani Gatsby Computational Neuroscience Unit
More information(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis
Summarizing a posterior Given the data and prior the posterior is determined Summarizing the posterior gives parameter estimates, intervals, and hypothesis tests Most of these computations are integrals
More informationNPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic
NPFL108 Bayesian inference Introduction Filip Jurčíček Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic Home page: http://ufal.mff.cuni.cz/~jurcicek Version: 21/02/2014
More informationDifferential Priors for Elastic Nets
Differential Priors for Elastic Nets Miguel Á Carreira-Perpiñán 1, Peter Dayan, and Geoffrey J Goodhill 3 1 Dept of Computer Science & Electrical Eng, OGI, Oregon Health & Science University miguel@cseogiedu
More informationSlice Sampling with Adaptive Multivariate Steps: The Shrinking-Rank Method
Slice Sampling with Adaptive Multivariate Steps: The Shrinking-Rank Method Madeleine B. Thompson Radford M. Neal Abstract The shrinking rank method is a variation of slice sampling that is efficient at
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationData Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis
Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inrialpes.fr http://perception.inrialpes.fr/ Outline of Lecture
More informationSequential Monte Carlo Methods for Bayesian Computation
Sequential Monte Carlo Methods for Bayesian Computation A. Doucet Kyoto Sept. 2012 A. Doucet (MLSS Sept. 2012) Sept. 2012 1 / 136 Motivating Example 1: Generic Bayesian Model Let X be a vector parameter
More informationPrinciples of Bayesian Inference
Principles of Bayesian Inference Sudipto Banerjee University of Minnesota July 20th, 2008 1 Bayesian Principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters
More informationRegularized Regression A Bayesian point of view
Regularized Regression A Bayesian point of view Vincent MICHEL Director : Gilles Celeux Supervisor : Bertrand Thirion Parietal Team, INRIA Saclay Ile-de-France LRI, Université Paris Sud CEA, DSV, I2BM,
More informationFactor Analysis and Kalman Filtering (11/2/04)
CS281A/Stat241A: Statistical Learning Theory Factor Analysis and Kalman Filtering (11/2/04) Lecturer: Michael I. Jordan Scribes: Byung-Gon Chun and Sunghoon Kim 1 Factor Analysis Factor analysis is used
More informationProbabilistic Graphical Models
2016 Robert Nowak Probabilistic Graphical Models 1 Introduction We have focused mainly on linear models for signals, in particular the subspace model x = Uθ, where U is a n k matrix and θ R k is a vector
More information