Bayesian Sampling and Ensemble Learning in Generative Topographic Mapping

Size: px

Start display at page:

Download "Bayesian Sampling and Ensemble Learning in Generative Topographic Mapping"

Bathsheba Hood
5 years ago
Views:

1 Bayesian Sampling and Ensemble Learning in Generative Topographic Mapping Akio Utsugi National Institute of Bioscience and Human-Technology, - Higashi Tsukuba Ibaraki , Japan March 6, Abstract Generative topographic mapping (GTM) is a statistical model to extract a hidden smooth manifold from data, like the self-organizing map (SOM). Although a deterministic search algorithm for the hyperparameters regulating the smoothness of the manifold has been proposed previously, it is based on approximations that are valid only on abundant data. Thus, it often fails to obtain suitable estimates on small data. In this paper, to improve the hyperparameter search in GTM, we construct a Gibbs sampler on the model, which generates random sample series following the posteriors on the hyperparameters. Reliable estimates are obtained from the samples. In addition, we obtain another deterministic algorithm using the ensemble learning. From the result of an experimental comparison of these algorithms, an efficient method for reliable estimation in GTM is suggested. Introduction The self-organizing map (SOM) [] is initially proposed as a minimal model for the formation of topology-preserving maps in brains. Subsequently, it is used as an information processing tool to extract a hidden smooth manifold from data. However, since the SOM is defined as a learning algorithm and has no explicit statistical model for the data generation, it is difficult to develop higher-level inference on the model, such as the determination of the optimal hyperparameters and topology. Thus, several authors propose statistical generative models whose parameter estimation algorithms are similar to the SOM.

2 The elastic net [] [3][4][5] is one of such generative models. Its learning is derived as a parameter estimation algorithm for a model consisting of the mixture of spherical Gaussian generators and a smoothing prior on the centroids of the generators. This prior is also a spherical Gaussian distribution on the variations of the centroids along a predefined topology. The variances of these spherical Gaussian distributions are hyperparameters regulating the smoothness of the hidden manifold. The elastic net is extended to have a more general Gaussian smoothing prior, and evidence for the hyperparameters and the topology, which is a Bayesian model selection criterion, is obtained [6][7][8]. Generative topographic mapping (GTM) is another SOM-like generative model [9]. It is also based on the mixture of spherical Gaussian generators with a constraint on the centroids. However, the centroids are assumed to be generated as the outputs of a generalized linear network whose inputs are the nodes of a regular grid on a latent space. Evidence for the hyperparameters of GTM is obtained by the same method as used for the generalized elastic net []. Subsequently, the framework of GTM is expanded to include models with a general Gaussian smoothing prior on the centroids. In the present paper, we mainly deal with this type of GTM model. Evidence for hyperparameters is strictly defined as the marginal likelihood of the hyperparameters, which is obtained by integrating out the other parameters from the joint likelihood. In GTM, we need to integrate out the centroid parameters to obtain the evidence. However, this integral is difficult to calculate exactly. Thus, it is approximated using the Laplace method [][][3]. Furthermore, in order to obtain a fast search algorithm for the optimal hyperparameters, which is given by the maximizer of the evidence, we need the derivatives of the evidence with regard to the hyperparameters. These derivatives are obtained using additional approximations. The validity of these approximations depends on the data size and the signal to noise ratio. Actually, a simulation experiment shows that the hyperparameter search algorithm fails to obtain suitable estimates when the data size is small and the noise level is high [6]. In this paper, we try to improve the hyperparameter search of GTM on small data using a Gibbs sampler. The Gibbs sampler [3] generates the random sample series of the parameters and the hyperparameters following their posteriors. In the limits of the long series, their averages approach the posterior means, which are the exact Bayesian estimates. We can also evaluate the confidence of the estimates using the estimated posterior variances. In fact, we can obtain the estimates of the posterior distributions using the histograms of the samples. While the Gibbs sampler provides reliable and wealthy estimation on small data, it is rather time-consuming on large data because it requires the generation of long sample series. Thus, we also need a deterministic algorithm producing

3 the estimates quickly. We obtain another deterministic algorithm for the hyperparameter search of GTM using the ensemble learning method [4][5]. While this algorithm is similar to the previous deterministic algorithm, it is based on more straightforward approximation assumption. In addition, the ensemble learning is considered more stable than the previous deterministic algorithm, because it minimizes the variational free energy of the model, which gives an upper bound of negative log evidence. To evaluate the validity of the approximations in the deterministic algorithms, we compare the estimates by those algorithms with the outcome of the Gibbs sampler in a simulation experiment. The experiment shows the superiority of the Gibbs sampler on small data and the validity of the deterministic algorithms on large data. From this result, a policy for algorithm selection in GTM is suggested. Generative topographic mapping GTM has two versions: an original regression version and a Gaussian process version. In this paper, we focus on the latter, and the former is mentioned shortly in the discussion. The Gaussian process version of GTM consists of a spherical Gaussian mixture density and a Gaussian process prior. The spherical Gaussian mixture model assumes that each data point x i = (x i,...,x im ) R m is generated from one of r spherical Gaussian generators, which correspond to the inner units of a neural network. The density of a data set X = {x,...,x n } is given by ( ) nm/ β n { r f(x Y, W,β)= exp ( β )} yik π x i w k, (.) i= k= where y ik in Y are binary membership variables with constraints r k= y ik =, and w k =(w k,...,w km ) in W are the centroids of the spherical Gaussian generators. By integrating out Y from the product of this density and a multinomial prior f(y )= n i= k= we obtain a spherical Gaussian mixture density r ( ) yik = r n, (.) r f(x W,β)= ( ) nm/ β n r exp ( β ) r n π x i w k. (.3) i= k= 3

4 Next, we assume that the centroids are the regular samples of a smooth Gaussian process over a latent space. In other words, W has a Gaussian prior f(w h) =(π) rm/ M m/ m j= exp ( ) w (j)m w (j), (.4) where w (j) =(w j,...,w rj ), and the covariance in M decreases as the distance between the pair of the inner units over the latent space increases. In the next section, we use this general form of Gaussian process prior to obtain a Gibbs sampler for W. In fact, there are various manners to construct M using hyperparameters in h. We will choose a specific form of M to obtain a simple Gibbs sampler for h. The Bayesian inference of W is based on its posterior f(w X, h), where h includes β. Although this posterior is obtained by the Bayes theorem: f(w X, h) f(x, W h) =f(x W,β)f(W h), (.5) the posterior mean as the estimate of W is difficult to calculate. Instead, we can obtain the maximum a posteriori (MAP) estimate Ŵ, which is the maximizer of the posterior, through an expectation-maximization (EM) algorithm [7][]. The inference of h is based on its evidence f(x h). The maximizer of the evidence is called the generalized maximum likelihood (GML) estimate of h. Note that the evidence is proportional to the posterior on h, f(h X), if we adopt a flat prior on h. In this case, the GML and MAP estimates of h are identical. Although the evidence is obtained by integrating out W from f(x, W h), the integral is difficult to calculate exactly. Using the Laplace method [][][3], we can obtain a computable approximate expression of the evidence [7][]. In this method, the integration is performed by quadratically approximating the logarithm of the integrand around Ŵ. In other words, the integrand is approximated by the nearest Gaussian function. Furthermore, a fast hyperparameter search algorithm is obtained using the approximate derivatives of the evidence with regard to h [6][]. In this approximation, the dependence of Ŵ and the posterior selection probabilities of the inner units on h is neglected. As mentioned in the introduction, the approximations for the hyperparameter search algorithm are valid only on abundant data. The algorithm often fail to extract a suitable structure from small data. In order to enhance the capability of GTM, the hyperparameter search is improved using a Gibbs sampler in the following section. 4

5 3 Gibbs sampler in GTM In the preceding section, we mentioned a method to estimate the parameters and the hyperparameters by the modes of their posteriors with the help of several approximations. Another method to infer them is to use random samples following their posteriors. Any moment of the posteriors can be obtained precisely by an average over the long sample series. Markov chain Monte Carlo (MCMC) [3] provides easier devices to generate such random samples than direct generation from the posteriors. The Metropolis-Hastings algorithm is the most general MCMC method. At each step of this algorithm, a trial sample is generated from a trial distribution, then its adoption is determined stochastically by the improvement ratio of the posterior. Since the posterior has to be calculated at each step, an efficient procedure for this calculation is required. Moreover, the efficiency of this algorithm depends on the design of a fine trial distribution. The Gibbs sampler is another MCMC method. In contrast to the Metropolis- Hastings algorithm, it does not need the design of a trial distribution. In fact, the Gibbs sampler can be regarded as a special kind of Metropolis-Hastings algorithm, whose trials are always adopted. In the Gibbs sampler, the variables of a model are divided into some groups, and the conditional posterior on each group given the other variables is obtained. The algorithm is started by setting appropriate initial values in the conditions of the conditional posteriors. Then, a sample is generated from a conditional posterior and this sample is set to the conditions of the other conditional posteriors. Such sampling and setting is iterated among the groups. After arrival at a stationary state of this process, the samples follow their (unconditional) posteriors. In the following two subsections, we obtain all conditional posteriors for the Gibbs sampler of GTM: f(y X, W, h), f(w X, Y, h) and f(h X, Y, W ). 3. Conditional posteriors on Y and W The conditional posterior on Y has been already obtained in the previous papers [6][7]: n r f(y X, W,β)= p y ik ik, (3.) i= k= where exp( β p ik = x i w k ) rk= exp( β x i w k ) are the posterior selection probabilities of the inner units. (3.) 5

6 The conditional posterior on W is obtained by normalizing f(x, Y, W h), the product of (.), (.) and (.4). It is a product of Gaussian densities m f(w X, Y, h) = N(w (j) µ (j), Σ). (3.3) j= The common covariance matrix is and the means are Σ =(βn + M ) (3.4) µ (j) = βσs (j), (3.5) where n N = diag(n,,...,n r )= diag(y i,...,y ir ) (3.6) i= and n s (j) =(s j,...,s rj ) = x ij (y i,...,y ir ), (3.7) i= that is, n k is the number of data belonging to the kth inner unit, and s k = (s k,...,s km ) is the sum of data belonging to it. 3. Conditional posteriors on hyperparameters Bishop et al. [] suggest a Gaussian process prior with a Gaussian-shaped covariance function, that is, the entries of M are ( m ij exp ) λ u i u j, (3.8) where u i is the position of the ith inner unit over a latent space. An MCMC algorithm for the hyperparameter λ may be constructed using the Metropolis- Hastings method. However, it requires heavy calculation of the posterior at each step and the design of a fine trial distribution. From the standpoint of the availability of the Gibbs sampler, we choose a Gaussian process prior with M =(αd D + ξe E), (3.9) where D is a discretized Laplacian and E is a matrix whose rows are the orthonormal basis vectors of the linear null-space of D [6]. For a one-dimensional latent space, the entries of D are d ij = i j + = i j + = (i =,..., r ; j =,..., r), otherwise 6 (3.)

7 and E consists of a constant vector and a linear-trend vector. For a two-dimensional latent space, D is constructed from the distortion of a thin-plate and E consists of a constant vector and two linear-trend vectors [8]. When the hyperparameters α and ξ are positive, M is positive definite, thus the Gaussian process prior (.4) is proper and expressed as f(w α, ξ) = (π) rm/ α lm/ ξ (r l)m/ D D m/ + m exp (α Dw (j) + ξ Ew (j) ), (3.) j= where D D + is the product of the positive eigenvalues of D D and l = rank(d D), which is given by l = r t for a t-dimensional latent space. The logarithm of this prior corresponds to a discretized Laplacian regularizer, which is used in discretized Laplacian smoothing [7], if ξ. The term of ξ is introduced to make the prior proper. Although other methods to make the prior proper can be considered, our method has the advantage of leading a simple gamma posterior on α. To obtain the posteriors on the hyperparameters, we consider hyper-priors on α and β: f(α d α,s α )=G(α d α,s α ), (3.) where G is a gamma density function given by f(β d β,s β )=G(β d β,s β ), (3.3) G(x d, s) = sd x d Γ(d) exp( sx), (3.4) whose mean is d/s. The other hyperparameters ξ,d α,s α,d β,s β are fixed to appropriate values using prior knowledge. Alternatively, we can use non-informative priors given by ξ,d α = d β =,s α,s β. Hereafter we represent a model structure with a set of the fixed hyperparameters as H. The conditional posteriors on α and β are obtained by normalizing f(x, Y, W,α,β H), the product of (.), (.), (3.), (3.) and (3.3): f(α X, Y, W,H)=G(α d α, s α ), (3.5) where f(β X, Y, W,H)=G(β d β, s β ), (3.6) d α = ml + d α, (3.7) 7

8 s α = m Dw (j) + s α, j= (3.8) d β = nm + d β (3.9) and s β = n r y ik x i w k + s β. i= k= (3.) We now obtain all conditional posteriors for the Gibbs sampler of GTM. 3.3 Reduction of computational load The computational bottleneck of our Gibbs sampler lies on the inversion of the r r matrix βn + M = βn + αd D + ξe E in (3.4). This inversion generally needs O(r 3 ) operations and r becomes large to obtain fine structures. If we use a partially improper prior given by ξ, this matrix becomes a sparse matrix K = βn + αd D. In particular, for a one-dimensional latent space, K becomes a five-banded matrix. Thus, the inversion of K needs only O(r) operations. If we can fix the topology of the latent space, this strategy is recommended. However, if we desire to compare different topologies, ξ must be positive, because the partially improper prior is not available for model selection [6]. In this case, we can reduce the load of the inversion using the matrix-inversion lemma: (K + ξe E) = K K E (EK E + ξ I) EK. (3.) Note that the matrix in the parenthesis of the right side has a very small size r l. 4Ensemble learning in GTM The ensemble learning [4][5] is a deterministic algorithm to obtain the estimates of parameters and hyperparameters concurrently. We consider an approximating ensemble density Q(Y, W,α,β) and its variational free energy on a model H: F (Q H) = Q(Y, W,α,β) log Q(Y, W,α,β) dy dw dαdβ. (4.) f(x, Y, W,α,β H) This functional is minimized by the joint posterior Q(Y, W,α,β)=f(Y, W,α,β X,H), and the minimum is equal to the negative log evidence of the model, log f(x H). This variational problem is generally difficult to solve, thus an approximate solution is obtained by restricting Q to a specific form. For example, if we restrict Q to a 8

9 factorial form Q(Y, W,α,β) = Q(Y )Q(W )Q(α)Q(β), we can have a straightforward algorithm for the minimization of F. Ultimately, the optimization procedure is given as follows:. Initial densities are set to the partial ensembles, Q(Y ),Q(W ),Q(α) and Q(β).. From the present densities of Q(W ),Q(α) and Q(β), a new density of Q(Y ) is obtained by { Q(Y ) exp } Q(W )Q(α)Q(β) log f(x, Y, W,α,β H)dWdαdβ. (4.) 3. Each of the other partial ensembles is also updated using the same formula as (4.) except that Y and the target variable are exchanged. 4. These updates of the partial ensembles are repeated until a convergence condition is satisfied. Actually, each partial ensemble has the same parametric form as the corresponding conditional posterior has, thus the updates of the partial ensembles are reduced to those of their parameters. Using (4.), we obtain the update formula for the partial ensemble on Y : Q(Y )= n r i= k= p y ik ik, (4.3) where exp{ β p ik = ( x i w k + m σ k )} rk= exp{ β ( x i w k + m σ k )}, (4.4) w k and β are the partial-ensemble means of w k and β respectively; and σ k is the kth diagonal entry of Σ, the partial-ensemble covariance matrix of w k. The update formula for the partial ensemble on W is m Q(W )= N(w (j) w (j), Σ), (4.5) j= where Σ =( β N +ᾱd D + ξe E), (4.6) w (j) = β Σ s (j), (4.7) 9

10 and s (j) = n N = diag( p i,..., p ir ) (4.8) i= n x ij ( p i,..., p ir ). (4.9) i= Finally, the update formulae for the partial ensembles on the hyperparameters are Q(α) =G(α d α, s α ), (4.) Q(β) =G(β d β, s β ), (4.) where s α = m D w (j) + m j= tr(d D Σ)+s α, (4.) s β = n r p ik x i w k + m i= tr( N Σ)+s β. k= (4.3) From them, the partial-ensemble means of the hyperparameters are given by ᾱ = d α / s α, (4.4) β = d β / s β. (4.5) 5 Simulations We compare the three algorithms in simulations: the previous deterministic algorithm in [6], the ensemble learning, and the Gibbs sampler. However, since the two deterministic algorithms yield the similar estimates, they are represented by the ensemble learning in the following. Artificial data x i =(x i,x i ), i =,...,n, are generated from two independent standard Gaussian random series {e i } and {e i } by x i =4(i )/n +σe i, (5.) x i = sin[π(i )/n]+σe i. (5.) We use three noise levels: σ =.3,.4,.5; and two data sizes: n = 5,. Under each data condition, 5 different data sets are prepared. Figure shows the examples of the data sets with the extracted manifolds. The GTM model has inner units (r = ) over a one-dimensional latent space (t = ) and the non-informative hyper-priors mentioned in Section 3.. The

11 σ=.3, n= σ=.5, n= Figure : The examples of the data sets and the extracted manifolds. The dots are the data points. The squares and circles are the estimated centroids by the ensemble learning and the Gibbs sampler, respectively. The centroids are linked along the manifolds. initial values of W and β are obtained by the PCA initialization method [6], and α is initially set to 3. The deterministic algorithms are terminated when the relative variations of α and β become under 4 or α becomes over 5 or the number of iterations is over. The average number of iterations is 565. On the other hand, the Gibbs sampler is always continued until 5 iterations. In all the algorithms, α is restricted to be under 5, because the centroids show no visible change for α over 5. In the case of n =, the ensemble learning and the Gibbs sampler take 4 and 9 seconds per session, respectively, on a 333 MHz processor. However, the fixed number of iterations in the Gibbs sampler is chosen arbitrarily and may be too large in comparison with the ensemble learning. Figure shows the histograms of the estimates of log α. The estimates by the Gibbs sampler are the posterior means estimated by the means of the Gibbs samples. The discrepancy between the deterministic algorithms and the Gibbs sampler increases as the data condition is getting worse. Although the estimates by the deterministic algorithms stay at values consistent with the Gibbs sampler under the good conditions, they diverge abruptly as the noise level grows or the data size reduces gradually. At the maximum value of α, the estimated centroids are arranged regularly on a straight line, as shown in figure. This is a bias of the deterministic algorithms toward the simplest form. On the other hand, the estimates by the Gibbs sampler vary more smoothly as the data condition changes. Unlike α, the estimates of β are similar in all the methods. We can also observe the shape of the posterior distribution using the Gibbs

12 5 σ =.3 5 σ =.4 5 σ = n = n = Figure : The histograms of the estimates of log α by the ensemble learning (black) and the Gibbs sampler (white). sampler. Figure 3 shows the estimated posterior distributions on log α by the histograms of the Gibbs samples. This figure suggests that the posterior mean over the restricted range is rather inappropriate as a point estimate of the hyperparameter when the distribution is broad or multimodal. Instead, we can use the posterior mode. Figure 4 shows the histograms of the estimated posterior modes. Unlike the posterior means in figure, some of them are multimodal. However, they are dominated by the values near, while the estimates of the deterministic algorithms reach the maximum in many cases (figure ). The result of our experiment shows the limitation of the deterministic algorithms for the hyperparameter search on small data and the superiority of the Gibbs sampler. However, the deterministic algorithms are attractive for their fast convergence, particularly on large data. Figure shows that when an estimate of α by the deterministic algorithms stays at a finite value, we can consider the estimate reliable. From this fact, an efficient method to obtain reliable estimates

13 σ =.3 σ =.4 σ = n = n = Figure 3: The estimated posteriors on log α. Each distribution is obtained by the histogram of the Gibbs samples on a data set. The width of the bins is.. in GTM is suggested, that is, we first employ a deterministic algorithm to obtain the estimates quickly, then if the estimate of α diverges, we request the aid of the Gibbs sampler. 6 Discussion 6. The regression version of GTM The original regression version of GTM becomes a full Bayesian model by introducing a ridge prior on the regression coefficients []. This model has three types of smoothing hyperparameters: the number of basis functions, the width of each basis function and a ridge coefficient. Only for the ridge coefficient, we can obtain a Gibbs sampler and ensemble learning in a similar manner. 3

14 5 σ =.3 5 σ =.4 5 σ = n = n = Figure 4: The histograms of the posterior modes for log α. One method to avoid the tuning of the other smoothing hyperparameters is to use cubic B-splines or thin-plate splines at every nodes on the inner space as basis functions. When the number of basis functions is equal to the number of the nodes, the regression version of GTM can be expressed as a Gaussian process version of GTM. Furthermore, in the case of a one-dimensional inner space, M made from the splines has a similar form to one made from the discretized Laplacian if an appropriate prior on the regression coefficients is employed [6]. While this spline version of GTM has a strong connection with spline smoothing, it is timeconsuming except for the case of a one-dimensional inner space. 6. Markov chain Monte Carlo model construction Markov chain Monte Carlo model construction (MC 3 ) is a method to estimate a posterior over the set of model structures using MCMC []. In our model, D is regarded as a structure variable. MCMC for D has no alternative but the 4

15 Metropolis-Hastings algorithm. Thus, the design of a fine trial distribution is required. In addition, the choice of a prior on D may be important for the efficiency of the algorithm. We are now searching for a trial distribution and a prior for the efficient MC 3. 7 Conclusion In order to improve the hyperparameter search of GTM on small data, we constructed a Gibbs sampler on the model, which generates the sample series of the hyperparameters following their posteriors. Using the series, we can obtain the reliable estimates of the hyperparameters and their posterior distributions. In addition, another deterministic algorithm for the hyperparameter search was obtained using ensemble learning. A simulation experiment showed the superiority of the Gibbs sampler on small data and the validity of the deterministic algorithms on large data. Finally, an efficient method for reliable estimation in GTM was suggested using the deterministic algorithms and the Gibbs sampler. References [] T. Kohonen. Self-Organizing Maps. Springer, Berlin, 995. [] R. Durbin and D. Willshaw. An analogue approach to the traveling salesman problem using an elastic net method. Nature, 36:689 69, 987. [3] R. Durbin, R. Szeliski, and A. Yuille. An analysis of the elastic net approach to the traveling salesman problem. Neural Computation, : , 989. [4] P. D. Simic. Statistical mechanics as the underlying theory of elastic and neural optimisations. Network, :89 3, 99. [5] R. Durbin and G. Mitchison. A dimension reduction framework for understanding cortical maps. Nature, 343: , 99. [6] A. Utsugi. Topology selection for self-organizing maps. Network, 7:77 74, 996. [7] A. Utsugi. Hyperparameter selection for self-organizing maps. Neural Computation, 9:63 635, 997. [8] A. Utsugi. Density estimation by mixture models with smoothing priors. Neural Computation, :5 35,

16 [9] C. M. Bishop, M. Svensén, and C. K. I. Williams. GTM: the generative topographic mapping. Neural Computation, :5 34, 998. [] C. M. Bishop, M. Svensén, and C. K. I. Williams. Developments of the generative topographic mapping. Neurocomputing, :3 4, 998. [] D. J. C. MacKay. A practical Bayesian framework for backprop networks. Neural Computation, 4:448 47, 99. [] R. E. Kass and A. E. Raftery. Bayes factors. Journal of the American Statistical Association, 9: , 995. [3] M. A. Tanner. Tools for statistical inference: methods for exploration of posterior distribution and likelihood functions. Springer-Verlag, New York, 3rd edition, 996. [4] S. Waterhouse, D. MacKay, and T. Robinson. Bayesian methods for mixtures of experts. In D. S. Touretzky, M. C. Mozer, and M. E. Haaslmo, editors, Advances in Neural Information Processing Systems 8, pages MIT Press, Cambridge, 996. [5] D. J. C. MacKay. Comparison of approximate methods for handling hyperparameters. Neural Computation, :35 68, 999. [6] A. Buja, T. Hastie, and R. Tibshirani. Linear smoothers and additive models. The Annals of Statistics, 7: ,

Variational Principal Components

Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings