Two Further Gradient BYY Learning Rules for Gaussian Mixture with Automated Model Selection

Size: px

Start display at page:

Download "Two Further Gradient BYY Learning Rules for Gaussian Mixture with Automated Model Selection"

Sharlene Dixon
5 years ago
Views:

1 Two Further Gradient BYY Learning Rules for Gaussian Mixture with Automated Model Selection Jinwen Ma, Bin Gao, Yang Wang, and Qiansheng Cheng Department of Information Science, School of Mathematical Sciences and LMAM, Peking University, Beijing, , China Abstract. Under the Bayesian Ying-Yang (BYY) harmony learning theory, a harmony function has been developed for Gaussian mixture model with an important feature that, via its maximization through a gradient learning rule, model selection can be made automatically during parameter learning on a set of sample data from a Gaussian mixture. This paper proposes two further gradient learning rules, called conjugate and natural gradient learning rules, respectively, to efficiently implement the maximization of the harmony function on Gaussian mixture. It is demonstrated by simulation experiments that these two new gradient learning rules not only work well, but also converge more quickly than the general gradient ones. 1 Introduction As a powerful statistical model, Gaussian mixture has been widely applied to data analysis and there have been several statistical methods for its modelling(e.g., the expectation-maximization(em) algorithm [1] and k-means algorithm [2]). But it is usually assumed that the number of Gaussians in the mixture is pre-known. However, in many instances this key information is not available and the selection of an appropriate number of Gaussians must be made with the estimation of the parameters, which is rather difficult [3]. The traditional approach is to choose a best number k of Gaussians via some selection criterion. Actually, many heuristic criteria have been proposed in the statistical literature(e.g.,[4]-[5]). However, the process of evaluating a criterion incurs a large computational cost since we need to repeat the entire parameter estimating process at a number of different values of k. Recently, a new approach has been developed from the Bayesian Ying-Yang (BYY) harmony learning theory [6] with the feature that model selection can be made automatically during the parameter learning. In fact, it was shown in [7] that this Gaussian mixture modelling problem is equivalent to the maximization of a harmony function on a specific architecture of the BYY system related This work was supported by the Natural Science Foundation of China for Project Z.R. Yang et al. (Eds.): IDEAL 2004, LNCS 3177, pp , c Springer-Verlag Berlin Heidelberg 2004

2 Two Further Gradient BYY Learning Rules for Gaussian Mixture 691 to Gaussian mixture model and a gradient learning rule for maximization of this harmony function was also established. The simulation experiments showed that an appropriate number of Gaussians can be automatically allocated for the sample data set, with the mixing proportions of the extra Gaussians attenuating to zero. Moreover, an adaptive gradient learning rule was further proposed and analyzed for the general finite mixture model, and demonstrated well on a sample data set from Gaussian mixture [8]. In this paper, we propose two further gradient learning rules to efficiently implement the maximization of the harmony function in a Gaussian mixture setting. The first learning rule is constructed from the conjugate gradient of the harmony function, while the second learning rule is derived from Amari and Nagaoka s natural gradient theory [9]. Moreover, it has been demonstrated by simulation experiments that the two new gradient learning rules not only make model selection automatically during the parameter learning, but also converge more quickly than the general gradient ones. In the sequel, the conjugate and natural gradientlearning rules are derived in Section 2. In Section 3, they are both demonstrated by simulation experiments, and finally a brief conclusion is made in Section 4. 2 Conjugate and Natural Gradient Learning Rules In this section, we first introduce the harmony function on the Gaussian mixture model and then derive the conjugate and natural gradient learning rules from it. 2.1 The Harmony Function Under the BYY harmony learning principle, we can get the following harmony function on a sample data set D x = {x t } N t=1 from a Gaussian mixture model (Refer to [6] or [7] for the derivation): J(Θ k )= 1 N t=1 j=1 α j q(x t m j,σ j ) k i=1 α iq(x t m i,σ i ) ln[α jq(x t m j,σ j )], (1) where q(x m j,σ j ) is a Gaussian density given by q(x m j,σ j )= 1 (2π) n 2 Σ j 1 2 e 1 2 (x mj)t Σ 1 j (x m j), (2) where m j is the mean vector and Σ j is the covariance matrix which is assumed positive definite. α j is the mixing proportion, Θ k = {α j,m j,σ j } k j=1 and q(x, Θ k )= k j=1 α jq(x m j,σ j ) is just the Gaussian mixture density. According to the best harmony learning principle of the BYY system [6] as well as the experimental results obtained in [7]-[8], the maximization of J(Θ k ) can realize the parameter learning with automated model selection on a sample data set from a Gaussian mixture. For convenience of analysis, we

3 692 Jinwen Ma et al. let α j = e βj / k i=1 eβi and Σ j = B j Bj T for j = 1, 2,,k, where < β 1,,β k < +, andb j is a nonsingular square matrix. By these transformations, the parameters in J(Θ k )turninto{β j,m j,b j } k j= Conjugate Gradient Learning Rule We begin to give the derivatives of J(Θ k ) with respect to β j,m j and B j as follows. (Refer to [7] for the derivation.) β j m j = α j N = α j N i=1 t=1 t=1 h(i x t )U(i x t )(δ ij α i ), (3) h(j x t )U(j x t )Σ 1 j (x t m j ), (4) = (B jbj T ), (5) B j B j Σ j where δ ij is the Kronecker function, and U(i x t )= h(i x t )= Σ j (δ ri p(r x t )) ln[α r q(x t m r,σ r )] + 1, r=1 q(x t m i,σ i ) k r=1 α rq(x t m r,σ r ), p(i x t)=α i h(i x t ), = α j N t=1 h(j x t )U(j x t )Σj 1 [(x t m j )(x t m j ) T Σ j ]Σj 1. In Eq. (5), B j and J(Θ k) Σ j are considered as their vector forms, i.e., vec[b j ] and vec[ J(Θ k) Σ j ], respectively. (Bj BT j ) B j is an n 2 -order square matrix which can be easily computed. Combining these β j,m j,andb j intoavectorθ k, we have the conjugate gradient learning rule as follows: θ i+1 k = θk i + ηŝ i, (6) where η is the learning rate, and the searching direction ŝ i is computed from the following iterations of conjugate vectors: s 1 = J(θk), 1 ŝ 1 = J(θ1 k ) J(θk 1) s i = J(θk i )+v i 1s i 1, ŝ i = s i s i, v i 1 = J(θi k ) 2 J(θ i 1 k ), 2 where J(θ k ) is the general gradient vector of J(θ k )=J(Θ k )and is the Euclidean norm.

4 Two Further Gradient BYY Learning Rules for Gaussian Mixture Natural Gradient Learning Rule In order to get the natural gradient of J(Θ k ), we let θ j =(m j,b j )=(m j,σ j ) so that Θ k = {α j,θ j } k j=1 which can be considered as a point in the Riemann space. Then, we can construct a k(n 2 + n + 1)-dimensional statistical model S = {q(x, Θ k ):Θ k Ξ}. The Fisher information matrix of S at a point Θ k is G(Θ k )=[g ij (Θ k )], where g ij (Θ k )isgivenby g ij (Θ k )= i l(x, Θ k ) j l(x, Θ k )q(x, Θ k )dx, (7) and l(x, Θ k )=lnq(x, Θ k ). According to the following deriva- where i = tives: q(x t Θ k ) B j Θ i k = BT j B j B j q(x t Θ k ) β j = α j q(x t θ j ) q(x t Θ k ) m j (δ ij α i ), (8) i=1 = α j q(x t θ j )Σ 1 j (x t m j ), (9) α j q(x t θ j )Σj 1 [(x t m j )(x t m j ) T Σ j ]Σj 1, (10) we can easily estimate G(Θ k ) on a sample data set through the law of large number. According to Amari and Nagaoka s natural gradient theory [9], we have the following natural gradient learning rule: Θ k (m +1)=Θ k (m) ηg 1 (Θ k (m)) J(Θ k(m)), (11) Θ k where η is the learning rate. 3 Simulation Experiments We conducted experiments on seven sets (a)-(g) of samples drawn from a mixture of four or three bivariate Gaussians densities (i.e., n = 2). Asshownin Figure 1, each data set of samples consists three or four Gaussians with certain degree of overlap. Using k to denote the number of Gaussians in the original mixture, we implemented the conjugate and natural gradient learning rules on those seven sample data sets always with k k 3k and η =0.1. Moreover, the other parameters were initialized randomly within certain intervals. In all the experiments, the learning was stopped when J(Θk new ) J(Θk old) < The experimental results of the conjugate and natural gradient learning rules on the data sets (c) and (d) are given in Figures 2 & 3, respectively, with case k = 8andk = 4. We can observe that four Gaussians are finally located accurately, while the mixing proportions of the other four Gaussians were reduced to below 0.01, i.e, these Gaussians are extra and can be discarded. That is, the correct number of the clusters have been detected on these data sets. Moreover, the

5 694 Jinwen Ma et al. Fig. 1. Seven sets of sample data used in the experiments. F i g. 2. The experimental result of the conjugate gradient learning rule on the data set (c) (stopped after 63 iterations). F i g. 3. The experimental result of the natural gradient learning rule on the data set (d) (stopped after 149 iterations). F i g. 4. The experimental result of the natural gradient learning rule on the data set (f) (stopped after 173 iterations). experiments of the two learning rules have been made on the other sample data sets in different cases and show the similar results on automated model selection. For example, the natural gradient learning rule was implemented on the data set (f) with k =8,k = 4. As shown in Figure 4, even each cluster has a small number of samples, the correct number of clusters can still be detected, with the mixing proportions of other four extra Gaussians reduced below In addition to the correct number detection, we further compared the converged values of parameters (discarding the extra Gaussians) with those parameters in the mixture from which the samples come from. We checked the results in these experiments and found that the conjugate and natural gradient learning rules converge with a lower average error between the estimated parameters and thetrueparameters.

6 Two Further Gradient BYY Learning Rules for Gaussian Mixture 695 In comparison with the simulation results of the batch and adaptive gradient learning rules [7]-[8] on these seven sets of sample data, we have found that the conjugate and natural gradient learning rules converge more quickly than the two general gradient ones. Actually, for the most cases it had been demonstrated by simulation experiments that the number of iterations required by each of these two rules is only about one tenth to a quarter of the number of iterations required by either the batch or adaptive gradient learning rule. As compared with each other, the conjugate gradient learning rule converges more quickly than the natural gradient learning rule, but the natural gradient learning rule obtains a more accurate solution on the parameter estimation. 4 Conclusion We have proposed the conjugate and natural gradient learning rules for the BYY harmony learning on Gaussian mixture with automated model selection. They are derived from the conjugate gradient method and Amari and Nagaoka s natural gradient theory for the maximization of the harmony function defined on Gaussian mixture model. The simulation experiments have demonstrated that both the conjugate and natural learning rules lead to the correct selection of the number of actual Gaussians as well as a good estimate for the parameters of the original Gaussian mixture. Moreover, they converge more quickly than the general gradient ones. References 1. R. A. Render and H. F. Walker, Mixture densities, maximum likelihood and the EM algorithm, SIAM Review, vol.26, no.2, pp , A. K. Jain and R. C. Dubes, Algorithm for Clustering Data,, Englewood Cliffs, N. J.: Prentice Hall, J. A. Hartigan, Distribution problems in clustering, Classification and clustering, J. Van Ryzin Eds., pp , New York: Academic press, H. Akaike. A New Look at the Statistical Model Identification, IEEE Transactions on Automatic Control, vol. AC-19, pp , G. W. Millgan and M. C. Copper, An examination of procedures for determining the number of clusters in a data set, Psychometrika, vol.46, pp: , L. Xu, Best harmony, unified RPCL and automated model selection for unsupervised and supervised learning on Gaussian mixtures, three-layer nets and ME- RBF-SVM models, International Journal of Neural Systems, vol.11, no.1, pp , February, J. Ma, T. Wang and L. Xu, A gradient BYY harmony learning rule on Gaussian mixture with automated model selection, Neurocomputing, vol.56, pp: , J. Ma, T. Wang and L. Xu, An adaptive BYY harmony learning algorithm and its relation to rewarding and penalizing competitive learning mechanism, Proc. of ICSP 02, Aug. 2002, vol.2, pp: S. Amari and H. Nagaoka, Methods of Information Geometry, AMS, Oxford University Press, 2000.

CONJUGATE AND NATURAL GRADIENT RULES FOR BYY HARMONY LEARNING ON GAUSSIAN MIXTURE WITH AUTOMATED MODEL SELECTION

International Journal of Pattern Recognition and Artificial Intelligence Vol. 9, No. 5 (2005) 70 73 c World Scientific Publishing Company CONJUGATE AND NATURAL GRADIENT RULES FOR BYY HARMONY LEARNING ON