Model-based Clustering by Probabilistic Self-organizing Maps

Size: px

Start display at page:

Download "Model-based Clustering by Probabilistic Self-organizing Maps"

Cecilia Tyler
6 years ago
Views:

1 EEE TRANATN N NERA NETRK, V. XX, N., ode-based ustering by robabiistic ef-organizing aps hih-ian heng, Hsin-hia u, ember, EEE, and Hsin-in ang, ember, EEE Abstract n this paper, we consider the earning process of a probabiistic sef-organizing map (b) as a modebased data custering procedure that preserves the topoogica reationships between data custers in a neura network. ased on this concept, we deveop a couping-ikeihood mixture mode for the b that extends the reference vectors in Kohonen s to mutivariate aussians distributions. e aso derive three E-type agorithms, caed the E, E, and DAE agorithms, for earning the mode (b) based on the maximum ikeihood criterion. E is derived by using the cassification E (E) agorithm to maximize the cassification ikeihood; E is derived by using the E agorithm to maximize the mixture ikeihood; and DAE is a deterministic anneaing (DA) variant of E and E. oreover, by shrinking the neighborhood size, E and E can be interpreted, respectivey, as DA variants of the E and E agorithms for aussian mode-based custering. The experiment resuts show that the proposed b earning agorithms achieve comparabe data custering performance to that of the deterministic anneaing E (DAE) approach, whie maintaining the topoogy-preserving property. ndex Terms mode-based custering, sef-organizing map (), probabiistic sef-organizing map (b), E agorithm, DAE agorithm, E agorithm.. NTRDTN n mode-based custering, data sampes are grouped by earning a mixture mode (usuay a aussian mixture mode) in which each mixture component represents a group or custer. There are two major earning methods for modebased custering: the mixture ikeihood approach, where the ikeihood of each data sampe is a mixture of a the component ikeihoods of the data sampe; and the cassification ikeihood approach, where the ikeihood of each data sampe is generated by its winning component ony [], [2], [3], [4], [5], [6], [7], [8], [9]. n both approaches, when the gobay optima estimation of the mode parameters cannot be obtained anayticay, iterative earning agorithms that ony guarantee obtaining ocay optima soutions are usuay empoyed. The expectation-maximization (E) agorithm for mixture ikeihood earning [], [] and the cassification E (E) anuscript received November, 27; revised November 3, 28. This work was supported in part by the Nationa cience ounci of Taiwan, R..., under rants N E--24-Y3 and N96-33-H heng is with the Department of omputer cience, Nationa hiao Tung niversity, Hsinchu, Taiwan, R..., and aso with the nstitute of nformation cience, Academia inica, Taipei, Taiwan, R... e-mai: (sscheng@iis.sinica.edu.tw). H.. u is with the Department of omputer cience, Nationa hiao Tung niversity, Hsinchu, Taiwan, R... e-mai: (hcfu@csie.nctu.edu.tw). H.. ang is with the nstitute of nformation cience, Academia inica, Taipei, Taiwan, R... e-mai: (whm@iis.sinica.edu.tw). agorithm for cassification ikeihood earning [8] are two such agorithms. However, a critica aspect of the E and E agorithms is that their earning performance is very sensitive to the initia conditions of the mode s parameters. To address this issue, eda and Nakano [2] proposed a deterministic anneaing E (DAE) agorithm that tackes the initiaization issue via a deterministic anneaing process, which performs robust optimization based on an anaogy to the cooing of a system in statistica physics. ome heuristic-ike earning agorithms have aso been proposed. or exampe, in [3], the authors propose an agorithm that finds the appropriate initia conditions for E earning by using spit and merge operations. Another method, proposed in [4], overcomes the initiaization issue of E by iterativey spitting the mixture components using the ayesian nformation riterion as the spitting vaidity measure. n addition to the initiaization issue of the earning agorithms, conventiona mode-based custering suffers from another imitation in that it cannot preserve the topoogica reationships among custers after the custering procedure. To overcome this shortcoming, the custering task can be performed by using Kohonen s sef-organizing map () [5], [6], a we-known neura network mode for data custering and visuaization. After the custering procedure, the topoogica reationships among data custers can be preserved (or visuaized) on the network, which is usuay a two dimensiona attice. Kohonen s sequentia and batch earning agorithms have proved successfu in many practica appications [5], [6]. However, they aso suffer from some shortcomings, such as the ack of an objective (cost) function, a genera proof of convergence, and a probabiistic framework [7]. The foowing are some reated works that have addressed these issues. n [8], [9], the behavior of Kohonen s sequentia earning agorithm was studied in terms of energy functions, based on which, heng [2] proposed an energy function for whose parameters can be earned by a K-means type agorithm. uttre [2], [22] proposed a noisy vector quantization mode caed the topographic vector quantizer (TVQ), whose training process coincides with the earning process of. The cost function of TVQ represents the topographic distortion between the input data and the output code vectors in terms of Eucidean distance. raepe et a. [23], [24] derived a soft topographic vector quantization (TVQ) agorithm by appying a deterministic anneaing process to the optimization of TVQ s cost function. ased on the topographic distortion concept, Heskes [25] appied a different DA impementation from that of TVQ, and obtained an agorithm identica to TVQ when the quantization error is

2 EEE TRANATN N NERA NETRK, V. XX, N., 2 expressed in terms of Eucidean distance. n [26], how and u proposed an on-ine agorithm for TVQ; ater, motivated by TVQ, they proposed a data visuaization method that integrates and muti-dimensiona scaing [27]. ased on the ayesian anaysis of s in [28], Anouar et a. [29] proposed a probabiistic formaism for, where the parameters are earned by a K-means type agorithm. To hep users seect the correct mode compexity for by probabiistic assessment, ampinen and Kostiainen [3] deveoped a generative mode in which the is trained by Kohonen s agorithm. eanwhie, Van Hue [3] deveoped a kernebased topographic formation in which the parameters are adjusted to maximize the joint entropy of the kerne outputs. He subsequenty deveoped a new agorithm with heteroscedastic aussian mixtures that aows for a unified account of vector quantization, og-ikeihood, and Kuback-eiber divergence [32]. Another probabiistic formuation is proposed in [33], whereby a normaized neighborhood function of is used as the posterior distribution in the E-step of the E agorithm for a mixture mode to enforce the sef-organizing of the mixture components. um et a. [34] interpreted Kohonen s sequentia earning agorithm in terms of maximizing the oca correations (couping energies) between neurons and their neighborhoods for the given input data. They then proposed an energy function for that reveas the correations, and a gradient ascent earning agorithm for the energy function. n Kohonen s architecture, neurons in the network associate with reference vectors in the data space. This contrasts with a whose neurons associate with reference modes that present probabiity distributions, such as the isotropic aussians used in [33] and the heteroscedastic aussians used in [29], [32]. n this paper, we ca the atter a probabiistic (b). otivated by the couping energy concept in um et a. s work [34], we deveop a couping-ikeihood mixture mode for the b that uses mutivariate aussian distributions as the reference modes. n the proposed mode, oca couping energies between neurons and their neighborhoods are expressed in terms of probabiistic ikeihoods; and each mixture component expresses the oca coupingikeihood between one neuron and its neighborhood. ased on this mode, we deveop E, E, and DAE agorithms for earning bs, namey the E, E, and DAE agorithms, respectivey. ecause they inherit the properties of the E and E agorithms, the proposed agorithms are characterized by reiabe convergence, ow cost per iteration, economy of storage, and ease of programming. rom our experiments on the organizing property, we observe that E is ess sensitive to the initiaization of the parameters than E when using a sma-fixed neighborhood, whie DAE overcomes the initiaization probem of E and E through an anneaing process. urthermore, we show that E and E can be interpreted, respectivey, as deterministic anneaing variants of the E and E agorithms for aussian mode-based custering, where the neighborhood shrinking is interpreted as an anneaing process. e conducted experiments on data sets from the achine earning Database Repository [35]. The experiment resuts show that the proposed b earning agorithms achieve comparabe data custering performance to the DAE agorithm, whie maintaining the topoogy-preserving property. The remainder of the paper is organized as foows. n ec., we review the E, E, and DAE agorithms for mode-based custering. Then, the proposed coupingikeihood mixture mode, and the E, E, and DAE agorithms are described in ec.. The experiment resuts are detaied in ec. V. The differences and reations between the proposed agorithms and other ones are discussed in ec. V. e then present our concusions in ec. V.. THE E, E, AND DAE ARTH R DE-AED TERN A. The mixture ikeihood approach and E agorithm n the mixture ikeihood approach for mode-based custering, it is assumed that the given data set X {x, x 2,, x N } R d is generated by a set of independenty and identicay distributed (i.i.d.) random vectors from a mixture mode: p(x i ; Θ) w(k)p(x i ; θ k ), () k where w(k) is the mixing weight of the mixture component p(x i ; θ k ), subject to w(k) for k, 2,, ; k w(k) ; and θ k denotes the parameter set of p(x i ; θ k ). The maximum ikeihood estimate of the parameter set of the mixture mode ˆΘ {ŵ(), ŵ(2),, ŵ(),ˆθ, ˆθ 2,, ˆθ } can be obtained by maximizing the foowing ogikeihood function: N (Θ; X ) og p(x i ; Θ) i N og( w(k)p(x i ; θ k )). (2) i k This is usuay achieved by using the expectationmaximization (E) agorithm [], []. After earning the mixture mode, we derive a partition of X, ˆ { ˆ, ˆ 2,, ˆ }, by assigning each x i X to the mixture component that has the argest posterior probabiity for x i, i.e., x i ˆ j if j arg max k p(ˆθ k x i ; ˆΘ). ) The E agorithm for mixture modes: f the maximum ikeihood estimation of the parameters cannot be accompished anayticay, the E agorithm is normay used as an aternative approach when the given data is incompete or contains hidden information. n the case of the mixture mode, suppose that Θ (t) denotes the current estimate of the parameter set, and k is the hidden variabe that indicates the mixture component from which the observation is generated. The E-step of the E agorithm then computes the foowing so-caed auxiiary function: N Q(Θ; Θ (t) ) p(k x i ; Θ (t) ) og p(x i, k; Θ), (3) where i k p(x i, k; Θ) w(k)p(x i ; θ k ), (4)

3 EEE TRANATN N NERA NETRK, V. XX, N., 3 and p(k x i ; Θ (t) ) w(k) (t) p(x i ; θ (t) k ) j w(j)(t) p(x i ; θ (t) j ) (5) denotes the posterior probabiity of the kth mixture component for x i with the given Θ (t). Then, in the foowing -step, the Θ (t+) that satisfies Q(Θ (t+) ; Θ (t) ) max Θ Q(Θ; Θ(t) ) (6) is chosen as the new estimate of the parameter set. y iterativey creating the auxiiary function in Eq. (3) and performing the subsequent maximization step, the E agorithm guarantees to converge to a oca maximum of the og-ikeihood function in Eq. (2). hen Q(Θ; Θ (t) ) can not be maximized anayticay, the -step is modified to find some Θ (t+) such that Q(Θ (t+) ; Θ (t) ) > Q(Θ (t) ; Θ (t) ). This type of agorithm, caed eneraized E (E), is aso guaranteed to converge to a oca maximum [], [].. The cassification ikeihood approach and E agorithm n the cassification ikeihood approach for mode-based custering [6], [7], [8], instead of maximizing the ogikeihood function of the mixture mode in Eq. (2), the objective is to find the partition ˆ { ˆ, ˆ 2,, ˆ } of X and the mode parameters that maximize or (, {θ, θ 2,, θ }; X ) 2 (, Θ; X ) k k The reation between and 2 is x i k og p(x i ; θ k ), (7) x i k og(w(k)p(x i ; θ k )). (8) 2 (, Θ; X ) (, {θ, θ 2,, θ }; X ) + k og w(k), (9) k where k denotes the number of sampes in k. f a the mixture components are equay weighted, k k og w(k) becomes a constant, such that and 2 are equivaent. ) The E agorithm for mixture modes: eeux and ovaert [8] proposed the assification E (E) agorithm for estimating the parameter set Θ and partition. ike the E agorithm, the E agorithm is aso an iterative earning approach. n each iteration, E inserts a cassification step (-step) between the E-step and -step of the E agorithm. n the E-step, the posterior probabiity of each mixture component is cacuated for each data sampe. n the -step, to obtain the partition ˆ of the data sampes, each sampe is assigned to the mixture component that yieds the argest posterior probabiity for that sampe. n the -step, the maximization process is appied to ˆ k individuay for k, 2,,. or exampe, if a mutivariate aussian is used as the mixture component, the re-estimated mean vector and covariance matrix are the mean vector and the covariance matrix of the data sampes in ˆ k, respectivey; whie the reestimated mixture weight is ˆ k /N. rom a practica point of view, E is a K-means type agorithm that represents the prototypes with probabiity distributions [8].. The DAE agorithm n the DAE agorithm for earning a mixture mode [2], the objective is to minimize the foowing system energy function during the anneaing process: β (Θ; X ) N og( (w(k)p(x i ; θ k )) β ), () β i k where /β corresponds to the temperature that contros the anneaing process. The auxiiary function in this case is N β (Θ; Θ (t) ) f(k x i ; Θ (t) ) og p(x i, k; Θ), where i k () f(k x i ; Θ (t) (w(k) (t) p(x i ; θ (t) k ) ))β (2) j (w(j)(t) p(x i ; θ (t) j ))β is the posterior probabiity derived by using the maximum entropy principe. eda and Nakano [2] showed that β (Θ; X ) can be iterativey minimized by iterativey minimizing β (Θ; Θ (t) ). hen using DAE to earn a mixture mode, β is initiaized with a sma vaue (ess then ) such that the energy function itsef is simpe enough to be optimized. Then, the vaue of β is graduay increased to. During the earning process, the parameters earned in the current earning phase are used as the initia parameters of the next phase. n the case of β, β (Θ; X ) and β (Θ; Θ (t) ) are the negatives of the og-ikeihood function in Eq. (2) and the Q-function in Eq. (3), respectivey; thus, minimizing β (Θ; X ) is equivaent to maximizing the og-ikeihood function. According to [2], Eq. () can be rewritten as where and β (Θ) β (Θ) β (Θ; X ) β (Θ) β β(θ), (3) N i k N i k f(k x i ; Θ) og p(x i, k; Θ), (4) f(k x i ; Θ) og f(k x i ; Θ) (5) is the entropy of the posterior distribution. hen β, the rationa function f(k x i ; Θ) approximates to a zeroone function; thus, the entropy term β (Θ). n this case, β (Θ; X ) is equivaent to the negative of the objective function for E in Eq. (8). Therefore, DAE can be viewed as a DA variant of E. Each β vaue corresponds to a earning phase. The agorithm proceeds to the next phase after it converges in the current phase.

4 EEE TRANATN N NERA NETRK, V. XX, N., 4. THE E-TYE ARTH R EARNN A. ormuation of the couping-ikeihood mixture mode n this paper, we define a b as a that consists of neurons R{r, r 2,, r } in a network with a neighborhood function h k that defines the strength of atera interaction between two neurons, r k and r, for k, {, 2,, }; and each neuron r k associates with a reference mode θ k that represents some probabiity distribution in the data space. um et a. [34] interpreted Kohonen s sequentia earning agorithm in terms of maximizing the oca correations (couping energies) between the neurons and their neighborhoods with the given input data. iven a data sampe x i X {x, x 2,, x N }, the oca couping energy between r k and its neighborhood is defined as E xi k h k r k (x i ; θ k )r (x i ; θ ) r k (x i ; θ k ) h k r (x i ; θ ), (6) where r k (x i ; θ k ) denotes the response of neuron r k to x i, which is modeed by an isotropic aussian density. Then, the couping energy over the network for x i is defined as E xi E xi k, (7) k and the energy function to be maximized is N og E xi. (8) i n Eq. (6), the term h kr (x i ; θ ) can be considered as the neighborhood response of r k, where the conjunction between the neuron responses is impemented using the summing operation. n this study, we express the neuron response r (x i ; θ ) as a mutivariate aussian distribution as foows: r (x i ; θ ) (2π) d/2 Σ /2 exp( 2 (x i µ ) T Σ (x i µ )) (9) for, 2,, ; and formuate the neighborhood response of r k as r (x i ; θ ) h k, (2) k where the conjunction between the neuron responses in the neighborhood of r k is impemented using the mutipicative operation. Then, for a given x i, we define the oca couping energy between r k and its neighborhood as the foowing couping-ikeihood: p s (x i k; Θ, h) r k (x i ; θ k ) h kk r (x i ; θ ) h k k r (x i ; θ ) h k exp( h k og r (x i ; θ )), (2) where Θ is the set of reference modes, and h denotes the given neighborhood function 2. Then, we define the coupingikeihood of x i over the network as the foowing (unnormaized) mixture ikeihood: p s (x i ; Θ, h) w s (k)p s (x i k; Θ, h), (22) k where w s (k) for k, 2,, is fixed at /. Note that, theoreticay, the mixture weights can be earned automaticay. hen maximizing the oca couping-ikeihood p s (x i k; Θ, h) for each neuron r k, k, 2,,, the topoogica order between neuron r k and its neighborhood for the given data sampe x i is earned in the earning process; therefore, we use equa mixture weights in the mixture mode to take account of the topoogica order earning induced by the neurons faithfuy (with equa prior importance). n fact, this is important for earning an ordered map. rom our experimenta anaysis, if the mixture weights are updated in the earning process, the earning of topoogica order is frequenty dominated by some particuar mixture components, which makes it difficut to obtain an ordered map. or detais, one can refer to Appendix after reading the main body of the paper. omparing the network structure of the proposed coupingikeihood mixture mode in Eq. (22) with that of the aussian mixture mode (), as shown in ig., the proposed mode inserts a couping-ikeihood ayer between the aussian ikeihood ayer and the mixture ikeihood ayer to take account of the couping between the neurons and their neighborhoods. hen the neighborhood size is reduced to zero (i.e., h k δ k ), the couping-ikeihood mixture mode becomes a with equa mixture weights. Note that other probabiity distributions are possibe for r (x i ; θ ) in the formuation of the couping-ikeihood mixture mode, athough we use the mutivariate aussian distribution in this paper.. The E agorithm The sef-organizing process of b can be described as a mode-based data custering procedure that preserves the spatia reationships between the custers in a network. ased on the cassification ikeihood criterion for data custering 2 rom Eq. (2), it is obvious that, in our formuation, the couping between r k and its neighboring neurons is considered jointy, whereas um et a. s formuation considers it in a pairwise manner, as shown in Eq. (6). Note that we use the term couping-ikeihood instead of couping energy for two reasons: ) Eq. (2) is a couping of aussian ikeihoods; and 2) using couping-ikeihood can hep describe the ink between our proposed approaches and mode-based custering.

5 EEE TRANATN N NERA NETRK, V. XX, N., 5 As w s (k) for k, 2,, is fixed at /, the objective function can be rewritten as (a) aussian mixture mode s (, Θ; X, h) k x i k h k og r (x i ; θ ) + onst. (24) imiar to the derivation of the cassification E (E) agorithm for mode-based custering in [8], the E agorithm for the proposed b, i.e., the E agorithm, is derived as foows. E-step: iven the current reference mode set, Θ (t), compute the posterior probabiity of each mixture component of p s (x i ; Θ (t), h) for each x i as foows: γ (t) k i p s(k x i ; Θ (t), h) p s(x i, k; Θ (t), h) p s (x i ; Θ (t), h) exp( h k og r (x i ; θ (t) )) j exp( h j og r (x i ; θ (t) )), (25) for k, 2,,, and i, 2,, N. -step: Assign each x i to the custer whose corresponding mixture component has the argest posterior probabiity for x i, i.e., x i ˆ (t) j if j arg max k γ (t) k i. -step: After the -step, the partition of X (i.e., ˆ(t) ) is formed, and the objective function s defined in Eq. (24) becomes (b) The proposed couping-ikeihood mixture mode ig.. (a) The network structure of a aussian mixture mode, and (b) the proposed couping-ikeihood mixture mode. Here, r (x i ; θ ) denotes the mutivariate aussian distribution described in Eq. (9). [8], the computation of the couping-ikeihood of a data sampe is restricted to its winning neuron. Thus, the goa is to estimate the partition of X, ˆ { ˆ, ˆ 2,, ˆ }, and the set of reference modes, ˆΘ, so as to maximize the accumuated cassification og-ikeihood over a the data sampes as foows: s (, Θ; X, h) og(w s (k)p s (x i k; Θ, h)) k x i k og(w s (k) exp( h k og r (x i ; θ ))). x i k k (23) s (Θ; ˆ (t), X, h) k (t) x i ˆ k h k og r (x i ; θ )+onst. (26) imiar to the derivation of the -step of the E agorithm for earning a aussian mixture mode [], we can obtain the re-estimation formuae for the mean vectors and covariance matrices by substituting Eq. (9) into Eq. (26), taking the derivative of s with respect to individua parameters, and then setting it to zero. The re-estimation formuae are as foows: µ (t+) Σ (t+) k k k x i ˆ (t) k (t) ˆ k h k x i ˆ (t) k h k x i, (27) h k (x i µ (t+) k (t) ˆ k h k )(x i µ (t+) ) T (28) for, 2,,. hen the neighborhood size is reduced to zero (i.e., h k δ k ), E reduces to the E agorithm for earning s with equa mixture weights. ) E - a DA variant of E for : imiar to Kohonen s sequentia or batch agorithm, the E agorithm is appied in two stages. irst, it is appied to a arge neighborhood to form an ordered map near the center of the data sampes. Then, the reference modes are adapted to fit the distribution of the data sampes by graduay shrinking the neighborhood.

6 EEE TRANATN N NERA NETRK, V. XX, N., 6 ithout oss of generaity, we suppose the neighborhood function is the widey adopted (unnormaized) aussian kerne h k exp( r k r 2 2σ 2 ), (29) where r k r is the Eucidean distance between two neurons r k and r in the network. nitiay, E is appied with a arge σ vaue, which is reduced after the agorithm converges. Then, we use the new σ vaue and the earned parameters as the initia condition of the next earning phase. This process is repeated unti the vaue of σ is reduced to the pre-defined minimum vaue σ min. The above shrinking of the neighborhood (reduction of the σ vaue) can be interpreted as an anneaing process, where a arge σ vaue corresponds to a high temperature. Tabe ists the earning rues of the DAE agorithm for earning s with equa mixture weights [2] and the E agorithm. To faciitate the interpretation, we rewrite the objective function and re-estimation formuae of E in Eq. (24) and Eqs. (27)-(28), respectivey, with the new variabe win i, which denotes the index of the winning neuron of x i. or simpicity, we ony ist the re-estimation formuae of the mean vectors of the aussian components. y anayzing these two agorithms carefuy, one may view h (t) win i as a kind of posterior probabiity of θ(t) for x i in the network domain. ore precisey, x i is initiay projected into r (t) win in the network domain; then, r (t) i win is appied to i Eq. (29) as an observation of the aussian kerne centered at r to obtain the vaue of h (t) win. n both the DAE i and E agorithms, when the temperature (/β or σ) is high, the posterior distribution becomes amost uniform; hence, a the reference modes wi be moved to ocations near the center of the data sampes in this earning phase. y graduay reducing the temperature, the infuence of each x i becomes more ocaized, and the reference modes graduay spread out to fit the distribution of the data sampes. hen the temperature approaches zero, the probabiistic assignment strategy for the data sampes becomes the winner-take-a strategy, and the objective functions and earning rues of DAE and E are equivaent to those of E. The major difference between DAE and E seems to be that the posterior distribution in E is constrained by the network topoogy, but DAE does not have this property. To visuaize the transition of the objective function, we show a simuation on a simpe one-dimension, two-component aussian mixture probem in ig The training data contains 2 observations drawn from p(x; {m, v }, {m 2, v 2 }) exp( (x m ) 2 ) + exp( (x m 2) 2 ), v 2π v 2 2π 2v 2 2v 2 2 (3) where the aussian means are (m,m 2 )(-5,5); and the aussian variances are (v 2,v 2 2)(,) 4. The b network 3 Visuaization of how deterministic anneaing E/E works for function optimization is iustrated in detai in [2]. 4 The data is generated using the function gmmsamp.m in Netab software from structure is a 2 attice in [,]. The two reference modes are θ {µ, Σ } and θ 2 {µ 2, Σ 2 }, where Σ Σ 2. The objective function in Eq. (23) is cacuated with different setups for (µ,µ 2 ) to form the og-ikeihood surface. rom ig. 2, we observe that a arger σ for h k yieds a simper objective function for optimization. The og-ikeihood surface is symmetric aong µ µ 2 because of the symmetric attice structure and equa weighting of the reference modes. or the case of σ, the og-ikeihood vaue is cose to the goba maximum of the surface when both µ and µ 2 are cose to the center of the data (2.39 in this case). ith the reduction in the vaue of σ, the ocation of (µ,µ 2 ) for the goba maximum moves toward (m,m 2 ) and (m 2,m ). 2) Reation to Kohonen s batch agorithm: There are two differences between the E agorithm and Kohonen s batch agorithm. irst, E considers the neighborhood information when seecting the winning neuron, but Kohonen s agorithm does not. econd, E extends the reference vectors in Kohonen s agorithm with mutivariate aussians. n other words, if we set γ (t) r k i in E to k (x i ;θ (t) k ) j rj(xi;θ(t) ), j instead of the setting in Eq. (25), we obtain a probabiistic variant of Kohonen s batch agorithm (denoted as Kohonenaussian), where Kohonen s winner seection strategy is appied and the reference vectors are repaced with mutivariate aussians. Thus, we may view Kohonenaussian as an approximate impementation of E that optimizes E s objective function. oreover, if we set the covariance matrices in Kohonenaussian to be diagona with sma, identica variances, Kohonenaussian is equivaent to Kohonen s batch agorithm. Therefore, we can interpret the neighborhood shrinking of Kohonen s agorithms as a deterministic anneaing process, and thereby expain why they need to start with a arge neighborhood size. Recenty, Zhong and hosh [3] interpreted the neighborhood size of the agorithms that appy Kohonen s winner seection strategy as a temperature parameter in a deterministic anneaing process. However, their interpretations were not based on the optimization of an objective function, which is the essentia part of DA-based optimization. n contrast, in E, the neighborhood shrinking eads to the transition of the objective function from a simper one to a more compex one, as iustrated in ig. 2. 3) omputationa cost: t is cear from Tabe that the computationa cost of DAE is (N ), where, N, and are the numbers of reference modes, data sampes, and earning iterations, respectivey. ompared to DAE, E needs additiona ( 2 N) mutipication and addition operations for winner seection in each iteration, whie Kohonenaussian needs additiona (N) mutipications and additions.. The E agorithm As is obvious from Eq. (23), in the formuation of the objective function of the E agorithm, ony the oca couping-ikeihoods associated with the winning neurons are considered. Aternativey, we can compute the coupingikeihood of x i using the mixture ikeihood defined in Eq.

7 EEE TRANATN N NERA NETRK, V. XX, N., 7 TAE THE DAE ARTH R EARNN TH EQA XTRE EHT AND THE E ARTH. Agorithm DAE E N bjective function β (Θ; X ) in Eq. (3) i h win i og r (x i ; θ ) + onst where p(x i, ; Θ) r (x i ; θ ) osterior distribution f( x i ; Θ (t) r (x i ;θ (t) ) β r (t) r 2 ) h j r j (x i ;θ (t) (t) win i j )β win 2σ 2 ) i, 2,,, 2,, Temperature /β σ Re-estimation formuae µ (t+) N i f( x i;θ (t) )x i N f( x i;θ (t) ) i, 2,, µ (t+) N i h win (t) x i i N h i win (t) i, 2,, x 4 x 4 cassification og ikeihood cassification og ikeihood µ µ 5 2 µ µ 5 2 (a) σ (b) σ x 4 x 4 cassification og ikeihood µ µ 5 2 cassification og ikeihood µ µ 5 2 (c) σ (d) σ (i.e., h k δ k ) ig. 2. E s objective function becomes more compex with the reduction of neighborhood size (σ in h k ). (22) and appy the E agorithm to maximize the objective og-ikeihood function s (Θ; X, h) N og( w s (k)p s (x i k; Θ, h)). (3) i k The steps of the E agorithm for the proposed b, i.e., the E agorithm, are as foows. E-step: ith the mixture mode in Eq. (22), we form the auxiiary function as Q s (Θ; Θ (t) ) N i k γ (t) k i og p s(x i, k; Θ, h), (32) where γ (t) k i is the same as Eq. (25). ince p s(x i, k; Θ, h) w s (k)p s (x i k; Θ, h), Eq. (32) can be rewritten as Q s (Θ; Θ (t) ) N i k γ (t) k i og(w s(k)p s (x i k; Θ, h)). (33) As w s (k) for k, 2,, is fixed at /, by substituting

8 EEE TRANATN N NERA NETRK, V. XX, N., 8 Eq. (2) into Eq. (33), the auxiiary function can be rewritten as Q s (Θ; Θ (t) ) N i k N γ (t) k i i k h k og r (x i ; θ ) + onst. γ (t) k i h k og r (x i ; θ ) + onst. (34) r r 2 r3 4 r r.. inner seection r -step: y repacing the response r (x i ; θ ) in Eq. (34) with the mutivariate aussian density in Eq. (9) and setting the derivative of Q s with respect to individua mean vectors and covariance matrices to zero, we obtain the foowing reestimation formuae: µ (t+) Σ (t+) N i ( k γ(t) k i h k)x i N i ( k γ(t) k i h k), (35) N i ( k γ(t) k i h k)(x i µ (t+) )(x i µ (t+) ) T N i ( k γ(t) k i h k) (36) for, 2,,. hen the neighborhood size is reduced to zero (i.e., h k δ k ), E reduces to the E agorithm for earning s with equa mixture weights. There are two major differences between the E and E agorithms. irst, they earn maps based on the cassification ikeihood criterion and the mixture ikeihood criterion, respectivey. econd, E adapts the reference modes in a more goba way than E. To expain this perspective, we can consider the earning of E and E in the sense of sequentia earning. As iustrated in ig. 3, in the E agorithm (cf. Eqs. (27)-(28)), each data sampe x i ony contributes to the adaptation of the winning reference mode and its neighborhood (i.e., x i ony contributes to the earning of the topoogica order between the winning reference mode and its neighborhood). However, in the E agorithm (cf. Eqs. (35)-(36)), each data sampe x i contributes proportionay to the adaptation of each reference mode and its neighborhood according to the posterior probabiities γ (t) k i for k, 2,,. ) E - a DA variant of E for : As with the E agorithm, we can appy E to a arge neighborhood and obtain different map configurations by graduay reducing the neighborhood size. The term k γ(t) k i h k in Eqs. (35)-(36) can be considered as a kind of posterior probabiity, π( x i ; Θ (t), h), of the reference mode θ (t) for x i, which is aso constrained by the neighborhood function. ith a arge σ vaue in h k (Eq. (29)), π( x i ; Θ (t), h) for, 2,,, wi be neary a uniform distribution due to the sma variation in the vaues of γ (t) k i for k, 2,,, and the sma variation in the vaues of h k for k, 2,,, for each case of. Hence, a the reference modes wi be moved to ocations near the center of the data sampes. hen the neighborhood size is reduced to zero (i.e., h k δ k ), the E agorithm becomes the E agorithm for earning s with equa mixture weights. As with the anneaing interpretation of γ ( t) i r r r x i (a) E r.. r γ eighted winner x i ( t) 4 i (b) E r γ i ( t) ig. 3. or each data sampe x i, the adaptation of the reference modes in E is restricted to the winning reference mode and its neighborhood. However, in E, the winner is reaxed to the weighted winners by the posterior probabiities γ (t) k i, for k, 2,,. Each data sampe x i contributes proportionay to the adaptation of each reference mode and its neighborhood according to the posterior probabiities. E, E can be viewed as a topoogy-constrained deterministic anneaing variant of the E agorithm for earning s with equa mixture weights 5. 2) omputationa cost: omparing Eqs. (34)-(36) to Eqs. (26)-(28), we can see that, in each earning iteration, E and E have a simiar computationa cost in the E-step, but the former needs additiona (N) mutipication and addition operations for updating the mode parameters in the -step. D. The DAE agorithm imiar to the derivation of the deterministic anneaing E (DAE) agorithm for earning s [2], we deveoped a DAE agorithm for the proposed b, caed the DAE agorithm. ith the mixture ikeihood defined in Eq. (22), DAE first derives the posterior density in the E- step using the principe of maximum entropy. oowing the derivation of the posterior probabiity in [2] with the current 5 E yieded a simiar resut on the one-dimension, two-component aussian mixture probem in ig. 2; however, we do not present it here to avoid redundancy.

9 EEE TRANATN N NERA NETRK, V. XX, N., 9 mode s parameter set Θ (t), we obtain the posterior probabiity of the kth mixture component for x i as foows: τ (t) k i p s (x i k; Θ (t), h) β j p s(x i j; Θ (t), h) β exp(β h k og r (x i ; θ (t) )) j exp(β h j og r (x i ; θ (t) )). (37) Then, the auxiiary function to be minimized is sβ (Θ; Θ (t) ) N i k τ (t) k i og p s(x i, k; Θ, h), (38) and the re-estimation formuae for the mean vectors and covariance matrices are N µ (t+) i ( k τ (t) k i h k)x i N i ( Σ (t+) k τ (t) N i ( k τ (t) k i h k), (39) k i h k)(x i µ (t+) N i ( k τ (t) k i h k) )(x i µ (t+) ) T (4) for, 2,,. Note that the re-estimation formuae for DAE are the same as those for E, except that γ (t) (t) k i is repaced by τ k i. /β corresponds to the temperature that contros the anneaing process, in which a high temperature is appied initiay. Then, the system is cooed down by graduay reducing the temperature. hen /β, the DAE agorithm becomes the E agorithm; however, when /β, it is equivaent to the E agorithm. n other words, DAE can be viewed as a deterministic anneaing variant of E and E. y considering certain cases and approximations of - DAE, E, and E, we summarize the famiy of E-based approaches for aussian mode-based custering discussed in this section in ig. 4. oth E under the mixture-ikeihood criterion and E under the cassificationikeihood criterion are widey used mode-based data custering methods. E (E) can be appied instead of E (E) in mode-based custering if we want to preserve the spatia reationships between the resuting data custers on a network. ince DAE is a DA variant of E and E, it can be appied in mode-based data custering under both mixture-ikeihood and cassification-ikeihood criteria. ) omputationa cost: omparing Eqs. (39)-(4) to Eqs. (35)-(36), we can see that DAE and E have simiar computationa costs in each earning iteration. V. EXERENT RET A. Experiments on the organizing property Data set description: e conducted experiments on two types of data: a synthetic data set and a rea-word data set. The synthetic data set consisted of 5 points uniformy distributed in a unit square. or the rea-word data set, we used the training set of cass in the en-ased Recognition of DAE / β E / β E hk hk hk δ δ δ k k topoogyconstrained anneaing k topoogyconstrained anneaing DAE for / β E for / β E for ig. 4. The famiy of aussian mode-based custering agorithms derived from the DAE, E and E agorithms. δ k if k ; otherwise, δ k. Handwritten Digits database (denoted as enrecdigits ) in the achine earning Database Repository [35]. The data set consists of 82 6-dimensiona vectors. To demonstrate the map-earning process, we used the first two dimensions of the feature vectors as data for simuations. As a pre-processing step, we scaed down each eement of the vectors in enrecdigits to / of its origina vaue to avoid numerica traps. Experiment setup: n the experiments, an 8 8 equay spaced square attice in a unit square was used as the structure of the network. or the neighborhood function, we used the aussian kerne h k in Eq. (29). e evauated E, E, DAE, and Kohonen- aussian (Kohonen s batch agorithm that uses aussian reference modes) in 2 independent random initiaization trias and two setups for σ in h k. or each tria, data sampes were randomy seected from the data set as the initia mean vectors, µ, µ 2,, µ, of the reference modes, which were mutivariate aussians with fu covariance matrices. The initia covariance matrix Σ was set as ρ, where ρ min k { µ µ k }, for, 2,,. To avoid the singuarity probem, we appied the variance imiting step to the covariance matrices during the earning process. f the vaue of any eement of the covariance matrix was ess than., it was set at.. ) Resuts using the synthetic data: e first demonstrate the map-earning processes of E, E, and DAE using one of the 2 random initiaizations by showing the configurations of the aussian means on the maps, and then summarize the overa resuts of a the initiaizations. imuations using E: ig. 5 shows two simuations using the E agorithm. n the first simuation, E is run with the random initiaization in ig. 5 (a) and a fixed σ of in h k. As shown in ig. 5 (b), the agorithm s earning converges to an unordered map. n the second simuation, E starts with the same random initiaization as that in ig. 5 (a), but with a arger σ of. hen it converges at the

10 EEE TRANATN N NERA NETRK, V. XX, N., current σ vaue, σ is reduced by. Then, the agorithm is appied again with the new σ vaue and the reference modes obtained in the previous phase. This process continues unti E converges at σ. igs. 5 (c), (d), (e), and (f) depict the maps obtained when σ, 5,, and, respectivey. e can expain the second simuation in terms of anneaing (cf. ec. -): hen using E, we start with a arger σ vaue (a higher temperature) so that the objective function is simpe enough to be optimized. Then, we obtain the target map configuration by graduay reducing the vaue of σ (the temperature). Though the reduction in σ produces a more compex objective function for optimization, E can sti earn we because the reference modes obtained at the arger σ vaue provide a sound initiaization for the next earning phase at the smaer σ vaue. imuations using E: e conducted two simiar simuations using the E agorithm. n the first simuation, E was run with the random initiaization in ig. 6 (a) (the same as that in ig. 5 (a)) and a fixed σ of. As shown in ig. 6 (b), the earning of E converged to an unordered map. n the second simuation, E started with the random initiaization in ig. 6 (a) and a arger σ of. Then, the vaue of σ was graduay reduced to in decrements. igs. 6 (c), (d), (e), and (f) depict the maps obtained when E converges at σ, 5,, and, respectivey. imiar to E, we can interpret the reduction of σ in E as an anneaing process (cf. ec. -), which overcomes the initiaization issue. omparing igs. 6 (c)-(d) to igs. 5 (c)-(d), we observe that the map obtained by E is more concentrated than that obtained by E for the same σ vaue. This may be because E earns the map in a more goba manner than E, as noted in ec. -. n other words, each data sampe contributes to a the neurons in a more goba manner in E than in E. imuations using DAE: ig. 7 depicts the simuations using the DAE agorithm with the same random initiaization as that in ig. 5 (a) and ig. 6 (a). The vaue of σ is aso fixed at, and the initia vaue of β is set to. hen DAE converges at a β vaue, it is appied again with β new β.6 and the reference modes obtained in the previous phase. e stop the earning process at β n our experience, it is appropriate to set the maximum vaue of β within the range to 2 for practica appications. hen β, the temperature is high enough to ensure a smooth objective function. Therefore, according to the parameter update rues of DAE, the reference modes form a compact ordered map via atera interactions near the center of the data sampes, even though the neighborhood size is sma (σ in this case). hen β.4 and 7.592, DAE is amost equivaent to E and E, respectivey. n these two cases, DAE converges to the ordered maps in ig. 7 (f) and ig. 7 (i), respectivey. However, as shown in igs. 5 (a)-(b) and igs. 6 (a)-(b), E and E do not converge to an ordered map when σ, which demonstrates that the anneaing process of DAE overcomes the initiaization probem of E and E when σ. Note that DAE may not be abe to obtain any ordered map during the anneaing process if the vaue of σ is too sma to form an ordered map at a sma β vaue. Discussion: The experiment resuts obtained by the three proposed agorithms and Kohonenaussian for the 2 random initiaizations are summarized in Tabe. evera concusions can be drawn from the resuts. irst, E often converges to an ordered map even at a sma, fixed σ vaue (σ in the experiments); but Kohonenaussian and E sedom do so. This may be because E earns the map in a more goba way, as noted in ec. -; hence, it is ess sensitive to the initiaization of the parameters when σ is sma. The resuts for Kohonenaussian and E are simiar. This may be because they ony differ in the winner seection strategy. econd, the initiaization issue of Kohonenaussian, E and E can be overcome by using a arger σ vaue ( in the experiments) initiay, and then graduay reducing the vaue to the target σ vaue ( in the experiments). The reduction of σ can be interpreted as an anneaing process (cf. ec. -, ec. -2, and ec. -). Third, the experiment resuts show that DAE overcomes the initiaization issue of E and E at a sma σ vaue ( in the experiments) using the anneaing process, which is controed by the temperature parameter β. TAE RET ATN N KHNENAAN, E, E, AND DAE N 2 NDEENDENT RAND NTAZATN TRA N THE YNTHET DATA. THE ARTH ERE RN TH T ET R σ N h k. HEN σ, KHNENAAN EEDED N NVERN T AN RDERED A N NE RAND NTAZATN AE (:), T AED N THE REANN AE (:9). etup for σ σ σ initiay, and is reduced to in decrements Kohonenaussian : :2 :9 : E : :2 :9 : E :5 :2 :5 : DAE :2 - : - 2) Resuts using enrecdigits : e aso conducted experiments on rea-word data using the setups for the neighborhood function described in ec. V-A. Tabe summarizes the resuts obtained by the four b earning agorithms. rom the resuts, we can draw the same concusions as those made for the experiment resuts on the synthetic data. igs. 8, 9, and demonstrate, respectivey, the map-earning processes of E, E, and DAE using one of the 2 random initiaizations. omparing igs. 8, 9, and, we observe that these three agorithms obtain rather different resuts. E and E usuay obtain different maps because they earn the maps based on different custering criteria (cassification-ikeihood vs. mixtureikeihood). DAE and E (or E) usuay obtain different resuts because DAE s anneaing is achieved by increasing the β vaue, whie E s (or E s) anneaing is achieved by decreasing the σ vaue. omparing igs. 9 (f) and (f), athough DAE becomes equivaent to E when the vaue of β is increased to.4, their search paths on the objective function surface are different because they have rather different seed modes (ig. (e) vs. ig.

11 EEE TRANATN N NERA NETRK, V. XX, N.,.... (a) random ini.. (b) σ with rand. ini.. (c) σ with rand. ini..... (d) σ 5. (e) σ. (f) σ ig. 5. The map-earning process obtained by running the E agorithm on the synthetic data. imuation ((a)-(b)): hen E is run with the random initiaization in (a) and σ, it converges to the unordered map in (b). imuation 2 ((a) and (c)-(f)): E starts with σ and the random initiaization in (a). Then, the vaue of σ is reduced to in decrements..... (a) random ini.. (b) σ with rand. ini.. (c) σ with rand. ini..... (d) σ 5. (e) σ. (f) σ ig. 6. The map-earning process obtained by running the E agorithm on the synthetic data. imuation ((a)-(b)): hen E is run with the random initiaization in (a) and σ, it converges to the unordered map in (b). imuation 2 ((a) and (c)-(f)): E starts with σ and the random initiaization in (a). Then, the vaue of σ is reduced to in decrements. 9 (e)). Therefore, they converge to different oca maxima of the objective function and obtain different maps. ikewise, athough DAE becomes equivaent to E when the vaue of β is increased to 7.592, they converge to different oca maxima of the objective function and obtain different maps (ig. (i) vs. ig. 8 (f)).

12 EEE TRANATN N NERA NETRK, V. XX, N., (a) random ini.. (b) σ, β. (c) σ, β (d) σ, β 9. (e) σ, β 55. (f) σ, β (g) σ, β (h) σ, β (i) σ, β ig. 7. The map-earning process obtained by running the DAE agorithm on the synthetic data. The vaue of σ is fixed at, whie vaue of β is initiaized at and increased in mutipes of.6 up to TAE RET ATN N KHNENAAN, E, E, AND DAE N 2 NDEENDENT RAND NTAZATN TRA N ENREDT. THE ARTH ERE RN TH T ET R σ N h k. HEN σ, KHNENAAN EEDED N NVERN T AN RDERED A N NE RAND NTAZATN AE (:), T AED N THE REANN AE (:9). etup for σ σ σ initiay, and is reduced to in decrements Kohonenaussian : :2 :9 : E :2 :2 :8 : E :4 :2 :6 : DAE :2 - : -. Experiments to evauate the performance of data custering and visuaization Data set description: n this section, we evauate the performance of data custering and visuaization of the proposed agorithms on two data sets from the achine earning Database Repository [35]: the test set of the image segmentation database (denoted as mgeg), which consists of 2, 9-dimensiona feature vectors; and the Ecoi data set (denoted as Ecoi), which consists of dimensiona feature vectors. Here, we used the fu vector, rather than ony two dimensions, in the experiments. As a pre-processing step, we scaed down each eement of the data vectors in mgeg to / of its origina vaue to avoid numerica traps. Experiment setup: To avoid the singuarity probem that often occurs when using E or E to earn fu covariance s, we used diagona covariance aussians in the experiments. e aso appied the variance imiting step, in which the minimum vaue for a variance was set at.. or the b earning agorithms, we used five configurations for the network structure; they are 3 3, 4 4, 5 5, 6 6, and 7 7 attices equay spaced in a unit square. e used the aussian kerne h k in Eq. (29) as the neighborhood function. To avoid ambiguity, when the DAE and DAE agorithms are appied in data custering based on the

13 EEE TRANATN N NERA NETRK, V. XX, N., 3. (a) random ini.. (b) σ with rand. ini.. (c) σ with rand. ini.. (d) σ 5. (e) σ. (f) σ ig. 8. The map-earning process obtained by running the E agorithm on enrecdigits. imuation ((a)-(b)): hen E is run with the random initiaization in (a) and σ, it converges to the unordered map in (b). imuation 2 ((a) and (c)-(f)): E starts with σ and the random initiaization in (a). Then, the vaue of σ is reduced to in decrements.. (a) random ini.. (b) σ with rand. ini.. (c) σ with rand. ini.. (d) σ 5. (e) σ. (f) σ ig. 9. The map-earning process obtained by running the E agorithm on enrecdigits. imuation ((a)-(b)): hen E is run with the random initiaization in (a) and σ, it converges to the unordered map in (b). imuation 2 ((a) and (c)-(f)): E starts with σ and the random initiaization in (a). Then, the vaue of σ is reduced to in decrements. cassification-ikeihood criterion, they are denoted as DAE and DAE ; and they are denoted as DAE and DAE when appied in data custering based on the mixture-ikeihood criterion. A the agorithms discussed here were run with random initiaizations generated in the same way described in ec V-A.

14 EEE TRANATN N NERA NETRK, V. XX, N., 4. (a) random ini.. (b) σ, β. (c) σ, β 56. (d) σ, β 9. (e) σ, β 55. (f) σ, β.4. (g) σ, β (h) σ, β (i) σ, β ig.. The map-earning process obtained by running the DAE agorithm on enrecdigits. The vaue of σ is fixed at, whie vaue of β is initiaized at and increased in mutipes of.6 up to ) Experiments on mgeg by using E and - DAE : irst, we evauated the data custering performance of Kohonenaussian, E, and DAE in terms of the cassification og-ikeihood defined in Eq. (7). The performance was compared with that of E and DAE. The setting for each agorithm was as foows: DAE : The vaue of β was set at initiay, and increased to by the formua β new β.2. E: The vaue of σ in h k was set at initiay, and reduced to (i.e., h k δ k ) in.2 decrements. DAE : oth the vaues of β and σ in h k were set at initiay. To perform data custering using the cassification-ikeihood criterion, the vaue of β was increased to by the formua β new β.2 first; then, the vaue of σ was reduced to in.2 decrements. Kohonenaussian: The vaue of σ in h k was set at initiay, and reduced to in.2 decrements every 3 earning iterations 6. e ran the agorithms except E with 2 independent trias using 9, 6, 25, 36, 49 aussian components. To conduct a fair comparison of E and the proposed approaches, we ran E many trias ti the accumuated execution time was cose to that of one E tria. The mean and standard deviations (error bars) of the cassification og-ikeihood vaues over the trias for each agorithm and the best resuts of E (denoted as E-best) are shown in ig.. Note that, in the figure, we sighty separate the resuts associated with a specific aussian component number in order to distinguish between them. rom the figure, we observe that the custering performance of E, DAE, and Kohonenaussian is cose to that of DAE. oreover, they obtain arger and more stabe cassification og-ikeihoods than E. These 6 n our impementation for E, E, and DAE, the phase transition occurs when the ikeihood increase is beow a threshod or the number of earning iterations exceeds 3 in the current phase. However, Kohonenaussian does not have the convergence property; thus, we ran 3 iterations for each phase of the agorithm.

15 EEE TRANATN N NERA NETRK, V. XX, N., 5 ig.. The data custering performance of E, DAE, E, DAE, and Kohonenaussian on mgeg in terms of the cassification og-ikeihood. resuts are rationa since E is a topoogy-constrained DA variant of the E agorithm, and DAE is an anneaing variant of E with the settings for β and σ here. Next, we evauated the data visuaization abiity of Kohonenaussian, E, and DAE. To visuaize the data custers on the network, each data sampe was assigned to its winning reference mode, and then randomy potted within the neuron that associates to the reference mode [36]. Here, the winner seection strategy for DAE was the same as that of E (i.e., the -step of E). ig. 2 shows the projections of the data sampes on 7 7 attices obtained by different agorithms. The mgeg data set is comprised of seven casses, namey brickface:, sky:, foiage:, cement:, window:, path:, and grass: ; each cass consists of 3 data sampes. ig. 2 (a) depicts the initia mapping of the data obtained with a random initiaization for the reference modes. As we can see from the figure, the data custers are randomy projected to the neurons (attice nodes) and the network does not preserve the topoogica (spatia) reationships among the custers. igs. 2 (b)-(f) shows the resuts of the three b earning agorithms obtained with the random initiaization in ig. 2 (a). e see that they can preserve the topoogica reationships among the data custers on the network. oreover, it seems that the data sampes of casses,,, and are more distinguishabe and we-grouped on the network than those of the other casses. n particuar, from igs. 2 (b), (c) and (d), we see that ony cass is separated from the other casses with empty nodes; thus, we may infer that the separabiity between and the other casses is higher than that between the remaining casses. or E, as shown in igs. 2 (c) and (d), the network contains ess empty nodes at σ than at σ.6. This may be because in the former case the atera interactions have vanished, and thus the reference modes are adapted to more fit the data distribution than the atter case. omparing ig. 2 (b) to ig. 2 (d), we see that the data projection resuts of Kohonenaussian and E are rather different athough they obtain simiar cassification og-ikeihoods in ig.. However, we can draw simiar observations from the two figures. or exampe, the data sampes of cass are more cose to those of cass and than those of cass. igs. 2 (e) and (f) show the resuts obtained by DAE. e see that the resut in ig. 2 (f) is rather different from that in ig. 2 (d) athough DAE has become equivaent to E when σ. This may be because these two approaches search on the objective function surface aong different paths and converge to different oca maxima, as the expanation for the difference of igs. 9 (f) and (f) in ec. V-A2. 2) Experiments on mgeg by using E and - DAE : irst, we evauated the performance of E and DAE in earning a aussian mixture mode with equa mixture weights. The objective function was the og mixtureikeihood function in Eq. (2) with equa mixture weights. e compared the performance with that of E and DAE. The setting for each agorithm was as foows: DAE : The vaue of β was set at initiay, and increased to by the formua β new β.2. E: The vaue of σ in h k was set at initiay, and reduced to (i.e., h k δ k ) in.2 decrements. DAE : oth the vaues of β and σ in h k were set at initiay. To perform data custering using the mixture-ikeihood criterion, the vaue of β was increased to by the formua β new β.2 first; then, the vaue of σ was reduced to in.2 decrements. e ran DAE, E, and DAE with 2 independent random initiaization trias. imiar to the experiments on E, we ran E many trias ti the accumuated execution time was cose to that of one E tria. The mean and standard deviations (error bars) of the og mixture-ikeihood vaues over the trias for each agorithm and the best resuts of E (denoted as E-best) are shown in ig. 3. rom the figure, it is cear that DAE, E, and DAE achieve simiar performance. oreover, they obtain arger and more stabe og mixture-ikeihoods than E. The resuts are rationa since E is a topoogy-constrained DA variant of the E agorithm, and DAE is an anneaing variant of E with the settings for β and σ here. Next, we evauated the data visuaization abiity of E and DAE. e ran these two agorithms with a 7 7 attice and the initia reference modes used in ec. V- for evauating E; therefore, the initia projection of the data was the same as that shown in ig. 2 (a). hen custering the data sampes, each sampe was assigned to its winning reference mode using E s winner seection strategy. rom ig. 4, we observe that these two agorithms can preserve topoogica reationships among data custers (sampes). imiar to the resuts reveaed by ig. 2, data sampes of casses,, and are more distinguishabe than those of the other casses.

16 EEE TRANATN N NERA NETRK, V. XX, N., 6.. (a) random ini... (b) Kohonenaussian (σ).. (c) E (σ.6).. (d) E (σ).. (e) DAE_ (β, σ).. (f) DAE_ (β, σ) ig. 2. Data visuaization for mgeg by running Kohonenaussian ((b)), E ((c), (d)), and DAE ((e), (f)) with the random initiaization in (a). The network structure is a 7 7 equay spaced square attice in a unit square. omparing ig. 4 (b) to igs. 2 (b) and (d), it is cear that E produces ess empty nodes than Kohonenaussian and E when the vaue of σ is reduced to zero. t may be expained as foows. or Kohonenaussian and E, in the case of σ, they become the E (K-means type) agorithm where each data sampe ony adapts its winner. However, when σ, E becomes the E agorithm where each data sampe adapts a the reference modes

17 EEE TRANATN N NERA NETRK, V. XX, N., 7 the custers are spherica and of equa voume. n this case, the E agorithm is equivaent to the TVQ agorithm in [22], which was deveoped for noisy vector quantization. t is aso equivaent to the batch earning agorithm described in [2], which empoys an energy function in the earning phase of a. However, E was deveoped from a different perspective. e consider the earning of a b as a mode-based custering process. y this perspective, a couping-ikeihood mixture mode is deveoped first, and an objective function is then formuated based on the cassification ikeihood criterion. oreover, the connection between the couping-ikeihood mixture mode and the aussian mixture mode heps interpret E as a topoogy-constrained DA variant of the E agorithm for. ig. 3. earning a aussian mixture mode by appying E, DAE, E, and DAE to mgeg. according to their posterior probabiities; thus, the modes are more adapted to fit the data than the modes of the other two agorithms. 3) Experiments on Ecoi: e conducted experiments on Ecoi using the agorithms appied to mgeg in ec. V- and ec. V-2. igs. 5 (a) and (b) show the data custering performance of each agorithm in terms of the cassification og-ikeihood and the og mixture-ikeihood, respectivey. imiar to the resuts on mgeg, the b earning agorithms aso achieve decent data custering performance on Ecoi. n ig. 6, for each agorithm we show the resut at the σ vaue that the cass separabiity can be best visuaized on the network. The Ecoi data set is comprised of eight casses, namey cp:, im:, pp:, im:, om:, om:, im:, and im:. The numbers of data sampes are 43, 77, 52, 35, 2, 5, 2, and 2, respectivey. rom the figure, we can see that topoogica reationships among data custers are preserved we and data casses can be roughy separated on the network. V. REATN T THER ARTH n this section, we expore the differences and reations between the proposed agorithms and other reated agorithms. A. or E n [37], Ambroise and ovaert proposed a topoogy preserving E (TE) agorithm that introduces topoogica constraints in the E agorithm. f Kohonen s winner seection strategy is appied, E is equivaent to TE whose mixture weights are equay fixed. n E, the covariance matrix of a aussian component, Σ, can have different parameterizations for different geometric interpretations []. hen Σ λ for, 2,, (where λ is a sma positive constant and denotes the identity matrix),. or E and DAE n DAE, when Σ λ for, 2,,, DAE is equivaent to the TVQ agorithm [23], which earns the parameters by maximizing their density function predicted by the maximum entropy principe. n TVQ, the inverse temperature, β, is the agrange mutipier introduced for the constrained optimization induced by the maximum entropy principe. Heskes [25] extends TVQ s cost function to an expected quantization error. Then, an objective function is obtained by weighting the quantization error with the inverse temperature β and pusing it to an entropy term that introduces the anneaing process. ith the resuting objective function, Heskes obtained an agorithm identica to TVQ. The impementations for deterministic anneaing in TVQ and Heskes agorithm can aso be found in [38], [39], where the DA is appied for vector quantization. DAE differs from raepe et a. s TVQ and Heskes agorithm in the foowing ways. irst, the deterministic anneaing processes are impemented differenty. DAE is a DAE agorithm deveoped to earn the mixture modes with a deterministic anneaing process, which is impemented based on predicting the posterior distribution in the E-step using the maximum entropy principe. econd, the case of β was not we addressed in raepe et a. s and Heskes papers. This may be because their origina goa was to deveop a DA earning for TVQ. hen β is fixed at, however, DAE becomes the E agorithm. oreover, the connection between the proposed couping-ikeihood mixture mode and the aussian mixture mode heps interpret E as a topoogyconstrained DA variant of the E agorithm for. V. NN onsidering the earning of a probabiistic sef-organizing map (b) as a mode-based custering process, we deveop a couping-ikeihood mixture mode for b, and derive three E-type earning agorithms, namey the E, E, and DAE agorithms, for earning the mode (b). The proposed agorithms improve Kohonen s earning agorithms by incuding a cost function, an E-based convergence property, and a probabiistic framework. n addition, the proposed agorithms provide some insights into the choice of neighborhood size that woud ensure topographic ordering. rom the experiment resuts, we observe that

18 EEE TRANATN N NERA NETRK, V. XX, N., 8.. (a) E (σ.6).. (b) E (σ).. (c) DAE_ (β, σ).. (d) DAE_ (β, σ) ig. 4. Data visuaization for mgeg by running E ((a), (b)) and DAE ((c), (d)) with the random initiaization in ig. 2 (a). The network structure is a 7 7 equay spaced square attice in a unit square. (a) (b) ig. 5. The data custering performance on Ecoi in terms of (a) the cassification og-ikeihood and (b) the og mixture-ikeihood.

19 EEE TRANATN N NERA NETRK, V. XX, N., 9.. (a) random ini... (b) Kohonenaussian (σ.6).. (c) E (σ.6).. (d) E (σ.8).. (e) DAE_ (β, σ).. (f) DAE_ (β,σ) ig. 6. Data visuaization for Ecoi by running (b) Kohonenaussian, (c) E, (d) E, (e) DAE, and (f) DAE with the random initiaization in (a). The network structure is a 7 7 equay spaced square attice in a unit square.

20 EEE TRANATN N NERA NETRK, V. XX, N., 2 the earning performance of E is very sensitive to the initia setting of the reference modes when the neighborhood is sma. onversey, it is not sensitive to the initia condition when the neighborhood is sufficienty arge. To dea with the initiaization probem, we first run E with a arge neighborhood, and then graduay reduce the neighborhood size unti the earning converges to the desired map. hen using a sma neighborhood, E is ess sensitive to the initiaization than E. However, to earn an ordered map, E sti needs to start with a arge neighborhood. n both E and E, the neighborhood shrinking can be interpreted as an anneaing process that overcomes the initiaization issue. Aternativey, we can appy DAE, which is a deterministic anneaing variant of E and E, to earn a map. n our experiments, DAE overcomes the initiaization issue of E and E via the anneaing process controed by the temperature parameter. oreover, through the comparison of E and Kohonen s batch agorithm, we can aso appy the DA interpretation of neighborhood shrinking to Kohonen s agorithms to expain why they need to start with a arge neighborhood size. e have aso shown that the E and E agorithms can be interpreted, respectivey, as topoogy-constrained deterministic anneaing variants of the E and E agorithms for aussian mode-based custering. The experiment resuts show that our proposed b earning agorithms achieve an effective data custering performance, whie maintaining the topoogy-preserving property. AENDX Theoreticay, the mixture weights of the coupingikeihood mixture mode in Eq. (22) can be earned automaticay. oowing the derivations of the E, E, and DAE agorithms in ecs. -, -, and -D, the earning rues for the mixture weights are derived as foows. osterior distribution: or E and E, γ (t) k i or DAE, τ (t) k i w s (k) (t) exp( h k og r (x i ; θ (t) )) j w s(j) (t) exp( h j og r (x i ; θ (t) )). (4) (w s (k) (t) exp( h k og r (x i ; θ (t) ))) β j (w s(j) (t) exp( h j og r (x i ; θ (t) ))). β Re-estimation formuae: or E, or E, w s (k) (t+) N w s (k) (t+) N (42) (t) ˆ k. (43) N i γ (t) k i. (44) or DAE, w s (k) (t+) N N i τ (t) k i. (45) The mean vectors and covariance matrices in E, E, and DAE agorithms are updated using Eqs. (27)- (28), Eqs. (35)-(36), and Eqs. (39)-(4), respectivey, where γ (t) (t) k i and τ k i are computed by Eqs. (4) and (42), respectivey. However, in our experience, if the mixture weights are earned in the three agorithms, the earning of topoogica order is frequenty dominated by some particuar mixture components, which makes it difficut to obtain an ordered map. As an exampe, we appied E to the synthetic data set, which consisted of 5 points uniformy distributed in a unit square. The network structure was a 4 4 equay spaced square attice in a unit square. A the mixture weights were set at /6 initiay. The vaue of σ in the neighborhood function (i.e., Eq. (29)) was set at. The resuts are shown in igs. 7 (a)-(e). rom the figures, we observe that the map shrinks to near a ine after the agorithm converges (with 8 iterations). This phenomenon can be verified by inspecting the vaues of mixture weights during the earning process. As shown in Tabe V, after the agorithm converges, most of the mixture weights become zero and the earning ony maximizes the oca couping-ikeihoods of neurons 4 and 3, whose mixture weights are 4 and 96, respectivey. n contrast, as shown in ig. 7 (f), if the mixture weights are equay fixed at /6 throughout the earning process, E converges to an ordered map. or E and DAE, we obtained the simiar resuts. REERENE []. raey and A. E. Raftery, How many custers? hich custering method? Answers via mode-based custer anaysis. omputer Journa 4: , 998. [2]. raey and A. E. Raftery, ode-based custering, discriminant anaysis, and density estimation, Journa of The American tatistica Association, vo. 97, no. 458, pp. 6-63, 22. [3]. Zhong and J. hosh, A unified framework for mode-based custering, Journa of achine earning Research vo. 4 no. 6, pp. -37, 23. [4]. raey and A. E. Raftery, ayesian reguarization for norma mixture estimation and mode-based custering, Journa of assification, vo. 24, no. 2, pp. 55-8, 27. [5].. h and A. E. Raftery, ode-based custering with dissimiarities: A ayesian approach, Journa of omputationa and raphica tatistics, vo. 6, no. 3, pp , 27. [6]. J. ymons, ustering criteria and mutivariate norma mixture, iometrics, vo. 37, pp , 98. [7]. anesaingam, assification and mixture approach to custering via maximum ikeihood, Appied tatistics, vo. 38, no. 3, pp , 989. [8]. eeux and. ovaert, A cassification E agorithm for custering and two stochastic versions, omputationa tatistics & Data Anaysis, vo. 4, no. 3, pp , 992. [9] J. D. anfied and A. E. Raftery, ode-based aussian and non- aussian custering, iometrics vo. 49, no. 3, pp , 993. [] Jeff A. imes, A gente tutoria of the E agorithm and its appication to parameter estimation for aussian mixture and hidden arkov modes, nternationa omputer cience nsitute Technica Reports, TR- 97-2, Apri 998. []. J. cachan and T. Krishnan, The E agorithm and extensions, New York: John iey, 997. [2] N. eda and R. Nakano, Deterministic anneaing E agorithm, Neura Networks, vo., no. 2, pp , 998.

21 EEE TRANATN N NERA NETRK, V. XX, N., (a) initiaization. (b) weights are updated, iter5. (c) weights are updated, iter (d) weights are updated, iter6. (e) weights are updated, iter8. (f) fixed equa weights ig. 7. The map-earning process obtained by running the E agorithm on the synthetic data with an ordered initiaization in (a). imuation ((a)-(e)): The mixture weights are initiaized at, and updated in the earning process; the agorithm starts with the initiaization in (a) and converges to the 6 unordered map in (e). imuation 2 ((a) and (f)): E is performed with equa mixture weights throughout the earning process; the agorithm starts with the initiaization in (a) and converges to the map in (f). The network structure is a 4 4 square attice; the vaue of σ is set at. TAE V THE XTRE EHT EARNED Y E TH THE NTAZATN N. 7 (A). THE XTRE EHT ARE NTAZED AT 6. weight index nitia iter iter iter iter [3] N. eda, R. Nakano, Z. hahramani, and. Hinton, E agorithm for mixture modes, Neura omputation, vo. 2, no. 9, pp , 2. [4].. heng, H.. ang, and H.. u A mode-seection-based sefspitting aussian mixture earning with appication to speaker identification, ERA Journa on Appied igna rocessing, vo. 24, no. 7, pp , 24. [5] T. Kohonen, ef-rganizing aps, pringer, 2. [6] T. Kohonen, The sef-organizing maps, Neurocomputing, vo. 2, pp. -6, 998. [7]. ishop,. vensén, and. iiams, The generative topographic mapping, Neura omputation vo., no., pp , 998. [8] V. V. Toat, An anaysis of Kohonen s sef-organizing maps using a system of energy functions, ioogica ybernetics, vo. 64, no. 2, pp , 99. [9] E. Erwin, K. bermayer, and K. chuten, ef-organizing maps: ordering, convergence properties and energy functions, ioogica ybernetics, vo. 67, no., pp , 992. [2] Y. heng, onvergence and ordering of Kohonen s batch map, Neura omputation, vo. 9, no. 8, pp , 997. [2].. uttre, ef-organization: A derivation from fist principes of a cass of earning agorithm, in roc EEE nt. Joint onf. Neura Networks, 989. [22].. uttre, ode vector density in topographic mappings: caar case, EEE Trans. Neura Networks, vo. 2, no. 4, pp , 99. [23] T. raepe,. urger, and K. bermayer, hase transitions in stochastic sef-organization maps, hysica Review E, vo. 56, no. 4, pp , 997. [24] T. raepe,. urger, and K. bermayer, ef-organizing maps: eneraizations and new optimization techniques, Neurocomputing, vo. 2, pp. 73-9, 998. [25] T. Heskes, ef-organizing maps, vector quantization, and mixture modeing, EEE Trans. Neura Networks, vo. 2, no 6, pp , 2. [26] T... how and. u, An onine ceuar probabiistic seforganizing map for static and dynamica data sets, EEE Trans. ircuit and ystems, art, vo. 5, no. 4, pp , 24. [27]. u and T... how, R: A new visuaization method by hybridizing mutidimensiona scaing and sef-organizing map, EEE Trans. Neura Networks, vo. 6, no 6, pp , 25. [28].. uttre, A ayesian anaysis of sef-organizing maps, Neura omputation, vo. 6, no. 5, pp , 994. [29]. Anouar,. adran, and. Thiria, robabiistic sef-organizing map and radia basis function networks, Neurocomputing, vo. 2, pp.93-96, 998. [3] J. ampinen and T. Kostiainen, enerative probabiity density mode in the sef-organizing map,. eiffert and. Jain, editors, ef-organizing neura networks: Recent advances and appications, pp , hysica Verag, 22. [3].. Van Hue, Joint entropy maximization in kerne-based topographic maps, Neura omputation, vo. 4, no. 8, pp , 22. [32].. Van Hue, aximum ikeihood topographic map formation, Neura omputation, vo. 7, no. 3, pp , 25.

EEE TRANATN N NERA NETRK, V. XX, N., 22 [33] J. J. Verbeek, N. Vassis, and. J. A. Kröse, ef-organizing mixture modes, Neurocomputing, vo. 63, pp. 99-23, 25. [34] J. um,.. eung,.. han, and.

Hastie, R. Tibshirani, and J. riedman, The eements of statistica earning, pringer, 2. [37]. Ambroise and. ovaert, onstrained custering and Kohonen seforganizing maps, Journa of assification, vo.

Rose, Deterministic anneaing for custering, compression, cassification, regression, and reated optimization probems, roceedings of The EEE, vo. 86, no., pp. 22-2239, 998. Hsin-in ang received the.

22 EEE TRANATN N NERA NETRK, V. XX, N., 22 [33] J. J. Verbeek, N. Vassis, and. J. A. Kröse, ef-organizing mixture modes, Neurocomputing, vo. 63, pp , 25. [34] J. um,.. eung,.. han, and. Xu, Yet another agorithm which can generate topography map, EEE Trans. Neura Networks, vo. 8, no 5, pp , 997. [35] achine earning Repository. mearn/repository.htm [36] T. Hastie, R. Tibshirani, and J. riedman, The eements of statistica earning, pringer, 2. [37]. Ambroise and. ovaert, onstrained custering and Kohonen seforganizing maps, Journa of assification, vo. 3, no. 2, pp , 996. [38] K. Rose, E. urewitz, and.. ox, Vector quantization by deterministic anneaing, EEE Trans. nform. Theory, vo. 38, no. 4, pp , 992. [39] K. Rose, Deterministic anneaing for custering, compression, cassification, regression, and reated optimization probems, roceedings of The EEE, vo. 86, no., pp , 998. Hsin-in ang received the.. and h.d. degrees in eectrica engineering from Nationa Taiwan niversity, Taipei, Taiwan, in 989 and 995, respectivey. n ctober 995, he joined the nstitute of nformation cience, Academia inica, Taipei, Taiwan, as a ostdoctora eow. He was promoted to Assistant Research eow and then Associate Research eow in 996 and 22, respectivey. He was an adjunct associate professor with Nationa Taipei niversity of Technoogy and Nationa hengchi niversity. He was a board member and chair of academic counci of A. He currenty serves as secretary-genera of A and as an editoria board member of nternationa Journa of omputationa inguistics and hinese anguage rocessing. His major research interests incude speech processing, natura anguage processing, spoken diaogue processing, mutimedia information retrieva, and pattern recognition. Dr. ang was a recipient of the hinese nstitute of Engineers (E) Technica aper Award in 995. He is a ife member of A and and a member of A. hih-ian heng received the.. degree in mathematics from Nationa Kaohsiung Norma niversity, Kaohsiung, Taiwan, R..., in 999 and the.. degree in computer science from Nationa hiao Tung niversity, Hsinchu, Taiwan, in 22. He is currenty pursuing the h.d. degree in the Department of omputer cience, Nationa hiao Tung niversity, Taiwan. n 22, he joined the poken anguage roup, hinese nformation rocessing aboratory, nstitute of nformation cience, Academia inica, Taipei, Taiwan, as a Research Assistant. His research interests incude machine earning, pattern recognition, speech processing, and neura networks. Hsin-hia u received the.. degree from Nationa hiao-tung niversity in Eectrica and ommunication engineering in 972, and the.. and h.d. degrees from New exico tate niversity, both in Eectrica and omputer Engineering in 975 and 98, respectivey. rom 98 to 983 he was a ember of the Technica taff at e aboratories. ince 983, he has been on the facuty of the Department of omputer science and nformation engineering at Nationa hiao-tung niversity, in Taiwan, R. He is aso the Taiwan representative of TE onsortium since 23. orm 987 to 988, he served as the director of the department of information management at the Research Deveopment and Evauation ommission, of the Executive Yuan, R. rom , he was a visiting schoar of rinceton niversity. rom 989 to 99, he served as the chairman of the Department of omputer cience and nformation Engineering. rom eptember to December of 994, he was a visiting scientist at raunhofer-nstitut for roduction ystems and Design Technoogy (K), erin ermany. His research interests incude digita signa/image processing, utimedia information processing, and neura networks. Dr. u was the corecipient of the 992 and 993 ong-term est Thesis Award with Koun Tem un and heng hin hiang, and the recipient of the 996 Xerox A paper Award. He has served as a founding member, rogram co-chair (993) and enera co-chair (995) of nternationa ymposium on Artificia Neura Networks. He has been the Technica ommittee on Neura Networks for igna rocessing of the EEE igna rocessing ociety from 997 to 2. He has authored more than technica papers, and two textbooks /XT Anaysis, and ntroduction to neura networks, by un-kung ook o., and Third ave ubishing o., respectivey. Dr. u is a member of the EEE igna rocessing and omputer ocieties, hi Tau hi, and the Eta Kappa Nu Eectrica Engineering Honor ociety.

CS229 Lecture notes. Andrew Ng

CS229 Lecture notes. Andrew Ng CS229 Lecture notes Andrew Ng Part IX The EM agorithm In the previous set of notes, we taked about the EM agorithm as appied to fitting a mixture of Gaussians. In this set of notes, we give a broader view