Particle Swarm Optimization of Hidden Markov Models: a comparative study

Size: px

Start display at page:

Download "Particle Swarm Optimization of Hidden Markov Models: a comparative study"

Edward Lamb
5 years ago
Views:

1 Particle Swarm Optimization of Hidden Markov Models: a comparative study D. Novák Department of Cybernetics Czech Technical University in Prague Czech Republic xnovakd@labe.felk.cvut.cz M. Macaš, Department of Cybernetics Czech Technical University in Prague Czech Republic mmacas@seznam.cz Abstract In recent years, Hidden Markov Models (HMM) have been increasingly applied in data mining applications. However, most authors have used classical optimization Expectation- Maximization (EM) scheme. A new method of HMM learning based on Particle Swarm Optimization (PSO) has been developed. Along with others global approaches as Simulating Annealing (SIM) and Genetic Algorithms (GA) the following local gradient methods have been also compared: classical Expectation-Maximization algorithm, Maximum A Posteriory approach (MAP) and Bayes Variational learning (VAR). The methods are evaluated on a synthetic data set using different evaluation criteria including classification problem. The most reliable optimization approach in terms of performance, numerical stability and speed is VAR learning followed by PSO approach. I. INTRODUCTION Will the classical EM algorithm for HMM optimization stand up? Over several ten years the EM algorithm has been gold standard for HMM optimization. Due to increasing popularity of HMM modelling technique, HMM have been applied in many application areas such as speech processing, signal processing, dynamic systems, robotics, handwriting recognition, economy, molecular biology. In all mentioned papers authors have used EM algorithm. In this paper we ask whether more optimization techniques could not bring any further improvements. We make a comparison study of several different methods for continuous HMM introducing new global technique of Particle Swarm Optimization (PSO) []. The techniques can be divided into two groups: (i) hillclimbing algorithms (EM, MAP and VAR approach) and (ii) global searching algorithm (PSO, genetic approach and simulated annealing). The first group depends quite strongly on the initial estimate of the model parameters. Any arbitrary estimate of the initial model parameters will usually lead to a sub-optimal model in practice. The second group are able to escape from the initially guess and find the optimal solutions due to the global searching capability. The paper is organized as follows. Firstly, the HMM theory is introduced. Then in Sect. II-B the description of optimization techniques is addressed. In Sect. II-C several criteria for methods evaluation are mentioned. Finally, results are presented in Sect. III and discussion along with concluding remarks are included in Sect. IV. A. Hidden Markov Models II. METHOD An HMM is a stochastic finite state automata characterized by the following []: ) N, the number of states in the model. Although the states are hidden, for many practical applications there is often some physical significance. We denote the individual states as S = (S,S,...,S N ), and the state at time t as q t. ) M, the number of distinct observations symbols per state or the number of mixtures in Gaussian pdf. 3) The state transition probability distribution A = {a i j }, of size N N, defines the probability of transition from state i at time t, to state j at time t +. a i j = P(q t+ = S j q t = S i ), i, j N. () ) The initial state distribution π = {π i }, defines the probability of any given state being the initial state of the given sequences, where π i = P(q = S i ), i N. () 5) emission probability which we can further divide into two categories depending whether the observation sequence is discrete or continuous.in the paper presented, the continuous model is used: - continuous emission probability B = {b j (O t )}, where O = O,O,...,O T, the emission probability density function of each state is defined by a finite multivariate Gaussian mixture (3): b j (O t ) = M m= d jm N (O t, µ jm,c jm ), j N (3)

2 where O t is the feature vector of the sequence being modelled, d jm is the mixture coefficient from the mth mixture in state j and N is a Gaussian probability with mean vector µ jm and covariance matrix C jm for the mth mixture component in state j. We will refer to these models as a continuous HMM (CHMM). A complete specification of an HMM requires specification of two model parameters (N and M), specification of the three probability measures A, b, π. We will refer to all HMM parameters using the set = {A,B,π}. Each model can be used to compute the probability of observing a discrete input sequence O = O,...,O T, P(O ) to find the corresponding state sequence S that maximizes the probability of the input sequence, P(S O,), and to induce the model that maximizes the probability of a given sequence P(O ) > P(O ). The following keywords are known as the three problems of an HMM: evaluation, generation, and training. B. HMM optimization techniques ) Expectation-Maximization []: or Baum-Welch The EM algorithm is a general method of finding the maximumlikelihood (ML) estimate of the parameters of an underlying distribution from a given data set when the data is incomplete or has missing values. Let s have density function P(O ) that is governed by the set of parameters and O = {O,O,...O N }. We assume that these data are independent and identically distributed (iid) with distribution P. Therefore, the resulting density for the samples is P(O ) = N i= P(O i ) = P( O) () In the maximum likelihood problem, our goal is to find the that maximizes P. That is, we wish to find where = argmax P( O) (5) The EM algorithm first finds the expected value of the complete-data likelihood P(O, S ) with respect to the unknown hidden states S given the observed data O and the current parameter estimates (i). The evaluation of this expectation is called the E-step of the algorithm. The second step, which is called the M-step, maximize the expectation we computed in the first step. ) A Posteriory approach (MAP) [3]: The difference between MAP and ML estimation lies in the assumption of an appropriate prior distribution to be estimated from the observation sequence O with probability density function P(O ), and if P () is the prior density function of then the MAP estimates is defined as MAP = argmaxp( O) argmaxp(o )P () (6) where we used the Bayes theorem. Regarding the MAP estimate, it follows that the same iterative procedure of EM algorithm described in previous paragraph. Nevertheless, the EM algorithm can be applied to the MAP estimation problem if the prior density P () belongs to the conjugate family of the complete-data density. In case of the initial and transitions probabilities, a Dirichlet density was used for the initial probability vector π and for each row of the transition probability matrix A. For the mean vector and covariance matrix of Gaussian mixture the conjugate densities are as follows: for the mean is Normal density and for the covariance normal-wishart densities [3]. 3) Variational Bayes learning: We wish to approximate the conditional probability P(S O) because the exact algorithms might not provide a satisfactory solution to inference and learning problems due to the time or space complexity. We introduce an approximating family of conditional probability distributions, Q(S O, ), where are variational parameters. From the family of approximating distributions Q, we choose a particular distribution by minimizing the Kullback-Leibler (KL) divergence, D(Q P), with respect to the variational parameters [] = argmax D [ Q(S O,) P(S O) ] (7) where the KL divergence for any probability distribution Q(V ) and P(V ), V = {S,O} is defined as follows D(Q P) = Q(V )ln Q(V ) P(V ) {V } The minimizing values of the variational parameters,, define a particular distribution, Q(S O, ), that we treat as the best approximation of P(S O) in the family Q(S O,). Another important remark is that the optimization procedure can be casted under the framework of EM algorithm as in the MAP learning. We used the same approximative family as in case of MAP approach, e.g. Dirichlet, Normal and normal-wishart continuous density. We show the typical development of density functions during particular training period in Fig.. Note, that the model priors were set to be flat as much as possible in this case. The mean priors, which follow Normal cdf, are really very flat see Fig. (b), first column. The covariances priors, which follow Wishart cdf, were set to the total training covariance matrix. As the training process goes further, the refinement in means and covariances can be observed. The similar development could be also observed in case of MAP approach. ) Simulated annealing: Simulated annealing is a well known general heuristic approach to combinatorial optimization. Given the observation sequence O, a state sequence S is generated at random and the logarithm of probability P(O ) of generating O is considered to be the objective value f (S) to be minimized. The solution structure is based on the choice of a state trajectory. The various building blocks were proposed as follows [5]: (i) The initial solution is obtained simply by generating a (8)

3 Mean cdf Covariance cdf State Prior Last posterior Prior Last posterior 6 State State (a) Development of first two Dirichlet cdf states of model (3) State 3 3 (b) Development of mean and variance cdf Fig.. Sub-figure (a): Development of two first Dirichlet cdf states throughout variational Bayes learning. The total number of iterations was 38. Sub-figure (b): Development of mean and variance cdf of the model (3) throughout variational Bayes learning. The total number of iterations was. random state trajectory. (ii) The initial temperature should be high to allow virtually all transitions to be accepted. Thereafter, the temperature is decreased at each iteration by a factor.98. (iii) The number of trials at each temperature should progressively increase with the decrease in temperature (in our case by a factor.). (iv) Moving from one solution to the next is obtained by choosing at random a state at a randomly chosen instant and affecting it randomly to another state. (v) The objective function to be minimized is the overall probability of the observation sequence. 5) Genetic algorithm: By using global searching capability and non-problem specific property of GA, the GA for HMM training can find the optimal model parameters. Generally speaking each GA consists of several steps: encoding mechanism, the fitness evaluation, the selection mechanism, and the replacement mechanism. Next we will describe very briefly our algorithm which is a modification of approach proposed by [6]. During encoding mechanism, each chromosome represents one HMM model, where each gene expresses one HMM parameter. The likelihood P(O ) is an appropriate criterion used in the fitness function to determine the quality of the chromosome. Selection mechanism is one of the most common-the roulette wheel selection. Finally the steady-state reproduction is used as the replacement strategy. To increase speed of GA, we used also a hybrid operator, e.g., after each ten population iterations, the classical EM estimation (with 8 HMM iteration only) is applied to all the chromosomes in population. Regarding the algorithm setup, the number of generation was N gen = 6, the number of chromosomes in population N pop = 6 and the number of offsprings N child = 6. 6) Particle Swarm Optimization: The PSO method is one of optimization method developed for finding a global optima of some nonlinear function []. It is inspired by the social behavior of birds and fishes.the method uses group of problem solutions. Each solution consists of set of parameters and represents a point in multidimensional space. The solution is called particle and the group of particles (population) is called swarm. Two kinds of information are available to the particles. The first is their own experience - they have tried the choices and know which state has been better so far and how good it was. The second information is social knowledge - the particles know how the other individuals in their neighborhood have performed. Each particle i is represented as a D-dimensional position vector x i (t) and has a corresponding instantaneous velocity vector v i (t). Furthermore, it remembers its individual best value of fitness function and position p i which has resulted in that value. During each iteration t, the velocity update rule (9) is applied on each particle in the swarm. The p g is the best position of the entire swarm and represents the social knowledge. v i (t) = α v i (t ) + + ϕ ( p i x i (t )) + + ϕ ( p g x i (t )) (9) The parameter α is called inertia weight and during all iterations decreases linearly from α start to α end. The symbols Φ and Φ are computed according to the equation (), where j =,. The parameters ϕ i are constants that weight influence of particles own experience and the social knowledge. In our experiments, the parameters were set to ϕ = and ϕ =, α start = and α end =. The r jk, where k =...D are random numbers drawn from a uniform distribution between and.

4 r j Φ j = ϕ j... () r jd Next, the position update rule () is applied. x i (t) = x i (t ) + v i (t) () If any component of v i is less than V max or greater than +V max, the corresponding value is replaced by V max or +V max, respectively. The V max is maximum velocity parameter whose value setting depends on parameters range of HMM. The update formulas (9) and () are applied during each iteration and the p i and p g values are updated simultaneously. The algorithm stops if maximum number of iterations is achieved. C. Evaluation criteria We evaluate the quality of derived HMM using the following criteria: Data likelihood (Lik). The data likelihood measures the log likelihood of data for a given HMM model. Distance measure (DM). We can define a distance measure D(, ), between two Markov models, (generating model) and (derived model), as [] D(, ) = T (logp(o ) logp(o )) () where O = O,O,...,O T is a sequence of observations generated by model. Classification experiment (Clas). The last benchmark is a classification rate (in %) of synthetic data set, which is generated from three very similar HMMs (3-5). Time. Duration of classification task in minutes. The following computational framework has been used in all experiments: Intel Pentium, 3.Ghz, Windows Vista, Matlab R7 edition. III. RESULTS AND DISCUSSION To verify the effectiveness of different initialization and learning methods we performed in total ten experiment runs N run = to get statistical significant results. Regarding HMM parameter setup, we use continous HMM: only one mixture component per state, M = and diagonal covariance matrix B. We constructed three HMM models (3-5) for generating the synthetic data set. The models are full transition models of four states. The data set consists of one observation sequence of length T =. The sequence was generated assuming a probabilistic walk through the HMM. In the first part of experiment likelihood and distance measure were evaluated on the first model (3). In the second part-classification experiment, the HMM generating models used for each class are those shown in (3-5) only slightly differing in transition matrix A and observation matrix B. In total, the three HMMs are quite similar to each other A = π =.5.5 (3) µ = σ =.5 B = µ = σ =.5 µ 3 = σ3 =.5 µ = σ = A = π =.5.5 () µ = σ =. B = µ = σ =. µ 3 = σ3 =.5 µ = σ = A 3 = π 3 =.5.5 (5) µ = σ =. B 3 = µ = σ =. µ 3 = σ3 =.5 µ = σ =.5 The result summary of the evaluation criteria is shown in Table I. The mean and the variance (in parentheses) of each criterion across ten runs are computed. In case of local methods (EM,VAR,MAP) the k-means initialization was used. The stopping criteria was e, the maximum number of iterations. The size of population in case of GA was, in case of PSO was 3. TABLE I METHODS COMPARISON: LIKELIHOOD (LIK), DISTANCE MEASURE (DM), CLASSIFICATION (%) AND TIME (MINS) Lik DM Clas Time EM 99.6 (5.5). (.7) 85. (.6).5 MAP 6. (3.) 5.9 (3.) 83.6(.5). VAR 8. (.6) 6. (3.) 8. (.3) 3. SIM.9 (5.6) 7.3 (5.) 8. (3.9). GA. (3.). (.8) 83. (5.) 9 PSO 7.(5.9).8 (.) 9.7(.7) 5. The best classification performance is achieved by PSO and classical HMM approach, while MAP, VAR and GA, SIM methods yield similar results. Not surprisingly, all global optimization methods are several times slower than local

5 gradient approaches. Especially, the time cost of GA and PSO is caused by their population size. The most stable method from the group of local optimization techniques was variational Bayes. Unlike its counterparts (EM and MAP) the covariance matrix of Gaussian density B did not collapse so frequently into singular points during optimization-see Figure where density function of HMM parameters are depicted during 38, resp iterations. AP PSO Log Likelihood algorithms comparision VAR SIM GA speed and variational Bayes approach due to its numerical stability, and its insensitivity to initialization. However, if time is not the most limiting factor than Particle Swarm Optimization yielded best performance. ACKNOWLEDGMENT The project was supported by the Ministry of Education, Youth and Sport of the Czech Republic with the grant No. MSM6877 entitled Transdisciplinary Research in Biomedical Engineering II. REFERENCES [] R.Eberhart, Y.Shi, and J.Kennedy, Swarm Intelligence. Morgan Kaufmann,. [] R. Rabiner, A tutorial on hidden markov models and selected applications in speech recognition, Proceedings of the IEEE, vol. 77, 989. [3] J. Gauvain and C. Lee, Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains, IEEE Transaction on Speech and Audio Processing, vol., no., 99. [] I. Rezek and S. Roberts, Learning ensemble hidden markov models for biosignal analysis, in th International Conference on Digital Signal Processing, Greece,. [5] Y. Hamam and T. Al-Ani, Simulated annealing approach for training hidden markov models, in Working Conference on Optimization-Based Computer-Aided Modeling and Design, ser. ESIEE, France, 996. [6] S. Kwong, C. Chau, K. Man, and K. Tang, Optimisation of hmm topology and its model parameters by genetic algorithms, Pattern Recognition, vol. 3, pp. 59 5,. [7] G. McLachlan and T. Krishnan, The EM algorithm and extensions. John Wiley & Sons, Iteration Fig.. Log-likelihood comparison In Figure, log-likelihood curves are compared for one algorithm run. In this case PSO outperformed classical EM in terms of likelihood. However, final values of global techniques are close to each other. IV. CONCLUSION All three local-searching learning techniques follow the same framework of EM optimization approach. Apart of the main drawback of the EM algorithm that is the sensitivity to initialization, the EM algorithm also led to meaningless parameters estimation several times when the EM converged to the boundary of the parameter space. Here the likelihood is unbounded [7], and the computation had to be either restarted or it was sufficient to restart only the covariance diagonals. Using global strategies as PSO, SIM and GA approaches, these problems have been overcome. On the other hand, we have exchanged the time effective optimization for the numerical stability. The is a very important drawback, because even when using the hybrid combination of local and global approaches, these algorithms were still at least ten times slower than the local gradient approaches. In terms of performance using the evaluation criteria, no big differences between local and global techniques were remarked. To sum up, the most suitable combination is to apply local gradient algorithm, namely classical EM approach due to its

An Evolutionary Programming Based Algorithm for HMM training

An Evolutionary Programming Based Algorithm for HMM training Ewa Figielska,Wlodzimierz Kasprzak Institute of Control and Computation Engineering, Warsaw University of Technology ul. Nowowiejska 15/19,