Grenoble, France Grenoble University, F Grenoble Cedex, France

Size: px

Start display at page:

Download "Grenoble, France Grenoble University, F Grenoble Cedex, France"

Kathlyn May
5 years ago
Views:

1 MODIFIED K-MEA CLUSTERIG METHOD OF HMM STATES FOR IITIALIZATIO OF BAUM-WELCH TRAIIG ALGORITHM Paulne Larue 1, Perre Jallon 1, Bertrand Rvet 2 1 CEA LETI - MIATEC Campus Grenoble, France emal: perre.jallon@cea.fr 2 GIPSA-lab, CRS-UMR5216 Grenoble Unversty, F Grenoble Cedex, France emal: bertrand.rvet@gpsa-lab.grenoble-np.fr ABSTRACT Hdden Markov models are wdely used for recognton algorthms (speech, wrtng, gesture,...). In ths paper, a classcal set of models s consdered: state space of hdden varable s dscrete and observaton probabltes are modeled as Gaussan dstrbutons. The models parameters are generally estmated wth tranng sequences and the Baum-Welch algorthm,.e. an expectaton maxmzaton algorthm. However ths knd of algorthm s well known to be senstve to ts ntalzaton pont. The problem of ths ntalzaton pont choce s addressed n ths paper: a model wth a very large number of states whch descrbe tranng sequences wth accuracy s frst constructed. The number of states s then reduced usng a k-mean algorthm on the state. Ths algorthm s compared to other methods based on a k-mean algorthm on the data wth numercal smulatons. 1. ITRODUCTIO Hdden Markov models (HMM) are wdely used for statstcal modelng of sgnals havng a temporal structure. They are based on two statstcs processes: the state varable process and the observaton process. The frst one X s the so-called state varable and s a Markov chan, assumed dscrete n ths paper. It s not observed but t s fully characterzed by transtons and ntalzaton probabltes. In many recognton algorthms (e.g., speech [1], wrtng recognton [2], gesture recognton [3],...), ths varable s used to descrbe the temporal structure of the sgnals to model. In partcular f these sgnals can be descrbed as sequences of (shorter) statonary sgnals, a very partcular set of models can be used: the left-rght models; models whose transton probabltes satsfy the followng constrants: ( 1, 2 ), 2 < 1, p(x t = 2 X t = 1 )=0, whch means that the state varable can only be a, ncreased sequence. The second process Y s the so-called observaton process and s assumed to be an ndependent process condtonally to X [1]. For each state n {0,,} a probablty densty functon (p.d.f.) p (Y X = n) sdefned,where s the number of dfferent values taken by X or the number of hdden states. Y s assumed to be a contnuous varable and ts p.d.f s modeled for each state as a Gaussan dstrbuton. The followng notatons wll be used n the rest of ths paper: the ntal probablty of state s denoted π = p (X 0 = ), {0,, 1}, the transton probablty a 1, 2 between states 1 and 2 s defned by a 1, 2 = p (X t = 2 X t = 1 ), ( 1, 2 ) {0,, 1} {0,, 1}. The observaton probabltes are descrbed, for state # wth two varables µ (mean vector) and Σ (covarance matrx). Fnally the whole set of parameters of a HMM wth states s denoted as λ : λ = {π }, {a 1, 2 } 1, 2, {µ, Σ }. In general λ s estmated usng tranng sequences. Gven K observaton sequences of length T k, Y k,0:tk = {Y k,0,y k,tk }, the optmal set of parameters λ s defned as: ˆλ = argmax λ K k=0 p Y k,0:tk λ. Wthout addtonal assumptons, ths problem can not be solved analytcally. The manly used technque to estmate λ s the expectaton maxmzaton (EM) algorthm [4] trough the forward-backward method so-called Baum-Welch algorthm [5]. Startng from a prelmnary set of parameters λ (0), the algorthm estmates the set of parameters n an teratve manner - denoted as λ (s) at step #s - such as K k=0 p Y k,0:tk λ (s) ncreases wth respect to s. ˆλ s then estmated as: ˆλ = lm s λ(s). However convergence to a global maxma s not ensured and ths algorthm s well known to be very senstve to ts ntalzaton pont λ (0) [6]. Several ntalzaton methods have hence been proposed n the lterature. A k-mean algorthm on the data can be used to cluster observatons Y k,0:tk [1, 7] or several sets of parameters to perform the tranng can be used [8]. For left-rght models, constraned clusterng technques can also be used wth a k-mean algorthm on the observaton data [9].

2 In ths paper, an alternatve approach s proposed to estmate a frst set λ (0). A set of parameters s frst estmated wth a very large number of states (M ) to descrbe all tranng sequences n a very accurate way. However, ths set of parameters can not be used n practce due to over-learnng problems and computatonal tme ssues. As a consequence, the number of states has to be reduced to a much smaller number to overcome these dffcultes. An unsupervsed clusterng algorthm (based on the k-mean algorthm) on the observaton p.d.f s hence proposed to do ths operaton whle keepng the tranng sgnals descrpton accuracy. Transton and ntalzaton probabltes reman to be estmated wth the Baum-Welch algorthm. The proposed method s derved n two cases: unconstraned and left-rght models. The paper s structured as follows. Secton 2 descrbes the manlnes of the algorthm. The k-mean algorthm operatons are detaled n secton 3. The proposed method performance s compared to other methods s based on smulatons n secton 4. Fnally, Secton 5 concludes ths paper. 2. IITIALIZATIO ALGORITHM DESCRIPTIO The am of the algorthm s to provde an ntalzaton set of parameters λ (0) close enough to ˆλ to ensure a good convergence of the learnng algorthm. It only focuses on the observaton probabltes parameters, whch ams at fndng sets of observaton probabltes parameters {, Σ (c) } {1,,},.e. Gaussan dstrbutons, to descrbe n an accurate way the tranng sequences. It s worth notng that transton probabltes a 1, 2 and ntalzaton probabltes π are not estmated (although constraned for the left-rght models), ths operaton beng done by the Baum-Welch algorthm. The followng values are hence used: for unconstraned models, transton probabltes are set to a, =0.8, {0,,} and 1 = 2, a 1, 2 =0.2/( 1), and ntalzaton probabltes π =1/, {0,, 1}; for left-rght models, transton probabltes are set to a, =0.8, {0,, 2} and a, = 1, a,+1 =0.2 and other values are set to 0, fnally ntalzaton probabltes π 0 = 1 and {1,, 1}, π = 0. Concernng the p.d.f. of hdden states, the estmaton s performed n two steps: frst M p.d.f. are estmated whch accurately descrbe the sgnal (Subsecton 2.1). These M dstrbutons are then reduced to (Subsecton 2.2). 2.1 Intalzaton step Gven a set of K tranng sequences, Y k,0:tk, k {1,, K}, M Gaussan dstrbutons whch accurately descrbe data can be estmated as follows. Each tranng sequence Y k s splt nto several segments of length P, overlappng or not and coverng ts tme support. Each segment s then modeled as a Gaussan sgnal whch parameters µ k,j, Σ k,j are estmated usng classcal methods: µ k,j beng the mean of the related sgnal segment and Σ k,j ts covarance matrx. Wthout restrcton, t s assumed that the dstrbutons are sorted wth respect to the Y k segment delay. For nstance, wth non-overlappng segments, each couple of values (0 k K 1 and 0 j<t k /P ) can be estmated as: µ k,j = 1 P Σ k,j = 1 P P p=0 P p=0 Y k,jp +p, Yk,jP +p µ k,j Yk,jP +p µ k,j T, where. T s the transpose operator. P should be chosen small enough so that each sgnal segment can be consdered as statonary, and large enough to ensure a good estmaton of mean vector and covarance matrx. It s possble after ths step to collect all these values to buld the ntal set of parameters λ (0) M. It s worth notng that ths ntal set λ (0) M descrbes n a very accurate way the tranng sequences: ndeed, each tranng sgnal segment beng descrbed by one of the M states. However, and as already mentoned n the ntroducton secton, ths model can not be used n practce because of over-learnng problems and of computatonal tme ssues. 2.2 Clusterng step The second step of the algorthm s hence to approxmate these M Gaussan dstrbutons wth ones ( M). Ths operaton s performed usng a k-mean algorthm on these Gaussan dstrbutons. Let µ k,j, Σ k,j denotes the Gaussan dstrbuton p.d.f. wth mean vector µ k,j and covarance matrx Σ k,j. Moreover, let us refer as center the dstrbuton whch characterzes a cluster, of whch parameters are labelled by (c):, Σ (c). The k-mean algorthm works teratvely as follows. Gven an ntal set of centers,0, Σ(c),0, the followng two steps 1 are performed at each teraton #s: 1. Assocaton: each dstrbuton µ k,j, Σ k,j s assocated to a center n the set,s, Σ(c),s ; 1 2. Center update: based on the prevous assocaton, updated centers are estmated,s+1, Σ(c),s+1. 1 The centers are fnally estmated as:,, Σ (c) = lm s,s, Σ(c),s. The ntal set of centers depends on the type of HMM (unconstraned or left-rght models). For unconstraned models, the M dstrbutons are sorted accordng to the frst component of ther mean vector. The set of M dstrbutons s splt nto consecutve sets and a center s estmated for each set accordng to the k-mean algorthm second step as descrbed above. For

3 left-rght models, each tranng sequence dstrbutons are splt nto dsjonted segments. Frst segment s then assocated to center #0, second one to center #1, and so on. Intalzaton centers are also estmated usng the k-mean algorthm second step. It s worth notng that the frst step requres to adapt the Eucldan dstance to assocate each dstrbuton to a cluster. Ths pont s dscussed n Secton 3.1 for unconstraned models and n Secton 3.2 for left-rght models. Furthermore, the second step s descrbed n Secton MODIFIED K-MEA ALGORITHM Ths secton detals the practcal proposed consderatons of the proposed modfed k-mean clusterng algorthm. 3.1 State to center dstance computaton for unconstraned models The unconstraned model case s frst consdered: each Gaussan dstrbuton s assocated to a center n an ndependent way. As both dstrbuton and center are Gaussan dstrbutons, the Kullback-Lebler dvergence s proposed to estmate the dstance between both p.d.f.. Ths dvergence s defned as: p 1 (u) p 1 (u)p 2 (u) = p 1 (u) log du. p 2 (u) For Gaussan dstrbutons, ths latter expresson s proportonal to µ k,j, Σ k,j det Σ(c) ln +Tr Σ (c) det Σ k,j T µ k,j Σ (c), Σ (c) Σk,j + µ k,j, (1) where Tr s the trace operator and det s classcally the determnant. Fnally, (k, j), the related dstrbuton µ k,j, Σ k,j s assocated to the closest center whch satsfes: = argmn µ k,j, Σ k,j, Σ (c). 3.2 State to center dstance computaton for left-rght models Compared to prevous unconstraned model, left-rght assumpton sets an addtonal constrant. For all k, f dstrbuton µ k,j0, Σ k,j0 s assocated wth center # 0, then for all j 1 >j 0, µ k,j1, Σ k,j1 cannot be assocated to any center #, < 0. In other words, the dstrbuton center assocaton has to be jontly done for each tranng sequence (rather than n an ndependent way). For sequence #k, let consder therefore the set of + 1 break ndces I k (0),,I k (), defnng the states. Ths set must verfy due to the left-rght constran that I k (0) = 0, I k () =M 1 and for two states 1 and 2, 1 < 2 {0,,} {0,,}, I k ( 1 ) <I k ( 2 ). Fnally, (k, j), the related dstrbuton µ k,j, Σ k,j s assocated to the closest center whch satsfes: = I k () j<i k ( + 1). The set of break ndces s prevously estmated as Îk (1),, Îk( 1) = where argmn D I k (1),,I k ( 1), I k (1),,I k () D I k (1),,I k ( 1) = =0 j I k () µ k,j, Σ k,j, Σ (c), wth I k () = {I k (),,I k ( + 1) 1} and ( ) defned by (1). ote that the two extrema I k (0) and I k () do not requre to be optmzed snce I k (0) = 0 and I k () =M Center estmaton The thrd and last ssue to solve s the computaton of the updated center parameters once each dstrbutons have been assocated to one cluster. Consder therefore the center # and I (c) the assocated set of dstrbutons. A random varable Z n ths set can be wrtten as: Z = δ(h (k, j)) k,j where k,j s a Gaussan random varable wth mean and varance {µ k,j, Σ k,j }, δ(u) the Drac delta functon and H s a hdden varable equal to (k, j) fz shares k,j dstrbuton. The -th center parameters, Σ (c) are estmated as mean and covarance of Z. It s straghtforward to check that: and Σ (c) = 1 card I (c) = 1 card I (c) µ k,j Σ k,j + µ k,j µ T k,j T where card I (c) (c) s the number of elements n set I. 4. UMERICAL RESULTS Ths secton presents the results acheved by the proposed method wth several knd of confguratons: toy example and smulated sgnals.

4 Scenaro µ 0, Σ 0 µ 1, Σ 1 µ 2, Σ 2 µ 3, Σ 3 #1 [ 2 2 ], [ 02 ] [ 6 6 ], [ 20 [ 1 1 ], [ 02 ] [ 5 5 ], [ 20 #2 [ 1 1 ], [ 05 ] [ 3 3 ], [ [ 6 6 ], [ 05 ] [ 5 5 ], [ #3 [ 1 1 ], [ 05 ] [ 2 2 ], [ [ 3 3 ], [ 05 ] [ 4 4 ], [ Table 1: Smulaton scenaros. 4.1 Evaluaton method The proposed algorthm performance (denoted k-mean on Gaussan laws ) have been estmated usng numercal smulatons and are compared to the performance of the classcal k-mean algorthm appled drectly on data (denoted k-mean on data ). Moreover, the nfluence of the left-rght constran s nvestgated (referred as no constraned or left-rght ). evertheless, constraned k-mean on data for left-rght models has not been tested because the number of sets of break ndces to test s prohbtve leadng thus to an excessve computatonal tme. As mentoned n the ntroducton, the am of these algorthms s to estmate a set of parameters ˆλ (0) so that the convergence of the Baum-Welch algorthm s mproved. Ths latter algorthm s an teratve algorthm computng n an teratve manner the set of parameters λ (s) at step #s such as K k=0 p Y k,0:tk λ (s) ncreases wth respect to s. For a gven set of tranng sequences Y k,0:tk ˆλ (0) the sets of parameters estmated wth the dfferent ntalzaton algorthms are hence compared wth the value reached and the convergence rate of the followng lkelhood functon (s) = log 4.2 Toy example K k=0 k, p Y k,0:tk λ (s). Sgnals have been generated as the concatenaton of 4 segments sgnals so that Y 0:T = {Y (0) 0:T, Y (1) 0 0:T, Y (2) 1 0:T, Y (3) 2 0:T 3 }. Each of these four sgnals s generated usng a Gaussan dstrbuton: for the p th sgnal, wth mean vector µ p and covarance matrx Σ p. Ther tme lengths T p, p are randomly chosen usng a truncated Gaussan dstrbuton of mean 0 and standard devaton 0 (keepng only postve value). Three sets of parameters {µ p, Σ p } p have been tested correspondng to 3 scenaros resumed n Table 1. As one can see, these three scenaro vary from these dffculty. Indeed, scenaro #1 s the easest snce the 4 states are qute separated. Scenaro #2 s a lttle bt more complex than scenaro #1 snce states 1 and 2, and 3 and 4 overlap. Fnally, scenaro #3 s the more confusng model snce all the state largely overlap. For each scenaro, 0 realzatons have been generated, and for each realzaton 5 sequences have been used for the ntalzaton and the tranng algorthms. P = 20 has Averaged log lkelyhood values Averaged log lkelyhood values Averaged log lkelyhood values no constraned k mean on gaussan laws Left rght k mean on gaussan laws no constraned k mean on data umber of teratons (a) Scenaro #1 no constraned k mean on gaussan laws Left rght k mean on gaussan laws no constraned k mean on data umber of teratons (b) Scenaro #2 no constraned k mean on gaussan laws Left rght k mean on gaussan laws no constraned k mean on data umber of teratons (c) Scenaro #3 Fgure 1: Performance evaluaton acheved on the toy example. been used for estmaton of M Gaussan dstrbutons pror to k-mean algorthm. Fg. 1 presents the acheved results as the averaged values of (s) over the 0 realzatons. For the frst scenaro (Fg. 1(a)), the 3 tested algorthms ( left-rght k- mean on Gaussan dstrbutons, unconstraned k-mean on Gaussan dstrbutons and on data ) converge to the same value, but k-mean on Gaussan dstrbutons

5 Conf. #1 Conf. #2 DTW HMM Table 2: Classfcaton accuracy n percentage [%]. algorthms converge much faster than classcal k-mean ntalzaton. For the second scenaro (Fg. 1(b)), the two proposed k-mean on Gaussan dstrbutons algorthms (left-rght and unconstraned) converge to the same pont wth a few teratons whereas the classcal k-mean on data algorthm fals nto a local maxma. Fnally for the most complex scenaro (Fg. 1(c)), the constraned left-rght proposed k-mean algorthm on Gaussan dstrbuton has better performance than both other algorthms. The proposed method hence mproves the estmaton of the set of parameters used to ntate the Baum Welch algorthm: a better and a faster convergence s shown by smulatons. 4.3 Recognton problem In a second set of numercal experments (Tab. 2) classfcaton accuracy (CA) of the proposed method s compared to CA acheved by the reference dynamc tme warpng method (DTW) []. Two confguratons are compared, and for each of them b-dmensonal observatons (Y k, R 2 ) are generated from 25 dfferent models. Each model s composed from the concatenaton of 2 to 4 states (Confguraton #1) and of 4 to 8 states (Confguraton #2). In all ths experment (confguratons #1 and #2), a 5-state left-rght HMM wth the proposed ntalzaton procedure s consdered (referred as HMM). As one can see, the proposed method outperforms the classcal DTW by about % of classfcaton accuracy computed n a -fold cross valdaton procedure:.e. the observatons sequences are parttoned nto sets and each of them s sequentally used as the test database whle the other 9 sets are used to tran the HMM parameters. It s worth notng that the good behavor of proposed method s acheved wthout optmzaton of the number of states. As a consequence t s not surprsng that the best CA s obtaned on the smplest confguraton (Conf. #1), however the gap between the two detecton method remans of about % even wth the more complex confguraton (Conf. #2). 5. COCLUSIO In ths study, an ntalzaton of the Baum-Welch tranng algorthm based on a modfed k-mean clusterng method of HMM states s presented. The proposed procedure dffers from classcal mplementatons by a clusterng of the states rather than on the tranng data. The smulatons results have shown that the proposed method mproves the ntalzaton of the Baum-Welch algorthm snce the value of the log-lkelhood acheved by our method s hgher than value acheved by the classcal ntalzaton. Moreover, n a recognton problem, the proposed method outperforms the reference dynamc tme warpng method showng ts good behavor. Future works wll nclude a deeper analyss of ths method as well as an automatc procedure to adjust jontly the number of hdden states. REFERECES [1] L. R. Rabner, A tutoral on hdden markov models and selected applcatons n speech recognton, Proceedngs of the IEEE, vol. 77, no. 2, pp , February [2] Janyng Hu, M.K. Brown, and W. Turn, HMM based onlne handwrtng recognton, IEEE Transactons on Pattern Analyss and Machne Intellgence, vol. 18, no., pp , October [3] M. Elmezan, A. Al-Hamad, J. Appenrodt, and B. Mchaels, A hdden markov model-based contnuous gesture recognton system for hand moton trajectory, n Proc. 19th Internatonal Conference on Pattern Recognton (ICPR), December 2008, pp [4] A. P. Dempster,. M. Lard, and D. B. Rubn, Maxmum-lkelhood from ncomplete data va the EM algorthm, J. Royal Statst. Soc. Ser. B., vol. 39, pp. 1 38, [5] Olver Cappé, Erc Moulnes, and Tobas Rydén, Eds., Inference n Hdden Markov Model, Sprnger Seres n Statstcs, [6] L. R. Rabner, B. H. Juang, S. E. Levnson, and M. M. Sondh, Some propertes of contnuous hdden markov model representatons, AT&T techncal journal, [7] K. athan, A. Senor, and J. Subrahmona, Intalzaton of hdden markov models for unconstraned on-lne handwrtng recognton, n Proc. IEEE Int. Conf. on Acoustcs, Speech, and Sgnal Processng (ICASSP), May 1996, vol. 6, pp [8] Md. Huda, Ranadhr Ghosh, and John Yearwood, A varable ntalzaton approach to the EM algorthm for better estmaton of the parameters of hdden markov model based acoustc modelng of speech sgnals, n Advances n Data Mnng, Lecture otes n Computer Scence [9] S. Huda, J. Yearwood, and R. Togner, A constrant-based evolutonary learnng approach to the expectaton maxmzaton for optmal estmaton of the hdden markov model for speech sgnal modelng, IEEE Transactons on Systems, Man, and Cybernetcs, Part B: Cybernetcs, vol. 39, no. 1, pp , February [] H. Sakoe and S. Chba, Dynamc programmng algorthm optmzaton for spoken word recognton, IEEE Transactons on Acoustcs, Speech and Sgnal Processng, vol. 26, no. 1, pp , 1978.

Hidden Markov Models

Hidden Markov Models Hdden Markov Models Namrata Vaswan, Iowa State Unversty Aprl 24, 204 Hdden Markov Model Defntons and Examples Defntons:. A hdden Markov model (HMM) refers to a set of hdden states X 0, X,..., X t,...,