Samy Bengioy, Yoshua Bengioz. y INRS-Telecommunications, 16, Place du Commerce, Ile-des-Soeurs, Qc, H3E 1H6, CANADA

An EM Algorithm for Asynchronous Input/Output Hidden Markov Models Samy Bengioy, Yoshua Bengioz y INRS-Telecommunications, 6, Place du Commerce, Ile-des-Soeurs, Qc, H3E H6, CANADA z Dept. IRO, Universite de Montreal, Montreal, Qc, H3C 3J7, CANADA Abstract In learning tasks in which input sequences are mapped to output sequences, it is often the case that the input and output sequences are not synchronous. For example, in speech recognition, acoustic sequences are longer than phoneme sequences. Input/Output Hidden Markov Models have already been proposed to represent the distribution of an output sequence given an input sequence of the same length. We extend here this model to the case of asynchronous sequences, and show an Expectation-Maximization algorithm for training such models. Introduction Supervised learning algorithms for sequential data minimize a training criterion that depends on pairs of input and output sequences. It is often assumed that input and output sequences are synchronized, i.e., that each input sequence has the same length as the corresponding output sequence. For instance, recurrent networks [Rumelhart et al., 986] can be used to map input sequences to output sequences, for example minimizing at each time step the squared dierence between the actual output and the desired output. Another example is a recently proposed recurrent mixture of experts connectionist architecture which has an interpretation as a probabilistic model, called Input/Output Hidden Markov Model (IOHMM) [Bengio and Frasconi, 996, Bengio and Frasconi, 995]. This model represents the distribution of an output sequence when given an input sequence of the same length, using a hidden state variable and a Markovian independence assumption, as in Hidden Markov Models (HMMs) [Levinson et al., 983, Rabiner, 989], in order to simplify the distribution. IOHMMs are a form of probabilistic transducers [Pereira et al., 994, Singer, 996], with input and output variables which can be discrete as well as continuous-valued. However, in many sequential problems where one tries to map an input sequence to an output sequence, the length of the input and output sequences may not be equal. Input and output sequences could behave at dierent time scales. For example, in a speech recognition problem where one wants to map an acoustic signal to a phoneme sequence, each phoneme approximately corresponds to a subsequence of the acoustic signal, therefore the input acoustic sequence is generally longer than the output phoneme sequence, and the alignment between inputs and outputs is often not available. In comparison with HMMs, emission and transition probabilities in IOHMMs vary with time in function of an input sequence. Unlike HMMs, IOHMMs with discrete outputs are discriminant models. Furthermore, the transition probabilities and emission probabilities are generally better matched, which reduces a problem observed in speech recognition HMMs: because outputs are in a much higher dimensional space than transitions in HMMs, the dynamic range of transition probabilities is much less than that of emission probabilities. Therefore the choice between dierent paths (during recognition) is mostly inuenced by emission rather than transition probabilities. In this paper, we present an extension of IOHMMs to the asynchronous case. We rst present the probabilistic model, then derive an exact Expectation-Maximization (EM) algorithm for training asynchronous IOHMMs. For complex distributions (e.g., using articial neural networks to represent transition and emission distributions), a Generalized EM algorithm or gradient ascent in likelihood can be used. Finally, a recognition algorithm similar to the Viterbi algorithm is presented, to map given input sequences to likely output sequences. 2 The Model Let us note u T for input sequences u u 2 ::: u T, and similarly y S for output sequences y y 2 ::: y S. In this paper we consider the case in which the output sequences are shorter than the input sequences. The more general case is a straightforward extension of this model (using \empty" transitions that do not take any time) and will be discussed elsewhere. As in HMMs and IOHMMs, we introduce a discrete hidden state variable, x t,which will allow us to simplify the distribution P (y S jut )byusing Markovian independence assumptions. The state sequence x T is taken to be synchronous with the input sequence u T. In order to produce output sequences shorter than input sequences, we will have states that do not emit an output as well as states that do emit an output. When, at time t, the system is in a non-emitting state, no output can be produced. Therefore, there exists many sequences of states corresponding to

dierent (shorter) length output sequences. When conceived as a generative model of the output (given the input), an asynchronous IOHMM works as follows. At timet =,aninitialstatex is chosen according to the distribution P (x ), and the length of the output sequence s is initialized to. At other time steps t>, a state x t is rst picked according to the transition distribution P (x t jx t; u t ), using the state at the previous time step x t; and the current input u t. If x t is an emitting state, then the length of the output sequence is increased from s ; to s, and the s th output y s is sampled from the emission distribution P (y s jx t u t ). The parameters of the model are thus the initial state probabilities, (i) =P (x = i), and the parameters of the output and transition conditional distribution models, P (y s jx t u t )andp(x t jx t; u t ). Since the input and output sequences are of dierent lengths, we will introduce another hidden variable, t, specically to represent the alignment between inputs and outputs, with t = s meaning that s outputs have been emitted at time t. Let us rst formalize the independence assumptions and the form of the conditional distribution represented by the model. The conditional probability P (y S jut ) can be written as a sum of terms P (y S xt T jut )over all possible state sequences x T such that the number of emitting states in each of these sequences is S (the length of the output sequence): P (y S jut)= P (y S xt T jut ) () x T T =S All S outputs must have been emitted by time T,so T = S. The hidden state x t takes discrete values in a nite set. Each of the terms P (y S xt T jut ) corresponds to a particular sequence of states, and a corresponding alignment: this probability can be written as the initial state probabilities P (x ) times a product of factors over all the time steps t: if state x t = i is an emitting state, that factor is P (x t jx t; u t )P (y s jx t u t ), otherwise that factor is simply P (x t jx t; u t ), where s is the position in the output sequence of the output emitted at time t (when an output is emitted at time t). We summarize in table the notation we have introduced and dene additional notation used in this paper. Table : Notation used in the paper S = size of the output sequence. T = size of the input sequence. N =number of states in the IOHMM. a(i j t) = output of the module that computes P (x t=ijx t;=j u t) b(i s t) = output of the module that computes P (y sjx t=i u t) (i) =P (x =i), initial probability of state i. z i t =ifx t = i z i t = otherwise. These indicator variables give the state sequence. m s t = if the system emits the s th output at time t, m s t = otherwise. These indicator variables give the input/output alignment. e i is true if state i emits (so P (e i) = ), e i is false otherwise. t = s means that the rst s rst outputs have been emitted at time t. t k = if the t th input symbol is k, t k = otherwise. s k = if the s th output symbol is k, s k = otherwise. pred(i) is the set of all the predecessors states of state i. succ(i) is the set of all the successors states of state i. The Markovian conditional independence assumptions in this model mean that the state variable x t summarizes suciently the past of the sequence, so P (x t jx t; u t )=P (x t jx t; u t ) (2) and P (y s jx t ut )=P (y sjx t u t ): (3) These assumptions are analogous to the Markovian independence assumptions used in HMMs and are the same as in synchronous IOHMMs. Based on these two assumptions, the conditional probability can be eciently represented and computed recursively, using an intermediate variable (i s t) = P (x t =i t =s y s jut ): (4) The conditional probability of an output sequence, can be expressed in terms of this variable: L = P (y S jut )= (i S T ) (5) i2f

where F is a set of nal states. These 's can be computed recursively in a way that is analogous to the forward pass of the Baum-Welsh algorithm for HMMs, but with a dierent treatment for emitting (e i is true) and non-emitting (e i is false) states: (i s t) = @ P (ei =)b(i s t) j2pred(i) a(i j t)(j s; t;) A + @ P (ei =) j2pred(i) a(i j t)(j s t;) A (6) where a(i j t) =P (x t =ijx t; =j u t ), b(i s t) =P (y s jx t =i u t ), and pred(i) is the set of states with an outgoing transition to state i. The derivation of this recursion using the independence assumptions can be found in [Bengio and Bengio, 996]. 3 An EM Algorithm for Asynchronous IOHMMs The learning algorithm we propose is based on the maximum likelihood principle, i.e., here, maximizing the conditional likelihood of the training data. The algorithm could be generalized to one for maximizing the likelihood of the parameters given the data, by taking into account priors on those parameters. Let the training data, D, be a set of P input/output sequences independently sampled from the same distribution. Let T p and S p be the lengths of the p th input and output sequence respectively: = f(u Tp (p) ysp (p)) p= :::Pg (7) Let be the set of all the parameters of the model. Because each sequence is sampled independently, we can write the likelihood function as follows, omitting sequence indexes to simplify: D L( D) = PY p= P (y Sp jutp ) (8) According to the maximum likelihood principle, the optimal parameters are obtained when L( D) is maximized. We will show here how an iterative optimization to a local maximum can be achieved using the Expectation-Maximization (EM) algorithm. EM is an iterative procedure to maximum likelihood estimation, originally formalized in [Dempster et al., 977]. Each iteration is composed of two steps: an estimation step and a maximization step. The basic idea is to introduce an additional variable,, which, if it were known, would greatly simplify the optimization problem. This additional variable is known as the missing or hidden data. A joint model of with the observed variables must be set up. The set D c which includes the data set D and values of the variable for each of the examples, is known as the complete data set. Correspondingly, L c ( D c ) is referred as the complete data likelihood. Since is not observed, L c is a random variable and cannot be maximized directly. The EM algorithm is based on averaging logl c ( D c )over the distribution of, given the known data D and the previous value of the parameters ^. This expected log likelihood is called the auxiliary function: i Q( ^) = E hlogl c ( D c )jd ^ (9) In short, for each EM iteration, one rst computes Q (E-step), then updates such that it maximizes Q (M-step). To apply EM to asynchronous IOHMMs, we need to choose hidden variables such that the knowledge of these variables would simplify the learning problem. Let be a hidden variable representing the hidden state of the Markov model (such that x t = i means the system is in state i at time t). Knowledge of the state sequence x T would make the estimation of parameters trivial (simple counting would suce). Because some states emit and some don't, we introduce an additional hidden variable,, representing the alignment between inputs (and states) and outputs, such that t = s implies that s outputs have been emitted at time t. Note that in the current implementation of the model, we are assuming that there is a deterministic relation between the value of the state sequence x T and of the alignment sequence, T, because P (e i ), the probability that state i emits, is assumed to be or (and known). Here, the complete data can be written as follows: (p)) p= :::Pg () The corresponding complete data likelihood is (again dropping (p) indices): D c = f(u Tp (p) ysp (p) xtp (p) Tp L c ( D c )= PY p= P (y Sp xtp Tp jutp ) () Let z i t be an indicator variable such that z i t =ifx t = i, andz i t = otherwise. Let m s t be an indicator variable such that m s t = if the s th output is emitted at time t, andm s t = otherwise. Using these indicator variables and factorizing the likelihood equation, we obtain: L c ( D c ) = T PY Y p NY Y p= t= i= @ Sp s= P (y s jx t =i u t ) zi tms t A @ N Y j= P (x t =ijx t; =j u t ) zi tzj t; A (2)

Taking the logarithm we obtain the following expression for the complete data log likelihood: logl c ( D c )= T P p N p= t= i= @ Sp 3. The Estimation Step s= z i t m s t log P (y s jx t =i u t ) A + @ N j= z i t z j t; log P (x t =ijx t; =j u t ) A (3) Let us dene the auxiliary function Q( ^) as the expected value of logl c ( D c ) with respect to the hidden variables and, given the data D and the previous set of parameters ^: Q( ^) = where, by denition, and T P p N p= t= i= Q( ^) = E hlog L c ( D c )jd ^ @ Sp s= ^g i s t log P (y s jx t =i u t ) i A + @ N j= (4) ^h i j t log P (x t =ijx t; =j u t ) A (5) ^g i s t = E [x t =i t =s e i =ju T ys ] (6) ^h i j t = E [x t =i x t; =jju T ys ] (7) The ^ on ^g and ^h denote that these variables are computed using the previous value of the parameters, ^. In order to compute ^g i s t and ^h i j t,we will use the already dened (i s t) (equations 4 and 6), and introduce a new variable, (i s t), borrowing the notation from the HMM and IOHMM literature: (i s t) = P (y S s+ jx t=i t =s ut+) T (8) Like, can be computed recursively, but going backwards in time: (i s t) = P (e j =)a(j i t+)b(j s+ t+)(j s+ t+) + P (e j =)a(j i t+)(j s t+) (9) j2succ(i) j2succ(i) where pred(i) is the set of predecessor states of state i and succ(i) is the set of successor states of state i, a(j i t) is the conditional transition probability from state i to state j at time t, andb(i s t) isthe conditional probability toemitthes th output at time t in state i. The proof of correctness of this recursion (using the Markovian independence assumptions) is given in [Bengio and Bengio, 996]. We can now express g i s t and h i j t in terms of (i s t) and (i s t) (see derivations in [Bengio and Bengio, 996]): and h i j t = 3.2 The Maximization Step a(i j t) L B @ P (e i =) s= +P (e i =) (j s; t;)b(i s t)(i s t) s= (j s t;)(i s t) (i s t)(i s t) g i s t = P (e i =) L After each estimation step, one has to maximize Q( ^). If the conditional probability distributions (for transitions as well as emissions) have a simple enough form (e.g. multinomial, generalized linear models, or mixtures of these) then one can maximize analytically Q, i.e., solve @Q( ^) = (22) @ for. Otherwise, if for instance conditional probabilities are implemented using non-linear systems (such as a neural network), then maximization of Q cannot be done in one step. In this case, we can apply a Generalized EM (GEM) algorithm, that simply requires an increase in Q at each optimization step, for example using gradient ascentinq, or directly perform gradient ascent inl( D). 3.3 Multinomial Distributions We describe here the maximization procedure for multinomial conditional probabilities (for transitions and emissions), i.e., which can be implemented as lookup tables. This applies to problems with discrete C A (2) (2)

inputs and outputs. Let M i be the number of input symbols, M o the number of output classes, t k = when the t th input symbol is k, and t k = otherwise. Also, let s k = when the s th output symbol is k, and s k = otherwise. For transition probabilities, let w i j k = P (x t =ijx t; =j u k t =). The solution of equation (22) for w i j k, with the constraints that output probabilities must sum to, yields the following reestimation formula: w i j k = N t= l= t= t k^hi j t t k^hl j t (23) For emission probabilities, let v i l k = P (y=ljx t =i u k t =), then the reestimation formula is the following: v i l k = M o t= s= m= t= s= t k s l^g i s t 3.4 Neural Networks or Other Complex Distributions t k s m^g i s t (24) In the more general case where one cannot maximize analytically Q, we can compute the gradient for each parameter and apply gradient ascent inq, yielding a GEM algorithm (or alternatively, directly maximize L by gradient ascent). For transition probability models, a(j i t) =P (x t =jjx t; =i u t w i )for state i, with parameters w i : the gradient ofq with respect to w i is @Q( ^) @w i = t= j2succ(i)! ^h j i t @a(j i t) a(j i t) @w i where @a(j i t) @w i can be computed by back-propagation. Similarly, for emission probabilitymodelsb(i s t) = P (y s jx t =i u t v i ) for state i, with parameters v i, the gradient is where, again, @b(i s t) @v i @Q( ^) @v i = t= s= ^gi s t b(i s t) can be computed by back-propagation. @b(i s t) @v i (25) (26) Table 2: Overview of the learning algorithm for asynchronous IOHMMs. Estimation Step: foreach training sequence (u T y S ) do (a) foreach state j :::n do compute a(i j t) andb(j s t) according to the chosen distribution models. (b) foreach state i :::n do compute i s t, i s t, and L using the current value of the parameters ^ (equations 6, 5 and 9). compute the posterior probabilities ^h i j t and ^g i s t (equations 2 and 2). 2. Maximization Step: foreach state j :::n do (a) Adjust the transition probability parameters of state j using reestimation formulae such as equation 23 (or gradient ascent for non-linear modules, equation 25). (b) Adjust the emission probability parameters of state j using reestimation formulae such as equation 24 (or gradient ascent for non-linear modules, equation 26). 4 A Recognition Algorithm for Asynchronous IOHMMs Given a trained asynchronous IOHMM, we want to recognize new sequences, i.e., given an input sequence, choose an output sequence according to the model. Ideally, wewould like topick the output sequence that is most likely, given the input sequence. However, this would require an exponential number of computations (with respect to sequence length). Instead, like in the Viterbi algorithm for HMMs, we will consider the complete data model, and look for the joint values of states and outputs that is most likely. Thanks to a dynamic programming recurrence, we can compute the most likely state and output

sequence in time that is proportional to the sequence length times the number of transitions (even though the number of such sequences is exponential in the sequence length). Let us dene V (i t) as the probability of the best state and output subsequence ending up in state i at time t: V (i t) = max s y s xt; P (y s xt; x t =i t =sju t ) (27) where the maximum is taken over all possible lengths s of output sequences y. s This variable can be computed recursively by dynamic programming (the derivation is given in [Bengio and Bengio, 996]): V (i t) =P (e i =) max y s b(i s t)max(a(i j t)v (j t;)) j + P (e i =) max(a(i j t)v (j t;)) (28) j At the end of the sequence, the best nal state i which maximizes V (i T )ispicked within the set of nal states F. If the argmax in the above recurrence is kept, than the best predecessor j and best output y s for each (i t) can be used to trace back the optimal state and output sequence from i, like inthe Viterbi algorithm. 5 Conclusion We have presented a novel model and training algorithm for representing conditional distributions of output sequences given input sequences of a dierent length. The distribution is simplied by introducing hidden variables for the state and the alignment of inputs and outputs, similarly to HMMs. The output sequence distribution is decomposed into conditional emission distributions for individual outputs (given a state and an input at time t) and conditional transition distributions (given a previous state and an input at time t). This is an extension of the already proposed IOHMMs [Bengio and Frasconi, 996, Bengio and Frasconi, 995] that allows input and output sequences to be asynchronous. The parameters of the model can be estimated with an EM or GEM algorithm (depending of the form of the emission and transition distributions). Both the E-step and the M-step can be performed in time at worst proportional to the product of the lengths of the input and output sequences, times the number of transitions. A recognition algorithm similar to the Viterbi algorithm for HMMs has also been presented, which takes in the worse case time proportional to the length of the input sequence times the number of transitions. In practice (especially when the number of states is large), both training and recognition can be sped up by using search algorithms (such as beam search) in the space of state sequences. References [Bengio and Bengio, 996] Bengio, Y. and Bengio, S. (996). Training asynchronous input/output hidden markovmodels. Technical Report #3, Departement d'informatique et de Recherche Operationnelle, Universite de Montreal, Montreal (QC) Canada. [Bengio and Frasconi, 995] Bengio, Y. and Frasconi, P. (995). An input/output HMM architecture. In Tesauro, G., Touretzky, D., and Leen, T., editors, Advances in Neural Information Processing Systems 7, pages 427{434. MIT Press, Cambridge, MA. [Bengio and Frasconi, 996] Bengio, Y. and Frasconi, P. (996). Input/Output HMMs for sequence processing. to appear in IEEE Transactions on Neural Networks. [Dempster et al., 977] Dempster, A. P., Laird, N. M., and Rubin, D. B. (977). Maximum-likelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society B, 39:{38. [Levinson et al., 983] Levinson, S., Rabiner, L., and Sondhi, M. (983). An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition. Bell System Technical Journal, 64(4):35{74. [Pereira et al., 994] Pereira, F., Riley, M., and Sproat, R. (994). Weighted rational transductions and their application to human language processing. In ARPA Natural Language Processing Workshop. [Rabiner, 989] Rabiner, L. R. (989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257{286. [Rumelhart et al., 986] Rumelhart, D., Hinton, G., and Williams, R. (986). Learning internal representations by error propagation. In Rumelhart, D. and McClelland, J., editors, Parallel Distributed Processing, volume, chapter 8, pages 38{362. MIT Press, Cambridge. [Singer, 996] Singer, Y. (996). Adaptive mixtures of probabilistic transducers. In Mozer, M., Touretzky, D., and Perrone, M., editors, Advances in Neural Information Processing Systems 8. MIT Press, Cambridge, MA.