Samy Bengioy, Yoshua Bengioz. y INRS-Telecommunications, 16, Place du Commerce, Ile-des-Soeurs, Qc, H3E 1H6, CANADA
|
|
- Bertina Norris
- 6 years ago
- Views:
Transcription
1 An EM Algorithm for Asynchronous Input/Output Hidden Markov Models Samy Bengioy, Yoshua Bengioz y INRS-Telecommunications, 6, Place du Commerce, Ile-des-Soeurs, Qc, H3E H6, CANADA z Dept. IRO, Universite de Montreal, Montreal, Qc, H3C 3J7, CANADA Abstract In learning tasks in which input sequences are mapped to output sequences, it is often the case that the input and output sequences are not synchronous. For example, in speech recognition, acoustic sequences are longer than phoneme sequences. Input/Output Hidden Markov Models have already been proposed to represent the distribution of an output sequence given an input sequence of the same length. We extend here this model to the case of asynchronous sequences, and show an Expectation-Maximization algorithm for training such models. Introduction Supervised learning algorithms for sequential data minimize a training criterion that depends on pairs of input and output sequences. It is often assumed that input and output sequences are synchronized, i.e., that each input sequence has the same length as the corresponding output sequence. For instance, recurrent networks [Rumelhart et al., 986] can be used to map input sequences to output sequences, for example minimizing at each time step the squared dierence between the actual output and the desired output. Another example is a recently proposed recurrent mixture of experts connectionist architecture which has an interpretation as a probabilistic model, called Input/Output Hidden Markov Model (IOHMM) [Bengio and Frasconi, 996, Bengio and Frasconi, 995]. This model represents the distribution of an output sequence when given an input sequence of the same length, using a hidden state variable and a Markovian independence assumption, as in Hidden Markov Models (HMMs) [Levinson et al., 983, Rabiner, 989], in order to simplify the distribution. IOHMMs are a form of probabilistic transducers [Pereira et al., 994, Singer, 996], with input and output variables which can be discrete as well as continuous-valued. However, in many sequential problems where one tries to map an input sequence to an output sequence, the length of the input and output sequences may not be equal. Input and output sequences could behave at dierent time scales. For example, in a speech recognition problem where one wants to map an acoustic signal to a phoneme sequence, each phoneme approximately corresponds to a subsequence of the acoustic signal, therefore the input acoustic sequence is generally longer than the output phoneme sequence, and the alignment between inputs and outputs is often not available. In comparison with HMMs, emission and transition probabilities in IOHMMs vary with time in function of an input sequence. Unlike HMMs, IOHMMs with discrete outputs are discriminant models. Furthermore, the transition probabilities and emission probabilities are generally better matched, which reduces a problem observed in speech recognition HMMs: because outputs are in a much higher dimensional space than transitions in HMMs, the dynamic range of transition probabilities is much less than that of emission probabilities. Therefore the choice between dierent paths (during recognition) is mostly inuenced by emission rather than transition probabilities. In this paper, we present an extension of IOHMMs to the asynchronous case. We rst present the probabilistic model, then derive an exact Expectation-Maximization (EM) algorithm for training asynchronous IOHMMs. For complex distributions (e.g., using articial neural networks to represent transition and emission distributions), a Generalized EM algorithm or gradient ascent in likelihood can be used. Finally, a recognition algorithm similar to the Viterbi algorithm is presented, to map given input sequences to likely output sequences. 2 The Model Let us note u T for input sequences u u 2 ::: u T, and similarly y S for output sequences y y 2 ::: y S. In this paper we consider the case in which the output sequences are shorter than the input sequences. The more general case is a straightforward extension of this model (using \empty" transitions that do not take any time) and will be discussed elsewhere. As in HMMs and IOHMMs, we introduce a discrete hidden state variable, x t,which will allow us to simplify the distribution P (y S jut )byusing Markovian independence assumptions. The state sequence x T is taken to be synchronous with the input sequence u T. In order to produce output sequences shorter than input sequences, we will have states that do not emit an output as well as states that do emit an output. When, at time t, the system is in a non-emitting state, no output can be produced. Therefore, there exists many sequences of states corresponding to
2 dierent (shorter) length output sequences. When conceived as a generative model of the output (given the input), an asynchronous IOHMM works as follows. At timet =,aninitialstatex is chosen according to the distribution P (x ), and the length of the output sequence s is initialized to. At other time steps t>, a state x t is rst picked according to the transition distribution P (x t jx t; u t ), using the state at the previous time step x t; and the current input u t. If x t is an emitting state, then the length of the output sequence is increased from s ; to s, and the s th output y s is sampled from the emission distribution P (y s jx t u t ). The parameters of the model are thus the initial state probabilities, (i) =P (x = i), and the parameters of the output and transition conditional distribution models, P (y s jx t u t )andp(x t jx t; u t ). Since the input and output sequences are of dierent lengths, we will introduce another hidden variable, t, specically to represent the alignment between inputs and outputs, with t = s meaning that s outputs have been emitted at time t. Let us rst formalize the independence assumptions and the form of the conditional distribution represented by the model. The conditional probability P (y S jut ) can be written as a sum of terms P (y S xt T jut )over all possible state sequences x T such that the number of emitting states in each of these sequences is S (the length of the output sequence): P (y S jut)= P (y S xt T jut ) () x T T =S All S outputs must have been emitted by time T,so T = S. The hidden state x t takes discrete values in a nite set. Each of the terms P (y S xt T jut ) corresponds to a particular sequence of states, and a corresponding alignment: this probability can be written as the initial state probabilities P (x ) times a product of factors over all the time steps t: if state x t = i is an emitting state, that factor is P (x t jx t; u t )P (y s jx t u t ), otherwise that factor is simply P (x t jx t; u t ), where s is the position in the output sequence of the output emitted at time t (when an output is emitted at time t). We summarize in table the notation we have introduced and dene additional notation used in this paper. Table : Notation used in the paper S = size of the output sequence. T = size of the input sequence. N =number of states in the IOHMM. a(i j t) = output of the module that computes P (x t=ijx t;=j u t) b(i s t) = output of the module that computes P (y sjx t=i u t) (i) =P (x =i), initial probability of state i. z i t =ifx t = i z i t = otherwise. These indicator variables give the state sequence. m s t = if the system emits the s th output at time t, m s t = otherwise. These indicator variables give the input/output alignment. e i is true if state i emits (so P (e i) = ), e i is false otherwise. t = s means that the rst s rst outputs have been emitted at time t. t k = if the t th input symbol is k, t k = otherwise. s k = if the s th output symbol is k, s k = otherwise. pred(i) is the set of all the predecessors states of state i. succ(i) is the set of all the successors states of state i. The Markovian conditional independence assumptions in this model mean that the state variable x t summarizes suciently the past of the sequence, so P (x t jx t; u t )=P (x t jx t; u t ) (2) and P (y s jx t ut )=P (y sjx t u t ): (3) These assumptions are analogous to the Markovian independence assumptions used in HMMs and are the same as in synchronous IOHMMs. Based on these two assumptions, the conditional probability can be eciently represented and computed recursively, using an intermediate variable (i s t) = P (x t =i t =s y s jut ): (4) The conditional probability of an output sequence, can be expressed in terms of this variable: L = P (y S jut )= (i S T ) (5) i2f
3 where F is a set of nal states. These 's can be computed recursively in a way that is analogous to the forward pass of the Baum-Welsh algorithm for HMMs, but with a dierent treatment for emitting (e i is true) and non-emitting (e i is false) states: (i s t) P (ei =)b(i s t) j2pred(i) a(i j t)(j s; t;) A P (ei =) j2pred(i) a(i j t)(j s t;) A (6) where a(i j t) =P (x t =ijx t; =j u t ), b(i s t) =P (y s jx t =i u t ), and pred(i) is the set of states with an outgoing transition to state i. The derivation of this recursion using the independence assumptions can be found in [Bengio and Bengio, 996]. 3 An EM Algorithm for Asynchronous IOHMMs The learning algorithm we propose is based on the maximum likelihood principle, i.e., here, maximizing the conditional likelihood of the training data. The algorithm could be generalized to one for maximizing the likelihood of the parameters given the data, by taking into account priors on those parameters. Let the training data, D, be a set of P input/output sequences independently sampled from the same distribution. Let T p and S p be the lengths of the p th input and output sequence respectively: = f(u Tp (p) ysp (p)) p= :::Pg (7) Let be the set of all the parameters of the model. Because each sequence is sampled independently, we can write the likelihood function as follows, omitting sequence indexes to simplify: D L( D) = PY p= P (y Sp jutp ) (8) According to the maximum likelihood principle, the optimal parameters are obtained when L( D) is maximized. We will show here how an iterative optimization to a local maximum can be achieved using the Expectation-Maximization (EM) algorithm. EM is an iterative procedure to maximum likelihood estimation, originally formalized in [Dempster et al., 977]. Each iteration is composed of two steps: an estimation step and a maximization step. The basic idea is to introduce an additional variable,, which, if it were known, would greatly simplify the optimization problem. This additional variable is known as the missing or hidden data. A joint model of with the observed variables must be set up. The set D c which includes the data set D and values of the variable for each of the examples, is known as the complete data set. Correspondingly, L c ( D c ) is referred as the complete data likelihood. Since is not observed, L c is a random variable and cannot be maximized directly. The EM algorithm is based on averaging logl c ( D c )over the distribution of, given the known data D and the previous value of the parameters ^. This expected log likelihood is called the auxiliary function: i Q( ^) = E hlogl c ( D c )jd ^ (9) In short, for each EM iteration, one rst computes Q (E-step), then updates such that it maximizes Q (M-step). To apply EM to asynchronous IOHMMs, we need to choose hidden variables such that the knowledge of these variables would simplify the learning problem. Let be a hidden variable representing the hidden state of the Markov model (such that x t = i means the system is in state i at time t). Knowledge of the state sequence x T would make the estimation of parameters trivial (simple counting would suce). Because some states emit and some don't, we introduce an additional hidden variable,, representing the alignment between inputs (and states) and outputs, such that t = s implies that s outputs have been emitted at time t. Note that in the current implementation of the model, we are assuming that there is a deterministic relation between the value of the state sequence x T and of the alignment sequence, T, because P (e i ), the probability that state i emits, is assumed to be or (and known). Here, the complete data can be written as follows: (p)) p= :::Pg () The corresponding complete data likelihood is (again dropping (p) indices): D c = f(u Tp (p) ysp (p) xtp (p) Tp L c ( D c )= PY p= P (y Sp xtp Tp jutp ) () Let z i t be an indicator variable such that z i t =ifx t = i, andz i t = otherwise. Let m s t be an indicator variable such that m s t = if the s th output is emitted at time t, andm s t = otherwise. Using these indicator variables and factorizing the likelihood equation, we obtain: L c ( D c ) = T PY Y p NY Y p= t= Sp s= P (y s jx t =i u t ) zi tms t N Y j= P (x t =ijx t; =j u t ) zi tzj t; A (2)
4 Taking the logarithm we obtain the following expression for the complete data log likelihood: logl c ( D c )= T P p N p= t= Sp 3. The Estimation Step s= z i t m s t log P (y s jx t =i u t ) A N j= z i t z j t; log P (x t =ijx t; =j u t ) A (3) Let us dene the auxiliary function Q( ^) as the expected value of logl c ( D c ) with respect to the hidden variables and, given the data D and the previous set of parameters ^: Q( ^) = where, by denition, and T P p N p= t= i= Q( ^) = E hlog L c ( D c )jd Sp s= ^g i s t log P (y s jx t =i u t ) i A N j= (4) ^h i j t log P (x t =ijx t; =j u t ) A (5) ^g i s t = E [x t =i t =s e i =ju T ys ] (6) ^h i j t = E [x t =i x t; =jju T ys ] (7) The ^ on ^g and ^h denote that these variables are computed using the previous value of the parameters, ^. In order to compute ^g i s t and ^h i j t,we will use the already dened (i s t) (equations 4 and 6), and introduce a new variable, (i s t), borrowing the notation from the HMM and IOHMM literature: (i s t) = P (y S s+ jx t=i t =s ut+) T (8) Like, can be computed recursively, but going backwards in time: (i s t) = P (e j =)a(j i t+)b(j s+ t+)(j s+ t+) + P (e j =)a(j i t+)(j s t+) (9) j2succ(i) j2succ(i) where pred(i) is the set of predecessor states of state i and succ(i) is the set of successor states of state i, a(j i t) is the conditional transition probability from state i to state j at time t, andb(i s t) isthe conditional probability toemitthes th output at time t in state i. The proof of correctness of this recursion (using the Markovian independence assumptions) is given in [Bengio and Bengio, 996]. We can now express g i s t and h i j t in terms of (i s t) and (i s t) (see derivations in [Bengio and Bengio, 996]): and h i j t = 3.2 The Maximization Step a(i j t) L P (e i =) s= +P (e i =) (j s; t;)b(i s t)(i s t) s= (j s t;)(i s t) (i s t)(i s t) g i s t = P (e i =) L After each estimation step, one has to maximize Q( ^). If the conditional probability distributions (for transitions as well as emissions) have a simple enough form (e.g. multinomial, generalized linear models, or mixtures of these) then one can maximize analytically Q, i.e., ^) = for. Otherwise, if for instance conditional probabilities are implemented using non-linear systems (such as a neural network), then maximization of Q cannot be done in one step. In this case, we can apply a Generalized EM (GEM) algorithm, that simply requires an increase in Q at each optimization step, for example using gradient ascentinq, or directly perform gradient ascent inl( D). 3.3 Multinomial Distributions We describe here the maximization procedure for multinomial conditional probabilities (for transitions and emissions), i.e., which can be implemented as lookup tables. This applies to problems with discrete C A (2) (2)
5 inputs and outputs. Let M i be the number of input symbols, M o the number of output classes, t k = when the t th input symbol is k, and t k = otherwise. Also, let s k = when the s th output symbol is k, and s k = otherwise. For transition probabilities, let w i j k = P (x t =ijx t; =j u k t =). The solution of equation (22) for w i j k, with the constraints that output probabilities must sum to, yields the following reestimation formula: w i j k = N t= l= t= t k^hi j t t k^hl j t (23) For emission probabilities, let v i l k = P (y=ljx t =i u k t =), then the reestimation formula is the following: v i l k = M o t= s= m= t= s= t k s l^g i s t 3.4 Neural Networks or Other Complex Distributions t k s m^g i s t (24) In the more general case where one cannot maximize analytically Q, we can compute the gradient for each parameter and apply gradient ascent inq, yielding a GEM algorithm (or alternatively, directly maximize L by gradient ascent). For transition probability models, a(j i t) =P (x t =jjx t; =i u t w i )for state i, with parameters w i : the gradient ofq with respect to w i i = t= j2succ(i)! ^h j i i t) a(j i i i i can be computed by back-propagation. Similarly, for emission probabilitymodelsb(i s t) = P (y s jx t =i u t v i ) for state i, with parameters v i, the gradient is where, s i = t= s= ^gi s t b(i s t) can be computed by s i (25) (26) Table 2: Overview of the learning algorithm for asynchronous IOHMMs. Estimation Step: foreach training sequence (u T y S ) do (a) foreach state j :::n do compute a(i j t) andb(j s t) according to the chosen distribution models. (b) foreach state i :::n do compute i s t, i s t, and L using the current value of the parameters ^ (equations 6, 5 and 9). compute the posterior probabilities ^h i j t and ^g i s t (equations 2 and 2). 2. Maximization Step: foreach state j :::n do (a) Adjust the transition probability parameters of state j using reestimation formulae such as equation 23 (or gradient ascent for non-linear modules, equation 25). (b) Adjust the emission probability parameters of state j using reestimation formulae such as equation 24 (or gradient ascent for non-linear modules, equation 26). 4 A Recognition Algorithm for Asynchronous IOHMMs Given a trained asynchronous IOHMM, we want to recognize new sequences, i.e., given an input sequence, choose an output sequence according to the model. Ideally, wewould like topick the output sequence that is most likely, given the input sequence. However, this would require an exponential number of computations (with respect to sequence length). Instead, like in the Viterbi algorithm for HMMs, we will consider the complete data model, and look for the joint values of states and outputs that is most likely. Thanks to a dynamic programming recurrence, we can compute the most likely state and output
6 sequence in time that is proportional to the sequence length times the number of transitions (even though the number of such sequences is exponential in the sequence length). Let us dene V (i t) as the probability of the best state and output subsequence ending up in state i at time t: V (i t) = max s y s xt; P (y s xt; x t =i t =sju t ) (27) where the maximum is taken over all possible lengths s of output sequences y. s This variable can be computed recursively by dynamic programming (the derivation is given in [Bengio and Bengio, 996]): V (i t) =P (e i =) max y s b(i s t)max(a(i j t)v (j t;)) j + P (e i =) max(a(i j t)v (j t;)) (28) j At the end of the sequence, the best nal state i which maximizes V (i T )ispicked within the set of nal states F. If the argmax in the above recurrence is kept, than the best predecessor j and best output y s for each (i t) can be used to trace back the optimal state and output sequence from i, like inthe Viterbi algorithm. 5 Conclusion We have presented a novel model and training algorithm for representing conditional distributions of output sequences given input sequences of a dierent length. The distribution is simplied by introducing hidden variables for the state and the alignment of inputs and outputs, similarly to HMMs. The output sequence distribution is decomposed into conditional emission distributions for individual outputs (given a state and an input at time t) and conditional transition distributions (given a previous state and an input at time t). This is an extension of the already proposed IOHMMs [Bengio and Frasconi, 996, Bengio and Frasconi, 995] that allows input and output sequences to be asynchronous. The parameters of the model can be estimated with an EM or GEM algorithm (depending of the form of the emission and transition distributions). Both the E-step and the M-step can be performed in time at worst proportional to the product of the lengths of the input and output sequences, times the number of transitions. A recognition algorithm similar to the Viterbi algorithm for HMMs has also been presented, which takes in the worse case time proportional to the length of the input sequence times the number of transitions. In practice (especially when the number of states is large), both training and recognition can be sped up by using search algorithms (such as beam search) in the space of state sequences. References [Bengio and Bengio, 996] Bengio, Y. and Bengio, S. (996). Training asynchronous input/output hidden markovmodels. Technical Report #3, Departement d'informatique et de Recherche Operationnelle, Universite de Montreal, Montreal (QC) Canada. [Bengio and Frasconi, 995] Bengio, Y. and Frasconi, P. (995). An input/output HMM architecture. In Tesauro, G., Touretzky, D., and Leen, T., editors, Advances in Neural Information Processing Systems 7, pages 427{434. MIT Press, Cambridge, MA. [Bengio and Frasconi, 996] Bengio, Y. and Frasconi, P. (996). Input/Output HMMs for sequence processing. to appear in IEEE Transactions on Neural Networks. [Dempster et al., 977] Dempster, A. P., Laird, N. M., and Rubin, D. B. (977). Maximum-likelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society B, 39:{38. [Levinson et al., 983] Levinson, S., Rabiner, L., and Sondhi, M. (983). An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition. Bell System Technical Journal, 64(4):35{74. [Pereira et al., 994] Pereira, F., Riley, M., and Sproat, R. (994). Weighted rational transductions and their application to human language processing. In ARPA Natural Language Processing Workshop. [Rabiner, 989] Rabiner, L. R. (989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257{286. [Rumelhart et al., 986] Rumelhart, D., Hinton, G., and Williams, R. (986). Learning internal representations by error propagation. In Rumelhart, D. and McClelland, J., editors, Parallel Distributed Processing, volume, chapter 8, pages 38{362. MIT Press, Cambridge. [Singer, 996] Singer, Y. (996). Adaptive mixtures of probabilistic transducers. In Mozer, M., Touretzky, D., and Perrone, M., editors, Advances in Neural Information Processing Systems 8. MIT Press, Cambridge, MA.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute
More informationAbstract. In this paper we propose recurrent neural networks with feedback into the input
Recurrent Neural Networks for Missing or Asynchronous Data Yoshua Bengio Dept. Informatique et Recherche Operationnelle Universite de Montreal Montreal, Qc H3C-3J7 bengioy@iro.umontreal.ca Francois Gingras
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data Statistical Machine Learning from Data Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne (EPFL),
More informationSpeech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.
Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.com This Lecture Expectation-Maximization (EM)
More informationHidden Markov Models and other Finite State Automata for Sequence Processing
To appear in The Handbook of Brain Theory and Neural Networks, Second edition, (M.A. Arbib, Ed.), Cambridge, MA: The MIT Press, 2002. http://mitpress.mit.edu The MIT Press Hidden Markov Models and other
More informationSequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them
HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated
More informationWeighted Finite-State Transducers in Computational Biology
Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1 This Tutorial
More informationAn Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition
An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition Samy Bengio Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4, 1920 Martigny, Switzerland
More informationCISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)
CISC 889 Bioinformatics (Spring 24) Hidden Markov Models (II) a. Likelihood: forward algorithm b. Decoding: Viterbi algorithm c. Model building: Baum-Welch algorithm Viterbi training Hidden Markov models
More informationMaster 2 Informatique Probabilistic Learning and Data Analysis
Master 2 Informatique Probabilistic Learning and Data Analysis Faicel Chamroukhi Maître de Conférences USTV, LSIS UMR CNRS 7296 email: chamroukhi@univ-tln.fr web: chamroukhi.univ-tln.fr 2013/2014 Faicel
More informationWe Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named
We Live in Exciting Times ACM (an international computing research society) has named CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Apr. 2, 2019 Yoshua Bengio,
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far
More informationMachine Learning for natural language processing
Machine Learning for natural language processing Hidden Markov Models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 33 Introduction So far, we have classified texts/observations
More informationLearning Dynamic Audio/Visual Mapping with Input-Output Hidden Markov Models
Learning Dynamic Audio/Visual Mapping with Input-Output Hidden Markov Models Yan Li and Heung-Yeung Shum Microsoft Research China Beijing 100080, P.R. China Abstract In this paper we formulate the problem
More informationHidden Markov Models
Hidden Markov Models Lecture Notes Speech Communication 2, SS 2004 Erhard Rank/Franz Pernkopf Signal Processing and Speech Communication Laboratory Graz University of Technology Inffeldgasse 16c, A-8010
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationThe Generalized CEM Algorithm. MIT Media Lab. Abstract. variable probability models to maximize conditional likelihood
The Generalized CEM Algorithm Tony Jebara MIT Media Lab Ames St. Cambridge, MA 139 jebara@media.mit.edu Alex Pentland MIT Media Lab Ames St. Cambridge, MA 139 sandy@media.mit.edu Abstract We propose a
More informationHidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391
Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391 Parameters of an HMM States: A set of states S=s 1, s n Transition probabilities: A= a 1,1, a 1,2,, a n,n
More informationCS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm
+ September13, 2016 Professor Meteer CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm Thanks to Dan Jurafsky for these slides + ASR components n Feature
More informationDynamic Approaches: The Hidden Markov Model
Dynamic Approaches: The Hidden Markov Model Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Inference as Message
More informationHMM and IOHMM Modeling of EEG Rhythms for Asynchronous BCI Systems
HMM and IOHMM Modeling of EEG Rhythms for Asynchronous BCI Systems Silvia Chiappa and Samy Bengio {chiappa,bengio}@idiap.ch IDIAP, P.O. Box 592, CH-1920 Martigny, Switzerland Abstract. We compare the use
More informationUsing Expectation-Maximization for Reinforcement Learning
NOTE Communicated by Andrew Barto and Michael Jordan Using Expectation-Maximization for Reinforcement Learning Peter Dayan Department of Brain and Cognitive Sciences, Center for Biological and Computational
More informationComputational Genomics and Molecular Biology, Fall
Computational Genomics and Molecular Biology, Fall 2011 1 HMM Lecture Notes Dannie Durand and Rose Hoberman October 11th 1 Hidden Markov Models In the last few lectures, we have focussed on three problems
More informationStatistical Methods for NLP
Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured
More informationBasic math for biology
Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood
More informationO 3 O 4 O 5. q 3. q 4. Transition
Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in
More informationStatistical NLP: Hidden Markov Models. Updated 12/15
Statistical NLP: Hidden Markov Models Updated 12/15 Markov Models Markov models are statistical tools that are useful for NLP because they can be used for part-of-speech-tagging applications Their first
More informationNote Set 5: Hidden Markov Models
Note Set 5: Hidden Markov Models Probabilistic Learning: Theory and Algorithms, CS 274A, Winter 2016 1 Hidden Markov Models (HMMs) 1.1 Introduction Consider observed data vectors x t that are d-dimensional
More informationHidden Markov Models Hamid R. Rabiee
Hidden Markov Models Hamid R. Rabiee 1 Hidden Markov Models (HMMs) In the previous slides, we have seen that in many cases the underlying behavior of nature could be modeled as a Markov process. However
More informationStatistical Processing of Natural Language
Statistical Processing of Natural Language and DMKM - Universitat Politècnica de Catalunya and 1 2 and 3 1. Observation Probability 2. Best State Sequence 3. Parameter Estimation 4 Graphical and Generative
More informationExpectation Maximization (EM)
Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationMultiscale Systems Engineering Research Group
Hidden Markov Model Prof. Yan Wang Woodruff School of Mechanical Engineering Georgia Institute of echnology Atlanta, GA 30332, U.S.A. yan.wang@me.gatech.edu Learning Objectives o familiarize the hidden
More informationHuman-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg
Temporal Reasoning Kai Arras, University of Freiburg 1 Temporal Reasoning Contents Introduction Temporal Reasoning Hidden Markov Models Linear Dynamical Systems (LDS) Kalman Filter 2 Temporal Reasoning
More informationorder is number of previous outputs
Markov Models Lecture : Markov and Hidden Markov Models PSfrag Use past replacements as state. Next output depends on previous output(s): y t = f[y t, y t,...] order is number of previous outputs y t y
More informationHidden Markov Models
CS769 Spring 2010 Advanced Natural Language Processing Hidden Markov Models Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Part-of-Speech Tagging The goal of Part-of-Speech (POS) tagging is to label each
More informationHidden Markov Models and Gaussian Mixture Models
Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 23&27 January 2014 ASR Lectures 4&5 Hidden Markov Models and Gaussian
More informationHidden Markov Models
Andrea Passerini passerini@disi.unitn.it Statistical relational learning The aim Modeling temporal sequences Model signals which vary over time (e.g. speech) Two alternatives: deterministic models directly
More informationHidden Markov Models Part 2: Algorithms
Hidden Markov Models Part 2: Algorithms CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Hidden Markov Model An HMM consists of:
More informationA Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Abstract
INTERNATIONAL COMPUTER SCIENCE INSTITUTE 947 Center St. Suite 600 Berkeley, California 94704-98 (50) 643-953 FA (50) 643-7684I A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation
More information1. Markov models. 1.1 Markov-chain
1. Markov models 1.1 Markov-chain Let X be a random variable X = (X 1,..., X t ) taking values in some set S = {s 1,..., s N }. The sequence is Markov chain if it has the following properties: 1. Limited
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationInput-Output Hidden Markov Models for Trees
Input-Output Hidden Markov Models for Trees Davide Bacciu 1 and Alessio Micheli 1 and Alessandro Sperduti 2 1- Dipartimento di Informatica - Università di Pisa- Italy 2- Dipartimento di Matematica Pura
More informationModeling High-Dimensional Discrete Data with Multi-Layer Neural Networks
Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks Yoshua Bengio Dept. IRO Université de Montréal Montreal, Qc, Canada, H3C 3J7 bengioy@iro.umontreal.ca Samy Bengio IDIAP CP 592,
More informationHIDDEN MARKOV MODELS IN SPEECH RECOGNITION
HIDDEN MARKOV MODELS IN SPEECH RECOGNITION Wayne Ward Carnegie Mellon University Pittsburgh, PA 1 Acknowledgements Much of this talk is derived from the paper "An Introduction to Hidden Markov Models",
More informationTraining Hidden Markov Models with Multiple Observations A Combinatorial Method
Li & al., IEEE Transactions on PAMI, vol. PAMI-22, no. 4, pp 371-377, April 2000. 1 Training Hidden Markov Models with Multiple Observations A Combinatorial Method iaolin Li, Member, IEEE Computer Society
More informationConditional Random Fields for Sequential Supervised Learning
Conditional Random Fields for Sequential Supervised Learning Thomas G. Dietterich Adam Ashenfelter Department of Computer Science Oregon State University Corvallis, Oregon 97331 http://www.eecs.oregonstate.edu/~tgd
More informationUniversity of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I
University of Cambridge MPhil in Computer Speech Text & Internet Technology Module: Speech Processing II Lecture 2: Hidden Markov Models I o o o o o 1 2 3 4 T 1 b 2 () a 12 2 a 3 a 4 5 34 a 23 b () b ()
More informationHidden Markov Modelling
Hidden Markov Modelling Introduction Problem formulation Forward-Backward algorithm Viterbi search Baum-Welch parameter estimation Other considerations Multiple observation sequences Phone-based models
More informationHidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208
Hidden Markov Model Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/19 Outline Example: Hidden Coin Tossing Hidden
More informationThe main algorithms used in the seqhmm package
The main algorithms used in the seqhmm package Jouni Helske University of Jyväskylä, Finland May 9, 2018 1 Introduction This vignette contains the descriptions of the main algorithms used in the seqhmm
More informationLearning from Sequential and Time-Series Data
Learning from Sequential and Time-Series Data Sridhar Mahadevan mahadeva@cs.umass.edu University of Massachusetts Sridhar Mahadevan: CMPSCI 689 p. 1/? Sequential and Time-Series Data Many real-world applications
More informationPart of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015
Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about
More informationBrief Introduction of Machine Learning Techniques for Content Analysis
1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview
More informationDeep Belief Networks are compact universal approximators
1 Deep Belief Networks are compact universal approximators Nicolas Le Roux 1, Yoshua Bengio 2 1 Microsoft Research Cambridge 2 University of Montreal Keywords: Deep Belief Networks, Universal Approximation
More informationFigure : Learning the dynamics of juggling. Three motion classes, emerging from dynamical learning, turn out to correspond accurately to ballistic mot
Learning multi-class dynamics A. Blake, B. North and M. Isard Department of Engineering Science, University of Oxford, Oxford OX 3PJ, UK. Web: http://www.robots.ox.ac.uk/vdg/ Abstract Standard techniques
More informationMachine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017
1 Introduction Let x = (x 1,..., x M ) denote a sequence (e.g. a sequence of words), and let y = (y 1,..., y M ) denote a corresponding hidden sequence that we believe explains or influences x somehow
More informationp(d θ ) l(θ ) 1.2 x x x
p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to
More informationEECS E6870: Lecture 4: Hidden Markov Models
EECS E6870: Lecture 4: Hidden Markov Models Stanley F. Chen, Michael A. Picheny and Bhuvana Ramabhadran IBM T. J. Watson Research Center Yorktown Heights, NY 10549 stanchen@us.ibm.com, picheny@us.ibm.com,
More informationTo make a grammar probabilistic, we need to assign a probability to each context-free rewrite
Notes on the Inside-Outside Algorithm To make a grammar probabilistic, we need to assign a probability to each context-free rewrite rule. But how should these probabilities be chosen? It is natural to
More informationBasic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology
Basic Text Analysis Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakimnivre@lingfiluuse Basic Text Analysis 1(33) Hidden Markov Models Markov models are
More informationMACHINE LEARNING 2 UGM,HMMS Lecture 7
LOREM I P S U M Royal Institute of Technology MACHINE LEARNING 2 UGM,HMMS Lecture 7 THIS LECTURE DGM semantics UGM De-noising HMMs Applications (interesting probabilities) DP for generation probability
More informationASR using Hidden Markov Model : A tutorial
ASR using Hidden Markov Model : A tutorial Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 samudravijaya@gmail.com Tata Institute of Fundamental Research Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11
More informationMore on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013
More on HMMs and other sequence models Intro to NLP - ETHZ - 18/03/2013 Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative
More informationStatistical Sequence Recognition and Training: An Introduction to HMMs
Statistical Sequence Recognition and Training: An Introduction to HMMs EECS 225D Nikki Mirghafori nikki@icsi.berkeley.edu March 7, 2005 Credit: many of the HMM slides have been borrowed and adapted, with
More informationShankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms
Recognition of Visual Speech Elements Using Adaptively Boosted Hidden Markov Models. Say Wei Foo, Yong Lian, Liang Dong. IEEE Transactions on Circuits and Systems for Video Technology, May 2004. Shankar
More informationHidden Markov Models
Hidden Markov Models Slides mostly from Mitch Marcus and Eric Fosler (with lots of modifications). Have you seen HMMs? Have you seen Kalman filters? Have you seen dynamic programming? HMMs are dynamic
More informationPage 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence
Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994)
More informationL23: hidden Markov models
L23: hidden Markov models Discrete Markov processes Hidden Markov models Forward and Backward procedures The Viterbi algorithm This lecture is based on [Rabiner and Juang, 1993] Introduction to Speech
More informationHybrid HMM/MLP models for time series prediction
Bruges (Belgium), 2-23 April 999, D-Facto public., ISBN 2-649-9-X, pp. 455-462 Hybrid HMM/MLP models for time series prediction Joseph Rynkiewicz SAMOS, Université Paris I - Panthéon Sorbonne Paris, France
More information6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm
6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm Overview The EM algorithm in general form The EM algorithm for hidden markov models (brute force) The EM algorithm for hidden markov models (dynamic
More informationConditional Random Field
Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions
More informationHidden Markov Models in Language Processing
Hidden Markov Models in Language Processing Dustin Hillard Lecture notes courtesy of Prof. Mari Ostendorf Outline Review of Markov models What is an HMM? Examples General idea of hidden variables: implications
More informationSequential Supervised Learning
Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given
More informationLecture 7 Sequence analysis. Hidden Markov Models
Lecture 7 Sequence analysis. Hidden Markov Models Nicolas Lartillot may 2012 Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 1 / 60 1 Motivation 2 Examples of Hidden Markov models 3 Hidden
More informationMarkov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University
Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University Markov Chains and Hidden Markov Models Modeling the statistical properties of biological sequences and distinguishing regions
More informationPp. 311{318 in Proceedings of the Sixth International Workshop on Articial Intelligence and Statistics
Pp. 311{318 in Proceedings of the Sixth International Workshop on Articial Intelligence and Statistics (Ft. Lauderdale, USA, January 1997) Comparing Predictive Inference Methods for Discrete Domains Petri
More informationOutline of Today s Lecture
University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005 Jeff A. Bilmes Lecture 12 Slides Feb 23 rd, 2005 Outline of Today s
More informationLecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010
Hidden Lecture 4: Hidden : An Introduction to Dynamic Decision Making November 11, 2010 Special Meeting 1/26 Markov Model Hidden When a dynamical system is probabilistic it may be determined by the transition
More informationHidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010
Hidden Markov Models Aarti Singh Slides courtesy: Eric Xing Machine Learning 10-701/15-781 Nov 8, 2010 i.i.d to sequential data So far we assumed independent, identically distributed data Sequential data
More informationarxiv: v1 [cs.cl] 21 May 2017
Spelling Correction as a Foreign Language Yingbo Zhou yingbzhou@ebay.com Utkarsh Porwal uporwal@ebay.com Roberto Konow rkonow@ebay.com arxiv:1705.07371v1 [cs.cl] 21 May 2017 Abstract In this paper, we
More informationSparse Forward-Backward for Fast Training of Conditional Random Fields
Sparse Forward-Backward for Fast Training of Conditional Random Fields Charles Sutton, Chris Pal and Andrew McCallum University of Massachusetts Amherst Dept. Computer Science Amherst, MA 01003 {casutton,
More informationHidden Markov Models and Gaussian Mixture Models
Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 25&29 January 2018 ASR Lectures 4&5 Hidden Markov Models and Gaussian
More informationCSCE 471/871 Lecture 3: Markov Chains and
and and 1 / 26 sscott@cse.unl.edu 2 / 26 Outline and chains models (s) Formal definition Finding most probable state path (Viterbi algorithm) Forward and backward algorithms State sequence known State
More informationmemory networks, have been proposed by Hopeld (1982), Lapedes and Farber (1986), Almeida (1987), Pineda (1988), and Rohwer and Forrest (1987). Other r
A Learning Algorithm for Continually Running Fully Recurrent Neural Networks Ronald J. Williams College of Computer Science Northeastern University Boston, Massachusetts 02115 and David Zipser Institute
More informationSean Escola. Center for Theoretical Neuroscience
Employing hidden Markov models of neural spike-trains toward the improved estimation of linear receptive fields and the decoding of multiple firing regimes Sean Escola Center for Theoretical Neuroscience
More informationHidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationA New OCR System Similar to ASR System
A ew OCR System Similar to ASR System Abstract Optical character recognition (OCR) system is created using the concepts of automatic speech recognition where the hidden Markov Model is widely used. Results
More informationAssignments for lecture Bioinformatics III WS 03/04. Assignment 5, return until Dec 16, 2003, 11 am. Your name: Matrikelnummer: Fachrichtung:
Assignments for lecture Bioinformatics III WS 03/04 Assignment 5, return until Dec 16, 2003, 11 am Your name: Matrikelnummer: Fachrichtung: Please direct questions to: Jörg Niggemann, tel. 302-64167, email:
More informationIntroduction to Artificial Intelligence (AI)
Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 10 Oct, 13, 2011 CPSC 502, Lecture 10 Slide 1 Today Oct 13 Inference in HMMs More on Robot Localization CPSC 502, Lecture
More informationHidden Markov models for time series of counts with excess zeros
Hidden Markov models for time series of counts with excess zeros Madalina Olteanu and James Ridgway University Paris 1 Pantheon-Sorbonne - SAMM, EA4543 90 Rue de Tolbiac, 75013 Paris - France Abstract.
More informationNumerically Stable Hidden Markov Model Implementation
Numerically Stable Hidden Markov Model Implementation Tobias P. Mann February 21, 2006 Abstract Application of Hidden Markov Models to long observation sequences entails the computation of extremely small
More informationMaximum A Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains
Maximum A Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains Jean-Luc Gauvain 1 and Chin-Hui Lee Speech Research Department AT&T Bell Laboratories Murray Hill, NJ 07974
More informationAn Introduction to Bioinformatics Algorithms Hidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationHidden Markov Model and Speech Recognition
1 Dec,2006 Outline Introduction 1 Introduction 2 3 4 5 Introduction What is Speech Recognition? Understanding what is being said Mapping speech data to textual information Speech Recognition is indeed
More informationConditional Random Fields: An Introduction
University of Pennsylvania ScholarlyCommons Technical Reports (CIS) Department of Computer & Information Science 2-24-2004 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania
More informationExpectation Maximization (EM)
Expectation Maximization (EM) The Expectation Maximization (EM) algorithm is one approach to unsupervised, semi-supervised, or lightly supervised learning. In this kind of learning either no labels are
More informationStructure Learning in Sequential Data
Structure Learning in Sequential Data Liam Stewart liam@cs.toronto.edu Richard Zemel zemel@cs.toronto.edu 2005.09.19 Motivation. Cau, R. Kuiper, and W.-P. de Roever. Formalising Dijkstra's development
More informationChapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang
Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check
More information