Investigating Mixed Discrete/Continuous Dynamic Bayesian Networks with Application to Automatic Speech Recognition

Size: px

Start display at page:

Download "Investigating Mixed Discrete/Continuous Dynamic Bayesian Networks with Application to Automatic Speech Recognition"

Meagan Johnson
5 years ago
Views:

1 Investigating Mixed Discrete/Continuous Dynamic Bayesian Networks with Application to Automatic Speech Recognition Bertrand Mesot IDIAP Research Institute P.O. Box 59 CH-9 Martigny Switzerland Thesis Proposal September 5 Submitted to: The Swiss Federal Institute of Technology of Lausanne (EPFL), School of Engineering Sciences (STI), Signal Processing Institute Thesis Director: Prof Hervé Bourlard Supervisor: Dr David Barber ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE

2 Abbreviations HMM Seg-HMM SAR-HMM SLDS DBN PDF EM ML Hidden Markov Model Segmental HMM Switching Autoregressive HMM Switching Linear Dynamical System Dynamical Bayesian Network Probability Distribution Function Expectation Maximisation Maximum Likelihood Notation s t h t o t v t x :T φ The state of a discrete (switch) hidden variable at time t The state of a continuous hidden variable at time t A feature vector at time t A sample of the speech signal at time t Shorthand for x, x,..., x T A particular setting of the HMM parameters

3 Introduction Speech is fundamentally a mixture of discrete and continuous effects. On the one hand, we have the words or sub-words units like phonemes, on the other hand, there is the waveform which comes from the oscillations of the vocal chords and is modulated by the vocal tract. Current HMM-based recognition systems are mainly centred on the modelling of the discrete part and assume that the information carried by the continuous component can be compressed into a set of features which encompass static as well as dynamic information. While this approach has been very successful, from a formal point of view it has the drawback of making the model inconsistent. This is mainly because the introduction of dynamical features induces correlation between the observations at various time steps, while an HMM assumes that there is no correlation between observations coming from the same state. Two models that address this issue are the Segmental HMM (Seg-HMM) and the Switching Autoregressive HMM (SAR- HMM). The Seg-HMM tries to capture the short term correlations that exist between consecutive features, while the SAR-HMM does not use features at all and rather models the raw speech signal by means of a set of autoregressive processes. Both models can be seen as a step towards a better modelling of the continuous component of the speech signal, but they still do not address the issue completely. Indeed in a Seg-HMM continuity is broken at segments boundaries and, in a SAR-HMM, the AR processes are defined on the noisy signal, which makes the model quite sensitive to noise. The goal of this thesis is to go a step further in the modelling of the continuous component of the speech signal by proposing a more general model that belongs to the class of Switching Linear Dynamical Systems (SLDSs). SLDSs are particularly suited for the modelling of the speech signal because they combine in a single consistent model both continuous and discrete variables. Contrary to the SAR-HMM, the continuity of the signal is directly inherited from the continuity of the hidden variable. This approach has the benefit of clearly separating the underlying dynamics of the signal modelled and the noise. Furthermore, compared to the Seg-HMM, the continuous hidden variable is not integrated out over a segment, but is shared by all time steps. Therefore segments are no more required and the problem of discontinuities at segment boundaries disappears. Another advantage of a SLDS over an HMM is that interactions between discrete and continuous hidden variables can be easily handled. This is particularly useful for modelling state duration for example, since one can condition the discrete transition probability on the state of the continuous variable in order to constrain state transitions to occur only at a particular moment. Finally, the use of a continuous hidden variable makes the introduction of knowledge about the structure of the signal straightforward whilst maintaining the consistency of the model. As we will show the characteristic shape and dynamics of the harmonic stacking structures observed in speech utterances can be efficiently encoded into an SLDS by considering a Fourier type representation. This approach is particularly interesting because it allows knowledge about the structure of the signal to be integrated in a more natural way than with an AR process and also considerably reduces the number of parameters that need to be trained. For example, the inherent structure of phonemes does not change whether it is spoken by a male or a female, the pitch however changes and must

4 be considered as a parameter. Ultimately our model will represent the speech waveform as a dynamical system, which is controlled by a set of parameters whose values depend on the state of a discrete switch variable, that itself represents a sub-word unit, like a phoneme for example. With this structure it is also possible to integrate higher level knowledge, like a word model, in a hierarchical way. The following section starts by briefly presenting the basic principles of speech recognition, it then describes in more details the Seg-HMM and the SAR-HMM and shows what, we believe, are their deficits. We then introduce the SLDSs and explain how we plan to apply them to speech recognition. Section presents preparatory works that we carried out in order to demonstrate the validity of our model. It discusses the difficulties of doing inference in SLDSs and shows that, even though inference in SLDSs is intractable, it can be efficiently approximated by an algorithm that we developed. In this section we also compare the robustness to noise of a SAR-HMM and a SLDS. Finally, Section 6 presents a detailed research plan. Background Motivation. General Framework Current state-of-the-art automatic speech recognition (ASR) systems are based on the principle of statistical pattern matching. A speech waveform is first converted into a sequence of acoustic (or feature) vectors o :T and the corresponding sentence (or utterance) is split into a sequence of words w :N. The job of the ASR system is to find the most likely sequence of words ŵ :N, given the observed sequence of acoustic vectors: ŵ :N = arg max p(w :N o :T ). () w :N To do so, each word is split into simple units called phones and each of those phones is modelled by an HMM. The HMM are then concatenated according to the position of the phones in the utterance to form the corresponding model if a phone appears more than once, the parameters of the HMM are shared between all the instances. The probability of the observed sequence of acoustic vectors given the sequence of words p(o :T w :N ) is then computed and Bayes rule is used to calculate p(w :N o :T ): p(w :N o :T ) = p(o :T w :N )p(w :N ) p(o :T ) () where the a priori probability of the sequence of words p(w :N ) is given by a language model. This procedure is repeated for all possible sequences of words and the most likely is considered to be the actual transcription of the speech waveform. The underlying behaviour of the recognition process is actually much more complicated than what has been presented so far. The speech signal is generally pre-emphasised before being cut into small, usually 5ms long, overlapping windows on which a Fourier analysis is carried out to obtain slices of the short-term

5 s t s t s t+ o t o t o t+ Figure : Dynamical Bayesian network representation of an HMM. The time is symbolised by t, the discrete state variable by s and the continuous visible variable by o. spectrogram. Various transformations are then applied on those slices in order to generate acoustic vectors. Each phone is then usually modelled by a hidden Markov chain of three states. Phone models are combined to form word models which are combined together to form a single composite HMM representing all possible utterances. Finding the most likely word sequence ŵ :N is then reduced to finding the most likely state sequence ŝ :T, given the observations: ŝ :T = arg max p(s :T o :T ). () s :T This problem can be efficiently solved by using the Viterbi algorithm, while finding the most likely word sequence, as given by Equation, is intractable. Before being used for recognition the parameters of an HMM must be adjusted in order to fit the model to the observations. Usually one wants to find the parameter setting ˆφ that maximises the log-likelihood of the observations, given the word sequence: ˆφ = arg max φ log p(o :T w :N, φ). () This training procedure is carried out by the Baum-Welch algorithm which is a specialisation of the expectation maximisation (EM) algorithm to the HMM. The Baum-Welch algorithm itself relies on the Forward-Backward algorithm which computes for each time step the posterior probability of the state given the whole sequence of observations, p(s t o :T ). This procedure is called inference and its complexity is directly related to the structure of the model. An HMM for example, can be represented by the Dynamical Bayesian Network (DBN) of Figure and defines the following joint probability: p(s :T, o :T ) = p(o s )p(s ) T p(o t s t )p(s t s t ) (5) where p(s ) is the state prior, p(s t s t ) the transition probability and p(o t s t ) the emission probability. The actual value of those probabilities are the parameters of the HMM. HMM training based on maximum log-likelihood (ML) may not be adequate since the discriminative power of the model is not enforced. For this reason t= The interested reader may refer to [] for a detailed explanation.

6 people have tried various alternative approaches like, for example, conditional log-likelihood maximisation of the state sequence p(s :T o :T ) [7], maximum mutual information (MMI) [, 9] or minimum word/phone error (MWE/MPE) []. Discriminative training is more expensive than ML training and approximations are usually required. It has nevertheless been successfully used for connected digit recognition [] as well as large vocabulary speech recognition [8, ].. Segmental HMMs The idea behind Seg-HMM [9, ] is that recognition performance would be improved if it was possible to model more accurately the relation that exists between consecutive feature vectors. To do so, the observation sequence o :T is assumed to be the concatenation of an a priori unknown number of segments, where each one belongs to a certain a priori unknown state. The length of the segments as well as the state to which they belong are discrete hidden variables that need to be inferred. Formally speaking the segmental HMM is just a plain HMM where the probability distribution function (PDF) of a single acoustic vector has been replaced by the PDF of a sequence of vectors. Segments are thus assumed to be independent and, given a state sequence s :N of length N and the duration of each segment l :N, the probability of the entire sequence of observations o :T is given by the product of the probabilities of each segment: p(o :T l :N, s :N ) = N p(o ti:t i+l i l i, s i ) (6) i= where t i is the starting time of the ith segment and l i the length of that segment. The probability of the observations given the state sequence only is obtained by integrating over all possible segment lengths: p(o :T s :N ) = l :N p(o :T l :N, s :N )p(l :N s :N ) (7) with p(l :N s :N ) = p(l s ) N p(l i s i, l i, s i ), (8) where the length of the segments l i are assumed to be conditionally independent. The introduction of the additional variable l makes the Seg-HMM slightly more complex compared to a plain HMM, since this variable must be inferred as well. Various approaches exist for modelling observation dependency inside a segment. They can be classified into two families, the stationary and non-stationary models []. We will concentrate ourself on the non-stationary family, because it contains all the stationary models as a special case and it is the closest to our ultimate model. The basic idea is to condition the PDF of the visible variable on an additional hidden variable h which encodes some prior knowledge that one has about the dynamics of the observed segment. For example, Digalakis [] i= 5

7 s i s i s i h t h t+li o t o t+li Figure : DBN representation of a segmental HMM. The segment number is symbolised by i and its length by l i. s is a discrete hidden variable, h is a continuous hidden variable and o is a continuous visible variable. models the evolution of the feature vectors with a stochastic linear dynamical system of the form: h t = A(s i )h t + η h (s i ) with η h (s i ) N ( µ h (s i ), Σ h (s i ) ) o t = B(s i )h t + η v (s i ) with η v (s i ) N ( µ v (s i ), Σ v (s i ) ) (9) where A(s i ) is a transition matrix that encodes the dynamics and B(s i ) is a projection matrix. The DBN representation of such a non-stationary Seg-HMM is depicted in Figure. Note that since the continuous hidden variable is local to a segment, the emission probability is easily obtained by integrating out the hidden variable: p(o ti:t i+l i l i, s i ) = h :li l i p(h l i, s i ) p(o ti+t h t, l i, s i )p(h t h t, l i, s i ). () There are two drawbacks to such an approach. First, although continuity is efficiently modelled inside a segment, discontinuities are still present at segment boundaries. In this respect the Seg-HMM represents only a step towards a (in our view) more natural model, for which the continuous hidden variable, instead of being integrated out, is kept across segment boundaries. Second, the use of feature vectors complicates the encoding of knowledge that we have about the dynamics of the observations. We will see in Section that structures like the harmonic stacking can be very easily encoded into the model if the speech waveform is used instead of feature vectors. Note that a similar approach has also been envisaged in [], where a Seg-HMM is used for pitch tracking, voiced/unvoiced detection and timescale modification of the speech signal. t=. Switching Autoregressive HMMs The autoregressive HMM [] and its recent extension, the switching autoregressive HMM (SAR-HMM) [6], are two well-known attempts to model the speech waveform directly. The latter model assumes that the sequence of samples can be represented by a switching autoregressive process, i.e., the sample v t at each 6

8 s t s t s t+ v t v t v t+ Figure : DBN representation of an AR-HMM. The time is symbolised by t, the discrete state variable by s and the continuous visible variable by v. time step can be predicted by a stochastic linear equation of the form: v t = R c r (s t )v t r + η(s t ) with η(s t ) N (, σ (s t )) () r= where c r (s t ) is an autoregressive coefficient and σ (s t ) the variance of the innovation η(s t ). The DBN representation of such a model is shown on Figure. As can be seen by comparing Figures and, the only difference between an HMM and a SAR-HMM is the dependency between the observations at various time steps. The interesting thing about the SAR-HMM approach is that the parameters of the AR process are controlled by the discrete variable. Contrary to the HMM where no continuity exists between the observations, the dynamics produced by the SAR-HMM is still influenced by the discrete variable, but smoothness is conserved. To this respect the SAR-HMM is considerably more powerful than the Seg-HMM, where the continuity is lost when the state changes. A drawback of Equation is that the predicted observation v t depends on the previous noisy observations. This is unfortunate since therefore in situations where considerable noise is present in the observations, the prediction will be poor. An alternative to encourage observation continuity and smoothness is to use a continuous hidden variable h, forming an autoregressive process in the hidden space. The sample v t can then be made to depend only on the cleaner hidden variable. Predictions in noisy environments should thus be inherently more robust since the observation is seen as a corrupted version of a clean, constrained and controlled hidden dynamics. This can be modelled and indeed motivates our interest in the SLDS, as described in Section.. Explicit State Duration Another weakness of HMMs is their failure to effectively model phone duration. Indeed due to the Markov assumption, the probability of staying in the same phone n consecutive time steps is given by a geometric distribution, whereas the empirical phone duration distribution is rather Poisson or Gaussian [, 5, 6]. State-of-the-art ASR systems partially resolve this limitation by modelling a phone with two or three states with identical emission PDF. Although this We are assuming that no extra time counting hidden variables are introduced. It is possible to make more realistic phone durations using temporal counters, but the vast increase in the number of hidden states is prohibitive. 7

9 s t s t s t+ h t h t h t+ v t v t v t+ Figure : DBN representation of an SLDS. The time is symbolised by t, the discrete state variable by s, the continuous state variable by h and the continuous visible variable by v. works quite well in practice, this is a clear limitation of the model and various solutions have been proposed [7,, 5, 6]. Explicit state duration is not used in state-of-the-art ASR systems, possibly because it generally requires an additional hidden variable and the computational overhead introduced is not deemed to be worth the potential performance improvement. Explicit state duration is nevertheless a mandatory component of the Seg-HMM since the length of the segment emitted by a state must be known a priori each state is associated with a hidden variable which conditions the length of the generated sequence. This approach has the benefit of being completely generic since any arbitrary PDF can be used to model duration. A drawback however is that the length of a segment is given a priori and thus cannot be influenced by the hidden variable. For example, if the hidden variable is used to model a linear trajectory, to preserve continuity, it could be useful to allow segment boundaries to occur only when the trajectory is at a specific position. This approach has been used for example in [] to model speech waveforms with a Seg-HMM. To reduce the number of possible segmentations, they only consider those where segments begin and end on zero crossings of the signal. Proposed Solution The previous section showed how the SAR-HMM and the Seg-HMM are limited in the way they can model the discrete and continuous nature of speech at the same time. We want to go a step further by formulating a model which combines the continuous hidden variable found in the non-stationary Seg-HMM with the discrete switching process found in the SAR-HMM. Contrary to the Seg-HMM we prefer to model the speech waveform directly since this maintains the consistency of the model and allows straightforward encoding of structures like the harmonic stacking. The proposed model belongs to the class of Switching Linear Dynamical Systems (SLDSs) whose DBN representation is shown in Figure. This model is fundamentally different from the previously discussed SAR-HMM in the sense that the knowledge that one has about the dynamics of the signal is encoded into a continuous hidden variable and not as a correlation between the samples. The 8

10 resulting model should thus be far less sensitive to noise than the SAR-HMM. It is also different than the Seg-HMM since the continuous hidden variable is not local to a segment, but is shared between all time steps. Consequently, the continuous hidden variable cannot be simply integrated out like in a Seg-HMM, but needs to be inferred along with the discrete switch variable. This has the negative effect of making inference intractable. Fortunately, inference can still be quite efficiently approximated thanks to various approximation algorithms that we will present in Section.. Formally speaking the SLDS, as shown in Figure, defines the following joint distribution: p(v :T, s :T, h :T ) = t p(v t s t, h t )p(h t h t, s t, s t )p(s t s t, h t ). () The continuous transition PDF p(h t h t, s t, s t ) is the main novelty introduced by the SLDS. This term actually contains the knowledge that we have of the dynamics of the signal by assuming that h t follow a stochastic linear equation of the form: with h t = A(s t, s t )h t + η h (s t, s t ) () η h (s t, s t ) N (µ h (s t, s t ), Σ h (s t, s t )) () where the transition matrix A(s t, s t ) encodes the dynamics and η h (s t, s t ) is a source of innovation. For example a single cosine wave with frequency ω can be obtained by setting: [ ] cos(ω) sin(ω) A(s t, s t ) Θ(ω) = (5) sin(ω) cos(ω) in Equation. The continuous hidden variable can then be interpreted as a two-dimensional rotating vector. This Fourier type representation is particularly attractive, because it is very handy to encode prior knowledge that we have about the speech signal, like for example the harmonic stacking structure of a phoneme. Those structures contain a certain number of particularly active frequencies whose dynamics carry information about what is actually spoken. Therefore, being able to efficiently track those frequencies may significantly improve the recognition accuracy. For that we can define a block diagonal transition matrix of the form: ρ (s t )Θ(s t, ω ) A(s t ) = ρ (s t )Θ(s t, ω ).... ρ N (s t )Θ(s t, ω N ). (6) where ω,..., ω N are N frequencies that we want to track and ρ,..., ρ N are N damping factors. The harmonic stacking effect can then be achieved by constraining the frequencies ω,..., ω N to be pre-defined multiples of the base frequency ω. If we furthermore assume that some relation exists between the 9

11 damping factors ρ i (s t ) for example, in the case of a vibrating string, damping factors of harmonics scale approximately geometrically with respect to that of the fundamental frequency [9] then the number of free parameters (those which will be learned) is considerably reduced. By making these free parameters dependent on the the switch s state s t, it is then possible to represents different kind of dynamics which nevertheless possesses the same underlying harmonic stacking structure. For example, there is a difference between the fundamental frequencies of a male and a female speaker or between the characteristic frequencies of different phonemes. That kind of approach has been successfully used in music transcription [, ] and, we think, may be useful for speech recognition as well. The emission distribution p(v t s t, h t ) specifies the transformation that must be applied to the continuous hidden variable in order to generate the waveform. Similarly to Equation, v t is assumed to follow a stochastic linear equation of the form: v t = B(s t )h t + η v (s t ) with η v (s t ) N (µ v (s t ), Σ v (s t )) (7) where B(s t ) is the projection matrix and η v (s t ) a source of noise. The projection in our case is straightforward since it reduces to summing all the cosine components of the hidden variable. This is achieved by the following N matrix: [ ] B =.... (8) Contrary to the SAR-HMM where η n and η v are mixed together, a SLDS clearly separates the innovation process from the noise process. The benefit of this approach is considerable when the environmental conditions are not wellcontrolled, because it is possible, by adapting η v, to choose the level of noise that is filtered out of the signal before this latter is processed by the hidden layer. The discrete transition distribution p(s t s t, h t ) specifies the dynamics of the model parameters, i.e, the hidden continuous dynamics can switch between various regimes whose parameters depend on the value of the discrete switch variable s. The state of this switch variable is what we are ultimately interested in because it represents a particular thing that we want to recognise, like for example a phoneme or a gender. A novelty of our model is the conditioning of the discrete transition distribution on the state of the continuous hidden variable. This additional link is particularly useful for modelling state duration, for example by forbidding transitions as long as h is not close to zero. Finally, similarly to the HMM, recognition is achieved by finding the most likely sequence of state ŝ :T, given the sequence of samples v :T : ŝ :T = arg max p(s :T v :T ) (9) s :T where p(s :T v :T ) is obtained by integrating over all possible hidden trajectories: p(s :T v :T ) = p(s :T, h :T v :T ). () h :T It is important to note that, although exact decoding and training is formally possible, the number of computations required grows exponentially with the

12 number of samples considered this is explained in Section.. A second problem is that current approximation techniques tend to be numerically unstable and thus prevent their use to long (more than, time steps) sequences. Fortunately, work done during the preparation of this proposal led to the development of the expectation correction (EC) algorithm [], a new stable approximation algorithm for doing inference in SLDSs. Furthermore, training of SLDSs with the maximum likelihood algorithm has already been tested on real data []. We also wish to apply discriminative training methods to SLDSs, since they may be expected to improve recognition performance. Preparatory Work The complexity of a model is measured by the complexity of the corresponding inference process. Doing inference on a SAR-HMM is not more difficult than with a plain HMM. In a Seg-HMM the inference process is slightly more complicated since there are two additional hidden variables. It nevertheless stays tractable, because the additional continuous variable is integrated out, while the second one, the length of the segment, is discrete. With a SLDS however, the complexity of the inference process is exponential in the number of samples considered. Given the fundamental role played by the inference process during training, it is therefore extremely important to dispose of an accurate and stable approximation algorithm. The following section describes a new approximation techniques that we developed and compare its performance to other well-known approximation techniques. An important aspect of our proposed model is that, contrary to the SAR- HMM, it makes a clear distinction between the actual process that generates the signal and the noise that gets added on it. In Section we claimed that this approach is particularly useful when the environmental conditions change, because the noise is filtered out from the signal, before it reaches the hidden layer. To demonstrate that point, we implemented the SAR-HMM and compared its performance with that of an autoregressive SLDS (AR-SLDS). Details about the experiments that we carried out as well as the results are given in Section... Approximate Inference in SLDSs.. The Problem Inference in SLDSs, like in the SAR-HMM and the Seg-HMM, is generally done in two steps. The first one proceeds forward in time and computes, for each time t, a message ρ(s t, h t ) that carries information coming from the past. The second one proceeds backward in time and computes, for each time t, a message λ(s t, h t ) that carries information coming from the future. The forward and backward messages are then combined to obtain the joint posterior distribution of the hidden variables: p(s t, h t v :T ) = ρ(s t, h t )λ(s t, h t ). ()

13 The forward message, also called filtered estimate, is computed by the following recursive equation: ρ(s t, h t ) p(v t s t, h t ) p(h t h t, s t )p(s t s t )ρ(s t, h t ). () s t h t A problem with this expression is that, since the initial message ρ(s, h ) is a mixture of S Gaussians S is the number of switch states due to the sum over s t, the next message will be a mixture of S Gaussians. One then easily deduces that the number of mixture components at time t will be S t and therefore the number of computations that need to be carried out at each time step will grow exponentially. A solution to this problem is to collapse the mixture obtained at each time step to a mixture with fewer components. This corresponds to the so-called Gaussian Sum Approximation [] which is a form of Assumed Density Filtering (ADF) []. Contrary to the forward message which is equivalent to p(s t, h t v :t ), the backward message does not correspond to a probability distribution and makes unclear how one should perform the collapse. The Expectation Propagation (EP) algorithm [] addresses this issue by defining the backward message as the division of the posterior distribution by the forward message: λ(s t, h t ) = s t h t p(s t, h t, s t, h t v :T ). () ρ(s t, h t ) The main drawback of such an approach is that, because we do not know how to divide by a mixture of Gaussians, this latter expression can only be computed if the forward message is collapsed to a single Gaussian. Note that a similar formula is used to find the forward message and thus implies the same constraint on the backward message. Another weakness of EP is its sensibility to numerical instabilities. In order to improve on that point we devised a new formulation of the backward pass that thanks to an auxiliary variable trick that will not be exposed here allows the backward message to be considered as a probability. As we will see later, the new derivation will prove to be far more stable than the original one. Being forced to restrict ourselves to approximate messages by single Gaussians is a bit frustrating because it seems intuitively clear that, in the cases when a high dimensional continuous hidden variable gets projected to a comparatively low dimensional visible variable, multi-modalities in the distribution of h are likely to occur. Indeed knowing the value of the visible variable may not provide enough information to precisely infer the value of the hidden variable. To cope with that problem we developed a new approximation algorithm called Expectation Correction, where the backward pass computes directly the posterior distribution p(s t, h t v :T ) by correcting the forward estimation, hence the name of the algorithm. The recursion for the backward message is given by: λ(s t, h t ) p(s t, h t, s t+, h t+ v :T ) () s t+ h t+ p(h t s t, s t+, v :T ) p(s t s t+, h t+, v :t )λ(s t+, h t+ ). s t+ h t+ The interested reader may refer to [6] for a detailed explanation.

14 Similarly to Equation, this expression defines a mixture of S t Gaussians which can be collapsed to a mixture with a fewer number of components. In order to evaluate this latter expression two other approximations are required: (a) dropping of a dependence p(h t+ s t, s t+, v :T ) p(h t+ s t+, v :T ); (b) approximation of the integral of p(s t s t+, h t+, v :t ) with respect to λ(s t+, h t+ ). Compared to the Generalised Pseudo Bayes (GPB) method which discards all future information by replacing (b) by p(s t s t+, v :t )λ(s t+, h t+ ), the approach taken by EC is more conservative. Indeed, as we will see, simply evaluating the distribution p(s t s t+, h t+, v :t ) at the mean of λ(h t+ s t+ ) is sufficient to improve on GPB. Other popular approaches to filtering and smoothing are based on sequential Monte Carlo [5]. Whilst potentially powerful, these non-analytic methods typically suffer in high-dimensional latent/hidden spaces since they are often based on naive importance sampling, which restricts their practical use. Implementations of Rao-Blackwellisation (see for example []) may not help in difficult problems where the continuous posterior is highly non-gaussian, and we are unaware of methods that have addressed this... Experiments We would like to test our Expectation Correction smoothing method in a problem with a reasonably long temporal sequence. An obvious difficulty arises here in that, since the exact computation is exponential in the number T of sample considered, a formally exact evaluation of the method is infeasible. A reasonable approach under these circumstances, is to suppose that the generated switch states will be close to the most probable state of the true posterior p(s t v :T ). That is, we sample a hidden state s and h from the prior, and then a visible observation v. Then, sequentially, we generate hidden states and visible states for the next time steps. The task for smoothing inference is, given only the parameters of the model and the visible observations (but not any of the hidden states h :T and s :T ), to infer p(h t v :T ) and p(s t v :T ). A simple performance measure is to assume that the original sample states s :T are the correct inferences, and compare how our most probable posterior smoothed estimates arg max st p(s t v :T ) compare with the assumed correct s t. The reader should bear in mind, of course, that this is just a tractable surrogate for comparing our estimate of p(s t v :T ) with the exact value of p(s t v :T ). We look at two sets of experiments, both on time series of length T = with S = switch states. In both sets of experiments, we compared methods using a single Gaussian, and methods using multiple Gaussians. The number of Gaussians used was set to S throughout. One is relatively easy and the other relatively hard. From the viewpoint of classical signal processing, both experiments are extremely difficult in the sense that they cannot be solved by short time Fourier methods, since changes occur in the dynamics at a much higher rate than the typical frequencies in the signal, see Figure 5. In the easy experiments, we used a small hidden dimension H =, with a moderate amount of transition and observation noise. As can be seen from Figure 6a, Particle Filtering performs reasonably well, although its performance is enhanced by Rao- Blackwellisation (RBPF). Assumed Density Filtering using a single Gaussian and with a mixture performed roughly the same as RBPF, as did the methods

15 EP ECS 6 8 ADFM 6 8 ECM Figure 5: Results on a typical example from our hard problem for the methods of Expectation Propagation (EP), Assumed Density Filtering using a mixture of four Gaussians (ADFM), Expectation Correction using a Single Gaussian (ECS), and Expectation Correction using a mixture of four Gaussians (ECM). Plotted is the one dimensional visible signal, with a marker coloured by the most probable posterior estimated switch variable, which can be one of two states. A cross indicates a switch variable inference error. Only methods which have mixture representations of the posterior succeed indeed, the ECM method gives no errors. based on Generalised Pseudo Bayes, using either the ADF single Gaussian results, or the Gaussian mixture results. A standard implementation of Expectation Propagation, even in this case, suffers from many numerical stability problems, but is improved somewhat by our own more stable implementation. The Expectation Correction method using a single Gaussian dramatically improves on the ADF single Gaussian filtered estimate. Using a small number of mixture components in Expectation Correction improves the situation further. In the hard case, Figure 6b, we used a larger hidden dimension, H =, with a small amount of transition noise, and a large amount of observation noise. We chose these parameters since this will most likely result in highly multi-modal posteriors. In this case, only those methods that used a mixture of Gaussians performed well otherwise, the methods were little better than random guessing. Expectation Correction with a small number of mixture components, apart from a small number of errors, dramatically gives almost perfect performance. Readers interested in Particle Filters may wonder why Rao-Blackwellisation does not seem to perform well. Our explanation is that the standard implementation we used [] still makes the assumption that a single Gaussian is adequate to describe the posterior filtered estimate p(h t s t, v :t ). In our hard experiment, any method which does not deal with multi-modality of the posterior is doomed. Thanks to EC, inference in SLDSs can now be efficiently and correctly carried out, thus greatly improving the quality of the training and decoding as well. Indeed stability and accuracy are essential if one want to apply a SLDS to real data, since speech utterances, even downsampled to 8kHz, contain more than ten thousand samples. Unfortunately training a SLDS is still difficult, partly

16 5 6 PF RBPF PF RBPF ADFS 5 6 GPB 5 6 ADFS 5 6 GPB ADFM 5 6 GPBM 5 6 ADFM 5 6 GPBM EP 5 6 EP 5 6 EP 5 6 EP ECSS 5 6 ECMS 5 6 ECSS 5 6 ECMS (a) (b) 5 6 Figure 6: Histograms of the number of errors over experiments, i.e., the x- axis gives the number of errors and the y-axis the number of experiments that resulted in a certain number of errors. Particle Filter (PF), Rao-Blackwellised Particle Filter (RBPF), Assumed Density Filtering with Single Gaussian (ADFS), Generalised Pseudo Bayes (GPB), Assumed Density Filtering with Multiple Gaussians (ADFM), Generalised Pseudo Bayes with Multiple Gaussians (GPBM), Expectation Propagation (EP), our implementation of Expectation Propagation (EP), Expectation Correction Smoothing with a Single Gaussian (ECSS), Expectation Correction Smoothing with Multiple Gaussians (ECSM). 5

17 because of the high number of parameters involved and the sensitivity to initial conditions, but mainly because of numerical instabilities that are still present in the update process. One of the goals of this thesis is to improve training of SLDS.. Experiments on Noise Robustness Of the two models presented in Section, the SAR-HMM is the one which is the closest to a SLDS. Indeed, by getting rid of the feature vectors, the SAR-HMM is, in our view, the most elaborate attempt so far to build a consistent model that tries to capture the inherent continuity of the speech signal while retaining the idea of an underlying discrete switching process. However, as mentioned in Section, in a SAR-HMM the predicted observation depends exclusively on the previous noisy observations. Therefore, if the noise level in the test set is different from that of the training set, the performance may drop significantly. On the contrary, the performance of a SLDS should not be too much affected by the level of noise because this is effectively filtered out before it reaches the hidden layer. In order to demonstrate this point we first implemented the SAR-HMM described in [6] and trained one model for each of the digits of the TI-DIGITS database. The training set for each digit was composed of utterances downsampled to 8kHz, the models were composed of ten states with a left-right transition matrix and the AR processes were of order. The models were then evaluated on a test set composed of utterances of each digit, and the recognition was simply achieved by selecting the model with the highest likelihood. The models where initialised using uniform segmentation; Viterbi alignment has also been tested, but did not bring any improvement. As mentioned in the previous section, efficient training in SLDSs is still a subject of research. For this reason, instead of training a SLDS, we simply embedded the trained SAR-HMM into the hidden layer of a SLDS. The resulting model, called AR-SLDS, has the benefit of separating the part of the signal that comes from the innovation, from the part that comes from the noise. Indeed the hidden variance which comes from the trained SAR-HMM in this case represents the amount of innovation that is introduced into the system at each time step, while the visible variance represents the amount of noise that gets added onto the resulting clean hidden signal. We therefore expect that, by setting the visible variance to the actual noise variance, the performance of the AR-SLDS would be roughly the same as that of the SAR-HMM tested on the clean utterances. For the purpose of the demonstration we set the noise variance by hand in our experiments, although, ultimately, we intend to learn the noise level automatically. In order to compare our results with those obtained by a standard, featurebased HMM, we trained an HMM on the same training set that we used for the SAR-HMM. We chose the same setup as that used to get the baseline performance on the Aurora database, namely 8 states, left-right transition matrix, mixtures of Gaussians and 9 MFCC features including first and second temporal derivatives as well as energy. The recognition accuracies achieved by the three models are shown in Table. Since we used a generic SLDS code for those experiments the time re- 6

18 Noise Variance SNR (db) HMM SAR-HMM AR-SLDS clean.% 86.8% 86.8%.8.% 87.% 86.% 9.6.% 87.% 9.8% % 6.7% 9.6% % 5.5% 88.% % 9.7% 87.% 5. 5.% 8.6% 79.%..6% 9.% 8.%. 9.% 9.% 6.6% Table : Comparison of the recognition accuracies of an AR-HMM and a AR- SLDS for various levels of Gaussian noise. The SNR was estimated with the NIST stnr program. quired to evaluate the whole test set with this model was particularly long, we therefore built a different test set for each level of noise by randomly selecting utterances (among ) per digit, i.e, a total of utterances. The HMM is quite robust, since it is still able to reach 95% with added noise of variance 6, while the performance of the SAR-HMM is already close to that of a random guess, i.e., 9.%. However as soon as the noise variance gets above 6 the performance of the HMM drops very quickly. Indeed the robustness of the HMM mainly depends on the efficiency of the feature extraction, which is done regardless of the model. Once the features have been computed, the HMM only serves as a mean to calculate the likelihood of the corresponding sequence of observations. With the AR-SLDS, the robustness comes from the additional continuous hidden variable which encodes the knowledge that we have about the clean speech signal, and the most likely filtered signal is computed along with the most likely state sequence. As expected, the AR-SLDS is less sensitive to changes in noise conditions compared to the SAR-HMM and is even more robust than the HMM on high variances. It is important here to notice that due to the embedding of the SAR-HMM, the performance of the AR-SLDS is bounded above by that of the SAR-HMM. Indeed an SAR-HMM accuracy of 98.6% is reported in [6], but we have not been able to reproduce this performance. Although we are lacking results to support this, we expect that a better SAR-HMM performance would improve the accuracy of the AR-SLDS as well. Another interesting fact concerning the AR-SLDS is that, for some value of the noise variance, the accuracy is significantly improved compared to that obtained on the clean utterances. This improvement can be explained by looking at Figure 7, where the noise correlation of the original signal is compared to that of some corrupted signals. As one can see, as soon as the variance of the additional noise is of the same order as that of the original correlated noise, the resulting noise looks like something completely uncorrelated. This effect is particularly useful when an AR-SLDS is used because that uncorrelated noise is implicitly and efficiently removed from the signal. Therefore the signal seen by the hidden layer is cleaner than that 7

19 x Original x e v t+ v t+ v t+ v t x x e 9 v t x v t+ v t x x e v t x Figure 7: Correlation of two consecutive samples of a silence utterance which has been corrupted by different levels of noise. seen when a lower level of noise is used, and hence the better performance. Of course, when the level of noise is too high, the accuracy decreases because the filtering throws away parts of the signal that are useful for recognition. To illustrate that effect, we compared in Figures 8 and 9, the power spectrums of the corrupted signal with that of the filtered signal, i.e., the signal seen by the hidden layer. For a noise variance of and the correlated noise is still dominant and the SAR-HMM performs as well as the AR-SLDS. When the noise variance reaches 9, the noise correlation is lower, but still quite close to what it is in the training set; the performance of the SAR-HMM is still fine while that of the AR-SLDS increases because the noise starts to be filtered out. With a noise variance of 8, the signal is clearly degraded and a significant drop in the SAR-HMM performance is seen. On the contrary, the performance of the AR-SLDS is even better because the uncorrelated noise can be efficiently filtered out, leaving a clean filtered signal which is easily recognised by the embedded SAR-HMM. Finally, as the noise variance further increases, the performance of the AR-SLDS decreases because it becomes more and more difficult to remove the noise and to reconstruct the signal. The accuracy of the AR-SLDS is nevertheless still quite high compared to that of the SAR-SLDS. These experiments demonstrate that using a continuous hidden variable is profitable from the point of view of noise robustness since, although the SLDS was not trained directly, a simple embedding of a trained SAR-HMM into an 8

20 SLDS gave better results than the SAR-HMM alone and, for high noise variance, better than the HMM. This demonstrates that once the model has been trained, the variance of the visible noise can be automatically adapted to match new environmental conditions that did not exist in the training set. This is particularly useful because instead of having to train a model on a lot of different noise conditions, we have a versatile model that can adapt itself to unknown situations. The accuracy of the model could be further improved by introducing more specific knowledge that we have about the structure of the speech signal. Indeed, the SAR-HMM is a rather generic model for which encoding of structures like harmonic stacking is cumbersome. A Fourier type representation is more adequate in this case, and motivates the model proposed in Section. 5 Summary The work done during the thesis will hopefully be beneficial to at least two fields: speech recognition and machine learning. We are aware that our proposed approach to speech recognition is quite different to what is generally found in the literature, but we nevertheless think that it is an approach worth trying for the following reasons. Current HMM-based speech recognition systems, by including feature vectors which encompass time derivatives, are inconsistent. Although it has been recently suggested that this can be corrected by explicitly modelling the relationship between static and dynamic features [8, 7, 8], we think that this approach makes the modelling of the continuous component of the speech signal more difficult than it actually is. We also think that in order to improve speech recognition, it is necessary to introduce as much knowledge as we can about the dynamics of the signal. For this reason, we choose to model the speech waveform directly by means of a Fourier type representation, which allows encoding of structures like the harmonic stacking straightforward. Although this approach has already been used for music transcription, its application to speech recognition is, as far as we know, totally new. We tried to show with the preparatory work that evidences of a potential improvement of recognition accuracy exist. The ultimate goal of the thesis is indeed to demonstrate that state-of-the-art-speech recognition can be improved by a better understanding and modelling of the continuous component of speech. The thesis will also be beneficial to the field of machine learning. For example, the recent development of the EC algorithm is a major step towards more stable and accurate training algorithms for SLDSs. In that respect, the work done during the preparation of this proposal should lead to the publication of at least one journal paper and is already described in various research reports [, 5, 6]. More generally, the research done throughout the thesis will be mainly centred around the improvement of state-of-the-art techniques in speech recognition as well as in machine learning. It is thus expected to lead to various publications in the best journals and conferences in those fields. 9

21 Original Original e e e (a) e e e (b) Figure 8: (a) Original power spectrums of a one corrupted by various levels of noise, and (b) power spectrums of the corresponding filtered signal.

22 e 7 e e e e (a) e e e (b) Figure 9: (a) Original power spectrums of a one corrupted by various levels of noise, and (b) power spectrums of the corresponding filtered signal.

23 6 Research Plan This section describes a number of tasks that have to be carried out during this thesis. 6. Task : AR-HMM vs SLDS The goal of this task is to compare the recognition performance of the SAR- HMM [6] and an SLDS on single digits recognition. Contrary to what was presented in Section., the SLDS will not be limited to an AR-SLDS and will be fully trained. Its structure will be similar to the DBN shown on Figure, but augmented with a discrete variable to model the deterministic counter used in the SAR-HMM. The SLDS will first be trained and tested on the single digit utterances of the TI-DIGITS database in order to compare the results with those obtained by Ephraim and Roberts. As previously mentioned, our implementation of the SAR-HMM does not reach the same accuracy as that reported in [6], and we will thus also have to analyse the reasons of this deficiency. Other tests will also be done on the whole set of utterances of the TI-DIGITS databases and the Numbers database as well. The results are expected to demonstrate that modelling dynamics at the hidden space level is more robust than what is done in an SAR-HMM. If this assumption is confirmed noise robustness will eventually be tested on the AURORA task. 6. Task : Explicit State Duration Modelling The goal of this task is to evaluate various explicit state duration models. The deterministic counter used by the SLDS of Task will be replaced by an explicit state duration distribution to model explicitly temporal dependencies over potentially several hundred milliseconds. The new SLDS will then be trained and tested on the same databases than Task. The results should show if the additional computational overhead caused by the use of a stochastic state duration model is worth the performance improvement. 6. Task : Discriminative Training Task and assume that an efficient training procedure exists for the proposed SLDS. Maximum likelihood is an obvious objective function and will be used as a baseline. As an alternative we will formulate a discriminative training algorithm for the SLDS. Different approximations will be considered: (a) log-likelihood ratio and (b) conditional likelihood maximisation. In (b) novel inference and learning techniques will need to be developed. The goal of this task is to explore the possibility of applying discriminative training procedures to SLDS and to evaluate the performance of the training algorithm on Task and. In that respect, correspondence with the authors is ongoing.

CURRENT state-of-the-art automatic speech recognition

CURRENT state-of-the-art automatic speech recognition 1850 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 6, AUGUST 2007 Switching Linear Dynamical Systems for Noise Robust Speech Recognition Bertrand Mesot and David Barber Abstract