MAXIMUM A POSTERIORI ESTIMATION OF COUPLED HIDDEN MARKOV MODELS

Size: px

Start display at page:

Download "MAXIMUM A POSTERIORI ESTIMATION OF COUPLED HIDDEN MARKOV MODELS"

Blanche Holmes
5 years ago
Views:

1 MAXIMUM A POSTERIORI ESTIMATION OF COUPLED HIDDEN MARKOV MODELS Iead Rezek, Michael Gibbs and Stephen J. Roberts Robotics Research Group, Department of Engineering Science, University of Oxford, UK. irezek@robots.ox.ac.uk WWW: irezek Fax: +44 (0) Abstract. Coupled Hidden Markov Models (CHMM) are a tool which model interactions between variables in state space rather than observation space. Thus they may reveal coupling in cases where classical tools such as correlation fail. In this paper we derive the maximum a posteriori equations for the Expectation Maximization algorithm. The use of the models is demonstrated on simulated data, as well as in biomedical signal analysis. INTRODUCTION Analysis of complex systems (such as physiological systems) frequently involves studying the interactions between them. Indeed, the interactions themselves can give us clues, for example, about the state of health of the patient. For instance, interactions between cardio-respiratory variables are known to be abnormal in autonomic neuropathy [1]. Conversely, if the type of interaction is known, a departure from this state is, by definition, an anomaly. For instance, in multichannel EEG analysis we would expect coupling between the channels. If the states of the channels are divergent one may suspect some recording error (e.g. electrode failure). Traditionally, such interactions would be measured using correlation or, more generally, mutual information measures [2]. There are, however, major limitations associated with correlation, namely, the linearity and stationarity assumptions. Mutual information, on the other hand, can handle non-linearity. However, it is unable to infer anti-correlation and it also assumes stationarity (for the density estimation). Recent work has focused on Hidden Markov models (HMMs) in order to avoid some of the stationarity assumptions and model the dynamics explicitly [3]. HMMs handle the dynamics of the time-series through the use of a discrete state-space. These dynamics are taken from the state transition probabilities from one time step to the next. HMMs, however, typically have a single observation model and observation sequence and thus, by default, cannot handle the composite modelling of multiple time-series. This can be overcome through the use of multivariate observa-

2 B t B t+1 B t+2 B t+3 B t B t+1 B t+2 B t+3 X t X t+1 X t+2 X t+3 Y t Y t+1 Y t+2 Y t+3 Y t Y t+1 Y t+2 Y t+3 A t A t+1 A t+2 A t+3 X t X t+1 X t+2 X t+3 X t X t+1 X t+2 X t+3 A t A t+1 A t+2 A t+3 A t A t+1 A t+2 A t+3 (a) Standard (singlechain) HMM. (b) Coupled HMM with lag 1. (c) Coupled HMM with lag 2. Figure 1: Hidden Markov Model Topologies. tion models, such as multivariate autoregressive models [4]. The latter implies that interactions are modelled in the observation-space and not in state-space. This is unsatisfactory in some biomedical applications where the signals have completely different forms and thus spectra/bandwidths, e.g. heart-rate and respiration. Another approach models the interactions in state-space by coupling the state-spaces of each variate [5]. This allows interaction analysis to proceed on arbitrary multi-channel sequences. HMMS: MARKOV CONDITION AND COUPLING Classical HMMs A standard HMM is graphically represented as an unfolded mixture model whose states at time are conditioned on those at time (see Figure 1[a]), also known as the first order Markov property. Estimation is often performed using the maximum likelihood framework [6, 7], where the hidden states are iteratively re-estimated by their expected values and, using these, the model parameters are re-estimated. HMMs may be extended by changing the dimensions of the observation space,, and state space,, or by changing the distributions of and. In this paper, however, we are interested in systems with different topologies, and in particular, with extensions to more than one state variable. Coupled HMMS One possible extension to multi-dimensional systems is the coupled hidden Markov model (CHMM), as shown in Figure 1[b]. Here the current state is dependent on the states of its own chain ( ) and that of the neighbouring chain ( ) at the previous time-step. These models have been used in computer vision [5] and digital communication [8]. Several things are of note: The model, though still a DAG (directed acyclic graph) contains a loop which forms the essential difficulty in estimating parameters in CHMMs. One approach to solve this problem is clustering [9]: all nodes of a loop are merged

3 to form a single higher dimensional node. Though this may be a feasible approach for the simple model shown in Figure 1[b] (the state variables would be merged in this case) it is infeasible once the lag-link (link from or ) is moved forward in time. For instance, the DAG shown in figure 1[c] requires merging of the state variables and thus causes an immediate explosion of node dimensionality. Another approach, that of cut-sets [9], is also impractical for dynamic models, since the number of cut-set variables increases with the chain length. To illustrate this briefly, the same CHMM in redrawn in figure 2 as a nested sequence of loops. To cut-open each loop the values of and must be clamped, thus increasing the number of cut-set variables linearly with, the length of the chain. There are chains with possibly different observation models. Unless the observation models are identical, the chains cannot be simply transformed to a single HMM by forming a Cartesian product of the observationspaces. X t Xt+1 Xt+2 Xt+3 X t+4 X t+5 Y t+5 Y t+4 Yt+3 Yt+2 Yt+1 Yt Figure 2: Illustration of the nested loops in a CHMM In this paper, we consider CHMMs with lag and, to keep close to the original model, are required to employ node-merging. In the following section we derive the full forward-backward equations and the update equations for the maximum a posteriori estimators for the model parameters. MAXIMUM A POSTERIORI PARAMETER ESTIMATION Notation Variables and Probability Densities: is the complete set of observed variables, where is the set

4 of observation variables of the first chain and is the set of observation variables of the second chain. for each time step. step. are the actual observed (given) values of the first chain are the observed values of the second chain for each time is the set of is set of all hidden state variables, where is the set of state variables of the first chain and state variables of the second chain. At each time step the, state variable of the first chain can take one of discrete values. Likewise, at each time step, the state variable one of discrete values. of the second chain can take and are the state space dimensions. denotes the state transition probability at time chain, i.e. the probability of conditioned on and. for the first Likewise, is second chain s probability at time of conditioned on and at the previous time step. is the probability of observing conditioned on. is the probability of observing conditioned on. is the initial state probabilities of first chain. is the initial state probabilities of second chain. Model Parameters: The parameter vector contains parameters of the transition probabilities, the initial state probabilities, and the parameters of the observation models. follows a Multinomial density with parameters for each possible state of the conditioning variables and which are enumerated by and : Mu. follows a Multinomial density with parameters for each possible state of the conditioning variables and which are enumerated by and : Mu follows a Multinomial density with parameters : Mu.

5 follows a Multinomial density with parameters : Mu. and are assumed to be -variate Gaussian densities with mean vectors and covariance matrices, respectively for each channel:,, Parameter Priors: As for the parameter priors, we choose conjugate priors for each model parameter:,, Wishart, Wishart, Dirichlet for and, Dirichlet for and, Dirichlet, Dirichlet. All hyper-priors such that the priors were vague or flat [10]. were chosen The Model In the Maximum a posteriori (MAP) form of the Expectation Maximization Algorithm we aim to maximize the function [11] where and are probability densities. The likelihood function encodes the model and, in the case of a CHMM with lag, is given as (1) The EM algorithm then iterates two steps: E-Step: The expectation step, or E-Step, estimates the hidden variables such that it maximizes (1). This is done by filling in the missing values with their expectation conditioned on the data and the model parameters of the preceeding step. For HMMs, this amounts to computing the forward and backward recursions as shown in [7]. M-Step: The maximization step, or M-Step, maximizes (1) w.r.t to, conditioned on the expected values of. The maximum likelihood solution for CHMMs is identical to that for standard HMM case [4, 12]. Here, however, we derive the maximum a posteriori solution to CHMMs.

6 The Expectation Step Clearly, for CHMMs, the non-standard part is the belief propagation in the E-Step, computing the expected values of the hidden variables given the data and the parameters. For clarity, we will first demonstrate the derivation of the forward-backward recursions for a general CHMM, i.e. not necessarily one with discrete state space. The discrete case is recovered, simply, by replacing the integrals with sums. For standard HMMs we have: and by conditional independence, For CHMMs, we define a variable for each chain: and. At, At, since is independent of given, and so Repeating the steps for the second chain and for arbitrary gives (2) Thus, there are two un-coupled prediction equations and no node clustering is, yet, required. (3)

7 The backward recursions for standard HMMs are defined as We can solve for these inductively, as follows: Repeating the steps for any we obtain The loop in the CHMM prevents a split of the backward recursions into two uncoupled recursions. We are thus forced to perform a clustering operation of all the nodes in a loop, namely,, which leads to the following recursion: Because the back recursions cannot be decoupled we are left with no choice but to merge successive time steps. This simply leads to the joint forward and backrecursions This also demonstrates that it quickly becomes infeasible to explore different coupling topologies, as they will increase the dimensionality of the clustered loop exponentially. (4) (5) (6)

8 Finally, we require for the M-step the quantity (7) (8) and (9) (10) The Maximization Step Having computed the hidden states we can proceed with the estimation of the model parameters. Expanding the first term in equation (1) we obtain the following function (11) The equation above consists of sets of terms: in the first line are the terms for the initial state probabilities of each channel with parameters, in the second line are the terms for the state transition probabilities of the X-Channel with parameters, in the third for the state transition probabilities of the Y-Channel with parameters, and finally in the forth line for the observation models of each channel with parameters, (12)

9 The update formulas can be computed by maximizing the w.r.t. to the parameters. -functions above The means and To derive the update rules for the first chain, the required -function is where is is just the logarithm of a -variate Normal density and the log-prior Maximizing w.r.t. gives the update equation for the means of the first chain Likewise, for the second chain we maximize w.r.t. to obtain the update rule for the means of the second chain (13) (14) The covariance matrices and The -functions to maximize for the two chains are again and, which need to be maximized w.r.t. and, respectively. For the first chain, the -functions is where and the prior is a -variate Wishart density for the precision

10 Maximization gives updates for the covariance matrices according to (15) (16) The initial state probabilities and The relevant -function for the computation of the initial state probabilities is where the prior is a Dirichlet density and we impose the constraint. Maximizing the above gives the update equations for the initial state probabilities (17) (18) (19) The state transition probabilities and Finally, we need to compute the update equations for the state transition probabilities. The relevant -function is given by where the prior is a Dirichlet density and we impose the constraint.

11 Maximizing leads to update equations for the state transition probabilities (20) (21) for each possible parent state and. EXPERIMENTS Synthetic Data Synthetic data was generated by drawing samples from a CHMM. Figure 3 shows the input state sequence together with the estimated joint state and marginal sequences. The lower two traces in the figure depict the individual state sequences, estimated by marginalisation of the joint transition probabilities and feeding these into a standard single HMM model. Input Viterbi Joint Viterbi First Chain Viterbi Second Chain Viterbi Figure 3: CHMM state sequences (from top to bottom trace): input, estimated joint and marginal Viterbi paths. Cheyne Stokes Data We applied the CHMM to features extracted from a section of Cheyne Stokes Data 1, which consisted of one EEG recording and a simultaneous respiration recording sampled at. The feature, fractional spectral radius (FSR) measure [13], was 1 A breathing disorder in which bursts of fast respiration are interspersed with breathing absence.

12 Joint Viterbi with corresponding 2 observed sequences 1/0 State (Chain A/Chain B) 1/1 0/0 0/1 Figure 4: CHMM modelling of Cheyne Stokes data: Maximum Likelihood modelling of Cheyne Stokes:: there are 2 states corresponding to EEG and respiration being both on or both off (0/0 and 1/1) and 2 states corresponding to EEG in state on/off and respiration being in states off/on (i.e. transition states). computed from consecutive non-overlapping windows of two seconds length. Thus, respiration FSR then formed one channel while EEG FSR the second channel. The data has clearly visible states which can be used to verify the results of the estimated model. Figure 4 shows the FSR traces superimposed with the joint Viterbi path sequence. The CHMM states were set to and. Essentially, of the four total states the CHMM splits the sequence such that two correspond to respiration and EEG being in phase and two correspond to the transitions states, i.e. respiration on, EEG off and vice versa. Time Movement Planning Experiment Data This experiment was to produce a model to predict the onset of movement in humans by examining changes in electroencephalography signals (EEG) measured above the motor cortex area of the brain. HMMs have been successfully used to predict finger movement using EEG signals [14][15]. Using a separate source of information, here the EMG, we attempted to improve the prediction quality of the model. EEG recordings were taken from 5 subjects 3 cm behind the C3 and C4 electrode positions (according to the system), which will be called here C3 and C4. The subjects were asked to move their left or right hand when given cues. EMG data was also recorded from the arms to mark the onset of movement. Both the EEG and EMG signals were sampled at 125 Hz.Reflection coefficients [16] from a order autoregressive model, we then calculated from the difference between of

13 CHMM viterbi path EMG viterbi path Figure 5: Viterbi paths of the hand movement data: The top trace is computed from the marginalised CHMM using only EEG data. The lower trace indicates the timing of the hand movements and is from the marginalised CHMM using only EMG data. the EEG signals at C3 and C4, in addition to the reflection coefficients computed form the EMG signals. The relection coefficients were computed using sliding windows of length second shifted by seconds. The final signal for training was composed of 6 second long sections centred around the actual movement. We trained a CHMM and a standard HMM, both with Gaussian observation models, for comparison. The CHMM model was trained using EEG and EMG data. For testing, however, the model s EMG channel was marginalised out, thus transforming the CHMM model to a standard HMM model which is now tested on EEG alone, i.e. training was performed on all available data while prediction was done one EEG alone. Figure 5 (top trace) shows the Viterbi path obtained by testing the marginalised CHMM on EEG data alone. The bottom trace shows the the Viterbi path of the marginalised CHMM-EMG channel which indicates the exact timing of the movements. If no information transfer between the channels exists then the CHMM would degenerate into a product of two single chain HMMs. Inspection of the transition probabilities and the observation models reveals that information from the EMG channel is used to change both the EEG observation model and the inter-chain transition probabilities. The use of the EMG during training really focuses the model on the actual movement planning processes in the brain. A single HMM, without further information from other sources, will inevitably latch onto many other brain processes. This leads to reduced prediction quality. Sleep EEG Data In a final example, we fitted a -th order Bayesian auto-regressive (AR) model [16] to a full night EEG recording. The reflection-coefficients were extracted from two channels, namely C3 and C4, using sliding non-overlapping windows of s length (sampling rate Hz). The coefficients then formed the observation sequence for a CHMM. The estimation was performed in a completely un-supervised manner. For each of the two chains the number of hidden states was set to, corresponding to three sleep-stages [17]: NREM, REM and Wake, a decision based on the prior knowledge of the structure in the data. The input data were labeled by a human expert as deep

14 sleep (sleep stage 4, S4), REM and Wake. Figure 6[a] depicts the Hypnogram (human labels) together with the estimated state sequence and Figure 6[b] shows the co-occurrences of the Viterbi labels with those of the Hypnogram. Roughly speaking, of the states with the longest duration, state corresponds to wake, state to REM sleep, and states and correspond to most non-rem sleep. Generally, a CHMM estimated using the Maximum a posteriori procedure does make good sense of the data, but the correspondence between the human labels and the CHMM labels is not unique. CONCLUSIONS & FUTURE WORK The maximum a posteriori EM procedure is very fast and stable, indeed, much more so than a simple maximum likelihood estimator (for stability) [18]or sampling based methods (for speed) [19]. Moreover, the smoothness and stability of the state sequence is much higher than that obtained using single chain HMMs [20]. The results presented here are only qualitative in that we have not assessed, for example, the classification performance quantitatively. This is because the paper presents a step towards a fully Bayesian estimation of CHMM parameters using variational methods. These will then also permit the estimation of the true model, in particular the hidden state space size and the lag. The CHMMs themselves may be expanded to more than just two coupled chains and have continuous hidden state spaces or other observation models. One other possible direction is that of trimming unnecessary transition probabilities with the use of entropy priors [21]. More important for us, however, is the question of lagestimation. Future work will focus on model-lag selection for which we envisage ensemble learning methods [22]. REFERENCES [1] R. Bannister and C.J. Mathias. Autonomic Failure, A Textbook of Clinical Disorders of the Autonomic Nervous System. Oxford University Press, Oxford, 3 edition, [2] R.K. Kushwaha, W.J. Williams, and H. Shevrin. An Information Flow Technique for Category Related Evoked Potentials. IEEE Transactions on Biomedical Engineering, 39(2): , [3] M. Azzouzi and I.T. Nabney. Time Delay Estimation with Hidden Markov Models. In Ninth International Conference on Artificial Neural Networks, pages IEE, London, [4] W.D Penny and S.J. Roberts. Gaussian Observation Hidden Markov Models for EEG analysis. Technical Report TR-98-12, Imperial College London, [5] M. Brand. Coupled hidden Markov models for modeling interacting processes. Technical Report 405, MIT Media Lab Perceptual Computing, June [6] S. Roweis and Z. Ghahramani. A unifying review of linear Gaussian models. Computation and neural systems, California Institute of Technology, [7] L. R. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceeding of the IEEE, 77(2): , 1989.

15 Joint Viterbi (3 pt Median Filtered) 2/2 2/1 2/0 1/2 1/1 1/0 0/2 0/1 0/0 Hypnogram Wake Move REM S1 S2 S3 S4 (a) CHMM state sequence and Hypnogram (Human expert scorning) Label Co occurances /1 0/ /0 0/2 0 1/1 Wake Move REM S1 1/2 S2 S3 S4 2/1 2/0 Viterbi Label Hypnogram Label (b) Label Co-occurrences Figure 6: Comparison between estimated labels and Hypnogram Labels for sleep EEG

16 [8] B.J. Frey. Graphical Models for Machine Learning and Digital Communication. MIT Press, [9] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Mateo, 2nd edition, [10] P.M. Lee. Bayesian Statistics: An Introduction. Edward Arnold, [11] R.W. McLachlan and T. Krishnan. The EM Algorithm and Extensions. John Wiley & Sons, New York, NY, [12] S. Roweis and Z. Ghahramani. A unifying review of linear Gaussian models. Neural Computation, 11(2): , [13] I. Rezek and S.J. Roberts. Stochastic Complexity Measures for Physiological Signal Analysis. IEEE Transactions on Biomedical Engineering, 44(9), [14] J.R. Wolpaw, N. Birbaumer, W.J. Heetderks, D.J. McFarland, P.Hunter Peckham, G. Shalk, E. Donchin, L.A. Quatrano, C.JU. Robinson, and T.M. Vaughan. Brain- Computer Interface Technology: A Review of the First International Meeting. IEEE Transactions on Rehabilitation Engineering, 8(2): , [15] W.D. Penny, S.J. Roberts, E. Curran, and M. Stokes. EEG-based communication: a pattern recognition approach. IEEE Transactions on Rehabilitation Engineering, 8(2), [16] P. Sykacek, S. Roberts, A. Flexer, and G. Dorffner. A probabilistic approach to high resolution sleep analysis. In G. Dorffner, editor, Proceedings of the International Conference on Neural Networks (ICANN), page to appear, Wien, Berlin, New York, Springer Verlag. [17] S.J. Roberts. Analysis of the Human Sleep Electroencephalogram using a Selforganising Neural Network. PhD thesis, University of Oxford, [18] I. Rezek and S.J. Roberts. Estimation of Coupled Hidden Markov Models with application to Biosignal Interaction Modelling. In NNSP International Workshop on Neural Networks for Signal Processing., irezek. [19] I. Rezek, Peter Sykacek, and S.J. Roberts. Learning Interaction Dynamics with Coupled Hidden Markov Models. IEE Proceedings - Science, Measurement and Technology., 147(6): , November irezek. [20] A. Flexer, P. Sykacek P, I. Rezek, and G. Dorffner. Using Hidden Markov Models to build an automatic, continuous and probabilistic sleep stager for the SIESTA project. In European Medical And Biological Engineering Conference, [21] M. Brand. An entropic estimator for structure discovery. In Neural Information Processing Systems, NIPS 11, [22] T.S. Jaakkola. Tutorial on variational approximation methods Acknowledgement The authors would like to thank Will Penny and Peter Sykacek for very helpful discussions and software. I.R. and M.G. are supported by the UK EPSRC & M.G. is also supported by the BBC Research and Development, to whom we are most grateful.

Bayesian time series classification

Bayesian time series classification Peter Sykacek Department of Engineering Science University of Oxford Oxford, OX 3PJ, UK psyk@robots.ox.ac.uk Stephen Roberts Department of Engineering Science University