A Higher-Order Interactive Hidden Markov Model and Its Applications Wai-Ki Ching Department of Mathematics The University of Hong Kong

A Higher-Order Interactive Hidden Markov Model and Its Applications Wai-Ki Ching Department of Mathematics The University of Hong Kong Abstract: In this talk, a higher-order Interactive Hidden Markov Model (IHMM) will be introduced. In the proposed higher-order IHMM, the hidden states depend on the observable states, and vice versa, so that the feedback effect of the observable states is taken into account in the process. Efficient procedures are given to estimate the model parameters. The model is then used in the detection of machine failure. Numerical examples are given to demonstrate that the proposed higher-order IHMM significantly outperforms the traditional HMM. A joint work with Dong-Mei Zhu, Robert J. Elliott and Tak-Kuen Siu. 1

Outline (1) A Brief Introduction to HMM. (2) Interactive Hidden Markov Model (IHMM). (3) The Higher-Order Interactive Hidden Markov Model (IHMM). (4) Application of Higher-order IHMM to Machine Failure Detection. (5) Concluding Remarks. 2

1. A Brief Introduction to HMM Hidden Markov Models (HMMs) are widely used in many areas -Speech Recognition. L. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceedings of the IEEE, 77 (1989) 257 286. -Computer Vision. H. Bunke and T. Caelli, Hidden Markov Models : Applications in Computer Vision, Editors, Horst Bunke, Terry Caelli, Singapore, World Scientific (2001). -Bioinformatics. T. Koski, Hidden Markov Models for Bioinformatics, Kluwer Academic Publisher, Dordrecht (2001). -Finance. Rogemar S. Mamon and Robert J. Elliott, Hidden Markov Models in Finance, New York : Springer (2007). 3

In a HMM, there are two types of states: observable states and hidden states. The hidden states follow a Markov chain process and the observable states are driven by the hidden states. To define a HMM, one has to define the number of both types of states. We also need to define/estimate the transition probabilities of the hidden states and probability distribution of the observable states. The major problem is to determine the transition probabilities of the hidden states because the transitions among the hidden states are supposed to be unobservable. 4

1.1 The Basic Idea of HMM Through an Example. We consider the process of choosing dice and recording the number of dots by throwing the dice of six faces (a cube) with numbers 1, 2, 3, 4, 5, 6. Suppose we have two dice A and B of six faces (1, 2, 3, 4, 5, 6) such that Die A is fair and Die B is bias. The probability distributions of the dots obtained by throwing Die A and Die B are given in the table below. 1 2 3 4 5 6 A 1/6 1/6 1/6 1/6 1/6 1/6 B 1/6 1/6 1/6 1/12 1/12 1/3 Table 1. 5

Each time a die is chosen, with probability α, Die B is chosen given that Die A was chosen last time. And with probability β, Die A is chosen given that Die B was chosen last time. It is a 2-state Markov chain process having transition probability matrix: Die A ( 1 α α ) Die B β 1 β Here the selection process is hidden because it is assumed that we don t know actually which die is being chosen. The chosen die is then thrown and the number of dots (this is observable) obtained is recorded. The following is a possible realization of the process: A }{{} Hidden }{{} 1 α A }{{} Hidden }{{} α B }{{} Hidden }{{} 1 β B }{{} Hidden }{{} β A }{{} Hidden 1/6 1/6 1/3 1/12 1/6 5 }{{} Observable 4 }{{} Observable 6 }{{} Observable 5 }{{} Observable 6 }{{} Observable 6

The followings are the usual model parameters of a HMM. N, number of hidden states V = {v 1,..., v M }, the set of observable states M, number of distinct observable states S = {S 1,..., S N }, the set of hidden states T, length of the observation period K, number of observable sequences (Usually K = 1) π, initial state distribution O j = (o j 1, oj 2,..., oj T ), jth observation sequence w j, the weighting of the jth observation sequence Q = (q 0, q 1, q 2,..., q T ), the sequence of hidden state q t, hidden state at time t a ij, transition probabilities from hidden State i to hidden State j b k j (v), the probability of symbol v V being observed in State j in the k-th sequence λ = (A, B, π), the model training parameters (N, M are fixed) 7

1.2 Model Parameter Estimation. To construct a HMM, one has to solve three standard problems. Problem (I) To efficiently compute P (O λ), the likelihood of of a given observation sequence, when we are given the model parameters λ = (A, B, π) and the observation sequence O = O 1 O 2... O T. (Forward and Backward Algorithm, (Baum, 1972). ) Problem (II) To find the most likely hidden sequence. (Viterbi algorithm (Viterbi, 1967), a DP approach.) Problem (III) To adjust the parameters λ = (A, B, π) of the model so as to maximize P (O λ). (Baum-Welch algorithm (Caelli, 2001), an EM-like algorithm.) Once the model parameters are obtained, one can simulate the model easily. (A tutorial paper by Rabiner (1989)) 8

(I) The Likelihood of an Observed Sequence Suppose that q = q 1 q 2... q T is a sequence of hidden states, with q 1 as the initial state. Then we have Prob(q) = π q1 a q1 q 2 a q2 q 3... a qt 1 q T. The probability to observe the sequence O = O 1 O 2... O T, given the sequence of hidden states q is Hence we have Prob(O) = Prob(O q) = b q1 (O 1 )b q2 (O 2 ) b qt (O T ). all possible q π q1 b q1 (O 1 ) a q1 q 2 b q2 (O 2 ) a qt 1 q T b qt (O T ). It is not difficult to show that the computational cost for getting Prob(O) is O(T N T ) if we compute each term in the last summation and sum them up. 9

To speed up the computation, the following forward and backward procedures (Baum, 1972) have been used. Define α t (i) = Prob(O 1 O 2... O t, q t = S i ). The procedure of the forward algorithm is as follows: (Step F1) Initialization: (Step F2) Recursion: α t (j) = b j (O t ) α 1 (i) = π i b i (O 1 ) for 1 i N. N i=1 (Step F3) Termination: α t 1 (i)a ij for 2 t T and 1 j N. Prob(O) = N i=1 α T (i). The computational cost of forward algorithm is O(T N 2 ), which is much lower than O(T N T ) as T is usually large. 10

To define the backward algorithm, let β t (i) = Prob(O t+1 O t+2... O T q t = S i ). Then the procedure of the backward algorithm is as follows: (Step B1) Initialization: β T (i) = 1 for 1 i N. (Step B2) Recursion: β t (i) = N j=1 (Step B3) Termination: a ij b j (O t+1 )β t+1 (j) for 1 t T 1 and 1 i N. Prob(O) = N i=1 β 1 (i)π i b i (O 1 ). The computational cost of the backward algorithm is also O(T N 2 ). 11

(II) Estimation for the Most Likely Hidden States The Viterbi algorithm (Viterbi, 1967) can be used to find the most likely sequence of hidden states given the HMM model parameters and an observed sequence. Define δ t (i) = max Prob(q q 1,q 2,...,q 1 q 2... q t, O 1 O 2... O t ; q t = S i, ) t 1 which is the highest probability along a single path up to time t. We note that δ t (j) = b j (O t ) max{δ t 1 (i)a ij }. i A Dynamic Programming (DP) approach can then be applied to solve the problem. To obtain the optimal solution, we need to keep track of the solution of the above maximization problem via the variable θ 1 (i) below. 12

The DP procedure is given as follows: (Step V1) Initialization: δ 1 (i) = π i b i (O 1 ) and θ 1 (i) = 0 for 1 i N. (Step V2) Recursion: δ t (j) = max 1 i N δ t 1(i)a ij b j (O t ) for 2 t T and 1 j N. θ t (j) = argmax 1 i N {δ t 1 (i)a ij } for 2 t T and 1 j N. (Step V3) Termination: P = max 1 i N δ T (i) and q T = argmax 1 i Nδ T (i). Here P is the optimal likelihood, and qt state at time T. (Step V4) Backtracking: is the most likely hidden qt = θ t+1(qt+1 ) for t = T 1, T 2,..., 2, 1. Here q t is the most likely hidden state at time t. 13

(III) The Baum-Welch (BW) Algorithm (Caelli, 2001) Define ξ t (i, j) = Prob(q t = S i, q t+1 = S j O, A, B, π) the probability of being in state S i at time t and having a transition to state S j at time t+1 given the observed sequence and the model. We note that We define ξ t (i, j) = α t(i)a ij β t+1 (j)b j (O t+1 ) α t (i)a ij β t+1 (j)b j (O t+1 ). i j γ t (i) = Prob(q t = S i O, A, B, π) which is the probability of being in state S i at time t given the observed sequence and the model. Then we have γ t (i) = j ξ t (i, j). 14

The Baum-Welch (BW) algorithm is as follows: (Step BW1) Choose a set of initial parameters λ = {A, B, π}. (Step BW2) Re-estimate the parameters by π i = γ 1 (i) for 1 i N ā ij = T 1 t=1 ξ t(i, j) T 1 t=1 γ t(i) for 1 i N and 1 j N, where b j (k) = Tt=1 γ t (j)i Ot =k Tt=1 γ t (j) I Ot =k = for 1 j N and 1 k M, { 1 if Ot = k; 0 otherwise. (Step BW3) Let Ā = {ā ij } ij, B = { b j (k)} jk and π = { π i }. (Step BW4) Set λ = {Ā, B, π}. (Step BW5) If λ = λ, terminate the procedure; otherwise, set λ to be λ and return to (Step BW2). 15

2. Interactive Hidden Markov Model (IHMM) Suppose we are given a categorical data sequence of six possible (observable) sales volumes (1,2,3,4,5,6) of certain products follows: 1, 2, 1, 2, 1, 2, 2, 4, 2, 5, 6, 2, 1,.... 1 = very high, 2 = high, 3 = moderate high, 4 = moderate low, 5= low, 6 = very low. Suppose there are two hidden states: Bad Market Situation (A) and Good Market Situation (B). In the good market situation, the probability distribution of the sales volume is assumed to follow the distribution: ( 1 4, 1 4, 1 4, 1, 0, 0). 4 While in the bad market situation, the probability distribution of the sales volume is assumed to follow the distribution: (0, 0, 1 4, 1 4, 1 4, 1 4 ). 16

In our model (Ching et al. 2009), we assume that the market situation and the sales volume interact each other. The sales volume (observable state) can infer the market situation. In the Markov chain, the states are A, B, 1, 2, 3, 4, 5 and 6. We assume that when the observable state is i, the probabilities that the hidden state is A and B in next time step are given by α i and 1 α i, respectively. The transition probability matrix governing the Markov chain is given by the following matrix: P 2 = 0 0 1 4 1 4 1 4 1 4 0 0 0 0 0 0 1 4 1 4 1 4 1 4 α 1 1 α 1 0 0 0 0 0 0 α 2 1 α 2 0 0 0 0 0 0 α 3 1 α 3 0 0 0 0 0 0 α 4 1 α 4 0 0 0 0 0 0 α 5 1 α 5 0 0 0 0 0 0 α 6 1 α 6 0 0 0 0 0 0. 17

In order to define the IHMM, one has to estimate α = (α 1, α 2, α 3, α 4, α 5, α 6 ) from an observed data sequence. We consider the two-step transition probability matrix: P 2 2 = α 1 +α 2 +α 3 +α 4 4 1 α 1+α 2 +α 3 +α 4 4 0 0 0 0 0 0 4 1 α 3+α 4 +α 5 +α 6 4 0 0 0 0 0 0 α 0 0 1 α 1 1 1 1 4 4 4 4 4 α 1 1 4 4 α 1 4 α 0 0 2 α 2 1 1 1 4 4 4 4 4 α 2 1 4 4 α 2 4 α 0 0 3 α 3 1 1 1 4 4 4 4 4 α 3 1 4 4 α 3 4 α 0 0 4 α 1 1 1 1 4 4 4 4 4 α 4 1 4 4 α 4 4 α 0 0 5 α 5 1 1 1 4 4 4 4 4 α 5 1 4 4 α 5 4 α 0 0 6 α 6 1 1 1 4 4 4 4 4 α 6 1 4 4 α 6 4 α 3 +α 4 +α 5 +α 6. 18

We then extract the one-step transition probability matrix of the observable states from the matrix P2 2 as follows: P 2 = α 1 α 1 1 1 1 4 4 4 4 4 α 1 1 4 4 α 1 4 α 2 1 1 1 4 4 4 4 4 α 2 1 4 4 α 2 4 α 3 1 1 1 4 4 4 4 4 α 3 1 4 4 α 3 4 α 1 1 1 1 4 4 4 4 4 α 4 1 4 4 α 4 4 α 5 1 1 1 4 4 4 4 4 α 5 1 4 4 α 5 4 α 2 α 3 α 4 α 5 α 6 α 6 1 1 1 4 4 4 4 4 α 6 1 4 4 α 6 4. The advantage of looking at the matrix P 2 is that it gives the information of one-step transition from one observable state to another observable state. Even though, in this case, we do not have a closed form solution for the stationary distribution of the process. There are four parameters to be estimated. 19

2.1 Estimation Method To estimate the parameter α i, we estimate the one-step transition probability matrix from the given observed sequence. This can be done by counting the transition frequency of the states in the observed sequence. Suppose the estimates for this example is given by ˆP 2. We expect P 2 ˆP 2 and, hence, α i can be obtained by solving the following minimization problem: subject to: min α i P 2 ˆP 2 2 F 0 α i 1. Here,. F is the Frobenius norm, i.e. A 2 F = n n i=1 i=1 A 2 ij. We remark that other matrix norms can also be used as the objective function. 20

For a general situation, we define the following transition probability matrices: P = and α = p 11 p 12 p 1n p 21. p 22.. p 2n. p m1 p m2 p mn α 11 α 12 α 1m α 21. α 22.. α 2m. α n1 α n2 α nm, (1) (2) Here the matrix α is unknown but the matrix P can be unknown or given. 21

Case I: P is Known: The one-step transition probability matrix of the observable states is given by P 2 = αp, i.e. [ P 2 ] ij = m k=1 α ik p kj i, j = 1, 2,..., n. Here α ij are unknown but the probabilities p ij are given. Suppose [Q] ij is the one-step transition probability matrix estimated from the observed sequence. Then, for each fixed i, α ij, j = 1, 2,..., m can be obtained by solving the following constrained leastsquare problem: subject to min α ik n m j=1 k=1 m k=1 α ik p kj [Q] ij α ik = 1 and α ik 0. 2 22

Case II: P is Unknown: Suppose all the probabilities P ij are also unknown. We adopt the bi-level programming techniques. Initialize p (0) ij ; e = 1; h = 1; Solve α (h) ik subject to Solve p (h) ik subject to min α (h) ik min p (h) ik n m j=1 k=1 m k=1 α (h) ik n m j=1 k=1 n k=1 p (h) ik α (h) ik p(h 1) kj [Q] ij = 1 and α(h) ik 0; α (h) ik p(h) kj [Q] ij 2 2 = 1 and p(h) ik 0 23

While e < tolerance, h := h + 1; Solve α (h) subject to Solve p (h) ik subject to min α (h) ik min p (h) ik n m j=1 k=1 m k=1 α (h) ik n m j=1 k=1 n k=1 p (h) ik α (h) ik p(h 1) kj ik [Q] ij = 1 and α(h) ik 0 α (h) ik p(h) kj [Q] ij = 1 and p(h) ik 0 e := ( α (h) α (h 1) 2 2 + P (h) P (h 1) 2 2 )/N ; end. 2 2 24

3. The Higher-Order Interactive Hidden Markov Model (IHMM) x t : the probability vector of the hidden sequence at time t. y t : the probability vector of the observable sequence at time t. Relationship: x t = y t = h i=1 k i=1 λ t i P t i y t i µ t i+1 M t i+1 x t i+1 i λ i = i µ i = 1, λ i, µ i 0. For simplicity of discussion, we consider a second-order model. We set k=h= 2, the model becomes: { xt = λp y t 1 + (1 λ)qy t 2 y t = µmx t + (1 µ)nx t 1, 25

The dynamics of y t : y t = λµmp y t 1 + [(1 λ)µmq + λ(1 µ)np ]y t 2 +(1 λ)(1 µ)nqy t 3. We write where z t = y t y t 1 y t 2 = C 1 C 2 C 3 I 0 0 0 I 0 y t 1 y t 2 y t 3 λµmp = C 1 (1 λ)µmq + λ(1 µ)np = C 2 (1 λ)(1 µ)nq = C 3 = Cz t 1 26

Parameter estimation subject to 0 C 1, n j=1 c i j1 = min C,γ i n j=1 γ 1 + γ 2 + γ 3 = 1. T t=3 c i j2 = = z t Cz t 1 2 F n j=1 Frobenius norm: A F = ni=1 nj=1 a ij 2 c i jn = γ i (i = 1, 2, 3), n j=1 c i jk : the k-th column sum of matrix C i and λµ = γ 1 (1 λ)µ + λ(1 µ) = γ 2 (1 λ)(1 µ) = γ 3 27

Based on non-negative matrix factorization (NMF) and the idea of bi-level optimization, the parameters of M, N, P and Q can be estimated. To avoid being trapped in a local optimum, we choose the initial guess randomly for a number of times (say 100) and take the best result. The tolerance is set to be 0.001 in our numerical examples. Step 1: Initialize M (0), N (0), h = 1; Step 2: With the sub-problem algorithm from Lin, C.J. (2007), solve P (h), Q (h) by minimizing λµm (h 1) P (h) C 1 2 F and (1 λ)(1 µ)n (h 1) Q (h) C 3 2 F ; Step 3: Solve M (h), N (h) by minimizing C 2 (1 λ)µm (h) Q (h) λ(1 µ)n (h) P (h) 2 F subject to the column sums of M (h) and N (h) being 1 respectively and 0 M (h), N (h) 1; Step 4: If ( M (h) M (h 1) 2 F + N (h) N (h 1) 2 F )/W < tolerance, stop; otherwise h = h + 1 and go back to Step 2. 28

4. Application of Higher-order IHMM to Machine Failure Detection We consider a production system of n independent, indistinguishable production units, each of which can produce items of the same product (Tai, Ching and Chan 2009). A production unit is either in the normal state w 1 subnormal state w 2. We assume the followings: or in the (i) While a production unit is producing an item, the state of the production unit does not change. (ii) If a production unit is in state w 1, immediately after an item is produced it will deteriorate to state w 2 with probability p and remain in state w 1 with probability 1 p. 29

(iii) If a production unit is in State w 2, it will remain in this state until a perfect maintenance action is carried out which will bring the unit back to State w 1. (iv) An item produced can be classified as either conforming or nonconforming. (v) When a production unit is in state w i, it will produce a conforming item with probability r i (i = 1, 2) and produce a nonconforming item with probability 1 r i (i = 1, 2) where r 1 > r 2. 30

We assume that during production, the states of the production units are unobservable, i.e, w 1 and w 2 are hidden states. The states of the production units are known only if production is stopped and a full inspection on the system is carried out. But this is expensive. Furthermore, we assume that all product items produced are inspected, and the inspection is perfect so that it will correctly indicate whether an item is conforming or nonconforming. The number of nonconforming product is observable. The production system is said to be in State i if i production units are in State w 2 and the other (n i) units are in state w 1. 31

The transition probability matrix of the hidden state is an (n + 1) (n + 1) matrix A n+1 = {a ij }, where a ij = C n i+1 j i (1 p) n j+1 p j i (i, j = 1,..., n + 1). Suppose that 0 k n. We wish to calculate the probability b i (k) of getting k nonconforming products when the production system is in State i. The probability for (n i) production units to be in state w 1 to produce (k l) nonconforming items, and i production units in state w 2 to produce l nonconforming items, one from each production unit, can be shown to be p(k l, l) = C n i k l r(n i) (k l) 1 (1 r 1 ) k l C i l ri l 2 (1 r 2) l. 32

k l normal l subnormal {}}{{}}{ x x x x x x x x x x x x x x x }{{}}{{} n i items produced i items produced by n i production units by i production units in State w 1 in State w 2 Figure Production of a total of k nonconforming items. 33

The probability distribution matrix for the observable states of the n items produced from the production system, one from each of the n production units, is an (n + 1) (n + 1) matrix B n+1 = {b i (k)}. We have b i (k) = k l=0 p(k l, l), k i, k n i, k l=k (n i) i l=0 i l=k (n i) p(k l, l), k i, k > n i, p(k l, l), k > i, k n i, p(k l, l), k > i, k > n i. 34

4.1 Numerical Experiment Parameters: n = 2, r 1 = 0.95, r 2 = 0.5 and p = 0.02. Initial state: π = (1, 0, 0). Use the algorithm in (Tai, Ching & Chan, 2009), these parameters are used to simulate: (i) a time series of hidden states (No. of subnormal units) (ii) a time series of observable states (No. of nonconforming items) The same parameters are used to simulate the above processes for 50 times We then apply the higher-order IHMM to the our problem and here we consider a second-order model. The computational cost for calculating all possible combinations of hidden states is O(T n T ). It will be very large if both length of data sequence T and number of production units n are large. We have developed an efficient algorithm (O(T n 2 )) similar to the Forward- Backward algorithm for computing the probability. 35

We also developed an efficient algorithm similar to Viterbi algorithm for estimation the most like hidden sequence and use it to make prediction. Step 1: Initialization: δ 3 (i) = P(ξ 3 = i η 1, η 2 ) and θ 1 (i) = θ 2 (i) = θ 3 (i) = 0 for 0 i n, q1 = q 2 = 0, which means that the two production units are in their normal states at time 1 and 2; Step 2: Recursion: δ t (i) = max 0 j n&j i δ t 1 (j)p(ξ t = i η t 1, η t 2 ) for 3 < t T and 0 i n, and θ t (i) = arg max 0 j n&j i {δ t 1 (j)}p(ξ t = i η t 1, η t 2 ) for 3 < t T and 0 i n; Step 3: Termination: qt = arg max 0 i n δ T (i), here qt is the most-likely hidden state at time T ; Step 4: Backtracking: qt = θ t+1(qt+1 ), where qt is the most-likely hidden state at time t. 36

Table 5 The simulation results of Second-order IHMM Exact Early Late First deterioration Number of occurrence 23 9 18 Mean number of steps 0 2.3 4.5 Standard Deviation 0 6.4 20.6 Second deterioration Number of occurrence 29 18 3 Mean number of steps 0 3.7 10.7 Standard Deviation 0 10.0 19.0 Table 6 The simulation results of HMM Exact Early Late First deterioration Number of occurrence 1 46 3 Mean number of steps 0 1.9 11.3 Standard Deviation 0 3.8 6.9 Second deterioration Number of occurrence 3 41 6 Mean number of steps 0 2.5 9.5 Standard Deviation 0 3.9 16.3 37

5. Concluding Remarks We have introduced a new class of higher-order IHMMs and we provide efficient estimation methods based on NMF for the model parameter. An applications to the detection of machine failure is given to demonstrate the effectiveness of the proposed model and method. We shall discuss the problem in the determination of the orders of IHMMs and establish convergence theory for the proposed algorithms. We shall consider potential applications in other areas such as biology, economics and finance. 38

6 References. Ching, W. and Ng, M. (2006) Markov chains : Models, Algorithms and Applications. International Series on Operations Research and Management Science, Springer: New York. Ching, W., Fung, E., Ng, M., Siu, T. and Li, W. (2007) Interactive Hidden Markov Models and Their Applications. IMA Journal of Management Mathematics, 18 85-97. A. H. Tai, W. K. Ching, and L. Y. Chan. (2009) Detection of machine failure: hidden Markov model approach. Computers & Industrial Engineering, 57 608-619. Ching, W, Siu, T., Li, L., Li, T. and Li W. (2009). Modeling Default Data via an Interactive Hidden Markov Model, Computational Economics, 34 1-19. 39

Baum, L., An inequality and associated maximization techniques in statistical estimation for probabilistic function of Markov processes, Inequality, 3 (1972) 1-8. Lin, C., Projected gradient methods for non-negative matrix factorization, Neural Computation, 19 (2007), 2756 2779. Rabiner, L., A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of the IEEE, 77 (1989), 257 286. Viterbi, A., Error bounds for convolutional codes and an asymptotically optimum decoding algorithm, IEEE Transactions on Information Theory, 13 (1967), 260 269. 40