Learning Dynamic Audio/Visual Mapping with Input-Output Hidden Markov Models

Size: px

Start display at page:

Download "Learning Dynamic Audio/Visual Mapping with Input-Output Hidden Markov Models"

Wilfrid Green
5 years ago
Views:

1 Learning Dynamic Audio/Visual Mapping with Input-Output Hidden Markov Models Yan Li and Heung-Yeung Shum Microsoft Research China Beijing , P.R. China Abstract In this paper we formulate the problem of synthesizing facial animation from an input audio sequence (a.k.a. video rewrite, voice puppetry) as dynamic audio/visual mapping. We propose that audio/visual mapping should be modeled with an input-output hidden Markov model, or IOHMM. An IOHMM is an HMM for which the emission and transition probabilities are conditional on the input sequence. We train IOHMMs using the expectation-maximization (EM) algorithm with a novel architecture to explicitly model the relationship between each transition probability and the input using a neural network. Given an input sequence, an output sequence is synthesized by a maximum likelihood estimation. Experimental results demonstrate that IOHMMs can generate good-quality and natural facial animation sequences from input audio. 1

2 1 Introduction Dynamic audio/visual mapping (or vocal/facial mapping) has recently received much attention as a powerful alternative to traditional facial animation techniques [6, 5, 7, 4, 9]. Instead of directly animating facial expression, a sequence of audio is used to drive the facial motion. While voice is generated from the vocal cords, and facial expressions are formed from facial skin and muscles, there exists a great deal of mutual information between audio and visual signals. Representative projects on learning dynamic audio/visual mappings, recently from the graphics community, include video rewrite and voice puppetry. A good survey on the importance and difficulties of audio/visual mapping can be found in Section 2 of [4]. Sequential dynamics of audio or visual signals can be modeled effectively using hidden Markov models (HMMs) [10, 12]. HMMs have been widely used for speech recognition and gesture recognition from video sequences. They employ hidden states to carry contextual information forward and backward in time, and model the contextual information by state observation probabilities and state transition probabilities. HMMs can be useful for representing dynamic audio/visual mappings as well, as advocated by [5] and [4]. For instance, co-articulation, i.e., the context before and after the current frame, can be taken into account in the HMMs for dynamic mapping. Unfortunately, different audio signals can correspond to a single facial expression, while many facial expressions correspond to the same audio track. To deal with the many-to-many mapping between audio signals and visual signals, previous approaches assume that either the audio or the video can be modeled by a hidden Markov model (HMM). For example, the video rewrite technique [5] recognizes different phonemes from the input audio signal. Animation is generated by re-ordering the captured video frames which share similar phonemes as in the training video. Co-articulation from the preceding and the subsequent phonemes is considered by the use of a triphone model. But the training video may not have enough triphone samples. Moreover, long-range co-articulation effects are not captured in triphone models. On the other hand, the voice puppetry technique [4] trains an HMM model for 2

3 the visual signal. Then the corresponding audio signal is analyzed with respect to the learnt visual models (i.e., facial gestures). A remapping process is employed to give each state a dual mapping into both audio and visual signals. Given a novel audio signal, another analysis step is needed to combine the audio signal with the HMM model and the remapping results to generate the most likely visual state sequence. Finally, animation is generated from the most likely visual states and the learnt visual output distribution to obtain an optimal trajectory in the visual configuration space. There is a reason why the cumbersome remapping and analysis steps are needed in voice puppetry. Although HMM has been shown to be a powerful tool to model the dynamic process, it is quite sub-optimal for synthesis. Traditionally, for recognition, an HMM aims to model the dynamics of one kind of signal. For synthesis, we need to explore the mapping relationship between different signals, each of which might have a different probabilistic model. Moreover, the model parameters in conventional HMMs are fixed after training, which result in a homogeneous Markov chain. On the other hand, when our observations are two related input and output sequences, and the output sequence conditionally depends on the input sequence, the expected model should be inhomogeneous, or have the ability of adapting to the input. In this paper, we propose that dynamic audio/visual mapping should be learnt by an input-output hidden Markov model, or IOHMM. IOHMM, a.k.a. conditional HMM originally introduced by Bengio [3, 1] for sequence processing, can be stated as follows: An IOHMM is an HMM for which the emission and transition distributions are conditional on the input sequence. Specifically in this paper, we present novel algorithms to tackle the following two problems: learning IOHMMs for dynamic vocal/facial mapping from synchronized audio and visual signals; synthesizing facial expressions from input audio and the learnt IOHMMs. Our approach is inspired by the voice puppetry work. While the voice puppetry work highlights a new learning algorithm which determines the compact structure of 3

4 HMMs, our work emphasizes the essence of learning dynamic input/output mapping with IOHMMs. Our approach is much less complex because the input and the output are put together for training and synthesis, thus eliminating the need for the remapping and analysis steps used in voice puppetry. As a consequence, we can generate good quality and natural synthesis results. The remainder of this paper is organized as follows. The IOHMM model is introduced in Section 2. We explain why HMM needs to be augmented to IOHMM for the synthesis task. The audio/visual mapping is studied in Section 3. Experimental results are presented in Section 4. Finally we conclude our paper in Section 5. 2 IOHMM for synthesis 2.1 HMM HMMs are statistical models of sequential data that have been used successfully in many applications, e.g., speech recognition. A Bayesian network [11] representing graphically the independence assumptions of an HMM is shown in Figure 1(a). The relationship between the observed (output) sequence y1 t = (y 1,..., y t ) and the hidden state sequence q1 t = (q 1,..., q t ) satisfies the following conditional first-order independence assumptions P (y t q t 1, y t 1 1 ) = P (y t q t 1) (1) P (q t+1 q t 1, y t 1) = P (q t+1 q t 1) (2) as Therefore, the joint distribution of the hidden and observed variables can be simplified T P (y1 T, q1 T 1 T ) = P (q 1 ) P (q t+1 q t ) P (y t q t ) (3) t=1 t=1 The joint distribution is therefore completely specified by the initial state probabilities π = P (q 1 ), the transition probabilities A = P (q t q t 1 ) and the emission probabilities B = p(y t q t ). Thus, a compact HMM model is λ = (A, B, π). Details of solving three fundamental problems of HMMs can be found in the literature such as the HMM tutorial paper [12], including the Viterbi algorthm and the Baum-Welch algorithm. 4

5 The conventional HMM can be extended for the purpose of dynamic input/output mapping. The Bayesian network shown in Figure 1(b) illustrates that the learnt HMM from the observed output sequence can be remapped to the input sequence (in dotted lines). This is exactly the approach adopted in [4] for synthesizing facial gesture from voice. Although compelling results have been shown, this technique has two problems. First, at the synthesis step, vocal signals are only used to generate the most likely state sequence. Animation is generated by solving a global trajectory in the visual state space, which obliterates the relationship between vocal and visual signals. This problem can be partially addressed by enforcing the local input/output relationship, i.e., adding a direct arc from input to output, as shown in Figure 1(c). At the synthesis stage, from the state sequence and input signal, we can generate the output sequence with the help of the local input/output mapping. Introducing a local model into HMM is necessary for the synthesis problem because we expect to obtain a continuous output, not to classify the input into a specific state (as expected in a recognition problem). If we have no prior knowledge of the relationship between input and output, the local mapping model can be obtained by some regression method such as a neural network. Generally speaking, a more explicit and compact distribution of the output can be learnt by introducing some prior knowledge or assumptions about the input and output signals. Second, and more significantly, a remapping process is required to map the occupancy matrix (obtained from the HMM model for the output sequence) to the synchronized input so that each state has a dual mapping with both input and output. The underlying assumption made in the remapping process is that the input sequence shares the dynamic behavior exhibited in the HMM trained from the output. As a result, the learnt model is homogeneous for all input sequences. These problems are addressed in the Bayesian network shown in Figure 1(d) where input and output are put together for training. The model proposed in Figure 1(d) is called IOHMM or conditional HMM because the model configuration is conditionally dependent on the input sequence. This is illustrated by the arc from the input to the state (x t q t ) in Figure 1(d) having a direction reverse from that in Figure 1(b). It indicates 5

6 q t qt + 1 q t+ 2 y t yt + 1 y t+ 2 (a) q t qt + 1 q t+ 2 x y t t t + 1 t+ 1 x y x t+ y 2 t+ 2 (b) q t qt + 1 q t+ 2 x t t t + 1 y x yt+ 1 x t+ y 2 t+ 2 (c) q t qt + 1 q t+ 2 x t t t + 1 y x yt+ 1 x t+ y 2 t+ 2 (d) Figure 1: Bayesian networks for several hidden Markov models. (a) Conventional HMM where no input is in the network. (b) HMM remapped (with dotted lines) to the input sequence as well. (c) Remapped HMM (b) plus direct connection between input and output. (d) Input-output HMM. Dotted lines from q t to x t in (b)(c) indicate a remapping process, not a causal effect. Solid line from x t to q t in (d) shows that q t and transition from q t to q t+1 are conditional to x t. 6

7 the causal effect from the input to the output. For reference, notations for IOHMM are summarized in the Appendix A. 2.2 IOHMM The main difference between standard HMMs and IOHMMs, is that the former represents the distribution P (y1 T ) of output sequences, whereas the latter represents the conditional distribution P (y1 T x T 1 ) of the output sequence given the input sequence x T 1 = (x 1, x 2,..., x T ). IOHMMs are trained by maximizing the conditional likelihood P (y1 T x T 1 ). This is a supervised learning problem since the output y1 T plays the role of a desired output in response to the input x T 1. The Bayesian network for HMMs (Figure 1(a)) can be obtained by simply removing the input nodes and arcs from the IOHMM in Figure 1(d). The arc from x t to y t in Figure 1(d) indicates that IOHMMs represent a conditional distribution of an (desired) output sequence when an (observed) input sequence is given. And the arc from x t to q t implies that in IOHMM, transition probabilities are conditional on the input and thus depend on time, resulting in inhomogeneous Markov chains. In comparison, standard HMMs are based on homogeneous Markov chains. Therefore, IOHMMs are better suited for learning to represent long-range context than HMMs. These properties of IOHMMs make them more suitable than traditional HMM for synthesis. 2.3 An example We illustrate the difference between HMMs and IOHMMs in training and synthesis from a toy problem below Problem description The input and output sequences shown in Figures 2(a) and (b) have the following properties: At any time instant, the input signal is assumed to move along one of the two concentric circles, clockwise along the outer circle, but counterclockwise along the 7

8 (a) (b) Figure 2: A toy problem to map circles to squares. (a) Distribution of the input signal; (b) distribution of the output signal. Solid lines and curves are the paths in which the data move; dots are the actual samples perturbed by noise. inner one, indicated by circles with arrows. Gaussian noise proportional to the circle radius is further added to the point positions. The output signal is synchronous to the input signal. The output signal moves along one of the two corresponding diamonds, clockwise along the outer diamond, counterclockwise along the inner one. The input can only jump to adjacent circles from point J 1 to J 2, or vice versa, as shown in Figure 2(a). The output jumps accordingly. Our objective is to learn the dynamic mapping between the input and output. Furthermore, given a new input sequence, we would like to synthesize the most likely output sequence that best fits the learnt model HMM To simplify the training problem, we assume four states in our HMM. It has been shown in [4] that the minimum entropy principle can be used to learn the number of states and the structure of HMMs. Applying the standard HMM to the output sequence, we obtain four states shown in Figure 3(a), each of which represents the data distribution 8

9 (a) (b) Figure 3: The learnt HMM from the output sequence of Figure 2(b). (a) Four states of HMM; (b) fixed transition between HMM states. Solid lines in (a) indicate that model parameters are fully specified. Different shapes at the four states represent different distributions. along a specific side of the diamond. HMMs (with remapping from the output to the input as shown in Figure 1(b)) are inappropriate for synthesis because of the following two reasons. First, HMMs do not represent any dynamics at a finer scale than a state. This causes blurring and muting of the output, and eliminates the fine-scale noise and texture that are expected for synthesis. For example, we might be able to recognize that the output is in state 0 in Figure 3(a), but we cannot determine if it is on the inner or outer diamond. Although the expressive power of the model can be ameliorated by adding more states, e.g., using 8 or 16 states for this toy problem, the complexity of the state machine increases (imagine that we have 100 concentric diamonds with different sizes for the output). Second, as shown in Figure 3(b), the transition probabilities of an HMM are fixed after training. A transition probability represents an average transiting behavior between two states. The amount of uncertainty of the transition is, however, not modeled. Therefore, an HMM cannot distinguish whether a transition probability is highly volatile or fixed. With the fixed transition probability matrix, the HMM in Figure 3(a) cannot 9

10 (a) (b) (c) (d) (e) (f) Figure 4: The learnt IOHMM to map circles to diamonds. (a) Four states of IOHMM; (b) four points from the input; (c)-(f) corresponding transition matrices for the four points A, B, C, D shown in (b). Dotted lines in (a) indicate that the model parameters are not fully specified unless the input is given. synthesize the correct change from one diamond boundary to another. To apply HMMs for synthesis, the emission and transition probabilities must depend on the input. The importance of input-output dependency can be shown by another example of synthesizing visual speech from audio. Suppose that we have obtained two states A and B that represent the closed mouth and open mouth respectively. The transition probability from A to B (a AB ) is equal to that from B to A (a BA ). However, (a AB ) should be much larger than (a BA ) when the energy of the input vocal signal is increasing (e.g., starting a speech). 10

11 2.3.3 IOHMM For IOHMM, we again use four states to model the output data distribution, as shown in Figure 4(a). Dotted lines for the states in Figure 4(a) indicate that the specific formulations of the emission probability and the transition probability are not fully determined unless an input is given. Training an IOHMM is much more complex than training an HMM because the emission and transition probabilities are conditional on the input. In particular, for each entry in the transition matrix, its conditional distribution on the input may not have an analytical form. Therefore, the mapping from the input to the transition matrix should be trained by neural networks, as suggested by Bengio [2]. In general, the emission probabilities can be learnt using neural networks as well. But often they can be modeled by some radial basis functions (RBF) such as Gaussian distributions given some prior knowledge on the input/output relationship. Obviously, if we add more prior knowledge, we can obtain more compact and explicit output distributions. At the extreme, the training process degenerates to a regression problem between the input and the output. In this experiment, we simplify the emission distribution at each state as a Gaussian output whose mean and variance are determined by the input data. At each state S i, the emission probability is given by b i = G(µ i ϕ(x t ), Σ i ϕ 2 (x t )) (4) where µ i is a vector and Σ i is a matrix, and ϕ is the distance of the input signal from the origin. Learning the emission probability is then simplified to one of determining the values of µ i and Σ i. We have developed a training algorithm for IOHMM. We follow Bengio s approach [3] to train IOHMMs under the EM framework. What is novel in our algorithm is the process of training the transition matrix with neural networks. Each entry in the transition matrix is trained with an independent neural network, after the M-step at each iteration. For this toy problem, our network has a single hidden layer with six nodes, two input 11

12 nodes (2D coordinates of the input data) and one output node (the transition probability from state S i to state S j ). A bias node is further added to the input and hidden layers. Details of our training algorithm can be found in Appendix B. The learnt transition probabilities are clearly dependent on the input, as shown in Figure 4(c)-(f) for four different points. For example, the point A which is located at the boundary of state S 0 and state S 1 has a transition matrix in Figure 4(c). It shows that the next output can stay at either state S 0 or S 1, but not at S 2 or S 3 because a 22 = a 33 = 0. In other words, there is a strong tendency to transit to these two states no matter what the current state is, as long as the input falls at location A. As we move from A to B, it becomes more likely to transit to state S 1 than to S 0, as shown in Figure 4(d). When the input point C is at the mean (for the local observation distribution) of state S 1, the transition matrix will be simplified to a zero matrix except for the column corresponding to state S 1 whose entries are all equal to 1, as shown in Figure 4(e). It implies that the next state must be S 1 after the transition. Figure 4(f) shows the transition matrix when the input point is at point D. Figure 4(f) has a similar structure to Figure 4(d) except that the next output is more likely to be on state S 0. Because of the significant constraint by the transition matrix, the synthesis will most likely yield correct state transitions even if the output is sampled from a wrong state at some time instant. From the input data in Figure 5(a), we obtain the synthesis result shown in Figure 5(b). As expected, the output is distributed around one of the two diamonds, similar to the training data. Moreover, transitions between different states are correct as shown by the arrows in Figure 5(b). Depending on whether the input is on the inner or the outer circle, the output samples form two Gaussian distributions that belong to the same state S 2 (the dotted blue circle in Figure 5(b)). Because temporal information is not used in training and only four states are used, the synthesized output trajectory does not follow the two diamonds exactly. 3 Synthesizing facial animation from audio We apply IOHMMs to synthesize facial animations from audio. 12

(a) (b) Figure 5: The synthesis results using IOHMM. (a) The input sequence. The dots are actual samples, curly lines with arrows show the moving trajectory and the dotted line indicates a jump.

The dotted circle indicates the state 2 which two local distributions belong to. Arrows show the transition between states and local distributions. 3.

13 (a) (b) Figure 5: The synthesis results using IOHMM. (a) The input sequence. The dots are actual samples, curly lines with arrows show the moving trajectory and the dotted line indicates a jump. (b) The output. The dots are the sampled output signal. Solid ellipses are the local distributions fully specified given the input. The dotted circle indicates the state 2 which two local distributions belong to. Arrows show the transition between states and local distributions. 3.1 Audio-visual signal representation In our system, we use the trajectories of 3D points on the face of an actor and his voice as the training data. In total, 150 points are tracked. We use principal component analysis (PCA) to compress the 450 dimensional feature vector into a 15-dimensional feature vector that covers 97% of the variance. poses. Figure 6 shows the data points at different We use an 18-dimensional feature vector to represent vocal signals. Instead of the traditional phoneme-viseme mapping, we use low-level acoustic features such as MFCC and energy as the input. The input audio sequence is blocked into frames with the same size as the captured video. In order to capture more dynamics in the vocal feature, we also calculate the delta parameters for MFCC and energy. Speech energy is an important vocal feature because it plays an important role in controlling facial expressions. 13

Figure 6: Training data used in our experiments. (Courtesy of Dr. Brian Guenter at Microsoft) (a) An original image with marked feature points; (b)(c) tracked 3D

14 Figure 6: Training data used in our experiments. (Courtesy of Dr. Brian Guenter at Microsoft) (a) An original image with marked feature points; (b)(c) tracked 3D points at two different poses. 3.2 Training In our application, both the input (vocal feature vector) and output (facial expression) are continuous high-dimensional random variables. Furthermore, we have no prior knowledge about the mapping relations between the input and output. Therefore, training such an IOHMM is much more complex than the toy problem proposed in the preceding section. In Bengio s work [3], the local mapping model and state transition probabilities are modeled as neural networks. Although this architecture can be trained using the generalized EM (GEM) algorithm [8], training so many neural networks is non-trivial. To simplify the training process, we first quantify the input into K classes, each of which has its own mean and variance. A new audio frame a can be classified by calculating the Mahalanobis distance: if a class m (5) m = arg min(a µ ai ) T Σ 1 i ai (a µ ai ), i = 1,..., K (6) where µ ai and Σ ai are the mean and variance for class i. For each class, the conditional distribution is modeled by K Gaussians, each of which corresponds to a specific audio 14

15 Figure 7: Training IOHMM from input audio signals and corresponding visual signals. The model consists of three parts: a state machine, emission probability conditional on the input, and transition probability matrix conditional on the input. class. Then the emission probability for state i can be represented by b i (t) = G(µ vik, Σ vik ) (7) if a t belongs to the class k. Note that µ vik and Σ vik are the mean and variance of the output distribution for class k at state i. These parameters need to be learnt. In our system, the transition probabilities are modeled by N N neural networks which are similar to those used in the toy problem in the last section. The training algorithm shown in Appendix B can be easily applied to our system with a little modification. In the E-step, the emission probability should be computed by equation 7. In the M-step, the emission probability parameters are updated by: µ vik = T t=1,a t class m γ t(i)v t T t=1,a t class m γ t(i) (8) Σ vik = T t=1,a t class m γ t(i)(v t µ vik )(v t µ vik ) T T t=1,a t class m γ t(i) (9) Figure 7 shows our training algorithm. A trained HMM consists of three parts: a state machine, an emission probability for each state conditional on the input, and a transition matrix conditional to the input. 15

16 Figure 8: Synthesizing visual signals from an IOHMM and input audio signals. Steps 1 and 2 are the initialization. Steps 3, 4 and 5 are iterated until convergence. 3.3 Synthesis Given a new audio sequence, we can apply the model to synthesize the most likely visual sequence that best fits the model. In IOHMM, the state output probabilities and transition probabilities are conditionally dependent on the input. Therefore, the synthesized sequence is the most likely one that satisfies V = arg max P (V A, λ) (10) V where V is the visual sequence and A the audio sequence. There are three steps in the synthesis process as shown in Figure 8: Initialization. At time 1 we choose q 1 according to the model prior state probabilities π. Then at time t we choose (randomly sample) q t according to P (q t x t, q t 1 ) (where the R.H.S. is known), and we randomly sample y t according to P (y t x t, q t ). We obtain an initial estimation of the output sequence by repeating this process, V 1 = (v 11, v 12,..., v 1T ). 16

Figure 9: A few frames from a synthesized video sequence of Dr. King s speech. The synthesis uses a single picture. Iteration. The observation is complete after we obtain the initial output sequence.

17 Figure 9: A few frames from a synthesized video sequence of Dr. King s speech. The synthesis uses a single picture. Iteration. The observation is complete after we obtain the initial output sequence. For each iteration we run a forward-backward process, after which an occupancy matrix γ t (i) can be obtained. γ t (i) represents the probability of being in state i at time t, given the input/output sequence and model. Then the synthesized output can be updated by equation 23. Termination. Given the fixed model parameters and input sequence, the most likely output sequence can be obtained when the change of the likelihood is below a threshold. We prove in Appendix C that the above iterative algorithm will converge to an optimal solution under the EM framework. In our experiments, we found that the synthesis sequence tends to converge to the means of the states. This can be explained by the blurring and muting effects in HMM. However, since we have K distributions for a given state which correspond to K audio classes, the expressive power is sufficient. In fact, those fine details which are expected for the synthesis are supplied mainly by the local mapping (one output distribution for each class at each state). 4 Experimental results In our experiment, we have used frames or seconds of video. The video consists of 189 short sentences. The input audio is clustered into 15 classes. The training process takes about 5 minutes to converge on a mid-level PC. We first apply the learnt 17

Figure 10: Several frames of synthesized cartoon video sequence. Several cartoon template images are used in the animation. Comparison Error S1-B 0.0052 S2-B 0.0059 A-B 0.

18 Figure 10: Several frames of synthesized cartoon video sequence. Several cartoon template images are used in the animation. Comparison Error S1-B S2-B A-B Table 1: A comparison between the synthesized result and ground truth. A is the ground truth, B is the PCA reconstructed result (also ground truth), S1 is the synthesis result initialized by B, S2 is the synthesis result initialized by random sampling. IOHMM to synthesize facial expressions from the audio in the training set. The audio sequence used for comparison is not used for training the IOHMM. Table 1 shows the comparison with ground truth. To compute the reconstruction error, we calculate the minimum distance of each feature point to the face model reconstructed after PCA (with 97% variance covered). The error shown in the table is the summed error normalized by all feature points. We can conclude from the table that reconstruction quality is good because the reconstruction errors (with two different initialization schemes) are on the same order of the error between the original and the PCA-reconstructed model. Figure 9 shows the result of animating a single picture using our model. Several frames from the synthesized sequence of the famous speech of Dr. Martin Luther King, I have a dream, are shown in the figure. The sequence shows significant facial movement. Using several cartoon templates with different poses and expressions, we can also animate a long sequence of cartoon. Several animated cartoon frames are shown in Figure 10. Because we use a 3D model with 150 feature points, we clearly observe facial expressions over the whole face in the animation sequences (Please see the accompanying video for the complete demo 1 ). 1 We apologize to the reviewers that the video can only be played in Microsoft Mediaplayer 18

19 We have encountered some difficulties when using the IOHMM for synthesis. The most difficult problem is to synthesize facial expression when the character is silent. There is no clear mapping from silence to facial expression. Some high-level knowledge must be applied to tackle this problem. Similarly, we have difficulties synthesizing some facial expressions that do not correspond to vocal signals. An example is the frowning expression. 5 Conclusion and future work In this paper we have studied the problem of dynamic audio/visual mapping, specifically, by formulating audio/visual mapping as an IOHMM problem. A key observation is that IOHMM is better suited than conventional HMM for synthesis because it can synthesize structures that are finer than states. Moreover, because transition probabilities in IOHMM are conditional to the input, it is more likely that the synthesized state sequence will be correct. An IOHMM model is trained under the EM framework, where each transition probability is modeled by a single neural network and updated at each iteration. Given the input audio signal, a facial animation sequence is generated by the maximum likelihood principle. Our experimental results from a single image and from a sequence of cartoon template images demonstrate that our synthesis results are of good quality. An interesting idea is to drive facial animation by emotions. While we have studied synthesizing facial expressions from audio in this paper, the very idea of IOHMM is also applicable to other dynamic input/output mappings. We plan to build a complete cartoon video rewrite system by combining cartoon animations from different poses/emotions. References [1] Y Bengio. Markovian models for sequential data. Neural Computing Survey 2, pages ,

20 [2] Y Bengio. Personal conmmunication, [3] Y Bengio and P Frasconi. Input/output HMMs for sequence processing. IEEE Trans on Neural Network, pages , [4] M Brand. Voice puppetry. In Proc. SIGGRAPH99, pages 21 28, [5] C Bregler, M Covell, and M Slaney. Video rewrite: Driving visual speech with audio. In Proc. SIGGRAPH97, pages , [6] T Chen and R Rao. Audio-visual integration in multimodal communication. Proceedings of IEEE, pages , May [7] K H Choi and J N Hwang. Baum-welch hidden Markov model inversion for reliable audio-to-visual conversion. In IEEE 3rd Workshop on Multimedia Signal Processing, pages , [8] A P Dempster, N M Laird, and D B Rubin. Maximum-likelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society B, 39:1 38, [9] B Guenter, C Grimm, D Wood, H Malvar, and F Pighin. Making faces. In Proc. SIGGRAPH98, pages 55 66, [10] S Levinson, L Rabiner, and M. Sondhi. An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition. Bell System Technical Journal, 64(4): , [11] J Pearl. Probabilistic reasoning in intelligent system: Networks of plausible inference. Morgan Kaufmann, [12] L Rabiner. A tutorial on hidden Markov models and selected applicaitons in speech recognition. Proceedings of IEEE, pages , Feb

21 Appendix A. A list of symbols used in IOHMM λ = (A, B, π), model parameter. A = {a ij (t)} = {P (q t = S j q t 1 = S i, x t )}, state transition probability; it is a conditional distribution on x t. B = {b i (t)} = {P (y t q t = S i, x t )}, emission probability for state i; it is a conditional distribution on x t. π i = P (q 0 = S i ), 1 i N, initial probability of state S i. S = {S 1, S 2,..., S N }, a set of N states. X = x 1, x 2,..., x T, input sequence. Y = y 1, y 2,..., y T, input sequence. N, the number of states in an IOHMM. T, size of the synchronized input/output sequence. α t (i) = P (y1, t q t = S i x t 1, λ), the probability of the joint event that y 1 y 2 y t are observed, and the state at time t is s i, given the input sequence x 1 x 2 x t and model λ. β t (i) = P (yt+1 q T t = S i, x T t+1, λ), the probability of the partial observation sequence from t + 1 to the end, given state S i at time t, the input sequence x t+1 x t+2 x T and model λ. γ t (i) = P (q t = S i x T 1, y1 T, λ), the probability of being in S i at time t, given the input/output sequence and model λ. ξ t (i, j) = P (q t = S i, q t+1 = S j x T 1, y1 T, λ), the probability of being in S i at time t, and S j at time t + 1, given the input/output sequence and model λ. 21

22 B. Training algorithm for IOHMM In our experiment, we use a back-propagation neural network to train the model. Therefore, the transition probability from state S i to state S j given the current input x t can be computed by running the network: a ij (t) = N ij (x t ) (11) Under this configuration, the IOHMM for this system can be trained under the EM framework: Initialization We initialize randomly the values of µ i and σ i, for each state S i. The initial state probabilities and transition probabilities are initialized uniformly. neural networks for the transition matrix are also initialized. The N N E-step In the E-step, we calculate the forward variables α t (i) and backward variables β t (i): α t (i) = P (y1, t q t = S i x t 1, λ) (12) = N [ α t (j)a ji (t)]b i (y t x t ) (13) j=1 β t (i) = P (yt+1 q T t = S i, x T t+1, λ) (14) = N a ij (t)b j (y t+1 x t+1 )β t+1 (j) (15) j=1 We also obtain ξ t (i, j), which is the probability of being in state S i at time t, and state S j at time t + 1, given the model and the input and output signals. It can be calculated by where ξ t (i, j) = α t(i)a ij (t)b j (y t+1 x t+1 )β t+1 (j) P (y T 1 x T 1, λ) (16) N N P (y1 T x T 1, λ) = α t (i)a ij (t)b j (y t+1 x t+1 )β t+1 (j) (17) i=1 j=1 22

23 Let γ t (i) be the probability of being in state S i at time t, given the input and output sequence and the model λ. Then we can form an occupancy matrix by calculating γ t (i) = P (q t = S i y T 1, x T 1, λ) = α t(i)β t (i) Ni=1 α t (i)β t (i) (18) It can be seen that the forward-backward procedure of Baum-Welch algorithm used in conventional HMMs can be applied to IOHMM training as well. However, the emission probability and transition probability at time t must be updated according to the input at that time. M-step The M-step in IOHMM training should adapt to the representation of the emission and transition probabilities. In this case, the parameters can be updated as follows: ˆµ i = Tt=1 γ t (i)(y t /ϕ(x t )) Tt=1 γ t (i) (19) ˆσ i = Tt=1 γ t (i)((y t ˆµ i )(y t ˆµ i ) T /ϕ 2 (x t )) Tt=1 γ t (i) (20) NN-step In order to train the transition matrix (each entry is a neural network), we first normalize ξ t (i, j) by: ˆξ t (i, j) = ξ t (i, j) Nj=1, (i, j = 1,..., N) (21) ξ t (i, j) The above normalization is necessary to ensure that each row of ˆξ t (i, j) has unit length, i.e., T j=1 ˆξt (i, j) = 1, although ˆξ t (i, j) has been normalized according to the whole matrix in Equation 16 in the E-step. Then N ij can be updated by giving T training samples {x t, ˆξ t (i, j)}, t = 1,..., T. The networks are initialized randomly in the first iteration. At each new step, training begins with the node weights converged in previous iteration. The whole model will converge after several steps. 23

24 C. Proof of synthesis algorithm In synthesis, we try to find the optimal visual sequence V given the input audio A. Using the EM algorithm, the optimal visual sequence V can be obtained by maximizing the auxiliary function Q(V V ), i.e., V = arg max Q(V V ) (22) V where V and V denotes the visual sequence before and after each iteration. Q(V V ) = E[ln P (V, S A, λ)] = S [P (V, S A, λ) ln P (V, S A, λ)] = T 1 [P (V A, λ)(ln π q0 + ln a qtqt+1 S t=1 T + ln b qt (v t))] t=1 where S is the state sequence, λ is the learnt model parameter. Setting the derivative of the above function to zero, and noting that the state observation is modeled as a Gaussian, we have Q(V V ) v t = [P (V A, λ) ln b v t qt (v t))] S N = P (V, q t = S i A, λ) ln b i=1 v t si (v t)) N = P (V, q t = S i A, λ)σ 1 i (v t µ i ) i=1 where µ i and Σ i are the mean vector and covariance matrix for state S i. The ML estimation of the visual sequence can be obtained by v t = Ni=1 γ t (i)µ i Ni=1 γ t (i) (23) where γ t (i) = P (V, q t = i A, λ) can be computed by the forward-backward process. It should be noted that a similar algorithm has been proposed, named inverted HMM [7]. But here we assume the visual signal conditionally depends on the audio input instead of modeling them by joint distributions [7]. 24

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms Recognition of Visual Speech Elements Using Adaptively Boosted Hidden Markov Models. Say Wei Foo, Yong Lian, Liang Dong. IEEE Transactions on Circuits and Systems for Video Technology, May 2004. Shankar