ENHANCEMENTS OF MAXIMUM LIKELIHOOD EIGEN-DECOMPOSITION USING FUZZY LOGIC CONTROL FOR EIGENVOICE-BASED SPEAKER ADAPTATION.

Size: px

Start display at page:

Download "ENHANCEMENTS OF MAXIMUM LIKELIHOOD EIGEN-DECOMPOSITION USING FUZZY LOGIC CONTROL FOR EIGENVOICE-BASED SPEAKER ADAPTATION."

Alexandra Mills
5 years ago
Views:

1 International Journal of Innovative Computing, Information and Control ICIC International c 2011 ISSN Volume 7, Number 7(B), July 2011 pp ENHANCEMENTS OF MAXIMUM LIKELIHOOD EIGEN-DECOMPOSITION USING FUZZY LOGIC CONTROL FOR EIGENVOICE-BASED SPEAKER ADAPTATION Ing-Jr Ding Department of Electrical Engineering National Formosa University No. 64, Wunhua Rd., Huwei Township, Yunlin County 632, Taiwan ingjr@nfu.edu.tw Received May 2010; revised November 2010 Abstract. This paper presents a fuzzy logic control (FLC) mechanism for the popular eigenvoice-based speaker adaptation scheme. The proposed mechanism regulates the influence of maximum likelihood eigen-decomposition (MLED) when the training data from a new speaker is inadequate. The FLC-MLED method functions by accounting for the adaptation data when estimating the linear combination coefficients for eigenvector decomposition. This approach ensures the robustness of speaker adaptation against data scarcity. The proposed mechanism is conceptually simple, effective and computationally inexpensive. Experimental results indicate that FLC-MLED outperforms conventional MLED, especially when encountering data insufficiency. The proposed approach performs better than maximum a posteriori eigen-decomposition (MAPED) at a much lower computing cost. Keywords: Speech recognition, Speaker adaptation Takagi-Sugeno fuzzy logic controller, Maximum likelihood eigen-decomposition, Maximum a posteriori eigen-decomposition 1. Introduction. Computing techniques for automatic speech recognition (ASR) have existed for years [1-4]. As they have matured, these techniques have found more and more applications in everyday life [5]. Nevertheless, the recognition performance of any speech recognition system ever built is undeniably inferior to a human listener [6]. During speech recognition, variations in speech are strange to the system or known to the system only in poor vocal shape. These variations often cause a mismatch between the pre-established reference templates and the testing template, compromising recognition performance. Speaker adaptation (SA) sometimes referred to as model-based adaptation, can reduce this mismatch phenomenon. Speaker adaptation is the process of transforming a speaker independent (SI) speech recognition system into a speaker dependent (SD) system. This process achieves SD-like performance by adjusting the acoustic parameters of the SI speech model, typically in the form of hidden Markov models (HMM), with speech samples acquired from a target speaker. There are three major types of speaker-adaptive techniques: Bayesian-based adaptation, transformation-based adaptation and speaker-clustering-based adaptation. Bayesianbased model adaptation directly re-estimates the acoustic model parameters using maximum a posteriori (MAP) adaptation [7,8]. The Bayesian reasoning framework is an example of this approach. Transformation-based model adaptation, such as maximum likelihood linear regression (MLLR) [9] and maximum a posteriori linear regression (MAPLR) [10,11], must derive certain appropriate transformations from a set of adaptation utterances from a new speaker and then apply them to clusters of HMM parameters. 4207

2 4208 I.-J. DING Figure 1. Eigenvoice-based adaptation Eigenvoice-based adaptation [12-23] is a relatively new member of the speaker adaptation family, first appearing around 2000 [12]. This approach is also known as speakerclustering-based adaptation because it creates a SD speech model for every member in a group of speakers. This adaptation method extracts feature vectors called as eigenvoices from these groups to build an eigenvoice speech model (called an eigenvoice vector space) for a new speaker. The adaptation to the speech model then can be undertaken when adaptation data is available, as Figure 1 shows. The basic concept of eigenvoice-based adaptation is to employ a priori knowledge about the inter-speaker variation by analyzing the training speakers. This method applies principal component analysis (PCA) to construct the eigenvoice space using supervectors derived from SD speaker models [12]. These principal components are then used to build speaker-adaptive models for a new speaker through maximum likelihood eigendecomposition (MLED). In MLED, the linear combination coefficients for eigenvector decomposition are estimated via the maximum likelihood (ML) criterion. The MLED adaptation scheme proposed by Kuhn et al. [12] plays a key role in this type eigenvoice adaptation technique, and has been proven effective in many speech recognition applications. However, given insufficient adaptation utterances from a new speaker, the performance of MLED is questionable due to inaccurately estimated linear combination coefficients. In this case, the recognition rate may fall below the baseline, i.e., is worse than no adaptation at all (as shown by the experiments in this study). After Kuhn et al., researchers proposed a series of variants of the MLED scheme in an attempt to improve the quality of the estimated linear combination coefficients given insufficient adaptation data. However, the approaches for enhancing the robustness of MLED are complicated and time consuming in computation, preventing on-line adaptation applications. For example, the MAPED scheme [23], which estimates the linear combination coefficients by maximizing the posterior density using maximum a posteriori (MAP) theory [7,8], is a classic variant of MLED, but it spends much more time estimating the linear combination coefficients than the MLED scheme.

3 ENHANCEMENTS OF MAXIMUM LIKELIHOOD EIGEN-DECOMPOSITION 4209 This study proposes a fuzzy control mechanism to tackle the issue of the unreliable MLED estimation of the linear combination coefficients due to insufficient training data without incurring the high cost of the MAPED-like approach. Based on the amount of adaptation utterances available, the MLED approach can be regulated to exploit the rapidness of MLED in calculating the linear combination coefficients when the amount of training data allows, while simultaneously alleviating the undesired effect of poor estimation of the linear combination coefficients. The resulting implementation is called FLC-MLED, where FLC represents fuzzy logic control and indicates the underlying fuzzy mechanism incorporated in the conventional MLED. The use of FLC mechanism in estimating acoustic parameters for eigenvoice-based speaker adaptation has been rarely attempted. However, an adaptation method with the support of FLC has several advantages compared to those without: better performance in ordinary cases, robustness against the scarcity of training data, less computation in parameter estimation compared to other MLED-enhancement methods (e.g., the typical MAPED). Fuzzy control has been applied to a wide range of applications with great success [24], including speech recognition. Takagi-Sugeno (T-S) fuzzy model is conceptually simple and straightforward [25,26], and has appeared in the control of systems as complicated as an electric power plant [27-29]. Therefore, this study employs the T-S fuzzy model in eigenvoice speaker adaptation. The rest of the paper is organized as follows. Section 2 briefly describes the theoretical formulation of MLED and MAPED. This section also introduces the concept idea of incremental MLED eigenvoice adaptation under fuzzy regulation, followed by the formulation of the T-S fuzzy mechanism for model adaptation in this study. Section 2 also presents a complexity analysis of the proposed scheme and describes a future improvement of the proposed FLC-MLED approach. Section 3 presents the experiment results, which compare the effectiveness and performance of FLC-MLED with conventional MLED and MAPED. Finally, Section 4 provides some concluding remarks. 2. FLC-MLED Eigenvoice Adaptation. The basic idea of eigenvoice adaptation is to build a number of speaker clusters in advance, and then represent the model of the current speaker as an interpolated form of the weighted sum of the speaker clusters. R. Kuhn et al. [12] first proposed the eigenvoice adaptation using a priori knowledge concerning the variations among all training speakers was represented as the set of SD model parameters in the form of eigenvectors named eigenvoices. A new speaker model is then expressed as the linear combination of the set of eigenvoices. The eigenvoice approach greatly reduces the number of parameters to be estimated, however, is still capable of retaining the overall system characteristics to capture the variance between speakers. The eigenvoice approach must take care of two things: eigenvoice construction (the training phase) and coefficient estimation (the adaptation phase). Figure 1 shows that in the eigenvoice construction phase, a set of N well-trained SD models from N speakers must be established first. Then, the model parameters of each SD model are vectorized, forming a set of N supervectors. Space dimension reduction techniques, such as PCA, are then applied to the set of N supervectors to obtain N eigenvectors with dimension D, also called as eigenvoices. In general, the higher-order eigenvoices are thrown away and only the first K (where K < N D) eigenvoices are kept. These eigenvoices are significant because they possess most of the information from speech data and are thus capable of representing all the variations in considerations. Finally, using these K eigenvoices, an accurate speaker space K-space can be spanned and acquired. The

4 4210 I.-J. DING coefficient estimation phase then performs adaptation to determine the location of a new speaker in K-space. Let the supervector µ of the new speaker be conducted in K-space as follows µ = e(0) + w(1) e(1) + + w(k) e(k) K = e(0) + w(k)e(k), k=1 where e(0) is the mean vector of N supervectors. The problem here is to estimate the weights {w(k), k = 1, 2,..., K} that correspond to K eigenvectors {e(k), k = 1, 2,..., K}, to find a weighted combination of eigenvoices. In general, a classical eigen-decomposition scheme, such as MLED or MAPED [12,23], can derive the set of weight coefficients {w(k), k = 1, 2,..., K} using speaker specific adaptation data X. The following subsection briefly describes the theoretical formulations of MLED and MAPED MLED and MAPED. The MLED method used to estimate the weight coefficients is to solve [12]. ŵ ML = arg max P (X w). (2) w ŵ ML in Equatrion (2) can be solved by the expectation-maximization (E-M) algorithm [30]. In the E-step, the expectation is determined as Q ( w w ) = E [ log P (X, S, M w) X, w ] s m t γ (s) m (t) [ n log(2π) log C m (s) ] + h(x t, s, m), where γ m (s) (t) = P (s t = s, m t = m X, w) is the occupation probability that the observation data x t stays at state s and mixture m and µ (s) m h(x t, s, m) = ( µ (s) m x t ) T C (s) 1 m (1) (3) ( µ (s) m x t ). (4) in Equation (4) can be replaced with the corresponding linear combination of eigenvoices as follows: K µ m (s) = w(k)e (s) m (k). (5) k=1 The M-step then maximizes Q( w w). To maximize Q( w w), set ( Q( w w)/ w(j)) = 0, j = 1, 2,..., K. For each j, one obtains ( ) T γ m (s) (t) e (s) (s) 1 m (j) C m x t s m t = { γ (s) K ( ) } T (6) m (t) ŵ(k) e (s) (s) 1 m (k) C m e (s) m (j). s m t k=1 There are K equations to solve for the K unknown weights ŵ(1), ŵ(2),..., ŵ(k). Given minimal adaptation data, the estimate of combination coefficients by MLED will be inaccurate. Huang et al. presented the MAPED technique to ensure the robustness of the estimate of combination coefficients with insufficient adaptation data [23]. MAPED can take the prior density into account in the estimation process of the combination coefficients using a MAP criterion: ŵ MAP = arg max R(ŵ w). (7) ŵ

5 ENHANCEMENTS OF MAXIMUM LIKELIHOOD EIGEN-DECOMPOSITION 4211 The auxiliary function R(ŵ w) in Equation (7) is defined as follows R ( w w ) = S M s T { } γ m (s) (t) n log(2π) + log C (s) m + h(x t, s, m) s=1 m=1 { t=1 + K log(2π) + 2 log σ w(j) + (ŵ(j) µ } w(j)) 2, j=1 σ 2 w(j) where the coefficient w(j) is modeled by a Gaussian distribution with mean µ w(j) and variance σw(j) 2. To maximize R(ŵ w), set ( R(ŵ w)/ ŵ(j)) = 0, j = 1, 2,..., K. The combination coefficients ŵ(1), ŵ(2),..., ŵ(k) are then derived by the following K equations: µ w(j) + S M s T ( ) T γ (s) σw(j) 2 m (t) e (s) (s) 1 m (j) C m x t s=1 m=1 t=1 { } = K S M s T ( ) T ŵ(k) γ m (s) (t) e (s) (s) 1 m (k) C m e (s) m (j) + δ(k j) ŵ(j) (9). k=1 s=1 m=1 t=1 σ 2 w(j) Solving the Equation (9) for the combination coefficients is obviously more time consuming than using MLED due to the use of additional parameters {µ w(j), σw(j) 2 } of the prior distribution [23]. Though it enhances the robustness of MLED, the MAPED scheme is more complicated and requires much more time in calculations than the MLED scheme Incremental MLED eigenvoice adaptation. The coefficient estimation phase performs eigenvoice speaker adaptation using the eigen-decomposition algorithm such as MLED or MAPED to estimate a set of weights to find a weighted combination of eigenvoices for the new speaker. Given sufficient adaptation data, the eigenvoice adaptation method is effective. However, given insufficient adaptation data, the accuracy of the estimated combination coefficients, especially derived by the MLED approach, is dubious. Poor estimation of the combination coefficients in turn leads to incorrect positioning in the speaker space. The problem of scarce adaptation data can be alleviated by using the MAPED scheme if heavy computation is permissible. Given insufficient training data, it is necessary to be more conservative in using the combination coefficients thus derived. In other words, the effect of the adaptation should be restricted so that the adapted mean vector does not reference too much from the combination coefficients derived with the insufficient training data. Therefore, this study proposes the following incremental MLED eigenvoice adaptation approach [7,8] K ˆµ (s) m = k=1 [ λ w(k) + (1 λ) µw(k) ] e (s) m (k), 0 λ 1, (10) where w(k) is the combination coefficient calculated by MLED and µ w(k) is the prior mean of the combination coefficient. The linear combination coefficients for eigenvector decomposition are not calculated as in the maximum likelihood criterion. Instead, this approach calculates a weighted sum of the maximum likelihood estimate and the prior mean of the combination coefficient. The form of incremental MLED eigenvoice adaptation in Equation (10) is very similar to MAP adaptation [7,8], which is essentially an MAP-like adaptation. A weight parameter λ governs the balance of w(k) and µ w(k), mimicking the role of the adaptation speed parameter in MAP adaptation [7,8]. Using a weighting scheme, λ, should achieve satisfactory adaptation performance even when only a small amount of training data is available for eigen-decomposition. Note that the weight λ varies depending on how much confidence one has in the combination coefficient derived from MLED. A possibly not so well estimate of the combination coefficient from MLED (8)

6 4212 I.-J. DING due to insufficient adaptation data would preferably goes with λ approaching 0 so that the biased estimate of w(k) will be restricted. Conversely, 1-approaching λ can take full advantage of fast adaptation from sufficient adaptation data Fuzzy model and eigenvoice speaker adaptation. This section presents the FLC-MLED approach, which performs incremental MLED estimate of the combination coefficient using fuzzy logic control. Depending on the adaptation data size, the FLC-MLED method adjusts the weight parameter λ by moving the coefficient for eigendecomposition closer to the side of w(k) or to the side of µ w(k) to estimate the adapted mean vector of a new speaker in the speaker space. When the combination coefficient w(k) is reliable as a result of abundant adaptation samples, λ should be large. Conversely, λ should be smaller when the quality of w(k) is in doubt as a result of a little adaptation samples. To fulfill the requirements, a rule base with three implications governs λ regulations under the circumstance of N training samples (in terms of acoustic frames) observed for all Gaussian mixture components as follows. Rule 1: If N is small, then λ is set to small, Rule 2: If N is medium, then λ is set to medium, Rule 3: If N is large, then λ is set to large. Fuzzy techniques are naturally suitable for translating lingual statements into quantitative expressions for computation. This study employs a specific type of fuzzy logic control mechanism by Takagi-Sugeno (T-S hereafter) [25] T-S fuzzy control mechanism. The T-S fuzzy design procedure presents a systematic framework of fuzzy modeling design for a complex system [25]. The system includes a set of subsystems for which local behaviors are identified by expressing the inputs-out mapping in terms of a fuzzy implication (or rule) where the inputs are specified in the antecedent part and the output as the linear combination of the associated inputs. The overall system output is then a function of the subsystem outputs, which could be as simple as of a linear combination, where coefficient handling takes cares of fuzziness of the system behaviors, or of more elaborate forms. In the T-S fuzzy model, a generic system can be formulated as a set of fuzzy implications together with a system output determined by consequences in the set of implications. The system representation adopts the form Rule 1 : IF x(1) is A 1 1 and... and x(n) is A 1 n THEN y 1 = a a 1 1x(1) a 1 nx(n), Rule i : IF x(1) is A i 1 and... and x(n) is A i n THEN y i = a i 0 + a i 1x(1) a i nx(n), (11) Rule l : IF x(1) is A l 1 and... and x(n) is A l n THEN y l = a l 0 + a l 1x(1) a l nx(n) System output : y = l w i y i i=1, given that w l i = w i i=1 n A i p(x(p)), (12) for a system of n inputs and l implications. Note that A i p, p = 1,..., n, are fuzzy sets and A i p(x(n)) denotes the fuzzy values of the membership function associated with A i p p=1

7 ENHANCEMENTS OF MAXIMUM LIKELIHOOD EIGEN-DECOMPOSITION 4213 for the input x(n); a i p, p = 0, 1,..., n, are consequent parameters through which the i-th consequence y i is expressed as a linear combination of n inputs FLC-MLED formulation. For the specific problem in this study, the aforementioned simple rule governing λ regulation, given N adaptation samples observed for all Gaussian mixture components, can be formulated as follows: Rule 1: If N is small, Then λ is small, Rule 2: If N is medium, Then λ is medium, Rule 3: If N is large, Then λ is large. Let M 1 (N), M 2 (N) and M 3 (N) be membership functions associated respectively with small, medium and large amounts of training data available for adaptation, as Figure 2 shows, and let λ S, λ M and λ L be the small, medium and large values of λ determined respectively by functions f 1 (N), f 2 (N) and f 3 (N) in each of the three cases. Then, the previous set of rules can be further clarified as: Rule 1: If N is M 1 (N), Then λ S = f 1 (N), Rule 2: If N is M 2 (N), Then λ M = f 2 (N), Rule 3: If N is M 3 (N), Then λ L = f 3 (N), where M 1 (N) = 1 N N 1, N 2 N N 2 N 1 N 1 N N 2, 0 N N 2, M 3 (N) = together with the implication functions and the final system output [25] M 2 (N) = 0 N N 2, N N 2 N 3 N 2 N 2 < N < N 3, 1 N N 3, f 1 (N) = a 1 N + b 1, f 2 (N) = a 2 N + b 2, f 3 (N) = a 3 N + b 3, λ = 0 N N 1 or N N 3, N N 1 N 2 N 1 N 1 < N N 2, N 3 N N 3 N 2 N 2 N < N 3, 3 M i (N)f i (N). (13) 3 M i (N) i=1 i=1 Equation (13) shows that for N < N 1, λ is solely determined by f 1 (N), i.e., λ = λ S, whereas for N > N 3, λ is determined by f 3 (N) alone. When N is approximately N 2, λ is determined by f 2 (N) since M 2 (N) is much greater than M 1 (N) and M 3 (N). The system now has nine hyperparameters (a 1, a 2, a 3, b 1, b 2, b 3, N 1, N 2 and N 3 ) to be fixed, for which an iterative process is developed as follows:

8 4214 I.-J. DING Figure 2. Membership functions of the FLC for FLC-MLED Step 1: Let N 1 : N 2 : N 3 = 1 : 2 : 3 and initialize N 1. Step 2: Estimate the parameters a 1 and b 1 under the condition N < N 1, wherein M 1 (N) = 1, M 2 (N) = M 3 (N) = 0 and λ = M 1(N)f 1 (N) M 1 (N) = f 1 (N) = a 1 N + b 1. The procedure for fixing a 1 and b 1 is shown in Figure 3, which is in the type of pseudocode sequence. Step 3: Estimate the parameters a 3 and b 3 under the condition N > N 3, wherein M 1 (N) = M 2 (N) = 0, M 3 (N) = 1 and λ = M 3(N)f 3 (N) M 3 (N) = f 3 (N) = a 3 N + b 3. The determination of a 3 and b 3 is done by the same process as for a 1 and b 1 with the initial condition R 0 = R q from Step 2. Step 4: Estimate the parameters a 2 and b 2 under the condition N 1 N N 2, wherein M 1 (N) = N 2 N N 2 N 1, M 2 (N) = N N 1 N 2 N 1, M 3 (N) = 0 and λ = M 1(N)f 1 (N) + M 2 (N)f 2 (N) M 1 (N) + M 2 (N) = N 2 N N 2 N 1 (a 1 N + b 1 ) + N N 1 N 2 N 1 (a 2 N + b 2 ) N 2 N N 2 N 1 + N N 1 N 2 N 1 = (N 2 N)(a 1 N + b 1 ) + (N N 1 )(a 2 N + b 2 ) N 2 N 1. With a 1 and b 1 already obtained at Step 2, the parameters a 2 and b 2 is determined through the same tuning process as in Step 2 with the initial condition R 0 = R q from Step 3 for best recognition rate, R q, too.

9 ENHANCEMENTS OF MAXIMUM LIKELIHOOD EIGEN-DECOMPOSITION 4215 Figure 3. A procedure to fix FLC hyperparameters a 1 and b 1 Step 5: Re-estimate the parameter N 3 under the condition N 2 N N 3, wherein M 1 (N) = 0, M 2 (N) = N 3 N N 3 N 2, M 3 (N) = N N 2 N 3 N 2 and λ = M 2(N)f 2 (N) + M 3 (N)f 3 (N) M 2 (N) + M 3 (N) = N 3 N N 3 N 2 (a 2 N + b 2 ) + N N 2 N 3 N 2 (a 3 N + b 3 ) N 3 N N 3 N 2 + N N 2 N 3 N 2 = (N 3 N)(a 2 N + b 2 ) + (N N 2 )(a 3 N + b 3 ) N 3 N 2. Since a 2 and b 2, together with a 3 and b 3, have already been determined at Steps 4 and 3 respectively, a new value for N 3 can now be obtained by tuning for a higher R q value than in Step 4. Step 6: Given the new estimate of N 3 from Step 5, update N 1 and N 2 such that N 1 : N 2 : N 3 = 1 : 2 : 3, δ = Rq R, / R : desired recognition rate / R R 0 = R q. Repeat from Step 2 until δ is less than a predefined threshold.

10 4216 I.-J. DING Time complexity analysis of FLC-MLED. Compared to conventional MLED, the computation overhead of FLC-MLED adaptation for calculating λ is relatively minor, considering that at most 4 extra multiplications are required. The analysis is straightforward. For N < N 1, λ = a 1 N + b 1 which takes only 1 multiplication, as is for the case when N > N 3, λ = a 3 N + b 3, and for the case N 1 N N 2, λ = M 1(N)f 1 (N) + M 2 (N)f 2 (N) M 1 (N) + M 2 (N) = N 2 (a 2 a 1 ) + N(a 1 N 2 a 2 N 1 + b 2 b 1 ) + b 1 N 2 b 2 N 1 N 2 N 1 = p (c 1 N 2 + c 2 N + c 3 ), the computation of which involves 4 multiplications, as is for the case when N 2 N N 3, λ = M 2(N)f 2 (N) + M 3 (N)f 3 (N) M 2 (N) + M 3 (N) = N 2 (a 3 a 2 ) + N(a 2 N 3 a 3 N 2 + b 3 b 2 ) + b 2 N 3 b 3 N 2 N 3 N 2 = q (d 1 N 2 + d 2 N + d 3 ). Thus, calulating λ does not increase time complexity, and the computation of Equation (10) follows the same order as computing Equation (5). Therefore, computing FLC-MLED is much less expensive than computing MAPED Improvements and future directions on FLC-MLED. This study develops a complete and convincing concept of a fuzzy logic control mechanism for eigenvoice speaker adaptation applications. However, some possible improvements to the proposed FLC-MLED approach may be performed before encompassing all the needs. There are some possible extensions or improvements to the use of FLC mechanism in the future study. Eigenvoice speaker adaptation by neural networks (NN), support vector machines (SVM) and genetic algorithm (GA) may be considered for the bringing in of FLC mechanisms whenever plausible. Given insufficient training data for a new speaker, the proposed FLC- MLED regulates the influence of maximum likelihood eigen-decomposition by considering the amount of adaptation data when estimating the linear combination coefficients for eigenvector decomposition. This approach ensures the robustness of speaker adaptation against data scarcity. However, the effectiveness and performance of speaker adaptation strongly depend on the quality and quantity of adaptation data. A previous study [31] presents a hybrid scheme of a support vector machine and fuzzy logic control incorporated into MAP speaker adaptation to address this point in speaker adaptation design. How to incorporate this hybrid SVM and FLC mechanism into the eigenvoice process is a challenging issue, and an SVM-FLC-MLED would seem to be a promising subject for future research. The adaptation capability of eigenvoice speaker adaptation under the framework of hybrid SVM-FLC operations will be further ensured, especially in the robustness against the scarcity and the impropriety of adaptation data. Another key issue in future research is to enhance the FLC design of FLC-MLED. The FLC design must account for potential variations in the process itself, making the use of time-variant parameters in the FLC design unavoidable. In other words, the FLC of FLC-MLED should be adaptive in accordance with the time-varying process. Adaptation can be performed by modifying the rule sets or the fuzzy set, resulting in two classes of FLCs: the self-organizing and self-tuning FLC [32].

11 ENHANCEMENTS OF MAXIMUM LIKELIHOOD EIGEN-DECOMPOSITION 4217 As a final remark, the use of the T-S FLC mechanism is one choice from many fuzzy formulations in control by computation. For example, the Mamdani (linguistic) type fuzzy model [33] is an alternative that can be used in place of T-S FLC in the proposed FLC-MLED. 3. Experiments. The study presents experiments with FLC-MLED adaptation to compare its recognition performance with MLED- and MAPED-adaptations when encountering different amounts of adaptation data. The following subsections present the experimental settings and results of the proposed FLC-MLED adaptation algorithm Databases and experimental design. Experiments testing the recognition of 30 worldwide famous city names in Mandarin were run in three parts: (1) establishing the initial SI models and the eigenspace, (2) the training phase for fixing FLC hyperparameters, and (3) the recognition phase for evaluating the performance of tuning of λ weight by the FLC (FLC-MLED). An 8 khz sampling rate was set for speech signal acquisition. The analysis frames were 30-ms wide with a 15-ms overlap. A 24-dimensional feature vector was extracted for each frame. This feature vector was made up of a 12-dimensional mel-cepstral vector and a 12-dimensional delta-mel-cepstral vector. The database MAT400 sub-database DB3 [34] was used to train the initial SI models as a set of HMM parameters. This study adopted the Initial/Final HMM s. A syllable in Mandarin comprises an initial part and a final part. The modeling of Mandarin syllables assumes that the initial part is right dependent on the beginning phone of the following final part and the final part is context independent [35]. A Mandarin utterance consists of one to several syllables. The HMM of a syllable comprises an HMM with 3 states for the initial part, and an HMM with 6 states for the final part. The HMM of an utterance consists of all HMMs of the constituent syllables. Each state has 4 Gaussian mixture components. An SD model was generated for each training speaker in the database by adjusting the SI model. The resulting SD models were then used to build up the eigenspace bases. The training phase collected training data for tuning the hyperparameters of the FLC from 15 speakers. Each of the 15 speakers uttered 10 city names (picked among 30 cities) to generate the adaptation data, and then uttered 60 names (two utterances for each city) to generate FLC parameter tuning data (to be used in following-up observations); all utterances were recorded by an ordinary microphone. The training phase experiment procedure is described in the pseudo-code sequence below. R 0 = baseline recognition rate; t = 0; Repeat { t ++; R 2 t = 2 utterances training (eigenvectors, hyperparameters); R 4 t = 4 utterances training (eigenvectors, hyperparameters); R 6 t = 6 utterances training (eigenvectors, hyperparameters); R 8 t = 8 utterances training (eigenvectors, hyperparameters); R 10 t = 10 utterances training (eigenvectors, hyperparameters); 5 R t 2 i R t i=1 = ; 5 R t = Rt R t 1 ; } until R t < threshold;

12 4218 I.-J. DING where 2 i utterances training(d), i = 1, 2, 3, 4, 5 is the procedure using 2 i adaptation utterances from 15 speakers for fixing the 9 hyperparameters of FLC defined in Section and thus returning a better-than-baseline overall recognition rate R t 2 i for the 15 training speakers as explained in the code-like sequence below. 2 i utterances training (eigenvectors, hyperparameters) // i = 1, 2, 3, 4, 5 { k = 0; R 0 2 i =baseline recognition rate; Repeat {k ++; R k (2 i)1 = speaker training (eigenvectors, test data 1, hyperparameters, 2 i utterances 1 );. R k (2 i)j = speaker training (eigenvectors, test data j, hyperparameters, 2 i utterances j );. R k (2 i)15 = speaker training (eigenvectors, test data 15, hyperparameters, 15 2 i utterances 15 ); R k (2 i)j R 2 i k j=1 = ; 15 R 2 i = k 1 Rk 2 i R 2 i ; } until R 2 i < threshold 1; return R 2 i; k where 2 i utterances j and test data j denote respectively the adaptation utterances in the number of 2, 4, 6, 8 and 10 and the 60 test utterances from the jth speaker, 1 j 15, for the tuning of the 9 hyperparameters in the proposed FLC mechanism. And speaker training(d) is the procedure that would incrementally perform adaptation by appropriate settings of the hyperparameters of the T-S FLC, as already described in Section 2.3.2, such that the adaptation would not jeopardize the recognition rate, given 2 i utterances. speaker training (eigenvectors, test data j, hyperparameters, 2 i utterances j ) // j = 1, 2,..., 15 { Derivation of w(k) (2 i utterances j ); //w(k), k = 1, 2,..., K, denoting the combination coefficient from MLED R (2 i)j = Iterative process (w(k), eigenvectors, test data j, hyperparameters); // as described in Section for maximizing the recognition rate R (2 i)j return R (2 i)j ; }; As a result, a set of FLC hyperparameters {a 1, a 2, a 3, b 1, b 2, b 3, N 1, N 2 and N 3 } was determined. The recognition phase involved a new group of 15 speakers. Each speaker was asked to generate 10 and 60 utterances for adaptation and recognition, respectively. The weight λ was calculated using the hyperparameters acquired in the training stage for adaptation. For the recognition experiment with FLC-MLED adaptation, five adapted models were constructed using 2, 4, 6, 8 and 10 adaptation utterances from each of the 15 speakers, and the λ for each of the 5 adaptations was calculated by Equation (13) with N utterances = 2, 4, 6, 8 or 10, and the FLC hyperparameters already determined in the training phase.

13 ENHANCEMENTS OF MAXIMUM LIKELIHOOD EIGEN-DECOMPOSITION 4219 Figure 4. The curve of the training values of λ 5 MLED-adapted and 5 MAPED-adapted models using 2, 4, 6, 8 and 10 adaptation utterances, respectively, were also constructed for performance comparison. Then, 60 utterances from each of the 15 speakers were fed into the five adapted models to evaluate their recognition rates Experiment results. The training phase produced some interesting experimental results and observations. The weight λ increased as the number of adaptation utterances increased. As Figure 4 shows, λ rose noticeably when the number of utterances increased from 2 to 6, and then ascended gradually and somewhat stabilized as the number of utterances increased further. The λ curve exhibited the same tendency as one determined from the precursory fuzzy rule base design (Section 2.3.2). This study also uses various numbers of adaptation utterances to compare the recognition performance of the proposed FLC-MLED utilizing a T-S FLC, the MLED without referencing any prior knowledge and the MAPED. As Figure 5 shows, the recognition rate improved as the number of adaptation utterances increased for all three adaptations. In the case of limited adaptation utterances, the performance of the MLED and MAPED methods fall below the baseline recognition rate, which indicates the potential inaccuracy or unreliability of inadequately MLED and MAPED models due to insufficient adaptation data. The performance of the FLC-MLED method remained above the baseline when only 2 utterances were available for adaptation. In all testing cases, the proposed FLC-MLED adaptation achieves the best recognition, followed by MAPED adaptation and MLED adaptation. FLC-MLED performs better than MLED and MAPED, especially when training data is quite limited. Note that MAPED tends to catch up with FLC-MLED when the amount of training data increases. Finally, Figures 6 and 7 show the effects of λ variation on the recognition performance of MLED under extreme cases of training data availability are also observed. The former shows that while the training data are scarce, such as 2 utterances, the performance falls below the baseline if, for λ being more than 0.3, the model adaptation is to be largely determined by the combination coefficients w(k) derived from MLED, which is very much likely poorly estimated. With a small value of λ, the influence of w(k) on the adaptation is less and the recognition rate is still able to maintain above the baseline. Conversely, given sufficient training data, 10 utterances for instance, full advantage of adaptation by w(k) should be exploited, by using a big λ value, for good performance, as Figure 7 shows. 4. Conclusions. This paper presents an FLC-MLED scheme with a weight control parameter λ determined by the fuzzy logic controller. The fuzzy mechanism regulates λ

14 4220 I.-J. DING Figure 5. The performance curves of FLC-MLED, conventional MAPED and conventional MLED in the recognition testing experiments Figure 6. Numbers of adaptation utterances = 2 (MLED testing experiments) Figure 7. Numbers of adaptation utterances = 10 (MLED testing experiments) according to the amount of adaptation data. The proposed FLC-MLED enhances eigendecomposition of eigenvoice speaker adaptation, and accurately identifies HMM acoustic parameters of a new speaker. Experiment results show that FLC-MLED outperforms MLED and even MAPED in recognition performance, regardless of the amount of adaptation data available. The behaviors of λ with respect to the variation in adaptation data

15 ENHANCEMENTS OF MAXIMUM LIKELIHOOD EIGEN-DECOMPOSITION 4221 available follow the requirement in the FLC design. FLC-MLED is an adaptive learning method that is more robust against data insufficiency than conventional MLED and incurs much lower computation cost than MAPED. Acknowledgment. This research is partially supported by the National Science Council (NSC) in Taiwan under grant NSC E The author also gratefully acknowledges the helpful comments and suggestions of the reviewers, which have improved the presentation. REFERENCES [1] B. H. Juang and L. R. Rabiner, Automatic speech recognition A brief history of the technology development, Encyclopedia of Language and Linguistics, 2nd Edition, Elsevier, [2] M. Nakayama and S. Ishimitsu, Speech support system using body-conducted speech recognition for disorders, International Journal of Innovative Computing, Information and Control, vol.5, no.11(b), pp , [3] X. Wang, J. Lin, Y. Sun, H. Gan and L. Yao, Applying feature extraction of speech recognition on VoIP auditing, International Journal of Innovative Computing, Information and Control, vol.5, no.7, pp , [4] T. Guan and Q. Gong, A study on the effects of spectral information encoding in mandarin speech recognition in white noise, ICIC Express Letters, vol.3, no.3(a), pp , [5] L. R. Rabiner, The power of speech, Science, vol.301, pp , [6] R. P. Lippmann, Speech recognition by machines and humans, Speech Communication, vol.22, pp.1-15, [7] C. H. Lee, C. H. Lin and B. H. Juang, A study on speaker adaptation of the parameters of continuous density hidden Markov models, IEEE Trans. on Acoustics, Speech and Signal Processing, vol.39, pp , [8] J. L. Gauvain and C. H. Lee, Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains, IEEE Trans. on Speech and Audio Processing, vol.2, no.2, pp , [9] C. J. Leggetter and P. C. Woodland, Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models, Computer Speech and Language, vol.9, pp , [10] C. Chesta, O. Siohan and C. H. Lee, Maximum a posteriori linear regression for hidden Markov model adaptation, Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH), pp , [11] W. Chou, Maximum a posteriori linear regression with elliptically symmetric matrix priors, Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH), pp.1-4, [12] R. Kuhn, J.-C. Junqua, P. Nguyen and N. Niedzielski, Rapid speaker adaptation in eigenvoice space, IEEE Trans. on Speech and Audio Processing, vol.8, no.6, pp , [13] K. T. Chen, W. W. Liau, H. M. Wang and L. S. Lee, Fast speaker adaptation using eigenspace-based maximum likelihood linear regression, Proc. of the International Conference on Spoken Language Processing, pp , [14] K. T. Chen and H. M. Wang, Eigenspace-based maximum a posteriori linear regression for rapid speaker adaptation, Proc. of IEEE the International Conference on Acoustic, Speech and Signal Processing, pp , [15] B. Mak, S. Ho and J. T. Kwok, Speedup of kernel eigenvoice speaker adaptation by embedded kernel PCA, Proc. of the International Conference on Spoken Language Processing, pp , [16] B. Mak and R. Hsiao, Improving eigenspace-based MLLR adaptation by kernel PCA, Proc. of the International Conference on Spoken Language Processing, pp.13-16, [17] R. Hsiao and B. Mak, Kernel eigenspace-based MLLR adaptation using multiple regression classes, Proc. of the IEEE International Conference on Acoustic, Speech and Signal Processing, pp , [18] B. Mak, J. T. Kwok and S. Ho, Kernel eigenvoice speaker adaptation, IEEE Trans. on Speech and Audio Processing, vol.13, no.5, pp , [19] B. Zhou and J. Hansen, Rapid discriminative acoustic model based on eigenspace mapping for fast speaker adaptation, IEEE Trans. on Speech and Audio Processing, vol.13, no.4, pp , 2005.

16 4222 I.-J. DING [20] B. Mak and S. Ho, Various reference speakers determination methods for embedded kernel eigenvoice speaker adaptation, Proc. of the IEEE International Conference on Acoustic, Speech and Signal Processing, pp , [21] B. Mak, R. Hsiao, S. Ho and J. T. Kwok, Embedded kernel eigenvoice speaker adaptation and its implication to reference speaker weighting, IEEE Trans. on Audio, Speech, and Language Processing, vol.14, no.4, pp , [22] B. Mak and R. Hsiao, Kernel eigenspace-based MLLR adaptation, IEEE Trans. on Audio, Speech, and Language Processing, vol.15, no.3, pp , [23] C.-H. Huang, J.-T. Chien and H.-M. Wang, A new eigenvoice approach to speaker adaptation, Proc. of the IEEE International Symposium on Chinese Spoken Language Processing, pp , [24] R. Yager and D. Filev, Essentials of Fuzzy Modeling and Control, Wiley, New York, [25] T. Takagi and M. Sugeno, Fuzzy identification of systems and its applications to modeling and control, IEEE Trans. on System, Man, and Cybernetics, vol.15, pp , [26] C. Li, J. Yi and D. Zhao, Design of interval type-2 fuzzy logic system using sampled data and prior knowledge, ICIC Express Letters, vol.3, no.3(b), pp , [27] J. Yen, R. Langari and L. A. Zadeh, Industrial Applications of Fuzzy Logic and Intelligent Systems, IEEE Press, New York, [28] S. Kermiche, M. L. Saidi, H. A. Abbassi and H. Ghodbane, Takagi-Sugeno based controller for mobile robot navigation, Journal of Applied Science, vol.6, no.8, pp , [29] M. C. M. Teixeira, G. S. Deaecto, R. Gaino, E. Assunção, A. A. Carvalho and U. C. Farias, Design of a fuzzy Takagi-Sugeno controller to vary the joint knee angle of paraplegic patients, Proc. of International Conference on Neural Information Processing, pp , [30] A. P. Dempster, N. M. Laird and D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, vol.39, pp.1-38, [31] I.-J. Ding, MAP speaker adaptation by hybrid SVM-FLC for speech recognition, ICIC Express Letters, vol.5, no.2, pp , [32] H.-J. Zimmermann, Fuzzy Set Theory and Its Applications, 3rd Edition, Kluwer Academic, [33] E. H. Mamdani, Application of fuzzy logic to approximate reasoning using linguistic systems, Fuzzy Sets and Systems, vol.26, pp , [34] H. C. Wang, MAT A project to collect mandarin speech data through telephone networks in Taiwan, Comput. Linguist. Chinese Lang. Process., vol.2, pp.73-89, [35] C. H. Lin, C. H. Wu, P. Y. Ting and H. M. Wang, Frameworks for recognition of mandarin syllables with tones using sub-syllabic units, Speech Communication, vol.18, no.2, pp , 1996.

Heeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University

Heeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University Heeyoul (Henry) Choi Dept. of Computer Science Texas A&M University hchoi@cs.tamu.edu Introduction Speaker Adaptation Eigenvoice Comparison with others MAP, MLLR, EMAP, RMP, CAT, RSW Experiments Future