Data-driven clustering of channels in corpora

Size: px

Start display at page:

Download "Data-driven clustering of channels in corpora"

Eric O’Brien’
5 years ago
Views:

1 Data-driven clustering of channels in corpora Mattias Nilsson Supervisor: Daniel Neiberg Approved: Examiner: Mats Blomberg Centre for Speech Technology Stockholm 22nd June (signature) Master Thesis in Speech Technology Department of Speech, Music and Hearing Royal Institute of Technology S Stockholm

3 Examensarbete i Talteknologi Datadriven klustring av kanaler i taldatabaser Mattias Nilsson Godkänt: Examinator: Mats Blomberg Handledare: Daniel Neiberg

5 Abstract The performance of a speaker verification system is reduced by the channel if the speech is distorted by a telephone network. This thesis examines a channel compensation method based on the fact that the channel can be seen as an additive bias in the cepstral domain. A channel estimate is therefore subtracted from the speech as channel compensation. To compensate for non-linear distortion, the idea was to use a unsupervised clustering method on a lot of training data from 10 different channels in HTIMIT. The cluster centroids were then used as channel estimates. Vector quantization was used for clustering of the data. Experimental results show that a trained Vector Quantizer managed to classify 81% of the test data to the correct channel, when bias for the sex of the speaker was subtracted. The Minimum Description Length principle was used to examine the correct number of clusters in a data set. Sammanfattning Ett talarverifieringssystems prestanda minskar om talet förvrängs i ett telefonsystem. Detta examensarbete undersöker en kanalkompenseringsmetod som baseras på att kanalen kan ses som en additiv bias till cepstrumkoefficienterna. Ett kanalestimat subtraheras därför från talet som kanalkompensering. För att kompensera för ickelinjära fenomen, användes en oövervakad klustringsmetod på en stor mängd träningsdata från 10 olika kanaler i HTIMIT. Klustercentroiderna användes sedan som kanalestimat. Vektorkvantisering användes för klustringen av data. Experimentella resultat visar att en tränad vektorkvantiserare klassifiserar 81% av testdata till den korrekta kanalen, när bias för talarens kön subtraherats. Minimum Description Length principen användes för att undersöka det korrekta antalet kluster i en datamängd.

6 Acknowledgments I would like to thank the following persons for helping me with this thesis. My supervisor Daniel Neiberg for helping me in my daily work and for providing a lot of help during the writing of this thesis report. My examiner Mats Blomberg and all the rest of the people at the Centre for Speech Technology for providing an inspiring working environment. My fellow thesis workers, both from the old place and the new fresh place, for mind expanding discussions. My good friend and lunch company Simon Hensing for listining to my problems and providing me with some of his own.

7 ii

8 List of Figures 2.1 The acoustic environment Simplified Acoustic environment The MDL value for 1-14 channels with p k calculated by the VQ and m = K(D + 1 2D(D + 1)) The MDL value for 1-14 channels with p k = 1 K and m = K(D + 1 2D(D + 1)) The MDL value for 1-14 channels with p k = 1 K, a global covariance matrix and m = KD iii

9 iv

10 List of Tables 3.1 The different microphones in HTIMIT Results from a VQ classifying the test data. The training was performed by taking the mean of each channels training data. Correctly classified data from each channel are in bold style Results from a VQ classifying the test data. The training was performed by the LBG algorithm. Correctly classified data from each channel are in bold style Results from a VQ classifying the test data. The speech files were processed by a speech detector.the training was performed by taking the mean of each channels training data. Correctly classified data from each channel are in bold style Results from a VQ classifying the test data. The training was performed by the LBG algorithm. All data was processed by a speech detector. Correctly classified data from each channel are in bold style Results from a trained VQ classifying test data separately for each sex. The training was performed by the LBG algorithm on separate training sets for each sex. Correctly classified data from each channel are in bold style Results from a trained VQ classifying test data after the bias vector for sex has been subtracted. Correctly classified data from each channel are in bold style Results from a VQ classifying test data after the bias vector for dialect has been subtracted. Correctly classified data from each channel are in bold style 18 v

11 List of Abbreviations CMS DFT GIVES GMM iid LBG MDL MFCC PDF SNR VQ Cepstral Mean Subtraction Discrete Fourier Transform General Identity Verification System Gaussian Mixture Model independent identically distributed Linde-Buzo-Gray algorithm Minimum Description Length Mel-frequency Cepstral Coefficients probability density function Signal-to-noise ratio Vector Quantizer vi

12 Contents 1 Introduction Background Problem formulation Different Channel Compensation methods Approach Method Theory Acoustic environment MFCC Channel compensation Channel estimation Vector quantization Introduction Coding of the test data Designing the codebook The LBG algorithm Bias from Sex and Dialect Determining the number of channels Implementation and Experiments Feature Extraction Clustering Classification of the test data Compensation for sex and dialect MDL principle The Corpus Results Ideal codebook A trained VQ Speech detection Separate codebooks for male and female Compensation for the bias of sex and dialect Determining the number of different channels Discussion Performance of the Vector Quantizer Determining the number of clusters with the MDL principle vii

13 6 Conclusions Outlook viii

14 Chapter 1 Introduction 1.1 Background Speaker verification is when the identity of a speaker is to be verified. The speaker claims an identity and the speaker verification system either accepts or rejects the speaker s identity depending on whether the speech is similar to the speech of the claimed identity. It can be difficult to set the threshold for when the speaker is to be accepted. A perfect speaker verification system would always accept the correct speaker, and always reject a false speaker. Depending on how the threshold is set the system will sometimes reject the true speaker or accept a false speaker. When speaker verification is to be performed on the speech recorded, some kind of feature extraction is performed on the speech in order to extract the desired information from the original data. One kind of feature extraction [1] often used is Mel-Frequency Cepstrum Coefficients (MFCC) [2]. 1.2 Problem formulation When speech is transferred through a telephone network it gets corrupted by the channel and external noise. There will be both additive and convolutional noise corruoting the original speech signal. Depending on the kind of telephone handset and channel, the signal will be corrupted in different ways. This channel variability is a problem when speaker verification [3, 4] is to be done. Therefore it is desirable to compensate for the channel in some way. 1.3 Different Channel Compensation methods In order to compensate for the channel, some channel compensation method has to be used. These channel compensation methods can be divided into two major groups, methods that classify the data to a channel and methods that do not classify the data. Among the channel compensation methods that classify the data are feature mapping [5] and nonlinear filtering [6]. Feature mapping is based on detecting the most probable channel model, and then mapping the feature vectors to a channel independent space. Nonlinear filtering fits three filters to each channel. A linear pre-filter, a nonlinear filter and a linear post-filter. Methods for channel compensation that do not depend on classification of the data include Cepstral Mean Subtraction (CMS) [7], RASTA Processing [8] and Stochastic Matching [9]. CMS subtracts a mean feature vector over a long utterance. RASTA Processing 1

15 designs an appropriate band-pass filter, that removes components that are unlikely to be related to phonetic properties of the speech. Stochastic matching is based on a maximumlikelihood matching approach that decrease the mismatch between a test utterance and a given set of speech models. If a channel estimation is made in the MFCC space it can be seen that it is an additive bias to the speech. If the channel bias could be found in some way, one channel compensation would then be to subtract the bias. One problem that many channel compensation methods share is that thay can not compensate for non-linear distortion components that appear among carbon-button microphones. 1.4 Approach The main idea of this thesis was to see if different channels could be separated using some unsupervised clustering method on a training set consisting of a corpus (a speech database),with speech sent through several different channels. The clustering would take place in the MFCC space and the cluster centroids would then be channel estimates. Depending on the structure of the training set the channel estimates would be more or less similar to the actual channels. Ideally each cluster would consist of data from one channel only. The idea was also to see if the effect of non-linear components [10, 6] could be compensated for by using a lot of training data for the channel estimate. The question of finding the correct number of channels from a training sequence with unknown number of channels was to be examined using an optimization algorithm. Data that came from certain common groups of the speakers, such as sex and dialect showed some common characteristics. These characteristics influence the channel estimates, and can be seen as additive biases. If these characteristics could be compensated for, it would thus result in better channel estimates. 1.5 Method The choice of channel estimation method is not a trivial question. In this thesis an unsupervised clustering method, vector quantization, was used on the corpus available to produce channel estimates. Different methods for designing the codebook was examined. In order to measure the performance of the different algorithms the test data was labeled with the correct, physical channel. The exact number of data correctly labeled by the vector quantizer could therefore be derived. In order to find the optimal number of channel clusters from a training sequence with an unknown number of channels, the Minimum Description Length algorithm (MDL) was used. 2

16 Chapter 2 Theory 2.1 Acoustic environment Lombard Ambient Noise Speech + + Channel Corrupted Speech Stress Channel Additive Noise Reciever Additive Noise Figure 2.1: The acoustic environment Speech that is transferred through the telephone network is corrupted by the channel and noise is added at different stages during the transfer. Compared to high quality recordings the difference can be large. The acoustic environment in a telephone network can be described as [11] y(t) = [((s(t) Stress Lombard )(t)+n background(t)) h mic (t)+n chan (t)] h chan (t)+n receiver (t) (2.1) where y(t) is the corrupted output signal, s(t) is the clean speech signal from the speaker, n background (t) is the background noise, h mic (t) is the impulse response of the microphone, n chan (t) is the additive noise of the transmission channel, h chan (t) is the impulse response of the transmission channel, and n receiver (t) from the receiver. The impulse responses can be combined into one impulse response h system for the whole system. The background noise also affects the signal by convolution, besides being additive, since a speaker often gets stressed or tries to articulate in a different way(the Lombard effect [12]). If all the additive noise sources are combined into one and all the convolutional noise sources into one, then y(t) = h system (t) [s(t) + n system (t)] (2.2) describes the system. 3

17 Speech System + Corrupted Speech System Noise Figure 2.2: Simplified Acoustic environment If this system is transformed with the discrete Fourier transform y(n) = 1 N N 1 k=0 and then taking the power y(n)y (n) yields y(t k )e j2kπn/n n = 0,..., N 1 (2.3) y(n)y (n) = h(n)s(n) 2 + n(n) 2 + 2n(n) h(n) s(n)cos(θ) (2.4) where θ is the angle between the speech vector s(n) and the noise vector n(n). Some assumptions that are commonly made are that h(n) is constant over time and independent of input level [11]. It is also usually assumed that the signal is smoothed over sufficient bins, so that the mixed term last in (2.4) can be ignored. If the signal-to-noise ratio, SNR, is high enough, the noise can be ignored. This yields y(n)y (n) = h(n)s(n) 2 (2.5) 2.2 MFCC The choice of feature extraction method used to process the speech signal fell upon Mel- Frequency Cepstrum Coefficients (MFCC) [2].The computation of the MFCC involves a number of mathematical operations on the signal. First the signal is divided into frames by multiplication of overlapping Hamming windows. The signal is then transformed by taking the Discrete Fourier Transform (DFT) and the power, y(n)y (n). If the environment is relatively noise free as described in (2.5) this gives y(n)y (n) = h(n)s(n) 2 (2.6) The signal then passes through a logarithmic triangular filter bank [13], which summarizes the DFT powers. This transforms the frequency scale into a mel scale, i.e. the centre frequencies of the filters are spaced equally on a linear scale up to 1 khz and equally on a logarithmic scale above 1 khz. This reduces the number of coefficients from 128 to 24 mel-frequency bins. Each coefficient now corresponds to the mean of the frequencies covered by each triangular window. The amplitude scale is then transformed to a logarithmic scale. The system described in (2.6) now becomes log(ŷ(n)ŷ (n)) = 2 log ĥ(n) + 2 log ŝ(n) (2.7) 4

18 The channel is now an additive bias to the signal. The last step is to cosine transform the system. Usually the components in the cosine transform above a certain number are not used since they represent such fine spectral detail that it is of lesser phonetic significance. Thus a reduction of the number of dimensions [2] is achieved. 2.3 Channel compensation Consider one of the log-log filter bank j components of (2.7) The cosine projected cepstra then becomes MF CC i = K LLF B j = 2 log ĥj(n) + 2 log ŝ j (n) (2.8) J [(log ĥj + log ŝ j (n) )cos(i(j 1/2)π/J)] i = 1,..., I (2.9) j=0 where J is the number of log-log filter banks, I is the number of cepstrum coefficients and K is a constant. The zeroth component MF CC 0 can be seen as the energy of the signal. One method of channel compensation that is often used is Cepstral Mean Subtraction, CMS, which subtracts the arithmetic mean from the whole utterance. A problem with CMS is that the arithmetic mean, E(s), besides the channel also contains information about both the speaker and the context, i.e. what sounds the speaker make. Therefore, when CMS is applied to a system a speaker normalization is performed [14]. This can be a problem in speaker verification, since part of the speaker information is removed. 2.4 Channel estimation Since CMS caused problems in speaker verification (see section 2.3), by normalizing the speaker, another approach would be to see if some other method could approximate the channel more independently of the speaker. One method would be to take the arithmetic mean of several different speakers, speaking through the same channel. Thus the influence from each speaker to the channel estimate would decrease. It is also possible that there are some non-linear effects. The influence of these non-linear effects would decrease if the channel estimate would be based on several speakers compared to CMS where the estimate would be based on one speaker. This would be a supervised method, since each speaker would have to be labeled with the corresponding channel. If the channel spoken through was unknown an unsupervised method, for example clustering, would have to be used. One such clustering method is Vector quantization. 2.5 Vector quantization Introduction Vector quantization [15], is a form of lossy data compression [16], which means that it is impossible to reconstruct the original data after the compression. A vector quantizer maps k-dimensional vectors in the vector space R k into a finite set of vectors C = c m, m = 1, 2,..., M. These vectors are called code vectors or codewords. A vector quantizer based on the nearest neighbor method, i.e. a vector is classified to belong to the code vector which it has the smallest distance measure to, is called a Nearest neighbor or, if the euclidean metric 5

19 is used, a Voronoi vector quantizer. The set of all code vectors is called the codebook. The optimal choice of distance measure is dependent on the data to be compressed. For example, the Euclidean distance measure would be d(x, c) = x c 2 = k (x i c i ) 2 (2.10) The region where all data would belong to a certain code vector is called an encoding region belonging to a certain code vector Coding of the test data The test data will be classified as belonging to one of the clusters, represented by the corresponding code vector. The classification is based on the smallest distance d(z, c i ) from test data z to code vector c i. i= Designing the codebook The design of a vector quantizer can be described as finding the optimal codebook, given a certain distortion measure and a certain number of code vectors. For an unknown set of training data, the number of code vectors to be chosen is not trivial. This optimal solution satisfies two criteria. 1. Nearest Neighbor Condition S m = x : x c m 2 x c m 2 m = 1, 2,..., M (2.11) This means that the encoding region S m consists of all vectors that are closer to the code vector c m than to any other code vector. 2. Centroid Condition c m = x n S m x n x n S m 1 n = 1, 2,..., N (2.12) The code vector c m is the average of the training vectors in S m. This solution minimizes the average distortion: D Average = 1 Nk N x n Q(x n ) 2 (2.13) n=1 where Q(x n ) = c m if x n S m The method of designing a codebook is an open research area. One could for example choose the M first training vectors as the initial codebook. It is also possible to choose the initial code vectors randomly from the training set. If the training data is not independent identically distributed (iid), it might be influenced by other training data in for example its neighborhood. This could mean that the M first training vectors come from the same natural cluster in the training set. A disadvantage with the random initial codebook might be that it is difficult to compare results, since the results will vary so much depending on the starting codebook. After the initial codebook is chosen, each training sample will be classified as belonging to one of the clusters. The new code vectors are then computed according to the centroid 6

20 condition. This process of finding the correct cluster for each training data and then computing a new codebook is called the training of the vector quantizer. When the codebook no longer changes the average distortion has reached a local minimum and the training stops The LBG algorithm Another method of designing the codebook is the Linde-Buzo-Gray (LBG) algorithm [17]. The LBG algorithm is based on the method of splitting the code vectors until an optimal solution is found, based on some criteria. Thus the initial codebook consists of one code vector which is the mean of the training data. This code vector is then split into two by adding and subtracting a small vector ɛ. The choice of ɛ could for example be to make ɛ the eigenvector corresponding to the largest eigenvalue of the training data belonging to S m. Then the vector quantizer is run on these two code vectors until a solution is found. The final two code vectors are then split into four. This process is repeated iteratively until the codebook no longer changes. Then, a local minimum is reached and the training stops. 1. Starting Given the training data Y = y n ; n = 1, 2,..., N and some optimality requirement, set ɛ > 0 to a small vector. Let M = 1 and set the first code vector c 1 to 2. Splitting For j = 1,..., M set c 1 = 1 N N n=1 y n Set M = 2M. 3. Iteration Set the iteration index i = 0 c 0 j = (1 + ɛ)c j c 0 M+j = (1 ɛ)c j (a) For n = 1, 2,..., N, find for a chosen distance measure d d(c i m, y n ) over all m = 1,..., M. Let m be the index which minimizes the distance. Then label y n with Q(y n ) = c i m (b) For m = 1,..., M update the code vector c i+1 m = Q(y n)=c i m y n Q(y n)=c im 1 (2.14) (c) i=i+1 (d) If the code vectors are constant c i = c i 1 or some other stop criteria is met, stop, else go back to (a) 4. If the size of the codebook matches the desired size, stop, else go back to 2. 7

21 2.6 Bias from Sex and Dialect If the signal of several speakers are compared it can be seen that they may have some common partial vectors. These partial vectors could be sex or dialect of the speaker or context of the speech. The signal in (2.7) can thus be seen as being made up of several partial vectors. s(n) = s sex (n) + s dialect (n) s context (n) (2.15) There would for example be two vectors for sex, one male and one female. If these partial vectors could be subtracted from the signal it might be possible to get better channel estimates. Of course, in order to subtract these vectors, the corresponding sex or dialect of the speaker has to be known or identified in some manner. 2.7 Determining the number of channels In an unsupervised clustering algorithm, some stop criteria has to be used to find the optimal number of clusters. Without knowledge about the training sequence it can be difficult to know if all of the training data come from the same channel, or if each training sample represents a unique channel. This partitioning of data into subgroups is known as Cluster Analysis [18]. Vector quantization with the LBG algorithm is a divisive, hierarchical clustering method. This means that the algorithm produces a sequence of partitions of the training data corresponding to an increased number of clusters, since the clusters are split in each stage of the algorithm. One way of deciding the optimal number of clusters is the Minimum Description Length principle (MDL) [19, 20]. The MDL principle assumes that the training data can be described as a set of Probability Density Functions (PDF s). When cluster analysis is done on a training set, the set is often described by a Gaussian Mixture Model (GMM). Each cluster will then be a Gaussian distribution. The Gaussian Mixture distribution can then be computed after the VQ has finished its training. The Gaussian Mixture distribution will be K f(x i θ) = p k N(x i µ k, Σ k ) (2.16) k=1 for the clusters, where p k, K k=1 p k = 1, is the probability that a data is classified by the VQ as belonging to a certain cluster k and N(µ k, Σ k ) is a Gaussian density with mean vector µ k and covariance matrix Σ k. The VQ s training algorithm approximates the EM algorithm [21] which estimate the mixture parameters so that the following log-likelihood function is maximized N L = logf(x i θ) (2.17) i=1 If several models are equally likely a priori then L is proportional to the a posteriori probability that the data conform to the model θ = (p 1,..., p K ; µ 1,..., µk; Σ 1,..., Σ K ). Since L will increase with the complexity of the model, a term m is added that penalizes more complex models. This penalizing term m is based on the complexity of the model, and is the number of free parameters subject to estimation. For a GMM the penalizing term will be m = (K 1) + K(D + 1 D(D + 1)) (2.18) 2 8

22 where D is the number of features and K is the number of clusters. The MDL function is maximized for the optimal solution MDL = L 1 mlogn (2.19) 2 where N is the number of data samples. Since both L and m will increase with the complexity of the model, there will be a maximum if the increased complexity of the model does not balance the penalizing term. 9

23 10

24 Chapter 3 Implementation and Experiments The different theories described in Chapter 2 are here used to describe the method that was used to get channel estimates through vector quantization. The equipment and implemented theories are described in detail. 3.1 Feature Extraction The MFCC feature extraction was performed using GIVES (General Identity Verification System), a software package built for speaker verification. First the speech file was separated into frames by multiplication of overlapping Hamming windows. The frame duration was set to 10 ms and the window length was 25.6 ms. Then the speech file was transformed with the DFT and sent to a triangular filter bank with 24 components between 300 Hz and 3400 Hz. This was done to remove redundant information since the guaranteed band width in the telephone network is in that interval. The cosine transform was now used to reduce the dimensions of the feature vectors from 24 to 12. Cepstral liftering was finally used on the feature vectors. The result was a series of 12-dimensional vectors from each speech file. Since the speech files were of different length the final series were also of different length. Since the sentences in HTIMIT consist of several words, this meant that there may be silent intervals between the words. One question that came to mind was therefore if this silence influenced the channel estimates in any way. To examine this a speech detector was used to construct files without silence. 3.2 Clustering The clustering of the MFCC feature extracted data was done by a vector quantizer in MATLAB. First the starting codebook was picked randomly from the training sequence (see chapter 2). However, this resulted in rather dramatic differences in the performance of the vector quantizer, from poor to excellent. Therefore, the codebook was constructed by the LBG-algorithm, which gave good results each time it was run. The choice of the small vector ɛ, used for splitting the code vectors, was proportional to the standard deviation of the training data belonging to the cluster that was split. Since the number of channels in HTIMIT was known, this information was first used to set the final number of code vectors M = 10. The size of the codebook was therefore doubled until it was of the size 8. Then the two clusters with the largest variance was split, to get the final 10 code vectors. The choice of distance measure fell upon the Mahalanobis metric which in this case is defined 11

25 as d Mahalanobis (y, c i ) = (y c i )Σ 1 (y c i ) (3.1) where Σ 1 is the inverted covariance matrix of the training set. 3.3 Classification of the test data As described in the test data was classified as belonging to the channel corresponding to the closest code vector. 3.4 Compensation for sex and dialect In order to compensate for redundant information, such as sex and dialect, which are of no use for the clustering of the channels, the corresponding vectors had to be found [8]. These vectors were found using the arithmetic mean of all the training data and the arithmetic mean of the training data from the respective class of sex or dialect. For example, if the vector corresponding to the characteristics for all male speakers were to be found, the arithmetic mean of all speakers would be subtracted from the arithmetic mean of all male speakers. This vector c male would then be subtracted from the data belonging to all male speakers. c male = y j S male y j y j S male 1 N i=1 y i N j = 1,..., N (3.2) S male is the set of all male speakers and N is the number of speakers in the training set. The same method was applied to get the different dialect vectors and the female vector. Another method for compensation for common characteristics such as dialect or sex, would be to have a separate codebook for each sex and dialect region. This was also examined by having different codebooks for females and males. 3.5 MDL principle The MDL principle was first implemented as described in section 2.7. To be able to use the MDL principle a GMM had to constructed from the trained VQ. The Gaussian mixture distribution will be K f(x i θ) = p k N(x i µ k, Σ k ) (3.3) k=1 as described in section 2.7. The probability p k that a data belongs to a certain class, was found by dividing the number of data found by the VQ to belong to certain class k, by the total number of data. The mean µ k was the mean value of the data belonging to the class k. The covariance matrix Σ k is calculated by Σ k = 1 n 1 n (y i µ k )(y i µ k ) t (3.4) i=1 where n is the number of data y i that belongs to a certain class k. Since the MDL principle was originally constructed for use with the EM algorithm [21], two additional adjustments was tried in addition to the original method described in 2.7. The first adjustment was that the probability that a data belongs to a certain cluster was set to equal for all clusters p k = 1 K and the penalizing term was set to m = K(D + 12

26 1 2 D(D + 1)) as the number of free parameters was reduced. The classification of the training data into clusters will then be the same as Nearest Neighbor classification with the Mahalanobis distance measure, which is the method used in this thesis. The only difference between the classification with GMM and the classification with a VQ with Mahalanobis distance measure, will then be the Σ 1 in the Gaussian distribution equation, where Σ 1 k is the inverted covariance matrix for each cluster k. Finally a second adjustment was also made to set Σ 1 k = Σ 1 Global, i.e. the inverted global covariance matrix for the whole training sequence. The classification will then be same as Nearest Neighbor classification with the Mahalanobis distance measure. The penalizing term was now set to m = KD. The reason for doing these changes for the algorithm is that the number of free parameters in the model should match the number of free parameters in the VQ. The three different versions of the MDL principle resulted in three different MDL plots. 3.6 The Corpus In order to test the channel compensation technique previously described (see chapter 2), that is to use the cluster centroid as a channel estimate and subtract it from the speech, a training sequence and test data was necessary. Both the training sequence and the test data was chosen from HTIMIT, an American speech database. The data used was in the form of two different sentences, SA1 and SA2, spoken through ten different channels. The complete database consisted of 192 male and 192 female speakers. The different channels included electret and carbon-button microphones, and also cordless transmission. HTIMIT also provided information about the sex and dialect region of the speaker. The division between training and test data provided in HTIMIT was used, i.e. the training set consisted of 278 speakers and the test set consisted of 106 speakers. The total number of training data was therefore 5560 sentences, since each speaker spoke two different sentences that were sent through all ten channels. The total number of test data was 2120, by the same reasoning. Transducer Name senh pt1 el1 el2 el3 el4 cb1 cb2 cb3 cb4 Description Sennheizer head-mounted microphone Sony portable (cord-less) telephone Northern-Telecom Unity electret (3-line grill) Northern-Telecom Unity Noisy-Environment electret(2-line grill) Unknown manufacture electret (64-hole grill) Radio Shack Chronophone-255 electret telephone Northern-Telecom G-type carbon-button(center hole membrane transducer) Northern-Telecom G-type carbon-button(6 hole metal transducer) Northern-Telecom G-type carbon-button(6 hole membrane transducer) ITT carbon-button (6 hole membrane/attached transducer) Table 3.1: The different microphones in HTIMIT 13

27 14

28 Chapter 4 Results To measure how correct the different VQ s codebooks were, test data was used. The test data was labeled with the channel, and it could therefore be seen if the VQ classified the test data to the correct channel. The sample set consisted of ten different channels. The division of test and training data provided in HTIMIT was used. When the codebooks resulted from training, the LBG algorithm and the Mahalanobis distance measure was used. Due to round off all columns in the tables will not sum up to Ideal codebook To get some results to compare with, an ideal codebook was created with the code vectors set as the mean of the training data corresponding to each channel. That is, the mean of the training vectors belonging to channel cb1 became one of the code vectors, and so on. In some sense this would be the result of a perfect training of a VQ, since all the vectors clustered together would come from the same channel. This would describe how much the clusters were overlapping and if some of the channels would be difficult to separate. True Channel cb1 cb2 cb3 cb4 el1 el2 el3 el4 pt1 senh Classified as(%) cb cb cb cb el el el el pt senh Table 4.1: Results from a VQ classifying the test data. The training was performed by taking the mean of each channels training data. Correctly classified data from each channel are in bold style The VQ with ideal code vectors managed to classify 89% of the test set to the correct channel. The results for each channel can be seen in Table

29 4.2 A trained VQ It now remained to see how well a trained VQ would perform compared to the VQ with the ideal codebook. To see this a VQ was trained using the LBG algorithm. The LBG-trained True Channel cb1 cb2 cb3 cb4 el1 el2 el3 el4 pt1 senh Classified as(%) cb cb cb cb el el el el pt senh Table 4.2: Results from a VQ classifying the test data. The training was performed by the LBG algorithm. Correctly classified data from each channel are in bold style VQ performed worse than the VQ with the ideal codebook. It classified 55% of the test data to the correct channel. The results for each channel for the LBG trained VQ can be seen in Table Speech detection A speech detector was used in order to produce speech files with no silence in them. The speech files were then feature extracted as previously described (see section 2.2). To compare the speech detected channel estimates with the normal ones, both an ideal speech detection codebook (as in section 4.1) and a clustered codebook (as in section 4.2) were created. The VQ with the ideal codebook based on speech files processed by a speech True Channel cb1 cb2 cb3 cb4 el1 el2 el3 el4 pt1 senh Classified as(%) cb cb cb cb el el el el pt senh Table 4.3: Results from a VQ classifying the test data. The speech files were processed by a speech detector.the training was performed by taking the mean of each channels training data. Correctly classified data from each channel are in bold style. detector managed to classify 82% of the test data correctly. The results for each channel 16

30 can be seen in Table 4.3 The LBG trained VQ managed to classify 60% of the speech True Channel cb1 cb2 cb3 cb4 el1 el2 el3 el4 pt1 senh Classified as(%) cb cb cb cb el el el el pt senh Table 4.4: Results from a VQ classifying the test data. The training was performed by the LBG algorithm. All data was processed by a speech detector. Correctly classified data from each channel are in bold style. detected test data correctly. The results for each channel can be seen in Table Separate codebooks for male and female The training set was separated into two sets, one for female speakers and one for male speakers. Then two different codebooks were created by training a VQ with the LBG algorithm as performed in chapter 4.2 for each training set. The classification was then made separately for male and female, i.e. the female test data was tested on the female codebook, and vice versa. In order for this method to work, the speakers sex somehow had to be detected. This was assumed to be possible. True Channel cb1 cb2 cb3 cb4 el1 el2 el3 el4 pt1 senh Classified as(%) cb cb cb cb el el el el pt senh Table 4.5: Results from a trained VQ classifying test data separately for each sex. The training was performed by the LBG algorithm on separate training sets for each sex. Correctly classified data from each channel are in bold style. As can be seen this resulted in much better performance than when the training set was of mixed sexes. The trained VQ managed to classify 80% of the test data correctly. The results for each channel can be seen in Table

31 4.5 Compensation for the bias of sex and dialect The common characteristics of the speakers, such as sex and dialect region can be seen as additive biases as described in 3.4. When these biases were subtracted before the clustering, the result was more separated clusters. Previous results show that it is possible to classify sex or dialect region to some extent [22, 4]. True Channel cb1 cb2 cb3 cb4 el1 el2 el3 el4 pt1 senh Classified as(%) cb cb cb cb el el el el pt senh Table 4.6: Results from a trained VQ classifying test data after the bias vector for sex has been subtracted. Correctly classified data from each channel are in bold style. As can be seen the result increased from 55% for non-compensated clustering (see section 4.2) to 81% of the test data clustered correctly when the bias for sex was subtracted. An interesting result was also that this method performed slightly better than having separate codebooks. The results for each channel can be seen in Table 4.6. True Channel cb1 cb2 cb3 cb4 el1 el2 el3 el4 pt1 senh Classified as(%) cb cb cb cb el el el el pt senh Table 4.7: Results from a VQ classifying test data after the bias vector for dialect has been subtracted. Correctly classified data from each channel are in bold style When the bias for dialect was subtracted from the training and test data it seemed to perform no better than the original VQ in section 4.2. It classified 55% of the test data correctly, exactly the same number as the VQ in section 4.2. There were differences in the individual channels though, as can be seen if table 4.7 is compared to table

32 4.6 Determining the number of different channels In order to determine the number of channels in an unlabeled training set, some algorithm has to be used. In this thesis the choice fell upon the MDL principle (see section 2.7). Since it was known that HTIMIT consisted of ten different channels, it would be interesting to see if the MDL-algorithm would find its optimal solution at ten channels. Initially p k was found by dividing the number of data found by the VQ to belong to certain class k, by the total number of data. The penalizing term was initially set to m = K(D + 1 2D(D + 1)) as described in section 2.7. The resulting plot of the MDL value showed that the MDL found its optimal solution at 9 channels (see figure 4.1.) 1.24 x 105 MDL Figure 4.1: The MDL value for 1-14 channels with p k calculated by the VQ and m = K(D + 1 2D(D + 1)) When p k = 1 K and m = K(D + 1 2D(D + 1)) this resulted in a peak for the MDL value at 8 channels (see figure 4.2). The covariance matrix Σ k was finally set to Σ Global along with p k = 1 K and m = KD. This resulted in a peak at 9 channels (see figure 4.3). It should be noted that for the last two MDL plots the log-likelihood function L is not strictly increasing. 19

33 1.24 x 105 MDL MDL value Number of channels Figure 4.2: The MDL value for 1-14 channels with p k = 1 K and m = K(D + 1 2D(D + 1)) 1.32 x 105 MDL MDL value Number of channels Figure 4.3: The MDL value for 1-14 channels with p k = 1 K, a global covariance matrix and m = KD 20

34 Chapter 5 Discussion The idea of this thesis work was to see if different channels could be separated using unsupervised clustering. The number of clusters in a data set was also examined using the MDL principle. The results presented in chapter 4 are here analyzed, and the reason for success or failure of different experiments are discussed. 5.1 Performance of the Vector Quantizer The performance of the VQ seems to be very dependent on how much of the different biases, besides the channel, that can be subtracted from the speech before clustering. When compared to the ideal VQ (see table 4.1), the trained VQ performed rather poorly (see table 4.2). The difference in performance is from 89% of the test data classified correctly for the ideal VQ, to 55% for the trained VQ. This is probably due to the fact that the training data is not structured as separate clusters naturally, but rather overlapping PDF s. The data from the different channels seemed to be overlapping and therefore the cluster centroids could be placed far from the ideal ones by the VQ. If the VQ with the optimal codebook is studied (see Table 4.1) it can be seen that the two most difficult channels to separate from the other, are the two first carbon-button microphones, cb1 and cb2. This could be due to the non-linear distortion known to exist in carbon-button microphones [10, 6], so that the data from those two channels are spread over a large volume of the MFCC space. It could also be because the two microphones are similar, and their respective cluster centroids are close to each other. The trained VQ had the same difficulties as the ideal one separating cb1 from cb2, but also cb3, pt1 and el3 proved difficult to separate. Especially cb1 and el3 which it practically failed completely to classify correctly. This is probably because the code vectors for those two channels were far from the corresponding ideal ones. This shows one of the difficulties of unsupervised clustering when the data is not structured in non-overlapping clusters. The speech files were processed by a speech detector to remove silence in them. This proved to make little difference for the VQ. The separation of cb1 and cb2 proved to work better, but the overall result was only slightly better than when the original data was used. One of the things that seemed to influence the performance of the VQ was if the bias for sex could be subtracted. When this was done it resulted in a large improvement of the performance of the VQ. This improvement did not seem to depend on whether the bias was subtracted before clustering or if the method with separate codebooks was chosen (see section 3.4). The removal of dialect bias did not have any significant influence on the performance of the VQ. The two most difficult channels to estimate for the VQ was 21

35 without doubt cb1 and el3. The reason that cb1 was difficult was probably because it was not clearly separated from cb2. El3 seemed to be evenly classified among the neighboring channels which indicates that it was a difficult cluster for the VQ to find. Since HTIMIT contained the same number of data from each of the 10 channels, one way of improving the performance might be to set the condition that all clusters must contain the same number of data during the training of the VQ. This was however not tried, since this would be using information that would not be availble in most cases. It would not be a strictly unsupervised clustering method, since that condition would be a kind of supervision. 5.2 Determining the number of clusters with the MDL principle The MDL principle found its maximum value at 9 channels instead of the 10 actual channels. The reason for this is that the increase in the log-likelihood function L is so small between nine and ten channels. This could be because one of the channels is difficult to distinguish from another (for example cb1 from cb2) or perhaps because one of the channels is hard to find for the VQ since it is not formed as a clearly separated cluster in the MFCC space (for example el3). When the MDL principle was modified to match the number of free parameters in the VQ the log-likelihood function L did not increase strictly, as it would have if the EM algorithm would have been used. This explains the rather strange MDL plots (see figures 4.2, 4.3). An interesting phenomena is that the MDL value decreases when the number of channels increase from 3 to 4 channels. 22

36 Chapter 6 Conclusions It seems to be possible to cluster channels with a VQ. In order to get a good channel estimate from the clustering, it is however necessary to subtract as much bias from the speech as possible before the training. Especially the subtraction of the bias from the sex of the speaker seems to make the training of the VQ much easier. The bias from dialect region did not seem to be of as much importance. The influence of silence in speech did not seem to have a lot of inluence on the performance of the VQ. What then about the channels that get falsely classified by the VQ? Since test data that get classified to the wrong channel usually get classified to a neighboring cluster, it is usually no catastrophy that they get falsely classified. As long as the cluster centroid that the data gets classified to, is not a large distance from the actual channel in the MFCC space, it is no major problem. When compared to CMS that subtracts the aritmethic mean as channel compensation it can however result in worse channel compensation for some test data. The MDL principle seemed to find approximately the correct number of channels. When the MDL formula was changed to match the number of free parameters of the VQ it showed some interesting results. However it still found the maximum at 8 or 9 clusters. 6.1 Outlook Possible further work would be to implement this channel compensation technique described in some kind of speaker verification system. Then it would be possible to see if this alternative channel compensation method would perform better than CMS. It would be interesting to see if other clustering methods, perhaps GMM based, would perform better. It would also be interesting to see how other methods [23] for determining the number of clusters would perform on the HTIMIT data set. 23

37 24

38 Bibliography [1] A. Webb. Statistical Pattern Recognition. Newnes, [2] S. Davis and P. Mermelstein. Comparison of parametric representation for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-28(4): , August [3] D.A.Reynolds, T.F. Quatieri, and R.B. Dunn. Speaker verification using adapted gaussian mixture models. Digital Signal Processing, 10:19 41, [4] H. Melin. Speaker verification in telecommunication. Department of Speech, Music and Hearing, KTH, Available from: melin/publications.html, [5] Douglas A. Reynolds. Channel robust speaker verification via feature mapping. MIT Lincoln Laboratory, Lexington, MA USA. [6] T.F. Quatieri, D.A. Reynolds, and G.C.O Leary. Estimation of handset nonlinearity with application to speaker recognition. IEEE Trans. on Speech and Audio Processing, 8(5): , [7] B.S. Atal. Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. J. Acoust. Soc. Am., pages , [8] H. Hermansky and N. Morgan. Rasta processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4): , October [9] A. Sankar and Chin-Hui Lee. A maximum-likelihood approach to stochastic matching for robust speech recognition. IEEE Transactions on Speech and Audio Processing, 4(3): , May [10] D.A. Reynolds, M.A. Zissman, T.F. Quatieri, and G.C.O Leary. The effects of telephone transmission degradations on speaker recognition performance. In Proc. of ICASSP95, pages , [11] M.J.F Gales. "nice" model-based compensation schemes for robust speech recognition. ECSA/NATO Tutorial and Research Workshop on Robust speech recognition for unknown communication channels, [12] A. Wakao, K. Takeda, and F. Itakura. Variability of lombard effects under different noise conditions. In Proc. ICSLP 96, volume 4, pages , Philadelphia, PA, [13] Monson H. Hayes. Statistical Digital Signal Processing and Modeling. Wiley,

Robust Speaker Identification

Robust Speaker Identification by Smarajit Bose Interdisciplinary Statistical Research Unit Indian Statistical Institute, Kolkata Joint work with Amita Pal and Ayanendranath Basu Overview } } } } } } }