Data-driven clustering of channels in corpora

Size: px
Start display at page:

Download "Data-driven clustering of channels in corpora"

Transcription

1 Data-driven clustering of channels in corpora Mattias Nilsson Supervisor: Daniel Neiberg Approved: Examiner: Mats Blomberg Centre for Speech Technology Stockholm 22nd June (signature) Master Thesis in Speech Technology Department of Speech, Music and Hearing Royal Institute of Technology S Stockholm

2

3 Examensarbete i Talteknologi Datadriven klustring av kanaler i taldatabaser Mattias Nilsson Godkänt: Examinator: Mats Blomberg Handledare: Daniel Neiberg

4

5 Abstract The performance of a speaker verification system is reduced by the channel if the speech is distorted by a telephone network. This thesis examines a channel compensation method based on the fact that the channel can be seen as an additive bias in the cepstral domain. A channel estimate is therefore subtracted from the speech as channel compensation. To compensate for non-linear distortion, the idea was to use a unsupervised clustering method on a lot of training data from 10 different channels in HTIMIT. The cluster centroids were then used as channel estimates. Vector quantization was used for clustering of the data. Experimental results show that a trained Vector Quantizer managed to classify 81% of the test data to the correct channel, when bias for the sex of the speaker was subtracted. The Minimum Description Length principle was used to examine the correct number of clusters in a data set. Sammanfattning Ett talarverifieringssystems prestanda minskar om talet förvrängs i ett telefonsystem. Detta examensarbete undersöker en kanalkompenseringsmetod som baseras på att kanalen kan ses som en additiv bias till cepstrumkoefficienterna. Ett kanalestimat subtraheras därför från talet som kanalkompensering. För att kompensera för ickelinjära fenomen, användes en oövervakad klustringsmetod på en stor mängd träningsdata från 10 olika kanaler i HTIMIT. Klustercentroiderna användes sedan som kanalestimat. Vektorkvantisering användes för klustringen av data. Experimentella resultat visar att en tränad vektorkvantiserare klassifiserar 81% av testdata till den korrekta kanalen, när bias för talarens kön subtraherats. Minimum Description Length principen användes för att undersöka det korrekta antalet kluster i en datamängd.

6 Acknowledgments I would like to thank the following persons for helping me with this thesis. My supervisor Daniel Neiberg for helping me in my daily work and for providing a lot of help during the writing of this thesis report. My examiner Mats Blomberg and all the rest of the people at the Centre for Speech Technology for providing an inspiring working environment. My fellow thesis workers, both from the old place and the new fresh place, for mind expanding discussions. My good friend and lunch company Simon Hensing for listining to my problems and providing me with some of his own.

7 ii

8 List of Figures 2.1 The acoustic environment Simplified Acoustic environment The MDL value for 1-14 channels with p k calculated by the VQ and m = K(D + 1 2D(D + 1)) The MDL value for 1-14 channels with p k = 1 K and m = K(D + 1 2D(D + 1)) The MDL value for 1-14 channels with p k = 1 K, a global covariance matrix and m = KD iii

9 iv

10 List of Tables 3.1 The different microphones in HTIMIT Results from a VQ classifying the test data. The training was performed by taking the mean of each channels training data. Correctly classified data from each channel are in bold style Results from a VQ classifying the test data. The training was performed by the LBG algorithm. Correctly classified data from each channel are in bold style Results from a VQ classifying the test data. The speech files were processed by a speech detector.the training was performed by taking the mean of each channels training data. Correctly classified data from each channel are in bold style Results from a VQ classifying the test data. The training was performed by the LBG algorithm. All data was processed by a speech detector. Correctly classified data from each channel are in bold style Results from a trained VQ classifying test data separately for each sex. The training was performed by the LBG algorithm on separate training sets for each sex. Correctly classified data from each channel are in bold style Results from a trained VQ classifying test data after the bias vector for sex has been subtracted. Correctly classified data from each channel are in bold style Results from a VQ classifying test data after the bias vector for dialect has been subtracted. Correctly classified data from each channel are in bold style 18 v

11 List of Abbreviations CMS DFT GIVES GMM iid LBG MDL MFCC PDF SNR VQ Cepstral Mean Subtraction Discrete Fourier Transform General Identity Verification System Gaussian Mixture Model independent identically distributed Linde-Buzo-Gray algorithm Minimum Description Length Mel-frequency Cepstral Coefficients probability density function Signal-to-noise ratio Vector Quantizer vi

12 Contents 1 Introduction Background Problem formulation Different Channel Compensation methods Approach Method Theory Acoustic environment MFCC Channel compensation Channel estimation Vector quantization Introduction Coding of the test data Designing the codebook The LBG algorithm Bias from Sex and Dialect Determining the number of channels Implementation and Experiments Feature Extraction Clustering Classification of the test data Compensation for sex and dialect MDL principle The Corpus Results Ideal codebook A trained VQ Speech detection Separate codebooks for male and female Compensation for the bias of sex and dialect Determining the number of different channels Discussion Performance of the Vector Quantizer Determining the number of clusters with the MDL principle vii

13 6 Conclusions Outlook viii

14 Chapter 1 Introduction 1.1 Background Speaker verification is when the identity of a speaker is to be verified. The speaker claims an identity and the speaker verification system either accepts or rejects the speaker s identity depending on whether the speech is similar to the speech of the claimed identity. It can be difficult to set the threshold for when the speaker is to be accepted. A perfect speaker verification system would always accept the correct speaker, and always reject a false speaker. Depending on how the threshold is set the system will sometimes reject the true speaker or accept a false speaker. When speaker verification is to be performed on the speech recorded, some kind of feature extraction is performed on the speech in order to extract the desired information from the original data. One kind of feature extraction [1] often used is Mel-Frequency Cepstrum Coefficients (MFCC) [2]. 1.2 Problem formulation When speech is transferred through a telephone network it gets corrupted by the channel and external noise. There will be both additive and convolutional noise corruoting the original speech signal. Depending on the kind of telephone handset and channel, the signal will be corrupted in different ways. This channel variability is a problem when speaker verification [3, 4] is to be done. Therefore it is desirable to compensate for the channel in some way. 1.3 Different Channel Compensation methods In order to compensate for the channel, some channel compensation method has to be used. These channel compensation methods can be divided into two major groups, methods that classify the data to a channel and methods that do not classify the data. Among the channel compensation methods that classify the data are feature mapping [5] and nonlinear filtering [6]. Feature mapping is based on detecting the most probable channel model, and then mapping the feature vectors to a channel independent space. Nonlinear filtering fits three filters to each channel. A linear pre-filter, a nonlinear filter and a linear post-filter. Methods for channel compensation that do not depend on classification of the data include Cepstral Mean Subtraction (CMS) [7], RASTA Processing [8] and Stochastic Matching [9]. CMS subtracts a mean feature vector over a long utterance. RASTA Processing 1

15 designs an appropriate band-pass filter, that removes components that are unlikely to be related to phonetic properties of the speech. Stochastic matching is based on a maximumlikelihood matching approach that decrease the mismatch between a test utterance and a given set of speech models. If a channel estimation is made in the MFCC space it can be seen that it is an additive bias to the speech. If the channel bias could be found in some way, one channel compensation would then be to subtract the bias. One problem that many channel compensation methods share is that thay can not compensate for non-linear distortion components that appear among carbon-button microphones. 1.4 Approach The main idea of this thesis was to see if different channels could be separated using some unsupervised clustering method on a training set consisting of a corpus (a speech database),with speech sent through several different channels. The clustering would take place in the MFCC space and the cluster centroids would then be channel estimates. Depending on the structure of the training set the channel estimates would be more or less similar to the actual channels. Ideally each cluster would consist of data from one channel only. The idea was also to see if the effect of non-linear components [10, 6] could be compensated for by using a lot of training data for the channel estimate. The question of finding the correct number of channels from a training sequence with unknown number of channels was to be examined using an optimization algorithm. Data that came from certain common groups of the speakers, such as sex and dialect showed some common characteristics. These characteristics influence the channel estimates, and can be seen as additive biases. If these characteristics could be compensated for, it would thus result in better channel estimates. 1.5 Method The choice of channel estimation method is not a trivial question. In this thesis an unsupervised clustering method, vector quantization, was used on the corpus available to produce channel estimates. Different methods for designing the codebook was examined. In order to measure the performance of the different algorithms the test data was labeled with the correct, physical channel. The exact number of data correctly labeled by the vector quantizer could therefore be derived. In order to find the optimal number of channel clusters from a training sequence with an unknown number of channels, the Minimum Description Length algorithm (MDL) was used. 2

16 Chapter 2 Theory 2.1 Acoustic environment Lombard Ambient Noise Speech + + Channel Corrupted Speech Stress Channel Additive Noise Reciever Additive Noise Figure 2.1: The acoustic environment Speech that is transferred through the telephone network is corrupted by the channel and noise is added at different stages during the transfer. Compared to high quality recordings the difference can be large. The acoustic environment in a telephone network can be described as [11] y(t) = [((s(t) Stress Lombard )(t)+n background(t)) h mic (t)+n chan (t)] h chan (t)+n receiver (t) (2.1) where y(t) is the corrupted output signal, s(t) is the clean speech signal from the speaker, n background (t) is the background noise, h mic (t) is the impulse response of the microphone, n chan (t) is the additive noise of the transmission channel, h chan (t) is the impulse response of the transmission channel, and n receiver (t) from the receiver. The impulse responses can be combined into one impulse response h system for the whole system. The background noise also affects the signal by convolution, besides being additive, since a speaker often gets stressed or tries to articulate in a different way(the Lombard effect [12]). If all the additive noise sources are combined into one and all the convolutional noise sources into one, then y(t) = h system (t) [s(t) + n system (t)] (2.2) describes the system. 3

17 Speech System + Corrupted Speech System Noise Figure 2.2: Simplified Acoustic environment If this system is transformed with the discrete Fourier transform y(n) = 1 N N 1 k=0 and then taking the power y(n)y (n) yields y(t k )e j2kπn/n n = 0,..., N 1 (2.3) y(n)y (n) = h(n)s(n) 2 + n(n) 2 + 2n(n) h(n) s(n)cos(θ) (2.4) where θ is the angle between the speech vector s(n) and the noise vector n(n). Some assumptions that are commonly made are that h(n) is constant over time and independent of input level [11]. It is also usually assumed that the signal is smoothed over sufficient bins, so that the mixed term last in (2.4) can be ignored. If the signal-to-noise ratio, SNR, is high enough, the noise can be ignored. This yields y(n)y (n) = h(n)s(n) 2 (2.5) 2.2 MFCC The choice of feature extraction method used to process the speech signal fell upon Mel- Frequency Cepstrum Coefficients (MFCC) [2].The computation of the MFCC involves a number of mathematical operations on the signal. First the signal is divided into frames by multiplication of overlapping Hamming windows. The signal is then transformed by taking the Discrete Fourier Transform (DFT) and the power, y(n)y (n). If the environment is relatively noise free as described in (2.5) this gives y(n)y (n) = h(n)s(n) 2 (2.6) The signal then passes through a logarithmic triangular filter bank [13], which summarizes the DFT powers. This transforms the frequency scale into a mel scale, i.e. the centre frequencies of the filters are spaced equally on a linear scale up to 1 khz and equally on a logarithmic scale above 1 khz. This reduces the number of coefficients from 128 to 24 mel-frequency bins. Each coefficient now corresponds to the mean of the frequencies covered by each triangular window. The amplitude scale is then transformed to a logarithmic scale. The system described in (2.6) now becomes log(ŷ(n)ŷ (n)) = 2 log ĥ(n) + 2 log ŝ(n) (2.7) 4

18 The channel is now an additive bias to the signal. The last step is to cosine transform the system. Usually the components in the cosine transform above a certain number are not used since they represent such fine spectral detail that it is of lesser phonetic significance. Thus a reduction of the number of dimensions [2] is achieved. 2.3 Channel compensation Consider one of the log-log filter bank j components of (2.7) The cosine projected cepstra then becomes MF CC i = K LLF B j = 2 log ĥj(n) + 2 log ŝ j (n) (2.8) J [(log ĥj + log ŝ j (n) )cos(i(j 1/2)π/J)] i = 1,..., I (2.9) j=0 where J is the number of log-log filter banks, I is the number of cepstrum coefficients and K is a constant. The zeroth component MF CC 0 can be seen as the energy of the signal. One method of channel compensation that is often used is Cepstral Mean Subtraction, CMS, which subtracts the arithmetic mean from the whole utterance. A problem with CMS is that the arithmetic mean, E(s), besides the channel also contains information about both the speaker and the context, i.e. what sounds the speaker make. Therefore, when CMS is applied to a system a speaker normalization is performed [14]. This can be a problem in speaker verification, since part of the speaker information is removed. 2.4 Channel estimation Since CMS caused problems in speaker verification (see section 2.3), by normalizing the speaker, another approach would be to see if some other method could approximate the channel more independently of the speaker. One method would be to take the arithmetic mean of several different speakers, speaking through the same channel. Thus the influence from each speaker to the channel estimate would decrease. It is also possible that there are some non-linear effects. The influence of these non-linear effects would decrease if the channel estimate would be based on several speakers compared to CMS where the estimate would be based on one speaker. This would be a supervised method, since each speaker would have to be labeled with the corresponding channel. If the channel spoken through was unknown an unsupervised method, for example clustering, would have to be used. One such clustering method is Vector quantization. 2.5 Vector quantization Introduction Vector quantization [15], is a form of lossy data compression [16], which means that it is impossible to reconstruct the original data after the compression. A vector quantizer maps k-dimensional vectors in the vector space R k into a finite set of vectors C = c m, m = 1, 2,..., M. These vectors are called code vectors or codewords. A vector quantizer based on the nearest neighbor method, i.e. a vector is classified to belong to the code vector which it has the smallest distance measure to, is called a Nearest neighbor or, if the euclidean metric 5

19 is used, a Voronoi vector quantizer. The set of all code vectors is called the codebook. The optimal choice of distance measure is dependent on the data to be compressed. For example, the Euclidean distance measure would be d(x, c) = x c 2 = k (x i c i ) 2 (2.10) The region where all data would belong to a certain code vector is called an encoding region belonging to a certain code vector Coding of the test data The test data will be classified as belonging to one of the clusters, represented by the corresponding code vector. The classification is based on the smallest distance d(z, c i ) from test data z to code vector c i. i= Designing the codebook The design of a vector quantizer can be described as finding the optimal codebook, given a certain distortion measure and a certain number of code vectors. For an unknown set of training data, the number of code vectors to be chosen is not trivial. This optimal solution satisfies two criteria. 1. Nearest Neighbor Condition S m = x : x c m 2 x c m 2 m = 1, 2,..., M (2.11) This means that the encoding region S m consists of all vectors that are closer to the code vector c m than to any other code vector. 2. Centroid Condition c m = x n S m x n x n S m 1 n = 1, 2,..., N (2.12) The code vector c m is the average of the training vectors in S m. This solution minimizes the average distortion: D Average = 1 Nk N x n Q(x n ) 2 (2.13) n=1 where Q(x n ) = c m if x n S m The method of designing a codebook is an open research area. One could for example choose the M first training vectors as the initial codebook. It is also possible to choose the initial code vectors randomly from the training set. If the training data is not independent identically distributed (iid), it might be influenced by other training data in for example its neighborhood. This could mean that the M first training vectors come from the same natural cluster in the training set. A disadvantage with the random initial codebook might be that it is difficult to compare results, since the results will vary so much depending on the starting codebook. After the initial codebook is chosen, each training sample will be classified as belonging to one of the clusters. The new code vectors are then computed according to the centroid 6

20 condition. This process of finding the correct cluster for each training data and then computing a new codebook is called the training of the vector quantizer. When the codebook no longer changes the average distortion has reached a local minimum and the training stops The LBG algorithm Another method of designing the codebook is the Linde-Buzo-Gray (LBG) algorithm [17]. The LBG algorithm is based on the method of splitting the code vectors until an optimal solution is found, based on some criteria. Thus the initial codebook consists of one code vector which is the mean of the training data. This code vector is then split into two by adding and subtracting a small vector ɛ. The choice of ɛ could for example be to make ɛ the eigenvector corresponding to the largest eigenvalue of the training data belonging to S m. Then the vector quantizer is run on these two code vectors until a solution is found. The final two code vectors are then split into four. This process is repeated iteratively until the codebook no longer changes. Then, a local minimum is reached and the training stops. 1. Starting Given the training data Y = y n ; n = 1, 2,..., N and some optimality requirement, set ɛ > 0 to a small vector. Let M = 1 and set the first code vector c 1 to 2. Splitting For j = 1,..., M set c 1 = 1 N N n=1 y n Set M = 2M. 3. Iteration Set the iteration index i = 0 c 0 j = (1 + ɛ)c j c 0 M+j = (1 ɛ)c j (a) For n = 1, 2,..., N, find for a chosen distance measure d d(c i m, y n ) over all m = 1,..., M. Let m be the index which minimizes the distance. Then label y n with Q(y n ) = c i m (b) For m = 1,..., M update the code vector c i+1 m = Q(y n)=c i m y n Q(y n)=c im 1 (2.14) (c) i=i+1 (d) If the code vectors are constant c i = c i 1 or some other stop criteria is met, stop, else go back to (a) 4. If the size of the codebook matches the desired size, stop, else go back to 2. 7

21 2.6 Bias from Sex and Dialect If the signal of several speakers are compared it can be seen that they may have some common partial vectors. These partial vectors could be sex or dialect of the speaker or context of the speech. The signal in (2.7) can thus be seen as being made up of several partial vectors. s(n) = s sex (n) + s dialect (n) s context (n) (2.15) There would for example be two vectors for sex, one male and one female. If these partial vectors could be subtracted from the signal it might be possible to get better channel estimates. Of course, in order to subtract these vectors, the corresponding sex or dialect of the speaker has to be known or identified in some manner. 2.7 Determining the number of channels In an unsupervised clustering algorithm, some stop criteria has to be used to find the optimal number of clusters. Without knowledge about the training sequence it can be difficult to know if all of the training data come from the same channel, or if each training sample represents a unique channel. This partitioning of data into subgroups is known as Cluster Analysis [18]. Vector quantization with the LBG algorithm is a divisive, hierarchical clustering method. This means that the algorithm produces a sequence of partitions of the training data corresponding to an increased number of clusters, since the clusters are split in each stage of the algorithm. One way of deciding the optimal number of clusters is the Minimum Description Length principle (MDL) [19, 20]. The MDL principle assumes that the training data can be described as a set of Probability Density Functions (PDF s). When cluster analysis is done on a training set, the set is often described by a Gaussian Mixture Model (GMM). Each cluster will then be a Gaussian distribution. The Gaussian Mixture distribution can then be computed after the VQ has finished its training. The Gaussian Mixture distribution will be K f(x i θ) = p k N(x i µ k, Σ k ) (2.16) k=1 for the clusters, where p k, K k=1 p k = 1, is the probability that a data is classified by the VQ as belonging to a certain cluster k and N(µ k, Σ k ) is a Gaussian density with mean vector µ k and covariance matrix Σ k. The VQ s training algorithm approximates the EM algorithm [21] which estimate the mixture parameters so that the following log-likelihood function is maximized N L = logf(x i θ) (2.17) i=1 If several models are equally likely a priori then L is proportional to the a posteriori probability that the data conform to the model θ = (p 1,..., p K ; µ 1,..., µk; Σ 1,..., Σ K ). Since L will increase with the complexity of the model, a term m is added that penalizes more complex models. This penalizing term m is based on the complexity of the model, and is the number of free parameters subject to estimation. For a GMM the penalizing term will be m = (K 1) + K(D + 1 D(D + 1)) (2.18) 2 8

22 where D is the number of features and K is the number of clusters. The MDL function is maximized for the optimal solution MDL = L 1 mlogn (2.19) 2 where N is the number of data samples. Since both L and m will increase with the complexity of the model, there will be a maximum if the increased complexity of the model does not balance the penalizing term. 9

23 10

24 Chapter 3 Implementation and Experiments The different theories described in Chapter 2 are here used to describe the method that was used to get channel estimates through vector quantization. The equipment and implemented theories are described in detail. 3.1 Feature Extraction The MFCC feature extraction was performed using GIVES (General Identity Verification System), a software package built for speaker verification. First the speech file was separated into frames by multiplication of overlapping Hamming windows. The frame duration was set to 10 ms and the window length was 25.6 ms. Then the speech file was transformed with the DFT and sent to a triangular filter bank with 24 components between 300 Hz and 3400 Hz. This was done to remove redundant information since the guaranteed band width in the telephone network is in that interval. The cosine transform was now used to reduce the dimensions of the feature vectors from 24 to 12. Cepstral liftering was finally used on the feature vectors. The result was a series of 12-dimensional vectors from each speech file. Since the speech files were of different length the final series were also of different length. Since the sentences in HTIMIT consist of several words, this meant that there may be silent intervals between the words. One question that came to mind was therefore if this silence influenced the channel estimates in any way. To examine this a speech detector was used to construct files without silence. 3.2 Clustering The clustering of the MFCC feature extracted data was done by a vector quantizer in MATLAB. First the starting codebook was picked randomly from the training sequence (see chapter 2). However, this resulted in rather dramatic differences in the performance of the vector quantizer, from poor to excellent. Therefore, the codebook was constructed by the LBG-algorithm, which gave good results each time it was run. The choice of the small vector ɛ, used for splitting the code vectors, was proportional to the standard deviation of the training data belonging to the cluster that was split. Since the number of channels in HTIMIT was known, this information was first used to set the final number of code vectors M = 10. The size of the codebook was therefore doubled until it was of the size 8. Then the two clusters with the largest variance was split, to get the final 10 code vectors. The choice of distance measure fell upon the Mahalanobis metric which in this case is defined 11

25 as d Mahalanobis (y, c i ) = (y c i )Σ 1 (y c i ) (3.1) where Σ 1 is the inverted covariance matrix of the training set. 3.3 Classification of the test data As described in the test data was classified as belonging to the channel corresponding to the closest code vector. 3.4 Compensation for sex and dialect In order to compensate for redundant information, such as sex and dialect, which are of no use for the clustering of the channels, the corresponding vectors had to be found [8]. These vectors were found using the arithmetic mean of all the training data and the arithmetic mean of the training data from the respective class of sex or dialect. For example, if the vector corresponding to the characteristics for all male speakers were to be found, the arithmetic mean of all speakers would be subtracted from the arithmetic mean of all male speakers. This vector c male would then be subtracted from the data belonging to all male speakers. c male = y j S male y j y j S male 1 N i=1 y i N j = 1,..., N (3.2) S male is the set of all male speakers and N is the number of speakers in the training set. The same method was applied to get the different dialect vectors and the female vector. Another method for compensation for common characteristics such as dialect or sex, would be to have a separate codebook for each sex and dialect region. This was also examined by having different codebooks for females and males. 3.5 MDL principle The MDL principle was first implemented as described in section 2.7. To be able to use the MDL principle a GMM had to constructed from the trained VQ. The Gaussian mixture distribution will be K f(x i θ) = p k N(x i µ k, Σ k ) (3.3) k=1 as described in section 2.7. The probability p k that a data belongs to a certain class, was found by dividing the number of data found by the VQ to belong to certain class k, by the total number of data. The mean µ k was the mean value of the data belonging to the class k. The covariance matrix Σ k is calculated by Σ k = 1 n 1 n (y i µ k )(y i µ k ) t (3.4) i=1 where n is the number of data y i that belongs to a certain class k. Since the MDL principle was originally constructed for use with the EM algorithm [21], two additional adjustments was tried in addition to the original method described in 2.7. The first adjustment was that the probability that a data belongs to a certain cluster was set to equal for all clusters p k = 1 K and the penalizing term was set to m = K(D + 12

26 1 2 D(D + 1)) as the number of free parameters was reduced. The classification of the training data into clusters will then be the same as Nearest Neighbor classification with the Mahalanobis distance measure, which is the method used in this thesis. The only difference between the classification with GMM and the classification with a VQ with Mahalanobis distance measure, will then be the Σ 1 in the Gaussian distribution equation, where Σ 1 k is the inverted covariance matrix for each cluster k. Finally a second adjustment was also made to set Σ 1 k = Σ 1 Global, i.e. the inverted global covariance matrix for the whole training sequence. The classification will then be same as Nearest Neighbor classification with the Mahalanobis distance measure. The penalizing term was now set to m = KD. The reason for doing these changes for the algorithm is that the number of free parameters in the model should match the number of free parameters in the VQ. The three different versions of the MDL principle resulted in three different MDL plots. 3.6 The Corpus In order to test the channel compensation technique previously described (see chapter 2), that is to use the cluster centroid as a channel estimate and subtract it from the speech, a training sequence and test data was necessary. Both the training sequence and the test data was chosen from HTIMIT, an American speech database. The data used was in the form of two different sentences, SA1 and SA2, spoken through ten different channels. The complete database consisted of 192 male and 192 female speakers. The different channels included electret and carbon-button microphones, and also cordless transmission. HTIMIT also provided information about the sex and dialect region of the speaker. The division between training and test data provided in HTIMIT was used, i.e. the training set consisted of 278 speakers and the test set consisted of 106 speakers. The total number of training data was therefore 5560 sentences, since each speaker spoke two different sentences that were sent through all ten channels. The total number of test data was 2120, by the same reasoning. Transducer Name senh pt1 el1 el2 el3 el4 cb1 cb2 cb3 cb4 Description Sennheizer head-mounted microphone Sony portable (cord-less) telephone Northern-Telecom Unity electret (3-line grill) Northern-Telecom Unity Noisy-Environment electret(2-line grill) Unknown manufacture electret (64-hole grill) Radio Shack Chronophone-255 electret telephone Northern-Telecom G-type carbon-button(center hole membrane transducer) Northern-Telecom G-type carbon-button(6 hole metal transducer) Northern-Telecom G-type carbon-button(6 hole membrane transducer) ITT carbon-button (6 hole membrane/attached transducer) Table 3.1: The different microphones in HTIMIT 13

27 14

28 Chapter 4 Results To measure how correct the different VQ s codebooks were, test data was used. The test data was labeled with the channel, and it could therefore be seen if the VQ classified the test data to the correct channel. The sample set consisted of ten different channels. The division of test and training data provided in HTIMIT was used. When the codebooks resulted from training, the LBG algorithm and the Mahalanobis distance measure was used. Due to round off all columns in the tables will not sum up to Ideal codebook To get some results to compare with, an ideal codebook was created with the code vectors set as the mean of the training data corresponding to each channel. That is, the mean of the training vectors belonging to channel cb1 became one of the code vectors, and so on. In some sense this would be the result of a perfect training of a VQ, since all the vectors clustered together would come from the same channel. This would describe how much the clusters were overlapping and if some of the channels would be difficult to separate. True Channel cb1 cb2 cb3 cb4 el1 el2 el3 el4 pt1 senh Classified as(%) cb cb cb cb el el el el pt senh Table 4.1: Results from a VQ classifying the test data. The training was performed by taking the mean of each channels training data. Correctly classified data from each channel are in bold style The VQ with ideal code vectors managed to classify 89% of the test set to the correct channel. The results for each channel can be seen in Table

29 4.2 A trained VQ It now remained to see how well a trained VQ would perform compared to the VQ with the ideal codebook. To see this a VQ was trained using the LBG algorithm. The LBG-trained True Channel cb1 cb2 cb3 cb4 el1 el2 el3 el4 pt1 senh Classified as(%) cb cb cb cb el el el el pt senh Table 4.2: Results from a VQ classifying the test data. The training was performed by the LBG algorithm. Correctly classified data from each channel are in bold style VQ performed worse than the VQ with the ideal codebook. It classified 55% of the test data to the correct channel. The results for each channel for the LBG trained VQ can be seen in Table Speech detection A speech detector was used in order to produce speech files with no silence in them. The speech files were then feature extracted as previously described (see section 2.2). To compare the speech detected channel estimates with the normal ones, both an ideal speech detection codebook (as in section 4.1) and a clustered codebook (as in section 4.2) were created. The VQ with the ideal codebook based on speech files processed by a speech True Channel cb1 cb2 cb3 cb4 el1 el2 el3 el4 pt1 senh Classified as(%) cb cb cb cb el el el el pt senh Table 4.3: Results from a VQ classifying the test data. The speech files were processed by a speech detector.the training was performed by taking the mean of each channels training data. Correctly classified data from each channel are in bold style. detector managed to classify 82% of the test data correctly. The results for each channel 16

30 can be seen in Table 4.3 The LBG trained VQ managed to classify 60% of the speech True Channel cb1 cb2 cb3 cb4 el1 el2 el3 el4 pt1 senh Classified as(%) cb cb cb cb el el el el pt senh Table 4.4: Results from a VQ classifying the test data. The training was performed by the LBG algorithm. All data was processed by a speech detector. Correctly classified data from each channel are in bold style. detected test data correctly. The results for each channel can be seen in Table Separate codebooks for male and female The training set was separated into two sets, one for female speakers and one for male speakers. Then two different codebooks were created by training a VQ with the LBG algorithm as performed in chapter 4.2 for each training set. The classification was then made separately for male and female, i.e. the female test data was tested on the female codebook, and vice versa. In order for this method to work, the speakers sex somehow had to be detected. This was assumed to be possible. True Channel cb1 cb2 cb3 cb4 el1 el2 el3 el4 pt1 senh Classified as(%) cb cb cb cb el el el el pt senh Table 4.5: Results from a trained VQ classifying test data separately for each sex. The training was performed by the LBG algorithm on separate training sets for each sex. Correctly classified data from each channel are in bold style. As can be seen this resulted in much better performance than when the training set was of mixed sexes. The trained VQ managed to classify 80% of the test data correctly. The results for each channel can be seen in Table

31 4.5 Compensation for the bias of sex and dialect The common characteristics of the speakers, such as sex and dialect region can be seen as additive biases as described in 3.4. When these biases were subtracted before the clustering, the result was more separated clusters. Previous results show that it is possible to classify sex or dialect region to some extent [22, 4]. True Channel cb1 cb2 cb3 cb4 el1 el2 el3 el4 pt1 senh Classified as(%) cb cb cb cb el el el el pt senh Table 4.6: Results from a trained VQ classifying test data after the bias vector for sex has been subtracted. Correctly classified data from each channel are in bold style. As can be seen the result increased from 55% for non-compensated clustering (see section 4.2) to 81% of the test data clustered correctly when the bias for sex was subtracted. An interesting result was also that this method performed slightly better than having separate codebooks. The results for each channel can be seen in Table 4.6. True Channel cb1 cb2 cb3 cb4 el1 el2 el3 el4 pt1 senh Classified as(%) cb cb cb cb el el el el pt senh Table 4.7: Results from a VQ classifying test data after the bias vector for dialect has been subtracted. Correctly classified data from each channel are in bold style When the bias for dialect was subtracted from the training and test data it seemed to perform no better than the original VQ in section 4.2. It classified 55% of the test data correctly, exactly the same number as the VQ in section 4.2. There were differences in the individual channels though, as can be seen if table 4.7 is compared to table

32 4.6 Determining the number of different channels In order to determine the number of channels in an unlabeled training set, some algorithm has to be used. In this thesis the choice fell upon the MDL principle (see section 2.7). Since it was known that HTIMIT consisted of ten different channels, it would be interesting to see if the MDL-algorithm would find its optimal solution at ten channels. Initially p k was found by dividing the number of data found by the VQ to belong to certain class k, by the total number of data. The penalizing term was initially set to m = K(D + 1 2D(D + 1)) as described in section 2.7. The resulting plot of the MDL value showed that the MDL found its optimal solution at 9 channels (see figure 4.1.) 1.24 x 105 MDL Figure 4.1: The MDL value for 1-14 channels with p k calculated by the VQ and m = K(D + 1 2D(D + 1)) When p k = 1 K and m = K(D + 1 2D(D + 1)) this resulted in a peak for the MDL value at 8 channels (see figure 4.2). The covariance matrix Σ k was finally set to Σ Global along with p k = 1 K and m = KD. This resulted in a peak at 9 channels (see figure 4.3). It should be noted that for the last two MDL plots the log-likelihood function L is not strictly increasing. 19

33 1.24 x 105 MDL MDL value Number of channels Figure 4.2: The MDL value for 1-14 channels with p k = 1 K and m = K(D + 1 2D(D + 1)) 1.32 x 105 MDL MDL value Number of channels Figure 4.3: The MDL value for 1-14 channels with p k = 1 K, a global covariance matrix and m = KD 20

34 Chapter 5 Discussion The idea of this thesis work was to see if different channels could be separated using unsupervised clustering. The number of clusters in a data set was also examined using the MDL principle. The results presented in chapter 4 are here analyzed, and the reason for success or failure of different experiments are discussed. 5.1 Performance of the Vector Quantizer The performance of the VQ seems to be very dependent on how much of the different biases, besides the channel, that can be subtracted from the speech before clustering. When compared to the ideal VQ (see table 4.1), the trained VQ performed rather poorly (see table 4.2). The difference in performance is from 89% of the test data classified correctly for the ideal VQ, to 55% for the trained VQ. This is probably due to the fact that the training data is not structured as separate clusters naturally, but rather overlapping PDF s. The data from the different channels seemed to be overlapping and therefore the cluster centroids could be placed far from the ideal ones by the VQ. If the VQ with the optimal codebook is studied (see Table 4.1) it can be seen that the two most difficult channels to separate from the other, are the two first carbon-button microphones, cb1 and cb2. This could be due to the non-linear distortion known to exist in carbon-button microphones [10, 6], so that the data from those two channels are spread over a large volume of the MFCC space. It could also be because the two microphones are similar, and their respective cluster centroids are close to each other. The trained VQ had the same difficulties as the ideal one separating cb1 from cb2, but also cb3, pt1 and el3 proved difficult to separate. Especially cb1 and el3 which it practically failed completely to classify correctly. This is probably because the code vectors for those two channels were far from the corresponding ideal ones. This shows one of the difficulties of unsupervised clustering when the data is not structured in non-overlapping clusters. The speech files were processed by a speech detector to remove silence in them. This proved to make little difference for the VQ. The separation of cb1 and cb2 proved to work better, but the overall result was only slightly better than when the original data was used. One of the things that seemed to influence the performance of the VQ was if the bias for sex could be subtracted. When this was done it resulted in a large improvement of the performance of the VQ. This improvement did not seem to depend on whether the bias was subtracted before clustering or if the method with separate codebooks was chosen (see section 3.4). The removal of dialect bias did not have any significant influence on the performance of the VQ. The two most difficult channels to estimate for the VQ was 21

35 without doubt cb1 and el3. The reason that cb1 was difficult was probably because it was not clearly separated from cb2. El3 seemed to be evenly classified among the neighboring channels which indicates that it was a difficult cluster for the VQ to find. Since HTIMIT contained the same number of data from each of the 10 channels, one way of improving the performance might be to set the condition that all clusters must contain the same number of data during the training of the VQ. This was however not tried, since this would be using information that would not be availble in most cases. It would not be a strictly unsupervised clustering method, since that condition would be a kind of supervision. 5.2 Determining the number of clusters with the MDL principle The MDL principle found its maximum value at 9 channels instead of the 10 actual channels. The reason for this is that the increase in the log-likelihood function L is so small between nine and ten channels. This could be because one of the channels is difficult to distinguish from another (for example cb1 from cb2) or perhaps because one of the channels is hard to find for the VQ since it is not formed as a clearly separated cluster in the MFCC space (for example el3). When the MDL principle was modified to match the number of free parameters in the VQ the log-likelihood function L did not increase strictly, as it would have if the EM algorithm would have been used. This explains the rather strange MDL plots (see figures 4.2, 4.3). An interesting phenomena is that the MDL value decreases when the number of channels increase from 3 to 4 channels. 22

36 Chapter 6 Conclusions It seems to be possible to cluster channels with a VQ. In order to get a good channel estimate from the clustering, it is however necessary to subtract as much bias from the speech as possible before the training. Especially the subtraction of the bias from the sex of the speaker seems to make the training of the VQ much easier. The bias from dialect region did not seem to be of as much importance. The influence of silence in speech did not seem to have a lot of inluence on the performance of the VQ. What then about the channels that get falsely classified by the VQ? Since test data that get classified to the wrong channel usually get classified to a neighboring cluster, it is usually no catastrophy that they get falsely classified. As long as the cluster centroid that the data gets classified to, is not a large distance from the actual channel in the MFCC space, it is no major problem. When compared to CMS that subtracts the aritmethic mean as channel compensation it can however result in worse channel compensation for some test data. The MDL principle seemed to find approximately the correct number of channels. When the MDL formula was changed to match the number of free parameters of the VQ it showed some interesting results. However it still found the maximum at 8 or 9 clusters. 6.1 Outlook Possible further work would be to implement this channel compensation technique described in some kind of speaker verification system. Then it would be possible to see if this alternative channel compensation method would perform better than CMS. It would be interesting to see if other clustering methods, perhaps GMM based, would perform better. It would also be interesting to see how other methods [23] for determining the number of clusters would perform on the HTIMIT data set. 23

37 24

38 Bibliography [1] A. Webb. Statistical Pattern Recognition. Newnes, [2] S. Davis and P. Mermelstein. Comparison of parametric representation for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-28(4): , August [3] D.A.Reynolds, T.F. Quatieri, and R.B. Dunn. Speaker verification using adapted gaussian mixture models. Digital Signal Processing, 10:19 41, [4] H. Melin. Speaker verification in telecommunication. Department of Speech, Music and Hearing, KTH, Available from: melin/publications.html, [5] Douglas A. Reynolds. Channel robust speaker verification via feature mapping. MIT Lincoln Laboratory, Lexington, MA USA. [6] T.F. Quatieri, D.A. Reynolds, and G.C.O Leary. Estimation of handset nonlinearity with application to speaker recognition. IEEE Trans. on Speech and Audio Processing, 8(5): , [7] B.S. Atal. Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. J. Acoust. Soc. Am., pages , [8] H. Hermansky and N. Morgan. Rasta processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4): , October [9] A. Sankar and Chin-Hui Lee. A maximum-likelihood approach to stochastic matching for robust speech recognition. IEEE Transactions on Speech and Audio Processing, 4(3): , May [10] D.A. Reynolds, M.A. Zissman, T.F. Quatieri, and G.C.O Leary. The effects of telephone transmission degradations on speaker recognition performance. In Proc. of ICASSP95, pages , [11] M.J.F Gales. "nice" model-based compensation schemes for robust speech recognition. ECSA/NATO Tutorial and Research Workshop on Robust speech recognition for unknown communication channels, [12] A. Wakao, K. Takeda, and F. Itakura. Variability of lombard effects under different noise conditions. In Proc. ICSLP 96, volume 4, pages , Philadelphia, PA, [13] Monson H. Hayes. Statistical Digital Signal Processing and Modeling. Wiley,

Robust Speaker Identification

Robust Speaker Identification Robust Speaker Identification by Smarajit Bose Interdisciplinary Statistical Research Unit Indian Statistical Institute, Kolkata Joint work with Amita Pal and Ayanendranath Basu Overview } } } } } } }

More information

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition ABSTRACT It is well known that the expectation-maximization (EM) algorithm, commonly used to estimate hidden

More information

TinySR. Peter Schmidt-Nielsen. August 27, 2014

TinySR. Peter Schmidt-Nielsen. August 27, 2014 TinySR Peter Schmidt-Nielsen August 27, 2014 Abstract TinySR is a light weight real-time small vocabulary speech recognizer written entirely in portable C. The library fits in a single file (plus header),

More information

Maximum Likelihood and Maximum A Posteriori Adaptation for Distributed Speaker Recognition Systems

Maximum Likelihood and Maximum A Posteriori Adaptation for Distributed Speaker Recognition Systems Maximum Likelihood and Maximum A Posteriori Adaptation for Distributed Speaker Recognition Systems Chin-Hung Sit 1, Man-Wai Mak 1, and Sun-Yuan Kung 2 1 Center for Multimedia Signal Processing Dept. of

More information

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi Signal Modeling Techniques in Speech Recognition Hassan A. Kingravi Outline Introduction Spectral Shaping Spectral Analysis Parameter Transforms Statistical Modeling Discussion Conclusions 1: Introduction

More information

Estimation of Relative Operating Characteristics of Text Independent Speaker Verification

Estimation of Relative Operating Characteristics of Text Independent Speaker Verification International Journal of Engineering Science Invention Volume 1 Issue 1 December. 2012 PP.18-23 Estimation of Relative Operating Characteristics of Text Independent Speaker Verification Palivela Hema 1,

More information

Automatic Speech Recognition (CS753)

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 12: Acoustic Feature Extraction for ASR Instructor: Preethi Jyothi Feb 13, 2017 Speech Signal Analysis Generate discrete samples A frame Need to focus on short

More information

Support Vector Machines using GMM Supervectors for Speaker Verification

Support Vector Machines using GMM Supervectors for Speaker Verification 1 Support Vector Machines using GMM Supervectors for Speaker Verification W. M. Campbell, D. E. Sturim, D. A. Reynolds MIT Lincoln Laboratory 244 Wood Street Lexington, MA 02420 Corresponding author e-mail:

More information

Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic Approximation Algorithm

Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic Approximation Algorithm EngOpt 2008 - International Conference on Engineering Optimization Rio de Janeiro, Brazil, 0-05 June 2008. Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic

More information

Speech Signal Representations

Speech Signal Representations Speech Signal Representations Berlin Chen 2003 References: 1. X. Huang et. al., Spoken Language Processing, Chapters 5, 6 2. J. R. Deller et. al., Discrete-Time Processing of Speech Signals, Chapters 4-6

More information

CEPSTRAL analysis has been widely used in signal processing

CEPSTRAL analysis has been widely used in signal processing 162 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 7, NO. 2, MARCH 1999 On Second-Order Statistics and Linear Estimation of Cepstral Coefficients Yariv Ephraim, Fellow, IEEE, and Mazin Rahim, Senior

More information

Ch. 10 Vector Quantization. Advantages & Design

Ch. 10 Vector Quantization. Advantages & Design Ch. 10 Vector Quantization Advantages & Design 1 Advantages of VQ There are (at least) 3 main characteristics of VQ that help it outperform SQ: 1. Exploit Correlation within vectors 2. Exploit Shape Flexibility

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

GMM Vector Quantization on the Modeling of DHMM for Arabic Isolated Word Recognition System

GMM Vector Quantization on the Modeling of DHMM for Arabic Isolated Word Recognition System GMM Vector Quantization on the Modeling of DHMM for Arabic Isolated Word Recognition System Snani Cherifa 1, Ramdani Messaoud 1, Zermi Narima 1, Bourouba Houcine 2 1 Laboratoire d Automatique et Signaux

More information

SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan

SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS Emad M. Grais and Hakan Erdogan Faculty of Engineering and Natural Sciences, Sabanci University, Orhanli

More information

Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers

Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers Kumari Rambha Ranjan, Kartik Mahto, Dipti Kumari,S.S.Solanki Dept. of Electronics and Communication Birla

More information

Symmetric Distortion Measure for Speaker Recognition

Symmetric Distortion Measure for Speaker Recognition ISCA Archive http://www.isca-speech.org/archive SPECOM 2004: 9 th Conference Speech and Computer St. Petersburg, Russia September 20-22, 2004 Symmetric Distortion Measure for Speaker Recognition Evgeny

More information

Proc. of NCC 2010, Chennai, India

Proc. of NCC 2010, Chennai, India Proc. of NCC 2010, Chennai, India Trajectory and surface modeling of LSF for low rate speech coding M. Deepak and Preeti Rao Department of Electrical Engineering Indian Institute of Technology, Bombay

More information

ISCA Archive

ISCA Archive ISCA Archive http://www.isca-speech.org/archive ODYSSEY04 - The Speaker and Language Recognition Workshop Toledo, Spain May 3 - June 3, 2004 Analysis of Multitarget Detection for Speaker and Language Recognition*

More information

EE368B Image and Video Compression

EE368B Image and Video Compression EE368B Image and Video Compression Homework Set #2 due Friday, October 20, 2000, 9 a.m. Introduction The Lloyd-Max quantizer is a scalar quantizer which can be seen as a special case of a vector quantizer

More information

Modifying Voice Activity Detection in Low SNR by correction factors

Modifying Voice Activity Detection in Low SNR by correction factors Modifying Voice Activity Detection in Low SNR by correction factors H. Farsi, M. A. Mozaffarian, H.Rahmani Department of Electrical Engineering University of Birjand P.O. Box: +98-9775-376 IRAN hfarsi@birjand.ac.ir

More information

arxiv: v1 [cs.sd] 25 Oct 2014

arxiv: v1 [cs.sd] 25 Oct 2014 Choice of Mel Filter Bank in Computing MFCC of a Resampled Speech arxiv:1410.6903v1 [cs.sd] 25 Oct 2014 Laxmi Narayana M, Sunil Kumar Kopparapu TCS Innovation Lab - Mumbai, Tata Consultancy Services, Yantra

More information

Feature extraction 2

Feature extraction 2 Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. Feature extraction 2 Dr Philip Jackson Linear prediction Perceptual linear prediction Comparison of feature methods

More information

Multimedia Networking ECE 599

Multimedia Networking ECE 599 Multimedia Networking ECE 599 Prof. Thinh Nguyen School of Electrical Engineering and Computer Science Based on lectures from B. Lee, B. Girod, and A. Mukherjee 1 Outline Digital Signal Representation

More information

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models Statistical NLP Spring 2010 The Noisy Channel Model Lecture 9: Acoustic Models Dan Klein UC Berkeley Acoustic model: HMMs over word positions with mixtures of Gaussians as emissions Language model: Distributions

More information

Pulse-Code Modulation (PCM) :

Pulse-Code Modulation (PCM) : PCM & DPCM & DM 1 Pulse-Code Modulation (PCM) : In PCM each sample of the signal is quantized to one of the amplitude levels, where B is the number of bits used to represent each sample. The rate from

More information

How to Deal with Multiple-Targets in Speaker Identification Systems?

How to Deal with Multiple-Targets in Speaker Identification Systems? How to Deal with Multiple-Targets in Speaker Identification Systems? Yaniv Zigel and Moshe Wasserblat ICE Systems Ltd., Audio Analysis Group, P.O.B. 690 Ra anana 4307, Israel yanivz@nice.com Abstract In

More information

Optimal Speech Enhancement Under Signal Presence Uncertainty Using Log-Spectral Amplitude Estimator

Optimal Speech Enhancement Under Signal Presence Uncertainty Using Log-Spectral Amplitude Estimator 1 Optimal Speech Enhancement Under Signal Presence Uncertainty Using Log-Spectral Amplitude Estimator Israel Cohen Lamar Signal Processing Ltd. P.O.Box 573, Yokneam Ilit 20692, Israel E-mail: icohen@lamar.co.il

More information

Vector Quantization. Institut Mines-Telecom. Marco Cagnazzo, MN910 Advanced Compression

Vector Quantization. Institut Mines-Telecom. Marco Cagnazzo, MN910 Advanced Compression Institut Mines-Telecom Vector Quantization Marco Cagnazzo, cagnazzo@telecom-paristech.fr MN910 Advanced Compression 2/66 19.01.18 Institut Mines-Telecom Vector Quantization Outline Gain-shape VQ 3/66 19.01.18

More information

NOISE ROBUST RELATIVE TRANSFER FUNCTION ESTIMATION. M. Schwab, P. Noll, and T. Sikora. Technical University Berlin, Germany Communication System Group

NOISE ROBUST RELATIVE TRANSFER FUNCTION ESTIMATION. M. Schwab, P. Noll, and T. Sikora. Technical University Berlin, Germany Communication System Group NOISE ROBUST RELATIVE TRANSFER FUNCTION ESTIMATION M. Schwab, P. Noll, and T. Sikora Technical University Berlin, Germany Communication System Group Einsteinufer 17, 1557 Berlin (Germany) {schwab noll

More information

Scalar and Vector Quantization. National Chiao Tung University Chun-Jen Tsai 11/06/2014

Scalar and Vector Quantization. National Chiao Tung University Chun-Jen Tsai 11/06/2014 Scalar and Vector Quantization National Chiao Tung University Chun-Jen Tsai 11/06/014 Basic Concept of Quantization Quantization is the process of representing a large, possibly infinite, set of values

More information

Hidden Markov Model Based Robust Speech Recognition

Hidden Markov Model Based Robust Speech Recognition Hidden Markov Model Based Robust Speech Recognition Vikas Mulik * Vikram Mane Imran Jamadar JCEM,K.M.Gad,E&Tc,&Shivaji University, ADCET,ASHTA,E&Tc&Shivaji university ADCET,ASHTA,Automobile&Shivaji Abstract

More information

Comparing linear and non-linear transformation of speech

Comparing linear and non-linear transformation of speech Comparing linear and non-linear transformation of speech Larbi Mesbahi, Vincent Barreaud and Olivier Boeffard IRISA / ENSSAT - University of Rennes 1 6, rue de Kerampont, Lannion, France {lmesbahi, vincent.barreaud,

More information

Multimedia Systems Giorgio Leonardi A.A Lecture 4 -> 6 : Quantization

Multimedia Systems Giorgio Leonardi A.A Lecture 4 -> 6 : Quantization Multimedia Systems Giorgio Leonardi A.A.2014-2015 Lecture 4 -> 6 : Quantization Overview Course page (D.I.R.): https://disit.dir.unipmn.it/course/view.php?id=639 Consulting: Office hours by appointment:

More information

Single Channel Music Sound Separation Based on Spectrogram Decomposition and Note Classification

Single Channel Music Sound Separation Based on Spectrogram Decomposition and Note Classification Single Channel Music Sound Separation Based on Spectrogram Decomposition and Note Classification Hafiz Mustafa and Wenwu Wang Centre for Vision, Speech and Signal Processing (CVSSP) University of Surrey,

More information

A Low-Cost Robust Front-end for Embedded ASR System

A Low-Cost Robust Front-end for Embedded ASR System A Low-Cost Robust Front-end for Embedded ASR System Lihui Guo 1, Xin He 2, Yue Lu 1, and Yaxin Zhang 2 1 Department of Computer Science and Technology, East China Normal University, Shanghai 200062 2 Motorola

More information

A Generative Model Based Kernel for SVM Classification in Multimedia Applications

A Generative Model Based Kernel for SVM Classification in Multimedia Applications Appears in Neural Information Processing Systems, Vancouver, Canada, 2003. A Generative Model Based Kernel for SVM Classification in Multimedia Applications Pedro J. Moreno Purdy P. Ho Hewlett-Packard

More information

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models Statistical NLP Spring 2009 The Noisy Channel Model Lecture 10: Acoustic Models Dan Klein UC Berkeley Search through space of all possible sentences. Pick the one that is most probable given the waveform.

More information

Statistical NLP Spring The Noisy Channel Model

Statistical NLP Spring The Noisy Channel Model Statistical NLP Spring 2009 Lecture 10: Acoustic Models Dan Klein UC Berkeley The Noisy Channel Model Search through space of all possible sentences. Pick the one that is most probable given the waveform.

More information

Correspondence. Pulse Doppler Radar Target Recognition using a Two-Stage SVM Procedure

Correspondence. Pulse Doppler Radar Target Recognition using a Two-Stage SVM Procedure Correspondence Pulse Doppler Radar Target Recognition using a Two-Stage SVM Procedure It is possible to detect and classify moving and stationary targets using ground surveillance pulse-doppler radars

More information

Allpass Modeling of LP Residual for Speaker Recognition

Allpass Modeling of LP Residual for Speaker Recognition Allpass Modeling of LP Residual for Speaker Recognition K. Sri Rama Murty, Vivek Boominathan and Karthika Vijayan Department of Electrical Engineering, Indian Institute of Technology Hyderabad, India email:

More information

The Secrets of Quantization. Nimrod Peleg Update: Sept. 2009

The Secrets of Quantization. Nimrod Peleg Update: Sept. 2009 The Secrets of Quantization Nimrod Peleg Update: Sept. 2009 What is Quantization Representation of a large set of elements with a much smaller set is called quantization. The number of elements in the

More information

Model-based unsupervised segmentation of birdcalls from field recordings

Model-based unsupervised segmentation of birdcalls from field recordings Model-based unsupervised segmentation of birdcalls from field recordings Anshul Thakur School of Computing and Electrical Engineering Indian Institute of Technology Mandi Himachal Pradesh, India Email:

More information

Estimation of Cepstral Coefficients for Robust Speech Recognition

Estimation of Cepstral Coefficients for Robust Speech Recognition Estimation of Cepstral Coefficients for Robust Speech Recognition by Kevin M. Indrebo, B.S., M.S. A Dissertation submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment

More information

ECE 661: Homework 10 Fall 2014

ECE 661: Homework 10 Fall 2014 ECE 661: Homework 10 Fall 2014 This homework consists of the following two parts: (1) Face recognition with PCA and LDA for dimensionality reduction and the nearest-neighborhood rule for classification;

More information

Forecasting Wind Ramps

Forecasting Wind Ramps Forecasting Wind Ramps Erin Summers and Anand Subramanian Jan 5, 20 Introduction The recent increase in the number of wind power producers has necessitated changes in the methods power system operators

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Pattern Classification

Pattern Classification Pattern Classification Introduction Parametric classifiers Semi-parametric classifiers Dimensionality reduction Significance testing 6345 Automatic Speech Recognition Semi-Parametric Classifiers 1 Semi-Parametric

More information

Spectral and Textural Feature-Based System for Automatic Detection of Fricatives and Affricates

Spectral and Textural Feature-Based System for Automatic Detection of Fricatives and Affricates Spectral and Textural Feature-Based System for Automatic Detection of Fricatives and Affricates Dima Ruinskiy Niv Dadush Yizhar Lavner Department of Computer Science, Tel-Hai College, Israel Outline Phoneme

More information

Harmonic Structure Transform for Speaker Recognition

Harmonic Structure Transform for Speaker Recognition Harmonic Structure Transform for Speaker Recognition Kornel Laskowski & Qin Jin Carnegie Mellon University, Pittsburgh PA, USA KTH Speech Music & Hearing, Stockholm, Sweden 29 August, 2011 Laskowski &

More information

speaker recognition using gmm-ubm semester project presentation

speaker recognition using gmm-ubm semester project presentation speaker recognition using gmm-ubm semester project presentation OBJECTIVES OF THE PROJECT study the GMM-UBM speaker recognition system implement this system with matlab document the code and how it interfaces

More information

Cepstral Deconvolution Method for Measurement of Absorption and Scattering Coefficients of Materials

Cepstral Deconvolution Method for Measurement of Absorption and Scattering Coefficients of Materials Cepstral Deconvolution Method for Measurement of Absorption and Scattering Coefficients of Materials Mehmet ÇALIŞKAN a) Middle East Technical University, Department of Mechanical Engineering, Ankara, 06800,

More information

The effect of speaking rate and vowel context on the perception of consonants. in babble noise

The effect of speaking rate and vowel context on the perception of consonants. in babble noise The effect of speaking rate and vowel context on the perception of consonants in babble noise Anirudh Raju Department of Electrical Engineering, University of California, Los Angeles, California, USA anirudh90@ucla.edu

More information

An Investigation of Spectral Subband Centroids for Speaker Authentication

An Investigation of Spectral Subband Centroids for Speaker Authentication R E S E A R C H R E P O R T I D I A P An Investigation of Spectral Subband Centroids for Speaker Authentication Norman Poh Hoon Thian a Conrad Sanderson a Samy Bengio a IDIAP RR 3-62 December 3 published

More information

Feature extraction 1

Feature extraction 1 Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. Feature extraction 1 Dr Philip Jackson Cepstral analysis - Real & complex cepstra - Homomorphic decomposition Filter

More information

Modeling Prosody for Speaker Recognition: Why Estimating Pitch May Be a Red Herring

Modeling Prosody for Speaker Recognition: Why Estimating Pitch May Be a Red Herring Modeling Prosody for Speaker Recognition: Why Estimating Pitch May Be a Red Herring Kornel Laskowski & Qin Jin Carnegie Mellon University Pittsburgh PA, USA 28 June, 2010 Laskowski & Jin ODYSSEY 2010,

More information

FEATURE SELECTION USING FISHER S RATIO TECHNIQUE FOR AUTOMATIC SPEECH RECOGNITION

FEATURE SELECTION USING FISHER S RATIO TECHNIQUE FOR AUTOMATIC SPEECH RECOGNITION FEATURE SELECTION USING FISHER S RATIO TECHNIQUE FOR AUTOMATIC SPEECH RECOGNITION Sarika Hegde 1, K. K. Achary 2 and Surendra Shetty 3 1 Department of Computer Applications, NMAM.I.T., Nitte, Karkala Taluk,

More information

Detection of Overlapping Acoustic Events Based on NMF with Shared Basis Vectors

Detection of Overlapping Acoustic Events Based on NMF with Shared Basis Vectors Detection of Overlapping Acoustic Events Based on NMF with Shared Basis Vectors Kazumasa Yamamoto Department of Computer Science Chubu University Kasugai, Aichi, Japan Email: yamamoto@cs.chubu.ac.jp Chikara

More information

Session Variability Compensation in Automatic Speaker Recognition

Session Variability Compensation in Automatic Speaker Recognition Session Variability Compensation in Automatic Speaker Recognition Javier González Domínguez VII Jornadas MAVIR Universidad Autónoma de Madrid November 2012 Outline 1. The Inter-session Variability Problem

More information

Unsupervised Learning: K- Means & PCA

Unsupervised Learning: K- Means & PCA Unsupervised Learning: K- Means & PCA Unsupervised Learning Supervised learning used labeled data pairs (x, y) to learn a func>on f : X Y But, what if we don t have labels? No labels = unsupervised learning

More information

Lecture 5: GMM Acoustic Modeling and Feature Extraction

Lecture 5: GMM Acoustic Modeling and Feature Extraction CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 5: GMM Acoustic Modeling and Feature Extraction Original slides by Dan Jurafsky Outline for Today Acoustic

More information

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech CS 294-5: Statistical Natural Language Processing The Noisy Channel Model Speech Recognition II Lecture 21: 11/29/05 Search through space of all possible sentences. Pick the one that is most probable given

More information

On Compression Encrypted Data part 2. Prof. Ja-Ling Wu The Graduate Institute of Networking and Multimedia National Taiwan University

On Compression Encrypted Data part 2. Prof. Ja-Ling Wu The Graduate Institute of Networking and Multimedia National Taiwan University On Compression Encrypted Data part 2 Prof. Ja-Ling Wu The Graduate Institute of Networking and Multimedia National Taiwan University 1 Brief Summary of Information-theoretic Prescription At a functional

More information

Frog Sound Identification System for Frog Species Recognition

Frog Sound Identification System for Frog Species Recognition Frog Sound Identification System for Frog Species Recognition Clifford Loh Ting Yuan and Dzati Athiar Ramli Intelligent Biometric Research Group (IBG), School of Electrical and Electronic Engineering,

More information

REAL-TIME TIME-FREQUENCY BASED BLIND SOURCE SEPARATION. Scott Rickard, Radu Balan, Justinian Rosca. Siemens Corporate Research Princeton, NJ 08540

REAL-TIME TIME-FREQUENCY BASED BLIND SOURCE SEPARATION. Scott Rickard, Radu Balan, Justinian Rosca. Siemens Corporate Research Princeton, NJ 08540 REAL-TIME TIME-FREQUENCY BASED BLIND SOURCE SEPARATION Scott Rickard, Radu Balan, Justinian Rosca Siemens Corporate Research Princeton, NJ 84 fscott.rickard,radu.balan,justinian.roscag@scr.siemens.com

More information

Maximum Likelihood Estimation. only training data is available to design a classifier

Maximum Likelihood Estimation. only training data is available to design a classifier Introduction to Pattern Recognition [ Part 5 ] Mahdi Vasighi Introduction Bayesian Decision Theory shows that we could design an optimal classifier if we knew: P( i ) : priors p(x i ) : class-conditional

More information

SPEECH ENHANCEMENT USING PCA AND VARIANCE OF THE RECONSTRUCTION ERROR IN DISTRIBUTED SPEECH RECOGNITION

SPEECH ENHANCEMENT USING PCA AND VARIANCE OF THE RECONSTRUCTION ERROR IN DISTRIBUTED SPEECH RECOGNITION SPEECH ENHANCEMENT USING PCA AND VARIANCE OF THE RECONSTRUCTION ERROR IN DISTRIBUTED SPEECH RECOGNITION Amin Haji Abolhassani 1, Sid-Ahmed Selouani 2, Douglas O Shaughnessy 1 1 INRS-Energie-Matériaux-Télécommunications,

More information

HARMONIC VECTOR QUANTIZATION

HARMONIC VECTOR QUANTIZATION HARMONIC VECTOR QUANTIZATION Volodya Grancharov, Sigurdur Sverrisson, Erik Norvell, Tomas Toftgård, Jonas Svedberg, and Harald Pobloth SMN, Ericsson Research, Ericsson AB 64 8, Stockholm, Sweden ABSTRACT

More information

Independent Component Analysis and Unsupervised Learning

Independent Component Analysis and Unsupervised Learning Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien National Cheng Kung University TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent

More information

Eigenvoice Speaker Adaptation via Composite Kernel PCA

Eigenvoice Speaker Adaptation via Composite Kernel PCA Eigenvoice Speaker Adaptation via Composite Kernel PCA James T. Kwok, Brian Mak and Simon Ho Department of Computer Science Hong Kong University of Science and Technology Clear Water Bay, Hong Kong [jamesk,mak,csho]@cs.ust.hk

More information

Speech Enhancement with Applications in Speech Recognition

Speech Enhancement with Applications in Speech Recognition Speech Enhancement with Applications in Speech Recognition A First Year Report Submitted to the School of Computer Engineering of the Nanyang Technological University by Xiao Xiong for the Confirmation

More information

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent voices Nonparametric likelihood

More information

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1 EEL 851: Biometrics An Overview of Statistical Pattern Recognition EEL 851 1 Outline Introduction Pattern Feature Noise Example Problem Analysis Segmentation Feature Extraction Classification Design Cycle

More information

CEPSTRAL ANALYSIS SYNTHESIS ON THE MEL FREQUENCY SCALE, AND AN ADAPTATIVE ALGORITHM FOR IT.

CEPSTRAL ANALYSIS SYNTHESIS ON THE MEL FREQUENCY SCALE, AND AN ADAPTATIVE ALGORITHM FOR IT. CEPSTRAL ANALYSIS SYNTHESIS ON THE EL FREQUENCY SCALE, AND AN ADAPTATIVE ALGORITH FOR IT. Summarized overview of the IEEE-publicated papers Cepstral analysis synthesis on the mel frequency scale by Satochi

More information

Vector Quantization Encoder Decoder Original Form image Minimize distortion Table Channel Image Vectors Look-up (X, X i ) X may be a block of l

Vector Quantization Encoder Decoder Original Form image Minimize distortion Table Channel Image Vectors Look-up (X, X i ) X may be a block of l Vector Quantization Encoder Decoder Original Image Form image Vectors X Minimize distortion k k Table X^ k Channel d(x, X^ Look-up i ) X may be a block of l m image or X=( r, g, b ), or a block of DCT

More information

Exemplar-based voice conversion using non-negative spectrogram deconvolution

Exemplar-based voice conversion using non-negative spectrogram deconvolution Exemplar-based voice conversion using non-negative spectrogram deconvolution Zhizheng Wu 1, Tuomas Virtanen 2, Tomi Kinnunen 3, Eng Siong Chng 1, Haizhou Li 1,4 1 Nanyang Technological University, Singapore

More information

Gaussian Mixture Model Uncertainty Learning (GMMUL) Version 1.0 User Guide

Gaussian Mixture Model Uncertainty Learning (GMMUL) Version 1.0 User Guide Gaussian Mixture Model Uncertainty Learning (GMMUL) Version 1. User Guide Alexey Ozerov 1, Mathieu Lagrange and Emmanuel Vincent 1 1 INRIA, Centre de Rennes - Bretagne Atlantique Campus de Beaulieu, 3

More information

An Evolutionary Programming Based Algorithm for HMM training

An Evolutionary Programming Based Algorithm for HMM training An Evolutionary Programming Based Algorithm for HMM training Ewa Figielska,Wlodzimierz Kasprzak Institute of Control and Computation Engineering, Warsaw University of Technology ul. Nowowiejska 15/19,

More information

Front-End Factor Analysis For Speaker Verification

Front-End Factor Analysis For Speaker Verification IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING Front-End Factor Analysis For Speaker Verification Najim Dehak, Patrick Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet, Abstract This

More information

SEPARATION OF ACOUSTIC SIGNALS USING SELF-ORGANIZING NEURAL NETWORKS. Temujin Gautama & Marc M. Van Hulle

SEPARATION OF ACOUSTIC SIGNALS USING SELF-ORGANIZING NEURAL NETWORKS. Temujin Gautama & Marc M. Van Hulle SEPARATION OF ACOUSTIC SIGNALS USING SELF-ORGANIZING NEURAL NETWORKS Temujin Gautama & Marc M. Van Hulle K.U.Leuven, Laboratorium voor Neuro- en Psychofysiologie Campus Gasthuisberg, Herestraat 49, B-3000

More information

Noise Compensation for Subspace Gaussian Mixture Models

Noise Compensation for Subspace Gaussian Mixture Models Noise ompensation for ubspace Gaussian Mixture Models Liang Lu University of Edinburgh Joint work with KK hin, A. Ghoshal and. enals Liang Lu, Interspeech, eptember, 2012 Outline Motivation ubspace GMM

More information

Vector Quantization and Subband Coding

Vector Quantization and Subband Coding Vector Quantization and Subband Coding 18-796 ultimedia Communications: Coding, Systems, and Networking Prof. Tsuhan Chen tsuhan@ece.cmu.edu Vector Quantization 1 Vector Quantization (VQ) Each image block

More information

ON THE USE OF MLP-DISTANCE TO ESTIMATE POSTERIOR PROBABILITIES BY KNN FOR SPEECH RECOGNITION

ON THE USE OF MLP-DISTANCE TO ESTIMATE POSTERIOR PROBABILITIES BY KNN FOR SPEECH RECOGNITION Zaragoza Del 8 al 1 de Noviembre de 26 ON THE USE OF MLP-DISTANCE TO ESTIMATE POSTERIOR PROBABILITIES BY KNN FOR SPEECH RECOGNITION Ana I. García Moral, Carmen Peláez Moreno EPS-Universidad Carlos III

More information

Text Independent Speaker Identification Using Imfcc Integrated With Ica

Text Independent Speaker Identification Using Imfcc Integrated With Ica IOSR Journal of Electronics and Communication Engineering (IOSR-JECE) e-issn: 2278-2834,p- ISSN: 2278-8735. Volume 7, Issue 5 (Sep. - Oct. 2013), PP 22-27 ext Independent Speaker Identification Using Imfcc

More information

Data Preprocessing. Cluster Similarity

Data Preprocessing. Cluster Similarity 1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M

More information

3. ESTIMATION OF SIGNALS USING A LEAST SQUARES TECHNIQUE

3. ESTIMATION OF SIGNALS USING A LEAST SQUARES TECHNIQUE 3. ESTIMATION OF SIGNALS USING A LEAST SQUARES TECHNIQUE 3.0 INTRODUCTION The purpose of this chapter is to introduce estimators shortly. More elaborated courses on System Identification, which are given

More information

Approximating the Covariance Matrix with Low-rank Perturbations

Approximating the Covariance Matrix with Low-rank Perturbations Approximating the Covariance Matrix with Low-rank Perturbations Malik Magdon-Ismail and Jonathan T. Purnell Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180 {magdon,purnej}@cs.rpi.edu

More information

A SPECTRAL SUBTRACTION RULE FOR REAL-TIME DSP IMPLEMENTATION OF NOISE REDUCTION IN SPEECH SIGNALS

A SPECTRAL SUBTRACTION RULE FOR REAL-TIME DSP IMPLEMENTATION OF NOISE REDUCTION IN SPEECH SIGNALS Proc. of the 1 th Int. Conference on Digital Audio Effects (DAFx-9), Como, Italy, September 1-4, 9 A SPECTRAL SUBTRACTION RULE FOR REAL-TIME DSP IMPLEMENTATION OF NOISE REDUCTION IN SPEECH SIGNALS Matteo

More information

A comparative study of time-delay estimation techniques for convolutive speech mixtures

A comparative study of time-delay estimation techniques for convolutive speech mixtures A comparative study of time-delay estimation techniques for convolutive speech mixtures COSME LLERENA AGUILAR University of Alcala Signal Theory and Communications 28805 Alcalá de Henares SPAIN cosme.llerena@uah.es

More information

ESTIMATION OF RELATIVE TRANSFER FUNCTION IN THE PRESENCE OF STATIONARY NOISE BASED ON SEGMENTAL POWER SPECTRAL DENSITY MATRIX SUBTRACTION

ESTIMATION OF RELATIVE TRANSFER FUNCTION IN THE PRESENCE OF STATIONARY NOISE BASED ON SEGMENTAL POWER SPECTRAL DENSITY MATRIX SUBTRACTION ESTIMATION OF RELATIVE TRANSFER FUNCTION IN THE PRESENCE OF STATIONARY NOISE BASED ON SEGMENTAL POWER SPECTRAL DENSITY MATRIX SUBTRACTION Xiaofei Li 1, Laurent Girin 1,, Radu Horaud 1 1 INRIA Grenoble

More information

The Comparison of Vector Quantization Algoritms in Fish Species Acoustic Voice Recognition Using Hidden Markov Model

The Comparison of Vector Quantization Algoritms in Fish Species Acoustic Voice Recognition Using Hidden Markov Model The Comparison Vector Quantization Algoritms in Fish Species Acoustic Voice Recognition Using Hidden Markov Model Diponegoro A.D 1). and Fawwaz Al Maki. W 1) 1) Department Electrical Enginering, University

More information

Improving the Effectiveness of Speaker Verification Domain Adaptation With Inadequate In-Domain Data

Improving the Effectiveness of Speaker Verification Domain Adaptation With Inadequate In-Domain Data Distribution A: Public Release Improving the Effectiveness of Speaker Verification Domain Adaptation With Inadequate In-Domain Data Bengt J. Borgström Elliot Singer Douglas Reynolds and Omid Sadjadi 2

More information

A Generalized Subspace Approach for Enhancing Speech Corrupted by Colored Noise

A Generalized Subspace Approach for Enhancing Speech Corrupted by Colored Noise 334 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL 11, NO 4, JULY 2003 A Generalized Subspace Approach for Enhancing Speech Corrupted by Colored Noise Yi Hu, Student Member, IEEE, and Philipos C

More information

SYMBOL RECOGNITION IN HANDWRITTEN MATHEMATI- CAL FORMULAS

SYMBOL RECOGNITION IN HANDWRITTEN MATHEMATI- CAL FORMULAS SYMBOL RECOGNITION IN HANDWRITTEN MATHEMATI- CAL FORMULAS Hans-Jürgen Winkler ABSTRACT In this paper an efficient on-line recognition system for handwritten mathematical formulas is proposed. After formula

More information

Design Criteria for the Quadratically Interpolated FFT Method (I): Bias due to Interpolation

Design Criteria for the Quadratically Interpolated FFT Method (I): Bias due to Interpolation CENTER FOR COMPUTER RESEARCH IN MUSIC AND ACOUSTICS DEPARTMENT OF MUSIC, STANFORD UNIVERSITY REPORT NO. STAN-M-4 Design Criteria for the Quadratically Interpolated FFT Method (I): Bias due to Interpolation

More information

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS Jinjin Ye jinjin.ye@mu.edu Michael T. Johnson mike.johnson@mu.edu Richard J. Povinelli richard.povinelli@mu.edu

More information

LOW COMPLEXITY WIDEBAND LSF QUANTIZATION USING GMM OF UNCORRELATED GAUSSIAN MIXTURES

LOW COMPLEXITY WIDEBAND LSF QUANTIZATION USING GMM OF UNCORRELATED GAUSSIAN MIXTURES LOW COMPLEXITY WIDEBAND LSF QUANTIZATION USING GMM OF UNCORRELATED GAUSSIAN MIXTURES Saikat Chatterjee and T.V. Sreenivas Department of Electrical Communication Engineering Indian Institute of Science,

More information

ISOLATED WORD RECOGNITION FOR ENGLISH LANGUAGE USING LPC,VQ AND HMM

ISOLATED WORD RECOGNITION FOR ENGLISH LANGUAGE USING LPC,VQ AND HMM ISOLATED WORD RECOGNITION FOR ENGLISH LANGUAGE USING LPC,VQ AND HMM Mayukh Bhaowal and Kunal Chawla (Students)Indian Institute of Information Technology, Allahabad, India Abstract: Key words: Speech recognition

More information

Environmental Sound Classification in Realistic Situations

Environmental Sound Classification in Realistic Situations Environmental Sound Classification in Realistic Situations K. Haddad, W. Song Brüel & Kjær Sound and Vibration Measurement A/S, Skodsborgvej 307, 2850 Nærum, Denmark. X. Valero La Salle, Universistat Ramon

More information

Analysis of polyphonic audio using source-filter model and non-negative matrix factorization

Analysis of polyphonic audio using source-filter model and non-negative matrix factorization Analysis of polyphonic audio using source-filter model and non-negative matrix factorization Tuomas Virtanen and Anssi Klapuri Tampere University of Technology, Institute of Signal Processing Korkeakoulunkatu

More information