Proc. of NCC 2010, Chennai, India

Size: px
Start display at page:

Download "Proc. of NCC 2010, Chennai, India"

Transcription

1 Proc. of NCC 2010, Chennai, India Trajectory and surface modeling of LSF for low rate speech coding M. Deepak and Preeti Rao Department of Electrical Engineering Indian Institute of Technology, Bombay Powai, Mumbai {deepu, Abstract In low rate speech coders, frame-based speech spectral parameters, represented by line spectral frequencies (LSF), are typically encoded by vector quantization without exploiting explicitly the temporal correlation between frames. Recently, however, methods have been proposed that model the temporal trajectory of each LSF over a speech segment by a low-order polynomial function. In the present work, we propose a method that exploits the joint spectro-temporal correlation across a speech segment. The LSF values are treated as points on a smooth surface obtained by fitting a loworder bivariate polynomial function of the variables, frame index and LSF index. The issues that arise in incorporating such joint modeling in a low bit rate coder are identified, and some comparative results of performance are presented. I. INTRODUCTION Speech coders aim at reducing bit rate while maintaining good quality of reconstructed speech. Most low bit rate coders are model based coders. Model based coders represent speech by a fixed set of parameters per frame, typically at a frame rate of 50 frames per second. This limits the achievable compression with parameter transmission needed at 20 ms intervals. Speech, however, is known to be relatively stationary over much longer intervals depending on the underlying phone class. This redundancy can be exploited by encoding larger segments of speech where speech parameters vary only slowly across frames thus facilitating more efficient quantization. In the present work, we investigate methods for segment level encoding of spectral parameters in the context of the Multiband Excitation (MBE) speech model. The MBE model of speech is a powerful representation of the speech signal providing for compactness, perceptually superior quality and robustness to acoustic background noise [1]. Each 20 ms interval of speech is represented by a pitch, harmonic spectral amplitudes and voicing band decisions. The spectral amplitudes are further represented by a spectral envelope parameterized by LSFs. Efficient quantization of the LSFs is therefore crucial to achieving low bit rate coding while maintaining good reconstructed speech quality. The LSF vector (typically 10 dimensional) is generated in every frame interval (20 ms). However, the temporal evolution of the LSFs across the frames is smooth leading to inter-frame redundancy. Modeling the temporal evolution of the LSFs by trajectory functions can help in exploiting the inter-frame redundancy. Polynomial functions have been used for modeling of pitch, LSFs and gain parameters of 2400 bps standard MELPe coder to achieve a rate of 1533 bps with nearly same speech quality as standard MELPe coder [2]. In [3], discrete cosine transform coefficients are used to represent the trajectory of individual LSFs and a bit rate reduction of nearly 50 % is achieved. The above methods model the trajectory of each LSF independently. However it is well known that frame-level spectral parameters exhibit significant joint spectro-temporal correlation. Such correlation exists for LSFs as well as for decorrelated spectral representations such as MFCC when one considers time segments on the order of phone durations. The physiological basis for the spectro-temporal correlation is the nature of the articulatory dynamics. Joint modeling in the spectro-temporal domain could potentially contribute to further efficiencies in coding over the independent steps of coding of spectral coefficients (by frame-level VQ) and coding of temporal trajectories of the spectral coefficients. This new idea is proposed and developed in the present work. Joint spectro-temporal modeling can be viewed as modeling a surface with a low-order polynomial over the time-frequency index plane. In the case of the MBE coder, the time index is the frame number while the frequency index is the LSF index. The value of the LSF (in Hz) is the surface variable whose dependence on the two indices is to be modeled. One method of representing a smooth surface is by a low order bivariate polynomial function. In this work, we explore the implementation of modeling the joint variation of LSFs by bivariate polynomial modeling and compare the performance in terms of quality of reconstructed speech with that obtained with temporal trajectory modeling only. The surface modeling is carried out over short homogenous segments of speech. The organization of paper is as follows. Automatic segmentation of the speech signal is presented in Section II. The trajectory and surface modeling of the frame-level LSFs over the speech segments by polynomial functions is presented in Section III. A variable bit rate coder

2 incorporating trajectory modeling of LSFs is presented in Section IV. II. SEGMENTATION OF SPEECH A segment is collection of fixed or variable number of frames. In case of variable length segments within each of which the speech parameters vary in a smooth manner, efficient segment based quantization methods can exploit most of the redundancy present in speech. In the present work, we consider variable length segmentation implemented as described below. For segmentation, the input speech first is divided into voiced and unvoiced segments based on the frame voicing decisions obtained from the MBE analysis. Within each voiced and unvoiced section, the Euclidean distance between adjacent frames LSF vectors is computed. Based on its value relative to a fixed threshold, a segment boundary is marked. This 2-step process is more effective in delineating homogenous regions compared with direct Euclidean distance based segmentation. In this segmentation method, the maximum segment duration may assume even large values. This may create problems for implementing the quantization of parameters. The maximum segment size is therefore limited based on the following specific considerations. In the vector quantization, the speech parameter of frames of segment are treated as vector elements and that vector is compared is with available codebook vectors and the best matched vector index is transmitted to the decoder. For each speech parameter and each input vector dimension, the corresponding codebook is required. If maximum segment size is large, large number of codebooks is required (large memory). Another important factor is the transmission of segment size, because at decoder for fetching a parameter vector using codebook index, proper codebook should be selected. Considering these two, the initial upper limit of segment size is fixed as eight (3-bit). The other important parameter in the segmentation process is the threshold value, which is empirically selected to trade off between the two desirable aspects: maximizing segment duration, and avoiding over segmentation. A good trade-off would involve considering all the three performance issues: bit rate achieved, speech quality and memory. Bits required for LSF quantization are a major share of the total bits required and thus impacts the bit rate of the coder. Experiments are carried out on the training speech database (details provided at the end of paper) to identify the best combination of maximum segment size and threshold, considering LSFs bit rate and their speech quality (in spectral distortion (db)). Of course, the LSFs bit rate depends on the quantization scheme. In the present work, we propose to model LSFs with trajectory functions before quantization is done. By such modeling, only a reduced set of LSF vectors need to be quantized resulting in the reduction of bit rate. In the present work, a 50 % reduction in bit rate is targeted with respect to frame-based quantization. Each combination of maximum segment size and threshold are tested on the training database and segments are extracted. The segments are modeled with trajectory functions, and spectral distortion is measured. For bit rate calculations, it is assumed that each LSF vector is quantized to 12 bits. The bit rate and speech quality for all combinations are measured. In this, the performance of maximum segment sizes 6 and 8 are found to be close. Further, considering the memory constraint, maximum segment size is selected to be 6. A spectral distortion of 1 db is generally considered to be acceptable for LSFs quantization. At the segmentation stage, quantization is not done so the threshold at which spectral distortion is 0.5 to 0.6 db with modeling is selected for maximum segment size 6. This gives a threshold of 0.2. With the above parameter settings, the division of segments in the training database is as follows: segments with 1 frame (32 %), 2 frames (14 %), 3 frames (9 %), 4 frames (8 %), 5 frames (6 %), and 6 frames (32 %). The modeling of the segment LSFs by polynomial functions is presented in following section. III. MODLEING BY POLYNOMIAL FUNCTIONS In this section, a discussion of modeling by univariate polynomial functions is presented followed by the extension to bivariate polynomial modeling. A. Univariate polynomial functions In modeling of LSFs by univariate polynomial functions, the temporal trajectory of each LSF over a given segment is modeled by a low order polynomial function [2][3]. A polynomial function of order P is given by: f P (t) = a P t P + a P-1 t P a 1 t + a 0 ; (1) where t represents the frame index. To find the coefficients of the polynomial fit on a segment of N frames, N equations are obtained and they are solved using leastsquares optimization to obtain coefficients {a P, a P-1,., a 1, a 0 }. For the given M LSFs, M unique polynomial functions will be constructed, one representing each LSF trajectory. In order to achieve efficient coding, it is necessary to encode the trajectory in some way. While the polynomial coefficients can be quantized, past experience with such an exercise is very limited in contrast to the vast research results on quantization of LSFs. Therefore, quantization is carried out on LSFs obtained by sampling the trajectories at P points rather than on the polynomial coefficients directly. On sampling these curves at P+1 points, P+1 feature vectors are derived. These feature vectors are encoded and transmitted to the decoder instead of the original N LSF vectors. Compression or gain in bit rate is obtained if P+1<N. The gain in bit rate depends on the P and N values. For the given segment with N frames, depending on the compression required, P value is selected. For the 10- dimensional LSF vectors, 10 polynomial curves with P order

3 will be formed. From each polynomial curve which approximates N frames LSF value, P+1 point on the curve are taken. This forms the P+1 feature vectors. These P+1 feature vectors are encoded and transmitted to the decoder. At the decoder, the P+1 feature vectors are extracted. The polynomial curves of order P are constructed using these P+1 vector. The curves are sampled at all original points and thus all original LSFs are obtained. In the present implementation, the segment size ranges from 1 to 6. The segments with single frame are directly quantized (with 4096 codebook for the 10-dim LSF vector) without any use of polynomial functions. For other segments with frames two to six, polynomial functions are implemented. The polynomial order P is chosen based on segment size, targeting close to 50 % compression. In general, for even segment size P+1 is half of segment size and for odd segment size P+1 is floor (half of segment size+1). On obtaining the polynomial order for the segment, the polynomial curve with P order is constructed and the curves are sampled at P+1 points to extract P+1 feature vectors and each one is quantized with codebook of 4096 (12-bit). The indices are transmitted to the decoder where feature LSF vectors are decoded. The polynomial curves of P order are constructed using P+1 feature LSF vectors and curves are sampled at number of points as segment size and thus all LSFs of the segment are extracted. Like this, all LSF vectors are obtained with quantizing few feature LSF vectors and so gain in bit rate. One important point to consider is the P+1 sampling point positions. As said earlier, the polynomial curve of order P approximated over N frames is sampled at P+1 points. Theoretically, the sampling points can be any P+1 points. But results are not same when implemented. The problem is better explained with an example. Let a segment of six frames with frames numbered as {1, 2, 3, 4, 5, and 6} is present. Second order polynomial curves are constructed. The curve is to be sampled at any 3 points to get feature LSF vectors for quantization. Let the sampled points be {1, 2, 3}, and these feature vectors are quantized and indices are transmitted. At decoder after reconstructing the polynomial curve, the curve is sampled at all six points. It is then observed that at points {4, 5, 6}, some spurious values occur. This is due to quantization step, if that step is not present, any three points will give appropriate values at all six points. To avoid this problem, the sampling points should contain the start and end points of the segment. The number of sampling points depends on the order P and it is best to choose them to be equally spaced as far as possible. For the case with segment size 6, one such sampling set can be {1, 3, and 6}. This method is followed for selecting the sampling points of all segment sizes. For other segment sizes the points chosen are: segment size 2 is {1}, segment size 3 {1, 3}, segment size 4 {1, 4}, segment size {1, 3, and 5}. The LSF vectors obtained for these segments are quantized with codebook of size 4096 (12 bits). Table I PERFORMANCE EVALUATION Parameter Results Average bit rate (bps) 340 ASD (db) 1.1 PESQ score 3.82 Outliers > 1dB (%) Outliers > 2 db (%) 2.53 Outliers > 5 db (%) 0.00 The performance evaluation with these univariate polynomial functions is given in Table I. For the performance evaluation PESQ MOS [5] scores w.r.t LP modeled version and ASD (db) are taken as measures. For comparison purposes, the method followed by Dusan [2] is also implemented. In this, a fixed segment size of 10 frames is modeled with fourth order polynomial functions. The ASD came as 1.7 db at bit rate of 300 bps and the percentage of outliers > 5 db is 3.7 % (quite high). It can be observed that the by implementing proposed segmentation method speech quality can be improved. B. Bivariate polynomial functions The univariate polynomial modeling described above captures the temporal correlations of each LSF independently of the others. If we can capture the joint timefrequency characteristics of LSFs and represent these with few coefficients or values then possibly a further reduction in bit rate can be achieved. Bivariate polynomial functions can model characteristics of LSFs along both time and LSF index axes. A surface is formed from the LSF values across both time and LSF index, and the surface coefficients are extracted and transmitted to the decoder. At the decoder, the surface is reconstructed by coefficients and the LSF information is recovered from the surface. The basic details of surface construction and the extraction of surface coefficients are explained below. By using the bivariate polynomial function, a surface can be defined by a function f (x, y) for the given points ((x 1,y 1 ),z 1 ), ((x 2,y 2 ),z 2 ), ((x 3,y 3 ),z 3 ),.., ((x n,y n ),z n ). The variables x and y represent co-ordinates (frame index and LSF index) and z indicates the value (LSF value) at that point. The surfaces can be formed by the second or third order polynomial functions (further higher orders are also possible). The second order bivariate polynomial function is given as: f (x, y) = c 0 + c 1 x + c 2 y + c 3 xy + c 4 x 2 + c 5 y 2 ; (2) In Equation (2), {c 0, c 1, c 2, c 3, c 4, c 5 and c 6 } are surface or bivariate polynomial coefficients to be estimated by leastsquares fitting of the given data points. The surface modeling is implemented selectively on homogenous segments (where sound variation is smooth) with 3 to 7 frames typically. The homogeneous segments are obtained by manual marking using the spectrogram of speech and covers about 70% of the frames. For segments with one to two frames, the input data requirement of minimum dimensions (3 x 3) matrix is not possible.

4 The surface modeling functions are applied in different ways. In the first case, using second order bivariate polynomial function a single surface is formed for entire segment and the surface coefficients are used at decoder to reconstruct surface and it is sampled to get all original LSFs. For this case, the results are very poor, because the surface coefficients (6 for 2 nd order) are not able represent LSF information of 3 to 4 frames (30 to 40 elements for 10 th order LP modeling) adequately. The ASD is observed to be above 1.6 db and PESQ score is around 2.5 (maximum 4.5) in all cases, which are too high. The detailed results of one speech file are like this, total number of speech frames (433), number of segments marked (86), total number of frames in segments (287, 66 % of total frames), for second order bivariate polynomial functions, coefficients required are (86 x 6 =516). The average spectral distortion (ASD) obtained with this modeling is 1.9 db and PESQ score is 2.3. Here, these ASD values are obtained just for modeling, if coefficients are also quantized, there will be further reduction in speech quality. The ASD value around 1 db is acceptable value if quantization is also done. For modeling alone, ASD value around 0.5 db can be treated as an acceptable value. The problem just described is due to the attempt to fit a single smooth surface to a large region. It can be addressed by dividing the LSFs of the segment into two splits, the lower split contains the first five LSFs and upper split contains the other five LSFs. Each split is modeled with a second order bivariate polynomial function surface. Thus the LSFs information are represented with more surface coefficients leading to better reconstruction of the LSFs at the decoder. This gives better results than the previous case but still not good enough. For the same speech file considered earlier, the coefficients required are (86 x 12 = 1032) for second order surface modeling. The ASD observed is 1.0 db and PESQ score is 3.4. In the third case, we carry out further experimentation by increasing the number of surfaces further. Now, the LSFs of the segment are divided into three splits, the first one with three LSFs, and the second one with the next three and last one with the remaining four LSFs. Each split is modeled with a second order bivariate polynomial function surface. It is observed that the results are considerably improved over the earlier two cases. In this case for the same speech file mentioned in case 1, the coefficients required are (86 x 18 = 1548), ASD is 0.5 db and PESQ score is 4.0. One thing to observe is that from first case to third case, the number of coefficients is increasing in the order of {x, 2x, 3x}. If surface coefficients are going to be quantized then from first case to third case the number of bits required also increases as coefficients increased. The results are summarized in Table II, comparing the coefficients required: ASD (db) and PESQ score for three cases mentioned. To complete the full LSF quantization in this approach, the problem is quantization of surface coefficients. The Table II COMPARISON OF DIFFERENT CASES OF BIVARIATE POLYNOMIAL MODELING Number of surfaces per segment Coefficients required ASD (db) surface coefficients are a new and untried set of parameters and their quantization properties are unknown. Thus quantization has not been incorporated in the present work but some suggestions are provided next. The indirect quantization of surface coefficients is possible via LSF quantization as follows. Consider the first case, where the whole segment is modeled by single second order bivariate surface. Let the segment have 4 frames (40 LSF values for 10 th LP order). Assuming the use of second order bivariate polynomial functions, the six surface coefficients obtained should satisfy each LSF (z) value by its time (x) and LSF index (y) points; it should satisfy Equation (1). As mentioned, the coefficients obtained cannot be quantized directly. If any (3 x 3) matrix LSF points of the segment are transmitted to decoder, then at decoder the surface coefficients from the (3 x 3) matrix and so other LSF points can be obtained. The methods for the transmission of LSF points (quantization of LSFs) are well known. By this way, quantization process using bivariate polynomial functions can be feasible. The point to observe is, in the quantization of (3 x 3) matrix LSF points, the LSF values will more or less get modified, and then the surface coefficients (obtained by quantized (3 x3) matrix LSF points) and so other LSFs will also get modified. If the LSFs recovered at decoder are in acceptable range, then bivariate polynomial functions can be used for LSF quantization. IV. VARIABLE BIT RATE CODER PESQ score (wrt before modeling) One Two Three Ten (univariate case) In this section, we describe a fully quantized coder that uses univariate polynomial modeling of LSF trajectories as presented in Section III A. The parameters from MBE analysis are voicing decisions, LSFs, gain and pitch. The LSFs quantization is done by univariate polynomial functions as explained in Section III A. The quantization methods of the other parameters viz. voicing, gain and pitch are briefly discussed below. Quantization of voicing decisions From the MBE analysis, 12 band voicing decisions are obtained for each frame. The band voicing converted into 3 bit frame voicing index. To exploit the inter-frame redundancy, a study is conducted to observe the frequency of occurrence of particular voicing index patterns (consecutive frame voicing indices of segment). In the study, a few patterns are found to occur more frequently compared to other patterns. So, if these patterns are made into codebook, it will be enough for encoding frame voicing indices of input segment. This study is conducted for all segment sizes and codebooks are designed. These codebooks are used in

5 encoding and decoding of voicing information of input segments. For selection of proper codebook at the decoder, segment size is also transmitted. This is the overhead present in this method. Gain quantization The gain values are highly correlated across the frames of the segment. For this, vector quantization is implemented. Log gain values are taken to reduce the dynamic range. Codebooks are generated for all segment sizes and gain quantization is implemented. Pitch quantization Vector quantization (VQ) is used for pitch parameter. Log pitch is taken to reduce dynamic range. In implementing VQ, it is observed pitch percentage error due to segment size 6 is more for smaller codebooks (less than or equal to 1024). To reduce this error, for the segment size six only, average pitch of segment and normalized vector obtained by subtracting the average pitch from the pitch values of the segment are quantized. This reduces the error and speech quality is also improved. Postfiltering Postfiltering is applied to reduce the coding noise introduced during the quantization process based on the fact that noise in the spectral valleys is more audible than that near the peaks. A pseudo-cepstrum based postfilter has been found to perform well and is incorporated in the decoder [4]. The overall average bit rate of the variable rate coder is given in Table III. The average PESQ MOS score is measured to be 2.7. V. DISCUSSION Temporal trajectory modeling of LSFs across automatically selected homogenous segments of speech has been studied in the context of the MBE speech coder. As an extension to temporal trajectory modeling, we present a method to capture the joint spectro-temporal correlation via bivariate surface modeling of the LSFs. The issues that arise in the modeling and further quantization are discussed. Since full quantization has not yet been implemented in the surface modeling approach, any meaningful comparison with univariate trajectory modeling is possible only at the output of the polynomial modeling stage. Table II shows a comparison of the two approaches on the same speech data. We observe that the univariate modeling actually surpasses slightly the best performing bivariate surface modeled method, both in terms of speech quality as well as compression efficiency. This indicates that more research on the precise realization of the bivariate modeling concept is in order. Further investigations on the specific speech segments that may benefit more from one or the other type of segment modeling would be useful. Finally, a complete variable rate coder based on univariate modeling of the LSF trajectories is presented. The average bit rate of the coder is 700 bps and it obtains an average PESQ MOS score of 2.7. Appendix: Training data base for LSF quantization The database comprises of 418 English sentences spoken by 49 speakers (both male and female). The speech of native English speakers was taken directly from TIMIT database and the rest were recorded in our lab. This database contains a total of frames of 20 ms duration. Table III BIT ALLOCATION TABLE SR.NO PARAMETER AVERAGE BIT RATE(BPS) 1 Voicing LSFs Gain Pitch Segment size 45 Total 700 REFERENCES [1] D. W. Griffin and J. S. Lim, Multi-band excitation vocoder, IEEE Trans. Acoustics, Speech and Signal Processing, vol. 36, no. 8, pp , August [2] S. Dusan, J. Flanagan, A. Karve & M. Balaraman, Speech coding using trajectory compression and multiple sensors, Proc. Int. Conf. on Speech & Language Proc., Jeju, Korea, [3] L.Girin, Long- term quantization of LSF parameters, Proc. IEEE ICASSP 2007, Honolulu, Hawaii, USA, [4] R.S. Kumar, N.Tamrakar and P.Rao, Segment based MBE speech coding at 1000 bps, in Proc. National conference on communication, pp , February [5] ITU-T, Rec P.862, Perceptual evaluation of speech quality (PESQ) an objective assessment of narrowband networks and speech codecs, ITU, 2002.

SPEECH ANALYSIS AND SYNTHESIS

SPEECH ANALYSIS AND SYNTHESIS 16 Chapter 2 SPEECH ANALYSIS AND SYNTHESIS 2.1 INTRODUCTION: Speech signal analysis is used to characterize the spectral information of an input speech signal. Speech signal analysis [52-53] techniques

More information

A 600bps Vocoder Algorithm Based on MELP. Lan ZHU and Qiang LI*

A 600bps Vocoder Algorithm Based on MELP. Lan ZHU and Qiang LI* 2017 2nd International Conference on Electrical and Electronics: Techniques and Applications (EETA 2017) ISBN: 978-1-60595-416-5 A 600bps Vocoder Algorithm Based on MELP Lan ZHU and Qiang LI* Chongqing

More information

Low Bit-Rate Speech Codec Based on a Long-Term Harmonic Plus Noise Model

Low Bit-Rate Speech Codec Based on a Long-Term Harmonic Plus Noise Model PAPERS Journal of the Audio Engineering Society Vol. 64, No. 11, November 216 ( C 216) DOI: https://doi.org/1.17743/jaes.216.28 Low Bit-Rate Speech Codec Based on a Long-Term Harmonic Plus Noise Model

More information

HARMONIC VECTOR QUANTIZATION

HARMONIC VECTOR QUANTIZATION HARMONIC VECTOR QUANTIZATION Volodya Grancharov, Sigurdur Sverrisson, Erik Norvell, Tomas Toftgård, Jonas Svedberg, and Harald Pobloth SMN, Ericsson Research, Ericsson AB 64 8, Stockholm, Sweden ABSTRACT

More information

Speech Signal Representations

Speech Signal Representations Speech Signal Representations Berlin Chen 2003 References: 1. X. Huang et. al., Spoken Language Processing, Chapters 5, 6 2. J. R. Deller et. al., Discrete-Time Processing of Speech Signals, Chapters 4-6

More information

Vector Quantizers for Reduced Bit-Rate Coding of Correlated Sources

Vector Quantizers for Reduced Bit-Rate Coding of Correlated Sources Vector Quantizers for Reduced Bit-Rate Coding of Correlated Sources Russell M. Mersereau Center for Signal and Image Processing Georgia Institute of Technology Outline Cache vector quantization Lossless

More information

CS578- Speech Signal Processing

CS578- Speech Signal Processing CS578- Speech Signal Processing Lecture 7: Speech Coding Yannis Stylianou University of Crete, Computer Science Dept., Multimedia Informatics Lab yannis@csd.uoc.gr Univ. of Crete Outline 1 Introduction

More information

Research Article Adaptive Long-Term Coding of LSF Parameters Trajectories for Large-Delay/Very- to Ultra-Low Bit-Rate Speech Coding

Research Article Adaptive Long-Term Coding of LSF Parameters Trajectories for Large-Delay/Very- to Ultra-Low Bit-Rate Speech Coding Hindawi Publishing Corporation EURASIP Journal on Audio, Speech, and Music Processing Volume 00, Article ID 597039, 3 pages doi:0.55/00/597039 Research Article Adaptive Long-Term Coding of LSF Parameters

More information

Chapter 9. Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析

Chapter 9. Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析 Chapter 9 Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析 1 LPC Methods LPC methods are the most widely used in speech coding, speech synthesis, speech recognition, speaker recognition and verification

More information

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi Signal Modeling Techniques in Speech Recognition Hassan A. Kingravi Outline Introduction Spectral Shaping Spectral Analysis Parameter Transforms Statistical Modeling Discussion Conclusions 1: Introduction

More information

ON SCALABLE CODING OF HIDDEN MARKOV SOURCES. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose

ON SCALABLE CODING OF HIDDEN MARKOV SOURCES. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose ON SCALABLE CODING OF HIDDEN MARKOV SOURCES Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose Department of Electrical and Computer Engineering University of California, Santa Barbara, CA, 93106

More information

Maximum Likelihood and Maximum A Posteriori Adaptation for Distributed Speaker Recognition Systems

Maximum Likelihood and Maximum A Posteriori Adaptation for Distributed Speaker Recognition Systems Maximum Likelihood and Maximum A Posteriori Adaptation for Distributed Speaker Recognition Systems Chin-Hung Sit 1, Man-Wai Mak 1, and Sun-Yuan Kung 2 1 Center for Multimedia Signal Processing Dept. of

More information

Robust Speaker Identification

Robust Speaker Identification Robust Speaker Identification by Smarajit Bose Interdisciplinary Statistical Research Unit Indian Statistical Institute, Kolkata Joint work with Amita Pal and Ayanendranath Basu Overview } } } } } } }

More information

LOW COMPLEXITY WIDEBAND LSF QUANTIZATION USING GMM OF UNCORRELATED GAUSSIAN MIXTURES

LOW COMPLEXITY WIDEBAND LSF QUANTIZATION USING GMM OF UNCORRELATED GAUSSIAN MIXTURES LOW COMPLEXITY WIDEBAND LSF QUANTIZATION USING GMM OF UNCORRELATED GAUSSIAN MIXTURES Saikat Chatterjee and T.V. Sreenivas Department of Electrical Communication Engineering Indian Institute of Science,

More information

Improved Speech Presence Probabilities Using HMM-Based Inference, with Applications to Speech Enhancement and ASR

Improved Speech Presence Probabilities Using HMM-Based Inference, with Applications to Speech Enhancement and ASR Improved Speech Presence Probabilities Using HMM-Based Inference, with Applications to Speech Enhancement and ASR Bengt J. Borgström, Student Member, IEEE, and Abeer Alwan, IEEE Fellow Abstract This paper

More information

SCELP: LOW DELAY AUDIO CODING WITH NOISE SHAPING BASED ON SPHERICAL VECTOR QUANTIZATION

SCELP: LOW DELAY AUDIO CODING WITH NOISE SHAPING BASED ON SPHERICAL VECTOR QUANTIZATION SCELP: LOW DELAY AUDIO CODING WITH NOISE SHAPING BASED ON SPHERICAL VECTOR QUANTIZATION Hauke Krüger and Peter Vary Institute of Communication Systems and Data Processing RWTH Aachen University, Templergraben

More information

Empirical Lower Bound on the Bitrate for the Transparent Memoryless Coding of Wideband LPC Parameters

Empirical Lower Bound on the Bitrate for the Transparent Memoryless Coding of Wideband LPC Parameters Empirical Lower Bound on the Bitrate for the Transparent Memoryless Coding of Wideband LPC Parameters Author So, Stephen, Paliwal, Kuldip Published 2006 Journal Title IEEE Signal Processing Letters DOI

More information

L used in various speech coding applications for representing

L used in various speech coding applications for representing IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 1, NO. 1. JANUARY 1993 3 Efficient Vector Quantization of LPC Parameters at 24 BitsFrame Kuldip K. Paliwal, Member, IEEE, and Bishnu S. Atal, Fellow,

More information

Speech Coding. Speech Processing. Tom Bäckström. October Aalto University

Speech Coding. Speech Processing. Tom Bäckström. October Aalto University Speech Coding Speech Processing Tom Bäckström Aalto University October 2015 Introduction Speech coding refers to the digital compression of speech signals for telecommunication (and storage) applications.

More information

AN IMPROVED ADPCM DECODER BY ADAPTIVELY CONTROLLED QUANTIZATION INTERVAL CENTROIDS. Sai Han and Tim Fingscheidt

AN IMPROVED ADPCM DECODER BY ADAPTIVELY CONTROLLED QUANTIZATION INTERVAL CENTROIDS. Sai Han and Tim Fingscheidt AN IMPROVED ADPCM DECODER BY ADAPTIVELY CONTROLLED QUANTIZATION INTERVAL CENTROIDS Sai Han and Tim Fingscheidt Institute for Communications Technology, Technische Universität Braunschweig Schleinitzstr.

More information

Linear Prediction Coding. Nimrod Peleg Update: Aug. 2007

Linear Prediction Coding. Nimrod Peleg Update: Aug. 2007 Linear Prediction Coding Nimrod Peleg Update: Aug. 2007 1 Linear Prediction and Speech Coding The earliest papers on applying LPC to speech: Atal 1968, 1970, 1971 Markel 1971, 1972 Makhoul 1975 This is

More information

Multimedia Networking ECE 599

Multimedia Networking ECE 599 Multimedia Networking ECE 599 Prof. Thinh Nguyen School of Electrical Engineering and Computer Science Based on lectures from B. Lee, B. Girod, and A. Mukherjee 1 Outline Digital Signal Representation

More information

Limited error based event localizing decomposition and its application to rate speech coding. Nguyen, Phu Chien; Akagi, Masato; Ng Author(s) Phu

Limited error based event localizing decomposition and its application to rate speech coding. Nguyen, Phu Chien; Akagi, Masato; Ng Author(s) Phu JAIST Reposi https://dspace.j Title Limited error based event localizing decomposition and its application to rate speech coding Nguyen, Phu Chien; Akagi, Masato; Ng Author(s) Phu Citation Speech Communication,

More information

PARAMETRIC coding has proven to be very effective

PARAMETRIC coding has proven to be very effective 966 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 3, MARCH 2007 High-Resolution Spherical Quantization of Sinusoidal Parameters Pim Korten, Jesper Jensen, and Richard Heusdens

More information

L7: Linear prediction of speech

L7: Linear prediction of speech L7: Linear prediction of speech Introduction Linear prediction Finding the linear prediction coefficients Alternative representations This lecture is based on [Dutoit and Marques, 2009, ch1; Taylor, 2009,

More information

Quantization of LSF Parameters Using A Trellis Modeling

Quantization of LSF Parameters Using A Trellis Modeling 1 Quantization of LSF Parameters Using A Trellis Modeling Farshad Lahouti, Amir K. Khandani Coding and Signal Transmission Lab. Dept. of E&CE, University of Waterloo, Waterloo, ON, N2L 3G1, Canada (farshad,

More information

Re-estimation of Linear Predictive Parameters in Sparse Linear Prediction

Re-estimation of Linear Predictive Parameters in Sparse Linear Prediction Downloaded from vbnaaudk on: januar 12, 2019 Aalborg Universitet Re-estimation of Linear Predictive Parameters in Sparse Linear Prediction Giacobello, Daniele; Murthi, Manohar N; Christensen, Mads Græsbøll;

More information

Optimal Speech Enhancement Under Signal Presence Uncertainty Using Log-Spectral Amplitude Estimator

Optimal Speech Enhancement Under Signal Presence Uncertainty Using Log-Spectral Amplitude Estimator 1 Optimal Speech Enhancement Under Signal Presence Uncertainty Using Log-Spectral Amplitude Estimator Israel Cohen Lamar Signal Processing Ltd. P.O.Box 573, Yokneam Ilit 20692, Israel E-mail: icohen@lamar.co.il

More information

Log Likelihood Spectral Distance, Entropy Rate Power, and Mutual Information with Applications to Speech Coding

Log Likelihood Spectral Distance, Entropy Rate Power, and Mutual Information with Applications to Speech Coding entropy Article Log Likelihood Spectral Distance, Entropy Rate Power, and Mutual Information with Applications to Speech Coding Jerry D. Gibson * and Preethi Mahadevan Department of Electrical and Computer

More information

On Compression Encrypted Data part 2. Prof. Ja-Ling Wu The Graduate Institute of Networking and Multimedia National Taiwan University

On Compression Encrypted Data part 2. Prof. Ja-Ling Wu The Graduate Institute of Networking and Multimedia National Taiwan University On Compression Encrypted Data part 2 Prof. Ja-Ling Wu The Graduate Institute of Networking and Multimedia National Taiwan University 1 Brief Summary of Information-theoretic Prescription At a functional

More information

GAUSSIANIZATION METHOD FOR IDENTIFICATION OF MEMORYLESS NONLINEAR AUDIO SYSTEMS

GAUSSIANIZATION METHOD FOR IDENTIFICATION OF MEMORYLESS NONLINEAR AUDIO SYSTEMS GAUSSIANIATION METHOD FOR IDENTIFICATION OF MEMORYLESS NONLINEAR AUDIO SYSTEMS I. Marrakchi-Mezghani (1),G. Mahé (2), M. Jaïdane-Saïdane (1), S. Djaziri-Larbi (1), M. Turki-Hadj Alouane (1) (1) Unité Signaux

More information

A comparative study of LPC parameter representations and quantisation schemes for wideband speech coding

A comparative study of LPC parameter representations and quantisation schemes for wideband speech coding Digital Signal Processing 17 (2007) 114 137 www.elsevier.com/locate/dsp A comparative study of LPC parameter representations and quantisation schemes for wideband speech coding Stephen So a,, Kuldip K.

More information

Improved noise power spectral density tracking by a MAP-based postprocessor

Improved noise power spectral density tracking by a MAP-based postprocessor Improved noise power spectral density tracking by a MAP-based postprocessor Aleksej Chinaev, Alexander Krueger, Dang Hai Tran Vu, Reinhold Haeb-Umbach University of Paderborn, Germany March 8th, 01 Computer

More information

Time-domain representations

Time-domain representations Time-domain representations Speech Processing Tom Bäckström Aalto University Fall 2016 Basics of Signal Processing in the Time-domain Time-domain signals Before we can describe speech signals or modelling

More information

2D Spectrogram Filter for Single Channel Speech Enhancement

2D Spectrogram Filter for Single Channel Speech Enhancement Proceedings of the 7th WSEAS International Conference on Signal, Speech and Image Processing, Beijing, China, September 15-17, 007 89 D Spectrogram Filter for Single Channel Speech Enhancement HUIJUN DING,

More information

Pulse-Code Modulation (PCM) :

Pulse-Code Modulation (PCM) : PCM & DPCM & DM 1 Pulse-Code Modulation (PCM) : In PCM each sample of the signal is quantized to one of the amplitude levels, where B is the number of bits used to represent each sample. The rate from

More information

Single Channel Signal Separation Using MAP-based Subspace Decomposition

Single Channel Signal Separation Using MAP-based Subspace Decomposition Single Channel Signal Separation Using MAP-based Subspace Decomposition Gil-Jin Jang, Te-Won Lee, and Yung-Hwan Oh 1 Spoken Language Laboratory, Department of Computer Science, KAIST 373-1 Gusong-dong,

More information

Joint Filtering and Factorization for Recovering Latent Structure from Noisy Speech Data

Joint Filtering and Factorization for Recovering Latent Structure from Noisy Speech Data Joint Filtering and Factorization for Recovering Latent Structure from Noisy Speech Data Colin Vaz, Vikram Ramanarayanan, and Shrikanth Narayanan Ming Hsieh Department of Electrical Engineering University

More information

AN INVERTIBLE DISCRETE AUDITORY TRANSFORM

AN INVERTIBLE DISCRETE AUDITORY TRANSFORM COMM. MATH. SCI. Vol. 3, No. 1, pp. 47 56 c 25 International Press AN INVERTIBLE DISCRETE AUDITORY TRANSFORM JACK XIN AND YINGYONG QI Abstract. A discrete auditory transform (DAT) from sound signal to

More information

Automatic Speech Recognition (CS753)

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 12: Acoustic Feature Extraction for ASR Instructor: Preethi Jyothi Feb 13, 2017 Speech Signal Analysis Generate discrete samples A frame Need to focus on short

More information

Nearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis. Thomas Ewender

Nearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis. Thomas Ewender Nearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis Thomas Ewender Outline Motivation Detection algorithm of continuous F 0 contour Frame classification algorithm

More information

MANY digital speech communication applications, e.g.,

MANY digital speech communication applications, e.g., 406 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 2, FEBRUARY 2007 An MMSE Estimator for Speech Enhancement Under a Combined Stochastic Deterministic Speech Model Richard C.

More information

Design of a CELP coder and analysis of various quantization techniques

Design of a CELP coder and analysis of various quantization techniques EECS 65 Project Report Design of a CELP coder and analysis of various quantization techniques Prof. David L. Neuhoff By: Awais M. Kamboh Krispian C. Lawrence Aditya M. Thomas Philip I. Tsai Winter 005

More information

Selective Use Of Multiple Entropy Models In Audio Coding

Selective Use Of Multiple Entropy Models In Audio Coding Selective Use Of Multiple Entropy Models In Audio Coding Sanjeev Mehrotra, Wei-ge Chen Microsoft Corporation One Microsoft Way, Redmond, WA 98052 {sanjeevm,wchen}@microsoft.com Abstract The use of multiple

More information

Vector Quantization and Subband Coding

Vector Quantization and Subband Coding Vector Quantization and Subband Coding 18-796 ultimedia Communications: Coding, Systems, and Networking Prof. Tsuhan Chen tsuhan@ece.cmu.edu Vector Quantization 1 Vector Quantization (VQ) Each image block

More information

The effect of speaking rate and vowel context on the perception of consonants. in babble noise

The effect of speaking rate and vowel context on the perception of consonants. in babble noise The effect of speaking rate and vowel context on the perception of consonants in babble noise Anirudh Raju Department of Electrical Engineering, University of California, Los Angeles, California, USA anirudh90@ucla.edu

More information

Compression and Coding

Compression and Coding Compression and Coding Theory and Applications Part 1: Fundamentals Gloria Menegaz 1 Transmitter (Encoder) What is the problem? Receiver (Decoder) Transformation information unit Channel Ordering (significance)

More information

SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan

SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS Emad M. Grais and Hakan Erdogan Faculty of Engineering and Natural Sciences, Sabanci University, Orhanli

More information

The Choice of MPEG-4 AAC encoding parameters as a direct function of the perceptual entropy of the audio signal

The Choice of MPEG-4 AAC encoding parameters as a direct function of the perceptual entropy of the audio signal The Choice of MPEG-4 AAC encoding parameters as a direct function of the perceptual entropy of the audio signal Claus Bauer, Mark Vinton Abstract This paper proposes a new procedure of lowcomplexity to

More information

ECE533 Digital Image Processing. Embedded Zerotree Wavelet Image Codec

ECE533 Digital Image Processing. Embedded Zerotree Wavelet Image Codec University of Wisconsin Madison Electrical Computer Engineering ECE533 Digital Image Processing Embedded Zerotree Wavelet Image Codec Team members Hongyu Sun Yi Zhang December 12, 2003 Table of Contents

More information

A Study of Source Controlled Channel Decoding for GSM AMR Vocoder

A Study of Source Controlled Channel Decoding for GSM AMR Vocoder A Study of Source Controlled Channel Decoding for GSM AMR Vocoder K.S.M. Phanindra Girish A Redekar David Koilpillai Department of Electrical Engineering, IIT Madras, Chennai-6000036. phanindra@midascomm.com,

More information

The Equivalence of ADPCM and CELP Coding

The Equivalence of ADPCM and CELP Coding The Equivalence of ADPCM and CELP Coding Peter Kabal Department of Electrical & Computer Engineering McGill University Montreal, Canada Version.2 March 20 c 20 Peter Kabal 20/03/ You are free: to Share

More information

Digital Signal Processing

Digital Signal Processing Digital Signal Processing 0 (010) 157 1578 Contents lists available at ScienceDirect Digital Signal Processing www.elsevier.com/locate/dsp Improved minima controlled recursive averaging technique using

More information

Class of waveform coders can be represented in this manner

Class of waveform coders can be represented in this manner Digital Speech Processing Lecture 15 Speech Coding Methods Based on Speech Waveform Representations ti and Speech Models Uniform and Non- Uniform Coding Methods 1 Analog-to-Digital Conversion (Sampling

More information

Multimedia Systems Giorgio Leonardi A.A Lecture 4 -> 6 : Quantization

Multimedia Systems Giorgio Leonardi A.A Lecture 4 -> 6 : Quantization Multimedia Systems Giorgio Leonardi A.A.2014-2015 Lecture 4 -> 6 : Quantization Overview Course page (D.I.R.): https://disit.dir.unipmn.it/course/view.php?id=639 Consulting: Office hours by appointment:

More information

Improved Method for Epoch Extraction in High Pass Filtered Speech

Improved Method for Epoch Extraction in High Pass Filtered Speech Improved Method for Epoch Extraction in High Pass Filtered Speech D. Govind Center for Computational Engineering & Networking Amrita Vishwa Vidyapeetham (University) Coimbatore, Tamilnadu 642 Email: d

More information

On Optimal Coding of Hidden Markov Sources

On Optimal Coding of Hidden Markov Sources 2014 Data Compression Conference On Optimal Coding of Hidden Markov Sources Mehdi Salehifar, Emrah Akyol, Kumar Viswanatha, and Kenneth Rose Department of Electrical and Computer Engineering University

More information

Rate-Constrained Multihypothesis Prediction for Motion-Compensated Video Compression

Rate-Constrained Multihypothesis Prediction for Motion-Compensated Video Compression IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL 12, NO 11, NOVEMBER 2002 957 Rate-Constrained Multihypothesis Prediction for Motion-Compensated Video Compression Markus Flierl, Student

More information

Stress detection through emotional speech analysis

Stress detection through emotional speech analysis Stress detection through emotional speech analysis INMA MOHINO inmaculada.mohino@uah.edu.es ROBERTO GIL-PITA roberto.gil@uah.es LORENA ÁLVAREZ PÉREZ loreduna88@hotmail Abstract: Stress is a reaction or

More information

LECTURE NOTES IN AUDIO ANALYSIS: PITCH ESTIMATION FOR DUMMIES

LECTURE NOTES IN AUDIO ANALYSIS: PITCH ESTIMATION FOR DUMMIES LECTURE NOTES IN AUDIO ANALYSIS: PITCH ESTIMATION FOR DUMMIES Abstract March, 3 Mads Græsbøll Christensen Audio Analysis Lab, AD:MT Aalborg University This document contains a brief introduction to pitch

More information

CEPSTRAL ANALYSIS SYNTHESIS ON THE MEL FREQUENCY SCALE, AND AN ADAPTATIVE ALGORITHM FOR IT.

CEPSTRAL ANALYSIS SYNTHESIS ON THE MEL FREQUENCY SCALE, AND AN ADAPTATIVE ALGORITHM FOR IT. CEPSTRAL ANALYSIS SYNTHESIS ON THE EL FREQUENCY SCALE, AND AN ADAPTATIVE ALGORITH FOR IT. Summarized overview of the IEEE-publicated papers Cepstral analysis synthesis on the mel frequency scale by Satochi

More information

Gaussian Mixture Model-based Quantization of Line Spectral Frequencies for Adaptive Multirate Speech Codec

Gaussian Mixture Model-based Quantization of Line Spectral Frequencies for Adaptive Multirate Speech Codec Journal of Computing and Information Technology - CIT 19, 2011, 2, 113 126 doi:10.2498/cit.1001767 113 Gaussian Mixture Model-based Quantization of Line Spectral Frequencies for Adaptive Multirate Speech

More information

Basic Principles of Video Coding

Basic Principles of Video Coding Basic Principles of Video Coding Introduction Categories of Video Coding Schemes Information Theory Overview of Video Coding Techniques Predictive coding Transform coding Quantization Entropy coding Motion

More information

Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers

Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers Kumari Rambha Ranjan, Kartik Mahto, Dipti Kumari,S.S.Solanki Dept. of Electronics and Communication Birla

More information

Modeling the creaky excitation for parametric speech synthesis.

Modeling the creaky excitation for parametric speech synthesis. Modeling the creaky excitation for parametric speech synthesis. 1 Thomas Drugman, 2 John Kane, 2 Christer Gobl September 11th, 2012 Interspeech Portland, Oregon, USA 1 University of Mons, Belgium 2 Trinity

More information

Estimation of Relative Operating Characteristics of Text Independent Speaker Verification

Estimation of Relative Operating Characteristics of Text Independent Speaker Verification International Journal of Engineering Science Invention Volume 1 Issue 1 December. 2012 PP.18-23 Estimation of Relative Operating Characteristics of Text Independent Speaker Verification Palivela Hema 1,

More information

Exemplar-based voice conversion using non-negative spectrogram deconvolution

Exemplar-based voice conversion using non-negative spectrogram deconvolution Exemplar-based voice conversion using non-negative spectrogram deconvolution Zhizheng Wu 1, Tuomas Virtanen 2, Tomi Kinnunen 3, Eng Siong Chng 1, Haizhou Li 1,4 1 Nanyang Technological University, Singapore

More information

representation of speech

representation of speech Digital Speech Processing Lectures 7-8 Time Domain Methods in Speech Processing 1 General Synthesis Model voiced sound amplitude Log Areas, Reflection Coefficients, Formants, Vocal Tract Polynomial, l

More information

Vector Quantization. Institut Mines-Telecom. Marco Cagnazzo, MN910 Advanced Compression

Vector Quantization. Institut Mines-Telecom. Marco Cagnazzo, MN910 Advanced Compression Institut Mines-Telecom Vector Quantization Marco Cagnazzo, cagnazzo@telecom-paristech.fr MN910 Advanced Compression 2/66 19.01.18 Institut Mines-Telecom Vector Quantization Outline Gain-shape VQ 3/66 19.01.18

More information

Adapting Wavenet for Speech Enhancement DARIO RETHAGE JULY 12, 2017

Adapting Wavenet for Speech Enhancement DARIO RETHAGE JULY 12, 2017 Adapting Wavenet for Speech Enhancement DARIO RETHAGE JULY 12, 2017 I am v Master Student v 6 months @ Music Technology Group, Universitat Pompeu Fabra v Deep learning for acoustic source separation v

More information

REAL-TIME TIME-FREQUENCY BASED BLIND SOURCE SEPARATION. Scott Rickard, Radu Balan, Justinian Rosca. Siemens Corporate Research Princeton, NJ 08540

REAL-TIME TIME-FREQUENCY BASED BLIND SOURCE SEPARATION. Scott Rickard, Radu Balan, Justinian Rosca. Siemens Corporate Research Princeton, NJ 08540 REAL-TIME TIME-FREQUENCY BASED BLIND SOURCE SEPARATION Scott Rickard, Radu Balan, Justinian Rosca Siemens Corporate Research Princeton, NJ 84 fscott.rickard,radu.balan,justinian.roscag@scr.siemens.com

More information

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement Simon Leglaive 1 Laurent Girin 1,2 Radu Horaud 1 1: Inria Grenoble Rhône-Alpes 2: Univ. Grenoble Alpes, Grenoble INP,

More information

Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic Approximation Algorithm

Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic Approximation Algorithm EngOpt 2008 - International Conference on Engineering Optimization Rio de Janeiro, Brazil, 0-05 June 2008. Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic

More information

Module 5 EMBEDDED WAVELET CODING. Version 2 ECE IIT, Kharagpur

Module 5 EMBEDDED WAVELET CODING. Version 2 ECE IIT, Kharagpur Module 5 EMBEDDED WAVELET CODING Lesson 13 Zerotree Approach. Instructional Objectives At the end of this lesson, the students should be able to: 1. Explain the principle of embedded coding. 2. Show the

More information

Can the sample being transmitted be used to refine its own PDF estimate?

Can the sample being transmitted be used to refine its own PDF estimate? Can the sample being transmitted be used to refine its own PDF estimate? Dinei A. Florêncio and Patrice Simard Microsoft Research One Microsoft Way, Redmond, WA 98052 {dinei, patrice}@microsoft.com Abstract

More information

on a per-coecient basis in large images is computationally expensive. Further, the algorithm in [CR95] needs to be rerun, every time a new rate of com

on a per-coecient basis in large images is computationally expensive. Further, the algorithm in [CR95] needs to be rerun, every time a new rate of com Extending RD-OPT with Global Thresholding for JPEG Optimization Viresh Ratnakar University of Wisconsin-Madison Computer Sciences Department Madison, WI 53706 Phone: (608) 262-6627 Email: ratnakar@cs.wisc.edu

More information

EE368B Image and Video Compression

EE368B Image and Video Compression EE368B Image and Video Compression Homework Set #2 due Friday, October 20, 2000, 9 a.m. Introduction The Lloyd-Max quantizer is a scalar quantizer which can be seen as a special case of a vector quantizer

More information

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS Jinjin Ye jinjin.ye@mu.edu Michael T. Johnson mike.johnson@mu.edu Richard J. Povinelli richard.povinelli@mu.edu

More information

SIGNAL COMPRESSION. 8. Lossy image compression: Principle of embedding

SIGNAL COMPRESSION. 8. Lossy image compression: Principle of embedding SIGNAL COMPRESSION 8. Lossy image compression: Principle of embedding 8.1 Lossy compression 8.2 Embedded Zerotree Coder 161 8.1 Lossy compression - many degrees of freedom and many viewpoints The fundamental

More information

Frequency Domain Speech Analysis

Frequency Domain Speech Analysis Frequency Domain Speech Analysis Short Time Fourier Analysis Cepstral Analysis Windowed (short time) Fourier Transform Spectrogram of speech signals Filter bank implementation* (Real) cepstrum and complex

More information

Modifying Voice Activity Detection in Low SNR by correction factors

Modifying Voice Activity Detection in Low SNR by correction factors Modifying Voice Activity Detection in Low SNR by correction factors H. Farsi, M. A. Mozaffarian, H.Rahmani Department of Electrical Engineering University of Birjand P.O. Box: +98-9775-376 IRAN hfarsi@birjand.ac.ir

More information

Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation. Keiichi Tokuda

Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation. Keiichi Tokuda Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation Keiichi Tokuda Nagoya Institute of Technology Carnegie Mellon University Tamkang University March 13,

More information

Design Criteria for the Quadratically Interpolated FFT Method (I): Bias due to Interpolation

Design Criteria for the Quadratically Interpolated FFT Method (I): Bias due to Interpolation CENTER FOR COMPUTER RESEARCH IN MUSIC AND ACOUSTICS DEPARTMENT OF MUSIC, STANFORD UNIVERSITY REPORT NO. STAN-M-4 Design Criteria for the Quadratically Interpolated FFT Method (I): Bias due to Interpolation

More information

SINGLE-CHANNEL SPEECH PRESENCE PROBABILITY ESTIMATION USING INTER-FRAME AND INTER-BAND CORRELATIONS

SINGLE-CHANNEL SPEECH PRESENCE PROBABILITY ESTIMATION USING INTER-FRAME AND INTER-BAND CORRELATIONS 204 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) SINGLE-CHANNEL SPEECH PRESENCE PROBABILITY ESTIMATION USING INTER-FRAME AND INTER-BAND CORRELATIONS Hajar Momeni,2,,

More information

The information loss in quantization

The information loss in quantization The information loss in quantization The rough meaning of quantization in the frame of coding is representing numerical quantities with a finite set of symbols. The mapping between numbers, which are normally

More information

Zeros of z-transform(zzt) representation and chirp group delay processing for analysis of source and filter characteristics of speech signals

Zeros of z-transform(zzt) representation and chirp group delay processing for analysis of source and filter characteristics of speech signals Zeros of z-transformzzt representation and chirp group delay processing for analysis of source and filter characteristics of speech signals Baris Bozkurt 1 Collaboration with LIMSI-CNRS, France 07/03/2017

More information

+ (50% contribution by each member)

+ (50% contribution by each member) Image Coding using EZW and QM coder ECE 533 Project Report Ahuja, Alok + Singh, Aarti + + (50% contribution by each member) Abstract This project involves Matlab implementation of the Embedded Zerotree

More information

Harmonic Structure Transform for Speaker Recognition

Harmonic Structure Transform for Speaker Recognition Harmonic Structure Transform for Speaker Recognition Kornel Laskowski & Qin Jin Carnegie Mellon University, Pittsburgh PA, USA KTH Speech Music & Hearing, Stockholm, Sweden 29 August, 2011 Laskowski &

More information

Detection-Based Speech Recognition with Sparse Point Process Models

Detection-Based Speech Recognition with Sparse Point Process Models Detection-Based Speech Recognition with Sparse Point Process Models Aren Jansen Partha Niyogi Human Language Technology Center of Excellence Departments of Computer Science and Statistics ICASSP 2010 Dallas,

More information

Improved MU-MIMO Performance for Future Systems Using Differential Feedback

Improved MU-MIMO Performance for Future Systems Using Differential Feedback Improved MU-MIMO Performance for Future 80. Systems Using Differential Feedback Ron Porat, Eric Ojard, Nihar Jindal, Matthew Fischer, Vinko Erceg Broadcom Corp. {rporat, eo, njindal, mfischer, verceg}@broadcom.com

More information

(Classical) Information Theory III: Noisy channel coding

(Classical) Information Theory III: Noisy channel coding (Classical) Information Theory III: Noisy channel coding Sibasish Ghosh The Institute of Mathematical Sciences CIT Campus, Taramani, Chennai 600 113, India. p. 1 Abstract What is the best possible way

More information

Source modeling (block processing)

Source modeling (block processing) Digital Speech Processing Lecture 17 Speech Coding Methods Based on Speech Models 1 Waveform Coding versus Block Waveform coding Processing sample-by-sample matching of waveforms coding gquality measured

More information

Sound Recognition in Mixtures

Sound Recognition in Mixtures Sound Recognition in Mixtures Juhan Nam, Gautham J. Mysore 2, and Paris Smaragdis 2,3 Center for Computer Research in Music and Acoustics, Stanford University, 2 Advanced Technology Labs, Adobe Systems

More information

Voice Activity Detection Using Pitch Feature

Voice Activity Detection Using Pitch Feature Voice Activity Detection Using Pitch Feature Presented by: Shay Perera 1 CONTENTS Introduction Related work Proposed Improvement References Questions 2 PROBLEM speech Non speech Speech Region Non Speech

More information

Model-based unsupervised segmentation of birdcalls from field recordings

Model-based unsupervised segmentation of birdcalls from field recordings Model-based unsupervised segmentation of birdcalls from field recordings Anshul Thakur School of Computing and Electrical Engineering Indian Institute of Technology Mandi Himachal Pradesh, India Email:

More information

Short-Time ICA for Blind Separation of Noisy Speech

Short-Time ICA for Blind Separation of Noisy Speech Short-Time ICA for Blind Separation of Noisy Speech Jing Zhang, P.C. Ching Department of Electronic Engineering The Chinese University of Hong Kong, Hong Kong jzhang@ee.cuhk.edu.hk, pcching@ee.cuhk.edu.hk

More information

ACCESS Linnaeus Center, Electrical Engineering. Stockholm, Sweden

ACCESS Linnaeus Center, Electrical Engineering. Stockholm, Sweden FLEXIBLE QUANTIZATION OF AUDIO AND SPEECH BASED ON THE AUTOREGRESSIVE MODEL Alexey Ozerov and W. Bastiaan Kleijn ACCESS Linnaeus Center, Electrical Engineering KTH - Royal Institute of Technology Stockholm,

More information

Lecture 9: Speech Recognition. Recognizing Speech

Lecture 9: Speech Recognition. Recognizing Speech EE E68: Speech & Audio Processing & Recognition Lecture 9: Speech Recognition 3 4 Recognizing Speech Feature Calculation Sequence Recognition Hidden Markov Models Dan Ellis http://www.ee.columbia.edu/~dpwe/e68/

More information

Lecture 9: Speech Recognition

Lecture 9: Speech Recognition EE E682: Speech & Audio Processing & Recognition Lecture 9: Speech Recognition 1 2 3 4 Recognizing Speech Feature Calculation Sequence Recognition Hidden Markov Models Dan Ellis

More information

Scalar and Vector Quantization. National Chiao Tung University Chun-Jen Tsai 11/06/2014

Scalar and Vector Quantization. National Chiao Tung University Chun-Jen Tsai 11/06/2014 Scalar and Vector Quantization National Chiao Tung University Chun-Jen Tsai 11/06/014 Basic Concept of Quantization Quantization is the process of representing a large, possibly infinite, set of values

More information