CEPSTRAL ANALYSIS SYNTHESIS ON THE MEL FREQUENCY SCALE, AND AN ADAPTATIVE ALGORITHM FOR IT.

Similar documents
Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation. Keiichi Tokuda

Feature extraction 2

Speech Signal Representations

Chapter 9. Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析

Feature extraction 1

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi

Signal representations: Cepstrum

Voiced Speech. Unvoiced Speech

SPEECH ANALYSIS AND SYNTHESIS

L7: Linear prediction of speech

Linear Prediction 1 / 41

LECTURE NOTES IN AUDIO ANALYSIS: PITCH ESTIMATION FOR DUMMIES

Design Criteria for the Quadratically Interpolated FFT Method (I): Bias due to Interpolation

Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic Approximation Algorithm

Oversampling Converters

Finite Word Length Effects and Quantisation Noise. Professors A G Constantinides & L R Arnaut

Sinusoidal Modeling. Yannis Stylianou SPCC University of Crete, Computer Science Dept., Greece,

Automatic Speech Recognition (CS753)

ELEN E4810: Digital Signal Processing Topic 11: Continuous Signals. 1. Sampling and Reconstruction 2. Quantization

Chapter 10 Applications in Communications

Signals, Instruments, and Systems W5. Introduction to Signal Processing Sampling, Reconstruction, and Filters

L8: Source estimation

DISCRETE-TIME SIGNAL PROCESSING

Robust Speaker Identification

Proc. of NCC 2010, Chennai, India

representation of speech

Lecture 7: Feature Extraction

CS578- Speech Signal Processing

200Pa 10million. Overview. Acoustics of Speech and Hearing. Loudness. Terms to describe sound. Matching Pressure to Loudness. Loudness vs.

Time-domain representations

Frequency Domain Speech Analysis

Allpass Modeling of LP Residual for Speaker Recognition

TinySR. Peter Schmidt-Nielsen. August 27, 2014

Phase-Correlation Motion Estimation Yi Liang

EE 602 TERM PAPER PRESENTATION Richa Tripathi Mounika Boppudi FOURIER SERIES BASED MODEL FOR STATISTICAL SIGNAL PROCESSING - CHONG YUNG CHI

Zeros of z-transform(zzt) representation and chirp group delay processing for analysis of source and filter characteristics of speech signals

Text-to-speech synthesizer based on combination of composite wavelet and hidden Markov models

8/19/16. Fourier Analysis. Fourier analysis: the dial tone phone. Fourier analysis: the dial tone phone

Second and Higher-Order Delta-Sigma Modulators

A Spectral-Flatness Measure for Studying the Autocorrelation Method of Linear Prediction of Speech Analysis

E : Lecture 1 Introduction

sine wave fit algorithm

Lab 9a. Linear Predictive Coding for Speech Processing

Design of a CELP coder and analysis of various quantization techniques

Fourier Analysis. David-Alexander Robinson ; Daniel Tanner; Jack Denning th October Abstract 2. 2 Introduction & Theory 2

Analysis of polyphonic audio using source-filter model and non-negative matrix factorization

arxiv: v1 [cs.sd] 25 Oct 2014

Research Article Identification of Premature Ventricular Cycles of Electrocardiogram Using Discrete Cosine Transform-Teager Energy Operator Model

Time-Frequency Analysis

Estimation of Relative Operating Characteristics of Text Independent Speaker Verification

Reverberation Impulse Response Analysis

ω 0 = 2π/T 0 is called the fundamental angular frequency and ω 2 = 2ω 0 is called the

Department of Electrical and Computer Engineering Digital Speech Processing Homework No. 6 Solutions

Lecture 3 - Design of Digital Filters

Basics on 2-D 2 D Random Signal

Linear Prediction Coding. Nimrod Peleg Update: Aug. 2007

GAUSSIANIZATION METHOD FOR IDENTIFICATION OF MEMORYLESS NONLINEAR AUDIO SYSTEMS

Digital Signal Processing

Topic 6. Timbre Representations

Chapter 12 Variable Phase Interpolation

Elec4621 Advanced Digital Signal Processing Chapter 11: Time-Frequency Analysis

Methods for Synthesizing Very High Q Parametrically Well Behaved Two Pole Filters

Stress detection through emotional speech analysis

INTRODUCTION TO DELTA-SIGMA ADCS

THE RELATION BETWEEN PERCEIVED HYPERNASALITY OF CLEFT PALATE SPEECH AND ITS HOARSENESS

Optimal Design of Real and Complex Minimum Phase Digital FIR Filters

Evaluation of the modified group delay feature for isolated word recognition

Filter structures ELEC-E5410

VID3: Sampling and Quantization

A perception- and PDE-based nonlinear transformation for processing spoken words

LAB 6: FIR Filter Design Summer 2011

Submitted to Electronics Letters. Indexing terms: Signal Processing, Adaptive Filters. The Combined LMS/F Algorithm Shao-Jen Lim and John G. Harris Co

Vocoding approaches for statistical parametric speech synthesis


EE 123: Digital Signal Processing Spring Lecture 23 April 10

Multimedia Systems Giorgio Leonardi A.A Lecture 4 -> 6 : Quantization

CMPT 889: Lecture 3 Fundamentals of Digital Audio, Discrete-Time Signals

A Library of Functions

Exam 2. Average: 85.6 Median: 87.0 Maximum: Minimum: 55.0 Standard Deviation: Numerical Methods Fall 2011 Lecture 20

1. Probability density function for speech samples. Gamma. Laplacian. 2. Coding paradigms. =(2X max /2 B ) for a B-bit quantizer Δ Δ Δ Δ Δ

Signal Modeling Techniques In Speech Recognition

3.4 Linear Least-Squares Filter

Analysis of Finite Wordlength Effects

Cepstral Deconvolution Method for Measurement of Absorption and Scattering Coefficients of Materials

Speech Coding. Speech Processing. Tom Bäckström. October Aalto University

CEPSTRAL analysis has been widely used in signal processing

New Statistical Model for the Enhancement of Noisy Speech

ECE4270 Fundamentals of DSP Lecture 20. Fixed-Point Arithmetic in FIR and IIR Filters (part I) Overview of Lecture. Overflow. FIR Digital Filter

Digital Signal Processing Lecture 8 - Filter Design - IIR

Signal Processing COS 323

Part III Spectrum Estimation

Statistical and Adaptive Signal Processing

Analysis of methods for speech signals quantization

Stability Condition in Terms of the Pole Locations

Music Synthesis. synthesis. 1. NCTU/CSIE/ DSP Copyright 1996 C.M. LIU

NUMERICAL METHODS. x n+1 = 2x n x 2 n. In particular: which of them gives faster convergence, and why? [Work to four decimal places.

Comparing linear and non-linear transformation of speech

Review of Fundamentals of Digital Signal Processing

MULTI-RESOLUTION SIGNAL DECOMPOSITION WITH TIME-DOMAIN SPECTROGRAM FACTORIZATION. Hirokazu Kameoka

EE 230 Lecture 43. Data Converters

Transcription:

CEPSTRAL ANALYSIS SYNTHESIS ON THE EL FREQUENCY SCALE, AND AN ADAPTATIVE ALGORITH FOR IT. Summarized overview of the IEEE-publicated papers Cepstral analysis synthesis on the mel frequency scale by Satochi IAI (Japan, 1983), and An adaptative algorithm for mel-cepstral analysis of speech by Toshiaki FUKADA, Keiichi TOKUDA, Takao KOBAYASHI and Satoshi IAI (Japan, 1992). Cecilia CARUNCHO LLAGUNO, Graz (Austria), April 2008. What is cepstral analysis? Ceptral analysis is a modelation of speech based on the use of cepstrum, which is defined as the inverse Fourier transform of the logarithm of the Fourier transform module. cepstrum of signal = F{ log [F -1 (signal) + j 2π m]} It has therefore good characteristics for parametric representation of speech. Some other interesting features are: 1. It represents the log spectral envelope of speech accurately and efficiently, 2. the LA filter can be used for a high quality speech direct synthesis, 3. both the sensitivity and the cepstrum quantization noise are small, and 4. the spectral distortion is also small. Note that it will be better to use el frequency scale rather than linear frequency scale, since the representation of the spectral envelope of speech is more effective. What is the el scale? Psychophysical studies have shown that human perception of the frequency content of sounds, either for pure tones or for speech signals, does not follow a linear scale. This research has led to the idea of defining subjective pitch of pure tones. Thus for each tone with an actual frequency, f, measured in Hz, a subjective pitch is measured on a scale called ''el'' scale. As a reference point, the pitch of a 1kHz tone, 40 db above the perceptual hearing threshold, is defined as 1000 mels. Other subjective pitch values are obtained by adjusting the frequency of a tone such that it is half or twice the perceived pitch of a reference tone (with a known mel frequency). el cepstral analysis synthesis system.

Input speech Frequency Spectral F Log F -1 wrapping envelope extractor F -1 Peak picking Pitch parameter extractor Linear transform Filter parameter Excitation Signal Generator LSA Synthetised speech Note that the LSA filter is used as the speech synthetizer. The filter coefficients are obtained through a simple linear transform from the mel cepstrum which is defined as the Fourier cosine coefficients of the true spectral envelope of the mel log spectrum. Spectral envelope extraction The mel scale can be approximated by the phase characteristics of a first order all-pass filter, with phase β(ω): =tan 1 1 2 sin 1 2 cos 2 And the mel frequency scale ω is represented by ω= β(ω). Thus the mel log spectrum can be defined as a function of ω. We will assume that this is a smooth function represented by a trigonometric polimomial of order : G = c m cos m It is also very important to extract the spectral envelope properly. The cepstral method estimates a spectral envelope minimizing the mean square error, so that poles and zeros can be expressed with te same accuracy. The present cepstral method can automatically and stably extract the spectral envelope satisfactory without being affected by the fine structure in either voiced or invoiced sounds. el log spectrum approximation In order to optimize the quality of the synthetized speech,, the mean spare approximation to the desired log spectral envelope must be minimum. The best results are achieved with the LSA filter, whose coefficients are obtained by a simple linear transform from the mel spectrum. This filter has two main advantages: i) very low coefficient sensitivities and ii) good coefficient quantization characteristics. The ideal form of the LSA filter's transfer function is the so-called basic filter. H o z = e F z If the basic flilter is stable, then the LSA will be stable and of minimum phase., where F α (z) is

F z = c m z m ln H o e j = c m cos m note that is the frequency onmel scale If the filter parameter c(m) is chosen as the mel cepstrum for the spectral envelope, the log magnitude on the mel frequency scale is identical to the mel log spectral envelope. However, the ideal LSA filter's transfer function is not realizable, so a Padé 1 approximation must be used. Thus, the transfer function is rewritten as: F z = b 0 z 1 1 b m z m 1 m01 b... recursive filter parameter The filter coefficients b α (m) decay almos in the same orther as the mel ceppstrum parameters c α (m) do, which is actually rather fast. This means that the filter coefficients can be roughly quantized. oreover, since the b α (m) have almost the same statistical properties as those of the mel cepstrum, they can be used as a parametric representation of speech. Data rate. The filter coefficients are used as the spectral envelope parameter. It is shown from experimental results that the difference between the maximal and minimal values of each filter coefficient b α (m) is bounded. The filter coefficient sensitivities of the mel log spectrum are uniform for every order m. Consequently, the filter coefficients can be digitized by using a quantizer having the same quantization width q ( < 1). After experimental observations, the data amount is given by: b s = 2 2 [log 2 q] 3 if 9 2 2 [log 2 q] 14 if 9 If the spectral envelope and pitch parameter are represented by b s and b p bit per frame transmitted every T seconds, the averall bit rate B of this system is given by: 1 A Padé approximant approximates a function in one variable. Given a function f and two integers m 0 and n 0, the Padé approximant of order (m, n) is the rational function: which agrees with f(x) to the highest possible order. Equivalently, if R(x) is expanded in a Taylor series at 0, its first m + n + 1 terms would cancel the first m + n + 1 terms of f(x), and as such:

B = b s b p T The effect of varying T, q, or b parameters has been studied, with the results shown in the following table. T (ms) q Bp (bit) B (kbits/s) Speech quality 15 11 0.25 7 4 Very high 20 8 0.5 7 2 Fairly good 25 5 0.5 6 1.2 Still good Spectral distortion The spectral distortions are caused by the interpolation of the spectral envelope parameters of two succesive frames and by the quzntization of the spectral envelope parameter. The approximate r.m.s. value of the spectral distortion caused by the quantization is referred to as D Q (db) and that caused by the interpolation is referred to as D T (db). They are given by the following expressions: D Q = q 1 5 ; D T = 65T Now, a method in which they apply the criterion used in the unbiased estimation of log spectrum to the spectral model represented by the mel-cepstral coefficients is proposed. To solve the nonlinear minimization problem involved in the method, an iterative algorithm whose convergence is guaranteed is given. Furthermore, the authors derive an adaptive algorithm for the mel-cepstral analysis by introducing an instantaneous estimate for gradient of the criterion. The adaptive melcepstral analysis system is implemented with an IIR adaptive filter which has an exponential transfer function, and whose stability is guaranteed. Spectral estimation based on mel-cepstral representation The model spectrum H z = exp c m z m can be expressed as follows: H z = exp b m m z = K D z where m z = 1 if ; 1 2 z 1 1 z 1 z 1 1 z 1 m 1 and K = expb 0 ; D z = exp b m m z m=1 if m 1

and where the cepstral (c(m)) and filter (b(m)) parameters are related through: c m = b m if m= and b m b m 1 if 0 m In order to obtain an unbiased estimate, ε must be minimized, being: = 1 2 I N D e j 2 d This minimization issue can be solved by means of the Newton-Raphson ethod 2. For the i- th result b (i), solving a set of linear equations H b i = b=b i where H... Hessianmatrix H = 2 / b b T, and the gradient is given by = 2 r, where r m = 1 I N 2 D e j e j 2 m d we have the values: b i = [ b i 1, b i 2,..., b i ] T, so b i 1 = b i b i Typically, few iterations are needed to obtain the solution, which makes this method quick and computationally efficient. Adaptative mel-cepstral analysis algorithm Replacing H with the unit matrix, the next result b (i+1) is given from the i-th result b (i) : b i 1 = b i b=b i where... adaptation step size If we call e(n) to the output of the inverse filter 1/D(z) driven by x(n), we can interpret 2 The Newton Raphson method is the best known method for finding successively better approximations to the zeros (roots) of a real-valued function f(x):

= E [e 2 n ], so that = 2 E [e n e n ], where e n = [ e 1 n,e 2 n,...,e n ] T, e m n... outputof the filter m z. To derive an adaptative algorithm, an instantaneous estimate is introduced: n = 2e n e n and, to suppress fluctuation of b, ε is estimated using an exponential window: n = 2 1 n i= n i e i e i = n 1 2 1 e n e n, 0 1. When the gain of the signal x(n) is time-varying, μ is normalized as: n = a n, 0 a 1 where n...estimate of at time n : n = n 1 1 e 2 n, 0 1 So the coefficient vector b(n) at time n can be updated: b n 1 = b n n n. And, since K = min, the mel cepstral coefficients [ c m ] can be obtained. Conclusion The LSA is a simple, great idea to be applied on low bit rate cepstral analysis synthesis systems since its parameters can be rougly quantized, and it has good stathistical features. oreover, this system has fairly small spectral distortions. In addition to this, the use of the above described algorithm will make the calculation of the cepstral coefficients so much easier and efficient, which has been awarded by experimental results.