Feature extraction 1

Similar documents
Feature extraction 2

Signal representations: Cepstrum

Speech Signal Representations

Voiced Speech. Unvoiced Speech

Automatic Speech Recognition (CS753)

Linear Prediction 1 / 41

Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation. Keiichi Tokuda

Frequency Domain Speech Analysis

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi

Topic 6. Timbre Representations

SPEECH COMMUNICATION 6.541J J-HST710J Spring 2004

Estimation of Cepstral Coefficients for Robust Speech Recognition

Lecture 7: Feature Extraction

Evaluation of the modified group delay feature for isolated word recognition

200Pa 10million. Overview. Acoustics of Speech and Hearing. Loudness. Terms to describe sound. Matching Pressure to Loudness. Loudness vs.

2A1H Time-Frequency Analysis II

Chapter 9. Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析

Sound 2: frequency analysis

TinySR. Peter Schmidt-Nielsen. August 27, 2014

Homework 4. May An LTI system has an input, x(t) and output y(t) related through the equation y(t) = t e (t t ) x(t 2)dt

CEPSTRAL ANALYSIS SYNTHESIS ON THE MEL FREQUENCY SCALE, AND AN ADAPTATIVE ALGORITHM FOR IT.

SPEECH ANALYSIS AND SYNTHESIS

Zeros of z-transform(zzt) representation and chirp group delay processing for analysis of source and filter characteristics of speech signals

L8: Source estimation

2A1H Time-Frequency Analysis II Bugs/queries to HT 2011 For hints and answers visit dwm/courses/2tf

L7: Linear prediction of speech

Timbral, Scale, Pitch modifications

arxiv: v1 [cs.sd] 25 Oct 2014

THE APPLICATION OF BLIND SOURCE SEPARATION TO FEATURE DECORRELATION AND NORMALIZATION. by Manuel Laura BS, University of Kansas, 2003

SNR Features for Automatic Speech Recognition

Analytical Review of Feature Extraction Techniques for Automatic Speech Recognition

Robust Speaker Identification

LECTURE NOTES IN AUDIO ANALYSIS: PITCH ESTIMATION FOR DUMMIES

HMM part 1. Dr Philip Jackson

Cepstral Deconvolution Method for Measurement of Absorption and Scattering Coefficients of Materials

David Weenink. First semester 2007

Lecture 5: GMM Acoustic Modeling and Feature Extraction

[Omer* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

Bich Ngoc Do. Neural Networks for Automatic Speaker, Language and Sex Identification

Sinusoidal Modeling. Yannis Stylianou SPCC University of Crete, Computer Science Dept., Greece,

Correlator I. Basics. Chapter Introduction. 8.2 Digitization Sampling. D. Anish Roshi

COMP 546, Winter 2018 lecture 19 - sound 2

Signal Modeling Techniques In Speech Recognition

Noise Reduction. Two Stage Mel-Warped Weiner Filter Approach

MULTISCALE SCATTERING FOR AUDIO CLASSIFICATION

Nearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis. Thomas Ewender

BLIND DEREVERBERATION USING SHORT TIME CEPSTRUM FRAME SUBTRACTION. J.S. van Eeghem (l), T. Koike (2) and M.Tohyama (1)

MVA Processing of Speech Features. Chia-Ping Chen, Jeff Bilmes

University of Colorado at Boulder ECEN 4/5532. Lab 2 Lab report due on February 16, 2015

ECE 8440 Unit Applica,ons (of Homomorphic Deconvolu,on) to Speech Processing

Spectral and Textural Feature-Based System for Automatic Detection of Fricatives and Affricates

AN OPERATIONAL ANALYSIS TECHNIQUE FOR THE MIMO SITUATION BASED ON CEPSTRAL TECHNIQUES

Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic Approximation Algorithm

CEPSTRAL analysis has been widely used in signal processing

Voice Activity Detection Using Pitch Feature

Department of Electrical and Computer Engineering Digital Speech Processing Homework No. 6 Solutions

Modeling Prosody for Speaker Recognition: Why Estimating Pitch May Be a Red Herring

Lecture 9: Speech Recognition. Recognizing Speech

Deep Learning for Automatic Speech Recognition Part I

Lab 9a. Linear Predictive Coding for Speech Processing

Lecture 9: Speech Recognition

ω 0 = 2π/T 0 is called the fundamental angular frequency and ω 2 = 2ω 0 is called the

Stress detection through emotional speech analysis

L6: Short-time Fourier analysis and synthesis

Allpass Modeling of LP Residual for Speaker Recognition

Last time: small acoustics

COMP 546. Lecture 21. Cochlea to brain, Source Localization. Tues. April 3, 2018

A perception- and PDE-based nonlinear transformation for processing spoken words

Frog Sound Identification System for Frog Species Recognition

SPEECH RECOGNITION USING TIME DOMAIN FEATURES FROM PHASE SPACE RECONSTRUCTIONS

EECS E Speech Recognition Lecture 2

Vocoding approaches for statistical parametric speech synthesis

Research Article Identification of Premature Ventricular Cycles of Electrocardiogram Using Discrete Cosine Transform-Teager Energy Operator Model

Lecture 14 1/38 Phys 220. Final Exam. Wednesday, August 6 th 10:30 am 12:30 pm Phys multiple choice problems (15 points each 300 total)

Lecture 5: Jan 19, 2005

Nonparametric Deconvolution of Seismic Depth Phases

Signal Processing COS 323

Signals, Instruments, and Systems W5. Introduction to Signal Processing Sampling, Reconstruction, and Filters

PART 1. Review of DSP. f (t)e iωt dt. F(ω) = f (t) = 1 2π. F(ω)e iωt dω. f (t) F (ω) The Fourier Transform. Fourier Transform.

representation of speech

Improved Method for Epoch Extraction in High Pass Filtered Speech

WORD BOUNDARY DETECTION IN STATIONARY NOISES USING CEPSTRAL MATRICES

Speaker Identification Based On Discriminative Vector Quantization And Data Fusion

ECE 598: The Speech Chain. Lecture 5: Room Acoustics; Filters

Chirp Decomposition of Speech Signals for Glottal Source Estimation

16.584: Random (Stochastic) Processes

ABSTRACT 1. INTRODUCTION

Estimation of Relative Operating Characteristics of Text Independent Speaker Verification

Glottal Source Estimation using an Automatic Chirp Decomposition

AN INVERTIBLE DISCRETE AUDITORY TRANSFORM

Speech Spectra and Spectrograms

Review of Fourier Transform

Estimation and tracking of fundamental, 2nd and 3d harmonic frequencies for spectrogram normalization in speech recognition

3 THE P HYSICS PHYSICS OF SOUND

Improved system blind identification based on second-order cyclostationary statistics: A group delay approach

Cochlear modeling and its role in human speech recognition

JOINT TIME-FREQUENCY SCATTERING FOR AUDIO CLASSIFICATION

Source/Filter Model. Markus Flohberger. Acoustic Tube Models Linear Prediction Formant Synthesizer.

Characterization of phonemes by means of correlation dimension

Time-Series Analysis for Ear-Related and Psychoacoustic Metrics

Transcription:

Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. Feature extraction 1 Dr Philip Jackson Cepstral analysis - Real & complex cepstra - Homomorphic decomposition Filter bank energies Mel-frequency cepstral coefficients www.ee.surrey.ac.uk/teaching/courses/eem.ssr H.1

Cepstral analysis (1) Cepstral analysis, or homomorphic decomposition, is designed to separate convolved signal components by transforming the signal s to a domain where components are additive and in distinct regions. A source signal x passed through a filter with impulse reponse h, where denotes convolution, yields s(t) = h(t) x(t) (1) Taking Fourier transforms of both sides, we get S(ω) = H(ω) X(ω) (2) where ω denotes the angular frequency, ω = 2πf. H.2

Cepstral analysis (2) So, the spectrum of the signal can be written S(ω) = H(ω) X(ω) and taking natural logarithms of both sides gives ln S(ω) = ln H(ω) + ln X(ω) (3) Thus, a convolution in time has been transformed into a sum of log components in the frequency domain. Finally, applying an inverse Fourier transform to the log spectrum gives F 1 {ln S(ω)} = F 1 {ln H(ω)} + F 1 {ln X(ω)} (4) where F { } denotes the Fourier transform (FT), and F 1 its inverse (IFT). H.3

Cepstral analysis (3) A block diagram of the process for calculating cepstral coefficients shows the three steps: s(t) FT S( ω) c *( τ) ln. IFT s This last transform takes the function back into the time domain, but it is not the same as the time of the original signal. In fact, it is a measure of the rate of change of the spectral envelope, as viewed in decibels (db). This domain is called the cepstrum, and the time axis is often referred to as the lag or quefrency axis. H.4

Complex cepstrum In this case, the output c s(τ) is the complex cepstrum of signal s, since the logarithm is applied to a complex number. Hence, we can re-write eq. 4 as where, for example, c s(τ) = c h (τ) + c x(τ) (5) c s(τ) = F 1 {ln S(ω)} (6) and c h (τ) and c x(τ) are the complex cepstra of h and x, respectively. H.5

Real cepstrum The real cepstrum is derived from the log magnitude of the spectrum, and these are also are superposed: c s (τ) = F 1 {ln S(ω) } (7) c s (τ) = c h (τ) + c x (τ) (8) The diagram shows magnitude and logarithm operations: s(t) S( ω) c ( τ) FT ln. IFT s The real cepstrum is a real and an even function of the lag, or quefrency. So, the IFT can be replaced by the discrete cosine transform (DCT). Phase information from the original signal has been lost in the process of calculating the real cepstrum. H.6

Real cepstrum applied to speech Cepstral analysis has found many applications in such areas as seismic exploration and speech processing, for which an example is given. The sequence of plots below shows the cepstral analysis procedure applied to two frames of voiced speech data: a vowel [a] (vowels are high in amplitude and have strong periodicity), a voiced fricative consonant [z] (fricatives tend to have a strong high-frequency noise component). H.7

Examples of cepstral analysis, [a] and [z] 5 Mid vowel [a] 5 Mid fricative (voiced) [z] Pressure (Pa) 5 Pressure (Pa) 5 1 2 3 4 Time (ms) 1 1 2 3 4 Time (ms) 1 PSD (db/hz) 5 PSD (db/hz) 5 2 4 6 8 Frequency (khz) 2 2 4 6 8 Frequency (khz).4 Autocorrelation 1 Autocorrelation.2 Real Cepstrum 1 5 1 15 2 Time (ms).6.4.2.2 5 1 15 2 Time (ms) Real Cepstrum.2 5 1 15 2 Time (ms) 1.5.5 5 1 15 2 Time (ms) H.8

Homomorphic decomposition Decomposition by cepstral liftering Provided that the spectral properties of the two signals, h(t) and x(t), are distinct, then c h (τ) and c x (τ) will occupy distinct regions of the quefrency domain. Using a suitable cepstral filter (or lifter), the components may be separated from each other, and then they can be transformed back into log-magnitudes or magnitudes in the frequency domain, as required. H.9

Source-filter theory of speech It is assumed that the recorded speech signal is the output from a linear system which consists of a source of filter excitation (a series of periodic pulses) convolved with the impulse response of a filter (Fant 196). The filter represents the acoustic effect of the vocal tract, which depends on the positions of the articulators (jaw, tongue, lips, etc.) and corresponds to the uttered vowel ([i] from the word linear ). Cepstral analysis allows both an accurate estimation of the periodicity of the excitation and the extraction of the frequency response, and hence the impulse response of the vocal-tract filter. H.1

Example spectral envelope of [i] in linear Amplitude Log. magnitude (db) Amplitude.2.1.1.2 5 1 15 2 25 3 Time (ms) 2 2 4 1 2 3 4 5 6 7 8 Frequency (khz) 2 1 1 2 2 4 6 8 1 12 14 16 Quefrency (ms) H.11

Example pitch extraction from [i] in linear Amplitude Log. magnitude (db) Amplitude 2 1 1 2 2 4 6 8 1 12 14 16 Quefrency (ms) 2 2 4 1 2 3 4 5 6 7 8 Frequency (khz).2.1.1.2 5 1 15 2 25 3 Time (ms) H.12

Extraction of acoustic features Filter banks measure local spectral shape reduce effects of pitch characteristic frequencies of nerve fibres filters with critical-band reponses gain freq. H.13

Mel-frequency scale Human hearing does not perceive frequencies over 1 khz in a linear fashion. The frequency resolution of the ear, measured in terms of the critical bandwidth, gets broader as frequency increases. Psychoacoustic experiments of this phenomenon have provided us with a normalised frequency scale, measured in Mels: ( f Mel = 1127.148 ln 1 + f ) Hz (9) 7 4 Pitch (Mel) 3 2 1 5 1 15 2 Frequency (khz) H.14

Mel-frequency cepstral coefficients (MFCCs) MFCC features are typically computed in the front end of an automatic speech recognition system: s(n) DFT S(k). ln. DCT c s( ν) 1 freq m 1... m j... MELSPEC m P Energy in Each Band Figure 5.3. Mel-scale filter bank, from (Young et al, 1997). H.15

Feature extraction 1 summary Cepstral analysis Calculating the complex cepstrum Calculating the real cepstrum Real cepstrum Examples of cepstra computed from speech Homomorphic decomposition of speech Spectral envelope Pitch tracking Mel-frequency cepstral coefficients Mel-frequency warping As a front end for automatic speech recognition H.16

Homework from week 3 Complete the assignment for Lab 1 Work through the exercises on the web page H.17