Lecture 7: Feature Extraction

Size: px

Start display at page:

Download "Lecture 7: Feature Extraction"

Abel Hubert Tyler
6 years ago
Views:

1 Lecture 7: Feature Extraction Kai Yu SpeechLab Department of Computer Science & Engineering Shanghai Jiao Tong University Autumn 2014 Kai Yu Lecture 7: Feature Extraction SJTU Speech Lab 1 / 28

2 Table of Content Acoustic features for speech recognition Dynamic features Feature projection: LDA & HLDA Kai Yu Lecture 7: Feature Extraction SJTU Speech Lab 2 / 28

3 Why We Need Feature Extraction? Pure spectrum has: Rich information- computational cost is high Redundant (irrelevant/disturbing) information for ASR - not effective Need to get more compact and effective features from spectrum for ASR Kai Yu Lecture 7: Feature Extraction SJTU Speech Lab 3 / 28

4 Commonly Used Features for Speech Recognition Short-term spectra are used to describe speech signals. Useful features extracted from short-term spectra include: Linear Prediction Coefficients (LPC) Mel Frequency Cepstral Coefficients (MFCC) Perceptual Linear Prediction Coefficients (PLP) Kai Yu Lecture 7: Feature Extraction SJTU Speech Lab 4 / 28

5 Linear Prediction Coefficient (LPC) Given the singal x = [x 1,, x T ], a linear predictor of order n predicts the sample at time t as a weighted linear interpolation of its n preceding samples: n ˆx t = a i x t i ˆx = Ma i=1 where x 0 x 1 x 2 x n+1 x 1 x 0 x 1 x n+2 M = a = x T 1 x T 2 x T 3 x n+t a 1 a 2. a n where a i, 1 i n are known as linear prediction coefficients, M is known as a Toeplitz matrix. Kai Yu Lecture 7: Feature Extraction SJTU Speech Lab 5 / 28

6 Linear Prediction Coefficient (LPC) Find Linear Prediction Coefficient Note that L mse = 1 T T (ˆx t x t ) 2 = 1 T (ˆx x) (ˆx x) t=1 ˆx t = n a i x t i i=1 1 t T where a i, 1 i n are linear prediction coefficients. Do we want to the direct solution? Not really. a = (MM ) 1 M x Levinson-Durbin algorithm is an auto-correlation algorithm which makes use of the Toplitz property of M Kai Yu Lecture 7: Feature Extraction SJTU Speech Lab 6 / 28

Linear Prediction Coefficient (LPC) Spectral Envelopes from LPC TOP: waveform of sound aa MIDDLE & BOTTOM: Spectral magnitude plotted on log scale Spectral envelop is plotted as a smooth red line

7 Linear Prediction Coefficient (LPC) Spectral Envelopes from LPC TOP: waveform of sound aa MIDDLE & BOTTOM: Spectral magnitude plotted on log scale Spectral envelop is plotted as a smooth red line MIDDLE: 10-order LP BOTTOM: 25-order LP Higher order LP tracks spectral magnitude more precisely Envelope peaks can be used to determine formant locations Kai Yu Lecture 7: Feature Extraction SJTU Speech Lab 7 / 28

8 Filter A filter is a device or process that removes from a signal some unwanted component or feature. It is usually (though not necessarily) applied in frequency domain. Y (ω) = H(ω)X(ω) Kai Yu Lecture 7: Feature Extraction SJTU Speech Lab 8 / 28

Filter Bank Coefficients Spectral magnitude by STFT contains too much information Filter bank is a series (bank) of bandpass filters (typically triangular

9 Filter Bank Coefficients Spectral magnitude by STFT contains too much information Filter bank is a series (bank) of bandpass filters (typically triangular filters) Each bandpass filter produces one coefficient corresponding to the sum of bandpassed signal Kai Yu Lecture 7: Feature Extraction SJTU Speech Lab 9 / 28

10 Mel Scale Mel scale is is a perceptual scale of pitches judged by listeners to be equal in distance from one another. The Mel function is a non-linear mapping between the frequency and the Mel scales: Mel(f) = 2595 log 10 (1 + f 700 ) Note that points at equal distance apart in the Mel has a higher resolution at the lower frequencies. Kai Yu Lecture 7: Feature Extraction SJTU Speech Lab 10 / 28

Mel Scale Filter Bank Coefficients m i = F i k=f i s(k)t i (k) where m i is the i th filter bank coefficient, f i and F i are the start and end frequency of the triangular filter

11 Mel Scale Filter Bank Coefficients m i = F i k=f i s(k)t i (k) where m i is the i th filter bank coefficient, f i and F i are the start and end frequency of the triangular filter and, s(k) is the spectral power (sometimes magnitude) at frequency bin k and T i (k) is the triangular filter bank value. Kai Yu Lecture 7: Feature Extraction SJTU Speech Lab 11 / 28

12 Mel Frequency Cepstral Coefficient (MFCC) MFCC is widely used in many speech processing techniques. They are derived from the Mel Filter Bank Coefficients by: 1. Take logarithm of N log-filterbank coefficients 2. Compute Cepstral coefficients using Discrete Cosine Transform (DCT) c n = 2 N fb N fb j=1 ( ) πn log(m j ) cos (j 0.5) N fb n = 1, 2,, N mfcc where c n is the n th MFCC coefficient, m j is the j th mel scale filter bank coefficient and N fb and N mfcc are the number of filter bank and final MFCC coefficients respectively (usually N mfcc = 12, N fb varies from 20-30). Kai Yu Lecture 7: Feature Extraction SJTU Speech Lab 12 / 28

From Spectrum to Cepstrum The vector of spectral energies is not used directly because Speech power spectra are not Gaussin All coefficients are sensitive to the loudness

13 From Spectrum to Cepstrum The vector of spectral energies is not used directly because Speech power spectra are not Gaussin All coefficients are sensitive to the loudness Neighboring coefficients are highly correlated Discrete Cosine Transform can effectively remove the dependency between coefficients. Kai Yu Lecture 7: Feature Extraction SJTU Speech Lab 13 / 28

14 Discrete Cosine Transform (DCT) DCT is a linear transform: c 1 c 2 cos 2.0. = N fb. cos c Nmfcc ( ) π(0.5) N fb ( ) πnmfcc(0.5) N fb ( cos π(nfb0.5).... cos N fb ) ( ) πnmfcc(nfb0.5) N fb log(m 1 ) log(m 2 ). log(m Nfb ) Kai Yu Lecture 7: Feature Extraction SJTU Speech Lab 14 / 28

15 Basis Function of DCT Kai Yu Lecture 7: Feature Extraction SJTU Speech Lab 15 / 28

16 Perceptual Linear Prediction (PLP) Perceptual linear prediction (PLP) is a well known and widely used feature extraction technique incorporating a perception into the front-end. It is believed to be more robust to noise. The PLP coefficients can be derived from the filter bank coefficients by applying: Apply equal loudness pre-emphasis curve and compression ( ˆm k = (L k m k ) β L k = f 2 k f 2 k + 1.6e5 ) 2 ( f 2 ) 2 k e6 fk e6 Apply inverse DFT to filter pre-emphasised bank to yields auto-correlation coefficients Apply Levinson-Durbin algorithm to get LP coefficients Convert LP coefficients to cepstral coefficients Kai Yu Lecture 7: Feature Extraction SJTU Speech Lab 16 / 28

17 Different Energy Terms Energy information is very important for speech recognition. Spectral energy E = 1 N N n=1 x 2 n E = 1 N N x n n=1 0 th cepstral coefficient C0 MFCC PLP 2.0 Nfb N fb k=1 m[k] log(lpc Gain) Energy normalization is important due to energy variations over different channels Kai Yu Lecture 7: Feature Extraction SJTU Speech Lab 17 / 28

18 Dynamic Features in Speech Recogntion Concept MFCC or PLP describe the instantaneous speech signal spectrum, but can not describe signal dynamics. This situation can be improved by including coefficients differentials into feature vector. o = c c c Kai Yu Lecture 7: Feature Extraction SJTU Speech Lab 18 / 28

19 Dynamic Features in Speech Recogntion Calculation Simple differential coefficients can be calculated as n = c n+δ c n δ 2δ More robust estimation uses regression coefficients to calculate the best straight line through a number of frames (here 2σ + 1) δ i=1 i(c n+i c n i ) n = 2 δ i=1 i2 Higher order differential coefficients can be obtained by applying the above recursively δ i=1 n = i( n+i n i ) 2 δ i=1 i2 Typically 2 nd or 3 rd differential are used Kai Yu Lecture 7: Feature Extraction SJTU Speech Lab 19 / 28

20 Linear Disriminant Analysies (LDA) Linear Disriminant Analysies (LDA) is a linear projection scheme to find matrix of dimensions p n, where n is the original vector size and p n,which map onto feature space which is best for discrimination. Kai Yu Lecture 7: Feature Extraction SJTU Speech Lab 20 / 28

21 Fisher s Linear Discriminant Recap Criterion: L(w) = w S B w w S w w where within and between class covariances are S B = (µ 1 µ 2 )(µ 1 µ 2 ) S w = Σ 1 + Σ 2 Kai Yu Lecture 7: Feature Extraction SJTU Speech Lab 21 / 28

22 Criterion for LDA L(A [p] ) = diag(a [p]ba [p] ) diag(a [p] WA [p] ) Between class matrix B = m,t γ m(t)(µ (m) µ (g) )(µ (m) µ (g) ) m,t γ m(t) Global within class matrix W = m,t γ m(t)(o(t) µ (m) )(o(t) µ (m) ) m,t γ m(t) Kai Yu Lecture 7: Feature Extraction SJTU Speech Lab 22 / 28

23 Linear Disriminant Analysies (LDA) A simple 2-dimension example The two classes have the same covariation matrix. Kai Yu Lecture 7: Feature Extraction SJTU Speech Lab 23 / 28

24 Heteroscedastic LDA (HLDA) An extended version of LDA where each class has its own covariance matrices Kai Yu Lecture 7: Feature Extraction SJTU Speech Lab 24 / 28

25 Comparison Between LDA and HLDA LDA: Global within class matrix m,t W LDA = γ m(t)(o(t) µ (m) )(o(t) µ (m) ) m,t γ m(t) HLDA: Local within class matrix W HLDA = t γ m(t)(o(t) µ (m) )(o(t) µ (m) ) m t γ m(t) Kai Yu Lecture 7: Feature Extraction SJTU Speech Lab 25 / 28

26 HTK - Feature Extraction (HCopy) HCopy -C mfcc.cfg -S digit.wav2mfcc.scp Kai Yu Lecture 7: Feature Extraction SJTU Speech Lab 26 / 28

27 HTK - Supported Feature Types LPC: Linear Prediction Coefficient MELSPEC: Mel-frequency spectral magnitude FBANK: Log filter bank coefficient MFCC: Mel Frequency Cepstral Coefficient PLP: Perceptral Linear Prediction coefficient Additional qualifiers: Kai Yu Lecture 7: Feature Extraction SJTU Speech Lab 27 / 28

28 HTK - View Feature Files (HList) HList -h data.mfcc Actual format: Note, HTK stores data in big endian format. Can specify NATURALREAD and NATURALWRITE to override defaults. Kai Yu Lecture 7: Feature Extraction SJTU Speech Lab 28 / 28

Feature extraction 2

Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. Feature extraction 2 Dr Philip Jackson Linear prediction Perceptual linear prediction Comparison of feature methods