Automatic Speech Recognition (CS753)

Similar documents
Speech Signal Representations

Feature extraction 2

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi

Lecture 5: GMM Acoustic Modeling and Feature Extraction

Robust Speaker Identification

Feature extraction 1

Voiced Speech. Unvoiced Speech

Lecture 7: Feature Extraction

Chapter 9 Automatic Speech Recognition DRAFT

arxiv: v1 [cs.sd] 25 Oct 2014

Linear Prediction 1 / 41

Evaluation of the modified group delay feature for isolated word recognition

Signal representations: Cepstrum

Estimation of Relative Operating Characteristics of Text Independent Speaker Verification

Automatic Speech Recognition (CS753)

TinySR. Peter Schmidt-Nielsen. August 27, 2014

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

Chapter 9. Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析

Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers

Voice Activity Detection Using Pitch Feature

Frequency Domain Speech Analysis

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

Statistical NLP Spring The Noisy Channel Model

Estimation of Cepstral Coefficients for Robust Speech Recognition

Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation. Keiichi Tokuda

L8: Source estimation

L7: Linear prediction of speech

Topic 6. Timbre Representations

Harmonic Structure Transform for Speaker Recognition

Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic Approximation Algorithm

Modeling Prosody for Speaker Recognition: Why Estimating Pitch May Be a Red Herring

L6: Short-time Fourier analysis and synthesis

Lab 9a. Linear Predictive Coding for Speech Processing

CS 188: Artificial Intelligence Fall 2011

Model-based unsupervised segmentation of birdcalls from field recordings

SPEECH ANALYSIS AND SYNTHESIS

Zeros of z-transform(zzt) representation and chirp group delay processing for analysis of source and filter characteristics of speech signals

Correspondence. Pulse Doppler Radar Target Recognition using a Two-Stage SVM Procedure

Exemplar-based voice conversion using non-negative spectrogram deconvolution

Speech Enhancement with Applications in Speech Recognition

MVA Processing of Speech Features. Chia-Ping Chen, Jeff Bilmes

Time-domain representations

Stress detection through emotional speech analysis

Text-Independent Speaker Identification using Statistical Learning

GMM Vector Quantization on the Modeling of DHMM for Arabic Isolated Word Recognition System

SPEECH RECOGNITION USING TIME DOMAIN FEATURES FROM PHASE SPACE RECONSTRUCTIONS

Spectral and Textural Feature-Based System for Automatic Detection of Fricatives and Affricates

EE123 Digital Signal Processing

Allpass Modeling of LP Residual for Speaker Recognition

Nearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis. Thomas Ewender

Deep Learning for Speech Recognition. Hung-yi Lee

Robust Speech Recognition in the Presence of Additive Noise. Svein Gunnar Storebakken Pettersen

Speech Recognition. CS 294-5: Statistical Natural Language Processing. State-of-the-Art: Recognition. ASR for Dialog Systems.

Fourier Analysis of Signals Using the DFT

FEATURE SELECTION USING FISHER S RATIO TECHNIQUE FOR AUTOMATIC SPEECH RECOGNITION

EECS E Speech Recognition Lecture 2

CEPSTRAL analysis has been widely used in signal processing

Deep Learning for Automatic Speech Recognition Part I

Lecture 3: ASR: HMMs, Forward, Viterbi

Enhancement of Noisy Speech. State-of-the-Art and Perspectives

Frog Sound Identification System for Frog Species Recognition

Text Independent Speaker Identification Using Imfcc Integrated With Ica

[Omer* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

Lecture 9: Speech Recognition. Recognizing Speech

Lecture 9: Speech Recognition

Support Vector Machines using GMM Supervectors for Speaker Verification

ELEN E Topics in Signal Processing Topic: Speech Recognition Lecture 9

Timbral, Scale, Pitch modifications

Speaker Identification Based On Discriminative Vector Quantization And Data Fusion

Automatic Phoneme Recognition. Segmental Hidden Markov Models

Detection of Overlapping Acoustic Events Based on NMF with Shared Basis Vectors

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS

Dominant Feature Vectors Based Audio Similarity Measure

L29: Fourier analysis

Hidden Markov Models and Gaussian Mixture Models

Chapter 10 Applications in Communications

Signal Modeling Techniques In Speech Recognition

representation of speech

Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (II)

CEPSTRAL ANALYSIS SYNTHESIS ON THE MEL FREQUENCY SCALE, AND AN ADAPTATIVE ALGORITHM FOR IT.

OLA and FBS Duality Review

Comparing linear and non-linear transformation of speech

Temporal Modeling and Basic Speech Recognition

IMISOUND: An Unsupervised System for Sound Query by Vocal Imitation

1. Calculation of the DFT

Discriminant Feature Space Transformations for Automatic Speech Recognition

CS 5522: Artificial Intelligence II

Model-based Approaches to Robust Speech Recognition in Diverse Environments

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition

Statistical NLP Spring Digitizing Speech

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...

Proc. of NCC 2010, Chennai, India

Bich Ngoc Do. Neural Networks for Automatic Speaker, Language and Sex Identification

On Variable-Scale Piecewise Stationary Spectral Analysis of Speech Signals for ASR

The effect of speaking rate and vowel context on the perception of consonants. in babble noise

200Pa 10million. Overview. Acoustics of Speech and Hearing. Loudness. Terms to describe sound. Matching Pressure to Loudness. Loudness vs.

MULTISCALE SCATTERING FOR AUDIO CLASSIFICATION

GMM-Based Speech Transformation Systems under Data Reduction

Transcription:

Automatic Speech Recognition (CS753) Lecture 12: Acoustic Feature Extraction for ASR Instructor: Preethi Jyothi Feb 13, 2017

Speech Signal Analysis Generate discrete samples A frame Need to focus on short segments of speech (speech frames) that more or less correspond to a subphone and are stationary Each speech frame is typically 20-50 ms long Use overlapping frames with frame shift of around 10 ms

Frame-wise processing frame shift (10 ms) frame size (25 ms)

Speech Signal Analysis Generate discrete samples A frame Need to focus on short segments of speech (speech frames) that more or less correspond to a phoneme and are stationary Each speech frame is typically 20-50 ms long Use overlapping frames with frame shift of around 10 ms Generate acoustic features corresponding to each speech frame

Acoustic feature extraction for ASR Desirable feature characteristics: Capture essential information about underlying phones Compress information into compact form Factor out information that s not relevant to recognition e.g. speaker-specific information such as vocal-tract length, channel characteristics, etc. Would be desirable to find features that can be well-modelled by known distributions (Gaussian models, for example) Feature widely used in ASR: Mel-frequency Cepstral Coefficients (MFCCs)

MFCC Extraction idft yt(j) Derivatives ( yt (j),e t y t (j), e t ) log 2 y t (j), 2 e t Mel Filterbank energy DFT Windowing Pre-emphasis Sampled speech signal x(j)

Pre-emphasis Pre-emphasis increases the amount of energy in the high frequencies compared with lower frequencies Why? Because of spectral tilt In voiced speech, signal has more energy at low frequencies Due to the glottal source Boosting high frequency energy improves phone detection accuracy Image credit: Jurafsky & Martin, Figure 9.9

MFCC Extraction idft log yt(j) Time derivatives ( yt (j),e t ) y t (j), e t 2 y t (j), 2 e t Mel Filterbank energy DFT Windowing Pre-emphasis Sampled speech signal x(j)

Windowing Speech signal is modelled as a sequence of frames (assumption: stationary across each frame) Windowing: multiply the value of the signal at time n, s[n] by the value of the window at time n, w[n]: y[n] = w[n]s[n] Rectangular: w[n] = ( 1 0 apple n apple L 1 0 otherwise Hamming: w[n] = ( 0.54 0.46cos 2 n L 0 apple n apple L 1 0 otherwise

Windowing: Illustration Rectangular window Hamming window

MFCC Extraction idft log yt(j) Time derivatives ( yt (j),e t ) y t (j), e t 2 y t (j), 2 e t Mel Filterbank energy DFT Windowing Pre-emphasis Sampled speech signal x(j)

Discrete Fourier Transform (DFT) Extract spectral information from the windowed signal: Compute the DFT of the sampled signal X[k] = NX 1 n=0 x[n]e j 2 N kn Input: windowed signal x[1],,x[n] Output: complex number X[k] giving magnitude/phase for the kth frequency component Image credit: Jurafsky & Martin, Figure 9.12

MFCC Extraction idft log yt(j) Time derivatives ( yt (j),e t ) y t (j), e t 2 y t (j), 2 e t Mel Filterbank energy DFT Windowing Pre-emphasis Sampled speech signal x(j)

Mel Filter Bank DFT gives energy at each frequency band However, human hearing is not sensitive at all frequencies: less sensitive at higher frequencies Warp the DFT output to the mel scale: mel is a unit of pitch such that sounds which are perceptually equidistant in pitch are separated by the same number of mels

Mels vs Hertz

Mel filterbank Mel frequency can be computed from the raw frequency f as: mel(f) = 1127ln(1 + f 700 ) 10 filters spaced linearly below 1kHz and remaining filters spread logarithmically above 1kHz 1 Amplitude T 0 0 1000 2000 3000 4000 Frequency (Hz) Mel Spectrum 1... Image credit: Jurafsky & Martin, Figure 9.13

Mel filterbank inspired by speech perception

Mel filterbank Mel frequency can be computed from the raw frequency f as: mel(f) = 1127ln(1 + f 700 ) 10 filters spaced linearly below 1kHz and remaining filters spread logarithmically above 1kHz 1 Amplitude 0 0 1000 2000 3000 4000 Frequency (Hz) Mel Spectrum 1... T Take log of each mel spectrum value 1) human sensitivity to signal energy is logarithmic 2) log makes features robust to input variations Image credit: Jurafsky & Martin, Figure 9.13

MFCC Extraction idft log yt(j) Time derivatives ( yt (j),e t ) y t (j), e t 2 y t (j), 2 e t Mel Filterbank energy DFT Windowing Pre-emphasis Sampled speech signal x(j)

Cepstrum: Inverse DFT Recall speech signals are created when a glottal source of a particular fundamental frequency passes through the vocal tract Most useful information for phone detection is the vocal tract filter (and not the glottal source) How do we deconvolve the source and filter to retrieve information about the vocal tract filter? Cepstrum

Cepstrum Cepstrum: spectrum of the log of the spectrum magnitude spectrum log magnitude spectrum cepstrum Image credit: Jurafsky & Martin, Figure 9.14

Cepstrum For MFCC extraction, we use the first 12 cepstral values Variance of the different cepstral coefficients tend to be uncorrelated Useful property when modelling using GMMs in the acoustic model diagonal covariance matrices will suffice Cepstrum is formally defined as the inverse DFT of the log magnitude of the DFT of a signal c[n] = NX 1 n=0 log NX 1 n=0 x[n]e j 2 N kn! e j 2 N kn

MFCC Extraction DCT log yt(j) Time derivatives ( yt (j),e t ) y t (j), e t 2 y t (j), 2 e t Mel Filterbank energy DFT Windowing Pre-emphasis Sampled speech signal x(j)

Deltas and double-deltas From the cepstrum, use 12 cepstral coefficients for each frame 13th feature represents energy from the frame computed as sum of the power of the samples in the frame Also add features related to change in cepstral features over time to capture speech dynamics t = P N n=1 n(c t+n c t n ) 2 P N n=1 n2 Typical value for N is 2. Static cepstral coefficients are ct+n and ct-n Add 13 delta features (Δ t ) and 13 double-delta features (Δ 2 t)

Recap: MFCCs Motivated by human speech perception and speech production For each speech frame Compute frequency spectrum and apply Mel binning Compute cepstrum using inverse DFT on the log of the melwarped spectrum 39-dimensional MFCC feature vector: First 12 cepstral coefficients + energy + 13 delta + 13 double-delta coefficients

Other features Neural network-based: Bottleneck features (saw this in lecture 10) Train deep NN using conventional acoustic features Introduce a narrow hidden layer (e.g. 40 hidden units) referred to as the bottleneck layer Force neural network to encode relevant information in the bottleneck layer Use hidden unit activations in the bottleneck layer as features