Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi

Similar documents
Automatic Speech Recognition (CS753)

Speech Signal Representations

Signal Modeling Techniques In Speech Recognition

SPEECH ANALYSIS AND SYNTHESIS

Feature extraction 2

Linear Prediction 1 / 41

Chapter 9. Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析

Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation. Keiichi Tokuda

L7: Linear prediction of speech

Robust Speaker Identification

TinySR. Peter Schmidt-Nielsen. August 27, 2014

CEPSTRAL analysis has been widely used in signal processing

Feature extraction 1

Lecture 5: GMM Acoustic Modeling and Feature Extraction

Signal representations: Cepstrum

Lecture 7: Feature Extraction

AN INVERTIBLE DISCRETE AUDITORY TRANSFORM

Frequency Domain Speech Analysis

Lab 9a. Linear Predictive Coding for Speech Processing

Voiced Speech. Unvoiced Speech

Time-domain representations

Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers

CEPSTRAL ANALYSIS SYNTHESIS ON THE MEL FREQUENCY SCALE, AND AN ADAPTATIVE ALGORITHM FOR IT.

Proc. of NCC 2010, Chennai, India

Dominant Feature Vectors Based Audio Similarity Measure

USEFULNESS OF LINEAR PREDICTIVE CODING IN HYDROACOUSTICS SIGNATURES FEATURES EXTRACTION ANDRZEJ ZAK

Frog Sound Identification System for Frog Species Recognition

DISCRETE-TIME SIGNAL PROCESSING

Gaussian Processes for Audio Feature Extraction

Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic Approximation Algorithm

LECTURE NOTES IN AUDIO ANALYSIS: PITCH ESTIMATION FOR DUMMIES

Elec4621 Advanced Digital Signal Processing Chapter 11: Time-Frequency Analysis

Course content (will be adapted to the background knowledge of the class):

Topic 6. Timbre Representations

L8: Source estimation

L6: Short-time Fourier analysis and synthesis

Speaker Identification Based On Discriminative Vector Quantization And Data Fusion

L11: Pattern recognition principles

Estimation of Cepstral Coefficients for Robust Speech Recognition

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features

Spectral and Textural Feature-Based System for Automatic Detection of Fricatives and Affricates

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

Digital Image Processing Lectures 13 & 14

EIGENFILTERS FOR SIGNAL CANCELLATION. Sunil Bharitkar and Chris Kyriakakis

Allpass Modeling of LP Residual for Speaker Recognition

GMM Vector Quantization on the Modeling of DHMM for Arabic Isolated Word Recognition System

Speech Coding. Speech Processing. Tom Bäckström. October Aalto University

Tutorial on Blind Source Separation and Independent Component Analysis

Evaluation of the modified group delay feature for isolated word recognition

Some Interesting Problems in Pattern Recognition and Image Processing

Statistical Signal Processing Detection, Estimation, and Time Series Analysis

where =0,, 1, () is the sample at time index and is the imaginary number 1. Then, () is a vector of values at frequency index corresponding to the mag

MULTI-RESOLUTION SIGNAL DECOMPOSITION WITH TIME-DOMAIN SPECTROGRAM FACTORIZATION. Hirokazu Kameoka

SPEECH COMMUNICATION 6.541J J-HST710J Spring 2004

Metric-based classifiers. Nuno Vasconcelos UCSD

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

Voice Activity Detection Using Pitch Feature

Modeling Prosody for Speaker Recognition: Why Estimating Pitch May Be a Red Herring

Signal Processing COS 323

Pulse-Code Modulation (PCM) :

covariance function, 174 probability structure of; Yule-Walker equations, 174 Moving average process, fluctuations, 5-6, 175 probability structure of

Lecture 7: Pitch and Chord (2) HMM, pitch detection functions. Li Su 2016/03/31

Statistical and Adaptive Signal Processing

Linear Prediction Coding. Nimrod Peleg Update: Aug. 2007

Introduction to Biomedical Engineering

Lecture Notes 5: Multiresolution Analysis

Principal Component Analysis CS498

Speech Enhancement with Applications in Speech Recognition

arxiv:math/ v1 [math.na] 12 Feb 2005

Announcements (repeat) Principal Components Analysis

DETECTION theory deals primarily with techniques for

A Modified Baum Welch Algorithm for Hidden Markov Models with Multiple Observation Spaces

SYMBOL RECOGNITION IN HANDWRITTEN MATHEMATI- CAL FORMULAS

Lab 4: Quantization, Oversampling, and Noise Shaping

Estimation of Relative Operating Characteristics of Text Independent Speaker Verification

HST.582J/6.555J/16.456J

ISOLATED WORD RECOGNITION FOR ENGLISH LANGUAGE USING LPC,VQ AND HMM

Basics on 2-D 2 D Random Signal

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Improved system blind identification based on second-order cyclostationary statistics: A group delay approach

Stress detection through emotional speech analysis

L29: Fourier analysis

6.435, System Identification

A Multivariate Time-Frequency Based Phase Synchrony Measure for Quantifying Functional Connectivity in the Brain

Signals, Instruments, and Systems W5. Introduction to Signal Processing Sampling, Reconstruction, and Filters

Real Sound Synthesis for Interactive Applications

Chapter 10 Applications in Communications

Computer Engineering 4TL4: Digital Signal Processing

Harmonic Structure Transform for Speaker Recognition

MULTICHANNEL SIGNAL PROCESSING USING SPATIAL RANK COVARIANCE MATRICES

Theory and Problems of Signals and Systems

3. ESTIMATION OF SIGNALS USING A LEAST SQUARES TECHNIQUE

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

Motivating the Covariance Matrix

LAB 6: FIR Filter Design Summer 2011

Multiresolution Analysis

ADVANCED SPEAKER RECOGNITION

Acoustic Quantities. LMS Test.Lab. Rev 12A

Transcription:

Signal Modeling Techniques in Speech Recognition Hassan A. Kingravi

Outline Introduction Spectral Shaping Spectral Analysis Parameter Transforms Statistical Modeling Discussion Conclusions

1: Introduction Signal Modeling: The main focus of the paper. Signal modeling is the process of converting c a speech sample to an observation vector in a probability space. Approach: Goals: Desire parameters that are perceptually meaningful, invariant,, and that capture temporal correlation.

2: Spectral Shaping Analog-to to-digital Conversion Digital Filtering Create a data representation of the signal with as high an SNR as is possible. Once conversion is finished, the signal is filtered to emphasize important frequencies in the signal. This is performed via the use of Finite Impulse Response filters. For e.g. hearing is more sensitive in the 1-KHz 1 region of the spectrum, so this is emphasized.

3: Spectral Analysis Step 1: Extraction of fundamental frequency and power. Fundamental frequency: Measured via different methods; a) Gold-Rabiner Rabiner: : multiple measures of periodicity in the signal and votes among them. b) NSA: average magnitude difference function and DA of multiple voicing measures c) Dynamic Programming: evaluate several measures of correlation and spectral change in the signal, and compute an optimal fundamental frequency and voicing pattern d) Cepstrum

3: Spectral Analysis Signal Power: Average of its energy (energy can be thought of as the area underneath the curve.) The formulation of power in signals uses a weighting function called a window, to favor samples in the center of the window in a frame-by by-frame traversal. This, combined with redundancy, creates a smoothing effect. e

3: Spectral Analysis Hanning window example: Smoothing is important because in some areas, the power of a signal varies rapidly.

3: Spectral Analysis Step 2: Spectral Analysis, proper. Digital Filter Bank Fourier Transform Filter Bank Cepstral Coefficients Linear Prediction Coefficients LP-Derived Filter Bank Amplitudes LP-Derived Cepstral Coefficients

Digital Filter Bank Motivations: Physiological,, i.e. the place theory of hearing, and perceptual,, i.e. frequencies of complex sounds with a certain bandwidth cannot be individually identified. Critical Bandwidth Human auditory system cannot distinguish between frequencies close to one another; the higher the frequency, the larger the interval; this is the critical band for a frequency. Filter Bank Collection of linear phase FIR bandpass filters arranged linearly across the Bark or mel scales. Used in systems that attempt to emulate human auditory processing. Merit: certain filter outputs correlated with certain speech sounds.

Digital Filter Bank

Fourier Transform Filter Bank Motivations: FBs have non-uniformly spaced frequency samples. If we perform a DFT of the signal, we go to the frequency domain, and can get the instance nce at any frequency. The DFT (or the FFT) also oversamples the spectrum; the value for each filter in the bank is calculated as an average of several frequencies, resulting in smoother estimates. Method: Sample spectrum at frequencies shown in last table. Often discard d low- amplitude regions using a dynamic range threshold.

Cepstral Coefficients Motivations: Process is nonlinear. Theory; there are waves with high frequency (excitation) and waves with low frequency (vocal tract) that are superimposed in the spectrum. Need holomorphic techniques obeying generalized rule of superposition; logarithmic in nature. This separates s the excitation and vocal tract shape components. Method: We take the inverse DFT of the log spectral magnitudes, calling this the cepstrum. Low order terms of the cepstrum represent vocal tract shape and high order terms represent excitation.

Cepstral Coefficients

Linear Prediction Coefficients Method: An autoregressive process that parametrically fits the spectrum; you model the signal as a linear combination of its previous samples. Minimize the error of this equation to find the p a s, the LPC coefficients. The greater the value of p,, the better the model. Ignoring Low-Amplitude Regions: Dynamic thresholding using autocorrelation is used. We add a small amount of noise to the signal, which prevents the LP model from modelling sharp nulls in the spectrum.

LP-Derived Filter Banks Motivations: Similar to the other filter banks, except we sample the LP model at the given frequencies, instead of the spectrum. This is done because the LP model gives more robust spectral estimates. Caveat: As DSP technology has developed further, difference between approaches not so great any more.

LP-Derived Cepstral Coefficients Method: Recall that the cepstral coefficients are calculated from the FFT of the log amplitudes. The LP-derived cepstral coefficents are the LP equivalent, where we take the logarithm of the inverse filter (remember, the LP equation is a filter). The number of cepstral coefficients usually equivalent to the number of LP coefficients. Difficulties: The coefficients calculated reflect a linear frequency scale; work needs to be done to make it nonlinear.

4: Parameter Transforms Motivations: The previous process gives us absolute measurements. Now we generate signal parameters from the measurements via differentiation and concatenation. Output; parameter vector with raw estimates of signal. Differentiation: Using the concept of a gradient, we take approximate derivatives of the signal to highlight changes in variation. Concatenation: Take as input a large matrix of all the measurements called X.. Using auxiliary matrices, we perform all the necessary filtering operations discussed, including weighting, differentiation and averaging; results r in creation of single parameter vector per frame that contains all desired signal parameter.

5: Statistical Modeling Motivations: Assumption; signal parameters generated via some underlying multivariate probability distribution. Job is to find the model; treat the data as signal observations. Uses topics from: Multivariate Statistical Models Distance Measures

Multivariate Statistical Models Motivations: Quantities with different numerical scales and that are correlated ed are mixed with each other; lot of redundancy. Scales can be re-adjusted; correlation is more difficult to deal with. The Whitening Transform: Assuming that the process we are modeling is Gaussian in nature, we can use a prewhitening transform to decorrelate the parameters. This involves computing the eigenvalues and eigenvectors of the covariance matrix. The eigenvalues can be further used to determine which eigenvectors are the most significant, leading to a reduction in features. Note: need data for this for mean and variance calculations.

Multivariate Statistical Models Vector Quantization: If the Gaussian model is not appropriate, need a non-parametric representation. Hypothesize a discrete distribution and try to fit f it. Vector quantizers compress the distribution; physiological precedents include the assumption that a finite set of stationary vocal tract shapes exists. Need to estimate a vector quantizing codebook Q, which contains a finite set of ideal vectors, which will replace an input vector. Secondly, must maximize P(y Q). Latter is simple. Former is done by use of the K-Means K algorithm.

Multivariate Statistical Models K-Means Algorithm: Given the training data, and a number, say k,, of clusters, place the k points in the space as far apart as possible. For each element in the data, d assign it to the cluster closest to it. Once this is done, recompute the centroids of the clusters. Repeat this until the centroids no longer move. This results in the k groups we need, and the centroids become the components of Q.

Distance Measures Motivation: The k-means k algorithm requires the definition of a distance measure in the vector space, without which it cannot proceed. Properties: Distance measures must satisfy the properties of nonnegativity,, symmetry, and the triangle inequality. Examples: Most famous form is Euclidean. An interesting approximation that s presented is the dot product form of the Euclidean distance, which is obviously very cheap to evaluate.

Discussion Overview: Author gives a table of methods used in the community as of 1993; ; not really important. Comments: Neural Network based classification systems employ filter banks (tying in with cognition and the auditory system etc.) Cepstral coefficients dominant acoustic measurement. FFT-derived cepstral coefficients are more common than LP-derived ones. FFT is still popular because of its immunity to noise.

Summary Overview: Presented the various processes involved in the modeling of a signal and its conversion to a meaningful representation, one necessary for analysis.