L7: Linear prediction of speech

Similar documents
Chapter 9. Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析

L8: Source estimation

SPEECH ANALYSIS AND SYNTHESIS

Signal representations: Cepstrum

Linear Prediction 1 / 41

Linear Prediction Coding. Nimrod Peleg Update: Aug. 2007

Lab 9a. Linear Predictive Coding for Speech Processing

Applications of Linear Prediction

M. Hasegawa-Johnson. DRAFT COPY.

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi

L6: Short-time Fourier analysis and synthesis

Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation. Keiichi Tokuda

Feature extraction 2

Voiced Speech. Unvoiced Speech

Speech Signal Representations

Automatic Speech Recognition (CS753)

Design of a CELP coder and analysis of various quantization techniques

Source/Filter Model. Markus Flohberger. Acoustic Tube Models Linear Prediction Formant Synthesizer.

Lecture 7: Feature Extraction

Sinusoidal Modeling. Yannis Stylianou SPCC University of Crete, Computer Science Dept., Greece,

Department of Electrical and Computer Engineering Digital Speech Processing Homework No. 7 Solutions

SPEECH COMMUNICATION 6.541J J-HST710J Spring 2004

Chirp Decomposition of Speech Signals for Glottal Source Estimation

Feature extraction 1

Linear Prediction: The Problem, its Solution and Application to Speech

A Spectral-Flatness Measure for Studying the Autocorrelation Method of Linear Prediction of Speech Analysis

representation of speech

Timbral, Scale, Pitch modifications

Speech Coding. Speech Processing. Tom Bäckström. October Aalto University

CEPSTRAL ANALYSIS SYNTHESIS ON THE MEL FREQUENCY SCALE, AND AN ADAPTATIVE ALGORITHM FOR IT.

Glottal Source Estimation using an Automatic Chirp Decomposition

Frequency Domain Speech Analysis

Resonances and mode shapes of the human vocal tract during vowel production

Sound 2: frequency analysis

L11: Pattern recognition principles

Time-domain representations

Vocoding approaches for statistical parametric speech synthesis

CS578- Speech Signal Processing

Keywords: Vocal Tract; Lattice model; Reflection coefficients; Linear Prediction; Levinson algorithm.

Statistical and Adaptive Signal Processing

Zeros of z-transform(zzt) representation and chirp group delay processing for analysis of source and filter characteristics of speech signals

Proc. of NCC 2010, Chennai, India

Allpass Modeling of LP Residual for Speaker Recognition

L5: Quadratic classifiers

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features

Formant Analysis using LPC

Parametric Signal Modeling and Linear Prediction Theory 4. The Levinson-Durbin Recursion

Improved Method for Epoch Extraction in High Pass Filtered Speech

Ill-Conditioning and Bandwidth Expansion in Linear Prediction of Speech

L29: Fourier analysis

4.2 Acoustics of Speech Production

Lecture 5: GMM Acoustic Modeling and Feature Extraction

LPC methods are the most widely used in. recognition, speaker recognition and verification

NEW LINEAR PREDICTIVE METHODS FOR DIGITAL SPEECH PROCESSING

Multimedia Communications. Differential Coding

The Equivalence of ADPCM and CELP Coding

3GPP TS V6.1.1 ( )

Chapter 10 Applications in Communications

COMP 546, Winter 2018 lecture 19 - sound 2

Application of the Bispectrum to Glottal Pulse Analysis

Automatic Autocorrelation and Spectral Analysis

Estimation of Cepstral Coefficients for Robust Speech Recognition

NEAR EAST UNIVERSITY

Lesson 1. Optimal signalbehandling LTH. September Statistical Digital Signal Processing and Modeling, Hayes, M:

Chapter 2 Speech Production Model

ETSI TS V7.0.0 ( )

Source modeling (block processing)

L26: Advanced dimensionality reduction

ETSI TS V5.0.0 ( )

Lecture 4 - Spectral Estimation

HARMONIC CODING OF SPEECH AT LOW BIT RATES

Pitch Prediction Filters in Speech Coding

Lecture 3: Acoustics

Lecture 5: Jan 19, 2005

ETSI TS V ( )

DISCRETE-TIME SIGNAL PROCESSING

Modeling the creaky excitation for parametric speech synthesis.

L23: hidden Markov models

Speaker Identification Based On Discriminative Vector Quantization And Data Fusion

Speech Recognition. CS 294-5: Statistical Natural Language Processing. State-of-the-Art: Recognition. ASR for Dialog Systems.

INTERNATIONAL TELECOMMUNICATION UNION. Wideband coding of speech at around 16 kbit/s using Adaptive Multi-Rate Wideband (AMR-WB)

Modeling Prosody for Speaker Recognition: Why Estimating Pitch May Be a Red Herring

Research Article Adaptive Long-Term Coding of LSF Parameters Trajectories for Large-Delay/Very- to Ultra-Low Bit-Rate Speech Coding

Lecture 2: Acoustics. Acoustics & sound

LECTURE NOTES IN AUDIO ANALYSIS: PITCH ESTIMATION FOR DUMMIES

Part III Spectrum Estimation

ETSI EN V7.1.1 ( )

Abstract Most low bit rate speech coders employ linear predictive coding (LPC) which models the short-term spectral information within each speech fra

A comparative study of LPC parameter representations and quantisation schemes for wideband speech coding

CCNY. BME I5100: Biomedical Signal Processing. Stochastic Processes. Lucas C. Parra Biomedical Engineering Department City College of New York

ADSP ADSP ADSP ADSP. Advanced Digital Signal Processing (18-792) Spring Fall Semester, Department of Electrical and Computer Engineering

Log Likelihood Spectral Distance, Entropy Rate Power, and Mutual Information with Applications to Speech Coding

Stochastic Structural Dynamics Prof. Dr. C. S. Manohar Department of Civil Engineering Indian Institute of Science, Bangalore

ANALYSB AND DEMONSTRATION OF THE QUANTILE VOCODER

QUASI CLOSED PHASE ANALYSIS OF SPEECH SIGNALS USING TIME VARYING WEIGHTED LINEAR PREDICTION FOR ACCURATE FORMANT TRACKING

L used in various speech coding applications for representing

Using the Sound Recognition Techniques to Reduce the Electricity Consumption in Highways

On Variable-Scale Piecewise Stationary Spectral Analysis of Speech Signals for ASR

Signal Modeling Techniques In Speech Recognition

c 2014 Jacob Daniel Bryan

Transcription:

L7: Linear prediction of speech Introduction Linear prediction Finding the linear prediction coefficients Alternative representations This lecture is based on [Dutoit and Marques, 2009, ch1; Taylor, 2009, ch. 12; Rabiner and Schaefer, 2007, ch. 6] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 1

Review of speech production Introduction Speech is produced by an excitation signal generated in the throat, which is modified by resonances due to the shape of the vocal, nasal and pharyngeal tracts The excitation signal can be Glottal pulses created by periodic opening and closing of the vocal folds (voiced speech) These periodic components are characterized by their fundamental frequency (F 0 ), whose perceptual correlate is the pitch Continuous air flow pushed by the lungs (unvoiced speech) A combination of the two Resonances in the vocal, nasal and pharyngeal tracts are called formants Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 2

On a spectral plot for a speech frame Pitch appears as narrow peaks for fundamental and harmonics Formants appear as wide peaks in the spectral envelope [Dutoit and Marques, 2009] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 3

The source-filter model Linear prediction Originally proposed by Gunnar Fant in 1960 as a linear model of speech production in which glottis and vocal tract are fully uncoupled According to the model, the speech signal is the output y n of an allpole filer 1 A z excited by x n Y z = X z 1 1 p k=1 a k z k = X z 1 A p z where Y z and X z are the z transforms of the speech and excitation signals, respectively, and p is the prediction order The filter 1 A p z is known as the synthesis filter, and A p z is called the inverse filter As discussed before, the excitation signal is either A sequence of regularly spaced pulses, whose period T 0 and amplitude σ can be adjusted, or White Gaussian noise, whose variance σ 2 can be adjusted Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 4

[Dutoit and Marques, 2009] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 5

The above equation implicitly introduces the concept of linear predictability, which gives name to the model Taking the inverse z-transform, the speech signal can be expressed as y n = x n + p a k k=1 y n k which states that the speech sample can be modeled as a weighted sum of the p previous samples plus some excitation contribution In linear prediction, the term x n is usually referred to as the error (or residual) and is often written as e n to reflect this Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 6

Inverse filter For a given speech signal x n, and given the LP parameters a i, the residual e n can be estimated as e n = y n p a k k=1 y n k which is simply the output of the inverse filter excited by the speech signal (see figure below) Hence, the LP model also allows us to obtain an estimate of the excitation signal that led to the speech signal One will then expect that e n will approximate a sequence of pulses (for voiced speech) or white Gaussian noise (for unvoiced speech) a i y n A p z e n [Dutoit and Marques, 2009] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 7

Finding the LP coefficients How do we estimate the LP parameters? We seek to estimate model parameters a i that minimize the expectation of the residual energy e 2 n a i ooo = arg min e 2 n Two closely related techniques are commonly used the covariance method the autocorrelation method Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 8

The covariance method Using the term E to denote the sum squared error, we can state N 1 E = e 2 n n=0 N 1 = y n a k n=0 p k=1 y n k 2 We can then find the minimum of E by differentiating with respect to each coefficient a i and setting to zero N 1 p = 0 2 y n a a k y n k j N 1 n=0 = 2 y n y n j n=0 which gives k=1 N 1 p y n j + 2 a k y n k y n j n=0 j = 1,2, p k=1 = = 0 N 1 y n y n j n=0 p N 1 = 2 a k y n k y n j k=1 n=0 Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 9

Defining φ j, k as N 1 φ j, k = y n j y n k n=0 This expression can be written more succinctly as φ j, 0 = φ j, k a k p k=1 Or in matrix notation as φ 1,0 φ 2,0 = φ 1,1 φ 2,1 φ 1,2 φ 2,2 φ 1, p φ 2, p φ p, 0 φ p, 1 φ p, 2 φ p, p or even more compactly as Φ = Ψa a 1 a 2 a p Since Φ is symmetric, this system of equations can be solved efficiently using Cholesky decomposition in O p 3 Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 10

NOTES This method is known as the covariance method (for unclear reasons) The method calculates the error in the region 0 n < N 1, but to do so uses speech samples in the region p n < N 1 Note that to estimate the error at y 0, one needs samples up to y p No special windowing functions are needed for this method If the signal follows an all-pole model, the covariance matrix can produce an exact solution In contrast, the method we will see next is suboptimal, but leads to more efficient and stable solutions Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 11

The autocorrelation method The autocorrelation function of a signal can be defined as R n = y m y n m m= This expression is similar to that of φ j, k in the covariance method but extends over to ± rather than to the range 0 n < N φ j, k = y n j y n k To perform the calculation over ±, we window the speech signal (i.e., Hann), which sets to zero all values outside 0 n < N Thus, all errors e n will be zero before the window and p samples after the window, and the calculation of the error over ± can be rewritten as φ j, k = N 1+p n=0 which in turn can be rewritten as φ j, k = N 1 (j k) n=0 y n j y n k y n y n + j k Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 12

thus, φ j, k = R j k p k=1 as which allows us to write φ j, 0 = φ j, k a k R j = p k=1 R j k a k The resulting matrix R 1 R 2 = R 0 R 1 R 1 R 0 R p 1 R p 2 R p R p 1 R p 2 R 0 is now a Toeplitz matrix (symmetric, with all elements on each diagonal being identical), which is significantly easier to invert In particular, the Levinson-Durbin recursion provides a solution in O p 2 a 1 a 2 a p Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 13

Speech spectral envelope and the LP filter The frequency response of the LP filter can be found by evaluating the transfer function on the unit circle at angles 2ππ f s, that is 2 H e jjjj f 2 G s = p 1 a k e jjjkf f s k=1 Remember that this all-pole filter models the resonances of the vocal tract and that the glottal excitation is captured in the residual e n Therefore, the frequency response of 1 A p z will be smooth and free of pitch harmonics This response is generally referred to as the spectral envelope Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 14

How many LP parameters should be used? The next slide shows the spectral envelope for p = 12, 40, and the reduction in mean-squared error over a range of values At p = 12 the spectral envelope captures the broad spectral peaks (i.e. the harmonics), whereas at p = 40 the spectral peaks also capture the harmonic structure Notice also that the MSE curve flattens out above about p = 12 and then decreases modestly after Also consider the various factors that contribute to the speech spectra Resonance structure comprising about one resonance per 1Khz, each resonance needing one complex pole pair A low-pass glottal pulse spectrum, and a high-pass filter due to radiation at the lips, which can be modeled by 1-2 complex pole pairs This leads to a rule of thumb of p = 4 + f s 1000, or about 10-12 LP coefficients for a sampling rate of f s = 8kkk Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 15

[Rabiner and Schafer, 2007] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 16

http://www.phys.unsw.edu.au/jw/graphics/voice3.gif Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 17

Examples ex7p1.m Computing linear predictive coefficients Estimating spectral envelope as a function of the number of LPC coefficients Inverse filtering with LPC filters Speech synthesis with simple excitation models (white noise and pulse trains) ex7p2.m Repeat the above at the sentence level Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 18

Alternative representations A variety of different equivalent representations can be obtained from the parameters of the LP model This is important because the LP coefficients a i are hard to interpret and also too sensitive to numerical precision Here we review some of these alternative representations and how they can be derived from the LP model Root pairs Line spectrum frequencies Reflection coefficients Log-area ratio coefficients Additional representations (i.e., cepstrum, perceptual linear prediction) will be discussed in a different lecture Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 19

Root pairs The polynomial can be factored into complex pairs, each of which represents a resonance in the model These roots (poles of the LP transfer function) are relatively stable and are numerically well behaved The example in the next slide shows the roots (marked with a ) of a 12-th order model Eight of the roots (4 pairs) are close to the unit circle, which indicates they model formant frequencies The remaining four roots lie well within the unit circle, which means they only provide for the overall spectral shaping due to glottal and radiation influences Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 20

[McLoughlin and Chance, 1997] [Rabiner and Schafer, 2007] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 21

Line spectral frequencies (LSF) A more desirable alternative to quantization of the roots of A p z is based on the so-called line spectrum pair polynomials P z = A z + z p+1 A z 1 Q z = A z z p+1 A z 1 which, when added up, yield the original A p z The roots of P z, Q z and A p z are shown in the previous slide All the roots of P z and Q z are on the unit circle and their frequencies (angles in the z-plane) are known as the line spectral frequencies The LSFs are close together when the roots of A p z are close to the unit circle; in other words, presence of two close LSFs is indicative of a strong resonance (see previous slide) LSFs are not overly sensitive to quantization noise and are also stable, so they are widely used for quantizing LP filters Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 22

Reflection coefficients (a.k.a. PARCOR) The reflection coefficients represent the fraction of energy reflected at each section of a non-uniform tube model of the vocal tract They are a popular choice of LP representation for various reasons They are easily computed as a by-product of the Levinson-Durbin iteration They are robust to quantization error They have a physical interpretation, making then amenable to interpolation Reflection coefficients may be obtained from the predictor coefficients through the following backward recursion i r i = a i i = p,, 1 a i 1 j = a j i + r i a i j 2 1 j < i 1 r i where we initialize a i p = a i i Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 23

Log-area ratios Log-area ratio coefficients are the natural logarithm of the ratio of the areas of adjacent sections of a lossless tube equivalent to the vocal tract (i.e., both having the same transfer function) While it is possible to estimate the ratio of adjacent sections, it is not possible to find the absolute values of those areas Log-area ratios can be found from the reflection coefficients as A k = ln 1 r k 1 + r k where g k is the LAR and r k is the corresponding reflection coefficient Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 24