Speech Coding. Speech Processing. Tom Bäckström. October Aalto University

Speech Coding Speech Processing Tom Bäckström Aalto University October 2015

Introduction Speech coding refers to the digital compression of speech signals for telecommunication (and storage) applications. There are over 7.6 billion active mobile subscriptions and more than 3.7 billion unique subscribers (https://gsmaintelligence.com/). More mobile phones than people! More than half the population owns a mobile phone. Speech coding is the biggest speech processing application. Demonstrates that speech coding is important and that it works with sufficient quality. Demonstrates that also further improvements will have a majestic impact.

Introduction Generations in mobile networks 1978 1G First widely deployed analog cellular system, AMPS. 1991 2G First digital systems GSM and CDMA. 2001 3G EDGE & HSPA -systems with increased data speeds. 2009 4G LTE systems with even faster data. 2014 More mobile phones than people. 201? 5G Internet of things?

Introduction Some speech coding standards Name Year Bit-rate Bandwidth NMT 1981 Analog 3.5 khz GSM 1987 13 kbits/s 3.5 khz EFR 1996 12.2 kbits/s 3.5 khz AMR 1999 4.75 kbits/s... 12.2 kbits/s 3.5 khz AMR-WB 2001 6.6 kbits/s... 23.85 kbits/s 7 khz AMR-WB+ 2001 5.2 kbits/s... 48 kbits/s 7 khz EVS 2014 5.9 kbits/s... 128 kbits/s 3.5 khz... 48 khz AMR has been the most succesfull codec of all time and is still widely used. Deployment of AMR-WB is progressing (started 2006, but still not finished). EVS is the first codec with native support for IP-networks (LTE) and deployment could start in 2016.

Quality comparison as a function of bitrate Clean speech NB 3.5 khz, WB 7 khz, SWB 16 khz, FB 22 khz

Quality comparison as a function of bitrate Noisy speech NB 3.5 khz, WB 7 khz, SWB 16 khz, FB 22 khz

Quality comparison as a function of bitrate Mixed content (music and speech) NB 3.5 khz, WB 7 khz, SWB 16 khz, FB 22 khz

Speech Production Modelling Impulse train F 0 (z) + Linear predictor A(z) Residual X(z) In the speech production modelling -lecture, we already presented the source-filter model (above), where a linear predictor A(z) models the acoustic effect of the vocal tract, a long-time predictor models F0 (z) the periodic structure of the excitation (fundamental frequency) and the residual (excitation) E(z) is modeled with a noise codebook. The approach is known as Code-Excited Linear Prediction (CELP).

Speech Production Modelling To be accurate, the long-time predictor is applied as a filter, whereby it is not an additive but a cascaded element. 1 Residual codebook Long-time predictor Linear predictor Speech The entire model can then be written as X (z) = A 1 (z)f 1 0 (z)e(z) or equivalently x n = h n f n e n where h n is the impulse response of A 1 (z) = H(z). Next, each of these components is presented in detail. 1 The source-filter model is a good model and it is used in some applications. However, although it is often advertised that also speech coding is based on this approach that is not entirely accurate.

Linear prediction Linear prediction is a model of the spectral envelope of a speech signal. It models the overall shape of the spectrum. It can be a tube-model of the vocal tract, but estimating parameters of a tube-model is difficult. Besides, by modelling everything in the spectral envelope, we do not need a separate model for other effects such as the glottal excitation. Estimation of linear predictors was already discussed in a previous lecture. The residual A(z)X (z) = F 1 0 (z)e(z) of linear prediction has only a harmonic structure and/or noise.

Linear prediction Quantization and Coding Our task is to quantize the parameters of the linear predictive filter A(z) and encode them. In that process, the aim is to minimize the perceptually weighted error between the original filter A 1 (z) and the quantized filter Â 1 (z) ( min W (z) A 1 (z) Â (z)) 1 2. Here W (z) is the perceptual weighting filter. Such an objective ensures that the perceptual effect of quantization is minimized, whereby the receiver can reconstruct the speech signal such that it maximally resembles the original signal.

Linear prediction Quantization and Coding A linear predictor is defined as A(z) = 1 + m α k z k k=1 whereby its parameters are the m scalars α k. These parameters α k entirely describe the linear predictor, whereby our objective is to quantize these parameters. Unfortunately, α k are sensitive to errors: Small errors in αk can have a big effect on the output (problem is highly non-linear). Stability of predictor cannot be guaranteed (the magnitude of predicted values can grow without bound). Direct quantization of α k is not feasible.

Linear prediction Quantization and Coding The best available method for quantization of linear predictors is based on a transform knwon as line spectral pair (LSP) decomposition. The predictor is split into two parts, one symmetric and the other antisymmetric: {P(z) = A(z) + z m 1 A ( z 1) Q(z) = A(z) z m 1 A ( z 1) where z m 1 A ( z 1) is merely the backwards polynomial where the coefficients α k are in reverse order. It follows that the original predictor can be reconstructed as A(z) = 1 [P(z) + Q(z)]. 2 The line spectral polynomials P(z) and Q(z) thus contain all the information of A(z).

Linear prediction Quantization and Coding The line spectral polynomials P(z) and Q(z) have the following properties If A(z) is stable, then the roots of P(z) are Q(z) are alternating (interlaced) on the unit circle. If the roots of P(z) are Q(z) are alternating (interlaced) on the unit circle then the reconstruction A(z) is stable. Small errors in the location of the roots of P(z) and Q(z) cause small errors in the output. It follows that the angles of the roots, the line spectral frequencies entirely describe the predictor and are robust to quantization errors. Perfect domain for quantization and coding.

Quantization and Coding LSF Z-plane 1 A(z) P(z) Q(z) 0.5 Imaginary part 0 0.5 1 1 0.5 0 0.5 1 Real part

Quantization and Coding LSF Spectrum Magntiude (db) 60 40 20 0 20 A(z) P(z) Q(z) 40 0 1 2 3 4 5 6 Frequency (khz)

Quantization and Coding Background Given a predictor of order M, we obtain M line spectral frequencies which exactly describe the predictor and thus the spectral envelope. We then need to quantizer and code the frequencies such that the frequencies (the envelope) can be transmitted with as few bits and as high accuracy as possible. In general, spectra can have very complicated structure, whereby straightforward methods such as direct quantization and entropy coding is suboptimal. Vector quantization and coding gives (with some loose assumptions) always optimal performance, with the cost of higher complexity.

Quantization and Coding Idea Vector coding is based on the idea of a codebook-representation the signal. If x is an input vector, and vectors c k with k S is the codebook, then we can find the best match whereby c k x. k = arg min k d(x, c k ) (1) We then need to transmit only the codebook index k.

Quantization and Coding Complexity An inherent problem of vector coding is complexity. If the codebook has N elements then we need to calculate the distance between the input signal x to N codebook vectors. If we have 30 bits for the codebook, then N = 2 30. With M = 16 we then need 2 34 operations to determine k. Unfeasible! In practice, we can use a layered structure such that we start by quantizing x roughly with a small codebook (e.g. N = 2 9 = 512), and then proceed to quantize the estimation error by further codebooks, each with small codebooks (Multi-Stage VQ). With three stages, each with N = 512 we can then reduce complexity to 3 2 13 operations, which is feasible. A multi-stage VQ is sub-optimal, but the reduction in accuracy is reasonably small.

Quantization and Coding Training The question then remains, how to choose a codebook c k? If the input data would be easy to model, then we could design such a model and obtain same performance with lower complexity. Envelopes have complicated structure! We must therefore train the codebook from data. We would like to find the solution to {c k } = arg min {c k } E n[min k d(x n, c k )] (2) where E n [ ] is the expectation over all input vectors x n and d( ) is a distance measure. This is a complicated minimization problem!

Quantization and Coding Training/EM Direct iterative estimation: Algorithm 1 k-means or Expectation Maximisation Define an initial-guess codebook {c k } as, for example, K randomly chosen vectors from the database of x k. repeat Find the best-matching codebook vectors for each x k. Update each of the codebook vectors c k to be the mean (or centroids) of all vectors x k assigned to that codebook vector. until converged Demo!

Quantization and Coding Training/Split-VQ Algorithm with better convergence: Algorithm 2 Split Vector Quantization Define an initial-guess codebook {c k } as, for example, two randomly chosen vectors from the database of x k. Apply the k-means algorithm on this codebook. repeat Split each codevector c k into two vectors c k ɛ and c k + ɛ. Apply the k-means algorithm on this codebook. until codebook full Demo!

Quantization and Coding Training/Split-VQ Many improved training algorithms exist. Vector quantization was a very active field of research up to the 90 s. Classic book: Gersho and Gray Vector quantization and signal compression (1992). VQ remains the optimal approach in terms of efficiency. Decorrelation is a newer alternative approch which attempts to extract orthogonal directions, such that they can be modelled and quantized independently. Almost optimal, but low complexity. Not yet in wide-spread use.

Linear prediction Summary Linear prediction can be used to model the spectral envelope of speech signals. Line Spectral Frequencies is the most common/effective representation of linear predictoris for quantization. Vector quantization and coding is used on the line spectral frequencies.

Long-time Prediction Assumptions and Objectives Voiced sounds have a quasi-periodic structure, caused by the oscillations of the vocal folds. Fundamental frequency is assumed to be slowly changing. However, the rate of change is much faster than in most other types of audio. Usually less than 10 octaves / second, but sometimes up to 15 octaves / second can be observed ( one semitone during sub-frame of 5ms). The pitch range is usually something like 85 to 400 Hz. Perceptual pitch resolution is roughly 2 Hz. The objective is to model a feature of speech production, the pitch, to enable coding with high efficiency. Experience has shown that long time prediction is a very efficient tool for source modelling (has huge impact on SNR).

Long-time Prediction Vocabulary The fundamental frequency model is known by many names long time prediction (LTP) adaptive codebook (ACBK) fundamental frequency (F0) model impulse train.

Long-time Prediction Definition A long time predictor (LTP) can be defined by the pitch lag T and gain factor γ P as F 0 (z) = 1 γ P z T. In time domain we can predict a future sample of x n as ˆx n = γ P x n T. With a vector x k = [ x kn x kn+1... ] T x kn+n 1 we thus obtain ˆx k = γ p x k T.

Long-time Prediction Codebook A long time predictor can thus be interpreted as a vector codebook, where the codebook entries are x k T with different delays T. Since the past signal keeps changing from frame to frame, the codebook is signal adapative. This explains why long time predictors are often known as the adaptive codebook.

Long-time Prediction Optimization We want to optimize the output quality of the signal x k = He k, where x k is a vector of the output signal and H is the convolution matrix corresponding to linear prediction. Perceptual weighting can be applied by multiplying with a weighting matrix W such that the our optimization task is min B(e k ê k ) 2 where ê k is the quantized residual and B = WH. We want to model the residual with long time prediction, whereby ê k = γ P e k T and min B(e k γ P e k T ) 2. which gives the optimal T and γ P.

Long-time Prediction Optimization (advanced topic) The objective function has a multiplicative term γ P e k T whereby it has no simple analytic solution. We must use a trick to find the optimal parameters. The optimal value of γ P can be found by setting the derivative to zero 0 = γ P B(e k γ P e k T ) 2 = e T k T BT B(e k γ P e k T ) whereby γ P = et k T BT Be k Be k T 2.

Long-time Prediction Optimization (advanced topic) Given the optimal γ P, we can modify objective function by inserting the value of γ P. Through simple manipulations we find that arg min B(e k γ P e k T ) 2 = ( e T arg max k T B T ) 2 Be k Be k T 2. Optimizing this function gives us the optimal e k T with the assumption that γ P has the optimal value. Note that the above expression is actually the normalized correlation between Be k and Be k T. That is, we are trying to find that vector e k T which is closest to the direction of e k.

Long-time Prediction Summary Long time prediction is used to model the fundamental frequency. It can be implemented as a vector codebook, although it actually is a filter. The optimal e k T is found by maximizing the correlation between e k and e k T. The optimal γ P can then be quantized directly. The long time predictor thus requires transmission of these two parameters, the lag T and quantized gain γ P.

Residual coding Once the spectral envelope has been modelled with the linear predictor A(z) and the fundamental frequency with the long time predictor F 0 (z), we are left with a residual E(z) = X (z)a(z)f 0 (z). The residual contains everything that the two predictors were unable to model. It is basically white noise. Noise has no structure, right? So how do we model noise? How can we model the structure of something that has none? What do we know about noise?

Residual coding Let ɛ n be an uncorrelated, zero-mean white-noise signal, E[ɛ n ] = 0 and E[ɛ n ɛ k ] = 0 for n k. Nothing to model here. The energy (=variance) of the signal is σ 2 = E[ɛ 2 n]. We can encode the energy of the noise! If ek is a vector of the residual, we can encode the energy γc 2 = e k 2. The model is thus e k = γ C ẽ k, where ẽ k 2 = 1. The remaining problem then reduces to encoding a vector ẽ k which has ẽ k 2 = 1.

Residual coding Distribution (advanced topic) We happen to know that speech signals are distributed more or less according to the Laplace distribution. This holds also for the residual vector e k. The probability distribution of the Laplace distributed vector is defined as [ p(e k ) = C exp e ] k 1. b Since the length of the vector e k is encoded separately, it suffices to encode only vectors of a fixed length e k 1 = p. The probability of such vectors is [ p(e k e k 1 = p) = C exp p ] = constant. b All vectors with the same length have the same probability. We can use a codebook which contains all vectors of the same length.

Residual coding Algebraic coding Consider a vector with e k = 1 which is quantized to integer values. The possible vectors are Index ɛ 0 ɛ 1 ɛ 2 ɛ 3 ɛ 4 ɛ 5... 0 +1 0 0 0 0 0... 1 1 0 0 0 0 0... 2 0 +1 0 0 0 0... 3 0 1 0 0 0 0... 4 0 0 +1 0 0 0... 5 0 0 1 0 0 0..... Here, clearly, we can encode the vectors with index = 2k + s, where k is the position of the pulse and s = 0 for plus sign and s = 1 for minus sign..

Residual coding Algebraic coding With e k = 2 we can have either two separate pulses or two pulses at the same point, for example vectors ɛ 0 ɛ 1 ɛ 2 ɛ 3 ɛ 4 ɛ 5... +1 0 0 0 1 0... 1 0 0 1 0 0... 0 +2 0 0 0 0... 0 2 0 0 0 0... 0 0 0 +1 1 0..... Here it is a bit more challenging to generate indices for the above vectors..

Residual coding Algebraic coding In the case e k 1 = 2 we can use an index = 2Nk 1 + 2k 2 + s 1, where k n are the positions of the pulses, s 1 is the sign of the first pulse and N is the length of the vector. We can then deduce s2 from the positions such that if k 1 k 2 then s 2 = s 1, otherwise the opposite sign. Example with N = 2 Index k 1 k 2 s 1 s 2 ɛ 0 ɛ 1 0 0 0 0 0 +2 0 1 0 0 1 1 2 0 2 0 1 0 0 +1 +1 3 0 1 1 1 1 1 4 1 0 0 1 1 +1 5 1 0 1 0 +1 1 6 1 1 0 0 0 +2 7 1 1 1 1 0 2

Residual coding Algebraic coding With N = 2 and e k 1 = 2 we thus have 8 different codebook vectors which can be encoded with log 2 8 = 3 bits. If we would encode the same vector directly, by giving one bit for each position k 1 and k 2 and one bit for each sign s 1 and s 3, we would need 4 bits. The vector [+1, +1] could then be either k 1 = 0, k 2 = 1, s 0 = s 1 = 0 or k 1 = 1, k 2 = 0, s 0 = s 1 = 0. We have two descriptions for one vector which is inefficient. With the algebraic coding rule we get the smallest possible bit consumption.

Residual coding Algebraic coding The same approach can be extended to arbitrary N and p. Algebraic coding can then be used to describe noise vectors e k of any length p with minimum number of bits. Since the method is based on an algebraic rule (=an algorithm), the noise codebook does not need to be stored No storage needed. Since the codevectors are sparse (=if N is large and p is low, then e k is mostly zeros), whereby computations with e k are simple to perform (low complexity).

Residual coding Analysis by synthesis To find the optimal quantization e k, we use the same optimization as for the LTP arg max where ê k is the quantized residual. (êt k B T Be k ) 2 Bê k 2 Note that Hê k corresponds to the synthesised output signal and Bê k = WHê k is thus the perceptually weighted, synthesised output signal. The above optimization thus evaluates (analyses) the quality of the synthesiszed output signal. Consequently, this method is known as the analysis by synthesis method. To find the optimum, we have to evaluate every possible quantization of e k this is a brute-force method.

Residual coding Analysis by synthesis The final output is obtained as the sum of the contributions of the fundamental frequency model and the residual codebook x k = H (γ P ê k T + γ c ê k ). or equivalently as a time-domain convolution x n = h n (γ p ê n T + γ C ê n ). Here h n corresponds to the impulse response of the linear predictor, which is excited (filtered) by the codebook vectors ê k T and ê k. Hence the name, code excited linear prediction (CELP) and when using the algebraic residual codebook, algebraic code excited linear prediction (ACELP).

Residual coding Summary The residual after modelling with linear prediction and long time prediction is modelled with a noise codebook. We first encode the gain (energy) of the noise vector. Secondly, we encode the fixed-length residual with an algebraic codebook. The algebraic codebook generates vectors with an algorithm such that there is no storage required. The best quantization is found by a brute-force search, also known as the analysis by synthesis method.

Conclusion Speech coding is digital compression of speech for transmission and storage. It is the most important speech processing application with over 7 billion users. Most phones still use the old AMR standard, partly superseeded with the newer AMR-WB, but the newest standard, EVS, will hopefully soon replace both. Code excited linear prediction (CELP) is the standard approach and it is used in all main-stream standards. It is based on source-filter modelling where the spectral envelope is modelled with linear prediction, fundamental frequency with a long time predictor and the residual with a noise codebook. Parameters are optimized with a brute-force, analysis by synthesis method. Current research in speech coding is an active, math-intensive field of research.