Sinusoidal Modeling Yannis Stylianou University of Crete, Computer Science Dept., Greece, yannis@csd.uoc.gr SPCC 2016
1 Speech Production 2 Modulators 3 Sinusoidal Modeling Sinusoidal Models Voiced Speech Unvoiced Speech Synthesis Amplitude and Phase Interpolation 4 Harmonic Modeling HNM Speech Models 5 References yannis@csd.uoc.gr Sinusoidal Modeling 2/44
A Simple view Used after permission: [1] yannis@csd.uoc.gr Sinusoidal Modeling 3/44
Downward-looking into the larynx: Vocal folds Left: Voicing, Right: Breathing Used after permission: [1] yannis@csd.uoc.gr Sinusoidal Modeling 4/44
Glottal airflow velocity Used after permission: [1] yannis@csd.uoc.gr Sinusoidal Modeling 5/44
Vocat tract shapes Used after permission: [1] yannis@csd.uoc.gr Sinusoidal Modeling 6/44
Example Let the excitation of vocal tract, h[n], be: u[n] = g[n] p[n] then, the output speech, x[n, τ], is given by: x[n, τ] = w[n, τ]{h[n] (g[n] p[n])} and X (ω, τ) = 1 P H(ω k )G(ω k )W (ω ω k, τ) k= yannis@csd.uoc.gr Sinusoidal Modeling 7/44
Harmonics and Formants Used after permission: [1] yannis@csd.uoc.gr Sinusoidal Modeling 8/44
Sinusoidal Models Sinusoidal Model: K x(t) = γ k e j2πf kt k= K Special case, Harmonic Model: K x(t) = γ k e j2πkf 0t k= K Estimation of parameters (Linear approach): γ = F fk s Model Evaluation, Mean-Squared Error (MSE): ɛ = s(t) x(t) 2 dt w yannis@csd.uoc.gr Sinusoidal Modeling 9/44
Sinusoidal Models Sinusoidal Model: K x(t) = γ k e j2πf kt k= K Special case, Harmonic Model: K x(t) = γ k e j2πkf 0t k= K Estimation of parameters (Linear approach): γ = F fk s Model Evaluation, Mean-Squared Error (MSE): ɛ = s(t) x(t) 2 dt w yannis@csd.uoc.gr Sinusoidal Modeling 9/44
Sinusoidal Models Sinusoidal Model: K x(t) = γ k e j2πf kt k= K Special case, Harmonic Model: K x(t) = γ k e j2πkf 0t k= K Estimation of parameters (Linear approach): γ = F fk s Model Evaluation, Mean-Squared Error (MSE): ɛ = s(t) x(t) 2 dt w yannis@csd.uoc.gr Sinusoidal Modeling 9/44
Sinusoidal Models Sinusoidal Model: K x(t) = γ k e j2πf kt k= K Special case, Harmonic Model: K x(t) = γ k e j2πkf 0t k= K Estimation of parameters (Linear approach): γ = F fk s Model Evaluation, Mean-Squared Error (MSE): ɛ = s(t) x(t) 2 dt w yannis@csd.uoc.gr Sinusoidal Modeling 9/44
Source-Filter with Sinusoids Source: where: u(t) = φ k (t) = K(t) k= K(t) t 0 α k (t)e jφ k(t) Ω k (σ)dσ + φ k Filter: h(t, τ) with Fourier Transform (FT): H(t, Ω) = M(t, Ω)e jφ(t,ω) yannis@csd.uoc.gr Sinusoidal Modeling 10/44
Source-Filter with Sinusoids Source: where: u(t) = φ k (t) = K(t) k= K(t) t 0 α k (t)e jφ k(t) Ω k (σ)dσ + φ k Filter: h(t, τ) with Fourier Transform (FT): H(t, Ω) = M(t, Ω)e jφ(t,ω) yannis@csd.uoc.gr Sinusoidal Modeling 10/44
Source-Filter with Sinusoids where: s(t) = K(t) k=1 K(t) A k (t)e jθ k(t) A k (t) = α k (t)m [t, Ω k (t)] θ k (t) = φ k (t) + Φ [t, Ω k (t)] = t 0 Ω k (σ)dσ + Φ [t, Ω k (t)] + φ k yannis@csd.uoc.gr Sinusoidal Modeling 11/44
Sinusoidal Analysis: Stationarity Assumption We assume stationarity inside the analysis window: which leads to: and to: s(t) = In discrete time: s[n] = K k= K K k= K A k (t) = A k Ω k (t) = Ω k K(t) = K θ k (t) = Ω k (t t l ) + θ k A k e jθ k e jω k(t t l ) γ k e jω kn N w 1 2 t l T 2 t t l + T 2 n N w 1 2 yannis@csd.uoc.gr Sinusoidal Modeling 12/44
Sinusoidal Analysis: Stationarity Assumption We assume stationarity inside the analysis window: which leads to: and to: s(t) = In discrete time: s[n] = K k= K K k= K A k (t) = A k Ω k (t) = Ω k K(t) = K θ k (t) = Ω k (t t l ) + θ k A k e jθ k e jω k(t t l ) γ k e jω kn N w 1 2 t l T 2 t t l + T 2 n N w 1 2 yannis@csd.uoc.gr Sinusoidal Modeling 12/44
Sinusoidal Analysis: Stationarity Assumption We assume stationarity inside the analysis window: which leads to: and to: s(t) = In discrete time: s[n] = K k= K K k= K A k (t) = A k Ω k (t) = Ω k K(t) = K θ k (t) = Ω k (t t l ) + θ k A k e jθ k e jω k(t t l ) γ k e jω kn N w 1 2 t l T 2 t t l + T 2 n N w 1 2 yannis@csd.uoc.gr Sinusoidal Modeling 12/44
Sinusoidal Analysis: Stationarity Assumption We assume stationarity inside the analysis window: which leads to: and to: s(t) = In discrete time: s[n] = K k= K K k= K A k (t) = A k Ω k (t) = Ω k K(t) = K θ k (t) = Ω k (t t l ) + θ k A k e jθ k e jω k(t t l ) γ k e jω kn N w 1 2 t l T 2 t t l + T 2 n N w 1 2 yannis@csd.uoc.gr Sinusoidal Modeling 12/44
Mean-Squared Error Given the original measured waveform, y[n] and the synthetic speech waveform, s[n], estimate the unknown parameters A l k, ωl k, and θk l by minimizing the MSE criterion: n= (N w 1)/2 ɛ l = n=(n w 1)/2 n= (N w 1)/2 y[n] s[n] 2 which can be written as: n=(n w 1)/2 K l ( Y ) ɛ l = y[n] 2 + N w (ω l k ) γl k 2 Y (ωk l ) 2 which can be reduced further to: ɛ l = n=(n w 1)/2 n= (N w 1)/2 k=1 y[n] 2 N w Y (ωk l ) 2 K l k=1 yannis@csd.uoc.gr Sinusoidal Modeling 13/44
Mean-Squared Error Given the original measured waveform, y[n] and the synthetic speech waveform, s[n], estimate the unknown parameters A l k, ωl k, and θk l by minimizing the MSE criterion: n= (N w 1)/2 ɛ l = n=(n w 1)/2 n= (N w 1)/2 y[n] s[n] 2 which can be written as: n=(n w 1)/2 K l ( Y ) ɛ l = y[n] 2 + N w (ω l k ) γl k 2 Y (ωk l ) 2 which can be reduced further to: ɛ l = n=(n w 1)/2 n= (N w 1)/2 k=1 y[n] 2 N w Y (ωk l ) 2 K l k=1 yannis@csd.uoc.gr Sinusoidal Modeling 13/44
Mean-Squared Error Given the original measured waveform, y[n] and the synthetic speech waveform, s[n], estimate the unknown parameters A l k, ωl k, and θk l by minimizing the MSE criterion: n= (N w 1)/2 ɛ l = n=(n w 1)/2 n= (N w 1)/2 y[n] s[n] 2 which can be written as: n=(n w 1)/2 K l ( Y ) ɛ l = y[n] 2 + N w (ω l k ) γl k 2 Y (ωk l ) 2 which can be reduced further to: ɛ l = n=(n w 1)/2 n= (N w 1)/2 k=1 y[n] 2 N w Y (ωk l ) 2 K l k=1 yannis@csd.uoc.gr Sinusoidal Modeling 13/44
Karhunen-Loève Expansion Karhunen-Loève expansion allows constructing a random process from harmonic sinusoids with uncorrelated complex amplitudes. Estimated power spectrum should not vary too much over consecutive frequencies. Following the above necessary constraints, for unvoiced speech, and for a window width to be at least 20ms, an 100 Hz harmonic structure provides good results. yannis@csd.uoc.gr Sinusoidal Modeling 14/44
Karhunen-Loève Expansion Karhunen-Loève expansion allows constructing a random process from harmonic sinusoids with uncorrelated complex amplitudes. Estimated power spectrum should not vary too much over consecutive frequencies. Following the above necessary constraints, for unvoiced speech, and for a window width to be at least 20ms, an 100 Hz harmonic structure provides good results. yannis@csd.uoc.gr Sinusoidal Modeling 14/44
Karhunen-Loève Expansion Karhunen-Loève expansion allows constructing a random process from harmonic sinusoids with uncorrelated complex amplitudes. Estimated power spectrum should not vary too much over consecutive frequencies. Following the above necessary constraints, for unvoiced speech, and for a window width to be at least 20ms, an 100 Hz harmonic structure provides good results. yannis@csd.uoc.gr Sinusoidal Modeling 14/44
Karhunen-Loève Expansion Karhunen-Loève expansion allows constructing a random process from harmonic sinusoids with uncorrelated complex amplitudes. Estimated power spectrum should not vary too much over consecutive frequencies. Following the above necessary constraints, for unvoiced speech, and for a window width to be at least 20ms, an 100 Hz harmonic structure provides good results. yannis@csd.uoc.gr Sinusoidal Modeling 14/44
Sinusoidal Analysis: Peak picking Used after permission: [1] yannis@csd.uoc.gr Sinusoidal Modeling 15/44
Sinusoidal Synthesis: OLA, a simple solution yannis@csd.uoc.gr Sinusoidal Modeling 16/44
Problem of frequency matching yannis@csd.uoc.gr Sinusoidal Modeling 17/44
Frame-to-Frame Peak Matching yannis@csd.uoc.gr Sinusoidal Modeling 18/44
The birth/death process yannis@csd.uoc.gr Sinusoidal Modeling 19/44
A birth/death process in speech yannis@csd.uoc.gr Sinusoidal Modeling 20/44
Amplitude and Phase Interpolation Linear amplitude interpolation model: ( ( n ) A l k [n] = Al k + A l+1 k Ak) l n = 0, 1, 2,, L 1 L Cubic Phase interpolation model θ(t) = ζ + γt + αt 2 + βt 3 yannis@csd.uoc.gr Sinusoidal Modeling 21/44
Block diagram of the Synthesis System yannis@csd.uoc.gr Sinusoidal Modeling 22/44
Synthesis: Reconstruction Example Used after permission: [1] yannis@csd.uoc.gr Sinusoidal Modeling 23/44
Harmonic model Sinusoidal Model: K x(t) = γ k e j2πf kt k= K Special case, Harmonic Model: K x(t) = γ k e j2πkf 0t k= K Estimation of parameters (Linear approach): γ = F fk s Model Evaluation, Mean-Squared Error (MSE): ɛ = s(t) x(t) 2 dt w yannis@csd.uoc.gr Sinusoidal Modeling 24/44
Brief overview of HNM HNM is a pitch-synchronous harmonic plus noise representation of the speech signal. Speech spectrum is divided into a low and a high band delimited by the so-called maximum voiced frequency. The low band of the spectrum (below the maximum voiced frequency) is represented solely by harmonically related sine waves. The upper band is modeled as a noise component modulated by a time-domain amplitude envelope. HNM allows high-quality copy synthesis and prosodic modifications. yannis@csd.uoc.gr Sinusoidal Modeling 25/44
HNM in equations Harmonic part: h(t) = L(t) k= L(t) A k (t)e j kω 0(t) t Noise part: Speech: n(t) = e(t) [v(τ, t) b(t)] s(t) = h(t) + n(t) yannis@csd.uoc.gr Sinusoidal Modeling 26/44
Periodic part HNM 1 : Sum of exponential functions without slope h 1 [n] = L(n i a) k= L(n i a) a k (n i a)e j2πkf 0(n i a)(n n i a) HNM 2 : Sum of exponential function with complex slope L(n a i ) h 2 [n] = R A k (n) exp j2πkf 0(na i )(n ni a ) where k=1 A k (n) = a k (n i a) + (n n i a)b k (n i a) with a k (n i a), b k (n i a) to be complex numbers (amplitude and slope respectively). R denotes taking the real part. yannis@csd.uoc.gr Sinusoidal Modeling 27/44
Periodic part HNM 1 : Sum of exponential functions without slope h 1 [n] = L(n i a) k= L(n i a) a k (n i a)e j2πkf 0(n i a)(n n i a) HNM 2 : Sum of exponential function with complex slope L(n a i ) h 2 [n] = R A k (n) exp j2πkf 0(na i )(n ni a ) where k=1 A k (n) = a k (n i a) + (n n i a)b k (n i a) with a k (n i a), b k (n i a) to be complex numbers (amplitude and slope respectively). R denotes taking the real part. yannis@csd.uoc.gr Sinusoidal Modeling 27/44
Periodic part continuing HNM 3 : Sum of sinusoids with time-varying real amplitudes where h 3 [n] = L(n i a) k=0 a k (n)cos(ϕ k (n)) a k (n) = c k0 + c k1 (n na) i 1 + + c kp (n na) i p(n) ϕ k (n) = ɛ k + 2π kζ (n na) i where p(n) is the order of the amplitude polynomial, which is, in general, a time-varying parameter. yannis@csd.uoc.gr Sinusoidal Modeling 28/44
Residual (Noise) part The non-periodic part is just the residual signal obtained by subtracting the periodic-part (harmonic part) from the original speech signal in the time-domain r[n] = s[n] h[n] where h[n] is either h 1 [n], h 2 [n], or h 3 [n]. yannis@csd.uoc.gr Sinusoidal Modeling 29/44
Amplitudes and phases estimation Having f 0 estimated for voiced frames, amplitudes and phases are estimated by minimizing the criterion: ɛ = n i a+n n=n i a N w 2 [n](s[n] ĥ[n]) 2 where na i = na i 1 + P(na i 1 ), and P(na i 1 ) denotes the pitch period at na i 1. for HNM 1 and HNM 2, this criterion has a quadratic form and is solved by inverting an over-determined system of linear equations. For HNM 3, however,a non-linear system of equations has to be solved. The residual signal r[n] is estimated by ˆr[n] = s[n] ĥ[n] yannis@csd.uoc.gr Sinusoidal Modeling 30/44
Amplitudes and phases estimation Having f 0 estimated for voiced frames, amplitudes and phases are estimated by minimizing the criterion: ɛ = n i a+n n=n i a N w 2 [n](s[n] ĥ[n]) 2 where na i = na i 1 + P(na i 1 ), and P(na i 1 ) denotes the pitch period at na i 1. for HNM 1 and HNM 2, this criterion has a quadratic form and is solved by inverting an over-determined system of linear equations. For HNM 3, however,a non-linear system of equations has to be solved. The residual signal r[n] is estimated by ˆr[n] = s[n] ĥ[n] yannis@csd.uoc.gr Sinusoidal Modeling 30/44
Time domain characteristics of ˆr[n] 1.5 x Original speech signal 104 1 0.5 0 0.5 1 0 100 200 (a) Number of samples 3000 2000 1000 0 1000 2000 Error signal using the HNM2 3000 0 100 200 (c) Number of samples 3000 2000 1000 0 1000 2000 Error signal using HNM1 3000 0 100 200 (a) Number of samples 3000 2000 1000 0 1000 2000 Error signal using HNM3 3000 0 100 200 (d) Number of samples yannis@csd.uoc.gr Sinusoidal Modeling 31/44
Variance of the residual signal The variance of the residual signal is given as: E(rr h ) = I WP(P h W h WP) 1 P h W h 1 Variance using the three harmonic models 0.9 0.8 0.7 0.6 Variance 0.5 0.4 0.3 0.2 0.1 0 0 10 20 30 40 50 60 70 80 90 Time in samples yannis@csd.uoc.gr Sinusoidal Modeling 32/44
Modeling error 3 x 104 (a) Original signal Error in db 2 1 0 1 2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Time in sec 0 10 20 30 40 50 60 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Time in sec yannis@csd.uoc.gr Sinusoidal Modeling 33/44
Modeling the residual signal Full bandwidth representation using a low-order (10th) AR filter Time-domain characteristics of the residual signal can be modeled using various functions yannis@csd.uoc.gr Sinusoidal Modeling 34/44
Time domain energy modulation 0.4 0.2 0 0.2 0.4 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Time in samples 0.1 0.05 0 0.05 0.1 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Time in samples yannis@csd.uoc.gr Sinusoidal Modeling 35/44
Triangular envelope yannis@csd.uoc.gr Sinusoidal Modeling 36/44
Signal envelope There are many ways to obtain the envelope of a signal, as: Hilbert Transform (analytic signal) Low-pass local energy (energy envelope): e[n] = 1 2N + 1 N k= N where r[n] denotes the residual signal. r[n k] yannis@csd.uoc.gr Sinusoidal Modeling 37/44
Signal envelope There are many ways to obtain the envelope of a signal, as: Hilbert Transform (analytic signal) Low-pass local energy (energy envelope): e[n] = 1 2N + 1 N k= N where r[n] denotes the residual signal. r[n k] yannis@csd.uoc.gr Sinusoidal Modeling 37/44
Example of energy envelope Example of Energy Envelope, with N = 7 0.1 0.05 0 0.05 0.1 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Time in samples The energy envelope can be efficiently parameterized with a few Fourier coefficients: ê[n] = where L e is set to be 3 to 4 L e k= L e A k e j2πk(f0/fs)n yannis@csd.uoc.gr Sinusoidal Modeling 38/44
Comparing time-domain modulations 0.1 0 0.1 0.1 200 400 600 800 1000 1200 1400 1600 1800 2000 Noise Part 0 0.1 0.1 200 400 600 800 1000 1200 1400 1600 1800 2000 Triangular Envelope 0 0.1 0.1 200 400 600 800 1000 1200 1400 1600 1800 2000 Hilbert Envelope 0 0.1 200 400 600 800 1000 1200 1400 1600 1800 2000 Energy Envelope yannis@csd.uoc.gr Sinusoidal Modeling 39/44
Sound Examples using HNM 1 Examples with HNM 1 and Maximum Voiced frequency fixed at 4000 Hz Male Male Female Female Original Triangular Hilbert Energy yannis@csd.uoc.gr Sinusoidal Modeling 40/44
Sound Examples using HNM 2 Examples with HNM 2 and Maximum Voiced frequency fixed at 4000 Hz Male Male Female Female Original Triangular Hilbert Energy yannis@csd.uoc.gr Sinusoidal Modeling 41/44
Sensitivity on f 0 100 Maginute (db) 80 60 40 20 0 500 1000 1500 2000 2500 3000 3500 4000 (a) Frequency in Hz 100 Maginute (db) 80 60 40 20 0 500 1000 1500 2000 2500 3000 3500 4000 (b) Frequency in Hz yannis@csd.uoc.gr Sinusoidal Modeling 42/44
T. F. Quatieri, Discrete-Time Speech Signal Processing. Engewood Cliffs, NJ: Prentice Hall, 2002. yannis@csd.uoc.gr Sinusoidal Modeling 43/44
THANK YOU for your attention on this part yannis@csd.uoc.gr Sinusoidal Modeling 43/44