Gaussian Processes for Audio Feature Extraction Dr. Richard E. Turner (ret26@cam.ac.uk) Computational and Biological Learning Lab Department of Engineering University of Cambridge
Machine hearing pipeline signal T samples
Machine hearing pipeline time-frequency (TF) analysis frequency T' D > T samples short time Fourier transform spectrogram wavelet filter bank (non-linear) signal T samples
Machine hearing pipeline probabilistic model HMM (speech recognition) NMF (source separation, denoising) ICA (source separation, denoising) time-frequency (TF) analysis frequency T' D > T samples short time Fourier transform spectrogram wavelet filter bank (non-linear) signal T samples
Problems with conventional pipeline probabilistic model noise (source mixtures) hard to model in TF domain (hard to propagate uncertainty noise/missing data - from signal to TF domain) time-frequency (TF) analysis frequency T' D > T samples (non-linear) signal T samples
Problems with conventional pipeline probabilistic model noise (source mixtures) hard to model in TF domain (hard to propagate uncertainty noise/missing data - from signal to TF domain) time-frequency (TF) analysis frequency T' D > T samples hard to enforce/learn dependencies intrinsic to the FT analysis image of mapping (injective) (non-linear) signal T samples
Problems with conventional pipeline probabilistic model noise (source mixtures) hard to model in TF domain (hard to propagate uncertainty noise/missing data - from signal to TF domain) time-frequency (TF) analysis frequency T' D > T samples hard to enforce/learn dependencies intrinsic to the FT analysis image of mapping (injective) (non-linear) learning based on time-frequency representation ignores Jacobian signal T samples
Problems with conventional pipeline probabilistic model noise (source mixtures) hard to model in TF domain (hard to propagate uncertainty noise/missing data - from signal to TF domain) time-frequency (TF) analysis frequency T' D > T samples hard to enforce/learn dependencies intrinsic to the FT analysis image of mapping (injective) (non-linear) learning based on time-frequency representation ignores Jacobian signal T samples
Problems with conventional pipeline probabilistic model noise (source mixtures) hard to model in TF domain (hard to propagate uncertainty noise/missing data - from signal to TF domain) time-frequency (TF) analysis frequency T' D > T samples hard to enforce/learn dependencies intrinsic to the FT analysis image of mapping (injective) (non-linear) learning based on time-frequency representation ignores Jacobian signal T samples hard to adapt both top and bottom layers
Goal of this talk probabilistic model probabilise time-frequency analysis (construct generative model in which inference corresponds to classical time-frequency analysis) time-frequency (TF) analysis frequency build a hierachical model that incorporates downstream processing module T' D > T samples classical signal processing (non-linear) machine learning signal T samples
A typical audio pipeline signal y.5 1 1.5 2 time /s
A typical audio pipeline spectrogram frequency /khz 6 2.6 1.1.5.2 magnitude short time Fourier transform Fourier transform window signal y.5 1 1.5 2 time /s
A typical audio pipeline NMF spectrogram frequency /khz 6 2.6 1.1.5.2 magnitude short time Fourier transform Fourier transform window signal y.5 1 1.5 2 time /s
A typical audio pipeline NMF spectrogram frequency /khz 6 2.6 1.1.5.2 magnitude short time Fourier transform Fourier transform window signal y.5 1 1.5 2 time /s
What form of generative model corresponds to the STFT? desire: expected value of latent time-frequency coefficients s d,1:t = STFT assume y formed by (weighted) superposition of band-limited signals s d,1:t linearity of inference can be assured by setting the distributions of each s d,1:t and the noise to be Gaussian time-invariance = generative model statistically stationary = GP prior over STFT coefficients, p(s d,1:t ) = G(s d,1:t ;, Γ), stationary Γ t,t T k=1 FT 1 t,k γ kft k,t where FT k,t =e 2πi(k 1)(t 1)/T
Time-frequency analysis as inference generation complex sinusoids time-varying (complex) coefficients
Time-frequency analysis as inference generation complex sinusoids time-varying (complex) coefficients
Time-frequency analysis as inference generation inference complex sinusoids time-varying (complex) coefficients
Time-frequency analysis as inference complex sinusoids generation time-varying (complex) coefficients inference most probable coefficients given the signal is the STFT STFT STFT window = prior covariance frequency shifted inverse signal covariance
Time-frequency analysis as inference generation inference
Time-frequency analysis as inference generation inference
Time-frequency analysis as inference generation inference
Time-frequency analysis as inference generation inference
Time-frequency analysis as inference generation inference depends on independent of
Time-frequency analysis as inference generation inference depends on independent of signal noise
Time-frequency analysis as inference generation inference depends on independent of signal noise Wiener filter
Time-frequency analysis as inference generation inference depends on independent of signal noise Wiener filter
Time-frequency analysis as inference generation inference depends on independent of signal noise Wiener filter STFT window = prior covariance frequency shifted inverse signal covariance
Time-frequency analysis as inference generation inference probabilistic filter bank depends on independent of signal noise Wiener filter probabilistic STFT STFT window = prior covariance frequency shifted inverse signal covariance
Time-frequency analysis as inference probabilistic models in which inference recovers STFT, filter bank, wavelet analysis unifes a number of existing probabilistic time-series models & connects to traditional sig. proc. can learn window of STFT and frequencies (equivalently filter properties) frequency shift relationship mimics classical relationship between these time-frequency relationships hops/down-sampling and finite window used correspond to FITC (uniformly spaced pseudo-points) and sparse-covariance approximations rediscover Nyquist in the context of approximation GPs
Probabilistic audio processing pipeline 2.6 envelopes freq /khz.1 2.6 mean spectrum carriers freq /khz.1 = bandpass Gaussian noise 1 signal 1 2 4 6 8 1 12 14 16 18 time /ms
Probabilistic audio processing pipeline 2.6 envelopes freq /khz.1 2.6 mean spectrum carriers freq /khz.1 = bandpass Gaussian noise 1 signal 1 2 4 6 8 1 12 14 16 18 time /ms
Probabilistic audio processing pipeline mean spectrum envelope patterns 2.6 = slow Gaussian process envelopes freq /khz.1 2.6 mean spectrum carriers freq /khz.1 = bandpass Gaussian noise 1 signal 1 2 4 6 8 1 12 14 16 18 time /ms
Key Observation fix envelopes: Inference and Learning posterior over carriers is Gaussian posterior mean given by an (adaptive) filter Leads to MAP estimation of the envelopes (or HMCMC), let z lt = log h lt Z MAP = arg max p(z Y) Z p(z Y) = 1 Z p(z, Y) = 1 dxp(z, Y, X) = 1 Z Z p(z) dxp(y A, X)p(X) Compute integral efficiently using chain stuctured approximation and Kalman Smoothing Leads to gradient based optimisation for transformed amplitudes Learning: approximate Maximum Likelihood θ = arg max θ p(y θ) NMF: zero-temperature EM, one E-Step, initialise constant envelopes
Audio modelling
Audio modelling fire stream wind rain frequency /kh foot step time /s tent-zip Turner, 21
Audio modelling fire stream wind rain frequency /kh foot step time /s tent-zip Turner, 21
Audio modelling frequency /kh time /s Turner, 21
Statistical texture synthesis Old approach: build detailed physical models (e.g. rain drops) New approach train model on your favourite texture sample from the prior, and then from the likelihood. Waveform unique, but statistically matched to original Often perceptually indistinguishable
Audio denoising SNR improvement /db 14 12 1 8 6 4 2 1 5 5 1 15 SNR before /db NMF tnmf GTF GTFtNMF adapted filters unadapted filters Wiener spectral subtract block threshold PESQ improvement 1.5.5 1 1.5 2 2.5 PESQ before SNR log spec improvement /db 7 6 5 4 3 2 1 5 1 15 2 SNR log spec before /db 5 5 y t y t 5 5 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 5 y t y t 5 5 2 4 6 8 1 12 14 2 4 6 8 1 12 14 1 5 1 y t y t 5 1 1 2 4 6 8 1 12 14 16 18 2 time /ms 2 4 6 8 1 12 14 16 18 2 time /ms
Audio missing data imputation SNR /db 14 12 1 8 6 4 2 5 1 15 missing region /ms tnmf GTF GTFtNMF unadapted filters adapted filters PESQ 4 3.5 3 2.5 5 1 15 missing region /ms SNR log spec /db 2 15 1 5 5 1 15 missing region /ms 2 2 1 1 y t y t 1 1 2 2 5 1 15 2 25 5 1 15 2 25 2 2 y t y t 2 2 2 1 5 1 15 2 25 2 1 5 1 15 2 25 y t y t 1 2 1 2 5 1 15 2 25 time /ms 5 1 15 2 25 time /ms
Unifying classical and probabilistic audio signal processing Probabilistic signal processing robustness adaptation fast methods important variables Classical signal processing
Probabilistic signal processing Classical signal processing Cemgil & Godsill freq shift Qi & Minka Filter Bank & Hilbert freq shift STFT estimation Amplitudes Spectrogram
Additional slides