Gaussian Processes for Audio Feature Extraction

Similar documents
A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Linear Dynamical Systems (Kalman filter)

Non-Negative Matrix Factorization And Its Application to Audio. Tuomas Virtanen Tampere University of Technology

arxiv: v5 [eess.sp] 12 Nov 2018

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi

Linear Dynamical Systems

Bayesian Estimation of Time-Frequency Coefficients for Audio Signal Enhancement

Generative models for amplitude modulation: Gaussian Modulation Cascade Processes

Covariance smoothing and consistent Wiener filtering for artifact reduction in audio source separation

Sparse linear models

Introduction to Machine Learning

Introduction to time-frequency analysis. From linear to energy-based representations

Variational inference

Nonparametric Bayesian Methods (Gaussian Processes)

Single Channel Signal Separation Using MAP-based Subspace Decomposition

CS 188: Artificial Intelligence Fall 2011

MULTI-RESOLUTION SIGNAL DECOMPOSITION WITH TIME-DOMAIN SPECTROGRAM FACTORIZATION. Hirokazu Kameoka

Adapting Wavenet for Speech Enhancement DARIO RETHAGE JULY 12, 2017

Introduction to Biomedical Engineering

SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan

STA 4273H: Statistical Machine Learning

Multiresolution Analysis

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

REVIEW OF SINGLE CHANNEL SOURCE SEPARATION TECHNIQUES

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

Expectation propagation for signal detection in flat-fading channels

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Signal Processing COS 323

Course content (will be adapted to the background knowledge of the class):

Environmental Sound Classification in Realistic Situations

A Generative Perspective on MRFs in Low-Level Vision Supplemental Material

Machine Recognition of Sounds in Mixtures

The 1d Kalman Filter. 1 Understanding the forward model. Richard Turner

State Space Representation of Gaussian Processes

Dimension Reduction. David M. Blei. April 23, 2012

State Space Gaussian Processes with Non-Gaussian Likelihoods

TinySR. Peter Schmidt-Nielsen. August 27, 2014

Joint Filtering and Factorization for Recovering Latent Structure from Noisy Speech Data

CURRENT state-of-the-art automatic speech recognition

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

L6: Short-time Fourier analysis and synthesis

EE123 Digital Signal Processing

Frequency Domain Speech Analysis

Is early vision optimised for extracting higher order dependencies? Karklin and Lewicki, NIPS 2005

HST.582J/6.555J/16.456J

Expectation Propagation in Dynamical Systems

Lecture Notes 5: Multiresolution Analysis

Power Supply Quality Analysis Using S-Transform and SVM Classifier

A latent variable modelling approach to the acoustic-to-articulatory mapping problem

VID3: Sampling and Quantization

Factor Analysis and Kalman Filtering (11/2/04)

Digital Image Processing Lectures 15 & 16

Various signal sampling and reconstruction methods

CPSC 540: Machine Learning

Probabilistic Reasoning in Deep Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Probabilistic Graphical Models

Algorithmisches Lernen/Machine Learning

Lecture 7: Con3nuous Latent Variable Models

PROBABILISTIC REASONING OVER TIME

Single-channel source separation using non-negative matrix factorization

ECE521 Lecture 19 HMM cont. Inference in HMM

1. Calculation of the DFT

STA 414/2104: Machine Learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Pattern Recognition and Machine Learning

TIME-FREQUENCY ANALYSIS: TUTORIAL. Werner Kozek & Götz Pfander

Sound Recognition in Mixtures

Review: Learning Bimodal Structures in Audio-Visual Data

State Space and Hidden Markov Models

Recent Advances in Bayesian Inference Techniques

Supervised and Unsupervised Speech Enhancement Using Nonnegative Matrix Factorization

Invariant Scattering Convolution Networks

Scalable audio separation with light Kernel Additive Modelling

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Learning Stationary Time Series using Gaussian Processes with Nonparametric Kernels

Why DNN Works for Acoustic Modeling in Speech Recognition?

Lecture 6: April 19, 2002

KLT for transient signal analysis

HMM part 1. Dr Philip Jackson

Detection-Based Speech Recognition with Sparse Point Process Models

Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation. Keiichi Tokuda

Missing Data and Dynamical Systems

Templates, Image Pyramids, and Filter Banks

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Hidden Markov Models

Machine Learning Techniques for Computer Vision

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Gaussian with mean ( µ ) and standard deviation ( σ)

Approximate Inference

A State-Space Approach to Dynamic Nonnegative Matrix Factorization

Lecture 6: Bayesian Inference in SDE Models

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

CS 343: Artificial Intelligence

Note Set 5: Hidden Markov Models

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

COMP90051 Statistical Machine Learning

CS 343: Artificial Intelligence

Dynamic Approaches: The Hidden Markov Model

Transcription:

Gaussian Processes for Audio Feature Extraction Dr. Richard E. Turner (ret26@cam.ac.uk) Computational and Biological Learning Lab Department of Engineering University of Cambridge

Machine hearing pipeline signal T samples

Machine hearing pipeline time-frequency (TF) analysis frequency T' D > T samples short time Fourier transform spectrogram wavelet filter bank (non-linear) signal T samples

Machine hearing pipeline probabilistic model HMM (speech recognition) NMF (source separation, denoising) ICA (source separation, denoising) time-frequency (TF) analysis frequency T' D > T samples short time Fourier transform spectrogram wavelet filter bank (non-linear) signal T samples

Problems with conventional pipeline probabilistic model noise (source mixtures) hard to model in TF domain (hard to propagate uncertainty noise/missing data - from signal to TF domain) time-frequency (TF) analysis frequency T' D > T samples (non-linear) signal T samples

Problems with conventional pipeline probabilistic model noise (source mixtures) hard to model in TF domain (hard to propagate uncertainty noise/missing data - from signal to TF domain) time-frequency (TF) analysis frequency T' D > T samples hard to enforce/learn dependencies intrinsic to the FT analysis image of mapping (injective) (non-linear) signal T samples

Problems with conventional pipeline probabilistic model noise (source mixtures) hard to model in TF domain (hard to propagate uncertainty noise/missing data - from signal to TF domain) time-frequency (TF) analysis frequency T' D > T samples hard to enforce/learn dependencies intrinsic to the FT analysis image of mapping (injective) (non-linear) learning based on time-frequency representation ignores Jacobian signal T samples

Problems with conventional pipeline probabilistic model noise (source mixtures) hard to model in TF domain (hard to propagate uncertainty noise/missing data - from signal to TF domain) time-frequency (TF) analysis frequency T' D > T samples hard to enforce/learn dependencies intrinsic to the FT analysis image of mapping (injective) (non-linear) learning based on time-frequency representation ignores Jacobian signal T samples

Problems with conventional pipeline probabilistic model noise (source mixtures) hard to model in TF domain (hard to propagate uncertainty noise/missing data - from signal to TF domain) time-frequency (TF) analysis frequency T' D > T samples hard to enforce/learn dependencies intrinsic to the FT analysis image of mapping (injective) (non-linear) learning based on time-frequency representation ignores Jacobian signal T samples hard to adapt both top and bottom layers

Goal of this talk probabilistic model probabilise time-frequency analysis (construct generative model in which inference corresponds to classical time-frequency analysis) time-frequency (TF) analysis frequency build a hierachical model that incorporates downstream processing module T' D > T samples classical signal processing (non-linear) machine learning signal T samples

A typical audio pipeline signal y.5 1 1.5 2 time /s

A typical audio pipeline spectrogram frequency /khz 6 2.6 1.1.5.2 magnitude short time Fourier transform Fourier transform window signal y.5 1 1.5 2 time /s

A typical audio pipeline NMF spectrogram frequency /khz 6 2.6 1.1.5.2 magnitude short time Fourier transform Fourier transform window signal y.5 1 1.5 2 time /s

A typical audio pipeline NMF spectrogram frequency /khz 6 2.6 1.1.5.2 magnitude short time Fourier transform Fourier transform window signal y.5 1 1.5 2 time /s

What form of generative model corresponds to the STFT? desire: expected value of latent time-frequency coefficients s d,1:t = STFT assume y formed by (weighted) superposition of band-limited signals s d,1:t linearity of inference can be assured by setting the distributions of each s d,1:t and the noise to be Gaussian time-invariance = generative model statistically stationary = GP prior over STFT coefficients, p(s d,1:t ) = G(s d,1:t ;, Γ), stationary Γ t,t T k=1 FT 1 t,k γ kft k,t where FT k,t =e 2πi(k 1)(t 1)/T

Time-frequency analysis as inference generation complex sinusoids time-varying (complex) coefficients

Time-frequency analysis as inference generation complex sinusoids time-varying (complex) coefficients

Time-frequency analysis as inference generation inference complex sinusoids time-varying (complex) coefficients

Time-frequency analysis as inference complex sinusoids generation time-varying (complex) coefficients inference most probable coefficients given the signal is the STFT STFT STFT window = prior covariance frequency shifted inverse signal covariance

Time-frequency analysis as inference generation inference

Time-frequency analysis as inference generation inference

Time-frequency analysis as inference generation inference

Time-frequency analysis as inference generation inference

Time-frequency analysis as inference generation inference depends on independent of

Time-frequency analysis as inference generation inference depends on independent of signal noise

Time-frequency analysis as inference generation inference depends on independent of signal noise Wiener filter

Time-frequency analysis as inference generation inference depends on independent of signal noise Wiener filter

Time-frequency analysis as inference generation inference depends on independent of signal noise Wiener filter STFT window = prior covariance frequency shifted inverse signal covariance

Time-frequency analysis as inference generation inference probabilistic filter bank depends on independent of signal noise Wiener filter probabilistic STFT STFT window = prior covariance frequency shifted inverse signal covariance

Time-frequency analysis as inference probabilistic models in which inference recovers STFT, filter bank, wavelet analysis unifes a number of existing probabilistic time-series models & connects to traditional sig. proc. can learn window of STFT and frequencies (equivalently filter properties) frequency shift relationship mimics classical relationship between these time-frequency relationships hops/down-sampling and finite window used correspond to FITC (uniformly spaced pseudo-points) and sparse-covariance approximations rediscover Nyquist in the context of approximation GPs

Probabilistic audio processing pipeline 2.6 envelopes freq /khz.1 2.6 mean spectrum carriers freq /khz.1 = bandpass Gaussian noise 1 signal 1 2 4 6 8 1 12 14 16 18 time /ms

Probabilistic audio processing pipeline 2.6 envelopes freq /khz.1 2.6 mean spectrum carriers freq /khz.1 = bandpass Gaussian noise 1 signal 1 2 4 6 8 1 12 14 16 18 time /ms

Probabilistic audio processing pipeline mean spectrum envelope patterns 2.6 = slow Gaussian process envelopes freq /khz.1 2.6 mean spectrum carriers freq /khz.1 = bandpass Gaussian noise 1 signal 1 2 4 6 8 1 12 14 16 18 time /ms

Key Observation fix envelopes: Inference and Learning posterior over carriers is Gaussian posterior mean given by an (adaptive) filter Leads to MAP estimation of the envelopes (or HMCMC), let z lt = log h lt Z MAP = arg max p(z Y) Z p(z Y) = 1 Z p(z, Y) = 1 dxp(z, Y, X) = 1 Z Z p(z) dxp(y A, X)p(X) Compute integral efficiently using chain stuctured approximation and Kalman Smoothing Leads to gradient based optimisation for transformed amplitudes Learning: approximate Maximum Likelihood θ = arg max θ p(y θ) NMF: zero-temperature EM, one E-Step, initialise constant envelopes

Audio modelling

Audio modelling fire stream wind rain frequency /kh foot step time /s tent-zip Turner, 21

Audio modelling fire stream wind rain frequency /kh foot step time /s tent-zip Turner, 21

Audio modelling frequency /kh time /s Turner, 21

Statistical texture synthesis Old approach: build detailed physical models (e.g. rain drops) New approach train model on your favourite texture sample from the prior, and then from the likelihood. Waveform unique, but statistically matched to original Often perceptually indistinguishable

Audio denoising SNR improvement /db 14 12 1 8 6 4 2 1 5 5 1 15 SNR before /db NMF tnmf GTF GTFtNMF adapted filters unadapted filters Wiener spectral subtract block threshold PESQ improvement 1.5.5 1 1.5 2 2.5 PESQ before SNR log spec improvement /db 7 6 5 4 3 2 1 5 1 15 2 SNR log spec before /db 5 5 y t y t 5 5 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 5 y t y t 5 5 2 4 6 8 1 12 14 2 4 6 8 1 12 14 1 5 1 y t y t 5 1 1 2 4 6 8 1 12 14 16 18 2 time /ms 2 4 6 8 1 12 14 16 18 2 time /ms

Audio missing data imputation SNR /db 14 12 1 8 6 4 2 5 1 15 missing region /ms tnmf GTF GTFtNMF unadapted filters adapted filters PESQ 4 3.5 3 2.5 5 1 15 missing region /ms SNR log spec /db 2 15 1 5 5 1 15 missing region /ms 2 2 1 1 y t y t 1 1 2 2 5 1 15 2 25 5 1 15 2 25 2 2 y t y t 2 2 2 1 5 1 15 2 25 2 1 5 1 15 2 25 y t y t 1 2 1 2 5 1 15 2 25 time /ms 5 1 15 2 25 time /ms

Unifying classical and probabilistic audio signal processing Probabilistic signal processing robustness adaptation fast methods important variables Classical signal processing

Probabilistic signal processing Classical signal processing Cemgil & Godsill freq shift Qi & Minka Filter Bank & Hilbert freq shift STFT estimation Amplitudes Spectrogram

Additional slides