Generative models for amplitude modulation: Gaussian Modulation Cascade Processes

Size: px

Start display at page:

Download "Generative models for amplitude modulation: Gaussian Modulation Cascade Processes"

Juliet Lambert
5 years ago
Views:

1 Generative models for amplitude modulation: Gaussian Modulation Cascade Processes Richard Turner Maneesh Sahani Gatsby Computational Neuroscience Unit, 31/01/2007

2 Motivation time /s

3 Motivation time /s

4 Motivation: Traditional AM = time /s

5 Motivation: Traditional AM = time /s

6 Motivation: Demodulate the Modulator = time /s

7 Motivation: Demodulation Cascade = time /s

8 Goal and stucture of the talk Eventual GOAL: Construct a latent variable model for sounds to understand auditory processing in the brain. Amplitude modulation is an important statistical dimension of sounds Auditory system listens attentively to amplitude modulation First half of talk: Statistical models for amplitude demodulation Second half of talk: A generative model for sounds

9 Applications for Amplitude Demodulation Our focus: Natural scene statistics and auditory processing Psychophysics: Manipulation of speech Signal processing applications Bearing failure prediction (bearings account for half of all motor failures) Financial models

10 Traditional Demodulation Algorithms: Hilbert transform x 1 y and x time/s

11 Traditional Demodulation Algorithms: Hilbert transform x 1 y and x time/s

12 Traditional Demodulation Algorithms: Hilbert transform x 1 y and x time/s

13 Traditional Demodulation Algorithms: square and lowpass x 1 y and x time/s

14 Traditional Demodulation Algorithms: square and lowpass x 1 y and x time/s

15 Traditional Demodulation Algorithms: square and lowpass x 1 y and x time/s

16 Traditional Demodulation Algorithms are not sufficient Hilbert transform recovers a carrier of constant variance. recovers a modulator that may not correspond to the envelope we want. no tunable parameter Square and lowpass recovers a good-looking smooth envelope recovers a carrier of high variance i.e. not demodulated has a tunable parameter

17 Advantages of the probabilistic approach Representing a signal as: quickly varying carrier slowly varying positive envelope is fundamentally an ill posed problem we need prior information and therefore the probabilistic setting is the natural one. Makes assumptions explicit allowing them to be criticed and improved Allows us to put error-bars on our inferences We can develop a family of related algorithms: cheap and cheerful, slow and Bayesian, with(out) hand-tunable parameters Tap into the existing theory and methods to e.g. deal with missing data, model comparison etc.

18 A simple generative model for AM x (2) t = f a (z (2) t ) z (2) t Norm(λz (2) t 1, σ2 ) x (1) t Norm(0, 1) y t = x (1) t x (2) t

19 A simple generative model for AM l eff = 1/log(λ) x (2) 1:T x (2) t = f a (z (2) t ) z (2) t Norm(λz (2) t 1, σ2 ) x (1) t Norm(0, 1) y t = x (1) t x (2) t

20 A simple generative model for AM l eff = 1/log(λ) x (2) 1:T x (2) t = f a (z (2) t ) z (2) t Norm(λz (2) t 1, σ2 ) x (1) t Norm(0, 1) y t = x (1) t x (2) t x (1) 1:T

21 A simple generative model for AM l eff = 1/log(λ) x (2) 1:T x (2) t = f a (z (2) t ) z (2) t Norm(λz (2) t 1, σ2 ) x (1) t Norm(0, 1) y t = x (1) t x (2) t x (1) 1:T y 1:T

22 A cheap and cheerful learning algorithm Maximise over the envelope and other parameters p(y 1:T, x (1) 1:T, x(2) 1:T, λ σ2, a) = p(λ)p(y 1:T x (1) 1:T, x(2) 1:T )p(x(1) 1:T )p(x(2) 1:T )

23 A cheap and cheerful learning algorithm Maximise over the envelope and other parameters p(y 1:T, x (1) 1:T, x(2) 1:T, λ σ2, a) = p(λ)p(y 1:T x (1) 1:T, x(2) 1:T )p(x(1) p(z 1:T ) {z } TY dz t 1:T ) p(x(2) 1:T ) t=1 dx (2) t

24 A cheap and cheerful learning algorithm Maximise over the envelope and other parameters p(y 1:T, x (1) 1:T, x(2) 1:T, λ σ2, a) = p(λ)p(y 1:T x (1) 1:T, x(2) 1:T )p(x(1) p(y 1:T, x (2) 1:T, λ σ2, a) = Z p(z 1:T ) {z } TY dz t 1:T ) p(x(2) 1:T ) t=1 dx (2) t dx (1) 1:T p(y 1:T, x (1) 1:T, x(2) 1:T, λ σ2, a)

25 A cheap and cheerful learning algorithm Maximise over the envelope and other parameters p(y 1:T, x (1) 1:T, x(2) 1:T, λ σ2, a) = p(λ)p(y 1:T x (1) 1:T, x(2) 1:T )p(x(1) p(y 1:T, x (2) 1:T, λ σ2, a) = p(y 1:T, x (2) 1:T σ2, a) = Z Z p(z 1:T ) {z } TY dz t 1:T ) p(x(2) 1:T ) t=1 dx (2) t dx (1) 1:T p(y 1:T, x (1) 1:T, x(2) 1:T, λ σ2, a) Z dλ dx (1) 1:T p(y 1:T, x (1) 1:T, x(2) 1:T, λ σ2, a)

26 A cheap and cheerful learning algorithm Maximise over the envelope and other parameters p(y 1:T, x (1) 1:T, x(2) 1:T, λ σ2, a) = p(λ)p(y 1:T x (1) 1:T, x(2) 1:T )p(x(1) p(y 1:T, x (2) 1:T, λ σ2, a) = p(y 1:T, x (2) 1:T σ2, a) = Z Z p(z 1:T ) {z } TY dz t 1:T ) p(x(2) 1:T ) t=1 dx (2) t dx (1) 1:T p(y 1:T, x (1) 1:T, x(2) 1:T, λ σ2, a) Z dλ dx (1) 1:T p(y 1:T, x (1) 1:T, x(2) 1:T, λ σ2, a) arg max x (2) 1:T,σ2,a p(y 1:T, x (2) 1:T σ2, a)

27 Results x 1 y and x x 10 4

28 Results x 1 y and x x 10 4

29 Results x 1 y and x x 10 4

30 Desirable features of a new generative model We could use this algorithm to collect low level statistics of amplitude modulation in the multidimensional setting Alternative is to build a model which attempts to learn this structure in a higher-level prior.

31 Desirable features of a new generative model 1. Sparse outputs. 2. Explicit temporal dimension, smooth latent variables. 3. Hierarchical prior that captures the AM statistics of sounds at different time scales: cascade of modulatory processes, with slowly varying processes at the top modulating more rapidly varying signals towards the bottom. 4. Learnable; and we would like to preserve information about the uncertainty, and possibly correlations, in our inferences.

32 x (2) t,1 K 2 = x (1) t,1 K 1 = = (1) (2) y t = x x1,t 1,t t

33 x (2) t,1 K 2 = x (1) t,1 x (1) t,2 K 1 = g 11 g 21 = (1) (1) (2) y t = ( g 11 x 1,t + g x2,t 21 ) x1,t t

34 x (2) t,1 K 2 = x (1) t,1 x (1) t,2 K 1 = g 11 g 21 = (1) (1) (2) y t = ( x 1,t + x2,t) x1,t t

35 x (3) t,1 K 3 = x (2) t,1 x (2) t,2 K 2 = x (1) t,1 x (1) t,2 x (1) t,3 K 1 = y t = x (3) t,1 [ x(2) t,1 (x(1) t,1 g x(1) t,2 g x(1) t,3 g 311 ) + x(2) t,2 (x(1) t,1 g x(1) t,2 g x(1) t,3 g 321 ) ]

36 M = 3, K 1 = 3, K 2 = 2, K 3 = 1, D = 80

37 Inference and learning Inference and learning by EM has an intractable E-Step. But due to the structure of the non-linearity variational EM can be used. Key point: If we freeze all the latent time-series bar one, the distribution over the unfrozen chain is Gaussian. Leads to a family of efficient variational approximations: p(x Y ) m q(x(m) 1:T,1:K m ), where each approximating distribution is a Gaussian.

38 Inference and Learning: Proof of concept x (2) t,1 K 2 = x (1) t,1 x (1) t,2 K 1 = g 11 g 21 = (1) (1) (2) y t = ( x 1,t + x2,t) x1,t t

39 Inference and Learning: Proof of concept X (2) 1,t Y d X (1) 2,t X (1) 1,t Time

40 Current work and Future directions Apply to natural sounds (require good initialisation) and single microphone source separation Generalise the model to have non-local features: to capture sweeps correlations in the prior: to capture the mutual-exclusivity of voiced and unvoiced sections of speech Representation: real sounds live on a hyper-plane in STFT space: can we project our model onto this manifold? How do we map posteriors to spike-trains: p(spike trains y) = f[q(x y)]?

41 Extra slides...

42 AM: a candidate organising principle in the Auditory System? The auditory system listens attentively to Amplitude Modulation Examples include Comodulation Masking Release in psychophysics, and a possible topographic mapping of AM in the IC from electrophysiology. Main goal: discover computational principles underpinning auditory processing. But armed with a generative model you can do: segregation, fill in missing data/remove artifacts etc. sound denoising, source

43 What s up with current generative models? Model bubeicabube SC latents sparse share power slowly varying Assumption: Latent variables are sparse. learnable

44 What s up with current generative models? Model bubeicabube SC GSM latents sparse share power slowly varying learnable Assumption: Latents are sparse, and share power. x = λu λ 0 a positive scalar random variable, u G(0, Q)

45 What s up with current generative models? Model latents sparse share power slowly varying bubeicabube SC GSM SFA learnable Assumption: Latents are slow.

46 What s up with current generative models? Model latents sparse share power slowly varying bubeicabube SC GSM SFA Bubbles learnable Assumption: Latents are sparse, slow (and share power).

47 What s the point in a generative model for sounds? Theoretical Neuroscience Understand neural receptive fields: if a latent variable model provides a computationally effective representation of sounds, neurons might encode p(latent sound) Psychophysics: Play sounds to subjects (possibly drawn from the forward model), and compare their inferences with inference in the model. Machine Learning Fill in missing data: e.g. to fill in an intermittent recording or remove artifacts Denoise sounds, Stream segregation, Compression

48 Outline Natural auditory scene statistics - Amplitude modulation is a key component Amplitude modulation in the auditory system - a candidate organising principle (the auditory system is short on such things) Previous statistical models of natural scenes - Gaussian Scale Mixture Models and AM A new model for sounds: The Gaussian Modulation Cascade Process On going issues...

49 Natural Auditory Scene statistics: Acoustic ecology Marginal distribution very sparse (more so than in vision) WHY? Lots of soft sounds e.g. the pauses between utterances in speech sounds and rare highly structured localised events that carry substantial parts of the stimulus energy

50 Statistics of Amplitude Modulation (Attias & Schreiner) Methods: sound filterbank Hilbert transform find envelope a(t, ω) take logs and transform to zero-mean and unit variance: â(t, ω) Marginal statistics of â(t, ω) p[â(t, ω)] exp[ γâ(t,ω)] (β 2 +â(t,ω) 2 ) α/2 Spectrum of â(t, ω) F T [â(t, ω)] 1 (ω 2 0 +ω2 ) α/2

51 Results summary Marginal distribution very sparse (finite prob of arb. soft sounds): Independent of filter centre frequency, filter bandwidth, time resolution. If AM stats were uncorrelated over time and frequency, CLT would predict increasing the filter bandwidth/time resolution would make the distribution more Gaussian. Spectrum of the amplitude modulations Independent of filter centre frequency Modified power law, indicating long temporal correlations (scale invariance) Independent of filter bandwidth (cf. Voss and Clarke 1/f) Implication: A good generative model of sounds should capture long correlations in AM across frequency bands and time (> 100ms) 2. Each location on the cochlea sees the same AM stats

52 AM in the Auditory System - Highlights from lots of work Psychophysics: Comodulation masking release task: detect a tone in noise. Alter the bandwidth of the noise and measure threshold Repeat but amplitude modulate the masker.

53 Electrophysiological data Type 1 AN fibres phase lock to AM (implies temporal code), as we move up the neuraxis the tuning moves from temporal to rate. Evidence in IC for topographic AM tuning (Schreiner and Langer, 1988) Cortex: AM processing seems filter independent (modulation filter bank?) Jury still out but AM may be a fundamental organising principle

54 What s wrong with statistical models for natural scenes? ICA and Sparse coding I p(x) = p(x i ) (1) i=1 p(x i ) = sparse (2) Recognition weights: p ICA (x y) = δ(x Ry) Images Sounds Problems: p ICA (y x) = δ(gx y) (3) p SC (y x) = Norm(Gx, σ 2 I) (4) 1. Extracted latents are not independent: correlations in their power. 2. No explicit temporal dimension - not a true generative model for sounds or movies

55 1. Empirical distribution of the latents - power correlations

56 Caption for previous figure: Explanation Expected joint distribution of latents was starry (top left). Empirically the joint is found to be elliptical - many more instances of two latents being high than expected (top right). Another way of seeing this is to look at the conditionals (bottom): if one latent has high power then near by latents also tend to have high power. How do we improve the model? 1. Fix the bottom level p(y x) = δ(y Gx) 2. Fixes recognition distribution p(x y) = δ(x Ry) 3. p(x) = dyp(x y)p(y) 1 N n p(x y n) 4. These are exactly the distributions plotted on the previous page 5. No matter what R is we get similar empirical distributions 6. So choose a new prior to match them...

57 x = λu Gaussian Scale Mixtures (GSMs) λ 0 a scalar random variable u G(0, Q) λ and u are independent density of these semi-parametric models can be expressed as an integral: p(x) = Z p(x λ)p(λ)dλ = Z 2πλ 2 Q 1/2 exp xt Q 1 x 2λ 2! p(λ)dλ (5) e.g. p(λ) discrete MOG (components 0 mean), p(λ) =Gamma student-t. Imagine generalising this to the temporal setting: x(t) = λ(t)u(t) = positive envelope carrier Are we seeing the hall marks of AM in a non-temporal setting?

Learning a neighbourhood (Karklin and Lewicki 2003, 05, 06) The statistical dependencies between filters depends on their separation (in space, scale, and orientation.

58 Learning a neighbourhood (Karklin and Lewicki 2003, 05, 06) The statistical dependencies between filters depends on their separation (in space, scale, and orientation.) We d like to learn these neighbourhoods of dependence Solution: Share multipliers in a linear fashion using a GSM with a generalised log-normal over the variances. p(z j ) = Norm(0, 1) (6) λ 2 i = exp j h ij z j (7) p(x i z, B) = Norm(0, λ i ) (8) p(y G, x) = Norm(Gx, σ 2 I) (9)

59 A temporal GSMM: The proof of concept Bubbles model p(z i,t ) = sparse point process (10) λ 2 i,t = f j h ij Φ(t) z j (t) (11) p(x i,t λ i,t ) = Norm(0, λ 2 i,t) (12) p(y t G, x t ) = δ(gx t y t ) (13) Temporal correlations between the multipliers are captured by the moving average Φ(t) u j (t) (creates a SLOW ENVELOPE) Bubbles of activity in latent space (both in space and time) Columns of h i,j fixed and change smoothly to induce topographic structure - computationally useful

60 Learning Common to learn the parameters using zero-temperature EM: q(x) = δ(x X MAP ) p(x Y ) (14) uncertainty and correlational information is lost. 1. Effects learning 2. To compare to neural data need to specify a mapping: p(spike trains y) = f[q(x y)] = f(x MAP ) for this approximation. 3. BUT we believe neural populations will represent uncertainty and correlations in latent variables. We d like to retain variance and correlational information, both for learning and for comparison to biology.

61 Motivations for the GMCP 1. Sparse outputs. 2. Explicit temporal dimension, smooth latent variables. 3. Hierarchical prior that captures the AM statistics of sounds at different time scales: cascade of modulatory processes, with slowly varying processes at the top modulating more rapidly varying signals towards the bottom. 4. Learnable; and we would like to preserve information about the uncertainty, and possibly correlations, in our inferences.

62 Dynamics: The Gaussian Modulation Cascade Process ( p(x (m) k,t x(m) k,t 1:t τ m, λ m k,1:τ m, σm,k 2 ) = Norm τm ) t =1 λ(m) k,t x (m) k,t t, σm,k 2 Emission Distribution: p(y t x (1:M) t, g k1 :k M, σ 2 y) = Norm ( k 1 :k M g k 1 :k M Time-frequency representation: y t = filter bank outputs ) M m=1 x(m) k m,t, σ2 yi

63 e.g. M = 2, K 1 = 2, K 2 = 1, D = 1 x (2) t, x (1) t,1 x (1) t, (1) (1) y t = ( g 11 x +g21 1,t x 2,t ) (2) x1,t g 11 + g 21 = t

64 e.g. M = 4, K 1 = 4, K 2 = 2, K 3 = 2, K 4 = 1, D = 80

65 Comments on the model A GSM: latent = rectified Gaussian Gaussian Emission distribution related to Grimes & Rao, Tenenbaum & Freeman (for M=2) and Vasilescu & Terzopoulos for general M., but the temporal priors allow us to be fully unsupervised. As p(x m 1:K,1:T x1:m m 1:K,1:T, y 1:T, θ) is Gaussian, there are a family of variational EM algorithms we can use to learn the model.

66 Inference and Learning: Proof of concept X (2) 1,t Y d X (1) 2,t X (1) 1,t Time

67 Future directions and current issues DIRECTIONS correlated prior: speech is usually periodic (voiced) or unvoiced - mutually exclusive, easier to represent sweeps. non-local features: make G temporal to capture sweeps ISSUES Representation: real sounds live on a hyper-plane in filter-bank space - can we project our model onto this manifold? Initialisation: The free energy has lots of local minima: use Slow modulations [da(g ] T 2 analysis to initialise arg min n y) gn dt such that a(gn T y)a(gmy) T = δ mn How do we map posteriors to spike-trains: p(spike trains y) = f[q(x y)]?

68 What s up with current generative models? Model p(x (1) ) p(y x (1) ) ICA sparse δ(y Gx (1) ) SC sparse Norm(Gx (1), σyi) 2 Assumption: Latent variables are sparse.

69 What s up with current generative models? Model p(x (2) ) p(x (1) x (2) ) p(y x (1) ) ICA N/A sparse δ(y Gx (1) ) SC N/A sparse Norm(Gx (1), σyi) 2 GSM Norm(0, I) Norm(0, λ 2 i ) Norm(Gx(1), σyi) 2 λ 2 i = exp(ht i x(2) ) Assumption: Latents are sparse, and share power. x (1) = λu λ 0 a positive scalar random variable, u G(0, Q)

70 What s up with current generative models? Model p(x (2) ) p(x (1) x (2) ) p(y x (1) ) ICA N/A sparse δ(y Gx (1) ) SC N/A sparse Norm(Gx (1), σyi) 2 GSM Norm(0, I) Norm(0, λ 2 i ) Norm(Gx(1), σyi) 2 λ 2 i = exp(ht i x(2) ) SFA N/A Norm(γx t 1, σ 2 ) δ(y Gx (1) ) Assumption: Latents are slow.

71 What s up with current generative models? Model p(x (2) ) p(x (1) x (2) ) p(y x (1) ) ICA N/A sparse δ(y Gx (1) ) SC N/A sparse Norm(Gx (1), σ 2 yi) GSM Norm(0, I) Norm(0, λ 2 i ) Norm(Gx(1), σ 2 yi) λ 2 i = exp(ht i x(2) ) SFA N/A Norm(γx t 1, σ 2 ) δ(y Gx (1) ) Bubbles point-process Norm(0, λ 2 i,t ) δ(y Gx(1) ) λ 2 i,t = f(ht i x(2) t φ t ) Assumption: Latents are sparse, slow (and share power).

Is early vision optimised for extracting higher order dependencies? Karklin and Lewicki, NIPS 2005

Is early vision optimised for extracting higher order dependencies? Karklin and Lewicki, NIPS 2005 Richard Turner (turner@gatsby.ucl.ac.uk) Gatsby Computational Neuroscience Unit, 02/03/2006 Outline Historical