From Bases to Exemplars, from Separa2on to Understanding Paris Smaragdis

Size: px

Start display at page:

Download "From Bases to Exemplars, from Separa2on to Understanding Paris Smaragdis"

Ezra Bryan
6 years ago
Views:

1 From Bases to Exemplars, from Separa2on to Understanding Paris Smaragdis CS & ECE Depts. UNIVERSITY OF ILLINOIS AT URBANA CHAMPAIGN

2 Some Mo2va2on Why do we separate signals? I don t really know Is there an all- conquering algorithm? I suspect, not really So why are you working on both of the above Paris? It s a good exercise for what s to come

3 Outline Low- rank models Learning to listen Nearest subspace approaches Using data, not fancy algorithms Taking advantage of seman2c informa2on Explaining mixtures, not decomposing them

4 How do we deal with mixtures? We ﬁnd coherent structure We can mimic humans Or we can use Frequency Time

5 Learning what to separate We cannot make up data, we need something to learn from Unknown data Original speech Original interference Training data Observed data Mixture Training speech Training interference

6 Describing sounds A probabilis2c interpreta2on of the spectrum Why? We don t care for scale and phase Allows us to perform sophis@cated reasoning Probability Frequency bin

7 The space we deal with These distribu2ons live in a simplex {0,0,1} {0,1,0} {1,0,0}

8 The space we deal with These distribu2ons live in a simplex P(f1 ) [1, 0, 0] 1 1 1,, , 0, 2 2 [0.1, 0.8, 0.1] P(f2 ) P(f3 )

9 Modeling one sound Use a dic2onary representa2on P t (f ) = z P(f z)p t (z) Input spectra Dic.onary elements Weights z is the index of the dic2onary Everything is a distribu2on We can es2mate dic2onary/weights using EM

10 For the matrix inclined It s a linear transform P t (f ) = z P(f z)p t (z) Input spectra Dic.onary elements Weights f = f. z t z t

Huh? 500 450 400 350 Frequency bin 300 250 200

11 Huh? Frequency bin Time frame

12 Represented as frequency distribu2ons Each column is normalized Each column is now P t (f ) 500 Normalized spectra Frequency bin Time frame

13 A 2- element dic2onary approxima2on P t (f ) Frequency bin z P(f z) Time frame z P t (z)

14 Complex sounds = large dic2onaries Frequency distribu2ons capture spectral character

15 Or we can see this as Different areas of the simplex are different sounds Learned elements form convex hulls around them Source a Source b Dic.onary element for a Dic.onary element for b

16 Modeling mixtures A mix of two normed spectra lies on connec2ng subspace Source b m? Mix Source a? Two source mix Two source mix using dic.onaries

17 Huh again? Learn speech dic.onary Keep ﬁxed Speech and Chimes Extracted speech Extracted chimes Learn only the weights Learn chime dic.onary Mixture of speech and chimes

18 A problem Convex hulls are a bad idea, sounds can overlap Source A Source B Mixture Convex Hull A Convex Hull B Estimate for A Estimate for B Approximation of mixture

19 Nearest subspace search Search for all possible solu2ons given training data i.e. exemplars training Source x Source y Mixture m

20 The bad news Very high computa2onal complexity M N searches per query For N sources and M training data points 8 min, 5 sources 206,719 training data points 206,719 5 = 377,486,980,238,462,848,824,329,599 searches For each input spectrum! Approximate algorithms Somewhat faster search, unrealis@c memory requirements A few Petabytes

21 Avoiding the search We can s2ll use the previous model P t (f ) = z P(f z)p t (z) Input spectra Consolidated training data Sparse weights If we force weights P t (z) to be sparse we approximate the nearest subspace search

22 Enforcing sparsity The hard way: Entropic priors We can tune each s entropy For sparse P t (z) we minimize its entropy Pain in The easy way: Maximum l 2 - norm Since 0 P t (z) 1 max l 2 - norm results in sparsity Corresponds to Simpson s diversity index Both plug seamlessly in EM es2ma2on

23 Computa2on gains Proposed method is substan2ally faster for a realis2c number of training data ( > 1,000) source search 2 source search 3 source search 4 source search 5 source search Computation required FLOPS Number of training samples per source

24 How this looks Finds points whose connec2ng subspace passes closest to the observed mixture point Using learned dic-onaries Using exemplars Source A Source B Mixture Convex Hull A Convex Hull B Estimate for A Estimate for B Approximation of mixture Source A Source B Mixture Estimate for A Estimate for B Approximation of mixture

25 And some results TIMIT speech mixes ~20dB SIR on average ~30dB SIR with post- process db Signal to Interference Ratio Speech and speech Inf Inf+Post z Extracted female speech Extracted male speech 8 6 Signal to Distortion Ratio Exemplars beat dic2onaries By a lot! db Inf Inf+Post z

26 A prac2cal extension We can t know all sources But we usually know one (target or interference) All mixing problems are binary Target vs. all else Target'dic+onary'P('f' 'z') Observed'mixtures'P t' ('f') Implied'compe+ng sources'subspace We need to learn extra bases Describe all that we don t know Straighgorward extension

27 In prac2ce Selec2ve parameter updates Pt (f ) = Pt (s1 ) P(f z,s1 )Pt (z s1 ) + Pt (s2 ) P(f z,s2 )Pt (z s2 ) z z Learn extra frequency elements that explain other sounds Target or interference example Mixture Extracted source Target extrac.on Noise removal Mixture recording Keep ﬁxed Learn the weights

28 Fun things to do Original drum loop Extracted layers Music layer No tambourine Voice layer No congas Congas! Remixer Selec.ve pitch shiling Soprano layer Piano + Soprano Remixed layers Piano layer

29 More fun things

30 But the objec2ve is not to separate!! Source separa2on is a useless pursuit There is almost never a reason to separate The real holy grail: Understand mixtures, don t separate them Harder proposi2on, and rather unexplored

31 Making Direct Use of Exemplars Polyphonic pitch tracking Difficult mixture problem Some observa2ons It can be learned, it shouldn t be user- specified We can adapt what we ve done to do so We should avoid to separate!

32 Mono pitch tracking by example Hz Hz Training data Data to pitch track f x 1 (t ) f y 1 (t ) Frequency Frequency Frequency t p x 1 (t ) t Hz ? t ( )

33 Hz Hz 200 Representa2on and matching Normalized warped spectrograms 50 f x 1( t ) 100 t t p xf 1x(2t( )t ) Frequency 200 Hz Provide gain invariance Clarify harmonic structure t Frequency t p x 1( t ) t t f yp1(xt2()t ) 300 Training spectra t fy2 Frequency Find closest spectrum Use neighbor s pitch tag 100 f x 1( t ) Frequency Nearest Neighbor match t f y 1( t ) Input Frequency Training pitch tags 300

34 How well does that work? Proposed pitch tracking accuracy Error mean μ = 0.02 Hz Error devia@on σ = 1.1 Hz Praat" With popular pitch trackers Error mean μ = 0.1 Hz Error devia@on σ = 1.2 Hz So we re on to something But

35 The polyphonic case Nearest neighbors are insensi2ve to addi2vity Therefore can t resolve mixture sounds For mixtures we have to search for the nearest subspaces Aha! We know how to do that! Nearest Neighbor Search Nearest Subspace Search Untagged input Pitch- tagged data

36 Dealing with a duet Hz Hz Training on two instruments f x 1 (t ) t f y 1 (t ) f x 2 (t ) t f y 1 (t )+f y 2 (t ) Frequency Frequency t p x 1 (t ) Frequency t p x 2 (t ) t Frequency t Hz 160 Hz t ( ) t ( )+ ( )

37 Duet results S2ll works well Pitch error stats: μ = Hz, σ = 2.05 Hz ˆp y 1 (t) ˆp y 2 (t) Hz True Pratt" Estimate Misses Hz t True Praat" Estimate Misses t

38 A more beefy example Wind quintet recording Bassoon, Clarinet, Flute, Horn, Oboe Training data 7m41s per source 198,535 training vectors Removing unpitched vectors 50,000 training vectors Test data 1m10s of simultaneous performance 6,000 input spectra Data tested as duet, trio, quartet and quintet

39 Ra2o of true vs. es2mated pitch Duet p y / ˆp y µ = Hz σ = 89.5 Hz t = 37 sec Trio p y / ˆp y µ = Hz σ = Hz t = 49 sec Quartet p y / ˆp y µ = Hz σ = 1.6 Hz t = 60 sec x 104 Quintet p y / ˆp y µ = Hz σ = Hz t = 114 sec

40 Zooming in Most errors are human Raw method problems 1000Occasional confusion with other instruments True Estimate Misses Hz Correct over majority samples of a note t Median filtering Hz t Note averaging 1500 Hz 1000

41 What happened here? Input: Some listening experience Mixture of five sounds Output: Pitch values for each instrument elements used) Kind of instrument elements again) Amplitude of each source (presence of these elements) What more is there to do? No need to separate

42 Human - ish side effects Graceful degrada2on with increasing number of sources Duets easier than trios, easier than quartets, Can pitch track pitch- less sounds Inharmonic, quasi- periodic, etc. The more you know the beger you do

43 A more realis2c take Just as before, we can t know everything But we know something Semi- exemplar learning Mix exemplar model with basis decomposi@on Applies to target/background cases Which are most of the interes@ng cases anyway

44 Step 1. Learn the target source Like before, each exemplar comes with feature labels In this case pitch, can also be phoneme, stress, etc. We also learn temporal dynamics How exemplars/bases appear Also can apply for features too Target'dic+onary'P('f' 'z') Transi+on'matrix'P('z t+1 ' 'z t ') Use a transi2on matrix for z P(z t+1 z t )

45 Step 2. Learn the rest from a mixture Keep target exemplars fixed Adapt a new set of bases, while obeying transi@ons Explain mixture as target + rest We don t care about rest accuracy Use pitch from exemplars Same as before Target'dic+onary'P('f' 'z') Transi+on'matrix'P('z t+1 ' 'z t ') Observed'mixtures'P t ('f') Implied'compe+ng sources'subspace

46 Example pitch tracking Results in very accurate target following In a challenging and highly correlated case Training data Mixture Estimated P t (a) P t (a) (q) with C = Pitch (Hz) Expected pitch Time (sec)

47 Delving deeper in temporal dynamics Previous model was a linear predictor of sorts Short- term effects, minimal structure Extending this idea to stricter models Hidden Markov Model formula@on Can come in many flavors Markov Model Selec@on Non- Nega@ve HMMs

48 One last example Structured speech mixtures Each speaker follows a language model i.e. we hear words in sentences that make sense Use an HMM of course Encode domain structure knowledge Structured model replaces exemplars

49 The non- nega2ve HMM Temporal model using exemplars/bases P (f z 1 ) P (f z 2 ) T 1,1 T 2,2 P (f z 3 ) P (f z 4 ) 1 f T 1,2 State 1 State 2 f i X τ (f) i 1 2 f τ 0

50 Non- factorial learning State model addi2vity results in decoupled chains Fast state doesn t require factorial model Mixture Consolidated state dic.onaries Es.mated state probabili.es Previous extensions apply

51 Results on the Speech Separa2on Challenge Yes, we can separate, but we don t have to! HMM state paths transcribe speech Results are quite compe@@ve 0dB - 3dB - 6dB Baseline Speaker 2 Speaker 1-9dB Clean

52 My par2ng messages Don t separate! Separa@on algorithms are laying the founda@on for mixed signal processing and analysis, treat them as such!

53 My par2ng messages Don t separate! Separa@on algorithms are laying the founda@on for mixed signal processing and analysis, treat them as such! Keep separa2ng! We re learning a ton of new things, that s great! J

54 References Smaragdis, P Approximate nearest subspace for sound mixtures. In Proceedings Conference on Speech and Signal Processing (ICASSP). Prague, Czech Republic, May, 2011 Mysore, G., Smaragdis, and B. Raj Non- - - nega@ve hidden Markov modeling of audio with applica@on to source separa@on. In 9th interna@onal conference on Latent Variable Analysis and Signal Separa@on (LCA/ICA). St. Malo, France. September, 2010 Smaragdis, P. and B. Raj The Markov selec@on model for concurrent speech recogni@on. In IEEE interna@onal workshop on Machine Learning for Signal Processing (MLSP). Ki@la, Finland. August 2010 Smaragdis, P., M. Shashanka, and B. Raj A sparse non- - - parametric approach for single channel separa@on of known sounds. In in Neural Informa@on Processing Systems. Vancouver, BC, Canada. December 2009 Smaragdis, P User guided audio selec@on from complex sound mixtures. in the 22nd ACM Symposium on User Interface Sowware and Technology (UIST 09). Victoria, BC, Canada, October 2009 Shashanka, M.V., B. Raj and P. Smaragdis, Probabilis@c Latent Variable Models as Non- - - Nega@ve Factoriza@ons. In special issue on Advances in Non- - - nega@ve Matrix and Tensor Factoriza@on, Computa@onal Intelligence and Neuroscience Journal. May 2008 Shashanka, M.V., B. Raj, P. Smaragdis, Sparse Overcomplete Latent Variable Decomposi@on of Counts Data. In Neural Informa@on Processing Systems (NIPS), Vancouver, BC, Canada. December 2007 Smaragdis, P., B. Raj, and M.V. Shashanka, 2006, A probabilis@c latent variable model for acous@c modeling, Advances in models for acous@c processing workshop, NIPS 2006 Smaragdis, P. Component based techniques for monophonic speech separa@on and recogni@on, in "Blind Speech Separa@on", S. Makino, T- W.Lee and H. Sawada (eds.) Blind Speech Separa@on, Springer.

Sta$s$cal sequence recogni$on

Sta$s$cal sequence recogni$on Determinis$c sequence recogni$on Last $me, temporal integra$on of local distances via DP Integrates local matches over $me Normalizes $me varia$ons For cts speech, segments