Monaural speech separation using source-adapted models

Size: px

Start display at page:

Download "Monaural speech separation using source-adapted models"

Mervyn Harris
6 years ago
Views:

1 Monaural speech separation using source-adapted models Ron Weiss, Dan Ellis LabROSA Department of Electrical Enginering Columbia University 007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA / 15

Monaural speech separation Given single channel recording of multiple talkers Infer the original source signals from mixture Under-determined - more

2 Monaural speech separation Given single channel recording of multiple talkers Infer the original source signals from mixture Under-determined - more unknowns (sources) than observations Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 007 / 15

Speech separation challenge [Cooke and Lee, 00] Single channel, two-talker mixtures of utterances from 3 speakers Constrained grammar: command() color() preposition() letter(5) digit(10)

3 Speech separation challenge [Cooke and Lee, 00] Single channel, two-talker mixtures of utterances from 3 speakers Constrained grammar: command() color() preposition() letter(5) digit(10) adverb() Task: determine letter and digit for source that said white -9 to db TMR Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA / 15

Model-based separation Frequency (khz) 0 Model means 0 0 0 0 100 10 State index Use constraints from prior signal models to guide separation HMM, log spectral features Factorial model inference

4 Model-based separation Frequency (khz) 0 Model means State index Use constraints from prior signal models to guide separation HMM, log spectral features Factorial model inference Explain each frame of mixed signal as combination of model states e.g. Iroquois [Kristjansson et al., 00] Speaker-dependent models Acoustic dynamics and grammar constraints Superhuman performance Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 007 / 15

5 Model-based separation - Limitations Rely on speaker-dependent models to disambiguate sources What if the task isn t so well defined? No a priori knowledge of speaker identities or grammar Adapt speaker-independent source model [Ozerov et al., 005] Problems 1 Want to adapt to a single utterance, not enough data for MLLR Use PCA to reduce number of adaptation parameters - Eigenvoices Only observation is mixed signal Iterative separation/adaptation algorithm Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA / 15

6 Eigenvoices [Kuhn et al., 000] Train N speaker-dependent models priors on space of speaker variation Pack model parameters (Gaussian means) into speaker supervector Principal component analysis to find orthonormal bases Speaker model is a linear combination of bases: µ = µ + w U + g adapted model mean voice weights eigenvoice bases gain Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 007 / 15

Eigenvoice example Frequency (khz) Frequency (khz) Frequency (khz)

iy ih eh eyaeaaawayahaoowuwax Eigenvoice dimension 1 b d g p t k jh ch

dimension b d g p t k jh ch dimension 3 b d g p t k jh ch s z f th v

(Columbia University) Monaural speech separation using source-adapted

7 Eigenvoice example Frequency (khz) Frequency (khz) Frequency (khz) Frequency (khz) Mean voice b d g p t k jh ch s z f th v dh m n l r w y iy ih eh eyaeaaawayahaoowuwax Eigenvoice dimension 1 b d g p t k jh ch s z f th v dh m n l r w y iy ih eh eyaeaaawayahaoowuwax Eigenvoice dimension b d g p t k jh ch s z f th v dh m n l r w y iy ih eh eyaeaaawayahaoowuwax Eigenvoice dimension 3 b d g p t k jh ch s z f th v dh m n l r w y iy ih eh eyaeaaawayahaoowuwax Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA / 15

8 Separation algorithm - Signal separation Compose factorial HMM from adapted models Find maximum likelihood path using Viterbi algorithm Reconstruct source signals from Viterbi path model model 1 observations / time Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 007 / 15

9 Separation algorithm - Model adaptation Find projection of reconstructed source signals onto eigenvoice bases But state sequence is hidden, need EM E-step: HMM forward-backward M-step: for each possible state sequence, project signal frames onto corresponding sequence of states from each eigenvoice basis vector Iterate... Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA / 15

10 Separation example Mixture: t3_swila_m1_sbar9n Adaptation iteration 1 Frequency (khz) Adaptation iteration 3 Adaptation iteration SD model separation Time (sec) Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA / 15

11 Performance 0 Diff Gender Same Gender Same Talker Accuracy Iteration Letter-digit accuracy averaged across all TMRs Adaptation improves separation Same talker case hard - source permutations Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA / 15

12 Performance - Adapted vs. source-dependent models Accuracy Diff Gender db 3dB 0dB 3dB db 9dB Same Gender db 3dB 0dB 3dB db 9dB Same Talker SD SA SI Baseline db 3dB 0dB 3dB db 9dB Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA / 15

13 Performance - Held out speakers Accuracy SA SD Same Gender Diff Gender Num training speakers Num training speakers 3 Trained models on subset of speakers Tested on mixtures from held out speakers Performance suffers for both systems Relative decrease significantly bigger for SD than SA Open question: scale Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA / 15

14 Summary Limitations of model-based source separation Algorithm for model adaptation from mixed signal Significant improvement over speaker-independent models Source-dependent models better on matched training/testing data Adaptation helps generalize better to held out speakers Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA / 15

15 References Cooke, M. and Lee, T. W. (00). The speech separation challenge. Kristjansson, T., Hershey, J., Olsen, P., Rennie, S., and Gopinath, R. (00). Super-human multi-talker speech recognition: The IBM 00 speech separation challenge system. In Proceedings of Interspeech. Kuhn, R., Junqua, J., Nguyen, P., and Niedzielski, N. (000). Rapid speaker adaptation in eigenvoice space. IEEE Transations on Speech and Audio Processing, (): Ozerov, A., Philippe, P., Gribonval, R., and Bimbot, F. (005). One microphone singing voice separation using source-adapted models. In Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA / 15

16 Separation algorithm - Initialization Fast convergence needs good initialization 1000 Want to differentiate source models to get 500 best separation 0 Get initial coefficient for each eigenvoice 500 dimension independently Coarsely quantize eigenvoice weights Find most likely combination in mixture w 1 w Eigenvoice weights vs speaker gender Male Female

Eigenvoice Speaker Adaptation via Composite Kernel PCA

Eigenvoice Speaker Adaptation via Composite Kernel PCA James T. Kwok, Brian Mak and Simon Ho Department of Computer Science Hong Kong University of Science and Technology Clear Water Bay, Hong Kong [jamesk,mak,csho]@cs.ust.hk