Comparing Recent Waveform Generation and Acoustic Modeling Methods for Neural-network-based Speech Synthesis

Size: px
Start display at page:

Download "Comparing Recent Waveform Generation and Acoustic Modeling Methods for Neural-network-based Speech Synthesis"

Transcription

1 ICASSP 2018 Calgary, Canada Comparing Recent Waveform Generation and Acoustic Modeling Methods for Neural-network-based Speech Synthesis Xin WANG, Jaime Lorenzo-Trueba, Shinji TAKAKI, Lauri Juvela, Junichi YAMAGISHI National Institute of Informatics, Japan & Aalto University, Finland contact: we welcome critical comments, suggestions, and discussion 1

2 qmotivation Better modules for the statistical parametric speech synthesis (SPSS) framework? qmethod OVERVIEW Plug and test new acoustic models and waveform generators qresults Best combination WaveNet-based vocoder Autoregressive (AR) acoustic models Quality: as good as vocoded speech (at 16kHz) 2

3 CONTENTS q Introduction q Models and modules q Experiments q Summary 3

4 INTRODUCTION Background q Conventional TTS pipeline [1] Linguistic Text Front-end Back-end features Speech q SPSS back-end [2,3] Linguistic features Acoustic models Spectral MGC & features BAP F0 Waveform generators Speech MGC: Mel-generalized cepstral coefficients [4] BAP: band-aperiodicity [1] T. Dutoit. An Introduction to Text-to-speech Synthesis. Kluwer Academic Publishers, Norwell, MA, USA, [2] Tokuda, K., et al., (2013). Speech Synthesis Based on Hidden Markov Models. Proceedings of the IEEE, 101(5), [3] Zen, H., et al. (2009). Statistical parametric speech synthesis. Speech Communication, 51, [4] Tokuda, K., Kobayashi, T., Masuko, T., and Imai, S. (1994). Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages

5 INTRODUCTION Topic of this work q Better modules for SPSS back-end? Recurrent neural networks (RNNs) Autoregressive (AR) models General adversarial network (GAN) WORLD vocoder + Phrase recovery WaveNet-based vocoder Linguistic features Acoustic models MGC & BAP F0 Waveform generators Speech MGC: Mel-generalized cepstral coefficients BAP: band-aperiodicity 5

6 CONTENTS q Introduction q Models and modules q Experiments q Summary 6

7 MODELS & METHODS Speech waveforms WORLD MGC & BAP & F0 Waveform generators RNN Linguistic features Acoustic models 7

8 MODELS & METHODS Acoustic models q Baseline RNN bo 1 bo 2 bo 3 bo 4 bo 5 x 1 x 2 x 3 x 4 x 5 Sequence of linguistic features Sequence of generated acoustic features {x 1,, x t, } {bo 1,, bo t, } 8

9 MODELS & METHODS Acoustic models q Baseline RNN o 1 o 2 o 3 o 4 o 5 Probabilistic models M 1 M 2 M 3 M 4 M 5 Neural network H (RNN) ( ) p(o 1:T x 1:T ; ) = x 1 x 2 x 3 x 4 x 5 TY p(o t x 1:T ; ) = t=1 M t = {µ t }, where µ t = H (RNN) (x 1:T,t) bo t = µ t TY N (o t ; µ t, I) t=1 [5] C. M. Bishop. Neural networks for pattern recognition. Oxford university press,

10 MODELS & METHODS Acoustic models q Baseline RNN o 1 o 2 o 3 o 4 o 5 M 1 M 2 M 3 M 4 M 5 Limitations 1. Conditional independence 2. Maximum-likelihood training AR models GAN x 1 x 2 x 3 x 4 x 5 p(o 1:T x 1:T ; ) = TY p(o t x 1:T ; ) = t=1 TY N (o t ; µ t, I) t=1 10

11 MODELS & METHODS Acoustic models q Shallow AR (SAR) o 1 o 2 o 3 o 4 o 5 K = 2 M 1 M 2 M 3 M 4 M 5 x 1 x 2 x 3 x 4 x 5 p(o 1:T x 1:T ;, )= TY p(o t o t K:t 1, x 1:T ; ) = t=1 TY N (o t ; µ t + f (o t K:t 1 ), I) t=1 Alternative interpretation: trainable filter + RNN [6] X. Wang, S. Takaki, and J. Yamagishi. An autoregressive recurrent mixture density network for parametric speech synthesis. In Proc. ICASSP, pages ,

12 MODELS & METHODS Acoustic models q Deep AR (DAR) o 1 o 2 o 3 o 4 o 5 M 1 M 2 M 3 M 4 M 5 x 1 x 2 x 3 x 4 x 5 p(o 1:T x 1:T ; ) = TY p(o t o 1:t 1, x 1:T ; ) t=1 Only for (quantized) F0 modeling [7] X. Wang, S. Takaki, and J. Yamagishi. Autoregressive neural F0 model for statistical parametric speech synthesis. IEEE Transactions on Audio, Speech and Language Processing. (Accepted) 12

13 MODELS & METHODS Acoustic models q GAN GAN-based post-filter [8] Acoustic features (generated) Acoustic features (natural) Noise + Residual generator Discriminator Acoustic model Linguistic features 0/1 [8] T. Kaneko, H. Kameoka, N. Hojo, Y. Ijima, K. Hiramatsu, and K. Kashino. Generative adversarial network-based postfilter for statistical parametric speech synthesis. In Proc. ICASSP, pages ,

14 MODELS & METHODS Acoustic models Speech waveforms WORLD F0 MGC Waveform generators GAN DAR SAR RNN Acoustic models Linguistic features BAP is not shown 14

15 MODELS & METHODS Acoustic models SAR-Wo SGA-Wo RGA-Wo RNN-Wo WORLD Waveform generators F0 GAN MGC DAR SAR RNN Acoustic models Linguistic features BAP is not shown 15

16 MODELS & METHODS Waveform generators q Deterministic approaches WORLD [9] o o Binary voicing decision Minimum phase A log domain pulse model (PML) [10] o o Source-filter model, additive in log-domain Binary noisy mask WORLD + phrase recovery Generated waveform STFT amplitude Phrase recovery [11] Inverse STFT Phraserecovered waveform [9] M. Morise, et al. WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. on Information and Systems, 99(7): , [10] G. Degottex, et al. A log domain pulse model for parametric speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, [11] D. Griffin and J. Lim. Signal estimation from modified short-time Fourier transform. IEEE Trans. ASSP, 32(2): ,

17 MODELS & METHODS Waveform generators q WaveNet-based vocoder [12, 13] How to generate (search) a waveform: 1. Exploration: sampling 2. Exploitation: picking one-best [12] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. WaveNet: A generative model for raw audio. arxiv preprint arxiv: , [13] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and T. Toda. Speaker-dependent WaveNet vocoder. In Proc. Interspeech, pages ,

18 MODELS & METHODS Waveform generators q WaveNet-based vocoder Unvoiced segment probability Sampling point Waveform level Voiced region Sampling in unvoiced region Picking one-best in voiced region Less distortion of harmonics appendix & paper Sampling point Waveform level 59

19 MODELS & METHODS Acoustic models SAR-Wa SAR-Pr SAR-Pm SAR-Wo SGA-Wo RGA-Wo RNN-Wo WaveNet Phrase recovery PML minimum phase WORLD Waveform generators F0 GAN MGC DAR SAR RNN Acoustic models Linguistic features 19

20 CONTENTS q Introduction q Models and modules q Experiments q Summary 20

21 Configuration q Data EXPERIMENTS Corpus Size Note ATR Ximera F009 voice [14] ~30,000 utterances 48 hours Recording period: over 1 year Sampling rate: 48kHz Japanese, neutral style, reading Feature Dimension Linguistic Phone identity, prosodic tags... ~ 390 Acoustic MGC 60 BAP 25 F0 1 q Front-end: Open JTalk [15] [14] Kawai, H., Toda, T., Ni, J., Tsuzaki, M., and Tokuda, K. (2004). Ximera: A new TTS from ATR based on corpus-based technologies. In Proc. SSW5, pages [15] HTS Working Group. The Japanese TTS System Open JTalk, 2015.

22 Configuration q Listening test Quality: MOS (1-5) Similarity: rate 1-5, natural reference 48kHz Participants: 235 native Japanese listeners, 1500 sets of results q Systems EXPERIMENTS Common network configuration (cf. the paper) 2 Without,, nor formant enhancement Sampling rate: 48kHz & 16kHz, except SAR-Wa at 16kHz (10 bits, μ-law) Natural Abs-Pm Abs-Wo SAR-Wa SAR-Pr SAR-Pm SAR-Wo SGA-Wo RGA-Wo RNN-Wo PML WORLD WaveNet +PhraseRec WORLD PML WORLD natural natural SAR SAR GAN SAR GAN RNN RNN 22

23 EXPERIMENTS 48kHz 16kHz Natural Abs-Pm Abs-Wo SAR-Wa SAR-Pr SAR-Pm SAR-Wo SGA-Wo RGA-Wo RNN-Wo PML WORLD WaveNet +PhraseRec WORLD PML WORLD natural natural SAR SAR SAR SAR SAR GAN SAR GAN RNN RNN 23

24 Quality scores EXPERIMENTS Natural Abs-Pm Abs-Wo SAR-Wa SAR-Pr SAR-Pm SAR-Wo SGA-Wo RGA-Wo RNN-Wo PML WORLD WaveNet +PhraseRec WORLD PML WORLD natural natural SAR SAR SAR SAR SAR GAN SAR GAN RNN RNN 24

25 Similarity scores EXPERIMENTS Natural Abs-Pm Abs-Wo SAR-Wa SAR-Pr SAR-Pm SAR-Wo SGA-Wo RGA-Wo RNN-Wo PML WORLD WaveNet +PhraseRec WORLD PML WORLD natural natural SAR SAR SAR SAR SAR GAN SAR GAN RNN RNN 25

26 CONTENTS q Introduction q Models and methods q Experiments q Summary 26

27 Plug and test q SAR: Avoid conditional independence assumption Alleviate the over-smoothing effect appendix & paper q WaveNet SUMMARY Generation method: one-best generation + random sampling Less distortion of harmonics appendix & paper q WaveNet vocoder + SAR & DAR Better than other combinations Worse than natural speech Close to vocoded speech 27

28 FURTHER IMPROVEMENT? Recent work appendix q SAR Special case of volume-preserving normalizing flow [16] Extended SAR = time-variant filter + RNN q WaveNet-vocoders Training based on generated conditional features q Annotated linguistic features Reduce the gap between natural & synthetic speech Future work? q Reduce the variability of recordings q Use complex-valued neural models [16] D. Rezende and S. Mohamed. Variational inference with normalizing flows. In Proc. ICML, pages ,

29 MESSAGE Code, recipes, slides q Acoustic models & WaveNet (CUDA/C++) q Simple explanation on WaveNet and acoustic models tonywangx.github.io wangxin@nii.ac.jp 29

30 Thank you for your attention Q & A tonywangx.github.io wangxin@nii.ac.jp 30

31 APPENDIX - WAVENET Conditional network WaveNet-Backend Learning curve No cherry picking All samples based on generated acoustic features or automatically inferred linguistic features Generation method Generation with other acoustic models Training based on generated features Tutorial slides: 31

32 Structure APPENDIX - WAVENET P (o t o t R:t 1, c 1:N ) Post-processing network Linear+tanh Linear+tanh Softmax + e t 1 s (1) t s (2) t o Wavenet Wavenet t 1 Linear block 1 block 2 e t r (1) t s (M) t Wavenet block M Clock rate: 16kHz l t c 1:N F0 Bi-LSTM Conditional feature network CNN concatenation Linear Up-sampling Clock rate: 200Hz (frame shift = 5ms) 32

33 APPENDIX - WAVENET Conditional network q Architecture of the conditional network F0 MGC Trial1 c 1:N Linear l 1:N Trial2 c 1:N Bi-LSTM CNN Linear l 1:N Trial 3 F0 c 1:N Bi-LSTM CNN Concat. Linear l 1:N 33

34 APPENDIX - WAVENET Conditional network q Experiments on WaveNet Vocoder Given generated MGC/F0 sample1 sample2 sample3 Natural Trial 1 None Trial 2 LSTM+CNN Trial 3 LSTM+CNN+skip-F0 34

35 APPENDIX - WAVENET WaveNet backend q Architecture Text Text analyzer Textual features F0 model (DAR) WaveNet backend Waveform q WaveNet-backend only uses random sampling Natural WaveNetvocoder WaveNetbackend sample1 sample2 sample3 sample4 sample5 35

36 APPENDIX - WAVENET Learning curve - log likelihood Train set Val. Set epoch WaveNet vocoder - log likelihood Train set Val. Set epoch We trained WaveNet backend for more than 100 epochs WaveNet backend 36

37 APPENDIX - WAVENET Generation method 1000 Waveform probablity levels (0-1024) Waveform waveform levels (mu-law) (0-1024) sampling point 37

38 APPENDIX - WAVENET 0.5 Generation method q Experiments on WaveNet vocoder Sampling point Waveform level Sampling point Waveform level 59

39 APPENDIX - WAVENET Generation method q One-best + random sampling Given generated MGC/F0 voiced region: one-best Mix: unvoiced region: random sampling Random: random sampling all the time Natural sample1 sample2 sample3 Mix Random 39

40 Natural Voiced: one-best Unvoiced: sampling Voiced: sampling Unvoiced: sampling 40

41 APPENDIX - WAVENET Generation method q One-best + random sampling Mix: Mix2: voiced region: one-best unvoiced region: random sampling 75% voiced frames: one-best else: random sampling Mix4: 25% voiced frames: one-best else: random sampling 41

42 INVESTIGATION Generation method q One-best + random sampling Natural sample1 sample2 sample3 Mix Mix2 100% 75% Mix3 Mix4 Random 50% 20% 00% 42

43 APPENDIX - WAVENET Generation method q One-best + random sampling sample1 sample2 sample3 Natural WaveNet backend WaveNet vocoder Mix Random Mix 43

44 Natural WaveNet-backend Mix Mix WaveNet-backend Random Random 44

45 APPENDIX - WAVENET WaveNet-vocoder + other acoustic models Natural sample1 sample2 sample3 SAR + DAR SGA + DAR RGA + DAR RNN + DAR extended SAR + DAR 45

46 APPENDIX - WAVENET WaveNet-vocoder: training using generated MGC sample1 sample2 sample3 Natural WaveNet-Backend Trained on natural MGC WaveNet- Vocoder Trained on generated MGC Epoch 35 Trained on generated MGC Epoch 45 Trained on generated MGC Epoch 55 46

47 APPENDIX ACOUSTIC MODELS General comparison SAR & DAR SAR extension More details: 47

48 APPENDIX ACOUSTIC MODELS General comparison q Generated trajectories 4 MGC 1 dim MGC 1 st dim Natural RNN RNN+GAN (RGA) SAR SAR+GAN (SGA) Frame index (utterance ATR Ximera F009 AOZORAR T01) MGC 5 th dim MGC 5 dim Natural RNN RNN+GAN (RGA) SAR SAR+GAN (SGA) Frame index (utterance ATR Ximera F009 AOZORAR T01) 48

49 APPENDIX ACOUSTIC MODELS General comparison q Generated trajectories Natural RNN SAR 0.2 MGC 15 dim MGC 15 th dim Frame index (utterance ATR Ximera F009 AOZORAR T01) Natural RNN SAR MGC 31 dim MGC 31 th dim Frame index (utterance ATR Ximera F009 AOZORAR T01) 49

50 APPENDIX ACOUSTIC MODELS General comparison q Generated trajectories Natural RNN RNN+GAN (RGA) SAR SAR+GAN (SGA) 0.2 MGC 15 dim MGC 15 th dim Frame index (utterance ATR Ximera F009 AOZORAR T01) MGC 31 dim Natural RNN RNN+GAN (RGA) SAR SAR+GAN (SGA) MGC 31 th dim Frame index (utterance ATR Ximera F009 AOZORAR T01) 50

51 Utterance-level GV of MGC APPENDIX ACOUSTIC MODELS General 1 comparison Global variance 1.5 q 0GV Natural SAR SAR+GAN (SGA) RNN+GAN (RGA) RNN Dimension index of MGC Modulation spectrum (MGC 31th) 51

52 APPENDIX ACOUSTIC MODELS SAR & DAR q Toy SAR example SAR versus RMDN with a recurrent output layer [11] SAR RMDN o a 1 o 2 o 1 o 2 w µ µ µ 1 µ 2 1 µ 2 output layer h 1 h 2 h 1 h 2 hidden layer x 1 x 2 x 1 x 2 Assume o t 2 R and t =1, linear activation function [11] H. Zen and H. Sak. Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In Proc. ICASSP, pages ,

53 APPENDIX ACOUSTIC MODELS SAR & DAR q Toy SAR example RMDN o 1 o 2 w µ µ 1 µ 2 h 1 h 2 x 1 x 2 p(o 1:2 )=N (o 1 ; µ 1, 1)N (o 2 ; µ 2 + w µ µ 1, 1) µ 1 = w > h 1 + b µ 2 = w > h 2 + b + w µ µ 1 = µ 2 + w µ µ 1 o a 1 o 2 SAR µ 1 µ 2 h 1 h 2 x 1 x 2 p(o 1:2 )=N (o 1 ; µ 1, 1)N (o 2 ; µ 2 + ao 1, 1) µ 1 = w > h 1 + b µ 2 = w > h 2 + b [11] H. Zen and H. Sak. Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In Proc. ICASSP, pages ,

54 APPENDIX ACOUSTIC MODELS SAR & DAR q Toy SAR example RMDN o 1 o 2 w µ µ 1 µ 2 h 1 h 2 x 1 x 2 p(o 1:2 )=N (o 1 ; µ 1, 1)N (o 2 ; µ 2 + w µ µ 1, 1) = 1 2 exp( 1 2 (o µ)> 1 (o µ)) o =[o 1,o 2 ] > µ =[µ 1, µ 2 + w µ µ 1 ] > = apple SAR o a 1 o 2 µ 1 µ 2 h 1 h 2 x 1 x 2 Dependency between µ t or o t? p(o 1:2 )=N (o 1 ; µ 1, 1)N (o 2 ; µ 2 + ao 1, 1) = 1 2 exp( 1 2 (o µ)> 1 (o µ)) apple o =[o 1,o 2 ] > µ =[µ 1,µ 2 + aµ 1 ] > 1 a = a 1+a 2 54

55 APPENDIX ACOUSTIC MODELS SAR & DAR q Toy SAR example RMDN c 1 c 2 µ 1 µ 2 h 1 h 2 p(o) =p(c) =N (c; µ c, c ) apple µ c =[µ 1,µ 2 ] > 1 0 c = 0 1 x 1 x 2 o a 1 o 2 c = apple c1 c 2 = apple o 1 = o 2 ao 1 apple 1 0 a 1 apple o1 o 2 = Ao SAR µ 1 µ 2 h 1 h 2 x 1 x 2 o =[o 1,o 2 ] > p(o) =N (o; µ o, o ) µ o =[µ 1,µ 2 + aµ 1 ] > o = apple 1 a a 1+a 2 55

56 APPENDIX ACOUSTIC MODELS SAR & DAR q SAR: invertible linear feature/model transformation For o 1:T 2 R D T SAR is equivalent to: o 1:T = 2 3 o 1:T,1 4 5 o 1:T,D A (1) A (D) 2 3 c 1:T,1 4 5 c 1:T,D Training o 1:T A (1) A (D) c 1:T TY p(c t ; M t ) t=1 Generation bo 1:T A (1) 1 A (D) 1 bc 1:T TY p(c t ; M t ) t=1 x 1:T 56

57 APPENDIX ACOUSTIC MODELS SAR & DAR q SAR: invertible linear feature/model transformation For o 1:T 2 R D T o 1:T = 2 3 o 1:T,1 4 5 o 1:T,D filter 1 filter D 2 3 c 1:T,1 4 5 c 1:T,D SAR is equivalent to: Training o 1:T filters A 1 (z) A D (z) c 1:T TY p(c t ; M t ) t=1 Generation bo 1:T filters 1/A 1 (z) 1/A D (z) bc 1:T TY p(c t ; M t ) t=1 x 1:T 57

58 APPENDIX ACOUSTIC MODELS SAR & DAR 5 q SAR: invertible linear feature/model transformation o 1:T filters A 1 (z) A D (z) c 1:T Magnitude (db) A 1 (z) A 1 (z) Frequency bin (:/1024) bo 1:T filters 1/A 1 (z) 1/A D (z) bc 1:T Magnitude (db) /A 1 (z) H 1 (z) Only due to 1/A d (z)? Frequency bin (:/1024) Due to {A d (z), 1/A d (z)}, less mismatch between c 1:T and RMDN 58

59 APPENDIX ACOUSTIC MODELS SAR extension: normalizing flow q Basic idea o 1:T c 1:T = f (o 1:T ) c 1:T TY p(c t ; M t ) t=1 x 1:T bo 1:T bo 1:T = f 1 (bc 1:T ) bc 1:T TY p(c t ; M t ) t=1 Jacobian matrix must be simple f (.) must be invertible p o (o 1:T x 1:T )=p c (c 1:T x 1:T ) 1:T [13] D. Rezende and S. Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, pages , [14] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved variational inference with inverse autoregressive flow. In Proc. NIPS, pages ,

60 APPENDIX ACOUSTIC MODELS SAR extension: normalizing flow q Basic idea p o (o 1:T x 1:T )=p c (c 1:T x 1:T ) 1:T SAR AR Flow Transform c t = o t X K k=1 a k o t k c t = o t X K k=1 f (k) (o 1:t k ) o t k De-transform KX bo t = c t + a k bo t k k=1 KX bo t = c t + f (k) (bo 1:t k ) bo t k k=1 Simple for SAR: 1:T =1 µ t = RNN(o 1:t 1 ) 60

61 APPENDIX ACOUSTIC MODELS SAR extension q SAR can be extended More details: 61

62 p-value APPENDIX ACOUSTIC MODELS DAR q Same autoregressive principle q But noninvertible nonlinear MOS NAT DAR MOS score SAR RMDN RNN 3.00 NAT DAR SAR RMDN RNN NAT <1e-30 <1.0e-30 <1.0e-30 <1.0e-30 DAR <1e e e e-30 SAR <1e e RMDN <1e e RNN <1e e GV of F0 at utterance-level (Hz) NAT RNN RMDN SAR DAR NAT RNN F0 GV RMDN SAR DAR [7] X. Wang, S. Takaki, and J. Yamagishi. Autoregressive neural F0 model for statistical parametric speech synthesis. IEEE Transactions on Audio, Speech and Language Processing. (Accepted) 62

Autoregressive Neural Models for Statistical Parametric Speech Synthesis

Autoregressive Neural Models for Statistical Parametric Speech Synthesis Autoregressive Neural Models for Statistical Parametric Speech Synthesis シンワン Xin WANG 2018-01-11 contact: wangxin@nii.ac.jp we welcome critical comments, suggestions, and discussion 1 https://www.slideshare.net/kotarotanahashi/deep-learning-library-coyotecnn

More information

An Autoregressive Recurrent Mixture Density Network for Parametric Speech Synthesis

An Autoregressive Recurrent Mixture Density Network for Parametric Speech Synthesis ICASSP 07 New Orleans, USA An Autoregressive Recurrent Mixture Density Network for Parametric Speech Synthesis Xin WANG, Shinji TAKAKI, Junichi YAMAGISHI National Institute of Informatics, Japan 07-03-07

More information

Generative Model-Based Text-to-Speech Synthesis. Heiga Zen (Google London) February

Generative Model-Based Text-to-Speech Synthesis. Heiga Zen (Google London) February Generative Model-Based Text-to-Speech Synthesis Heiga Zen (Google London) February rd, @MIT Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks

More information

An RNN-based Quantized F0 Model with Multi-tier Feedback Links for Text-to-Speech Synthesis

An RNN-based Quantized F0 Model with Multi-tier Feedback Links for Text-to-Speech Synthesis INTERSPEECH 217 August 2 24, 217, Stockholm, Sweden An RNN-based Quantized F Model with Multi-tier Feedback Links for Text-to-Speech Synthesis Xin Wang 1,2, Shinji Takaki 1, Junichi Yamagishi 1,2,3 1 National

More information

Adapting Wavenet for Speech Enhancement DARIO RETHAGE JULY 12, 2017

Adapting Wavenet for Speech Enhancement DARIO RETHAGE JULY 12, 2017 Adapting Wavenet for Speech Enhancement DARIO RETHAGE JULY 12, 2017 I am v Master Student v 6 months @ Music Technology Group, Universitat Pompeu Fabra v Deep learning for acoustic source separation v

More information

Text-to-speech synthesizer based on combination of composite wavelet and hidden Markov models

Text-to-speech synthesizer based on combination of composite wavelet and hidden Markov models 8th ISCA Speech Synthesis Workshop August 31 September 2, 2013 Barcelona, Spain Text-to-speech synthesizer based on combination of composite wavelet and hidden Markov models Nobukatsu Hojo 1, Kota Yoshizato

More information

CSC321 Lecture 20: Reversible and Autoregressive Models

CSC321 Lecture 20: Reversible and Autoregressive Models CSC321 Lecture 20: Reversible and Autoregressive Models Roger Grosse Roger Grosse CSC321 Lecture 20: Reversible and Autoregressive Models 1 / 23 Overview Four modern approaches to generative modeling:

More information

Vocoding approaches for statistical parametric speech synthesis

Vocoding approaches for statistical parametric speech synthesis Vocoding approaches for statistical parametric speech synthesis Ranniery Maia Toshiba Research Europe Limited Cambridge Research Laboratory Speech Synthesis Seminar Series CUED, University of Cambridge,

More information

WaveNet: A Generative Model for Raw Audio

WaveNet: A Generative Model for Raw Audio WaveNet: A Generative Model for Raw Audio Ido Guy & Daniel Brodeski Deep Learning Seminar 2017 TAU Outline Introduction WaveNet Experiments Introduction WaveNet is a deep generative model of raw audio

More information

arxiv: v1 [cs.lg] 28 Dec 2017

arxiv: v1 [cs.lg] 28 Dec 2017 PixelSNAIL: An Improved Autoregressive Generative Model arxiv:1712.09763v1 [cs.lg] 28 Dec 2017 Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, Pieter Abbeel Embodied Intelligence UC Berkeley, Department of

More information

ISSUE 2: TEMPORAL DEPENDENCY?

ISSUE 2: TEMPORAL DEPENDENCY? DAR DAR 8/2/18 1 DAR q Interpretation ISSUE 2: TEMPORAL DEPENDENCY? Only linear activation function, unit variance p(o 1:2 x 1:2 )= 1 2 exp( 1 2 (o µ)> 1 (o µ)) DAR o 1 o 2 SAR o 1 a o 2 o =[o 1,o 2 ]

More information

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Segmental Recurrent Neural Networks for End-to-end Speech Recognition Segmental Recurrent Neural Networks for End-to-end Speech Recognition Liang Lu, Lingpeng Kong, Chris Dyer, Noah Smith and Steve Renals TTI-Chicago, UoE, CMU and UW 9 September 2016 Background A new wave

More information

The Success of Deep Generative Models

The Success of Deep Generative Models The Success of Deep Generative Models Jakub Tomczak AMLAB, University of Amsterdam CERN, 2018 What is AI about? What is AI about? Decision making: What is AI about? Decision making: new data High probability

More information

Presented By: Omer Shmueli and Sivan Niv

Presented By: Omer Shmueli and Sivan Niv Deep Speaker: an End-to-End Neural Speaker Embedding System Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, Zhenyao Zhu Presented By: Omer Shmueli and Sivan

More information

Proc. of NCC 2010, Chennai, India

Proc. of NCC 2010, Chennai, India Proc. of NCC 2010, Chennai, India Trajectory and surface modeling of LSF for low rate speech coding M. Deepak and Preeti Rao Department of Electrical Engineering Indian Institute of Technology, Bombay

More information

An Excitation Model for HMM-Based Speech Synthesis Based on Residual Modeling

An Excitation Model for HMM-Based Speech Synthesis Based on Residual Modeling An Model for HMM-Based Speech Synthesis Based on Residual Modeling Ranniery Maia,, Tomoki Toda,, Heiga Zen, Yoshihiko Nankaku, Keiichi Tokuda, National Inst. of Inform. and Comm. Technology (NiCT), Japan

More information

Deep Recurrent Neural Networks

Deep Recurrent Neural Networks Deep Recurrent Neural Networks Artem Chernodub e-mail: a.chernodub@gmail.com web: http://zzphoto.me ZZ Photo IMMSP NASU 2 / 28 Neuroscience Biological-inspired models Machine Learning p x y = p y x p(x)/p(y)

More information

Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks

Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks INTERSPEECH 2014 Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks Ryu Takeda, Naoyuki Kanda, and Nobuo Nukaga Central Research Laboratory, Hitachi Ltd., 1-280, Kokubunji-shi,

More information

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech

More information

BLSTM-HMM HYBRID SYSTEM COMBINED WITH SOUND ACTIVITY DETECTION NETWORK FOR POLYPHONIC SOUND EVENT DETECTION

BLSTM-HMM HYBRID SYSTEM COMBINED WITH SOUND ACTIVITY DETECTION NETWORK FOR POLYPHONIC SOUND EVENT DETECTION BLSTM-HMM HYBRID SYSTEM COMBINED WITH SOUND ACTIVITY DETECTION NETWORK FOR POLYPHONIC SOUND EVENT DETECTION Tomoki Hayashi 1, Shinji Watanabe 2, Tomoki Toda 1, Takaaki Hori 2, Jonathan Le Roux 2, Kazuya

More information

Robust Speaker Identification

Robust Speaker Identification Robust Speaker Identification by Smarajit Bose Interdisciplinary Statistical Research Unit Indian Statistical Institute, Kolkata Joint work with Amita Pal and Ayanendranath Basu Overview } } } } } } }

More information

Modeling the creaky excitation for parametric speech synthesis.

Modeling the creaky excitation for parametric speech synthesis. Modeling the creaky excitation for parametric speech synthesis. 1 Thomas Drugman, 2 John Kane, 2 Christer Gobl September 11th, 2012 Interspeech Portland, Oregon, USA 1 University of Mons, Belgium 2 Trinity

More information

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement Simon Leglaive 1 Laurent Girin 1,2 Radu Horaud 1 1: Inria Grenoble Rhône-Alpes 2: Univ. Grenoble Alpes, Grenoble INP,

More information

Unfolded Recurrent Neural Networks for Speech Recognition

Unfolded Recurrent Neural Networks for Speech Recognition INTERSPEECH 2014 Unfolded Recurrent Neural Networks for Speech Recognition George Saon, Hagen Soltau, Ahmad Emami and Michael Picheny IBM T. J. Watson Research Center, Yorktown Heights, NY, 10598 gsaon@us.ibm.com

More information

BIDIRECTIONAL LSTM-HMM HYBRID SYSTEM FOR POLYPHONIC SOUND EVENT DETECTION

BIDIRECTIONAL LSTM-HMM HYBRID SYSTEM FOR POLYPHONIC SOUND EVENT DETECTION BIDIRECTIONAL LSTM-HMM HYBRID SYSTEM FOR POLYPHONIC SOUND EVENT DETECTION Tomoki Hayashi 1, Shinji Watanabe 2, Tomoki Toda 1, Takaaki Hori 2, Jonathan Le Roux 2, Kazuya Takeda 1 1 Nagoya University, Furo-cho,

More information

ON ADVERSARIAL TRAINING AND LOSS FUNCTIONS FOR SPEECH ENHANCEMENT. Ashutosh Pandey 1 and Deliang Wang 1,2. {pandey.99, wang.5664,

ON ADVERSARIAL TRAINING AND LOSS FUNCTIONS FOR SPEECH ENHANCEMENT. Ashutosh Pandey 1 and Deliang Wang 1,2. {pandey.99, wang.5664, ON ADVERSARIAL TRAINING AND LOSS FUNCTIONS FOR SPEECH ENHANCEMENT Ashutosh Pandey and Deliang Wang,2 Department of Computer Science and Engineering, The Ohio State University, USA 2 Center for Cognitive

More information

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features Heiga ZEN (Byung Ha CHUN) Nagoya Inst. of Tech., Japan Overview. Research backgrounds 2.

More information

Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks

Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks Modeling Time-Frequency Patterns with LSTM vs Convolutional Architectures for LVCSR Tasks Tara N Sainath, Bo Li Google, Inc New York, NY, USA {tsainath, boboli}@googlecom Abstract Various neural network

More information

Nearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis. Thomas Ewender

Nearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis. Thomas Ewender Nearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis Thomas Ewender Outline Motivation Detection algorithm of continuous F 0 contour Frame classification algorithm

More information

arxiv: v1 [cs.sd] 31 Oct 2018

arxiv: v1 [cs.sd] 31 Oct 2018 WAVEGLOW: A FLOW-BASED GENERATIVE NETWORK FOR SPEECH SYNTHESIS Ryan Prenger, Rafael Valle, Bryan Catanzaro NVIDIA Corporation arxiv:1811.00002v1 [cs.sd] 31 Oct 2018 ABSTRACT In this paper we propose WaveGlow:

More information

Probabilistic Inference of Speech Signals from Phaseless Spectrograms

Probabilistic Inference of Speech Signals from Phaseless Spectrograms Probabilistic Inference of Speech Signals from Phaseless Spectrograms Kannan Achan, Sam T. Roweis, Brendan J. Frey Machine Learning Group University of Toronto Abstract Many techniques for complex speech

More information

Deep Learning for Speech Recognition. Hung-yi Lee

Deep Learning for Speech Recognition. Hung-yi Lee Deep Learning for Speech Recognition Hung-yi Lee Outline Conventional Speech Recognition How to use Deep Learning in acoustic modeling? Why Deep Learning? Speaker Adaptation Multi-task Deep Learning New

More information

2D Spectrogram Filter for Single Channel Speech Enhancement

2D Spectrogram Filter for Single Channel Speech Enhancement Proceedings of the 7th WSEAS International Conference on Signal, Speech and Image Processing, Beijing, China, September 15-17, 007 89 D Spectrogram Filter for Single Channel Speech Enhancement HUIJUN DING,

More information

The effect of speaking rate and vowel context on the perception of consonants. in babble noise

The effect of speaking rate and vowel context on the perception of consonants. in babble noise The effect of speaking rate and vowel context on the perception of consonants in babble noise Anirudh Raju Department of Electrical Engineering, University of California, Los Angeles, California, USA anirudh90@ucla.edu

More information

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as

More information

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent voices Nonparametric likelihood

More information

Detection of Overlapping Acoustic Events Based on NMF with Shared Basis Vectors

Detection of Overlapping Acoustic Events Based on NMF with Shared Basis Vectors Detection of Overlapping Acoustic Events Based on NMF with Shared Basis Vectors Kazumasa Yamamoto Department of Computer Science Chubu University Kasugai, Aichi, Japan Email: yamamoto@cs.chubu.ac.jp Chikara

More information

Speaker recognition by means of Deep Belief Networks

Speaker recognition by means of Deep Belief Networks Speaker recognition by means of Deep Belief Networks Vasileios Vasilakakis, Sandro Cumani, Pietro Laface, Politecnico di Torino, Italy {first.lastname}@polito.it 1. Abstract Most state of the art speaker

More information

Seq2Tree: A Tree-Structured Extension of LSTM Network

Seq2Tree: A Tree-Structured Extension of LSTM Network Seq2Tree: A Tree-Structured Extension of LSTM Network Weicheng Ma Computer Science Department, Boston University 111 Cummington Mall, Boston, MA wcma@bu.edu Kai Cao Cambia Health Solutions kai.cao@cambiahealth.com

More information

Conditional Image Generation with PixelCNN Decoders

Conditional Image Generation with PixelCNN Decoders Conditional Image Generation with PixelCNN Decoders Aaron van den Oord 1 Nal Kalchbrenner 1 Oriol Vinyals 1 Lasse Espeholt 1 Alex Graves 1 Koray Kavukcuoglu 1 1 Google DeepMind NIPS, 2016 Presenter: Beilun

More information

Masked Autoregressive Flow for Density Estimation

Masked Autoregressive Flow for Density Estimation Masked Autoregressive Flow for Density Estimation George Papamakarios University of Edinburgh g.papamakarios@ed.ac.uk Theo Pavlakou University of Edinburgh theo.pavlakou@ed.ac.uk Iain Murray University

More information

Exemplar-based voice conversion using non-negative spectrogram deconvolution

Exemplar-based voice conversion using non-negative spectrogram deconvolution Exemplar-based voice conversion using non-negative spectrogram deconvolution Zhizheng Wu 1, Tuomas Virtanen 2, Tomi Kinnunen 3, Eng Siong Chng 1, Haizhou Li 1,4 1 Nanyang Technological University, Singapore

More information

RARE SOUND EVENT DETECTION USING 1D CONVOLUTIONAL RECURRENT NEURAL NETWORKS

RARE SOUND EVENT DETECTION USING 1D CONVOLUTIONAL RECURRENT NEURAL NETWORKS RARE SOUND EVENT DETECTION USING 1D CONVOLUTIONAL RECURRENT NEURAL NETWORKS Hyungui Lim 1, Jeongsoo Park 1,2, Kyogu Lee 2, Yoonchang Han 1 1 Cochlear.ai, Seoul, Korea 2 Music and Audio Research Group,

More information

Voice Activity Detection Using Pitch Feature

Voice Activity Detection Using Pitch Feature Voice Activity Detection Using Pitch Feature Presented by: Shay Perera 1 CONTENTS Introduction Related work Proposed Improvement References Questions 2 PROBLEM speech Non speech Speech Region Non Speech

More information

Improving Variational Inference with Inverse Autoregressive Flow

Improving Variational Inference with Inverse Autoregressive Flow Improving Variational Inference with Inverse Autoregressive Flow Diederik P. Kingma dpkingma@openai.com Tim Salimans tim@openai.com Rafal Jozefowicz rafal@openai.com Xi Chen peter@openai.com Ilya Sutskever

More information

ALTERNATIVE OBJECTIVE FUNCTIONS FOR DEEP CLUSTERING

ALTERNATIVE OBJECTIVE FUNCTIONS FOR DEEP CLUSTERING ALTERNATIVE OBJECTIVE FUNCTIONS FOR DEEP CLUSTERING Zhong-Qiu Wang,2, Jonathan Le Roux, John R. Hershey Mitsubishi Electric Research Laboratories (MERL), USA 2 Department of Computer Science and Engineering,

More information

A DEEP RECURRENT APPROACH FOR ACOUSTIC-TO-ARTICULATORY INVERSION

A DEEP RECURRENT APPROACH FOR ACOUSTIC-TO-ARTICULATORY INVERSION A DEEP RECURRENT APPROACH FOR ACOUSTIC-TO-ARTICULATORY INVERSION P eng Liu 1,2, Quanjie Y u 1,2, Zhiyong W u 1,2,3, Shiyin Kang 3, Helen Meng 1,3, Lianhong Cai 1,2 1 Tsinghua-CUHK Joint Research Center

More information

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models Statistical NLP Spring 2009 The Noisy Channel Model Lecture 10: Acoustic Models Dan Klein UC Berkeley Search through space of all possible sentences. Pick the one that is most probable given the waveform.

More information

Statistical NLP Spring The Noisy Channel Model

Statistical NLP Spring The Noisy Channel Model Statistical NLP Spring 2009 Lecture 10: Acoustic Models Dan Klein UC Berkeley The Noisy Channel Model Search through space of all possible sentences. Pick the one that is most probable given the waveform.

More information

GMM-Based Speech Transformation Systems under Data Reduction

GMM-Based Speech Transformation Systems under Data Reduction GMM-Based Speech Transformation Systems under Data Reduction Larbi Mesbahi, Vincent Barreaud, Olivier Boeffard IRISA / University of Rennes 1 - ENSSAT 6 rue de Kerampont, B.P. 80518, F-22305 Lannion Cedex

More information

Chapter 9. Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析

Chapter 9. Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析 Chapter 9 Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析 1 LPC Methods LPC methods are the most widely used in speech coding, speech synthesis, speech recognition, speaker recognition and verification

More information

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji Dynamic Data Modeling, Recognition, and Synthesis Rui Zhao Thesis Defense Advisor: Professor Qiang Ji Contents Introduction Related Work Dynamic Data Modeling & Analysis Temporal localization Insufficient

More information

Session 1: Pattern Recognition

Session 1: Pattern Recognition Proc. Digital del Continguts Musicals Session 1: Pattern Recognition 1 2 3 4 5 Music Content Analysis Pattern Classification The Statistical Approach Distribution Models Singing Detection Dan Ellis

More information

Independent Component Analysis and Unsupervised Learning

Independent Component Analysis and Unsupervised Learning Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien National Cheng Kung University TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent

More information

Generating Sequences with Recurrent Neural Networks

Generating Sequences with Recurrent Neural Networks Generating Sequences with Recurrent Neural Networks Alex Graves University of Toronto & Google DeepMind Presented by Zhe Gan, Duke University May 15, 2015 1 / 23 Outline Deep recurrent neural network based

More information

Hidden Markov Model and Speech Recognition

Hidden Markov Model and Speech Recognition 1 Dec,2006 Outline Introduction 1 Introduction 2 3 4 5 Introduction What is Speech Recognition? Understanding what is being said Mapping speech data to textual information Speech Recognition is indeed

More information

Pattern Recognition Applied to Music Signals

Pattern Recognition Applied to Music Signals JHU CLSP Summer School Pattern Recognition Applied to Music Signals 2 3 4 5 Music Content Analysis Classification and Features Statistical Pattern Recognition Gaussian Mixtures and Neural Nets Singing

More information

CEPSTRAL ANALYSIS SYNTHESIS ON THE MEL FREQUENCY SCALE, AND AN ADAPTATIVE ALGORITHM FOR IT.

CEPSTRAL ANALYSIS SYNTHESIS ON THE MEL FREQUENCY SCALE, AND AN ADAPTATIVE ALGORITHM FOR IT. CEPSTRAL ANALYSIS SYNTHESIS ON THE EL FREQUENCY SCALE, AND AN ADAPTATIVE ALGORITH FOR IT. Summarized overview of the IEEE-publicated papers Cepstral analysis synthesis on the mel frequency scale by Satochi

More information

Gaussian Processes for Audio Feature Extraction

Gaussian Processes for Audio Feature Extraction Gaussian Processes for Audio Feature Extraction Dr. Richard E. Turner (ret26@cam.ac.uk) Computational and Biological Learning Lab Department of Engineering University of Cambridge Machine hearing pipeline

More information

SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan

SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS Emad M. Grais and Hakan Erdogan Faculty of Engineering and Natural Sciences, Sabanci University, Orhanli

More information

An exploration of dropout with LSTMs

An exploration of dropout with LSTMs An exploration of out with LSTMs Gaofeng Cheng 1,3, Vijayaditya Peddinti 4,5, Daniel Povey 4,5, Vimal Manohar 4,5, Sanjeev Khudanpur 4,5,Yonghong Yan 1,2,3 1 Key Laboratory of Speech Acoustics and Content

More information

Recurrent Neural Network

Recurrent Neural Network Recurrent Neural Network Xiaogang Wang xgwang@ee..edu.hk March 2, 2017 Xiaogang Wang (linux) Recurrent Neural Network March 2, 2017 1 / 48 Outline 1 Recurrent neural networks Recurrent neural networks

More information

Gate Activation Signal Analysis for Gated Recurrent Neural Networks and Its Correlation with Phoneme Boundaries

Gate Activation Signal Analysis for Gated Recurrent Neural Networks and Its Correlation with Phoneme Boundaries INTERSPEECH 2017 August 20 24, 2017, Stockholm, Sweden Gate Activation Signal Analysis for Gated Recurrent Neural Networks and Its Correlation with Phoneme Boundaries Yu-Hsuan Wang, Cheng-Tao Chung, Hung-yi

More information

Recurrent Autoregressive Networks for Online Multi-Object Tracking. Presented By: Ishan Gupta

Recurrent Autoregressive Networks for Online Multi-Object Tracking. Presented By: Ishan Gupta Recurrent Autoregressive Networks for Online Multi-Object Tracking Presented By: Ishan Gupta Outline Multi Object Tracking Recurrent Autoregressive Networks (RANs) RANs for Online Tracking Other State

More information

arxiv: v2 [cs.sd] 18 Apr 2018

arxiv: v2 [cs.sd] 18 Apr 2018 TASNET: TIME-DOMAIN AUDIO SEPARATION NETWORK FOR REAL-TIME, SINGLE-CHANNEL SPEECH SEPARATION Yi Luo Nima Mesgarani Department of Electrical Engineering, Columbia University, New York, NY arxiv:1711.00541v2

More information

Usually the estimation of the partition function is intractable and it becomes exponentially hard when the complexity of the model increases. However,

Usually the estimation of the partition function is intractable and it becomes exponentially hard when the complexity of the model increases. However, Odyssey 2012 The Speaker and Language Recognition Workshop 25-28 June 2012, Singapore First attempt of Boltzmann Machines for Speaker Verification Mohammed Senoussaoui 1,2, Najim Dehak 3, Patrick Kenny

More information

Neural Architectures for Image, Language, and Speech Processing

Neural Architectures for Image, Language, and Speech Processing Neural Architectures for Image, Language, and Speech Processing Karl Stratos June 26, 2018 1 / 31 Overview Feedforward Networks Need for Specialized Architectures Convolutional Neural Networks (CNNs) Recurrent

More information

Global SNR Estimation of Speech Signals using Entropy and Uncertainty Estimates from Dropout Networks

Global SNR Estimation of Speech Signals using Entropy and Uncertainty Estimates from Dropout Networks Interspeech 2018 2-6 September 2018, Hyderabad Global SNR Estimation of Speech Signals using Entropy and Uncertainty Estimates from Dropout Networks Rohith Aralikatti, Dilip Kumar Margam, Tanay Sharma,

More information

Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition

Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition INTERSPEECH 017 August 0 4, 017, Stockholm, Sweden Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition Jaeyoung Kim 1, Mostafa El-Khamy 1, Jungwon Lee 1 1 Samsung Semiconductor,

More information

CS 188: Artificial Intelligence Fall 2011

CS 188: Artificial Intelligence Fall 2011 CS 188: Artificial Intelligence Fall 2011 Lecture 20: HMMs / Speech / ML 11/8/2011 Dan Klein UC Berkeley Today HMMs Demo bonanza! Most likely explanation queries Speech recognition A massive HMM! Details

More information

Evaluation of the modified group delay feature for isolated word recognition

Evaluation of the modified group delay feature for isolated word recognition Evaluation of the modified group delay feature for isolated word recognition Author Alsteris, Leigh, Paliwal, Kuldip Published 25 Conference Title The 8th International Symposium on Signal Processing and

More information

Sequential Neural Methods for Likelihood-free Inference

Sequential Neural Methods for Likelihood-free Inference Sequential Neural Methods for Likelihood-free Inference Conor Durkan University of Edinburgh conor.durkan@ed.ac.uk George Papamakarios University of Edinburgh g.papamakarios@ed.ac.uk Iain Murray University

More information

Generative Adversarial Networks

Generative Adversarial Networks Generative Adversarial Networks SIBGRAPI 2017 Tutorial Everything you wanted to know about Deep Learning for Computer Vision but were afraid to ask Presentation content inspired by Ian Goodfellow s tutorial

More information

arxiv: v4 [cs.cl] 5 Jun 2017

arxiv: v4 [cs.cl] 5 Jun 2017 Multitask Learning with CTC and Segmental CRF for Speech Recognition Liang Lu, Lingpeng Kong, Chris Dyer, and Noah A Smith Toyota Technological Institute at Chicago, USA School of Computer Science, Carnegie

More information

Timbral, Scale, Pitch modifications

Timbral, Scale, Pitch modifications Introduction Timbral, Scale, Pitch modifications M2 Mathématiques / Vision / Apprentissage Audio signal analysis, indexing and transformation Page 1 / 40 Page 2 / 40 Modification of playback speed Modifications

More information

High-Order Sequence Modeling Using Speaker-Dependent Recurrent Temporal Restricted Boltzmann Machines for Voice Conversion

High-Order Sequence Modeling Using Speaker-Dependent Recurrent Temporal Restricted Boltzmann Machines for Voice Conversion INTERSPEECH 2014 High-Order Sequence Modeling Using Speaker-Dependent Recurrent Temporal Restricted Boltzmann Machines for Voice Conversion Toru Nakashika 1, Tetsuya Takiguchi 2, Yasuo Ariki 2 1 Graduate

More information

A Neural Parametric Singing Synthesizer

A Neural Parametric Singing Synthesizer INTERSPEECH 17 August 4, 17, Stockholm, Sweden A Neural Parametric Singing Synthesizer Merlijn Blaauw, Jordi Bonada Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain merlijn.blaauw@up.edu,

More information

Improved Learning through Augmenting the Loss

Improved Learning through Augmenting the Loss Improved Learning through Augmenting the Loss Hakan Inan inanh@stanford.edu Khashayar Khosravi khosravi@stanford.edu Abstract We present two improvements to the well-known Recurrent Neural Network Language

More information

Maximum Likelihood and Maximum A Posteriori Adaptation for Distributed Speaker Recognition Systems

Maximum Likelihood and Maximum A Posteriori Adaptation for Distributed Speaker Recognition Systems Maximum Likelihood and Maximum A Posteriori Adaptation for Distributed Speaker Recognition Systems Chin-Hung Sit 1, Man-Wai Mak 1, and Sun-Yuan Kung 2 1 Center for Multimedia Signal Processing Dept. of

More information

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi Signal Modeling Techniques in Speech Recognition Hassan A. Kingravi Outline Introduction Spectral Shaping Spectral Analysis Parameter Transforms Statistical Modeling Discussion Conclusions 1: Introduction

More information

Modifying Voice Activity Detection in Low SNR by correction factors

Modifying Voice Activity Detection in Low SNR by correction factors Modifying Voice Activity Detection in Low SNR by correction factors H. Farsi, M. A. Mozaffarian, H.Rahmani Department of Electrical Engineering University of Birjand P.O. Box: +98-9775-376 IRAN hfarsi@birjand.ac.ir

More information

Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation. Keiichi Tokuda

Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation. Keiichi Tokuda Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation Keiichi Tokuda Nagoya Institute of Technology Carnegie Mellon University Tamkang University March 13,

More information

Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers

Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers Kumari Rambha Ranjan, Kartik Mahto, Dipti Kumari,S.S.Solanki Dept. of Electronics and Communication Birla

More information

TinySR. Peter Schmidt-Nielsen. August 27, 2014

TinySR. Peter Schmidt-Nielsen. August 27, 2014 TinySR Peter Schmidt-Nielsen August 27, 2014 Abstract TinySR is a light weight real-time small vocabulary speech recognizer written entirely in portable C. The library fits in a single file (plus header),

More information

Highway-LSTM and Recurrent Highway Networks for Speech Recognition

Highway-LSTM and Recurrent Highway Networks for Speech Recognition Highway-LSTM and Recurrent Highway Networks for Speech Recognition Golan Pundak, Tara N. Sainath Google Inc., New York, NY, USA {golan, tsainath}@google.com Abstract Recently, very deep networks, with

More information

Tensor-Train Long Short-Term Memory for Monaural Speech Enhancement

Tensor-Train Long Short-Term Memory for Monaural Speech Enhancement 1 Tensor-Train Long Short-Term Memory for Monaural Speech Enhancement Suman Samui, Indrajit Chakrabarti, and Soumya K. Ghosh, arxiv:1812.10095v1 [cs.sd] 25 Dec 2018 Abstract In recent years, Long Short-Term

More information

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM c 1. (Natural Language Processing; NLP) (Deep Learning) RGB IBM 135 8511 5 6 52 yutat@jp.ibm.com a) b) 2. 1 0 2 1 Bag of words White House 2 [1] 2015 4 Copyright c by ORSJ. Unauthorized reproduction of

More information

A Survey on Voice Activity Detection Methods

A Survey on Voice Activity Detection Methods e-issn 2455 1392 Volume 2 Issue 4, April 2016 pp. 668-675 Scientific Journal Impact Factor : 3.468 http://www.ijcter.com A Survey on Voice Activity Detection Methods Shabeeba T. K. 1, Anand Pavithran 2

More information

A Low-Cost Robust Front-end for Embedded ASR System

A Low-Cost Robust Front-end for Embedded ASR System A Low-Cost Robust Front-end for Embedded ASR System Lihui Guo 1, Xin He 2, Yue Lu 1, and Yaxin Zhang 2 1 Department of Computer Science and Technology, East China Normal University, Shanghai 200062 2 Motorola

More information

Expectation Propagation in Dynamical Systems

Expectation Propagation in Dynamical Systems Expectation Propagation in Dynamical Systems Marc Peter Deisenroth Joint Work with Shakir Mohamed (UBC) August 10, 2012 Marc Deisenroth (TU Darmstadt) EP in Dynamical Systems 1 Motivation Figure : Complex

More information

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning Lei Lei Ruoxuan Xiong December 16, 2017 1 Introduction Deep Neural Network

More information

Model-based unsupervised segmentation of birdcalls from field recordings

Model-based unsupervised segmentation of birdcalls from field recordings Model-based unsupervised segmentation of birdcalls from field recordings Anshul Thakur School of Computing and Electrical Engineering Indian Institute of Technology Mandi Himachal Pradesh, India Email:

More information

Automatic Speech Recognition (CS753)

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 12: Acoustic Feature Extraction for ASR Instructor: Preethi Jyothi Feb 13, 2017 Speech Signal Analysis Generate discrete samples A frame Need to focus on short

More information

A Generative Model Based Kernel for SVM Classification in Multimedia Applications

A Generative Model Based Kernel for SVM Classification in Multimedia Applications Appears in Neural Information Processing Systems, Vancouver, Canada, 2003. A Generative Model Based Kernel for SVM Classification in Multimedia Applications Pedro J. Moreno Purdy P. Ho Hewlett-Packard

More information

Optimal Speech Enhancement Under Signal Presence Uncertainty Using Log-Spectral Amplitude Estimator

Optimal Speech Enhancement Under Signal Presence Uncertainty Using Log-Spectral Amplitude Estimator 1 Optimal Speech Enhancement Under Signal Presence Uncertainty Using Log-Spectral Amplitude Estimator Israel Cohen Lamar Signal Processing Ltd. P.O.Box 573, Yokneam Ilit 20692, Israel E-mail: icohen@lamar.co.il

More information

Introduction to Neural Networks

Introduction to Neural Networks Introduction to Neural Networks Steve Renals Automatic Speech Recognition ASR Lecture 10 24 February 2014 ASR Lecture 10 Introduction to Neural Networks 1 Neural networks for speech recognition Introduction

More information

arxiv: v1 [cs.lg] 15 Jun 2016

arxiv: v1 [cs.lg] 15 Jun 2016 Improving Variational Inference with Inverse Autoregressive Flow arxiv:1606.04934v1 [cs.lg] 15 Jun 2016 Diederik P. Kingma, Tim Salimans and Max Welling OpenAI, San Francisco University of Amsterdam, University

More information

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Speaker Representation and Verification Part II. by Vasileios Vasilakakis Speaker Representation and Verification Part II by Vasileios Vasilakakis Outline -Approaches of Neural Networks in Speaker/Speech Recognition -Feed-Forward Neural Networks -Training with Back-propagation

More information

Recurrent Neural Networks

Recurrent Neural Networks Recurrent Neural Networks Datamining Seminar Kaspar Märtens Karl-Oskar Masing Today's Topics Modeling sequences: a brief overview Training RNNs with back propagation A toy example of training an RNN Why

More information

Improved Method for Epoch Extraction in High Pass Filtered Speech

Improved Method for Epoch Extraction in High Pass Filtered Speech Improved Method for Epoch Extraction in High Pass Filtered Speech D. Govind Center for Computational Engineering & Networking Amrita Vishwa Vidyapeetham (University) Coimbatore, Tamilnadu 642 Email: d

More information