Comparing Recent Waveform Generation and Acoustic Modeling Methods for Neural-network-based Speech Synthesis
|
|
- Mavis Hutchinson
- 5 years ago
- Views:
Transcription
1 ICASSP 2018 Calgary, Canada Comparing Recent Waveform Generation and Acoustic Modeling Methods for Neural-network-based Speech Synthesis Xin WANG, Jaime Lorenzo-Trueba, Shinji TAKAKI, Lauri Juvela, Junichi YAMAGISHI National Institute of Informatics, Japan & Aalto University, Finland contact: we welcome critical comments, suggestions, and discussion 1
2 qmotivation Better modules for the statistical parametric speech synthesis (SPSS) framework? qmethod OVERVIEW Plug and test new acoustic models and waveform generators qresults Best combination WaveNet-based vocoder Autoregressive (AR) acoustic models Quality: as good as vocoded speech (at 16kHz) 2
3 CONTENTS q Introduction q Models and modules q Experiments q Summary 3
4 INTRODUCTION Background q Conventional TTS pipeline [1] Linguistic Text Front-end Back-end features Speech q SPSS back-end [2,3] Linguistic features Acoustic models Spectral MGC & features BAP F0 Waveform generators Speech MGC: Mel-generalized cepstral coefficients [4] BAP: band-aperiodicity [1] T. Dutoit. An Introduction to Text-to-speech Synthesis. Kluwer Academic Publishers, Norwell, MA, USA, [2] Tokuda, K., et al., (2013). Speech Synthesis Based on Hidden Markov Models. Proceedings of the IEEE, 101(5), [3] Zen, H., et al. (2009). Statistical parametric speech synthesis. Speech Communication, 51, [4] Tokuda, K., Kobayashi, T., Masuko, T., and Imai, S. (1994). Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages
5 INTRODUCTION Topic of this work q Better modules for SPSS back-end? Recurrent neural networks (RNNs) Autoregressive (AR) models General adversarial network (GAN) WORLD vocoder + Phrase recovery WaveNet-based vocoder Linguistic features Acoustic models MGC & BAP F0 Waveform generators Speech MGC: Mel-generalized cepstral coefficients BAP: band-aperiodicity 5
6 CONTENTS q Introduction q Models and modules q Experiments q Summary 6
7 MODELS & METHODS Speech waveforms WORLD MGC & BAP & F0 Waveform generators RNN Linguistic features Acoustic models 7
8 MODELS & METHODS Acoustic models q Baseline RNN bo 1 bo 2 bo 3 bo 4 bo 5 x 1 x 2 x 3 x 4 x 5 Sequence of linguistic features Sequence of generated acoustic features {x 1,, x t, } {bo 1,, bo t, } 8
9 MODELS & METHODS Acoustic models q Baseline RNN o 1 o 2 o 3 o 4 o 5 Probabilistic models M 1 M 2 M 3 M 4 M 5 Neural network H (RNN) ( ) p(o 1:T x 1:T ; ) = x 1 x 2 x 3 x 4 x 5 TY p(o t x 1:T ; ) = t=1 M t = {µ t }, where µ t = H (RNN) (x 1:T,t) bo t = µ t TY N (o t ; µ t, I) t=1 [5] C. M. Bishop. Neural networks for pattern recognition. Oxford university press,
10 MODELS & METHODS Acoustic models q Baseline RNN o 1 o 2 o 3 o 4 o 5 M 1 M 2 M 3 M 4 M 5 Limitations 1. Conditional independence 2. Maximum-likelihood training AR models GAN x 1 x 2 x 3 x 4 x 5 p(o 1:T x 1:T ; ) = TY p(o t x 1:T ; ) = t=1 TY N (o t ; µ t, I) t=1 10
11 MODELS & METHODS Acoustic models q Shallow AR (SAR) o 1 o 2 o 3 o 4 o 5 K = 2 M 1 M 2 M 3 M 4 M 5 x 1 x 2 x 3 x 4 x 5 p(o 1:T x 1:T ;, )= TY p(o t o t K:t 1, x 1:T ; ) = t=1 TY N (o t ; µ t + f (o t K:t 1 ), I) t=1 Alternative interpretation: trainable filter + RNN [6] X. Wang, S. Takaki, and J. Yamagishi. An autoregressive recurrent mixture density network for parametric speech synthesis. In Proc. ICASSP, pages ,
12 MODELS & METHODS Acoustic models q Deep AR (DAR) o 1 o 2 o 3 o 4 o 5 M 1 M 2 M 3 M 4 M 5 x 1 x 2 x 3 x 4 x 5 p(o 1:T x 1:T ; ) = TY p(o t o 1:t 1, x 1:T ; ) t=1 Only for (quantized) F0 modeling [7] X. Wang, S. Takaki, and J. Yamagishi. Autoregressive neural F0 model for statistical parametric speech synthesis. IEEE Transactions on Audio, Speech and Language Processing. (Accepted) 12
13 MODELS & METHODS Acoustic models q GAN GAN-based post-filter [8] Acoustic features (generated) Acoustic features (natural) Noise + Residual generator Discriminator Acoustic model Linguistic features 0/1 [8] T. Kaneko, H. Kameoka, N. Hojo, Y. Ijima, K. Hiramatsu, and K. Kashino. Generative adversarial network-based postfilter for statistical parametric speech synthesis. In Proc. ICASSP, pages ,
14 MODELS & METHODS Acoustic models Speech waveforms WORLD F0 MGC Waveform generators GAN DAR SAR RNN Acoustic models Linguistic features BAP is not shown 14
15 MODELS & METHODS Acoustic models SAR-Wo SGA-Wo RGA-Wo RNN-Wo WORLD Waveform generators F0 GAN MGC DAR SAR RNN Acoustic models Linguistic features BAP is not shown 15
16 MODELS & METHODS Waveform generators q Deterministic approaches WORLD [9] o o Binary voicing decision Minimum phase A log domain pulse model (PML) [10] o o Source-filter model, additive in log-domain Binary noisy mask WORLD + phrase recovery Generated waveform STFT amplitude Phrase recovery [11] Inverse STFT Phraserecovered waveform [9] M. Morise, et al. WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. on Information and Systems, 99(7): , [10] G. Degottex, et al. A log domain pulse model for parametric speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, [11] D. Griffin and J. Lim. Signal estimation from modified short-time Fourier transform. IEEE Trans. ASSP, 32(2): ,
17 MODELS & METHODS Waveform generators q WaveNet-based vocoder [12, 13] How to generate (search) a waveform: 1. Exploration: sampling 2. Exploitation: picking one-best [12] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. WaveNet: A generative model for raw audio. arxiv preprint arxiv: , [13] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and T. Toda. Speaker-dependent WaveNet vocoder. In Proc. Interspeech, pages ,
18 MODELS & METHODS Waveform generators q WaveNet-based vocoder Unvoiced segment probability Sampling point Waveform level Voiced region Sampling in unvoiced region Picking one-best in voiced region Less distortion of harmonics appendix & paper Sampling point Waveform level 59
19 MODELS & METHODS Acoustic models SAR-Wa SAR-Pr SAR-Pm SAR-Wo SGA-Wo RGA-Wo RNN-Wo WaveNet Phrase recovery PML minimum phase WORLD Waveform generators F0 GAN MGC DAR SAR RNN Acoustic models Linguistic features 19
20 CONTENTS q Introduction q Models and modules q Experiments q Summary 20
21 Configuration q Data EXPERIMENTS Corpus Size Note ATR Ximera F009 voice [14] ~30,000 utterances 48 hours Recording period: over 1 year Sampling rate: 48kHz Japanese, neutral style, reading Feature Dimension Linguistic Phone identity, prosodic tags... ~ 390 Acoustic MGC 60 BAP 25 F0 1 q Front-end: Open JTalk [15] [14] Kawai, H., Toda, T., Ni, J., Tsuzaki, M., and Tokuda, K. (2004). Ximera: A new TTS from ATR based on corpus-based technologies. In Proc. SSW5, pages [15] HTS Working Group. The Japanese TTS System Open JTalk, 2015.
22 Configuration q Listening test Quality: MOS (1-5) Similarity: rate 1-5, natural reference 48kHz Participants: 235 native Japanese listeners, 1500 sets of results q Systems EXPERIMENTS Common network configuration (cf. the paper) 2 Without,, nor formant enhancement Sampling rate: 48kHz & 16kHz, except SAR-Wa at 16kHz (10 bits, μ-law) Natural Abs-Pm Abs-Wo SAR-Wa SAR-Pr SAR-Pm SAR-Wo SGA-Wo RGA-Wo RNN-Wo PML WORLD WaveNet +PhraseRec WORLD PML WORLD natural natural SAR SAR GAN SAR GAN RNN RNN 22
23 EXPERIMENTS 48kHz 16kHz Natural Abs-Pm Abs-Wo SAR-Wa SAR-Pr SAR-Pm SAR-Wo SGA-Wo RGA-Wo RNN-Wo PML WORLD WaveNet +PhraseRec WORLD PML WORLD natural natural SAR SAR SAR SAR SAR GAN SAR GAN RNN RNN 23
24 Quality scores EXPERIMENTS Natural Abs-Pm Abs-Wo SAR-Wa SAR-Pr SAR-Pm SAR-Wo SGA-Wo RGA-Wo RNN-Wo PML WORLD WaveNet +PhraseRec WORLD PML WORLD natural natural SAR SAR SAR SAR SAR GAN SAR GAN RNN RNN 24
25 Similarity scores EXPERIMENTS Natural Abs-Pm Abs-Wo SAR-Wa SAR-Pr SAR-Pm SAR-Wo SGA-Wo RGA-Wo RNN-Wo PML WORLD WaveNet +PhraseRec WORLD PML WORLD natural natural SAR SAR SAR SAR SAR GAN SAR GAN RNN RNN 25
26 CONTENTS q Introduction q Models and methods q Experiments q Summary 26
27 Plug and test q SAR: Avoid conditional independence assumption Alleviate the over-smoothing effect appendix & paper q WaveNet SUMMARY Generation method: one-best generation + random sampling Less distortion of harmonics appendix & paper q WaveNet vocoder + SAR & DAR Better than other combinations Worse than natural speech Close to vocoded speech 27
28 FURTHER IMPROVEMENT? Recent work appendix q SAR Special case of volume-preserving normalizing flow [16] Extended SAR = time-variant filter + RNN q WaveNet-vocoders Training based on generated conditional features q Annotated linguistic features Reduce the gap between natural & synthetic speech Future work? q Reduce the variability of recordings q Use complex-valued neural models [16] D. Rezende and S. Mohamed. Variational inference with normalizing flows. In Proc. ICML, pages ,
29 MESSAGE Code, recipes, slides q Acoustic models & WaveNet (CUDA/C++) q Simple explanation on WaveNet and acoustic models tonywangx.github.io wangxin@nii.ac.jp 29
30 Thank you for your attention Q & A tonywangx.github.io wangxin@nii.ac.jp 30
31 APPENDIX - WAVENET Conditional network WaveNet-Backend Learning curve No cherry picking All samples based on generated acoustic features or automatically inferred linguistic features Generation method Generation with other acoustic models Training based on generated features Tutorial slides: 31
32 Structure APPENDIX - WAVENET P (o t o t R:t 1, c 1:N ) Post-processing network Linear+tanh Linear+tanh Softmax + e t 1 s (1) t s (2) t o Wavenet Wavenet t 1 Linear block 1 block 2 e t r (1) t s (M) t Wavenet block M Clock rate: 16kHz l t c 1:N F0 Bi-LSTM Conditional feature network CNN concatenation Linear Up-sampling Clock rate: 200Hz (frame shift = 5ms) 32
33 APPENDIX - WAVENET Conditional network q Architecture of the conditional network F0 MGC Trial1 c 1:N Linear l 1:N Trial2 c 1:N Bi-LSTM CNN Linear l 1:N Trial 3 F0 c 1:N Bi-LSTM CNN Concat. Linear l 1:N 33
34 APPENDIX - WAVENET Conditional network q Experiments on WaveNet Vocoder Given generated MGC/F0 sample1 sample2 sample3 Natural Trial 1 None Trial 2 LSTM+CNN Trial 3 LSTM+CNN+skip-F0 34
35 APPENDIX - WAVENET WaveNet backend q Architecture Text Text analyzer Textual features F0 model (DAR) WaveNet backend Waveform q WaveNet-backend only uses random sampling Natural WaveNetvocoder WaveNetbackend sample1 sample2 sample3 sample4 sample5 35
36 APPENDIX - WAVENET Learning curve - log likelihood Train set Val. Set epoch WaveNet vocoder - log likelihood Train set Val. Set epoch We trained WaveNet backend for more than 100 epochs WaveNet backend 36
37 APPENDIX - WAVENET Generation method 1000 Waveform probablity levels (0-1024) Waveform waveform levels (mu-law) (0-1024) sampling point 37
38 APPENDIX - WAVENET 0.5 Generation method q Experiments on WaveNet vocoder Sampling point Waveform level Sampling point Waveform level 59
39 APPENDIX - WAVENET Generation method q One-best + random sampling Given generated MGC/F0 voiced region: one-best Mix: unvoiced region: random sampling Random: random sampling all the time Natural sample1 sample2 sample3 Mix Random 39
40 Natural Voiced: one-best Unvoiced: sampling Voiced: sampling Unvoiced: sampling 40
41 APPENDIX - WAVENET Generation method q One-best + random sampling Mix: Mix2: voiced region: one-best unvoiced region: random sampling 75% voiced frames: one-best else: random sampling Mix4: 25% voiced frames: one-best else: random sampling 41
42 INVESTIGATION Generation method q One-best + random sampling Natural sample1 sample2 sample3 Mix Mix2 100% 75% Mix3 Mix4 Random 50% 20% 00% 42
43 APPENDIX - WAVENET Generation method q One-best + random sampling sample1 sample2 sample3 Natural WaveNet backend WaveNet vocoder Mix Random Mix 43
44 Natural WaveNet-backend Mix Mix WaveNet-backend Random Random 44
45 APPENDIX - WAVENET WaveNet-vocoder + other acoustic models Natural sample1 sample2 sample3 SAR + DAR SGA + DAR RGA + DAR RNN + DAR extended SAR + DAR 45
46 APPENDIX - WAVENET WaveNet-vocoder: training using generated MGC sample1 sample2 sample3 Natural WaveNet-Backend Trained on natural MGC WaveNet- Vocoder Trained on generated MGC Epoch 35 Trained on generated MGC Epoch 45 Trained on generated MGC Epoch 55 46
47 APPENDIX ACOUSTIC MODELS General comparison SAR & DAR SAR extension More details: 47
48 APPENDIX ACOUSTIC MODELS General comparison q Generated trajectories 4 MGC 1 dim MGC 1 st dim Natural RNN RNN+GAN (RGA) SAR SAR+GAN (SGA) Frame index (utterance ATR Ximera F009 AOZORAR T01) MGC 5 th dim MGC 5 dim Natural RNN RNN+GAN (RGA) SAR SAR+GAN (SGA) Frame index (utterance ATR Ximera F009 AOZORAR T01) 48
49 APPENDIX ACOUSTIC MODELS General comparison q Generated trajectories Natural RNN SAR 0.2 MGC 15 dim MGC 15 th dim Frame index (utterance ATR Ximera F009 AOZORAR T01) Natural RNN SAR MGC 31 dim MGC 31 th dim Frame index (utterance ATR Ximera F009 AOZORAR T01) 49
50 APPENDIX ACOUSTIC MODELS General comparison q Generated trajectories Natural RNN RNN+GAN (RGA) SAR SAR+GAN (SGA) 0.2 MGC 15 dim MGC 15 th dim Frame index (utterance ATR Ximera F009 AOZORAR T01) MGC 31 dim Natural RNN RNN+GAN (RGA) SAR SAR+GAN (SGA) MGC 31 th dim Frame index (utterance ATR Ximera F009 AOZORAR T01) 50
51 Utterance-level GV of MGC APPENDIX ACOUSTIC MODELS General 1 comparison Global variance 1.5 q 0GV Natural SAR SAR+GAN (SGA) RNN+GAN (RGA) RNN Dimension index of MGC Modulation spectrum (MGC 31th) 51
52 APPENDIX ACOUSTIC MODELS SAR & DAR q Toy SAR example SAR versus RMDN with a recurrent output layer [11] SAR RMDN o a 1 o 2 o 1 o 2 w µ µ µ 1 µ 2 1 µ 2 output layer h 1 h 2 h 1 h 2 hidden layer x 1 x 2 x 1 x 2 Assume o t 2 R and t =1, linear activation function [11] H. Zen and H. Sak. Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In Proc. ICASSP, pages ,
53 APPENDIX ACOUSTIC MODELS SAR & DAR q Toy SAR example RMDN o 1 o 2 w µ µ 1 µ 2 h 1 h 2 x 1 x 2 p(o 1:2 )=N (o 1 ; µ 1, 1)N (o 2 ; µ 2 + w µ µ 1, 1) µ 1 = w > h 1 + b µ 2 = w > h 2 + b + w µ µ 1 = µ 2 + w µ µ 1 o a 1 o 2 SAR µ 1 µ 2 h 1 h 2 x 1 x 2 p(o 1:2 )=N (o 1 ; µ 1, 1)N (o 2 ; µ 2 + ao 1, 1) µ 1 = w > h 1 + b µ 2 = w > h 2 + b [11] H. Zen and H. Sak. Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In Proc. ICASSP, pages ,
54 APPENDIX ACOUSTIC MODELS SAR & DAR q Toy SAR example RMDN o 1 o 2 w µ µ 1 µ 2 h 1 h 2 x 1 x 2 p(o 1:2 )=N (o 1 ; µ 1, 1)N (o 2 ; µ 2 + w µ µ 1, 1) = 1 2 exp( 1 2 (o µ)> 1 (o µ)) o =[o 1,o 2 ] > µ =[µ 1, µ 2 + w µ µ 1 ] > = apple SAR o a 1 o 2 µ 1 µ 2 h 1 h 2 x 1 x 2 Dependency between µ t or o t? p(o 1:2 )=N (o 1 ; µ 1, 1)N (o 2 ; µ 2 + ao 1, 1) = 1 2 exp( 1 2 (o µ)> 1 (o µ)) apple o =[o 1,o 2 ] > µ =[µ 1,µ 2 + aµ 1 ] > 1 a = a 1+a 2 54
55 APPENDIX ACOUSTIC MODELS SAR & DAR q Toy SAR example RMDN c 1 c 2 µ 1 µ 2 h 1 h 2 p(o) =p(c) =N (c; µ c, c ) apple µ c =[µ 1,µ 2 ] > 1 0 c = 0 1 x 1 x 2 o a 1 o 2 c = apple c1 c 2 = apple o 1 = o 2 ao 1 apple 1 0 a 1 apple o1 o 2 = Ao SAR µ 1 µ 2 h 1 h 2 x 1 x 2 o =[o 1,o 2 ] > p(o) =N (o; µ o, o ) µ o =[µ 1,µ 2 + aµ 1 ] > o = apple 1 a a 1+a 2 55
56 APPENDIX ACOUSTIC MODELS SAR & DAR q SAR: invertible linear feature/model transformation For o 1:T 2 R D T SAR is equivalent to: o 1:T = 2 3 o 1:T,1 4 5 o 1:T,D A (1) A (D) 2 3 c 1:T,1 4 5 c 1:T,D Training o 1:T A (1) A (D) c 1:T TY p(c t ; M t ) t=1 Generation bo 1:T A (1) 1 A (D) 1 bc 1:T TY p(c t ; M t ) t=1 x 1:T 56
57 APPENDIX ACOUSTIC MODELS SAR & DAR q SAR: invertible linear feature/model transformation For o 1:T 2 R D T o 1:T = 2 3 o 1:T,1 4 5 o 1:T,D filter 1 filter D 2 3 c 1:T,1 4 5 c 1:T,D SAR is equivalent to: Training o 1:T filters A 1 (z) A D (z) c 1:T TY p(c t ; M t ) t=1 Generation bo 1:T filters 1/A 1 (z) 1/A D (z) bc 1:T TY p(c t ; M t ) t=1 x 1:T 57
58 APPENDIX ACOUSTIC MODELS SAR & DAR 5 q SAR: invertible linear feature/model transformation o 1:T filters A 1 (z) A D (z) c 1:T Magnitude (db) A 1 (z) A 1 (z) Frequency bin (:/1024) bo 1:T filters 1/A 1 (z) 1/A D (z) bc 1:T Magnitude (db) /A 1 (z) H 1 (z) Only due to 1/A d (z)? Frequency bin (:/1024) Due to {A d (z), 1/A d (z)}, less mismatch between c 1:T and RMDN 58
59 APPENDIX ACOUSTIC MODELS SAR extension: normalizing flow q Basic idea o 1:T c 1:T = f (o 1:T ) c 1:T TY p(c t ; M t ) t=1 x 1:T bo 1:T bo 1:T = f 1 (bc 1:T ) bc 1:T TY p(c t ; M t ) t=1 Jacobian matrix must be simple f (.) must be invertible p o (o 1:T x 1:T )=p c (c 1:T x 1:T ) 1:T [13] D. Rezende and S. Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, pages , [14] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved variational inference with inverse autoregressive flow. In Proc. NIPS, pages ,
60 APPENDIX ACOUSTIC MODELS SAR extension: normalizing flow q Basic idea p o (o 1:T x 1:T )=p c (c 1:T x 1:T ) 1:T SAR AR Flow Transform c t = o t X K k=1 a k o t k c t = o t X K k=1 f (k) (o 1:t k ) o t k De-transform KX bo t = c t + a k bo t k k=1 KX bo t = c t + f (k) (bo 1:t k ) bo t k k=1 Simple for SAR: 1:T =1 µ t = RNN(o 1:t 1 ) 60
61 APPENDIX ACOUSTIC MODELS SAR extension q SAR can be extended More details: 61
62 p-value APPENDIX ACOUSTIC MODELS DAR q Same autoregressive principle q But noninvertible nonlinear MOS NAT DAR MOS score SAR RMDN RNN 3.00 NAT DAR SAR RMDN RNN NAT <1e-30 <1.0e-30 <1.0e-30 <1.0e-30 DAR <1e e e e-30 SAR <1e e RMDN <1e e RNN <1e e GV of F0 at utterance-level (Hz) NAT RNN RMDN SAR DAR NAT RNN F0 GV RMDN SAR DAR [7] X. Wang, S. Takaki, and J. Yamagishi. Autoregressive neural F0 model for statistical parametric speech synthesis. IEEE Transactions on Audio, Speech and Language Processing. (Accepted) 62
Autoregressive Neural Models for Statistical Parametric Speech Synthesis
Autoregressive Neural Models for Statistical Parametric Speech Synthesis シンワン Xin WANG 2018-01-11 contact: wangxin@nii.ac.jp we welcome critical comments, suggestions, and discussion 1 https://www.slideshare.net/kotarotanahashi/deep-learning-library-coyotecnn
More informationAn Autoregressive Recurrent Mixture Density Network for Parametric Speech Synthesis
ICASSP 07 New Orleans, USA An Autoregressive Recurrent Mixture Density Network for Parametric Speech Synthesis Xin WANG, Shinji TAKAKI, Junichi YAMAGISHI National Institute of Informatics, Japan 07-03-07
More informationGenerative Model-Based Text-to-Speech Synthesis. Heiga Zen (Google London) February
Generative Model-Based Text-to-Speech Synthesis Heiga Zen (Google London) February rd, @MIT Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks
More informationAn RNN-based Quantized F0 Model with Multi-tier Feedback Links for Text-to-Speech Synthesis
INTERSPEECH 217 August 2 24, 217, Stockholm, Sweden An RNN-based Quantized F Model with Multi-tier Feedback Links for Text-to-Speech Synthesis Xin Wang 1,2, Shinji Takaki 1, Junichi Yamagishi 1,2,3 1 National
More informationAdapting Wavenet for Speech Enhancement DARIO RETHAGE JULY 12, 2017
Adapting Wavenet for Speech Enhancement DARIO RETHAGE JULY 12, 2017 I am v Master Student v 6 months @ Music Technology Group, Universitat Pompeu Fabra v Deep learning for acoustic source separation v
More informationText-to-speech synthesizer based on combination of composite wavelet and hidden Markov models
8th ISCA Speech Synthesis Workshop August 31 September 2, 2013 Barcelona, Spain Text-to-speech synthesizer based on combination of composite wavelet and hidden Markov models Nobukatsu Hojo 1, Kota Yoshizato
More informationCSC321 Lecture 20: Reversible and Autoregressive Models
CSC321 Lecture 20: Reversible and Autoregressive Models Roger Grosse Roger Grosse CSC321 Lecture 20: Reversible and Autoregressive Models 1 / 23 Overview Four modern approaches to generative modeling:
More informationVocoding approaches for statistical parametric speech synthesis
Vocoding approaches for statistical parametric speech synthesis Ranniery Maia Toshiba Research Europe Limited Cambridge Research Laboratory Speech Synthesis Seminar Series CUED, University of Cambridge,
More informationWaveNet: A Generative Model for Raw Audio
WaveNet: A Generative Model for Raw Audio Ido Guy & Daniel Brodeski Deep Learning Seminar 2017 TAU Outline Introduction WaveNet Experiments Introduction WaveNet is a deep generative model of raw audio
More informationarxiv: v1 [cs.lg] 28 Dec 2017
PixelSNAIL: An Improved Autoregressive Generative Model arxiv:1712.09763v1 [cs.lg] 28 Dec 2017 Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, Pieter Abbeel Embodied Intelligence UC Berkeley, Department of
More informationISSUE 2: TEMPORAL DEPENDENCY?
DAR DAR 8/2/18 1 DAR q Interpretation ISSUE 2: TEMPORAL DEPENDENCY? Only linear activation function, unit variance p(o 1:2 x 1:2 )= 1 2 exp( 1 2 (o µ)> 1 (o µ)) DAR o 1 o 2 SAR o 1 a o 2 o =[o 1,o 2 ]
More informationSegmental Recurrent Neural Networks for End-to-end Speech Recognition
Segmental Recurrent Neural Networks for End-to-end Speech Recognition Liang Lu, Lingpeng Kong, Chris Dyer, Noah Smith and Steve Renals TTI-Chicago, UoE, CMU and UW 9 September 2016 Background A new wave
More informationThe Success of Deep Generative Models
The Success of Deep Generative Models Jakub Tomczak AMLAB, University of Amsterdam CERN, 2018 What is AI about? What is AI about? Decision making: What is AI about? Decision making: new data High probability
More informationPresented By: Omer Shmueli and Sivan Niv
Deep Speaker: an End-to-End Neural Speaker Embedding System Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, Zhenyao Zhu Presented By: Omer Shmueli and Sivan
More informationProc. of NCC 2010, Chennai, India
Proc. of NCC 2010, Chennai, India Trajectory and surface modeling of LSF for low rate speech coding M. Deepak and Preeti Rao Department of Electrical Engineering Indian Institute of Technology, Bombay
More informationAn Excitation Model for HMM-Based Speech Synthesis Based on Residual Modeling
An Model for HMM-Based Speech Synthesis Based on Residual Modeling Ranniery Maia,, Tomoki Toda,, Heiga Zen, Yoshihiko Nankaku, Keiichi Tokuda, National Inst. of Inform. and Comm. Technology (NiCT), Japan
More informationDeep Recurrent Neural Networks
Deep Recurrent Neural Networks Artem Chernodub e-mail: a.chernodub@gmail.com web: http://zzphoto.me ZZ Photo IMMSP NASU 2 / 28 Neuroscience Biological-inspired models Machine Learning p x y = p y x p(x)/p(y)
More informationBoundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks
INTERSPEECH 2014 Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks Ryu Takeda, Naoyuki Kanda, and Nobuo Nukaga Central Research Laboratory, Hitachi Ltd., 1-280, Kokubunji-shi,
More informationRecurrent Neural Networks (Part - 2) Sumit Chopra Facebook
Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech
More informationBLSTM-HMM HYBRID SYSTEM COMBINED WITH SOUND ACTIVITY DETECTION NETWORK FOR POLYPHONIC SOUND EVENT DETECTION
BLSTM-HMM HYBRID SYSTEM COMBINED WITH SOUND ACTIVITY DETECTION NETWORK FOR POLYPHONIC SOUND EVENT DETECTION Tomoki Hayashi 1, Shinji Watanabe 2, Tomoki Toda 1, Takaaki Hori 2, Jonathan Le Roux 2, Kazuya
More informationRobust Speaker Identification
Robust Speaker Identification by Smarajit Bose Interdisciplinary Statistical Research Unit Indian Statistical Institute, Kolkata Joint work with Amita Pal and Ayanendranath Basu Overview } } } } } } }
More informationModeling the creaky excitation for parametric speech synthesis.
Modeling the creaky excitation for parametric speech synthesis. 1 Thomas Drugman, 2 John Kane, 2 Christer Gobl September 11th, 2012 Interspeech Portland, Oregon, USA 1 University of Mons, Belgium 2 Trinity
More informationA Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement
A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement Simon Leglaive 1 Laurent Girin 1,2 Radu Horaud 1 1: Inria Grenoble Rhône-Alpes 2: Univ. Grenoble Alpes, Grenoble INP,
More informationUnfolded Recurrent Neural Networks for Speech Recognition
INTERSPEECH 2014 Unfolded Recurrent Neural Networks for Speech Recognition George Saon, Hagen Soltau, Ahmad Emami and Michael Picheny IBM T. J. Watson Research Center, Yorktown Heights, NY, 10598 gsaon@us.ibm.com
More informationBIDIRECTIONAL LSTM-HMM HYBRID SYSTEM FOR POLYPHONIC SOUND EVENT DETECTION
BIDIRECTIONAL LSTM-HMM HYBRID SYSTEM FOR POLYPHONIC SOUND EVENT DETECTION Tomoki Hayashi 1, Shinji Watanabe 2, Tomoki Toda 1, Takaaki Hori 2, Jonathan Le Roux 2, Kazuya Takeda 1 1 Nagoya University, Furo-cho,
More informationON ADVERSARIAL TRAINING AND LOSS FUNCTIONS FOR SPEECH ENHANCEMENT. Ashutosh Pandey 1 and Deliang Wang 1,2. {pandey.99, wang.5664,
ON ADVERSARIAL TRAINING AND LOSS FUNCTIONS FOR SPEECH ENHANCEMENT Ashutosh Pandey and Deliang Wang,2 Department of Computer Science and Engineering, The Ohio State University, USA 2 Center for Cognitive
More informationReformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features
Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features Heiga ZEN (Byung Ha CHUN) Nagoya Inst. of Tech., Japan Overview. Research backgrounds 2.
More informationModeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks
Modeling Time-Frequency Patterns with LSTM vs Convolutional Architectures for LVCSR Tasks Tara N Sainath, Bo Li Google, Inc New York, NY, USA {tsainath, boboli}@googlecom Abstract Various neural network
More informationNearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis. Thomas Ewender
Nearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis Thomas Ewender Outline Motivation Detection algorithm of continuous F 0 contour Frame classification algorithm
More informationarxiv: v1 [cs.sd] 31 Oct 2018
WAVEGLOW: A FLOW-BASED GENERATIVE NETWORK FOR SPEECH SYNTHESIS Ryan Prenger, Rafael Valle, Bryan Catanzaro NVIDIA Corporation arxiv:1811.00002v1 [cs.sd] 31 Oct 2018 ABSTRACT In this paper we propose WaveGlow:
More informationProbabilistic Inference of Speech Signals from Phaseless Spectrograms
Probabilistic Inference of Speech Signals from Phaseless Spectrograms Kannan Achan, Sam T. Roweis, Brendan J. Frey Machine Learning Group University of Toronto Abstract Many techniques for complex speech
More informationDeep Learning for Speech Recognition. Hung-yi Lee
Deep Learning for Speech Recognition Hung-yi Lee Outline Conventional Speech Recognition How to use Deep Learning in acoustic modeling? Why Deep Learning? Speaker Adaptation Multi-task Deep Learning New
More information2D Spectrogram Filter for Single Channel Speech Enhancement
Proceedings of the 7th WSEAS International Conference on Signal, Speech and Image Processing, Beijing, China, September 15-17, 007 89 D Spectrogram Filter for Single Channel Speech Enhancement HUIJUN DING,
More informationThe effect of speaking rate and vowel context on the perception of consonants. in babble noise
The effect of speaking rate and vowel context on the perception of consonants in babble noise Anirudh Raju Department of Electrical Engineering, University of California, Los Angeles, California, USA anirudh90@ucla.edu
More informationArtificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino
Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as
More informationIndependent Component Analysis and Unsupervised Learning. Jen-Tzung Chien
Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent voices Nonparametric likelihood
More informationDetection of Overlapping Acoustic Events Based on NMF with Shared Basis Vectors
Detection of Overlapping Acoustic Events Based on NMF with Shared Basis Vectors Kazumasa Yamamoto Department of Computer Science Chubu University Kasugai, Aichi, Japan Email: yamamoto@cs.chubu.ac.jp Chikara
More informationSpeaker recognition by means of Deep Belief Networks
Speaker recognition by means of Deep Belief Networks Vasileios Vasilakakis, Sandro Cumani, Pietro Laface, Politecnico di Torino, Italy {first.lastname}@polito.it 1. Abstract Most state of the art speaker
More informationSeq2Tree: A Tree-Structured Extension of LSTM Network
Seq2Tree: A Tree-Structured Extension of LSTM Network Weicheng Ma Computer Science Department, Boston University 111 Cummington Mall, Boston, MA wcma@bu.edu Kai Cao Cambia Health Solutions kai.cao@cambiahealth.com
More informationConditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN Decoders Aaron van den Oord 1 Nal Kalchbrenner 1 Oriol Vinyals 1 Lasse Espeholt 1 Alex Graves 1 Koray Kavukcuoglu 1 1 Google DeepMind NIPS, 2016 Presenter: Beilun
More informationMasked Autoregressive Flow for Density Estimation
Masked Autoregressive Flow for Density Estimation George Papamakarios University of Edinburgh g.papamakarios@ed.ac.uk Theo Pavlakou University of Edinburgh theo.pavlakou@ed.ac.uk Iain Murray University
More informationExemplar-based voice conversion using non-negative spectrogram deconvolution
Exemplar-based voice conversion using non-negative spectrogram deconvolution Zhizheng Wu 1, Tuomas Virtanen 2, Tomi Kinnunen 3, Eng Siong Chng 1, Haizhou Li 1,4 1 Nanyang Technological University, Singapore
More informationRARE SOUND EVENT DETECTION USING 1D CONVOLUTIONAL RECURRENT NEURAL NETWORKS
RARE SOUND EVENT DETECTION USING 1D CONVOLUTIONAL RECURRENT NEURAL NETWORKS Hyungui Lim 1, Jeongsoo Park 1,2, Kyogu Lee 2, Yoonchang Han 1 1 Cochlear.ai, Seoul, Korea 2 Music and Audio Research Group,
More informationVoice Activity Detection Using Pitch Feature
Voice Activity Detection Using Pitch Feature Presented by: Shay Perera 1 CONTENTS Introduction Related work Proposed Improvement References Questions 2 PROBLEM speech Non speech Speech Region Non Speech
More informationImproving Variational Inference with Inverse Autoregressive Flow
Improving Variational Inference with Inverse Autoregressive Flow Diederik P. Kingma dpkingma@openai.com Tim Salimans tim@openai.com Rafal Jozefowicz rafal@openai.com Xi Chen peter@openai.com Ilya Sutskever
More informationALTERNATIVE OBJECTIVE FUNCTIONS FOR DEEP CLUSTERING
ALTERNATIVE OBJECTIVE FUNCTIONS FOR DEEP CLUSTERING Zhong-Qiu Wang,2, Jonathan Le Roux, John R. Hershey Mitsubishi Electric Research Laboratories (MERL), USA 2 Department of Computer Science and Engineering,
More informationA DEEP RECURRENT APPROACH FOR ACOUSTIC-TO-ARTICULATORY INVERSION
A DEEP RECURRENT APPROACH FOR ACOUSTIC-TO-ARTICULATORY INVERSION P eng Liu 1,2, Quanjie Y u 1,2, Zhiyong W u 1,2,3, Shiyin Kang 3, Helen Meng 1,3, Lianhong Cai 1,2 1 Tsinghua-CUHK Joint Research Center
More informationThe Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models
Statistical NLP Spring 2009 The Noisy Channel Model Lecture 10: Acoustic Models Dan Klein UC Berkeley Search through space of all possible sentences. Pick the one that is most probable given the waveform.
More informationStatistical NLP Spring The Noisy Channel Model
Statistical NLP Spring 2009 Lecture 10: Acoustic Models Dan Klein UC Berkeley The Noisy Channel Model Search through space of all possible sentences. Pick the one that is most probable given the waveform.
More informationGMM-Based Speech Transformation Systems under Data Reduction
GMM-Based Speech Transformation Systems under Data Reduction Larbi Mesbahi, Vincent Barreaud, Olivier Boeffard IRISA / University of Rennes 1 - ENSSAT 6 rue de Kerampont, B.P. 80518, F-22305 Lannion Cedex
More informationChapter 9. Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析
Chapter 9 Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析 1 LPC Methods LPC methods are the most widely used in speech coding, speech synthesis, speech recognition, speaker recognition and verification
More informationDynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji
Dynamic Data Modeling, Recognition, and Synthesis Rui Zhao Thesis Defense Advisor: Professor Qiang Ji Contents Introduction Related Work Dynamic Data Modeling & Analysis Temporal localization Insufficient
More informationSession 1: Pattern Recognition
Proc. Digital del Continguts Musicals Session 1: Pattern Recognition 1 2 3 4 5 Music Content Analysis Pattern Classification The Statistical Approach Distribution Models Singing Detection Dan Ellis
More informationIndependent Component Analysis and Unsupervised Learning
Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien National Cheng Kung University TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent
More informationGenerating Sequences with Recurrent Neural Networks
Generating Sequences with Recurrent Neural Networks Alex Graves University of Toronto & Google DeepMind Presented by Zhe Gan, Duke University May 15, 2015 1 / 23 Outline Deep recurrent neural network based
More informationHidden Markov Model and Speech Recognition
1 Dec,2006 Outline Introduction 1 Introduction 2 3 4 5 Introduction What is Speech Recognition? Understanding what is being said Mapping speech data to textual information Speech Recognition is indeed
More informationPattern Recognition Applied to Music Signals
JHU CLSP Summer School Pattern Recognition Applied to Music Signals 2 3 4 5 Music Content Analysis Classification and Features Statistical Pattern Recognition Gaussian Mixtures and Neural Nets Singing
More informationCEPSTRAL ANALYSIS SYNTHESIS ON THE MEL FREQUENCY SCALE, AND AN ADAPTATIVE ALGORITHM FOR IT.
CEPSTRAL ANALYSIS SYNTHESIS ON THE EL FREQUENCY SCALE, AND AN ADAPTATIVE ALGORITH FOR IT. Summarized overview of the IEEE-publicated papers Cepstral analysis synthesis on the mel frequency scale by Satochi
More informationGaussian Processes for Audio Feature Extraction
Gaussian Processes for Audio Feature Extraction Dr. Richard E. Turner (ret26@cam.ac.uk) Computational and Biological Learning Lab Department of Engineering University of Cambridge Machine hearing pipeline
More informationSINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan
SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS Emad M. Grais and Hakan Erdogan Faculty of Engineering and Natural Sciences, Sabanci University, Orhanli
More informationAn exploration of dropout with LSTMs
An exploration of out with LSTMs Gaofeng Cheng 1,3, Vijayaditya Peddinti 4,5, Daniel Povey 4,5, Vimal Manohar 4,5, Sanjeev Khudanpur 4,5,Yonghong Yan 1,2,3 1 Key Laboratory of Speech Acoustics and Content
More informationRecurrent Neural Network
Recurrent Neural Network Xiaogang Wang xgwang@ee..edu.hk March 2, 2017 Xiaogang Wang (linux) Recurrent Neural Network March 2, 2017 1 / 48 Outline 1 Recurrent neural networks Recurrent neural networks
More informationGate Activation Signal Analysis for Gated Recurrent Neural Networks and Its Correlation with Phoneme Boundaries
INTERSPEECH 2017 August 20 24, 2017, Stockholm, Sweden Gate Activation Signal Analysis for Gated Recurrent Neural Networks and Its Correlation with Phoneme Boundaries Yu-Hsuan Wang, Cheng-Tao Chung, Hung-yi
More informationRecurrent Autoregressive Networks for Online Multi-Object Tracking. Presented By: Ishan Gupta
Recurrent Autoregressive Networks for Online Multi-Object Tracking Presented By: Ishan Gupta Outline Multi Object Tracking Recurrent Autoregressive Networks (RANs) RANs for Online Tracking Other State
More informationarxiv: v2 [cs.sd] 18 Apr 2018
TASNET: TIME-DOMAIN AUDIO SEPARATION NETWORK FOR REAL-TIME, SINGLE-CHANNEL SPEECH SEPARATION Yi Luo Nima Mesgarani Department of Electrical Engineering, Columbia University, New York, NY arxiv:1711.00541v2
More informationUsually the estimation of the partition function is intractable and it becomes exponentially hard when the complexity of the model increases. However,
Odyssey 2012 The Speaker and Language Recognition Workshop 25-28 June 2012, Singapore First attempt of Boltzmann Machines for Speaker Verification Mohammed Senoussaoui 1,2, Najim Dehak 3, Patrick Kenny
More informationNeural Architectures for Image, Language, and Speech Processing
Neural Architectures for Image, Language, and Speech Processing Karl Stratos June 26, 2018 1 / 31 Overview Feedforward Networks Need for Specialized Architectures Convolutional Neural Networks (CNNs) Recurrent
More informationGlobal SNR Estimation of Speech Signals using Entropy and Uncertainty Estimates from Dropout Networks
Interspeech 2018 2-6 September 2018, Hyderabad Global SNR Estimation of Speech Signals using Entropy and Uncertainty Estimates from Dropout Networks Rohith Aralikatti, Dilip Kumar Margam, Tanay Sharma,
More informationResidual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition
INTERSPEECH 017 August 0 4, 017, Stockholm, Sweden Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition Jaeyoung Kim 1, Mostafa El-Khamy 1, Jungwon Lee 1 1 Samsung Semiconductor,
More informationCS 188: Artificial Intelligence Fall 2011
CS 188: Artificial Intelligence Fall 2011 Lecture 20: HMMs / Speech / ML 11/8/2011 Dan Klein UC Berkeley Today HMMs Demo bonanza! Most likely explanation queries Speech recognition A massive HMM! Details
More informationEvaluation of the modified group delay feature for isolated word recognition
Evaluation of the modified group delay feature for isolated word recognition Author Alsteris, Leigh, Paliwal, Kuldip Published 25 Conference Title The 8th International Symposium on Signal Processing and
More informationSequential Neural Methods for Likelihood-free Inference
Sequential Neural Methods for Likelihood-free Inference Conor Durkan University of Edinburgh conor.durkan@ed.ac.uk George Papamakarios University of Edinburgh g.papamakarios@ed.ac.uk Iain Murray University
More informationGenerative Adversarial Networks
Generative Adversarial Networks SIBGRAPI 2017 Tutorial Everything you wanted to know about Deep Learning for Computer Vision but were afraid to ask Presentation content inspired by Ian Goodfellow s tutorial
More informationarxiv: v4 [cs.cl] 5 Jun 2017
Multitask Learning with CTC and Segmental CRF for Speech Recognition Liang Lu, Lingpeng Kong, Chris Dyer, and Noah A Smith Toyota Technological Institute at Chicago, USA School of Computer Science, Carnegie
More informationTimbral, Scale, Pitch modifications
Introduction Timbral, Scale, Pitch modifications M2 Mathématiques / Vision / Apprentissage Audio signal analysis, indexing and transformation Page 1 / 40 Page 2 / 40 Modification of playback speed Modifications
More informationHigh-Order Sequence Modeling Using Speaker-Dependent Recurrent Temporal Restricted Boltzmann Machines for Voice Conversion
INTERSPEECH 2014 High-Order Sequence Modeling Using Speaker-Dependent Recurrent Temporal Restricted Boltzmann Machines for Voice Conversion Toru Nakashika 1, Tetsuya Takiguchi 2, Yasuo Ariki 2 1 Graduate
More informationA Neural Parametric Singing Synthesizer
INTERSPEECH 17 August 4, 17, Stockholm, Sweden A Neural Parametric Singing Synthesizer Merlijn Blaauw, Jordi Bonada Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain merlijn.blaauw@up.edu,
More informationImproved Learning through Augmenting the Loss
Improved Learning through Augmenting the Loss Hakan Inan inanh@stanford.edu Khashayar Khosravi khosravi@stanford.edu Abstract We present two improvements to the well-known Recurrent Neural Network Language
More informationMaximum Likelihood and Maximum A Posteriori Adaptation for Distributed Speaker Recognition Systems
Maximum Likelihood and Maximum A Posteriori Adaptation for Distributed Speaker Recognition Systems Chin-Hung Sit 1, Man-Wai Mak 1, and Sun-Yuan Kung 2 1 Center for Multimedia Signal Processing Dept. of
More informationSignal Modeling Techniques in Speech Recognition. Hassan A. Kingravi
Signal Modeling Techniques in Speech Recognition Hassan A. Kingravi Outline Introduction Spectral Shaping Spectral Analysis Parameter Transforms Statistical Modeling Discussion Conclusions 1: Introduction
More informationModifying Voice Activity Detection in Low SNR by correction factors
Modifying Voice Activity Detection in Low SNR by correction factors H. Farsi, M. A. Mozaffarian, H.Rahmani Department of Electrical Engineering University of Birjand P.O. Box: +98-9775-376 IRAN hfarsi@birjand.ac.ir
More informationMel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation. Keiichi Tokuda
Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation Keiichi Tokuda Nagoya Institute of Technology Carnegie Mellon University Tamkang University March 13,
More informationSinger Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers
Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers Kumari Rambha Ranjan, Kartik Mahto, Dipti Kumari,S.S.Solanki Dept. of Electronics and Communication Birla
More informationTinySR. Peter Schmidt-Nielsen. August 27, 2014
TinySR Peter Schmidt-Nielsen August 27, 2014 Abstract TinySR is a light weight real-time small vocabulary speech recognizer written entirely in portable C. The library fits in a single file (plus header),
More informationHighway-LSTM and Recurrent Highway Networks for Speech Recognition
Highway-LSTM and Recurrent Highway Networks for Speech Recognition Golan Pundak, Tara N. Sainath Google Inc., New York, NY, USA {golan, tsainath}@google.com Abstract Recently, very deep networks, with
More informationTensor-Train Long Short-Term Memory for Monaural Speech Enhancement
1 Tensor-Train Long Short-Term Memory for Monaural Speech Enhancement Suman Samui, Indrajit Chakrabarti, and Soumya K. Ghosh, arxiv:1812.10095v1 [cs.sd] 25 Dec 2018 Abstract In recent years, Long Short-Term
More informationa) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM
c 1. (Natural Language Processing; NLP) (Deep Learning) RGB IBM 135 8511 5 6 52 yutat@jp.ibm.com a) b) 2. 1 0 2 1 Bag of words White House 2 [1] 2015 4 Copyright c by ORSJ. Unauthorized reproduction of
More informationA Survey on Voice Activity Detection Methods
e-issn 2455 1392 Volume 2 Issue 4, April 2016 pp. 668-675 Scientific Journal Impact Factor : 3.468 http://www.ijcter.com A Survey on Voice Activity Detection Methods Shabeeba T. K. 1, Anand Pavithran 2
More informationA Low-Cost Robust Front-end for Embedded ASR System
A Low-Cost Robust Front-end for Embedded ASR System Lihui Guo 1, Xin He 2, Yue Lu 1, and Yaxin Zhang 2 1 Department of Computer Science and Technology, East China Normal University, Shanghai 200062 2 Motorola
More informationExpectation Propagation in Dynamical Systems
Expectation Propagation in Dynamical Systems Marc Peter Deisenroth Joint Work with Shakir Mohamed (UBC) August 10, 2012 Marc Deisenroth (TU Darmstadt) EP in Dynamical Systems 1 Motivation Figure : Complex
More informationCS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning
CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning Lei Lei Ruoxuan Xiong December 16, 2017 1 Introduction Deep Neural Network
More informationModel-based unsupervised segmentation of birdcalls from field recordings
Model-based unsupervised segmentation of birdcalls from field recordings Anshul Thakur School of Computing and Electrical Engineering Indian Institute of Technology Mandi Himachal Pradesh, India Email:
More informationAutomatic Speech Recognition (CS753)
Automatic Speech Recognition (CS753) Lecture 12: Acoustic Feature Extraction for ASR Instructor: Preethi Jyothi Feb 13, 2017 Speech Signal Analysis Generate discrete samples A frame Need to focus on short
More informationA Generative Model Based Kernel for SVM Classification in Multimedia Applications
Appears in Neural Information Processing Systems, Vancouver, Canada, 2003. A Generative Model Based Kernel for SVM Classification in Multimedia Applications Pedro J. Moreno Purdy P. Ho Hewlett-Packard
More informationOptimal Speech Enhancement Under Signal Presence Uncertainty Using Log-Spectral Amplitude Estimator
1 Optimal Speech Enhancement Under Signal Presence Uncertainty Using Log-Spectral Amplitude Estimator Israel Cohen Lamar Signal Processing Ltd. P.O.Box 573, Yokneam Ilit 20692, Israel E-mail: icohen@lamar.co.il
More informationIntroduction to Neural Networks
Introduction to Neural Networks Steve Renals Automatic Speech Recognition ASR Lecture 10 24 February 2014 ASR Lecture 10 Introduction to Neural Networks 1 Neural networks for speech recognition Introduction
More informationarxiv: v1 [cs.lg] 15 Jun 2016
Improving Variational Inference with Inverse Autoregressive Flow arxiv:1606.04934v1 [cs.lg] 15 Jun 2016 Diederik P. Kingma, Tim Salimans and Max Welling OpenAI, San Francisco University of Amsterdam, University
More informationSpeaker Representation and Verification Part II. by Vasileios Vasilakakis
Speaker Representation and Verification Part II by Vasileios Vasilakakis Outline -Approaches of Neural Networks in Speaker/Speech Recognition -Feed-Forward Neural Networks -Training with Back-propagation
More informationRecurrent Neural Networks
Recurrent Neural Networks Datamining Seminar Kaspar Märtens Karl-Oskar Masing Today's Topics Modeling sequences: a brief overview Training RNNs with back propagation A toy example of training an RNN Why
More informationImproved Method for Epoch Extraction in High Pass Filtered Speech
Improved Method for Epoch Extraction in High Pass Filtered Speech D. Govind Center for Computational Engineering & Networking Amrita Vishwa Vidyapeetham (University) Coimbatore, Tamilnadu 642 Email: d
More information