Generative Model-Based Text-to-Speech Synthesis. Heiga Zen (Google London) February

Size: px
Start display at page:

Download "Generative Model-Based Text-to-Speech Synthesis. Heiga Zen (Google London) February"

Transcription

1 Generative Model-Based Text-to-Speech Synthesis Heiga Zen (Google London) February

2 Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics

3 Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics

4 Text-to-speech as sequence-to-sequence mapping Automatic speech recognition (ASR)! Hello my name is Heiga Zen Machine translation (MT) Hello my name is Heiga Zen! Ich heiße Heiga Zen Text-to-speech synthesis (TTS) Hello my name is Heiga Zen!

5 Speech production process text (concept) fundamental freq modulation of carrier wave by speech information voiced/unvoiced freq transfer char frequency transfer characteristics magnitude start--end fundamental frequency Sound source voiced: pulse unvoiced: noise speech air flow

6 Typical ow of TTS system Sentence segmentation Word segmentation Text normalization Part-of-speech tagging Pronunciation discrete ) discrete NLP Frontend TEXT Text analysis Speech synthesis SYNTHESIZED SEECH Prosody prediction Waveform generation discrete ) continuous Speech Backend

7 Speech synthesis approaches

8 Speech synthesis approaches Rule-based, formant synthesis []

9 Speech synthesis approaches Rule-based, formant synthesis [] Sample-based, concatenative synthesis []

10 Speech synthesis approaches Rule-based, formant synthesis [] Sample-based, concatenative synthesis [] Model-based, generative synthesis p(speech= text= Hello, my name is Heiga Zen. )

11 Probabilistic formulation of TTS Random variables X Speech waveforms (data) Observed W Transcriptions (data) Observed w Given text Observed x Synthesized speech Unobserved Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 6 of

12 Probabilistic formulation of TTS Random variables X Speech waveforms (data) Observed W Transcriptions (data) Observed w Given text Observed x Synthesized speech Unobserved Synthesis Estimate posterior predictive distribution! p(x w, X, W) Sample x from the posterior distribution X W w x Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 6 of

13 Probabilistic formulation Introduce auxiliary variables (representation) +factorizedependency y X X p(x w, X, W) = p(x o)p(o l, )p(l w) 8l 8L p(x O)p(O L, )p( )p(l W)/ p(x ) dodod where O, o: Acoustic features L, l: Linguistic features : Model X W w O L l o x

14 Approximation () Approximate {sum & integral} by best point estimates (like MAP) [] where p(x w, X, W) p(x ô) X W w {ô, ˆl, Ô, ˆL, ˆ} = arg max o,l,o,l, O L l p(x o)p(o l, )p(l w) p(x O)p(O L, )p( )p(l W) o ô x Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 8 of

15 Approximation () Joint! Step-by-step maximization [] Ô = arg max p(x O) Extract acoustic features O ˆL = arg max p(l W) Extract linguistic features L ˆ = arg max p( Ô ˆL, )p( ) Learn mapping X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) Predict ô = arg max p(o o x f x (ô) =p(x ô) Predict linguistic features acoustic features Synthesize waveform ˆ o ô x

16 Approximation () Joint! Step-by-step maximization [] Ô = arg max p(x O) O ˆL = arg max p(l W) L ˆ = arg max p( Ô ˆL, )p( ) Extract acoustic features X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) ô = arg max p(o o x f x (ô) =p(x ô) ˆ o ô x

17 Approximation () Joint! Step-by-step maximization [] Ô = arg max p(x O) O ˆL = arg max p(l W) L ˆ = arg max p( Ô ˆL, )p( ) Extract acoustic features Extract linguistic features X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) ô = arg max p(o o x f x (ô) =p(x ô) ˆ o ô x

18 Approximation () Joint! Step-by-step maximization [] Ô = arg max p(x O) Extract acoustic features O ˆL = arg max p(l W) Extract linguistic features L ˆ = arg max p( Ô ˆL, )p( ) Learn mapping X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) ô = arg max p(o o x f x (ô) =p(x ô) ˆ o ô x

19 Approximation () Joint! Step-by-step maximization [] Ô = arg max p(x O) Extract acoustic features O ˆL = arg max p(l W) Extract linguistic features L ˆ = arg max p( Ô ˆL, )p( ) Learn mapping X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) ô = arg max p(o o x f x (ô) =p(x ô) Predict linguistic features ˆ o ô x

20 Approximation () Joint! Step-by-step maximization [] Ô = arg max p(x O) Extract acoustic features O ˆL = arg max p(l W) Extract linguistic features L ˆ = arg max p( Ô ˆL, )p( ) Learn mapping X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) Predict ô = arg max p(o o x f x (ô) =p(x ô) Predict linguistic features acoustic features ˆ o ô x

21 Approximation () Joint! Step-by-step maximization [] Ô = arg max p(x O) Extract acoustic features O ˆL = arg max p(l W) Extract linguistic features L ˆ = arg max p( Ô ˆL, )p( ) Learn mapping X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) Predict ô = arg max p(o o x f x (ô) =p(x ô) Predict linguistic features acoustic features Synthesize waveform ˆ o ô x

22 Approximation () Joint! Step-by-step maximization [] Ô = arg max p(x O) Extract acoustic features O ˆL = arg max p(l W) Extract linguistic features L ˆ = arg max p( Ô ˆL, )p( ) Learn mapping X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) Predict ô = arg max p(o o x f x (ô) =p(x ô) Predict linguistic features acoustic features Synthesize waveform ˆ o ô Representations: acoustic, linguistic, mapping x

23 Representation Linguistic features Hello, hello Hello, world. world. world Sentence: length,... Phrase: intonation,... Word: POS, grammar,... h-e2 l-ou1 w-er1-l-d Syllable: stress, tone,... h e l ou w er l d Phone: voicing, manner,...

24 Representation Linguistic features Hello, hello Hello, world. world. world Sentence: length,... Phrase: intonation,... Word: POS, grammar,... h-e2 l-ou1 w-er1-l-d Syllable: stress, tone,... h e l ou w er l d Phone: voicing, manner,...! Based on knowledge about spoken language Lexicon, letter-to-sound rules Tokenizer, tagger, parser Phonology rules

25 Representation Acoustic features Piece-wise stationary, source-lter generative model p(x o) Vocal source Vocal tract filter Pulse train (voiced) Fundamental frequency Aperiodicity, voicing,... + e(n) Cepstrum, LPC,... [db] [khz] overlap/shift windowing Speech x(n) = h(n)*e(n) h(n) White noise (unvoiced)

26 Representation Acoustic features Piece-wise stationary, source-lter generative model p(x o) Vocal source Vocal tract filter Pulse train (voiced) Fundamental frequency Aperiodicity, voicing,... + e(n) Cepstrum, LPC,... [db] [khz] overlap/shift windowing Speech x(n) = h(n)*e(n) h(n) White noise (unvoiced)! Needs to solve inverse problem Estimate parameters from signals Use estimated parameters (e.g., cepstrum) as acoustic features

27 Representation Mapping Rule-based, formant synthesis [] Ô = arg max p(x O) Vocoder analysis O ˆL = arg max p(l W) Text analysis L ˆ = arg max p( Ô ˆL, )p( ) Extract rules X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) Apply ô = arg max p(o o x f x (ô) =p(x ô) Text analysis rules Vocoder synthesis ˆ o ô x

28 Representation Mapping Rule-based, formant synthesis [] Ô = arg max p(x O) Vocoder analysis O ˆL = arg max p(l W) Text analysis L ˆ = arg max p( Ô ˆL, )p( ) Extract rules X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) Apply ô = arg max p(o o x f x (ô) =p(x ô) Text analysis rules Vocoder synthesis ˆ o ô! Hand-crafted rules on knowledge-based features x

29 Representation Mapping HMM-based [], statistical parametric synthesis [] Ô = arg max p(x O) Vocoder analysis O ˆL = arg max p(l W) Text analysis L ˆ = arg max p( Ô ˆL, )p( ) Train HMMs X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) Parameter ô = arg max p(o o x f x (ô) =p(x ô) Text analysis generation Vocoder synthesis ˆ o ô x

30 Representation Mapping HMM-based [], statistical parametric synthesis [] Ô = arg max p(x O) Vocoder analysis O ˆL = arg max p(l W) Text analysis L ˆ = arg max p( Ô ˆL, )p( ) Train HMMs X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) Parameter ô = arg max p(o o x f x (ô) =p(x ô) Text analysis generation Vocoder synthesis ˆ o ô! Replace rules by HMM-based generative acoustic model x

31 Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics

32 HMM-based generative acoustic model for TTS Context-dependent subword HMMs Decision trees to cluster & tie HMM states! interpretable l 1... l N l... q 1 q 2 q 3 q 4 : Discrete o 1 o 2 o 3 o 4 o o T o2 o2 o 5 o 1 o 2 o 3 o 4 : Continuous p(o l, )= X 8q = X 8q TY p(o t q t, )P (q l, ) q t :hiddenstateatt t=1 TY N (o t ; µ qt, qt )P (q l, ) t=1

33 HMM-based generative acoustic model for TTS Non-smooth, step-wise statistics! Smoothing is essential Difcult to use high-dimensional acoustic features (e.g., raw spectra)! Use low-dimensional features (e.g., cepstra) Data fragmentation! Ineffective, local representation Alotofresearchworkhavebeendonetoaddresstheseissues Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 6 of

34 Alternative acoustic model HMM: Handle variable length & alignment Decision tree: Map linguistic! acoustic Linguistic features l yes no yes no yes no yes no... yes no Statistics of acoustic features o Regression tree: linguistic features! Stats. of acoustic features

35 Alternative acoustic model HMM: Handle variable length & alignment Decision tree: Map linguistic! acoustic Linguistic features l yes no yes no yes no yes no... yes no Statistics of acoustic features o Regression tree: linguistic features! Stats. of acoustic features Replace the tree w/ a general-purpose regression model! Articial neural network

36 FFNN-based acoustic model for TTS [6] Target Frame-level acoustic feature o t o t 1 o t o t+1 h t Frame-level linguistic feature Input l t l t 1 l t l t+1 X h t = g (W hl l t + b h ) ˆ = arg min ko t ô t k 2 ô t = W oh h t + b o = {W hl, W oh, b h, b o } ô t E [o t l t ]! Replace decision trees & Gaussian distributions t Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 8 of

37 RNN-based acoustic model for TTS [] Target Frame-level acoustic feature o t o t 1 o t o t+1 Recurrent connections Frame-level linguistic feature Input lt l t 1 l t l t+1 X h t = g (W hl l t + W hh h t 1 + b h ) ˆ = arg min ko t ô t k 2 ô t = W oh h t + b o = {W hl, W hh, W oh, b h, b o } FFNN: ô t E [o t l t ] RNN: ô t E [o t l 1,...,l t ] t

38 NN-based generative acoustic model for TTS Non-smooth, step-wise statistics! RNN predicts smoothly varying acoustic features [, 8] Difcult to use high-dimensional acoustic features (e.g., raw spectra)! Layered architecture can handle high-dimensional features [] Data fragmentation! Distributed representation []

39 NN-based generative acoustic model for TTS Non-smooth, step-wise statistics! RNN predicts smoothly varying acoustic features [, 8] Difcult to use high-dimensional acoustic features (e.g., raw spectra)! Layered architecture can handle high-dimensional features [] Data fragmentation! Distributed representation [] NN-based approach is now mainstream in research & products Models: FFNN [6], MDN [], RNN [], Highway network [], GAN [] Products: e.g., Google []

40 Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics

41 NN-based generative model for TTS text (concept) fundamental freq modulation of carrier wave by speech information voiced/unvoiced freq transfer char frequency transfer characteristics magnitude start--end fundamental frequency Sound source voiced: pulse unvoiced: noise speech air flow Text! Linguistic! (Articulatory)! Acoustic! Waveform

42 Knowledge-based features! Learned features Unsupervised feature learning ~ x(t) w(n+1) world Acoustic ) feature o(t) l(n-1) Linguistic ) feature l(n) x(t) (raw FFT spectrum) hello w(n) (1-hot representation of word) Speech: auto-encoder at FFT spectra [, ]! positive results Text: word [6], phone & syllable []! less positive

43 Relax approximation Joint acoustic feature extraction & model training Two-step optimization! Joint optimization 8 >< Ô = arg max p(x O) O >: ˆ = arg max p( Ô ˆL, )p( ) + X O W L ˆL w l ˆl {ˆ, Ô} = arg max p(x O)p(O ˆL, )p( ),O Joint source-lter & acoustic model optimization HMM [8,, ] NN [, ] ˆ o ô x

44 Relax approximation Joint acoustic feature extracion & model training Mixed-phase cepstral analysis + LSTM-RNN [] p n Pulse train x n Speech... Cepstrum Hu -1 fn G(z) - (z) sn en z z z -1 z -1-1 z -1 z d... t (v) d (u) t Derivatives o t... (v) o (u) t Forward propagation... Linguistic... features l t Back propagation

45 Relax approximation Direct mapping from linguistic to waveform No explicit acoustic features {ˆ, Ô} = arg max p(x O)p(O ˆL, )p( ),O X W w + L l ˆ = arg max p(x ˆL, )p( ) ˆL ˆl Generative models for raw audio LPC [] WaveNet [] SampleRNN [] ˆ x Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 6 of

46 WaveNet: A generative model for raw audio Autoregressive (AR) modelling of speech signals x = {x 0,x 1,...,x N 1 } :rawwaveform p(x )=p(x 0,x 1,...,x N 1 )= NY 1 n=0 p(x n x 0,...,x n 1, )

47 WaveNet: A generative model for raw audio Autoregressive (AR) modelling of speech signals x = {x 0,x 1,...,x N 1 } :rawwaveform p(x )=p(x 0,x 1,...,x N 1 )= NY 1 n=0 p(x n x 0,...,x n 1, ) WaveNet []! p(x n x 0,...,x n 1, ) is modeled by convolutional NN

48 WaveNet: A generative model for raw audio Autoregressive (AR) modelling of speech signals x = {x 0,x 1,...,x N 1 } :rawwaveform p(x )=p(x 0,x 1,...,x N 1 )= NY 1 n=0 p(x n x 0,...,x n 1, ) WaveNet []! p(x n x 0,...,x n 1, ) is modeled by convolutional NN Key components Causal dilated convolution: capture long-term dependency Gated convolution + residual + skip: powerful non-linearity Softmax at output: classication rather than regression

49 WaveNet Causal dilated convolution ms in 6kHz sampling =,6 time steps! Too long to be captured by normal RNN/LSTM Dilated convolution Exponentially increase receptive eld size w.r.t. # of layers Output p(x x,..., x ) n n-16 n-1 Hidden layer3 Hidden layer2 Hidden layer1 Input x n x n-3 x n-2 x n-1 Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 8 of

50 WaveNet Non-linearity Residual block Skip connections 30 Residual block ReLU 1x1 256 ReLU 1x1 256 Softmax p(x n x 0,..., x n-1 ) Residual block Residual block To residual block 512 Residual block 1x1 1x1 ReLU Gated : 1x1 convolution : ReLU activation : Gated activation Gated 2x1 dilated 1x1 256 To skip connection Softmax : Softmax activation

51 WaveNet Softmax Amplitude Time Analog audio signal

52 WaveNet Softmax Amplitude Time Sampling & Quantization

53 WaveNet Softmax 1 Category index Amplitude 16 Time Categorical distribution Histogram - Unimodal - Multimodal - Skewed...

54 WaveNet Conditional modelling Linguistic features l Embedding at time n h n 1x1 1x1... 1x1 Residual block Residual block... Residual block... + ReLU 1x1 ReLU 1x1 Softmax p(x n x 0,..., x n-1,l) Residual block 1x1 Residual block 1x1 Gated 1x1 conditioning 2x1 dilated

55 WaveNet vs conventional audio generative models Assumptions in conventional audio generative models [, 6,, ] Stationary process w/ xed-length analysis window! Estimate model within ms window w/ shift Linear, time-invariant lter within a frame! Relationship between samples can be non-linear Gaussian process! Assumes speech signals are normally distributed WaveNet Sample-by-saple, non-linear, capable to take additional inputs Arbitrary-shaped signal distribution SOTA subjective naturalness w/ WaveNet-based TTS [] HMM LSTM Concatenative WaveNet

56 Relax approximation Towards Bayesian end-to-end TTS Integrated end-to-end 8 >< ˆL = arg max p(l W) L >: ˆ = arg max p(x ˆL, )p( ) + ˆ = arg max p(x W, )p( ) X W w ˆ Text analysis is integrated to model x

57 Relax approximation Towards Bayesian end-to-end TTS Bayesian end-to-end 8 < ˆ = arg max p(x W, )p( ) : x f x (w, ˆ) =p(x w, ˆ) + x f x (w, X, W) =p(x w, X, W) Z = p(x w, )p( X, W)d 1 KX p(x w, K ˆk) Ensemble k=1 X W w x Marginalize model parameters & architecture

58 Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics

59 Generative model-based text-to-speech synthesis Bayes formulation + factorization + approximations Representation: acoustic features, linguistic features, mapping Mapping: Rules! HMM! NN Feature: Engineered! Unsupervised, learned Less approximations Joint training, direct waveform modelling Moving towards integrated & Bayesian end-to-end TTS Naturalness: Concatenative apple Generative Flexibility: Concatenative Generative (e.g., multiple speakers) Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 6 of

60 Beyond text -to-speech synthesis TTS on conversational assistants Texts aren t fully contained Need more context Location to resolve homographs User query to put right emphasis

61 Beyond text -to-speech synthesis TTS on conversational assistants Texts aren t fully contained Need more context Location to resolve homographs User query to put right emphasis We need representation that can organize the world information & make it accessible & useful from TTS generative models,

62 Beyond generative TTS Generative model-based TTS Model represents process behind speech production Trained to minimize error against human-produced speech Learned model! speaker Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 8 of

63 Beyond generative TTS Generative model-based TTS Model represents process behind speech production Trained to minimize error against human-produced speech Learned model! speaker Speech is for communication Goal: maximize the amount of information to be received Missing listener! listener in training / model itself? Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 8 of

64 Thanks!

65 References I [] [] [] [] [] D. Klatt. Real-time speech synthesis by rule. Journal of ASA,68(S):S8 S8,8. A. Hunt and A. Black. Unit selection in a concatenative speech synthesis system using a large speech database. In Proc. ICASSP,pages 6,6. K. Tokuda. Speech synthesis as a statistical machine learning problem. Invited talk given at ASRU. T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. IEICE Trans. Inf. Syst.,J8-D-II():,. (in Japanese). H. Zen, K. Tokuda, and A. Black. Statistical parametric speech synthesis. Speech Commn.,(): 6,. [6] H. Zen, A. Senior, and M. Schuster. Statistical parametric speech synthesis using deep neural networks. In Proc. ICASSP,pages6 66,. [] Y. Fan, Y. Qian, F.-L. Xie, and F. Soong. TTS synthesis with bidirectional LSTM based recurrent neural networks. In Proc. Interspeech,pages6 68,.

66 References II [8] H. Zen. Acoustic modeling for speech synthesis: from HMM to RNN. Invited talk given at ASRU. [] [] [] [] [] [] S. Takaki and J. Yamagishi. Adeepauto-encoderbasedlow-dimensionalfeatureextractionfromFFTspectralenvelopesforstatisticalparametricspeech synthesis. In Proc. ICASSP,pages,6. G. Hinton, J. McClelland, and D. Rumelhart. Distributed representation. In D. Rumelhart, J. McClelland, and the PDP Research Group, editors, Parallel distributed processing: Explorations in the microstructure of cognition.mitpress,86. H. Zen and A. Senior. Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis. In Proc. ICASSP,pages8 86,. X. Wang, S. Takaki, and J. Yamagishi. Investigating very deep highway networks for parametric speech synthesis. In Proc. ISCA SSW,6. Y. Saito, S. Takamichi, and Saruwatari. Training algorithm to deceive anti-spoong verication for DNN-based speech synthesis. In Proc. ICASSP,. H. Zen, Y. Agiomyrgiannakis, N. Egberts, F. Henderson, and P. Szczepaniak. Fast, compact, and high quality LSTM-RNN based statistical parametric speech synthesizers for mobile devices. In Proc. Interspeech,6.

67 References III [] P. Muthukumar and A. Black. Adeeplearningapproachtodata-drivenparameterizationsforstatisticalparametricspeechsynthesis. arxiv:.88,. [6] P. W a n g, Y. Q i a n, F. S o o n g, L. H e, a n d H. Z h a o. Word embedding for recurrent neural network based TTS synthesis. In Proc. ICASSP,pages8 88,. [] [8] [] [] [] [] X. Wang, S. Takaki, and J. Yamagishi. Investigation of using continuous representation of various linguistic units in neural network-based text-to-speech synthesis. IEICE Trans. Inf. Syst.,E-D(): 8,6. T. Toda and K. Tokuda. Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory hmm. In Proc. ICASSP,pages 8,8. Y.-J. Wu and K. Tokuda. Minimum generation error training with direct log spectral distortion on LSPs for HMM-based speech synthesis. In Proc. Interspeech,pages 8,8. R. Maia, H. Zen, and M. Gales. Statistical parametric speech synthesis with joint estimation of acoustic and excitation model parameters. In Proc. ISCA SSW,pages88,. K. Tokuda and H. Zen. Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis. In Proc. ICASSP,pages,. K. Tokuda and H. Zen. Directly modeling voiced and unvoiced components in speech waveforms by neural networks. In Proc. ICASSP,pages6 6,6.

68 References IV [] [] [] [6] [] F. Itakura and S. Saito. Astatisticalmethodforestimationofspeechspectraldensityandformantfrequencies. Trans. IEICE,JA:,. A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. WaveNet: A generative model for raw audio. arxiv:6.,6. S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio. SampleRNN: An unconditional end-to-end neural audio generation model. arxiv:6.8,6. S. Imai and C. Furuichi. Unbiased estimation of log spectrum. In Proc. EURASIP,pages 6,88. H. Kameoka, Y. Ohishi, D. Mochihashi, and J. Le Roux. Speech analysis with multi-kernel linear prediction. In Proc. Spring Conference of ASJ,pages,. (in Japanese).

69 X W w X W w X W w X W w O L l O L l O Ô L ˆL l ˆl o o ô ˆ o x x x ô x (1) Bayesian (2) Auxiliary variables + factorization (3) Joint maximization (4) Step-by-step maximization e.g., statistical parametric TTS X W w X W w X W w X W w O L ˆL l ˆl L ˆL l ˆl ˆ ˆ ˆ o ô x x x x (5) Joint acoustic feature extraction + model training (6) Conditional WaveNet -based TTS (7) Integrated end-to-end (8) Bayesian end-to-end

Comparing Recent Waveform Generation and Acoustic Modeling Methods for Neural-network-based Speech Synthesis

Comparing Recent Waveform Generation and Acoustic Modeling Methods for Neural-network-based Speech Synthesis ICASSP 2018 Calgary, Canada Comparing Recent Waveform Generation and Acoustic Modeling Methods for Neural-network-based Speech Synthesis Xin WANG, Jaime Lorenzo-Trueba, Shinji TAKAKI, Lauri Juvela, Junichi

More information

An Autoregressive Recurrent Mixture Density Network for Parametric Speech Synthesis

An Autoregressive Recurrent Mixture Density Network for Parametric Speech Synthesis ICASSP 07 New Orleans, USA An Autoregressive Recurrent Mixture Density Network for Parametric Speech Synthesis Xin WANG, Shinji TAKAKI, Junichi YAMAGISHI National Institute of Informatics, Japan 07-03-07

More information

Autoregressive Neural Models for Statistical Parametric Speech Synthesis

Autoregressive Neural Models for Statistical Parametric Speech Synthesis Autoregressive Neural Models for Statistical Parametric Speech Synthesis シンワン Xin WANG 2018-01-11 contact: wangxin@nii.ac.jp we welcome critical comments, suggestions, and discussion 1 https://www.slideshare.net/kotarotanahashi/deep-learning-library-coyotecnn

More information

WaveNet: A Generative Model for Raw Audio

WaveNet: A Generative Model for Raw Audio WaveNet: A Generative Model for Raw Audio Ido Guy & Daniel Brodeski Deep Learning Seminar 2017 TAU Outline Introduction WaveNet Experiments Introduction WaveNet is a deep generative model of raw audio

More information

Text-to-speech synthesizer based on combination of composite wavelet and hidden Markov models

Text-to-speech synthesizer based on combination of composite wavelet and hidden Markov models 8th ISCA Speech Synthesis Workshop August 31 September 2, 2013 Barcelona, Spain Text-to-speech synthesizer based on combination of composite wavelet and hidden Markov models Nobukatsu Hojo 1, Kota Yoshizato

More information

Vocoding approaches for statistical parametric speech synthesis

Vocoding approaches for statistical parametric speech synthesis Vocoding approaches for statistical parametric speech synthesis Ranniery Maia Toshiba Research Europe Limited Cambridge Research Laboratory Speech Synthesis Seminar Series CUED, University of Cambridge,

More information

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features Heiga ZEN (Byung Ha CHUN) Nagoya Inst. of Tech., Japan Overview. Research backgrounds 2.

More information

An Excitation Model for HMM-Based Speech Synthesis Based on Residual Modeling

An Excitation Model for HMM-Based Speech Synthesis Based on Residual Modeling An Model for HMM-Based Speech Synthesis Based on Residual Modeling Ranniery Maia,, Tomoki Toda,, Heiga Zen, Yoshihiko Nankaku, Keiichi Tokuda, National Inst. of Inform. and Comm. Technology (NiCT), Japan

More information

An RNN-based Quantized F0 Model with Multi-tier Feedback Links for Text-to-Speech Synthesis

An RNN-based Quantized F0 Model with Multi-tier Feedback Links for Text-to-Speech Synthesis INTERSPEECH 217 August 2 24, 217, Stockholm, Sweden An RNN-based Quantized F Model with Multi-tier Feedback Links for Text-to-Speech Synthesis Xin Wang 1,2, Shinji Takaki 1, Junichi Yamagishi 1,2,3 1 National

More information

Adapting Wavenet for Speech Enhancement DARIO RETHAGE JULY 12, 2017

Adapting Wavenet for Speech Enhancement DARIO RETHAGE JULY 12, 2017 Adapting Wavenet for Speech Enhancement DARIO RETHAGE JULY 12, 2017 I am v Master Student v 6 months @ Music Technology Group, Universitat Pompeu Fabra v Deep learning for acoustic source separation v

More information

Automatic Speech Recognition (CS753)

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 12: Acoustic Feature Extraction for ASR Instructor: Preethi Jyothi Feb 13, 2017 Speech Signal Analysis Generate discrete samples A frame Need to focus on short

More information

Feature extraction 2

Feature extraction 2 Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. Feature extraction 2 Dr Philip Jackson Linear prediction Perceptual linear prediction Comparison of feature methods

More information

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement Simon Leglaive 1 Laurent Girin 1,2 Radu Horaud 1 1: Inria Grenoble Rhône-Alpes 2: Univ. Grenoble Alpes, Grenoble INP,

More information

Deep Learning for Speech Recognition. Hung-yi Lee

Deep Learning for Speech Recognition. Hung-yi Lee Deep Learning for Speech Recognition Hung-yi Lee Outline Conventional Speech Recognition How to use Deep Learning in acoustic modeling? Why Deep Learning? Speaker Adaptation Multi-task Deep Learning New

More information

Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation. Keiichi Tokuda

Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation. Keiichi Tokuda Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation Keiichi Tokuda Nagoya Institute of Technology Carnegie Mellon University Tamkang University March 13,

More information

SPEECH ANALYSIS AND SYNTHESIS

SPEECH ANALYSIS AND SYNTHESIS 16 Chapter 2 SPEECH ANALYSIS AND SYNTHESIS 2.1 INTRODUCTION: Speech signal analysis is used to characterize the spectral information of an input speech signal. Speech signal analysis [52-53] techniques

More information

Signal representations: Cepstrum

Signal representations: Cepstrum Signal representations: Cepstrum Source-filter separation for sound production For speech, source corresponds to excitation by a pulse train for voiced phonemes and to turbulence (noise) for unvoiced phonemes,

More information

CS 188: Artificial Intelligence Fall 2011

CS 188: Artificial Intelligence Fall 2011 CS 188: Artificial Intelligence Fall 2011 Lecture 20: HMMs / Speech / ML 11/8/2011 Dan Klein UC Berkeley Today HMMs Demo bonanza! Most likely explanation queries Speech recognition A massive HMM! Details

More information

Lecture 5: GMM Acoustic Modeling and Feature Extraction

Lecture 5: GMM Acoustic Modeling and Feature Extraction CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 5: GMM Acoustic Modeling and Feature Extraction Original slides by Dan Jurafsky Outline for Today Acoustic

More information

Proc. of NCC 2010, Chennai, India

Proc. of NCC 2010, Chennai, India Proc. of NCC 2010, Chennai, India Trajectory and surface modeling of LSF for low rate speech coding M. Deepak and Preeti Rao Department of Electrical Engineering Indian Institute of Technology, Bombay

More information

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi Signal Modeling Techniques in Speech Recognition Hassan A. Kingravi Outline Introduction Spectral Shaping Spectral Analysis Parameter Transforms Statistical Modeling Discussion Conclusions 1: Introduction

More information

Recurrent Neural Network

Recurrent Neural Network Recurrent Neural Network Xiaogang Wang xgwang@ee..edu.hk March 2, 2017 Xiaogang Wang (linux) Recurrent Neural Network March 2, 2017 1 / 48 Outline 1 Recurrent neural networks Recurrent neural networks

More information

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 Machine Learning for Signal Processing Neural Networks Continue Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 1 So what are neural networks?? Voice signal N.Net Transcription Image N.Net Text

More information

A DEEP RECURRENT APPROACH FOR ACOUSTIC-TO-ARTICULATORY INVERSION

A DEEP RECURRENT APPROACH FOR ACOUSTIC-TO-ARTICULATORY INVERSION A DEEP RECURRENT APPROACH FOR ACOUSTIC-TO-ARTICULATORY INVERSION P eng Liu 1,2, Quanjie Y u 1,2, Zhiyong W u 1,2,3, Shiyin Kang 3, Helen Meng 1,3, Lianhong Cai 1,2 1 Tsinghua-CUHK Joint Research Center

More information

Deep Recurrent Neural Networks

Deep Recurrent Neural Networks Deep Recurrent Neural Networks Artem Chernodub e-mail: a.chernodub@gmail.com web: http://zzphoto.me ZZ Photo IMMSP NASU 2 / 28 Neuroscience Biological-inspired models Machine Learning p x y = p y x p(x)/p(y)

More information

Chapter 9. Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析

Chapter 9. Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析 Chapter 9 Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析 1 LPC Methods LPC methods are the most widely used in speech coding, speech synthesis, speech recognition, speaker recognition and verification

More information

Presented By: Omer Shmueli and Sivan Niv

Presented By: Omer Shmueli and Sivan Niv Deep Speaker: an End-to-End Neural Speaker Embedding System Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, Zhenyao Zhu Presented By: Omer Shmueli and Sivan

More information

Timbral, Scale, Pitch modifications

Timbral, Scale, Pitch modifications Introduction Timbral, Scale, Pitch modifications M2 Mathématiques / Vision / Apprentissage Audio signal analysis, indexing and transformation Page 1 / 40 Page 2 / 40 Modification of playback speed Modifications

More information

Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers

Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers Kumari Rambha Ranjan, Kartik Mahto, Dipti Kumari,S.S.Solanki Dept. of Electronics and Communication Birla

More information

A Neural Parametric Singing Synthesizer

A Neural Parametric Singing Synthesizer INTERSPEECH 17 August 4, 17, Stockholm, Sweden A Neural Parametric Singing Synthesizer Merlijn Blaauw, Jordi Bonada Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain merlijn.blaauw@up.edu,

More information

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26 Machine Translation 10: Advanced Neural Machine Translation Architectures Rico Sennrich University of Edinburgh R. Sennrich MT 2018 10 1 / 26 Today s Lecture so far today we discussed RNNs as encoder and

More information

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models Statistical NLP Spring 2010 The Noisy Channel Model Lecture 9: Acoustic Models Dan Klein UC Berkeley Acoustic model: HMMs over word positions with mixtures of Gaussians as emissions Language model: Distributions

More information

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Segmental Recurrent Neural Networks for End-to-end Speech Recognition Segmental Recurrent Neural Networks for End-to-end Speech Recognition Liang Lu, Lingpeng Kong, Chris Dyer, Noah Smith and Steve Renals TTI-Chicago, UoE, CMU and UW 9 September 2016 Background A new wave

More information

Robust Speaker Identification

Robust Speaker Identification Robust Speaker Identification by Smarajit Bose Interdisciplinary Statistical Research Unit Indian Statistical Institute, Kolkata Joint work with Amita Pal and Ayanendranath Basu Overview } } } } } } }

More information

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as

More information

Reading Group on Deep Learning Session 2

Reading Group on Deep Learning Session 2 Reading Group on Deep Learning Session 2 Stephane Lathuiliere & Pablo Mesejo 10 June 2016 1/39 Chapter Structure Introduction. 5.1. Feed-forward Network Functions. 5.2. Network Training. 5.3. Error Backpropagation.

More information

Nearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis. Thomas Ewender

Nearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis. Thomas Ewender Nearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis Thomas Ewender Outline Motivation Detection algorithm of continuous F 0 contour Frame classification algorithm

More information

attention mechanisms and generative models

attention mechanisms and generative models attention mechanisms and generative models Master's Deep Learning Sergey Nikolenko Harbour Space University, Barcelona, Spain November 20, 2017 attention in neural networks attention You re paying attention

More information

GMM-Based Speech Transformation Systems under Data Reduction

GMM-Based Speech Transformation Systems under Data Reduction GMM-Based Speech Transformation Systems under Data Reduction Larbi Mesbahi, Vincent Barreaud, Olivier Boeffard IRISA / University of Rennes 1 - ENSSAT 6 rue de Kerampont, B.P. 80518, F-22305 Lannion Cedex

More information

Unfolded Recurrent Neural Networks for Speech Recognition

Unfolded Recurrent Neural Networks for Speech Recognition INTERSPEECH 2014 Unfolded Recurrent Neural Networks for Speech Recognition George Saon, Hagen Soltau, Ahmad Emami and Michael Picheny IBM T. J. Watson Research Center, Yorktown Heights, NY, 10598 gsaon@us.ibm.com

More information

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models Statistical NLP Spring 2009 The Noisy Channel Model Lecture 10: Acoustic Models Dan Klein UC Berkeley Search through space of all possible sentences. Pick the one that is most probable given the waveform.

More information

Statistical NLP Spring The Noisy Channel Model

Statistical NLP Spring The Noisy Channel Model Statistical NLP Spring 2009 Lecture 10: Acoustic Models Dan Klein UC Berkeley The Noisy Channel Model Search through space of all possible sentences. Pick the one that is most probable given the waveform.

More information

Voiced Speech. Unvoiced Speech

Voiced Speech. Unvoiced Speech Digital Speech Processing Lecture 2 Homomorphic Speech Processing General Discrete-Time Model of Speech Production p [ n] = p[ n] h [ n] Voiced Speech L h [ n] = A g[ n] v[ n] r[ n] V V V p [ n ] = u [

More information

Conditional Image Generation with PixelCNN Decoders

Conditional Image Generation with PixelCNN Decoders Conditional Image Generation with PixelCNN Decoders Aaron van den Oord 1 Nal Kalchbrenner 1 Oriol Vinyals 1 Lasse Espeholt 1 Alex Graves 1 Koray Kavukcuoglu 1 1 Google DeepMind NIPS, 2016 Presenter: Beilun

More information

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech

More information

arxiv: v1 [cs.lg] 28 Dec 2017

arxiv: v1 [cs.lg] 28 Dec 2017 PixelSNAIL: An Improved Autoregressive Generative Model arxiv:1712.09763v1 [cs.lg] 28 Dec 2017 Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, Pieter Abbeel Embodied Intelligence UC Berkeley, Department of

More information

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Neural Networks 2. 2 Receptive fields and dealing with image inputs CS 446 Machine Learning Fall 2016 Oct 04, 2016 Neural Networks 2 Professor: Dan Roth Scribe: C. Cheng, C. Cervantes Overview Convolutional Neural Networks Recurrent Neural Networks 1 Introduction There

More information

Hidden Markov Model and Speech Recognition

Hidden Markov Model and Speech Recognition 1 Dec,2006 Outline Introduction 1 Introduction 2 3 4 5 Introduction What is Speech Recognition? Understanding what is being said Mapping speech data to textual information Speech Recognition is indeed

More information

Acoustic Modeling for Speech Recognition

Acoustic Modeling for Speech Recognition Acoustic Modeling for Speech Recognition Berlin Chen 2004 References:. X. Huang et. al. Spoken Language Processing. Chapter 8 2. S. Young. The HTK Book (HTK Version 3.2) Introduction For the given acoustic

More information

Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks

Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks Modeling Time-Frequency Patterns with LSTM vs Convolutional Architectures for LVCSR Tasks Tara N Sainath, Bo Li Google, Inc New York, NY, USA {tsainath, boboli}@googlecom Abstract Various neural network

More information

Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks

Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks INTERSPEECH 2014 Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks Ryu Takeda, Naoyuki Kanda, and Nobuo Nukaga Central Research Laboratory, Hitachi Ltd., 1-280, Kokubunji-shi,

More information

Neural Architectures for Image, Language, and Speech Processing

Neural Architectures for Image, Language, and Speech Processing Neural Architectures for Image, Language, and Speech Processing Karl Stratos June 26, 2018 1 / 31 Overview Feedforward Networks Need for Specialized Architectures Convolutional Neural Networks (CNNs) Recurrent

More information

Lecture 3: ASR: HMMs, Forward, Viterbi

Lecture 3: ASR: HMMs, Forward, Viterbi Original slides by Dan Jurafsky CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 3: ASR: HMMs, Forward, Viterbi Fun informative read on phonetics The

More information

Frequency Domain Speech Analysis

Frequency Domain Speech Analysis Frequency Domain Speech Analysis Short Time Fourier Analysis Cepstral Analysis Windowed (short time) Fourier Transform Spectrogram of speech signals Filter bank implementation* (Real) cepstrum and complex

More information

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent voices Nonparametric likelihood

More information

Applied Natural Language Processing

Applied Natural Language Processing Applied Natural Language Processing Info 256 Lecture 20: Sequence labeling (April 9, 2019) David Bamman, UC Berkeley POS tagging NNP Labeling the tag that s correct for the context. IN JJ FW SYM IN JJ

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Independent Component Analysis and Unsupervised Learning

Independent Component Analysis and Unsupervised Learning Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien National Cheng Kung University TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent

More information

Introduction to Deep Neural Networks

Introduction to Deep Neural Networks Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Multimodal context analysis and prediction

Multimodal context analysis and prediction Multimodal context analysis and prediction Valeria Tomaselli (valeria.tomaselli@st.com) Sebastiano Battiato Giovanni Maria Farinella Tiziana Rotondo (PhD student) Outline 2 Context analysis vs prediction

More information

Lecture 9: Speech Recognition. Recognizing Speech

Lecture 9: Speech Recognition. Recognizing Speech EE E68: Speech & Audio Processing & Recognition Lecture 9: Speech Recognition 3 4 Recognizing Speech Feature Calculation Sequence Recognition Hidden Markov Models Dan Ellis http://www.ee.columbia.edu/~dpwe/e68/

More information

Lecture 9: Speech Recognition

Lecture 9: Speech Recognition EE E682: Speech & Audio Processing & Recognition Lecture 9: Speech Recognition 1 2 3 4 Recognizing Speech Feature Calculation Sequence Recognition Hidden Markov Models Dan Ellis

More information

Today s Lecture. Dropout

Today s Lecture. Dropout Today s Lecture so far we discussed RNNs as encoder and decoder we discussed some architecture variants: RNN vs. GRU vs. LSTM attention mechanisms Machine Translation 1: Advanced Neural Machine Translation

More information

Generating Sequences with Recurrent Neural Networks

Generating Sequences with Recurrent Neural Networks Generating Sequences with Recurrent Neural Networks Alex Graves University of Toronto & Google DeepMind Presented by Zhe Gan, Duke University May 15, 2015 1 / 23 Outline Deep recurrent neural network based

More information

Why DNN Works for Acoustic Modeling in Speech Recognition?

Why DNN Works for Acoustic Modeling in Speech Recognition? Why DNN Works for Acoustic Modeling in Speech Recognition? Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA Joint work with Y. Bao, J. Pan,

More information

Deep Learning Recurrent Networks 2/28/2018

Deep Learning Recurrent Networks 2/28/2018 Deep Learning Recurrent Networks /8/8 Recap: Recurrent networks can be incredibly effective Story so far Y(t+) Stock vector X(t) X(t+) X(t+) X(t+) X(t+) X(t+5) X(t+) X(t+7) Iterated structures are good

More information

Multimodal Machine Learning

Multimodal Machine Learning Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine Learning Laboratory [MultiComp Lab] 1 CMU Course 11-777: Multimodal Machine Learning 2 Lecture Objectives

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Deep Learning for Speech Processing An NST Perspective

Deep Learning for Speech Processing An NST Perspective Deep Learning for Speech Processing An NST Perspective Mark Gales University of Cambridge June 2016 September 2016 Natural Speech Technology (NST) EPSRC (UK Government) Programme Grant: collaboration significantly

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

Automatic Speech Recognition (CS753)

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 21: Speaker Adaptation Instructor: Preethi Jyothi Oct 23, 2017 Speaker variations Major cause of variability in speech is the differences between speakers Speaking

More information

Lab 9a. Linear Predictive Coding for Speech Processing

Lab 9a. Linear Predictive Coding for Speech Processing EE275Lab October 27, 2007 Lab 9a. Linear Predictive Coding for Speech Processing Pitch Period Impulse Train Generator Voiced/Unvoiced Speech Switch Vocal Tract Parameters Time-Varying Digital Filter H(z)

More information

CSC321 Lecture 20: Reversible and Autoregressive Models

CSC321 Lecture 20: Reversible and Autoregressive Models CSC321 Lecture 20: Reversible and Autoregressive Models Roger Grosse Roger Grosse CSC321 Lecture 20: Reversible and Autoregressive Models 1 / 23 Overview Four modern approaches to generative modeling:

More information

Natural Language Processing and Recurrent Neural Networks

Natural Language Processing and Recurrent Neural Networks Natural Language Processing and Recurrent Neural Networks Pranay Tarafdar October 19 th, 2018 Outline Introduction to NLP Word2vec RNN GRU LSTM Demo What is NLP? Natural Language? : Huge amount of information

More information

Machine Learning Lecture 12

Machine Learning Lecture 12 Machine Learning Lecture 12 Neural Networks 30.11.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory Probability

More information

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017 Modelling Time Series with Neural Networks Volker Tresp Summer 2017 1 Modelling of Time Series The next figure shows a time series (DAX) Other interesting time-series: energy prize, energy consumption,

More information

EE-559 Deep learning Recurrent Neural Networks

EE-559 Deep learning Recurrent Neural Networks EE-559 Deep learning 11.1. Recurrent Neural Networks François Fleuret https://fleuret.org/ee559/ Sun Feb 24 20:33:31 UTC 2019 Inference from sequences François Fleuret EE-559 Deep learning / 11.1. Recurrent

More information

Non-Negative Matrix Factorization And Its Application to Audio. Tuomas Virtanen Tampere University of Technology

Non-Negative Matrix Factorization And Its Application to Audio. Tuomas Virtanen Tampere University of Technology Non-Negative Matrix Factorization And Its Application to Audio Tuomas Virtanen Tampere University of Technology tuomas.virtanen@tut.fi 2 Contents Introduction to audio signals Spectrogram representation

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

SPEECH RECOGNITION USING TIME DOMAIN FEATURES FROM PHASE SPACE RECONSTRUCTIONS

SPEECH RECOGNITION USING TIME DOMAIN FEATURES FROM PHASE SPACE RECONSTRUCTIONS SPEECH RECOGNITION USING TIME DOMAIN FEATURES FROM PHASE SPACE RECONSTRUCTIONS by Jinjin Ye, B.S. A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL IN PARTIAL FULFILLMENT OF THE REQUIREMENTS for

More information

Lecture 11 Recurrent Neural Networks I

Lecture 11 Recurrent Neural Networks I Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks

More information

End-to-end Automatic Speech Recognition

End-to-end Automatic Speech Recognition End-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas J. Watson Research Center Yorktown Heights, NY 10598, USA Markus Nussbaum-Thom. February 22, 2017 Nussbaum-Thom: IBM Thomas J. Watson

More information

Speech Signal Representations

Speech Signal Representations Speech Signal Representations Berlin Chen 2003 References: 1. X. Huang et. al., Spoken Language Processing, Chapters 5, 6 2. J. R. Deller et. al., Discrete-Time Processing of Speech Signals, Chapters 4-6

More information

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018 Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:

More information

Hidden Markov Models Hamid R. Rabiee

Hidden Markov Models Hamid R. Rabiee Hidden Markov Models Hamid R. Rabiee 1 Hidden Markov Models (HMMs) In the previous slides, we have seen that in many cases the underlying behavior of nature could be modeled as a Markov process. However

More information

L7: Linear prediction of speech

L7: Linear prediction of speech L7: Linear prediction of speech Introduction Linear prediction Finding the linear prediction coefficients Alternative representations This lecture is based on [Dutoit and Marques, 2009, ch1; Taylor, 2009,

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Lecture 11 Recurrent Neural Networks I

Lecture 11 Recurrent Neural Networks I Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor niversity of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks

More information

Graphical Models for Automatic Speech Recognition

Graphical Models for Automatic Speech Recognition Graphical Models for Automatic Speech Recognition Advanced Signal Processing SE 2, SS05 Stefan Petrik Signal Processing and Speech Communication Laboratory Graz University of Technology GMs for Automatic

More information

Gaussian Processes for Audio Feature Extraction

Gaussian Processes for Audio Feature Extraction Gaussian Processes for Audio Feature Extraction Dr. Richard E. Turner (ret26@cam.ac.uk) Computational and Biological Learning Lab Department of Engineering University of Cambridge Machine hearing pipeline

More information

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models and Gaussian Mixture Models Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 23&27 January 2014 ASR Lectures 4&5 Hidden Markov Models and Gaussian

More information

BLSTM-HMM HYBRID SYSTEM COMBINED WITH SOUND ACTIVITY DETECTION NETWORK FOR POLYPHONIC SOUND EVENT DETECTION

BLSTM-HMM HYBRID SYSTEM COMBINED WITH SOUND ACTIVITY DETECTION NETWORK FOR POLYPHONIC SOUND EVENT DETECTION BLSTM-HMM HYBRID SYSTEM COMBINED WITH SOUND ACTIVITY DETECTION NETWORK FOR POLYPHONIC SOUND EVENT DETECTION Tomoki Hayashi 1, Shinji Watanabe 2, Tomoki Toda 1, Takaaki Hori 2, Jonathan Le Roux 2, Kazuya

More information

NEURAL LANGUAGE MODELS

NEURAL LANGUAGE MODELS COMP90042 LECTURE 14 NEURAL LANGUAGE MODELS LANGUAGE MODELS Assign a probability to a sequence of words Framed as sliding a window over the sentence, predicting each word from finite context to left E.g.,

More information

Generative Models for Sentences

Generative Models for Sentences Generative Models for Sentences Amjad Almahairi PhD student August 16 th 2014 Outline 1. Motivation Language modelling Full Sentence Embeddings 2. Approach Bayesian Networks Variational Autoencoders (VAE)

More information

Linear Prediction 1 / 41

Linear Prediction 1 / 41 Linear Prediction 1 / 41 A map of speech signal processing Natural signals Models Artificial signals Inference Speech synthesis Hidden Markov Inference Homomorphic processing Dereverberation, Deconvolution

More information

BIDIRECTIONAL LSTM-HMM HYBRID SYSTEM FOR POLYPHONIC SOUND EVENT DETECTION

BIDIRECTIONAL LSTM-HMM HYBRID SYSTEM FOR POLYPHONIC SOUND EVENT DETECTION BIDIRECTIONAL LSTM-HMM HYBRID SYSTEM FOR POLYPHONIC SOUND EVENT DETECTION Tomoki Hayashi 1, Shinji Watanabe 2, Tomoki Toda 1, Takaaki Hori 2, Jonathan Le Roux 2, Kazuya Takeda 1 1 Nagoya University, Furo-cho,

More information

Deep Learning for Automatic Speech Recognition Part II

Deep Learning for Automatic Speech Recognition Part II Deep Learning for Automatic Speech Recognition Part II Xiaodong Cui IBM T. J. Watson Research Center Yorktown Heights, NY 10598 Fall, 2018 Outline A brief revisit of sampling, pitch/formant and MFCC DNN-HMM

More information

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly

More information

Dynamic Approaches: The Hidden Markov Model

Dynamic Approaches: The Hidden Markov Model Dynamic Approaches: The Hidden Markov Model Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Inference as Message

More information