Generative Model-Based Text-to-Speech Synthesis. Heiga Zen (Google London) February
|
|
- Jeffrey Cameron
- 6 years ago
- Views:
Transcription
1 Generative Model-Based Text-to-Speech Synthesis Heiga Zen (Google London) February
2 Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics
3 Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics
4 Text-to-speech as sequence-to-sequence mapping Automatic speech recognition (ASR)! Hello my name is Heiga Zen Machine translation (MT) Hello my name is Heiga Zen! Ich heiße Heiga Zen Text-to-speech synthesis (TTS) Hello my name is Heiga Zen!
5 Speech production process text (concept) fundamental freq modulation of carrier wave by speech information voiced/unvoiced freq transfer char frequency transfer characteristics magnitude start--end fundamental frequency Sound source voiced: pulse unvoiced: noise speech air flow
6 Typical ow of TTS system Sentence segmentation Word segmentation Text normalization Part-of-speech tagging Pronunciation discrete ) discrete NLP Frontend TEXT Text analysis Speech synthesis SYNTHESIZED SEECH Prosody prediction Waveform generation discrete ) continuous Speech Backend
7 Speech synthesis approaches
8 Speech synthesis approaches Rule-based, formant synthesis []
9 Speech synthesis approaches Rule-based, formant synthesis [] Sample-based, concatenative synthesis []
10 Speech synthesis approaches Rule-based, formant synthesis [] Sample-based, concatenative synthesis [] Model-based, generative synthesis p(speech= text= Hello, my name is Heiga Zen. )
11 Probabilistic formulation of TTS Random variables X Speech waveforms (data) Observed W Transcriptions (data) Observed w Given text Observed x Synthesized speech Unobserved Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 6 of
12 Probabilistic formulation of TTS Random variables X Speech waveforms (data) Observed W Transcriptions (data) Observed w Given text Observed x Synthesized speech Unobserved Synthesis Estimate posterior predictive distribution! p(x w, X, W) Sample x from the posterior distribution X W w x Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 6 of
13 Probabilistic formulation Introduce auxiliary variables (representation) +factorizedependency y X X p(x w, X, W) = p(x o)p(o l, )p(l w) 8l 8L p(x O)p(O L, )p( )p(l W)/ p(x ) dodod where O, o: Acoustic features L, l: Linguistic features : Model X W w O L l o x
14 Approximation () Approximate {sum & integral} by best point estimates (like MAP) [] where p(x w, X, W) p(x ô) X W w {ô, ˆl, Ô, ˆL, ˆ} = arg max o,l,o,l, O L l p(x o)p(o l, )p(l w) p(x O)p(O L, )p( )p(l W) o ô x Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 8 of
15 Approximation () Joint! Step-by-step maximization [] Ô = arg max p(x O) Extract acoustic features O ˆL = arg max p(l W) Extract linguistic features L ˆ = arg max p( Ô ˆL, )p( ) Learn mapping X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) Predict ô = arg max p(o o x f x (ô) =p(x ô) Predict linguistic features acoustic features Synthesize waveform ˆ o ô x
16 Approximation () Joint! Step-by-step maximization [] Ô = arg max p(x O) O ˆL = arg max p(l W) L ˆ = arg max p( Ô ˆL, )p( ) Extract acoustic features X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) ô = arg max p(o o x f x (ô) =p(x ô) ˆ o ô x
17 Approximation () Joint! Step-by-step maximization [] Ô = arg max p(x O) O ˆL = arg max p(l W) L ˆ = arg max p( Ô ˆL, )p( ) Extract acoustic features Extract linguistic features X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) ô = arg max p(o o x f x (ô) =p(x ô) ˆ o ô x
18 Approximation () Joint! Step-by-step maximization [] Ô = arg max p(x O) Extract acoustic features O ˆL = arg max p(l W) Extract linguistic features L ˆ = arg max p( Ô ˆL, )p( ) Learn mapping X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) ô = arg max p(o o x f x (ô) =p(x ô) ˆ o ô x
19 Approximation () Joint! Step-by-step maximization [] Ô = arg max p(x O) Extract acoustic features O ˆL = arg max p(l W) Extract linguistic features L ˆ = arg max p( Ô ˆL, )p( ) Learn mapping X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) ô = arg max p(o o x f x (ô) =p(x ô) Predict linguistic features ˆ o ô x
20 Approximation () Joint! Step-by-step maximization [] Ô = arg max p(x O) Extract acoustic features O ˆL = arg max p(l W) Extract linguistic features L ˆ = arg max p( Ô ˆL, )p( ) Learn mapping X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) Predict ô = arg max p(o o x f x (ô) =p(x ô) Predict linguistic features acoustic features ˆ o ô x
21 Approximation () Joint! Step-by-step maximization [] Ô = arg max p(x O) Extract acoustic features O ˆL = arg max p(l W) Extract linguistic features L ˆ = arg max p( Ô ˆL, )p( ) Learn mapping X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) Predict ô = arg max p(o o x f x (ô) =p(x ô) Predict linguistic features acoustic features Synthesize waveform ˆ o ô x
22 Approximation () Joint! Step-by-step maximization [] Ô = arg max p(x O) Extract acoustic features O ˆL = arg max p(l W) Extract linguistic features L ˆ = arg max p( Ô ˆL, )p( ) Learn mapping X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) Predict ô = arg max p(o o x f x (ô) =p(x ô) Predict linguistic features acoustic features Synthesize waveform ˆ o ô Representations: acoustic, linguistic, mapping x
23 Representation Linguistic features Hello, hello Hello, world. world. world Sentence: length,... Phrase: intonation,... Word: POS, grammar,... h-e2 l-ou1 w-er1-l-d Syllable: stress, tone,... h e l ou w er l d Phone: voicing, manner,...
24 Representation Linguistic features Hello, hello Hello, world. world. world Sentence: length,... Phrase: intonation,... Word: POS, grammar,... h-e2 l-ou1 w-er1-l-d Syllable: stress, tone,... h e l ou w er l d Phone: voicing, manner,...! Based on knowledge about spoken language Lexicon, letter-to-sound rules Tokenizer, tagger, parser Phonology rules
25 Representation Acoustic features Piece-wise stationary, source-lter generative model p(x o) Vocal source Vocal tract filter Pulse train (voiced) Fundamental frequency Aperiodicity, voicing,... + e(n) Cepstrum, LPC,... [db] [khz] overlap/shift windowing Speech x(n) = h(n)*e(n) h(n) White noise (unvoiced)
26 Representation Acoustic features Piece-wise stationary, source-lter generative model p(x o) Vocal source Vocal tract filter Pulse train (voiced) Fundamental frequency Aperiodicity, voicing,... + e(n) Cepstrum, LPC,... [db] [khz] overlap/shift windowing Speech x(n) = h(n)*e(n) h(n) White noise (unvoiced)! Needs to solve inverse problem Estimate parameters from signals Use estimated parameters (e.g., cepstrum) as acoustic features
27 Representation Mapping Rule-based, formant synthesis [] Ô = arg max p(x O) Vocoder analysis O ˆL = arg max p(l W) Text analysis L ˆ = arg max p( Ô ˆL, )p( ) Extract rules X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) Apply ô = arg max p(o o x f x (ô) =p(x ô) Text analysis rules Vocoder synthesis ˆ o ô x
28 Representation Mapping Rule-based, formant synthesis [] Ô = arg max p(x O) Vocoder analysis O ˆL = arg max p(l W) Text analysis L ˆ = arg max p( Ô ˆL, )p( ) Extract rules X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) Apply ô = arg max p(o o x f x (ô) =p(x ô) Text analysis rules Vocoder synthesis ˆ o ô! Hand-crafted rules on knowledge-based features x
29 Representation Mapping HMM-based [], statistical parametric synthesis [] Ô = arg max p(x O) Vocoder analysis O ˆL = arg max p(l W) Text analysis L ˆ = arg max p( Ô ˆL, )p( ) Train HMMs X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) Parameter ô = arg max p(o o x f x (ô) =p(x ô) Text analysis generation Vocoder synthesis ˆ o ô x
30 Representation Mapping HMM-based [], statistical parametric synthesis [] Ô = arg max p(x O) Vocoder analysis O ˆL = arg max p(l W) Text analysis L ˆ = arg max p( Ô ˆL, )p( ) Train HMMs X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) Parameter ô = arg max p(o o x f x (ô) =p(x ô) Text analysis generation Vocoder synthesis ˆ o ô! Replace rules by HMM-based generative acoustic model x
31 Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics
32 HMM-based generative acoustic model for TTS Context-dependent subword HMMs Decision trees to cluster & tie HMM states! interpretable l 1... l N l... q 1 q 2 q 3 q 4 : Discrete o 1 o 2 o 3 o 4 o o T o2 o2 o 5 o 1 o 2 o 3 o 4 : Continuous p(o l, )= X 8q = X 8q TY p(o t q t, )P (q l, ) q t :hiddenstateatt t=1 TY N (o t ; µ qt, qt )P (q l, ) t=1
33 HMM-based generative acoustic model for TTS Non-smooth, step-wise statistics! Smoothing is essential Difcult to use high-dimensional acoustic features (e.g., raw spectra)! Use low-dimensional features (e.g., cepstra) Data fragmentation! Ineffective, local representation Alotofresearchworkhavebeendonetoaddresstheseissues Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 6 of
34 Alternative acoustic model HMM: Handle variable length & alignment Decision tree: Map linguistic! acoustic Linguistic features l yes no yes no yes no yes no... yes no Statistics of acoustic features o Regression tree: linguistic features! Stats. of acoustic features
35 Alternative acoustic model HMM: Handle variable length & alignment Decision tree: Map linguistic! acoustic Linguistic features l yes no yes no yes no yes no... yes no Statistics of acoustic features o Regression tree: linguistic features! Stats. of acoustic features Replace the tree w/ a general-purpose regression model! Articial neural network
36 FFNN-based acoustic model for TTS [6] Target Frame-level acoustic feature o t o t 1 o t o t+1 h t Frame-level linguistic feature Input l t l t 1 l t l t+1 X h t = g (W hl l t + b h ) ˆ = arg min ko t ô t k 2 ô t = W oh h t + b o = {W hl, W oh, b h, b o } ô t E [o t l t ]! Replace decision trees & Gaussian distributions t Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 8 of
37 RNN-based acoustic model for TTS [] Target Frame-level acoustic feature o t o t 1 o t o t+1 Recurrent connections Frame-level linguistic feature Input lt l t 1 l t l t+1 X h t = g (W hl l t + W hh h t 1 + b h ) ˆ = arg min ko t ô t k 2 ô t = W oh h t + b o = {W hl, W hh, W oh, b h, b o } FFNN: ô t E [o t l t ] RNN: ô t E [o t l 1,...,l t ] t
38 NN-based generative acoustic model for TTS Non-smooth, step-wise statistics! RNN predicts smoothly varying acoustic features [, 8] Difcult to use high-dimensional acoustic features (e.g., raw spectra)! Layered architecture can handle high-dimensional features [] Data fragmentation! Distributed representation []
39 NN-based generative acoustic model for TTS Non-smooth, step-wise statistics! RNN predicts smoothly varying acoustic features [, 8] Difcult to use high-dimensional acoustic features (e.g., raw spectra)! Layered architecture can handle high-dimensional features [] Data fragmentation! Distributed representation [] NN-based approach is now mainstream in research & products Models: FFNN [6], MDN [], RNN [], Highway network [], GAN [] Products: e.g., Google []
40 Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics
41 NN-based generative model for TTS text (concept) fundamental freq modulation of carrier wave by speech information voiced/unvoiced freq transfer char frequency transfer characteristics magnitude start--end fundamental frequency Sound source voiced: pulse unvoiced: noise speech air flow Text! Linguistic! (Articulatory)! Acoustic! Waveform
42 Knowledge-based features! Learned features Unsupervised feature learning ~ x(t) w(n+1) world Acoustic ) feature o(t) l(n-1) Linguistic ) feature l(n) x(t) (raw FFT spectrum) hello w(n) (1-hot representation of word) Speech: auto-encoder at FFT spectra [, ]! positive results Text: word [6], phone & syllable []! less positive
43 Relax approximation Joint acoustic feature extraction & model training Two-step optimization! Joint optimization 8 >< Ô = arg max p(x O) O >: ˆ = arg max p( Ô ˆL, )p( ) + X O W L ˆL w l ˆl {ˆ, Ô} = arg max p(x O)p(O ˆL, )p( ),O Joint source-lter & acoustic model optimization HMM [8,, ] NN [, ] ˆ o ô x
44 Relax approximation Joint acoustic feature extracion & model training Mixed-phase cepstral analysis + LSTM-RNN [] p n Pulse train x n Speech... Cepstrum Hu -1 fn G(z) - (z) sn en z z z -1 z -1-1 z -1 z d... t (v) d (u) t Derivatives o t... (v) o (u) t Forward propagation... Linguistic... features l t Back propagation
45 Relax approximation Direct mapping from linguistic to waveform No explicit acoustic features {ˆ, Ô} = arg max p(x O)p(O ˆL, )p( ),O X W w + L l ˆ = arg max p(x ˆL, )p( ) ˆL ˆl Generative models for raw audio LPC [] WaveNet [] SampleRNN [] ˆ x Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 6 of
46 WaveNet: A generative model for raw audio Autoregressive (AR) modelling of speech signals x = {x 0,x 1,...,x N 1 } :rawwaveform p(x )=p(x 0,x 1,...,x N 1 )= NY 1 n=0 p(x n x 0,...,x n 1, )
47 WaveNet: A generative model for raw audio Autoregressive (AR) modelling of speech signals x = {x 0,x 1,...,x N 1 } :rawwaveform p(x )=p(x 0,x 1,...,x N 1 )= NY 1 n=0 p(x n x 0,...,x n 1, ) WaveNet []! p(x n x 0,...,x n 1, ) is modeled by convolutional NN
48 WaveNet: A generative model for raw audio Autoregressive (AR) modelling of speech signals x = {x 0,x 1,...,x N 1 } :rawwaveform p(x )=p(x 0,x 1,...,x N 1 )= NY 1 n=0 p(x n x 0,...,x n 1, ) WaveNet []! p(x n x 0,...,x n 1, ) is modeled by convolutional NN Key components Causal dilated convolution: capture long-term dependency Gated convolution + residual + skip: powerful non-linearity Softmax at output: classication rather than regression
49 WaveNet Causal dilated convolution ms in 6kHz sampling =,6 time steps! Too long to be captured by normal RNN/LSTM Dilated convolution Exponentially increase receptive eld size w.r.t. # of layers Output p(x x,..., x ) n n-16 n-1 Hidden layer3 Hidden layer2 Hidden layer1 Input x n x n-3 x n-2 x n-1 Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 8 of
50 WaveNet Non-linearity Residual block Skip connections 30 Residual block ReLU 1x1 256 ReLU 1x1 256 Softmax p(x n x 0,..., x n-1 ) Residual block Residual block To residual block 512 Residual block 1x1 1x1 ReLU Gated : 1x1 convolution : ReLU activation : Gated activation Gated 2x1 dilated 1x1 256 To skip connection Softmax : Softmax activation
51 WaveNet Softmax Amplitude Time Analog audio signal
52 WaveNet Softmax Amplitude Time Sampling & Quantization
53 WaveNet Softmax 1 Category index Amplitude 16 Time Categorical distribution Histogram - Unimodal - Multimodal - Skewed...
54 WaveNet Conditional modelling Linguistic features l Embedding at time n h n 1x1 1x1... 1x1 Residual block Residual block... Residual block... + ReLU 1x1 ReLU 1x1 Softmax p(x n x 0,..., x n-1,l) Residual block 1x1 Residual block 1x1 Gated 1x1 conditioning 2x1 dilated
55 WaveNet vs conventional audio generative models Assumptions in conventional audio generative models [, 6,, ] Stationary process w/ xed-length analysis window! Estimate model within ms window w/ shift Linear, time-invariant lter within a frame! Relationship between samples can be non-linear Gaussian process! Assumes speech signals are normally distributed WaveNet Sample-by-saple, non-linear, capable to take additional inputs Arbitrary-shaped signal distribution SOTA subjective naturalness w/ WaveNet-based TTS [] HMM LSTM Concatenative WaveNet
56 Relax approximation Towards Bayesian end-to-end TTS Integrated end-to-end 8 >< ˆL = arg max p(l W) L >: ˆ = arg max p(x ˆL, )p( ) + ˆ = arg max p(x W, )p( ) X W w ˆ Text analysis is integrated to model x
57 Relax approximation Towards Bayesian end-to-end TTS Bayesian end-to-end 8 < ˆ = arg max p(x W, )p( ) : x f x (w, ˆ) =p(x w, ˆ) + x f x (w, X, W) =p(x w, X, W) Z = p(x w, )p( X, W)d 1 KX p(x w, K ˆk) Ensemble k=1 X W w x Marginalize model parameters & architecture
58 Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics
59 Generative model-based text-to-speech synthesis Bayes formulation + factorization + approximations Representation: acoustic features, linguistic features, mapping Mapping: Rules! HMM! NN Feature: Engineered! Unsupervised, learned Less approximations Joint training, direct waveform modelling Moving towards integrated & Bayesian end-to-end TTS Naturalness: Concatenative apple Generative Flexibility: Concatenative Generative (e.g., multiple speakers) Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 6 of
60 Beyond text -to-speech synthesis TTS on conversational assistants Texts aren t fully contained Need more context Location to resolve homographs User query to put right emphasis
61 Beyond text -to-speech synthesis TTS on conversational assistants Texts aren t fully contained Need more context Location to resolve homographs User query to put right emphasis We need representation that can organize the world information & make it accessible & useful from TTS generative models,
62 Beyond generative TTS Generative model-based TTS Model represents process behind speech production Trained to minimize error against human-produced speech Learned model! speaker Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 8 of
63 Beyond generative TTS Generative model-based TTS Model represents process behind speech production Trained to minimize error against human-produced speech Learned model! speaker Speech is for communication Goal: maximize the amount of information to be received Missing listener! listener in training / model itself? Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 8 of
64 Thanks!
65 References I [] [] [] [] [] D. Klatt. Real-time speech synthesis by rule. Journal of ASA,68(S):S8 S8,8. A. Hunt and A. Black. Unit selection in a concatenative speech synthesis system using a large speech database. In Proc. ICASSP,pages 6,6. K. Tokuda. Speech synthesis as a statistical machine learning problem. Invited talk given at ASRU. T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. IEICE Trans. Inf. Syst.,J8-D-II():,. (in Japanese). H. Zen, K. Tokuda, and A. Black. Statistical parametric speech synthesis. Speech Commn.,(): 6,. [6] H. Zen, A. Senior, and M. Schuster. Statistical parametric speech synthesis using deep neural networks. In Proc. ICASSP,pages6 66,. [] Y. Fan, Y. Qian, F.-L. Xie, and F. Soong. TTS synthesis with bidirectional LSTM based recurrent neural networks. In Proc. Interspeech,pages6 68,.
66 References II [8] H. Zen. Acoustic modeling for speech synthesis: from HMM to RNN. Invited talk given at ASRU. [] [] [] [] [] [] S. Takaki and J. Yamagishi. Adeepauto-encoderbasedlow-dimensionalfeatureextractionfromFFTspectralenvelopesforstatisticalparametricspeech synthesis. In Proc. ICASSP,pages,6. G. Hinton, J. McClelland, and D. Rumelhart. Distributed representation. In D. Rumelhart, J. McClelland, and the PDP Research Group, editors, Parallel distributed processing: Explorations in the microstructure of cognition.mitpress,86. H. Zen and A. Senior. Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis. In Proc. ICASSP,pages8 86,. X. Wang, S. Takaki, and J. Yamagishi. Investigating very deep highway networks for parametric speech synthesis. In Proc. ISCA SSW,6. Y. Saito, S. Takamichi, and Saruwatari. Training algorithm to deceive anti-spoong verication for DNN-based speech synthesis. In Proc. ICASSP,. H. Zen, Y. Agiomyrgiannakis, N. Egberts, F. Henderson, and P. Szczepaniak. Fast, compact, and high quality LSTM-RNN based statistical parametric speech synthesizers for mobile devices. In Proc. Interspeech,6.
67 References III [] P. Muthukumar and A. Black. Adeeplearningapproachtodata-drivenparameterizationsforstatisticalparametricspeechsynthesis. arxiv:.88,. [6] P. W a n g, Y. Q i a n, F. S o o n g, L. H e, a n d H. Z h a o. Word embedding for recurrent neural network based TTS synthesis. In Proc. ICASSP,pages8 88,. [] [8] [] [] [] [] X. Wang, S. Takaki, and J. Yamagishi. Investigation of using continuous representation of various linguistic units in neural network-based text-to-speech synthesis. IEICE Trans. Inf. Syst.,E-D(): 8,6. T. Toda and K. Tokuda. Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory hmm. In Proc. ICASSP,pages 8,8. Y.-J. Wu and K. Tokuda. Minimum generation error training with direct log spectral distortion on LSPs for HMM-based speech synthesis. In Proc. Interspeech,pages 8,8. R. Maia, H. Zen, and M. Gales. Statistical parametric speech synthesis with joint estimation of acoustic and excitation model parameters. In Proc. ISCA SSW,pages88,. K. Tokuda and H. Zen. Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis. In Proc. ICASSP,pages,. K. Tokuda and H. Zen. Directly modeling voiced and unvoiced components in speech waveforms by neural networks. In Proc. ICASSP,pages6 6,6.
68 References IV [] [] [] [6] [] F. Itakura and S. Saito. Astatisticalmethodforestimationofspeechspectraldensityandformantfrequencies. Trans. IEICE,JA:,. A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. WaveNet: A generative model for raw audio. arxiv:6.,6. S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio. SampleRNN: An unconditional end-to-end neural audio generation model. arxiv:6.8,6. S. Imai and C. Furuichi. Unbiased estimation of log spectrum. In Proc. EURASIP,pages 6,88. H. Kameoka, Y. Ohishi, D. Mochihashi, and J. Le Roux. Speech analysis with multi-kernel linear prediction. In Proc. Spring Conference of ASJ,pages,. (in Japanese).
69 X W w X W w X W w X W w O L l O L l O Ô L ˆL l ˆl o o ô ˆ o x x x ô x (1) Bayesian (2) Auxiliary variables + factorization (3) Joint maximization (4) Step-by-step maximization e.g., statistical parametric TTS X W w X W w X W w X W w O L ˆL l ˆl L ˆL l ˆl ˆ ˆ ˆ o ô x x x x (5) Joint acoustic feature extraction + model training (6) Conditional WaveNet -based TTS (7) Integrated end-to-end (8) Bayesian end-to-end
Comparing Recent Waveform Generation and Acoustic Modeling Methods for Neural-network-based Speech Synthesis
ICASSP 2018 Calgary, Canada Comparing Recent Waveform Generation and Acoustic Modeling Methods for Neural-network-based Speech Synthesis Xin WANG, Jaime Lorenzo-Trueba, Shinji TAKAKI, Lauri Juvela, Junichi
More informationAn Autoregressive Recurrent Mixture Density Network for Parametric Speech Synthesis
ICASSP 07 New Orleans, USA An Autoregressive Recurrent Mixture Density Network for Parametric Speech Synthesis Xin WANG, Shinji TAKAKI, Junichi YAMAGISHI National Institute of Informatics, Japan 07-03-07
More informationAutoregressive Neural Models for Statistical Parametric Speech Synthesis
Autoregressive Neural Models for Statistical Parametric Speech Synthesis シンワン Xin WANG 2018-01-11 contact: wangxin@nii.ac.jp we welcome critical comments, suggestions, and discussion 1 https://www.slideshare.net/kotarotanahashi/deep-learning-library-coyotecnn
More informationWaveNet: A Generative Model for Raw Audio
WaveNet: A Generative Model for Raw Audio Ido Guy & Daniel Brodeski Deep Learning Seminar 2017 TAU Outline Introduction WaveNet Experiments Introduction WaveNet is a deep generative model of raw audio
More informationText-to-speech synthesizer based on combination of composite wavelet and hidden Markov models
8th ISCA Speech Synthesis Workshop August 31 September 2, 2013 Barcelona, Spain Text-to-speech synthesizer based on combination of composite wavelet and hidden Markov models Nobukatsu Hojo 1, Kota Yoshizato
More informationVocoding approaches for statistical parametric speech synthesis
Vocoding approaches for statistical parametric speech synthesis Ranniery Maia Toshiba Research Europe Limited Cambridge Research Laboratory Speech Synthesis Seminar Series CUED, University of Cambridge,
More informationReformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features
Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features Heiga ZEN (Byung Ha CHUN) Nagoya Inst. of Tech., Japan Overview. Research backgrounds 2.
More informationAn Excitation Model for HMM-Based Speech Synthesis Based on Residual Modeling
An Model for HMM-Based Speech Synthesis Based on Residual Modeling Ranniery Maia,, Tomoki Toda,, Heiga Zen, Yoshihiko Nankaku, Keiichi Tokuda, National Inst. of Inform. and Comm. Technology (NiCT), Japan
More informationAn RNN-based Quantized F0 Model with Multi-tier Feedback Links for Text-to-Speech Synthesis
INTERSPEECH 217 August 2 24, 217, Stockholm, Sweden An RNN-based Quantized F Model with Multi-tier Feedback Links for Text-to-Speech Synthesis Xin Wang 1,2, Shinji Takaki 1, Junichi Yamagishi 1,2,3 1 National
More informationAdapting Wavenet for Speech Enhancement DARIO RETHAGE JULY 12, 2017
Adapting Wavenet for Speech Enhancement DARIO RETHAGE JULY 12, 2017 I am v Master Student v 6 months @ Music Technology Group, Universitat Pompeu Fabra v Deep learning for acoustic source separation v
More informationAutomatic Speech Recognition (CS753)
Automatic Speech Recognition (CS753) Lecture 12: Acoustic Feature Extraction for ASR Instructor: Preethi Jyothi Feb 13, 2017 Speech Signal Analysis Generate discrete samples A frame Need to focus on short
More informationFeature extraction 2
Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. Feature extraction 2 Dr Philip Jackson Linear prediction Perceptual linear prediction Comparison of feature methods
More informationA Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement
A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement Simon Leglaive 1 Laurent Girin 1,2 Radu Horaud 1 1: Inria Grenoble Rhône-Alpes 2: Univ. Grenoble Alpes, Grenoble INP,
More informationDeep Learning for Speech Recognition. Hung-yi Lee
Deep Learning for Speech Recognition Hung-yi Lee Outline Conventional Speech Recognition How to use Deep Learning in acoustic modeling? Why Deep Learning? Speaker Adaptation Multi-task Deep Learning New
More informationMel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation. Keiichi Tokuda
Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation Keiichi Tokuda Nagoya Institute of Technology Carnegie Mellon University Tamkang University March 13,
More informationSPEECH ANALYSIS AND SYNTHESIS
16 Chapter 2 SPEECH ANALYSIS AND SYNTHESIS 2.1 INTRODUCTION: Speech signal analysis is used to characterize the spectral information of an input speech signal. Speech signal analysis [52-53] techniques
More informationSignal representations: Cepstrum
Signal representations: Cepstrum Source-filter separation for sound production For speech, source corresponds to excitation by a pulse train for voiced phonemes and to turbulence (noise) for unvoiced phonemes,
More informationCS 188: Artificial Intelligence Fall 2011
CS 188: Artificial Intelligence Fall 2011 Lecture 20: HMMs / Speech / ML 11/8/2011 Dan Klein UC Berkeley Today HMMs Demo bonanza! Most likely explanation queries Speech recognition A massive HMM! Details
More informationLecture 5: GMM Acoustic Modeling and Feature Extraction
CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 5: GMM Acoustic Modeling and Feature Extraction Original slides by Dan Jurafsky Outline for Today Acoustic
More informationProc. of NCC 2010, Chennai, India
Proc. of NCC 2010, Chennai, India Trajectory and surface modeling of LSF for low rate speech coding M. Deepak and Preeti Rao Department of Electrical Engineering Indian Institute of Technology, Bombay
More informationSignal Modeling Techniques in Speech Recognition. Hassan A. Kingravi
Signal Modeling Techniques in Speech Recognition Hassan A. Kingravi Outline Introduction Spectral Shaping Spectral Analysis Parameter Transforms Statistical Modeling Discussion Conclusions 1: Introduction
More informationRecurrent Neural Network
Recurrent Neural Network Xiaogang Wang xgwang@ee..edu.hk March 2, 2017 Xiaogang Wang (linux) Recurrent Neural Network March 2, 2017 1 / 48 Outline 1 Recurrent neural networks Recurrent neural networks
More informationMachine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016
Machine Learning for Signal Processing Neural Networks Continue Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 1 So what are neural networks?? Voice signal N.Net Transcription Image N.Net Text
More informationA DEEP RECURRENT APPROACH FOR ACOUSTIC-TO-ARTICULATORY INVERSION
A DEEP RECURRENT APPROACH FOR ACOUSTIC-TO-ARTICULATORY INVERSION P eng Liu 1,2, Quanjie Y u 1,2, Zhiyong W u 1,2,3, Shiyin Kang 3, Helen Meng 1,3, Lianhong Cai 1,2 1 Tsinghua-CUHK Joint Research Center
More informationDeep Recurrent Neural Networks
Deep Recurrent Neural Networks Artem Chernodub e-mail: a.chernodub@gmail.com web: http://zzphoto.me ZZ Photo IMMSP NASU 2 / 28 Neuroscience Biological-inspired models Machine Learning p x y = p y x p(x)/p(y)
More informationChapter 9. Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析
Chapter 9 Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析 1 LPC Methods LPC methods are the most widely used in speech coding, speech synthesis, speech recognition, speaker recognition and verification
More informationPresented By: Omer Shmueli and Sivan Niv
Deep Speaker: an End-to-End Neural Speaker Embedding System Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, Zhenyao Zhu Presented By: Omer Shmueli and Sivan
More informationTimbral, Scale, Pitch modifications
Introduction Timbral, Scale, Pitch modifications M2 Mathématiques / Vision / Apprentissage Audio signal analysis, indexing and transformation Page 1 / 40 Page 2 / 40 Modification of playback speed Modifications
More informationSinger Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers
Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers Kumari Rambha Ranjan, Kartik Mahto, Dipti Kumari,S.S.Solanki Dept. of Electronics and Communication Birla
More informationA Neural Parametric Singing Synthesizer
INTERSPEECH 17 August 4, 17, Stockholm, Sweden A Neural Parametric Singing Synthesizer Merlijn Blaauw, Jordi Bonada Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain merlijn.blaauw@up.edu,
More informationMachine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26
Machine Translation 10: Advanced Neural Machine Translation Architectures Rico Sennrich University of Edinburgh R. Sennrich MT 2018 10 1 / 26 Today s Lecture so far today we discussed RNNs as encoder and
More informationThe Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models
Statistical NLP Spring 2010 The Noisy Channel Model Lecture 9: Acoustic Models Dan Klein UC Berkeley Acoustic model: HMMs over word positions with mixtures of Gaussians as emissions Language model: Distributions
More informationSegmental Recurrent Neural Networks for End-to-end Speech Recognition
Segmental Recurrent Neural Networks for End-to-end Speech Recognition Liang Lu, Lingpeng Kong, Chris Dyer, Noah Smith and Steve Renals TTI-Chicago, UoE, CMU and UW 9 September 2016 Background A new wave
More informationRobust Speaker Identification
Robust Speaker Identification by Smarajit Bose Interdisciplinary Statistical Research Unit Indian Statistical Institute, Kolkata Joint work with Amita Pal and Ayanendranath Basu Overview } } } } } } }
More informationArtificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino
Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as
More informationReading Group on Deep Learning Session 2
Reading Group on Deep Learning Session 2 Stephane Lathuiliere & Pablo Mesejo 10 June 2016 1/39 Chapter Structure Introduction. 5.1. Feed-forward Network Functions. 5.2. Network Training. 5.3. Error Backpropagation.
More informationNearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis. Thomas Ewender
Nearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis Thomas Ewender Outline Motivation Detection algorithm of continuous F 0 contour Frame classification algorithm
More informationattention mechanisms and generative models
attention mechanisms and generative models Master's Deep Learning Sergey Nikolenko Harbour Space University, Barcelona, Spain November 20, 2017 attention in neural networks attention You re paying attention
More informationGMM-Based Speech Transformation Systems under Data Reduction
GMM-Based Speech Transformation Systems under Data Reduction Larbi Mesbahi, Vincent Barreaud, Olivier Boeffard IRISA / University of Rennes 1 - ENSSAT 6 rue de Kerampont, B.P. 80518, F-22305 Lannion Cedex
More informationUnfolded Recurrent Neural Networks for Speech Recognition
INTERSPEECH 2014 Unfolded Recurrent Neural Networks for Speech Recognition George Saon, Hagen Soltau, Ahmad Emami and Michael Picheny IBM T. J. Watson Research Center, Yorktown Heights, NY, 10598 gsaon@us.ibm.com
More informationThe Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models
Statistical NLP Spring 2009 The Noisy Channel Model Lecture 10: Acoustic Models Dan Klein UC Berkeley Search through space of all possible sentences. Pick the one that is most probable given the waveform.
More informationStatistical NLP Spring The Noisy Channel Model
Statistical NLP Spring 2009 Lecture 10: Acoustic Models Dan Klein UC Berkeley The Noisy Channel Model Search through space of all possible sentences. Pick the one that is most probable given the waveform.
More informationVoiced Speech. Unvoiced Speech
Digital Speech Processing Lecture 2 Homomorphic Speech Processing General Discrete-Time Model of Speech Production p [ n] = p[ n] h [ n] Voiced Speech L h [ n] = A g[ n] v[ n] r[ n] V V V p [ n ] = u [
More informationConditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN Decoders Aaron van den Oord 1 Nal Kalchbrenner 1 Oriol Vinyals 1 Lasse Espeholt 1 Alex Graves 1 Koray Kavukcuoglu 1 1 Google DeepMind NIPS, 2016 Presenter: Beilun
More informationRecurrent Neural Networks (Part - 2) Sumit Chopra Facebook
Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech
More informationarxiv: v1 [cs.lg] 28 Dec 2017
PixelSNAIL: An Improved Autoregressive Generative Model arxiv:1712.09763v1 [cs.lg] 28 Dec 2017 Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, Pieter Abbeel Embodied Intelligence UC Berkeley, Department of
More informationNeural Networks 2. 2 Receptive fields and dealing with image inputs
CS 446 Machine Learning Fall 2016 Oct 04, 2016 Neural Networks 2 Professor: Dan Roth Scribe: C. Cheng, C. Cervantes Overview Convolutional Neural Networks Recurrent Neural Networks 1 Introduction There
More informationHidden Markov Model and Speech Recognition
1 Dec,2006 Outline Introduction 1 Introduction 2 3 4 5 Introduction What is Speech Recognition? Understanding what is being said Mapping speech data to textual information Speech Recognition is indeed
More informationAcoustic Modeling for Speech Recognition
Acoustic Modeling for Speech Recognition Berlin Chen 2004 References:. X. Huang et. al. Spoken Language Processing. Chapter 8 2. S. Young. The HTK Book (HTK Version 3.2) Introduction For the given acoustic
More informationModeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks
Modeling Time-Frequency Patterns with LSTM vs Convolutional Architectures for LVCSR Tasks Tara N Sainath, Bo Li Google, Inc New York, NY, USA {tsainath, boboli}@googlecom Abstract Various neural network
More informationBoundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks
INTERSPEECH 2014 Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks Ryu Takeda, Naoyuki Kanda, and Nobuo Nukaga Central Research Laboratory, Hitachi Ltd., 1-280, Kokubunji-shi,
More informationNeural Architectures for Image, Language, and Speech Processing
Neural Architectures for Image, Language, and Speech Processing Karl Stratos June 26, 2018 1 / 31 Overview Feedforward Networks Need for Specialized Architectures Convolutional Neural Networks (CNNs) Recurrent
More informationLecture 3: ASR: HMMs, Forward, Viterbi
Original slides by Dan Jurafsky CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 3: ASR: HMMs, Forward, Viterbi Fun informative read on phonetics The
More informationFrequency Domain Speech Analysis
Frequency Domain Speech Analysis Short Time Fourier Analysis Cepstral Analysis Windowed (short time) Fourier Transform Spectrogram of speech signals Filter bank implementation* (Real) cepstrum and complex
More informationIndependent Component Analysis and Unsupervised Learning. Jen-Tzung Chien
Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent voices Nonparametric likelihood
More informationApplied Natural Language Processing
Applied Natural Language Processing Info 256 Lecture 20: Sequence labeling (April 9, 2019) David Bamman, UC Berkeley POS tagging NNP Labeling the tag that s correct for the context. IN JJ FW SYM IN JJ
More informationFrom perceptrons to word embeddings. Simon Šuster University of Groningen
From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written
More informationIndependent Component Analysis and Unsupervised Learning
Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien National Cheng Kung University TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent
More informationIntroduction to Deep Neural Networks
Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,
More informationMultimodal context analysis and prediction
Multimodal context analysis and prediction Valeria Tomaselli (valeria.tomaselli@st.com) Sebastiano Battiato Giovanni Maria Farinella Tiziana Rotondo (PhD student) Outline 2 Context analysis vs prediction
More informationLecture 9: Speech Recognition. Recognizing Speech
EE E68: Speech & Audio Processing & Recognition Lecture 9: Speech Recognition 3 4 Recognizing Speech Feature Calculation Sequence Recognition Hidden Markov Models Dan Ellis http://www.ee.columbia.edu/~dpwe/e68/
More informationLecture 9: Speech Recognition
EE E682: Speech & Audio Processing & Recognition Lecture 9: Speech Recognition 1 2 3 4 Recognizing Speech Feature Calculation Sequence Recognition Hidden Markov Models Dan Ellis
More informationToday s Lecture. Dropout
Today s Lecture so far we discussed RNNs as encoder and decoder we discussed some architecture variants: RNN vs. GRU vs. LSTM attention mechanisms Machine Translation 1: Advanced Neural Machine Translation
More informationGenerating Sequences with Recurrent Neural Networks
Generating Sequences with Recurrent Neural Networks Alex Graves University of Toronto & Google DeepMind Presented by Zhe Gan, Duke University May 15, 2015 1 / 23 Outline Deep recurrent neural network based
More informationWhy DNN Works for Acoustic Modeling in Speech Recognition?
Why DNN Works for Acoustic Modeling in Speech Recognition? Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA Joint work with Y. Bao, J. Pan,
More informationDeep Learning Recurrent Networks 2/28/2018
Deep Learning Recurrent Networks /8/8 Recap: Recurrent networks can be incredibly effective Story so far Y(t+) Stock vector X(t) X(t+) X(t+) X(t+) X(t+) X(t+5) X(t+) X(t+7) Iterated structures are good
More informationMultimodal Machine Learning
Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine Learning Laboratory [MultiComp Lab] 1 CMU Course 11-777: Multimodal Machine Learning 2 Lecture Objectives
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,
More informationDeep Learning for Speech Processing An NST Perspective
Deep Learning for Speech Processing An NST Perspective Mark Gales University of Cambridge June 2016 September 2016 Natural Speech Technology (NST) EPSRC (UK Government) Programme Grant: collaboration significantly
More informationDeep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści
Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?
More informationAutomatic Speech Recognition (CS753)
Automatic Speech Recognition (CS753) Lecture 21: Speaker Adaptation Instructor: Preethi Jyothi Oct 23, 2017 Speaker variations Major cause of variability in speech is the differences between speakers Speaking
More informationLab 9a. Linear Predictive Coding for Speech Processing
EE275Lab October 27, 2007 Lab 9a. Linear Predictive Coding for Speech Processing Pitch Period Impulse Train Generator Voiced/Unvoiced Speech Switch Vocal Tract Parameters Time-Varying Digital Filter H(z)
More informationCSC321 Lecture 20: Reversible and Autoregressive Models
CSC321 Lecture 20: Reversible and Autoregressive Models Roger Grosse Roger Grosse CSC321 Lecture 20: Reversible and Autoregressive Models 1 / 23 Overview Four modern approaches to generative modeling:
More informationNatural Language Processing and Recurrent Neural Networks
Natural Language Processing and Recurrent Neural Networks Pranay Tarafdar October 19 th, 2018 Outline Introduction to NLP Word2vec RNN GRU LSTM Demo What is NLP? Natural Language? : Huge amount of information
More informationMachine Learning Lecture 12
Machine Learning Lecture 12 Neural Networks 30.11.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory Probability
More informationModelling Time Series with Neural Networks. Volker Tresp Summer 2017
Modelling Time Series with Neural Networks Volker Tresp Summer 2017 1 Modelling of Time Series The next figure shows a time series (DAX) Other interesting time-series: energy prize, energy consumption,
More informationEE-559 Deep learning Recurrent Neural Networks
EE-559 Deep learning 11.1. Recurrent Neural Networks François Fleuret https://fleuret.org/ee559/ Sun Feb 24 20:33:31 UTC 2019 Inference from sequences François Fleuret EE-559 Deep learning / 11.1. Recurrent
More informationNon-Negative Matrix Factorization And Its Application to Audio. Tuomas Virtanen Tampere University of Technology
Non-Negative Matrix Factorization And Its Application to Audio Tuomas Virtanen Tampere University of Technology tuomas.virtanen@tut.fi 2 Contents Introduction to audio signals Spectrogram representation
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)
More informationSPEECH RECOGNITION USING TIME DOMAIN FEATURES FROM PHASE SPACE RECONSTRUCTIONS
SPEECH RECOGNITION USING TIME DOMAIN FEATURES FROM PHASE SPACE RECONSTRUCTIONS by Jinjin Ye, B.S. A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL IN PARTIAL FULFILLMENT OF THE REQUIREMENTS for
More informationLecture 11 Recurrent Neural Networks I
Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks
More informationEnd-to-end Automatic Speech Recognition
End-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas J. Watson Research Center Yorktown Heights, NY 10598, USA Markus Nussbaum-Thom. February 22, 2017 Nussbaum-Thom: IBM Thomas J. Watson
More informationSpeech Signal Representations
Speech Signal Representations Berlin Chen 2003 References: 1. X. Huang et. al., Spoken Language Processing, Chapters 5, 6 2. J. R. Deller et. al., Discrete-Time Processing of Speech Signals, Chapters 4-6
More informationDeep Learning Sequence to Sequence models: Attention Models. 17 March 2018
Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:
More informationHidden Markov Models Hamid R. Rabiee
Hidden Markov Models Hamid R. Rabiee 1 Hidden Markov Models (HMMs) In the previous slides, we have seen that in many cases the underlying behavior of nature could be modeled as a Markov process. However
More informationL7: Linear prediction of speech
L7: Linear prediction of speech Introduction Linear prediction Finding the linear prediction coefficients Alternative representations This lecture is based on [Dutoit and Marques, 2009, ch1; Taylor, 2009,
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationLecture 11 Recurrent Neural Networks I
Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor niversity of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks
More informationGraphical Models for Automatic Speech Recognition
Graphical Models for Automatic Speech Recognition Advanced Signal Processing SE 2, SS05 Stefan Petrik Signal Processing and Speech Communication Laboratory Graz University of Technology GMs for Automatic
More informationGaussian Processes for Audio Feature Extraction
Gaussian Processes for Audio Feature Extraction Dr. Richard E. Turner (ret26@cam.ac.uk) Computational and Biological Learning Lab Department of Engineering University of Cambridge Machine hearing pipeline
More informationHidden Markov Models and Gaussian Mixture Models
Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 23&27 January 2014 ASR Lectures 4&5 Hidden Markov Models and Gaussian
More informationBLSTM-HMM HYBRID SYSTEM COMBINED WITH SOUND ACTIVITY DETECTION NETWORK FOR POLYPHONIC SOUND EVENT DETECTION
BLSTM-HMM HYBRID SYSTEM COMBINED WITH SOUND ACTIVITY DETECTION NETWORK FOR POLYPHONIC SOUND EVENT DETECTION Tomoki Hayashi 1, Shinji Watanabe 2, Tomoki Toda 1, Takaaki Hori 2, Jonathan Le Roux 2, Kazuya
More informationNEURAL LANGUAGE MODELS
COMP90042 LECTURE 14 NEURAL LANGUAGE MODELS LANGUAGE MODELS Assign a probability to a sequence of words Framed as sliding a window over the sentence, predicting each word from finite context to left E.g.,
More informationGenerative Models for Sentences
Generative Models for Sentences Amjad Almahairi PhD student August 16 th 2014 Outline 1. Motivation Language modelling Full Sentence Embeddings 2. Approach Bayesian Networks Variational Autoencoders (VAE)
More informationLinear Prediction 1 / 41
Linear Prediction 1 / 41 A map of speech signal processing Natural signals Models Artificial signals Inference Speech synthesis Hidden Markov Inference Homomorphic processing Dereverberation, Deconvolution
More informationBIDIRECTIONAL LSTM-HMM HYBRID SYSTEM FOR POLYPHONIC SOUND EVENT DETECTION
BIDIRECTIONAL LSTM-HMM HYBRID SYSTEM FOR POLYPHONIC SOUND EVENT DETECTION Tomoki Hayashi 1, Shinji Watanabe 2, Tomoki Toda 1, Takaaki Hori 2, Jonathan Le Roux 2, Kazuya Takeda 1 1 Nagoya University, Furo-cho,
More informationDeep Learning for Automatic Speech Recognition Part II
Deep Learning for Automatic Speech Recognition Part II Xiaodong Cui IBM T. J. Watson Research Center Yorktown Heights, NY 10598 Fall, 2018 Outline A brief revisit of sampling, pitch/formant and MFCC DNN-HMM
More informationProbabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov
Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly
More informationDynamic Approaches: The Hidden Markov Model
Dynamic Approaches: The Hidden Markov Model Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Inference as Message
More information