Autoregressive Neural Models for Statistical Parametric Speech Synthesis

Similar documents
An Autoregressive Recurrent Mixture Density Network for Parametric Speech Synthesis

Comparing Recent Waveform Generation and Acoustic Modeling Methods for Neural-network-based Speech Synthesis

ISSUE 2: TEMPORAL DEPENDENCY?

Deep Recurrent Neural Networks

An RNN-based Quantized F0 Model with Multi-tier Feedback Links for Text-to-Speech Synthesis

Generative Model-Based Text-to-Speech Synthesis. Heiga Zen (Google London) February

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

A DEEP RECURRENT APPROACH FOR ACOUSTIC-TO-ARTICULATORY INVERSION

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Long-Short Term Memory and Other Gated RNNs

Neural Architectures for Image, Language, and Speech Processing

Presented By: Omer Shmueli and Sivan Niv

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji

Introduction to Neural Networks

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Generating Sequences with Recurrent Neural Networks

Development of a Deep Recurrent Neural Network Controller for Flight Applications

Recurrent Neural Network

Text-to-speech synthesizer based on combination of composite wavelet and hidden Markov models

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Gate Activation Signal Analysis for Gated Recurrent Neural Networks and Its Correlation with Phoneme Boundaries

Recurrent Neural Networks

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Advanced Data Science

BLSTM-HMM HYBRID SYSTEM COMBINED WITH SOUND ACTIVITY DETECTION NETWORK FOR POLYPHONIC SOUND EVENT DETECTION

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

Introduction to Deep Neural Networks

BIDIRECTIONAL LSTM-HMM HYBRID SYSTEM FOR POLYPHONIC SOUND EVENT DETECTION

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Deep Learning Recurrent Networks 2/28/2018

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Unfolded Recurrent Neural Networks for Speech Recognition

Vocoding approaches for statistical parametric speech synthesis

Recurrent Neural Networks. Jian Tang

arxiv: v3 [cs.lg] 14 Jan 2018

CSC321 Lecture 15: Recurrent Neural Networks

Asaf Bar Zvi Adi Hayat. Semantic Segmentation

Conditional Random Field

Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity- Representativeness Reward

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Recurrent Neural Networks

Independent Component Analysis and Unsupervised Learning

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

Noise Compensation for Subspace Gaussian Mixture Models

GMM-Based Speech Transformation Systems under Data Reduction

CS 188: Artificial Intelligence Fall 2011

RARE SOUND EVENT DETECTION USING 1D CONVOLUTIONAL RECURRENT NEURAL NETWORKS

10/17/04. Today s Main Points

Gaussian Processes for Audio Feature Extraction

arxiv: v4 [cs.cl] 5 Jun 2017

Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks

ECE521 Lecture 19 HMM cont. Inference in HMM

Master 2 Informatique Probabilistic Learning and Data Analysis

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Deep Learning for Speech Recognition. Hung-yi Lee

A TWO-LAYER NON-NEGATIVE MATRIX FACTORIZATION MODEL FOR VOCABULARY DISCOVERY. MengSun,HugoVanhamme

Classification of Hand-Written Digits Using Scattering Convolutional Network

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Recurrence Enhances the Spatial Encoding of Static Inputs in Reservoir Networks

EE-559 Deep learning Recurrent Neural Networks

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

On the use of Long-Short Term Memory neural networks for time series prediction

Time series forecasting using ARIMA and Recurrent Neural Net with LSTM network

Hidden Markov Models

Slide credit from Hung-Yi Lee & Richard Socher

Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks

Hidden Markov Models Hamid R. Rabiee

Conditional Language Modeling. Chris Dyer

WaveNet: A Generative Model for Raw Audio

Speech and Language Processing

Probability and Time: Hidden Markov Models (HMMs)

Recurrent Autoregressive Networks for Online Multi-Object Tracking. Presented By: Ishan Gupta

Hidden Markov Model and Speech Recognition

Reading Group on Deep Learning Session 2

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS

Masked Autoregressive Flow for Density Estimation

Expectation Propagation in Dynamical Systems

Recurrent and Recursive Networks

CS230: Lecture 10 Sequence models II

Conditional Random Fields: An Introduction

Deep Learning & Neural Networks Lecture 4

MULTI-FRAME FACTORISATION FOR LONG-SPAN ACOUSTIC MODELLING. Liang Lu and Steve Renals

Hidden Markov Models

Towards Maximum Geometric Margin Minimum Error Classification

CSC321 Lecture 16: ResNets and Attention

Comparing linear and non-linear transformation of speech

Task-Oriented Dialogue System (Young, 2000)

Random Field Models for Applications in Computer Vision

COMP90051 Statistical Machine Learning

Necessary Corrections in Intransitive Likelihood-Ratio Classifiers

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Sequence Transduction with Recurrent Neural Networks

Attention Based Joint Model with Negative Sampling for New Slot Values Recognition. By: Mulan Hou

Improved Learning through Augmenting the Loss

Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition

Hidden Markov Models in Language Processing

Why DNN Works for Acoustic Modeling in Speech Recognition?

HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU

arxiv: v1 [cs.ne] 14 Nov 2012

Transcription:

Autoregressive Neural Models for Statistical Parametric Speech Synthesis シンワン Xin WANG 2018-01-11 contact: wangxin@nii.ac.jp we welcome critical comments, suggestions, and discussion 1

https://www.slideshare.net/kotarotanahashi/deep-learning-library-coyotecnn https://www.techemergence.com/deepminds-nando-de-freitas-why-deep-learning-is-like-building-with-legos/ 2

CONTENTS q Introduction q Models and methods q Summary 3

INTRODUCTION Text-to-speech (TTS) pipeline [1,2] Linguistic Text Front-end Back-end features Speech Statistical parametric speech synthesizer (SPSS) [3,4] Acoustic models Spectral features Acoustic features Fundamental frequency (F0) Vocoder [1] Taylor, P. (2009). Text-to-Speech Synthesis. [2] Dutoit, T. (1997). An Introduction to Text-to-speech Synthesis. [3] Tokuda, K., et al., (2013). Speech Synthesis Based on Hidden Markov Models. Proceedings of the IEEE, 101(5), 1234 1252. [4] Zen, H., et al. (2009). Statistical parametric speech synthesis. Speech Communication, 51, 1039 1064. 4

Topic of this talk INTRODUCTION Linguistic features Acoustic model Acoustic features x 1:T = {x 1,, x T } bo 1:T = {bo 1,, bo T } x 1 x 2 x T p(o bo 1 bo 2 bo T 1:T x 1:T ; ) x t 2 R D x, o t 2 R D o T frames 5

Roadmap INTRODUCTION Feedforward Naïve network model (FNN) Recurrent network (RNN) Autoregressive (AR) neural models Perfect model Time dependency modeling 6

CONTENTS q Introduction q Models and methods Baseline models q Summary 7

FNN q Computation flow BASELINE MODELS bo 1 bo 2 bo 3 bo 4 bo 5 H (FNN) ( ) x 1 x 2 x 3 x 4 x 5 bo t = H (FNN) (x t ) x t 2 R D x, o t 2 R D o 8

FNN q As a probabilistic model [6] BASELINE MODELS o 1 o 2 o 3 o 4 o 5 Output layer M 1 M 2 M 3 M 4 M 5 H (FNN) ( ) Mixture density network (MDN) [6] p(o 1:T x 1:T ; ) = x 1 x 2 x 3 x 4 x 5 TY p(o t ; M t )= t=1 M t = {µ t }, where µ t = H (FNN) (x t ) bo t = µ t TY N (o t ; µ t, I) t=1 [6] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer- Verlag New York, Inc., Secaucus, NJ, USA, 2006. 9

BASELINE MODELS RNN q As a recurrent MDN (RMDN) [7] o 1 o 2 o 3 o 4 o 5 Output layer M 1 M 2 M 3 M 4 M 5 H (RNN) ( ) p(o 1:T x 1:T ; ) = x 1 x 2 x 3 x 4 x 5 TY p(o t ; M t )= t=1 TY N (o t ; µ t, I) t=1 M t = {µ t }, where µ t = H (RNN) (x 1:T,t) bo t = µ t [7] M. Schuster. Better generative models for sequential data problems: Bidirectional recurrent mixture density networks. In Proc. NIPS, pages 589 595, 1999. 10

BASELINE MODELS Model assumption q (Conditional) Independence of o 1:T p(o 1:T x 1:T ; ) =p(o 1:T ; M 1:T )= TY p(o t ; M t ) t=1 q Verify the model assumption [8] Mean-based generation bo t = µ t Sampling bo t p(o t ; M t ) [8] M. Shannon. Probabilistic acoustic modelling for parametric speech synthesis. PhD thesis, 2014. 11

BASELINE MODELS Model assumption q Example: F0 modeling 450 p(o 1:T ; M 1:T ) by RMDN RMDN mean ±1 std range 350 250 150 100 200 300 400 500 Frame index (utterance BC2011 nancy APDC2-166-00) 450 Natural F0 NAT RMDN sample NAT 350 F0 (Hz) F0 (Hz) Sampling 250 150 100 200 300 400 500 Frame index (utterance BC2011 nancy APDC2-166-00) 12

CONTENTS q Introduction q Models and methods Baseline models AR models q Summary 13

AR MODELS Improve baseline models Directed graphic models AR models [8] o 1 o 2 o 3 p(o 1:T )= TY p(o t o 1:t 1 ) t=1 FNN RNN Undirected graphic models Trajectory model [9] o o, 2 1 o 2 ( o + MLPG [10] )* Q k o 3 p(o 1:T )= f k(o 1:T ) Z [8] M. Shannon. Probabilistic acoustic modelling for parametric speech synthesis. PhD thesis, 2014. [9] H. Zen, K. Tokuda, and T. Kitamura. Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences. 14 Computer Speech & Language, 21(1):153 173, 2007. [10] T. Keiichi, Y. Takayoshi, M. Takashi, K. Takao, and K. Tadashi. Speech parameter generation algorithms for HMM-based speech synthesis. In Proc. ICASSP, pages 936 939, 2000.

AR MODELS Improve baseline models Directed graphic models AR models [8] o 1 o 2 o 3 p(o 1:T )= TY p(o t o 1:t 1 ) t=1 FNN RNN 1. Shallow AR models (SAR) 2. AR flow 3. Deep AR models (DAR) 15

Definition SHALLOW AR MODEL o 1 o 2 o 3 o 4 o 5 M 1 M 2 M 3 M 4 M 5 x 1 x 2 x 3 x 4 x 5 p(o 1:T ; M 1:T )= TY p(o t o t K:t 1 ; M t ) t=1 16

SHALLOW AR MODEL Implementation a 2 a 2 a 2 o 1 o 2 o 3 o 4 o 5 K = 2 a 1 a 1 a 1 a 1 p(o 1:T ; M 1:T )= = TY p(o t o t K:t 1 ; M t ) t=1 TY N (o t ; µ t + F(o t K:t 1 ), t ) t=1 F(o t K:t 1 )= Time invariant K: hyper-parameter {a k, b} KX a k o t k + b k=1 Trainable parameters Shallow AR (SAR) 17

Theory SHALLOW AR MODEL SAR versus RMDN with a recurrent output layer [11] SAR RMDN o a 1 o 2 o 1 o 2 w µ µ µ 1 µ 2 1 µ 2 output layer h 1 h 2 h 1 h 2 hidden layer x 1 x 2 x 1 x 2 Assume and, linear activation function o t 2 R t =1 [11] H. Zen and H. Sak. Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In Proc. ICASSP, pages 4470 4474, 2015. 18

Theory SHALLOW AR MODEL RMDN o 1 o 2 w µ µ 1 µ 2 h 1 h 2 p(o 1:2 )=N (o 1 ; µ 1, 1)N (o 2 ; µ 2 + w µ µ 1, 1) µ 1 = w > h 1 + b µ 2 = w > h 2 + b + w µ µ 1 = µ 2 + w µ µ 1 x 1 x 2 SAR o a 1 o 2 µ 1 µ 2 h 1 h 2 x 1 x 2 p(o 1:2 )=N (o 1 ; µ 1, 1)N (o 2 ; µ 2 + ao 1, 1) µ 1 = w > h 1 + b µ 2 = w > h 2 + b 19

Theory SHALLOW AR MODEL RMDN o 1 o 2 w µ µ 1 µ 2 h 1 h 2 x 1 x 2 p(o 1:2 )=N (o 1 ; µ 1, 1)N (o 2 ; µ 2 + w µ µ 1, 1) = 1 2 exp( 1 2 (o µ)> 1 (o µ)) o =[o 1,o 2 ] > µ =[µ 1, µ 2 + w µ µ 1 ] > = apple 1 0 0 1 SAR o a 1 o 2 µ 1 µ 2 h 1 h 2 x 1 x 2 Dependency between µ t or o t? p(o 1:2 )=N (o 1 ; µ 1, 1)N (o 2 ; µ 2 + ao 1, 1) = 1 2 exp( 1 2 (o µ)> 1 (o µ)) apple o =[o 1,o 2 ] > µ =[µ 1,µ 2 + aµ 1 ] > 1 a = a 1+a 2 20

Experiments q Data SHALLOW AR MODEL Corpus: Blizzard Challenge 2011 x : similar to HTS English [12] t o t : MGC, interpolated F0, U/V, BAP q Network RNN feedforward -> feedforward -> Bi-LSTM -> Bi->LSTM RMDN SAR K=1, 2,0 for MGC, F0 and BAP RNN+MLPG RNN with o, 2 o + MLPG http://tonywangx.github.io/#sar [12] K. Tokuda, H. Zen, and A. W. Black. An HMM-based speech synthesis system applied to english. In Proc. SSW, pages 227 230, Sept 2002. 21

Experiments q Results 8 SHALLOW AR MODEL 6 MGC (1st dim) MGC (15th dim) 4 2 0-2 0.8 0.6 0.4 0.2 0-0.2-0.4-0.6-0.8 NAT RNN RNN+MLPG RMDN SAR AR-RMDN 1 st dimension of MGC 100 200 300 400 500 600 700 800 1000 Frame index (utterance BC2011_nancy_APDC2-166-00) NAT RNN RNN+MLPG RMDN AR-RMDN SAR 15 th dimension of MGC 100 200 300 400 500 600 700 800 1000 Frame index (utterance BC2011_nancy_APDC2-166-00) 1/12/18 22

MGC (30th dim) Experiments q Results 0.2 0.1 0-0.1 SHALLOW AR MODEL NAT RNN RNN+MLPG RMDN SAR AR-RMDN F0 (Hz) -0.2-0.3 450 400 350 300 250 200 150 100 50 0 30 th dimension of MGC 100 200 300 400 500 600 700 800 1000 Frame index (utterance BC2011_nancy_APDC2-166-00) NAT RNN RNN+MLPG RMDN SAR AR-RMDN F0 (after U/V classification) 100 150 200 250 300 350 400 Frame index (utterance BC2011_nancy_APDC2-166-00) 1/12/18 23

Experiments q Results SHALLOW AR MODEL 2 GV of generated MGC 0-2 AR-RMDN SAR GV of MGC -4-6 NAT -8 RNN+MLPG NAT RNN -10 RNN RNN+MLPG RMDN RMDN AR-RMDN -12 1 20 40 60 Order of MGC 1/12/18 24

Experiments q Samples SHALLOW AR MODEL RNN RNN+MLPG RMDN SAR Natural w/o formant enhancement with formant enhancement 25

Interpretation q Feature transformation SHALLOW AR MODEL RMDN c 1 c 2 µ 1 µ 2 h 1 h 2 p(o) =p(c) =N (c; µ c, c ) apple µ c =[µ 1,µ 2 ] > 1 0 c = 0 1 x 1 x 2 o a 1 o 2 c = apple c1 c 2 = apple o 1 = o 2 ao 1 apple 1 0 a 1 apple o1 o 2 = Ao SAR µ 1 µ 2 h 1 h 2 x 1 x 2 o =[o 1,o 2 ] > p(o) =N (o; µ o, o ) µ o =[µ 1,µ 2 + aµ 1 ] > o = apple 1 a a 1+a 2 26

Interpretation q Feature transformation For o 1:T 2 R D T SHALLOW AR MODEL o 1:T = 2 3 o 1:T,1 4 5 o 1:T,D A (1) A (D) 2 3 c 1:T,1 4 5 c 1:T,D SAR is equivalent to: Training o 1:T A (1) A (D) c 1:T TY p(c t ; M t ) t=1 Generation bo 1:T A (1) 1 A (D) 1 bc 1:T TY p(c t ; M t ) t=1 x 1:T 27

Interpretation q Signal and filters For o 1:T 2 R D T SHALLOW AR MODEL o 1:T = 2 3 o 1:T,1 4 5 o 1:T,D filter 1 filter D 2 3 c 1:T,1 4 5 c 1:T,D SAR is equivalent to: Training o 1:T filters A 1 (z) A D (z) c 1:T TY p(c t ; M t ) t=1 Generation bo 1:T filters 1/A 1 (z) 1/A D (z) bc 1:T TY p(c t ; M t ) t=1 x 1:T 28

Interpretation q Secret of SAR o 1:T filters A 1 (z) A D (z) SHALLOW AR MODEL c 1:T Magnitude (db) 5 0-5 -10 A 1 (z) A 1 (z) 1 250 500 750 1000 Frequency bin (:/1024) bo 1:T filters 1/A 1 (z) 1/A D (z) bc 1:T Magnitude (db) 5 0-5 -10 1/A 1 (z) H 1 (z) Only due to 1/A d (z)? 1 250 500 750 1000 Frequency bin (:/1024) Due to {A d (z), 1/A d (z)}, less mismatch between c 1:T and RMDN http://tonywangx.github.io/#sar 29

CONTENTS q Introduction q Models and methods Baseline models Shallow AR model AR flow q Summary 30

AR FLOW Is SAR good enough? q Random sampling on SAR 450 Randomly sampled F0 Natural F0 RMDN sample SAR sample F0 (Hz) 350 250 150 100 200 300 400 500 Frame index (utterance BC2011 nancy APDC2-166-00) q Reason? Linear Time-invariant a k a 2 a 2 a 2 o 1 o 2 o 3 o 4 o 5 a 1 a 1 a 1 a 1 31

Theory q Change random variable AR FLOW o 1:T c = Ao c 1:T TY p(c t ; M t ) t=1 x 1:T bo 1:T bo = A 1 bc bc 1:T TY p(c t ; M t ) t=1 32

Theory q Change random variable AR FLOW o 1:T c 1:T = f (o 1:T ) c 1:T TY p(c t ; M t ) t=1 x 1:T bo 1:T bo 1:T = f 1 (bc 1:T ) bc 1:T TY p(c t ; M t ) t=1 Jacobian matrix must be simple f (.) must be invertible p o (o 1:T x 1:T )=p c (c 1:T x 1:T ) det @c 1:T @o 1:T [13] D. Rezende and S. Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, pages 1530 1538, 2015. [14] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved variational inference with inverse autoregressive flow. In Proc. NIPS, pages 4743 4751, 2016. 33

Basic idea AR FLOW p o (o 1:T x 1:T )=p c (c 1:T x 1:T ) det @c 1:T @o 1:T SAR AR Flow Transform c t = o t X K k=1 a k o t k c t = o t X K k=1 f (k) (o 1:t k ) o t k De-transform KX bo t = c t + a k bo t k k=1 KX bo t = c t + f (k) (bo 1:t k ) bo t k k=1 Very simple: det @c 1:T @o 1:T =1 µ t = RNN(o 1:t 1 ) 34

Implementation q Training AR FLOW o 1 o 2 o 3 o 4 o 5 µ t = RNN(o 1:t 1 ) 0 µ 2 µ 3 µ 4 µ 5 c 1 c 2 c 3 c 4 c 5 c t = o t + µ t M 1 M 2 M 3 M 4 M 5 x 1 x 2 x 3 x 4 x 5 35

Implementation q Generation AR FLOW bo 1 bo 2 bo 3 bo 4 bo 5 bo t = bc t µ t µ 2 µ 3 µ 4 µ 5 bc 1 bc 2 bc 3 bc 4 bc 5 M 1 M 2 M 3 M 4 M 5 x 1 x 2 x 3 x 4 x 5 36

AR FLOW Experiments on MGC q Data Japanese data Neither MLPG nor formant enhancement q Network RNN SAR AR Flow RNN + AR flow (Uni-LSTM + feedforward) 37

Experiments on MGC q Results AR FLOW 1 st dimension of MGC 30 th dimension of MGC 38

Experiments on MGC q Results AR FLOW GV of generated MGC Modulation spectrum 30 th dimension of MGC Natural RNN AR Flow SAR 39

Experiments on MGC q Samples AR FLOW RNN SAR AR Flow Natural Natural duration F0 from RNN Experiments will be done on F0 40

CONTENTS q Introduction q Models and methods Baseline models Shallow AR model AR flow Deep AR model q Summary 41

DEEP AR MODEL Random sampling from AR Flow? q Not work c TY o 1:T 1:T c 1:T = f (o 1:T ) p(c t ; M t ) x 1:T t=1 Simple @c 1:T @o 1:T q One way to go o 1:T Network with AR dependency x 1:T 42

Definition DEEP AR MODEL o 1 o 2 o 3 o 4 o 5 M 1 M 2 M 3 M 4 M 5 x 1 x 2 x 3 x 4 x 5 43

Definition DEEP AR MODEL o 1 o 2 o 3 o 4 o 5 M 1 M 2 M 3 M 4 M 5 x 1 x 2 x 3 x 4 x 5 p(o 1:T ; M 1:T )= TY p(o t o 1:t 1 ; M t ) t=1 http://tonywangx.github.io/#dar 44

DEEP AR MODEL Experiments on MGC q No better results yet Experiments on F0 q Samples RNN RMDN SAR DAR Natural Network only models F0 Given natural MGC (and duration) 45

DEEP AR MODEL Experiments on F0 p-value MOS 4.25 4.00 3.75 3.50 3.25 NAT DAR MOS score SAR RMDN RNN 3.00 NAT DAR SAR RMDN RNN NAT <1e-30 <1.0e-30 <1.0e-30 <1.0e-30 DAR <1e-30 1.6e-28 6.3e-19 2.4e-30 SAR <1e-30 1.6e-28 0.015 0.949 RMDN <1e-30 6.3e-19 0.015 0.014 RNN <1e-30 2.4e-30 0.949 0.014 GV of F0 at utterance-level (Hz) 120 100 80 60 40 20 NAT RNN RMDN SAR DAR NAT RNN F0 GV RMDN SAR DAR 46

DEEP AR MODEL Random sampling on F0 q Japanese data 450 NAT RMDN sample SAR sample F0 (Hz) 350 250 150 200 400 600 800 1000 1200 Frame index (utterance ATR Ximera F009 NIKKEIR 03362 T01) NAT DAR sample 450 F0 (Hz) 350 250 150 200 400 600 800 Frame index (utterance ATR Ximera F009 NIKKEIR 03362 T01) 1000 1200 47

DEEP AR MODEL Random sampling on F0 q Japanese data 450 NAT DAR sample 1 F0 (Hz) 350 250 150 200 400 600 800 1000 1200 450 NAT DAR sample 2 F0 (Hz) 350 250 150 200 400 600 800 1000 1200 450 NAT DAR sample 2 F0 (Hz) 350 250 150 200 400 600 800 Frame index (utterance ATR Ximera F009 NIKKEIR 03362 T01) 1000 1200 48

DEEP AR MODEL Random sampling on F0 q English data F0 (Hz) 400 300 200 100 0 NAT SAMPLE 1 100 200 300 400 500 600 700 800 900 1000 F0 (Hz) 400 300 200 100 0 NAT SAMPLE 2 100 200 300 400 500 600 700 800 900 1000 F0 (Hz) 400 300 200 100 0 NAT SAMPLE 3 100 200 300 400 500 600 700 800 900 1000 http://tonywangx.github.io/#dar 49

CONTENTS q Introduction q Models and methods Baseline models Shallow AR model AR flow Deep AR model q Summary 50

SUMMARY FNN RNN SAR AR Flow DAR Random sampling as a diagnostic tool Linear & invertible c 1:T = Ao 1:T Non-linear & invertible Non-linear & non-invertible c 1:T = f(o 1:T ) AR dependency 51

Message SUMMARY RMDN o 1 o 2 w µ µ 1 µ 2 Dependency between µ t or o t? SAR o 1 a o 2 µ 1 µ 2 h 1 h 2 h 1 h 2 x 1 x 2 x 1 x 2 52

Thank you for your attention Q & A Toolkit, scripts, slides tonywangx.github.io 53