arxiv: v1 [stat.ml] 24 May 2016 Abstract

Size: px

Start display at page:

Download "arxiv: v1 [stat.ml] 24 May 2016 Abstract"

Iris Bradley
6 years ago
Views:

1 Sequenial Neural Models wih Sochasic Layers Marco Fraccaro Søren Kaae Sønderby Ulrich Paque Ole Winher Technical Universiy of Denmark Universiy of Copenhagen, Denmark arxiv: v1 [sa.ml] 24 May 2016 Absrac How can we efficienly propagae uncerainy in a laen sae represenaion wih recurren neural neworks? This paper inroduces sochasic recurren neural neworks which glue a deerminisic recurren neural nework and a sae space model ogeher o form a sochasic and sequenial neural generaive model. The clear separaion of deerminisic and sochasic layers allows a srucured variaional inference nework o rack he facorizaion of he model s poserior disribuion. By reaining boh he nonlinear recursive srucure of a recurren neural nework and averaging over he uncerainy in a laen pah, like a sae space model, we improve he sae of he ar resuls on he Blizzard and TIMIT speech modeling daa ses by a large margin, while achieving comparable performances o compeing mehods on polyphonic music modeling. 1 Inroducion Recurren neural neworks (RNNs) are able o represen long-erm dependencies in sequenial daa, by adaping and propagaing a deerminisic hidden (or laen) sae [6, 17]. There is recen evidence ha when complex sequences such as speech and music are modeled, he performances of RNNs can be dramaically improved when uncerainy is included in heir hidden saes [3, 4, 8, 12, 13, 16]. In his paper we add a new direcion o he explorer s map of reaing he hidden RNN saes as uncerain pahs, by including he world of sae space models (SSMs) as an RNN layer. By cleanly delineaing a SSM layer, cerain independence properies of variables arise, which are beneficial for making efficien poserior inferences. The resul is a generaive model for sequenial daa, wih a maching inference nework ha has is roos in variaional auo-encoders (VAEs). SSMs can be viewed as a probabilisic exension of RNNs, where he hidden saes are assumed o be random variables. Alhough SSMs have an illusrious hisory [27], heir sochasiciy has limied heir widespread use in he deep learning communiy, as inference can only be exac for wo relaively simple classes of SSMs, namely hidden Markov models and linear Gaussian models, neiher of which are well-suied o modeling long-erm dependencies and complex probabiliy disribuions over high-dimensional sequences. On he oher hand modern RNNs rely on gaed nonlineariies such as long shor-erm memory (LSTM) [17] cells or gaed recurren unis (GRUs) [7], ha le he deerminisic hidden sae of he RNN ac as an inernal memory for he model. This inernal memory seems fundamenal o capuring complex relaionships in he daa hrough a saisical model. This paper inroduces he sochasic recurren neural nework (SRNN) in Secion 3. SRNNs combine he gaed acivaion mechanism of RNNs wih he sochasic saes of SSMs, and are formed by sacking a RNN and a nonlinear SSM. The sae ransiions of he SSM are nonlinear and are paramerized by a neural nework ha also depend on he corresponding RNN hidden sae. The SSM can herefore uilize long-erm informaion capured by he RNN. We use recen advances in variaional inference o efficienly approximae he inracable poserior disribuion over he laen saes wih an inference nework [21, 26]. The form of our variaional Now a Google DeepMind.

2 x 1 x x +1 x 1 x x +1 d 1 d d +1 z 1 z z +1 u 1 u u +1 u 1 u u +1 (a) RNN (b) SSM Figure 1: Graphical models o generae x 1:T wih a recurren neural nework (RNN) and a sae space model (SSM). Diamond-shaped unis are used for deerminisic saes, while circles are used for sochasic ones. For sequence generaion, like in a language model, one can use u = x 1. approximaion is inspired by he independence properies of he rue poserior disribuion over he laen saes of he model, and allows us o improve inference by convenienly using he informaion coming from he whole sequence a each ime sep. The poserior disribuion over he laen saes of he SRNN is highly non-saionary while we are learning he parameers of he model. To furher improve he variaional approximaion, we show ha we can consruc he inference nework so ha i only needs o learn how o compue he mean of he variaional approximaion a each ime sep given he mean of he predicive prior disribuion. In Secion 4 we es he performances of SRNN on speech and polyphonic music modelling asks. SRNN improves he sae of he ar resuls on he Blizzard and TIMIT speech daa ses by a large margin, and performs comparably o compeing models on polyphonic music modeling. Finally, oher models ha exend RNNs by adding sochasic unis will be reviewed and compared o SRNN in Secion 5. 2 Recurren Neural Neworks and Sae Space Models Recurren neural neworks and sae space models are widely used o model emporal sequences of vecors x 1:T = (x 1, x 2,..., x T ) ha possibly depend on inpus u 1:T = (u 1, u 2,..., u T ). Boh models res on he assumpion ha he sequence x 1: of observaions up o ime can be summarized by a laen sae d or z, which is deerminisically deermined (d in a RNN) or reaed as a random variable which is averaged away (z in a SSM). The difference in reamen of he laen sae has radiionally led o vasly differen models: RNNs recursively compue d = f(d 1, u ) using a parameerized nonlinear funcion f, like a LSTM cell or a GRU. The RNN observaion probabiliies p(x d ) are equally modeled wih nonlinear funcions. SSMs, like linear Gaussian or hidden Markov models, explicily model uncerainy in he laen process hrough z 1:T. Parameer inference in a SSM require z 1:T o be averaged ou, and hence p(z z 1, u ) and p(x z ) are ofen resriced o he exponenial family of disribuions o make many exising approximae inference algorihms applicable. On he oher hand, averaging a funcion over he deerminisic pah d 1:T in a RNN is a rivial operaion. The sriking similariy in facorizaion beween hese models is illusraed in Figures 1a and 1b. Can we combine he bes of boh worlds, and make he sochasic sae ransiions of SSMs nonlinear whils keeping he gaed acivaion mechanism of RNNs? Below, we show ha a more expressive model can be creaed by sacking a SSM on op of a RNN, and ha by keeping hem layered, he funcional form of he rue poserior disribuion over z 1:T guides he design of a backwards-recursive srucured variaional approximaion. 3 Sochasic Recurren Neural Neworks We define a SRNN as a generaive model p θ by emporally inerlocking a SSM wih a RNN, as illusraed in Figure 2a. The join probabiliy of a single sequence and is laen saes, assuming knowledge of he saring saes z 0 = 0 and d 0 = 0, and inpus u 1:T, facorizes as 2

3 x 1 x x +1 x 1 x x +1 z 1 z z +1 z 1 z z +1 d 1 d d +1 a 1 a a +1 u 1 u u +1 d 1 d d +1 (a) Generaive model p θ (b) Inference nework q φ Figure 2: A SRNN as a generaive model p θ for a sequence x 1:T. Poserior inference of z 1:T and d 1:T is done hrough an inference nework q φ, which uses a backwards-recurren sae a o approximae he nonlinear dependence of z on fuure observaions x :T and saes d :T ; see Equaion (7). p θ (x 1:T, z 1:T, d 1:T u 1:T, z 0, d 0 ) = p θx (x 1:T z 1:T, d 1:T ) p θz (z 1:T d 1:T, z 0 ) p θd (d 1:T u 1:T, d 0 ) T = p θx (x z, d ) p θz (z z 1, d ) p θd (d d 1, u ). (1) =1 The SSM and RNN are furher ied wih skip-connecions from d o x. The join densiy in (1) is parameerized by θ = {θ x, θ z, θ d }, which will be adaped ogeher wih parameers φ of a so-called inference nework q φ o bes model N independenly observed daa sequences {x i 1:T i } N i=1 ha are described by he log marginal likelihood or evidence ( L(θ) = log p θ {x i 1:Ti } {u i 1:T i, z i 0, d i 0} N ) i=1 = log p θ (x i 1:T i u i 1:T i, z i 0, d i 0) = L i (θ). (2) i i Throughou he paper, we omi superscrip i when only one sequence is referred o, or when i is clear from he conex. In each log likelihood erm L i (θ) in (2), he laen saes z 1:T and d 1:T were averaged ou of (1). Inegraing ou d 1:T is done by simply subsiuing is deerminisically obained value, bu z 1:T requires more care, and we reurn o i in Secion 3.2. Following Figure 2a, he saes d 1:T are deermined from d 0 and u 1:T hrough he recursion d = f θd (d 1, u ). In our implemenaion f θd is a GRU nework wih parameers θ d. For laer convenience we denoe he value of d 1:T, as compued by applicaion of f θd, by d 1:T. Therefore p θd (d d 1, u ) = δ(d d ), i.e. d 1:T follows a dela disribuion cenered a d 1:T. Unlike he VRNN [8], z direcly depends on z 1, as i does in a SSM, via p θz (z z 1, d ). This spli makes a clear separaion beween he deerminisic and sochasic pars of p θ ; he RNN remains enirely deerminisic and is recurren unis do no depend on noisy samples of z, while he prior over z follows he Markov srucure of SSMs. The spli allows us o laer mimic he srucure of he poserior disribuion over z 1:T and d 1:T in is approximaion q φ. We le he prior ransiion disribuion p θz (z z 1, d ) = N (z ; µ (p), v (p) ) be a Gaussian wih a diagonal covariance marix, whose mean and log-variance are parameerized by neural neworks ha depend on z 1 and d, µ (p) = NN (p) 1 (z 1, d ), log v (p) = NN (p) 2 (z 1, d ), (3) where NN denoes a neural nework. Parameers θ z denoe all weighs of NN (p) 1 and NN (p) 2, which are wo-layer feed-forward neworks in our implemenaion. Similarly, he parameers of he emission disribuion p θx (x z, d ) depend on z and d hrough a similar neural nework ha is parameerized by θ x. 3.1 Variaional inference for he SRNN The sochasic variables z 1:T of he nonlinear SSM canno be analyically inegraed ou o obain L(θ) in (2). Insead of maximizing L wih respec o θ, we maximize a variaional evidence lower 3

4 bound (ELBO) F(θ, φ) = i F i(θ, φ) L(θ) wih respec o boh θ and he variaional parameers φ [18]. The ELBO is a sum of lower bounds F i (θ, φ) L i (θ), one for each sequence i, F i (θ, φ) = q φ (z 1:T, d 1:T x 1:T, A) log p θ(x 1:T, z 1:T, d 1:T A) q φ (z 1:T, d 1:T x 1:T, A) dz 1:T dd 1:T, (4) where A = {u 1:T, z 0, d 0 } is a noaional shorhand. Each sequence s approximaion q φ shares parameers φ wih all ohers, o form he auo-encoding variaional Bayes inference nework or variaional auo encoder (VAE) [21, 26] shown in Figure 2b. Maximizing F(θ, φ) which we call raining he neural nework archiecure wih parameers θ and φ is done by sochasic gradien ascen, and in doing so, boh he poserior and is approximaion q φ change simulaneously. All he inracable expecaions in (4) would ypically be approximaed by sampling, using he reparameerizaion rick [21, 26] or conrol variaes [24] o obain low-variance esimaors of is gradiens. We use he reparameerizaion rick in our implemenaion. Ieraively maximizing F over θ and φ separaely would yield an expecaion maximizaion-ype algorihm, which has formed a backbone of saisical modeling for many decades [9]. The ighness of he bound depends on how well we can approximae he i = 1,..., N facors p θ (z i 1:T i, d i 1:T i x i 1:T i, A i ) ha consiue he rue poserior over all laen variables wih heir corresponding facors q φ (z i 1:T i, d i 1:T i x i 1:T i, A i ). In wha follows, we show how q φ could be judiciously srucured o mach he poserior facors. We add iniial srucure o q φ by noicing ha he prior p θd (d 1:T u 1:T, d 0 ) in he generaive model is a dela funcion over d 1:T, and so is he poserior p θ (d 1:T x 1:T, u 1:T, d 0 ). Consequenly, we le he inference nework use exacly he same deerminisic sae seing d 1:T as ha of he generaive model, and we decompose i as q φ (z 1:T, d 1:T x 1:T, u 1:T, z 0, d 0 ) = q φ (z 1:T d 1:T, x 1:T, z 0 ) q(d 1:T x 1:T, u 1:T, d 0 ). (5) }{{} = p θd (d 1:T u 1:T,d 0) This choice exacly approximaes one dela-funcion by iself, and simplifies he ELBO by leing hem cancel ou. By furher aking he ouer average in (4), one obains F i (θ, φ) = E qφ [log p θ (x 1:T z 1:T, d ] ( 1:T ) KL q φ (z 1:T d 1:T, x 1:T, z 0 ) ) p θ (z 1:T d 1:T, z 0 ), (6) which sill depends on θ d, u 1:T and d 0 via d 1:T. The firs erm is an expeced log likelihood under q φ (z 1:T d 1:T, x 1:T, z 0 ), while KL denoes he Kullback-Leibler divergence beween wo disribuions. Having saed he second facor in (5), we now urn our aenion o parameerizing he firs facor in (5) o resemble is poserior equivalen, by exploiing he emporal srucure of p θ. 3.2 Exploiing he emporal srucure The rue poserior disribuion of he sochasic saes z 1:T, given boh he daa and he deerminisic saes d 1:T, facorizes as p θ (z 1:T d 1:T, x 1:T, u 1:T, z 0 ) = p θ(z z 1, d :T, x :T ). This can be verified by considering he condiional independence properies of he graphical model in Figure 2a using d-separaion [14]. This shows ha, knowing z 1, he poserior disribuion of z does no depend on he pas oupus and deerminisic saes, bu only on he presen and fuure ones; his was also noed in [22]. Insead of facorizing q φ as a mean-field approximaion across ime seps, we keep he srucured form of he poserior facors, including z s dependence on z 1, in he variaional approximaion q φ (z 1:T d 1:T, x 1:T, z 0 ) = q φ (z z 1, d :T, x :T ) = q φz (z z 1, a = g φa (a +1, [d, x ])), (7) where [d, x ] is he concaenaion of he vecors d and x. The graphical model for he inference nework is shown in Figure 2b. Apar from he direc dependence of he poserior approximaion a ime on boh d :T and x :T, he disribuion also depends on d 1: 1 and x 1: 1 hrough z 1. We mimic each poserior facor s nonlinear long-erm dependence on d :T and x :T hrough a backwardsrecurren funcion g φa, shown in (7), which we will reurn o in greaer deail in Secion 3.3. The inference nework in Figure 2b is herefore parameerized by φ = {φ z, φ a } and θ d. In (7) all ime seps are aken ino accoun when consrucing he variaional approximaion a ime ; his can herefore be seen as a smoohing problem. In our experimens we also consider filering, 4

5 where only he informaion up o ime is used o define q φ (z z 1, d, x ). As he parameers φ are shared across ime seps, we can easily handle sequences wih variable lengh in boh cases. As boh he generaive model and inference nework facorize over ime seps in (1) and (7), he ELBO in (6) separaes as a sum over he ime seps F i (θ, φ) = [ E q φ (z 1) E qφ (z z 1, d :T,x :T )[ log pθ (x z, d ) ] + ( KL q φ (z z 1, d :T, x :T ) p θ (z z 1, d )] ), (8) where q φ (z 1) denoes he marginal disribuion of z 1 in he variaional approximaion o he poserior q φ (z 1: 1 d 1:T, x 1:T, z 0 ), given by [ ] qφ(z 1 ) = q φ (z 1: 1 d 1:T, x 1:T, z 0 ) dz 1: 2 = E q φ (z 2) q φ (z 1 z 2, d 1:T, x 1:T ). (9) We can inerpre (9) as having a VAE a each ime sep, wih he VAE being condiioned on he pas hrough he sochasic variable z 1. To compue (8), he dependence on z 1 needs o be inegraed ou, using our poserior knowledge a ime 1 which is given by qφ (z 1). We approximae he ouer expecaion in (8) using a Mone Carlo esimae, as samples from qφ (z 1) can be efficienly obained by ancesral sampling. The sequenial formulaion of he inference model in (7) allows such samples o be drawn and reused, as given a sample z (s) 2 from q φ (z 2), a sample z (s) 1 from q φ (z 1 z (s) 2, d 1:T, x 1:T ) will be disribued according o qφ (z 1). 3.3 Parameerizaion of he inference nework The variaional disribuion q φ (z z 1, d :T, x :T ) needs o approximae he dependence of he rue poserior p θ (z z 1, d :T, x :T ) on d :T and x :T, and as alluded o in (7), his is done by running a RNN wih inpus d :T and x :T backwards in ime. Specifically, we iniialize he hidden sae of he backwards-recursive RNN in Figure 2b as a T +1 = 0, and recursively compue a = g φa (a +1, [ d, x ]). The funcion g φa represens a recurren neural nework wih, for example, LSTM or GRU unis. Each sequence s variaional approximaion facorizes over ime wih q φ (z 1:T d 1:T, x 1:T, z 0 ) = q φ z (z z 1, a ), as shown in (7). We le q φz (z z 1, a ) be a Gaussian wih diagonal covariance, whose mean and he log-variance are parameerized wih φ z as µ (q) = NN (q) 1 (z 1, a ), log v (q) = NN (q) 2 (z 1, a ). (10) Insead of smoohing, we can also do filering by using a neural nework o approximae he dependence of he rue poserior p θ (z z 1, d, x ) on d and x, hrough for insance a = NN (a) (d, x ). Improving he poserior approximaion. In our experimens we found ha during raining, he parameerizaion inroduced in (10) can lead o small values of he KL erm KL(q φ (z z 1, a ) p θ (z z 1, d )) in he ELBO in (8). This happens when g φ in he inference nework does no rely on he informaion propagaed back from fuure oupus in a, bu i is mosly using he hidden sae d o imiae he behavior of he prior. The inference nework could herefore ge suck by rying o opimize he ELBO hrough sampling from he prior of he model, making he variaional approximaion o he poserior useless. To overcome his issue, we direcly include some knowledge of he predicive prior dynamics in he parameerizaion of he inference nework, using our approximaion of he poserior disribuion qφ (z 1) over he previous laen saes. In he spiri of sequenial Mone Carlo mehods [11], we improve he parameerizaion of q φ (z z 1, a ) by using qφ (z 1) from (9). As we are consrucing he variaional disribuion sequenially, we approximae he predicive prior mean, i.e. our bes guess on he prior dynamics of z, as µ (p) = NN (p) 1 (z 1, d ) p(z 1 x 1:T ) dz 1 NN (p) 1 (z 1, d ) qφ(z 1 ) dz 1, (11) where we used he parameerizaion of he prior disribuion in (3). We esimae he inegral required o compue µ (p) by reusing he samples ha were needed for he Mone Carlo esimae of he ELBO 5

6 in (8). This predicive prior mean can hen be used in he parameerizaion of he mean of he variaional approximaion q φ (z z 1, a ), µ (q) = µ (p) and we refer o his parameerizaion as Res q in he resuls in Secion 4. Raher han direcly learning µ (q), we learn he residual beween µ (p) and µ (q). I is sraighforward o show ha wih his parameerizaion he KL-erm in (8) will no depend on µ (p), bu only NN (q) 1 (z 1, a ). Learning he residual improves inference, making i seemingly easier for he inference nework o rack changes in he generaive model while he model is rained, as i will only have o learn how o correc he predicive prior dynamics by using he informaion coming from d :T and x :T. We did no see any improvemen in resuls by parameerizing log v (q) in a similar way. The inference procedure of + NN (q) 1 (z 1, a ), (12) Algorihm 1 Inference of SRNN wih Res q parameerizaion from (12). 1: inpus: d 1:T and a 1:T 2: Iniialize z 0 3: for = 1 o T do 4: µ (p) = NN (p) 1 (z 1, d ) 5: µ (q) = µ (p) + NN (q) 1 (z 1, a ) 6: log v (q) = NN (q) 2 (z 1, a ) 7: z N (z ; µ (q), v (q) ) 8: end for SRNN wih Res q parameerizaion for one sequence is summarized in Algorihm 1. 4 Resuls In his secion he SRNN is evaluaed on he modeling of speech and polyphonic music daa, as hey have shown o be difficul o model wihou a good represenaion of he uncerainy in he laen saes [3, 8, 12, 13, 16]. We es SRNN on he Blizzard [19] and TIMIT raw audio daa ses (Table 1) used in [8]. The preprocessing of he daa ses and he esing performance measures are idenical o hose repored in [8]. Blizzard is a daase of 300 hours of English, spoken by a single female speaker. TIMIT is a daase of 6300 English senences read by 630 speakers. As done in [8], for Blizzard we repor he average log-likelihood for half-second sequences and for TIMIT we repor he average log likelihood per sequence for he es se sequences. Noe ha he sequences in he TIMIT es se are on average 3.1s long, and herefore 6 imes longer han hose in Blizzard. For he raw audio daases we use a fully facorized Gaussian oupu disribuion. Addiionally, we es SRNN for modeling sequences of polyphonic music (Table 2), using he four daa ses of MIDI songs inroduced in [4]. Each daa se conains more han 7 hours of polyphonic music of varying complexiy: folk unes (Noingham daa se), he four-par chorales by J. S. Bach (JSB chorales), orchesral music (MuseDaa) and classical piano music (Piano-midi.de). For polyphonic music we use a Bernoulli oupu disribuion o model he binary sequences of piano noes. All models where implemened using Theano [2], Lasagne [10] and Parmesan 2. Training using a NVIDIA Tian X GPU ook around 1.5 hours for TIMIT, 18 hours for Blizzard, less han 15 minues for he JSB chorales and Piano-midi.de daa ses, and around 30 minues for he Noingham and MuseDaa daa ses. To reduce he compuaional requiremens we use only 1 sample o approximae all he inracable expecaions in he ELBO (noice ha he KL erm can be compued analyically). Furher implemenaion and experimenal deails can be found in he Supplemenary Maerial. Blizzard and TIMIT. Table 1 compares he average log-likelihood per es sequence of SRNN o he resuls from [8]. For RNNs and VRNNs he auhors of [8] es wo differen oupu disribuions, namely a Gaussian disribuion (Gauss) and a Gaussian Mixure Model (GMM). VRNN-I differs from he VRNN in ha he prior over he laen variables is independen across ime seps, and i is herefore similar o STORN [3]. For SRNN we compare he smoohing and filering performance (denoed as smooh and fil in Table 1), boh wih he residual erm in (12) and wihou i in (10) (denoed as Res q if presen). We prefer o only repor he more conservaive evidence lower bound for SRNN, as he approximaion of he log-likelihood using sandard imporance sampling is known o be difficul o compue accuraely in he sequenial seing [11]. We see from Table 1 ha SRNN ouperforms all he compeing mehods for speech modeling. As he es sequences in TIMIT are on average more han 6 imes longer han he ones for Blizzard, he resuls obained wih SRNN for TIMIT are in line wih hose obained for Blizzard. The VRNN, which performs well when he voice 2 hps://gihub.com/casperkaae/parmesan. The code for SRNN will be made available online. 6

7 Models Blizzard TIMIT SRNN (smooh+res q) SRNN (smooh) SRNN (fil+res q ) SRNN (fil) VRNN-GMM VRNN-Gauss VRNN-I-Gauss RNN-GMM RNN-Gauss Table 1: Average log-likelihood per sequence on he es ses. For TIMIT he average es se lengh is 3.1s, while he Blizzard sequences are all 0.5s long. The non-srnn resuls are repored as in [8]. Smooh: g φa is a GRU running backwards; fil: g φa is a feed-forward nework; Res q : parameerizaion wih residual in (12). Avg. KL(q p) Raw signal Recon. µ Recon. logσ 2 Example 1 Example Seconds Seconds Figure 3: Visualizaion of he average KL erm and reconsrucions of he oupu mean and log-variance for wo examples from he Blizzard es se. Models Noingham JSB chorales MuseDaa Piano-midi.de SRNN (smooh+res q ) TSBN NASMC STORN RNN-NADE RNN Table 2: Average log-likelihood on he es ses. The TSBN resuls are from [13], NASMC from [16], STORN from [3], RNN-NADE and RNN from [4]. of he single speaker from Blizzard is modeled, seems o encouner difficulies when modeling he 630 speakers in he TIMIT daa se. As expeced, for SRNN he variaional approximaion ha is obained when fuure informaion is also used (smoohing) is beer han he one obained by filering. Learning he residual beween he prior mean and he mean of he variaional approximaion, given in (12), furher improves he performance in 3 ou of 4 cases. In he firs wo lines of Figure 3 we plo wo raw signals from he Blizzard es se and he average KL erm beween he variaional approximaion and he prior disribuion. We see ha he KL erm increases whenever here is a ransiion in he raw audio signal, meaning ha he inference nework is using he informaion coming from he oupu symbols o improve inference. Finally, he reconsrucions of he oupu mean and log-variance in he las wo lines of Figure 3 look consisen wih he original signal. Polyphonic music. Table 2 compares he average log-likelihood on he es ses obained wih SRNN and he models inroduced in [3, 4, 13, 16]. As done for he speech daa, we prefer o repor he more conservaive esimae of he ELBO in Table 2, raher han approximaing he log-likelihood wih imporance sampling as some of he oher mehods do. We see ha SRNN performs comparably o oher sae of he ar mehods in all four daa ses. We repor he resuls using smoohing and learning he residual beween he mean of he predicive prior and he one of he variaional approximaion, bu he performances using filering and learning direcly he mean of he variaional approximaion are now similar. We believe ha his is due o he small amoun of daa and he fac ha modeling MIDI music is much simpler han modeling raw speech signals. 7

8 5 Relaed work A number of works have exended RNNs wih sochasic unis o model moion capure, speech and music daa [3, 8, 12, 13, 16]. The performances of hese models are highly dependen on how he dependence among sochasic unis is modeled over ime, on he ype of ineracion beween sochasic unis and deerminisic ones, and on he procedure ha is used o evaluae he ypically inracable log likelihood. Figure 4 highlighs how SRNN differs from some of hese works. In STORN [3] (Figure 4a) and DRAW [15] he sochasic unis a each ime sep have an isoropic Gaussian prior and are independen beween ime seps. The sochasic unis are used as an inpu o he deerminisic unis in a RNN. As in our work, he reparamerizaion rick [21, 26] is used o opimize an ELBO. The auhors of he VRNN [8] (Figure x x x 4b) noe ha i is beneficial o add informaion coming from he pas z z 1 z z 1 z saes o he prior over laen variables z.the VRNN les he prior p θz (z d ) over he sochasic unis d 1 d d 1 d depend on he deerminisic unis d, which in urn depend on boh he deerminisic and he sochasic unis a u u u he previous ime sep hrough he recursion d = f(d 1, z 1, u ). (a) STORN (b) VRNN (c) Deep Kalman Filer The SRNN differs by clearly separaing he deerminisic and sochasic Figure 4: Generaive models of x 1:T ha are relaed o SRNN. par, as shown in Figure 2a. The separaion of deerminisic and sochasic unis allows us o improve he poserior approximaion by doing smoohing, as he sochasic unis sill depend on each oher when we condiion on d 1:T. In he VRNN, on he oher hand, he sochasic unis are condiionally independen given he saes d 1:T. Because he inference and generaive neworks in he VRNN share he deerminisic unis, he variaional approximaion would no improve by making i dependen on he fuure hrough a, when calculaed wih a backward GRU, as we do in our model. Unlike STORN, DRAW and VRNN, he SRNN separaes he noisy sochasic unis from he deerminisic ones, forming an enire layer of inerconneced sochasic unis. We found in pracice ha his gave beer performance and was easier o rain. The works by [1, 22] (Figure 4c) show ha i is possible o improve inference in SSMs by using ideas from VAEs, similar o wha is done in he sochasic par (he op layer) of SRNN. Towards he periphery of relaed works, [16] approximaes he log likelihood of a SSM wih sequenial Mone Carlo, by learning flexible proposal disribuions parameerized by deep neworks, while [13] uses a recurren model wih discree sochasic unis ha is opimized using he NVIL algorihm [23]. 6 Conclusion This work has shown how o exend he modeling capabiliies of recurren neural neworks by combining hem wih nonlinear sae space models. Inspired by he independence properies of he inracable rue poserior disribuion over he laen saes, we designed an inference nework in a principled way. The variaional approximaion for he sochasic layer was improved by using he informaion coming from he whole sequence and by using he Res q parameerizaion o help he inference nework o rack he non-saionary poserior. SRNN achieves sae of he ar performances in he Blizzard and TIMIT speech daa se, and performs comparably o compeing mehods for polyphonic music modeling. Acknowledgemens We hank Casper Kaae Sønderby and Lars Maaløe for many fruiful discussions, and NVIDIA Corporaion for he donaion of TITAN X and Tesla K40 GPUs. Marco Fraccaro is suppored by Microsof Research hrough is PhD Scholarship Programme. 8

9 References [1] E. Archer, I. M. Park, L. Buesing, J. Cunningham, and L. Paninski. Black box variaional inference for sae space models. arxiv: , [2] F. Basien, P. Lamblin, R. Pascanu, J. Bergsra, I. Goodfellow, A. Bergeron, N. Bouchard, D. Warde-Farley, and Y. Bengio. Theano: new feaures and speed improvemens. arxiv: , [3] J. Bayer and C. Osendorfer. Learning sochasic recurren neworks. arxiv: , [4] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincen. Modeling emporal dependencies in highdimensional sequences: Applicaion o polyphonic music generaion and ranscripion. arxiv: , [5] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio. Generaing senences from a coninuous space. arxiv: , [6] K. Cho, B. Van Merriënboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase represenaions using RNN encoder decoder for saisical machine ranslaion. In EMNLP, pages , [7] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluaion of gaed recurren neural neworks on sequence modeling. arxiv: , [8] J. Chung, K. Kasner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio. A recurren laen variable model for sequenial daa. In NIPS, pages , [9] A. P. Dempser, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplee daa via he EM algorihm. Journal of he Royal Saisical Sociey, Series B, 39(1), [10] S. Dieleman, J. Schlüer, C. Raffel, E. Olson, S. K. Sønderby, D. Nouri, E. Baenberg, and A. van den Oord. Lasagne: Firs release, [11] A. Douce, N. de Freias, and N. Gordon. An inroducion o sequenial Mone Carlo mehods. In Sequenial Mone Carlo Mehods in Pracice, Saisics for Engineering and Informaion Science [12] O. Fabius and J. R. van Amersfoor. Variaional recurren auo-encoders. arxiv: , [13] Z. Gan, C. Li, R. Henao, D. E. Carlson, and L. Carin. Deep emporal sigmoid belief neworks for sequence modeling. In NIPS, pages , [14] D. Geiger, T. Verma, and J. Pearl. Idenifying independence in Bayesian neworks. Neworks, 20: , [15] K. Gregor, I. Danihelka, A. Graves, and D. Wiersra. DRAW: A recurren neural nework for image generaion. In ICML, [16] S. Gu, Z. Ghahramani, and R. E. Turner. Neural adapive sequenial Mone Carlo. In NIPS, pages , [17] S. Hochreier and J. Schmidhuber. Long shor-erm memory. Neural Compuaion, 9(8): , Nov [18] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An inroducion o variaional mehods for graphical models. Machine Learning, 37(2): , [19] S. King and V. Karaiskos. The Blizzard challenge In The Ninh Annual Blizzard Challenge, [20] D. Kingma and J. Ba. Adam: A mehod for sochasic opimizaion. arxiv: , [21] D. Kingma and M. Welling. Auo-encoding variaional Bayes. In ICLR, [22] R. G. Krishnan, U. Shali, and D. Sonag. Deep Kalman filers. arxiv: , [23] A. Mnih and K. Gregor. Neural variaional inference and learning in belief neworks. arxiv: , [24] J. W. Paisley, D. M. Blei, and M. I. Jordan. Variaional Bayesian inference wih sochasic search. In ICML,

10 [25] T. Raiko, H. Valpola, M. Harva, and J. Karhunen. Building blocks for variaional Bayesian learning of laen variable models. Journal of Machine Learning Research, 8: , [26] D. J. Rezende, S. Mohamed, and D. Wiersra. Sochasic backpropagaion and approximae inference in deep generaive models. In ICML, pages , [27] S. Roweis and Z. Ghahramani. A unifying review of linear Gaussian models. Neural Compuaion, 11(2):305 45, [28] C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winher. How o rain deep variaional auoencoders and probabilisic ladder neworks. arxiv: , A Experimenal seup A.1 Blizzard and TIMIT The sampling rae is 16KHz and he raw audio signal is normalized using he global mean and sandard deviaion of he raning se. We spli he raw audio signals in chunks of 2 seconds. The waveforms are hen divided ino non-overlapping vecors of size 200. The RNN hus runs for 160 seps 3. The model is rained o predic he nex vecor (x ) given he curren one (u ). During raining we use backpropagaion hrough ime (BPTT) for 0.5 seconds, i.e we have 4 updaes for each 2 seconds of audio. For he firs 0.5 second we iniialize hidden unis wih zeros and for he subsequen 3 chunks we use he previous hidden saes as iniializaion. For Blizzard we spli he daa using 90% for raining, 5% for validaion and 5% for esing. For esing we repor he average log-likelihood per 0.5s sequences. For TIMIT we use he predefined es se for esing and spli he res of he daa ino 95% for raining and 5% for validaion. The raining and esing seup are idenical o he ones for Blizzard. For TIMIT he es sequences have variable lengh and are on average 3.1s, i.e. more han 6 imes longer han Blizzard. We model he oupu using a fully facorized Gaussian disribuion for p θx (x z, d ). The deerminisic RNNs use GRUs [7], wih 2048 unis for Blizzard and 1024 unis for TIMIT. In boh cases, z is a 256-dimensional vecor. All he neural neworks have 2 layers, wih 1024 unis for Blizzard and 512 for TIMIT, and use leaky recified nonlineariies wih leakiness 1 3 and clipped a ±3. In boh generaive and inference models we share a neural nework o exrac feaures from he raw audio signal. The sizes of he models were chosen o roughly mach he number of parameers used in [8]. In all experimens i was fundamenal o gradually inroduce he KL erm in he ELBO, as shown in [5, 28, 25]. We herefore muliply a emperaure β o he KL erm, i.e. βkl, and linearly increase β from 0.2 o 1 in he beginning of raining (for Blizzard we increase i by afer each updae, while for TIMIT by ). In boh daa ses we used he ADAM opimizer [20]. For Blizzard we use a learning rae of and bach size of 128, for TIMIT hey are and 64 respecively. A.2 Polyphonic music We use he same model archiecure as in Secion 4, excep for he oupu Bernoulli variables used o model he acive noes. We reduced he number of parameers in he model o 300 deerminisic hidden unis for he GRU neworks, and 100 sochasic unis whose disribuions are parameerized wih neural neworks wih 1 layer of 500 unis. 3 2s 16Khz / 200 =

Sequential Neural Models with Stochastic Layers

Sequential Neural Models with Stochastic Layers Sequenial Neural Models wih Sochasic Layers Marco Fraccaro Søren Kaae Sønderby Ulrich Paque * Ole Winher Technical Universiy of Denmark Universiy of Copenhagen * Google DeepMind Absrac How can we efficienly