Appendix for Stochastic Gradient Monomial Gamma Sampler

Size: px

Start display at page:

Download "Appendix for Stochastic Gradient Monomial Gamma Sampler"

Emil Barker
5 years ago
Views:

1 Appendix for Stochastic Gradient Monomia Gamma Samper A The Main Theorem We provide the foowing theorem to characterize the stationary distribution of the stochastic process with SDEs in (3) Theorem 3 The stochastic process generated from SDEs (3) converges to a stationary distribution p( ) / exp( H( )), where H( ) is defined as in () Proof We first show that the Fokker-Panck equation hods for the proposed SDE and probabiity density p( ), r p( )V ( ) = r r T :[p( )D( )] Here the r, (/, /p, / ) represents a vector inner product and : denotes the matrix doube dot product, ie, X : Y = Tr(X T Y ) In order to show FP equation hods, we ook at both side of the equation The eft hand side can be written as r p( )V ( ) = [ V ( ) H( ) V ( )]p( ) = { [(ru( )) r U( )] For the right hand side, p [(rk(p)) r K(p)] [(rf ( )) r F ( )]}p( ) r r T :[p( )D( )] = rr T : p( ) p rr T : p( ) rr T : p( ) = { [(ru( )) r U( )] p [(rk(p)) r K(p)] [(rf ( )) r F ( )]}p( ) For stationary distribution, p(,t) t = As a resut, the equaity in (4) hods The stochastic process defined by (3) is preserved by the dynamic Aternativey, one can everage the recipe from (Ma et a, 5) to recover the same concusion, by setting semi-definite matrix D = Diag([, p, ]) and skew-symmetric Q to be I I rk(p) rk(p) A Agorithm SGMGT/SGMGT-D with Euer integrator Input: Monomia parameter a, variances {, p, }, thermostat variance, resamping frequency {T p,t }, and earning rate h Initiaized, p and (a to zero) for t =,, do Evauate stochastic gradient rũ( ) on mini-batch After each T p iterations, resampe momentum, p t, from / exp( K c (p)) using rejection samping For each T iterations, resampe thermostats, t, from N (, p ) Generate random variabes, p, N D (, ) Compute first and second order derivatives for kinetics K c (p) as in () t = t rũ( )h rk c(p)h p h p t = p t rũ( )h (rk c(p))h p p h p t = t [rk c (p) rk c (p) (r K c (p))]h ( p)h p h Note that under the softened kinetics, the K c (p) is twice differentiabe, and rk c (p) is Lipschitz continuous Thus the Fokker-Panck equation hods, eading to a stationary distribution invariant to target distribution Another remark is that the resamping process for p and wi sti ead to the same invariante distribution p( ), since the resamping process is directy drawing sampe from the margina distribution Finay, it can be proved that the corresponding Itô diffusion of our agorithm in (3) is non-reversibe This speed up the convergence speed to equiibrium, because it is known that a reversibe process convergences sower than its non-reversibe counter part (Hwang et a, 5) B SGMGT-D/SGMGT-D with Euer integrator C Detais for softened kinetics We provide the detais for the derivation of softened kinetics Note that in the SDE (3), ony rk c (p) and r K c (p) is invoved For a =, we consider K c (p) = g(p)/c og( e cg(p) ), (4) g(p) =p/m

2 Stochastic Gradient Monomia Gamma Samper which gives rk c (p) = m (g(p)), r K c (p) = m (g(p)) Where, (x) = ecx e cx (tanh) with the softening parameter c, (x) = For a =, we consider which gives K c (p) =g(p) is the hyperboic tangent function g(p) = p / /m cecx (e cx ) 4 c( e cg(p) ), (5) rk c (p) = m sign(p) (g(p)) p /, r K c (p) = m (g(p)) (g(p)) p 4m (g(p)) p 3/ In genera, for arbitrary a, we consider setting the rk c (p) = a m (g(p))a p /a Such specification wi yied a differentiabe softened kinetics function by computing the integra, which is tractabe for positive vaue of a However, in practice, as suggested by Zhang et a (6) the optima a woud usuay between [5, ] We woud suggest consider using a =or a = for genera inference tasks D Synthetic muti-we potentia probem The five-we potentia is defined as: U( ), e P i= ci sin( 4 i( 4)), where c =( 47, 83, 7,, 4,, 7, 37, 87, 37) is a vector, c i is the i-th eement of c E Symmetric Spitting Integrators for SGMGT The first-ordered Euer integration resuts in high discretization error in Hamitonian dynamic updating of HMC In Chen et a (6), a symmetric spitting scheme is everaged to reduce the numerica error We appied the ( p) softened kinetics K c (p), and set F ( ) as In this symmetric spitting scheme, the Hamitonian is spit into sub-componenents, and for each sub-componenents an individua SDE is appied on The resuting discretization is sympectic and second-ordered: A : d = B : d = O : d = rũ( )rk c(p) f( ) rk c (p) rũ( ) A dt/ A dt D( )dw A dt/ Here we denote f( ), [(rk c (p)) (r K c (p))] ( p) for carity The sub-sde under sub-sde B is anayticay sovabe Foowing (Chen et a, 5), for a 6= /, the updating procedure foows an ABOBA scheme, given by A : t/3 = t rk c (p)h/, t/3 = t f( )h/ B : p t/3 =[p (a )/a a a/(a ) t a t/ h/] O : t/3 = t/3 p p t/3 = p t/3 rũ( ))h/p p p, t/3 = t/3 p B : p t =[p (a )/a a a/(a ) t/3 a t/3 h/] A : t = t/3 rk c (p)h/, t = t/3 f( )h/ When a =/, it foows the spitting scheme with standard SGNHT (Chen et a, 5) F Experimenta setups for DRBM The hyper-parameter setups for the DRBM experiments are provided as beow We seect the hyperparameters based on the performance on vaidation dataset The agorithm wi be eary stopped if the vaidation error starts to increase The seection is based on a grid search For p, and we seect from {,,,, } For the softening parameter c we seect from {3, 5, 8} We fixed m =and = The stepsize is chosen from {e 5, e 5, 5e 5, e 4, e 4, 5e 4} The T p and T are set as and, respectivey For SGLD, we use a stepsize of e 5 Tabe 4 Experimenta setup for discriminative RBM Agorithms p c h SGNHT e-4 SGNHT-D - e-4 SGMGT-D (a=) 3 e-5 SGMGT-D (a=) 5 5e-5

3 Stochastic Gradient Monomia Gamma Samper Tabe 5 Experimenta setup for RNNs Agorithms m p c SGNHT SGNHT-D - SGMGT/SGMGT-D (a=) 5 SGMGT/SGMGT-D (a=) 3 G Experimenta setups for RNN The hyper-parameter setups for the RNN experiments are simiar to the DRBM experiments For p, and we seect from {,,, } For the softening parameter c we seect from {3, 5, 8} We fixed the m =and = The stepsize of SGMGT-D/SGMGT is chosen from {e 3, 5e 3, e 3, 5e 3, 3e 3} The T p and T are set as and, respectivey We aso incorporate a decay scheme for stepsize, ie the stepsize is divided by a decaying factor = for each scan of dataset (ie each epoch) The gradient estimated on a subset of data is cipped to have a maximum vaue of 5 as in Chen et a (6) for each dimension to prevent updates from a arge gradient vaue to bow up the objective oss For JSB we use a stepsize of e 3 for SGMGT, for other three datasets (Piano, Muse, Nott) we use a stepsize of 3e 3 For SGLD, we use a stepsize of e 3, for SGNHT the stepsize is set as 5e 5 The other hyperparameters are provided in Tabe 5 H Additiona figure for RNN experiments We provide the tracepot of one parameter in RNN experiment of JSB dataset We choose this parameter at random Generay, the SGMGT with a =seems to demonstrate more random wak behavior than SGMGT with a = Parameter vaue SGMGT (a=) SGMGT (a=) SGMGT-D (a=) SGMGT-D (a=) 4 # of iterations Figure 6 Tracepot for RNN experiments I Additiona resuts for RNN experiments Here we provide the resuts of severa optimization methods, the resuts are taken from Chen et a (6) Tabe 6 Test negative og-ikeihood resuts on poyphonic music datasets using RNN Agorithms Piano Nott Muse JSB Adam RMSprop SGD-M SGD HF SGD-M J Convergence property Proof This foows the proof for genera SG-MCMC agorithms Specificay, in SGMGT, the generator of the corresponding SDE is defined as: where Lf(x), F (x) r T : rr T f(x), x =(, p, ), ru( )rk c (p) F (x) = rũ( ) ( p rf ( ))rk c (p) (rkc (p)) r K c (p) A, rf ( ) p p = p A p After introducing stochastic gradients, in each iteration t, the generator is perturbed by: V t = rũ( ) ru( ) (r r), such that L t = L V t, where L t is the oca generator for the SDE in iterator t After defining these notation, we foows the proofs of Theorem and Theorem 3 in Chen et a (5) The proof for the bias: Foowing Theorem in Chen et a (5), in the decreasing step size setting, the spit fow can be written as: E ( (X h )) = I h L (X ( )h ) KX k= h k k! L (X ( )h )O(h K )

4 Simiary, the expected difference between and can be simpified using the step size sequence (h ) as: E (6) = (E ( (X Lh )) (X )) (7) KX P h k L Lk k!s (X ( )h )O( hk ) (8) L k= Simiar to the derivation in Chen et a (5), we can derive the foowing bounds k =(,,K): h k E L k (X ( )h ) (9) = O = O (h k h k ) L! k (X ( )h )h K h K! Stochastic Gradient Monomia Gamma Samper () Substitute (9) into (6) and coect ow order terms, we have: E () = (E ( (X Lh )) (X )) O( hk ) As a resut, the bias can be expressed as: () E appe (E [ (X Lh )] (X )) O( hk ) =O hk ) hk Taking L!, both terms go to zero by assumption The proof for the MSE: Foowing simiar derivations as in Theorem in Chen et a (5), we have that E ( (X h )) =! (X ( )h ) h L (X ( )h ) h V (X ( )h ) KX k= h k k! L k (X ( )h )C h K Substitute the Poisson equation into the above equation and divided both sides by, we have ˆ E (X Lh ) (x ) = E (X ( )h ) (X ( )h ) KX k= h V (X ( )h ) h k k! Lk (X ( )h )C h3 As a resut, there exists some positive constant C, such that: E ˆ B appe CE ( (X ) E (X Lh )) (3) {z } A E (X ( )h ) (X ( )h ) {z } A KX h S L h3 k V k! h k! Lk k!s (X ( )h ) L k= {z } A 3 A (4) A can be bounded by assumptions, and A is shown to be bounded by using the fact that E (X ( )h ) (X ( )h ) = O( p h ) from Theorem in Chen et a (5) Furthermore, simiar to the proof of Theorem in Chen et a (5), the expectation of A 3 can aso be bounded by using the formua E[X ] = (E X) E[(X E X) ] and (9) It turns out that the resuting terms have order higher than those from the other terms, thus can be ignored in the expression beow After some simpifications, (3) is bounded by: E ˆ (5) X h = C X E k V k h P! L hk E k V k ( hk )! (6) for some C>, this competes the first part of the theorem We can see that according to the assumption, the ast two

5 Stochastic Gradient Monomia Gamma Samper terms in (5) approach to when L! If we further P assume h to because: X h E k =, then the first term in (5) approaches V k appe sup E k V k P h S L As a resut, we have im L! E ˆ =! Let the number of sampes in each resamping period to be (T ) L, and denote T, T Further denote the sampe average in the -th resamping period to be: ˆT, XT T (x (T ) ), where {x (T ) } denotes sampes in the -th resamping period The fina sampe average is defined as: K Proof for Lemma To prove Lemma, we first introduce the foowing emma from Geyer (5) Lemma 3 (Geyer (5)) Suppose µ is a probabiity distribution and for each z in the domain of domain of µ there is a Markov kerne P z satisfying = P z, and suppose that the map (z,x) P z (x, A) is jointy measurabe for each A Then Z Q(x, A) = µ(dz)p z (x, A) defines a kerne Q that is Markov and satisfies = Q Proof Detaied proof can be found in Chapter 3 of Geyer (5) ˆT, T P T ˆT = As a resut, the bias can be bounded as: E ˆT = P T appe X = X = X X L T = E P T ˆT = X T E ˆT T P T E ˆT T T O P T h h T O P h T h Now it is ready to prove Lemma Proof of Lemma First, we note that the momentum (or other auxiiary variabes) is resamped from the stationary distribution of the Itô diffusion As a resut, for each mode parameter, it corresponds to a Markov kerne P with the stationary Gaussian density According to Lemma 3, the composition of the numerica integrator in SGMGT and the resamping forms a Markov kerne Q(, A), such that h = h Q The above equation means that h is aso the stationary distribution of the Markov kerne Q, which competes the proof L Proof for Lemma Proof First, the optima bias and MSE bounds in Proposition are given by: Bias: E ˆT MSE: E ˆ = O T /, = O T /3 Optimizing over h, we have appeo E ˆT = X ( P T ) / P T T O P T / = O T /, which is the same as the optima bias bound for SGMGT The proof for the optima MSE bound foows simiary M Stochastic sice samping In this section, we everage the connection between sice samping and HMC (Zhang et a, 6), to investigate the approach to perform sice samping with subset of data Sice samping (Nea, 3) augments the density p( )/C (where C> is a normaization constant) with sice variabes u, such that the joint distribution p(, u) =/C, st < u < p( ) To sampe from the target distribution, sice samping is performed in a Gibbs samping manner, ie, aternating between uniformy samping the sice variabe (sice samping step) u, and uniformy generating new

6 Stochastic Gradient Monomia Gamma Samper sampes (conditiona samping step), from a restricted domain such that the (unnormaized) density function vaues for are ess than the samped sice variabe u Sice samping aows moves that can adaptivey fit the scae of the oca density structure, thus yieding rapid mixing When the dataset is arge, however, fu-data density evauations can be very expensive One recent attempt to use subset data for sice samping incorporates a hypothesis test sub-procedure when performing the conditiona samping step (DuBois et a, 4) However, the rejection rate coud be arge if the mini-batch size is sma Furthermore, sampes from the agorithm are biased due to the hypothesis test step One straightforward approach to perform stochastic sice samping is by evauating the ikeihood on a subset of data during the conditiona samping step when performing standard sice samping This approach, detaied in the SM is referred as naïve stochastic sice samping (Naïve stochastic SS) As shown in Figure 7 in the SM, appying this naïve impementation to a Bayesian inear regression probem woud yied over-dispersed sampes The reason why naïve stochastic sice samping fais can be expained by foowing the ogic of Zhang et a (6) and Chen et a (4); Betancourt (5) In Zhang et a (6), the authors demonstrate the connection between sice samping and Hamitonian Monte Caro, reveaed by Hamitonian-Jacobi Equation As a resut, performing sice samping can be equivaenty reaized in an HMC formuation We consider mapping naïve stochastic sice samping to its equivaent HMC space parameterized by mode parameter and momentum p as in (7) (where the monomia parameter a =, with notation from Zhang et a (6)) This resuts in an HMC formuation that is equivaent to the naïve stochastic gradient HMC in Chen et a (4), but with different kinetic function, as in (3) when a = Simiar to (Chen et a, 4), the entropy of the joint distribution of (, p) woud aways increase due to the stochastic noise, expaining the over-dispersion distribution that we observe in Figure 7 in the SM Fortunatey, one can everage the connection between sice samping and HMC from Zhang et a (6) to perform an improved stochastic sice samping This is done by adopting the SDE of SGHMC in (7) and substituting the Gaussian kinetic with a softened Lapace kinetic (ie K c (p) when a =) as in () A friction term ArK c (p) is incorporate to offset the stochastic noise, resuting in d = rk c (p)dt, (7) q dp = [rũ( )ArK c(p)]dt (AI ˆB( ))dw The resuting stochastic Lapace HMC agorithm (detaied in the SM) from (7) is (asymptoticay) invariant to the target distribution, and performs equivaenty to a correct stochastic sice samping in one-dimensiona cases, as c! In Figure 7, the stochastic Lapace HMC samper can ameiorate the over-dispersion of samped posterior distribution than naïve stochastic sice samping N Naive stochastic sice samping and Stochastic Lapacian HMC The naïve stochastic sice samping can be described in Agorithm Agorithm Naïve stochastic SS Input: Initia parameter for t =,, do Samping a mini-batch x t Evauate stochastic negative og-density Ũ x t ( t ), N exp[ og p( t ) N Px x t og p(x t )] Uniformy sampe u t from (, exp[ Ũ xt ( t )]) Sampe t from { : Ũ x t ( ) < og( u t )} using doubing and shrinking (Nea, 3) According to Zhang et a (6), Agorithm has deep connection to Agorithm 3 in HMC formuation, in univariate scenarios Agorithm 3 Naïve stochastic SS in HMC space Input: Initia parameter for t =,, do Samping a mini-batch x t Samping each momentum p independenty (for each dimension) from a Lapacian distribution L(m), where m>is the mass parameter for s =,, do Evauate stochastic gradient, rũ( ), from (5) on mini-batch x t Perform eap-frog updates using (4) by substituting the ru( ) with rũ( ) and substituting rk(p) with sign(p)/m By adding a friction term as in Chen et a (4) we provide the corrected SG-MCMC Agorithm 3 that corresponds to stochastic sice samping

7 Stochastic Gradient Monomia Gamma Samper Stochastic Lapace HMC Naive stochastic SS Orace density Probabity density Figure 7 Naïve stochastic sice samping vs Stochastic Lapacian HMC Agorithm 4 Stochastic Lapacian HMC Input: Initia parameter for t =,, do Samping a momentum p from a distribution / exp( K c (p)) with K c (p) defined in (), where c is the softened parameter for s =,, do Samping a mini-batch x t Evauating stochastic gradient rũ( ) from (5) on mini-batch x t Performing eap-frog updating using SDE in (7) We provide the empirica density drawn by naïve stochastic sice samping and stochastic Lapacian HMC on a synthetic Bayesian inear regression probem with one feature dimension For each instance i, y i N( x i, ) We estimate the posterior of the singe parameter The synthetic dataset has training sampes We use a minibatch size of 3 for each method, and coect, Monte Caro iterations For stochastic Lapacian HMC we use a stepsize of the diffusion parameter A is set to be 7 and the soften parameter is set to be From Figure 7, the stochastic Lapacian HMC can better recover the target distribution

Appendix for Stochastic Gradient Monomial Gamma Sampler

3 4 5 6 7 8 9 3 4 5 6 7 8 9 3 4 5 6 7 8 9 3 3 3 33 34 35 36 37 38 39 4 4 4 43 44 45 46 47 48 49 5 5 5 53 54 Appendix for Stochastic Gradient Monomia Gamma Samper A The Main Theorem We provide the foowing