Supplementary Material of High-Order Stochastic Gradient Thermostats for Bayesian Learning of Deep Models

Size: px

Start display at page:

Download "Supplementary Material of High-Order Stochastic Gradient Thermostats for Bayesian Learning of Deep Models"

Christian Booker
6 years ago
Views:

1 Supplementary Material of High-Order Stochastic Gradient hermostats for Bayesian Learning of Deep Models Chunyuan Li, Changyou Chen, Kai Fan 2 and Lawrence Carin Department of Electrical and Computer Engineering, Duke University 2 Computational Biology and Bioinformatics, Duke University chunyuan.li@duke.edu, cchangyou@gmail.com, kai.fan@duke.edu, lcarin@duke.edu A he proof of Lemma Proof In msgnh, X t = (θ t, p t, ξ t. he update equations with a Euler integrator are: θ t+ = θ t + p t h p t+ = p t θ Ũ t (θ t+ h diag(ξ t p t h + 2Dζ t+ ξ t+ = ξ t + (p t+ p t+ h Based on the update equations, it is easily seen that the corresponding Kolmogorov operator P h l for msgnh is P h l = e hl e hl2 e hl3, ( where L (p p ξ, L 2 diag(ξp t p θ Ũ l (θ p + 2DI : p p, and L 3 p θ. Using the BCH formula, we have P l h = e h(l+l2+l3 + O(h 2. (2 On the other hand, the local generator of msgnh at the t-th iteration can be seen to be: L t = L + L 2 + L 3, (3 where L and L 3 are defined previously, and L 2 = diag(ξp p θ Ũ l (θ p +2DI : p p. According to the Kolmogorov s backward equation, we have E[f(X t+ ] = e h L t f(x t. (4 Substitute (4 into (2 and use the fact that p = p t + O(h by aylor expansion, it is easily seen P l h = e h L t+o(h 2 + O(h 2 = e h L t + O(h 2. his completes the proof. B he proof of Lemma 2 Proof In symmetric splitting scheme for msgnh, according to the splitting in (4 in the main text, the generator L t is split into the following sub-generators which can be solved analytically: L l = L A + L B + L Ol, where A L A = p θ + (p p ξ, B L B = diag(ξp t p, O l L Ol = θ Ũ l (θ p + 2D : p p. Copyright c 26, Association for the Advancement of Artificial Intelligence ( All rights reserved. he corresponding Kolmogorov operator P h l for the splitting integrator can be seen to be: P l h e h 2 L A e h 2 L B e hl O l e h 2 L B e h 2 L A, In the following we use the Baker Campbell Hausdorff (BCH formula (Rossmann to show that P h l is a 2ndorder integrator. Specifically, e h 2 A e h 2 B = e h 2 A+ h 2 B+ 8 [A,B]+ ([A,[A,B]]+[B,[B,A]]+ (5 96 = e h 2 A+ h 2 B+ 8 [A,B] + O(h 3, (6 where [X, Y ] XY Y X is the commutator of X and Y, (5 follows from the BCH formula, and (6 follows by moving high order terms O(h 3 out of the exponential map using aylor expansion. Similarly, for the other composition, we have e ho l e h 2 A e h 2 B = e ho l (e h 2 A+ h 2 B+ 8 [A,B] + O(h 3 =e ho l+ h 2 A+ h 2 B+ 8 [A,B]+ 2 [ho l, h 2 A+ h 2 B+ 8 [A,B]] + O(h 3 =e ho l+ h 2 A+ h 2 e h 2 A e ho l e h 2 A e h 2 B =e h 2 A ( e ho l+ h 2 A+ h 2 =e ho l+ha+ h 2 B+ 8 [A,B]+ 4 [O l,a]+ 4 [Ol,B] + O(h 3 B+ 8 [A,B]+ 4 [O l,a]+ 4 [Ol,B] + O(h 3 B+ 4 [A,B]+ 2 [Ol,B] + O(h 3 As a result P l h e h 2 B e h 2 A e hz e h 2 A e h 2 B =e h 2 B ( e ho l+ha+ h 2 =e ho l+ha+hb+ 4 [A,B]+ =e h(b+a+o l + O(h 3 B+ 4 [A,B]+ 2 [Ol,B] + O(h 3 2 [O l,b]+ 4 =e h(l+ V l + O(h 3 = e h L l + O(h 3. his completes the proof. [B,A]+ 4 [B,O l]+ 8 [B,B] + O(h 3 C More details on Lemma 3 Our justification of the symmetric splitting integrator is based on Lemma 3, which is a simplification of the main

2 theorems in (Chen, Ding, and Carin 25. For completeness, we give details of their main theorems in this section. o recap notation, for an Itó diffusion with an invariant measure ρ(x, the posterior average is defined as: φ φ(xρ(xdx for some test function φ(x of interest. X Given samples (x t t= from a SG-MCMC, we use the sample average ˆφ t= φ(x t to approximate φ. In addition, we define an operator for the t-th iteration as: V t ( θ Ũ t θ U p. (7 heorem and heorem 2 summarize the convergence of a SG-MCMC algorithm with a K-th order integrator with respect to the Bias and MSE, under certain assumptions. Please refer to (Chen, Ding, and Carin 25 for detailed proofs. heorem Let be the operator norm. he bias of an SG-MCMC with a Kth-order integrator at time = h can be bounded as: E ˆφ φ = O ( h + t E V t + h K. heorem 2 For a smooth test function φ, the MSE of an SG-MCMC with a Kth-order integrator at time = h is bounded, for some C > independent of (, h, as E ( ˆφ φ 2 C ( B mse C ( t E V t 2 t E Vt 2 + h + h + K We simplify heorem and heorem ( 2 to Lemma 3, where the functions B bias O h + t E Vt and, independent of the order of an integrator. In addition, from heorem and heorem 2, we can get the optimal convergence rate of a SG-MCMC algorithm with respect to the Bias and MSE by optimizing the bounds. Specifically, the optimal convergence rates of the Bias for the Euler integrator is /2 with optimal stepsize /2, while this is 2/3 for the symmetric splitting operator with optimal stepsize /3. For the MSE, the rate for the Euler integrator is 2/3 with optimal stepsize /3, compared to 4/5 with optimal stepsize /5 for the symmetric splitting integrator. D Latent Dirichlet Allocation Following (Ding et al., this section describes the semicollapsed posterior of the LDA model, and the Expanded- Natural representation of the prabability simplexes used in (Patterson and eh 23. Let W = {w jv } be the observed words, Z = {z jv } be the topic indicator variables, where j indexes the documents and v indexes the words. Let (π kw be the topic-word distribution, n jkw be the number of word w in document j allocated to topic k, means marginal sum, i.e., n jk = w n dkw. he semi-collapsed posterior of the LDA model is J p(w, Z, π α, τ = p(π τ p(w j, z j α, π, (8 j=. where J is the number of documents, α is the parameter in the Dirichlet prior of the topic distribution for each document, τ is the parameter in the prior of π, and p(w j, z j α, π = K k= Γ (α + n jk Γ (α W w= π n jkw kw. (9 he Expanded-Natural representation of the simplexes π k s in (Patterson and eh 23 is used, where π kw = e θ kw w eθ kw. ( Following (Ding et al., a Gaussian prior on θ kw is adopted, p(θ kw τ = {β, σ} = N (θ kw, σ 2. he stochastic gradient of the log-posterior of parameter θ kw with a mini-batch S t becomes, Ũ(θ log p(θ W, τ, α = ( θ kw θ kw = β θ kw σ 2 + J E zj w S t j,θ,α (n jkw π kw n jk. i S t for the t-th iteration, where S t {, 2,, J}, and is the cardinality of a set. When σ =, we obtain the same stochastic gradient as in (Patterson and eh 23 using Riemannian manifold. o calculate the expectation term, we use the same Gibbs sampling method as in (Patterson and eh 23, ( α + n \v jk e θ kw jv p(z jv = k w d, θ, α =, (2 k (α + n \v jk e θ k w jv where \v denotes the count excluding the v-th topic assignment variable. he expectation is estimated by the samples. Running time For the results reported in the main text, to achieve reported results of LDA on ICML dataset,, and Gibbs take iterations, the running times are 6.99, 8.55 and 56.2 second, respectively. E Logistic Regression For each data {x, y}, x R P is the input data, y {, } is the label. Logistic Regression gives the prabability P (y x g θ (x = + exp ( (W x + c. (3 A Gaussian prior is placed on model parameters θ = {W, c} N (, σ 2 I. We set σ 2 = in our experiment. We study the effectiveness of for different h and D. Learning curves of test errors and training loglikelihoods are shown in Fig.. Generally, the performances of the proposed are consistently more stable than the and SGHMC, across varying h and D.

3 est Error (% raining Log Likelihood h =.5, D = # 5 h =.5, D = est Error (% raining Log Likelihood h =.5, D = # 4 h =.5, D = SGHMC-E.4 SGHMC-S est Error (% raining Log Likelihood h =., D = # 4 h =., D = est Error (% raining Log Likelihood h =.5, D = # 4 h =.5, D = SGHMC-E SGHMC-S Figure : Learning curves for logistic regression on a9a dataset for varying h and D. Column share the same h; column 2-4 share the same D. op row is testing error; bottom row is training log-likelihood h = 2e-4, # h = e-4, # h = 5e-5, # h = 2e-4, # h = e-4, # h = 5e-5, # Figure 2: Learning curves for FNN (ReLU link on MNIS dataset for varying stepsize h. Step size decreases top-down # # # # # # Figure 3: Learning curves for FNN (Sigmoid link on MNIS dataset for varying network depth. Depth increases top-down. Furthermore, converges faster than msgnh- E, especially at the beginning of learning. Comparing column (fixing h, varying D, is more robust to the choice of diffusion factor D. Comparing column 2-4 (fixing D, varying h, larger h potentially brings larger gradient-estimation errors and numerical errors. msgnh is shown to significantly outperform others when h is large. F Deep Neural Networks F. Sensitivity to Step-size For feedforward neural nets (FNN with a 2-layer model of - ReLU, we study the performance of for different h. D = 5. We test a wide range of h, and show in Fig. 2 the learning curves of the test accuracy and training negative log-likelihood for h = 2 4, 4, 5 5. is less sensitive to step size; it maintains fast convergence when h is large, while significantly slows down.

4 (% h =., D = (% h =.5, D = Figure 4: Learning curves of CNN for different step sizes. F.2 Sigmoid Activation Function We compare different methods in the case of sigmoid link. Similar to the main paper, we test the FNNs with varying depths, e.g., {, 2, 3}, respectively. We set D = and h = 3. Fig. 3 displays learning curves on testing accuracy and training negative log-likelihood. he gaps between and becomes larger in deeper models. When a 3-layer network is emplyed, msgnh- E fails at the first 5 epochs, whilst converges pretty well. Moreover, converges more accurately, yielding an accuracy 97.36%, while gives 96.86%. his manifests the importance of numerical accuracy in SG-MCMC for practical applications, and is desirable in this regard. F.3 Comparison with Other Methods Based on network size 4-4 with ReLU, we also compared with a recent state-of-the-art inference method for neural nets, Bayes by Backprop (BBB (Blundell et al. 25. h = 2 4 and D = 6. he comparison is in able. BBB and SGD are results taken from (Blundell et al. 25. able : Classification accuracy on MNIS. Method Accuracy (% BBB 98.8 SGD 98.7 F.4 Convolutional Neural Networks We also performed the comparison with a standard network convolutional neural networks, LeNet (Jarrett et al. 29, on MNIS dataset. It is 2-layer convolution networks followed by a 2-layer fully-connected FNN, each containing 2 hidden nodes that uses ReLU. Both convolutional layers use 5 5 filter size with 32 and 64 channels, respectively, 2 2 max pooling are used after each convolutional layer. 4 epochs are used, and L is set to 2. In Fig. 4, we tested the stepsizes h = 4 and h = 5 4 for and, and D = 5. Again, under the same network architecture, CNN trained with converges fater than. In particular, when a large stepsize used, has a significant improvement over. G Deep Poisson Factor Analysis G. Model Specification We first provide model details of Deep Poisson Factor Analysis (DPFA (Gan et al. 25. Given a discrete matrix W Z V + J containing counts from J documents and V words, Poisson factor analysis (PFA (Zhou et al. assumes the entries of W are summations of K < latent counts, each produced by a latent factor (in the case of topic modeling, a hidden topic. For W, the generative process of DPFA with L-layer Sigmoid Belief Networks (SBN, is as follows p(h (L k L = g(b (L k L (4 p(h (l n = g ( (w (l h (l+ n + c (l (5 n = h(l+ W Pois(Φ(Ψ h ( (6 where g( is the Sigmoid link. Equation (4 and (5 define Deep Sigmoid Belief Networks (DSBN. Φ is the factor loading matrix. Each column of Φ, φ k V, encodes the relative importance of each word in topic k, with V representing the V -dimensional simplex. Θ R K J + is the factor score matrix. Each column ψ j, contains relative topic intensities specific to document j. h (l {, } K is a latent binary feature vector, which defines whether certain topics are associated with documents. he factor scores for document j at bottom layer are the element-wise multiplication ψ j h (. DPFA is constructed by placing Dirichlet priors on Φ k and gamma priors on ψ j. his is summarized as, K x vj = x vjk, x vjk Pois(φ vk ψ kj h k (7 k= with priors specified as φ k Dir(a φ,..., a φ, ψ kn Gamma(r k, p k /( p k, r k Gamma(γ, /c, and γ Gamma(e, /f. G.2 Model Inference Following (Gan et al. 25, we use stochastic gradient Riemannian Langevin dynamics (SGRLD (Patterson and eh 23 to sample the topic-word distributions {φ k }. ms- GNH is used to sample the parameters in DSBN, i.e., θ = ( W (l, c (l, b (L, where l =,, L. Specifically, the stochastic gradients of W (l and c (l evaluated on a mini-batch of data (denote S as the index set of a minibatch are calculated, Ũ w (l Ũ c (l = J S = J S i D i D E (l h i,h (l+ i E (l h i,h (l+ i [( σ (l i h(l i ] h (l+ i, (8 [ ] σ (l i h(l i, (9 where σ (l i = σ((w(l h (l+ i + c (l, and the expectation is taken over posteriors. Monte Carlo integration is used to approximate this quantity.

5 References Blundell, C.; Cornebise, J.; Kavukcuoglu, K.; and Wierstra, D. 25. Weight uncertainty in neural networks. In ICML. Chen, C.; Ding, N.; and Carin, L. 25. On the convergence of stochastic gradient MCMC algorithms with high-order integrators. In NIPS. Ding, N.; Fang, Y.; Babbush, R.; Chen, C.; Skeel, R. D.; and Neven, H.. Bayesian sampling using stochastic gradient thermostats. In NIPS. Gan, Z.; Chen, C.; Henao, R.; Carlson, D.; and Carin, L. 25. Scalable deep Poisson factor analysis for topic modeling. In ICML. Jarrett, K.; Kavukcuoglu, K.; Ranzato, M.; and LeCun, Y. 29. What is the best multi-stage architecture for object recognition? In ICCV. Patterson, S., and eh, Y. W. 23. Stochastic gradient Riemannian Langevin dynamics on the probability simplex. In NIPS. Rossmann, W.. Lie Groups An Introduction hrough Linear Groups. Oxford Graduate exts in Mathematics, Oxford Science Publications. Zhou, M.; Hannah, L.; Dunson, D. B.; and Carin, L.. Beta-negative Binomial process and Poisson factor analysis. In AISAS.

High-Order Stochastic Gradient Thermostats for Bayesian Learning of Deep Models

High-Order Stochastic Gradient Thermostats for Bayesian Learning of Deep Models Chunyuan Li, Changyou Chen, Kai Fan 2 and Lawrence Carin Department of Electrical and Computer Engineering, Duke University