Supplementary Material: Learning Structured Weight Uncertainty in Bayesian Neural Networks

Shengyang Sun, Changyou Chen, Lawrence Carn Suppementary Matera: Learnng Structured Weght Uncertanty n Bayesan Neura Networks Shengyang Sun Changyou Chen Lawrence Carn Tsnghua Unversty Duke Unversty Duke Unversty A When f s from the Pror If f s a pror term, and N w; m, V s our current mutvarate norma posteror approxmaton. Then accordng to Bayes Formua, we get the updated posteror q w: s w = Z f w N w; m, V As f w can be very compcated, gettng the exact resut of s w can be hard or even mpossbe. So we choose to approxmate t wth another mutvarate Gaussan Dstrbuton. As the mutvarate Gaussan dstrbuton h two parameter m new and V new, we ony need to compute the mean and covarance matrx of the new posteror dstrbuton s w. Accordng to Mnka,, we get m new = m + V w og Z V new = V V w T wog Z V og Z V Therefore, f we can compute the normazer Z, we can update the mutvarate Gaussan dstrbuton anaytcay. Take {m, Σ } for exampe, the pror s N v ;, φ I V Z = N = N N m, Σ = St N v ;, φ I V q R d R v ;, φ I V dv dφ v ;, βφ α φ I V, α φ v ;, =N m ;, α φ Gam φ; α φ, β φ β φ N α φ α φ I V β φ α φ I V + Σ v ; m, Σ dv N v ; m, Σ dv The dstrbuton St x; µ, Σ, v s the mutvarate Student Dstrbuton. We approxmate t usng a mutvarate Gaussan Dstrbuton wth the same mean and covarance n the fourth equaton. The thrd equaton takes advantage of the fact that Gam λ; α, β N x; µ, Σ dλ = St x; µ, βα λ Σ, α The t equaton s the appcaton of the formua that N x; µ f, Σ f N x; µ g, Σdx = N µ f ; µ g, Σ f + Σ g The above mutvarate Gaussan formua can o be extended to hande the -d Gaussan ce. Insertng the above resut nto, we can update the mean and varance of any Gaussan dstrbuted varabes n the mode. B When f s from the Lkehood When f s from the kehood, for a Gaussan varatona dstrbuton, we need to compute the normazer: Z = N y n ; f x n, W, P, Q, γ q Θ dθ Assume z nl = f x n, W, P, Q h an approxmate Gaussan dstrbuton wth mean m znl and varance vector v znl. Then the normazer Z can be computed Z = N y n ; f x n, P L B L Q T L,, γ I VL q R d R N y n ; z nl, γ I VL N znl ; m znl, dag v znl Gam γ; α γ, β γ d z nl dγ = St y n ; z nl, βγ α γ, αγ N z nl ; m znl, v znl d z nl β γ = N y n ; m znl, α γ + v z nl Here, we st do not know the vaue of mean m z nl and varance vector v z nl. Ths can be resoved by usng

Learnng Structured Weght Uncertanty n BNN the feed forward network structure. We propagate the dstrbutons forward through the neura network and f necessary we approxmate the dstrbuton wth a Gaussan dstrbuton. We sume the output z n of ayer s a dagona Gaussan dstrbuton wth mean m zn and varance vector v zn. What needs to be expaned s that the ndependence of neurons does not nterfere wth the dependence of parameters. In each ayer of our mode, the nformaton goes through three stages, mutpyng Q T, mutpyng B and then mutpyng P. After the frst stage, z Q = I v v T v T v and v n N = Q T z n. As m, Σ, o Q s an orthogona matrx, the covarance matrx of z n s the same the covarance matrx of z n. As we sume z n s a mutvarate Gaussan random varabe wth dagona covarance a matrx, z s st dagona mutvarate Gaussan dstrbuted wth varance vector v zn. Thus we can get the mean and varance vector of z n m z n = I E v v z n = v zn, v T v T v n m zn, where for the expectaton, we use Mento Caroo sampng to approxmate t. As Σ s a postve defnte matrx, we perform choesky decomposton and get Σ = KK T. Therefore f we set k = K v m, k N, I, the expectaton becomes T K k + m z n K k + m z n E T K k + m z n K k + m z n T 7 M K k + m z n K k + m z n M T = K k + m z n K k + m z n Bed on ths, t s convenent to cacuate the dervatve wth respect to m z n and K. Note that we need to compute the dervatve of Z wth respect to m and Σ n order to get the updated mean and covarance matrx. The reparameterzaton trck provdes a effcent way to do ths. Specfcay, because Σ = KK T, we get the dervatve wth respect to Σ Σ og Z =. K og Z K. At the second stage, z n = W z n / V +. The mean and varance vector of z n can be computed by m z n = M m z n / V + [ v z n = M M v +Σ z n ] m z n m z n + Σ v z n / V +, where M and Σ are V V + matrces whose entres are gven by m j and λ j, respectvey. And denotes the Hadamard eementwse product. The t state s smar to the frst stage P and Q have the same knd of dstrbuton. Let a n = P z n. As P = I v and v N m, Σ, we v T v T v can get the mean and varance vector of a n m z n = I E v v z n = v z. n v T v T v m z n, After mutpyng the parameter matrces, the nformaton of a n go thorough the neurons n ayer. Let b n = ReLU a n, we can compute the mean and varance of the -th entry n b n : m b n v b n = Φ α v = m b n v Φ α + Φ α v a n γ γ + α where v = m a + v a γ, α = ma v a, γ = φ α Φ α and Φ are φ the CDF and the densty functon of a standard Gaussan Dstrbuton.When α s very arge and negatve the prevous defnton of γ s not numercay stabe. Instead, when α, we use the approxmaton γ = α α + α. The output of the -th ayer z s obtaned by concatenatng b wth constance for the b. We can therefore approxmate the dstrbuton of z to be Gaussan wth margna mean and varance m zn = [ m b n ; ] v zn = [ v b n ; ] Therefore we get the approxmatng dstrbuton of z n. Performng sequenta update aong the neura network we can get the fna dstrbuton parameter m znl and v znl.

Shengyang Sun, Changyou Chen, Lawrence Carn C Extra Expermenta Resuts C. Extra resuts for nonnear regresson We shown an extra toy regresson experment, whch foows the setup n Louzos and Weng, to randomy generate data ponts x n the regon [, ]. The output y n for each data pont x n s modeed y n = x n+ɛ n, where ɛ n N, 9. We ft an FNN wth one ayer and hdden unts to ths data. The number of epoch s set to 7. In our method, sampes s taken to approxmate the expectaton n. An extra toy regresson experment s shown n Fgure. For rea datets, we pot a the earnng curves for the nonnear regresson tk on the datets n Fgure and 7, n terms of and og-kehood, respectvey. C. Bayesan DNN for csfcaton Extendng the orgna PBP framework Hernández- Lobato and Adams,, we provde some premnary resuts to use PBP MV for csfcaton. The extenson of PBP MV for csfcaton s descrbed n Secton.. We tran a three ayer DNN wth neurons n each ayer on the MNIST datet. When tranng, we use the PBP to update the weghts wth MVG prors n the ower ayers of the DNN, and use Adam to optmze the weght for the top ayer softmax ayer. The agorthm s run for epochs. We obtan a fna error rate of.%, whch s better than purey tranng an DNN wth stochtc optmzaton agorthms Su et a.,, but worse than the stateof-the-art Bayesan DNN mode Louzos and Weng,. Note the comparson wth s not qute far because uses mnnbatches to update parameters n each teraton, where n PBP MV the mnbatch sze s actuay equa to one. Ths brngs the need for extendng PBP MV wth mnbatch-updatng, an nterestng future work to pursue. a PBP b PBP MV Fgure : The toy regresson experment for PBP and our mode. The observaton are shown back dots. The bue ne represents the true data generatng functon and the mean predctons are shown green ne. The ght gray shaded area s the ± standard dervaton confdence nterva.

Learnng Structured Weght Uncertanty n BNN 8 7.8.8.7.7. 8 8.7... 8 # -...8 8..8....8 # 9 8.9 8.8 8.7 8. 8 # Fgure : Learnng curves n terms of vs runnng tme on datets. From top-down and eft to rght, the datets are: boston housng, concrete, energy, kn8nm, Nava, CCPP, wnequaty, yacht, proten and YearPredcton.

Shengyang Sun, Changyou Chen, Lawrence Carn - -.8 -. og-kehood og-kehood - - -... 8 og-kehood og-kehood - -. -. -. og-kehood og-kehood - -. - -. - -.7 -.8 -.8.9 8 -.7 og-kehood -.8 -.9 - og-kehood -. - -. - -. -.9 8 -. -.7 - -. og-kehood -.8 -.8 -.9 og-kehood -. -. -.9 # -. 8 # Fgure 7: Learnng curves n terms of og-kehood vs runnng tme on datets. From top-down and eft to rght, the datets are: boston housng, concrete, energy, kn8nm, Nava, CCPP, wnequaty, yacht, proten and YearPredcton.