Supplementary Material: Learning Structured Weight Uncertainty in Bayesian Neural Networks

Similar documents
Image Classification Using EM And JE algorithms

Supporting Information

MARKOV CHAIN AND HIDDEN MARKOV MODEL

Lecture Notes on Linear Regression

On the Power Function of the Likelihood Ratio Test for MANOVA

COXREG. Estimation (1)

Nested case-control and case-cohort studies

Neural network-based athletics performance prediction optimization model applied research

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Example: Suppose we want to build a classifier that recognizes WebPages of graduate students.

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

Generalized Linear Methods

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

1 Motivation and Introduction

Research on Complex Networks Control Based on Fuzzy Integral Sliding Theory

A finite difference method for heat equation in the unbounded domain

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

EEE 241: Linear Systems

Linear Approximation with Regularization and Moving Least Squares

Relevance Vector Machines Explained

Gaussian process classification: a message-passing viewpoint

Delay tomography for large scale networks

A DIMENSION-REDUCTION METHOD FOR STOCHASTIC ANALYSIS SECOND-MOMENT ANALYSIS

Supervised Learning. Neural Networks and Back-Propagation Learning. Credit Assignment Problem. Feedforward Network. Adaptive System.

Homework Assignment 3 Due in class, Thursday October 15

An adaptive SMC scheme for ABC. Bayesian Computation (ABC)

Maximum Likelihood Estimation (MLE)

Differentiating Gaussian Processes

Associative Memories

LOW BIAS INTEGRATED PATH ESTIMATORS. James M. Calvin

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

In: Fourth International Conference on Articial Neural Networks, Churchill College, University of

The Application of BP Neural Network principal component analysis in the Forecasting the Road Traffic Accident

Multispectral Remote Sensing Image Classification Algorithm Based on Rough Set Theory

Gaussian Conditional Random Field Network for Semantic Segmentation - Supplementary Material

Xin Li Department of Information Systems, College of Business, City University of Hong Kong, Hong Kong, CHINA

Gaussian Processes and Polynomial Chaos Expansion for Regression Problem: Linkage via the RKHS and Comparison via the KL Divergence

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

10-701/ Machine Learning, Fall 2005 Homework 3

BAYESIAN CURVE FITTING USING PIECEWISE POLYNOMIALS. Dariusz Biskup

The exam is closed book, closed notes except your one-page cheat sheet.

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Natural Images, Gaussian Mixtures and Dead Leaves Supplementary Material

Scalable Multi-Class Gaussian Process Classification using Expectation Propagation

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

1 Derivation of Point-to-Plane Minimization

Primer on High-Order Moment Estimators

Computation of Higher Order Moments from Two Multinomial Overdispersion Likelihood Models

Maximum Likelihood Estimation

Multilayer Perceptron (MLP)

Singular Value Decomposition: Theory and Applications

Adaptive LRBP Using Learning Automata for Neural Networks

CHALMERS, GÖTEBORGS UNIVERSITET. SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

PHYS 450 Spring semester Lecture 02: Dealing with Experimental Uncertainties. Ron Reifenberger Birck Nanotechnology Center Purdue University

The big picture. Outline

Lecture 3. Ax x i a i. i i

11 Bayesian Methods. p(w D) = p(d w)p(w) (1)

Fall 2012 Analysis of Experimental Measurements B. Eisenstein/rev. S. Errede

Comparison of the Population Variance Estimators. of 2-Parameter Exponential Distribution Based on. Multiple Criteria Decision Making Method

Gaussian Mixture Models

Probability Theory. The nth coefficient of the Taylor series of f(k), expanded around k = 0, gives the nth moment of x as ( ik) n n!

Expectation Maximization Mixture Models HMMs

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Stat 543 Exam 2 Spring 2016

Matrix Approximation via Sampling, Subspace Embedding. 1 Solving Linear Systems Using SVD

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

STATISTICALLY LINEARIZED RECURSIVE LEAST SQUARES. Matthieu Geist and Olivier Pietquin. IMS Research Group Supélec, Metz, France

Chapter Newton s Method

Hidden Markov Models

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing

First Year Examination Department of Statistics, University of Florida

APPENDIX 2 FITTING A STRAIGHT LINE TO OBSERVATIONS

Sparse Gaussian Processes Using Backward Elimination

RELIABILITY ASSESSMENT

Properties of Least Squares

π e ax2 dx = x 2 e ax2 dx or x 3 e ax2 dx = 1 x 4 e ax2 dx = 3 π 8a 5/2 (a) We are considering the Maxwell velocity distribution function: 2πτ/m

Feature Selection: Part 1

Lecture 20: Lift and Project, SDP Duality. Today we will study the Lift and Project method. Then we will prove the SDP duality theorem.

Classification as a Regression Problem

CSE 546 Midterm Exam, Fall 2014(with Solution)

Stat 543 Exam 2 Spring 2016

UNIVERSITY OF TORONTO Faculty of Arts and Science. December 2005 Examinations STA437H1F/STA1005HF. Duration - 3 hours

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

Maxent Models & Deep Learning

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Training Convolutional Neural Networks

7. Multivariate Probability

Lecture 3: Probability Distributions

The Expectation-Maximization Algorithm

Quantum Runge-Lenz Vector and the Hydrogen Atom, the hidden SO(4) symmetry

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

CS286r Assign One. Answer Key

See Book Chapter 11 2 nd Edition (Chapter 10 1 st Edition)

The conjugate prior to a Bernoulli is. A) Bernoulli B) Gaussian C) Beta D) none of the above

The Entire Solution Path for Support Vector Machine in Positive and Unlabeled Classification 1

Transcription:

Shengyang Sun, Changyou Chen, Lawrence Carn Suppementary Matera: Learnng Structured Weght Uncertanty n Bayesan Neura Networks Shengyang Sun Changyou Chen Lawrence Carn Tsnghua Unversty Duke Unversty Duke Unversty A When f s from the Pror If f s a pror term, and N w; m, V s our current mutvarate norma posteror approxmaton. Then accordng to Bayes Formua, we get the updated posteror q w: s w = Z f w N w; m, V As f w can be very compcated, gettng the exact resut of s w can be hard or even mpossbe. So we choose to approxmate t wth another mutvarate Gaussan Dstrbuton. As the mutvarate Gaussan dstrbuton h two parameter m new and V new, we ony need to compute the mean and covarance matrx of the new posteror dstrbuton s w. Accordng to Mnka,, we get m new = m + V w og Z V new = V V w T wog Z V og Z V Therefore, f we can compute the normazer Z, we can update the mutvarate Gaussan dstrbuton anaytcay. Take {m, Σ } for exampe, the pror s N v ;, φ I V Z = N = N N m, Σ = St N v ;, φ I V q R d R v ;, φ I V dv dφ v ;, βφ α φ I V, α φ v ;, =N m ;, α φ Gam φ; α φ, β φ β φ N α φ α φ I V β φ α φ I V + Σ v ; m, Σ dv N v ; m, Σ dv The dstrbuton St x; µ, Σ, v s the mutvarate Student Dstrbuton. We approxmate t usng a mutvarate Gaussan Dstrbuton wth the same mean and covarance n the fourth equaton. The thrd equaton takes advantage of the fact that Gam λ; α, β N x; µ, Σ dλ = St x; µ, βα λ Σ, α The t equaton s the appcaton of the formua that N x; µ f, Σ f N x; µ g, Σdx = N µ f ; µ g, Σ f + Σ g The above mutvarate Gaussan formua can o be extended to hande the -d Gaussan ce. Insertng the above resut nto, we can update the mean and varance of any Gaussan dstrbuted varabes n the mode. B When f s from the Lkehood When f s from the kehood, for a Gaussan varatona dstrbuton, we need to compute the normazer: Z = N y n ; f x n, W, P, Q, γ q Θ dθ Assume z nl = f x n, W, P, Q h an approxmate Gaussan dstrbuton wth mean m znl and varance vector v znl. Then the normazer Z can be computed Z = N y n ; f x n, P L B L Q T L,, γ I VL q R d R N y n ; z nl, γ I VL N znl ; m znl, dag v znl Gam γ; α γ, β γ d z nl dγ = St y n ; z nl, βγ α γ, αγ N z nl ; m znl, v znl d z nl β γ = N y n ; m znl, α γ + v z nl Here, we st do not know the vaue of mean m z nl and varance vector v z nl. Ths can be resoved by usng

Learnng Structured Weght Uncertanty n BNN the feed forward network structure. We propagate the dstrbutons forward through the neura network and f necessary we approxmate the dstrbuton wth a Gaussan dstrbuton. We sume the output z n of ayer s a dagona Gaussan dstrbuton wth mean m zn and varance vector v zn. What needs to be expaned s that the ndependence of neurons does not nterfere wth the dependence of parameters. In each ayer of our mode, the nformaton goes through three stages, mutpyng Q T, mutpyng B and then mutpyng P. After the frst stage, z Q = I v v T v T v and v n N = Q T z n. As m, Σ, o Q s an orthogona matrx, the covarance matrx of z n s the same the covarance matrx of z n. As we sume z n s a mutvarate Gaussan random varabe wth dagona covarance a matrx, z s st dagona mutvarate Gaussan dstrbuted wth varance vector v zn. Thus we can get the mean and varance vector of z n m z n = I E v v z n = v zn, v T v T v n m zn, where for the expectaton, we use Mento Caroo sampng to approxmate t. As Σ s a postve defnte matrx, we perform choesky decomposton and get Σ = KK T. Therefore f we set k = K v m, k N, I, the expectaton becomes T K k + m z n K k + m z n E T K k + m z n K k + m z n T 7 M K k + m z n K k + m z n M T = K k + m z n K k + m z n Bed on ths, t s convenent to cacuate the dervatve wth respect to m z n and K. Note that we need to compute the dervatve of Z wth respect to m and Σ n order to get the updated mean and covarance matrx. The reparameterzaton trck provdes a effcent way to do ths. Specfcay, because Σ = KK T, we get the dervatve wth respect to Σ Σ og Z =. K og Z K. At the second stage, z n = W z n / V +. The mean and varance vector of z n can be computed by m z n = M m z n / V + [ v z n = M M v +Σ z n ] m z n m z n + Σ v z n / V +, where M and Σ are V V + matrces whose entres are gven by m j and λ j, respectvey. And denotes the Hadamard eementwse product. The t state s smar to the frst stage P and Q have the same knd of dstrbuton. Let a n = P z n. As P = I v and v N m, Σ, we v T v T v can get the mean and varance vector of a n m z n = I E v v z n = v z. n v T v T v m z n, After mutpyng the parameter matrces, the nformaton of a n go thorough the neurons n ayer. Let b n = ReLU a n, we can compute the mean and varance of the -th entry n b n : m b n v b n = Φ α v = m b n v Φ α + Φ α v a n γ γ + α where v = m a + v a γ, α = ma v a, γ = φ α Φ α and Φ are φ the CDF and the densty functon of a standard Gaussan Dstrbuton.When α s very arge and negatve the prevous defnton of γ s not numercay stabe. Instead, when α, we use the approxmaton γ = α α + α. The output of the -th ayer z s obtaned by concatenatng b wth constance for the b. We can therefore approxmate the dstrbuton of z to be Gaussan wth margna mean and varance m zn = [ m b n ; ] v zn = [ v b n ; ] Therefore we get the approxmatng dstrbuton of z n. Performng sequenta update aong the neura network we can get the fna dstrbuton parameter m znl and v znl.

Shengyang Sun, Changyou Chen, Lawrence Carn C Extra Expermenta Resuts C. Extra resuts for nonnear regresson We shown an extra toy regresson experment, whch foows the setup n Louzos and Weng, to randomy generate data ponts x n the regon [, ]. The output y n for each data pont x n s modeed y n = x n+ɛ n, where ɛ n N, 9. We ft an FNN wth one ayer and hdden unts to ths data. The number of epoch s set to 7. In our method, sampes s taken to approxmate the expectaton n. An extra toy regresson experment s shown n Fgure. For rea datets, we pot a the earnng curves for the nonnear regresson tk on the datets n Fgure and 7, n terms of and og-kehood, respectvey. C. Bayesan DNN for csfcaton Extendng the orgna PBP framework Hernández- Lobato and Adams,, we provde some premnary resuts to use PBP MV for csfcaton. The extenson of PBP MV for csfcaton s descrbed n Secton.. We tran a three ayer DNN wth neurons n each ayer on the MNIST datet. When tranng, we use the PBP to update the weghts wth MVG prors n the ower ayers of the DNN, and use Adam to optmze the weght for the top ayer softmax ayer. The agorthm s run for epochs. We obtan a fna error rate of.%, whch s better than purey tranng an DNN wth stochtc optmzaton agorthms Su et a.,, but worse than the stateof-the-art Bayesan DNN mode Louzos and Weng,. Note the comparson wth s not qute far because uses mnnbatches to update parameters n each teraton, where n PBP MV the mnbatch sze s actuay equa to one. Ths brngs the need for extendng PBP MV wth mnbatch-updatng, an nterestng future work to pursue. a PBP b PBP MV Fgure : The toy regresson experment for PBP and our mode. The observaton are shown back dots. The bue ne represents the true data generatng functon and the mean predctons are shown green ne. The ght gray shaded area s the ± standard dervaton confdence nterva.

Learnng Structured Weght Uncertanty n BNN 8 7.8.8.7.7. 8 8.7... 8 # -...8 8..8....8 # 9 8.9 8.8 8.7 8. 8 # Fgure : Learnng curves n terms of vs runnng tme on datets. From top-down and eft to rght, the datets are: boston housng, concrete, energy, kn8nm, Nava, CCPP, wnequaty, yacht, proten and YearPredcton.

Shengyang Sun, Changyou Chen, Lawrence Carn - -.8 -. og-kehood og-kehood - - -... 8 og-kehood og-kehood - -. -. -. og-kehood og-kehood - -. - -. - -.7 -.8 -.8.9 8 -.7 og-kehood -.8 -.9 - og-kehood -. - -. - -. -.9 8 -. -.7 - -. og-kehood -.8 -.8 -.9 og-kehood -. -. -.9 # -. 8 # Fgure 7: Learnng curves n terms of og-kehood vs runnng tme on datets. From top-down and eft to rght, the datets are: boston housng, concrete, energy, kn8nm, Nava, CCPP, wnequaty, yacht, proten and YearPredcton.