Riemannian Stein Variational Gradient Descent for Bayesian Inference

Size: px

Start display at page:

Download "Riemannian Stein Variational Gradient Descent for Bayesian Inference"

Francis Norris
5 years ago
Views:

1 Riemannian Stein Variational Gradient Descent for Bayesian Inference Chang Liu, Jun Zhu 1 Dept. of Comp. Sci. & Tech., TNList Lab; Center for Bio-Inspired Computing Research State Key Lab for Intell. Tech. & Systems, Tsinghua University, Beijing, China {chang-li14@mails., dcszj@}tsinghua.edu.cn AAAI New Orleans 1 Corresponding author. Chang Liu and Jun Zhu (THU) Riemannian Stein Variational Gradient Descent for Bayesian Inference 1 / 24

2 Introduction 1 Introduction 2 Preliminaries 3 Riemannian SVGD Derivation of the Directional Derivative Expression in the Embedded Space 4 Experiments Bayesian Logistic Regression Spherical Admixture Model Chang Liu and Jun Zhu (THU) Riemannian Stein Variational Gradient Descent for Bayesian Inference 2 / 24

3 Introduction Introduction Bayesian inference: given a dataset D and a Bayesian model p(x, D), estimate the posterior of the latent variable p(x D). Comparison of current inference methods: model-based variational inference methods (M-VIs), Monte Carlo methods (MCs) and particle-based variational inference methods (P-VIs) Methods M-VIs MCs P-VIs Asymptotic Accuracy No Yes Promising Approximation Flexibility Limited Unlimited Unlimited Iteration Effectiveness Yes Weak Strong Particle Efficiency (do not apply) Weak Strong Stein Variational Gradient Descent (SVGD) [7]: a P-VI with minimal assumption and impressive performance. Chang Liu and Jun Zhu (THU) Riemannian Stein Variational Gradient Descent for Bayesian Inference 3 / 24

4 Introduction Introduction In this work: Generalize SVGD to the Riemann manifold settings, so that we can: Purpose 1 Adapt SVGD to tasks on Riemann manifold and introduce the first P-VI to the Riemannian world. Purpose 2 Improve SVGD efficiency for usual tasks (ones on Euclidean space) by exploring information geometry. Chang Liu and Jun Zhu (THU) Riemannian Stein Variational Gradient Descent for Bayesian Inference 4 / 24

5 Preliminaries 1 Introduction 2 Preliminaries 3 Riemannian SVGD Derivation of the Directional Derivative Expression in the Embedded Space 4 Experiments Bayesian Logistic Regression Spherical Admixture Model Chang Liu and Jun Zhu (THU) Riemannian Stein Variational Gradient Descent for Bayesian Inference 5 / 24

6 Preliminaries Stein Variational Gradient Descent (SVGD) The idea of SVGD: A deterministic continuous-time dynamics d dtx(t) = φ(x(t)) on M = R m (where φ : R m R m ) will induce a continuously evolving distribution q t on M. At some instant t, for a fixed dynamics φ, find the decreasing rate of KL(q t p), i.e. the Directional Derivative d dt KL(q t p) in the direction of φ. Find φ that maximizes the directional derivative, i.e. the Functional Gradient φ (the steepest ascending direction ). For close-form solution, φ is chosen from H m, where H is the reproducing kernel Hilbert space (RKHS) of some kernel. Apply the dynamics φ to samples {x (s) } S s=1 of q t: {x (s) + εφ (x (s) )} S s=1 forms a set of samples of q t+ε. Chang Liu and Jun Zhu (THU) Riemannian Stein Variational Gradient Descent for Bayesian Inference 6 / 24

7 Riemannian SVGD 1 Introduction 2 Preliminaries 3 Riemannian SVGD Derivation of the Directional Derivative Expression in the Embedded Space 4 Experiments Bayesian Logistic Regression Spherical Admixture Model Chang Liu and Jun Zhu (THU) Riemannian Stein Variational Gradient Descent for Bayesian Inference 7 / 24

8 Riemannian SVGD Roadmap For a general Riemann manifold M, Any deterministic continuous-time dynamics on M is described by a vector field X on M. It induces a continuously evolving distribution on M with density q t (w.r.t. Riemann volume form). Derive the Directional Derivative d dt KL(q t p) under dynamics X. Derive the Functional Gradient X := (max arg max) X X =1 d dt KL(q t p). Moreover, for Purpose 1, express X in the Embedded Space of M when M has no global coordinate systems (c.s.), e.g. hyperspheres. Finally, simulate the dynamics X for a small time step ε to update samples. Chang Liu and Jun Zhu (THU) Riemannian Stein Variational Gradient Descent for Bayesian Inference 8 / 24

9 Riemannian SVGD Derivation of the Directional Derivative Derivation of the Directional Derivative Let q t be the evolving density under dynamics X. Lemma (Continuity Equation on Riemann Manifold) q t t = div(q tx) = X[q t ] q t div(x). X[q t ]: the action of the vector field X on the smooth function q t. In any c.s., X[q t ] = X i i q t. div(x): the divergence of vector field X. In any c.s., div(x) = i ( G X i )/ G, where G is the matrix expression under the c.s. of the Riemann metric of M. Theorem (Directional Derivative) Let p be a fixed distribution. Then the directional derivative is d dt KL(q t p)=e qt [div(px)/p]=e qt [ X[log p]+div(x) ]. Chang Liu and Jun Zhu (THU) Riemannian Stein Variational Gradient Descent for Bayesian Inference 9 / 24

10 Riemannian SVGD The task now: X := (max arg max) X X, X X =1 J (X) := E q [ X[log p] + div(x) ], where X is some subspace of the space of vector fields on M, such that the requirements are met: Requirements on X, thus on X R1: X is a valid vector field on M; R2: X is coordinate invariant; R3: X can be expressed in closed form, where q appears only in terms of expectation. Chang Liu and Jun Zhu (THU) Riemannian Stein Variational Gradient Descent for Bayesian Inference 10 / 24

11 Riemannian SVGD R1: X is a valid vector field on M. Why needed: deductions are based on valid vector fields. Note: non-trivial to guarantee! Example (Vector fields on hyperspheres) Vector fields on an even-dimensional hypersphere must have one zero-vector-valued point (critical point) due to the hairy ball theorem ([1], Theorem ). The choice in SVGD X = H m cannot guarantee R1. Chang Liu and Jun Zhu (THU) Riemannian Stein Variational Gradient Descent for Bayesian Inference 11 / 24

12 Riemannian SVGD R2: X is coordinate invariant. Concept: the expression of an object on M in any c.s. is the same. E.g. vector field, gradient and divergence. Why needed: necessary to avoid ambiguity or arbitrariness of the solution. The vector field X should be independent of the choice of c.s. in which it is expressed. Note: the choice in SVGD X = H m cannot guarantee R2. R3: X can be expressed in closed form, where q appears only in terms of expectation. Why needed: for tractable implementation, and for avoiding making restrictive assumptions on q. Chang Liu and Jun Zhu (THU) Riemannian Stein Variational Gradient Descent for Bayesian Inference 12 / 24

13 Riemannian SVGD Our Solution X = {grad f f H}, where H is the RKHS of some kernel. grad f is the gradient of the smooth function f. In any c.s., (grad f) j = g ij i f, where g ij is the entry of G 1 under the c.s. Theorem For Gaussian RKHS, X is isometrically isomorphic to H, thus it is a Hilbert space. Our solution guarantees all the requirements: The gradient is a well-defined object on M and it is guaranteed to be a valid vector field and coordinate invariant (see paper for detailed interpretation). Close-form solution can be derived (see next). Chang Liu and Jun Zhu (THU) Riemannian Stein Variational Gradient Descent for Bayesian Inference 13 / 24

14 Riemannian SVGD The close-form solution: Theorem (Functional Gradient) X = grad f, f = E q [ (grad K)[log p] + K ], where notations with prime take x as argument while others take x and K takes both, and f := div(grad f). In any c.s., [ X i (g = g ij je ab q a log(p G ) + a g ab) ] b K + g ab a b K. Chang Liu and Jun Zhu (THU) Riemannian Stein Variational Gradient Descent for Bayesian Inference 14 / 24

15 Riemannian SVGD Purpose 2 Improve efficiency for the usual inference tasks on Euclidean space R m. Apply the idea of information geometry [3, 2]: for a Bayesian model with prior p(x) and likelihood p(d x), take M = {p( x) : x R m } and treat x as the coordinate of p( x). In this global c.s., G(x) is the Fisher information matrix of p( x) (and typically subtract by the Hessian of log p(x)). Calculate the tangent vector at each sample using the c.s. expression [ X i (g = g ij je ab q a log(p G ) + a g ab) ] b K + g ab a b K, where the target distribution p = p(x D) p(x)p(d x) and the expectation is estimated by averaging over samples. Simulate the dynamics for a small time step ε to update samples: x (s) x (s) + εx (x (s) ). Chang Liu and Jun Zhu (THU) Riemannian Stein Variational Gradient Descent for Bayesian Inference 15 / 24

16 Riemannian SVGD Expression in the Embedded Space Expression in the Embedded Space Purpose 1 Enable applicability to inference tasks on non-linear Riemann manifolds. In the coordinate space of M: Some manifolds have no global c.s., e.g. hypersphere S n 1 := {x R n : x = 1} and Stiefel manifold [5]. Cumbersome switch among local c.s. G would be singular near the edge of coordinate space. In the embedded space of M: M can be expressed globally, and is natural for S n 1 and Stiefel manifold. No singularity problems. Requires exponential map and density w.r.t. Hausdorff measure, which are available for S n 1 and Stiefel manifold. Chang Liu and Jun Zhu (THU) Riemannian Stein Variational Gradient Descent for Bayesian Inference 16 / 24

17 Riemannian SVGD Expression in the Embedded Space Expression in the Embedded Space Proposition (Functional Gradient in the Embedded Space) Let m-dim Riemann manifold M isometrically embedded in R n (with orthonormal basis {y α } n α=1 )) via Ξ : M Rn. Let p be the density w.r.t. the Hausdorff measure on Ξ(M). Then X = (I n N N ) f, [( f = E q log ( p G )) (I n NN ) ( K) + K ( tr N ( K)N ) + ( (M ) (G 1 M ) ) ] ( K), where I n R n n is the identity matrix, = ( y 1,..., y n), M R n m : M αi = yα, N R n (n m) is the set of orthonormal basis x i of the orthogonal complement of Ξ (T x M), and tr( ) is the trace. Simulating the dynamics requires the exponential map Exp of M: y (s) Exp y (s)(εx (y (s) )). Exp y (v): moves y on Ξ(M) straightly along the direction of v. Chang Liu and Jun Zhu (THU) Riemannian Stein Variational Gradient Descent for Bayesian Inference 17 / 24

18 Riemannian SVGD Expression in the Embedded Space Expression in the Embedded Space Proposition (Functional Gradient for Embedded Hyperspheres) For S n 1 isometrically embedded in R n with orthonormal basis {y α } n α=1, we have X = (I n y y ) f, where f = E q [( log p ) ( K) + K y ( K ) y (y log p + n 1)y K ]. Exponential map on S n 1 : Exp y (v) = y cos( v ) + (v/ v ) sin( v ). Chang Liu and Jun Zhu (THU) Riemannian Stein Variational Gradient Descent for Bayesian Inference 18 / 24

19 Experiments 1 Introduction 2 Preliminaries 3 Riemannian SVGD Derivation of the Directional Derivative Expression in the Embedded Space 4 Experiments Bayesian Logistic Regression Spherical Admixture Model Chang Liu and Jun Zhu (THU) Riemannian Stein Variational Gradient Descent for Bayesian Inference 19 / 24

20 Experiments Bayesian Logistic Regression BLR: for Purpose 2 Model: Bayesian Logistic Regression (BLR) w N (0, αi m ), y d Bern(σ(w x d )), where σ(x) = 1/(1 + e x ). Euclidean task: w R m. Posterior: p(w {(x d, y d )}), log-density gradient known. Riemann metric tensor G: FisherInfo Hessian, known in close form. Kernel: Gaussian kernel in the coordinate space. Baselines: vanilla SVGD. Evaluation: averaged test accuracy. Chang Liu and Jun Zhu (THU) Riemannian Stein Variational Gradient Descent for Bayesian Inference 20 / 24

21 Experiments Bayesian Logistic Regression BLR: for Purpose 2 Results: accuracy accuracy SVGD RSVGD (Ours) SVGD RSVGD (Ours) iteration iteration (a) On Splice19 dataset (b) On Covertype dataset Figure: Test accuracy along iteration for BLR. Both methods are run 20 times on Splice19 and 10 times on Covertype. Each run on Covertype uses a random train(80%)-test(20%) split as in [7]. Chang Liu and Jun Zhu (THU) Riemannian Stein Variational Gradient Descent for Bayesian Inference 21 / 24

22 Experiments Spherical Admixture Model SAM: for Purpose 1 Model: Spherical Admixture Model (SAM) [8] Observed var.: tf-idf representation of documents: v d S V 1. Latent var.: spherical topics: β t S V 1. Non-linear Riemann manifold task: β (S V 1 ) T. Posterior: p(β v) (w.r.t. the Hausdorff measure), log-density gradient can be estimated [6]. Kernel: von-mises Fisher (vmf) kernel K(y, y ) = exp(κy y ), the restriction of Gaussian kernel in R n on S n 1. Baselines: Variational Inference (VI) [8]: the vanilla inference method of SAM. Geodesic Monte Carlo (GMC) [4]: MCMC for RM in the embed. sp. Stochastic Gradient GMC (SGGMC) [6]: SG-MCMC for RM in the embeded space. (-b: mini-batch grad. est. -f: full-batch grad. est.) For MCMCs, -seq: samples from one chain. -par: newest samples from multiple chains. Evaluation: log-perplexity (negative log-likelihood of test dataset under the trained model) [6]. Chang Liu and Jun Zhu (THU) Riemannian Stein Variational Gradient Descent for Bayesian Inference 22 / 24

23 Experiments Spherical Admixture Model SAM: for Purpose 1 Results: log perplexity VI GMC seq GMC par SGGMCb seq SGGMCb par SGGMCf seq SGGMCf par RSVGD (Ours) log perplexity GMC seq GMC par SGGMCb seq SGGMCb par SGGMCf seq SGGMCf par RSVGD (Ours) epochs (a) Results with 100 particles number of particles (b) Results at 200 epochs Figure: Results on the SAM inference task on 20News-different dataset, in log-perplexity. We run SGGMCf for full batch and SGGMCb for a mini-batch size of 50. Chang Liu and Jun Zhu (THU) Riemannian Stein Variational Gradient Descent for Bayesian Inference 23 / 24

24 Thank you! Chang Liu and Jun Zhu (THU) Riemannian Stein Variational Gradient Descent for Bayesian Inference 24 / 24

25 References Ralph Abraham, Jerrold E Marsden, and Tudor Ratiu. Manifolds, tensor analysis, and applications, volume 75. Springer Science & Business Media, Shun-Ichi Amari. Information geometry and its applications. Springer, Shun-Ichi Amari and Hiroshi Nagaoka. Methods of information geometry, volume 191. American Mathematical Soc., Simon Byrne and Mark Girolami. Geodesic monte carlo on embedded manifolds. Scandinavian Journal of Statistics, 40(4): , I. M. James. The topology of Stiefel manifolds, volume 24. Cambridge University Press, Chang Liu, Jun Zhu, and Yang Song. Stochastic gradient geodesic mcmc methods. In Advances In Neural Information Processing Systems, pages , Chang Liu and Jun Zhu (THU) Riemannian Stein Variational Gradient Descent for Bayesian Inference 24 / 24

26 References Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances in Neural Information Processing Systems, pages , Joseph Reisinger, Austin Waters, Bryan Silverthorn, and Raymond J. Mooney. Spherical topic models. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages , Chang Liu and Jun Zhu (THU) Riemannian Stein Variational Gradient Descent for Bayesian Inference 24 / 24

Riemannian Stein Variational Gradient Descent for Bayesian Inference

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) Riemannian Stein Variational Gradient Descent for Bayesian Inference Chang Liu, Jun Zhu Dept. of Comp. Sci. & Tech., TNList Lab; Center