Evaluating the Variance of

Size: px

Start display at page:

Download "Evaluating the Variance of"

Buddy Fletcher
5 years ago
Views:

1 Evaluating the Variance of Likelihood-Ratio Gradient Estimators Seiya Tokui 2 Issei Sato 2 3 Preferred Networks 2 The University of Tokyo 3 RIKEN ICML Sydney

2 Task: Gradient estimation for stochastic computational graph 2 Want to compute the following gradient: x φ z f Computational Graph If φ E qφ (z x)f(x, z) No stochasiticity in z (q is a delta distribution) use backprop s stochastic (stochastic computational graph) need more techniques

3 Example: Variational inference in deep directed graphical models 3 x z z 2 z 3 Graphical Models p θ q φ Generative model p θ x, z p θ x z p θ z z 2 p θ z 2 z 3 p θ (z 3 ) Each factor is written by a NN Approximate posterior q φ z x q φ z x q φ z 2 z q φ (z 3 z 2 ) Each factor is written by a NN Objective function (variational bound) L φ, θ E qφ (z x) log p θ x, z q φ z x f(x, z) We want to compute φ L to optimize L with a gradient method.

4 Overview of unbiased estimators 4 Likelihood-ratio estimators z can be continuous or discrete f can be non-continuous Tend to have high variance Many (heuristic) techniques to reduce the variance exist Reparameterization trick z must be continuous f must be differentiable Tend to have low variance in practice (but not guaranteed) Our finding: when there are M random variables, also likelihood-ratio estimators can be formulated with reparameterization for M variables unified framework of gradient estimators

5 A unified framework of gradient estimators 5 Let z (z,, z M ) and q φ z x ς M i q φi ( pa i ). The set of parents of Suppose we have a reparameterization formula: q φi pa i g φi pa i, ε i, ε i p(ε i ) Noise variable that generates Exchange and E partially for each i : φi E qφ (z x)f x, z φi E ε f x, g φ x, ε E ε i φi E εi f(x, g φ x, ε ) Reparameterization [Williams, 992][Kingma & Welling, 24] [Rezende+, 24][Titsias & Lázaro-Gredilla, 24] Local marginalization [Titsias & Lázaro-Gredilla, 25] Differentiable even if g is noncontinuous ( is discrete)

6 A unified framework of gradient estimators 6 φi E qφ (z x)f x, z E ε i φi E εi f(x, g φ x, ε ) Local gradient Each method differs in how to estimate the local gradient. Likelihood-ratio estimator: use log derivative trick Reparameterization estimator: use reparameterization trick Optimal estimator: exactly (or numerically) compute the inner expectation

7 Likelihood-ratio estimator under the framework 7 φi E εi f x, z E εi f x, z b i x, ε φi log q φi pa i + C i x, ε i Baseline Residual Baseline Definition Example Constant b i is a constant of x and ε. C i. Running average of sampled f Independent b i (x, ε i ) is a constant of ε i. C i. Input-dependent baseline Local signal [Mnih & Gregor, 24] Linear b i (x, ε) is linear against. MuProp [Gu+, 26] Fullyinformed b i(x, ε) may be nonlinear against. The optimal estimator (general)

8 Reparameterization estimator under the framework 8 Apply the reparameterization trick to the local gradient: E εi φi f(x, g φ x, ε ) If g φ is not continuous, the above equation does not hold (in other words, Monte Carlo estimation of the right hand side has infinite variance). Otherwise, the reparameterization trick can be used.

9 9 f x, z φi q φi pa i

10 f x, z φi q φi pa i

11 f x, z φi q φi pa i

12 2 f x, z φi q φi pa i

13 3 f x, z φi q φi pa i

14 4 f x, z φi q φi pa i

15 5 f x, z φi q φi pa i

16 6 f x, z φi q φi pa i

17 7 f x, z φi q φi pa i

18 8 f x, z φi q φi pa i

19 9 f x, z φi q φi pa i

20 2 f x, z φi q φi pa i

21 2 f x, z φi q φi pa i

22 22 f x, z φi q φi pa i

23 23 f x, z φi q φi pa i

24 24 f x, z φi q φi pa i

25 25 f x, z φi q φi pa i

26 26 f x, z φi q φi pa i

27 27 f x, z φi q φi pa i

28 28 f x, z φi q φi pa i

29 29 f x, z φi q φi pa i

30 Evaluating the variance of estimators within the framework 3 Theorem The optimal estimator achieves the minimum variance among all estimators within the framework. ( the property of Rao-Blackwellization) Theorem 2 When is a Bernoulli variable, there is an independent baseline b i with which the likelihood-ratio estimator achieves the optimal variance. (I.e., LR with independent baseline can be the optimal estimator.)

31 Experiment: variational learning of sigmoid belief networks 3 z z 2 z 3 p θ q φ Datasets: MNIST and Omniglot 784-dimensional binary (/) inputs Each latent variable follows a Bernoulli distribution with the logit given by the net input (i.e., sigmoid- Bernoulli unit) In the optimal estimator, the Bernoulli unit is reparameterized by / thresholding at ε U(, ).

32 Experimental results: variational learning of sigmoid belief nets 32

33 Conclusion 33 We proposed a framework of gradient estimators for stochastic computational graph by reparameterization and local marginalization. We formulated a hierarchy of baseline techniques for likelihood-ratio estimators and showed the relationship between this hierarchy and the optimal estimator. The experimental results show that the variance of gradient estimation for binary discrete variables is approaching to the optimum with recent advancements, yet a non-negligible gap still exists, indicating the possibility of further improvements.

34 (end) 34

35 Appendix: results with shallow networks 35

36 Appendix: training curve of deep networks 36

37 Appendix: training curve of shallow networks 37

38 Appendix: final performance on test sets 38 VB stands for variational bound LL stands for log likelihood, which is approximated by Monte Carlo objective using sample of size 5,

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning Diederik (Durk) Kingma Danilo J. Rezende (*) Max Welling Shakir Mohamed (**) Stochastic Gradient Variational Inference Bayesian