Density estimation. Computing, and avoiding, partition functions. Iain Murray

Size: px

Start display at page:

Download "Density estimation. Computing, and avoiding, partition functions. Iain Murray"

Loreen Nash
5 years ago
Views:

1 Density estimation Computing, and avoiding, partition functions Roadmap: Motivation: density estimation Understanding annealing/tempering NADE Iain Murray School of Informatics, University of Edinburgh Includes work with Ruslan Salakhutdinov and Hugo Larochelle

2 Probabilistic model H Predict new images: P (x H)

3 High density can be boring P (x H)

4 Image reconstruction Observation model: P (y x) Underlying image: P (x y) P (y x) P (x) (e.g., Zoran and Weiss, 2011; Lucas Theis s work)

5 Roadmap Unsupervised learning and P (x H) Evaluating P (x H) Salakhutdinov and Murray (2008) Murray and Salakhutdinov (2009) Wallach, Murray, Salakhutdinov & Mimno (2009) NADE: density estimation put first Larochelle and Murray (2011)

6 Restricted Boltzmann Machines P (x, h θ) = 1 [ Z(θ) exp ij W ijx i h j + i bx i x i + ] j bh j h j P (x θ) = h P (x, h θ) = 1 exp [ ] Z(θ) } h {{} f(x; θ), tractable

7 Z a normalizer x Uniform x x Model p(x) = f(x) Z

8 Annealing / Tempering e.g. P (x; β) P (x) β π(x) (1 β) β = 0 β = 0.01 β = 0.1 β = 0.25 β = 0.5 β = 1 1/β = temperature

9 Annealed Importance Sampling x 0 p 0 (x) T P (X) : x 1 T2 TK 0 x 1 x 2 x K 1 x K Q(X) : x 0 x 1 x 2 x K 1 x K T 1 T 2 T K x K p K+1 (x) P(X) = P (x K ) Z K k=1 T k (x k 1 ; x k ), Q(X) = π(x 0 ) K k=1 T k (x k ; x k 1 ) Standard importance sampling of P(X) = P (X) Z with Q(X)

10 Annealed Importance Sampling S P (X) Z 1 S s=1 Q(X) Estimated logz True logz log Z sec 3.3 min 17 min 33 min 5.5 hrs Q P Large Variance Number of AIS runs

11 Parallel tempering Standard MCMC: Transitions + swap proposals on joint: P (X) = β P (X; β) P (x) T 1 T 1 T 1 T 1 T 1 P β1 (x) T β1 T β1 T β1 T β1 T β1 P β2 (x) T β2 T β2 T β2 T β2 T β2 P β3 (x) T β3 T β3 T β3 T β3 T β3 larger system information from low β diffuses up by slow random walk

12 Tempered transitions Drive temperature up... P (X) : ˆT βk x K Ť βk ˆTβ2 ˆx 2 ˆx K 1 ˇx K 1 Ť β2 ˇx 2 ˇx 1 ˆT β1 ˆx 0 ˆx 1 Ť β1 ˇx 0 ˆx 0 P (x)... and back down Proposal: swap order, final point ˇx 0 putatively P (x) Acceptance probability: [ ] min 1, P β 1 (ˆx 0 ) P (ˆx 0 ) Pβ K (ˆx K 1 ) P βk 1 (ˇx K 1 ) P βk 1 (ˆx 0 ) P βk (ˇx K 1 ) P (ˇx 0) P β1 (ˇx 0 )

13 Summary on Z Whirlwind tour of annealing / tempering Must be able to get anywhere in distribution Methods to use generally for hardest problems.

14 An experiment Take 60,000 binarized MNIST digits, like these: Train an RBM using CD (and then find Z) Train a mixture of multivariate Bernoullis with EM Compare samples and test-set log-probs

15 Which is which? A comparison Samples from: mixture of Bernoullis, 143 nats/test digit RBM, 106 nats/test digit

16 A better fitted RBM RBM samples Training set examples Test log-prob now 20 nats better ( 86 nats/digit)

17 Dependent latent variables Deep Belief Net Lateral connections Directed model P (x) = 1 Z P (x, h), not available h

18 Chib-style estimates Bayes Rule: P (h x) = P (h, x) P (x) For any special state h : P (x) = P (h, x) P (h x) Estimate Murray and Salakhutdinov (2009)

19 Variational approach h (2) h (1) x log P (x) = log h h 1 Z P (x, h) Q(h) log P (x, h) log Z + H[Q(h)]

20 Results MNIST Estimated Test Log probability AIS Estimator Our Proposed Estimator Estimate of Variational Lower Bound Number of Markov chain steps

21 Results Natural Scenes Estimated Test Log probability AIS Estimator Our Proposed Estimator Estimate of Variational Lower Bound Number of Markov chain steps

22 P (x H) taught me RBM: state-of-the-art for binary dists Deep nets only very slightly better on MNIST Some Gaussian RBMs are really bad......and going deep won t help Most topic model P (x H) ests wrong

23 Roadmap Unsupervised learning and P (x H) Evaluating P (x H) Salakhutdinov and Murray (2008) Murray and Salakhutdinov (2009) Wallach, Murray, Salakhutdinov & Mimno (2009) NADE: density estimation put first Larochelle and Murray (2011)

24 W Decompose into scalars Fully Visible Bayesia P (x) = P (x 1 ) P (x 2 x 1 ) P (x 3 x 1, x 2 )... General graphical model Fully Belie = k P (x k x <k ) p(x) = k p(x k x <k ) x 1 x 2 x 3 x 4

25 FVSBN: Fully Visible Sigmoid Belief Net Bayesian Networks Logistic regression for conditionals del Fully Visible Sigmoid Belief Net (FVSBN) k) c k ˆx x k W p(x k =1 x <k ) x ating p(x) (tractable) FVSBNs beat mixtures, but not RBMs

26 Approximate RBM P (x k x <k ) from MCMC or mean field (Requires fitted RBM. Creates new model.)

27 One Mean Field step gressive ator (NADE) t x V (1) (2) (3) (4) h h h h ˆx k = σ ( b x k + W k,.h (k)) h (k) ( = σ b h + W.,<k ) x <k W x

28 NADE Neural Autoregressive Distribution Estimator ressive tor (NADE) x V Fit as new model (1) (2) (3) (4) h h h h P (x H) = k ˆp(x k) W Tractable, O(DH) x

29 NADE results Experiments Model adult connect-4 dna mushrooms nips-0-12 ocr-letters rcv1 web MoB ± 0.10 ± 0.04 ± 0.53 ± 0.10 ± 1.12 ± 0.32 ± 0.11 ± 0.23 RBM ± 0.06 ± 0.02 ± 0.48 ± 0.09 ± 1.07 ± 0.30 ± 0.11 ± 0.20 RBM mult. ± 0.06 ± 0.03 ± 0.40 ± 0.05 ± 1.06 ± 0.29 ± 0.11 ± 0.21 RBForest ± 0.06 ± 0.02 ± 0.49 ± 0.07 ± 1.07 ± 0.28 ± 0.11 ± 0.21 FVSBN ± 0.04 ± 0.01 ± 0.50 ± 0.05 ± 0.98 ± 0.23 ± 0.11 ± 0.20 NADE ± 0.05 ± 0.01 ± 0.57 ± 0.04 ± 1.11 ± 0.21 ± 0.11 ± 0.20 Normalization Little variation when changing input ordering: DNA = +/ MUSHROOMS = +/ NIPS-0-12 = +/- 0.15

30 NADE results Experiments On a binarized version of MNIST: Intractable { Model Log. Like. * MoB * RBM (CD1) RBM (CD3)* * RBM (CD25) FVSBN NADE * : taken from Salakhutdinov and Murray (2008) Samples

31 Talking points When should we learn P (x H)? Monte Carlo methods Autoregressive models (see also Lucas Theis) A longer talk on NADE:

32 Appendix slides

33 Markov chain estimation Stationary condition for Markov chain: P (h x) = h T (h h)p (h x) [ 1 S S s=1 T (h h (s) ), h (s) P(H) ] = ˆp P(H) draws a sequence from an equilibrium Markov chain: h (1) h (2) h (3) h (4) h (5) T T T T h (1) P (h v)

34 Bias in answer P (x) = P (h, x) P (h x) = P (h, x) E[ˆp] E [ P (h ], x) ˆp Idea: bias Markov chain by starting at h 1 S S s=1 T (h h (s) ) will often overestimate P (h x) h (1) h (2) h (3) h (4) h T T T T (5) T h Q(H) = T (h(1) h ) P (h (1) x) P(H)

35 New estimator We actually need a slightly more complicated Q: h (1) T h (2) h (3) h (4) T T T h (5) T Q(H) = 1 S S s=1 T (h (s) h ) P (h (s) x) P(H) h E Q(H) [1 / 1 S S s=1 T (h h (s) ) ] = 1 P (h x) ˆP (x) = P (x, h ) ˆP (h x) unbiased stochastic lower bound on log P (x)

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed