Annealing Between Distributions by Averaging Moments

Size: px

Start display at page:

Download "Annealing Between Distributions by Averaging Moments"

Cori Gabriella Stanley
5 years ago
Views:

1 Annealing Between Distributions by Averaging Moments Chris J. Maddison Dept. of Comp. Sci. University of Toronto Roger Grosse CSAIL MIT Ruslan Salakhutdinov University of Toronto

2 Partition Functions We usually specify distributions up to a normalizing constant, MRFs p(y) = f (y)/z Posteriors y x θ f exp( E(x, θ)) p(x θ)p(θ) Z Z(θ) p(x)

3 Partition Functions We usually specify distributions up to a normalizing constant, MRFs p(y) = f (y)/z Posteriors y x θ f exp( E(x, θ)) p(x θ)p(θ) Z Z(θ) p(x) For Markov Random Fields (MRFs) partition function Z(θ) = x exp( E(x, θ)) is intractable Goal: Estimate log Z(θ).

4 Estimating Partition Functions Variational approximations and bounds on log Z (Yedida et al., 2005; Wainwright et al., 2005). We want our models to reflect a highly dependent world, this can hurt variational approaches as we assume more and more independence. This assumption less costly for posterior inference over parameters. Sampling methods such as path sampling (Gelman and Meng, 1998), sequential Monte Carlo (e.g. del Moral et al., 2006), simple importance sampling, and annealed importance sampling (Neal, 2002). Slow, finicky, and hard to diagnose In principle, can deal with multimodality

5 Simple Importance Sampling (SIS) Two distributions p a (x) and p b (x) over X Then Z a M f a (x)/z a tractable Z easy to sample M i=1 for x (i) p a (x). f b (x (i) ) f a (x (i) ) f b (x)/z b intractable Z hard to sample fb (x) p a (x) p a(x) dx = Z b Variance is high (sometimes ) if p a << p b in some regions.

6 If f 0 is the uniform distribution, this becomes An Intuition f (x) =f 1 (x), which explains the term annealing Think Move of gradually as an inverse fromtemperature a hotter p a to a colder p b annealing oger Grosse (joint Reduce work with variance Chris Annealing by Maddison, chaining between distributions Ruslan importance by Salakhutdinov) averagingsamplers moments May 22, / 22 Z 0 M M i=1 f 1 (x (i) 0 ) f 0 (x (i) 0 ) f K (x (i) K 1 ) Z 1 f K 1 (x (i) K 1 ) Z 0 Z 0 Z K Z K 1 = Z K where we independently draw x k p k (x)

7 If f 0 is the uniform distribution, this becomes An Intuition f (x) =f 1 (x), which explains the term annealing Think Move of gradually as an inverse fromtemperature a hotter p a to a colder p b annealing oger Grosse (joint Reduce work with variance Chris Annealing by Maddison, chaining between distributions Ruslan importance by Salakhutdinov) averagingsamplers moments May 22, / 22 Z 0 M M i=1 f 1 (x (i) 0 ) f 0 (x (i) 0 ) f K (x (i) K 1 ) Z 1 f K 1 (x (i) K 1 ) Z 0 Z 0 Z K Z K 1 = Z K where we independently draw x k p k (x) hard to do

8 Annealed Importance Sampling (AIS) We can still do this with certain dependent samplers! (Neal, 2002)

9 Annealed Importance Sampling (AIS) We can still do this with certain dependent samplers! (Neal, 2002) x k T k (x x k 1 ) }{{} leaves p k (x) invariant

10 Annealed Importance Sampling (AIS) We can still do this with certain dependent samplers! (Neal, 2002) For chain i x 0 p 0 x 0 x k T k (x x k 1 ) }{{} leaves p k (x) invariant w (i) = f 1(x (i) 0 ) f 0 (x (i) 0 )

11 Annealed Importance Sampling (AIS) We can still do this with certain dependent samplers! (Neal, 2002) For chain i x 0 p 0 x 0 x 1 x 1 T 1 (x x 0 ) x k T k (x x k 1 ) }{{} leaves p k (x) invariant (i) 1 ) w (i) = f 1(x (i) 0 ) f 2 (x f 0 (x (i) 0 ) f 1 (x (i) 1 )

12 Annealed Importance Sampling (AIS) We can still do this with certain dependent samplers! (Neal, 2002) x k T k (x x k 1 ) }{{} leaves p k (x) invariant For chain i x 0 p 0 x 2 T 2 (x x 1 ) x 0 x 1 x 2 x 1 T 1 (x x 0 ) (i) 1 ) (i) 2 ) w (i) = f 1(x (i) 0 ) f 2 (x f 3 (x f 0 (x (i) 0 ) f 1 (x (i) 1 ) f 2 (x (i) 2 )

13 Annealed Importance Sampling (AIS) We can still do this with certain dependent samplers! (Neal, 2002) x k T k (x x k 1 ) }{{} leaves p k (x) invariant For chain i x 0 p 0 x 2 T 2 (x x 1 ) x 0 x 1 x 2... x K 1 x 1 T 1 (x x 0 ) x K 1 T K 1 (x x K 2 ) w (i) = f 1(x (i) 0 ) (i) f 2 (x1 ) (i) f 3 (x2 ) f 0 (x (i) 0 ) f 1 (x (i) 1 ) f 2 (x (i) 2 ) f K (x (i) K 1 ) f K 1 (x (i) K 1 )

14 Annealed Importance Sampling (AIS) We can still do this with certain dependent samplers! (Neal, 2002) x k T k (x x k 1 ) }{{} leaves p k (x) invariant For chain i x 0 p 0 x 2 T 2 (x x 1 ) x 0 x 1 x 2... x K 1 x 1 T 1 (x x 0 ) x K 1 T K 1 (x x K 2 ) Z 0 M M w (i) Z K i=1 Intuition: SIS on an extended state space, remarkably unbiased!

15 Analyzing AIS Paths Virtually the only scheme used is geometric averages, f k (x) = f a 1 β k (x)f b β k (x)

16 Analyzing AIS Paths Virtually the only scheme used is geometric averages, f k (x) = f a 1 β k (x)f b β k (x) Let P be a family of distributions parameterized by θ Define a path γ : [0, 1] P and a schedule of points 0 = β 0 < β 1 <... < β K = 1 p b p k = p θ(β) p a

17 Analyzing AIS Paths Assume perfect transitions T k (x x k 1 ) = p k (x) then E [ log w (i)] = K k=1 E p k [log f k (x) log f k 1 (x)]

18 Analyzing AIS Paths Assume perfect transitions T k (x x k 1 ) = p k (x) then E [ log w (i)] = K k=1 E p k [log f k (x) log f k 1 (x)] }{{} finite difference approximation assump. + math k 1 d log Z(θ(β)) 0 dβ dβ = log Z K Z 0 Intuition: log w (i) s accumulate finite differences of log Z

19 Analyzing AIS Paths What is the error for a fixed number of intermediate distributions?

20 Analyzing AIS Paths What is the error for a fixed number of intermediate distributions? E [ log w (i)] is biased, unlike E [ w (i)] [ E log w (i)] = K E pk [log f k (x) log f k 1 (x)] k=1 = log Z K Z 0 K D KL (p k 1 p k ) k=1 } {{ } bias = δ(log w (i) )

21 Analyzing AIS Paths What is the error for a fixed number of intermediate distributions? E [ log w (i)] is biased, unlike E [ w (i)] [ E log w (i)] = K E pk [log f k (x) log f k 1 (x)] k=1 = log Z K Z 0 K D KL (p k 1 p k ) k=1 } {{ } bias = δ(log w (i) ) With perfect transitions the following are monotonic in the sum of KL divergences δ(log w (i) ) var [ log w (i)] var [ w (i)] Goal: Minimize the sum of KL divergences.

22 Analyzing AIS Paths Approach: Approximate the sum of KL with a functional. Let γ be a path and β k a linearly spaced schedule. Then K K k=1 D KL (p k 1 p k ) K F(γ) θ(β) T G θ (β) }{{ θ(β) dβ, } 0 metric on manifold defined by Fisher Inform. where G θ = cov pθ ( θ log p θ (x)) is the Fisher Information and θ(β) = dθ(β)/dβ. Ties to information geometry. Analogous functional for path sampling (Gelman and Meng, 1998).

23 Analyzing AIS Paths Under the optimal schedule, the value of the functional is F(γ) = l(γ) 2 /2 where l(γ) = 1 0 θ(β) T G θ (β) θ(β)dβ is the path length on the Riemannian manifold defined by G θ (β). Intuition: Spend more time on segments with high curvature.

24 Paths for Exponential Family Distributions Let us restrict ourselves to the exponential family p(x) = 1 Z(η) h(x) exp(ηt g(x)) η or s = E[g(x)] completely specifies a distribution. One-to-one correspondence between η and s

25 Paths for Exponential Family Distributions Let us restrict ourselves to the exponential family p(x) = 1 Z(η) h(x) exp(ηt g(x)) η or s = E[g(x)] completely specifies a distribution. One-to-one correspondence between η and s Old Geom. Averaged Path γ GA (β) is the distribution with η(β) = (1 β)η(0) + βη(1)

26 Paths for Exponential Family Distributions Let us restrict ourselves to the exponential family p(x) = 1 Z(η) h(x) exp(ηt g(x)) η or s = E[g(x)] completely specifies a distribution. One-to-one correspondence between η and s Old Geom. Averaged Path γ GA (β) is the distribution with η(β) = (1 β)η(0) + βη(1) New Moment Averaged Path γ MA (β) is the distribution with s(β) = (1 β)s(0) + βs(1)

27 The Picture for Gaussians Often unintuitive... Two Gaussians

28 The Picture for Gaussians Often unintuitive... GA path γ GA places mass only where both pdfs agree veto effects

29 The Picture for Gaussians Often unintuitive... GA path MA path γ GA places mass only where both pdfs agree veto effects γ MA interpolate means and covariances then stretch covariance

30 Path Properties: Variational Interpretation For geometric averages, the intermediate distribution minimizes a weighted sum of KLs p (GA) β = arg min p (1 β)d KL (p p a ) + βd KL (p p b ) For moment averages, the same but of the reverse KLs p (MA) β = arg min p (1 β)d KL (p a p) + βd KL (p b p) GA path MA path

31 Path Properties: Cost Functional For the exponential family we can find F(γ) Important this assumes linear schedules Both γ GA and γ MA have the same functional! F(γ GA ) = F(γ MA ) = 1 2 (s(1) s(0))t (η(1) η(0)) If we partition a schedule by distributions p j into piecewise linear schedules we can optimally allocate distributions from a total budget = K j K j (η j+1 η j ) T (s j+1 s j ) Biggest effect is the choice of path, not schedule.

32 Gaussians Two Gaussians N (( 10 0 ), ( )) (( and N 100 ) (, ))

33 Gaussians Number of intermediate distributions required to anneal between two Gaussians with means µ 1 and µ 2 and variance σ as a function of d = (µ 2 µ1)/σ GA, linear schedule O(d 2 ) MA, linear schedule O(d 2 ) GA, optimal schedule O(d 2 ) MA, optimal schedule O((log d) 2 ) Optimal path (Gelman and Meng, 1998) O((log d) 2 ) MA within constant factor of optimal under its optimal scheduling no general proof yet. Mixing issues dominate performance in practice where MA shines.

Restricted Boltzmann Machines (RBMs) Restricted Boltzmann Machines RBMs are Markov Random

h) need {0, to introduce 1} D {0, hidden 1} F with variables.

will contain visible and hidden variables x=(v,h).

are connected configuration to stochastic is E(v, h, binary θ) hidden variables.

T c h T b unary potential model parameters. Image visible variables where θ = (W, c, b).

34 Restricted Boltzmann Machines (RBMs) Restricted Boltzmann Machines RBMs are Markov Random Fields of coupled visible and For hidden many binary real-world variables problems, x = (v, we h) need {0, to introduce 1} D {0, hidden 1} F with variables. a Our special random bipartite variables structure. will contain visible and hidden variables x=(v,h). hidden variables Bipartite Structure The energy of a joint Stochastic binary visible variables are connected configuration to stochastic is E(v, h, binary θ) hidden variables. binary potentials unary potential {}}{{}}{ The energy of vthe T Wjoint h configuration: }{{} v T c h T b unary potential model parameters. Image visible variables where θ = (W, c, b). Probability Z(θ) = of the v joint h exp( E(v, configuration h, θ)) is given is generally by the Boltzmann intractable, distribution: except if we have a small number (< 25) of hidden units or few visible units. parooon$funcoon$ potenoal$funcoons$

35 Restricted Boltzmann Machines (RBMs) Two different paths for an RBM trained on the MNIST digit dataset (60,000 B&W images of digits).

36 Restricted Boltzmann Machines (RBMs) p k = p θ(β) p b MA path infeasible for most RBMs solve for natural parameters p a estimate moments {}}{{}}{ E[vh T ] β = (1 β)e[vh T ] 0 + β E[vh T ] 1 Can do approximately for a few intermediate models. Moment Averaged Spline Path (γ MAS now in blue) Knots are moment matched, annealing between them with GA p b p a knot

37 Estimating Partition Functions of RBMs Estimating partition function of an RBM with 20 hidden units trained on MNIST with PCD

38 Estimating Partition Functions of RBMs Estimating partition function of an RBM with 500 hidden units trained on MNIST with CD1. Under estimating by 20 nats is difference between a log probability of 130 and 110 on MNIST test set! overestimate p(x) = exp( E(x, θ))/z(θ) underestimate

39 Conclusions Theoretical foundations for studying AIS under prefect mixing A new annealing scheme for exponential family distributions with practical approximations Improved estimates of partition functions for RBMs Ongoing work Diagnostics for AIS Extend this work to models that are harder for AIS MA intermediate distributions for other tempering-based methods such as learning MRFs with MA parallel tempering Using MA path in marginal likelihood estimation for directed models

40 Thanks!

Contrastive Divergence

Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer Science TU Graz June 9, 2010 Contents 1 Theory 2 Argument 3 Contrastive