Expectation Propagation performs smooth gradient descent GUILLAUME DEHAENE

Expectation Propagation performs smooth gradient descent 1 GUILLAUME DEHAENE

In a nutshell Problem: posteriors are uncomputable Solution: parametric approximations 2 But which one should we choose? Laplace? Variational Bayes? Expectation Propagation? I will unite these three methods

Outline I. Gaussian Approximation methods A. Laplace B. Variational Bayes C. Expectation Propagation II. Smooth gradient methods A. Fixed-point conditions B. Reformulating gradient descent C. Gaussian Smoothing D. Using the factor structure III. Some consequences 3

I. Gaussian approximation methods 4 We have collected some data DD 1 DD nn We have a great IID model: - Prior: pp θθ - Conditional: pp DD ii θθ The posterior: p θθ DD 1 DD nn = 1 ZZ DD 1 DD nn pp θθ ii pp DD ii θθ

Uncomputability Usually, the posterior is uncomputable: - θθ is high-dimensional - The likelihoods have a complicated structure 5 Two solutions: - Sampling methods - Approximation methods We are going to focus on Gaussian approximations

A. The Laplace approximation The problem: Finding a Gaussian approximation qq(θθ) of pp θθ 6 IE: a quadratic approximation of log pp θθ = ψψ θθ The most basic quadratic approximation is the Taylor expansion to second order

A. The Laplace approximation 7 We center at the global maximum of pp θθ : θθ MMMMMM log pp θθ = ψψ θθ ψψ θθ MMMMMM + θθ θθ MMMMMM 2 2 HHHH θθ MMMMMM The Laplace approximation: - A Gaussian - Centered at θθ MMMMMM - With inverse-variance HHHH θθ MMMMMM

B. The Variational Bayes approximation Laplace is fine but not very principled 8 Instead, let s find the Gaussian which minimizes a «sensible notion of distance» to pp θθ qq VVVV = arg min KKKK qq, pp Distance = reverse Kullback-Leibler divergence: qq θθ KKKK qq, pp = qq θθ log pp θθ

B. The Variational Bayes approximation Parameterize Gaussians with their mean and std: μμ, σσ ηη μμ,σσ = μμ + σσηη 0 qq θθ μμ, σσ = 1 2 1 xx μμ exp 2ππ σσ 2σσ 2 9 KKKK qq, pp = EE ψψ μμ + σσηη 0 log σσ + Gradient to μμ: μμ = EE ψψ μμ + σσηη 0 Gradient to σσ: σσ = σσee Hψψ μμ + σσηη 0 1/σσ

B. The Variational Bayes approximation 10 Gradient to μμ: μμ = EE ψψ μμ + σσηη 0 Gradient to σσ: σσ = σσee Hψψ μμ + σσηη 0 1/σσ We can use stochastic gradient descent to optimize the reverse KL divergence We use samples from ηη 0

C. Expectation Propagation First key idea: - Instead of a global approximation qq θθ pp θθ - We use the factor structure of pp θθ : 11 nn pp θθ = ff ii θθ ii=1 - And compute nn local approximations gg ii θθ ff ii θθ We can recover a global approximation as: qq EEEE θθ = gg ii θθ

C. Expectation Propagation First key idea: - Compute nn local approximations gg ii θθ 12 Second key idea: - Iteratively refine the approximation gg ii θθ ff ii θθ - Using other approximations gg jj θθ ; jj ii as context

C. Expectation Propagation To update gg ii θθ - Define the «hybrid»: 13 h ii θθ ff ii θθ gg jj θθ - Compute the mean and variance of h ii θθ and the Gaussian which has that same mean and variance qq jj ii - New approximation: gg ii θθ = qq θθ jj ii gg jj θθ

C. Expectation Propagation First key idea: - Compute nn local approximations gg ii θθ 14 Second key idea: - Iteratively refine the approximation gg ii θθ ff ii θθ - New approximation: gg ii θθ = GGGGGGGGGG ff ii θθ jj ii gg jj θθ jj ii gg jj θθ

C. Expectation Propagation Computing the mean and the variance of the hybrid requires model specific work: - Analytic - Quadrature methods - Pre-compiled approximations - Sampling 15 EP can also use non-gaussian approximations

Summary To deal with an uncomputable target distribution: 16 nn pp θθ = ff ii θθ ii=1 - Laplace approximation GD on ψψ θθ - VB approximation SGD on KKKK qq, pp - EP EP iteration These methods couldn t be further appart!!

Summary Gradient descent is well-understood and intuitive: - Dynamics of an object sliding down a slope 17 Stochastic optimization of KKKK qq, pp and EP iteration are unintuitive This makes them hard to use

II. Smooth gradient methods I will now unite these three methods under a single framework: smooth gradient methods 18 These iterate on Gaussian approximations to pp θθ They are closely related to GD = intuitive The three methods correspond to special cases

A. Fixed-point conditions The methods can be united because their fixed-point conditions are extremely similar 19 Laplace: gradient is 0 ψψ θθ MMMMMM = 0

A. Fixed-point conditions Variational Bayes: Recall the gradient: Gradient to μμ: μμ = EE ψψ μμ + σσηη 0 Gradient to σσ: σσ = σσσσ Hψψ μμ + σσηη 0 1/σσ 20 Optimal Gaussian qq VVVV must respect: EE qqvvvv ψψ θθ = 0 EE qqvvvv HHHH = CCCCvv qqvvvv 1

Variational Bayes: A. Fixed-point conditions 21 EE qqvvvv ψψ θθ = 0 EE qqvvvv HHHH = CCCCvv qqvvvv 1 The smoothed gradient must be 0 The covariance is related to the peakedness of log pp

A. Fixed-point conditions Expectation Propagation: - Define the «hybrid»: 22 h ii θθ ff ii θθ gg jj θθ - Compute the mean and variance of h ii θθ and the Gaussian which has that same mean and variance qq jj ii - New approximation: gg ii θθ = qq θθ jj ii gg jj θθ

A. Fixed-point conditions Expectation-Propagation: Easy fixed-point condition: All the hybrids h ii θθ and the global Gaussian approximation qq EEEE θθ have the same mean / variance Using this, and a little bit of math (Dehaene 2016): 23 ii EE hii log ff ii θθ = 0 ii CCCCvv hii 1 EEhii θθ μμ ii log ff ii θθ = CCCCvv qqeeee 1

A. Fixed-point conditions 24 Laplace: VB: ψψ θθ MMMMMM = 0 EE qqvvvv ψψ θθ = 0 EE qqvvvv HHHH θθ = CCCCvv qqvvvv 1 EP: EE hii log ff ii θθ = 0 CCCCvv hii 1 EE hii θθ μμ EEEE log ff ii θθ = CCCCvv qqeeee 1

A. Fixed-point conditions 25 Laplace: ψψ θθ MMMMMM = 0 VB: EP: EE qqvvvv ψψ θθ = 0 CCCCvv qqvvvv 1 EEqqVVVV θθ μμ VVVV ψψ θθ = CCCCvv qqvvvv 1 EE hii log ff ii θθ = 0 CCCCvv hii 1 EE hii θθ μμ EEEE log ff ii θθ = CCCCvv qqeeee 1

B. Reformulating gradient descent The first step: reframing gradient descent as - Iterating over Gaussian approximations to pp θθ - Fixed point = Laplace approximation 26 Key idea: GD corresponds to using a linear approximation of ψψ = log pp

B. Reformulating gradient descent Interpretation of GD: θθ nn+1 = θθ nn λλ ψψ θθ nn 27 «Please find the maximizer of the equation» θθ θθ nn ψψ θθ nn λλ 2 θθ θθ nn 2 This is the same as the mean of the Gaussian: qq nn+1 θθ exp θθ θθ nn ψψ θθ nn λλ 2 θθ θθ nn 2

B. Reformulating gradient descent 28 A trivial reformulation of GD: Iterate: qq nn+1 θθ exp θθ θθ nn ψψ θθ nn λλ 2 θθ θθ nn 2 θθ nn+1 = EE qqnn+1 θθ This iterates Gaussian approximations to pp θθ But the fixed-point isn t the Laplace approximation!

B. Reformulating gradient descent We need to use an optimization algorithm which uses a quadratic approximation of ψψ: 29 Newton s method: θθ nn+1 = θθ nn Hψψ θθ nn 1 ψψ θθ nn «Please find the maximizer of the equation» θθ θθ nn ψψ θθ nn 1 2 HHHH θθ nn θθ θθ nn 2

B. Reformulating gradient descent A trivial reformulation of Newton s: 30 Iterate: qq nn+1 θθ exp θθ θθ nn ψψ θθ nn θθ nn+1 = EE qqnn+1 HHHH θθ nn 2 θθ θθ θθ nn 2 We iterate Gaussian approximations of pp θθ until we find the Laplace approximation

Algorithm 1: disguised gradient descent Newton s method 31 DGD ψψ qqqqqqqqqqqqqqqqqq pp exp qqqqqqqqqqqqqqqqqq pp GGGGGGGGGGGGGGGG

C. Gaussian smoothing The Laplace approximation is a point approximation 32 We could improve the algorithm: - Smooth the objective function ψψ = ψψ exp θθ2 2σσ 2 - Run the algorithm on ψψ

C. Gaussian smoothing How should we choose the smoothing bandwith σσ? 33 We could choose it once for all steps OR On each step, we could use the current Gaussian approximation qq nn θθ to smooth ψψ

Algorithm 2: smoothed gradient descent 34 - Initialize with any Gaussian qq 0 - Loop: μμ nn = EE qqnn θθ rr = EE qqnn ψψ θθ ββ = EE qqnn HHHH θθ qq nn+1 θθ exp rr θθ μμ nn ββ 2 θθ μμ nn 2

Algorithm 2: smoothed gradient descent 35

D. Using the factor structure In order to use EP, pp θθ needs to have a nice factor structure: 36 nn pp θθ = ff ii θθ ii=1 For ψψ: nn ψψ θθ = φφ ii θθ ii=1

D. Using the factor structure Crazy idea: - We could use the VB algorithm - With non-gaussian smoothing - On each component φφ ii of ψψ 37

D. Using the factor structure The algorithm 2 update has two equivalent forms: ββ = EE qqnn HHHH θθ 38 OR ββ = CCCCvv qqnn 1 EEqqnn θθ μμ nn ψψ θθ When we replace qq nn by h ii, we have a choice to make: which form should we use?

Algorithm 3: smooth EP 39 - Initialize with any Gaussians gg 1, gg 2 gg nn - Loop: h ii ff ii jj ii gg jj μμ ii = EE hii θθ rr ii = EE hii φφ ii θθ ββ ii = vvvvrr hii 1 EEhii θθ μμ ii φφ ii θθ gg ii θθ exp rr θθ μμ ii ββ 2 θθ μμ ii 2

D. Using the factor structure Smooth EP is actually exactly equivalent to EP!! 40 This ties EP to a much more intuitive algorithm: Newton s method However, we have lost the most important feature: the explicit objective function

Summary I have presented a family of algorithms which: - Iterate Gaussian approximations to pp θθ - By computing smoothed quadratic approximations of log pp θθ (or parts of it) 41 Different smoothings correspod to different known methods: - Laplace = No smoothing - VB = Gaussian smoothing - EP = Various hybrid smoothings

III. Consequences Key observation: EP and algorithm 2 are closely related to Newton s method 42 They must behave in a similar fashion!!

III. Consequences Newton s has two striking features: - Very fast convergence near its fixed-points - Possible oscillations (overshooting) Solution: compliment it with line-search methods 43 EP is also known for its oscillations! Are these also overshoots of the target?

III. Consequences 44 EP still needs improvements We could import good ideas from Newton s method and use them on EP - Line-search (but how?) - Non-SPD second-order term ββ ii (but how?) Finally, the smooth Newton view of EP might be useful on its own

III. Consequences Since Laplace, VB and EP are closely related, we can ask whether they behave similarly in some situations 45 - Laplace = No smoothing - VB = Gaussian smoothing - EP = Various hybrid smoothings Whenever the smoothing distributions are similar, the methods are similar!

III. Consequences If all hybrids are almost equal to the global approximation: h ii qq nn 46 IE: ff ii /gg ii is negligible compared to qq nn Then EP and VB have the same smoothing and behave similarly

III. Consequences Furthermore, if qq nn is almost a Dirac distribution IE: its width is negligible compared to the oscillations of ψψ and/or the φφ ii functions 47 Then, algorithm 2 behaves similarly to algorithm 1 Thus, the VB approximation behaves similarly to the Laplace approximation

Conclusion Various approximation methods are closely related: The can be obtaiend through smooth Newton methods 48 - Laplace = No smoothing - VB = Gaussian smoothing - EP = Hybrid smoothings Corollary: VB and EP are very closely related EP behaves similarly to Newton s method

Speculation Smooth Newton variants might be computationally useful for VB and EP 49 VB and EP should give better approximations than Laplace Can this give a path towards understanding or improving the convergence of EP and VB?

References Minka, 2001, Expectation Propagation for approximate Bayesian inference Seeger, 2007, Expectation Propagation for Exponential Families Dehaene, Barthelmé, 2017, Expectation Propagation in the large-data limit Dehaene, 2016, Expectation Propagation performs a smoothed gradient descent 50