Expectation Propagation performs smooth gradient descent GUILLAUME DEHAENE

Similar documents
Variations. ECE 6540, Lecture 02 Multivariate Random Variables & Linear Algebra

Inferring the origin of an epidemic with a dynamic message-passing algorithm

CS Lecture 8 & 9. Lagrange Multipliers & Varitional Bounds

Expectation Propagation Algorithm

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Nonparametric Bayesian Methods (Gaussian Processes)

Gradient expansion formalism for generic spin torques

CS249: ADVANCED DATA MINING

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo

Entropy Enhanced Covariance Matrix Adaptation Evolution Strategy (EE_CMAES)

Statistical Learning with the Lasso, spring The Lasso

Non-linear least squares

Part 1: Expectation Propagation

Dan Roth 461C, 3401 Walnut

Chapter 5: Spectral Domain From: The Handbook of Spatial Statistics. Dr. Montserrat Fuentes and Dr. Brian Reich Prepared by: Amanda Bell

Extreme value statistics: from one dimension to many. Lecture 1: one dimension Lecture 2: many dimensions

Angular Momentum, Electromagnetic Waves

General Strong Polarization

Solar Photovoltaics & Energy Systems

Gaussian and Linear Discriminant Analysis; Multiclass Classification

PHY103A: Lecture # 4

A new procedure for sensitivity testing with two stress factors

(1) Correspondence of the density matrix to traditional method

CHAPTER 5 Wave Properties of Matter and Quantum Mechanics I

General Strong Polarization

A minimalist s exposition of EM

Density Estimation. Seungjin Choi

Regression with Numerical Optimization. Logistic

Linear Models for Regression CS534

Stat 451 Lecture Notes Numerical Integration

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

ECE 6540, Lecture 06 Sufficient Statistics & Complete Statistics Variations

Expectation Propagation for Approximate Bayesian Inference

Linear Models for Regression CS534

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

SECTION 5: POWER FLOW. ESE 470 Energy Distribution Systems

ECE521 week 3: 23/26 January 2017

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

SECTION 8: ROOT-LOCUS ANALYSIS. ESE 499 Feedback Control Systems

Introduction to Statistical Learning Theory

Curve Fitting Re-visited, Bishop1.2.5

Instructor: Dr. Volkan Cevher. 1. Background

Expectation propagation for signal detection in flat-fading channels

Foundations of Statistical Inference

Active and Semi-supervised Kernel Classification

Machine Learning 4771

Radial Basis Function (RBF) Networks

Ch 4. Linear Models for Classification

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Stochastic Analogues to Deterministic Optimizers

(1) Introduction: a new basis set

Variational Bayes. A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M

Lecture 3 Optimization methods for econometrics models

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Selected Topics in Optimization. Some slides borrowed from

Multilayer Perceptrons (MLPs)

Black-box α-divergence Minimization

Log Gaussian Cox Processes. Chi Group Meeting February 23, 2016

Linear Models for Regression CS534

Expectation propagation as a way of life

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Bayesian Regression Linear and Logistic Regression

Lecture 13 : Variational Inference: Mean Field Approximation

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

PATTERN RECOGNITION AND MACHINE LEARNING

Integration, differentiation, and root finding. Phys 420/580 Lecture 7

Lecture 10: Logistic Regression

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

Bayesian Deep Learning

Hopfield Network for Associative Memory

Integrated Non-Factorized Variational Inference

Minimum Message Length Analysis of the Behrens Fisher Problem

Estimate by the L 2 Norm of a Parameter Poisson Intensity Discontinuous

CSC 578 Neural Networks and Deep Learning

1 Bayesian Linear Regression (BLR)

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server

Solving Fuzzy Nonlinear Equations by a General Iterative Method

Variational Bayesian Inference Techniques

Understanding Covariance Estimates in Expectation Propagation

Quantitative Biology II Lecture 4: Variational Methods

Variational Principal Components

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Logistic Regression. Seungjin Choi

CSC 2541: Bayesian Methods for Machine Learning

Variational Methods in Bayesian Deconvolution

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

TDA231. Logistic regression

Stochastic Spectral Approaches to Bayesian Inference

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Bayesian Inference for DSGE Models. Lawrence J. Christiano

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Review for Exam Hyunse Yoon, Ph.D. Assistant Research Scientist IIHR-Hydroscience & Engineering University of Iowa

Gradient Descent. Sargur Srihari

Linear Methods for Prediction

Chapter 1: A Brief Review of Maximum Likelihood, GMM, and Numerical Tools. Joan Llull. Microeconometrics IDEA PhD Program

Continuous Random Variables

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Transcription:

Expectation Propagation performs smooth gradient descent 1 GUILLAUME DEHAENE

In a nutshell Problem: posteriors are uncomputable Solution: parametric approximations 2 But which one should we choose? Laplace? Variational Bayes? Expectation Propagation? I will unite these three methods

Outline I. Gaussian Approximation methods A. Laplace B. Variational Bayes C. Expectation Propagation II. Smooth gradient methods A. Fixed-point conditions B. Reformulating gradient descent C. Gaussian Smoothing D. Using the factor structure III. Some consequences 3

I. Gaussian approximation methods 4 We have collected some data DD 1 DD nn We have a great IID model: - Prior: pp θθ - Conditional: pp DD ii θθ The posterior: p θθ DD 1 DD nn = 1 ZZ DD 1 DD nn pp θθ ii pp DD ii θθ

Uncomputability Usually, the posterior is uncomputable: - θθ is high-dimensional - The likelihoods have a complicated structure 5 Two solutions: - Sampling methods - Approximation methods We are going to focus on Gaussian approximations

A. The Laplace approximation The problem: Finding a Gaussian approximation qq(θθ) of pp θθ 6 IE: a quadratic approximation of log pp θθ = ψψ θθ The most basic quadratic approximation is the Taylor expansion to second order

A. The Laplace approximation 7 We center at the global maximum of pp θθ : θθ MMMMMM log pp θθ = ψψ θθ ψψ θθ MMMMMM + θθ θθ MMMMMM 2 2 HHHH θθ MMMMMM The Laplace approximation: - A Gaussian - Centered at θθ MMMMMM - With inverse-variance HHHH θθ MMMMMM

B. The Variational Bayes approximation Laplace is fine but not very principled 8 Instead, let s find the Gaussian which minimizes a «sensible notion of distance» to pp θθ qq VVVV = arg min KKKK qq, pp Distance = reverse Kullback-Leibler divergence: qq θθ KKKK qq, pp = qq θθ log pp θθ

B. The Variational Bayes approximation Parameterize Gaussians with their mean and std: μμ, σσ ηη μμ,σσ = μμ + σσηη 0 qq θθ μμ, σσ = 1 2 1 xx μμ exp 2ππ σσ 2σσ 2 9 KKKK qq, pp = EE ψψ μμ + σσηη 0 log σσ + Gradient to μμ: μμ = EE ψψ μμ + σσηη 0 Gradient to σσ: σσ = σσee Hψψ μμ + σσηη 0 1/σσ

B. The Variational Bayes approximation 10 Gradient to μμ: μμ = EE ψψ μμ + σσηη 0 Gradient to σσ: σσ = σσee Hψψ μμ + σσηη 0 1/σσ We can use stochastic gradient descent to optimize the reverse KL divergence We use samples from ηη 0

C. Expectation Propagation First key idea: - Instead of a global approximation qq θθ pp θθ - We use the factor structure of pp θθ : 11 nn pp θθ = ff ii θθ ii=1 - And compute nn local approximations gg ii θθ ff ii θθ We can recover a global approximation as: qq EEEE θθ = gg ii θθ

C. Expectation Propagation First key idea: - Compute nn local approximations gg ii θθ 12 Second key idea: - Iteratively refine the approximation gg ii θθ ff ii θθ - Using other approximations gg jj θθ ; jj ii as context

C. Expectation Propagation To update gg ii θθ - Define the «hybrid»: 13 h ii θθ ff ii θθ gg jj θθ - Compute the mean and variance of h ii θθ and the Gaussian which has that same mean and variance qq jj ii - New approximation: gg ii θθ = qq θθ jj ii gg jj θθ

C. Expectation Propagation First key idea: - Compute nn local approximations gg ii θθ 14 Second key idea: - Iteratively refine the approximation gg ii θθ ff ii θθ - New approximation: gg ii θθ = GGGGGGGGGG ff ii θθ jj ii gg jj θθ jj ii gg jj θθ

C. Expectation Propagation Computing the mean and the variance of the hybrid requires model specific work: - Analytic - Quadrature methods - Pre-compiled approximations - Sampling 15 EP can also use non-gaussian approximations

Summary To deal with an uncomputable target distribution: 16 nn pp θθ = ff ii θθ ii=1 - Laplace approximation GD on ψψ θθ - VB approximation SGD on KKKK qq, pp - EP EP iteration These methods couldn t be further appart!!

Summary Gradient descent is well-understood and intuitive: - Dynamics of an object sliding down a slope 17 Stochastic optimization of KKKK qq, pp and EP iteration are unintuitive This makes them hard to use

II. Smooth gradient methods I will now unite these three methods under a single framework: smooth gradient methods 18 These iterate on Gaussian approximations to pp θθ They are closely related to GD = intuitive The three methods correspond to special cases

A. Fixed-point conditions The methods can be united because their fixed-point conditions are extremely similar 19 Laplace: gradient is 0 ψψ θθ MMMMMM = 0

A. Fixed-point conditions Variational Bayes: Recall the gradient: Gradient to μμ: μμ = EE ψψ μμ + σσηη 0 Gradient to σσ: σσ = σσσσ Hψψ μμ + σσηη 0 1/σσ 20 Optimal Gaussian qq VVVV must respect: EE qqvvvv ψψ θθ = 0 EE qqvvvv HHHH = CCCCvv qqvvvv 1

Variational Bayes: A. Fixed-point conditions 21 EE qqvvvv ψψ θθ = 0 EE qqvvvv HHHH = CCCCvv qqvvvv 1 The smoothed gradient must be 0 The covariance is related to the peakedness of log pp

A. Fixed-point conditions Expectation Propagation: - Define the «hybrid»: 22 h ii θθ ff ii θθ gg jj θθ - Compute the mean and variance of h ii θθ and the Gaussian which has that same mean and variance qq jj ii - New approximation: gg ii θθ = qq θθ jj ii gg jj θθ

A. Fixed-point conditions Expectation-Propagation: Easy fixed-point condition: All the hybrids h ii θθ and the global Gaussian approximation qq EEEE θθ have the same mean / variance Using this, and a little bit of math (Dehaene 2016): 23 ii EE hii log ff ii θθ = 0 ii CCCCvv hii 1 EEhii θθ μμ ii log ff ii θθ = CCCCvv qqeeee 1

A. Fixed-point conditions 24 Laplace: VB: ψψ θθ MMMMMM = 0 EE qqvvvv ψψ θθ = 0 EE qqvvvv HHHH θθ = CCCCvv qqvvvv 1 EP: EE hii log ff ii θθ = 0 CCCCvv hii 1 EE hii θθ μμ EEEE log ff ii θθ = CCCCvv qqeeee 1

A. Fixed-point conditions 25 Laplace: ψψ θθ MMMMMM = 0 VB: EP: EE qqvvvv ψψ θθ = 0 CCCCvv qqvvvv 1 EEqqVVVV θθ μμ VVVV ψψ θθ = CCCCvv qqvvvv 1 EE hii log ff ii θθ = 0 CCCCvv hii 1 EE hii θθ μμ EEEE log ff ii θθ = CCCCvv qqeeee 1

B. Reformulating gradient descent The first step: reframing gradient descent as - Iterating over Gaussian approximations to pp θθ - Fixed point = Laplace approximation 26 Key idea: GD corresponds to using a linear approximation of ψψ = log pp

B. Reformulating gradient descent Interpretation of GD: θθ nn+1 = θθ nn λλ ψψ θθ nn 27 «Please find the maximizer of the equation» θθ θθ nn ψψ θθ nn λλ 2 θθ θθ nn 2 This is the same as the mean of the Gaussian: qq nn+1 θθ exp θθ θθ nn ψψ θθ nn λλ 2 θθ θθ nn 2

B. Reformulating gradient descent 28 A trivial reformulation of GD: Iterate: qq nn+1 θθ exp θθ θθ nn ψψ θθ nn λλ 2 θθ θθ nn 2 θθ nn+1 = EE qqnn+1 θθ This iterates Gaussian approximations to pp θθ But the fixed-point isn t the Laplace approximation!

B. Reformulating gradient descent We need to use an optimization algorithm which uses a quadratic approximation of ψψ: 29 Newton s method: θθ nn+1 = θθ nn Hψψ θθ nn 1 ψψ θθ nn «Please find the maximizer of the equation» θθ θθ nn ψψ θθ nn 1 2 HHHH θθ nn θθ θθ nn 2

B. Reformulating gradient descent A trivial reformulation of Newton s: 30 Iterate: qq nn+1 θθ exp θθ θθ nn ψψ θθ nn θθ nn+1 = EE qqnn+1 HHHH θθ nn 2 θθ θθ θθ nn 2 We iterate Gaussian approximations of pp θθ until we find the Laplace approximation

Algorithm 1: disguised gradient descent Newton s method 31 DGD ψψ qqqqqqqqqqqqqqqqqq pp exp qqqqqqqqqqqqqqqqqq pp GGGGGGGGGGGGGGGG

C. Gaussian smoothing The Laplace approximation is a point approximation 32 We could improve the algorithm: - Smooth the objective function ψψ = ψψ exp θθ2 2σσ 2 - Run the algorithm on ψψ

C. Gaussian smoothing How should we choose the smoothing bandwith σσ? 33 We could choose it once for all steps OR On each step, we could use the current Gaussian approximation qq nn θθ to smooth ψψ

Algorithm 2: smoothed gradient descent 34 - Initialize with any Gaussian qq 0 - Loop: μμ nn = EE qqnn θθ rr = EE qqnn ψψ θθ ββ = EE qqnn HHHH θθ qq nn+1 θθ exp rr θθ μμ nn ββ 2 θθ μμ nn 2

Algorithm 2: smoothed gradient descent 35

D. Using the factor structure In order to use EP, pp θθ needs to have a nice factor structure: 36 nn pp θθ = ff ii θθ ii=1 For ψψ: nn ψψ θθ = φφ ii θθ ii=1

D. Using the factor structure Crazy idea: - We could use the VB algorithm - With non-gaussian smoothing - On each component φφ ii of ψψ 37

D. Using the factor structure The algorithm 2 update has two equivalent forms: ββ = EE qqnn HHHH θθ 38 OR ββ = CCCCvv qqnn 1 EEqqnn θθ μμ nn ψψ θθ When we replace qq nn by h ii, we have a choice to make: which form should we use?

Algorithm 3: smooth EP 39 - Initialize with any Gaussians gg 1, gg 2 gg nn - Loop: h ii ff ii jj ii gg jj μμ ii = EE hii θθ rr ii = EE hii φφ ii θθ ββ ii = vvvvrr hii 1 EEhii θθ μμ ii φφ ii θθ gg ii θθ exp rr θθ μμ ii ββ 2 θθ μμ ii 2

D. Using the factor structure Smooth EP is actually exactly equivalent to EP!! 40 This ties EP to a much more intuitive algorithm: Newton s method However, we have lost the most important feature: the explicit objective function

Summary I have presented a family of algorithms which: - Iterate Gaussian approximations to pp θθ - By computing smoothed quadratic approximations of log pp θθ (or parts of it) 41 Different smoothings correspod to different known methods: - Laplace = No smoothing - VB = Gaussian smoothing - EP = Various hybrid smoothings

III. Consequences Key observation: EP and algorithm 2 are closely related to Newton s method 42 They must behave in a similar fashion!!

III. Consequences Newton s has two striking features: - Very fast convergence near its fixed-points - Possible oscillations (overshooting) Solution: compliment it with line-search methods 43 EP is also known for its oscillations! Are these also overshoots of the target?

III. Consequences 44 EP still needs improvements We could import good ideas from Newton s method and use them on EP - Line-search (but how?) - Non-SPD second-order term ββ ii (but how?) Finally, the smooth Newton view of EP might be useful on its own

III. Consequences Since Laplace, VB and EP are closely related, we can ask whether they behave similarly in some situations 45 - Laplace = No smoothing - VB = Gaussian smoothing - EP = Various hybrid smoothings Whenever the smoothing distributions are similar, the methods are similar!

III. Consequences If all hybrids are almost equal to the global approximation: h ii qq nn 46 IE: ff ii /gg ii is negligible compared to qq nn Then EP and VB have the same smoothing and behave similarly

III. Consequences Furthermore, if qq nn is almost a Dirac distribution IE: its width is negligible compared to the oscillations of ψψ and/or the φφ ii functions 47 Then, algorithm 2 behaves similarly to algorithm 1 Thus, the VB approximation behaves similarly to the Laplace approximation

Conclusion Various approximation methods are closely related: The can be obtaiend through smooth Newton methods 48 - Laplace = No smoothing - VB = Gaussian smoothing - EP = Hybrid smoothings Corollary: VB and EP are very closely related EP behaves similarly to Newton s method

Speculation Smooth Newton variants might be computationally useful for VB and EP 49 VB and EP should give better approximations than Laplace Can this give a path towards understanding or improving the convergence of EP and VB?

References Minka, 2001, Expectation Propagation for approximate Bayesian inference Seeger, 2007, Expectation Propagation for Exponential Families Dehaene, Barthelmé, 2017, Expectation Propagation in the large-data limit Dehaene, 2016, Expectation Propagation performs a smoothed gradient descent 50