Bristol Machine Learning Reading Group

Size: px

Start display at page:

Download "Bristol Machine Learning Reading Group"

Emory Floyd
6 years ago
Views:

1 Bristol Machine Learning Reading Group Introduction to Variational Inference Carl Henrik Ek - carlhenrik.ek@bristol.ac.uk November 25,

2 Introduction

3 Ronald Aylmer Fisher 1

4 TODAY p(y ) = p(y, X )dx p(x Y ) = p(y, X ) p(y ) "Being Bayesian" implies not making a point estimate, only deductively impossible scenarios should be given zero probability Learning: maximise evidence of data Decision/Reasoning: posterior distribution The evidence is the key-quantity in machine learning as it includes all possible knowledge 2

5 In practice p(y ) = In practice p(y, X )dx p(x Y ) = p(y X ) p(x ) p(y ) We can usually formulate joint distribution most commonly as likelihood times prior reaching posterior is hard as evidence is challenging to compute 3

6 Laplace quote "Nature laughs at the difficulties of integration" Simon Laplace 4

7 Pachinko YouTube 5

8 Two paths p(y ) i p(y, X i ) p(y ) = L(q(X )) + D(q(X )) X i p(x ) q(x ) p(x Y ) Stochastic Deterministic + correct in limit - now evidence of approximation + know how good approximation is - will never be correct 6

9 Variational Inference

10 Formalise log p(y) = log p(y, X)dX = log p(x Y)p(Y)dX q(x) log q(x) p(x Y)p(Y)dX 7

11 Jensen Inequality Convex Function λf (x 0 ) + (1 λ)f (x 1 ) f (λx 0 + (1 λ)x 1 ) x [x min, x max ] λ [0, 1]] 8

12 Jensen Inequality E[f (x)] f (E[x]) ( ) f (x)p(x)dx f xp(x)dx 9

13 Jensen Inequality in Variational Bayes ( log(x)p(x)dx log ) xp(x)dx moving the log inside the the integral is a lower-bound on the integral 10

14 Variational Bayes cont. q(x) logp(y) = log q(x) p(x Y)p(Y)dX = q(x)log p(x Y)p(Y) dx q(x) = q(x)log p(x Y) q(x) dx + dxp(y) = KL (q(x) p(x Y)) + log p(y) 11

15 Variational Bayes cont. q(x) logp(y) = log q(x) p(x Y)p(Y)dX = q(x)log p(x Y)p(Y) dx q(x) = q(x)log p(x Y) q(x) dx + dxp(y) = KL (q(x) p(x Y)) + log p(y) if q(x) is the true posterior we have an equality, therefore match the distributions i.e. argmin q KL (q(x) p(x Y)) variational distributions are approximations to intractable posteriors 11

16 ELBO KL(q(X) p(x Y)) = = q(x)log q(x) p(x Y) dx q(x)log q(x) dx + log p(y) p(x, Y) = H(q(X)) E q(x) [log p(x, Y)] + log p(y) 12

17 ELBO log p(y) = KL(q(X) p(x Y)) + E q(x) [log p(x, Y)] H(q(X)) }{{} ELBO E q(x) [log p(x, Y)] H(q(X)) = L(q(X)) Evidence Lower BOund if we maximise the ELBO we, find an approximate posterior get an approximation to the marginal likelihood maximising p(y) is learning finding p(x Y) q(x) is prediction 12

18 ELBO % Define block styles \usetikzlibrary{shapes,arrows} \tikzstyle{astate} = [circle, draw, text centered, font=\foo \tikzstyle{rstate} = [circle, draw, text centered, font=\foo \tikzstyle{bstate} = [circle, draw, text centered, font=\foo \begin{tikzpicture}[->,>=stealth, shorten >=1pt, auto, node \node [astate] (X) at (0,1.5) {X}; \node [rstate] (Y) at (0,0) {Y}; \node [astate] (X2) at (1.5,1.5) {X}; \node [rstate] (Y2) at (1.5,0) {Y}; \node [bstate] (T) at (2.3,1.5) {$\theta$}; \path (X) edge (Y); \end{tikzpicture} 12

19 Why is this useful? Why is this a sensible thing to do? Taking the expectation of a log is usually easier than the expectation We are allowed to choose the distribution to take the expectation over Ryan Adams in Talking Machines 1 1 Talking Machines - Season Two Episode Five 13

20 Approximate Distribution Mean Field Approximation q(x) = i q i (X i ) Introduced in statistical physics 2 Approximates the marginals of the posterior 2 Peterson, C., and Anderson, J. R. (1987) A mean field theory learning algorithm for neural networks 14

21 Examples

22 Ising Model 15

23 Gaussian Process Why? Not directly applicable to variational bayes Introduces variational compression by augumentation Exemplifies well what VB is in practice 16

24 Gaussian Process Gaussian Process 101 [ f f ] N ([ 0 0 ], [ k(x, X) k(x, x ) k(x, X) k(x, x ) ]) p(f x, X, f, θ) = N (k(x, X) T K(X, X) 1 f, k(x, x ) k(x, X) T K(X, X) 1 K(X, x )) 16

25 Gaussian Process Joint Distribution p(y, F, X) =p(y F)(F X)p(X) d =p(x) p(y f)p(f X) j=1 Learning Task p(y) = p(y F)(F X)p(X)dXdF we can analytically integrate out F but X appears non-linearly w.r.t. Y rendering this intractable 16

26 Gaussian Process L A,B = = X,F F,X ( ) p(y F)p(F X))p(X) q(x) log q(x) q(x)(y F)p(F X) q(x) log q(x) p(x) = L KL (q(x) p(x)) X 16

27 Gaussian Process L = F,X q(x) log p(y F)p(F X) d f GP(y, k(, )) p(f X) = N (f :,j 0, K) k (x :,i, x :,j ) = σe 1 2 j=1 Q q=1 wq(x q,i x q,j) 2 16

28 Gaussian Process Add another set of samples from the same prior d p(u Z) = N (u :,j 0, K)yy j=1 Conditional distribution p(f :,j, u :,j X, Z) = p(f :,j u :,j, X, Z)p(u :,j Z) = N ( f :,j K fu (K uu ) 1 u :,j, K ff K fu (K uu ) 1 K uf ) N (u:,j 0, K uu ), 16

29 Gaussian Process New Augmented Model d p(y, F, U, X Z) = p(x) p(y :,j f :,j )p(f :,j u :,j, X)p(u :,j Z) j=1 we have done nothing to the model, just added halucinated observations however, U and X u are not random but variational parameters 16

30 Gaussian Process Variational distributions are approximations to intractable posteriors, q(u) p(u Y, X, Z, F) q(f) p(f U, X, Z, Y) q(x) p(x Y) If U is sufficient statistics of F this means, p(f U, X, Z, Y) = p(f U, X, Z) 16

31 Gaussian Process L = = X,F,U X,F,U p(y, F, U X, Z) q(f)q(u)q(x) log q(f)q(u) d j=1 q(f)q(u)q(x) log p(y :,j f :,j )p(f :,j u :,j, X, Z)p(u :,j Z) q(f)q(u)q(x) 16

32 Gaussian Process L = = X,F,U X,F,U p(y, F, U X, Z) q(f)q(u)q(x) log q(f)q(u) d j=1 q(f)q(u)q(x) log p(y :,j f :,j )p(f :,j u :,j, X, Z)p(u :,j Z) q(f)q(u)q(x) Assume that U is sufficient statistics for F q(f)q(u)q(x) = p(f U, X, Z)q(U)q(X) 16

33 Gaussian Process = X,F,U j=1 d L = p(f :,j u :,j, X, Z)q(u :,j )q(x) X,F,U j=1 d j=1 log p(y :,j f :,j ) p(f :,j u :,j, X, Z)p(u :,j Z) d = j=1 p(f :,j u :,j, X, Z)q(u :,j ) p p(f :,j u :,j, X, Z)q(u :,j )q(x) log p j=1 p(y :,j f :,j )p(u :,j Z) p j=1 q(u :,j) = E q(f),q(x),q(u) [p(y F)] KL (q(u) p(u Z)) 16

34 Gaussian Process Summary E q(f),q(x),q(u) [p(y F)] KL (q(u) p(u Z)) KL (q(x) p(x)) Expectation tractable Can be computed for certain priors Reduces to expectations over co-variance functions know as Ψ statistics 16

35 Conclusion

36 Variational Inference Summary Often efficient Not stochastic Provides you with posterior and a bound on marginal likelihood its fun a lot of the work relates to multi-variate calculus tricks and substitutions 17

37 Bristol Machine Learning Reading Group 18

38 Bristol Machine Learning Reading Group Berkeley Tea Talk Style everyone reads paper someone introduces paper and leads discussion + constant workload on everyone - requires everyone to take this serious 18

39 Bristol Machine Learning Reading Group Seminar Style everyone skims paper someone is responsible for presenting paper + will work - very uneven workload 18

40 Bristol Machine Learning Reading Group Choosing the paper Presenter picks freely Presenter picks from agreed pool Currator chooses papers Topics several papers on one topic cover lots of single topics 18

41 eof 19

42 Source blocks

43 import numpy as np import matplotlib.pyplot as plt plt.xkcd() plt.savefig(path) return path 19

Gaussian Processes for Machine Learning

Gaussian Processes for Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics Tübingen, Germany carl@tuebingen.mpg.de Carlos III, Madrid, May 2006 The actual science of