Introduction to Bayesian inference

Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Probabilistic models Describe how data was generated using probability distributions Generative process Data D, parameters θ Want to find the best parameters θ for the data D - inference

Topic modelling Documents D 1,..., D D Documents cover topics, with a distribution θ d = (t 1,..., t T ) Words in document, D d = {w d,1,..., w d,n } Some words are more prevalent in some topics Topics have a word distribution φ t = (w 1,..., w V ) Data is words in documents D 1,..., D D, parameters are θ d, φ t

From Blei s ICML-2012 tutorial

Overview Probabilistic models Probability theory Latent variable models Bayesian inference Bayes theorem Latent Dirichlet Allocation Conjugacy Graphical models Gibbs sampling Variational Bayesian inference

Probability primer Random variable X Probability distribution p(x ) Discrete distribution e.g. coin flip or dice roll Continuous distribution e.g. height distribution

Multiple RVs Joint distribution p(x, Y ) Conditional distribution p(x Y )

Probability rules Chain rule p(x Y ) = p(x, Y ) p(y ) Marginal rule p(x ) = Y for continuous variables, p(x) = p(x, y)dy = y y or p(x, Y ) = p(x Y )p(y ) p(x, Y ) = Y p(x y)p(y)dy p(x Y )p(y ) We can add more conditional random variables if we want, so e.g. p(x, Y Z) = p(x Y, Z)p(Y Z).

Independence X and Y are independent if p(x, Y ) = p(x )p(y ) Equivalently, if p(y X ) = p(y )

Expectation and variance Expectation E [X ] = x p(x = x), E [X ] = x p(x)dx x x [ Variance V [X ] = E (X E [X ]) 2] = E [ X 2] E [X ] 2 where E [ X 2] = x 2 p(x = x) or E [ X 2] = x 2 p(x)dx x x

Dice roll: E [X ] = 1 1 6 + 2 1 6 + 3 1 6 + 4 1 6 + 5 1 6 + 6 1 6 = 7 2 E [ X 2] = 1 2 1 6 + 22 1 6 + 32 1 6 + 42 1 6 + 52 1 6 + 62 1 6 = 91 6 So V [X ] = 91 ( ) 7 2 6 = 35 2 12

Latent variable models Manifest or observed variable Latent or unobserved variable Latent variable models

Probability distributions Categorical distribution N possible outcomes, with probabilities (p 1,..., p N ) Draw a single value e.g. throw a dice once Parameters θ = (p 1,..., p N ) Discrete distribution, p(x = i) = p i Expectation for outcome i is p i, variance is p i (1 p i )

Probability distributions Dirichlet distribution Draws are vectors x = (x 1,..., x N ) s.t. i x i = 1 In other words, draws are probability vectors the parameter to the categorical distribution Parameters θ = (α 1,..., α N ) = α Continuous distribution, p(x) = 1 where B(α) = B(α) i Γ (α i) Γ ( i α i) and Γ (α i) = Expectation for ith element x i is E [x i ] = i 0 x α i 1 i y α i 1 e α i dy α i j α j

Probabilistic models Probability theory Latent variable models Bayesian inference Bayes theorem Latent Dirichlet Allocation Conjugacy Graphical models Gibbs sampling Variational Bayesian inference

Unfair dice Dice with unknown distribution, p = (p 1, p 2, p 3, p 4, p 5, p 6 ) We observe some throws and want to estimate p Say we observe 4, 6, 6, 4, 6, 3 Perhaps p = (0, 0, 1 6, 2 6, 0, 3 6 )

Maximum likelihood Maximum likelihood solution, θ ML = max p(d θ) θ Easily leads to overfitting Want to incorporate some prior belief or knowledge about our parameters

Bayes theorem Bayes theorem For any two random variables X and Y, p(x Y ) = p(y X )p(x ) p(y ) Proof From chain rule, p(x, Y ) = p(y X )p(x ) = p(x Y )p(y ). Divide both sides by p(y ).

Disease test Test for disease with 99% accuracy 1 in a 1000 people have the disease You tested positive. What is the probability that you have the disease?

Disease test Let X = disease, and Y = positive Want to know p(x Y ) probability of disease given a positive test p(y X )p(x ) From Bayes, p(x Y ) = p(y ) p(y X ) = 0.99, p(x ) = 0.001 p(y ) = p(y, X ) + p(y,!x ) = p(y X )p(x ) + p(y!x )p(!x ) = 0.99 0.001 + 0.01 0.999 = 0.01098 So p(x Y ) = 0.99 0.001 0.01098 = 0.09016393442

Bayes theorem for inference Want to find best parameters θ for our model after observing the data D ML overfits by using p(d θ) Need some way of using prior belief about the parameters Consider p(θ D) our belief about the parameters after observing the data

Bayesian inference Using Bayes theorem, p(θ D) = p(d θ)p(θ) p(d) Prior p(θ) Likelihood p(d θ) Posterior p(θ D) Maximum A Posteriori (MAP) θ MAP = max p(θ D) = max p(d θ)p(θ) θ θ Bayesian inference find full posterior distribution p(θ D)

Intractability In our model we define the prior p(θ) and likelihood p(d θ) How do we find p(d)? p(d) = p(d, θ)dθ = p(d θ)p(θ)dθ θ BUT: space of possible values for θ is huge! Approximate Bayesian inference θ

Latent Dirichlet Allocation Generative process Draw document-to-topic distributions, θ d Dir(α) (d = 1,..., D) Draw topic-to-word distributions, φ t Dir(β) (t = 1,..., T ) For each of the N words in each of the D documents: Draw a topic from the document s topic distribution, z dn Multinomial(θ d ) Draw a word from the topics s word distribution, w dn Multinomial(φ z ) Note that our model s data is the words w dn we observe, and the parameters are the θ d, φ t. We have placed Dirichlet priors over the parameters, with its own parameters α, β.

Hyperparameters In our model we have: Random variables observed ones, like the words; and latent ones, like the topics Parameters document-to-topic distributions θ d and topic-to-word distributions φ d Hyperparameters these are parameters to the prior distributions over our parameters, so α and β

Conjugacy For a specific parameter θ i, p(θ i ) is conjugate to the likelihood p(d θ i ) if the posterior of the parameter, p(θ i D), is of the same family as the prior. e.g. the Dirichlet distribution is the conjugate prior for the categorical distribution.

Probabilistic models Probability theory Latent variable models Bayesian inference Bayes theorem Latent Dirichlet Allocation Conjugacy Graphical models Gibbs sampling Variational Bayesian inference

Bayesian network Nodes are random variables (latent or observed) Arrows indicate dependencies Distribution of a node only depends on its parents (and things further down the network) Plates indicate repetition of variables A C D B p(d A, B, C) = p(d C) BUT: p(c A, B, D) p(c A, B)

A C D B Recall Bayes p(x Y, Z) = p(y X, Z)p(X Z) p(y Z) p(c A, B, D) = = = p(d A, B, C)p(C A, B) p(d A, B) C C p(d C)p(C A, B) p(d A, B, C)p(C A, B)dC p(d C)p(C A, B) p(d C)p(C A, B)dC = p(d C)p(C A, B) p(c, D A, B)dC C

Latent Dirichlet Allocation From http://parkcu.com/blog/

Probabilistic models Probability theory Latent variable models Bayesian inference Bayes theorem Latent Dirichlet Allocation Conjugacy Graphical models Gibbs sampling Variational Bayesian inference

Gibbs sampling Want to approximate p(θ D) for parameters θ = (θ 1,..., θ N ) Cannot compute this exactly, but maybe we can draw samples from it We can then use these samples to estimate the distribution, or estimate the expectation and variance

Gibbs sampling For each parameter θ i, write down its distribution conditional on the data and the values of the other parameters, p(θ i θ i, D) If our model is conjugate, this gives closed-form expressions (meaning this distribution is of a known form, e.g. Dirichlet, so we can draw from it) Drawing new values for the parameters θ i in turn will eventually converge to give draws from the true posterior, p(θ D) Burn-in, thinning

Latent Dirichlet Allocation Want to draw samples from p(θ, φ, z w) w = {w d,n } d=1..d,n=1..n z = {z d,n } d=1..d,n=1..n θ = {θ d } d=1..d φ = {φ t } t=1..t

Latent Dirichlet Allocation For Gibbs sampling, need distribitions: p(θ d θ d, φ, z, w) p(φ t θ, φ t, z, w) p(z d,n θ, φ, z d,n, w)

Latent Dirichlet Allocation These are relatively straightforward to derive. For example: p(z d,n θ, φ, z d,n, w) = p(w θ, φ, z)p(z d,n θ, φ, z d,n ) p(w θ, φ, z d,n ) p(w d,n θ, φ, z w,n )p(z d,n θ, φ, z d,n ) = p(w d,n z w,n, φ zd,n )p(z d,n θ d ) = φ zd,n,w d,n θ d,zd,n Where the first step follows from Bayes theorem, the second from that fact that some terms do not depend on z d,n, the third from independence in our Bayesian Network, and the fourth from our model s definition of those distributions. We then simply compute these probabilities for all z d,n, normalise them to sum to 1, and draw a new value with those probabilities!

Collapsed Gibbs sampler In practice we actually want to find p(z w), as we can estimate the θ d, φ t from the topic assignments. We integrate out the other parameters. This is called a collapsed Gibbs sampler.

Probabilistic models Probability theory Latent variable models Bayesian inference Bayes theorem Latent Dirichlet Allocation Conjugacy Graphical models Gibbs sampling Variational Bayesian inference

Variational Bayesian inference Want to approximate p(θ D) for parameters θ = (θ 1,..., θ N ) Cannot compute this exactly, but maybe we can approximate it Introduce a new distribution q(θ) over the parameters, called the variational distribution We can choose the exact form of q ourselves, giving us a set of variational parameters ν i.e. we have q(θ ν) We then tweak ν so that q is as similar to p as possible! We want q to be easier to compute we normally do this by assuming each of the parameters θ i is independent in the posterior mean-field assumption q(θ ν) = i q(θ i ν i )

KL-divergence We need some way of measuring similarity between distributions We use the KL-divergence between distributions q and p D KL (q p) = q(θ) log q(θ) p(θ D) dθ θ

ELBO We can show that minimising D KL (q p) is equivalent to maximising something called the Evidence Lower Bound (ELBO) L. L = q(θ) log p(θ, D)dθ q(θ) log q(θ)dθ θ θ = E q [log p(θ, D)] E q [log q(θ)] If we choose the precise distribution for q, we can write down this expression. Then optimise by taking the derivative w.r.t. ν and solving for 0, to give the variational parameter updates.

Convergence We update the variational parameters ν in turn, and alternate updates until the value of the ELBO converges. After convergence, our estimate of the posterior distribution of a parameter θ i is q(θ i ν i ).

Choosing q Our choice of q determines how well our approximation to p is If our model has conjugacy, we simply choose the same distribution for q(θ i ) as we used for Gibbs sampling, p(θ i θ i, D) We then obtain very nice updates In non-conjugate models we need to use gradient descent to optimise the ELBO!

Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015