Quantitative Biology II Lecture 4: Variational Methods

10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory

Image credit: Mike West

Summary Approximate Bayesian Inference Kullback-Leibler Divergence Variational Principle Bayesian Variational Inference

Bayesian Inference Posterior P(z x) = Likelihood P(x z)p(z) P(x) Prior Marginal Likelihood (Model Evidence) Z: parameters that we want to learn X: data that we observe Both Z and X can both be high-dimensional e.g. inference of haplotypes from genotypes

Approximate Bayesian Inference Evaluating the posterior exactly can be very difficult because i) Analytical solutions not available ii) Numerical integration too expensive An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem John W. Tukey

Approximate Bayesian Inference Method 1 (Markov Chain Monte Carlo) To draw random samples from the p(z x) Asymptotically exact Computationally very burdensome Method 2 (Variational Methods) Find analytical approximation Very fast

Big Idea in Variational Inference Let s to try find a simpler distribution q(z) that approximates the posterior p(z x) Optimize q(z) from a class of distributions to find the best fit complicated p(z x) simple q(z) Variation of q(z)

How do we quantify distance between distributions? Kullback-Leibler Divergence (D KL ) Also known as relative entropy Quantifies difference between two distributions: P(x) and Q(x) D KL (P Q) =! x = " P(x)ln Non-symmetric measure P(x)ln P(x) Q(x) P(x) Q(x) dx D KL (P Q) 0, D KL (P Q)=0 if and only if P=Q Invariant to reparameterization of x (discrete) (continuous)

Kullback-Leibler Divergence D KL 0 Proof, use Jensen s inequality: for a concave function f(x), f x E.g. ln ( )! f (x) ( x )! ln(x) ln(x) for a concave function, every chord lies below the function D KL (P Q) =! P(x)ln P(x) Q(x) = "! P(x)ln Q(x) P(x) = " ln Q(x) x x x P(x) P(x) # "ln Q(x) = "ln! P(x) Q(x) P(x) P(x) P(x) = ln! Q(x) = ln1= 0! D KL (P Q) " 0 x x

Kullback-Leibler Divergence Motivation 1: Counting Statistics Flip a fair coin N times, i.e., q H =q T =0.5 E.g. N=50, observe 27 heads and 23 tails What is the probability of observing this? 0.6 0.4 0.2 Observed Distribution 0.6 0.4 0.2 Actual Distribution 0 Heads Tails 0 Heads Tails P(x)={0.54;0.46} Q(x)={0.50;0.50} p H p T q H q T

Kullback-Leibler Divergence Motivation 1: Counting Statistics P (n H,n T )= N! n H!n T! qn H H qn T T exp ( Np H ln p H /q H Np T ln p T /q T ) =exp( ND KL [P Q]) (Binomial distribution) (for large N) - Probability of observing counts depends on i) N and ii) how much observed distribution differs from true distribution - D KL emerges from the large N limit of a binomial (multinomial) distribution. - D KL quantifies how much the observed distribution diverges from the true underlying distribution. - If D KL >1/N then the distributions are very different.

Kullback-Leibler Divergence Motivation 2: Information Theory How many extra bits, on average, do we need to code samples from P(x) using a code optimized for Q(x)? D KL (P Q) = avg no. of bits using bad code - avg no. of bits using optimal code # & # & = %! P(x)log 2 Q(x) (!%! P(x)log 2 P(x) ( $ ' $ ' " x P(x) = " P(x)log 2 x Q(x) " x

Kullback-Leibler Divergence Motivation 2: Information Theory Symbol Probability of symbol, P(x) Bad code, but optimal for Q(x) Optimal code for P(x) A 1/2 00 0 C 1/4 01 10 T 1/8 10 110 G 1/8 11 111 D KL (P Q)=2-1.75=0.25 Avg length =2 bits Avg length =1.75 P(x)={1/2,1/4,1/8,1/8} Q(x)={1/4,1/4,1/4,1/4} Entropy of symbol distribution =! p(x)log 2 p(x) " x =1.75 bits This is equal to the entropy and thus is optimal i.e. there is an additional overhead of 0.25 bits per symbol if we use the bad code {A=00;C=01;T=10;G=11} instead of the optimal code.

Mutual Information You can use D KL to measure distances between joint distributions For example, P=h(x,y) and Q=h(x)h(y) h(x, y) D KL [h(x, y) h(x)h(y)] =! h(x, y)log 2 x,y h(x)h(y) = I(x, y) (Mutual Information) The divergence between h(x,y) and h(x)h(y) is just the mutual information I[x,y], quantifying how non-independent the x and y variables are

D KL [P Q] versus D KL [Q P] In estimating the Bayesian posterior we use D KL as a penalty for the difference between the posterior p(z x) and the approximate distribution q(z). Do we use D KL [P Q] or D KL [Q P]? P 0.8 0.6 0.4 0.2 0 State 1 State 2 State 3 In this example, D KL [P Q] < D KL [Q P] So if we use D KL [Q P] as the objective function to be minimized then Q forced to avoid regions where P small Q 0.8 0.6 0.4 0.2 0 State 1 State 2 State 3

Minimizing Divergence Minimizing D KL [q p] Minimizing D KL [p q] Green: correlated gaussian Red: approximating distribution q(z) Image credit: Bishop Note that minimizing D KL [q p] forces Q to avoid regions where P is small and Q will underestimate the support of P

Minimizing Divergence Blue: mixture of two gaussians Red: approximating distribution, single gaussian Minimizing D KL (p q) Different minima of D KL (q p) Image credit: Bishop

Log model evidence Free Energy ln p(x) = ln p(x, z) p(z x) =! q(z)ln =! q(z)ln p(x, z) p(z x) dz p(x, z) p(z x) q(z) q(z) dz =! q(z)ln q(z) p(z x) dz +! q(z)ln p(x, z) q(z) dz D KL [q p] F(q,x) Free energy

Free Energy The log model evidence can be expressed as ln p(x) = D KL [q p]+ F(q, x) Free Energy F(q,x) easy to evaluate for a given q Maximizing F(q,x) is equivalent to 1. minimizing D KL [Q P] 2. tightening F(q,x) as a lower bound to the log model evidence

Free Energy Optimization ln p(x) ln p(x) D KL [q p] D KL [q p] F(q,x) F(q,x) 0 0 Initialization Convergence

Free Energy! F(x, z) = q(z)ln p(x, z) q(z) dz =! q(z)ln p(x, z)dz "! q(z)ln q(z)dz = ln p(x, z) q + H[q] Average Energy Entropy Energy and Free Energy are negative of the usual quantities in physics Recall, from statistical physics: Free Energy, F = U TS = Average Energy Temperature * Entropy

Variational Calculus Standard calculus Newton, Leibniz, Take derivatives of a function f(x) e.g. maximize posterior p(z x) w.r.t. z Variational calculus Euler, Lagrange Take derivatives of a functional F(f) e.g. maxmize entropy H[p] w.r.t. a probability distribution p

Variational Principle in Science Try to find a function (or distribution) that maximizes or minimizes a cost function Examples in physics Fermat s Principle Action Principle

Variational Inference Example Imagine N data samples drawn from a onedimensional gaussian We want to infer the mean u and precision t We assume a variational distribution q(u,t) Let s further assume the variational distribution factorizes (mean-field approximation) q(u,t)=q(u)q(t) q(u) follows a gaussian distribution q(t) follows a gamma distribution

Variational Inference Example Maximize Free Energy to obtain a coupled set of equations for q(u) and q(t) which we can update iteratively until convergence Make an initial guess for the parameters of q(u) and q(t) Compute q(u) and evaluate first two moments of u. Plug these into q(t) and evaluate first two moments of t. Repeat until convergence

Variational Inference Example Initialize Evaluate q(u) Evaluate q(t) Convergence Image credit: Bishop

Strategies in variational inference No mean-field assumption Mean field assumption q(z)=π q(z i ) No parametric assumptions variation inference = exact inference iterative free-form variational optimization Parametric assumptions q(z)=f(z t) fixed-form optimization of moments Iterative fixed-form variational optimization

Variational Bayes Inference of Population Structure 2014 Observed data (x) = whole genome SNP genotypes from lots of individuals Latent variables (z) = ancestral population e.g. Japanese, Basque etc.