Quantitative Biology II Lecture 4: Variational Methods

Similar documents
Quantitative Biology Lecture 3

Introduction to Information Theory

13: Variational inference II

Machine Learning Srihari. Information Theory. Sargur N. Srihari

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

14 : Mean Field Assumption

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Lecture 2: August 31

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Classification & Information Theory Lecture #8

Curve Fitting Re-visited, Bishop1.2.5

Introduction to Statistical Learning Theory

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Series 7, May 22, 2018 (EM Convergence)

An Introduction to Expectation-Maximization

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

STA 4273H: Statistical Machine Learning

Variational Inference (11/04/13)

Bayesian Inference and MCMC

Auto-Encoding Variational Bayes

The Expectation Maximization or EM algorithm

Lecture 8: Graphical models for Text

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Latent Variable Models and EM algorithm

Expectation Maximization

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Variational inference

Lecture 6: Model Checking and Selection

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Introduction to Probability and Statistics (Continued)

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

13 : Variational Inference: Loopy Belief Propagation and Mean Field

CS 630 Basic Probability and Information Theory. Tim Campbell

Expectation Propagation Algorithm

Probabilistic Graphical Models for Image Analysis - Lecture 4

an introduction to bayesian inference

Lecture 13 : Variational Inference: Mean Field Approximation

An introduction to Variational calculus in Machine Learning

Machine Learning using Bayesian Approaches

Introduction to Machine Learning

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Unsupervised Learning

Gaussian Mixture Models

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Introduction to Machine Learning

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Bayesian Methods for Machine Learning

Statistical Machine Learning Lectures 4: Variational Bayes

Variational Inference. Sargur Srihari

5 Mutual Information and Channel Capacity

Expectation Propagation for Approximate Bayesian Inference

Posterior Regularization

Expectation Maximization

Information. = more information was provided by the outcome in #2

A Gentle Tutorial on Information Theory and Learning. Roni Rosenfeld. Carnegie Mellon University

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Study Notes on the Latent Dirichlet Allocation

Data Mining Techniques

Markov Chain Monte Carlo

Gaussian Mixture Models

The connection of dropout and Bayesian statistics

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Variational Autoencoders (VAEs)

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

The Expectation-Maximization Algorithm

Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

Bayesian Machine Learning - Lecture 7

Lecture 1a: Basic Concepts and Recaps

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Mixtures of Gaussians. Sargur Srihari

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015

Probabilistic Graphical Models

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Artificial Intelligence

Answers and expectations

Variational Inference and Learning. Sargur N. Srihari

17 : Markov Chain Monte Carlo

ECE 4400:693 - Information Theory

Lecture 5: GPs and Streaming regression

Bioinformatics: Biology X

MCMC: Markov Chain Monte Carlo

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Information Theory Primer:

Lecture 5 - Information theory

G8325: Variational Bayes

Machine Learning Techniques for Computer Vision

Computational Systems Biology: Biology X

Lecture 1: Introduction, Entropy and ML estimation

Probabilistic and Bayesian Machine Learning

Application of Information Theory, Lecture 7. Relative Entropy. Handout Mode. Iftach Haitner. Tel Aviv University.

Lecture 2: Conjugate priors

The binary entropy function

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples

Two Useful Bounds for Variational Inference

Clustering with k-means and Gaussian mixture distributions

Basic Sampling Methods

Chapter 8: Differential entropy. University of Illinois at Chicago ECE 534, Natasha Devroye

Language as a Stochastic Process

Lecture 35: December The fundamental statistical distances

Transcription:

10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory

Image credit: Mike West

Summary Approximate Bayesian Inference Kullback-Leibler Divergence Variational Principle Bayesian Variational Inference

Bayesian Inference Posterior P(z x) = Likelihood P(x z)p(z) P(x) Prior Marginal Likelihood (Model Evidence) Z: parameters that we want to learn X: data that we observe Both Z and X can both be high-dimensional e.g. inference of haplotypes from genotypes

Approximate Bayesian Inference Evaluating the posterior exactly can be very difficult because i) Analytical solutions not available ii) Numerical integration too expensive An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem John W. Tukey

Approximate Bayesian Inference Method 1 (Markov Chain Monte Carlo) To draw random samples from the p(z x) Asymptotically exact Computationally very burdensome Method 2 (Variational Methods) Find analytical approximation Very fast

Big Idea in Variational Inference Let s to try find a simpler distribution q(z) that approximates the posterior p(z x) Optimize q(z) from a class of distributions to find the best fit complicated p(z x) simple q(z) Variation of q(z)

How do we quantify distance between distributions? Kullback-Leibler Divergence (D KL ) Also known as relative entropy Quantifies difference between two distributions: P(x) and Q(x) D KL (P Q) =! x = " P(x)ln Non-symmetric measure P(x)ln P(x) Q(x) P(x) Q(x) dx D KL (P Q) 0, D KL (P Q)=0 if and only if P=Q Invariant to reparameterization of x (discrete) (continuous)

Kullback-Leibler Divergence D KL 0 Proof, use Jensen s inequality: for a concave function f(x), f x E.g. ln ( )! f (x) ( x )! ln(x) ln(x) for a concave function, every chord lies below the function D KL (P Q) =! P(x)ln P(x) Q(x) = "! P(x)ln Q(x) P(x) = " ln Q(x) x x x P(x) P(x) # "ln Q(x) = "ln! P(x) Q(x) P(x) P(x) P(x) = ln! Q(x) = ln1= 0! D KL (P Q) " 0 x x

Kullback-Leibler Divergence Motivation 1: Counting Statistics Flip a fair coin N times, i.e., q H =q T =0.5 E.g. N=50, observe 27 heads and 23 tails What is the probability of observing this? 0.6 0.4 0.2 Observed Distribution 0.6 0.4 0.2 Actual Distribution 0 Heads Tails 0 Heads Tails P(x)={0.54;0.46} Q(x)={0.50;0.50} p H p T q H q T

Kullback-Leibler Divergence Motivation 1: Counting Statistics P (n H,n T )= N! n H!n T! qn H H qn T T exp ( Np H ln p H /q H Np T ln p T /q T ) =exp( ND KL [P Q]) (Binomial distribution) (for large N) - Probability of observing counts depends on i) N and ii) how much observed distribution differs from true distribution - D KL emerges from the large N limit of a binomial (multinomial) distribution. - D KL quantifies how much the observed distribution diverges from the true underlying distribution. - If D KL >1/N then the distributions are very different.

Kullback-Leibler Divergence Motivation 2: Information Theory How many extra bits, on average, do we need to code samples from P(x) using a code optimized for Q(x)? D KL (P Q) = avg no. of bits using bad code - avg no. of bits using optimal code # & # & = %! P(x)log 2 Q(x) (!%! P(x)log 2 P(x) ( $ ' $ ' " x P(x) = " P(x)log 2 x Q(x) " x

Kullback-Leibler Divergence Motivation 2: Information Theory Symbol Probability of symbol, P(x) Bad code, but optimal for Q(x) Optimal code for P(x) A 1/2 00 0 C 1/4 01 10 T 1/8 10 110 G 1/8 11 111 D KL (P Q)=2-1.75=0.25 Avg length =2 bits Avg length =1.75 P(x)={1/2,1/4,1/8,1/8} Q(x)={1/4,1/4,1/4,1/4} Entropy of symbol distribution =! p(x)log 2 p(x) " x =1.75 bits This is equal to the entropy and thus is optimal i.e. there is an additional overhead of 0.25 bits per symbol if we use the bad code {A=00;C=01;T=10;G=11} instead of the optimal code.

Mutual Information You can use D KL to measure distances between joint distributions For example, P=h(x,y) and Q=h(x)h(y) h(x, y) D KL [h(x, y) h(x)h(y)] =! h(x, y)log 2 x,y h(x)h(y) = I(x, y) (Mutual Information) The divergence between h(x,y) and h(x)h(y) is just the mutual information I[x,y], quantifying how non-independent the x and y variables are

D KL [P Q] versus D KL [Q P] In estimating the Bayesian posterior we use D KL as a penalty for the difference between the posterior p(z x) and the approximate distribution q(z). Do we use D KL [P Q] or D KL [Q P]? P 0.8 0.6 0.4 0.2 0 State 1 State 2 State 3 In this example, D KL [P Q] < D KL [Q P] So if we use D KL [Q P] as the objective function to be minimized then Q forced to avoid regions where P small Q 0.8 0.6 0.4 0.2 0 State 1 State 2 State 3

Minimizing Divergence Minimizing D KL [q p] Minimizing D KL [p q] Green: correlated gaussian Red: approximating distribution q(z) Image credit: Bishop Note that minimizing D KL [q p] forces Q to avoid regions where P is small and Q will underestimate the support of P

Minimizing Divergence Blue: mixture of two gaussians Red: approximating distribution, single gaussian Minimizing D KL (p q) Different minima of D KL (q p) Image credit: Bishop

Log model evidence Free Energy ln p(x) = ln p(x, z) p(z x) =! q(z)ln =! q(z)ln p(x, z) p(z x) dz p(x, z) p(z x) q(z) q(z) dz =! q(z)ln q(z) p(z x) dz +! q(z)ln p(x, z) q(z) dz D KL [q p] F(q,x) Free energy

Free Energy The log model evidence can be expressed as ln p(x) = D KL [q p]+ F(q, x) Free Energy F(q,x) easy to evaluate for a given q Maximizing F(q,x) is equivalent to 1. minimizing D KL [Q P] 2. tightening F(q,x) as a lower bound to the log model evidence

Free Energy Optimization ln p(x) ln p(x) D KL [q p] D KL [q p] F(q,x) F(q,x) 0 0 Initialization Convergence

Free Energy! F(x, z) = q(z)ln p(x, z) q(z) dz =! q(z)ln p(x, z)dz "! q(z)ln q(z)dz = ln p(x, z) q + H[q] Average Energy Entropy Energy and Free Energy are negative of the usual quantities in physics Recall, from statistical physics: Free Energy, F = U TS = Average Energy Temperature * Entropy

Variational Calculus Standard calculus Newton, Leibniz, Take derivatives of a function f(x) e.g. maximize posterior p(z x) w.r.t. z Variational calculus Euler, Lagrange Take derivatives of a functional F(f) e.g. maxmize entropy H[p] w.r.t. a probability distribution p

Variational Principle in Science Try to find a function (or distribution) that maximizes or minimizes a cost function Examples in physics Fermat s Principle Action Principle

Variational Inference Example Imagine N data samples drawn from a onedimensional gaussian We want to infer the mean u and precision t We assume a variational distribution q(u,t) Let s further assume the variational distribution factorizes (mean-field approximation) q(u,t)=q(u)q(t) q(u) follows a gaussian distribution q(t) follows a gamma distribution

Variational Inference Example Maximize Free Energy to obtain a coupled set of equations for q(u) and q(t) which we can update iteratively until convergence Make an initial guess for the parameters of q(u) and q(t) Compute q(u) and evaluate first two moments of u. Plug these into q(t) and evaluate first two moments of t. Repeat until convergence

Variational Inference Example Initialize Evaluate q(u) Evaluate q(t) Convergence Image credit: Bishop

Strategies in variational inference No mean-field assumption Mean field assumption q(z)=π q(z i ) No parametric assumptions variation inference = exact inference iterative free-form variational optimization Parametric assumptions q(z)=f(z t) fixed-form optimization of moments Iterative fixed-form variational optimization

Variational Bayes Inference of Population Structure 2014 Observed data (x) = whole genome SNP genotypes from lots of individuals Latent variables (z) = ancestral population e.g. Japanese, Basque etc.