Introduction to Bayesian inference

Similar documents
CS Lecture 18. Topic Models and LDA

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Density Estimation. Seungjin Choi

Lecture 13 : Variational Inference: Mean Field Approximation

Bayesian Inference and MCMC

Bayesian Inference. p(y)

Computational Cognitive Science

Introduction to Probabilistic Machine Learning

PMR Learning as Inference

COMP90051 Statistical Machine Learning

13: Variational inference II

CSC321 Lecture 18: Learning Probabilistic Models

Statistical Debugging with Latent Topic Models

Graphical Models and Kernel Methods

Learning Bayesian network : Given structure and completely observed data

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Latent Dirichlet Allocation (LDA)

Bayesian Inference for Dirichlet-Multinomials

Applying LDA topic model to a corpus of Italian Supreme Court decisions

An overview of Bayesian analysis

Computational Cognitive Science

Introduction into Bayesian statistics

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Bayesian RL Seminar. Chris Mansley September 9, 2008

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Lecture 2: Priors and Conjugacy

Bayesian Machine Learning

Particle Filtering a brief introductory tutorial. Frank Wood Gatsby, August 2007

Document and Topic Models: plsa and LDA

Topic Modelling and Latent Dirichlet Allocation

Introduc)on to Bayesian Methods

Streaming Variational Bayes

an introduction to bayesian inference

COS513 LECTURE 8 STATISTICAL CONCEPTS

Probabilistic and Bayesian Machine Learning

Lecture 6: Graphical Models: Learning

Approximate Inference using MCMC

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Variational Scoring of Graphical Model Structures

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

CS281A/Stat241A Lecture 22

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Modeling Environment

Based on slides by Richard Zemel

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

Hierarchical Models & Bayesian Model Selection

Part 1: Expectation Propagation

Latent Dirichlet Alloca/on

LDA with Amortized Inference

Lecture 4: Probabilistic Learning

Machine Learning Summer School

Note for plsa and LDA-Version 1.1

Bayesian Methods for Machine Learning

Notes on pseudo-marginal methods, variational Bayes and ABC

Latent Dirichlet Allocation

Text Mining for Economics and Finance Latent Dirichlet Allocation

Lecture : Probabilistic Machine Learning

Advanced Probabilistic Modeling in R Day 1

An Introduction to Bayesian Machine Learning

Lecturer: David Blei Lecture #3 Scribes: Jordan Boyd-Graber and Francisco Pereira October 1, 2007

17 : Markov Chain Monte Carlo

Foundations of Statistical Inference

16 : Approximate Inference: Markov Chain Monte Carlo

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayesian Machine Learning

Probabilistic Reasoning in Deep Learning

Probabilistic Machine Learning

Graphical Models for Collaborative Filtering

Introduction to Bayesian Statistics

Collapsed Variational Bayesian Inference for Hidden Markov Models

Parameter estimation for text analysis

Bayesian Models in Machine Learning

Probabilistic Graphical Models

Lecture 9: PGM Learning

Note 1: Varitional Methods for Latent Dirichlet Allocation

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

Bayesian Machine Learning

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Expectation maximization tutorial

DAG models and Markov Chain Monte Carlo methods a short overview

Variational Inference. Sargur Srihari

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Integrated Non-Factorized Variational Inference

Variational Bayesian Logistic Regression

Bayesian Linear Models

STAT J535: Introduction

Some Probability and Statistics

Principles of Bayesian Inference

Machine Learning

10. Exchangeability and hierarchical models Objective. Recommended reading

Lecture 7 and 8: Markov Chain Monte Carlo

Probabilistic & Unsupervised Learning

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Unsupervised Learning

Bayesian Regression Linear and Logistic Regression

Some Probability and Statistics

Probability and Estimation. Alan Moses

Inference Methods for Latent Dirichlet Allocation

Transcription:

Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Probabilistic models Describe how data was generated using probability distributions Generative process Data D, parameters θ Want to find the best parameters θ for the data D - inference

Topic modelling Documents D 1,..., D D Documents cover topics, with a distribution θ d = (t 1,..., t T ) Words in document, D d = {w d,1,..., w d,n } Some words are more prevalent in some topics Topics have a word distribution φ t = (w 1,..., w V ) Data is words in documents D 1,..., D D, parameters are θ d, φ t

From Blei s ICML-2012 tutorial

Overview Probabilistic models Probability theory Latent variable models Bayesian inference Bayes theorem Latent Dirichlet Allocation Conjugacy Graphical models Gibbs sampling Variational Bayesian inference

Probability primer Random variable X Probability distribution p(x ) Discrete distribution e.g. coin flip or dice roll Continuous distribution e.g. height distribution

Multiple RVs Joint distribution p(x, Y ) Conditional distribution p(x Y )

Probability rules Chain rule p(x Y ) = p(x, Y ) p(y ) Marginal rule p(x ) = Y for continuous variables, p(x) = p(x, y)dy = y y or p(x, Y ) = p(x Y )p(y ) p(x, Y ) = Y p(x y)p(y)dy p(x Y )p(y ) We can add more conditional random variables if we want, so e.g. p(x, Y Z) = p(x Y, Z)p(Y Z).

Independence X and Y are independent if p(x, Y ) = p(x )p(y ) Equivalently, if p(y X ) = p(y )

Expectation and variance Expectation E [X ] = x p(x = x), E [X ] = x p(x)dx x x [ Variance V [X ] = E (X E [X ]) 2] = E [ X 2] E [X ] 2 where E [ X 2] = x 2 p(x = x) or E [ X 2] = x 2 p(x)dx x x

Dice roll: E [X ] = 1 1 6 + 2 1 6 + 3 1 6 + 4 1 6 + 5 1 6 + 6 1 6 = 7 2 E [ X 2] = 1 2 1 6 + 22 1 6 + 32 1 6 + 42 1 6 + 52 1 6 + 62 1 6 = 91 6 So V [X ] = 91 ( ) 7 2 6 = 35 2 12

Latent variable models Manifest or observed variable Latent or unobserved variable Latent variable models

Probability distributions Categorical distribution N possible outcomes, with probabilities (p 1,..., p N ) Draw a single value e.g. throw a dice once Parameters θ = (p 1,..., p N ) Discrete distribution, p(x = i) = p i Expectation for outcome i is p i, variance is p i (1 p i )

Probability distributions Dirichlet distribution Draws are vectors x = (x 1,..., x N ) s.t. i x i = 1 In other words, draws are probability vectors the parameter to the categorical distribution Parameters θ = (α 1,..., α N ) = α Continuous distribution, p(x) = 1 where B(α) = B(α) i Γ (α i) Γ ( i α i) and Γ (α i) = Expectation for ith element x i is E [x i ] = i 0 x α i 1 i y α i 1 e α i dy α i j α j

Probabilistic models Probability theory Latent variable models Bayesian inference Bayes theorem Latent Dirichlet Allocation Conjugacy Graphical models Gibbs sampling Variational Bayesian inference

Unfair dice Dice with unknown distribution, p = (p 1, p 2, p 3, p 4, p 5, p 6 ) We observe some throws and want to estimate p Say we observe 4, 6, 6, 4, 6, 3 Perhaps p = (0, 0, 1 6, 2 6, 0, 3 6 )

Maximum likelihood Maximum likelihood solution, θ ML = max p(d θ) θ Easily leads to overfitting Want to incorporate some prior belief or knowledge about our parameters

Bayes theorem Bayes theorem For any two random variables X and Y, p(x Y ) = p(y X )p(x ) p(y ) Proof From chain rule, p(x, Y ) = p(y X )p(x ) = p(x Y )p(y ). Divide both sides by p(y ).

Disease test Test for disease with 99% accuracy 1 in a 1000 people have the disease You tested positive. What is the probability that you have the disease?

Disease test Let X = disease, and Y = positive Want to know p(x Y ) probability of disease given a positive test p(y X )p(x ) From Bayes, p(x Y ) = p(y ) p(y X ) = 0.99, p(x ) = 0.001 p(y ) = p(y, X ) + p(y,!x ) = p(y X )p(x ) + p(y!x )p(!x ) = 0.99 0.001 + 0.01 0.999 = 0.01098 So p(x Y ) = 0.99 0.001 0.01098 = 0.09016393442

Bayes theorem for inference Want to find best parameters θ for our model after observing the data D ML overfits by using p(d θ) Need some way of using prior belief about the parameters Consider p(θ D) our belief about the parameters after observing the data

Bayesian inference Using Bayes theorem, p(θ D) = p(d θ)p(θ) p(d) Prior p(θ) Likelihood p(d θ) Posterior p(θ D) Maximum A Posteriori (MAP) θ MAP = max p(θ D) = max p(d θ)p(θ) θ θ Bayesian inference find full posterior distribution p(θ D)

Intractability In our model we define the prior p(θ) and likelihood p(d θ) How do we find p(d)? p(d) = p(d, θ)dθ = p(d θ)p(θ)dθ θ BUT: space of possible values for θ is huge! Approximate Bayesian inference θ

Latent Dirichlet Allocation Generative process Draw document-to-topic distributions, θ d Dir(α) (d = 1,..., D) Draw topic-to-word distributions, φ t Dir(β) (t = 1,..., T ) For each of the N words in each of the D documents: Draw a topic from the document s topic distribution, z dn Multinomial(θ d ) Draw a word from the topics s word distribution, w dn Multinomial(φ z ) Note that our model s data is the words w dn we observe, and the parameters are the θ d, φ t. We have placed Dirichlet priors over the parameters, with its own parameters α, β.

Hyperparameters In our model we have: Random variables observed ones, like the words; and latent ones, like the topics Parameters document-to-topic distributions θ d and topic-to-word distributions φ d Hyperparameters these are parameters to the prior distributions over our parameters, so α and β

Conjugacy For a specific parameter θ i, p(θ i ) is conjugate to the likelihood p(d θ i ) if the posterior of the parameter, p(θ i D), is of the same family as the prior. e.g. the Dirichlet distribution is the conjugate prior for the categorical distribution.

Probabilistic models Probability theory Latent variable models Bayesian inference Bayes theorem Latent Dirichlet Allocation Conjugacy Graphical models Gibbs sampling Variational Bayesian inference

Bayesian network Nodes are random variables (latent or observed) Arrows indicate dependencies Distribution of a node only depends on its parents (and things further down the network) Plates indicate repetition of variables A C D B p(d A, B, C) = p(d C) BUT: p(c A, B, D) p(c A, B)

A C D B Recall Bayes p(x Y, Z) = p(y X, Z)p(X Z) p(y Z) p(c A, B, D) = = = p(d A, B, C)p(C A, B) p(d A, B) C C p(d C)p(C A, B) p(d A, B, C)p(C A, B)dC p(d C)p(C A, B) p(d C)p(C A, B)dC = p(d C)p(C A, B) p(c, D A, B)dC C

Latent Dirichlet Allocation Generative process Draw document-to-topic distributions, θ d Dir(α) (d = 1,..., D) Draw topic-to-word distributions, φ t Dir(β) (t = 1,..., T ) For each of the N words in each of the D documents: Draw a topic from the document s topic distribution, z dn Multinomial(θ d ) Draw a word from the topics s word distribution, w dn Multinomial(φ z )

Latent Dirichlet Allocation From http://parkcu.com/blog/

Probabilistic models Probability theory Latent variable models Bayesian inference Bayes theorem Latent Dirichlet Allocation Conjugacy Graphical models Gibbs sampling Variational Bayesian inference

Gibbs sampling Want to approximate p(θ D) for parameters θ = (θ 1,..., θ N ) Cannot compute this exactly, but maybe we can draw samples from it We can then use these samples to estimate the distribution, or estimate the expectation and variance

Gibbs sampling For each parameter θ i, write down its distribution conditional on the data and the values of the other parameters, p(θ i θ i, D) If our model is conjugate, this gives closed-form expressions (meaning this distribution is of a known form, e.g. Dirichlet, so we can draw from it) Drawing new values for the parameters θ i in turn will eventually converge to give draws from the true posterior, p(θ D) Burn-in, thinning

Latent Dirichlet Allocation Want to draw samples from p(θ, φ, z w) w = {w d,n } d=1..d,n=1..n z = {z d,n } d=1..d,n=1..n θ = {θ d } d=1..d φ = {φ t } t=1..t

Latent Dirichlet Allocation For Gibbs sampling, need distribitions: p(θ d θ d, φ, z, w) p(φ t θ, φ t, z, w) p(z d,n θ, φ, z d,n, w)

Latent Dirichlet Allocation These are relatively straightforward to derive. For example: p(z d,n θ, φ, z d,n, w) = p(w θ, φ, z)p(z d,n θ, φ, z d,n ) p(w θ, φ, z d,n ) p(w d,n θ, φ, z w,n )p(z d,n θ, φ, z d,n ) = p(w d,n z w,n, φ zd,n )p(z d,n θ d ) = φ zd,n,w d,n θ d,zd,n Where the first step follows from Bayes theorem, the second from that fact that some terms do not depend on z d,n, the third from independence in our Bayesian Network, and the fourth from our model s definition of those distributions. We then simply compute these probabilities for all z d,n, normalise them to sum to 1, and draw a new value with those probabilities!

Collapsed Gibbs sampler In practice we actually want to find p(z w), as we can estimate the θ d, φ t from the topic assignments. We integrate out the other parameters. This is called a collapsed Gibbs sampler.

Probabilistic models Probability theory Latent variable models Bayesian inference Bayes theorem Latent Dirichlet Allocation Conjugacy Graphical models Gibbs sampling Variational Bayesian inference

Variational Bayesian inference Want to approximate p(θ D) for parameters θ = (θ 1,..., θ N ) Cannot compute this exactly, but maybe we can approximate it Introduce a new distribution q(θ) over the parameters, called the variational distribution We can choose the exact form of q ourselves, giving us a set of variational parameters ν i.e. we have q(θ ν) We then tweak ν so that q is as similar to p as possible! We want q to be easier to compute we normally do this by assuming each of the parameters θ i is independent in the posterior mean-field assumption q(θ ν) = i q(θ i ν i )

KL-divergence We need some way of measuring similarity between distributions We use the KL-divergence between distributions q and p D KL (q p) = q(θ) log q(θ) p(θ D) dθ θ

ELBO We can show that minimising D KL (q p) is equivalent to maximising something called the Evidence Lower Bound (ELBO) L. L = q(θ) log p(θ, D)dθ q(θ) log q(θ)dθ θ θ = E q [log p(θ, D)] E q [log q(θ)] If we choose the precise distribution for q, we can write down this expression. Then optimise by taking the derivative w.r.t. ν and solving for 0, to give the variational parameter updates.

Convergence We update the variational parameters ν in turn, and alternate updates until the value of the ELBO converges. After convergence, our estimate of the posterior distribution of a parameter θ i is q(θ i ν i ).

Choosing q Our choice of q determines how well our approximation to p is If our model has conjugacy, we simply choose the same distribution for q(θ i ) as we used for Gibbs sampling, p(θ i θ i, D) We then obtain very nice updates In non-conjugate models we need to use gradient descent to optimise the ELBO!

Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015