Patterns of Scalable Bayesian Inference Background (Session 1)

Similar documents
1. Fisher Information

Bayesian Inference and MCMC

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Exponential Families

Probability and Estimation. Alan Moses

Part 1: Expectation Propagation

Bayesian Dropout. Tue Herlau, Morten Morup and Mikkel N. Schmidt. Feb 20, Discussed by: Yizhe Zhang

Density Estimation. Seungjin Choi

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Probabilistic Graphical Models for Image Analysis - Lecture 4

CPSC 540: Machine Learning

Notes on pseudo-marginal methods, variational Bayes and ABC

Lecture 2: From Linear Regression to Kalman Filter and Beyond

STAT 730 Chapter 4: Estimation

The Monte Carlo Method: Bayesian Networks

March 10, 2017 THE EXPONENTIAL CLASS OF DISTRIBUTIONS

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Lecture 6: Graphical Models: Learning

Learning Bayesian network : Given structure and completely observed data

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Data Analysis and Uncertainty Part 2: Estimation

Machine Learning Summer School

Annealing Between Distributions by Averaging Moments

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Models in Machine Learning

Parametric Techniques Lecture 3

INTRODUCTION TO BAYESIAN METHODS II

Conjugate Priors, Uninformative Priors

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Statistical Inference

Parametric Techniques

Expectation Propagation for Approximate Bayesian Inference

Probabilistic Graphical Models

MIT Spring 2016

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Bayesian Regression Linear and Logistic Regression

Patterns of Scalable Bayesian Inference

PATTERN RECOGNITION AND MACHINE LEARNING

Sequential Monte Carlo Methods for Bayesian Computation

PMR Learning as Inference

STAT J535: Chapter 5: Classes of Bayesian Priors

Bayesian Nonparametrics

STA 4273H: Statistical Machine Learning

Unsupervised Learning

an introduction to bayesian inference

Lecture 7 and 8: Markov Chain Monte Carlo

Lecture : Probabilistic Machine Learning

Brief Review on Estimation Theory

Non-Parametric Bayes

Bayesian Deep Learning

Bayesian Methods for Machine Learning

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Bayesian Decision and Bayesian Learning

Generalized Linear Models and Exponential Families

STA 4273H: Statistical Machine Learning

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Statistical Inference

Classical Estimation Topics

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

The Expectation-Maximization Algorithm

Introduction to Probabilistic Machine Learning

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Machine learning - HT Maximum Likelihood

CS540 Machine learning L9 Bayesian statistics

The Bayesian Choice. Christian P. Robert. From Decision-Theoretic Foundations to Computational Implementation. Second Edition.

CSC 2541: Bayesian Methods for Machine Learning

Density Estimation: ML, MAP, Bayesian estimation

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

F denotes cumulative density. denotes probability density function; (.)

Lecture 6: Model Checking and Selection

Graphical Models for Collaborative Filtering

Lecture 8: Bayesian Estimation of Parameters in State Space Models

MATH c UNIVERSITY OF LEEDS Examination for the Module MATH2715 (January 2015) STATISTICAL METHODS. Time allowed: 2 hours

Basic math for biology

Outline Lecture 2 2(32)

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Expectation Propagation Algorithm

Introduction to Bayesian Inference

Mathematical statistics

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Streaming Variational Bayes

Time Series and Dynamic Models

STAT215: Solutions for Homework 1

Learning the hyper-parameters. Luca Martino

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Integrated Non-Factorized Variational Inference

1 Bayesian Linear Regression (BLR)

Quantitative Biology II Lecture 4: Variational Methods

Pattern Recognition and Machine Learning

Bayesian Machine Learning

Computational Cognitive Science

General Bayesian Inference I

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida

Bayesian RL Seminar. Chris Mansley September 9, 2008

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Strong Lens Modeling (II): Statistical Methods

Variational Scoring of Graphical Model Structures

Foundations of Statistical Inference

Transcription:

Patterns of Scalable Bayesian Inference Background (Session 1) Jerónimo Arenas-García Universidad Carlos III de Madrid jeronimo.arenas@gmail.com June 14, 2017 1 / 15

Motivation. Bayesian Learning principles Main property of Bayesian methods: accounting for uncertainty p(θ x) = p(x θ)p(θ) p(x) Typically we have to average over the posterior. Two kinds of approximation methods: Monte Carlo sampling Variational methods 2 / 15

Bayesian methods in the big data scenario As the available training data becomes increasingly large, p(θ x) would concentrate around the maximum of p(x θ) ˆθ MAP ˆθ ML f (θ)p(θ x)dθ f (ˆθMAP ) This will not be the case in many big data scenarios, as the number of parameters and the model complexity itself grows also with the available data When recurring traditional Bayesian methods in big data applications we need to be aware of Computation and memory issues: sequential and parallel methods It may be better to give up some desired properties (such as asymptotic unbiasedness) in favor of better scalability The tradeoff between scalability and computational complexity The book deals with scalable methods for Monte Carlo sampling and variational methods 3 / 15

Outline 1 Exponential families 2 Markov Chain Monte Carlo Sampling 3 Mean field variational inference 4 Stochastic gradient optimization 4 / 15

The exponential families of distributions (I) x is a random vector, x X R d θ is a (random) parameter vector, θ Θ R m Definition: {p( θ)} is a exponential family if the probability density function of the family can be written down as p(x θ) = h(x) exp [ η(θ), t(x) log Z(η(θ))] η(θ), t(x) = K k=1 η k(θ)t k (x) η k (θ) are the natural parameters t k (x) are the natural statistics t k (x) are sufficient statistics, i.e., θ x t k (x) A(θ) = log Z(η(θ)) ensures normalization θ 5 / 15

The exponential families of distributions (II) We assume that X does not depend on θ Regular family: If Θ is an open set The family is minimum if a 0 such that a, t(x) is constant Exercise 1: Natural parameter form for the Bernouilli distribution Exercise 2: Natural parameter form for the Gaussian distribution 6 / 15

The exponential families of distributions (III) 7 / 15

Sampling of a exponential family distribution If x follows an exponential family distribution with natural parameters η k (θ) and natural statistics t k (x), then a collection of samples from the distribution, x = (x 1, x 2,, x n ) follows an exponential family distribution with the same natural parameters and natural statistics t k (x) = n t k (x i ) i=1 Again, t k (x) are sufficient statistics for the joint distribution This is an important property for big data scenarios 8 / 15

The role of Z(η(θ)) To ensure proper normalization of the distribution: [ ] p(x θ)dx = exp[ A(η)] h(x) exp η t(x) dx = 1 Since A(η) = log Z(η), we have [ ] Z(η) = h(x) exp η t(x) dx, i.e., for t(x) = x, we would have the Laplace Transform of h(x). 9 / 15

Derivatives of A(η(θ)) A(η) = log Z(η) = log [ ] h(x) exp η t(x) dx, Exercise: Show that η A(η) = E{t(x)} Exercise: Verify that the previous result holds for the Gaussian distribution Proposition: 2 ηa(η) = Cov{t(x)} 10 / 15

Moment generating function of exponential families For a general random variable x, its moment generating function is defined as M x (s) = E{e sx } If it exists, it allows an easy calculation of moments, since E{e sx } = 1 + E{sx} + E{(sx)2 } 2! Therefore, k s M x (s) s=0 = E{x k } + E{(sx)3 } 3! +... For an exponential family, cumulants of t(x) can be obtained from M T (s) = E{e s t(x) } = e A(η+s) A(η) 11 / 15

Other properties For a regular exponential family, we define the score with respect the natural parameters as: v(x, η) = η log p(x η) = η [ η, t log Z(η)] = t(x) η log Z(η) = t(x) E{t(x)} For a regular exponential family, Fisher information with respect to the natural parameters is I (η) = E{v(x, η)v (x, η)} = E {(t(x) E{t(x)})(t(x) E{t(x)}) } = Cov[t(x)] = 2 η log Z(η) 12 / 15

Conjugate priors in Bayesian statistics p(θ x, α ) = p(θ α)p(x θ) p(x) If p(θ α) and p(θ x, α ) are parametric forms of the same family, then we say that the prior p(θ α) is conjugated with the likelihood function p(x θ) In general, α = α (α, x) The definition is general...... but if p(x θ) is from a regular exponential family, then it is always possible to find a conjugated prior (which is also an exponential family) 13 / 15

Conjugate priors in Bayesian statistics: an example Dirichlet prior: Multinomial likelihood: Posterior: p(θ α) = Γ( i α i) Π i Γ(α i ) Π iθ α i 1 i p(x θ) = ( i x i)! x 1!x 2!... x n! Π iθ x i i p(θ x, α) Π i θ (x i +α i 1) i Therefore, the posterior is Dirichlet with parameters α i + x i, implying that the Dirichlet and the Multinomial are conjugated. 14 / 15

Conjugate pairs 15 / 15