Principles of Bayesian Inference

Similar documents
Principles of Bayesian Inference

Principles of Bayesian Inference

Principles of Bayesian Inference

Model Assessment and Comparisons

Introduction to Spatial Data and Models

Bayesian Linear Models

Bayesian Linear Models

Bayesian Linear Regression

Bayesian Linear Models

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

MCMC algorithms for fitting Bayesian models

Hierarchical Modelling for Univariate Spatial Data

Special Topic: Bayesian Finite Population Survey Sampling

David Giles Bayesian Econometrics

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Markov chain Monte Carlo

Data Analysis and Uncertainty Part 2: Estimation

One-parameter models

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Introduction to Spatial Data and Models

Hierarchical Modeling for non-gaussian Spatial Data

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

(4) One-parameter models - Beta/binomial. ST440/550: Applied Bayesian Statistics

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Hierarchical Modeling for Spatio-temporal Data

Hierarchical Modelling for Univariate Spatial Data

Hierarchical Modeling for Univariate Spatial Data

Introduction to Bayesian Methods

Lecture 13 Fundamentals of Bayesian Inference

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Bayesian Phylogenetics:

Bayesian Inference and MCMC

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

EXERCISES FOR SECTION 1 AND 2

Markov Chain Monte Carlo methods

Bayesian Regression Linear and Logistic Regression

Lecture : Probabilistic Machine Learning

Bayesian Networks in Educational Assessment

Introduction to Bayesian Inference

Introduction. Start with a probability distribution f(y θ) for the data. where η is a vector of hyperparameters

Hierarchical Modeling for Multivariate Spatial Data

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Introduction to Machine Learning

Bayesian Meta-analysis with Hierarchical Modeling Brian P. Hobbs 1

Lecture 2: Priors and Conjugacy

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Learning the hyper-parameters. Luca Martino

Hierarchical Modelling for non-gaussian Spatial Data

Bayesian Machine Learning

Bayesian model selection: methodology, computation and applications

STA 4273H: Statistical Machine Learning

2016 SISG Module 17: Bayesian Statistics for Genetics Lecture 3: Binomial Sampling

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Spatial Statistics Chapter 4 Basics of Bayesian Inference and Computation

Bayesian philosophy Bayesian computation Bayesian software. Bayesian Statistics. Petter Mostad. Chalmers. April 6, 2017

Chapter 5. Bayesian Statistics

David Giles Bayesian Econometrics

Bayesian Model Diagnostics and Checking

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Computational Perception. Bayesian Inference

Markov Chain Monte Carlo

PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 1 Bayesian inference

Metropolis-Hastings Algorithm

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

The Metropolis-Hastings Algorithm. June 8, 2012

PMR Learning as Inference

10. Exchangeability and hierarchical models Objective. Recommended reading

Introduction to Probabilistic Machine Learning

eqr094: Hierarchical MCMC for Bayesian System Reliability

ST 740: Markov Chain Monte Carlo

Markov chain Monte Carlo

Advanced Statistical Modelling

Comparison of Three Calculation Methods for a Bayesian Inference of Two Poisson Parameters

Bayesian data analysis in practice: Three simple examples

Approximate Bayesian computation for spatial extremes via open-faced sandwich adjustment

Bayesian RL Seminar. Chris Mansley September 9, 2008

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Density Estimation. Seungjin Choi

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Inference for a Population Proportion

Linear Models A linear model is defined by the expression

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Bayesian Methods for Machine Learning

Beyond MCMC in fitting complex Bayesian models: The INLA method

Eco517 Fall 2013 C. Sims MCMC. October 8, 2013

CS281A/Stat241A Lecture 22

Bayes: All uncertainty is described using probability.

Bayesian Modeling of Accelerated Life Tests with Random Effects

Tutorial on Probabilistic Programming with PyMC3

CSC321 Lecture 18: Learning Probabilistic Models

(3) Review of Probability. ST440/540: Applied Bayesian Statistics

Transcription:

Principles of Bayesian Inference Sudipto Banerjee 1 and Andrew O. Finley 2 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. 2 Department of Forestry & Department of Geography, Michigan State University, Lansing Michigan, U.S.A. July 25, 2014 1

Bayesian principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters as random, and thus having distributions (just like the data). 2 Graduate Workshop on Environmental Data Analytics 2014

Bayesian principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters as random, and thus having distributions (just like the data). A Bayesian writes down a prior guess for parameter(s) θ, say p(θ). They then combines this with the information provided by the observed data y to obtain the posterior distribution of θ, which we denote by p(θ y). All statistical inferences (point and interval estimates, hypothesis tests) then follow from posterior summaries. For example, the posterior means/medians/modes offer point estimates of θ, while the quantiles yield credible intervals. 2 Graduate Workshop on Environmental Data Analytics 2014

Bayesian principles The key to Bayesian inference is learning or updating of prior beliefs. Thus, posterior information prior information. 3 Graduate Workshop on Environmental Data Analytics 2014

Bayesian principles The key to Bayesian inference is learning or updating of prior beliefs. Thus, posterior information prior information. Is the classical approach wrong? That may be a controversial statement, but it certainly is fair to say that the classical approach is limited in scope. 3 Graduate Workshop on Environmental Data Analytics 2014

Bayesian principles The key to Bayesian inference is learning or updating of prior beliefs. Thus, posterior information prior information. Is the classical approach wrong? That may be a controversial statement, but it certainly is fair to say that the classical approach is limited in scope. The Bayesian approach expands the class of models and easily handles: repeated measures unbalanced or missing data nonhomogenous variances multivariate data and many other settings that are precluded (or much more complicated) in classical settings. 3 Graduate Workshop on Environmental Data Analytics 2014

Basics of Bayesian inference We start with a model (likelihood) f(y θ) for the observed data y = (y 1,..., y n ) given unknown parameters θ (perhaps a collection of several parameters). 4 Graduate Workshop on Environmental Data Analytics 2014

Basics of Bayesian inference We start with a model (likelihood) f(y θ) for the observed data y = (y 1,..., y n ) given unknown parameters θ (perhaps a collection of several parameters). Add a prior distribution p(θ λ), where λ is a vector of hyper-parameters. 4 Graduate Workshop on Environmental Data Analytics 2014

Basics of Bayesian inference We start with a model (likelihood) f(y θ) for the observed data y = (y 1,..., y n ) given unknown parameters θ (perhaps a collection of several parameters). Add a prior distribution p(θ λ), where λ is a vector of hyper-parameters. The posterior distribution of θ is given by: p(θ y, λ) = p(θ λ) f(y θ) p(y λ) = p(θ λ) f(y θ) f(y θ)p(θ λ)dθ. We refer to this formula as Bayes Theorem. 4 Graduate Workshop on Environmental Data Analytics 2014

Basics of Bayesian inference Calculations (numerical and algebraic) are usually required only up to a proportionaly constant. We, therefore, write the posterior as: p(θ y, λ) p(θ λ) f(y θ). 5 Graduate Workshop on Environmental Data Analytics 2014

Basics of Bayesian inference Calculations (numerical and algebraic) are usually required only up to a proportionaly constant. We, therefore, write the posterior as: p(θ y, λ) p(θ λ) f(y θ). If λ are known/fixed, then the above represents the desired posterior. If, however, λ are unknown, we assign a prior, p(λ), and seek: p(θ, λ y) p(λ)p(θ λ)f(y θ). The proportionality constant does not depend upon θ or λ: 1 p(y) = 1 p(λ)p(θ λ)f(y θ)dλdθ The above represents a joint posterior from a hierarchical model. The marginal posterior distribution for θ is: p(θ y) = p(λ)p(θ λ)f(y θ)dλ. 5 Graduate Workshop on Environmental Data Analytics 2014

Bayesian inference: point estimation Point estimation is easy: simply choose an appropriate distribution summary: posterior mean, median or mode. Mode sometimes easy to compute (no integration, simply optimization), but often misrepresents the middle of the distribution especially for one-tailed distributions. Mean: easy to compute. It has the opposite effect of the mode chases tails. Median: probably the best compromise in being robust to tail behaviour although it may be awkward to compute as it needs to solve: θmedian p(θ y)dθ = 1 2. 6 Graduate Workshop on Environmental Data Analytics 2014

Bayesian inference: interval estimation The most popular method of inference in practical Bayesian modelling is interval estimation using credible sets. A 100(1 α)% credible set C for θ is a set that satisfies: P (θ C y) = p(θ y)dθ 1 α. The most popular credible set is the simple equal-tail interval estimate (q L, q U ) such that: ql C p(θ y)dθ = α 2 = p(θ y)dθ Then clearly P (θ (q L, q U ) y) = 1 α. This interval is relatively easy to compute and has a direct interpretation: The probability that θ lies between (q L, q U ) is 1 α. The frequentist interpretation is extremely convoluted. 7 Graduate Workshop on Environmental Data Analytics 2014 q U

A simple example: Normal data and normal priors Example: Consider a single data point y from a Normal distribution: y N(θ, σ 2 ); assume σ is known. f(y θ) = N(y θ, σ 2 ) = 1 σ 2π exp( 1 2σ 2 (y θ)2 ) 8 Graduate Workshop on Environmental Data Analytics 2014

A simple example: Normal data and normal priors Example: Consider a single data point y from a Normal distribution: y N(θ, σ 2 ); assume σ is known. f(y θ) = N(y θ, σ 2 ) = 1 σ 2π exp( 1 2σ 2 (y θ)2 ) Now set the prior for θ N(µ, τ 2 ), i.e. p(θ) = N(θ µ, τ 2 ); µ, τ 2 are known. 8 Graduate Workshop on Environmental Data Analytics 2014

A simple example: Normal data and normal priors Example: Consider a single data point y from a Normal distribution: y N(θ, σ 2 ); assume σ is known. f(y θ) = N(y θ, σ 2 ) = 1 σ 2π exp( 1 2σ 2 (y θ)2 ) Now set the prior for θ N(µ, τ 2 ), i.e. p(θ) = N(θ µ, τ 2 ); µ, τ 2 are known. Posterior distribution of θ p(θ y) N(θ µ, τ 2 ) N(y θ, σ 2 ) ( = N = N θ ( θ 1 τ 2 1 σ 2 1 + 1 µ + 1 + 1 y, 1 + 1 σ 2 τ 2 σ 2 τ 2 σ 2 τ 2 σ 2 σ 2 + τ 2 µ + τ 2 σ 2 + τ 2 y, σ 2 τ 2 σ 2 + τ 2 1 ) ). 8 Graduate Workshop on Environmental Data Analytics 2014

A simple example: Normal data and normal priors Interpret: Posterior mean is a weighted mean of prior mean and data point. The direct estimate is shrunk towards the prior. 9 Graduate Workshop on Environmental Data Analytics 2014

A simple example: Normal data and normal priors Interpret: Posterior mean is a weighted mean of prior mean and data point. The direct estimate is shrunk towards the prior. What if you had n observations instead of one in the earlier set up? Say y = (y 1,..., y n ) iid, where y i N(θ, σ 2 ). 9 Graduate Workshop on Environmental Data Analytics 2014

A simple example: Normal data and normal priors Interpret: Posterior mean is a weighted mean of prior mean and data point. The direct estimate is shrunk towards the prior. What if you had n observations instead of one in the earlier set up? Say y = (y 1,..., y n ) iid, where y i N(θ, σ 2 ). ( ) ȳ is a sufficient statistic for θ; ȳ N θ, σ2 n 9 Graduate Workshop on Environmental Data Analytics 2014

A simple example: Normal data and normal priors Interpret: Posterior mean is a weighted mean of prior mean and data point. The direct estimate is shrunk towards the prior. What if you had n observations instead of one in the earlier set up? Say y = (y 1,..., y n ) iid, where y i N(θ, σ 2 ). ( ) ȳ is a sufficient statistic for θ; ȳ N θ, σ2 n Posterior distribution of θ p(θ y) N(θ µ, τ 2 ) N ( = N = N θ ( θ 1 τ 2 ) (ȳ θ, σ2 n n σ 2 n + 1 µ + n + 1 ȳ, n + 1 σ 2 τ 2 σ 2 τ 2 σ 2 τ 2 σ 2 σ 2 + nτ 2 µ + nτ 2 σ 2 + nτ 2 ȳ, σ 2 τ 2 σ 2 + nτ 2 1 ) ) 9 Graduate Workshop on Environmental Data Analytics 2014

Another simple example: The Beta-Binomial model Example: Let Y be the number of successes in n independent trials. ( ) n P (Y = y θ) = f(y θ) = θ y (1 θ) n y y Prior: p(θ) = Beta(θ a, b): p(θ) θ a 1 (1 θ) b 1. Prior mean: µ = a/(a + b); Variance ab/((a + b) 2 (a + b + 1)) Posterior distribution of θ p(θ y) = Beta(θ a + y, b + n y) 10 Graduate Workshop on Environmental Data Analytics 2014

Sampling-based inference In practice, we will compute the posterior distribution p(θ y) by drawing samples from it. This replaces numerical integration (quadrature) by Monte Carlo integration. 11 Graduate Workshop on Environmental Data Analytics 2014

Sampling-based inference In practice, we will compute the posterior distribution p(θ y) by drawing samples from it. This replaces numerical integration (quadrature) by Monte Carlo integration. One important advantage: we only need to know p(θ y) up to the proportionality constant. Suppose θ = (θ 1, θ 2 ) and we know how to sample from the marginal posterior distribution p(θ 2 y) and the conditional distribution P (θ 1 θ 2, y). How do we draw samples from the joint distribution: p(θ 1, θ 2 y)? 11 Graduate Workshop on Environmental Data Analytics 2014

Sampling-based inference We do this in two stages using composition sampling: First draw θ (j) 2 p(θ 2 y), j = 1,... M. ( Next draw θ (j) 1 p θ 1 θ (j) 2 )., y This sampling scheme produces exact samples, {θ (j) 1, θ(j) 2 }M j=1 from the posterior distribution p(θ 1, θ 2 y). Gelfand and Smith (JASA, 1990) demonstrated automatic marginalization: {θ (j) 1 }M j=1 are samples from p(θ 1 y) and (of course!) {θ (j) 2 }M j=1 are samples from p(θ 2 y). In effect, composition sampling has performed the following integration : p(θ 1 y) = p(θ 1 θ 2, y)p(θ 2 y)dθ. 12 Graduate Workshop on Environmental Data Analytics 2014

Bayesian predictions Suppose we want to predict new observations, say ỹ, based upon the observed data y. Bayesian predictions follow from the posterior predictive distribution that averages out the θ from the conditional predictive distribution with respect to the posterior: p(ỹ y) = p(ỹ y, θ)p(θ y)dθ. 13 Graduate Workshop on Environmental Data Analytics 2014

Bayesian predictions Suppose we want to predict new observations, say ỹ, based upon the observed data y. Bayesian predictions follow from the posterior predictive distribution that averages out the θ from the conditional predictive distribution with respect to the posterior: p(ỹ y) = p(ỹ y, θ)p(θ y)dθ. This can be evaluated using composition sampling: First obtain: θ (j) p(θ y), j = 1,... M For j = 1,..., M sample ỹ (j) p(ỹ y, θ (j) ) The {ỹ (j) } M j=1 are samples from the posterior predictive distribution p(ỹ y). 13 Graduate Workshop on Environmental Data Analytics 2014

Some remarks on sampling-based inference Direct Monte Carlo: Some algorithms (e.g. composition sampling) can generate independent samples exactly from the posterior distribution. In these situations there are NO convergence problems or issues. Sampling is called exact. 14 Graduate Workshop on Environmental Data Analytics 2014

Some remarks on sampling-based inference Direct Monte Carlo: Some algorithms (e.g. composition sampling) can generate independent samples exactly from the posterior distribution. In these situations there are NO convergence problems or issues. Sampling is called exact. Markov chain Monte Carlo (MCMC): In general, exact sampling may not be possible/feasible. MCMC is a far more versatile set of algorithms that can be invoked to fit more general models. Note: anywhere where direct Monte Carlo applies, MCMC will provide excellent results too. 14 Graduate Workshop on Environmental Data Analytics 2014

Some remarks on sampling-based inference Convergence issues: There is no free lunch! The power of MCMC comes at a cost. The initial samples do not necessarily come from the desired posterior distribution. Rather, they need to converge to the true posterior distribution. Therefore, one needs to assess convergence, discard output before the convergence and retain only post-convergence samples. The time of convergence is called burn-in. 15 Graduate Workshop on Environmental Data Analytics 2014

Some remarks on sampling-based inference Convergence issues: There is no free lunch! The power of MCMC comes at a cost. The initial samples do not necessarily come from the desired posterior distribution. Rather, they need to converge to the true posterior distribution. Therefore, one needs to assess convergence, discard output before the convergence and retain only post-convergence samples. The time of convergence is called burn-in. Diagnosing convergence: Usually a few parallel chains are run from rather different starting points. The sample values are plotted (called trace-plots) for each of the chains. The time for the chains to mix together is taken as the time for convergence. 15 Graduate Workshop on Environmental Data Analytics 2014

Some remarks on sampling-based inference Convergence issues: There is no free lunch! The power of MCMC comes at a cost. The initial samples do not necessarily come from the desired posterior distribution. Rather, they need to converge to the true posterior distribution. Therefore, one needs to assess convergence, discard output before the convergence and retain only post-convergence samples. The time of convergence is called burn-in. Diagnosing convergence: Usually a few parallel chains are run from rather different starting points. The sample values are plotted (called trace-plots) for each of the chains. The time for the chains to mix together is taken as the time for convergence. Good news! Many modeling frameworks are automated in freely available software. So, as users, we need to only configure how to specify good Bayesian models and 15 Graduate Workshop on Environmental Data Analytics 2014

Some remarks on sampling-based inference Find a wide variety of R packages dealing with Bayesian inference here: http: //cran.r-project.org/web/views/bayesian.html Here s a nice rant on Why I love R http://www.sr.bham. ac.uk/~ajrs/talks/why_i_love_r.pdf 16 Graduate Workshop on Environmental Data Analytics 2014