Principles of Bayesian Inference

Similar documents
Principles of Bayesian Inference

Principles of Bayesian Inference

Principles of Bayesian Inference

Bayesian Linear Models

Bayesian Linear Models

Bayesian Linear Models

Bayesian Linear Regression

MCMC algorithms for fitting Bayesian models

Introduction to Spatial Data and Models

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Markov chain Monte Carlo

Bayesian Phylogenetics:

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

STA 4273H: Statistical Machine Learning

Learning the hyper-parameters. Luca Martino

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Markov Chain Monte Carlo

Metropolis-Hastings Algorithm

Spatial Statistics Chapter 4 Basics of Bayesian Inference and Computation

ST 740: Markov Chain Monte Carlo

Markov Chain Monte Carlo methods

Computational statistics

Bayesian Inference and MCMC

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

eqr094: Hierarchical MCMC for Bayesian System Reliability

David Giles Bayesian Econometrics

Markov Chain Monte Carlo, Numerical Integration

Data Analysis and Uncertainty Part 2: Estimation

CS281A/Stat241A Lecture 22

Markov chain Monte Carlo

10. Exchangeability and hierarchical models Objective. Recommended reading

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

CSC 2541: Bayesian Methods for Machine Learning

Lecture 13 Fundamentals of Bayesian Inference

Advanced Statistical Modelling

MCMC Methods: Gibbs and Metropolis

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Down by the Bayes, where the Watermelons Grow

Model Assessment and Comparisons

Metropolis Hastings. Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601. Module 9

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

David Giles Bayesian Econometrics

(4) One-parameter models - Beta/binomial. ST440/550: Applied Bayesian Statistics

The Metropolis-Hastings Algorithm. June 8, 2012

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Bayesian Networks in Educational Assessment

A Review of Pseudo-Marginal Markov Chain Monte Carlo

Markov Chain Monte Carlo methods

Bayesian Methods for Machine Learning

Eco517 Fall 2013 C. Sims MCMC. October 8, 2013

One-parameter models

Bayesian Estimation with Sparse Grids

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

Special Topic: Bayesian Finite Population Survey Sampling

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida

STAT 425: Introduction to Bayesian Analysis

Introduction to Spatial Data and Models

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Bayesian data analysis in practice: Three simple examples

ComputationalToolsforComparing AsymmetricGARCHModelsviaBayes Factors. RicardoS.Ehlers

Inference for a Population Proportion

Reminder of some Markov Chain properties:

Introduction. Start with a probability distribution f(y θ) for the data. where η is a vector of hyperparameters

Lecture 7 and 8: Markov Chain Monte Carlo

MCMC Sampling for Bayesian Inference using L1-type Priors

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017

Bayesian Meta-analysis with Hierarchical Modeling Brian P. Hobbs 1

an introduction to bayesian inference

Part 6: Multivariate Normal and Linear Models

Markov Chain Monte Carlo (MCMC)

EXERCISES FOR SECTION 1 AND 2

7. Estimation and hypothesis testing. Objective. Recommended reading

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Comparison of Three Calculation Methods for a Bayesian Inference of Two Poisson Parameters

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Probability and Estimation. Alan Moses

7. Estimation and hypothesis testing. Objective. Recommended reading

Metropolis-Hastings sampling

Gibbs Sampling in Linear Models #2

Session 3A: Markov chain Monte Carlo (MCMC)

Stat 516, Homework 1

Introduction to Bayesian Methods

Markov chain Monte Carlo

CPSC 540: Machine Learning

Probabilistic Machine Learning

Bayes: All uncertainty is described using probability.

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

DAG models and Markov Chain Monte Carlo methods a short overview

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Bayesian model selection: methodology, computation and applications

INTRODUCTION TO BAYESIAN STATISTICS

Markov Chain Monte Carlo and Applied Bayesian Statistics

The linear model is the most fundamental of all serious statistical models encompassing:

Introduction to Machine Learning

Likelihood-free MCMC

Transcription:

Principles of Bayesian Inference Sudipto Banerjee University of Minnesota July 20th, 2008 1

Bayesian Principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters as random, and thus having distributions (just like the data). We can thus think about unknowns for which no reliable frequentist experiment exists, e.g. θ = proportion of US men with untreated prostate cancer. A Bayesian writes down a prior guess for parameter(s) θ, say p(θ). He then combines this with the information provided by the observed data y to obtain the posterior distribution of θ, which we denote by p(θ y). All statistical inferences (point and interval estimates, hypothesis tests) then follow from posterior summaries. For example, the posterior means/medians/modes offer point estimates of θ, while the quantiles yield credible intervals. 2

Bayesian Principles The key to Bayesian inference is learning or updating of prior beliefs. Thus, posterior information prior information. Is the classical approach wrong? That may be a controversial statement, but it certainly is fair to say that the classical approach is limited in scope. The Bayesian approach expands the class of models and easily handles: repeated measures unbalanced or missing data nonhomogenous variances multivariate data and many other settings that are precluded (or much more complicated) in classical settings. 3

Basics of Bayesian Inference We start with a model (likelihood) f(y θ) for the observed data y = (y 1,..., y n ) given unknown parameters θ (perhaps a collection of several parameters). Add a prior distribution p(θ λ), where λ is a vector of hyper-parameters. The posterior distribution of θ is given by: p(θ y, λ) = p(θ λ) f(y θ) p(y λ) = p(θ λ) f(y θ) f(y θ)p(θ λ)dθ. We refer to this formula as Bayes Theorem. 4

Basics of Bayesian Inference Calculations (numerical and algebraic) are usually required only up to a proportionaly constant. We, therefore, write the posterior as: p(θ y, λ) p(θ λ) f(y θ). If λ are known/fixed, then the above represents the desired posterior. If, however, λ are unknown, we assign a prior, p(λ), and seek: p(θ, λ y) p(λ)p(θ λ)f(y θ). The proportionality constant does not depend upon θ or λ: 1 p(y) = 1 p(λ)p(θ λ)f(y θ)dλdθ The above represents a joint posterior from a hierarchical model. The marginal posterior distribution for θ is: p(θ y) = p(λ)p(θ λ)f(y θ)dλ. 5

Bayesian Inference: Point Estimation Point estimation is easy: simply choose an appropriate distribution summary: posterior mean, median or mode. Mode sometimes easy to compute (no integration, simply optimization), but often misrepresents the middle of the distribution especially for one-tailed distributions. Mean: easy to compute. It has the opposite effect of the mode chases tails. Median: probably the best compromise in being robust to tail behaviour although it may be awkward to compute as it needs to solve: θmedian p(θ y)dθ = 1 2. 6

Bayesian Inference: Interval Estimation The most popular method of inference in practical Bayesian modelling is interval estimation using credible sets. A 100(1 α)% credible set C for θ is a set that satisfies: P (θ C y) = p(θ y)dθ 1 α. The most popular credible set is the simple equal-tail interval estimate (q L, q U ) such that: ql C p(θ y)dθ = α 2 = p(θ y)dθ Then clearly P (θ (q L, q U ) y) = 1 α. This interval is relatively easy to compute and has a direct interpretation: The probability that θ lies between (q L, q U ) is 1 α. The frequentist interpretation is extremely convoluted. q U 7

A simple example: Normal data and normal priors Example: Consider a single data point y from a Normal distribution: y N(θ, σ 2 ); assume σ is known. f(y θ) = N(y θ, σ 2 ) = 1 σ 2π exp( 1 2σ 2 (y θ)2 ) θ N(µ, τ 2 ), i.e. p(θ) = N(θ µ, τ 2 ); µ, τ 2 are known. Posterior distribution of θ p(θ y) N(θ µ, τ 2 ) N(y θ, σ 2 ) ( = N = N θ ( θ 1 τ 2 1 σ 2 1 + 1 µ + 1 + 1 y, 1 + 1 σ 2 τ 2 σ 2 τ 2 σ 2 τ 2 σ 2 σ 2 + τ 2 µ + τ 2 σ 2 + τ 2 y, σ 2 τ 2 σ 2 + τ 2 1 ) ). 8

A simple example: Normal data and normal priors Interpret: Posterior mean is a weighted mean of prior mean and data point. The direct estimate is shrunk towards the prior. What if you had n observations instead of one in the earlier set up? Say y = (y 1,..., y n ) iid, where y i N(θ, σ 2 ). ( ) ȳ is a sufficient statistic for µ; ȳ N θ, σ2 n Posterior distribution of θ p(θ y) N(θ µ, τ 2 ) N ( = N = N θ ( θ 1 τ 2 ) (ȳ θ, σ2 n n σ 2 n + 1 µ + n + 1 ȳ, n + 1 σ 2 τ 2 σ 2 τ 2 σ 2 τ 2 σ 2 σ 2 + nτ 2 µ + nτ 2 σ 2 + nτ 2 ȳ, σ 2 τ 2 σ 2 + nτ 2 1 ) ) 9

Prior to posterior learning density 0.0 0.2 0.4 0.6 0.8 1.0 1.2 prior posterior with n=1 posterior with n=10 2 0 2 4 6 8 θ Example: µ = 2, ȳ = 6, τ 2 = σ 2 = 1. When n = 1, the prior and the likelihood receive equal weight, so the posterior mean is 4 = 2+6 2 When n = 10, the data dominate the prior and the postrior mean approaches ȳ. The posterior variance also shrinks as n gets larger; collapses to ȳ. 10

A simple Exercise Consider the problem of estimating the current weight of a group of people. A sample of 10 people were taken and their average weight was calculated as ȳ = 176 lbs. Assume that the population standard deviation was known as σ = 3. Assuming that the data y 1,..., y 10 came from a N(θ, σ 2 ) population perform the following: Obtain a 95% confidence interval for θ using classical methods. Assume a prior distribution for θ of the form N(µ, τ 2 ). Obtain 95% posterior credible intervals for θ for each of the cases: (a) µ = 176, τ = 8; (b) µ = 176, τ = 1000 (c) µ = 0, τ = 1000. Which case gives results closest to that obtained in the classical method? 11

Another simple example: The Beta-Binomial model Example: Let Y be the number of successes in n independent trials. ( ) n P (Y = y θ) = f(y θ) = θ y (1 θ) n y y Prior: p(θ) = Beta(θ a, b): p(θ) θ a 1 (1 θ) b 1. Prior mean: µ = a/(a + b); Variance ab/((a + b) 2 (a + b + 1)) Posterior distribution of θ p(θ y) = Beta(θ a + y, b + n y) 12

Sampling-based inference We will compute the posterior distribution p(θ y) by drawing samples from it. This replaces numerical integration (quadrature) by Monte Carlo integration. One important advantage: we only need to know p(θ y) up to the proportionality constant. Suppose θ = (θ 1, θ 2 ) and we know how to sample from the marginal posterior distribution p(θ 2 y) and the conditional distribution P (θ 1 θ 2, y). How do we draw samples from the joint distribution: p(θ 1, θ 2 y)? 13

Sampling-based inference We do this in two stages using composition sampling: First draw θ (j) 2 p(θ 2 y), j = 1,... M. ( Next draw θ (j) 1 p θ 1 θ (j) 2 )., y This sampling scheme produces exact samples, {θ (j) 1, θ(j) 2 }M j=1 from the posterior distribution p(θ 1, θ 2 y). Gelfand and Smith (JASA, 1990) demonstrated automatic marginalization: {θ (j) 1 }M j=1 are samples from p(θ 1 y) and (of course!) {θ (j) 2 }M j=1 are samples from p(θ 2 y). In effect, composition sampling has performed the following integration : p(θ 1 y) = p(θ 1 θ 2, y)p(θ 2 y)dθ. 14

Bayesian predictions Suppose we want to predict new observations, say ỹ, based upon the observed data y. We will specify a joint probability model p(ỹ, y, θ), which defines the conditional predictive distribution: p(ỹ y, θ) = p(ỹ, y, θ). p(y θ) Bayesian predictions follow from the posterior predictive distribution that averages out the θ from the conditional predictive distribution with respect to the posterior: p(ỹ y) = p(ỹ y, θ)p(θ y)dθ. This can be evaluated using composition sampling: First obtain: θ (j) p(θ y), j = 1,... M For j = 1,..., M sample ỹ (j) p(ỹ y, θ (j) ) The {ỹ (j) } M j=1 are samples from the posterior predictive distribution p(ỹ y). 15

The Gibbs sampler Suppose that θ = (θ 1, θ 2 ) and we seek the posterior distribution p(θ 1, θ 2 y). For many interesting hierarchical models, we have access to full conditional distributions p(θ 1 θ 2, y) and p(θ 1 θ 2, y). The Gibbs sampler proposes the following sampling scheme. Set starting values θ (0) = (θ (0) 1, θ(0) 2 ) For j = 1,..., M Draw θ (j) 1 p(θ 1 θ (j 1) 2, y) Draw θ (j) 2 p(θ 2 θ (j) 1, y) This constructs a Markov Chain and, after an initial burn-in period when the chains are trying to find their way, the above algorithm guarantees that {θ (j) 1, θ(j) 2 }M j=m 0 +1 will be samples from p(θ 1, θ 2 y), where M 0 is the burn-in period.. 16

The Gibbs sampler More generally, if θ = (θ 1,..., θ p ) are the parameters in our model, we provide a set of initial values θ (0) = (θ (0) 1,..., θ(0) p ) and then performs the j-th iteration, say for j = 1,..., M, by updating successively from the full conditional distributions: θ (j) 1 p(θ (j) 1 θ (j 1) 2,..., θ p (j 1), y) θ (j) 2 p(θ 2 θ (j) 1, θ(j) 3,..., θ(j 1) p, y)... (the generic k th element) p(θ k θ (j) 1,..., θ(j) k 1, θ(j) k+1,..., θ(j 1) p, y) θ (j) k θ (j) p p(θ p θ (j) 1,..., θ(j) p 1, y) 17

The Gibbs sampler In principle, the Gibbs sampler will work for extremely complex hierarchical models. The only issue is sampling from the full conditionals. They may not be amenable to easy sampling when these are not in closed form. A more general and extremely powerful - and often easier to code - algorithm is the Metropolis-Hastings (MH) algorithm. This algorithm also constructs a Markov Chain, but does not necessarily care about full conditionals. 18

The Metropolis-Hastings algorithm The Metropolis-Hastings algorithm: Start with a initial value for θ = θ (0). Select a candidate or proposal distribution from which to propose a value of θ at the j-th iteration: θ (j) q(θ (j 1), ν). For example, q(θ (j 1), ν) = N(θ (j 1), ν) with ν fixed. Compute r = p(θ y)q(θ (j 1) θ, ν) p(θ (j 1) y)q(θ θ (j 1) ν) If r 1 then set θ (j) = θ. If r 1 then draw U (0, 1). If U r then θ (j) = θ. Otherwise, θ (j) = θ (j 1). Repeat for j = 1,... M. This yields θ (1),..., θ (M), which, after a burn-in period, will be samples from the true posterior distribution. It is important to monitor the acceptance ratio r of the sampler through the iterations. Rough recommendations: for vector updates r 20%., for scalar updates r 40%. This can be controlled by tuning ν. Popular approach: Embed Metropolis steps within Gibbs to draw from full conditionals that are not accessible to directly generate from. 19

Some remarks on sampling-based inference Direct Monte Carlo: Some algorithms (e.g. composition sampling) can generate independent samples exactly from the posterior distribution. In these situations there are NO convergence problems or issues. Sampling is called exact. Markov Chain Monte Carlo (MCMC): In general, exact sampling may not be possible/feasible. MCMC is a far more versatile set of algorithms that can be invoked to fit more general models. Note: anywhere where direct Monte Carlo applies, MCMC will provide excellent results too. Convergence issues: There is no free lunch! The power of MCMC comes at a cost. The initial samples do not necessarily come from the desired posterior distribution. Rather, they need to converge to the true posterior distribution. Therefore, one needs to assess convergence, discard output before the convergence and retain only post-convergence samples. The time of convergence is called burn-in. Diagnosing convergence: Usually a few parallel chains are run from rather different starting points. The sample values are plotted (called trace-plots) for each of the chains. The time for the chains to mix together is taken as the time for convergence. Good news! All this is automated in WinBUGS. So, as users, we need to only configure how to specify good Bayesian models and implement them in WinBUGS. 20