Principles of Bayesian Inference

Similar documents
Principles of Bayesian Inference

Principles of Bayesian Inference

Principles of Bayesian Inference

Introduction to Spatial Data and Models

Model Assessment and Comparisons

Bayesian Linear Models

Bayesian Linear Models

Bayesian Linear Regression

Hierarchical Modelling for Univariate Spatial Data

Bayesian Linear Models

MCMC algorithms for fitting Bayesian models

David Giles Bayesian Econometrics

Special Topic: Bayesian Finite Population Survey Sampling

Hierarchical Modelling for Univariate Spatial Data

One-parameter models

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Introduction to Spatial Data and Models

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Markov chain Monte Carlo

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Data Analysis and Uncertainty Part 2: Estimation

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction. Start with a probability distribution f(y θ) for the data. where η is a vector of hyperparameters

Hierarchical Modeling for Spatio-temporal Data

Introduction to Bayesian Inference

(4) One-parameter models - Beta/binomial. ST440/550: Applied Bayesian Statistics

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Hierarchical Modelling for non-gaussian Spatial Data

Bayesian Phylogenetics:

Bayesian Inference and MCMC

EXERCISES FOR SECTION 1 AND 2

Introduction to Bayesian Methods

Hierarchical Modeling for Multivariate Spatial Data

Lecture : Probabilistic Machine Learning

Hierarchical Modeling for non-gaussian Spatial Data

2016 SISG Module 17: Bayesian Statistics for Genetics Lecture 3: Binomial Sampling

Bayesian Networks in Educational Assessment

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Spatial Statistics Chapter 4 Basics of Bayesian Inference and Computation

Bayesian model selection: methodology, computation and applications

Lecture 2: Priors and Conjugacy

STA 4273H: Statistical Machine Learning

Bayesian Meta-analysis with Hierarchical Modeling Brian P. Hobbs 1

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

10. Exchangeability and hierarchical models Objective. Recommended reading

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Learning the hyper-parameters. Luca Martino

eqr094: Hierarchical MCMC for Bayesian System Reliability

Hierarchical Modeling for Univariate Spatial Data

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida

Bayesian Regression Linear and Logistic Regression

Bayesian Model Diagnostics and Checking

Bayesian philosophy Bayesian computation Bayesian software. Bayesian Statistics. Petter Mostad. Chalmers. April 6, 2017

Lecture 13 Fundamentals of Bayesian Inference

Markov Chain Monte Carlo methods

Introduction to Machine Learning

Inference for a Population Proportion

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Chapter 5. Bayesian Statistics

David Giles Bayesian Econometrics

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Computational Perception. Bayesian Inference

Lecture 1 Bayesian inference

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Markov Chain Monte Carlo

The Metropolis-Hastings Algorithm. June 8, 2012

7. Estimation and hypothesis testing. Objective. Recommended reading

PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.

Bayesian Modeling of Accelerated Life Tests with Random Effects

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

Part III. A Decision-Theoretic Approach and Bayesian testing

Bayesian Machine Learning

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

ST 740: Model Selection

CS281A/Stat241A Lecture 22

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Approximate Bayesian computation for spatial extremes via open-faced sandwich adjustment

PMR Learning as Inference

Markov chain Monte Carlo

Bayesian Inference: Concept and Practice

Bayesian RL Seminar. Chris Mansley September 9, 2008

BUGS Bayesian inference Using Gibbs Sampling

Bayesian data analysis in practice: Three simple examples

CSC 2541: Bayesian Methods for Machine Learning

Metropolis-Hastings Algorithm

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Linear Models A linear model is defined by the expression

Density Estimation. Seungjin Choi

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Time Series and Dynamic Models

36-463/663: Hierarchical Linear Models

Bayes: All uncertainty is described using probability.

Comparison of Three Calculation Methods for a Bayesian Inference of Two Poisson Parameters

Transcription:

Principles of Bayesian Inference Sudipto Banerjee and Andrew O. Finley 2 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. 2 Department of Forestry & Department of Geography, Michigan State University, Lansing Michigan, U.S.A. March 3, 200

Bayesian principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters as random, and thus having distributions (just like the data). We can thus think about unknowns for which no reliable frequentist experiment exists, e.g. θ = proportion of US men with untreated prostate cancer. A Bayesian writes down a prior guess for parameter(s) θ, say p(θ). He then combines this with the information provided by the observed data y to obtain the posterior distribution of θ, which we denote by p(θ y). All statistical inferences (point and interval estimates, hypothesis tests) then follow from posterior summaries. For example, the posterior means/medians/modes offer point estimates of θ, while the quantiles yield credible intervals. 2 Frontiers of Statistical Decision Making and Bayesian Analysis

Bayesian principles The key to Bayesian inference is learning or updating of prior beliefs. Thus, posterior information prior information. Is the classical approach wrong? That may be a controversial statement, but it certainly is fair to say that the classical approach is limited in scope. The Bayesian approach expands the class of models and easily handles: repeated measures unbalanced or missing data nonhomogenous variances multivariate data and many other settings that are precluded (or much more complicated) in classical settings. 3 Frontiers of Statistical Decision Making and Bayesian Analysis

Basics of Bayesian inference We start with a model (likelihood) f(y θ) for the observed data y = (y,..., y n ) given unknown parameters θ (perhaps a collection of several parameters). Add a prior distribution p(θ λ), where λ is a vector of hyper-parameters. The posterior distribution of θ is given by: p(θ y, λ) = p(θ λ) f(y θ) p(y λ) = p(θ λ) f(y θ) f(y θ)p(θ λ)dθ. We refer to this formula as Bayes Theorem. 4 Frontiers of Statistical Decision Making and Bayesian Analysis

Basics of Bayesian inference Calculations (numerical and algebraic) are usually required only up to a proportionaly constant. We, therefore, write the posterior as: p(θ y, λ) p(θ λ) f(y θ). If λ are known/fixed, then the above represents the desired posterior. If, however, λ are unknown, we assign a prior, p(λ), and seek: p(θ, λ y) p(λ)p(θ λ)f(y θ). The proportionality constant does not depend upon θ or λ: p(y) = p(λ)p(θ λ)f(y θ)dλdθ The above represents a joint posterior from a hierarchical model. The marginal posterior distribution for θ is: p(θ y) = p(λ)p(θ λ)f(y θ)dλ. 5 Frontiers of Statistical Decision Making and Bayesian Analysis

Bayesian inference: point estimation Point estimation is easy: simply choose an appropriate distribution summary: posterior mean, median or mode. Mode sometimes easy to compute (no integration, simply optimization), but often misrepresents the middle of the distribution especially for one-tailed distributions. Mean: easy to compute. It has the opposite effect of the mode chases tails. Median: probably the best compromise in being robust to tail behaviour although it may be awkward to compute as it needs to solve: θmedian p(θ y)dθ = 2. 6 Frontiers of Statistical Decision Making and Bayesian Analysis

Bayesian inference: interval estimation The most popular method of inference in practical Bayesian modelling is interval estimation using credible sets. A 00( α)% credible set C for θ is a set that satisfies: P (θ C y) = p(θ y)dθ α. The most popular credible set is the simple equal-tail interval estimate (q L, q U ) such that: ql C p(θ y)dθ = α 2 = p(θ y)dθ Then clearly P (θ (q L, q U ) y) = α. This interval is relatively easy to compute and has a direct interpretation: The probability that θ lies between (q L, q U ) is α. The frequentist interpretation is extremely convoluted. 7 Frontiers of Statistical Decision Making and Bayesian Analysis q U

A simple example: Normal data and normal priors Example: Consider a single data point y from a Normal distribution: y N(θ, σ 2 ); assume σ is known. f(y θ) = N(y θ, σ 2 ) = σ 2π exp( 2σ 2 (y θ)2 ) θ N(µ, τ 2 ), i.e. p(θ) = N(θ µ, τ 2 ); µ, τ 2 are known. Posterior distribution of θ p(θ y) N(θ µ, τ 2 ) N(y θ, σ 2 ) ( = N = N θ ( θ τ 2 σ 2 + µ + + y, + σ 2 τ 2 σ 2 τ 2 σ 2 τ 2 σ 2 σ 2 + τ 2 µ + τ 2 σ 2 + τ 2 y, σ 2 τ 2 σ 2 + τ 2 ) ). 8 Frontiers of Statistical Decision Making and Bayesian Analysis

A simple example: Normal data and normal priors Interpret: Posterior mean is a weighted mean of prior mean and data point. The direct estimate is shrunk towards the prior. What if you had n observations instead of one in the earlier set up? Say y = (y,..., y n ) iid, where y i N(0, σ 2 ). ( ) ȳ is a sufficient statistic for µ; ȳ N µ, σ2 n Posterior distribution of θ p(θ y) N(θ µ, τ 2 ) N ( = N = N θ ( θ τ 2 ) (ȳ θ, σ2 n n σ 2 n + µ + n + ȳ, n + σ 2 τ 2 σ 2 τ 2 σ 2 τ 2 σ 2 σ 2 + nτ 2 µ + nτ 2 σ 2 + nτ 2 ȳ, σ 2 τ 2 σ 2 + nτ 2 ) ) 9 Frontiers of Statistical Decision Making and Bayesian Analysis

Another simple example: The Beta-Binomial model Example: Let Y be the number of successes in n independent trials. ( ) n P (Y = y θ) = f(y θ) = θ y ( θ) n y y Prior: p(θ) = Beta(θ a, b): p(θ) θ a ( θ) b. Prior mean: µ = a/(a + b); Variance ab/((a + b) 2 (a + b + )) Posterior distribution of θ p(θ y) = Beta(θ a + y, b + n y) 0 Frontiers of Statistical Decision Making and Bayesian Analysis

Sampling-based inference We will compute the posterior distribution p(θ y) by drawing samples from it. This replaces numerical integration (quadrature) by Monte Carlo integration. One important advantage: we only need to know p(θ y) up to the proportionality constant. Suppose θ = (θ, θ 2 ) and we know how to sample from the marginal posterior distribution p(θ 2 y) and the conditional distribution P (θ θ 2, y). How do we draw samples from the joint distribution: p(θ, θ 2 y)? Frontiers of Statistical Decision Making and Bayesian Analysis

Sampling-based inference We do this in two stages using composition sampling: First draw θ (j) 2 p(θ 2 y), j =,... M. ( Next draw θ (j) p θ θ (j) 2 )., y This sampling scheme produces exact samples, {θ (j), θ(j) 2 }M j= from the posterior distribution p(θ, θ 2 y). Gelfand and Smith (JASA, 990) demonstrated automatic marginalization: {θ (j) }M j= are samples from p(θ y) and (of course!) {θ (j) 2 }M j= are samples from p(θ 2 y). In effect, composition sampling has performed the following integration : p(θ y) = p(θ θ 2, y)p(θ 2 y)dθ. 2 Frontiers of Statistical Decision Making and Bayesian Analysis

Bayesian predictions Suppose we want to predict new observations, say ỹ, based upon the observed data y. We will specify a joint probability model p(ỹ, y, θ), which defines the conditional predictive distribution: p(ỹ y, θ) = p(ỹ, y, θ). p(y θ) Bayesian predictions follow from the posterior predictive distribution that averages out the θ from the conditional predictive distribution with respect to the posterior: p(ỹ y) = p(ỹ y, θ)p(θ y)dθ. This can be evaluated using composition sampling: First obtain: θ (j) p(θ y), j =,... M For j =,..., M sample ỹ (j) p(ỹ y, θ (j) ) The {ỹ (j) } M j= are samples from the posterior predictive distribution p(ỹ y). 3 Frontiers of Statistical Decision Making and Bayesian Analysis

Some remarks on sampling-based inference Direct Monte Carlo: Some algorithms (e.g. composition sampling) can generate independent samples exactly from the posterior distribution. In these situations there are NO convergence problems or issues. Sampling is called exact. Markov Chain Monte Carlo (MCMC): In general, exact sampling may not be possible/feasible. MCMC is a far more versatile set of algorithms that can be invoked to fit more general models. Note: anywhere where direct Monte Carlo applies, MCMC will provide excellent results too. Convergence issues: There is no free lunch! The power of MCMC comes at a cost. The initial samples do not necessarily come from the desired posterior distribution. Rather, they need to converge to the true posterior distribution. Therefore, one needs to assess convergence, discard output before the convergence and retain only post-convergence samples. The time of convergence is called burn-in. Diagnosing convergence: Usually a few parallel chains are run from rather different starting points. The sample values are plotted (called trace-plots) for each of the chains. The time for the chains to mix together is taken as the time for convergence. Good news! All this is automated in WinBUGS. So, as users, we need to only configure how to specify good Bayesian models and implement them in WinBUGS. 4 Frontiers of Statistical Decision Making and Bayesian Analysis