Principles of Bayesian Inference

Similar documents
Principles of Bayesian Inference

Principles of Bayesian Inference

Principles of Bayesian Inference

Model Assessment and Comparisons

Bayesian Linear Models

Introduction to Spatial Data and Models

Bayesian Linear Models

Bayesian Linear Regression

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Special Topic: Bayesian Finite Population Survey Sampling

David Giles Bayesian Econometrics

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Hierarchical Modelling for Univariate Spatial Data

Chapter 5. Bayesian Statistics

Markov chain Monte Carlo

One-parameter models

Hierarchical Modelling for Univariate Spatial Data

(4) One-parameter models - Beta/binomial. ST440/550: Applied Bayesian Statistics

Data Analysis and Uncertainty Part 2: Estimation

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Introduction. Start with a probability distribution f(y θ) for the data. where η is a vector of hyperparameters

Introduction to Bayesian Inference

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Bayesian Linear Models

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Hierarchical Modeling for Spatio-temporal Data

Rejection sampling - Acceptance probability. Review: How to sample from a multivariate normal in R. Review: Rejection sampling. Weighted resampling

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida

EXERCISES FOR SECTION 1 AND 2

Hierarchical Modeling for non-gaussian Spatial Data

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Lecture : Probabilistic Machine Learning

Introduction to Bayesian Methods

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

Bayesian Inference. Anders Gorm Pedersen. Molecular Evolution Group Center for Biological Sequence Analysis Technical University of Denmark (DTU)

7. Estimation and hypothesis testing. Objective. Recommended reading

Hierarchical Modeling for Multivariate Spatial Data

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Bayesian Meta-analysis with Hierarchical Modeling Brian P. Hobbs 1

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

10. Exchangeability and hierarchical models Objective. Recommended reading

PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.

Bayesian Regression Linear and Logistic Regression

Bayesian Inference and MCMC

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Introduction into Bayesian statistics

Lecture 1 Bayesian inference

Introduction to Spatial Data and Models

g-priors for Linear Regression

Bayesian Model Diagnostics and Checking

Hierarchical Modelling for non-gaussian Spatial Data

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Bayesian RL Seminar. Chris Mansley September 9, 2008

Computational Perception. Bayesian Inference

Introduction to Machine Learning

Bayesian Inference. Chapter 2: Conjugate models

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Quantile POD for Hit-Miss Data

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Part 2: One-parameter models

Bayesian Phylogenetics:

(1) Introduction to Bayesian statistics

Lecture 2: Priors and Conjugacy

BUGS Bayesian inference Using Gibbs Sampling

Predictive Distributions

MCMC algorithms for fitting Bayesian models

USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL*

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

McGill University. Department of Epidemiology and Biostatistics. Bayesian Analysis for the Health Sciences. Course EPIB-675.

A primer on Bayesian statistics, with an application to mortality rate estimation

What are the Findings?

2016 SISG Module 17: Bayesian Statistics for Genetics Lecture 3: Binomial Sampling

Advanced Statistical Modelling

Bayesian Modeling of Accelerated Life Tests with Random Effects

Bayesian Inference: What, and Why?

Bayesian philosophy Bayesian computation Bayesian software. Bayesian Statistics. Petter Mostad. Chalmers. April 6, 2017

Part 4: Multi-parameter and normal models

HPD Intervals / Regions

The Metropolis-Hastings Algorithm. June 8, 2012

Introduc)on to Bayesian Methods

PMR Learning as Inference

Computation and Monte Carlo Techniques

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Inference for a Population Proportion

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

Bayesian Analysis (Optional)

Bayesian Estimation An Informal Introduction

Answers and expectations

Lecture 13 Fundamentals of Bayesian Inference

Bayesian Methods for Machine Learning

Bayesian Computation

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Transcription:

Principles of Bayesian Inference Sudipto Banerjee 1 and Andrew O. Finley 2 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. 2 Department of Forestry & Department of Geography, Michigan State University, Lansing Michigan, U.S.A. October 30, 2013 1

Basic probability We like to think of Probability formally as a function that assigns a real number to an event. Let E and F be any events that might occur under an experimental setup H. Then a probability function P (E) is defined as: P1 0 P (E) 1 for all E. P2 P (H) = 1. P3 P (E F ) = P (E) + P (F ) whenever it is impossible for any two of the events E and F to occur. Usually consider: E F = {φ} and say they are mutually exclusive. 2 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop

Basic probability If E is an event, then we denote its complement ( NOT E) by Ē or Ec. Since E Ē = {φ}, we have from P3: P (Ē) = 1 P (E). Conditional Probability of E given F : P (E F ) = P (E F ) P (F ) Sometimes we write EF for E F. Compound probability rule: write the above as P (E F )P (F ) = P (EF ). 3 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop

Basic probability Independent Events: E and F are said to be independent if the occurrence of one does not imply the occurrence of the other. Then, P (E F ) = P (E) and we have the following multiplication rule: P (EF ) = P (E)P (F ). If P (E F ) = P (E), then P (F E) = P (F ). Marginalization: We can express P (E) by marginalizing over the event F : P (E) = P (EF ) + P (E F ) = P (F )P (E F ) + P ( F )P (E F ). 4 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop

Bayes Theorem Bayes Theorem Observe that: P (EF ) = P (E F )P (F ) = P (F E)P (E) P (F E) = P (F )P (E F ). P (E) This is Bayes Theorem, named after Reverend Thomas Bayes an English clergyman with a passion for gambling! Often this is written as: P (F E) = P (F )P (E F ) P (F )P (E F ) + P ( F )P (E F ). 5 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop

Bayesian principles Two hypothesis: H 0 : excess relative risk for thrombosis for women taking a pill exceeds 2; H 1 : it is under 2. Data collected at hand from a controlled trial show a relative risk of x = 3.6. Probability or likelihood under the data, given our prior beliefs is P (x H); H is H 0 or H 1. 6 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop

Bayesian principles Bayes Theorem updates the probability of each hypothesis: P (H x) = P (H)P (x H) ; H {H 0, H 1 }. P (x) Marginal probability: P (x) = P (H 0 )P (x H 0 ) + P (H 1 )P (x H 1 ) Reexpress: P (H x) P (H)P (x H); H {H 0, H 1 }. 7 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop

Bayesian principles Likelihood and Prior Bayes theorem in English: Posterior distribution = prior likelihood ( prior likelihood) Denominator is summed over all possible priors It is a fixed normalizing factor that is (usually) extremely difficult to evaluate: curse of dimensionality Markov Chain Monte Carlo to the rescue! WinBUGS software: www.mrc-bsu.cam.ac.uk/bugs/welcome.shtml 8 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop

Bayesian principles An Example Clinician interested in π: proportion of children between age 5 9 in a particular population having asthma symptoms. Clinician has prior beliefs about π, summarized as Prior support and Prior weights Data: random sample of 15 children show 2 having asthma symptoms. Likelihood obtained from Binomial distribution: ( ) 15 π 2 (1 π) 13 2 Note: ( 15 2 ) is a constant and *can* be ignored in the computations (though they are accounted for in the next Table). 9 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop

Bayesian principles Prior Support Prior weight Likelihood Prior Likelihood Posterior 0.10 0.10 0.267 0.027 0.098 0.12 0.15 0.287 0.043 0.157 0.14 0.25 0.290 0.072 0.265 0.16 0.25 0.279 0.070 0.255 0.18 0.15 0.258 0.039 0.141 0.20 0.10 0.231 0.023 0.084 Total 1.00 0.274 1.000 Posterior: obtained by dividing Prior Likelihood with normalizing constant 0.274 10 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop

Bayesian principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters as random, and thus having distributions (just like the data). We can thus think about unknowns for which no reliable frequentist experiment exists, e.g. θ = proportion of US men with untreated prostate cancer. A Bayesian writes down a prior guess for parameter(s) θ, say p(θ). He then combines this with the information provided by the observed data y to obtain the posterior distribution of θ, which we denote by p(θ y). All statistical inferences (point and interval estimates, hypothesis tests) then follow from posterior summaries. For example, the posterior means/medians/modes offer point estimates of θ, while the quantiles yield credible intervals. 11 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop

Bayesian principles The key to Bayesian inference is learning or updating of prior beliefs. Thus, posterior information prior information. Is the classical approach wrong? That may be a controversial statement, but it certainly is fair to say that the classical approach is limited in scope. The Bayesian approach expands the class of models and easily handles: repeated measures unbalanced or missing data nonhomogenous variances multivariate data and many other settings that are precluded (or much more complicated) in classical settings. 12 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop

Basics of Bayesian inference We start with a model (likelihood) f(y θ) for the observed data y = (y 1,..., y n ) given unknown parameters θ (perhaps a collection of several parameters). Add a prior distribution p(θ λ), where λ is a vector of hyper-parameters. The posterior distribution of θ is given by: p(θ y, λ) = p(θ λ) f(y θ) p(y λ) = p(θ λ) f(y θ) f(y θ)p(θ λ)dθ. We refer to this formula as Bayes Theorem. 13 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop

Basics of Bayesian inference Calculations (numerical and algebraic) are usually required only up to a proportionaly constant. We, therefore, write the posterior as: p(θ y, λ) p(θ λ) f(y θ). If λ are known/fixed, then the above represents the desired posterior. If, however, λ are unknown, we assign a prior, p(λ), and seek: p(θ, λ y) p(λ)p(θ λ)f(y θ). The proportionality constant does not depend upon θ or λ: 1 p(y) = 1 p(λ)p(θ λ)f(y θ)dλdθ The above represents a joint posterior from a hierarchical model. The marginal posterior distribution for θ is: p(θ y) = p(λ)p(θ λ)f(y θ)dλ. 14 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop

A simple example: Normal data and normal priors Example: Consider a single data point y from a Normal distribution: y N(θ, σ 2 ); assume σ is known. f(y θ) = N(y θ, σ 2 ) = 1 σ 2π exp( 1 2σ 2 (y θ)2 ) θ N(µ, τ 2 ), i.e. p(θ) = N(θ µ, τ 2 ); µ, τ 2 are known. Posterior distribution of θ p(θ y) N(θ µ, τ 2 ) N(y θ, σ 2 ) ( = N = N θ ( θ 1 τ 2 1 σ 2 1 + 1 µ + 1 + 1 y, 1 + 1 σ 2 τ 2 σ 2 τ 2 σ 2 τ 2 σ 2 σ 2 + τ 2 µ + τ 2 σ 2 + τ 2 y, σ 2 τ 2 σ 2 + τ 2 1 ) ). 15 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop

A simple example: Normal data and normal priors Interpret: Posterior mean is a weighted mean of prior mean and data point. The direct estimate is shrunk towards the prior. What if you had n observations instead of one in the earlier set up? Say y = (y 1,..., y n ) iid, where y i N(θ, σ 2 ). ( ) ȳ is a sufficient statistic for θ; ȳ N θ, σ2 n Posterior distribution of θ p(θ y) N(θ µ, τ 2 ) N ( = N = N θ ( θ 1 τ 2 ) (ȳ θ, σ2 n n σ 2 n + 1 µ + n + 1 ȳ, n + 1 σ 2 τ 2 σ 2 τ 2 σ 2 τ 2 σ 2 σ 2 + nτ 2 µ + nτ 2 σ 2 + nτ 2 ȳ, σ 2 τ 2 σ 2 + nτ 2 1 ) ) 16 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop

A simple exercise using R Consider the problem of estimating the current weight of a group of people. A sample of 10 people were taken and their average weight was calculated as ȳ = 176 lbs. Assume that the population standard deviation was known as σ = 3. Assuming that the data y 1,..., y 10 came from a N(θ, σ 2 ) population perform the following: Obtain a 95% confidence interval for θ using classical methods. Assume a prior distribution for θ of the form N(µ, τ 2 ). Obtain 95% posterior credible intervals for θ for each of the cases: (a) µ = 176, τ = 8; (b) µ = 176, τ = 1000 (c) µ = 0, τ = 1000. Which case gives results closest to that obtained in the classical method? 17 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop

Another simple example: The Beta-Binomial model Example: Let Y be the number of successes in n independent trials. ( ) n P (Y = y θ) = f(y θ) = θ y (1 θ) n y y Prior: p(θ) = Beta(θ a, b): p(θ) θ a 1 (1 θ) b 1. Prior mean: µ = a/(a + b); Variance ab/((a + b) 2 (a + b + 1)) Posterior distribution of θ p(θ y) = Beta(θ a + y, b + n y) 18 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop

Bayesian inference: point estimation Point estimation is easy: simply choose an appropriate distribution summary: posterior mean, median or mode. Mode sometimes easy to compute (no integration, simply optimization), but often misrepresents the middle of the distribution especially for one-tailed distributions. Mean: easy to compute. It has the opposite effect of the mode chases tails. Median: probably the best compromise in being robust to tail behaviour although it may be awkward to compute as it needs to solve: θmedian p(θ y)dθ = 1 2. 19 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop

Bayesian inference: interval estimation The most popular method of inference in practical Bayesian modelling is interval estimation using credible sets. A 100(1 α)% credible set C for θ is a set that satisfies: P (θ C y) = p(θ y)dθ 1 α. The most popular credible set is the simple equal-tail interval estimate (q L, q U ) such that: ql C p(θ y)dθ = α 2 = p(θ y)dθ Then clearly P (θ (q L, q U ) y) = 1 α. This interval is relatively easy to compute and has a direct interpretation: The probability that θ lies between (q L, q U ) is 1 α. The frequentist interpretation is extremely convoluted. 20 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop q U

Sampling-based inference Previous example: direct evaluation of the posterior probabilities. Feasible only for simpler problems. Modern Bayesian Analysis: Derive complete posterior densities, say p(θ y) by drawing samples from that density. Samples are of the parameters themselves, or of their functions. If θ 1,..., θ M are samples from p(θ y) then, densities are created by feeding them into a density plotter. Similarly samples from f(θ), for some function f, are obtained by simply feeding the θ i s to f( ). In principle M can be arbitrarily large it comes from the computer and only depends upon the time we have for analysis. Do not confuse this with the data sample size n which is limited in size by experimental constraints. 21 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop