Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Similar documents
Linear Models A linear model is defined by the expression

Chapter 5. Bayesian Statistics

The Normal Linear Regression Model with Natural Conjugate Prior. March 7, 2016

HPD Intervals / Regions

Introduction to Bayesian Methods

A Very Brief Summary of Bayesian Inference, and Examples

David Giles Bayesian Econometrics

The linear model is the most fundamental of all serious statistical models encompassing:

Introduction to Applied Bayesian Modeling. ICPSR Day 4

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Introduction into Bayesian statistics

Bayesian Inference: Concept and Practice

Bayesian Linear Models

STAT 425: Introduction to Bayesian Analysis

Data Analysis and Uncertainty Part 2: Estimation

Beta statistics. Keywords. Bayes theorem. Bayes rule

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Bayesian Linear Regression

A Bayesian Treatment of Linear Gaussian Regression

Foundations of Statistical Inference

Bayesian Linear Models

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Bayesian Regression (1/31/13)

Introduction. Start with a probability distribution f(y θ) for the data. where η is a vector of hyperparameters

Bayesian linear regression

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

An Introduction to Bayesian Linear Regression

USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL*

One-parameter models

Lecture 2: Priors and Conjugacy

Bayesian Inference. Chapter 9. Linear models and regression

Bayesian RL Seminar. Chris Mansley September 9, 2008

Bayesian Regression Linear and Logistic Regression

Introduction to Probabilistic Machine Learning

Probability and Estimation. Alan Moses

PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.

Course topics (tentative) The role of random effects

Eco517 Fall 2004 C. Sims MIDTERM EXAM

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

STAT J535: Chapter 5: Classes of Bayesian Priors

Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown.

The Jeffreys Prior. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) The Jeffreys Prior MATH / 13

Bayesian Statistics Part III: Building Bayes Theorem Part IV: Prior Specification

A Very Brief Summary of Statistical Inference, and Examples

Chapter 8: Sampling distributions of estimators Sections

Bayesian Linear Models

Gibbs Sampling in Linear Models #2

Principles of Bayesian Inference

Inference for a Population Proportion

ECE521 W17 Tutorial 6. Min Bai and Yuhuai (Tony) Wu

Bayesian statistics, simulation and software

Bayesian Phylogenetics:

Remarks on Improper Ignorance Priors

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

A Discussion of the Bayesian Approach

Conjugate Analysis for the Linear Model

7. Estimation and hypothesis testing. Objective. Recommended reading

STAT J535: Introduction

variability of the model, represented by σ 2 and not accounted for by Xβ

Foundations of Statistical Inference

Stat 5101 Lecture Notes

Principles of Bayesian Inference

Probability Theory. Patrick Lam

Bayesian Inference and MCMC

Bayesian Econometrics

Principles of Bayesian Inference

Bayesian Inference for Normal Mean

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

COS513 LECTURE 8 STATISTICAL CONCEPTS

Lecture 2. (See Exercise 7.22, 7.23, 7.24 in Casella & Berger)

2016 SISG Module 17: Bayesian Statistics for Genetics Lecture 3: Binomial Sampling

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

Principles of Bayesian Inference

GOV 2001/ 1002/ E-2001 Section 3 Theories of Inference

Topic 12 Overview of Estimation

Foundations of Statistical Inference

Time Series and Dynamic Models

MCMC algorithms for fitting Bayesian models

STAT:5100 (22S:193) Statistical Inference I

Part 2: One-parameter models

Bayesian Inference. Chapter 2: Conjugate models

Other Noninformative Priors

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

General Bayesian Inference I

Hierarchical Models & Bayesian Model Selection

Problem Selected Scores

AMS-207: Bayesian Statistics

Part 6: Multivariate Normal and Linear Models

One Parameter Models

10. Exchangeability and hierarchical models Objective. Recommended reading

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

The exponential family: Conjugate priors

Lecture 3. Univariate Bayesian inference: conjugate analysis

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Transcription:

Bayesian inference Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark April 10, 2017 1 / 22

Outline for today A genetic example Bayes theorem Examples Priors Posterior summaries 2 / 22

Bayes theorem Bayes theorem for events A, B: P(A B) = P(B A)P(A) P(B) Combines marginal probability for A with conditional probability for B given A to obtain conditional probability of A B. Bayes theorem for random variables X and Y : f (x y) = f (y x)f (x) f (y) f (y x)f (x) NB: c = f (y) normalizing constant for unnormalized density h(x) = f (y x)f (x) 3 / 22

Example: forensic statistics Population of n individuals each with bloodtype a or a. Population: {x 1, x 2,..., x n } where x i = (i, t i ) and t i is either a or a. Stochastic variables G and B. G = i means ith person guilty. B is bloodtype of guilty person (G = i B = t i ). Prior distribution for G: P(G = i) = p i. Suppose we know B = a. Then P(B = a G = i)p(g = i) P(G = i B = a) = P(B = a) Note P(B = a G = l) = 1 if t l = a and zero otherwise. Hence if t i = a, p i P(G = i B = a) = l:t l =a p l Note P(B = a) may differ from proportion of population with bloodtype a! 4 / 22

The idea of Bayesian inference Idea: in order to infer an unknown quantity θ we should combine information in the data with prior information (e.g. past experience). Formal approach: prior information expressed using probability density p(θ) and information in data quantified using likelihood function. Inference given data obtained via posterior distribution (Bayes theorem) p(θ y) = f (y θ)p(θ) f (y θ)p(θ) f (y) 5 / 22

NB: Bayesian inference mimics our daily approaches to handling uncertainty where we implicitly combine sources of data with prior knowledge. Advantage: enables the use of prior information when this is available. Disadvantage: requires the use of prior information. This may be hard to obtain or different persons may have different prior opinions. 6 / 22

Example: beta-binomial Suppose we observe X b(n, θ). Use beta prior Posterior p(θ) = Γ(α + β) Γ(α)Γ(β) θα 1 (1 θ) β 1 p(θ x) θ x (1 θ) n x θ α 1 (1 θ) β 1 = θ x+α 1 (1 θ) n x+β 1 Hence posterior p(θ x) is beta-distributed (Beta(x + α, n x + β)) too! 7 / 22

Plots of prior, likelihood and posterior when X = 3 and n = 10 with different choices of (α, β): (1,1) (uniform/flat) (0.5,0.5) (symmetric) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 likelihood prior posterior 0.0 0.5 1.0 1.5 2.0 2.5 3.0 likelihood prior posterior 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 theta theta 8 / 22

(8,2) (2,8) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 likelihood prior posterior 0 1 2 3 4 likelihood prior posterior 0.0 0.2 0.4 0.6 0.8 1.0 theta 0.0 0.2 0.4 0.6 0.8 1.0 theta 9 / 22

Conjugate prior distributions Beta distribution is an example of a prior which is conjugate for the binomial likelihood: posterior distribution is beta too! Other examples: Gamma is conjugate for Poisson normal/scaled inverse χ 2 conjugate for linear normal model Conjugate priors only available in simple situations. 10 / 22

Poisson-Gamma Suppose Y 1,..., Y n λ independent Poisson with mean λ and we choose Γ(α, β) prior for λ. Posterior: p(λ y) λ y exp( nλ)λ α 1 exp( λ/β) = λ y +α 1 exp( λ/[β/(1+nβ)]) Hence posterior for λ is Γ(y + α, β/(1 + nβ)). Expressions for posterior means and variances for binomial-beta and Poisson-gamma can be found in Chapter 6 in M & T. 11 / 22

Linear normal model Y β, σ 2 N(X β, σ 2 I ). Priors: β σ 2 N(0, φi ) and σ 2 Sχ 2 (f ). We already know from our treatment of linear mixed models that β σ 2, y N(( σ2 φ I + X T X ) 1 X T Y, σ 2 (( σ2 φ I + X T X ) 1 ) Note this converges to proper limit N( ˆβ, σ 2 (X T X ) 1 ) when φ. Note formal similarity with frequentist result for MLE ˆβ. We can also show that σ 2 y is scaled χ 2. For convenience we do this in in the improper prior case β σ 2 1 - next slide. 12 / 22

With p(β, σ 2 ) (σ 2 ) f 2 1 exp( S/(2σ 2 )) and using Pythagoras we obtain p(β, σ 2 y) exp( 1 2σ 2 (β ˆβ) T X T X (β ˆβ))(σ 2 ) f +n 2 1 e S+RSS 2σ 2 where RSS = Y X ˆβ 2 is the sum of squared residuals. From this we obtain β σ 2, y N( ˆβ, σ 2 (X T X ) 1 ) and σ 2 y (RSS + S)χ 2 (f + n p). Hence provided RSS > 0 and n p > 0, posterior also proper with the improper prior p(β, σ 2 ) 1/σ 2 (i.e. S = f = 0). 13 / 22

With R = (β ˆβ)/σ we obtain R σ 2, y N(0, (X T X ) 1 ). Thus R and σ 2 are independent given y. With s 2 = RSS/(n p) and p(β, σ 2 ) 1/σ 2 : β ˆβ = R σ and R σ s 2 s s y N(0, (X T X ) 1 ) (n p)χ 2 (n p) The product of independent N(0, (X T X ) 1 ) and (n p)χ 2 (n p) gives a p-dimensional t distribution with n p degrees of freedom. Thus β ˆβ s 2 y t(p, (X T X ) 1, n p) With v i the ith diagonal element of (X T X ) 1 we obtain β i ˆβ i y t(n p) vi s2 Note again formal similarity with frequentist t-statistic! 14 / 22

Improper priors Priors and p(β) 1, β R p p(σ 2 ) 1/σ 2, σ 2 > 0 are improper (do not integrate to one). In case of normal likelihood posterior is nevertheless proper (limiting cases of normal and χ 2 priors). In complex models it may be hard to check that a posterior is proper when improper priors are used. Note: with large datasets, posterior results less sensitive to choice of prior (likelihood dominates). 15 / 22

Non-informative and priors Consider flat prior for θ [0, 1]. Priors for odds and log odds not flat!: odds log odds priorodds(o) 0.0 0.2 0.4 0.6 0.8 1.0 logoddsprior(l) 0.00 0.05 0.10 0.15 0.20 0.25 0 2 4 6 8 10 10 5 0 5 10 o Hence whether a prior is non-informative depends on scale. l Rule of thumb: use non-informative priors on the scale that we wish to draw inference for. 16 / 22

Priors for odds and log odds obtained using transformation theorem: Suppose X f X and Y = h(x ) for differentiable and injective function h. Then density of Y is f Y (y) = 1 dy/dx f X (x) where x = h 1 (y) Also valid in the multivariate case. Then is determinant and dy dx = [dy i dx j ] ij is Jacobian matrix of partial derivatives. 17 / 22

Summarizing the posterior For a vector (θ 1,..., θ n ) posterior summaries are often computed for the components separately. Hence for θ i we may compute posterior mean or median and express posterior uncertainty in terms of posterior variance (not so useful if posterior far from normal). Posterior 95 % credibility interval: interval [l, u] (depending on data) such that P(θ i [l, u] y) = 95%. Often a central interval is used: P(θ i < u y) = P(θ i > l y) = 0.025. 95% Highest posterior density (HPD) region : H chosen so that P(θ H y) = 0.95 and p(θ y) > p( θ y) whenever θ inside R and θ outside. More sophisticated possibilities: e.g. posterior probability that θ 1 > θ 2 or look at ranks for components of θ (e.g. which treatment is best?). 18 / 22

Confidence intervals versus posterior intervals 95% confidence interval: random interval which in 95% of future hypothetical repetitions of the experiment would contain the (fixed) unknown parameter θ (frequentist interpretation). 95% posterior interval: Given the data y the confidence interval is fixed while θ is random. The 95% probability associated with the posterior interval is the probability that θ is in the interval given the data. No reference to hypothetical repetitions of experiment. 19 / 22

Exercises 1. Consider n iid binomial observations with common probability parameter θ. Compute the posterior distribution of θ when a beta prior is used for θ. 2. Suppose y λ is Poisson(λ) and λ is Γ(α, β). Show that y marginally has a negative binomial distribution. 3. Compute the prior for p when logit(p) = log(p/(1 p)) = β and the prior for β is N(0, τ 2 ). What happens if τ 2 (try to plot the prior for large τ 2 )? 4. Consider the linear normal model Y i N(β, σ 2 ) (i.e. the design matrix X is a column of 1 s) and use the prior p(β, σ 2 ) 1/σ 2. 4.1 Compute a 95% posterior credibility interval for β. 4.2 Compare with the frequentist 95% confidence interval. What are the interpretations of the two intervals and how do the interpretations differ? 20 / 22

5. Suppose observations 4, 6, 6, 7, 3, 5, 3, 11, 10, 5 are observations of iid Poisson random variables with mean λ. Use a Gamma prior with mean 6 and variance 10. Compute the posterior mean, variance, and 95% central posterior interval for λ. 21 / 22

A few results needed for the exercises The density of Γ(α, β) with shape α and scale β is f (x; α, β) = β α Γ(α) x α 1 exp( x/β), x > 0 where Γ( ) is the gamma function. Mean and variance are αβ and αβ 2. If β is interpreted as the rate (inverse scale) then f (x; α, β) = βα Γ(α) x α 1 exp( βx), x > 0 The density of a negative binomial distribution with parameters α and β is f (y) = Γ(y + α) ( 1 ) α ( β ) y y = 0, 1, 2,... y!γ(α) 1 + β 1 + β 22 / 22